Sie sind auf Seite 1von 25

Reliability Engineering and System Safety 133 (2015) 275–299

Contents lists available at ScienceDirect

Reliability Engineering and System Safety


journal homepage: www.elsevier.com/locate/ress

Analysis of information security reliability: A tutorial


Suleyman Kondakci n
Faculty of Engineering & Computer Sciences, Izmir University of Economics, Sakarya Cad. No. 156, 35330 Balcova-Izmir, Turkey

art ic l e i nf o a b s t r a c t

Article history: This article presents a concise reliability analysis of network security abstracted from stochastic
Received 24 January 2014 modeling, reliability, and queuing theories. Network security analysis is composed of threats, their
Received in revised form impacts, and recovery of the failed systems. A unique framework with a collection of the key reliability
15 August 2014
models is presented here to guide the determination of the system reliability based on the strength of
Accepted 21 September 2014
Available online 30 September 2014
malicious acts and performance of the recovery processes. A unique model, called Attack-obstacle model,
is also proposed here for analyzing systems with immunity growth features. Most computer science
Keywords: curricula do not contain courses in reliability modeling applicable to different areas of computer
Availability modeling engineering. Hence, the topic of reliability analysis is often too diffuse to most computer engineers and
Reliability
researchers dealing with network security. This work is thus aimed at shedding some light on this issue,
Security
which can be useful in identifying models, their assumptions and practical parameters for estimating the
Risk assessment
reliability of threatened systems and for assessing the performance of recovery facilities. It can also be
useful for the classification of processes and states regarding the reliability of information systems.
Systems with stochastic behaviors undergoing queue operations and random state transitions can also
benefit from the approaches presented here.
& 2014 Elsevier Ltd. All rights reserved.

1. Introduction major parts: (i) attack and failure modeling, (ii) impact modeling,
and (iii) recovery modeling. Theoretical approaches alone are
One of the major reasons for developing concrete theories is to often difficult to satisfy these with rapidly evolving IT systems
enhance their practical applicability to some desired technology and emerging networking concepts. Simulations and empirical
and engineering fields. Our objective here is to find an efficient studies are only devoted to observe and assess system behaviors in
way of bringing parts from reliability theory into a practical order to substantiate the reliability in practical situations. Theore-
technique for the analysis of information security. As known, the tical models used for simulations and experiments need also be
main objective of information security relies on the provision of justified and matched to the well-established modes of operations.
the general triad called CIA, Confidentiality, Integrity, and Avail- Probabilistic approaches can be used to build impact models and
ability. Within this context, availability (as a measure of service estimate the loss due to system failures caused by the threats with
degradation) is the most critical factor that can cause immense predefined probability and hazard distributions. Additionally,
cost on the communication infrastructure and on general business queuing theory and stochastic processes (e.g., Markov chains)
outcomes. We usually have effective protection mechanisms for can be used to guide stationary analysis of system failures and
providing confidentiality and integrity. However, protecting from hazard functions together with the associated repair models.
threats causing unavailability is more complex and often requires The major problem for many information security engineers is
additional mechanisms (e.g., redundant or standby systems), as follows. Though theoretical frameworks are critical in guiding
which may also be under the same type of threats leading to research, however, in some contexts, they can be confusing and
additional losses. much cumbersome to apply. Especially, for complex network
Building reliable models for analyzing failures and impacts structures facing complicated threat types, matching a theoretical
caused by different threat types on information systems can be security model to an overall analysis need to be inductive,
extremely complicated, or models to describe these processes may tractable, explanatory, and well-thought to guide access to con-
not even exist. Therefore, in order to achieve at least an analytical cretely measurable results obtained from many interrelated con-
tractability, we need to separate the problem domain into three cepts and their influencing parameters. Reliability models dealing
with complex systems are numerous, and naturally, some of them
are too diffuse to some computer engineers to apply. As the
n
Tel.: þ 90 232 488 8256. complexity increases alongside with the drastically growing infor-
E-mail address: suleyman.kondakci@ieu.edu.tr mation systems’ diversity, adapting a concrete model becomes

http://dx.doi.org/10.1016/j.ress.2014.09.021
0951-8320/& 2014 Elsevier Ltd. All rights reserved.
276 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

more and more complicated. Therefore, obtaining practically 1.1. Outline of the paper
sound reliability functions that can address different paths of a
complex and redundant security structures are necessary. Following the introduction and the objective of this work
The objective of this article is thus to present and discuss in Section 1, a brief review about the related work is given in
several useful reliability models dealing with the availability Section 2. Section 3 describes the terminology used in this paper,
analysis of information systems. Two major reasons have triggered presents a concise overview of the key reliability models, describes
the development of this work: (1) lack of a holistic approach to the failure sources (threats) and associated failures, presents the
network reliability analyses in the current literature and (2) to method for constructing reliability structures from network struc-
enhance the knowledge of formal reliability analysis in the tures, and outlines the main steps of the reliability analysis used
computer engineering discipline. In fact, we find various work in here. Section 4 summarizes the threat categories and tabulates the
the literature that mainly consider the analysis of singular systems, candidate reliability models for modeling attacks and failures.
such as the analysis of a specific type of worm/virus propagation Assumptions and limitations regarding the discussed reliability
and analysis of software (SW) and hardware (HW) faults. With this models are presented in Section 5. Section 6 defines a new class of
work, we intend to bring front an umbrella framework that can the reliability patterns associated to the reliability of networks
facilitate the analysis of a broad spectrum of network structures, security. Section 7 presents a detailed discussion of the reliability
their components, interdependence among the interconnected models and their applications to the analysis of network security
structures, impacts (cost of service degradation) caused by the threats and failures. Impact analysis of the threats is presented in
associated threats, and performance analysis of the facilities used Section 8. Section 9 presents models for describing and analyzing
to repair the threatened systems. repair processes, service efficiency for recovery facilities, whereas
The second reason triggering this idea is such that the most Section 10 goes through a case study dealing with the recovery
of the universities and institutions around the world do not operations and service degradation caused by a virus infection
provide system reliability courses in their computer science scenario. A brief discussion on the presented material and the
departments. Thus, we have a good reason to provide the funda- feature extension of the work is given in Section 11. Finally, a
mental modeling approaches that are applicable to the analysis of detailed background of the model theories is presented in
various aspects of reliability and security of computer networks. Appendix A.
This paper will therefore emphasize this issue by presenting a set
of practical models derived from the reliability theory and queuing
systems. 2. Related work
Obviously, complexity of the reliability models increases in
parallel with the system complexity, which is then multiplied with There exist numerous work being considered within the gen-
the level of system redundancy, if implemented so. Communica- eral context of the reliability engineering. However, it is hard to
tion networks, Internet search engines, cloud computing environ- find approaches specific to the reliability analysis of network
ments, smart grid networks, and resources of the grid networks security. An earlier software failure analysis model was developed
are good examples of such a complexity that contain a high degree by [1], which presents a stochastic model for the software failure
of redundancy and discrepancy in the overall system structure and phenomenon based on a nonhomogeneous Poisson process.
the services provided by these globally distributed systems. A service reliability approach for a distributed software is con-
Defining a separate reliability function for each subset of such sidered in [2], where a distributed system was modeled as a single
conglomerate structures, and integrating them under a framework service system shared by some distributed clients. This could be
will always introduce additional complexities. Estimating the interpreted as a single service system shared among multiple
overall reliability of such a complex structure can be facilitated if customers using a control center that allows access for the client
we were able to get down to some concrete models from the machines. The system availability of the control center is deter-
theoretical reliability concepts. Accordingly, this paper is intended mined by the probability for itself to be available, which is also the
to provide a framework of models applicable to the reliability overall reliability measure representing its clients. As known,
analysis of computer networks. wireless communication networks often have degraded through-
Primarily, we customize a set of functions and models from the put of broadcast packets due to the nature of the transmission
reliability concept and lay down some model assumptions that are characteristics. Related to this, a composite reliability analysis is
specific to the analysis of information security. An appropriate given by [3], which illustrates three modeling approaches for
model can then be selected to determine system states as a metric composite performance and availability analysis. A high-level
reflecting the degree of the system availability. The results description language, stochastic reward net, and continuous time
obtained can then be used for risk assessment of systems under Markov chain are used to construct models for evaluating the
different situations. Related to this, as a special case, we present a performability measures of a channel allocation scheme in a
numerical analysis that determines the reliability measure of a wireless network. Service reliability in a grid system with star
network of susceptible computers, which are vulnerable to some topology is considered by [4], and a topological view on the
virus attacks and software failures. A reliability measure is reliability of a large-scale distributed system is presented by [5].
composed of a set of parameters, such as mean failure and One of the main objectives of our work is to facilitate the
recovery rates, total down times, service efficiency, and repairman quantitative reliability analysis of interconnected systems. Wher-
utilization. Some systems may experience alternating states, while ever applicable, a candidate model defined here will embody the
others experience increasing or decreasing failure states, depend- fundamental steps for assessing information security risks of a
ing on the cause of the failure or the efficiency of the recovery given network. The selection of the candidate model depends
operations. That is, we will analyze time-dependent states of some basically on the underlying assumptions for the applicability of the
suspected systems and determine expressions representing the model to the network under consideration.
failure and recovery rates of them as well as the stationary Reliability engineering covering various types of system safeties
characteristic describing the long-run reliability figures of these encompasses a wide spectrum of theoretical areas, each of which
systems. Accordingly, throughout the paper, we define an unreli- needs a closer look for accurate adaptation to more specific engineer-
able node as a repairable/renewable system since the node can be ing problems. Therefore, the discussion taken here can be considered
restored to operate after eventual failures. as a dedicated focus on the information security compared to that of
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 277

the topics we usually encounter in the recent literature on reliability availability, anticipate logistics and maintenance resource needs,
engineering. Computer networks are complex systems. There exist and provide long term savings in operations and maintenance costs
a number of theoretical approaches intended for both the computa- based on necessary optimizations.
tion and evaluation of reliability of complex systems, e.g., [6], in which In many cases we need to discuss several types of parameters in
a measure of complexity was defined by the number of paths found order to obtain more realistic and accurate results when analyzing
in the system, and reliability functions of the system are given. Using information systems that are either under dedicated attacks or
this definition, the reliabilities of some typical redundant systems vulnerable to frequent failures. As will be discussed soon, we have
with complexity were also obtained. Kondakci [7] suggests models for several classes of systems to consider regarding this issue, e.g.,
the analysis of complex attack patterns and their effects on informa- web-servers, mail-servers, firewalls, database systems, ERP, and
tion systems. A review on different aspects of the predictability project development systems. Each of these systems possesses
problem in dynamical systems complexity is given in [8]. Applying distinct vulnerability characteristics against threats that can cause
these approaches to modern networked systems is rather a compli- reductions in the system availability. Topics such as probability,
cated task. It could be more fruitful if we could bring the general statistics, reliability, stochastic processes, and queuing theory have
theories down to a practically sound ground, so that we can directly fundamental importance to be covered by a security engineer in
apply them to more specific engineering problems. Additionally, there order to successfully analyze the reliability of a given problem
exist numerous work dealing with availability analysis of computer domain. For perceiving a general knowledge, readers can refer to
networks, e.g., [9] presents a review on the software reliability in several sources among which the book of Trivedi, [21], can be a
distributed environments. Jian and Shaoping [10] suggests that the good choice to start with.
availability of interconnected networks is dependent on its topological Computer networks are in the class of complex systems whose
structure and network services performed simultaneously, and dis- analysis require further knowledge on the “complex system
cusses an integrated availability model, which considers both topo- analysis”, which can bear rather special assumptions. A discussion
logical connectivity and operational performance. Gupta and on the reliability of a system complexity is presented in [6]. Most
Dharmaraja [11] have studied the combined effects of resource networked systems are consecutively affected by each other via
degradation and security breaches on the Quality of Service of Voice propagated failures/errors that are triggered in a system element
over IP systems by a dependability model using the semi-Markov within a given structure. For example, a jitter-effect in a badly
process. designed software package can cause malfunctioning in other
It is hard to find a rigorous study in the most recent literature software packages if they interact with each other. Or, a worm
for determining the reliability of information systems that may entering a system can quickly spread to many interconnected
also incorporate the analysis of system security. Nevertheless, systems in a manner of a branching process.
some of them have been appropriately derived from the classical Approaches regarding the theoretical analysis of some conse-
reliability models, e.g., [12,13]. Our approach is based on a cutive system behaviors are discussed in [22–25]. Those mathe-
stochastic model delivering quantitative results, which can be matical models can be modified to match the reliability analysis of
easily applied to risk assessment of safety systems. A comprehen- many types of security attacks whose individuals reproduce in a
sive treatment of stochastic models related to system performance manner that is influenced by time and the degree of system
and reliability is presented in [14]. A Markov model for a multi- protection. Determining whether and how interactions quantita-
state repairable system is presented in [15]. Worm and malware tively impact operational choices, when dealing with risk assess-
attacks can cause degradation in system availability. These types of ments, can be a great challenge to many of us. A perspective of this
analyses are considered in depth by [16], where a concept of subject is considered by [26].
recurrent stochastic model was developed to define the states of a
recurrent epidemic model called REM. Accordingly, a concise cost
analysis of the Internet malware is presented in [17]. 3. Reliability structures and approaches
Mass-mailing can also be used as a typical denial of service
attack aiming at degrading the system availability and creating A threat is a circumstance or event bearing the potential that can
serious frustrations and annoyance for many users. cause harm to information systems through disclosure, destruction,
Solutions to complex problems alone would have little impact falsification, fabrication of data, and disruption of systems (denial of
unless we could provide additional optimizations. Instead of using service). An attack is an assault on system security that derives from
powerful algorithms and going on only with brute-force type an intelligent threat, that is, a deliberate attempt to the realization
approaches can be too computation-intensive. We can refer to a of a given threat or a set of threats on a system to evade security
wide range of sources and topics for discussions on the optimiza- services and to cause harms to the system. We generalize two sets
tion issue, e.g., a tutorial on multi-objective optimization by [18] of definitions here: (i) the terms threats, attacks, worms, viruses,
gives a closer look at an approach using the genetic algorithm. Trojans, pop-ups, and any other malware are interchangeably used
Other related source is given by [19]. to denote the cause (or source) of a failure, (ii) components, nodes,
Reliability analysis covers entire engineering community with hosts, machines, computers, and systems are also interchangeably
almost no boundaries. Hence, when dealing with risk analyses, we used to denote the system under analysis. Though there exist
may encounter several unforeseen surprises in many of the broad several aspects of Internet malware, such as malware intelligence,
fields [20]. Methods dealing with the analysis of reliability and risk discovery, deployment, and defense strategies, we consider mainly
estimation have been deeply considered by several researchers and the reliability analyses of IT systems without explicitly considering
institutions with high-availability concepts, e.g., NASA, where it has the defense factor. Some models may deal with systems containing
been stated that availability prediction and assessment methods combat (or defense) capabilities, which will be explicitly stated
can provide quantitative performance measures that may be used in when we consider these.
assessing a given design. We also agree that quantitative results Reliability is defined as the probability that a given system
obtained from the methods can lead us to collect accurate main- operates properly for a specified period of time. As a companion
tenance costs and help to develop alternatives to reduce life cycle definition to reliability, availability of a system for its users is
costs. Furthermore, analyses based on reliability predictions will defined as the relative frequency that the system works. Here, the
guide us to assess design options that can lead to precise definitions percentage of the successful up times of a system is considered as
of maintenance support concepts that can increase future system a measure of the system reliability. From another perspective,
278 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

unavailability is a probabilistic measure defined to be the prob-

A modified NHPP fault count model used for software reliability assessment, m(t) denotes the expected failure count observed by time t.
ability that a system fails during a specified period of time.
Unless otherwise stated, an entire network is termed as a
system, a node within the network is denoted as a subsystem, and
an element within a subsystem is termed as a component. That is,

.
b  α t  1
obeying the use of the reliability terminology, the term system is
sometimes interchangeably used to denote any system varying

e
from a single component to a variety of systems that are inter-

α
Also called Extreme value distribution: hazard rate is initially constant and then increases rapidly with time, v ¼
connected through any communication infrastructure.

failure rates, L-l avoids the strict bounding of IFR, DFR, and AFR. h(t) is DFR for κ r 1 and AFR for κ 4 1.
3.1. A brief overview of reliability models

Before proceeding it would be beneficial to briefly touch the

Appropriate for modeling systems with IFR, DFR, AFR, and systems n-stages of subsystems.
key modeling approaches. In order to analyze system states in
terms of probabilistic lifetime (survival time) distributions with
respect to non-repairable and repairable constraints several mod-
els have been proposed. A lifetime distribution model can be
represented by a probability distribution function (pdf), denoted

Attacks are branching, propagation rates are high but failure events are rare.

Weibull and gamma rely on strictly increasing, decreasing, and alternating


as f(t), which is also referred to as the density function. f(t) will
also represent time to failure distribution. From pdf we can obtain
the cumulative distribution function (cdf), denoted as F(t), which

Suitable for nonlinear hazard rates attributed to either IFR or DFR


Many distributions are modifications of the gamma distribution.
gives the probability that a randomly selected system will fail by
time t, that is, F(t) is the cumulative probability function of
unsuccessful lifetime. Since reliability, R(t), is defined as the
cumulative probability function of successful lifetime, then the
cumulative distribution function is the complement of R(t), i.e.,
RðtÞ þ FðtÞ ¼ 1:

and systems experiencing n-failure modes.


The reliability (or survival) function is thus RðtÞ ¼ 1 FðtÞ, and the
failure rate (or hazard) function is given as
f ðtÞ
hðtÞ ¼ :
RðtÞ

If exist, based on the stationary conditions, we classify failure rates


in terms of the failure growth as
Description

Increasing failure rate, IFR: A lifetime Ti is said to have increasing


failure rate if the failure rate function h(t) is increasing
for t 40.
Decreasing failure rate, DFR: A lifetime Ti is said to have decreasing
failure rate if the failure rate function h(t) is decreasing
for t 4 0. That is, a reliability improvement is observed
Pars

b; α
a; b
λ; α

λ; α

λ; κ
λ

during the test time.


Alternating failure rate, AFR: A lifetime Ti is said to have alternat-
ing or stochastic failure rate if the failure rate function h
ðλtÞk
k ¼ 0 k!

(t) is alternating for t 4 0.


α1

1 þ ðλ tÞκ
e  λt ∑

ae  bt

1
α

αt
e  λt

e  λt
R(t)

A reliability model will always make use of some parameters


be

that are specific to the system considered. Common to all models


Some of the key reliability models, Pars ¼ Parameters.

are the time t and the rate parameters often denoted by λ for
λκðλ tÞκ  1
1þ ðλ tÞκ
λα t ðα  1Þ

incoming events and μ for departing events, respectively. Further-


a b e  bt

more, models have also so-called shape, location, and scale


RðtÞ
h(t)

ev
f ðtÞ
λðtÞ

parameters. In order to explain the relations among the compo-


nents of a reliability model, we consider here the basic exponential
α

distribution and omit further discussions of the remaining dis-


λ αt ðα  1Þ e  αt

mðtÞy e  mðtÞ
λα t α  1  λt

½1 þ ðλ tÞκ 2

tributions, and rather summarize them in Table 1. It may already


λκðλ tÞκ  1

ev
e

y!

be noted that many useful models such as Poisson, Binomial,


ΓðαÞ
λe  λt

αt
f(t)

Erlang, Pareto, and some family of exponential distributions,


be

e.g., hyperexponential, are not included in the table. These models


will be discussed in Section 7 in accordance with their applications
Goel–Okumoto

to some special cases. Many systems exhibit exponentially


Log-logistic

distributed constant failure rate, λ, during their lifetime. Lifetime


Gompertz
Constant

Weibull
Gamma

analysis of such systems has the following formulation. A con-


Model
Table 1

tinues random variable (or event), X, is said to have an exponen-


tial distribution with the occurrence rate parameter λ, and its
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 279

probability distribution function (pdf) is 5. Human-related vulnerabilities:


( (a) security policy,
λ e  λx for x Z 0;
f ðxÞ ¼ ð1Þ (b) user vigilance,
0 for x o 1; (c) software maintenance operations: patches, upgrades, and
and its cumulative distribution function (cdf) is obtained as registry updates,
Z x ( (d) backup and recovery functions.
1  e  λx for x Z 0; 6. Hardware and hardware drivers.
FðxÞ ¼ f ðuÞ du ¼ ð2Þ
1 0 for x o 1: 7. Degree of network connectivity; the degree of openness to
external threats.
The reliability function, R(x), is
RðxÞ ¼ 1  FðxÞ ¼ e  λx : Some of the above items may require a specifically designed model
2 or a collection of approaches for assessing their reliabilities,
Mean and variance of X are E½X ¼ 1=λ and VarðXÞ ¼ 1=λ ,
whereas some others may use a common model with precisely
respectively.
distinguished assumptions. As will be discussed later, the selection
Furthermore, a random variable, X, is said to be memoryless if
of an appropriate model is mainly based on the threat type and the
each subsequent event is completely independent from the pre-
environment in which the system is being used.
vious events. For example,
PfX 4 τg ¼ PfX 4τ þt∣X 4 tg; ½τ; t Z 0: 3.3. Basic decomposition method
That is, assuming that X is the lifetime of an item, if the item is
alive at time t, then its remaining lifetime is at least τ þ t, and the Before considering the reliability analysis of a network we need
distribution of its remaining lifetime is the same as the initial to break down the problem domain into a set of individually
distribution, hence it has a constant failure (hazard) rate function manageable solution space. To do so, the following steps are
given by required to be successfully implemented.

f ðxÞ f ðxÞ f ðtÞ λe  λt 1. Build a topology of interconnected subsystems, e.g., Fig. 1.


hðtÞ ¼ ¼ ¼ λðtÞ ¼  λt ¼  λt ¼ λ: ð3Þ
RðxÞ 1  FðxÞ e e 2. Dissect each interconnected subsystem into components of
As can be noted, the time dependent reliability R(t) of a system interrelated parts, e.g., Fig. 2.
with the constant hazard rate λ is 3. Build a reliability diagram composed of all components, e.g.,
Fig. 3.
RðtÞ ¼ e  λt :
4. Identify/define reliability models/functions for each
component.
3.2. General threat context 5. Determine the component-wise reliability of each subsystem.
6. Identify/define recovery and performance models/functions.
Technically, functions used for IT security consist of a wide set 7. If required determine model/function for impact and risk
of independently developed software and hardware systems, analyses.
which always bear interdependence relations in some manner. 8. Compose and determine an overall reliability measure for the
Even though a networked system is working standalone it has entire network.
many interactions within its own subsystems and components. For
example, assuming a standalone PC having no connection to As an example, Fig. 1 shows a typical network, Fig. 2 represents a
external systems but it is made of many local components subsystem (e.g., a host from the host group G2), and Fig. 3
interacting with each other. The main hardware of a PC has no represents the decomposed reliability diagram of the subsystem
meaning without its operating system and software services shown in Fig. 2 (a).
(applications) provided to its users. On the other hand, an Inter- We have listed the fundamental threat sources above, each of
networked system has naturally interfaces to many other net- which can have impact in a different form on different network
works and systems, hence, it has a much more complex structure units. Though the impacts are varying, the most critical compo-
of interdependence both internally and externally. nent in this network is the router which is exposed to numerous
Although a model can seem to (intuitively) be appropriate for threats such as deliberate exposure, TCP reset, peer spoofing, session
a chosen target of evaluation, it may lead to imperfect results for hijacking, sniffing, falsification (route deaggregation, malicious rout
different environments each having different defense strategies injection, unknown route injection), resource draining, link cutting,
and usage profiles. Because the environments may suffer from and so on [27,28]. However, according to the reports from
different vulnerabilities leading to some associated exploits due to Symantec, viruses are on the top of the list having the highest
the density of the provided services (applications), the degree of spread via e-mails and files through the Internet. Worms and
connectivity to untrusted networks, and also due to flaws in the Trojan horses share the first place in malignancy. Malignancy of
protection policy. Generally, the analysis of information security the viruses take the second place. Viruses often leave back doors in
covers the following fundamental aspects. around the infected filesystems, which are extremely harmful, e.g.,
MyDoom virus and its back door exploits Doomjuice and Deadhat.
Worms can turn compromised computers into bogus e-mail
1. Operating system vulnerabilities. servers, which are able to send massive e-mails causing DoS
2. Application vulnerabilities. attacks on the victim. Application-specific intrusions mostly cause
3. Deficiency in network protocol implementations. buffer overflows leading to DoS on application servers and Web-
4. Protection systems ' concerns: servers. These attacks are becoming more sophisticated. SQL
(a) confidentiality of data, injection forces database servers to disclose secret information
(b) integrity of data, and passwords of database users and other principals to adver-
(c) access control and authorization, saries. Identity theft, or so-called Phishing, is most commonly
(d) worm/virus and malware protection, targeted at gathering private data from users surfing on the
(e) firewalls and packet filtering, Internet. The private data contain mostly bank account
280 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

Printing
Service

Host group 2 (G2) Diskless computers Confidential


Host group 1 (G1)
Internal switch

A A A A
E A E SA
dB Web Mail App File
Server Server Project dB
Server Server Server
server

Main Switch A = Authentication


SA = Secure authentication
E = Encryption
Internet Screening Router

Fig. 1. A topological diagram of a network under reliability analysis.

information and other authentication data used for accessing for encrypting the data. It has been shown that a dedicated
confidential areas. hardware with a cost of one million dollars can search all possible
As can be seen in Fig. 1, some components are tagged with A, DES keys in about a couple of hours. In fact the DES algorithm was
SA, and E. These specify the protection type for the tagged retired and replaced by the Triple-DES algorithm. Since the
component. Attacks to confidentiality and integrity of information computer resources get improved rapidly in these days, the
systems are far difficult to employ but nevertheless important. Triple-DES algorithm with a keylength of 168 bits is no longer
Confidentiality and integrity measures are mostly implemented to secure either. However, according to NIST, with the strongest
ensure the security of mission critical systems, and require rather hardware it will still take more than about 260 years to crack a
different approaches to analyze them. Reliability analysis of Triple-DES key.
cryptography is a completely different issue compared to that of Integrity of a message is provided by so-called cyclic redundancy
the availability analysis of IT systems. Hence, our main concern in checks or by employing some secure hash algorithms. That is, a set of
this paper is the determination of appropriate models that can be extra bits are added to the original message for both error detection
used to analyze the availability of IT systems. Authentication (A) is and correction. Although these mechanisms are quite robust, there still
the simplest form of a control mechanism that allows authenti- exist threats to the integrity of a message, e.g., eavesdropping, session
cated access of principals to a resource, often using simple PINs hijacking, and the denial of service attack.
and passwords. Such an authentication system can be easily
broken by use of password crackers. A system with secure
authentication (SA) is cryptographically protected against inten- 3.4. Constructing reliability structures
tional or unintentional attempts made to access the system.
As mentioned earlier, the security of a resource relies on the Computer networks contain complex system elements, where
provision of the grand triad called CIA, confidentiality, integrity, and each element consists of at least four groups of interacting
availability. Confidentiality of a system is provided by some subsystems: (1) hardware, (2) operating system, (3) communica-
cryptographic functions consisting of cryptic algorithms, encryp- tion unit (hardware and software), and (4) a set of service software
tion (E) and decryption (D), which disallow the disclosure of (so-called applications). Block diagrams of two typical systems, a
a resource to others than the authorized owners and principals. user computer (PC) and a network printer, are shown in Fig. 2.
Denoting simply, the function C ¼ EðM; K; AÞ encrypts the message Prior to the reliability analysis, these structures need to be
M to produce the ciphertext C using the key K and the algorithm A. converted to a reliability diagram consisting of only series (and
For decrypting the ciphertext C, function M ¼ DðC; K; AÞ is per- eventually some parallel) components together with their inter-
formed. As can be noted, the same algorithm and key are used dependence structures. To do so, we can apply the path-tracing
both for the encryption and decryption functions. There exist technique to identify all possible paths from the input end to the
several algorithms devoted to securing critical resources, e.g., RSA, output of the entire system. We can further apply reduction to
DES, and Triple-DES, of which DES uses 56-bit keys (256 different series element method and/or the minimal cut algorithm to simplify
combinations of a key). To break a cryptographic algorithm is the computation of the overall reliability structure. There exist
extremely hard, which requires intensive work performed by only several techniques and methods found in the literature for
intelligence communities or cryptanalysts. The most critical part of simplifying different reliability structures, e.g., [29], which uses a
a cryptographic system (or function) is the length of the key used formerly known recursive decomposition algorithm.
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 281

Comm. Unit App-1 Comm. Unit OJ-1 OJ-1 OJ-1

Spooler
Operating
(CU) (PCU) Micro Output queue
System
OS
(OS)
CHW CSW App-N PHW PSW
IJ-1 IJ-1 IJ-1
Main Hardware (MHW) Printer Hardware (PHW) Input queue

Fig. 2. Structures of a user computer (a) and a network printer (b).

NA NA

CHW CSW NA NA NA NA

OS = Operating system NA NA
MHW = Main hardware
CHW = Communication HW OS MHW
CHW = Communication SW
NA = Network application APP APP
APP= User application
APP APP APP APP

APP APP

Fig. 3. Reliability diagram of a user computer.

To illustrate the determination of the reliability diagram we 3.5. Main steps of the analysis process
consider here only the PC, whose structure is shown in Fig. 2 (a).
By referring to the roles and interdependency structure of the Most of the existing reliability models have been used for the
components of the PC, one can easily derive reliability diagrams of analysis of production reliability, life tests, and failure counts of
some other related systems, e.g., that of the network printer service facilities. Network security analysis is relatively a new-
shown in Fig. 2 (b). In contrast to the similarity in their HW comer into this area. Thus, applicability of these models to
structures, the major differences between the PC and the printer assessing the reliability of network (or information) security must
are such that the network printer has only one application (so- be substantially discussed together with the underlying model
called spooler) and additionally two queues representing input assumptions. Namely, it is important to take the key modeling
and output jobs. The reliability diagram of the user computer, approaches, analyze their underlying assumptions, find their
shown in Fig. 3, contains two sets of applications, network limitations, and justify their applicability to the analysis of
applications (NA) and user applications (APP) with some relations identifiable security threats.
to each other, while being independent of the common It is not easy to directly analyze system failures caused by
communication unit. unknown security threats. In order to easily identify system
Following the construction of the reliability diagram, we define failures their causes must be identified first. Thus, before diving
a reliability model (or function) together with its associated into the failure analysis, the identification and modeling of threats
parameters, and outline underlying assumptions. Furthermore, is more advantages. This can help determining the correlation
we combine the reliability functions of the components into a between the threats and the failures caused by them. Due to the
single structure in order to obtain an overall structure that large variations in the computer platforms, their applications,
expresses the reliability measure of the entire system. The relia- behaviors, and the threats to them, we have to consider several
bility measure together with some other system specific para- critical issues prior to choosing an appropriate model for the
meters (e.g., importance factor, threat and vulnerability measures) reliability analysis of a given system. Hence, we need to
can then be used to derive an impact metric for the risks
estimation regarding the system at hand.  identify threats/attacks to the system,
Overall reliability of a series system with n components is given by  unify test and observation models needed for experimenting
and model matching,
n  unify necessary models via generalized reliability concepts,
R ¼ ∏ Ri ; ð4Þ
i¼1  design a candidate model and identify the underlying
parameters,
and the reliability of a parallel system with n components is  determine the underlying assumptions and limitations of the
candidate model,
n  identify methods to obtain estimates of model parameters, e.g.,
R ¼ 1  ∏ ð1  Ri Þ: ð5Þ
i¼1 MLE method,
 identify and eliminate practical and theoretical difficulties in
Thus, it is trivial to determine the reliability of a serial–parallel or any parameter estimations,
combination of a system, which is left as an exercise to the reader, if  validate the applicability of the model to the analysis of the
desired so. problem domain at hand, e.g., goodness-of-fit test.
282 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

Although we are not going to detail these issues separately, a various DDoS attacks such as ICMP-flood, SYN-flood, Trin00,
sufficient framework covering most of the fundamental concepts Stacheldraht, TFN2K, UDP-flood, Smurf-attack, and so forth. A
will be considered here. TFN system is made up of a master server, clients, and
numerous daemons (slave servers) programs. Most of these
attacks have deterministic attack rates, which also can lead
4. Threat categories to constant failure rates. Stochastic models involving
repeated trials, e.g., Attack-obstacle, exponential, Poisson,
Generally, we have several categories of threats to networked Weibull, Gamma, and Log-logistic are the strongest candi-
systems that can effect a system in different ways. IT systems can be dates for modeling the variants of DoS attacks, depending
threatened by a specific type of threat or a combination of various on the way they can be realized, i.e., in random but
threat types, e.g., virus infections, Trojan injections, software failures, sequential phases or in parallel phases.
or service disruptions caused by some protocol attacks. Apparently, in Buffer overflow: This type of threats (e.g., Trojans) can be deliber-
order to analyze the reliability of a victim (target) system, or to ately inserted into applications or they may come from
measure the strength of eventual attacks, we need to select an misconfigurations, which when applied repeatedly may
appropriate reliability model to proceed with. In order to be consistent cause kernel area violations and excessive storage usage,
with the subject covered here, five major threat categories were and in turn, will result in denial of service. These threats
evaluated to consider here, where the reliability analysis of each can propagate to other interoperating software both
category can be satisfactorily bound to a realistic model. residing on the local and remote hosts. Threats that
randomly propagate among different applications on dif-
ferent hosts can be modeled as a Brownian motion. Buffer
Worms/viruses: Stochastic and deterministic worm propagation and overflow failures involving randomly repeated trials can be
extinction models usually give realistic results for rando- treated as a stochastic process, which can be modeled as a
mized mass worm attacks. In an earlier work [16], we have binomial process, whereas the density of the attacks
modeled an infection process as a binomial process, suscept- causing the failures can be modeled as a Poisson process.
ibility to infection as a Poisson process, however, arrivals of As of Assumption 1–3, a modified Gamma model can be
suspected hosts at a quarantine service were modeled as a used for systems that have immunity growth capability.
Poisson process, and the respective recovery operation was Variations of the Weibull and Pareto models can be a
modeled as a birth–death Markov process. Related to the candidate for tests dealing with strength-stress assess-
analysis of the propagation of malware (worms, viruses, ments of physical characteristics of memory systems and
Trojans, etc.), a variety of complex approaches mostly based page faults in operating systems.
on stochastic processes have been discussed in [16,30–33]. Remote malware injection: This category contains a broad spec-
It follows that branching processes for rapidly propagating trum of threats consisting of Trojans, SQL-injection,
worm and malware [16] and Poisson and Bernoulli pro- identity theft, spyware injection, key-loggers, phishing,
cesses for infrequently occurring incidents are good candi- unsolicited bulk messages (spams), cross-site scripting,
dates. pop-ups, and malicious e-mail attachments. Though the
Verification of the candidate models and parameter estima- attack density of these threats are specific to each class,
tions in this category are relatively easy compared to those however, incidents caused by them can be modeled as a
of the protocol-based threat analyses. Often, simulations and Bernoulli trial using the parameters generated from the
real-time experiments are used to collect failure data, protection profiles of the attacked systems and user
estimate the model parameters from the failure data, and behaviors (e.g., users' surf-pattern on the Internet and
perform goodness-of-fit test for checking the appropriate- security awareness of the user).
ness of the model. Experimenting with worms can be easily Human-related security: Users and administrators of systems may
performed by randomly “seeding” a certain number of also cause vulnerabilities to systems. User vigilance, informa-
worms in a network of computers each having a different tion leakage by social engineering, and unawareness of the
protection and usage profile. Observation of the failures is security threats can lead to exploits caused by all categories
then carried out within a given time duration while count- defined above. As an example, a spam trap is the inclusion of
ing the number of incidents. an option in an online web-form that is preselected by
Denial of Service (DoS): DoS attacks mostly make use of protocol default with the expectation that the user will fail to notice
deficiencies, which are aimed at degrading system avail- the option. If the form is submitted with the option selected,
ability. DoS attacks often have successful results reported as the user has given the adversary permission to send junk e-
to completely shutdown systems in several hours. A TCP- mails. Consequently, a number of infected e-mail attach-
SYN flood attack is an example of a typical DoS scenario, in ments, spear phishing by forged e-mails will continue to
which the adversary finds a hole through a deficiency show up in the user's mail box. Additionally, system main-
found in the TCP protocol implementation based on a loose tenance tasks including backup, restore, patch, upgrade, and
protocol description. Such a DoS attack launches a huge malconfigurations can introduce vulnerabilities while being
number of concurrent connection requests against its performed by the users and system administrators. Frequent
victim machine, which is being lured to believe that the software upgrades to newer releases for increasing the
requests are coming from all legitimate machines and functionality of a system can also be considered as a reliability
hence it tries to accept the connection requests from the issue. Reliability of test and debug operations, practical
attackers. While trying to accept the requests the victim implications of the company security policy, reliability of
machine quickly fills up the multiplexing buffer of its TCP protection system configurations, and flaws (e.g., zero-day
protocol, which then freezes the operating system, conse- attack) in emerging system installations are also considered
quently, it becomes unavailable. DoS attacks are imple- among the human-related security. It is hard to designate a
mented in a variety of forms, e.g., Tribe Flood Network known reliability model to this category, however, user and
(TFN) and Distributed Denial of Service (DDoS). The TFN system-administrator activities should be separately modeled
system is composed of a set of computer tools that conduct and verified.
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 283

Attacks are branching, propagation rates are high but failure events are rare. Since many nodes are involved k-out-of-n analysis

Massive attacks (e.g., by TFN) are hyperexponential, Due to protocol intelligence, models involving AFR, e.g., Attack-obstacle and

Propagation of threats are Brownian motions, failure rates are hypoexponential (serial system), Pareto for CPU- and memory-
Table 2 summarizes the threat categories and candidate models
for the analysis of attacks and failures. Common to each threat
category are Assumptions 1–3, if a related failure correction is
applied. If not explicitly stated, the remaining assumptions are
considered as default for all threat categories. It should be noted
that the candidate models shown are not customary, i.e., a given
model may be used for many types of analysis. Before choosing a
model, we need to perform a sensitivity analysis on the model

Threats are multifaceted each having own attack model, failure analysis is binomially distributed.
parameters. How the model behaves to changes in parameter
values and how it responds to changes in the structure of the
model are important issues in designing an experiment and
related models. Failure data and plots are useful in determining
the system behavior. Following the analysis of the data and plots,
an estimate of the model parameters can be obtained by use of a
method, e.g., maximum likelihood. Furthermore, an operation for
model matching need to be done by substituting the estimated
parameter values in the selected model. Finally, in order to
Difficult to model human behavior, mostly bear AFR characteristics.

eliminate eventual uncertainties, a goodness-of-fit test has to be


performed. If the chosen model does not fit, we have to gather
additional data and reassess the tests or try another model from
the candidate model list. Nevertheless, it may be hard to predict
the size of the additional data to collect or to find a better model.
In that case, carefully designed simulations may help to guide
sensitivity analysis and eliminate eventual uncertainties in the
chosen model and the underlying parameters.

5. Assumptions and limitations

As known, a reliability model can be represented as a prob-


ability distribution, life time distribution, a compound risk level, as
an instantaneous, or as a steady-state reliability (hazard) function.
usage analyses.

However, in practice neither testing nor proving of a model on a


Description

is required.

specific environment can guarantee a complete confidence in the


gamma.

appropriateness of the model. Because computer networks have


highly complex structures that require a through analysis of each
system in order to match a specific model. The models discussed
here can be considered as a complementary approach to assess the
Gamma, log-logistic, Erlang, Attack-

Binomial, Hypoexponential, Pareto

reliability of a specific part of a given environment rather than


being used as a competing tool. That is, due to the imperfectness of
the models in assuring accurate reliability assessments, a frame of
approaches is provided here to plan and conduct tests for collect-
ing data, experimenting, choosing a model, estimating model
Poisson, Binomial

parameters, and developing appropriate methods that can obtain


Failure model

a fitted model. This can further help developing analytical models


whose parameters will be estimated from the collected data and
Binomial
Binomial
obstacle

from the applied experiments. Reliability, impacts, risks, and


maintenance efficiency of failed systems can be determined by
referring to the approaches found in the suggested frame.
Hardware systems mostly suffer from component wearout.
Exponential, hyperexponential,

However, there is no such thing as wearout in software except a


Binomial, Brownian motion

continuously growing spectrum of software vulnerabilities/threats


that are taking place each epoch of time. That is, from the
Branching, Binomial
Threat categories and candidate models.

reliability point of view, threats to software systems are of a


Remote malware Binomial, Poisson

multitude size compared to that of hardware systems. Based on


Attack model

this assumption, reliability analysis of most hardware systems deal


Binomial

with relatively straightforward methods mostly applying classical


Poisson

fatigue analyses, which therefore are not considered here. Soft-


ware systems have, on the contrary, complex interdependence
structures that have various vulnerabilities and functions of
Buffer overflow

Human-related

interoperations under different usage profiles, which require


Worms and
viruses

complicated reliability models applying mostly times between


failures and failure count approaches. Many of the attacks and
Threat
Table 2

DoS

their impacts are often modeled as a stochastic process each


applying a different parameter set and different assumptions. For
284 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

example, as rigorously discussed and modeled by [17], most reliability distributions following the corrections are all different,
malware attacks taking advantage of the vulnerable applications R1 ðtÞ o R2 ðtÞ; …; o Rn ðtÞ, whose failure times (or incident counts)
on a computer occur mostly at random times. As extensively are also independent of each other.
discussed by [16,34–36], usage profiles basically affected by the
user vigilance are also of a stochastic character. Consequently, Assumption 3. Protocol intelligence: Some communication proto-
unless otherwise specified, attacks, exploits, vulnerabilities, risks, cols may have built-in mechanisms to observe the input activity
failures, and the overall impact are considered as random pro- and make the system to counteract accordingly. Upon the obser-
cesses, random functions, or variables as appropriate. vation of a threat pattern, the intelligent (or expert) mechanism
tries to thwart the attack by discarding the malicious packets. In
such realizations, the effect of the attacks is assumed to be higher
5.1. Significance of protection profiles during the observation (learning) period than that of the following
periods. Hence, the reliability figure of the intelligent protocols
Depending on the company policy, a network can be configured exhibits first increasing failure rates (IFR) followed by decreasing
to have some level of protection. A large network, if not strictly failure rates (DFR). This is due to the learned expertise gained by
applying necessary security mechanisms, will always have ran- the protocol.
domness in the overall vulnerability figure. This is due to the
diversity of operating systems and the large number of software Assumption 4. Randomness: Unless otherwise specified, threat
applications with several different vulnerability patterns. Some events and the number of vulnerabilities in an application as well
networks, e.g., many home networks, usually make use of freely as in a communication protocol is random. Some applications may
available anti-virus programs on their computers, and no other already contain discovered vulnerabilities denoted as determinis-
defense mechanisms at all. Some PCs and servers in some net- tic, however, system failures often occur at random times and
works may be individually protected by their users, while some under random circumstances, including the user vigilance and
other PCs have either none or some arbitrary level of protection. protection policies.
Obviously, a network with several different operating systems and
Assumption 5. Non-uniformity: Unless otherwise specified, all
hundreds of different software applications each developed by a
threats (attacks and vulnerabilities) contribute unequally to the
different vendor and through distinct design and production plans
(un)reliability of a system. The same target may react differently at
will always show stochastic behaviors against many undiscovered
different times regardless of the type of the threat.
security threats. Besides, attacks to the elements of a network take
random patterns and occur at random time epochs. Attacks Assumption 6. Reliability metric: If convenience calls so, we will
exploiting vulnerabilities inherent in an application may also explicitly specify the reliability measure as failure counts, working
result in nondeterministic failure patterns. times, attack counts, attack success rates, and/or attack durations.
For example, success rate of a DoS attack, the number of infected
5.2. General assumptions e-mail boxes, effect of spear phishing (e-mail spoofing), and the
success rate of a scan worm (e.g., how many nodes can be Trojan
At this point, it may serve quite well that we partition the injected by a worm in Δ t) are also typical counting processes.
model assumptions so that they specifically reflect two different
Assumption 7. Independent failure cases: Failures detected during
categories of analysis, responsive and nonresponsive, respectively.
the nonoverlapping time intervals are independent of each other.
The responsive analysis relies on an immunity growth factor in the
failed systems such that the failure rates decrease with the Assumption 8. Working time distribution: Unless otherwise spe-
increasing immunity factor while the causes of the failures are cified, working times between failures are independently distrib-
being eliminated. This means that intermediate failures caused by uted random variables.
attacks/vulnerabilities are being fixed so that the victim system
will have increased strength against further threats, consequently, Assumption 9. Rare events: Homogeneous Poisson processes
it will have decreasing failure rates. On the contrary, a nonrespon- (HPP) are used to model events with “rare” occurrences, hence,
sive analysis totally omits system fixes and focuses merely on the their rates fλ1 ; λ2 ; …; λn g are also exponentially distributed inde-
cumulative failure types, which in the most cases results in pendent random variables being defined by the nature of threats
increasing failure rates. We have to explicitly declare the category and the usage profile of the system under consideration. Three
of the applied method whether it will be responsive or nonre- main properties of a Poisson process are (1) events occurring in
sponsive. We consider here the following major assumptions different epochs are independent of each other, (2) events occur at
regarding both of these two categories. a constant rate, and (3) few events occur within a short interval
of time.
Assumption 1. Responsiveness: In order to capture the effect of
the immunity growth (responsiveness), vulnerabilities are fixed Assumption 10. Non-homogeneous Poisson: Events (attacks or
with certainty in a negligible time. Most software applications do failures) described by non-homogeneous Poisson processes
not contain automatic system corrections, thus, the necessary (NHPP) evolve as time dependent random functions or processes,
corrections are done manually by patches and/or by upgrades as as appropriate. Hence, as apposed to HPP, events of NHPP exhibit
appropriate. Depending on the system's reaction to a given randomly alternating rates. Time dependent probabilities of a
correction, a patch operation may result in increase, decrease, given number of events with the mean rate M(t) for the NHPP
alternating, or constant (towards the end) failure rates. But in most model can be calculated by
cases the cumulative result is assumed to be DFR.   MðtÞk  MðtÞ
P NðtÞ ¼ k ¼ e :
k!
Assumption 2. Responsive system reliability: Reliability is a func-
tion of all remaining failures. Assuming that all necessary correc-
tions have been made, the expected failure rate will ideally exhibit 6. Reliability patterns
a decreasing (DFR) behavior. In line with this, most of our
experiments show that the failure-time distributions exhibit We classify reliability models according to failure type and
alternating mean values after each correction. That is, the system complexity. The first class deals with three types of failure
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 285

rates: (1) increasing failure rate (IFR), (2) decreasing failure rate Commander (initiater)
(DFR), and (3) Alternating failure rate (AFR). The second class
considers the system complexity based on the diversity of the Slave server
systems and their components: (1) homogeneous and (2) hetero-
geneous systems, see Section 7.6. Common to both of the classes is Agent
the variety of threats.
Fitting a suitable model to any of the threat categories is usually
nontrivial. Because some threats (or attacks) and their ultimate
effects often take place in n interleaved stages. Related to this,
eventual system recovery and impact analyses following the fail-
ure phase will multiply the complexity of the model at hand. In
fact, in most cases, occurrence of the actual failure of a system is
Attack shower
such that the system fails when n independent failures have
occurred. The sequence of failures can be in tact with the strength
of their associated attacks. Therefore, it can be advantageous to
first analyze the attacks in order to facilitate the analysis of failures
and impacts. Causal and harmonic effects of the threats can also
arise due to the consecutive attacks emanating from different
sources [34–38]. Victim
Packet queues (buffers)

6.1. Increasing reliability patterns

As of Assumptions 1 and 2, a process with immunity growth


Fig. 4. Illustration of a massive attack.
results in decreasing system failures. The analysis is relatively
simple to perform, which often deals with initially increasing,
followed by a decreasing, and finally by a constant hazard rate, and the immunity growth function must be fitted to follow the
in all cases with constant recovery rates, if a steady-state exists. changes in the reliability pattern. Nevertheless, the model requires
We assume constant recovery rates, because we do often have a a (statistical) method to estimate this function together with the α
constant amount of resources and thus we have to apply the same parameter for justifying its applicability to a given problem.
techniques to correct a specific error. The formulation of DFR
(increasing reliability) is simple, because, suppose that the density 6.2. Decreasing reliability patterns
function for the nth failure of a reactive system for any time
t; ðt o sÞ is given by Depending on the protection level and usage profiles of the
Z s systems, decreasing reliability can be caused by almost all security
f ðnÞ ¼ F ðn  1Þ ðs tÞf n ðtÞdt; n 4 1; ð6Þ threats. Nonresponsive systems have higher probability to encoun-
0
ter increasing failures, hence decreasing reliability patterns. Some
and by Laplace transforming of Eq. (6), we obtain the first stage attacks, e.g., most of the DoS attacks such as TFN, SQL-slammer, and
cumulative function following the first correction operation as scan-worms intensify their pressure by either increasing the num-
F 1 ðsÞ ¼ 1  e  λ1 s ; ber of attack sequences or by harnessing additional nodes that
launch their attacks in parallel with each other. Consider the case of
 
s n λj a TFN attack as illustrated in Fig. 4 and suppose also that each event
F n ðsÞ ¼ ∑ αi F i ðsÞ; αi ¼ ∏ ; 8 〈i a j〉 ;
i¼n j¼1 λj  λi (attack burst) occurs independently with probability pi ; 1 ri r k.
These multiple parallel attacks are targeted against a single victim,
and thus, the overall cdf is whereas the response of the victim is sequential. Because the victim
"  #
n n λj has a single CPU system performing multitasking in a round-robin
F n ðsÞ ¼ ∑ ∏ ð1  eλi s Þ; i aj: ð7Þ fashion for processing the attacks. As illustrated in Fig. 4, the entire
i ¼ 1 j ¼ 1 λj  λi
process can be represented as a superposition of independent
However, by nature, some of the reliability patterns may evolve as Poisson processes. Since sum of n independent random Poisson
modified gamma distribution defined as variables is a Poisson variable, the superposition of n independent
  ½ξðiÞα λαi  1 e  ξðiÞλi random Poisson processes with respective average rates λ1 ; λ2 ; …; λn
P λi ∣α; ξðiÞ ¼ : ð8Þ is also a Poisson process with the average rate λ ¼ ∑ni¼ 1 λi . Assume
ΓðαÞ
that an attack (e.g., SYN-flood) is being carried out against an
Based on Assumptions 1 and 2, distribution functions of some intelligent TCP-protocol. The overall failure rate, during the learning
failures can contain an immunity growth parameter defined as an period by the attacked TCP-protocol, will grow to
increasing random function, ξðÞ. This function can be used instead n
of the scale parameter of the gamma distribution, which denotes S ¼ ∑ si λi :
i¼1
the quality of the responsiveness. Suppose an e-mail server having
a series of vulnerabilities each leading to a new infection. If an Because each attack bursts packets with rate λi (λi is the burst rate not
active anti-virus program performs effective and timely recoveries the individual packet rate), which in turn, introduces a respective
with the latest virus signatures, than it can help to increase the failure coefficient si at the victim node. The coefficient set fs1 ; s2 ; …; sn g
immunity of the e-mail server. Let us assume constant ξðÞ during denote the peak stresses, if they can be estimated to be the right
the failure period and focus on the changes in α. As known, parameters for that matter. Let each X packets of a burst result in x
α represents the shape parameter of the gamma distribution, if CPU-peaks, ξðtÞ denote the number of CPU-peaks at time t, and let N(t)
0 o α o1 then the distribution is DFR. Referring to Assumption 3, denote the total CPU-peaks occurring during ð0; t. Assume that we
during the learning period of a responsive protocol, the failure rate have NðtÞ ¼ n CPU-peak events at time t each lasting independently
becomes IFR for α 41. On the other hand, keeping α constant with probability p, then we have a sequence of n Bernoulli tries
286 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

t X packets
CPU usage (%)
Burst 1
Burst 2
s1 s3
Attack bursts
50 s2
s4
30
s1 s5
Burst n 10

t1 t2 t3 t4 t5 t1 t2 t3 t4 t5

Fig. 5. Correlation between the attack bursts and the CPU-utilization, 3λ ¼ 10% CPU usage.

describing the number of effective CPU-peaks at time t as (varying as k ¼ I 0 ; I 0 þ 1; …, I0), and γ denotes the mean
8  Poisson rate.
>
< n pk ð1  pÞn  k for 0 Z r n; Example: threat sampling: We have experimented with a fault-
PfξðtÞ ¼ k∣NðtÞ ¼ ng ¼ k ð9Þ seeding scenario, where a virus (Conficker) was randomly seeded
>
:
0 otherwise: via e-mail messages in a network of PCs with and without virus
protection tools. In order to justify the applicability of the
By the low of total probability we have
    predicted model, we have also implemented the scenario in a
  1 n ðλtÞn  λt simulation containing 100 randomly protected PCs. The unpro-
P ξðtÞ ¼ k ¼ ∑ pk ð1  pÞn  k e
n¼k k n! tected PCs are nonreactive, so that the same PC might be repeat-
edly infected by the same virus [16], hence leading to IFR system
ðλt pÞk  λt p
¼ e : ð10Þ behavior. The description of the chosen model (Poisson) is as
k!
follows. Let N(t) be the current (at time t) number of faults
The effect of attack bursts is illustrated in Fig. 5. detected and M the expected number of total virus incidents
As can be easily seen, the number of effective CPU-peaks at during the entire observation time. We assumed that the time to
time t has a Poisson distribution with parameter λt p, which detect the incidents, denoted as T 1 ; T 2 ; …; T M , are mutually inde-
corresponds to a certain percentage of CPU load. Depending on pendent and identically distributed random processes each
the system, some percentage of the CPU usage corresponds to a defined by a common function ξðtÞ. That is, the injection process
certain number of attack bursts. For example, the scale used in of the seeded virus is defined as a common Poisson process,
Fig. 5 is such that 3 parallel bursts produce a 10 percent of CPU because, a unique type of virus has been seeded, and the PCs were
load. But this is not always an accurate metric, since each all identical having the same operating system and the same set of
operating system has different paging and process scheduling applications. It follows that the probability of detecting incident i
schemes, the scale parameter must be empirically determined. at time t is ξi ðtÞ ¼ p. Thus, we have
Let n be the number of attack bursts and p be the probability of
 
successfully hitting the peak once, then the probability of exactly k M
PfNðtÞ ¼ k∣Mg ¼ ½ξðtÞk ½1  ξðtÞM  k : ð13Þ
attack errors is given by k
 
n
Pfe ¼ k∣ng ¼ pn  k ð1  pÞk ; ð11Þ Since M is Poisson distributed, its pdf would be given as
k

and the probability of a successful peak hit resulting from n attack   ðλtÞm e  λt
P MðtÞ ¼ m ¼ : ð14Þ
bursts is given by m!
Pfe ¼ 0∣ng ¼ sn ¼ pn ; 0 osn o 1; n ¼ 1; 2; …:
From the above assumptions, we can obtain the number of such
There exist several IFR attack classes aimed at degrading incidents at time t as
system availability. For instance, TFN attack and the massive warm
attack called SQL-slammer engage their attack-agents in the form   ½λξðtÞk e  λξðtÞ
P NðtÞ ¼ k∣λ; ξðtÞ ¼ : ð15Þ
of a branching process. Some scan worms develop extremely many k!
branches and grow exponentially by propagating their malware
via vulnerable applications. An extensive model developed by [16] As a corollary to this, [30] has achieved the following deduc-
shows that the scan-worm propagation evolves as a branching tions. If the number of attacks are large and the infection
process. There exist other deterministic and stochastic models probability of each attack is small then the process is Poisson
defined for the spread and detection of scan worms, e.g., [31]. See distributed. If, on the other hand, the infection probability of each
also Eqs. (20) and (21) for that matter. A similar method to that of attack is relatively high, which is often the case for insufficiently
presented by [16] has been suggested by [39], which determines protected small environments, then the process is binomially
the total progeny of the branching process as a Borel–Tanner distributed., i.e., the probability of n infections among m incoming
distribution, i.e., messages (both viral and clean) is binomially distributed and can
be determined by [16]
  I0
P I¼k ¼ ðkγÞðk  I0 Þ e  kγ ; ð12Þ
kðk  I 0 Þ
 
where I0 denotes the number of initially infected hosts, PfI ¼ kg m
pn ¼ pn ð1 pÞðm  nÞ :
denotes the probability that the total number of hosts infected is k n
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 287

6.3. Alternating reliability patterns where


∏nj ¼ 01 λj
Effect of the consecutive Poisson events from an identical AðiÞ
n ¼ :
ðλ0  λi Þðλ1  λi Þ…ðλi  1  λi Þðλi þ 1  λi Þ…ðλn  λi Þ
threat type has a constant rate (λ) regardless of the event history.
This assumption may not be appropriate for most of the counting For a pure birth process (no death), i.e., λn ¼ nλ, there is always a
processes. A more realistic counting process can be obtained population growth, which evolves as a Yule Process. In this process
by making λ to depend on the nature (state) of the process. Eq. (16) takes the form:
On the other hand, assume that an event has occurred n times
P 0n ðtÞ ¼  nλP n ðtÞ þ ðn  1ÞλP n  1 ðtÞ; n 4 0:
during ð0; t. Following this, the evolution of the events during
a sufficiently small time interval Δt is independent of the time By setting P 1 ð0Þ ¼ P i ð0Þ ¼ 0; i 41 and considering the same solu-
since the last occurrence. Thus, the rate of the events will exhibit tion as of Eq. (17), we get
either decreasing or increasing behavior, depending on the current P n ðtÞ ¼ p qn  1 ; ð18Þ
state, as a birth and death Markov process. For example, a worm
attack may first rise to an increasing infection count, which will where for the exponential event rate, Pn(t) is a geometric distribu-
actually proceed as long as the birth of n new infections take place tion with p ¼ e  λ t and q ¼ ð1  e  λ t Þn  1 [16].
during some time ð0; t (IFR). The process may revert to a count If we start with a population size i, then we can say that the
down process exhibiting a death period, (ðt þ Δt; t n ), due to the population size at time t is the sum of i geometric random
disinfection of the infected nodes leading to an immunity growth, variables. In this case, Eq. (18) becomes a negative binomial
hence, decrease in the number of infected cases (DFR). As detailed distribution defined as
 
by [16], the probabilities of the event evolutions are as follows: n1
P n ðtÞ ¼ e  λt ð1  eλt Þn  i : ð19Þ
ni
 a single event with probability p ¼ λΔt þ oðΔtÞ,
 zero event with probability q ¼ 1 λΔt þ oðΔtÞ,
 more than one event with probability oðΔtÞ. 6.3.2. Decrease in failure rates
Death process: As a counter part of the birth process, the death
These transitions can be generalized to express state transitions process can be used to model many physical phenomena in which
for many failure and recovery cases that possess a finite number of the population tends to decrease in size. Imagine an ideal disin-
states with Markovian property, see Appendix A for more details. fection process followed by a virus spreading process, or assume
The above definitions yield for 1-step transition probabilities, an attack-reactive system with immunity growth feature. The
however, we need to obtain the n-step transition probabilities in hazard function incorporating an adequate immunity growth
order to entirely analyze the case. This is done by first defining the feature can be applied to modeling both of these processes. The
first-step transition probabilities, and then defining a set of disinfection (or reaction) processes can be modeled as a death
differential equations in the form of the well-known Chapman– process starting with i 4 0 members, which are being recovered at
Kolmogorov equation [21,40]. The (forward) Chapman–Kolmogorov a rate μ and the recovery process proceed until the size of the
equations will then be used to build a generator matrix for failed population converges to zero. Here, “death” means the
calculating the rates for each single transition. One can also build “death of failures”. In an interval of ðt; t þΔt, analogous to the
a transition matrix for calculating the probabilities for each single birth process discussed above, we have the following probabilities:
transition of the states until a steady state is met [30].
 a single recovery with probability p ¼ μΔt þ oðΔtÞ,
 zero recovery with probability q ¼ 1  μΔt þ oðΔtÞ,
6.3.1. Increase in failure rates  more than one recoveries with probability oðΔtÞ.
Pure birth process: Assume an effective worm or DoS attack
resulting in some nonrecoverable situations. The failure count will Given that the initial population (recovery queue) size is n and let
naturally increase exponentially due to the attack intensity. Let μn be the recovery (or immunity growth) rate, we can obtain the
Pn(t) be the probability that n events occur in time ð0; t. By generator matrix by applying the same procedure as of the pure
referring to the Markov property of state transitions, we get the birth process. Let P 0n;n  1 ð0Þ ¼ μn and P 0n;n ð0Þ ¼  μn , thus
0 1
generator coefficients for this evolution as  μi μi 0 0 …
B  μi  1 μi  1 0 … C
P 0n;n þ 1 ð0Þ ¼ λn and P 0nn ð0Þ ¼ λn : A¼@ 0 A:
⋮ ⋮ ⋮ ⋱ ⋱
Thus, the generator matrix A describing the transition states of the
Again, by the forward Kolmogorov equations, we get
failure process will be obtained as
0 1 P 0i ðtÞ ¼  μi P i ðtÞ
 λ0 λ0 0 0 …
P 0n ðtÞ ¼  μn P n ðtÞ þ μn þ 1 P n þ 1 ðtÞ; i 4n Z 0: ð20Þ
B 0  λ1 λ1 0 … C
B C
A¼B C: Thus, assuming an exponential rate for this pure death case, i.e.,
@ 0 0  λ2 λ2 0 … A
μn ¼ nμ, solution of Eq. (20) leads us to
⋮ ⋮ ⋮ ⋱ ⋱  
i
By forward Kolmogorov equations we get P n ðtÞ ¼ e  nμt ð1  e  μt Þi  n ; n r i; ð21Þ
n
P 00 ðtÞ ¼ λ0 P 0 ðtÞ which is a binomial distribution with p ¼ e  nμt and q ¼ ð1  e  μt Þi  n .
A detailed discussion of this model dealing with a worm propagation
P 0n ðtÞ ¼ λn P n ðtÞ þ λn  1 P n  1 ðtÞ; n 4 0; ð16Þ
case is presented in [16].
and solving these differential equations we obtain
6.3.3. Alternating failure rates
n
P n ðtÞ ¼ ∑ AðiÞ
n e
 λi t
; n Z0; ð17Þ Birth and death process: In many physical situations both
i¼0 increase and decrease of states can occur in the same setting. That
288 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

is, combining the features of both birth and death Markov 7.1. Poisson and binomial distributions
processes, we can model many realistic problems related to worm
propagations, massive attacks, network traffic analyses, mainte- To validate failures involving the exponential distribution and
nance problems, and queuing systems dealing with recovery having the memoryless Markov property, we can make use of
problems. As a special case (considered in Section 7.5), we have properties inherent to Poisson events. As an example, the times
modeled the failures caused by a DDoS attack as a “birth” process between different types of attacks launched on different machines
and the reactiveness of the attacked system as a “death” process. within the same network can be modeled as a Poisson process
Availability of the victim system degrades during the birth period with the density parameter λi each. That is, for each machine
and increases during the death (reactive) period. The reactiveness receiving different attack densities, λ1 ; …; λm , we obtain the prob-
is an explicitly built-in intelligence within the victim system, ability of exactly n attacks being observed in t time units can be
which automatically obstacles the attack packets, hence, decreas- computed by
ing the cumulative failures following each obstacle period. In
½ðλ1 þ λ2 þ; ⋯; þ λn Þtn  ðλ1 þ λ2 þ ;⋯; þ λm Þt
addition to Appendix A, we can find the necessary material on pn ðtÞ ¼ e : ð26Þ
n!
the derivation of the related mathematical models in diverse
sources, e.g., [16]. In this model a process is a combination of both One can refer to the additive property of the Poisson process to
the birth and death processes whose transitions lead to the verify Eq. (26) as follows. Let X 1 ðtÞ and X 2 ðtÞ be two Poisson
generator matrix: arrivals with average rates λ1 and λ2, respectively. Now, let
0 1 XðtÞ ¼ X 1 ðtÞ þ X 2 ðtÞ, and for t Z 0 we have
λ0  λ0 0 0 0 …
Bμ   ðλ1 tÞn1
B 1  ðλ1 þ μ1 Þ λ1 0 0 …C C P X 1 ðtÞ ¼ n1 ¼ e  λ1
A¼B C: n1 !
@0 μ2  ðλ2 þ μ2 Þ λ2 0 … A
  ðλ 2 tÞ
n2
⋮ ⋮ ⋮ ⋱ ⋱ ⋱ P X 2 ðtÞ ¼ n2 ¼ e  λ2 ð27Þ
n2 !
Based on the generator matrix, we can see that the respective
Using these, we can also show that X(t) is also a Poisson process.
forward Kolmogorov equation describing the birth and death
Hence,
process is
  n    
P 0n ðtÞ ¼  ðλn þμn ÞP n ðtÞ þ λn  1 P n  1 ðtÞ þ μn þ 1 P n þ 1 ðtÞ: ð22Þ P XðtÞ ¼ n ¼ ∑ P X 1 ðtÞ ¼ n  n2 P X 2 ðtÞ ¼ n2
n2 ¼ 0
With the initial conditions: n ðλ1 tÞn  n2  λ2 t ðλ2 tÞn2
( ¼ ∑ e  λ1 t e
1 if n ¼ i; n2 ¼ 0 ðn  n2 Þ! n2 !
P n ð0Þ ¼
0 if n a i; n λn1  n2 λn22
¼ e  ðλ1 þ λ2 Þt t n ∑
n2 ¼ 0 ðn n2 Þ!n2 !
and, if both of the birth and death processes take place linearly !
dependent on the population size, i.e., λn ¼ nλ; μn ¼ nμ, and ρ ¼ λ=μ, ½ðλ1 þ λ2 Þtn n n
¼ e  ðλ1 þ λ2 Þt ∑
we get n! n2 ¼ 0 n 2
 n  n2  n2
1  e  ðλ  μÞt λ1 λ2
P 0 ðtÞ ¼ μ ; 
λ μ e  ðλ  μÞt λ1 þ λ2 λ1 þ λ2
P n ðtÞ ¼ ð1  P 0 ðtÞÞð1  ρ P 0 ðtÞÞðρ P 0 ðtÞÞn  1 : ð23Þ ½ðλ1 þ λ2 Þtn
¼ e  ðλ1 þ λ2 Þt ð28Þ
n!
Assuming that the only nonzero transition probabilities of this
process are [17] Example. Suppose a pair of independent DoS attackers, say A and
λ0 ¼ μ; λ1 ¼  ðλ þμÞ; and λ2 ¼ λ; B, bursting independently attack packets on a router with Border
Gateway Protocol (BGP). Attacker A performs a Resource Exhaustion
then we get a set of useful functions describing the probabilities of
attack and B carries on with Link Cutting attack, both of which eats
the first and the last states as
up the CPU time and storage resources of the BGP protocol
P 0 ðtÞ ¼ μγ; independently. The ultimate effect of the attacks is the service
P n ðtÞ ¼ ð1  λγÞð1  μγÞðλγÞn  1 ; 8 n Z 0; ð24Þ degradation. The number of attacks succeeded by A has a Poisson
distribution with mean 5 and number of attacks succeeded by B
where has mean 4 and has the Poisson distribution. Let us find the
8 probability that the total number of successes is 10. By referring to
>
> 1  eðλ  μÞt
>
< if λ a μ; Eq. (28), we obtain the result as
μ  λeðλ  μÞt
γ¼ ð25Þ
>
> t   ð5 þ 4Þ10
>
: 1 þ λt if λ ¼ μ: P XðtÞ ¼ 10 ¼ eð  5 þ 4Þ ¼ 0:119:
10!

7.2. Double-exponential distribution


7. Models for IT security analysis
Double-exponential distribution, or so-called Pareto distribu-
Below are the models and distribution functions that are tion, can be used to model the CPU usage consumed by a random
widely used for the reliability analysis of a broad spectrum of process. For example, time spent by a screening router's CPU for
systems. As necessity arises, we will discuss their use and present blocking malicious network traffic, frequency of the portscan visits
some modified versions of them in order to apply them to analysis on a server, or the number of attack packets in a DoS attack can be
of network security. Justifications and proofs of some of the modeled by a Pareto distribution. How appropriate can be the
presented models and underlying assumptions are kept at mini- model for a given problem depends on the location parameter, λ,
mum, while proofs of more general approaches are avoided. which is the random variable describing the process. The pdf of a
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 289

Pareto distribution is given by It should be reminded from the related theorem that if
α α1 X 1 ; X 2 ; …; X r are mutually independent and identically distributed
f ðxÞ ¼ αλ x x Z ½λ; α; λ 40; ð29Þ
random variables, where each Xi is exponentially distributed, then
1
the cdf: the random variable ∑ri ¼ 0 X i has an r-stage Erlang distribution
8  α with parameter λ. Hence, by this assumption Eq. (35) is an r-stage
< 1 λ
>
for x Z λ; Erlang distribution function. It follows that we can express this
FðxÞ ¼ x
>
:0 function in terms of an Erlang hazard rate as
for x o λ;
λr t r  1
the failure rate: hðtÞ ¼ ; ½t; λ 4 0; r ¼ 1; 2; …: ð36Þ
r1
8 ðr  1Þ! ∑ ðλtÞk
< α for x Z λ; k¼0
k!
hðxÞ ¼ x
: 0 for x o λ;
This implies that the victim's lifetime has an r-stage Erlang
distribution expressed by
and the reliability for x Zλ is
 α λt t r  1 e  λt
λ f ðtÞ ¼ ; ½t; λ 40; r ¼ 1; 2; …: ð37Þ
RðxÞ ¼ : ð30Þ ðr  1Þ!
x

7.3. Erlang distribution 7.4. Weibull distribution

It follows from Section 6.2 that the overall effect of super- Weibull belongs to one of the most widely used parametric
imposed attacks can be modeled as a hyper-Erlang distribution, category of distributions often used for hardware fatigue analyses
assuming that the distribution is discrete. An hyper-Erlang dis- [41]. A nonlinear hazard function is used when the failure rate
tributed random variable X has the following pdf: does not change linearly with time. We can obtain an IFR, DFR, or
n constant failure distribution by a proper choice of its shape
SðxÞ ¼ ∑ pi Eri ðxÞ; 0 o pi r 1; ð31Þ parameter, α. A DoS or a bufferoverflow attacked system will often
i¼0
go through a mortality phase of first IFR followed by a constant
where Eri denotes an r-stage Erlang distribution each stage having failure rate. The latter phase is thus the highest unreliable state of
the probability pi. As will be discussed later, for situations where the system that leads to restart during which a natural DFR phase
the attack times are exponentially distributed with different rates is active.
(λi a λi þ 1 ) with random time intervals (T i aT i þ 1 ), then the events
can be modeled by an r-stage hyperexponential distribution:
7.5. Attack-obstacle model: a case study
r
f ðtÞ ¼ ∑ αi λi e  λi t ; 0 o αi r 1; t 4 0: ð32Þ
i¼1 Attackers may take advantage of deficiencies found in a
Suppose a victim system having a maximum tolerable stress level, protocol implementation. Attacks are generally tunneled via these
say Sm, which when exceeded will cause random peak failures. protocols to gain access to specific applications. There are several
Such failures caused by massive attacks (bursts of attack packets) other protocol deficiencies that allow an attacker to build fraudu-
can be defined in terms of peak stresses, which are assumed to lent TCP/IP packets leading to disruptions. But these attacks can be
follow a Poisson distribution having the attack effect rate para- thwarted with the aid of secure protocol implementations, which
meter λs t, which is defined for time ðt; t þ sÞ. The effect of the have some built-in intelligence. Intelligent protocols first track the
attacks varies with the use of CPU and memory resources of the network traffic in a learning period, and in turn, try to obstacle
target system. The number of peak stresses (point unreliability), eventual attacks, all in real-time. For example, some firewalls and
St, during the time interval ð0; t is given by routers may have an intelligent combat module used as a real-time
attack blocker. Such systems with quasi-renewal capabilities can
e  λt ðλtÞr be approximated to an immunity growth model, Assumption 3. An
P fSt ¼ r∣λg ¼ ; λ 4 0; r ¼ 0; 1; 2; …: ð33Þ
r! immunity growth model composed of a failure period (birth)
As of the property of Poisson events, the number of attacks is very followed by a short recovery (death) period can be ideally
high while the probability of causing failures is relatively low [16]. modeled by a birth and death Markov process. Analysis of firewalls
The reliability of the system prior to the peak stress period can be and intrusion detection systems with real-time responsiveness can
expressed in terms of the victim's uptime X related to St satisfying benefit from this model.
Given the probability of the attack accumulation (or density) a
½X 4 t ¼ ½St or:
(t) at time t¼ 0, which is defined by the relation between the
Thus, the reliability of the system prior to the peak stress period attack rate λ and blocking rate μ of a massive attack by
can be expressed as
1 eðλ  μÞt
RðtÞ ¼ P fX 4 t g ¼ P fSt o r g aðtÞ ¼ ; ð38Þ
μ  λ eðλ  μÞt
r1
¼ ∑ P fSt ¼ r∣λg we get the probability of having n new offspring of DDoS attacks at
k¼0
time t leading to the peak stress on a single victim as
r 1ðλtÞk
¼ e  λt ∑ : ð34Þ
k ¼ 0 k!
P n ðtÞ ¼ ½1 λ aðtÞ½1 μ aðtÞ½λ aðtÞn  1 : ð39Þ

The cumulative distribution function, F(t), as the complement of As discussed in detail by [16], the above formulation assumes that
R(t) is defined as the attack intensity initially takes the form of a branching process,
see Eq. (24). This function can be further expressed in terms of the
r 1 ðλtÞk
FðtÞ ¼ 1  RðtÞ ¼ 1  e  λt ∑ ; ½t; λ 4 0; r ¼ 1; 2; …: ð35Þ CPU-usage of the victim, which also clarifies the relation between
k ¼ 0 k! the attack and blocking behaviors together with the attack density
290 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

Fig. 6. Observation of the peak stress on a DDoS attacked Windows system.

n leading to an ultimate reliability function: 7.6.1. Sequential phases


        n  1 Depending on the degree of diversity in the system elements,
1  eðλ  μÞt λ 1  eðλ  μÞt μ 1  eðλ  μÞt λ the analysis process can be split up into some phases. A process
An ðtÞ ¼ 1  1  :
μ  λ eðλ  μÞt μ  λ eðλ  μÞt μ λ eðλ  μÞt with sequential phases augment to a form of hypoexponential or
ð40Þ an Erlang distribution, depending on whether the phases have
identical distributions [21]. It should be noted that, the Erlang
The unavailability function, defined as 1 An ðtÞ, has been verified
distribution is the distribution of the sum of k independent and
by simulations and real-time experiments run on a Windows
s identically distributed random variables each having an exponen-
system having an Intel Core i7TM 64-bit CPU @2.4 GHz and 8 Gb
tial distribution. The Erlang distribution can be used to measure
of main memory. We have experimented with a DDoS scenario by
the time between incoming threats, where the failure events occur
simulating and testing it on a Windows system without any
independently with some average rate as a Poisson process. Some
known protocol intelligence, but with an active Windows defender.
aspects of the Erlang distribution is considered in Section 7.3. See
As obviously seen in Fig. 6, the peaks of the CPU-utilization initially
also for some model similarities found in phase-type distributions
increases exponentially and decays slowly, followed by
[43,44]. It has been shown that the entire time is hypoexponentially
a an abrupt high stress causing a very high CPU-utilization, which
distributed for a process spreading its time over different sequen-
lasts until the attacks seas. The abrupt increase in the stress level is
tial time slots (phases), where the time in each phase are
due to the fact that the TCP/IP implementation of the Windows
independent of the others and exponentially distributed. Suppose
system has no intelligence, i.e., the embedded firewall was turned
we have a 2-stage random process with the exponential para-
off, disabling the proactive defense. However, as can be seen in
meters λ1 and λ2 ; λ1 aλ2 . The pdf of the process is given as
Fig. 6 (a), the attack density has been initially significantly low, but
favoring some short growths at random epochs due to the random- λ1 λ2   λ1 t 
ness of the CPU-peaks. The peaks correspond to the learning f ðtÞ ¼ e  e  λ2 t ; λ2 4 λ1 ; t Z 1; ð41Þ
λ2  λ1
periods while decays (decrease in CPU usage) show that the system
is in the obstacle mode. its cdf is given by
λ2 λ1
FðtÞ ¼ 1  e  λ1 t þ e  λ2 t ; ð42Þ
7.6. Model complexity λ2  λ1 λ2 λ1

Sometimes the occurrence of attacks, their impacts resulting in and the related failure (hazard) rate is given by
 
system failures, and the process of system recovery evolve in λ1 λ2 e  λ1 t  e  λ2 t
hðtÞ ¼ : ð43Þ
multiple sequential stages. For example, the so-called Watering λ2 e  λ1 t þλ1 e  λ2 t
hole attack [42] uses three contiguous stages to compromise an
organizations’ resources. Some organized attacks may occur in It shows that a hypoexponential distribution exhibits IFR from
parallel or in some combined forms. Besides, a modern IT zero up to minðλ1 ; λ2 Þ.
environment consists of HW and SW components each with
different vulnerability characteristics and different protection
possibilities. Due to these, building an appropriate model is a
complex process, where different models can be combined to 7.6.2. Parallel phases
achieve more realistic results. An IT environment with mostly If a process consists of alternate phases, where each phase has
identical systems is treated as a homogeneous system, while the some incoming events each with the probability α, then the phases
one with a mixture of different systems is treated as a hetero- can be modeled to be exponentially distributed. It follows that the
geneous or nonhomogeneous system. Reliability analysis of hetero- overall distribution will rise to a hyperexponential distribution.
geneous systems is significantly complex compared to that of the The density function of a k-phase hyperexponential random event
homogeneous systems. Nevertheless, for both of the categories the chain is given by
decomposition steps of the analysis need to be handled with great k k
care. Below are the two most widely used mathematical models f ðtÞ ¼ ∑ αi λi e  λi t ; ½t; λi  4 0 and αi 4 0 with ∑ αi ¼ 1:
that can be applied to solving problems involving sequential and i¼1 i¼1

parallel phases of operations. ð44Þ


S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 291

and its cumulative distribution function is stages with nonidentical behaviors. If the interconnected stages
k
behave identically against threats then the computation of the
FðtÞ ¼ ∑ αi ð1  e  λi t Þ; t Z 0: ð45Þ cumulative reliability can be computed by applying the Bernoulli
i¼1 process as
The hazard function is thus !
k n  jþ 1
Rðp; k; nÞ ¼ ∑ pn  j ð1  pÞj : ð48Þ
∑i f ðiÞ ∑i αi λi e  λi t j¼0 j
hðtÞ ¼ ¼ ; 1 r i r k; t Z 0: ð46Þ
∑i RðiÞ ∑ i α i e  λi t
A Bernoulli process is a sequence of independent identically
Here, αi denotes the probability of the ith event and λi denotes the distributed (iid) Bernoulli trials. Independence of the trials implies
occurrence rate of the ith event. The hazard function, Eq. (46), is that the process is memoryless. Given that the probability p is
a DFR starting from the upper range given by ∑i αi λi down to known; past outcomes provide no information about future out-
minðλ1 ; λ2 ; …; λk Þ. An application of the hyperexponential distribu- comes. However, the past informs about the future indirectly
tion dealing with the analysis of network performance models is through inferences about p while the process undergoes the trials.
presented in [45]. Since the hyperexponential distribution exhibits On the contrary, if the stages do not behave identically (variable
DFR it can be used with immunity growth models applied to IFR/DFR) then we need to analyze every possible path of the
systems having different random parameters for different events reliability structure in order to accurately determine the overall
each showing various vulnerabilities under different situations. Or, reliability of the system at hand. In most basic cases where only
it can be used to model an attack that completes its mission in qualitative results are required then a Fault Trees method can be
several consecutive phases. For example, a targeted attack takes used to simplify the solution. However, analysis of sequential and
place in four phases: incursion, discovery, capture, and exfiltration. functional interdependencies between components can be cum-
Each of these events has a different probability distribution for bersome to deal with the fault-tree method, instead one may
ðα1 ; α2 ; α3 ; α4 Þ and a different cause effect (λ) [46]. Another example choose a more deductive approach using Bayesian belief networks
of consecutive events is performed by a Trojan, which can [47]. There exist some other mathematical models that can be
propagate through many computers in a network whose impact modified and used further. For example, reliability analysis of
on the visited computers are random. consecutive k-out-of-n systems with non-identical components
Similar to the TFN attacks, illustrated in Fig. 4, an attacker is lifetimes considered by [22].
often interested in finding a way to (stealthily) access the resources It is also necessary to consider the hazard function of the
within a network of distributed resources. This is a so-called systems when attacked, whether the failure rate increases/
reconnoissance process often performed by portscanner tools, e.g., decreases linearly by time or the stages go through constant
nmap and Nessus, which launch their scans in parallel; often via a failure rates. This complexity can be modeled as k-out-of-n system,
single scanner (also pseudo-parallel). Since the scanned system where each stage having a different failure function. Many com-
believes that the scan is a benign connection request, it uses its CPU plex k-out-of-n systems having constant failure rate with expo-
and memory resources to blindly respond the reconnoissance nential distributions are modeled as
process. As a result, the service quality of the scanned systems will !
significantly degrade. Suppose that interarrival times of a portscan
k n
Rx ðtÞ ¼ ∑ ðe  λt Þj ð1 e  λt Þn  j : ð49Þ
burst are exponentially distributed, and exactly one portscan has j¼0 j
occurred in the interval ð0; t, i.e., there is no overlap between the
Consequently, due to numerous variety of systems with also
successive portscan times. By referring to the properties of the
immense complexity, analysis of complex system has always been
Poisson process, we can easily show that the conditional distribu-
an open problem. Though there exist methods dealing with
tion of scan time T1 is uniform over ð0; t. If Tk denotes the kth scan
conventional complex structures, it is quite difficult to calculate
event, Tk is a k-stage Erlang random variable, whose conditional
the collapse probability of a system having indeterminate struc-
distribution T k ð1 rk rnÞ can be determined as follows:
ture with many possible modes or paths to complete failure, which
  Pfð NðτÞ ¼ 1 Þ \ ð NðtÞ  NðτÞ ¼ 0 Þg may propagate the problem to remain unsolved for the repair case
P T 1 rτ∣NðtÞ ¼ 1 ¼
PfNðtÞ ¼ 1g too. We have always some parallels between the failure and
λτe  λτ e  λðt  τÞ τ recovery structures of a given system. Several models used for
¼ ¼ : ð47Þ
λte  λt t failure analysis are already discussed above. The most appropriate
approach to model the reliability of a majority of complex systems
This can be generalized to n cases to give the joint pdf of the arrival
with predictable failure pats can be a k-out-of-n parallel system
times as
described as [6]
  n!  
f t 1 ; t 2 ; …; t n ∣NðtÞ ¼ n ¼ n : n n
t Rs ¼ ∑ Rx Ri0 ð1  R0 Þn  i ; ð50Þ
i¼k i
The proof is based on the relation:
where Rx denotes the individual complexity of a failure path,
PðA \ BÞ
PðA∣BÞ ¼ ; which is defined as (a discrete geometric distribution):
PðBÞ
 
n
and left to the reader. 1
Rx ¼ q i ; q  e  α; q A ½0  1; α A ½0  1:
7.6.3. Consecutive system reliability Eq. (50) assumes identical reliability function, R0, for each of the
Let us consider a typical DoS attack, where the attacker may components, where the summation components are subscripted
issue more than a single attack by launching a series of consecu- by i starting at k. Thus, the step parameter ðniÞ denotes the number
tive sub-attacks. That is, first bypassing (fire-walking) a firewall, of paths of a k-out-of-n system with n components of which i at
than finding exact local IP-addresses, and than launching various step i are working. In a more general case, depending on the
other attacks (e.g., IP-masquerading combined with UDP-squelch) components' interaction structure with each other, we can model
in several stages till the victim is compromised. There are two the system as a series–parallel system. The overall risk is then
major cases to consider: (i) stages with identical behaviors and (ii) dominated by the cascade causality among the interacting
292 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

CPU−utilization: λ = 1.8, μ = 1.74 attack-obstacle model was verified to match a system with
0.35 responsiveness, see Assumptions 1–3.
Immunity growth
Cumulative degradations and recoveries

0.3
Failure
8. Impact analysis
0.25
The models given above are devoted to attack and failure
0.2
analyses, however, regarding the impact analysis, we need to
develop, or apply existing, e.g., [17], models to estimate an overall
0.15
loss involving system failure rates, system down-times, and the
entire cost from the start of the failure phase until the total
0.1
recovery, if repairable. Risks in a network describe the degree of
0.05 exposure of the system to failures. impact is defined as the “cost”
(or risk) of unavailability of a resource, which is a factor of the
0 resource weight (importance), the reliability measure of the
0 10 20 30 40 50
resource, and a defense factor (if the resource is protected). There
Time
can be multiple simultaneous attacks to a single or to a series of
Fig. 7. Plots of the results from the simulation and experiments (fluctuating systems in a network causing multiple failures. Accordingly, the
curves). overall network reliability should be reconsidered in order to
estimate the total cost as a measure of unreliability of all the
components [36]. Bayesian belief networks can be effectively used affected systems. Some systems may have defense strategies while
for the majority of cases, where the causal interdependence others are unprotected against threats. That is, the impact measure
among system components are definable. Some related models depends also on existing protection measures and, indirectly, on
can be found in [25,48]. the strength of the threats that lead to system failures. Therefore,
in order to estimate the overall impact for a network covering all
7.7. Model verification security threats, each component must have been assigned a

While analyzing the system we collect attack data (if observa- wi quantitative importance factor (weight), 0 o wi o 100,
ble) and failure data corresponding to the attacks. The first step Ti threat/attack list given as T i 〈t 1 ; …; t n 〉,
following the data aggregation is to create plots of observed data. di quantitative vulnerability factor associated to each
Since the plot can display the behavior of the system reliability threat/attack in the list, which can be represented either
during the observation period, it can guide us in finding a closer in terms of the strength of the threat/attack or alterna-
model to start with. We can now estimate values for the model tively in terms of the strength of the defense strategy,
parameters from the failure data. We have a variety of methods for 0 odi o 1,
the parameter estimation. Because of its good statistical properties hi quantitative hazard factor/function associated to each
one often uses the maximum likelihood estimation method (MLE) threat/attack in the list, 0 ohi o 1.
for estimating a small set of parameters. We can now use the The weight of a system is a subjective metric chosen by the
estimated parameter values to obtain a fitted model by substitut- evaluator, whereas the threat list and the remaining factors are
ing the estimated values of the parameters with the parameters of determined from the existing treat list and working environment.
the chosen model. Furthermore, as the last step, we need to verify Regarding the vulnerability factor, an empirical method for deter-
the model chosen. This process is called goodness-of-fit test, which mining the attack strength defined in terms of attack success ratio
evaluates the chosen model whether it fits the observed data. is presented in [7], where a vector was built to contain the attack
There are also various methods for checking a model, e.g., success (or hazard) data from a variety of attack types as

Kolmogorov–Smirnov or Anderson–Darling test can be used to !   s1 s2 sn
test whether two samples are drawn from identical distributions h ¼ h1 ; h2 ; …; hn ¼ ; ; …; ;
a1 a2 an
match or converge at some point. If the model chosen does not fit,
we have to collect more data or search for another model. Perhaps Here, parameter ai depicts the number of attacks of type i, si holds
the number of successes of that attack type. A total impact factor
simulation is the easiest method for performing goodness-of-fit
test. Besides, if the complexity of the model parameters is high, of m components each having n threats can be readily computed
by
which makes it difficult to guess a model, then we can use the
!
simulation method. m n
We applied the simulation method to the experiment pre- I¼ ∑ wj ∑ hi :
j¼1 i¼1
sented above. The experiment has produced a set of availability
(CPU-usage) data for 50 different DoS attacks. Since the target This formula is being kept simple and has no bound normalization,
system is a responsive system, the chosen model would rely on the
immunity growth exhibiting an AFR behavior. We have repeated
the experiment on the same target system at different times and Table 3
with different attack schemes, and tried to plot the observation Impact computation for some network units.

data and the simulation data. Both the experiment and the Component Threat list ! Weight Impact
∑h
simulation results are plotted in Fig. 7. Mean values of all
experiment data at different points can be compared to the Router T 1 〈⋯〉 0.7826 100 78.2600
corresponding simulation data in order to find a correlation E-mail server T 2 〈⋯〉 0.2250 75 16.8750
between the experiment and the simulation. By a further analysis PC T 3 〈⋯〉 0.5000 60 30.000
of the correlation data one can easily compute the estimated Printer T 3 〈⋯〉 0.4300 50 21.5000
dB-server T 4 〈⋯〉 0.4091 80 32.7280
model parameters. As mentioned earlier, the attack-obstacle model Web-server T 2 〈⋯〉 0.8958 90 80.6220
was designed to match the experiment discussed above, thus, the
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 293

 
but it can satisfactorily display the impacts in a quantitative λ
pn þ 1 ¼ ðm nÞ p n ¼ 1; 2; …; m  1: ð54Þ
manner. Indeed, the computation of the hazard factor for some μ n
components may also include defense factors. Since, generally
Solving Eq. (54) with the normalization condition of
defense strategies change from company to company, their relia-
∑mn ¼ 0 pn ¼ 1, we obtain the probability of exactly n machines
bility analysis can be generalized by eliminating the unknown
among m being down as
factor of the defense strategy. Due to this reason we consider here
 n
the defense factor as a constant parameter in the impact analysis. m! λ
pn ¼ p0 ; n ¼ 1; 2; …; m  1; ð55Þ
Some components (see Fig. 1) and their items used for the impact ðm  nÞ! μ
computation are shown in Table 3.
where p0 is the proportion of time the system is idle, i.e., no
recovery taking place.
9. Recovery models
9.2. Utilization of the recovery facility
Recovery process of a failed computer consists of a number of
operations such as patching, rebuilding, and reinstalling of a variety Recovery utilization (or traffic intensity), defined as the propor-
of corrupted software and/or hardware components. There may also tion of time the repair facility (or software) is busy with processing
be an entire recovery process dealing with disc recovery and a of the incoming machines, is expressed as 1  p0 . Let us assume a
thorough system rescue operation, which may require significantly stable state (the steady-state), where the arrival rate of the failed
high resources for recovery, and much longer time for all required machines is always less than or equal to the repair rate of the
operations. Suppose that we have n computers enqueued randomly repair facility, and the corresponding utilization factor of the
for repair, patch, and update operations, where the incoming traffic repair facility is expressed as ρ ¼ λ=μ.
is of a Poisson type, Eq. (15), with the arrival rate λ. Furthermore, The overall throughput ratio for a repair facility of a single-
λΔt þoðΔÞt is the probability of accepting at least one repair request repairman (or a single utility tool) and n repair requests of identical
in the small time interval Δt. Total repair (holding) times are failures is computed by
exponentially distributed with random time intervals, assuming  n
that each computer waits for T random time units and requires λ
ρ¼ ; ð56Þ
another random time unit t to be repaired, then from [16], μ

PfT Z tg ¼ e  μt ð51Þ
and therefore, from the queuing theory we have
can be used to determine the probability of the total time required
for the recovery (queuing þ repair) operation. The times between pn ¼ p0 ρn ; for n ¼ 1; 2; …; ð57Þ
recoveries, R, is also a Poisson process with the exponential where implicitly,
distribution parameter μ. Thus, the continuous random variable R  1  1
1 1 1
has the probability density function: p0 ¼ ¼ ∑ ρn ¼ ¼ ð1  ρÞ: ð58Þ
1 þ ∑1
n ¼ 1ρ
n
n¼0 1ρ
f R ðtÞ ¼ μe  μt ; ð52Þ
Thus, the steady-state probability for the state n (i.e., n
and the mean time between recoveries is
machines in the recovery service) is
1
EðRÞ ¼ : ð53Þ pn ¼ p0 ρn ¼ ð1  ρÞρn ; for n ¼ 1; 2; …: ð59Þ
μ
From this the expected number of machines in the recovery
system (queue þservice) can be easily computed by

9.1. Steady-state of the recovery process 1 ρ λ


L ¼ ∑ npn ¼ np0 ρn ¼ nð1  ρÞρn ¼ ¼ : ð60Þ
n¼0 1ρ μλ
Since the details are given in Appendix A, we will omit the
derivation of expressions and detailed discussions here. In order to The probability pn, as the number of repair requests in the system,
determine impacts of malicious attacks it is primarily required to can be interpreted as “the proportion of the time the system is in
compute the amount of repairs related to the total number of state n”. Again, from the queuing theory the expected queue
machine failures. That is, we need to determine the probability length waiting for repair is
distribution of m machine failures, the number of failures, and the
service utilization of the recovery facility. Depending on the company 1 λ2
Lq ¼ ∑ ðn  1Þpn ¼ : ð61Þ
policy, machines are inspected either at predefined periods or at n¼0 μðμ  λÞ
random times and then discovered (if failed) in negligible times.
For most of the cases of queuing systems the basic relation is given
Let pn denote the probability of having n machines among m
by the Little's formula:
being queued at the recovery system in the form of a birth–death
Markov process. Suppose the system settles down to steady-state L ¼ λW;
when the equilibrium state λpm  1 ¼ μpm has been reached. That is,
where W represents the average waiting time of a machine in
both input and output flow rates of the recovery queue go into an
system, which is expressed as
equilibrium state, where λ denotes the failure rate and μ denotes
the recovery rate of the failed nodes. We already know that mean
n Rj þ Dj
W ¼ lim ∑ :
time to failure (arrival rate for recovery) of a machine is 1=λ. n-1 j ¼ 1 n
Hence, the expression ðm  nÞpn λ yields for n machine failure rates
Here, Rj þ Dj represents the total waiting time in the queue, i.e.,
among m. Likewise, for the mean repair time, 1=μ will yield μpn þ 1
expected repair time plus processing delay in the queue.
repair rate for n þ 1 machines. Applying the recurrence approach
Regarding the case of multiple-repairmen with s number of
will lead us to so-called level crossing balance equation set:
repairmen available and again as assumed before λ o sμ, which is
ðm  nÞλpn ¼ μpn þ 1 ; n ¼ 1; 2; …; m 1; the most desired case, then the steady-state probability for the
294 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

state n is defined as [16] modeling the dynamic interaction of the mixed queues can be
8 n simply expressed by the following differential equation set:
> ρ
>
< n! p0 if 0 r n rs;
∂s
pn ¼ ¼ βsðtÞ  λf ðtÞ
>
> ρn
∂t
: p if n Z s:
s!ðsn  s Þ 0 ∂f
¼ λf ðtÞ βsðtÞ
∂t

s1 n 1 ∂r
ρ ρs ¼ μrðtÞ  λf ðtÞ: ð64Þ
p0 ¼ ∑ þ : ð62Þ ∂t
n ¼ 0 n! s!ð1  ρs Þ
As clearly seen, this equation set denotes the dynamic behavior for
Setting ρs ¼ λ=sμ as the throughput (or service rate), we get the discovering susceptible computers ∂s=∂t, their queuing at the
expected number of machines in the recovery system (queue þ recovery center ∂f =∂t, and the recovery operation ∂r=∂t, where
service) with a multiple-repair facility as the sum of expected the associated rates are β; λ, and μ, respectively. Given the failure
queue length waiting for repair and the throughput ratio of the distribution
queue:
f ðtÞ ¼ λe  λt ;
λ p0 ρs ρs
L ¼ Lq þ ¼ þ ρ: ð63Þ the reliability function for the continuous time analysis can be
μ s!ð1  ρs Þ2
derived from the complement expression, RðtÞ þFðtÞ ¼ 1, as
Z t
RðtÞ ¼ 1  FðtÞ ¼ 1  λe  λu du ¼ e  λt :
10. A case study: analysis of a recovery facility 0

Hence, the failure function is


With this concrete example, we will determine the most Z t
important reliability factors of failed computers and service
FðtÞ ¼ λe  λu du ¼ 1 e  λt ¼ p01 ðtÞ:
performance of the recovery center. The numerical analysis con- 0
siders a medium-sized computer network with 1620 nodes, where Note that, this function also denotes the transition probability,
some malicious network traffic must have diminished the avail- p01 ðtÞ, from state 0 (operative) to state 1 (non-operative), in other
ability of some vulnerable nodes. Some hosts in this sample words, the number of failures increases by 1. Fig. 8 shows the
network were susceptible to attacks and therefore fail to deliver dynamic behavior of the processes denoting the susceptibility
the required service capacity. Susceptibility (observed service (discovery of service-degraded computers), their queuing at the
degradation) is measured by a branch test of all active processes recovery center, and the performance of the recovery operation.
on each computer, where only excessive CPU and memory usages This model is rather more general than that of “watch and
are considered in the computation. A recovery center is modeled immediately repair” type systems. Hence, as a security engineer,
as an M/M/11 queuing system. The recovery center receives on we must also assume existence of various types of repair models
average 18 suspected computers randomly each 60 min, which that can be applied to similar cases, e.g., a typical type of a general
mainly performs system scanning, removal, and update operations approach is given in [49]. A repair-time distribution model directly
for worm/virus infection cases. In other cases, it performs updates, applicable to this problem is defined by Eq. (52) as
patches, and other recovery operations. It is assumed that 30%
(pf ¼0.30) of the incoming computers are infected and the recov- f R ðtÞ ¼ μe  μt ;
ery system uses an average of 1/20 h to scan and eventually repair where the mean time between recoveries is 1=μ. The recovery
a computer, where only one computer is treated at a time. That is, system goes into a free state with probability [16]
the recovery center receives λ ¼ 18 computers per hour, the
process time of the recovery service is μ ¼ 20 computers=h. p10 ðtÞ ¼ 1  eμt :

10.1. Input/output analysis of recovery process 10.2. Step-by-step problem solution

First, we consider the input dynamics of the recovery facility, so In order to give a clearer view, we present the analysis with a
that we can be able to analyze the time-dependent states of the series of related questions each leading to the solution of a specific
suspected computers to determine the failure and recovery func- part of the problem at hand.
tions, as well as the stationary states describing the long-run
reliability figure of the entire process. Susceptibility of a node is Q1: What portion of time is the recovery service busy with
identified by its expected service degradation rate, β. Time/ processing the suspected (slowed-down) computers?
resource usage for computing β is not considered in this example, Primarily, we compute the traffic intensity (input/output
however, for large networks this process can require significant flow ratio of the recovery facility) defined by Eq. (56) as
amount of resources that can be necessary to incorporate into the
computation. The probability distribution for having s susceptible λ 18
ρ¼ ¼ ¼ 0:900:
hosts among m is binomial and μ 20
  Thus, the proportion of the process times of failed, ρf, and
m s
pðβ; s; mÞ ¼ β ð1  βÞðm  sÞ ; s 4 1: non-failed (healthy), ρh, computers are computed as
s
ρf ¼ ρpf ¼ 0:900  0:300 ¼ 0:270; and
Since the output from the failure queue is directly cascaded to the
ρh ¼ ρph ð1  pf Þ ¼ 0:900  ð1  0:300Þ ¼ 0:900  0:700 ¼ 0:630:
recovery queue, the scenario becomes a mixed situation of both
input to the recovery and output from the recovery queue. Thus, Note that, pf denotes the failure (infection) probability,
and hence ph ¼ 1  pf .
1
An M/M/1 queue represents a stochastic queue system having a single server,
Q2: The probability of exactly one computer being in the
where arrivals are determined by a Poisson process and job service times have an recovery center can be obtained using the recurrence
exponential distribution. formula in a Markovian system, see Appendix A. First, we
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 295

determine the probability of k computers being in the computers is


system using Eq. (59) as 1 1
Eðwf Þ ¼ ¼ ¼ 0:068 h:
pk ¼ p0 ρk ¼ ð1  ρÞρk : μ  λf 20 5:400

Thus, the probability of having exactly one computer Average time (minutes) spent on each failed computer is
being in the system is 0:068  60 ¼ 4:080 min, which makes a total of 18.115 h
p1 ¼ ð1 ρÞρ ¼ ð1  0:900Þ  0:900 ¼ 0:090: per month to maintain 266 attacked computes. Conse-
quently, given the infection probability of 0.300, arrival
rate of 18 for incoming hosts, and the service rate of 20,
Q3: What is the probability that there are more than one the average time spent on 1260 computers for scanning
failed computers in the system? First, the arrival rate of and recovering 266 of them is 9  0.500 ¼4.5 h per day.
the failed (infectious) computers is determined as The impact of this threat is not only holding 266
λf ¼ pf λ ¼ 0:300  18 ¼ 5:400 computers=h; computers down in 4.5 hours per month, but also the
business loss caused by each of the failed computer
and the probability of having exactly one infected com- during 4.5 hours a month plus the labor time spent for
puter, thus scanning 1260 machines and recovering 266 of them,
     
λf μ λf λf 20  5:400 5:400 plus the extent of damage to the reputation of the
p〈f ;1〉 ¼ p〈f ;0〉 ¼ ¼ ¼ 0:197:
μ μ μ 20 20 company caused by this impact.
Hence, the probability of having more than one, p〈f ;k〉 ,
infected computers in the recovery center is
1
p〈f ;k〉 ¼ ∑ p/f ;jS ¼ 1  p〈f ;0〉  p〈f ;1〉 11. Concluding remarks
j¼2
 
20  5:400 The vast majority of the theorems and models were presented
¼ 1  0:197 ¼ 0:073:
20 date back at least some decades ago. However, the application of
The tendency of p〈f ;k〉 when limk-1 ¼ 0 is at some point them to the reliability analysis of computer networks has not been
near the equilibrium state, which verifies that the system clearly identified yet. It is important to determine the cost of
is stable within the range of steady-state. unavailability of information systems using quantitative
Q4: What is the average number of computers in the system, approaches that are theoretically and practically sound. We have
both infected and healthy? From Eq. (60) presented an umbrella framework with the hope that it can
facilitate the analysis of a broad spectrum of network structures,
ρ ðλ=μÞ λ 18 their components, interdependence among the interconnected
EðNÞ ¼ ¼ ¼ ¼ ¼ 9;
1  ρ 1  ðλ=μÞ μ λ 20 18 structures, impacts caused by the associated threats, and perfor-
and the average number of failed computers in the mance analysis of the recovery facilities. A model with clearly
system is identified underlying assumptions can significantly simplify the
application of the model to a network of systems, a subsystem
ρf ðλf =μÞ λf 5:400
Eðf Þ ¼ ¼ ¼ ¼ ¼ 0:370: within the network, or a component within the subsystem.
1 ρf 1  ðλf =μÞ μ λf 20  5:400
Reliability analysis of complex systems can be too diffuse to
The estimated number of failed hosts per day is 24 many of us, and also most computer security engineers have
0:370 ¼ 8:880, and per month is 30  8:880 ¼ 266:400. difficulties in having a clear overview due to lack of the scholar
Q5: What is the average time spent on each of these background in reliability engineering. Some research guided by
computers in the system, both infected and healthy? the theoretical context is thus diffuse and cumbersome for many
  computer security engineers to apply to real world problems.
ρ 1 1 1
EðwÞ ¼ ¼ ¼ ¼ 0:500 h; Therefore, in this paper we have presented a concise framework
1 ρ λ μ λ 20 9
for determining the reliability of systems that are under security
and the average time spent on each of the failed threats. The framework makes use of the most fundamental
reliability, stochastic process, and queuing theories to guide the
analysis. Choosing appropriate models for threat, failure, and
performance of recovery services play the fundamental role for
s u sc

ity designing effective protection mechanisms as well as defining


ab il
r el i configurable redundancy levels with a reasonable complexity. By
e p ta

modifying a reliability model one can easily control the dynamics


bili ty

of the system behavior of almost any networked system. Accu-


rately determined system dynamics can significantly facilitate the
computation of total system reliability and help mitigate costs
Re associated with lifecycle maintenance tasks.
co
ve Due to the complexity of the computer networks we encounter
ry
Fa q ue
i lu ue numerous difficulties in designing appropriate reliability models.
re Every reliability model makes use of its fundamental properties
qu
eu with respect to the constraints and assumptions that are unique to
e
the model itself and the system under consideration. Assumptions
for a given model need not always hold. Therefore, under many
situations, we need to analyze empirically observed data, and
hypothesize that the assumptions of the model hold. Furthermore,
Time
applicability of the model need to be justified on the basis of the
Fig. 8. Dynamic queue behavior of the recovery center. results obtained from its hypothesis test. That is, some models
296 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

may have proven applicability to some environments/systems will Markov processes can substantially facilitate the analysis of long-run
be considered as is. Other models bearing uncertainty should be characteristics of system states, stochastic failure changes, random
thoroughly verified for being an adequate model candidate of a attack changes, and stochastic recovery and queue operations.
specific system analysis. A Markov chain is a random process usually characterized as mem-
In practice neither testing nor proving of a model on a specific oryless, i.e., the next state depends only on the current state and not
environment can guarantee a complete confidence in the appro- on the events occurred earlier. Also systems that undergo transitions
priateness of the model. Because computer networks and threats forgetfully from one state to another can simply adapt this model.
to them have highly complex structures that require a through Shortly, Markov processes can be used to model recovery services
analysis of each system and threats to them in order to asympto- and several other reliability structures having IFR, DFR, and AFR
tically match a specific model. The models discussed here can be characteristics.
considered as a complementary approach to assess the reliability Immunity (reliability) growth is an important phenomenon in
of a specific part of a given environment rather than being used as network security, because, systems are often recovered and
a competing tool. Due to the imperfectness of the models in patched to thwart failure sources. Some systems (e.g., routers,
assuring accurate results, sometimes a set of approaches need to firewalls, and network protocols) may have built-in capability for
be evaluated for collecting data, experimenting, choosing a model, eliminating failures in real-time. This aspect requires a specific
estimating model parameters, and developing appropriate meth- approach to analyze the attacks, failures, and the correlation
ods that can obtain a fitted model. between the attacks and failures. Accordingly, a theoretically
Furthermore, how the model behaves to changes in parameter justified model, [16], called Attack-obstacle model, was also mod-
values and how it responds to changes in the structure of the ified and tested for DDoS analysis. Due to the limitation on the
model are important. Failure data and plots are useful in deter- scope of this article, the modification of this model was not
mining the system behavior. Following the analysis of the data and substantially verified, the verification was done by additional
plots, an estimate of the model parameters can be obtained by use simulations. We said also that threats can randomly propagate
of a method, e.g., MLE. A model matching need to be done by among different applications on different hosts can be modeled as
substituting the estimated parameter values in the selected model. a Brownian motion. But this was not substantiated either. Because
Finally, in order to eliminate eventual uncertainties, a goodness- of its complexity a feature work is planned to investigate the fault
of-fit test has to be performed. If the chosen model does not fit, we propagation within heterogeneous application environments. We
have to gather additional data and reassess the tests or try another did not spend much time for the verification of some of the
model from the candidate model list. It may be hard to predict the candidate models either. An extensive study is thus needed for
size of the additional data to collect or to find a better model. In testing and verifying them in order to validate their application to
that case, carefully designed simulations may help to guide a system analysis, especially, systems with immunity growth
sensitivity analysis and eliminate eventual uncertainties in the features.
chosen model and the underlying parameters and assumptions.
As already noticed, we have frequently used Poisson processes for
modeling both the attack success counts and failure counts for massive Acknowledgments
attacks, e.g., DoS and scan worm spreading. Error searching for
deterministic failures can be facilitated by matching a Bernoulli The author is grateful to the referees for their inspiring
process, where defects such as infections, back doors, Trojans are comments and suggestions, which have significantly improved
being discovered by security tools and the failures are binomially the presentation of this article. Their superior knowledge in the
counted. Overall reliability of complex systems mostly featuring serial- field and their challenging comments have contributed a lot to
parallel interoperational structures requires k-out-of-n type analysis. create a more fruitful paper that will hopefully guide many of
For example, having n machines in a network, each using a specific computer scientists, researchers, and engineers dealing with the
application that is vulnerable to a specific threat under a random reliability analysis of network security.
operational sequence, can be modeled as a k-out-of-n system.
Systems with immunity growth features require the determi-
Appendix A. Steady-state analysis
nation of appropriate growth functions and sensitivity analysis for
their parameter values. This can be extremely difficult, if we are
Assume a network of m machines, which are maintained by
not able to collect sufficient data for the parameter estimation and
a repair facility. The time to failure (up time) of a machine and its
goodness-of-fit tests. Variants of Gamma and Poisson distributions
repair time are exponentially distributed random variables with
can be modified to contain the immunity growth function
rates λ and μ, respectively. This can be considered as a classical
assumed that the underlying model assumptions and parameters
machine repairmen problem. Assume further that repairs occur as
are carefully chosen. That is, both the attack behavior and its
a Markovian process at rate μ  min½NðtÞ; RðtÞ, where N(t) is the
correlation to the observed system failures need to be studied in
number of machines failed and awaiting repair at time t, and R(t) is
detail for laying down the required parameters and assumptions.
the number of repairmen on duty. This process can be modeled as
Regarding more complex structures, consecutively occurring
a simple birth–death Markov process, wherein jumps in state, N(t),
attacks can be modeled by hypoexponential distributions, whereas
occur at exponentially distributed intervals defined as follows:
the threats occurring concurrently can be modeled using hyper-
failure probability of a machine before time t is
exponential and Erlang distributions. Again, a Gamma distribution
combined with any of these exponential distributions is almost the pf ðtÞ ¼ Pftime to failure otg ¼ 1  e  λt ; …t ¼ 0; 1; 2; …;
only choice if the system has the immunity growth feature.
and the probability that there will be no failure before t is given by
As a brief note, we have summarized the key reliability models,
suggested models for decomposing network structures into reliability pf ðtÞ ¼ Pftime to failure Ztg ¼ e  λt ; …t ¼ 0; 1; 2; …: ðA:1Þ
structures, modified and defined a set of reliability approaches
Time to repair expressions are analogous to the above definitions,
applicable to modeling attacks and failures. Hence, approaches cover-
i.e., the probability of the recovery time of a machine being less
ing attacks, failures, impact analysis, and recovery processes have been
than time t is
provided with some detail, but without much concern of justifying
them. As we have already considered in Sections 6, Appendix A, pr ðtÞ ¼ Pftime to repair o tg ¼ 1  e  μt ; …t ¼ 0; 1; 2; …;
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 297

convenient starting point would be to compute p1 in terms of p0


and recur the computation throughout all unknown variables. This
approach will than lead us to easily compute the probability of n
machines being in a specific state (e.g., repair). That is, from the
equation set (A.7) we get the following recurrence steps:
Fig. 9. Input/output representation of the recovery center.
λ0
p1 ¼ p ;
μ1 0
and the probability of the recovery time being greater than or λ1 λ1 ðμ p  λ0 p0 Þ λ1 λ0
equal to time t is p2 ¼ p1 ) p1 þ 1 1 ¼ p ;
μ2 μ2 μ2 μ2 μ1 0
pr ðtÞ ¼ Pftime to repair Z tg ¼ e  μt ; …t ¼ 0; 1; 2; …: ðA:2Þ λ2 λ2 ðμ p  λ1 p1 Þ λ2 λ1 λ0
p3 ¼ p2 ) p2 þ 2 2 ¼ p ;
μ3 μ3 μ3 μ3 μ2 μ1 0
The recovery system is modeled as a birth–death Markov process
with the input and output state transition probabilities defined as ⋮
λn  1 λn  1 ðμ p  λn  2 pn  2 Þ n λ
pn þ 1 ðtÞ ¼ λn Δt þoðΔÞt; pn ¼ pn  1 ) pn  1 þ n  1 n  1 ¼ ∏ i  1 p0 ;
μn μn μn i ¼ 1 μi
pn  1 ðtÞ ¼ μn Δt þ oðΔÞt; ðA:3Þ
λn λn ðμn pn  λn  1 pn  1 Þ n þ 1 λi
respectively. This states that there are n machines at the recovery pn þ 1 ¼ pn ) pn þ ¼ ∏ p0 : ðA:8Þ
system, and when a new failure arrives the state increments to μn þ 1 μn þ 1 μn þ 1 i ¼ 0 μi þ 1

n þ 1 (birth), when a machine departs from the recovery service The requirement of the sum of the probabilities ∑1
n ¼ 0 pn ¼ 1
the state decrements to n  1 (death) during ðt; t þ ΔtÞ. Fig. 9 implies
illustrates the input-output relation of the recovery center.
n nþ1 λi
As can be easily deduced from the one-step transition equita- pn ¼ 1 þ ∑ ∏ p0 ¼ 1;
tions, Eq. (A.3), the dynamics of the state transition probabilities 8 i Z 0 i ¼ 0 μi þ 1

can be expressed as which then yields


p010 ðtÞ ¼ λp0 ðtÞ μp1 ðtÞ; birth of a new failure " #1
p001 ðtÞ ¼ μp1 ðtÞ  λp0 ðtÞ; death of a failure ðrepair of a nodeÞ:
n nþ1 λi
p0 ¼ 1 þ ∑ ∏ p0 : ðA:9Þ
8 i Z 0 i ¼ 0 μi þ 1
ðA:4Þ
Hence, two base-cases of this Markov process will be dissected in
order to proceed further:
A.1. Machines on repair: steady-state results
① pðn; tÞ ¼ PfNðtÞ ¼ ng; n ¼ 0; 1; …; m; t Z 0;
The flow densities λ〈i;…;n〉 and μ〈j;…;n þ 1〉 must in the steady-state
② pn ¼ lim pðn; tÞ; settle down to a controllable level so that the corresponding
t-1
inflow rate becomes less than or equal to that of the outflow.
where pðn; tÞ is the probability that n machines are down at time t,
Otherwise the system will go into an uncontrollable state and
and pn is the probability that n machines are down under the
oscillate further indefinitely, up and down.
equilibrium state. Indeed, we count the number of machines that
Alternating flow densities (λi and μi) are representative gen-
enter down state, e.g., in a specific time unit, the system entering
erally for multiple recovery services each with random inflow and
state n ¼5 means that we have the birth of 5 corrupted machines.
outflow rates. However, simpler solutions to queuing systems with
The process of leaving a specific state is denoted by a death, i.e., the
identical queue properties usually require equal density rates. This
corrupted machines get recovered and leave the queue. We can
is especially true for a facility with a single repairman, where
track the evolution of the system behavior at times t þ Δt by
failures occur with constant rates, and a single repairman repairs
setting up a Markov transition diagram. Based on the diagram and
failed machines with equal rates immediately without a significant
the transition equation:
waiting delay. This assumption fits perfectly for a virus removal
p0 ðt þ ΔtÞ ¼ ½1  λΔtp0 ðtÞ þ ðμΔtÞp1 ðtÞ; ðA:5Þ operation done by a software tool, but can be difficult to consider
we can easily deduce future transitions for an arbitrary state n of a for a completely corrupted system with different components,
Markov chain as where each component has different failure and repair distribu-
tions of probability.
pðn; t þ ΔtÞ ¼ pðn; tÞ½1  ðm  nÞλΔt  μΔt Supporting the limits limt-1 PfNðtÞ ¼ ng ¼ pðn; tÞ; n ¼ 1; 2; …;
þ pðn  1; tÞðm  n þ 1ÞλΔt that the proportion of time there are n occurrences (failures,
þ pðn þ 1; tÞμΔt þ ðoΔtÞ; n ¼ 1; 2; …; m: ðA:6Þ repairs, etc) in the system, we have
Further, state transitions captured so far will balance in a steady- ðλ þμÞpn ) Transition out of state n;
state as derived below, equally for each pair of states (working and μpn þ 1 þ λpn  1 ) Transition into state n;
non-working): ðλ þμÞpn ¼ μpn þ 1 þ λpn  1 ) Balance about state n:
ðλn þ μn Þpn ¼ λn  1 pn  1 þμn þ 1 pn þ 1 ;
Knowingly, by setting ρ ¼ λ=μ, from Eq. (A.7), we rewrite
ðλn  1 þ μn  1 Þpn  1 ¼ λn  2 pn  2 þ μn pn ;
ðλ2 þ μ2 Þp2 ¼ λ1 p1 þ μ3 p3 ; p1 ¼ ρp0 :
ðλ1 þ μ1 Þp1 ¼ λ0 p0 þ μ2 p2 ; It is easy to recursively show that ðλ þ μÞp1 ¼ μp2 þ λp0 . Hence, we
⋮ go down to the ultimate state
λ0 p0 ¼ μ1 p1 : ðA:7Þ
pn ¼ ρpn  1 ¼ ρn p0 : ðA:10Þ
Now, a recurrence formula can be applied to solving one of the
Again, with ∑1
n ¼ 0 pn ¼ 1,
variables in terms of another. In some cases, you might have
1
different flow rates before the equilibrium state will come in p0 ∑ ρn ¼ 1; ρ o 1:
picture. Since we have enough setup to compute p0, the most n¼0
298 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299

If this series converges (i.e., ρ o 1), we can have the solution for p0 Similarly, the average delay in the queue is
and for pn consequently, that is, Q 1 1 1
D¼ ¼W ¼  :
1 1 λ μ ðμ  λÞ μ
∑ ρn ¼ ; ρo 1:
n¼0 1 ρ

Since p0 ¼ 1  ρ, then from Eq. (A.10)


References
pn ¼ ρn ð1  ρÞ; n ¼ 0; 1; …: ðA:11Þ

Considering the single repairman model, the recurrence formula, [1] Goel A, Okumoto K. Time-dependent error-detection rate model for software
reliability and other performance measures. IEEE Trans Reliab 1979;R-28
Eq. (A.13), of moving from pn  1 to pn can be used to build the (3):206–11.
probabilities for the state transition space S ¼ fp1 ; p2 ; …; pn g, given [2] Dai Y, Xie M, Poh K, Liu G. A study of service reliability and availability for
that we have the equation set (A.7). As well, in a long run, the rates distributed systems. Reliab Eng Syst Saf 2003;79(1):103–12.
[3] Ma Y, Han J, Trivedi K. Composite performance and availability analysis of wireless
will converge to their average values, i.e., communication networks. IEEE Trans Veh Technol 2001;50(5):1216–23.
[4] Levitin G, Dai Y-S. Service reliability and performance in grid system with star
n nþ1
λ C ∏ λi and μC ∏ μj : topology. Reliab Eng Syst Saf 2007;92(1):40–6.
i¼0 j¼1 [5] He Y, Ren H, Liu Y, Yang B. On the reliability of large-scale distributed systems
—a topological view. Comput Netw 2009;53(12):2140–52.
Hence, the initial throughput during the idle period of the repair- [6] Nakagawa T, Yasui K. Note on reliability of a system complexity. Math Comput
Model 2003;38(11–13):1365–71.
man will become [7] Kondakci S. Intelligent network security assessment with modeling and
nþ1 λ
  analysis of attack patterns. Secur Commun Netw 2012;5(12):1471–86.
i λ
ρp0 ¼ ∏ p0 ¼ p0 ¼ ρp0 : ðA:12Þ [8] Boffetta G, Cencini M, Falcioni M, Vulpiani A. Predictability: a way to
μ
i ¼ 0 iþ1 μ characterize complexity. Phys Rep 2002;356(6):367–474.
[9] Ahmed W, Wu YW. A survey on reliability in distributed systems. J Comput
These assignments will recursively lead us to the steady-state Syst Sci 2013;79(8):1243–55.
solution for states 0  n, i.e., the set of steady-state probabilities [10] Jian S, Shaoping W. Integrated availability model based on performance of
computer networks. Reliab Eng Syst Saf 2007;92(3):341–50.
from state 0 to state n: [11] Gupta V, Dharmaraja S. Semi-Markov modeling of dependability of VOIP
(  "   2 #  3  n ) network in the presence of resource degradation and security attacks. Reliab
λ λ λ λ λ Eng Syst Saf 2011;96(12):1627–36.
S¼ p0 ; p1 ¼ p0 ; p0 ; …; p0 : ðA:13Þ
μ μ μ μ μ [12] Daneshkhah A, Bedford T. Probabilistic sensitivity analysis of system avail-
ability using Gaussian processes. Reliab Eng Syst Saf 2013;112(0):82–93.
[13] Ang C-W, Tham C-K. Analysis and optimization of service availability in a HA
It is obvious that the recurrence formula given above yields for cluster with load-dependent machine availability. IEEE Trans Parallel Distrib
equally constant mean rates, i.e., λi ¼ λi þ 1 ¼ λi þ 2 ¼ ; …; ¼ λn and Syst 2007;18(9):1307–19.
μi ¼ μi þ 1 ¼ μi þ 2 ¼ ; …; ¼ μn , which significantly simplifies the [14] Buzacott JA, Shanthikumar JG. Stochastic models of manufacturing systems.
Englewood Cliffs, NJ: Prentice Hall PTR; 1993.
steady-state solution. Since the steady-state probabilities must [15] Cui L, Li H, Li J. Markov repairable systems with history-dependent up and
sum to 1, we can easily obtain p0 for the general case as down states. Stoch Models 2007;23(October (4)):665–81.

1 1
 n  n [16] Kondakci S, Dincer C. Internet epidemiology: healthy, susceptible, infected,
λ 1 λ p quarantined, and recovered. Secur Commun Netw 2011;4(2):216–38.
1 ¼ ∑ pn ¼ ∑ p0 ¼ p0 ∑ ¼ 0 :
n¼0 μ n¼0 μ 1ρ [17] Kondakci S. A concise cost analysis of Internet malware. Comput Secur
n¼0
2009;28(7):648–59.
[18] Konak A, Coit DW, Smith AE. Multi-objective optimization using genetic
Thus,
algorithms: a tutorial. Reliab Eng Syst Saf 2006;91(9):992–1007.
p0 ¼ 1  ρ: ðA:14Þ [19] Levitin G. The universal generating function in reliability analysis and
optimization (Springer series in reliability engineering). NJ, USA: Springer-
The value of p0 gives the proportion of time the repair system is Verlag New York, Inc.; 2005.
[20] Aven T, Krohn BS. A new perspective on how to understand, assess and
idle (or unavailable), not necessarily being hacked down; a repair manage risk and the unforeseen. Reliab Eng Syst Saf 2014;121(0):1–10.
system can also be under repair or a casual maintenance. Hence, [21] Trivedi KS. Probability & statistics with reliability queuing and computer
by referring to the transition probability space (A.13), we obtain science applications. 2nd Edition. New York, US: John Wiley & Sons Inc.; 2001.
[22] Salehi E, Asadi M, Eryilmaz S. Reliability analysis of consecutive k-out-of-n
the probability, pn, that we have n machines on repair (queue þ systems with non-identical components lifetimes. J Stat Plan Inference
recovery) in the system as 2011;141(8):2920–32.
[23] Eryilmaz S. Reliability properties of consecutive k-out-of-n systems of arbi-
pn ¼ ρn p0 ¼ ρn ð1  ρÞ; for n 4 0: trarily dependent components. Reliab Eng Syst Saf 2009;94(2):350–6.
[24] Eryilmaz S, Navarro J. Failure rates of consecutive k-out-of-n systems. J Korean
Thus, for λ o μ (that is ρo 1), the expected number of jobs Stat Soc 2012;41(1):1–11.
(requests) in this Markov process under steady-state is the mean [25] Wilson AG, Huzurbazar AV. Bayesian networks for multilevel system relia-
bility. Reliab Eng Syst Saf 2007;92(10):1413–20.
of pn ¼ ρn ð1  ρÞ, i.e., [26] Borgonovo E, Smith CL. A study of interactions in the risk assessment of
1 1 complex engineering systems: an application to space PSA. Oper Res 2011;59
ρ ðλ=μÞ
L ¼ ∑ npn ¼ ∑ nð1 ρÞρn ¼ ¼ : ðA:15Þ (6):1461–76.
n¼0 n¼0 1  ρ 1  ðλ=μÞ [27] A.B. Notel, S.M. Sparta, Y. Yang, Generic threats to routing protocols, 2006. url:
〈http://www.rfc-base.org/txt/rfc-4593.txt〉.
The average number of requests currently under service is [28] Montgomery D, Murphy S. Toward secure routing infrastructures. Secur Priv
IEEE 2006;4(5):84–7. http://dx.doi.org/10.1109/MSP.2006.135.
λ
Ls ¼ ρ ¼ ¼ 1 p0 : [29] Kim Y, Kang W-H. Network reliability analysis of complex systems using a
μ non-simulation-based method. Reliab Eng Syst Saf 2013;110(0):80–8.
[30] Kondakci S. Epidemic state analysis of computers under malware attacks.
Then, the average number of requests in the queue is Simul Model Pract Theory 2008;16(5):571–84.
[31] Rohloff KR, Başar T. Deterministic and stochastic models for the detection of
1 ρn
Q ¼ L  Ls ¼ ∑ ðn  1Þpn ¼ : random constant scanning worms. ACM Trans Model Comput Simul 2008;18
n¼1 1ρ (2):1–24.
[32] Amador J, Artalejo JR. Stochastic modeling of computer virus spreading with
By referring to the Little's low L ¼ λW and combining it with warning signals. J Frankl Inst 2013;350(5):1112–38.
[33] Barthèlemy M, Barrat A, Pastor-Satorras R, Vespignani A. Dynamical patterns
Eq. (A.15), we can obtain the average waiting time, W, as of epidemic outbreaks in complex heterogeneous networks. J Theor Biol
2005;235(2):275–88.
L 1
W¼ ¼ : [34] Kondakci S. Dependency analysis of risks in information security. Int Rev
λ ðμ  λÞ Comput Softw 2008;3(1):11–9.
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 299

[35] Kondakci S. A new assessment and improvement model of risk propagation in [43] Horváth G, Reinecke P, Telek M, Wolter K. Efficient generation of PH-
information security. Int J Inf Comput Secur 2007;1(3):341–66. distributed random variates. In: Al-Begain K, Fiems D, Vincent J-M, editors.
[36] Kondakci S. A causal model for information security risk assessment. In: 2010 Analytical and stochastic modeling techniques and applications, lecture notes
Sixth International Conference on Information Assurance and Security (IAS); in computer science, vol. 7314. Berlin, Heidelberg: Springer; 2012.
2010, pp. 143–148. p. 271–85. http://dx.doi.org/10.1007/978-3-642-30782-9_19.
[37] Doguc O, Ramirez-Marquez JE. A generic method for estimating system [44] Fang Y. Hyper-erlang distribution model and its application in wireless mobile
reliability using Bayesian networks. Reliab Eng Syst Saf 2009;94(2):542–50. networks. Wirel. Netw 2001;7(3):211–9. http://dx.doi.org/10.1023/
[38] S. Kondakci, Network security risk assessment using bayesian belief networks, A:1016617904269.
in: 2010 IEEE second international conference on social computing (social- [45] Feldmann A, Whitt W. Fitting mixtures of exponentials to long-tail distribu-
com), 2010, pp. 952–960. tions to analyze network performance models. Perform Eval 1998;31(3–4):
[39] Sellke SH, Shroff NB, Bagchi S. Modeling and automated containment of 245–79.
worms. IEEE Trans Dependable Secur Comput 2008;5(2):71–86. http://dx.doi. [46] S. Corporation, Internet security threat report 2014 (April 2014). url: 〈http://
org/10.1109/TDSC.2007.70230. www.symantec.com/security_response/publications/〉.
[40] Walpole RE, Myers RH, Myers SL, Ye K., Probability & statistics for engineers [47] Boudali H, Dugan J. A discrete-time bayesian network reliability modeling and
and scientists, Prentice–Hall, Inc., USD, NJ, USA, 2002. analysis framework. Reliab Eng Syst Saf 2005;87(3):337–49.
[41] Jiang X, Yuan Y, Liu X. Bayesian inference method for stochastic damage [48] Langseth H, Portinale L. Bayesian networks in reliability. Reliab Eng & System
accumulation modeling. Reliab Eng Syst Saf 2013;111(0):126–38. Safety 2007;92(1):92–108.
[42] W. Gragido, Lions at the watering hole—the “VOHO” affair (July 2012). url: [49] Guo H, Liao H, Zhao W, Mettas A. A new stochastic model for systems under
〈https://blogs.rsa.com/lions-at-the-watering-hole-the-voho-affair/〉. general repairs. IEEE Trans Reliab 2007;56(1):40–9.

Das könnte Ihnen auch gefallen