Sie sind auf Seite 1von 7

2016 3rd MEC International Conference on Big Data and Smart City

Study on Network Failure Prediction Based on


Alarm Logs
Jiang ZHONG1 Weili GUO2*, Zhenhua WANG2
1. Key Laboratory of Dependable Service Computing in 2. College of Computer Science
Cyber-Physical Society Chongqing University
Chongqing University Chongqing city, China
Chongqing city, China shidaiwu@cqu.edu.cn, qaswzh@163.com
zhongjiang@cqu.edu.cn

Abstract—To avoid unpredictable losses because of network prediction can be classified into two types: the prediction
failure, the reliability of the network needs to be evaluated in based on failure logs and the prediction on the basis of system
some application scenarios. This paper start the network failure status [2]. The basic idea of prediction based on fault logs is to
prediction research upon 14 months’ network alarm logs we collect the failure logs and analyze their potential relationships
collected. The logs are of one Metropolitan area network. The with the data mining or machine learning method, and then
research method is shown as below: firstly, construct features to use these relationships to predict failure. Liang [3] collected
represent network characteristics by the means of the feature the IBM BlueGene/L event logs, and then applied classifiers
construction method which is based on two levels time windows; on the data to build the prediction model. Gainaru [4] merged
secondly, select optimal parameter combination to create the
signal analysis concepts with data mining techniques to extend
feature files through multiple experiments; thirdly, design and
build adaptive failure prediction model according to
the ELSA (Event Log Signal Analyzer) toolkit and offered an
classification learning methods. Numbers of experiments show adaptive prediction module for failure. Failure Trace Archive
that accuracy of predicting whether the network failure takes (FTA) has been created by Javadi [5] as an online public
place in 6 hours is up to 70%, is better than the prediction result repository of availability traces taken from diverse parallel and
of Weibull distribution model obviously; the results of distributed systems to facilitate the design, validation, and
classification prediction for network equipment failure are comparison of fault-tolerant models and algorithms. An
slightly better than the prediction method on the basis of Weibull unsupervised failure detection method using an ensemble of
distribution. Preliminary research results show that most Bayesian models is presented by Guan [6], it characterizes
network failures can be predicted through analyzing previous normal execution states of the system and detects anomalous
network running logs and the method proposed in this paper is behaviors. And the main idea of the prediction method based
verified to be with good prediction effect. This method can detect on the state monitoring is that the occurrence of some system's
failures in practical application on early stage and reduce failures will inevitably lead to changes of some state
unnecessary economic losses. parameters of the system. Therefore, we can predict whether
the system will fail through monitoring the change of system
Keywords—network failure; network equipment; failure state parameter. A measurement-based model has been
prediction; classification prediction; Weibull distribution; feature proposed by Vaidyanathan [7] to estimate the rate of
construction
exhaustion of operating system resources both as a function of
time and the system workload state, and then semi-Markov
I. INTRODUCTION reward model was constructed on the basis of workload to
With the development of the Internet, numerous devices predict system failure. At present, there are few systematic
are network-connecting and a more complex network structure types of research on the network fault prediction, Hou [8]
is formed in which network failures become inevitable. In presented a network device failure prediction method based on
some circumstances such as rocket launching and military neural network, the method means to build failure prediction
exercise, mission could fail because of the loss of important model with network device running state data, and then do the
information results from network failures which may cause simulation experiment. But he haven’t verified the model in a
unaccountable damage. Therefore, it is worthwhile and large scale, real network operating environment.
valuable to evaluate the status of the network and predict the In the practical applications, network failure prediction can
probability of network failures before executing an important be divided into failure prediction for a network system and
task. failure prediction for network equipment. The former refers to
Failure prediction is a method to determine whether the whether the network equipment malfunction exists within a
system will malfunction according to the analysis on the future period of time. In this paper, we studied these two types
evolution of historical status and current behavior of the of network failure prediction by means of the alarm logs
system. Failure prediction study is of great importance in provided by real network environment and designed an
reducing the burden of network management and maintenance, adaptive prediction method based on data mining. The main
and in minimizing the loss caused by network failure [1]. contribution of the paper is:
According to the difference on analytical objects, the failure
This paper is sponsored by National High-tech R&D Program of China
(NO.2015AA015308), Project CDJZR185502 supported by the Fundamental
Research Funds for the Central Universities, Fundamental Research Funds for
the Central Universities (NO.106112014CDJZR188801)
Corresponding author. Email: shidaiwu@cqu.edu.cn

978-1-4673-9584-7/16/$31.00 ©2016 IEEE


2016 3rd MEC International Conference on Big Data and Smart City

1. Due to the lack of network failures data in real faults are extremely complex, and a log recorded long-term
network environment, we collected and analyzed 14 months of running data of network system may contain these
a network alarm logs to start our prediction research on relationships. Unfortunately, it is difficult to use professional
network failure. knowledge to accurately describe the conversion relationships
inside. Therefore, this paper establishes prediction models for
2. For the two types of network failure prediction above, network system urgent failure and network equipment urgent
we designed a feature construction method based on two failure with data mining methods.
levels time windows, and through a series of comparative
experiments, we determined the key parameters in the process
of constructing features for the prediction model.
3. We implemented adaptive prediction model based on
classifier for network system failure and network equipment Fault
failure. Comparison with the traditional prediction model
indicated that classification prediction model has a better
performance.
Urgent
The contents of the paper are as follows: Section 1 Abnormal
failure
presents the running state of the network and the
characteristics of the log; Section 2 describes the method of
failure prediction and feature construction; Section 3 gives
results and analysis of the experiments, and the last includes Normal Intervention &
the summary and outlook. repair

II. NETWORK RUNNING CHARACTERISTIC Fig. 1. The conversion of network states

A. Running state of network and equipment B. Characteristic of network alarm logs


In the real application scenarios, the running state of The network alarm logs record all kinds of alarm incidents
network equipment can be divided into normal running state, in the course of system operation, including the information of
abnormal state, fault state and urgent failure state. In normal the device which alarmed, occurrence time and clearing time,
running state, the network device operates normally and the alarm name, alarm type, alarm level and other information.
logging system will record some information such as routing According to the urgency of alarm events in the real
changes. The abnormal state is the state when network application, alarm events are classified as prompt alarms,
equipment occurs some exception during operation, but these minor alarms, important alarm, and urgent alarm,
exceptions may not affect the normal operation of the network corresponding to the device state of normal running, abnormal,
device. Fault status refers to the state where some failures in fault and urgent failure [2]. Table 1 shows a part of network
the network equipment influence the operation of the device, alarm logs.
but will not lead to any serious and global problem. Urgent Table 1. Alarm logs sequence
failure state means there are some global problems because
some device malfunction in the course of network operation, Level Name Source Start Time
and the urgent failure must respond immediately, otherwise it
will lead to failure of the task, such as the disconnection of a prompt Out of performance Device36 01/16/201315:11:18
link to the core switch. prompt Out of performance Device36 01/16/201315:11:18
According to running status of each equipment in the minor Interface CRC error Device58 01/16/201315:11:20
network system, running status of the network system can also
important Power module down Device42 01/16/201315:11:27
be classified into normal state, abnormal state, fault state, and
urgent failure state. These four states respectively are referring urgent Link disconnection Device47 01/16/201315:11:30
to that all devices are in normal operation condition, some Through the analysis of log, we find that network alarm
devices are in the abnormal state, some devices are in the fault logs have the following characteristics:
state and some are in urgent failure state in the current
network system. 1. The alarm incidents of network devices can be
divided into root alarms derivative alarms, derivative alarm is
By analyzing the network operation logs, we find: derived from the root alarm.
1. There is possibility of conversion between any two 2. There are a lot of redundant data and flash alarms
types of network equipment status. which would be soon solved automatically by the system in
2. There is possibility of conversion between any two the alarm logs.
kinds of network system status. 3. The distribution of the alarms’ number have certain
The conversion relationships are as shown in Fig. 1.The rules, and the occurrence of alarm event has a strong
interaction relationships among network devices and different correlation with time, for example, the number of occurrence
in working hours was significantly higher than in other periods.
2016 3rd MEC International Conference on Big Data and Smart City

Based on above analysis, data preprocessing process is III. NETWORK FAILURE PREDICTION BASED ON DATA MINING
divided into three steps: derivative alarm records were deleted This paper predicts network failure by using a method
at first, then flash alarms in which clear interval is less than 22 similar to the classical failure prediction method for computer
seconds were filtered, redundant alarms would be screened system [9], the basic idea is to predict network failure after
with time interval filtering methods at last. Time threshold of one moment through the analysis of network state before that
prompt level, secondary level, important level, urgent level is moment. Therefore, we designed a prediction system as Fig. 4.
set respectively as 15,4,3,2 minutes. By using the filter method Firstly, collect network running alarm logs; secondly, filter the
to deal with the original alarm logging data, 881,196 dirty data; thirdly, construct feature file based on the log
redundant alarm records were filtered, accounting for 85% of analysis and in the end, choose classifier to learn and train to
the original alarm records. build prediction model.
C. Network alarm distribution Network
alarm

Alarm logs record each alarm generated by each network logs Feature
construction classifier Feed back
device, the alarm distribution of each device every day in
Process log method
Construct Training

October is as shown in Fig. 2, and the four levels alarm Filtered feature Feature files datasets
Failure alert
Network logs
Testing
distribution of each day in October is as shown in Figure 3. datasets Classified
prediction
model

Fig. 4. Network failure prediction system


The prediction model is the key of network failure
prediction, and the constructed feature will affect directly the
performance of prediction model. To construct effective
feature which can represent network failure, this paper design
feature construction method based on two levels time window.

A. Time window for failure prediction


This paper uses the statistic on alarm event in two levels
time windows to represent network state and the statistic is
also regarded as decision-making feature for failure prediction.
Assuming the current moment for time as t, in order to predict
network failure in next Δ (hour), this paper gives 3 definitions
of time windows [10,11] as shown in Fig. 5.
Definition 1. Prediction Window Wp means the time window
(t , t + Δ] which we predict network failure in this period of
Fig. 2. Alarm distribution of each device in October time.
Definition 2. Observation Window Wo means the time
window [t – n × Δ, t], the size of the unit observation window
is Δ. The alarm event set in unit observation window I is S wo_i .
Definition 3. Sample Window Ws , each window in unit
observation window Wo_u is divided into   sample
windows, the size of sample window is  . The alarm event set
in sample window j is S ws_j .
Observation window i Current observation
Sample window j
window
 Δ

t Time
Observation size n×Δ Prediction

Fig. 5. The partitioning strategy of timeline


It is needed to quantitatively represent network system
Fig. 3. 4 levels failures distribution in October running state or network equipment running state in the
process of predict failure with classification method. We use
From Fig. 2, Fig. 3, we can see that different types of the statistic on alarm events in two levels time windows to
alarms generated by different network device occurred each characterize network running state. The prediction model in
day in the course of network operation. Therefore, we can take this paper is built on the basis of this statistic.
the statistics for network alarms of some period of time to
characterize the network status in this period, and then
construct feature files with these statistics.
2016 3rd MEC International Conference on Big Data and Smart City

B. The construction of network system failure feature the alert logs, we construct feature file for network equipment
The construction of feature file plays an important role in failure with the reference feature construction method for the
the process of building classification prediction model for network system failure. We count respectively the number of
network system failure. To ensure the prediction effect, it is each type alarm of each equipment as feature items (87
necessary to construct the feature file which can accurately equipment, 4 levels and 32 types), we got those following
characterize network running state. By analyzing alarm logs, types of feature items:
we find that there is the possibility of conversion between any Table 3. Features construction for network equipment failure prediction model
two kinds of network system status. There are some certain
correlations between network system failures, it can be found Type Feature item Count
from the alarm event in the alert logs. Therefore, we construct the number of each device’s 4
Type 1 87*(32+4)*n
feature file with these following method: levels,32 types of alarms in
each unit observation window
1. To compute respectively the number of each type the number of each device’s 4
alarm in each observation window as first type feature item. Type 2 levels,32 types of alarms in 87   32+4   (n    )
each sample window
2. In order to represent the network fault more
accurately, we count each type alarm in each sample window means and variances of the
Type 3 number of each device’s 4 87*(32+4)*2
as the second type feature item. levels,32 types alarm in all
sample windows
3. To compute the means and variances of each type
alarm number in all sample windows as the third type feature the number of unit windows
item. Type 4 between current window and 87
the last failure window of each
4. If there is no failure occurs for a long time, then the device
probability of failure is imminent event will be relatively large, Type 5 the midpoint time of the 1
so we compute the number of unit windows between the prediction window(hour)
current window and last failure window as the fourth type
feature item. After constructing 113+37×(n×(   +1)) feature items
according to the feature constructing rules defined above,
5. The appearance of emergency alarm may be time- feature items also need to be screened to reduce the dimension
correlation in the log analysis, so we choose the midpoint time of the feature space with Information Gained.
of prediction time window as the fifth type feature item.
In summary, in the construction of features for network IV. EXPERIMENT AND PERFORMANCE EVALUATION
failure prediction model, we got those following types of
feature items: This article takes the research on a total of 14 months of
network equipment alarm logs from January 10, 2013, to April
Table 2. Features construction for network system failure prediction model 10, 2013, provided by a metropolitan area network. First of all,
Type Feature item Count
we determine evaluation index of the model. And then study
how to choose the optional parameter combination to
Type 1 the number of 4 levels,33 types of alarms in 37*n construct features refer to the evaluation index. In the end, we
each unit observation window
show the results and present our analysis. To make a compare
Type 2 the number of 4 levels,33 types of alarms in 37  (n    ) with classification prediction model, we use the theory of two-
each sample window parameter Weibull distribution to build prediction model in
Type 3 means and variances of the number of 4 37*2 the view of probability statistics. Reliability features related to
levels,33 types alarm in all sample windows the Weibull distribution [13] such as failure probability
Type 4 the number of unit windows between current 1
density function are obtained with estimating the two
window and the last failure window parameters of Weibull distribution and then we can take the
the midpoint time of the prediction
failure prediction for network equipment and the network
Type 5 1
window(hour) system by studying the reliability of network equipment.

After constructing 113+37×(n×(   +1)) feature items A. Evaluation Index


according to the feature constructing rules defined above, The experiments chose prediction accuracy, recall rate, and
feature items also need to be screened to reduce the dimension F-Measure as the indexes to evaluate the performance of
of the feature space with Information Gained [12]. prediction model [2]. Further, the concrete calculates methods
of these indexes are shown from Eq. 1 to Eq. 3.
C. The construction of network equipment failure feature
Failure prediction for network equipment refers to predict TP
accuracy = (1)
whether the device will malfunction in prediction window. In TP + FP
consideration of there are the possibility of conversion
between any two types of network equipment status and there TP
are some certain correlations between network equipment recall = (2)
TP + FN
failures, the relationships can be found from the alarm event in
2016 3rd MEC International Conference on Big Data and Smart City

2  TP Accordingly, the sample window size and the number of


F-Measure= (3) observation window are choose as table 4 when we construct
2  TP  FP  FN features for network system failure prediction model and
In the equations above, TP means the number of failures network equipment failure prediction model in this paper. The
which are predicted as failure, FP refers to the number of non- prediction window sizes are 1 hour, 4 hours, 6 hours and 12
failures which are predicted as failure and FN values the hours.
number of failures which are predicted as non-failure. Table. 4 Choose of parameter to build prediction model

Prediction window Sample Windows


B. Results and Analysis Observation Windows’
size  (hours) size  (minutes) number n
1) Experiment for choosing parameter
When we construct features, the optimal values of 1 10 3
observation window number and sample window size are 4 40 2
needed to select. For this purpose, we have multiple sets of 6 60 1
contrast experiments. As shown in Fig. 6, the change of 12 90 1
prediction accuracy with different combinations of 2) Prediction results for network system failure
observation window number and sample window size in the This paper chose algorithm RIPPER, Bayes Net, Random
case of different prediction time window size by using Bayes Forest [14] to build classification prediction model, and then
Net algorithm to take the failure prediction. constructed four feature files from the processed log to build a
prediction model for network system failure, according to the
parameter information determined above. Each dataset is
divided into a training set for the models adaptive training and
learning, and testing set for testing the performance of the
model. Further, this experiment uses ten round of cross
validation method. The results are shown in Fig. 7:

Fig.6. Accuracy varies with observation window number, sample window size
in condition of different prediction window
Fig. 6 shows that:
1. Predict whether the network malfunction in the next
period of time (prediction time window is fixed), the
prediction accuracy is on the decline with the increase of the (a)-accuracy (b)-recall
number of observation windows on the whole when the
sample window size is determined; the prediction accuracy is
on the rise modestly with the increase of sample window size
on the whole when the number of observation windows is
determined.
2. When the number of observation windows and
sample window size are all fixed, the greater the prediction
time window size, the higher the forecast accuracy.
The research for network equipment failure is in good
agreement with conclusions (the prediction time window size
are 6 hours, 12 hours) above. From the analysis above, the
number of observation windows and sample window size are
determined based on the size of prediction time window, the
number of observation windows is determined default as 1-2
and sample window size is moderate (general 10-60 minutes).
2016 3rd MEC International Conference on Big Data and Smart City

(c) F-Mean
Fig. 7. Performance of prediction model for network system failure

Experimental results show that: firstly, the prediction


model based on classification has better prediction
performance, indicating that prediction method based on
classification is more suitable for network failure prediction
than traditional forecasting method on the basis of Weibull
distribution. Secondly, when the time window is greater than 6
hours, failure prediction model has better prediction
performance, accuracy rate reaches more than 70% and recall
rate is even over 80%. Thirdly, the stability and performance
of the RIPPER algorithm are better overall than other two
classification algorithms.
3) Prediction result for network equipment failure
The less the number of one device malfunction is, the less
the log is, the worse the classification forecast effect is. (c)- F-Mean for device 83 (d)- accuracy for device 0
Therefore, we chose device 83 whose number of failure is
maximum and device 0 whose number of failure is secondary
as the research object, and then we build urgent failure
prediction model respectively for device 83 and device 0
through classification and probability statistics method.
According to the parameter information determined above,
we construct four feature files from the processed log to build
failure prediction model respectively for device 83 and device
0. Each dataset is divided into a training set for the models
adaptive training and learning, and testing set. Further, this
experiment uses ten round of cross validation method. The
results are shown in Fig. 8:

(e)- recall for device 0 (f)- F-Mean for device 0


Fig. 8. Performance of prediction model for device 83/device 0
According to Fig. 8 can be seen that the effect of Weibull
prediction model based on probability statistics is similar to
the effect of model based on classification, indicating the
presence of a certain independence among different devices’
alarms, compared to the classification prediction model,
performance of prediction model on the basis of Weibull
distribution fluctuate large, testifying that better stability
prediction model based on classification has.
In summary, the effect will be better if the number of
observation windows is set smaller (general 1-3) or the size of
sample window is moderate (general 10-60 minutes) when the
(a)-accuracy for device 83 (b)-recall for device 83 prediction window size is large in the process of feature
construction. The accuracy of the prediction model based on
classified for network system failure can reach more than 70%,
obviously better than the prediction result of the probability
model. The effect of algorithm RIPPER is the best overall; the
effect of forecast for equipment failure is not very good,
especially the data of device which is independent of the
prediction device as a feature, it may interfere with the effect
of classifiers. Failure prediction for network equipment which
malfunction less, Bayes Net is more suitable in three kinds of
classification algorithm above; Failure prediction for network
equipment which malfunction more, Random Forest is more
suitable.
2016 3rd MEC International Conference on Big Data and Smart City

V. SUMMARY AND OUTLOOK [12] Gao Z, Xu Y, Meng F, et al. Improved information gain-based feature
selection for text categorization[C]. In: Wireless Communications,
This paper put forward feature construction method based Vehicular Technology, Information Theory and Aerospace & Electronic
on two levels time windows, design and build prediction Systems (VITAE), 2014 4th International Conference on. IEEE, 2014:
model based on classification for network system failure or 1-5.
equipment failure. The accuracy can reach more than 70%, [13] Almalki S J, Nadarajah S. Modifications of the Weibull distribution: A
review[J]. Reliability Engineering & System Safety, 2014, 124: 32-55.
and it is verified that there is the possibility of conversion
between any two types of network equipment status or two [14] Thornton C, Hutter F, Hoos H H, et al. Auto-WEKA: Combined
selection and hyperparameter optimization of classification
kinds of network system status. The research results also show algorithms[C]. In: Proceedings of the 19th ACM SIGKDD international
that most network failures can be predicted through analyzing conference on Knowledge discovery and data mining. ACM, 2013: 847-
network logs with data mining method. The comparison with 855.
prediction model on the basis of Weibull distribution indicates
that the performance of prediction model based on
classification are better whether it's in view of network failure
or equipment failure.
In the future work, we can promote the performance and
availability of our prediction model through the following
aspects: Firstly, we can extract more effective data about the
alarm type, such as CPU utilization of the equipment.
Secondly, we can filter original alarm logging data more
effective by the association rules. Thirdly, we can improve the
rules about feature construction with considering the
characteristics of the network equipment. However, we have
to make some modifications according to the characteristics of
the network systems.

References
[1] Pecht M. Prognostics and health management of electronics[M]. John
Wiley & Sons, Ltd, 2008.
[2] Salfner F, Lenk M, Malek M. A survey of online failure prediction
methods[J]. ACM Computing Surveys (CSUR), 2010, 42(3): 10.
[3] Liang Y, Zhang Y, Jette M, et al. BlueGene/L failure analysis and
prediction models[C]. In: Dependable Systems and Networks, 2006.
DSN 2006. International Conference on. IEEE, 2006: 425-434.
[4] Gainaru A, Cappello F, Snir M, et al. Failure prediction for HPC
systems and applications Current situation and open issues[J].
International Journal of High Performance Computing Applications,
2013, 27(3): 273-282.
[5] Javadi B, Kondo D, Iosup A, et al. The Failure Trace Archive: Enabling
the comparison of failure measurements and models of distributed
systems[J]. Journal of Parallel and Distributed Computing, 2013, 73(8):
1208-1223.
[6] Guan Q, Zhang Z, Fu S. Ensemble of bayesian predictors and decision
trees for proactive failure management in cloud computing systems[J].
Journal of Communications, 2012, 7(1): 52-61.
[7] Vaidyanathan K, Trivedi K S. A measurement-based model for
estimation of resource exhaustion in operational software systems[C]. In:
Software Reliability Engineering, 1999. Proceedings. 10th International
Symposium on. IEEE, 1999: 84-93.
[8] Hou X.K. Neural network fault prediction system in network
equipment[J]. Journal of Shandong University of Technology(Natural
Science Edition), 06(2014):29-34.
[9] Y. Liang, Y. Zhang, H. Xiong, R. Sahoo. An adaptive semantic filter for
blue gene/L failure log analysis[C]. In: Proceedings of the Third
International Workshop on System Management Techniques, Processes,
and Services (SMTPS), 2007.
[10] Adam J. Oliner, Alex Aiken, Jon Stearley. Alert detection in system
logs.2008 Eighth IEEE International Conference on Data Mining, 2008.
[11] Y. Liang, Y. Zhang, H. Xiong, R. Sahoo, A. Sivasubramaniam. Failure
prediction in IBM blue gene/L Event Logs[C]. In: Proceedings of
Seventh IEEE International Conference on Data Mining. Washington,
DC, USA: IEEE Computer Society, 2007: 583-588.

Das könnte Ihnen auch gefallen