Beruflich Dokumente
Kultur Dokumente
Abstract—To avoid unpredictable losses because of network prediction can be classified into two types: the prediction
failure, the reliability of the network needs to be evaluated in based on failure logs and the prediction on the basis of system
some application scenarios. This paper start the network failure status [2]. The basic idea of prediction based on fault logs is to
prediction research upon 14 months’ network alarm logs we collect the failure logs and analyze their potential relationships
collected. The logs are of one Metropolitan area network. The with the data mining or machine learning method, and then
research method is shown as below: firstly, construct features to use these relationships to predict failure. Liang [3] collected
represent network characteristics by the means of the feature the IBM BlueGene/L event logs, and then applied classifiers
construction method which is based on two levels time windows; on the data to build the prediction model. Gainaru [4] merged
secondly, select optimal parameter combination to create the
signal analysis concepts with data mining techniques to extend
feature files through multiple experiments; thirdly, design and
build adaptive failure prediction model according to
the ELSA (Event Log Signal Analyzer) toolkit and offered an
classification learning methods. Numbers of experiments show adaptive prediction module for failure. Failure Trace Archive
that accuracy of predicting whether the network failure takes (FTA) has been created by Javadi [5] as an online public
place in 6 hours is up to 70%, is better than the prediction result repository of availability traces taken from diverse parallel and
of Weibull distribution model obviously; the results of distributed systems to facilitate the design, validation, and
classification prediction for network equipment failure are comparison of fault-tolerant models and algorithms. An
slightly better than the prediction method on the basis of Weibull unsupervised failure detection method using an ensemble of
distribution. Preliminary research results show that most Bayesian models is presented by Guan [6], it characterizes
network failures can be predicted through analyzing previous normal execution states of the system and detects anomalous
network running logs and the method proposed in this paper is behaviors. And the main idea of the prediction method based
verified to be with good prediction effect. This method can detect on the state monitoring is that the occurrence of some system's
failures in practical application on early stage and reduce failures will inevitably lead to changes of some state
unnecessary economic losses. parameters of the system. Therefore, we can predict whether
the system will fail through monitoring the change of system
Keywords—network failure; network equipment; failure state parameter. A measurement-based model has been
prediction; classification prediction; Weibull distribution; feature proposed by Vaidyanathan [7] to estimate the rate of
construction
exhaustion of operating system resources both as a function of
time and the system workload state, and then semi-Markov
I. INTRODUCTION reward model was constructed on the basis of workload to
With the development of the Internet, numerous devices predict system failure. At present, there are few systematic
are network-connecting and a more complex network structure types of research on the network fault prediction, Hou [8]
is formed in which network failures become inevitable. In presented a network device failure prediction method based on
some circumstances such as rocket launching and military neural network, the method means to build failure prediction
exercise, mission could fail because of the loss of important model with network device running state data, and then do the
information results from network failures which may cause simulation experiment. But he haven’t verified the model in a
unaccountable damage. Therefore, it is worthwhile and large scale, real network operating environment.
valuable to evaluate the status of the network and predict the In the practical applications, network failure prediction can
probability of network failures before executing an important be divided into failure prediction for a network system and
task. failure prediction for network equipment. The former refers to
Failure prediction is a method to determine whether the whether the network equipment malfunction exists within a
system will malfunction according to the analysis on the future period of time. In this paper, we studied these two types
evolution of historical status and current behavior of the of network failure prediction by means of the alarm logs
system. Failure prediction study is of great importance in provided by real network environment and designed an
reducing the burden of network management and maintenance, adaptive prediction method based on data mining. The main
and in minimizing the loss caused by network failure [1]. contribution of the paper is:
According to the difference on analytical objects, the failure
This paper is sponsored by National High-tech R&D Program of China
(NO.2015AA015308), Project CDJZR185502 supported by the Fundamental
Research Funds for the Central Universities, Fundamental Research Funds for
the Central Universities (NO.106112014CDJZR188801)
Corresponding author. Email: shidaiwu@cqu.edu.cn
1. Due to the lack of network failures data in real faults are extremely complex, and a log recorded long-term
network environment, we collected and analyzed 14 months of running data of network system may contain these
a network alarm logs to start our prediction research on relationships. Unfortunately, it is difficult to use professional
network failure. knowledge to accurately describe the conversion relationships
inside. Therefore, this paper establishes prediction models for
2. For the two types of network failure prediction above, network system urgent failure and network equipment urgent
we designed a feature construction method based on two failure with data mining methods.
levels time windows, and through a series of comparative
experiments, we determined the key parameters in the process
of constructing features for the prediction model.
3. We implemented adaptive prediction model based on
classifier for network system failure and network equipment Fault
failure. Comparison with the traditional prediction model
indicated that classification prediction model has a better
performance.
Urgent
The contents of the paper are as follows: Section 1 Abnormal
failure
presents the running state of the network and the
characteristics of the log; Section 2 describes the method of
failure prediction and feature construction; Section 3 gives
results and analysis of the experiments, and the last includes Normal Intervention &
the summary and outlook. repair
Based on above analysis, data preprocessing process is III. NETWORK FAILURE PREDICTION BASED ON DATA MINING
divided into three steps: derivative alarm records were deleted This paper predicts network failure by using a method
at first, then flash alarms in which clear interval is less than 22 similar to the classical failure prediction method for computer
seconds were filtered, redundant alarms would be screened system [9], the basic idea is to predict network failure after
with time interval filtering methods at last. Time threshold of one moment through the analysis of network state before that
prompt level, secondary level, important level, urgent level is moment. Therefore, we designed a prediction system as Fig. 4.
set respectively as 15,4,3,2 minutes. By using the filter method Firstly, collect network running alarm logs; secondly, filter the
to deal with the original alarm logging data, 881,196 dirty data; thirdly, construct feature file based on the log
redundant alarm records were filtered, accounting for 85% of analysis and in the end, choose classifier to learn and train to
the original alarm records. build prediction model.
C. Network alarm distribution Network
alarm
Alarm logs record each alarm generated by each network logs Feature
construction classifier Feed back
device, the alarm distribution of each device every day in
Process log method
Construct Training
October is as shown in Fig. 2, and the four levels alarm Filtered feature Feature files datasets
Failure alert
Network logs
Testing
distribution of each day in October is as shown in Figure 3. datasets Classified
prediction
model
t Time
Observation size n×Δ Prediction
B. The construction of network system failure feature the alert logs, we construct feature file for network equipment
The construction of feature file plays an important role in failure with the reference feature construction method for the
the process of building classification prediction model for network system failure. We count respectively the number of
network system failure. To ensure the prediction effect, it is each type alarm of each equipment as feature items (87
necessary to construct the feature file which can accurately equipment, 4 levels and 32 types), we got those following
characterize network running state. By analyzing alarm logs, types of feature items:
we find that there is the possibility of conversion between any Table 3. Features construction for network equipment failure prediction model
two kinds of network system status. There are some certain
correlations between network system failures, it can be found Type Feature item Count
from the alarm event in the alert logs. Therefore, we construct the number of each device’s 4
Type 1 87*(32+4)*n
feature file with these following method: levels,32 types of alarms in
each unit observation window
1. To compute respectively the number of each type the number of each device’s 4
alarm in each observation window as first type feature item. Type 2 levels,32 types of alarms in 87 32+4 (n )
each sample window
2. In order to represent the network fault more
accurately, we count each type alarm in each sample window means and variances of the
Type 3 number of each device’s 4 87*(32+4)*2
as the second type feature item. levels,32 types alarm in all
sample windows
3. To compute the means and variances of each type
alarm number in all sample windows as the third type feature the number of unit windows
item. Type 4 between current window and 87
the last failure window of each
4. If there is no failure occurs for a long time, then the device
probability of failure is imminent event will be relatively large, Type 5 the midpoint time of the 1
so we compute the number of unit windows between the prediction window(hour)
current window and last failure window as the fourth type
feature item. After constructing 113+37×(n×( +1)) feature items
according to the feature constructing rules defined above,
5. The appearance of emergency alarm may be time- feature items also need to be screened to reduce the dimension
correlation in the log analysis, so we choose the midpoint time of the feature space with Information Gained.
of prediction time window as the fifth type feature item.
In summary, in the construction of features for network IV. EXPERIMENT AND PERFORMANCE EVALUATION
failure prediction model, we got those following types of
feature items: This article takes the research on a total of 14 months of
network equipment alarm logs from January 10, 2013, to April
Table 2. Features construction for network system failure prediction model 10, 2013, provided by a metropolitan area network. First of all,
Type Feature item Count
we determine evaluation index of the model. And then study
how to choose the optional parameter combination to
Type 1 the number of 4 levels,33 types of alarms in 37*n construct features refer to the evaluation index. In the end, we
each unit observation window
show the results and present our analysis. To make a compare
Type 2 the number of 4 levels,33 types of alarms in 37 (n ) with classification prediction model, we use the theory of two-
each sample window parameter Weibull distribution to build prediction model in
Type 3 means and variances of the number of 4 37*2 the view of probability statistics. Reliability features related to
levels,33 types alarm in all sample windows the Weibull distribution [13] such as failure probability
Type 4 the number of unit windows between current 1
density function are obtained with estimating the two
window and the last failure window parameters of Weibull distribution and then we can take the
the midpoint time of the prediction
failure prediction for network equipment and the network
Type 5 1
window(hour) system by studying the reliability of network equipment.
Fig.6. Accuracy varies with observation window number, sample window size
in condition of different prediction window
Fig. 6 shows that:
1. Predict whether the network malfunction in the next
period of time (prediction time window is fixed), the
prediction accuracy is on the decline with the increase of the (a)-accuracy (b)-recall
number of observation windows on the whole when the
sample window size is determined; the prediction accuracy is
on the rise modestly with the increase of sample window size
on the whole when the number of observation windows is
determined.
2. When the number of observation windows and
sample window size are all fixed, the greater the prediction
time window size, the higher the forecast accuracy.
The research for network equipment failure is in good
agreement with conclusions (the prediction time window size
are 6 hours, 12 hours) above. From the analysis above, the
number of observation windows and sample window size are
determined based on the size of prediction time window, the
number of observation windows is determined default as 1-2
and sample window size is moderate (general 10-60 minutes).
2016 3rd MEC International Conference on Big Data and Smart City
(c) F-Mean
Fig. 7. Performance of prediction model for network system failure
V. SUMMARY AND OUTLOOK [12] Gao Z, Xu Y, Meng F, et al. Improved information gain-based feature
selection for text categorization[C]. In: Wireless Communications,
This paper put forward feature construction method based Vehicular Technology, Information Theory and Aerospace & Electronic
on two levels time windows, design and build prediction Systems (VITAE), 2014 4th International Conference on. IEEE, 2014:
model based on classification for network system failure or 1-5.
equipment failure. The accuracy can reach more than 70%, [13] Almalki S J, Nadarajah S. Modifications of the Weibull distribution: A
review[J]. Reliability Engineering & System Safety, 2014, 124: 32-55.
and it is verified that there is the possibility of conversion
between any two types of network equipment status or two [14] Thornton C, Hutter F, Hoos H H, et al. Auto-WEKA: Combined
selection and hyperparameter optimization of classification
kinds of network system status. The research results also show algorithms[C]. In: Proceedings of the 19th ACM SIGKDD international
that most network failures can be predicted through analyzing conference on Knowledge discovery and data mining. ACM, 2013: 847-
network logs with data mining method. The comparison with 855.
prediction model on the basis of Weibull distribution indicates
that the performance of prediction model based on
classification are better whether it's in view of network failure
or equipment failure.
In the future work, we can promote the performance and
availability of our prediction model through the following
aspects: Firstly, we can extract more effective data about the
alarm type, such as CPU utilization of the equipment.
Secondly, we can filter original alarm logging data more
effective by the association rules. Thirdly, we can improve the
rules about feature construction with considering the
characteristics of the network equipment. However, we have
to make some modifications according to the characteristics of
the network systems.
References
[1] Pecht M. Prognostics and health management of electronics[M]. John
Wiley & Sons, Ltd, 2008.
[2] Salfner F, Lenk M, Malek M. A survey of online failure prediction
methods[J]. ACM Computing Surveys (CSUR), 2010, 42(3): 10.
[3] Liang Y, Zhang Y, Jette M, et al. BlueGene/L failure analysis and
prediction models[C]. In: Dependable Systems and Networks, 2006.
DSN 2006. International Conference on. IEEE, 2006: 425-434.
[4] Gainaru A, Cappello F, Snir M, et al. Failure prediction for HPC
systems and applications Current situation and open issues[J].
International Journal of High Performance Computing Applications,
2013, 27(3): 273-282.
[5] Javadi B, Kondo D, Iosup A, et al. The Failure Trace Archive: Enabling
the comparison of failure measurements and models of distributed
systems[J]. Journal of Parallel and Distributed Computing, 2013, 73(8):
1208-1223.
[6] Guan Q, Zhang Z, Fu S. Ensemble of bayesian predictors and decision
trees for proactive failure management in cloud computing systems[J].
Journal of Communications, 2012, 7(1): 52-61.
[7] Vaidyanathan K, Trivedi K S. A measurement-based model for
estimation of resource exhaustion in operational software systems[C]. In:
Software Reliability Engineering, 1999. Proceedings. 10th International
Symposium on. IEEE, 1999: 84-93.
[8] Hou X.K. Neural network fault prediction system in network
equipment[J]. Journal of Shandong University of Technology(Natural
Science Edition), 06(2014):29-34.
[9] Y. Liang, Y. Zhang, H. Xiong, R. Sahoo. An adaptive semantic filter for
blue gene/L failure log analysis[C]. In: Proceedings of the Third
International Workshop on System Management Techniques, Processes,
and Services (SMTPS), 2007.
[10] Adam J. Oliner, Alex Aiken, Jon Stearley. Alert detection in system
logs.2008 Eighth IEEE International Conference on Data Mining, 2008.
[11] Y. Liang, Y. Zhang, H. Xiong, R. Sahoo, A. Sivasubramaniam. Failure
prediction in IBM blue gene/L Event Logs[C]. In: Proceedings of
Seventh IEEE International Conference on Data Mining. Washington,
DC, USA: IEEE Computer Society, 2007: 583-588.