Sie sind auf Seite 1von 6

Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29 August 2004

USING STATISTICAL ANALYSIS AND SUPPORT VECTOR MACHINE


CLASSIFICATION TO DETECT COMPLICATED ATTACKS
MING TIAN I f , SONG-CAN CHEN 2, YI ZHUANG 2, JIA LIU
Department of Computer, Yancheng Institute of Technology, Yangcheng, 224001,China
2Departmentof computing, Institute of Information, Nanjing University of Aeronautics and Astronautics,
Nanjing, 210016, China
E-MAIL: jsyctm@sian.com, sc.chen@nuaa.edu.cn, zhuangyi@263.net, nuaa-liujia@263.net

Abstract:
Anomaly detection systems can dstect unknown attacks,
but they have a high false alarm rate. This article introduces
our prototype that uses statistical analysis and support vector
machine classifier to detect complicated attacks. We research
the sampling methods of statistical analysis techniques, and
propose a new statistical model named Smooth K-Windows.
An improved support vector machine classifier that has
higher accuracyis proposed after analyzing the reason why
support vector machine makes misclassifications. The
experimental results show that the prototype system can
detect complicated attacks in which the attackers stash their
behavior by changing it gradually.
Keywords:
Anomaly Detection; Statistical Analysis; Time Window;
Support Vector Machine

1.

Introduction

Intrusion detection techniques can be categorized into


misuse detection and anomaly detection [l]. Misuse
detection systems find intrusions by matching sample data
to known intrusive patterns. Anomaly detection systems
find intrusion by analyzing the deviation from normal
activities profiles that are retrieved from historical data.
Anomaly detection technique can detect unknown intrusion,
and this advantage makes it become one of the research
hotspots of intrusion detection field [2]. However, it is
difficult to decrease the false alarm rate, and to make a
proper trade-off between detection performance and
detection efficiency of anomaly detection systems. Besides
the system thresholds that are difficult to be determined, the
crucial cause of a high false positive rate is that the normal
activities of users or systems are complicated, and are not
unchanged. The main reason of a high false negative rate is
in that the attacking methods adopted by experienced
intruders are more and more intricate, and the attacking
activities are more and more covert. For example, attackers

may conceal their behavior by changing it gradually.


Statistical analysis techniques are widely used in
anomaly detection [3]. Compared with machine learning
methods, statistical analysis techniques have an advantage
that they can run in real time without offline learning and
relearning from training data, but its detection performance
is not good enough. Because the act& statistical
distributions are difficult to be established and calculated,
many traditional statistical analysis techniques adopt some
approximate processes that are quite rough sometimes. For
example, it is simply hypothesized that random variables
are submitted to Normal Distribution, etc.
Support Vector Machines (SVMs) are new universal
machine learning methods based on statistical learning
theory proposed by Vapnic and co-workers [4]. Applying
SVM classifier to intrusion detection has some advantages,
such as high speed, scalability, etc. There are, however,
some inherent difficulties in SVM. One example is that it is
liable to be disturbed by noise data near the separating
hyperplane. The second case is that the degree of its
classification accuracy is not high enough while dealing
with complicated questions.
We construct a prototype of anomaly networks
intrusion detection system using statistical analysis and
SVM classification. Our system aims at detecting
complicated attacks that are well disguised. We regearch
how improper sampling methods injure detection
performance in statistical analysis techniques. Based on
half-live windows of NIEDS [5] and layer windows of
HIDE [6], we bring forth a new statistical model namedSmooth K-Windows, which is used to describe an8 trace
the users behavior both in the short term and in the long
term. Based on the analysis of misclassifications made by
SVM classifier, we propose an improved SVM classifier
that is combined with the k nearest neighbor (k),
and
give an algorithm named BSKSVM accordingly. The
experimental results show that our system can detect
complicated attacks.

0-7803-8403-2/04/$20.00 02004 IEEE


2747

Proceedmgs of the Third International Conference on Machine Learning and Cybemetics, Shanghai, 26-29 August 2004
The rest of the paper is organized as follows. In
Section 2, we analyze sampling methods of statistical
analysis techniques. Section 3 gives Smooth K-Windows in
detail. Section 4 simply introduces SVM and k, and
analyzes their characteristics. In section 5, we present an
ameliorated algorithm named BDKSVM, which has high
accuracy. Section 6 outlines the architecture of our system.
The experimental results using our prototype are shown in
section 7. Section 8 summarizes the paper and gives an
image of future work.

2.

Sampling methods in statistical analysis

The first step to apply statistical analysis to anomaly


detection is the sampling. The analyzed objects in intrusion
detection are usually audit events that are retrieved from
original and have been converted into proper formats. Each
event has its own duration varying from a short time to a
long time. When sampling, it must be taken into account
that improper sampling intervals will make mistakes.
Sampling interval is also called sampling window.
There are particular cases when compare event
lifetime with sampling window:
a. Event duration is much shorter than sampling
window.
b. Event duration is far longer than sampling window.
c. Event duration is as long as sampling window.
In the f m t two cases, anomaly events will escape
detection. In the third case, if an anomaly event does not
glide through the sampling window in a smooth way, it
may not be detected. We will illustrate this by several

system thresholds.
In Fig. 1-a, there is an anomaly event with short
duration appearing in the short window T1, and its
frequency exceeds the threshold, so it is recognized. If a
long window T2 is adopted in Fig.1-b, because of average
effect, the anomaly event will escape detection. In Fig. I-c,
duration of the event is long, and a long window T2 is
adopted. The frequency goes beyond the threshold, so it
can be detected. If a short window T1 is used in Fig.ld,
then within a long time T2 compared with T1, the users
behavior has no obvious change, so the threshold will be
adjusted by statistical analysis engine, then the anomaly
event will be regarded as normal activity by mistake.
In Fig. 1-e to Fig. 1-h, the event duration is as much as
sampling window T. However, in the three former figures,
the frequency is lower than the threshold, so the anomaly
event cannot be found out. In Fig. 1-h, the anomaly event
appears properly in the sampling window, so it will be
detected.
From the above analyses, we can draw the following
conclusions:
0 There must be lot of sampling windows to deal with
anomaly events with different durations; otherwise
anomaly events cannot be detected.
0 Events must pass the sampling window smoothly;
otherwise anomaly events may evade detection.

3.

Smooth K-Window statistical model

3.1. Half-life windows model


A weighting method is used in NIDES when
computing the historical frequency distribution for statistic
Q according to the following formula [5]:

PIU,k =

- Cl W , k
,j2-bk-
Nk

where b is the decay rate that corresponds to a half-life


window of the data used to estimate the frequency P,,,,k. It
means that the weighted coefficient is 0.5 after half-life has
elapsed. For the intensity measures, the half-life is
recommended to be 10 seconds, 30 seconds, or 60 seconds.
For other types of measures, a 30-day half-life is
recommended in NIDES 151.

1rl

3.2.
Figure1. Event lifetime and sampling interval

(1)

j=1

special random processes shown in Fig. 1.


In Fig. 1, T1 is short sampling window, and T2 is long
one. Event frequency can be calculated from the area of the
shadow region. The horizontal dashed lines correspond to

layer-window model

NIDE adopts a nested layer-window model to detect


network-based anomaly events [a]. The values of time
windows in all layers constitute a geometrical progression,
which typically has a factor of 4 [6].The newly arrived
events will first be stored in the event buffer of observation

2748

Proceedings of the Third International Conference on Machine Learning and Cybemedcs, Shanghai, 26-29 August 2004
window layer 1. The stored events will be analyzed with
statistical processors, and the results are fed into a neural
network classifier to make a further analysis. The event
buffer will be emptied once it becomes full, and the stored
events will be averaged and forwarded to the event buffer
of layer 2. This process will be repeated recursively until
the top level is reached

3.3. K-window model


There are several defects in half-life window model.
First, the value of the decay rate b is difficult to be defined.
Secondly, events with different durations only appear in
several fixed windows, so anomaly events are difficult to
be found out. The system must store all of historical
statistical summaries, so it occupies huge memory spaces.
The layer-window model also has some shortcomings. One
case is that the ratio of two neighboring windows equals the
event buffers length that is usually quite long, and there
may be no suitable window for some events. Furthermore,
the events do not pass the time windows in a smooth way.
This article brings forward a smooth K-window
statistical model 6n which the former questions can be
solved well. K-window, shown in Fig. 2, is similar to
layer-window but with major modifications. K-window is
also multilayer, and its detail is as following:

... ... 4

... ...

zij =(zj+$j+l+ ...+zij+=)/Ki,i=2,3,4, ...,m


(4)
That is to say, zj is the average of Ki continuous basic
events. The events that leave the event buffers will simply
be dropped.
K-window model have some advantages. First, it
monitors network traffic with different time windows. Each
anomaly event will appear in a certain window comparable
with its duration. Secondly, unlike layer-window model,
K-window model is not nested. Each abstract event on
higher layers is calculated directly from basic events. All
events will pass corresponding time window smoothly.
Further more, except the event buffer of Iayer 1, other
buffers are virtual, and need no extra memory space.
4.

Brief introduction of SVM and k

4.1. SVM

Given a training set (xi,yi), i=l,..., n, XE Rd, YE {+1, .


-1 }, let Rd3H be a nonlinear map which tpnsforms x
from the input space into the feature space H. A kernel is a
function K, which for all XI, x2 E Rd, K(XI,XZ)=
c ~ I ( x ~ ) , ~ ( x ~Then,
) > . for the nonlinear classification, to
determine the optimal hyperplane equals to solve the
following constrained quadratic optimization problem:
maximize the objective function

a:

subject to the constrains:

taiyi

Window K2

i*,

i=l, 2, ..., n ,

(6)

i=l

Wmdow K1

Figure2. K-window model

All values of time windows constitute Fibonacci


numbers:
K1=l, K2=2, Ki=Ki-1+Ki-2,
i=3,4, ..., m
(2)
Each layer has an event buffer queue, and the length
of the queue is determine by the following formula [7]:
N=p( 1-p)[ 4 -(Q/2)I2/E
(3)
where 4 is Laplace inverse function, and Q is the
confidence. The signification of N is the least volume of
the sample set.
The basic events zj will be stored in the event buffer of
layer one by one.
That stored i? the event buffers of other layers are
abstract events zj that are calculated according to the
following formula:

where aiis the Lagrange multiplier for each sample.


Let ai be the optimum solution of the problem,
according to the Kuhn-Tucker conditions, it can easily
deduce that only a small number of ai are not equal to
zero, and the corresponding samples are support vectors.
The resulting decision function can be written:
f(x)=sgn{ xol,*yiK(xi,x )

+b],

(7)

k SV

That is the so-called SVM classifier.


4.2. IC

The algorithm idea of k classifier is as follows:


Taking all samples of the data set as representative points
[8], the lr Classifier CdCUlateS the distances from the test
sample X to all the samples given, and finds k nearest
neighbors of x among them. Then the class label of x is

2749

Proceedings of the Third International Conference on Machine Learning and Cybernetics,Shanghai, 26-29 August 2004
followed from the label of a certain class that the most
nearest neighbors belong to. So the k classifier uses the
information of all training samples to make discriminations.
5.

absorbs the merits of the k classifier.


The BDKSVM algorithm is shown in Fig. 4.
~~

~~~

Input: given training set Tr=[xili=l, 2, ..., m),


and testing set T~={x,+~
y=1, 2, ..., n]
Output: the class label of testing sample x m+,.
Procedure:
train svh4
>SV={qIxiETr, i=l, 2, ..., k]
Tr
for Q=l; n; j++)

An improved S V M classifier

5.1. The Reasons why SVM makes Misclassification

The essential idea of the SVM classification is as


following: The samples near the separating hyperplane play
prominent roles in the classification. The support vectors,
which are chosen as the representative points of the two
classes and support the optimal hyperplane, are
determined through maximizing the margin of .the two
classes, so the VC dimension of the indicator functions set
{ sgn(w. x+b)} is reduced.

I
g(xm+j)=C a l : y i K ( X i

.X,+j)

+b*;

BSV

if (lg(Xm+j)l?l)
printf (sgn(g(xm+j)));
else

for (i=l; m; i++) calculate IIx~+~-x~II;


find out the pk nearest neighbors (x) of x,,,+~
according to I l ~ ~ + ~ -where
q l l , L=l, 2, ;.., pk, and
k is the positive samples quantity among them;

0
0
0 0

k+

M + ( X ) = - - ; ~ ( X ~-+x ) ;

Lo

L+=1

L
L2
Figure 3. SVM classification
il

In Fig. 3, the points (0 and 0 ) are training sample, and


the points (o and m) are test samples nearby the optimal
hyperplane L. The samples on Ll or Lz are support vectors.
Suppose the training and test samples given are generated
independently and identically according to an unknown but
fured distribution Y. Let Lo be the correct separating
hyperplane. That is to say, Lo can always separate all
samples into two classes correctly. If a test sample falls on
the left side of L, the SVM classifier will classifies it into
the positive class, or else classifies it into negative class. So
the SVM classifier misclassifies all the test samples (H)
falling between L and Lo.
The reason why SVM makes mistakes is that the
support vectors, which are chosen as representative points,
occupy little portion of the training set, and the information
of the most samples far from the hyperplane is ignored.

5.2. BDKSVM Algorithm


As shown in Fig. 3, we know that all errors made by
the S V M classifier are almost near the optimal hyperplane
L. In the worst case, when,.I is overlapped with L1 or Lz,
half of the test samples falling inside the region between L1
and b a r e misclassified. Therefore, we give an improved
S V M classification algorithm named BDKSVM that

IC L=l

v i=W(x)-M(x);

e;

for (L=I;
L++)calculate I ~ , ~ ( x , + , - x ~ ) I ;
find out the k nearest neighbors (xL)of x m+j
according to Iv IT(~m+j-xL)I;
if (the majority of xLare positive samples)
printf (+l);
else printf (-1);

I
I
Figure 4. The BDKSVM algorithm
A certain SVM classifier is used when the test samples
fall on the left side of L1 or right side of L.Otherwise, a
k classifier is adopted that regards all training samples
as representative points. Considering that the k
classifier requires k nearest neighbors to be close to the test
sample, the value of k is set to be the number of the support
vectors. As the training set is finite, a version of the k
classifier based on the best distance measurement is used.
Concretely, let p be a constant, the pk nearest neighbors are
found firstly based on general distance definition, and then
k nearest neighbors are determined among them with the
following distance definition [8]:

2750

Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29 August 2004

Experimental Result and Analysis

D(x,xL)=l IT(x-xL)I,
(8)
where xLis a sample near the sample x,
l=M+(x)-M(x),

7.

the number of the positive samples on the vicinal area of x,


and N is the number of whole samples on the vicinal area
of x.
The added computing time is acceptable, for the test
samples fallen between Ll'and Lz are the minority of the
test set.

SVM and BDKSVM algorithm. The experimental data are

7.1. Result of the SVM classifier


We first evaluate and compare the performances of
two-spirals (Fig. 6), which is a typical binary classification
question on pattern recognition. The Bootstrap Method is
used and expatiated as follows: Selecting randomly n times
for n samples, the selected samples are taken as training set
and the rest are taken as test set. We repeat each test 30
times, and take the average value of the data.

6. System architecture
Our system is constituted by probe, event preprocessor,
statistical processor, S V M classifier and user interface. Its
architecture is shown in Fig. 5:

.
.
.
.
.

Figure 6. A two-spirals
In this testing, several two spirals are tested with
different circle numbers, and the number of dots is 2000. p
equals 2 in the BDKSVM algorithm. The experimental
results are shown in Table 1.
Table 1 indicates that using BDKSVM algorithm can
improve the classification accuracy obviously. The more
circle number is, the bigger the improved extent is. This
can be verified by the accuracy comparison between the

Figure 5. System architecture


Probe: Collects the network trflic of a host or a
network.
Event Preprocessor: Receives data from the probe and
converts them into the formats required by the
statistical model and SVM classifier.
Statistical Processor: Analyzes network activities and
reports anomaly events to the console, or forms input
vectors to feed into the SVM classifier. In our system,
the Kolmogrov-Smirnov test is adopted to detect the
abnormal value of the random variable.
SVM Classifier: Analyzes the input vectors received
from the event preprocessor or the statistical processor,
and reports anomaly events to the console.
Console: Receives reports from the statistical
processor and the SVM classifier. Decides whether the
network tr&k is normal or not. Gives commands such
as adjusting thresholds or parameter to other
components.

Table 1. The ComDarison of SVM and BDKSVM

3 circle
AlgOrithm

11 circle

7 circle

Accuracy

CPU
time

Accuracy

CPU
time

Accuracy

(%)

(s)

("/.I

(s)

(70)

SVM

CPU
time
(s)
1.83

92.17

1.91

DBKSVJ4

2.05

94.82

2.34
~

83.86

1.92

76.20

88.67

2.58

84.36

~~~~~

two algorithms. The reason is that the test samples near the
hyperplane increase on the complicated problems. These
difficultly distinguished samples are classified by
BDKSVM algorithm that chooses all training samples as
representative points, so the error rate is reduced
accordingly. It also shows that the increased computing
time is acceptable.

7.2. Testing the detection capability of the system


We used tcpdump to capture data from a secured

2751

Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29 August 2004
subnet .of the campus network of Nanjing University of
Aeronautics and Astronautics. The subnet is isolated from
outer network by a frrewall that is strictly configured. First,
we collected four weeks network data as training data, to
get the statistical profiles of normal events, and to train the
SVM classifier. Then we collected two weeks data as test
data that were inserted with some simulated attacking data.
We used some tools to launch simulated attacks. In addition,
we wrote a scan tool with various scan speeds. At the
beginning, the scan speed was quite low, and was increased
bit by bit. So it can be considered as a covert attack.
The ROC curve is shown in Fig. 7. It shows that our
system has powerful detection capability, and can detect
complicated attacks.

This work was supported in part by the Natural


Science Foundation of Jiangsu Province under Grant
BK2002091, China.

References
[l] Herv Debar, Marc Dacier, and Andreas Wespi,
Towards a Taxonomy of Intrusion Detection
Systems, IBM Technical Paper, Computer Networks,
Vol. 39, Issue 9, pp. 805-822, April 1999.
[2] Lee Wenke and Xiang Dong, Information-theoretic
measures for anomaly detection, Proceeding of 2001
IEEE Symposium on Security and Privacy Conference,
Oakland, pp.130-143, May 2001.
[3] Midori Asaka, Takefumi Onabuta, Tadashi Inoue,
Shunji Okazawa, and Shigeki Goto, A new intrusion
detection method based on discriminant analysis,
IEEE TRANS. INF. & SYST., Vol. E84-D, NOS, pp.
570-577, MAY 2001.

os -

0.7 c Ob
-

.PI

.A os P

Acknowledgements

1.0
09-

some proper values of K in the K-window model by using


the genetic algorithm. We will research into other
approximate processes that injure the detection capability
of statistical analysis technique. We will study the SVM
classification further more to make the BDKSVM
algorithm have strong scalability.

0.4 -

0.q
r

0.1 0.2 0 3 0.4 0 5 0.6 0.7

os 0.9

1.0

False positive rate


Figure 7. ROC curve from our detection engine

[4] N. Cristianini and J. Shawe-Taylor, An Introduction

on the simulated attack data

8.

Conclusions and Future Works

[5]

This article analyzes the sampling method of the


statistical analysis technique. The improper sampling
method will impair the detection capability of statistical
analysis engines. Based on the half-life time windows
model and layer-window model, we bring forth a new
statistical model named Smooth K-Window, which is used
to describe and trace the users behavior both in the short
term and in the long term.After analyzing the reason why
SVM is apt to make misclassification when test samples are
near the optimal hyperplane, we propose an ameliorated
version of SVM classifier, which is combined with k
classifier and has higher classification accuracy. Both
statistical analysis and SVM classifier are adopted in our
prototype that aims at detecting complicated attacks. The
experimental results show that our system has powerful
detection capability, and can detect covert attacks that are
well disguised.
Future works include several aspects. In order to
reduce computing time, we will research on how to choose

[6]

[7]
[8]

2752

to Support Vector Machines, Cambridge University


Press, Cambridge, UK, 2000.
Harold S. Javitz and Alfonso Valdes, The NIDES
Statistical Component Description and Justification,
Annual Report, CSL, SRI International, Menlo Park,
California, March 1994.
Zheng Zhang, Jun Li, C. N. Manikopoulos, Jay
Jorgenson, and Jose Ucles, HIDE: a hierarchical
network intrusion detection system using statistical
preprocessing and neural network classification,
Proceedings of the 2001 IEEE Workshop on
information assurance and security, New York, pp.
85-90, June 2001.
Lu Fengshan, queue theory and its application,
Hunan Science and Technology Press, Changsha,
Chian, 1984.
Bian Zhaoqi, and Zhang Xuegong, Pattern
Recognition, TsingHua University Press, Beijing,
China, 2000.

Das könnte Ihnen auch gefallen