Sie sind auf Seite 1von 8

NETWORK CODING AND ALGORITHM

Intrusion Detection Algorithm Based on Density,


Cluster Centers, and Nearest Neighbors
Xiujuan Wang1, Chenxi Zhang1, Kangfeng Zheng2
1
2

Computer Sciences, Beijing University of Technology, Beijing, China


Computer Science And Technology, Beijing University of Posts and Telecommunications, Beijing, China

Abstract: Intrusion detection aims to detect


intrusion behavior and serves as a complement to firewalls. It can detect attack types
of malicious network communications and
computer usage that cannot be detected by
idiomatic firewalls. Many intrusion detection
methods are processed through machine learning. Previous literature has shown that the
performance of an intrusion detection method
based on hybrid learning or integration approach is superior to that of single learning
technology. However, almost no studies focus
on how additional representative and concise
features can be extracted to process effective intrusion detection among massive and
complicated data. In this paper, a new hybrid
learning method is proposed on the basis of
features such as density, cluster centers, and
nearest neighbors (DCNN). In this algorithm,
data is represented by the local density of each
sample point and the sum of distances from
each sample point to cluster centers and to its
nearest neighbor. k-NN classifier is adopted
to classify the new feature vectors. Our experiment shows that DCNN, which combines
K-means, clustering-based density, and k-NN
classifier, is effective in intrusion detection.
Keywords: intrusion detection; DCNN; density; cluster center; nearest neighbor
China Communications July 2016

I. INTRODUCTION
Information has become a valuable resource
in modern society. However, the security of
information systems is a critical problem because of the openness of networks. In network
security, possessing a complete security system is not possible. Instead, it is more practical to establish a secure system that is easy to
implement and, to simultaneously, construct
a corresponding security assistance system
according to the security policies. Intrusion
detection is a security system that serves as
a supplement to firewalls, which defend the
computer system against attacks [1].
Intrusion detection detects external intrusions and supervises unauthorized activities of
internal users by identifying and responding to
malicious network communication and computer usage behavior. Intrusion detection aims
to detect intrusions by studying the process
and characteristics of intrusion behavior, thereby enabling a real-time response to intrusion
events and the invasion process. Two basic
intrusion detection technologies exist, namely,
anomaly detection and misuse detection [2].
Currently, most of the relevant literature
focuses on intrusion detection based on machine learning and combines different criteria

24

to improve detection performance, such as


accuracy, detection rate, and false alarm. Although numerous advanced detection methods
have been proposed, only a few studies focus
on how simple and relatively large correlation
values can be used to represent a large amount
of data.
In this paper, a new characteristic value
representation method is proposed for intrusion detection. This new feature vector uses
the local density of each sample point in the
dataset, the distance from the sample point to
the cluster center, and the distance to the nearest neighbor. Thus, the method is known as
DCNN.
This paper is organized as follows: In the
second section, we summarize the related
technologies in intrusion detection. The proposed DCNN algorithm is detailed in the third
section. The experimental setup and the results
are provided in the fourth section. Finally, a
conclusion is given.

II. LITERATURE REVIEW


At present, intrusion detection technology
based on rule matching, machine learning, and

data mining, among others, mainly exist.


Devaraju S et al. used intrusion detection
technology based on rule matching[3]. Association rule mining algorithm (ARMA) was used
to detect the attack types on KDD Cup 99
datasets.
Zhang Yi et al. used data mining technolo[4]
gy . This paper described the use of association rules and its optimization algorithm. The
design and implementation of an intrusion detection system were based on feature analysis
and knowledge discovery of log files.
As mentioned above, intrusion detection
based on rule matching fails to recognize
unknown malicious behaviors, which can be
solved with machine learning. Many studies
have been conducted on this subject[5][6][7].
A.S. Eesa et al. used cuttlefish algorithm to
obtain a new feature and then used decision
tree for classification[5].
E. de la Hoz et al used a multi-objective approach to select features for the DARPA/NSLKDD dataset and then used growing hierarchical self-organizing maps to detect intrusions[6].
Table I lists several papers related to intrusion detection, which mainly compare the intrusion detection technology, dataset, problem

Table I Comparison among related work


Work

Technique

Dataset

Problem domain

Evaluation method

Devaraju S et al

[3]

ARMA

KDD-Cup99

Misuse Detection

DR, FA

Tsai CF, Lin CY

[7]

TANN

KDD-Cup99

Anomaly detection

DR , Accuracy, FP , FN

KDD-Cup99

Anomaly detection

DR, Accuracy, FP, FN

Lin WCet al

[8]

A.S. Eesa et al

CANN

ARMA
5

K-NN/ SVM7

K-NN/SVM

KDD-Cup99

Anomaly detection

DR,FP, Accuracy, ROC curved

N/A

E. de la Hoz et al[6]

GHSOM11

DARPA/NSL-KDD

Anomaly detection

DR, ROC curve

NB12, RF13, DT

Wang FN et al [9]

KPCA14 +RVM15

KDD-Cup99

Anomaly detection

DR, FA, Accuracy

RVM

[5]

DT

Baseline

10

1 ARMA: associationrulemining algorithm


2 FA: false alarm
3 TANN: triangle area-based nearest neighbors approach to intrusion detection
4 DR: detection rate
5 FP: false positive
6 FN: false negative
7 SVM: support vector machine
8 CANN: an intrusion detection system based on combining cluster centers and nearest neighbors
9 DT: decision tree references
10 ROC curve: receiver operating characteristic curve
11 GHSOM: growing hierarchical self-organizing maps
12 NB: nave Bayes
13 RF: random forest
14 KPCA: kernel principal component analysis
15 RVM: relevance vector machine

25

China Communications July 2016

areas, evaluation methods, and baseline classifiers. Most studies used the KDD Cup 99 dataset. The most widely used evaluation method
is the accuracy and detection rate. k-NN classifier is the most widely used baseline classifier. Most techniques used in the literature are
two techniques for intrusion detection. Using
clustering method first, and then use the classifier
Tsai CF and Lin CY used intrusion detection technology based on hybrid machine
learning[7]. First, k-means clustering was performed to obtain cluster centers. Then, the triangular area related to two cluster centers with
one data from the given dataset is calculated,
thereby forming a new feature signature of the
data. Finally, the k-NN classifier was used to
classify similar attacks on the basis of the new
feature represented by the triangular area[7].
Lin WC, Ke SW, and Tsai CF also used intrusion detection technology based on hybrid
machine learning[8]. They first used k-means
clustering to obtain cluster centers and nearest
neighbors, and then they obtained a new feature by calculating the distance between data
points and cluster centers or nearest neighbors.
Finally, k-NN classifier and SVM were used
to classify based on the new feature. They
achieved a detection accuracy close to that of

normal

2000

Density

Fig. 1(b). Density distribution of


the PROBING class

Density

Fig. 1(c). Density distribution of


the DOS class

R2L
500

0
1
71
141
211
281
351
421
491

1000

500
1
71
141
211
281
351
421
491

1
81
161
241
321
401
481

4000

1000

Fig. 1(d). Density distribution of


the U2R class

DOS

500

U2R

Density

To illustrate the validity of density, the density distribution of the KDD Cup 99 corpus of
normal and abnormal classes are calculated as
shown in Figure 1. The definition of local density will be a subject of focus later.
Figure 1 indicates that the density distribution of each data type is different. The local
density of normal data is most distributed at
higher values of approximately 497500, and

PROBING

Density

Fig. 1(a). Density distribution of


the normal class

3.1 DCNN framework

1000

0
80
160
240
320
400
480

III. DCNN ALGORITHM

1
81
161
241
321
401
481

5000

the k-NN classification on original features


with low cost.
In this paper, we propose an intrusion detection algorithm based on the study of Lin
WC et al. [8]. Clustering can be conducted
based on measures, such as distance and density. Which types of clustering is better depends
on the characteristics being addressed. In intrusion detection, the measure that performs
best is uncertain. In other words, the distance
cannot be fully representative of the feature.
In addition to the distance, the density is a representative value of the feature. Therefore, we
suggest taking density into account as a new
representation of the feature.

Density

Fig. 1(e). Density distribution of


the R2L class

Fig.1 Density distribution


China Communications July 2016

26

the attack data tend to distribute with lower


densities. Hence, density can be used as an effective distinguishing feature.
The frame of intrusion detection algorithm
based on DCNN is shown in Figure 2. Hybrid
learning is applied to network data. First, clustering is used to obtain distances and density
related to network data to form a new feature
vector with low dimension. k-NN classifier is
developed based on the new feature vectors,
and the label is output the class of data
Clustering on training set T aims to
obtain the cluster center (Ci), the nearest
neighbors of each sample point (Ni), and
the local density of each sample point (i):
Intrusion detection is a classification problem.
Therefore, the number of categories in classification should be determined first. Distance is
adopted to measure the similarity between the
unlabeled sample point and each type of attack
data. Hence, the number of clusters should be
equal to the number of classification categories.
New feature vector formation: The distance between the sample point and all cluster
centers (d1), and the distance between the data
point and nearest neighbor point in this cluster
(d2) are calculated. Then, two distances are
added to obtain the new data Di. The training
dataset T=(x1,x2,,xn) is replaced by a new
two-dimensional feature vector T=(D,)
formed by distance Di and local density i.
Cluster centers,
nearest
neighborslocal
density
Cluster
centers Ci
Training
dataset T

Nearest
neighbors Ni

New feature vector

Distance
between Ci
and point (d1)

Dlabel

Distance

dataset S

Local density
i

Label=
attacks

No label
data

K-NN
classify
Label=
normal

Fig.2 Frame of DCNN

27

3.2 Determining Ci, Ni, i


In this paper, we use the k-means clustering
algorithm to extract the clustering center. Assume that the number of categories is a total of
N class data. Let K=N. Then, a total of N cluster centers exist, namely, C1C2CN.
Figure 3 shows an example that contains 15
data points to be clustered into 5 categories;
each cluster has a cluster center C1C2C3
C4C5. The green boxes represent the cluster
centers.
The nearest neighbor of one data in terms
of distance is an index. Therefore, the nearest
neighbor points in this paper are determined
as follows: For each data point A, the distance
between A and all other points in its cluster is
calculated. Then the point of shortest distance
is the nearest neighbor point of A. The neighbor points of point A are defined as Neigh (A).
The nearest neighbor point NN (A) is defined
as Formula (1).
(1)
As shown in Figure 2, the distance between
A and other points in cluster 4 is calculated.
dAE and dAB are obtained, with dAE being less
than dAB. Thus, the nearest neighbor of A is E
point.
The local density calculation method used
in this paper is based on the work of Rodriguez A et al. [10]. Local density is defined as
Formula (2).
(2)

between Ni
and point (d2)

Testing

Training and testing: The above step is


repeated on test dataset S. A new two-dimensional feature vector S is obtained. Then,
k-NN classifier is used for intrusion detection
on T and S datasets.

dc is a truncated distance that is manually set.


dij represents the distance from points I to j.
According to the formula, to define a truncated
distance dc, the distance dij between every two
points is calculated. Then, the local density i
of each data point with dc and dij is obtained,
as defined in Formula (2).
Figure 4 shows an illustrated process of calChina Communications July 2016

culating the local density. The dotted line represents the truncated distance dc. To calculate
the local density of point I, a circle is drawn
around center I, and the radius equals d c.
Then, the number of points within this circle
is counted. The density of the point I shown in
the graph is equal to 5.

With the local density i and the distance


Di, a new two-dimensional feature vector (Di,
i) is obtained to replace the original feature
vector.
Finally, the two-dimensional feature vector
is used to train the k-NN classifier. Then, the

3.3 New feature vectors


Before obtaining the new feature data, two
distances need to be calculated in addition to
the local density. One is the distance between
sample point A and cluster centers Ci. Suppose that N cluster centers are obtained in the
clustering process. Then, the distances d (ACi)
between point A and the N cluster centers Ci
need to be calculated. The sum of said N distances can describe the data point in terms of
distance synthetically. Then, we can obtain a
distance d1, which is defined as Formula (3).

C1

C2

C5
A

C3

C4

Fig.3 Extracting the cluster centers with the use of k-means and determining the
nearest neighbor

(3)
The other distance is the distance between
the sample point and its nearest neighbor. The
calculation method used in this paper is Euclidean distance. Assuming that the original
dataset is an M dimensional vector, then the
distance d2 from point A (a1,a2,,am) to
the nearest neighbor point B (b1,b2,bm) is
defined as Formula (4)

dc
I

Fig.4 Calculating local density

(4)
Figure 5 shows an example of calculating
said 2 distances in case of 5 classes. The black
solid line represents the distances between
data point A and 5 cluster centers. The red
dashed line represents the distance between
data point A and its nearest neighbor B. After
obtaining these two distances, d1 and d2 are
added together. A new distance Di is obtained
for each sample point to act as the first new
feature, as defined in Formula (5)
(5)
China Communications July 2016

C1

C2

C5

A
C3

Fig.5 Distance between point A and cluster centers, and distance between point A
and nearest neighbor

28

Table II Tags for data classification


types of data

Normal

PROBING

DOS

U2R

R2L

Tags

Table III Measures of evaluation


Actual\predicted

Normal

Normal

TN

FP

Attacks

FN

TP

120000
100000
80000

training dataset number

4.1.2 Evaluations

training and testing dataset number

This paper uses three evaluation criteria to


evaluate the performance of DCNN, i.e., accuracy, detection rate, and false alarm, as defined
in Formulas 6, 7, and 8, respectively. Table 3
explains the variables used in the formulas.

66.65
%
64.57%

28.77
33.09 %
%

60000
40000
20000
0

Attacks

1.47 2.41
%
%

Normal PROBING

DOS

(6)

0.04 0.14 0.83 2.03%


%
%
%

U2R

R2L

(7)
(8)

Fig.6 Classification of data

4.1 Experimental setup

True positives (TP): the number of malicious executables correctly classified as malicious
True negatives (TN): the number of benign
programs correctly classified as benign
False positives (FP): the number of benign
programs falsely classified as malicious
False negative (FN): the number of malicious executables falsely classified as benign

4.1.1 Dataset

4.2 Experimental results

training and testing sets are combined to obtain the two-dimensional feature vector for the
k-NN classifier testing.

IV. EXPERIMENTS

The training and testing datasets used in


this paper are all KDD Cup 99 corpus [11].
KDD Cup 99 corpus is the dataset used in
the Knowledge Discovery and Data Mining
(KDD) contest held in 1999. Although the data
are old, they are widely recognized and used
by researchers.
Each network connection in the KDD Cup
99 dataset is marked as normal or attack. The
attacks are divided into 4 categories and 39
species. The four types of attack are DOS,
R2L, U2R, and PROBING. Table II describes
the classification tags for the five types of data.
The KDD Cup 99 dataset has 41 dimen-

29

sional of feature descriptions and one dimension of category label for a total of 42 dimensions. Similar to the work of Zhang et al. [11],
19 dimensional characteristics are selected.
After taking out 19 dimensional data, quantitative data need to be normalized. Afterwards,
we need to remove duplicate data to obtain a
single dataset. Figure 6 shows the composition
of the remaining data. The training dataset has
119845 data, and the training and testing datasets have 177463 data.

4.2.1 Original data classification


Intrusion detection using k-NN classifier is
performed for the original 19 dimensional
KDD Cup datasets. Results of the five types of
data are shown in Table IV. K is set to 21. The
total accuracy is 84.36%.
4.2.2 CANN
Results of the five data types of CANN are
shown in Table V. The K value of the baseline
classifier k-NN is set to 21. The total accuracy
is 89.79%, thereby indicating that the performance is better than that of k-NN classifica-

China Communications July 2016

tion on original features.


4.2.3 DCNN
Results of the five data types of DCNN are
shown in Table VI. The K value of the baseline
classifier k-NN is set to 21. The total accuracy
is 96.74%, thereby indicating the best performance.
4.2.4 Discussion
The intrusion detection approaches based on
the three evaluation criteria are shown in Figure 7.
The above three tables and Figure 7 show
that the k-NN classification process is simplified because of the dimension reduction
processing of the data feature in the DCNN algorithm. Our experiments indicate that DCNN
performs better than CANN and k-NN classifiers and has the highest accuracy and detection
rate, and the lowest false alarm. Thus, new
features in DCNN can describe the characteristic of network data well. The processing
time of DCNN algorithm is better than that
of direct use of k-NN classifier for intrusion
detection because of dimensionality reduction
processing. Tables IV, V, and VI indicate that
U2R and R2L have low detection accuracy.
Thus, DCNN can improve the detection accuracy of these two types of attacks.
Figure 1 shows that the local density of
each data type exhibits differences. The density of the normal data type is much higher than
that of attack types, and the performance of
DCNN is better than CANN. Thus, local density is a valuable feature. The above findings
indicate that our method is successful.

V. CONCLUSION
In this paper, we proposed a new hybrid machine learning-based intrusion detection method called DCNN, which effectively reduces
the feature dimension of the original dataset
into a simple and representative two-dimensional vector. It saves time and improves the
accuracy in our experiment on the KDD Cup
99 dataset. Experimental results show that
China Communications July 2016

Table IV Result of k-NN classifier (K=21)


actual\ Predicted

Normal

PROBING

DOS

U2R

R2L

Accuracy

Normal

68741

1852

6124

593

79

88.83%

PROBING

126

1594

21

11

90.77%

DOS

8147

1218

30287

76.38%

U2R

19

27

51.92%

R2L

241

44

254

452

45.38%

DOS

U2R

R2L

Accuracy

Table V Result of CANN classifier (K=21)


actual\ Predicted Normal PROBING
Normal

69630

240

7079

80

360

89.97%

PROBING

94

1595

60

90.83%

DOS

3056

793

35549

118

136

89.65%

U2R

10

33

63.16%

R2L

162

31

801

80.42%

Table VI Result of DCNN classifier (K=21)


actual\ Predicted

Normal

PROBING

DOS

U2R

R2L

Accuracy

Normal

76375

896

82

11

25

98.69%

PROBING

42

1711

97.44%

DOS

1992

502

36955

69

134

93.20%

U2R

14

35

67.31%

R2L

67

15

48

862

86.55%

KNN
120
100

96.74
84.3689.79

80

CANN

DCNN

91.9694.93
78.91

60
40

11.17
4.55 2.69

20
0

Accuracy(%)

Detection
Rate(%)

False
Alarm(%)

Fig.7 Performance comparison

DCNN can successfully detect intrusions.


Additional work is needed in the future.
For example, we used only k-means clustering
algorithm to obtain cluster centers. We can
change the selection method of initial cluster
centers to improve the clustering accuracy. In
addition, the density calculation is not very accurate, In the future, we can change the density to the density in the points cluster. Finally,

30

only the k-NN classifier is used as the baseline


classifier in this paper. Wide comparisons can
be conducted with the use of other baseline
classifiers for intrusion detection.

References
[1] Yao Lan, Wang Xinmei. Present situation and
development trend of intrusion detection
system. Telecommunications Science. 2002.
(12):31-35.
[2] YAO Jun-lan. Intrusion detection technology
and its development trend. Information Technology. 2006.(4):172-175
[3] Devaraju S, Ramakrishnan S. Detectionof
Attacks for IDS using AssociationRuleMining Algorithm. IETE JOURNAL OF RESEARCH.
2015.61(6)624-633.
[4] ZHANG Yi, LIU Yan-heng et al. Intrusion detection system based on association rules. Journal
of Jilin University. 2006. (2).
[5] A.S. Eesa, Z. Orman, A.M.A. Brifcani. A novel feature-selection approach based on the cuttlefish
optimization algorithm for intrusion detection
systems. EXPERT SYSTEMS WITH APPLICATIONS. 2015. 42(5):26702679.
[6] E. de la Hoz, E. de la Hoz, A. Ortiz, J. Ortega, A. Martinez-Alvarez. Feature selection by
multi-objective optimization: application to network anomaly detection by hierarchical self-organising maps. KNOWLEDGE-BASED SYSTEMS.
2014. 71(SI):322338.
[7] Tsai CF, Lin CY. A triangle area based nearest
neighbors approach to intrusion detection.
PATTERN RECOGNITION. 2010. 43(1): 222-229.
[8] Lin WC, Ke SW, Tsai CF. CANN: An intrusion
detection system based on combining cluster centers and nearest neighbors. KNOWLEDGE-BASED SYSTEMS. 2015. 78:13-21.

31

[9] Wang FN,Wang SS. Solving theintrusiondetectionproblem with KPCA-RVM. DESIGN, MANUFACTURING AND MECHATRONICS. 2016.520527.
[10] Rodriguez A, Laio A. Clustering by fast search
and find of density peaks. SCIENCE. 2014.
344(6191):1492-1496.
[11] X.-Q. Zhang, C.H. Gu, J.J. Lin. Intrusion detection
system based on feature selection and support
vector machine. International Conference on
Communications and Networking in China.
2006. pp. 15.

Biographies
Xiujuan Wang, received her PhD in Information and
Signal Processing in July 2006 at the Beijing University of Posts and Telecommunications. She is currently
an instructor lecturer at the College of Computer Sciences, Beijing University of Technology. Her research
interests include information and signal processing
and network security. E-mail: xjwang@bjut.edu.cn
Chenxi Zhang, is currently pursuing her Master at
the College of Computer Sciences, Beijing University
of Technology. Her research interests include information and network security. The corresponding
author, E-mail: 15110005031 @163.com
Kangfeng Zheng, received his PhD in Information
and Signal Processing in July 2006 at Beijing University of Posts and Telecommunications. He is currently
an associate professor at the School of Computer
Science and Technology, Beijing University of Posts
and Telecommunications. His research interests include networking and system security, and network
information processing. E-mail: kfzheng@bupt.edu.
cn

China Communications July 2016

Das könnte Ihnen auch gefallen