Beruflich Dokumente
Kultur Dokumente
Dataset
Prasanta Gogoi1 , Monowar H. Bhuyan1 , D. K. Bhattacharyya1 , and
J. K. Kalita2
1
Abstract. With exponential growth in the number of computer applications and the size of networks, the potential damage that can be caused
by attacks launched over the internet keeps increasing dramatically. A
number of network intrusion detection methods have been developed
with their respective strengths and weaknesses. The majority of research
in the area of network intrusion detection is still based on the simulated
datasets because of non-availability of real datasets. A simulated dataset
cannot represent the real network intrusion scenario. It is important to
generate real and timely datasets to ensure accurate and consistent evaluation of methods. We propose a new real dataset to ameliorate this
crucial shortcoming. We have set up a testbed to launch network traffic
of both attack as well as normal nature using attack tools. We capture
the network traffic in packet and flow format. The captured traffic is
filtered and preprocessed to generate a featured dataset. The dataset is
made available for research purpose.
Keywords: Testbed, Dataset, Packet, Netflow, Anomaly, NIDS
Introduction
With the tremendous growth in size and use of computer networks and the
enormous increase in the number of applications running on them, network security is becoming increasingly more important. Intrusion detection (ID) is an
important component of any infrastructure protection mechanism. It is a type of
security management system for computers and networks. Intrusion can be defined as a set of actions aimed to compromise the computer security goals such
as confidentiality, integrity and availability [9]. An intrusion detection system
(IDS) gathers and analyzes information from various areas within a computer
or a network to identify possible security breaches, which include both types
of intrusions - misuse and anomaly. A misuse intrusion detection approach uses
information about known attacks and detects intrusions based on matches with
existing attack patterns or signatures. On the other hand, an anomaly detection
approach learns the normal behavior of the system or the network it monitors
and reports when the monitored behavior deviates significantly from the normal
profile. There exist various IDSs that are based on misuse as well as anomaly.
Examples include Bro [10], Snort [11], ADAM [4]. The effectiveness of an IDS
is evaluated based on its true detection rate of intrusions. An intrusion dataset
is important to find the effectiveness of a method for intrusion detection. The
KDD Cup 1999 intrusion dataset 1 is an internationally accepted bench mark
intrusion dataset.
1.1
Motivation
The majority of the research in the field of network intrusion detection is based on
the synthetic datasets because of the lack of better datasets. With the knowledge
of the shortcomings of the data, it is necessary and urgent to create datasets to
ensure consistent and accurate evaluation of intrusion detection systems.
1.2
Objective
The objective of this paper is to set up a network testbed for generating normal network traffic as well as attack network traffic and capture the traffic in
packet as well as flow modes in isolated environments. The captured traffic will
be filtered, preprocessed, analyzed and will ultimately be used to produce two
unbiased network intrusion datasets called Packet Level and Flow Level TUIDS
datasets.
1.3
Organization of Paper
There are a lot of intrusion detection systems that have been come into existence
in the last three decades. The various techniques used in IDSs have their own
strengths and weaknesses. A key aspect of any IDS is the nature of the input data.
For a set of input data, different IDS techniques face different challenges. Input
is generally a collection of data instances (also referred to as objects, records,
points, vectors, patterns, events, cases, samples, observations or entities) [12].
The attributes can be of different types such as binary, categorical or continuous.
Each data instance may consist of multiple attributes (multivariate). In the case
of multivariate data instances, all attributes may be of the same type or may
be a mixture of different data types. The nature of attributes determines the
applicability of an IDS technique.
1
http://kdd.ics.uci.edu
2.1
Data Labels
The labels associated with a data instance denote if that instance is normal or
anomalous. It should be noted that obtaining labeled data, which is accurate as
well as representative of all types of behaviors, is often prohibitively expensive.
Labeling is often done manually by human experts and hence requires substantial effort. Typically, getting a labeled set of anomalous data instances which
cover all possible type of anomalous behavior is more difficult than getting labels for normal behavior. Moreover, anomalous behavior is often dynamic in
nature, e.g., new types of anomalies may arise, for which there is no labeled
training data. Based on the extent to which the labels are available, anomaly
detection techniques can operate in one of the following two modes: supervised
and unsupervised.
Techniques [6] trained in supervised mode assume the availability of a training data set which has labeled instances for normal as well as anomalous classes.
The typical approach in such cases is to build a predictive model for normal vs.
anomalous classes. Any unseen data instance is compared against the model to
determine which class it belongs to.
Techniques [2] that operate in unsupervised mode do not require training
data, and thus are most widely applicable. The techniques in this category make
the implicit assumption that normal instances are far more frequent than anomalies in the test data. If this assumption is not true, such techniques suffer from
high false alarm rate.
Datasets play an important role in the testing and validation of any intrusion
detection method. The quality of data not only allows us to identify a methods
ability to detect anomalous behavior, but also shows its potential effectiveness
during deployment in real operating environments. Several datasets are publicly
available for testing and evaluation of intrusion detection. However, the most
widely used evaluation datasets are the KDD Cup 1999 and its modified version,
the NSL-KDD dataset [13].
3.1
The KDD Cup 1999 dataset is the benchmark dataset for intrusion detection.
Each record of the dataset represents a connection between two network hosts
according to existing network protocol and is described by 41 attributes (38
continuous or discrete numerical attributes and 3 categorical attributes). Each
record of the training data is labeled as either normal or a specific kind of attack.
The attacks fall in one of four categories: Denial of Service (DoS), User to Root
(U 2R), Remote to local (R2L) and P robe.
Denial of Service(DoS): An attacker tries to prevent legitimate users from
using a service. For example, SYN flood, smurf and teardrop.
User to Root (U 2R): An attackers has local access to the victim machine
and tries to gain super-user privilege. For example, buffer overflow attacks.
Remote to Local (R2L): An attackers tries to gain access to victim machine
without having an account on it. For example, password guessing attack.
P robe: An attacker tries to gain information about the target host. For
example, port-scan and ping-sweep.
The datasets consist of two types of data: training and testing. The training
data contain a total of 22 attack types and an additional 15 attack types in the
test data only. The numbers of samples of each category of attack in Corrected
KDD and 10-percent KDD training dataset are shown in Table 1.
Table 1. Attacks distribution in KDD Cup training dataset
Dataset
DoS U2R R2L Probe Normal Total
Corrected KDD
2,29,853 70 16,347 4,166 60,593 3,11,029
10-percent Corrected KDD 3,91,458 52 1,126 4,107 97,278 4,94,021
3.2
NSL-KDD dataset
7458
67 2,887 2,422
9,710
22,544
The KDD Cup 1999 and NSL-KDD dataset both are evaluation datasets. The
records in the dataset may be distinctly different from real network traffic data.
Besides, the nature of attack and normal instances may dynamically change. One
of the most important deficiencies of the KDD dataset is the very large number
of redundant records, which causes the learning algorithms to be biased towards
frequent records, and thus prevent them from learning from infrequent records,
which may be more harmful to network health. In addition, the existence of
these repeated records in the test set causes the evaluation results to be biased
positively toward methods which have better detection rates on the frequent
records.
Among the approaches surveyed in [14], the most prevalent approach to evaluation of the intrusion detection systems is based on the KDD Cup 1999 dataset.
The generation of this evaluation dataset consist of simulated host and network
normal traffic and manually generated network-based attacks. A list of some
existing intrusion detection systems validated with KDD cup 1999 intrusion
dataset are summarized in [5].
Our dataset
our method of dataset generation extracts various types of features from network
packet and flow data captured using an isolated network. Using existing attack
tools, we generate a group of attacks against a local network server and collect
the produced traffic as known attack traffic. The attacks for which we capture
data along with the corresponding tools for their generation are presented in
Table 3 2 . These attacks and tools are also used by Amini et al. [2].
5.1
Testbed Setup
The experimental setup of the testbed for network traffic capture includes one
router, one L3 switch, two L2 switches, one server, two workstations and forty
nodes. Six VLANs are created from the L3 switch and L2 switch; and nodes and
workstations are connected to separated VLANs. The L3 switch is connected to
a router through an internal IP router and the router is connected to the Internet
through an external IP router. The server is connected to the L3 switch through
a mirror port to observe traffic activity to the switch. Another LAN of 350
nodes is connected to other VLANs through five L3 and L2 switches and three
routers. The attacks are launched within our testbed as well as from another
LAN through the Internet. To launch attacks within the testbed, nodes of one
VLAN are attacked from nodes of another VLAN as well as the same VLAN.
Normal traffic is created within our testbed in a restricted manner condition
after disconnecting the other LAN. Traffic activities to our testbed are observed
on the computer connected to the mirror port. A diagram of the testbed for
generation of TUIDS intrusion detection datasets is shown in Fig. 1.
The various features are extracted using an distributed feature extraction
architecture as given in Fig. 2. The frame work is used for fast protocol specific
(e.g. TCP, UDP, ICMP) feature extraction for packet and flow data separately.
Servers (S1 and S2) are be used for the initial storage of the captured and
preprocessed data as well as for the final formatted packet and flow feature
data. Workstations (W S1 and W S2) are dedicated in the various types of feature
extraction in a distributed manner using multiple nodes (N 1, N 2, ..., N 6).
2
http://packetstormsecurity.nl/index.html
Generation
Tool
Attack
Generation
Tool
bonk
targa2.c
1234
targa2.c
jolt
targa2.c
saihyousen targa2.c
land
targa2.c
oshare
targa2.c
nestea
targa2.c
window
targa2.c
newtear targa2.c
syn
N map
syndrop targa2.c
xmas
N map
teardrop targa2.c
f raggle
f raggle.c
winnuke targa2.c
smurf
smurf 4.c
Table 4.
datasets
Connection
type
Packet level
Normal
Attack
Total
Flow level
Normal
Attack
Total
TUIDS
intrusion
detection
Dataset type
Training dataset Testing dataset
71785 58.87% 47895
50142 41.13% 38370
121927
86265
55.52%
44.48%
23120
29723
52843
41.17%
58.83%
43.75% 16770
56.25% 23955
40725
5.2
The packet level network traffic is captured using the open source software tool
called gulp 3 . Gulp drop packets directly from the network and write to disk
at high rate of packet capture. The packets are analyzed using the open source
packet analyzing software wireshark 4 . The raw packet data is preprocessed
and filtered before extracting and constructing new features. In the packet level
network traffic, 50 types of features are extracted. To extract these features we
use open source tool tcptrace 5 , C programs and Perl scripts. These features are
classified as basic, content based, time-based and connection-based. The list of
features is given in Tables 5.
5.3
http://staff.washington.edu/corey/gulp/
http://www.wireshark.org/
http://www.tcptrace.org
http://www.ietf.org/rfc/rfc3917.txt,http://www.ietf.org/rfc/rfc3954.txt
Type*
Feature Description
Basic features
1.
Duration
2.
Protocol
3.
Src IP
Source IP address
4.
Dst IP
Destination IP address
5.
Src port
6.
Dst port
7.
Service
8.
num-bytes-src-dst
9.
num-bytes-dst-src
10. Fr-no.
Frame number
11. Fr-length
12. Cap-length
13. Head-len
14. Frag-offset
15. TTL
Time to live
16. Seq-no.
Sequence number
17. CWR
18. ECN
19. URG
20. ACK
Ack flag
21. PSH
22. RST
23. SYN
24. FIN
25. Land
Content-based features
26. Mss-src-dst-requested
27. Mss-dst-src-requested
28. Ttt-len-src-dst
29. Ttt-len-dst-src
30. Conn-status
Time-based features
31. count-fr-dst
No. of frames received by unique dst in the last T sec from the same
src
32. count-fr-src
No. of frames received by unique src in the last T sec to the same dst
33. count-serv-src
No. of frames from the src to the same dst port in the last T sec
34. count-serv-dst
No. of frames from dst to the same src port in the last T sec
35. num-pushed-src-dst
36. num-pushed-dst-src
37. num-SYN-FIN-src-dst
38. num-SYN-FIN-dst-src
39. num-FIN-src-dst
40. num-FIN-dst-src
Connection-based features
41. count-dst-conn
No. of frames to unique dst in the last N packets from the same src
42. count-src-conn
No. of frames from unique src in the last N packets to the same dst
43. count-serv-src-conn
No. of frames from the src to the same dst port in the last N packets
44. count-serv-dst-conn
No. of frames from the dst to the same src port in the last N packets
45. num-packets-src-dst
46. num-packets-dst-src
47. num-acks-src-dst
48. num-acks-dst-src
49. num-retransmit-src-dst C
50. num-retransmit-dst-src C
http://nfdump.sourceforge.net/
http://www.cisco.com
Type*
Feature Description
Basic features
1.
Duration
2.
Protocol-type
3.
src IP
4.
dst IP
Destination IP address
5.
src port
Source port
6.
dst port
Destination port
7.
ToS
Type of service
8.
URG
9.
ACK
Ack flag
10. PSH
Push flag
11. RST
Reset flag
12. SYN
SYN flag
13. FIN
FIN flag
16. Land
Time-window features
17. count-dst
No. of flows to unique dst IP addr inside the network in the last T sec
from the same src
18. count-src
No. of flows from unique src IP addr inside the network in the last T sec
to the same dst
19. count-serv-src
No. of flows from the src IP to the same dst port in the last T sec
20. count-serv-dst
No. of flows to the dst IP using same src port in the last T sec
Connection-based features
21. count-dst-conn
No. of flows to unique dst IP in the last N flows from the same src
22. count-src-conn
No. of flows from unique src IP in the last N flows to the same dst
23. count-serv-src-conn C
No. of flows from the src IP to the same dst port in the last N flows.
24. count-serv-dst-conn C
No. of flows to the dst IP to the same src port in the last N flows.
Predicted Class
Normal Attack
Normal 47363
Actual
Attack 273
class
Sum
47636
532
38097
38629
Predicted Class
Sum
Recall
47895
38370
86265
0.9889
0.9929
Normal Attack
Normal 16620
Actual
Attack 113
class
Sum
16733
150
23842
23992
Sum
Recall
16770
23955
40725
0.9911
0.9953
Total
FPR
(%)
Corrected KDD
311029
250436
60593
97.55
90.01
2.45
10% KDD
494021
396743
97278
95.75
94.76
4.25
KDDT rain+
125973
58630
67343
97.65
93.89
2.35
KDDT est+
22544
12834
9710
98.88
96.55
1.12
P acket Level
86265
38370
47895
99.29
98.89
0.71
F low Level
40725
23955
16770
99.53
99.11
0.47
against a local network server and the produced traffic is collected and labeled
as known attack traffic. There are generated 16 different types of attacks. The
network traffic data was captured at packet level and flow level through two separate port mirroring machines. The captured data was preprocessed and filtered
to extract various types of features. The numbers of records in the datasets are
given in Table 4. We call the two datasets: Packet Level and Flow Level TUIDS
datasets.
bonk
jolt
nestea
newtear
syndrop
teardrop
winnuke
1234
oshare
saihyousen
smurf
f raggle
syn
xmas
window
land
2680
282
19
28
13
27
2510
6216
2500
52
6
2500
1650
2720
2766
2
2589
277
19
27
12
27
2417
5994
2306
51
6
2246
1567
2707
2679
2
96.63%
98.57%
100%
99.27%
98.48%
100%
96.33%
96.43%
92.27%
98.07%
100%
89.87%
94.98%
99.55%
96.89%
100%
All attacks
normal
23955
16770
22926
16759
95.70%
99.94%
Dataset type
Training dataset Testing dataset
71785 58.87% 47895 55.52%
42592 34.93% 30613 35.49%
7550
6.19%
7757
8.99%
121927
86265
23120 43.75%
21441 40.57%
8282
15.67%
52843
16770 41.17%
14475 35.54%
9480
23.28%
40725
(1)
3. We have also used TUIDS intrusion dataset to evaluate our method NADO
[3] for its effectiveness in intrusion detection. Table 11 describes the distribution of the normal and attack instances in both packet and flow level TUIDS
intrusion dataset. Table 12 presents the confusion matrix for each category
of attack class in terms of precision, recall and F-measure.
Table 12. The Confusion matrix of the proposed scheme [3] over the packet and flow
level TUIDS intrusion datasets
Connection type
Packet level
Normal
DoS
Probe
Average
Flow level
Normal
DoS
Probe
Average
Evaluation measures
Precision Recall
F-measure
Value
Confusion matrix
Normal DoS
Probe
Total
0.9607%
1.0000%
0.9988%
0.9865%
0.9813%
0.9764%
0.8918%
0.9498%
0.9708%
0.9764%
0.9436%
0.9636%
Normal
DoS
Probe
Total
46011
720
838
47569
1817
29893
8
31718
67
0
6911
6978
47895
30613
7757
86265
0.9745%
0.9991%
0.9995%
0.9910%
0.9842%
0.9938%
0.9626%
0.9802%
0.9793%
0.9964%
0.9806%
0.9854%
Normal
DoS
Probe
Total
16342
89
354
16785
421
14374
5
14800
7
12
9121
9140
16770
14475
9480
40725
Conclusion
In this paper, we provide high-level analysis of the KDD Cup 1999 and NSLKDD datasets. The analysis shows that these dataset are simulated and very
old. In applying machine learning methods to intrusion detection, these datasets
are not always suitable in current dynamic network scenarios. To address these
issues, we create two real life network intrusion datasets: Packet Level and Flow
Level TUIDS datasets. To create the datasets, we set up an isolated testbed to
launch attacks and capture the traffic in two modes, and generate the dataset
after rigorous preprocessing of the raw data. To establish the effectiveness of the
datasets, we used them in evaluating the performance of some intrusion detection
method. The results have been reported.
A Distributed Denial of Service (DDoS) attack uses many computers to
launch a coordinated DoS attack against one or more targets. Using client/server
technology, the perpetrator is able to multiply the effectiveness of the Denial of
Service significantly by harnessing the resources of multiple unwitting accomplice
computers which serve as attack platforms. In the future, we plan to generate one
new intrusion dataset focussing on distributed denial of service (DDoS) attacks.
Acknowledgment
This work is supported by Department of Information Technology, MCIT Government of India. The authors are grateful to anonymous reviewer and the funding agencies.
References
1. Adetunmbi, A.O., Falaki, S.O., Adewale, O.S., Alese, B.K.: Network intrusion detection based on rough set and k-nearest neighbour. International Journal of Computing and ICT Research 2, 6066 (2008)
2. Amini, M., Jalili, R., Shahriari, H.R.: Rt-unnid: A practical solution to real- time
network-based intrusion detection using unsupervised neural networks. Computers
& Security 25(6), 459468 (2006)
3. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K.: NADO : network anomaly detection using outlier approach. In: Proceedings of the ACM International Conference on Communication, Computing & Security. pp. 531536, New York, NY, USA
(2011)
4. Daniel, B., Julia, C., Sushil, J., Ningning, W.: Adam: a testbed for exploring the
use of data mining in intrusion detection. SIGMOD Rec. 30(4), 1524 (2001)
5. Gogoi, P., Borah, B., Bhattacharyya, D.K.: Anomaly detection analysis of intrusion
data using supervised & unsupervised approach. Journal of Convergence Information Technology 5, 95110 (2010)
6. Gogoi, P., Borah, B., Bhattacharyya, D.K.: Supervised anomaly detection using
clustering based normal behaviour modeling. International Journal of Advances in
Engineering Sciences 1, 1217 (2011)
7. Gogoi, P., Borah, B., Bhattacharyya, D.K.: Network anomaly detection using unsupervised model. International Journal of Computer Applications (Special Issue
on Network Security and Cryptography) NSC, 1930 (2011)
8. Gogoi, P., Das, R., Borah, B., Bhattacharyya, D.K.: Efficient rule set generation
using rough set theory for classification of high dimensional data. In: Proc. of
Intnl Conf. on Communication and Network Security (ICCNS 2011). pp. 1922.
Bhubaneswar, India (Nov 13-14 2011)
9. Heady, R., Luger, G., Maccabe, A., Servilla, M.: The architecture of a network level
intrusion detection system. Tech. rep., Computer Science Department, University
of New Mexico, New Mexico (1990)
10. Paxson, V.: Bro: A system for detecting network intruders in real-time. In: Proceedings of the 7th USENIX Security Symposium San Antonio. San Antonio,Texas
(Jan 1998)
11. Roesch, M.: Snort-lightweight intrusion detection for networks. In: Proceedings of
the 13th USENIX conference on System administration. pp. 229238. USENIX,
Seattle, Washington (Nov 1999)
12. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. AddisonWesley (2005)
13. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the kdd
cup 99 data set. Availble on: http://nsl.cs.unb.ca/NSL-KDD/ (2009)
14. Tavallaee, M., Stakhanova, N., Ghorbani, A.A.: Toward credible evaluation of
anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man,
and Cybernetics, Part C 40, 516524 (2010)