Sie sind auf Seite 1von 13

44

CHAPTER 2

DARPA KDDCUP99 DATASET

2.1 THE DARPA INTRUSION-DETECTION EVALUATION


PROGRAM

The number of intrusions is to be found in any computer and


network audit data are plentiful as well as ever-changing. They are also
thoroughly scattered and attempts to structure or catalogue audit data are
extremely effort-intensive. In order to create effective detection models,
model-building algorithms typically require a large amount of labelled data.
One major difficulty in deploying IDS is the need to label system audit data
for the algorithms. Misuse-detection systems need the data to be accurately
labelled as either ‘normal’ or ‘attack’,. Whereas for anomaly-detection
systems, the data must be verified to ensure that it is exclusively ‘normal’
namely attack-free. This requires the same effort (Eskin et al 2000; Lee et al
2001) and preparation of the data in this manner is both time-consuming and
costly.

A generous sponsor for the production of intrusion-detection audit


data was found in the US government agency DARPA (Defense Advanced
Research Project Agency, US) an innovator and promoter of technology, this
organization has funded many projects in the last few decades. In 1969, one
such research and development project was subsidized ‘to create an
experimental packet-switched network’. This one venture saw the modest
beginnings of what grew into the omnipresent Internet, known today. As a
matter of fact, DARPA supports the evaluation of developing technologies:
focusing on an effort, documenting existing capabilities and guiding research.
45

The 1998 DARPA Off-line Intrusion-Detection Evaluation Program


(Lippmann et al 2000; http://www.ll.mit.edu/IST/ideval/data/ data_index.html
1999; Lippmann. et al 2000; Haines et al 2001) was one such project. Aware
of the lack of suitable audit data sets for intrusion detection, DARPA sets out
(1) to generate an intrusion-detection evaluation corpus which could be shared
by many researchers, (2) to evaluate many intrusion-detection systems, (3) to
include a wide variety of attacks and (4) to measure both attack-detection
rates and false-alarm rates for realistic normal traffic. To avoid publicizing
confidential information concerning any real network in connection with the
data and in order not to cause disruption in the operation of an on-line
network, an extensive test bed has been set up at MIT’s Lincoln Laboratories
for synthesis purpose. This test bed simulated the operation of a typical US
Air Force LAN for over two months allowing considerable amount of audit
data to be collected from it.

2.2 ATTACK TYPES IN THE 1999 DARPA DATA SET

Each attack type falls into one of the four following main
categories:

Denial-of-service (DOS) attacks have the goal of limiting or


denying service(s) provided to a user, computer or network.
A common tactic is to severely overload the targeted system
like a SYN flood.

Probing or surveillance attacks have the goal of gaining


knowledge of the existence or configuration of a computer
system or network. Port scans or sweeping of a given IP-
address range is typically used in this category like IPsweep.

Remote-to-Local (R2L) attacks have the goal of gaining


local access to a computer or network to which the attacker
46

only previously had remote access. Examples of this are


attempts to gain control of a user account say the Dictionary.

User-to-Root (U2R) attacks have the goal of gaining root or


super-user access on a particular computer or system with
which the attacker previously had user level access. These
are the attempts by a non-privileged user to gain
administrative privileges (e.g. Eject). A total of 24 attack
types was included in the training data and further 14 novel
attacks were added to the test data, to compare the
performance of IDS on ‘known’ and on ‘yet-unseen’ attacks.
A further aim of the evaluation was to determine whether
systems could detect stealthy attacks. These are variations of
an attack. They have been modified from the standard form
available on the Internet, in an attempt to evade detection.
Methods of being stealthy vary, depending on the attack type
(Kendall 1999). The attacks are grouped according to a
category and type. The number of occurrences is detailed;
distinguishing between attacks launched in the clear or
performed stealthily. Furthermore, specifying whether it is
appeared in training or test data. For example, there were 46
Eject attacks in the simulation. Of these, 10 were stealthy
and 36 were performed in the clear. Of those in the clear
category, 29 figured in the training data and 7 in the test
data. In the DARPA programmes, detection rates for each
attack category was estimated for comparative purposes,
when evaluating the performance of IDS.
47

2.2.1 Different Attack Types

The category of an attack is determined by its ultimate goal, so that


within a given category, attacks may closely resemble each other. The DOS
attacks are designed to disrupt a host or network service. Some DOS attacks
(e.g. smurf) excessively load a legitimate network service; others (e.g.
teardrop, Ping of Death) create malformed packets, which are incorrectly
handled by the victim machine. Others still (e.g. apache2, back, syslogd) take
advantage of software bugs in network daemon programmes. Probe attacks
are launched by programmes, which can automatically scan a network of
computers to gather information or find known vulnerabilities. Such probes
are often precursors to more dangerous attacks because they provide mapping
to machines and services and pinpoint weak links in a network. Some of these
scanning tools, satans, saint and mscan enable even an unskilled attacker to
check hundreds of machines on a network for known vulnerabilities.

In the R2L attacks, an attacker who does not have an account on a


victim machine sends packets to that machine and gains local access. Some
R2L attacks exploit buffer overflows in network server software (e.g. imap,
named, sendmail); others exploit weak or misconfigured security policies (e.g.
dictionary, ftp-write, and guest) and one (xsnoop) is a Trojan password-
capture programme. The snmp-get R2L attack against the router is a
password-guessing attack where the community password of the router is
guessed and an attacker then uses SNMP to monitor the router. During U2R
attacks, a local user on a machine tries to obtain privileges normally reserved
for the UNIX root or super-user. Some U2R attacks exploit poorly-written
system programmes which run at root level and are susceptible to buffer
overflows (e.g. eject, ffbconfig, fdformat). Others may exploit weaknesses in
path-name verification (e.g. loadmodule), bugs in some versions of perl (e.g.
suidperl) or other software weaknesses.
48

2.2.2 Attack Descriptions

back - Denial-of-service attack against apache webserver, where a client


requests a URL containing many backslashes.
dict - Guess passwords for a valid user, using simple variants of the
account name over a telnet connection.
eject - Buffer overflow using eject program on Solaris. Leads to a user-
to-root transition if successful.
ffb - Buffer overflow using the ffbconfig UNIX system command leads
to root shell.
format - Buffer overflow using the fdformat UNIX system command leads
to root shell.
ftp-write - Remote FTP user creates .rhost file in world writable anonymous
FTP directory and obtains local login.
guest - Try to guess password via telnet for guest account.
ipsweep - Surveillance sweep performing either a port sweep or ping on
multiple host addresses.
land - Denial of service where a remote host is sent a UDP packet with
the same source and destination.
loadmodule - Non-stealthy load module attack which resets IFS for a normal
user and creates a root shell.
multihop - Multi-day scenario in which a user first breaks into one machine.
neptune - Syn-flood denial-of-service on one or more ports.
nmap - Network mapping using the nmap tool. Mode of exploring
network will vary-options include SYN.
49

perlmagic - Perl attack which sets the user id to root in a perl script and
creates a root shell.
phf - Exploitable CGI script which allows a client to execute arbitrary
commands on a machine with a misconfigured web server.
pod - Denial-of-service ping-of-death.
portsweep- Surveillance sweep through many ports to determine which
services are supported on a single host.
rootkit - Multi-day scenario where a user installs one or more components
of a rootkit.
satan - Network probing tool which looks for well-known weaknesses.
operates at three different levels. Level 0 is light.
smurf - Denial-of-service icmp-echo reply flood.
spy - Multi-day scenario in which a user breaks into a machine with the
purpose of finding important information where the user tries to
avoid detection. Uses several different exploit methods to gain
access.
syslog - Denial of service for the syslog service connects to port 514 with
unresolvable source ip.
teardrop - Denial of service where mis-fragmented UDP packets cause some
systems to reboot.
warez - User logs into anonymous FTP site and creates a hidden directory.
warezclient - Users downloading illegal software which was previously
posted via anonymous FTP by the warezmaster.
warezmaster - Anonymous FTP upload of Warez (usually illegal copies of
copyrighted software) onto FTP server.
50

2.3 DATA-SET DESCRIPTION

The ‘KDDCUP99 Data’ (Irvine 1999) are the data sets, which were
issued for use in the KDDCUP ’99 Classifier-Learning Competition. These
sets of training and test data were made available by Stolfo and Lee (http://
kdd.ics.uci.edu/ databases/kddcup99/task.htm. 1999) and consisted of a pre-
processed version of the 1998 DARPA Evaluation Data. This team’s IDS had
performed particularly well in the Intrusion-Detection Evaluation Program of
that year, using data mining even as a ‘pre-processing’ stage to extract
characteristic intrusion features from raw TCP/IP audit data. The original raw
training data were about four gigabytes of compressed binary tcpdump data
obtained from the first seven weeks of network traffic at MIT. This was pre-
processed with the feature-construction framework MADAM ID (Mining
Audit data for automated models for Intrusion Detection) to produce about
five-million connection records. A connection is defined to be a sequence of
TCP packets starting and ending at some well-defined times, between which
data flow to and fro from a source IP address to a destination IP address,
under some well-defined protocol. Each connection is labelled as either
‘normal’ or with the name of its specific attack type. A connection record
consists of about 100 bytes. Ten percent of the complementary two-weeks of
the test data were, likewise, pre-processed to yield a further less than half-a-
million connection records. For the information of contestants, it was stressed
that these test data were not from the same probability distribution as the
training data, and that they included specific attack types which are not found
in the training data. The full amount of labeled test data with some two
million records was not included in this data set.
51

2.3.1 Set of Features used in the Connection Records

In the KDDCUP99 Data, the initial features extracted for a


connection record (Eskin 2002; Lee 1994-1999) include the basic features of
an individual TCP connection, such as: its duration, protocol type, number of
bytes transferred and the flag indicating the normal or error status of the
connection. These ‘intrinsic’ features provide information for general
network-traffic analysis purposes. Since most DOS and Probe attacks involve
sending a lot of connections to the same host(s) at the same time, they can
have frequent sequential patterns, which are different to the normal traffic.
For these patterns, a “same host” feature examines all other connections in the
previous 2 seconds, which had the same destination as the current connection.
Similarly, a “same service” feature examines all other connections in the
previous 2 seconds, which had the same service as the current connection.
These temporal and statistical characteristics are referred to as the “time-
based” traffic features. There are several Probe attacks which use a much
longer interval than 2 seconds (for example, one minute) when scanning the
hosts or ports. For these, a mirror set of “host-based” traffic features were
constructed based on a ‘connection window’ of 100 connections: The R2L
and U2R attacks are embedded in the data portions of the TCP packets and it
may involve only a single connection. To detect these, ‘connection’ features
individual connections were constructed using domain knowledge. These
features suggest whether the data contains suspicious behaviour, such as: a
number of failed logins successfully logged in or not, whether logged in as
root, whether a root shell is obtained, etc. In total, there are 42 features
(including the attack type) in each connection record, with most of them
taking on continuous values. The individual features are listed and briefly
described in Table 2.2 to 2.5. Table 2.1 shows the different types of attacks
and their categories:
52

Table 2.1 Class Labels that Appears in “10% KDDCUP99” Dataset

Attack Number of Samples Category


smurf. 280790 DOS
neptune. 107201 DOS

back. 2203 DOS


teardrop. 979 DOS

pod. 264 DOS


land. 21 DOS

normal. 97277 Normal

satan. 1589 Probe


ipsweep. 1247 Probe

portsweep. 1040 Probe


nmap. 231 Probe
warezclient. 1020 R2L
guess_passwd. 53 R2L

warezmaster. 20 R2L

imap. 12 R2L
ftp_write. 8 R2L

multihop. 7 R2L
phf. 4 R2L
spy 2 R2L
buffer_overflow. 30 U2R

rootkit. 10 U2R
loadmodule. 9 U2R
perl. 3 U2R
53

Connection Features, KDDCUP99


Table 2.2 Basic Features of Individual TCP Connections

Feature name Description Type


Duration length (number of seconds) of the connection continuous
Protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
Service network service on the destination, e.g., http, telnet, etc.discrete
Src_bytes number of data bytes from source to destination continuous
Dst_bytes number of data bytes from destination to source continuous
Flag normal or error status of the connection discrete
Land 1 if connection is from/to the same host/port; discrete
0 otherwise
Wrong_fragment number of ‘wrong’ fragments continuous
Urgent number of urgent packets continuous

Table 2.3 Content Features Within a Connection Suggested by


Domain Knowledge

Feature name Description Type


hot Number of ‘hot’' indicators continuous
Num_failed_logins Number of failed login attempts continuous
Logged_in 1 if successfully logged in ; 0 otherwise discrete
Num_compromised Number of ‘compromised’ conditions continuous
Root_shell 1 if root shell is obtained; 0 otherwise discrete
Su_attempted 1 if ‘su root’ command attempted; 0 otherwise discrete
Num_root Number of ‘root’ accesses continuous
Num_file_creations Number of file creation operations continuous
Num_shells Number of shell prompts continuous
Num_access_files Number of operations on access control files continuous
Num_outbound_cmds Number of outbound commands in an ftp session continuous
Is_hot_login 1 if the login belongs to the ‘hot’ list; 0 discrete
otherwise
Is_guest_login 1 if the login is a ‘guest’ login ; 0 otherwise discrete
54

Table 2.4 Traffic Features Computed Using a Two-Second Time Window

Feature name Description Type


count number of connections to the same host as the continuous
current connection in the past two seconds
Note: The following features refer to these same-host connections.
serror_rate % of connections that have ``SYN'' errors continuous
rerror_rate % of connections that have ``REJ'' errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the continuous
current connection in the past two seconds
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have ‘SYN’ errors continuous
srv_rerror_rate % of connections that have ‘REJ’ errors continuous
srv_diff_host_rate % of connections to different hosts continuous
55

Table 2.5 Traffic Features Computed Using a Hundred – Second


Connection Window

Traffic features dst_host_count* continuous No. of connections to same host as


computed using the current connection in the past
a hundred – two seconds
connection dst_host_serror_ continuous % of connections that have ‘SYN’
window rate* errors
dst_host_rerror_ continuous % of connections that have ‘REJ’
rate* errors
*=same-host
dst_host_same_s continuous % of connections to the same
cxn
rv_rate* service

**=same- dst_host_diff_sr continuous % of connections to the different


v_rate* services
service cxn
dst_host_srv_co continuous No. of connections to the same
unt** service as the current connection in
the past two seconds
dst_host_srv_ser continuous % of the connections that have
ror_rate** “SYN” errors
dst_host_srv_rer continuous % of the connections that have
ror_rate** “REJ” errors
dst_host_srv_dif continuous % of the connections to different
f_host_rate** hosts
56

Figure 2.1 Umatrix for KDDCUP99 Data (Features 1 to 10 are shown)

The U-matrix visualizes the distances between neighbouring map


units, and thus shows the cluster structure of the map: high values of the U-
matrix indicates a cluster border; uniform areas of low values indicate clusters
themselves. Each component plane shows the values of one variable in each
map unit. On top of these visualizations, additional information can be shown:
labels, data histograms and trajectories. U-Matrix of the KDDCUP99 data is
shown in Figure 2.1.

Continued use of the KDDCUP99 Data in current research reported


from Columbia University (Pfahringer 2000; Elkan 2000; Levin 2000; Lee
1994-1999; Chimphlee et al 2006) confirms the uniqueness of these data set
in offering a large volume of network audit data (originally from DARPA)
with a wide variety of labelled intrusions. For these reasons, it was decided to
use the KDDCUP99 Data set for the investigation which was done in this
research work.

Das könnte Ihnen auch gefallen