Beruflich Dokumente
Kultur Dokumente
available at www.sciencedirect.com
C3I Division, DSTO, PO Box 1500, Edinburgh, South Australia 5111, Australia
Information Security Institute, QUT, Brisbane 4001, Australia
article info
abstract
Article history:
paper reviews the data preprocessing techniques used by anomaly-based network intru-
sion detection systems (NIDS), concentrating on which aspects of the network traffic are
3 February 2011
analyzed, and what feature construction and selection methods have been used. Motiva-
tion for the paper comes from the large impact data preprocessing has on the accuracy and
capability of anomaly-based NIDS. The review finds that many NIDS limit their view of
Keywords:
network traffic to the TCP/IP packet headers. Time-based statistics can be derived from
Data preprocessing
these headers to detect network scans, network worm behavior, and denial of service
Network intrusion
attacks. A number of other NIDS perform deeper inspection of request packets to detect
Anomaly detection
attacks against network services and network applications. More recent approaches
Data mining
analyze full service responses to detect attacks targeting clients. The review covers a wide
Feature construction
range of NIDS, highlighting which classes of attack are detectable by each of these
Feature selection
approaches.
Data preprocessing is found to predominantly rely on expert domain knowledge for
identifying the most relevant parts of network traffic and for constructing the initial
candidate set of traffic features. On the other hand, automated methods have been widely
used for feature extraction to reduce data dimensionality, and feature selection to find the
most relevant subset of features from this candidate set. The review shows a trend toward
deeper packet inspection to construct more relevant features through targeted content
parsing. These context sensitive features are required to detect current attacks.
Crown Copyright 2011 Published by Elsevier Ltd. All rights reserved.
1.
Introduction
354
1.1.
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
1
Network traffic packet capture API. See http://www.tcpdump.
org/.
2
For NetFlow version 9 see http://www.ietf.org/rfc/rfc3954.txt.
1.2.
Data preprocessing
1.3.
Related work
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
1.4.
355
2.
356
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
20
18
16
14
Client
Server
KDD
Protocol
MCD
SCD
Basic
12
Papers 10
8
Packet Header
Protocol Analysis
Payload
Feature Type
Fig. 1 e Number of published anomaly-based NIDS papers vs. feature type used.
2.1.
Table 1 e NIDS using only parsed packet header fields, i.e. packet header basic features.
Authors
Data input
Data preprocessing
Main algorithm
Ethernet, IP,
TCP headers
Packet headers
(Guennoun
et al., 2008)
802.11 frame
headers
Univariate
anomaly detection
Entropy, mutual
information, or
Bayes network.
K-means Classifier
used to detect attacks
Detection
Probe, DoS
Probes (network
and portscans)
Wireless network
attacks.
357
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table 2 e NIDS using single connection derived (SCD) features from packet headers.
Authors
Data input
Data preprocessing
Main algorithm
ANDSOM, (Ramadas
et al., 2003)
Tcpurify output
(Yamada
et al., 2007)
HTTPS traffic
(Estevez-Tapiador
et al., 2003)
TCP sessions
TCP sessions
TCP/IP headers
TCP/IP headers
2.2.
Detection
BIND attack and
http_tunnel
Web Server
attacks
Nmap scans. SSH,
HTTP misuse
FTP anomalies
Stepping stones
Proxies, backdoors,
protocols.
http://www.mindrot.org/projects/softflowd/.
quad is shorthand for the 4 basic features: source IP address,
source port, destination IP address and destination port.
4
358
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table 3 e NIDS using multiple connection derived (MCD) features from packet headers.
Authors
Data input
(Lakhina
et al., 2005)
NetFlow output
MINDS, (Ertoz
et al., 2004)
NetFlow records
(Lazarevic
et al., 2003)
Tcptrace output
(Pokrajac
et al., 2007)
ADAM, (Barbara
et al., 2001)
Tcptrace output
Connection records
Connection records
(Muraleedharan
et al., 2010)
IPFIX output
full pcap
TCP/IP headers
Data preprocessing
Main algorithm
Multiway-subspace
method finds
variations/anomalies
Local Outlier Factor(LOF),
Association Mining
Incremental LOF
As above
Clustering
DoS
Construct profiles.
Use Chi-squared
measure
to detect anomalies
Model normal traffic
using wavelet
approximation
Detection
2.3.
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
359
An SSH Process Table Attack is a DoS attack where connections are continually made to the SSH service without completing
authentication. It aims to force the victim machine to spawn SSH
processes until the victims resources are exhausted.
360
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table 4 e NIDS using protocol analysis: specification-based, parser-based, or application protocol keyword-based.
NIDS
Data input
Data preprocessing
(Sekar
et al., 2002)
TCP/IP headers
Snort, (Roesch,
1999)
Bro, (Vallentin
et al., 2007)
TCP sessions
TCP sessions
Main algorithm
Detection
Probe, DoS
Probe, DoS,
R2L, U2R
All
All
Probe, DoS,
R2L, U2R
3.
3.1.
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
3.2.
3.3.
Application protocol keyword-based anomaly
detection
Mahoney and Chan (2002b) built on PHAD by creating a new
component called Application Layer Anomaly Detector (ALAD).
ALAD adds some SCD features from the headers within a session,
as well as keywords from the application layer protocol.
A data instance for ALAD is a complete TCP connection
with basic features: application protocol keywords, opening
and closing TCP flags, source address, destination address
and port. Use of application protocol keywords puts this in
the category of protocol-based anomaly detection for this
review (although a mixture of feature types are used). The
keywords are defined as the first word on each new line
within the application protocol header. Training data was
used to build models of allowed keywords in text-based
application protocols such as SMTP, HTTP and FTP. In the
detection phase, the anomaly score increased when a rare
keyword was used for a particular service. Unusual
keywords can indicate a R2L attack against a network service
361
4.
Many NIDS papers make use of the KDD Cup 99 dataset (KDD,
1999) as labeled data for testing and comparing network
intrusion algorithms. While it has known limitations
(McHugh, 2000; Mahoney and Chan, 2003), its advantages
include being publicly available, labeled, and preprocessed
ready for machine learning. This opens the field to any
researcher wanting to test their IDS and make meaningful
comparisons with other intrusion detection algorithms.
Generating accurate labels for custom datasets is a very time
consuming process, so this dataset is still used, despite its age.
The dataset was generated from the DARPA 98 network traffic
(Lippmann et al., 2000). Each network connection was processed into a labeled vector of 41 features. These were constructed using data mining techniques and expert domain
knowledge when creating a machine learning misuse-based
NIDS (Lee and Stolfo, 2000, 1998). One of their stated goals
was to eliminate the manual and ad-hoc processes of building
an IDS. While their research was successful, it found the raw
network traffic needed a lot of iterative data preprocessing
and required significant domain knowledge to produce a good
feature set (making the process hard to automate). It also
found that adding temporal-statistic measures significantly
improved classification accuracy.
The data preprocessing produced:
9 basic and SCD header features for each connection
(similar to NetFlow)
9 time-based MCD header features constructed over a 2 s
window
10 host-based MCD header features constructed over a 100
connection window to detect slow probes.
13 content-based features were constructed from the traffic
payloads using domain knowledge. Data mining algorithms
could not be used since the payloads were unprocessed and
therefore unstructured. They were designed to specifically
detect U2R and R2L attacks.
362
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table 5 e NIDS using the KDD Cup 1999 dataset features as data input.
NIDS
Data preprocessing
(Laskov
et al., 2005)
(Laskov
et al., 2004)
(Xu, 2006)
Data cleaning
(Wang and
Battiti, 2006)
(Shyu
et al., 2003)
(Bouzida
et al., 2004)
(Yeung and
Chow, 2002)
(Hernandez-Pereira
et al., 2009)
(Jiong and
Mohammad,
2006)
(Li and Guo, 2007)
(Chebrolu
et al., 2005)
4.1.
Data transformation
Main algorithm
Compare various supervised
and unsupervised learning
algorithms
Quarter Sphere SVM
Multi-class SVM
Separate models created for
normal class and each
intrusion class. Euclidean
distance for classification.
Principal Component
Classifier. Method compared
to LOF, Canberra and
Euclidean distance.
Nearest Neighbor and
Decision Tree classification
methods compared
Parzen Window Density
Estimation
Various Classifiers
Detection
KDD Cup: Probe,
DoS, R2L, U2R
KDD Cup: Probe,
DoS, R2L, U2R
KDD Cup: Probe,
DoS, R2L, U2R
KDD Cup: Probe,
DoS, R2L, U2R 98%
detection with 0.4%
false positive rate
KDD Cup: Probe,
DoS, R2L, U2R 98%
detection at 1% false
positive rate
KDD Cup: Probe,
DoS, R2L, U2R
KDD Cup: Probe,
DoS, R2L, U2R Good
results
Data preprocessing
improves detection
rate
KDD Cup: similar
results to other
unsupervised
algorithms
KDD Cup: Probe,
DoS, R2L, U2R
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
4.2.
Data cleaning
4.3.
Data reduction
4.4.
Discretization
4.5.
Re-labeling
363
4.6.
5.
364
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table 6 e NIDS which analyze traffic payloads to servers and individual web applications.
NIDS
Data
input
PAYL, (Wang
and Stolfo,
2004)
POSEIDON,
(Bolzoni
et al., 2006)
ANAGRAM,
Wang et al.
(2006)
McPAD,
(Perdisci
et al., 2009)
(Kloft et al.,
2008)
(Rieck and
Laskov,
2007)
(Zhang and
White, 2007)
Network
packet
payload
Network
packet
payload
Network
packet
payload
Network
packet
payload
HTTP
request
Network
packet
payload
Network
packet
payload
(Kruegel and
HTTP web
Vigna, 2003) requests
(Kiani et al.,
HTTP web
2008)
requests
Data preprocessing
1-g used to compute byte-frequency
distribution models for each network
destination
SOM identifies similar payloads per network
destination. Similar payloads are grouped into
a model.
N-grams from payload stored in normal and
malicious bloom filters. N tested from 2 to 9.
Detection
PAYL
Higher accuracy
than PAYL
Mimicry resistance
added
Shellcode attacks to
web servers
Anomaly detector
5.1.
Main algorithm
Executable files
Attacks to web
applications.
SQL injection
attacks.
365
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
5.2.
libAnomaly available
projects/libanomaly/.
at
http://www.cs.ucsb.edu/seclab/
366
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table 7 e NIDS which analyze traffic payloads for attacks targeting clients.
NIDS
Data Input
Data Preprocessing
Main Algorithm
Detection
Malicious
client side
scripts used
in XSS and
drive-bydownloads
Malicious
client side
scripts used
in XSS and
drive-bydownloads
Malicious c
lient side
scripts used
in XSS and
drive-bydownloads
Avoids XSS
attacks
(Chen
et al., 2009)
JSAND, (Cova
et al., 2010)
Caffeine Monkey,
(Feinstein
et al., 2007)
Use libAnomaly to
build models from
features. Find pages
with anomaly scores
20% greater than
training set
Statistical analysis of
JavaScript function
calls to discriminate
normal from malicious
JavaScript
Noxes, (Kirda
et al., 2006)
5.3.
5.4.
Common network architectures ensure client hosts (workstations) within an organization are not directly exposed to
the Internet at the network layer. This protects the client
hosts from external threats such as probes, DoS, network
worms and other attacks against open ports (services).
However, many other threats are faced by these clients,
particularly when they are exposed to untrusted code or data.
Exposure occurs when performing standard client computing
tasks such as browsing the web, using an email client, instant
Whitelisting of
allowed sites to visit.
Any site not in the
whitelist is blocked
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
7.
6.
367
In this review, anomaly-based NIDS papers have been grouped according to the types of network features they analyze
(see Tables 1e7). NIDS within a single table have similar
claimed detection capabilities. This suggests the choice of
feature types is important in determining the possible capabilities or coverage of the detector. i.e. different feature types
allow detection of different attack categories such as probe,
DoS, R2L and U2R. Subsequent stages such as the main data
mining algorithm and correlation engine then determine how
accurate and effective the NIDS is.
368
7.1.
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
7.2.
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
369
7.3.
8.
Conclusions
370
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
A. List of features
Table A2 (continued)
Feature
RTT
Fingerprint
HTTPS session
802.11 calculated
Ethernet headers
IP headers
TCP headers
UDP headers
ICMP headers
TCP states
Land
wrong_fragment
Urgent
KDD feature
2 second window
count
rerror_rate
same_srv_rate
Contextual
Description
Quad: combination of src ip, src port, dst
ip, dst port define a single connection
Service type (TCP, UDP or ICMP) and
application protocol (HTTP, SMTP, SSH or
FTP etc.) to group similar traffic
diff_srv_rate
srv_count
srv_serror_rate
srv_rerror_rate
Duration
Status
SCD timing
serror_rate
Feature
Description
srv_diff_host_rate
Description
Count of connections to the
same dest ip as the current
connection
% of connections with same
dest ip that have SYN errors
% of connections with same
dest ip that have REJ errors
% of connections with same
dest ip to the same service
% of connections with same
dest ip to different services
Count of connections with
same dest port as the
current connection
% of connections with same
dest port that have SYN errors
% of connections with same
dest port that have REJ errors
% of connections with same
dest port but to different hosts
371
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table A3 (continued)
KDD feature
Table A4 (continued)
Description
dst_host_same_srv_rate
% of connections within
dst_host_count to the
same dest port
dst_host_diff_srv_rate
% of connections within
dst_host_count to a
different dest port
dst_host_same_src_port_rate% of connections within
dst_host_srv_count
with the same source port
dst_host_srv_diff_host_rate % of connections within
dst_host_srv_count to a
different dest ip
dst_host_serror_rate
% of connections within
dst_host_count that have
SYN errors
dst_host_srv_serror_rate
% of connections within
dst_host_srv_count that
have SYN errors
dst_host_rerror_rate
% of connections within
dst_host_count that have
REJ errors
dst_host_srv_rerror_rate
% of connections within
dst_host_srv_count that
have REJ errors
Feature
Count variance
Wrong resent rate
Duplicate ACK rate
Data bytes
sdp statistics
av_size
av_packets
count_single
ratio
Description
Variance measure for the count of packets
for each src-dest pair
Count of bytes sent even after being
acknowledged
Count of duplicate acknowledgment
packets
Count of data bytes exchanged per flow
Source-destination pairs (sdps) are unique
combinations of src ip, dest ip and dest
port
Number of unique sdps in collection
interval
Number of new sdps in this data collection
interval
Number of new sdps which were not seen
in last month
Number of well known ports used in
interval
Variance of the count of packets seen
against sdps
Count of sdps which include hosts outside
local network domain
Number of successfully established TCP
connections in time interval
Total packets observed in collection
interval
Average packet size over time window
Average packets per flow over time
window
Number of single packet flows over time
window
Ratio of Number of flows to bytes per
packet (TCP) over time window
Description
PsrcIP; srcport; dstIP; dstport;
PsrcIP; dstIP; dstport; PdstIP; dstport
Conditional
probability
Entropy measures
PTCPflagsjdstport; andPkeywordjdstport
Association rules
Flow concentration
Data points
per cluster
%Control, %Data
Av duration
Av duration dest
Max flows
%_same_service_host
%_same_host_service
count-dest
count-src
count-serv-src
count-serv-dest
count-dest-conn
count-src-conn
Description
Flow count to unique dest
IP in the last T seconds
from the same src
Flow count from unique
src IP in the last T seconds
to the same dest
Flow count from the src IP
to the same dest port in the
last T seconds
Flow count to the dest IP
using same src port in the
last T seconds
Flow count to unique dest
IP in the last N flows from
the same src IP
Flow count from unique src
IP in the last N flows to
the same dest IP
(continued on next page)
372
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table A6 (continued)
Table A5 (continued )
Volume-based feature
count-serv-src-conn
count-serv-dest-conn
num_packets_src_dst/dst_src
num_acks_src_dst/dst_src
num_bytes_src_dst/dst_src
num_retransmit_src_dst/dst_src
num_pushed_src_dst/dst_src
num_SYNs(FINs)_src_dst/dst_src
connection_status
count_src
count_serv_src
count_serv_dest
count_src_conn
count_dest_conn
count_serv_src_conn
count_serv_dst_conn
Description
Flow count from the src IP
to the same dest port in
the last N flows
Flow count to the dest IP
using same source port
in the last N flows
Count of packets flowing
in each direction
Count of acknowledgment
packets flowing in each
direction
Count of data bytes flowing
in each direction
Count of retransmitted
packets flowing in each
direction
Count of pushed packets
flowing in each direction
Count of SYN/FYN packets
flowing in each direction
Status of the connection
(0 e Completed; 1 - Not
completed; 2 e Reset)
Connection count from
same source as the
current record
Count of different services
from the same source as
the current record
Count of different services
to the same destination IP
as the current record
Connection count from
this src IP in the last 100
connections
Connection count to this
dest IP in the last 100
connections
Connection count with
same dst port and src IP
in the last 100 connections
Connection count with
same dst port and dst IP
in the last 100 connections
Data N-g
2v-g
Description
Model normal traffic payload. For each
destination port and flow length, create
feature vector with relative frequency
count of each 1-g (byte) in the payload.
Using complete payloads, store higher
order N-grams extracted from attack
traffic to create an attack model, and from
normal traffic to create a normal model
Extract 2-g from HTTP requests where the
2-g is constructed from two bytes
separated by n bytes. For each value of n,
a separate model can be created.
N-gram feature
3-g
App N-g
App Keyword
File 1-g
Snort
Description
Extract 3-g from HTTP requests. Also
calculate HTTP request features:
String length histogram
String entropy
Flags indicating existence of special
characters or strings
Generate N-g from application layer
header bytes only.
Extract first word on each new line within
an application protocol header, e.g. HTTP
or SMTP
1-g frequencies calculated for sample exe,
pdf, jpg, etc files. Detect exe in traffic
Custom Snort alerts written to match byte
patterns in particular types of sessions
Parameter length
Parameter character
distribution
Structural inference
Token finder
Presence or absence
Description
User-supplied data to web applications is
extracted from request URIs as
parameters
Flag long parameter strings which may
contain a buffer overflow exploit
Parameters generally have regular
structure, are mostly human readable,
and almost always only use printable
chars
Infer parameter structure (its regular
grammar) from legitimate training data
Identify parameters which are
enumerations or have a limited set of
allowed values
A list of valid parameter sets for the web
application is created from the training
data
Redirection Rate
Description
The depth of a tree of web page links.
Deeper depth is more suspicious
Flag whether a module detects known
malicious DHTML
Number of times a web page is encoded.
Some malicious pages encode recursively
Number of pages using suspicious coding
styles, e.g. unreasonable use of eval() or
document.write() methods
Number of redirect pages. i.e. pages only
comprising invisible iframe tags or
JavaScript code
373
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
Table A8 (continued)
Web content feature
Keywords Splitting
Keywords Encoding
Splitting & Encoding
JS objects
JS elements
JS escape
JS string
JS write
Redirection targets
Browser differences
String ratio
Shellcode strings
Instantiated
components
Method call params
Method call sequence
Table A9 (continued)
Description
The number of pages splitting sensitive
keywords such as iframe, presumably as
an evasion technique
The number of pages with sensitive
keywords encoded
Number of pages with both sensitive
keywords splitting and encoding
Number of object instantiations in
deobfuscated JavaScript
Number of element instantiations in
deobfuscated JavaScript
Number of uses of escape in deobfuscated
JavaScript
Number of string instantiations in
deobfuscated JavaScript
Number of uses of document.write in
deobfuscated JavaScript
Identify redirect chains to an unusually
large number of domains.
Detect differences of served web pages
when different browser types are used.
Malicious websites may attempt different
attacks for each browser.
Detect deobfuscation code by finding
a high ratio of string definition to string
use.
Measure how many function calls are
used to dynamically execute JavaScript.
Long strings passed to eval function can be
indicative of malicious code.
Measure number of bytes allocated
through string operations. Large amounts
of memory can indicate heap-spraying
Analyze both static and dynamically
allocated strings >256 bytes for shellcode
e.g. The number of ActiveX controls or
plugins which are instantiated in a page.
Look for long strings or large integers
passed to instantiated components
Look for method call sequences indicative
of an exploit
num_failed_logins
logged_in
num_compromised
root_shell
su_attempted
num_root
num_file_creations
num_shells
num_access_files
Description
Number of hot indicators, such as access
to system directories or program
execution
Number of failed login attempts
1 if successfully logged in. 0 otherwise
Number of compromised conditions
1 if root shell is obtained. 0 otherwise
1 if su root command attempted.
0 otherwise
Number of root accesses
Number of file creation operations
Number of shell prompts
Number of operations on access control
files
KDD Feature
num_outbound_cmds
is_hot_login
is_guest_login
Description
Number of outbound commands in an ftp
session
1 if the login belongs to the hot list.
0 otherwise
1 if the login is a guest login. 0 otherwise
references
374
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
c o m p u t e r s & s e c u r i t y 3 0 ( 2 0 1 1 ) 3 5 3 e3 7 5
375