Sie sind auf Seite 1von 34

Network Intrusion Detection: Data

Mining Perspective
Supervised by: Dr. Nacer Assem
By: Marghabi Souhail
Mouni Mati
Fassi Fihri Ismail
Spring 2013
Motivation
Ability to detect and prevent network intrusions
has become critical for servers with sensitive
data.
Nowadays, ratio of illegal actions in networks is
higher than normal authorized access.
Intrusion is defined as actions compromising
these security of sensitive data and their owners.
Nowadays, Intrusion prevention is not sufficient
Intrusion detection is more crucial
Intrusion detection: the process of monitoring
and analyzing events(sessions) within a network
system for harmful actions.



Misuse Detection Vs. Anomaly Detection
Misuse detection: uses knowledge of predefined attacks to
identify incoming intrusions
Classification based on known intrusions
Each instance in a data set is labeled as normal or intrusion and a
learning algorithm is trained over the labeled data.
High degree of accuracy in detecting known attacks and their
variations
Signature-based detection methods.
Drawback: inability to detect intrusion from unknown action
patterns
Anomaly detection: contrast new data with traditional normal
classified access to identify traces of intrusion
Any significant deviations from the expected behavior are reported
as possible attacks.
Builds models of normal behavior, and automatically detects any
deviation from it, flagging it as suspect.
Drawback: rate of eronous detections
The use of Data Mining
Signature-based detection methods can only detect
previously known attacks that have a matching
signature
Considering this limitation, a rising interest is devoted
to the use of Data Mining for intrusion detection.
Two categories: Misuse detection, and Anomaly
detection. Each instance of the dataset should be
labeled as normal or intrusive.
This project will be focused with building models for
the two major elements in the intrusion detection
domain: Misuse detection and Anomaly detection
Goal : classify connections to a military LAN server
into legitimate or an attack category.
Data Description
Collected from network traffic during a period of
10 weeks
Raw TCP dump data for a LAN simulating a
typical US air force LAN
The LAN was the subject of several attacks of
different types

Data Description
Connection data is defined as an aggregation
of raw packet data.
Thus, a connection is a sequence of TCP
packets starting and ending at some well-
defined times, between which data flows to
and from a source IP address to a target IP
address under some well-defined protocol.
Each connection is labeled as either normal,
or as an attack, with exactly one specific
attack type
Data Description
There are 38 different attack types, belonging to the following 4 main
categories:

1) Denial of Service Attack (DoS): is an attack in which the attacker makes
some computing or memory resource too busy, and denies legitimate
users access to a machine.
2) User to Root Attack (U2R): is a class of exploit in which the attacker
exploit some vulnerability to gain root access to the system.
3) Remote to Local Attack (R2L): occurs when an attacker who has the
ability to send packets to a machine over a network but exploits some
vulnerability to gain local access as a user of that machine.
4) Probing Attack: is an attempt to gather information about a network of
computers for the apparent purpose of circumventing its security controls.



Data Description
The 22 different types of network attacks
present in the training data are shown below:
Data Description
Lincoln Labs have provided the following
information about the size of data and how they
have split it and made it available to public:

8,050,290 records: 4,940,000 training records and
3,110,290 test records.
41 attributes and unlabeled at first.
Using a 10-fold sampling We used a smaller
subset of the large dataset: 186305 records.

Data description: Features of TCP
connection

Feature description type
count
number of connections to the same host as the current connection in the past
two seconds
continuous
same_srv_rate
% of connections to the same service continuous
diff_srv_rate
% of connections to different services continuous
serror_rate
% of connections that have ``SYN'' errors continuous
rerror_rate
% of connections that have ``REJ'' errors continuous
srv_diff_host_rate
% of connections to different hosts continuous
srv_serror_rate
% of connections in same service that have ``SYN'' errors continuous
srv_rerror_rate
% of connections that have ``REJ'' errors in same service continuous
srv_count
number of connections to the same service as the current connection in the
past two seconds
continuous
Content features within a connection
Feature name description type
hot number of ``hot'' indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of ``compromised'' conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if ``su root'' command attempted; 0 otherwise discrete
num_root number of ``root'' accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the ``hot'' list; 0 otherwise discrete
is_guest_login 1 if the login is a ``guest''login; 0 otherwise discrete
Snapshot of the Data in Arff

Technology Tools Used

WEKA:
A collection of machine learning algorithms for data
mining tasks
The algorithms can either be applied directly to a
dataset or called from your own Java code
The use of WEKA was very efficient for the present
project as it helps in data visualization since it contains
tools for data pre-processing
WEKA contains tools for classification and clustering; we
used them to build the misuse detection model and the
anomaly detection model
Data Visualization & Preprocessing

Data Visualization & Preprocessing

Attack Types
Data Mining Techniques Used
J48 model:
J48 is slightly modied C4.5 in WEKA.
The C4.5 algorithm generates a classication-decision
tree for the given data-set by recursive partitioning of
data.
The algorithm considers all the possible tests that can
split the data set and selects a test that gives the best
information gain.

Data Mining Techniques Used
Nave Bayes
In simple terms, a naive Bayes classifier assumes that the presence
(or absence) of a particular feature of a class is unrelated to the
presence (or absence) of any other feature, given the class variable.
A naive Bayes classifier, abbreviated as NBC, is probabilistic
classifier that relies on the application of the theorem Bayes, with
the assumption of strong, also referred as nave, independence. This
model is often described with the term independent feature model.

Example,
A fruit may be considered to be an apple if it is red, round, and
about 4" in diameter. Even if these features depend on each other or
upon the existence of the other features, a naive Bayes classifier
considers all of these properties to independently contribute to the
probability that this fruit is an apple.
Misuse Detection
WEKA

Memory Management
We increase Java heap size to 2000MB instead of the default value
WEKA knowledge Flow Environment

WEKA knowledge Flow Interface
K-means clustering
K-means clustering
Attacks: 96238 instances, Cluster 0
Normal connections: 90087 instances. Cluster 1
NaiveBayes Flow
Nbayes Results
J48: Decision trees

Modified version of the C4.5 algorithm.
Uses a depth-first strategy and considers all the possible splits and
uses the one that has the same information gain.

TP: True Positive: when classified as a positive, it is a real positive
TN: true Negative: when classified as a negative, it is a real positive
FP: False Positive rate: when classified as a positive, while it is a
real negative


FN: False Negative rate : when classified as a negative, while it is a
real positive


Precision:

+
and Accuracy:
+
+
measure the
quality of binary classifiers

J48 Knowledge Flow
J48 Classifier
J48 Decision Tree

Partial Graph view


Workflow for comparing performance
of J48 and Nave Bayes

ROC chart

Rule extraction


Some rules extracted from the decision tree built using J48

Rule 1: If count <= 76 and num_compromised <= 0 and dst_host_serror_rate <=
0.89 and wrong_fragment <= 0 and dst_host_diff_srv_rate <= 0.93 and hot <= 0
and same_srv_rate <= 0.32 andsrc_bytes <= 9 THEN attack (255.0/2.0)
Rule 2: If count <= 76 and num_compromised <= 0 and dst_host_serror_rate <=
0.89 and wrong_fragment <= 0 and dst_host_diff_srv_rate <= 0.93 and hot <= 0
and same_srv_rate <= 0.32 andsrc_bytes <= 9 THEN normal_connection
(234.0/1.0)
Rule 3: If count <= 76 and num_compromised > 0 and dst_host_same_srv_rate >
0.49 and src_bytes <= 29200 and dst_host_srv_diff_host_rate > 0.01 and
dst_host_srv_diff_host_rate > 0.01 Service = http THEN normal_connection
(5.0)
Rule 4: IF count > 76 and dst_bytes <= 1 THEN attack (59031.0)
Rule 5: IF count > 76 and dst_bytes > 1 and protocol_type = tcp and
diff_srv_rate <= 0.07 THEN normal_connection (8.0)
Rule 6: IF count <= 76 and num_compromised > 0 and dst_host_same_srv_rate
<= 0.49 THEN normal_connection (39.0/2.0)

Das könnte Ihnen auch gefallen