You are on page 1of 35

A

PRELIMINARY PROJECT REPORT ON

An Automatically Tuning Intrusion Detection System

SUBMITTED

TO

PUNE UNIVERSITY, PUNE


FOR THE DEGREE

OF

BACHELOR OF COMPUTER ENGINEERING


BY

Ritesh Kumar Sinha, Ankush Verma,Manikant


Ojha,Amit Kumar

UNDER THE GUIDANCE

OF

Prof. S.R.Patil
DEPARTMENT OF COMPUTER ENGINEERING
MAHARASHTRA ACADEMY OF ENGINEERING
ALANDI (D), PUNE-412105
2010-2011
MAHARASHTRA ACADEMY OF ENGINEERING
ALANDI(D), PUNE-412105
2010-11

Certificate
This is to certify that Project Report entitled

“An Automatic tuning intrusion Detection System”

Has been submitted by

Mr. Ritesh Kumar Sinha


Mr. Ankush Verma
Mr. Manikant Ojha
Mr. Amit Kumar
In partial fulfillment of Bachelors Degree in
Computer Engineering awarded by
UNIVERSITY OF PUNE, PUNE
2010-11

Prof. Guide Name Dr. S J Wagh


Project guide Head of Department
. Computer Engineering

Principal
MAHARASHTRA ACADEMY OF ENGINEERING
ALANDI(D), PUNE-412105
II

Acknowledgement

We would like to thank our guide Mr S.R. Patil for his complete support
toward our project. Without his help at every step this project would have
not been successful.We would also like to thank our HOD Prof S J Wagh
for his support towards our project.We would like to thank the MAE staff
including the library for their help during our research.

Mr. Ritesh Kumar Sinha,


Mr. Ankush Verma,
Mr. Manikant Ojha,
Mr. Amit kumar
Abstract

THEME/PURPOSE:
An intrusion detection system (IDS) is monitoring system which is used to identify
abnormal activities in a computer system. Intrusion detection system reports alarms to system
operator when it detects any abnormal condition. IDS is working in dynamically changing
environment .Traditionally working of IDS depends on security experts which requires
manual tuning. As it is working in dynamically changing environment we develop system
which reduces dependence by tuning the system automatically. Basically an IDS consists of
prediction engine which analyzes data and outputs the prediction on the data. By seeing
predictions system operator is able to know that data record is normal or is affected by any
attack. Therefore prediction engine is the heart intrusion detection system.
In an automatically tuning intrusion detection system (ATIDS), system operator
analyzes the predictions obtained from detection model. In results obtained from detection
model only false prediction are considered .ATIDS consists of three major components:
prediction model, prediction engine and model tuner. First we create prediction model
.prediction engine analyzes and data according to prediction model. System operator verifies
results and marks false predictions .Only false predictions are fed back to model tuner to tune
model automatically.

METHODOLOGY:
Our project will have basis two important aspects and the whole procedure will be as
such to make these aspects implemented in a correct manner. He aspects are given below:
Our project will have basis two important aspects and the whole procedure will be as
such to make these aspects implemented in a correct manner. He aspects are given below:

Attack detection model:


Here we are going to use SLIPPER learning algorithm for detecting
intrusion which is a rule learning system based. The system is evaluated using the
KDDCup’99 intrusion detection dataset.
Prediction engine:
Binary learning algorithm can only build binary classifier. We will group
attacks into categories such as denial-of-service, probing, remote-to-local, and user-to-root.
Correspondingly, we constructed five binary classier from the training dataset. One binary
classier predicts whether the input data record is normal. The other four binary classier will
predict whether the input data record constitutes a particular attack.

HARDWARE AND SOFTWARE REQUIREMENT:

SOFTWARE REQUIRED
• Java 1.3 or more.
• Java Swings.

HARDWARE REQUIREMENT
• Hard Disk(40Gb).
• Ram(128Mb).
• Processor(Pentium).

APPLICATIONS: Basically an IDS analyzes data and outputs the prediction on the data. By
seeing predictions system operator is able to know that data record is normal or is affected by
any attack.
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

FRONTPAGE I
CERTIFICATE II
ACKNOWLEDGEMENT III
ABSTRACT IV
LIST OF FIGURES VII
LIST OF TABLES VIII

Chapter 1. INTRODUCTION 1
1.1 Introduction

Chapter 2. PLATFORM CHOICE 2


2.1 Java Swing

Chapter 3. LITERATURE SURVEY 3

3.1 Basic Structure of IDS 4


3.1.1 Data sampling 5
3.1.2 Data processing 5
3.1.3 Classifier System 6
3.1.4 Types of IDS 7
3.2 System Overview 8
3.2.1 Beginning 8
3.2.2 Types of IDS 9
3.2.2.1 Host based 9
3.2.2.2 Network based 10

V
Chapter 4. REQUIREMENT ANALYSIS 12

4.1 Data Set 12


4.1.1 KDD CUP 99 set description 12
4.2 Arbitral Strategy by neural network 14
4.3 Multi Class Sleeper 15
4.4 Hardware and software requirement 16
4.2.1 Hardware Requirement 16
4.2.2 Software Requirement 16
4.5 Project Plan 16

Chapter 5. SYSTEM DESIGN 17


5.1 UML Diagrams 17
5.1.1 Activity Diagram 17
5.1.2 Use Case Diagram 18
5.1.3 Component Diagram 19
5.1.4 Class Diagram 20
5.1.5 Activity Diagram 21
5.1.6 Sequence Diagram 22
Chapter 6. CONCLUSION AND FUTURE SCOPE 23

REFERENCES IX

VI
LIST OF FIGURES

Sr. No. Figure Number Name of figure Page Number


1. Fig 1 Basic architecture of IDS 5
2. Fig 2 A classifier system consists of four parts: 6
3. Fig 3 Multi-class SLIPPER 15
4. Fig 4 Optimized preprocess algorithm 15
5. Fig 5 Activity Diagram 17
6. Fig 6 Use case Diagram 18
7. Fig 7 Component Diagram 19
8. Fig 8 Class Diagram 20
9. Fig 9 Deployment Diagram 21
10. Fig 10 Sequence Diagram 22
VII

LIST OF TABLES

Sr. No. Table Number Name of Table Page Number


1. Table: 1 PROJECT PLAN 16

VIII
Chapter 1.
Introduction

With the expansion of the Internet, the value of information safety has been on the
rise. There is no standard definition of intrusion detection as such. Usually, intrusion
detection is recognized as the discovery of network behaviors that abuse or put in danger
network security. Intrusion detection can be treated as a pattern recognition problem which
distinguishes between network attacks and normal network behaviors or further distinguishes
between different categories of attacks.
Any set of events that try to compromise on the accessibility, reliability or privacy of
resources is called as interruption. An intruder is a person or collection of persons who
initiates the events during the interruption. Also, the intruder can be from within the system,
that is, someone with the permission to use the computer with normal user privileges, or
someone who uses a hole in some operating system to escalate their privilege level, or it can
be from outside the system that is someone on another network or perhaps even in another
country who exploits a vulnerability, weakness in an unprotected network service on the
computer to gain unauthorized entry and control.
An intrusion recognition system is in fact a security layer used to notice continuing
interfering activities in information systems. Conventionally, intrusion discovery heavily
depends on widespread knowledge of safety experts, in particular, on their knowledge with
the processor system that is to be sheltered. To diminish this dependency, a variety of
mechanism learning techniques and data mining techniques has been deployed for intrusion
discovery. Most often working of IDS is in dynamically altering surroundings, which results
in constant tuning of the intrusion finding model, so as to maintain enough presentation. The
physical alteration process necessary by current systems depends on the system operators in
functioning out the tuning answer and by integrating it into the discovery model. Moreover
Network intrusion detection aims at separating the attacks on the Internet from normal use of
the Internet. It is a very important and essential piece of the information safety system. Due to
the diversity in network behaviors and the rapid development of attack fashions, it is of prime
importance to develop fast machine-learning-based intrusion detection algorithms with low
false-alarm rates and high detection rates and.

1
Chapter 2.

Platform choice

Java Swing :

Swing is a toolkit for Java. It is part of Sun Microsystems’ Java Foundation Classes (JFC) an
API for a graphical user interface (GUI). Swing was developed to give a more sophisticated
set of GUI components than the Abstract Window Toolkit (AWT). Swing gives a native look
and feel which emulates the look and feel of several platforms.

Using Swings we will develop the user interface of our intrusion detection system which will
show all the functionalities of the system such as create rule, prediction and tuning.

The most important advantage of java swings is the cross-platform support, which allows
developers to build applications that execute on Windows, Mac and Linux. Swings in
addition also provides a very rich set of components and features that can very easily satisfy
the requirements of many types of different applications, such as development tools,
administration consoles, and business applications.

Chapter 3.

2
Literature Survey

Protection of any system forms an important aspect of any computing system. Protection
encompasses the accessibility, reliability and privacy of the resources gave by a computing
system. Three aspects of network systems create these systems more susceptible to attack
than as compared to self-sufficient machines-
• Networks typically provide more number of resources than independent machines
• Network systems are normally configured to facilitate resource sharing
• Global protection policies that can be applied to all of the machines in a network are
rare.

As discussed earlier in order to reduce the dependency of security experts, found in


traditional systems there was a lot of research efforts invested in different research projects
which led to the rise of different data mining and machine learning methods that could be
easily incorporated in different intrusion detection systems.
Audit data analysis and mining was one such technique that combined the logic of mining
association rules and classification in order to identify and detect intrusion from the network
traffic. Whereas ISA (Information system assurance laboratory) utilized the technique based
on statistics along with chi-square and exponentially weighted moving averages for statistical
analysis of audit data.

Information security on the Internet consist the following:


1) Protection: The information system is automatically protected to avoid security violations
that are called intrusions.
2) Detection: Security violations are detected as soon as they occur.
3) Reaction: Reactions, such as pursuit of hackers or automatic alarm are performed when the
system is intruded upon.
4) Recovery: The information system automatically repairs the damages caused by an
intrusion.
Intrusion detection forms a crucial part of information security. Only if intrusions are
correctly detected can the subsequent reaction and recovery be successfully implemented.

3
Intrusion detection system is based on the fact that an intrusion will be detected by a change
in the ‘normal’ patterns of resources. Intrusion detection is a methodology by which any
undesirable or abnormal activity can be detected. An intrusion discovery scheme is a
monitoring system which reports the entire gives alert to the system machinist whenever it
infers from its discovery model. Intrusion discovery System (IDS) is software, hardware or
mixture of both, that is help to notice intruder movement. IDS may have dissimilar capacities
depending upon how stylish and complex the mechanisms are. IDS appliances that are a
mixture of software and hardware are obtainable from lot of organization. An IDS may
possibly apply anomaly based techniques, signatures, or together. Alerts are any kind of user
announcement for an intruder action. When IDS detects an intruder, it informs the security
supervisor about this by means of alerts. These alerts may be in the form of logging to a
console, pop-up windows, sending e-mail and so on. It is an unrelenting active attempt in
discovering or detecting the presence of intrusive activities. As Intrusion discovery (ID)
relates to computers and network communications it encompasses a far broader range. All
processes recommended by it, to which are used in discovering or detecting illegal uses of
network or computer devices. This is achieved by the use of purposely deliberate software
with a lone reason of detecting abnormal or irregular movement. Depending ahead the
network topology, we can place intrusion discovery systems at one or more locations. It also
depends upon the type of intrusion behavior we want to notice: interior, exterior or both. For
instance, if we wish to detect only exterior intrusion behavior, and we have only single router
linking to the Internet, then the finest position for an intrusion discovery system may be just
inside the firewall or a router. If numerous paths exist toward the Internet, then we want to
position one IDS package at every entrance point. But if we want to discover interior threats
as well, then a box should be placed in every network section.

3.1 Basic architecture of IDS


One of the approaches of developing a network safety is to describe network behavior
structure that point out offensive use of the network and also look for the occurrence of those
patterns. While such an approach may be accomplished of detecting different types of known
intrusive actions, it would allow new or undocumented types of attacks to go invisible. As a
result, this leads to a system which monitors and learns normal network behavior and then
detects deviations from the normal network behavior.

4
Fig 1: Basic architecture of IDS

3.1.1 Data Sampling:


The first step in collecting data is to find exactly what type of data should be
collected. Because of the objective of this project is going towards intrusion detection at the
network level, a natural choice for data transmission is the network transmission packet. The
network gives two types of information to study: transport information and user information,
but for this only transport information is selected. Transport data information contains a
structured pair of source and destination. It also consists of some type of checksum on which
the integrity of a packet is determined. Transport information is added to the packet as a part
of the network transmission protocol. Transport information which cannot be made deceptive
by fraudulent user is called as unbiased data. The user information contains information that
is going to be transformed from one machine to another. This can be easily modified by
fraudulent user and hence we call it as biased data. The next stair in collecting data is to
design a device for monitoring network packets. Since finding an intrusion is not reliant on
any particular method used to check packets, any method that’s capable of obtaining a
suitable data example is acceptable. The last step in collecting is to process it in such a way
that it is distorted into a format which is satisfactory to the classifier system.
3.1.2 Data Preprocessing:
There are some values which are important to classifier. These are as given:
• Packet size value,

5
• timestamp value,
• Ethernet source-destination ordered pair.
There are 2 reasons for preprocessing data:
1) In the case of packet sizes and source and destination address, the raw data can be
compacted without loss of relevant information. This results in data which is easier to
manipulate for classifier system. Also, this data requires less disk storage space.
2) In the part of time stamp information, the basic second count provided is greater than
before so as to comprise relative information of day of week and hour of day. This
allows for the structure of network performance which is depends on human temporal
patterns.

3.1.3 Classifier system:


The classifier scheme is a similar, law based, message passing system. All rules are of the
type action form. This action form is receipt of the messages and the action is the sending of
messages when the rule is satisfied. All messages hold a tag specifying their source and an
extra information field.

Fig 2: A classifier system consists of four parts:

1) An input interface
In this case an input interface is a message that contains information taken from a 4-
tuple describing an individual packet information.

6
2) The classifiers
The classifiers are the rules which describe the behavior in which the system operates
and creates messages.
3) The list of message
The message list is a directory of all messages yet to be measured by the classifier
policy. The messages possibly will from fulfilled rules or from input interface.
4) The output interface
An output interface is message signifying whether recent network performance is
supposed to be regular or irregular.
Consider a simple example of how classifier system works. Suppose that transmission of
packets, each of size 100, were being considered as an indicator of normal network behavior
and anybody interested in the number of packets of size 100 over a one second period needs
to evaluate 5, 50 and 150 as possible threshold of abnormality.
Then there are three classifier rules:
1) Rule 1 would examine all mails from the input interface. It would now use the size and
time values in those messages to maintain a count of packets of size 100 over a sliding
time window of one second. After giving out an input message it will set on the message
list a message of its own with the simplified count for the final second.
2) Rule 2 observes all messages set on the message list by Rule1. In case the present count
of packets of size 100 above the last second exceeds 5 then Rule 2 in turn puts message
on the list of message notifying that its threshold has been crossed.
3) Rule 3 and 4 reads all mails having from Rule1 and if the current count exceeds their
particular threshold of 50 and 150 they too put messages on the list of message.
The productivity interface attends to all messages from Rule2,3,4.when any of those rules
have excited and put a message on the list demonstrating that its threshold has been exceeded,
then the output interface will inform the surroundings that the rule is predicting the
occurrence of abnormal behavior.

3.1.4 Types of IDS:


Intrusion discovery systems can be broken down up into 3 major categories:
1. Host-based Systems: is a system in which an IDS examines data that comes straight
from individual systems, or computers (hosts), it is host-based. Examples of data
sources include event logs for and applications (Web servers, database products, etc).

7
2. Network-based Systems: When IDS observes data as it moves crossways the network,
such as TCP/IP traffic, it is called as network based.
3. Hybrid Systems: A hybrid scheme is just an IDS that has features of both network
based scheme and host based scheme.

3.2 System Overview


Since the introduction of the Internet, intrusion attempts on Network Systems have increased
to a great extent. With increase in security measures, there have been clever attacks by much
more sophisticated attackers. Because of this Network Intrusion Detection Systems (NIDS)
have become increasingly necessary in today’s scenario. In the current scenario if you have
internet, then firewall as well as network intrusion detection system is essential.
There is already a number of "ready to run" i.e. software
option available which try to provide some measure of network security. An intrusion in
computer networking terms is defined as someone (hacker, cracker) trying to bypass security
protocols and infiltrate a network system. The impulse behind this could be something as
small as misusing e-mail for spam, stealing confidential data, or any number of things for
which a system administrator could be held responsible. Evidences have shown that these
attacks are becoming more intelligent, subversive and harmful. It has become certain that
anyone accountable for a network with an Internet presence is now a potential target, and
intrusion detection systems are quickly becoming an essential necessity.

3.2.1 Beginning
A USAF paper available in Oct 1972 written via James P. Anderson explained the fact that
the USAF had "become ever more aware of computer security problems. This difficulty was
felt practically in every part of USAF workings and administration". During that period of
time, USAF had to perform the daunting tasks of providing shared used of their computer
systems, which consisted of various levels of classifications in a need to know environment
with a user base containing various levels of security clearance.
Thirty years ago, this created a serious problem
that still exists with us today. The problem is: How to safely protect separate classification
domains on the same network without any compromise in security? The first task was to find
and define the threats that existed. Before designing IDS, it was necessary to understand and
comprehend the types of threats and attacks that could be mounted against computers systems
and how to recognize them in an audit data. In fact, it was possibly referring to the necessity

8
of a risk evaluation plan to understand the threat (what the risks are or vulnerabilities, what
the attacks might be or the means of penetrations) thus subsequent with the creation of a
security policy to protect the systems in place. Among 1984 and 1986, Dorothy Denning and
Peter Neumann examined and designed the first model of real-time IDS. This trial product
was named the Intrusion discovery Expert scheme (IDES). This IDES was originally a rule-
based specialist system skilled to detect known cruel movement. This same system has been
developed and improved to form what is identified today as the Next Generation Intrusion
discovery Expert scheme (NIDES).The report published by James P. Anderson and the work
on the IDES was the start of much of the research on IDS throughout the 1980s and 1990s.An
intrusion detection system (IDS) is a system designed to systematically detect host attacks on
a network. These systems provide a secondary, passive level of security by providing the
administrator with critical information about intrusion attempts. Datagram’s are simply the
packet bundles of information that computer systems use to communicate with each other
over the network. Typically an IDS is not intended to block or actively counter attacks, but
some newer systems have an active capacity for dealing with threats. Indeed, a very
knowledgeable human being should be watching and making value judgments on the 'alerts'
that the IDS has presented him or her with. While firewalls can be thought of as a border or
security perimeter, IDS should detect whether that border has been reached .Under no
circumstances does an IDS guarantee security, but with proper policies, authentication, and
access control, some measure of security can be attained.

3.2.2 Types of IDS


3.2.2.1 Host-Based
Host-based approaches detect intrusions utilizing audit data that are collected from the target
host machine.
As the information given by the review data can be tremendously inclusive and complicated,
host based approaches can acquire high discovery rates and low false alarm rates.
However, there are disadvantages for host-based approaches, which include the following:
1) Host-based approaches cannot easily prevent attacks: when an
intrusion is detected, the attack has partially occurred.
2) Audit data may be altered by attackers, influencing the reliability of review data.

9
The data from a solo host is used to notice symbols of interruption as the packets Enters or
exits the host. Host-based systems are becoming more and more popular due to their
effectiveness at handling insider misuse. This is mostly due to the IDS assembly data (log
files) from each dangerous machine within the network, while network based systems can
only analyses the data that passes by a exacting network node.
Host based scheme stand out at stopping the following:
• Data Access/Modification: The makeup of mission critical data is different for every
organization, but includes things like the Web site, customer or member databases,
proposal information, and personnel records. By observance an eye on the access of
this data and taking note of changes, host based IDS’s are superior at significant when
something altered that should not have.
• Abuse of Privilege: This is probably one of the most serious problems in most
organizations, and an area where host-based IDS’s excel. By observing track of
changes to permissions, the host based scheme can inform safety personnel when the
doors are swinging too large. In adding up, most host based scheme allow safety
admin to get a rapid view of the privileges that survive across their organization, and
can ensure that people like past employees are detached from all systems.

3.2.2.2 Network-Based
Network-based approaches detect intrusions using the IP package information collected by
the network hardware such as switches and routers. Such information is not so plentiful as the
review data of the objective host machine. Nevertheless, there are advantages for network
based approaches, which include the following:
1) Network-based approaches can detect the so-called “distributed” intrusions over the whole
network and thus lighten the burden on each individual host machine for detecting intrusions.
2) Network-based approaches can defend the machine against attack, as detection occurs
before the data arrive at the machine.

The information from a network is scrutinized next to a database and it flags those who look
doubtful. Review data from one or more than a one hosts may be used as well to detect
symbols of intrusions. Network based systems focus on observing the network packets, by
sniffing them, which means that they proof traffic as it goes by. Some IDS's of this type can
be installed in more than one location, which is usually referred to as a Distributed IDS.

10
Network-based IDS's tend to be less expensive than their host based cousins, as they typically
only need to be installed near the entry/exit point of the network.
Network-based systems do extremely well at stranger attacks, and focus on catching people
before they are authenticated. Areas where they will be good at comprise stopping the
following:
• DOS & Packet Manipulation: A denial of service (DOS) attack is when someone
sends an overload of network packets to a single resource, causing it to either crash or
become so slow as to be unresponsive. A more advanced version is the Distributed
Denial of Service attack, in which multiple computers all attack the resource
simultaneously. Many network attacks involve sending network packets that are of
incorrect size or configuration, which often causes the targeted resource to crash.
Network-based IDS’s, because they can process huge amounts of network traffic and
sit in an optimal location, are excellent for blocking such attacks. However, note that
they can also be a prime target for these attacks.
• Unauthorized Use: This is the most common attack type that people think of when
they hear about IT security. Network-based IDS’s are ideal for tracking unauthorized
access, meaning intruders that are attempting to login to a machine without the proper
credentials, compromise a machine to create a jump-off point, and those that are
looking to grab passwords or data.

11
Chapter 4
Requirement Analysis
4.1Data Set

With the enormous growth of computer networks usage and the huge increase in the number
of applications running on top of it, network security is becoming increasingly more
important. As it is shown in [1], all the computer systems suffer from security vulnerabilities
which are both technically difficult and economically costly to be solved by the
manufacturers. Therefore, the role of Intrusion Detection Systems (IDSs), as special-purpose
devices to detect anomalies and attacks in the network, is becoming more important. The
research in the intrusion detection field has been mostly focused on anomaly-based and
misuse-based detection techniques for a long time. While misuse-based detection is generally
favoured in commercial products due to its predictability and high accuracy, in academic
research anomaly detection is typically conceived as a more powerful method due to its
theoretical potential for addressing novel attacks.
Conducting a thorough analysis of the recent research trend in anomaly detection, one
will encounter several machine learning methods reported to have a very high detection rate
of 98% while keeping the false alarm rate at 1% [2]. However, when we look at the state of
the art IDS solutions and commercial tools, there is few products using anomaly detection
approaches, and practitioners still think that it is not a mature technology yet. To find the
reason of this contrast, we studied the details of the research done in anomaly detection and
considered various aspects such as learning and detection approaches, training data sets,
testing data sets, and evaluation methods. Our study shows that there are some inherent
problems in the KDDCUP’99 data set [3], which is widely used as one of the few publicly
available data sets for network-based anomaly detection systems .
4.1.1 KDD CUP 99 data set description

Since 1999, KDD’99 [3] has been the most wildly used data set for the evaluation of
anomaly detection methods. This data set is prepared by Stolfo et al. [5] and is built based on
the data captured in DARPA’98 IDS evaluation program [6]. DARPA’98 is about 4
gigabytes of compressed raw (binary) tcp dump data of 7 weeks of network traffic, which can
be processed into about 5 million connection records, each with about 100 bytes. The two
weeks of test data have around 2 million connection records. KDD training dataset consists of
approximately 4,900,000 single connection vectors each of which contains 41 features and

12
is labelled as either normal or an attack, with exactly one specific attack type. The simulated
attacks fall in one of the following four categories:
1) Denial of Service Attack (DoS): is an attack in which the attacker makes some
computing or memory resource too busy or too full to handle legitimate requests, or
denies legitimate users access to a machine.
2) User to Root Attack (U2R): is a class of exploit in which the attacker starts out
with access to a normal user account on the system (perhaps gained by sniffing
passwords, a dictionary attack, or social engineering) and is able to exploit some
vulnerability to gain root access to the system.
3) Remote to Local Attack (R2L): occurs when an attacker who has the ability to send
packets to a machine over a network but who does not have an account on that
machine exploits some vulnerability to gain local access as a user of that machine.
4) Probing Attack: is an attempt to gather information about a network of computers
for the apparent purpose of circumventing its security controls.
It is important to note that the test data is not from the same probability distribution as the
training data, and it includes specific attack types not in the training data which make the
task more realistic. Some intrusion experts believe that most novel attacks are variants of
known attacks and the signature of known attacks can be sufficient to catch novel variants.
The datasets contain a total number of 24 training attack types, with an additional 14 types in
the test data only. The name and detail description of the training attack types are
listed in [7].
KDD’99 features can be classified into three groups:
1) Basic features: this category encapsulates all the attributes that can be extracted
from a TCP/IP connection. Most of these features leading to an implicit delay in
detection.
2) Traffic features: this category includes features that are computed with respect to a
window interval and is divided into two groups:
a) “same host” features: examine only the connections in the past 2 seconds
that have the same destination host as the current connection, and calculate
statistics related to protocol behaviour, service, etc.
b) “same service” features: examine only the connections in the past 2 seconds
that have the same service as the current connection. The two
aforementioned types of “traffic” features are called time-based. However,
there are several slow probing attacks that scan the hosts (or ports) using a

13
much larger time interval than 2 seconds, for example, one in every minute.
As a result, these attacks do not produce intrusion patterns with a time
window of 2 seconds. To solve this problem, the “same host” and “same
service” features are re-calculated but based on the connection window of
100 connections rather than a time window of 2 seconds. These features are
called connection-based traffic features.
3) Content features: unlike most of the DoS and Probing attacks, the R2L and U2R
attacks don’t have any intrusion frequent sequential patterns. This is because the
DoS and Probing attacks involve many connections to some host(s) in a very short
period of time; however the R2L and U2R attacks are embedded in the data
portions of the packets, and normally involves only a single connection. To detect
these kinds of attacks, we need some features to be able to look for suspicious
behaviour in the data portion, e.g., number of failed login attempts. These features
are called content features.

4.2 Arbitral Strategy by Neural Network

Artificial Neural network is a powerful tool to solve complex classification problem.


We do not need to force much assumption on the problem. We only need to prepare a set of
inputs and targets to train it, and let the neural network learn a model. The most popular
neural network is the error back-propagation (BP) neural network. A conventional BP
network is a three layers feed forward network. We choose to build a conventional BP
network as our final arbiter because of its simplicity and popularity. The inputs of the BP
network are the prediction confidence ratios from each binary classifier. The output with
maximal value is interpreted as the final class.
The number of nodes for the input layer and the output layer is the number of binary
classifiers in our MC-SLIPPER. However, it is difficult to choose the best number of nodes
for the hidden layer, because it depends on lots of facts, such as the numbers of nodes
in input and output layer, the number of training examples, the type of hidden node activation
function and so on. We choose the number of nodes for hidden layer according to some rules
of thumb. We have addressed the steps of our Multi-class SLIPPER framework. Next, we
show our experiments.
4.3 Framework for Multi-Class SLIPPER

14
The current version of SLIPPER is a binary classifier. However the intrusion
detection problem is a five-class classification problem. To handle multiclass problem,
we build a framework (Figure 1.) using the binary SLIPPER as basic modules. The basic
idea is to translate the multi-class problem into multiple binary classification problems,
and the final arbiter adopting certain strategy to make the final decision. Below we give
details of this framework.

Fig 3: Multi-class SLIPPER

4.3 Train Multiple Binary Classifiers


For a multi-class problem, the training dataset contain examples with multiple class
labels. However, the binary SLIPPER classifier only accepts training examples with two class
labels. To build a binary classifier for each class, we pre-process the training data to generate
proper training data for each class. An optimized pre-process procedure to reduce disk read is
shown in Figure 2. For each training example, if the label is not the target class name, then
change the it to an unused class name, such as “other”, otherwise, keep the label same. While
pre-process the training dataset, we can get the frequency of the target class which can be
used to ensure that the positive class is our target class for each binary classifier.

Fig 4: Optimized preprocess algorithm


Once we have got binary classifier model for each class, we can predict an unseen
data example using all models. Each classifier will output its predicted class with confidence.
Obviously, the results might be conflictive. To address the conflict of outputs, we proposed
different arbitral strategies.
4.4 HARDWARE AND SOFTWARE REQUIREMENT:

SOFTWARE REQUIRED 15

• Java 1.3 or more.


• Java Swings.

HARDWARE REQUIREMENT
• Hard Disk(40Gb).
• Ram(128Mb).
• Processor(Pentium)

4.5 PROJECT PLAN

Task Effort Deliverables


weeks
Analysis of existing systems & 4 weeks
compare with proposed one
Literature survey 1 week
Designing & planning 1+2
weeks
o System flow 1 weeks
o Designing modules & 2 week Modules
it’s deliverables design
document
Implementation 9 weeks Primary system
Testing 3 weeks Test Reports
Documentation 1 weeks Complete
project report

Table1: PROJECT PLAN


Chapter 5.
System Design
16
Preprocessed Data
Fig 5: Activity Diagram
Training Data Set

Multi_class Slipper

Labeled Data Set

Slipper

Rules Data Set

Prediction
Predictionengine
engine Prediction
Prediction Prediction
Prediction Prediction
Prediction Prediction
Prediction
for Normal
for Normal engine
enginefor
forDOS
DOS engine
enginefor
forR2L
R2L engine
enginefor
forU2R
U2R engine
engine for
for Probe
Probe

Positive Prediction If Negative Prediction


∑PC
>0

If PC in Yes If PC in
No +ive
Yes No
-ive
range? range?

False -ive Prediction


False +ive Prediction

Tunning

If Yes
tune?

No
17

Standard KDD <<includes>>


cup99 dataset

Initial rules

Preprocessed data

<<includes>>

<<extends>>
System operator Training Dataset
System operator
<<includes>> Rule Set

Prediction Engine

<<includes>>

Tunning <<extends>>

User Modified tunned


rules
<<includes>>

User
Attacks detected

Fig 6: Use case Diagram


18

filediscrepto. Atids.java
java

ProbeAttack.
DOSAttack.jav U2RAttack.jav
a R a
R2LAttack.java java

PredictConfid Tunning.java
ence.java

Fig 7: Component Diagram

19
Atids
Filedescriptor +DOS,R2L,U2R,Prob:Strin
+ KDDinput: file Works on g +weight+, weight- : Rules to
+file: Bufferreader 1..* double +confidence: double 1
+ Getfile() 1
+AnalyseDataset()
+readfile() +CreateRules()
+feacturesextraction()
+Calculateconfidence()
+calculateweights()

R2LAttack U2RAttack ProbeAttack


DOSAttack +CalZt: double +CalZt: double +CalZt: double
+CalZt: double
+sumconfidence:doubl
+CalculateZt() +sumconfidence:double +sumconfidence:double +sumconfidence:double

+Calculategrowrule() +CalculateZt() +CalculateZt() +CalculateZt()

+Calculateprunerule() +Calculategrowrule() +Calculategrowrule() +Calculategrowrule()


+Calculateprunerule() +Calculateprunerule() +Calculateprunerule()

1..*

Tunning
Predictorconfidence +percent: double
+Confidence: double +load rule: String
+ratio: double +ratio: double
+getconfidence() +falseprediction()
+calculateconfidence() 1..* 1..*
1
+falsepositivepredict()
Prediction to
+falsenegativepredict()

Fig 8: Class Diagram

20

20
Machine 1

Internet
Machine 2 System

Machine n

Fig 9: Deployment Diagram

21
: System
: System
Operator

Standard KDD Dataset

Initial Rules : MainUI


Rules set

Input packet from Network

Preprocessed dataset
Training dataset

Labeled dataset

Trained dataset & weights

M/C Learning &


Slipper
Rules Set

Data set, Rules Set

Prediction engine

False prediction & confidence

Fig 10: Sequence Diagram

22
22
Chapter 6.
Conclusion and Future Scope

Because computer networks are continuously changing, it is difficult


to collect high-quality training data to build intrusion detection models. In
this paper, rather than focusing on building a highly effective initial
detection model, we propose to improve a detection model dynamically
after the model is deployed when it is exposed to new data. In our
approach, the
detection performance is fed back into the detection model, and the
model is adaptively tuned. To simplify the tuning procedure, we represent
the detection model in the form of rule sets, which are easily understood
and controlled; tuning amounts to adjusting confidence values associated
with each rule. This approach is simple yet effective. Our experimental
results show that the TMC of ATIDS with full and instant tuning drops
about 35% from the cost of the MC-SLIPPER system with a fixed detection
model. If only 10% false predictions are used to tune the model, the
system still achieves about 30% performance improvement. When tuning
is delayed by only a short time, the system achieves 20% improvement
when only 1.3% false predictions are used to tune the model. ATIDS
imposes a relatively small burden on the system operator: operators need
to mark the false alarms after they identify them. These results are
encouraging. We plan to extend this system by tuning each rule
independently. Another direction is to adopt more flexible rule
adjustments beyond the constant factors relied on in these experiments.
We have further noticed that if system behaviour changes drastically or if
the tuning is delayed too long, the benefit of model tuning might be
diminished or even negative. In the former case, new rules could be
trained and added to the detection model. If it takes too much time to
identify a false prediction, tuning on this particular false prediction is
easily prevented as long as the prediction result is not fed back to the
model tuner.

References 23

[1]W. Cohen and Y. Singer, "A simple, fast, and effective rule learner," in Proc.Annu. Conf.
Amer. Assoc. Artif. Intell., 1999, pp. 335-342.
[2]W. Lee and S. Stolfo, “A framework for constructing features and models for intrusion
detection systems,” ACMTrans. Inf. Syst. Secur., vol.3, no. 4, pp. 227–261, Nov. 2000
[3] L. Ertoz, E. Eilertson, A. Lazarevic, P. Tan, J. Srivastava, V. Kumar, and P. Dokas, The
MINDS—Minnesota Intrusion Detection System: Next Generation Data Mining. Cambridge,
MA: MIT Press, 2004.
[4] K. Julish, “Data mining for intrusion detection: A critical review,” IBM, Kluwer, Boston,
MA, Res. Rep. RZ 3398, Feb. 2002. No. 93450.
[5] I. Dubrawsky and R. Saville, SAFE: IDS Deployment, Tuning, and Logging in Depth,
CISCO SAFE White Paper. [Online]. Available: http://www.cisco.com/go/safe
[6] W. Lee, S. Stolfo, and P. Chan, “Real time data mining-based intrusion detection,” in
Proc. DISCEX II, Jun. 2001, pp. 89–100.
[7] E. Eskin, M. Miller, Z. Zhong, G. Yi, W. Lee, and S. Stolfo, “Adaptive model generation
for intrusion detection systems,” in Proc. 7th ACM Conf. Comput. Security Workshop
Intrusion Detection and Prevention, Nov. 2000. [Online].
Available:http://www1.cs.columbia.edu/ids/publications/adaptive-ccsids00.pdf
[8] A. Honig, A. Howard, E. Eskin, and S. Stolfo, “Adaptive model generation: An
architecture for the deployment of data mining-based intrusion detection systems,” in Data
Mining for Security Applications. Norwell, MA: Kluwer, 2002.
[9] M. Hossian and S. Bridges, “A framework for an adaptive intrusion detection system with
data mining,” in Proc. 13th Annu. CITSS, Jun. 2001. [Online]. Available:
http://www.cs.msstate.edu/~bridges/papers/citss-2001.pdf
[10] X. Li and N. Ye, “Decision tree classifiers for computer intrusion detection,”
J. Parallel Distrib. Comput. Prac., vol. 4, no. 2, pp. 179–180, 2003.
[11] J. Ryan, M. Lin, and R. Miikkulainen, “Intrusion detection with neural networks,” in
Proc. Advances NIPS 10, Denver, CO, 1997, pp. 943–949.
[12] S. Kumar and E. Spafford, “A pattern matching model for misuse intrusion detection,” in
Proc. 17th Nat. Comput. Security Conf., 1994, pp. 11–21.
[13] Z. Yu and J. Tsai, “A multi-class SLIPPER system for intrusion detection,” in Proc.
28th IEEE Annu. Int. COMPSAC, Sep. 2004, pp. 212–217.
[14] W. Cohen and Y. Singer, “A simple, fast, and effective rule learner,” in Proc. Annu.
Conf. Amer. Assoc. Artif. Intell., 1999, pp. 335–342.
[15] S. Robert and S. Yoram, “Improved boosting algorithms using confidence created
predictions,” Mach. Learn., vol. 37, no. 3, pp. 297–336, Dec. 1999.
[16] L. Faussett, Fundamentals of Neural Networks: Architectures, Algorithms,and
Applications. Englewood Cliffs, NJ: Prentice-Hall, 1994.
IX