You are on page 1of 7

Seminar Report

Classification, Clustering and Application in Intrusion Detection


System
Kaushal Mittal 04329024
M.Tech I Year
Under the Guidance of Prof. Sunita Sarawagi
KReSIT, Indian Institute of Technology Bombay

November 15, 2004

Abstract tion techniques are examples of supervised learn-


ing and clustering techniques are examples of un-
Classification and clustering techniques in data supervised learning.
mining are useful for a wide variety of real time
Intrusion detection systems are softwares used
applications dealing with large amount of data.
for identifying the intentional or unintentional
Some of the applications of data mining are text
use of the system resources by unauthorized
classification, selective marketing, medical di-
users. They can be categorized into misuse de-
agnosis, intrusion detection systems. Intrusion
tection systems and anomaly detection systems.
detection system are software system for iden-
Misuse detection systems model attacks as a
tifying the deviations from the normal behav-
specific pattern and are more useful in detect-
ior and usage of the system. They detect at-
ing known attack patterns. Anomaly detection
tacks using the data mining techniques - clas-
systems are adaptive systems that distinguish
sification and clustering algorithms. In this re-
the behavior of the normal users from the other
port, I discuss approaches based on classification
users. The misuse detection systems can detect
techniques like naive bayesian classifiers, neural
specific types of attacks but are not generalized.
networks and WINNOW based algorithm. Ap-
They cannot detect new attacks until trained for
proaches based on clustering techniques like hier-
them. On the other hand, anomaly detection
archical and density based clustering have been
systems are adaptive in nature, they can deal
discussed to emphasize the use of clustering tech-
with new attacks, but they cannot identify the
niques in intrusion detection.
specific type of attacks. If the intrusion occurs
during learning, then the anomaly detection sys-
tem may learn the intruders behavior and hence
1 Introduction may fail. Being more generalized and having
a wider scope as compared to misuse detection
Classification techniques analyze and categorize
systems, most of the current research focus on
the data into known classes. Each data sample
anomaly detection systems.
is labeled with a known class label. Clustering is
a process of grouping objects resulting into set of Data mining approaches can be applied for
clusters such that similar objects are members of both anomaly and misuse detection. The data
the same cluster and dissimilar objects belongs sample are a set of system properties, represent-
to different clusters. In classification, the classes ing the behavior of the system/user. Classifi-
and number of classes is predefined. Training ex- cation techniques are used to learn a model us-
amples are used to create a model, where each ing the training set of data samples. The model
training sample is assigned a predefined label. is used to classify the data samples as anoma-
This is not the case with clustering. Classifica- lous behavior instance or the normal behavior

1
instance. Clustering techniques can be used to belief networks, neural networks etc. are used in
form clusters of data samples corresponding to data mining based applications. In this section,
the normal use of the system. Any data sam- I discuss the naive bayesian classfiers and neural
ple with characteristics different from the formed networks.
clusters is considered to be an instance of anoma-
lous behavior. Clustering based techniques can 2.1 Naive Bayesian Classifiers
detect new attacks as compared to the classifica-
tion based techniques. Naive bayesian classifiers use the bayes theorem
A number of classification and clustering al- to classify the new instances of data. Each in-
gorithms can be used for anomaly detection. [?] stance is a set of attribute values described by
proposes the use of bayesian classifiers to learn a vector, X = (x1 , x2 . . . , xn ). Considering m
a model that distinguishes the behavior of in- classes, the sample X is assigned to the class Ci
truder from the normal users behavior. [?] pro- if and only if
poses hierarchical clustering based algorithm for P (X|Ci )P (Ci ) > P (X|Cj )P (Cj )
anomaly detection on network. [?] proposes the
WINNOW based algorithm for anomaly detec- for all j in (1, m) such that j 6= i.
tion. [?] proposes the use of neural networks and The sample belongs to the class with maximum
[?] proposes the use of density based clustering posterior probability for the sample. For cate-
for anomaly detection. gorical data P (Xk |Ci ) is calculated as the ratio
Rest of the report is organized as follows: Sec-of frequency of value xk for attribute Ak and the
tion 2 discusses the bayesian classifiers and neu- total number of samples in the training set. For
ral network based classification. Section 3 dis- continuous valued attributes a guassian distribu-
cusses the hierarchical and density based cluster- tion can be assumed without any loss of gener-
ing. Section 4 discusses the anomaly detection ality.
approach based on WINNOW based algorithm In naive bayesian approach the attributes are
and the use of the classification and clustering assumed to be conditionally independent. In-
algorithms discussed in section 2 and section 3, spite of this assumption, naive bayesian classi-
for anomaly detection. Section 5 gives the con- fiers give satisfactory results because focus is on
clusion. identifying the classes for the instances, not the
exact probabilities. Application like spam mail
classification, text classification can use naive
2 Classification Techniques bayesian classifiers. Theoretically, bayesian clas-
sifiers are least prone to errors. The limitation is
In Classification, training examples are used to
the requirement of the prior probabilities. The
learn a model that can classify the data samples
amount of probability information required is ex-
into known classes. The Classification process
ponential in terms of number of attribute, num-
involves following steps:
ber of classes and the maximum cardinality of
1. Create training data set. attributes. With increase in number of classes
or attributes, the space and computational com-
2. Identify class attribute and classes. plexity of bayesian classifiers increases exponen-
tially.
3. Identify useful attributes for classification
(relevance analysis).
2.2 Neural Networks
4. Learn a model using training examples in
An artificial neural network consists of connected
training set.
set of processing units. The connections have
5. Use the model to classify the unknown data weights that determines how one unit will af-
samples. fect other. Subset of such units act as input
nodes, output nodes and remaining nodes con-
A variety of classification techniques viz. decision stitute the hidden layer. By assigning activation
tree induction, bayesian classification, bayesian to each of the input node, and allowing them to

2
propagate through the hidden layer nodes to the rameter for restricting the level of clustering.
output nodes, neural network performs a func- Clustering stops when the required number of
tional mapping from input values to output val- clusters have been formed or depth of the clus-
ues. The mapping is stored in terms of weights tering tree has reached to a specified value. Hier-
over connection. archical clustering algorithms can be categorized
Backpropagation network are simple feed for- into:
ward neural networks. Input is submitted to
the network and the activation at each level are • Agglomerative algorithms - based on bot-
cascaded forward, ending up with activations at tom up approach.
output nodes. During training backpropagation
• Divisive algorithms - based on top down ap-
algorithm [?] is used to tune the values of weights
proach.
over connection. The error at the output layer
is calculated and backpropagated. This feed-
back is used at intermediate level to readjust 3.1.1 Agglomerative Algorithms
the weights. Performance of training phase de- These algorithms initially assign each sample
pends on the learning rate used for adjusting to be a separate cluster. The clusters with the
the weights. Too small value of learning rate least distance are merged to get larger clusters
makes learning very slow. Conversely, too large till the termination condition is satisfied or a
value may result in oscillation of weights between single cluster is left.
wrong values and network may take long time to
learn. The training stops when weights tend to
converge or the network is able to classify the 1. BIRCH - Balanced Iterative Reducing
samples correctly. After training the backpropa- and Clustering using Hierarchies.
gation network can be used as a model for clas- In BIRCH, summary of statistics of the clus-
sification of new instances. ter, called as cluster feature, is calculated for
They are adaptive in nature, tolerant to noisy each sub-cluster consisting on n samples. A
data and can classify instances for which they height balanced CF(clustering feature) tree
are not trained. However, training may take a is dynamically constructed with samples as
long time and is an irreversible process. Also the leaf nodes CF of the children as non leaf
the knowledge representation in neural networks nodes. A sample is kept in the closest leaf
is not directly interpretable by humans. node. When the size of leaf node becomes
larger than the threshold, the node splits
and the CF is recalculated for the individ-
3 Clustering Techniques ual nodes and updated in the tree. The com-
plexity of the algorithm is O(n) where n is
Clustering involves unsupervised learning - num- the number of objects to be clustered. It is
ber of classes and the classes are not known in known to generate the best clusters with the
prior. In this section I discuss the hierarchical available resources but do not work well if
and density based approaches for clustering. the clusters are not spherical in shape (be-
cause it use the notion of radius to consider
3.1 Hierarchical Clustering Algo- the boundary of the cluster).
rithms
2. CURE - Clustering using Representa-
These algorithms group the data into a tree of tives.
clusters forming a hierarchical structure. The This algorithm works well for non-spherical
clusters are merged or split based on some dis- shaped clusters also. In CURE cluster is
tance measurement that accounts for the simi- represented by a set of representative points
larity or difference between the samples respec- generated by randomly selecting the scat-
tively. The distance can be euclidean distance, tered points in the cluster and shrinking
mean distance, maximum distance, average, cen- them by fractions to reach the cluster cen-
troid etc. The number of clusters acts as a pa- ter. The two clusters with the closest pair

3
systems can be used to collect this data. For win-
of representative points are merged. More
dows operating system application like perfmon
than one representative points allows non-
and netstat are used whereas for Linux systems
spherical shaped clusters. However, the
top, tcpdump, strace, etc are used.
aggregate interconnectivity is ignored and
hence the categorical attributes cannot be
Rest of the section discusses the approaches
handled. for anomaly detection based on the classification
and clustering techniques described in section 2
3.2 Density based Clustering Algo- and section 3.
rithms
The set of samples forming a dense region are
4.1 Naive Bayesian Approach
treated as clusters. Since the clusters are based [?] proposes the use of naive bayesian approach
on density, not on distance, they need not nec- for anomaly detection in systems with windows
essarily be spherical. operating system. [?] measure around 1500 fea-
tures every second. Features are specific system
3.2.1 DBSCAN - Density based Spa- properties like average CPU utilization, average
tial Clustering of Application with of last 10 values of data transfer rate, memory
Noise. utilisation, number of processes etc. A data sam-
ple comprises of values for each of these 1500 fea-
This algorithm identify regions with sufficiently
tures. The problem is defined as to classify each
high density as clusters. All the samples within
data sample into anomalous or normal category.
the radius  form -neighbourhood of the sample.
[?] assume that the features are conditionally
-neighbourhood for each sample points is called
independent given the category. The training set
a core group i.e. initial cluster. All the objects
is used to calculate the prior probabilities. The
that are density reachable, connected or directly
prior probabilities are used to calculate the prob-
density reachable are merged to form larger clus-
ability of obtaining the current measurement,
ters. This continues till no more merging of clus-
given the possible categories. The current sam-
ters is possible. If spatial indexes are used, the
ples is assigned a label for which the probability
complexity of DBSCAN algorithm is O(nlogn),
calculated is maximum. [?] conducted experi-
otherwise it is O(n2 ). The algorithm is useful for
ments over test data and found the detection rate
applications with spatial databases and noise.
of this approach to be 57.8%. The low detection
rate is on account of the assumption that the
4 Intrusion Detection Systems features are conditionally independent.
This section briefly discusses the data mining
4.2 Neural Network
approaches proposed for intrusion detection sys-
tem. The characteristics of a good intrusion de- [?] proposes the use of neural networks for
tection system are: anomaly detection. The approach consists of
maintaining a database of a sequence of system
1. High detection rate.
calls made by each program to the operating sys-
2. Less false alarms. tem, used as the signature for the normal behav-
ior. If online sequence of system calls for a pro-
3. Less CPU cycles. gram differ from the sequence in the database
anomalous behavior is registered. If significant
4. Quick detection of intrusion.
percentage of sequences do not match then alarm
The user profile, system behavior comprising of for intrusion is raised.
the statistics related to the network, CPU, mem- Backpropagation network is trained with
ory, processes, softwares and applications used training set of sequences of system calls. A leaky
by the users constitute the test data for the in- bucket algorithm is used to capture the tempo-
trusion detection system. A large number of sys- ral locality of the anomalous sequences. When
tem tools and utilities plugged in with operating closely related anomalous sequences are faced,

4
counter gets a large value and when a normal se- user and detect deviation from that behaviour
quence is obtained the counter gradually drops as anomalous. For systems with multiple valid
down to zero. This leads to intrusion detection users, the requirement is to consider the be-
only when a lot of similar anomalous sequences haviour of each of the N valid users as normal
are obtained, thereby representing the behavior and the remaining users as anomalous. For such
of intruder. requirements this approach is useful. It creates
density based clusters corresponding to the be-
4.3 Hierarchical Clustering havior of the N users. Any sample not resem-
bling the behaviour of any of the N users will
[?] proposes the use of graph clustering for in- lie outside the cluster and will be considered as
trusion detection over networks. The approach outlier.
consists of using agglomerative clustering to
form clusters of nodes communicating exten-
4.5 WINNOW based Algorithm
sively with each other. The nodes or systems
on the network can be represented by a graph. [?] proposes a WINNOW based algorithm for
In the graph, nodes represent the systems and anomaly detection. Most of the approaches dis-
edges with weight represent the amount of data cussed above did not achieved high detection
exchange between the system corresponding the rates. Experimental results proves that this ap-
nodes linked by the edge. This graph is decom- proach can achieve detection rate of about 95%
posed into the number of clusters such that nodes with less than one false alarm per day. The data
within each cluster exchange data with other collected is same as described in subsection ??.
nodes in the cluster extensively. Perfmon is used to measure around 200 differ-
Feature vector consisting of values for features ent properties and 1500 different features corre-
like degree of nodes, average outgoing traffic etc. sponding to these properties. Each data sample
is calculated. A neural network is used to learn is a vector containing the value of these 1500 fea-
the mapping from these feature values to the nor- tures.
mal behavior or anomalous behavior. If the in- The algorithm consists of three phases.
truder use the system, the traffic over the net-
work changes, resulting in the change of the fea-
4.5.1 Training Phase
ture values, leading to the detection by the neu-
ral network as anomalous behavior. The data samples are collected corresponding to
the normal user and an equal number of samples
4.4 Density based Local Outliers corresponding to the intruder (Any user other
than the normal user). The values for most
[?] proposes a density based clustering approach of the features are continuous. They are dis-
for anomaly detection. Data sample correspond- cretized into ten bins. The values of a feature
ing to the anomalous behavior is considered as are classified into ten bins by fitting the stan-
an outlier. This approach assigns a LOF (Local dard distribution functions like uniform, guas-
outlier factor) to each data sample. Greater the sian, exponential and erlang. The distribution
LOF, greater is the probability of sample being function for which the root mean square error is
an outlier. The k-distance is computed for the minimum represents the probability distribution
k th nearest neighbor for each sample. DBSCAN for the feature. The probability distribution is
algorithm is used to find k-neighbourhood for the obtained by normalizing the frequency in each
data sample and form clusters of samples corre- bin with the total count of samples. [?] propose
sponding to the normal behavior. Large value of to use the WINNOW based algorithm to assign
LOF for a data sample indicates it is distant from weights to each feature to model the normal be-
the clusters of samples corresponding to normal haviour.
behaviour. Hence the sample is an outlier. The WINNOW based algorithm is as follows
This approach does not requires tuning and
is adaptive in nature. Most of the techniques 1. Initialize the weights for each feature, wf ,
discussed above learn the behavior of a specific as 1.

5
2. For each training sample user. Even during this phase WINNOW algo-
rithm has to be used to adjust to the chang-
(a) Initialize votes for and votes against to ing behavior of the user, otherwise false alarm
be 0. rate will increase and intrusion detection rate
(b) For all features, if relative probability will gradually drop.
of the feature is less than the constant
r, add the weight wf to votes for oth-
4.5.4 Experimental Evaluation and
erwise add it to votes against.
Analysis
(c) If votes f or > votes against then the
measurement is anomalous. [?] have conducted experiments to analyze the
performance of the WINNOW based algorithm
(d) If sample corresponds to normal user
proposed. The analysis show that if tuning pa-
and is considered anomalous, the
rameters are carefully selected, the intrusion de-
weights of all features that voted for
tection rate reaches 95% with less than one false
raising alarm are reduced to half of
alarm per day. The tuning parameter W is to
their current values. Conversely, if
be selected carefully. Too small the value of W
anomalous sample is treated as normal,
larger the number of false alarms, due to over-
the weights of all features that voted
lapping of the samples with the samples that
against raising the alarm are reduced
have already raised an alarm. As W increases,
to half of their current values.
the false alarm rate decreases, but intruder gets
more time to use the system before being de-
4.5.2 Tuning Phase tected. The system can adapt itself to learn the
Tuning data consists of data samples from the changing behavior of the normal user. If the tun-
normal user and the intruders(other users). This ing parameters are wrongly selected, the system
phase involves calculation of three system pa- may learn the intruder’s behavior also.
rameters W - the window size, T hreshmini and
T hreshf ull . For different combinations of these
parameters, following steps are executed: 5 Conclusion
1. The feature values of test sample collected Intrusion detection systems are one of the key
each second, vote for mini alarms. If the ra- areas of application of data mining techniques.
tio of votes for and votes against is greater Naive bayesian classifiers though performs well
than T hreshmini , then a mini alarm is for most of the applications inspite of the as-
raised. sumption of conditional independency, does not
provide good results for intrusion detection sys-
2. If number of mini alarms in last W sec- tems. Clustering techniques can be used for in-
onds is greater than T hreshf ull then raise trusion detection, as they can detect unknown
an alarm signaling intrusion. After each attacks also. They are useful for misuse detec-
such alarm wait for W seconds to avoid the tion as well as anomaly detection systems.
samples from overlapping. WINNOW based algorithm provides higher
The goal is to select the value of these parame- detection rates and lower false alarm rates as
ters to maximize the intrusion detection rate and compared to the other approaches discussed.
minimize the false alarms. The system involves less CPU cost. The only
costly phase is the tuning phase. Oftenly, intru-
sion or misuse of system can be best described
4.5.3 Operation Phase
by excessive usage of resources and events that
In this phase the learned statistical model, along do not occur frequently. For eg. too many print
with the values of the parameters W, T hreshmini jobs, downloads etc. The system assumes the
and T hreshf ull are used to detect anomalous be- samples collected from all other users, except the
haviour. The system can retrain and retune to normal user as samples of anomalous behavior.
adjust to the changing behavior of the normal In practice, this may not be a good representa-

6
tive set for anomalous behavior. It is necessary
to test the system with data consisting of real
data samples corresponding to intrusive behav-
ior.

References
[1] Jude Shavlik and Mark Shavlik, Selec-
tion, Combination, and Evaluation of
Effective Software Sensors for Detect-
ing Abnormal computer Usage, KDD
2004, Seattle, Washington, USA., 2004.

[2] A. Lazarevic, L. Ertoz, A. Ozgur, J.


Srivastava and V. Kumar, A Com-
parative Study of Anomaly Detection
Schemes in Network Intrusion Detec-
tion, Proc. SIAM Conf. Data Mining,
2003.

[3] A. Ghosh, A. Schwartzbard and


M. Schatz, Learning Program Behav-
ior Profiles for Intrusion Detection,
USENIX Workshop on Intrusion De-
tection and Network Monitoring, April
1999.

[4] J. Tolle and O. Niggemann, Supporting


Intrusion Detection by Graph Cluster-
ing and Graph Drawing, RAID 2000.

[5] Tom M. Mitchell, Machine Learn-


ing, McGraw-Hill, International Edi-
tion 1997.

[6] Jiawei Han and Micheline Kamber,


Data Mining Concepts and Techniques,
Morgan Kaufmann, 2001.