Comp of Clustering Method

A comparison of clustering methods
- for unsupervised anomaly detection in network traffic
Koffi Bruno Yao (koffi@diku.dk)
February 28, 2006

Contents
0.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goal of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Introduction to computer network security . . . . . . . . . . . 6
2.1.1 Network security . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Network intrusion detection systems . . . . . . . . . . 7
2.1.3 Network anomaly detection . . . . . . . . . . . . . . . 8
2.1.4 Computer attacks . . . . . . . . . . . . . . . . . . . . . 9
2.2 Introduction to clustering . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Notation and definitions . . . . . . . . . . . . . . . . . 12
2.2.2 The clustering problem . . . . . . . . . . . . . . . . . . 12
2.2.3 The clustering process . . . . . . . . . . . . . . . . . . 13
2.2.4 Feature selection . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Choice of clustering algorithm . . . . . . . . . . . . . . 13
2.2.6 Cluster validity . . . . . . . . . . . . . . . . . . . . . . 16
2.2.7 Clustering tendency . . . . . . . . . . . . . . . . . . . . 17
2.2.8 Clustering of network traffic data . . . . . . . . . . . . 18
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Clustering methods and algorithms 20

3.1 Hierarchical clustering methods . . . . . . . . . . . . . . . . . 21
3.2 Partitioning clustering methods . . . . . . . . . . . . . . . . . 24
3.2.1 Squared-error clustering . . . . . . . . . . . . . . . . . 24
1
CONTENTS 2
3.2.2 Model-based clustering . . . . . . . . . . . . . . . . . . 27

3.2.3 Density-based clustering . . . . . . . . . . . . . . . . . 42
3.2.4 Grid-based clustering . . . . . . . . . . . . . . . . . . . 45
3.2.5 Online clustering . . . . . . . . . . . . . . . . . . . . . 47
3.2.6 Fuzzy clustering . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Discussion of the classical clustering methods . . . . . . . . . 52
3.4 Combining clustering methods . . . . . . . . . . . . . . . . . . 54
3.4.1 Two-level clustering with kmeans . . . . . . . . . . . . 54
3.4.2 Initialisation of clustering algorithms with the results
of leader clustering . . . . . . . . . . . . . . . . . . . . 60
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Experiments 62
4.1 Design of the experiments . . . . . . . . . . . . . . . . . . . . 62
4.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Choice of data set . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Description of the feature set . . . . . . . . . . . . . . 65
4.3 Implementation issues . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Evaluation of clustering methods 72

5.1 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . 72
5.2 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Evaluation measure requirements . . . . . . . . . . . . 74
5.2.2 Choice of evaluation measures . . . . . . . . . . . . . . 74
5.3 k-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Discussion and analysis of the experiment results . . . . . . . 76
5.4.1 Results of the experiments . . . . . . . . . . . . . . . . 76
5.4.2 Analysis of the experiment results . . . . . . . . . . . . 79
6 Conclusion 86
6.1 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
CONTENTS 3
A Definitions 95
A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B Feature set 98
B.1 The feature set of the KDD Cup 99 data set . . . . . . . . . . 98
C Computer attacks 101

C.1 Probe attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
C.2 Denial of service attacks . . . . . . . . . . . . . . . . . . . . . 102
C.3 User to root attacks . . . . . . . . . . . . . . . . . . . . . . . . 102
C.4 Remote to local attacks . . . . . . . . . . . . . . . . . . . . . . 103
C.5 Other attack scenarios . . . . . . . . . . . . . . . . . . . . . . 104
D Theorems 105
D.1 Algorithm: Hill climbing . . . . . . . . . . . . . . . . . . . . . 105
D.2 Theorem: Jensen’s inequality . . . . . . . . . . . . . . . . . . 105
D.3 Theorem: The Lagrange method . . . . . . . . . . . . . . . . 106
E Results of the experiments 107

List of Figures
3.1 A dendrogram corresponding to the distance matrix in table 3.1 22

3.2 Variation of the sum of squared-errors in kmeans . . . . . . . 28
3.3 Variation of the log-likelihood with the iterations of the clas-
sification maximum likelihood . . . . . . . . . . . . . . . . . . 38
3.4 A 3x3 kohonen network map . . . . . . . . . . . . . . . . . . . 40
3.5 Querying recursively a multi-resolution grid with STING . . . 45
3.6 Variation of the - fuzzy - sum of squared errors in fuzzy kmeans
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Variation of classification accuracy with the number of basic
clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Variation of the sum of squared-errors(SSE) with the number
of clusters in kmeans . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 The classification accuracy of the clustering algorithms in ta-

bles E.1 and E.2. L+kmeans refers to leader + kmeans and
fuzzy K refers to fuzzy kmeans. The number of clusters is 23. 77
5.2 The number of different cluster categories found by the algo-
rithms when the number of clusters is 23. The total number
of labels contained in the data set is 23. . . . . . . . . . . . . 81
5.3 The cluster entropies when the number of clusters is 23. The
cluster entropy measures the homogeneity of the clusters. The
lower the cluster entropy is the more homogeneous the clusters
are. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 The classification accuracy of the clustering algorithms in ta-
bles E.3 and E.4. The number of clusters is 49. . . . . . . . . 83
5.5 The number of different cluster categories found by the algo-
rithms when the number of clusters is 49. The total number
of labels contained in the data set is 23. . . . . . . . . . . . . 84
4
LIST OF FIGURES 5
5.6 The cluster entropies when the number of clusters is 49. . . . . 85

List of Tables
3.1 Example of distance matrix used for hierarchical clustering . . 21
4.1 Distribution of labels in the data set . . . . . . . . . . . . . . 70
B.1 Basic features of the KDD Cup 99 data set . . . . . . . . . . . 99

B.2 Content-based features . . . . . . . . . . . . . . . . . . . . . . 99
B.3 Traffic-based features . . . . . . . . . . . . . . . . . . . . . . . 100
E.1 Random initialisation . . . . . . . . . . . . . . . . . . . . . . . 108

E.2 Experimental results of various classical algorithms and com-
bination of those algorithms run on a KDD Cup 1999 data set
slightly modified. The number of clusters is set to the number
of attack and normal labels in the data set and this number
is 23. The results in table E.1 are obtained with random ini-
tialisation of the algorithms and that of table E.2 correspond
to initialisation of the algorithms with leader clustering. . . . 108
E.3 Random initialisation . . . . . . . . . . . . . . . . . . . . . . . 108
E.4 Experiment results when the number of clusters is 49 . . . . . 109
6
Abstract
Network anomaly detection aims at detecting malicious activities in com-

puter network traffic data. In this approach, the normal profile of the network
traffic is modelled and any significant deviation from this normal profile is
interpreted as malicious. While supervised anomaly detection models the
normal traffic behaviour on the basis of an attack free data set, unsuper-
vised anomaly detection works on a data set which contains both normal
and attack data. Clustering has recently been investigated as one way of
approaching the issues of unsupervised anomaly detection.
LIST OF TABLES 2
This thesis investigates the cluster-based approach to off-line anomaly de-

tection. Our goal is to study the purity of the clusters created by different
clustering methods. Ideally each cluster should contain a single type of data:
either normal or a specific attack type. The result of such a clustering can
assist a security expert in understanding different attack types and in la-
belling the data set. One of the main challenges in clustering network traffic
data for anomaly detection is related to the skewed distribution of the attack
categories. Generally, a very large proportion of the network traffic data is
normal and only a small percentage constitutes anomalies.
Six classical clustering algorithms: kmeans, SOM, EM-based clustering,
classification EM clustering, fuzzy kmeans, leader clustering and different
combination scenarios of these algorithms are discussed, implemented and
experimentally compared. The experiments are performed on the KDD Cup
99 data set, which is widely used for evaluating intrusion detection systems.
The evaluation of the clustering is done on the basis of the purity of the
clusters produced by the clustering algorithms. Two of the indexes used for
quantifying the purity of clusters are: the classification accuracy and cluster
homogeneity. The classification accuracy is measured by the proportion of
items successfully classified and cluster homogeneity is measured by the clus-
ter entropy. We have also investigated a clustering technique, which combines
different clustering techniques. This technique has given promising results.
Keywords:
Off-line network anomaly detection, unsupervised anomaly detection, clus-
tering methods, external assessment of clustering methods.
LIST OF TABLES 1
0.1 Preface
This thesis has been written by koffi bruno yao at the department of computer
science of the university of copenhagen(DIKU). The thesis was written in
the period 19/04/2005 to 01/03/2006 and was supervised by Peter Johansen
professor at DIKU. I would like to thank my supervisor for his support. The
primary audience of this thesis is researchers in anomaly detection. However
any reader with interest in clustering will find the thesis useful. The reader is
expected to have some basic understandings of computer networks and some
basic mathematical knowledge.
Chapter 1
Introduction
1.1 Motivation
It is important for companies to keep their computer systems secure be-
cause their economical activities rely on it. Despite the existence of attack
prevention mechanisms such as firewalls, most company computer networks
are still the victim of attacks. According to the statistics of CERT [44], the
number of reported incidents against computer networks has increased from
252 in 1990 to 21756 in 2000, and to 137529 in 2003. This happened because
of misconfiguration of firewalls or because malicious activities are generally
cleverly designed to circumvent the firewall policies. It is therefore crucial to
have another line of defence in order to detect and stop malicious activities.
This line of defence is intrusion detection systems (IDS).
During the last decades, different approaches to intrusion detection have

been explored. The two most common approaches are misuse detection and
anomaly detection. In misuse detection, attacks are detected by matching
the current traffic pattern with the signature of known attacks. Anomaly
detection keeps a profile of normal system behaviour and interprets any sig-
nificant deviation from this normal profile as malicious activity. One of the
strengths of anomaly detection is the ability to detect new attacks. Anomaly
detection’s most serious weakness is that it generates too many false alarms.
Anomaly detection falls into two categories: supervised anomaly detection
and unsupervised anomaly detection. In supervised anomaly detection, the
instances of the data set used for training the system are labelled either as
2
CHAPTER 1. INTRODUCTION 3
normal or as specific attack type. The problem with this approach is that la-
belling the data is time consuming. Unsupervised anomaly detection, on the
other hand, operates on unlabeled data. The advantage of using unlabeled
data is that the unlabeled data is easy and inexpensive to obtain. The main
challenge in performing unsupervised anomaly detection is distinguishing the
normal data patterns from attack data patterns.
Recently, clustering has been investigated as one approach to solving this

problem. As attack data patterns are assumed to differ from normal data
patterns, clustering can be used to distinguish attack data patterns from
normal data patterns. Clustering network traffic data is difficult because:
1. of high data volume
2. of high data dimension
3. the distribution of attack and normal classes is skewed
4. the data is a mixture of categorical and continuous data
5. of the pre-processing of the data required.
1.2 Goal of the thesis

Although different clustering algorithms have been studied for this pur-
pose, to our knowledge not much has been done in the direction of comparing
and combining different clustering approaches. We believe that such a study
could help in designing the most appropriate clustering approaches for the
purpose of unsupervised anomaly detection.
The main goals of this thesis are:
1. to provide a comprehensive study of the clustering problem and the

different methods used to solve it
2. to implement and compare experimentally some classical clustering al-

gorithms
3. and to combine different clustering approaches.

1.3 Related works

Clustering has been studied in many scientific disciplines. A wide variety
of algorithms are found in clustering literature. [2] gives a good review of the
main classical clustering concepts and algorithms. [3, 20] provide an excellent
mathematical approach to the clustering problem. [23] is also an excellent
source; it covers all the main steps in clustering. It discusses clustering
from a statistical perspective. [10, 18] present recently developed clustering
algorithms for clustering large data sets.
There are in the literature many examples of experimental comparisons of

clustering algorithms. Some examples of recent works are found in [35, 12]. In
[35], dynamical clustering, kmeans, SOM, hierarchical agglomerative cluster-
ing and CLICK have been compared for gene expression data. [12] compares
kmeans, SOM and ART-C on text documents. These comparisons differ as
to the selection of clustering algorithms, the data set, and the evaluation cri-
teria used for assessing the algorithms or the evaluation methodology. Some
of these experiments compare clusters on the basis of internal criteria such as
the number of clusters, compactness and separability of clusters, while other
works compare clustering algorithms on the basis of external indices. An
external index measures how well a partition created by a given clustering
algorithm matches an a priori partitioning of the data set. Our choice of clus-
tering algorithms, data set, evaluation criterion and evaluation methodology
distinguishes our work from these works.
[9] provides a good review of data mining approaches for intrusion detec-
tion. Much work has been done on the area of unsupervised anomaly detec-
tion [7, 4, 6]. In [4], Eskin uses clustering to group normal data; intrusions
are considered to be outliers. Eskin follows a probability based approach to
outliers’ detection. In this approach, the data space has an unknown proba-
bility distribution. In this data space, anomalies are located in sparse regions
while normal data are found in dense regions.
1.4 Thesis organization

This thesis is composed of two main parts:
• A theoretical part, in which the clustering problem and the different

clustering methods are studied. This part consists in chapter 2 and
3. Chapter 2 is an introduction to anomaly detection and clustering.
Chapter 3 discusses the different clustering methods. In conclusion,
different combinations of these methods are proposed.
• An experimental part which consists in chapters 4 and 5. In chapter 4,

the data set, and the design of the experiments are discussed. Chapter
5 discusses the evaluation of clustering methods. Chapter 6 concludes
the thesis.
The
Chapter 2
Background
This chapter provides background in network security and clustering relevant

for understanding the thesis.
• Section 2.1 gives an introduction to network security. The definitions
of network terminologies, frequently used in this thesis, are found in
appendix A.
• Section 2.2 gives an introduction to clustering.
• Section 2.3 summarizes this chapter.
2.1 Introduction to computer network secu-

rity
Computer networks interconnect multiple computers and make it easy and
fast to share resources between these computers. The most popular example
of such a network is the global Internet. In this thesis, the term computer
networks mainly refers to private computer networks, geographically limited
and connected to the outside world. The main threats to computer networks
are security issues. This section will give a brief discussion of some of the
main issues pertaining to network security.
2.1.1 Network security

Computer network security aims at preventing, detecting and stopping any
activity that has the potential of compromising the confidentiality and in-
6
CHAPTER 2. BACKGROUND 7
tegrity of communication on the network as well as the availability of the

network’s resources and services. Another goal of security is to recover from
such malicious activities when they take place. Attack prevention is generally
implemented by security mechanisms such as authentication, cryptography
and firewalls. Although attack prevention mechanisms are crucial, they are
not enough for assuring the security of the network. Firewalls, for example,
can prevent malicious activities from penetrating into the internal network,
but they are not able to prevent malicious activities that are initiated from
inside the network. Firewalls can also be subject to attacks and prevented
from working by for example denial of service (DOS) attacks. Attacks can
also pass through firewalls successfully because the firewalls have been mis-
configured. Because of these weaknesses in prevention mechanisms, computer
networks will always be vulnerable to malicious activities.
Attack detection and recovery mechanisms complement attack prevention
mechanisms. This function of detection and recovery is mainly implemented
by intrusion detection systems. A distinction is made between host-based
intrusion detection systems (HIDS) and network based intrusion detection
systems (NIDS). HIDS detect intrusions directed against a single host. NIDS
detect intrusions directed against the entire network. In this thesis, we will
focus on network intrusion detection. In the next section, we will present
systems for network intrusion detection. We will discuss their architecture,
and the different steps followed when dealing with intrusion.
2.1.2 Network intrusion detection systems

In this thesis, we cluster data for network intrusion detection. This section
discusses how the input data and the clustering result fit into the architecture
of network intrusion detection systems. Network intrusion detection systems
are designed to detect the presence of malicious activities on the network.
The architecture of network intrusion detection systems generally consists
of three parts: agent, detector and notifier. Agents gather network traffic
data, detectors analyse the information gathered by agents to determine the
presence of attacks. The notifier makes decision as to whether a notification
about the presence of an intrusion should be sent. The same software can
perform all these task in a simple network. In more complex networks, these
functions are distributed over the network for reasons of security, efficiency,
scalability and robustness.
In the context of this thesis, only agents and detectors are relevant.
Agents generally gather network traffic data by sniffing the network. Sniff-
ing the network involves the agent having access to all the network traffic.
In an Ethernet-based network one computer can play the role of an agent.
Agents generally process the gathered data into a format that is easy for the
detector to use. The detector can use different techniques for the detection
of intrusions. The two main techniques are misuse detection and anomaly
detection. Misuse detection detects attacks by matching the current network
traffic against a database of known attack signatures. Anomaly detection,
on the other hand, finds attacks by identifying traffic patterns that deviate
significantly from the normal traffic.
The data set used in this thesis is an example of data obtained from
network intrusion detection agents. The output of the clustering serves to
define or enrich models used by the detector.
In the next section, we will look at network anomaly detection, which is
the type of detection technique we are interested in in this thesis.
2.1.3 Network anomaly detection

As we explained earlier, detectors need models or rules for detecting intru-
sions. These models can be built off-line on the basis of earlier network traffic
data gathered by agents. Once the model has been built, the task of detect-
ing and stopping intrusions can be performed online. One of the weaknesses
of this approach is that it is not adaptive. This is because small changes in
traffic affect the model globally. Some approaches to anomaly detection per-
form the model construction and anomaly detection simultaneously on-line.
In some of these approaches clustering has been used. One of the advan-
tages of online modelling is that it is less time consuming because it does
not require a separate training phase. Furthermore, the model reflects the
current nature of network traffic. The problem with this approach is that it
can lead to inaccurate models. This happens because this approach fails to
detect attacks performed systematically over a long period of time. These
types of attacks can only be detected by analysing network traffic gathered
over a long period of time.
The clusters obtained by clustering network traffic data off-line can be
used for either anomaly detection or misuse detection. For anomaly detec-
tion, it is the clusters formed by the normal data that are relevant for model
construction. For misuse detection, it is the different attack clusters that are
used for model construction.
This section has described mechanisms for detecting attacks against a

computer network. The next section is a discussion of computers attacks.
2.1.4 Computer attacks

A computer attack is any activity that aims at compromising the confiden-
tiality, the integrity or the availability of a computer system. Compromising
the confidentiality consists in gaining unauthorized access to resources and
services on the computer system. Compromising the integrity consists in
unauthorized modification of information on the computer system. Finally,
compromising the availability of the computer system makes the computer
system unavailable to legal users. These attacks can be performed at the
physical level, by damaging computer hardware, or they can be performed
at a software level. It is the type of attacks performed at the software level
we refer to, when using the term computer attacks in this thesis.
The computer attacks, considered in this thesis, fall into four main cat-
egories: probe attacks, denial of services (DOS) attacks, user to root (U2R)
attacks and remote to local (R2L) attacks. Probe attacks are attacks that
probe computers or computer networks in order to detect the services that
are available on the computer system. This information can then be used to
attack the computer system in a specific way. Denial of service attacks are
attacks that aim to make the computer systems unavailable for legal users.
This is done, for instance, by keeping computers busy dealing with tasks
submitted by the attacker. User to root attacks aim at gaining unauthorized
access to system resources. The attacker tries to obtain root privileges in
order to perform malicious activities. Examples of such attacks are buffer
overflow attacks. In buffer overflow attacks, the attacker gets root-privileges
by overwriting memory locations containing security sensitive information.
In remote to local attacks, the attacker exploits misconfigurations or weak-
nesses on a server host to gain remote access to the computer system with
the same level of privileges as an authorized user. For example, exploiting
the misconfiguration on a FTP-server could make it possible for the attacker
to remotely add files to the FTP-server.
Attackers perform computer attacks for intellectual, economical or politi-
cal reasons or just for fun. Computer attacks performed for economic reasons
are a growing problem. According to [45], it-criminality was more profitable
than drug trading in 2005. Two examples of economic it-criminality, that
are on the rise, are blackmailing organizations and phising. In phishing, the
attacker sends emails to the victims. In these emails, he presents himself as

from an organization the victim knows and trusts, for example the victim’s
bank. The goal of the attacker is to collect the victim’s bank account in-
formation and misuse it. Blackmailing an organization consists in launching
attacks against the organization, if that organization refuses to satisfy the
attackers request. Attacks against computer systems are possible because of:
• social engineering: Legal users of the computer systems can delib-

erately lend their password to unauthorized users. Most legal users
have difficulty in following strict security policies. This can result in
passwords being made available to attackers.
• misuse of features: The denial of service attack named smurf is an

example of the misuse of features. This attack is based of misusing the
ping tool. The normal purpose of the ping tool is to make it possible for
one host to test if it has a connection to another host. Smurf abuses
this facility; an attacker makes a false ping-request to a large number
of hosts simultaneously on behalf of the victim host. As a consequence,
all the receivers of the ping-request will send a response back to the
victim host. This large volume of traffic will eventually put the victim
host out of normal function.
• misconfiguration of computer systems: Correct configuration of com-

puter systems is not easy. Generally, there is a large number of param-
eter values to select from. An example of a computer attack that takes
advantage of a misconfigured computer system is the f tp write attack.
This attack exploits a misconfiguration concerning write privileges of
an anonymous account on a FTP-server. This misconfiguration can
lead to a situation where any FTP-user can add an arbitrary file to the
FTP-server.
• flaws in software implementation: As software gets more and more

complex, the chance that flaws exist in software also increases. Ac-
cording to the statistics of CERT [44], the number of reported vulnera-
bilities in widely used software has increased from 171 in 1995 to 1090
in 2000 and to 5990 in 2005. The buffer overflow attack is an example of
an attack that exploits flaws in software implementation. This attack
works by overflowing the input buffer in order to overwrite memory
locations that contain security relevant information. This is possible
because some software fails to check the size of the inputs entered by
users.
• usurpation or masquerade: The attacker steals the identity of a legal

user. The attacker can also steal a TCP-connection successfully estab-
lished by a legal user and then acts as if he were that legal user.
It is practically impossible to protect a computer network totally from

all these vulnerability factors. Therefore computer networks will always be
vulnerable to some forms of attack. A short description of the computer
attacks considered in this thesis is found in appendix C. [13] provides a
complete description of these attacks.
2.2 Introduction to clustering

Clustering, also known as cluster analysis, is used in scientific disciplines such
as psychology, biology, machine learning, data mining and statistics.
The term clustering was invented in the thirties in psychology. However,
numerical taxonomy in biology and pattern recognition in machine learning
have played an important role in the development of the concept of clustering
in the sixties.
2.2.1 Notation and definitions

• Notation: |A|: Given a set A, the notation |A| refers to the size of A.
• Definition: Partition of a set: Let S be a set and {Si , i ∈ {1, ..., N }},
N non empty subsets of S.
The family of subsets {Si , i ∈ {1, ..., N }} is a partition of the set S if
and only if:
∀(i, j) ∈ {1, ...N } × {1, ..., N } and i 6= j, Si Sj = and N i=1 Si = S.
T S
• Note: In this thesis, the terms data points, data patterns, data items
and data instances refer all to the instances of a data set.
2.2.2 The clustering problem

Clustering is the process of grouping data into clusters, so that pairs of data
in the same clusters have a higher similarity than pairs of data from different
clusters. It provides a tool for exploring the structure of the data. Formally,
the clustering problem can be expressed as follows:
The clustering problem

Given a data set D = {v1 , ..., vn } of tuples, given a similarity measure
S : D × D → R, the clustering problem is defined as the mapping of each
vector vi ∈ D to some class L. The mapping is performed under the con-
straint that: ∀vs , vt ∈ L and vq ∈
/ L, S(vs , vt ) > S(vs , vq ).
Another problem, related to the clustering problem is the classification

problem. The difference between these two problems is that in classification
the class labels are known a priori and the goal of the classification is to
assign instances to the class they belong to. In clustering, on the other hand,
no a priori class structure is known. The goal of the clustering is then to

define the class structure, that is how many categories the data set contains,
and to assign instances to a category in a meaningful way.
Clustering can be performed in various ways, depending on how the sim-
ilarity between pairs of data items is defined. In the next chapter, different
methods for performing clustering will be discussed.
2.2.3 The clustering process

The main steps in clustering are: feature selection, the choice of clustering
algorithm, and the validation of the clustering results.
2.2.4 Feature selection

Feature selection aims at selecting an optimal subset of relevant features for
representing the data. The definition of an optimal subset of features depends
on the specific application at hand. An optimal subset may be defined as
a subset that provides the best classification accuracy. The classification
accuracy measures the proportion of items that are correctly classified, in a
classification task. In the context of anomaly detection, we are interested
in a feature set, which efficiently discriminates normal data patterns from
attack data patterns.
2.2.5 Choice of clustering algorithm

Finding the optimal set of clusters, that maximizes the intra-cluster similar-
ity and minimizes the inter-cluster similarity, is a NP-hard problem because
all the possible partitions of the data set need to be examined. Generally,
we want a clustering algorithm that can provide an acceptable solution -not
necessarily the optimal solution.
A clustering algorithm is mainly characterized by the type of similarity mea-
sure it uses and by how it proceeds in finding clusters. Many clustering algo-
rithms approach the clustering problem as an optimisation problem. These
algorithms find clusters by optimising a specified function, called objective
function. For this class of algorithms, the objective function is also a main
characteristic of the algorithm. Similarity measures and objective functions
will be discussed below.
Similarity measures
The definition of the similarity between data items depends on the type
of the data. Two main types of data exist: continuous data and categorical
data1 . Examples of similarity measures for each of these types of data will
be presented in the following.
Distance measures in continuous data For continuous data, distance

measures are used for quantifying the degree of similarity or dissimilarity of
two data instances. The lower the distance between two instances, the more
similar the instances are. And the higher the distance, the more dissimilar
they are. A distance measure is a non-negative function δ : D × D− > R+
with the following properties:
δ(x, y) = 0 ⇐⇒ x = y, ∀x, y ∈ D (2.1)

δ(x, y) = δ(y, x), ∀x, y ∈ D (2.2)
δ(x, y) ≤ δ(x, z) + δ(z, y), ∀x, y, z ∈ D (2.3)
Here are some examples of distance measures:
qP
• Minkowski distance: d(x, y) = p ni=1 (xi − yi )p , p > 0 if p=1, it is the
Hamming distance; if p=2, it is the euclidean distance
• Tchebyschev distance: d(x, y) = maxni=1 |xi − yi |
Similarity measures in categorical data: Given a data set D, an in-

dex of similarity is a function S : D × D− > [0, 1], satisfying the following
properties:
S(x, x) = 1, ∀x ∈ D (2.4)
S(x, y) = S(y, x), ∀x, y ∈ D (2.5)
Similarity indices can, in principle, be used on arbitrary data types. However,
they are generally used for measuring similarity in categorical data. They
are seldom applied to continuous data because distance measures are more
1
Sometimes binary data, which is essentially categorical data with two categories, is
considered as a separate category. In this thesis, no distinction is made between categorical
data and binary data.
suitable for continuous data than similarity indices are. Different similarity
indices for binary or categorical data are found in the literature. Here are
three examples of similarity indices. In the following expressions, a is the
number of positive matches, d is the number of negative matches and b and
c are the number of mismatches between two instances A and B.
• The matching coefficient:

a+d
a+b+c+d
• The Russel and Rao measure of similarity:

a
a+b+c+d
• The Jacard index:

a
a+b+c
The choice of similarity measure depends on the type of data at hand:
categorical or continuous data. Depending on the intent of the investigator,
continuous data can be converted to binary data, by fixing some thresholds.
Alternatively categorical data can be converted to continuous data. As the
feature set is selected to provide an optimal description of the data, con-
verting from one data type to another may result in a loss of information
about the data. This will affect the quality of the analysis being conducted
on the data set. The method of analysis to be conducted on the data also
influences the choice of similarity measure. For example, euclidean distance
is appropriated for methods that are easily explained geometrically.
Objective functions
Objective functions are used by clustering methods that approach the clus-
tering problem as an optimization problem. An objective function defines
the criterion to be optimised by a clustering algorithm in order to obtain an
optimal clustering of the data set. Different objective functions are found in
cluster literature. Each of them is based on implicit or explicit assumptions
about the data set. A good choice of the objective function helps reveal a
meaningful structure in the data set. The most widely used objective func-
tion is the sum of squared-errors. Given a data set D = {x1 , x2 , ..., xn } and
a partition P = {C1 , C2 , ..., CK } the sum of squared-errors of P is:

SSE(P ) = K 2
x∈Ck ||x − µk || , where µk is the mean of cluster Ck :
P P
k=1
µk = |C1k | x∈Ck x and |Ck | is the size of cluster Ck . The popularity of the
P
sum of squared-errors objective function is partly related to its simplicity.
2.2.6 Cluster validity

The assessment of the quality of clustering results is important. It helps in
identifying meaningful partitioning of the data set. This assessment is im-
portant because the data set can be partitioned in different ways. Generally,
the same clustering algorithm, executed with different initial values, will pro-
duce different partitions of the data set. Some of these partitions are more
meaningful than others. What is considered as a meaningful partitioning is
application specific. It depends on the kind of information or structure the
investigator is looking for.
Cluster validity can be performed at different levels: hierarchical, individ-
ual, and partition levels. The validity of the hierarchical structure of clusters
is only relevant for hierarchical clustering. Hierarchical clustering creates a
hierarchy of clusters. The study of the validity of the hierarchical structure
aims at judging the quality of that hierarchical structure. The validity of
individual clusters measures the compactness and the isolation of the clus-
ter. A good cluster is expected to be compact and well separated from other
patterns. The validity of the partition structure evaluates the quality of the
partition produced by a clustering algorithm. For example, it may be used
to determine whether the correct number of clusters has been found or the
clusters found by the algorithm match an a priori partitioning of the data.
In this thesis, only the validity of the partition’s structure is considered
because we evaluate the clustering algorithms against an a priori partition of
the data set. So in the rest of this thesis, when we refer to cluster validity,
we mean validity of partition’s structure. The assessment of the partition’s
structure can be performed at different levels: external, internal and relative
levels.
External validity: In external validity, the partition produced by a
clustering algorithm is compared with an a priori partition of the data set.
Some of the most common examples of external indices found in cluster
literature[44] are: Jacard and Rand indices. These indices quantify the degree
of agreement between a partition produced by a cluster algorithm and an a
priori partition of the data set.
Internal validity: Internal validity only makes use of the data involved
in the clustering to assess the quality of the clusterings result. Example of
such data include the proximity matrix. The proximity matrix is a N matrix
which entry (i, j) represents the similarity between data patterns i and j.
Relative validity:
The purpose of relative clustering validity is to evaluate the partition pro-
duced by a clustering algorithm by comparing it with other partitions pro-
duced by the same algorithm, initialised with different parameters.
External validity is independent of the clustering algorithms used. It is
therefore appropriate for the comparison of different clustering algorithms.
Cluster validation by visualization:
This cluster validation is carried out by evaluating the quality of the cluster-
ing’s result with the human eye. This requires an appropriate representation
of the clusters so that they are easy to visualize. This approach is imprac-
tical for large data set and when the dimension of the data is high. It only
works in 2 to 3 dimensions because human eyes cannot visualize higher di-
mensions. For visualizing high dimension data, the dimension of the data
has to be reduced to 2 or 3. SOM, which is one clustering algorithm we will
study later, is often used as a tool for reducing the dimensions of the data
for visualization. Cluster validation by visualization will not be considered
in this thesis. There are two reasons for this: first because the size of the
data set is large and the dimension of the data is high and second because
the visualization cannot be quantified. We need to be able to quantify the
quality of the partitions in order to compare the algorithms on this basis.
2.2.7 Clustering tendency

Clustering tendency evaluates whether the data set is suitable for clustering.
It determines whether the data set contains any structure. This study should
be performed before using clustering as a tool for exploring the structure of
the data. Despite its importance, this step is most often omitted -probably
because it is time consuming. An example of an algorithm for studying the
presence or absence of structure in the data set and one that also identifies
the optimal number of clusters in the data, is the model explorer algorithm.
This algorithm has been presented by Ben-Hur et al. [32].
Here is the description of the model explorer algorithm:
1. Choose a number of clusters K, the number of sub samples L, the

similarity measure between two partitions and the proportion α of the
data set to be sampled -without replacement.
2. Generate two sub samples s and t of the data set of size α*(the size of
the data set).
3. Cluster both subsamples using the same clustering algorithm.
4. Compute the similarity between the two partitions. Only elements

common to s and t are involved in this computation.
5. repeat the step 2 to 4 L times.
The model explorer algorithm is based on the following assumption: if the

data set has a structure, this structure will remain stable to small perturba-
tions of the data set such as removing or adding values. So the model explorer
algorithm gives an indication of the presence or the absence of structure in
the the data. In case of the presence of structure, the model explorer algo-
rithm finds the optimal number of clusters in the data.
The main problem with the model explorer is that it is computationally ex-
pensive.
2.2.8 Clustering of network traffic data

The efficiency of the clustering algorithms depends on the nature of the data.
Some of the main difficulties in clustering network traffic data are:
• the size of the data is large,
• the dimension of the data is high,
• the distribution of the class is skewed,
• the data is a mixture of categorical and continuous data,
• the data needs to be pre-processed.

2.3 Summary
In this chapter, aspects of network security and clustering relevant for the
rest of the thesis have been introduced. Network intrusion detection has been
briefly presented. Because of the sophistication of network attack techniques
and the weaknesses in attack prevention mechanisms, network intrusion de-
tection systems are important for ensuring the security of computer networks.
The clustering problem has been defined and steps of the clustering process
have been presented. The main steps of the clustering process are: feature
selection, choice of clustering algorithms and cluster validity. In the next
chapter, clustering methods will be discussed more deeply.
Chapter 3
Clustering methods and

algorithms
In this chapter, different clustering methods will be discussed. For each of

the methods, examples of clustering algorithms will be presented.
• Section 3.1 discusses hierarchical clustering.
• Section 3.2 discusses partitioning methods. It is one of the most impor-

tant sections of this chapter as the discussion of partitioning clustering
provides the basis for the implementation of the algorithms used for
the experiments. The main classes of clustering methods are: squared-
error clustering, model-based clustering, density-based clustering and
grid-based clustering. Online clustering and fuzzy clustering methods
are also discussed. The main concepts of the algorithms not used for
the experiments will be presented while the algorithms that are part of
the experiments will be discussed in more detail.
• Section 3.3 compares the clustering methods and algorithms theoreti-

cally
• Section 3.4 studies how to combine clustering methods. In this sec-

tion, we propose a clustering technique appropriate to the clustering of
network traffic data
A clustering method defines the general strategy for grouping the data
instances into clusters. It specifies for example the objective criterion. It also
defines the basic theory or concept the clustering is based on. A clustering
20
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 21
Element a b c d e f
a 0 1 1 3 2 5
b 1 0 2 2 1 4
c 1 2 0 3 2 5
d 3 2 3 0 1 4
e 2 1 2 1 0 3
f 5 4 5 4 3 0
Table 3.1: Example of distance matrix used for hierarchical clustering
algorithm, on the other hand, is a particular implementation of a clustering

method. For example, the clustering method defined by the sum of squared-
errors objective function can be implemented in different ways. An example
of such implementation is the kmeans algorithm.
Clustering methods can be categorized in different ways. At a higher
level, one can distinguish between two main clustering strategies: hierarchical
methods and partitioning methods. Hierarchical clustering organizes the
data instances into a tree of clusters. Each hierarchical level of the tree
corresponds to a partition of the data set. Partitioning methods, on the
other hand, create a single partition of the data set. Both categories will be
discussed in the following sections.
3.1 Hierarchical clustering methods

As mentioned earlier, hierarchical clustering methods organize the data in-
stances into a hierarchy of clusters. This organization follows a tree structure
known as a dendrogram. The root of the dendrogram represents the entire
data set. The clusters located at the leaves contain exactly one data instance.
Figure 3.1 shows an example of a dendrogram corresponding to the dis-
tance matrix in table 3.1. Cutting the dendrogram at each level of the tree
hierarchy gives a different partition of the data set. Hierarchical clustering
methods can be divided into two main categories: hierarchical agglomerative
clustering (HAC) and hierarchical divisible clustering (HDC).
Hierarchical agglomerative clustering constructs clusters by moving step
by step from the leaves to the root of the dendrogram. HAC starts with
clusters consisting of a single element and iteratively merge them to form
the clusters of the next level of the tree hierarchy. This process continues
b c d e f
a
Figure 3.1: A dendrogram corresponding to the distance matrix in table 3.1
until the entire data set falls into a single cluster. At this point the root of
the dendrogram is known.
Hierarchical divisible clustering uses the dendrogram from the root to
the leaves. HDC starts with a single cluster representing the entire data
set. Then it proceeds by iteratively dividing large clusters at the current
level i into smaller clusters at level i + 1. This process stops when each of
current clusters consists in a single element. At this point the leaves of the
dendrogram are known.
The following are the main steps by which HAC organizes the data in-
stances into a hierarchy of clusters. How HDC proceeds can trivially be
deduced from the steps of HAC.
1. Compute the distance between all the items and store them in a dis-
tance matrix.
2. Identify and merge the two most similar clusters.
3. Update the distance matrix by computing the distance between the
new cluster and all the others clusters.
4. Repeat step 2 and 3 until the desired number of clusters is obtained or
until all the items fall in a single cluster.
In order to merge clusters, the distance between pairs of clusters needs to be
computed. Below are some examples of inter-cluster distances.
Inter-cluster distances for hierarchical clustering

Four distances are frequently used for measuring the similarity of two clusters
in hierarchical clustering. Let C1 and C2 be two clusters, the four inter-cluster
distances are:
• The maximum distance between C1 and C2 :
distmax (C1 , C2 ) = max(p1 ∈C1, p2 ∈C2 ) dist(p1 , p2 ) (3.1)
• The minimum distance between C1 and C2 :
distmin (C1 , C2 ) = min(p1 ∈C1, p2 ∈C2 ) dist(p1 , p2 ) (3.2)
• Average distance:
the average of the distances of all pair of elements (p1 ∈ C1 , p2 ∈ C2 )
• The distance between the mean µ1 of C1 and the mean µ2 of C2 :
distmean (C1 , C2 ) = dist(µ1 , µ2 ) (3.3)
In these expressions, dist is the distance measure used between pairs of

elements. Generally, euclidean distance is used.
In [3], the authors illustrate the difference between these distances. They
show that if the clusters are compact and non-overlapping, these distances
are similar. But in the case that the clusters overlap or are not hyperspher-
ical shapes, they give results that differ significantly. The distmean is less
computationally expensive than the others three distance measures because
it does not compute the distance between all pairs of instances of the two
clusters C1 and C2 . These inter-cluster distance measures correspond to dif-
ferent strategies for merging clusters. When distmin is used, the algorithm
is known as the nearest-neighbor algorithm and when distmax is used, the
algorithm is called the farthest-neighbor algorithm.
The problem with hierarchical clustering is that it is computationally
expensive both in time and space because the distances between all pairs
of instances of the data set need to be computed and stored. The time
complexity of HAC is at least O(N 2 log N ), where N is the size of the data set.
This is because there is at least log N levels in the dendrogram and each of
the requires O(N 2 ) for creating a partition. Because of its high computation
time, hierarchical clustering are not suitable for clustering large data sets.
Hierarchical clustering algorithms do not aim at maximizing a global
objective function. At each step of the clustering process, they make local
decisions in order to find the best way of clustering the data.
In this section, hierarchical clustering has been briefly discussed. Hierar-
chical clustering is impractical for large data sets. The next section is about
partitioning clustering.
3.2 Partitioning clustering methods

Partitioning clustering methods, as opposed to hierarchical clustering meth-
ods create a single partition of the data set. The main categories of parti-
tioning clustering methods are described in the following.
3.2.1 Squared-error clustering

The objective of squared-error clustering is to find the partition of the data
set with the minimal sum of squared-errors. The squared-error of a cluster
is defined as the sum of the squared euclidean distance of each of the cluster
members to the cluster’s centre. And the sum of squared-errors of a parti-
tion P = {C1 , .., CK } is defined as the sum of the squared-errors of all the
clusters. In other words:
K X
||x − µk ||2 , where
X
SSE(P ) =
k=1 x∈Ck
µk is the mean of cluster Ck

1 X
µk = x
|Ck | x∈Ck
(3.4)
The general form of a squared-error clustering is:

Given a data set D,
1. Initialisation
(a) Specify the number of clusters and assign arbitrarily each instance
of the data set to a cluster
(b) Compute the centre for each cluster
2. Iterations: Repeat steps 2a, 2b and 2c until the difference of two con-
secutive iterations is below a specified threshold.
(a) Assign each instance of D to the cluster which centre it is closest

to
(b) Compute the new centre for each cluster
(c) Compute the squared-error.
Why do the squared-error clustering algorithm converge?

The sum of squared-errors clustering is an example of optimisation algorithms
based on local iterative optimisation steps. Here follows the description of a
local search algorithm and the proof of its convergence.
Local search algorithm: Let P be a finite set of possible solutions (in

partitioning clustering P is the set of all partitions), and let f : P → R
be a function to be minimized (in sum of squared clustering f is the sum of
squared errors). The algorithm starts from an initial solution x0 ∈ P . It
then finds a minimizer x1 ∈ P of f in a neighbourhood of x0 . If x1 6= x0 ,
a minimizer x2 ∈ P of f is found in a neighbourhood of x1 . A sequence of
minimizers x0 , x1 , ..., xt ∈ P is constructed in this way. The iterations stop
when xt gets very close to xt−1 .
Proof: It is clear that f (x0 ) ≥ f (x1 ) ≥ ... ≥ f (xt ). And the stopping
criterion, xt = xt−1 is satisfied at the point where f (xt ) = f (xt−1 ). This
means that the inequalities that exist before the stopping criterion is met
are all strict, so the algorithm progresses. It stops at some point in time be-
cause D is a finite set. The convergence is local and not optimal because the
algorithm performs locally; only a subset of the solution space is investigated.
More precisely, squared-error clustering is based on a version of a local

search algorithm called alternating minimization [20]. Alternating minimiza-
tion is appropriate in situations where:
• the variables of the function to be optimised fall in two or more groups,

• and if optimising the function by keeping some of the variables constant
is easier than doing the optimisation with all the variables at the time.
The alternating minimization proceeds in the following way:
Let xt = (ct , sset ) be two groups of variables. In the case of squared-error
clustering, these variables are respectively the centres of the clusters and the
sum of squared-errors. At each iteration t, the minimization occurs by keep-
ing constant sset . ct+1 is then found as the value of c that minimizes the
function f (c, sset ). The value of sset+1 is the value of sse that minimizes
f (ct+1 , sse).
The main strengths of squared-error clustering are its simplicity and effi-
ciency. Some of its limitations are:
• The sum of squared-errors criterion is appropriate in situations where
the clusters are compact and non-overlapping.
• The partition with the lowest SSE is not always the one that reveals
the true structure of the data. Sometimes partitions consisting in large
clusters has smaller sum of squared error than partition that reflects
the true structure of the data. This situation often occurs when the
data contains outliers.
One of the most popular examples of squared-error clustering is the
kmeans-algorithm.
The kmeans-algorithm
Kmeans is an iterative clustering algorithm which moves items among clus-
ters until a specified convergence criterion is met. Convergence is reached
when only very small changes are observed between two consecutive itera-
tions. The convergence criterion can be expressed in terms of the sum of
squared-errors but it does not need to be so expressed.
Algorithm: kmeans-algorithm
Input: A data set D of size N and the number of clusters K,
Output: a set of K clusters with minimal sum of squared-error.
1. Randomly choose K instances from D as the initial cluster centres;
Repeat steps 2 and 3 until no change occurs.

2. Assign each instance to the cluster, which centre the instance is closest
to;
3. Recompute the cluster centres. The centre of the cluster Ck , is given

by:
µk = |C1k | j∈Ck xkj , where |Ck | is the size of the Ck .
P
Kmeans is simple and efficient. Because of these qualities, kmeans is

widely used as a clustering tool.
The problems associated with kmeans are mainly those common to squared-
error clustering: the clustering’s result is not optimal and the sum of squared-
errors is not always a good indicator of the quality of the clustering. The
number of clusters needs to be specified by the user and the quality of the
clustering is dependent on the initial values. It is appropriate when the clus-
ters are compact, well separated, spherical and approximately of similar size.
The algorithm does not explicitly handle outliers and the presence of outliers
can degrade the quality of the clustering. The time complexity of kmeans is
O(I ∗K ∗N ), where N is the size of the data set, I is the number of iterations
and K is the number of clusters. Generally, the maximum number of itera-
tions is specified. In these cases, the time complexity is O(K ∗ N ). Figure
3.2 illustrates how the sum of squared-errors varies during the iterations of
the kmeans algorithm.
Figure 3.2 shows that the sum of squared-errors decreases very slowly
after the 10th iteration. This indicates that convergence of the kmeans is
reached around the 10th iteration.
3.2.2 Model-based clustering

Model-based clustering methods assume that the data set has an underly-
ing mathematical model, and they aim at uncovering this unknown model.
Generally, the model is specified in advance and what remains is the compu-
tation of its parameters. Two main classes of model-based clustering exist.
The first class is based on a probabilistic approach and the second is based
on an artificial neural networks approach.
• Probabilistic clustering
60000
"sse.dat"
55000
50000
45000
SSE
40000
35000
30000
25000
0 5 10 15 20 25 30 35 40
iteration
Figure 3.2: Variation of the sum of squared-errors in kmeans
In the probabilistic approach, the mixture of gaussians is often used to

describe the underlying model of the data set. The model parameters
can be learned by two different approaches: the mixture likelihood ap-
proach and the classification likelihood approach. The main difference
between these two approaches is that the former assumes overlapping
clusters while the latter does not. The expectation maximization(EM)
algorithm [5] is generally used for learning the model parameters under
the mixture likelihood approach. And the classification EM algorithm
[24] is used for learning model parameters under the classification like-
lihood approach. An example of each of these approaches is presented
in the following.
– Clustering under the mixture likelihood approach or EM-

based clustering
The maximum likelihood parameter estimation is at the heart of

this clustering approach.
The maximum likelihood parameter estimation:
Given a density function p(x|Θ), where Θ is a parameter set and
given a data set D = {x1 , ..., xN }. The maximum-likelihood pa-
rameter estimation consists in finding the value Θmax of the pa-

rameter Θ that maximizes the likelihood function λ defined as:
M (D|Θ) = ΠN
n=1 p(xn |Θ) = λ(Θ|D), (3.5)
For the purpose of identifying clusters in a data set, the density

function used is a mixture of density functions. Each component
of the mixture represents a cluster.
The mixture of density functions is defined as:
K
X
∀x ∈ D, p(x|Θ) = αk p(x|Θk )) (3.6)
k=1
where Θ = (Θ1 , ..., ΘK )t is a set of parameters and Kk=1 αk = 1.

P
p(x|Θk ) and αk are respectively the density function and the mix-
ture proportion of the k th mixture component.
The maximum likelihood parameter estimation approach is based

on two assumptions:
For a specified value of the parameter Θ,
- the instances xi of the data set D are statistically independent
- the selection of instances from a mixture component is done in-
dependently of the other components.
An intuitive way of explaining the selection of each instance xi in
the mixture model is that it happens in two steps:
-firstly by selecting a component k with probability αk ,
-and secondly by selecting xi from the component k with the prob-
ability p(x|Θk ).
For the experiments in this thesis, the model used is the mixture
of isotropic gaussians. This model is also known as the mixture
of spherical gaussians. In this model, each component of the mix-
ture is a spherical gaussian. The mixture of isotropic gaussians
has been chosen because of its simplicity, efficiency and scalability
to higher dimension.
The EM algorithm is a general method, used for estimating the
parameters of the mixture model. It is an iterative procedure that
consists in two steps: the expectation step and the maximization
step. The expectation step is commonly called the E-step and the
maximization step is called the M-step. The E-step estimates the
extent to which instances belong to clusters. The M-step computes
the new parameters of the model on the basis of the estimates of
the E-step. In the case of the mixture of isotropic gaussians, the
model parameters are the means, standard deviations and the
weights of the clusters. This step is called the maximization step
because it finds the values of the parameters that maximize the
likelihood function.
The E and M steps are repeated until convergence of the parame-
ters is reached. Convergence is reached when the parameter values
of two consecutive iterations get very close. At the end of the it-
erations, a partitioning of the data set is obtained by assigning
each data instance to the cluster to which the instance has high-
est membership degree. This way of assigning instances to clusters
is called the maximum a posteriori (MAP) assignment. MAP as-
signment gives a crisp or hard clustering of the data set. A soft
clustering -also called fuzzy clustering- can be obtained by using
the cluster membership degrees computed in the E-step.
Algorithm: Learning a mixture of isotropic gaussians with

the EM algorithm
Input: the data set of size N, a set of parameters Θ for the mixture
of gaussians: Θ = {αk , µk , σk }k=1,...,K
Output: A partition of the data set into K clusters: {C1 , ..., CK }
1. Random initialisation of the parameter set Θ
2. Repeat steps 3 and 4 until the log-likelihood function

log(λ(Θ |D)) converges
3. E-step: Estimation of the posterior probabilities of the k th com-

ponent.
n |θk )
P (k|xn , Θ) = Pαkαγ(x
j γ(xn |θj ) j
2
where γ(xn |θk ) = √ 1
( 2πσk )d
exp( −kxn2σ−µ
2
kk
)
k
4. M-step: Re-estimation
P of the parameter set Θ of the model
(new) P (k|xn ,Θ)xn
µk = Pn
P (k|xn ,Θ)
,
n
s P
(new) 2
(new) 1 P (k|xn ,Θ)kxn −µk k
σk = d
n P
P (k|xn ,Θ)
,
n
(new) 1
αk = P (k|xn , Θ)
P
N n
5. MAP assignment of instances to clusters

In the rest of this section about EM-based clustering, we will ex-
plain how the expressions of the model parameters used in the
iterations of the EM algorithm are obtained.
How does the EM algorithm work?

This section constitutes preparatory remarks concerning the EM
algorithm. Estimating the parameters of the mixture model using
the maximum likelihood approach can be difficult or easy depend-
ing on the expression of likelihood function. In simple cases, the
problem can be solved by computing the derivative of the likeli-
hood function with respect to the model parameters. The value
of the parameters that maximize the likelihood function are then
found by setting the derivative to zero. In most cases, the problem

is not easy and special techniques, such as the EM algorithm are
needed to solve it. The EM algorithm can be explained in vari-
ous ways. In the following, we interpret EM approach as a lower
bound maximization problem. This approach has been presented
in [26]. In this approach, the maximization of the complex log-
likelihood expression is replaced by the maximization of a simpler
lower bound function.
Here follows a brief explanation for why maximizing the bound
function helps maximizing the log-likelihood function. One of the
constraints the lower bound function must satisfy is that it must
touch the log-likelihood function at the current estimate of the
maximizer. Given that constraint and given two functions g and
h, let y = arg maxx g(x). Suppose that g(x) ≤ h(x) ∀x, and for
some z, g(z) = h(z). Then, if g(y) > g(z) then, h(y) > h(z). This
means that a maximizer of g is also a maximizer of h. (Here, z
is the current estimate of maximizer of h and y is its new estimate).
Computation of the model parameters

As mentioned earlier, the mixture model of interest is the mixture
of isotropic gaussians(MIXIG) and its parameters are {αk , µk , σk }1≤k≤K .
The parameters {αk , µk , σk } are respectively the mixture propor-
tion, the mean and the standard deviation of the k th mixture
component. In this section, the expressions used for computing
the new estimates of the parameters at each iteration of EM pro-
cess will be derived.
In the E-step, the posterior probabilities for the tth iteration are
computed. The posterior probabilities express the membership
degree of instances to clusters. The membership degree of in-
stance xn to the k th cluster, given the current parameters Θ(t) =
(t) (t)
(t) (t) α γ(xn |Θk )
(Θ1 , ..., ΘK ) is: P (t) (k|xn , Θ(t) ) = PN k (t) (t)
α γ(xn |Θj )
j=1 j
The denominator of this fraction ensures that the sum of the pos-
teriors in each iteration gives 1. The numerator expresses how the
instance xn is selected from the data set: first a cluster is cho-
sen with a probability αk and then xn is selected from the chosen
cluster according to the density function governing the selected
(t)
cluster; this gives the value αk γ(xn |Θk ).
The density function for each of the cluster is an isotropic gaus-

sian. And its expression is:
(t) 2
−1 k n k k
x −µ
(t) 2 (t) 2 (t) (t) (t)

1 (σ )
γ(xn |Θk ) = √ (t) d exp k , where Θk = (µk , σk )
( 2Πσk )
In the following, we find a lower bound function of the log-likelihood

function.
Let us recall the likelihood function; it is given by:
λ(Θ|D) = N n=1 h(xn |Θ), where the function h is P
the gaussian mix-
Q
ture density function. By definition h(x|Θ) = K k=1 αk γ(xn |Θk ),

where the function γ is the isotropic gaussian.
Putting it all together we get:
N X
Y K
λ(Θ|D) = αk γ(xn |Θk ) (3.7)
n=1 k=1
The logarithm of λ is easier to manipulate than λ. Because the

function log λ varies the same way as the function λ does, max-
imizing log λ is the same as maximizing λ. The logarithm of λ,
called the log-likelihood function is:
δ(Θ|D) = log λ(Θ|D) = N
PK
n=1 log( k=1 αk γ(xn |Θk ))
P
This expression is complex and difficult to maximize because of

the logarithm of a sum it contains. Therefore, a lower bound
function of this function will be found and maximized instead. In
order to make the manipulation of symbols easier, some notations
are here introduced: s(k, n) = αk γ(xn |Θk ); that gives:
δ(Θ|D) = N
PK
n=1 log( k=1 s(k, n))
P
s(k,n)
δ(Θ|D) = N
PK (t)
n=1 log( k=1 P (k|xn , Θ) P (t) (k|xn ,Θ) )
P
And using Jensens inequality [appendix D.2], this gives
s(k, n)
P (t) (k|xn , Θ)log(
XX
δ(Θ|D) ≥ ) = Bt (Θ). (3.8)
n k P (t) (k|x
n , Θ)
By rewriting Bt (Θ), we get :
P (t) (k|xn , Θ)log(s(k, n))− P (t) (k|xn , Θ)log(P (t) (k|xn , Θ)

XX XX
Bt (Θ) =
n k n k
(3.9)
From the E-step, the second term of the right side of this expres-
sion is known, therefore the maximization Bt (Θ) is reduced to the
maximization of
P (t) (k|xn , Θ) log(s(k, n))
XX
bt (Θ) = (3.10)
n k
This results in the following formulas:

The formulas are obtained by computing the derivative of the
lower bound function with respect to each of the parameters of
the model and by setting each of these derivatives to zero.
Here are the details of the derivation of the formulas of the model
parameters.
Formula of the mean:
N
∂bt (Θ) X µ k − xn
= P (t) (k|xn , Θ) =0 (3.11)
∂µk n=1 σk2
Which gives
P (t) (k|xn , Θ)xn
P
(t+1) n
µk = P (t)
(3.12)
n P (k|xn , Θ)
The computation of the standard deviation is obtained in the same
(t+1)
way. First, µk is inserted in bt (Θ) and then bt (Θ) is derived with
respect to σk . This gives the expression:
v
u P
u1 (t) (t+1) 2
(t+1) n P (k|xn , Θ)kxn − µk k
σk =t (t)
(3.13)
d P (k|xn , Θ)
P
n
For the derivation of the expression of the mixture probabilities,

the constraint k αk = 1 must be considered. In order to do
P
this, the lagrange method [D.3], is used. The expression bt (Θ) is

extended by including the constraint k αk = 1. This results in a
P
new function:
K
X
ft (Θ) = bt (Θ) + λ( αk − 1), (3.14)
k=1
where λ is the lagrange multiplier. By inserting the expression of

bt (Θ), this gives:
K
P (t) (k|xn )log(s(k, n)) + λ(
XX X
ft (Θ) = αk − 1), (3.15)
n k k=1
where
2
−1 kxn −µ(t)
k k
1 2
(σ
(t) 2
)
s(k, n) = αk γ(xn |Θk ) = αk √ (t)
exp k (3.16)
( 2Πσk )d
Setting the derivative of ft (Θ) with respect to αk to zero gives:

N
∂ft (Θ) X 1
= P (t) (k|xn , Θ)( ) + λ = 0 (3.17)
∂αk n=1 αk
Which gives: PN
P (t) (k|xn , Θ)
− n=1
αk = (3.18)
λ
By taking into account the constraint K k=1 αk = 1, we get:
P
K PK PN
X − k=1 n=1 P (t) (k|xn , Θ)
1= αk = (3.19)
k=1 λ
This is equivalent to:

PN PK
− n=1 k=1P (t) (k|xn , Θ) N
1= =− , (3.20)
λ λ
because K (t)
k=1 P (k|xn , Θ) = 1
P
Which means
λ = −N (3.21)
Replacing λ by its value in the equation 3.18 gives the estimate of
the mixing probability:
(t+1) 1 X (t)
αk = P (k|xn , Θ). (3.22)
N n
In this section, we have discussed one example of probability-based

clustering that uses the mixture likelihood approach. This ap-
proach assumes that clusters overlap. The next section is another
example of probabilistic clustering. It is based on the classifica-
tion likelihood approach. This approach assumes that the clusters
are non-overlapping.
– Clustering under the classification likelihood approach or

CEM-based clustering
The objective of clustering under the classification likelihood ap-
proach is to find a partition of the data set that maximizes the
classification likelihood criterion κ defined as:
K
X X
κ(Θ|D) = log (αk p(xik |µk , σk )) (3.23)
k=1 xik∈Ck
Ck is the k th cluster and µk , σk , αk are respectively its mean,

standard deviation and mixture proportion.
While the EM algorithm is a general method for estimating the
model parameters under the mixture approach, the classification
EM is a method for estimating the model parameters under the
classification approach. The classification EM algorithm has been
proposed by G. Celeux and G. Govaert in [24].
The classification likelihood objective criterion κ is a special case

of the mixture likelihood criterion. In this special case, each in-
stance belongs exclusively to a single cluster. Like the EM al-
gorithm, the classification EM algorithm has an expectation step
and a maximization step. During the expectation step, the ex-
pected membership degree of each instance to each of the clusters
is computed. Using the cluster membership degrees, computed in
the E-step, the maximization step computes the values of the pa-
rameters that maximize the log-likelihood function. In addition,
the classification EM algorithm has a classification step, called
C-step. The C-step takes place between the E-step and the M-
step. In the classification step, instances are assigned to clusters
according to the maximum a posteriori (MAP) principle. Below
is a description of the classification EM algorithm:
Algorithm: learning model parameters via CEM algorithm

Input A dataset set D of size N, the desired number of cluster K
Output A partition of D in K clusters.
1. Initialisation
Start from an initial partition P 0 of the data set.
Repeat the E, C and M steps until convergence is reached.
2. E-step: Computation of the posterior probabilities:

For i = 1, ..., N and for k = 1, ..., K, the posterior probability
zik for data instance xi belonging to cluster Ck is given by
(t) (t)
(t+1) α f (xi ,Θk ) (t) (t)
zik = PK k (t) (t) , where αk and Θk are the values
α f (xi ,Θr )
r=1 r
of the parameters of the model at the tth iteration and f is a
density distribution function.
3. C-step: MAP assignment of items to clusters
4. M-step: Computation of the parameter values

(t+1)
For k = 1, ..., K, αk = NNk , where Nk is the size of Ck .
The formula for the computation of the parameter Θk depends
on the exact expression of f .
In this thesis, f is a mixture of isotropic gaussians.
The mean of the cluster k is µk and its variance is σk .
This gives the expression:
(t+1)
µk = N1k xi ∈C (t) xi , ∀k = 1, ..., K
P
k
and r
(t+1) 2

(σk )(t+1) = N1k d xi ∈C (t) xi − µk , where d is the dimen-
P
k
sion of the data space and Nk is the size of cluster Ck .
These formula are intuitive. As a special case of the mixture like-
lihood, these formula can be derived from that of the EM-based
approach. This is done by replacing, P (t) (k|xn , Θ) respectively
with 1 if xn ∈ Ck and with 0 if xn ∈/ Ck in the formula obtained
with the EM algorithm.
Figure 3.3 shows how the log-likelihood of the data increases with
1.3e+07
"cml.dat"
1.2e+07
1.1e+07
loglikelihood
1e+07
9e+06
8e+06
7e+06
0 2 4 6 8 10 12 14
iteration
Figure 3.3: Variation of the log-likelihood with the iterations of the classifi-
cation maximum likelihood
the iterations of the classification maximum likelihood. From this

figure, it appears that the log-likelihood converges after 10 itera-
tions.
One of the drawbacks is that it is computationally expensive es-
pecially for high number of clusters.
The two previous examples are examples of model based clustering that
use a probabilistic approach. The next method, which is also an exam-
ple of model-based clustering uses an artificial neural network approach.
• Artificial neural network based methods

Artificial neural networks(ANN) are inspired by the way the human
brain works. ANN consists in many interconnected processing units,
called neurons. They are generally modelled as a directed graph. The
source of the graph is called the input layer and sink is the output layer.
Sometimes, hidden layers are located between the input layer and the
output layer.
ANN are used for both classification and clustering. They can be com-
petitive or non-competitive. In competitive learning, the output nodes
compete and only one of them wins. A commonly used competitive
approach for clustering is self-organizing maps (SOM). The term self-
organizing refers to the ability of the nodes of the networks to organize
themselves into clusters.
SOM are represented by a single layered neural network in which each
output node is connected to all input nodes. This is illustrated in figure
3.4. When an input vector is presented to the input layer, only a sin-
gle output node is activated. This activated node is called the winner.
When the winner has been identified its weights are adjusted. At the
end of the learning process, similar items get associated to the same
output node. The most popular examples of SOM are the Kohonen
self-organizing maps [47].
Kohonen Self-Organizing Maps(SOM) algorithm
Kohonen self-organizing maps were developed by Teuvo Kohonen around

1982. They have two layers: an input layer and a competitive layer, as
illustrated in figure 3.4. The competitive layer is a grid of nodes. Each
input node is connected to all the nodes in the competitive layer. The
links between the input nodes and the nodes of the competitive layer
have a weight, and each node in the competitive layer has an activa-
tion function. The network learns in the following way: initially, the
weights of the network are randomly initialised. Then for each input
vector presented to the input layer, each of the competitive nodes pro-
duces an output value. The node that produces the best output value
is the winner of the competition. As a result, the weights of the winner
node as well as those of the nodes in its neighbourhood are adjusted.
X1
X2
X3
INPUT LAYER KOHONEN LAYER
Figure 3.4: A 3x3 kohonen network map
Description of the kohonen SOM algorithm

1. Initialisation
The weights of the network are randomly chosen and the
neighbourhood of the output nodes are specified.
Iterations: Repeat steps 2, 3 and 4 until convergence. Conver-

gence is reached when the variation in weights of two consecu-
tive iterations becomes very small.
2. Find the winner node

For a given input X, the node of the kohonen layer most similar
to X is chosen as the winner.
3. Update the weights

The weights of the winner node as well as those in its neigh-
(t+1) (t) (t) (t)
bourhood are updated. Wis = Wis + α(t) (Xi − Wis )
4. Decrease the learning rate and reduce the size of the neighbour-
hood of output nodes.
Initialisation of SOM algorithm: The weights of the network can
be initialised randomly. But with random initialisation, some of the
output nodes may never win the competition. This problem can be
avoided by randomly choosing instances of the data set as initial val-
ues of the weights.
Choice of distance measure: The dot product and the euclidean
distance are commonly used as distance measures. The dot product is
used in situations where the input patterns and the network weights
are normalized.
Learning rate: The learning rate controls the amount by which the
weights of the winner node and that of its neighbours are adjusted. The
initial learning rate is specified at the initialisation. Then it decreases
as the number of iterations increases. Decreasing the learning rate
ensures that the learning process stops at some point in time. This
is important because usually the convergence criterion is defined in
terms of very small changes in the weights of two consecutive iterations.
Competitive learning does not give any guaranties that this convergence
criteria will eventually be satisfied.
Defining the neighbourhood: Initially, the neighbourhood is set to
a large value which then decreases with the iterations. This corresponds
to assigning instances to nodes with more precision as the number of
iterations increases.
The time complexity of SOM is O(M ∗ N ), where M is the size of the
grid and N is the size of the data set. The justification of this time
complexity is the following: during the training a number of operations
(finding the winner and updating the neighbourhood), which is are at
most twice the size of the grid take place. And the maximum number
of iterations is equal to the size of the data. So the time complexity
for the training is O(M ∗ N ). As the assignment only takes O(N ) this
gives a total complexity of O(M ∗ N ).
One of the main strengths of SOM is its ability to preserve the topology
of the input data: items that are close to each other in the input space
remain close in the output space. This makes SOM a valuable tool for
visualizing high dimensions data in low dimension. SOM also supports
parallel processing; this can speed up the learning process.
Some of the limitations of Kohonen SOM are: it is most appropriate
for detecting hyperspherical clusters. The choice of initial parameter
values - the initial weights of connections, the learning rate, and the
size of the neighbourhood - is difficult. The quality of the clustering

depends on the choice of the initial values of these parameters and also
on the order in which items are processed.
This subsection has discussed model-based clustering. This approach
assumes that the data can be described by a mathematical model and
aim at uncovering this model. The two main approaches to model-
based clustering are the probabilistic approach and the artificial neural
network approach. The next subsection approaches the clustering prob-
lem differently. It views clusters as dense regions in the data space.
3.2.3 Density-based clustering

In the density-based approach, a cluster is defined as a region of the
data space with high density. This dense region is bordered by low-
density regions that separate the cluster from other points of the data
space. There are two main types of density-based clustering: the ap-
proach based on connectivity and the approach based on density func-
tions. An example of a clustering algorithm based on connectivity is
DBSCAN [46] and an example based on density functions is DENCLUE
[27]. These two algorithms are popular for clustering large spatial data
sets.
The two algorithms will be briefly presented in the following.
1. DBSCAN: Density-Based Spatial Clustering using Appli-

cations with Noise
DBSCAN finds clusters by first identifying points in dense regions
and then growing the regions around these points until the bor-
ders of these regions are met. To be more specific, DBSCAN finds
clusters by:
- First identifying core points of the data set. A core point is a
point whose neighbourhood contains a minimum number of points.
The size of the neighbourhood and the the minimum number of
points are two parameters of the algorithm.
- Next, DBSCAN iteratively merges core points that are directly
density-reachable. A point p is directly density-reachable from a
point q if q is a core point and p belongs to the neighbourhood of
q. The iterations stop when it is no longer possible to add new
points to any of the clusters. DBSCAN is designed for spatial data

sets. A spatial data set capture the spatial relationship between
the instances of the data set. Examples of spatial data sets are
geographical data sets or image databases. The time complexity
of DBSCAN is O(N log N ) when the spatial index R∗ -tree is used.
Some of the difficulties in using DBSCAN are related to the choice
of appropriate values of the neighbourhood and the minimum
number of points that characterizes a core point.
Some of the advantages of density-based clustering are their abil-
ity to detect clusters of arbitrary shapes and scalability to large
data sets.
2. DENCLUE: DENsity-based CLUstEring

DENCLUE is a clustering algorithm that uses density distribu-
tion functions to identify clusters. Clusters are identified as local
maxima of the overall density function. The overall density func-
tion is the sum of all the influence functions of each point of the
data space. Given a data set D, the influence function of a point
y ∈ D is a function f y : D → R0+ , which models the impact of
y within a neighbourhood. The gaussian distribution function is
an example of an influence function that is commonly used. The
gaussian function is defined as:
2
y
fGauss (x) = exp( −d(x,y)
2σ 2 ), where d is a distance measure.
Clusters are generated by density attractors. A density attractor
is a local maximum of the (overall) density function.
There are two types of clusters: center-defined clusters and arbi-
trary shaped clusters.
Center-Defined Cluster: Given a threshold Γ, a center-defined
cluster for a density attractor xmax is the subset C of the data set
D defined by:
C = {y ∈ D|y is density-attracted by xmax and fβy (xmax ) ≥ Γ}
Arbitrary-shaped Cluster: Given Γ, an arbitrary-shaped clus-

ter for the set of density attractors A is a subset C of the data set
D, such that:
- ∀x ∈ C, ∃ xmax ∈ A with fβD (xmax ) ≥ Γ and x is density-
attracted by xmax ,
- For all density attractors xmax

1 and xmax
2 there is a path from
x1 to x2 such that for all points y on this path, fβD (y) ≥ Γ.
max max
The notion of density-attracted points used in these two defini-

tions of clusters is defined as follows:
Density-attracted points: Given ∈ R+ , a point x ∈ D
is density-attracted to a density attractor xmax if and only if
there exists a chain of points x0 , x1 , ..., xk such that x0 = x and
∆f D (xi−1 )
d(xk , xmax ) ≤ where xi = xi−1 + δ ||∆fβD (xi−1 )|| for 0 < i < k.
β
For continuous and differentiable influence functions, such as the
gaussian influence function, a hill-climbing algorithm guided by
the gradient can be used to find density-attracted points. The de-
scription of the hill-climbing algorithm can be found in appendix
appendix D.1.
DENCLUE consider outliers as noise and remove them. Some of

the strengths of DENCLUE are:
-It has a solid mathematical foundation
-It is good for clustering high dimensional data
-It detects clusters of arbitrary shapes
The main limitations of these algorithms are they remove have
been designed for spatial data only. The fact that they remove
outliers does not them suitable for identifying attack clusters that
are small. The quality of the clustering result depends on the
choice of density parameter and the noise threshold.
This subsection has presented the density-based approach to cluster-

ing. In this approach, clusters are defined as dense regions of the data
space. Two examples of algorithms have been presented: DBSCAN
and DENCLUE. These algorithms are designed for spatial data. The
next subsection discusses the grid-based approach to clustering. This
approach is also designed for spatial data. It views the data space as a
grid.
LEVEL 1
LEVEL 2 LEVEL 3
Figure 3.5: Querying recursively a multi-resolution grid with STING
3.2.4 Grid-based clustering

In grid-based clustering, the data space is partitioned into grid cells. A
summary of the information about each cell is computed and kept in
a -grid- data structure. Cells that contain a number of points above a
specified threshold are considered as dense. Dense cells are connected
to form clusters.
One popular example of a grid-based clustering algorithm is STING
[28]. It makes use of statistical information about the grid cells.
In the following we will explore STING.
STING: STatistical INformation Grid
STING was proposed by Wang et al. in [28]. It is a multi-resolution

grid. The data space is divided into rectangular cells. The cells are or-
ganized in different hierarchical levels corresponding to different levels
of resolutions. A cell at a hierarchical level i is partitioned into cells at
the next hierarchical level i+1.
Statistical information about each cell is pre-computed and stored.
Some of the statistical information stored is:
- count: the number of items in the cell,
- mean: the mean of the cell,
- s: the standard deviation,

- min: the minimum value of the cell,
- max: the maximum value of the cell,
- the distribution: the distribution of the cell if it is known.
Statistical information about the cells is computed in a bottom-up fash-
ion. For example, the distribution at the hierarchical level i can be
estimated to be the distribution of the majority of cells at the hierar-
chical level i − 1.
The statistical information about the cells is used in a top-down fash-
ion to answer spatial queries. A query asks for the selection of cells
satisfying certain conditions on density for example.
The query-answering is performed in the following way:
First the level of the cell hierarchy where the answering is to begin is
found. Generally, it contains a small number of cells. Then for each
cell at this level, the relevancy of the cell in answering the query is
estimated. Only the relevant cells are submitted to further processing
in the next hierarchical level down.
This process is repeated until the lowest level of the hierarchy is reached.
At this stage the cells satisfying the query are returned. Usually this
ends the clustering process. In cases where very accurate results are
desired, the relevant cells are submitted to further processing. Only
cell members that satisfy the query are returned.
Figure 3.5 illustrates a top-down querying of the grid in STING. Start-

ing at level 1, the possible candidates satisfying the query are localized.
These initial solutions are refined at levels 2 and 3. The desired cells
are returned at level 3. As it appears from this figure, the borders of
the cluster are either horizontal or vertical.
Some of the strengths of STING and grid based clustering in general

are: -It is scalable to a large data set. The query processing time is
linear with respect to the number of cells.
-The grid structure supports parallel processing and incremental up-
dating.
One of the weaknesses of STING is that the borders of the clusters
are either vertical or horizontal. Grid based clustering algorithms are
designed for clustering spatial data. They will not be considered for
the experiment as the data to be used does not capture the spatial
relationship between items.
The following two clustering methods correspond to two other ways

of categorizing clustering methods. In the first of these categorizations,
a distinction is made between online and off-line clustering algorithms.
All the methods discussed up to now, except SOM, are off-line methods.
The second categorization distinguishes between crisp clustering and
fuzzy clustering. Crisp clustering creates distinct clusters, while in
fuzzy clustering, items belongs to more than one cluster.
3.2.5 Online clustering

One of the main differences between off-line clustering and online clus-
tering is that the former requires that the entire data set is available
at each step of the clustering. That is so because off-line clustering
algorithms generally aim at finding the global optimiser of an objective
function. The latter -online clustering algorithms- generates clusters as
the data is produced. Online clustering algorithms are appropriate for
clustering in a data flow environment. Network traffic is an example of
such a type of data.
Online clustering algorithms do not aim at optimising a global crite-
rion, rather they proceed by making local decisions. Optimisation of
a global criterion often leads to a stability problem, in that the clus-
ters produced by these methods are sensitive to small changes in the
data. The advantage of online clustering is that it leads to adaptable
and stable cluster structure. An example of online clustering is leader
clustering [31].
The leader clustering algorithm

Leader clustering starts by selecting a representative of a cluster. This
representative is called the leader of the cluster. When assigning in-
stances to clusters, the distances of the instance to each of the current
clusters are computed. The instance is assigned to the closest cluster
if its distance to that cluster is below a specified threshold . If the dis-
tance of the instance to each of the existing clusters is greater than the
threshold, a new cluster, consisting of that single instance, is created.

This process is repeated for each of the instances. Generally, euclidean
distance is used.
Description of the algorithm
1. Initialisation
Choose a threshold , and initialise the first cluster centre µ0 .
Generally the first item of the data set is chosen.
For each of the remaining items x repeat the following steps:

2. Identify the closest cluster Cclosest
3. If kx − µclosest k < , update the centre µclosest otherwise create a
new cluster with x as its leader.
One of the main drawbacks of leader clustering, that is common to

on-line clustering algorithms, is that the clustering result is dependent
on the order in which instances are processed. When leader clustering
is used for off-line clustering, this problem can be solved by selecting
instances in a random order.
Some of the strengths of leader clustering are: it is fast, robust to
outliers and does not require the number of clusters to be specified ex-
plicitly. Its robustness in the presence of outliers indicates that it may
have some potential for clustering network traffic data for anomaly de-
tection.
The time complexity of the leader clustering is O(K ∗ N ), where K
is the number of clusters and N is the size of the data set. A single
scan of the data set is required and a constant number of operations is
performed during the processing of each instance.
3.2.6 Fuzzy clustering

Another way of categorizing clustering methods is to consider the de-
gree of membership of data instances to clusters. A distinction is made
between crisp clustering and fuzzy clustering. In crisp clustering, each
data instance is assigned to only one cluster. In fuzzy clustering, on the
other hand, each instance belongs to more than one cluster with some
degree of membership. The degree of membership of a data instance
Xi to a cluster Ck is a real value zik ∈ [0, 1], where k zik = 1.
P
Crisp clustering can be considered as a special case of fuzzy clustering,

where zik = 1 if Xi belongs to Ck and zik = 0 otherwise.
Fuzzy clustering aims at minimizing a fuzzy objective criterion. An
example of fuzzy clustering is the EM-based clustering studied earlier.
Another example is fuzzy kmeans discussed below.
The fuzzy kmeans algorithm
Fuzzy kmeans, also known as fuzzy cmeans, was proposed by Dunn

in 1974 and improved by Bezdek in 1981. The algorithm aims at min-
imizing the following objective function:
K X
N
b
kxi − µk k2 .
X
Q= zik (3.24)
k=1 i=1
b is called the fuzzifier and it controls the degree of fuzziness. When the
fuzzifier b is closer to 1, the clustering tends to be crisp and when the
fuzzifier b becomes very large, the degree of membership approaches
1/K; that means the data instance is a member of all the clusters to
the same degree. Generally, the value of the fuzzifier b is chosen to be
2.
Description of fuzzy kmeans algorithm

1. Initialisation: choose the number of clusters K, the initial cluster
centres, the fuzzifier b, a threshold and the cluster membership
degrees zik (where i = 1, ..., N and k = 1, ..., K).
PK
2. Normalize zik so that k=1 zik = 1, ∀i = 1, ..., N
Iterations: Repeat steps 3 and 4 until: (Q(t) − Q(t − 1)) ≤
3. Recompute cluster means

PN
(zik )b xi
µk = Pi=1
N b
(3.25)
i=1 (zik )
4. Recompute the degree of cluster membership

1
zik = PK kxi −µk k b−1 2 (3.26)
j=1 ( kxi −µj k )
Derivation of the formulas The above formulas are obtained as

follows:
Let first derive the expression of cluster membership zik
We are looking for an extremum of the the fuzzy objective function
Q under the constraint that K k=1 zik = 1. In order to include that
P
constraint we use the Lagrange method.

Let
K
X
P = Q − λ( zik − 1), (3.27)
k=1
where λ is the Lagrange multiplier

By setting the derivative of P with respect to zik to zero, we get:
∂P ∂Q
= −λ=0 (3.28)
∂zik ∂zik
Using the expression of the derivative of Q gives:
b−1 2
bzik dik = λ, (3.29)
where dik = kxi − µk k

Which is equivalent to:
λ 1 1
zik = ( ) b−1 2 (3.30)
b (dik ) b−1
PK
Using the constraint j=1 zij = 1, we get:
K
λ 1 X 1
( ) b−1 ( 2 ) = 1 (3.31)
b j=1 (dij ) b−1

λ 1 1
( ) b−1 = PK 1 (3.32)
b j=1 ( 2 )
(dij ) b−1
9000
"fuzzyKmeansSSE.dat"
8500
8000
7500
fuzzy SSE
7000
6500
6000
5500
5000
0 2 4 6 8 10 12 14 16
iteration
Figure 3.6: Variation of the - fuzzy - sum of squared errors in fuzzy kmeans
algorithm
When inserting the value of λ in equation 3.30, we get cluster member-

ship formula:
1
zik = PK d 2 (3.33)
j=1 ( dij )
ik b−1
Finding the formula for the means of the cluster is simple because no
constraints have to be satisfied. This is obtained deriving Q according
to µk and setting it to zero.
N
∂Q X
b
=2 zik (xi − µk ) = 0 (3.34)
∂µk i=1
this gives:
PN b
zik xi
µk = Pi=1
N b
(3.35)
i=1 zik
Figure 3.6 shows how the fuzzy sum of squared-errors vary with the
number of iterations of fuzzykmeans. In this figure, it appears that the
fuzzy sum of squared error decreases very slowly after the 11th itera-
tion. That indicates that convergence of the fuzzy kmeans is reached
around the 11th iteration.
A limitation of the fuzzy kmeans is that it is more computationally

expensive when compared to the standard kmeans.
Fuzzy clustering is appropriate in situation where the clusters overlap.
In this thesis, we are looking for partitions of the data set. So our
interest in fuzzy clustering is limited to studying the effects that fuzzy
concepts have on clustering results. At the end of the clustering pro-
cess, a partition -non-overlapping clusters- is returned using a MAP
assignment, for example.
3.3 Discussion of the classical clustering

methods
The clustering algorithms discussed in the previous sections of this
chapter fall into two groups: the traditional ones and the most recent
ones. The traditional algorithms are HAC, kmeans, EM-based cluster-
ing, CEM-based clustering, SOM, fuzzy kmeans and leader clustering.
The most recent ones are examples of density-based clustering such
as DBSCAN and DENCLUE and examples of grid-based clustering
such as STING. The categorization of each of the clustering algorithms
as instances of a specific clustering method provides a framework for
understanding and discussing properties of the algorithms. Although
these algorithms belong to different methods, some of them can be
easily related. Kmeans is a special case of classification EM-based
clustering which, in turn, is a special case of EM. Kmeans can also be
seen as a special case of fuzzy kmeans clustering. A one-dimensional
SOM, in which only the winner node’s weights are updated during the
competitive learning, is equivalent to the online version of the kmeans
algorithm. The difference between online kmeans and kmeans is that
the former updates the cluster centres as items are assigned to clusters.
Only the centre of the cluster to which a new instance is assigned is
recomputed. The latter assigns all the instances to the clusters be-
fore re-computing the centres of the clusters. The relation between
those algorithms will be helpful in explaining the performance of the

algorithms.
All the discussed clustering algorithms have their strengths and limita-
tions. Generally, each of these algorithms will produce a good clustering
result if the assumptions and ideas the algorithm is based on match that
of the data set. A major difference between these algorithms are their
running times. Model-based clustering algorithms, such as EM-based
clustering and CEM-based clustering, and hierarchical clustering are
computationally expensive. Online clustering is fast and squared-error
clustering has an acceptable running time. So clustering algorithms
such as EM-based, CEM-based clustering and HAC are impractical for
clustering large data sets. The computationally time of EM-based clus-
tering increases drastically with the number of desired clusters. The
execution time of SOM increases only slightly with the number of clus-
ters -the size of the som-grid.
Of the partitioning clustering methods discussed, only the examples
of density based clustering and grid based clustering are useful in the
detection of clusters of arbitrary shapes and sizes. In both approaches,
identifying clusters is achieved by merging small dense clusters. The
main difference in these approaches is how they define and identify the
small clusters. DENCLUE, which is an example of density based clus-
tering, uses density distribution functions and identifies dense regions
by finding the local maxima of the overall density function. DBSCAN,
which is another example of a density based clustering algorithm, local-
izes points that contain a number of items above a specified threshold.
STING, an example of grid based clustering, uses sufficient statistics
about grid cells for identifying the dense cells. These algorithms are de-
signed for spatial databases. They use efficient spatial data structure,
such as R* tree, for merging dense clusters. This makes them scalable
to large data sets. Hierarchical agglomerative clusters are also con-
structed by merging small clusters. But it is impractical to use HAC
for clustering large data sets because HAC does not use an efficient
data structure.
In the next section, we will study the issue of combining clustering
methods. We will specifically study how the merging of small clusters
can be efficiently adapted to the data set at hand.
3.4 Combining clustering methods

Clustering methods can be combined using two main approaches.
1. The first approach combines the clustering results produced by

pairs of clustering algorithms. It deduces new partitions of the
data set by studying the agreement in the clustering results pro-
duced by different clustering methods. [32] studied various algo-
rithms and techniques for studying the agreement in the partitions
provided by different clustering methods. This approach will not
be considered in this thesis because it is computationally expen-
sive.
2. The second approach combines ideas and techniques from different
clustering methods to derive new clustering techniques. The goal
is to use different ideas and techniques from different clustering
methods as building blocks for new clustering techniques to solve
the problem at hand. Two different architectures will be explored.
– The first involves initialising a clustering algorithm with the
partition produced by another clustering algorithm.
– The second clustering architecture consists in two levels. The
first level creates a large number of small clusters using one of
the studied clustering algorithms and the second level merges
the clusters created at the first level. This clustering archi-
tecture will be called two-level clustering.
3.4.1 Two-level clustering with kmeans

We use the two-level architecture in order to detect clusters of arbitrary
shapes and sizes. Because the distribution of the attacks is skewed,
producing high number of small clusters will help us to identify small
size attack clusters. Large clusters, consisting for example of normal
data, can be constructed by merging small clusters.
In this study, kmeans is used for the creation of the first level clusters.
In principle, the choice of clustering algorithm for the creation of the
clusters at the first level does not make a significant difference as long
as the clusters created are of high purity. Kmeans has been chosen
because it is fast compared to most of the other algorithms, and because

it has some properties that are essential for the success of the proposed
method. In the rest of this section, the first level clusters will be referred
to as basic clusters.
Merging basic clusters degrades the purity of the clustering. Our aim is
to merge clusters in such a way that the purity of the clusters degrades
as little as possible. As the attack labels are not known during the
clustering process, we do not have a way of directly measuring the
purity of the clustering. Other characteristics of the data will be used
to approximate the purity of clusters.
A cluster is said to be 100% pure if it contains attacks of exactly one
kind. Merging two 100% pure clusters, that contains the same attack
type, will not degrade the purity of the clustering. It will be assumed
that two basic clusters are of the same type and therefore can be merged
if the following two conditions are satisfied:
- the two clusters are close to each other,
- the two clusters have approximately the same density.
The first of these conditions is based on the assumption that data in-
stances of the same attack type are close to each other. The second
condition is based on the assumption that clusters of the same attack
type have approximately the same density. The density of a cluster is
defined as the average number of items in a specified radius ρ.
Estimation of the density of basic clusters:

Because kmeans is used for the creation of basic clusters, a basic clus-
ter size can be used as an approximation of the cluster density. This
is possible because kmeans is based on the implicit assumption that
clusters are spheres of identical radius δ. So by choosing ρ equal to δ
the cluster size can be to estimate cluster density.
Proof of the assumption regarding the shape and size of kmeans

clusters:
The estimate of the density of basic clusters produced by kmeans is
based on our assertion that kmeans assumes that clusters are spheres
of the same size. The goal of this section is to prove this assertion.
As we explained earlier, kmeans aims at minimizing the sum of squared-
errors criterion. Let us recall the expression of the sum of squared-errors

of a partition P.
SSE(P ) = K 2
x∈Ck ||x − µk || , where µk is the center of cluster Ck :
P P
k=1
µk = |C1k | x∈Ck x.
P
The purpose of the proof is to show that the minimization of the SSE
is equivalent to the maximization of a special case classification like-
lihood criterion. This special case corresponds to the situation where
the model is a mixture of isotropic gaussians with identical standard
deviation and with identical mixture proportions. In this special case
CEM aims at finding clusters that are spheres of the same sizes. The
expression of the classification likelihood criterion, shown earlier, is as
follows:
κ(Θ|D) = K xik∈Ck log (αk p(xik |µk , σk )), where Ck is the k
th
clus-
P P
k=1
ter and µk , σk , αk are respectively its mean, standard deviation and
mixture proportion.
In the case where the mixture proportions and standard deviations are
identical for all the clusters, we have:
αk = 1/K and σk = σ ∀k, 1 ≤ k ≤ K. So,
K
X X
κ(Θ|D) = log ((1/K)p(xik |µk , σ)), (3.36)
k=1 xik∈Ck

K
X X
κ(Θ|D) = log p(xik |µk , σ) + R, (3.37)
k=1 xik∈Ck
where R is a constant.
Using the expression of the isotropic gaussian, that is: p(xik |µk , σ) =
2
√ 1
( 2πσ)d
exp( −kxik2σ−µ
2
kk
),
we get:
K
X X −1 2
√
κ(Θ|D) = (( )||x ik − µ k || + d log( 2πσ) + R (3.38)
k=1 xik∈Ck 2σ 2
µk is the center of the k th because the maximum likelihood estimate

of the mean of a cluster is the center of the cluster -as shown in the
formula of the M step of the CEM algorithm. So,

−1 √
κ(Θ|D) = SSE(P ) − N d log( 2πσ) + R (3.39)
2σ 2
where N is the size of the data set D and d is is the dimension of D.
This last equation proves that minimizing SSE is equivalent to maxi-
mizing the classification likelihood criterion for a mixture of isotropic
gaussians with identical mixture proportion and identical standard de-
viation.
Merging basic clusters

In order to produce clusters of arbitrary shapes, basic clusters are linked
instead of being fused. The linking of basic clusters results in multi-
centered clusters. The fusion of basic clusters into one center-based
cluster will have produced spherical clusters. The distance between two
multi-centered clusters is defined as the distance between their closest
basic clusters. The distance between two basic clusters is defined as
the euclidean distance between their means. This distance measure has
been chosen because its computation is fast.
Selecting an optimal number of basic clusters :

The parameters that influence the quality of clustering with the two-
level approach are: the purity of the basic clusters and the number
of times basic clusters are linked. These two conditions are mutually
antagonistic. A high purity of basic clusters requires a large number of
basic clusters. But with a large number of basic clusters a high number
of linking operations are required. We need a mechanism for choosing
an optimal number of basic clusters.
Figure 3.7 illustrates how the classification accuracy obtained with two-
level clustering varies with the number of basic clusters. Figure 3.7
shows that the classification accuracy is highest when the number of
basic clusters is 200.
In order to choose the appropriate number of basic clusters, we study
how the SSE is related to the classification accuracy for kmeans clus-
tering. This study shows that SSE and classification accuracy vary in
a similar way with the number of clusters of kmeans. So we use SSE
for identifying the optimal number of basic clusters. As SSE measures
96
"accuracy2levels.dat" using 1:2
95.5
classification accuracy*100
95
94.5
94
93.5
93
92.5
92
0 100 200 300 400 500 600 700 800
number of basic clusters
Figure 3.7: Variation of classification accuracy with the number of basic

clusters
the compactness of the clusters, it makes sense to use it as a measure

of the homogeneity of the clusters. The identification of an optimal
number of clusters is achieved by plotting the variation of the SSE
with the number of clusters. The optimal number of basic clusters is
chosen in the region of the graph where SSE begins to decrease very
slowly. Let us call this region of graph <. Selecting a point within
region < is reasonable because the purity of the clusters does not vary
significantly accounting from < and because merging a high number of
clusters decreases the purity of the final clusters.
In short, the optimal number of basic clusters is found experimentally
by studying the variation of SSE with respect to the number of clus-
ters. If the difference between the SSE of two consecutive number of
clusters, say α and β, is below a specified threshold, either or β is
selected as reasonable number of basic clusters.
Figure 3.8 shows how the SSE varies with the number of clusters in
kmeans. Selecting the number of basic clusters within the interval [150
250] is reasonable.
Some of the main strengths of two-level clustering are:
22000
"kmeansVariationOfSSE.dat" using 1:2
20000
18000
16000
14000
SSE
12000
10000
8000
6000
4000
2000
0 50 100 150 200 250 300 350 400
number of clusters
Figure 3.8: Variation of the sum of squared-errors(SSE) with the number of

clusters in kmeans
– It detects clusters of arbitrary shapes and sizes

– It is possible to adjust the quality of the clustering: by varying
the number of basic clusters
Some of the weaknesses are:
– When the number of basic clusters is high, the computation time

may also be high, however it is not worst than most of the other
clustering algorithms considered in this study.
– Finding the optimal number of basic clusters is difficult. It may
require experimentation and this is time consuming.
Two-level clustering can be seen as a combination of the kmeans and

HAC. It also makes use of ideas of density-based clustering when
merging basic clusters. In the following we summarize the steps used for
performing the two-level clustering. As it is a combination of kmeans
and HAC and density clustering we called this algorithm KHADENS
(Kmeans HAc and DENSity).
1. Initialisation: Specify the number of basic clusters β. This is

done through experimentation. Specify the minimum distance
minDist and the minimum rapport of size minDens which two
basic clusters must have in order to be merged.
2. Creation of the basic clusters: Create β clusters using the kmeans
algorithm
Iteration: Repeat step 3 until no change occur
3. Merging clusters Start with the the basic clusters. For each pair
of clusters M C1 and M C2 , merge them if there is a basic cluster
bc1 ∈ M C1 and basic cluster bc2 ∈ M C2 such that d(bc1 , bc2 ) ≤
minDist and |bc 1|
|bc2 |
≥ minDens.
Another variation of this algorithm has been explored. In this variation

that we call KHAC, the closest clusters are iteratively merged until the
desired number of clusters is reached. The main difference between
these two algorithms is that the size of basic clusters is not considered
when merging clusters in KHAC.
The running time of KHADENS and KHAC is mainly the time used
for the creation of the basic clusters. The merging of the basic clusters
is fast as it generally involves a small number of clusters.
3.4.2 Initialisation of clustering algorithms with

the results of leader clustering
Leader clustering is very fast and robust to outliers. It can, therefore, be
used for the identification of better initial cluster centres to be used in
each of the other algorithms. The procedure for initialising a clustering
algorithm CA with the leader clustering algorithm is the following: Let
K be the number of clusters desired by CA. The leader clustering is
used to cluster the data set into M clusters, where M ≥ K. Then the
centres of K of the clusters created by the leader algorithm are used as
initial centres for the clustering algorithm CA.
3.5 Summary
In this chapter, different clustering methods have been discussed. A
distinction has been made between clustering methods and clustering
algorithms. A clustering method defines the general concept and theory
the clustering is based on, while a clustering algorithm is a particular
implementation of a clustering method. Examples of each of the con-
sidered clustering methods have been discussed. Most of the classical
clustering algorithms considered in this thesis approach the clustering
problem as an optimisation problem. They aim at optimising a global
objective function. They make use of an iterative process to solve the
problem. Another group of algorithms do not approach the clustering
problem as an optimisation problem. They view clusters as dense re-
gions in the data space and identify clusters by merging small units of
dense regions. This is the case for density based clustering and grid
based clustering. A clustering architecture inspired by some properties
of the clustering algorithms using an optimisation approach and the
clustering algorithms which constructs clusters by making local deci-
sions has been proposed. This architecture takes into consideration the
characteristics of the data set at hand. The discussed clustering algo-
rithms, with the exception of DBSCAN, DENCLUE and STING are
used for our experiments, which are discussed in the next chapter.
Chapter 4
Experiments
This section describes and discusses the design and the execution of
the experiments. This discussion is important in order to understand
and explain aspects of the experiments that have an impact on the
performance of the clustering algorithms.
– Section 4.1 discusses the design of the experiments.

– Section 4.2 discusses the data set and its feature set.
– Section 4.3 discusses implementation issues.
– Section 4.4 summarizes the chapter.
The experiments are conducted on a pentium 4 processor with 1.5

GB of memory. The operative system is FREEBSD release 5.4.
4.1 Design of the experiments

The design and the implementation are modular. The programming
language used for the implementations of the clustering algorithms and
the experiment is C++. As an example of object oriented programming
language, C++ supports modularity. Furthermore C++ is an efficient
and flexible programming language. The efficiency is an important is-
sue in our experiment because of the large size of the data set. The
62
CHAPTER 4. EXPERIMENTS 63
architecture of the system used for implementing the clustering algo-

rithms and performing the experiments is composed of four modules:
the data preparation module, the clustering algorithms module, the
experiment module and the evaluation module.
– Data preparation module: this module puts the data in a form

that is easily used by the clustering algorithms. It transforms
non-numeric feature values into numeric feature values and it nor-
malizes the feature values.
– Clustering algorithm module: This module implements the dis-
tance measures and the clustering algorithms. Common and im-
portant clustering concepts have been encapsulated into classes so
they can be easily share by the different clustering algorithms.
– Experiment module: This module implements operations related
to the execution of the experiment. It implements for example the
execution of the ten-fold cross validation. It also implements the
different indices to be used for evaluation of the algorithms.
– Evaluation module: On the basis of the evaluation indices com-
puted in the experiment module, the evaluation module compares
the clustering algorithms. It computes the means and standard
deviations of the indices and makes a paired t-test comparison.
4.2 Data set

The performance of clustering algorithms partly depends on the char-
acteristics of the data set. This section describes and discusses the data
set selected for the experiments.
4.2.1 Choice of data set

The data set chosen for the experiment is the KDD Cup 99 data set.
This data set is available at [16]. The KDD Cup 99 data set is a
processed version of a data set, developed under the sponsorship of the
Defense Advanced Research Projects Agency (DARPA) -of the USA- in
1998 and 1999, for the off-line evaluation of intrusion detection systems
(IDS) [17, 41]. Currently the DARPA data set is the most widely used
data set for testing IDS.
This DARPA project was prepared and executed by the Massachusetts
Institute of Technology (MIT) Lincoln Laboratory. MIT Lincoln Labs
set up an environment on a local area network that simulated a mili-
tary network under intensive attacks. The simulated network consists
of hundreds of users on thousands of hosts. Working in a simulated
environment made it possible for the experimenters to have complete
control of the data generation process. The experiment was carried
out over nine weeks and a raw network traffic data, also called raw
tcpdump data, was collected during this period.
The raw tcpdump data has then been processed into connection records
used in the KDD Cup 99 data set. The KDD Cup 99 data set contains
a rich variety of computer attacks. The full size of the KDD Cup 99 is
about five million network connection records. Each connection record
is described by 41 features and is labelled either as normal or as a
specific network attack. One of the reasons for choosing this data set
is that the data set is standard. This will make it easy to compare the
results of our work with other similar works. Another reason is that
it is difficult to get another data set which contain so rich a variety of
attacks as the one used here.
Some criticisms have been made about the generation of the DARPA
data set. One the strongest criticisms was made by J. Mc HUGH in
[42]. The network traffic generated in the DARPA data set has two
components: the background traffic data -which consists of network
traffic data generated during the normal usage of the network- and the
attack data. According to Mc HUGH, the generation process of the
background traffic data has not been described explicitly by the exper-
imenters. Therefore, there is no direct evidence that the background
traffic matches the normal usage pattern of the network to be simu-
lated. He made similar criticisms about the generation of the attack
data. The intensive attacks the network has been submitted to do not
reflect a real word attack scenario.
Although some of these criticisms are important and can be useful in
future generation off-line intrusion evaluation data sets, the DARPA
data set has many strengths which still make it the best publicly avail-
able data set for evaluation of intrusion detection systems.
4.2.2 Description of the feature set

The choice of the feature set is crucial for the success of clustering.
The goal of this section is to describe the feature set, and its ability to
discriminate normal patterns and attack patterns.
Generally, the construction of efficient features for intrusion detection is

either done manually or by semi-automated process. The manual con-
struction of features uses only security domain knowledge while semi-
automated feature construction automates part of the feature construc-
tion process. To our knowledge, none of the existing feature construc-
tion methods for network attack detection fully automates the feature
construction process.
Stolfo et. al in [14] used a semi-automated approach to identify use-
ful features for discriminating normal patterns from attack patterns.
Their approach is based on data mining concepts and techniques like
link analysis and sequence analysis. Link analysis determines the re-
lation between fields of a data record; sequence analysis identifies the
sequential pattern between records. Their work has led to the feature
set describing the data set used in this thesis.
In the following, we will first describe the feature set, then we will give
a brief explanation of the approach used in [14] to derive them and
finally a discussion of the discriminative capability of the feature set
will be presented. A short description of the feature set is available in
appendix B.1. The full description can be found in [16, 14].
There are 41 features, which fall into different categories: basic features
and derived features.
1. The basic features describe single network connections.

2. The derived features can be divided into content-based features
and traffic based features.
(a) The content-based features are derived using domain knowl-
edge.
(b) And the traffic-based features are obtained by studying the

sequential patterns between the connection records as well as
the correlation between basic features.
In order to construct the feature set, the raw tcpdump data has been
pre-processed into connection records. The basic features are directly
obtained from the connection records. The derived features fall into
two groups: the content-based features and the traffic based features.
Content-based features are used for the description of attacks that are
embedded in the data portion of the IP packet. The description of
these types of attacks requires some domain knowledge and cannot be
done only on the basis of information available in the packet header.
Most of these attacks are R2L and U2R attacks. Traffic based features
have been computed automatically; they are effective for the detection
of DOS and probe attacks. The different types of attack contained in
the data set are described in appendix C.
In order to derive the traffic features, Stolfo et al. made use of an
algorithm that identifies frequent sequential patterns. The algorithm
takes the network connection records, described by the current basic
features, as input and computes the frequent sequential patterns. The
frequent episodes algorithm is executed on two different data sets: an
intrusion-free data set and a data set with intrusions. Then these two
results are compared in order to identify intrusion-patterns.
The derived features are constructed on the basis of patterns that only
appear in intrusion data records. Therefore, they are able to discrim-
inate between normal and intrusion connection records. Although ex-
perience shows that the feature set considered here discriminates well
between normal and intrusive patterns it has some limitations when it
is used for anomaly detection. Because the feature set has been derived
on the basis of intrusions in the training data set, the derived feature
set cannot describe attacks not included in the training data set. The
feature set is, therefore, more suitable for misuse detection than for
anomaly detection. Another limitation of the feature set is that it may
not discriminate well between normal data and attacks embedded in
the data portion of the data packet. The reason for this is that the
feature set has been constructed primarily on the basis of information
available in the header of the packet. The content-based features may

not describe correctly attacks embedded in the data portion of the IP
packet. These features have been derived from indices that characterize
the session between two communicating hosts. These indices may not
be sufficient to capture the full nature of an attack embedded in the
data portion.
Scaling and normalization of the feature values
The purpose of scaling the feature values is to avoid a situation where

features with large and infrequent values dominate the small and fre-
quent values during the computation of distance. Normalization scales
all the feature values in the range [0 1]. Some examples of scaling
schemes are the linear scale, logarithmic scale, and scaling using the
mean and standard deviation of the feature.
The linear scale of feature value x of feature nr. j is:
x−minj
N orm(x) = min j −maxj
, where minj and maxj are respectively the min-
imum and the maximum value of feature j.
The logarithmic scale is N ormL(x) = N orm(log(x) + 1).
And the third scaling scheme, based on the mean and standard devia-
x−meanj
tion of the feature nr.j, is defined as: N ormD(x) = standard deviationj
The advantage of the linear scale compared to the other two scaling
schemes is its simplicity. Furthermore, the linear scale normalizes the
feature values. For these reasons, the linear scale has been used for
scaling the feature values.
Handling categorial feature values
All the clustering algorithms considered in this thesis are appropriate

for numerical feature values. As the feature set of the KDD Cup 99
data set is a mixture of continuous and categorical values, we need
a mechanism for converting the categorical feature values to numeric
values. Converting from one feature type to another must be done with
care because it may result in the loss of information about the data.
This loss of information may affect the discriminative capacity of the
resulting feature set.
One way of quantifying categorical feature value is by replacing it by

its frequency. For example, if we consider feature that describes the
transport protocol used for communication, two of its possible values
are: TCP and UDP. The categorical feature value TCP is converted to
0.6 if 60% of the connection records use the TCP protocol.
This conversion scheme has been used earlier in the implementation of
CLAD [30]- which is a cluster-based anomaly detection system. It is
reasonable to use frequency for quantifying categorical feature values
because values that appear more frequently are less likely to be anoma-
lous. The frequency can help us to separate normal connection records
from attack connection records.
Another method for encoding categorical feature values is, the so-called
1 to N encoding scheme. In this scheme, each categorical feature is ex-
tended to N features where N is the number of different values this
feature can take. The value of 1 (or 1/N in the normalized form) is set
in the columns corresponding to that feature in the extended feature
space, the other columns of that feature in the extended feature space
are set to 0 to mark the absence of that category.
One of the problems with this encoding scheme is that it increases the
dimension of the data space. How serious this problem is, depends on
the number of categorical features and on the number of different val-
ues each of them can take.
Once the categorical feature values have been converted to numerical
values and the feature values have been normalized, euclidean distance
is used as the similarity measure between instances.
Usage of the data set
This section describes how the data set is used for the experiments.
A 10% version of the KDD Cup data set is also available at [16]. We
use the 10% version of the KDD Cup data set. The 10% version of the
KDD Cup data set contains the same attack labels as the full version.
It has been constructed by selecting, from the original data set, 10% of
each of the most frequent attack categories and by keeping the smaller
attack categories unchanged. The advantage of using this version of the
data set is that the data set is smaller and therefore faster to process.
Working with the original data set would have made the execution of
the experiment impossible on the computation resource at our disposal.
About eighty percent of the data are attacks. Most of these attacks
are DOS attacks: neptune and smurf. A large percentage of this data
consists of duplicates. In order to reduce the size of the data set, we
select only a low percentage of the smurf data and the neptune data
set. This new distribution of attack and normal labels is closer to a
real life scenario. Most researches in unsupervised anomaly detection
make some assumptions about the data set. Without such assumptions
the task of unsupervised anomaly detection is not possible. The subset
selected for the experiments consists of 10% attacks and 90% normal
data. Table 4.1 shows the distribution of attack categories for this data
set.
For each of the 10 phases in the ten-fold cross validation, each clustering
algorithm is run 3 times. We proceed in this way because most of
the algorithms are randomly initialised and the result of clustering is
dependent on the initial values. As mentioned above, the instances
of the data set are labelled: either as normal or as a specific attack
category. The labels are not used during clustering, they are only used
during the evaluation of the clustering algorithms.
4.3 Implementation issues

The implementations of the clustering algorithms have been kept sim-
ple. The implementation has been kept simple because we have focused
on highlighting the basic ideas in each of the clustering algorithms. We
have avoided optimisation techniques that could possibly influence the
clustering result.
In the implementation of two-level clustering, no significant difference
has been observed between linking successively the two closest clusters
until the desired number of clusters is reached (KHAC) and the linking
approach just described in KHADENS. So for simplicity, KHAC has
been used our experiments. One possible reason for why the two merg-
ing strategies produce similar result may be that the closest clusters
have almost similar size.
attack type number percentage

normal 107011 89.94
back 2424 2.04
buffer overflow 33 0.03
ftp write 8 0.007
guess password 59 0.05
imap 13 0.011
ipsweep 1370 1.15
land 21 0.018
loadmodule 10 0.008
multihop 7 0.006
neptune 649 0.54
nmap 253 0.21
perl 3 0.002
phf 4 0.003
ping of death(pod) 290 0.24
portsweep 1146 0.96
rootkit 11 0.009
satan 1077 0.90
smurf 1693 1.42
spy 2 0.002
teardrop 1077 0.90
warezclient 1120 0.94
warezmaster 22 0.002
TOTAL 118980 100
Table 4.1: Distribution of labels in the data set

For each of the clustering algorithms, various tests have been performed
in order to select the best parameter values. The experiments have been
performed with the best parameter values identified.
4.4 Summary
This chapter has covered the design and execution of our experiments.
Special attention has been paid to the data set and feature set used.
– The data set used is a slightly modified version of the KDD data
set. The feature values have been scaled and normalized using a
linear scale. The categorical feature values have been transformed
to numeric values using a frequency encoding.
– For each of the clustering algorithms, different tests have been
performed in order to choose the best set of parameters
– The limitation of the feature set for unsupervised anomaly detec-
tion has been discussed: Some of these limitations are: Firstly,
the algorithm used for the construction of the features relies on
the existence of an attack-free data set. But the fact that it is
difficult to obtain attack-free data set is the main motivation for
performing unsupervised anomaly detection. So for the purpose
of unsupervised anomaly detection we need some other method
to compute the feature set. Secondly, for the purpose of anomaly
detection, it is the normal traffic patterns we want to describe and
not the attacks, so it is appropriate for us to construct features
that describe the normal patterns and not the attacks.
In the next chapter, we evaluate the clustering algorithms.

Chapter 5
Evaluation of clustering
methods
In this chapter the studied clustering algorithms are compared experi-

mentally.
– Section 5.1 describes the evaluation methodology used.

– Section 5.2 discusses the evaluation measures.
– Section 5.3 discusses the usage of the k-fold cross validation method.
– Section 5.4 presents and analyses the results of the experiments.
– Section 5.5 summarizes the chapter
5.1 Evaluation methodology

The clustering algorithms are evaluated on the basis of external in-
dices. External evaluation is possible because data labels are available.
Because the considered clustering algorithms are instances of different
clustering methods, external evaluation is the correct method for evalu-
ating the algorithms. Evaluating the algorithms on the basis of internal
indices, such as the sum of squared-errors, is not appropriate. This is
because internal indices are generally based on assumptions about the
clustering methods used or about the data set. For example, using
the sum of squared-errors as a measure of compactness and evaluating
72
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 73
the clustering algorithms using it will provide favorable conditions for

squared-errors clustering algorithms such as kmeans.
The methodology used for comparing the clustering algorithms is de-
scribed below.
We use a ten-fold cross validation. Different experiments in cluster
literature, such as [36], have shown that ten fold is appropriate when
performing k-fold cross validation. The clusters produced during the
training phase are used as a classifier and used for the classification
of the test data. The same assignment method and measure used for
assigning instances to clusters during training is used for assignment
during the test. The idea in using cross validation is to measure the
generality of the clusters produced during the training phase. The k-
fold cross validation method is described in the next section. As all
of the studied clustering algorithms are dependent on the initialisa-
tion values, for each pair of training and test data set of the ten-fold
cross validation, the clustering algorithms are run three times. Run-
ning the experiments will produce 30 values for each of the evaluation
indices and for each of the clustering algorithms. Then for each of the
clustering algorithms, the average of the 30 indices and the standard
deviation are computed. A paired t-test is used to compare each pair
of clustering algorithms. The paired t-test is used to estimate the sta-
tistical significance of the difference in performance for each pair of
algorithms. In order to evaluate the performance of each of the stud-
ied clustering algorithms individually, each of them is compared to the
result of a random clustering of the data set. The random clustering is
done by assigning instances to clusters randomly.
The experiments are run for two different number of clusters: 23 and
49. 23 is the number of categories in the data set. We choose 49 arbi-
trary in order to study how the algorithm perform with another number
of clusters.
5.2 Evaluation measures

This section discusses the choice of evaluation measures.
5.2.1 Evaluation measure requirements

The goal in clustering network traffic data for anomaly detection is to
create clusters that ideally consist in a single category. The category
is either a specific attack such as smurf or normal.
So the different types of attack/normal category identified by a cluster-
ing algorithm are a good indication of how well the algorithm perform
this task. We do not expect the clustering algorithms to produce clus-
ters that are 100% pure. So the attack category of a cluster is defined
to be the label of which there are most of in the cluster. The percent-
age represented by the majority of attack labels in a cluster is another
indication of how pure that cluster is.
It is also useful to measure whether a cluster contain few attack cate-
gories or several attack categories.
These requirements lead to the choice of three evaluation measures that
will be studied in the next section. Each of these evaluation measures
covers one of the requirements. The first one is the number of dif-
ferent categories: it counts the different number of attack or normal
categories found by the clustering algorithms. The second is the clas-
sification accuracy: it computes the proportion of label of which there
is most of in the cluster. And the third measure is the cluster entropy,
which estimates the homogeneity of the clusters.
5.2.2 Choice of evaluation measures

Some of the classical external validation measure found in the litera-
ture [23] are the Jacard, Hubert, Rand and Corrected Rand indices.
But these measures do not match our requirement that they should
measure the purity of clusters. The measures used in this thesis are:
count of cluster categories, classification accuracy and cluster entropy.
The count of cluster categories is the number of different cluster cate-
gories found by the clustering algorithm. The category of a cluster is
defined as the label of which there are most of in the cluster.
The classification accuracy of a cluster is defined as the proportion rep-
resented by the label that is in majority in this cluster. The overall
classification accuracy of the clustering is defined as the weighted mean
of the classification accuracy of the clusters produced by this cluster-
ing.
The cluster entropy has been introduced in [37]. This measure captures
the homogeneity of clusters. Clusters which contain data from different
attack classes have a higher value. And the clusters, which contain only
few attack classes, have low entropy- close to zero. The overall cluster
entropy is the weighted mean of the cluster entropies.
Classification accuracy
The classification accuracy of a cluster is the proportion of label most

often found in that cluster. That is
(Size of majority label)
clusterAccuracy = . (5.1)
Size of cluster
And the overall classification accuracy of the clustering is the weighted
mean of the classification accuracy of the clusters. The weight of a
cluster is its size divided by the total number of instances.
Cluster entropy
The entropy of the cluster level captures the homogeneity of the clus-
ter. The entropy of a cluster is defined as:
Eclusteri = − j ( Nnjii ) log( Nnjii )
P
where ni is the size of the ite cluster and Nji is the number of instances
of cluster i which belongs to the class label j.
And the cluster entropy is the weighted sum of the cluster entropies:
Ecluster = i nNi Eclusteri
P
where N is the total size of the data set and ni is the number of in-
stances in cluster i.
The cluster entropy is lowest when the clusters consists of a single data
types and it is highest when the proportion of each of data category in
the clusters is the same.
5.3 k-fold cross validation

K-fold cross validation is used in classification to evaluate the accuracy
of classifiers. It consists in randomly dividing the data set in k disjoint
subsets of approximately equal size. The classifier is trained and tested
k times. Each training uses k-1 subsets and the subset left out is used
for testing. K-fold cross validation can be adapted to clustering.
As with classification, the system is trained and tested k times. The
training consists in clustering k-1 of the subsets. The subset left out
is used for the test. The test is done by assigning instances of the
test data set to the clusters produced during training. The same as-
signment method and measure used for assigning instances to clusters
during training is also used for assignment during testing. During the
assignment in the test phase, the characteristics of the clusters are not
updated. For example, the means of the clusters are not recomputed.
So the performance of a clustering indicates how well the algorithm
performed during the training and the test phases.
Using an independent test data set makes it possible to evaluate the
robustness and the generality of the clusters produced by the clustering
algorithms. After all, the goal of the off-line clustering we perform is to
create clusters that will be used for performing a classification of new
data. Therefore it is important that the clustering algorithms are able
to classify correctly an independent data set.
5.4 Discussion and analysis of the exper-

iment results
This section analyses the results of the experiments.
5.4.1 Results of the experiments

Figures 5.1, 5.2, 5.3, 5.4, 5.5 and 5.6, located from page 77 to page
85, show the experiment result. The data used for the generation of
the histograms in these figures are found in appendix E. In these his-

tograms, the notation L + clustAlgo, where clustAlgo is a clustAlgo
refers to initializing clustAlgo with the centres of the clusters produced
by the leader clustering algorithm. And the notation fuzzy K refers to
the fuzzy kmeans algorithm.
Figure 5.1: The classification accuracy of the clustering algorithms in tables

E.1 and E.2. L+kmeans refers to leader + kmeans and fuzzy K refers to
fuzzy kmeans. The number of clusters is 23.
When the number of clusters is 23, the figures 5.1, 5.3 and 5.2 show
that, the two-level clustering(KHAC) and SOM or kmeans, initialised
with the clustering results of the leader clustering algorithm give the
best classification accuracies. The clusters identified by these cluster-

ing approaches are homogeneous: the majority of items in each of these
clusters are from the same attack/normal category. The clusters pro-
duced by these algorithms represent a wider variety of attack categories.
When initialised with the results of the leader clustering algorithm, the
performance of kmeans and SOM are very similar.
When the number of cluster is 49, the figures 5.4,5.6 and 5.5 show
that KHAC, leader clustering and the combination of leader clustering
with any of the other algorithms- except EM- have the best classifi-
cation accuracies and the clusters found by these algorithms represent
a larger varieties of attack categories. Although initialising any of the
other algorithms with the leader clustering improves the performance
of the algorithm, these combinations do not perform significantly better
than the leader clustering alone. And this is true both for the classi-
fication accuracies, cluster entropies and number of cluster categories.
Because most of the studied clustering except the leader clustering are
slow, using only leader seems more appropriate than using any of the
other algorithm either alone or in combination with the leader cluster-
ing algorithm.
The homogeneity of the clusters produced by kmeans is slightly better
than any of the other algorithms. The homogeneity of the clusters pro-
duced by fuzzy kmeans, EM-based clustering, CEM clustering is poor
than that of the other algorithms.
For both 23 and 49 clusters, each of the clustering outperform random
clustering.
The performance of the EM-based clustering algorithm is not so im-
pressive.
Some of conclusions that can be drawn from these results are that:
– The performance of the clustering algorithms depends on the num-

ber of clusters to be found. The difference in the performance of
the clustering algorithms decrease as the number of clusters in-
crease. This indicates for large number of clusters, other criterion
such as the running time can be used to guide the selection of
clustering algorithm.
– The leader clustering is of significance in clustering network traffic

data for anomaly detection: It achieves good perfomance for each
of the evaluation measures considered independently of the num-
ber of clusters to be found. Furthermore it is very fast and using
it for initializing the other algorithms improves the performance
of these algoritms significantly. This improvement is impressive
in the case of the SOM algorithm.
– When the desired number of clusters is small, the two-level clus-
tering, seems to be a good choice of algorithm.
These conclusions can be reformulated as follows:
– KHAC is a good choice of clustering algorithm when the desired

number of clusters is small.
– Leader clustering is more appropriate for high number of clusters.
5.4.2 Analysis of the experiment results

It seems that the clustering algorithms that create clusters one at the
time, e.g leader clustering and KHAC, perform better than the others.
One possible explanation is related to the skewed distribution of attack
categories. The group of algorithms that performs poorly consists in
algorithms that are randomly initialised. And their performance is de-
pendent on the initial choice cluster centres. When the initial cluster
centres are selected randomly from the data set, the chance that rep-
resentative from different categories will be picked out is not equal for
each category. The categories that are in majority are more likely to be
selected. This explains why initializing with leader clustering improves
the performance of those algorithms. KHAC and leader clustering are
not initialised in this way so this problem does not affect them. It may
be preferable to initialise the clustering algorithms by choosing totally
random values than choosing randomly items from the data set. An-
other observation which tends to confirm the above explanation is the
fact that the performance of most of the studied algorithms are similar
for high number of clusters. That is because with high number of clus-
ters the change of selecting representatives of different attack category
as initial centres is high.
The good performance of KHAC can also be related to the fact that
it is the only of the studied algorithms that is able to detect clusters
of an arbitrary shape. The penalty of approximating incorrectly the
shape of clusters is higher for large clusters than for small clusters. This
could explain why KHAC have a good performance when the number
of clusters is low.
The EM-based algorithm did not produce good results compared to
most of the others algorithms. This was surprising, because most of
the others algorithms can be explained as special case of EM-based
clustering. One of the possible explanations for the poor performance
of EM-based clustering may be that the mixture of isotropic gaussians
does not match the underlying model of the data. But, this explanation
does not seem to hold because the classification EM clustering which
also assumes that the components of the model are non overlapping
isotropic gaussians gives better results. We could not relate the poor
performance of EM-based clustering to the fact that it assumes over-
lapping clusters. This is because, the fuzzy kmeans algorithm, which
also makes an assumption of overlapping clusters, has a much better
performance.
We conclude that the EM-based clustering’s poor performance is re-
lated to some parameters of the EM based clustering algorithm that
may not have been chosen correctly. For example, the number of clus-
ters, considered in our experiments may not be optimal for the EM-
based clustering. Alternatively, it may simply be related to the fact
that this clustering algorithm is not appropriate for this task. The
EM-based clustering is also less attractive for this task because of its
high computation time.
Figure 5.2: The number of different cluster categories found by the algorithms
when the number of clusters is 23. The total number of labels contained in
the data set is 23.
Figure 5.3: The cluster entropies when the number of clusters is 23. The
cluster entropy measures the homogeneity of the clusters. The lower the
cluster entropy is the more homogeneous the clusters are.
Figure 5.4: The classification accuracy of the clustering algorithms in tables

E.3 and E.4. The number of clusters is 49.
Figure 5.5: The number of different cluster categories found by the algorithms
when the number of clusters is 49. The total number of labels contained in
the data set is 23.
Figure 5.6: The cluster entropies when the number of clusters is 49.
Chapter 6
Conclusion
6.1 Resume
In this thesis, we have:
– discussed issues of network security and in particular issues con-

cerning unsupervised anomaly detection. We have discussed how
clustering can be used to solve this problem.
– discussed the clustering problem and the most common cluster-
ing methods. Examples of clustering algorithms have also been
discussed
– implemented and compared classical clustering methods. The
classical clustering algorithms considered for this study are: stan-
dard kmeans, fuzzy kmeans, Expectation Maximization(EM) based
clustering, Classification Expectation Maximization based cluster-
ing, Kohonen self organizing feature maps, leader clustering
– investigated two combinations of clustering methods.
∗ The first one uses the results of the leader clustering algorithm
for the initialization of each of the other studied algorithms.
This method improves significantly the performance of these
algorithms.
∗ The second combination is a technique we have proposed.
Essentially, this technique is a combination of Kmeans and
86
CHAPTER 6. CONCLUSION 87
Hierarchical Agglomerative Clustering. We call this combi-

nation KHAC. The purpose of KHAC is to create a large
number of small clusters using kmeans and then merge these
small clusters in a similar fashion to hierarchical agglomer-
ative clustering. The advantage of this clustering technique
is its ability to detect arbitrarily shaped clusters. We found
that KHAC gives better results compared to most of the other
studied algorithms. The performance of KHAC is especially
impressive for small numbers of final clusters.
On the basis our results, we can say that clustering can be successfully
used for unsupervised anomaly detection. Some of the clustering algo-
rithms are more appropriate for this task than others. We investigated
the potential of the leader clustering algorithm. This algorithm is very
simple and fast and produces good clustering results compared to most
of the other studied algorithms. When leader clustering is used for
initializing the other clustering algorithms, included in this thesis, the
clustering results of these algorithms improve significantly.
6.2 Achievements
The main goal of the thesis has been to investigate the efficiency of dif-
ferent classical clustering algorithms in clustering network traffic data
for unsupervised anomaly detection. The clusters obtained by cluster-
ing the network traffic data set are intended to be used by a security
expert for manual labelling. A second goal has been to study some
possible ways of combining these algorithms in order to improve their
performance. We can say that these goals have been achieved. The
results of our experiments have given us an indication of which cluster-
ing algorithms are good for this task and which ones are less suitable
for this task. Furthermore, we have studied ways of combining cluster-
ing ideas in order to efficiently solve the problem. We have found out
that, when the number of clusters is low, KHAC which is a combina-
tion of clustering concepts we have proposed, produces better results
than most of the other studied algorithms. Our data shows the poten-
tial of leader clustering algorithm in performing this task. Clustering
algorithms similar to leader cluster algorithm have been successfully
CHAPTER 6. CONCLUSION 88
used in some earlier works [6, 30] for clustering network traffic data.
The reasons for using this particular algorithm have not been explic-
itly stated in these works. In conclusion for our thesis we can say that
leader clustering is to be preferred, not only because it is fast but also
because it perform better than most of the other clustering algorithms.
So leader-like clustering algorithms could be investigated further in fu-
ture research on unsupervised detection. What make them specially
attractive is their scalability to a large data set. And KHAC seems
attractive when the number of clusters is low.
6.3 Limitations
One of the limitations of this thesis is that it has not possible to validate
the conclusions of the experiments against a real life data set. This has
not been possible because of the difficulties of acquiring such a data
set.
6.4 Future work

This work will serve as a first step in building a complete cluster-based
anomaly detection system.
Bibliography
[1] A comparative Study of Anomaly Detection Schemes

in Network Intrusion Detection, A. Lazarevic, L. Er-
toz, V. Kumar, A. Ozgur, J. Srivastava
[2] A.K. Jain, M.N. Murty and P.J Flynn. Data cluster-
ing: A Review. ACM Computing Surveys, Vol. 31,
No .3, September1999.
[3] Richard O. Duda, Peter E. Hart and David D. Stork. Pattern Classification.
John Wiley & sons, second edition, 2001.
[4] Eleazar Eskin. Anomaly Detection over noisy data using learned probabil-
ity distribution, located at: http://citeseer.ist.psu.edu/eskin00anomaly.html,
2000.
[5] Arthur Dempster, Nan Laird, and Donald Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical So-
ciety B, 39, 1-38, 1977.
[6] Leonid Portnoy. Intrusion detection with unlabeled data using clustering.
located at: http://citeseer.ist.psu.edu/574910.html, 2001
[7] E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. Stolfo. A geometric Frame-
work for Unsupervised Anomaly Detection: Detecting Intrusion in Unlabeled
Data. available at: http://www.cs.cmu.edu/˜ aarnold/ids/uad-dmsa02.pdf,
2002
[8] Dorothy E. Denning, An intrusion detection model, IEEE Transactions on
software engineering, vol SE-13, No 2, Februar 1987 pages: 222-232 also
located at: http://www.cs.georgetown.edu/ denning/infosec/ids-model.rtf.
[9] S. T. Brugger. Data mining methods for network intrusion detection,
University of California, Davis, appeared in ACM and available at:
http://www.bruggerink.com/ zow/papers/bruggerd mnids urvey.pdf, 2004.
89
BIBLIOGRAPHY 90
[10] Recent Advance in Clustering : a brief review, S.B. KoTSIANTIS, P.E. PIN-
TELAS
[11] Mining in a data-flow environment: Experience in network intrusion detec-
tion. In proc. 5th ACM SIGKDD Int. Conf. Knowledge Discovery and data
mining, 114-124, W. Lee;S. Stolfo, and K. Mok 1999.
[12] J. He, A.H. Tan, C.L. Tan, and S.Y. Sung. On Quantitative Evaluation
of Clustering Systems. In W.Wu, H. Xiong and S. Shekhar Clustering and
Information retrieval(pp. 105-133), Kluwer Academic Publishers, 2004.
[13] K. Kendall. A database of Computer Attacks for the Evaluation of Intrusion
Detection Systems, Master thesis, Massachusetts Institute of Technology,
1999.
[14] W. Lee and S.J. Stolfo, A Framework for constructing features and models for
intrusion detection systems, ACM Transactions on Information and System
Security, Vol.3 No.4, November 2000, pages 227-261.
[15] The internet traffic archive ( 2000): http://ita.ee.lbl.gov
[16] KDD cup 99. Located at: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[17] DARPA. Located at: http://www.ll.mit.edu/IST/ideval/
[18] Clustering large datasets, D.P. Mercer, october 2003
[19] I. Costa, F. de Carvalho, Maricilio C.P de Souto. Comparative analysis of
clustering methods for gene expression time course data,
[20] Boris Mirkin. Mathematical Classification and Clustering, Kluwer Ademic
Publishers, 1996.
[21] Robert Rosenthal and Ralph L. Rosnow, Essentials of Behavioral Research,
Methods and Data Analysis, second edition, 1991.
[22] Jiawei Han and Micheline Kamber. Data Mining, Concepts and Techniques,
Morgan Kaufmann Publishers ,2001.
[23] Anil K. Jain, Richard C. Dubes. Algorithms for clustering data, Prentice
Hall, 1988.
[24] Giles Celeux and Gerard Govaert. A classification EM algorithm for cluster-
ing and two stochastic versions, INRIA, 1991.
[25] Stuart J.Russel and Peter Norvig. Artificial Intelligence, a modern approach
, second edition , Prentice Hall, 2003.
BIBLIOGRAPHY 91
[26] Thomas P. Minka. Expectation maximization as

lower bound maximization, 1998, tutorial located at:
http://research.microsoft.com/˜ minka/papers/em.html
[27] An efficient approach to clustering in large multimedia database with noise
Alexander Hinneburg , Daniel A. Keim American Association for artificial
intelligence (www.aaai.org) 1998
[28] STING: A STastiscal Information Grid Approach to Spatial data Mining 1997
, Wei Wang, Jiong Yang and Richard Muntz. [In twenty-third International
conference on very large data bases pp. 186-195 Athens, Greece. Morgan
Kaufmann ]
[29] Ben Krose and Patrick van der Smagt. Artificial Neural Networks, eight
edition november, 1996.
[30] Learning rules and clusters for anomaly detection in network traffic: Philipp
K. Chan, Matthew V. Mahoney, and Muhammad H. Arshad. Located at:
http://www.cs.fit.edu/ pkc/papers/cyber.pdf, Florida Institute of Technol-
ogy and Massachusetts Institute of Technology.
[31] Sushmita Mitra,Tinku ACharya. Data mining, multimedia, soft computing
and bioinformatics. Wiley Interscience, 2003.
[32] Ludmila I. Kuncheva. Combining pattern classifiers, methods and algorithms.
Wiley Interscience, 2004.
[33] A. Ben-Hur, A. Elisseeff and I. Guiyon. A stability based method for discov-
ering structure clustered data In. Proc. Pacific Symposium on Biocomputing,
2002,pp. 6-17 ?
[34] Wenke lee, Salvatore J. Stolfo, Kui W. Mok, Mining in a data-flow environ-
ment. Experience in network intrusion detection, March 1999.
[35] Comparative analysis of clustering methods for gene expression time course
data , Costa et.al august 2004
[36] Ron Kohavi. A study of Cross-validation and Bootstrap for Accuracy Estima-
tion and Model Selection, International Conference on Artificial Intelligence,
1995.
[37] Daniel Boley. Principal direction divisive partitioning. Data Mining and
Knowledge Discovery, 2(4) :325-344, 1998,
[38] Michalis Vazirgiannis, Maria Halkidi, Dimitrios Gunopulos. Uncertainty han-
dling and quality assessment in data mining, Springer 2003.
BIBLIOGRAPHY 92
[39] Wenke Lee and S. J. Stolfo. Data Mining Approaches for Intrusion Detection,
1998.
[40] Stefano Zanero and Sergio M. Savaresi. Unsupervised learning techniques for
an intrusion detection system, ACM March 2004.
[41] R. Lippmann, J.w.Haines, D.J. Fried,J. Korba and K. Das. The 1999 DARPA
Off-line Intrusion Detection Evaluation, Lincoln Laboratory MIT, 2000.
[42] John Mc HUGH. Testing Intrusion Detection Systems: A Critique of 1998
and 1999 DARPA Intrusion Detection System Evaluations as Performed by
Lincoln Laboratory, ACM Transactions on Information and System Security,
Vol.3, No. 4, November 2000 pages 262-294.
[43] E. Eskin,M. Miller,Z. Zhong,G. Yi, W. Lee, and S. Stolfo. Adaptive model
generation for intrusion detection systems.
[44] http://www.cert.org/stats/cert stats.html#incidents
[45] https://www.cert.dk/artikler/artikler/CW30122005.shtml
[46] Martin Ester, Hans-Peter Kriegel,Jorg Sander,Xiaowei Xu. A density-based
clustering algorithm for discovering Clusters in Large Spatial databases with
noise. Proceedings of 2nd international Conference on Knowledge Discovery
and Data Mining, 1996.
[47] Teuvo Kohonen. Self-organizing maps, 2nd edition Springer, 1997.
[48] A. Ultsch and C. Vetter. Self-organizing-feature-maps versus sta-
tistical clustering methods: A benchmark. University of Marbug.
Research Report 0994. located at: http://www.mathematik.uni-
marburg.de/d̃atabionics/de//downloads/papers/ultsch94benchmark.pdf,
[accessed 15/02/2006]
[49] Ross J. Anderson. Security Engineering: A guide to building dependable
distributed systems. John Wiley & Sons, 2001.
[50] A. Wespi, G. Vigna and L.Deri. Recent Advances in Intrusion Detection.
5th International Symposium, Raid 2002 Zurich, Switzerland, October 2002
Proceedings. Springer.
[51] D. Gollmann. Computer Security. John Wiley & Sons, 1999.
[52] P. Giudici. Applied Data Mining: Statistical Methods for Business and In-
dustry. Wiley,2003.
[53] Bjarne Stroustrup. The C++ programming language, third edition, Addison-
Wesley, 1997.
BIBLIOGRAPHY 93
[54] http://www-iepm.slac.stanford.edu/monitoring/passive/tcpdump.html
Appendix A
Definitions
A.1 Acronyms
DOS: Denial of service attacks.
OS : Operative systems.
IDS: Intrusion detection systems.
NIDS: Network intrusion detection systems.
pod: Ping of Death.
IP: Internet Protocol.
TCP: Transport Control Protocol.
UDP: User Datagram Protocol.
ICMP: Internet control message protocol.
HTTP: hypertexts Transport Protocol.
FTP: File Transfer Protocol.
A.2 Definitions
Network Traffic
In this thesis network traffic refers to transfer of IP packets through
network communication channels.
Firewalls
94
APPENDIX A. DEFINITIONS 95
Firewalls are security systems protecting the boundary of an internal

network.
To broadcast a message
To broadcast a message consists in delivering that message to every
host on a -given- network.
Ping
A program that is used to test if a connection can be established to a
remote host.
Protocol
A protocol is a specifies how modules running on different hosts should
communicate with each other.
Host
A host is a synonym for computer.
CGI scripts
A CGI (common gateway interface) script is a program running on a
server and which can be invoked by a client from the CGI interface.
Client and Server

On a network, a client is the host that requests service from another
host. And the host delivering the service is called the server.
TCP connection
A TCP connection is a sequence of IP packets flowing from the packet
sender to the packet receiver under the control of a specific protocol.
The duration of the connection is limited in time.
Tcpdump
Is a log obtained by monitoring network traffic. Different tools exist
for sniffing network traffic. On such a tool which has been used for
collecting the network traffic data used in this thesis is the program
called TCPDUMP [54].
Data mining
APPENDIX A. DEFINITIONS 96
Data mining is the process of extracting useful models from large vol-
ume of data.
Appendix B
Feature set
B.1 The feature set of the KDD Cup 99

data set
Tables B.1, B.2 and B.3 respectively describe the basic features, the
content-based features and the traffic-based features of the KDD Cup
99 data set.
97
APPENDIX B. FEATURE SET 98
name of feature description feature type

duration the length of the connection is seconds continuous
protocol-type the type of -transport- protocol used symbolic
service the network service e.g. http symbolic
src bytes the number of bytes sent from source to destination continuous
dst bytes number of bytes from destination to source continuous
flag indicate a normal or error status of the connection symbolic
land check if source and destination are the same symbolic
urgent number of urgent packets continuous
wrong fragments the number of wrong fragments continuous
Table B.1: Basic features of the KDD Cup 99 data set
name of feature description feature type

hot number of hot indicators continuous
num failed logins number of unsuccessful logins continuous
logged in indicates whether logged in successfully or not symbolic
num compromised number of compromised conditions continuous
root shell indicate whether a root shell is obtained or not symbolic
su attempted set to 1 if attempt to switch to root else 0 symbolic
num roots number of root accesses continuous
num file creation number of file creation actions continuous
num shells number of shell prompts continuous
num access files number of operations on access control files continuous
num outbound cmds number of outbound commands in an ftp session continuous
is hot login indicate whether the login is hot or not symbolic
is guest login indicate whether the it is a guest login or not symbolic
Table B.2: Content-based features

APPENDIX B. FEATURE SET 99
name of feature description feature ty

count number of connections to same host continuou
serror rate %con. to same host with SYN errors continuou
rerror rate %con. to same host with REJ continuou
same srv rate %con. to same host with the same service continuou
diff srv rate %con. to same host with different services continuou
srv count number of con. to the same service continuou
srv serror rate %con. to same service with SYN errors continuou
srv rerror rate %con. to same service with REJ errors continuou
srv diff host rate %con. to same service on different hosts continuou
dst host count number of connections to same host continuou
dst host serror rate %con. from dst. to same host with SYN errors continuou
dst host rerror rate %con. from dst. to same host with REJ continuou
dst host same srv rate %con. from dst. to same host with the same service continuou
dst host diff srv rate %con. from dst. to same host with different services continuou
dst host srv count number of con. from dst. to the same service continuou
dst host srv serror rate %con. from dst. to same service with SYN errors continuou
dst host srv rerror rate %con. from dst. to same service with REJ errors continuou
dst host srv diff host rate %con. from dst. to same service on different hosts continuou
dst host same src port rate % con. from dst. to the same source port continuou
Table B.3: Traffic-based features

Appendix C
Computer attacks
Here is a list of the computer attacks considered in this thesis:
C.1 Probe attacks

– Ipsweep
probes the network to discover available services on the network.
– Portsweep
probes a host to find available services on that host.
– Nmap
is a complete and flexible tool for scanning a network either ran-
domly or sequentially.
– Satan
is an administration tool; it gathers information about the net-
work. This information can be used by an attacker.
100
APPENDIX C. COMPUTER ATTACKS 101
C.2 Denial of service attacks

– Ping of death (pod) makes the victim host unavailable by sending
it oversized ICMP packets as ping requests.
– Back
is a denial of service attack against Apache webservers. The at-
tacker sends requests containing many front slashes. The process-
ing of which is time consuming.
– Land:
Spoofed SYN packet sent to the victim host resulting in that that
host repeatedly synchronizing with itself.
– Smurf
A broadcast of ping requests with a spoofed sender address which
results in that the victim being bombarded with a huge number
of ping responses.
– Neptune:
The attacker half opens a number of TCP connections to the vic-
tim host making it impossible for the victim host to accept new
TCP connections from other hosts.
– Teardrop:
Confuses the victim host by sending it overlapping IP fragments:
overlapping IP fragments are incorrectly dealt with by some older
operating systems.
C.3 User to root attacks

– Loadmodule
This attack exploits a flaw in how SUNOS 4.1 dynamically load
modules. This flaw makes it possible for any user of the system
to get root privileges.
– Perl:
Exploits a bug in some PERL implementations on some earlier
systems. This bug consists in these PERL implementations im-
properly handling their root privileges. This leads to a situation
where any user can obtain root privileges.
– Buffer overflow
Consists in overflowing input buffers in order to overwrite memory
locations containing security relevant information.
C.4 Remote to local attacks

– Imap
Imap causes a buffer overflow by exploiting a bug in the authenti-
cation procedure of the imap server on some versions of LINUX.
The attacker gets root privileges and can execute an arbitrary se-
quence of commands.
– Ftp write
This attack exploits a misconfiguration affecting write privileges
of anonymous accounts on an FTP server.
This allows any ftp user to add arbitrary files to the FTP server.
– Phf
Is an example of badly written CGI scripts that is distributed
with the apache server. Exploiting this flaw allows the attacker
to execute codes with the http privileges.
– Warezmaster
The warezmaster attack is possible in a situation where write per-
missions are improperly assigned on a FTP server.
When this is the case, the attacker can upload copies of illegal
software that can then be download by other users.
– Warezclient
The Warezclient attack consists in downloading illegal software
previously upload during a warezmaster attack.
C.5 Other attack scenarios

The four categories of attacks described take place usually during a
single session.
In most realistic attack scenarios, the attacker performs his attack over
a certain period of time in order to minize the chances of detection and
in order to perform more precise and successful attacks.
These attack scenarios are performed by combining some the basis at-
tack categories described.
Here are some of these attacks scenarios:
– Guessing passwords
– Making use of spy programs

A spy program monitors the activity on the victim host and makes
information available to the attacker.
– Making use of rootkit

A rootkit is a program that hides the presence of other -malicious-
programs or data files. Spyware programs often make use of rootk-
its in order to avoid detection by anti-spyware programs.
– Multihop attack
This attack first affects a host on a network and then uses that
host to attack other hosts on the network.
Appendix D
Theorems
D.1 Algorithm: Hill climbing

The hill-climbing algorithm is a local optimisation algorithm.
• Hill climbing algorithm Let g(x) be the gradient of a function f (x).

In searching for the maximizer of f (x), the algorithm proceeds as fol-
lows:
- It starts with an arbitrary solution s0 ∈ S, where S is the solution
space.
- Then a sequence {st , t ≥ 0} of solutions that approaches the max-
imizer of f (x) is constructed. The sequence is defined as: st+1 =
αg(xt ) + (1 − α)xt , where α > 0.
D.2 Theorem: Jensen’s inequality

Let f be a convex function defined on an interval I.
If x1 ,...,xn ∈ I and α1 ,...,αn ≥ 0 with ni=1 = 1,
P
n
X n
X
f( αi xi ) ≤ αi f (xi ) (D.1)
i=1 i=1
104
APPENDIX D. THEOREMS 105
D.3 Theorem: The Lagrange method

Let f : Rn → R and g : Rn → R be C 1 - that is f and g are derivable and
their respective derivative are continuous. Let α ∈ Rn such that ∇g(α) 6= 0
(∇g is the gradient of g). If α is an extremum of f under the constraint
g(x1 , ..., xn ) = 0, ∃λ ∈ R such that
∇f (α) = λ∇g(α) (D.2)

Appendix E
Results of the experiments
106
APPENDIX E. RESULTS OF THE EXPERIMENTS 107
Algorithms classification accuracy cluster entropy nb of categories

random 0.899±0.0 0.548±0.0 1.0±0.0
kmeans 0.929±0.001 0.209±0.005 4.5±0.2
leader 0.937±0.001 0.240±0.004 7.3±0.1
EM 0.907±0.001 0.274±0.006 3.6±0.2
CEM 0.916±0.002 0.276±0.006 2.6±0.2
som 0.919±0.001 0.252±0.004 3.2±0.1
fuzzy kmeans 0.915±0.001 0.243±0.003 3.3±0.2
KHAC 0.954±0.001 0.204±0.003 9.4±0.1
Table E.1: Random initialisation

leader + kmeans 0.941±0.001 0.194±0.004 8.0±0.2
leader + EM 0.909±0.0 0.268±0.003 3.3±0.1
leader + CEM 0.935±0.002 0.219±0.005 7.4±0.2
leader + som 0.944±0.001 0.187±0.004 8.0±0.2
leader + fuzzy kmeans 0.937±0.001 0.196±0.002 5.8±0.1
Table E.2: Experimental results of various classical algorithms and combina-

tion of those algorithms run on a KDD Cup 1999 data set slightly modified.
The number of clusters is set to the number of attack and normal labels in
the data set and this number is 23. The results in table E.1 are obtained
with random initialisation of the algorithms and that of table E.2 correspond
to initialisation of the algorithms with leader clustering.

random 0.899±0.0 0.546±0.0 1.0±0.0
kmeans 0.954±0.003 0.123±0.005 7.9±0.3
leader 0.951±0.001 0.151±0.001 12.8±0.3
EM 0.927±0.002 0.204±0.008 5.8±0.3
CEM 0.930±0.004 0.253±0.026 5.6±0.5
som 0.929±0.003 0.198±0.008 4.8±0.2
fuzzy kmeans 0.935±0.002 0.184±0.006 6.1±0.4
KHAC 0.962±0.001 0.146±0.003 9.6±0.3
Table E.3: Random initialisation

APPENDIX E. RESULTS OF THE EXPERIMENTS 108

leader + kmeans 0.954±0.0 0.138±0.002 13.6±0.1
leader + EM 0.938±0.0 0.165±0.003 7.0±0.1
leader + CEM 0.952±0.0 0.150±0.002 12.7±0.2
leader + som 0.951±0.001 0.147±0.002 13.1±0.3
leader + fuzzy kmeans 0.953±0.001 0.146±0.002 10.0±0.1
Table E.4: Experiment results when the number of clusters is 49

Comp of Clustering Method

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Comp of Clustering Method

Hochgeladen von

Copyright:

Verfügbare Formate

A comparison of clustering methods

- for unsupervised anomaly detection in network traffic

Koffi Bruno Yao (koffi@diku.dk)

February 28, 2006

3 Clustering methods and algorithms 20

3.2.2 Model-based clustering . . . . . . . . . . . . . . . . . . 27

5 Evaluation of clustering methods 72

C Computer attacks 101

E Results of the experiments 107

3.1 A dendrogram corresponding to the distance matrix in table 3.1 22

5.1 The classification accuracy of the clustering algorithms in ta-

5.6 The cluster entropies when the number of clusters is 49. . . . . 85

3.1 Example of distance matrix used for hierarchical clustering . . 21

4.1 Distribution of labels in the data set . . . . . . . . . . . . . . 70

B.1 Basic features of the KDD Cup 99 data set . . . . . . . . . . . 99

E.1 Random initialisation . . . . . . . . . . . . . . . . . . . . . . . 108

Network anomaly detection aims at detecting malicious activities in com-

This thesis investigates the cluster-based approach to off-line anomaly de-

During the last decades, different approaches to intrusion detection have

Recently, clustering has been investigated as one approach to solving this

1. of high data volume

2. of high data dimension

3. the distribution of attack and normal classes is skewed

4. the data is a mixture of categorical and continuous data

5. of the pre-processing of the data required.

1.2 Goal of the thesis

1. to provide a comprehensive study of the clustering problem and the

2. to implement and compare experimentally some classical clustering al-

3. and to combine different clustering approaches.

1.3 Related works

There are in the literature many examples of experimental comparisons of

1.4 Thesis organization

• A theoretical part, in which the clustering problem and the different

• An experimental part which consists in chapters 4 and 5. In chapter 4,

This chapter provides background in network security and clustering relevant

2.1 Introduction to computer network secu-

2.1.1 Network security

tegrity of communication on the network as well as the availability of the

2.1.2 Network intrusion detection systems

2.1.3 Network anomaly detection

This section has described mechanisms for detecting attacks against a

2.1.4 Computer attacks

attacker sends emails to the victims. In these emails, he presents himself as

• social engineering: Legal users of the computer systems can delib-

• misuse of features: The denial of service attack named smurf is an

• misconfiguration of computer systems: Correct configuration of com-

• flaws in software implementation: As software gets more and more

• usurpation or masquerade: The attacker steals the identity of a legal

It is practically impossible to protect a computer network totally from

2.2 Introduction to clustering

2.2.1 Notation and definitions

2.2.2 The clustering problem

The clustering problem

Another problem, related to the clustering problem is the classification

no a priori class structure is known. The goal of the clustering is then to

2.2.3 The clustering process

2.2.4 Feature selection

2.2.5 Choice of clustering algorithm

Distance measures in continuous data For continuous data, distance

δ(x, y) = 0 ⇐⇒ x = y, ∀x, y ∈ D (2.1)

• Tchebyschev distance: d(x, y) = maxni=1 |xi − yi |

Similarity measures in categorical data: Given a data set D, an in-

• The matching coefficient:

• The Russel and Rao measure of similarity: