Ben Fekih-2020-Distributed Architecture of An

Distributed Architecture of an Intrusion
Detection System Based on Cloud Computing

and Big Data Techniques
Rim Ben Fekih1(&) and Farah Jemili2

1
Higher Institute of Computer Science and Telecom (ISITCom),
University of Sousse, Sousse, Tunisia
Rimbenfekih@outlook.com
2
Modeling of Automated Reasoning Systems (MARS) Research Lab
LR17ES05, Higher Institute of Computer Science and Telecom (ISITCom),
University of Sousse, Sousse, Tunisia
Jmili_farah@yahoo.fr
Abstract. An efficient intrusion detection system (IDS) requires more than just
a good machine learning (ML) classifier. However, current IDSs offer a limited
perspective in handling alerts’ databases. These databases must be local and
structured in order to be referenced by these IDSs, offering an obsolete approach
to solve advanced attacks and intrusions on distributed systems.
With the emergence of big data, cyber-attacks have become a concerning
issue worldwide. In situations where data security is paramount, swiftness
becomes an obligation in processing and analytic operations. In that aspect,
cloud-computing services can deal efficiently with big data issues. They offer
storage and distributed analysis as the tools to be featured in our paper.
To handle a large scale of alert data, we propose a new distributed IDS model
that solves data storage problems, combines multiple heterogeneous sources of
alert data, and makes data treatment much faster than local IDSs.
For this purpose, this paper presents an approach of IDS using Databricks as a
Cloud environment and Spark as a big data analysis tool.
Keywords: Intrusion detection Spark Cloud Databricks DBFS

Machine learning Naïve Bayes
1 Introduction
Since the appearance of computer networks, hackers’ endeavor is to penetrate networks

to steal valuable information or disrupt computer resources. In that aspect, Intrusion
detection systems (IDSs) have played a critical role in ensuring the safety of networks
for all users, but the nature of this role has changed in recent history [1]. Previous well-
known events have shown that IDSs would be more effective in real-time and analysis
of cyber threats could be improved by correlating security events from numerous
heterogeneous sources, as mentioned by Zuech et al. [2]. However, big data challenges
and its exponential growth lead to a heavy data storage and analysis limitations. To face
those limitations, the power of Cloud computing technology was proposed by previous
© Springer Nature Switzerland AG 2020

M. S. Bouhlel and S. Rovetta (Eds.): SETIT 2018, SIST 146, pp. 192–201, 2020.
https://doi.org/10.1007/978-3-030-21005-2_19
Distributed Architecture of an Intrusion Detection System 193
researchers as a solution to speed up computation and access to an overwhelming

capacity of data storage. In recent years, and from a computation tools perspective,
Spark, an open source software framework, has been used to perform advanced
analysis. In addition, orchestrated Spark in cloud environment can even perform better
results in term of rapid analysis and storage in distributed file system based in the
cloud, that provides scalable and reliable data storage for managing large quantities of
alert data to determine the presence of attacks or malicious activities. For this purpose,
we propose an approach of IDS based on cloud computing and big data techniques. We
use three intrusion detection datasets NSLKDD, MAWILab and DARPA’99, which we
combine to have a variety of intrusions, also to have only one homogenous dataset and
finally make a realistic intrusion detection rate. We choose Databricks as a unified
cloud environment and Databricks file system (DBFS) to load our datasets on. We will
follow a ML pipeline model; load, process and train model with a Naive Bayesian
machine-learning algorithm to obtain classification rate of each attack type.
The remainder of this paper is organized as follows. Section 2 discusses related
work and background. Section 3 describes Databricks and its functionalities, while
Sect. 4 describes intrusion detection datasets used in our research. In Sect. 5, the
proposed approach is elaborated. The experimental results are presented in Sect. 6.
Finally, Sect. 7 provides the conclusion of the paper and offers perspectives for future
research.
2 Related Work and Background
The purpose of this section is to present a brief background on IDSs as well as an

insight of big data challenges facing intrusion detection. Also, how big data tech-
nologies and Cloud computing can be utilized to address big data challenges in
intrusion detection.
An IDS is a mechanism for detecting abnormal or suspicious activities on analyzed
target (a network or host). In 1994, a study by Frank [3] shows that big data is a major
challenge for intrusion detection. He also focused on improving detection accuracy by
adopting data mining techniques and feature selection to realize a real time detection.
Several researchers have utilized big data technologies to treat analysis problems. For
example, Akbar et al. [4] proposed a system based on Big Data Analytics to maintain
security across the heterogeneous data and Co-relating it from different sources using
hybrid strategy. Reghunath [5] designed a real-Time Intrusion Detection System based
on anomaly detection, which evaluates data and issue alert message based on abnormal
behavior.
The idea is to automatically store and monitor the log with existing intrusion
dictionary when a real time cyber-attack occurred. Essid and Jemili [6] combined and
removed redundancy from alert datasets (KDD99, DARPA); in addition, they applied
Map Reduce operations with Hadoop to obtain a single dataset. Their main goal was to
improve detection rate and decrease false negatives. In the other hand, Elayni and
Jemili [7] added a third dataset and worked in a local environment with Map Reduce
under MongoDB.
194 R. Ben Fekih and F. Jemili
They aimed to merge unstructured and different datasets and improve intrusion
detection rate. As we talked about big data challenges and intrusion detection [4–7] are
limited due to their local architecture. Among these limits, we consider, limited storage
capacity, problems of processing speed and access to stored data (e.g. data stored on
local disks). Consequently, Cloud becomes an inevitable alternative of unlocking the
potential of Big Data. Cloud computing influence on data management and processing
[1], not only it relates infrastructure and computation to the network, but it also supply
management software and hardware resources with reduced costs. Furthermore, it
results in a big emergence of programming frameworks such as Hadoop, Spark, and
Hive for complex and large datasets. Using these tools, numerous studies have been
performed in cloud environments. Esteves et al. [8] used Cloud computing for a
distributed K-means clustering.
The authors chose a large dataset to simulate big data challenges. Tests were
executed using Mahout and Hadoop to solve data intensive problems, while running on
Amazon EC2 during computation tasks. In our work, we chose Databricks as a cloud
environment to process and manage data. More details about Databricks are given in
Sect. 3.
3 About Databricks
Databricks [9] was founded by the team that created Apache Spark™, the most active
open source project in today’s large data ecosystem. Databricks is a Cloud based data
platform and designed to ease the creation and deployment of advanced analysis
solutions. Also, it provides the Databricks community edition as a free and open source
version. The most important functionality is that Databricks provides a unified
ecosystem with orchestrated Apache Spark [10] for implementation, development and
scaling. In addition, It provides access to data easily and swiftly with Ingestion of non-
traditional data storage based on cloud computing. Databricks integrates with Amazon
S3 for storage. S3 buckets can be mounted into the Databricks File System (DBFS) and
read the data into a Spark application as if it were on the local disk [11].
3.1 Apache Spark

Apache Spark was developed at the University of California at Berkeley by AMP Lab
and today is a project of the Apache Foundation, as an Open Source Framework of Big
Data. We chose to program with Spark instead of Hadoop because [12, 13]:
Spark is faster than Hadoop (100 faster in memory and 10 faster in disk access)
because Spark reduce read/write iterations from disk and store intermediate data in-
memory.
• Spark is easier to program.
• Spark is able to process, with low latency, real-time streams coming from different
sources that generate millions of events per second (Twitter, Facebook) unlike
Hadoop which is designed to batch mode processing.
• Spark acts its own flow scheduler (Due to in-memory computation).

• Spark provides MLlib as an Apache Spark machine-learning library. Its goal is to
make ML practical, scalable and easy.
3.2 Distributed Processing with Fast Data Access

Thanks to parallel-distributed processes, Spark simplify Big Data implementation and
analytics.
MapReduce is a great solution for one-pass computations, but less efficient for use
cases that require multi-pass computations and algorithms (Slow due to replication and
disk storage), however, Spark presents several advantages compared to other tech-
nologies like Hadoop and Storm, for example spark enhances MapReduce with in-
memory data storage, making the treatment less costly and much faster [14].
3.3 ML Pipeline
Machine learning pipelines are mainly inspired by scikit-learn project [15]. Subse-
quently, the basic concepts of a pipeline are presented:
• DataFrame: DataFrames are used as learning datasets, which contains a heteroge-
nous data types.
• Transformer: is an algorithm that can transform attributes into predictions.
• Estimator: is an algorithm that produces a Transformer from a DataFrame (e.g. a
learning algorithm).
A ML application can have several steps to form a workflow or a Pipeline (e.g. a
word count application; divide the text of each document into words, convert the words
of each document into a numeric vector and make predictions after learning a model
using datasets.). These steps (Fig. 1) are executed in order, to transform an input
Dataframe through each step.
Fig. 1. Machine learning pipeline stages

4 Intrusion Detection Datasets
To build an intrusion detection system, we chose NSL-KDD [16], DARPA’99 [17] and
MAWILab [19] datasets.
4.1 NSL-KDD
NSL-KDD contains four attack classes: DOS, Probe, U2R, and R2L:
• Denial of service attack (DOS): the goal of this attack is to render a service
unavailable, and subsequently prevent legitimate users of a service from using it.
These may include flooding of a network to prevent its operation, Disruption of
connections between two machines, preventing access to a particular service.
• Probing attack: This is the kind of attack in which the attacker scans a machine or
network device to determine weaknesses or vulnerabilities that can be exploited
later to compromise the system. This technique is commonly used in data mining.
• User to Root Attack (U2R): Is an exploit class in which the attacker get access to a
system and exploit a certain vulnerability to obtain Root access.
• Remote to Local Attack (R2L): this kind of attacks occur when an attacker who has
the ability to send packets to a machine on a network but does not have an account
on that machine so attackers try to exploit a certain vulnerability to obtain local
access as a user of that machine.
4.2 DARPA Dataset

DARPA’99 traces are generated by MIT Lincoln Labs [17]. This dataset is the most
frequent one for use in various researches about intrusion detection. DARPA is
grouped into files recorded in five weeks, three weeks for training and two weeks for
testing. We choose to train our model with records from the first week of DARPA’99
dataset especially because it contains SSH and NOTSSH connections [18].
4.3 MAWILab Dataset

MAWILab [19] is an available database for anomaly detection. This database classify
anomalies according to a taxonomy [20] which contains 11 identifying labels: DoS,
Network scan ICMP, Network scan UDP, Network scan TCP, Multi points, HTTP,
Alpha flow, IPv6 tunneling, Port scan, unknown and other. In addition, it is daily
updated, so we choose to train our model with the latest datasets, those recorded during
Dec. 2016 and all available recordings during 2017. We merge it to get only one
dataset.
5 Proposed Approach
In this section we present our proposed approach (Fig. 2). We use the Pandas and
scikit-learn APIs for these transformations.
Fig. 2. ML architecture as a distributed system
Our first objective is to load intrusion detection datasets NSLKDD, MAWILab and
DARPA’99 in DBFS. Then, our work is divided into three major steps:
• Extract and transform data.
• Normalize features and data.
• Train and evaluate model.
5.1 Extract and Transform Data

In this phase, we will eliminate the redundancy and join alert datasets.
Remove Redundancy
Removing duplicates from proposed alert datasets so the classifiers will not be influ-
enced towards more frequent records and the detection rate performance will increase.
This step will be realized with Dataframe.drop_duplicates method, which returns as an
output a DataFrame without redundant lines.
Join Datasets
Spark provides four methods for joining datasets, left, right, outer, or inner join. By
default, it takes the inner value. We will use full outer join to completely merge
datasets. DataFrame.merge (right, left, how = ‘outer’) method which merge alert data
bases, right and left are two Dataframes to merge, and it specify how to merge datasets,
in our case it will be a full outer join. This function merges two tables while making the
join according to one or more columns in common.
5.2 Features and Data Normalization

Before the normalization step, some of our features are textual, and we want them to be
numerical so we can train our model. Therefore, we convert the features to numerical
values, in the correct order based on the feature meanings. Then, we split our data into
two dataframes, the first contains class or attack types, and the second dataframe
contains other features. Finally, we normalize each feature to have unit variance.
5.3 Train and Evaluate Model

We use the training dataset to form and evaluate our model. The test dataset is then
used to make predictions. This step gives us an idea of the performance and robustness
of the model. We choose to train our Model with Naïve Bayes algorithm; specifically
Bernoulli Naïve Bayes.
Before evaluating our model and retrieving intrusion detection errors and accuracy
values, we can tune parameters in order to improve results. Parameter tuning is the task
of tuning parameters of a learning or prediction system. The idea is to split our data into
k sets and train multiple models with different parameters on one set. Finally, we
proceed to testing and we keep the best parameters, which give better results.
6 Experimentation and Results
In our experimental result and as we said above we will use Databricks as a cloud
environment to upload and analyze the dataset with Naïve Bayes algorithm.
The experimentation environment is set within Databricks community edition,
which gives access to Amazon EC2 with Spark nodes already configured and provides
only 6 GB of storage, limiting the size of the cluster provided to achieve wide pro-
spects of the experiment.
6.1 The Naïve Bayes Classifier

The naïve Bayes classifier is a simple probabilistic Bayesian method based on the
Bayes theorem with a strong naive independence of hypotheses. This means that the
probability of an attribute does not affect the probability of the other. Taking into
account a series of n attributes, Naïve Bayes makes 2n! Independent assumptions.
6.2 Evaluation
After merging our intrusion detection datasets, we get a voluminous database. Table 1
contains number of records in both training and testing dataset.
Table 1. Number of records

Train Test
Number of instances 845 721 351 296
Before getting intrusion detection rate, we tune our parameters to increase detection
accuracy. This operation consist of tuning parameters of a learning or prediction system
in order to improve the results. It is commonly done by training multiple models using
different parameters on one set of data and then testing those models on another held-
out set of data. Table 2 shows results before and after tuning.
Table 2. Tuning parameters

Parameter (alpha) Training Test
Original 0 0.82 0.85
Tuned 10 000 0.96 0.97
Table 3 shows the detection rate of our approach. The low performance of our
intrusion detection system may be explained by the low proportions of attacks.
Table 3. Detection rate

Connection type Detection rate False positive rate
Normal 95% –
NOTSSH 97% 0%
Suspicious 100% 0%
Anomalous 100% 0%
Probe 61% 1%
Dos 76% 0.09%
U2R 3% 0.07%
R2L 83% 0.01%
SSH 100% 2%
It should be clarified that DoS and probe attacks are well classified by most of
machine learning algorithms. However, U2r attack categories presents poor detection
rates as this type of attacks is embedded in its data packets itself. Consequently, their
detection become a difficult duty. In addition, Table 3 presents the performance of our
model in false positive rates.
It is true that the analyzed dataset in Essid and Jemili [6] are combined with
MapReduce, which can be set within a distributed architecture. However, the analysis
itself is done with Weka, on a local system, which is impeded with slow execution
times as we show in the next paragraph (Table 4).
Table 4. Weka and Spark

Weka (sec) Our system (sec)
KDD 165 9.36
DARPA’99 26 5
MAWILab – 4.54
Total 191 18.9
To be most effective while using Weka, we still have to check an intrusion on every
dataset available, a single execution at a time. So, we will solve this issue by fusing
these datasets as it is possible for Spark to handle them this way with small execution
times as well (Table 5).
Table 5. Runtime per operation

Operations Spark in cloud (sec)
Eliminate redundancy of 3 datasets 1.21
Join datasets 1.35
Train model 2.25
Inference 2.80
Total 7.61
7 Conclusion
We achieved in this paper a successful combination of Cloud computing, Spark and

intrusion detection datasets to build distributed IDS, reaching several benefits. After
merging NSL-KDD, MAWILab and DARPA’99 datasets, we implemented Naïve
Bayes algorithm to train our model. The main achievements of our work on intrusion
detection are the storage of datasets in the Cloud, which allows us to have a distributed
system, and the use of Spark power to join and analyze large and heterogeneous
structures of intrusion datasets. Naïve Bayes classifier shows good performance
especially while dealing with intrusions carrying high records number. However, there
is another problem hanging in our approach, which is the rapidity of data analysis with
Spark.
The proposed IDS architecture uses only one cluster, in future work, we will
perform our dataset analysis with several clusters to achieve faster results. In addition,
we will develop our approach with other classifiers to get better results.
References
1. Keegan, N., Ji, S.-Y., Chaudhary, A., Concolato, C., Yu, B., Jeong, D.H.: A survey of cloud-
based network intrusion detection analysis. Hum.-Centric Comput. Inf. Sci. 6(1), 19 (2016)
2. Zuech, R., Khoshgoftaar, T.M., Wald, R.: Intrusion detection and big heterogeneous data: a
survey. J. Big Data 2(1), 3 (2015)
3. Frank, J.: Artificial intelligence and intrusion detection: current and future directions. In:
Proceedings of the 17th National Computer Security Conference, vol. 10, pp. 1–12 (1994)
4. Akbar, S., Srinivasa Rao, T., Ali Hussain, M.: A hybrid scheme based on big data analytics
using intrusion detection system. Indian J. Sci. Technol. 9, 33 (2016)
5. Reghunath, K.: Real-time intrusion detection system for big data. Int. J. Peer to Peer Netw.
(IJP2P) 8(1) (2017)
6. Essid, M., Jemili, F.: Combining intrusion detection datasets using MapReduce. In:
Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics
(SMC 2016); 10/2016 - Budapest, Hungary
7. Elayni, M., Jemili, F., Using MongoDB databases for training and combining intrusion
detection datasets. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking
and Parallel/Distributed Computing, pp. 17–29. Springer International Publishing. ISBN:
978-3-319-62048-0. https://doi.org/10.1007/978-3-319-62048-0_2
8. Esteves, R.M., Pais, R., Rong, C.: K-means clustering in the cloud—a mahout test. In:
Proceedings of the 2011 IEEE Workshops of International Conference on Advanced
Information Networking and Applications, WAINA ’11. IEEE Computer Society, pp. 514–
519 (2011)
9. Ghodsi, A.: The databricks unified analytics platform (2017). https://databricks.com/
10. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X, Rosen, J.,
Venkataraman, S., Franklin, M.J., and others. Apache Spark: a unified engine for big data
processing. Commun. ACM 59(11), 56–65 (2016)
11. Brugier, R.: A tour of Databricks Community Edition: a hosted Spark service (2016). https://
web.cs.dal.ca/*riyad/Site/Download.html
12. Stockinger, K.: Brave New World: Hadoop vs. Spark. Datalab Seminar (2015)
13. Pan, S.: The Performance Comparison of Hadoop and Spark. St. Cloud State University,
St. Cloud (2016)
14. Nathon: Apache Spark setup. nathontech (2015). https://nathontech.wordpress.com/2015/11/
16/apache-spark-setup/
15. David Cournapeau: Scikit-learn (2017). http://scikit-learn.org/stable/
16. University of New Brunswick: UNB datasets (2017). http://www.unb.ca/cic/datasets/nsl.
html
17. Lincoln Laboratory: DARPA 99 (2017). https://web.cs.dal.ca/*riyad/Site/Download.html
18. DARPA 99 Homepage. https://web.cs.dal.ca/*riyad/Site/Download.html
19. Fontugne, R., Borgnat, P., Abry, P., Fukuda, K.: MAWILab: combining diverse anomaly
detectors for automated anomaly labeling and performance benchmarking. In: ACM
CoNEXT ’10, p. 12. Philadelphia, PA (2010)
20. Mazel, J., Fontugne, R., Fukuda, K.: A taxonomy of anomalies in backbone network traffic.
In: Proceedings of 5th International Workshop on TRaffic Analysis and Characterization
(TRAC 2014), pp. 30–36. http://www.fukuda-lab

Ben Fekih-2020-Distributed Architecture of An

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ben Fekih-2020-Distributed Architecture of An

Hochgeladen von

Copyright:

Verfügbare Formate

Distributed Architecture of an Intrusion

Detection System Based on Cloud Computing

Rim Ben Fekih1(&) and Farah Jemili2

Keywords: Intrusion detection Spark Cloud Databricks DBFS

Since the appearance of computer networks, hackers’ endeavor is to penetrate networks

© Springer Nature Switzerland AG 2020

researchers as a solution to speed up computation and access to an overwhelming

2 Related Work and Background

The purpose of this section is to present a brief background on IDSs as well as an

3.1 Apache Spark

• Spark acts its own flow scheduler (Due to in-memory computation).

3.2 Distributed Processing with Fast Data Access

Fig. 1. Machine learning pipeline stages

4 Intrusion Detection Datasets

4.2 DARPA Dataset

4.3 MAWILab Dataset

Fig. 2. ML architecture as a distributed system

5.1 Extract and Transform Data

5.2 Features and Data Normalization

5.3 Train and Evaluate Model

6 Experimentation and Results

6.1 The Naïve Bayes Classiﬁer

Table 1. Number of records

Table 2. Tuning parameters

Table 3. Detection rate

Table 4. Weka and Spark

Table 5. Runtime per operation

We achieved in this paper a successful combination of Cloud computing, Spark and

Das könnte Ihnen auch gefallen