Beruflich Dokumente
Kultur Dokumente
Abstract. An efficient intrusion detection system (IDS) requires more than just
a good machine learning (ML) classifier. However, current IDSs offer a limited
perspective in handling alerts’ databases. These databases must be local and
structured in order to be referenced by these IDSs, offering an obsolete approach
to solve advanced attacks and intrusions on distributed systems.
With the emergence of big data, cyber-attacks have become a concerning
issue worldwide. In situations where data security is paramount, swiftness
becomes an obligation in processing and analytic operations. In that aspect,
cloud-computing services can deal efficiently with big data issues. They offer
storage and distributed analysis as the tools to be featured in our paper.
To handle a large scale of alert data, we propose a new distributed IDS model
that solves data storage problems, combines multiple heterogeneous sources of
alert data, and makes data treatment much faster than local IDSs.
For this purpose, this paper presents an approach of IDS using Databricks as a
Cloud environment and Spark as a big data analysis tool.
1 Introduction
They aimed to merge unstructured and different datasets and improve intrusion
detection rate. As we talked about big data challenges and intrusion detection [4–7] are
limited due to their local architecture. Among these limits, we consider, limited storage
capacity, problems of processing speed and access to stored data (e.g. data stored on
local disks). Consequently, Cloud becomes an inevitable alternative of unlocking the
potential of Big Data. Cloud computing influence on data management and processing
[1], not only it relates infrastructure and computation to the network, but it also supply
management software and hardware resources with reduced costs. Furthermore, it
results in a big emergence of programming frameworks such as Hadoop, Spark, and
Hive for complex and large datasets. Using these tools, numerous studies have been
performed in cloud environments. Esteves et al. [8] used Cloud computing for a
distributed K-means clustering.
The authors chose a large dataset to simulate big data challenges. Tests were
executed using Mahout and Hadoop to solve data intensive problems, while running on
Amazon EC2 during computation tasks. In our work, we chose Databricks as a cloud
environment to process and manage data. More details about Databricks are given in
Sect. 3.
3 About Databricks
Databricks [9] was founded by the team that created Apache Spark™, the most active
open source project in today’s large data ecosystem. Databricks is a Cloud based data
platform and designed to ease the creation and deployment of advanced analysis
solutions. Also, it provides the Databricks community edition as a free and open source
version. The most important functionality is that Databricks provides a unified
ecosystem with orchestrated Apache Spark [10] for implementation, development and
scaling. In addition, It provides access to data easily and swiftly with Ingestion of non-
traditional data storage based on cloud computing. Databricks integrates with Amazon
S3 for storage. S3 buckets can be mounted into the Databricks File System (DBFS) and
read the data into a Spark application as if it were on the local disk [11].
3.3 ML Pipeline
Machine learning pipelines are mainly inspired by scikit-learn project [15]. Subse-
quently, the basic concepts of a pipeline are presented:
• DataFrame: DataFrames are used as learning datasets, which contains a heteroge-
nous data types.
• Transformer: is an algorithm that can transform attributes into predictions.
• Estimator: is an algorithm that produces a Transformer from a DataFrame (e.g. a
learning algorithm).
A ML application can have several steps to form a workflow or a Pipeline (e.g. a
word count application; divide the text of each document into words, convert the words
of each document into a numeric vector and make predictions after learning a model
using datasets.). These steps (Fig. 1) are executed in order, to transform an input
Dataframe through each step.
To build an intrusion detection system, we chose NSL-KDD [16], DARPA’99 [17] and
MAWILab [19] datasets.
4.1 NSL-KDD
NSL-KDD contains four attack classes: DOS, Probe, U2R, and R2L:
• Denial of service attack (DOS): the goal of this attack is to render a service
unavailable, and subsequently prevent legitimate users of a service from using it.
These may include flooding of a network to prevent its operation, Disruption of
connections between two machines, preventing access to a particular service.
• Probing attack: This is the kind of attack in which the attacker scans a machine or
network device to determine weaknesses or vulnerabilities that can be exploited
later to compromise the system. This technique is commonly used in data mining.
• User to Root Attack (U2R): Is an exploit class in which the attacker get access to a
system and exploit a certain vulnerability to obtain Root access.
• Remote to Local Attack (R2L): this kind of attacks occur when an attacker who has
the ability to send packets to a machine on a network but does not have an account
on that machine so attackers try to exploit a certain vulnerability to obtain local
access as a user of that machine.
5 Proposed Approach
In this section we present our proposed approach (Fig. 2). We use the Pandas and
scikit-learn APIs for these transformations.
Distributed Architecture of an Intrusion Detection System 197
Our first objective is to load intrusion detection datasets NSLKDD, MAWILab and
DARPA’99 in DBFS. Then, our work is divided into three major steps:
• Extract and transform data.
• Normalize features and data.
• Train and evaluate model.
Remove Redundancy
Removing duplicates from proposed alert datasets so the classifiers will not be influ-
enced towards more frequent records and the detection rate performance will increase.
This step will be realized with Dataframe.drop_duplicates method, which returns as an
output a DataFrame without redundant lines.
Join Datasets
Spark provides four methods for joining datasets, left, right, outer, or inner join. By
default, it takes the inner value. We will use full outer join to completely merge
datasets. DataFrame.merge (right, left, how = ‘outer’) method which merge alert data
bases, right and left are two Dataframes to merge, and it specify how to merge datasets,
in our case it will be a full outer join. This function merges two tables while making the
join according to one or more columns in common.
values, in the correct order based on the feature meanings. Then, we split our data into
two dataframes, the first contains class or attack types, and the second dataframe
contains other features. Finally, we normalize each feature to have unit variance.
In our experimental result and as we said above we will use Databricks as a cloud
environment to upload and analyze the dataset with Naïve Bayes algorithm.
The experimentation environment is set within Databricks community edition,
which gives access to Amazon EC2 with Spark nodes already configured and provides
only 6 GB of storage, limiting the size of the cluster provided to achieve wide pro-
spects of the experiment.
6.2 Evaluation
After merging our intrusion detection datasets, we get a voluminous database. Table 1
contains number of records in both training and testing dataset.
Before getting intrusion detection rate, we tune our parameters to increase detection
accuracy. This operation consist of tuning parameters of a learning or prediction system
Distributed Architecture of an Intrusion Detection System 199
in order to improve the results. It is commonly done by training multiple models using
different parameters on one set of data and then testing those models on another held-
out set of data. Table 2 shows results before and after tuning.
Table 3 shows the detection rate of our approach. The low performance of our
intrusion detection system may be explained by the low proportions of attacks.
It should be clarified that DoS and probe attacks are well classified by most of
machine learning algorithms. However, U2r attack categories presents poor detection
rates as this type of attacks is embedded in its data packets itself. Consequently, their
detection become a difficult duty. In addition, Table 3 presents the performance of our
model in false positive rates.
It is true that the analyzed dataset in Essid and Jemili [6] are combined with
MapReduce, which can be set within a distributed architecture. However, the analysis
itself is done with Weka, on a local system, which is impeded with slow execution
times as we show in the next paragraph (Table 4).
To be most effective while using Weka, we still have to check an intrusion on every
dataset available, a single execution at a time. So, we will solve this issue by fusing
these datasets as it is possible for Spark to handle them this way with small execution
times as well (Table 5).
7 Conclusion
References
1. Keegan, N., Ji, S.-Y., Chaudhary, A., Concolato, C., Yu, B., Jeong, D.H.: A survey of cloud-
based network intrusion detection analysis. Hum.-Centric Comput. Inf. Sci. 6(1), 19 (2016)
2. Zuech, R., Khoshgoftaar, T.M., Wald, R.: Intrusion detection and big heterogeneous data: a
survey. J. Big Data 2(1), 3 (2015)
3. Frank, J.: Artificial intelligence and intrusion detection: current and future directions. In:
Proceedings of the 17th National Computer Security Conference, vol. 10, pp. 1–12 (1994)
4. Akbar, S., Srinivasa Rao, T., Ali Hussain, M.: A hybrid scheme based on big data analytics
using intrusion detection system. Indian J. Sci. Technol. 9, 33 (2016)
5. Reghunath, K.: Real-time intrusion detection system for big data. Int. J. Peer to Peer Netw.
(IJP2P) 8(1) (2017)
Distributed Architecture of an Intrusion Detection System 201
6. Essid, M., Jemili, F.: Combining intrusion detection datasets using MapReduce. In:
Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics
(SMC 2016); 10/2016 - Budapest, Hungary
7. Elayni, M., Jemili, F., Using MongoDB databases for training and combining intrusion
detection datasets. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking
and Parallel/Distributed Computing, pp. 17–29. Springer International Publishing. ISBN:
978-3-319-62048-0. https://doi.org/10.1007/978-3-319-62048-0_2
8. Esteves, R.M., Pais, R., Rong, C.: K-means clustering in the cloud—a mahout test. In:
Proceedings of the 2011 IEEE Workshops of International Conference on Advanced
Information Networking and Applications, WAINA ’11. IEEE Computer Society, pp. 514–
519 (2011)
9. Ghodsi, A.: The databricks unified analytics platform (2017). https://databricks.com/
10. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X, Rosen, J.,
Venkataraman, S., Franklin, M.J., and others. Apache Spark: a unified engine for big data
processing. Commun. ACM 59(11), 56–65 (2016)
11. Brugier, R.: A tour of Databricks Community Edition: a hosted Spark service (2016). https://
web.cs.dal.ca/*riyad/Site/Download.html
12. Stockinger, K.: Brave New World: Hadoop vs. Spark. Datalab Seminar (2015)
13. Pan, S.: The Performance Comparison of Hadoop and Spark. St. Cloud State University,
St. Cloud (2016)
14. Nathon: Apache Spark setup. nathontech (2015). https://nathontech.wordpress.com/2015/11/
16/apache-spark-setup/
15. David Cournapeau: Scikit-learn (2017). http://scikit-learn.org/stable/
16. University of New Brunswick: UNB datasets (2017). http://www.unb.ca/cic/datasets/nsl.
html
17. Lincoln Laboratory: DARPA 99 (2017). https://web.cs.dal.ca/*riyad/Site/Download.html
18. DARPA 99 Homepage. https://web.cs.dal.ca/*riyad/Site/Download.html
19. Fontugne, R., Borgnat, P., Abry, P., Fukuda, K.: MAWILab: combining diverse anomaly
detectors for automated anomaly labeling and performance benchmarking. In: ACM
CoNEXT ’10, p. 12. Philadelphia, PA (2010)
20. Mazel, J., Fontugne, R., Fukuda, K.: A taxonomy of anomalies in backbone network traffic.
In: Proceedings of 5th International Workshop on TRaffic Analysis and Characterization
(TRAC 2014), pp. 30–36. http://www.fukuda-lab