Sie sind auf Seite 1von 8

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Procediaonline
Available Computer
at Science 00 (2019) 000–000
www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000 www.elsevier.com/locate/procedia

ScienceDirect www.elsevier.com/locate/procedia

Procedia Computer Science 148 (2019) 37–44

Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018)


Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018)
Log files Analysis Using MapReduce to Improve Security
Log files Analysis Using MapReduce to Improve Security
Yassine AZIZI*, Mostafa AZIZI, Mohamed ELBOUKHARI
Yassine AZIZI*, Mostafa AZIZI, Mohamed ELBOUKHARI
Lab. MATSI, ESTO, University Mohammed 1st, Oujda, Morocco
Lab. MATSI, ESTO, University Mohammed 1st, Oujda, Morocco

Abstract
Abstract
Log files are a very useful source of information to diagnose system security and to detect problems that occur in the system, and
are often
Log verya large
files are and can
very useful haveofcomplex
source structure.
information In this system
to diagnose paper, security
we provide
and atomethodology of security
detect problems analysis
that occur in the that aimsand
system, to
apply
are Bigvery
often Datalarge
techniques,
and cansuch
haveascomplex
MapReduce, over several
structure. systemwe
In this paper, logprovide
files in aorder to locate and
methodology extract data
of security probably
analysis related
that aims to
to attacks
apply Big made by malicioussuch
Data techniques, users
as whose intendsover
MapReduce, to compromise
several systema system. These
log files data will
in order lead,and
to locate through a process
extract of learning,
data probably to
related
identify,
to attackspredict
made by attacks or detect
malicious usersintrusions. We have
whose intends clarified this
to compromise approach
a system. with data
These a concrete casethrough
will lead, study on exploiting
a process access log
of learning, to
files of web
identify, apache
predict servers
attacks to predict
or detect and detect
intrusions. We haveSQLI and DDOS
clarified attacks. The
this approach with obtained results
a concrete are promising,
case study we are
on exploiting ablelog
access to
extract
files of malicious
web apache indicators
servers toand eventsand
predict that characterize
detect SQLI and theDDOS
intrusions, which
attacks. The help us toresults
obtained make are
an promising,
accurate diagnosis of the
we are able to
security and supervise
extract malicious state ofand
indicators the events
system,that
and characterize
subsequentlythe in the learning which
intrusions, process. help us to make an accurate diagnosis of the
security and supervise state of the system, and subsequently in the learning process.
© 2019 The Authors. Published by Elsevier B.V.
© 2019
This The
is an Authors.
open accessPublished by Elsevier
article under B.V.
the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
© 2019 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
This is an under
open responsibility
access article of
under the
the scientific
CC BY-NC-NDcommittee of the
license Second International Conference on Intelligent Computing in
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the Second International Conference on Intelligent Computing in
Data Sciences
Peer-review (ICDS
under
Data Sciences (ICDS 2018). 2018).
responsibility of the scientific committee of the Second International Conference on Intelligent Computing in
Data Sciences (ICDS 2018).
Keywords: Big Data; Security; Attacks; Log files; MapReduce; SQL Injection; DDOS.
Keywords: Big Data; Security; Attacks; Log files; MapReduce; SQL Injection; DDOS.

1. Introduction
1. Introduction
The world has experienced a data revolution in all digital domains due to the exponential use of connected tools
andThe worldAccording
objects. has experienced a data
to statistics revolution
developed byinIBM
all digital
[1], wedomains
generatedue
2.5 to the exponential
trillion useeach
bytes of data of connected
day, thesetools
data
and objects.
come According
from different to statistics
sources, namelydeveloped by IBMclimate
social networks, [1], weinformation,
generate 2.5GPS
trillion bytessensors
signals, of data and
eachlog
day, these data
files.
come
Thefrom
log different
files are asources, namely social
very important sourcenetworks, climatethey
of information; information,
retrace allGPS
the signals, sensors
events that occurand log files.
during the activity
Thesystem.
of the log filesThese
are a are
veryoften
important source
of great of information;
volume and come from theyeverywhere,
retrace all the events that
operating occurapplication
systems, during the servers,
activity
of theservers
data system.
… These are often of great volume and come from everywhere, operating systems, application servers,
data servers …

* Corresponding author. Tel.:+212618728634; fax: +21236500223.


E-mail address:author.
* Corresponding azizi.yass@gmail.com
Tel.:+212618728634; fax: +21236500223.
E-mail address: azizi.yass@gmail.com
1877-0509 © 2019 The Authors. Published by Elsevier B.V.
This is an open
1877-0509 access
© 2019 Thearticle under
Authors. the CC BY-NC-ND
Published license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
by Elsevier B.V.
Peer-review underaccess
This is an open responsibility of the scientific
article under committee oflicense
the CC BY-NC-ND the Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018).
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018).

1877-0509 © 2019 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Second International Conference on Intelligent Computing in
Data Sciences (ICDS 2018).
10.1016/j.procs.2019.01.006
38 Yassine AZIZI et al. / Procedia Computer Science 148 (2019) 37–44
Yassine AZIZI et al./ Procedia Computer Science 00 (2019) 000–000

The larger the network perimeter is, the more we will need an effective solution to analyze and exploit these
thousands of generated events. That is by using devices on heterogeneous networks in order to facilitate the
detection of anomalies at the application, system and network levels.
This paper is organized as follows: Section II presents main related works that really have used log files for
extracting useful information. Then, Section III illustrates our methodology that we use to deal with log files for
extracting data on eventual attacks. Before concluding, in the last section, we show a case study on Apache web
servers.

2. Related Work

A log file takes the form of a classical text file, chronologically representing all the events that have affected a
system. The log files are generated in many different formats according to the diversity of devices and software.
This difference appears in the number of parameters and fields recorded in each log file and in the format of theses
parameters.
In the literature, several research studies consider log files as a very useful data source in several areas. Authors
in [5, 6] exploit log files in the field of e-commerce to predict the behavior of their customers and improve the
income of their business. In [7], the work was devoted to in-depth analysis, log file data from NASA's website to
identify very important information about a web server, the behaviors of users, the main mistakes, potential visitors
of the site, all this in order to help the system administrator and web designer to improve the system by questioning.
In [8] they used the log files the routers for error diagnosis and troubleshooting in home networks because the
information contained in the log file helps to clarify the causes of network problems, such as misconfigurations or
hardware failures.
The log files have a vital interest in computer security because they present an overview of all what has happened
on of the whole system in order, for example, to explain an error, to understand how a system detects and attacks
anomalies. In [9], the researchers propose a diagnostic approach in a cloud computing architecture; this approach is
based on exploiting log files of different systems of that architecture for finding the wrong uses and detecting
anomalies which will improve system security.
In [10], authors designed a new algorithm that detects on-line port scan attacks. The proposed method is mainly
based on the sliding HyperLogLog algorithm, they used sliding HyperLogLog to analyze traffic and perform an on-
line counting that they completed with a decisional mechanism that identifies port scan attacks.

3. Proposed methodology

We are interested in the exploitation of techniques of Big Data in the security analysis of systems and networks.
In this sense, we have proposed a methodology that consists of four stages:

 Data collection: This phase refers to the acquisition process data; for our research case the data come from
log files of the web serves apache connected to a network, we recovers these files which retrace all the
events executed by the system. Using a java program, we read log files line by line, and we collect only
useful information
 Data processing: The log files are generated in many different formats depending on the diversity of used
devices and software. To facilitate and consolidate the exploitation and analysis of collected data, we
proposed a pretreatment that transforms and segments this data in a standard XML structure composed of
common key elements. For this reason, we used an open source ETL (Pentaho data integration) that allowed
us to design and execute the operations of manipulation and transformation of necessary data.
 Data storage: In this phase, we provide real-time backup of log messages in a remote platform to build a
point of contact that centralizes all system events. The centralization of the log files in a remote machine
makes it possible to guarantee the survival of the log files to a deletion in the local machine and to have an
overview of crucial elements to the good exploitation of the data collected.
 Data analysis: After the collection, the storage and the formatting of the data, comes the stage of analysis,
and that introduces the primordial phase in the proposed process. It focuses on the analysis and processing of
Yassine AZIZI et al. / Procedia Computer Science 148 (2019) 37–44 39
Yassine AZIZI et al./ Procedia Computer Science 00 (2019) 000–000 3

large volumes of collected data in the hope of generating useful primitives in improving intrusion detection
and consolidation of security rules. We propose the use of the MapReduce model of Hadoop [11], in which
we perform parallel and distributed treatments.

Fig. 1. Proposed architecture

4. Case study

Nowadays, there are over than 3.81 billion users connected to the Internet and more than a billion websites; 60%
of these websites are hosted on the Apache web servers. The Web server provides different mechanisms for logging
anything that may occur in the server, from the initial request to the URL mapping process to the connection,
including any errors that may happen during processing. Most web servers provide the ability to store log files in the
common log format (NSCA Common Format) [12], among the main types of Web server log files:
 Error log : register errors encountered when processing requests,
 Access log : the server-access log records all requests processed by the server.

Each line of these files provides a recordset of the following information:

 The domain name or Internet Protocol (IP) address of the calling machine,
 The name and the HTTP login of the user (in case of access by password),
 The date and time of the request,
 The method used in the request (GET, POST ...) and the name of the requested web resource (the URL of the
requested page),
 The status of the request i.e. the result of the request (success, failure, error, etc.),
 The size of the requested page in bytes,
 The browser and operating system used by the client.
40 Yassine AZIZI et al. / Procedia Computer Science 148 (2019) 37–44
Yassine AZIZI et al./ Procedia Computer Science 00 (2019) 000–000

Fig. 2. Access log format

In our case study, we are working on access log files from three web servers apache, to apply the proposed
methodology, we started by defining and determining the usable data in the "access log" file of the web server.
Through a java program, we retrieve the indicators of each event and we save them in a database, then we use an
ETL "Pentaho Data integration" to transform the collected data to a standard XML format.

Fig. 3. Standard XML format generated

These preprocessing and data formatting steps have ensured the transition from a state of unstructured data to
well-structured consolidated data, which facilitates subsequent analysis and exploitation.
We performed two main steps, data collection and preprocessing which aim to recover and save all actions
performed by a web server, then a transformation step into a standard XML structure.
Here, we analyze log files of Web servers looking forwards to trace some attacks like SQL injection (SQLi) and
distributed denial of service (DDOS). The approach targets to analyze correlate several events recorded in Access
Log files over time and to release useful security information.
We store all generated log files in a common platform to make the analysis of these files more efficient. Then we
use MapReduce to perform parallel and distributed processing.
Yassine AZIZI et al. / Procedia Computer Science 148 (2019) 37–44 41
Yassine AZIZI et al./ Procedia Computer Science 00 (2019) 000–000 5

Fig. 4. MapReduce processing

4.1. SQL Injection

SQL injection is an attack that exploits a security vulnerability of an application interacting with a database, this
happens when inserting an SQL query not planned by the system [13], it consists of injecting SQL code that will be
interpreted by the base engine of data. This attack involves entering specific characters in a variable that will be used
in SQL query. These characters will cause the original query to be deviated from its purpose in order to open roads
to malicious users [14]. They could, for example, authenticate themselves without knowing the password, create a
new administrator user whose password they will know, destroy a table, screw up the data, and so on.
Three injection mechanisms can execute malicious SQL code on the databases of a Web application: injection
into user inputs, injection into the cookies, and injection into server variables, which consists of injecting values into
the http header. The mechanism of this attack is to inject special characters, which will make the original request,
deviated from its purpose.

Table 1. Some indicators of SQL Injection.


Indicators Signification
(\')|(\%27) The single quote and its URL encoded
version
(\-¬\-¬)|(%20--%20) The double dash, comment on a line
(;)|(%20;%20) Semicolon, request delimiter
(%20UNION%20),(%20SELECT%20%),(%20OR%20),(20%INSERT%20) Structured Query Language keywords

Here, for detecting the SQL injection attack at log files, we parse access log file line by line and we look for SQL
keywords or specious characters in order to identify the deviations in the behavior of the monitored events and to
clear the IP addresses that make the SQL injection attempts. It is impossible to carry out an attack without injecting
dangerous characters into the input parameters since this is the only way to be able to change the structure or the
syntax tree an SQL query at run time. We obtain as a result the IP addresses that launched the malicious requests,
the number of attempt and the index of the attack.
After running our MapReduce program that contains the sets of SQL injection tracking and detection
instructions, we get the result of this analysis in a file named part-r-00000, and we can clearly deduce the malicious
users who attempts to attack the system in question with the number of attempts and the detection indicator in order
to take the necessary countermeasures.
42 YassineYassine
AZIZI et al./ Procedia
AZIZI Computer
et al. / Procedia Science Science
Computer 00 (2019) 000–000
148 (2019) 37–44

Fig. 5. The result of the SQLI attack detection approach

Fig. 6. The SQL injection attempts

4.2. DDOS Attack

Distributed Denial of Service (DDoS) is a malicious attempt to disrupt the normal traffic of a targeted server,
service or network by saturating the target or its surrounding infrastructure with a flood of Internet traffic [15].
DDOS attacks owe their effectiveness by using multiple compromised computer systems as sources of attack traffic,
specifically, it is for hackers to send a large number of requests on a device (host, server, web application, etc ...) in
order to saturate and cause a total interruption of service.
There are three strategies for implementing a DDOS attack:

 Bandwidth: Attack category of saturating the network capacity of the server, thus rendering it unreachable.
 Resources: an attack category consisting of exhausting the machine's system resources, thus preventing it
from responding to legitimate queries.
 Exploitation of software flaw: category of attack targeting a particular software flaw in order to make your
machine unavailable.
In this work, we are interested in the DDOS attack detection that aims to exhaust the processing capabilities of a
target. For example, an attacker can try to reach the limit of the number of concurrent connections that a web server
can process. In this case, the attacker constantly sends a large number of HTTP GET or POST requests to the
targeted server. A single HTTP request is not expensive to execute on the client side, but can be expensive for the
target server to respond to it, it must often load multiple files and execute database queries to create a web page.
Our approach is to scan the access log file to detect users or machines that attempt to send massive queries in a
very short time interval for particular resources in the hope of making the service unavailable. For this, we have
developed a MapReduce program that releases the number of requests sent by users in a time interval of one second,
and we set a threshold that determines the maximum number of requests generated by a particular IP address in a
time interval to distinguish legitimate traffic from attack traffic, The threshold value (Tmax) is set to 20 [16], only
20 HTTP requests are allowed to a particular IP in one second. If the IP frequency value for an IP address exceeds
Tmax, and will be considered this IP address as a source of attack and the countermeasures are decelerated to block
Yassine AZIZI et al. / Procedia Computer Science 148 (2019) 37–44 43
Yassine AZIZI et al./ Procedia Computer Science 00 (2019) 000–000 7

the progression of this attack and preserve the functioning of the system.

Fig. 7. The result of the DDOS attack detection approach

Fig. 8. The DDOS attempts

The analysis of the log files allowed us to obtain significant results and to extract some indicators which
characterize the attacks like SQLI and DDOS in order to anticipate this threat, and to take a certain number of
technical and organizational measures to protect system security. Our proposed methodology is based on the
analysis of events recorded in the access log files and offers an anomaly detection mechanism and traces the data
related to attacks. Regarding the SQL injection attack, our methodology allows us to scan all attempts to access the
database and to free users who try to inject dangerous SQL characters through HTTP requests to provide an index of
attack, this helps us to clearly identify the IP addresses malicious. For the detection of the DDOS attack, our
proposal allows to calculate the number of requests of each IP in one second and to identify the IP addresses which
have exceeded the threshold of the end to put them in the black list and to block the sources of the attack
These results also represent some limits, on the one hand the difficulty to confirm if it is a potential attack or not
which can generate false alarms. On the other hand the difficulty to know in advances all the characters and the
behaviours dangerous which evolve rapidly.

5. Conclusion

In this work, we presented a methodology that aims to exploit the log files in the domain of computer security,
which try to improve anomaly detection and increase the level of security. This methodology is made through four
stages: Data collection, Data processing, Data storage, and Data analysis. On the implementation side, we have
collected and saved the events of a web server and we have done a structuring of the data in a common structure. In
what Concerns the part of data analysis and extraction of knowledge we proposed an approach of detection of SQL
Injection and DDOS attacks based on a parallel and distributed processing using MapReduce. Our future work will
focus on the machine learning part in order to apply appropriate learning techniques to the machine logs.
Yassine AZIZI et al./ Procedia Computer Science 00 (2019) 000–000

44 Yassine AZIZI et al. / Procedia Computer Science 148 (2019) 37–44

References

[1] M.Miranda, S. Big Brother au Big Data. Conférence de Big Data, Université Sophia Antipolis, 2015
[2] M.CHEN, Min, MAO, Shiwen, et LIU, Yunhao. Big data: A survey. Mobile Networks and Applications, 2014, Numéro: 171–204.
[3] M.Gandomi, Amir (ITM) Haider, Murtaza (REM. Beyond the Hype : Big Data Concepts, Methods, and Analytics. International Journal of
Information Management, 2015, Numéro : 137–144.
[4] M.Patgiri, R., Ahmed, A. . Big Data: The V’s of the Game Changer Paradigm. ; IEEE 2nd International Conference on Data Science and
Systems, 2016, Numéro: 17–24.
[5] M.Savitha K, Vijaya MS. Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies. ,IJACSA, 2014.
[6] Mme.Salama, S. E., I. Marie, M., El-Fangary, L. M., K. Helmy, Y. Web Server Logs Preprocessing for Web Intrusion Detection. . Computer
and Information Science, 2011, Numéro : 123–133.
[7] S. S and B. Uma Maheswari . Analyzing Large Web Log Files in a Hadoop Distributed Cluster Environment. International Journal of
Computer Technology and Applications (IJCTA), 2014 Numéro : vol. 5, no.5.
[8] Müller, A., Miinz, G., & Carle, G. (2011, October). Collecting router information for error diagnosis and troubleshooting in home networks.
In Local Computer Networks (LCN), 2011 IEEE 36th Conference on (pp. 764-769). IEEE.
[9] M.Amar M, Lemoudden M, El Ouahidi B. Log file’s centralization to improve cloud security. . 2016 International Conference on Cloud
Computing Technologies and Applications, CloudTech 2016, 2016, Numéro: 178-183.
[10]Chabchoub, Y., Chiky, R., & Dogan, B. (2014). How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic?. EURASIP
Journal on Information Security, 2014(1), 5.
[11] M.Matsuzaki, K. Functional Models of Hadoop MapReduce with Application to Scan. International Journal of Parallel Programming, 2017,
Numéro : 362-381.
[12] M.JU Santhi, S Bellamkonda. web server log files using Hadoop MapReduce to preprocess the log files and to explore the session
identification and network anomalie. NG Rao , 2016.
[13] Halfond, W. G., Viegas, J., & Orso, A. (2006, March). A classification of SQL-injection attacks and countermeasures. In Proceedings of the
IEEE International Symposium on Secure Software Engineering (Vol. 1, pp. 13-15). IEEE.
[14] Alwan, Z. S., & Younis, M. F. (2017). Detection and Prevention of SQL Injection Attack: A Survey.
[15] Balakrishnan, H. P., & Moses, J. C. (2014). A Survey on Defense Mechanism against DDOS Attacks. International Journal, 4(3).
[16] AK, M. I., George, L., Govind, K., & Selvakumar, S. (2012). Threshold based kernel level http filter (tbhf) for ddos mitigation. International
Journal of Computer Network and Information Security, 4(12), 31.

Das könnte Ihnen auch gefallen