Sie sind auf Seite 1von 84

A Project report on

TREND ANALYSIS BASED ON ACCESS PATTERN OVER


WEB LOGS USING HADOOP

In partial fulfillment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE & ENGINEERING

Submitted By

I.HARIPRIYA (12JD1A0518)
P. NAGA SAI APARNA (13JD5A0506)
N. PRAJNA (12JD1A0546)
S. SOUMIKA (12JD1A0558)
Under the Guidance of
V.RAJESH BABU
M.TECH
Assistant Professor

Department of Computer Science & Engineering


ELURU COLLEGE OF ENGINEERING AND TECHNOLOGY
DUGGIRALA(V), PEDAVEGI(M), ELURU-534004

(Affiliated to Jawaharlal Nehru Technological University)


2012-2016

i
ELURU COLLEGE OF ENGINEERING AND TECHNOLOGY

(Affiliated to JNTUK, Approved by AICTE-NEW DELHI)

Department of Computer Science and Engineering

This is to certify that the project report titled “TREND ANALYSIS BASED ON
ACCESS PATTERN OVER WEB LOGS USING HADOOP”is being submitted by
I.HARIPRIYA (12JD1A0518),P.NAGA SAI APARNA (13JD5A0506), N.PRAJNA
(12JD1A0546) and S.SOWMIKA (12JD1A0558) in B.Tech IV II semester
ComputerScience & Engineering is a record bonafide work carried out by them. The
results embodied in this report have not been submitted to any other University for the
award of any degree.

PROJECT GUIDE HEAD OF THE DEPARTMENT


Mr.V.RAJESH BABU
M.Tech(CSE)
Assistant Professor

EXTERNAL EXAMINER

ii
ACKNOWLEDGEMENT

The Present Project work is the several days study of various aspects of the
project development. During this the effort is the present study, we have received a great
amount of help from our Chairman Sri V.RAGHAVENDRARAO and Secretary
V.RAMA KRISHNARAO of Padmavathi Group of Institutions, which we wish to
acknowledge and thank from depth of our hearts.

We are thankful to our Principal Dr. P. BALAKRISHNA PRASAD for


permitting and encouraging us doing this project.

We are deeply intended to Dr. G. GURUKESAVADAS, Professor& Head of


the Department, Whose motivation and constant encouragement has led to pursue a
project in field of Software development.

We are very much obliged and thankful to our project guide Mr.V.RAJESH
BABU, Assistant Professor, forprovidingthisopportunity and constant encouragement
given by him during the course. We are grateful to his valuable guidance and suggestions
during our project work.
Our parents have put our self ahead of themselves. Because of their hard work
and dedication, we had opportunity beyond our wildest dreams. Our heartfelt thanks to
them for giving us all we ever needed to be successful student and individual.
Finally we express our thanks to all our faculty members, classmates, friends and
neighbors who helped us for the completion of our project and without infinite love and
patience this would never have been possible.
I. HARIPRIYA (12JD1A0518)
P. NAGA SAI APARNA (13JD5A0506)
N. PRAJNA (12JD1A0546)
S. SOUMIKA (12JD1A0558)

iii
ABSTRACT

The first part covers some fundamental theory and summarizes basic goals andtechniques
of log file analysis. It reveals that log file analysis is an omitted fieldof computer science.
Available papers describe moreover specific log analyzersand only few contain some
general methodology.Second part contains three case studies to illustrate different
application of log file analysis. The examples were selected to show quite different
approach and goals of analysis and thus they set up different requirements.The analysis of
requirements then follows in the next part which discussesvarious criteria put on a
general analysis tool and also proposes some design suggestions. Finally, in the last part
there is an outline of the design and implementation ofan universal analyzer. Some
features are presented in more detail while others are just intentions or suggestions.

iv
TABLE OF CONTENTS
CHAPTER NO NAME PAGE NO
ACKNOWLEDGEMENT iii
ABSTRACT iv
TABLE OF CONTENTS v
LIST OF FIGURES vii
1 INTRODUCTION 1
1.1 About Log Files 1
1.2 Motivation 6
1.3 Problem Definition 6
1.4 Current State of Technology 7
1.5 Current Practice 8
2 HISTORY OF HADOOP 10
3 LITERATURE SURVEY 22
4 SYSTEM ANALYSIS 27
4.1 Existing System 27
4.2 Proposed System 27
4.3 System Specification 29
4.3.1 Software Requirements 29
4.3.2 Hardware Requirements 29
5 SYSTEM DESIGN 30
5.1 System Architecture 30
5.2 Use Case Diagram 31
5.3 Behavioral Model 34
5.3.1 Data flow Diagram 34
5.3.2 Sequence Diagram 35
6 SYSTEM IMPLEMENTATION 36
6.1 Language of Metadata 36
6.1.1 Log Description 36
6.1.2 Examples 37
6.2 Language of Analysis 38

v
6.2.1 Principle of operation 38
6.2.2 Data Filtering and Cleaning 39
6.2.3 Analytical Tasks 41
6.3 Analyzer Layout 44
6.3.1 Operation 44
6.3.2 File types 45
6.3.3 Modules 46
7 SOFTWARE TESTING 61
7.1 System Testing 61
7.2 Architecture Testing 64
7.3 Performance Testing 65
8 RESULTS 70
9 CONCLUSIONS AND FUTURE WORK 74
9.1 Conclusion 74
9.2 Future Enhancements 74
10 REFERENCES 75

vi
LIST OF FIGURES
CHAPTER NO FIGURE NAME PAGE NO
1 Big data Applications 3

3 System description 16

3 Pig processing steps 12

4 Hadoop frame work 18

4 The log analysis (use case diagram) 15

4 Searching the web 15

4 Flow of data 16

4 The log analysis (Sequence diagram) 17

5 Analyzer layout 26

5 Components of Hadoop Frame Work 30

5 Flow of Data in Map Reduce Frame Work 32

5 Format of Execution 33

5 Log Analysis Solution Architecture 37

vii
viii
CHAPTER 1

INTRODUCTION

1.1 About Log Files


Current software application often produce (or can be configured to produce) some
auxiliary text files known as log files. Such files are used during various stages of software
development, mainly for debugging and profiling purposes. Use of log files helps testing
by making debugging easier. It allows the following logic of the program, at high level,
without having to run it in debug mode.
Nowadays, log files are commonly used also at customer’s installations for the purpose of
permanent software monitoring and/or fine-tuning. Log files became a standard part of
large application and are essential in operating systems, computer networks and distributed
systems.
Log files are often the only way how to identify and locate an error in software, because
log file analysis is not affected by any time-based issues known as probe effect. This is an
opposite to an analysis of a running program, when the analytical process can interfere with
time {critical or resource {critical conditions within the analyzed program.
Log files are often very large and can have complex structure. Although the process of
generating log files is quite simple and straightforward, log file analysis could be a
tremendous task that requires enormous computational resources, long time and
sophisticated procedures. This often leads to a common situation, when log files are
continuously generated and occupy valuable space on storage devices, but nobody uses
them and utilizes enclosed information.

Big data means really a big data; it is a collection of large datasets that cannot be processed
using traditional computing techniques. Big data is not merely a data, rather it has become
a complete subject, which involves various tools, technqiues and frameworks.

What Comes Under Big Data?


1
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.

 Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
 Social Media Data: Social media such as Facebook and Twitter hold information
and the views posted by millions of people across the globe.
 Stock Exchange Data: The stock exchange data holds information about the ‘buy’
and ‘sell’ decisions made on a share of different companies made by the customers.
 Power Grid Data: The power grid data holds information consumed by a particular
node with respect to a base station.
 Transport Data: Transport data includes model, capacity, distance and availability
of a vehicle.
 Search Engine Data: Search engines retrieve lots of data from different databases.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The
data in it will be of three types.

2
 Structured data: Relational data.
 Semi Structured data: XML data.
 Unstructured data: Word, PDF, Text, Media Logs.

Benefits of Big Data

Big data is really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much known to
all of us:

 Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and other
advertising mediums.
 Using the information in the social media like preferences and product perception
of their consumers, product companies and retail organizations are planning their
production.
 Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead
to more concrete decision-making resulting in greater operational efficiencies, cost
reductions, and reduced risks for the business.

To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.

There are various technologies in the market from different vendors including Amazon,
IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle
big data, we examine the following two classes of technology:

3
Operational Big Data

This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.

NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be
run inexpensively and efficiently. This makes operational big data workloads much easier
to manage, cheaper, and faster to implement.

Some NoSQL systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data

This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that
may touch most or all of the data.

MapReduce provides a new method of analyzing data that is complementary to the


capabilities provided by SQL, and a system based on MapReduce that can be scaled up
from single servers to thousands of high and low end machines.

These two classes of technology are complementary and frequently deployed together.

Operational vs. Analytical Systems

Operational Analytical
Latency 1 ms - 100 ms 1 min - 100 min
Concurrency 1000 - 100,000 1 - 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective

4
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL MapReduce, MPP Database

Fulfilling the Challenges of Relational Databases

The major challenges associated with big data are as follows:

 Capturing data
 Duration
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation

To fulfill the above challenges, organizations normally take the help of enterprise servers.

1.2 Motivation
There are various applications (known as log file analyzers or log files visualization tools)
that can digest a log file of specific vendor or structure and produce easily human readable
summary reports. Such tools are undoubtedly useful, but their usage is limited only to log
files of certain structure. Although such products have configuration options, they can
answer only built-in questions and create built-in reports. The initial motivation of this
work was the lack of Cisco Net Flow analyzer that could be used to monitor and analyze
large computer networks like the metropolitan area network of the University of West
Bohemia (WEBNET) or the country-wide backbone of the Czech Academic Network
(CESNET) using Cisco Net Flow data exports. Because the amount of log data (every
packet is logged!), evolution of the Net Flow log format in time and wide spectrum of

5
monitoring goals/questions, it seems that introduction of an new, systematic, efficient and
open approach to the log analysis is necessary.
There is also a belief that it is useful to research in the field of log files analysis and to
design an open, very flexible modular tool that would be capable to analyze almost any log
file and answer any questions, including very complex ones. Such analyzer should be
programmable, extendable, efficient (because of the volume of log files) and easy to use
for end users. It should not be limited to analyze just log files of specific structure or type
and also the type of question should not be restricted.

1.3 Problem Definition


The overall goal of this research is to invent and design a good model of generic processing
of log files. The problem covers areas of formal languages and grammars, definite state
machines, lexical and syntax analysis, data-driven programming techniques and data
mining/warehousing techniques.
The following list sums up areas involved by this research.
 Formal definition of a log file
 Formal description of the structure and syntax of a log file (metadata)
 Lexical and syntax analysis of log file and metadata information
 Formal specification of a programming language for easy and efficient log analysis
 Design of internal data types and structures (includes RDBMS)
 Design of such programming language
 Design of a basic library/API and functions or operators for easy handling of logs
within the programming language
 Deployment of data mining/warehousing techniques if applicable
 Design of an user interface
The expected results are both theoretical both practical. The main theoretical results are (a)
design of the framework and (b) the formal language for log analysis description. From
practical point of view, the results of the research and design of an analyzer should be
verified and evaluated by a prototype implementation of an experimental Cisco Net Flow
analyzer.

6
1.4 Current State of Technology
In the past decades there was surprisingly low attention paid to problem of getting useful
information from log files. It seems there are two main streams of research. The first one
concentrates on validating program runs by checking conformity of log files to a state
machine. Records in log file are interpreted as transitions of given state machine. If some
illegal transitions occur, then there is certainly a problem, either in the software under test
or in the state machine specification or in the testing software itself.
The second branch of research is represented by articles that just describe various
ways of production statistical output.
The following items summarize current possible usage of log files:
 Generic program debugging and profiling
 Tests whether program conforms to a given state machine
 Various usage statistics, top ten, etc.
 Security monitoring
According to available scientific papers it seems that the most evolving and
developed area of log file analysis is the WWW industry. Log files of HTTP servers are
nowadays used not only for system load statistic but they offer a very valuable and cheap
source of feedback. Providers of web content were the first one who lack more detailed
and sophisticated reports based on server logs. They require detecting behavioral patterns,
paths, trends etc. Simple statistical methods do not satisfy these needs so an advanced
approach must be used. There are over 30 commercially available applications for web log
analysis and many more free available on the Internet. Regardless of their price, they are
disliked by their user and considered too slow, inflexible and difficult to maintain.
Some log files, especially small and simple, can be also analyzed using common
spreadsheet or database programs. In such case, the logs are imported into a worksheet or
database and then analyzed using available functions and tools.
The remaining but probably the most promising way of log file processing
represents data driven tools like AWK. In connection with regular expressions are such
tools very efficient and flexible. On the other hand, they are too low-level, i.e. their
usability is limited to text files, one-way, single-pass operation. For higher-level tasks they
lack mainly advanced data structures.

7
1.5 Current Practice
Prior to a more formal definition, let us simply describe log _les and their usage.
Typically, log files are used by programs in the following way:
 The log file is an auxiliary output file, distinct from other outputs of the program.
Almost all log files are plain text files.
 On startup of the program, the log file is either empty, or contains whatever was
left from previous runs of the program.
 During program operation, lines (or groups of lines) are gradually appended to the
log file, never deleting or changing any previously stored information.
 Each record (i.e. a line or a group of lines) in a log file is caused by a given event
in the program, like user interaction, function call, input or output procedure etc.
 Records in log files are often parameterized, i.e. they show current values of
variables, return values of function calls or any other state information
 The information reported in log files is the information that programmers consider
important or useful for program monitoring and/or locating faults.
Syntax and format of log files can vary a lot. There are very brief log files that
contain just sets of numbers and there are log files containing whole essays. Nevertheless,
an average log file is a compromise of these two approaches; it contains minimum
unnecessary information while it is still easily human-readable. Such files for example
contain variable names variable values, comprehensive delimiters, smart text formatting,
hint keywords, comments etc.

8
CHAPTER 2

HISTORY OF HADOOP

2.1 About Apache Hadoop

Apache Hadoop is an open-source software framework written in Java for distributed


storage and distributed processing of very large data sets on computer clusters built from
commodity hardware. All the modules in Hadoop are designed with a fundamental
assumption that hardware failures are common and should be automatically handled by
the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File
System (HDFS), and a processing part called MapReduce. Hadoop splits files into large
blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers
packaged code for nodes to process in parallel based on the data that needs to be
processed. This approach takes advantage of data locality nodes manipulating the data
they have access to— to allow the dataset to be processed faster and more efficiently than
it would be in a more conventional supercomputer architecture that relies on a parallel
file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

9
 Hadoop Common – contains libraries and utilities needed by other Hadoop
modules;
 Hadoop Distributed File System (HDFS) – a distributed file-system that stores
data on commodity machines, providing very high aggregate bandwidth across
the cluster;
 Hadoop YARN – a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of users'
applications;[6][7] and
 Hadoop MapReduce – an implementation of the MapReduce programming model
for large scale data processing.

The term Hadoop has come to refer not just to the base modules above, but also to the
ecosystem,[8] or collection of additional software packages that can be installed on top of
or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache
Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache
Sqoop, Apache Oozie, Apache Storm.[9]

Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System.[10]

The Hadoop framework itself is mostly written in the Java programming language, with
some native code in C and command line utilities written as shell scripts. Though
MapReduce Java code is common, any programming language can be used with "Hadoop
Streaming" to implement the "map" and "reduce" parts of the user's program. Other
projects in the Hadoop ecosystem expose richer user interfaces.

2.2 History

The genesis of Hadoop came from the Google File System paper that was published in
October 2003. This paper spawned another research paper from Google - MapReduce:
Simplified Data Processing on Large Clusters.[13] Development started in the Apache
Nutch project, but was moved to the new Hadoop subproject in January 2006. Doug
Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.

10
The initial code that was factored out of Nutch consisted of 5k lines of code for NDFS
and 6k lines of code for MapReduce.

The first committer added to the Hadoop project was Owen O’Malley in March 2006.
Hadoop 0.1.0 was released in April 2006 and continues to evolve by the many
contributors to the Apache Hadoop project.

11
2.3 Architecture

See also: Hadoop Distributed File System, Apache HBase and MapReduce

Hadoop consists of the Hadoop Common package, which provides file system and OS
level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and
the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the
necessary Java Archive (JAR) files and scripts needed to start Hadoop.

For effective scheduling of work, every Hadoop-compatible file system should provide
location awareness: the name of the rack (more precisely, of the network switch) where a
worker node is. Hadoop applications can use this information to execute code on the node
where the data is, and, failing that, on the same rack/switch to reduce backbone traffic.
HDFS uses this method when replicating data for data redundancy across multiple racks.
This approach reduces the impact of a rack power outage or switch failure; if one of these
hardware failures occurs, the data will remain available.

A multi-node Hadoop cluster

A small Hadoop cluster includes a single master and multiple worker nodes. The master
node consists of a Job Tracker, Task Tracker, Name Node, and Data Node. A slave or
worker node acts as both a Data Node and Task Tracker, though it is possible to have

12
data-only worker nodes and compute-only worker nodes. These are normally used only in
nonstandard applications.[62]

Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup
and shutdown scripts require that Secure Shell (ssh) be set up between nodes in the
cluster.[63]

In a larger cluster, HDFS nodes are managed through a dedicated Name Node server to
host the file system index, and a secondary Name Node that can generate snapshots of the
name node's memory structures, thereby preventing file-system corruption and loss of
data. Similarly, a standalone Job Tracker server can manage job scheduling across nodes.
When Hadoop MapReduce is used with an alternate file system, the Name Node,
secondary Name Node, and Data Node architecture of HDFS are replaced by the file-
system-specific equivalents.

2.4 File systems

Hadoop distributed file system

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-
system written in Java for the Hadoop framework. A Hadoop cluster has nominally a
single name node plus a cluster of data nodes, although redundancy options are available
for the name node due to its criticality. Each data node serves up blocks of data over the
network using a block protocol specific to HDFS. The file system uses TCP/IP sockets
for communication. Clients use remote procedure call (RPC) to communicate between
each other.

HDFS stores large files (typically in the range of gigabytes to terabytes across multiple
machines. It achieves reliability by replicating the data across multiple hosts, and hence
theoretically does not require RAID storage on hosts (but to increase I/O performance
some RAID configurations are still useful). With the default replication value, 3, data is

13
stored on three nodes: two on the same rack, and one on a different rack. Data nodes can
talk to each other to rebalance data, to move copies around, and to keep the replication of
data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX
file-system differ from the target goals for a Hadoop application. The trade-off of not
having a fully POSIX-compliant file-system is increased performance for data throughput
and support for non-POSIX operations such as Append.[65]

HDFS added the high-availability capabilities, as announced for release 2.0 in May
2012,[66] letting the main metadata server (the Name Node) fail over manually to a
backup. The project has also started developing automatic fail-over.

The HDFS file system includes a so-called secondary name node, a misleading name that
some might incorrectly interpret as a backup name node for when the primary name node
goes offline. In fact, the secondary name node regularly connects with the primary name
node and builds snapshots of the primary name node's directory information, which the
system then saves to local or remote directories. These check pointed images can be used
to restart a failed primary name node without having to replay the entire journal of file-
system actions, then to edit the log to create an up-to-date directory structure. Because the
name node is the single point for storage and management of metadata, it can become a
bottleneck for supporting a huge number of files, especially a large number of small files.
HDFS Federation, a new addition, aims to tackle this problem to a certain extent by
allowing multiple namespaces served by separate name nodes.

An advantage of using HDFS is data awareness between the job tracker and task tracker.
The job tracker schedules map or reduce jobs to task trackers with an awareness of the
data location. For example: if node A contains data (x,y,z) and node B contains data
(a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and
node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the
amount of traffic that goes over the network and prevents unnecessary data transfer.
When Hadoop is used with other file systems, this advantage is not always available. This
can have a significant impact on job-completion times, which has been demonstrated
when running data-intensive jobs.

14
HDFS was designed for mostly immutable files and may not be suitable for systems
requiring concurrent write-operations.

HDFS can be mounted directly with a File system in User space (FUSE) virtual file
system on Linux and some other Unix systems.

File access can be achieved through the native Java application programming interface
(API), the Thrift API to generate a client in the language of the users' choosing (C++,
Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the
command-line interface, browsed through the HDFS-UI Web application (web app) over
HTTP, or via 3rd-party network client libraries.[68]

2.5 Other file systems

Hadoop works directly with any distributed file system that can be mounted by the
underlying operating system simply by using a file:// URL; however, this comes at a
price: the loss of locality. To reduce network traffic, Hadoop needs to know which
servers are closest to the data; this is information that Hadoop-specific file system bridges
can provide.

In May 2011, the list of supported file systems bundled with Apache Hadoop were:

 HDFS: Hadoop's own rack-aware file system.[69] This is designed to scale to tens
of peta bytes of storage and runs on top of the file systems of the underlying
operating systems.
 FTP File system: this stores all its data on remotely accessible FTP servers.
 Amazon S3 (Simple Storage Service) file system. This is targeted at clusters
hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure.
There is no rack-awareness in this file system, as it is all remote.
 Windows Azure Storage Blobs (WASB) file system. WASB, an extension on top
of HDFS, allows distributions of Hadoop to access data in Azure blob stores
without moving the data permanently into the cluster.

15
A number of third-party file system bridges have also been written, none of which are
currently in Hadoop distributions. However, some commercial distributions of Hadoop
ship with an alternative file system as the default—specifically IBM and MapR.

 In 2009, IBM discussed running Hadoop over the IBM General Parallel File
System.[70] The source code was published in October 2009.[71]
 In April 2010, Parascale published the source code to run Hadoop against the
Parascale file system.[72]
 In April 2010, Appistry released a Hadoop file system driver for use with its own
CloudIQ Storage product.[73]
 In June 2010, HP discussed a location-aware IBRIX Fusion file system driver.[74]
 In May 2011, Map Reduce Technologies, Inc. announced the availability of an
alternative file system for Hadoop, which replaced the HDFS file system with a
full random-access read/write file system.

Job Tracker and Task Tracker: the MapReduce engine

Main article: MapReduce

Above the file systems comes the MapReduce Engine, which consists of one Job
Tracker, to which client applications submit MapReduce jobs. The Job Tracker pushes
work out to available Task Tracker nodes in the cluster, striving to keep the work as close
to the data as possible. With a rack-aware file system, the Job Tracker knows which node
contains the data, and which other machines are nearby. If the work cannot be hosted on
the actual node where the data resides, priority is given to nodes in the same rack. This
reduces network traffic on the main backbone network. If a Task Tracker fails or times
out, that part of the job is rescheduled. The Task Tracker on each node spawns a separate
Java Virtual Machine process to prevent the Task Tracker itself from failing if the
running job crashes its JVM. A heartbeat is sent from the Task Tracker to the Job Tracker
every few minutes to check its status. The Job Tracker and Task Tracker status and
information is exposed by Jetty and can be viewed from a web browser.

16
Known limitations of this approach are:

 The allocation of work to Task Trackers is very simple. Every Task Tracker has a
number of available slots (such as "4 slots"). Every active map or reduce task
takes up one slot. The Job Tracker allocates work to the tracker nearest to the data
with an available slot. There is no consideration of the current system load of the
allocated machine, and hence its actual availability.
 If one Task Tracker is very slow, it can delay the entire MapReduce job—
especially towards the end of a job, where everything can end up waiting for the
slowest task. With speculative execution enabled, however, a single task can be
executed on multiple slave nodes.

Scheduling

By default Hadoop uses FIFO scheduling, and optionally 5 scheduling priorities to


schedule jobs from a work queue. In version 0.19 the job scheduler was re factored out of
the Job Tracker, while adding the ability to use an alternate scheduler (such as the Fair
scheduler or the Capacity scheduler, described next)

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to
provide fast response times for small jobs and Quality of Service for production jobs. The
fair scheduler has three basic concepts.

1. Jobs are grouped into pools.


2. Each pool is assigned a guaranteed minimum share.
3. Excess capacity is split between jobs.

By default, jobs that are uncategorized go into a default pool. Pools have to specify the
minimum number of map slots, reduce slots, and a limit on the number of running jobs.

Capacity scheduler

17
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several
features that are similar to the fair scheduler.[79]

 Queues are allocated a fraction of the total resource capacity.


 Free resources are allocated to queues beyond their total capacity.
 Within a queue a job with a high level of priority has access to the queue's
resources.

There is no preemption once a job is running.

2.6 Other applications

The HDFS file system is not restricted to MapReduce jobs. It can be used for other
applications, many of which are under development at Apache. The list includes the
HBase database, the Apache Mahout machine learning system, and the Apache Hive Data
Warehouse system. Hadoop can in theory be used for any sort of work that is batch-
oriented rather than real-time, is very data-intensive, and benefits from parallel
processing of data. It can also be used to complement a real-time system, such as lambda
architecture.

As of October 2009, commercial applications of Hadoop[80] included:

 Log and/or click stream analysis of various kinds


 Marketing analytics
 Machine learning and/or sophisticated data mining
 Image processing
 Processing of XML messages
 Web crawling and/or text processing
 General archiving, including of relational/tabular data, e.g. for compliance

Prominent users

On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest
Hadoop production application. The Yahoo! Search Web map is a Hadoop application
18
that runs on a Linux cluster with more than 10,000 cores and produced data that was used
in every Yahoo! web search query.[81] There are multiple Hadoop clusters at Yahoo! and
no HDFS file systems or MapReduce jobs are split across multiple datacenters. Every
Hadoop cluster node bootstraps the Linux image, including the Hadoop distribution.
Work that the clusters perform is known to include the index calculations for the Yahoo!
search engine. In June 2009, Yahoo! made the source code of the Hadoop version it runs
available to the public via the open-source community.[82]

In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21
PB of storage.[83] In June 2012, they announced the data had grown to 100 PB[84] and later
that year they announced that the data was growing by roughly half a PB per day.[85]

As of 2013, Hadoop adoption had become widespread: more than half of the Fortune 50
used Hadoop.[86]

Hadoop hosting in the Cloud

Hadoop can be deployed in a traditional onsite datacenter as well as in the cloud. The
cloud allows organizations to deploy Hadoop without hardware to acquire or specific
setup expertise. Vendors who currently have an offer for the cloud include Microsoft,
Amazon, IBM and Google.

On Microsoft Azure

Azure HD Insight is a service that deploys Hadoop on Microsoft Azure. HD Insight uses
Horton works HDP and was jointly developed for HDI with Horton works. HDI allows
programming extensions with .NET (in addition to Java). HD Insight also supports
creation of Hadoop clusters using Linux with Ubuntu. By deploying HD Insight in the
cloud, organizations can spin up the number of nodes they want and only get charged for
the compute and storage that is used.[90] Horton works implementations can also move
data from the on-premises datacenter to the cloud for backup, development/test, and
bursting scenarios.[90] It is also possible to run Cloudera or Horton works Hadoop clusters
on Azure Virtual Machines.

19
On Amazon EC2/S3 services

It is possible to run Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon
Simple Storage Service (S3). As an example, The New York Times used 100 Amazon
EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored
in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of
about $240 (not including bandwidth).

There is support for the S3 object store in the Apache Hadoop releases, though this is
below what one expects from a traditional POSIX file system. Specifically, operations
such as rename() and delete() on directories are not atomic, and can take time
proportional to the number of entries and the amount of data in them.

Amazon Elastic MapReduce

Elastic MapReduce (EMR) was introduced by Amazon.com in April 2009. Provisioning


of the Hadoop cluster, running and terminating jobs, and handling data transfer between
EC2 (VM) and S3(Object Storage) are automated by Elastic MapReduce. Apache Hive,
which is built on top of Hadoop for providing data warehouse services, is also offered in
Elastic MapReduce.

Support for using Spot Instances was later added in August 2011. Elastic MapReduce is
fault tolerant for slave failures, and it is recommended to only run the Task Instance
Group on spot instances to take advantage of the lower cost while maintaining
availability.

Commercial support

A number of companies offer commercial implementations or support for Hadoop.

ASF's view on the use of "Hadoop" in product names

The Apache Software Foundation has stated that only software officially released by the
Apache Hadoop Project can be called Apache Hadoop or Distributions of Apache

20
Hadoop. The naming of products and derivative works from other vendors and the term
"compatible" are somewhat controversial within the Hadoop developer community.

CHAPTER 3
LITERATURE SURVEY
3.1 BIG DATA
“Every day, we create 2500 PB of data — so much that 90% of the data in the world
today has been created in the last two years alone.”

Modern systems have to deal with far more data than was the case in the past.
Organizations are generating huge amounts of data and that data has inherent value, and
cannot be deleted.

We live in the data age! Web has been growing rapidly in size as well as scale
during the last 10 years and shows no signs of slowing down. Statistics show that every
passing year more data gets generated than all the previous years combined. Moore's law
not only holds true for hardware but for data being generated too. Without wasting time for

21
coining a new phrase for such vast amounts of data, the computing industry decided to just
call it, plain and simple, Big Data.

More than structured information stored neatly in rows and columns, Big Data
actually comes in complex, unstructured formats, everything from web sites, social media
and email, to videos, presentations, etc. This is a critical distinction, because, in order to
extract valuable business intelligence from Big Data, any organization will need to rely on
technologies that enable a scalable, accurate, and powerful analysis of these formats.

Big data dimensions: IBM’s 4 V’s Volume, Velocity, Variety and Veracity

Volume

Amount of data which is processed. There are many business areas where we are
uploading and processing the data in very high rate in terms of terabytes and zeta bytes. If
we take present statistics, every day we are uploading around 25 terabytes of data into
Facebook, 12 terabytes of data in twitter and around 10 terabytes of data from various
devices like RFID, telecommunications and networking. Here the storing of these huge
amounts of data will require high clusters and large servers with high bandwidth. And here
the problem is not only storing the information but also the processing at much higher
speed. This became the major issue nowadays in most of the companies.

Velocity

The Velocity is defined as the speed at which the data is created, modified, retrieved
and processed. This is one of the major challenges of big data because the amount of time
for processing and performing different operations are considered a lot when dealing with
massive amounts of data. If we use traditional databases for such types of data, then it is
useless and doesn’t give full satisfaction to the customers in the organizations and
industries. So processing rate of such data is taken into account considerably when talking
about big data. Hence volume is one of the big challenges in dealing with big data.

Variety

22
In the distributed environment there may be the chances of presenting various types
of data. This is known as variety of data. These can be categorized as structured, semi
structured and unstructured data. The process of analysis and performing operations are
varying from one and another. Social media like Facebook posts or Tweets can give
different insights, such as sentiment analysis on your brand, while sensory data will give
you information about how a product is used and what the mistakes are. So this is the major
issue to process information from different sets of data.

Veracity

Big data Veracity refers to the biases, noise and abnormally in data. It is the data
that is being stored, and mined meaningful to the problem being analyzed. In other words,
Veracity can be treated as the uncertainty of data due to data inconsistency and
incompleteness, ambiguities, latency, model approximations in the process of analyzing
data.

Yarn Architecture

3.2 Technical Challenges

 Scalable storage
 Massive Parallel Processing and System Integration
 Data Analysis

23
3.3 Hadoop as a solution

The next logical question arises – How do we efficiently process such large data
sets? One of the pioneers in this field was Google, which designed scalable frameworks
like MapReduce and Google File System. Inspired by these designs, an Apache open
source initiative was started under the name Hadoop. Apache Hadoop is a framework that
allows for the distributed processing of such large data sets across clusters of machines.

Apache Hadoop, at its core, consists of 2 sub-projects – Hadoop MapReduce and


Hadoop Distributed File System. Hadoop MapReduce is a programming model and
software framework for writing applications that rapidly process vast amounts of data in
parallel on large clusters of compute nodes. HDFS is the primary storage system used by
Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes throughout a cluster to enable reliable, extremely rapid computations.
Other Hadoop-related projects at Apache include PIG, Hive, HBase, Mahout, Sqoop,
Flume, Oozie and ZooKeeper.

The Motivation for Hadoop - Traditional Large-Scale Computation

Traditionally, computation has been processor-bound, small amounts of data and


significant amount of complex processing performed on that data. For the years, the
primary concern was to increase the computing power of a single machine, faster processor,
more RAM.

Programming for traditional distributed systems is complex. Data exchange


requires synchronization. Temporal dependencies are complicated, difficult to deal with
partial failures of the system.

Data Storage, data for a distributed system is stored on a SAN. At compute time,
data is copied to the compute nodes (data to code). But only good for relatively small
amounts of data

What is Hadoop?

 Open source project

24
 Written in java
 Optimized to handle
 Massive amounts of data through parallelism
 A variety of data (structured, unstructured and semi structured)
 Uses inexpensive commodity hardware and gives great performance
 Reliability provided through replication
 Apache Hadoop 2.7.1 is a stable release in the 2.7.x release line, building upon the
previous stable release 2.7.0.

Hadoop core concepts

 Computation happens where the data is stored, wherever possible.

 Nodes talk to each other as little as possible via TCP Handshake.

 If a node fails, the master will detect that failure and re-assign the work to a different
node on the system

 Restarting a task does not require communication with nodes working on other
portions of the data

 When data is loaded into the system, it is split into ‘blocks’ , Typically 64MB or
128MB

 Map tasks (the first part of the MapReduce system) work on relatively small
portions of data typically a single block.

 A master program allocates work to nodes such that a Map task, will work on a
block of data stored locally on that node

 Many nodes work in parallel, each on their own part of the overall dataset

25
CHAPTER 4
SYSTEM ANALYSIS

4.1 EXISTING SYSTEM

Our Existing System is a Traditional Database. A DBMS makes it possible for end users
to create, read, update and delete data in a database. The DBMS essentially serves as an
interface between the database and end users or application programs, ensuring that data is
consistently organized and remains easily accessible.

DISADVANTAGES

 Traditionally, computation has been processor-bound


-Small amounts of data
-Significant amount of complex processing performed on that data
 For the years, the primary concern was to increase the computing power of a
single machine
-Faster processor, more RAM
 Programming for traditional distributed systems is complex

26
-Data exchange requires synchronization
-Temporal dependencies are complicated
-Difficult to deal with partial failures of the system
 Data Storage
-Data for a distributed system is stored on a SAN
-At compute time, data is copied to the compute nodes
-Good for relatively small amounts of data

4.2PROPOSED SYSTEM

‘BIG DATA’ has been getting much importance in different industries over the last
year or two, on a scale that has generated lots of data every day. Big Data is a term applied
to data sets of very large size such that the traditional databases are unable to process their
operations in a reasonable amount of time. It has tremendous potential to transform
business and power in several ways. Here the challenge is not only storing the data, but
also accessing and analyzing the required data in specified amount of time. One of the
popular implementation to solve the above challenges of big data is using Hadoop. Hadoop
is well-known open-source implementation of the MapReduce programming model for
processing big data in parallel of data-intensive jobs on clusters of commodity servers. It
is highly scalable compute platform. Hadoop enables users to store and process bulk
amount which is not possible while using less scalable techniques.

SYSTEM ARCHITECTURE:

27
SYSTEM PROCESSING STEPS:

Step 1:

Streaming data through flume from webservers and storing the data sets on Hadoop

Step 2:

Processing (doing ETL jobs) Log files with PIG

Step 3:

Storing the resultant data set on Hadoop or once again exporting it into Hive (Data
Ware house systems)

4.3 SYSTEM SPECIFICAIONS:

4.3.1 SOFTWARE SPECIFICATIONS:

Operating System : Windows7/Windows8, centos, Ubuntu

VMware Player/Studio /Oracle Virtual Box

4.3.2 MINIMUM HARDWARE SPECIFICATIONS


Hard Disk : 40 GB
RAM : 4GB
Processor : i3/i5 Processor which supports virtualization

28
CHAPTER 5
SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE

29
Hadoop frame work
ARCHITECTURE OF DATA WAREHOUSE

5.2 USECASE DIAGRAM


Use Case diagram identify the functionality provided by system (use cases), the users
who interact with the system (actors), and the association between the
users and the functionality. Use Cases are used in the analysis phase of
software development to articulate the high-level requirements of the
system. The primary goals of Use Case diagrams include
 Providing a high-level view of what the system does
 Identifying the users (“actors”) of the system
 Determining areas needing human-computer interfaces
Use Cases extend beyond pictorial diagrams. In fact, text-based use case
descriptions are often used to supplement diagrams, and explore use case
functionality in more detail.
Graphical Notation
The basic components of Use Case diagrams are the actor, the Use Case, and the
Association.
Actor
An Actor, as mentioned, is depicted using a stick figure. The role of the user is written
beneath the icon. Actors are not limited to humans. If a system communicates with another

30
application, and expects input or delivers output, then that application can also be
considered an actor

Use Case A Use Case is functionality provided by the system, Use cases are depicted
with an eclipse. The name of the use case is written with the ellipse.

Association Associations are used to link Actors with Use Case in some form.
Associations are depicted by a line connecting the actor and the use case

Behind each Use case is a series of actions to achieve the proper functionality, as
well as alternate paths for instances where validation fails, or errors occur. These actions
can be further defined in a Use Case description. Because this is not addressed in UML,
there are no standards for Use case descriptions. However, there are some common
templates you can follow, and whole books on the subject writing of Use Case descriptions.
Common methods of writing Use Case descriptions include
 Write a paragraph describing the sequence of activities in the Use Case
 List two columns, with the activities of the actor and the responses by the system
 Use a template (Such as those from the Rational Unified Process or Alistair
Cockburn’s book, Writing Effective
Use Cases) identifying actors,
preconditions, post conditions, main
success scenarios and extensions.

31
Remember, the goal of the process is to able to communicate the requirements of the
system, so use which ever method is best for your team and your organization.

USECASE DIAGRAMS

32
The Log Analysis

Searching the web

5.3 BEHAVIOURAL DIAGRAMS

5.3.1 Data Flow Diagram

A Data Flow Diagram (DFD) is a graphical representation of the “Flow” of data through a
information system, modeling its process aspects. A DFD is often used as a preliminary
step to create an overview of the system, which can be later be elaborated. These can also
be used for the visualization of data(structured diagram).

33
The Flow of Data

5.3.2 Sequence Diagram

34
Web Server Data Analyst Manager
Administrator

1 : login()

2 : gives access to log files()

3 : Manages the server logs()

4 : Analyses the webserver log files()

5 : creates a report of processed log files()

6 : gives the report()

7 : Decission making()

8 : Decission Sending()

9 : Final Report()

10 : Terminate()

The Log Analysis

CHAPTER 6

SYSTEM IMPLEMENTATION

35
6.1 Language of Metadata

This section describes in more detail some design issues and explains decisionsthat have
been made. Although this paper is written during early stages of design and implementation
but it is believed to reflect the key design ideas.
The first stage of log file analysis is the process of understanding log content.
This is done by lexical scanning of log entries and parsing them into internaldata structures
according to a given syntax and semantics4. The semantics canbe either explicitly specified
in the log file or it must be provided by a human(preferably by the designer of the involved
software product) in another way, i.e.by a description in corresponding metadata file.
6.1.1 Log Description
Unfortunately, the common practice is that log files do not contain any explicit semantic
information. In such case the user has to supply missing information,i.e. to create the
metadata in a separate file that contains log file description.
The format of metadata information the language of metadata is hard-wired into the
analyzer. The exact notation is not determined yet, but there is a suggestion in the next
paragraph.
Generally the format of a log file is given and we cannot change internal file structure
directly by adding other information. Instead, we must provide some external information.
For example we can simply list all possible records in the log file and explain their meaning
or write a grammar that would correspond to a given log format.
A complete list of all possibilities or a list of all grammar rules (transcriptions)is
exhausting, therefore we should seek a more efficient notation. One possible solution to
this problem is based on matching log records against a set of patterns and searching for
the best fit. This idea is illustrated by an example in the next paragraph.

6.1.2 Example
For illustration, let us assume a fake log file that contains a mixture of several types of
events with different parameters and different delimiters like this:structured reports from
various programs like12:52:11 alpha kernel: Loaded 113 symbols in 8 modules,

36
12:52:11 alpha inet: inetd startup succeeded.
12:52:11 alpha sshd: Connection from 192.168.1.10 port 64187
12:52:11 beta login: root login on tty1
 Periodic reports of a temperature
16/05/01 temp=21
17/05/01 temp=23
18/05/01 temp=22
 Reports about switching on and a heater
heater onheater off
Then we can construct a description of the log structure somehow in the following
way: There are four rules that identify parts of data separated by delimiters that
are listed in parenthesis. Each description of a data element consist of a name (or
an asterisk if the name is not further used) and of an optional type description.
There are macros that recognize common built-in types of information like time,
date, IP address etc. in several notations.
time:$TIME( )host( )program(: )message
time:$TIME( )*( sash: )*( )address:$IP( port )port
:$DATE( temp=)temperature
(heater )status
The first line fonts for any record that begins by a timestamp (macro $TIME should fit for
various notations of time) followed by a space, a host name, a programname followed by
a colon, one more space and a message. The second line fonts forrecord of sshd program
from any host; this rule distinguishes also correspondingIP address and port number. The
third line fits for dated temperature recordsand finally the forth fonts for heater status.
Users would later use the names defined herein (i.e. \host", \program", \address"
etc.) Directly in their programs as identifiers of variables.
Note This is just an example how the meta file can look like. Exact specification
is subject of further research in the PhD thesis.
6.2 Language of Analysis

37
The language of analysis is a cardinal element of the framework. It determines fundamental
features of any analyzer and therefore requires careful design. Let us repeat its main
purpose: The language of analysis is a notation for efficient finding or description of
various data relation within log _les, or more generally, finding unknown relations in
heterogeneous tables. This is the language of user-written filter and analytical plug-ins. A
well-designed language should be powerful enough to express easily all required relations
using an abstract notation. Simultaneously, it must allow feasible implementation in
available programming languages. The following paragraphs present a draft of such
language. They sequentially describe the principle of operation, basic features and
structures. A complete formal description will be developed during further work.

6.2.1 Principle of Operation and Basic Features


The language of analysis is of a data-driven programming style that eliminates usage of
frequent program loops. Basic operation is roughly similar to AWK as it is further
explained. Programs are sets of conditions and actions that are applied on each record of
given log file. Conditions are expressions of Boolean functions and can use abstract data
types. Actions are imperative procedures without direct input/output but with ability to
control flow of data or execution. Unlike AWK, there is no single-pass processing of the
stream of log records. The analyzer takes conditions one by one and transforms them into
SQL commands that select necessary data into a temporary table. The table is then
processedrecord{by{record: each time, data from one record are used as variable values
int he respective action and the action is executed. The type system is weak, variable type
is determined by assigned values. The language of analysis defines four different sections

1. An optional FILTER section that contains a set of conditions that allow selective
filtration of log content. FILTER section is stored separately in filter plug-in. Filter plug-
ins use different syntax to analytical plug-ins. More detailed description is available later
in this chapter.
2. An optional DEFINE section, where output variables and user-defined functions are
declared or defined. The log content is already accessible in apseudo-array using identifiers
declared in metafile.

38
3. An optional BEGIN section that is executed once before `main' programexecution.
BEGIN section can be used for initialization of some variablesor for analysis of `command
line' arguments passed to the plug-in.
4. The `MAIN' analytic program that consists of ordered pairs [condition; action].Rules are
written in form of Boolean expressions, actions are written in form of imperative
procedures. All log entries are step-by-step testedagainst all conditions, if the condition is
satisfied then the action is executed.In contrary to AWK, the conditions and actions
manipulate data at highabstraction level, i.e. they use variables and not lexical elements
like AWKdoes. In addition, both conditions and actions can use a variety of useful
operators and functions, including but not limited to string and booleanfunctions like in
AWK.Actions would be written in pseudo C imperative language.
5. An optional END part that is executed once after the `main' program execution.
Here can a plug-in produce summary information or perform final,close-up computation.
DECLARE, BEGIN, `MAIN' and END sections make together an analytical plugin
module.
6.2.2 Data Filtering and Cleaning
Data filtering and cleaning is simple transformation of a log file to reduce itssize, the
reduced information is of the same type and structure. During thisphase, an analyzer has to
determine what log entries should be filtered out and what information should be passed
for further processing if the examined record successfully passes filtration. The filtration
is performed by testing each log record against a set of conditions.
 A set of conditions must be declared to have one of the following policies (in all
cases there are two types of conditions, `negative' and `positive'):
 Order pass, fail that means that all pass conditions are evaluated before fail
conditions; initial state is FAIL, all conditions are evaluated, there is no shortcut,
the order of conditions is not significant
 Order fail, pass that means that all fail conditions are evaluated before pass
conditions; initial state is PASS, all conditions are evaluated, there isno shortcut,
the order of conditions is not significant
 Order require, satisfy that means that a log record must pass all requireconditions
before but including the first satisfy condition (if there is any);the record fails

39
filtration when it by the first unsatisfied condition; the orderof conditions is
significant.
The conditions are written by common Boolean expressions and use identifiersfrom
metadata. They can also use functions and operators from the API library.
Here are some examples of fragments of filter plug-ins (they assume the samplelog file
described):
#Example 1: messages from alba not older 5 days
order pass,fail
pass host=="alba"
pass (time-NOW)[days]<5
#Example 2: problems with hackers and secure shell
order fail, pass
fail all
pass (host=="beta")&&(program=="sshd")
pass (program=="sshd")&&(address.dnslookup IN "badboys.com")
#Example 3: heater test
order require,satisfy
satisfy temperature>30 # if temperature is too high -> critical
require temperature>25 # if temperature is high and ...
require status=="on" # heater is on -> heater malfunction ???
There must be also a way how to specify what information should be excluded
from further processing. This can be done by `removal' of selected variables from internal
data structures. For example, the following command indicates thatvariables \time" and
\host" are not further used:omit time,host

6.2.3 Analytical Tasks


This paragraph further describes the language of analysis, especially features and
expressions used in analytical task. Some features are illustrated in subsequent examples.
A program consists of an unordered list of conditions and actions.
Conditions are boolean expressions in C-like notation and they are evaluated

40
from left to right. Conditions can be preceded by a label is such case they act
like procedures and can be referenced in actions. Conditions use the same syntax
like actions, see below for more details.
Actions are imperative procedures that are executed if the respective condition
is true. Actions have no file or console input and output. All identifiers are casesensitive.
Here is a list of basic features:
 There are common control structures like sequence command, conditions if-then-
else, three program loops for, while, do-while, function invocationetc. There are
also special control commands like abort or rewind.
 Expression is a common composition of identifiers, constants, operators and
parentheses. Priority of operators is defined but can be overridden by parenthesis.
 There are unary and binary operators with prefix, infix andpostfix notation. Unary
postfix operators are mainly type-converters.
 Variables should be declared prior they usage. At beginning they are empty they
have no initial value. Variable type is determined with first assignment but it can
change with next assignments. There are the following basic types: number, string,
and array. Strings are enclosed in quotes,arrays use brackets for index.
 Variables are referred directly by their names. If a variable is used inside string, the
name is preceded by a $ sign.
 Variable scope and visibility spans over whole program.
 Arrays are hybrid (also referenced as hash arrays); they are associative arrays with
possible indexed access. Internally, they are represented by alist of pairs key value.
 Multidimensional arrays are allowed; in fact they are array(s) of arrays. Fields are
accessed via index or using internal pointer.
 Index can be of type number or string. Array can be heterogeneous.
 Newfield in array can be created by assignment without index or by
assignmentwith not-yet-existing key.
 They are available common array operators like first, last, next etc.
 Structures (record type) are internally treated as arrays; therefore there is no special
type. Sample `structures' are date, time, IP address and similar.
 Procedures and functions are written as named pairs condition, action.

41
 Conditions are optional. Arguments are referred within procedure a by their order,
is the first one. Procedure invocation is done by its identifier followed by
parentheses optionally with passed arguments. Recursion is not allowed.
 There is also a set of internal variables that reflect internal state of the
analyzer:current log file, current line number, current log record etc.
 Like in the C language, there is a language design and there are useful functions
separated in a library. There are `system' and `user' library. Some system functions
are `primitive' that means that they cannot be re-written using this language.
Bellow there are several simple examples that illustrate basic features of the suggested
language of analysis. The examples are believed to be self-explaining and they also use the
same sample log file from page 33.
#Example 1: counting and statistics
DEFINE {average, count}
BEGIN {count=0; sum=0}
(temperature<>0) {count++; sum+=temperature}
END {average=sum/count}
#Example 2: security report - ssh sessions
DEFINE {output tbl}
BEGIN {user=""; msg=[]}
(program=="sshd"){msg=FINDNEXT(30,5m,(program==login));
user=msg[message].first;
output_tbl[address.dnslookup][user]++
}
#Example 3: heater validation by a state machine
DEFINE {result}
BEGIN {state="off"; tempr=0; min=argv[1]; max=argv[2]}
(temperature<>0) {tempr=temperature}
(status=="on") {if (state=="off" &&tempr<min) state="on"
else {result="Error at $time"; abort}
}
(status=="off") {if (state=="on" &&tempr>max) state="off"

42
else {output[0]="Error at $time"; abort}
}
END {result="OK"}
#Example 4: finding relations
DEFINE {same_addr,same_port}
(program=="sshd"){foo(address); bar(port)}
foo:(address==$1){same_adr[]=message}
bar:(port=$1) {same_port[]=message}
#Example 5: finding causes
DEFINE {causes}
BEGIN {msg=[]}
(program=="sshd"){msg=FINDMSG(100,10m,3);
sort(msg);
for (i=0;i<msg.length; i++)
causes[]="$i: $msg[i][time] $msg[i][program]"
}
Again, the examples above just illustrate how things can work. Formal specification is
subject of further work.
6.2.4 Discussion
The proposed system of operation is believed to take advantage of both AWK and
imperative languages while avoiding the limitations of AWK. In fact, AWK enables to
write incredibly short and efficient programs for text processing but it is near useless when
it should work with high-level data types and terms like variable values and semantics.
6.3 Analyzer Layout
The previous chapters imply that an universal log analyzer is fairly complicated piece of
software that must fulfill many requirements. For the purpose of easy and open
implementation, the analyzer is therefore divided into several functional modules as it is
shown at the figure below:

43
The figure proposes one possible modular decomposition of an analyzer. It shows
main modules, used files, the user, basic rows of information (solid arrows) and
basic control rows (outlined arrows).
6.3.1 Operation
Before a detailed description of analyzer's components, let us enumerate the basic stages
of log processing from user point of view and thus explain how the analyzer operates.
1. Job setup. In a typical session, the user specifies via a user interface one or more log
files, a metadata file and selects modules that should participate in the job, each module
for one analytical task.
2. Pre-processing. The analyzer then reads metadata information (using lexical and syntax
analyzers) and thus learns the structure of the given log file.
3. Filtration, cleaning, database construction. After learning log structure, the analyzer
starts reading the log file and according the metadata stores pieces of information into the
database. During this process some unwanted records can be filtered out or some selected
parts of each record can be omitted to reduce the database size.

44
4. The analysis. When all logs are read and stored in the database, the analyzers executes
one-after-one all user-written modules that perform analytical tasks. Each program is
expected to produce some output in form of a set of variables or the output may be missing
if the log does not contain the sought-after pattern etc.
5. Processing of results. Output of each module (if any) is passed to the visualization
module that uses yet another type of files (visual plug-ins) to learn how to present obtained
results. For example, it can transform the obtained variables into pie charts, bar graphs,
nice looking tables, lists, sentences etc. Finally, the transformed results are either displayed
to user via the user interface module or they are stored in an output file.
6.3.2 File types
There are several types of files used to describe data and to control program behavior
during the analysis. The following list describes each type of file: Control file. This file
can be optionally used as a configuration file and thus eliminates the need of any user
interface. It contains description of log source (i.e. local filename or network socket),
metadata location, list of applied filtering plug-ins, analytical plug-ins and visual plug-ins.
Metadata file. The file contains information about the structure and format of a given log
file. First, there are parameters used for lexical andsyntax analysis of the log. Second, the
file gives semantics to various piecesof information in the log, i.e. it declares names of
variables used later inanalytical plug-ins, describes fixed and variable parts of each record
(timestamp, hostname, IP address etc.).
 Log file(s). The file or files that should be analyzed in the job. Instead of a local
file also a network socket can be used.
 Filtering plug-in(s). Contains a set of rules that are applied on each log record to
determine whether it should be passed for further processing or thrown away.
 Filter plug-in can also remove some unnecessary informationfrom each record.
 Analytical plug-ins. These plug-ins do the real job. They are programs that search
for patterns, count occurrences and do most of the work; they contain the `business
logic' of the analysis. The result of execution of an analytical plug-in is a set of
variables filled with required information. The plug-ins should not be too difficult
to write because the log is already pre-processed. Some details about this type of
plug-in are discussed laterin this chapter.

45
 Visual plug-in(s). Such files contain instructions and parameters for the
visualization module how to transform outputs of analytical modules. In other
words, a visual plug-in transforms content of several variables (those are output of
analytical plug-ins) into a sentence of graphic primitives (or. commands) for a
visualization module.
 Output file. A place where the final results of the analytical job can be stored, but
under normal operation, the final results are displayed to theuser and not stored to
any file.
6.3.3 Modules
Finally there is a list of proposed modules together with brief description of their function
and usage:
 User interface. It allows the user to set up the job and also handles and displays the
final results of the analysis. The user has to specify where to find the log file(s) and
which plug-ins to use.User interface can be replaced by control and output files.
 Main executive. This module controls the operation of the analyzer. On behalf of
user interaction or a control file it executes subordinate modules and passes correct
parameters to them.
 Generic lexical analyzer. At the beginning a metadata file and log files are
processed by this module to identify lexical segments. The module is configured
and used by the metadata parser and log file parser.
 Generic syntax analyzer. Alike the lexical analyzer, this module is also used by
both metadata parser and log file parser. It performs syntaxanalysis of metadata
information and log files.
 Metadata parser. This module reads a metadata file and gathers information about
the structure of correspondent log file. The information is subsequently used to
configure log file parser.
 Log files parser. This module uses the information gained by the metadata parser.
While it reads line by line the log content, pieces of information from each record
(with correct syntax) are stored in corresponding variablesaccording
theirsemantics.

46
 Filtering, cleaning: This module reads one or more filter plug-ins and examines the
output of the log file parser. All records that pass filtration are stored into a
database.
Software environment

Apache Hadoop: The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing. The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle failures at the
application layer, so delivering a highly-available service on top of a cluster of computers,
each of which may be prone to failures.

The project includes these modules

Hadoop Common: The common utilities that support the other Hadoop modules.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data.

Hadoop YARN: A framework for job scheduling and cluster resource management.

Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

HBase™: A scalable, distributed database that supports structured data storage for large
tables.

Hive™: A data warehouse infrastructure that provides data summarization and ad hoc
querying.

Pig™: A high-level data-flow language and execution framework for parallel computation.

ZooKeeper™: A high-performance coordination service for distributed applications.

47
Components of Hadoop Frame Work

MapReduce

MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map and
Reduce. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes
the output from a map as an input and combines those data tuples into a smaller set of
tuples. As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce
form, scaling the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.

The Algorithm

48
Generally MapReduce paradigm is based on sending the computer to where the data
resides. MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.

Map stage: The map or mapper’s job is to process the input data. Generally the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and creates
several small chunks of data.

Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.

The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes. Most of the computing
takes place on nodes with data on local disks that reduces the network traffic.

After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Flow of Data in Map Reduce Frame Work

49
Input Output
Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology

Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.

MasterNode - Node where JobTracker runs and which accepts job requests from clients.

SlaveNode - Node where Map and Reduce program runs.

NamedNode - Node that manages the Hadoop Distributed File System (HDFS).

DataNode - Node where data is presented in advance before any processing takes place.

JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.

Task Tracker - Tracks the task and reports status to JobTracker.

Job - A program is an execution of a Mapper and Reducer across a dataset.

Task - An execution of a Mapper or a Reducer on a slice of data.

Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

50
Format of Execution

The World Wide Web has an estimated 2 billion users and contains anywhere from 15 to
45 billion Web pages, with around 10 million pages added each day. With such large
numbers, almost every website owner and developer who has a decent presence on the
Internet faces a complex problem: how to make sense of their web pages and all the users
who visit their websites.

Every Web server worth its salt logs the user activities for the websites it supports and the
Web pages it serves up to the virtual world. These Web logs are mostly used for debugging
issues or to get insight into the details, which are interesting from a business or performance
point of view. Over time, the size of the logs keeps increasing until it becomes very difficult

51
to manually extract any important information out of them, particularly for busy websites.
The Hadoop framework does a good job at tackling this challenge in a timely, reliable, and
cost-efficient manner.

We propose a solution based on the Pig framework that aggregates data at an hourly, daily
or yearly granularity. The proposed architecture features a data-collection and a database
layer as an end-to-end solution, but we focus on the analysis layer, which is implemented
in the Pig Latin language.

Analyzing Logs Generated by Web Servers

The challenge for the proposed solution is to analyze Web logs generated by Apache Web
Server. Apache Web logs follow a standard pattern that is customizable. Their description
and sample Web logs can be found easily on the Web. These logs contain information in
various fields, such as timestamp, IP address, page visited, referrer, and browser, among
others. Each row in the Web log corresponds to a visit or event on a Web page.

The size of Web logs can range anywhere from a few KB to hundreds of GB. We have to
design solutions that based on different dimensions such as timestamp, browser and
country can extract patterns and information out of these logs and provide us vital bits of
information, such as the number of hits for a particular website or Web page, the number
of unique users, and so on. Each potential problem can be divided into a particular use case
and can then be solved.

The technologies used are the Apache Hadoop framework, Apache Pig, the Java
programming language, and regular expressions (regex).

Hadoop Solution Architecture

52
The proposed architecture is a layered architecture, and each layer has components. It
scales according to the number of logs generated by your Web servers, enables you to
harness the data to get key insights, and is based on an economical, scalable platform.

The Log Analysis Software Stack

Hadoop is an open source framework that allows users to process very large data in parallel.
It's based on the framework that supports Google search engine. The Hadoop core is mainly
divided into two modules:

HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data
using multiple commodity servers connected in a cluster.

Map-Reduce (MR) is a framework for parallel processing of large data sets. The default
implementation is bonded with HDFS.

The database can be a NoSQL database such as HBase. The advantage of a NoSQL
database is that it provides scalability for the reporting module as well, as we can keep
historical processed data for reporting purposes. HBase is an open source columnar DB or
NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time,
random read/write access to very large data sets -- HBase can save very large tables having
millions of rows. It's a distributed database and can also keep multiple versions of a single
row.

The Pig framework is an open source platform for analyzing large data sets and is
implemented as a layered language over the Hadoop Map-Reduce framework. It is built to
ease the work of developers who write code in the Map-Reduce format, since code in Map-
Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a
scripting language.

Flume is a distributed, reliable and available service for collecting, aggregating and moving
a large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-
HDFS for further processing. It's a data flow solution, where there is an originator and
destination for each node and is divided into Agent and Collector tiers for collecting logs
and pushing them to destination storage.

53
Data Flow and Components

Content will be created by multiple Web servers and logged in local hard discs. This
content will then be pushed to HDFS using FLUME framework. FLUME has agents
running on Web servers; these are machines that collect data intermediately using
collectors and finally push that data to HDFS.

Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated
batch job solution). These scripts actually analyze the logs on various dimensions and
extract the results. Results from Pig are by default inserted into HDFS, but we can use
storage implementation for other repositories also such as HBase, MongoDB, etc. We have
also tried the solution with HBase (please see the implementation section). Pig Scripts can
either push this data to HDFS and then MR jobs will be required to read and push this data
into HBase, or Pig scripts can push this data into HBase directly.

The database HBase will have the data processed by Pig scripts ready for reporting and
further slicing and dicing. The data-access Web service is a REST-based service that eases
the access and integrations with data clients. The client can be in any language to access
REST-based API. These clients could be BI- or UI-based clients.

Log Analysis Solution Architecture

Apache Tomcat 7 Server UML Composite Structure Diagram

54
Implementation steps

Analyzing log files, churning them and extracting meaningful information is a potential
use case in Hadoop. Let us consider Pig for apache log analysis. Pig has some built in

55
libraries that would help us load the apache log files into pig and also some cleanup
operation on string values from crude log files.
All the functionalities are available in the piggybank.jar mostly available
under pig/contrib/piggybank/java/ directory.

Log Format
“%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\””

%h
The IP address of the client (remote host) which made the request. If hostname looks up is
set it will try to look the hostname instead of the IP address
%l
The – indicated the requested information is not available. RFC1413 Identification
Protocol of the client
%u
This is the userid of the person requestion the document as determined by the HTTP
authentication. Same way if – present then the requested information is not available
%t
The time that the server finished processing the request. The format is
[day/month/year:hour:minute:secondzone]
day=2*digit
month=3*letter
year=4*digit
hour=2*digit
minute=2*digit
second=2*digit
zone = (`+’ | `-‘) 4*digit
%r
The request line from the client in the double process. The method used by the client (GET).
The client requested resource i.e /apachepb.gif
%m %U%q %H

56
This will provide the log method, path, query-string and protocol
%s
The status code that the server sends back to the client
2xx is a successful response, 3xx is a redirection,
4xx is a client error
5xx is a server error
%b
Size of object returned to the client. Measured in bytes.
%Referrer
The referrer, the site that the client reports having been reported from.
%{User-agent}
The user-agent HTTP request header.Identifying information that the client browser
reports about itself.
As the first step we need to register this jar file with our pig session then only we can use
the functionalities in our Pig Latin
1. Register Piggybank jar
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
Once we have registered the jar file we need to define a few functionalities to be used in
our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a
column oriented format in pig, we can create a apache log loader as
2. Define a log loader
DEFINEApacheCommonLogLoaderorg.apache.pig.piggybank.storage.apachelog.Comm
onLogLoader();

(Piggy Bank has other log loaders as well)


In apache log files the default format of date is ‘dd/MMM/yyyy:HH:mm:ss Z’ . But such
a date won’t help us much in case of log analysis we may have to extract date without time
stamp. For that we use DateExtractor()
3. Define Date Extractor
DEFINEDayExtractororg.apache.pig.piggybank.evaluation.util.apachelogparser.DateEx
tractor('yyyy-MM-dd');

57
Once we have the required functionalities with us we need to first load the log file into pig
4. Load apache log file into pig
--load the log files from hdfs into pig using CommonLogLoader
logs = LOAD '/userdata /access.log' USING ApacheCommonLogLoader AS (ip_address,
rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);

Now we are ready to dive in for the actual log analysis. There would be multiple
information you need to extract out of a log; we’d see a few of those common requirements
out here

Note: you need to first register the jar, define the classes to be used and load the log files
into pig before trying out any of the pig latin below

Example 1: Find unique hits per day


PIG Latin
--Extracting the day alone and grouping records based on days
grpd = GROUP logs BY DayExtractor(dt) as day;
--looping through each group to get the unique no of userIds
cntd = FOREACH grpd
{
tempId = logs.userId;
uniqueUserId = DISTINCT tempId;
GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
}
--sorting the processed records based on no of unique user ids in descending order
srtd = ORDER cntd BY cntdesc;
--storing the final result into a hdfs directory
STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';

Example 2: Find unique hits to websites (IPs) per day

58
PIG Latin
--Extracting the day alone and grouping records based on days and ip address
grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address);
--looping through each group to get the unique no of userIds
cntd = FOREACH grpd
{
tempId = logs.userId;
uniqueUserId = DISTINCT tempId;
GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
}
--sorting the processed records based on no of unique user ids in descending order
srtd = ORDER cntd BY cntdesc;
--storing the final result into a hdfs directory
STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';
BUSINESS LOGIC
REGISTER '/piggybank.jar';

DEFINEApacheCommonLogLoaderorg.apache.pig.piggybank.storage.apachelog.Comm
onLogLoader();

logs = LOAD '/common_access_log' USING ApacheCommonLogLoader AS (addr:


chararray, logname: chararray, user: chararray, time: chararray, method: chararray, uri:
chararray, proto: chararray, status: int, bytes: int);

addrs = GROUP logs BY addr;

counts = FOREACH addrs GENERATE flatten($0), COUNT($1) as count;

DUMP counts;

59
CHAPTER 7

SOFTWARE TESTING

7.1 System testing

Testing Approach

60
Big data is still emerging and a there is a lot of onus on testers to identify innovative ideas
to test the implementation. Testers can create small utility tools using excel macros for data
comparison which can help in deriving a dynamic structure from the various data sources
during the pre Hadoop processing stage.

For instance, Aspire designed a test automation framework for one of our large retail
customers’ big data implementation in the BDT (Behavior Driven Testing) model using
Cucumber and Ruby. The framework helped in performing count and data validation
during the data processing stage by comparing the record count between the Hive and SQL
tables and confirmed that the data is properly loaded without any truncation by verifying
the data between Hive and SQL tables.

Similarly when it comes to validation on the map-reduce process stage, it definitely helps
if the tester has good experience on programming languages. The reason is because unlike
SQL where queries can be constructed to work through the data MapReduce framework
transforms a list of key-value pairs into a list of values. A good unit testing framework like
Junit or PyUnit can help validate the individual parts of the MapReduce job but they do
not test them as a whole.

Building a test automation framework using a programming language like Java can help
here. The automation framework can focus on the bigger picture pertaining to MapReduce
jobs while encompassing the unit tests as well. Setting up the automation framework to a
continuous integration server like Jenkins can be even more helpful. However, building the
right framework for big data applications relies on how the test environment is setup as the
processing happens in a distributed manner here. There could be a cluster of machines on
the QA server where testing of MapReduce jobs should happen. Testing Big Data
application is more a verification of its data processing rather than testing the individual
features of the software product. When it comes to Big data testing, performance and
functional testing are the key. In Big data testing QA engineers verify the successful
processing of terabytes of data using commodity cluster and other supportive components.
It demands a high level of testing skills as the processing is very fast. Processing may be
of three types

61
Along with this, data quality is also an important factor in big data testing. Before testing
the application, it is necessary to check the quality of data and should be considered as a
part of database testing. It involves checking various characteristics like conformity,
accuracy, duplication, consistency, validity, data completeness, etc.

The following figure gives a high level overview of phases in Testing Big Data
Applications

62
Big Data Testing can be broadly divided into three steps

Step 1: Data Staging Validation

 The first step of big data testing, also referred as pre-Hadoop stage involves process
validation.
 Data from various source like RDBMS, weblogs, social media, etc. should be
validated to make sure that correct data is pulled into system.
 Comparing source data with the data pushed into the Hadoop system to make sure
they match
 Verify the right data is extracted and loaded into the correct HDFS location
 Tools like Talend, Datameer, can be used for data staging validation
63
Step 2: "MapReduce" Validation

 The second step is a validation of "MapReduce". In this stage, the tester verifies the
business logic validation on every node and then validating them after running
against multiple nodes, ensuring that the Map Reduce process works correctly
 Data aggregation or segregation rules are implemented on the data
 Key value pairs are generated
 Validating the data after Map Reduce process

Step 3: Output Validation Phase

 The final or third stage of Big Data testing is the output validation process. The
output data files are generated and ready to be moved to an EDW (Enterprise Data
Warehouse) or any other system based on the requirement.
 Activities in third stage includes:
 To check the transformation rules are correctly applied
 To check the data integrity and successful data load into the target system
 To check that there is no data corruption by comparing the target data with the
HDFS file system data

Architecture Testing

Hadoop processes very large volumes of data and is highly resource intensive. Hence,
architectural testing is crucial to ensure success of your Big Data project. Poorly or
improper designed system may lead to performance degradation, and the system could fail
to meet the requirement. Atleast, Performance and Failover test services should be done in
a Hadoop environment.

Performance testing includes testing of job completion time, memory utilization, data
throughput and similar system metrics. While the motive of Failover test service is to verify
that data processing occurs seamlessly in case of failure of data nodes

Performance Testing

64
Performance Testing for Big Data includes two main action

Data ingestion and Throughout: In this stage, the tester verifies how the fast system can
consume data from various data source. Testing involves identifying different message that
the queue can process in a given time frame. It also includes how quickly data can be
inserted into underlying data store for example insertion rate into a Mongo and Cassandra
database.

Data Processing: It involves verifying the speed with which the queries or map reduce
jobs are executed. It also includes testing the data processing in isolation when the
underlying data store is populated within the data sets. For example running Map Reduce
jobs on the underlying HDFS

Sub-Component Performance: These systems are made up of multiple components, and


it is essential to test each of these components in isolation. For example, how quickly
message is indexed and consumed, mapreduce jobs, query performance, search, etc.

Performance Testing Approach

Performance testing for big data application involves testing of huge volumes of structured
and unstructured data, and it requires a specific testing approach to test such massive data.

Performance Testing is executed in this order

 Process begins with the setting of the Big data cluster which is to be tested for
performance
 Identify and design corresponding workloads

65
 Prepare individual clients (Custom Scripts are created)
 Execute the test and analyzes the result (If objectives are not met then tune the
component and re-execute)
 Optimum Configuration

Parameters for Performance Testing

Various parameters to be verified for performance testing are

 Data Storage: How data is stored in different nodes


 Commit logs: How large the commit log is allowed to grow
 Concurrency: How many threads can perform write and read operation
 Caching: Tune the cache setting "row cache" and "key cache."
 Timeouts: Values for connection timeout, query timeout, etc.
 JVM Parameters: Heap size, GC collection algorithms, etc.
 Map reduce performance: Sorts, merge, etc.
 Message queue: Message rate, size, etc.
 Test Environment Needs
 Test Environment needs depend on the type of application you are testing. For Big
data testing, test environment should encompass
 It should have enough space for storage and process large amount of data
 It should have cluster with distributed nodes and data
 It should have minimum CPU and memory utilization to keep performance high

Challenges in Big Data Testing

Automation

Automation testing for Big data requires someone with a technical expertise. Also,
automated tools are not equipped to handle unexpected problems that arise during testing

Virtualization

It is one of the integral phases of testing. Virtual machine latency creates timing problems
in real time big data testing. Also managing images in Big data is a hassle.

66
Large Dataset

 Need to verify more data and need to do it faster


 Need to automate the testing effort
 Need to be able to test across different platform

Performance testing challenges

 Diverse set of technologies: Each sub-component belongs to different technology


and requires testing in isolation
 Unavailability of specific tools: No single tool can perform the end-to-end testing.
For example, NoSQL might not fit for message queues
 Test Scripting: A high degree of scripting is needed to design test scenarios and test
cases
 Test environment: It needs special test environment due to large data size
 Monitoring Solution: Limited solutions exists that can monitor the entire
environment
 Diagnostic Solution: Custom solution is required to develop to drill down the
performance bottleneck areas

As data engineering and data analytics advances to a next level, Big data testing is
inevitable.

Big data processing could be Batch, Real-Time, or Interactive

3 stages of Testing Big Data applications are

 Data staging validation


 "MapReduce" validation
 Output validation phase

Architecture Testing is the important phase of Big data testing, as poorly designed system
may lead to unprecedented errors and degradation of performance

Performance testing for big data includes verifying

67
 Data throughput
 Data processing
 Sub-component performance
 Big data testing is very different from Traditional data testing in terms of Data,
Infrastructure & Validation Tools
 Big Data Testing challenges include virtualization, test automation and dealing
with large dataset. Performance testing of Big Data applications is also an issue.

In summary, test automation can be a good approach in testing big data implementations.
Identifying the requirements and building a robust automation framework can help in doing
comprehensive testing. However, a lot would depend on how the skills of the tester and
how the big data environment is setup. In addition to functional testing of big data
applications using approaches such as test automation, given the large size of data there are
definitely needs for Performance and load testing in big data implementations.

Tools used in Big Data Scenarios

Big Data Cluster Big Data Tools

No SQL CouchDB, Databases, MongoDB, Cassandra, Redis, ZooKeeper, Hbase

MapReduce Hadoop, Hive, Pig, Cascading, Oozie, Kafka, S4, MapR, Flume

Storage S3, HDFS (Hadoop Distributed File System)

Servers Elastic, Heroku, Elastic, Google App Engine, EC2

Processing R, Yahoo! Pipes, Mechanical Turk, Big Sheets, Data Meer

Big data Testing vs. Traditional database Testing

68
Properties Traditional Data Base Big data Testing
Data Tester work with Tester works with both
structured data structured as well as
unstructured data
Testing approach is well Testing approach requires
defined and time-tested focused R&D efforts
Tester has the option of "Sampling" strategy in Big
"Sampling" strategy doing data is a challenge
manually or "Exhaustive
Verification" strategy by
automation tool
 It does not require special It requires special test
Infrastructure test environment as the file environment due to large
size is limited data size and files (HDFS)
Validation Tools Tester uses either the Excel No defined tools, the range
based macros or UI based is vast from programming
automation tools tools like MapReduce to
HIVEQL
Testing Tools can be used It requires a specific set of
with basic operating skills and training to
knowledge and less operate testing tool. Also,
training. the tools are in their
nascent stage and overtime
it may come up with new
features.

CHAPTER 8

RESULTS

69
70
71
72
73
CHAPTER 9

CONCLUSION AND FUTURE WORK

9.1 Conclusion

This project aims to show how large amounts of unstructured data can be easily analyzed
and how important information can be extracted from the data using the Apache Hadoop
framework and its related technologies. The approach described here can be used to
effectively address data analysis challenges such as traffic log analysis, user consumption
patterns, best-selling products, and so on. Information gleaned from the analysis can have
important economic and societal value, and can help organizations and people to operate
more efficiently. The limit to implementations is set only by human imagination.

9.2 Future enhancements


For now, Hadoop does a good job of pushing the limits on the size of data that can be
analyzed and of extracting valuable information from seemingly arbitrary data. However,
a lot of work still needs to be done to improve the latency time in Hadoop solutions so that
Hadoop will be applicable to scenarios where almost real-time responses are required.

CHAPTER 10
74
REFERENCES
[1] J. H. Andrews: \Theory and practice of log file analysis." Technical Report
524, Department of Computer Science, University of Western Ontario, May
1998.
[2] J. H. Andrews: \Testing using log _le analysis: tools, methods, and issues."
Proc. 13 th IEEE International Conference on Automated Software Engineering,
Oct. 1998, pp. 157-166.
[3] J. H. Andrews: \A Framework for Log File Analysis."
http://citeseer.nj.nec.com/159829.html
[4] J. H. Andrews, Y. Zhang: \Broad-spectrum studies of log _le analysis."
International Conference on Software Engineering, pages 105-114, 2000
[5] J. H. Andrews: \Testing using Log File Analysis: Tools, Methods, and Issues."
available at http://citeseer.nj.nec.com
[6] L. Lamport: \Time, Clocks, and the Ordering of Events in a Distributed
System." Communications of the ACM, Vol. 21, No. 7, July 1978
[7] K. M. Chandy, L. Lamport: \Distributed Snapshots: Determining Global
States of Distributed Systems." ACM Transactions on Computer Systems,
Vol. 3, No. 1, February 1985
[8] F. Cristian: \Probabilistic Clock Synchronization." 9th Int. Conference on
Distributed Computing Systems, June 1989
[9] M. Guzdial, P. Santos, A. Badre, S. Hudson, M. Gray: \Analyzing and
visualizing log _les: A computational science of usability." Presented at HCI
Consortium Workshop, 1994.
[10] M. J. Guzdial: \Deriving software usage patterns from log _les." Georgia Institute
of Technology. GVU Center Technical Report.Report #93-41. 1993.
[11] Tec-Ed, Inc.: \Assessing Web Site Usability from Server Log Files White
Paper." http://citeseer.nj.nec.com/290488.html
[12] Osmar R. Zaane, Man Xin, and Jiawei Han: \Discovering web access patterns
and trends by applying OLAP and data mining technology on web logs." In
Proc. Advances in Digital Libraries ADL'98, pages 19{29, Santa Barbara,
CA, USA, April 1998.

75
[13] Cisco Systems: \NetFlow Services and Application." White paper. Available
at http://www.cisco.com
49
[14] Cisco Systems: \NetFlowFlowCollector Installation and User Guide."
Available at http://www.cisco.com
[15] D. Gunter, B. Tierney, B. Crowley, M. Holding, J. Lee: \Netlogger: A toolkit
for Distributed System Performance Analysis." Proceedings of the IEEE
Mascots 2000 Conference, August 2000, LBNL-46269
[16] B. Tirney, W. Johnston, B. Crowley, G. Hoo, C. Brooks, D. Gunter: \The
Netlogger Methodology for High Performance Distributed Systems Performance
Analysis." Proceedings of IEEE High Performance Distributed Computing
Conference (HPDC-7), July 1998, LBNL-42611
[17] Google Web Directory: \A list of HTTP log analysis tools"
http://directory.google.com/Top/Computers/Software/Internet/
/Site Management/Log Analysis
[18] J. Abela, T. Debeaupuis: \Universal Format for Logger Messages." Expired
IETF draft <draft-abela-ulm-05.txt>
[19] C. J. Calabrese: \Requirements for a Network Event Logging Protocol."
IETF draft <draft-calabrese-requir-logprot-04.txt>
[20] SOFA group at The Charles University, Prague.
http://nenya.ms.mff.cuni.cz/thegroup/SOFA/sofa.html

76

Das könnte Ihnen auch gefallen