You are on page 1of 7

Big Data Assignment

Big Data Life Cycle

Big Data analysis differs from traditional data analysis primarily due to the volume, velocity and
variety characteristics of the data being processes. To address the distinct requirements for
performing analysis on Big Data, a step-by-step methodology is needed to organize the activities and
tasks involved with acquiring, processing, analysing and repurposing data. The upcoming sections
explore a specific data analytics lifecycle that organizes and manages the tasks and activities
associated with the analysis of Big Data. From a Big Data adoption and planning perspective, it is
important that in addition to the lifecycle, consideration be made for issues of training, education,
tooling and staffing of a data analytics team.

The Big Data analytics lifecycle can be divided into the following nine stages, as shown in Figure:

1. Business Case Evaluation

2. Data Identification
3. Data Acquisition & Filtering
4. Data Extraction
5. Data Validation & Cleansing
6. Data Aggregation & Representation
7. Data Analysis
8. Data Visualization
9. Utilization of Analysis Results
1. Business Case Evaluation

The beginning of the Big Data Lifecycle starts with a sound evaluation of the business case.
Before any Big Data project can be started, it needs to be clear what the business objectives
and results of the data analysis should be. Begin with the end in mind and clearly define the
objectives and desired results of the project. Many different forms of data analysis could be
conducted, but what exactly is the reason for investing time and effort in data analysis? As
with any good business case, the proposal should be backed up by financial data.

2. Data Identification

The Data Identification stage determines the origin of data. Before data can be analysed, it is
important to know what the sources of the data will be. Especially if data is procured from
external suppliers, it is necessary to clearly identify what the original source of the data is and
how reliable (frequently referred to as the veracity of the data) the dataset is. The second
stage of the Big Data Lifecycle is very important, because if the input data is unreliable, the
output data will also definitely be unreliable.
3. Data Acquisition and Filtering

The Data Acquisition and Filtering Phase builds upon the previous stage op the Big Data
Lifecycle. In this stage, the data is gathered from different sources, both from within the
company and outside of the company. After the acquisition, a first step of filtering is
conducted to filter out corrupt data. Additionally, data that is not necessary for the analysis
will be filtered out as well. The filtering step will be applied on each data source individually,
so before the data is aggregated into the data warehouse.

4. Data Extraction

Some of the data identified in the two previous stages may be incompatible with the Big Data
tool that will perform the actual analysis. In order to deal with this problem, the Data
Extraction stage is dedicated to extracting different data formats from data sets (e.g. the data
source) and transforming these into a format the Big Data tool is able to process and analyse.
The complexity of the transformation and the extent in which is necessary to transform data
is greatly dependent on the Big Data tool that has been selected. Most ‘modern’ Big Data tools
can read industry standard data data formats of relational and non-relational data.

5. Data Validation and Cleansing

Data that is invalid leads to invalid results. In order to ensure only the appropriate data is
analysed, the Data Validation and Cleansing stage of the Big Data Lifecycle is required. During
this stage, data is validated against a set of predetermined conditions and rules in order to
ensure the data is not corrupt. An example of a validation rule would be to exclude all persons
that are older than 100 years old, since it is very unlikely that data about these persons would
be correct due to physical constraints.

6. Data Aggregation and Representation

Data may be spread across multiple datasets, requiring that dataset be joined together to
conduct the actual analysis. In order to ensure only the correct data will be analysed in the
next stage, it might be necessary to integrate multiple datasets. The Data Aggregation and
Representation stage is dedicated to integrate multiple datasets to arrive at a unified view.
Additionally, data aggregation will greatly speed up the analysis process of the Big Data tool,
because the tool will not be required to join different tables from different datasets, greatly
speeding up the process.

7. Data Analysis

The Data Analysis stage of the Big Data Lifecycle stage is dedicated to carrying out the actual
analysis task. It runs the code or algorithm that makes the calculations that will lead to the
actual result. Data Analysis can be simple or really complex, depending on the required
analysis type. In this stage the ‘actual value’ of the Big Data project will be generated. If all
previous stages have been executed carefully, the results will be factual and correct.

8. Data Visualization

The ability to analyse massive amounts of data and find useful insight is one thing,
communicating the results in a way that everybody can understand is something completely
different. The Data visualization stage is dedicated to to using data visualization techniques
and tools to graphically communicate the analysis results for effective interpretation by
business users. Frequently this requires plotting data points in charts, graphs or heat maps.

9. Utilization of Analysis Results

After the data analysis has been performed and the result have been presented, the final step
of the Big Data Lifecycle is to use the results in practice. The Utilisation of Analysis results is
dedicated to determining how and where the processed data can be further utilised to
leverage the result of the Big Data Project.

Hadoop Architecture
Apache Hadoop 2.x or later versions are using the following Hadoop Architecture. It is a Hadoop 2.x
High-level Architecture. We will discuss in-detailed Low-level Architecture.

 Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop Components. All
other components work on top of this module.
 HDFS stands for Hadoop Distributed File System. It is also known as HDFS V2 as it is part of
Hadoop 2.x with some enhanced features. It is used as a Distributed Storage System in
Hadoop Architecture.
 YARN stands for Yet Another Resource Negotiator. It is new Component in Hadoop 2.x
Architecture. It is also known as “MR V2”.
 MapReduce is a Batch Processing or Distributed Data Processing Module. It is also known as
“MR V1” as it is part of Hadoop 1.x with some updated features.
 Remaining all Hadoop Ecosystem components work on top of these three major components:
HDFS, YARN and MapReduce. We will discuss all Hadoop Ecosystem components in-detail in
my coming posts.

When compared to Hadoop 1.x, Hadoop 2.x Architecture is designed completely different. It has
added one new component: YARN and also updated HDFS and MapReduce component’s

Hadoop 2.x Major Components

 MapReduce

These three are also known as Three Pillars of Hadoop 2. Here major key component change is
YARN. It is really game changing component in Big Data Hadoop System.

Hadoop 2.x components follow this architecture to interact each other and to work parallel in a
reliable, highly available and fault-tolerant manner.

Hadoop 2.x Components High-Level Architecture

 All Master Nodes and Slave Nodes contains both MapReduce and HDFS Components.
 One Master Node has two components:
 Resource Manager (YARN or MapReduce v2)

It’s HDFS component is also known as NameNode. It’s NameNode is used to store Meta Data.

 In Hadoop 2.x, some more Nodes acts as Master Nodes as shown in the above diagram. Each
this 2nd level Master Node has 3 components:
 Node Manager
 Application Master
 Data Node
 Each this 2nd level Master Node again contains one or more Slave Nodes as shown in the
above diagram.
 These Slave Nodes have two components:
 Node Manager

It’s HDFS component is also knows as Data Node. It’s Data Node component is used to store
actual our application Big Data. These nodes do not contain Application Master component.

Hadoop 2.x Architecture Description

Resource Manager:

 Resource Manager is a Per-Cluster Level Component.

 Resource Manager is again divided into two components:
 Scheduler
 Application Manager
 Resource Manager’s Scheduler is:
 Responsible to schedule required resources to Applications (that is Per-Application
 It does only scheduling.
 It does care about monitoring or tracking of those Applications.

Application Master:

 Application Master is a per-application level component. It is responsible for:

 Managing assigned Application Life cycle.
 It interacts with both Resource Manager’s Scheduler and Node Manager
 It interacts with Scheduler to acquire required resources.
 It interacts with Node Manager to execute assigned tasks and monitor those task’s

Node Manager:

 Node Manager is a Per-Node Level component.

 It is responsible for:
 Managing the life-cycle of the Container.
 Monitoring each Container’s Resources utilization.

 Each Master Node or Slave Node contains set of Containers. In this diagram, Main Node’s
Name Node is not showing the Containers. However, it also contains a set of Containers.
 Container is a portion of Memory in HDFS (Either Name Node or Data Node).
 In Hadoop 2.x, Container is similar to Data Slots in Hadoop 1.x.