Sie sind auf Seite 1von 14

A Brief History of Hadoop

A brief history of Hadoop

It is a known fact that Hadoop has been specially created to manage Big Data.
Everybody in the world knows about Google; it is probably the most popular search
engine in the online world.

TRADITIONAL APPROACH

In this approach, an enterprise will have a computer to store and process


big data. Here data will be stored in an RDBMS like Oracle Database, MS
SQL Server or DB2 and sophisticated softwares can be written to interact
with the database, process the required data and present it to the users for
analysis purpose.

Limitation
This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the
processor which is processing the data. But when it comes to dealing with
huge amounts of data, it is really a tedious task to process such data
through a traditional database server.

Google’s Solution
Google solved this problem using an algorithm called MapReduce. This
algorithm divides the task into small parts and assigns those parts to many
computers connected over the network, and collects the results to form the
final result dataset.
Above diagram shows various commodity hardwares which could be single
CPU machines or servers with higher capacity.

Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google
and started an Open Source Project called HADOOP in 2005 and Doug
named it after his son's toy elephant. Now Apache Hadoop is a registered
trademark of the Apache Software Foundation.

Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes. In short, Hadoop framework is
capabale enough to develop applications capable of running on clusters of
computers and they could perform complete statistical analysis for a huge
amounts of data.

To provide search results for users Google had to store huge amounts of data. In the
1990s, Google started searching for ways to store and process huge amounts of data.
And finally in 2003 they provided the world with an innovative Big Data storage idea
called GFS or Google File System; it is a technique to store data especially huge
amount of data.
In the year 2004 they provided the world with another technique called Map reduce,
which is the technique for processing the data that is present in GFS. And it can be
observed that it took Google 13 years to come up with this innovative idea of storing
and processing Big Data and fine tuning the idea.

But these techniques have been presented to the world just as a description through
white papers. but there was no working model or code provided. Then in the year 2006-
07 another major search engine, Yahoo came up with techniques called HDFS and Map
reduce based on the white papers published by Google. So finally, the HDFS and Map
reduce are the two core concepts that make up Hadoop.

HADOOP DISTRIBUTED FILE SYSTEM


Hadoop File System was developed using distributed file system design. Unlike
other distributed systems, HDFS is highly fault tolerant and designed using low-
cost hardware.

HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored
in redundant fashion to rescue the system from possible data losses in case of
failure.

HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following
elements.

Namenode
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software. It is a software that can be
run on commodity hardware. The system having the namenode acts as the
master server and it does the following tasks:

 Manages the file system namespace.

 Regulates client’s access to files.

 It also executes file system operations such as renaming, closing, and opening
files and directories.

Datanode
The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes
manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client
request.

 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.

Block
Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes. These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the
need to change in HDFS configuration.

Goals of HDFS
 Fault detection and recovery : Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.
 Huge datasets : HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.
 Hardware at data : A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.

HOW TO ANALYZE BIG DATA WITH HADOOP


TECHNOLOGIES
With rapid innovations, frequent evolutions of technologies and a rapidly growing
internet population, systems and enterprises are generating huge amounts of data
to the tune of terabytes and even petabytes of information. Since data is being
generated in very huge volumes with great velocity in all multi-structured formats
like images, videos, weblogs, sensor data, etc. from all different sources, there is
a huge demand to efficiently store, process and analyze this large amount of data
to make it usable.

Hadoop is undoubtedly the preferred choice for such a requirement due to its key
characteristics of being reliable, flexible, economical, and a scalable solution.
While Hadoop provides the ability to store this large scale data on HDFS (Hadoop
Distributed File System), there are multiple solutions available in the market for
analyzing this huge data like MapReduce, Pig and Hive. With the advancements
of these different data analysis technologies to analyze the big data, there are
many different school of thoughts about which Hadoop data analysis technology
should be used when and which could be efficient.

A well-executed big data analysis provides the possibility to uncover hidden


markets, discover unfulfilled customer demands and cost reduction opportunities
and drive game-changing, significant improvements in everything from
telecommunication efficiencies and surgical or medical treatments, to social
media campaigns and related digital marketing promotions.

What is Big Data Analysis?

Big data is mostly generated from social media websites, sensors, devices,
video/audio, networks, log files and web, and much of it is generated in real time
and on a very large scale. Big data analytics is the process of examining this
large amount of different data types, or big data, in an effort to uncover hidden
patterns, unknown correlations and other useful information.

Advantages of Big Data Analysis

Big data analysis allows market analysts, researchers and business users to
develop deep insights from the available data, resulting in numerous business
advantages. Business users are able to make a precise analysis of the data and
the key early indicators from this analysis can mean fortunes for the business.
Some of the exemplary use cases are as follows:

 Whenever users browse travel portals, shopping sites, search flights, hotels or add a
particular item into their cart, then Ad Targeting companies can analyze this wide variety
of data and activities and can provide better recommendations to the user regarding
offers, discounts and deals based on the user browsing history and product history.
 In the telecommunications space, if customers are moving from one service provider to
another service provider, then by analyzing huge call data records of the various issues
faced by the customers can be unearthed. Issues could be as wide-ranging as a
significant increase in the call drops or some network congestion problems. Based on
analyzing these issues, it can be identified if a telecom company needs to place a new
tower in a particular urban area or if they need to revive the marketing strategy for a
particular region as a new player has come up there. That way customer churn can be
proactively minimized.
Case Study – Stock market data

Now let’s look at a case study for analyzing stock market data. We will evaluate
various big data technologies to analyze this stock market data from a sample
‘New York Stock Exchange’ dataset and calculate the covariance for this stock
data and aim to solve both storage and processing problems related to a huge
volume of data.

Covariance is a financial term that represents the degree or amount that two
stocks or financial instruments move together or apart from each other. With
covariance, investors have the opportunity to seek out different investment
options based upon their respective risk profile. It is a statistical measure of how
one investment moves in relation to the other.

A positive covariance means that asset returns moved together. If investment


instruments or stocks tend to be up or down during the same time periods, they
have positive covariance.

A negative covariance means returns move inversely. If one investment


instrument tends to be up while the other is down, they have negative covariance.

This will help a stock broker in recommending the stocks to his customers.

Dataset: The sample dataset provided is a comma separated file (CSV) named


‘NYSE_daily_prices_Q.csv ’ that contains the stock information such as daily
quotes, Stock opening price, Stock highest price, etc. on the New York Stock
Exchange.

The dataset provided is just a sample small dataset having around 3500 records,
but in the real production environment there could be huge stock data running into
GBs or TBs. So our solution must be supported in a real production environment.

Hadoop Data Analysis Technologies

Let’s have a look at the existing open source Hadoop data analysis technologies
to analyze the huge stock data being generated very frequently.
What is Map reduce?
Map reduce is a processing technique and a program model for distributed
computing based on java. The Map reduce algorithm contains two important
tasks, namely Map and Reduce.

Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of
the name Map reduce implies, the reduce task is always performed after the
map job.

Under the Map reduce model, the data processing primitives are called
mappers and reducers.

Decomposing a data processing application into mappers and reducers is


sometimes nontrivial. But, once we write an application in the Map reduce
form, scaling the application to run over hundreds, thousands, or even tens
of thousands of machines in a cluster is merely a configuration change. This
simple scalability is what has attracted many programmers to use the Map
reduce model.

An example of MapReduce
Let’s look at a simple example. Assume you have five files, and each file contains two
columns (a key and a value in Hadoop terms) that represent a city and the
corresponding temperature recorded in that city for the various measurement days..
Either way, in this example, city is the key and tempera¬ture is the value.

Toronto, 20 
Whitby, 25 
New York, 22 
Rome, 32 
Toronto, 4 
Rome, 33 
New York, 18
Out of all the data we have collected, we want to find the maximum tem¬perature for
each city across all of the data files (note that each file might have the same city
represented multiple times). Using the MapReduce framework, we can break this down
into five map tasks, where each mapper works on one of the five files and the mapper
task goes through the data and returns the maximum temperature for each city. For
example, the results produced from one mapper task for the data above would look like
this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not shown
here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New
York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31)
(Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combine the
input results and output a single value for each city, producing a final result set as
follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)

COMPONENTS OF HADOOP

Hadoop framework includes following four modules:

 Hadoop Common: These are Java libraries and utilities required by other


Hadoop modules. These libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts required to start Hadoop.
 Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
 Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
 Hadoop MapReduce: This is YARN-based system for parallel processing of
large data sets.

We can use following diagram to depict these four components available in


Hadoop framework.
Since 2012, the term "Hadoop" often refers not just to the base modules
mentioned above but also to the collection of additional software packages
that can be installed on top of or alongside Hadoop, such as Apache Pig,
Apache Hive, Apache HBase, Apache Spark etc.

MapReduce
Hadoop MapReduce is a software framework for easily writing applications
which process big amounts of data in-parallel on large clusters (thousands
of nodes) of commodity hardware in a reliable, fault-tolerant manner.

The term MapReduce actually refers to the following two different tasks that
Hadoop programs perform:

 The Map Task: This is the first task, which takes input data and converts it into
a set of data, where individual elements are broken down into tuples (key/value
pairs).
 The Reduce Task: This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is
always performed after the map task.

Typically both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes
the failed tasks.

The MapReduce framework consists of a single master JobTracker and one


slave TaskTracker per cluster-node. The master is responsible for resource
management, tracking resource consumption/availability and scheduling the
jobs component tasks on the slaves, monitoring them and re-executing the
failed tasks. The slaves TaskTracker execute the tasks as directed by the
master and provide task-status information to the master periodically.

The JobTracker is a single point of failure for the Hadoop MapReduce service
which means if JobTracker goes down, all running jobs are halted.

Hadoop Distributed File System


Hadoop can work directly with any mountable distributed file system such
as Local FS, HFTP FS, S3 FS, and others, but the most common file system
used by Hadoop is the Hadoop Distributed File System (HDFS).

The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS) and provides a distributed file system that is designed to run
on large clusters (thousands of computers) of small computer machines in a
reliable, fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a


singleNameNode that manages the file system metadata and one or more
slaveDataNodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks are
stored in a set of DataNodes. The NameNode determines the mapping of
blocks to the DataNodes. The DataNodes takes care of read and write
operation with the file system. They also take care of block creation,
deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are
available to interact with the file system. These shell commands will be
covered in a separate chapter along with appropriate examples.

How Does Hadoop Work?


Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for
required process by specifying the following items:

1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and
reduce functions.
3. The job configuration by setting different parameters specific to the job.

Stage 2
The Hadoop job client then submits the job (jar/executable etc) and
configuration to the JobTracker which then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-
client.

Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce
implementation and output of the reduce function is stored into the output
files on the file system.

Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems.
It is efficient, and it automatic distributes the data and work across the
machines and in turn, utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect and
handle failures at the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.

HADOOP STREAMING

Suppose we have a word-count problem written in python language.

In the example, both the mapper and the reducer are python scripts that
read the input from standard input and emit the output to standard output.
The utility will create a Map/Reduce job, submit the job to an appropriate
cluster, and monitor the progress of the job until it completes.

When a script is specified for mappers, each mapper task will launch the
script as a separate process when the mapper is initialized. As the mapper
task runs, it converts its inputs into lines and feed the lines to the standard
input (STDIN) of the process. In the meantime, the mapper collects the
line-oriented outputs from the standard output (STDOUT) of the process
and converts each line into a key/value pair, which is collected as the
output of the mapper. By default, the prefix of a line up to the first tab
character is the key and the rest of the line (excluding the tab character)
will be the value. If there is no tab character in the line, then the entire line
is considered as the key and the value is null. However, this can be
customized, as per one need.

When a script is specified for reducers, each reducer task will launch the
script as a separate process, then the reducer is initialized. As the reducer
task runs, it converts its input key/values pairs into lines and feeds the lines
to the standard input (STDIN) of the process. In the meantime, the reducer
collects the line-oriented outputs from the standard output (STDOUT) of the
process, converts each line into a key/value pair, which is collected as the
output of the reducer. By default, the prefix of a line up to the first tab
character is the key and the rest of the line (excluding the tab character) is
the value. However, this can be customized as per specific requirements.

Das könnte Ihnen auch gefallen