Sie sind auf Seite 1von 15


A Brief History of Hadoop-

Named after a toy elephant belonging to developer Doug Cutting’s son, over the
past decade Hadoop has proven to be the little platform that could.
From its humble beginnings as an open source search engine project created by
Cutting and Mike Cafarella, Hadoop has evolved into a robust platform for Big
Data storage and analysis. It manages the deluge of user data for giant social
services like Facebook and Twitter, supports sophisticated medical and scientific
research, and increasingly addresses the storage and predictive analytics demands
of the Enterprise.

In the early days, big data required a lot of raw computing power, storage, and
parallelism, which meant that organizations had to spend a lot of money to build
the infrastructure needed to support big data analytics. Given the large price tag,
only the largest Fortune 500 organizations could afford such an infrastructure.

The Birth of MapReduce: The only way to get around this problem was to break
down big data into manageable chunks and run smaller jobs in parallel, using low
cost hardware, where fault tolerance and self-healing would be managed in the
software. This was the primary goal of the Hadoop Distributed File System
(HDFS). And to fully capitalize on big data, MapReduce came on the scene. This
programming paradigm made it possible for massive scalability across hundreds or
thousands of servers in a Hadoop cluster.

YARN Comes on the Scene: The first generation of Hadoop provided affordable
scalability and a flexible data structure, but it was really only the first step in the
journey. Its batch-oriented job processing and consolidated resource management
were limitations that drove the development of Yet Another Resource Negotiator
(YARN). YARN essentially became the architectural center of Hadoop, since it
allowed multiple data processing engines to handle data stored in one platform.

This new, modern data architecture made it possible for Hadoop to become a true
data operating system and platform. YARN separated the data persistence
functions from the different execution models to unify data for multiple workloads.
Hadoop Version 2 provides the foundation for today’s data lake strategy, which
is basically a large object-based storage repository that holds data in its native
format until it is needed. However, using the data lake only as a consolidated data
repository is shortsighted; Hadoop is really meant to be used as an interactive,
multiple workload and operational data platform.

Apache Hadoop is an open-source software framework for distributed storage and

distributed processing of extremely large data sets.
There are basically 3 important core components of hadoop -

1. For computational processing i.e. MapReduce: MapReduce is the data

processing layer of Hadoop. It is a software framework for easily writing
applications that process the vast amount of structured and unstructured data stored
in the Hadoop Distributed Filesystem (HSDF). It processes huge amount of data in
parallel by dividing the job (submitted job) into a set of independent tasks (sub-
In Hadoop, MapReduce works by breaking the processing into phases: Map and
Reduce. The Map is the first phase of processing, where we specify all the
complex logic/business rules/costly code. Reduce is the second phase of
processing, where we specify light-weight processing like aggregation/summation.
2. For storage purpose i.e., HDFS :Acronym of Hadoop Distributed File System -
which is basic motive of storage. It also works as the Master-Slave pattern. In
HDFS NameNode acts as a master which stores the metadata of data node and
Data node acts as a slave which stores the actual data in local disc parallel.
3. Yarn : which is used for resource allocation.YARN is the processing framework
in Hadoop, which provides Resource management, and it allows multiple data
processing engines such as real-time streaming, data science and batch processing
to handle data stored on a single platform.

The Hadoop Ecosystem-

Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a
combination of technologies which have proficient advantage in solving business
Let us understand the components in Hadoop Ecosytem to build right solutions for
a given business problem.
Hadoop Ecosystem:
Data Access:
Pig:Apache Pig is a high level language built on top of MapReduce for analyzing
large datasets with simple adhoc data analysis programs. Pig is also known as Data
Flow language. It is very well integrated with python. It is initially developed by
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Pig scripts internally will be converted to map reduce programs.
Hive:Apache Hive is another high level query language and data warehouse
infrastructure built on top of Hadoop for providing data summarization, query and
analysis. It is initially developed by yahoo and made open source.
Salient features of hive:
• SQL like query language called HQL.
• Partitioning and bucketing for faster data processing.
• Integration with visualization tools like Tableau.
Hive queries internally will be converted to map reduce programs.

Data Storage:
Hbase:Apache HBase is a NoSQL database built for hosting large tables with
billions of rows and millions of columns on top of Hadoop commodity hardware
machines. Use Apache Hbase when you need random, realtime read/write access to
your Big Data.
• Strictly consistent reads and writes. In memory operations.
• Easy to use Java API for client access.
• Well integrated with pig, hive and sqoop.
• Is a consistent and partition tolerant system in CAP theorem.
Cassandra:Cassandra is a NoSQL database designed for linear scalability and
high availability. Cassandra is based on key-value model. Developed by Facebook
and known for faster response to queries.
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.

Interaction -Visualization- execution-development:

Hcatalog:HCatalog is a table management layer which provides integration of
hive metadata for other Hadoop applications. It enables users with different data
processing tools like Apache pig, Apache MapReduce and Apache Hive to more
easily read and write data.
• Tabular view for different formats.
• Notifications of data availability.
• REST API’s for external systems to access metadata.
Lucene:Apache LuceneTM is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly any
application that requires full-text search, especially cross-platform.
• Scalable, High – Performance indexing.
• Powerful, Accurate and Efficient search algorithms.
• Cross-platform solution.
Hama:Apache Hama is a distributed framework based on Bulk Synchronous
Parallel(BSP) computing. Capable and well known for massive scientific
computations like matrix, graph and network algorithms.
• Simple programming model
• Well suited for iterative algorithms
• YARN supported
• Collaborative filtering unsupervised machine learning.
• K-Means clustering.
Crunch:Apache crunch is built for pipelining MapReduce programs which are
simple and efficient. This framework is used for writing, testing and running
MapReduce pipelines.
• Developer focused.
• Minimal abstractions
• Flexible data model.

Data Serialization:
Avro:Apache Avro is a data serialization framework which is language neutral.
Designed for language portability, allowing data to potentially outlive the language
to read and write it.
Thrift:Thrift is a language developed to build interfaces to interact with
technologies built on Hadoop. It is used to define and create services for numerous

Data Intelligence:
Drill:Apache Drill is a low latency SQL query engine for Hadoop and NoSQL.
• Agility
• Flexibility
• Familiarilty.
Mahout:Apache Mahout is a scalable machine learning library designed for
building predictive analytics on Big Data. Mahout now has implementations
apache spark for faster in memory computing.
• Collaborative filtering.
• Classification
• Clustering
• Dimensionality reduction

Data Integration:
Apache Sqoop:Apache Sqoop is a tool designed for bulk data transfers between
relational databases and Hadoop.
• Import and export to and from HDFS.
• Import and export to and from Hive.
• Import and export to HBase.
Apache Flume:Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
Apache Chukwa:Scalable log collector used for monitoring large distributed files
• Scales to thousands of nodes.
• Reliable delivery.
• Should be able to store data indefinitely.

Management, Monitoring and Orchestration:

Apache Ambari:Ambari is designed to make hadoop management simpler by
providing an interface for provisioning, managing and monitoring Apache Hadoop
• Provision a Hadoop Cluster.
• Manage a Hadoop Cluster.
• Monitor a Hadoop Cluster.
Apache Zookeeper:Zookeeper is a centralized service designed for maintaining
configuration information, naming, providing distributed synchronization, and
providing group services.
• Serialization
• Atomicity
• Reliability
• Simple API
Apache Oozie:Oozie is a workflow scheduler system to manage Apache Hadoop
• Scalable, reliable and extensible system.
• Supports several types of Hadoop jobs such as Map-Reduce, Hive, Pig and
• Simple and easy to use.

The Building Blocks of Hadoop-

Hadoop employs a master/slave architecture for both distributed storage and
distributed computation. The distributed storage system is called the Hadoop
Distributed File System (HDFS).

On a fully configured cluster, "running Hadoop" means running a set of daemons,

or resident programs, on the different servers in you network. These daemons have
specific roles; some exists only on one server, some exist across multiple servers.
The daemons include:
10 NameNode
11 DataNode
12 Secondary NameNode
13 JobTracker
14 TaskTracker
NameNode: The NameNode is the master of HDFS that directs the slave
DataNode daemons to perform the low-level I/O tasks. It is the bookkeeper of
HDFS; it keeps track of how your files are broken down into file blocks, which
nodes store those blocks and the overall health of the distributed filesystem.

The server hosting the NameNode typically doesn't store any user data or perform
any computations for a MapReduce program to lower the workload on the
machine, hence memory & I/O intensive.

There is unfortunately a negative aspect to the importance of the NameNode - it's a

single point of failure of your Hadoop cluster. For any of the other daemons, if
their host fail for software or hardware reasons, the Hadoop cluster will likely
continue to function smoothly or you can quickly restart it. Not so for the

DataNode: Each slave machine in your cluster will host a DataNode daemon to
perform the grunt work of the distributed filesystem - reading and writing HDFS
blocks to actual files on the local file system

When you want to read or write a HDFS file, the file is broken into blocks and the
NameNode will tell your client which DataNode each block resides in. Your client
communicates directly with the DataNode daemons to process the local files
corresponding to the blocks.

Furthermore, a DataNode may communicate with other DataNodes to replicate its

data blocks for redundancy.This ensures that if any one DataNode crashes or
becomes inaccessible over the network, you'll still able to read the files.

DataNodes are constantly reporting to the NameNode. Upon initialization, each of

the DataNodes informs the NameNode of the blocks it's currently storing. After
this mapping is complete, the DataNodes continually poll the NameNode to
provide information regarding local changes as well as receive instructions to
create, move, or delete from the local disk.

Secondary NameNode(SNN): The SNN is an assistant daemon for monitoring the

state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it
typically resides on its own machine as well. No other DataNode or TaskTracker
daemons run on the same server. The SNN differs from the NameNode in that this
process doesn't receive or record any real-time changes to HDFS. Instead, it
communicates with the NameNode to take snapshots of the HDFS metadata at
intervals defined by the cluster configuration.
As mentioned earlier, the NameNode is a single point of failure for a Hadoop
cluster, and the SNN snapshots help minimize the downtime and loss of data.

There is another topic which can be covered under SNN, i.e., fsimage(filesystem
image) file and edits file:

The HDFS namespace is stored by the NameNode. The NameNode uses a

transaction log called the EditLog to persistently record every change that occurs
to file system metadata. For example, creating a new file in HDFS causes the
NameNode to insert a record into the EditLog indicating this. Similarly, changing
the replication factor of a file causes a new record to be inserted into the EditLog.
The NameNode uses a file in its local host OS file system to store the EditLog. The
entire file system namespace, including the mapping of blocks to files and file
system properties, is stored in a file called the FsImage. The FsImage is stored as a
file in the NameNode’s local file system too.
obTracker: Once you submit your code to your cluster, the JobTracker determines
the execution plan by determining which files to process, assigns nodes to different
tasks, and monitors all tasks as they're running. should a task fail, the JobTracker
will automatically relaunch the task, possibly on a different node, up to a
predefined limit of retries.

There is only one JobTracker daemon per Hadoop cluster. It's typically run on a
server as a master node of the cluster.

TaskTracker: As with the storage daemons, the computing daemons also follow a
master/slave architecture: the JobTracker is the master overseeing the overall
execution of a MapReduce job and the TaskTracker manage the execution of
individual tasks on each slave node.

Each TaskTracker is responsible for executing the individual tasks that the
JobTracker assigns. Although there is a single TaskTracker per slave node, each
TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in

One responsibility of the TaskTracker is to constantly communicate with the

JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker
within a specified amount of time, it will assume the TaskTracker has crashed and
will resubmit the corresponding tasks to other nodes in the cluster.
The Design of HDFS-
HDFS is a filesystem designed for storing very large files with streaming data
access patterns, running on clusters of commodity hardware. Let’s examine this
statement in more detail:
Very large files:“Very large” in this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running
today that store petabytes of data.
Streaming data access: HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern. A dataset is typically
generated or copied analysis will involve a large proportion, if not all, of the
dataset, so the time to read the whole dataset is more important than the latency in
reading the first record.
Commodity hardware: Hadoop doesn’t require expensive, highly reliable
hardware to run on. It’s designed to run on clusters of commodity hardware
(commonly available hardware available from multiple vendors) for which the
chance of node failure across the cluster is high, at least for large clusters. HDFS is
designed to carry on working without a noticeable interruption to the user in the
face of such failure.
It is also worth examining the applications for which using HDFS does not work so
well. While this may change in the future, these are areas where HDFS is not a
good fit today:
Low-latency data access: Applications that require low-latency access to data, in
the tens of milli seconds range, will not work well with HDFS. Remember, HDFS
is optimized for delivering a high throughput of data, and this may be at the
expense of latency. HBase (Chapter Zookeeper) is currently a better choice for
low-latency access.
Lots of small files: Since the namenode holds file system meta data in memory, the
limit to the number of files in a file system is governed by the amount of memory
on the name node. As a rule of thumb, each file, directory, and block takes about
150 bytes. So, for example, if you had one million files, each taking one block, you
would need at least 300 MB of memory. While storing millions of files is feasible,
billions is beyond the capability of current hardware. Multiple writers, arbitrary file
modifications Files in HDFS may be written to by a single writer. Writes are
always made at the end of the file. There is no support for multiple writers, or for
modifications at arbitrary offsets in the file. (These might be supported in the
future, but they are likely to be relatively inefficient.)
HDFS Concepts-
Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are
128 MB by default and this is configurable.
Files n HDFS are broken into block-sized chunks,which are stored as independent units.
Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy
full blocks size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB of space
The HDFS block size is large just to minimize the cost of seek.
Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata of all
the files in HDFS; the metadata information being file permission, names and location of each
block.The metadata are small, so it is stored in the memory of name node,allowing faster
access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all
this information is handled bya single machine.
The file system operations like opening, closing, renaming etc. are executed by it.
Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing.
The data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
Since all the metadata is stored in name node, it is very important.
If it fails the file system can not be used as there would be no way of knowing how to
reconstruct the files from blocks present in data node. To overcome this, the concept of
secondary name node arises.
HDFS DataNode and NameNode Image:
Secondary Name Node: It is a separate physical machine which acts as a helper of
name node. It performs periodic check points.It communicates with the name node
and take snapshot of meta data which helps minimize downtime and loss of data.

Hadoop File Systems-

Hadoop stores petabytes of data using the HDFS technology. Using HDFS it is
possible to connect commodity hardware or personal computers, also known as
nodes in Hadoop parlance. These no nodes
des are connected over a cluster on which the
data files are stored in a distributed manner. Using the power of HDFS the whole
cluster and the nodes can be easily accessed ffor or data storage and processing. The
access to data is strictly on a streaming manner using the MapReduce process.
Key features of HDFS:
 HDFS is highly resilient since upon failure the workload is immediately
transferred to another node
 It provides an extremely goodamount of throughput even for gigantic
volumes of data sets
 It is unlike other distributed file systems sinceit is based on write-once-read-
many model
 It allows high data coherence, removes concurrency control issues and
speeds up data access
 HDFS moves computation to the place where data exists instead of the other
way around
 Thus, applications are moved closer to the point where data resides which is
much cheaper, faster and improves the overall throughput.
The reasons why HDFS works so well with Big Data:
• HDFS uses the method of MapReduce for access to data which is very fast
• It follows a data coherency model that is simple yet highly robust and scalable
• Compatible with any commodity hardware and operating system
• Achieves economy by distributing data and processing on clusters with parallel
• Data is always safe as it is automatically saved in multiple locations in a
foolproof way
• It provides a JAVA API and even a C language wrapper on top
It is easily accessible using a web browser making it highly utilitarian.

File Read and Write

An application adds data to HDFS by creating a new file and writing the data to it.
After the file is closed, the bytes written cannot be altered or removed except that
new data can be added to the file by reopening the file for append. HDFS
implements a single-writer, multiple-reader model.
The HDFS client that opens a file for writing is granted a lease for the file; no other
client can write to the file. The writing client periodically renews the lease by
sending a heartbeat to the NameNode. When the file is closed, the lease is revoked.
The lease duration is bound by a soft limit and a hard limit. Until the soft limit
expires, the writer is certain of exclusive access to the file. If the soft limit expires
and the client fails to close the file or renew the lease, another client can preempt
the lease. If after the hard limit expires (one hour) and the client has failed to renew
the lease, HDFS assumes that the client has quit and will automatically close the
file on behalf of the writer, and recover the lease. The writer's lease does not
prevent other clients from reading the file; a file may have many concurrent
An HDFS file consists of blocks. When there is a need for a new block, the
NameNode allocates a block with a unique block ID and determines a list of
DataNodes to host replicas of the block. The DataNodes form a pipeline, the order
of which minimizes the total network distance from the client to the last DataNode.
Bytes are pushed to the pipeline as a sequence of packets. The bytes that an
application writes first buffer at the client side. After a packet buffer is filled
(typically 64 KB), the data are pushed to the pipeline. The next packet can be
pushed to the pipeline before receiving the acknowledgment for the previous
packets. The number of outstanding packets is limited by the outstanding packets
window size of the client.
After data are written to an HDFS file, HDFS does not provide any guarantee that
data are visible to a new reader until the file is closed. If a user application needs
the visibility guarantee, it can explicitly call the hflush operation. Then the current
packet is immediately pushed to the pipeline, and the hflush operation will wait
until all DataNodes in the pipeline acknowledge the successful transmission of the
packet. All data written before the hflush operation are then certain to be visible to
Replication Management
The NameNode endeavors to ensure that each block always has the intended
number of replicas. The NameNode detects that a block has become under- or
over-replicated when a block report from a DataNode arrives. When a block
becomes over replicated, the NameNode chooses a replica to remove. The
NameNode will prefer not to reduce the number of racks that host replicas, and
secondly prefer to remove a replica from the DataNode with the least amount of
available disk space. The goal is to balance storage utilization across DataNodes
without reducing the block's availability.
When a block becomes under-replicated, it is put in the replication priority queue.
A block with only one replica has the highest priority, while a block with a number
of replicas that is greater than two thirds of its replication factor has the lowest
priority. A background thread periodically scans the head of the replication queue
to decide where to place new replicas. Block replication follows a similar policy as
that of new block placement. If the number of existing replicas is one, HDFS
places the next replica on a different rack. In case that the block has two existing
replicas, if the two existing replicas are on the same rack, the third replica is placed
on a different rack; otherwise, the third replica is placed on a different node in the
same rack as an existing replica. Here the goal is to reduce the cost of creating new