Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools

Introduction to Hadoop,
MapReduce and HDFS for

Big Data Applications
Hadoop History and General

Information
Apache Hadoop is an open-source software framework for distributed storage and

distributed processing of very large data sets on computer clusters built from
commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and should be automatically handled by the
framework.
Hadoop Main Components
Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a
number of related projects such as Apache Hive, HBase and Zookeeper.
MapReduce and Hadoop distributed file system (HDFS) are the main component of
Hadoop.
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provides filesystem and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
Hadoop Cluster
Normally any set of loosely connected or tightly connected computers that work
together as a single system is called Cluster. In simple words, a computer cluster used
for Hadoop is called Hadoop Cluster.
Hadoop cluster is a special type of computational cluster designed for storing and
analyzing vast amount of unstructured data in a distributed computing environment.
These clusters run on low cost commodity computers.
Hadoop clusters are often referred to as "shared nothing" systems because the only
thing that is shared between nodes is the network that connects them.
Large Hadoop Clusters are arranged in several racks. Network traffic between
different nodes in the same rack is much more desirable than network traffic across
the racks.
Core Components of Hadoop Cluster

Hadoop cluster has 3 components:
Client
Master
Slave
Task Tracker
1.
Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job
Tracker.
2.
Task Tracker also handles the data motion between the map and reduce phases.
3.
One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the
status of the Task.
4.
If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of
time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to
other nodes in the cluster.
Hadoop Heart - MapReduce
MapReduce is a programming model which is used to process large data sets in a batch
processing manner.
A MapReduce program is composed of
a Map() procedure that performs filtering and sorting (such as sorting students by first name
into queues, one queue for each name)
and a Reduce() procedure that performs a summary operation (such as counting the
number of students in each queue, yielding name frequencies).
Facts about MapReduce
Apache Hadoop Map-Reduce is an open source implementation of Google's Map Reduce

Framework.
Although there are so many map-reduce implementation like Dryad from Microsoft, Dicso
from Nokia which have been developed for distributed systems but Hadoop being the most
popular among them offering open source implementation of Map-reduce framework.
Hadoop Map-Reduce framework works on Master/Slave architecture.
MapReduce Architecture
Hadoop MapReduce is composed of two components
Job tracker playing the role of master and runs on MasterNode (Namenode)
Task tracker playing the role of slave per data node and runs o
Job Tracker
Job Tracker is the one to which client application submit mapreduce programs(jobs).
Job Tracker schedule clients jobs and allocates task to the slave task trackers that are
running on individual worker machines(date nodes).
Job tracker manage overall execution of Map-Reduce job.
Job tracker manages the resources of the cluster

Manage the data nodes i.e. task tracker.
To keep track of the consumed and available resource.
To keep track of already running task, to provide fault-tolerance for task etc.
Hadoop HDFS
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
Hadoop - Big Data Solutions
In this approach, an enterprise will have a computer to store and process big data.
Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and
sophisticated softwares can be written to interact with the database, process the
required data and present it to the users for analysis purpose.
Thank you!
REBECCA THO, HADOOP DEVELOPER AT KYVOS INSIGHTS
HTTP://WWW.KYVOSINSIGHTS.COM

Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Hadoop,

MapReduce and HDFS for

Hadoop History and General

Apache Hadoop is an open-source software framework for distributed storage and

Hadoop Main Components

Core Components of Hadoop Cluster

Hadoop Heart - MapReduce

Facts about MapReduce

Apache Hadoop Map-Reduce is an open source implementation of Google's Map Reduce

Hadoop Map-Reduce framework works on Master/Slave architecture.

Job tracker manage overall execution of Map-Reduce job.

Job tracker manages the resources of the cluster

It is suitable for the distributed storage and processing.

Hadoop provides a command interface to interact with HDFS.

Streaming access to file system data.

HDFS provides file permissions and authentication.

Hadoop - Big Data Solutions

Das könnte Ihnen auch gefallen