Hadoop

Hadoop
By Apache Software Foundation
Welcome to Apache Hadoop!

What Is Apache Hadoop? Hadoop for Cluster Architecture Demystifying Cluster Architecture Why Hadoop? HDFS and MapReduce Overview Hadoop Distributed File System Hadoop MapReduce Advantages of Hadoop Why Hadoop is a serious player? Conclusion References
What Is Apache Hadoop?

What is in a name? Doug Cutting named his new project after his sons favorite toy.
Hadoop for Cluster Architecture

Hadoop was not designed with the typical Enterprise environment in mind Hadoop was designed for Cluster Architecture built out of commodity hardware. Cluster Architecture is based on very simple and very basic components. These components are available in hundreds of thousands and can be easily assembled together.
Demystifying Cluster Architecture

Node: Consists of a set of
commodity processing cores, Main memory Commodity disks
A stack of nodes forms a Rack A group of racks form a Cluster All connected via high speed network to enable fast exchange of information.
Why Hadoop?
Way to continuously index massive amounts of data. Initial work was done by a little company called Google
who needed a way to index the entire world wide web everyday!
Why Hadoop? (2)

As a result of the breakthrough by Google and others who joined them later including
Yahoo!: Originator and major contributor Apache Facebook IBM Twitter American Airlines LinkedIn The New York Times And many more
Why Hadoop? (3)

Facebook uses Hadoop for processing and analyzing data across its 845 Million users.
Why Hadoop? (4)

Hadoop was designed to enable applications to make most out of cluster architecture by addressing 2 key points:
Distributed Data Data Locality
And that brings us to Hadoops two main mechanisms:

The Hadoop Distributed File System(HDFS) Hadoop MapReduce
HDFS and MapReduce Overview

The Hadoop Distributed File System is a file system to
Split Scatter replicate and manage data across the nodes.
MapReduce is a computational mechanism to execute an application in parallel by:

Dividing it into tasks Collocating these tasks with parts of the data Collecting and redistributing intermediate results Managing failures
Hadoop Distributed File System(HDFS)
HDFS: File Blocks, Storage Blocks

Simple and uniform design principle A file consists of equal sized File Blocks. Each File Block has multiple Storage Blocks.
HDFS: Basic Unit

Hadoop uses File Block as the unit to use to distribute parts of a file across disks in use. The nodes in a Rack could fail
the same file block can be stored in multiple nodes across the cluster The number of copies is set to 3(default)
HDFS: Data Nodes and Name Nodes

Hadoop treats all nodes as Data Nodes, meaning that they can store data But designates atleast one node to be the Name Node.
HDFS: Name Node

Name Node decides
where in each disk, each one of the copies of each one of the file blocks will reside and Keeps track of all that information in tables stored locally in its local disks
HDFS: Failure Detection and Restore

When a node fails, the Name node
identifies all the file blocks from other healthy nodes Finds new nodes to store another copy of them Restore the other copies there and update this information in its tables
Hadoop Distributed File System(7)

When the application needs to read a file
It first connects to the Name Node to get the address for the disk blocks The application can then read these blocks directly without going through the Name Node again.
HDFS: Single Point of Failure

One common concern about HDFS: the fact that the Name Node can become a single point of failure. All the map information between filenames and the address of their respective file blocks may be lost.
HDFS: Single Point of Failure(2)

A new node needs to be designated as the Name Node with the same IP Address of the Name Node that failed. To address this issue, Hadoop saves copies of the tables created by Name Node in other nodes in the cluster.
HDFS: Backup
In addition, some versions of Hadoop have other nodes serving as backup for the Name Node.
Hadoop MapReduce
MapReduce sees the computational task as consisting of two phases:
The Mapper phase and The Reduce phase
MapReduce: Mapper Phase

All nodes do the same computation but against the path of the data set that is collocated in the node closer to it. (Data Locality Principle). The mapper computation can be equally divided across the nodes because all the file blocks are of equal size.
MapReduce: Example
Consider a file that aggregates the blog interest related to big data posted in the last 24 hours. The file has been store in a cluster using HDFS under the name BigData.txt File->File Blocks 3 copies of file blocks across the nodes in a 50 node cluster.
MapReduce: Example(2)
Analysis: Count the no. of times Hadoop, BigData and Green Plum appeared.
Mapper Function: Input: address of a file block Output: counts the number of times each one of these appear Each node participating will receive a pointer to the Mapper function and the address of a file block located in a node
List of key value pairs. The key would be Hadoop, Big Data or Green Plum The value would be the number of times the name appears in the file. After executing the Mapper function, each node would produce a list of key value pairs.
We can then use the Reduce Phase to sum the results obtained by each node to reduce the output of the Map function to a single list of key value pairs A node is selected to execute the reduce function.
After sorting, the key value pairs with the same key is being grouped together. The reduce function then adds the values up for each key-value pair, leading to the final result.
MapReduce: Job Tracker

Mapper and the Reduce operations are considered as Tasks Tasks together form a Job. Job Tracker: co-ordinates all the jobs run on the system by dividing the job into tasks and scheduling them to run on nodes Keeps track of all nodes, monitor their status, orchestrate data flow and handle node failures
MapReduce: Task Tracker

Task Trackers: run tasks and send progress reports to Job Tracker. Keeps track of all tasks running on its nodes, be it a Map or Reduce Task.
Advantages of Hadoop
Leverages cluster architecture to organize big data across hundreds of thousands of nodes Tasks can be executed in parallel across nodes Computational tasks on big data with total freedom on the type of analytics performed by data scientists Removes the potential bottleneck from rendering the network
Why Hadoop is a serious player?

According to predictions done by Yahoo!
By 2015, 50% of enterprise data will be processed and stored using Hadoop.
Big Data was the term dujour for 2011. In 2012, well move away from this term, as ANY DATA will be of paramount importance for enterprises.
Conclusion
References
http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Apache_Hadoop http://www.facebook.com/note.php?note_id =16121578919 http://www.facebook.com/notes/paulyang/moving-an-elephant-large-scale-hadoopdata-migration-atfacebook/10150246275318920 http://developer.yahoo.com/hadoop/
Thank You

Hadoop

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop

By Apache Software Foundation

Welcome to Apache Hadoop!

What Is Apache Hadoop?

Hadoop for Cluster Architecture

Demystifying Cluster Architecture

Why Hadoop? (2)

Why Hadoop? (3)

Why Hadoop? (4)

And that brings us to Hadoops two main mechanisms:

HDFS and MapReduce Overview

MapReduce is a computational mechanism to execute an application in parallel by:

Hadoop Distributed File System(HDFS)

HDFS: File Blocks, Storage Blocks

HDFS: Basic Unit

HDFS: Data Nodes and Name Nodes

HDFS: Name Node

HDFS: Failure Detection and Restore

Hadoop Distributed File System(7)

HDFS: Single Point of Failure

HDFS: Single Point of Failure(2)

MapReduce: Mapper Phase

MapReduce: Job Tracker

MapReduce: Task Tracker

Why Hadoop is a serious player?

Das könnte Ihnen auch gefallen