Sie sind auf Seite 1von 36

Hadoop

By Apache Software Foundation

Welcome to Apache Hadoop!


What Is Apache Hadoop? Hadoop for Cluster Architecture Demystifying Cluster Architecture Why Hadoop? HDFS and MapReduce Overview Hadoop Distributed File System Hadoop MapReduce Advantages of Hadoop Why Hadoop is a serious player? Conclusion References

What Is Apache Hadoop?


What is in a name? Doug Cutting named his new project after his sons favorite toy.

Hadoop for Cluster Architecture


Hadoop was not designed with the typical Enterprise environment in mind Hadoop was designed for Cluster Architecture built out of commodity hardware. Cluster Architecture is based on very simple and very basic components. These components are available in hundreds of thousands and can be easily assembled together.

Demystifying Cluster Architecture


Node: Consists of a set of
commodity processing cores, Main memory Commodity disks

A stack of nodes forms a Rack A group of racks form a Cluster All connected via high speed network to enable fast exchange of information.

Why Hadoop?
Way to continuously index massive amounts of data. Initial work was done by a little company called Google
who needed a way to index the entire world wide web everyday!

Why Hadoop? (2)


As a result of the breakthrough by Google and others who joined them later including
Yahoo!: Originator and major contributor Apache Facebook IBM Twitter American Airlines LinkedIn The New York Times And many more

Why Hadoop? (3)


Facebook uses Hadoop for processing and analyzing data across its 845 Million users.

Why Hadoop? (4)


Hadoop was designed to enable applications to make most out of cluster architecture by addressing 2 key points:
Distributed Data Data Locality

And that brings us to Hadoops two main mechanisms:


The Hadoop Distributed File System(HDFS) Hadoop MapReduce

HDFS and MapReduce Overview


The Hadoop Distributed File System is a file system to
Split Scatter replicate and manage data across the nodes.

MapReduce is a computational mechanism to execute an application in parallel by:


Dividing it into tasks Collocating these tasks with parts of the data Collecting and redistributing intermediate results Managing failures

Hadoop Distributed File System(HDFS)

HDFS: File Blocks, Storage Blocks


Simple and uniform design principle A file consists of equal sized File Blocks. Each File Block has multiple Storage Blocks.

HDFS: Basic Unit


Hadoop uses File Block as the unit to use to distribute parts of a file across disks in use. The nodes in a Rack could fail
the same file block can be stored in multiple nodes across the cluster The number of copies is set to 3(default)

HDFS: Data Nodes and Name Nodes


Hadoop treats all nodes as Data Nodes, meaning that they can store data But designates atleast one node to be the Name Node.

HDFS: Name Node


Name Node decides
where in each disk, each one of the copies of each one of the file blocks will reside and Keeps track of all that information in tables stored locally in its local disks

HDFS: Failure Detection and Restore


When a node fails, the Name node
identifies all the file blocks from other healthy nodes Finds new nodes to store another copy of them Restore the other copies there and update this information in its tables

Hadoop Distributed File System(7)


When the application needs to read a file
It first connects to the Name Node to get the address for the disk blocks The application can then read these blocks directly without going through the Name Node again.

HDFS: Single Point of Failure


One common concern about HDFS: the fact that the Name Node can become a single point of failure. All the map information between filenames and the address of their respective file blocks may be lost.

HDFS: Single Point of Failure(2)


A new node needs to be designated as the Name Node with the same IP Address of the Name Node that failed. To address this issue, Hadoop saves copies of the tables created by Name Node in other nodes in the cluster.

HDFS: Backup
In addition, some versions of Hadoop have other nodes serving as backup for the Name Node.

Hadoop MapReduce
MapReduce sees the computational task as consisting of two phases:
The Mapper phase and The Reduce phase

MapReduce: Mapper Phase


All nodes do the same computation but against the path of the data set that is collocated in the node closer to it. (Data Locality Principle). The mapper computation can be equally divided across the nodes because all the file blocks are of equal size.

MapReduce: Example
Consider a file that aggregates the blog interest related to big data posted in the last 24 hours. The file has been store in a cluster using HDFS under the name BigData.txt File->File Blocks 3 copies of file blocks across the nodes in a 50 node cluster.

MapReduce: Example(2)
Analysis: Count the no. of times Hadoop, BigData and Green Plum appeared.
Mapper Function: Input: address of a file block Output: counts the number of times each one of these appear Each node participating will receive a pointer to the Mapper function and the address of a file block located in a node

MapReduce: Example(3)
List of key value pairs. The key would be Hadoop, Big Data or Green Plum The value would be the number of times the name appears in the file. After executing the Mapper function, each node would produce a list of key value pairs.

MapReduce: Example(4)

MapReduce: Example(5)
We can then use the Reduce Phase to sum the results obtained by each node to reduce the output of the Map function to a single list of key value pairs A node is selected to execute the reduce function.

MapReduce: Example(6)
After sorting, the key value pairs with the same key is being grouped together. The reduce function then adds the values up for each key-value pair, leading to the final result.

MapReduce: Job Tracker


Mapper and the Reduce operations are considered as Tasks Tasks together form a Job. Job Tracker: co-ordinates all the jobs run on the system by dividing the job into tasks and scheduling them to run on nodes Keeps track of all nodes, monitor their status, orchestrate data flow and handle node failures

MapReduce: Task Tracker


Task Trackers: run tasks and send progress reports to Job Tracker. Keeps track of all tasks running on its nodes, be it a Map or Reduce Task.

Advantages of Hadoop
Leverages cluster architecture to organize big data across hundreds of thousands of nodes Tasks can be executed in parallel across nodes Computational tasks on big data with total freedom on the type of analytics performed by data scientists Removes the potential bottleneck from rendering the network

Why Hadoop is a serious player?


According to predictions done by Yahoo!
By 2015, 50% of enterprise data will be processed and stored using Hadoop.

Big Data was the term dujour for 2011. In 2012, well move away from this term, as ANY DATA will be of paramount importance for enterprises.

Conclusion

References
http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Apache_Hadoop http://www.facebook.com/note.php?note_id =16121578919 http://www.facebook.com/notes/paulyang/moving-an-elephant-large-scale-hadoopdata-migration-atfacebook/10150246275318920 http://developer.yahoo.com/hadoop/

Thank You

Das könnte Ihnen auch gefallen