Sie sind auf Seite 1von 32

HADOOP

IMPLEMENTATION AND USES

Nilaf T 1RV08CS068

Why Hadoop
Large scale data processing is difficult!
Managing

hundreds or thousands of processors Managing parallelization and distribution I/O Scheduling Status and monitoring Fault/crash tolerance

What is Hadoop
It's

a framework for running applications on large clusters of commodity hardware which produces huge data and to process it. Apache Software Foundation Project. Open source 1.0.1 stable release available for download

Hadoop History
Google

used Map Reduce internally to index webpages to access peta bytes of data. Google used their own filesystem called google file system GFS. In 2004 , based on google paper on MapReduce, Hadoop as apache project was started with contributions from Yahoo! , Amazon .

Hadoop History

Hadoop Contains
Hadoop consists of two key services: Reliable data storage using the Hadoop Distributed File System (HDFS) High-performance parallel data processing using a technique called MapReduce

Hadoop FDS (HDFS)


Why HDFS No of request increase , current file system becomes inadequate to retrieve large amounts of data simuntaneously one large single storage starts becoming a bottleneck. To overcome, we move the file system from a single disk storage to a clustered file system.

Hadoop FDS (HDFS)


HDFS

is a distributed, scalable, and portable filesystem. written in Java for Hadoop framework. Filesystem keeps checksums of data for corruption detection and recovery. Data is organized into files and directories.

Hadoop FDS (HDFS)


Files are broken in to large blocks. Typically 128 MB block size Blocks are replicated for reliability One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed

Hadoop FDS (HDFS)


The

major advantage of HDFS are that it provides very high input and output speeds. Hadoop DFS stores each file as a sequence of blocks Each block in a file except the last block are of the same size. It is inspired by the Google File System.

Map Reduce

Started by Google
Processes

20 PB of data per day

Popularized by open-source Hadoop project


Used

by Yahoo!, Facebook, Amazon,

MapReduce
Is

the style in which most programs running on Hadoop are written. input is broken in tiny pieces which are processed independently Map-reduce is a programming model for processing and generating large data sets.

MapReduce
It consists of two steps: map and reduce. The map step takes a key/value pair and produces an intermediate key/value pair. The reduce step takes a key and a list of the key's values and outputs the final key/value pair.

MapReduce Benefits
Map-reduce jobs are automatically parallelized. Partial failure of the processing cluster is expected and tolerable. Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry. It scales very well. Many jobs are naturally expressible in the map/reduce paradigm.

MapReduce Benefits
In any Map Reduce framework there are two key elements -

1.A Mapper which applies some processing function on a given key/value and produces a list of key/value pairs.
2.The second element is a Reducer which: 1. Fetch a key and a list of all the values that where created for that key from all the Mappers 2. Sort the fetched key/values into buckets 3. Reduce the list by processing key/values and output new key/value pairs.

HDFS Architecture

HDFS Architecture

Name and Data Nodes


A typical installation cluster has a dedicated machine that runs a name node and possibly one data node. Each of the other machines in the cluster runs one data node. Data nodes continuously loop, asking the name node for instructions. A name node can't connect directly to a data node Name node simply returns values from functions invoked by a data node.

Data storage reliability


Objective of HDFS is to store data reliably, even when failures occur within name nodes, data nodes, or network partitions. HDFS Heartbeats loss of connectivity between name and data nodes is crucial Each data node sends periodic heartbeat messages to its name node, so the latter can detect

Data storage reliability


The name node marks as dead data nodes not responding to heartbeats Name node refrains from sending further requests to data nodes. Data stored on a dead node is no longer available to an HDFS client from that node, which is effectively removed from the system. If the death of a node causes the replication factor of data blocks to drop below their minimum value, the name node initiates additional replication to bring the replication factor back to a normalized state.

Data storage reliability


Data block rebalancing

Used space for one or more data nodes can be underutilized. Therefore, HDFS supports rebalancing data blocks using various models. One model might move data blocks from one data node to another automatically if the free space on a data node falls too low. Another model might dynamically create additional replicas and rebalance other data blocks in a cluster if a sudden increase in demand for a given file occurs. HDFS also provides the hadoop balance command for

Hadoop is not
Apache Hadoop is not a substitute for a database Hadoop is not a database nor does it need to replace any existing data systems you may have. Hadoop is a massively scalable storage and batch data processing system. Hadoop stores data in files, and does not index

Hadoop is not
Hadoop

works where the data is too big for a database. Hadoop does not store indexes and relations like databases hence saves SPACE.

Hadoop is not
MapReduce is not always the best algorithm. Hadoop is best if two jobs can run in parallel. Hadoop not good when one job needs other. Hadoop is not a good choice for building systems that carry out intense calculations with little or no data.

Advantages of Hadoop
Hadoop compute clusters are built on cheap commodity hardware. Hadoop automatically handles node failures and data replication. Hadoop is a good framework for building batch data processing system. Hadoop provides API and framework implementation for working with Map Reduce. Hadoop Job infrastructure can manage and handle HUGE amounts of data in the range of peta bytes.

Uses
Article

clustering for google News. Crawl for news and cluster into one. Website indexing Spam filter. Data analysis

Who uses Hadoop


Google
Index

building for google search. Web crawlers need to crawl each website to index based on the keywords and contents. Article clustering for google News. Crawl for news and cluster into one.

Who uses Hadoop


Yahoo!
Index

building for Yahoo! Search. Spam Detection for yahoo mail. Too many mails being received every day and yahoo! Spam filter need to detect spams

Who uses Hadoop


Facebook
Data

Mining. Ex finding relations between two users based on profile information. Ad Optimization.Optimize ad based on the contents on the user profile. Spam Detection.

Hadoop
How Hadoop saved money and time The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240

END

Das könnte Ihnen auch gefallen