Beruflich Dokumente
Kultur Dokumente
OREIN IT Technologies
Agenda
Architecture of HDFS and MapReduce
Hadoop Streams
Hadoop Pipes
Basics of Hbase and Zookeeper
Design of HDFS
Very large files
Very large in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size. There are Hadoop clusters running today that
store petabytes of data.
Commodity hardware
Hadoop doesnt require expensive, highly reliable hardware to run on. Its
designed to run on clusters of commodity hardware for which the chance of
node failure across the cluster is high, at least for large clusters. HDFS is
designed to carry on working without a noticeable interruption to the user in the
face of such failure.
By making a block large enough, the time to transfer the data from the disk can be
made to be significantly larger than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk
transfer rate.
A quick calculation shows that if the seek time is around 10 ms, and the transfer rate
is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB.
The default is actually 64 MB, although many HDFS installations use 128 MB blocks.
Advantage of HDFS?
Moving Computation is Cheaper than Moving Data
A computation requested by an application is much more efficient if it
is executed near the data it operates on. This is especially true when
the size of the data set is huge.
The assumption is that it is often better to migrate the computation
closer to where the data is located rather than moving the data to
where the application is running.
HDFS provides interfaces for applications to move themselves closer
to where the data is located.
HDFS Architecture
HDFS has a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that
manages the file system namespace and regulates access to files by
clients.
It also consist of secondary namenode. It updates the namespace
image with datalog.
The client opens the file it wishes to read by calling open() on the
FileSystem object, which for HDFS is an instance of DFS.
DFS calls the namenode, using RPC, to determine the locations of the
blocks for the first few blocks in the file.
For each block, the namenode returns the addresses of the
datanodes that have a copy of that block and the datanodes are
sorted according to their proximity to the client
The DFS returns a FSDataInputStream (an input stream that supports
file seeks) to the client for it to read data from.
Replication
Example1: Problem
statement
Whats the highest recorded global temperature for each year in the
dataset?
Data Flow
A MapReduce job is a unit of work that the client wants to be
performed, it consists of
q
The input data
q
The MapReduce program
q
Configuration information
Hadoop runs the job by dividing it into tasks
q
map tasks
q
reduce tasks
Inputsplits
Hadoop divides the input to a MapReduce job into fixed-size pieces
called inputsplits, or just splits.
Hadoop creates one map task for each split, which runs the
userdefined map function for each record in the split.
Reduce Task
Reduce tasks dont have the advantage of data locality, the input to a
single reduce task is normally the output from all mappers.
Combiner function
Many MapReduce jobs are limited by the bandwidth available on the
cluster
It pays to minimize the data transferred between map and reduce
tasks.
Hadoop allows the user to specify a combiner function to be run on
the map output
The output of Combiner function is input to reducer
If we use combiner!
Job initialization
When the JobTracker receives a call to its
submitJob() method,
It puts it into an internal queue from where the job scheduler will pick
it up and initialize it.
Initialization involves
q
creating an object to represent the job being run, which
encapsulates its tasks
bookkeeping information to keep track of the tasks status and
progress(step 5)
q
To create the list of tasks to run, the job scheduler first retrieves the
input splits computed by the JobClient from the shared filesystem
(step 6)
It then creates one map task for each split.
The number of reduce tasks to create is determined by the
mapred.reduce.tasks property in the JobConf
The scheduler simply creates this number of reduce tasks to be run.
Tasks are given IDs at this point.
Task Execution
Once the tasktracker has been assigned a task
It localizes the job JAR by copying it from the shared filesystem to the
tasktrackers
filesystem. It also copies any files needed from the distributed
cache by the application to the local disk(step 8)
It creates a local working directory for the task, and un-jars the
contents of the JAR into this directory.
It creates an instance of TaskRunner to run the task.
Task Runner
TaskRunner launches a new Java Virtual Machine(step 9)
run each task in(step 10)
Hadoop Streaming
Hadoop provides an API to MapReduce that allows you to write your
map and reduce functions in languages other than Java.
Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and your program, so you can use any language
that can read standard input and write to standard output to write
your MapReduce program.
Ruby, Python
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*streaming.jar \
-input input/sample.txt \
-output output \
-mapper src/main/ruby/mapper.rb \
-reducer src/main/ruby/reducer.rb
Hadoop Pipes
Hadoop Pipes is the name of the C++ interface to Hadoop
MapReduce.
Unlike Streaming, which uses standard input and output to
communicate with the map and reduce code
Pipes uses sockets as the channel over which the tasktracker
communicates with the process running the C++ map or reduce
function. JNI is not used.
hadoop fs -put max_temperature bin/max_temperature
hadoop fs -put input/sample.txt sample.txt
hadoop pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input sample.txt \
-output output \
-program bin/max_temperature
HBase
HBase is a distributed column-oriented database
built on top of HDFS
HBase is the Hadoop application to use when you require real-time
read/write random-access to very large datasets.
Zookeeper
Links
http://hadoop.apache.org/common/docs/current/hdfs_design.html
http://hadoop.apache.org/mapreduce/
http://www.cloudera.com/hadoop/
http://www.cloudera.com/
http://www.cloudera.com/hadoop-training/
http://www.cloudera.com/resources/?type=Training
http://blog.adku.com/2011/02/hbase-vs-cassandra.html
http://www.google.co.in/search?q=hbase+tutorial&ie=utf-8&oe=utf-8&aq=t&rls=or