Sie sind auf Seite 1von 26

Introduction to Hadoop

Evangelos Vazaios
vagvaz at gmail.com
Technical University of Crete

What is Hadoop?

Open Source Implementaion of MapReduce

Distributed processing of large data sets

Simple Programming Model

Flat Scalability

MapReduce
List Processing

MapReduce programs transform lists of input


data elements into lists of output data elements.

A MapReduce program will do this twice, using


two different processing list processing idioms
map and reduce

Names borrowed from functional programming


languanges (Lisp,Scheme,ML)

Hadoop MapReduce

Mappers take as input: Key/Value pairs and output


Key/Value pairs
Reducers take a key,value list and outputs key/value
pairs

Different colors represent different keys.


All values with the same key are sent to a single reduce task.

Example: Find Word with maximum


number of appearances
Problem
The problem is to find for each letter of the alphabet the word that
appears more in a set of files
For example, if we had the files:
foo.txt: Sweet, this is the foo file, the text
bar.txt: This is the bar file
We would expect the output to be:
b bar, f file, i is
s sweet t the

Example 'ed Mapper & Reducer


mapper (filename, filecontents):
for each word in filecontents:
emit (word.charAt(0),word)

reducer (char, values):


dict = new dictionary<string,int>()
for each value in values:
count = dict.get(value)
dict.put(value,count+1)

word_max = findWordWithMaxCount(dict)
emit(char,word_max)

Example: Mapper's Code


public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context )throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(new Text(word.charAt(0)),word); } } }

Example: Reducer's Code


public static class MaxWordReducer extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException,InterruptedException {
Map<Text,Integer> dict = new HashMap<Text,Integer>()
for (Text val : values) {
if dict.containsKey(val) dict.put(val,dict.get(val)+1);
else dict.put(val,1);
}
result = iterate over entrySet of dict to find max count value
context.write(key, result); } }

Example: Job's Code


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(MaxWordReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1); } }

Custom Data Types

Writable Interface for custom data types


public interface Writable {
void readFields(DataInput in);
void write(DataOutput out); }

WritableComparable for custom key types


public interface WritableComparable<T> extends Writable,
Comparable<T> {
** Implement hashCode Default partitioner uses this function
to determine where to send each pair <key,value>

Writable Example
class complexNum implements Writable{
private float real;
private float im;
public void write(DataOutput out) throws IOException {
out.writeFloat(real);
out.writeFloat(im); }
public void readFields(DataInput in) throws IOException {
real = in.readFloat();
im = in.readFloat();}

Advance Features

Combiner: this class runs after the mapper and before the reducer.
Combiner is a minireducer which operates only on data generated by
one machine
Input & Output Format describes the input/output specs of the job
Input Split represents the data to be processed by an individual
Mapper
RecordReader converts byteoriented view to <key,value>
Chaining jobs: Not every problem can be solved with a MapReduce
program. Many problems can be solved by writing several
MapReduce steps which run in sequential
<Stage 1><Stage 2><Stage 3

Map1 > Reduce1 > Map2 > Reduce2 > Map3...

MapReduce DataFlow

Zoom In

HDFS: A distributed Filesystem

Store large amount of data

Reliability

Scalability

Integrate with MapReduce

Streaming Reads

Written Once, Read a lot ( The most annoying)

More on HDFS

Block Size 64MB (trad. Fs us 4-8KB)

Replication (per file)

Files not part of OS'sfs

Metadata handled
by NameNode

Useful HDFS Commands

bin/hadoop dfs -mkdir /foodir


Create a directory named /foodir

bin/hadoop dfs -ls /foodir


View the conents of a folder named /foodir

bin/hadoop dfs -rmr /foodir


Remove a directory named /foodir

bin/hadoop dfs -cat /foodir/myfile.txt


View the contents of a file named /foodir/myfile.txt

Useful Hadoop options

io.sort.mb
memory size for sorting map outputs

io.sort.factor
how many merge streams for sorting
(each thread has io.sort.mb/ io.sort.factor memory)

fs.inmemorysize.mb
size of reduce-side buffer for sorting & merging multi-map output
before spilling to disk

Useful Tools

Hbase: A NoSQL column store on top of HDFS

Distributed

Integrates with Hadoop/HDFS

Use when you have hundreds of millions of rows

Implements Google's BigTable model (row,column


families, columns)

Basic API:

Get: get a certain row


Put: either add new rows or update existing rows
Scan: Iterator on various rows
Delete: delete a row

Useful Tools cont.

GraphBuilder: A library published by Intel in order


to ease the building of graphs from raw data

Write application specific parsers & extraction


routines

Call library-supplied APIS to generate an edge list,


the library supplies a set of built-in functions
(add,mul,div,term frequency) to tabulate vertex
values and edge weights

Graph Transformation and checking

Graph partioning and optional dictionary


compression

Useful Tools cont.

Giraph: open-source implemetantion of Google's


Pregel
Hive: query and manage large datasets.

Use of a SQL-like language called HiveQL.


Plug in custom mappers and reducers when it is
inconvenient or inefficient to express this logic in
HiveQL.

How to Hadoop...

Try to find how your algorithm could be split

Independent computations

Independent operations on input splits

Try to use Mappers in order to filter/reduce input


size
Try to minimize intermediate data exchange
Think possible ways to exploit that keys are
presented sorted to reducers
process data sequentially

How to Hadoop cont.

Create a HadoopJobWrapper class that is


responsible for configuration and submission
Use java's Properties class to pass arguments to
HadoopJobWrapper class
If possible create your own Mapper and Reducer
base classes and add must have functionalities
(periodically call setStatus, or use of in-map
combiners)
Use Counters to validate and monitor your job

Implementing Graph Algorithms

Partition you graph (simple vertex.key %


#partitions)
Map: Read partitions and calculate the vertex update
function, emit only messages
Reducers: Read partitions and use the idea of
parallel merge join to update vertices with incoming
messages

And a glimpse of the Future

Split resource management and job scheduling into separate


daemons
Have a global RM and per-application Master, Apps are
either one job or a DAG
RM and The Node Manager from the data-computation
framework

And a glimpse of the Future cont.

Application Manager is acceptso jobs, allocates the first


container and possibly restat App Master on failure
NM is a per-machine framework agent responsible for
containers, monitoring and reporting to RM/scheduler
App Master negotiates resource containers,track & monitor
their status

Das könnte Ihnen auch gefallen