5) Hadoop

Introduction to Hadoop
Evangelos Vazaios
vagvaz at gmail.com
Technical University of Crete
What is Hadoop?
Open Source Implementaion of MapReduce
Distributed processing of large data sets
Simple Programming Model
Flat Scalability
MapReduce
List Processing
MapReduce programs transform lists of input

data elements into lists of output data elements.
A MapReduce program will do this twice, using

two different processing list processing idioms
map and reduce
Names borrowed from functional programming

languanges (Lisp,Scheme,ML)
Hadoop MapReduce
Mappers take as input: Key/Value pairs and output

Key/Value pairs
Reducers take a key,value list and outputs key/value
pairs
Different colors represent different keys.

All values with the same key are sent to a single reduce task.
Example: Find Word with maximum

number of appearances
Problem
The problem is to find for each letter of the alphabet the word that
appears more in a set of files
For example, if we had the files:
foo.txt: Sweet, this is the foo file, the text
bar.txt: This is the bar file
We would expect the output to be:
b bar, f file, i is
s sweet t the
Example 'ed Mapper & Reducer

mapper (filename, filecontents):
for each word in filecontents:
emit (word.charAt(0),word)
reducer (char, values):

dict = new dictionary<string,int>()
for each value in values:
count = dict.get(value)
dict.put(value,count+1)
word_max = findWordWithMaxCount(dict)
emit(char,word_max)
Example: Mapper's Code

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context )throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(new Text(word.charAt(0)),word); } } }
Example: Reducer's Code

public static class MaxWordReducer extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException,InterruptedException {
Map<Text,Integer> dict = new HashMap<Text,Integer>()
for (Text val : values) {
if dict.containsKey(val) dict.put(val,dict.get(val)+1);
else dict.put(val,1);
}
result = iterate over entrySet of dict to find max count value
context.write(key, result); } }
Example: Job's Code

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(MaxWordReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Custom Data Types
Writable Interface for custom data types

public interface Writable {
void readFields(DataInput in);
void write(DataOutput out); }
WritableComparable for custom key types

public interface WritableComparable<T> extends Writable,
Comparable<T> {
** Implement hashCode Default partitioner uses this function
to determine where to send each pair <key,value>
Writable Example
class complexNum implements Writable{
private float real;
private float im;
public void write(DataOutput out) throws IOException {
out.writeFloat(real);
out.writeFloat(im); }
public void readFields(DataInput in) throws IOException {
real = in.readFloat();
im = in.readFloat();}
Advance Features
Combiner: this class runs after the mapper and before the reducer.
Combiner is a minireducer which operates only on data generated by
one machine
Input & Output Format describes the input/output specs of the job
Input Split represents the data to be processed by an individual
Mapper
RecordReader converts byteoriented view to <key,value>
Chaining jobs: Not every problem can be solved with a MapReduce
program. Many problems can be solved by writing several
MapReduce steps which run in sequential
<Stage 1><Stage 2><Stage 3
Map1 > Reduce1 > Map2 > Reduce2 > Map3...
MapReduce DataFlow
Zoom In
HDFS: A distributed Filesystem
Store large amount of data
Reliability
Scalability
Integrate with MapReduce
Streaming Reads
Written Once, Read a lot ( The most annoying)
More on HDFS
Block Size 64MB (trad. Fs us 4-8KB)
Replication (per file)
Files not part of OS'sfs
Metadata handled
by NameNode
Useful HDFS Commands
bin/hadoop dfs -mkdir /foodir

Create a directory named /foodir
bin/hadoop dfs -ls /foodir

View the conents of a folder named /foodir
bin/hadoop dfs -rmr /foodir

Remove a directory named /foodir
bin/hadoop dfs -cat /foodir/myfile.txt

View the contents of a file named /foodir/myfile.txt
Useful Hadoop options
io.sort.mb
memory size for sorting map outputs
io.sort.factor
how many merge streams for sorting
(each thread has io.sort.mb/ io.sort.factor memory)
fs.inmemorysize.mb
size of reduce-side buffer for sorting & merging multi-map output
before spilling to disk
Useful Tools
Hbase: A NoSQL column store on top of HDFS
Distributed
Integrates with Hadoop/HDFS
Use when you have hundreds of millions of rows
Implements Google's BigTable model (row,column

families, columns)
Basic API:
Get: get a certain row

Put: either add new rows or update existing rows
Scan: Iterator on various rows
Delete: delete a row
Useful Tools cont.
GraphBuilder: A library published by Intel in order

to ease the building of graphs from raw data
Write application specific parsers & extraction

routines
Call library-supplied APIS to generate an edge list,

the library supplies a set of built-in functions
(add,mul,div,term frequency) to tabulate vertex
values and edge weights
Graph Transformation and checking
Graph partioning and optional dictionary

compression
Useful Tools cont.
Giraph: open-source implemetantion of Google's

Pregel
Hive: query and manage large datasets.
Use of a SQL-like language called HiveQL.

Plug in custom mappers and reducers when it is
inconvenient or inefficient to express this logic in
HiveQL.
How to Hadoop...
Try to find how your algorithm could be split
Independent computations
Independent operations on input splits
Try to use Mappers in order to filter/reduce input

size
Try to minimize intermediate data exchange
Think possible ways to exploit that keys are
presented sorted to reducers
process data sequentially
How to Hadoop cont.
Create a HadoopJobWrapper class that is

responsible for configuration and submission
Use java's Properties class to pass arguments to
HadoopJobWrapper class
If possible create your own Mapper and Reducer
base classes and add must have functionalities
(periodically call setStatus, or use of in-map
combiners)
Use Counters to validate and monitor your job
Implementing Graph Algorithms
Partition you graph (simple vertex.key %

#partitions)
Map: Read partitions and calculate the vertex update
function, emit only messages
Reducers: Read partitions and use the idea of
parallel merge join to update vertices with incoming
messages
And a glimpse of the Future
Split resource management and job scheduling into separate

daemons
Have a global RM and per-application Master, Apps are
either one job or a DAG
RM and The Node Manager from the data-computation
framework
And a glimpse of the Future cont.
Application Manager is acceptso jobs, allocates the first

container and possibly restat App Master on failure
NM is a per-machine framework agent responsible for
containers, monitoring and reporting to RM/scheduler
App Master negotiates resource containers,track & monitor
their status

5) Hadoop

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

5) Hadoop

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Hadoop

Open Source Implementaion of MapReduce

Distributed processing of large data sets

Simple Programming Model

MapReduce programs transform lists of input

A MapReduce program will do this twice, using

Names borrowed from functional programming

Mappers take as input: Key/Value pairs and output

Different colors represent different keys.

Example: Find Word with maximum

Example 'ed Mapper & Reducer

reducer (char, values):

Example: Mapper's Code

Example: Reducer's Code

Example: Job's Code

Custom Data Types

Writable Interface for custom data types

WritableComparable for custom key types

Map1 > Reduce1 > Map2 > Reduce2 > Map3...

HDFS: A distributed Filesystem

Store large amount of data

Integrate with MapReduce

Written Once, Read a lot ( The most annoying)

Block Size 64MB (trad. Fs us 4-8KB)

Replication (per file)

Files not part of OS'sfs

Useful HDFS Commands

bin/hadoop dfs -mkdir /foodir

bin/hadoop dfs -ls /foodir

bin/hadoop dfs -rmr /foodir

bin/hadoop dfs -cat /foodir/myfile.txt

Useful Hadoop options

Hbase: A NoSQL column store on top of HDFS

Integrates with Hadoop/HDFS

Use when you have hundreds of millions of rows

Implements Google's BigTable model (row,column

Get: get a certain row

Useful Tools cont.

GraphBuilder: A library published by Intel in order

Write application specific parsers & extraction

Call library-supplied APIS to generate an edge list,

Graph Transformation and checking

Graph partioning and optional dictionary

Useful Tools cont.

Giraph: open-source implemetantion of Google's

Use of a SQL-like language called HiveQL.

Try to find how your algorithm could be split

Independent operations on input splits

Try to use Mappers in order to filter/reduce input

How to Hadoop cont.

Create a HadoopJobWrapper class that is

Implementing Graph Algorithms

Partition you graph (simple vertex.key %

And a glimpse of the Future

Split resource management and job scheduling into separate

And a glimpse of the Future cont.

Application Manager is acceptso jobs, allocates the first

Das könnte Ihnen auch gefallen