Beruflich Dokumente
Kultur Dokumente
Hadoop API
Dimitrios Michail
2012-2013
Hadoop
Input/Output (input) < k1 , v1 > map < k2 , v2 > combine < k2 , v2 > reduce < k3 , v3 >
(output)
. . . . . .
2/29
Hadoop
Serializable The key and value classes need to be serializable by the framework. they need to implement the Writable interface
3/29
Hadoop
Serializable The key and value classes need to be serializable by the framework. they need to implement the Writable interface
Comparable To facilitate sorting by the framework, the key class also needs to be comparable. needs to also implement the WritableComparable interface.
3/29
Hadoop
Serializable The key and value classes need to be serializable by the framework. they need to implement the Writable interface
Comparable To facilitate sorting by the framework, the key class also needs to be comparable. needs to also implement the WritableComparable interface. There are implementations for all basic types, e.g. DoubleWritable.
. . . . . .
3/29
WordCount Example
p u b l i c c l a s s WordCount { p u b l i c s t a t i c c l a s s Map extends MapReduceBase implements Mapper<LongWritable , Text , Text , I n t W r i t a b l e > { p r i v a t e s t a t i c I n t W r i t a b l e one = new I n t W r i t a b l e ( 1 ) ; p r i v a t e Text word = new Text ( ) ; p u b l i c v o i d map( LongWritable key , Text v a l u e , O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > output , R e p o r t e r r e p o r t e r ) throws IOException { String l i n e = value . toString () ; S t r i n g T o k e n i z e r t o k e n i z e r = new S t r i n g T o k e n i z e r ( l i n e ) ; w h i l e ( t o k e n i z e r . hasMoreTokens ( ) ) { word . s e t ( t o k e n i z e r . nextToken ( ) ) ; output . c o l l e c t ( word , one ) ; } } }
. . . . . .
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
4/29
WordCount Example
18 19
p u b l i c s t a t i c c l a s s Reduce extends MapReduceBase implements Reducer<Text , I n t W r i t a b l e , Text , IntWritable> { p u b l i c v o i d r e d u c e ( Text key , I t e r a t o r <I n t W r i t a b l e > v a l u e s , O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > output , R e p o r t e r r e p o r t e r ) throws IOException { i n t sum = 0 ; w h i l e ( v a l u e s . hasNext ( ) ) { sum += v a l u e s . n e x t ( ) . g e t ( ) ; } output . c o l l e c t ( key , new I n t W r i t a b l e (sum) ) ; } }
20 21
22 23 24 25 26 27 28
5/29
WordCount Example
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { JobConf c o n f = new JobConf ( WordCount . c l a s s ) ; c o n f . setJobName ( wordcount ) ; c o n f . setOutputKeyClass ( Text . c l a s s ) ; conf . setOutputValueClass ( IntWritable . c l a s s ) ; c o n f . s e t M a p pe r Cl a s s (Map . c l a s s ) ; c o n f . s e t C o m b i n e r C l a s s ( Reduce . c l a s s ) ; c o n f . s e t R e d u c e r C l a s s ( Reduce . c l a s s ) ; c o n f . s e t I n p u t F o r m a t ( TextInputFormat . c l a s s ) ; c o n f . setOutputFormat ( TextOutputFormat . c l a s s ) ; F i l e I n p u t F o r m a t . s e t I n p u t P a t h s ( conf , new Path ( a r g s [ 0 ] ) ) ; FileOutputFormat . setOutputPath ( conf , new Path ( a r g s [ 1 ] ) ) ; J o b C l i e n t . runJob ( c o n f ) ; } } // end o f c l a s s WordCount
. . . . . .
6/29
Input
There are three major interfaces for providing input to a map-reduce job InputFormat<K,V> InputSplit RecordReader
7/29
InputFormat<K,V> Interface
InputFormat<K,V> describes the input-specication for a Map-Reduce job. The Map-Reduce framework relies on the InputFormat of the job to:
1 2
Validate the input-specication of the job. Split-up the input le(s) into logical InputSplit(s), each of which is then assigned to an individual Mapper. Provide the RecordReader implementation to be used to produce input records from the logical InputSplit for processing by the Mapper.
8/29
InputSplit Interface
InputSplit represents the data to be processed by an individual Mapper. presents a byte-oriented view on the input, and it is the responsibility of the RecordReader of the job to process this and present a record-oriented view.
9/29
RecordReader Interface
RecordReader reads <key, value> pairs from an InputSplit. converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.
10/29
Default InputFormat
The default InputFormat<K,V> is the TextInputFormat. TextInputFormat Implements TextInputFormat<LongWritable, Text> InputFormat<K,V> for plain text les, les are broken into lines. Either linefeed or carriage-return are used to signal end of line. keys are the position in the le, and values are the line of text.
11/29
Output
The major interfaces for outputing a map-reduce job are OutputCollector<K,V> OutputFormat<K,V> RecordWriter FileSystem
12/29
OutputCollector Interface
Collects the <key, value> pairs output by Mappers and Reducers. OutputCollector<K,V>is the generalization of the facility provided by the Map-Reduce framework to collect data output by either the Mapper or the Reducer i.e. intermediate outputs or the output of the job.
13/29
OutputFormat<K,V> Interface
OutputFormat<K,V> describes the output-specication for a Map-Reduce job. The Map-Reduce framework relies on the OutputFormat<K,V>of the job to:
1
Validate the output-specication of the job. For e.g. check that the output directory doesnt already exist. Provide the RecordWriter implementation to be used to write out the output les of the job. Output les are stored in a FileSystem.
14/29
RecordWriter Interface
RecordWriter writes the output <key, value> pairs to an output le. RecordWriter implementations write the job outputs to the FileSystem.
15/29
FileSystem
An abstract base class for a fairly generic lesystem. Implemented either as: local reects the locally-connected disk Distributed File System the Hadoop DFS which is a multi-machine system that appears as a single disk fault tolerance and potentially very large capacity
16/29
Default OutputFormat
The default OutputFormat<K,V> is the TextOutputFormat. TextOutputFormat OutputFormat<K,V> that writes plain text les,
17/29
JobConf
map/reduce job conguration
JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. JobConf typically species the Mapper, combiner (if any), Partitioner, Reducer, InputFormat<K,V>and OutputFormat<K,V> implementations to be used etc.
. . . . . .
18/29
Mapper Interface
Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into intermediate records. The user implements the following interface:
p u b l i c i n t e r f a c e Mapper<K1 , V1 , K2 , V2> extends JobConfigurable , Closeable { v o i d map(K1 key , V1 v a l u e , O u t p u t C o l l e c t o r <K2 , V2> output , R e p o r t e r r e p o r t e r ) ; }
19/29
Framework spawns one map task for each InputSplit generated by the InputFormat<K,V> for the job. calls map(K1, V1, OutputCollector<K2,V2>, Reporter) for each key/value pair in the InputSplit for that task.
20/29
Grouping All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the nal output. Users can control the grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).
21/29
Partitioning The grouped Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
22/29
Partitioning The grouped Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
22/29
Combiners Users can optionally specify a combiner, via JobConf.setCombinerClass(Class). Combiners perform local aggregation of the intermediate outputs. Helps to cut down the amount of data transferred from the Mapper to the Reducer.
23/29
The intermediate, grouped outputs are stored in SequenceFiles which are at les consisting of binary key/value pairs. There are three SequenceFile Writers:
1 2
Writer : Uncompressed records. RecordCompressWriter : Record-compressed les, only compress values. BlockCompressWriter : Block-compressed les, both keys & values are collected in blocks separately and compressed. The size of the block is congurable.
24/29
The intermediate, grouped outputs are stored in SequenceFiles which are at les consisting of binary key/value pairs. The actual compression algorithm used to compress key and/or values can be specied by using the appropriate CompressionCodec. There are also Readers and Sorters for each SequenceFile type.
24/29
Reducer Interface
Reduce a set of intermediate values which share a key to a smaller set of values. The user implements the following interface:
p u b l i c i n t e r f a c e Reducer<K2 , V2 , K3 , V3> extends JobConfigurable , Closeable { v o i d r e d u c e (K2 key , I t e r a t o r <V2> v a l u e s , O u t p u t C o l l e c t o r < K3 , V3> output , R e p o r t e r r e p o r t e r ) ; }
The number of Reducers for the job is set by the user via JobConf.setNumReduceTasks(int).
. . . . . .
25/29
Phase 1: Shue The framework, for each reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
26/29
Phase 1: Shue The framework, for each reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
Phase 2: Sort The framework groups Reducer inputs by keys (since dierent Mappers may have output the same key). The shue and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
. . . . . .
26/29
Phase 3: Reduce The reduce(K2, Iterator<V2>, OutputCollector<K3,V3>, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(K3, V3).
27/29
Reporter
A facility for Map-Reduce applications to report progress and update counters, status information etc. Is Alive? Mapper and Reducer can use the Reporter provided to report progress or just indicate that they are alive. In scenarios where the application takes an insignicant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task.
28/29
Material
29/29