Sie sind auf Seite 1von 12

Beyond map/reduce functions

partitioner, combiner and parameter configuration

Gang Luo Sept. 9, 2010

Partitioner
Determine which reducer/partition one record should go to Given the key, value and number of partitions, return an integer
Partition: (K2, V2, #Partitions) integer

Partitioner
Interface: public interface Partitioner<K2, V2> extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); } Implementation: public class myPartitioner<K2, V2> implements Partitioner<K2, V2> { int getPartition(K2 key, V2 value, int numPartitions) {

your logic!
} }

Partitioner
Example:
public class myPartitioner<Text, Text> implements Partitioner<Text, Text> { int getPartition(Text key, Text value, int numPartitions) { int hashCode = key.hashCode(); int partitionIndex = hashCode mod numPartitions; return partitionIndex; } }

Combiner
Reduce the amount of intermediate data before sending them to reducers. Pre-aggregation The interface is exactly the same as reducer.

Combiner
Example: public class myCombiner extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> value OutputCollector<Text, Text> output, Reporter reporter) {

your logic
} } Should be the same type as map output key/value Should be the same type as reducer input key/value

Parameter
Cluster-level parameters (e.g. HDFS block size) Job-specific parameters (e.g. number of reducers, map output buffer size)
Configurable. Important for job performance

User-define parameters
Used to pass information from driver to mapper/reducer. Help to make your mapper/reducer more generic

Parameter
JobConf conf = new JobConf(Driver.class); conf.setNumReduceTasks(10); // set the number of reducers by a // build-in function conf.set(io.sort.mb, 200); // set the size of map output buffer by the // name of that parameter conf.setString(deliminator, \t); //set a user-defined parameter. conf.getNumReduceTasks(10); // get the value of a parameter by // build-in function String buffSize = conf.get(io.sort.mb, 200); //get the value of a parameter // by its name String deliminator = conf.getString(deliminator, \t); // get the value of a // user-defined parameter

Parameter
There are some built-in parameters managed by Hadoop. We are not supposed to change them, but can read them
String inputFile = jobConf.get("map.input.file"); Get the path to the current input Used in joining datasets
*for new api, you should use: FileSplit split = (FileSplit)context.getInputSplit();

More about Hadoop


Identity Mapper/Reducer
Output == input. No modification

Why do we need map/reduce function without any logic in them?


Sorting! More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping)

More about Hadoop


How to determine the number of splits?
If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) If a file is non-splitable, only one split. If a file is small (smaller than a block), one split for file, unless...

More about Hadoop


CombineFileInputFormat
Merge multiple small files into one split, which will be processed by one mapper Save mapper slots. Reduce the overhead

Other options to handle small files?


hadoop fs -getmerge src dest