Sie sind auf Seite 1von 12

Beyond map/reduce functions

partitioner, combiner and parameter configuration

Gang Luo Sept. 9, 2010

Determine which reducer/partition one record should go to Given the key, value and number of partitions, return an integer
Partition: (K2, V2, #Partitions) integer

Interface: public interface Partitioner<K2, V2> extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); } Implementation: public class myPartitioner<K2, V2> implements Partitioner<K2, V2> { int getPartition(K2 key, V2 value, int numPartitions) {

your logic!
} }

public class myPartitioner<Text, Text> implements Partitioner<Text, Text> { int getPartition(Text key, Text value, int numPartitions) { int hashCode = key.hashCode(); int partitionIndex = hashCode mod numPartitions; return partitionIndex; } }

Reduce the amount of intermediate data before sending them to reducers. Pre-aggregation The interface is exactly the same as reducer.

Example: public class myCombiner extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> value OutputCollector<Text, Text> output, Reporter reporter) {

your logic
} } Should be the same type as map output key/value Should be the same type as reducer input key/value

Cluster-level parameters (e.g. HDFS block size) Job-specific parameters (e.g. number of reducers, map output buffer size)
Configurable. Important for job performance

User-define parameters
Used to pass information from driver to mapper/reducer. Help to make your mapper/reducer more generic

JobConf conf = new JobConf(Driver.class); conf.setNumReduceTasks(10); // set the number of reducers by a // build-in function conf.set(io.sort.mb, 200); // set the size of map output buffer by the // name of that parameter conf.setString(deliminator, \t); //set a user-defined parameter. conf.getNumReduceTasks(10); // get the value of a parameter by // build-in function String buffSize = conf.get(io.sort.mb, 200); //get the value of a parameter // by its name String deliminator = conf.getString(deliminator, \t); // get the value of a // user-defined parameter

There are some built-in parameters managed by Hadoop. We are not supposed to change them, but can read them
String inputFile = jobConf.get("map.input.file"); Get the path to the current input Used in joining datasets
*for new api, you should use: FileSplit split = (FileSplit)context.getInputSplit();

More about Hadoop

Identity Mapper/Reducer
Output == input. No modification

Why do we need map/reduce function without any logic in them?

Sorting! More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping)

More about Hadoop

How to determine the number of splits?
If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) If a file is non-splitable, only one split. If a file is small (smaller than a block), one split for file, unless...

More about Hadoop

Merge multiple small files into one split, which will be processed by one mapper Save mapper slots. Reduce the overhead

Other options to handle small files?

hadoop fs -getmerge src dest