Sie sind auf Seite 1von 13

Hadoop API differences

Disclaimer: This material is protected under copyright act AnalytixLabs , 2011. Unauthorized use and/ or duplication of this material or any part of
this material including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright
will attract legal actions.

Hadoop API Differences

API differences
Hadoop 2 uses the new JAVA API for MapReduce
New API is functionally equivalent to the old API.
Hadoop ships with both the old and new MapReduce
APIs
They can be used independently, but they are not
compatible with each other.
There are many differences between the two APIs.

Packages
The new API is in the org.apache.hadoop.mapreduce
package and subpackages.
The old API is in the org.apache.hadoop.mapred
package

Interface vs abstract classes


The new API favors abstract classes over interfaces
Abstract classes are easier to evolve.
For example you can add a method (with a default
implementation) to an abstract class without breaking
old implementations of the class.
Examples of abstract classes in the new API are Mapper
and Reducer.

Context
The new API makes extensive use of context objects
Contexts allow the user code to communicate with the
MapReduce system.
The new Context unifies the role of
JobConf
OutputCollector
Reporter

from the old API.

Context
Context object is used to write output Key-Values as well
as get
configuration
Counters
cacheFiles

Mapper < LongWritable, Text, Text, IntWritable > takes


only as keys and as values.
Using context you can interact with the class easily

More control
In both APIs, key-value record pairs are pushed to the
mapper and reducer.
The new API also allows both mappers and reducers to
control the execution flow by overriding the run()
method.
Records can be processed in batches
the execution can be terminated before all the records
have been processed.

Job control
Job control is performed through the Job class in the new
API
The old API uses JobClient
JobClient does not exist in new API

Configuration
Configuration has been unified in the new API.
The old API uses JobConf object for job configuration
This is an extension of Hadoops Configuration object.
In the new API, job configuration is done through a
Configuration through some of the helper methods on
Job.

Output file names


Output files in the old API for both map and reduce
outputs are named part-nnnnn
In the new API map outputs are named part-m-nnnnn
and reduce outputs are named part-r-nnnnn
nnnnn is an integer designating the part number
starting from 00000.

Interruptable
User-overridable methods in the new API are declared to
throw java.lang.InterruptedException.
This enables the code to be responsive to interrupts so
that the framework can gracefully cancel long-running
operations if required.

Iterator
In the new API, the reduce() method passes values as a
java.lang.Iterable
In the old API it is java.lang.Iterator.
This makes it easier to iterate over the values using
Javas for-each loop construct:

Das könnte Ihnen auch gefallen