Lecture 4 PDF

Apache
Hadoop A course for undergraduates

Lecture 4
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-1
Delving Deeper into the Hadoop API

Chapter 4.1
4-2

Using the ToolRunner class
Decreasing the amount of intermediate data with Combiners
Se?ng up and tearing down Mappers and Reducers using the setup and
cleanup methods
How to access HDFS programmaDcally
How to use the distributed cache
How to use the Hadoop APIs library of Mappers, Reducers, and
ParDDoners
4-3
Chapter Topics
Using the ToolRunner Class
SeJng Up and Tearing Down Mappers and Reducers
Decreasing the Amount of Intermediate Data with Combiners
Accessing HDFS programmaPcally
Using the Distributed Cache
Using the Hadoop APIs Library of Mappers, Reducers and ParPPoners
4-4
Why Use ToolRunner?

You can use ToolRunner in MapReduce driver classes
This is not required, but is a best pracPce
ToolRunner uses the GenericOptionsParser class internally
Allows you to specify conguraPon opPons on the command line
Also allows you to specify items for the Distributed Cache on the
command line (see later)
4-5
How to Implement ToolRunner: Complete Driver

import
import
import
import
import
import
import
import
import
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.conf.Configured;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
public class WordCount extends Configured implements Tool {

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;
}
Job job = new Job(getConf());
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);

return success ? 0 : 1;
4-6
How to Implement ToolRunner: Imports

import
import
import
import
import
import
import
import
import
import
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.conf.Configured;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
Import the relevant classes. We omit the import

statements in future slides for brevity.

}
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n",
getClass().getSimpleName());
return -1;
}
4-7
How to Implement ToolRunner: Driver Class DeniPon

}
public int
The driver class implements the Tool interface and

run(String[] extends
args) throws
Exception
{
the
Configured
class.
System.out.printf(
return -1;
}
job.setJarByClass(WordCount.class); job.setJobName("Word Count");

4-8
How to Implement ToolRunner: Main Method


int exitCode = ToolRunner.run(new Configuration(),
new WordCount(), args);
}
System.out.printf(
return -1;
}
The driver main method calls ToolRunner.run.

...
4-9
How to Implement ToolRunner: Run Method

The driver run method creates, congures, and submits

the job.

int exitCode = ToolRunner.run(new Configuration(),
new WordCount(), args);
}
System.out.printf(
return -1;
}


...
4-10
ToolRunner Command Line OpPons

ToolRunner allows the user to specify conguraDon opDons on the
command line
Commonly used to specify Hadoop properDes using the -D ag
Will override any default or site properPes in the conguraPon
But will not override those set in the driver code
$ hadoop jar myjar.jar MyDriver \
-D mapred.reduce.tasks=10 myinputdir myoutputdir
Note that -D opDons must appear before any addiDonal program

arguments
Can specify an XML conguraDon le with -conf
Can specify the default lesystem with -fs uri
Shortcut for D fs.default.name=uri
4-11
Chapter Topics
Se?ng Up and Tearing Down Mappers and Reducers
Decreasing the amount of intermediate data with Combiners
4-12
The setup Method

It is common to want your Mapper or Reducer to execute some code
before the map or reduce method is called for the rst Dme
IniPalize data structures
Read data from an external le
Set parameters
The setup method is run before the map or reduce method is called for
the rst Dme
public void setup(Context context)
4-13
The cleanup Method

Similarly, you may wish to perform some acDon(s) a[er all the records
have been processed by your Mapper or Reducer
The cleanup method is called before the Mapper or Reducer terminates
public void cleanup(Context context) throws
IOException, InterruptedException
4-14
Passing Parameters
public class MyDriverClass {
public int main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setInt ("paramname",value);
Job job = new Job(conf);
...
}
}
public class MyMapper extends Mapper {

public void setup(Context context) {
Configuration conf = context.getConfiguration();
int myParam = conf.getInt("paramname", 0);
...
}
public void map...
}
4-15
Chapter Topics
4-16
The Combiner
O[en, Mappers produce large amounts of intermediate data
That data must be passed to the Reducers
This can result in a lot of network trac
It is o[en possible to specify a Combiner
Like a mini-Reducer
Runs locally on a single Mappers output
Output from the Combiner is sent to the Reducers
Combiner and Reducer code are o[en idenDcal
Technically, this is possible if the operaPon performed is commutaPve
and associaPve
Input and output data types for the Combiner/Reducer must be
idenPcal
4-17
The Combiner
Combiners run as part of the
Map phase
Output from the Combiners is
passed to the Reducers
Block 1
Block 2
Block 3
Input Format
Input Format
Input Format
Mapper
Mapper
Mapper
Combiner
Combiner
Combiner
ParPPoner
ParPPoner
ParPPoner
Shue and Sort

Reducer
Reducer
4-18
WordCount Revisited
Node 1
the cat
sat on
the mat
Mapper
Node 2
the
aardvark
sat on
the sofa
Mapper
the
cat
sat
aardvark
on
cat
the
mat
mat
on
1
1
sat
1
1
the
aardvark
sat
on
the
sofa
sofa
the
1
1
1
1
Reducer
Reducer
Reducer
aardvark
cat
mat
on
sat
sofa
the
4-19
WordCount With Combiner

Node 1
the cat
sat on
the mat
Mapper
Node 2
the
aardvark
sat on
the sofa
Mapper
the
cat
sat
on
the
mat
the
aardvark
sat
on
the
sofa
Combiner
Combiner
cat
mat
on
sat
the
aardvark
on
sat
sofa
the
aardvark
cat
aardvark
mat
cat
on
1
1
mat
on
sat
1
1
sat
sofa
sofa
the
2
2
the
4-20
WriPng a Combiner
The Combiner uses the same signature as the Reducer
Takes in a key and a list of values
Outputs zero or more (key, value) pairs
The actual method called is the reduce method in the class
reduce(inter_key, [v1, v2, ])
(result_key, result_value)
4-21
Combiners and Reducers

Some Reducers may be used as Combiners
If operaPon is associaPve and commutaPve, e.g., SumReducer
foo
2
3
5
6
Sum(2,3)
Sum(5,6)
foo
5
11
Sum(5,11)
foo
16
=
foo
Sum(2,3,5,6)
16
Some Reducers cannot be used as a Combiner, e.g., AverageReducer
bar
5
5
5
8
10
Avg(5,5,5)
Avg(8,10)
bar
5
9
Avg(5,5,5,8,10)
Avg(5,9)
bar

bar
6.3
4-22
Specifying a Combiner
Specify the Combiner class to be used in your MapReduce code in the
driver
Use the setCombinerClass method, e.g.:
job.setCombinerClass(SumReducer.class);
Input and output data types for the Combiner and the Reducer for a job
must be idenDcal
VERY IMPORTANT: The Combiner may run once, or more than once, on
the output from any given Mapper
Do not put code in the Combiner which could inuence your results if it
runs more than once
4-23
Chapter Topics
Accessing HDFS ProgrammaDcally
4-24
Accessing HDFS ProgrammaPcally

In addiDon to using the command-line shell, you can access HDFS
programmaDcally
Useful if your code needs to read or write side data in addiPon to the
standard MapReduce inputs and outputs
Or for programs outside of Hadoop which need to read the results of
MapReduce jobs
Beware: HDFS is not a general-purpose lesystem!
Files cannot be modied once they have been wri>en, for example
Hadoop provides the FileSystem abstract base class
Provides an API to generic le systems
Could be HDFS
Could be your local le system
Could even be, for example, Amazon S3
4-25
The FileSystem API (1)

In order to use the FileSystem API, retrieve an instance of it
FileSystem fs = FileSystem.get(conf);
The conf object has read in the Hadoop conguraDon les, and therefore
knows the address of the NameNode
A le in HDFS is represented by a Path object
Path p = new Path("/path/to/my/file");
4-26
The FileSystem API (2)

Some useful API methods:
FSDataOutputStream create(...)
Extends java.io.DataOutputStream
Provides methods for wriPng primiPves, raw bytes etc
FSDataInputStream open(...)
Extends java.io.DataInputStream
Provides methods for reading primiPves, raw bytes etc
boolean delete(...)
boolean mkdirs(...)
void copyFromLocalFile(...)
void copyToLocalFile(...)
FileStatus[] listStatus(...)
4-27
The FileSystem API: Directory LisPng

Get a directory lisDng:
Path p = new Path("/my/path");
FileStatus[] fileStats = fs.listStatus(p);
for (int i = 0; i < fileStats.length; i++) {
Path f = fileStats[i].getPath();
// do something interesting
}
4-28
The FileSystem API: WriPng Data

Write data to a le
Path p = new Path("/my/path/foo");
FSDataOutputStream out = fs.create(p, false);
// write some raw bytes
out.write(getBytes());
// write an int
out.writeInt(getInt());
...
out.close();
4-29
Chapter Topics
4-30
The Distributed Cache: MoPvaPon

A common requirement is for a Mapper or Reducer to need access to
some side data
Lookup tables
DicPonaries
Standard conguraPon values
One opDon: read directly from HDFS in the setup method
Using the API seen in the previous secPon
Works, but is not scalable
The Distributed Cache provides an API to push data to all slave nodes
Transfer happens behind the scenes before any task is executed
Data is only transferred once to each node, rather
Note: Distributed Cache is read-only
Files in the Distributed Cache are automaPcally deleted from slave
nodes when the job nishes
4-31
Using the Distributed Cache: The Dicult Way

Place the les into HDFS
Congure the Distributed Cache in your driver code
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat"),conf);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"),conf);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz",conf));
.jar les added with addFileToClassPath will be added to your

Mapper or Reducers classpath
Files added with addCacheArchive will automaPcally be
dearchived/decompressed
4-32
Using the DistributedCache: The Easy Way

If you are using ToolRunner, you can add les to the Distributed Cache
directly from the command line when you run the job
No need to copy the les to HDFS rst
Use the -files opDon to add les
hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...
The -archives ag adds archived les, and automaDcally unarchives

them on the desDnaDon machines
The -libjars ag adds jar les to the classpath
4-33
Accessing Files in the Distributed Cache

Files added to the Distributed Cache are made available in your tasks
local working directory
Access them from your Mapper or Reducer the way you would read any
ordinary local le
File f = new File("file_name_here");
4-34
Chapter Topics
Using the Hadoop APIs Library of Mappers, Reducers and ParDDoners
4-35
Reusable Classes for the New API

The org.apache.hadoop.mapreduce.lib.*/* packages contain a
library of Mappers, Reducers, and ParDDoners supporDng the new API
Example classes:
InverseMapper Swaps keys and values
RegexMapper Extracts text based on a regular expression
IntSumReducer, LongSumReducer Add up all values for a key
TotalOrderPartitioner Reads a previously-created parPPon
le and parPPons based on the data from that le
Sample the data rst to create the parPPon le
Allows you to parPPon your data into n parPPons without hard-
coding the parPPoning informaPon
Refer to the Javadoc for classes available in your version of CDH
Available classes vary greatly from version to version
4-36
Key Points
Use the ToolRunner class to build drivers
Parses job opPons and conguraPon variables automaPcally
Override Mapper and Reducer setup and cleanup methods
Set up and tear down, e.g. reading conguraPon parameters
Combiners are mini-reducers
Run locally on Mapper output to reduce data sent to Reducers
The FileSystem API lets you read and write HDFS les programmaDcally
The Distributed Cache lets you copy local les to worker nodes
Mappers and Reducers can access directly as regular les
Hadoop includes a library of predened Mappers, Reducers, and
ParDDoners
4-37
Bibliography
The following oer more informaDon on topics discussed in this chapter
Combiners are discussed in TDG 3e on pages 33-36.
A table describing available lesystems in Hadoop is on pages 52-53 of
TDG 3e.
The HDFS API is described in TDG 3e on pages 55-67.
Distributed cache: See pages 289-295 of TDG 3e for more details.
4-38

Lecture 4 PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 4 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Apache

Hadoop A course for undergraduates

Delving Deeper into the Hadoop API

Delving Deeper into the Hadoop API

Why Use ToolRunner?

How to Implement ToolRunner: Complete Driver

public class WordCount extends Configured implements Tool {

boolean success = job.waitForCompletion(true);

How to Implement ToolRunner: Imports

public class WordCount extends Configured implements Tool {

Import the relevant classes. We omit the import

public static void main(String[] args) throws Exception {

How to Implement ToolRunner: Driver Class DeniPon

The driver class implements the Tool interface and

boolean success = job.waitForCompletion(true);

How to Implement ToolRunner: Main Method

public static void main(String[] args) throws Exception {

The driver main method calls ToolRunner.run.

FileInputFormat.setInputPaths(job, new Path(args[0]));

How to Implement ToolRunner: Run Method

The driver run method creates, congures, and submits

public static void main(String[] args) throws Exception {

public int run(String[] args) throws Exception {

Job job = new Job(getConf());

FileInputFormat.setInputPaths(job, new Path(args[0]));

ToolRunner Command Line OpPons

Note that -D opDons must appear before any addiDonal program

The setup Method

The cleanup Method

public class MyMapper extends Mapper {

Shue and Sort

WordCount With Combiner

Combiners and Reducers

Some Reducers cannot be used as a Combiner, e.g., AverageReducer

Accessing HDFS ProgrammaPcally

The FileSystem API (1)

The FileSystem API (2)

The FileSystem API: Directory LisPng

The FileSystem API: WriPng Data

The Distributed Cache: MoPvaPon

Using the Distributed Cache: The Dicult Way

.jar les added with addFileToClassPath will be added to your

Using the DistributedCache: The Easy Way

The -archives ag adds archived les, and automaDcally unarchives

Accessing Files in the Distributed Cache

Reusable Classes for the New API

Das könnte Ihnen auch gefallen