Sie sind auf Seite 1von 9

10/17/12

Hadoop in Practice | Javalobby

LOG IN OR JOIN

HOM E

REF CARDZ

M ICROZ ONES

Z ONES

L IBRARY

SNIP P ET S

Search

T UT ORIAL S

CLOUD ZONE
CONNECT WITH DZONE

Chris Smith
Bio

Website

@DZone

PublishanArticle

ShareaTip

Hadoop in Practice
02.28.2012

12059 views

Like

Tweet

18

DZone,Inc.on
Share

Follow

The Cloud Zone is brought to you in partnership with DZone and Microsoft. Try their Windows Azure
cloud platform free for 90 days with no strings attached. Learn more.

Like

4.9k

Follow

17.3K followers

RELATED MICROZONE RESOURCES

PHP Web Site with Windows Azure Table


Storage Using Git
PHP Web Site with MySQL Using Git on
Azure
FREE 3 Month Azure Trial
Hello World Web App Using Eclipse on
Azure
On-Premises Application with Azure Blob
Storage

Spotlight Features
Your Software Flow is More
Like a Whirlpool than a
Pipeline
Hadoop in Practice
By Alex Holmes

Working with simple data formats such as log les is straightforward and supported in
MapReduce. In this article based on Chapter 3 of Hadoop in Practice, author Alex Holmes

Commercial and Open Source


Big Data Platforms
Comparison
Continuous Delivery and
Apple

shows you how to work with ubiquitous data serialization formats such as XML and JSON.
Quality + Simplicity - the
Sweet Spot

Processing Common Serialization Formats


XML and JSON are industry-standard data interchange formats. Their ubiquity in our industry is
evidenced in their heavy adoption in data storage and exchange. XML has existed since 1998 as a
mechanism to represent data that is readable by machines and humans alike. It became a universal
language to data exchange between systems and is employed by many standards today such as
SOAP and RSS and used as an open data format for products such as Microsoft Ofce.

Technique 1: MapReduce and XML


Our goal is to be able to use XML as a data source for a MapReduce job. Were going to assume that
the XML documents that need to be processed are large and, as a result, we want to be able to
process them in parallel with multiple mappers working on the same input le.

java.dzone.com/articles/hadooppractice

POPULAR AT DZONE

Simple Algo to compute the Square


Root of a Number
Debugging Hibernate Envers Historical Data
Gradle, Vaadin 7 and Multi-Module
projects
Rate Limiting With Repose, The
RESTFul Proxy Servie Engine
Terracotta and Tomcat Clustering

1/9

10/17/12

Hadoop in Practice | Javalobby

Problem

Understanding Logging Frameworks In


Java

Working on a single XML le in parallel in MapReduce is tricky because XML does not contain a
synchronization marker in its data format. Therefore, how do we work with a le format thats not
inherently splittable like XML?

Case Study: Factory Design Pattern


SeemorepopularatDZone
SubscribetotheRSSfeed

Solution
MapReduce doesnt contain built-in support for XML, so we have to turn to another Apache project,
Mahout, a machine learning system, which provides an XML InputFormat. To showcase the XML
InputFormat, lets write a MapReduce job that uses the Mahouts XML Input Format to read property
names and values from Hadoops
conguration les. Our rst step is to set up our job conguration.
1. conf.set("xmlinput.start", "<property>");
2. conf.set("xmlinput.end", "</property>");
3. job.setInputFormatClass(XmlInputFormat.class);

#1
#2
#3

#1 Denes the string form of the XML start tag. Our job is to take Hadoop cong les as input,
where each conguration entry uses the "property" tag.
#2 Denes the string form of the XML end tag.
#3 Sets the Mahout XML input format class.

It quickly becomes apparent by looking at the code that Mahouts XML InputFormat is rudimentary;
you need to tell
it an exact sequence of start and end XML tags that will be searched in the le. Looking at the source
of the
InputFormat conrms this:
01. private boolean next(LongWritable key, Text value)
02.
throws IOException {
03.
if (fsin.getPos() < end && readUntilMatch(startTag, false)) {
04.
try {
05.
buffer.write(startTag);
06.
if (readUntilMatch(endTag, true)) {
07.
key.set(fsin.getPos());
08.
value.set(buffer.getData(), 0, buffer.getLength());
09.
return true;
10.
}
11.
} finally {
12.
buffer.reset();
13.
}
14.
}
15.
return false;
16. }

Next, we need to write a Mapper to consume Mahouts XML input format. Were being supplied the
XML element in
Text form, so well need to use an XML parser to extract content from the XML.

java.dzone.com/articles/hadooppractice

2/9

10/17/12

Hadoop in Practice | Javalobby

01. public static class Map extends Mapper<LongWritable, Text,


02.
Text, Text> {
03.
@Override
04.
protected void map(LongWritable key, Text value,
05.
Mapper.Context context)
06.
throws
07.
IOException, InterruptedException {
08.
String document = value.toString();
09.
System.out.println("" + document + "");
10.
try {
11.
XMLStreamReader reader =
12.
XMLInputFactory.newInstance().createXMLStreamReader(new
13.
ByteArrayInputStream(document.getBytes()));
14.
String propertyName = "";
15.
String propertyValue = "";
16.
String currentElement = "";
17.
while (reader.hasNext()) {
18.
int code = reader.next();
19.
switch (code) {
20.
case START_ELEMENT:
21.
currentElement = reader.getLocalName();
22.
break;
23.
case CHARACTERS:
24.
if (currentElement.equalsIgnoreCase("name")) {
25.
propertyName += reader.getText();
26.
} else if (currentElement.equalsIgnoreCase("value")) {
27.
propertyValue += reader.getText();
28.
}
29.
break;
30.
}
31.
}
32.
reader.close();
33.
context.write(propertyName.trim(), propertyValue.trim());
34.
} catch (Exception e) {
35.
log.error("Error processing " + document + "", e);
36.
}
37.
}
38. }

Our Map is given a Text instance, which contains a String representation of the data between the start
and end tags. In our code were using Javas built-in Streaming API for XML (StAX) parser to extract
the key and value for each property and output them. If we run our MapReduce job against Clouderas
core-site.xml and cat the output, well see the output that you see below.
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.

$ hadoop fs -put $HADOOP_HOME/conf/core-site.xml core-site.xml


$ bin/run.sh com.manning.hip.ch3.xml.HadoopPropertyXMLMapReduce \
core-site.xml output
$ hadoop fs -cat output/part*
fs.default.name hdfs://localhost:8020
hadoop.tmp.dir /var/lib/hadoop-0.20/cache/${user.name}
hadoop.proxyuser.oozie.hosts *
hadoop.proxyuser.oozie.groups *

This output shows that we have successfully worked with XML as an input serialization format with
MapReduce! Not only that, we can support huge XML les since the InputFormat supports splitting
XML.

WRITING XML

java.dzone.com/articles/hadooppractice

3/9

10/17/12

Hadoop in Practice | Javalobby

Having successfully read XML, the next question would be how do we write XML? In our Reducer, we
have callbacks
that occur before and after our main reduce method is called, which we can use to emit a start and end
tag.
01. public static class Reduce
02.
extends Reducer<Text, Text, Text, Text> {
03.
04.
@Override
05.
protected void setup(
06.
Context context)
07.
throws IOException, InterruptedException {
08.
context.write(new Text("<configuration>"), null);
#1
09.
}
10.
11.
@Override
12.
protected void cleanup(
13.
Context context)
14.
throws IOException, InterruptedException {
15.
context.write(new Text("</configuration>"), null);
#2
16.
}
17.
18.
private Text outputKey = new Text();
19.
public void reduce(Text key, Iterable<Text> values,
20.
Context context)
21.
throws IOException, InterruptedException {
22.
for (Text value : values) {
23.
outputKey.set(constructPropertyXml(key, value));
#3
24.
context.write(outputKey, null);
#4
25.
}
26.
}
27.
28.
public static String constructPropertyXml(Text name, Text value)
{
29.
StringBuilder sb = new StringBuilder();
30.
sb.append("<property><name>").append(name)
31.
.append("</name><value>").append(value)
32.
.append("</value></property>");
33.
return sb.toString();
34.
}
35. }
#1 Uses the setup method to write the root element start tag.
#2 Uses the cleanup method to write the root element end tag.
#3 Constructs a child XML element for each key/value combination we get in the Reducer. #4
Emits the XML element.
This could also be embedded in an OutputFormat.

PIG
If you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pig
code) contains an XMLLoader. It works in a similar way to our technique and captures all of the
content between a start and end tag and supplies it as a single bytearray eld in a Pig tuple.

HIVE
Currently, there doesnt seem to be a way to work with XML in Hive. You would have to write a custom
SerDe[1].

Discussion
Mahouts XML InputFormat certainly helps you work with XML. However, its very sensitive to an

java.dzone.com/articles/hadooppractice

4/9

10/17/12

Hadoop in Practice | Javalobby

exact string match of both the start and end element names. If the element tag can contain attributes
with variable values, or the generation of the element cant be controlled and could result in XML
namespace qualiers being used, then this approach may not work for you. Also problematic will be
situations where the element name you specify is used as a descendant child element.
If you have control over how the XML is laid out in the input, this exercise can be simplied by having
a single XML element per line. This will let you use the built-in MapReduce text-based InputFormats
(such as TextInputFormat), which treat each line as a record and split accordingly to preserve that
demarcation.
Another option worth considering is that of a preprocessing step, where you could convert the original
XML into a separate line per XML elemen, or convert it into an altogether different data format such as
a SequenceFile or Avro, both of which solve the splitting problem for you.
Theres a streaming class StreamXmlRecordReader to allow you to work with XML in your streaming
code.
We have a handle on how to work with XML, so lets move on to tackle another popular serialization
format, JSON. JSON shares the machine and human-readable traits of XML and has existed since
the early 2000s. It is less verbose than XML and doesnt have the rich typing and validation features
available in XML.

Technique 2: MapReduce and JSON


Our technique covers how you can work with JSON in MapReduce. Well also cover a method by
which a JSON le can be partitioned for concurrent reads.

Problem
Figure 1 shows us the problem with using JSON in MapReduce. If you are working with large JSON
les, you need to be able to split them. But, given a random offset in a le, how do we determine the
start of the next JSON element, especially when working with JSON that has multiple hierarchies
such as in the example below?

Figure 1 Example of issue with JSON and multiple input splits

Solution
JSON is harder to partition into distinct segments than a format such as XML because JSON doesnt
have a token (like an end tag in XML) to denote the start or end of a record.
ElephantBird[2], an open-source project that contains some useful utilities for working with LZO
compression, has a LzoJsonInputFormat, which can read JSON, but it requires that the input le be

java.dzone.com/articles/hadooppractice

5/9

10/17/12

Hadoop in Practice | Javalobby

LZOP compressed. Well use this code as a template for our own JSON InputFormat, which doesnt
have the LZOP compression requirement.

Were cheating with our solution because were assuming that each JSON record is on a separate
line. Our JsonRecordFormat is simple and does nothing other than construct and return a
JsonRecordReader, so well skip over that code. The JsonRecordReader emits LongWritable,
MapWritable key/value pairs to the Mapper, where the Map is a map of JSON element names and
their values. Lets take a look at how this RecordReader works. It leverages the LineRecordReader,
which is a built-in MapReduce reader that emits a record for each line. To convert the line to a
MapWritable, it uses the following method.
01. public static boolean decodeLineToJson(JSONParser parser, Text
line,
02.
MapWritable value) {
03.
try {
04.
JSONObject jsonObj = (JSONObject)parser.parse(line.toString());
05.
for (Object key: jsonObj.keySet()) {
06.
Text mapKey = new Text(key.toString());
07.
Text mapValue = new Text();
08.
if (jsonObj.get(key) != null) {
09.
mapValue.set(jsonObj.get(key).toString());
10.
}
11.
12.
value.put(mapKey, mapValue);
13.
}
Online Visitors: 329
14.
return true;
15.
} catch (ParseException e) {
16.
LOG.warn("Could not json-decode string: " + line, e);
17.
return false;
18.
} catch (NumberFormatException e) {
19.
LOG.warn("Could not parse field into number: " + line, e);
20.
return false;
21.
}
22. }

It uses the json-simple[3] parser to parse the line into a JSON object and then iterates over the keys
and puts the keys and values into a MapWritable. The Mapper is given the JSON data in
LongWritable, MapWriable pairs and can process the data accordingly. The code for the MapReduce
job is very basic. Were going to demonstrate the code using the JSON below.
01. {
02.
"results" :
03.
[
04.
{
05.
"created_at" : "Thu, 29 Dec 2011 21:46:01 +0000",
06.
"from_user" : "grep_alex",
07.
"text" : "RT @kevinweil: After a lot of hard work by ..."
08.
},
09.
{
10.
"created_at" : "Mon, 26 Dec 2011 21:18:37 +0000",
11.
"from_user" : "grep_alex",
12.
"text" : "@miguno pull request has been merged, thanks
again!"
13.
}
14.
]
15. }

Since our technique assumes a JSON object per line, the actual JSON le well work with is shown
below.

java.dzone.com/articles/hadooppractice

6/9

10/17/12

Hadoop in Practice | Javalobby

1. {"created_at" : "Thu, 29 Dec 2011 21:46:01 +0000","from_user" :


...
2. {"created_at" : "Mon, 26 Dec 2011 21:18:37 +0000","from_user" :
...

Well copy the JSON le into HDFS and run our MapReduce code. Our MapReduce code simply
writes each JSON
key/value to the output.
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
13.

$ hadoop fs -put test-data/ch3/singleline-tweets.json \


singleline-tweets.json
$ bin/run.sh com.manning.hip.ch3.json.JsonMapReduce \
singleline-tweets.json output
$ fs -cat output/part*
text RT @kevinweil: After a lot of hard work by ...
from_user grep_alex
created_at Thu, 29 Dec 2011 21:46:01 +0000
text @miguno pull request has been merged, thanks again!
from_user grep_alex
created_at Mon, 26 Dec 2011 21:18:37 +0000

WRITING JSON
An approach similar to what we looked at for writing XML could also be used to write JSON.

PIG
ElephantBird contains a JsonLoader and LzoJsonLoader, which can be used to work with JSON in
Pig. It also works for JSON that is line based. Each Pig tuple contains a eld for each JSON element
in the line as a chararray.

HIVE
Hive contains a DelimitedJSONSerDe, which can serialize JSON but unfortunately not deserialize it,
so you cant load data into Hive using this SerDe.

Discussion
Our solution works with the assumption that the JSON input is structured with a line per JSON object.
How would we work with JSON objects that are across multiple lines? The authors have an
experimental project on GitHub[4], which works with multiple input splits over a single JSON le. The
key to this approach is searching for a specic JSON member and retrieving the containing object.
Theres a Google Code project called hive-json-serde[5], which can support both serialization and
deserialization.

Summary
As you can see, using XML and JSON in MapReduce is kludgy and has rigid requirements about how
your data is laid out. Supporting them in MapReduce is complex and error prone, as they dont
naturally lend themselves to splitting. Alternative le formats, such as Avro and SequenceFiles, have
built-in support for splittability.

java.dzone.com/articles/hadooppractice

7/9

10/17/12

Hadoop in Practice | Javalobby

If you would like to purchase Hadoop in Practice, DZone members can receive a 38% discount
by entering the Promotional Code: dzone38 during checkout at Manning.com.

[1] SerDe is a shortened form of Serializer/Deserializer, the mechanism that allows Hive to read and
write data in HDFS.
[2] https://github.com/kevinweil/elephant-bird
[3] http://code.google.com/p/json-simple/
[4] A multiline JSON InputFormat. https://github.com/alexholmes/json-mapreduce.
[5] http://code.google.com/p/hive-json-serde/

Source: http://www.manning.com/holmes/

Tags: Apache

big data

Hadoop

Open Source

The Cloud Zone is presented by DZone and Microsoft. There is a host of tools to let you deploy
Node.js, PHP, and Java apps on their Windows Azure platform with an unprecedented 90 day free
trial.

AROUND THE DZONE NETWORK


WEB BUILDER

WEB BUILDER

ARCHITECTS

JAVALOBBY

JAVALOBBY

JAVALOBBY

Ah,thewondersof
DHTML...

CachingforFunand
Profit.Or,WhyWould
You...

5WaystodoSource
ControlReally,Really
Wrong

AJavaScript
MapReduceOneLiner

SOAServiceDesign
CheatSheet

10WaysIAvoid
Troubleinthe
Database

YOU MIGHT ALSO LIKE


ThingsGreatEngineers(almost)NeverSay
TimetoStopPayingGitHub'sStupidToll
WhyJavaEELostandSpringWon
TheAntiJavaScript:Perl6
ThisisWhyWeNeedMoreWomeninTechnology
10ThingsThatINeverWanttoSeeaJavaDevelopertodoAgain
DoesScalaasanFPLanguageSufferFromItsOOSyntax?
TheProsandConsofJavaScriptandjQuery
NomoreexcusestousenullreferencesinJava8
HowtoTuneJavaGarbageCollection
ADeveloper'sGuidetoGettingHired
CageMatch!SenchaTouchvs.jQueryMobile
LearningHowtoLearn
VampiresoftheCloud
WeeklyPoll:InaWorldwhereJavawasnomore...

java.dzone.com/articles/hadooppractice

8/9

10/17/12

Hadoop in Practice | Javalobby

POPULAR ON JAVALOBBY

SPOTLIGHT RESOURCES

Data
Warehousing:
Best Practices
for Collecting,
Storing, and
Delivering
Decision-Support
Data

SpringBatchHelloWorld
IsHibernatethebestchoice?
9ProgrammingLanguagesTo
WatchIn2011
Lucene'sFuzzyQueryis100times
fasterin4.0
HowtoCreateVisualApplications
inJava?

Data Warehousing is a
process for collecting,
storing, and...

IntroductiontoOracle'sADFFaces
RichClientFramework
TimeSlider:OpenSolaris2008.11
KillerFeature
Interview:JohnDeGoesIntroduces
aNewlyFreeSourceCodeEditor

Database
Partitioning with
MySQL:
Improving
Performance,
Availability, and
Manageability

LATEST ARTICLES
ANewSuiteDevToolsforWindows
8andVS2012fromTelerik

MySQL,theworlds
mostpopularopen
sourcedatabase
managementsystem,
hasbecomethedefault
databaseforbuilding
anynewgeneration...

Erlang:client/server
Video:MonitoringNetflix'sJVMson
AWS
DebuggingHibernateEnvers
HistoricalData
MicrosoftDevRadio:UsingBlendto
HelpDesignYourWindows8Apps
(Part2)

Java Proling
with VisualVM: XRay Vision for
Dramatic
Performance
Gains

IsArchitectureEvaluationaWaste
ofTimeandMoney?
HotDeployStillHell
HowtoCombatWebBrowsing
Zombies

VisualVMisavisual
toolintegratingseveral
commandlineJDK
toolsandlightweight
profilingcapabilities.
Designedforboth
production...

Search

DZ o n e
Refcardz
TechLibrary
Snippets
AboutDZone
Tools&Buttons

T o p i cs
BookReviews
ITQuestions
MyProfile
Advertise
SendFeedback

HTML5
Cloud
.NET
PHP
Performance
Agile

F o llo w Us
WindowsPhone
Mobile
Java
Eclipse
BigData
DevOps

Google+
Facebook
LinkedIn

"Startingfromscratch"is
seductivebutdiseaseridden

PithyAdviceforProgrammers

Twitter

Advertising - Terms of Service - Privacy - 1997-2012, DZone, Inc.

java.dzone.com/articles/hadooppractice

9/9

Das könnte Ihnen auch gefallen