Beruflich Dokumente
Kultur Dokumente
LOG IN OR JOIN
HOM E
REF CARDZ
M ICROZ ONES
Z ONES
L IBRARY
SNIP P ET S
Search
T UT ORIAL S
CLOUD ZONE
CONNECT WITH DZONE
Chris Smith
Bio
Website
@DZone
PublishanArticle
ShareaTip
Hadoop in Practice
02.28.2012
12059 views
Like
Tweet
18
DZone,Inc.on
Share
Follow
The Cloud Zone is brought to you in partnership with DZone and Microsoft. Try their Windows Azure
cloud platform free for 90 days with no strings attached. Learn more.
Like
4.9k
Follow
17.3K followers
Spotlight Features
Your Software Flow is More
Like a Whirlpool than a
Pipeline
Hadoop in Practice
By Alex Holmes
Working with simple data formats such as log les is straightforward and supported in
MapReduce. In this article based on Chapter 3 of Hadoop in Practice, author Alex Holmes
shows you how to work with ubiquitous data serialization formats such as XML and JSON.
Quality + Simplicity - the
Sweet Spot
java.dzone.com/articles/hadooppractice
POPULAR AT DZONE
1/9
10/17/12
Problem
Working on a single XML le in parallel in MapReduce is tricky because XML does not contain a
synchronization marker in its data format. Therefore, how do we work with a le format thats not
inherently splittable like XML?
Solution
MapReduce doesnt contain built-in support for XML, so we have to turn to another Apache project,
Mahout, a machine learning system, which provides an XML InputFormat. To showcase the XML
InputFormat, lets write a MapReduce job that uses the Mahouts XML Input Format to read property
names and values from Hadoops
conguration les. Our rst step is to set up our job conguration.
1. conf.set("xmlinput.start", "<property>");
2. conf.set("xmlinput.end", "</property>");
3. job.setInputFormatClass(XmlInputFormat.class);
#1
#2
#3
#1 Denes the string form of the XML start tag. Our job is to take Hadoop cong les as input,
where each conguration entry uses the "property" tag.
#2 Denes the string form of the XML end tag.
#3 Sets the Mahout XML input format class.
It quickly becomes apparent by looking at the code that Mahouts XML InputFormat is rudimentary;
you need to tell
it an exact sequence of start and end XML tags that will be searched in the le. Looking at the source
of the
InputFormat conrms this:
01. private boolean next(LongWritable key, Text value)
02.
throws IOException {
03.
if (fsin.getPos() < end && readUntilMatch(startTag, false)) {
04.
try {
05.
buffer.write(startTag);
06.
if (readUntilMatch(endTag, true)) {
07.
key.set(fsin.getPos());
08.
value.set(buffer.getData(), 0, buffer.getLength());
09.
return true;
10.
}
11.
} finally {
12.
buffer.reset();
13.
}
14.
}
15.
return false;
16. }
Next, we need to write a Mapper to consume Mahouts XML input format. Were being supplied the
XML element in
Text form, so well need to use an XML parser to extract content from the XML.
java.dzone.com/articles/hadooppractice
2/9
10/17/12
Our Map is given a Text instance, which contains a String representation of the data between the start
and end tags. In our code were using Javas built-in Streaming API for XML (StAX) parser to extract
the key and value for each property and output them. If we run our MapReduce job against Clouderas
core-site.xml and cat the output, well see the output that you see below.
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
This output shows that we have successfully worked with XML as an input serialization format with
MapReduce! Not only that, we can support huge XML les since the InputFormat supports splitting
XML.
WRITING XML
java.dzone.com/articles/hadooppractice
3/9
10/17/12
Having successfully read XML, the next question would be how do we write XML? In our Reducer, we
have callbacks
that occur before and after our main reduce method is called, which we can use to emit a start and end
tag.
01. public static class Reduce
02.
extends Reducer<Text, Text, Text, Text> {
03.
04.
@Override
05.
protected void setup(
06.
Context context)
07.
throws IOException, InterruptedException {
08.
context.write(new Text("<configuration>"), null);
#1
09.
}
10.
11.
@Override
12.
protected void cleanup(
13.
Context context)
14.
throws IOException, InterruptedException {
15.
context.write(new Text("</configuration>"), null);
#2
16.
}
17.
18.
private Text outputKey = new Text();
19.
public void reduce(Text key, Iterable<Text> values,
20.
Context context)
21.
throws IOException, InterruptedException {
22.
for (Text value : values) {
23.
outputKey.set(constructPropertyXml(key, value));
#3
24.
context.write(outputKey, null);
#4
25.
}
26.
}
27.
28.
public static String constructPropertyXml(Text name, Text value)
{
29.
StringBuilder sb = new StringBuilder();
30.
sb.append("<property><name>").append(name)
31.
.append("</name><value>").append(value)
32.
.append("</value></property>");
33.
return sb.toString();
34.
}
35. }
#1 Uses the setup method to write the root element start tag.
#2 Uses the cleanup method to write the root element end tag.
#3 Constructs a child XML element for each key/value combination we get in the Reducer. #4
Emits the XML element.
This could also be embedded in an OutputFormat.
PIG
If you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pig
code) contains an XMLLoader. It works in a similar way to our technique and captures all of the
content between a start and end tag and supplies it as a single bytearray eld in a Pig tuple.
HIVE
Currently, there doesnt seem to be a way to work with XML in Hive. You would have to write a custom
SerDe[1].
Discussion
Mahouts XML InputFormat certainly helps you work with XML. However, its very sensitive to an
java.dzone.com/articles/hadooppractice
4/9
10/17/12
exact string match of both the start and end element names. If the element tag can contain attributes
with variable values, or the generation of the element cant be controlled and could result in XML
namespace qualiers being used, then this approach may not work for you. Also problematic will be
situations where the element name you specify is used as a descendant child element.
If you have control over how the XML is laid out in the input, this exercise can be simplied by having
a single XML element per line. This will let you use the built-in MapReduce text-based InputFormats
(such as TextInputFormat), which treat each line as a record and split accordingly to preserve that
demarcation.
Another option worth considering is that of a preprocessing step, where you could convert the original
XML into a separate line per XML elemen, or convert it into an altogether different data format such as
a SequenceFile or Avro, both of which solve the splitting problem for you.
Theres a streaming class StreamXmlRecordReader to allow you to work with XML in your streaming
code.
We have a handle on how to work with XML, so lets move on to tackle another popular serialization
format, JSON. JSON shares the machine and human-readable traits of XML and has existed since
the early 2000s. It is less verbose than XML and doesnt have the rich typing and validation features
available in XML.
Problem
Figure 1 shows us the problem with using JSON in MapReduce. If you are working with large JSON
les, you need to be able to split them. But, given a random offset in a le, how do we determine the
start of the next JSON element, especially when working with JSON that has multiple hierarchies
such as in the example below?
Solution
JSON is harder to partition into distinct segments than a format such as XML because JSON doesnt
have a token (like an end tag in XML) to denote the start or end of a record.
ElephantBird[2], an open-source project that contains some useful utilities for working with LZO
compression, has a LzoJsonInputFormat, which can read JSON, but it requires that the input le be
java.dzone.com/articles/hadooppractice
5/9
10/17/12
LZOP compressed. Well use this code as a template for our own JSON InputFormat, which doesnt
have the LZOP compression requirement.
Were cheating with our solution because were assuming that each JSON record is on a separate
line. Our JsonRecordFormat is simple and does nothing other than construct and return a
JsonRecordReader, so well skip over that code. The JsonRecordReader emits LongWritable,
MapWritable key/value pairs to the Mapper, where the Map is a map of JSON element names and
their values. Lets take a look at how this RecordReader works. It leverages the LineRecordReader,
which is a built-in MapReduce reader that emits a record for each line. To convert the line to a
MapWritable, it uses the following method.
01. public static boolean decodeLineToJson(JSONParser parser, Text
line,
02.
MapWritable value) {
03.
try {
04.
JSONObject jsonObj = (JSONObject)parser.parse(line.toString());
05.
for (Object key: jsonObj.keySet()) {
06.
Text mapKey = new Text(key.toString());
07.
Text mapValue = new Text();
08.
if (jsonObj.get(key) != null) {
09.
mapValue.set(jsonObj.get(key).toString());
10.
}
11.
12.
value.put(mapKey, mapValue);
13.
}
Online Visitors: 329
14.
return true;
15.
} catch (ParseException e) {
16.
LOG.warn("Could not json-decode string: " + line, e);
17.
return false;
18.
} catch (NumberFormatException e) {
19.
LOG.warn("Could not parse field into number: " + line, e);
20.
return false;
21.
}
22. }
It uses the json-simple[3] parser to parse the line into a JSON object and then iterates over the keys
and puts the keys and values into a MapWritable. The Mapper is given the JSON data in
LongWritable, MapWriable pairs and can process the data accordingly. The code for the MapReduce
job is very basic. Were going to demonstrate the code using the JSON below.
01. {
02.
"results" :
03.
[
04.
{
05.
"created_at" : "Thu, 29 Dec 2011 21:46:01 +0000",
06.
"from_user" : "grep_alex",
07.
"text" : "RT @kevinweil: After a lot of hard work by ..."
08.
},
09.
{
10.
"created_at" : "Mon, 26 Dec 2011 21:18:37 +0000",
11.
"from_user" : "grep_alex",
12.
"text" : "@miguno pull request has been merged, thanks
again!"
13.
}
14.
]
15. }
Since our technique assumes a JSON object per line, the actual JSON le well work with is shown
below.
java.dzone.com/articles/hadooppractice
6/9
10/17/12
Well copy the JSON le into HDFS and run our MapReduce code. Our MapReduce code simply
writes each JSON
key/value to the output.
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
13.
WRITING JSON
An approach similar to what we looked at for writing XML could also be used to write JSON.
PIG
ElephantBird contains a JsonLoader and LzoJsonLoader, which can be used to work with JSON in
Pig. It also works for JSON that is line based. Each Pig tuple contains a eld for each JSON element
in the line as a chararray.
HIVE
Hive contains a DelimitedJSONSerDe, which can serialize JSON but unfortunately not deserialize it,
so you cant load data into Hive using this SerDe.
Discussion
Our solution works with the assumption that the JSON input is structured with a line per JSON object.
How would we work with JSON objects that are across multiple lines? The authors have an
experimental project on GitHub[4], which works with multiple input splits over a single JSON le. The
key to this approach is searching for a specic JSON member and retrieving the containing object.
Theres a Google Code project called hive-json-serde[5], which can support both serialization and
deserialization.
Summary
As you can see, using XML and JSON in MapReduce is kludgy and has rigid requirements about how
your data is laid out. Supporting them in MapReduce is complex and error prone, as they dont
naturally lend themselves to splitting. Alternative le formats, such as Avro and SequenceFiles, have
built-in support for splittability.
java.dzone.com/articles/hadooppractice
7/9
10/17/12
If you would like to purchase Hadoop in Practice, DZone members can receive a 38% discount
by entering the Promotional Code: dzone38 during checkout at Manning.com.
[1] SerDe is a shortened form of Serializer/Deserializer, the mechanism that allows Hive to read and
write data in HDFS.
[2] https://github.com/kevinweil/elephant-bird
[3] http://code.google.com/p/json-simple/
[4] A multiline JSON InputFormat. https://github.com/alexholmes/json-mapreduce.
[5] http://code.google.com/p/hive-json-serde/
Source: http://www.manning.com/holmes/
Tags: Apache
big data
Hadoop
Open Source
The Cloud Zone is presented by DZone and Microsoft. There is a host of tools to let you deploy
Node.js, PHP, and Java apps on their Windows Azure platform with an unprecedented 90 day free
trial.
WEB BUILDER
ARCHITECTS
JAVALOBBY
JAVALOBBY
JAVALOBBY
Ah,thewondersof
DHTML...
CachingforFunand
Profit.Or,WhyWould
You...
5WaystodoSource
ControlReally,Really
Wrong
AJavaScript
MapReduceOneLiner
SOAServiceDesign
CheatSheet
10WaysIAvoid
Troubleinthe
Database
java.dzone.com/articles/hadooppractice
8/9
10/17/12
POPULAR ON JAVALOBBY
SPOTLIGHT RESOURCES
Data
Warehousing:
Best Practices
for Collecting,
Storing, and
Delivering
Decision-Support
Data
SpringBatchHelloWorld
IsHibernatethebestchoice?
9ProgrammingLanguagesTo
WatchIn2011
Lucene'sFuzzyQueryis100times
fasterin4.0
HowtoCreateVisualApplications
inJava?
Data Warehousing is a
process for collecting,
storing, and...
IntroductiontoOracle'sADFFaces
RichClientFramework
TimeSlider:OpenSolaris2008.11
KillerFeature
Interview:JohnDeGoesIntroduces
aNewlyFreeSourceCodeEditor
Database
Partitioning with
MySQL:
Improving
Performance,
Availability, and
Manageability
LATEST ARTICLES
ANewSuiteDevToolsforWindows
8andVS2012fromTelerik
MySQL,theworlds
mostpopularopen
sourcedatabase
managementsystem,
hasbecomethedefault
databaseforbuilding
anynewgeneration...
Erlang:client/server
Video:MonitoringNetflix'sJVMson
AWS
DebuggingHibernateEnvers
HistoricalData
MicrosoftDevRadio:UsingBlendto
HelpDesignYourWindows8Apps
(Part2)
Java Proling
with VisualVM: XRay Vision for
Dramatic
Performance
Gains
IsArchitectureEvaluationaWaste
ofTimeandMoney?
HotDeployStillHell
HowtoCombatWebBrowsing
Zombies
VisualVMisavisual
toolintegratingseveral
commandlineJDK
toolsandlightweight
profilingcapabilities.
Designedforboth
production...
Search
DZ o n e
Refcardz
TechLibrary
Snippets
AboutDZone
Tools&Buttons
T o p i cs
BookReviews
ITQuestions
MyProfile
Advertise
SendFeedback
HTML5
Cloud
.NET
PHP
Performance
Agile
F o llo w Us
WindowsPhone
Mobile
Java
Eclipse
BigData
DevOps
Google+
Facebook
LinkedIn
"Startingfromscratch"is
seductivebutdiseaseridden
PithyAdviceforProgrammers
java.dzone.com/articles/hadooppractice
9/9