Sie sind auf Seite 1von 12

UNLOCKING THE BUSINESS BENEFITS IN BIG DATA

THE INS AND OUTS


OF HARNESSING HADOOP
Hadoop clusters make it easier for organizations to process and
analyze streams of big data. But there are limits to what the open
source technology can doplus implementation and management
challenges. BY JACK VAUGHAN AND ED BURNS

YARN SPINS
NEW FLEXIBILITY

MORE TO IT
THAN MORE NODES

AVOID DASHED
EXPECTATIONS

NEED FOR
ANALYTICS SPEED

THE HADOOP DISTRIBUTED PROCESSING FRAMEWORK


PRESENTS IT, DATA MANAGEMENT AND ANALYTICS TEAMS
WITH NEW OPPORTUNITIES FOR PROCESSING, STORING AND
USING DATA, PARTICULARLY IN BIG DATA APPLICATIONS.

HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

But it also confronts them with new


challenges as they look to deploy
and work with Hadoop systems.
And because Hadoop and the large
number of open source technologies
surrounding it are evolving quickly,
organizations must be prepared for
frequent changesmost immediately in the Hadoop 2 release.
The new version, which the
Apache Software Foundation
released in October 2013, will eventually take Hadoop far beyond its
current core configuration, which
combines the Hadoop Distributed
File System (HDFS) with Javabased MapReduce programs. Earlyadopter companies are using that
pairing to help them deal with large
amounts of transaction dataas
well as server and network log files,
sensor data, social media feeds, text
documents, image files and other
types of unstructured and semistructured data.
Hadoop typically runs on clusters
of commodity servers, resulting in
relatively low data processing and

storage costs. And because of its


ability to handle data with very light
structure, Hadoop applications can
take advantage of new information
sources that dont lend themselves
to traditional databases, said Tony
Cosentino, an analyst at Ventana
Research.

Organizations must be
prepared for frequent
changesmost immediately
in the Hadoop 2 release.
But Cosentino added in an email
that implementations of the existing
Hadoop architecture are restricted
by its batch-processing orientation,
which makes it more akin to a truck
than a sports car on performance.
Hadoop is ideally suited where
time latency is not an issue and
where significant amounts of data
need to be processed, he said.
In its HDFS-MapReduce configuration, Hadoop is very good
at analysis of very large, static
THE INS AND OUTS OF HARNESSING HADOOP

INTRODUCTION

HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

unstructured data sets consisting of


many terabytes or even petabytes
of information, said William Bain,
CEO of ScaleOut Software Inc., a
vendor of data grid software. As an
example, he cited a sentiment analysis application on a huge chunk of
Twitter data aimed at discerning
what customers are thinkingand
tweetingabout a company or its
products.
Like Cosentino, Bain emphasized
that because of its batch nature
and large startup overhead on
processing jobs, Hadoop generally
hasnt been useful in real-time analysis of live data setsat least not as
its currently constituted. But some
vendors have recently introduced
query engines designed to support
ad hoc analysis of Hadoop data.
Data warehousing applications
involving large volumes of data
are good targets for Hadoop uses,
according to Sanjay Sharma, a principal architect at software development services provider Impetus
Technologies Inc. How large? It varies, he said: Tens of terabytes is a
sweet spot for Hadoop, but if there
is great complexity to the unstructured data, it could be tens of gigabytes.
Some users, such as car-shopping information website operator
Edmunds.com Inc., have deployed
Hadoop and related technologies to
replace their traditional data warehouses. But Hadoop clusters often
are being positioned as landing pads

and staging areas for the data gushing into organizations. In such cases,
data can be pared down by MapReduce, transformed into or summarized in a relational structure and
moved along to an enterprise data
warehouse or data marts for analysis by business users and analytics
professionals. That approach also
provides increased flexibility: The
raw data can be kept in a Hadoop
system and modeled for analysis
as needed, using extract, load and
transform processes.

Some users, such as


Edmunds.com, have
deployed Hadoop and
related technologies to
replace their traditional
data warehouses.
Sharma describes such implementations as a data lake for
downstream processing. Colin
White, president of consultancy
BI Research, uses the term business refinery. In a report released in
February 2013, Gartner Inc. analysts Mark Beyer and Ted Friedman
wrote that using Hadoop to collect
and prepare data for analysis in a
data warehouse was the most-cited
strategy for supporting big data
analytics applications in a survey
conducted by the research and consulting company. An even 50%
of the 272 respondents said their
THE INS AND OUTS OF HARNESSING HADOOP

YARN SPINS NEW FLEXIBILITY

HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

organizations planned to do so during the next 12 months.


The vibrancy of the open source
ecosystem that surrounds Hadoop
can hardly be overstated.
From its earliest days, Hadoop
has attracted software developers looking to create add-on tools
to fill in gaps in its functionality.
For example, there are HBase, Hive
and Pigrespectively, a distributed
database, a SQL-style data warehouse and a high-level language for
developing data analysis programs
in MapReduce. Other supporting
actors that have become Hadoop
subprojects or Apache projects in
their own right include Ambari, for
provisioning, managing and monitoring Hadoop clusters; Cassandra,
a NoSQL database; and ZooKeeper,
which maintains configuration data
and synchronizes distributed operations across clusters.

NEED FOR
ANALYTICS
SPEED

YARN SPINS NEW


FLEXIBILITY

And now Hadoop 2 is entering


the picture. Central to the update
is YARN, an overhauled resource
manager that enables applications

FIGURE 1: MINORITY GROUP


n

Hadoop and MapReduce

might be all the buzz, but the


percentage of organizations
choosing not to deploy the
software might surprise you.
62%

No
current
plans to
add

29%

Plan to
add

9%

In use
now

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

n
n
n
n
n
n
n
n
n
n

BASED ON RESPONSES FROM 387 IT, BUSINESS INTELLIGENCE,


ANALYTICS AND BUSINESS PROFESSIONALS IN ORGANIZATIONS WITH DATA WAREHOUSES INSTALLED OR UNDER
DEVELOPMENT; SOURCE: TECHTARGETS 2013 ANALYTICS &
DATA WAREHOUSING READER SURVEY

other than MapReduce programs to


work with HDFS. By doing so, YARN
(short, good-naturedly, for Yet
Another Resource Negotiator) is
meant to free Hadoop from its reliance on batch processing while still
providing backward compatibility
with existing application programming interfaces.
YARN is the key difference for
Hadoop 2.0, Cosentino said, using
the releases original name. Instead
of letting a MapReduce job see itself
as the only tenant on HDFS, he
added, it allows for multiple workloads to run concurrently. One
early example comes from Yahoo,
THE INS AND OUTS OF HARNESSING HADOOP

YARN SPINS NEW FLEXIBILITY

HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

which has implemented the Storm


complex event processing software
on top of YARN to aid in funneling
data about the activities of website
users into a Hadoop cluster.
Hadoop 2 also is due to bring high
availability improvements, through
a new feature that enables users to
create a federated name
(or master) node architecture in
HDFS instead of relying on a single
node to control an entire cluster.
Meanwhile, commercial vendors
are brewing up additional management-tool elixirsnew job schedulers and cluster provisioning software, for examplein an effort to
further boost Hadoops enterprisereadiness.
Hadoop use still isnt widespread.
In a 2013 survey of TechTarget
readers on business intelligence,
analytics and data warehousing

technologies, the percentage of


active Hadoop and MapReduce
users was still in the single digits,
and nearly two-thirds of respondents said their organizations had
no current plans to deploy the
technologies (see Figure 1). Even in
companies with big data programs
in place or planned, Hadoop ranked
fourth on the list of technologies
being used or eyed to help underpin
the initiatives (see Figure 2).
Because Hadoop is novel to most
users, deploying it can present unfamiliar challenges to project teams
especially if they dont have experience with open source software or
parallel processing on distributed
clusters.
Even seasoned IT hands may find
surprises in working with Hadoop,
for much assembly typically is
required.

FIGURE 2: DOWN THE LIST


n

Hadoop isnt the first choice among the top technologies being used by
organizations to support their big data environments.

Mainstream relational databases or data warehouses


Specialized analytical databases
Data warehouse appliances
Hadoop clusters

55%
52%
46%
41%

BASED ON RESPONSES FROM 222 IT, BUSINESS INTELLIGENCE, ANALYTICS AND BUSINESS PROFESSIONALS IN ORGANIZATIONS WITH ACTIVE OR
PLANNED BIG DATA MANAGEMENT AND ANALYTICS PROGRAMS; RESPONDENTS WERE ASKED TO CHOOSE ALL TECHNOLOGIES THAT APPLIED;
SOURCE: TECHTARGETS 2013 ANALYTICS & DATA WAREHOUSING READER SURVEY

THE INS AND OUTS OF HARNESSING HADOOP

AVOID DASHED EXPECTATIONS

AVOID DASHED
EXPECTATIONS

HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

IT managers and corporate executives might look at what the large


Internet companies that first honed
Hadoop are doing with it and see
a chance to do bigger systems at
less cost, said Ofir Manor, a product manager and technical architect
at Gene by Gene Ltd., a Houstonbased genetic testing services company. But Manor, who also writes
a blog on data technologies, added
that those expectations can be difficult to meet.
Its relatively easy to do a small
Hadoop implementation and try it
out, he said. Playing with the technology can be fun. But to move it to
the infrastructure level is hard. In
addition to the technical challenges,
another issue Manor cited is that IT
operations often work in silos, with
separate teams handling systems
administration, database administration, storage, networking, security and application development and
so on. That approach can lead to
problems in managing Hadoop clusters, he warned. Hadoop requires
more teamwork than usual, and
enterprises may fall into a which
team owns the platform? debate.

Navigating the open source software culture can be a hurdle for


some companies, too. The commercial distributions of Hadoop offered
by a variety of IT vendors do help
simplify the process of rolling out
and supporting the software. But
Manor said organizations have to
ask themselves if theyre ready and
willing to commit their own developers to involvement in the Hadoop
community, which can aid in efforts
to take full advantage of the technology.
Successfully implementing
Hadoop requires first coming to
terms with the process of setting up
the computer cluster that will run
the software. And while clusters
are usually built around low-cost
and easy-to-use servers, there are
numerous configuration settings
and issues to work through up front.
Hadoop is a very complex environment. There are a lot of moving parts, said Douglas Moore, a
consultant at Think Big Analytics,
a consulting and development services provider that focuses on big
data deployments. Moore said a
Hadoop implementation team needs
to make sure the size and overall
design of its system are sufficient to
handle the pipeline of data that will
be fed into the cluster. Job scheduling routines and the performance
of disk drives and other hardware
components can also factor into the
Hadoop-cluster performance equation.
THE INS AND OUTS OF HARNESSING HADOOP

AVOID DASHED EXPECTATIONS

For example, RAID Level 0 striping of data across a disk array, typically turned on by default in Hadoop
systems, can shackle I/O speeds
to the rate of the slowest drive in
an array. In addition, a single disk
failure can take down an entire

array and temporarily knock all of


a cluster nodes data offline. As a
result, various Hadoop vendors and
consultants recommend configuring the disks in a cluster as separate
devices or limiting RAID striping to
pairs of disks.

HADOOP BREAKS FREE


HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

Hadoop initially was the province of the large Internet companies that created it, and the likes of eBay, Facebook, LinkedIn, Twitter and Yahoo remain
marquee users of the technology. But the number of other types of organizations that are looking to ride the Hadoop surge is growing.

NASA is using Hadoop to make climate data available via a cloud-based service to researchers outside its walls. To help predict and improve crop yields,
agricultural and chemical company Monsanto is loading geospatial data from
internal and external sources into Hadoop for processing and then moving the
files to HBase, its companion NoSQL database, for analysis. Data storage technology vendor NetApp uses Hadoop to cull log data from sensors to monitor
the performance of its equipment at customer sites. Telecom service provider
China Mobile Group Guangdong built a Hadoop-based system to support online bill payments and provide new data analytics capabilities internally.
Marketing and advertising analysis is another common application for Hadoop
and related big data technologies. Edmunds.com, which publishes automobile
pricing data and vehicle reviews online, deployed a combination of Hadoop
and HBase to help business analysts fine-tune its paid-search marketing and
keyword bidding processes. Retailer Kohls plans to use Hadoop to enable
business users to analyze store and website data. And Luminar, a company
that analyzes data about Hispanic consumers in the U.S. for retailers, manufacturers and other clients, replaced a traditional data warehouse with a Hadoop system to power its analytical modeling.
But Colin White, president of consultancy BI Research, thinks Hadoop has the
potential to spark new and innovative applications, not just to step in and take
the place of traditional systems. My concern is in seeing Hadoop used for a
bunch of workloads in which it is reinventing the wheel, he said. Id rather
see it moving in the direction of solving problems we havent solved.

JACK VAUGHAN

THE INS AND OUTS OF HARNESSING HADOOP

MORE TO IT THAN MORE NODES

HOME

YARN SPINS
NEW FLEXIBILITY

Also, because Hadoop is so often


combined with supporting software
such as HBase and Hive, pinpointing the sources of performance
problems can be, well, problematic.
In working with clients to optimize
cluster performance, Moore and his
fellow consultants find that in many
cases the first suspect isnt necessarily the culprit.
Weve been brought in for technology assessments by people who
think they had an issue with HBase
failing, he said. But the fact is, the
problem could be with how their
workflow is set uphow theyre
rolling jobs into a cluster.

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

MORE TO IT THAN
MORE NODES

The use of commodity servers


makes it relatively inexpensive
to add more nodes to a cluster.
And with the fast-paced growth
of Google, Twitter and other Web
powerhouses, and the corresponding expansion of their data processing requirements, scaling out
clusters as needed to boost performance became a common strategy.
But that approach isnt likely to fly

in more traditional organizations,


said Vin Sharma, director of product
marketing for Hadoop at Intel Corp.
Its true that throw another node
at it may have become a mantra at
fast-growing Web monsters, but
it wont be repeated in the typical
enterprise, Sharma said. Instead,
he expects to see a focus on troubleshooting performance problems. Doing so in a Hadoop cluster,
though, is more complicated than
in the average system, he said. It
requires expertise that not every
organization has in-house.
The first order of business once
a cluster is set up, according to
Sharma, is to deploy performance
monitoring tools to help identify
bottlenecks. He also recommends
checking MapReduce applications to ensure that theyve been
designed for optimal performance
on a cluster. If [an application]
requires a lot of network communication, it may not be a good fit.
Hadoop itself might not be the
right choice to begin with: The highfevered interest in the technology
shouldnt obscure the fact that its
not the best option for every application, Cosentino cautioned. Dont
think about technology first, he
said. Think first about the business problem youre trying to solve,
because you may not even need a
Hadoop cluster.
And while its tempting, the inclination to follow the lead of the Internet giants down the Hadoop path
THE INS AND OUTS OF HARNESSING HADOOP

NEED FOR ANALYTICS SPEED

shouldnt be an absolute, Manor


said, noting that the needs of those
companies and other types of businesses are often different. The
tools to solve the [online] scalability
issue are not always a good fit for
enterprise challenges, he said.
One particular case in point: realtime analytics applications involving
ad hoc querying of Hadoop data.
Hadoop is optimized to crunch
through large data sets, but its
batch-processing power doesnt
equate to data analysis speed.
HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

NEED FOR
ANALYTICS SPEED

And Jan Gelin, vice president of


technical operations at Rubicon
Project, said analytics speed is
something that the online advertising broker needsbadly. The company, based in Playa Vista, Calif.,
offers a platform for advertisers
to use in bidding for ad space on

webpages as Internet users visit


the pages. The system allows the
advertisers to see information about
website visitors before making bids,
in order to ensure that ads will only
be seen by interested consumers.
Gelin said the process involves a lot
of analytics and it all has to happen
in fractions of a second.
Rubicon leans heavily on Hadoop
to help power the ad-bidding platform. The key, Gelin said, is to pair
it with other technologies that can
handle true real-time analysis. Like
Yahoo, Rubicon uses the Storm
processing engine to capture and
quickly analyze large amounts of
data as part of the ad bidding process. Storm then sends the data into
a cluster running MapR Technologies Inc.s Hadoop distribution. The
Hadoop cluster is primarily used to
transform the data to prepare it for
more traditional analytical applications, such as business intelligence
reporting. Even for that stage,
though, much of the information is
loaded into a Greenplum analytical
database for access by users.
Gelin said the sheer volume of
data that Rubicon produces on a
daily basis meant it would need a
system capable of processing all the
information. Thats where Hadoop

Think first about the business problem youre trying to


solve, because you may not even need a Hadoop cluster.
TONY COSENTINO, ANALYST AT VENTANA RESEARCH

THE INS AND OUTS OF HARNESSING HADOOP

NEED FOR ANALYTICS SPEED

HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

comes in. But, he added, you cant


take away the fact that Hadoop is a
batch-processing system. Theres
other things on top of Hadoop you
can play around with that are actually like real real-time.
Several Hadoop vendors are trying to eliminate the real-time analytics restrictions.
Cloudera Inc. got the ball rolling
in April 2013 by releasing its Impala
query engine, promising the ability to run interactive SQL queries
against Hadoop data in near real
time. Pivotal, a data management
and analytics spinoff from EMC
Corp. and its VMware subsidiary,
followed three months later with a
similar query engine named Hawq.
Also looking to get in the game is
Splunk Inc., which focuses on capturing streams of machine-generated data; it began beta-testing a
Hadoop data analysis tool called
Hunk in June 2013.
Hadoop 2 also aids the cause
by opening up Hadoop systems to
non-MapReduce applications. With
all the new tools and capabilities,
Hadoop may soon be up to the realtime challenge, said Mike Gualtieri,
an analyst at Forrester Research Inc.
One big factor working in its favor,
he added, is that vendors as well
as Hadoop users are determined to
make the technology function in real
or near real time for analytics applications.
Hadoop is fundamentally a batch
operation environment, Gualtieri

said. However, because of the


distributed architecture and
because a lot of use cases have to
do with putting data into Hadoop, a
lot of vendors or even the end users
are saying, Hey, why cant we do
more real-time or ad hoc queries
against Hadoop, and its a good
question.
Gualtieri sees two main challenges. First, he said, most of the new
Hadoop query engines still arent
as fast as running queries against
mainstream relational databases is.
Tools like Impala and Hawq provide
interfaces that enable end users to

With all the new tools


and capabilities, Hadoop
may soon be up to the
real-time challenge.
write queries in the SQL programming language. The queries then get
translated into MapReduce for execution on a Hadoop cluster, but that
process is inherently slower than
running a SQL query directly against
a relational database, according to
Gualtieri.
The second challenge that Gualtieri sees is that Hadoop currently
is a read-only system once data has
been written into HDFS. Users cant
easily insert, delete or modify individual pieces of data stored in the
file system like they can in a relational database, he said. While the
THE INS AND OUTS OF HARNESSING HADOOP

10

NEED FOR ANALYTICS SPEED

HOME

challenges are real, Gualtieri thinks


they can be overcome. For example,
Hadoop 2 includes a capability for
appending data to HDFS files.
Gartner analyst Nick Heudecker
wrote in an email that even though
the new query engines dont support true real-time analytics functionality, they do enable end users
with less technical expertise to
access and analyze data stored in
Hadoop. That can decrease the
cycle time and cost associated
with running Hadoop analytics jobs
because MapReduce developers no

longer have to write queries, he said.


Organizations will have to decide
for themselves whether thats
enough of a justification for deploying such tools. Despite all the hype,
Hadoop isnt a magic bullet, said
Patricia Gorla, a consultant at IT services provider OpenSource Connections LLC. Whats important, Gorla
said, is finding the best fit for the
technologyand not trying to forcefit it into a systems architecture
where it doesnt belong. Hadoop is
good at what its good at and not at
what its not, she said. n

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

THE INS AND OUTS OF HARNESSING HADOOP

11

ABOUT THE AUTHORS

JACK VAUGHAN is site editor


of SearchDataManagement
.com. He covers topics such
as big data management, data
warehousing, databases and
data integration. Vaughan
previously was an editor for
TechTargets SearchSOA.com, SearchVB.com,
TheServerSide.net and SearchDomino.com websites. Email him at jvaughan@techtarget.com.

ED BURNS is site editor of

HOME

YARN SPINS
NEW FLEXIBILITY

AVOID DASHED
EXPECTATIONS

MORE TO IT
THAN MORE
NODES

NEED FOR
ANALYTICS
SPEED

SearchBusinessAnalytics.com;
in that position, he covers
business intelligence, analytics and data visualization
technologies and topics. He
previously was a news writer
for TechTargets SearchHealthIT.com website,
and he has also written for a variety of daily and
weekly newspapers in eastern Massachusetts.
Email him at eburns@techtarget.com.

The Ins and Outs of Harnessing Hadoop is a


SearchBusinessAnalytics.com
e-publication.
Scot Petersen
Editorial Director
Jason Sparapani
Managing Editor, E-Publications
Joe Hebert
Associate Managing Editor, E-Publications
Craig Stedman
Executive Editor
Melanie Luna
Managing Editor
Mark Brunelli
News Director
Linda Koury
Director of Online Design
Neva Maniscalco
Graphic Designer
Doug Olender
Publisher
dolender@techtarget.com
Annie Matthews
Director of Sales
amatthews@techtarget.com
TechTarget Inc.
275 Grove Street, Newton, MA 02466
www.techtarget.com
2013 TechTarget Inc. No part of this publication
may be transmitted or reproduced in any form or
by any means without written permission from the
publisher. TechTarget reprints are available through
The YGS Group.
About TechTarget: TechTarget publishes media for
information technology professionals. More than
100 focused websites enable quick access to a deep
store of news, advice and analysis about the tech
nologies, products and processes crucial to your job.
Our live and virtual events give you direct access to
independent expert commentary and advice. At IT
Knowledge Exchange, our social community, you
can get advice and share solutions with peers and
experts.

THE INS AND OUTS OF HARNESSING HADOOP

12

Das könnte Ihnen auch gefallen