Beruflich Dokumente
Kultur Dokumente
LARGE-SCALE INFRASTRUCTURES
How Adobe Systems achieved breakthrough results in
Big Data analytics with Hadoop-as-a-Service
ABSTRACT
Large-scale Apache Hadoop analytics have long eluded the industry, especially in virtualized
environments. In a ground-breaking proof of concept (POC), Adobe Systems demonstrated running
Hadoop-as-a-Service (HDaaS) on a virtualized and centralized infrastructure handled large-scale data
analytics workloads. This white paper documents the POCs infrastructure design, initial obstacles, and
successful completion, as well as sizing and configuration details, and best practices. Importantly, the
paper also underscores how HDaaS built on an integrated and virtualized infrastructure delivers
outstanding performance, scalability, and efficiency, paving the path toward larger-scale Big Data
analytics in Hadoop environments.
December, 2014
To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized
reseller, visit www.emc.com, or explore and compare products in the EMC Store
Copyright 2014 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.
The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in
this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
VMware and vSphere are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other trademarks used herein
are the property of their respective owners.
TABLE OF CONTENTS
EXECUTIVE SUMMARY................................................................................................................................................. 4
INTRODUCTION ............................................................................................................................................................. 5
BOLD ARCHITECTURE FOR HDAAS .............................................................................................................................. 7
NAVIGATING TOWARD LARGE-SCALE HDAAS ........................................................................................................... 8
A Few surprises ............................................................................................................................................................................................. 8
Diving in Deeper ............................................................................................................................................................................................ 8
Relooking at Memory Settings ............................................................................................................................................................................ 8
Modifying Settings Properly with BDE ................................................................................................................................................................ 9
Bigger is Not Always Better ................................................................................................................................................................................. 9
Storage Sizing Proved Successful ...................................................................................................................................................................... 9
H13856
Page 3 of 12
EXECUTIVE SUMMARY
Apache Hadoop has become a prime tool for analyzing Big Data and achieving greater insights that help organizations improve strategic decision making.
Traditional Hadoop clusters have proved inefficient for handling large-scale analytics jobs sized at hundreds of terabytes or even petabytes. Adobes Digital
Marketing organization, which operates data analytic jobs on this scale, was encountering increased demand internally to use Hadoop for analysis of the
company's existing eight-petabyte data repository.
To address this need, Adobe explored an innovative approach to Hadoop. Rather than running traditional Hadoop clusters on commodity servers with locally
attached storage, Adobe virtualized the Hadoop computing environment and used its existing EMC Isilon storagewhere the eight-petabyte data repository
residesas a central location for Hadoop data.
Adobe enlisted resources, technologies, and expertise of EMC, VMware, and Cisco to build a reference architecture for virtualized Hadoop-as-a-Service
(HDaaS) and perform a comprehensive proof of concept. While the five-month POC encountered some challenges, the project also yielded a wealth of insights
and understanding relating to how Hadoop operates and its infrastructure requirements.
After meticulous configuring, refining, and testing, Adobe successfully ran a 65-terabyte Hadoop jobone of the industrys largest to date in a virtualized
environment. This white paper details the process that Adobe and the POC team followed that led to this accomplishment.
The paper includes specific configurations of the virtual HDaaS environment used in the POC. It also covers initial obstacles and how the POC team overcame
them. It also documents how the team adjusted settings, sized systems, and reconfigured the environment to support large-scale Hadoop analytics in a virtual
environment with centralized storage.
Most importantly, the paper presents POCs results, along with valuable best practices for other organizations interested in pursuing similar projects. The last
section describes Adobes plans to bring virtual HDaaS to production for its business users and data scientists.
H13856
Page 4 of 12
INTRODUCTION
Organizations across the world increasingly view Big Data as a prime source of competitive differentiation, and analytics as the means to tap this source.
Specifically, Hadoop enables data scientists to perform sophisticated queries against massive volumes of data to gain insights, discover trends, and predict
outcomes. In fact, a GE and Accenture study reported that 84 percent of survey respondents believe that using Big Data analytics has the power to shift the
competitive landscape for my industry" in the next year. 1
Apache Hadoop, an increasingly popular environment for running analytics jobs, is an open source framework for storing and processing large data sets.
Traditionally running on clusters of commodity servers with local storage, Hadoop comprises multiple components, primarily the Hadoop Distributed File System
(HDFS) for data storage, Yet Another Resource Negotiator (YARN) for managing system resources like memory and CPUs, and MapReduce for processing
massive jobs by splitting up input data into small subtasks and collating results.
At Adobe, a global leader in digital marketing and digital media solutions, its Technical Operations team uses traditional Hadoop clusters to deliver Hadoop as a
Service (HDaaS) in a private cloud for several application teams. These teams run Big Data jobs such as log and statistical analysis of application layers to
uncover trends that help guide product enhancements.
Elsewhere, Adobe's Digital Marketing organization tracks and analyzes customers website statistics, which are stored in an eight-petabyte data repository on
EMC Isilon storage. Adobe Digital Marketing would like to use HDaaS for more in-depth analysis that would help their clients improve website effectiveness,
correlate site visits to revenue, and guide strategic business decisions. Rather than moving data from a large data repository to the Hadoop clustersa timeconsuming task, Technical Operations determined it would be most efficient to simply use Hadoop to access data sets on the existing Isilon-based data
repository.
Adobe has a goal of running analytics jobs against data sets that are hundreds of terabytes in size. Simply adding commodity servers to Hadoop clusters
would become highly inefficient, especially since traditional Hadoop clusters require three copies of the data to ensure availability. Adobe also was concerned
that current Hadoop versions lack high availability features. For example, Hadoop has only has two NameNodes, which tracks where data resides in Hadoop
environments. If both NameNodes fail, the entire Hadoop cluster would collapse.
Technical Operations proposed separating the Hadoop elements and placing them where they can scale more efficiently and reliably. This meant using Isilon,
where Adobes file-based data repository is stored, for centralized Hadoop storage and virtualizing the Hadoop cluster nodes to enable more flexible scalability
and lower compute costs. (Figures 1 and 2)
Figure 1.
H13856
Page 5 of 12
Figure 2.
Despite internal skepticism about a virtualized infrastructure handling Hadoops complexity, Technical Operations recognized a compelling upside: improving
efficiency and increasing scalability to a level that had not been achieved for single-job data sets in a virtualize Hadoop environment with Isilon. This is
enticing, especially as data analytics jobs continue to grow in size across all environments.
"People think by that virtualizing Hadoop, you're going to take a performance hit. But we
showed that's not the case. Instead you get added flexibility that actually unencumbers
your infrastructure."
Chris Mutchler, Compute Platform Engineer, Adobe Systems
To explore the possibilities, Adobe Technical Operations embarked on a virtual HDaaS POC for Adobe Systems Digital Marketing. The infrastructure comprised
EMC, VMware, and Cisco solutions and was designed to test the outer limits of Big Data analytics on Isilon and VMware using Hadoop.
Key objectives of the POC included:
Building a virtualized HDaaS environment to deliver analytics through a self-service catalog to internal Adobe customers
Decoupling storage from compute by using EMC Isilon to provide HDFS, ultimately enabling access to the entire data repository for analytics
Understanding sizing and security requirements of the integrated EMC Isilon, EMC VNX, VMware and Cisco UCS infrastructure to support larger-scale
HDaaS
Proving an attractive return on investment and total cost of ownership in virtualized HDaaS environments compared to physical in-house solutions or
public cloud services such as Amazon Web Services
H13856
Page 6 of 12
At the compute layer, Adobe was particularly interested in Cisco UCS for its firmware management and centralized configuration capabilities. Plus, UCS
provides a converged compute and network environment when deployed with Nexus.
VNX provides block storage for VMware ESX hosts and virtual machines (VMs) that comprise the Hadoop cluster. Adobe's focus was learning the VNX sizing
and performance requirements to support virtualized HDaaS.
An existing Isilon customer, Adobe especially liked Isilons data lake concept that enables access to one source of data through multiple protocols, such as
NFS, FTP, Object, and HDFS. In the POC, data was loaded onto Isilon via NFS and accessed via HDFS by virtual machines in the Hadoop compute cluster. The
goal was to prove that Isilon delivered sufficient performance to support large Hadoop workloads.
To deploy, run, and manage Hadoop on a common virtual infrastructure, Adobe relied on VMware Big Data Extensions (BDE) an essential software component
of the overall environment. Adobe already used BDE in its private cloud HDaaS deployment and wanted to apply it to the new infrastructure.
BDE enabled Adobe to automate and simplify deployment of hundreds of virtualized Hadoop compute nodes that were tied directly to Isilon for HDFS. During
testing, Adobe also used BDE to deploy, reclaim, and redeploy the Hadoop cluster more than 30 times to evaluate different cluster configurations. Without the
automation and flexibility of BDE, Adobe would not have been able to conduct such a wide range and high volume of tests within such a short timeframe.
In this POC, Adobe used Pivotal HD as an enhanced Hadoop distribution framework but designed the infrastructure to run any Hadoop distribution.
The following tools assisted Adobe with monitoring, collecting and reporting on metrics generated by the POC:
H13856
Page 7 of 12
A FEW SURPRISES
Adobe ran its first Hadoop MapReduce job in the virtual HDaaS environment within three days of initial set-up. Smaller data sets of 60 to 450 gigabytes
performed well, but the team hit a wall beyond 450 gigabytes.
The team focused on the job definition of the Hadoop configuration to determine if it was written correctly or using memory efficiently. In researching the
industry at large, Adobe learned that most enterprise Hadoop environments were testing data on a small scale. In fact, Adobe did not find another Hadoop POC
or implementation that exceeded 10 terabytes for single-job data sets in a virtualized Hadoop environment with Isilon.
"When we talked to other people in the industry, we realized we were on the forefront of
scaling Hadoop at levels possibly never seen before."
Jason Farnsworth, Senior Storage Engineer, Adobe Systems
After four weeks of tweaking the Hadoop job definition and adjusting memory settings, the team successfully ran a six-terabyte job. Pushing beyond six
terabytes, the team sought to run larger data sets upwards of 60 terabytes. The larger jobs again proved difficult to complete successfully.
DIVING IN DEEPER
The next phase involved Adobe Technical Operations enlisting help from storage services, compute platforms, research scientists, data center operations, and
network engineering. Technical Operations also reached out to the POCs key partnersEMC, including Isilon and Pivotal, VMware, Cisco, and Trace3, an EMC
value-added reseller and IT systems integrator.
The team, which included several Hadoop experts, dissected nearly every element of the HDaaS environment. This included Hadoop job definitions, memory
settings, Java memory allocations, command line options, physical and virtual infrastructure configurations, and HDFS options.
"We had several excellent meetings with Hadoop experts from EMC and VMware. We
learned an enormous amount that helped us solve our initial problems and tweak the
infrastructure to scale the way we wanted."
Jason Farnsworth, Senior Storage Engineer, Adobe Systems
Amount of physical memory in megabytes that can be allocated for containers: yarn.nodemanager.resource.memory-mb=x x=memory in megabytes.
BDE has a base calculation for this value according to how much RAM to allocate to the workers on deployment. Default value is 8192.
Minimum container memory for YARN. The minimum allocation for every container request at the ResourceManager, in megabytes:
yarn.scheduler.minimum-allocation-mb=x x=memory in megabytes. Default Value is 1024.
Java options for the application master (JVM HEAP Size): yarn.app.mapreduce.am.command-opts=x x=memory in megabytes but passed as a Java
option (e.g., Xmx7000m). Default value is Xmx1024m.
H13856
Page 8 of 12
Mapred Settings
Mapper Java Options (JVM Heap Size). Heap size for child JVMs of maps: mapreduce.map.java.opts=x x=memory but passed as a Java option (e.g,
Xmx2000m). Default Value is Xmx1024m
Reducer Java Options (JVM Heap Size). Heap size for child JVMs of reduces: mapreduce.reduce.java.opts=x x=memory but passed as a Java option
(e.g., xmx4000m). Default Value is Xmx2560m
Maximum size of the split metainfo file: mapreduce.jobtracker.split.metainfo.maxsize=x x=10000000 by default. POC team set this to -1, which disables
or sets to any size.
For guidance on baseline values to use in these memory settings, the POC team recommends the following documents:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1-11.html
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
https://support.pivotal.io/hc/en-us/articles/201462036-Mapreduce-YARN-Memory-Parameters
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/r2.5.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
H13856
Page 9 of 12
"Our results proved that having Isilon act as the HDFS layer was not adverse. In fact, we
got better results with Isilon than we would have in a traditional cluster."
Chris Mutchler, Compute Platform Engineer, Adobe Systems
The team concluded that Hadoop performs better in a scale-out rather than scale-up configuration. That is, jobs complete more quickly when run on a greater
number of compute nodes, so having more cores is more important than having faster processors. In fact, performance improved as the number of workers
increased.
Tests were run with the following cluster configurations:
Page 10 of 12
H13856
Page 11 of 12
Devote ample time upfront to sizing storage layers based on workload and scalability requirements. Sizing for Hadoop intermediate space also deserves
careful consideration.
Consider setting large HDFS block sizes to 256 to 1024 megabytes to ensure sufficient performance. On Isilon, HDFS block size is configured as a
protocol setting in the OneFS operating system.
In the compute environment, deploy a large number of hosts using processors with as many cores as possible and align the VMs to those cores. In general,
having more cores is more important than having faster processors and results in better performance and scalability.
Configure all physical hosts in the VMware cluster identically. For example, mixing eight-core and ten-core systems will make CPU alignment challenging
when using BDE. Different RAM amounts also will cause unwanted overhead while VMware's distributed resource scheduling moves virtual guests.
EMC Isilon Best Practices for Hadoop Data Storage white paper
When building and configuring the virtual HDaaS infrastructure, companies should select vendors with extensive expertise in Hadoop and especially in largescale Hadoop environments. EMC, VMware, and solution integrators with Big Data experience can help accelerate a Hadoop deployment and ensure success.
Because of the interdependencies among the many components in a virtual HDaaS infrastructure, internal and external team members will need broad
knowledge of the technology stack, including compute, storage, virtualization, and networking, with deep understanding of how each performs separately and
together. While IT as a whole is still evolving toward developing integrated skill sets, EMC has been on the forefront of this trend and can provide insights and
guidance.
Page 12 of 12