Virtualizing Hadoop in Large Scale Infrastructures

VIRTUALIZING HADOOP IN
LARGE-SCALE INFRASTRUCTURES
How Adobe Systems achieved breakthrough results in
Big Data analytics with Hadoop-as-a-Service
ABSTRACT
Large-scale Apache Hadoop analytics have long eluded the industry, especially in virtualized
environments. In a ground-breaking proof of concept (POC), Adobe Systems demonstrated running
Hadoop-as-a-Service (HDaaS) on a virtualized and centralized infrastructure handled large-scale data
analytics workloads. This white paper documents the POCs infrastructure design, initial obstacles, and
successful completion, as well as sizing and configuration details, and best practices. Importantly, the
paper also underscores how HDaaS built on an integrated and virtualized infrastructure delivers
outstanding performance, scalability, and efficiency, paving the path toward larger-scale Big Data
analytics in Hadoop environments.
December, 2014
FEDERATION WHITE PAPER
To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized
reseller, visit www.emc.com, or explore and compare products in the EMC Store
Copyright 2014 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.
The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in
this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
VMware and vSphere are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other trademarks used herein
are the property of their respective owners.
Part Number H13856
TABLE OF CONTENTS
EXECUTIVE SUMMARY................................................................................................................................................. 4
INTRODUCTION ............................................................................................................................................................. 5
BOLD ARCHITECTURE FOR HDAAS .............................................................................................................................. 7
NAVIGATING TOWARD LARGE-SCALE HDAAS ........................................................................................................... 8
A Few surprises ............................................................................................................................................................................................. 8
Diving in Deeper ............................................................................................................................................................................................ 8
Relooking at Memory Settings ............................................................................................................................................................................ 8
Modifying Settings Properly with BDE ................................................................................................................................................................ 9
Bigger is Not Always Better ................................................................................................................................................................................. 9
Storage Sizing Proved Successful ...................................................................................................................................................................... 9
BREAKTHROUGH IN HADOOP ANALYTICS ............................................................................................................... 10

Impressive Performance Results ............................................................................................................................................................... 10
Breaking with Tradition Adds Efficiency................................................................................................................................................... 11
Stronger Data PRotection .......................................................................................................................................................................... 11
Freeing the Infrastructure .......................................................................................................................................................................... 11
BEST PRACTICE RECOMMENDATIONS ..................................................................................................................... 12

Memory settings are key ........................................................................................................................................................................... 12
Understand Sizing and Configuration ...................................................................................................................................................... 12
Acquire or Develop Hadoop Expertise ..................................................................................................................................................... 12
NEXT STEPS: LIVE WITH HDAAS ............................................................................................................................... 12
H13856
Page 3 of 12
EXECUTIVE SUMMARY
Apache Hadoop has become a prime tool for analyzing Big Data and achieving greater insights that help organizations improve strategic decision making.
Traditional Hadoop clusters have proved inefficient for handling large-scale analytics jobs sized at hundreds of terabytes or even petabytes. Adobes Digital
Marketing organization, which operates data analytic jobs on this scale, was encountering increased demand internally to use Hadoop for analysis of the
company's existing eight-petabyte data repository.
To address this need, Adobe explored an innovative approach to Hadoop. Rather than running traditional Hadoop clusters on commodity servers with locally
attached storage, Adobe virtualized the Hadoop computing environment and used its existing EMC Isilon storagewhere the eight-petabyte data repository
residesas a central location for Hadoop data.
Adobe enlisted resources, technologies, and expertise of EMC, VMware, and Cisco to build a reference architecture for virtualized Hadoop-as-a-Service
(HDaaS) and perform a comprehensive proof of concept. While the five-month POC encountered some challenges, the project also yielded a wealth of insights
and understanding relating to how Hadoop operates and its infrastructure requirements.
After meticulous configuring, refining, and testing, Adobe successfully ran a 65-terabyte Hadoop jobone of the industrys largest to date in a virtualized
environment. This white paper details the process that Adobe and the POC team followed that led to this accomplishment.
The paper includes specific configurations of the virtual HDaaS environment used in the POC. It also covers initial obstacles and how the POC team overcame
them. It also documents how the team adjusted settings, sized systems, and reconfigured the environment to support large-scale Hadoop analytics in a virtual
environment with centralized storage.
Most importantly, the paper presents POCs results, along with valuable best practices for other organizations interested in pursuing similar projects. The last
section describes Adobes plans to bring virtual HDaaS to production for its business users and data scientists.
H13856
Page 4 of 12
INTRODUCTION
Organizations across the world increasingly view Big Data as a prime source of competitive differentiation, and analytics as the means to tap this source.
Specifically, Hadoop enables data scientists to perform sophisticated queries against massive volumes of data to gain insights, discover trends, and predict
outcomes. In fact, a GE and Accenture study reported that 84 percent of survey respondents believe that using Big Data analytics has the power to shift the
competitive landscape for my industry" in the next year. 1
Apache Hadoop, an increasingly popular environment for running analytics jobs, is an open source framework for storing and processing large data sets.
Traditionally running on clusters of commodity servers with local storage, Hadoop comprises multiple components, primarily the Hadoop Distributed File System
(HDFS) for data storage, Yet Another Resource Negotiator (YARN) for managing system resources like memory and CPUs, and MapReduce for processing
massive jobs by splitting up input data into small subtasks and collating results.
At Adobe, a global leader in digital marketing and digital media solutions, its Technical Operations team uses traditional Hadoop clusters to deliver Hadoop as a
Service (HDaaS) in a private cloud for several application teams. These teams run Big Data jobs such as log and statistical analysis of application layers to
uncover trends that help guide product enhancements.
Elsewhere, Adobe's Digital Marketing organization tracks and analyzes customers website statistics, which are stored in an eight-petabyte data repository on
EMC Isilon storage. Adobe Digital Marketing would like to use HDaaS for more in-depth analysis that would help their clients improve website effectiveness,
correlate site visits to revenue, and guide strategic business decisions. Rather than moving data from a large data repository to the Hadoop clustersa timeconsuming task, Technical Operations determined it would be most efficient to simply use Hadoop to access data sets on the existing Isilon-based data
repository.
Adobe has a goal of running analytics jobs against data sets that are hundreds of terabytes in size. Simply adding commodity servers to Hadoop clusters
would become highly inefficient, especially since traditional Hadoop clusters require three copies of the data to ensure availability. Adobe also was concerned
that current Hadoop versions lack high availability features. For example, Hadoop has only has two NameNodes, which tracks where data resides in Hadoop
environments. If both NameNodes fail, the entire Hadoop cluster would collapse.
Technical Operations proposed separating the Hadoop elements and placing them where they can scale more efficiently and reliably. This meant using Isilon,
where Adobes file-based data repository is stored, for centralized Hadoop storage and virtualizing the Hadoop cluster nodes to enable more flexible scalability
and lower compute costs. (Figures 1 and 2)
Figure 1.
Traditional Hadoop Architecture
"Industrial Internet Insights Report for 2015." GE, Accenture. 2014.
H13856
Page 5 of 12
Figure 2.
Virtual Hadoop Architecture with Isilon
Despite internal skepticism about a virtualized infrastructure handling Hadoops complexity, Technical Operations recognized a compelling upside: improving
efficiency and increasing scalability to a level that had not been achieved for single-job data sets in a virtualize Hadoop environment with Isilon. This is
enticing, especially as data analytics jobs continue to grow in size across all environments.
"People think by that virtualizing Hadoop, you're going to take a performance hit. But we
showed that's not the case. Instead you get added flexibility that actually unencumbers
your infrastructure."
Chris Mutchler, Compute Platform Engineer, Adobe Systems
To explore the possibilities, Adobe Technical Operations embarked on a virtual HDaaS POC for Adobe Systems Digital Marketing. The infrastructure comprised
EMC, VMware, and Cisco solutions and was designed to test the outer limits of Big Data analytics on Isilon and VMware using Hadoop.
Key objectives of the POC included:
Building a virtualized HDaaS environment to deliver analytics through a self-service catalog to internal Adobe customers
Decoupling storage from compute by using EMC Isilon to provide HDFS, ultimately enabling access to the entire data repository for analytics
Understanding sizing and security requirements of the integrated EMC Isilon, EMC VNX, VMware and Cisco UCS infrastructure to support larger-scale
HDaaS
Proving an attractive return on investment and total cost of ownership in virtualized HDaaS environments compared to physical in-house solutions or
public cloud services such as Amazon Web Services
Documenting key learnings and best practices

The results were impressive. While the POC uncovered some surprises, Adobe gained valuable knowledge for future HDaaS projects. Ultimately, Adobe ran
some of the largest Hadoop data analytics jobs to date in a virtualized HDaaS environment. It was a groundbreaking achievement and bodes a new era of scale
and efficiency for Big Data analytics.
H13856
Page 6 of 12
BOLD ARCHITECTURE FOR HDAAS

The POCs physical topology is built on Cisco Unified Compute System (UCS), Cisco Nexus networking, EMC VNX block storage, and EMC Isilon scale-out
storage. (Figure 3)
Figure 3.
HDaaS Hardware Topology
At the compute layer, Adobe was particularly interested in Cisco UCS for its firmware management and centralized configuration capabilities. Plus, UCS
provides a converged compute and network environment when deployed with Nexus.
VNX provides block storage for VMware ESX hosts and virtual machines (VMs) that comprise the Hadoop cluster. Adobe's focus was learning the VNX sizing
and performance requirements to support virtualized HDaaS.
An existing Isilon customer, Adobe especially liked Isilons data lake concept that enables access to one source of data through multiple protocols, such as
NFS, FTP, Object, and HDFS. In the POC, data was loaded onto Isilon via NFS and accessed via HDFS by virtual machines in the Hadoop compute cluster. The
goal was to prove that Isilon delivered sufficient performance to support large Hadoop workloads.
To deploy, run, and manage Hadoop on a common virtual infrastructure, Adobe relied on VMware Big Data Extensions (BDE) an essential software component
of the overall environment. Adobe already used BDE in its private cloud HDaaS deployment and wanted to apply it to the new infrastructure.
BDE enabled Adobe to automate and simplify deployment of hundreds of virtualized Hadoop compute nodes that were tied directly to Isilon for HDFS. During
testing, Adobe also used BDE to deploy, reclaim, and redeploy the Hadoop cluster more than 30 times to evaluate different cluster configurations. Without the
automation and flexibility of BDE, Adobe would not have been able to conduct such a wide range and high volume of tests within such a short timeframe.
In this POC, Adobe used Pivotal HD as an enhanced Hadoop distribution framework but designed the infrastructure to run any Hadoop distribution.
The following tools assisted Adobe with monitoring, collecting and reporting on metrics generated by the POC:
VNX Monitor and Reporting Suite (M&R)
Isilon Insight IQ (IIQ)
Vmware vCenter Operations Manager (VCOPS)
Cisco UCS Director (USCD)
H13856
Page 7 of 12
NAVIGATING TOWARD LARGE-SCALE HDAAS

The POC spanned five months from hardware delivery through final testing. Adobe expected the infrastructure components to integrate well, provide a stable
environment, and perform satisfactorily.
In fact, the POC team implemented the infrastructure in about one and a half weeks. Then it put Isilon to the test as the HDFS data store, and evaluated how
well Hadoop ran in a virtualized environment.
A FEW SURPRISES
Adobe ran its first Hadoop MapReduce job in the virtual HDaaS environment within three days of initial set-up. Smaller data sets of 60 to 450 gigabytes
performed well, but the team hit a wall beyond 450 gigabytes.
The team focused on the job definition of the Hadoop configuration to determine if it was written correctly or using memory efficiently. In researching the
industry at large, Adobe learned that most enterprise Hadoop environments were testing data on a small scale. In fact, Adobe did not find another Hadoop POC
or implementation that exceeded 10 terabytes for single-job data sets in a virtualized Hadoop environment with Isilon.
"When we talked to other people in the industry, we realized we were on the forefront of
scaling Hadoop at levels possibly never seen before."
Jason Farnsworth, Senior Storage Engineer, Adobe Systems
After four weeks of tweaking the Hadoop job definition and adjusting memory settings, the team successfully ran a six-terabyte job. Pushing beyond six
terabytes, the team sought to run larger data sets upwards of 60 terabytes. The larger jobs again proved difficult to complete successfully.
DIVING IN DEEPER
The next phase involved Adobe Technical Operations enlisting help from storage services, compute platforms, research scientists, data center operations, and
network engineering. Technical Operations also reached out to the POCs key partnersEMC, including Isilon and Pivotal, VMware, Cisco, and Trace3, an EMC
value-added reseller and IT systems integrator.
The team, which included several Hadoop experts, dissected nearly every element of the HDaaS environment. This included Hadoop job definitions, memory
settings, Java memory allocations, command line options, physical and virtual infrastructure configurations, and HDFS options.
"We had several excellent meetings with Hadoop experts from EMC and VMware. We
learned an enormous amount that helped us solve our initial problems and tweak the
infrastructure to scale the way we wanted."
Jason Farnsworth, Senior Storage Engineer, Adobe Systems
Relooking at Memory Settings

Close inspection of Hadoop revealed a lack of maturity to perform in virtualized environments. For example, some operations launched through VMware BDE
did not function properly on Hadoop, requiring significant tweaking. Complicating matters, the team learned that Hadoop error messages did not clearly
describe the problem or indicate the origin.
Most notable, the team discovered that Hadoop lacked sufficient intelligence to analyze memory requirements for large analytics jobs. This necessitated
manually adjusting memory settings.
The POC team recommends the following memory settings as a good starting point for organizations to diagnose scaling and job-related issues when testing
Hadoop in larger-scale environments:
Yarn Settings
Amount of physical memory in megabytes that can be allocated for containers: yarn.nodemanager.resource.memory-mb=x x=memory in megabytes.
BDE has a base calculation for this value according to how much RAM to allocate to the workers on deployment. Default value is 8192.
Minimum container memory for YARN. The minimum allocation for every container request at the ResourceManager, in megabytes:
yarn.scheduler.minimum-allocation-mb=x x=memory in megabytes. Default Value is 1024.
Application Master Memory: yarn.app.mapreduce.am.resource.mb=x x=memory in megabytes. Default value is 1536.
Java options for the application master (JVM HEAP Size): yarn.app.mapreduce.am.command-opts=x x=memory in megabytes but passed as a Java
option (e.g., Xmx7000m). Default value is Xmx1024m.
H13856
Page 8 of 12
Mapred Settings
Mapper memory: mapreduce.map.memory.mb=x x=memory in megabytes. Default Value is 1536.
Reducer memory: mapreduce.reduce.memory.mb=x x=memory in megabytes. Default Value is 3072
Mapper Java Options (JVM Heap Size). Heap size for child JVMs of maps: mapreduce.map.java.opts=x x=memory but passed as a Java option (e.g,
Xmx2000m). Default Value is Xmx1024m
Reducer Java Options (JVM Heap Size). Heap size for child JVMs of reduces: mapreduce.reduce.java.opts=x x=memory but passed as a Java option
(e.g., xmx4000m). Default Value is Xmx2560m
Maximum size of the split metainfo file: mapreduce.jobtracker.split.metainfo.maxsize=x x=10000000 by default. POC team set this to -1, which disables
or sets to any size.
For guidance on baseline values to use in these memory settings, the POC team recommends the following documents:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1-11.html
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
https://support.pivotal.io/hc/en-us/articles/201462036-Mapreduce-YARN-Memory-Parameters
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/r2.5.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Modifying Settings Properly with BDE

Both the virtual and physical infrastructure required configuration adjustments. Since VMware BDE acts as a management service layer on top of Hadoop, the
team relied on BDE to modify Hadoop settings to ensure they were properly applied to the virtual clusters and remained persistent. Changing the settings via
the servers would not enable consistent application of modifications across all the virtual clusters. The team also kept in mind that stopping, restarting, or
redeploying a cluster through BDE would automatically reset all the node settings to their default values.
Bigger is Not Always Better

The POC revealed that the configuration of physical servers (hosts) and virtual servers (Hadoop workers or guests) affected Hadoop performance and cost
efficiency.
For example, a greater number of physical cores (CPUs) with less megahertz delivered improved performance versus fewer cores with more megahertz. At a
higher cost, the same number of physical cores with more megahertz delivered even better performance.
In a virtual environment, fewer virtual CPUs (vCPUs) with a greater number of Hadoop workers, performed and scaled better than a greater number of vCPUs
supporting fewer workers.
The team also learned to keep all physical hosts in the VMware cluster configured identically and ensure there were not any variations in host configurations.
This way, VMware distributed resource scheduling would not be invoked to spend time and resources balancing the cluster and resources instead would be
made immediately available to Hadoop. BDE also was especially valuable in ensuring that memory settings and the alignment between cores and VMs were
consistent.
Storage Sizing Proved Successful

Both VNX and Isilon performed perfectly in the POC. The team sized VNX to hold both the VMware environment and the Hadoop intermediate space
temporary space used by Hadoop jobs such as MapReduce. Intermediate space also can be configured to be stored directly on the Isilon cluster, but this
setting was not tested during the POC.
Technical Operations also tested various HDFS block sizes, resulting in performance optimizations. Depending on job and workload, the team found that block
sizes of 64 megabytes to 1024 megabytes drove optimal throughput. The 12 Isilon X-Series nodes with two-terabyte drives provided more than enough
capacity and performance for tested workloads, and could easily scale to support Hadoop workloads hundreds of terabytes in size.
While the POCs Isilon did not incorporate flash technology, the team noted that adding flash drives would provide a measurable performance increase.
H13856
Page 9 of 12
BREAKTHROUGH IN HADOOP ANALYTICS

After eight weeks of fine-tuning the virtual HDaaS infrastructure, Adobe succeeded in running a 65-terabyte Hadoop workloadsignificantly larger than the
largest known virtual Hadoop workloads. In addition, this was the largest workload ever tested by EMC in a virtual Hadoop environment on Isilon.
Fundamentally, these results proved that Isilon as the HDFS layer worked. In fact, the POC refutes claims by some in the industry that suggest shared storage
will cause problems with Hadoop. To the contrary, Isilon had no adverse effects and even contributed superior results in a virtualized HDaaS environment
compared to traditional Hadoop clusters. These advantages apply to many aspects of Hadoop, including performance, storage efficiency, data protection, and
flexibility.
"Our results proved that having Isilon act as the HDFS layer was not adverse. In fact, we
got better results with Isilon than we would have in a traditional cluster."
Chris Mutchler, Compute Platform Engineer, Adobe Systems
IMPRESSIVE PERFORMANCE RESULTS

With compute resources allocated in small quantities to a large number of VMs, job run time improved significantly. (Figures 4 and 5) Furthermore, the test
demonstrated that Isilon performed well without flash drives.
Figure 4. TeraSort Job Run Time by Worker Count
Figure 5. Adobe Pig Job Run Time by Worker Count
The team concluded that Hadoop performs better in a scale-out rather than scale-up configuration. That is, jobs complete more quickly when run on a greater
number of compute nodes, so having more cores is more important than having faster processors. In fact, performance improved as the number of workers
increased.
Tests were run with the following cluster configurations:
256 workers, 1 vCPU, 7.25 GB RAM, 30 GB intermediate space
128 workers, 2 vCPU, 14.5 GB RAM, 90 GB intermediate space
64 workers, 4 vCPU, 29 GB RAM, 210 GB intermediate space
32 Workers, 8 vCPU, 58 GB RAM, 450 GB intermediate space

H13856
Page 10 of 12
BREAKING WITH TRADITION ADDS EFFICIENCY

Traditional Hadoop clusters require three copies of the data in case servers fail. Isilon eliminates tripling storage capacity requirements due to built-in data
protection capabilities of the Isilon OneFS operating system.
For example, in a traditional Hadoop cluster running jobs against eight petabytes of data, the infrastructure would require 24 petabytes of raw disk capacity
a 200 percent overheadto accommodate three copies. Eight petabytes of Hadoop data when stored on Isilon requires only 9.6 petabytes of raw disk
capacitya nearly 60 percent reduction. Not only does Isilon save on storage but it also streamlines storage administration by eliminating the need to oversee
numerous islands of storage. Using Adobes eight-petabyte data set in a traditional environment would require 24 petabytes of local disk capacity
necessitating thousands of Hadoop nodes when hundreds of compute nodes would be adequate.
Enabling a data lake, Isilon OneFS provides enterprises with one central data repository of data accessible through multiple protocols. Rather than requiring
a separate, purpose-built HDFS device, Isilon supports HDFS along with NFS, FTP, SMP, HTTP, NDMB, SWIFT, and OBJECT. (Figure 6). This allows organizations
to bring Hadoop to the dataa more streamlined approach, rather than moving data to Hadoop.
Figure 4.
Isilon Data Lake Concept with Multi-protocol Support
STRONGER DATA PROTECTION

Isilon provides secure control over data access by supporting POSIX for granular file access permissions. Isilon stores data in a POSIX-compliant file system
with SMB and NFS workflows that users can also access through HDFS for MapReduce. Isilon protects partitioned subsets of data with access zones that
prevent unauthorized access.
In addition, Isilon offers rich data services that are not available in traditional Hadoop environments. For example, Isilon enables users to create snapshots of
the Hadoop environment for point-in-time data protection or to create duplicate environments. Isilon replication also can synchronize Hadoop to a remote site,
providing even greater protection. This allows organizations to keep Hadoop data secure on premises, rather than moving data to a public cloud.
FREEING THE INFRASTRUCTURE

Virtualizing HDaaS introduces greater opportunities for flexibility, unencumbering the infrastructure from physical limitations. Instead of traditional bare-metal
clusters with rigid configurations, virtualization allows organizations to tailor Hadoop VMs to their individual workloads and even use existing compute
infrastructure. This is key to optimizing performance and efficiency. Plus, virtualization facilitates multi-tenancy and offers additional high-availability
advantages through fluid movement of VMs from one physical host to another.
H13856
Page 11 of 12
BEST PRACTICE RECOMMENDATIONS

Several important lessons learned and best practices were documented from this breakthrough POC, as follows.
MEMORY SETTINGS ARE KEY

It's important to recognize that Hadoop is still a maturing product and does not automatically recognize optimal memory requirements. Memory settings are
crucial to achieving sufficient performance to run Hadoop jobs against large data sets. EMC recommends methodically adjusting memory settings and
repeatedly testing configurations until the optimal environment is achieved.
UNDERSTAND SIZING AND CONFIGURATION

Operating at Adobe's scalehundreds of terabytes to tens of petabytesdemands close attention to sizing and configuration of virtualized infrastructure
components. Since no two Hadoop jobs are alike, IT organizations must thoroughly understand the data sets and jobs their customers plan to run. Key sizing
and configuration insights from this POC include:
Devote ample time upfront to sizing storage layers based on workload and scalability requirements. Sizing for Hadoop intermediate space also deserves
careful consideration.
Consider setting large HDFS block sizes to 256 to 1024 megabytes to ensure sufficient performance. On Isilon, HDFS block size is configured as a
protocol setting in the OneFS operating system.
In the compute environment, deploy a large number of hosts using processors with as many cores as possible and align the VMs to those cores. In general,
having more cores is more important than having faster processors and results in better performance and scalability.
Configure all physical hosts in the VMware cluster identically. For example, mixing eight-core and ten-core systems will make CPU alignment challenging
when using BDE. Different RAM amounts also will cause unwanted overhead while VMware's distributed resource scheduling moves virtual guests.
ACQUIRE OR DEVELOP HADOOP EXPERTISE

Hadoop is complex, with numerous moving parts that must operate in concert. For example, MapReduce settings may affect Java, which may in turn, impact
YARN. EMC recommends that organizations wishing to use Hadoop to ramp up gradually and review the many resources available to help simplify Hadoop
implementation with Isilon. Hadoop insights also may be achieved through "tribal" sharing of experiences among industry colleagues, as well as formal
documentation and training. The POC team recommends these resources as a starting place:
EMC Isilon Free Hadoop website
EMC Hadoop Starter Kit
EMC Isilon Best Practices for Hadoop Data Storage white paper
EMC Big Data website
When building and configuring the virtual HDaaS infrastructure, companies should select vendors with extensive expertise in Hadoop and especially in largescale Hadoop environments. EMC, VMware, and solution integrators with Big Data experience can help accelerate a Hadoop deployment and ensure success.
Because of the interdependencies among the many components in a virtual HDaaS infrastructure, internal and external team members will need broad
knowledge of the technology stack, including compute, storage, virtualization, and networking, with deep understanding of how each performs separately and
together. While IT as a whole is still evolving toward developing integrated skill sets, EMC has been on the forefront of this trend and can provide insights and
guidance.
NEXT STEPS: LIVE WITH HDAAS

With the breakthrough results of this POC, Adobe plans to take the HDaaS reference architecture using Isilon into production and test even larger Hadoop
jobs. To generate additional results, Adobe also will run a variety of Hadoop jobs on the virtual HDaaS platform repeatedlyas much as hundreds of times. The
goal is to demonstrate that virtual HDaaS can deliver and is ready for large production applications.
While the POC pointed one Hadoop cluster to Isilon, additional testing will focus on multiple Hadoop clusters accessing data sets on Isilon to further prove
scalability. This multi-tenancy capability is crucial for supporting multiple analytics teams with separate projects. Adobe Technical Operations plans to run
Hadoop jobs through Isilon access zones to ensure isolation is preserved without impacting performance or scalability.
In addition, the team plans to move intermediate space from VNX block storage to Isilon and evaluate the impact of additional I/O on Isilon. Adobe also expects
that an all-flash array such as EMC XtremIO would provide an excellent option for block storage in place of VNX.
Additional configuration adjustments and testing are well worth the effort to Adobe and present tremendous opportunities for the analytics community as a
whole. Using centralized storage, such as Isilon, provides a common data source rather than creating numerous storage locations for multiple Hadoop projects.
The flexibility and scalability of the virtual HDaaS environment is also of great value as Hadoop jobs continue to grow in size.
Most important, moving virtual HDaaS into production will enable Adobe's data scientists will be able to query against the entire data set residing on Isilon. By
doing so, they will have a powerful way to gain more insight and intelligence that can be presented to Adobes customers and provide both Adobe and their
customers with strong competitive advantage.
H13856
Page 12 of 12

Virtualizing Hadoop in Large Scale Infrastructures

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Virtualizing Hadoop in Large Scale Infrastructures

Hochgeladen von

Copyright:

Verfügbare Formate

VIRTUALIZING HADOOP IN

FEDERATION WHITE PAPER

Part Number H13856

BREAKTHROUGH IN HADOOP ANALYTICS ............................................................................................................... 10

BEST PRACTICE RECOMMENDATIONS ..................................................................................................................... 12

NEXT STEPS: LIVE WITH HDAAS ............................................................................................................................... 12

Traditional Hadoop Architecture

"Industrial Internet Insights Report for 2015." GE, Accenture. 2014.

Virtual Hadoop Architecture with Isilon

Documenting key learnings and best practices

BOLD ARCHITECTURE FOR HDAAS

HDaaS Hardware Topology

VNX Monitor and Reporting Suite (M&R)

Isilon Insight IQ (IIQ)

Vmware vCenter Operations Manager (VCOPS)

Cisco UCS Director (USCD)

NAVIGATING TOWARD LARGE-SCALE HDAAS

Relooking at Memory Settings

Application Master Memory: yarn.app.mapreduce.am.resource.mb=x x=memory in megabytes. Default value is 1536.

Mapper memory: mapreduce.map.memory.mb=x x=memory in megabytes. Default Value is 1536.

Reducer memory: mapreduce.reduce.memory.mb=x x=memory in megabytes. Default Value is 3072

Modifying Settings Properly with BDE

Bigger is Not Always Better

Storage Sizing Proved Successful

BREAKTHROUGH IN HADOOP ANALYTICS

IMPRESSIVE PERFORMANCE RESULTS

Figure 5. Adobe Pig Job Run Time by Worker Count

256 workers, 1 vCPU, 7.25 GB RAM, 30 GB intermediate space

128 workers, 2 vCPU, 14.5 GB RAM, 90 GB intermediate space

64 workers, 4 vCPU, 29 GB RAM, 210 GB intermediate space

32 Workers, 8 vCPU, 58 GB RAM, 450 GB intermediate space

BREAKING WITH TRADITION ADDS EFFICIENCY

Isilon Data Lake Concept with Multi-protocol Support

STRONGER DATA PROTECTION

FREEING THE INFRASTRUCTURE

BEST PRACTICE RECOMMENDATIONS

MEMORY SETTINGS ARE KEY

UNDERSTAND SIZING AND CONFIGURATION

ACQUIRE OR DEVELOP HADOOP EXPERTISE

EMC Isilon Free Hadoop website

EMC Hadoop Starter Kit

EMC Big Data website

NEXT STEPS: LIVE WITH HDAAS

Das könnte Ihnen auch gefallen