Sie sind auf Seite 1von 32

Dell Next Generation Compute Solutions

Dell | Cloudera
Solution for Apache Hadoop
Reference Architecture

A Dell Reference Architecture Guide


Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
2
Table of Contents
Tables 3
Figures 3
Overview 4
Summary 4
Abbreviations 4
Dell | Cloudera Hadoop Solution 5
Solution Overview 5
Dell | Cloudera Hadoop Solution Network Architecture 8
Dell | Cloudera Network Segmentation 8
Dell | Cloudera Network Design Overview 8
Dell Certified Switch Solution 9
Dell Open Switch Solution 9
Dell | Cloudera NIC Teaming 10
IPv6 Capabilities 11
Dell | Cloudera Hadoop Solution Hardware Architecture 12
High-level Architecture 12
High-level Network Architecture 14
Hadoop Network Cable Scheme and Connections 15
Dell | Cloudera Hadoop Solution ComputePowerEdge C2100 18
Dell | Cloudera Hadoop Solution Software Architecture 20
Linux File System Configuration Definition 20
Disk Partitioning Recommendation for the Name Node 20
Hadoop Ecosystem Services and Utilities Mapping 20
Dell | Cloudera Hadoop Solution Software Configuration 21
Dell | Cloudera Configuration Parameters Recommended Values 21
Dell | Cloudera Hadoop Solution Deployment Methodology 25
Site Preparation Needed for the Deployment 25
Dell | Cloudera Hadoop Solution Hardware Monitoring and Alerting 26
Nagios 26
Ganglia 26
Cloudera Enterprise 26
Dell | Cloudera Hadoop Solution Security Design 28
What is available in CDH3 28
Implementing Secure Hadoop 28
Appendix A: Bill of Materials 29
Name Node Bill of Materials 29
Slave Node Bill of Materials 29
Edge Node Bill of Materials 30
Network Connectivity Bill of MaterialsTop of Rack 31
Software 31
Appendix B: Dell | Cloudera Hadoop Solution Components Decoder Ring 32
Appendix C: External References 32
To Learn More 32


Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
3
Tables
Table 1: Dell | Cloudera Hadoop Solution Use Cases 5
Table 2: Dell | Cloudera Hardware Configurations 13
Table 3: Dell | Cloudera Software Locations 13
Table 4: Dell | Cloudera Support Matrix 14
Table 5: Dell | Cloudera Network Cabling 14
Table 6: Hadoop Ecosystem Utilities Mapping 20
Table 7: hdfs-site.xml 21
Table 8: mapred-site.xml 22
Table 9: default.xml 22
Table 10: hadoop-env.sh 23
Table 11: /etc/fstab 23
Table 12: core-site.xml 23
Table 13: /etc/security/limits.conf 23

Figures
Figure 1: Dell | Cloudera Hadoop Solution Taxonomy 6
Figure 2: Dell | Cloudera Network Switch Connectivity 8
Figure 3: Hadoop Top of Rack Network Connectivity 9
Figure 4: Dell | Cloudera Node-Level Network Connectivity 10
Figure 5: Dell | Cloudera Hardware Architecture 12
Figure 6: Dell | Cloudera Compute Node Network Interconnects 14
Figure 7: PowerEdge C2100 18
Figure 8: Kerberos Authentication in Hadoop 28




THIS PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS
PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND.

2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden.
For more information, contact Dell.

Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerEdge are trademarks of Dell Inc. Cloudera, CDH,, Cloudera Enterprise are trademarks of
Cloudera and its affiliates in the US and other countries. Intel and Xeon are registered trademarks of Intel Corporation in the U.S. and other countries. Other
trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any
proprietary interest in trademarks and trade names other than its own.

August 2011

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
4
Overview
Summary
The document presents the reference architecture of the Dell | Cloudera Hadoop Solution that Dell
designed jointly with Cloudera.
The reference architecture introduces all the high-level components, hardware, and software that are included
in the stack. Each high-level component is then described individually.
Abbreviations
Abbreviation Definition
BMC Baseboard management controller
DBMS Database management system
EDW Enterprise data warehouse
EoR End-of-row switch/router
HDFS Hadoop Distributed File System
IPMI
Intelligent Platform Management
Interface
LAG Link aggregation group
NIC Network interface card
ToR Top-of-rack switch/router

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
5
Dell | Cloudera Hadoop Solution
Solution Overview
Hadoop is an Apache project being built and used by a global community of contributors, using the Java
programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively
across its businesses. Other contributors and users include Facebook, LinkedIn, eHarmony, and eBay. Cloudera
has created a quality controlled distribution of Hadoop and offers commercial management software, support,
and consulting services.
Dell developed a solution for Hadoop that includes optimized hardware, software, and services to streamline
deployment and improve the customer experience.
The Dell | Cloudera Hadoop Solution is based on the Cloudera CDH 3 Enterprise distribution of Hadoop. Dells solution
includes:
Reference architecture and best practices
Optimized hardware and network infrastructure
Cloudera CDH Enterprise software (CDH Community-provided for customer-deployed solutions)
Hadoop infrastructure management tools
Dell Crowbar software
This solution provides Dell a foundation to offer additional solutions as the Hadoop environment evolves and
expands.
The solution is designed to address the following use cases:

Table 1: Dell | Cloudera Hadoop Solution Use Cases
Use case Description
Data storage
The user would like to be able to collect and store unstructured and semi-structured data
in a fault-resilient scalable data store that can be organized and sorted for indexing and
analysis.
Batch processing of
unstructured data
The user would like to batch-process (index, analyze, etc.) large quantities of unstructured
and semi-structured data.
Data archive
The user would like medium-term (1236 months) archival of data from EDW/DBMS to
increase the length that data is retrained or to meet data retention policies/compliance.
Integration with data
warehouse
The user would like to transfer data stored in Hadoop into a separate DBMS for advanced
analytics. Also the user may want to transfer the data from the DBMS back to Hadoop.

Aside from the Hadoop core technology (HDFS, MapReduce, etc.) Dell had designed additional capabilities meant to address
specific customer needs:
Monitoring, reporting, and alerting of the hardware and software components
Infrastructure configuration automation
The Dell | Cloudera Hadoop Solution lowers the barrier to adoption for organizations looking to use Hadoop in
production. Dells customer-centered approach is to create rapidly deployable and highly optimized end-to-
end Hadoop solutions running on commodity hardware. Dell provides all the hardware and software
components and resources to meet your requirements, and no other supplier need be involved. Cloudera will
provide support and software updates for the Hadoop software components within the solution.
The hardware platform for the Dell | Cloudera Hadoop Solution (Figure 1) is the Dell PowerEdge C-series.
Dell PowerEdge C-series servers are focused on hyperscale and cloud capabilities. Rather than emphasizing
gigahertz and gigabytes, these servers deliver maximum density, memory, and serviceability while minimizing
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
6
total cost of ownership. Its all about getting the processing you need in the least amount of space and in an
energy-efficient package that slashes operational costs.
Dell recommends Red Hat Enterprise Linux 5.6 for use in Cloudera deployments. You can choose to install
CentOS 5.6 for user-deployed solutions.
The recommended Java Virtual Machine (JVM) is the Oracle Sun JVM 1.6u25 or above.
The hardware platforms, the operating system, and the Java Virtual Machine make up the foundation on which
the Hadoop software stack runs.


Figure 1: Dell | Cloudera Hadoop Solution Taxonomy

The dark blue layer, depicting the Cloudera CDH components (Figure 1) comprises two frameworks:
1. The Data Storage Framework (HDFS) is the file system that Hadoop uses to store data on the cluster nodes.
Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file system.
2. The Data Processing Framework (MapReduce) is a massively-parallel compute framework inspired by
Googles MapReduce papers.
The next layer of the stack in the Dell | Cloudera Hadoop Solution design is the network layer. Dell recommends implementing
the Hadoop cluster on a dedicated network for two reasons:
1. Dell provides network design blueprints that have been tested and qualified.
2. Network performance predictabilitysharing the network with other applications may have a detrimental
impact on the performance of the Hadoop jobs.
The next two frameworksthe Data Access Framework and the Data Orchestration Frameworkcomprise
utilities that are part of the Hadoop ecosystem.
Dell listened to its customers and designed a Hadoop solution that is unique in the marketplace. Dells end-to-
end solution approach means that you can be in production with Hadoop in a shorter time than is traditionally
possible with homegrown solutions. The Dell | Cloudera Hadoop Solution embodies all the software functions
and services needed to run Hadoop in a production environment. One of Dells chief contributions to Hadoop
is a method to rapidly deploy and integrate Hadoop in production. These complementary functions are
designed and implemented side-by-side with Hadoop core technology.
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
7
Installing and configuring Hadoop is non-trivial. There are different roles and configurations that need to be
deployed on various nodes. Designing, deploying, and optimizing the network layer to match Hadoops
scalability requires consideration for the type of workloads that will be running on the Hadoop cluster. The
deployment mechanism that Dell designed for Hadoop automates the deployment of the cluster from bare-
metal (no operating system installed) all the way to installing and configuring the Cloudera software
components to your specific requirements. Intermediary steps include system BIOS update and configuration,
RAID/SAS configuration, operating system deployment, Hadoop software deployment, Hadoop software
configuration, and integration with your data center applications (i.e. monitoring and alerting).
Data backup and recovery is another topic that was brought up during customer roundtables. As Hadoop
becomes the de facto platform for business-critical applications, the data that is stored in Hadoop is crucial for
ensuring business continuity. Dells approach is to offer several enterprise-grade backup solutions and let the
customer choose while providing reference architectures and deployments guides for streamlined, consistent,
low-risk implementations.
Lastly, Dells open, integrated approach to enterprise-wide systems management enables you to build a
comprehensive system management solution based on open standards and integrated with industry-leading
partners. Instead of building a patchwork of solutions leading to systems management sprawl, Dell integrates
the management of the Dell hardware running the Hadoop cluster with the traditional Hadoop management
consoles (Ganglia, Nagios).
To summarize, Dell has added Hadoop to its data analytics solutions portfolio. Dells end-to-end solution
approach means that Dell will provide readily available software interfaces for integration between the
solutions in the portfolio.
In the current design, the Dell | Cloudera Hadoop Solution contains the core components of a typical Hadoop
deployment (HDFS, MapReduce, etc.) and auxiliary services (monitoring, reporting, security, etc.) that span the
entire solution stack.

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
8
Dell | Cloudera Hadoop Solution Network Architecture
The implementation is a multi-node, multi-rack design that uses Dell PowerEdge C series machines. The
network infrastructure is implemented with Dell PowerConnect switches.
Dell | Cloudera Network Segmentation
The Dell | Cloudera solution implements at minimum three distinct, separate VLANs:
Cluster Access LANconnects the compute node NICs into a fabric used for sharing data and distributing work
tasks among compute nodes
Cluster Management LANconnects all the BMCs in the cluster nodes
Cluster Edge LANconnects the cluster to the outside world
Dell | Cloudera Network Design Overview
The network design for the Dell | Cloudera Hadoop Solution is based on two assumptions:
1. The majority of the traffic will be between the nodes within the cluster (on the Cluster Access LAN).
2. Since Hadoop implements logical proximity between nodes at the application level, there is no need to segment
the network at the rack (or group of racks) level; thus the network within the cluster can be flat.
The easiest way to build a flat network is by stacking the switches. A potential problem with stacking a large
number of switches together is the potential need for traffic to traverse the ring. As the ring extends the traffic
must pass through more and more switches to reach its destination. Without top-of-rack switches being
stacked, the traffic between racks will need to travel up to the core router and then back down to the
destination rack. As this is only one Layer 2 hop away, the uplinks have to share this traffic with traffic that
originates outside the environment.
Additionally, the benefits of the switch stack are retained however the drawbacks are mitigated.
Improved manageability: All switches in the stack are managed as a single unit.
Efficient Spanning Tree: The stack is viewed as a single switch by the Spanning Tree Protocol.
Link aggregation: Stacking multiple switches in a chassis allows a link aggregation group (LAG) across ports on
different switches in the stack.
Reduced network traffic: Traffic between the individual switches in a stack is passed across the stacking cable,
reducing the amount of traffic passed upstream to network distribution switches.
Higher speed: The stacking module supports a higher data rate than the 10GbE uplink module (supports 12Gb per
stack port offering 24Gb between switches)
Lower cost: Uplink ports are shared by all switches in the stack, reducing the number of distribution switch ports
necessary to connect modular servers to the network.
Simplified updates: The basic firmware management commands will propagate new firmware versions and boot
image settings to all switch stack members.
Figure 2 depicts the connections required to scale above a single stack of switches for the Hadoop cluster
interconnect. The top-of-rack switches depicted are connected redundantly to the six stacked Gigabit
Ethernet switches that provide network connectivity to the Hadoop cluster nodes.


Figure 2: Dell | Cloudera Network Switch Connectivity
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
9
Dell recommends Dell PowerConnect 6248 switches. The six stacked switches provide Ethernet
connectivity for three racks of servers. The customer can install up to 20 servers in each rack, which means
that the six stacked switches can connect up to 60 servers. Large-scale implementations that require more
than 60 machines need to consider multiple 6-switch stacks, each stack running no more than 60 machines
each. These large-scale implementations also require a pair of core routers (or end-of-row routers) to connect
all stacks together. Each switch in each stack need to uplink to the core routers using 10GB links as
represented in Figure 2. There is no need for core routers for implementations smaller than 60 nodes.
Dells recommendation for core routers includes Arista Networks 7100 Series or 7500 Series 10Gb switches.
Next figure shows the connectivity between the Hadoop Name Nodes, Edge Node and the switches. The Edge
Node acts as gateway between data stored in Hadoop and users connected on the Corporate Network.
Hence, note that the Edge Node is connected to the cluster network (via stacked switches) and the corporate
network. You have the option of removing the Edge Nodes and directly connecting the Hadoop network to
your organizations network.


Figure 3: Hadoop Top of Rack Network Connectivity
The BMC of each node connects to the Cluster Management LAN using a dedicated (non-shared) NIC.
VLAN Number Use Connectivity
Untagged BMC Ports
10 Hadoop Data Network Slave Node LAG groups

Dell Certified Switch Solution
This Reference Architecture includes the use of Dell PowerConnect 6248 Gigabit Ethernet switches as the
top-of-rack connectivity to all Hadoop related nodes. The reference architecture is used to ensure consistency
in rapid deployments through the minimal differences in the network configuration.
Dell Open Switch Solution
In addition to the Dell switch-based reference architecture, Dell provides an open standard that allows you to
choose other brands and configurations of switches for your Hadoop environment. The following list of
requirements will enable other brands of switches to properly operate with the tools and configurations in the
Dell Hadoop Reference Architecture.
Support for IEEE 802.1Q VLAN traffic and port tagging
Ability to provide a minimum of 170 Gigabit Ethernet ports in a non-blocking configuration within VLAN 10
o Configuration can be a single switch or a combination of stacked switches to meet the additional
requirements
o The ability to create link aggregation groups (LAGs) with a minimum of two physical links in each LAG

Corporate
Network
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
10
If multiple switches are stacked:
o The ability to create a LAG across stacked switches is required
o Full-bisection bandwidth
o Support for VLANs to be available across all switches in the stack
The ability to provide a minimum 65 10/100 Ethernet ports on the untagged VLAN
250,000 packets-per-second capability per switch
The ability to provide 12 10Gb ports for redundant uplinks contained in VLAN 10
A managed switch that supports SSH and Serial line configuration
SNMP v3 support


Figure 4: Dell | Cloudera Node-Level Network Connectivity

Dell | Cloudera NIC Teaming
Since the network must be able to handle the ever-increasing volume of data, Dell recommends using
network teaming to increase the available bandwidth to each node in the Hadoop cluster. Teaming the NICs
for bandwidth allows both NICs to be active and the operating system to balance the network traffic across all
NICs in the team.
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
11
The teaming configuration that Dell recommends is balanced-alb (mode = 6). This configuration setting is
explained in greater deal in the Hadoop Solution Deployment Guide. The Dell Crowbar deployment software
automatically configures this setting for Hadoop environments.
IPv6 Capabilities
At this time, the Dell | Cloudera Hadoop Solution does not support or allow for the use of IPv6 for network
connectivity. All deployments are based on IPv4 with IPv6 explicitly disabled on all nodes within the Hadoop
environment.

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
12
Dell | Cloudera Hadoop Solution Hardware Architecture
The Dell | Cloudera hardware contains recommended configurations for the following services within a Hadoop cluster:
Master Nodesometimes called Name Node, it runs all the services needed to manage the HDFS data storage and
MapReduce task distribution and tracking.
Slave Noderuns all the services required to store blocks of data on the local hard drives and execute processing
tasks against that data
Edge Nodeprovides the interface between a data and processing capacity available in the Hadoop cluster and a user
of that capacity
Admin Nodeprovides cluster deployment and management capabilities


Figure 5: Dell | Cloudera Hardware Architecture
High-level Architecture
Dell | Cloudera Sizing Terms
The Dell | Cloudera Reference Architecture is organized into three components for sizing as the Hadoop
environment grows. The smallest designation is rack, moving to a pod, and then into a cluster. Each has
specific characteristics and sizing considerations documented in this reference architecture. The design goal
for the Hadoop environment is to enable you to scale the environment by adding the additional capacity as
needed, without the need to replace any existing components.
Rack
A rack is the smallest designation of a Hadoop environment. A rack consists of all the necessary power,
network cabling, and two Ethernet switches necessary to support up to 20 data nodes. These nodes should
utilize their own power connectivity and space within the data center, separate from other racks, and be
treated as a fault zone.
Pod
A pod is a single set of stacked Ethernet switches. In the case of this reference architecture, the maximum and
minimum will be six. A pod consists of the administration and operation infrastructure to support three racks.
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
13
Cluster
A cluster is a set of greater than one pod and a maximum of 12 pods. A cluster is a set of Hadoop nodes that
share the same Name Node and management tools for operating the Hadoop environment.
Table 2: Dell | Cloudera Hardware Configurations
Machine Function Master Node (Admin Node) Data Node (Slave Node) Edge Node
Platform PowerEdge C2100 PowerEdge C2100 PowerEdge C2100
CPU 2x E5645 (6-core)
2x E5606 (4-core)
(optional 2x E5645 6-core)
2x E5645 (6-core)
RAM (Minimum) 96GB 24GB 48GB
Add-in NIC One dual-port Intel 1GigE None One dual-port Intel 10GigE
DISK 6x 600GB SAS NL3.5 12x 1TB SATA 7.2K 3.5 6x 600GB SAS NL3.5
Storage Controller PERC H700 LSI2008 PERC H700
RAID RAID 10 JBOD RAID 10
Min per Rack 1
Max Per Rack 20
Min per Pod 2 3* 1
Max per Pod 2 60
Min per cluster 2 36 1
Max per Cluster 2 720
* A minimum of five Data Nodes are needed if ZooKeeper will be used in the environment.

Table 3: Dell | Cloudera Hadoop Solution Software Locations
Daemon Primary Location Secondary Location
JobTracker MasterNode02 MasterNode01
TaskTracker SlaveNode(x)
SlaveNode SlaveNode(x)
NameNode MasterNode01 MasterNode02
Operating System Provisioning MasterNode02 MasterNode01
Chef MasterNode02 MasterNode01
Yum Repositories MasterNode02 MasterNode01

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
14
Table 4: Dell | Cloudera Hadoop Solution Support Matrix
RA Version OS Version Hadoop Version Available Support Supported JVM
1.0
Red Hat Enterprise Linux
5.6
Cloudera CDH3
Enterprise
Dell Hardware Support
Cloudera Hadoop
Support
Red Hat Linux Support
Sun Oracle JVM 1.6u20, u21,
u23, u24, u25
1.0 CentOS 5.6
Cloudera CDH3
Community
Dell Hardware Support
Sun Oracle JVM 1.6u20, u21,
u23, u24, u25
High-level Network Architecture
The network interconnects between various hardware components of the cloud solution is depicted in Figure
6.

Figure 6: Dell | Cloudera Compute Node Network Interconnects

The network cabling within the Dell | Cloudera Hadoop Solution is described in the following table.
Table 5: Dell | Cloudera Hadoop Solution Network Cabling
Component
NICs to Switch Port
LOM1 LOM2 PCI-NIC1 PCI-NIC2 BMC
Master Node
Data Node N/A N/A
Edge Node

Legend
Cluster Production LAN
Cluster Management LAN
Cluster Edge LAN


NI
NI


BM
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
15
Hadoop Network Cable Scheme and Connections


Node Connection Switch Port
rNN-n01 LOM1 rNN-sw01 1

BMC rNN-sw01 25

LOM2 rNN-sw02 1
rNN-n02 LOM1 rNN-sw01 2

BMC rNN-sw01 26

LOM2 rNN-sw02 2
rNN-n03 LOM1 rNN-sw01 3

BMC rNN-sw01 27

LOM2 rNN-sw02 3
rNN-n04 LOM1 rNN-sw01 4

BMC rNN-sw01 28

LOM2 rNN-sw02 4
rNN-n05 LOM1 rNN-sw01 5

BMC rNN-sw01 29

LOM2 rNN-sw02 5
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
16
rNN-n06 LOM1 rNN-sw01 6

BMC rNN-sw01 30

LOM2 rNN-sw02 6
rNN-n07 LOM1 rNN-sw01 7

BMC rNN-sw01 31

LOM2 rNN-sw02 7
rNN-n08 LOM1 rNN-sw01 8

BMC rNN-sw01 32

LOM2 rNN-sw02 8
rNN-n09 LOM1 rNN-sw01 9

BMC rNN-sw01 33

LOM2 rNN-sw02 9
rNN-n10 LOM1 rNN-sw01 10

BMC rNN-sw01 34

LOM2 rNN-sw02 10
rNN-n11 LOM1 rNN-sw01 11

BMC rNN-sw01 35

LOM2 rNN-sw02 11
rNN-n12 LOM1 rNN-sw01 12

BMC rNN-sw01 36

LOM2 rNN-sw02 12
rNN-n13 LOM1 rNN-sw01 13

BMC rNN-sw01 37

LOM2 rNN-sw02 13
rNN-n14 LOM1 rNN-sw01 14

BMC rNN-sw01 38

LOM2 rNN-sw02 14
rNN-n15 LOM1 rNN-sw01 15

BMC rNN-sw01 39

LOM2 rNN-sw02 15
rNN-n16 LOM1 rNN-sw01 16

BMC rNN-sw01 40

LOM2 rNN-sw02 16
rNN-n17 LOM1 rNN-sw01 17

BMC rNN-sw01 41

LOM2 rNN-sw02 17
rNN-n18 LOM1 rNN-sw01 18

BMC rNN-sw01 42

LOM2 rNN-sw02 18
rNN-n19 LOM1 rNN-sw01 19

BMC rNN-sw01 43

LOM2 rNN-sw02 19
rNN-n20 LOM1 rNN-sw01 20

BMC rNN-sw01 44

LOM2 rNN-sw02 20
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
17

End-of-Row Switch Port Connectivity
POD Number ToR Switch ToR Switch Port EoR Switch EoR Switch Port
1
r01-s01 10GbE1 Eor-row01-sw01 1
1 r01-s01 10GbE2 Eor-row01-sw02 1
1 r01-s02 10GbE1 Eor-row01-sw01 2
1 r01-s02 10GbE2 Eor-row01-sw02 2
1 r02-s01 10GbE1 Eor-row01-sw01 3
1 r02-s01 10GbE2 Eor-row01-sw02 3
1 r02-s02 10GbE1 Eor-row01-sw01 4
1 r02-s02 10GbE2 Eor-row01-sw02 4
1 r03-s01 10GbE1 Eor-row01-sw01 5
1 r03-s01 10GbE2 Eor-row01-sw02 5
1 r03-s02 10GbE1 Eor-row01-sw01 6
1 r03-s02 10GbE2 Eor-row01-sw02 6
2 r01-s01 10GbE1 Eor-row01-sw01 7
2 r01-s01 10GbE2 Eor-row01-sw02 7
2 r01-s02 10GbE1 Eor-row01-sw01 8
2 r01-s02 10GbE2 Eor-row01-sw02 8
2 r02-s01 10GbE1 Eor-row01-sw01 9
2 r02-s01 10GbE2 Eor-row01-sw02 9
2 r02-s02 10GbE1 Eor-row01-sw01 10
2 r02-s02 10GbE2 Eor-row01-sw02 10
2 r03-s01 10GbE1 Eor-row01-sw01 11
2 r03-s01 10GbE2 Eor-row01-sw02 11
2 r03-s02 10GbE1 Eor-row01-sw01 12
2 r03-s02 10GbE2 Eor-row01-sw02 12
3 r01-s01 10GbE1 Eor-row01-sw01 13
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
18
3 r01-s01 10GbE2 Eor-row01-sw02 13
3 r01-s02 10GbE1 Eor-row01-sw01 14
3 r01-s02 10GbE2 Eor-row01-sw02 14
3 r02-s01 10GbE1 Eor-row01-sw01 15
3 r02-s01 10GbE2 Eor-row01-sw02 15
3 r02-s02 10GbE1 Eor-row01-sw01 16
3 r02-s02 10GbE2 Eor-row01-sw02 16
3 r03-s01 10GbE1 Eor-row01-sw01 17
3 r03-s01 10GbE2 Eor-row01-sw02 17
3 r03-s02 10GbE1 Eor-row01-sw01 18
3 r03-s02 10GbE2 Eor-row01-sw02 18

Dell | Cloudera Hadoop Solution ComputePowerEdge C2100
The Dell PowerEdge C2100 server is focused on Hadoop, MapReduce, and high-performance Web
applications where high spindle count and high memory density are key. The PowerEdge C2100 offers
multiple backplane and drive configurations to provide you with the flexibility to meet your application needs.
The Dell Hadoop solution includes hardware and software support using Dells world-class, worldwide services
organization. When you use a Dell | Cloudera Hadoop Solution, all the components of your solution are tested,
validated, and supported by Dell.

Figure 7: PowerEdge C2100
PowerEdge C2100 feature summary:
Two-socket Intel Xeon 5600 series processors for ultimate performance
18 DDR3 memory slots that enable larger memory footprints while utilizing smaller memory modules
12 3.5-inch SAS/SATA and two 2.5-inch internal SATA/SSD
Dual redundant or single 750W high-efficiency power supply options
Intel 82576 dual-port embedded Gigabit Ethernet NIC with industry leading virtualization performance
IPMI 2.0-compliant BMC management
6Gb SAS hard drive and controller support
Two x8 PCI-E Gen2 slots and two x4 PCI-E Gen2 dedicated mezzanine slots
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
19
Rack installation, Dell Basic and ProSupport options, next-business-day support only

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
20
Dell | Cloudera Hadoop Solution Software Architecture
Linux File System Configuration Definition
Dell recommends and supports the use of ext3 for all HDFS disks.
Disk Partitioning Recommendation for the Name Node
There are two types of services running on the Name Node:
JobTracker (Supports MapReduce job distribution)
NameNode (Supports HDFS data storage)
There are two types of services running on the Slave Node:
TaskTracker Daemon (to support MapReduce job execution)
SlaveNode Daemon (to support HDFS data storage)
All disk configuration parameters are documented in the Dell | Cloudera Hadoop Solution Deployment Guide,
as well as Linux Kickstart scripts for proper configuration at the time of operating system installation.
Hadoop Ecosystem Services and Utilities Mapping

Table 6: Hadoop Ecosystem Utilities Mapping
Component Master Node Slave Node Edge Node Utilize From Administer From
Pig X X X Edge Node Edge Node
Hive Edge Node Edge Node
Sqoop X Edge Node Edge Node
Zookeeper X (5) Edge Node Edge Node


Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
21
Dell | Cloudera Hadoop Solution Software Configuration
Dell | Cloudera Hadoop Solution Configuration Parameters Recommended Values

Table 7: hdfs-site.xml
Property Description Value
dfs.block.size Lower value offers parallelism 134217728 (128Mb)
dfs.name.dir
Comma-separated list of folders (no space) where
a SlaveNode stores its blocks
Cluster specific
dfs.datanode.handler.count
Number of handlers dedicated to serve data block
requests in Hadoop SlaveNodes
16
(start 2 x CORE_COUNT in each SlaveNode )
dfs.namenode.handler.count
More Master Node server threads to handle RPCs
from large number of SlaveNodes
Start with 10, increase large clusters
(higher count will drive higher CPU, RAM
and Network utilization)
dfs.SlaveNode.du.reserved
The amount of space on each storage volume
which HDFS should not use.
10M
dfs.replication Data replication factor. Default is 3. 3 (default)
fs.trash.interval Time interval between HDFS space reclaiming ( 1440 (minutes)
dfs.permissions true (default)
dfs.datanode.handler.count 8

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
22
Table 8: mapred-site.xml

Table 9: default.xml
Property Description Value
SCAN_IPC_CACHE_LIMIT
Number of rows cached in search
engine for each scanner next call over
the wire. It reduces the network round
trip by 300 times caching 300 rows in
each trip.
100
LOCAL_JOB_HANDLER_COUNT
Number of parallel queries executed at
one go. Query requests above than this
limit gets queued up.
30
Property Description Value
mapred.child.java.opts
Larger heap-size for child JVMs of
maps/reduces.
-Xmx1024M
mapred.job.tracker
Hostname or IP address and port of the
JobTracker.
TBD
mapred.job.tracker.handler.count
More JobTracker server threads to handle
RPCs from large number of TaskTrackers.
Start with 32, increase large clusters
(higher count will drive higher CPU, RAM
and Network utilization)
mapred.reduce.tasks The number of Reduce tasks per job.
Set to a prime close to the number of
available hosts
mapred.local.dir
Comma-separated list of folders (no space)
where a TaskTracker stores runtime
information
Cluster-specific
mapred.tasktracker.map.tasks.maximu
m
Maximum number of map tasks to run on
the node
2 + (2/3) * number of cores per node
mapred.tasktracker.reduce.tasks.maxim
um
Maximum number of reduce tasks to run
per node
2 + (1/3) * number of cores per node
mapred.child.ulimit 2097152
mapred.map.tasks.speculative.executio
n
FALSE
mapred.reduce.tasks.speculative.execut
ion
FALSE
mapred.job.reuse.jvm.num.tasks -1
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
23




Table 10: hadoop-env.sh
Property Description Value
java.net.preferIPv4Stack true
JAVA_HOME
HADOOP_*_OPTS -Xmx2048m

Table 11: /etc/fstab
Property Description Value
File system mount options data=writeback,nodiratime, noatime

Table 12: core-site.xml
Property Description Value
io.file.buffer.size
The size of buffer for use in sequence
files. The size of this buffer should
probably be a multiple of hardware
page size (4096 on Intel x86), and it
determines how much data is buffered
during read and write operations.
65536 (64Kb)
fs.default.name
The name of the default file system. A
URI whose scheme and authority
determine the file system
implementation.
Example:
hdfs://someserver.example.com:8020/
fs.checkpoint.dir
Comma-separated list of directories on
the local file system of the Secondary
Master Node where its checkpoint
images are stored
TBD
io.sort.factor 80
Io.sort.mb 512

Table 13: /etc/security/limits.conf
Property Description Value
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
24
mapred nofile 32768
hdfs nofile 32768
hbase nofile 32768

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
25
Dell | Cloudera Hadoop Solution Deployment Methodology
Site Preparation Needed for the Deployment
The heating, ventilation, air conditioning (HVAC), and power requirement can be estimated using the Dell
Energy Smart Solution Advisor at:
http://www.dell.com/content/topics/topic.aspx/global/products/pedge/topics/en/config_calculator?c=us&cs=555&l=en&s
=biz
Using this tool, you can plan the needs for your solution, order the correct PDUs, and ensure that the proper
HVAC is ready for the installation.
Detailed deployment instructions are documented in the Dell | Cloudera Hadoop Solution Deployment Guide.

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
26
Dell | Cloudera Hadoop Solution Hardware Monitoring and Alerting
To automate the alert and response to unexpected events and failures within the Dell | Cloudera Hadoop Solution, the software
stack includes Nagios and Ganglia. The Dell | Cloudera Hadoop Solution includes capabilities for three primary components of
the monitoring environment:
Monitoring of cluster activitiesThe Dell | Cloudera Hadoop Solution utilizes Nagios to monitor the cluster,
including hardware, software, and users. The Nagios deployment as part of the Dell | Cloudera Hadoop Solution
will keep historical information regarding system availability, maintenance, and failure events.
Alerting of unexpected eventsThe Dell | Cloudera Hadoop Solution utilizes Nagios to alert system operations staff
to events that occur that are unexpected from normal operation and that the administrator has designated for
notification.
Debugging of cluster runtime operations The Dell | Cloudera Hadoop Solution utilizes Cloudera Enterprise to
provide the users and administrators of the Hadoop environment with the necessary tools for tracking, debugging,
and monitoring job performance and characteristics.
The Dell | Cloudera Hadoop Solution is designed to include the necessary components to monitor and
respond to events in your Hadoop environment, while being flexible enough to allow integration with existing
operations management frameworks in your environment.
The monitoring components of the Dell | Cloudera Hadoop Solution Reference Architecture are designed to
be proactive in nature, alerting the IT operations team when failures in the environment occur, but before they
cause an outage that affects product workloads and users.
Nagios
Nagios is an open source solution for enterprise monitoring. It utilizes a pluggable architecture to allow for
consistent event handling, while supporting a wide variety of sensors, plug-ins, applications, servers, and
hardware platforms.
The Dell | Cloudera Hadoop Solution includes Nagios as part of all default installations. The Dell | Cloudera
Hadoop Solution will automatically install the Nagios console and the necessary Nagios plug-ins for
monitoring the Hadoop cluster, including processes, operating systems, and physical servers.
Ganglia
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters
and grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used
technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for
data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low
per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set
of operating systems and processor architectures, and is currently in use on thousands of clusters around the
world. It has been used to link clusters across university campuses and around the world and can scale to
handle clusters with 2,000 nodes.
The Dell | Cloudera Hadoop Solution automates the installation and configuration of Ganglia within the
Hadoop cluster, enabling IT operations staff detailed reporting on the status and utilization of all Hadoop
nodes.
Cloudera Enterprise
Cloudera Enterprise is a subscription service that comprises Cloudera support and a portfolio of software,
including Cloudera Management Suite, which enables data-driven enterprises to run Apache Hadoop
environments in production cost-effectively and with repeatable success. By combining expert support with a
software layer that delivers deep visibility into and across Hadoop clusters, Cloudera Enterprise gives you an
efficient way to precisely provision and manage cluster resources. It also allows your IT shop to apply familiar
business metricssuch as measurable SLAs and chargebacksto your Hadoop environment so it can run at
optimal utilization.
Cloudera Enterprise consists of two core components:
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
27
Cloudera Management Suite (CMS)
As Hadoop is increasingly used as a shared platform for multiple applications, organizations require the same manageability
characteristics that they expect from other popular enterprise technologies. Cloudera meets this need with a management
suite that helps your organization improve quality of service, increase compliance, and reduce administrative costs.

The Cloudera Management Suite includes:

Service and Configuration Manager (SCM)
Deploy Hadoop in minutes.
Manage system services, automate changes and validate settings.
Activity Monitor
Consolidate all user activities into a single, real-time view.
Diagnose user performance and track activity metrics.
Resource Manager
Report system resources usage.
Plan for capacity expansion.
Authorization Manager
Centralize management of all users, groups and privileges.
Manage permissions via delegated administration.
Cloudera Support
As the use of Hadoop grows and an increasing number of groups and applications move into production, your
internal customers will expect greater levels of performance and consistency. Clouderas proactive
production-level support gives your administrators the expertise and responsiveness they need.
Cloudera Support includes:
Flexible Support Windows
Choose 85 or 247 to meet SLA requirements.
Configuration Checks
Verify that your Hadoop cluster is fine tuned for your environment.
Escalation and Issue Resolution
Resolve support cases with maximum efficiency.
Comprehensive Knowledgebase
Expand your Hadoop knowledge with hundreds of articles and tech notes.
Support for Certified Integration
Connect your Hadoop cluster to your existing data analysis tools.
Proactive Notification
Stay up to speed with new developments and events.
With Cloudera Enterprise, you can leverage your existing teams experience and Clouderas expertise to
operationalize your Hadoop system with ease. Built-in predictive capabilities anticipate shifts in the Hadoop
infrastructure, helping to ensure reliable operation.
Cloudera Enterprise makes it easy to run open source Hadoop in production:
Simplify and accelerate Hadoop deployment.
Reduce the costs and risks of adopting Hadoop in production.
Reliably operate Hadoop in production with repeatable success.
Apply SLAs to Hadoop.
Increase control over Hadoop cluster provisioning and management.
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
28
Dell | Cloudera Hadoop Solution Security Design
What is available in CDH3
Clouderas CDH3 release offers the following security features:
There are two levels of authentication:
o Cluster: SlaveNode to NameNode, TaskTracker to JobTracker
o User: Unix-style file permissions
NOTE: Access to cluster can be restricted to Kerberos-authorization users.
Sqoop and Pig support security with no configuration required.
ZooKeeper will operate normally in an unsecured mode with a secure Hadoop cluster.
All cluster nodes need direct access to the Kerberos server.
Implementing Secure Hadoop
https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide

Figure 8: Kerberos Authentication in Hadoop

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
29
Appendix A: Bill of Materials
Name Node Bill of Materials
Component Description SKU
PowerEdge C2100
PowerEdge C2100 Expander Backplane
support for 3.5-inch Hard Drives
Redundant Power Supplies
224-8350
Processor
2x Intel

Xeon

E5640 2.66Ghz, 12M


Cache,Turbo, HT, 1066MHz Max Mem
317-4110
Memory
96GB Memory (12x8GB), 1333MHz Dual
Ranked RDIMMs for 2 Processors,
Optimized
317-6324
Operating System No Factory Installed Operating System
OS Media None
Documentation/Disks C2100 Documentation 330-8774
Rails C2100 Sliding Rail Kit 330-8520
Power Cords
2XC13 to C14, PDU Style, 12 AMP, 2 Feet
(.6m), Power Cord
330-6870
Hard Drive Controller and
Configuration
Add-in LSI 9260-8i controllers for up to
12 HP Drives total
342-1529, 342-0993
Hardware Support Services
3 Year ProSupport and NBD On-site
Service
908-3264, 908-3304, 909-1668, 909-1677,
926-4080
Installation Services
CUSTOM INSTALLATION SOLUTION
REQUIRED

Keep Your Hard Drive None
Dell Recycling None
2nd Controller None
Bios None
Hard Drives - DellSta
600GB 15K RPM Serial-Attach SCSI 6Gbps
3.5in Hot Plug Hard Drive
342-1544
Network Card Intel Gigabit ET Dual Port 1GbE, PCIe x4 430-0937
Slave Node Bill of Materials
Component Description SKU
PowerEdge C2100
PowerEdge C2100 Expander Backplane
support for 3.5-inch Hard Drives Single
Power Supply
224-8331
Processor
2x Intel Xeon E5640 2.66Ghz, 12M
Cache,Turbo, HT, 1066MHz Max Mem
317-4110
Memory
24GB Memory (6x4GB), 1333MHz Dual
Ranked RDIMMs for 2 Processors,
Optimized
317-6324
Operating System No Factory Installed Operating System
OS Media None
Documentation/Disks C2100 Documentation 330-8774
Rails C2100 Sliding Rail Kit 330-8520
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
30
Power Cords
C13 to C14, PDU Style, 12 AMP, 2 Feet
(.6m), Power Cord
330-6870
Hard Drive Controller and
Configuration
Add-in 6Gb SAS Mezzanine controllers
for up to 12 HP Drives total
342-0989, 342-0990
Hardware Support Services
3 Year ProSupport and NBD On-site
Service
908-3264, 908-3304, 909-1668, 909-1677,
926-4080
Installation Services
CUSTOM INSTALLATION SOLUTION
REQUIRED

Keep Your Hard Drive None
Dell Recycling None
2nd Controller None
Bios None
Hard Drives - DellSta
2TB 7.2K RPM SATA 3.5in Hot Plug Hard
Drive
342-0935
Network Card None

Edge Node Bill of Materials
Component Description SKU
PowerEdge C2100
PowerEdge C2100 Expander Backplane
support for 3.5-inch Hard Drives
Redundant Power Supplies
224-8350
Processor
2x Intel

Xeon

E5640 2.66Ghz, 12M


Cache,Turbo, HT, 1066MHz Max Mem
317-4110
Memory
24GB Memory (6x4GB), 1333MHz Dual
Ranked RDIMMs for 2 Processors,
Optimized
317-6324
Operating System No Factory Installed Operating System
OS Media None
Documentation/Disks C2100 Documentation 330-8774
Rails C2100 Sliding Rail Kit 330-8520
Power Cords
2XC13 to C14, PDU Style, 12 AMP, 2 Feet
(.6m), Power Cord
330-6870
Hard Drive Controller and
Configuration
Add-in LSI 9260-8i controllers for up to
12 HP Drives total
342-1529, 342-0993
Hardware Support Services
3 Year ProSupport and NBD On-site
Service
908-3264, 908-3304, 909-1668, 909-1677,
926-4080
Installation Services
CUSTOM INSTALLATION SOLUTION
REQUIRED

Keep Your Hard Drive None
Dell Recycling None
2nd Controller None
Bios None
Hard Drives - DellSta
600GB 15K RPM Serial-Attach SCSI
6Gbps 3.5in Hot Plug Hard Drive
342-1544
Network Card
Intel 82599 Dual Port 10GE Mezzanine
Card
342-0727
Network Card Intel Gigabit ET Dual Port 1GbE, PCIe x4 430-0937

Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
31

Network Connectivity Bill of MaterialsTop of Rack
Component Description SKU
PowerConnect 6248P
PowerConnect 6248P, 48 GbE Ports,
Managed Switch, 10GbE and Stackg
Capable

Front-end SFP Fiber
Transceivers
None
Modular Upgrade Bay 1:
Modules
Stacking Module, 48Gbps, Includes 1m
Stacking Cable

Modular Upgrade Bay 1: Optics None
Modular Upgrade Bay 2:
Modules
None
Modular Upgrade Bay 2: Optics None
Cables Stacking Cable, 3m
External Redundant Power
Supply
None
Hardware Support Services
3 Year ProSupport and NBD On-site
Service

Installation Services No Installation Services Selected
Asset Recovery Services None

Software
Package Description SKU
RHEL 5.6
Red Hat Enterprise Linux 5.6, 1yr
Subscription

CDH Community 331-4370
CDH Enterprise 5x8 Per Node 331-4371
CDH Enterprise 24x7 Per Node 331-4372
Cloudera Developer Training
and Certification
331-4373
Cloudera System Administrator
Training and Certification
331-4374
Cloudera Hbase Training 331-4375
Cloudera Hive & Pig Training 331-4376
Cloudera Essentials for
Managers Training
331-4377
Hadoop Informational SKU 331-3282
Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0
32
Appendix B: Dell | Cloudera Hadoop Solution Components Decoder Ring
(source: http://hadoop.apache.org/)
1. Hadoop: http://en.wikipedia.org/wiki/Hadoop
2. Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to
application data
(http://en.wikipedia.org/wiki/Hadoop_Distributed_Filesystem#Hadoop_Distributed_File_System)
3. MapReduce: a software framework for distributed processing of large data sets on compute clusters.
(http://en.wikipedia.org/wiki/MapReduce)
4. Avro: a data serialization system
5. Chukwa: a data collection system for managing large distributed systems
6. HBase: a scalable, distributed database that supports structured data storage for large tables
7. Hive: a data warehouse infrastructure that provides data summarization and ad-hoc querying
8. ZooKeeper: a high-performance coordination service for distributed applications
9. Pig: a platform for analyzing large data sets that consists of high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these programs.
10. Sqoop (from Cloudera): a tool designed to import data from relational databases into Hadoop. Sqoop uses
JDBC to connect to a database.
11. Flume (from Cloudera): a distributed service for collecting, aggregating and moving large amounts of log
data. Its architecture is based on streaming data flows.
12. Crowbara Dell provided, supported, and maintained toolset for system deployment and configuration
automation. Crowbar supports the bare-metal bring-up of new hardware and configuration management of
existing hardware.
Appendix C: External References
Nagios: http://www.nagios.org
Ganglia: http://ganglia.sourceforge.net/
Cloudera: http://www.cloudera.com







To Learn More
For more information on the Dell | Cloudera Hadoop Solution, visit:
www.Dell.com/Hadoop





2011 Dell Inc. All rights reserved. Trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Specifications are
correct at date of publication but are subject to availability or change without notice at any time. Dell and its affiliates cannot be responsible for errors or omissions in typography or
photography. Dells Terms and Conditions of Sales and Service apply and are available on request. Dell service offerings do not affect consumers statutory rights.

Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.

Das könnte Ihnen auch gefallen