Sie sind auf Seite 1von 109

The

Hadoop Ecosystem
Chapter 2.1

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-2
The Hadoop Ecosystem

What other projects exist around core Hadoop?


When to use HBase?
How does Spark compare to MapReduce?
What is the dierences between Hive, Pig, and Impala?
How is Flume typically deployed?

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-3
Chapter Topics

The Hadoop Ecosystem

IntroducLon
Data Storage: HBase
Data IntegraMon: Flume and Sqoop
Data Processing: Spark
Data Analysis: Hive, Pig, and Impala
Workow Engine: Oozie
Machine Learning: Mahout

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-4
The Hadoop Ecosystem (1)

Sqoop Impala Hive Pig

Hadoop
Ecosystem
HBase Flume Oozie

CDH

MapReduce Hadoop Core


Components

Hadoop Distributed File System

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-5
The Hadoop Ecosystem (2)

Sqoop Impala Hive Pig

Hadoop
Ecosystem
HBase Flume Oozie

Next, a discussion of the key Hadoop ecosystem components

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-6
The Hadoop Ecosystem (3)

Ecosystem projects may be


Built on HDFS and MapReduce
Built on just HDFS
Designed to integrate with or support Hadoop
Most are Apache projects or Apache Incubator projects
Some others are not managed by the Apache So]ware FoundaMon
These are o]en hosted on GitHub or a similar repository
Following is an introducLon to some of the most signicant projects

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-7
Chapter Topics

The Hadoop Ecosystem

IntroducMon
Data Storage: HBase
Data IntegraMon: Flume and Sqoop
Data Processing: Spark
Data Analysis: Hive, Pig, and Impala
Workow Engine: Oozie
Machine Learning: Mahout

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-8
HBase

HBase is the Hadoop database


A NoSQL datastore
Can store massive amounts of data
Petabytes+
High write throughput
Scales to hundreds of thousands of inserts per second
Handles sparse data well
No wasted spaces for empty columns in a row
Limited access model
OpMmized for lookup of a row by key rather than full queries
No transacMons: single row operaMons only
Only one column (the row key) is indexed

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-9
HBase vs TradiMonal RDBMSs

RDBMS HBase
Data layout Row-oriented Column-oriented

TransacLons Yes Single row only

Query language SQL get/put/scan (or use Hive


or Impala)
Security AuthenMcaMon/AuthorizaMon Kerberos

Indexes Any column Row-key only

Max data size TBs PB+

Read/write throughput Thousands Millions


(queries per second)

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-10
When To Use HBase

Use plain HDFS if


You only append to your dataset
(no random write)
You usually read the whole dataset (no random read)
Use HBase if
You need random write and/or read
You do thousands of operaMons per second
on TB+ of data
Use an RDBMS if
Your data ts on one big node
You need full transacMon support
You need real-Mme query capabiliMes

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-11
Chapter Topics

The Hadoop Ecosystem

IntroducMon
Data Storage: HBase
Data IntegraLon: Flume and Sqoop
Data Processing: Spark
Data Analysis: Hive, Pig, and Impala
Workow Engine: Oozie
Machine Learning: Mahout

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-12
Flume: Real-Mme Data Import

What is Flume?
A service to move large amounts of data in real Mme
Example: storing log les in HDFS

Flume imports data into HDFS as it is generated


Instead of batch-processing it later
For example, log les from a Web server

Flume is
Distributed
Reliable and available
Horizontally scalable
Extensible

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-13
Flume: High-Level Overview

Collect data as it is produced


Files, syslogs, stdout or
custom source
Agent Agent Agent Agent Agent

Process in place encrypt compress

e.g., encrypt, compress

Pre-process data before storing Agent Agent


e.g., transform, scrub, enrich

Write in parallel
Scalable throughput Agent(s)

Store in any format


Text, compressed, binary, or HDFS
custom sink

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-14
Sqoop: Exchanging Data With RDBMSs

Sqoop transfers data between RDBMSs and HDFS


Does this very eciently via a Map-only MapReduce job
Supports JDBC, ODBC, and several specic databases
Sqoop = SQL to Hadoop


Sqoop
RDBMS



HDFS

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-15
Sqoop Custom Connectors

Custom connectors for


MySQL
Postgres
Netezza
Teradata
Oracle (partnered with Quest So]ware)
Not open source, but free to use

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-16
Chapter Topics

The Hadoop Ecosystem

IntroducMon
Data Storage: HBase
Data IntegraMon: Flume and Sqoop
Data Processing: Spark
Data Analysis: Hive, Pig, and Impala
Workow Engine: Oozie
Machine Learning: Mahout

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-17
Apache Spark

Apache Spark is a fast, general engine for large-scale


data processing on a cluster
Originally developed UC Berkeleys AMPLab
Open source Apache project
Provides several benets over MapReduce
Faster
Be>er suited for iteraMve algorithms
Can hold intermediate data in RAM, resulMng in much be>er
performance
Easier API
Supports Python, Scala, Java
Supports real-Mme streaming data processing

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-18
Spark vs Hadoop MapReduce

MapReduce
Widely used, huge investment already made
Supports and supported by many complementary tools
Mature, well-tested
Spark
Flexible
Elegant
Fast
Supports real-Mme streaming data processing
Over Lme, Spark is expected to supplant MapReduce as the general
processing framework used by most organizaLons

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-19
Chapter Topics

The Hadoop Ecosystem

IntroducMon
Data Storage: HBase
Data IntegraMon: Flume, Sqoop
Data Processing: Spark
Data Analysis: Hive, Pig, and Impala
Workow Engine: Oozie
Machine Learning: Mahout

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-20
Hive and Pig: High Level Data Languages

The moLvaLon: MapReduce is powerful


but hard to master
The soluLon: Hive and Pig
Languages for querying and manipulaMng data
Leverage exisMng skillsets
Data analysts who use SQL
Programmers who use scripMng languages
Open source Apache projects
Hive iniMally developed at Facebook
Pig IniMally developed at Yahoo!
Interpreter runs on a client machine
Turns queries into MapReduce jobs
Submits jobs to the cluster

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-21
Hive

What is Hive?
HiveQL: An SQL-like interface to Hadoop

SELECT * FROM purchases WHERE price > 10000 ORDER BY


storeid

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-22
Pig

What is Pig?
Pig LaLn: A dataow language for transforming large data sets

purchases = LOAD "/user/dave/purchases" AS (itemID,


price, storeID, purchaserID);
bigticket = FILTER purchases BY price > 10000;
...

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-23
Hive vs. Pig

Hive Pig
Language HiveQL (SQL-like) Pig LaMn (dataow
language)
Schema Table deniMons stored in Schema opMonally dened
a metastore at runMme
ProgrammaLc access JDBC, ODBC PigServer (Java API)

JDBC: Java Database ConnecMvity


ODBC: Open Database ConnecMvity

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-24
Impala: High Performance Queries

High-performance SQL engine for vast amounts of data


Similar query language to HiveQL
10 to 50+ Mmes faster than Hive, Pig, or MapReduce
Impala runs on Hadoop clusters
Data stored in HDFS
Does not use MapReduce
Developed by Cloudera
100% open source, released under the Apache so]ware
license

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-25
Which to Choose?

Use Impala when


You need near real-Mme responses to ad hoc queries
You have structured data with a dened schema
Use Hive or Pig when
You need support for custom le types, complex data types, or external
funcMons
Use Pig when
You have developers experienced with wriMng scripts
Your data is unstructured/mulM-structured
Use Hive When
You have analysts familiar with SQL
You are integraMng with BI or reporMng tools via ODBC/JDBC

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-26
Chapter Topics

The Hadoop Ecosystem

IntroducMon
Data Storage: HBase
Data IntegraMon: Flume, Sqoop
Data Processing: Spark
Data Analysis: Hive, Pig, and Impala
Workow Engine: Oozie
Machine Learning: Mahout

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-27
Oozie

Oozie
Workow engine for MapReduce jobs
Denes dependencies between jobs
The Oozie server submits the jobs to the server in the correct sequence
We will invesLgate Oozie later in the course

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-28
Chapter Topics

The Hadoop Ecosystem

IntroducMon
Data Storage: HBase
Data IntegraMon: Flume, Sqoop
Data Processing: Spark
Data Analysis: Hive, Pig, and Impala
Workow Engine: Oozie
Machine Learning: Mahout

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-29
Mahout

Mahout is a Machine Learning library wrigen in Java


Used for
CollaboraMve ltering (recommendaMons)
Clustering (nding naturally occurring groupings in data)
ClassicaMon (determining whether new data ts a category)
Why use Hadoop for Machine Learning?
Its not who has the best algorithms that wins. Its who has the most
data.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-30
Key Points

Hadoop Ecosystem
Many projects built on, and supporMng, Hadoop
Several will be covered in detail later in the course

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-31
Bibliography

The following oer more informaLon on topics discussed in this chapter


HBase was inspired by Googles BigTable paper presented at OSDI in
2006
http://research.google.com/archive/bigtable-
osdi06.pdf
An interesLng link comparing NoSQL databases
http://www.networkworld.com/news/tech/
2012/102212-nosql-263595.html
AMPLab
https://amplab.cs.berkeley.edu
Databricks
http://databricks.com

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-32
Bibliography (contd)

The following oer more informaLon on topics discussed in this chapter


Spark
http://databricks.com/blog/2013/10/28/databricks-
and-cloudera-partner-to-support-spark.html
Dremel is a distributed system for interacLve ad-hoc queries that was
created by Google
http://research.google.com/pubs/archive/36632.pdf
Impala, developed by Cloudera, supports the same inner, outer, and semi-
joins that Hive does
http://tiny.cloudera.com/dac15b

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-33
Managing Your Hadoop SoluMon
Chapter 2.2

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-34
Managing Your Hadoop SoluMon

What is the typical architecture of a data center with Hadoop?


What are typical hardware requirements for a Hadoop cluster?

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-35
Chapter Topics

Managing Your Hadoop SoluLon

Hadoop in the Data Center


Cluster Hardware

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-36
A Typical Data Center With Hadoop

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-37
Technology Strengths and Weaknesses

Technology Strengths Drawbacks


Hadoop Scales to petabytes Query speed
Import/export tools No transacMon support
Flexible structure
Commodity hardware
RelaMonal Complex transacMons Data must t into rows and columns
Databases 1000s of queries/second Schema is costly to change
SQL Full table scans are slow
Data warehouses ReporMng capabiliMes Dimensions require pre-materializaMon
Up to 100s of terabytes
File servers (SAN/ Serving individual les Cost
NAS) Write caches Reading large porMons of the data saturates
the network
Backup systems Cheap Expensive to retrieve the data
(tape)
SAN: Storage Area Network
NAS: Network-A>ached Storage

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-38
Example Data Flow

Hadoop

Sqoop
(nightly)
Orders
HBase
Sqoop
Flume HDFS ETL Enterprise
(real Mme) Data
Warehouse
Web server logs

Sqoop
(nightly)
RecommendaMons
Site Content/
RecommendaMons

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-39
Chapter Topics

Managing Your Hadoop SoluLon

Hadoop in the Data Center


Cluster Hardware

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-40
Cluster Hardware

Master Name
Nodes Node

Slave
Nodes

JobTracker
Master or
Client
Resource
Nodes Manager

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-41
Slave Nodes: Recommended ConguraMons (1)

Processors
Mid-grade processors (e.g., 2 x 6-core 2.9 GHz)
Memory
48-96GB RAM
Network
1Gb Ethernet (mid-range)
10Gb Ethernet (high-end)
Disk Drives
6 x 2TB drives per machine (mid-range)
12 x 3TB drives per machine (high-end)
Non-RAID

RAID: Redundant Array of Inexpensive Disks

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-42
Slave Nodes: Recommended ConguraMons (2)

Switch
Dedicated switching infrastructure required because Hadoop can
saturate the network
All nodes talking to all nodes
Cost
Per slave node cost should be around $4,000 to $10,000 (2014
esMmate)

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-43
Master Nodes Are More Important

Four master nodes


NameNode (acMve and standby)
JobTracker (MRv1) or ResourceManager (MRv2) (acMve and standby)
Some installaMons just use a single JobTracker or Resource
Manager
Each typically runs on a separate machine
Master node machines are high-quality servers
Carrier-class hardware
Dual power supplies, Ethernet cards
RAIDed hard drives
24GB RAM for clusters of 20 nodes or less
48GB RAM for clusters of up to 300 nodes
96GB RAM for larger clusters

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-44
Capacity Planning (1)

Basing your cluster growth on storage capacity is omen a good method to


use
Example:
Data grows by approximately 3TB per week/40TB per quarter
Hadoop replicates 3 Mmes = 120TB
Extra space required for temporary data while running jobs (~30%) =
160TB
Assuming machines with 12 x 3TB hard drives
4-5 new machines per quarter
Two years of data = 1.3PB
Requires approximately 36 machines

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-45
Capacity Planning (2)

New nodes are automaLcally used by Hadoop


Many clusters start small (less than 10 nodes) and scale up as data and
processing demands grow
Hadoop clusters can grow to thousands of nodes

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-46
Bibliography

The following oer more informaLon on topics discussed in this chapter


Slave and master node conguraLon recommendaLons: Hadoop
OperaLons, rst ediLon (1e), by Eric Sammer, p. 46 and 50.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-47
IntroducMon to MapReduce
Chapter 2.3

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-48
IntroducMon to MapReduce

Concepts behind MapReduce


How does data ow through MapReduce stages
Typical uses of Mappers
Typical uses of Reducers

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-49
Chapter Topics

IntroducLon to MapReduce

MapReduce Overview
Example: WordCount
Mappers
Reducers

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-50
Review - Features of MapReduce

AutomaLc parallelizaLon and distribuLon


Fault-tolerance
A clean abstracLon for programmers
MapReduce programs are usually wri>en in Java
Can be wri>en in any language using Hadoop Streaming
All of Hadoop is wri>en in Java
MapReduce abstracts all the housekeeping away from the developer
Developer can simply concentrate on wriMng the Map and Reduce
funcMons

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-51
Review Key MapReduce Stages

The Mapper
Each Map task (typically) operates on a single HDFS
block Map
Map tasks (usually) run on the node where the
block is stored
Shue and Sort
Sorts and consolidates intermediate data from all Shue
mappers and Sort
Happens a]er all Map tasks are complete and
before Reduce tasks start
The Reducer
Operates on shued/sorted intermediate data
Reduce
(Map task output)
Produces nal output

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-52
The MapReduce Flow

Input Input Format


File(s)
Input Split 1 Input Split 2 Input Split 3

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-53
The MapReduce Flow

Input Input Format


File(s)
Input Split 1 Input Split 2 Input Split 3

Record Reader Record Reader Record Reader

Mapper Mapper Mapper

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-54
The MapReduce Flow

Input Input Format


File(s)
Input Split 1 Input Split 2 Input Split 3

Record Reader Record Reader Record Reader

Mapper Mapper Mapper

ParMMoner ParMMoner ParMMoner

Shue and Sort

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-55
The MapReduce Flow

Input Input Format


File(s)
Input Split 1 Input Split 2 Input Split 3

Record Reader Record Reader Record Reader

Mapper Mapper Mapper

ParMMoner ParMMoner ParMMoner

Shue and Sort

Reducer Reducer

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-56
The MapReduce Flow

Input Input Format


File(s)
Input Split 1 Input Split 2 Input Split 3

Record Reader Record Reader Record Reader

Mapper Mapper Mapper

ParMMoner ParMMoner ParMMoner

Shue and Sort

Reducer Reducer

Output Format Output Format

Output File Output File

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-57
The MapReduce Flow

Input Input Format


File(s)
Input Split 1 Input Split 2 Input Split 3

Record Reader Record Reader Record Reader

Mapper Mapper Mapper

ParMMoner ParMMoner ParMMoner

Shue and Sort

Reducer Reducer

Output Format Output Format

Output File Output File

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-58
Chapter Topics

IntroducLon to MapReduce

MapReduce Overview
Review: WordCount Example
Mappers
Reducers

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-59
Example: Word Count

Result
aardvark 1
Input Data
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa Map Reduce on 2

sat 2
sofa 1
the 4

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-60
Example: The WordCount Mapper (1)
Input Data (HDFS le)
the cat sat on the mat
the aardvark sat on the sofa

Mapper

Record Reader

0 the cat sat on the mat


23 the aardvark sat on the
sofa
52

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-61
Example: The WordCount Mapper (2)
Input Data (HDFS le)
the cat sat on the mat
the 1
the aardvark sat on the sofa

Mapper cat 1
sat 1
on 1
map()
the 1
Record Reader mat 1

0 the cat sat on the mat the 1

23 the aardvark sat on the aardvark 1


sofa map() sat 1
52 on 1
the 1
sofa 1

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-62
Example: WordCount Shue and Sort
the 1
cat 1
sat 1 aardvark 1 aardvark 1
on 1 cat 1 cat 1
the 1 mat 1 mat 1
mat 1 on 1,1
Mapper
the 1 sat 1,1
aardvark 1 sofa 1 elephant 1
sat 1 the 1,1,1,1 mahout 1
on 1 sat 1,1
the 1

Node 1 sofa 1

Node 2 drove 1
the 1
drove 1 on 1,1
mahout 1
Mapper elephant 1 sofa 1
drove 1
mahout 1 the 1,1,1,1,1,1
the 1
elephant 1 the 1,1

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-63
Example: SumReducer (1)
Final Output
part-r-00000
aardvark 1 aardvark 1
cat 1 Reducer 0 cat 1
mat 1
mat 1

elephant 1 part-r-00001
mahout 1 Reducer 1
elephant 1
sat 1,1
mahout 1

sat 2

drove 1

on 1,1 part-r-00002
Reducer 2
sofa 1 drove 1
the 1,1,1,1,1,1 on 2
sofa 1

the 6

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-64
Example: SumReducer (2)

Final Output
Reducer 2 HDFS File
part-r-00002
reduce()
drove 1 drove 1
on 1,1
reduce() on 2
sofa 1
the 1,1,1,1,1,1 sofa 1
reduce() the 6

reduce()

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-65
Chapter Topics

IntroducLon to MapReduce

MapReduce Overview
Example: WordCount
Mappers
Reducers

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-66
MapReduce: The Mapper (1)

The Mapper
Input: key/value pair
Output: A list of zero or more key value pairs

map(in_key, in_value)
(inter_key, inter_value) list

intermediate key 1 value 1


input input intermediate key 2 value 2
map
key value intermediate key 3 value 3

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-67
MapReduce: The Mapper (2)

The Mapper may use or completely ignore the input key


For example, a standard pa>ern is to read one line of a le at a Mme
The key is the byte oset into the le at which the line starts
The value is the contents of the line itself
Typically the key is considered irrelevant
If the Mapper writes anything out, the output must be in the form of
key/value pairs
the 1
aardvark 1
23 the aardvark sat on the sat 1
map
sofa on 1
the 1
sofa 1

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-68
Example Mapper: Upper Case Mapper

Turn input into upper case (pseudo-code):

let map(k, v) =
emit(k.toUpper(), v.toUpper())

bugaboo an object of fear BUGABOO AN OBJECT OF FEAR


or alarm map() OR ALARM

mahout an elephant driver map() MAHOUT AN ELEPHANT DRIVER

bumbershoot umbrella map() BUMBERSHOOT UMBRELLA

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-69
Example Mapper: Explode Mapper

Output each input character separately (pseudo-code):

let map(k, v) =
foreach char c in v:
emit (k, c)

pi 3

pi 3.14 map() pi .
pi 1
pi 4

145 k

145 kale map() 145 a


145 l
145 e

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-70
Example Mapper: Filter Mapper

Only output key/value pairs where the input value is a prime number
(pseudo-code):

let map(k, v) =
if (isPrime(v)) then emit(k, v)

48 7 map() 48 7

pi 3.14 map()

5 12 map()

foo 13 map() foo 13

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-71
Example Mapper: Changing Keyspaces

The key output by the Mapper does not need to be idenLcal to the input
key
Example: output the word length as the key (pseudo-code):

let map(k, v) =
emit(v.length(), v)

001 hadoop map() 6 hadoop

002 aim map() 3 aim

003 ridiculous map() 10 ridiculous

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-72
Example Mapper: IdenMty Mapper

Emit the key,value pair (pseudo-code):

let map(k, v) =
emit(k,v)

bugaboo an object of fear bugaboo an object of fear


or alarm map() or alarm

mahout an elephant driver map() mahout an elephant driver

bumbershoot umbrella map() bumbershoot umbrella

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-73
Chapter Topics

IntroducLon to MapReduce

MapReduce Overview
Example: WordCount
Mappers
Reducers

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-74
Shue and Sort

Amer the Map phase is over, all intermediate values for a given
intermediate key are grouped together
Each key and value list is passed to a Reducer
All values for a parMcular intermediate key go to the same Reducer
The intermediate keys/value lists are passed in sorted key order
gif 1231
jpg 3992 1231
gif
html 891 3997 Reducer
1231 jpg 3992
gif
jpg 3992 3997

gif 3997
344 344
html 788 html 891 html 891 Reducer
788 788
html 344

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-75
The Reducer

The Reducer outputs zero or more nal key/value pairs


In pracMce, usually emits a single key/value pair for each input key
These are wri>en to HDFS

reduce(inter_key, [v1, v2, ])


(result_key, result_value)

1231
gif reduce() gif 2614
3997

344
reduce() html 1498
html 891
788

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-76
Example Reducer: Sum Reducer

Add up all the values associated with each intermediate key (pseudo-
code):

let reduce(k, vals) =


sum = 0
foreach int i in vals:
sum += i
emit(k, sum)

1
1
the reduce() the 4
1
1

34
reduce()
SKU0021 61
SKU0021 8
19

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-77
Example Reducer: Average Reducer

Find the mean of all the values associated with each intermediate key
(pseudo-code):

let reduce(k, vals) =


sum = 0; counter = 0;
foreach int i in vals:
sum += i; counter += 1;
emit(k, sum/counter)

1
1
the reduce() the 1
1
1

34
reduce()
SKU0021 20.33
SKU0021 8
19

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-78
Example Reducer: IdenMty Reducer

The IdenLty Reducer is very common (pseudo-code):

let reduce(k, vals) =


foreach v in vals:
emit(k, v)

a knot with two loops a knot with two loops


and two loose ends bow
and two loose ends
a weapon for shooting a weapon for shooting
bow reduce() bow
arrows arrows
a bending of the head a bending of the head
or body in respect bow
or body in respect

2 28 2
28 2 reduce() 28 2
7 28 7

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-79
Key Points

A MapReduce program has two major developer-created components: a


Mapper and a Reducer
Mappers map input data to intermediate key/value pairs
O]en parse, lter, or transform the data
Reducers process Mapper output into nal key/value pairs
O]en aggregate data using staMsMcal funcMons

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-80
Hadoop Clusters
Chapter 2.4

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-81
Hadoop Clusters

Components of a Hadoop cluster


How do Hadoop jobs and tasks run on a cluster
How does a jobs data ow in a Hadoop cluster

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-82
Chapter Topics

Hadoop Clusters

Hadoop Cluster Overview


Hadoop Jobs and Tasks

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-83
Installing A Hadoop Cluster (1)

Amer tesLng a Hadoop soluLon on a small sample dataset, the soluLon


can be run on a Hadoop cluster to process all data
Clusters are typically installed and maintained by a system administrator
Developers benet by understanding how the components of a Hadoop
cluster work together

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-84
Installing A Hadoop Cluster (2)

Dicult
Download, install, and integrate individual Hadoop components
directly from Apache

Easier: CDH
Clouderas DistribuMon for Apache Hadoop
Vanilla Hadoop plus many patches, backports, bug xes
Includes many other components from the Hadoop ecosystem

Easiest: Cloudera Manager
Wizard-based UI to install, congure and manage a Hadoop
cluster
Included with Cloudera Standard (free) or Cloudera Enterprise

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-85
Hadoop Cluster Terminology

A Hadoop cluster is a group of computers working together


Usually runs HDFS and MapReduce
A node is an individual computer in the cluster
Master nodes manage distribuMon of work and data to slave nodes
A daemon is a program running on a node
Each performs dierent funcMons in the cluster

Slave Node

Slave Node HDFS


MapReduce
Master Master
Nodes Slave Node Nodes

Slave Node

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-86
Hadoop Daemons: HDFS

HDFS daemons
NameNode holds the metadata for HDFS
Typically two on a producMon cluster: one acMve, one standby
DataNode holds the actual HDFS data
One per slave node

DataNode

DataNode
Name Node
(AcMve +
DataNode Standby)

DataNode

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-87
MapReduce v1 and v2 (1)

MapReduce v1 (MRv1 or Classic MapReduce)


Uses a JobTracker/TaskTracker architecture
One JobTracker per cluster limits cluster size to about 4000 nodes
Slots on slave nodes designated for Map or Reduce tasks
MapReduce v2 (MRv2)
Built on top of YARN (Yet Another Resource NegoMator)
Uses ResourceManager/NodeManager architecture
Increases scalability of cluster
Node resources can be used for any type of task
Improves cluster uMlizaMon
Support for non-MR jobs

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-88
Hadoop Daemons: MapReduce v1

MRv1 daemons
JobTracker one per cluster
Manages MapReduce jobs, distributes individual tasks to
TaskTrackers
TaskTracker one per slave node
Starts and monitors individual Map and Reduce tasks

TaskTracker

JobTracker TaskTracker
(AcMve +
Standby)
TaskTracker

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-89
Basic Cluster ConguraMon: HDFS

Name
HDFS
Node Manage data storage
Master
(AcMve Hold metadata
Nodes
+ Standby)

DataNode DataNode DataNode DataNode


Slave
Nodes

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-90
Basic Cluster ConguraMon: HDFS + MapReduce v1

Name
HDFS
Node Manage data storage
Master
(AcMve Hold metadata
Nodes
+ Standby)

DataNode DataNode DataNode DataNode


Slave
Nodes Task Task Task
TaskTracker TaskTracker TaskTracker TaskTracker

Manages MR jobs Job


MapReduce Tracker
Distributes tasks to MR Job Client
Master slave nodes (AcMve +
Nodes Standby)

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-91
Hadoop Daemons: MapReduce v2

MRv2 daemons
ResourceManager one per cluster
Starts ApplicaMonMasters, allocates resources on slave nodes
ApplicaMonMaster one per job
Requests resources, manages individual Map and Reduce tasks
NodeManager one per slave node
Manages resources on individual slave nodes
JobHistory one per cluster
Archives jobs metrics and metadata
NodeManager
Resource
Manager NodeManager
(AcMve + Standby) ApplicaMonMaster

Job History NodeManager


Server

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-92
Basic Cluster ConguraMon: HDFS

Name
HDFS
Node Manage data storage
Master
(AcMve Hold metadata
Nodes
+ Standby)

DataNode DataNode DataNode DataNode


Slave
Nodes

Note: This slide is the same as the earlier HDFS slide; there is no change in the HDFS
design for YARN/MapReduce v2 because HDFS is the storage side of Hadoop. YARN/
MapReduce v2 implement the compute side of Hadoop.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-93
Basic Cluster ConguraMon: HDFS + MapReduce v2

Name
HDFS
Node Manage data storage
Master
(AcMve Hold metadata
Nodes
+ Standby)

DataNode DataNode DataNode DataNode


Slave
ApplicaLon
Nodes Task Master Task Task
NodeManager NodeManager NodeManager NodeManager

MR Job

Allocates resources Resource


MapReduce Manager
Launches ApplicaMon MR Job Client
Master Masters (AcMve +
Nodes Standby)

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-94
Chapter Topics

Hadoop Clusters

Hadoop Cluster Overview


Hadoop Jobs and Tasks

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-95
Review - MapReduce Terminology

A job is a full program


A complete execuMon of Mappers and Reducers over a dataset
A task is the execuLon of a single Mapper or Reducer over a slice of data
A task a1empt is a parLcular instance of an agempt to execute a task
There will be at least as many task a>empts as there are tasks
If a task a>empt fails, another will be started by the JobTracker or
ApplicaMonMaster
Specula3ve execu3on (covered later) can also result in more task
a>empts than completed tasks

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-96
Submi|ng A Job

Job MR
Client Master
.jar XML Node

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-97
A MapReduce v1 Cluster
Slave Nodes
TaskTracker DataNode

TaskTracker DataNode
Name
Job Node(s)
Tracker(s)

TaskTracker DataNode

TaskTracker DataNode

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-98
Running a Job on a MapReduce v1 Cluster (1)
Slave Nodes $ hadoop fs put mydata

TaskTracker DataNode HDFS:


mydata
Block1

TaskTracker DataNode
Block2 Name
Job Node(s)
Tracker(s)

TaskTracker DataNode

TaskTracker DataNode

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-99
Running a Job on a MapReduce v1 Cluster (2)
Slave Nodes
TaskTracker DataNode HDFS:
mydata
Map Task 1 Block1

Client

TaskTracker DataNode
Map Task 2 Block2 Name
Job Node(s)
Tracker(s) Reduce Task 1

TaskTracker DataNode
Reduce Task 2

TaskTracker DataNode

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-100
A MapReduce v2 Cluster
Slave Nodes
NodeManager DataNode

NodeManager DataNode
Name
Resource Node(s)
Manager(s)

NodeManager DataNode

Job History
Server
NodeManager DataNode

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-101
Running a Job on a MapReduce v2 Cluster (1)
Slave Nodes
NodeManager DataNode HDFS:
mydata
Block1

Client

NodeManager DataNode
Block2 Name
Resource Node(s)
Manager(s)

NodeManager DataNode
MapReduce
ApplicaLon
Master
Job History
Server
NodeManager DataNode

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-102
Running a Job on a MapReduce v2 Cluster (2)
Slave Nodes
NodeManager DataNode HDFS:
mydata
Map Task 1 Block1

Client

NodeManager DataNode
Map Task 2 Block2 Name
Resource Node(s)
Manager(s) Reduce Task 1

NodeManager DataNode
MapReduce
ApplicaLon
Master
Job History
Server
NodeManager DataNode
Reduce Task 2

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-103
Job Data: Mapper Data Locality

When possible, Map Map Task 1 Block1


tasks run on a node
where the block of
data to be processed
is stored locally
Block2
HDFS
Otherwise, the Map
task will transfer
the data across the
network and then Map Task 2
process that data

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-104
Job Data: Intermediate Data

Map Task 1 Block1

Intermediate
Data
Map task
intermediate data is
stored on the local Block2
disk (not HDFS) HDFS

Map Task 2

Intermediate
Data

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-105
Job Data: Shue and Sort

There is no concept of Map Task 1 Block1


data locality for Intermediate
Reducers Data Reduce Task 2

Intermediate data is
transferred across the Block2
network to the HDFS
Reducers Reduce Task 1

Reducers write their Map Task 2


output to HDFS Intermediate
Data

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-106
Is Shue and Sort a Bo>leneck?

It appears that the shue and sort phase is a bogleneck


The reduce method in the Reducers cannot start unMl all Mappers
have nished
In pracLce, Hadoop will start to transfer data from Mappers to Reducers
as soon as the Mappers nish work
This avoids a huge amount of data transfer starMng as soon as the last
Mapper nishes
The reduce method sMll does not start unMl all intermediate data has
been transferred and sorted

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-107
Is a Slow Mapper a Bo>leneck?

It is possible for one Map task to run more slowly than the others
Perhaps due to faulty hardware, or just a very slow machine
It would appear that this would create a bogleneck
The reduce method in the Reducer cannot start unMl every Mapper
has nished
Hadoop uses specula3ve execu3on to miLgate against this
If a Mapper appears to be running signicantly more slowly than the
others, a new instance of the Mapper will be started on another
machine, operaMng on the same data
A new task a7empt for the same task
The results of the rst Mapper to nish will be used
Hadoop will kill o the Mapper which is sMll running

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-108
CreaMng and Running a MapReduce Job

Write the Mapper and Reducer classes


Write a Driver class that congures the job and submits it to the cluster
Driver classes are covered in the WriMng MapReduce chapter
Compile the Mapper, Reducer, and Driver classes
$ javac -classpath `hadoop classpath` MyMapper.java
MyReducer.java MyDriver.java

Create a jar le with the Mapper, Reducer, and Driver classes
$ jar cf MyMR.jar MyMapper.class MyReducer.class
MyDriver.class

Run the hadoop jar command to submit the job to the Hadoop cluster

$ hadoop jar MyMR.jar MyDriver in_file out_dir

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-109
Bibliography

The following oer more informaLon on topics discussed in this chapter


Full Cloduera documentaLon available at
http://cloudera.com/
Reference Hadoop: The DeniLve Guide, third ediLon (TDG 3e) for details
on job submission - Chapter 6, How MapReduce Works, starLng on page
189.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-110