Sie sind auf Seite 1von 86

Hadoop MapReduce

Fundamentals
@LynnLangit

a five part series Part 1 of 5

Course Outline

What is Hadoop?
Open-source data storage and processing API
Massively scalable, automatically parallelizable

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Apache Hadoop
Cloudera CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight

Why Use
Hadoop?

Cheaper

Scales to Petabytes or
more

Faster

Parallel data
processing

Better

Suited for particular


types of BigData
problems

What types of business problems for


Hadoop?

Source: Cloudera Ten Common Hadoopable Problems

Companies
Using Hadoop

Facebook
Yahoo
Amazon
eBay
American Airlines
The New York Times
Federal Reserve
Board
IBM
Orbitz

Forecast growth of Hadoop Job Market

Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html

Hadoop is a set of Apache Frameworks


and more
Data storage (HDFS)

Runs on commodity hardware (usually Linux)


Horizontally scalable

Processing (MapReduce)

Parallelized (scalable) processing


Fault Tolerant

Other Tools / Frameworks

Data Access

HBase, Hive, Pig, Mahout


Tools

Hue, Sqoop
Monitoring
Greenplum, Cloudera

Monitoring & Alerting


Tools & Libraries
Data Access
MapReduce API
Hadoop Core - HDFS

What are the core parts of a Hadoop


distribution?

Hadoop Cluster HDFS (Physical)


Storage

MapReduce Job
Logical View

Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Hadoop Ecosystem

Common Hadoop
Distributions
Open Source

Apache

Commercial

Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft HDInsight
(Beta)

A View of Hadoop (from Hortonworks)

Source: Intro to Map Reduce -- http://www.youtube.com/watch?v=ht3dNvdNDzI

Setting up Hadoop
Development

Demo Setting up Cloudera


Hadoop

Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series Part 2 of 5

So, whats the


problem?
I can just use some SQL-like language to query
Hadoop, right?
Yeah, SQL-on-Hadoopthats what I want
I dont want learn a new query language and.
I want massive scale for my shiny, new BigData

Ways to MapReduce

Libraries

Languages

Note: Java is most common, but other languages can be used

Demo Using Hive QL


on CDH4

What is Hive?
a data warehouse system for Hadoop that

facilitates easy data summarization


supports ad-hoc queries (still batch though)
created by Facebook

a mechanism to project structure onto this data and query


the data using a SQL-like language HiveQL

Interactive-console orExecute scripts


Kicks off one or more MapReduce jobs in the background

an ability to use indexes, built-in user-defined functions

Is HQL == ANSI SQL? NO!


--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)

Note: Joins are quite different in MapReduce, more on that


coming up

Preparing for
MapReduce

Common Hadoop Shell Commands


hadoop fs cat file:///file2
hadoop fs mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs copyFromLocal <fromDir> <toDir>
hadoop fs put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs ls /user/hadoop/dir1
hadoop fs cat hdfs://nn1.example.com/file1
hadoop fs get /user/hadoop/file <localfile>

Tips
-- sudo means run as administrator (super user)
--some hadoop configurations use hadoop dfs rather than hadoop fs file paths to hadoop
differ for the former, see the link included for more detail

Demo Working with Files and


HDFS

Thinking in
MapReduce
Hint: Its Functional

Understanding MapReduce
P1/3
Map>>

(K1, V1)

Info in
Input Split

list (K2, V2)

Key / Value out


(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)

Understanding MapReduce
P2/3
Map>>

(K1, V1)

Info in
Input Split

list (K2, V2)

Key / Value out


(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)

Shuffle/Sort>>

Understanding MapReduce
P3/3
Map>>

(K1, V1)

Shuffle/Sort>> Reduce

Info in
Input Split

list (K2, V2)

Key / Value out


(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)

(K2, list(V2)

Shuffle / Sort phase


precedes Reduce
phase
Combines Map
output into a list

list (K3, V3)

Usually aggregates
intermediate values

(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3>
(output)

MapReduce Example WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

MapReduce Objects

Each daemon spawns a new JVM

Ways to MapReduce

Libraries

Languages

Note: Java is most common, but other languages can be used

Demo Running MapReduce


WordCount

Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series Part 3 of 5

Ways to run MapReduce Jobs


Configure JobConf options
From Development Environment (IDE)
From a GUI utility

Cloudera Hue
Microsoft Azure HDInsight console

From the command line

hadoop jar <filename.jar> input output

Ways to MapReduce

Libraries

Languages

Note: Java is most common, but other languages can be used

Setting up Hadoop On Windows Azure


About HDInsight

Demo MapReduce in
the Cloud
WordCount MapReduce using HDInsight

MapReduce (WordCount) with Java


Script

Note: JavaScript is
part of the Azure
Hadoop
distribution

Common Data Sources for MapReduce


Jobs

Where is your Data coming


from?
On premises

Local file system


Local HDFS instance

Private Cloud

Cloud storage

Public Cloud

Input Storage buckets


Script / Code buckets
Output buckets

Common Data Jobs for


MapReduce

Demo Other Types of


MapReduce

Tip: Review the Java MapReduce code in these samples as well.

Methods to write MapReduce Jobs


Typical usually written in Java

MapReduce 2.0 API


MapReduce 1.0 API

Streaming

Uses stdin and stdout


Can use any language to write Map and Reduce Functions

C#, Python, JavaScript, etc

Pipes

Often used with C++

Abstraction libraries

Hive, Pig, etc write in a higher level language, generate one


or more MapReduce jobs

Ways to MapReduce

Libraries

Languages

Note: Java is most common, but other languages can be used

Demo MapReduce via C# &


PowerShell

Ways to MapReduce

Libraries

Languages

Note: Java is most common, but other languages can be used

Using AWS MapReduce

Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud

What is Pig?
ETL Library for HDFS developed at Yahoo

Pig Runtime
Pig Language
Generates MapReduce Jobs

ETL steps

LOAD <file>
FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT
DUMP {to screen for testing} STORE <newFile>

MapReduce Python Sample


Remember that white space matters in Python!

Demo Using AWS


MapReduce with Pig

Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud

AWS Data Pipeline with HIVE

Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series Part 4 of 5

Better MapReduce - Optimizations

Optimization BEFORE running a


MapReduce Job

More about Input File Compression


From Cloudera
Their version of LZO splittable

Type

File

Size GB

Compress

Decompres
s

None

Log

8.0

Gzip

Log.gz

1.3

241

72

LZO

Log.lzo

2.0

55

35

Optimization WITHIN a
MapReduce Job

59

Mapper Task
Optimization

Writable

Text (String)
IntWritable
LongWritable
FloatWritable
BooleanWritable

WritableComparable for keys


Custom Types supported write RawComparator

Data
Types

Reducer Task
Optimization

MapReduce Job
Optimization

Demo Unit Testing


MapReduce

Using MRUnit + Asserts


Optionally using ApprovalTests

Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png

A note about MapReduce 2.0


Splits the existing JobTrackers roles

resource management
job lifecycle management

MapReduce 2.0 provides many benefits over the existing


MapReduce framework, such as better scalability

through distributed job lifecycle management


support for multiple Hadoop MapReduce API versions in a
single cluster

What is
Mahout?

Library with common machine learning algorithms


Over 20 algorithms

Recommendation (likelihood Pandora)


Classification (known data and new data spam id)
Clustering (new groups of similar data Google news)

Can non-statisticians find value using this library?

Mahout Algorithms

Setting up Hadoop on Windows

For local development


Install from binaries from Web Platform Installer
Install .NET Azure SDK (for Azure BLOB storage)
Install other tools

Neudesic Azure Storage Viewer

Demo
Mahout
Using HDInsight

What about the


output?

Clients (Visualizations) for HDFS


Many clients use Hive

Often included in GUI console tools for Hadoop distributions as


well

Microsoft includes clients in Office (Excel 2013)

Direct Hive client


Connect using ODBC

PowerPivot data mashups and presentation


Data Explorer connect, transform, mashup and filter

Hadoop SDK on Codeplex

Other popular clients

Qlikview
Tableau
Karmasphere

Demo Executing Hive Queries

Demo Using HDFS output in Excel


2013

To download Data Explorer:


http://www.microsoft.com/enus/download/details.aspx?id=36803

About

Visualizatio

Demo New Visualizations D3

Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series Part 5 of 5

Limitations of
MapReduce

Comparing: RDBMS vs. Hadoop


Traditional RDBMS

Hadoop / MapReduce

Data Size

Gigabytes (Terabytes)

Petabytes (Hexabytes)

Access

Interactive and Batch

Batch NOT Interactive

Updates

Read / Write many times

Write once, Read many times

Structure

Static Schema

Dynamic Schema

Integrity

High (ACID)

Low

Scaling

Nonlinear

Linear

Query
Response Time

Can be near immediate

Has latency (due to batch


processing)

Microsoft alternatives to
MapReduce
Use existing relational system

Scale via cloud or edition (i.e. Enterprise or PDW)

Use in memory OLAP

SQL Server Analysis Services Tabular Models

Use productized Dremel

Microsoft Polybase status = beta?

Looking Forward - Dremel or


Apache Drill

Based on original research from Google

Apache Drill Architecture

In-market MapReduce
Alternatives
Cloudera
Impala

Google
Big Query

Demo Googles BigQuery


Dremel for the rest of us

Hadoop MapReduce Call to


Action

More MapReduce Developer


Resources
Based on the distribution on premises

Apache

MapReduce tutorial http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera

Cloudera

Cloudera University - http://university.cloudera.com/


Cloudera Developer Course (4 day) - *RECOMMENDED* http://university.cloudera.com/training/apache_hadoop/developer.html

Hortonworks
MapR

Based on the distribution cloud

AWS MapReduce

Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs

Windows Azure HDInsight

Tutorial http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-withhdinsight/
More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-tohadoop/

The Changing Data Landscape

Das könnte Ihnen auch gefallen