No SQL Data With Hadoop

Hadoop MapReduce
Fundamentals
@LynnLangit
a five part series Part 1 of 5
Course Outline
What is Hadoop?
Open-source data storage and processing API
Massively scalable, automatically parallelizable
Based on work from Google
GFS + MapReduce + BigTable
Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight
Why Use
Hadoop?
Cheaper
Scales to Petabytes or
more
Faster
Parallel data
processing
Better
Suited for particular

types of BigData
problems
What types of business problems for

Hadoop?
Source: Cloudera Ten Common Hadoopable Problems
Companies
Using Hadoop
Facebook
Yahoo
Amazon
eBay
American Airlines
The New York Times
Federal Reserve
Board
IBM
Orbitz
Forecast growth of Hadoop Job Market
Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
Hadoop is a set of Apache Frameworks

and more
Data storage (HDFS)
Runs on commodity hardware (usually Linux)

Horizontally scalable
Processing (MapReduce)
Parallelized (scalable) processing

Fault Tolerant
Other Tools / Frameworks
Data Access
HBase, Hive, Pig, Mahout

Tools
Hue, Sqoop
Monitoring
Greenplum, Cloudera
Monitoring & Alerting

Tools & Libraries
Data Access
MapReduce API
Hadoop Core - HDFS
What are the core parts of a Hadoop

distribution?
Hadoop Cluster HDFS (Physical)

Storage
MapReduce Job
Logical View
Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem
Common Hadoop
Distributions
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft HDInsight
(Beta)
A View of Hadoop (from Hortonworks)
Source: Intro to Map Reduce -- http://www.youtube.com/watch?v=ht3dNvdNDzI
Setting up Hadoop
Development
Demo Setting up Cloudera

Hadoop
Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
Hadoop MapReduce
Fundamentals
@LynnLangit
So, whats the

problem?
I can just use some SQL-like language to query
Hadoop, right?
Yeah, SQL-on-Hadoopthats what I want
I dont want learn a new query language and.
I want massive scale for my shiny, new BigData
Ways to MapReduce
Libraries
Languages
Note: Java is most common, but other languages can be used
Demo Using Hive QL

on CDH4
What is Hive?
a data warehouse system for Hadoop that
facilitates easy data summarization

supports ad-hoc queries (still batch though)
created by Facebook
a mechanism to project structure onto this data and query

the data using a SQL-like language HiveQL
Interactive-console orExecute scripts

Kicks off one or more MapReduce jobs in the background
an ability to use indexes, built-in user-defined functions
Is HQL == ANSI SQL? NO!

--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that

coming up
Preparing for
MapReduce
Common Hadoop Shell Commands

hadoop fs cat file:///file2
hadoop fs mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs copyFromLocal <fromDir> <toDir>
hadoop fs put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs ls /user/hadoop/dir1
hadoop fs cat hdfs://nn1.example.com/file1
hadoop fs get /user/hadoop/file <localfile>
Tips
-- sudo means run as administrator (super user)
--some hadoop configurations use hadoop dfs rather than hadoop fs file paths to hadoop
differ for the former, see the link included for more detail
Demo Working with Files and

HDFS
Thinking in
MapReduce
Hint: Its Functional
Understanding MapReduce
P1/3
Map>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out

(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)
P2/3
Map>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out

(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)
Shuffle/Sort>>
P3/3
Map>>
(K1, V1)
Shuffle/Sort>> Reduce
Info in
Input Split
list (K2, V2)
Key / Value out

(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)
(K2, list(V2)
Shuffle / Sort phase

precedes Reduce
phase
Combines Map
output into a list
list (K3, V3)
Usually aggregates
intermediate values
(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3>
(output)
MapReduce Example WordCount
Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
MapReduce Objects
Each daemon spawns a new JVM
Ways to MapReduce
Libraries
Languages
Demo Running MapReduce

WordCount
Hadoop MapReduce
Fundamentals
@LynnLangit
Ways to run MapReduce Jobs

Configure JobConf options
From Development Environment (IDE)
From a GUI utility
Cloudera Hue
Microsoft Azure HDInsight console
From the command line
hadoop jar <filename.jar> input output
Ways to MapReduce
Libraries
Languages
Setting up Hadoop On Windows Azure

About HDInsight
Demo MapReduce in
the Cloud
WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java

Script
Note: JavaScript is
part of the Azure
Hadoop
distribution
Common Data Sources for MapReduce

Jobs
Where is your Data coming

from?
On premises
Local file system

Local HDFS instance
Private Cloud
Cloud storage
Public Cloud
Input Storage buckets

Script / Code buckets
Output buckets
Common Data Jobs for

MapReduce
Demo Other Types of

MapReduce
Tip: Review the Java MapReduce code in these samples as well.
Methods to write MapReduce Jobs

Typical usually written in Java
MapReduce 2.0 API

MapReduce 1.0 API
Streaming
Uses stdin and stdout

Can use any language to write Map and Reduce Functions
C#, Python, JavaScript, etc
Pipes
Often used with C++
Abstraction libraries
Hive, Pig, etc write in a higher level language, generate one

or more MapReduce jobs
Ways to MapReduce
Libraries
Languages
Demo MapReduce via C# &

PowerShell
Ways to MapReduce
Libraries
Languages
Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
What is Pig?
ETL Library for HDFS developed at Yahoo
Pig Runtime
Pig Language
Generates MapReduce Jobs
ETL steps
LOAD <file>
FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT
DUMP {to screen for testing} STORE <newFile>
MapReduce Python Sample

Remember that white space matters in Python!
Demo Using AWS

MapReduce with Pig
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce
Fundamentals
@LynnLangit
Better MapReduce - Optimizations
Optimization BEFORE running a

MapReduce Job
More about Input File Compression

From Cloudera
Their version of LZO splittable
Type
File
Size GB
Compress
Decompres
s
None
Log
8.0
Gzip
Log.gz
1.3
241
72
LZO
Log.lzo
2.0
55
35
Optimization WITHIN a
MapReduce Job
59
Mapper Task
Optimization
Writable
Text (String)
IntWritable
LongWritable
FloatWritable
BooleanWritable
WritableComparable for keys

Custom Types supported write RawComparator
Data
Types
Reducer Task
Optimization
MapReduce Job
Optimization
Demo Unit Testing

MapReduce
Using MRUnit + Asserts

Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
A note about MapReduce 2.0

Splits the existing JobTrackers roles
resource management
job lifecycle management
MapReduce 2.0 provides many benefits over the existing

MapReduce framework, such as better scalability
through distributed job lifecycle management

support for multiple Hadoop MapReduce API versions in a
single cluster
What is
Mahout?
Library with common machine learning algorithms

Over 20 algorithms
Recommendation (likelihood Pandora)

Classification (known data and new data spam id)
Clustering (new groups of similar data Google news)
Can non-statisticians find value using this library?
Mahout Algorithms
Setting up Hadoop on Windows
For local development

Install from binaries from Web Platform Installer
Install .NET Azure SDK (for Azure BLOB storage)
Install other tools
Neudesic Azure Storage Viewer
Demo
Mahout
Using HDInsight
What about the

output?
Clients (Visualizations) for HDFS

Many clients use Hive
Often included in GUI console tools for Hadoop distributions as

well
Microsoft includes clients in Office (Excel 2013)
Direct Hive client

Connect using ODBC
PowerPivot data mashups and presentation

Data Explorer connect, transform, mashup and filter
Hadoop SDK on Codeplex
Other popular clients
Qlikview
Tableau
Karmasphere
Demo Executing Hive Queries
Demo Using HDFS output in Excel

2013
To download Data Explorer:

http://www.microsoft.com/enus/download/details.aspx?id=36803
About
Visualizatio
Demo New Visualizations D3
Hadoop MapReduce
Fundamentals
@LynnLangit
Limitations of
MapReduce
Comparing: RDBMS vs. Hadoop

Traditional RDBMS
Hadoop / MapReduce
Data Size
Gigabytes (Terabytes)
Petabytes (Hexabytes)
Access
Interactive and Batch
Batch NOT Interactive
Updates
Read / Write many times
Write once, Read many times
Structure
Static Schema
Dynamic Schema
Integrity
High (ACID)
Low
Scaling
Nonlinear
Linear
Query
Response Time
Can be near immediate
Has latency (due to batch

processing)
Microsoft alternatives to
MapReduce
Use existing relational system
Scale via cloud or edition (i.e. Enterprise or PDW)
Use in memory OLAP
SQL Server Analysis Services Tabular Models
Use productized Dremel
Microsoft Polybase status = beta?
Looking Forward - Dremel or

Apache Drill
Based on original research from Google
Apache Drill Architecture
In-market MapReduce
Alternatives
Cloudera
Impala
Google
Big Query
Demo Googles BigQuery

Dremel for the rest of us
Hadoop MapReduce Call to

Action
More MapReduce Developer

Resources
Based on the distribution on premises
Apache
MapReduce tutorial http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera
Cloudera
Cloudera University - http://university.cloudera.com/

Cloudera Developer Course (4 day) - *RECOMMENDED* http://university.cloudera.com/training/apache_hadoop/developer.html
Hortonworks
MapR
Based on the distribution cloud
AWS MapReduce
Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs
Windows Azure HDInsight
Tutorial http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-withhdinsight/
More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-tohadoop/
The Changing Data Landscape

No SQL Data With Hadoop

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

No SQL Data With Hadoop

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop MapReduce

a five part series Part 1 of 5

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Suited for particular

What types of business problems for

Source: Cloudera Ten Common Hadoopable Problems

Forecast growth of Hadoop Job Market

Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html

Hadoop is a set of Apache Frameworks

Runs on commodity hardware (usually Linux)

Parallelized (scalable) processing

Other Tools / Frameworks

HBase, Hive, Pig, Mahout

Monitoring & Alerting

What are the core parts of a Hadoop

Hadoop Cluster HDFS (Physical)

Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

A View of Hadoop (from Hortonworks)

Source: Intro to Map Reduce -- http://www.youtube.com/watch?v=ht3dNvdNDzI

Demo Setting up Cloudera

Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

a five part series Part 2 of 5

So, whats the

Note: Java is most common, but other languages can be used

Demo Using Hive QL

facilitates easy data summarization

a mechanism to project structure onto this data and query

Interactive-console orExecute scripts

an ability to use indexes, built-in user-defined functions

Is HQL == ANSI SQL? NO!

Note: Joins are quite different in MapReduce, more on that

Common Hadoop Shell Commands

Demo Working with Files and

list (K2, V2)

Key / Value out

list (K2, V2)

Key / Value out

list (K2, V2)

Key / Value out

Shuffle / Sort phase

list (K3, V3)

MapReduce Example WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

Each daemon spawns a new JVM

Note: Java is most common, but other languages can be used

Demo Running MapReduce

a five part series Part 3 of 5

Ways to run MapReduce Jobs

From the command line

hadoop jar <filename.jar> input output

Note: Java is most common, but other languages can be used

Setting up Hadoop On Windows Azure

MapReduce (WordCount) with Java

Common Data Sources for MapReduce

Where is your Data coming

Local file system

Input Storage buckets

Common Data Jobs for

Demo Other Types of

Tip: Review the Java MapReduce code in these samples as well.

Methods to write MapReduce Jobs