Beruflich Dokumente
Kultur Dokumente
Fundamentals
@LynnLangit
Course Outline
What is Hadoop?
Open-source data storage and processing API
Massively scalable, automatically parallelizable
Apache Hadoop
Cloudera CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight
Why Use
Hadoop?
Cheaper
Scales to Petabytes or
more
Faster
Parallel data
processing
Better
Companies
Using Hadoop
Facebook
Yahoo
Amazon
eBay
American Airlines
The New York Times
Federal Reserve
Board
IBM
Orbitz
Processing (MapReduce)
Data Access
Hue, Sqoop
Monitoring
Greenplum, Cloudera
MapReduce Job
Logical View
Hadoop Ecosystem
Common Hadoop
Distributions
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft HDInsight
(Beta)
Setting up Hadoop
Development
Hadoop MapReduce
Fundamentals
@LynnLangit
Ways to MapReduce
Libraries
Languages
What is Hive?
a data warehouse system for Hadoop that
Preparing for
MapReduce
Tips
-- sudo means run as administrator (super user)
--some hadoop configurations use hadoop dfs rather than hadoop fs file paths to hadoop
differ for the former, see the link included for more detail
Thinking in
MapReduce
Hint: Its Functional
Understanding MapReduce
P1/3
Map>>
(K1, V1)
Info in
Input Split
Understanding MapReduce
P2/3
Map>>
(K1, V1)
Info in
Input Split
Shuffle/Sort>>
Understanding MapReduce
P3/3
Map>>
(K1, V1)
Shuffle/Sort>> Reduce
Info in
Input Split
(K2, list(V2)
Usually aggregates
intermediate values
(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3>
(output)
MapReduce Objects
Ways to MapReduce
Libraries
Languages
Hadoop MapReduce
Fundamentals
@LynnLangit
Cloudera Hue
Microsoft Azure HDInsight console
Ways to MapReduce
Libraries
Languages
Demo MapReduce in
the Cloud
WordCount MapReduce using HDInsight
Note: JavaScript is
part of the Azure
Hadoop
distribution
Private Cloud
Cloud storage
Public Cloud
Streaming
Pipes
Abstraction libraries
Ways to MapReduce
Libraries
Languages
Ways to MapReduce
Libraries
Languages
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
What is Pig?
ETL Library for HDFS developed at Yahoo
Pig Runtime
Pig Language
Generates MapReduce Jobs
ETL steps
LOAD <file>
FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT
DUMP {to screen for testing} STORE <newFile>
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
Hadoop MapReduce
Fundamentals
@LynnLangit
Type
File
Size GB
Compress
Decompres
s
None
Log
8.0
Gzip
Log.gz
1.3
241
72
LZO
Log.lzo
2.0
55
35
Optimization WITHIN a
MapReduce Job
59
Mapper Task
Optimization
Writable
Text (String)
IntWritable
LongWritable
FloatWritable
BooleanWritable
Data
Types
Reducer Task
Optimization
MapReduce Job
Optimization
resource management
job lifecycle management
What is
Mahout?
Mahout Algorithms
Demo
Mahout
Using HDInsight
Qlikview
Tableau
Karmasphere
About
Visualizatio
Hadoop MapReduce
Fundamentals
@LynnLangit
Limitations of
MapReduce
Hadoop / MapReduce
Data Size
Gigabytes (Terabytes)
Petabytes (Hexabytes)
Access
Updates
Structure
Static Schema
Dynamic Schema
Integrity
High (ACID)
Low
Scaling
Nonlinear
Linear
Query
Response Time
Microsoft alternatives to
MapReduce
Use existing relational system
In-market MapReduce
Alternatives
Cloudera
Impala
Google
Big Query
Apache
Cloudera
Hortonworks
MapR
AWS MapReduce
Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs
Tutorial http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-withhdinsight/
More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-tohadoop/