Beruflich Dokumente
Kultur Dokumente
Moving Data To
Compute Doesn’t Scale
Instrumentation
2
©2011 Cloudera, Inc. All Rights Reserved.
So What is Apache Hadoop ?
• A scalable fault-tolerant distributed system for
data storage and processing (open source
under the Apache license).
3
©2011 Cloudera, Inc. All Rights Reserved.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before • Data is simply copied to the file
any data can be loaded. store, no transformation is needed.
• An explicit load operation has to • A SerDe (Serializer/Deserlizer) is
take place which transforms applied during read time to extract
data to DB internal structure. the required columns (late binding)
• New columns must be added • New data can start flowing anytime
explicitly before new data for and will appear retroactively once
such columns can be loaded the SerDe is updated to parse it.
into the database.
4
©2011 Cloudera, Inc. All Rights Reserved.
Innovation: Explore Original Raw Data
5
©2011 Cloudera, Inc. All Rights Reserved.
Flexibility: Complex Data Processing
1. Java MapReduce: Most flexibility and performance, but tedious
development cycle (the assembly language of Hadoop).
2. Streaming MapReduce (aka Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java
(modeled After Google’s FlumeJava)
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
workflow of jobs composed of any of the above.
6
©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Scalable Software Development
AUTO SCALE
7
©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Data Beats Algorithm
A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009
8
©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Keep All Data Alive Forever
Archive to Tape and Extract Value From
Never See It Again All Your Data
9
©2011 Cloudera, Inc. All Rights Reserved.
Use The Right Tool For The Right Job
Relational Databases: Hadoop:
10
©2011 Cloudera, Inc. All Rights Reserved.
HDFS: Hadoop Distributed File System
A given file is broken down into blocks
(default=64MB), then blocks are
replicated across cluster (default=3).
Optimized for:
• Throughput
• Put/Get/Delete
• Appends
11
©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Computational Framework
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
(words, counts)
Split 1 (docid, text) Map 1 (sorted words, counts)
Output
Be, 5 Reduce 1
(sorted words,
sum of counts)
File 1
“To Be
Or Not Be, 30
To Be?”
Be, 12
Output
(sorted words,
Reduce i File i
Split i (docid, text) Map i sum of counts)
Be, 7
Be, 6
Shuffle Output
(sorted words,
Reduce R File R
sum of counts)
Split N (docid, text) Map M (words, counts) (sorted words, counts)
Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)
12
©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Resource Manager / Scheduler
A given job is broken down into tasks,
then tasks are scheduled to be as
close to data as possible.
Optimized for:
• Batch Processing
• Failure Recovery
System detects laggard tasks and
speculatively executes parallel tasks
on the same slice of data.
13
©2011 Cloudera, Inc. All Rights Reserved.
But Networks Are Faster Than Disks!
14
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs
15
©2011 Cloudera, Inc. All Rights Reserved.
Changes for Better Availability/Scalability
Hadoop Client
Federation partitions Contacts Name Node for data Each job has its own
or Job Tracker to submit jobs
out the name space, Application Manager,
High Availability via Resource Manager is
an Active Standby. decoupled from MR.
16
©2011 Cloudera, Inc. All Rights Reserved.
CDH: Cloudera’s Distribution Including Apache Hadoop
17
©2011 Cloudera, Inc. All Rights Reserved.
Books
18
©2011 Cloudera, Inc. All Rights Reserved.
Conclusion
• The Key Benefits of Apache Hadoop:
– Agility/Flexibility (Quickest Time to Insight).
– Complex Data Processing (Any Language, Any Problem).
– Scalability of Storage/Compute (Freedom to Grow).
– Economical Storage (Keep All Your Data Alive Forever).
19
©2011 Cloudera, Inc. All Rights Reserved.
Appendix
BACKUP SLIDES
20
©2011 Cloudera, Inc. All Rights Reserved.
Unstructured Data is Exploding
Complex, Unstructured
Relational
22
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop in the Enterprise Data Stack
Data
Warehouse
Sqoop
Data
Architects Customers
Low-Latency Web
Flume Flume Flume Sqoop
Serving Application
Relational Systems
Logs Files Web Data
Databases
23
©2011 Cloudera, Inc. All Rights Reserved.
MapReduce Next Gen
Main idea is to split up the JobTracker functions:
• Cluster resource management (for tracking and
allocating nodes)
• Application life-cycle management (for MapReduce
scheduling and execution)
Enables:
• High Availability
• Better Scalability
• Efficient Slot Allocation
• Rolling Upgrades
• Non-MapReduce Apps
24
©2011 Cloudera, Inc. All Rights Reserved.
Two Core Use Cases Common Across Many Industries
Web
ADVANCED ANALYTICS
DATA PROCESSING
Content Optimization Media Clickstream Sessionization
25
©2011 Cloudera, Inc. All Rights Reserved.
What is Cloudera Enterprise?
Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS
source Apache Hadoop enterprise-easy
EFFECTIVENESS EFFICIENCY
Ensuring Repeatable Value from Enabling Apache Hadoop to be
Apache Hadoop Deployments Affordably Run in Production
26
©2011 Cloudera, Inc. All Rights Reserved.
Hive vs Pig Latin (count distinct values > 0)
• Hive Syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
27
©2011 Cloudera, Inc. All Rights Reserved.
Apache Hive Key Features
28
©2011 Cloudera, Inc. All Rights Reserved.
Hive Agile Data Types
• STRUCTS:
– SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
– SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
– SELECT mytable.mycolumn[5] FROM …
• JSON:
– SELECT get_json_object(mycolumn, objpath) FROM …
29
©2011 Cloudera, Inc. All Rights Reserved.