Slides PDF

11/16/2011, Stanford EE380 Computer Systems Colloquium
Introducing Apache Hadoop:

The Modern Data Operating System
Dr. Amr Awadallah | Founder, CTO, VP of Engineering

aaa@cloudera.com, twitter: @awadallah
Limitations of Existing Data Analytics Architecture
BI Reports + Interactive Apps Can’t Explore Original

High Fidelity Raw Data
RDBMS (aggregated data)
ETL Compute Grid
Moving Data To
Compute Doesn’t Scale
Storage Only Grid (original raw data)

Archiving =
Mostly Append
Premature
Collection Data Death
Instrumentation
2
©2011 Cloudera, Inc. All Rights Reserved.
So What is Apache Hadoop ?
• A scalable fault-tolerant distributed system for
data storage and processing (open source
under the Apache license).
• Core Hadoop has two main systems:

– Hadoop Distributed File System: self-healing
high-bandwidth clustered storage.
– MapReduce: distributed fault-tolerant resource
management and scheduling coupled with a
scalable data programming abstraction.
3
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before • Data is simply copied to the file
any data can be loaded. store, no transformation is needed.
• An explicit load operation has to • A SerDe (Serializer/Deserlizer) is
take place which transforms applied during read time to extract
data to DB internal structure. the required columns (late binding)
• New columns must be added • New data can start flowing anytime
explicitly before new data for and will appear retroactively once
such columns can be loaded the SerDe is updated to parse it.
into the database.
• Read is Fast • Load is Fast

Pros
• Standards/Governance • Flexibility/Agility
4
Innovation: Explore Original Raw Data
Data Committee Data Scientist
5
Flexibility: Complex Data Processing
1. Java MapReduce: Most flexibility and performance, but tedious
development cycle (the assembly language of Hadoop).
2. Streaming MapReduce (aka Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java
(modeled After Google’s FlumeJava)
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
workflow of jobs composed of any of the above.
6
Scalability: Scalable Software Development
Grows without requiring developers to

re-architect their algorithms/application.
AUTO SCALE
7
Scalability: Data Beats Algorithm
Smarter Algos More Data
A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009
8
Scalability: Keep All Data Alive Forever
Archive to Tape and Extract Value From
Never See It Again All Your Data
9
Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when: Use when:

• Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility)
• Multistep ACID Transactions • Scalability of Storage/Compute
• 100% SQL Compliance • Complex Data Processing
10
HDFS: Hadoop Distributed File System
A given file is broken down into blocks
(default=64MB), then blocks are
replicated across cluster (default=3).
Optimized for:
• Throughput
• Put/Get/Delete
• Appends
Block Replication for:

• Durability
• Availability
• Throughput
Block Replicas are distributed

across servers and racks.
11
MapReduce: Computational Framework
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
(words, counts)
Split 1 (docid, text) Map 1 (sorted words, counts)
Output
Be, 5 Reduce 1
(sorted words,
sum of counts)
File 1
“To Be
Or Not Be, 30
To Be?”
Be, 12
Output
(sorted words,
Reduce i File i
Split i (docid, text) Map i sum of counts)
Be, 7
Be, 6
Shuffle Output
(sorted words,
Reduce R File R
sum of counts)
Split N (docid, text) Map M (words, counts) (sorted words, counts)
Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)
12
MapReduce: Resource Manager / Scheduler
A given job is broken down into tasks,
then tasks are scheduled to be as
close to data as possible.
Three levels of data locality:

• Same server as data (local disk)
• Same rack as data (rack/leaf switch)
• Wherever there is a free slot (cross rack)
Optimized for:
• Batch Processing
• Failure Recovery
System detects laggard tasks and
speculatively executes parallel tasks
on the same slice of data.
13
But Networks Are Faster Than Disks!
Yes, however, core and disk density per server

are going up very quickly:
• 1 Hard Disk = 100MB/sec (~1Gbps)

• Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)
• Rack = 20 Servers = 24GB/sec (~240Gbps)
• Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)
• Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)
• Scanning 4.8TB at 100MB/sec takes 13 hours.
14
Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs
Name Node Job Tracker

Maintains mapping of file names Tracks resources and schedules
to blocks to data node slaves. jobs across task tracker slaves.
Data Node Task Tracker

Stores and serves Runs tasks (work units)
blocks of data within a job
Share Physical Node
15
Changes for Better Availability/Scalability
Hadoop Client
Federation partitions Contacts Name Node for data Each job has its own
or Job Tracker to submit jobs
out the name space, Application Manager,
High Availability via Resource Manager is
an Active Standby. decoupled from MR.
Name Node Job Tracker
Data Node Task Tracker

Stores and serves Runs tasks (work units)
blocks of data within a job
Share Physical Node
16
CDH: Cloudera’s Distribution Including Apache Hadoop
File System Mount UI Framework/SDK Data Mining

Build/Test: APACHE BIGTOP
FUSE-DFS HUE APACHE MAHOUT
Workflow Scheduling Metadata

APACHE OOZIE APACHE OOZIE APACHE HIVE
Languages / Compilers Fast

Data APACHE PIG, APACHE HIVE
Read/Write
Integration
Access
APACHE FLUME,
APACHE SQOOP APACHE HBASE
Coordination APACHE ZOOKEEPER
SCM Express (Installation Wizard for CDH)
17
Books
18
Conclusion
• The Key Benefits of Apache Hadoop:
– Agility/Flexibility (Quickest Time to Insight).
– Complex Data Processing (Any Language, Any Problem).
– Scalability of Storage/Compute (Freedom to Grow).
– Economical Storage (Keep All Your Data Alive Forever).
• The Key Systems for Apache Hadoop are:

– Hadoop Distributed File System: self-healing high-
bandwidth clustered storage.
– MapReduce: distributed fault-tolerant resource
management coupled with scalable data processing.
19
Appendix
BACKUP SLIDES
20
Unstructured Data is Exploding
Complex, Unstructured
Relational
• 2,500 exabytes of new information in 2012 with Internet as primary driver

• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
“zettabytes” this year
Source: IDC White Paper - sponsored by EMC.
As the Economy Contracts, the Digital Universe Expands. May 2009.
21
Hadoop Creation History
• Fastest sort of a TB,

62secs over 1,460 nodes
• Sorted a PB in 16.25hours
over 3,658 nodes
22
Hadoop in the Enterprise Data Stack
Data Scientists Analysts Business Users

Enterprise
IDEs BI, Analytics
Reporting
Development Tools Business Intelligence Tools

System
Operators
ODBC, JDBC,
Cloudera
NFS, Native
Mgmt Suite Enterprise
ETL Tools
Data
Warehouse
Sqoop
Data
Architects Customers
Low-Latency Web
Flume Flume Flume Sqoop
Serving Application
Relational Systems
Logs Files Web Data
Databases
23
MapReduce Next Gen
Main idea is to split up the JobTracker functions:
• Cluster resource management (for tracking and
allocating nodes)
• Application life-cycle management (for MapReduce
scheduling and execution)
Enables:
• High Availability
• Better Scalability
• Efficient Slot Allocation
• Rolling Upgrades
• Non-MapReduce Apps
24
Two Core Use Cases Common Across Many Industries
Use Case Application Industry Application Use Case
Web
ADVANCED ANALYTICS
Social Network Analysis Clickstream Sessionization
DATA PROCESSING
Content Optimization Media Clickstream Sessionization
Network Analytics Telco Mediation
Loyalty & Promotions Retail Data Factory
Fraud Analysis Financial Trade Reconciliation
Entity Analysis Federal SIGINT
Sequencing Analysis Bioinformatics Genome Mapping
Product Quality Manufacturing Mfg Process Tracking
25
What is Cloudera Enterprise?
Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS
source Apache Hadoop enterprise-easy
 Simplify and Accelerate Hadoop Deployment Cloudera Production-

Management Level Support
 Reduce Adoption Costs and Risks
Suite
 Lower the Cost of Administration
Comprehensive Our Team of Experts
 Increase the Transparency & Control of Hadoop On-Call to Help You
Toolset for Hadoop
 Leverage the Experience of Our Experts Administration Meet Your SLAs
3 of the top 5 telecommunications, mobile services, defense &

intelligence, banking, media and retail organizations depend on Cloudera
EFFECTIVENESS EFFICIENCY
Ensuring Repeatable Value from Enabling Apache Hadoop to be
Apache Hadoop Deployments Affordably Run in Production
26
Hive vs Pig Latin (count distinct values > 0)
• Hive Syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig Latin Syntax:

mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
27
Apache Hive Key Features
• A subset of SQL covering the most common statements

• JDBC/ODBC support
• Agile data types: Array, Map, Struct, and JSON objects
• Pluggable SerDe system to work on unstructured files directly
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• Partitions and Buckets (for performance optimization)
• Microstrategy/Tableau Compatibility (through ODBC)
• In The Works: Indices, Columnar Storage, Views, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive
28
Hive Agile Data Types
• STRUCTS:
– SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
– SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
– SELECT mytable.mycolumn[5] FROM …
• JSON:
– SELECT get_json_object(mycolumn, objpath) FROM …
29

Slides PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Slides PDF

Hochgeladen von

Copyright:

Verfügbare Formate

11/16/2011, Stanford EE380 Computer Systems Colloquium

Introducing Apache Hadoop:

Dr. Amr Awadallah | Founder, CTO, VP of Engineering

BI Reports + Interactive Apps Can’t Explore Original

ETL Compute Grid

Storage Only Grid (original raw data)

• Core Hadoop has two main systems:

• Read is Fast • Load is Fast

Data Committee Data Scientist

Grows without requiring developers to

Smarter Algos More Data

Use when: Use when:

Block Replication for:

Block Replicas are distributed

Three levels of data locality:

Yes, however, core and disk density per server

• 1 Hard Disk = 100MB/sec (~1Gbps)

Name Node Job Tracker

Data Node Task Tracker

Name Node Job Tracker

Data Node Task Tracker

File System Mount UI Framework/SDK Data Mining

FUSE-DFS HUE APACHE MAHOUT

Workflow Scheduling Metadata

Languages / Compilers Fast

Coordination APACHE ZOOKEEPER

SCM Express (Installation Wizard for CDH)

• The Key Systems for Apache Hadoop are:

• 2,500 exabytes of new information in 2012 with Internet as primary driver

• Fastest sort of a TB,

Data Scientists Analysts Business Users

Development Tools Business Intelligence Tools

Use Case Application Industry Application Use Case

Social Network Analysis Clickstream Sessionization

Network Analytics Telco Mediation

Loyalty & Promotions Retail Data Factory

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Sequencing Analysis Bioinformatics Genome Mapping

Product Quality Manufacturing Mfg Process Tracking

 Simplify and Accelerate Hadoop Deployment Cloudera Production-

3 of the top 5 telecommunications, mobile services, defense &

• Pig Latin Syntax:

• A subset of SQL covering the most common statements

Das könnte Ihnen auch gefallen