Beruflich Dokumente
Kultur Dokumente
Alex Gorbachev
Chief Technology Officer at Pythian
Blogger
OakTable Network member
Oracle ACE Director
Founder of BattleAgainstAnyGuess.com
Founder of Sydney Oracle Meetup
IOUG Director of Communities
EVP, Ottawa Oracle User Group
2012 Pythian
Pythian
2012
Work
Expertise:
One
2012 Pythian
Pythian
2012
Agenda
What is Big Data?
What is Hadoop?
Hadoop use cases
Moving data in and out of
Hadoop
Avoiding major pitfalls
2012 Pythian
Doesnt Matter.
We are here to discuss data architecture and use cases.
Not define market segments.
2012 Pythian
2012 Pythian
2012 Pythian
Excel files
Sensor
data
2012 Pythian
2012 Pythian
2012 Pythian
quality
And
generally messy.
Technical Challenges
Storage capacity
Storage throughput
Pipeline throughput
Processing power
Parallel processing
System Integration
Data Analysis
Scalable storage
Massive Parallel Processing
Ready to use tools
2012 Pythian
Massively scalable
Throughput oriented
Sacrifice efficiency for scale
Hadoop is most
industry accepted
standard / tool
2012 Pythian
What is Hadoop?
Hadoop Principles
Bring Code to Data
Share Nothing
2012 Pythian
Hadoop in a Nutshell
Replicated Distributed BigData File System
2012 Pythian
HDFS architecture
simplified view
Files are split in large blocks
Each block is replicated on write
Files can be only created and
deleted by one client
Uploading
Update
2012 Pythian
2012 Pythian
Fault tolerant
Pitfalls
Low efficiency
Lots
of intermediate data
Lots
Complex manipulation
requires pipeline of multiple
jobs
No high-level language
Only mappers leverage local
reads on HDFS
2012 Pythian
MapReduce jobs
2012 Pythian
(Cloudera)
(MapR)
Phoenix
Hadapt
(Salesforce.com)
(commercial)
2012 Pythian
Hadoop Benefits
Reliable solution based on unreliable hardware
Designed for large files
Load data first, structure later
Designed to maximize throughput of large scans
Designed to leverage parallelism
Designed to scale
Flexible development platform
Solution Ecosystem
2012 Pythian
Hadoop Limitations
Hadoop is scalable but not fast
Some assembly required
Batteries
not included
2012 Pythian
2012 Pythian
customer behavior
Optimize ad placements
Recommendation
Improve
systems
Bottom-line contributors
Cheap
ETL
archives storage
easily
Inexpensive
Any
processing capacity
Data Landfill
Stop
Dont
Hadoop
2012 Pythian
Map-Reduce
Map-Reduce
Other
Root access
Merge
Logs
Web servers,
app server,
clickstreams
Flume
Hadoop
Cleanup,
aggregation
Longterm storage
DWH
BI,
batch reports
2012 Pythian
OLTP
Sqoop,
Perl
Oracle,
MySQL,
Informix
Hadoop
Transformation
aggregation
Longterm storage
DWH
BI,
batch reports
2012 Pythian
2012 Pythian
2012 Pythian
2012 Pythian
2012 Pythian
Sqoop
Queries
2012 Pythian
column
Parallel
Incremental
File
formats
2012 Pythian
2012 Pythian
export
vs. Insert
Example:
sqoop
export
--connect
jdbc:mysql://db.example.com/foo
--table
bar
--export-dir
/results/bar_data
2012 Pythian
FUSE-DFS
Mount HDFS on Oracle server:
sudo
hadoop-fuse-dfs
<mount_point>
dfs://<name_node_hostname>:<namenode_port>
2012 Pythian
for Avro
Support
2012 Pythian
2012 Pythian
2012 Pythian
2012 Pythian
2012 Pythian
storage or network?
Written
to disk in triplicate.
Eliminate unbalanced
workloads
Offload work to RDBMS
Fine-tune optimization with
Map-Reduce
2012 Pythian
Your Data
is
as
NOT
BIG
as you think
2012 Pythian
Getting started
Pick a business problem
Acquire data
Get the tools: Hadoop, R,
Hive, Pig, Tableau
Get platform: can start cheap
Analyze data
Need
store
ETL
2012 Pythian
Communities
Knowledge
Saring
Education
www.collaborate13.ioug.org
2012 Pythian
To follow us
http://www.pythian.com/news/
http://www.facebook.com/pages/The-Pythian-Group/
http://twitter.com/pythian
http://www.linkedin.com/company/pythian
2012 Pythian