Beruflich Dokumente
Kultur Dokumente
HI,
Featured in Hadoop
Weekly #109
Author
n
n
http://sujee.net/
Contact : sujee@elephantscale.com
Hadoop in 20 Seconds
n
Real Time
ElephantScale.com, 2014
Batch
Hadoop Ecosystem
n
HDFS
n
Map Reduce
n
Hive
n
Pig
n
HBase
n
ElephantScale.com, 2014
Spark in 20 Seconds
n
Spark Eco-System
Schema /
sql
Real Time
Machine
Learning
Spark
SQL
Spark
Streaming
ML lib
Graph
processing
GraphX
Spark Core
Stand alone
YARN
MESOS
Cluster
managers
Hypo-meter J
Spark Benchmarks
Source : stratio.com
Elephant Scale, 2014
Source : stratio.com
Elephant Scale, 2014
Hadoop
Spark
Source : http://www.kwigger.com/mit-skifte-til-mac/
Spark
MapReduce framework
Generalized computation
On disk / in memory
Batch process
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Applications
YARN
Cluster
Management
HDFS
Storage
Complimentary
Spark
(compute
engine)
HDFS
Amazon
S3
Cassandra
???
Other
Spark
Batch processing
Hadoops MapReduce
(Java, Pig, Hive)
Spark RDDs
(java / scala / python)
SQL querying
Hadoop : Hive
Spark SQL
Stream Processing /
Real Time processing
Storm
Kafka
Spark Streaming
Machine Learning
Mahout
Spark ML Lib
NoSQL (Hbase,
Cassandra ..etc)
No Spark component.
But Spark can query
data in NoSQL stores
Interactive shell
n fast development cycles
n
adhoc exploration
Image: buymeposters.com
Hadoop
Spark
Distributed Storage
HDFS
SQL querying
Hive
Spark SQL
Pig
- Spork : Pig on
Spark
- Mix of Spark
SQL ..etc
Machine Learning
Mahout
ML Lib
NoSQL DB
Hbase
???
Data size
2.
File System
3.
SQL
4.
ETL
5.
Machine Learning
Hadoop To Spark
Spark can
help
Real Time
Batch
ElephantScale.com, 2014
Big Data
< few G
10 G +
100 G +
1 TB +
100 TB +
PB +
Hadoop
Elephant Scale, 2014
1) Data Size
n
Good fit
Data might fit in memory !!
Applications
n
Streaming
ElephantScale.com, 2014
2) File System
n
NFS
Amazon S3
Data locality
High
(best)
Local enough
None
(ok)
Throughput
High
(best)
Medium
(good)
Low
(ok)
Latency
Low
(best)
Low
High
Reliability
Very High
(replicated)
Low
Very High
Cost
Varies
Varies
$30 / TB /
Month
On HDFS & S3
Cluster :
n
n
n
// count # records
hdfs.count()
s3.count()
HDFS Vs. S3
S3
Next : 3) SQL
ElephantScale.com, 2014
Spark
Engine
Hive
Spark SQL
Language
HiveQL
- HiveQL
- RDD programming
in Java / Python /
Scala
Scale
Petabytes
Inter operability
Formats
Terabytes ?
Can read Hive tables
or stand alone data
Timestamp
Customer_id
Resource_id Qty
cost
Milliseconds
String
Int
Int
int
1000
Phone
10
10c
1003
SMS
4c
1005
Data
3M
5c
Data Set
n
n
n
10G + data
400 million records
CSV Format
Hive Table:
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val top10 = hiveCtx.sql("select customer_id, SUM(cost) as
total from billing group by customer_id order by total DESC
LIMIT 10")
top10.collect()
Fast on
same
HDFS
data !
Speed !
Great for exploring / trying-out
CSV
ElephantScale.com, 2014
ETL?
Data 1
Data 2
(clean)
Data 4
Data 3
Spark
ETL Tools
Native RDD
programming
(Scala, Java, Python)
Pig
Cascading
High level
Spark-scalding
Pig
n
n
n
n
n
Cascading
n
n
n
n
Code re-use
Not re-writing from scratch
More flexible
Multiple language support : Scala / Java / Python
Spark
Tool
Mahout
MLLib
API
Java
Iterative Algorithms
Slower
Very fast
(in memory)
In Memory
processing
No
YES
Lots of momentum !
Spark Caching!
n
Caching Demo!
Caching Results
Cached!
Spark Caching
n
Multiplexer
All requests are executed through same context
2) Tachyon
n
Sub-second latency !
App1 :
ctx.saveRDD(my cached rdd, rdd1)
App2:
RDD rdd2 = ctx.loadRDD (my cached rdd)
n
https://github.com/spark-jobserver/spark-jobserver
Tachyon + Spark
With Spark
n
Load data set (Giga bytes) from S3 and cache it (one time)
Lessons Learned
n
In-depth analytics
n
Final Thoughts
n
Already on Hadoop?
n
n
n
Contemplating Hadoop?
n
n
Iterative loads
Spark Job servers
Tachyon
Thanks !
Sujee Maniyam
sujee@elephantscale.com
http://elephantscale.com
Expert consulting & training in Big Data
(Now offering Spark training)