Beruflich Dokumente
Kultur Dokumente
Amir H. Payberah
amir@sics.se
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71
What is Big Data?
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 2 / 71
Big Data
... everyone talks about it, nobody really knows how to do it, ev-
eryone thinks everyone else is doing it, so everyone claims they are
doing it.
- Dan Ariely
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 3 / 71
Big Data
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71
Big Data
r d s
Big data is the data characterized by 4 key attributes: volume,
o
zzw
variety, velocity and value.
Bu - Oracle
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71
Big Data In Simple Words
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 5 / 71
The Four Dimensions of Big Data
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 6 / 71
How To Store and Process
Big Data?
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 7 / 71
Scale Up vs. Scale Out
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 8 / 71
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 9 / 71
The Big Data Stack
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 10 / 71
Data Analysis
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 11 / 71
Programming Languages
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 12 / 71
Platform - Data Processing
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 13 / 71
Platform - Data Storage
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 14 / 71
Resource Management
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 15 / 71
Spark Processing Engine
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 16 / 71
Why Spark?
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 17 / 71
Motivation (1/4)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71
Motivation (1/4)
I Benefits of data flow: runtime can decide where to run tasks and
can automatically recover from failures.
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71
Motivation (1/4)
I Benefits of data flow: runtime can decide where to run tasks and
can automatically recover from failures.
I E.g., MapReduce
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71
Motivation (2/4)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 19 / 71
Motivation (3/4)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 20 / 71
Motivation (4/4)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 21 / 71
Spark vs. MapReduce (1/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71
Spark vs. MapReduce (1/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71
Spark vs. MapReduce (2/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71
Spark vs. MapReduce (2/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71
Challenge
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71
Challenge
Solution
Resilient Distributed Datasets (RDD)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71
Resilient Distributed Datasets (RDD) (1/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 25 / 71
Resilient Distributed Datasets (RDD) (2/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71
Resilient Distributed Datasets (RDD) (2/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71
Resilient Distributed Datasets (RDD) (2/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71
Programming Model
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 27 / 71
Spark Programming Model
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71
Spark Programming Model
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71
Spark Programming Model
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71
RDD Operators (1/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 29 / 71
RDD Operators (2/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 30 / 71
RDD Transformations - Map
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71
RDD Transformations - Map
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71
RDD Transformations - Reduce
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 32 / 71
RDD Transformations - Reduce
pets.groupByKey()
// {(cat, (1, 2)), (dog, (1))}
pets.reduceByKey((x, y) => x + y)
or
pets.reduceByKey(_ + _)
// {(cat, 3), (dog, 1)}
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 32 / 71
RDD Transformations - Join
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 33 / 71
RDD Transformations - Join
visits.join(pageNames)
// ("h", ("1.2.3.4", "Home"))
// ("h", ("1.3.3.1", "Home"))
// ("a", ("3.4.5.6", "About"))
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 33 / 71
RDD Transformations - CoGroup
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 34 / 71
RDD Transformations - CoGroup
visits.cogroup(pageNames)
// ("h", (("1.2.3.4", "1.3.3.1"), ("Home")))
// ("a", (("3.4.5.6"), ("About")))
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 34 / 71
RDD Transformations - Union and Sample
I Union: merges two RDDs and returns a single RDD using bag se-
mantics, i.e., duplicates are not removed.
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 35 / 71
Basic RDD Actions (1/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71
Basic RDD Actions (1/2)
nums.take(2) // Array(1, 2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71
Basic RDD Actions (1/2)
nums.take(2) // Array(1, 2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71
Basic RDD Actions (2/2)
nums.reduce((x, y) => x + y)
or
nums.reduce(_ + _) // 6
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 37 / 71
Basic RDD Actions (2/2)
nums.reduce((x, y) => x + y)
or
nums.reduce(_ + _) // 6
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 37 / 71
SparkContext
new SparkContext(conf)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 38 / 71
Creating RDDs
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 39 / 71
Creating RDDs
val a = sc.textFile("file.txt")
val b = sc.textFile("directory/*.txt")
val c = sc.textFile("hdfs://namenode:9000/path/file")
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 39 / 71
Example 1
counts.saveAsTextFile("hdfs://...")
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 40 / 71
Example 2
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 41 / 71
Example 2
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 41 / 71
Execution Engine
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 42 / 71
Spark Programming Interface
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 43 / 71
Lineage
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 44 / 71
RDD Dependencies (1/3)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 45 / 71
RDD Dependencies: Narrow (2/3)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 46 / 71
RDD Dependencies: Wide (3/3)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 47 / 71
Job Scheduling (1/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 48 / 71
Job Scheduling (2/2)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 49 / 71
RDD Fault Tolerance
I No replication.
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 50 / 71
Spark SQL
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 51 / 71
Spark and Spark SQL
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 52 / 71
DataFrame
I Homogeneous schema.
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 53 / 71
Adding Schema to RDDs
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 54 / 71
Creating DataFrames
val df = sqlContext.read.json(...)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 55 / 71
DataFrame Operations (1/2)
I Domain-specific language for structured data manipulation.
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 56 / 71
DataFrame Operations (2/2)
I Domain-specific language for structured data manipulation.
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 57 / 71
Running SQL Queries Programmatically
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 58 / 71
Converting RDDs into DataFrames
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext
.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 59 / 71
Data Sources
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 60 / 71
Advanced Programming
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 61 / 71
Shared Variables
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 62 / 71
Accumulators (1/2)
I The driver program can access the value by calling the value prop-
erty on the accumulator.
scala> accum.value
res2: Int = 10
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 63 / 71
Accumulators (2/2)
line.split(" ")
})
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 64 / 71
Broadcast Variables (1/4)
I The broadcast values are sent to each node only once, and should
be treated as read-only variables.
I The process of using broadcast variables can access its value with
the value property.
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 65 / 71
Broadcast Variables (2/4)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 66 / 71
Broadcast Variables (3/4)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 67 / 71
Broadcast Variables (4/4)
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 68 / 71
Summary
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 69 / 71
Summary
I Dataflow programming
I Spark: RDD
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 70 / 71
Questions?
Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 71 / 71