Beruflich Dokumente
Kultur Dokumente
APACHE SPARK
Sugimiyanto Suma
Yasir Arfat
Supervisor:
Prof. Rashid Mehmood
HPC Saudi
2 2018
Outline
Big Data
Big Data Analytics
Problem
Basics of Apache Spark
Practice basic examples of Spark
Building and Running the Spark Applications
Spark’s Libraries
Practice Data Analytics using applications.
Performance tuning of Apache Spark: Cluster Resource Utilization
Advance features of Apache Spark: When and How to use
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
3 2018
Tutorial objective
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
4
Big data
2018
Source: https://www.slideshare.net/AndersQuitzauIbm/big-
data-analyticsin-energy-utilities
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
5 2018
Source: sas.com Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
6
Big data analytic (cont.)
2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
7 2018
Problem
Data growing faster than processing speeds.
The only solution is to parallelize the data on large clusters.
Wide use in both enterprises and web industry.
Source: https://www.extremetech.com/wp-content/uploads/2012/03/ibm-supercomputer-p690-
cluster.jpg
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
8 2018
Problem
Data growing faster than processing speeds.
The only solution is to parallelize the data on large clusters.
Wide use in both enterprises and web industry.
Source: https://www.extremetech.com/wp-content/uploads/2012/03/ibm-supercomputer-p690-
cluster.jpg
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
9
Apache
HPC Saudi
10
History of Spark
2018
Source: https://spark.apache.org/docs/latest/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
11 2018
Spark Community
Most active open source community in big data 200+ developers,
50+ companies contributing
Source: https://training.databricks.com/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
12
What is spark?
2018
Source: https://spark.apache.org/docs/latest/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
13 2018
Sequential VS Parallel
http://www.itrelease.com/2017/11/di
fference-serial-parallel-processing/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
14
Why spark?
2018
Source:http://www.wiziq.com/blo
g/hype-around-apache-spark/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
16
Spark Stack and what you can do
2018
Spark SQL
For SQL and unstructured
data processing
MLib
Machine Learning Algorithms
GraphX
Graph Processing
Source: http://spark.apache.org Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
17 2018
Soucre:http://spark.apache.org/d
ocs/latest/cluster-overview.html Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
18 2018
Terminology
Spark Context
Represents the connection to a Spark cluster, and can be used to create RDDs, accumulators
and broadcast variables on the cluster.
Driver Program
The process running the main() function of the application and creating the SparkContext
Cluster Manager
An external service to manage resources on the cluster (standalone manager, YARN, Apache
Mesos)
Deploy Mode
"cluster" mode, the framework launches the driver inside of the cluster.
"client" mode, the submitter launches the driver outside of the cluster.
Source:http://spark.apache.org/docs/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
19 2018
Terminology (cont.)
Source: http://spark.apache.org/docs/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
20 2018
Read-only
Source:http://spark.apache.org/docs/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
21 2018
A list of partitions
Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-
partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations
for an HDFS file)
Source: http://spark.apache.org/docs/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
22 2018
3 main steps:
▪ Create RDD
▪ Transformation
▪ Actions
Source:http://www.sachinpbuzz.com/201
5/11/apache-spark.html Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
23 2018
Type of Operations
Transformations
map
filter
flatMap
union, etc.
Actions
collect
count
take
foreach, etc.
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
24 2018
Source:http://10minbasics.com/what-
is-apache-spark/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
25 2018
Creating RDDs
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
26 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
27 2018
1 2 3 4 core-1
core-2
+1 +1 +1 +1
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
28 2018
1 2 3 4 core-1
core-2
+1 +1 +1 +1 core-3
core-4
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
29 2018
Spark Web UI
Stage details event timeline showing the timeline of task execution. Notice that only task is executed sequentially.
There is only one task at a time.
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
30 2018
Stage details event timeline showing the timeline of task execution. Notice that there are two tasks are
executed in parallel at a time.
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
31 2018
Stage details event timeline over four cores, parallelism has increased when we use more cores. It shows that it is running four tasks
in parallel at a time.
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
32 2018
Cluster Deployment
Local Node
Standalone Deploy Mode
Apache Mesos
Hadoop YARN
Amazon EC2
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
33
Cluster Deployment 2018
Source:spark.apache.org/docs/latest/submitting-applications.html Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
34 2018
Interactively (Spark-shell)
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
35 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
36 2018
spark-submit \
--class company.division.yourClass \
--master spark://<host>:7077 \
--name “countingCows”
CountingCows-1.0.jar
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
37 2018
import org.apache.spark._
rddAdd.collect() Action
sc.stop()
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
38
Building And Running a filter 2018
application
cd /home/hduser/Tutorial/SimpleApp/Filter
1. Create main class directory : mkdir -p src/main/scala
2. Create script file : vim src/main/scala/Filter.scala
2.1 Use this code :
import org.apache.spark._
object Filter{
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Filter")
val sc = new SparkContext(conf)
val x = sc.parallelize(1 to 100000, 2)
val y = x.filter(e => e%2==1)
y.foreach(println(_))
println("Done.")
sc.stop()
}
}
Source: http://backtobazics.com/big-data/spark/apache-spark-
filter-example/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
39
Building And Running the filter 2018
application
3. Create an sbt file : vim build.sbt
4. Use this application properties :
name := "Filter"
version := "1.0"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
40 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
41 2018
object Toy{
def main(args: Array[String]){
val conf = new SparkConf().setAppName("ToyApp")
val sc = new SparkContext(conf)
Create an RDD
val rdd = sc.textFile("file:///home/hduser/Tutorial/SimpleApp/WordCount big.txt")
rddLower.coalesce(1).saveAsTextFile("file:///home/hduser/Tutorial/big_clean.txt") Action
val duration = (System.nanoTime - start) / 1e9d
println("Done.")
println("Execution time: "+duration)
sc.stop()
}
}
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
42 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
Building And Running the Word Count
HPC Saudi
43 2018
application
3. Create an sbt file : vim build.sbt
4. Use this application properties :
name := "WordCount"
version := "1.0"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
44 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
45 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
46 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
47 2018
Questions?
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
48 2018
Break
30 minutes
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
Welcome Back
Sugimiyanto Suma
Yasir Arfat
Supervisor:
Prof. Rashid Mehmood
s
HPC Saudi
50 2018
Outline
Big Data
Big Data Analytics
Problem
Basics of Apache Spark
Practice basic examples of Spark
Building and Running the Spark Applications
Spark’s Libraries
Practice Data Analytics using applications.
Performance tuning of Apache Spark: Cluster Resource Utilization
Advance features of Apache Spark: When and How to use
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
51 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
53 2018
Analysis Techniques
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
54 2018
Machine Learning
“the field of study that gives computers the ability to learn without
being explicitly programmed”
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
55 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
57 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
58 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
59 2018
MLlib
Spark SQL
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
60 2018
Source: spark.apache.org/mllib/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
61 2018
MLlib (cont.)
MLlib contains many algorithms and utilities.
ML algorithms include:
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Clustering: K-means, Gaussian mixtures (GMMs),…
ML workflow utilities include:
Feature transformations: standardization, normalization, hashing,...
Model evaluation and hyper-parameter tuning
Other utilities include:
Distributed linear algebra: SVD, PCA,...
Statistics: summary statistics, hypothesis testing,..
MLlib
Spark SQL
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
63 2018
Spark SQL
Source:spark.apache.org/sql/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
64 2018
Source: spark.apache.org/sql/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
65 2018
Standard Connectivity
Connect through JDBC or ODBC.
Source:https://spark.apache.org/sql/ Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
66 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
67 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
68 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
69 2018
cd /home/hduser/Tutorial/Analytic/EventDetection/App
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
70
Performance Tuning of Apache Spark
2018
Spark Web UI
Partitioning
Profiler
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
71 2018
Source: blog.cloudera.com Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
72 2018
Spark Web UI
Partitioning
Profiler
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
73
Spark Web UI
2018
Source: (Malak & East 2016) Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
74
Performance Tuning of Apache Spark
2018
Spark Web UI
Partitioning
Profiler
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
75 2018
In fact, only one executor can work on a single partition, so if the number of partitions is less than the number of
executors, the stage won’t take advantage of the full resources available.
Call RDD.partitions.size to find out how many partition your RDD has. The default partition size can be retrieved
using sc.defaultParallelism
One way to resolve this problem is to use the repartition method on the RDD to supply a recommended new
number of partitions:
Val rdd = …
rdd.repartition(10)
Source: (Malak & East 2016) Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
76
Performance Tuning of Apache Spark
2018
Spark Web UI
Partitioning
Profiler
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
77 2018
YourKit Profiler
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
78 2018
YourKit Profiler(cont.)
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
79 2018
Advance Features of
Apache Spark
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
80
Advance Features of Apache Spark
2018
Shared Variables
Accumulators
Broadcast Variable
RDD Persistence
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
81 2018
Shared Variables
Normally, when a function passed to a Spark operation (such as map or reduce) is
executed on a remote cluster node, it works on separate copies of all the variables used
in the function.
These variables are copied to each machine, and no updates to the variables on the
remote machine are propagated back to the driver program.
Source:https://www.tutorialspoint.com/apach
e_spark/advanced_spark_programming.htm Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
82 2018
Accumulators
Accumulators are variables that are only “added” to through an associative and
commutative operation and can therefore be efficiently supported in parallel.
Source:https://www.tutorialspoint.com/apach
e_spark/advanced_spark_programming.htm Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
83 2018
Accumulators (cont.)
scala> accum.value
res2: Long = 10
Source:https://spark.apache.org/docs/2.2.0/r
dd-programming-guide.html#accumulators Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
84 2018
Accumulators (cont.)
Source: (Malak & East 2016) Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
85
Advance Features of Apache Spark
2018
Shared Variables
Accumulators
Broadcast Variable
RDD Persistence
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
86 2018
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only
variable cached on each machine rather than shipping a copy of it
with tasks.
Source:https://spark.apache.org/docs/2.2.0/rdd-
programming-guide.html#broadcast-variables Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
87 2018
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
Source: https://spark.apache.org/docs/2.2.0/rdd-
programming-guide.html#broadcast-variables Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
88
Advance Features of Apache Spark
2018
Shared Variables
Accumulators
Broadcast Variable
RDD Persistence
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
89 2018
RDD Persistence
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory
across operations.
When you persist an RDD, each node stores any partitions of it, that it computes in memory and reuses them
in other actions on that dataset (or datasets derived from it).
This allows future actions to be much faster (often by more than 10x).
Caching is a key tool for iterative algorithms and fast interactive use.
Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.
Source: https://spark.apache.org/docs/2.2.0/rdd-
programming-guide.html#rdd-persistence Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
90 2018
Source:https://spark.apache.org/docs/2.2.0/rd
d-programming-guide.html#rdd-persistence Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
91 2018
References
blog.cloudera.com
Gandomi, A. & Haider, M., 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of
Information Management, 35(2), pp.137–144. Available at: http://dx.doi.org/10.1016/j.ijinfomgt.2014.10.007.
http://www.itrelease.com/2017/11/difference-serial-parallel-processing/
https://www.tutorialspoint.com/apache_spark/advanced_spark_programming.htm
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables
Malak, M.S. & East, R., 2016. Spark GraphX In action, Manning Publications.
Rajdeep, D., Ghotra, M.S. & Pentreath, N., 2017. Machine Learning with Spark, Birmingham, Mumbai: Packt Publishing.
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
92 2018
Useful Resources
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
93 2018
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-
HPC Saudi
94 2018
Contact
Tutorial: Big data analytics using Apache Spark - Sugimiyanto Suma et al.-