Apache Spark - What's Under The Hood - Adarsh Pannu

DSK-3576
Apache Spark:
What's Under the Hood?
Adarsh Pannu
Senior Technical Staff Member
IBM Analytics Platform
adarshrp@us.ibm.com
2015 IBM Corporation

Abstract
This session covers Spark's
architectural components through
the life of simple Spark jobs. Those
already familiar with the basic
Spark API will gain deeper
knowledge of how Spark works with
the potential of becoming an
advanced Spark user, administrator
or contributor.
1
Outline
Why Understand Spark Internals? To Write Better Applications.
Simple Spark Application

Resilient Distributed Datasets
Cluster architecture
Job Execution through Spark components
Tasks & Scheduling
Shuffle
Memory management
Writing better Spark applications: Tips and Tricks
First Spark Application
On-Time Arrival Performance Dataset

Record of every US airline flight since
1980s.
Fields:
Year, Month, DayofMonth
UniqueCarrier,FlightNum
DepTime, ArrTime, ActualElapsedTime
ArrDelay, DepDelay
Origin, Dest, Distance
Cancelled, ...
Where, When, How Long? ...
3
First Spark Application (contd.)
Year,Month,DayofMonth UniqueCarrier
DepTime FlightNum
2004,2,5,4,1521,1530,1837,1838,CO,65,N67158,376,368,326,-1,-9,EWR,LAX,2454,...
Origin
ActualElapsedTime Dest
Distance
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRS
ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapse
dTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can
celled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,S
ecurityDelay,LateAircraftDelay
4
First Spark Application (contd.)
Which airports handles the most airline carriers?

! Small airports (E.g. Ithaca) only served by a few airlines
! Larger airports (E.g. Newark) handle dozens of carriers
In SQL, this translates to:

SELECT Origin, count(distinct UniqueCarrier)
FROM flights
GROUP BY Origin
5
// Read data rows
2004,2,5...,AA,..,EWR,... 2004,3,22,...,UA,...SFO,... 2004,3,22,...,AA,...EWR,..
sc.textFile("hdfs:///.../flights").
// Extract Airport (key) and Airline (value)
(EWR, AA) (SFO, UA) (EWR, UA)
map(row => (row.split(",")(16), row.split(",")(8))).
// Group by Airport
(EWR, [AA, UA]) (SFO, UA)
groupByKey.
// Discard duplicate pairs, and compute group size
(EWR, 2) (SFO, 1)
mapValues(values => values.toSet.size).
// Return results to client
[ (EWR, 2), (SFO, 1) ]
collect
sc.textFile("hdfs:///.../flights"). Base RDD
groupByKey. Transformed
RDDs
collect Action
Whats an RDD?
Resilient Distributed Datasets
CO780, IAH, MCI
CO683, IAH, MSY Key abstraction in Spark
CO1707, TPA, IAH
... Immutable collection of objects
Distributed across machines
DL282, ATL, CVG
DL2128, CHS, ATL
DL2592, PBI, LGA Can be operated on in parallel
DL417, FLL, ATL
...
Can hold any kind of data
Hadoop datasets
UA620, SJC, ORD
UA675, ORD, SMF Parallelized Scala collections
UA676, ORD, LGA RDBMS or No-SQL, ...
...
Can recover from failures, be cached, ...
8
Resilient Distributed Datasets (RDD)
1. Set of partitions (splits in Hadoop)

2. List of dependencies on parent RDDs
Lineage
3. Function to compute a partition given its parent(s)
4. (Optional) Partitioner (hash, range)

5. (Optional) Preferred location(s)
Optimized
Execution
9
Weve written the code, what next?
Spark supports four different cluster managers:

Local: Useful only for development
Standalone: Bundled with Spark, doesnt play well with other
applications
YARN
Mesos
Each mode has a similar logical architecture although physical

details differ in terms of which/where processes and threads are
launched.
10
Spark Cluster Architecture: Logical View
Executor
Driver Program
Task
Cluster Manager Cache
Task
SparkContext
Driver represents the application. It runs the main() function.
SparkContext is the main entry point for Spark functionality.

Represents the connection to a Spark cluster.
Executor runs tasks and keeps data in memory or disk storage

across them. Each application has its own executors.
11
Whats Inside an Executor?
Single JVM
Task
Free
Running Task
Tasks Task Slots
Task
RDD1-1 Broadcast-1 Other

Cached RDD global
partitions memory
RDD2-3 Broadcast-2
Shuffle, Transport, Internal Threads

GC, ...
12
Spark Cluster Architecture: Physical View
Node 1 Node 2 Node 3

Executor 1 Executor 2 Executor 3
RDD P1 Task RDD P5 Task RDD P3 Task
Internal Internal Internal

Threads Threads Threads
Driver
13
Ok, can we get back to the code?
14
Spark Execution Model
Task
sc.textFile(...").
Task
map(row =>...).
Task
groupByKey.
mapValues(...).
collect
Task
Application RDD DAG and Task

Executor(s)
Code DAG Scheduler
15
Spark builds DAGs
Spark applications are

sc.textFile("hdfs:///.../flightdata"). written in a functional
style.
map(row => (row.split(",")(16), row.split(",")(8))). Internally, Spark turns a

functional pipeline into
a graph of RDD
groupByKey. objects.
Directed (arrows)
collect Acyclic (no loops)
Graph
16
Spark builds DAGs (contd.)
Data-set level View Partition level View
HadoopRDD
MapPartitionsRDD
ShuffleRDD
Spark has a rich
collection of operations
MapPartitionsRDD that generally map to
RDD classes
HadoopRDD,
FilteredRDD, JoinRDD,
MapPartitionsRDD
etc.
17
DAG Scheduler
textFile
Stage 1
map Split the DAG into
stages
A stage marks the
boundaries of a
groupByKey pipeline
One stage is
Stage 2 completed before
mapValues starting the next
stage
collect
18
DAG Scheduler: Pipelining
2004,2,5...,AA,..,EWR,... 2004,3,22,...,UA,...SFO,... 2004,3,22,...,AA,...EWR,..
(EWR, AA) (SFO, UA) (EWR, UA)
(EWR, [AA, UA]) (SFO, UA)

Task Scheduler (contd.)
textFile
Stage 1
map
Turn the Stages into Tasks

Task = Data + Computation
Ship tasks across the cluster

All RDDs in the stage have the same number of partitions
One task computes each pipeline
20
Stage 1
textFile
Computation
map
1 2 3 4
hdfs://.../flights hdfs://.../flights hdfs://.../flights hdfs://.../flights Data

(partition 1) (partition 2) (partition 3) (partition 4)
Task 1 Task 2 Task 3 Task 4
21
HDFS
Partitions 3 cores
1 4 partitions
1
HadoopRDD
2 MapPartitionsRDD
2 Time
2
HadoopRDD
3 MapPartitionsRDD
3 4
3
HadoopRDD HadoopRDD
4 MapPartitionsRDD MapPartitionsRDD
22
Task Scheduler: The Shuffle
textFile
Stage 1
map
All-to-all data
movement
groupByKey
Stage 2
mapValues
collect
23
Task Scheduler: The Shuffle (contd.)
Stage 1 1 2 3 4
Stage 2 1 2
Redistributes data among partitions

Typically hash-partitioned but can have user-defined partitioner
Avoided when possible, if data is already properly partitioned

Partial aggregation reduces data movement
! Similar to Map-Side Combine in MapReduce
24
Task Scheduler: The Shuffle (contd.)
1 2 3 4
1 2 1 2 1 2 1 2
1 2
Shuffle writes intermediate files to disk

These files are pulled by the next stage
Two algorithms: sort- based (new/default) and hash-based (older)
25
groupByKey
Stage 1
After shuffle, each reduce-side
partition contains all groups of
the same key
groupByKey Within each partition,
groupByKey builds an in-
memory hash map
mapValues
EWR -> [UA, AA, ...]
SFO -> [UA, ...]
collect JFK -> [DL, UA, AA, DL, ...]
Single key-values pair must fit in

Stage 2
memory. Can be spilled to disk in
its entirety.
26
Task Execution
Stage 1
Spark jobs can have any number

of stages (1, 2, ... N)
Stage 2
Theres a shuffle between stages
The last stage ends with an

action sending results back to
client, writing files to disk, etc.
Stage N
27
28
29
30
Writing better Spark applications
How can we optimize our application? Key considerations:
Partitioning How many tasks in each stage?
Shuffle How much data moved across nodes?
Memory pressure Usually a byproduct of partitioning and

shuffling
Code path length Is your code optimal? Are you using the best
available Spark API for the job?
31
Writing better Spark applications: Partitioning
Too few partitions?
! Less concurrency
! More susceptible to data skew
! Increased memory pressure
Too many partitions
! Over-partitioning leads to very short-running tasks
! Administrative inefficiencies outweigh benefits of parallelism
Need a reasonable number of partitions
! Usually a function of number of cores in cluster ( ~2 times is a good
rule of thumb)
! Ensure tasks task execution time > serialization time
32
Writing better Spark applications: Memory
Symptoms
! Bad performance
! Executor failures (OutOfMemory errors)
Resolution
! Use GC and other traces to track memory usage
! Give Spark more memory
! Tweak memory distribution between User memory vs Spark
memory
! Increase number of partitions
! Look at your code!
33
Rewriting our application
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
groupByKey.
collect
Partitioning
Shuffle size
Memory pressure
Code path length
34
Rewriting our application (contd.)
repartition(16).
groupByKey.
collect
Partitioning
How many stages do you see?
Shuffle size
Answer: 3 Memory pressure
Code path length
35
36
repartition(16).
distinct.
groupByKey.
collect
How many stages do you see now? Partitioning

Shuffle size
Answer: 4 Memory pressure
Code path length
37
38
distinct(numPartitions = 16).
groupByKey.
mapValues(values => values.size).
collect
Partitioning
Shuffle size
Memory pressure
Code path length
39
40
map(row => {
val cols = row.split(",")
(cols(16), cols(8))
}).
groupByKey.
mapValues(values => values.size).
collect
Partitioning
Shuffle size
Memory pressure
Code path length
41
map(row => {
(cols(16), cols(8))
}).
map(e => (e._1, 1)).
reduceByKey(_ + _).
collect
Partitioning
Shuffle size
Memory pressure
Code path length
42
Original
groupByKey.
OutOfMemory Error
after running for
collect
several minutes on
my laptop
Revised
map(row => {
(cols(16), cols(8))
Completed in
}).
seconds
map(e => (e._1, 1)).
reduceByKey(_ + _).
collect
43
Writing better Spark applications: Configuration
How many Executors per Node?
--num-executors OR spark.executor.instances
How many tasks can each Executor run simultaneously?

--executor-cores OR spark.executor.cores
How much memory does an Executor have?

--executor-memory OR spark.executor.memory
How is the memory divided inside an Executor?

spark.storage.memoryFraction and spark.shuffle.memoryFraction
How is the data stored? Partitioned? Compressed?
44
As you can see, writing Spark jobs is easy. However, doing so in
an efficient manner takes some know-how.
Want to learn more about Spark?

______________________________________________
Reference slides are at the end of this slide deck.
45
Notices and Disclaimers
Copyright 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customers business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
46
Notices and Disclaimers (cont)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBMs products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera, Bluemix, Blueworks Live, CICS, Clearcase, Cognos, DOORS, Emptoris, Enterprise Document
Management System, FASP, FileNet, Global Business Services , Global Technology Services , IBM ExperienceOne, IBM
SmartCloud, IBM Social Business, Information on Demand, ILOG, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON,
OpenPower, PureAnalytics, PureApplication, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery,
pureScale, PureSystems, QRadar, Rational, Rhapsody, Smarter Commerce, SoDA, SPSS, Sterling Commerce, StoredIQ,
Tealeaf, Tivoli, Trusteer, Unica, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
47
Thank You
2015 IBM Corporation

We Value Your Feedback!
Dont forget to submit your Insight session and speaker

feedback! Your feedback is very important to us we use it
to continually improve the conference.
Access your surveys at insight2015survey.com to quickly

submit your surveys from your smartphone, laptop or
conference kiosk.
49
Acknowledgements
Some of the material in this session was informed and inspired

by presentations done by:
Matei Zaharia (Creator of Apache Spark)

Reynold Xin
Aaron Davidson
Patrick Wendell
... and dozens of other Spark contributors.
50
Reference Slides
51
How deep do you want to go?
What is Spark? How does it relate

Intro to Hadoop? When would you use
it?
1-2 hours
Understand basic technology and

Basic write simple programs
1-2 days
Start writing complex Spark 5-15 days, to

Intermediate programs even as you understand
operational aspects
weeks and
months
Become a Spark Black Belt! Know Months to

Expert Spark inside out. years
Intro Spark
Go through these additional presentations to understand the value of Spark. These
speakers also attempt to differentiate Spark from Hadoop, and enumerate its comparative
strengths. (Not much code here)
" Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & Slides 25 mins
https://spark-summit.org/2013/talk/turning-data-into-value
" Spark: Whats in it your your business? Adarsh Pannu 60 mins

IBM Insight Conference 2015
" How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & Slides 12 mins
http://conferences.oreilly.com/strata/strata2014/public/schedule/detail/33057
" Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins
https://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/
Basic Spark
"Pick up some Scala through this article co-authored

by Scalas creator, Martin Odersky. Link
http://www.artima.com/scalazine/articles/steps.html
Estimated time: 2 hours

Basic Spark (contd.)
" Do these two courses. They cover Spark basics and include a
certification. You can use the supplied Docker images for all other
labs.
7 hours
Basic Spark (contd.)
" Go to spark.apache.org and study the Overview and the

Spark Programming Guide. Many online courses borrow
liberally from this material. Information on this site is
updated with every new Spark release.
Estimated 7-8 hours.

Intermediate Spark
" Stay at spark.apache.org. Go through the component specific

Programming Guides as well as the sections on Deploying and More.
Browse the Spark API as needed.
Estimated time 3-5 days and more.

Intermediate Spark (contd.)
Learn about the operational aspects of Spark:
" Advanced Apache Spark (DevOps) 6 hours # EXCELLENT!
Video https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
Slides https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
" Tuning and Debugging Spark Slides 48 mins

Video https://www.youtube.com/watch?v=kkOG_aJ9KjQ
Gain a high-level understanding of Spark architecture:

" Introduction to AmpLab Spark Internals, Matei Zaharia, 1 hr 15 mins
Video https://www.youtube.com/watch?v=49Hr5xZyTEA
" A Deeper Understanding of Spark Internals, Aaron Davidson, 44 mins
Video https://www.youtube.com/watch?v=dmL0N3qfSc8
PDF https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-
Understanding-of-Spark-Internals-Aaron-Davidson.pdf
Intermediate Spark (contd.)
Experiment, experiment, experiment ...
" Setup your personal 3-4 node cluster
" Download some open data. E.g. airline data on

stat-computing.org/dataexpo/2009/
" Write some code, make it run, see how it performs, tune it, trouble-shoot it
" Experiment with different deployment modes (Standalone + YARN)
" Play with different configuration knobs, check out dashboards, etc.
" Explore all subcomponents (especially Core, SQL, MLLib)

Advanced Spark: Original Papers
Read the original academic papers
" Resilient Distributed Datasets: A Fault-

Tolerant Abstraction for In-Memory Cluster
Computing, Matei Zaharia, et. al.
" Discretized Streams: An Efficient and Fault-

Tolerant Model for Stream Processing on
Large Clusters, Matei Zaharia, et. al.
" GraphX: A Resilient Distributed Graph

System on Spark, Reynold S. Xin, et. al.
" Spark SQL: Relational Data Processing in

Spark, Michael Armbrust, et. al.
Advanced Spark: Enhance your Scala skills
" Use this as your This book by " Excellent MooC by Odersky. Some of
primary Scala text Odersky is excellent the material is meant for CS majors.
but it isnt meant to Highly recommended for STC
give you a quick developers.
start. Its deep stuff. 35+ hours
Advanced Spark: Browse Conference Proceedings
Spark Summits cover technology and use cases. Technology is also covered in
various other places so you could consider skipping those tracks. Dont forget to
check out the customer stories. That is how we learn about enablement
opportunities and challenges, and in some cases, we can see through the Spark
hype
100+ hours of FREE videos and associated PDFs available on spark-

summit.org. You dont even have to pay the conference fee! Go back in time and
attend these conferences!
Advanced Spark: Browse YouTube Videos
YouTube is full of training videos, some good, other not so

much. These are the only channels you need to watch though.
There is a lot of repetition in the material, and some of the
videos are from the conferences mentioned earlier.
Advanced Spark: Check out these books
Provides a good overview of Covers concrete statistical analysis /

Spark much of material is also machine learning use cases. Covers
available through other sources Spark APIs and MLLib. Highly
previously mentioned. recommended for data scientists.
Advanced Spark: Yes ... read the code
Even if you dont intend to contribute to Spark, there are a ton of valuable
comments in the code that provide insights into Sparks design and these will
help you write better Spark applications. Dont be shy! Go to github.com/
apache/spark and check it to out.

Apache Spark - What's Under The Hood - Adarsh Pannu

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Apache Spark - What's Under The Hood - Adarsh Pannu

Hochgeladen von

Copyright:

Verfügbare Formate

DSK-3576

2015 IBM Corporation

Why Understand Spark Internals? To Write Better Applications.

Simple Spark Application

On-Time Arrival Performance Dataset

Where, When, How Long? ...

Which airports handles the most airline carriers?

In SQL, this translates to:

map(row => (row.split(",")(16), row.split(",")(8))).

mapValues(values => values.toSet.size).

Can recover from failures, be cached, ...

1. Set of partitions (splits in Hadoop)

4. (Optional) Partitioner (hash, range)

Spark supports four different cluster managers:

Each mode has a similar logical architecture although physical

Driver represents the application. It runs the main() function.

SparkContext is the main entry point for Spark functionality.

Executor runs tasks and keeps data in memory or disk storage

RDD1-1 Broadcast-1 Other

Shuffle, Transport, Internal Threads

Node 1 Node 2 Node 3

RDD P2 Task RDD P6 Task RDD P4 Task

RDD P3 Task RDD P1 Task RDD P2 Task

Internal Internal Internal

Application RDD DAG and Task

Spark applications are

map(row => (row.split(",")(16), row.split(",")(8))). Internally, Spark turns a

mapValues(values => values.toSet.size).

2004,2,5...,AA,..,EWR,... 2004,3,22,...,UA,...SFO,... 2004,3,22,...,AA,...EWR,..

(EWR, AA) (SFO, UA) (EWR, UA)

(EWR, [AA, UA]) (SFO, UA)

Turn the Stages into Tasks

Ship tasks across the cluster

hdfs://.../flights hdfs://.../flights hdfs://.../flights hdfs://.../flights Data

Task 1 Task 2 Task 3 Task 4

Redistributes data among partitions

Avoided when possible, if data is already properly partitioned

Shuffle writes intermediate files to disk

Single key-values pair must fit in

Spark jobs can have any number

The last stage ends with an

How can we optimize our application? Key considerations:

Partitioning How many tasks in each stage?

Shuffle How much data moved across nodes?

Memory pressure Usually a byproduct of partitioning and

How many stages do you see now? Partitioning

How many tasks can each Executor run simultaneously?

How much memory does an Executor have?

How is the memory divided inside an Executor?

How is the data stored? Partitioned? Compressed?

Want to learn more about Spark?

Reference slides are at the end of this slide deck.

2015 IBM Corporation

Dont forget to submit your Insight session and speaker

Access your surveys at insight2015survey.com to quickly

Some of the material in this session was informed and inspired

Matei Zaharia (Creator of Apache Spark)

What is Spark? How does it relate

Understand basic technology and

Start writing complex Spark 5-15 days, to

Become a Spark Black Belt! Know Months to

" Spark: Whats in it your your business? Adarsh Pannu 60 mins

"Pick up some Scala through this article co-authored