Beruflich Dokumente
Kultur Dokumente
Apache Spark:
What's Under the Hood?
Adarsh Pannu
Senior Technical Staff Member
IBM Analytics Platform
adarshrp@us.ibm.com
1
Outline
3
First Spark Application (contd.)
Year,Month,DayofMonth UniqueCarrier
DepTime FlightNum
2004,2,5,4,1521,1530,1837,1838,CO,65,N67158,376,368,326,-1,-9,EWR,LAX,2454,...
Origin
ActualElapsedTime Dest
Distance
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRS
ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapse
dTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can
celled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,S
ecurityDelay,LateAircraftDelay
4
First Spark Application (contd.)
5
// Read data rows
2004,2,5...,AA,..,EWR,... 2004,3,22,...,UA,...SFO,... 2004,3,22,...,AA,...EWR,..
sc.textFile("hdfs:///.../flights").
// Extract Airport (key) and Airline (value)
(EWR, AA) (SFO, UA) (EWR, UA)
map(row => (row.split(",")(16), row.split(",")(8))).
// Group by Airport
(EWR, [AA, UA]) (SFO, UA)
groupByKey.
// Discard duplicate pairs, and compute group size
(EWR, 2) (SFO, 1)
mapValues(values => values.toSet.size).
// Return results to client
[ (EWR, 2), (SFO, 1) ]
collect
sc.textFile("hdfs:///.../flights"). Base RDD
groupByKey. Transformed
RDDs
collect Action
Whats an RDD?
Resilient Distributed Datasets
CO780, IAH, MCI
CO683, IAH, MSY Key abstraction in Spark
CO1707, TPA, IAH
... Immutable collection of objects
Distributed across machines
DL282, ATL, CVG
DL2128, CHS, ATL
DL2592, PBI, LGA Can be operated on in parallel
DL417, FLL, ATL
...
Can hold any kind of data
Hadoop datasets
UA620, SJC, ORD
UA675, ORD, SMF Parallelized Scala collections
UA676, ORD, LGA RDBMS or No-SQL, ...
...
8
Resilient Distributed Datasets (RDD)
9
Weve written the code, what next?
10
Spark Cluster Architecture: Logical View
Executor
Driver Program
Task
Cluster Manager Cache
Task
SparkContext
11
Whats Inside an Executor?
Single JVM
Task
Free
Running Task
Tasks Task Slots
Task
12
Spark Cluster Architecture: Physical View
Driver
13
Ok, can we get back to the code?
14
Spark Execution Model
Task
sc.textFile(...").
Task
map(row =>...).
Task
groupByKey.
mapValues(...).
collect
Task
15
Spark builds DAGs
Directed (arrows)
collect Acyclic (no loops)
Graph
16
Spark builds DAGs (contd.)
Data-set level View Partition level View
HadoopRDD
MapPartitionsRDD
ShuffleRDD
Spark has a rich
collection of operations
MapPartitionsRDD that generally map to
RDD classes
HadoopRDD,
FilteredRDD, JoinRDD,
MapPartitionsRDD
etc.
17
DAG Scheduler
textFile
Stage 1
map Split the DAG into
stages
A stage marks the
boundaries of a
groupByKey pipeline
One stage is
Stage 2 completed before
mapValues starting the next
stage
collect
18
DAG Scheduler: Pipelining
textFile
Stage 1
map
20
Task Scheduler (contd.)
Stage 1
textFile
Computation
map
1 2 3 4
21
Task Scheduler (contd.)
HDFS
Partitions 3 cores
1 4 partitions
1
HadoopRDD
2 MapPartitionsRDD
2 Time
2
HadoopRDD
3 MapPartitionsRDD
3 4
3
HadoopRDD HadoopRDD
4 MapPartitionsRDD MapPartitionsRDD
22
Task Scheduler: The Shuffle
textFile
Stage 1
map
All-to-all data
movement
groupByKey
Stage 2
mapValues
collect
23
Task Scheduler: The Shuffle (contd.)
Stage 1 1 2 3 4
Stage 2 1 2
24
Task Scheduler: The Shuffle (contd.)
1 2 3 4
1 2 1 2 1 2 1 2
1 2
25
groupByKey
Stage 1
After shuffle, each reduce-side
partition contains all groups of
the same key
groupByKey Within each partition,
groupByKey builds an in-
memory hash map
mapValues
EWR -> [UA, AA, ...]
SFO -> [UA, ...]
collect JFK -> [DL, UA, AA, DL, ...]
26
Task Execution
Stage 1
Stage N
27
28
29
30
Writing better Spark applications
Code path length Is your code optimal? Are you using the best
available Spark API for the job?
31
Writing better Spark applications: Partitioning
Too few partitions?
! Less concurrency
! More susceptible to data skew
! Increased memory pressure
Too many partitions
! Over-partitioning leads to very short-running tasks
! Administrative inefficiencies outweigh benefits of parallelism
Need a reasonable number of partitions
! Usually a function of number of cores in cluster ( ~2 times is a good
rule of thumb)
! Ensure tasks task execution time > serialization time
32
Writing better Spark applications: Memory
Symptoms
! Bad performance
! Executor failures (OutOfMemory errors)
Resolution
! Use GC and other traces to track memory usage
! Give Spark more memory
! Tweak memory distribution between User memory vs Spark
memory
! Increase number of partitions
! Look at your code!
33
Rewriting our application
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => (row.split(",")(16), row.split(",")(8))).
groupByKey.
mapValues(values => values.toSet.size).
collect
Partitioning
Shuffle size
Memory pressure
Code path length
34
Rewriting our application (contd.)
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => (row.split(",")(16), row.split(",")(8))).
repartition(16).
groupByKey.
mapValues(values => values.toSet.size).
collect
Partitioning
How many stages do you see?
Shuffle size
Answer: 3 Memory pressure
Code path length
35
36
Rewriting our application (contd.)
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => (row.split(",")(16), row.split(",")(8))).
repartition(16).
distinct.
groupByKey.
mapValues(values => values.toSet.size).
collect
37
38
Rewriting our application (contd.)
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => (row.split(",")(16), row.split(",")(8))).
distinct(numPartitions = 16).
groupByKey.
mapValues(values => values.size).
collect
Partitioning
Shuffle size
Memory pressure
Code path length
39
40
Rewriting our application (contd.)
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => {
val cols = row.split(",")
(cols(16), cols(8))
}).
distinct(numPartitions = 16).
groupByKey.
mapValues(values => values.size).
collect
Partitioning
Shuffle size
Memory pressure
Code path length
41
Rewriting our application (contd.)
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => {
val cols = row.split(",")
(cols(16), cols(8))
}).
distinct(numPartitions = 16).
map(e => (e._1, 1)).
reduceByKey(_ + _).
collect
Partitioning
Shuffle size
Memory pressure
Code path length
42
Original
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => (row.split(",")(16), row.split(",")(8))).
groupByKey.
OutOfMemory Error
mapValues(values => values.toSet.size).
after running for
collect
several minutes on
my laptop
Revised
sc.textFile("hdfs://localhost:9000/user/Adarsh/flights").
map(row => {
val cols = row.split(",")
(cols(16), cols(8))
Completed in
}).
seconds
distinct(numPartitions = 16).
map(e => (e._1, 1)).
reduceByKey(_ + _).
collect
43
Writing better Spark applications: Configuration
How many Executors per Node?
--num-executors OR spark.executor.instances
44
As you can see, writing Spark jobs is easy. However, doing so in
an efficient manner takes some know-how.
45
Notices and Disclaimers
Copyright 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customers business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
46
Notices and Disclaimers (cont)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBMs products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera, Bluemix, Blueworks Live, CICS, Clearcase, Cognos, DOORS, Emptoris, Enterprise Document
Management System, FASP, FileNet, Global Business Services , Global Technology Services , IBM ExperienceOne, IBM
SmartCloud, IBM Social Business, Information on Demand, ILOG, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON,
OpenPower, PureAnalytics, PureApplication, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery,
pureScale, PureSystems, QRadar, Rational, Rhapsody, Smarter Commerce, SoDA, SPSS, Sterling Commerce, StoredIQ,
Tealeaf, Tivoli, Trusteer, Unica, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
47
Thank You
49
Acknowledgements
50
Reference Slides
51
How deep do you want to go?
" Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & Slides 25 mins
https://spark-summit.org/2013/talk/turning-data-into-value
" How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & Slides 12 mins
http://conferences.oreilly.com/strata/strata2014/public/schedule/detail/33057
" Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins
https://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/
Basic Spark
http://www.artima.com/scalazine/articles/steps.html
" Do these two courses. They cover Spark basics and include a
certification. You can use the supplied Docker images for all other
labs.
7 hours
Basic Spark (contd.)
" Write some code, make it run, see how it performs, tune it, trouble-shoot it
" Play with different configuration knobs, check out dashboards, etc.
" Use this as your This book by " Excellent MooC by Odersky. Some of
primary Scala text Odersky is excellent the material is meant for CS majors.
but it isnt meant to Highly recommended for STC
give you a quick developers.
start. Its deep stuff. 35+ hours
Advanced Spark: Browse Conference Proceedings
Spark Summits cover technology and use cases. Technology is also covered in
various other places so you could consider skipping those tracks. Dont forget to
check out the customer stories. That is how we learn about enablement
opportunities and challenges, and in some cases, we can see through the Spark
hype