Beruflich Dokumente
Kultur Dokumente
From: mastering-apache-spark
Architecture
Spark uses a master/worker architecture. There is a driver that talks to a single
coordinator called master that manages workers in which executors run.
The driver and the executors run in their own Java processes. You can run them all on the same
(horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.
Driver
A Spark driver is a JVM process that hosts. SparkContext for a Spark application. It is the master node in
a Spark application. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task
Scheduler). It hosts Web UI for the environment.
It splits a Spark application into tasks and schedules them to run on executors.
A driver is where the task scheduler lives and spawn’s tasks across workers.
A driver coordinates workers and overall execution of tasks.
Executors
Executors are distributed agents that execute tasks. They typically run for the entire lifetime of a Spark
application and is called static allocation of executors (but you could also opt in for dynamic allocation).
Dynamic Allocation (of Executors) (aka Elastic Scaling) is a Spark feature that allows for
adding or removing Spark executors dynamically to match the workload.
Unlike in the "traditional" static allocation where a Spark application reserves CPU and
memory resources upfront irrespective of how much it really uses at a time, in dynamic
allocation you get as much as needed and no more. It allows to scale the number of
executors up and down based on workload, i.e. idle executors are removed, and if you need
more executors for pending tasks, you simply request them.
Executors send active task metrics to the driver and inform about task status updates.
Executors provide in-memory storage for RDDs that are cached in Spark applications
Executors can run multiple tasks over its lifetime, both in parallel and sequentially. They
track running tasks. Executors use a thread pool for launching tasks and sending metrics.
It is recommended to have as many executors as data nodes and as many cores as you can
get from the cluster.
Creating Executor Instance
Executor takes the following when created:
Executor ID
Executor’s host name
SparkEnv
User-defined JARs (to add to tasks' class path). Empty by default
Flag that says whether the executor runs in local or cluster mode (default: false, i.e. cluster mode is
preferred)
Executor starts sending heartbeats and active tasks metrics.
Executor initializes the internal registries and counters in the meantime
Internally, launchTask creates a TaskRunner, registers it in runningTasks internal registry, and finally
executes it on thread pool.
For each task in TaskRunner (in runningTasks internal registry), the task’s metrics are computed (i.e.
mergeShuffleReadMetrics and setJvmGCTime ) that become part of the heartbeat (with accumulators).
TaskRunner
TaskRunner manages execution of a single task. It can be run or killed that boils down to running or
killing the task. Its an internal class of Executor.
A TaskRunner object is created when an executor is requested to launch a task.
Master
A master is a running Spark instance that connects to a cluster manager for resources. The master
acquires cluster nodes to run executors.
Workers
Workers (aka slaves) are running Spark instances where executors live to execute tasks. They are the
compute nodes in Spark.
A worker receives serialized tasks that it runs in a thread pool. It hosts a local Block Manager that serves
blocks to other workers in a Spark cluster. Workers communicate among themselves using their Block
Manager instances.
Block Manager is a key-value store for blocks of data. It acts as a local cache that runs on every "node" in a Spark application
(driver and executors).
When you create SparkContext, each worker starts an executor (a separate JVM process and it loads
your jar). The executors connect back to your driver program. Now the driver can send them commands,
like flatMap, map and reduceByKey . When the driver quits, the executors shut down.
The executor deserializes the command and executes it on a partition.
Based on this graph, two stages are created. The stage creation rule is based on the idea of
pipelining as many narrow transformations as possible. RDD operations with "narrow"
dependencies, like map() and filter() , are pipelined together into one set of tasks in
each stage.
In the WordCount example, the narrow transformation finishes at per-word count. Therefore,
you get two stages:
file → lines → words → per-word count
global word count → output
Once stages are defined, Spark will generate tasks from stages. The first stage will create
ShuffleMapTasks with the last stage creating ResultTasks because in the last stage, one
action operation is included to produce results.
The number of tasks to be generated depends on how your files are distributed. Suppose
that you have 3 three different files in three different nodes, the first stage will generate 3
tasks: one task per partition. The number of tasks being generated in each stage will be equal to the
number of partitions.
When a Spark application starts using spark-submit connects to Spark master as described by master
URL. Your Spark application can run locally or on the cluster which is based on the cluster manager and
the deploy mode (client or cluster). You can then create RDDs, transform them to other RDDs and
ultimately execute actions. You can also cache interim RDDs to speed up data processing. Spark
application finishes by stopping the Spark context.
Spark context sets up internal services and establishes a connection to a Spark execution environment.
Once a SparkContext instance is created you can use it to create RDDs, accumulators and broadcast
variables, access Spark services and run jobs.
SparkContext functions
Getting current configuration
SparkConf
deployment environment (as master URL)
application name
deploy mode
default level of parallelism
Spark user
the time (in ms) when SparkContext was created
Spark version
Setting Configuration
master URL
Local Properties — Creating Logical Job Groups
Default Logging Level
Accessing services
TaskScheduler
LiveListenerBus
BlockManager,
SchedulerBackends
ShuffleManager
ContextCleaner
Running jobs
Cancelling job
Setting up custom Scheduler Backend, TaskScheduler and DAGScheduler
Closure Cleaning
Submitting Jobs Asynchronously
Unpersisting RDDs, i.e. marking RDDs as non-persistent
Registering SparkListener
Programmable Dynamic Allocation
HeartbeatReceiver keeps track of executors and informs TaskScheduler and SparkContext about lost
executors.
Types of RDDs
ParallelCollectionRDD - result of SparkContext.parallelize and SparkContext.makeRDD
CoGroupedRDD - RDD that cogroups its pair RDD parents. For each key k in parent RDDs, the
resulting RDD contains a tuple with the list of values for that key.
MapPartitionsRDD - a result of calling operations like map, flatMap, filter, mapPartitions, etc.
CoalescedRDD - a result of repartition or coalesce transformations.
ShuffledRDD - a result of shuffling, e.g. after repartition or coalesce transformations. (created
for RDD transformations that trigger a data shuffling)
PipedRDD - an RDD created by piping elements to a forked external process.
PairRDD- that is an RDD of key-value pairs that is a result of groupByKey and join operations.
DoubleRDD- RDD of Double type.
SequenceFileRDD-RDD that can be saved as a SequenceFile
When you execute the Spark job, i.e. sc.parallelize(1 to 100).count , you see the following in Spark UI.
The reason for 8 Tasks in Total is it is on a 8-core laptop and by default the number of partitions is the
number of all available cores.
You can request for the minimum number of partitions, using the second input parameter
Ex: sc.parallelize(1 to 100, 2).count
Increasing partitions count will make each partition to have less data. Spark can only run 1 concurrent
task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster
with 50 cores, you want your RDDs to at least have 50 partitions (probably 2 or 3 times than that). A
"good" number of partitions is at least as many as the number of executors for parallelism.
The number of partitions determines how many files get generated by actions that save RDDs to files.
The maximum size of a partition is ultimately limited by the available memory of an executor.
Partitions get redistributed among nodes whenever shuffle occurs. Repartitioning may
cause shuffle to occur in some situations, but it is not guaranteed to occur in all cases.
And it usually happens during action stage.
coalesce transformation
coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]
The coalesce transformation is used to change the number of partitions. It can trigger RDD
shuffling depending on the shuffle flag (disabled by default).
For Ex:
val x = (1 to 10).toList
On 4 core machine, the numbersDf is split into four partitions, will be written to disk as 4 file on write
numbersDf.rdd.partitions.size // => 4
Partition A: 1, 2
Partition B: 3, 4, 5
Partition C: 6, 7
Partition D: 8, 9, 10
val numbersDf2 = numbersDf.coalesce(2)
coalesce has created a new DataFrame with only two partitions, will be written to disk as 2 files on write
numbersDf2.rdd.partitions.size // => 2
The partitions in numbersDf2 have the following data:
Partition A: 1, 2, 3, 4, 5
Partition C: 6, 7, 8, 9, 10
It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data
on the minimum number of partitions
You can try to increase the number of partitions with coalesce, but it won’t work!
The coalesce algorithm changes the number of nodes by moving data from some partitions to existing
partitions. So, this algorithm obviously cannot increase the number of partitions.
repartition Transformation
repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
repartition is coalesce with numPartitions and shuffle enabled. The repartition method can be used to
either increase or decrease the number of partitions.
Ex:
val homerDf = numbersDf.repartition(2)
homerDf.rdd.partitions.size // => 2
The repartition algorithm does a full data shuffle and equally distributes the data among the partitions.
It does not attempt to minimize data movement like the coalesce algorithm.
The repartition method can be used to increase the number of partitions as well. The repartition
method does a full shuffle of the data, so the number of partitions can be increased.
Partitioner captures data distribution at the output. A scheduler can optimize future operations based
on this.
val partitioner: Option[Partitioner] specifies how the RDD is partitioned.
Hash Partitioner
Attempts to spread the data evenly across various partitions based on the key. Object.hashCode method
is used to determine the partition in Spark as partition = key.hashCode () % numPartitions.
Range Partitioner
Some Spark RDDs have keys that follow a particular ordering, for such RDDs, range partitioning is an
efficient partitioning technique. In range partitioning method, tuples having keys within the same range
will appear on the same machine. Keys in a range partitioner are partitioned based on the set of sorted
range of keys and ordering of keys.
Difference is as below
Partitioning when done by distributing data between partitions depending on key (limited to
PairRDDs). This creates relationship between partitions and set of keys.
Partitioning when done by splitting input into multiple partitions where data is split in to chunks
of consecutive records.
So default partitioning scheme is none because partitioning is not applicable to all RDDs. For operations
which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default is hash
partitioning.
RDD shuffling
Shuffling is a process of redistributing data across partitions. Shuffling is the process of data transfer
between stages. By default, shuffling doesn’t change the number of partitions, but their content.
Checkpointing
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS)
or local file system.
There are two types of checkpointing:
reliable - in Spark (core), RDD checkpointing that saves the actual intermediate RDD data to a
reliable distributed file system, e.g. HDFS.
local - in Spark Streaming or GraphX - RDD checkpointing that truncates RDD lineage graph.
The most common challenge is memory pressure, due to improper configurations (particularly wrong-
sized executors), long-running operations, and tasks that result in Cartesian operations. You can speed
up jobs with appropriate caching, and by allowing for data skew.
2. Use optimal data format – Spark supports data formats such as formats, such as csv, json, xml,
parquet, orc, and avro. The best format for performance is parquet with snappy compression.
Parquet stores data in columnar format, and is highly optimized in Spark.
3. Select default storage - When you create a new Spark cluster, you have the option to select
Azure Blob Storage or Azure Data Lake Storage as your cluster's default storage. Both are
transient. Data lake is faster than blob. Local HDFS non-transient cluster is fastest.
4. Use the cache – Use spark native caching mechanisms like .persist(), .cache(), and CACHE TABLE
to cache intermediate results. Native spark caching is good for small datasets. Storage level
(Data lake or HDFS) caching is recommended.
5. Use memory efficiently - Spark operates by placing data in memory, so managing memory
resources is a key aspect.
6. Optimize data serialization - There are two serialization options. Java serialization is the default.
Kryo serialization is a newer format and can result in faster and more compact serialization than
Java.
7. Use bucketing - Bucketing is like data partitioning, but each bucket can hold a set of column
values. A bucket is determined by hashing the bucket key of the row. You can use partitioning
and bucketing at the same time.
8. Optimize joins and shuffles - If you have slow jobs on a Join or Shuffle, the cause is probably
data skew, which is asymmetry in your job data. To fix data skew, you should salt the entire key,
or use an isolated salt for only some subset of keys. Another option is to introduce a bucket
column and pre-aggregate in buckets first.
Another factor could be the join type. SortMerge is used by default and best for large datasets.
But expensive as it must sort both left and right hand side data before join.
Broadcast is best suited for small datasets. This type of join broadcasts small side to all
executors, and so requires more memory for broadcasts