Sie sind auf Seite 1von 12

Spark Architecture

From: mastering-apache-spark

Architecture
Spark uses a master/worker architecture. There is a driver that talks to a single
coordinator called master that manages workers in which executors run.
The driver and the executors run in their own Java processes. You can run them all on the same
(horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.

Driver
A Spark driver is a JVM process that hosts. SparkContext for a Spark application. It is the master node in
a Spark application. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task
Scheduler). It hosts Web UI for the environment.
It splits a Spark application into tasks and schedules them to run on executors.
A driver is where the task scheduler lives and spawn’s tasks across workers.
A driver coordinates workers and overall execution of tasks.

Executors

Executors are distributed agents that execute tasks. They typically run for the entire lifetime of a Spark
application and is called static allocation of executors (but you could also opt in for dynamic allocation).

Dynamic Allocation (of Executors) (aka Elastic Scaling) is a Spark feature that allows for
adding or removing Spark executors dynamically to match the workload.
Unlike in the "traditional" static allocation where a Spark application reserves CPU and
memory resources upfront irrespective of how much it really uses at a time, in dynamic
allocation you get as much as needed and no more. It allows to scale the number of
executors up and down based on workload, i.e. idle executors are removed, and if you need
more executors for pending tasks, you simply request them.

Executors send active task metrics to the driver and inform about task status updates.
Executors provide in-memory storage for RDDs that are cached in Spark applications
Executors can run multiple tasks over its lifetime, both in parallel and sequentially. They
track running tasks. Executors use a thread pool for launching tasks and sending metrics.
It is recommended to have as many executors as data nodes and as many cores as you can
get from the cluster.
Creating Executor Instance
Executor takes the following when created:
 Executor ID
 Executor’s host name
 SparkEnv
 User-defined JARs (to add to tasks' class path). Empty by default
 Flag that says whether the executor runs in local or cluster mode (default: false, i.e. cluster mode is
preferred)
Executor starts sending heartbeats and active tasks metrics.
Executor initializes the internal registries and counters in the meantime

Launching Tasks — launchTask Method


launchTask(context: ExecutorBackend,taskId: Long,attemptNumber: Int,taskName: String,serializedTask: ByteBuffer): Unit

Internally, launchTask creates a TaskRunner, registers it in runningTasks internal registry, and finally
executes it on thread pool.

For each task in TaskRunner (in runningTasks internal registry), the task’s metrics are computed (i.e.
mergeShuffleReadMetrics and setJvmGCTime ) that become part of the heartbeat (with accumulators).

TaskRunner
TaskRunner manages execution of a single task. It can be run or killed that boils down to running or
killing the task. Its an internal class of Executor.
A TaskRunner object is created when an executor is requested to launch a task.
Master
A master is a running Spark instance that connects to a cluster manager for resources. The master
acquires cluster nodes to run executors.

Workers
Workers (aka slaves) are running Spark instances where executors live to execute tasks. They are the
compute nodes in Spark.
A worker receives serialized tasks that it runs in a thread pool. It hosts a local Block Manager that serves
blocks to other workers in a Spark cluster. Workers communicate among themselves using their Block
Manager instances.

Block Manager is a key-value store for blocks of data. It acts as a local cache that runs on every "node" in a Spark application
(driver and executors).

Task execution in Spark and understand Spark’s underlying execution model

When you create SparkContext, each worker starts an executor (a separate JVM process and it loads
your jar). The executors connect back to your driver program. Now the driver can send them commands,
like flatMap, map and reduceByKey . When the driver quits, the executors shut down.
The executor deserializes the command and executes it on a partition.

Application in Spark is executed in three steps:


1. Create RDD graph (DAG)
DAG (directed acyclic graph) of RDDs to represent entire computation.
2. Create stage graph
Stages are created by breaking the RDD graph at shuffle boundaries. It is a logical execution plan
based on the RDD graph.
3. Based on the plan, schedule and execute tasks on workers.

WordCount RDD graph sample


file → lines → words → per-word count → global word count → output

Based on this graph, two stages are created. The stage creation rule is based on the idea of
pipelining as many narrow transformations as possible. RDD operations with "narrow"
dependencies, like map() and filter() , are pipelined together into one set of tasks in
each stage.
In the WordCount example, the narrow transformation finishes at per-word count. Therefore,
you get two stages:
file → lines → words → per-word count
global word count → output

Once stages are defined, Spark will generate tasks from stages. The first stage will create
ShuffleMapTasks with the last stage creating ResultTasks because in the last stage, one
action operation is included to produce results.
The number of tasks to be generated depends on how your files are distributed. Suppose
that you have 3 three different files in three different nodes, the first stage will generate 3
tasks: one task per partition. The number of tasks being generated in each stage will be equal to the
number of partitions.

Anatomy of Spark Application


Every Spark application starts at instantiating a Spark context. A Spark application is an instance of
SparkContext.

1. Master URL to connect the application to


2. Create Spark configuration
3. Create Spark context

When a Spark application starts using spark-submit connects to Spark master as described by master
URL. Your Spark application can run locally or on the cluster which is based on the cluster manager and
the deploy mode (client or cluster). You can then create RDDs, transform them to other RDDs and
ultimately execute actions. You can also cache interim RDDs to speed up data processing. Spark
application finishes by stopping the Spark context.

SparkContext — Entry Point to Spark (Core)

Spark context sets up internal services and establishes a connection to a Spark execution environment.
Once a SparkContext instance is created you can use it to create RDDs, accumulators and broadcast
variables, access Spark services and run jobs.

SparkContext functions
Getting current configuration
 SparkConf
 deployment environment (as master URL)
 application name
 deploy mode
 default level of parallelism
 Spark user
 the time (in ms) when SparkContext was created
 Spark version

Setting Configuration
 master URL
 Local Properties — Creating Logical Job Groups
 Default Logging Level

Creating Distributed Entities


 RDDs
 Accumulators
 Broadcast variables

Accessing services
 TaskScheduler
 LiveListenerBus
 BlockManager,
 SchedulerBackends
 ShuffleManager
 ContextCleaner

Running jobs
Cancelling job
Setting up custom Scheduler Backend, TaskScheduler and DAGScheduler
Closure Cleaning
Submitting Jobs Asynchronously
Unpersisting RDDs, i.e. marking RDDs as non-persistent
Registering SparkListener
Programmable Dynamic Allocation

HeartbeatReceiver keeps track of executors and informs TaskScheduler and SparkContext about lost
executors.

RDD — Resilient Distributed Dataset


A RDD is a resilient and distributed collection of records. You can compare it with a scala collection to its
distributed variant. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a fault-tolerant manner.
 Resilient, i.e. fault-tolerant with the help of RDD DAG and thus can recompute missing or
damaged partitions due to node failures in Spark
 Distributed with data residing on multiple nodes
 Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or
other objects
RDD has the following additional traits
 In-memory
 Immutable or Read-only
 Lazy evaluated (transformed only when an action is triggered)
 Cacheable
 Parallel (process data in parallel)
 Typed (e.g. RDD[Long] or RDD[(Int, String)])
 Partitioned, i.e. the data inside a RDD is partitioned (split into partitions) and then distributed
across nodes in a cluster
Partitions are the units of parallelism. You can control the number of partitions of a RDD using
repartition or coalesce transformations. Spark tries to be as close to data as possible without wasting
time to send data across network by means of RDD shuffling, and creates as many partitions as required
to follow the storage layout and thus optimize data access.
Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel.
Inside a partition, data is processed sequentially.

Types of RDDs
 ParallelCollectionRDD - result of SparkContext.parallelize and SparkContext.makeRDD
 CoGroupedRDD - RDD that cogroups its pair RDD parents. For each key k in parent RDDs, the
resulting RDD contains a tuple with the list of values for that key.
 MapPartitionsRDD - a result of calling operations like map, flatMap, filter, mapPartitions, etc.
 CoalescedRDD - a result of repartition or coalesce transformations.
 ShuffledRDD - a result of shuffling, e.g. after repartition or coalesce transformations. (created
for RDD transformations that trigger a data shuffling)
 PipedRDD - an RDD created by piping elements to a forked external process.
 PairRDD- that is an RDD of key-value pairs that is a result of groupByKey and join operations.
 DoubleRDD- RDD of Double type.
 SequenceFileRDD-RDD that can be saved as a SequenceFile

RDD Lineage — Logical Execution Plan


RDD Lineage is a graph of all the parent RDDs of a RDD. Built on applying transformations to RDD and
creates a logical execution plan
Partitions and Partitioning
An RDD is about the content or how it gets spread out over a cluster (performance), i.e. how many
partitions an RDD represents.
A partition (split) is a logical chunk of a large distributed data set. By default, a partition is created for
each HDFS block, which by default is 64/128MB. RDDs get partitioned automatically without
programmer intervention (you could adjust size, number of partitions or partitioning scheme as well).

When you execute the Spark job, i.e. sc.parallelize(1 to 100).count , you see the following in Spark UI.

The reason for 8 Tasks in Total is it is on a 8-core laptop and by default the number of partitions is the
number of all available cores.
You can request for the minimum number of partitions, using the second input parameter
Ex: sc.parallelize(1 to 100, 2).count

Increasing partitions count will make each partition to have less data. Spark can only run 1 concurrent
task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster
with 50 cores, you want your RDDs to at least have 50 partitions (probably 2 or 3 times than that). A
"good" number of partitions is at least as many as the number of executors for parallelism.
The number of partitions determines how many files get generated by actions that save RDDs to files.

The maximum size of a partition is ultimately limited by the available memory of an executor.

Partitions get redistributed among nodes whenever shuffle occurs. Repartitioning may
cause shuffle to occur in some situations, but it is not guaranteed to occur in all cases.
And it usually happens during action stage.

coalesce transformation
coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

The coalesce transformation is used to change the number of partitions. It can trigger RDD
shuffling depending on the shuffle flag (disabled by default).
For Ex:
val x = (1 to 10).toList
On 4 core machine, the numbersDf is split into four partitions, will be written to disk as 4 file on write
numbersDf.rdd.partitions.size // => 4

Partition A: 1, 2
Partition B: 3, 4, 5
Partition C: 6, 7
Partition D: 8, 9, 10
val numbersDf2 = numbersDf.coalesce(2)
coalesce has created a new DataFrame with only two partitions, will be written to disk as 2 files on write
numbersDf2.rdd.partitions.size // => 2
The partitions in numbersDf2 have the following data:
Partition A: 1, 2, 3, 4, 5
Partition C: 6, 7, 8, 9, 10

It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data
on the minimum number of partitions

You can try to increase the number of partitions with coalesce, but it won’t work!
The coalesce algorithm changes the number of nodes by moving data from some partitions to existing
partitions. So, this algorithm obviously cannot increase the number of partitions.

repartition Transformation
repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

repartition is coalesce with numPartitions and shuffle enabled. The repartition method can be used to
either increase or decrease the number of partitions.
Ex:
val homerDf = numbersDf.repartition(2)
homerDf.rdd.partitions.size // => 2

Let’s examine the data on each partition in homerDf:


Partition ABC: 1, 3, 5, 6, 8, 10
Partition XYZ: 2, 4, 7, 9

The repartition algorithm does a full data shuffle and equally distributes the data among the partitions.
It does not attempt to minimize data movement like the coalesce algorithm.

The repartition method can be used to increase the number of partitions as well. The repartition
method does a full shuffle of the data, so the number of partitions can be increased.

Differences between coalesce and repartition


The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data.
coalesce combines existing partitions to avoid a full shuffle. coalesce results in partitions with different
amounts of data (sometimes partitions that have much different sizes) and repartition results in roughly
equal sized partitions.
coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with
than equal sized partitions. You'll usually need to repartition datasets after filtering a large data set.
repartition to be faster overall because Spark is built to work with equal sized partitions.
Partitioner

Partitioner captures data distribution at the output. A scheduler can optimize future operations based
on this.
val partitioner: Option[Partitioner] specifies how the RDD is partitioned.

Types of Partitioning in Apache Spark

Hash Partitioner
Attempts to spread the data evenly across various partitions based on the key. Object.hashCode method
is used to determine the partition in Spark as partition = key.hashCode () % numPartitions.

Range Partitioner
Some Spark RDDs have keys that follow a particular ordering, for such RDDs, range partitioning is an
efficient partitioning technique. In range partitioning method, tuples having keys within the same range
will appear on the same machine. Keys in a range partitioner are partitioned based on the set of sorted
range of keys and ordering of keys.

Set partitioning for data in Apache Spark


RDDs can be created with specific partitioning in two ways –

 Providing explicit partitioner by calling partitionBy method on an RDD,


 Applying transformations that return RDDs with specific partitioners. Some operation on RDDs
that hold to and propagate a partitioner are-
 Join
 LeftOuterJoin
 RightOuterJoin
 groupByKey
 reduceByKey
 foldByKey
 sort
 partitionBy
 foldByKey

Default Partitioning Scheme in Spark


Consider example – partitions created is 10 with HashPartitioner
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10))
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner@a

without partionBy – partition created is 4 in 4 cores with no partitioner


scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
scala> rdd.partitioner.isDefined
res8: Boolean = false

Difference is as below
 Partitioning when done by distributing data between partitions depending on key (limited to
PairRDDs). This creates relationship between partitions and set of keys.
 Partitioning when done by splitting input into multiple partitions where data is split in to chunks
of consecutive records.

So default partitioning scheme is none because partitioning is not applicable to all RDDs. For operations
which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default is hash
partitioning.

RDD shuffling
Shuffling is a process of redistributing data across partitions. Shuffling is the process of data transfer
between stages. By default, shuffling doesn’t change the number of partitions, but their content.

Checkpointing
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS)
or local file system.
There are two types of checkpointing:
 reliable - in Spark (core), RDD checkpointing that saves the actual intermediate RDD data to a
reliable distributed file system, e.g. HDFS.
 local - in Spark Streaming or GraphX - RDD checkpointing that truncates RDD lineage graph.

Optimize Apache Spark jobs


https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf - reference

The most common challenge is memory pressure, due to improper configurations (particularly wrong-
sized executors), long-running operations, and tasks that result in Cartesian operations. You can speed
up jobs with appropriate caching, and by allowing for data skew.

Spark job optimizations and recommendations


1. Choose the data abstraction
 Dataframes
 Datasets
 RDDs

2. Use optimal data format – Spark supports data formats such as formats, such as csv, json, xml,
parquet, orc, and avro. The best format for performance is parquet with snappy compression.
Parquet stores data in columnar format, and is highly optimized in Spark.
3. Select default storage - When you create a new Spark cluster, you have the option to select
Azure Blob Storage or Azure Data Lake Storage as your cluster's default storage. Both are
transient. Data lake is faster than blob. Local HDFS non-transient cluster is fastest.
4. Use the cache – Use spark native caching mechanisms like .persist(), .cache(), and CACHE TABLE
to cache intermediate results. Native spark caching is good for small datasets. Storage level
(Data lake or HDFS) caching is recommended.
5. Use memory efficiently - Spark operates by placing data in memory, so managing memory
resources is a key aspect.
6. Optimize data serialization - There are two serialization options. Java serialization is the default.
Kryo serialization is a newer format and can result in faster and more compact serialization than
Java.
7. Use bucketing - Bucketing is like data partitioning, but each bucket can hold a set of column
values. A bucket is determined by hashing the bucket key of the row. You can use partitioning
and bucketing at the same time.
8. Optimize joins and shuffles - If you have slow jobs on a Join or Shuffle, the cause is probably
data skew, which is asymmetry in your job data. To fix data skew, you should salt the entire key,
or use an isolated salt for only some subset of keys. Another option is to introduce a bucket
column and pre-aggregate in buckets first.
Another factor could be the join type. SortMerge is used by default and best for large datasets.
But expensive as it must sort both left and right hand side data before join.
Broadcast is best suited for small datasets. This type of join broadcasts small side to all
executors, and so requires more memory for broadcasts

Das könnte Ihnen auch gefallen