Sie sind auf Seite 1von 16

Tuning Spark Streaming for Throughput

By Gerard Maas

22/12/2014

Tech

No Comments

Spark Streaming is an abstraction that brings Streaming capabilities to Spark. It works by creating
micro-batches of data that are given to Spark for further processing and it offers a rich set of stream
operations consistent with the Spark API.
While migrating some jobs to Spark Streaming, we faced a series of performance challenges. This
article summarizes our findings in the form of a tuning guide that could be of more general use. We
hope it can be of use to other Spark Streaming adopters and we welcome an open discussion on the
topic.
After we present the context of our system and background, we cover how Spark Streaming works,
going into the level of detail needed to explain the parameters involved in the performance of a
streaming job.
We have divided this article in four sections:
Our Context
Understanding Spark Streaming
Going One Level Deeper

A Tuning Guide
Scaling up consumers
Parallelism
Partitions
Data Locality
Caching
Logging

Closing words

Under the motto to measure is to know, performance measurements are an esse part of a
performance improvement process. In an upcoming issue, we will cover how to measure these
improvements using the tools provided by Spark.

Our Context
At Virdata, we collect, store, transform and analyse data produced by a large number of devices or
things. We have been migrating our data ingestion pipelines from a well known streaming
framework to Spark Streaming. Our rationale for that has been two-fold: the micro-batching model is
a great match for inserting records in Cassandra and having a single programming model for our
Spark analytics and the data ingestion layer makes it easier to grow and maintain a coherent and
reusable knowledge- and code- base.
Our initial tests with Spark Streaming were really promising and we went ahead with the migration,
but as we implemented the full spectrum of business requirements on the data ingestion pipeline, we
observed how our Spark Streaming jobs were increasingly lagging behind in terms of performance.
To address that situation, we have done an in-depth analysis to spot the issues and find a workable
solution, which lead to an improvement of about 61x throughput within the same limits of a micro
batch interval.

The tuning aspects discussed here will be based on our system that consists of Kafka 0.8.1.1
Spark+Streaming 1.1.0 Spark Cassandra Connector 1.0.0 Cassandra 2.x. Spark runs on top of
Mesos 0.20 and separate from the Cassandra 2.0.6 ring.

In this article, our focus will be mainly on the aspects and parameters related to Spark Streaming.

How Spark Streaming works


Spark Streaming combines one or more stream consumers with a Spark transformation process that
must be materialized by an action (like saveAs) in order to get scheduled for execution.
Intuitively, its very easy to understand: as data comes in, its collected and packaged in blocks during
a given time interval, also known as batch interval. Once the interval of time is completed, the
collected data blocks are given to Spark for processing.
Whats generally less known is the timing of how this sequence of events takes place:
The batch interval provided to the Spark Streaming constructor will determine the duration of
each interval. (10 seconds in this example)

val ssc = new StreamingContext(sc, Seconds(10))

That data is placed in blocks, implemented as arrays, and delivered to the Block Manager. These
blocks become the RDD partitions that Spark will work on.
On start of Spark Streaming (t0), the consumers will be instantiated and will start consuming
events from the streaming source. In our case, that means several Kafka consumers will start
fetching messages from Kafka topics.

On the next batch interval, (t1) the data collected at t0 is provided to Spark. Spark Streaming
starts filling the next bucket of data (#1).

At any point in time, Spark is processing the previous batch of data, while the Streaming
Consumers are collecting data for the current interval. In this chart, we can see that Spark
Streaming is processing interval #1 while collecting data for t2 (becoming block #2)
Once a batch has been processed by Spark (like #0 in the above illustration), it can be cleaned
up. When that RDD will be cleaned up is determined by the spark.cleaner.ttl setting.

Going one level deeper

Summarizing the previous section, Spark Streaming consists of two processes:


Fetching data; done by the streaming consumer (in our case, the Kafka consumer)
Processing the data; done by Spark

These two processes are connected by the timely delivery of collected data blocks from Spark
Streaming to Spark. This gives us also the main performance guideline for Spark Streaming:

The time to process the data of a batch interval must be less than the batch interval
time.
Given an unbounded consumer like the Kafka consumer, this implies that our Spark job must be able
to timely process the incoming data of a batch interval. In order to achieve our goal of maximizing
the throughput of the system, we want to consume data as fast as we can and tune our Spark job to
process that data within the time interval restriction.
As the Kafka consumer has proven to be quite good at delivering data up to overwhelming levels, the
tuning efforts further described in this article are focused on optimizing the Spark-side of Spark
Streaming.
Lets do a quick Spark recap:
A Spark job consists of transformations and actions. It is broken down in stages of operations
that can be inlined.
An RDD acts on a distributed collection of data, broken down in partitions spread over nodes.
A task is applying a stage to a data partition on an executor. Scheduling a task has some fixed
cost.

In Spark Streaming, at each batch interval we will apply the same job to a new batch of data, so the
batch processing time will be informally determined by:

processing time ~= #tasks * scheduling cost + #tasks * time-complexity per task / paralleli

where

#tasks = #stages x #partitions

From these two statements, we can infer that to minimize the processing time of a batch we need to
minimize the stages and partitions and maximize parallelism. Note how interval time is not explicit in
this set of equations.

Note:

Although this model is a simplification of the performance characterization of a Spark


Streaming job, it provides a sufficient framework to reason about the streaming job in terms
of tasks, stages, partitions and executors.

A tuning guide
Now that we have identified the elements that determine the performance of a Spark Streaming job,
lets see what knobs we have to turn in order to optimize performance.

Scaling up consumers
In order to increase the amount of messages consumed by the system we can create multiple
consumers that will fetch data in parallel. Each consumer is assigned one core on an executor.
This is a common pattern:

@transient val inKafkaList:List[DStream[(K,V)]] = List.fill(kafkaParallelism) {


KafkaUtils.createStream[K, V, KDecoder, VDecoder](ssc, kafkaConfig, topics, StorageLevel.ME
}
@transient val inKafka = inKafkaList.tail.foldLeft(inKafkaList.head){_.union(_)}

The union of the created DStreams is important as this reduces the number of transformation

pipelines on the input DStream to one. Not doing this will multiply the number of stages by the
number of consumers.

Notes:
kafkaParallelism is the (configurable) number of consumers to create

Storage level MEMORY_AND_DISK_SER will allow Spark Streaming to spill serialized data
to disk in cases of overload, when the available memory is not sufficient to hold the

incoming data
declaring the dstream references as transient is often necessary to avoid them being
serialized with the job. This would result in a serialization exception as DStreams are not
supposed to be serialized.

Parallelism
As we explained before, Spark Streaming is in fact two processes running concurrently: the data
consumers and Spark. The parallelism level of the consumer side is defined by the number of
consumers created (see the previous section on how to create consumers). The parallelism of the
Spark processing cluster is determined by the total number of cores configured for the job minus the
number of consumers.

Given the total number of cores, controlled by the configuration parameter spark.cores.max:

Consumer parallelism = #consumers created (kafkaParallelism in the previous example)


Spark parallelism = spark.cores.max - #consumers

Tuning guide for spark.cores.max:

To maximize the chances of data locality and even parallel execution, spark.cores.max should be a
multiple of #consumers. For example, if you are creating 4 kafka consumers, one could assign
spark.cores.max = 8 or spark.cores.max = 12, effectively configuring 1 or 2 spark cores per

consumer respectively.

Notes:
Its some kind of an urban Internet legend that a Spark Streaming application need n+1
cores, where n is the number of consumers. This is ONLY correct in test cases and very
small deployments. For a throughput sensitive application, provision the Spark-side of
the job with enough resources as outlined in this section.
There are no hard warranties on the even distributions of consumers and Spark cores
across executors, resulting in less-than-ideal cluster topologies. For network intensive

applications, an even deployment across physical nodes would be ideal. At the moment
of writing, theres no way to express that constraint in Mesos.

Partitions
As we discussed previously, reducing the number of partitions is important in order to reduce the
overall processing time, as it leads to less tasks and therefore bigger chunks of data to operate on at
once and less scheduling overhead.
How many partitions do we have for each RDD in a DStream?
Each receiver fetches data. That data is provided by the Receiver to its executor, a
ReceiverSupervisor that takes care of managing the blocks. Each block becomes a partition of the
RDD produced during each batch interval. The size of these blocks is time-bound and defined by the
configuration parameter (with its default value):

spark.streaming.blockInterval = 200

Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced into blocks
of data before storing them in Spark. (see docs)
Given that each consumer will produce the same amount of blocks, it follows that the number of
partitions in an RDD for a given interval is:

#partitions = #consumers * batchIntervalMillis / blockInterval

Tuning guide for spark.streaming.blockInterval:

Increasing spark.streaming.blockInterval will reduce the number of partitions in the RDD and
therefore, the number of tasks per batch. blockInterval must be an integer divisor of batch
interval. Following the Spark guideline of having the number of partitions roughly 2x-3x the number of
available cores, we have been successfully implementing the following guideline:
Given:

batchIntervalMillis = configured batch interval in milliseconds


spark.cores.max = total cores
#consumers = created streaming consumers
sparkCores = spark.cores.max - #consumers

partitionFactor = # of partitions / core (1, 2, 3,... ideally in multiples of k where spar

Then:

spark.streaming.blockInterval = batchIntervalMillis * #consumers / (partitionFactor x spark

Eg.
In our configuration we have assigned spark.cores.max = 12 cores and we have created 4
consumers, therefore leaving sparkCores = 8. We have defined a batch interval of 6 seconds and
consider that a partitionFactor = 2 is acceptable.
Then:

spark.streaming.blockInterval = 6000 * 4 / (2 * 8) = 24000/16 = 1500 ms

Lets check:

#partitions = 4 * 6000/1500 = 16 partitions = 2 factor x 8 cores [QED]

Data Locality
Big blocks of data created with a large configured value for spark.streaming.blockInterval are
great when they can be processed on the same node where they reside, using data locality level:
NODE_LOCAL. But they can be heavy to transport over the network if another node has idle

processing capacity.
We try to improve our data locality odds by allocating k Spark nodes per consumer task, so that
collected data can be evenly processed.
Nevertheless, depending on the complexity of the Spark job defined over the DStream, some

executors might decide to launch a task with a lesser locality level.


The time Spark will wait for locality is controlled by the configuration parameter:
spark.locality.wait, with a default value of 3000ms.

Tuning guide for spark.locality.wait:


The default value of 3000ms is too high for jobs that are expected to execute under the 5-10s range
as it will define, in many cases, a bottom line for the job execution time if any task breaks beyond
NODE_LOCAL locality level. We have observed that setting this parameter to a value between 500 to

1000 ms helps lowering the total processing time of a Spark Streaming job.

Notes:
We need to find a balance between spark.streaming.blockInterval and
spark.locality.wait. If we observe that tasks are taken by a non-local executor,

setting a lower spark.streaming.blockInterval will improve the network transfer


time while increasing spark.locality.wait. will increase the chance of that task
executing with data locality NODE_LOCAL.

Caching
In our data ingestion use-case, we are routing data to different Cassandra keyspaces. This seems to
be a reasonably common use-case as the processed data streams ought to be persisted somewhere
for further use (HDFS, Cassandra, local disk, ) and sorting it on some common denominator
(keyspaces, date, customers, folders) will help with retrieval afterwards.
To implement that routing function, we iterate over each RDD for as many times as different routes
we have, each time creating a filtered version of the original RDD that gets persisted.
In code, this process looks like this:

dstream.foreachRDD{rdd =>

rdd.cache() // cache the RDD before iterating!


keys.foreach{ key =>
rdd.filter(elem=> key(elem) == key).saveAsFooBar(...)
}
rdd.unpersist()
}

We enclose the iterative section with a cache/unpersist pair. This way we keep the data cached only
for the time we need it.

Using rdd.cache speeds up the process considerably. Like in Spark, while the first cycle takes the
same time as the uncached version, each subsequent iteration takes only a fraction of the time.
This discovery took us by surprise. Given that Spark Streaming data is per se already in memory, it
was counter-intuitive that rdd.cache would have any beneficial effect.
Our testing shows a different reality: If a DStream or the corresponding RDD is used multiple times,
caching it significantly speeds up the process. In our case, it was the setting that delivered the
largest performance improvement.

Note:
In case of window operations, DStreams are implicitly cached as the RDDs are preserved
beyond the limits of a single batch interval.

Tuning guide for .cache:


Use if the Streaming job involves iterating over the dstream or RDDs.

Logging
One of the popular rule-of-thumbs regarding logging is never place logging within a loop. As a Spark
Streaming Job is basically a long running loop of the same job over the incoming new dataset, this
advice is quite relevant.

On a simple ETL test job we measured the effect of two logInfo(...) lines in the scope of a
DStream closure. This chart illustrates a comparison in performance for that case:

Tuning guide for logging:


Avoid using logging calls within the DStream and RDD transformations in a Spark Streaming Job.
Spark is quite chatty on the logs. Set the right log levels for your application.

Enable Kryo
This one is still on our TODO list. After we enabled Kryo we had issues with some of the data being
silently nullified.

To Measure is to Know
To tune a Spark Streaming application, we need to have means of determining that our changes are
delivering a beneficial effect. We have been perusing the Spark UI and in particular, the Streaming
Tab and the Spark metrics subsystem to gather performance data. We will cover these tools in detail
in a follow up article.

Closing Words
Spark Streaming offers a micro-batch based streaming model that offers the same rich and

Services

Solutions

Technology

Learn

About us

expressive capabilities of Spark to streaming data. Given that the processing time is constrained to
that same batch interval, special care must be taken to ensure that Spark Streaming applications are

News

Contact

within that given time interval for every interval.


tuned to consistently deliver results

In this article we have visited the code changes, settings and parameters that have helped us
improve the throughput of our Spark Streaming applications by a 60-fold. We dont claim that these
are the only parameters affecting the performance of a Spark Streaming job, but we have seen
consistent performance improvements after applying this tuning guide, backed up by extensive
testing in development and production environments.
In a follow up article we will further explain how to use the tools delivered by Spark to help with the
tuning process.
All feedback, corrections and discussions are welcome.
Drunk monkeys dance under the moonlight.

By: Gerard Maas (twitter: @maasg)

RECENTLY
Tuning Spark Streaming for Throughput
Virdatas Spark presentation at Devoxx
Virdata in the Databricks Application Spotlight
Virdata @ Digiworld summit 2014 (November 18-20) in Montpellier
Visit Virdata hosted in the NetApp booth at AWS re:Invent 2014 in Las Vegas (November 11-14)

CATEGORIES
Events
News
Partner News
Press releases

Tech

ARCHIVES
December 2014
November 2014
October 2014
September 2014
July 2014
June 2014
May 2014
February 2014
January 2014
December 2013

RECENT NEWS
TUNING SPARK STREAMING FOR THROUGHPUT
22/12/2014
Spark Streaming is an abstraction that brings Streaming capabilities...
Read more

VIRDATAS SPARK PRESENTATION AT DEVOXX


09/12/2014
Virdatas presentation on Lightning Fast Big Data...

Read more

VIRDATA IN THE DATABRICKS APPLICATION SPOTLIGHT


04/12/2014
Read everything about Virdatas successful Spark Certification...
Read more

CONTACT

virdata US
175 S. San Antonio Rd,
Los Altos, CA 94022, USA
Phone: +1 (937) 569 4220
Email: info@virdata.com

virdata Belgium
Technicolor delivery technologies, SAS, dba Virdata
Prins Boudewijnlaan 47,
2650 Edegem, Belgium
Phone: +32 (0) 3 440 73 95
Email: info@virdata.com

C O N TAC T US

2012 - 2014 Technicolor Delivery Technologies, S.A.S. (dba Virdata) - All Rights Reserved
Website Terms of Use | Privacy & Personnal Data Rules and Policies | IPR Policies

Das könnte Ihnen auch gefallen