Sie sind auf Seite 1von 4

Memo from Analytix.

Marketing

Spark Streaming versus Storm: Comparing Systems for Processing


Fast and Large Streams of Data in Real Time

One of the most popular topics at the recent Spark Summit in San
Francisco was Spark Streaming, which is a system for processing
fast and large streams of data in real time. This blog post highlights Spark
Streamings core capabilities and architectural design point. We conclude
by offering some advice for readers who need to select a streaming system
by contrasting the capabilities of Spark Streaming with Storm.
Core Capabilities of Spark Streaming
Spark Streaming is an extension of the core Spark API that enables highthroughput, fault-tolerant processing of live data streams. It ingests data
from many sources including Kafka, Flume, Twitter, ZeroMQ and plain old
TCP sockets. Spark Streaming then processes that data using complex
algorithms, which are expressed in high-level functions. Finally, the
processed data can be stored in file systems (including HDFS), databases
(including Hbase), and live dashboards.
The core innovation behind Spark Streaming is to treat streaming
computations as a series of deterministic batch computations on small time
intervals. The input data received during each interval is stored reliably
across the cluster to form an input dataset (also called "micro batch") for
that interval. Once the time interval completes, this dataset is processed
via deterministic parallel operations, such as map, reduce, join, window
and group by, to produce new datasets representing program outputs or
intermediate states.
Architecture
The micro batch, called D-Stream in the original paper1, provides an
elegant solution to three challenges that arise in large-scale distributed
computing environments: fault tolerance, consistency and a unified
programming model across batch and real time.


1 http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Memo from Analytix.Marketing

Spark provides a unified programming model across all its


processing engines. This yields four benefits:
o Faster learning curve: It allows users to write one analytic job,
which then executes equally well on both batch and streaming
data. This obviates the need to learn about the different
interfaces and specific APIs of batch versus streaming
systems.
o Higher developer productivity: On a related note, machine
learning libraries, statistical functions, and complex algorithms
such as graph processing that are available in Spark can be
put to use on streaming data as well, saving developers time.
o Better decisions: Moreover, the unified programming model
also makes it much easier to combine arriving real-time data
with historical data in one analysis, for instance to make a
decision on the basis of comparing new data with old data.
o Ease of operations: Spark provides a unified run time across
different processing engines. Therefore, one physical cluster
and one set of operational processes can cover the full gamut
of use cases.
Consistency / statefulness. Spark nodes have an immutable state
enabled by a cluster-wide in-memory cache. This guarantees
exactly once semantics and supports use cases where statefulness
is important.
Fault tolerance. Input batches are replicated in-memory across
worker nodes. If a worker node fails, the batches on failed nodes
are recomputed in parallel across several nodes to ensure a fast
recovery. If used in conjunction with Zookeeper, fault tolerance also
extends to the master node.
It is this innovative design point that has given rise to the large interest in
Spark Streaming that we see today.
Advice for Selecting a Streaming System
A short blog post cannot do justice to the large variety of use cases that
call for streaming capabilities. However, we can offer some guidance
where Spark Streaming would be a good fit:
Ease of use is important, manifested by a quick learning curve for
developers, data scientists, analysts and IT operations. Users who
look at streaming from an application or business perspective find

Memo from Analytix.Marketing

the higher abstraction level that is available in Sparks declarative


APIs particularly compelling. This allows users to work at the level
of the actual business logic and data pipeline that specifies what has
to happen. Spark then figures out how this has to happen,
coordinating tasks such as data movement and recovery. Users are
spared having to worry about the details of which nodes execute
which computations as part of a specific.
Real-time decisioning for the business is important. Spark
combines statefulness and persistence with high throughput. Many
organizations have evolved from exploratory, discovery type use
cases of big data to use cases that require reasoning on the data as
it arrives in order to make a decision near real time that is pushed to
the front line of the organization, for instance in a sales or service or
production context. Users need certainty on questions such as the
exact number of frauds, emergencies or outages occurring today
and data loss is not acceptable. These business critical use cases
call for the exactly once" semantics that Spark Streaming provides.
Storm provides exactly once processing only in conjunction with
Trident. Trident achieves this via a transaction ID, which limits the
throughput that can be achieved.
Your big data vendor of choice supports Spark Streaming.
Currently, Hortonworks, Cloudera, Pivotal and MapR provide
commercial support for Spark, but the vendor ecosystem that
supports Spark is expanding quickly.
Storm enjoys more awareness in the market as of the time of this writing,
which explains why some believe that Storm is more mature. Reference
implementations including the original sponsor Twitter. Storms low-level
programming model might be an advantage for highly advanced users
trying to implement highly specialized, unusual processing logic.
Low latency is often stated as one of the biggest benefits of Storm. While
it is true that Storm achieves latency as low as milliseconds to tens of
milliseconds, the difference is immaterial to the vast majority of
commercially relevant use cases (exceptions include algorithmic trading).
In the telecommunications industry for instance, data streams from network
probes arrive with an intrinsic latency of 15 minutes. For most users
therefore, latency will not outweigh the other benefits of Spark Streaming.
An apples-to-apples comparison with Spark Streaming would also have to
consider Trident. Trident is an extension of Storm and provides higher-

Memo from Analytix.Marketing

level declarative / functional APIs similar to Pig or Cascading: joins,


aggregations, grouping, functions, filters, etc. This allows to persist stateversioning information to an external database, which is then used to
ensure exactly-once semantics. Relying on transaction IDs to update state
has to be implemented by the user (not a matter of just pressing a button)
and degrades performance / throughput.
As technologies are evolving very quickly, readers might find our approach
to comparing different streaming systems helpful:

Dimension

Criteria

Market traction

Speed of innovation
Partner ecosystem
Enterprise adoption

Developer productivity

Programming model & APIs


Integration of batch & RT

Data integration

Data ingestion
Data persistence

Data processing

Processing framework
State management
Throughput & latency

Operations

Native management
Choice of resource managers
Fault tolerance (FT)
Multi-tenancy

Das könnte Ihnen auch gefallen