Beruflich Dokumente
Kultur Dokumente
Marketing
One of the most popular topics at the recent Spark Summit in San
Francisco was Spark Streaming, which is a system for processing
fast and large streams of data in real time. This blog post highlights Spark
Streamings core capabilities and architectural design point. We conclude
by offering some advice for readers who need to select a streaming system
by contrasting the capabilities of Spark Streaming with Storm.
Core Capabilities of Spark Streaming
Spark Streaming is an extension of the core Spark API that enables highthroughput, fault-tolerant processing of live data streams. It ingests data
from many sources including Kafka, Flume, Twitter, ZeroMQ and plain old
TCP sockets. Spark Streaming then processes that data using complex
algorithms, which are expressed in high-level functions. Finally, the
processed data can be stored in file systems (including HDFS), databases
(including Hbase), and live dashboards.
The core innovation behind Spark Streaming is to treat streaming
computations as a series of deterministic batch computations on small time
intervals. The input data received during each interval is stored reliably
across the cluster to form an input dataset (also called "micro batch") for
that interval. Once the time interval completes, this dataset is processed
via deterministic parallel operations, such as map, reduce, join, window
and group by, to produce new datasets representing program outputs or
intermediate states.
Architecture
The micro batch, called D-Stream in the original paper1, provides an
elegant solution to three challenges that arise in large-scale distributed
computing environments: fault tolerance, consistency and a unified
programming model across batch and real time.
1
http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
Dimension
Criteria
Market traction
Speed of innovation
Partner ecosystem
Enterprise adoption
Developer productivity
Data integration
Data ingestion
Data persistence
Data processing
Processing framework
State management
Throughput & latency
Operations
Native management
Choice of resource managers
Fault tolerance (FT)
Multi-tenancy