This blog is a summary of Spark Streaming, to introduce the basic concepts of Spark Streaming, as well as the internals of Spark Streaming, comparison to Storm, also in-production usage of Spark Streaming and important JIRAs in the history of Spark Streaming evolution.
What is Spark Streaming
Spark Streaming is a component based on Spark for doing large-scale stream processing.
It Extends Spark for doing large scale stream processing
Scales to 100s of nodes and achieves second scale latencies
Efficient and fault-tolerant stateful stream processing
Simple batch-like API for implementing complex algorithms
Discretized Stream Processing
Run a Streaming computation as a series of very small, deterministic batch jobs.
Spark Streaming chops the live stream in to batches of X seconds, Spark treats each batch of data as RDDs and process them using RDD operations.
Essentially Spark Streaming is micro-batching system, by chopping the live stream into a series of very small batches, it could achieve the similar behavior as real-time streaming framework. You could simply simulate the behavior of Spark Streaming by writing a Spark batch job to process a mount of live data, and adding this job to a crontab to get called at a period of time.
Similar to RDD on Spark, Spark Streaming provides a high-level abstraction called DStream, which represents a continuous stream of data. Internally DStream is represented as a sequence of RDDs.
Like RDD, DStream can be got from input dstream like Kafka, Flume and so on, also transformation could be applied on the existing DStream to get a new DStream.
Here is the code snippet of streaming word count:
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)) .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2) wordCounts.print()
- Get the input DStream from Kafka input source.
- Apply high-level transformations on the input DStream to get the new DStreams.
- Trigger the action
print()on the DStream to finalize this job DAG.
So basically Spark Streaming’s programming pattern is quite similar to Spark, so it is simple and low learning curve for users to get started with Spark Streaming.
What Additionally Spark Streaming Provides
So you may wonder why not directly write a Spark program and get called periodically to simulate streaming program? Yes you could, but you should also handle some problems where Spark Streaming already have done for you.
- Spark Streaming offers various of connectors like Kafka, Flume, Kinesis and so on, it is necessary for a streaming program and Spark Streaming is well supported.
- Fault tolerance. Fault tolerance is quite important for a distributed program. Spark Streaming provides several ways to guarantee this: two copies of received data, WAL mechanism, metadata checkpoint and recovery.
- Instrumentation. Spark Streaming provides several instrumentation tools for you to better know the run-time internals of Spark Streaming.
There are some materials for you to better understand the internals of Spark Streaming:
- Deep dive into Spark Streaming
- Improved Fault-tolerance and Zero Data Loss in Spark Streaming
- Improvements to Kafka integration of Spark Streaming
- Integrating Kafka and Spark Streaming: Code Examples and State of the Game
Pros and Cons of Spark Streaming
- Ease of use, low learning curve, no additional effort when you already understand Spark.
- Highly integrated in the Spark ecosystem, mutual-operability between different Spark components.
- High throughput with fast fault recovery.
- Batch like high-level abstracted API for you to focus on the processing logic.
- Micro-batch processing model makes latency relatively higher than other record-based system.
- Checkpoint mechanism is not so robust and upgradable.
What’s the difference compared to Storm
The major difference between Spark Streaming and Storm is the process engine. Spark Streaming uses Spark internally as its process engine, so this restricts Spark Streaming to be a micro-batching model. Whereas Storm is a streaming model, in which data is come and processed as water flow.
This model difference makes Spark Streaming as a micro-batching system and Storm as a event-driven system.
Another difference is fault recovery. Spark Streaming as what I mentioned before is driven by Spark internally, so the fault mitigation is relying on Spark’s mechanism like straggler, task rerun. On the other side Storm uses a upstream recovery mechanism, in which it provides a anchoring system to monitor the data loss and notify the upstream*, to do fault recovery.
Anchoring system requires more cpu and network resources to track each message, this will effect the total throughput, so generally the throughput of Storm is lower than Spark Streaming.
From user’s point, Storm provides low-level primitives, whereas Spark Streaming offers high-level abstractions. On the one side high-level API is easy to use, but low-level primitives offers strong operability.
Here are some materials to compare Spark Streaming with Storm:
- Apache Storm and Spark Streaming Compared
- Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
History and Key Improvements of Spark Streaming
Spark Streaming was first brought into Spark at version 0.7 with papers Discretized Streams: Fault-Tolerant Streaming Computation at Scale. At that time, Spark Streaming is very rudimentary with many functionalities missed, also not so stable to put into in-production use.
In the version 1.0, Spark Streaming refactor the receiver and input stream part to make it more stable and extendable for feature use and user extension.
Add the UI for Spark Streaming to better monitoring the running status of Spark Streaming.
Add the Python API support for Spark Streaming.
Spark’s implementation makes it suffer from data loss when driver and executor lost. This Patch introduces a write ahead log mechanism to make sure no data lost when driver is failed. The basic implementation is to write data into HDFS as a reliable storage when received from external source.
Spark Streaming WAL mechanism brought additional overhead for Kafka input stream, since data is persisted in Kafka, no need to put into HDFS again, this patch provides another way of fetching data from Kafka with Spark Streaming.
Provides a better UI visualization for Spark Streaming.
Add a back pressure mechanism in Spark Streaming for better flow control.
Adoption of Spark Streaming
Spark Streaming in Netflix: Spark Streaming Resiliency.
Spark Streaming in Alibaba: Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX
Compared to the early version of Spark Streaming, its stability and maturity is increased a lot. Now more and more companies use Spark Streaming in their in-production environment as a replacement of Storm in some scenarios.