Streaming and Batch Data Pipelines

Building Big Data Solutions requires a significant amount of investigative work prior to implementation. This exercise in due diligence allows businesses implementing these solutions to do so effectively and efficiently -- the proper build for the application at the right cost. One consideration buyers need to make is the necessity and trade-offs between streaming and batch data pipelines. Most businesses that augment their decision making in more than one process using Big Data Solutions will have both types so it should never be considered one versus the other.

Streaming and Batch Analytics in Brief

Differentiating between streaming and batch data pipelines isn’t particularly difficult. I typically liken it to floodgates on a dam with the water being your data and the dam itself your ETL system. In “streaming” analytics the gates typically close for very short intervals if at all (hence the quotation marks on the word “streaming” -- even most “streaming” analytics platforms are using very small batches of data although there are a few pure-streaming ones as well). This means that the time span between data coming into your data warehouse, getting processed, and being digested by end-users is in near real-time.

Alternatively, batch analytics allows the water/data to build up behind the floodgates/ETL systems with intermittent opening to allow data inflow into the solution platform. Data is processed in “buckets” of water rather than a hose. This means that the outcomes of the Big Data Solution do not typically happen in real-time and experience a lag.

This is a very simplified explanation of how the two differ but it covers the main concepts. A variety of systems and tools exist to process each kind of data stream but their application can cause confusion even among Big Data Solution practitioners. Both Apache Spark and Hadoop are sometimes seen as “streaming” and “batch” processing respectively but this isn’t entirely the case as Hadoop is a full-fledged environment with MapReduce as its batch computation system and Spark (more specifically Spark Streaming) is really just a micro-batch, near real-time computation system that can actually run within a Hadoop environment. True real-time data processing is the promise of the Apache Storm computation system while both Flink and Beam (also Apache) offer additional configurations for both types of computation needs.

Considerations

While it isn’t likely one could cover every possible business application when deciding between streaming and batch computation in a single article, here are a few of the more important considerations that have far-reaching implications:

0 (2).jpeg
  • Data Resiliency - Data has a shelf life before it becomes less valuable (or even becomes “bad”). Determining this characteristic is highly specific to application, industry, and business model. For example, apps for recommending public transit routes or providing analysis of your current fitness routine need data processed in-the-moment or at least close to it. Data collected in these applications becomes old quickly which undermines the value of your offering to the end user. Clearly, these examples would be more likely to need streaming computation to be effective -- processing individual data points as they arrive.
  • Model Resiliency - Big Data Solutions are based on statistical and/or predictive models to derive value from the data they ingest. However, today’s model won’t necessarily be the best tomorrow as models need maintenance to perform acceptably over time. Determining the longevity of model performance has direct implications on the question of streaming or batch computation. Batch computation allows for larger “buckets” of data to be collected for model prototyping prior to deployment. On the other hand, streaming computation presents complications with the simultaneity of data being collected while also being a model input. Lambda Architecture has provided a means of processing in both batch and streaming (see both Apache Spark and Beam applications) but ultimately it comes down to domain knowledge (if not a “test period”) for determining how long a model will perform appropriately.
  • Decision Horizon - You’ll recognize a common factor linking these three considerations: time. This shouldn’t be surprising as the very nature of batch and streaming computation is based on time. Big Data Solutions can serve internal decision makers and external consumers. Understanding the duration of the decision making process therefore influences the streaming/batch decision. Using the previous example of a public transit app, consumers’ decisions on whether to take “Bus #5” or “Train #12” to get to their destination must happen presently (streaming). On the other hand, implementing a Big Data Solution for a market segmentation model is not held to the same quick decision time span -- models can be prototyped and tested without the need to “get out the door” immediately.

Manufacturing: Case Study

To highlight these considerations, a manufacturing client’s defect detection process was designed to measure each unit at a certain stage in assembly. These measurements were ultimately performed by engineers that manually reported incidents to management located off-site. Approval was needed for re-calibration and this caused delays which in manufacturing means dollars lost.

While something could be said of the approval process itself, there were two of the three considerations above coming into play: Data Resiliency and Decision Horizon. Because of the manual nature of reporting, by the time defects were detected data wasn’t as current and the scale of detected defect(s) could change thus requiring additional reporting to management before the initial issue had been resolved.

To address this, a combination of streaming and batch processing was implemented. An initial statistical model for defect detection was constructed from a point-in-time cross-section of measurement data. Units passing through the line were scored against the model to monitor violation of upper/lower control limits and flagged appropriately in near real-time (streaming) while simultaneously adding these measurements afterwards to a (batch) process for updating the model in future iterations.

How did this improve their operations? By blending both streaming and batch processing they were able to take advantage of data as quickly as it became available (management could monitor for defects on their own as systems were now in place connecting the process with internal data storage rather than relying on engineers) as well as adjust the batch processing window for different stages in the manufacturing line (some stages required lengthier calibration periods than others). This, in turn, optimized the Decision Horizon time frame so that downtime was minimized each time defects were measured.

The Right Tool for the Right Situation

These considerations, again, are not collectively exhaustive. In the above case study, while they were able to address and improve the defect detection capability it also added additional complexity “behind-the-scenes” that required not just monitoring for manufacturing defects, but monitoring for system faults.

Determining the right situation for streaming and batch processing is unique to each business and further, unique to the specific application. Most businesses implementing Big Data Solutions in more than one operational area will need to asses these individually and as they fit within the greater ecosystem of their Big Data Solutions.

Ultimately, businesses need to consider ROI and this comes down to the most important aspect of Big Data Solution implementation: understanding the business need.

John Sukup