Learning Apache Apex
上QQ阅读APP看书,第一时间看更新

Stream processing systems

The first open source stream processing framework in the big data ecosystem was Apache Storm. Since then, several other Apache projects for stream processing have emerged. Next-generation streaming first architectures such as Apache Apex and Apache Flink come with stronger capabilities and are more broadly applicable. They are not only able to process data with low latency, but also provide for state management (for data that an operation may require across inpidual events), strong processing guarantees (correctness), fault tolerance, scalability, and high performance.

Users can now also expect such frameworks to come with comprehensive libraries of connectors, other building blocks and APIs that make development of non-trivial streaming applications productive and allow for predictable project implementation cycles. Equally importantly, next-generation frameworks should cater to aspects such as operability, security, and the ability to run on shared infrastructure (multi-tenancy) to satisfy DevOps requirements for successful production launch and uptime.

Streaming can do it all!

Limitations of early stream processing systems lead to the so-called Lambda Architecture, essentially a parallel setup of stream and batch processing path to obtain fast but potentially unreliable results through the stream processor and, in parallel, correct but slow results through a batch processing system like Apache Hadoop MapReduce:

The fast processing path in the preceding diagram can potentially produce incorrect results, hence the need to re-compute the same results in an alternate batch processing path. Correctness issues are caused by previous technical limitations of stream processing, not by the paradigm itself. For example, if events are processed multiple times or lost, it leads to double or under counting, which would be a problem for an application that relies on accurate results, for example, in the financial sector.

This setup requires the same functionality to be implemented with two different frameworks, as well as extra infrastructure and operational skills, and therefore, results in longer time to production and higher Total Cost of Ownership (TOC). With recent advances in stream processing, Lambda Architecture is no longer necessary. Instead, a unified streaming architecture can be used for reliable processing in a much more TOC effective solution.

This approach based on a single system was outlined in 2014 as Kappa Architecture, and today there are several stream processing technology options, including Apache Apex, that support batch as a special case of streaming.

To know more about the Kappa Architecture, please refer to following link: https://www.oreilly.com/ideas/questioning-the-lambda-architecture.

These newer systems are fault-tolerant, produce correct results, can achieve low latency as well as high throughput, and provide options for enterprise-grade operability and support. Potential users are no longer confronted with the shortcomings that previously justified a parallel batch processing system. We will later see how Apache Apex ensures correct processing, including its support for exactly-once processing.