Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.
In this talk, we will start with the internals of how Spark streaming works and explain how user code is being translated and executed by the Spark streaming engine. Based on these internals, we will then walk over some of the best practices to do efficient state management, efficient joining of streams with historic datasets and achieving high throughput while receiving, processing and writing data. This should help you develop and tune your streaming applications properly by avoiding the common pitfalls.