Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.
Google has just open sources or released two products to the public in the last year, Apache Beam and Cloud Dataflow, that promise to change how we write data pipelines.
Beam is an open source, portable job description framework incubating at the Apache Foundation. It unifies batch and stream processing in a single model available in Java, Python and Scala. It supports running on popular execution engines like Spark, Flink and Google Cloud Dataflow, giving users flexiblity in where they run and eliminating the need to re-write pipelines. One of these execution frameworks, Dataflow, is a cloud-based fully managed service that (like BigQuery) allows users to just submit code and get results. Google provides autoscaling, straggler avoidance and monitoring.
In this talk we'll explore Beam's event time semantics like windows, sessions, and triggers. Eric will also demonstrate running a single Beam job run in both batch/stream modes and deployed on an on-prem cluster and in the cloud with no code changes.