Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.
Thursday, May 19 • 10:40am - 11:00am
Analyzing Massive Time Series Data with Spark

Want to build models over data every second from millions of sensors? Dig into the histories of millions of financial instruments? In this talk, we'll discuss the unique challenges in time series data, and how to work with it at scale: * What distinguishes time series data from other datasets? * What are the common operations that we wish to apply to it? * What are the different ways to lay out time series in data in memory, and what analysis tasks are each layout well suited for? * What are popular applications for time series analysis? We'll then introduce the open source Spark-TS library. Built atop Apache Spark, the library provides an intuitive Scala and Python API for munging, manipulating, and modeling time series data in a massively parallel manner.

Sandy Ryza

Senior Data Scientist, Clover Health
Data Science, Apache Spark, Time Series Data, Distributed Computation

