Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.
80-90% of data science is data cleaning and feature engineering. However, if we were to plot a count of what all the data science tools are for, we would find that most innovation happens in data infrastructure and modeling. We want to change that and make data scientists much more productive while also improving the quality of their work. In this talk I will describe the machine learning platform we wrote on top of spark to modularize these steps. This allows easy reuse of components, simplifying model building and changes. The framework simplifies the data preparation and feature building stages with reusable classes for each data source, making subsequent feature generation a matter of a few lines of code.