Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.
Processing genomics data efficiently nowadays implies being able to work at scale, to use advanced Machine Learning methods and to develop models in an interactive manner. The required convergence of technologies is a reality and is presented here. The edifice builds from ADAM, a spark library for genomics developped at the Amplab, providing the right data representation and APIs for applying distributed computing on genomics data. The development tool is the spark Notebook, giving an interactive interface to request code execution. Its integration with scalable Machine Learning libraries and ADAM allows us to work interactively on data from a single environment, at scale , with advanced modelling methods. We demonstrate some examples of genomics data processing, i.e. on 1000genomes data, going from simple data manipulation to descriptive statistics and more complex population stratification with Deep learning.