Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.
Genomics and Health related data implies lots of data, usually distributed in remote data centers, with lots of contraints related to privacy and confidentiality. Scalability is required at two levels, first within a single data center, and for this, distributed computing technologies like Apache Spark, scalable machine learning libraries and distributed databases are a match. At the inter-data center level, the scheme to share data and data processing methods must be guided by interoperability standards. The Global Alliance For Genomics and Health (GA4GH) is defining such a standard. We present here an implementation of a GA4GH server, using distributed computing and databases as back-end engine, so providing a scalable reference implementation. We also show how to extend the GA4GH server, with new functionality like requesting some model estimation (Machine Learning) and predictions on these models. We then show with the Spark Notebook as interactive tool how to generae a client for the GA4GH server and how to execute methods on the server.