Loading…
Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Life [clear filter]
Friday, May 20
 

9:00am

Why you can't afford to ignore deep learning
Few involved in life sciences have a background in the neural network fundamentals that underpin deep learning, and that can make it quite overwhelming when trying to learn about this topic for the first time. And it doesn't help that there's so much breathless hype around, which makes it hard to know what opportunities are real, and what is just marketing.

I came to deep learning applications in the life sciences from the other direction - although I've been working with neural networks for around 25 years, I only started looking at life sciences applications in the last couple of years. The data in life sciences tends to be challeging to work with - generally unstructured (genome sequences, natural language text, imaging, sound, etc) and often data sets are quite large. This kind of data turns out to be where deep learning really shines. In fact, many classic approaches to data analysis in the life sciences are either in the process of, or are about to be, totally transformed by deep learning. 

In this talk, I'll describe what deep learning can do, and give some examples of how it can be applied, with a particular focus on medical applications. I'll also provide some suggested places to learn more, so if I'm successful in my goal to convince you that you can't afford to ignore deep learning, you'll know where to look next.

Speakers
avatar for Jeremy Howard

Jeremy Howard

CEO, Enlitic
Jeremy Howard is a serial entrepreneur, business strategist, developer, and educator. He is the CEO of Enlitic, a startup he founded to use recent advances in machine learning to transform the practice of medicine, and bring modern medical diagnostics to billions of people in the... Read More →


Friday May 20, 2016 9:00am - 9:40am
Gardner

10:40am

Driver: How Data Science and Engineering Can Extend the Lives of Cancer Patients
TBA

Speakers
avatar for Petros Giannikopoulos

Petros Giannikopoulos

President & Cofounder, Driver
I am one of the co-founders of Driver, a start-up in San Francisco that is empowering cancer patients to gain access to new therapies through an innovative, patient-facing, genomics platform. Our team's mission is simple: to accelerate drug development through radical patient engagement... Read More →


Friday May 20, 2016 10:40am - 11:00am
Gardner

11:10am

A Scalable GA4GH Server Implementation
Genomics and Health related data implies lots of data, usually distributed in remote data centers, with lots of contraints related to privacy and confidentiality. Scalability is required at two levels, first within a single data center, and for this, distributed computing technologies like Apache Spark, scalable machine learning libraries and distributed databases are a match. At the inter-data center level, the scheme to share data and data processing methods must be guided by interoperability standards. The Global Alliance For Genomics and Health (GA4GH) is defining such a standard. We present here an implementation of a GA4GH server, using distributed computing and databases as back-end engine, so providing a scalable reference implementation. We also show how to extend the GA4GH server, with new functionality like requesting some model estimation (Machine Learning) and predictions on these models. We then show with the Spark Notebook as interactive tool how to generae a client for the GA4GH server and how to execute methods on the server.

Speakers
avatar for Andy Petrella

Andy Petrella

Cofounder, Data Fellas
Creator of Spark Notebook


Friday May 20, 2016 11:10am - 11:30am
Ada

11:10am

What Healthcare Can Learn from Netflix: Personalizing and Optimizing Preventive Care
There is widespread agreement that the most effective way to combat conditions like diabetes and heart disease is through a preventive approach centered on lifestyle and behavior change. But behavior change is hard, and to date, even the most effective in-person programs have included little (or no) personalization. Omada is changing that. Our team leverages open data to predict chronic health risk, and uses the power of analytics, machine learning, and experimentation to customize our interactive Prevent program for each individual. We’ve built experimentation directly into the product, using vast amounts of observational data to create a system that is continually improving to maximize participant outcomes and chronic disease risk reduction. Omada is reinventing the approach to preventive behavioral health research, its efficacy, and its application. In the process, we are unlocking a scalable, clinically-effective, and customizable approach to combating the chronic disease epidemic. Attendees will learn how analysis, machine learning, and experimentation can determine the best approach to motivate lasting behavior change in hard-to-reach populations – and how secondary data collected can be mined for unexpected value and insight. Additionally, they will learn, through examples, the mistakes and learnings of implementing an in-house clinical trial management system into a quickly growing company. Finally, attendees will learn how a customized, data-driven approach to chronic disease prevention has implications for the biggest public health challenge in American today.

Speakers
avatar for Eric Williams

Eric Williams

Director of Data Science, Omada Health
Eric Williams is Director of Data Science at Omada Health. As an undergraduate Eric studied physics at UC Berkeley, later receiving his PhD from Columbia University in Particle Physics. He was a Researcher at CERN for several years and a Postdoctoral Scholar at Memorial Sloan-Kettering... Read More →



Friday May 20, 2016 11:10am - 11:30am
Gardner

11:40am

Interactive Machine Learning on Genomics Data with the Spark Notebook
Processing genomics data efficiently nowadays implies being able to work at scale, to use advanced Machine Learning methods and to develop models in an interactive manner. The required convergence of technologies is a reality and is presented here. The edifice builds from ADAM, a spark library for genomics developped at the Amplab, providing the right data representation and APIs for applying distributed computing on genomics data. The development tool is the spark Notebook, giving an interactive interface to request code execution. Its integration with scalable Machine Learning libraries and ADAM allows us to work interactively on data from a single environment, at scale , with advanced modelling methods. We demonstrate some examples of genomics data processing, i.e. on 1000genomes data, going from simple data manipulation to descriptive statistics and more complex population stratification with Deep learning.

Speakers
avatar for Andy Petrella

Andy Petrella

Cofounder, Data Fellas
Creator of Spark Notebook


Friday May 20, 2016 11:40am - 12:00pm
Ada

11:40am

Real-time diagnostics for the masses
In Echo Labs, we have incorporated the full monitoring capabilities of a modern emergency room in a small, noninvasive, optical sensor. With this technology we captured real-time physiological data from hundreds of patients during their everyday lives. Why is this data important? How can we analyze these vast amounts of data to predict which one of the monitored patients needs medical attention? In the lecture I will cover the challenges of using techniques such as deep learning with physiological data, and on incorporating centuries of medical information and know-how in automatic diagnostic algorithms. I will show examples and results of some of our research over the last three years.

Speakers
avatar for Elad Ferber

Elad Ferber

Co-founder and CTO, Echo Labs


Friday May 20, 2016 11:40am - 12:00pm
Gardner

11:40am

Translating a Trillion Points of Data into Therapies, Diagnostics, and New Insights into Disease
There is an urgent need to take what we have learned in our new “genome era” and use it to create a new system of precision medicine, delivering the best preventative or therapeutic intervention at the right time, for the right patients. Dr. Butte's lab at the University of California, San Francisco builds and applies tools that convert trillions of points of molecular, clinical, and epidemiological data -- measured by researchers and clinicians over the past decade and now commonly termed “big data” -- into diagnostics, therapeutics, and new insights into disease. Several of these methods or findings have been spun out into new biotechnology companies. Dr. Butte, a computer scientist and pediatrician, will highlight his lab’s recent work, including the use of publicly-available molecular measurements to find new uses for drugs including new therapies for autoimmune diseases and cancer, discovering new diagnostics include blood tests for complications during pregnancy, and how the next generation of biotech companies might even start in your garage.

Speakers
avatar for Atul Butte

Atul Butte

Director, Institute for Computational Health Sciences, UCSF
Atul Butte, MD, PhD is the inaugural Director of the Institute of Computational Health Sciences (ICHS) at the University of California, San Francisco, and a Professor of Pediatrics.  Dr. Butte is also the Executive Director for Clinical Informatics across the six University of California... Read More →


Friday May 20, 2016 11:40am - 12:00pm
Markov

12:00pm

Challeneges in 3D Volumetric Data Sets
3Scan is an automated histology company. Our core technical offering is a 3d imaging microscope, which generate volumetric image data. Each microscope can create several terabyte scale datasets per day, and we run 3 in production, and will have close to 10 by the end of the year. In this talk I will address some of the challenges and issues surrounding data collection and analysis at this scale. Relevant tools include, Python, Apache Spark, EC2, and Meteor.

Speakers
avatar for Todd Huffman

Todd Huffman

CEO, 3Scan



Friday May 20, 2016 12:00pm - 12:20pm
Ada

12:00pm

Ginger.io: a Data-Driven Mental Health Care Provider
Speakers
avatar for Sai Moturu

Sai Moturu

Head of Data Science, Ginger.io
Data science, machine learning, healthcare, mental health



Friday May 20, 2016 12:00pm - 12:20pm
Gardner

1:10pm

Cancer Screening using Deep Learning
Sad but true: most of radiology is mind-numbing tedium. Radiologists review hundreds of mammograms searching for tiny lesions; they meticulously draw contours around the heart in cardiac MRIs to measure volumes; they rely on manual checklists and decision trees to characterize liver disease. These are repetitive, boring tasks. They take a huge amount of time, leading to large medical bills, and they cause radiologist fatigue, resulting in frequent errors or inconsistencies. Automated decision support, in which the all of the tedious tasks are automated by computerized algorithms, is the holy grail of radiological interpretation. Using the latest deep learning technology in an intelligent cloud platform, Arterys is bringing radiological decision support to hospitals worldwide. We describe one example of our technology, detecting lung nodules in the openly available LIDC lung cancer imaging data set, including our data processing and deep learning strategies.

Speakers
avatar for Daniel Golden

Daniel Golden

Senior Image Scientist, Arterys, Inc.
Dan is the Director of Machine Learning at Arterys, a startup focused on streamlining the practice of medical image interpretation and post-processing. After receiving a PhD in Electrical Engineering from Stanford, he stuck around for a postdoc, focusing on using machine learning... Read More →


Friday May 20, 2016 1:10pm - 1:30pm
Gardner

1:40pm

With Big Data Comes Big Responsibility
As the cost of genetic sequencing has fallen, the rate of data generation is outpacing our resources to analyze it. While a future in which we use low-level biological data about ourselves to inform our medical choices is inevitable, getting there will not happen by default. Current attitudes towards software in the life science and medical space are rooted in academia, but need to shift if we are to make true precision medicine a reality. The job will fall on us software engineers to develop high-quality open-source tools, build communities to support them, and transition organizations from siloed datacenters to cloud environments. We will compare bioinformatics to cryptography, a field that has successfully leveraged open-source technology to make the Internet a safer place. We will also explore specific examples of APIs and libraries that are beginning to enable this shift and are already providing benefits to their users.

Speakers
avatar for Nish Bhat

Nish Bhat

Founding Engineer, Color Genomics



Friday May 20, 2016 1:40pm - 2:00pm
Gardner

2:10pm

The UMLS - Authoritative biomedical concept names in context

[Revised 05/16/16]  A concept is a unit of thought. Chances are any biomedical concept that is represented in your data has been named by some authority. Your tax $ pay for these names to be collected, maintained, and represented in a homogeneous, tool-supported context called the UMLS (Unified Medical Language System).

The latter consists of three knowledge sources - the Metathesaurus, a Semantic Network, and a Lexicon and accompanying tools. The UMLS was created and is maintained by the U.S. National Library of Medicine (NLM), part of the National Institutes of Health (NIH). The 2016AA release of the Metathesaurus contains more than 3.25 million concepts and 13.00 million unique concept names from over 197 source vocabularies expressed using 25 different languages. Many of these vocabularies include translations into the world's major languages. Because it contains a mixture of public and proprietary content use of the UMLS requires a license, available free of charge from the NLM.

Tools are included to assist with browsing, downloading, subsetting, and representing the UMLS in existing databases. Additional tools support inter-source linking, and finding concepts in text. While not for the faint of heart, these resources are widely used around the world. Tutorial videos are available on the NLM UMLS web site.

Important, and widely used vocabularies in the UMLS include those naming diseases, lab tests, procedures, medications, chemicals, organisms, anatomic structures and genes, collected from both research and care.  Several of these vocabularies are part of the standards specified for use in U.S. Electronic Health Records. Internet connectivity permitting, audience members will be challenged to "stump the Metathesaurus" - that is, name an important biomedical concept that cannot be found there. This exercise will illustrate why the UMLS should not be re-invented.


Speakers
avatar for Brian Carlsen

Brian Carlsen

Sr. Informatics Consultant, West Coast Informatics
I've spent most of my career developing enterprise terminology maintenance solutions or the healtcare industry. I've worked with governments, non-profits, and private sector business to develop, maintain, publish, and implement healthcare termionlogies and information models to solve... Read More →
avatar for Mark Samuel Tuttle

Mark Samuel Tuttle

Board of Directors, Apelon
Taught computer science at UC Berkeley, and then Medical Information Science at UCSF. Co-founder of Lexical Technology, later merged with Onyx to form Apelon, from a UCSF project. Was initial external architect of National Library of Medicine Unified Medical Language System (UMLS... Read More →


Friday May 20, 2016 2:10pm - 2:50pm
Ada

3:00pm

Distributed Visualization for Genomic Analysis
Current genomics visualization tools are intended for a single node environment and lack computational resources to provide interactive speeds. Data from the 1000 Genomes Project provides 1.6 terabytes of variant data and over 14 terabytes of alignment data. However, typical genomic visualizations materialize less than 10 kbp, approximately 3.3e­7% of the genome. Mango is a visualization browser that selectively materializes and organizes genomic data to provide fast in memory queries. Mango materializes data from persistent storage as the user requests different regions of the genome. This data is efficiently partitioned and organized in memory using interval trees, which enables quick range queries over genomic data.

Speakers
avatar for Alyssa Morrow

Alyssa Morrow

Student Researcher, University of California-Berkeley
At UC Berkeley, the BDGenomics team is working to create scalable genomics preprocessing and analysis on top of Spark. I am currently working on a distributed genomic visualization tool that allows ad hoc querying on TB of genomic data.
avatar for Eric Tu

Eric Tu

Graduate Student, UC Berkeley AMPLab
I'm a graduate student at UC Berkeley in the AMPLab, working on genomic visualizations built on top of Spark.



Friday May 20, 2016 3:00pm - 3:20pm
Gardner

4:00pm

Deep Learning, Heart Rate Sensors, and Strokes
Deep learning has shown breakthrough results in computer vision, speech recognition, and natural language processing, but as yet, the applications to medicine are few. Now that we've thoroughly honed our techniques to detect cats in YouTube videos, can we apply that same technology to save lives?

We'll report on a collaboration between a team of cardiologists at UCSF and machine learning engineers at Cardiogram, using sensor data from Apple Watch and other wearables to prevent strokes. About a quarter of strokes are caused by atrial fibrillation, the most common heart arrhythmia. In atrial fibrillation, electrical conduction in the heart becomes disorganized. The upper chambers may beat 300-600 times per minute. The lower chambers may beat at a normal rate, but irregularly. AF is treatable, but asymptomatic—many people don't realize they have it—and if you can develop an algorithm to detect when a person has entered an episode of atrial fibrillation using heart rate time series, you can potentially prevent a stroke.

The talk will include a brief introduction to cardiac electrophysiology, a review of key techniques in deep learning, and then dive into how we're using convolutional autoencoders and semi-supervised sequence learning to detect anomalous patterns of heart rate variability. We'll include lots of example data to build up intuition, and code examples using TensorFlow. Depending on the length of the talk, the algorithmic techniques covered will likely include: convolutional autoencoders for dimensionality reduction, long-short term memory, and semi-supervised sequence learning. (If this is a 20 minute talk, we'll focus on just one of those techniques.)

We'll conclude with some broader thoughts on the intersection of artificial intelligence and medicine, including lessons learned while bridging the cultures of machine learning research and clinical research, as well as some thoughts on how artificial intelligence may drive the future of healthcare.

Speakers
avatar for Brandon Ballinger

Brandon Ballinger

Co-Founder, Cardiogram
Brandon currently applies machine learning to cardiology at Cardiogram. Previously, he helped fix healthcare.gov, co-founded Sift Science, and worked as an engineer at Google on Android speech recognition.
avatar for Johnson Hsieh

Johnson Hsieh

Co-Founder, Cardiogram
Johnson is applying machine learning to cardiology starting with data from the Apple Watch. Previously, Tech Lead on Google Voice Actions ("ok, Google"), interest-based user modeling, and Search.



Friday May 20, 2016 4:00pm - 4:40pm
Ada

4:00pm

Extracting Medical Attributes from Clinical Trials
Understanding the relationships between drugs and diseases, side effects, dosages is an important part of drug discovery and clinical trial design. Some of these relationships have been studied and curated in different formats such as the UMLS, bioportal, SNOWMED etc. Typically this data is not complete and distributed in various sources. I will adress different stages of the drug-disease, drug-side effects and drug-dosages relationship extraction. As a first step I will discuss medical attributes (diseases, dosages, side effects) extraction from FDA drug labels and clinical trials. As a next step I will use simple machine learning techniques to improve the precision and recall of this sample. I will also discuss bootstrapping a training sample from a smaller training set. As a next step I will use DeepDive, a dark data extraction framework to extract relationships between medical attributes and derive conclusive evidence on facts about them. The advantages of using deepdive is that it masks the complexities of the Machine Learning techniques and forces the user to think more about features in the data set. At the end of these steps we will have structured (queriable) data that answers questions such as What is the dosage of 'digoxin' for controling 'ventricular response rate' in a male adult at 'age 60' with weight '160lbs'.  




Speakers
avatar for Sanghamitra Deb

Sanghamitra Deb

Data Scientist, Accenture Technology Laboratory


Friday May 20, 2016 4:00pm - 4:40pm
Gardner

4:50pm

Protecting data scientists in healthcare with type safety
Healthcare is a veritable zoo of datatypes, implementations, and formats, providing an unrivaled challenge in data integration. Historically, quality control across disparate data has been done by data scientists, in an ad hoc manner, on their analysis platform of choice: Python or R. At Wellframe, we were looking for a concerted solution to handle both complex integrations and the more general problem of connecting data analysis and feature development. We will share our experiences where, as the volume and diversity of our data sources exploded, we realized that Python did not provide the guarantees we required. Toward this end, we have moved all data QC further upstream to our Scala-based infrastructure to let the type system help manage more of the complexity. To accelerate translating insights into features, we have utilized Spark to provide the DataFrames our data scientists know and love, while still being able to take advantage of our hardware. This has turned out to be a mixed blessing: it has increased our pace, but the loss of type safety during analysis allows bugs to be propagated through our system. We will discuss the approaches we are pursuing to improve this from both sides, by migrating from RDD's to Datasets, and moving our analysis from Python to Scala.

Speakers
GR

Gopal Ramachandran

Head of Technology, Wellframe
Gopal is the Head of Technology at Wellframe. Previously, he worked at Massachusetts General Hospital, wherein he was part of the team that worked with Apple on the development of ResearchKit. Gopal received his M.D. from Harvard Medical School, and his Ph.D. from MIT.



Friday May 20, 2016 4:50pm - 5:10pm
Markov