Loading…
Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monday, May 16
 

9:50am

Time-series Feature Engineering Done Right
Workday is building a data platform that allows their ML engineers to rapidly create and iterate on predictive applications. We will present how we do feature engineering and model validation over time-series data and the challenges we've have had along the way. You will learn why temporal validation is critical for certain types of problems and how to make sure you're doing it right.

Speakers
avatar for Parag Namjoshi

Parag Namjoshi

Principal Software Development Engineer, Workday
Parag works with machine learning technologies at Workday. He has extensive experience in applying ML/AI techniques to diverse domains including pay-as-you-drive insurance, vehicle health prognostics, and NLP. Parag holds PhD in Computer Science.



Monday May 16, 2016 9:50am - 10:30am
Gardner

10:40am

Deploying Image Recognition on a Budget
Recent years have seen a surge in Image Recognition Technology thanks to deep learning. This has led to more and more applications and companies using this technology to understand and organize their images better. This has also been helped by the introduction of several tools and libraries like caffe, torch, DL4J that make it easy to train deep learning systems for image recognition out of the box. Furthermore recent times have also seen introduction of public api's from companies like Microsoft, Google and other startups that make it really easy for developers to leverage the technology in their products. While this allows developers to get started in Image Recognition with minimal knowledge and infrastructure investment, it might be infeasible in cases where the content volume and recurring costs are high or the domain is too specific to the application. In this talk we would show how to deploy image recognition in production on a budget using open source tools. We would illustrate this using Trulia as an example and show how we develop end-to-end image recognition pipeline from images to predictions to applications using open source tools including caffe, celery, django, hadoop, flume, redis etc.

Speakers
avatar for Shourabh Rawat

Shourabh Rawat

Senior Data Scientist, Zillow Group



Monday May 16, 2016 10:40am - 11:00am
Ada

11:40am

Functional Programming for Machine Learning and Data Pipelines

It is easy to make a fast, incorrect algorithm. In machine learning, we are often faced with two challenges: correctly implement a complicated, math-heavy algorithm and make it run fast! In this talk, we present the marrying of functional programming and machine learning as a solution to this central problem dogging machine learning practitioners. As a demonstration of combining these ideas together, we present code that implements common learning algorithms using core principles of functional programming.


Speakers
avatar for Malcolm Greaves

Malcolm Greaves

Research Engineer, Nitro Software, Inc.
I'm a computer scientist and software engineer that adores functional programming, supports open source development, works on machine learning (ML) and natural language processing (NLP) algorithms, and never stops innovating and growing. | | My personal software development style... Read More →


Monday May 16, 2016 11:40am - 12:20pm
Gardner

2:10pm

The Lego Model for Machine Learning
80-90% of data science is data cleaning and feature engineering. However, if we were to plot a count of what all the data science tools are for, we would find that most innovation happens in data infrastructure and modeling. We want to change that and make data scientists much more productive while also improving the quality of their work. In this talk I will describe the machine learning platform we wrote on top of spark to modularize these steps. This allows easy reuse of components, simplifying model building and changes. The framework simplifies the data preparation and feature building stages with reusable classes for each data source, making subsequent feature generation a matter of a few lines of code.

Speakers
avatar for Vitaly Gordon

Vitaly Gordon

VP, Data Science and Engineering, Salesforce Einstein
VP, Data Science and Data Engineering, Salesforce Einstein



Monday May 16, 2016 2:10pm - 2:50pm
Gardner

4:00pm

Quantitative Trading with Machine Learning
Quantitative Trading is the methodical way of trading. It's a $300b industry, and Quantitative Hedge Funds are considered to be the elite of Hedge Funds. Today, with more and better data and software than ever, the application of machine learning methods on financial data becomes increasingly popular in the industry. In this talk we introduce basic concepts of quantitative trading and showcase a simple example of an application of machine learning to create a quantitative trading alogrithm for Futures.

Speakers
avatar for Martin Froehler

Martin Froehler

CEO, Quantiacs
Starting with US stocks in high school, continuing through his mathematics education in Europe, and culminating as head of a private quantitative research firm in Zurich, Martin has over 20 years experience with the markets. As a manager and teacher of new Quants, Martin realized... Read More →



Monday May 16, 2016 4:00pm - 4:40pm
Ada
 
Wednesday, May 18
 

9:50am

Synthesizing human and machine capabilities
Machine learning and artificial intelligence have made tremendous advances in the last several years. Machines are now better than humans at many tasks - facial recognition, medical diagnosis, parole decisions, driving cars - and the list is rapidly growing. As these tasks are ceded to machines, it creates new opportunities for humans to think less like machines and more like … humans. Skills like the ability to empathize, the ability to leverage ambient information, and the ability to grasp broad context are tremendously valuable and in the purview of humans. As machine algorithms become ubiquitous it is these human capabilities that become a means of differentiation. With today’s access economy it is now possible to harness the unique abilities of humans into products and services. New software systems are emerging that distribute their workloads across varied processors - be they machine or human. This synthesis enables new capabilities that go far beyond what is possible using either one resource alone.

Speakers
avatar for Eric Colson

Eric Colson

Chief Algorithms Officer, Stitch Fix
Chief Algorithms Officer @ Stitch Fix. Former Netflix Vice President of Data Science and Engineering. Talk to me about social algorithms, human computation, leading large data science teams.



Wednesday May 18, 2016 9:50am - 10:10am
Ada

9:50am

Black Magic: How to apply Machine Learning to real-world problems
Surprising progress has been made on Machine Learning algorithms and infrastructures recently. These techniques are being used to solve a lot of previously un-solvable problems. However, not all problems are directly ML-applicable. Some needs to be divided and transformed properly before the powerful ML technique can be applied to it. In this session, I’ll talk about what kinds of “black magic” we applied to the data problems we are facing daily at Mattermark, so that we can solve them in graceful and scalable way using Machine Learning.

Speakers
avatar for Evion Kim

Evion Kim

Lead Machine Learning Engineer, Mattermark
Mattermark is driven to uncover all the information about world's business information. We are using machine learning and big data techniques to collect, organize and analyze these data. Evion Kim is Lead Machine Learning Engineer at Matteramrk, where he focuses on building tools... Read More →


Wednesday May 18, 2016 9:50am - 10:30am
Markov

2:30pm

Building AI that Searches and Sells
Big data is transforming every aspect of society – what impact will it have on the world of sales?

At LeadGenius we've been working on the problem of automating sales activities using big data, machine learning and crowd computing. I'll share how we're attacking three problems in data science as they relate to building artificial intelligence for sales. First, we'll talk about the problem of predicting purchase decisions – sometimes, before they ever happen. Second, we'll talk about the problem of tracking and understanding every person and company in the United States and what they're going to need to buy next. Last, we'll talk about our latest challenge – building machine intelligence that can communicate naturally over email in a way that's indistinguishable from a salesperson.

Speakers
avatar for Anand Kulkarni

Anand Kulkarni

Chief Scientist & Co-Founder, LeadGenius
Anand is founder and Chief Scientist of LeadGenius, a Y Combinator, Sierra Ventures, and Andreessen-Horowitz-backed startup using human computation to automate sales at scale. | | Built on the MobileWorks crowd architecture, LeadGenius applies fair and ethical principles to... Read More →


Wednesday May 18, 2016 2:30pm - 2:50pm
Ada

4:00pm

Building semantic search using Deep Learning
Many search appliances exist today to make full-text search fairly simple: Elasticsearch, Solr, Algolia; the list goes on. However, all of these services implement n-gram or token level analysis of the text, and can only really do search on exact or partial matches of text in the base corpus. It is not able to match on overlapping concepts. For example, if a document is about the programming language Java, and you search for "Computer Science", unless computer science is explicitly mentioned in the document, it won't be scored highly. Recent innovations in general purpose word vectors and the ability to compose them to create general purpose document vectors provides a way to create conceptual, semantic search products. This talk will demonstrate creating a semantic search engine on a few non-trivial corpora.

Speakers
avatar for Samiur Rahman

Samiur Rahman

Head of Data Engineering, Mattermark
Head of Data Engineering, Mattermark


Wednesday May 18, 2016 4:00pm - 4:40pm
Gardner
 
Thursday, May 19
 

9:00am

Keynote: How Can We Trust Machine Learning? Exploration, Evaluation and Explanation for ML Models
Machine learning technologies are at the core of a new generation of intelligent applications that differentiate disruptive businesses from established players. Today, business tasks like product recommendation, image tagging, sentiment analysis, churn prediction, fraud detection and lead scoring can only be achieved using machine learning (ML). To build these applications at scale, companies are fast adopting tools such as Dato’s GraphLab Create and Predictive Services, enabling developers to accelerate the innovation cycle, and quickly take their ideas from inspiration to production. Industry practitioners understand that in order to secure adoption of intelligent applications, they must build trust in their models and predictions - that is, gain confidence that their models are achieving their desired outcomes and a good understanding of how predictions are made. In this talk, I'll describe both: a) Recent research done at the University of Washington to provide a formal framework that explains why a machine learning model makes a particular prediction, and how even non-experts can use these explanations to improve the performance of a model. b) New tools introduced by Dato to help industry practitioners build trust and confidence in machine learning by making it easy to evaluate, explore, and explain models and predictions. With these techniques, companies can start to have the means to gain trust and confidence in the models and predictions behind their core business applications.

Speakers
avatar for Carlos Guestrin

Carlos Guestrin

Director of Machine Learning, Apple
Director, Machine Learning, Apple and Amazon Professor of Machine Learning, University of Washignton


Thursday May 19, 2016 9:00am - 9:40am
Gardner

9:50am

Building an A.I. Cloudy Sky: What We Learned at PredictionIO
Building a successful A.I. cloud platform on top of an open-source Machine Learning project is more complicated than one would imagine. Simply offering a hosted version of the project is hardly the answer. The secret to success is to differentiate the needs between the open-source users and the potential SaaS users. Oftentimes, they are of different species. We will walk through how PredictionIO navigates the roadmap. Building a software-as-a-service business on top of an open-source project involves more than just providing a hosted version of it. It is a little-known fact that people in the existing open-source user segment are unlikely to become the potential cloud product customers. To productize an open-source project as a cloud platform, understanding who the customers are and what they are willing to pay for are the keys. Business techniques like customer discovery and lean product development can be applied to increase the chance of success. At the same time, keeping the open-source project and the cloud product under one umbrella while developing them separately for different user segments is non-trivial. We'll dive into some real scenarios, both successful and not so successful ones, of how we navigate the cloud product roadmap based on the popular open-source Machine Learning server project -- PredictionIO. Specifically, we will walk through these topics: • Evaluating Different Cloud Approaches • Understanding Existing and Potential Users • Finding the Unique Values of the Cloud • Contributing Back From the Cloud to the Open-Source

Speakers
avatar for Simon Chan

Simon Chan

CEO, PredictionIO
Simon Chan is a co-founder of PredictionIO, with years of experience in the tech industry in London, Hong Kong, Mainland China and Silicon Valley. His doctoral research work at University College London was on machine learning techniques for large-scale user preference prediction... Read More →



Thursday May 19, 2016 9:50am - 10:30am
Markov

9:50am

What Kaggle has learned from 2MM machine learning models
Speakers
avatar for Anthony  Goldbloom

Anthony Goldbloom

CEO, Kaggle
Anthony is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury.He holds a first class honours degree in economics and econometrics from the University of... Read More →


Thursday May 19, 2016 9:50am - 10:30am
Gardner

10:40am

Predicting Hacker News with Beam and TensorFlow
Google has just open sources or released two products to the public in the last year, Apache Beam and Cloud Dataflow, that promise to change how we write data pipelines.

Beam is an open source, portable job description framework incubating at the Apache Foundation. It unifies batch and stream processing in a single model available in Java, Python and Scala. It supports running on popular execution engines like Spark, Flink and Google Cloud Dataflow, giving users flexiblity in where they run and eliminating the need to re-write pipelines. One of these execution frameworks, Dataflow, is a cloud-based fully managed service that (like BigQuery) allows users to just submit code and get results. Google provides autoscaling, straggler avoidance and monitoring. 

In this talk we'll explore Beam's event time semantics like windows, sessions, and triggers. Eric will also demonstrate running a single Beam job run in both batch/stream modes and deployed on an on-prem cluster and in the cloud with no code changes.

Speakers
avatar for Eric Anderson

Eric Anderson

Product Manager, Google
Work on Google Cloud Dataflow.




Thursday May 19, 2016 10:40am - 11:00am
Ada

11:40am

Inside Pandora: Practical Application of Big Data in Music
Pandora began with The Music Genome Project, the most sophisticated taxonomy of musicological data ever collected and an extremely effective content-based approach to music recommendation. Its foundation is based on human music cognition, and how an expert describes and perceives the complex world of a music piece.

But what happens when you have a decade of additional data points, given off by more than 250 million registered users who have created 8+ billion personalized radio stations and given 60+ billion thumbs? As opposed to other traditional recommender systems, such as Netflix or Amazon, which need to recommend a single item or static set, Pandora provides an evolving set of sequential items, and needs to react in just a few milliseconds when the user is unhappy with the proposed songs. Furthermore, a variety of factors (e.g., musicological, social, geographical, or generational) play a critical role in deciding what music to play to a user, and these factors vary dramatically across each individual listener.

Furthermore, in this talk I will present a dynamic ensemble learning system that combines curational data and machine learning models to provide a truly personalized experience. This approach allows us to switch from a lean back experience (exploitation) to a more exploration mode to discover new music tailored specifically to users individual tastes. I will also discuss how Pandora, a data-driven company, makes informed decisions about the features that are added to the core product based on the results of extensive online A/B testing.

Following this session the audience will have an in-depth understanding of how Pandora uses Big Data Science to determine the perfect balance of familiarity, discovery, repetition and relevance for each individual listener, measures and evaluates user satisfaction, and how our online and offline architecture stack plays a critical role in our success.

Speakers
avatar for Oscar Celma

Oscar Celma

Director of Research, PANDORA
Dr. Òscar Celma is Director of Research at Pandora, where he leads a team of 60 (25 scientists and 35 musicologists) to provide the best personalized music discovery experience. | Òscar has published a book named "Music Recommendation and Discovery" (Springer, 2010).



Thursday May 19, 2016 11:40am - 12:20pm
Gardner

11:40am

Recommendations for Building Machine Learning Software
Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.

Speakers
avatar for Justin Basilico

Justin Basilico

Research/Engineering Manager, Netflix
Justin Basilico is a Research/Engineering Manager for Page Algorithms Engineering at Netflix. He leads an applied research team focused on developing the next generation of algorithms used to generate the Netflix homepage through machine learning, ranking, recommendation, and large-scale... Read More →



Thursday May 19, 2016 11:40am - 12:20pm
Ada

1:10pm

Fast deep recurrent net training
Deep recurrent nets are the extension of deep neural nets to process / output sequential data. They have exploded into the deep learning scene over the past few years, are no longer considered hard to train, and have enabled us today to make progress on everything from speech recognition, and language modeling to image captioning. In this talk, we will look at what recurrent nets can do for you, and go over some tips and tricks we've learnt from building Deep Speech for training seriously deep recurrent networks on your own.

Some knowledge of recurrent nets is expected, like having read http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Speakers
avatar for Sanjeev Satheesh

Sanjeev Satheesh

Deep learning researcher, Baidu USA
Sanjeev works as a Deep learning researcher at the Silicon Valley AI Lab at Baidu USA. SVAIL has been focused on the mission of using hard AI technologies to impact hundreds of millions of users.


Thursday May 19, 2016 1:10pm - 1:30pm
Gardner

2:10pm

Scalable Training of RNNs for Speech Recognition
One really good way to improve the accuracy deep learning based speech recognition system is throw a lot of diverse data at it. Which is one trick we use at Baidu's Silicon Valley AI Lab (SVAIL), but this means it can take a very long time to train the network. I will talk about some of the hardware, software and algorithmic tricks we use to enable training on tens of thousands of hours of raw speech data. These techniques are broadly applicable to a wide variety of sequence based machine learning tasks. We use synchronous SGD while scaling to multiple nodes (8 GPUs per nodes). Achieving this requires paying careful attention to the performance of the gradient synchronization between nodes - we have written our own version of MPI's all_reduce primitive. It also requires maintaining high performance on each individual node as you scale, which means paying close attention to the performance of your BLAS library. Combined with other tricks such as training in reduced precision means we can sustain 3 TFlops per node while weak scaling to 128 GPUs.

Speakers
avatar for Erich Elsen

Erich Elsen

Research Scientist, Baidu



Thursday May 19, 2016 2:10pm - 2:50pm
Gardner

3:00pm

Active Learning and Human-in-the-Loop Machine Learning
Speakers
avatar for Lukas Biewald

Lukas Biewald

Founder & Chief Data Scientist, CrowdFlower
Lukas Biewald: CEO/Co-founder of CrowdFlower. He has worked as a Senior Scientist and Manager within the Ranking and Management Team at Powerset, Inc., a natural language search technology company later acquired by Microsoft, and also led the Search Relevance Team for Yahoo! Japan... Read More →


Thursday May 19, 2016 3:00pm - 3:40pm
Gardner

3:00pm

Hidden GEMMs: How Optimized Math Libraries Work
Fast linear algebra is the foundation of all machine learning, including deep learning. Have you ever wondered how CUBLAS, MKL and similar libraries work? If so, this talk is for you!

Speakers
avatar for Marek Kolodziej

Marek Kolodziej

Principal Research Engineer, Nitro
Marek Kolodziej is a Principal Research Engineer at Nitro, Inc. He's been working on a diverse set of machine learning, distributed computing and big data problems for the past 6 years, and statistics and econometrics for the past 11. His current passion is deep learning and GPU computing... Read More →



Thursday May 19, 2016 3:00pm - 3:40pm
Ada

4:00pm

Multimodal Question Answering for Language and Vision
Deep Learning has made tremendous breakthroughs possible in visual understanding and speech recognition. Ostensibly, this is not the case in natural language processing (NLP) and higher level reasoning. However, it only appears that way because there are so many different tasks in NLP and no single one of them, by itself, captures the complexity of language. I will talk about dynamic memory networks for question answering. This model architecture and task combination can solve a wide variety of visual and NLP problems, including those that require reasoning.

Speakers
avatar for Richard Socher

Richard Socher

Founder & CEO, MetaMind
Richard Socher is the CTO and founder of MetaMind, a startup that seeks to improve artificial intelligence and make it widely accessible. He obtained his PhD from Stanford working on deep learning with Chris Manning and Andrew Ng. He is interested in developing new AI models that... Read More →


Thursday May 19, 2016 4:00pm - 4:40pm
Gardner

4:00pm

Optimizing Machine Learning Models
In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications. We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine. We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.

Speakers
avatar for Scott Clark

Scott Clark

Co-founder and CEO, SigOpt
Scott is the Co-founder and CEO of SigOpt, an optimization software as a service company helping firms tune their machine learning models and complex simulations. Scott has been applying optimal learning techniques in industry and academia for years, from bioinformatics to production... Read More →


Thursday May 19, 2016 4:00pm - 4:40pm
Markov

5:00pm

Panel: Artificial Intelligence: Who is Winning?
AlphaGo crushing humans, Tay making them blush -- finally, everybody thinks singularity is near. Is it? This panel will weight what actual technical grounds are there to claim advantage over humans or other AI in certain fields. Where does open-source data science and engineering adds the most impact, and how should modern state of the art AI be implemented in a startup that needs it to win? Coe and join the discussion!

Moderators
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist and Founder, By the Bay

Speakers
avatar for SriSatish Ambati

SriSatish Ambati

CEO and co-founder, H2O.ai
Sri is co-founder and ceo of H2O (@h2oai), the builders of H2O. H2O democratizes bigdata science and makes hadoop do math for better predictions. Before H2O, Sri spent time scaling R over bigdata with researchers at Purdue and Stanford. Prior to that Sri co-founded Platfora and... Read More →
avatar for Pete Skomoroch

Pete Skomoroch

Head of Data Products, Workday
Peter is Co-Founder and CEO of SkipFlag, which was acquired by Workday in 2018. Skipflag's technology uses your existing conversations, support tickets, and other communication to automatically build and update an enterprise knowledge base. It understands the people, topics, and facts... Read More →
avatar for Richard Socher

Richard Socher

Founder & CEO, MetaMind
Richard Socher is the CTO and founder of MetaMind, a startup that seeks to improve artificial intelligence and make it widely accessible. He obtained his PhD from Stanford working on deep learning with Chris Manning and Andrew Ng. He is interested in developing new AI models that... Read More →
avatar for Shivon Zillis

Shivon Zillis

VC, Bloomberg
Shivon is a partner and founding member of Bloomberg Beta, a $75 million venture fund backed by Bloomberg L.P. that invests in startups transforming the future of work. Bloomberg Beta has an unusual model for a corporate-backed venture fund. It invests for financial return and strives... Read More →


Thursday May 19, 2016 5:00pm - 6:00pm
Gardner
 
Friday, May 20
 

12:00pm

Build Smarter, Don't Just Build
Product Data Science at Salesforce creates prescriptive insights to drive growth for each cloud across based on data science. We educate algorithmically product strategy and design, product management, engineering quality, capacity planing and mainly, adoption across our substantial customer base. We allow for identification of drivers for behaviors such as churn, adoption, propensity to buy, readiness for multi-clouds and predictive journeys to maximize adoption and customer success. This will be a thrilling data.science.insights.action presentation

Speakers
avatar for Hernan Asorey

Hernan Asorey

VP, Product Data Science, Salesforce
Hernán is responsible for transforming product ideation, design and development, as well as growth into an evidence-driven culture where intelligence becomes an intrinsic part of the product to ship. He manages, grows, and strives to inspire a world-class team of data experts across... Read More →
avatar for Robin Glinton

Robin Glinton

Sr. Director, Data Science Applications, Salesforce.com
Robin Glinton is the Director of Data Science Applications (DSA) at Salesforce.com. Robin leads a team dedicated to understanding adoption of Salesforce product as well as sprinkling machine learning into the Salesforce CRM offerings. Robin has held a number of positions across... Read More →



Friday May 20, 2016 12:00pm - 12:20pm
Markov

1:40pm

How to Use Big Data to Inspire Consumer Confidence
Credit Karma uses the power of technology to simplify financial decision-making. Analyzing over 50 million members’ finances, Credit Karma researches and recommends credit cards, loans and insurance based on each individual member’s specific credit profile. In some cases, the platform can also show members if they’re pre-approved for a particular product, giving them the confidence to apply without the risk of hurting their credit score. Credit Karma uses machine learning to deliver personalized and tailored recommendations so consumers can save across all their financial products. In this panel, Credit Karma shares how to use data to enrich user experiences. We will also share best practices for collecting, sharing and maximizing the efficacy of processing this data.

Speakers
avatar for Daniel Doerr

Daniel Doerr

Data Science Manager, Credit Karma
Daniel leverages data at Credit Karma to provide personalized financial solutions and recommendations. He leads a team of data scientists and analysts who use the machine to make recommendations more personalized and human. When Daniel isn’t crunching data, he is using data to find... Read More →



Friday May 20, 2016 1:40pm - 2:00pm
Ada