Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monday, May 16
 

8:45am

Grand Welcome & Opening Remarks
Speakers
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.



Monday May 16, 2016 8:45am - 8:55am
Gardner

8:55am

Host Sponsor Welcome: Nitro
Speakers
avatar for Roland Tritsch

Roland Tritsch

VP Engineering, Nitro Inc.
I like to make NLP-based and ML-based PDF-productivity/document processing available to everybody, anywhere, anytime and build products and solutions that make users smile and make engineers grow and get it done (fast & right).


Monday May 16, 2016 8:55am - 9:05am
Gardner

9:05am

Keynote: Building a Real-time Streaming Platform Using Kafka Streams and Kafka Connect
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of continuously changing data in real time? The answer is stream processing, and one system that has become a core hub for streaming data is Apache Kafka.
This presentation will give a brief introduction to Apache Kafka and describe it's usage as a platform for streaming data. It will explain how Kafka serves as a foundation for both streaming data pipelines and applications that consume and process real-time data streams. It will introduce some of the newer components of Kafka that help make this possible, including Kafka Connect, framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library. Finally it will describe the lessons learned by companies like LinkedIn building massive streaming data architectures.

Speakers
avatar for Jay Kreps

Jay Kreps

Co-founder and CEO, Confluent
Jay Kreps is the co-founder and CEO of Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, he was the lead architect for data infrastructure at LinkedIn. He is among the original authors of several open source projects including Project Voldemort (a key-value store), Apache Kafka (a distributed messaging system) and Apache Samza (a stream processing system).


Monday May 16, 2016 9:05am - 9:45am
Gardner

9:50am

Time-series Feature Engineering Done Right
Workday is building a data platform that allows their ML engineers to rapidly create and iterate on predictive applications. We will present how we do feature engineering and model validation over time-series data and the challenges we've have had along the way. You will learn why temporal validation is critical for certain types of problems and how to make sure you're doing it right.

Speakers
avatar for Parag Namjoshi

Parag Namjoshi

Principal Software Development Engineer, Workday
Parag works on predictive analytics at Workday leveraging machine learning and data mining technologies. He has extensive experience in applying machine learning techniques to diverse domains including pay-as-you-drive insurance, vehicle health prognostics, and NLP. Parag holds PhD in Computer Science.



Monday May 16, 2016 9:50am - 10:30am
Gardner

9:50am

Building a Realtime Receiver with Spark Streaming
This talk will demystify Spark Streaming Receivers by showing live how to enable streaming consumption from new and untapped data sources.  We'll play around with a publicly available financial data API, while dipping our toes into a twitter data stream and a queue source.  Starting with a simple single-node receiver, we will build up to distributed, reliable receivers.

Speakers
avatar for Sal Uryasev

Sal Uryasev

Data Werewolf, GoFundMe
Having recently started to build out data infrastructure at GoFundMe, Sal is a veteran of the LinkedIn and Salesforce data science teams. A Scala fanatic, he's done a lot of work on streaming processors to build realtime data recommendation engines.


Monday May 16, 2016 9:50am - 10:30am
Markov

9:50am

Elastic Data Pipelines
Consumers want applications that make driving a seamless experience. The global automotive software market is expected to reach $10.1 billion by 2020, according to a 2014 report. By that same time, there will be 200 million connected cars on the road generating valuable information that can be utilized to improve safety, enhance quality and build new revenue streams. In this talk, we will discuss how to build real-world data pipelines that can deliver on these consumers demands and help companies tap into new revenue streams. We’ll show how to build and use data pipelines in a scalable and elastic way, and demonstrate their usage for the types spatio-temporal data that vehicles generate as they travel. Attendees will come away understanding how leading edge connected-car applications are delivered to market. The presentation will include the following technology components * Esri-developed Scala/Play app for highly interactive/dynamic map visualization * Azure IoT Hub for event ingestion * Mesosphere Infinity for event processing

Speakers
avatar for Claudio Caldato

Claudio Caldato

Microsoft, Microsoft
Claudio is a Program Manager in the Azure IoT Team. His team is building an IoT solution that leverages OSS technologies such as Mesos/DC/OS, Spark, Cassandra, Kafka, Akka and many others. Before joining the Azure team he worked on Node.js, Redis and other OSS technologies along with key Microsoft products such as .NET Common Language Runtime, Office and Visual Studio. In his offline time Claudio likes to exercise his Sommelier skills by sipping... Read More →
avatar for Adam Mollenkopf

Adam Mollenkopf

Real-Time GIS Capability Lead, Esri
Adam Mollenkopf is responsible for the strategic direction Esri takes towards enabling real-time GIS capabilities in the ArcGIS platform. This includes having the ability to ingest real-time data streams from a wide variety of sources, performing continuous processing and analysis on real-time data as it is received, and disseminating analytic results to communities of interest. He leads a team of experienced individuals in the area of geospatial... Read More →
avatar for Sunil Shah

Sunil Shah

Engineering Manager, Mesosphere
Sunil Shah is a Engineering Manager at Mesosphere, working on tools and services around the Apache Mesos project to make the lives of developers easier. Before joining Mesosphere, Sunil worked at music recommendations service Last.fm and completed a Master's program at UC Berkeley in EECS, working on real-time processing of images collected from drones. When he's not flying drones around, Sunil likes to cycle, camp, hike, ski and play a large... Read More →



Monday May 16, 2016 9:50am - 10:30am
Ada

10:40am

Deploying Image Recognition on a Budget
Recent years have seen a surge in Image Recognition Technology thanks to deep learning. This has led to more and more applications and companies using this technology to understand and organize their images better. This has also been helped by the introduction of several tools and libraries like caffe, torch, DL4J that make it easy to train deep learning systems for image recognition out of the box. Furthermore recent times have also seen introduction of public api's from companies like Microsoft, Google and other startups that make it really easy for developers to leverage the technology in their products. While this allows developers to get started in Image Recognition with minimal knowledge and infrastructure investment, it might be infeasible in cases where the content volume and recurring costs are high or the domain is too specific to the application. In this talk we would show how to deploy image recognition in production on a budget using open source tools. We would illustrate this using Trulia as an example and show how we develop end-to-end image recognition pipeline from images to predictions to applications using open source tools including caffe, celery, django, hadoop, flume, redis etc.

Speakers
avatar for Shourabh Rawat

Shourabh Rawat

Senior Data Scientist, Zillow Group



Monday May 16, 2016 10:40am - 11:00am
Ada

10:40am

Google BigQuery: a fully-managed data analytics service in the cloud
Google built Dremel to make internal data analysis simple, then made it available to the world as BigQuery service: a fully-managed data analytics service in the cloud, which allows you to focus on insight, not infrastructure.

Learn what makes BigQuery unique and how you can immediately start using its "nearly magical abilities". This talk will cover the basics of BigQuery, describe use cases and applications, and provide insights on how to use it efficiently.

Speakers
avatar for Michael Entin

Michael Entin

Senior Software Engineer, Google
Senior Software Engineer at Google Dremel / BigQuery team. | | Before joining Dremel team, worked on various data processing projects at Microsoft: SQL Server Integration Services, Analysis Services, distributed platform for AdCenter Business Intelligence, etc.



Monday May 16, 2016 10:40am - 11:20am
Gardner

10:40am

Taming JSON with SQL: From Raw to Results
The flexibility and simplicity of JSON have made it one of the most common formats for data. Data engines need to be able to load, process, and query JSON and nested data types quickly and efficiently. There are multiple approaches to processing JSON data, each with trade offs. In this session we’ll discuss the reasons and ways that developers want to use flexible schema options and the challenges that creates for processing and querying that data. We’ll dive into the approaches taken by different technologies such as Hive, Drill, BigQuery, Spark, and others, and the performance and complexity trade offs of each. The attendee with leave with an understanding of how to assess which system is best for their use case.

Speakers
avatar for Greg Rahn

Greg Rahn

Director of Product Management, Snowflake Computing


Monday May 16, 2016 10:40am - 11:20am
Markov

11:10am

Investing in Data Analytics: Risks and Opportunities in the Zettabyte Era
Industry analysts estimate that the amount of data created and replicated will be 10 ZB in 2016 and grow to 40 ZB by 2020. This rapid growth in data fuels a new era of computing – the Data & Analytics era – where data combined with algorithm and cloud-based software leads to improved insights thereby creating business value. This talk will describe the drivers, trends and implications of this new era on Enterprise IT, applications and product delivery across all industrial and consumer era, illustrated with several case studies (focusing on Industrial IoT), and why this era eventuates in a fundamental change in opportunities for investments.

Speakers
avatar for Kanu Gulati

Kanu Gulati

Investor, Zetta Venture Partners
Kanu Gulati is an investor at Zetta Venture Partners, a fund that invests in intelligent enterprise software - i.e. companies building software that learns from data to analyze, predict and prescribe outcomes in enterprise applications. | | Kanu has over 10 years of operating experience as an engineer, scientist and strategist. She led due diligence and provided deal support for early stage investments at Intel Capital and Khosla Ventures... Read More →


Monday May 16, 2016 11:10am - 11:30am
Ada

11:40am

Functional Programming for Machine Learning and Data Pipelines

It is easy to make a fast, incorrect algorithm. In machine learning, we are often faced with two challenges: correctly implement a complicated, math-heavy algorithm and make it run fast! In this talk, we present the marrying of functional programming and machine learning as a solution to this central problem dogging machine learning practitioners. As a demonstration of combining these ideas together, we present code that implements common learning algorithms using core principles of functional programming.


Speakers
avatar for Malcolm Greaves

Malcolm Greaves

Research Engineer, Nitro Software, Inc.
I'm a computer scientist and software engineer that adores functional programming, supports open source development, works on machine learning (ML) and natural language processing (NLP) algorithms, and never stops innovating and growing. | | My personal software development style focuses strongly on functional programming. Because much of my ML and NLP based work deals with big data, I'm always concerned about program performance. And in... Read More →


Monday May 16, 2016 11:40am - 12:20pm
Gardner

11:40am

Apache Flink: A Very Quick Guide
Apache Flink is the exciting newcomer in the Big Data space, which integrates batch and stream processing in a novel way. Join us in exploring Apache Flink and its architecture, internals, and its data and programming models. We point out the unique features and differences in comparison to Apache Spark. You will see Flink’s Scala APIs in action as we code and run a selected set of examples that nicely illustrate its features.

Speakers
avatar for Vladimir Bacvanski

Vladimir Bacvanski

Founder, SciSpike
Dr. Vladimir Bacvanski interest is in better and more productive ways to develop Big Data applications. He is a founder of SciSpike, a company doing custom development, consulting and training and engages clients on both Big Data and Scala topics. His recent projects include Big Data and Internet of Things in healthcare, reactive Big Data and Web Scale systems and introducing Scala, Akka, and Spark in a large financial organization. Vladimir is... Read More →


Monday May 16, 2016 11:40am - 12:20pm
Ada

11:40am

Characterizing and measuring the performance of Big Data processing platforms
There are several Big Data platforms, architectures and frameworks already out there and more are coming out each day, figuratively speaking! In such an ecosystem it is difficult to truly measure or characterize the performance of a data processing infrastructure using these frameworks. We abstract out the frameworks into three categories - Batch, Query and Streaming. In this paper, we identify characteristics for each kind of framework and present the results of running heterogeneous workloads for batch frameworks such as Hadoop, stream frameworks such as Spark and query frameworks such as Impala on target cloud-based infrastructure. In our experiments, we have seen performance variations given the multi-tenant nature of the infrastructure and have accounted for these temporal conditions by running our experiments at different times.

Speakers
avatar for Manish Singh

Manish Singh

CTO, Co-founder, MityLytics
Manish is first and foremost a systems professional who loves to get his hands dirty, deploying, maintaining and tuning massively parallel and distributed systems. At MityLytics he and his partners help customers make the transition to Big Data -painlessly. Once deployed the team at MityLytics helps customers to scale and tune their deployments using MityLytics software. | Manish has over 18 years of product development and... Read More →


Monday May 16, 2016 11:40am - 12:20pm
Markov

1:10pm

Akka Streams for Large Scale Data Processing
With over 50 million members, Credit Karma is the most utilized and trusted personal finance platform in the U.S. To handle tens of millions of Americans’ credit information, we use Akka Streams for high throughput data transfer. We will discuss how we quickly built a solution using Akka Actors to help us parallelize, parse and send data to our ingestion service. We then used Akka Streams to pull data through our actors based on demand, allowing us to easily control our memory buffers and prevent Out of Memory issues. Akka Streams also allowed us to apply parsing and simple business logic. In this panel, Credit Karma shares best practices on how to implement Akka Streams at scale.

Speakers
avatar for Zack Loebel-Begelman

Zack Loebel-Begelman

Senior Software Engineer, Credit Karma
As a senior software engineer on the data and analytics pipeline, Zack’s work allows Credit Karma to provide tailored recommendations for each individual member’s specific financial situation. Zack joined Credit Karma after two years designing and launching data engines to support consumer services and hardware. He has a bachelor’s degree in computer science from the University of California, Davis.
avatar for Dustin Lyons

Dustin Lyons

Senior Software Engineer, Credit Karma
Dustin is the technical lead of Credit Karma’s data services, which allows over 60 million members access to a personalized and seamless user experience. Before joining Credit Karma, Dustin worked in product development and infrastructure operations for over nine years. He holds an MBA from the University of Louisville and a bachelor’s in computer science from the University of Kentucky.



Monday May 16, 2016 1:10pm - 1:30pm
Ada

1:10pm

Building domain specific databases with a distributed commit log [Kafka/Kubernetes]
Smyte is building a platform to analyze all of the traffic running through busy consumer websites and mobile apps. In this talk I'm going to describe our solution to one tricky problem — counting. Specifically: accurately counting ludicrous amounts of events over sliding windows, while keeping costs as low as possible. Oh, and… lets get something working in hour or two and improve it later.

Speakers
avatar for Yunjing Xu

Yunjing Xu

Infrastructure Engineer, Smyte
Yunjing is an software engineer building server and database infrastructure at Smyte. Before Smyte, Yunjing worked on the data science and infrastructure team at Square and received Ph.D. from University of Michigan for researching performance and security problems of public cloud infrastructure.


Monday May 16, 2016 1:10pm - 1:30pm
Markov

1:10pm

Netflix Keystone - Streaming Data Pipeline @Scale in the Cloud
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.

Speakers
avatar for Monal Daxini

Monal Daxini

Senior Software Engineer, Netflix, Inc.
Monal Daxini is a Senior Software Engineer at Netflix building a scalable and multi-tenant event processing pipeline, and infrastructure for Stream Processing as a Service. He has over 15 years of experience building scalable distributed systems at organizations like Netflix, NFL.com, and Cisco.



Monday May 16, 2016 1:10pm - 1:30pm
Gardner

2:10pm

The Lego Model for Machine Learning
80-90% of data science is data cleaning and feature engineering. However, if we were to plot a count of what all the data science tools are for, we would find that most innovation happens in data infrastructure and modeling. We want to change that and make data scientists much more productive while also improving the quality of their work. In this talk I will describe the machine learning platform we wrote on top of spark to modularize these steps. This allows easy reuse of components, simplifying model building and changes. The framework simplifies the data preparation and feature building stages with reusable classes for each data source, making subsequent feature generation a matter of a few lines of code.

Speakers
avatar for Vitaly Gordon

Vitaly Gordon

VP, Data Science and Engineering, Salesforce Einstein
VP, Data Science and Data Engineering, Salesforce Einstein



Monday May 16, 2016 2:10pm - 2:50pm
Gardner

2:10pm

Real-time, Streaming Advanced Analytics, Approximations, and Recommendations using Apache Spark ML/GraphX, Kafka Stanford CoreNLP, and Twitter Algebird BONUS: Netflix Recommendations: Then and Now
Agenda Intro Live, Interactive Recommendations Demo Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker Types of Similarity Euclidean vs. Non-Euclidean Similarity User-to-User Similarity Content-based, Item-to-Item Similarity (Amazon) Collaborative-based, User-to-Item Similarity (Netflix) Graph-based, Item-to-Item Similarity Pathway (Spotify) Similarity Approximations at Scale Twitter Algebird MinHash and Bucketing Locality Sensitive Hashing (LSH) BONUS: Netflix Recommendation Algorithms: From Ratings to Real-Time DVD-Ratings-based $1M Netflix Prize (2009) Streaming-based "Trending Now" (2016) Wrap Up Q & A

Speakers
avatar for Chris Fregly

Chris Fregly

Research Scientist, PipelineIO
Chris Fregly is Founder and Research Scientist at PipelineIO - a Streaming Machine Learning and Artificial Intelligence Startup in San Francisco. | | Chris is a regular speaker at many conferences and Meetups throughout the world. He’s also an Apache Spark Contributor, Netflix Open Source Committer, and Founder of the Global Advanced Spark and TensorFlow Meetup, and Author of the upcoming O'Reilly Video Series on Deploying and... Read More →


Monday May 16, 2016 2:10pm - 2:50pm
Markov

2:10pm

Twitter Heron in Practice

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. In this talk, I will describe Heron in detail and share our operating experiences and challenges of running Heron at scale.

 


Speakers
avatar for Karthik Ramasamy

Karthik Ramasamy

Engineering Manager, Twitter
Karthik is the engineering manager for Real Time Compute at Twitter and co-creator of Heron. He has two decades of experience working in parallel databases, big data infrastructure and networking. He cofounded Locomatix, a company that specializes in realtime streaming processing on Hadoop and Cassandra using SQL that was acquired by Twitter. Before Locomatix, Karthik was at Juniper Networks where he designed and delivered platforms, protocols... Read More →



Monday May 16, 2016 2:10pm - 2:50pm
Ada

3:00pm

A Real Time Analytics Framework
We'll talk through the elements of a framework which was built to easily construct multiple real time analytics applications to address the needs of multiple teams at Yahoo's Publishing Products group. In this talk you'll learn about the architecture of what makes up a real time analytics stack and its use cases. We'll cover the supporting services and libraries that had to be built to support the various use cases. We'll also cover the optimizations we made to address resource and network I/O considerations. Our focus will mainly be on the data processing not the analytics.

Speakers
avatar for Hiral Patel

Hiral Patel

Technologist, Yahoo Inc
Hiral's been working with Scala for the past 6 years and Big Data for the past 12 years. He's built data platform's, data intensive applications, and real-time analytics frameworks. Hiral is currently a Senior Principal Architect/Engineer at Yahoo Inc.



Monday May 16, 2016 3:00pm - 3:20pm
Ada

3:00pm

Concord: Simple & Flexible Stream Processing on Apache Mesos
If you’re trying to process financial market data, monitor IoT sensor metrics or run real-time fraud detection, you’ll be thinking of stream processing. Stream processing sounds wonderful in concept, but scaling and debugging stream processing frameworks on distributed systems can be a nightmare. In clustered environments, your logs are scattered across many different computers making errors and strange behaviors are hard to trace. On frameworks like Apache Storm, the many layers of abstraction make it difficult to predict performance and do capacity planning. In micro batching frameworks like Spark Streaming, stateful aggregations can be a hassle. Moreover, in most of the existing frameworks, changing a single line of code requires a full topology redeploy causing operational strain. Concord strives to solve all the challenges above. In this talk, you’ll learn how Concord differs from other stream processing frameworks and how Concord can provide flexibility, simplicity, and predictable performance with help from Apache Mesos.

Speakers
avatar for Shinji Kim

Shinji Kim

Co-founder & CEO, Concord


Monday May 16, 2016 3:00pm - 3:20pm
Markov

3:00pm

Speed up app development with prefabricated, extensible, open-source backends
Building modern apps requires a lot of boilerplate backend code -- setting up server endpoints, forwarding requests to the database, and performing authentication are examples of code developers have to write over and over again. In this talk you'll learn how to dramatically cut down development time by using prefabricated, open-source backends like loopback.io and deepstream.io, and how to extend these backends with custom code once your application outgrows the functionality available out of the box. We'll also talk about how prefabricated backends are changing application architectures, and the impact end-to-end event driven application development is making on end-user experience. We’ll talk about our journey through the process of solving these problems in RethinkDB and Horizon, and how we see the future of web development unfold. RethinkDB Horizon is an open-source developer platform for building realtime, scalable web apps. It is built on top of RethinkDB, and allows app developers to get started with building modern, engaging apps without writing any backend code.

Speakers
avatar for Slava Akhmechet

Slava Akhmechet

cofounder, RethinkDB
Slava Akhmechet is the founder of RethinkDB, a database company dedicated to helping developers build realtime web applications. Prior to RethinkDB he was a systems engineer in the financial industry, working on scaling custom database systems. Slava is a frequent speaker and a blogger. He blogs about his interests in open source, developer tools, building delightful user experiences, and distributed systems on defmacro.org. He is currently on... Read More →


Monday May 16, 2016 3:00pm - 3:20pm
Gardner

4:00pm

Quantitative Trading with Machine Learning
Quantitative Trading is the methodical way of trading. It's a $300b industry, and Quantitative Hedge Funds are considered to be the elite of Hedge Funds. Today, with more and better data and software than ever, the application of machine learning methods on financial data becomes increasingly popular in the industry. In this talk we introduce basic concepts of quantitative trading and showcase a simple example of an application of machine learning to create a quantitative trading alogrithm for Futures.

Speakers
avatar for Martin Froehler

Martin Froehler

CEO, Quantiacs
Starting with US stocks in high school, continuing through his mathematics education in Europe, and culminating as head of a private quantitative research firm in Zurich, Martin has over 20 years experience with the markets. As a manager and teacher of new Quants, Martin realized that anybody could become a great Quant and has set out to make that possible.



Monday May 16, 2016 4:00pm - 4:40pm
Ada

4:00pm

Parallel and distributed big joins in H2O
Matt has taken the radix join as implemented in R's data.table and parallelized and distributed it in H2O. He will describe how the algorithm works, provide benchmarks and highlight advantages/disadvantages. H2O is open source on GitHub and is accessible from R and Python using the h2o package on CRAN and PyPI.

Speakers
avatar for Matt Dowle

Matt Dowle

Hacker, H2O.ai
Matt is the main author of R's data.table package, the 2nd most asked about R package on Stack Overflow. He has worked for some of the world’s largest financial organizations: Lehman Brothers, Salomon Brothers, Citigroup, Concordia Advisors and Winton Capital. He is particularly pleased that data.table is also used outside Finance, for example Genomics where large and ordered datasets are also researched.


Monday May 16, 2016 4:00pm - 4:40pm
Markov

4:00pm

The Engineer's Guide to Streaming -- How you should really compare Storm, Flink, and Apex
It feels like every week there's a new open-source streaming platform out there. Yet, if you only look at the descriptions, performance metrics, or even the architecture, they all start to look exactly the same! In short, nothing really differentiates itself - whether it be Storm, Flink, Apex, GearPumk, Samza, KafkaStreams, AkkaStreams, or any of the other myriad technologies. 

So if they all look the same, how do you really pick a streaming platform to solve the problem that YOU have? This talk is about how to really compare these platforms, and it turns out that they do have their key differences, they're just not the ones you usually think about. The way that you need to compare these systems if you're building something to last, a well-engineered system, is to look at how they handle durability, availability, how easy they are to install and use, and how they deal with failures. 

This is a relatively accessible technical talk. Newcomers to the streaming realm are welcome!

Speakers
avatar for Ilya Ganelin

Ilya Ganelin

Senior Data Engineer, Capital One Data Innovation Lab
Ilya is a roboticist turned data engineer. At the University of Michigan he built self-discovering robots and then worked on embedded DSP software with cell phone radios at Boeing. Today, he drives innovation at Capital One. Ilya is a contributor to the core components of Apache Spark and a PMC of Apache Apex with the goal of learning what it takes to build a next-generation distributed computing platform. He has presented at the Spark Summit and... Read More →



Monday May 16, 2016 4:00pm - 4:40pm
Gardner

5:00pm

Panel: Future of Data Services
When data pipelines meet data science, multiple directions emerge: -- Microservices -- Event Sourcing -- The SMACK Stack What are the core principles thet stay, and what changes in data pipelines? How Open Source makes it easier to build end-to-end solutions? Come and join the discussion!

Moderators
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.

Speakers
avatar for Marius Eriksen

Marius Eriksen

Principal Engineer, Twitter Inc
Marius Eriksen is a Principal Engineer in Twitter's systems infrastructure group. He works on all aspects of distributed systems and server software, and is currently working on data management and integration systems; he also chairs Twitter’s architecture group. You can reach him at marius@twitter.com or @marius on Twitter.
avatar for Joel Horwitz

Joel Horwitz

Director of Strategy & Business Development, IBM Analytics, IBM
Joel Horwitz is the Director of Strategy & Business Development for the IBM Analytics. He graduated from the University of Washington in Seattle with a Masters in Nanotechnology with a focus in Molecular Electronics. He also hails from the University of Pittsburgh with an International MBA in Product Marketing and Financial Management. Joel designed, built, and launched new products at Intel and Datameer resulting in breakthrough innovations. He... Read More →
avatar for Haoyuan Li

Haoyuan Li

CEO, Alluxio, Inc.
Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus). He is also a computer science PhD candidate at AMPLab, UC Berkeley, where he co-created Alluxio, a memory speed virtual distributed storage system. He is a founding committer of Apache Spark. Before the AMPLab, he worked at Conviva and Google. Haoyuan has an MS from Cornell University and a BS from Peking University.
avatar for William Morgan

William Morgan

CEO, Buoyant, Inc.
Builder of dreams @ Buoyant. @wm / buoyant.io
avatar for Neha Narkhede

Neha Narkhede

Chief Technology Officer, Confluent
Neha Narkhede is co-founder and CTO at Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s streaming infrastructure built on top of Apache Kafka and Apache Samza. She is one of the initial authors of Apache Kafka and a committer and PMC member on the project.
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Pivotal Inc.
Roman Shaposhnik is a Director of Open Source @Pivotal. He is a member of Apache Software Foundation, committer on Apache Hadoop, founder of Apache Bigtop and a man behind ODPi curtain. He has been involved in Open Source for more than a decade and has hacked projects ranging from Linux kernel to the flagship multimedia library FFmpeg. He grew up in Sun micro and was part of open sourcing of Solaris and Java. At Pivotal he was leading an... Read More →


Monday May 16, 2016 5:00pm - 6:00pm
Gardner

6:00pm

Happy Hour
Join us for a community gathering after the full day of talks. Full bar, fine food, and the company of the speakers, panelists, and the best minds in data.

Monday May 16, 2016 6:00pm - 8:00pm
The Galvanize Lobby
 
Tuesday, May 17
 

8:45am

Opening Remarks
Speakers
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.



Tuesday May 17, 2016 8:45am - 8:55am
Gardner

8:55am

Host Sponsor Welcome: IBM Analytics
Speakers
avatar for Joel Horwitz

Joel Horwitz

Director of Strategy & Business Development, IBM Analytics, IBM
Joel Horwitz is the Director of Strategy & Business Development for the IBM Analytics. He graduated from the University of Washington in Seattle with a Masters in Nanotechnology with a focus in Molecular Electronics. He also hails from the University of Pittsburgh with an International MBA in Product Marketing and Financial Management. Joel designed, built, and launched new products at Intel and Datameer resulting in breakthrough innovations. He... Read More →


Tuesday May 17, 2016 8:55am - 9:05am
Gardner

9:05am

Data and Algorithmic Bias in the Web
The Web is the largest public big data repository that humankind has created. In this overwhelming data ocean, we need to be aware of the quality and, in particular, of the biases that exist in this data. In the Web, biases also come from redundancy and spam, as well as from algorithms that we design to improve the user experience. This problem is further exacerbated by biases that are added by these algorithms, specially in the context of search and recommendation systems. They include selection and presentation bias in many forms, interaction bias, etc. We give several examples and their relation to sparsity, novelty, and privacy, stressing the importance of the user context to avoid these biases.

Speakers
avatar for Ricardo Baeza-Yates

Ricardo Baeza-Yates

WWW
Ricardo Baeza-Yates areas of expertise are information retrieval, web search and data mining, as well as data science and algorithms in general. He was VP of Research at Yahoo Labs, based in Sunnyvale, California, from August 2014 to March 2016. Before he founded and lead from 2006 to 2015 the Yahoo labs in Barcelona and Santiago de Chile. Between 2008 and 2012 he also oversaw the Haifa lab, as well as started the London lab in 2012. He is... Read More →


Tuesday May 17, 2016 9:05am - 9:40am
Gardner

9:50am

Building Word2Vec Models with Text Data
It is always amazing when someone is able to take a very hard and translate it to one that has been studied for centuries. This is the case with Word2Vec, which transforms words into vectors. Text is unstructured data and has been explored mathematically far less than vectors. Newton (1642-1726) may have been the first one to study vectors, while text mining started its studies a few decades ago. Word2Vec maps text to a vector space that can be utilized in a variety of ways, such as measuring distance between words. Therefore, given a word of interest, the aforementioned vector space can be used to compute the top N closest words. In this talk, I will explain how to build Word2Vec models with Twitter data stored in Hadoop using Spark and MLlib. I will describe how to choose the most important parameters to accurately train a Word2Vec matrix. In addition, I will show examples of how these models are used in practice in data products.

Speakers
avatar for Jorge Castanon

Jorge Castanon

Lead Data Scientist, IBM
Jorge Castañón hails from Mexico City and received his Ph.D. in Computational and Applied Mathematics from Rice University. He has a genuine passion for data science and machine learning applications of any kind. Since 2007, he is been developing numerical optimization models and algorithms for regularization and inverse problems. At IBM, Jorge joined the Big Data Analytics team at Silicon Valley Laboratory where he is building the future of... Read More →


Tuesday May 17, 2016 9:50am - 10:30am
Ada

9:50am

Identifying Actionable Messages on Social Media
Text actionability detection is the problem of classifying user authored natural language text, according to whether it can be acted upon by a responding agent. In this paper, we propose a supervised learning framework for domain-aware, large-scale actionability classification of social media messages. We derive lexicons, perform an in-depth analysis for over 25 text based features, and explore strategies to handle domains that have limited training data. We apply these methods to over 46 million messages spanning 75 companies and 35 languages, from both Facebook and Twitter. The models achieve an aggregate population-weighted F measure of 0.78 and accuracy of 0.74, with values of over 0.9 in some cases.

Speakers
avatar for Nemanja Spasojevic

Nemanja Spasojevic

Director Of Data Science, Lithium Technologies | Klout
Nemanja Spasojevic is the Director of Data Science at Lithium Technologies. He graduated from Massachusetts Institute of Technology and previously worked on the Google Books project, making all of the world’s knowledge accessible online.



Tuesday May 17, 2016 9:50am - 10:30am
Markov

9:50am

The practice of acquiring good labels
Engineers and researchers use human computation as a mechanism to produce labeled data sets for product development, research and experimentation. In a data-driven world, good labels are key. To gather useful results, a successful labeling task relies on many different elements: from clear instructions and user interface design to algorithms for quality control. In this talk, I will present a perspective for collecting high quality labels with an emphasis on practical implementations and scalability. I will focus on three main topics: programming crowds, debugging tasks with low agreement, and algorithms for quality control. I plan to show many examples and code along the way.

Speakers
avatar for Omar Alonso

Omar Alonso

Principal Data Scientist, Microsoft
Omar is a Principal Data Scientist Lead at Microsoft in Silicon Valley where he works on the intersection of social media, temporal information, knowledge graphs, and human computation for the Bing search engine. He holds a PhD from the University of California at Davis. @elunca



Tuesday May 17, 2016 9:50am - 10:30am
Gardner

10:40am

Hunting Criminals with Hybrid Analytics
Fraud detection is a classic adversarial analytics challenge: As soon as an automated system successfully learns to stop one scheme, fraudsters move on to attack another way. Each scheme requires looking for different signals (i.e. features) to catch; is relatively rare (one in millions for finance or e-commerce); and may take months to investigate a single case (in healthcare or tax, for example) – making quality training data scarce. This talk will cover, via live demo and code walk-through, the key lessons we've learned while building such real-world software systems over the past few years. We'll be looking for fraud signals in public email datasets, using IPython and popular open-source libraries (scikit-learn, statsmodel, nltk, etc.) for data science and Apache Spark as the compute engine for scalable parallel processing. The model is an ensemble using a combination of natural language, graph analysis and time series analysis features, and is re-trained using an automated pipeline to learn from feedback on the fly.

Speakers
avatar for David Talby

David Talby

CTO, Atigeo
David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cyber-security. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US... Read More →


Tuesday May 17, 2016 10:40am - 11:00am
Markov

10:40am

lda2vec
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.

Speakers
avatar for Christopher Erick Moody

Christopher Erick Moody

Data Scientist, Stitch Fix
Caltech - Astrostats - PhD supercomputing. Now data labs @stitchfix coding up word2vec, Gaussian Processes, t-SNE, tensors, Factorization Machines, RNNs, & VI



Tuesday May 17, 2016 10:40am - 11:20am
Ada

10:40am

Smarter Search with Spark, Solr and Machine Learning
The modern day search engine has significantly evolved from its keyword matching days to its current form which leverages a wide variety of data inputs and user feedback loops to help users find out what’s most important in their data. At Lucidworks, we leverage Apache Spark and Solr, together with a variety of open source machine learning and NLP approaches, to build smarter, richer search and data applications. This talk will explore several motivating use cases (customer 360, knowledge management, ecommerce) for our integrations as well as technical approaches and key lessons learned in real world implementations.

Speakers
avatar for Grant Ingersoll

Grant Ingersoll

CTO, Lucidworks
I'm the CTO and co-founder of Lucidworks, a long time Lucene and Solr hacker, co-creator of Apache Mahout and lead author of Taming Text.



Tuesday May 17, 2016 10:40am - 11:20am
Gardner

11:10am

Image Retrieval using Short Texts
Image retrieval in response to keyword-based queries is a well studied problem. Web services such as Google Image Search are used daily by users all around the world. The typical use case for these services is using a short piece of text made up of a few individual tokens as the search phrase. The services, therefore, are designed to work with such queries and generally do not work well when a longer search string is used (for example a sentence). This does not align well with the recent push towards a more visual web as evidenced by the popularity of applications such as Instagram, and the rise in the popularity of microblogging services which resulted in an abundance of short text snippets that users may want to be able to retrieve accompanying images for automatically. In this paper we introduce a novel approach, called ImageSuggest, which sits between the user and the traditional image retrieval systems and allows the users to enter longer search strings. Our approach extracts and ranks search terms from the input strings and feeds the resulting keywords to the image retrieval systems. We evaluate our approach on a dataset of short texts from the anonymous social network Whisper and show that we are able to outperform standard keyword extraction and query generation techniques on image retrieval tasks.

Speakers
avatar for Maarten Bosma

Maarten Bosma

Machine Learning Engineer, Whisper



Tuesday May 17, 2016 11:10am - 11:30am
Markov

11:40am

Automatic Links for a Web of Text
We tend to think of text as a linear representation of speech. Perhaps it is, some of the time (we'll show some historical examples). Documents are typically how we encounter text, however -- with structure and relationships beyond the linear. Layout in particular defines critical meaning that would be difficult to convey by other means. Historically, advances to textual & meta-textual representation have been slow to change and be adopted. The codex, table of contents, page numbers, index, standard typefaces -- all of these are crucial inventions that make text more meaningful. These have taken centuries to develop, although we now take for them for granted. The rapid growth of web text over the past quarter century gives us a new set of text properties to add new dimensions of meaning. The *link* is a device that allows us to layer meaning and relationships in entirely new ways. Textual links provide cross-document references, and pointers to authoritative sources and indexes. Links are often layered around the central text in lists and indexes to allow navigation at various scales. We will try to put the link, along with linking services such as search engines, into perspective with traditional print-based non-linear text features, to show how the link expands and redefines how text is consumed and construed. In addition to providing a historical perspective on the impact of links on text, we will demonstrate novel varieties of *dynamic links*. In combination with an active platform for reading (e.g. the browser), dynamic link construction provides a new way to increase the reach of texts, connecting them with resources and documents that may not even exist when the text was created, to create a qualitatively new reading experience.

Speakers
avatar for Scott (TS) Waterman

Scott (TS) Waterman

connector, founder, back-quote


Tuesday May 17, 2016 11:40am - 12:20pm
Ada

11:40am

Google Translate - how machines do it
The talk will focus on how Google Translate uses Machine Learning in enormous scale to translate 100B words/day between 103 languages. Google Translate is one of largest machine learning projects in the world in terms of data set (hundreds of billions of translated phrases) and usage (more than 500M people use it every month).

Speakers
avatar for Barak Turovsky

Barak Turovsky

Head of Product, Google Translate, Google
Barak Turovsky is responsible for product management and user experience for Google Translate. Barak focuses on applying advanced machine learning techniques to deliver magical experience to break language barriers across web, mobile applications, Search, Chrome and other products. Previously, Barak spent 2 years as a product leader within Google Wallet team. | | Prior to joining Google in 2011, Barak was Director of Product in Microsoft’s... Read More →



Tuesday May 17, 2016 11:40am - 12:20pm
Gardner

11:40am

Using Spark MLlib for NLP

Apache Spark is most often used as a means of processing large amounts of data efficiently, but is also useful for the processing of individual predictions common to many NLP applications. The algorithms inside MLlib are useful in and of themselves, independent of the core Spark framework. IdiML is an open source tool that enables incredibly fast predictions on textual data by using various components within MLlib. It acts as a standalone tool for performing core machine learning functionality that can easily be integrated into production systems to provide low-latency continuous streaming predictions. This talk explores the functionality inside IdiML, how it uses MLlib, and why that makes such a big difference.


Speakers
avatar for Michelle Casbon

Michelle Casbon

Senior Data Science Engineer, Qordoba
Michelle Casbon is Director of Data Science at Qordoba. Previously, she was a Senior Data Science Engineer at Idibon, where she contributed to the goal of bringing language technologies to all the world’s languages. Michelle's development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Michelle completed a Masters at the University of... Read More →



Tuesday May 17, 2016 11:40am - 12:20pm
Markov

1:10pm

An Innovative Approach to Labeling Ground Truth in Speech
Supervised machine learning algorithms require accurate and consistent data labels. However, complicated datasets may introduce ambiguity, resulting in irregular ground truths and challenging machine learning algorithm development. Consider the following truthing tasks for natural household speech: - *Labeling what was said* -- Think about how often people mispronounce words, talk over others, or simply mumble their speech. - *Segmenting when a given utterance/thought begins and ends* -- How many complete thoughts are in a spoken segment? What happens if speech is fragmented? How close to the start-and end- point of speech can we segment without cutting out essential data? - *Labeling sounds* -- Often there are non-human sounds in the background that we may or may not recognize. Additionally, people often make non-speech sounds that need to be considered. If that wasn't hard enough, now consider audio collected from households containing babies. Babies not only introduce more chaotic speech, but they have a language all their own that requires truth labels. Although many of aforementioned categories don't have a right or wrong way of being labeled, they do have the potential to introduce inconsistencies. To decrease the number of ground truth discrepancies, we created data tagging software called VersaTag. VersaTag is a GUI-based labeling system that can be distributed to volunteers to tag large quantities of audio. We are developing this software through an iterative process, decreasing truthing inconsistencies with each new improvement. VersaTag has already dramatically reduced the irregularities in our audio labels, and through the iterative development process, we are excited to continue improving!

Speakers
avatar for Jill Desmond

Jill Desmond

Senior Data Scientist, VersaMe
Jill is the Senior Data Scientist at VersaMe. She is currently collecting data and developing algorithms to provide feedback to parents regarding the audio environment that their child is exposed to. | | Jill has a Ph.D. in Electrical Engineering from Duke University, where she researched reverberation detection and mitigation in cochlear implants.



Tuesday May 17, 2016 1:10pm - 1:30pm
Ada

1:10pm

Data & Metadata at the Internet Archive
The Internet Archive has many petabytes of archived webpages, books, videos, and images. Recently we've been making a big effort to make our data and metadata more accessible to outside users. I'll show off some of the methods to download stuff from the Archive, and then I'll show some example projects using this data.

Speakers
avatar for Greg Lindahl

Greg Lindahl

Engineer, Internet Archive
I'm currently working on adding search to the Internet Archive's "Wayback Machine" web archive, but I'm interested in all kinds of data topics.



Tuesday May 17, 2016 1:10pm - 1:50pm
Gardner

1:10pm

Scalably Internationalizing Millions of Latent Semantic Labels
We've built a classification system that can map "Software Developer", "MTS", and "Code Monkey" as well as millions of other English language entities into a common semantic space with just a few thousand labels, which we use to understand people's job titles, skills, majors, and degrees. We're now working on internationalizing this system in a scalable way. The original method was labor intensive, so we have come up with an approach that leverages our English language work to provide good quality results in other languages with a small fraction of the effort.

Speakers
avatar for Xiao Fan

Xiao Fan

Dev Manager, Workday
My team is currently working on automated internationalization for a tool that provides semantic labels for plain English job titles.



Tuesday May 17, 2016 1:10pm - 1:50pm
Markov

1:40pm

Of Rules and Probabilities: Computational Linguistics Methods for Improving Machine Learning Models
Supervised machine learning models are extremely powerful and highly useful for processing vast amounts of text. Their applications include sentiment analysis, text classification, topic mining, part of speech tagging, and named entity recognition, among many others. However, supervised models rely heavily on large amounts of annotated data and furthermore require that the annotations be consistent and accurate. In practice, obtaining high quality annotated data, especially with strong inter-annotator agreement, is not always possible for legal and privacy reasons: there are some data that organizations may not be allowed to crowd source. In this talk I propose several methods to help machine learning models get over the hurdle of insufficient labeled data by leveraging a number of computational linguistics techniques. Specifically, focusing on CRF (conditional random field) model for Named Entity Recognition, I discuss how the use of language feature engineering, artificial dataset generation, and post-processing rules can significantly improve model performance, which otherwise suffers from the bottle-neck of insufficient training data. I propose a number of scalable and practical methods that machine learning practitioners can use in situations where obtaining more training data via crowdsourcing is not a viable option.

Speakers
avatar for Vita Markman

Vita Markman

Staff Software Engineer, Computational Linguist, LinkedIn
As a Staff Software Engineer at LinkedIn, I work on various natural language processing applications such as query understanding, sentiment analysis, and member /job data standardization. Before joining LinkedIn, I was a Staff Research Engineer at Samsung Research America, where among other projects, I worked on extracting topic-indicative phrases from a stream of closed caption news data in real-time and text-mining customer support chat-logs... Read More →



Tuesday May 17, 2016 1:40pm - 2:00pm
Ada

2:10pm

Aspect Based Sentiment Analysis in 20+ Languages
We benchmark a number of statistical approaches for ABSA (via SemEval public data) with a linguistic approach. We discuss parsing, used in most benchmarked systems, and it's two main branches, probabilistic and symbolic parsing. Finally, we propose an alternative approach which combines the best of both paradigms: linguistic/symbolic processing for topic and polarity detection and Machine Learning for aspect categorization.

We present a grammar-based approach to Aspect-Based Sentiment Analysis (also known as Topic-Based) which is currently available in 20+ languages. When we say it is available we mean that these 20+ languages are in production in numerous commercial projects mainly in the area of VoC and survey coding projects. We describe the typical ingredients of a linguistic platform: 
  • on the software side: a language-independent lexical analyzer and a PDA-based non-deterministic GLR parser
  • on the data side: corpus-based lexicons (with up to 300 million entries for morphologically complex languages like Finnish); and unification grammars (with anything from 500 to 1000 rules per language); 
  • on the customization side: sentiment rules (around 1000 per language); domain-specific categorization rules; etc.

Probably the main advantage of the engine is that adding support for a new language is a matter of changing the data side (grammar and dictionaries), which can be done quickly and efficiently.

The system achieves 70% accuracy out-of-the-box on most domains and types of texts, and up to 90% accuracy when adapted to specific domains. Domain adaptation is carried out by adding a small number of domain-dependent rules; the process is incremental and predictable. The grammars can be efficiently adapted to match the peculiarities of different types of text, from social media to news. No manually-annotated corpora are needed to train the system, since it does not require any sort of training.

Speakers
avatar for Antonio Valderrabanos

Antonio Valderrabanos

CEO and Founder, Bitext
Antonio Valderrábanos, CEO & Founder at Bitext | I have a long experience on how to use Deep Linguistic Analysis to solve business problems, particularly in the area of Text Analytics. I started working for large R&D labs, at IBM and Novell. I developed the first grammar checker released with a commercial word processor (Microsoft Word). Then, I founded Bitext, a company focused on Deep Linguistic Analysis. | | Bitext is specialized... Read More →


Tuesday May 17, 2016 2:10pm - 2:50pm
Ada

2:10pm

SAMEntics : Tools for paraphrase detection and paraphrase generation
Sparse Ground Truth, mediocre quality of training data, limited representation of novel queries, heavy biases due to human intervention and large time overheads associated with manual cluster creation are inconveniences that both partners and the Watson Ecosystem technical team face on a day-to-day basis. Enriching Ground Truth, boosting the quality of training data, factoring in for novel queries and minimizing biases & time sucks due to human intervention therefore emerge as preprocessing requirements that are crucial to meeting the needs of a more seamless transition into when utilizing a cognitive service that is powered by Watson. SAMEntics(Same + Semantics) has been conceptualized to match this exact purpose and provides an efficient alternative to handling large volumes of text across domains to scale. It comprises tools for paraphrase detection and paraphrase generation and is directed at 1. discovering rewording in sentences across domains 2. bucketing hierarchical categories within domains by capturing intent 3. expediting question(s)-answer(s) mapping 4. rendering syntactically correct phrasal variations of sentences while retaining semantic meaning to enrich partner ground truth, boost training data quality and minimize biases and time sucks due to human intervention. SAMEntics thus provides an intelligent alternative to handling large volumes of text efficiently by not only automatically rendering clusters based off user intent in a hierarchical manner but also by generating rewordings of user queries in the case of sparse and(or) poor quality training data. Join us as we go over the current and emerging state-of-the-art in this space. Reflect on what is changing the world in this era of cognition. Dive deep into the pipeline and the core algorithmic paradigms that power a paraphrase detection and paraphrase generation engine. And leave with an understanding of what it takes to build a product that provides data science-as-a-service.

Speakers
avatar for Niyati Parameswaran

Niyati Parameswaran

Data Scientist, IBM Watson
Niyati works as a data scientist for the Watson Ecosystem team. A dream of being able to provide a machine with intelligence that is unique, that can augment our own distinctive intelligence and that ultimately hopes to answer Alan Turing's question of 'Can machines think?' motivates her research. She holds a Bachelors in Computer Science from The Birla Institute of Science & Technology in India, and a Masters in Computer Science with a... Read More →



Tuesday May 17, 2016 2:10pm - 2:50pm
Markov

2:10pm

Sparse data alternatives with neural network embeddings
The advent of continuous word representation technologies such as Word2Vec and GLOVE has transformed how Data Scientists and Machine Learning experts work with natural language data. One reason these algorithms are so successful is that they offer an efficient information preserving methodology to highly compress native features (word frequencies) to the dimensions of the embedded vector space. This is particularly effective in the sparse data context of word count frequencies. Recently word embedding algorithms have been generalized to generic graph networks contexts. In this talk we review results of applying this generalization to alternative sparse data contexts such as User-based as well as Item-based recommender algorithms.

Speakers
avatar for Marvin Bertin

Marvin Bertin

Machine Learning Scientist, Skymind
MACHINE LEARNING SCIENTIST. I build intelligent applications with Machine Learning and Deep Learning for large-scale applications. | Developed like2vec = product co-purchase graph + DeepWalk + Recommender System.
avatar for David Ott

David Ott

Student, Galvanize
avatar for Mike Tamir

Mike Tamir

Chief Data Scientist, InterTrust


Tuesday May 17, 2016 2:10pm - 2:50pm
Gardner

3:00pm

3:00pm

News Analytics in Finance
In this talk we will discuss the evolution of the news analytics landscape from the perspective of the participants in the global financial industry. We will discuss the development route of several current key ML/NLP projects at Bloomberg, such as sentiment analysis of financial news, prediction of market impact, novelty detection, social media monitoring, question answering and topic clustering. These interdisciplinary problems lie at the intersection of linguistics, finance, computer science and mathematics, requiring methods from signal processing, machine vision and other fields. We will talk about the methods, problem formulation, and throughout, talk about practicalities of delivering machine learning solutions to problems of finance, highlighting issues such as importance of appropriate problem decomposition, validation and interpretability. We will also summarize the current state of the art and discuss possible future directions for the applications of natural language processing methods in finance. The talk will end with a Q&A session.

Speakers
avatar for Gary Kazantsev

Gary Kazantsev

Head of Machine Learning R&D, Bloomberg



Tuesday May 17, 2016 3:00pm - 3:40pm
Gardner

3:00pm

PhrazIt : Tool for automatic text summarization
Cognition is in virtually everything that humans do, such as language understanding, perception, judgment, learning, spatial processing and social behavior ; and given that IBM Watson represents the first step at envisioning truly cognitive systems - it becomes crucial to constantly harness its abilities with processing natural language, evaluating hypotheses and learning dynamically across domains. The project we will go over in this talk is aimed at augmenting these very behaviors of IBM Watson as PhrazIt is focused at enriching raw data and essentially transforming information into insights. PhrazIt’s technologically differentiated core that is powered by an augmented extraction-based text summarization algorithm utilizes a novel contextualized indexing framework thus making it a tremendous value-add when deploying cognitive services powered by Watson. Join us as we go over the current and emerging state of the art in the space of text summarization. Reflect on what is changing the world in this era of cognition. Dive deep into the pipeline and the core algorithmic paradigms that power a content extraction engine. And leave with an understanding of what it takes to build a product that provides data science-as-a-service.

Speakers
avatar for Niyati Parameswaran

Niyati Parameswaran

Data Scientist, IBM Watson
Niyati works as a data scientist for the Watson Ecosystem team. A dream of being able to provide a machine with intelligence that is unique, that can augment our own distinctive intelligence and that ultimately hopes to answer Alan Turing's question of 'Can machines think?' motivates her research. She holds a Bachelors in Computer Science from The Birla Institute of Science & Technology in India, and a Masters in Computer Science with a... Read More →



Tuesday May 17, 2016 3:00pm - 3:40pm
Markov

4:00pm

Time series analytics for Big Data and IoT with Kx
Trying to solve the data riddle purely through the lens of architecture is missing a vital point: The unifying factor across all data is a dependency on time. The ability to capture and factor in time is the key to unlocking real cost efficiencies. Whether it’s streaming sensor data, financial market data, chat logs, emails, SMS or the P&L, each piece of data exists and changes in real time, earlier today or further in the past. Unless they are linked together in a way that firms can analyze, there is no way of providing a meaningful overview of the business at any point in time. This talk will demonstrate using live coding how kdb+, a columnar relational time-series database, with a tightly integrated query language called q, can do aggregations and consolidations on billions of streaming, real-time and historical records for complex analytics.

Speakers
avatar for Fintan Quill

Fintan Quill

Global Head of Sales Engineering, Kx Systems
Fintan Quill is the global head of sales engineering for Kx Systems. An expert in developing database analytic systems, Fintan joined Kx in 2012 after having worked extensively with quantitative teams at a variety of Wall Street investment banks, hedge funds, and trading shops building high-performance Big Data applications. After beginning his career with First Derivatives, a global financial technology consultancy based in Northern Ireland... Read More →


Tuesday May 17, 2016 4:00pm - 4:20pm
Markov

4:00pm

Airbnb and Marketplace Matching
Marketplace matching is about matching both sides of the ecosystem, in Airbnb's case we're matching guests with hosts and understanding their unique preferences. Airbnb's inventory is extremely diverse, and different trip planners come to the platform with different goals in mind. In 2015, the Search team launched several experiments to help trip planners understand our inventory, narrow in on a subset of results relevant to them and build confidence in their booking decision. In this talk, we'll discuss these experiments and what we've learned from them, successes we've had with Machine Learning and some directions for future work.

Speakers
avatar for Surabhi Gupta

Surabhi Gupta

Engineering Manager, Airbnb
Surabhi is an engineering manager leading the Search and Application Infrastructure teams at Airbnb. Prior to Airbnb, she was a software engineer at Google where she worked on web search ranking and the Google Now team on predictive search. She holds a Masters degree in Computer Science from Stanford University. She spends her free time planning her travels from all the inspiration she gets at work.


Tuesday May 17, 2016 4:00pm - 4:40pm
Gardner

4:00pm

Mining Noisy Transaction Data with Neural Nets
Extracting relevant information from unstructured transaction data presents a challenge for those who may want to use such data for making business decisions such as underwriting loans or for monitoring credit worthiness. Most of our transaction data is in the form of transaction text describing the transaction often using abbreviations or unknown proper nouns. A common approach for text documents is to encode the words or documents into vectors using a neural net layer or multiple layers. These features may then be used in a classification algorithm or other models for predicting an outcome. To this end, we encoded transaction data of small 'sentences', often of only a few words, using skip-gram word2vec models along with RBM and Deep Belief Nets utilizing other features such as credit or debit value of transaction and institution information. The goal of this discussion is to describe the performance of the model and also considerations for training a nn in a large-data distributed framework like Spark. Tools used are Deeplearning4j, Spark, Scala.

Speakers
avatar for Frank Taylor

Frank Taylor

Data Scientist, Earnest, Inc.
I have a background in Physics specializing in statistical modeling of particle decays and later in optical signal processing. I am passionate about Big Data and its potential to gather insight into so many facets of humanity. As our tools get better and more scalable, we have the ability to answer greater questions and build more meaningful products that enrich our lives. Recently I have focused on deep learning and neural nets for the purpose... Read More →


Tuesday May 17, 2016 4:00pm - 4:40pm
Ada

4:20pm

byte2vec: a flexible embedding model constructed from bytes
In today's fragmented, globalized world, supporting multiple languages in NLU and NLP applications is more important than ever. The inherent language dependence in classical Machine Learning and rule-based NLP systems has traditionally been a barrier to scaling said systems to new languages. This dependence typically manifests itself in feature extraction, as well as in pre-processing steps. In this talk, we present byte2vec as an extension to the well-known word2vec embedding model to facilitate dealing with multiple languages and unknown words. We explore its efficacy in a multilingual setting for tasks such as Twitter Sentiment Analysis and ABSA. Byte2vec is an embedding model that is constructed directly from the rawest forms of input: bytes, and is: i. truly language-independent; ii. particularly apt for synthetic languages through the use of morphological information; iii. intrinsically able to deal with unknown words; and iv. directly pluggable into state-of-the-art NN architectures. Pre-trained embeddings generated with byte2vec can be fed into state-of-the-art models; byte2vec can also be directly integrated and fine-tuned as a general-purpose feature extractor, similar to VGGNet's current role for computer vision.

Speakers
avatar for Parsa Ghaffari

Parsa Ghaffari

CEO & Founder, AYLIEN
Parsa Ghaffari is an engineer and entrepreneur working in the field of Artificial Intelligence and Machine Learning. He currently runs AYLIEN, a leading NLP API provider focused on building and offering easy to use technologies for analyzing and understanding textual content at scale.



Tuesday May 17, 2016 4:20pm - 4:40pm
Markov

5:00pm

Panel: Human Intent and Text Understanding
Since Text By the Bay 2015, Machine Learning and NLP are growing even faster. What are the main trends in understanding human intent via textual interactions? Come and join the discussion!

Moderators
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.

Speakers
avatar for Tihomir Bajić

Tihomir Bajić

VP Engineering
VP Engineering, LTSE.com
avatar for Jenny Finkel

Jenny Finkel

Machine Learning Engineer / Manager, Mixpanel
avatar for Surabhi Gupta

Surabhi Gupta

Engineering Manager, Airbnb
Surabhi is an engineering manager leading the Search and Application Infrastructure teams at Airbnb. Prior to Airbnb, she was a software engineer at Google where she worked on web search ranking and the Google Now team on predictive search. She holds a Masters degree in Computer Science from Stanford University. She spends her free time planning her travels from all the inspiration she gets at work.
avatar for Robert Munro

Robert Munro

CEO, Idibon
Robert is the CEO of Idibon, a company formed with the goal of bringing language technology to all the world's languages. He has worked in diverse environments, from Sierra Leone, Haiti and the Amazon to London, Sydney and San Francisco. He completed a PhD in Computational Linguistics as a Graduate Fellow at Stanford University. Outside of work, he learned about the world’s diversity by cycling more than 20,000 kilometers across 20 countries.



Tuesday May 17, 2016 5:00pm - 6:00pm
Gardner

6:00pm

Happy Hour
Join us for a community gathering after the full day of talks. Full bar, fine food, and the company of the speakers, panelists, and the best minds in data.

Tuesday May 17, 2016 6:00pm - 8:00pm
The Galvanize Lobby
 
Wednesday, May 18
 

8:45am

Opening Remarks
Speakers
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.


Wednesday May 18, 2016 8:45am - 9:00am
Gardner

9:00am

Let's Build Civic Tech to Scale Democracy for a Better Future
I'm planning to outline some of the major challenges facing our political system (especially as they pertain to citizen engagement, empowerment, and faith in the process), suggest some ways that civic applications (largely made possible by open data) are beginning to address these challenges (somewhat of a quick survey of the field), and then close with a call to action.

Speakers
avatar for Matt Mahan

Matt Mahan

CEO, Brigade
Matt Mahan is CEO and cofounder of Brigade. He previously served as | CEO of Causes, the world’s largest online campaigning platform. Matt | grew up in Watsonville, CA, is a Teach for America alum and former | Harvard student body president.


Wednesday May 18, 2016 9:00am - 9:40am
Gardner

9:50am

Synthesizing human and machine capabilities
Machine learning and artificial intelligence have made tremendous advances in the last several years. Machines are now better than humans at many tasks - facial recognition, medical diagnosis, parole decisions, driving cars - and the list is rapidly growing. As these tasks are ceded to machines, it creates new opportunities for humans to think less like machines and more like … humans. Skills like the ability to empathize, the ability to leverage ambient information, and the ability to grasp broad context are tremendously valuable and in the purview of humans. As machine algorithms become ubiquitous it is these human capabilities that become a means of differentiation. With today’s access economy it is now possible to harness the unique abilities of humans into products and services. New software systems are emerging that distribute their workloads across varied processors - be they machine or human. This synthesis enables new capabilities that go far beyond what is possible using either one resource alone.

Speakers
avatar for Eric Colson

Eric Colson

Chief Algorithms Officer, Stitch Fix
Chief Algorithms Officer @ Stitch Fix. Former Netflix Vice President of Data Science and Engineering. Talk to me about social algorithms, human computation, leading large data science teams.



Wednesday May 18, 2016 9:50am - 10:10am
Ada

9:50am

Black Magic: How to apply Machine Learning to real-world problems
Surprising progress has been made on Machine Learning algorithms and infrastructures recently. These techniques are being used to solve a lot of previously un-solvable problems. However, not all problems are directly ML-applicable. Some needs to be divided and transformed properly before the powerful ML technique can be applied to it. In this session, I’ll talk about what kinds of “black magic” we applied to the data problems we are facing daily at Mattermark, so that we can solve them in graceful and scalable way using Machine Learning.

Speakers
avatar for Evion Kim

Evion Kim

Lead Machine Learning Engineer, Mattermark
Mattermark is driven to uncover all the information about world's business information. We are using machine learning and big data techniques to collect, organize and analyze these data. Evion Kim is Lead Machine Learning Engineer at Matteramrk, where he focuses on building tools to collect new data signal and find insightful derived data using machine learning and statistical techniques. Also He is building and maintaining the scalable... Read More →


Wednesday May 18, 2016 9:50am - 10:30am
Markov

9:50am

Hot Legal Issues
Overview of recent legal developments affecting Big Data, including the recent Federal Trade Commission Report on Big Data

Speakers
avatar for Francoise Gilbert

Francoise Gilbert

Partner, Greenberg Traurig Attorneys at Law
Françoise Gilbert is a partner at Greenberg Traurig, and practices in the firm’s Silicon Valley office, located in East Palo Alto, California, where she advises public companies, emerging technology businesses and non-profit organizations, on the entire spectrum of domestic and international privacy and cyber security issues legal issues. Francoise has focused on information privacy and security for more than 25 years; she regularly deals with... Read More →



Wednesday May 18, 2016 9:50am - 10:30am
Gardner

10:10am

Data From Click To Settlement: A Full-Funnel UX Approach
For young law firms in particular, the challenge of finding and retaining new clients is won by responding to inbound calls and intakes more quickly than the competition. However, an effective lawyer cannot spend his/her entire day on mobile messaging.  Furthermore, the high cost and diversity of channels available for reaching the online audience continues to add complexity.

We present our system for systematically acquiring new clients, which focused on integrating data across the "full funnel" User Experience. Using precision-based advertising, and 24/7 mobile messaging APIs, we show how we bridge together the online and offline interactions, to create a near-frictionless and ethically sound connection between law firms and potential new clients.  We show how these techniques may be generalized to small and mid-sized businesses looking take advantage of data to fuel their growth.

Speakers
avatar for Michael Terry

Michael Terry

CTO, Lawfty
I've spent the last 8 years as a data engineer and CTO, bringing data pipelines to numerous verticals. At Raytheon Space & Airborne Systems, I developed a radar data and signal processing pipeline for the military. At my first startup SeeTheScene.TV, we built a streaming video system for bars. Now, as technical Co-Founder at Lawfty, my team and I use precision advertising to connect people with lawyers in their local region. | | I'm... Read More →



Wednesday May 18, 2016 10:10am - 10:30am
Ada

10:40am

End-2-End Monitoring and Troubleshooting a Real-Time Data Pipeline
Real-time streaming pipelines are comprised of a combination of application, data frameworks and the underlying infrastructure, which has increasingly become containerized. The application code and the underlying data frameworks are closely intertwined with each other resulting in a blurred line between the application and data processing tier. The highly complex, distributed and interconnected nature of these services make monitoring and troubleshooting these pipelines very challenging. In this talk, we will: • Examine the the different components used to build a typical real-time streaming pipeline • Evaluate the importance of modeling the “pipeline” as a first-class object that should be monitored • Discuss the challenges of monitoring and troubleshooting a real-time streaming pipeline • Review capturing the overall metrics for the pipeline that map to specific metrics from each component like throughput, latency, backpressure and error rate. • Provide a set of best practices for organizing information to begin troubleshooting your data processing frameworks when things go wrong • Present a simple way to build a "Pipeline View" that captures the health of each component in the pipeline, as well as the dependencies between the components and gives an indication of any issues in the pipeline at a quick glance • Demonstrate how to visually correlate pipeline metrics and pipeline health to underlying infrastructure issues, so that problems can be quickly analyzed and resolved

Speakers
avatar for Alan Ngai

Alan Ngai

VP of Engineering, OpsClarity
As a co-founder and the VP of Engineering at OpsClarity, Alan brings over 15 years of experience building systems and engineering teams from the ground up. Prior to OpsClarity, he lead teams in solving large scale, complex problems at companies such as eBay, Yahoo and Telenav. Over the years, Alan has worked on building cloud platforms, search, GIS services and navigation, SE automation platforms, and more. He has a Bachelor of Science degree... Read More →


Wednesday May 18, 2016 10:40am - 11:00am
Ada

10:40am

How Data Science is Evolving Technology Education
Galvanize is at the forefront of educating the best technology trained workforce in web development, data science, and data engineering while also helping to create and launch the companies of tomorrow. At every one of our campuses, we offer curriculum designed and taught by full-time faculty of experts, who are active in industry, and know what startups and companies need. Along with our curriculum, each campus is a base for startups, innovation labs, and established companies, where we provide the access access to training, talent, and community. In this session, learn about how Galvanize is using data science to revolutionize technology education through adaptive learning, automated assessments, and predictive student interventions.

Speakers
avatar for Ryan Orban

Ryan Orban

VP of Business Operations & Expansion, Galvanize/ZipFian
Ryan was the co-founder & CEO of Zipfian Academy, the leading provider of immersive data science education focused on solving practical, real-world problems. Graduates of the 12-week program work at companies such as Facebook, Twitter, Airbnb, Uber, and Square. After joining forces with Galvanize, Ryan focuses on scaling and expanding immersive education worldwide.


Wednesday May 18, 2016 10:40am - 11:00am
Gardner

10:40am

An approach to internal search query analysis
Course Hero is an education technology company that provides subscription services for accessing crowd-sourced study materials. Our business is driven by SEO and having the most selective student generated materials related to a course. We show a small preview of the content to the search engines, which gets indexed. When users search any long-tailed education related queries for which our content is relevant, our content link would show up among the top results. This is how our product gets visibility. We have 3 main types of content on our website: student generated study documents, Q&A, and flashcard sets. Our internal search functionality is the method by which our customers discover content on our website. Content consumption and engagement metrics provide insightful information about the relevancy of our internal search algorithm and the quality of our content repository. Data mining these metrics helps us understand what our customers' demands are and how well our product is catering to them. Using unstructured search query data, as well as structured consumption and engagement metrics, we mined a meaningful list of high value content categories that yielded a sizeable traffic increase. As a part of the talk, we will be going over the analytical methodology for mining and identifying these high value categories.

Speakers
avatar for Max Ho

Max Ho

Business Analyst, Course Hero
Max Ho is the first Business Analyst at Course Hero, an online learning platform for students to access study resources like course materials, flashcards, and tutors. He is passionate about applying data science to practical business applications. At Course Hero, Max is working very closely with product, marketing, and community strategy teams to help build successful products and grow Course Hero's business.
avatar for Dhruv Sampat

Dhruv Sampat

VP of Analytics and Business Analyst, Course Hero
Dhruv Sampat is the VP of Analytics at Course Hero, an online learning platform for students to access study resources like course materials, flashcards, and tutors. He joined the company in 2011 and has helped it grow into a leading platform for online study help. Dhruv has been in the analytics domain for 7 years. Prior to joining Course Hero he was the first business analyst at Chegg where he worked very closely with the Product and... Read More →



Wednesday May 18, 2016 10:40am - 11:00am
Markov

11:10am

Design for Local Government: Data, Services, and Transparency
This talk by Steve Pepple & Morgan Keys will cover how the design team at OpenGov helps local governments better use their data, collaborate on public services, and communicate with citizens.

We'll discuss how we work to understand our users and design products for a group of people who have been under served for too long. Through user research we’ve found smart and dedicated public servants doing astonishing analysis without the latest technology, namely spreadsheets.

We'll share how we explore and solve the same problems with data science, data visualization, and interaction design.

Designing more efficient and intuitive processes for local governments, reduces data silos and error-prone administrative work. It gives local leaders time to be strategic and better communicate their decisions to the public. 

Speakers
avatar for Steve Pepple

Steve Pepple

Product Designer and Developer, OpenGov
Steve Pepple is a Bay Area designer and software developer who works to improve city streets and civic information systems. He a product designer at OpenGov, where he designs software that improves how governments spend money, make decisions, and communicate with citizens. | | His recent art visualizes urban activity and environment in cyberspace as a reflection of people’s activity in physical places and is currently on display at the... Read More →




Wednesday May 18, 2016 11:10am - 11:30am
Gardner

11:10am

Creating Value by Turning Law Into Data
The law has traditionally considered itself as “art not science”, but what value can be created by rethinking that art/science divide, and what would law as part science look like? Industries like sports, politics, and journalism provide recent and powerful comparisons, “moneyball” and Obama’s data-driven 2012 election campaign being the most famous examples of transformation. In the law, questions that can be answered with better data abound: how has a judge ruled on a defined type of case before? What is the likely outcome of filing a specific motion? How successful has a lawyer or firm been in representing certain kinds of cases, or before certain judges? This presentation will focus on the technology that Ravel Law uses to extract and classify such information, and on how it has built a business that harnesses the results.

Speakers
avatar for Daniel Lewis

Daniel Lewis

CEO, Ravel Law


Wednesday May 18, 2016 11:10am - 11:30am
Ada

11:10am

Text Analytics Simplified
In this talk, we will give an introduction to the Data Ninja services that greatly simplify your text analytics needs. We will briefly demonstrate the core functionalities of the services and showcase how an end-to-end application can be built using these services. Our goals are to enable app developers to build content-intelligent applications with unstructured data and to enable data scientists to explore the rich semantics from big data. We will provide a live demo on how to build a text analytics pipeline from scratch using the Data Ninja services. We will show you the steps starting from signing up to the services to producing actionable insights using machine-learning techniques on top of the semantic contents obtained from the Data Ninja services.

Speakers
avatar for Trung Diep

Trung Diep

Architect, Docomo Innovations
Trung heads the engineering team responsible for delivering the quality and performance of the Data Ninja services. Prior to joining Docomo Innovations, Trung has previously worked at Intel, Mercury Interactive, Rambus, and Broadcom. He received his B.S. and B.A. degrees in Electrical Engineering and Computer Science, respectively, from Rice University and M.S. and Ph.D. degrees in Computer Engineering from Carnegie Mellon University. His... Read More →
avatar for Ronald Sujithan

Ronald Sujithan

Founder and Principal, Sujithan Consulting, LLC


Wednesday May 18, 2016 11:10am - 11:30am
Markov

11:40am

Unsupervised NLP Classification with Clustering
In the world of local government finances training data is sparse. Language based training data is almost non-existent. Furthermore, fiscal language in governments has a high domain knowledge requirement to build training data and garner strong intuitions. This makes traditional supervised methods difficult to use successfully, as the training data you generate is always lagging raw data growth. To help tackle these challenges in performing NLP analysis we'll be showing techniques around relationship extraction and clustering to perform data understanding on domain heavy topics. We'll be exploring these techniques on published local government budget pdfs to extract topics and gain insights into the purpose of domain specific text. The format of the talk will follow each key point with code examples. First we’ll talk about data challenges in local government, and the lack of established knowledge bases around that data. Specifically we’ll explore the unknown number of classes problem and how unsupervised algorithms can garner insights. Then we’ll focus on the families of clustering algorithms available and how they allow you to focus on edge associations rather than holistic state spaces. Following that we’ll explore some useful techniques for optimizing computation and how missing or skipped data points can be linked by association. Finally we’ll combine the pieces we’ve shown to perform topic extraction and understanding from public financial budgets.

Speakers
avatar for Matthew Seal

Matthew Seal

Data Scientist, OpenGov
I'm an early employee of OpenGov who has a passion for data models, and data understanding. I've had a broad exposure to software development of various types from front-end code, to db architecture, to machine learning. I graduated from Stanford University with a BS in Electrical Engineering and a Masters in Computer Science, focused in robotics and AI. | | Feel free to ask me about anything software related. I've recently been diving deep... Read More →




Wednesday May 18, 2016 11:40am - 12:20pm
Gardner

11:40am

Hidden in plain sight: Using law to summarize the law
Summarization remains one of the most conspicuously unsolved problems in NLP today. There is plenty of active research, and the best available is reasonably capable of capturing the information contained in a body of text, but the output is often clumsy compared to what might be written by a human. More importantly for those of us working in legal informatics, the law is a notoriously conservative profession. Lawyers will justifiably be more comfortable relying on summaries written by fellow members of the bar. Fortunately, judges provide high quality, detailed summarization of the cases they cite to in their opinions, and a rich body of this data exists throughout the body of historical case law. We present a technique for enriching the display of judicial opinions with high quality summary data extracted from subsequent opinions, using a variety of state of the art open-source software tools. FOSS tools used include Antlr, for recognition of deterministic sequences using formal grammars, and Apache UIMA for construction of multi-layered indexes of recognized entities, such that their various juxtapositions can be used for further inference.

Speakers
avatar for Richard Downe

Richard Downe

VP of Data Science, Casetext
I enjoy working on interesting problems, and have tried to work on a wide variety thereof. These have included FPGA design at IBM's Watson labs in Yorktown Heights, NY, research into the progression of coronary artery disease (focusing on image segmentation and ML prediction of disease changes) in grad school at the University of Iowa, and NLP analysis, first of medical literature at IBM's Almaden lab in California, and subsequently on the law... Read More →




Wednesday May 18, 2016 11:40am - 12:20pm
Ada

11:40am

From text to knowledge via ML algorithms - the Quora answer
Q&A sites like Quora aim at growing the world’s knowledge. In order to do this, they need not only to get the right questions to the right people so they can answer them, but also the existing answers to people who are interested in them. In order to accomplish this we need to build a complex ecosystem taking text as the main data source, but also taking into account issues such as content quality, engagement, demand, interests, or reputation. Using high-quality data you can build machine learning solutions that can help address all of those requirements. In this talk I will describe some interesting uses of machine learning that range from different recommendation approaches such as personalized ranking to classifiers built to detect duplicate questions or spam. I will describe some of the modeling and feature engineering approaches that go into building these systems. I will also share some of the challenges faced when building such a large-scale knowledge base of human-generated knowledge. I will use my experience at Quora as the main driving example. Quora is a Q&A site that despite having over 80 million unique visitors a month, it is known for keeping a high-quality of knowledge and content in general.

Speakers
avatar for Xavier Amatriain

Xavier Amatriain

VP Engineering, Quora
VP of Engineering, Quora



Wednesday May 18, 2016 11:40am - 12:20pm
Markov

1:10pm

Open Collaboration for Civic Impact
Americans consistently rank government dissatisfaction as the “most important problem facing the U.S.” It has become increasingly clear that a host of different actors–public and private, formal and informal, citizens and communities–will need to work together to successfully address the challenges we face as a society. While civic innovation is a bright spot of hope, the sector is still nascent. Key to catalyzing this new wave of civic solutions will be knowledge sharing and storytelling around success and impact, which is not happening in any standardized, systemic, aggregated way across the sector. We believe that cataloging success (and failure) via an open collaboration project lifecycle will lead to new waves inspired action directed towards more effective ends.

Speakers
avatar for Lawrence Grodeska

Lawrence Grodeska

Co-founder & CEO, CivicMakers
Lawrence Grodeska is a maker, communicator and civic geek who uses technology to help civic leaders in the public and private sector engage key audiences. Over 15 years, Lawrence has built programs and products to transform how citizens interact with their communities and governments. Lawrence regularly speaks about civic innovation at events such as SXSW Interactive, The Nonprofit Technology Conference, and Code for America Summit.



Wednesday May 18, 2016 1:10pm - 1:30pm
Gardner

1:10pm

Case Law and ML on Spark
Apache Spark powers Ravel’s case law dissecting backend. This talk will cover the motivation for migrating to Spark, the benefits, pain points and experiences running on a legal corpus. Over a year ago, Ravel Law’s batch processing moved from Apache Pig to Apache Spark. Spark eliminated many Pig and Hadoop related pain points and has enabled rapid development of our case law processing pipeline to include running many NER, clustering and machine learning models, building search indexes and constructing the legal citation graph with case law, statutes and judges as nodes. Spark helps accelerate Ravel Law’s processing, development and integration of machine learning systems. Learn how we use Spark to prep our data and some of the building blocks we use to create our legal research and analytics products.

Speakers
avatar for Jeremy Corbett

Jeremy Corbett

Senior Lead Backend Engineer, Ravel Law



Wednesday May 18, 2016 1:10pm - 1:30pm
Ada

1:10pm

Single Customer View
A Single Customer View is a key support of growth for companies. It provides a deep understanding of their customer/users allowing them to drive better targeting for Product, Marketing and Sales. As Single Customer View is all about understanding the customer behaviour based on their past activities and predict their future ones, the practical problem of this is the amount of Data to be processed near real time. But Single Customer View is not a monster beast, and is in fact a really powerful tool which can more easily than expected be build, maintain and exploit for the benefits of all the company. We’ll dive into the key concepts and how we can leverage modern Data Processing infrastructure to build one.

Speakers
avatar for Thomas Trolez

Thomas Trolez

Nitro, Inc.
Data-driven by nature.



Wednesday May 18, 2016 1:10pm - 1:30pm
Markov

1:40pm

Making Predictions Under Lending Regulations
One particular difficulty when working in lending is being subject to a variety of different regulatory requirements. This has many implications when working with data - the requirements often affect the types of models you can build or restrict which predictors you can include when modelling. In this talk we present how the Fair Credit Reporting Act (FCRA) impacted our loan application accept / reject modelling. In particular, the FCRA requires explicit reasons for a declined application, which we satisfied through a thoughtful use of ensemble methods.

Speakers
avatar for Yujay Huoh

Yujay Huoh

Data Scientist, Earnest, Inc.
Data Scientist at Earnest. Before Earnest, I worked for a private bank in San Francisco doing credit modelling and enterprise risk. Statistics Ph.D from UC Berkeley. Interested in spatial statistics, numerical programming, and all things Bayesian.


Wednesday May 18, 2016 1:40pm - 2:00pm
Ada

1:40pm

Litigation Mitigation: Using NLP to Minimize Errors in Contracts
While work in natural language processing in the legal domain has primarily focused on case law and e-discovery, the domain of contracts and agreements has received comparatively little attention. Our work focuses on the potential for errors and oversights in such documents, a common problem with consequences ranging from professional embarrassment to litigation in the worst case. The American Bar Association estimates that administrative and substantive errors account for some 75% of all legal malpractice claims; our own small-scale analysis suggests that some one in five civil cases involve disputes over ambiguous contract language. We present our work applying computational linguistics techniques to automatically detecting such errors, focusing particularly on inconsistencies, ambiguities and style issues.  Our evolving pipeline for contract analysis builds on augmenting corpora of contracts with manual annotations including judgments of ambiguity.

Speakers
avatar for Shipra Dingare

Shipra Dingare

Lead Engineer, LitIQ
avatar for Gurinder Sangha

Gurinder Sangha

Founder and CEO, Lit IQ
I am the founder of Lit IQ, which is using advances in computational linguistics to help lawyers minimize litigation risk. I also teach at the University of Pennsylvania Law School and serve as a Fellow at the Stanford Center for Legal Informatics. | | Prior to Lit IQ, I founded Intelligize, an information services company that helps business professionals more easily research regulatory filings. Intelligize is one of the fastest growing... Read More →



Wednesday May 18, 2016 1:40pm - 2:00pm
Gardner

1:40pm

Privacy Issues in Big Data Processing in light of Data Breaches
Big Data processing is dominating several business applications. Businesses are realizing the benefits of Big Data and analyzing large volumes of data to gain insights about their customers. This causes more data to be collected and stored centrally. Much of the storage occurs in the Cloud. This attracts hackers since they could gain access to large volumes of data about people. Accessing this type of data results in disastrous consequences to individuals whose privacy is violated. In this talk we will look at five of the major data breaches within the recent past and look at ways to protect people's privacy by considering several Best Practices approach. Many in the IT field have realized that it would be very difficult to secure all data that is accessible from the cloud and so better mechanisms should be developed to protect such public data.

Speakers
avatar for S. Srinivasan

S. Srinivasan

Associate Dean & Distinguished Professor, Texas Southern University
Professor and researcher in Information Security, Cloud Computing and Big Data Applications



Wednesday May 18, 2016 1:40pm - 2:00pm
Markov

2:10pm

Building a Graph of all US businesses using Spark technologies
Radius Intelligence (www.radius.com) empowers Data Science to deliver an unique marketing intelligence platform used by over hundred US companies. This presentation will explain how Radius is using Spark along with GraphX, MLLib and Scala to create a comprehensive and accurate index of US business from dozens of different sources. In particular, I will address problems related to clustering records together based on a graph approach and how to resolve the graph into a set of US businesses. I will discuss some of the models related to cleaning out the noise and how to rank best values and impute missing values and provide some best practices.

Speakers
avatar for Alexis Roos

Alexis Roos

Engineering manager, Radius Intelligence
Alexis has over 20 years of software engineering experience with emphasis in large scale data science and engineering and application infrastructure. | Currently an Engineering Manager at Radius Intelligence, Alexis is leading a team of data scientists and data engineers building Radius business graph modeling over 20 million businesses in the US, created from over 7 billion records from dozens of sources using Spark, GraphX, MLLib and Scala... Read More →



Wednesday May 18, 2016 2:10pm - 2:30pm
Ada

2:10pm

Breaking Down Paywalls for Online Health
Approximately one-quarter of people searching for health information online hit a paywall. Medical knowledge is locked up in non-open access scientific research papers which have copyright licenses that prevent free distribution. However, facts cannot be copyrighted* and may pass through paywalls unencumbered by copyright license restrictions. We have developed a framework to enable access to scientific knowledge. Academic readers with access to papers can locally install and run our freely available Fact Extractor software. After a local PDF paper is identified and approved by the user, Fact Extractor identifies and extracts facts from the scientific paper. The software then distributes the extracted facts to our public Wiki-based server http://factpub.org for everyone to access. Client-side processing for fact extraction means no copies of the paper are distributed. Large-scale adoption of this fact-publishing framework will empower accessibility to health and other scientific research. * Feist Publications, Inc., v. Rural Telephone Ser-vice Co., 499 U.S. 340 (1991)

Speakers
avatar for Pauline Ng

Pauline Ng

Group Leader, Genome Institute of Singapore



Wednesday May 18, 2016 2:10pm - 2:50pm
Markov

2:10pm

Why The Best Minds Of Our Generation Are Thinking About How To Get People to Click on Ads
It'll be Josh Wills, talking about stuff. What's not to like?

Speakers
avatar for Josh Wills

Josh Wills

Head of Data Engineering, Slack



Wednesday May 18, 2016 2:10pm - 2:50pm
Gardner

2:30pm

Building AI that Searches and Sells
Big data is transforming every aspect of society – what impact will it have on the world of sales?

At LeadGenius we've been working on the problem of automating sales activities using big data, machine learning and crowd computing. I'll share how we're attacking three problems in data science as they relate to building artificial intelligence for sales. First, we'll talk about the problem of predicting purchase decisions – sometimes, before they ever happen. Second, we'll talk about the problem of tracking and understanding every person and company in the United States and what they're going to need to buy next. Last, we'll talk about our latest challenge – building machine intelligence that can communicate naturally over email in a way that's indistinguishable from a salesperson.

Speakers
avatar for Anand Kulkarni

Anand Kulkarni

Chief Scientist & Co-Founder, LeadGenius
Anand is founder and Chief Scientist of LeadGenius, a Y Combinator, Sierra Ventures, and Andreessen-Horowitz-backed startup using human computation to automate sales at scale. | | Built on the MobileWorks crowd architecture, LeadGenius applies fair and ethical principles to help crowds of workers find work in the online economy while letting sales team grow their businesses. | | Anand was named as one of Forbes Magazine's "30 under... Read More →


Wednesday May 18, 2016 2:30pm - 2:50pm
Ada

3:00pm

Gathering around the data table
Open data is not just an end but a means to broadening participation in our ongoing acts of self-government. While certainly important, voting in elections is just one act we take in our collective pursuit toward a more perfect union. Data can have a broader impact on the individual acts we take day to day. It can even redefine what it means to participate in a modern democracy. At DataSF, we seek to empower the use of data from the City and County of San Francisco across a spectrum of uses. I’ll discuss ways we are encouraging people to “gather around the data table” to collectively understand and address some of our most pressing challenges. And I’ll highlight projects and initiatives that demonstrate this, including:
  • How we’re putting our data users and data publishers at the center of our program to make sure we’re driving the program around actual needs
  • How we participate in our local San Francisco civic hacking group nearly every week
  • How we raised a digital barn called the Housing Data Hub to provide more context and insights on housing data in San Francisco
  • How we’re beginning to work with existing community institutions to make use of the City’s many open data assets

Speakers
avatar for Jason Lally

Jason Lally

Open Data Program Manager, City and County of San Francisco
Jason Lally is the Open Data Program Manager, working with the City’s Chief Data Officer, Joy Bonaguro, to help operationalize the City’s data strategy. Jason comes to the City by way of a Mayor’s Innovation Fellowship that wrapped up in August 2014. Before that, he worked at the intersection of technology and urban planning as Director of the Decision Lab at PlaceMatters in Denver, CO. He came to open government and open data through his... Read More →



Wednesday May 18, 2016 3:00pm - 3:20pm
Markov

3:00pm

Analyzing Time Interval Data
Analyzing huge amounts of time interval data is a task arising more and more frequently in different domains like resource utilization and scheduling, real time disposition, as well as health care. Analyzing this type of data using established, reliable, and proven technologies is desirable and required. However, utilizing commonly used tools and multidimensional models is not sufficient, because of modeling, querying, and processing limitations. In this talk, I present a tool (TIDAIS) helpful to analyze large amounts of time interval data. I present a query language and demonstrate some API examples. With time interval data, I will also share how time interval data plays a role in generating real-time user behavior predictions around interests and intent.

Speakers
avatar for Philipp Meisen

Philipp Meisen

CTO & Cofounder, Breinify Inc.
I'm a workaholic visionary always searching for the next adventure to be mastered until success. My passion is bringing innovative ideas to life finding new ways on how to make the impossible possible. | | I've developed software and analyzed data for more than 15 years; different scales, technologies, and requirements. I worked for different companies in different scaled projects, among others Audi, American Airlines, Delta, DHL, dnata... Read More →


Wednesday May 18, 2016 3:00pm - 3:40pm
Ada

3:00pm

Insider Text
"Insider Threat" is a major area of risk for many organisations, in both the government and commercial spheres. Employees, contract staff and suppliers are often in a strong position to perpetrate fraud, steal secret information or intellectual property, or sabotage computer systems, and effectively evade detection for long periods. After-the-fact investigation is often challenging and labour intensive, and early prediction and positive mitigation, which may be far more effective, is even more difficult to automate. To trained eyes, text sources like chat and email often contain signals that insiders are on a path to hostile action. However, making use of these sources in ways that respect individual rights and improve trust is as much of a challenge as the technical one of extracting the signal. Mr. Stewart outlines the role and limitations of text analysis in the automation of predicting insider threat by discussing key results in the area of intent detection, and argues that (given the projected state of the art) many organizations might achieve greater harm reduction by developing or adopting what IBM has called "Systems of Engagement".

Speakers
avatar for Gregor Stewart

Gregor Stewart

VP Product Management, Text Analytics, Basis Technology
As Vice President of Product Management at Basis Technology, Mr. Stewart helps to drive company strategy and ensure that the company’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters in Natural Language Processing from the University of Edinburgh, a BA in PPE from the University of Oxford, and a Masters from the London School of Economics.



Wednesday May 18, 2016 3:00pm - 3:40pm
Gardner

3:20pm

Quantifying Democracy and Freedom with Human Rights and Fertility Metrics
Having revisited the known phenomenon of negative correlation between girls' education and fertility analyzed by Jeffrey Sachs in "The End of Poverty", we proceed to take into account such quantitative ratings of democracy and freedom as the Democracy Index compiled by the Economist Intelligence Unit and Freedom in the World scores published annually by Freedom House. Are all democracies doomed to act as behavioral sinks, in John B. Calhoun's terminology? We will approach the problem by looking at the structure of fundamental and derived human rights. The structure will incorporate the right to have children and the right not to have children, neither of which the founding fathers needed to address directly, given a very different historical situation. The former right is explicitly restricted in China nowadays, while the latter is under attacks elsewhere. Preliminary conclusions will be drawn and future research directions proposed.

Speakers
avatar for Dmitri Gusev

Dmitri Gusev

Associate Professor, Purdue University
Dmitri A. Gusev is an Associate Professor of Computer and Information Technology (CIT). His primary research interests include imaging, game development, visualization, and computational linguistics. Dmitri A. Gusev received his Ph.D. in Computer Science from Indiana University in 1999. After graduation, Dmitri worked for Eastman Kodak Company as image processing scientist in 1999-2007. Prior to joining Purdue, he taught computer science at... Read More →


Wednesday May 18, 2016 3:20pm - 3:40pm
Markov

4:00pm

Analytics as Code with Juttle
Software developers rely on operational analytics to track health and performance of their apps/services, as we do at Quid. You can go the DIY route with Kibana, Grafana and other open source tools, or rely on analytics service providers, but many choices will limit you to a UI-based experience. I'm a believer in analytics as code, and will present reasons and possible choices for those who want their analytics managed as code with source control and code reviews. Juttle (http://juttle.github.io/) is one such developer-oriented analytics platform that bridges querying data from different storage backends, processing, joining and visualizing, all in a single line of code.

Speakers
avatar for Daria Mehra

Daria Mehra

Director of Quality Engineering, Quid
My fields of interest are data storage, data analytics, and quality (my unofficial title is "Bug Huntress"). I'm excited to be working on quality initiatives for the Quid intelligence platform. My favorite hammer of a programming language / analytics platform is Juttle, I'm one of the maintainers of its open source project (http://github.com/juttle/juttle). My single favorite activity is debugging, and a second close is reading books on paper.




Wednesday May 18, 2016 4:00pm - 4:20pm
Markov

4:00pm

Building semantic search using Deep Learning
Many search appliances exist today to make full-text search fairly simple: Elasticsearch, Solr, Algolia; the list goes on. However, all of these services implement n-gram or token level analysis of the text, and can only really do search on exact or partial matches of text in the base corpus. It is not able to match on overlapping concepts. For example, if a document is about the programming language Java, and you search for "Computer Science", unless computer science is explicitly mentioned in the document, it won't be scored highly. Recent innovations in general purpose word vectors and the ability to compose them to create general purpose document vectors provides a way to create conceptual, semantic search products. This talk will demonstrate creating a semantic search engine on a few non-trivial corpora.

Speakers
avatar for Samiur Rahman

Samiur Rahman

Head of Data Engineering, Mattermark
Head of Data Engineering, Mattermark


Wednesday May 18, 2016 4:00pm - 4:40pm
Gardner

4:00pm

Deep Dive: Spark Memory Management
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Speakers
avatar for Andrew Or

Andrew Or

Software Engineer, Databricks
Anything about Spark.


Wednesday May 18, 2016 4:00pm - 4:40pm
Ada

4:30pm

Data governance and distribution with Dat
We will (LIVE!) take a public dataset and build a streaming, versioned, content-addressable, peer-to-peer distribution endpoint. Attendees will also learn how to build an application using Dat as the real-time, content-addressable storage using Node.js.

Speakers
avatar for Karissa McKelvey

Karissa McKelvey

Software Developer, Dat Project


Wednesday May 18, 2016 4:30pm - 4:50pm
Markov

5:20pm

Panel: Data and Society
Moderators
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.

Speakers
avatar for Elizaveta Malashenko

Elizaveta Malashenko

Director, Safety and Enforcement, California Public Utilities Commission
I lead programs that ensure that CPUC regulated services are delivered in a safe, reliable manner. The Safety and Enforcement Division has safety oversight in a number of industries, including electric, natural gas, and telecommunications infrastructure; railroads, rail crossings, and light rail transit systems; passenger carriers, such as limousines, charter buses, and transportation network companies. The Safety and Enforcement Division has... Read More →
avatar for Brad Newman

Brad Newman

Practice Innovation Manager / Staff Attorney, Cooley LLP
At Cooley, I drive innovation by working with legal professionals, business analysts, application developers and other stakeholders to identify, scope, design, iterate and implement practice and client-focused products, processes and services that enhance the gathering, processing and harnessing of actionable knowledge and data-centric intelligence to support the Firm's delivery of legal services to clients.
avatar for Gurinder Sangha

Gurinder Sangha

Founder and CEO, Lit IQ
I am the founder of Lit IQ, which is using advances in computational linguistics to help lawyers minimize litigation risk. I also teach at the University of Pennsylvania Law School and serve as a Fellow at the Stanford Center for Legal Informatics. | | Prior to Lit IQ, I founded Intelligize, an information services company that helps business professionals more easily research regulatory filings. Intelligize is one of the fastest growing... Read More →
avatar for Nicole Shanahan

Nicole Shanahan

Residential Fellow, CodeX, The Stanford Center for Legal Informatics
Nicole Shanahan is an attorney in California and a residential fellow at CodeX, the Stanford Center of Legal Informatics, a joint center between Stanford Law School and Computer Science. She is the founder and CEO of ClearAccessIP an integrated patent management technology, and a legal technologist who specializes in the utilization of structured databases, APIs, UI/UX, automation and SaaS.


Wednesday May 18, 2016 5:20pm - 6:20pm
Gardner

6:00pm

Happy Hour
Join us for a community gathering after the full day of talks. Full bar, fine food, and the company of the speakers, panelists, and the best minds in data.

Wednesday May 18, 2016 6:00pm - 8:00pm
The Galvanize Lobby
 
Thursday, May 19
 

8:45am

Opening Remarks
Speakers
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.


Thursday May 19, 2016 8:45am - 9:00am
Gardner

9:00am

Keynote: How Can We Trust Machine Learning? Exploration, Evaluation and Explanation for ML Models
Machine learning technologies are at the core of a new generation of intelligent applications that differentiate disruptive businesses from established players. Today, business tasks like product recommendation, image tagging, sentiment analysis, churn prediction, fraud detection and lead scoring can only be achieved using machine learning (ML). To build these applications at scale, companies are fast adopting tools such as Dato’s GraphLab Create and Predictive Services, enabling developers to accelerate the innovation cycle, and quickly take their ideas from inspiration to production. Industry practitioners understand that in order to secure adoption of intelligent applications, they must build trust in their models and predictions - that is, gain confidence that their models are achieving their desired outcomes and a good understanding of how predictions are made. In this talk, I'll describe both: a) Recent research done at the University of Washington to provide a formal framework that explains why a machine learning model makes a particular prediction, and how even non-experts can use these explanations to improve the performance of a model. b) New tools introduced by Dato to help industry practitioners build trust and confidence in machine learning by making it easy to evaluate, explore, and explain models and predictions. With these techniques, companies can start to have the means to gain trust and confidence in the models and predictions behind their core business applications.

Speakers
avatar for Carlos Guestrin

Carlos Guestrin

CEO, Dato
Carlos is the CEO and co-founder of Dato (formerly GraphLab, Inc.) and the Amazon Professor of Machine Learning in Computer Science & Engineering at the University of Washington. A world-recognized leader in the field of Machine Learning, Carlos was named one of the 2008 “Brilliant 10″ by Popular Science Magazine, received the 2009 IJCAI Computers and Thought Award for his contributions to Artificial Intelligence, and a... Read More →


Thursday May 19, 2016 9:00am - 9:40am
Gardner

9:50am

Building an A.I. Cloudy Sky: What We Learned at PredictionIO
Building a successful A.I. cloud platform on top of an open-source Machine Learning project is more complicated than one would imagine. Simply offering a hosted version of the project is hardly the answer. The secret to success is to differentiate the needs between the open-source users and the potential SaaS users. Oftentimes, they are of different species. We will walk through how PredictionIO navigates the roadmap. Building a software-as-a-service business on top of an open-source project involves more than just providing a hosted version of it. It is a little-known fact that people in the existing open-source user segment are unlikely to become the potential cloud product customers. To productize an open-source project as a cloud platform, understanding who the customers are and what they are willing to pay for are the keys. Business techniques like customer discovery and lean product development can be applied to increase the chance of success. At the same time, keeping the open-source project and the cloud product under one umbrella while developing them separately for different user segments is non-trivial. We'll dive into some real scenarios, both successful and not so successful ones, of how we navigate the cloud product roadmap based on the popular open-source Machine Learning server project -- PredictionIO. Specifically, we will walk through these topics: • Evaluating Different Cloud Approaches • Understanding Existing and Potential Users • Finding the Unique Values of the Cloud • Contributing Back From the Cloud to the Open-Source

Speakers
avatar for Simon Chan

Simon Chan

CEO, PredictionIO
Simon Chan is a co-founder of PredictionIO, with years of experience in the tech industry in London, Hong Kong, Mainland China and Silicon Valley. His doctoral research work at University College London was on machine learning techniques for large-scale user preference prediction in noisy non-experimental environments.



Thursday May 19, 2016 9:50am - 10:30am
Markov

9:50am

What Kaggle has learned from 2MM machine learning models
Speakers
avatar for Anthony  Goldbloom

Anthony Goldbloom

CEO, Kaggle
Anthony is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury.He holds a first class honours degree in economics and econometrics from the University of Melbourne and has published in The Economist magazine and the Australian Economic Review.In 2011, Forbes Magazine cited Anthony as one of the 30 under 30 in technology... Read More →


Thursday May 19, 2016 9:50am - 10:30am
Gardner

9:50am

Beyond 1M Acres of Drone Imagery
DroneDeploy is making the sky productive and accessible to everyone. Our software platform flies unmanned aircraft all over the world and processes the data they collect. We'll show some of the image processing techniques we use to solve key problems in Agriculture, Mining and Construction as well as give a look at what future problems we are tacking in the drone space.

Speakers
avatar for Nicholas Pilkington

Nicholas Pilkington

CTO, DroneDeploy



Thursday May 19, 2016 9:50am - 10:30am
Ada

10:40am

Predicting Hacker News with Beam and TensorFlow
Google has just open sources or released two products to the public in the last year, Apache Beam and Cloud Dataflow, that promise to change how we write data pipelines.

Beam is an open source, portable job description framework incubating at the Apache Foundation. It unifies batch and stream processing in a single model available in Java, Python and Scala. It supports running on popular execution engines like Spark, Flink and Google Cloud Dataflow, giving users flexiblity in where they run and eliminating the need to re-write pipelines. One of these execution frameworks, Dataflow, is a cloud-based fully managed service that (like BigQuery) allows users to just submit code and get results. Google provides autoscaling, straggler avoidance and monitoring. 

In this talk we'll explore Beam's event time semantics like windows, sessions, and triggers. Eric will also demonstrate running a single Beam job run in both batch/stream modes and deployed on an on-prem cluster and in the cloud with no code changes.

Speakers
avatar for Eric Anderson

Eric Anderson

Product Manager, Google
Work on Google Cloud Dataflow.




Thursday May 19, 2016 10:40am - 11:00am
Ada

10:40am

Analyzing Massive Time Series Data with Spark
Want to build models over data every second from millions of sensors? Dig into the histories of millions of financial instruments? In this talk, we'll discuss the unique challenges in time series data, and how to work with it at scale: * What distinguishes time series data from other datasets? * What are the common operations that we wish to apply to it? * What are the different ways to lay out time series in data in memory, and what analysis tasks are each layout well suited for? * What are popular applications for time series analysis? We'll then introduce the open source Spark-TS library. Built atop Apache Spark, the library provides an intuitive Scala and Python API for munging, manipulating, and modeling time series data in a massively parallel manner.

Speakers
avatar for Sandy Ryza

Sandy Ryza

Senior Data Scientist, Clover Health
Data Science, Apache Spark, Time Series Data, Distributed Computation


Thursday May 19, 2016 10:40am - 11:00am
Gardner

10:40am

The Promise of Heterogeneous Computing
In general application performance demands in all fields are now outpacing Moore’s Law, and in AI, they’re certainly increasing exponentially. To keep up, we’re already beginning to rely on specialized processors like GPUs and DSPs beyond just CPUs for specific use cases. In the future, most applications in will depend on their ability to maximize multiple types of processors in the most efficient way possible. This talk will go through different types of heterogenous hardware available today and coming up in the future, and their applicability for solving AI problems, and how normal people (not just big companies and those with access to supercomputers) can access them, with some examples.

Speakers
avatar for Subbu Rama

Subbu Rama

CEO, Bitfusion.io
Co-founder and CEO of Bitfusion, a Software Defined Supercomputing company, with the mission of bringing HPC/Supercomputing to the masses. Held engineering and leadership roles in hardware and software divisions, while building CPUs, micro-servers, SoCs and cloud infrastructures, at companies like Intel and Dell. As a Founding Member, built Dell’s first Cloud Infrastructure Marketplace.



Thursday May 19, 2016 10:40am - 11:00am
Markov

11:10am

Finding Your Audience In The Internet of Things
Understanding the nature of the expressive and diverse audiences of applications can be transformative in the creation of powerful data products. For many industries, user interaction is the most accurate signal in audience segmentation, but the devices in the Internet of Things are often quiet and natural audience segments can be lost. In analyzing the Automatic driving data we have learned that there are many kinds of cars, but drivers themselves often act very similarly. In this study we correlate the confluence of driving style and physical models of cars with the aim to segment our audience using unsupervised learning techniques such as NMF.

Speakers
avatar for Dhruv Choudhary

Dhruv Choudhary

Data Scientist, Automatic Labs
Dhruv is a Data Scientist at Automatic Labs where he is currently building data products for the connected car space. Automatic brings to life a wealth of data about driving behavior of users and their interactions with their cars. | He also works on building scalable infrastructure and data pipelines to analyze and deploy data products. | Dhruv received his Masters from Georgia Tech in 2011 and has a background in control theory and... Read More →



Thursday May 19, 2016 11:10am - 11:30am
Markov

11:10am

Know the air you are breathing
This talk will demonstrate how to use a publicly available dataset of air quality sensor readings, clean and query the data, and visualize the data enabling the government and public to take appropriate actions. I will use a publicly available dataset from the epa.gov and the The U.S. Department of State Mission China air quality monitoring program. The set consists of data from devices that are sending measurements for San Franciso and Beijing. The air quality data measurements is enriched with extra data in the form of weather data from weather.gov to give the data additional context. We will then visualize the enriched data and see how the data relates to Air Quality. The technology stack leveraged comprises of Mesos, Zookeeper, Marathon, Docker, Riak TS, Kafka, Spark, Zeppelin. 

Speakers
avatar for Seema Jethani

Seema Jethani

Director of Product Management, Basho Technologies
Hello! I currently lead Product Management at Basho Technologies for Basho's flagship products Riak KV and Riak TS, distributed NoSQL databases.Prior to joining Basho, I held Product Management and Strategy positions at Dell, Enstratius and IBM. I hold an MBA degree from Duke University’s Fuqua School of Business, an MS in computer science from North Carolina State University, and a BE in computer engineering from the University of Mumbai. I... Read More →


Thursday May 19, 2016 11:10am - 11:30am
Ada

11:10am

MacroBase: Analytic Monitoring for the Internet of Things
An increasing proportion of data today is generated by automated processes, sensors, and systems---collectively, the Internet of Things (IoT). A core challenge in IoT and an increasingly popular value proposition of many IoT applications in domains including industrial diagnostics, predictive maintenance, and urban observability is in identifying and highlighting unusual and surprising data (e.g., poor driving behavior, equipment failures, gunshots). We call this task---which is often statistical in nature and time-sensitive---analytic monitoring. To facilitate rapid development and scalable deployment of analytic monitoring queries, we have developed MacroBase, a new kind of data analytics engine that provides turn-key analytic monitoring of IoT data streams. MacroBase implements a customizable pipeline of outlier detection, summarization, and ranking operators. To facilitate efficient and accurate operation, MacroBase implements several cross-layer optimizations across robust estimation, pattern mining, and sketching procedures. As a result, MacroBase can analyze several million events per second on a single server. MacroBase has already uncovered several unexpected behaviors (and corresponding bugs) in production in a medium-scale IoT deployment.

Speakers
avatar for Peter Bailis

Peter Bailis

Professor, Stanford University
Peter Bailis is an assistant professor of computer science at Stanford University. Peter's research in the Future Data Systems group (http://futuredata.stanford.edu/) focuses on the design and implementation of next-generation data-intensive systems. His work spans large-scale data management, distributed protocol design, and architectures for high-volume complex decision support. He is the recipient of an NSF Graduate Research Fellowship, a... Read More →


Thursday May 19, 2016 11:10am - 11:30am
Gardner

11:40am

Inside Pandora: Practical Application of Big Data in Music
Pandora began with The Music Genome Project, the most sophisticated taxonomy of musicological data ever collected and an extremely effective content-based approach to music recommendation. Its foundation is based on human music cognition, and how an expert describes and perceives the complex world of a music piece.

But what happens when you have a decade of additional data points, given off by more than 250 million registered users who have created 8+ billion personalized radio stations and given 60+ billion thumbs? As opposed to other traditional recommender systems, such as Netflix or Amazon, which need to recommend a single item or static set, Pandora provides an evolving set of sequential items, and needs to react in just a few milliseconds when the user is unhappy with the proposed songs. Furthermore, a variety of factors (e.g., musicological, social, geographical, or generational) play a critical role in deciding what music to play to a user, and these factors vary dramatically across each individual listener.

Furthermore, in this talk I will present a dynamic ensemble learning system that combines curational data and machine learning models to provide a truly personalized experience. This approach allows us to switch from a lean back experience (exploitation) to a more exploration mode to discover new music tailored specifically to users individual tastes. I will also discuss how Pandora, a data-driven company, makes informed decisions about the features that are added to the core product based on the results of extensive online A/B testing.

Following this session the audience will have an in-depth understanding of how Pandora uses Big Data Science to determine the perfect balance of familiarity, discovery, repetition and relevance for each individual listener, measures and evaluates user satisfaction, and how our online and offline architecture stack plays a critical role in our success.

Speakers
avatar for Oscar Celma

Oscar Celma

Director of Research, PANDORA
Director of Research, Pandora



Thursday May 19, 2016 11:40am - 12:20pm
Gardner

11:40am

Recommendations for Building Machine Learning Software
Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.

Speakers
avatar for Justin Basilico

Justin Basilico

Research/Engineering Manager, Netflix
Justin Basilico is a Research/Engineering Manager for Page Algorithms Engineering at Netflix. He leads an applied research team focused on developing the next generation of algorithms used to generate the Netflix homepage through machine learning, ranking, recommendation, and large-scale software engineering. Prior to Netflix, he worked on machine learning in the Cognitive Systems group at Sandia National Laboratories. He is also the co-creator... Read More →



Thursday May 19, 2016 11:40am - 12:20pm
Ada

11:40am

The Security Wolf of Wall Street: Fighting Crime with High-Frequency Classification and Natural Language Processing
In a world where threat actors move fast and the Internet evolves in a non-deterministic fashion, turning threat intelligence into automated protection has proven to be a challenge for the information security industry. While traditional threat research methods will never go away, there is an increasing need for powerful decision models that can process data in a real-time fashion and scale to incorporate increasingly-rich sources of threat intel. This talk will focus on one way to build a scalable machine learning infrastructure in real-time on a massive amount of DNS data (approximately 80B queries per day). In this talk, we will offer a sneak peek into how OpenDNS does scalable data science. We will touch on two core components, Big Data engineering and Big Data science, and specifically how they are used to implement a real-time threat detection systems for large-scale network traffic. To begin, we will detail Avalanche, a stream processing framework that helps OpenDNS data scientists create their own data processing pipelines using a modular graph-oriented representation. Each node acts as a data stream processor running as a process, thread or EC2 instance. In this graph database, the edges represent streaming channels connecting the different inputs and outputs of the nodes. The whole data pipeline can then easily be scaled and deployed to hundreds of instances in an AWS cloud. The Avalanche project's paradigm is to translate the approach that the finance world has been using for decades in high frequency or quantitative trading and apply it to traffic analysis. Applying intelligent detection models as close as possible to the data source holds the key to build a truly predictive security system, one where requests are classified and filtered on the fly. In our particular case at OpenDNS, we see a strong interest in integrating such a detection pipeline at the resolver level. We will next discuss how we integrate our statistical model NLP-Rank (a model that does large scale phishing detection) with Avalanche, and show some benchmarks. At its core, NLP-Rank is a fraud detection system that applies machine learning to the HTML content of a domain's web page to extract relevant terms and identify whether the content is potentially malicious or not. In this sense we are automating the security analyst's decision-making process in judging whether a website is legitimate or not. Typically when an analyst performs a review for a domain or URL in question, the analyst visits the site in a TOR browser, analyzes the content, and identifies the themes/summarize the page before deciding whether it's a fake or a false positive. In this talk, we will describe how we have automated this process at OpenDNS. We will also discuss the unique characteristics of NLP-Rank, including its machine learning techniques. Additionally, we will discuss the design and implementation of our phishing classification system. We will provide an overview of data preprocessing techniques and the information retrieval/natural language processing techniques used by our classifier. We will then discuss how Avalanche manages the results of NLP-Rank, how we add those results to our blocklists and our corpus, and Avalanche's overall performance.

Speakers
avatar for Jeremiah O'Connor and Thibault Reuille

Jeremiah O'Connor and Thibault Reuille

Research Engineer (Jeremiah O'Connor), Manager of Threat Research Development (Thibault Reuille), OpenDNS (now part of Cisco)
Jeremiah O'Connor is a research engineer at OpenDNS where he focuses on building scalable threat detection models and writing software to solve real-world security problems. His current interests are in machine learning, natural language processing, distributed systems, and big data engineering. Prior to joining OpenDNS, he worked at Evernote, and at Mandiant/Fireeye. Jeremiah earned a Master's Degree in Computer Science from University of... Read More →



Thursday May 19, 2016 11:40am - 12:20pm
Markov

1:10pm

Fast deep recurrent net training
Deep recurrent nets are the extension of deep neural nets to process / output sequential data. They have exploded into the deep learning scene over the past few years, are no longer considered hard to train, and have enabled us today to make progress on everything from speech recognition, and language modeling to image captioning. In this talk, we will look at what recurrent nets can do for you, and go over some tips and tricks we've learnt from building Deep Speech for training seriously deep recurrent networks on your own.

Some knowledge of recurrent nets is expected, like having read http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Speakers
avatar for Sanjeev Satheesh

Sanjeev Satheesh

Deep learning researcher, Baidu USA
Sanjeev works as a Deep learning researcher at the Silicon Valley AI Lab at Baidu USA. SVAIL has been focused on the mission of using hard AI technologies to impact hundreds of millions of users.


Thursday May 19, 2016 1:10pm - 1:30pm
Gardner

1:10pm

Building realtime efficient queries and batch jobs in an IOT platform
SAMI needs to ingest, serve and take realtime decisions at a large scale. In this talk we show briefly * How to build scalable queries using Cassandra, Redis and Elastic search. * High performance batch jobs using Apache parquet columnar storage format. * Trade offs between idempotent writes and Streaming counters real time.

Speakers
avatar for Dinesh Narayanan

Dinesh Narayanan

Staff Engineer, Samsung SSIC
Dinesh Narayanan is a Staff Engineer at Samsung SSIC. He is passionate in building Low latency distributed applications and functional programming.



Thursday May 19, 2016 1:10pm - 1:30pm
Ada

1:10pm

IoT security using data analysis
The interconnected world is here upon us, and so are the hackers! Security risks are multiplying with even the most basic sensor devices now becoming network-aware. Traditional security solutions often fail to protect such diverse Internet-of-Things (IoT) infrastructure. In this talk, we will present our innovative solution of analyzing large amount of IoT data and using behavior analytics in the cloud to detect anomalies. The patent pending solution from ZingBox fingerprints the behavior of IoT devices and generates a detailed ‘behavior profile’ via machine learning. ZingBox's solution is designed to protect IoT devices without any footprint on the IoT devices.

Speakers
avatar for Dr. May Wang

Dr. May Wang

Co-founder & CTO, ZingBox
Dr. May Wang is Co-founder and CTO of ZingBox, an Internet of Things (IoT) security company in Silicon Valley, well funded by CEOs of leading public security companies, partners of top VC PE firms, and Stanford professors. May is also a Venture Partner of SAIF (a $4B PE firm), an advisor for several VC firms and startups in Silicon Valley, as well as a member of Stanford Angels. | | Before ZingBox, May was the Head of Asia Pac... Read More →



Thursday May 19, 2016 1:10pm - 1:30pm
Markov

2:10pm

Scalable Training of RNNs for Speech Recognition
One really good way to improve the accuracy deep learning based speech recognition system is throw a lot of diverse data at it. Which is one trick we use at Baidu's Silicon Valley AI Lab (SVAIL), but this means it can take a very long time to train the network. I will talk about some of the hardware, software and algorithmic tricks we use to enable training on tens of thousands of hours of raw speech data. These techniques are broadly applicable to a wide variety of sequence based machine learning tasks. We use synchronous SGD while scaling to multiple nodes (8 GPUs per nodes). Achieving this requires paying careful attention to the performance of the gradient synchronization between nodes - we have written our own version of MPI's all_reduce primitive. It also requires maintaining high performance on each individual node as you scale, which means paying close attention to the performance of your BLAS library. Combined with other tricks such as training in reduced precision means we can sustain 3 TFlops per node while weak scaling to 128 GPUs.

Speakers
avatar for Erich Elsen

Erich Elsen

Research Scientist, Baidu



Thursday May 19, 2016 2:10pm - 2:50pm
Gardner

2:10pm

Detecting Anomalies in Streaming Data – Real-time Algorithms for Real-world Applications
There’s no question that we are seeing an increase in the availability of streaming, time-series data. Largely driven by the rise of the Internet of Things (IoT) and connected real-time data sources, we now have an enormous number of applications with sensors that produce important data that changes over time. This data presents a challenge and opportunity for businesses across every industry. How do they handle the onslaught of streaming data? How can they exploit it to make decisions in real-time? One way is to detect, in real time, when something unusual occurs. Early anomaly detection in streaming data has significant implications, yet can be very difficult to execute. It requires detectors to process data in real-time, not batches, and learn while simultaneously making predictions. In this talk, we’ll look at algorithms designed for such data and analyze the components that lead to optimal performance. We’ll also discuss a new benchmark with a labeled, real-world data set, designed to provide a controlled and repeatable environment of open-source tools to test and measure anomaly detection algorithms on streaming data. How do we score in a way that rewards algorithms that detect all anomalies as soon as possible, triggers no false alarms, works with real-world time-series data across a variety of domains, and automatically adapts to changing statistics?

Speakers
avatar for Subutai Ahmad

Subutai Ahmad

VP Research, Numenta
Numenta has a broad, long-term research agenda: we want to advance our understanding of cortical principles, and build systems based on those principles. We are currently focusing our research on new application areas, developing a theory of neurons, sequence memory in cortex, sensorimotor inference, and expanding our mathematical understanding of sparse neural representations.


Thursday May 19, 2016 2:10pm - 2:50pm
Markov

2:10pm

Deep dive and best practices of Spark streaming
In this talk, we will start with the internals of how Spark streaming works and explain how user code is being translated and executed by the Spark streaming engine. Based on these internals, we will then walk over some of the best practices to do efficient state management, efficient joining of streams with historic datasets and achieving high throughput while receiving, processing and writing data. This should help you develop and tune your streaming applications properly by avoiding the common pitfalls.

Speakers
avatar for Prakash Chockalingam

Prakash Chockalingam

Solutions Architect, Databricks Inc
Spark


Thursday May 19, 2016 2:10pm - 2:50pm
Ada

3:00pm

Automating Data Science for the IOT
The big challenge with machine data is the weak signal to noise ratio. The patterns are too tiny and too many spread over a large amount of data. Also patterns change frequently. All these variability makes it extremely challenging to dig insights/signals out by the traditional approach of generating models manually. This requires applying machine automation and machine intelligence to even finding the patterns on a continuous basis. In this talk we will demonstrate how we have applied this to solve problems of predictive maintenance for Fortune 500s.

Speakers
avatar for Ruban Phukan

Ruban Phukan

Chief Product and Analytics Officer, DataRPM
Ruban is a serial entrepreneur and technologist with rich and diverse experience in data science, product, technology and business. As a data scientist in Yahoo, Ruban’s role involved data mining and analyzing several big data sets of Yahoo and coming up with strategic business insights. His projects influenced several products & business strategies and led to tens of millions of dollars of positive revenue impact. He co-founded Bixee, a... Read More →


Thursday May 19, 2016 3:00pm - 3:20pm
Markov

3:00pm

Active Learning and Human-in-the-Loop Machine Learning
Speakers
avatar for Lukas Biewald

Lukas Biewald

CEO, CrowdFlower
Lukas Biewald: CEO/Co-founder of CrowdFlower. He has worked as a Senior Scientist and Manager within the Ranking and Management Team at Powerset, Inc., a natural language search technology company later acquired by Microsoft, and also led the Search Relevance Team for Yahoo! Japan. He graduated from Stanford University with a BS in Mathematics and a MS in Computer Science.


Thursday May 19, 2016 3:00pm - 3:40pm
Gardner

3:00pm

Hidden GEMMs: How Optimized Math Libraries Work
Fast linear algebra is the foundation of all machine learning, including deep learning. Have you ever wondered how CUBLAS, MKL and similar libraries work? If so, this talk is for you!

Speakers
avatar for Marek Kolodziej

Marek Kolodziej

Principal Research Engineer, Nitro
Marek Kolodziej is a Principal Research Engineer at Nitro, Inc. He's been working on a diverse set of machine learning, distributed computing and big data problems for the past 6 years, and statistics and econometrics for the past 11. His current passion is deep learning and GPU computing. Marek got his PhD in Energy and Environmental Economics from Boston University.



Thursday May 19, 2016 3:00pm - 3:40pm
Ada

3:30pm

A Match Made at Upwork
Upwork is the largest online freelancing marketplace, where freelancers from around the world earn over a billion dollars a year for work ranging from software development to sales to translation. Our data science team helps make this possible by developing algorithms to help clients and freelancers find the right fit for their needs and talents. We are in the privileged position of having data about the full life cycle of a job, from the moment the client posts it, to how freelancers find it, the interview and hiring process, and the final success of the project.

In this talk, I will focus on how we decide which jobs to recommend to freelancers. The job recommendation algorithms power a daily job digest e-mail, and are used in the Upwork Job Search and other parts of the site. I will share data and insights about our two-way marketplace, and discuss some of the machine learning models I’ve been developing. I will also discuss my approach to a key question for many ML practitioners — figuring out what to model in the first place.

I think about job recommendations as a two-part problem — will the freelancer apply to the job if they see it, and would that application be useful? I will describe the Job Interest model I have developed to answer the first question, which is a logistic regression model that uses features derived from the freelancer’s profile and application history to predict which jobs the freelancer will choose to apply to. A key modeling insight is that including features that capture main effects (the attractiveness of a job and each freelancer’s propensity for applying to jobs) vastly improves model accuracy.

Complex marketplace considerations come into play when we think about where to direct freelancers’ applications. One more application is hardly useful if a job already has several great applicants, or if the client has changed their mind about hiring. It is more important to help clients find the right person for large contracts than for small one-off tasks. We can also take advantage of our models that predict application and job success for applicants.

Speakers
avatar for Alya Abbott

Alya Abbott

Senior Data Scientist, Upwork


Thursday May 19, 2016 3:30pm - 3:50pm
Markov

4:00pm

Apache Beam (incubating): Unified batch and streaming data processing
This talk traces the evolution of ideas in Google's data processing tools over the past 13 years - from classic MapReduce, to strongly consistent stream processing with Millwheel, to the unified batch and streaming programming model of Apache Beam.

Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.

Beam cleanly separates the different aspects of temporal data processing: what computation to apply, where in event time to apply it, when in processing time to produce results, and how to refine the results as late data arrives. By decoupling semantics from the underlying execution environment, Beam provides portability across multiple runners, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).

I will give an overview of the programming model and current status of the project and invite you to participate in its rapidly developing ecosystem.

Speakers
avatar for Eugene Kirpichov

Eugene Kirpichov

Senior Software Engineer, Google
I'm an engineer on the Google Cloud Dataflow team. I'm interested in some programming- and math-related topics, equality-related issues, cognitive psychology, and a bunch of other things.


Thursday May 19, 2016 4:00pm - 4:20pm
Ada

4:00pm

Multimodal Question Answering for Language and Vision
Deep Learning has made tremendous breakthroughs possible in visual understanding and speech recognition. Ostensibly, this is not the case in natural language processing (NLP) and higher level reasoning. However, it only appears that way because there are so many different tasks in NLP and no single one of them, by itself, captures the complexity of language. I will talk about dynamic memory networks for question answering. This model architecture and task combination can solve a wide variety of visual and NLP problems, including those that require reasoning.

Speakers
avatar for Richard Socher

Richard Socher

Founder & CEO, MetaMind
Richard Socher is the CTO and founder of MetaMind, a startup that seeks to improve artificial intelligence and make it widely accessible. He obtained his PhD from Stanford working on deep learning with Chris Manning and Andrew Ng. He is interested in developing new AI models that perform well across multiple different tasks in natural language processing and computer vision. He was awarded the 2011 Yahoo! Key Scientific Challenges Award, the... Read More →


Thursday May 19, 2016 4:00pm - 4:40pm
Gardner

4:00pm

Optimizing Machine Learning Models
In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications. We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine. We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.

Speakers
avatar for Scott Clark

Scott Clark

Co-founder and CEO, SigOpt
Scott is the Co-founder and CEO of SigOpt, an optimization software as a service company helping firms tune their machine learning models and complex simulations. Scott has been applying optimal learning techniques in industry and academia for years, from bioinformatics to production advertising systems. Before SigOpt, Scott worked on the Ad Targeting team at Yelp leading the charge on academic research and outreach with projects like the Yelp... Read More →


Thursday May 19, 2016 4:00pm - 4:40pm
Markov

4:20pm

Data in the Apache "big data" ecosystem
Big Data, Small Data, it's all the same.

Speakers
avatar for Konstantin Boudnik

Konstantin Boudnik

CEO, Memcore
Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in software development, big- and fast-data analytic, Git, distributed systems and more, Dr. Boudnik has authored 16 US patents in distributed computing. Dr. Boudnik... Read More →


Thursday May 19, 2016 4:20pm - 4:40pm
Ada

5:00pm

Panel: Artificial Intelligence: Who is Winning?
AlphaGo crushing humans, Tay making them blush -- finally, everybody thinks singularity is near. Is it? This panel will weight what actual technical grounds are there to claim advantage over humans or other AI in certain fields. Where does open-source data science and engineering adds the most impact, and how should modern state of the art AI be implemented in a startup that needs it to win? Coe and join the discussion!

Moderators
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.

Speakers
avatar for SriSatish Ambati

SriSatish Ambati

CEO and co-founder, H2O.ai
Sri is co-founder and ceo of H2O (@h2oai), the builders of H2O. H2O democratizes bigdata science and makes hadoop do math for better predictions. Before H2O, Sri spent time scaling R over bigdata with researchers at Purdue and Stanford. Prior to that Sri co-founded Platfora and was the Director of Engineering at DataStax. Before that Sri was Partner & Performance engineer at java multi-core startup, Azul Systems, tinkering with the... Read More →
avatar for Pete Skomoroch

Pete Skomoroch

Co-Founder & CEO, Stealth.ai
I'm a data scientist and entrepreneur focused on building intelligent systems to collect information and enable better decisions. I specialize in solving hard algorithmic problems, leading cross-functional teams, and developing engaging products powered by data and machine learning. I'm currently working on a new startup based in San Francisco. | | Previously, I applied my skills to the consumer internet space at LinkedIn, the world's... Read More →
avatar for Richard Socher

Richard Socher

Founder & CEO, MetaMind
Richard Socher is the CTO and founder of MetaMind, a startup that seeks to improve artificial intelligence and make it widely accessible. He obtained his PhD from Stanford working on deep learning with Chris Manning and Andrew Ng. He is interested in developing new AI models that perform well across multiple different tasks in natural language processing and computer vision. He was awarded the 2011 Yahoo! Key Scientific Challenges Award, the... Read More →
avatar for Shivon Zillis

Shivon Zillis

VC, Bloomberg
Shivon is a partner and founding member of Bloomberg Beta, a $75 million venture fund backed by Bloomberg L.P. that invests in startups transforming the future of work. Bloomberg Beta has an unusual model for a corporate-backed venture fund. It invests for financial return and strives to work in the same new ways as startups -- transparent (with its full operating manual open sourced) and driven by data (with a program to statistically predict... Read More →


Thursday May 19, 2016 5:00pm - 6:00pm
Gardner

6:00pm

Happy Hour
Join us for a community gathering after the full day of talks. Full bar, fine food, and the company of the speakers, panelists, and the best minds in data.

Thursday May 19, 2016 6:00pm - 8:00pm
The Galvanize Lobby
 
Friday, May 20
 

8:45am

Opening Remarks
Speakers
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.


Friday May 20, 2016 8:45am - 9:00am
Gardner

9:00am

Why you can't afford to ignore deep learning
Few involved in life sciences have a background in the neural network fundamentals that underpin deep learning, and that can make it quite overwhelming when trying to learn about this topic for the first time. And it doesn't help that there's so much breathless hype around, which makes it hard to know what opportunities are real, and what is just marketing.

I came to deep learning applications in the life sciences from the other direction - although I've been working with neural networks for around 25 years, I only started looking at life sciences applications in the last couple of years. The data in life sciences tends to be challeging to work with - generally unstructured (genome sequences, natural language text, imaging, sound, etc) and often data sets are quite large. This kind of data turns out to be where deep learning really shines. In fact, many classic approaches to data analysis in the life sciences are either in the process of, or are about to be, totally transformed by deep learning. 

In this talk, I'll describe what deep learning can do, and give some examples of how it can be applied, with a particular focus on medical applications. I'll also provide some suggested places to learn more, so if I'm successful in my goal to convince you that you can't afford to ignore deep learning, you'll know where to look next.

Speakers
avatar for Jeremy Howard

Jeremy Howard

CEO, Enlitic
Jeremy Howard is a serial entrepreneur, business strategist, developer, and educator. He is the CEO of Enlitic, a startup he founded to use recent advances in machine learning to transform the practice of medicine, and bring modern medical diagnostics to billions of people in the developing world for the first time. He is the youngest faculty member at Singularity University, where he teaches data science, and is also a Young Global Leader with... Read More →


Friday May 20, 2016 9:00am - 9:40am
Gardner

9:50am

On the path to data Nirvana: Supporting OLTP and OLAP on Hadoop
With increased competition, companies need to make faster decisions based on real-time data. This requires databases that can deliver good performance for OLTP and OLAP workloads in a single database. This talk will discuss an innovative database architecture that uses both HBase and Spark engines to support simultaneous OLTP and OLAP workloads. In this talk, you will:

- Discover which uses cases require simultaneous OLTP and OLAP workloads
- Learn how the optimizer automatically routes OLTP queries to HBase/Hadoop, and OLAP queries to Spark
- Learn how the optimizer uses advanced resource management to ensure that OLAP queries do not overwhelm OLTP queries

Speakers
avatar for Monte Zweben

Monte Zweben

Co-Founder and CEO, Splice Machine
Monte is the CEO and co-founder of Splice Machine. Monte worked as the Deputy Chief of the Artificial Intelligence Branch at NASA Ames Research Center, where he won the Space Act Award. He then founded and was CEO of Red Pepper Software, which merged with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. Monte then founded and led as CEO Blue Martini Software, which went public in one of the most successful IPOs of... Read More →



Friday May 20, 2016 9:50am - 10:30am
Ada

9:50am

Challenges Applying Traditional UI/UX Principles to Machine Learning Products
Software UI and UX design have a few decades of established principles that guide designers in their tasks -- fundamental ideas like predictability, affordance, and graceful recovery from user errors. I have spent almost ten years trying to merge these ideas into the design of novel ML (and ML-assisted) products. Many of the traditional ideas become difficult to apply when you're intrinsically working with poorly understood data. What does it mean for an ML system to be predictable to the user, if the user has a poor understanding of the underlying data? If the ML system is not at least a little bit surprising, it's not doing its job. But how can a user feel confident if a surprising bit of feedback from the ML system is consistent with their own understanding of the world? How can they feel trust in the process when the process is itself usually beyond their understanding? Beyond that, while we have many years of experience on the "psychophysics" of visual perception, we have much less when it comes to user perception of probabilistic or statistical events. What little research we have is often contradictory, e.g. whether people better understand raw probabilities vs odds ratios vs qualitative statements like "very likely / likely / unlikely / etc.". I have spent many years studying people's in-the-field understanding and perceptions of probability in an applied setting, and this has come to shape much of how I design systems today. In many cases, I have found that these concerns motivated changes not just in a product's UX design, but changes in the underlying algorithms themselves. I'll be talking about highlights and best practices from this experience, spanning work from tiny startups and multiple Fortune 50 companies.

Speakers
avatar for Demetri Spanos

Demetri Spanos

CEO / ML Product Design, Marft, Inc.
I've been working on ML-driven products, systems, and research for over 10 years, at every scale: from tiny garage startups to Fortune 500s, premier research universities, and Federal Research labs. I'm especially interested in "flipping" the direction of ML design: currently we take the mathematics as given immutable truths, and then try to design human experiences around that. What if, instead, we take the human factors/needs as fixed goals... Read More →


Friday May 20, 2016 9:50am - 10:30am
Markov

9:50am

Coevolution: UX and the Changing Relationship Between Mind and Machine
We are in the midst of a massive experiment in which the relationship between human intelligence and its computer-based analogs is changing at an accelerating pace. The process of both designing technologies such as machine learning (ML) to best meet people’s needs and to adapt ourselves to the new possibilities opened up by these tools is not entirely clear or predictable.The concept of coevolution from biology offers one framework for thinking about these changes. The essential idea is that two, or more, interacting species reciprocally affect each other's evolution. These relationships can range from symbiotic to predator versus prey. Humans have been coevolving with machines for a long time, but in certain areas, the balance of control is fundamentally shifting with advances in areas such as artificial intelligence. For example, we have literally been in the driver’s seat with cars, but that relationship is in the process of a role reversal. In the environments we’re creating, the ability of technology to evolve, in many cases, seems to be outstripping our own. There are many unknowns about how people evolve and adapt along with their processor-based counterparts, including:

  • How might new generations of data analysis tools transform the ways people think about, and solve, problems?
  • As ML and related technologies advance, how will that reshape, or remove, the role of human analysts?
  • How will human-computer interactions and interfaces change as machines become better at mimicking human behavior? 
  • Coevolution can take many forms from adversarial to symbiotic. Will machines eat the proverbial lunch of many human analysts, propel them to a higher level of ability, or some combination of both?
  • Some pairs of species form highly specialized relationships with each other that can be both a real advantage and a tremendous vulnerability. What are the risks of dependence and overspecialization? 
The talk will examine the idea of coevolution in this context and what UX design and data science can do to help people adapt effectively to changing environments.


Speakers
avatar for Hunter Whitney

Hunter Whitney

Sr. Consultant and Author; UX Design and Data Visualization, Hunter Whitney and Associates, Inc.
Hunter Whitney is a consultant, author, and instructor who brings a user experience (UX) design perspective to data visualization. He has advised corporations, start-ups, government agencies, and NGOs to achieve their goals through a strategic design approach to digital products and services. His experience includes leading the designs of data analysis interfaces for uses ranging from biomedical research to cyber security. | | Hunter is the... Read More →


Friday May 20, 2016 9:50am - 10:30am
Gardner

10:40am

Driver: How Data Science and Engineering Can Extend the Lives of Cancer Patients
TBA

Speakers
avatar for Petros Giannikopoulos

Petros Giannikopoulos

President & Cofounder, Driver
I am one of the co-founders of Driver, a start-up in San Francisco that is empowering cancer patients to gain access to new therapies through an innovative, patient-facing, genomics platform. Our team's mission is simple: to accelerate drug development through radical patient engagement. At Driver I serve as the company's President and Laboratory Director, but I spend most of my time looking for and recruiting talented engineers and data... Read More →


Friday May 20, 2016 10:40am - 11:00am
Gardner

10:40am

Designing Visualizations of Health that Resonate With Users
Despite being a powerful tool in service of insight, data visualizations often fail mobile users who have come to expect transient and utilitarian mobile experiences. While traditional visualization patterns encourage exploration, mobile users often expect answers to be front and center. This talk will cover how we've adapted our visualization process to meet these platform challenges and address the broad needs of our users who have questions about their health and fitness. I'll use a recent product launch, Fitbit's Reminders to Move as a case study to demonstrate this process.

Speakers
avatar for Alan McLean

Alan McLean

Senior Data Visualization Designer, Fitbit


Friday May 20, 2016 10:40am - 11:00am
Markov

10:40am

Real Time Machine Learning Visualization with Spark
Training models on massive datasets, even on Spark, can be a lengthy process, during which the data scientist has no visibility into how the model is shaping up. The only way to monitor progress is to view the status of the Spark jobs, which provides no information about convergence or other statistics of interest. In this talk, we will discuss how to visualize and monitor the training of machine learning models in real-time with Spark. With this capability, you can monitor machine learning training from one iteration to the next, observe how the model converges during each iteration, visualize the characteristics of the model in real time, and decide if you wish to continue to train the model. In this talk you will learn: How machine learning algorithms are monitored by adding callbacks to K-Means and other algorithms. The Spark task communication infrastructure that has been built, using Akka to deliver messages from the Spark driver to the job submitter. How HTML5 SSE helps to generate real-time progress visualizations

Speakers
avatar for Chester Chen

Chester Chen

Director of Engineering, Alpine Data
Chester Chen is the Director of Engineering and hands on architect at Alpine Data Labs. He manages the analytics platform development as well as contribute to some of the major developments. He has been working with scala on and off since Scala 2.7. He is the founder and organizer of SF Big Analytics Meetup, as well as the main co-organizer of the SF machine learning meetup. Before joining Alpine Data Labs, he had played many roles as Technical... Read More →


Friday May 20, 2016 10:40am - 11:00am
Ada

11:10am

A Scalable GA4GH Server Implementation
Genomics and Health related data implies lots of data, usually distributed in remote data centers, with lots of contraints related to privacy and confidentiality. Scalability is required at two levels, first within a single data center, and for this, distributed computing technologies like Apache Spark, scalable machine learning libraries and distributed databases are a match. At the inter-data center level, the scheme to share data and data processing methods must be guided by interoperability standards. The Global Alliance For Genomics and Health (GA4GH) is defining such a standard. We present here an implementation of a GA4GH server, using distributed computing and databases as back-end engine, so providing a scalable reference implementation. We also show how to extend the GA4GH server, with new functionality like requesting some model estimation (Machine Learning) and predictions on these models. We then show with the Spark Notebook as interactive tool how to generae a client for the GA4GH server and how to execute methods on the server.

Speakers
avatar for Andy Petrella

Andy Petrella

Cofounder, Data Fellas
Creator of Spark Notebook


Friday May 20, 2016 11:10am - 11:30am
Ada

11:10am

What Healthcare Can Learn from Netflix: Personalizing and Optimizing Preventive Care
There is widespread agreement that the most effective way to combat conditions like diabetes and heart disease is through a preventive approach centered on lifestyle and behavior change. But behavior change is hard, and to date, even the most effective in-person programs have included little (or no) personalization. Omada is changing that. Our team leverages open data to predict chronic health risk, and uses the power of analytics, machine learning, and experimentation to customize our interactive Prevent program for each individual. We’ve built experimentation directly into the product, using vast amounts of observational data to create a system that is continually improving to maximize participant outcomes and chronic disease risk reduction. Omada is reinventing the approach to preventive behavioral health research, its efficacy, and its application. In the process, we are unlocking a scalable, clinically-effective, and customizable approach to combating the chronic disease epidemic. Attendees will learn how analysis, machine learning, and experimentation can determine the best approach to motivate lasting behavior change in hard-to-reach populations – and how secondary data collected can be mined for unexpected value and insight. Additionally, they will learn, through examples, the mistakes and learnings of implementing an in-house clinical trial management system into a quickly growing company. Finally, attendees will learn how a customized, data-driven approach to chronic disease prevention has implications for the biggest public health challenge in American today.

Speakers
avatar for Eric Williams

Eric Williams

Director of Data Science, Omada Health
Eric Williams is Director of Data Science at Omada Health. As an undergraduate Eric studied physics at UC Berkeley, later receiving his PhD from Columbia University in Particle Physics. He was a Researcher at CERN for several years and a Postdoctoral Scholar at Memorial Sloan-Kettering Cancer Center before joining Omada Health in 2014. | At Omada, Eric's team focuses on the data science of behavior change - leveraging analysis, machine... Read More →



Friday May 20, 2016 11:10am - 11:30am
Gardner

11:10am

Visualizing the silent force of the Bay
Why does data play a vital role to the life of a kiteboarder surfing beneath the Golden Gate bridge?

The San Francisco bay waters have a silent force that affects all object within it. For the past 9 years, I gained significant experience with struggling to navigate, understand and predict the tidal current conditions when sailing out of Crissy Field. This talk explores user-centered tidal data visualization on mobile devices focusing on the San Francisco Bay. Come to learn about next-level archaic and contemporary ways of environmental data visualization, and how technology can be used to help users safely traverse the bay.

Speakers
avatar for Boriana Viljoen

Boriana Viljoen

Product Designer, Castlight Health
A product designer interested in data visualization, healthcare tech, tidal currents and various watersports :)


Friday May 20, 2016 11:10am - 11:30am
Markov

11:40am

Interactive Machine Learning on Genomics Data with the Spark Notebook
Processing genomics data efficiently nowadays implies being able to work at scale, to use advanced Machine Learning methods and to develop models in an interactive manner. The required convergence of technologies is a reality and is presented here. The edifice builds from ADAM, a spark library for genomics developped at the Amplab, providing the right data representation and APIs for applying distributed computing on genomics data. The development tool is the spark Notebook, giving an interactive interface to request code execution. Its integration with scalable Machine Learning libraries and ADAM allows us to work interactively on data from a single environment, at scale , with advanced modelling methods. We demonstrate some examples of genomics data processing, i.e. on 1000genomes data, going from simple data manipulation to descriptive statistics and more complex population stratification with Deep learning.

Speakers
avatar for Andy Petrella

Andy Petrella

Cofounder, Data Fellas
Creator of Spark Notebook


Friday May 20, 2016 11:40am - 12:00pm
Ada

11:40am

Real-time diagnostics for the masses
In Echo Labs, we have incorporated the full monitoring capabilities of a modern emergency room in a small, noninvasive, optical sensor. With this technology we captured real-time physiological data from hundreds of patients during their everyday lives. Why is this data important? How can we analyze these vast amounts of data to predict which one of the monitored patients needs medical attention? In the lecture I will cover the challenges of using techniques such as deep learning with physiological data, and on incorporating centuries of medical information and know-how in automatic diagnostic algorithms. I will show examples and results of some of our research over the last three years.

Speakers
avatar for Elad Ferber

Elad Ferber

Co-founder and CTO, Echo Labs


Friday May 20, 2016 11:40am - 12:00pm
Gardner

11:40am

Translating a Trillion Points of Data into Therapies, Diagnostics, and New Insights into Disease
There is an urgent need to take what we have learned in our new “genome era” and use it to create a new system of precision medicine, delivering the best preventative or therapeutic intervention at the right time, for the right patients. Dr. Butte's lab at the University of California, San Francisco builds and applies tools that convert trillions of points of molecular, clinical, and epidemiological data -- measured by researchers and clinicians over the past decade and now commonly termed “big data” -- into diagnostics, therapeutics, and new insights into disease. Several of these methods or findings have been spun out into new biotechnology companies. Dr. Butte, a computer scientist and pediatrician, will highlight his lab’s recent work, including the use of publicly-available molecular measurements to find new uses for drugs including new therapies for autoimmune diseases and cancer, discovering new diagnostics include blood tests for complications during pregnancy, and how the next generation of biotech companies might even start in your garage.

Speakers
avatar for Atul Butte

Atul Butte

Director, Institute for Computational Health Sciences, UCSF
Atul Butte, MD, PhD is the inaugural Director of the Institute of Computational Health Sciences (ICHS) at the University of California, San Francisco, and a Professor of Pediatrics.  Dr. Butte is also the Executive Director for Clinical Informatics across the six University of California Medical Schools and Medical Centers. Dr. Butte trained in Computer Science at Brown University, worked as a software engineer at Apple and Microsoft... Read More →


Friday May 20, 2016 11:40am - 12:00pm
Markov

12:00pm

Build Smarter, Don't Just Build
Product Data Science at Salesforce creates prescriptive insights to drive growth for each cloud across based on data science. We educate algorithmically product strategy and design, product management, engineering quality, capacity planing and mainly, adoption across our substantial customer base. We allow for identification of drivers for behaviors such as churn, adoption, propensity to buy, readiness for multi-clouds and predictive journeys to maximize adoption and customer success. This will be a thrilling data.science.insights.action presentation

Speakers
avatar for Hernan Asorey

Hernan Asorey

VP, Product Data Science, Salesforce
Hernán is responsible for transforming product ideation, design and development, as well as growth into an evidence-driven culture where intelligence becomes an intrinsic part of the product to ship. He manages, grows, and strives to inspire a world-class team of data experts across all information spectrum: information engines (ingest&digest), analytics engines (visualize), data science (apply), experimentation and optimization (test and... Read More →
avatar for Robin Glinton

Robin Glinton

Sr. Director, Data Science Applications, Salesforce.com
Robin Glinton is the Director of Data Science Applications (DSA) at Salesforce.com. Robin leads a team dedicated to understanding adoption of Salesforce product as well as sprinkling machine learning into the Salesforce CRM offerings. Robin has held a number of positions across Startups, Industry and Academia including support of omni-channel marketing across Sears, Kmart and the 50M member ShopYourWay loyalty program. kWantera a startup... Read More →



Friday May 20, 2016 12:00pm - 12:20pm
Markov

12:00pm

Challeneges in 3D Volumetric Data Sets
3Scan is an automated histology company. Our core technical offering is a 3d imaging microscope, which generate volumetric image data. Each microscope can create several terabyte scale datasets per day, and we run 3 in production, and will have close to 10 by the end of the year. In this talk I will address some of the challenges and issues surrounding data collection and analysis at this scale. Relevant tools include, Python, Apache Spark, EC2, and Meteor.

Speakers
avatar for Todd Huffman

Todd Huffman

CEO, 3Scan



Friday May 20, 2016 12:00pm - 12:20pm
Ada

12:00pm

Ginger.io: a Data-Driven Mental Health Care Provider
Speakers
avatar for Sai Moturu

Sai Moturu

Head of Data Science, Ginger.io
Data science, machine learning, healthcare, mental health



Friday May 20, 2016 12:00pm - 12:20pm
Gardner

1:10pm

Cancer Screening using Deep Learning
Sad but true: most of radiology is mind-numbing tedium. Radiologists review hundreds of mammograms searching for tiny lesions; they meticulously draw contours around the heart in cardiac MRIs to measure volumes; they rely on manual checklists and decision trees to characterize liver disease. These are repetitive, boring tasks. They take a huge amount of time, leading to large medical bills, and they cause radiologist fatigue, resulting in frequent errors or inconsistencies. Automated decision support, in which the all of the tedious tasks are automated by computerized algorithms, is the holy grail of radiological interpretation. Using the latest deep learning technology in an intelligent cloud platform, Arterys is bringing radiological decision support to hospitals worldwide. We describe one example of our technology, detecting lung nodules in the openly available LIDC lung cancer imaging data set, including our data processing and deep learning strategies.

Speakers
avatar for Daniel Golden

Daniel Golden

Senior Image Scientist, Arterys, Inc.
Dan is a Senior Image Scientist at Arterys, a San Francisco-based startup specializing in medical imaging visualization and quantification. He leads machine learning efforts at Arterys, working on using deep learning and convolutional neural networks for automatic detection, segmentation and classification of structures in MRI and CT images to provide decision support to clinical radiologists. He graduated from Stanford University with a PhD in... Read More →


Friday May 20, 2016 1:10pm - 1:30pm
Gardner

1:10pm

Designing for Inconsistency
A primary goal of designing any interface is to establish consistent, usable patterns that provide familiar, rapid paths to and from information. Nowhere is this more true than in the world of data products, where the express intent of our work is to deliver the best information, at the right moment, with the greatest possible speed. But data isn’t perfect. Sometimes it isn’t even there. Especially in an early stage startup. How do we get to market while our dataset is still under construction? This talk will share strategies for accommodating variability in data when designing interfaces for data-driven products, highlighting their application in the design process at Ravel Law. We'll take a look at the nuances of legal data, discuss the notion of Primary Task and share lessons learned from the the design of a brand new analytics product.

Speakers
avatar for Brian Studwell

Brian Studwell

Product Designer, Ravel Law
I'm a product designer carrying a Master of Human-Computer Interaction + Design from the University of Washington. I practice at the intersection of human behavior, complex information and emerging technologies imagining new ways people can work, play and learn. I bring experience spanning medicine, hospitality and design to bear on human problems in technological contexts. When I'm not at Ravel you can find me perfecting my bolognese recipe... Read More →



Friday May 20, 2016 1:10pm - 1:30pm
Ada

1:10pm

Visualizing big data: live coding an interactive dataviz app with opensource tools
Ever considered making your dataviz interactive by linking it to a database? Yeah, but then you’re entering a world of pain: authenticating users, writing the SQL code, making your widgets interact with each other… In this talk, I build a new dataviz app from scratch. Using source data representing a couple hundred million rows in an Amazon Redshift database, I show which open source tools are used to create an interactive, secure Javascript app using D3js and OpenBouquet. I demonstrate how much value is provided by connecting a dataviz to raw source data – even if it’s big. I discuss the choice of tools and, by the end, you’ll feel confident about achieving the same results even without any database skills.

Speakers
avatar for Olivier Balbous

Olivier Balbous

Software Architect, OpenBouquet
Olivier is Chief Software Architect at Squid Solutions. Olivier is also the key driver of R&D of the OpenBouquet API and Javascript SDK. Before joining Squid, he spent 15 years as Engineering Manager designing front and back-end web platforms for companies such as the Banque Populaire, Alsthom and IRCAM. Olivier studied software engineering at EPITA in Paris.



Friday May 20, 2016 1:10pm - 1:50pm
Markov

1:40pm

How to Use Big Data to Inspire Consumer Confidence
Credit Karma uses the power of technology to simplify financial decision-making. Analyzing over 50 million members’ finances, Credit Karma researches and recommends credit cards, loans and insurance based on each individual member’s specific credit profile. In some cases, the platform can also show members if they’re pre-approved for a particular product, giving them the confidence to apply without the risk of hurting their credit score. Credit Karma uses machine learning to deliver personalized and tailored recommendations so consumers can save across all their financial products. In this panel, Credit Karma shares how to use data to enrich user experiences. We will also share best practices for collecting, sharing and maximizing the efficacy of processing this data.

Speakers
avatar for Daniel Doerr

Daniel Doerr

Data Science Manager, Credit Karma
Daniel leverages data at Credit Karma to provide personalized financial solutions and recommendations. He leads a team of data scientists and analysts who use the machine to make recommendations more personalized and human. When Daniel isn’t crunching data, he is using data to find the perfect line through Sears Point Raceway in Sonoma.



Friday May 20, 2016 1:40pm - 2:00pm
Ada

1:40pm

With Big Data Comes Big Responsibility
As the cost of genetic sequencing has fallen, the rate of data generation is outpacing our resources to analyze it. While a future in which we use low-level biological data about ourselves to inform our medical choices is inevitable, getting there will not happen by default. Current attitudes towards software in the life science and medical space are rooted in academia, but need to shift if we are to make true precision medicine a reality. The job will fall on us software engineers to develop high-quality open-source tools, build communities to support them, and transition organizations from siloed datacenters to cloud environments. We will compare bioinformatics to cryptography, a field that has successfully leveraged open-source technology to make the Internet a safer place. We will also explore specific examples of APIs and libraries that are beginning to enable this shift and are already providing benefits to their users.

Speakers
avatar for Nish Bhat

Nish Bhat

Founding Engineer, Color Genomics



Friday May 20, 2016 1:40pm - 2:00pm
Gardner

2:10pm

The UMLS - Authoritative biomedical concept names in context

[Revised 05/16/16]  A concept is a unit of thought. Chances are any biomedical concept that is represented in your data has been named by some authority. Your tax $ pay for these names to be collected, maintained, and represented in a homogeneous, tool-supported context called the UMLS (Unified Medical Language System).

The latter consists of three knowledge sources - the Metathesaurus, a Semantic Network, and a Lexicon and accompanying tools. The UMLS was created and is maintained by the U.S. National Library of Medicine (NLM), part of the National Institutes of Health (NIH). The 2016AA release of the Metathesaurus contains more than 3.25 million concepts and 13.00 million unique concept names from over 197 source vocabularies expressed using 25 different languages. Many of these vocabularies include translations into the world's major languages. Because it contains a mixture of public and proprietary content use of the UMLS requires a license, available free of charge from the NLM.

Tools are included to assist with browsing, downloading, subsetting, and representing the UMLS in existing databases. Additional tools support inter-source linking, and finding concepts in text. While not for the faint of heart, these resources are widely used around the world. Tutorial videos are available on the NLM UMLS web site.

Important, and widely used vocabularies in the UMLS include those naming diseases, lab tests, procedures, medications, chemicals, organisms, anatomic structures and genes, collected from both research and care.  Several of these vocabularies are part of the standards specified for use in U.S. Electronic Health Records. Internet connectivity permitting, audience members will be challenged to "stump the Metathesaurus" - that is, name an important biomedical concept that cannot be found there. This exercise will illustrate why the UMLS should not be re-invented.


Speakers
avatar for Brian Carlsen

Brian Carlsen

Sr. Informatics Consultant, West Coast Informatics
I've spent most of my career developing enterprise terminology maintenance solutions or the healtcare industry. I've worked with governments, non-profits, and private sector business to develop, maintain, publish, and implement healthcare termionlogies and information models to solve complex problems with unstructured data.
avatar for Mark Samuel Tuttle

Mark Samuel Tuttle

Board of Directors, Apelon
Taught computer science at UC Berkeley, and then Medical Information Science at UCSF. Co-founder of Lexical Technology, later merged with Onyx to form Apelon, from a UCSF project. Was initial external architect of National Library of Medicine Unified Medical Language System (UMLS) Metathesaurus. Now working at a data scientist, and engaged in data mining software development toward re-inventing statistics assuming unbounded computing... Read More →


Friday May 20, 2016 2:10pm - 2:50pm
Ada

2:10pm

Discoveries in using behavioral sensor data for authentication
There are many ways to prove your identity other than typing in your password or holding up your government issued ID. Human behaviors translated into data points highlight the fact that everyone is inherently unique. By utilizing the overabundance of sensor data coming from the growing number of connected devices used today, we are able to gain a deeper understanding of both how sensor data differs from person to person, and how we can use these unique data points to authenticate users both online and in the real world. As technology is rapidly advancing, so too must security. Join us as we take a look at human behavior from a data POV, and highlight some very interesting trends that we’ve discovered along the way.

Speakers
avatar for John Whaley

John Whaley

Founder/CEO, UnifyID
I work in the broad area of computer systems, especially operating systems, virtualization, computer security, finding and avoiding software defects, algorithms, performance, parallelization, concurrency, scalability, mobility, compilers, program analysis, programming languages, APIs, networking, protocols, system architecture, kernel development, IO performance, and data structures. | | Arthur L. Samuel Thesis Award for Best Thesis at... Read More →



Friday May 20, 2016 2:10pm - 2:50pm
Gardner

2:10pm

Pro-Active: Designing a genuinely helpful SQL interface, that even power-users love
Microsoft’s Clippy showed the world that a computer can use context clues to determine when a person is writing a letter, but that it probably should keep its digital mouth shut. In this talk, we’ll discuss designing a query tool that offers contextualized suggestions (and even warnings) to help users write accurate and performant queries quickly… without making them want to force quit in fury—whether they're less-techy Excel folks or committed command line coders.

Over the course of this fast-paced and fun session we’ll 
  • incorporate learnings from Don Norman, Cliff Nass, and other HCI luminaries (using case studies from self-driving cars and a variety of software examples, both cutting-edge and totally familiar), as well as lessons from user research on beta releases
  • geek out about predictive text (incorporating syntactic, semantic, and social clues) 
  • explore how to provide suggestions or interventions at just the right time and in just the right way to maximize utility and minimize frustration
  • and much more…

Speakers
avatar for Aaron Kalb

Aaron Kalb

Head of Product, Alation
Aaron has spent his career crafting delightful and empowering human-computer interactions, especially through natural language interfaces. After leaving Stanford with a BS and an MS in Symbolic Systems and working at Apple on iOS and Siri (doing engineering, research, and design in the Advanced Development Group), Aaron leads the design team and guides the product vision at Alation. In his spare time, he enjoys backpacking, board games, and... Read More →


Friday May 20, 2016 2:10pm - 2:50pm
Markov

3:00pm

Distributed Visualization for Genomic Analysis
Current genomics visualization tools are intended for a single node environment and lack computational resources to provide interactive speeds. Data from the 1000 Genomes Project provides 1.6 terabytes of variant data and over 14 terabytes of alignment data. However, typical genomic visualizations materialize less than 10 kbp, approximately 3.3e­7% of the genome. Mango is a visualization browser that selectively materializes and organizes genomic data to provide fast in memory queries. Mango materializes data from persistent storage as the user requests different regions of the genome. This data is efficiently partitioned and organized in memory using interval trees, which enables quick range queries over genomic data.

Speakers
avatar for Alyssa Morrow

Alyssa Morrow

Student Researcher, University of California-Berkeley
At UC Berkeley, the BDGenomics team is working to create scalable genomics preprocessing and analysis on top of Spark. I am currently working on a distributed genomic visualization tool that allows ad hoc querying on TB of genomic data.
avatar for Eric Tu

Eric Tu

Graduate Student, UC Berkeley AMPLab
I'm a graduate student at UC Berkeley in the AMPLab, working on genomic visualizations built on top of Spark.



Friday May 20, 2016 3:00pm - 3:20pm
Gardner

3:00pm

Data Design Challenges for Enterprise IoT Applications: Semantic Sensor Network Ontologies
This presentation will outline the problem of designing for data-intensive applications and suggest some possible solutions.  Enterprise IoT presents a series of non-trivial challenges for designers. Unlike many consumer applications, enterprise IoT systems tend to be unintuitive to non-expert users and designers alike. Designers of these systems are challenged to represent the data that underlies these experiences. Taking a data-first approach to designing these systems will result in better applications and less design angst.    

Speakers
avatar for Zachary Taschdjian

Zachary Taschdjian

Interaction/Product Lead, General Electric Digital
Zac makes tools for interacting with data and using data to drive business and user value, typically for enterprise platforms/products. His specialty is time series and graph data visualization for enterprise IoT applications. His skill set blends interaction design/HCI, visualization research and machine learning. Prior to GE, he researched and built data/info vis applications with IBM.



Friday May 20, 2016 3:00pm - 3:20pm
Ada

3:00pm

Remixing Design with Data
Software designers and data scientists have not traditionally been situated together within companies to team up to create great products. But when data scientists do join the fold, designers have an opportunity to tap into insights to created well-informed data-driven designs. I’ll discuss a number of recent intersections between design and data, including a number of projects at Platfora where we’ve remixed traditional design methods to create a more data-driven process. The talk will cover the role of data in a designer’s process, advice for building a data-driven product culture, an overview of Platfora research projects that offer a new spin on the tried-and-true research methods:
  • Prioritizing product and usability improvements using the Kano model
  • Understanding usage patterns to gauge feature adoption
  • Validating, challenging, and revising product personas using archetypal behavior derived from telemetry clickstreams

Speakers
avatar for James Mulholland

James Mulholland

Manager, UX Design and Research, Platfora
James loves to bring design and user experience to the world of big data and connected information. His background includes human-computer interaction, data visualization, organizational behavior, branding, and even technical illustration. | | Before joining Platfora as their first designer, he developed visualization solutions for big data researchers and mobile payment systems for Bank of America. | | At Platfora he leads the UX... Read More →



Friday May 20, 2016 3:00pm - 3:20pm
Markov

3:30pm

How Data helps UX evolves at realtor.com
realtor.com, a News Corp company and the fastest growing online real estate service provider in US, invites you to have a closer look on a data-driven design process behind a real-time, informed, and interactive online real estate application. Through case studies, this presentation will provide you an overview of how insight of user activities is analyzed in systematic approaches to drive product renovation. Key learning objectives also include how user experience of realtor.com's data-centric products is enhanced by data visualization techniques.

Speakers
avatar for Ian Lin

Ian Lin

Lead Data Visualization Designer, realtor.com
Ian Lin is the Lead Data Visualization Designer at Move, Inc., a News Corp company and the operator of realtor.com. He is a hybrid designer/developer in UX, Front-End Engineering, and Data Viz. He helps visualizing insight from Data Science/Analytics, streamlines UI workflows with Data Viz techniques, and create BI dashboards for C-level executives. His design philosophy is to create delightful UI to streamline collaboration between people and... Read More →


Friday May 20, 2016 3:30pm - 3:50pm
Ada

3:30pm

Urban Heartbeat: Data Experiments with Place
For this talk, I will discuss about how I collected, analyzed, and visualized real-time civic data from open data sources and APIs. As part of a fellowship with Stamen Design and Gray Area Arts Foundation last year, I created a project called Urban Heartbeat. My work explored civic, social, and environmental data at the neighborhood level. I collected and analyzed data from June to August 2015 and analyzed the data in a series of experiments. I performed spatial and content analysis of social media to discover the location of people’s activities in the neighborhood. I used data from DataSF.org, Instagram, Twitter, Foursquare, NextBus, Waze, Factual, Weather Underground, Craigslist, and other sources. The technology used in this project includes D3.js, Firebase, CartoDB, and Node.js (including Node libraries for color analysis and image quantization; geospatial analysis, network analysis, natural language processing, sentiment analysis, and machine learning). The resulting artwork was a generative data installation at the Grand Theater in San Francisco’s Mission District. The art allows passersby to explore their neighborhood via visualizations at the urban scale. My project work has been exhibited in Geneva, Bangalore, Pittsburg, and is currently on display in San Francisco. In 2016, my work has continued and I’m partnering with architects and urban planners in the Bay Area to analyze urban space, make planning decisions, and engage with local communities.

Speakers
avatar for Steve Pepple

Steve Pepple

Product Designer and Developer, OpenGov
Steve Pepple is a Bay Area designer and software developer who works to improve city streets and civic information systems. He a product designer at OpenGov, where he designs software that improves how governments spend money, make decisions, and communicate with citizens. | | His recent art visualizes urban activity and environment in cyberspace as a reflection of people’s activity in physical places and is currently on display at the... Read More →




Friday May 20, 2016 3:30pm - 3:50pm
Gardner

3:30pm

When Visualization Best Practices Fall On Deaf Ears
Data Visualization enthusiasts have, by now, listened to a lot of experts and read a million books which teach the importance of using the right visual encodings for effective user perception. These techniques or best practices are often backed by scientific research. But, what if your customer asks for something, that's exactly from the "dont's" section of your visualization rulebook? What often goes unaccounted for, is that the audience of these dashboards are neither necessarily trained in data visualization, nor are they even aware of the close-knit data visualization community active on social media. And, that's very natural. The audience of your dashboards are often business domain experts, who have much broader problems to solve, than to educate themselves about good data visualization techniques. So, what happens when your "Visualization Gyan" falls on deaf ears? Do you build something that your "audience wants" ? Or, do you decide to use your engineering excellence to to give them "what's right"?

Speakers
avatar for Akash Mukherjee

Akash Mukherjee

Data Products for People Insights, Facebook




Friday May 20, 2016 3:30pm - 3:50pm
Markov

4:00pm

Deep Learning, Heart Rate Sensors, and Strokes
Deep learning has shown breakthrough results in computer vision, speech recognition, and natural language processing, but as yet, the applications to medicine are few. Now that we've thoroughly honed our techniques to detect cats in YouTube videos, can we apply that same technology to save lives?

We'll report on a collaboration between a team of cardiologists at UCSF and machine learning engineers at Cardiogram, using sensor data from Apple Watch and other wearables to prevent strokes. About a quarter of strokes are caused by atrial fibrillation, the most common heart arrhythmia. In atrial fibrillation, electrical conduction in the heart becomes disorganized. The upper chambers may beat 300-600 times per minute. The lower chambers may beat at a normal rate, but irregularly. AF is treatable, but asymptomatic—many people don't realize they have it—and if you can develop an algorithm to detect when a person has entered an episode of atrial fibrillation using heart rate time series, you can potentially prevent a stroke.

The talk will include a brief introduction to cardiac electrophysiology, a review of key techniques in deep learning, and then dive into how we're using convolutional autoencoders and semi-supervised sequence learning to detect anomalous patterns of heart rate variability. We'll include lots of example data to build up intuition, and code examples using TensorFlow. Depending on the length of the talk, the algorithmic techniques covered will likely include: convolutional autoencoders for dimensionality reduction, long-short term memory, and semi-supervised sequence learning. (If this is a 20 minute talk, we'll focus on just one of those techniques.)

We'll conclude with some broader thoughts on the intersection of artificial intelligence and medicine, including lessons learned while bridging the cultures of machine learning research and clinical research, as well as some thoughts on how artificial intelligence may drive the future of healthcare.

Speakers
avatar for Brandon Ballinger

Brandon Ballinger

Co-Founder, Cardiogram
Brandon currently applies machine learning to cardiology at Cardiogram. Previously, he helped fix healthcare.gov, co-founded Sift Science, and worked as an engineer at Google on Android speech recognition.
avatar for Johnson Hsieh

Johnson Hsieh

Co-Founder, Cardiogram
Johnson is applying machine learning to cardiology starting with data from the Apple Watch. Previously, Tech Lead on Google Voice Actions ("ok, Google"), interest-based user modeling, and Search.



Friday May 20, 2016 4:00pm - 4:40pm
Ada

4:00pm

Extracting Medical Attributes from Clinical Trials
Understanding the relationships between drugs and diseases, side effects, dosages is an important part of drug discovery and clinical trial design. Some of these relationships have been studied and curated in different formats such as the UMLS, bioportal, SNOWMED etc. Typically this data is not complete and distributed in various sources. I will adress different stages of the drug-disease, drug-side effects and drug-dosages relationship extraction. As a first step I will discuss medical attributes (diseases, dosages, side effects) extraction from FDA drug labels and clinical trials. As a next step I will use simple machine learning techniques to improve the precision and recall of this sample. I will also discuss bootstrapping a training sample from a smaller training set. As a next step I will use DeepDive, a dark data extraction framework to extract relationships between medical attributes and derive conclusive evidence on facts about them. The advantages of using deepdive is that it masks the complexities of the Machine Learning techniques and forces the user to think more about features in the data set. At the end of these steps we will have structured (queriable) data that answers questions such as What is the dosage of 'digoxin' for controling 'ventricular response rate' in a male adult at 'age 60' with weight '160lbs'.  




Speakers
avatar for Sanghamitra Deb

Sanghamitra Deb

Data Scientist, Accenture Technology Laboratory


Friday May 20, 2016 4:00pm - 4:40pm
Gardner

4:00pm

UX Techniques Supporting Varying Levels of Aggregation in Data Selection and Visualization
Some technologies for building data visualizations lend themselves to dynamic applications and interactivity (D3, HighCharts). Other technologies offer a lot of flexibility and precision (SQL), ease-of-use (SAP, Excel), or breadth of visualization types (Tableau, Stata). Doing data exploration at varying levels of aggregation is still a challenge for all of these tools. This talk will explore use cases involving visualizations which require varying levels of aggregation in the same visualization, and some tools, techniques, and technologies to support those visualizations. Examples will include selection techniques in SQL, data preparation scripts to prepare data for D3 visualizations, and using Excel for prototyping and checking conclusions. ClearStory Data has used a combination of Spark, D3, and React to create a web-based application which makes data combination and exploration clear, interactive, and maintainable even for the largest data sets. This talk will also discuss findings specifically relevant to supporting interactivity and clarity in data exploration of varying aggregation levels.

Speakers
avatar for Katherine Ahern

Katherine Ahern

Manager, Analysis and Visualizations, ClearStory Data
Katherine Ahern manages the Analysis and Visualizations group at ClearStory Data, where she focuses on usability for complex analytic workflows, including getting accurate results combining diverse data sources. Before coming to ClearStory she worked on a web-based analytics tool for departments of transportation (including Caltrans), did analysis and reporting for emergency room operations at MedAmerica, and was a research and development... Read More →



Friday May 20, 2016 4:00pm - 4:40pm
Markov

4:50pm

Protecting data scientists in healthcare with type safety
Healthcare is a veritable zoo of datatypes, implementations, and formats, providing an unrivaled challenge in data integration. Historically, quality control across disparate data has been done by data scientists, in an ad hoc manner, on their analysis platform of choice: Python or R. At Wellframe, we were looking for a concerted solution to handle both complex integrations and the more general problem of connecting data analysis and feature development. We will share our experiences where, as the volume and diversity of our data sources exploded, we realized that Python did not provide the guarantees we required. Toward this end, we have moved all data QC further upstream to our Scala-based infrastructure to let the type system help manage more of the complexity. To accelerate translating insights into features, we have utilized Spark to provide the DataFrames our data scientists know and love, while still being able to take advantage of our hardware. This has turned out to be a mixed blessing: it has increased our pace, but the loss of type safety during analysis allows bugs to be propagated through our system. We will discuss the approaches we are pursuing to improve this from both sides, by migrating from RDD's to Datasets, and moving our analysis from Python to Scala.

Speakers
GR

Gopal Ramachandran

Head of Technology, Wellframe
Gopal is the Head of Technology at Wellframe. Previously, he worked at Massachusetts General Hospital, wherein he was part of the team that worked with Apple on the development of ResearchKit. | | Gopal received his M.D. from Harvard Medical School, and his Ph.D. from MIT.



Friday May 20, 2016 4:50pm - 5:10pm
Markov

4:50pm

Importance of rethinking data visualization
Visualizations of quantitative data typically just consists of displaying multiple line, bar, and pie charts in a dashboard leaving the viewer to aggregate and correlate the data in order to synthesize a meaningful story. The effects can vary from somewhat informative to disastrously misleading. The industry is mislead by products with standardized interfaces that aim to present performance information effectively through visualization but often miss the mark due to lack of domain expertise and creativity. This presentation show how much more powerful quantitative visual data can be with the application of domain expertise. The presentation will kick off with a review of Edward Tufte's analysis of the Challenger space shuttle disaster and move onto typical industry monitoring challenges such as displaying system load, analyzing the efficiency of database requests and monitoring latency response.

Speakers
avatar for Kyle Hailey

Kyle Hailey

Technial Evangelist, Delphix
Kyle has worked on IT performance for over 20 years. He was a principal designer at Oracle on the Enterprise Manager performance monitoring interface which was implemented under waterfall methods. Following that he was the designer and product manager of DB Optimizer at Embarcadero using agile methodologies to deliver every version on schedule. Now he works at Delphix bringing agility to one of the biggest hurdles to application development... Read More →


Friday May 20, 2016 4:50pm - 5:10pm
Gardner

4:50pm

Towards a virtual reality meta-Earth
We live in an era abundant of data and data science thrives,. While there is sufficient meta-data to reconstruct a representation of the real world, with digital maps being a prime example, data as abstract entities are effectively invisible to the everyday person. In this talk, we how we use meta-data to build a virtual meta-earth: -introduce data sources we employ as reality's blueprint -visualization technologies to bring these data to life -Utilization of scientific models to simuate day light cycles, polution levels -integration with virtual reality, as a prelude to the virtual worlds described in science fictions

Speakers
avatar for Bo Huang

Bo Huang

Principal Software Engineer, SenseEarth.com
Bo Huang has been a game programmer empowering mobile devices all over the world to enjoy Pacman and Time Crisis, to simulating photons refracting through paint producing wide range of glittering appearance as an engineering scientist. | | He combines realistic rendering technologies and real world data towards metaverses described by science fictions such as Snow Crash. In his talk, he will guide you through the data, technology, and... Read More →



Friday May 20, 2016 4:50pm - 5:10pm
Ada

5:20pm

Panel: Living Data
The closing panel of Data By the Bay brings together the topics of the last day -- Life Sciences and UX, DataViz -- making connections to all the themes of the conference. How can we reeingineer our lives with data? What can we see to make our lives better, and how should we see it so we can act upon what we see? Come join the dicussion!

Moderators
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.

Speakers
avatar for John St. John

John St. John

Co-founder, Bioinformatics Lead, Driver Group, L.L.C.
I am a co-founder, and bioinformatics lead engineer at Driver Group. We are leveraging an individual cancer patient's genomic data to provide that patient with the best therapeutic recommendations available. Additionally we use the data that we collect to bring the most promising new cancer therapies to market.
avatar for priya joseph

priya joseph

Data Czar, NuMedii
avatar for Jeff Lerman

Jeff Lerman

Staff Ontology Engineer, QIAGEN Silicon Valley (formerly Ingenuity Systems)
Biomedical ontology developer for eight years, focusing on knowledge models for diseases, gene products, and genetic variants. NLP projects include development of word-sense disambiguation approaches to identify genes discussed in biomedical publications, and ontology-leveraged scientific article classification by topic. Before QIAGEN, Jeff studied molecular biology at Princeton (Ph.D.), and did a postdoc in protein structure/function at U.C... Read More →
avatar for Daniella Perlroth

Daniella Perlroth

Chief Data Scientist, Lyra Health
Using data and technology in health care to improve access, quality and care. | Predicting quality, treatments, outcomes in health care. | Algorithms to recommend care options or providers based on a person's symptoms, conditions, needs.


Friday May 20, 2016 5:20pm - 6:20pm
Gardner

6:00pm

Happy Hour
Join us for a community gathering after the full day of talks. Full bar, fine food, and the company of the speakers, panelists, and the best minds in data.

Friday May 20, 2016 6:00pm - 8:00pm
The Galvanize Lobby