Loading…
Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Pipelines [clear filter]
Monday, May 16
 

9:05am

Keynote: Building a Real-time Streaming Platform Using Kafka Streams and Kafka Connect
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of continuously changing data in real time? The answer is stream processing, and one system that has become a core hub for streaming data is Apache Kafka.
This presentation will give a brief introduction to Apache Kafka and describe it's usage as a platform for streaming data. It will explain how Kafka serves as a foundation for both streaming data pipelines and applications that consume and process real-time data streams. It will introduce some of the newer components of Kafka that help make this possible, including Kafka Connect, framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library. Finally it will describe the lessons learned by companies like LinkedIn building massive streaming data architectures.

Speakers
avatar for Jay Kreps

Jay Kreps

Co-founder and CEO, Confluent
Jay Kreps is the co-founder and CEO of Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, he was the lead architect for data infrastructure at LinkedIn. He is among the original authors of several open source projects including Project... Read More →


Monday May 16, 2016 9:05am - 9:45am
Gardner

9:50am

Building a Realtime Receiver with Spark Streaming
This talk will demystify Spark Streaming Receivers by showing live how to enable streaming consumption from new and untapped data sources.  We'll play around with a publicly available financial data API, while dipping our toes into a twitter data stream and a queue source.  Starting with a simple single-node receiver, we will build up to distributed, reliable receivers.

Speakers
avatar for Sal Uryasev

Sal Uryasev

Data Werewolf, GoFundMe
Having recently started to build out data infrastructure at GoFundMe, Sal is a veteran of the LinkedIn and Salesforce data science teams. A Scala fanatic, he's done a lot of work on streaming processors to build realtime data recommendation engines.


Monday May 16, 2016 9:50am - 10:30am
Markov

9:50am

Elastic Data Pipelines
Consumers want applications that make driving a seamless experience. The global automotive software market is expected to reach $10.1 billion by 2020, according to a 2014 report. By that same time, there will be 200 million connected cars on the road generating valuable information that can be utilized to improve safety, enhance quality and build new revenue streams. In this talk, we will discuss how to build real-world data pipelines that can deliver on these consumers demands and help companies tap into new revenue streams. We’ll show how to build and use data pipelines in a scalable and elastic way, and demonstrate their usage for the types spatio-temporal data that vehicles generate as they travel. Attendees will come away understanding how leading edge connected-car applications are delivered to market. The presentation will include the following technology components * Esri-developed Scala/Play app for highly interactive/dynamic map visualization * Azure IoT Hub for event ingestion * Mesosphere Infinity for event processing

Speakers
avatar for Claudio Caldato

Claudio Caldato

Microsoft, Microsoft
Claudio is a Program Manager in the Azure IoT Team. His team is building an IoT solution that leverages OSS technologies such as Mesos/DC/OS, Spark, Cassandra, Kafka, Akka and many others. Before joining the Azure team he worked on Node.js, Redis and other OSS technologies along... Read More →
avatar for Adam Mollenkopf

Adam Mollenkopf

Real-Time & Big Data GIS Capability Lead, Esri
Adam Mollenkopf is responsible for the strategic direction Esri takes towards enabling real-time and big data capabilities in the ArcGIS platform. This includes having the ability to ingest real-time data streams from a wide variety of sources, performing continuous and recurring... Read More →
avatar for Sunil

Sunil

Engineering Manager, Yelp
Sunil Shah is an engineering manager at Yelp. His team builds and maintains the Mesos-based platform that powers workloads ranging from the web services that power Yelp.com to the batch and machine learning jobs used for processing large amounts of data and providing user recommendations... Read More →



Monday May 16, 2016 9:50am - 10:30am
Ada

10:40am

Google BigQuery: a fully-managed data analytics service in the cloud
Google built Dremel to make internal data analysis simple, then made it available to the world as BigQuery service: a fully-managed data analytics service in the cloud, which allows you to focus on insight, not infrastructure.

Learn what makes BigQuery unique and how you can immediately start using its "nearly magical abilities". This talk will cover the basics of BigQuery, describe use cases and applications, and provide insights on how to use it efficiently.

Speakers
avatar for Michael Entin

Michael Entin

Senior Software Engineer, Google
Senior Software Engineer at Google Dremel / BigQuery team. | | Before joining Dremel team, worked on various data processing projects at Microsoft: SQL Server Integration Services, Analysis Services, distributed platform for AdCenter Business Intelligence, etc.



Monday May 16, 2016 10:40am - 11:20am
Gardner

10:40am

Taming JSON with SQL: From Raw to Results
The flexibility and simplicity of JSON have made it one of the most common formats for data. Data engines need to be able to load, process, and query JSON and nested data types quickly and efficiently. There are multiple approaches to processing JSON data, each with trade offs. In this session we’ll discuss the reasons and ways that developers want to use flexible schema options and the challenges that creates for processing and querying that data. We’ll dive into the approaches taken by different technologies such as Hive, Drill, BigQuery, Spark, and others, and the performance and complexity trade offs of each. The attendee with leave with an understanding of how to assess which system is best for their use case.

Speakers
avatar for Greg Rahn

Greg Rahn

Director of Product Management, Snowflake Computing


Monday May 16, 2016 10:40am - 11:20am
Markov

11:10am

Investing in Data Analytics: Risks and Opportunities in the Zettabyte Era
Industry analysts estimate that the amount of data created and replicated will be 10 ZB in 2016 and grow to 40 ZB by 2020. This rapid growth in data fuels a new era of computing – the Data & Analytics era – where data combined with algorithm and cloud-based software leads to improved insights thereby creating business value. This talk will describe the drivers, trends and implications of this new era on Enterprise IT, applications and product delivery across all industrial and consumer era, illustrated with several case studies (focusing on Industrial IoT), and why this era eventuates in a fundamental change in opportunities for investments.

Speakers
avatar for Kanu Gulati

Kanu Gulati

Investor, Zetta Venture Partners
Kanu Gulati is an investor at Zetta Venture Partners, a fund that invests in intelligent enterprise software - i.e. companies building software that learns from data to analyze, predict and prescribe outcomes in enterprise applications. | | Kanu has over 10 years of operating... Read More →


Monday May 16, 2016 11:10am - 11:30am
Ada

11:40am

Apache Flink: A Very Quick Guide
Apache Flink is the exciting newcomer in the Big Data space, which integrates batch and stream processing in a novel way. Join us in exploring Apache Flink and its architecture, internals, and its data and programming models. We point out the unique features and differences in comparison to Apache Spark. You will see Flink’s Scala APIs in action as we code and run a selected set of examples that nicely illustrate its features.

Speakers
avatar for Vladimir Bacvanski

Vladimir Bacvanski

Principal Architect, Strategic Architecture, PayPal
Dr. Vladimir Bacvanski's interest is in better and more productive ways to develop highly scalable and reliable software systems. Before joining PayPal, he was the CTO and founder of SciSpike, a company doing custom development and consulting. His recent projects include Big Data... Read More →


Monday May 16, 2016 11:40am - 12:20pm
Ada

11:40am

Characterizing and measuring the performance of Big Data processing platforms
There are several Big Data platforms, architectures and frameworks already out there and more are coming out each day, figuratively speaking! In such an ecosystem it is difficult to truly measure or characterize the performance of a data processing infrastructure using these frameworks. We abstract out the frameworks into three categories - Batch, Query and Streaming. In this paper, we identify characteristics for each kind of framework and present the results of running heterogeneous workloads for batch frameworks such as Hadoop, stream frameworks such as Spark and query frameworks such as Impala on target cloud-based infrastructure. In our experiments, we have seen performance variations given the multi-tenant nature of the infrastructure and have accounted for these temporal conditions by running our experiments at different times.

Speakers
avatar for Manish Singh

Manish Singh

CTO, Co-founder, MityLytics
Manish is CTO and co-founder of MityLytics which develops products to help customers make the transition to Big Data platforms and to continue to grow and tune their Big Data analytics platforms and apps using MityLytics software. He has built, deployed and maintained massively distributed... Read More →


Monday May 16, 2016 11:40am - 12:20pm
Markov

1:10pm

Akka Streams for Large Scale Data Processing
With over 50 million members, Credit Karma is the most utilized and trusted personal finance platform in the U.S. To handle tens of millions of Americans’ credit information, we use Akka Streams for high throughput data transfer. We will discuss how we quickly built a solution using Akka Actors to help us parallelize, parse and send data to our ingestion service. We then used Akka Streams to pull data through our actors based on demand, allowing us to easily control our memory buffers and prevent Out of Memory issues. Akka Streams also allowed us to apply parsing and simple business logic. In this panel, Credit Karma shares best practices on how to implement Akka Streams at scale.

Speakers
avatar for Zack Loebel-Begelman

Zack Loebel-Begelman

Senior Software Engineer, Credit Karma
As a senior software engineer on the data and analytics pipeline, Zack’s work allows Credit Karma to provide tailored recommendations for each individual member’s specific financial situation. Zack joined Credit Karma after two years designing and launching data engines to support... Read More →
avatar for Dustin Lyons

Dustin Lyons

Senior Software Engineer, Credit Karma
Dustin is the technical lead of Credit Karma’s data services, which allows over 60 million members access to a personalized and seamless user experience. Before joining Credit Karma, Dustin worked in product development and infrastructure operations for over nine years. He holds... Read More →



Monday May 16, 2016 1:10pm - 1:30pm
Ada

1:10pm

Building domain specific databases with a distributed commit log [Kafka/Kubernetes]
Smyte is building a platform to analyze all of the traffic running through busy consumer websites and mobile apps. In this talk I'm going to describe our solution to one tricky problem — counting. Specifically: accurately counting ludicrous amounts of events over sliding windows, while keeping costs as low as possible. Oh, and… lets get something working in hour or two and improve it later.

Speakers
avatar for Yunjing Xu

Yunjing Xu

Infrastructure Engineer, Smyte
Yunjing is an software engineer building server and database infrastructure at Smyte. Before Smyte, Yunjing worked on the data science and infrastructure team at Square and received Ph.D. from University of Michigan for researching performance and security problems of public cloud... Read More →


Monday May 16, 2016 1:10pm - 1:30pm
Markov

1:10pm

Netflix Keystone - Streaming Data Pipeline @Scale in the Cloud
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.

Speakers
avatar for Monal Daxini

Monal Daxini

Senior Software Engineer, Netflix, Inc.
Monal Daxini is a Senior Software Engineer at Netflix building a scalable and multi-tenant event processing pipeline, and infrastructure for Stream Processing as a Service. He has over 15 years of experience building scalable distributed systems at organizations like Netflix, NFL.com... Read More →



Monday May 16, 2016 1:10pm - 1:30pm
Gardner

2:10pm

Real-time, Streaming Advanced Analytics, Approximations, and Recommendations using Apache Spark ML/GraphX, Kafka Stanford CoreNLP, and Twitter Algebird BONUS: Netflix Recommendations: Then and Now
Agenda Intro Live, Interactive Recommendations Demo Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker Types of Similarity Euclidean vs. Non-Euclidean Similarity User-to-User Similarity Content-based, Item-to-Item Similarity (Amazon) Collaborative-based, User-to-Item Similarity (Netflix) Graph-based, Item-to-Item Similarity Pathway (Spotify) Similarity Approximations at Scale Twitter Algebird MinHash and Bucketing Locality Sensitive Hashing (LSH) BONUS: Netflix Recommendation Algorithms: From Ratings to Real-Time DVD-Ratings-based $1M Netflix Prize (2009) Streaming-based "Trending Now" (2016) Wrap Up Q & A

Speakers
avatar for Chris Fregly

Chris Fregly

Founder, Research Engineer, PipelineIO
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup... Read More →


Monday May 16, 2016 2:10pm - 2:50pm
Markov

2:10pm

Twitter Heron in Practice

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. In this talk, I will describe Heron in detail and share our operating experiences and challenges of running Heron at scale.

 


Speakers
avatar for Karthik Ramasamy

Karthik Ramasamy

Engineering Manager, Twitter
Karthik is the engineering manager for Real Time Compute at Twitter and co-creator of Heron. He has two decades of experience working in parallel databases, big data infrastructure and networking. He cofounded Locomatix, a company that specializes in realtime streaming processing... Read More →



Monday May 16, 2016 2:10pm - 2:50pm
Ada

3:00pm

A Real Time Analytics Framework
We'll talk through the elements of a framework which was built to easily construct multiple real time analytics applications to address the needs of multiple teams at Yahoo's Publishing Products group. In this talk you'll learn about the architecture of what makes up a real time analytics stack and its use cases. We'll cover the supporting services and libraries that had to be built to support the various use cases. We'll also cover the optimizations we made to address resource and network I/O considerations. Our focus will mainly be on the data processing not the analytics.

Speakers
avatar for Hiral Patel

Hiral Patel

Technologist, Yahoo Inc
Hiral's been working with Scala for the past 6 years and Big Data for the past 12 years. He's built data platform's, data intensive applications, and real-time analytics frameworks. Hiral is currently a Senior Principal Architect/Engineer at Yahoo Inc.



Monday May 16, 2016 3:00pm - 3:20pm
Ada

3:00pm

Concord: Simple & Flexible Stream Processing on Apache Mesos
If you’re trying to process financial market data, monitor IoT sensor metrics or run real-time fraud detection, you’ll be thinking of stream processing. Stream processing sounds wonderful in concept, but scaling and debugging stream processing frameworks on distributed systems can be a nightmare. In clustered environments, your logs are scattered across many different computers making errors and strange behaviors are hard to trace. On frameworks like Apache Storm, the many layers of abstraction make it difficult to predict performance and do capacity planning. In micro batching frameworks like Spark Streaming, stateful aggregations can be a hassle. Moreover, in most of the existing frameworks, changing a single line of code requires a full topology redeploy causing operational strain. Concord strives to solve all the challenges above. In this talk, you’ll learn how Concord differs from other stream processing frameworks and how Concord can provide flexibility, simplicity, and predictable performance with help from Apache Mesos.

Speakers
avatar for Shinji Kim

Shinji Kim

Co-founder & CEO, Concord


Monday May 16, 2016 3:00pm - 3:20pm
Markov

3:00pm

Speed up app development with prefabricated, extensible, open-source backends
Building modern apps requires a lot of boilerplate backend code -- setting up server endpoints, forwarding requests to the database, and performing authentication are examples of code developers have to write over and over again. In this talk you'll learn how to dramatically cut down development time by using prefabricated, open-source backends like loopback.io and deepstream.io, and how to extend these backends with custom code once your application outgrows the functionality available out of the box. We'll also talk about how prefabricated backends are changing application architectures, and the impact end-to-end event driven application development is making on end-user experience. We’ll talk about our journey through the process of solving these problems in RethinkDB and Horizon, and how we see the future of web development unfold. RethinkDB Horizon is an open-source developer platform for building realtime, scalable web apps. It is built on top of RethinkDB, and allows app developers to get started with building modern, engaging apps without writing any backend code.

Speakers
avatar for Slava Akhmechet

Slava Akhmechet

cofounder, RethinkDB
Slava Akhmechet is the founder of RethinkDB, a database company dedicated to helping developers build realtime web applications. Prior to RethinkDB he was a systems engineer in the financial industry, working on scaling custom database systems. Slava is a frequent speaker and a blogger... Read More →


Monday May 16, 2016 3:00pm - 3:20pm
Gardner

4:00pm

Parallel and distributed big joins in H2O
Matt has taken the radix join as implemented in R's data.table and parallelized and distributed it in H2O. He will describe how the algorithm works, provide benchmarks and highlight advantages/disadvantages. H2O is open source on GitHub and is accessible from R and Python using the h2o package on CRAN and PyPI.

Speakers
avatar for Matt Dowle

Matt Dowle

Hacker, H2O.ai
Matt is the main author of R's data.table package, the 2nd most asked about R package on Stack Overflow. He has worked for some of the world’s largest financial organizations: Lehman Brothers, Salomon Brothers, Citigroup, Concordia Advisors and Winton Capital. He is particularly... Read More →


Monday May 16, 2016 4:00pm - 4:40pm
Markov

4:00pm

The Engineer's Guide to Streaming -- How you should really compare Storm, Flink, and Apex
It feels like every week there's a new open-source streaming platform out there. Yet, if you only look at the descriptions, performance metrics, or even the architecture, they all start to look exactly the same! In short, nothing really differentiates itself - whether it be Storm, Flink, Apex, GearPumk, Samza, KafkaStreams, AkkaStreams, or any of the other myriad technologies. 

So if they all look the same, how do you really pick a streaming platform to solve the problem that YOU have? This talk is about how to really compare these platforms, and it turns out that they do have their key differences, they're just not the ones you usually think about. The way that you need to compare these systems if you're building something to last, a well-engineered system, is to look at how they handle durability, availability, how easy they are to install and use, and how they deal with failures. 

This is a relatively accessible technical talk. Newcomers to the streaming realm are welcome!

Speakers
avatar for Ilya Ganelin

Ilya Ganelin

Senior Data Engineer, Capital One Data Innovation Lab
Ilya is a roboticist turned data engineer. At the University of Michigan he built self-discovering robots and then worked on embedded DSP software with cell phone radios at Boeing. Today, he drives innovation at Capital One. Ilya is a contributor to the core components of Apache Spark... Read More →



Monday May 16, 2016 4:00pm - 4:40pm
Gardner
 
Tuesday, May 17
 

3:00pm

Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage System
Speakers
avatar for Gene Pang

Gene Pang

Software Engineer, Berkeley
Gene Pang is one of PMCs and maintainers of the Alluxio open source project and a founding member at Alluxio, Inc. He recently graduated with a Ph.D. from the AMPLab at UC Berkeley, working on distributed database systems. Before starting at Berkeley, he worked at Google and has an... Read More →


Tuesday May 17, 2016 3:00pm - 3:40pm
Ada
 
Wednesday, May 18
 

10:40am

End-2-End Monitoring and Troubleshooting a Real-Time Data Pipeline
Real-time streaming pipelines are comprised of a combination of application, data frameworks and the underlying infrastructure, which has increasingly become containerized. The application code and the underlying data frameworks are closely intertwined with each other resulting in a blurred line between the application and data processing tier. The highly complex, distributed and interconnected nature of these services make monitoring and troubleshooting these pipelines very challenging. In this talk, we will: • Examine the the different components used to build a typical real-time streaming pipeline • Evaluate the importance of modeling the “pipeline” as a first-class object that should be monitored • Discuss the challenges of monitoring and troubleshooting a real-time streaming pipeline • Review capturing the overall metrics for the pipeline that map to specific metrics from each component like throughput, latency, backpressure and error rate. • Provide a set of best practices for organizing information to begin troubleshooting your data processing frameworks when things go wrong • Present a simple way to build a "Pipeline View" that captures the health of each component in the pipeline, as well as the dependencies between the components and gives an indication of any issues in the pipeline at a quick glance • Demonstrate how to visually correlate pipeline metrics and pipeline health to underlying infrastructure issues, so that problems can be quickly analyzed and resolved

Speakers
avatar for Alan Ngai

Alan Ngai

VP of Engineering, OpsClarity
As a co-founder and the VP of Engineering at OpsClarity, Alan brings over 15 years of experience building systems and engineering teams from the ground up. Prior to OpsClarity, he lead teams in solving large scale, complex problems at companies such as eBay, Yahoo and Telenav. Over... Read More →


Wednesday May 18, 2016 10:40am - 11:00am
Ada

10:40am

How Data Science is Evolving Technology Education
Galvanize is at the forefront of educating the best technology trained workforce in web development, data science, and data engineering while also helping to create and launch the companies of tomorrow. At every one of our campuses, we offer curriculum designed and taught by full-time faculty of experts, who are active in industry, and know what startups and companies need. Along with our curriculum, each campus is a base for startups, innovation labs, and established companies, where we provide the access access to training, talent, and community. In this session, learn about how Galvanize is using data science to revolutionize technology education through adaptive learning, automated assessments, and predictive student interventions.

Speakers
avatar for Ryan Orban

Ryan Orban

VP of Business Operations & Expansion, Galvanize/ZipFian
Ryan was the co-founder & CEO of Zipfian Academy, the leading provider of immersive data science education focused on solving practical, real-world problems. Graduates of the 12-week program work at companies such as Facebook, Twitter, Airbnb, Uber, and Square. After joining forces... Read More →


Wednesday May 18, 2016 10:40am - 11:00am
Gardner

1:10pm

Single Customer View
A Single Customer View is a key support of growth for companies. It provides a deep understanding of their customer/users allowing them to drive better targeting for Product, Marketing and Sales. As Single Customer View is all about understanding the customer behaviour based on their past activities and predict their future ones, the practical problem of this is the amount of Data to be processed near real time. But Single Customer View is not a monster beast, and is in fact a really powerful tool which can more easily than expected be build, maintain and exploit for the benefits of all the company. We’ll dive into the key concepts and how we can leverage modern Data Processing infrastructure to build one.

Speakers
avatar for Thomas Trolez

Thomas Trolez

Nitro, Inc.
Data-driven by nature.



Wednesday May 18, 2016 1:10pm - 1:30pm
Markov

4:00pm

Analytics as Code with Juttle
Software developers rely on operational analytics to track health and performance of their apps/services, as we do at Quid. You can go the DIY route with Kibana, Grafana and other open source tools, or rely on analytics service providers, but many choices will limit you to a UI-based experience. I'm a believer in analytics as code, and will present reasons and possible choices for those who want their analytics managed as code with source control and code reviews. Juttle (http://juttle.github.io/) is one such developer-oriented analytics platform that bridges querying data from different storage backends, processing, joining and visualizing, all in a single line of code.

Speakers
avatar for Daria Mehra

Daria Mehra

Director of Quality Engineering, Quid
My fields of interest are data storage, data analytics, and quality (my unofficial title is "Bug Huntress"). I'm excited to be working on quality initiatives for the Quid intelligence platform. My favorite hammer of a programming language / analytics platform is Juttle, I'm one of... Read More →




Wednesday May 18, 2016 4:00pm - 4:20pm
Markov

4:00pm

Deep Dive: Spark Memory Management
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Speakers
avatar for Andrew Or

Andrew Or

Software Engineer, Databricks
Anything about Spark.


Wednesday May 18, 2016 4:00pm - 4:40pm
Ada

4:30pm

Data governance and distribution with Dat
We will (LIVE!) take a public dataset and build a streaming, versioned, content-addressable, peer-to-peer distribution endpoint. Attendees will also learn how to build an application using Dat as the real-time, content-addressable storage using Node.js.

Speakers
avatar for Karissa McKelvey

Karissa McKelvey

Software Developer, Dat Project


Wednesday May 18, 2016 4:30pm - 4:50pm
Markov
 
Thursday, May 19
 

10:40am

Analyzing Massive Time Series Data with Spark
Want to build models over data every second from millions of sensors? Dig into the histories of millions of financial instruments? In this talk, we'll discuss the unique challenges in time series data, and how to work with it at scale: * What distinguishes time series data from other datasets? * What are the common operations that we wish to apply to it? * What are the different ways to lay out time series in data in memory, and what analysis tasks are each layout well suited for? * What are popular applications for time series analysis? We'll then introduce the open source Spark-TS library. Built atop Apache Spark, the library provides an intuitive Scala and Python API for munging, manipulating, and modeling time series data in a massively parallel manner.

Speakers
avatar for Sandy Ryza

Sandy Ryza

Senior Data Scientist, Clover Health
Data Science, Apache Spark, Time Series Data, Distributed Computation


Thursday May 19, 2016 10:40am - 11:00am
Gardner

10:40am

The Promise of Heterogeneous Computing
In general application performance demands in all fields are now outpacing Moore’s Law, and in AI, they’re certainly increasing exponentially. To keep up, we’re already beginning to rely on specialized processors like GPUs and DSPs beyond just CPUs for specific use cases. In the future, most applications in will depend on their ability to maximize multiple types of processors in the most efficient way possible. This talk will go through different types of heterogenous hardware available today and coming up in the future, and their applicability for solving AI problems, and how normal people (not just big companies and those with access to supercomputers) can access them, with some examples.

Speakers
avatar for Subbu Rama

Subbu Rama

CEO, Bitfusion.io
Co-founder and CEO of Bitfusion, a Software Defined Supercomputing company, with the mission of bringing HPC/Supercomputing to the masses. Held engineering and leadership roles in hardware and software divisions, while building CPUs, micro-servers, SoCs and cloud infrastructures... Read More →



Thursday May 19, 2016 10:40am - 11:00am
Markov

2:10pm

Deep dive and best practices of Spark streaming
In this talk, we will start with the internals of how Spark streaming works and explain how user code is being translated and executed by the Spark streaming engine. Based on these internals, we will then walk over some of the best practices to do efficient state management, efficient joining of streams with historic datasets and achieving high throughput while receiving, processing and writing data. This should help you develop and tune your streaming applications properly by avoiding the common pitfalls.

Speakers
avatar for Prakash Chockalingam

Prakash Chockalingam

Solutions Architect, Databricks Inc
Spark


Thursday May 19, 2016 2:10pm - 2:50pm
Ada

4:00pm

Apache Beam (incubating): Unified batch and streaming data processing
This talk traces the evolution of ideas in Google's data processing tools over the past 13 years - from classic MapReduce, to strongly consistent stream processing with Millwheel, to the unified batch and streaming programming model of Apache Beam.

Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.

Beam cleanly separates the different aspects of temporal data processing: what computation to apply, where in event time to apply it, when in processing time to produce results, and how to refine the results as late data arrives. By decoupling semantics from the underlying execution environment, Beam provides portability across multiple runners, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).

I will give an overview of the programming model and current status of the project and invite you to participate in its rapidly developing ecosystem.

Speakers
avatar for Eugene Kirpichov

Eugene Kirpichov

Senior Software Engineer, Google
I'm an engineer on the Google Cloud Dataflow team. I'm interested in some programming- and math-related topics, equality-related issues, cognitive psychology, and a bunch of other things.


Thursday May 19, 2016 4:00pm - 4:20pm
Ada

4:20pm

Data in the Apache "big data" ecosystem
Big Data, Small Data, it's all the same.

Speakers
avatar for Konstantin Boudnik

Konstantin Boudnik

CEO, Memcore
Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in... Read More →


Thursday May 19, 2016 4:20pm - 4:40pm
Ada
 
Friday, May 20
 

9:50am

On the path to data Nirvana: Supporting OLTP and OLAP on Hadoop
With increased competition, companies need to make faster decisions based on real-time data. This requires databases that can deliver good performance for OLTP and OLAP workloads in a single database. This talk will discuss an innovative database architecture that uses both HBase and Spark engines to support simultaneous OLTP and OLAP workloads. In this talk, you will:

- Discover which uses cases require simultaneous OLTP and OLAP workloads
- Learn how the optimizer automatically routes OLTP queries to HBase/Hadoop, and OLAP queries to Spark
- Learn how the optimizer uses advanced resource management to ensure that OLAP queries do not overwhelm OLTP queries

Speakers
avatar for Monte Zweben

Monte Zweben

Co-Founder and CEO, Splice Machine
Monte is the CEO and co-founder of Splice Machine. Monte worked as the Deputy Chief of the Artificial Intelligence Branch at NASA Ames Research Center, where he won the Space Act Award. He then founded and was CEO of Red Pepper Software, which merged with PeopleSoft, where he was... Read More →



Friday May 20, 2016 9:50am - 10:30am
Ada