Loading…
Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Text [clear filter]
Tuesday, May 17
 

9:05am

Data and Algorithmic Bias in the Web
The Web is the largest public big data repository that humankind has created. In this overwhelming data ocean, we need to be aware of the quality and, in particular, of the biases that exist in this data. In the Web, biases also come from redundancy and spam, as well as from algorithms that we design to improve the user experience. This problem is further exacerbated by biases that are added by these algorithms, specially in the context of search and recommendation systems. They include selection and presentation bias in many forms, interaction bias, etc. We give several examples and their relation to sparsity, novelty, and privacy, stressing the importance of the user context to avoid these biases.

Speakers
avatar for Ricardo Baeza-Yates

Ricardo Baeza-Yates

WWW
Ricardo Baeza-Yates areas of expertise are information retrieval, web search and data mining, as well as data science and algorithms in general. He was VP of Research at Yahoo Labs, based in Sunnyvale, California, from August 2014 to March 2016. Before he founded and lead from 2006... Read More →


Tuesday May 17, 2016 9:05am - 9:40am
Gardner

9:50am

Building Word2Vec Models with Text Data
It is always amazing when someone is able to take a very hard and translate it to one that has been studied for centuries. This is the case with Word2Vec, which transforms words into vectors. Text is unstructured data and has been explored mathematically far less than vectors. Newton (1642-1726) may have been the first one to study vectors, while text mining started its studies a few decades ago. Word2Vec maps text to a vector space that can be utilized in a variety of ways, such as measuring distance between words. Therefore, given a word of interest, the aforementioned vector space can be used to compute the top N closest words. In this talk, I will explain how to build Word2Vec models with Twitter data stored in Hadoop using Spark and MLlib. I will describe how to choose the most important parameters to accurately train a Word2Vec matrix. In addition, I will show examples of how these models are used in practice in data products.

Speakers
avatar for Jorge Castanon

Jorge Castanon

Lead Data Scientist, IBM
Jorge Castañón hails from Mexico City and received his Ph.D. in Computational and Applied Mathematics from Rice University. He has a genuine passion for data science and machine learning applications of any kind. Since 2007, he is been developing numerical optimization models and... Read More →


Tuesday May 17, 2016 9:50am - 10:30am
Ada

9:50am

Identifying Actionable Messages on Social Media
Text actionability detection is the problem of classifying user authored natural language text, according to whether it can be acted upon by a responding agent. In this paper, we propose a supervised learning framework for domain-aware, large-scale actionability classification of social media messages. We derive lexicons, perform an in-depth analysis for over 25 text based features, and explore strategies to handle domains that have limited training data. We apply these methods to over 46 million messages spanning 75 companies and 35 languages, from both Facebook and Twitter. The models achieve an aggregate population-weighted F measure of 0.78 and accuracy of 0.74, with values of over 0.9 in some cases.

Speakers
avatar for Nemanja Spasojevic

Nemanja Spasojevic

Director Of Data Science, Lithium Technologies | Klout
Nemanja Spasojevic is the Director of Data Science at Lithium Technologies. He graduated from Massachusetts Institute of Technology and previously worked on the Google Books project, making all of the world’s knowledge accessible online.



Tuesday May 17, 2016 9:50am - 10:30am
Markov

9:50am

The practice of acquiring good labels
Engineers and researchers use human computation as a mechanism to produce labeled data sets for product development, research and experimentation. In a data-driven world, good labels are key. To gather useful results, a successful labeling task relies on many different elements: from clear instructions and user interface design to algorithms for quality control. In this talk, I will present a perspective for collecting high quality labels with an emphasis on practical implementations and scalability. I will focus on three main topics: programming crowds, debugging tasks with low agreement, and algorithms for quality control. I plan to show many examples and code along the way.

Speakers
avatar for Omar Alonso

Omar Alonso

Principal Data Scientist, Microsoft
Omar is a Principal Data Scientist Lead at Microsoft in Silicon Valley where he works on the intersection of social media, temporal information, knowledge graphs, and human computation for the Bing search engine. He holds a PhD from the University of California at Davis. @elunca



Tuesday May 17, 2016 9:50am - 10:30am
Gardner

10:40am

Hunting Criminals with Hybrid Analytics
Fraud detection is a classic adversarial analytics challenge: As soon as an automated system successfully learns to stop one scheme, fraudsters move on to attack another way. Each scheme requires looking for different signals (i.e. features) to catch; is relatively rare (one in millions for finance or e-commerce); and may take months to investigate a single case (in healthcare or tax, for example) – making quality training data scarce. This talk will cover, via live demo and code walk-through, the key lessons we've learned while building such real-world software systems over the past few years. We'll be looking for fraud signals in public email datasets, using IPython and popular open-source libraries (scikit-learn, statsmodel, nltk, etc.) for data science and Apache Spark as the compute engine for scalable parallel processing. The model is an ensemble using a combination of natural language, graph analysis and time series analysis features, and is re-trained using an automated pipeline to learn from feedback on the fly.

Speakers
avatar for David Talby

David Talby

CTO, Atigeo
David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cyber-security. David has extensive experience in building and operating web-scale data science and business platforms, as... Read More →


Tuesday May 17, 2016 10:40am - 11:00am
Markov

10:40am

lda2vec
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.

Speakers
avatar for Christopher Erick Moody

Christopher Erick Moody

Data Scientist, Stitch Fix
Caltech - Astrostats - PhD supercomputing. Now data labs @stitchfix coding up word2vec, Gaussian Processes, t-SNE, tensors, Factorization Machines, RNNs, & VI



Tuesday May 17, 2016 10:40am - 11:20am
Ada

10:40am

Smarter Search with Spark, Solr and Machine Learning
The modern day search engine has significantly evolved from its keyword matching days to its current form which leverages a wide variety of data inputs and user feedback loops to help users find out what’s most important in their data. At Lucidworks, we leverage Apache Spark and Solr, together with a variety of open source machine learning and NLP approaches, to build smarter, richer search and data applications. This talk will explore several motivating use cases (customer 360, knowledge management, ecommerce) for our integrations as well as technical approaches and key lessons learned in real world implementations.

Speakers
avatar for Grant Ingersoll

Grant Ingersoll

CTO, Lucidworks
I'm the CTO and co-founder of Lucidworks, a long time Lucene and Solr hacker, co-creator of Apache Mahout and lead author of Taming Text.



Tuesday May 17, 2016 10:40am - 11:20am
Gardner

11:10am

Image Retrieval using Short Texts
Image retrieval in response to keyword-based queries is a well studied problem. Web services such as Google Image Search are used daily by users all around the world. The typical use case for these services is using a short piece of text made up of a few individual tokens as the search phrase. The services, therefore, are designed to work with such queries and generally do not work well when a longer search string is used (for example a sentence). This does not align well with the recent push towards a more visual web as evidenced by the popularity of applications such as Instagram, and the rise in the popularity of microblogging services which resulted in an abundance of short text snippets that users may want to be able to retrieve accompanying images for automatically. In this paper we introduce a novel approach, called ImageSuggest, which sits between the user and the traditional image retrieval systems and allows the users to enter longer search strings. Our approach extracts and ranks search terms from the input strings and feeds the resulting keywords to the image retrieval systems. We evaluate our approach on a dataset of short texts from the anonymous social network Whisper and show that we are able to outperform standard keyword extraction and query generation techniques on image retrieval tasks.

Speakers
avatar for Maarten Bosma

Maarten Bosma

Machine Learning Engineer, Whisper



Tuesday May 17, 2016 11:10am - 11:30am
Markov

11:40am

Automatic Links for a Web of Text
We tend to think of text as a linear representation of speech. Perhaps it is, some of the time (we'll show some historical examples). Documents are typically how we encounter text, however -- with structure and relationships beyond the linear. Layout in particular defines critical meaning that would be difficult to convey by other means. Historically, advances to textual & meta-textual representation have been slow to change and be adopted. The codex, table of contents, page numbers, index, standard typefaces -- all of these are crucial inventions that make text more meaningful. These have taken centuries to develop, although we now take for them for granted. The rapid growth of web text over the past quarter century gives us a new set of text properties to add new dimensions of meaning. The *link* is a device that allows us to layer meaning and relationships in entirely new ways. Textual links provide cross-document references, and pointers to authoritative sources and indexes. Links are often layered around the central text in lists and indexes to allow navigation at various scales. We will try to put the link, along with linking services such as search engines, into perspective with traditional print-based non-linear text features, to show how the link expands and redefines how text is consumed and construed. In addition to providing a historical perspective on the impact of links on text, we will demonstrate novel varieties of *dynamic links*. In combination with an active platform for reading (e.g. the browser), dynamic link construction provides a new way to increase the reach of texts, connecting them with resources and documents that may not even exist when the text was created, to create a qualitatively new reading experience.

Speakers
avatar for Scott (TS) Waterman

Scott (TS) Waterman

connector, founder, back-quote


Tuesday May 17, 2016 11:40am - 12:20pm
Ada

11:40am

Google Translate - how machines do it
The talk will focus on how Google Translate uses Machine Learning in enormous scale to translate 100B words/day between 103 languages. Google Translate is one of largest machine learning projects in the world in terms of data set (hundreds of billions of translated phrases) and usage (more than 500M people use it every month).

Speakers
avatar for Barak Turovsky

Barak Turovsky

Head of Product, Google Translate, Google
Barak Turovsky is responsible for product management and user experience for Google Translate. Barak focuses on applying advanced machine learning techniques to deliver magical experience to break language barriers across web, mobile applications, Search, Chrome and other products... Read More →



Tuesday May 17, 2016 11:40am - 12:20pm
Gardner

11:40am

Using Spark MLlib for NLP

Apache Spark is most often used as a means of processing large amounts of data efficiently, but is also useful for the processing of individual predictions common to many NLP applications. The algorithms inside MLlib are useful in and of themselves, independent of the core Spark framework. IdiML is an open source tool that enables incredibly fast predictions on textual data by using various components within MLlib. It acts as a standalone tool for performing core machine learning functionality that can easily be integrated into production systems to provide low-latency continuous streaming predictions. This talk explores the functionality inside IdiML, how it uses MLlib, and why that makes such a big difference.


Speakers
avatar for Michelle Casbon

Michelle Casbon

Senior Engineer, Google
Michelle Casbon is a Senior Engineer at Google, where she focuses on open source for machine learning and big data tools. Prior to joining Google, she was at Qordoba as Director of Data Science and Idibon as a Senior Data Science Engineer. Within these roles, she built and shipped... Read More →



Tuesday May 17, 2016 11:40am - 12:20pm
Markov

1:10pm

An Innovative Approach to Labeling Ground Truth in Speech
Supervised machine learning algorithms require accurate and consistent data labels. However, complicated datasets may introduce ambiguity, resulting in irregular ground truths and challenging machine learning algorithm development. Consider the following truthing tasks for natural household speech: - *Labeling what was said* -- Think about how often people mispronounce words, talk over others, or simply mumble their speech. - *Segmenting when a given utterance/thought begins and ends* -- How many complete thoughts are in a spoken segment? What happens if speech is fragmented? How close to the start-and end- point of speech can we segment without cutting out essential data? - *Labeling sounds* -- Often there are non-human sounds in the background that we may or may not recognize. Additionally, people often make non-speech sounds that need to be considered. If that wasn't hard enough, now consider audio collected from households containing babies. Babies not only introduce more chaotic speech, but they have a language all their own that requires truth labels. Although many of aforementioned categories don't have a right or wrong way of being labeled, they do have the potential to introduce inconsistencies. To decrease the number of ground truth discrepancies, we created data tagging software called VersaTag. VersaTag is a GUI-based labeling system that can be distributed to volunteers to tag large quantities of audio. We are developing this software through an iterative process, decreasing truthing inconsistencies with each new improvement. VersaTag has already dramatically reduced the irregularities in our audio labels, and through the iterative development process, we are excited to continue improving!

Speakers
avatar for Jill Desmond

Jill Desmond

Senior Data Scientist, VersaMe
Jill is the Senior Data Scientist at VersaMe. She is currently collecting data and developing algorithms to provide feedback to parents regarding the audio environment that their child is exposed to. | | Jill has a Ph.D. in Electrical Engineering from Duke University, where she... Read More →



Tuesday May 17, 2016 1:10pm - 1:30pm
Ada

1:10pm

Data & Metadata at the Internet Archive
The Internet Archive has many petabytes of archived webpages, books, videos, and images. Recently we've been making a big effort to make our data and metadata more accessible to outside users. I'll show off some of the methods to download stuff from the Archive, and then I'll show some example projects using this data.

Speakers
avatar for Greg Lindahl

Greg Lindahl

CTO, Presearch Labs
I'm currently working on adding search to the Internet Archive's "Wayback Machine" web archive, but I'm interested in all kinds of data topics.



Tuesday May 17, 2016 1:10pm - 1:50pm
Gardner

1:10pm

Scalably Internationalizing Millions of Latent Semantic Labels
We've built a classification system that can map "Software Developer", "MTS", and "Code Monkey" as well as millions of other English language entities into a common semantic space with just a few thousand labels, which we use to understand people's job titles, skills, majors, and degrees. We're now working on internationalizing this system in a scalable way. The original method was labor intensive, so we have come up with an approach that leverages our English language work to provide good quality results in other languages with a small fraction of the effort.

Speakers
avatar for Xiao Fan

Xiao Fan

Dev Manager, Workday
My team is currently working on automated internationalization for a tool that provides semantic labels for plain English job titles.



Tuesday May 17, 2016 1:10pm - 1:50pm
Markov

1:40pm

Of Rules and Probabilities: Computational Linguistics Methods for Improving Machine Learning Models
Supervised machine learning models are extremely powerful and highly useful for processing vast amounts of text. Their applications include sentiment analysis, text classification, topic mining, part of speech tagging, and named entity recognition, among many others. However, supervised models rely heavily on large amounts of annotated data and furthermore require that the annotations be consistent and accurate. In practice, obtaining high quality annotated data, especially with strong inter-annotator agreement, is not always possible for legal and privacy reasons: there are some data that organizations may not be allowed to crowd source. In this talk I propose several methods to help machine learning models get over the hurdle of insufficient labeled data by leveraging a number of computational linguistics techniques. Specifically, focusing on CRF (conditional random field) model for Named Entity Recognition, I discuss how the use of language feature engineering, artificial dataset generation, and post-processing rules can significantly improve model performance, which otherwise suffers from the bottle-neck of insufficient training data. I propose a number of scalable and practical methods that machine learning practitioners can use in situations where obtaining more training data via crowdsourcing is not a viable option.

Speakers
avatar for Vita Markman

Vita Markman

Staff Software Engineer, Computational Linguist, LinkedIn
As a Staff Software Engineer at LinkedIn, I work on various natural language processing applications such as query understanding, sentiment analysis, and member /job data standardization. Before joining LinkedIn, I was a Staff Research Engineer at Samsung Research America, where... Read More →



Tuesday May 17, 2016 1:40pm - 2:00pm
Ada

2:10pm

Aspect Based Sentiment Analysis in 20+ Languages
We benchmark a number of statistical approaches for ABSA (via SemEval public data) with a linguistic approach. We discuss parsing, used in most benchmarked systems, and it's two main branches, probabilistic and symbolic parsing. Finally, we propose an alternative approach which combines the best of both paradigms: linguistic/symbolic processing for topic and polarity detection and Machine Learning for aspect categorization.

We present a grammar-based approach to Aspect-Based Sentiment Analysis (also known as Topic-Based) which is currently available in 20+ languages. When we say it is available we mean that these 20+ languages are in production in numerous commercial projects mainly in the area of VoC and survey coding projects. We describe the typical ingredients of a linguistic platform: 
  • on the software side: a language-independent lexical analyzer and a PDA-based non-deterministic GLR parser
  • on the data side: corpus-based lexicons (with up to 300 million entries for morphologically complex languages like Finnish); and unification grammars (with anything from 500 to 1000 rules per language); 
  • on the customization side: sentiment rules (around 1000 per language); domain-specific categorization rules; etc.

Probably the main advantage of the engine is that adding support for a new language is a matter of changing the data side (grammar and dictionaries), which can be done quickly and efficiently.

The system achieves 70% accuracy out-of-the-box on most domains and types of texts, and up to 90% accuracy when adapted to specific domains. Domain adaptation is carried out by adding a small number of domain-dependent rules; the process is incremental and predictable. The grammars can be efficiently adapted to match the peculiarities of different types of text, from social media to news. No manually-annotated corpora are needed to train the system, since it does not require any sort of training.

Speakers
avatar for Antonio Valderrabanos

Antonio Valderrabanos

CEO and Founder, Bitext
Antonio Valderrábanos, CEO & Founder at Bitext | I have a long experience on how to use Deep Linguistic Analysis to solve business problems, particularly in the area of Text Analytics. I started working for large R&D labs, at IBM and Novell. I developed the first grammar checker... Read More →


Tuesday May 17, 2016 2:10pm - 2:50pm
Ada

2:10pm

SAMEntics : Tools for paraphrase detection and paraphrase generation
Sparse Ground Truth, mediocre quality of training data, limited representation of novel queries, heavy biases due to human intervention and large time overheads associated with manual cluster creation are inconveniences that both partners and the Watson Ecosystem technical team face on a day-to-day basis. Enriching Ground Truth, boosting the quality of training data, factoring in for novel queries and minimizing biases & time sucks due to human intervention therefore emerge as preprocessing requirements that are crucial to meeting the needs of a more seamless transition into when utilizing a cognitive service that is powered by Watson. SAMEntics(Same + Semantics) has been conceptualized to match this exact purpose and provides an efficient alternative to handling large volumes of text across domains to scale. It comprises tools for paraphrase detection and paraphrase generation and is directed at 1. discovering rewording in sentences across domains 2. bucketing hierarchical categories within domains by capturing intent 3. expediting question(s)-answer(s) mapping 4. rendering syntactically correct phrasal variations of sentences while retaining semantic meaning to enrich partner ground truth, boost training data quality and minimize biases and time sucks due to human intervention. SAMEntics thus provides an intelligent alternative to handling large volumes of text efficiently by not only automatically rendering clusters based off user intent in a hierarchical manner but also by generating rewordings of user queries in the case of sparse and(or) poor quality training data. Join us as we go over the current and emerging state-of-the-art in this space. Reflect on what is changing the world in this era of cognition. Dive deep into the pipeline and the core algorithmic paradigms that power a paraphrase detection and paraphrase generation engine. And leave with an understanding of what it takes to build a product that provides data science-as-a-service.

Speakers
avatar for Niyati Parameswaran

Niyati Parameswaran

Data Scientist, IBM Watson
Niyati works as a data scientist for the Watson Ecosystem team. A dream of being able to provide a machine with intelligence that is unique, that can augment our own distinctive intelligence and that ultimately hopes to answer Alan Turing's question of 'Can machines think?' motivates... Read More →



Tuesday May 17, 2016 2:10pm - 2:50pm
Markov

2:10pm

Sparse data alternatives with neural network embeddings
The advent of continuous word representation technologies such as Word2Vec and GLOVE has transformed how Data Scientists and Machine Learning experts work with natural language data. One reason these algorithms are so successful is that they offer an efficient information preserving methodology to highly compress native features (word frequencies) to the dimensions of the embedded vector space. This is particularly effective in the sparse data context of word count frequencies. Recently word embedding algorithms have been generalized to generic graph networks contexts. In this talk we review results of applying this generalization to alternative sparse data contexts such as User-based as well as Item-based recommender algorithms.

Speakers
avatar for Marvin Bertin

Marvin Bertin

Machine Learning Scientist, Skymind
MACHINE LEARNING SCIENTIST. I build intelligent applications with Machine Learning and Deep Learning for large-scale applications. | Developed like2vec = product co-purchase graph + DeepWalk + Recommender System.
avatar for David Ott

David Ott

Student, Galvanize
avatar for Mike Tamir

Mike Tamir

Chief Data Science Officer, Uber ATG


Tuesday May 17, 2016 2:10pm - 2:50pm
Gardner

3:00pm

News Analytics in Finance
In this talk we will discuss the evolution of the news analytics landscape from the perspective of the participants in the global financial industry. We will discuss the development route of several current key ML/NLP projects at Bloomberg, such as sentiment analysis of financial news, prediction of market impact, novelty detection, social media monitoring, question answering and topic clustering. These interdisciplinary problems lie at the intersection of linguistics, finance, computer science and mathematics, requiring methods from signal processing, machine vision and other fields. We will talk about the methods, problem formulation, and throughout, talk about practicalities of delivering machine learning solutions to problems of finance, highlighting issues such as importance of appropriate problem decomposition, validation and interpretability. We will also summarize the current state of the art and discuss possible future directions for the applications of natural language processing methods in finance. The talk will end with a Q&A session.

Speakers
avatar for Gary Kazantsev

Gary Kazantsev

Head of Machine Learning Engineering, Bloomberg
Gary Kazantsev is the Head of Machine Learning Engineering at Bloomberg, leading projects at the intersection of computational linguistics, machine learning and finance such as sentiment analysis, market impact indicators, statistical text classification, social media analytics, question... Read More →



Tuesday May 17, 2016 3:00pm - 3:40pm
Gardner

3:00pm

PhrazIt : Tool for automatic text summarization
Cognition is in virtually everything that humans do, such as language understanding, perception, judgment, learning, spatial processing and social behavior ; and given that IBM Watson represents the first step at envisioning truly cognitive systems - it becomes crucial to constantly harness its abilities with processing natural language, evaluating hypotheses and learning dynamically across domains. The project we will go over in this talk is aimed at augmenting these very behaviors of IBM Watson as PhrazIt is focused at enriching raw data and essentially transforming information into insights. PhrazIt’s technologically differentiated core that is powered by an augmented extraction-based text summarization algorithm utilizes a novel contextualized indexing framework thus making it a tremendous value-add when deploying cognitive services powered by Watson. Join us as we go over the current and emerging state of the art in the space of text summarization. Reflect on what is changing the world in this era of cognition. Dive deep into the pipeline and the core algorithmic paradigms that power a content extraction engine. And leave with an understanding of what it takes to build a product that provides data science-as-a-service.

Speakers
avatar for Niyati Parameswaran

Niyati Parameswaran

Data Scientist, IBM Watson
Niyati works as a data scientist for the Watson Ecosystem team. A dream of being able to provide a machine with intelligence that is unique, that can augment our own distinctive intelligence and that ultimately hopes to answer Alan Turing's question of 'Can machines think?' motivates... Read More →



Tuesday May 17, 2016 3:00pm - 3:40pm
Markov

4:00pm

Airbnb and Marketplace Matching
Marketplace matching is about matching both sides of the ecosystem, in Airbnb's case we're matching guests with hosts and understanding their unique preferences. Airbnb's inventory is extremely diverse, and different trip planners come to the platform with different goals in mind. In 2015, the Search team launched several experiments to help trip planners understand our inventory, narrow in on a subset of results relevant to them and build confidence in their booking decision. In this talk, we'll discuss these experiments and what we've learned from them, successes we've had with Machine Learning and some directions for future work.

Speakers
avatar for Surabhi Gupta

Surabhi Gupta

Engineering Manager, Airbnb
Surabhi is an engineering manager leading the Search and Application Infrastructure teams at Airbnb. Prior to Airbnb, she was a software engineer at Google where she worked on web search ranking and the Google Now team on predictive search. She holds a Masters degree in Computer Science... Read More →


Tuesday May 17, 2016 4:00pm - 4:40pm
Gardner

4:00pm

Mining Noisy Transaction Data with Neural Nets
Extracting relevant information from unstructured transaction data presents a challenge for those who may want to use such data for making business decisions such as underwriting loans or for monitoring credit worthiness. Most of our transaction data is in the form of transaction text describing the transaction often using abbreviations or unknown proper nouns. A common approach for text documents is to encode the words or documents into vectors using a neural net layer or multiple layers. These features may then be used in a classification algorithm or other models for predicting an outcome. To this end, we encoded transaction data of small 'sentences', often of only a few words, using skip-gram word2vec models along with RBM and Deep Belief Nets utilizing other features such as credit or debit value of transaction and institution information. The goal of this discussion is to describe the performance of the model and also considerations for training a nn in a large-data distributed framework like Spark. Tools used are Deeplearning4j, Spark, Scala.

Speakers
avatar for Frank Taylor

Frank Taylor

Data Scientist, Earnest, Inc.
I have a background in Physics specializing in statistical modeling of particle decays and later in optical signal processing. I am passionate about Big Data and its potential to gather insight into so many facets of humanity. As our tools get better and more scalable, we have the... Read More →


Tuesday May 17, 2016 4:00pm - 4:40pm
Ada

4:20pm

byte2vec: a flexible embedding model constructed from bytes
In today's fragmented, globalized world, supporting multiple languages in NLU and NLP applications is more important than ever. The inherent language dependence in classical Machine Learning and rule-based NLP systems has traditionally been a barrier to scaling said systems to new languages. This dependence typically manifests itself in feature extraction, as well as in pre-processing steps. In this talk, we present byte2vec as an extension to the well-known word2vec embedding model to facilitate dealing with multiple languages and unknown words. We explore its efficacy in a multilingual setting for tasks such as Twitter Sentiment Analysis and ABSA. Byte2vec is an embedding model that is constructed directly from the rawest forms of input: bytes, and is: i. truly language-independent; ii. particularly apt for synthetic languages through the use of morphological information; iii. intrinsically able to deal with unknown words; and iv. directly pluggable into state-of-the-art NN architectures. Pre-trained embeddings generated with byte2vec can be fed into state-of-the-art models; byte2vec can also be directly integrated and fine-tuned as a general-purpose feature extractor, similar to VGGNet's current role for computer vision.

Speakers
avatar for Parsa Ghaffari

Parsa Ghaffari

Founder, Aylien
Parsa Ghaffari is an engineer and entrepreneur working in the field of Artificial Intelligence. He currently runs AYLIEN, a leading AI startup focused on creating technologies for analyzing and understanding unstructured content (text and images). Parsa will explain how Aylien is... Read More →



Tuesday May 17, 2016 4:20pm - 4:40pm
Markov
 
Wednesday, May 18
 

10:40am

An approach to internal search query analysis
Course Hero is an education technology company that provides subscription services for accessing crowd-sourced study materials. Our business is driven by SEO and having the most selective student generated materials related to a course. We show a small preview of the content to the search engines, which gets indexed. When users search any long-tailed education related queries for which our content is relevant, our content link would show up among the top results. This is how our product gets visibility. We have 3 main types of content on our website: student generated study documents, Q&A, and flashcard sets. Our internal search functionality is the method by which our customers discover content on our website. Content consumption and engagement metrics provide insightful information about the relevancy of our internal search algorithm and the quality of our content repository. Data mining these metrics helps us understand what our customers' demands are and how well our product is catering to them. Using unstructured search query data, as well as structured consumption and engagement metrics, we mined a meaningful list of high value content categories that yielded a sizeable traffic increase. As a part of the talk, we will be going over the analytical methodology for mining and identifying these high value categories.

Speakers
avatar for Max Ho

Max Ho

Business Analyst, Course Hero
Max Ho is the first Business Analyst at Course Hero, an online learning platform for students to access study resources like course materials, flashcards, and tutors. He is passionate about applying data science to practical business applications. At Course Hero, Max is working very... Read More →
avatar for Dhruv Sampat

Dhruv Sampat

VP of Analytics and Business Analyst, Course Hero
Dhruv Sampat is the VP of Analytics at Course Hero, an online learning platform for students to access study resources like course materials, flashcards, and tutors. He joined the company in 2011 and has helped it grow into a leading platform for online study help. Dhruv has been... Read More →



Wednesday May 18, 2016 10:40am - 11:00am
Markov

11:10am

Text Analytics Simplified
In this talk, we will give an introduction to the Data Ninja services that greatly simplify your text analytics needs. We will briefly demonstrate the core functionalities of the services and showcase how an end-to-end application can be built using these services. Our goals are to enable app developers to build content-intelligent applications with unstructured data and to enable data scientists to explore the rich semantics from big data. We will provide a live demo on how to build a text analytics pipeline from scratch using the Data Ninja services. We will show you the steps starting from signing up to the services to producing actionable insights using machine-learning techniques on top of the semantic contents obtained from the Data Ninja services.

Speakers
avatar for Trung Diep

Trung Diep

Architect, Docomo Innovations
Trung heads the engineering team responsible for delivering the quality and performance of the Data Ninja services. Prior to joining Docomo Innovations, Trung has previously worked at Intel, Mercury Interactive, Rambus, and Broadcom. He received his B.S. and B.A. degrees in Electrical... Read More →
avatar for Ronald Sujithan

Ronald Sujithan

Founder and Principal, Sujithan Consulting, LLC


Wednesday May 18, 2016 11:10am - 11:30am
Markov

11:40am

From text to knowledge via ML algorithms - the Quora answer
Q&A sites like Quora aim at growing the world’s knowledge. In order to do this, they need not only to get the right questions to the right people so they can answer them, but also the existing answers to people who are interested in them. In order to accomplish this we need to build a complex ecosystem taking text as the main data source, but also taking into account issues such as content quality, engagement, demand, interests, or reputation. Using high-quality data you can build machine learning solutions that can help address all of those requirements. In this talk I will describe some interesting uses of machine learning that range from different recommendation approaches such as personalized ranking to classifiers built to detect duplicate questions or spam. I will describe some of the modeling and feature engineering approaches that go into building these systems. I will also share some of the challenges faced when building such a large-scale knowledge base of human-generated knowledge. I will use my experience at Quora as the main driving example. Quora is a Q&A site that despite having over 80 million unique visitors a month, it is known for keeping a high-quality of knowledge and content in general.

Speakers
avatar for Xavier Amatriain

Xavier Amatriain

VP Engineering, Quora
VP of Engineering, Quora



Wednesday May 18, 2016 11:40am - 12:20pm
Markov

2:10pm

Building a Graph of all US businesses using Spark technologies
Radius Intelligence (www.radius.com) empowers Data Science to deliver an unique marketing intelligence platform used by over hundred US companies. This presentation will explain how Radius is using Spark along with GraphX, MLLib and Scala to create a comprehensive and accurate index of US business from dozens of different sources. In particular, I will address problems related to clustering records together based on a graph approach and how to resolve the graph into a set of US businesses. I will discuss some of the models related to cleaning out the noise and how to rank best values and impute missing values and provide some best practices.

Speakers
avatar for Alexis Roos

Alexis Roos

Engineering manager, Radius Intelligence
Alexis has over 20 years of software engineering experience with emphasis in large scale data science and engineering and application infrastructure. | Currently an Engineering Manager at Radius Intelligence, Alexis is leading a team of data scientists and data engineers building... Read More →



Wednesday May 18, 2016 2:10pm - 2:30pm
Ada

2:10pm

Why The Best Minds Of Our Generation Are Thinking About How To Get People to Click on Ads
It'll be Josh Wills, talking about stuff. What's not to like?

Speakers
avatar for Josh Wills

Josh Wills

Head of Data Engineering, Slack



Wednesday May 18, 2016 2:10pm - 2:50pm
Gardner

3:00pm

Analyzing Time Interval Data
Analyzing huge amounts of time interval data is a task arising more and more frequently in different domains like resource utilization and scheduling, real time disposition, as well as health care. Analyzing this type of data using established, reliable, and proven technologies is desirable and required. However, utilizing commonly used tools and multidimensional models is not sufficient, because of modeling, querying, and processing limitations. In this talk, I present a tool (TIDAIS) helpful to analyze large amounts of time interval data. I present a query language and demonstrate some API examples. With time interval data, I will also share how time interval data plays a role in generating real-time user behavior predictions around interests and intent.

Speakers
avatar for Philipp Meisen

Philipp Meisen

CTO & Cofounder, Breinify Inc.
I'm a workaholic visionary always searching for the next adventure to be mastered until success. My passion is bringing innovative ideas to life finding new ways on how to make the impossible possible. | | I've developed software and analyzed data for more than 15 years; different... Read More →


Wednesday May 18, 2016 3:00pm - 3:40pm
Ada

3:00pm

Insider Text
"Insider Threat" is a major area of risk for many organisations, in both the government and commercial spheres. Employees, contract staff and suppliers are often in a strong position to perpetrate fraud, steal secret information or intellectual property, or sabotage computer systems, and effectively evade detection for long periods. After-the-fact investigation is often challenging and labour intensive, and early prediction and positive mitigation, which may be far more effective, is even more difficult to automate. To trained eyes, text sources like chat and email often contain signals that insiders are on a path to hostile action. However, making use of these sources in ways that respect individual rights and improve trust is as much of a challenge as the technical one of extracting the signal. Mr. Stewart outlines the role and limitations of text analysis in the automation of predicting insider threat by discussing key results in the area of intent detection, and argues that (given the projected state of the art) many organizations might achieve greater harm reduction by developing or adopting what IBM has called "Systems of Engagement".

Speakers
avatar for Gregor Stewart

Gregor Stewart

VP Product Management, Text Analytics, Basis Technology
As Vice President of Product Management at Basis Technology, Mr. Stewart helps to drive company strategy and ensure that the company’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters... Read More →



Wednesday May 18, 2016 3:00pm - 3:40pm
Gardner
 
Thursday, May 19
 

11:40am

The Security Wolf of Wall Street: Fighting Crime with High-Frequency Classification and Natural Language Processing
In a world where threat actors move fast and the Internet evolves in a non-deterministic fashion, turning threat intelligence into automated protection has proven to be a challenge for the information security industry. While traditional threat research methods will never go away, there is an increasing need for powerful decision models that can process data in a real-time fashion and scale to incorporate increasingly-rich sources of threat intel. This talk will focus on one way to build a scalable machine learning infrastructure in real-time on a massive amount of DNS data (approximately 80B queries per day). In this talk, we will offer a sneak peek into how OpenDNS does scalable data science. We will touch on two core components, Big Data engineering and Big Data science, and specifically how they are used to implement a real-time threat detection systems for large-scale network traffic. To begin, we will detail Avalanche, a stream processing framework that helps OpenDNS data scientists create their own data processing pipelines using a modular graph-oriented representation. Each node acts as a data stream processor running as a process, thread or EC2 instance. In this graph database, the edges represent streaming channels connecting the different inputs and outputs of the nodes. The whole data pipeline can then easily be scaled and deployed to hundreds of instances in an AWS cloud. The Avalanche project's paradigm is to translate the approach that the finance world has been using for decades in high frequency or quantitative trading and apply it to traffic analysis. Applying intelligent detection models as close as possible to the data source holds the key to build a truly predictive security system, one where requests are classified and filtered on the fly. In our particular case at OpenDNS, we see a strong interest in integrating such a detection pipeline at the resolver level. We will next discuss how we integrate our statistical model NLP-Rank (a model that does large scale phishing detection) with Avalanche, and show some benchmarks. At its core, NLP-Rank is a fraud detection system that applies machine learning to the HTML content of a domain's web page to extract relevant terms and identify whether the content is potentially malicious or not. In this sense we are automating the security analyst's decision-making process in judging whether a website is legitimate or not. Typically when an analyst performs a review for a domain or URL in question, the analyst visits the site in a TOR browser, analyzes the content, and identifies the themes/summarize the page before deciding whether it's a fake or a false positive. In this talk, we will describe how we have automated this process at OpenDNS. We will also discuss the unique characteristics of NLP-Rank, including its machine learning techniques. Additionally, we will discuss the design and implementation of our phishing classification system. We will provide an overview of data preprocessing techniques and the information retrieval/natural language processing techniques used by our classifier. We will then discuss how Avalanche manages the results of NLP-Rank, how we add those results to our blocklists and our corpus, and Avalanche's overall performance.

Speakers
avatar for Jeremiah O'Connor and Thibault Reuille

Jeremiah O'Connor and Thibault Reuille

Research Engineer (Jeremiah O'Connor), Manager of Threat Research Development (Thibault Reuille), OpenDNS (now part of Cisco)
Jeremiah O'Connor is a research engineer at OpenDNS where he focuses on building scalable threat detection models and writing software to solve real-world security problems. His current interests are in machine learning, natural language processing, distributed systems, and big data... Read More →



Thursday May 19, 2016 11:40am - 12:20pm
Markov

3:30pm

A Match Made at Upwork
Upwork is the largest online freelancing marketplace, where freelancers from around the world earn over a billion dollars a year for work ranging from software development to sales to translation. Our data science team helps make this possible by developing algorithms to help clients and freelancers find the right fit for their needs and talents. We are in the privileged position of having data about the full life cycle of a job, from the moment the client posts it, to how freelancers find it, the interview and hiring process, and the final success of the project.

In this talk, I will focus on how we decide which jobs to recommend to freelancers. The job recommendation algorithms power a daily job digest e-mail, and are used in the Upwork Job Search and other parts of the site. I will share data and insights about our two-way marketplace, and discuss some of the machine learning models I’ve been developing. I will also discuss my approach to a key question for many ML practitioners — figuring out what to model in the first place.

I think about job recommendations as a two-part problem — will the freelancer apply to the job if they see it, and would that application be useful? I will describe the Job Interest model I have developed to answer the first question, which is a logistic regression model that uses features derived from the freelancer’s profile and application history to predict which jobs the freelancer will choose to apply to. A key modeling insight is that including features that capture main effects (the attractiveness of a job and each freelancer’s propensity for applying to jobs) vastly improves model accuracy.

Complex marketplace considerations come into play when we think about where to direct freelancers’ applications. One more application is hardly useful if a job already has several great applicants, or if the client has changed their mind about hiring. It is more important to help clients find the right person for large contracts than for small one-off tasks. We can also take advantage of our models that predict application and job success for applicants.

Speakers
avatar for Alya Abbott

Alya Abbott

Senior Data Scientist, Upwork


Thursday May 19, 2016 3:30pm - 3:50pm
Markov