This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.
View analytic
Tuesday, May 17 • 1:40pm - 2:00pm
Of Rules and Probabilities: Computational Linguistics Methods for Improving Machine Learning Models

Sign up or log in to save this to your schedule and see who's attending!

Supervised machine learning models are extremely powerful and highly useful for processing vast amounts of text. Their applications include sentiment analysis, text classification, topic mining, part of speech tagging, and named entity recognition, among many others. However, supervised models rely heavily on large amounts of annotated data and furthermore require that the annotations be consistent and accurate. In practice, obtaining high quality annotated data, especially with strong inter-annotator agreement, is not always possible for legal and privacy reasons: there are some data that organizations may not be allowed to crowd source. In this talk I propose several methods to help machine learning models get over the hurdle of insufficient labeled data by leveraging a number of computational linguistics techniques. Specifically, focusing on CRF (conditional random field) model for Named Entity Recognition, I discuss how the use of language feature engineering, artificial dataset generation, and post-processing rules can significantly improve model performance, which otherwise suffers from the bottle-neck of insufficient training data. I propose a number of scalable and practical methods that machine learning practitioners can use in situations where obtaining more training data via crowdsourcing is not a viable option.

avatar for Vita Markman

Vita Markman

Staff Software Engineer, Computational Linguist, LinkedIn
As a Staff Software Engineer at LinkedIn, I work on various natural language processing applications such as query understanding, sentiment analysis, and member /job data standardization. Before joining LinkedIn, I was a Staff Research Engineer at Samsung Research America, where among other projects, I worked on extracting topic-indicative phrases from a stream of closed caption news data in real-time and text-mining customer support chat-logs... Read More →

Tuesday May 17, 2016 1:40pm - 2:00pm

Attendees (36)