Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.
Wednesday, May 18 • 11:40am - 12:20pm
Unsupervised NLP Classification with Clustering

Sign up or log in to save this to your schedule and see who's attending!

In the world of local government finances training data is sparse. Language based training data is almost non-existent. Furthermore, fiscal language in governments has a high domain knowledge requirement to build training data and garner strong intuitions. This makes traditional supervised methods difficult to use successfully, as the training data you generate is always lagging raw data growth. To help tackle these challenges in performing NLP analysis we'll be showing techniques around relationship extraction and clustering to perform data understanding on domain heavy topics. We'll be exploring these techniques on published local government budget pdfs to extract topics and gain insights into the purpose of domain specific text. The format of the talk will follow each key point with code examples. First we’ll talk about data challenges in local government, and the lack of established knowledge bases around that data. Specifically we’ll explore the unknown number of classes problem and how unsupervised algorithms can garner insights. Then we’ll focus on the families of clustering algorithms available and how they allow you to focus on edge associations rather than holistic state spaces. Following that we’ll explore some useful techniques for optimizing computation and how missing or skipped data points can be linked by association. Finally we’ll combine the pieces we’ve shown to perform topic extraction and understanding from public financial budgets.

avatar for Matthew Seal

Matthew Seal

Data Scientist, OpenGov
I'm an early employee of OpenGov who has a passion for data models, and data understanding. I've had a broad exposure to software development of various types from front-end code, to db architecture, to machine learning. I graduated from Stanford University with a BS in Electrical... Read More →

Wednesday May 18, 2016 11:40am - 12:20pm

Attendees (17)