Loading…
Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.
Back To Schedule
Tuesday, May 17 • 9:50am - 10:30am
Building Word2Vec Models with Text Data

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

It is always amazing when someone is able to take a very hard and translate it to one that has been studied for centuries. This is the case with Word2Vec, which transforms words into vectors. Text is unstructured data and has been explored mathematically far less than vectors. Newton (1642-1726) may have been the first one to study vectors, while text mining started its studies a few decades ago. Word2Vec maps text to a vector space that can be utilized in a variety of ways, such as measuring distance between words. Therefore, given a word of interest, the aforementioned vector space can be used to compute the top N closest words. In this talk, I will explain how to build Word2Vec models with Twitter data stored in Hadoop using Spark and MLlib. I will describe how to choose the most important parameters to accurately train a Word2Vec matrix. In addition, I will show examples of how these models are used in practice in data products.

Speakers
avatar for Jorge Castanon

Jorge Castanon

Lead Data Scientist, IBM
Jorge Castañón hails from Mexico City and received his Ph.D. in Computational and Applied Mathematics from Rice University. He has a genuine passion for data science and machine learning applications of any kind. Since 2007, he is been developing numerical optimization models and... Read More →


Tuesday May 17, 2016 9:50am - 10:30am PDT
Ada