It is easy to make a fast, incorrect algorithm. In machine learning, we are often faced with two challenges: correctly implement a complicated, math-heavy algorithm and make it run fast! In this talk, we present the marrying of functional programming and machine learning as a solution to this central problem dogging machine learning practitioners. As a demonstration of combining these ideas together, we present code that implements common learning algorithms using core principles of functional programming.
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. In this talk, I will describe Heron in detail and share our operating experiences and challenges of running Heron at scale.
Apache Spark is most often used as a means of processing large amounts of data efficiently, but is also useful for the processing of individual predictions common to many NLP applications. The algorithms inside MLlib are useful in and of themselves, independent of the core Spark framework. IdiML is an open source tool that enables incredibly fast predictions on textual data by using various components within MLlib. It acts as a standalone tool for performing core machine learning functionality that can easily be integrated into production systems to provide low-latency continuous streaming predictions. This talk explores the functionality inside IdiML, how it uses MLlib, and why that makes such a big difference.
[Revised 05/16/16] A concept is a unit of thought. Chances are any biomedical concept that is represented in your data has been named by some authority. Your tax $ pay for these names to be collected, maintained, and represented in a homogeneous, tool-supported context called the UMLS (Unified Medical Language System).
The latter consists of three knowledge sources - the Metathesaurus, a Semantic Network, and a Lexicon and accompanying tools. The UMLS was created and is maintained by the U.S. National Library of Medicine (NLM), part of the National Institutes of Health (NIH). The 2016AA release of the Metathesaurus contains more than 3.25 million concepts and 13.00 million unique concept names from over 197 source vocabularies expressed using 25 different languages. Many of these vocabularies include translations into the world's major languages. Because it contains a mixture of public and proprietary content use of the UMLS requires a license, available free of charge from the NLM.
Tools are included to assist with browsing, downloading, subsetting, and representing the UMLS in existing databases. Additional tools support inter-source linking, and finding concepts in text. While not for the faint of heart, these resources are widely used around the world. Tutorial videos are available on the NLM UMLS web site.
Important, and widely used vocabularies in the UMLS include those naming diseases, lab tests, procedures, medications, chemicals, organisms, anatomic structures and genes, collected from both research and care. Several of these vocabularies are part of the standards specified for use in U.S. Electronic Health Records. Internet connectivity permitting, audience members will be challenged to "stump the Metathesaurus" - that is, name an important biomedical concept that cannot be found there. This exercise will illustrate why the UMLS should not be re-invented.