Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.
Understanding the relationships between drugs and diseases, side effects, dosages is an important part of drug discovery and clinical trial design. Some of these relationships have been studied and curated in different formats such as the UMLS, bioportal, SNOWMED etc. Typically this data is not complete and distributed in various sources. I will adress different stages of the drug-disease, drug-side effects and drug-dosages relationship extraction. As a first step I will discuss medical attributes (diseases, dosages, side effects) extraction from FDA drug labels and clinical trials. As a next step I will use simple machine learning techniques to improve the precision and recall of this sample. I will also discuss bootstrapping a training sample from a smaller training set. As a next step I will use DeepDive, a dark data extraction framework to extract relationships between medical attributes and derive conclusive evidence on facts about them. The advantages of using deepdive is that it masks the complexities of the Machine Learning techniques and forces the user to think more about features in the data set. At the end of these steps we will have structured (queriable) data that answers questions such as What is the dosage of 'digoxin' for controling 'ventricular response rate' in a male adult at 'age 60' with weight '160lbs'.