Alexander Ratner | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexander Ratner is active.

Explore More

Publication

Featured researches published by Alexander Ratner.

international conference on management of data | 2016

DeepDive: Declarative Knowledge Base Construction

Christopher De Sa; Alexander Ratner; Christopher Ré; Jaeho Shin; Feiran Wang; Sen Wu; Ce Zhang

The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

very large data bases | 2017

Snorkel: rapid training data creation with weak supervision

Alexander Ratner; Stephen H. Bach; Henry R. Ehrenberg; Jason Alan Fries; Sen Wu; Christopher Ré

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of- the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

international conference on management of data | 2016

Data programming with DDLite: putting humans in a different part of the loop

Henry R. Ehrenberg; Jaeho Shin; Alexander Ratner; Jason Alan Fries; Christopher Ré

Populating large-scale structured databases from unstructured sources is a critical and challenging task in data analytics. As automated feature engineering methods grow increasingly prevalent, constructing sufficiently large labeled training sets has become the primary hurdle in building machine learning information extraction systems. In light of this, we have taken a new approach called data programming [7]. Rather than hand-labeling data, in the data programming paradigm, users generate large amounts of noisy training labels by programmatically encoding domain heuristics as simple rules. Using this approach over more traditional distant supervision methods and fully supervised approaches using labeled data, we have been able to construct knowledge base systems more rapidly and with higher quality. Since the ability to quickly prototype, evaluate, and debug these rules is a key component of this paradigm, we introduce DDLite, an interactive development framework for data programming. This paper reports feedback collected from DDLite users across a diverse set of entity extraction tasks. We share observations from several DDLite hackathons in which 10 biomedical researchers prototyped information extraction pipelines for chemicals, diseases, and anatomical named entities. Initial results were promising, with the disease tagging team obtaining an F1 score within 10 points of the state-of-the-art in only a single day-long hackathons work. Our key insights concern the challenges of writing diverse rule sets for generating labels, and exploring training data. These findings motivate several areas of active data programming research.

international conference on management of data | 2017

Snorkel: Fast Training Set Generation for Information Extraction

Alexander Ratner; Stephen H. Bach; Henry R. Ehrenberg; Christopher Ré

State-of-the art machine learning methods such as deep learning rely on large sets of hand-labeled training data. Collecting training data is prohibitively slow and expensive, especially when technical domain expertise is required; even the largest technology companies struggle with this challenge. We address this critical bottleneck with Snorkel, a new system for quickly creating, managing, and modeling training sets. Snorkel enables users to generate large volumes of training data by writing labeling functions, which are simple functions that express heuristics and other weak supervision strategies. These user-authored labeling functions may have low accuracies and may overlap and conflict, but Snorkel automatically learns their accuracies and synthesizes their output labels. Experiments and theory show that surprisingly, by modeling the labeling process in this way, we can train high-accuracy machine learning models even using potentially lower-accuracy inputs. Snorkel is currently used in production at top technology and consulting companies, and used by researchers to extract information from electronic health records, after-action combat reports, and the scientific literature. In this demonstration, we focus on the challenging task of information extraction, a common application of Snorkel in practice. Using the task of extracting corporate employment relationships from news articles, we will demonstrate and build intuition for a radically different way of developing machine learning systems which allows us to effectively bypass the bottleneck of hand-labeling training data.

international conference on management of data | 2018

Snorkel MeTaL: Weak Supervision for Multi-Task Learning

Alexander Ratner; Braden Hancock; Jared Dunnmon; Roger E. Goldman; Christopher Ré

Many real-world machine learning problems are challenging to tackle for two reasons: (i) they involve multiple sub-tasks at different levels of granularity; and (ii) they require large volumes of labeled training data. We propose Snorkel MeTaL, an end-to-end system for multi-task learning that leverages weak supervision provided at multiple levels of granularity by domain expert users. In MeTaL, a user specifies a problem consisting of multiple, hierarchically-related sub-tasks---for example, classifying a document at multiple levels of granularity---and then provides labeling functions for each sub-task as weak supervision. MeTaL learns a re-weighted model of these labeling functions, and uses the combined signal to train a hierarchical multi-task network which is automatically compiled from the structure of the sub-tasks. Using MeTaL on a radiology report triage task and a fine-grained news classification task, we achieve average gains of 11.2 accuracy points over a baseline supervised approach and 9.5 accuracy points over the predictions of the user-provided labeling functions.

bioRxiv | 2017

AMELIE accelerates Mendelian patient diagnosis directly from the primary literature

Johannes Birgmeier; Maximilian Haeussler; Cole A. Deisseroth; Karthik A. Jagadeesh; Alexander Ratner; Harendra Guturu; Aaron M. Wenger; Peter D. Stenson; David Neil Cooper; Christopher Ré; Jonathan A. Bernstein; Gill Bejerano

The diagnosis of Mendelian disorders requires labor-intensive literature research. Our software system AMELIE (Automatic Mendelian Literature Evaluation) greatly automates this process. AMELIE parses hundreds of thousands of full text articles to find an underlying diagnosis to explain a patient’s phenotypes given the patient’s exome. AMELIE prioritizes patient candidate genes for their likelihood of causing the patient’s phenotypes. Diagnosis of singleton patients (without relatives’ exomes) is the most time-consuming scenario. AMELIE’s gene ranking method was tested on 215 singleton Mendelian patients with a clinical diagnosis. AMELIE ranked the causal gene among the top 2 in the majority (63%) of cases. Examining AMELIE’s top 10 genes, amounting to 8% of 124 candidate genes with rare functional variants per patient, results in diagnosis for 95% of cases. Strikingly, training only on gene pathogenicity knowledge from 2011 leads to identical performance compared to training on current data. An accompanying analysis web portal has launched at AMELIE.stanford.edu.

neural information processing systems | 2016