David Sontag | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Sontag is active.

Explore More

Publication

Featured researches published by David Sontag.

conference on information and knowledge management | 2011

Personalizing web search results by reading level

Kevyn Collins-Thompson; Paul N. Bennett; Ryen W. White; Sebastian de la Chica; David Sontag

Traditionally, search engines have ignored the reading difficulty of documents and the reading proficiency of users in computing a document ranking. This is one reason why Web search engines do a poor job of serving an important segment of the population: children. While there are many important problems in interface design, content filtering, and results presentation related to addressing childrens search needs, perhaps the most fundamental challenge is simply that of providing relevant results at the right level of reading difficulty. At the opposite end of the proficiency spectrum, it may also be valuable for technical users to find more advanced material or to filter out material at lower levels of difficulty, such as tutorials and introductory texts. We show how reading level can provide a valuable new relevance signal for both general and personalized Web search. We describe models and algorithms to address the three key problems in improving relevance for search using reading difficulty: estimating user proficiency, estimating result difficulty, and re-ranking based on the difference between user and result reading level profiles. We evaluate our methods on a large volume of Web query traffic and provide a large-scale log analysis that highlights the importance of finding results at an appropriate reading level for the user.

web search and data mining | 2012

Probabilistic models for personalizing web search

David Sontag; Kevyn Collins-Thompson; Paul N. Bennett; Ryen W. White; Susan T. Dumais; Bodo Billerbeck

We present a new approach for personalizing Web search results to a specific user. Ranking functions for Web search engines are typically trained by machine learning algorithms using either direct human relevance judgments or indirect judgments obtained from click-through data from millions of users. The rankings are thus optimized to this generic population of users, not to any specific user. We propose a generative model of relevance which can be used to infer the relevance of a document to a specific user for a search query. The user-specific parameters of this generative model constitute a compact user profile. We show how to learn these profiles from a users long-term search history. Our algorithm for computing the personalized ranking is simple and has little computational overhead. We evaluate our personalization approach using historical search data from thousands of users of a major Web search engine. Our findings demonstrate gains in retrieval performance for queries with high ambiguity, with particularly large improvements for acronym queries.

knowledge discovery and data mining | 2014

Unsupervised learning of disease progression models

Xiang Wang; David Sontag; Fei Wang

Chronic diseases, such as Alzheimers Disease, Diabetes, and Chronic Obstructive Pulmonary Disease, usually progress slowly over a long period of time, causing increasing burden to the patients, their families, and the healthcare system. A better understanding of their progression is instrumental in early diagnosis and personalized care. Modeling disease progression based on real-world evidence is a very challenging task due to the incompleteness and irregularity of the observations, as well as the heterogeneity of the patient conditions. In this paper, we propose a probabilistic disease progression model that address these challenges. As compared to existing disease progression models, the advantage of our model is three-fold: 1) it learns a continuous-time progression model from discrete-time observations with non-equal intervals; 2) it learns the full progression trajectory from a set of incomplete records that only cover short segments of the progression; 3) it learns a compact set of medical concepts as the bridge between the hidden progression process and the observed medical evidence, which are usually extremely sparse and noisy. We demonstrate the capabilities of our model by applying it to a real-world COPD patient cohort and deriving some interesting clinical insights.

Scientific Reports | 2017

Recurrent Neural Networks for Multivariate Time Series with Missing Values

Zhengping Che; Sanjay Purushotham; Kyunghyun Cho; David Sontag; Yan Liu

Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.

european conference on computer vision | 2014

Instance Segmentation of Indoor Scenes Using a Coverage Loss

Nathan Silberman; David Sontag; Rob Fergus

A major limitation of existing models for semantic segmentation is the inability to identify individual instances of the same class: when labeling pixels with only semantic classes, a set of pixels with the same label could represent a single object or ten. In this work, we introduce a model to perform both semantic and instance segmentation simultaneously. We introduce a new higher-order loss function that directly minimizes the coverage metric and evaluate a variety of region features, including those from a convolutional network. We apply our model to the NYU Depth V2 dataset, obtaining state of the art results.

Journal of the American Medical Informatics Association | 2016

Electronic medical record phenotyping using the anchor and learn framework

Yoni Halpern; Steven Horng; Youngduck Choi; David Sontag

BACKGROUND Electronic medical records (EMRs) hold a tremendous amount of information about patients that is relevant to determining the optimal approach to patient care. As medicine becomes increasingly precise, a patients electronic medical record phenotype will play an important role in triggering clinical decision support systems that can deliver personalized recommendations in real time. Learning with anchors presents a method of efficiently learning statistically driven phenotypes with minimal manual intervention. MATERIALS AND METHODS We developed a phenotype library that uses both structured and unstructured data from the EMR to represent patients for real-time clinical decision support. Eight of the phenotypes were evaluated using retrospective EMR data on emergency department patients using a set of prospectively gathered gold standard labels. RESULTS We built a phenotype library with 42 publicly available phenotype definitions. Using information from triage time, the phenotype classifiers have an area under the ROC curve (AUC) of infection 0.89, cancer 0.88, immunosuppressed 0.85, septic shock 0.93, nursing home 0.87, anticoagulated 0.83, cardiac etiology 0.89, and pneumonia 0.90. Using information available at the time of disposition from the emergency department, the AUC values are infection 0.91, cancer 0.95, immunosuppressed 0.90, septic shock 0.97, nursing home 0.91, anticoagulated 0.94, cardiac etiology 0.92, and pneumonia 0.97. DISCUSSION The resulting phenotypes are interpretable and fast to build, and perform comparably to statistically learned phenotypes developed with 5000 manually labeled patients. CONCLUSION Learning with anchors is an attractive option for building a large public repository of phenotype definitions that can be used for a range of health IT applications, including real-time decision support.

JAMA Cardiology | 2016

Comparison of Approaches for Heart Failure Case Identification From Electronic Health Record Data.

Saul Blecker; Stuart D. Katz; Leora I. Horwitz; Gilad J. Kuperman; Hannah Park; Alex Gold; David Sontag

Importance Accurate, real-time case identification is needed to target interventions to improve quality and outcomes for hospitalized patients with heart failure. Problem lists may be useful for case identification but are often inaccurate or incomplete. Machine-learning approaches may improve accuracy of identification but can be limited by complexity of implementation. Objective To develop algorithms that use readily available clinical data to identify patients with heart failure while in the hospital. Design, Setting, and Participants We performed a retrospective study of hospitalizations at an academic medical center. Hospitalizations for patients 18 years or older who were admitted after January 1, 2013, and discharged before February 28, 2015, were included. From a random 75% sample of hospitalizations, we developed 5 algorithms for heart failure identification using electronic health record data: (1) heart failure on problem list; (2) presence of at least 1 of 3 characteristics: heart failure on problem list, inpatient loop diuretic, or brain natriuretic peptide level of 500 pg/mL or higher; (3) logistic regression of 30 clinically relevant structured data elements; (4) machine-learning approach using unstructured notes; and (5) machine-learning approach using structured and unstructured data. Main Outcomes and Measures Heart failure diagnosis based on discharge diagnosis and physician review of sampled medical records. Results A total of 47 119 hospitalizations were included in this study (mean [SD] age, 60.9 [18.15] years; 23 952 female [50.8%], 5258 black/African American [11.2%], and 3667 Hispanic/Latino [7.8%] patients). Of these hospitalizations, 6549 (13.9%) had a discharge diagnosis of heart failure. Inclusion of heart failure on the problem list (algorithm 1) had a sensitivity of 0.40 and a positive predictive value (PPV) of 0.96 for heart failure identification. Algorithm 2 improved sensitivity to 0.77 at the expense of a PPV of 0.64. Algorithms 3, 4, and 5 had areas under the receiver operating characteristic curves of 0.953, 0.969, and 0.974, respectively. With a PPV of 0.9, these algorithms had associated sensitivities of 0.68, 0.77, and 0.83, respectively. Conclusions and Relevance The problem list is insufficient for real-time identification of hospitalized patients with heart failure. The high predictive accuracy of machine learning using free text demonstrates that support of such analytics in future electronic health record systems can improve cohort identification.

conference on emerging network experiment and technology | 2009

Scaling all-pairs overlay routing

David Sontag; Yang Zhang; Amar Phanishayee; David G. Andersen; David R. Karger

This paper presents and experimentally evaluates a new algorithm for efficient one-hop link-state routing in full-mesh networks. Prior techniques for this setting scale poorly, as each node incurs quadratic (n2) communication overhead to broadcast its link state to all other nodes. In contrast, in our algorithm each node exchanges routing state with only a small subset of overlay nodes determined by using a quorum system. Using a two round protocol, each node can find an optimal one-hop path to any other node using only n1.5 per-node communication. Our algorithm can also be used to find the optimal shortest path of arbitrary length using only n1.5 logn per-node communication. The algorithm is designed to be resilient to both node and link failures. We apply this algorithm to a Resilient Overlay Network (RON) system, and evaluate the results using a large-scale, globally distributed set of Internet hosts. The reduced communication overhead from using our improved full-mesh algorithm allows the creation of all-pairs routing overlays that scale to hundreds of nodes, without reducing the systems ability to rapidly find optimal routes.

PLOS ONE | 2017

Creating an automated trigger for sepsis clinical decision support at emergency department triage using machine learning

Steven Horng; David Sontag; Yoni Halpern; Yacine Jernite; Nathan I. Shapiro; Larry A. Nathanson

Objective To demonstrate the incremental benefit of using free text data in addition to vital sign and demographic data to identify patients with suspected infection in the emergency department. Methods This was a retrospective, observational cohort study performed at a tertiary academic teaching hospital. All consecutive ED patient visits between 12/17/08 and 2/17/13 were included. No patients were excluded. The primary outcome measure was infection diagnosed in the emergency department defined as a patient having an infection related ED ICD-9-CM discharge diagnosis. Patients were randomly allocated to train (64%), validate (20%), and test (16%) data sets. After preprocessing the free text using bigram and negation detection, we built four models to predict infection, incrementally adding vital signs, chief complaint, and free text nursing assessment. We used two different methods to represent free text: a bag of words model and a topic model. We then used a support vector machine to build the prediction model. We calculated the area under the receiver operating characteristic curve to compare the discriminatory power of each model. Results A total of 230,936 patient visits were included in the study. Approximately 14% of patients had the primary outcome of diagnosed infection. The area under the ROC curve (AUC) for the vitals model, which used only vital signs and demographic data, was 0.67 for the training data set, 0.67 for the validation data set, and 0.67 (95% CI 0.65–0.69) for the test data set. The AUC for the chief complaint model which also included demographic and vital sign data was 0.84 for the training data set, 0.83 for the validation data set, and 0.83 (95% CI 0.81–0.84) for the test data set. The best performing methods made use of all of the free text. In particular, the AUC for the bag-of-words model was 0.89 for training data set, 0.86 for the validation data set, and 0.86 (95% CI 0.85–0.87) for the test data set. The AUC for the topic model was 0.86 for the training data set, 0.86 for the validation data set, and 0.85 (95% CI 0.84–0.86) for the test data set. Conclusion Compared to previous work that only used structured data such as vital signs and demographic information, utilizing free text drastically improves the discriminatory ability (increase in AUC from 0.67 to 0.86) of identifying infection.

pacific symposium on biocomputing | 2006

Probabilistic modeling of systematic errors in two-hybrid experiments.

David Sontag; Rohit Singh; Bonnie Berger

UNLABELLED We describe a novel probabilistic approach to estimating errors in two-hybrid (2H) experiments. Such experiments are frequently used to elucidate protein-protein interaction networks in a high-throughput fashion; however, a significant challenge with these is their relatively high error rate, specifically, a high false-positive rate. We describe a comprehensive error model for 2H data, accounting for both random and systematic errors. The latter arise from limitations of the 2H experimental protocol: in theory, the reporting mechanism of a 2H experiment should be activated if and only if the two proteins being tested truly interact; in practice, even in the absence of a true interaction, it may be activated by some proteins - either by themselves or through promiscuous interaction with other proteins. We describe a probabilistic relational model that explicitly models the above phenomenon and use Markov Chain Monte Carlo (MCMC) algorithms to compute both the probability of an observed 2H interaction being true as well as the probability of individual proteins being self-activating/promiscuous. This is the first approach that explicitly models systematic errors in protein-protein interaction data; in contrast, previous work on this topic has modeled errors as being independent and random. By explicitly modeling the sources of noise in 2H systems, we find that we are better able to make use of the available experimental data. In comparison with Bader et al.s method for estimating confidence in 2H predicted interactions, the proposed method performed 5-10% better overall, and in particular regimes improved prediction accuracy by as much as 76%. SUPPLEMENTARY INFORMATION http://theory.csail.mit.edu/probmod2H

Explore More