David Kauchak
Pomona College
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Kauchak.
international conference on machine learning | 2005
Rasmus Elsborg Madsen; David Kauchak; Charles Elkan
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.
language and technology conference | 2006
David Kauchak; Regina Barzilay
This paper studies the impact of paraphrases on the accuracy of automatic evaluation. Given a reference sentence and a machine-generated sentence, we seek to find a paraphrase of the reference sentence that is closer in wording to the machine output than the original reference. We apply our paraphrasing method in the context of machine translation evaluation. Our experiments show that the use of a paraphrased synthetic reference refines the accuracy of automatic evaluation. We also found a strong connection between the quality of automatic paraphrases as judged by humans and their contribution to automatic evaluation.
Journal of Medical Internet Research | 2013
Gondy Leroy; James E. Endicott; David Kauchak; Obay Mouradi; Melissa Just
Background Adequate health literacy is important for people to maintain good health and manage diseases and injuries. Educational text, either retrieved from the Internet or provided by a doctor’s office, is a popular method to communicate health-related information. Unfortunately, it is difficult to write text that is easy to understand, and existing approaches, mostly the application of readability formulas, have not convincingly been shown to reduce the difficulty of text. Objective To develop an evidence-based writer support tool to improve perceived and actual text difficulty. To this end, we are developing and testing algorithms that automatically identify difficult sections in text and provide appropriate, easier alternatives; algorithms that effectively reduce text difficulty will be included in the support tool. This work describes the user evaluation with an independent writer of an automated simplification algorithm using term familiarity. Methods Term familiarity indicates how easy words are for readers and is estimated using term frequencies in the Google Web Corpus. Unfamiliar words are algorithmically identified and tagged for potential replacement. Easier alternatives consisting of synonyms, hypernyms, definitions, and semantic types are extracted from WordNet, the Unified Medical Language System (UMLS), and Wiktionary and ranked for a writer to choose from to simplify the text. We conducted a controlled user study with a representative writer who used our simplification algorithm to simplify texts. We tested the impact with representative consumers. The key independent variable of our study is lexical simplification, and we measured its effect on both perceived and actual text difficulty. Participants were recruited from Amazon’s Mechanical Turk website. Perceived difficulty was measured with 1 metric, a 5-point Likert scale. Actual difficulty was measured with 3 metrics: 5 multiple-choice questions alongside each text to measure understanding, 7 multiple-choice questions without the text for learning, and 2 free recall questions for information retention. Results Ninety-nine participants completed the study. We found strong beneficial effects on both perceived and actual difficulty. After simplification, the text was perceived as simpler (P<.001) with simplified text scoring 2.3 and original text 3.2 on the 5-point Likert scale (score 1: easiest). It also led to better understanding of the text (P<.001) with 11% more correct answers with simplified text (63% correct) compared to the original (52% correct). There was more learning with 18% more correct answers after reading simplified text compared to 9% more correct answers after reading the original text (P=.003). There was no significant effect on free recall. Conclusions Term familiarity is a valuable feature in simplifying text. Although the topic of the text influences the effect size, the results were convincing and consistent.
meeting of the association for computational linguistics | 2014
Colby Horn; Cathryn Manduca; David Kauchak
In this paper we introduce a new lexical simplification approach. We extract over 30K candidate lexical simplifications by identifying aligned words in a sentencealigned corpus of English Wikipedia with Simple English Wikipedia. To apply these rules, we learn a feature-based ranker using SVMrank trained on a set of labeled simplifications collected using Amazon’s Mechanical Turk. Using human simplifications for evaluation, we achieve a precision of 76% with changes in 86% of the examples.
meeting of the association for computational linguistics | 2005
David Kauchak; Francine Chen
In this paper we examine topic segmentation of narrative documents, which are characterized by long passages of text with few headings. We first present results suggesting that previous topic segmentation approaches are not appropriate for narrative text. We then present a feature-based method that combines features from diverse sources as well as learned features. Applied to narrative books and encyclopedia articles, our method shows results that are significantly better than previous segmentation approaches. An analysis of individual features is also provided and the benefit of generalization using outside resources is shown.
Journal of the American Medical Informatics Association | 2014
Gondy Leroy; David Kauchak
There is little evidence that readability formula outcomes relate to text understanding. The potential cause may lie in their strong reliance on word and sentence length. We evaluated word familiarity rather than word length as a stand-in for word difficulty. Word familiarity represents how well known a word is, and is estimated using word frequency in a large text corpus, in this work the Google web corpus. We conducted a study with 239 people, who provided 50 evaluations for each of 275 words. Our study is the first study to focus on actual difficulty, measured with a multiple-choice task, in addition to perceived difficulty, measured with a Likert scale. Actual difficulty was correlated with word familiarity (r=0.219, p<0.001) but not with word length (r=-0.075, p=0.107). Perceived difficulty was correlated with both word familiarity (r=-0.397, p<0.001) and word length (r=0.254, p<0.001).
hawaii international conference on system sciences | 2014
David Kauchak; Obay Mouradi; Christopher Pentoney; Gondy Leroy
Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density, specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.
It Professional | 2016
David Kauchak; Gondy Leroy
Limited health literacy is a barrier to understanding health information. Simplifying text can reduce this barrier and possibly address other known health disparities. Unfortunately, few tools exist to simplify text with a demonstrated impact on comprehension. By leveraging modern data sources integrated with natural language processing algorithms, the authors have developed a semi-automated text-simplification tool. They introduce their evidence-based development strategy for designing effective text-simplification software and summarize initial, promising results. They also present a new study examining existing readability formulas, which are the most commonly used tools for text simplification in healthcare. They compare syllable count--the proxy for word difficulty used by most readability formulas--with their new metric, term familiarity, and determine that syllable count measures how difficult words appear to be, but not their actual difficulty. In contrast, term familiarity can be used to measure actual difficulty.
hawaii international conference on system sciences | 2013
Obay Mouradi; Gondy Leroy; David Kauchak; James E. Endicott
Because patients customarily receive medical text that is difficult to understand, we are developing a simplification algorithm to support simpler writing by medical professionals. Our algorithm relies on term familiarity and automatically suggests alternative wordings from different sources. We conducted a user study (N=17) to evaluate its effectiveness on reducing perceived and actual difficulty. Perceived difficulty was measured using sentences and a Likert-scale. Actual difficulty was measured using documents and multiple-choice and Cloze tests. We found a strong significant simplification effect for perceived difficulty (p=.002), but no effect for actual difficulty: only 6.2% improvement on the Cloze test. Evaluating participant characteristics showed that reading more newspapers or magazines correlated with lower multiple-choice (r=-386, p=.016) and Cloze test (r=.340, p=.025) scores. STOFHLA scores, a health literacy measure, correlated with the Cloze test scores (r=.461, p=.002).
european conference on machine learning | 2003
David Kauchak; Charles Elkan
In this paper we show how to learn rules to improve the performance of a machine translation system. Given a system consisting of two translation functions (one from language A to language B and one from B to A), training text is translated from A to B and back again to A. Using these two translations, differences in knowledge between the two translation functions are identified, and rules are learned to improve the functions. Context-independent rules are learned where the information suggests only a single possible translation for a word. When there are multiple alternate translations for a word, a likelihood ratio test is used to identify words that co-occur with each case significantly. These words are then used as context in context-dependent rules. Applied on the Pan American Health Organization corpus of 20,084 sentences, the learned rules improve the understandability of the translation produced by the SDL International engine on 78% of sentences, with high precision.