Richard Wicentowski
Swarthmore College
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Richard Wicentowski.
international conference on human language technology research | 2001
David Yarowsky; Grace Ngai; Richard Wicentowski
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish.Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections.Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system.This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection.
meeting of the association for computational linguistics | 2000
David Yarowsky; Richard Wicentowski
This paper presents a corpus-based algorithm capable of inducing inflectional morphological analyses of both regular and highly irregular forms (such as brought→bring) from distributional patterns in large monolingual text with no direct supervision. The algorithm combines four original alignment models based on relative corpus frequency, contextual similarity, weighted string similarity and incrementally retrained inflectional transduction probabilities. Starting with no paired examples for training and no prior seeding of legal morphological transformations, accuracy of the induced analyses of 3888 past-tense test cases in English exceeds 99.2% for the set, with currently over 80% accuracy on the most highly irregular forms and 99.7% accuracy on forms exhibiting non-concatenative suffixation.
meeting of the association for computational linguistics | 2007
Phil Katz; Matthew Singleton; Richard Wicentowski
In this paper, we describe our two SemEval-2007 entries. Our first entry, for Task 5: Multilingual Chinese-English Lexical Sample Task, is a supervised system that decides the most appropriate English translation of a Chinese target word. This system uses a combination of Naive Bayes, nearest neighbor cosine, decision lists, and latent semantic analysis. Our second entry, for Task 14: Affective Text, is a supervised system that annotates headlines using a predefined list of emotions. This system uses synonym expansion and matches lemmatized unigrams in the test headlines against a corpus of hand-annotated headlines.
meeting of the association for computational linguistics | 2004
Richard Wicentowski
This paper presents the WordFrame model, a noise-robust supervised algorithm capable of inducing morphological analyses for languages which exhibit prefixation, suffixation, and internal vowel shifts. In combination with a naive approach to suffix-based morphology, this algorithm is shown to be remarkably effective across a broad range of languages, including those exhibiting infixation and partial reduplication. Results are presented for over 30 languages with a median accuracy of 97.5% on test sets including both regular and irregular verbal inflections. Because the proposed method trains extremely well under conditions of high noise, it is an ideal candidate for use in co-training with unsupervised algorithms.
Journal of the American Medical Informatics Association | 2008
Richard Wicentowski; Matthew R. Sydes
As part of the 2006 i2b2 NLP Shared Task, we explored two methods for determining the smoking status of patients from their hospital discharge summaries when explicit smoking terms were present and when those same terms were removed. We developed a simple keyword-based classifier to determine smoking status from de-identified hospital discharge summaries. We then developed a Naïve Bayes classifier to determine smoking status from the same records after all smoking-related words had been manually removed (the smoke-blind dataset). The performance of the Naïve Bayes classifier was compared with the performance of three human annotators on a subset of the same training dataset (n = 54) and against the evaluation dataset (n = 104 records). The rule-based classifier was able to accurately extract smoking status from hospital discharge summaries when they contained explicit smoking words. On the smoke-blind dataset, where explicit smoking cues are not available, two Naïve Bayes systems performed less well than the rule-based classifier, but similarly to three expert human annotators.
technical symposium on computer science education | 2014
Tia Newhall; Lisa Meeden; Andrew Danner; Ameet Soni; Frances Ruiz; Richard Wicentowski
In line with institutions across the United States, the Computer Science Department at Swarthmore College has faced the challenge of maintaining a demographic composition of students that matches the student body as a whole. To combat this trend, our department has made a concerted effort to revamp our introductory course sequence to both attract and retain more women and minority students. The focus of this paper is the changes instituted in our Introduction to Computer Science course (i.e., CS1) intended for both majors and non-majors. In addition to changing the content of the course, we introduced a new student mentoring program that is managed by a full-time coordinator and consists of undergraduate students who have recently completed the course. This paper describes these efforts in detail, including the extension of these changes to our CS2 course and the associated costs required to maintain these efforts. We measure the impact of these changes by tracking student enrollment and performance over 13 academic years. We show that, unlike national trends, enrollment from underrepresented groups has increased dramatically over this time period. Additionally, we show that the student mentoring program has increased both performance and retention of students, particularly from underrepresented groups, at statistically significant levels.
meeting of the association for computational linguistics | 2007
George Dahl; Anne-Marie Frassica; Richard Wicentowski
We present two systems that pick the ten most appropriate substitutes for a marked word in a test sentence. The first system scores candidates based on how frequently their local contexts match that of the marked word. The second system, an enhancement to the first, incorporates cosine similarity using unigram features. The core of both systems bypasses intermediate sense selection. Our results show that a knowledge-light, direct method for scoring potential replacements is viable.
north american chapter of the association for computational linguistics | 2015
Ruth Talbot; Chloe Acheampong; Richard Wicentowski
This paper describes a sentiment classification system designed for SemEval-2015, Task 10, Subtask B. The system employs a constrained, supervised text categorization approach. Firstly, since thorough preprocessing of tweet data was shown to be effective in previous SemEval sentiment classification tasks, various preprocessessing steps were introduced to enhance the quality of lexical information. Secondly, a Naive Bayes classifier is used to detect tweet sentiment. The classifier is trained only on the training data provided by the task organizers. The system makes use of external human-generated lists of positive and negative words at several steps throughout classification. The system produced an overall F-score of 59.26 on the official test set.
Biomedical Informatics Insights | 2012
Richard Wicentowski; Matthew R. Sydes
An ensemble of supervised maximum entropy classifiers can accurately detect and identify sentiments expressed in suicide notes. Using lexical and syntactic features extracted from a training set of externally annotated suicide notes, we trained separate classifiers for each of fifteen pre-specified emotions. This formed part of the 2011 i2b2 NLP Shared Task, Track 2. The precision and recall of these classifiers related strongly with the number of occurrences of each emotion in the training data. Evaluating on previously unseen test data, our best system achieved an F1 score of 0.534.
language resources and evaluation | 2009
Eneko Agirre; Lluís Màrquez; Richard Wicentowski
SemEval-2007, the Fourth International Workshop on Semantic Evaluations (Agirre et al. 2007) took place on June 23-24, 2007, as a co-located event with the 45th Annual Meeting of the ACL. It was the fourth semantic evaluation exercise, continuing on from the series of successful Senseval workshops. SemEval-2007 took place over a period of about six months, including the evaluation exercise itself and the summary workshop. The exercise attracted considerable attention from the semantic processing community: 18 different evaluation tasks were organized, and more than 100 research teams and 123 systems participated in them. As a result, despite the huge effort carried out by task organizers and participant teams, time and material constraints made it virtually impossible to present thorough analyses of tasks, systems and results in the workshop proceedings. Therefore, in order to present the work and results of SemEval-2007, we assembled extended papers from the workshop as well as other contributors into this special issue of Language Resources and Evaluation, entitled