Gregory A. Sanders
National Institute of Standards and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gregory A. Sanders.
Machine Translation | 2009
Mark A. Przybocki; Kay Peterson; Sebastien Bronsart; Gregory A. Sanders
This paper discusses the evaluation of automated metrics developed for the purpose of evaluating machine translation (MT) technology. A general discussion of the usefulness of automated metrics is offered. The NIST MetricsMATR evaluation of MT metrology is described, including its objectives, protocols, participants, and test data. The methodology employed to evaluate the submitted metrics is reviewed. A summary is provided for the general classes of evaluated metrics. Overall results of this evaluation are presented, primarily by means of correlation statistics, showing the degree of agreement between the automated metric scores and the scores of human judgments. Metrics are analyzed at the sentence, document, and system level with results conditioned by various properties of the test data. This paper concludes with some perspective on the improvements that should be incorporated into future evaluations of metrics for MT evaluation.
performance metrics for intelligent systems | 2009
Craig I. Schlenoff; Gregory A. Sanders; Brian A. Weiss; Frederick M. Proctor; Michelle Potts Steves; Ann M. Virts
The Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program is a Defense Advanced Research Projects Agency (DARPA) advanced technology research and development program. The goal of the TRANSTAC program is to demonstrate capabilities to rapidly develop and field free-form, two-way translation systems that enable speakers of different languages to communicate with one another in realworld tactical situations without an interpreter. The National Institute of Standards and Technology (NIST), along with support from MITRE and Appen Pty Ltd., have been funded to serve as the Independent Evaluation Team (IET) for the TRANSTAC Program. The IET is responsible for analyzing the performance of the TRANSTAC systems by designing and executing multiple TRANSTAC evaluations and analyzing the results of the evaluation. To accomplish this, NIST has applied the SCORE (System, Component, and Operationally Relevant Evaluations) Framework. SCORE is a unified set of criteria and software tools for defining a performance evaluation approach for complex intelligent systems. It provides a comprehensive evaluation blueprint that assesses the technical performance of a system and its components through isolating variables as well as capturing end-user utility of the system in realistic use-case environments. This document describes the TRANSTAC program and explains how the SCORE framework was applied to assess the technical and utility performance of the TRANSTAC systems.
Machine Translation | 2012
Sherri L. Condon; Mark Arehart; Dan Parvaz; Gregory A. Sanders; Christy Doran; John S. Aberdeen
The Defense Advanced Research Projects Agency (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program (http://1.usa.gov/transtac) faced many challenges in applying automated measures of translation quality to Iraqi Arabic–English speech translation dialogues. Features of speech data in general and of Iraqi Arabic data in particular undermine basic assumptions of automated measures that depend on matching system outputs to reference translations. These features are described along with the challenges they present for evaluating machine translation quality using automated metrics. We show that scores for translation into Iraqi Arabic exhibit higher correlations with human judgments when they are computed from normalized system outputs and reference translations. Orthographic normalization, lexical normalization, and operations involving light stemming resulted in higher correlations with human judgments.
performance metrics for intelligent systems | 2009
Gregory A. Sanders; Sherri L. Condon
In this paper, we present one of the important metrics used to measure the quality of machine translation in the DARPA TRANSTAC program. The metric is stated as either the probability or the odds of a machine translation system successfully transferring the meaning of content words (nouns, verbs, adjectives, adverbs, plus the most important quanitifiers and prepositions). We present the rationale for the metric, explain its implementation, and examine its performance. To characterize the performance of the metric, we compare it to utterance level (or sentence-level) human judgments of the semantic adequacy of the translations, obtained from a panel of bilingual judges who compare the source-language input to the target-language (translated) output. Language pairs examined in this paper include English-to-Arabic, Arabic-to-English, English-to-Dari, and Dari-to-English.
International Journal of Speech Technology (ISSN 1381-2416) | 2004
Gregory A. Sanders; Audrey N. Le
The DARPA Communicator program explored ways to construct better spoken-dialogue systems, with which users interact via speech alone to perform relatively complex tasks such as travel planning. During 2000 and 2001 two large data sets were collected from sessions in which paid users did travel planning using the Communicator systems that had been built by eight research groups. The research groups improved their systems intensively during the ten months between the two data collections. In this paper, we analyze these data sets to estimate the effects of speech recognition accuracy, as measured by Word Error Rate (WER), on other metrics. The effects that we found were linear. We found correlation between WER and Task Completion, and that correlation, unexpectedly, remained more or less linear even for high values of WER. The picture for User Satisfaction metrics is more complex: we found little effect of WER on User Satisfaction for WER less than about 35 to 40% in the 2001 data. The size of the effect of WER on Task Completion was less in 2001 than in 2000, and we believe this difference is due to improved strategies for accomplishing tasks despite speech recognition errors, which is an important accomplishment of the research groups who built the Communicator implementations. We show that additional factors must account for much of the variability in task success, and we present multivariate linear regression models for task success on the 2001 data. We also discuss the apparent gaps in the coverage of our metrics for spoken dialogue systems.
conference of the international speech communication association | 2001
Marilyn A. Walker; John S. Aberdeen; Julie E. Boland; Elizabeth Owen Bratt; John S. Garofolo; Lynette Hirschman; Audrey N. Le; Sungbok Lee; Shrikanth Narayanan; Kishore Papineni; Bryan L. Pellom; Joseph Polifroni; Alexandros Potamianos; P. Prabhu; Alexander I. Rudnicky; Gregory A. Sanders; Stephanie Seneff; David Stallard; Steve Whittaker
conference of the international speech communication association | 2002
Marilyn A. Walker; Alexander I. Rudnicky; Rashmi Prasad; John S. Aberdeen; Elizabeth Owen Bratt; John S. Garofolo; Helen Wright Hastie; Audrey N. Le; Bryan L. Pellom; Alexandros Potamianos; Rebecca J. Passonneau; Salim Roukos; Gregory A. Sanders; Stephanie Seneff; David Stallard
conference of the international speech communication association | 2002
Marilyn A. Walker; Alexander I. Rudnicky; John S. Aberdeen; Elizabeth Owen Bratt; John S. Garofolo; Helen Wright Hastie; Audrey N. Le; Bryan L. Pellom; Alexandros Potamianos; Rebecca J. Passonneau; Rashmi Prasad; Salim Roukos; Gregory A. Sanders; Stephanie Seneff; David Stallard
language resources and evaluation | 2006
Mark A. Przybocki; Gregory A. Sanders; Audrey N. Le
language resources and evaluation | 2008
Gregory A. Sanders; Sebastien Bronsart; Sherri L. Condon; Craig I. Schlenoff