Erica Cooper
Massachusetts Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Erica Cooper.
international conference on acoustics, speech, and signal processing | 2009
Dogan Can; Erica Cooper; Abhinav Sethy; Christopher M. White; Bhuvana Ramabhadran; Murat Saraclar
The spoken term detection (STD) task aims to return relevant segments from a spoken archive that contain the query terms whether or not they are in the system vocabulary. This paper focuses on pronunciation modeling for Out-of-Vocabulary (OOV) terms which frequently occur in STD queries. The STD system described in this paper indexes word-level and sub-word level lattices or confusion networks produced by an LVCSR system using Weighted Finite State Transducers (WFST).We investigate the inclusion of n-best pronunciation variants for OOV terms (obtained from letter-to-sound rules) into the search and present the results obtained by indexing confusion networks as well as lattices. The following observations are worth mentioning: phone indexes generated from sub-words represent OOVs well and too many variants for the OOV terms degrade performance if pronunciations are not weighted.
Archive | 2012
Andrew Rosenberg; Erica Cooper; Rivka Levitan; Julia Hirschberg
We explore the ability to perform automatic prosodic analysis in one language using models trained on another. If we are successful, we should be able to identify prosodic elements in a language for which little or no prosodically labeled training data is available, using models trained on a language for which such training data exists. Given the laborious nature of manual prosodic annotation, such a process would vastly improve our ability to identify prosodic events in many languages and therefore to make use of such information in downstream processing tasks. The task we address here is the detection of intonational prominence, performing experiments using material from four languages: American English, Italian, French and German. While we do find that cross-language prominence detection is possible, we also find significant language-dependent differences. While we hypothesized that language family might serve as a reliable predictor of cross-language prosodic event detection accuracy, in our experiments this did not prove to be the case. Based upon our results, we suggest some directions that may be able to improve our cross-language approach.
international acm sigir conference on research and development in information retrieval | 2009
Dogan Can; Erica Cooper; Arnab Ghoshal; Martin Jansche; Sanjeev Khudanpur; Bhuvana Ramabhadran; Michael Riley; Murat Saraclar; Abhinav Sethy; Morgan Ulinski; Christopher M. White
Indexing and retrieval of speech content in various forms such as broadcast news, customer care data and on-line media has gained a lot of interest for a wide range of applications, from customer analytics to on-line media search. For most retrieval applications, the speech content is typically first converted to a lexical or phonetic representation using automatic speech recognition (ASR). The first step in searching through indexes built on these representations is the generation of pronunciations for named entities and foreign language query terms. This paper summarizes the results of the work conducted during the 2008 JHU Summer Workshop by the Multilingual Spoken Term Detection team, on mining the web for pronunciations and analyzing their impact on spoken term detection. We will first present methods to use the vast amount of pronunciation information available on the Web, in the form of IPA and ad-hoc transcriptions. We describe techniques for extracting candidate pronunciations from Web pages and associating them with orthographic words, filtering out poorly extracted pronunciations, normalizing IPA pronunciations to better conform to a common transcription standard, and generating phonemic representations from ad-hoc transcriptions. We then present an analysis of the effectiveness of using these pronunciations to represent Out-Of-Vocabulary (OOV) query terms on the performance of a spoken term detection (STD) system. We will provide comparisons of Web pronunciations against automated techniques for pronunciation generation as well as pronunciations generated by human experts. Our results cover a range of speech indexes based on lattices, confusion networks and one-best transcriptions at both word and word fragments levels.
international conference on acoustics, speech, and signal processing | 2014
Victor Soto; Erica Cooper; Lidia Mangu; Andrew Rosenberg; Julia Hirschberg
We introduce a two-stage cascaded scheme to rescore Confusion Networks (CNs) for Keyword Search in the context of Low-Resource Languages. In the first stage we rescore the CN to improve the error rate of the 1-best hypothesis using a large number of lexical, phonetic, false alarms and structural features. Using a rank learning Support Vector Machine classifier, we obtain WER gains between 0.54% and 2.84% on Cantonese, Tagalog, Turkish, Pashto and Vietnamese. In the second stage we generate keyword hits from the rescored CN and use logistic regression to detect true hits and false alarms. We compare these to hits generated from the unrescored CN and obtain gains between 0.45% and 0.9% on the MTWV metric by using the mentioned features and including acoustic and prosodic features on Tagalog, Turkish and Pashto.
international conference on acoustics, speech, and signal processing | 2013
Victor Soto; Erica Cooper; Andrew Rosenberg; Julia Hirschberg
We describe models of prosodic phrasing trained on multiple languages to identify boundaries in an unseen language. Our goal is to create models from High Resource languages, in which hand-annotated prosodic phrase boundaries are available, to use in identifying boundaries in a Low Resource language, with little or no training material. We train models on American English, Italian, Mandarin, and German and test on each of these languages. We find that, while pause is the most important feature for phrase boundary prediction in all languages examined, the role of pause in boundary identification varies by annotator and the relative importance of other features varies significantly by language. We also find that different acoustic correlates of prosodic boundaries characterize different languages. In some, the relative importance of features is silence > pitch > intensity > duration, while for other languages intensity is more important than pitch. These differences do not appear to be attributable to language family, since, e.g. English and German display different patterns.
meeting of the association for computational linguistics | 2016
Gideon Mendels; Erica Cooper; Julia Hirschberg
We describe a system to collect web data for Low Resource Languages, to augment language model training data for Automatic Speech Recognition (ASR) and keyword search by reducing the Out-ofVocabulary (OOV) rates – words in the test set that did not appear in the training set for ASR. We test this system on seven Low Resource Languages from the IARPA Babel Program: Paraguayan Guarani, Igbo, Amharic, Halh Mongolian, Javanese, Pashto, and Dholuo. The success of our system compared with other web collection systems is due to the targeted collection sources (blogs, twitter, forums) and the inclusion of a separate language identification component in its pipeline, which filters the data initially collected before finally saving it. Our results show a major reduction of OOV rates relative to those calculated from training corpora alone and major reductions in OOV rates calculated in terms of keywords in the training development set. We also describe differences among genres in this reduction, which vary by language but show a pronounced influence for augmentation from Twitter data for most languages.
conference of the international speech communication association | 2016
Erica Cooper; Alison Chang; Yocheved Levitan; Julia Hirschberg
We describe experiments in building HMM text-to-speech voices on professional broadcast news data from multiple speakers. We build on earlier work comparing techniques for selecting utterances from the corpus and voice adaptation to produce the most natural-sounding voices. While our ultimate goal is to develop intelligible and natural-sounding synthetic voices in low-resource languages rapidly and without the expense of collecting and annotating data specifically for text-to-speech, we focus on English initially, in order to develop and evaluate our methods. We evaluate our approaches using crowdsourced listening tests for naturalness. We have found that removing utterances that are outliers with respect to hyper-articulation, as well as combining the selection of hypoarticulated utterances and low mean f0 utterances, produce the most natural-sounding voices.
international conference on acoustics, speech, and signal processing | 2009
Christopher M. White; Abhinav Sethy; Bhuvana Ramabhadran; Patrick J. Wolfe; Erica Cooper; Murat Saraclar; James K. Baker
This paper addresses selecting between candidate pronunciations for out-of-vocabulary words in speech processing tasks. We introduce a simple, unsupervised method that outperforms the conventional supervised method of forced alignment with a reference. The success of this method is independently demonstrated using three metrics from large-scale speech tasks: word error rates for large vocabulary continuous speech recognition, decision error tradeoff curves for spoken term detection, and phone error rates compared to a handcrafted pronunciation lexicon. The experiments were conducted using state-of-the-art recognition, indexing, and retrieval systems. The results were compared across many terms, hundreds of hours of speech, and well known data sets.
9th International Conference on Speech Prosody 2018 | 2018
Erica Cooper; Emily Li; Julia Hirschberg
Extensive TTS corpora exist for commercial systems created for high-resource languages such as Mandarin, English, and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from “found” data collected for other purposes (e.g. training ASR systems) or available on the web (e.g. news broadcasts, audiobooks) to produce TTS systems for low-resource languages (LRLs) which do not currently have expensive, commercial systems. This study investigates whether traditional TTS speakers do exhibit significantly less variation and better speaking characteristics than speakers in found genres. By examining characteristics of f0, energy, speaking rate, articulation, NHR, jitter, and shimmer in found genres and comparing these to traditional TTS corpora, We have found that TTS recordings are indeed characterized by low mean pitch, standard deviation of energy, speaking rate, and level of articulation, and low mean and standard deviations of shimmer and NHR; in a number of respects these are quite similar to some found genres. By identifying similarities and differences, we are able to identify objective methods for selecting found data to build TTS systems for LRLs.
conference of the international speech communication association | 2015
Gideon Mendels; Erica Cooper; Victor Soto; Julia Hirschberg; Mark J. F. Gales; Kate Knill; Anton Ragni; Haipeng Wang