Christopher M. White | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher M. White is active.

Explore More

Publication

Featured researches published by Christopher M. White.

ieee automatic speech recognition and understanding workshop | 2009

Query-by-example spoken term detection using phonetic posteriorgram templates

Timothy J. Hazen; Wade Shen; Christopher M. White

This paper examines a query-by-example approach to spoken term detection in audio files. The approach is designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable. Instead of using word or phone strings as search terms, the user presents the system with audio snippets of desired search terms to act as the queries. Query and test materials are represented using phonetic posteriorgrams obtained from a phonetic recognition system. Query matches in the test data are located using a modified dynamic time warping search between query templates and test utterances. Experiments using this approach are presented using data from the Fisher corpus.

international conference on acoustics, speech, and signal processing | 2008

Combination of strongly and weakly constrained recognizers for reliable detection of OOVS

Lukas Burget; Petr Schwarz; Pavel Matejka; Mirko Hannemann; Ariya Rastrow; Christopher M. White; Sanjeev Khudanpur; Hynek Hermansky; Jan Cernocky

This paper addresses the detection of OOV segments in the output of a large vocabulary continuous speech recognition (LVCSR) system. First, standard confidence measures from frame-based word- and phone-posteriors are investigated. Substantial improvement is obtained when posteriors from two systems - strongly constrained (LVCSR) and weakly constrained (phone posterior estimator) are combined. We show that this approach is also suitable for detection of general recognition errors. All results are presented on WSJ task with reduced recognition vocabulary.

international conference on acoustics, speech, and signal processing | 2009

Effect of pronounciations on OOV queries in spoken term detection

Dogan Can; Erica Cooper; Abhinav Sethy; Christopher M. White; Bhuvana Ramabhadran; Murat Saraclar

The spoken term detection (STD) task aims to return relevant segments from a spoken archive that contain the query terms whether or not they are in the system vocabulary. This paper focuses on pronunciation modeling for Out-of-Vocabulary (OOV) terms which frequently occur in STD queries. The STD system described in this paper indexes word-level and sub-word level lattices or confusion networks produced by an LVCSR system using Weighted Finite State Transducers (WFST).We investigate the inclusion of n-best pronunciation variants for OOV terms (obtained from letter-to-sound rules) into the search and present the results obtained by indexing confusion networks as well as lattices. The following observations are worth mentioning: phone indexes generated from sub-words represent OOVs well and too many variants for the OOV terms degrade performance if pronunciations are not weighted.

international conference on acoustics, speech, and signal processing | 2007

Maximum Entropy Confidence Estimation for Speech Recognition

Christopher M. White; Jasha Droppo; Alex Acero; Julian J. Odell

For many automatic speech recognition (ASR) applications, it is useful to predict the likelihood that the recognized string contains an error. This paper explores two modifications of a classic design. First, it replaces the standard maximum likelihood classifier with a maximum entropy classifier. The maximum entropy framework carries the dual advantages discriminative training and reasonable generalization. Second, it includes a number of alternative features. Our ASR system is heavily pruned, and often produces recognition lattices with only a single path. These alternate features are meant to serve as a surrogate for the typical features that can be computed from a rich lattice. We show that the maximum entropy classifier easily outperforms the standard baseline system, and the alternative features provide consistent gains for all of our test sets.

international conference on acoustics, speech, and signal processing | 2008

Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments

Christopher M. White; Geoffrey Zweig; Lukas Burget; Petr Schwarz; Hynek Hermansky

Automatic speech recognition (ASR) systems continue to make errors during search when handling various phenomena including noise, pronunciation variation, and out of vocabulary (OOV) words. Predicting the probability that a word is incorrect can prevent the error from propagating and perhaps allow the system to recover. This paper addresses the problem of detecting errors and OOVs for read Wall Street Journal speech when the word error rate (WER) is very low. It augments a traditional confidence estimate by introducing two novel methods: phone-level comparison using multi-string alignment (MSA) and word-level comparison using phone-to-word transduction. We show that features from phone and word string comparisons can be added to a standard maximum entropy framework thereby substantially improving performance in detecting both errors and OOVs. Additionally we show an extension to detecting English and accented English for the language identification (LID) task.

international conference on acoustics, speech, and signal processing | 2006

Discriminative Classifiers for Language Recognition

Christopher M. White; Izhak Shafran; Jean-Luc Gauvain

Most language recognition systems consist of a cascade of three stages: (1) tokenizers that produce parallel phone streams, (2) phonotactic models that score the match between each phone stream and the phonotactic constraints in the target language, and (3) a final stage that combines the scores from the parallel streams appropriately (M.A. Zissman, 1996). This paper reports a series of contrastive experiments to assess the impact of replacing the second and third stages with large-margin discriminative classifiers. In addition, it investigates how sounds that are not represented in the tokenizers of the first stage can be approximated with composite units that utilize cross-stream dependencies obtained via multi-string alignments. This leads to a discriminative framework that can potentially incorporate a richer set of features such as prosodic and lexical cues. Experiments are reported on the NIST LRE 1996 and 2003 task and the results show that the new techniques give substantial gains over a competitive PPRLM baseline

international acm sigir conference on research and development in information retrieval | 2009

Web derived pronunciations for spoken term detection

Dogan Can; Erica Cooper; Arnab Ghoshal; Martin Jansche; Sanjeev Khudanpur; Bhuvana Ramabhadran; Michael Riley; Murat Saraclar; Abhinav Sethy; Morgan Ulinski; Christopher M. White

Indexing and retrieval of speech content in various forms such as broadcast news, customer care data and on-line media has gained a lot of interest for a wide range of applications, from customer analytics to on-line media search. For most retrieval applications, the speech content is typically first converted to a lexical or phonetic representation using automatic speech recognition (ASR). The first step in searching through indexes built on these representations is the generation of pronunciations for named entities and foreign language query terms. This paper summarizes the results of the work conducted during the 2008 JHU Summer Workshop by the Multilingual Spoken Term Detection team, on mining the web for pronunciations and analyzing their impact on spoken term detection. We will first present methods to use the vast amount of pronunciation information available on the Web, in the form of IPA and ad-hoc transcriptions. We describe techniques for extracting candidate pronunciations from Web pages and associating them with orthographic words, filtering out poorly extracted pronunciations, normalizing IPA pronunciations to better conform to a common transcription standard, and generating phonemic representations from ad-hoc transcriptions. We then present an analysis of the effectiveness of using these pronunciations to represent Out-Of-Vocabulary (OOV) query terms on the performance of a spoken term detection (STD) system. We will provide comparisons of Web pronunciations against automated techniques for pronunciation generation as well as pronunciations generated by human experts. Our results cover a range of speech indexes based on lattices, confusion networks and one-best transcriptions at both word and word fragments levels.

international conference on acoustics speech and signal processing | 1999

Interfacing a CDG parser with an HMM word recognizer using word graphs

Mary P. Harper; Michael T. Johnson; Leah H. Jamieson; Stephen A. Hockema; Christopher M. White

We describe a prototype spoken language system that loosely integrates a speech recognition component based on hidden Markov models with a constraint dependency grammar (CDG) parser using a word graph to pass sentence candidates between the two modules. This loosely coupled system was able to improve the sentence selection accuracy and concept accuracy over the level achieved by the acoustic module with a stochastic grammar. Timing profiles suggest that a tighter coupling of the modules could reduce parsing times of the system, as could the development of better acoustic models and tighter parsing constraints for conjunctions.

international conference on acoustics, speech, and signal processing | 2008

Sample selection for automatic language identification

David Farris; Christopher M. White; Sanjeev Khudanpur

Current approaches to automatic spoken language identification (LID) assume the availability of a large corpus of manually language-labeled speech samples for training statistical classifiers. We investigate two methods of active learning to significantly reduce the amount of labeled speech needed for training LID systems. Starting with a small training set, an automated method is used to select samples from a corpus of unlabeled speech, which are then labeled and added to the training pool - one selection method is based on a previously known entropy criterion, and another on a novel likelihood-ratio criterion. We demonstrate LID performance comparable to a large training corpus using only a tenth of the training data. A further 40% improvement in LID performance is obtained using a third of the training data. Finally, we show that our novel selection method is more robust to variance in the unlabeled pool than the entropy based method.

Behavior Research Methods Instruments & Computers | 1999

Familiarity and pronounceability of nouns and names.

Aimée M. Surprenant; Susan L. Hura; Mary P. Harper; Leah H. Jamieson; Glenis R. Long; Scott M. Thede; Ayasakanta Rout; Tsung-Hsiang Hsueh; Stephen A. Hockema; Michael T. Johnson; Pramila Srinivasan; Christopher M. White; J. Brandon Laflen

Ratings of familiarity and pronounceability were obtained from a random sample of 199 surnames (selected from over 80,000 entries in the Purdue University phone book) and 199 nouns (from the Kučera-Francis, 1967, word database). The distributions of ratings for nouns versus names are substantially different: Nouns were rated as more familiar and easier to pronounce than surnames. Frequency and familiarity were more closely related in the proper name pool than the word pool, although both correlations were modest. Ratings of familiarity and pronounceability were highly related for both groups. A production experiment showed that rated pronounceability was highly related to the time taken to produce a name. These data confirm the common belief that there are differences in the statistical and distributional properties of words as compared to proper names. The value of using frequency and the ratings of familiarity and pronounceability for predicting variations in actual pronunciations of words and names are discussed.

Explore More