Ellen Eide | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ellen Eide is active.

Explore More

Publication

Featured researches published by Ellen Eide.

international conference on acoustics speech and signal processing | 1996

A parametric approach to vocal tract length normalization

Ellen Eide; Herbert Gish

Differences in vocal tract size among individual speakers contribute to the variability of speech waveforms. The first-order effect of a difference in vocal tract length is a scaling of the frequency axis; a female speaker, for example, exhibits formants roughly 20% higher than the formants of from a male speaker, with the differences most severe in open vocal tract configurations. We describe a parametric method of normalisation which counteracts the effect of varied vocal tract length. The method is shown to be effective across a wide range of recognition systems and paradigms, but is particularly helpful in the case of a small amount of training data.

Journal of the Acoustical Society of America | 2011

System and method for rescoring N-best hypotheses of an automatic speech recognition system

Raimo Bakis; Ellen Eide

A system and method for rescoring the N-best hypotheses from an automatic speech recognition system by comparing an original speech waveform to synthetic speech waveforms that are generated for each text sequence of the N-best hypotheses. A distance is calculated from the original speech waveform to each of the synthesized waveforms, and the text associated with the synthesized waveform that is determined to be closest to the original waveform is selected as the final hypothesis. The original waveform and each synthesized waveform are aligned to a corresponding text sequence on a phoneme level. The mean of the feature vectors which align to each phoneme is computed for the original waveform as well as for each of the synthesized hypotheses. The distance of a synthesized hypothesis to the original speech signal is then computed as the sum over all phonemes in the hypothesis of the Euclidean distance between the means of the feature vectors of the frames aligning to that phoneme for the original and the synthesized signals. The text of the hypothesis which is closest under the above metric to the original waveform is chosen as the final system output.

international conference on acoustics, speech, and signal processing | 1997

Word-based confidence measures as a guide for stack search in speech recognition

Chalapathy Neti; Salim Roukos; Ellen Eide

The maximum a posteriori hypothesis is treated as the decoded truth in speech recognition. However, since the word recognition accuracy is not 100%, it is desirable to have an independent confidence measure on how good the maximum a posteriori hypothesis is relative to the spoken truth for some applications. Efforts are in progress to develop such confidence measures with the intent of applying them to the assessment of the confidence of whole utterances, rescoring of N-best lists, etc. In this paper, we explore the use of word-based confidence measures to adaptively modify the hypothesis score during searches in continuous speech recognition: specifically, based on the confidence of the current sequence of hypothesized words during the search, the weight of its prediction is changed as a function of the confidence. Experimental results are described for ATIS and SwitchBoard tasks. About 8% relative reduction in word error is obtained for ATIS.

international conference on acoustics, speech, and signal processing | 1995

Understanding and improving speech recognition performance through the use of diagnostic tools

Ellen Eide; Herbert Gish; Philippe Jeanrenaud; Angela Mielke

The goal of this work is to highlight aspects of an experiment other than the word error rate. When a speech recognition experiment is performed, the word error rate provides no insight into the factors responsible for the recognition errors. We begin this paper by describing an experiment which contrasts the language of conversational speech with that of text from the Wall Street Journal. The remainder of the paper is devoted to the description of a more general approach to performance diagnosis which identifies significant sources of error in a given experiment. The technique is based on the use of binary classification trees; we refer to the results of our analyses as diagnostic trees. Beyond providing understanding, diagnostic trees allow for improvements in the performance of a recognizer through the use of feedback provided by quantifying confidence in the recognition.

international conference on acoustics, speech, and signal processing | 2003

Recent improvements to the IBM trainable speech synthesis system

Ellen Eide; Andrew Aaron; Raimo Bakis; R. Cohen; Robert E. Donovan; Wael Hamza; T. Mathes; Michael Picheny; M. Polkosky; M. Smith; M. Viswanathan

In this paper we describe the current status of the trainable text-to-speech system at IBM. Recent algorithmic and database changes to the system have led to significant gains in the output quality. On the algorithms side, we have introduced statistical models for predicting pitch and duration targets which replace the rule-based target generation previously employed. Additionally, we have changed the cost function and the search strategy, introduced a post-search pitch smoothing algorithm, and improved our method of preselection. Through the combined data and algorithmic contributions, we have been able to significantly improve (p < 0.0001) the mean opinion score (MOS) of our female voice, from 3.68 to 4.85 when heard over loudspeakers and to 5.42 when heard over the telephone (seven point scale).

Speech Communication | 2002

Automatic transcription of Broadcast News

Scott Saobing Chen; Ellen Eide; Mark J. F. Gales; Ramesh A. Gopinath; D. Kanvesky; Peder A. Olsen

Abstract This paper describes the IBM approach to Broadcast News (BN) transcription. Typical problems in the BN transcription task are segmentation, clustering, acoustic modeling, language modeling and acoustic model adaptation. This paper presents new algorithms for each of these focus problems. Some key ideas include Bayesian information criterion (BIC) (for segmentation, clustering and acoustic modeling) and speaker/cluster adapted training (SAT/CAT).

international conference on acoustics speech and signal processing | 1999

Recent improvements to IBM's speech recognition system for automatic transcription of broadcast news

Scott Saobing Chen; Ellen Eide; Mark J. F. Gales; Ramesh A. Gopinath; Dimitri Kanevsky; Peder A. Olsen

We describe extensions and improvements to IBMs system for automatic transcription of broadcast news. The speech recognizer uses a total of 160 hours of acoustic training data, 80 hours more than for the system described in Chen et al. (1998). In addition to improvements obtained in 1997 we made a number of changes and algorithmic enhancements. Among these were changing the acoustic vocabulary, reducing the number of phonemes, insertion of short pauses, mixture models consisting of non-Gaussian components, pronunciation networks, factor analysis (FACILT) and Bayesian information criteria (BIC) applied to choosing the number of components in a Gaussian mixture model. The models were combined in a single system using NISTs script voting machine known as rover (Fiscus 1997).

international conference on acoustics, speech, and signal processing | 1995

Reducing word error rate on conversational speech from the Switchboard corpus

Philippe Jeanrenaud; Ellen Eide; Upendra V. Chaudhari; John W. McDonough; Kenney Ng; Man-Hung Siu; Herbert Gish

Speech recognition of conversational speech is a difficult task. The performance levels on the Switchboard corpus had been in the vicinity of 70% word error rate. In this paper, we describe the results of applying a variety of modifications to our speech recognition system and we show their impact on improving the performance on conversational speech. These modifications include the use of more complex models, trigram language models, and cross-word triphone models. We also show the effect of using additional acoustic training on the recognition performance. Finally, we present an approach to dealing with the abundance of short words, and examine how the variable speaking rate found in conversational speech impacts on the performance. Currently, the level of performance is at the vicinity of 50% error, a significant improvement over recent levels.

international conference on acoustics speech and signal processing | 1998

Speech recognition performance on a voicemail transcription task

Mukund Padmanabhan; Ellen Eide; Bhuvana Ramabhadran; Ganesh N. Ramaswamy; Lalit R. Bahl

We describe a new testbed for developing speech recognition algorithms-the ARRPA-sponsored voicemail transcription task, analogous to other tasks such as the Switchboard, CallHome and the Hub 4 tasks. The task involves the transcription of voicemail conversations. Voicemail represents a very large volume of real-world speech data, which is however not particularly well represented in existing databases. For instance, the Switchboard and CallHome databases contain telephone conversations between two humans, representing telephone-bandwidth spontaneous speech; the Hub 4 database contains radio broadcasts which represents different kinds of speech data such as spontaneous speech from a well-trained speaker, conversations between two humans possibly over the telephone, etc. The voicemail database on the other hand also represents telephone bandwidth spontaneous speech, however the difference with respect to the Switchboard and CallHome tasks is that the interaction is not between two humans, but rather between a human and a machine-consequently, the speech is expected to be a little more formal in its nature, without the problems of crosstalk, barge-in etc. This eliminates some of the variables and provides more controlled conditions enabling one to concentrate on the aspects of spontaneous speech and effects of the telephone channel. We describe the modality of collection of the speech data, and some algorithmic techniques that were devised based on this data. We also describe the initial results of the transcription performance on this task.

Journal of the Acoustical Society of America | 2006

Speech and signal digitization by using recognition metrics to select from multiple techniques

Ellen Eide; Ramesh A. Gopinath; Dimitri Kanevsky; Peder A. Olsen

A characteristic-specific digitization method and apparatus are disclosed that reduces the error rate in converting input information into a computer-readable format. The input information is analyzed and subsets of the input information are classified according to whether the input information exhibits a specific physical parameter affecting recognition accuracy. If the input information exhibits the specific physical parameter affecting recognition accuracy, the characteristic-specific digitization system recognizes the input information using a characteristic-specific recognizer that demonstrates improved performance for the given physical parameter. If the input information does not exhibit the specific physical parameter affecting recognition accuracy, the characteristic-specific digitization system recognizes the input information using a general recognizer that performs well for typical input information. In one implementation, input speech having very low recognition accuracy as a result of a physical speech characteristic is automatically identified and recognized using a characteristic-specific speech recognizer.

Explore More