David Nahamoo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Nahamoo is active.

Explore More

Publication

Featured researches published by David Nahamoo.

ieee automatic speech recognition and understanding workshop | 2013

Speaker adaptation of neural network acoustic models using i-vectors

George Saon; Hagen Soltau; David Nahamoo; Michael Picheny

We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1990

Tied mixture continuous parameter modeling for speech recognition

Jerome R. Bellegarda; David Nahamoo

The acoustic-modeling problem in automatic speech recognition is examined with the goal of unifying discrete and continuous parameter approaches. To model a sequence of information-bearing acoustic feature vectors which has been extracted from the speech waveform via some appropriate front-end signal processing, a speech recognizer basically faces two alternatives: (1) assign a multivariate probability distribution directly to the stream of vectors, or (2) use a time-synchronous labeling acoustic processor to perform vector quantization on this stream, and assign a multinomial probability distribution to the output of the vector quantizer. With a few exceptions, these two methods have traditionally been given separate treatment. A class of very general hidden Markov models which can accommodate feature vector sequences lying either in a discrete or in a continuous space is considered; the new class allows one to represent the prototypes in an assumption-limited, yet convenient way, as tied mixtures of simple multivariate densities. Speech recognition experiments, reported for two (5000- and 20000-word vocabulary) office correspondence tasks, demonstrate some of the benefits associated with this technique. >

IEEE Transactions on Information Theory | 1991

An inequality for rational functions with applications to some statistical estimation problems

Ponani S. Gopalakrishnan; Dimitri Kanevsky; Arthur Nádas; David Nahamoo

The well-known Baum-Eagon inequality (1967) provides an effective iterative scheme for finding a local maximum for homogeneous polynomials with positive coefficients over a domain of probability values. However, in many applications the goal is to maximize a general rational function. In view of this, the Baum-Eagon inequality is extended to rational functions. Some of the applications of this inequality to statistical estimation problems are briefly described. >

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1989

Speech recognition using noise-adaptive prototypes

Arthur Nádas; David Nahamoo; Michael Picheny

A probabilistic mixture mode is described for a frame (the short term spectrum) of speech to be used in speech recognition. Each component of the mixture is regarded as a prototype for the labeling phase of a hidden Markov model based speech recognition system. Since the ambient noise during recognition can differ from that present in the training data, the model is designed for convenient updating in changing noise. Based on the observation that the energy in a frequency band is at any fixed time dominated either by signal energy or by noise energy, the energy is modeled as the larger of the separate energies of signal and noise in the band. Statistical algorithms are given for training this as a hidden variables model. The hidden variables are the prototype identities and the separate signal and noise components. Speech recognition experiments that successfully utilize this model are described. >

international conference on acoustics, speech, and signal processing | 1995

Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task

Lalit R. Bahl; S. Balakrishnan-Aiyer; J.R. Bellgarda; Martin Franz; Ponani S. Gopalakrishnan; David Nahamoo; Miroslav Novak; Mukund Padmanabhan; Michael Picheny; Salim Roukos

In this paper we discuss various experimental results using our continuous speech recognition system on the Wall Street Journal task. Experiments with different feature extraction methods, varying amounts and type of training data, and different vocabulary sizes are reported.

international conference on acoustics, speech, and signal processing | 1991

Decision trees for phonological rules in continuous speech

Lalit R. Bahl; Peter Vincent Desouza; Ponani S. Gopalakrishnan; David Nahamoo; Michael Picheny

The authors present an automatic method for modeling phonological variation using decision trees. For each phone they construct a decision tree that specifies the acoustic realization of the phone as a function of the context in which it appears. Several-thousand sentences from a natural language corpus spoken by several speakers are used to construct these decision trees. Experimental results on a 5000-word vocabulary natural language speech recognition task are presented.<<ETX>>

IEEE Transactions on Speech and Audio Processing | 1994

The metamorphic algorithm: a speaker mapping approach to data augmentation

Jerome R. Bellegarda; P. V. de Souza; Arthur Nádas; David Nahamoo; Michael Picheny; Lalit R. Bahl

Large vocabulary speaker-dependent speech recognition systems adjust to the acoustic peculiarities of each new speaker based on some enrolment data provided by this speaker. As the amount of data required increases with the sophistication of the underlying acoustic models, the enrolment may get lengthy. To streamline it, it is therefore desirable to make use of previously acquired speech data. The authors describe a data augmentation strategy based on a piecewise linear mapping between the feature space of a new speaker and that of a reference speaker. This speaker-normalizing mapping is used to transform the previously acquired data of the reference speaker onto the space of the new speaker. The performance of the resulting procedure, dubbed the metamorphic algorithm, is illustrated on an isolated utterance speech recognition task with a vocabulary of 20000 words. Results show that the metamorphic algorithm can substantially reduce the word error rate when only a limited amount of enrolment data is available. Alternatively, it leads to a level of performance comparable to that obtained when a much greater amount of enrolment data is required from the new speaker. In addition, it can also be used for tracking spectral evolution over time, thus providing a possible means for robust speaker self-adaptation. >

IEEE Transactions on Speech and Audio Processing | 1993

Multonic Markov word models for large vocabulary continuous speech recognition

Lalit R. Bahl; Jerome R. Bellegarda; P. V. de Souza; Ponani S. Gopalakrishnan; David Nahamoo; Michael Picheny

A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system. The models, built from combinations of acoustically based sub-word units called fenones, are derived automatically from one or more sample utterances of a word. Because they are more flexible than previously reported fenone-based word models, they lead to an improved capability of modeling variations in pronunciation. They are therefore particularly useful in the recognition of continuous speech. In addition, their construction is relatively simple, because it can be done using the well-known forward-backward algorithm for parameter estimation of hidden Markov models. Appropriate reestimation formulas are derived for this purpose. Experimental results obtained on a 5000-word vocabulary natural language continuous speech recognition task are presented to illustrate the enhanced power of discrimination of the new models. >

international conference on acoustics, speech, and signal processing | 1995

Experiments using data augmentation for speaker adaptation

Jerome R. Bellegarda; P. V. de Souza; David Nahamoo; Mukund Padmanabhan; Michael Picheny; Lalit R. Bahl

Speaker adaptation typically involves customizing some existing (reference) models in order to account for the characteristics of a new speaker. This work considers the slightly different paradigm of customizing some reference data for the purpose of populating the new speakers space, and then using the resulting (augmented) data to derive the customized models. The data augmentation technique is based on the metamorphic algorithm first proposed in Bellegarda et al. [1992], assuming that a relatively modest amount of data (100 sentences) is available from each new speaker. This contraint requires that reference speakers be selected with some care. The performance of this method is illustrated on a portion of the Wall Street Journal task.

IEEE Transactions on Speech and Audio Processing | 1998

Speaker clustering and transformation for speaker adaptation in speech recognition systems

Mukund Padmanabhan; Lalit R. Bahl; David Nahamoo; Michael Picheny

A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to reestimate the system parameters. Further, a linear transformation is computed for every one of the selected training speakers to better map the training speakers data to the test speakers acoustic space. Finally, the system parameters (Gaussian means) are reestimated specifically for the test speaker using the transformed data from the selected training speakers. Experiments showed that this scheme is capable of providing an 18% relative improvement in the error rate on a large-vocabulary task with the use of as little as three sentences of adaptation data.

Explore More