Ariya Rastrow
Johns Hopkins University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ariya Rastrow.
Computer Speech & Language | 2011
Daniel Povey; Lukas Burget; Mohit Agarwal; Pinar Akyazi; Feng Kai; Arnab Ghoshal; Ondřej Glembek; Nagendra Goel; Martin Karafiát; Ariya Rastrow; Richard C. Rose; Petr Schwarz; Samuel Thomas
We describe a new approach to speech recognition, in which all Hidden Markov Model (HMM) states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state. The model is defined by vectors associated with each state with a dimension of, say, 50, together with a global mapping from this vector space to the space of parameters of the GMM. This model appears to give better results than a conventional model, and the extra structure offers many new opportunities for modeling innovations while maintaining compatibility with most standard techniques.
international conference on acoustics, speech, and signal processing | 2010
Daniel Povey; Lukśš Burget; Mohit Agarwal; Pinar Akyazi; Kai Feng; Arnab Ghoshal; Ondřej Glembek; Nagendra Goel; Martin Karafiát; Ariya Rastrow; Richard C. Rose; Petr Schwarz; Samuel Thomas
We describe an acoustic modeling approach in which all phonetic states share a common Gaussian Mixture Model structure, and the means and mixture weights vary in a subspace of the total parameter space. We call this a Subspace Gaussian Mixture Model (SGMM). Globally shared parameters define the subspace. This style of acoustic model allows for a much more compact representation and gives better results than a conventional modeling approach, particularly with smaller amounts of training data.
international conference on acoustics, speech, and signal processing | 2008
Lukas Burget; Petr Schwarz; Pavel Matejka; Mirko Hannemann; Ariya Rastrow; Christopher M. White; Sanjeev Khudanpur; Hynek Hermansky; Jan Cernocky
This paper addresses the detection of OOV segments in the output of a large vocabulary continuous speech recognition (LVCSR) system. First, standard confidence measures from frame-based word- and phone-posteriors are investigated. Substantial improvement is obtained when posteriors from two systems - strongly constrained (LVCSR) and weakly constrained (phone posterior estimator) are combined. We show that this approach is also suitable for detection of general recognition errors. All results are presented on WSJ task with reduced recognition vocabulary.
international conference on acoustics, speech, and signal processing | 2009
Ariya Rastrow; Abhinav Sethy; Bhuvana Ramabhadran
In this paper, we propose a new method for detecting regions with out-of-vocabulary (OOV) words in the output of a large vocabulary continuous speech recognition (LVCSR) system. The proposed method uses a hybrid system combining words and data-driven variable length sub word units. With the use of a single feature, the posterior probability of sub word units, this method outperforms existing methods published in the literature. We also presents a recipe to discriminatively train a hybrid language model to improve OOV detection rate. Results are presented on the RT04 broadcast news task.
international conference on acoustics, speech, and signal processing | 2010
Nagendra Goel; Samuel Thomas; Mohit Agarwal; Pinar Akyazi; Lukas Burget; Kai Feng; Arnab Ghoshal; Ondřej Glembek; Martin Karafiát; Daniel Povey; Ariya Rastrow; Richard C. Rose; Petr Schwarz
Preparation of a lexicon for speech recognition systems can be a significant effort in languages where the written form is not exactly phonetic. On the other hand, in languages where the written form is quite phonetic, some common words are often mispronounced. In this paper, we use a combination of lexicon learning techniques to explore whether a lexicon can be learned when only a small lexicon is available for boot-strapping. We discover that for a phonetic language such as Spanish, it is possible to do that better than what is possible from generic rules or hand-crafted pronunciations. For a more complex language such as English, we find that it is still possible but with some loss of accuracy.
international conference on acoustics, speech, and signal processing | 2010
Arnab Ghoshal; Daniel Povey; Mohit Agarwal; Pinar Akyazi; Lukas Burget; Kai Feng; Ond¿rej Glembek; Nagendra Goel; Martin Karafiát; Ariya Rastrow; Richard C. Rose; Petr Schwarz; Samuel Thomas
In this paper we present a novel approach for estimating feature-space maximum likelihood linear regression (fMLLR) transforms for full-covariance Gaussian models by directly maximizing the likelihood function by repeated line search in the direction of the gradient. We do this in a pre-transformed parameter space such that an approximation to the expected Hessian is proportional to the unit matrix. The proposed algorithm is as efficient or more efficient than standard approaches, and is more flexible because it can naturally be combined with sets of basis transforms and with full covariance and subspace precision and mean (SPAM) models.
international conference on acoustics, speech, and signal processing | 2011
Ariya Rastrow; Markus Dreyer; Abhinav Sethy; Sanjeev Khudanpur; Bhuvana Ramabhadran; Mark Dredze
We describe a new approach for rescoring speech lattices — with long-span language models or wide-context acoustic models — that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.
Speech Communication | 2012
Gakuto Kurata; Abhinav Sethy; Bhuvana Ramabhadran; Ariya Rastrow; Nobuyasu Itoh; Masafumi Nishimura
Recently proposed methods for discriminative language modeling require alternate hypotheses in the form of lattices or N-best lists. These are usually generated by an Automatic Speech Recognition (ASR) system on the same speech data used to train the system. This requirement restricts the scope of these methods to corpora where both the acoustic material and the corresponding true transcripts are available. Typically, the text data available for language model (LM) training is an order of magnitude larger than manually transcribed speech. This paper provides a general framework to take advantage of this volume of textual data in the discriminative training of language models. We propose to generate probable N-best lists directly from the text material, which resemble the N-best lists produced by an ASR system by incorporating phonetic confusability estimated from the acoustic model of the ASR system. We present experiments with Japanese spontaneous lecture speech data, which demonstrate that discriminative LM training with the proposed framework is effective and provides modest gains in ASR accuracy.
ieee automatic speech recognition and understanding workshop | 2009
Ariya Rastrow; Abhinav Sethy; Bhuvana Ramabhadran
In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2% absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems.
conference of the international speech communication association | 2016
Faisal Ladhak; Ankur Gandhe; Markus Dreyer; Lambert Mathias; Ariya Rastrow; Björn Hoffmeister
We present a new model called LATTICERNN, which generalizes recurrent neural networks (RNNs) to process weighted lattices as input, instead of sequences. A LATTICERNN can encode the complete structure of a lattice into a dense representation, which makes it suitable to a variety of problems, including rescoring, classifying, parsing, or translating lattices using deep neural networks (DNNs). In this paper, we use LATTICERNNs for a classification task: each lattice represents the output from an automatic speech recognition (ASR) component of a spoken language understanding (SLU) system, and we classify the intent of the spoken utterance based on the lattice embedding computed by a LATTICERNN. We show that making decisions based on the full ASR output lattice, as opposed to 1-best or n-best hypotheses, makes SLU systems more robust to ASR errors. Our experiments yield improvements of 13% over a baseline RNN system trained on transcriptions and 10% over an nbest list rescoring system for intent classification.