Vijayaditya Peddinti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vijayaditya Peddinti is active.

Explore More

Publication

Featured researches published by Vijayaditya Peddinti.

conference of the international speech communication association | 2016

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

Daniel Povey; Vijayaditya Peddinti; Daniel Galvez; Pegah Ghahremani; Vimal Manohar; Xingyu Na; Yiming Wang; Sanjeev Khudanpur

In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding. We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LFMMI provide a relative word error rate reduction of ∼11.5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions. A further reduction of ∼2.5%, relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.

international conference on acoustics, speech, and signal processing | 2013

A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition

Aren Jansen; Emmanuel Dupoux; Sharon Goldwater; Mark Johnson; Sanjeev Khudanpur; Kenneth Church; Naomi H. Feldman; Hynek Hermansky; Florian Metze; Richard C. Rose; Michael L. Seltzer; Pascal Clark; Ian McGraw; Balakrishnan Varadarajan; Erin Bennett; Benjamin Börschinger; Justin Chiu; Ewan Dunbar; Abdellah Fourtassi; David F. Harwath; Chia-ying Lee; Keith Levin; Atta Norouzian; Vijayaditya Peddinti; Rachael Richardson; Thomas Schatz; Samuel Thomas

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.

international conference on acoustics, speech, and signal processing | 2013

Mean temporal distance: Predicting ASR error from temporal properties of speech signal

Hynek Hermansky; Ehsan Variani; Vijayaditya Peddinti

Extending previous work on prediction of phoneme recognition error from unlabeled data that were corrupted by unpredictable factors, the current work investigates a simple but effective method of estimating ASR performance by computing a function M(Δt), which represents the mean distance between speech feature vectors evaluated over certain finite time interval, determined as a function of temporal distance Δt between the vectors. It is shown that M(Δt) is a function of signal-to-noise ratio of speech signal. Comparing M(Δt) curves, derived on data used for training of the classifier, and on test utterances, allows for predicting error on the test data. Another interesting observation is that M(Δt) remains approximately constant, as temporal separation Δt exceeds certain critical interval (about 200 ms), indicating the extent of coarticulation in speech sounds.

international conference on acoustics, speech, and signal processing | 2017

A study on data augmentation of reverberant speech for robust speech recognition

Tom Ko; Vijayaditya Peddinti; Daniel Povey; Michael L. Seltzer; Sanjeev Khudanpur

The environmental robustness of DNN-based acoustic models can be significantly improved by using multi-condition training data. However, as data collection is a costly proposition, simulation of the desired conditions is a frequently adopted strategy. In this paper we detail a data augmentation approach for far-field ASR. We examine the impact of using simulated room impulse responses (RIRs), as real RIRs can be difficult to acquire, and also the effect of adding point-source noises. We find that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added. Further we show that the trained acoustic models not only perform well in the distant-talking scenario but also provide better results in the close-talking scenario. We evaluate our approach on several LVCSR tasks which can adequately represent both scenarios.

IEEE Signal Processing Letters | 2018

Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs

Vijayaditya Peddinti; Yiming Wang; Daniel Povey; Sanjeev Khudanpur

Bidirectional long short-term memory (BLSTM) acoustic models provide a significant word error rate reduction compared to their unidirectional counterpart, as they model both the past and future temporal contexts. However, it is nontrivial to deploy bidirectional acoustic models for online speech recognition due to an increase in latency. In this letter, we propose the use of temporal convolution, in the form of time-delay neural network (TDNN) layers, along with unidirectional LSTM layers to limit the latency to 200 ms. This architecture has been shown to outperform the state-of-the-art low frame rate (LFR) BLSTM models. We further improve these LFR BLSTM acoustic models by operating them at higher frame rates at lower layers and show that the proposed model performs similar to these mixed frame rate BLSTMs. We present results on the Switchboard 300 h LVCSR task and the AMI LVCSR task, in the three microphone conditions.

ieee automatic speech recognition and understanding workshop | 2015

JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS

Vijayaditya Peddinti; Guoguo Chen; Vimal Manohar; Tom Ko; Daniel Povey; Sanjeev Khudanpur

Multi-style training, using data which emulates a variety of possible test scenarios, is a popular approach towards robust acoustic modeling. However acoustic models capable of exploiting large amounts of training data in a comparatively short amount of training time are essential. In this paper we tackle the problem of reverberant speech recognition using 5500 hours of simulated reverberant data. We use time-delay neural network (TDNN) architecture, which is capable of tackling long-term interactions between speech and corrupting sources in reverberant environments. By sub-sampling the outputs at TDNN layers across time steps, training time is substantially reduced. Combining this with distributed-optimization we show that the TDNN can be trained in 3 days using up to 32 GPUs. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation. Finally, recurrent neural network language models are applied to the lattices to further improve the performance. Our system is shown to provide state-of-the-art results in the IARPA ASpIRE challenge, with 26.5% WER on the dev Jest set.

international conference on acoustics, speech, and signal processing | 2014

Deep Scattering Spectrum with deep neural networks

Vijayaditya Peddinti; Tara N. Sainath; Shay Maymon; Bhuvana Ramabhadran; David Nahamoo; Vaibhava Goel

State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modelling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task.

international conference on acoustics, speech, and signal processing | 2015

Towards machines that know when they do not know: Summary of work done at 2014 Frederick Jelinek Memorial Workshop

Hynek Hermansky; Lukas Burget; Jordan Cohen; Emmanuel Dupoux; Naomi H. Feldman; John J. Godfrey; Sanjeev Khudanpur; Matthew Maciejewski; Sri Harish Mallidi; Anjali Menon; Tetsuji Ogawa; Vijayaditya Peddinti; Richard C. Rose; Richard M. Stern; Matthew Wiesner; Karel Vesely

A group of junior and senior researchers gathered as a part of the 2014 Frederick Jelinek Memorial Workshop in Prague to address the problem of predicting the accuracy of a nonlinear Deep Neural Network probability estimator for unknown data in a different application domain from the domain in which the estimator was trained. The paper describes the problem and summarizes approaches that were taken by the group1.

conference of the international speech communication association | 2016

Far-Field ASR Without Parallel Data.

Vijayaditya Peddinti; Vimal Manohar; Yiming Wang; Daniel Povey; Sanjeev Khudanpur

In far-field speech recognition systems, training acoustic models with alignments generated from parallel close-talk microphone data provides significant improvements. However it is not practical to assume the availability of large corpora of parallel close-talk microphone data, for training. In this paper we explore methods to reduce the performance gap between far-field ASR systems trained with alignments from distant microphone data and those trained with alignments from parallel close-talk microphone data. These methods include the use of a lattice-free sequence objective function which tolerates minor mis-alignment errors; and the use of data selection techniques to discard badly aligned data. We present results on single distant microphone and multiple distant microphone scenarios of the AMI LVCSR task. We identify prominent causes of alignment errors in AMI data.

international conference on acoustics, speech, and signal processing | 2013

Filter-bank optimization for Frequency Domain Linear Prediction

Vijayaditya Peddinti; Hynek Hermansky

The sub-band Frequency Domain Linear Prediction (FDLP) technique estimates autoregressive models of Hilbert envelopes of subband signals, from segments of discrete cosine transform (DCT) of a speech signal, using windows. Shapes of the windows and their positions on the cosine transform of the signal determine implied filtering of the signal. Thus, the choices of shape, position and number of these windows can be critical for the performance of the FDLP technique. So far, we have used Gaussian or rectangular windows. In this paper asymmetric cochlear-like filters are being studied. Further, a frequency differentiation operation, that introduces an additional set of parameters describing local spectral slope in each frequency sub-band, is introduced to increase the robustness of sub-band envelopes in noise. The performance gains achieved by these changes are reported in a variety of additive noise conditions, with an average relative improvement of 8.04% in phoneme recognition accuracy.

Explore More