Vimal Manohar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vimal Manohar is active.

Explore More

Publication

Featured researches published by Vimal Manohar.

conference of the international speech communication association | 2016

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

Daniel Povey; Vijayaditya Peddinti; Daniel Galvez; Pegah Ghahremani; Vimal Manohar; Xingyu Na; Yiming Wang; Sanjeev Khudanpur

In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding. We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LFMMI provide a relative word error rate reduction of ∼11.5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions. A further reduction of ∼2.5%, relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.

spoken language technology workshop | 2014

A keyword search system using open source software

Jan Trmal; Guoguo Chen; Daniel Povey; Sanjeev Khudanpur; Pegah Ghahremani; Xiaohui Zhang; Vimal Manohar; Chunxi Liu; Aren Jansen; Dietrich Klakow; David Yarowsky; Florian Metze

Provides an overview of a speech-to-text (STT) and keyword search (KWS) system architecture build primarily on the top of the Kaldi toolkit and expands on a few highlights. The system was developed as a part of the research efforts of the Radical team while participating in the IARPA Babel program. Our aim was to develop a general system pipeline which could be easily and rapidly deployed in any language, independently on the language script and phonological and linguistic features of the language.

ieee automatic speech recognition and understanding workshop | 2015

JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS

Vijayaditya Peddinti; Guoguo Chen; Vimal Manohar; Tom Ko; Daniel Povey; Sanjeev Khudanpur

Multi-style training, using data which emulates a variety of possible test scenarios, is a popular approach towards robust acoustic modeling. However acoustic models capable of exploiting large amounts of training data in a comparatively short amount of training time are essential. In this paper we tackle the problem of reverberant speech recognition using 5500 hours of simulated reverberant data. We use time-delay neural network (TDNN) architecture, which is capable of tackling long-term interactions between speech and corrupting sources in reverberant environments. By sub-sampling the outputs at TDNN layers across time steps, training time is substantially reduced. Combining this with distributed-optimization we show that the TDNN can be trained in 3 days using up to 32 GPUs. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation. Finally, recurrent neural network language models are applied to the lattices to further improve the performance. Our system is shown to provide state-of-the-art results in the IARPA ASpIRE challenge, with 26.5% WER on the dev Jest set.

international conference on acoustics, speech, and signal processing | 2016

Adapting ASR for under-resourced languages using mismatched transcriptions

Chunxi Liu; Preethi Jyothi; Hao Tang; Vimal Manohar; Rose Sloan; Tyler Kekona; Mark Hasegawa-Johnson; Sanjeev Khudanpur

Mismatched transcriptions of speech in a target language refers to transcriptions provided by people unfamiliar with the language, using English letter sequences. In this work, we demonstrate the value of such transcriptions in building an ASR system for the target language. For different languages, we use less than an hour of mismatched transcriptions to successfully adapt baseline multilingual models built with no access to native transcriptions in the target language. The adapted models provide up to 25% relative improvement in phone error rates on an unseen evaluation set.

conference of the international speech communication association | 2016

Acoustic Modelling from the Signal Domain Using CNNs.

Pegah Ghahremani; Vimal Manohar; Daniel Povey; Sanjeev Khudanpur

Most speech recognition systems use spectral features based on fixed filters, such as MFCC and PLP. In this paper, we show that it is possible to achieve state of the art results by making the feature extractor a part of the network and jointly optimizing it with the rest of the network. The basic approach is to start with a convolutional layer that operates on the signal (say, with a step size of 1.25 milliseconds), and aggregate the filter outputs over a portion of the time axis using a network in network architecture, and then down-sample to every 10 milliseconds for use by the rest of the network. We find that, unlike some previous work on learned feature extractors, the objective function converges as fast as for a network based on traditional features. Because we found that iVector adaptation is less effective in this framework, we also experiment with a different adaptation method that is part of the network, where activation statistics over a medium time span (around a second) are computed at intermediate layers. We find that the resulting ‘direct-fromsignal’ network is competitive with our state of the art networks based on conventional features with iVector adaptation.

IEEE Transactions on Audio, Speech, and Language Processing | 2017

ASR for Under-Resourced Languages From Probabilistic Transcription

Mark Hasegawa-Johnson; Preethi Jyothi; Daniel McCloy; Majid Mirbagheri; Giovanni M. Di Liberto; Amit Das; Bradley Ekin; Chunxi Liu; Vimal Manohar; Hao Tang; Edmund C. Lalor; Nancy F. Chen; Paul Hager; Tyler Kekona; Rose Sloan; Adrian Lee

In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition (ASR) is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: A probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semisupervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which nonspeakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of second-language speech perception. Third, EEG distribution coding is a new technique in which nonspeakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. Both EEG distribution coding and text-derived phone language models were shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing.

conference of the international speech communication association | 2016

Far-Field ASR Without Parallel Data.

Vijayaditya Peddinti; Vimal Manohar; Yiming Wang; Daniel Povey; Sanjeev Khudanpur

In far-field speech recognition systems, training acoustic models with alignments generated from parallel close-talk microphone data provides significant improvements. However it is not practical to assume the availability of large corpora of parallel close-talk microphone data, for training. In this paper we explore methods to reduce the performance gap between far-field ASR systems trained with alignments from distant microphone data and those trained with alignments from parallel close-talk microphone data. These methods include the use of a lattice-free sequence objective function which tolerates minor mis-alignment errors; and the use of data selection techniques to discard badly aligned data. We present results on single distant microphone and multiple distant microphone scenarios of the AMI LVCSR task. We identify prominent causes of alignment errors in AMI data.

conference of the international speech communication association | 2015