Kanishka Rao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kanishka Rao is active.

Explore More

Publication

Featured researches published by Kanishka Rao.

international conference on acoustics, speech, and signal processing | 2015

Learning acoustic frame labeling for speech recognition with recurrent neural networks

Hasim Sak; Andrew W. Senior; Kanishka Rao; Ozan Irsoy; Alex Graves; Francoise Beaufays; Johan Schalkwyk

We explore alternative acoustic modeling techniques for large vocabulary speech recognition using Long Short-Term Memory recurrent neural networks. For an acoustic frame labeling task, we compare the conventional approach of cross-entropy (CE) training using fixed forced-alignments of frames and labels, with the Connectionist Temporal Classification (CTC) method proposed for labeling unsegmented sequence data. We demonstrate that the latter can be implemented with finite state transducers. We experiment with phones and context dependent HMM states as acoustic modeling units. We also investigate the effect of context in acoustic input by training unidirectional and bidirectional LSTM RNN models. We show that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTM RNN model trained with CE using HMM state alignments. Finally, we also show the effect of sequence discriminative training on these models and show the first results for sMBR training of CTC models.

international conference on acoustics, speech, and signal processing | 2016

Personalized speech recognition on mobile devices

Ian McGraw; Rohit Prabhavalkar; Raziel Alvarez; Montse Gonzalez Arenas; Kanishka Rao; David Rybach; Ouais Alsharif; Hasim Sak; Alexander H. Gruenstein; Francoise Beaufays; Carolina Parada

We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.

ieee automatic speech recognition and understanding workshop | 2015

Acoustic modelling with CD-CTC-SMBR LSTM RNNS

Andrew; Hasim Sak; Félix de Chaumont Quitry; Tara N. Sainath; Kanishka Rao

This paper describes a series of experiments to extend the application of Context-Dependent (CD) long short-term memory (LSTM) recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss. Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional input layers. We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. Finally we investigate transferring knowledge from one network to another through alignments.

international conference on acoustics, speech, and signal processing | 2017

Multi-accent speech recognition with hierarchical grapheme based models

Kanishka Rao; Hasim Sak

We train grapheme-based acoustic models for speech recognition using a hierarchical recurrent neural network architecture with connectionist temporal classification (CTC) loss. The models learn to align utterances with phonetic transcriptions in a lower layer and graphemic transcriptions in the final layer in a multi-task learning setting. Using the grapheme predictions from a hierarchical model trained on 3 million US English utterances results in 6.7% relative word error rate (WER) increase when compared to using the phoneme-based acoustic model trained on the same data. However, we show that hierarchical grapheme-based models trained on larger acoustic data (12 million utterances) jointly for grapheme and phoneme prediction task outperform phoneme only model by 6.9% relative WER. We train a single multi-dialect model using a combined US, British, Indian and Australian English data set and then adapt the model using US English data only. This adapted multi-accent model outperforms a model exclusively trained on US English. This process is repeated for phoneme-based and grapheme-based acoustic models for all four dialects and larger improvements are obtained with grapheme models. Additionally using a multi-accent grapheme model, we observe large recognition accuracy improvements for Indian-accented utterances in Google VoiceSearch US traffic with a 40% relative WER reduction.

international conference on acoustics, speech, and signal processing | 2016

Flat start training of CD-CTC-SMBR LSTM RNN acoustic models

Kanishka Rao; Andrew W. Senior; Hasim Sak

We present a recipe for training acoustic models with context dependent (CD) phones from scratch using recurrent neural networks (RNNs). First, we use the connectionist temporal classification (CTC) technique to train a model with context independent (CI) phones directly from the written-domain word transcripts by aligning with all possible phonetic verbalizations. Then, we devise a mechanism to generate a set of CD phones using the CTC CI phone model alignments and train a CD phone model to improve the accuracy. This end-to-end training recipe does not require any previously trained GMM-HMM or DNN model for CD phone generation or alignment, and thus drastically reduces the overall model building time. We show that using this procedure does not degrade the performance of models and allows us to improve models more quickly by updates to pronunciations or training data.

international conference on acoustics, speech, and signal processing | 2015

Automatic pronunciation verification for speech recognition

Kanishka Rao; Fuchun Peng; Francoise Beaufays

Pronunciations for words are a critical component in an automated speech recognition system (ASR) as mis-recognitions may be caused by missing or inaccurate pronunciations. The need for high quality pronunciations has recently motivated data-driven techniques to generate them [1]. We propose a data-driven and language-independent framework for verification of such pronunciations to further improve the lexicon quality in ASR. New candidate pronunciations are verified by re-recognizing historical audio logs and examining the associated recognition costs. We build an additional pronunciation quality feature from word and pronunciation frequencies in logs. A machine learned classifier trained on these features achieves nearly 90% accuracy in labeling good vs bad pronunciations across all languages we tested. New pronunciations verified as good may be added to a dictionary, while bad pronunciations may be discarded or sent to experts for further evaluation. We simultaneously verify 5,000 to 30,000 new pronunciations within a few hours and show improvements in the ASR performance as a result of including pronunciations verified by this system.

conference of the international speech communication association | 2016

Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks

Daan van Esch; Mason Chua; Kanishka Rao

Word pronunciations, consisting of phoneme sequences and the associated syllabification and stress patterns, are vital for both speech recognition and text-to-speech (TTS) systems. For speech recognition phoneme sequences for words may be learned from audio data. We train recurrent neural network (RNN) based models to predict the syllabification and stress pattern for such pronunciations making them usable for TTS. We find these RNN models significantly outperform naive rulebased models for almost all languages we tested. Further, we find additional improvements to the stress prediction model by using the spelling as features in addition to the phoneme sequence. Finally, we train a single RNN model to predict the phoneme sequence, syllabification and stress for a given word. For several languages, this single RNN outperforms similar models trained specifically for either phoneme sequence or stress prediction. We report an exhaustive comparison of these approaches for twenty languages.

conference of the international speech communication association | 2015

Fast and accurate recurrent neural network acoustic models for speech recognition.

Hasim Sak; Andrew W. Senior; Kanishka Rao; Françoise Beaufays

international conference on acoustics, speech, and signal processing | 2015

Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks

Kanishka Rao; Fuchun Peng; Hasim Sak; Francoise Beaufays

international conference on acoustics, speech, and signal processing | 2018

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Chung-Cheng Chiu; Tara N. Sainath; Yonghui Wu; Rohit Prabhavalkar; Patrick Nguyen; Zhifeng Chen; Anjuli Kannan; Ron Weiss; Kanishka Rao; Katya Gonina; Navdeep Jaitly; Bo Li; Jan Chorowski; Michiel Bacchiani

Explore More