Pranay Dighe
Idiap Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pranay Dighe.
international conference on acoustics, speech, and signal processing | 2012
Anurag Kumar; Pranay Dighe; Rita Singh; Sourish Chaudhuri; Bhiksha Raj
In most real-world audio recordings, we encounter several types of audio events. In this paper, we develop a technique for detecting signature audio events, that is based on identifying patterns of occurrences of automatically learned atomic units of sound, which we call Acoustic Unit Descriptors or AUDs. Experiments show that the methodology works as well for detection of individual events and their boundaries in complex recordings.
Speech Communication | 2016
Pranay Dighe; Afsaneh Asaei; Hervé Bourlard
Automatic speech recognition can be cast as a realization of compressive sensing.Posterior probabilities are suitable features for exemplar-based sparse modeling.Posterior-based sparse representation meets statistical speech recognition formalism.Dictionary learning reduces collection size of exemplars and improves the performance.Collaborative hierarchical sparsity exploits temporal information in continuous speech. In this paper, a compressive sensing (CS) perspective to exemplar-based speech processing is proposed. Relying on an analytical relationship between CS formulation and statistical speech recognition (Hidden Markov Models - HMM), the automatic speech recognition (ASR) problem is cast as recovery of high-dimensional sparse word representation from the observed low-dimensional acoustic features. The acoustic features are exemplars obtained from (deep) neural network sub-word conditional posterior probabilities. Low-dimensional word manifolds are learned using these sub-word posterior exemplars and exploited to construct a linguistic dictionary for sparse representation of word posteriors. Dictionary learning has been found to be a principled way to alleviate the need of having huge collection of exemplars as required in conventional exemplar-based approaches, while still improving the performance. Context appending and collaborative hierarchical sparsity are used to exploit the sequential and group structure underlying word sparse representation. This formulation leads to a posterior-based sparse modeling approach to speech recognition. The potential of the proposed approach is demonstrated on isolated word (Phonebook corpus) and continuous speech (Numbers corpus) recognition tasks.
international conference on acoustics, speech, and signal processing | 2016
Pranay Dighe; Gil Luyet; Afsaneh Asaei
We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of low-dimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training data. Relying on the fact that the intrinsic dimensions of the posterior subspaces are indeed very small and the matrix of all posteriors belonging to a class has a very low rank, we demonstrate how low-dimensional structures enable further enhancement of the posteriors and rectify the spurious errors due to mismatch conditions. The enhanced acoustic modeling method leads to improvements in continuous speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in both clean and noisy conditions, where upto 15.4% relative reduction in word error rate (WER) is achieved.
international conference on multimedia and expo | 2013
Pranay Dighe; Parul Agrawal; Harish Karnick; Siddartha Thota; Bhiksha Raj
In Indian classical music a raga describes the constituent structure of notes in a musical piece. In this work, we investigate the problem of scale independent automatic raga identification by achieving state-of-the-art results using GMM based Hidden Markov Models over a collection of features consisting of chromagram patterns, mel-cepstrum coefficients and timbre features. We also perform the above task using 1) discrete HMMs and 2) classification trees over swara based features created from chromagrams using the concept of vadi of a raga. On a dataset of 4 ragas- darbari, khamaj, malhar and sohini; we have achieved an average accuracy of ~ 97%. This is a certain improvement over previous works because they use the knowledge of scale used in the raga performance. We believe that with a more careful selection of features and by fusing results from multiple classifiers we should be able to improve results further.
conference of the international speech communication association | 2016
Gil Luyet; Pranay Dighe; Afsaneh Asaei
We hypothesize that optimal deep neural networks (DNN) class-conditional posterior probabilities live in a union of lowdimensional subspaces. In real test conditions, DNN posteriors encode uncertainties which can be regarded as a superposition of unstructured sparse noise over the optimal posteriors. We aim to investigate different ways to structure the DNN outputs by exploiting low-rank representation (LRR) techniques. Using a large number of training posterior vectors, the underlying low-dimensional subspace of a test posterior is identified through nearest neighbor analysis, and low-rank decomposition enables separation of the “optimal” posteriors from the spurious uncertainties at the DNN output. Experiments demonstrate that by processing subsets of posteriors which possess strong subspace similarity, low-rank representation enables enhancement of posterior probabilities, and leads to higher speech recognition accuracy based on the hybrid DNN-hidden Markov model (HMM) system.
international conference on acoustics, speech, and signal processing | 2017
Pranay Dighe; Afsaneh Asaei
Conventional deep neural networks (DNN) for speech acoustic modeling rely on Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary class labels as the targets for DNN training. Subword classes in speech recognition systems correspond to context-dependent tied states or senones. The present work addresses some limitations of GMM-HMM senone alignments for DNN training. We hypothesize that the senone probabilities obtained from a DNN trained with binary labels can provide more accurate targets to learn better acoustic models. However, DNN outputs bear inaccuracies which are exhibited as high dimensional unstructured noise, whereas the informative components are structured and low-dimensional. We exploit principal component analysis (PCA) and sparse coding to characterize the senone subspaces. Enhanced probabilities obtained from low-rank and sparse reconstructions are used as soft-targets for DNN acoustic modeling, that also enables training with untranscribed data. Experiments conducted on AMI corpus shows 4.6% relative reduction in word error rate.
international symposium/conference on music information retrieval | 2013
Pranay Dighe; Harish Karnick; Bhiksha Raj
conference of the international speech communication association | 2015
Dhananjay Ram; Afsaneh Asaei; Pranay Dighe
conference of the international speech communication association | 2016
Gil Luyet; Pranay Dighe; Afsaneh Asaei
conference of the international speech communication association | 2014
Pranay Dighe; Marc Ferras