Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Sri Harish Reddy Mallidi is active.

Publication


Featured researches published by Sri Harish Reddy Mallidi.


international conference on acoustics, speech, and signal processing | 2013

Developing a speaker identification system for the DARPA RATS project

Oldrich Plchot; Spyros Matsoukas; Pavel Matejka; Najim Dehak; Jeff Z. Ma; Sandro Cumani; Ondrej Glembek; Hynek Hermansky; Sri Harish Reddy Mallidi; Nima Mesgarani; Richard M. Schwartz; Mehdi Soufifar; Zheng-Hua Tan; Samuel Thomas; Bing Zhang; Xinhui Zhou

This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.


IEEE Transactions on Audio, Speech, and Language Processing | 2014

Robust feature extraction using modulation filtering of autoregressive models

Sriram Ganapathy; Sri Harish Reddy Mallidi; Hynek Hermansky

Speaker and language recognition in noisy and degraded channel conditions continue to be a challenging problem mainly due to the mismatch between clean training and noisy test conditions. In the presence of noise, the most reliable portions of the signal are the high energy regions which can be used for robust feature extraction. In this paper, we propose a front end processing scheme based on autoregressive (AR) models that represent the high energy regions with good accuracy followed by a modulation filtering process. The AR model of the spectrogram is derived using two separable time and frequency AR transforms. The first AR model (temporal AR model) of the sub-band Hilbert envelopes is derived using frequency domain linear prediction (FDLP). This is followed by a spectral AR model applied on the FDLP envelopes. The output 2-D AR model represents a low-pass modulation filtered spectrogram of the speech signal. The band-pass modulation filtered spectrograms can further be derived by dividing two AR models with different model orders (cut-off frequencies). The modulation filtered spectrograms are converted to cepstral coefficients and are used for a speaker recognition task in noisy and reverberant conditions. Various speaker recognition experiments are performed with clean and noisy versions of the NIST-2010 speaker recognition evaluation (SRE) database using the state-of-the-art speaker recognition system. In these experiments, the proposed front-end analysis provides substantial improvements (relative improvements of up to 25%) compared to baseline techniques. Furthermore, we also illustrate the generalizability of the proposed methods using language identification (LID) experiments on highly degraded high-frequency (HF) radio channels and speech recognition experiments on noisy data.


ieee automatic speech recognition and understanding workshop | 2015

Robust speech recognition in unknown reverberant and noisy conditions

Roger Hsiao; Jeff Z. Ma; William Hartmann; Martin Karafiát; Frantisek Grezl; Lukas Burget; Igor Szöke; Jan Cernocky; Shinji Watanabe; Zhuo Chen; Sri Harish Reddy Mallidi; Hynek Hermansky; Stavros Tsakalidis; Richard M. Schwartz

In this paper, we describe our work on the ASpIRE (Automatic Speech recognition In Reverberant Environments) challenge, which aims to assess the robustness of automatic speech recognition (ASR) systems. The main characteristic of the challenge is developing a high-performance system without access to matched training and development data. While the evaluation data are recorded with far-field microphones in noisy and reverberant rooms, the training data are telephone speech and close talking. Our approach to this challenge includes speech enhancement, neural network methods and acoustic model adaptation, We show that these techniques can successfully alleviate the performance degradation due to noisy audio and data mismatch.


ieee automatic speech recognition and understanding workshop | 2015

Uncertainty estimation of DNN classifiers

Sri Harish Reddy Mallidi; Tetsuji Ogawa; Hynek Hermansky

New efficient measures for estimating uncertainty of deep neural network (DNN) classifiers are proposed and successfully applied to multistream-based unsupervised adaptation of ASR systems to address uncertainty derived from noise. The proposed measure is the error from associative memory models trained on outputs of a DNN. In the present study, an attempt is made to use autoencoders for remembering the property of data. Another measure proposed is an extension of the M-measure, which computes the divergences of probability estimates spaced at specific time intervals. The extended measure results in an improved reliability by considering the latent information of phoneme duration. Experimental comparisons carried out in a multistream-based ASR paradigm demonstrates that the proposed measures yielded improvements over the multistyle trained system and system selected based on existing measures. Fusion of the proposed measures achieved almost the same performance as the oracle system selection.


international conference on acoustics, speech, and signal processing | 2016

Novel neural network based fusion for multistream ASR

Sri Harish Reddy Mallidi; Hynek Hermansky

Robustness of automatic speech recognition (ASR) to acoustic mismatches can be improved by multistream framework. Frequently used approach to combine decisions from individual streams involve training large number of neural networks, one for each possible stream combination. In this work, we propose to simplify the fusion by replacing the large number of fusion networks with a single fusion network. During training of the proposed fusion network, features from a stream are randomly dropped out. At test time, corrupted streams are identified and dropped out to improve robustness. Using the proposed approach, we were able to achieve significant reduction in number of parameters, while remaining in less than 2.5 % relative degradation of conventional fusion technique. Furthermore, proposed fusion network is also applied in a multistream ASR system to improve noise robustness of Aurora4 speech recognition task. Noticeable improvements were observed over baseline systems (relative improvement of 9.2 % in microphone mismatch and 3.2 % in additive noise conditions).


international conference on acoustics, speech, and signal processing | 2012

The UMD-JHU 2011 speaker recognition system

Daniel Garcia-Romero; Xinhui Zhou; Dmitry N. Zotkin; Balaji Vasan Srinivasan; Yuancheng Luo; Sriram Ganapathy; Samuel Thomas; Sridhar Krishna Nemala; Garimella S. V. S. Sivaram; Majid Mirbagheri; Sri Harish Reddy Mallidi; Thomas Janu; Padmanabhan Rajan; Nima Mesgarani; Mounya Elhilali; Hynek Hermansky; Shihab A. Shamma; Ramani Duraiswami

In recent years, there have been significant advances in the field of speaker recognition that has resulted in very robust recognition systems. The primary focus of many recent developments have shifted to the problem of recognizing speakers in adverse conditions, e.g in the presence of noise/reverberation. In this paper, we present the UMD-JHU speaker recognition system applied on the NIST 2010 SRE task. The novel aspects of our systems are: 1) Improved performance on trials involving different vocal effort via the use of linear-scale features; 2) Expected improved recognition performance in the presence of reverberation and noise via the use of frequency domain perceptual linear predictor and cortical features; 3) A new discriminative kernel partial least squares (KPLS) framework that complements state-of-the-art back-end systems JFA and PLDA to aid in better overall recognition; and 4) Acceleration of JFA, PLDA and KPLS back-ends via distributed computing. The individual components of the system and the fused system are compared against a baseline JFA system and results reported by SRI and MIT-LL on SRE2010.


IEEE Signal Processing Letters | 2012

Regularized Auto-Associative Neural Networks for Speaker Verification

Sri Garimella; Sri Harish Reddy Mallidi; Hynek Hermansky

Auto-Associative Neural Network (AANN) is a fully connected feed-forward neural network, trained to reconstruct its input at its output through a hidden compression layer. AANNs are used to model speakers in speaker verification, where a speaker-specific AANN model is obtained by adapting (or retraining) the Universal Background Model (UBM) AANN, an AANN trained on multiple held out speakers, using corresponding speaker data. When the amount of speaker data is limited, this adaptation procedure leads to overfitting. Additionally, the resultant speaker-specific parameters become noisy due to outliers in data. Thus, we propose to regularize the parameters of an AANN during speaker adaptation. A closed-form expression for updating the parameters is derived. Further, these speaker-specific AANN parameters are directly used as features in linear discriminant analysis (LDA)/probabilistic discriminant (PLDA) analysis based speaker verification system. The proposed speaker verification system outperforms the previously proposed weighted least squares (WLS) based AANN speaker verification system on NIST-08 speaker recognition evaluation (SRE). Moreover, the proposed speaker verification system obviates the need for an intermediate dimensionality reduction (or i-vector extraction) step.


international conference on acoustics, speech, and signal processing | 2013

Frequency offset correction in speech without detecting pitch

Pascal Clark; Sri Harish Reddy Mallidi; Aren Jansen; Hynek Hermansky

Radio-transmitted speech sometimes contains a residual frequency shift or offset, resulting from incorrect demodulation in single-sideband channels. Frequency-shifted speech can mask speaker identity and reduce intelligibility. Therefore, frequency offset will degrade the performance of downstream speech technologies. Existing offset correction methods require a pitch estimate of the speech signal, which is difficult in noisy radio channels. We present a new, automatic algorithm for detecting and correcting frequency offset, based on third-order modulation spectral analysis. Our method is remarkably simple and does not require pitch estimation. We provide derivations, examples, and a pilot study demonstrating how offset correction improves speaker verification for radio-transmitted speech.


international conference on acoustics, speech, and signal processing | 2017

Predicting error rates for unknown data in automatic speech recognition

Bernd T. Meyer; Sri Harish Reddy Mallidi; Hendrik Kayser; Hynek Hermansky

In this paper we investigate methods to predict word error rates in automatic speech recognition in the presence of unknown noise types, which have not been seen during training. The performance measures operate on phoneme posteriorgrams that are obtained from neural nets. We compare average frame-wise entropy as a baseline measure to the mean temporal distance (M-Measure) and to the number of phonetic events. The latter is obtained by learning typical phoneme activations from clean training data, which are later applied as phoneme-specific matched filters to posteriorgrams (MaP). When exceeding a threshold after filtering, we register this as phonetic event. For test sets using 10 unknown noise types and a wide range of signal-to-noise ratios, we find M-Measure and MaP to produce predictions twice as accurate as the baseline measure. When excluding noise types that contain speech segments, a prediction error of 3.1% is achieved, compared to 15.0% for the baseline measure.


Computer Speech & Language | 2017

On the relevance of auditory-based Gabor features for deep learning in robust speech recognition

Angel Mario Castro Martinez; Sri Harish Reddy Mallidi; Bernd T. Meyer

DNN-based speech recognition greatly benefits from spectro-temporal Gabor features.Gabor filters with high temporal modulation encode the most relevant information.A measure of phoneme similarity is proposed to quantify class separability.This metric is used to explain the improved results on phoneme level. Display Omitted Previous studies support the idea of merging auditory-based Gabor features with deep learning architectures to achieve robust automatic speech recognition, however, the cause behind the gain of such combination is still unknown. We believe these representations provide the deep learning decoder with more discriminable cues. Our aim with this paper is to validate this hypothesis by performing experiments with three different recognition tasks (Aurora4, CHiME2 and CHiME3) and assess the discriminability of the information encoded by Gabor filterbank features. Additionally, to identify the contribution of low, medium and high temporal modulation frequencies subsets of the Gabor filterbank were used as features (dubbed LTM, MTM and HTM, respectively). With temporal modulation frequencies between 16 and 25Hz, HTM consistently outperformed the remaining ones in every condition, highlighting the robustness of these representations against channel distortions, low signal-to-noise ratios and acoustically challenging real-life scenarios with relative improvements from 11 to 56% against a Mel-filterbank-DNN baseline. To explain the results, a measure of similarity between phoneme classes from DNN activations is proposed and linked to their acoustic properties. We find this measure to be consistent with the observed error rates and highlight specific differences on phoneme level to pinpoint the benefit of the proposed features.

Collaboration


Dive into the Sri Harish Reddy Mallidi's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Feipeng Li

Johns Hopkins University

View shared research outputs
Researchain Logo
Decentralizing Knowledge