Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Kshitiz Kumar is active.

Publication


Featured researches published by Kshitiz Kumar.


international conference on acoustics, speech, and signal processing | 2007

Profile View Lip Reading

Kshitiz Kumar; Tsuhan Chen; Richard M. Stern

In this paper, we introduce profile view (PV) lip reading, a scheme for speaker-dependent isolated word speech recognition. We provide historic motivation for PV from the importance of profile images in facial animation for lip reading, and we present feature extraction schemes for PV as well as for the traditional frontal view (FV) approach. We compare lip reading results for PV and FV, which demonstrate a significant improvement for PV over FV. We show improvement in speech recognition with the integration of audio and visual features. We also found it advantageous to process the visual features over a longer duration than the duration marked by the endpoints of the speech utterance.


international conference on acoustics, speech, and signal processing | 2011

Delta-spectral cepstral coefficients for robust speech recognition

Kshitiz Kumar; Chanwoo Kim; Richard M. Stern

Almost all current automatic speech recognition (ASR) systems conventionally append delta and double-delta cepstral features to static cepstral features. In this work we describe a modified feature-extraction procedure in which the time-difference operation is performed in the spectral domain, rather than the cepstral domain as is generally presently done. We argue that this approach based on “delta-spectral” features is needed because even though delta-cepstral features capture dynamic speech information and generally greatly improve ASR recognition accuracy, they are not robust to noise and reverberation. We support the validity of the delta-spectral approach both with observations about the modulation spectrum of speech and noise, and with objective experiments that document the benefit that the delta-spectral approach brings to a variety of currently popular feature extraction algorithms. We found that the use of delta-spectral features, rather than the more traditional delta-cepstral features, improves the effective SNR by between 5 and 8 dB for background music and white noise, and recognition accuracy in reverberant environments is improved as well.


international conference on acoustics, speech, and signal processing | 2011

Gammatone sub-band magnitude-domain dereverberation for ASR

Kshitiz Kumar; Rita Singh; Bhiksha Raj; Richard M. Stern

We present an algorithm for dereverberation of speech signals for automatic speech recognition (ASR) applications. Often ASR systems are presented with speech that has been recorded in environments that include noise and reverberation. The performance of ASR systems degrades with increasing levels of noise and reverberation. While many algorithms have been proposed for robust ASR in noisy environments, reverberation is still a challenging problem. In this paper, we present 1 an approach for dereverberation that models reverberation as a convolution operation in the speech spectral domain. Using a least-squares error criterion we decompose reverberated spectra into clean spectra convolved with a filter. We incorporate non-negativity and sparsity of the speech spectra as constraints within a non-negative matrix factorization (NMF) framework to achieve the decomposition. In ASR experiments where the system is trained with unreverberated and reverberated speech, we show that the proposed approach can provide upto 40% and 19% relative reduction respectively in performance.


international conference on acoustics, speech, and signal processing | 2011

Binaural sound source separation motivated by auditory processing

Chanwoo Kim; Kshitiz Kumar; Richard M. Stern

In this paper we present a new method of signal processing for robust speech recognition using two microphones. The method, loosely based on the human binaural hearing system, consists of passing the speech signals detected by two microphones through bandpass filtering. We develop a spatial masking function based on normalized cross-correlation, which provides rejection of off-axis interfering signals. To obtain improvements in reverberant environments, a temporal masking component, which is closely related to our previously-described de-reverberation technique known as SSF. We demonstrate that this approach provides substantially better recognition accuracy than conventional binaural sound-source separation algorithms.


international conference on acoustics, speech, and signal processing | 2010

Maximum-likelihood-based cepstral inverse filtering for blind speech dereverberation

Kshitiz Kumar; Richard M. Stern

Current state-of-the-art speech recognition systems work quite well in controlled environments but their performance degrades severely in realistic acoustical conditions in reverberant environments. In this paper we build on the recent developments that represent reverberation in the cepstral feature domain as a filtering operation and we formulate a maximum likelihood objective to obtain an inverse reverberation filter. We show analytically that the optimal inverse filter can be approximately obtained under certain assumptions about the corresponding clean speech signal. We demonstrate that our approach reduces the relative gap in word error rate by 30 percent in large as well as small reverberation times.


international conference on acoustics, speech, and signal processing | 2011

An iterative least-squares technique for dereverberation

Kshitiz Kumar; Bhiksha Raj; Rita Singh; Richard M. Stern

Some recent dereverberation approaches that have been effective for automatic speech recognition (ASR) applications, model reverberation as a linear convolution operation in the spectral domain, and derive a factorization to decompose spectra of reverberated speech in to those of clean speech and room-response filter. Typically, a general non-negative matrix factorization (NMF) framework is employed for this. In this work1 we present an alternative to NMF and propose an iterative least-squares deconvolution technique for spectral factorization. We propose an efficient algorithm for this and experimentally demonstrate its effectiveness in improving ASR performance. The new method results in 40–50% relative reduction in word error rates over standard baselines on artificially reverberated speech.


2008 Hands-Free Speech Communication and Microphone Arrays | 2008

Binaural and Multiple-Microphone Signal Processing Motivated by Auditory Perception

Richard M. Stern; Evandro Gouvea; Chanwoo Kim; Kshitiz Kumar; Hyung-Min Park

It is well known that binaural processing is very useful for separating incoming sound sources as well as for improving the intelligibility of speech in reverberant environments. This paper describes and compares a number of ways in which the classic model of interaural cross-correlation proposed by Jeffress, quantified by Colburn, and further elaborated by Blauert, Lindemann, and others, can be applied to improving the accuracy of automatic speech recognition systems operating in cluttered, noisy, and reverberant environments. Typical implementations begin with an abstraction of cross-correlation of the incoming signals after nonlinear monaural bandpass processing, but there are many alternative implementation choices that can be considered. Typical implementations differ in the ways in which an enhanced version of the desired signal is developed using binaural principles, in the extent to which specific processing mechanisms are used to impose suppression motivated by the precedence effect, and in the precise mechanism used to extract interaural time differences.


ieee automatic speech recognition and understanding workshop | 2009

Robust speech recognition using a Small Power Boosting algorithm

Chanwoo Kim; Kshitiz Kumar; Richard M. Stern

In this paper, we present a noise robustness algorithm called Small Power Boosting (SPB). We observe that in the spectral domain, time-frequency bins with smaller power are more affected by additive noise. The conventional way of handling this problem is estimating the noise from the test utterance and doing normalization or subtraction. In our work, in contrast, we intentionally boost the power of time-frequency bins with small energy for both the training and testing datasets. Since time-frequency bins with small power no longer exist after this power boosting, the spectral distortion between the clean and corrupt test sets becomes reduced. This type of small power boosting is also highly related to physiological nonlinearity. We observe that when small power boosting is done, suitable weighting smoothing becomes highly important. Our experimental results indicate that this simple idea is very helpful for very difficult noisy environments such as corruption by background music.


computer vision and pattern recognition | 2009

Audio-visual speech synchronization detection using a bimodal linear prediction model

Kshitiz Kumar; Jiri Navratil; Etienne Marcheret; Vit Libal; Ganesh N. Ramaswamy; Gerasimos Potamianos

In this work, we study the problem of detecting audio-visual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two types of visual features extracted from the speakers facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.


international conference on acoustics, speech, and signal processing | 2008

Environment-invariant compensation for reverberation using linear post-filtering for minimum distortion

Kshitiz Kumar; Richard M. Stern

Speaker identification systems work quite well in controlled environments but their performance degrades severely in the presence of the reverberation that is frequently encountered in realistic acoustical environments. In this paper we develop an algorithm to make speaker identification systems more robust to reverberation by passing sequences of cepstral features through a short FIR filter. The coefficients of the filter are chosen to minimize the mean square differences between compensated features in the training and testing environments. Surprisingly, the resulting filter coefficients are relatively invariant to the actual nature of the reverberation. The use of the post-filtering approach is shown to improve speaker identification accuracy, especially when reverberation times are relatively long.

Collaboration


Dive into the Kshitiz Kumar's collaboration.

Top Co-Authors

Avatar

Richard M. Stern

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Chanwoo Kim

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bhiksha Raj

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Evandro Gouvea

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Qi Wu

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Rita Singh

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Yiming Wang

Carnegie Mellon University

View shared research outputs
Researchain Logo
Decentralizing Knowledge