Hemant Kumar Kathania
National Institute of Technology Sikkim
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hemant Kumar Kathania.
IEEE Signal Processing Letters | 2017
Syed Shahnawazuddin; Nagaraj Adiga; Hemant Kumar Kathania
Transcribing childrens speech using acoustic models trained on adults’ speech is very challenging. In such conditions, a highly degraded recognition performance is reported due to large mismatch in the acoustic/linguistic attributes of the training and test data. The differences in pitch (or fundamental frequency) between the two groups of speakers is one among several mismatch factors. Another important mismatch factor is the difference in speaking rates. To overcome these two sources of mismatch, prosody modification is explored in this letter. Prosody modification is done by using glottal closure instants (GCIs) as anchoring points. The GCIs, in turn, are determined using zero-frequency filtering (ZFF). The ZFF-GCI-based prosody modification is fast and results in highly accurate scaling of pitch and speaking rate. The experimental evaluations studying the effect of prosody modification resulted in a relative improvement of
Digital Signal Processing | 2018
Syed Shahnawazuddin; Nagaraj Adiga; Hemant Kumar Kathania; Gayadhar Pradhan; Rohit Sinha
50\%
Circuits Systems and Signal Processing | 2018
Hemant Kumar Kathania; Waquar Ahmad; Syed Shahnawazuddin; A. B. Samaddar
over the baseline.
Circuits Systems and Signal Processing | 2018
Syed Shahnawazuddin; Chaman Singh; Hemant Kumar Kathania; Waquar Ahmad; Gayadhar Pradhan
Abstract In the context of automatic speech recognition (ASR) systems, the front-end acoustic features should not be affected by signal periodicity (pitch period). Motivated by this fact, we have studied the role of pitch-synchronous spectrum estimation approach, referred to as TANDEM STRAIGHT, in this paper. TANDEM STRAIGHT results in a smoother spectrum devoid of pitch harmonics to a large extent. Consequently, the acoustic features derived using the smoothed spectra outperform the conventional Mel-frequency cepstral coefficients (MFCC). The experimental evaluations reported in this paper are performed on speech data from a wide range of speakers belonging to different age groups including children. The proposed features are found to be effective for all groups of speakers. To further improve the recognition of childrens speech, the effect of vocal-tract length normalization (VTLN) is studied. The inclusion of VTLN further improves the recognition performance. We have also performed a detailed study on the effect of speaking-rate normalization (SRN) in the context of childrens speech recognition. An SRN technique based on the anchoring of glottal closure instants estimated using zero-frequency filtering is explored in this regard. SRN is observed to be highly effective for child speakers belonging to different age groups. Finally, all the studied techniques are combined for effective mismatch reduction. In the case of childrens speech test set, the use of proposed features results in a relative improvement of 21.6% over the MFCC features even after combining VTLN and SRN.
ieee region 10 conference | 2015
Syed Shahnawazuddin; Hemant Kumar Kathania; Rohit Sinha
Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.
international conference on signal processing | 2014
Hemant Kumar Kathania; Syed Shahnawazuddin; Rohit Sinha
It is well known that the recognition performance of an automatic speech recognition (ASR) system is affected by intra-speaker as well inter-speaker variability. The differences in the geometry of vocal organs, pitch and speaking-rate among the speakers are some such inter-speaker variabilities affecting the recognition performance. A mismatch between the training and test data with respect to any of those aforementioned factors leads to increased error rates. An example of acoustically mismatched ASR is the task of transcribing children’s speech on adult data-trained system. A large number of studies have been reported earlier that present a myriad of techniques for addressing acoustic mismatch arising from differences in pitch and dimensions of vocal organs. At the same time, only a few works on speaking-rate adaptation employing timescale modification have been reported. Furthermore, those studies were performed on ASR systems developed using Gaussian mixture models. Motivated by these facts, speaking-rate adaptation is explored in this work in the context of children’s ASR system employing deep neural network-based acoustic modeling. Speaking-rate adaptation is performed by changing the frame-length and overlap during front-end feature extraction process. Significant reductions in errors are noted by speaking-rate adaptation. In addition to that, we have also studied the effect of combining speaking-rate adaptation with vocal-tract length normalization and explicit pitch modification. In both the cases, additive improvements are obtained. To summarize, relative improvements in 15–20% over the baselines are obtained by varying the frame-length and frame-overlap.
international conference on acoustics, speech, and signal processing | 2018
Hemant Kumar Kathania; Syed Shahnawazuddin; Nagaraj Adiga; Waquar Ahmad
national conference on communications | 2017
Hemant Kumar Kathania; Syed Shahnawazuddin; Rohit Sinha
conference of the international speech communication association | 2017
Waquar Ahmad; Syed Shahnawazuddin; Hemant Kumar Kathania; Gayadhar Pradhan; A. B. Samaddar
ieee region 10 conference | 2016
Hemant Kumar Kathania; Syed Shahnawazuddin; Gayadhar Pradhan; A. B. Samaddar