Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Syed Shahnawazuddin is active.

Publication


Featured researches published by Syed Shahnawazuddin.


international conference on acoustics, speech, and signal processing | 2017

Enhancing noise and pitch robustness of children's ASR

Syed Shahnawazuddin; K T Deepak; Gayadhar Pradhan; Rohit Sinha

It is well known that, when noisy speech is transcribed using automatic speech recognition (ASR) systems trained on clean data, a highly degraded recognition performance is obtained. The problemgets further aggravatedwhen the targeted group happens to be child speakers. For childrens speech, the acoustic correlates such as pitch and formant frequency vary significantly with age. This makes the recognition of childrens speech very challenging. In this paper, we have explored the ways to enhance the noise robustness of ASR systems for childrens speech. Towards addressing the same, recently developed front-end acoustic features based on spectral moments (SMAC) are explored. The SMAC features are reported to be more noise robust than the conventional features like the mel-frequency cepsatral coefficients. At the same time, the SMAC features are also noted to be sensitive to the variations in the pitch. To reduce the pitch sensitivity, a spectral smoothing approach based on adaptive-liftering is proposed. Spectral smoothening prior to the computation of spectral moments results in a significant improvement in the robustness to pitch without affecting the noise immunity. To further enhance noise robustness, a foreground speech segmentation and enhancement module is also included in the proposed front-end speech parameterization technique.


IEEE Signal Processing Letters | 2017

Effect of Prosody Modification on Children's ASR

Syed Shahnawazuddin; Nagaraj Adiga; Hemant Kumar Kathania

Transcribing childrens speech using acoustic models trained on adults’ speech is very challenging. In such conditions, a highly degraded recognition performance is reported due to large mismatch in the acoustic/linguistic attributes of the training and test data. The differences in pitch (or fundamental frequency) between the two groups of speakers is one among several mismatch factors. Another important mismatch factor is the difference in speaking rates. To overcome these two sources of mismatch, prosody modification is explored in this letter. Prosody modification is done by using glottal closure instants (GCIs) as anchoring points. The GCIs, in turn, are determined using zero-frequency filtering (ZFF). The ZFF-GCI-based prosody modification is fast and results in highly accurate scaling of pitch and speaking rate. The experimental evaluations studying the effect of prosody modification resulted in a relative improvement of


IEEE Signal Processing Letters | 2017

Pitch-Normalized Acoustic Features for Robust Children's Speech Recognition

Syed Shahnawazuddin; Rohit Sinha; Gayadhar Pradhan

50\%


Circuits Systems and Signal Processing | 2017

Improvements in the Detection of Vowel Onset and Offset Points in a Speech Sequence

Avinash Kumar; Syed Shahnawazuddin; Gayadhar Pradhan

over the baseline.


Circuits Systems and Signal Processing | 2018

An Efficient ECG Denoising Technique Based on Non-local Means Estimation and Modified Empirical Mode Decomposition

Pratik Singh; Syed Shahnawazuddin; Gayadhar Pradhan

In this letter, the effectiveness of recently reported SMAC (Spectral Moment time–frequency distribution Augmented by low-order Cepstral) features has been evaluated for robust automatic speech recognition (ASR). The SMAC features consist of normalized first central spectral moments appended with low-order cepstral coefficients. These features have been designed for achieving robustness to both additive noise and the pitch variations. We have explored the SMAC features in severe pitch mismatch ASR task, i.e., decoding of childrens speech on adults’ speech trained ASR system. In those tasks, the SMAC features are still observed to be sensitive to pitch variations. Toward addressing the same, a simple spectral smoothening approach employing adaptive-cepstral truncation is explored prior to the computation of spectral moments. With the proposed modification, the SMAC features are noted to achieve enhanced pitch robustness without affecting their noise immunity. Furthermore, the effectiveness of the proposed features is explored in three dominant acoustic modeling paradigms and varying data conditions. In all the cases, the proposed features are observed to significantly outperform the existing ones.


Digital Signal Processing | 2018

Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition

Syed Shahnawazuddin; Nagaraj Adiga; Hemant Kumar Kathania; Gayadhar Pradhan; Rohit Sinha

Detecting the vowel regions in a given speech signal has been a challenging area of research for a long time. A number of works have been reported over the years to accurately detect the vowel regions and the corresponding vowel onset points (VOPs) and vowel end points (VEPs). Effectiveness of the statistical acoustic modeling techniques and the front-end signal processing approaches has been explored in this regard. The work presented in this paper aims at improving the detection of vowel regions as well as the VOPs and VEPs. A number of statistical modeling approaches developed over the years have been employed in this work for the aforementioned task. To do the same, three-class classifiers (vowel, nonvowel and silence) are developed on the TIMIT database employing the different acoustic modeling techniques and the classification performances are studied. Using any particular three-class classifier, a given speech sample is then forced-aligned against the trained acoustic model under the constraints of first-pass transcription to detect the vowel regions. The correctly detected and spurious vowel regions are analyzed in detail to find the impact of semivowel and nasal sound units on the detection of vowel regions as well as on the determination of VOPs and VEPs. In addition to that, a novel front-end feature extraction technique exploiting the temporal and spectral characteristics of the excitation source information in the speech signal is also proposed. The use of the proposed excitation source feature results in the detection of vowel regions that are quite different from those obtained through the mel-frequency cepstral coefficients. Exploiting those differences in the obtained evidences by using the two kinds of features, a technique to combine the evidences is also proposed in order to get a better estimate of the VOPs and VEPs. When the proposed techniques are evaluated on the vowel–nonvowel classification systems developed using the TIMIT database, significant improvements are noted. Moreover, the improvements are noted to hold across all the acoustic modeling paradigms explored in the presented work.


Circuits Systems and Signal Processing | 2018

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Hemant Kumar Kathania; Waquar Ahmad; Syed Shahnawazuddin; A. B. Samaddar

Noninvasive nature of Electrocardiogram (ECG) signal makes it widely accepted for cardiac diagnosis. During the process of data acquisition, ECG signal is generally corrupted by a number of noises. Further, during ambulatory monitoring and wireless recording, ECG signal gets corrupted by additive white Gaussian noise. Without affecting the morphological structure, denoising of ECG signal is essential for proper diagnosis. This paper presents an ECG denoising method based on an effective combination of non-local means (NLM) estimation and empirical mode decomposition (EMD). Earlier works have shown that the patch-based NLM approach is insufficient for denoising the under-averaged region near high-amplitude QRS complex. To address this issue, the denoised signal obtained by NLM is decomposed into intrinsic mode functions (IMFs) using EMD in this work. Next, thresholding of the IMFs is done using the instantaneous half period criterion and the soft-thresholding to obtain the final denoised output. Furthermore, the modified empirical mode decomposition (M-EMD) is used in the place of standard EMD to reduce the computational cost. Performance of the proposed method is tested on a number of ECG signals from the MIT-BIH database. The experimental results presented in this paper show that the aforementioned shortcoming of the NLM method is addressed to a large extent. Moreover, the proposed approach provides improved performance when compared to different state-of-the-art ECG denoising methods.


Circuits Systems and Signal Processing | 2018

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

Syed Shahnawazuddin; Chaman Singh; Hemant Kumar Kathania; Waquar Ahmad; Gayadhar Pradhan

Abstract In the context of automatic speech recognition (ASR) systems, the front-end acoustic features should not be affected by signal periodicity (pitch period). Motivated by this fact, we have studied the role of pitch-synchronous spectrum estimation approach, referred to as TANDEM STRAIGHT, in this paper. TANDEM STRAIGHT results in a smoother spectrum devoid of pitch harmonics to a large extent. Consequently, the acoustic features derived using the smoothed spectra outperform the conventional Mel-frequency cepstral coefficients (MFCC). The experimental evaluations reported in this paper are performed on speech data from a wide range of speakers belonging to different age groups including children. The proposed features are found to be effective for all groups of speakers. To further improve the recognition of childrens speech, the effect of vocal-tract length normalization (VTLN) is studied. The inclusion of VTLN further improves the recognition performance. We have also performed a detailed study on the effect of speaking-rate normalization (SRN) in the context of childrens speech recognition. An SRN technique based on the anchoring of glottal closure instants estimated using zero-frequency filtering is explored in this regard. SRN is observed to be highly effective for child speakers belonging to different age groups. Finally, all the studied techniques are combined for effective mismatch reduction. In the case of childrens speech test set, the use of proposed features results in a relative improvement of 21.6% over the MFCC features even after combining VTLN and SRN.


ieee region 10 conference | 2016

Noise robustness of different front-end features for detection of vowels in speech signals

Avinash Kumar; Syed Shahnawazuddin; Gayadhar Pradhan

Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.


Biocybernetics and Biomedical Engineering | 2017

Denoising of ECG signal by non-local estimation of approximation coefficients in DWT

Pratik Singh; Gayadhar Pradhan; Syed Shahnawazuddin

It is well known that the recognition performance of an automatic speech recognition (ASR) system is affected by intra-speaker as well inter-speaker variability. The differences in the geometry of vocal organs, pitch and speaking-rate among the speakers are some such inter-speaker variabilities affecting the recognition performance. A mismatch between the training and test data with respect to any of those aforementioned factors leads to increased error rates. An example of acoustically mismatched ASR is the task of transcribing children’s speech on adult data-trained system. A large number of studies have been reported earlier that present a myriad of techniques for addressing acoustic mismatch arising from differences in pitch and dimensions of vocal organs. At the same time, only a few works on speaking-rate adaptation employing timescale modification have been reported. Furthermore, those studies were performed on ASR systems developed using Gaussian mixture models. Motivated by these facts, speaking-rate adaptation is explored in this work in the context of children’s ASR system employing deep neural network-based acoustic modeling. Speaking-rate adaptation is performed by changing the frame-length and overlap during front-end feature extraction process. Significant reductions in errors are noted by speaking-rate adaptation. In addition to that, we have also studied the effect of combining speaking-rate adaptation with vocal-tract length normalization and explicit pitch modification. In both the cases, additive improvements are obtained. To summarize, relative improvements in 15–20% over the baselines are obtained by varying the frame-length and frame-overlap.

Collaboration


Dive into the Syed Shahnawazuddin's collaboration.

Top Co-Authors

Avatar

Hemant Kumar Kathania

National Institute of Technology Sikkim

View shared research outputs
Top Co-Authors

Avatar

Rohit Sinha

Indian Institute of Technology Guwahati

View shared research outputs
Top Co-Authors

Avatar

A. B. Samaddar

National Institute of Technology Sikkim

View shared research outputs
Top Co-Authors

Avatar

Waquar Ahmad

National Institute of Technology Sikkim

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge