Sunil K. Gupta | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sunil K. Gupta is active.

Explore More

Publication

Featured researches published by Sunil K. Gupta.

Journal of the Acoustical Society of America | 1993

Pitch‐synchronous frame‐by‐frame and segment‐based articulatory analysis by synthesis

Sunil K. Gupta; Juergen Schroeter

This paper presents a pitch‐synchronous analysis‐by‐synthesis procedure for estimating model parameters for voiced speech. These model parameters describe the vocal‐tract shape and the time derivative of the glottal area function. The excitation waveform is derived from the glottal area function by incorporating source‐tract interaction using the current vocal‐tract input impedance. The corresponding analysis procedure for estimating the model parameters once every pitch period is outlined. A significant improvement in quality was obtained for the new pitch‐synchronous analysis/synthesis procedure relative to the fixed‐frame‐length‐based scheme used previously. It was also found that the new pitch‐synchronous articulatory analysis/synthesis scheme achieves lower rms spectral distortion values than the 2.4 kb/s. Federal standard LPC‐10E algorithm. A segment‐based procedure for estimating the vocal‐tract model parameters at a rate much lower than the current pitch is described. In this segment‐based analysi...

international conference on acoustics speech and signal processing | 1996

High-accuracy connected digit recognition for mobile applications

Sunil K. Gupta; Frank K. Soong; Raziel Haimi-Cohen

We present a connected digit recognition system with low storage and computational complexity which achieves good performance in car noise. Our system uses the TI-DIGITS database with additive car noise for training whole-word digit and background models. A digit accuracy of 96.1% is obtained on a 15-speaker database collected in a car using an open microphone with an average SNR of approximately 2 dB. There is a further error reduction of almost 35% if the top two candidate strings are considered using a traceback based N-best algorithm. The system can be implemented on a currently available fixed-point DSP chip. We show that significant performance improvements are obtained by using two-level cepstral mean subtraction (CMS), gender-dependent models and a decoding grammar constraining the possible lengths of digit strings.

international conference on spoken language processing | 1996

Quantizing mixture-weights in a tied-mixture HMM

Sunil K. Gupta; Frank K. Soong; Raziel Haimi-Cohen

We describe new techniques to significantly reduce computational, storage and memory access requirements of a tied mixture HMM based speech recognition system. Although continuous mixture HMMs offer improved recognition performance, we show that tied mixture HMMs may offer significant advantage in complexity reduction for low cost implementations. In particular, we consider two tasks: (a) connected digit recognition in car noise; and (b) subword modeling for command word recognition in a noisy office environment. We show that quantization of mixture weights can provide an almost three fold reduction in mixture weight storage requirements without any significant loss in recognition performance. Furthermore, we show that by combining mixture weight quantization with techniques such as VQ-Assist, the computational and memory access requirements can be reduced by almost 60-80% without any degradation in recognition performance.

international conference on spoken language processing | 1996

Durational characteristics of Hindi consonant clusters

Nisheeth Shrotriya; Rajesh Verma; Sunil K. Gupta; Shyam S. Agrawal

Various durations of closure, preceding vowel etc. have been studied in meaningful Hindi two consonant cluster words with stop consonants (such as /shptah/ (week) and //spl int//spl Lambda/ bd (word)). The data included 80 most frequently occurring clusters of the Hindi language. All these words were recorded by five male speakers and analysed using the Sensimetrics speech station software package. The analysis showed some very interesting features of the clusters such as closure duration of the clusters is found to play a very important role for die different categories of stop consonants. Further the duration of the voice bar and the vowel preceding the cluster (-C/sub 1/C/sub 2/) is shortened in a cluster words as compared to that for a non-cluster word.

Archive | 1993

Efficient Frequency-Domain Representation of LPC Excitation

Sunil K. Gupta; Bishnu S. Atal

Efficient representation of LPC excitation signal is of utmost importance in predictive coding systems for achieving high quality speech at low bit rates. In this paper, we present a method for obtaining an efficient parametric representation of the LPC excitation signal for voiced speech in the frequency domain that takes advantage of the nonuniform spacing of critical bands [1] in the auditory system. In current analysis/synthesis systems [2,3], a significant portion of the available bits is used to represent the excitation signal in order to reproduce its detailed structure which is very complicated. The method presented in this paper aims to preserve only those details in the LPC excitation signal which are necessary to produce synthetic speech without audible distortion.

Digital Signal Processing | 1992

Text-independent speaker verification based on broad phonetic segmentation of speech

Sunil K. Gupta; Michael Savic

Speaker verification involves the determination of whether or not a test utterance belongs to a specific reference speaker. The utterance is either accepted as belonging to the reference speaker or rejected as belonging to an imposter. Speaker verification has great potential for security applications, such as physical access control, computer data access control, and automatic telephone transaction control. The main components of a general speaker verification system are shown in Fig. 1. A speaker verification task consists of two phases. In the training phase, reference templates are created for particular speakers using the signal processor shown in Fig. la. During the uerification phase (Fig. lb), the identity claimed by a speaker is verified using a test utterance from that speaker. The inputs to the system consist of a test utterance (sampled at 10 kHz in our experiments) and the claimed identity of the reference speaker. The signal processor can be further subdivided into three steps: normalization, parameterization, and feature extraction. These steps involve preprocessing and information reduction (or elimination of redundancies) in the input data sequence to obtain speaker templates during the training and verification phases. The normalization step consists of noise reduction, signal amplitude level control, and time warping to reduce the effect of different speaking rates. This is followed by the parameterization step to reduce the amount of data with minimal information loss about the speaker characteristics. An optional feature extraction step can be used to further reduce the data. Next the test template is compared to the reference template. The accept/reject decision is usually based on the computation of a distance function which quantifies the degree of dissimilarity between the test template and the reference template [l] . If the distance exceeds a threshold, the system rejects the match. Comparing the test and the training templates in the verification phase is much simpler if the underlying texts of the utterances are the same. Normally, this text-dependent mode is possible only for cooperative speakers. In forensic work, for example, speakers are often uncooperative, and the test and the training texts are often not the same. This mode is called textindependent speaker verification and the required information stored in the templates is different in this case. In general, the templates contain long-term statistical data. Error rates for text-independent recognition are considerably higher than the rates for a comparable text-dependent case. In this paper, we investigate text-independent speaker verification. A speaker verification system produces two types of errors. Type I error is caused when a true speaker is rejected as being an imposter. Type II error results when an imposter is accepted by the system as the correct speaker. Naturally, the objective in a verification task is to minimize both errors. Previous work [l, 21 on automatic text-independent speaker verification suggests that the important features for speaker discrimination are the spectral envelope parameters. Generally, in speaker verification systems, the reference speaker templates are obtained by averaging short-time spectral parameters over the complete speech utterance. In other words, an average vocal-tract shape is assumed for the duration of the utterance. However, this does not hold in practice since it is well known that different sounds are produced by vocal-tract shapes that vary widely

international conference on acoustics, speech, and signal processing | 1991

Low update rate articulatory analysis/synthesis of speech

Sunil K. Gupta; Juergen Schroeter

A pitch-synchronous scheme for analysis/synthesis of speech is presented that uses a parametric model of the time-derivative of the glottal area function. Improved articulatory codebooks are used to start up an analysis-by-synthesis procedure to adapt parameters representing the vocal tract shape and the glottal geometry. The authors also present results of multiframe parameter optimization of articulatory parameters for potential application to low-bit-rate speech coding. In this scheme, they optimize the parameters once for each appropriately selected segment of speech and interpolate to obtain parameter values at intermediate time-instants. They study three different vocal-tract representations for linear and arctan interpolation functions for multiframe parameter optimization. Experiments indicate that multiframe parameter optimization results in some degradation in speech quality as compared to single-frame parameter optimization.<<ETX>>

Journal of the Acoustical Society of America | 1992

Efficient representation of LPC excitation using nonuniform frequency‐domain sampling.

Sunil K. Gupta; Bishnu S. Atal

An efficient representation of the LPC excitation is essential in predictive coding systems for synthesizing high‐quality speech at low bit rates. In this paper, a method is presented that takes advantage of the nonuniform spacing of auditory critical bands to achieve an efficient frequency‐domain representation of LPC excitation. A segment of LPC excitation with a duration of N samples, represented as a Fourier series, requires N/2 sinusoidal components uniformly spaced along the frequency axis for exact reproduction of the excitation. Thus, for speech bandlimited to 4 kHz, a 10‐ms segment requires 40 frequency components for exact reproduction. It was found that, by using uniform frequency spacing below 1 kHz and logarithmic spacing above 1 kHz, the number of sinusoidal components can be reduced to 15 without introducing any audible distortion in the synthetic speech signal. Subjective tests were conducted to determine the effective signal‐to‐noise ratio of synthetic speech for different numbers of sinu...

Archive | 2003