Ganesh Sivaraman
University of Maryland, College Park
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ganesh Sivaraman.
Speech Communication | 2017
Vikramjit Mitra; Ganesh Sivaraman; Hosung Nam; Carol Y. Espy-Wilson; Elliot Saltzman; Mark Tiede
Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the articulatory and acoustic space increase complexity of speech-to-articulatory mapping, which is already an ill-posed problem due to its inherent nonlinearity and non-unique nature. This work explores using deep neural networks (DNNs) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our speech-inversion results indicate that the CNN models perform better than their DNN counterparts. In addition, we use these inverse-models to generate articulatory information from speech for two separate speech recognition tasks: the WSJ1 and Aurora-4 continuous speech recognition tasks. This work proposes a hybrid convolutional neural network (HCNN), where two parallel layers are used to jointly model the acoustic and articulatory spaces, and the decisions from the parallel layers are fused at the output context-dependent (CD) state level. The acoustic model performs time-frequency convolution on filterbank-energy-level features, whereas the articulatory model performs time convolution on the articulatory features. The performance of the proposed architecture is compared to that of the CNN- and DNN-based systems using gammatone filterbank energies as acoustic features, and the results indicate that the HCNN-based model demonstrates lower word error rates compared to the CNN/DNN baseline systems.
conference of the international speech communication association | 2016
Ganesh Sivaraman; Vikramjit Mitra; Hosung Nam; Mark Tiede; Carol Y. Espy-Wilson
Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. This paper investigates a vocal tract length normalization (VTLN) technique to transform the acoustic space of different speakers to a target speaker space such that speaker specific details are minimized. The speaker normalized features are then used to train a feed-forward neural network based acoustic-toarticulatory speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients and the articulatory features are represented by six tract-variable (TV) trajectories. Experiments are performed with ten speakers from the U. Wisc. X-ray microbeam database. Speaker dependent speech inversion systems are trained for each speaker as baselines to compare the performance of the speaker independent approach. For each target speaker, data from the remaining nine speakers are transformed using the proposed approach and the transformed features are used to train a speech inversion system. The performances of the individual systems are compared using the correlation between the estimated and the actual TVs on the target speaker’s test set. Results show that the proposed speaker normalization approach provides a 7% absolute improvement in correlation as compared to the system where speaker normalization was not performed.
Journal of the Acoustical Society of America | 2017
Mark Tiede; Carol Y. Espy-Wilson; Dolly Goldenberg; Vikramjit Mitra; Hosung Nam; Ganesh Sivaraman
Electromagnetic articulometry (EMA) was used to record the 720 phonetically balanced Harvard sentences (IEEE, 1969) from multiple speakers at normal and fast production rates. Participants produced each sentence twice, first at their preferred “normal” speaking rate followed by a “fast” production (for a subset of the sentences two normal rate productions were elicited). They were instructed to produce the “fast” repetition as quickly as possible without making errors. EMA trajectories were obtained at 100 Hz from sensors placed on the tongue, lips, and mandible, corrected for head movement and aligned to the occlusal plane. Synchronized audio was recorded at 22050 Hz. Comparison of normal to fast acoustic durations for paired utterances showed a mean 67% length reduction, and assessed using Mermelsteins method (1975), two fewer syllables on average. A comparison of inflections in vertical jaw movement between paired utterances showed an average of 2.3 fewer syllables. Cross-recurrence analysis of distan...
Journal of the Acoustical Society of America | 2015
Ganesh Sivaraman; Vikramjit Mitra; Hosung Nam; Elliot Saltzman; Carol Y. Espy-Wilson
In articulatory phonetics, a phoneme’s identity is specified by its articulator-free (manner) and articulator-bound (place) features. Previous studies have shown that acoustic-phonetic features (APs) can be used to segment speech into broad classes determined by the manner of articulation of speech sounds; compared to MFCCs, however, APs perform poorly in determining place of articulation. This study explores the combination of APs with vocal Tract constriction Variables (TVs) to distinguish phonemes according to their place of articulation for stops, fricatives and nasals. TVs were estimated from acoustics using speech inversion systems trained on the XRMB database with pellet trajectories converted into TVs. TIMIT corpus sentences were first segmented into broad classes using a landmark based broad class segmentation algorithm. Each stop, fricative and nasal speech segment was further classified according to its place of articulation: stops were classified as bilabial (/P/, /B/), alveolar (/T/, /D/) or ...
Journal of the Acoustical Society of America | 2015
Vikramjit Mitra; Ganesh Sivaraman; Hosung Nam; Carol Y. Espy-Wilson; Elliot Saltzman
Articulatory features (AFs) are known to provide an invariant representation of speech, which is expected to be robust against channel and noise degradations. This work presents a deep neural network (DNN)—hidden Markov model (HMM) based acoustic model where articulatory features are used in addition to mel-frequency cepstral coefficients (MFCC) for the Aurora-4 speech recognition task. AFs were generated using a DNN trained layer-by-layer using synthetic speech data. Comparison between baseline mel-filterbank energy (MFB) features, MFCCs and fusion of articulatory feature with MFCCs show that articulatory features helped to increase the noise and channel robustness of the DNN-HMM acoustic model, indicating that articulatory representation does provide an invariant representation of speech.
Journal of the Acoustical Society of America | 2014
Ganesh Sivaraman; Carol Y. Espy-Wilson; Vikramjit Mitra; Hosung Nam; Elliot Saltzman
Speech inversion is a technique to estimate vocal tract configurations from speech acoustics. We constructed two such systems using feedforward neural networks. One was trained using natural speech data from the XRMB database and the second using synthetic data generated by the Haskins Laboratories TADA model that approximated the XRMB data. XRMB pellet trajectories were first converted into vocal tract constriction variables (TVs), providing a relative measure of constriction kinematics (location and degree) and synthetic TV data was obtained directly using TADA. The natural and synthetic speech inversion systems were trained as TV estimators using these respective sets of acoustic and TV data. TV-estimators were first tested using previously collected acoustic data on the utterance “perfect memory” spoken at slow, normal, and fast rates. The TV estimator trained on XRMB data (but not on TADA data) was able to recover the tongue tip gesture for /t/ in the fast utterance despite the gesture occurring part...
international conference on acoustics, speech, and signal processing | 2014
Vikramjit Mitra; Ganesh Sivaraman; Hosung Nam; Carol Y. Espy-Wilson; Elliot Saltzman
Archive | 2014
Vikramjit Mitra; Wen Wang; Yun Lei; Andreas Kathol; Ganesh Sivaraman; Carol Y. Espy-Wilson
conference of the international speech communication association | 2015
Ganesh Sivaraman; Vikramjit Mitra; Mark Tiede; Elliot Saltzman; Louis Goldstein; Carol Y. Espy-Wilson
Archive | 2013
Ganesh Sivaraman; Vikramjit Mitra; Carol Y. Espy-Wilson