Anand Venkataraman
SRI International
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Anand Venkataraman.
Speech Communication | 2005
Elizabeth Shriberg; Luciana Ferrer; Sachin S. Kajarekar; Anand Venkataraman; Andreas Stolcke
Abstract We describe a novel approach to modeling idiosyncratic prosodic behavior for automatic speaker recognition. The approach computes various duration, pitch, and energy features for each estimated syllable in speech recognition output, quantizes the features, forms N-grams of the quantized values, and models normalized counts for each feature N-gram using support vector machines (SVMs). We refer to these features as “SNERF-grams” (N-grams of Syllable-based Nonuniform Extraction Region Features). Evaluation of SNERF-gram performance is conducted on two-party spontaneous English conversational telephone data from the Fisher corpus, using one conversation side in both training and testing. Results show that SNERF-grams provide significant performance gains when combined with a state-of-the-art baseline system, as well as with two highly successful long-range feature systems that capture word usage and lexically constrained duration patterns. Further experiments examine the relative contributions of features by quantization resolution, N-gram length, and feature type. Results show that the optimal number of bins depends on both feature type and N-gram length, but is roughly in the range of 5–10 bins. We find that longer N-grams are better than shorter ones, and that pitch features are most useful, followed by duration and energy features. The most important pitch features are those capturing pitch level, whereas the most important energy features reflect patterns of rising and falling. For duration features, nucleus duration is more important for speaker recognition than are durations from the onset or coda of a syllable. Overall, we find that SVM modeling of prosodic feature sequences yields valuable information for automatic speaker recognition. It also offers rich new opportunities for exploring how speakers differ from each other in voluntary but habitual ways.
Computational Linguistics | 2001
Anand Venkataraman
A statistical model for segmentation and word discovery in continuous speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described. Results are also presented of empirical tests showing that the algorithm is competitive with other models that have been used for similar tasks.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Andreas Stolcke; Barry Y. Chen; H. Franco; Venkata Ramana Rao Gadde; Martin Graciarena; Mei-Yuh Hwang; Katrin Kirchhoff; Arindam Mandal; Nelson Morgan; Xin Lei; Tim Ng; Mari Ostendorf; M. Kemal Sönmez; Anand Venkataraman; Dimitra Vergyri; Wen Wang; Jing Zheng; Qifeng Zhu
We summarize recent progress in automatic speech-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin
international conference on acoustics, speech, and signal processing | 2005
Sachin S. Kajarekar; Luciana Ferrer; Elizabeth Shriberg; M. Kemal Sönmez; Andreas Stolcke; Anand Venkataraman; Jing Zheng
The paper describes our recent efforts in exploring longer-range features and their statistical modeling techniques for speaker recognition. In particular, we describe a system that uses discriminant features from cepstral coefficients, and systems that use discriminant models from word n-grams and syllable-based NERF n-grams. These systems together with a cepstral baseline system are evaluated on the 2004 NIST speaker recognition evaluation dataset. The effect of the development set is measured using two different datasets, one from Switchboard databases and another from the FISHER database. Results show that the difference between the development and evaluation sets affects the performance of the systems only when more training data is available. Results also show that systems using longer-range features combined with the baseline result in about a 31% improvement with 1-side training over the baseline system and about a 61% improvement with 8-side training over the baseline system.
international conference on acoustics, speech, and signal processing | 2003
Anand Venkataraman; Luciana Ferrer; Andreas Stolcke; Elizabeth Shriberg
Dialog act tagging is an important step toward speech understanding, yet training such taggers usually requires large amounts of data labeled by linguistic experts. Here we investigate the use of unlabeled data for training HMM-based dialog act taggers. Three techniques are shown to be effective for bootstrapping a tagger from very small amounts of labeled data: iterative relabeling and retraining on unlabeled data; a dialog grammar to model dialog act context, and a model of the prosodic correlates of dialog acts. On the SPINE dialog corpus, the combined use of prosodic information and unlabeled data reduces the tagging error between 12% and 16%, compared to baseline systems using word information and various amounts of labeled data only.
international conference on acoustics, speech, and signal processing | 2006
Luciana Ferrer; Elizabeth Shriberg; Sachin S. Kajarekar; Andreas Stolcke; M. Kemal Sönmez; Anand Venkataraman; Harry Bratt
Recent work in speaker recognition has demonstrated the advantage of modeling stylistic features in addition to traditional cepstral features, but to date there has been little study of the relative contributions of these different feature types to a state-of-the-art system. In this paper we provide such an analysis, based on SRIs submission to the NIST 2005 speaker recognition evaluation. The system consists of 7 subsystems (3 cepstral 4 stylistic). By running independent N-way subsystem combinations for increasing values of N, we fines that (1) a monotonic pattern in the choice of the best N systems allows for the inference of subsystem importance; (2) the ordering of subsystems alternates between cepstral and stylistic; (3) syllable-based prosodic features are the strongest stylistic features, and (4) overall subsystem ordering depends crucially on the amount of training data (1 versus 8 conversation sides). Improvements over the baseline cepstral system, when all systems are combined, range from 47% to 67%, with larger improvements for the 8-side condition. These results provide direct evidence of the complementary contributions of cepstral and stylistic features to speaker discrimination
ieee automatic speech recognition and understanding workshop | 2003
Anand Venkataraman; Horacio Franco; Greg Myers
We describe an iterative recognition strategy that can be used to improve vastly the performance of a speech recognition system when the speech pertains to structured information that can be looked up in a database. The framework that we present is designed to extract specific fields of interest from the speech signal during each iteration, query a database using these fields, and thereby construct the hypothesis space for searching during the next iteration. The architecture has been found to be significantly useful in applications such as spoken address recognition where a proof of concept and a demonstration system had been developed. We also present results on a small test set to compare the performance of the described system with the more common baseline approach.
conference of the international speech communication association | 2005
Andreas Stolcke; Luciana Ferrer; Sachin S. Kajarekar; Elizabeth Shriberg; Anand Venkataraman
Archive | 2004
Anand Venkataraman; Horacio Franco; Douglas A. Bercow
Archive | 2008
Babak Hodjat; Horacio Franco; Harry Bratt; Kristin Precoda; Andreas Stolcke; Anand Venkataraman; Dimitra Vergyri; Jing Zheng
