Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ganesh N. Ramaswamy is active.

Publication


Featured researches published by Ganesh N. Ramaswamy.


international conference on acoustics, speech, and signal processing | 2002

Short-time Gaussianization for robust speaker verification

Bing Xiang; Upendra V. Chaudhari; Jiří Navrátil; Ganesh N. Ramaswamy; Ramesh A. Gopinath

In this paper, a novel approach for robust speaker verification, namely short-time Gaussianization, is proposed. Short-time Gaussianization is initiated by a global linear transformation of the features, followed by a short-time windowed cumulative distribution function (CDF) matching. First, the linear transformation in the feature space leads to local independence or decorrelation. Then the CDF matching is applied to segments of speech localized in time and tries to warp a given feature so that its CDF matches normal distribution. It is shown that one of the recent techniques used for speaker recognition, feature warping [l] can be formulated within the framework of Gaussianization. Compared to the baseline system with cepstral mean subtraction (CMS), around 20% relative improvement in both equal error rate(EER) and minimum detection cost function (DCF) is obtained on NIST 2001 cellular phone data evaluation.


Journal of the Acoustical Society of America | 2007

Method for determining and maintaining dialog focus in a conversational speech system

Daniel M. Coffman; Popani Gopalakrishnan; Ganesh N. Ramaswamy; Jan Kleindienst

A system and method of the present invention for determining and maintaining dialog focus in a conversational speech system includes presenting a command associated with an application to a dialog manager. The application associated with the command is unknown to the dialog manager at the time it is made. The dialog manager determines a current context of the command by reviewing a multi-modal history of events. At least one method is determined responsive to the command based on the current context. The at least one method is executed responsive to the command associated with the application.


international conference on acoustics speech and signal processing | 1998

Compression of acoustic features for speech recognition in network environments

Ganesh N. Ramaswamy; Ponani S. Gopalakrishnan

In this paper, we describe a new compression algorithm for encoding acoustic features used in typical speech recognition systems. The proposed algorithm uses a combination of simple techniques, such as linear prediction and multi-stage vector quantization, and the current version of the algorithm encodes the acoustic features at a fixed rate of 4.0 kbit/s. The compression algorithm can be used very effectively for speech recognition in network environments, such as those employing a client-server model, or to reduce storage in general speech recognition applications. The algorithm has also been tuned for practical implementations, so that the computational complexity and memory requirements are modest. We have successfully tested the compression algorithm against many test sets from several different languages, and the algorithm performed very well, with no significant change in the recognition accuracy due to compression.


IEEE Transactions on Audio, Speech, and Language Processing | 2006

Pseudo Pitch Synchronous Analysis of Speech With Applications to Speaker Recognition

Ran D. Zilca; Brian Kingsbury; Jiri Navratil; Ganesh N. Ramaswamy

The fine spectral structure related to pitch information is conveyed in Mel cepstral features, with variations in pitch causing variations in the features. For speaker recognition systems, this phenomenon, known as “pitch mismatch” between training and testing, can increase error rates. Likewise, pitch-related variability may potentially increase error rates in speech recognition systems for languages such as English in which pitch does not carry phonetic information. In addition, for both speech recognition and speaker recognition systems, the parsing of the raw speech signal into frames is traditionally performed using a constant frame size and a constant frame offset, without aligning the frames to the natural pitch cycles. As a result the power spectral estimation that is done as part of the Mel cepstral computation may include artifacts. Pitch synchronous methods have addressed this problem in the past, at the expense of adding some complexity by using a variable frame size and/or offset. This paper introduces Pseudo Pitch Synchronous (PPS) signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset, in an effort to address the above problems. Text independent speaker recognition experiments performed on NIST speaker recognition tasks demonstrate a performance improvement when the scores produced by systems using PPS are fused with traditional speaker recognition scores. In addition, a better distribution of errors across trials may be obtained for similar error rates, and some insight regarding of role of the fundamental frequency in speaker recognition is revealed. Speech recognition experiments run on the Aurora-2 noisy digits task also show improved robustness and better accuracy for extremely low signal-to-noise ratio (SNR) data.


international conference on acoustics, speech, and signal processing | 2003

The IBM system for the NIST-2002 cellular speaker verification evaluation

Ganesh N. Ramaswamy; A. Navratil; Upendra V. Chaudhari; Ran D. Zilca

This paper presents an overview of the architecture and algorithms implemented in IBMs text-independent speaker verification system developed for the 2002 NIST speaker recognition evaluation, particularly for the 1-speaker detection task using cellular test data. We describe individual components including a Gaussianization front-end, celluar-codec post-processing, modeling, discriminative optimization and scoring steps. A combination of multiple, data-perturbed systems using a discriminative objective so as to achieve optimum performance for a low false alarm operating region obtained the top performance in the NIST 2002 1-speaker detection task.


international conference on multimedia and expo | 2003

Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction

Upendra V. Chaudhari; Ganesh N. Ramaswamy; Gerasimos Potamianos; Chalapathy Neti

We examine the techniques for multi-modal biometric information fusion for verification and identification of speakers, where the reliability of each data stream, either audio of video, is modeled with parameters that are time-varying and depend on the context created by its local behavior. The complementary nature and the time dependent relative reliability of audio and video data is studied in the context of verification and identification, on data collected during a users interaction with an automated system. Of significance is that this data is not corrupted artificially. Particular focus is directed to verification and its ability to refine identification decisions, by indicating a level of confidence in the system decisions. Results show more striking effects for verification, when using time-dependent fusion, than for identification.


international conference on acoustics, speech, and signal processing | 2003

Audio-visual speaker recognition using time-varying stream reliability prediction

Upendra V. Chaudhari; Ganesh N. Ramaswamy; Gerasimos Potamianos; Chalapathy Neti

We examine a time-varying, context dependent, information fusion methodology for multi-stream authentication based on audio and video data collected simultaneously during a users interaction with a system. Scores obtained from the two data streams are combined based on the relative local richness, as compared to the training data or derived model, and on the stability of each stream. The results show that the proposed technique outperforms the use of video or audio data alone as well as the use of fused data streams (via concatenation). Of particular note is that the performance improvements are achieved for clean, high quality speech, whereas previous efforts focused on degraded speech conditions.


international conference on acoustics, speech, and signal processing | 2005

Blind change detection for audio segmentation

Mohamed Kamal Omar; Upendra V. Chaudhari; Ganesh N. Ramaswamy

Automatic segmentation of audio streams according to speaker identities and environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining. In most previous approaches, the automatic segmentation was evaluated in terms of the performance of the final system, like the word error rate for speech recognition systems. In many applications, like online audio indexing, and information retrieval systems, the actual boundaries of the segments are required. We present an approach based on the cumulative sum (CuSum) algorithm for automatic segmentation which minimizes the missing probability for a given false alarm rate. We compare the CuSum algorithm to the Bayesian information criterion (BIC) algorithm, and a generalization of the Kolmogorov-Smirnov test for automatic segmentation of audio streams. We present a two-step variation of the three algorithms which improves the performance significantly. We present also a novel approach that combines hypothesized boundaries from the three algorithms to achieve the final segmentation of the audio stream. Our experiments, on the 1998 Hub4 broadcast news, show that a variation of the CuSum algorithm significantly outperforms the other two approaches and that combining the three approaches using a voting scheme improves the performance slightly compared to using the a two-step variation of the CuSum algorithm alone.


international conference on acoustics, speech, and signal processing | 2001

Very large population text-independent speaker identification using transformation enhanced multi-grained models

Upendra V. Chaudhari; J. Navrratil; Ganesh N. Ramaswamy; Stephane Herman Maes

Presents results on speaker identification with a population size of over 10000 speakers. Speaker modeling is accomplished via our transformation enhanced multigrained models. Pursuing two goals, the first is to study the performance of a number of different systems within the modeling framework of multi-grained models. The second is to analyze performance as a function of population size. We show that the most complex models within the framework perform the best and demonstrate that, in approximation, the identification error rate scales linearly with the log of the population size for the described system. Further, we develop a candidate rejection technique based on our analysis of the system performance which indicates a low confidence in the identity chosen.


international conference on acoustics speech and signal processing | 1998

Speech recognition performance on a voicemail transcription task

Mukund Padmanabhan; Ellen Eide; Bhuvana Ramabhadran; Ganesh N. Ramaswamy; Lalit R. Bahl

We describe a new testbed for developing speech recognition algorithms-the ARRPA-sponsored voicemail transcription task, analogous to other tasks such as the Switchboard, CallHome and the Hub 4 tasks. The task involves the transcription of voicemail conversations. Voicemail represents a very large volume of real-world speech data, which is however not particularly well represented in existing databases. For instance, the Switchboard and CallHome databases contain telephone conversations between two humans, representing telephone-bandwidth spontaneous speech; the Hub 4 database contains radio broadcasts which represents different kinds of speech data such as spontaneous speech from a well-trained speaker, conversations between two humans possibly over the telephone, etc. The voicemail database on the other hand also represents telephone bandwidth spontaneous speech, however the difference with respect to the Switchboard and CallHome tasks is that the interaction is not between two humans, but rather between a human and a machine-consequently, the speech is expected to be a little more formal in its nature, without the problems of crosstalk, barge-in etc. This eliminates some of the variables and provides more controlled conditions enabling one to concentrate on the aspects of spontaneous speech and effects of the telephone channel. We describe the modality of collection of the speech data, and some algorithmic techniques that were devised based on this data. We also describe the initial results of the transcription performance on this task.

Researchain Logo
Decentralizing Knowledge