Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jitendra Ajmera is active.

Publication


Featured researches published by Jitendra Ajmera.


international conference on acoustics, speech, and signal processing | 2007

Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications

Florian Metze; Jitendra Ajmera; Roman Englert; Udo Bub; Felix Burkhardt; Joachim Stegmann; Christian A. Müller; Richard Huber; Bernt Andrassy; Josef Bauer; Bernhard Dipl Ing Littel

This paper presents a comparative study of four different approaches to automatic age and gender classification using seven classes on a telephony speech task and also compares the results with human performance on the same data. The automatic approaches compared are based on (1) a parallel phone recognizer, derived from an automatic language identification system; (2) a system using dynamic Bayesian networks to combine several prosodic features; (3) a system based solely on linear prediction analysis; and (4) Gaussian mixture models based on MFCCs for separate recognition of age and gender. On average, the parallel phone recognizer performs as well as Human listeners do, while loosing performance on short utterances. The system based on prosodic features however shows very little dependence on the length of the utterance.


IEEE Signal Processing Letters | 2004

Robust speaker change detection

Jitendra Ajmera; Iain A. McCowan

Most commonly used criteria for speaker change detection like log likelihood ratio (LLR) and Bayesian information criterion (BIC) have an adjustable threshold/penalty parameter to make speaker change decisions. These parameters are not always robust to different acoustic conditions and have to be tuned. In this letter, we present a criterion which can be used to identify speaker changes in an audio stream without such tuning. The criterion consists of calculating the LLR of two models with the same number of parameters. Results on the Hub4 1997 evaluation set indicate that we achieve a performance comparable to using BIC with optimal penalty term.


international conference on acoustics, speech, and signal processing | 2004

Clustering and segmenting speakers and their locations in meetings

Jitendra Ajmera; Guillaume Lathoud; L. McCowan

The paper presents a new approach toward automatic annotation of meetings in terms of speaker identities and their locations. This is achieved by segmenting the audio recordings using two independent sources of information: magnitude spectrum analysis and sound source localization. We combine the two in an appropriate HMM framework. There are three main advantages of this approach. First, it is completely unsupervised, i.e. speaker identities and number of speakers and locations are automatically inferred. Second, it is threshold-free, i.e. the decisions are made without the need of a threshold value which generally requires an additional development dataset. The third advantage is that the joint segmentation improves over the speaker segmentation derived using only acoustic features. Experiments on a series of meetings recorded in the IDIAP smart meeting room demonstrate the effectiveness of this approach.


international conference on acoustics, speech, and signal processing | 2002

Robust HMM-based speech/music segmentation

Jitendra Ajmera; Iain A. McCowan

In this paper we present a new approach towards high performance speech/music segmentation on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, the local probability density function (PDF) estimators trained on clean microphone speech are used as a channel model at the output of which the entropy and “dynamism” will be measured and integrated over time through a 2-state (speech and and non-speech) hidden Markov model (HMM) with minimum duration constraints. The parameters of the HMM are trained using the EM algorithm in a completely unsupervised manner. Different experiments, including a variety of speech and music styles, as well as different segment durations of speech and music signals (real data distribution, mostly speech, or mostly music), will illustrate the robustness of the approach, which in each case achieves a frame-level accuracy greater than 94%.


2006 IEEE Odyssey - The Speaker and Language Recognition Workshop | 2006

Effect of age and gender on LP smoothed spectral envelope

Jitendra Ajmera

It is well known that linear prediction (LP) analysis suffers from drawbacks that are especially manifested during voiced segments of speech. The first drawback is in the form of its inherent error cancellation property and the second drawback is that the poles, as estimated by LP analysis, generally move in the direction of pitch harmonics. These two drawbacks make this analysis sensitive to the fundamental frequency of speaker. In this paper, this sensitivity is analyzed by computing a distance measure between signal power spectrum and spectral envelope estimated by LP analysis at spectral peaks. It is shown that this distance shows the same trend as the fundamental frequency of the speaker, i.e. higher the pitch frequency, greater is the distance between two spectra. This combined with the observation that children, adult male and adult female speaker classes have fundamental frequencies in significantly different ranges, makes this distance an obvious choice for discrimination among these three classes. An experimental framework is set-up for this classification task and the results (about 85% and 93% age and gender classification rates) clearly validate this hypothesis


international conference on acoustics, speech, and signal processing | 2007

Spotting using Durational Entropy

Jitendra Ajmera; Florian Metze

This paper deals with the task of detection of a given keyword in continuous speech. We build upon a previously proposed algorithm where a modified Viterbi search algorithm is used to detect keywords, without requiring any explicit garbage or filler models. In this work, the concept of durational entropy is used to further discard a large fraction of false alarm errors. Durational entropy is defined as the entropy of the distribution of state occupancies. A method to recursively compute it for all Viterbi paths is also presented in this paper. Experimental results on one hour of broadcast news data suggest that durational entropy constraints can indeed be used to avoid a large number of false alarms errors at a minimal cost of degradation in keyword detection accuracy.


ieee automatic speech recognition and understanding workshop | 2003

A robust speaker clustering algorithm

Jitendra Ajmera; Chuck Wooters


IEEE Signal Processing Letters (to appear) | 2003

Robust Speaker Change Detection

Jitendra Ajmera; Iain A. McCowan


conference of the international speech communication association | 2002

Unknown-Multiple Speaker clustering using HMM

Jitendra Ajmera; Itshak Lapidot; Iain A. McCowan


Archive | 2002

Improved Unknown-Multiple Speaker clustering using HMM

Jitendra Ajmera; Itshak Lapidot

Collaboration


Dive into the Jitendra Ajmera's collaboration.

Top Co-Authors

Avatar

Florian Metze

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alexander Raake

Technische Universität Ilmenau

View shared research outputs
Researchain Logo
Decentralizing Knowledge