Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Mohamed Kamal Omar is active.

Publication


Featured researches published by Mohamed Kamal Omar.


international conference on acoustics, speech, and signal processing | 2011

Feature normalization for speaker verification in room reverberation

Sriram Ganapathy; Jason W. Pelecanos; Mohamed Kamal Omar

The performance of a typical speaker verification system degrades significantly in reverberant environments. This degradation is partly due to the conventional feature extraction/compensation techniques that use analysis windows which are much shorter than typical room impulse responses. In this paper, we present a feature extraction technique which estimates long-term envelopes of speech in narrow sub-bands using frequency domain linear prediction (FDLP). When speech is corrupted by reverberation, the long-term sub-band envelopes are convolved in time with those of the room impulse response function. In a first order approximation, gain normalization of these envelopes in the FDLP model suppresses the room reverberation artifacts. Experiments are performed on the 8 core conditions of the NIST 2008 speaker recognition evaluation (SRE). In these experiments, the FDLP features provide significant improvements on the interview microphone conditions (relative improvements of 20–30%) over the corresponding baseline system with MFCC features.


international conference on acoustics, speech, and signal processing | 2005

Blind change detection for audio segmentation

Mohamed Kamal Omar; Upendra V. Chaudhari; Ganesh N. Ramaswamy

Automatic segmentation of audio streams according to speaker identities and environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining. In most previous approaches, the automatic segmentation was evaluated in terms of the performance of the final system, like the word error rate for speech recognition systems. In many applications, like online audio indexing, and information retrieval systems, the actual boundaries of the segments are required. We present an approach based on the cumulative sum (CuSum) algorithm for automatic segmentation which minimizes the missing probability for a given false alarm rate. We compare the CuSum algorithm to the Bayesian information criterion (BIC) algorithm, and a generalization of the Kolmogorov-Smirnov test for automatic segmentation of audio streams. We present a two-step variation of the three algorithms which improves the performance significantly. We present also a novel approach that combines hypothesized boundaries from the three algorithms to achieve the final segmentation of the audio stream. Our experiments, on the 1998 Hub4 broadcast news, show that a variation of the CuSum algorithm significantly outperforms the other two approaches and that combining the three approaches using a voting scheme improves the performance slightly compared to using the a two-step variation of the CuSum algorithm alone.


international conference on acoustics, speech, and signal processing | 2010

A novel approach to detecting non-native speakers and their native language

Mohamed Kamal Omar; Jason W. Pelecanos

Speech contains valuable information regarding the traits of speakers. This paper investigates two aspects of this information. The first is automatic detection of non-native speakers and their native language on relatively large data sets. We present several experiments which show how our system outperforms the best published results on both the Fisher database and the foreign-accented English (FAE) database for detecting non-native speakers and their native language respectively. Such performance is achieved by using an SVM-based classifier with ASR-based features integrated with a novel universal background model (UBM) obtained by clustering the Gaussian components of an ASR acoustic model. The second aspect of this work is to utilize the detected speaker characteristics within a speaker recognition system to improve its performance.


ieee automatic speech recognition and understanding workshop | 2001

Gaussian mixture models of phonetic boundaries for speech recognition

Mohamed Kamal Omar; Mark Hasegawa-Johnson; Stephen E. Levinson

A new approach to represent temporal correlation in an automatic speech recognition system is described. It introduces an acoustic feature set that captures the dynamics of a speech signal at the phoneme boundaries in combination with the traditional acoustic feature set representing the periods that are assumed to be quasi-stationary of speech. This newly introduced feature set represents an observed random vector associated with the state transition in HMM. For the same complexity and number of parameters, this approach improves the phoneme recognition accuracy by 3.5% compared to the context-independent HMM models. Stop consonant recognition accuracy is increased by 40%.


IEEE Transactions on Signal Processing | 2004

Model enforcement: a unified feature transformation framework for classification and recognition

Mohamed Kamal Omar; Mark Hasegawa-Johnson

Bayesian classifiers rely on models of the a priori and class-conditional feature distributions; the classifier is trained by optimizing these models to best represent features observed in a training corpus according to certain criterion. In many problems of interest, the true class-conditional feature probability density function (PDF) is not a member of the set of PDFs the classifier can represent. Previous research has shown that the effect of this problem may be reduced either by improving the models or by transforming the features used in the classifier. This paper addresses this model mismatch problem in statistical identification, classification, and recognition systems. We formulate the problem as the problem of minimizing the relative entropy, which is also known as the Kullback-Leibler distance, between the true conditional PDF and the hypothesized probabilistic model. Based on this formulation, we provide a computationally efficient solution to the problem based on volume-preserving maps; existing linear transform designs are shown to be special cases of the proposed solution. Using this result, we propose the symplectic maximum likelihood transform (SMLT), which is a nonlinear volume-preserving extension of the maximum likelihood linear transform (MLLT). This approach has many applications in statistical modeling, classification, and recognition. We apply it to the maximum likelihood estimation (MLE) of the joint PDF of order statistics and show a significant increase in the likelihood for the same number of parameters. We provide also phoneme recognition experiments that show recognition accuracy improvement compared with using the baseline Mel-Frequency Cepstrum Coefficient (MFCC) features or using MLLT. We present an iterative algorithm to jointly estimate the parameters of the symplectic map and the probabilistic model for both applications.


ieee signal processing workshop on statistical signal processing | 2003

Strong-sense class-dependent features for statistical recognition

Mohamed Kamal Omar; Mark Hasegawa-Johnson

In statistical classification and recognition problems with many classes, it is commonly the case that different classes exhibit wildly different properties. In this case it is unreasonable to expect to be able to summarize these properties by using features designed to represent all the classes. In contrast, features should be designed to represent subsets that exhibit common properties without regard to any class outside this subset. The value of these features for classes outside the subset may be meaningless, or simply undefined. The main problem, due to the statistical nature of the recognizer, is how to compare likelihoods conditioned on different sets of features to decode an input pattern. This paper introduces a class-dependent feature design approach that can be integrated with any probabilistic model. This approach avoids the need of having a conditional probabilistic model for each class and feature type pair, and therefore decreases the computational and storage requirements of using heterogeneous features. This paper presents an algorithm to calculate the class-dependent features that minimize an estimate of the relative entropy between the conditional probabilistic model and the actual conditional probability density function (PDF) of the features of each class. An approach to a hidden Markov model (HMM) automatic speech recognition (ASR) system is applied. A nonlinear class-dependent volume-preserving transformation of the features is used to minimize the objective function. Using this approach, 2% improvement in phoneme recognition accuracy is achieved compared to the baseline system. The approach also shows improvement in recognition accuracy compared to previous class-dependent linear features transformation.


international conference on acoustics, speech, and signal processing | 2011

Forensically inspired approaches to automatic speaker recognition

Kyu Jeong Han; Mohamed Kamal Omar; Jason W. Pelecanos; Cezar Pendus; Sibel Yaman; Weizhong Zhu

This paper presents ongoing research leveraging forensic methods for automatic speaker recognition. Some of the methods forensic scientists employ include identifying speaker distinctive audio segments and comparing these segments using features such as pitch, formant, and other information. Other approaches have also involved performing a phonetic analysis to recognize idiolectal attributes, and an implicit analysis of the demographics of speakers. Inspired by these forensic phonetic approaches, we target three threads of work; hot-spot analysis, speaker style and pronunciation modelling, and demographics analysis. As a result of this work we show that a phonetic analysis conditioned on select speech events (or hot-spots) can outperform a phonetic analysis performed over all speech without conditioning. In the area of pronunciation modelling, one set of results demonstrate significantly improved robustness by exploiting phonetic structure in an automatic speech recognition system. For demographics analysis, we present state-of-the-art results of systems capable of detecting dialect, non-nativeness and native language.


spoken language technology workshop | 2012

Noisy channel adaptation in language identification

Sriram Ganapathy; Mohamed Kamal Omar; Jason W. Pelecanos

Language identification (LID) of speech data recorded over noisy communication channels is a challenging problem especially when the LID system is tested on speech data from an unseen communication channel (not seen in training). In this paper, we consider the scenario in which a small amount of adaptation data is available from a new communication channel. Various approaches are investigated for efficient utilization of the adaptation data in a supervised as well as unsupervised setting. In a supervised adaptation framework, we show that support vector machines (SVMs) with higher order polynomial kernels (HO-SVM) trained using lower dimensional representations of the the Gaussian mixture model supervectors (GSVs) provide significant performance improvements over the baseline SVM-GSV system. In these LID experiments, we obtain 30% reduction in error-rate with 6 hours of adaptation data for a new channel. For unsupervised adaptation, we develop an iterative procedure for re-labeling the development data using a co-training framework. In these experiments, we obtain considerable improvements(relative improvements of 13 %) over a self-training framework with the HO-SVM models.


international conference on acoustics, speech, and signal processing | 2009

Maximum margin linear kernel optimization for speaker verification

Mohamed Kamal Omar; Jason W. Pelecanos; Ganesh N. Ramaswamy

This paper describes a novel approach for discriminative modeling and its application to automatic text-independent speaker verification. This approach maximizes the margin between the model scores for pairs of utterances belonging to the same speaker and for pairs of utterances belonging to different speakers. A low-dimensional linear kernel is estimated which maximizes this margin. This approach emphasizes features which have a better ability to discriminate between scores belonging to pairs of utterances of the same target speakers and those of different speakers. In this paper, we apply this approach to the NIST 2005 speaker verification task. Compared to the Gaussian mixture model (GMM) baseline system, a 17.7% relative improvement in the minimum detection cost function (DCF) and a 11.7% relative improvement in equal error rate (EER) are obtained. We achieve also a 5.7% relative improvement in EER and 2.3% relative improvement in DCF by using our approach on top of a nuisance attribute projection (NAP) compensated GMMbased kernel baseline system.


IEEE Transactions on Speech and Audio Processing | 2003

Approximately independent factors of speech using nonlinear symplectic transformation

Mohamed Kamal Omar; Mark Hasegawa-Johnson

This paper addresses the problem of representing the speech signal using a set of features that are approximately statistically independent. This statistical independence simplifies building probabilistic models based on these features that can be used in applications like speech recognition. Since there is no evidence that the speech signal is a linear combination of separate factors or sources, we use a more general nonlinear transformation of the speech signal to achieve our approximately statistically independent feature set. We choose the transformation to be symplectic to maximize the likelihood of the generated feature set. In this paper, we describe applying this nonlinear transformation to the speech time-domain data directly and to the Mel-frequency cepstrum coefficients (MFCC). We discuss also experiments in which the generated feature set is transformed into a more compact set using a maximum mutual information linear transformation. This linear transformation is used to generate the acoustic features that represent the distinctions among the phonemes. The features resulted from this transformation are used in phoneme recognition experiments. The best results achieved show about 2% improvement in recognition accuracy compared to results based on MFCC features.

Collaboration


Dive into the Mohamed Kamal Omar's collaboration.

Top Co-Authors

Avatar

Kyu Jeong Han

University of Southern California

View shared research outputs
Top Co-Authors

Avatar

Shrikanth Narayanan

University of Southern California

View shared research outputs
Top Co-Authors

Avatar

Sibel Yaman

Georgia Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge