Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jeff Z. Ma is active.

Publication


Featured researches published by Jeff Z. Ma.


Journal of the Acoustical Society of America | 2000

Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics

Li Deng; Jeff Z. Ma

A statistical coarticulatory model is presented for spontaneous speech recognition, where knowledge of the dynamic, target-directed behavior in the vocal tract resonance is incorporated into the model design, training, and in likelihood computation. The principal advantage of the new model over the conventional HMM is the use of a compact, internal structure that parsimoniously represents long-span context dependence in the observable domain of speech acoustics without using additional, context-dependent model parameters. The new model is formulated mathematically as a constrained, nonstationary, and nonlinear dynamic system, for which a version of the generalized EM algorithm is developed and implemented for automatically learning the compact set of model parameters. A series of experiments for speech recognition and model synthesis using spontaneous speech data from the Switchboard corpus are reported. The promise of the new model is demonstrated by showing its consistently superior performance over a state-of-the-art benchmark HMM system under controlled experimental conditions. Experiments on model synthesis and analysis shed insight into the mechanism underlying such superiority in terms of the target-directed behavior and of the long-span context-dependence property, both inherent in the designed structure of the new dynamic model of speech.


IEEE Transactions on Speech and Audio Processing | 2004

Target-directed mixture dynamic models for spontaneous speech recognition

Jeff Z. Ma; Li Deng

In this paper, a novel mixture linear dynamic model (MLDM) for speech recognition is developed and evaluated, where several linear dynamic models are combined (mixed) to represent different vocal-tract-resonance (VTR) dynamic behaviors and the mapping relationships between the VTRs and the acoustic observations. Each linear dynamic model is formulated as the state-space equations, where the VTRs target-directed property is incorporated in the state equation and a linear regression function is used for the observation equation that approximates the nonlinear mapping relationship. A version of the generalized EM algorithm is developed for learning the model parameters, where the constraint that the VTR targets change at the segmental level (rather than at the frame level) is imposed in the parameter learning and model scoring algorithms. Speech recognition experiments are carried out to evaluate the new model using the N-best re-scoring paradigm in a Switchboard task. Compared with a baseline recognizer using the triphone HMM acoustic model, the new recognizer demonstrated improved performance under several experimental conditions. The performance was shown to increase with an increased number of the mixture components in the model.


international conference on acoustics, speech, and signal processing | 2009

Unsupervised acoustic and language model training with small amounts of labelled data

Scott Novotney; Richard M. Schwartz; Jeff Z. Ma

We measure the effects of a weak language model, estimated from as little as 100k words of text, on unsupervised acoustic model training and then explore the best method of using word confidences to estimate n-gram counts for unsupervised language model training. Even with 100k words of text and 10 hours of training data, unsupervised acoustic modeling is robust, with 50% of the gain recovered when compared to supervised training. For language model training, multiplying the word confidences together to get a weighted count produces the best reduction in WER by 2% over the baseline language model and 0.5% absolute over using unweighted transcripts. Oracle experiments show that a larger gain is possible, but better confidence estimation techniques are needed to identify correct n-grams.


IEEE Transactions on Audio, Speech, and Language Processing | 2006

Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system

Spyridon Matsoukas; Jean-Luc Gauvain; Gilles Adda; Thomas Colthurst; Chia-Lin Kao; Owen Kimball; Lori Lamel; Fabrice Lefèvre; Jeff Z. Ma; John Makhoul; Long Nguyen; Rohit Prasad; Richard M. Schwartz; Holger Schwenk; Bing Xiang

This paper describes the progress made in the transcription of broadcast news (BN) and conversational telephone speech (CTS) within the combined BBN/LIMSI system from May 2002 to September 2004. During that period, BBN and LIMSI collaborated in an effort to produce significant reductions in the word error rate (WER), as directed by the aggressive goals of the Effective, Affordable, Reusable, Speech-to-text [Defense Advanced Research Projects Agency (DARPA) EARS] program. The paper focuses on general modeling techniques that led to recognition accuracy improvements, as well as engineering approaches that enabled efficient use of large amounts of training data and fast decoding architectures. Special attention is given on efforts to integrate components of the BBN and LIMSI systems, discussing the tradeoff between speed and accuracy for various system combination strategies. Results on the EARS progress test sets show that the combined BBN/LIMSI system achieved relative reductions of 47% and 51% on the BN and CTS domains, respectively


international conference on acoustics, speech, and signal processing | 2004

Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system

Richard M. Schwartz; Thomas Colthurst; Nicolae Duta; Herbert Gish; Rukmini Iyer; Chia-Lin Kao; Daben Liu; Owen Kimball; Jeff Z. Ma; John Makhoul; Spyros Matsoukas; Long Nguyen; Mohammed Noamany; Rohit Prasad; Bing Xiang; Dongxin Xu; Jean-Luc Gauvain; Lori Lamel; Holger Schwenk; Gilles Adda; Langzhou Chen

We report on the results of the first evaluations for the BBN/LIMSI system under the new DARPA EARS program. The evaluations were carried out for conversational telephone speech (CTS) and broadcast news (BN) for three languages: English, Mandarin, and Arabic. In addition to providing system descriptions and evaluation results, the paper highlights methods that worked well across the two domains and those few that worked well on one domain but not the other. For the BN evaluations, which had to be run under 10 times real-time, we demonstrated that a joint BBN/LIMSI system with a time constraint achieved better results than either system alone.


Computer Speech & Language | 2000

A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech

Jeff Z. Ma; Li Deng

Abstract In this paper we report our recent research whose goal is to improve the performance of a novel speech recognizer based on an underlying statistical hidden dynamic model of phonetic reduction in the production of conversational speech. We have developed a path-stack search algorithm which efficiently computes the likelihood of any observation utterance while optimizing the dynamic regimes in the speech model. The effectiveness of the algorithm is tested on the speech data in the Switchboard corpus, in which the optimized dynamic regimes computed from the algorithm are compared with those from exhaustive search. We also present speech recognition results on the Switchboard corpus that demonstrate improvements of the recognizer’s performance compared with the use of the dynamic regimes heuristically set from the phone segmentation by a state-of-the-art hidden Markov model (HMM) system.


international conference on acoustics, speech, and signal processing | 2006

Unsupervised Training on Large Amounts of Broadcast News Data

Jeff Z. Ma; Spyros Matsoukas; Owen Kimball; Richard M. Schwartz

This paper presents our recent effort that aims at improving our Arabic broadcast news (BN) recognition system by using thousands of hours of un-transcribed Arabic audio in the way of unsupervised training. Unsupervised training is first carried out on the 1,900-hour English topic detection and tracking (TDT) data and is compared with the lightly-supervised training method that we have used for the DARPA EARS evaluations. The comparison shows that unsupervised training produces a 21.7% relative reduction in word error rate (WER), which is comparable to the gain obtained with light supervision methods. The same unsupervised training strategy carried out on a similar amount of Arabic BN data produces an 11.6% relative gain. The gain, though considerable, is substantially smaller than what is observed on the English data. Our initial work towards understanding the reasons for this difference is also described


international conference on acoustics, speech, and signal processing | 2014

Domain adaptation via within-class covariance correction in I-vector based speaker recognition systems

Ondrej Glembek; Jeff Z. Ma; Pavel Matejka; Bing Zhang; Oldrich Plchot; Lukas Burget; Spyros Matsoukas

In this paper we propose a technique of Within-Class Covariance Correction (WCC) for Linear Discriminant Analysis (LDA) in Speaker Recognition to perform an unsupervised adaptation of LDA to an unseen data domain, and/or to compensate for speaker population difference among different portions of LDA training dataset. The paper follows on the study of source-normalization and inter-database variability compensation techniques which deal with multimodal distribution of i-vectors. On the DARPA RATS (Robust Automatic Transcription of Speech) task, we show that, with two hours of unsupervised data, we improve the Equal-Error Rate (EER) by 17.5%, and 36% relative on the unmatched and semi-matched conditions, respectively. On the Domain Adaptation Challenge we show up to 70% relative EER reduction and we propose a data clustering procedure to identify the directions of the domain-based variability in the adaptation data.


Computer Speech & Language | 2004

A mixed-level switching dynamic system for continuous speech recognition

Jeff Z. Ma; Li Deng

A two-level mixture linear dynamic system model, with frame-level switching parameters in the observation equation and with segment-level switching parameters in the target-directed state equation, is developed and evaluated. The main contributions of this work are: (1) the new framework for dealing with mixed-level switching in the dynamic system and (2) the novel use of piecewise linear functions, enabled by the introduction of frame-level switching, to approximate the nonlinear function between the hidden vocal-tract-resonance space and the observable acoustic space. The approximation is accomplished by the frame-dependent switching parameters in the observation equation. In this paper, in a self-contained manner, we highlight the key algorithm differences from the earlier model having only single segment-level switching that is synchronous between the state and observation equations. A series of speech recognition experiments are carried out to evaluate this new model using a subset of Switchboard conversational speech data. The experimental results show that the approximation accuracy is improved with an increased number of switching-parameter values. The speech recognizer built from the new mixed-level switching dynamic system model using an N-best re-scoring evaluation paradigm show moderate word error rate reduction compared with using either single-level switching or no switching parameters.


international conference on acoustics, speech, and signal processing | 2013

Developing a speaker identification system for the DARPA RATS project

Oldrich Plchot; Spyros Matsoukas; Pavel Matejka; Najim Dehak; Jeff Z. Ma; Sandro Cumani; Ondrej Glembek; Hynek Hermansky; Sri Harish Reddy Mallidi; Nima Mesgarani; Richard M. Schwartz; Mehdi Soufifar; Zheng-Hua Tan; Samuel Thomas; Bing Zhang; Xinhui Zhou

This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.

Collaboration


Dive into the Jeff Z. Ma's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Gilles Adda

Centre national de la recherche scientifique

View shared research outputs
Researchain Logo
Decentralizing Knowledge