Xiaodong Cui
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiaodong Cui.
international conference on acoustics, speech, and signal processing | 2006
Li Deng; Xiaodong Cui; Robert Pruvenok; Yanyi Chen; Safiyy Momen; Abeer Alwan
While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniques. We report in this paper on our recent effort to create a publicly available database of the first three VTR frequency trajectories. The database contains a representative subset of the TEMIT corpus with respect to speaker, gender, dialect and phonetic context, with a total of 538 sentences. A Matlab-based labeling tool is developed, with high-resolution wideband spectrograms displayed to assist in visual identification of VTR frequency values which are then recorded via mouse clicks and local spline interpolation. Special attention is paid to VTR values during consonant-to-vowel (CV) and vowel-to-consonant (VC) transitions, and to speech segments with vocal tract anti-resonances. Using this database, we quantitatively assess two common automatic VTR tracking techniques in terms of their average tracking errors analyzed within each of the six major broad phonetic classes as well as during CV and VC transitions. The potential use of the VTR database for research in several areas of speech processing is discussed
IEEE Transactions on Speech and Audio Processing | 2005
Xiaodong Cui; Abeer Alwan
A feature compensation (FC) algorithm based on polynomial regression of utterance signal-to-noise ratio (SNR) for noise robust automatic speech recognition (ASR) is proposed. In this algorithm, the bias between clean and noisy speech features is approximated by a set of polynomials which are estimated from adaptation data from the new environment by the expectation-maximization (EM) algorithm under the maximum likelihood (ML) criterion. In ASR, the utterance SNR for the speech signal is first estimated and noisy speech features are then compensated for by regression polynomials. The compensated speech features are decoded via acoustic HMMs trained with clean data. Comparative experiments on the Aurora 2 (English) and the German part of the Aurora 3 databases are performed between FC and maximum likelihood linear regression (MLLR). With the Aurora2 experiments, there are two MLLR implementations: pooling adaptation data across all SNRs, and using three distinct SNR clusters. For each type of noise, FC achieves, on average, a word error rate reduction of 16.7% and 16.5% for Set A, and 20.5% and 14.6% for Set B compared to the first and second MLLR implementations, respectively. For each SNR condition, FC achieves, on average, a word error rate reduction of 33.1% and 34.5% for Set A, and 23.6% and 21.4% for Set B. Results using the Aurora3 database show that, the best FC performance outperforms MLLR by 15.9%, 3.0% and 14.6% for well-matched, medium-mismatched and high-mismatched conditions, respectively.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Xiaodong Cui; Yifan Gong
To improve recognition performance in noisy environments, multicondition training is usually applied in which speech signals corrupted by a variety of noise are used in acoustic model training. Published hidden Markov modeling of speech uses multiple Gaussian distributions to cover the spread of the speech distribution caused by noise, which distracts the modeling of speech event itself and possibly sacrifices the performance on clean speech. In this paper, we propose a novel approach which extends the conventional Gaussian mixture hidden Markov model (GMHMM) by modeling state emission parameters (mean and variance) as a polynomial function of a continuous environment-dependent variable. At the recognition time, a set of HMMs specific to the given value of the environment variable is instantiated and used for recognition. The maximum-likelihood (ML) estimation of the polynomial functions of the proposed variable-parameter GMHMM is given within the expectation-maximization (EM) framework. Experiments on the Aurora 2 database show significant improvements of the variable-parameter Gaussian mixture HMMs compared to the conventional GMHMMs
Computer Speech & Language | 2006
Xiaodong Cui; Abeer Alwan
Abstract Automatic recognition of children’s speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children’s speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorithm [Zolfaghari, P., Robinson, T., 1996. Formant analysis using mixtures of Gaussians, Proceedings of International Conference on Spoken Language Processing, 1229–1232]. For limited adaptation data, the algorithm outperforms traditional vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) techniques.
international conference on acoustics, speech, and signal processing | 2003
Xiaodong Cui; Yifan Gong
To improve recognition, speech signals corrupted by a variety of noises can be used in speech model training. Published hidden Markov modeling of speech uses multiple Gaussian distributions to cover the spread of the speech distribution caused by the noises, which distracts the modeling of speech event itself and and possibly sacrifices the performance on clean speech. We extend GMHMM by allowing state emission parameters to change as function of an environment-dependent continuous variable. At the recognition time, a set of HMMs specific to the given the environment is instantiated and used for recognition. Variable parameter (VP) HMM with parameters modeled as a polynomial function of the environment variable is developed. Parameter estimation based on EM-algorithm is given. With the same number of mixtures, VPHMM reduces WER by 40% compared to conventional multi-condition training.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Shizhen Wang; Xiaodong Cui; Abeer Alwan
Spectral mismatch between training and testing utterances can cause significant degradation in the performance of automatic speech recognition (ASR) systems. Speaker adaptation and speaker normalization techniques are usually applied to address this issue. One way to reduce spectral mismatch is to reshape the spectrum by aligning corresponding formant peaks. There are various levels of mismatch in formant structures. In this paper, regression-tree-based phoneme- and state-level spectral peak alignment is proposed for rapid speaker adaptation using linearization of the vocal tract length normalization (VTLN) technique. This method is investigated in a maximum-likelihood linear regression (MLLR)-like framework, taking advantage of both the efficiency of frequency warping (VTLN) and the reliability of statistical estimations (MLLR). Two different regression classes are investigated: one based on phonetic classes (using combined knowledge and data-driven techniques) and the other based on Gaussian mixture classes. Compared to MLLR, VTLN, and global peak alignment, improved performance can be obtained for both supervised and unsupervised adaptations for both medium vocabulary (the RM1 database) and connected digits recognition (the TIDIGITS database) tasks. Performance improvements are largest with limited adaptation data which is often the case for ASR applications, and these improvements are shown to be statistically significant.
international conference on acoustics, speech, and signal processing | 2002
Xiaodong Cui; Abeer Alwan
This paper proposes an efficient algorithm for the automatic selection of sentences given a desired phoneme distribution. The algorithm is based on the Kullback-Leiblermeasure under the criterion of minimum cross-entropy. One application of this algorithm is the design of adaptation text for automatic speech recognition with a particular phoneme distribution. The algorithm is efficient and flexible, especially in the case of limited text size. Experimental results verify the advantage of this approach.
international conference on acoustics, speech, and signal processing | 2004
Xiaodong Cui; Abeer Alwan
Acoustic models trained with clean speech signals suffer in the presence of background noise. In some situations, only a limited amount of noisy data of the new environment is available based on which the clean models could be adapted. A feature compensation approach employing polynomial regression of the signal-to-noise ratio (SNR) is proposed in this paper. While clean acoustic models remain unchanged, a bias which is a polynomial function of utterance SNR is estimated and removed from the noisy feature. Depending on the amount of noisy data available, the algorithm could be flexibly carried out at different levels of granularity. Based on the Euclidean distance, the similarity between the residual distribution and the clean models are estimated and used as the confidence factor in a back-end weighted Viterbi decoding (WVD) algorithm. With limited amounts of noisy data, the feature compensation algorithm outperforms maximum likelihood linear regression (MLLR) for the Aurora2 database. Weighted Viterbi decoding further improves recognition accuracy.
international conference on acoustics, speech, and signal processing | 2004
Alexis P. Bernard; Yifan Gong; Xiaodong Cui
We present a back-end solution developed at Texas Instruments for noise robust speech recognition. The solution consists of three techniques: 1) a joint additive and convolutive noise compensation (JAC) which adapts speech acoustic models; 2) an enhanced channel estimation procedure which extends JAC performance towards lower SNR ranges; 3) an N-pass decoding algorithm. The performance of the proposed back-end is evaluated on the Aurora-2 database. With 20% fewer model parameters and without the need for the second order derivative of the recognition features, the performance of the proposed solution is 91.86%, which outperforms that of the ETSI advanced front-end standard (88.19%) by more than 30% relative word error rate reduction.
international conference on acoustics, speech, and signal processing | 2006
Xiaodong Cui; Yifan Gong
Variance variation with respect to a continuous environment-dependent variable is investigated in this paper in a variable parameter Gaussian mixture HMM (VP-GMHMM) for noisy speech recognition. The variation is modeled by a scaling polynomial applied to the variances in the conventional hidden Markov acoustic models. The maximum likelihood estimation of the scaling polynomial is performed under an SNR quantization approximation. Experiments on the Aurora 2 database show significant improvements by incorporating the variance scaling scheme into the previous VP-GMHMM where only mean variation is considered