Haitian Xu
Aalborg University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Haitian Xu.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Haitian Xu; Paul Dalsgaard; Zheng-Hua Tan; Børge Lindberg
Condition-dependent training strategy divides a training database into a number of clusters, each corresponding to a noise condition and subsequently trains a hidden Markov model (HMM) set for each cluster. This paper investigates and compares a number of condition-dependent training strategies in order to achieve a better understanding of the effects on automatic speech recogntion (ASR) performance as caused by a splitting of the training databases. Also, the relationship between mismatches in signal-to-noise ratio (SNR) is analyzed. The results show that a splitting of the training material in terms of both noise type and SNR value is advantageous compared to previously used methods, and that training of only a limited number of HMM sets is sufficient for each noise type for robustly handling of SNR mismatches. This leads to the introduction of an SNR and noise classification-based training strategy (SNT-SNC). Better ASR performance is obtained on test material containing data from known noise types as compared to either multicondition training or noise-type dependent training strategies. The computational complexity of the SNT-SNC framework is kept low by choosing only one HMM set for recognition. The HMM set is chosen on the basis of results from noise classification and SNR value estimations. However, compared to other strategies, the SNT-SNC framework shows lower performance for unknown noise types. This problem is partly overcome by introducing a number of model and feature domain techniques. Experiments using both artificially corrupted and real-world noisy speech databases are conducted and demonstrate the effectiveness of these methods.
Biennial on DSP for in-Vehicle and Mobile Systems | 2007
Haitian Xu; Zheng-Hua Tan; Paul Dalsgaard; Ralf Mattethat; Børge Lindberg
The growth in wireless communication and mobile devices has supported the development of distributed speech recognition (DSR) technology. During the last decade this has led to the establishment of ETSI-DSR standards and an increased interest in research aimed at systems exploiting DSR. So far, however, DSR-based systems executing on mobile devices are only in their infancy. One of the reasons is the lack of easy-to-use software development packages. This chapter presents a prototype version of a configurable DSR system for the development of speech enabled applications on mobile devices.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Haitian Xu; Mark J. F. Gales; K. K. Chin
Model-based noise compensation techniques are a powerful approach to improve speech recognition performance in noisy environments. However, one of the major issues with these schemes is that they are computationally expensive. Though techniques have been proposed to address this problem, they often result in degradations in performance. This paper proposes a new, highly flexible, approach which allows the computational load required for noise compensation to be controlled while maintaining good performance. The scheme applies the improved joint uncertainty decoding with the predictive linear transform framework. The final compensation is implemented as a set of linear transforms of the features, decoupling the computational cost of compensation from the complexity of the recognition system acoustic models. Furthermore, by using linear transforms, changes in the correlations in the feature vector can also be efficiently modeled. The proposed methods can be easily applied in an adaptive training scheme, including discriminative adaptive training. The performance of the approach is compared to a number of standard schemes on Aurora 2 as well as in-car speech recognition tasks. Results indicate that the proposed scheme is an attractive alternative to existing approaches.
IEEE Signal Processing Letters | 2008
Haitian Xu; Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg
The nonlocal means (NL-means) algorithm recently proposed for image denoising has proved highly effective for removing additive noise while to a large extent maintaining image details. The algorithm performs denoising by averaging each pixel with other pixels that have similar characteristics in the image. This letter considers the real and imaginary parts of complex speech spectrogram each as a separate image and presents a modified NL-means algorithm to them for denoising to improve the noise robustness of speech recognition. Recognition results on a noisy speech database show that the proposed method is superior to classical methods such as spectral subtraction.
international conference on acoustics, speech, and signal processing | 2009
Haitian Xu; K. K. Chin
Joint uncertainty decoding has recently achieved promising results by integrating the front-end uncertainty into the back-end in a mathematically consistent framework. In this paper, joint uncertainty decoding is compared with the widely used vector Taylor series (VTS). We show that the two methods are identical except that joint uncertainty decoding applies the Taylor expansion on each regression class whereas VTS applies it to each HMM mixture. The relatively rougher expansion points used in joint uncertainty decoding make it computationally cheaper than VTS but inevitably worse on recognition accuracy. To overcome this drawback, this paper proposes an improved joint uncertainty decoding algorithm which employs second-order Taylor expansion on each regression class in order to reduce the expansion errors. Special considerations are further given to limit the overall computational cost by adopting different number of regression classes for different orders in the Taylor expansion. Experiments on the Aurora 2 database show that the proposed method is able to beat VTS on recognition accuracy and computational cost with relative improvement up to 6% and 60%, respectively.
international conference on acoustics, speech, and signal processing | 2011
K. K. Chin; Haitian Xu; Mark J. F. Gales; Catherine Breslin; Katherine Mary Knill
For speech recognition, mismatches between training and testing for speaker and noise are normally handled separately. The work presented in this paper aims at jointly applying speaker adaptation and model-based noise compensation by embedding speaker adaptation as part of the noise mismatch function. The proposed method gives a faster and more optimum adaptation compared to compensating for these two factors separately. It is also more consistent with respect to the basic assumptions of speaker and noise adaptation. Experimental results show significant and consistent gains from the proposed method.
ieee automatic speech recognition and understanding workshop | 2009
Haitian Xu; Mark J. F. Gales; K. K. Chin
Model-based noise compensation techniques, such as Vector Taylor Series (VTS) compensation, have been applied to a range of noise robustness tasks. However one of the issues with these forms of approach is that for large speech recognition systems they are computationally expensive. To address this problem schemes such as Joint Uncertainty Decoding (JUD) have been proposed. Though computationally more efficient, the performance of the system is typically degraded. This paper proposes an alternative scheme, related to JUD, but making fewer approximations, VTS-JUD. Unfortunately this approach also removes some of the computational advantages of JUD. To address this, rather than using VTS-JUD directly, it is used instead to obtain statistics to estimate a predictive linear transform, PCMLLR. This is both computationally efficient and limits some of the issues associated with the diagonal covariance matrices typically used with schemes such as VTS. PCMLLR can also be simply used within an adaptive training framework (PAT). The performance of the VTS-JUD, PCMLLR and PAT system were compared to a number of standard approaches on an in-car speech recognition task. The proposed scheme is an attractive alternative to existing approaches.
international conference on acoustics, speech, and signal processing | 2006
Haitian Xu; Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg
Compared to multi-condition training (MTR), condition-dependent training generates multiple acoustic hidden Markov model sets each identified by a noisy environment and is known to perform substantially better for known noise types (included in training) while worse for unknown (untrained) noise types. This paper attempts to bridge the performance gap between known and unknown noise types by introducing a minimum mean-square error (MMSE) noise-type based compensation algorithm. On the basis of a modified vector Taylor series and the measurement of feature reliability as well as noise similarity, the MMSE estimation adapts the test features corrupted by the unknown noise type to the corresponding features corrupted by the known noise type. This method significantly improves the recognition performance for unknown noise types while maintaining the good performance for known noise types. Furthermore, in order to benefit directly from MTR, a model interpolation strategy is investigated which combines the MTR and the condition-dependent model sets. Both good performance and low computational cost are achieved by only interpolating the mixtures of each condition-dependent model state with the least weighted mixture in the corresponding MTR model state. The overall system gives promising results
conference of the international speech communication association | 2005
Haitian Xu; Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg
conference of the international speech communication association | 2010
Catherine Breslin; K. K. Chin; Mark J. F. Gales; Kate Knill; Haitian Xu