Khe Chai Sim
National University of Singapore
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Khe Chai Sim.
international conference on acoustics, speech, and signal processing | 2007
Khe Chai Sim; William Byrne; Mark J. F. Gales; Hichem Sahbi; Philip C. Woodland
This paper presents a simple and robust consensus decoding approach for combining multiple machine translation (MT) system outputs. A consensus network is constructed from an N-best list by aligning the hypotheses against an alignment reference, where the alignment is based on minimising the translation edit rate (TER). The minimum Bayes risk (MBR) decoding technique is investigated for the selection of an appropriate alignment reference. Several alternative decoding strategies proposed to retain coherent phrases in the original translations. Experimental results are presented primarily based on three-way combination of Chinese-English translation outputs, and also presents results for six-way system combination. It is shown that worthwhile improvements in translation performance can be obtained using the methods discussed.
international acm sigir conference on research and development in information retrieval | 2008
Tee Kiah Chia; Khe Chai Sim; Haizhou Li; Hwee Tou Ng
Recent efforts on the task of spoken document retrieval (SDR) have made use of speech lattices: speech lattices contain information about alternative speech transcription hypotheses other than the 1-best transcripts, and this information can improve retrieval accuracy by overcoming recognition errors present in the 1-best transcription. In this paper, we look at using lattices for the query-by-example spoken document retrieval task - retrieving documents from a speech corpus, where the queries are themselves in the form of complete spoken documents (query exemplars). We extend a previously proposed method for SDR with short queries to the query-by-example task. Specifically, we use a retrieval method based on statistical modeling: we compute expected word counts from document and query lattices, estimate statistical models from these counts, and compute relevance scores as divergences between these models. Experimental results on a speech corpus of conversational English show that the use of statistics from lattices for both documents and query exemplars results in better retrieval accuracy than using only 1-best transcripts for either documents, or queries, or both. In addition, we investigate the effect of stop word removal which further improves retrieval accuracy. To our knowledge, our work is the first to have used a lattice-based approach to query-by-example spoken document retrieval.
ACM Transactions on Information Systems | 2010
Tee Kiah Chia; Khe Chai Sim; Haizhou Li; Hwee Tou Ng
Recent research efforts on spoken document retrieval have tried to overcome the low quality of 1-best automatic speech recognition transcripts, especially in the case of conversational speech, by using statistics derived from speech lattices containing multiple transcription hypotheses as output by a speech recognizer. We present a method for lattice-based spoken document retrieval based on a statistical n-gram modeling approach to information retrieval. In this statistical lattice-based retrieval (SLBR) method, a smoothed statistical model is estimated for each document from the expected counts of words given the information in a lattice, and the relevance of each document to a query is measured as a probability under such a model. We investigate the efficacy of our method under various parameter settings of the speech recognition and lattice processing engines, using the Fisher English Corpus of conversational telephone speech. Experimental results show that our method consistently achieves better retrieval performance than using only the 1-best transcripts in statistical retrieval, outperforms a recently proposed lattice-based vector space retrieval method, and also compares favorably with a lattice-based retrieval method based on the Okapi BM25 model.
international conference on acoustics, speech, and signal processing | 2006
Rohit Sinha; Mark J. F. Gales; Do Yeong Kim; Xunying Liu; Khe Chai Sim; Philip C. Woodland
This paper discusses the development of the CU-HTK Mandarin broadcast news (BN) transcription system. The Mandarin BN task includes a significant amount of English data. Hence techniques have been investigated to allow the same system to handle both Mandarin and English by augmenting the Mandarin training sets with English acoustic and language model training data. A range of acoustic models were built including models based on Gaussianised features, speaker adaptive training and feature-space MPE. A multi-branch system architecture is described in which multiple acoustic model types, alternate phone sets and segmentations can be used in a system combination framework to generate the final output. The final system shows state-of-the-art performance over a range of test sets
international conference on acoustics, speech, and signal processing | 2005
Khe Chai Sim; Mark J. F. Gales
Recently, structured precision matrix models were found to outperform the conventional diagonal covariance matrix models. Minimum phone error discriminative training of these models gave very good unadapted performance on large vocabulary continuous speech recognition systems. To obtain state-of-the-art performance, it is important to apply adaptation techniques efficiently to these models. In this paper, simple row-by-row iterative formulae are described for both MLLR mean and constrained MLLR transform estimations of these models. These update formulae are derived within the standard expectation maximisation framework and are guaranteed to increase the likelihood of the adaptation data. Efficient approximate schemes for these adaptation methods are also investigated to further reduce the computation. Experimental results are presented based on the MPE trained subspace for precision and mean models, evaluated on both broadcast news and conversational telephone speech English tasks.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Bo Li; Khe Chai Sim
Improving the noise robustness of automatic speech recognition systems has been a challenging task for many years. Recently, it was found that Deep Neural Networks (DNNs) yield large performance gains over conventional GMM-HMM systems, when used in both hybrid and tandem systems. However, they are still far from the level of human expectations especially under adverse environments. Motivated by the separation-prior-to-recognition process of the human auditory system, we propose a robust spectral masking system where power spectral domain masks are predicted using a DNN trained on the same filter-bank features used for acoustic modeling. To further improve performance, Linear Input Network (LIN) adaptation is applied to both the mask estimator and the acoustic model DNNs. Since the estimation of LINs for the mask estimator requires stereo data, which is not available during testing, we proposed using the LINs estimated for the acoustic model DNNs to adapt the mask estimators. Furthermore, we used the same set of weights obtained from pre-training for the input layers of both the mask estimator and the acoustic model DNNs to ensure a better consistency for sharing LINs. Experimental results on benchmark Aurora2 and Aurora4 tasks demonstrated the effectiveness of our system, which yielded Word Error Rates (WERs) of 4.6% and 11.8% respectively. Furthermore, the simple averaging of posteriors from systems with and without spectral masking can further reduce the WERs to 4.3% on Aurora2 and 11.4% on Aurora4.
international conference on acoustics, speech, and signal processing | 2013
Bo Li; Khe Chai Sim
Deep Neural Networks (DNNs) have been successfully applied to various speech tasks during recent years. In this paper, we investigate the use of DNNs for noise-robust speech recognition and demonstrate their superior capabilities of modeling acoustic variations over the conventional Gaussian Mixture Models (GMMs). We then propose to compensate the normalization front-end of the DNNs using the GMM-based Vector Taylor Series (VTS) model compensation technique, which has been successfully applied in the GMM-based ASR systems to handle noisy speech. To fully benefit from both the powerful modeling capability of the DNN and the effective noise compensation of the VTS, an adaptive training algorithm is further developed. The preliminary experimental results on the AURORA 2 task have demonstrated the effectiveness of our approach. The adaptively trained system has been shown to outperform the GMM-based VTS adaptive training by relatively 18.8% using the MFCC features and 21.9% using the FBank features.
international conference on acoustics, speech, and signal processing | 2016
Tian Tan; Yanmin Qian; Dong Yu; Souvik Kundu; Liang Lu; Khe Chai Sim; Xiong Xiao; Yu Zhang
Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models has not been well investigated. In this paper, we study the LSTM-RNN speaker-aware training that incorporates the speaker information during model training to normalise the speaker variability. We first present several speaker-aware training architectures, and then empirically evaluate three types of speaker representation: I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training under the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs reduces word error rates by 6.5% relative to a very strong LSTM-RNN baseline, which uses FMLLR features.
international conference on acoustics, speech, and signal processing | 2015
Hengguan Huang; Khe Chai Sim
The conventional short-term interval features used by the Deep Neural Networks (DNNs) lack the ability to learn longer term information. This poses a challenge for training a speaker-independent (SI) DNN since the short-term features do not provide sufficient information for the DNN to estimate the real robust factors of speaker-level variations. The key to this problem is to obtain a sufficiently robust and informative speaker representation. This paper compares several speaker representations. Firstly, a DNN speaker classifier is used to extract the bottleneck features as the speaker representation, called the Bottleneck Speaker Vector (BSV). To further improve the robustness of this representation, a first-order Bottleneck Speaker Super Vector (BSSV) is also proposed, where the BSV is expanded into a super vector space by incorporating the phoneme posterior probabilities. Finally, a more fine-grain speaker representation based on the FMLLR-shifted features is examined. The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition. The best performance is achieved by augmenting both the BSSV and the FMLLR-shifted representations, yielding 10.0% - 15.3% relatively performance gains over the SI DNN baseline.
international conference on acoustics, speech, and signal processing | 2009
Haizhou Li; Bin Ma; Kong-Aik Lee; Hanwu Sun; Donglai Zhu; Khe Chai Sim; Changhuai You; Rong Tong; Ismo Kärkkäinen; Chien-Lin Huang; Vladimir Pervouchine; Wu Guo; Yijie Li; Li-Rong Dai; Mohaddeseh Nosratighods; Thiruvaran Tharmarajah; Julien Epps; Eliathamby Ambikairajah; Eng Siong Chng; Tanja Schultz; Qin Jin
This paper describes the performance of the I4U speaker recognition system in the NIST 2008 Speaker Recognition Evaluation. The system consists of seven subsystems, each with different cepstral features and classifiers. We describe the I4U Primary system and report on its core test results as they were submitted, which were among the best-performing submissions. The I4U effort was led by the Institute for Infocomm Research, Singapore (IIR), with contributions from the University of Science and Technology of China (USTC), the University of New South Wales, Australia (UNSW), Nanyang Technological University, Singapore (NTU) and Carnegie Mellon University, USA (CMU).