Xunying Liu
University of Cambridge
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xunying Liu.
international conference on acoustics, speech, and signal processing | 2014
Xunying Liu; Yongqiang Wang; Xie Chen; Mark J. F. Gales; Philip C. Woodland
Recurrent neural network language models (RNNLM) have become an increasingly popular choice for state-of-the-art speech recognition systems due to their inherently strong generalization performance. As these models use a vector representation of complete history contexts, RNNLMs are normally used to rescore N-best lists. Motivated by their intrinsic characteristics, two novel lattice rescoring methods for RNNLMs are investigated in this paper. The first uses an n-gram style clustering of history contexts. The second approach directly exploits the distance measure between hidden history vectors. Both methods produced 1-best performance comparable with a 10k-best rescoring baseline RNNLM system on a large vocabulary conversational telephone speech recognition task. Significant lattice size compression of over 70% and consistent improvements after confusion network (CN) decoding were also obtained over the N-best rescoring approach.
international conference on acoustics, speech, and signal processing | 2016
Xie Chen; Xunying Liu; Yanmin Qian; Mark J. F. Gales; Philip C. Woodland
In recent years, recurrent neural network language models (RNNLMs) have become increasingly popular for a range of applications including speech recognition. However, the training of RNNLMs is computationally expensive, which limits the quantity of data, and size of network, that can be used. In order to fully exploit the power of RNNLMs, efficient training implementations are required. This paper introduces an open-source toolkit, the CUED-RNNLM toolkit, which supports efficient GPU-based training of RNNLMs. RNNLM training with a large number of word level output targets is supported, in contrast to existing tools which used class-based output-targets. Support fotN-best and lattice-based rescoring of both HTK and Kaldi format lattices is included. An example of building and evaluating RNNLMs with this toolkit is presented for a Kaldi based speech recognition system using the AMI corpus. All necessary resources including the source code, documentation and recipe are available online1.
international conference on acoustics, speech, and signal processing | 2015
Xie Chen; Xunying Liu; Mark J. F. Gales; Philip C. Woodland
In recent years recurrent neural network language models (RNNLMs) have been successfully applied to a range of tasks including speech recognition. However, an important issue that limits the quantity of data used, and their possible application areas, is the computational cost in training. A signi??cant part of this cost is associated with the softmax function at the output layer, as this requires a normalization term to be explicitly calculated. This impacts both the training and testing speed, especially when a large output vocabulary is used. To address this problem, noise contrastive estimation (NCE) is explored in RNNLM training. NCE does not require the above normalization during both training and testing. It is insensitive to the output layer size. On a large vocabulary conversational telephone speech recognition task, a doubling in training speed on a GPU and a 56 times speed up in test time evaluation on a CPU were obtained.
international conference on acoustics, speech, and signal processing | 2006
Rohit Sinha; Mark J. F. Gales; Do Yeong Kim; Xunying Liu; Khe Chai Sim; Philip C. Woodland
This paper discusses the development of the CU-HTK Mandarin broadcast news (BN) transcription system. The Mandarin BN task includes a significant amount of English data. Hence techniques have been investigated to allow the same system to handle both Mandarin and English by augmenting the Mandarin training sets with English acoustic and language model training data. A range of acoustic models were built including models based on Gaussianised features, speaker adaptive training and feature-space MPE. A multi-branch system architecture is described in which multiple acoustic model types, alternate phone sets and segmentations can be used in a system combination framework to generate the final output. The final system shows state-of-the-art performance over a range of test sets
international conference on acoustics, speech, and signal processing | 2003
Xunying Liu; Mark J. F. Gales; Philip C. Woodland
Designing a state-of-the-art large vocabulary speech recognition systems is a highly complex problem. A wide range of techniques are available that affect the performance and number of free parameters. Selecting the appropriate complexity of system is both time-consuming and only a limited number of possible systems can be examined. This paper presents initial results on automatic system selection when both the number of dimensions and the number of components vary. Various complexity control schemes are discussed and evaluated. Limitations of schemes based on predicting held-out data log-likelihoods are described. In addition, problems of standard approximations for this task are detailed.
spoken language technology workshop | 2012
Peter Bell; Mark J. F. Gales; Pierre Lanchantin; Xunying Liu; Yanhua Long; Steve Renals; Pawel Swietojanski; Philip C. Woodland
We describe our work on developing a speech recognition system for multi-genre media archives. The high diversity of the data makes this a challenging recognition task, which may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks (MLAN), a novel technique for incorporating information from out-of-domain posterior features using deep neural networks. We show that it provides a substantial reduction in WER over other systems, with relative WER reductions of 15% over a PLP baseline, 9% over in-domain tandem features and 8% over the best out-of-domain tandem features.
international conference on acoustics, speech, and signal processing | 2010
Xunying Liu; Mark J. F. Gales; Jim L. Hieronymus; Philip C. Woodland
In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaptation may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequences.
IEEE Transactions on Speech and Audio Processing | 2005
Thomas Hain; Philip C. Woodland; Gunnar Evermann; Mark J. F. Gales; Xunying Liu; Gareth L. Moore; Daniel Povey; Lan Wang
This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 /spl times/ RT conversational speech transcription system.
international conference on acoustics, speech, and signal processing | 2005
Mark J. F. Gales; B. Jia; Xunying Liu; Khe Chai Sim; Philip C. Woodland; K. Yu
The paper details all aspects of the CUHTK 2004 Mandarin conversational telephone speech transcription system, but concentrates on the development of the acoustic models. As there are significant differences between the available training corpora, both in terms of topics of conversation and accents, forms of data normalisation and adaptive training techniques are investigated. The baseline discriminatively trained acoustic models are compared to a system built with a Gaussianisation front-end, a speaker adaptively trained system and an adaptively trained structured precision matrix system. The models are finally evaluated within a multi-pass, multi-branch, system combination framework.
Journal of the Acoustical Society of America | 2013
Xunying Liu; James L. Hieronymus; Mark J. F. Gales; Philip C. Woodland
Mandarin Chinese is based on characters which are syllabic in nature and morphological in meaning. All spoken languages have syllabiotactic rules which govern the construction of syllables and their allowed sequences. These constraints are not as restrictive as those learned from word sequences, but they can provide additional useful linguistic information. Hence, it is possible to improve speech recognition performance by appropriately combining these two types of constraints. For the Chinese language considered in this paper, character level language models (LMs) can be used as a first level approximation to allowed syllable sequences. To test this idea, word and character level n-gram LMs were trained on 2.8 billion words (equivalent to 4.3 billion characters) of texts from a wide collection of text sources. Both hypothesis and model based combination techniques were investigated to combine word and character level LMs. Significant character error rate reductions up to 7.3% relative were obtained on a state-of-the-art Mandarin Chinese broadcast audio recognition task using an adapted history dependent multi-level LM that performs a log-linearly combination of character and word level LMs. This supports the hypothesis that character or syllable sequence models are useful for improving Mandarin speech recognition performance.