Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Raymond W. M. Ng is active.

Publication


Featured researches published by Raymond W. M. Ng.


international conference on acoustics, speech, and signal processing | 2010

Prosodic attribute model for spoken language identification

Raymond W. M. Ng; Cheung-Chi Leung; Tan Lee; Bin Ma; Haizhou Li

Prosodic information is believed to carry language-specific information useful to spoken language recognition. Modeling prosodic features is a challenging problem, on which a wide diversity of approaches have been investigated. In this paper, a novel prosodic attribute model (PAM) is proposed to capture prosodic features with compact models. It models the language-specific co-occurrence statistics of a comprehensive set of prosodic features. When the prosodic LID system with PAM is evaluated in NIST Language Recognition Evaluations (LRE) 2007 and 2009, it demonstrates respectively 21% and 11% relative EER reduction compared to a phonotactic LID system. The contributions of prosodic features in detecting some of the target languages, including tonal languages, are even more substantial. It is also noted that most prosodic attributes in the comprehensive set are making positive contributions.


ieee automatic speech recognition and understanding workshop | 2015

The 2015 sheffield system for transcription of Multi-Genre Broadcast media

Oscar Saz; Mortaza Doulaty; Salil Deena; Rosanna Milner; Raymond W. M. Ng; Madina Hasan; Yulan Liu; Thomas Hain

We describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows. Transcription was one of four tasks proposed in the MGB challenge, with the aim of advancing the state of the art of automatic speech recognition, speaker diarisation and automatic alignment of subtitles for broadcast media. Four topics are investigated in this work: Data selection techniques for training with unreliable data, automatic speech segmentation of broadcast media shows, acoustic modelling and adaptation in highly variable environments, and language modelling of multi-genre shows. The final system operates in multiple passes, using an initial unadapted decoding stage to refine segmentation, followed by three adapted passes: a hybrid DNN pass with input features normalised by speaker-based cepstral normalisation, another hybrid stage with input features normalised by speaker feature-MLLR transformations, and finally a bottleneck-based tandem stage with noise and speaker factorisation. The combination of these three system outputs provides a final error rate of 27.5% on the official development set, consisting of 47 multi-genre shows.


international conference on asian language processing | 2009

Analysis and Selection of Prosodic Features for Language Identification

Raymond W. M. Ng; Tan Lee; Cheung-Chi Leung; Bin Ma; Haizhou Li

Prosodic features are relatively simple in their structures and are believed to be effective in some speech recognition tasks. However, this kind of features is subject to undesirable bias factors, such as speaking styles. To cope with this, researchers have suggested various normalization and measure methods to the features, which makes the feature inventory very large. In this paper, we use a mutual information criterion to analyze and select a number of prosody-related features in a language identification (LID) task. Among twelve optimal features, eight of them are elaborated in this paper. The feature analysis metric, z-score, is shown to have a moderate to high correlation with LID accuracies. Feature selection proposed in this paper brings about the best performance among all prosodic LID systems to our knowledge. A further attempt in system fusion shows a 13% relative improvement the prosodic LID system brings to the conventional phonotactic approach to LID.


international conference on acoustics, speech, and signal processing | 2015

Quality estimation for asr k-best list rescoring in spoken language translation

Raymond W. M. Ng; Kashif Shah; Wilker Aziz; Lucia Specia; Thomas Hain

Spoken language translation (SLT) combines automatic speech recognition (ASR) and machine translation (MT). During the decoding stage, the best hypothesis produced by the ASR system may not be the best input candidate to the MT system, but making use of multiple sub-optimal ASR results in SLT has been shown to be too complex computationally. This paper presents a method to rescore the k-best ASR output such as to improve translation quality. A translation quality estimation model is trained on a large number of features which aim to capture complementary information from both ASR and MT on translation difficulty and adequacy, as well as syntactic properties of the SLT inputs and outputs. Based on the predicted quality score, the ASR hypotheses are rescored before they are fed to the MT system. ASR confidence is found to be crucial in guiding the rescoring step. In an English-to-French speech-to-text translation task, the coupling of ASR and MT systems led to an increase of 0.5 BLEU points in translation quality.


international conference on acoustics, speech, and signal processing | 2013

Adaptation of lecture speech recognition system with machine translation output

Raymond W. M. Ng; Thomas Hain; Trevor Cohn

In spoken language translation, integration of the ASR and MT components is critical for good performance. In this paper, we consider the recognition setting where a text translation of each utterance is also available. We present experiments with different ASR system adaptation techniques to exploit MT system outputs. In particular, N-best MT outputs are represented as an utterance-specific language model, which are then used to rescore ASR lattices. We show that this method improves significantly over ASR alone, resulting in an absolute WER reduction of more than 6% for both indomain and out-of-domain acoustic models.


IEEE Transactions on Audio, Speech, and Language Processing | 2013

Spoken Language Recognition With Prosodic Features

Raymond W. M. Ng; Tan Lee; Cheung-Chi Leung; Bin Ma; Haizhou Li

Speech prosody is believed to carry much language-specific information that can be used for spoken language recognition (SLR). In the past, the use of prosodic features for SLR has been studied sporadically and the reported performances were considered unsatisfactory. In this paper, we exploit a wide range of prosodic attributes for large-scale SLR tasks. These attributes describe the multifaceted variations of F0, intensity and duration in different spoken languages. Prosodic attributes are modeled by the bag of n-grams approach with support vector machine (SVM) as in the conventional phonotactic SLR systems. Experimental results on OGI and NIST-LRE tasks showed that the use of proposed attributes gives significantly better SLR performance than those previously reported. The full feature set includes 87 prosodic attributes and redundancy among attributes may exist. Attributes are broken down into particular bigrams called bins. Four entropy-based feature selection metrics with different selection criteria are derived. Attributes can be selected by individual bins, or by attributes as batches of bins. It can also be done in a language-dependent or language-independent manner. By comparing different selection sizes and criteria, an optimal attribute subset comprising 5,000 bins is found by using a bin-level language-independent criterion. Feature selection reduces model size by 2.5 times and shortens the runtime by 6 times. The optimal subset of bins gives the lowest EER of 20.18% on NIST-LRE 2007 SLR task in a prosodic attribute model (PAM) system which exclusively modeled prosodic attributes. In a phonotactic-prosodic fusion SLR system, the detection cost, Cavg is 2.09%. The relative detection cost reduction is 23%.


international conference on acoustics, speech, and signal processing | 2016

Groupwise learning for ASR k-best list reranking in spoken language translation

Raymond W. M. Ng; Kashif Shah; Lucia Specia; Thomas Hain

Quality estimation models are used to predict the quality of the output from a spoken language translation (SLT) system. When these scores are used to rerank a k-best list, the rank of the scores is more important than their absolute values. This paper proposes groupwise learning to model this rank. Groupwise features were constructed by grouping pairs, triplets or M-plets among the ASR k-best outputs of the same sentence. Regression and classification models were learnt and a score combination strategy was used to predict the rank among the k-best list. Regression models with pairwise features give a bigger gain over other model and feature constructions. Groupwise learning is robust to sentences with different ASR-confidence. This technique is also complementary to linear discriminant analysis feature projection. An overall BLEU score improvement of 0.80 was achieved on an in-domain English-to-French SLT task.


international symposium on chinese spoken language processing | 2008

Entropy-Based Analysis of the Prosodic Features of Chinese Dialects

Raymond W. M. Ng; Tan Lee

In this paper, a novel approach is proposed to analyze prosodic features of four Chinese dialects: Wu, Cantonese, Min and Mandarin. The ultimate goal is to exploit these features in the task of automatic spoken language identification. Two entropy-based evaluation metrics are formulated to address the problems of data sparseness and lack of speakers. Different prosody-related acoustic features and their combinations are evaluated. FO, FO gradient and intensity are found to contain the most language-related information. Maximum language-related information are observed in multi-dimensional N-gram features with FO, FO gradient and syllable position in sentence. There are also some uncertain results that reveal the limitations of the proposed metrics.


Odyssey 2016 | 2016

The Sheffield language recognition system in NIST LRE 2015.

Raymond W. M. Ng; Mauro Nicolao; Oscar Saz; Madina Hasan; Bhusan Chettri; Mortaza Doulaty; Tan Lee; Thomas Hain

The Speech and Hearing Research Group of the University of Sheffield submitted a fusion language recognition system to NIST LRE 2015. It combines three language classifiers. Two are acoustic-based, which use i–vectors and a tandem DNN language recogniser respectively. The third classifier is a phonotactic language recogniser. Two sets of training data with duration of approximately 170 and 300 hours were composed for LR training. Using the larger set of training data, the primary Sheffield LR system gives 32.44 min DCF on the official LR 2015 eval data. A post-evaluation system enhancement was carried out where i–vectors were extracted from the bottleneck features of an English DNN. The min DCF was reduced to 29.20.


conference of the international speech communication association | 2016

Combining weak tokenisers for phonotactic language recognition in a resource-constrained setting

Raymond W. M. Ng; Bhusan Chettri; Thomas Hain

In the phonotactic approach for language recognition, a phone tokeniser is normally used to transform the audio signal into acoustic tokens. The language identity of the speech is modelled by the occurrence statistics of the decoded tokens. The performance of this approach depends heavily on the quality of the audio tokeniser. A high-quality tokeniser in matched condition is not always available for a language recognition task. This study investigated into the performance of a phonotactic language recogniser in a resource-constrained setting, following NIST LRE 2015 specification. An ensemble of phone tokenisers was constructed by applying unsupervised sequence training on different target languages followed by a score-based fusion. This method gave 5−7% relative performance improvement to baseline system on LRE 2015 eval set. This gain was retained when the ensemble phonotactic system was further fused with an acoustic iVector system

Collaboration


Dive into the Raymond W. M. Ng's collaboration.

Top Co-Authors

Avatar

Thomas Hain

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar

Tan Lee

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lucia Specia

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Haizhou Li

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar

Kashif Shah

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar

Madina Hasan

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar

Salil Deena

University of Sheffield

View shared research outputs
Researchain Logo
Decentralizing Knowledge