Van Hai Do
Nanyang Technological University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Van Hai Do.
international conference on acoustics, speech, and signal processing | 2016
Nancy F. Chen; Van Tung Pharri; Haihua Xu; Xiong Xiao; Van Hai Do; Chongjia Ni; I-Fan Chen; Sunil Sivadas; Chin-Hui Lee; Eng Siong Chng; Bin Ma; Haizhou Li
We present exemplar-inspired low-resource spoken keyword search strategies for acoustic modeling, keyword verification, and system combination. This state-of-the-art system was developed by the SINGA team in the context of the 2015 NIST Open Keyword Search Evaluation (OpenKWS15) using conversational Swahili provided by the IARPA Babel program. In this work, we elaborate on the following: (1) exploiting exemplar training samples to construct a non-parametric acoustic model using kernel density estimation at test time; (2) rescoring hypothesized keyword detections through quantifying their acoustic similarity with exemplar training samples; (3 ) extending our previously proposed system combination approach to incorporate prosody features of exemplar keyword samples.
international conference on acoustics, speech, and signal processing | 2016
Haikua Xu; Jingyong Hou; Xiong Xiao; Van Tung Pham; Cheung-Chi Leung; Lei Wang; Van Hai Do; Hang Lv; Lei Xie; Bin Ma; Eng Siong Chng; Haizhou Li
Dynamic Time Warping (DTW) is widely used in language independent query-by-example (QbE) spoken term detection (STD) tasks due to its high performance. However, there are two limitations of DTW based template matching, 1) it is not straightforward to perform approximate match of audio queries; 2) DTW is sensitive to the mismatch of signal conditions between the query and the speech search data. To allow approximate search, we propose a partial template matching strategy using phone time boundary information generated by a phone recognizer. To have more invariant representation of audio signals, we use bottleneck features (BNF) as the input of DTW. The BNF network is trained from augmented data, which is generated by adding reverberation and additive noises to the clean training data. Experimental results on QUESST 2015 task shows the effectiveness of the proposed methods for QbE-STD when the queries and search data are both distorted by reverberation and noises.
international symposium on chinese spoken language processing | 2012
Van Hai Do; Xiong Xiao; Eng Siong Chng; Haizhou Li
This paper presents a novel method for acoustic modeling with limited training data. The idea is to leverage on a well-trained acoustic model of a source language. In this paper, a conventional HMM/GMM triphone acoustic model of the source language is used to derive likelihood scores for each feature vector of the target language. These scores are then mapped to triphones of the target language using neural networks. We conduct a case study where Malay is the source language while English (Aurora-4 task) is the target language. Experimental results on the Aurora-4 (clean test set) show that by using only 7, 16, and 55 minutes of English training data, we achieve 21.58%, 17.97%, and 12.93% word error rate, respectively. These results outperform the conventional HMM/GMM and hybrid systems significantly.
asia pacific signal and information processing association annual summit and conference | 2015
Van Tung Pham; Haihua Xu; Van Hai Do; Tze Yuang Chong; Xiong Xiao; Eng Siong Chng; Haizhou Li
In this paper we report our approaches to accomplishing the very limited resource keyword search (KWS) task in the NIST Open Keyword Search 2015 (OpenKWS15) Evaluation. We devised the methods, first, to attain better acoustic modeling, multilingual and semi-supervised acoustic model training as well as the examplar-based acoustic model training; second, to address the overwhelming out-of-vocabulary (OOV) KWS issue. Finally, we proposed a neural network (NN) framework to fuse diversified component systems, yielding improved combination results. Experimental results demonstrated the effectiveness of these approaches.
international conference on asian language processing | 2012
Van Hai Do; Xiong Xiao; Eng Siong Chng; Haizhou Li
This paper presents a novel method for acoustic modeling of a new language with a limited amount of training data. In this approach, we use well-trained acoustic models of a foreign language to generate acoustic scores for each feature vector of the target language. These scores are then used as the input for mapping to context dependent triphones of the target language using a limited amount of training data. With this approach, we do not need to modify or have a special requirement for the foreign acoustic models. In this paper, English is used as the foreign language while Malay is used as the target language. Experiments on a Malay large vocabulary continuous speech recognition (LVCSR) task show that with using only few minutes of training data we can achieve a low word error rate which outperforms the best monolingual baseline acoustic model significantly.
conference of the international speech communication association | 2016
Van Hai Do; Nancy F. Chen; Boon Pang Lim; Mark Hasegawa-Johnson
When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).
international symposium on chinese spoken language processing | 2014
Mirco Ravanelli; Van Hai Do; Adam Janin
To improve speech recognition performance, a combination between TANDEM and bottleneck Deep Neural Networks (DNN) is investigated. In particular, exploiting a feature combination performed by means of a multi-stream hierarchical processing, we show a performance improvement by combining the same input features processed by different neural networks. The experiments are based on the spontaneous telephone recordings of the Cantonese IARPA Babel corpus using both standard MFCCs and Gabor as input features.
international conference on asian language processing | 2016
Van Hai Do; Nancy F. Chen; Boon Pang Lim; Mark Hasegawa-Johnson
Mismatched crowdsourcing is a technique to derive speech transcriptions using crowd-workers unfamiliar with the language being spoken. This technique is especially useful for under-resourced languages since it is hard to hire native transcribers. In this paper, we demonstrate that using mismatched transcription for adaptation improves performance of speech recognition under limited matched training data conditions. In addition, we show that using data augmentation improves not only performance of monolingual system but also makes mismatched transcription adaptation more effective.
asia pacific signal and information processing association annual summit and conference | 2015
Van Hai Do; Xiong Xiao; Eng Siong Chng; Haizhou Li
Kernel density model works well for limited training data in acoustic modeling. In this paper, we improve the kernel density-based acoustic model for low resource language speech recognition. In our previous study, we demonstrated the effectiveness of the kernel density-based acoustic model on discriminative features such as cross-lingual bottleneck features. In this paper, we propose to learn a Mahalanobis-based distance, which is equivalent to a full rank linear feature transformation, to minimize training data frame classification error. Experimental results on the Wall Street Journal (WSJ) task show that the proposed Mahalanobis-based distance learning results in significant improvements over the Euclidean distance. The kernel density acoustic model with the Mahalanobis-based distance also outperforms deep neural network acoustic model significantly in limited training data cases.
IEEE Transactions on Audio, Speech, and Language Processing | 2018
Van Hai Do; Nancy F. Chen; Boon Pang Lim; Mark Hasegawa-Johnson
It is challenging to obtain large amounts of native (matched) labels for speech audio in underresourced languages. This challenge is often due to a lack of literate speakers of the language, or in extreme cases, a lack of universally acknowledged orthography as well. One solution is to increase the amount of labeled data by using mismatched transcription, which employs transcribers who do not speak the underresourced language of interest called the target language (in place of native speakers), to transcribe what they hear as nonsense speech in their own annotation language (