Van Tung Pham
Nanyang Technological University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Van Tung Pham.
international conference on acoustics, speech, and signal processing | 2014
Nancy F. Chen; Sunil Sivadas; Boon Pang Lim; Hoang Gia Ngo; Haihua Xu; Van Tung Pham; Bin Ma; Haizhou Li
We propose strategies for a state-of-the-art Vietnamese keyword search (KWS) system developed at the Institute for Infocomm Research (I2R). The KWS system exploits acoustic features characterizing creaky voice quality peculiar to lexical tones in Vietnamese, a minimal-resource transliteration framework to alleviate out-of-vocabulary issues from foreign loan words, and a proposed system combination scheme FusionX. We show that the proposed creaky voice quality features complement pitch-related features, reaching fusion gains of 17.7% relative (6.9% absolute). To the best of our knowledge, the proposed transliteration framework is the first reported rule-based system for Vietnamese; it outperforms statistical-approach baselines up to 14.93-36.73% relative on foreign loan word search tasks. Using FusionX to combine 3 sub-systems, the actual term-weighted value (ATWV) reaches 0.4742, exceeding the ATWV=0.3 benchmark for IARPA Babel participants in the NIST OpenKWSB Evaluation.
international conference on acoustics, speech, and signal processing | 2015
Nancy F. Chen; Chongjia Ni; I-Fan Chen; Sunil Sivadas; Van Tung Pham; Haihua Xu; Xiong Xiao; Tze Siong Lau; Su Jun Leow; Boon Pang Lim; Cheung-Chi Leung; Lei Wang; Chin-Hui Lee; Alvina Goh; Eng Siong Chng; Bin Ma; Haizhou Li
We propose strategies for a state-of-the-art keyword search (KWS) system developed by the SINGA team in the context of the 2014 NIST Open Keyword Search Evaluation (OpenKWS14) using conversational Tamil provided by the IARPA Babel program. To tackle low-resource challenges and the rich morphological nature of Tamil, we present highlights of our current KWS system, including: (1) Submodular optimization data selection to maximize acoustic diversity through Gaussian component indexed N-grams; (2) Keywordaware language modeling; (3) Subword modeling of morphemes and homophones.
international conference on acoustics, speech, and signal processing | 2014
Van Tung Pham; Haihua Xu; Nancy F. Chen; Sunil Sivadas; Boon Pang Lim; Eng Siong Chng; Haizhou Li
Many keyword search (KWS) systems make “hit/false alarm (FA)” decisions based on the lattice-based posterior probability, which is incomparable across keywords. Therefore, score normalization is essential for a KWS system. In this paper, we investigate the integration of two novel features, ranking-score and relative-to-max, into a discriminative score normalization method. These features are extracted by considering all competing hypotheses of a putative detection. A metric-based normalization method is also applied as a post-processing step to further optimize the term-weighted value (TWV) evaluation metric. We report empirical improvements over standard baselines using the Vietnamese data from IARPAs Babel program in the NIST OpenKWS13 Evaluation setup.
spoken language technology workshop | 2014
Van Tung Pham; Nancy F. Chen; Sunil Sivadas; Haihua Xu; I-Fan Chen; Chongjia Ni; Eng Siong Chng; Haizhou Li
System combination (or data fusion1) is known to provide significant improvement for spoken term detection (STD). The key issue of the system combination is how to effectively fuse the various scores of participant systems. Currently, most system combination methods are system and keyword independent, i.e. they use the same arithmetic functions to combine scores for all keywords. Although such strategy improve keyword search performance, the improvement is limited. In this paper we first propose an arithmetic-based system combination method to incorporate the system and keyword characteristics into the fusion procedure to enhance the effectiveness of system combination. The method incorporates a system-keyword dependent property, which is the number of acceptances in this paper, into the combination procedure. We then introduce a discriminative model to combine various useful system and keyword characteristics into a general framework. Improvements over standard baselines are observed on the Vietnamese data from IARPA Babel program with the NIST OpenKWS13 Evaluation setup.
international conference on acoustics, speech, and signal processing | 2016
Haikua Xu; Jingyong Hou; Xiong Xiao; Van Tung Pham; Cheung-Chi Leung; Lei Wang; Van Hai Do; Hang Lv; Lei Xie; Bin Ma; Eng Siong Chng; Haizhou Li
Dynamic Time Warping (DTW) is widely used in language independent query-by-example (QbE) spoken term detection (STD) tasks due to its high performance. However, there are two limitations of DTW based template matching, 1) it is not straightforward to perform approximate match of audio queries; 2) DTW is sensitive to the mismatch of signal conditions between the query and the speech search data. To allow approximate search, we propose a partial template matching strategy using phone time boundary information generated by a phone recognizer. To have more invariant representation of audio signals, we use bottleneck features (BNF) as the input of DTW. The BNF network is trained from augmented data, which is generated by adding reverberation and additive noises to the clean training data. Experimental results on QUESST 2015 task shows the effectiveness of the proposed methods for QbE-STD when the queries and search data are both distorted by reverberation and noises.
conference of the international speech communication association | 2016
Cheung-Chi Leung; Lei Wang; Haihua Xu; Jingyong Hou; Van Tung Pham; Hang Lv; Lei Xie; Xiong Xiao; Chongjia Ni; Bin Ma; Eng Siong Chng; Haizhou Li
This paper documents the significant components of a state-ofthe-art language-independent query-by-example spoken term detection system designed for the Query by Example Search on Speech Task (QUESST) in MediaEval 2015. We developed exact and partial matching DTW systems, and WFST based symbolic search systems to handle different types of search queries. To handle the noisy and reverberant speech in the task, we trained tokenizers using data augmented with different noise and reverberation conditions. Our postevaluation analysis showed that the phone boundary label provided by the improved tokenizers brings more accurate speech activity detection in DTW systems. We argue that acoustic condition mismatch is possibly a more important factor than language mismatch for obtaining consistent gain from stacked bottleneck features. Our post-evaluation system, involving a smaller number of component systems, can outperform our submitted systems, which performed the best for the task.
international conference on acoustics, speech, and signal processing | 2015
Hang Su; Van Tung Pham; Yanzhang He; James Hieronymus
This paper investigates a weighted finite state transducer (WFST) based syllable decoding and transduction method for keyword search (KWS), and compares it with sub-word search and phone confusion methods in detail. Acoustic context dependent phone models are trained from word forced alignments and then used for syllable decoding and lattice generation. Out-of-vocabulary (OOV) keyword pronunciations are produced using a grapheme-to-syllable (G2S) system and then used to construct a lexical transducer. The lexical transducer is then composed with a keyword-boosted language model (LM) to transduce the syllable lattices to word lattices for final KWS. Word Error Rates (WER) and KWS results are reported for 5 different languages. It is shown that the syllable transduction method gives comparable KWS results to the syllable search and phone confusion methods. Combination of these three methods further improves OOV KWS performance.
international conference on acoustics, speech, and signal processing | 2016
Van Tung Pham; Haihua Xu; Xiong Xiao; Nancy F. Chen; Eng Siong Chng; Haizhou Li
In this work, we propose a novel framework for rescoring keyword search (KWS) detections using acoustic samples extracted from the training data. We view the keyword rescoring task as an information retrieval task and adopt the idea of query expansion. We expand a textual keyword with multiple speech keyword samples extracted from the training data. In this way, the hypothesized detections are compared with the multiple keywords using non-parametric approaches such as dynamic time warping (DTW). The obtained similarity scores are used in a graph based method to re-rank the original confidence scores estimated by the automatic speech recognition (ASR) systems. Experimental results on the NIST OpenKWS15 Evaluation show that our rescoring method is effective, especially for the subword system. For subword experiments, the graph-based rescoring with training samples obtains 5.1% and 1.5% absolute improvement over two baseline systems. One is a standard parametric ASR system, while the other is the graph-based rescoring without training samples.
asia pacific signal and information processing association annual summit and conference | 2015
Van Tung Pham; Haihua Xu; Van Hai Do; Tze Yuang Chong; Xiong Xiao; Eng Siong Chng; Haizhou Li
In this paper we report our approaches to accomplishing the very limited resource keyword search (KWS) task in the NIST Open Keyword Search 2015 (OpenKWS15) Evaluation. We devised the methods, first, to attain better acoustic modeling, multilingual and semi-supervised acoustic model training as well as the examplar-based acoustic model training; second, to address the overwhelming out-of-vocabulary (OOV) KWS issue. Finally, we proposed a neural network (NN) framework to fuse diversified component systems, yielding improved combination results. Experimental results demonstrated the effectiveness of these approaches.
conference of the international speech communication association | 2016
Van Tung Pham; Haihua Xu; Xiong Xiao; Nancy F. Chen; Eng Siong Chng; Haizhou Li
Rescoring hypothesized detections, using keyword’s audio samples extracted from training data, is an effective way to improve the performance of a Keyword Search (KWS) system. Unfortunately such rescoring framework cannot be applied directly to Out-of-Vocabulary (OOV) keywords since there is no sample in the training data. To address this limitation, we propose two techniques for OOV keywords in this work. The first technique generates samples for an OOV keyword by concatenating samples of its constituent subwords. The second technique splits hypothesized detections into segments, then estimates the acoustic similarities between detections and subword’s samples according to the similarities between segments and these samples. The similarity scores from these two techniques are used to rescore and re-rank the list of detections returned by the automatic speech recognition (ASR) systems. The experiments show that incorporating the proposed similarity scores results in a better separation between the correct and false alarm detections than using the ASR scores alone. Furthermore, experimental results on the NIST OpenKWS15 Evaluation show that rescoring with the proposed similarity scores significantly outperforms the raw ASR scores, and other methods that do not use the similarity scores, in both Maximum Term Weighted Value (MTWV) and Mean Average Precision (MAP) metrics.