Cheng-Tao Chung
National Taiwan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Cheng-Tao Chung.
international conference on acoustics, speech, and signal processing | 2013
Cheng-Tao Chung; Chun-an Chan; Lin-Shan Lee
Techniques for unsupervised discovery of acoustic patterns are getting increasingly attractive, because huge quantities of speech data are becoming available but manual annotations remain hard to acquire. In this paper, we propose an approach for unsupervised discovery of linguistic structure for the target spoken language given raw speech data. This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns. All patterns, models, and parameters can be automatically learned from the unlabelled speech corpus. This is achieved by an initialization step followed by three cascaded stages for acoustic, linguistic, and lexical iterative optimization. The lexicon of word-like patterns defines allowed consecutive sequence of HMMs for subword-like patterns. In each iteration, model training and decoding produces updated labels from which the lexicon and HMMs can be further updated. In this way, model parameters and decoded labels are respectively optimized in each iteration, and the knowledge about the linguistic structure is learned gradually layer after layer. The proposed approach was tested in preliminary experiments on a corpus of Mandarin broadcast news, including a task of spoken term detection with performance compared to a parallel test using models trained in a supervised way. Results show that the proposed system not only yields reasonable performance on its own, but is also complimentary to existing large vocabulary ASR systems.
international conference on acoustics, speech, and signal processing | 2014
Cheng-Tao Chung; Chun-an Chan; Lin-Shan Lee
This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus. The different pattern HMM configurations(number of states per model, number of distinct models, number of Gaussians per state)form a three-dimensional model granularity space. Different sets of acoustic patterns automatically discovered on different points properly distributed over this three-dimensional space are complementary to one another, thus can jointly capture the characteristics of the spoken terms. By representing the spoken content and spoken query as sequences of acoustic patterns, a series of approaches for matching the pattern index sequences while considering the signal variations are developed. In this way, not only the on-line computation load can be reduced, but the signal distributions caused by different speakers and acoustic conditions can be reasonably taken care of. The results indicate that this approach significantly outperformed the unsupervised feature-based DTW baseline by 16.16% in mean average precision on the TIMIT corpus.
international conference on acoustics, speech, and signal processing | 2013
Hung-yi Lee; Yun-Chiao Li; Cheng-Tao Chung; Lin-Shan Lee
Query expansion techniques were originally developed for text information retrieval in order to retrieve the documents not containing the query terms but semantically related to the query. This is achieved by assuming the terms frequently occurring in the top-ranked documents in the first-pass retrieval results to be query-related and using them to expand the query to do the second-pass retrieval. However, when this approach was used for spoken content retrieval, the inevitable recognition errors and the OOV problems in ASR make it difficult for many query-related terms to be included in the expanded query, and much of the information carried by the speech signal is lost during recognition and not recoverable. In this paper, we propose to use a second ASR engine based on acoustic patterns automatically discovered from the spoken archive used for retrieval. These acoustic patterns are discovered directly based on the signal characteristics, and therefore can compensate for the information lost during recognition to a good extent. When a text query is entered, the system generates the first-pass retrieval results based on the transcriptions of the spoken segments obtained via the conventional ASR. The acoustic patterns frequently occurring in the spoken segments ranked on top of the first-pass results are considered as query-related, and the spoken segments containing these query-related acoustic patterns are retrieved. In this way, even though some query-related terms are OOV or incorrectly recognized, the segments including these terms can still be retrieved by acoustic patterns corresponding to these terms. Preliminary experiments performed on Mandarin broadcast news offered very encouraging results.
international conference on acoustics, speech, and signal processing | 2013
Chun-an Chan; Cheng-Tao Chung; Yu-Hsin Kuo; Lin-Shan Lee
We present a two-stage model-based approach for unsupervised query-by-example spoken term detection (STD) without any annotated data. Compared to the prevailing DTW approaches for the unsupervised STD task, HMMs used by model-based approaches can better capture the signal distributions and time trajectories of speech with a more global view of the spoken archive; matching with model states also significantly reduces the computational load. The utterances in the spoken archive are first offline decoded into acoustic patterns automatically discovered in an unsupervised way from the spoken archive. In the first stage, we propose a document state matching (DSM) approach, where query frames are matched to the HMM state sequences for the spoken documents. In this process, a novel duration-constrained Viterbi (DC-Vite) algorithm is proposed to avoid unrealistic speaking rate distortion. In the second stage, pseudo relevant/irrelevant examples retrieved from the first stage are respectively used to construct query/anti-query HMMs. Each spoken term hypothesis is then rescored with the likelihood ratio to these two HMMs. Experimental results show an absolute 11.8% of mean average precision improvement with a more than 50% reduction in computation time compared to the segmental DTW approach on a Mandarin broadcast news corpus.
international conference on acoustics, speech, and signal processing | 2015
Cheng-Tao Chung; Wei-Ning Hsu; Cheng-Yi Lee; Lin-Shan Lee
This paper presents a novel approach for enhancing the multiple sets of acoustic patterns automatically discovered from a given corpus. In a previous work it was proposed that different HMM configurations (number of states per model, number of distinct models) for the acoustic patterns form a two-dimensional space. Multiple sets of acoustic patterns automatically discovered with the HMM configurations properly located on different points over this two-dimensional space were shown to be complementary to one another, jointly capturing the characteristics of the given corpus. By representing the given corpus as sequences of acoustic patterns on different HMM sets, the pattern indices in these sequences can be relabeled considering the context consistency across the different sequences. Good improvements were observed in preliminary experiments of pattern spoken term detection (STD) performed on both TIMIT and Mandarin Broadcast News with such enhanced patterns.
ieee automatic speech recognition and understanding workshop | 2015
Cheng-Tao Chung; Cheng-Yu Tsai; Hsiang-Hung Lu; Chia-Hsiang Liu; Hung-yi Lee; Lin-Shan Lee
In this work we aim to discover high quality speech features and Linguistic units directly from unlabeled speech data in a zero resource scenario. The results are evaluated using the metrics and corpora proposed in the Zero Resource Speech Challenge organized at Interspeech 2015. A Multi-layered Acoustic Tokenizer (MAT) was proposed for automatic discovery of multiple sets of acoustic tokens from the given corpus. Each acoustic token set is specified by a set of hyperparameters that describe the model configuration. These sets of acoustic tokens carry different characteristics fof the given corpus and the language behind, thus can be mutually reinforced. The multiple sets of token labels are then used as the targets of a Multi-target Deep Neural Network (MDNN) trained on low-level acoustic features. Bottleneck features extracted from the MDNN are then used as the feedback input to the MAT and the MDNN itself in the next iteration. We call this iterative deep learning framework the Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN), which generates both high quality speech features for the Track 1 of the Challenge and acoustic tokens for the Track 2 of the Challenge. In addition, we performed extra experiments on the same corpora on the application of query-by-example spoken term detection. The experimental results showed the iterative deep learning framework of MAT-DNN improved the detection performance due to better underlying speech features and acoustic tokens.
ieee automatic speech recognition and understanding workshop | 2013
Yun-Chiao Li; Hung-yi Lee; Cheng-Tao Chung; Chun-an Chan; Lin-Shan Lee
This paper presents an initial effort to retrieve semantically related spoken content in a completely unsupervised way. Unsupervised approaches of spoken content retrieval is attractive because the need for annotated data reasonably matched to the spoken content for training acoustic and language models can be bypassed. However, almost all such unsupervised approaches focus on spoken term detection, or returning the spoken segments containing the query, using either template matching techniques such as dynamic time warping (DTW) or model-based approaches. However, users usually prefer to retrieve all objects semantically related to the query, but not necessarily including the query terms. This paper proposes a different approach. We transcribe the spoken segments in the archive to be retrieved through into sequences of acoustic patterns automatically discovered in an unsupervised method. For an input query in spoken form, the top-N spoken segments from the archive obtained with the first-pass retrieval with DTW are taken as pseudo-relevant. The acoustic patterns frequently occurring in these segments are therefore considered as query-related and used for query expansion. Preliminary experiments performed on Mandarin broadcast news offered very encouraging results.
IEEE Transactions on Audio, Speech, and Language Processing | 2018
Cheng-Tao Chung; Lin-Shan Lee
In this paper, we compare two paradigms for unsupervised discovery of structured acoustic tokens directly from speech corpora without any human annotation. The multigranular paradigm seeks to capture all available information in the corpora with multiple sets of tokens for different model granularities. The hierarchical paradigm attempts to jointly learn several levels of signal representations in a hierarchical structure. The two paradigms are unified within a theoretical framework in this paper. Query-by-example spoken term detection (QbE-STD) experiments on the query by example search on speech task dataset of MediaEval 2015 verifies the competitiveness of the acoustic tokens. The enhanced relevance score proposed in this work improves both paradigms for the task of QbE-STD. We also list results on the ABX evaluation task of the Zero Resource Challenge 2015 for comparison of the paradigms.
international conference on acoustics, speech, and signal processing | 2017
Cheng-Kuang Wei; Cheng-Tao Chung; Hung-yi Lee; Lin-Shan Lee
It is well known that recognizers personalized to each user are much more effective than user-independent recognizers. With the popularity of smartphones today, although it is not difficult to collect a large set of audio data for each user, it is difficult to transcribe it. However, it is now possible to automatically discover acoustic tokens from unlabeled personal data in an unsupervised way. We therefore propose a multi-task deep learning framework called a phoneme-token deep neural network (PTDNN), jointly trained from unsupervised acoustic tokens discovered from unlabeled data and very limited transcribed data for personalized acoustic modeling. We term this scenario “weakly supervised”. The underlying intuition is that the high degree of similarity between the HMM states of acoustic token models and phoneme models may help them learn from each other in this multi-task learning framework. Initial experiments performed over a personalized audio data set recorded from Facebook posts demonstrated that very good improvements can be achieved in both frame accuracy and word accuracy over popularly-considered baselines such as fDLR, speaker code and lightly supervised adaptation. This approach complements existing speaker adaptation approaches and can be used jointly with such techniques to yield improved results.
IEEE Transactions on Audio, Speech, and Language Processing | 2017
Cheng-Tao Chung; Cheng-Yu Tsai; Chia-Hsiang Liu; Lin-Shan Lee
In this paper, we aim to automatically discover high-quality frame-level speech features and acoustic tokens directly from unlabeled speech data. A multigranular acoustic tokenizer (MAT) was proposed for automatic discovery of multiple sets of acoustic tokens from the given corpus. Each acoustic token set is specified by a set of hyperparameters describing the model configuration. These different sets of acoustic tokens carry different characteristics for the given corpus and the language behind and, thus, can be mutually reinforced. The multiple sets of token labels are then used as the targets of a multitarget deep neural network (MDNN) trained on frame-level acoustic features. Bottleneck features extracted from the MDNN are then used as the feedback input to the MAT and the MDNN itself in the next iteration. The multigranular acoustic token sets and the frame-level speech features can be iteratively optimized in the iterative deep learning framework. We call this framework the MAT deep neural network. The results were evaluated using the metrics and corpora defined in the Zero Resource Speech Challenge organized at Interspeech 2015, and improved performance was obtained with a set of experiments of query-by-example spoken term detection on the same corpora. Visualization for the discovered tokens against the English phonemes was also shown.