Yih-Ru Wang
National Chiao Tung University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yih-Ru Wang.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Sin-Horng Chen; Chiao-Hua Hsieh; Chen-Yu Chiang; Hsi-Chun Hsiao; Yih-Ru Wang; Yuan-Fu Liao; Hsiu-Min Yu
A new data-driven approach to building a speaking rate-dependent hierarchical prosodic model (SR-HPM), directly from a large prosody-unlabeled speech database containing utterances of various speaking rates, to describe the influences of speaking rate on Mandarin speech prosody is proposed. It is an extended version of the existing HPM model which contains 12 sub-models to describe various relationships of prosodic-acoustic features of speech signal, linguistic features of the associated text, and prosodic tags representing the prosodic structure of speech. Two main modifications are suggested. One is designing proper normalization functions from the statistics of the whole database to compensate the influences of speaking rate on all prosodic-acoustic features. Another is modifying the HPM training to let its parameters be speaking-rate dependent. Experimental results on a large Mandarin read speech corpus showed that the parameters of the SR-HPM together with these feature normalization functions interpreted the effects of speaking rate on Mandarin speech prosody very well. An application of the SR-HPM to design and implement a speaking rate-controlled Mandarin TTS system is demonstrated. The system can generate natural synthetic speech for any given speaking rate in a wide range of 3.4-6.8 syllables/sec. Two subjective tests, MOS and preference test, were conducted to compare the proposed system with the popular HTS system. The MOS scores of the proposed system were in the range of 3.58-3.83 for eight different speaking rates, while they were in 3.09-3.43 for HTS. Besides, the proposed system had higher preference scores (49.8%-79.6%) than those (9.8%-30.7%) of HTS. This confirmed the effectiveness of the speaking rate control method of the proposed TTS system.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Sin-Horng Chen; Jyh-Her Yang; Chen-Yu Chiang; Ming-Chieh Liu; Yih-Ru Wang
This paper presents a new prosody-assisted automatic speech recognition (ASR) system for Mandarin speech. It differs from the conventional approach of using simple prosodic cues on employing a sophisticated prosody modeling approach based on a four-layer prosody-hierarchy structure to automatically generate 12 prosodic models from a large unlabeled speech database by the joint prosody labeling and modeling (PLM) algorithm proposed previously. By incorporating these 12 prosodic models into a two-stage ASR system to rescore the word lattice generated in the first stage by the conventional hidden Markov model (HMM) recognizer, we can obtain a better recognized word string. Besides, some other information can also be decoded, including part of speech (POS), punctuation mark (PM), and two types of prosodic tags which can be used to construct the prosody-hierarchy structure of the testing speech. Experimental results on the TCC300 database, which consists of long paragraphic utterances, showed that the proposed system significantly outperformed the baseline scheme using an HMM recognizer with a factored language model which models word, POS, and PM. Performances of 20.7%, 14.4%, and 9.6% in word, character, and base-syllable error rates were obtained. They corresponded to 3.7%, 3.7%, and 2.4% absolute (or 15.2%, 20.4%, and 20% relative) error reductions. By an error analysis, we found that many word segmentation errors and tone recognition errors were corrected.
international symposium on chinese spoken language processing | 2012
Chen-Yu Chiang; Sabato Marco Siniscalchi; Yih-Ru Wang; Sin-Horng Chen; Chin-Hui Lee
We present a cross-language knowledge integration framework to improve the performance in large vocabulary continuous speech recognition. Two types of knowledge sources, manner attribute and prosodic structure, are incorporated. For manner of articulation, cross-lingual attribute detectors trained with an American English corpus (WSJ0) are utilized to verify and rescore hypothesized Mandarin syllables in word lattices obtained with state-of-the-art systems. For the prosodic structure, models trained with an unsupervised joint prosody labeling and modeling technique using a Mandarin corpus (TCC300) are used in lattice rescoring. Experimental results on Mandarin syllable, character and word recognition with the TCC300 corpus show that the proposed approach significantly outperforms the baseline system that does not use articulatory and prosodic information. It also demonstrates a potential of utilizing results from cross-lingual attribute detectors as a language-universal frontend for automatic speech recognition.
international conference on acoustics, speech, and signal processing | 2011
Jyh-Her Yang; Ming-Chieh Liu; Hao-Hsiang Chang; Chen-Yu Chiang; Yih-Ru Wang; Sin-Horng Chen
This paper presents a new probabilistic framework of Mandarin speech recognition by incorporating a sophisticated hierarchical prosody model into the conventional HMM-based system. The prosody model describes the relations of linguistic cues of various levels, break types and prosodic states which represent the prosody hierarchical structure, and prosody-related acoustic features. Aside from producing the recognized word sequences, the system also decodes other information including words part-of-speech, punctuation marks, inter-syllable break types, and prosodic states of syllables. Experimental results on the TCC300 corpus, which consists of paragraphic utterances, showed that the proposed system significantly outperformed the baseline system. The word and character error rates decreased from 24.4% and 18.1% to 20.7% and 14.4% (or 15.2% and 20.4% relative improvements), respectively.
international conference on acoustics, speech, and signal processing | 2002
Yih-Ru Wang; I-Je Wong; Teng-Chun Tsao
A new statistical pitch detection algorithm is proposed in this paper. It first defines frame-based voiced/unvoiced probabilistic measures and between-frame pitch transition probabilities, and then tightly integrates them into an ML search to find the best pitch contour of an utterance. Compared with the pitch detection method of the well-known ESPS software package, the proposed method performs better on both U/V decision and pitch frequency detection accuracy
international symposium on chinese spoken language processing | 2014
Po-Chun Wang; I-Bin Liao; Chen-Yu Chiang; Yih-Ru Wang; Sin-Horng Chen
In this paper, a speaker adaptation method to adapt an existing speaking rate-dependent hierarchical prosodic model (SR-HPM) of an SR-controlled Mandarin TTS system to new speakers data for realizing a new voice is proposed. Two main problems are addressed: data sparseness for few adaptation utterances existing only in a small range of normal speaking rate and no adaptation data in both ranges of fast and slow speaking rates. The proposed method follows the idea of SR-HPM training to firstly normalize the prosodic-acoustic features of the new speakers speech data, to then train an HPM by the prosody labeling and modeling algorithm, and to lastly refine the HPM to an SR-dependent model. The MAP adaptation method with model parameter extrapolation is applied to cope with the above two problems. Experimental results on a male speakers adaptation data confirmed that the resulting adaptive SR-HPM has reasonable parameters covering a wide range of speaking rates and hence can be used in the TTS system to generate prosodic-acoustic features for synthesizing the new speakers voice of any given SR.
international conference on acoustics, speech, and signal processing | 2008
Yih-Ru Wang
In this paper, a supervised neural network based signal change-point detector is proposed. The proposed detector uses some high order statistics of log-likelihood difference functions as the input features in order to improve the detection performance. These high order statistics can be easily calculated from the CCGMM coefficients of signals. Performance of the proposed signal change-point detector was examined by using a database of five-hour TV broadcast news. Experimental results showed that the equal error rate (EER) was improved from 16.6% achieved by the baseline method using the CCGMM-based divergence measure to 14.4% by the proposed method.
international conference on acoustics, speech, and signal processing | 2004
Wei-Chih Kuo; Yih-Ru Wang; Sin-Horng Chen
A model-based tone labeling method for Min-Nan/Taiwanese speech is proposed. It takes the mean and shape of syllable pitch contours as two modeling units and considers some major affecting factors that control their variations. By using the EM algorithm to estimate all parameters of the pitch mean and shape models from a speech database, we can decide the best tone sequences pronounced in all utterances of the database. Experimental results show that it outperforms the VQ classification method which suffers from the interference resulting from neighboring syllables and from the global prosodic phrase patterns.
international conference on acoustics speech and signal processing | 1998
Yih-Ru Wang; Sin-Horng Chen
This paper discusses an HMM-based Mandarin telephone speech recognition method for implementing a prototype system of automatic telephone number directory service. It adopted the GPD/MCE training algorithm to train the HMM models for 100 final-dependent syllable initials and 40 syllable finals. The SBR method was used to compensate the speaker and channel effects. Besides, a recurrent neural network (RNN) based pre-classification scheme was employed to speed up the recognition search. A syllable recognition rate of 53.7% was achieved. This method was then used to implement an isolated-word recognizer for the prototype system to discriminate 1922 names of bank and insurance companies. Word recognition rates of 94.8% for top-1 and 97.9% for top-3 were achieved.
international conference on acoustics, speech, and signal processing | 2012
Chen-Yu Chiang; Yih-Ru Wang; Sin-Horng Chen
A novel statistical linguistic feature, called punctuation confidence, is proposed in this paper for assisting in prosodic break prediction in Mandarin text-to-speech. The punctuation confidence calculated from the input text is a measure of the likelihood of inserting a major PM at a word boundary. Since a punctuation in text tends to be pronounced as a break, the punctuation confidence associated with a punctuation estimate should provide useful information for break prediction from text. The idea is realized in this study by first employing a conditional random field (CRF)-based model to generate a predicted punctuation and its associated punctuation confidence for each word boundary. Then, the predicted punctuation and its punctuation confidence are combined with contextual linguistic features to predict the break type of the word boundary by an MLP (multi-layer perceptrons). Experiment on the Treebank speech corpus confirmed the effectiveness of the proposed approach.