Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Sin-Horng Chen is active.

Publication


Featured researches published by Sin-Horng Chen.


IEEE Transactions on Speech and Audio Processing | 1998

An RNN-based prosodic information synthesizer for Mandarin text-to-speech

Sin-Horng Chen; Shaw-Hwa Hwang; Yih-Ru Wang

A new RNN-based prosodic information synthesizer for Mandarin Chinese text-to-speech (TTS) is proposed in this paper. Its four-layer recurrent neural network (RNN) generates prosodic information such as syllable pitch contours, syllable energy levels, syllable initial and final durations, as well as intersyllable pause durations. The input layer and first hidden layer operate with a word-synchronized clock to represent current-word phonologic states within the prosodic structure of text to be synthesized. The second hidden layer and output layer operate on a syllable-synchronized clock and use outputs from the preceding layers, along with additional syllable-level inputs fed directly to the second hidden layer, to generate desired prosodic parameters. The RNN was trained on a large set of actual utterances accompanied by associated texts, and can automatically learn many human-prosody phonologic rules, including the well-known Sandhi Tone 3 F0-change rule. Experimental results show that all synthesized prosodic parameter sequences matched quite well with their original counterparts, and a pitch-synchronous-overlap-add-based (PSOLA-based) Mandarin TTS system was also used for testing of our approach. While subjective tests are difficult to perform and remain to be done in the future, we have carried out informal listening tests by a significant number of native Chinese speakers and the results confirmed that all synthesized speech sounded quite natural.


IEEE Transactions on Speech and Audio Processing | 2003

A new duration modeling approach for Mandarin speech

Sin-Horng Chen; Wen-Hsing Lai; Yih-Ru Wang

A new duration modeling approach for Mandarin speech is proposed. It explicitly takes several major affecting factors, such as multiplicative companding factors (CFs), and estimates all model parameters by an EM algorithm. The three basic Tone 3 patterns (i.e., full tone, half tone and sandhi tone) are also properly considered using three different CFs to separate how they affect syllable duration. Experimental results show that the variance of the syllable duration is greatly reduced from 180.17 to 2.52 frame/sup 2/ (1 frame = 5 ms) by the syllable duration modeling to eliminate effects from those affecting factors. Moreover, the estimated CFs of those affecting factors agree well with our prior linguistic knowledge. Two extensions of the duration modeling method are also performed. One is the use of the same technique to model initial and final durations. The other is to replace the multiplicative model with an additive one. Lastly, a preliminary study of applying the proposed model to predict syllable duration for TTS (text-to-speech) is also performed. Experimental results show that it outperforms the conventional regressive prediction method.


international conference on spoken language processing | 1996

A Mandarin text-to-speech system

Shaw-Hwa Hwang; Sin-Horng Chen; Yih-Ru Wang

The implementation of a high-performance Mandarin TTS system is presented. The system is composed of four main parts: text analysis (TA), prosodic information generation (PIG), waveform table (WT) of 411 base-syllables, and PSOLA-based waveform synthesis (PSOLA). In TA, a statistical model based method is first employed to automatically tag the input text to obtain the word sequence and the associated part-of-speech (POS) sequence. A lexicon containing about 80000 words is used in the tagging process. Then the corresponding base-syllable sequence is found and used to get from WT the basic waveform sequence. Some linguistic features used in PIG are also extracted in TA. In PIG, a four-layer recurrent neural network (RNN) is employed to generate some prosodic information including pitch contour, energy level, initial duration and final duration of syllable as well as inter-syllable pause duration. Finally, in PSOLA the basic waveform sequence is modified using the prosodic information to generate output synthetic speech. The whole system is implemented by software on a PC/AT 486 with a 16-bit Sound Blaster add-on card. Only 3.2 Mbyte memory space is required. It can synthesize speech in real-time for any input Chinese text. Informal listening tests by many native Chinese living in Taiwan confirmed that the synthetic speech sounded very fluent and natural.


Journal of the Acoustical Society of America | 2005

A statistics-based pitch contour model for Mandarin speech

Sin-Horng Chen; Wen-Hsing Lai; Yih-Ru Wang

A statistics-based syllable pitch contour model for Mandarin speech is proposed. This approach takes the mean and the shape of a syllable log-pitch contour as two basic modeling units and considers several affecting factors that contribute to their variations. The affecting factors include the speaker, prosodic state (which essentially represents the high-level linguistic components of F0 and will be explained more clearly in Sec. I), tone, and initial and final syllable classes. The parameters of the two modeling units were automatically estimated using the expectation-maximization (EM) algorithm. Experimental results showed that the root mean squared errors (RMSEs) obtained in the closed and open tests in the reconstructed pitch period were 0.362 and 0.373 ms, respectively. This model provides a way to separate the effects of several major factors. All of the inferred values of the affecting factors were in close agreement with our prior linguistic knowledge. It also gives a quantitative and more complete description of the coarticulation effect of neighboring tones rather than conventional qualitative descriptions of the tone sandhi rules. In addition, the model can provide useful cues to determine the prosodic phrase boundaries, including those occurring at intersyllable locations, with or without punctuation marks.


IEEE Transactions on Speech and Audio Processing | 1998

An RNN-based preclassification method for fast continuous Mandarin speech recognition

Sin-Horng Chen; Yuan-Fu Liao; Song-Mao Chiang; Saga Chang

A novel recurrent neural network-based (RNN-based) front-end preclassification scheme for fast continuous Mandarin speech recognition is proposed. First, an RNN is employed to discriminate each input frame for the three broad classes of initial, final, and silence. A finite state machine (FSM) is then used to classify the input frame into four states including three stable states of initial (I), final (F), and silence (S), and a transient (T) state. The decision is made based on examining whether the RNN discriminates well between classes. We then restrict the search space for the three stable states in the following DP search to speed up the recognition process. The efficiency of the proposed scheme was examined by simulations in which we incorporate it with a hidden Markov model-based (HMM-based) continuous 411 Mandarin based-syllables recognizer. The experimental results showed that it can be used in conjunction with the beam search to greatly reduce the computational complexity of the HMM recognizer while keeping the recognition rate almost undegraded.


international conference on acoustics, speech, and signal processing | 1992

A first study on neural net based generation of prosodic and spectral information for Mandarin text-to-speech

Sin-Horng Chen; Shaw-Hwa Hwang; Chun-Yu Tsai

A neural-network-based approach to generating prosodic and spectral information of syllables for Mandarin text-to-speech synthesis is studied. Some contextual features are first extracted from a given input text by text analysis and taken as input signals for synthesis. Then, six multilayer perceptrons are employed to generate pause duration, syllable duration, and pitch mean and shape of one- and two-syllable synthesis units, several reproduction templates of proper size are first generated for each synthesis unit of syllable approach. The objective is to generate spectral patterns of the syllable that can be directly concatenated to synthesize natural speech without further modification. The validity of this novel approach was examined by simulation using a database of sentential utterances recorded from TV news, reported by a single female announcer. Experimental results confirmed that this is a promising approach for Mandarin text-to-speech synthesis.<<ETX>>


Pattern Recognition | 1995

SPEECH RECOGNITION WITH HIERARCHICAL RECURRENT NEURAL NETWORKS

Wen-Yuan Chen; Yuan-Fu Liao; Sin-Horng Chen

Abstract A hierarchical recurrent neural network (HRNN) for speech recognition is presented. The HRNN is trained by a generalized probabilistic descent (GPD) algorithm. Consequently, the difficulty of empirically selecting an appropriate target function for training RNNs can be avoided. Results obtained in this study indicate the proposed HRNN has the advantages of being capable of absorbing the temporal variation of speech patterns as well as possessing effective discrimination capabilities. The scaling problem of RNNs is also greatly reduced. Additionally, a realization of the system using initial/final sub-syllable models for isolated Mandarin syllable recognition is also undertaken for verifying its effectiveness. The effectiveness of the proposed HRNN is confirmed by the experimental results.


IEEE Transactions on Audio, Speech, and Language Processing | 2014

Modeling of speaking rate influences on mandarin speech prosody and its application to speaking rate-controlled TTS

Sin-Horng Chen; Chiao-Hua Hsieh; Chen-Yu Chiang; Hsi-Chun Hsiao; Yih-Ru Wang; Yuan-Fu Liao; Hsiu-Min Yu

A new data-driven approach to building a speaking rate-dependent hierarchical prosodic model (SR-HPM), directly from a large prosody-unlabeled speech database containing utterances of various speaking rates, to describe the influences of speaking rate on Mandarin speech prosody is proposed. It is an extended version of the existing HPM model which contains 12 sub-models to describe various relationships of prosodic-acoustic features of speech signal, linguistic features of the associated text, and prosodic tags representing the prosodic structure of speech. Two main modifications are suggested. One is designing proper normalization functions from the statistics of the whole database to compensate the influences of speaking rate on all prosodic-acoustic features. Another is modifying the HPM training to let its parameters be speaking-rate dependent. Experimental results on a large Mandarin read speech corpus showed that the parameters of the SR-HPM together with these feature normalization functions interpreted the effects of speaking rate on Mandarin speech prosody very well. An application of the SR-HPM to design and implement a speaking rate-controlled Mandarin TTS system is demonstrated. The system can generate natural synthetic speech for any given speaking rate in a wide range of 3.4-6.8 syllables/sec. Two subjective tests, MOS and preference test, were conducted to compare the proposed system with the popular HTS system. The MOS scores of the proposed system were in the range of 3.58-3.83 for eight different speaking rates, while they were in 3.09-3.43 for HTS. Besides, the proposed system had higher preference scores (49.8%-79.6%) than those (9.8%-30.7%) of HTS. This confirmed the effectiveness of the speaking rate control method of the proposed TTS system.


international conference on acoustics, speech, and signal processing | 1995

A prosodic model of Mandarin speech and its application to pitch level generation for text-to-speech

Shaw-Hwa Hwang; Sin-Horng Chen

A prosodic model of Mandarin speech is proposed to simulate humans pronunciation mechanism for exploring the hidden pronunciation states embedded in the input text. Parameters representing these pronunciation states are then used to assist prosody information generation. A multirate recurrent neural network (MRNN) is employed to realize the prosodic model. Two learning methods were proposed to train the MRNN. One is an indirect method which firstly uses an additional SRNN to track the dynamics of the prosody information of the utterance; and then takes the outputs of its hidden layer as desired targets to train the MRNN. The other is a direct training method which integrates the MRNN and the following MLP prosody synthesizers to directly learn the relation between the input linguistic features and the output prosody information. Simulation results confirmed the effectiveness of the approach. Most synthesized prosodic parameter sequences match quite well with their original counterparts.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

A New Prosody-Assisted Mandarin ASR System

Sin-Horng Chen; Jyh-Her Yang; Chen-Yu Chiang; Ming-Chieh Liu; Yih-Ru Wang

This paper presents a new prosody-assisted automatic speech recognition (ASR) system for Mandarin speech. It differs from the conventional approach of using simple prosodic cues on employing a sophisticated prosody modeling approach based on a four-layer prosody-hierarchy structure to automatically generate 12 prosodic models from a large unlabeled speech database by the joint prosody labeling and modeling (PLM) algorithm proposed previously. By incorporating these 12 prosodic models into a two-stage ASR system to rescore the word lattice generated in the first stage by the conventional hidden Markov model (HMM) recognizer, we can obtain a better recognized word string. Besides, some other information can also be decoded, including part of speech (POS), punctuation mark (PM), and two types of prosodic tags which can be used to construct the prosody-hierarchy structure of the testing speech. Experimental results on the TCC300 database, which consists of long paragraphic utterances, showed that the proposed system significantly outperformed the baseline scheme using an HMM recognizer with a factored language model which models word, POS, and PM. Performances of 20.7%, 14.4%, and 9.6% in word, character, and base-syllable error rates were obtained. They corresponded to 3.7%, 3.7%, and 2.4% absolute (or 15.2%, 20.4%, and 20% relative) error reductions. By an error analysis, we found that many word segmentation errors and tone recognition errors were corrected.

Collaboration


Dive into the Sin-Horng Chen's collaboration.

Top Co-Authors

Avatar

Yih-Ru Wang

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar

Chen-Yu Chiang

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar

Yuan-Fu Liao

National Taipei University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Wen-Hsing Lai

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jyh-Her Yang

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar

Shaw-Hwa Hwang

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar

Wei-Chih Kuo

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar

Chiao-Hua Hsieh

National Chiao Tung University

View shared research outputs
Researchain Logo
Decentralizing Knowledge