Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hsin-Te Hwang is active.

Publication


Featured researches published by Hsin-Te Hwang.


asia-pacific signal and information processing association annual summit and conference | 2013

Incorporating global variance in the training phase of GMM-based voice conversion

Hsin-Te Hwang; Yu Tsao; Hsin-Min Wang; Yih-Ru Wang; Sin-Horng Chen

Maximum likelihood-based trajectory mapping considering global variance (MLGV-based trajectory mapping) has been proposed for improving the quality of the converted speech of Gaussian mixture model-based voice conversion (GMM-based VC). Although the quality of the converted speech is significantly improved, the computational cost of the online conversion process is also increased because there is no closed form solution for parameter generation in MLGV-based trajectory mapping, and an iterative process is generally required. To reduce the online computational cost, we propose to incorporate GV in the training phase of GMM-based VC. Then, the conversion process can simply adopt ML-based trajectory mapping (without considering GV in the conversion phase), which has a closed form solution. In this way, it is expected that the quality of the converted speech can be improved without increasing the online computational cost. Our experimental results demonstrate that the proposed method yields a significant improvement in the quality of the converted speech comparing to the conventional GMM-based VC method. Meanwhile, comparing to MLGV-based trajectory mapping, the proposed method provides comparable converted speech quality with reduced computational cost in the conversion process.


conference of the international speech communication association | 2016

Locally Linear Embedding for Exemplar-Based Spectral Conversion.

Yi-Chiao Wu; Hsin-Te Hwang; Chin-Cheng Hsu; Yu Tsao; Hsin-Min Wang

This paper describes a novel exemplar-based spectral conversion (SC) system developed by the AST (Academia Sinica, Taipei) team for the 2016 voice conversion challenge (vcc2016). The key feature of our system is that it integrates the locally linear embedding (LLE) algorithm, a manifold learning algorithm that has been successfully applied for the super-resolution task in image processing, with the conventional exemplar-based SC method. To further improve the quality of the converted speech, our system also incorporates (1) the maximum likelihood parameter generation (MLPG) algorithm, (2) the postfiltering-based global variance (GV) compensation method, and (3) a high-resolution feature extraction process. The results of subjective evaluation conducted by the vcc2016 organizer show that our LLE-exemplarbased SC system notably outperforms the baseline GMMbased system (implemented by the vcc2016 organizer). Moreover, our own internal evaluation results confirm the effectiveness of the major LLE-exemplar-based SC method and the three additional approaches with improved speech quality.


asia pacific signal and information processing association annual summit and conference | 2015

Improving denoising auto-encoder based speech enhancement with the speech parameter generation algorithm

Syu-Siang Wang; Hsin-Te Hwang; Ying-Hui Lai; Yu Tsao; Xugang Lu; Hsin-Min Wang; Borching Su

This paper investigates the use of the speech parameter generation (SPG) algorithm, which has been successfully adopted in deep neural network (DNN)-based voice conversion (VC) and speech synthesis (SS), for incorporating temporal information to improve the deep denoising auto-encoder (DDAE)-based speech enhancement. In our previous studies, we have confirmed that DDAE could effectively suppress noise components from noise corrupted speech. However, because DDAE converts speech in a frame by frame manner, the enhanced speech shows some level of discontinuity even though context features are used as input to the DDAE. To handle this issue, this study proposes using the SPG algorithm as a post-processor to transform the DDAE processed feature sequence to one with a smoothed trajectory. Two types of temporal information with SPG are investigated in this study: static-dynamic and context features. Experimental results show that the SPG with context features outperforms the SPG with static-dynamic features and the baseline system, which considers context features without SPG, in terms of standardized objective tests in different noise types and SNRs.


international symposium on chinese spoken language processing | 2012

Exploring mutual information for GMM-based spectral conversion

Hsin-Te Hwang; Yu Tsao; Hsin-Min Wang; Yih-Ru Wang; Sin-Horng Chen

In this paper, we propose a maximum mutual information (MMI) training criterion to refine the parameters of the joint density GMM (JDGMM) set to tackle the over-smoothing issue in voice conversion (VC). Conventionally, the maximum likelihood (ML) criterion is used to train a JDGMM set, which characterizes the joint property of the source and target feature vectors. The MMI training criterion, on the other hand, updates the parameters of the JDGMM set to increase its capability on modeling the dependency between the source and target feature vectors, and thus to make the converted sounds closer to the natural ones. The subjective listening test demonstrates that the quality and individuality of the converted speech by the proposed ML followed by MMI (ML+MMI) training method is better that by the ML training method.


asia pacific signal and information processing association annual summit and conference | 2015

A probabilistic interpretation for artificial neural network-based voice conversion

Hsin-Te Hwang; Yu Tsao; Hsin-Min Wang; Yih-Ru Wang; Sin-Horng Chen

Voice conversion (VC) using artificial neural networks (ANNs) has shown its capability to produce better sound quality of the converted speech than that using Gaussian mixture model (GMM). Although ANN-based VC works reasonably well, there is still room for further improvement. One of the promising ways is to adopt the successful techniques in statistical model-based parameter generation (SMPG), such as trajectory-based mapping approaches that are originally designed for GMM-based VC and hidden Markov model (HMM)-based speech synthesis. This study presents a probabilistic interpretation for ANN-based VC. In this way, ANN-based VC can easily incorporate the successful techniques in SMPG. Experimental results demonstrate that the performance of ANN-based VC can be effectively improved by two trajectory-based mapping techniques (maximum likelihood parameter generation (MLPG) algorithm and maximum likelihood-based trajectory mapping considering global variance (referred to as MLGV)), compared to the conventional ANN-based VC with frame-based mapping and the GMM-based VC with the MLPG algorithm. Moreover, ANN-based VC with the trajectory-based mapping techniques can achieve comparable performance when compared to the state-of-the-art GMM-based VC with the MLGV algorithm.


international symposium on chinese spoken language processing | 2014

Acoustic feature conversion using a polynomial based feature transferring algorithm

Syu-Siang Wang; Payton Lin; Dau-Cheng Lyu; Yu Tsao; Hsin-Te Hwang; Borching Su

This study proposes a polynomial based feature transferring (PFT) algorithm for acoustic feature conversion. The PFT process consists of estimation and conversion phases. The estimation phase aims to compute a polynomial based transfer function using only a small set of parallel source and target features. With the estimated transfer function, the conversion phase converts large sets of source features to target ones. This study evaluates the proposed PFT algorithm using a robust automatic speech recognition (ASR) task on the Aurora-2 database. The source features were MFCCs with cepstral mean and variance normalization (CMVN), and the target features were advanced front end features (AFE). Compared to CMVN, AFE provides better robust speech recognition performance but requires more complicated and expensive cost for feature extraction. By PFT, we intend to use a simple transfer function to obtain AFE-like acoustic features from the source CMVN features. Experimental results on Aurora-2 demonstrate that the PFT generated AFE-like features that can notably improve the CMVN performance and approach results achieved by AFE. Furthermore, the recognition accuracy of PFT was better than that of histogram equalization (HEQ) and polynomial based histogram equalization (PHEQ). The results confirm the effectiveness of PFT with just a few sets of parallel features.


international symposium on chinese spoken language processing | 2006

A hakka text-to-speech system

Hsiu-Min Yu; Hsin-Te Hwang; Dong-Yi Lin; Sin-Horng Chen

In this paper, the implementation of a Hakka text-to-speech (TTS) system is presented. The system is designed based on the same principle of developing a Mandarin and a Min-Nan TTS systems proposed previously. It takes 671 base-syllables as basic synthesis units and uses a recurrent neural network (RNN)-based prosody generator to generate proper prosodic parameters for synthesizing natural output speech. The whole system is implemented by software and runs in real-time on PC. Informal subjective listening test confirmed that the system performed well. All synthetic speeches sounded well for well-tokenized texts and fair for texts with automatic tokenization.


asia pacific signal and information processing association annual summit and conference | 2016

Voice conversion from non-parallel corpora using variational auto-encoder

Chin-Cheng Hsu; Hsin-Te Hwang; Yi-Chiao Wu; Yu Tsao; Hsin-Min Wang

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our proposed method and compare it to SC methods that have access to aligned corpora.


international conference on acoustics, speech, and signal processing | 2017

A locally linear embbeding based postfiltering approach for speech enhancement

Yi-Chiao Wu; Hsin-Te Hwang; Syu-Siang Wang; Chin-Cheng Hsu; Ying-Hui Lai; Yu Tsao; Hsin-Min Wang

This paper presents a novel postfiltering approach based on the locally linear embedding (LLE) algorithm for speech enchantment (SE). The aim of the proposed LLE-based postfiltering approach is to further remove the residual noise components from the SE-processed speech signals through a spectral conversion process, thereby increasing the signal-to-noise ratio (SNR) and speech quality. The proposed postfiltering approach consists of two phases. In the offline phase, paired SE-processed and clean speech exemplars are prepared for dictionary construction. In the online phase, the LLE algorithm is adopted to convert the SE-processed speech signals to the clean ones. The present study integrates the LLE-based postfiltering approach with a deep denoising autoencoder (DDAE) SE method, which has been confirmed to provide outstanding capability for noise reduction. Experimental results show that the proposed postfiltering approach can notably enhance the DDAE-based SE processed speech signals in different noise types and SNR levels.


international symposium on chinese spoken language processing | 2016

Dictionary update for NMF-based voice conversion using an encoder-decoder network

Chin-Cheng Hsu; Hsin-Te Hwang; Yi-Chiao Wu; Yu Tsao; Hsin-Min Wang

In this paper, we propose a dictionary update method for Non-negative Matrix Factorization (NMF) with high dimensional data in a spectral conversion (SC) task. Voice conversion has been widely studied due to its potential applications such as personalized speech synthesis and speech enhancement. Exemplar-based NMF (ENMF) emerges as an effective and probably the simplest choice among all techniques for SC, as long as a source-target parallel speech corpus is given. ENMF-based SC systems usually need a large amount of bases (exemplars) to ensure the quality of the converted speech. However, a small and effective dictionary is desirable but hard to obtain via dictionary update, in particular when high-dimensional features such as STRAIGHT spectra are used. Therefore, we propose a dictionary update framework for NMF by means of an encoder-decoder reformulation. Regarding NMF as an encoder-decoder network makes it possible to exploit the whole parallel corpus more effectively and efficiently when applied to SC. Our experiments demonstrate significant gains of the proposed system with small dictionaries over conventional ENMF-based systems with dictionaries of same or much larger size.

Collaboration


Dive into the Hsin-Te Hwang's collaboration.

Top Co-Authors

Avatar

Yu Tsao

Center for Information Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sin-Horng Chen

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar

Yih-Ru Wang

National Chiao Tung University

View shared research outputs
Top Co-Authors

Avatar

Syu-Siang Wang

Center for Information Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Borching Su

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Ying-Hui Lai

Center for Information Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge