Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Chia-Ping Chen is active.

Publication


Featured researches published by Chia-Ping Chen.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

MVA Processing of Speech Features

Chia-Ping Chen; Jeff A. Bilmes

In this paper, we investigate a technique consisting of mean subtraction, variance normalization and time sequence filtering. Unlike other techniques, it applies auto-regression moving-average (ARMA) filtering directly in the cepstral domain. We call this technique mean subtraction, variance normalization, and ARMA filtering (MVA) post-processing, and speech features with MVA post-processing are called MVA features. Overall, compared to raw features without post-processing, MVA features achieve an error rate reduction of 45% on matched tasks and 65% on mismatched tasks on the Aurora 2.0 noisy speech database, and an average 57% error reduction on the Aurora 3.0 database. These improvements are comparable to the results of much more complicated techniques even though MVA is relatively simple and requires practically no additional computational cost. In this paper, in addition to describing MVA processing, we also present a novel analysis of the distortion of mel-frequency cepstral coefficients and the log energy in the presence of different types of noise. The effectiveness of MVA is extensively investigated with respect to several variations: the configurations used to extract and the type of raw features, the domains where MVA is applied, the filters that are used, the ARMA filter orders, and the causality of the normalization process. Specifically, it is argued and demonstrated that MVA works better when applied to the zeroth-order cepstral coefficient than to log energy, that MVA works better in the cepstral domain, that an ARMA filter is better than either a designed finite impulse response filter or a data-driven filter, and that a five-tap ARMA filter is sufficient to achieve good performance in a variety of settings. We also investigate and evaluate a multi-domain MVA generalization


international conference on acoustics, speech, and signal processing | 2005

Speech feature smoothing for robust ASR

Chia-Ping Chen; Jeff A. Bilmes; Daniel P. W. Ellis

We evaluate smoothing within the context of the MVA (mean subtraction, variance normalization, and ARMA filtering) post-processing scheme for noise-robust automatic speech recognition. MVA has shown great success in the past on the Aurora 2.0 and 3.0 corpora, even though it is computationally inexpensive. MVA is applied to many acoustic feature extraction methods, and is evaluated using Aurora 2.0. We evaluate MVA post-processing on MFCCs, LPCs, PLPs, RASTA, Tandem, modulation-filtered spectrogram, and modulation cross-correlogram features. We conclude that, while effectiveness does depend on the extraction method, the majority of features benefit significantly from MVA, and the smoothing ARMA filter is an important component. It appears that the effectiveness of normalization and smoothing depends on the domain in which it is applied, being most fruitfully applied just before being scored by a probabilistic model. Moreover, since it is both effective and simple, our ARMA filter should be considered a candidate method in most noise-robust speech recognition tasks.


asia-pacific signal and information processing association annual summit and conference | 2013

Feature space dimension reduction in speech emotion recognition using support vector machine

Bo-Chang Chiou; Chia-Ping Chen

We report implementations of automatic speech emotion recognition systems based on support vector machines in this paper. While common systems often extract a very large feature set per utterance for emotion classification, we conjecture that the dimension of the feature space can be greatly reduced without severe degradation of accuracy. Consequently, we systematically reduce the number of features via feature selection and principal component analysis. The evaluation is carried out on the Berlin Database of Emotional Speech, also known as EMO-DB, which consists of 10 speakers and 7 emotions. The results show that we can trim the feature set to 37 features and still maintain an accuracy of 80%. This means a reduction of more than 99% compared to the baseline system which uses more than 6,000 features.


Eurasip Journal on Audio, Speech, and Music Processing | 2011

Noise-robust speech feature processing with empirical mode decomposition

Kuo-Hau Wu; Chia-Ping Chen; Bing-Feng Yeh

In this article, a novel technique based on the empirical mode decomposition methodology for processing speech features is proposed and investigated. The empirical mode decomposition generalizes the Fourier analysis. It decomposes a signal as the sum of intrinsic mode functions. In this study, we implement an iterative algorithm to find the intrinsic mode functions for any given signal. We design a novel speech feature post-processing method based on the extracted intrinsic mode functions to achieve noise-robustness for automatic speech recognition. Evaluation results on the noisy-digit Aurora 2.0 database show that our method leads to significant performance improvement. The relative improvement over the baseline features increases from 24.0 to 41.1% when the proposed post-processing method is applied on mean-variance normalized speech features. The proposed method also improves over the performance achieved by a very noise-robust frontend when the test speech data are highly mismatched.


IEEE Transactions on Audio, Speech, and Language Processing | 2014

Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features

Chia-Ping Chen; Yi Chin Huang; Chung-Hsien Wu; Kuan De Lee

In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.


Eurasip Journal on Audio, Speech, and Music Processing | 2012

Robust dialogue act detection based on partial sentence tree, derivation rule, and spectral clustering algorithm

Chia-Ping Chen; Chung-Hsien Wu; Wei-Bin Liang

A novel approach for robust dialogue act detection in a spoken dialogue system is proposed. Shallow representation named partial sentence trees are employed to represent automatic speech recognition outputs. Parsing results of partial sentences can be decomposed into derivation rules, which turn out to be salient features for dialogue act detection. Data-driven dialogue acts are learned via an unsupervised learning algorithm called spectral clustering, in a vector space whose axes correspond to derivation rules. The proposed method is evaluated in a Mandarin spoken dialogue system for tourist-information services. Combined with information obtained from the automatic speech recognition module and from a Markov model on dialogue act sequence, the proposed method achieves a detection accuracy of 85.1%, which is significantly better than the baseline performance of 62.3% using a naïve Bayes classifier. Furthermore, the average number of turns per dialogue session also decreases significantly with the improved detection accuracy.


Eurasip Journal on Audio, Speech, and Music Processing | 2012

Speaker-dependent model interpolation for statistical emotional speech synthesis

Chih-Yu Hsu; Chia-Ping Chen

In this article, we propose a speaker-dependent model interpolation method for statistical emotional speech synthesis. The basic idea is to combine the neutral model set of the target speaker and an emotional model set selected from a pool of speakers. For model selection and interpolation weight determination, we propose to use a novel monophone-based Mahalanobis distance, which is a proper distance measure between two Hidden Markov Model sets. We design Latin-square evaluation to reduce the systematic bias in the subjective listening tests. The proposed interpolation method achieves sound performance on the emotional expressiveness, the naturalness, and the target speaker similarity. Moreover, such performance is achieved without the need to collect the emotional speech of the target speaker, saving the cost of data collection and labeling.


international conference on acoustics, speech, and signal processing | 2012

Cross-lingual frame selection method for polyglot speech synthesis

Chia-Ping Chen; Yi-Chin Huang; Chung-Hsien Wu; Kuan-De Lee

A novel approach is proposed to creating a polyglot speech synthesis system without the need of collecting speech data from a bilingual (or multilingual) speaker, which is often expensive or even infeasible. Given a target speaker with data in the first language (Mandarin in this study), the basic idea is to construct artificial utterances in the second language (English) via selection of speech sample frames of the given speaker in the first language. As the speaker needs not be polyglot, this method is generally applicable to any speaker and any languages. In the search for optimal frame sequence selection, the candidate set is constrained by a decision tree for phone segments in the speech data of both languages, and the cost function depends on the context-dependent articulatory and auditory features. Evaluation results show that good performance regarding similarity (speaker identity) and naturalness (speech quality) can be achieved with the proposed method.


international symposium on chinese spoken language processing | 2010

Auditory front-ends for noise-robust automatic speech recognition

Ja-Zang Yeh; Chia-Ping Chen

In this paper we investigate a noise-robust feature extraction method, which is based on the auditory masking effect, for automatic speech recognition systems. We physically model the basilar membrane as a cascade system of simple harmonic oscillators, and mathematically analyze the motion of the basilar membrane due to speech signals. Based on the analysis, we can identify a correlational factor for the coupled motion of the oscillators, which can be used to partially explain the masking effect. Accordingly, we insert an auditory module in the speech feature extraction process. The proposed methodology is evaluated on the Aurora 2.0 noisy-digit speech database, and it achieves significant improvements.


international conference on acoustics, speech, and signal processing | 2017

Speech emotion recognition with ensemble learning methods

Po Yuan Shih; Chia-Ping Chen; Chung-Hsien Wu

In this paper, we propose to apply ensemble learning methods on neural networks to improve the performance of speech emotion recognition tasks. The basic idea is to first divide unbalanced data set into balanced subsets and then combine the predictions of the models trained on these subsets. Several methods regarding the decomposition of data and the exploitation of model predictions are investigated in this study. On the public-domain FAU-Aibo database, which is used in Interspeech Emotion Challenge evaluation, the best performance we achieve is an unweighted average (UA) recall rate of 45.5% for the 5-class classification task. Furthermore, such performance is achieved with a feature space of 40-dimension. Compared to the baseline system with 384-dimension feature vector per example and an UA of 38.9%, such a performance is very impressive. Indeed, this is one of the best performances on FAU-Aibo within the static modeling framework.

Collaboration


Dive into the Chia-Ping Chen's collaboration.

Top Co-Authors

Avatar

Jeff A. Bilmes

University of Washington

View shared research outputs
Top Co-Authors

Avatar

Chung-Hsien Wu

National Cheng Kung University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ja-Zang Yeh

National Sun Yat-sen University

View shared research outputs
Top Co-Authors

Avatar

Tzu-Hsuan Tseng

National Sun Yat-sen University

View shared research outputs
Top Co-Authors

Avatar

Tzu-Hsuan Yang

National Sun Yat-sen University

View shared research outputs
Top Co-Authors

Avatar

Wei-Bin Liang

National Cheng Kung University

View shared research outputs
Top Co-Authors

Avatar

Bing-Feng Yeh

National Sun Yat-sen University

View shared research outputs
Top Co-Authors

Avatar

Bo-Chang Chiou

National Sun Yat-sen University

View shared research outputs
Top Co-Authors

Avatar

Chun-Han Tseng

National Sun Yat-sen University

View shared research outputs
Researchain Logo
Decentralizing Knowledge