Hsiao-Chuan Wang
National Tsing Hua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hsiao-Chuan Wang.
international conference on acoustics, speech, and signal processing | 2005
Chi-Yueh Lin; Hsiao-Chuan Wang
An approach to automatic language identification (LID) using pitch contour information is proposed. A segment of pitch contour is approximated by a set of Legendre polynomials so that coefficients of the polynomials form a feature vector to represent this pitch contour. A Gaussian mixture model (GMM) method based on feature vectors extracted from pitch contours is suggested for the LID. Our experiments show that only two or three coefficients are necessary to obtain reasonably good identification rates. We also find that the length of the segmented pitch contour is another useful feature for LID, so that it is included to improve the performance further. Pair-wise language identification experiments on the OGI-TS corpus show that our proposed approach is very promising We also find that tonal languages and pitch accent languages achieve better performance in our system.
Speech Communication | 1999
Kuo-Hwei Yuo; Hsiao-Chuan Wang
This paper introduces a new representation of speech that is invariant to noise. The idea is to filter the temporal trajectories of short-time one-sided autocorrelation sequences of speech such that the noise effect is removed. The filtered sequences are denoted by the relative autocorrelation sequences (RASs), and the mel-scale frequency cepstral coefficients (MFCC) are extracted from RAS instead of the original speech. This new speech feature set is denoted as RAS-MFCC. Experiments were conducted on a task of multispeaker isolated Mandarin digit recognition to demonstrate the effectiveness of RAS-MFCC features in the presence of white noise and colored noise. The proposed features are also shown to be superior to other robust representations and compensation techniques.
Speech Communication | 2003
Ching-Ta Lu; Hsiao-Chuan Wang
Abstract Wavelet packet transform has been progressively applied in removing the additive white Gaussian noise. By using soft thresholding function, it performs well in enhancing the corrupted speech. However, it suffers from serious residual noise and speech distortion. In this paper, we propose a method based on critical-band decomposition which converts a noisy signal into wavelet coefficients (WCs), and enhances the WCs by subtracting a threshold from noisy WCs in each subband. The threshold of each subband is adapted according to the segmental SNR (SegSNR) and the noise masking threshold. Thus residual noise can be efficiently suppressed for a speech-dominated frame. In a noise-dominated frame, the background noise can be almost removed by adjusting the wavelet coefficient threshold (WCT) according to the SegSNR. Speech distortion can be reduced by decreasing the WCT in speech-dominated subbands. The proposed method can effectively enhance noisy speech which is infected by colored-noise. Its performance is better than other wavelet-based speech enhancement methods in our experiments.
IEEE Signal Processing Letters | 1997
Jen-Tzung Chien; Chin-Hui Lee; Hsiao-Chuan Wang
We present a hybrid algorithm for adapting a set of speaker-independent hidden Markov models (HMMs) to a new speaker based on a combination of maximum a posteriori (MAP) parameter transformation and adaptation. The algorithm is developed by first transforming clusters of HMM parameters through a class of transformation functions. Then, the transformed HMM parameters are further smoothed via Bayesian adaptation. The proposed transformation/adaptation process can be iterated for any given amount of adaptation data, and it converges rapidly in terms of likelihood improvement. The algorithm also gives a better speech recognition performance than that obtained using transformation or adaptation alone for almost any practical amount of adaptation data.
IEEE Transactions on Speech and Audio Processing | 1995
Chih-Chung Kuo; Fu-Rong Jean; Hsiao-Chuan Wang
This correspondence proposes a new CELP coding method which embeds speech classification in adaptive codebook search. This approach can retain the synthesized speech quality at bit-rates below 4 kb/s. A pitch analyzer is designed to classify each frame by its periodicity, and with a finite-state machine, one of four states is determined. Then the adaptive codebook search scheme is switched according to the state. Simulation results show that higher SEGSNR and lower computation complexity can be achieved, and the pitch contour of the synthesized speech is smoother than that produced by conventional CELP coders. >
international conference on acoustics, speech, and signal processing | 1992
Chih-Chung Kuo; Fu-Rong Jean; Hsiao-Chuan Wang
A novel spectral coding method, two-dimensional differential line spectra pair coding (2DdLSP), is proposed. Taking advantage of the strong inter-frame, and intra-frame correlation of LSP parameters, a two-dimensional linear prediction technique is used to reduce the variance of the parameters to be quantized. One scalar quantization and two vector quantization schemes are designed to quantize the 2-D prediction residuals. Without further buffering delay, the spectral distortion of 1 dB/sup 2/ can be achieved at 19 b/frame when the frame period is 10 ms. Both within- and out-of-training tests show the robustness of the method to speech data variance.<<ETX>>
Digital Signal Processing | 2007
Ching-Ta Lu; Hsiao-Chuan Wang
Noise masking threshold (NMT) has been extensively applied in wavelet-based speech enhancement. A gain factor is typically derived according to the NMT to suppress the additive noise. This investigation proposes a gain factor in each wavelet subband subject to a perceptual constraint. This perceptual constraint preserves the wavelet coefficients (WCs) of noisy speech when the level of residual noise is smaller than the NMT. If the residual noise level exceeds the NMT, then the wavelet coefficients of noisy speech are suppressed to reduce the corrupting noise. A lower bound on the gain factor is also proposed to prevent low-SNR regions, such as unvoiced signal, from being over-attenuated. Experimental results show that the proposed approach improves the naturalness, and does not cause annoying residual noise in the enhanced speech.
IEEE Signal Processing Letters | 2001
Wei-Wen Hung; Hsiao-Chuan Wang
In this paper, we discuss the use of weighted filter bank analysis (WFBA) to increase the discriminating ability of mel frequency cepstral coefficients (MFCCs). The WFBA emphasizes the peak structure of the log filter bank energies (LFBEs) obtained from filter bank analysis while attenuating the components with lower energy in a simple, direct, and effective way. Experimental results for recognition of continuous Mandarin telephone speech indicate that the WFBA-based cepstral features are more robust than those derived by employing the standard filter bank analysis and some widely used cepstral liftering and frequency filtering schemes both in channel-distorted and noisy conditions.
international conference on acoustics, speech, and signal processing | 1997
Jen-Tzung Chien; Chin-Hui Lee; Hsiao-Chuan Wang
We propose an improved maximum a posteriori (MAP) learning algorithm of continuous-density hidden Markov model (CDHMM) parameters for speaker adaptation. The algorithm is developed by sequentially combining three adaptation approaches. First, the clusters of speaker-independent HMM parameters are locally transformed through a group of transformation functions. Then, the transformed HMM parameters are globally smoothed via the MAP adaptation. Within the MAP adaptation, the parameters of unseen units in adaptation data are further adapted by employing the transfer vector interpolation scheme. Experiments show that the combined algorithm converges rapidly and outperforms those other adaptation methods.
international conference on acoustics, speech, and signal processing | 2006
Chi-Yueh Lin; Hsiao-Chuan Wang
It had been shown that a segment of pitch contour represented by a set of Legendre polynomial coefficients was successful to the pair-wise language identification task. Feature vectors comprising these polynomial coefficients were formerly modeled by a Gaussian mixture model (GMM) for each language. However, the static model like GMM does not take advantage of the temporal information across several pitch contours. It is intuitive that the temporal information of prosodic features should be used for capturing the characteristics of a specific language. In this paper, a novel dynamic model in ergodic topology is proposed. The experiments show that the proposed method significantly improves the identification rate, even for stress-timed and syllable-timed languages