Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yu Tsao is active.

Publication


Featured researches published by Yu Tsao.


IEEE Transactions on Audio, Speech, and Language Processing | 2009

An Ensemble Speaker and Speaking Environment Modeling Approach to Robust Speech Recognition

Yu Tsao; Chin-Hui Lee

We propose an ensemble speaker and speaking environment modeling (ESSEM) approach to characterizing environments in order to enhance performance robustness of automatic speech recognition systems under adverse conditions. The ESSEM process comprises two phases, the offline and the online. In the offline phase, we prepare an ensemble speaker and speaking environment space formed by a collection of super-vectors. Each super-vector consists of the entire set of means from all the Gaussian mixture components of a set of hidden Markov models that characterizes a particular environment. In the online phase, with the ensemble environment space prepared in the offline phase, we estimate the super-vector for a new testing environment based on a stochastic matching criterion. In this paper, we focus on methods for enhancing the construction and coverage of the environment space in the offline phase. We first demonstrate environment clustering and partitioning algorithms to structure the environment space well; then, we propose a minimum classification error training algorithm to enhance discrimination across environment super-vectors and therefore broaden the coverage of the ensemble environment space. We evaluate the proposed ESSEM framework on the Aurora2 connected digit recognition task. Experimental results verify that ESSEM provides clear improvement over a baseline system without environmental compensation. Moreover, the performance of ESSEM can be further enhanced by using well-structured environment spaces. Finally, we confirm that ESSEM gives the best overall performance with an environment space refined by an integration of all techniques.


international conference on acoustics, speech, and signal processing | 2005

A study on knowledge source integration for candidate rescoring in automatic speech recognition

Jinyu Li; Yu Tsao; Chin-Hui Lee

We propose a rescoring framework for speech recognition that incorporates acoustic phonetic knowledge sources. The scores corresponding to all knowledge sources are generated from a collection of neural network based classifiers. Rescoring is then performed by combining different knowledge scores and they are used to reorder candidate strings provided by state-of-the-art HMM-based speech recognizers. We report on continuous phone recognition experiments using the TIMIT database. Our results indicate that classifying manners and places of articulation provides additional information in rescoring, and improved accuracies over our best baseline speech recognizers are achieved using both context-independent and context-dependent phone models. The same technique can be extended to lattice rescoring and large vocabulary continuous speech recognition.


international conference on acoustics, speech, and signal processing | 2014

Speech enhancement using segmental nonnegative matrix factorization

Hao-Teng Fan; Jeih-weih Hung; Xugang Lu; Syu-Siang Wang; Yu Tsao

The conventional NMF-based speech enhancement algorithm analyzes the magnitude spectrograms of both clean speech and noise in the training data via NMF and estimates a set of spectral basis vectors. These basis vectors are used to span a space to approximate the magnitude spectrogram of the noise-corrupted testing utterances. Finally, the components associated with the clean-speech spectral basis vectors are used to construct the updated magnitude spectrogram, producing an enhanced speech utterance. Considering that the rich spectral-temporal structure may be explored in local frequency and time-varying spectral patches, this study proposes a segmental NMF (SNMF) speech enhancement scheme to improve the conventional frame-wise NMF-based method. Two algorithms are derived to decompose the original nonnegative matrix associated with the magnitude spectrogram; the first algorithm is used in the spectral domain and the second algorithm is used in the temporal domain. When using the decomposition processes, noisy speech signals can be modeled more precisely, and spectrograms regarding the speech part can be constituted more favorably compared with using the conventional NMF-based method. Objective evaluations using perceptual evaluation of speech quality (PESQ) indicate that the proposed SNMF strategy increases the sound quality in noise conditions and outperforms the well-known MMSE log-spectral amplitude (LSA) estimation.


Speech Communication | 2016

Generalized maximum a posteriori spectral amplitude estimation for speech enhancement

Yu Tsao; Ying-Hui Lai

GMAPA specifies the weight of prior density based on the SNR of the testing speech signals.GMAPA is capable of performing environment-aware speech enhancement.When the SNR is high, GMAPA adopts a small weight to prevent overcompensations.When the SNR is low, GMAPA uses a large weight to avoid disturbance of the restoration.Results show that GMAPA outperforms related approaches in objective and subjective evaluations. Spectral restoration methods for speech enhancement aim to remove noise components in noisy speech signals by using a gain function in the spectral domain. How to design the gain function is one of the most important parts for obtaining enhanced speech with good quality. In most studies, the gain function is designed by optimizing a criterion based on some assumptions of the noise and speech distributions, such as minimum mean square error (MMSE), maximum likelihood (ML), and maximum a posteriori (MAP) criteria. The MAP criterion shows advantage in obtaining a more reliable gain function by incorporating a suitable prior density. However, it has a problem as several studies showed: although MAP based estimator effectively reduces noise components when the signal-to-noise ratio (SNR) is low, it brings large speech distortion when the SNR is high. For solving this problem, we have proposed a generalized maximum a posteriori spectral amplitude (GMAPA) algorithm in designing a gain function for speech enhancement. The proposed GMAPA algorithm dynamically specifies the weight of prior density of speech spectra according to the SNR of the testing speech signals to calculate the optimal gain function. When the SNR is high, GMAPA adopts a small weight to prevent overcompensations that may result in speech distortions. On the other hand, when the SNR is low, GMAPA uses a large weight to avoid disturbance of the restoration caused by measurement noises. In our previous study, it has been proven that the weight of the prior density plays a crucial role to the GMAPA performance, and the weight is determined based on the SNR in an utterance-level. In this paper, we propose to compute the weight with the consideration of time-frequency correlations that result in a more accurate estimation of the gain function. Experiments were carried out to evaluate the proposed algorithm on both objective tests and subjective tests. The experimental results obtained from objective tests indicate that GMAPA is promising compared to several well-known algorithms at both high and low SNRs. The results of subjective listening tests indicate that GMAPA provides significantly higher sound quality than other speech enhancement algorithms.


international conference on acoustics, speech, and signal processing | 2014

SPARSE REPRESENTATION BASED ON A BAG OF SPECTRAL EXEMPLARS FOR ACOUSTIC EVENT DETECTION

Xugang Lu; Yu Tsao; Shigeki Matsuda; Chiori Hori

Acoustic event detection is an important step for audio content analysis and retrieval. Traditional detection techniques model the acoustic events on frame-based spectral features. Considering the temporal-frequency structures of acoustic events may be distributed in time-scales beyond frames, we propose to represent those structures as a bag of spectral patch exemplars. In order to learn the representative exemplars, k-means clustering based vector quantization (VQ) was applied on the whitened spectral patches which makes the learned exemplars focus on high-order statistical structure. With the learned spectral exemplars, a sparse feature representation is extracted based on the similarity measurement to the learned exemplars. A support vector machine (SVM) classifier was built on the sparse representation for acoustic event detection. Our experimental results showed that the sparse representation based on the patch based exemplars significantly improved the performance compared with traditional frame based representations.


IEEE Transactions on Speech and Audio Processing | 2005

Segmental eigenvoice with delicate eigenspace for improved speaker adaptation

Yu Tsao; Shang-Ming Lee; Lin-Shan Lee

Eigenvoice techniques have been proposed to provide rapid speaker adaptation with very limited adaptation data, but the performance may be saturated when more adaptation data become available. This is because in these techniques an eigenspace with reduced dimensionality is established by properly utilizing the a priori knowledge from the large quantity of training data. The reduced dimensionality of the eigenspace requires less adaptation data to estimate the model parameters for the new speaker, but also makes it less easy to obtain more precise models with more adaptation data. In this paper, a new segmental eigenvoice approach is proposed, in which the eigenspace can be further segmented into N subeigenspaces by properly classifying the model parameters into N clusters. These N subeigenspaces can help to construct a more delicate eigenspace and more precise models when more adaptation data are available. It will be shown that there can be at least mixture-based, model-based and feature-based segmental eigenvoice approaches. Not only improved performance can be obtained, but these different approaches can be properly integrated to offer better performance. Two further approaches leading to improved segmental eigenvoice techniques with even better performance are also proposed. The experiments were performed with both a large vocabulary and a small vocabulary recognition tasks.


international conference on acoustics, speech, and signal processing | 2013

Speech enhancement using generalized maximum a posteriori spectral amplitude estimator

Yu-Cheng Su; Yu Tsao; Jung-En Wu; Fu-Rong Jean

This paper proposes a generalized maximum a posteriori spectral amplitude (GMAPA) algorithm to spectral restoration for speech enhancement. The proposed GMAPA algorithm dynamically adjusts the scale of prior information to calculate the gain function for spectral restoration. In higher signal-to-noise ratio (SNR) conditions, GMAPA adopts a smaller scale to prevent overcompensations that may result in speech distortions. On the other hand, in lower SNR conditions, GMAPA uses a larger scale to enable the gain function to more effectively remove noise components from noisy speech. We also develop a mapping function to optimally determine the prior information scale according to the SNR of speech utterances. Two standardized speech databases, Aurora-4 and Aurora-2, are used to conduct objective and recognition evaluations, respectively, to test the proposed GMAPA algorithm. For comparison, three conventional spectral restoration algorithms are also evaluated; they are minimum mean-square error spectral estimator (MMSE), maximum likelihood spectral amplitude estimator (MLSA), and maximum a posteriori spectral amplitude estimator (MAPA). The experimental results first confirm that GMAPA provides better objective evaluation scores than MMSE, MLSA, and MAPA in lower SNR conditions, with comparable scores to MLSA in higher SNR conditions. Moreover, our recognition results indicate that GMAPA outperforms the three conventional algorithms consistently over different testing conditions.


IEEE Transactions on Biomedical Engineering | 2017

S1 and S2 Heart Sound Recognition Using Deep Neural Networks

Tien-En Chen; Shih-I Yang; Li-Ting Ho; Kun-Hsi Tsai; Yu-Hsuan Chen; Yun-Fan Chang; Ying-Hui Lai; Syu-Siang Wang; Yu Tsao; Chau-Chung Wu

OBJECTIVE This study focuses on the first (S1) and second (S2) heart sound recognition based only on acoustic characteristics; the assumptions of the individual durations of S1 and S2 and time intervals of S1-S2 and S2-S1 are not involved in the recognition process. The main objective is to investigate whether reliable S1 and S2 recognition performance can still be attained under situations where the duration and interval information might not be accessible. METHODS A deep neural network (DNN) method is proposed for recognizing S1 and S2 heart sounds. In the proposed method, heart sound signals are first converted into a sequence of Mel-frequency cepstral coefficients (MFCCs). The K-means algorithm is applied to cluster MFCC features into two groups to refine their representation and discriminative capability. The refined features are then fed to a DNN classifier to perform S1 and S2 recognition. We conducted experiments using actual heart sound signals recorded using an electronic stethoscope. Precision, recall, F-measure, and accuracy are used as the evaluation metrics. RESULTS The proposed DNN-based method can achieve high precision, recall, and F-measure scores with more than 91% accuracy rate. CONCLUSION The DNN classifier provides higher evaluation scores compared with other well-known pattern classification methods. SIGNIFICANCE The proposed DNN-based method can achieve reliable S1 and S2 recognition performance based on acoustic characteristics without using an ECG reference or incorporating the assumptions of the individual durations of S1 and S2 and time intervals of S1-S2 and S2-S1.


conference of the international speech communication association | 2016

SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement.

Szu-Wei Fu; Yu Tsao; Xugang Lu

This paper proposes a signal-to-noise-ratio (SNR) aware convolutional neural network (CNN) model for speech enhancement (SE). Because the CNN model can deal with local temporal-spectral structures of speech signals, it can effectively disentangle the speech and noise signals given the noisy speech signals. In order to enhance the generalization capability and accuracy, we propose two SNR-aware algorithms for CNN modeling. The first algorithm employs a multi-task learning (MTL) framework, in which restoring clean speech and estimating SNR level are formulated as the main and the secondary tasks, respectively, given the noisy speech input. The second algorithm is an SNR adaptive denoising, in which the SNR level is explicitly predicted in the first step, and then an SNR-dependent CNN model is selected for denoising. Experiments were carried out to test the two SNR-aware algorithms for CNN modeling. Results demonstrate that CNN with the two proposed SNR-aware algorithms outperform the deep neural network counterpart in terms of standardized objective evaluations when using the same number of layers and nodes. Moreover, the SNR-aware algorithms can improve the denoising performance with unseen SNR levels, suggesting their promising generalization capability for real-world applications.


international conference on acoustics, speech, and signal processing | 2010

An acoustic segment model approach to incorporating temporal information into speaker modeling for text-independent speaker recognition

Yu Tsao; Hanwu Sun; Haizhou Li; Chin-Hui Lee

We propose an acoustic segment model (ASM) approach to incorporating temporal information into speaker modeling in text-independent speaker recognition. In training, the proposed framework first estimates a collection of ASM-based universal background models (UBMs). Multiple sets of speaker-specific ASMs are then obtained by adapting the ASM-based UBMs with speaker-specific enrollment data. A novel usage of language models of the ASM units is also proposed to characterize transitions among ASMs. In the testing phase the ASM sets for the claimed speaker and UBMs, along with a bigram ASM language model, are used to calculate detection scores for each given test utterance. We report on speaker recognition experiments using the NIST 2001 SRE database. The results clearly indicate that the proposed ASM-based method achieves a notable improvement over the GMM-based speaker modeling in which no temporal modeling is considered. Moreover, a further error reduction is obtained by integrating the language model, another inclusion of temporal properties made possibly by ASM based speaker modeling.

Collaboration


Dive into the Yu Tsao's collaboration.

Top Co-Authors

Avatar

Syu-Siang Wang

Center for Information Technology

View shared research outputs
Top Co-Authors

Avatar

Ying-Hui Lai

Center for Information Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xugang Lu

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Chin-Hui Lee

Georgia Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hisashi Kawai

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Chiori Hori

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Payton Lin

Center for Information Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge