Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Chengzhu Yu is active.

Publication


Featured researches published by Chengzhu Yu.


international conference on acoustics, speech, and signal processing | 2014

Uncertainty propagation in front end factor analysis for noise robust speaker recognition

Chengzhu Yu; Gang Liu; Seongjun Hahm; John H. L. Hansen

In this study, we explore the propagation of uncertainty in the state-of-the-art speaker recognition system. Specifically, we incorporate the uncertainty associated with observation features into the i-Vector extraction framework. To prove the concept, both the oracle and practically estimated uncertainty are used for evaluation. The oracle uncertainty is calculated assuming the knowledge of clean speech features, while the estimated uncertainties are obtained using SPLICE and joint-GMM based methods. We evaluate the proposed framework on both YOHO and NIST 2010 Speaker Recognition Evaluation (SRE) corpora by artificially introducing noise at different SNRs. In the speaker verification experiments, we confirmed that the proposed uncertainty based i-Vector extraction framework shows significant robustness against noise.


Journal of the Acoustical Society of America | 2014

Evaluation of the importance of time-frequency contributions to speech intelligibility in noise

Chengzhu Yu; Kamil K. Wójcicki; Philipos C. Loizou; John H. L. Hansen; Michael T. Johnson

Recent studies on binary masking techniques make the assumption that each time-frequency (T-F) unit contributes an equal amount to the overall intelligibility of speech. The present study demonstrated that the importance of each T-F unit to speech intelligibility varies in accordance with speech content. Specifically, T-F units are categorized into two classes, speech-present T-F units and speech-absent T-F units. Results indicate that the importance of each speech-present T-F unit to speech intelligibility is highly related to the loudness of its target component, while the importance of each speech-absent T-F unit varies according to the loudness of its masker component. Two types of mask errors are also considered, which include miss and false alarm errors. Consistent with previous work, false alarm errors are shown to be more harmful to speech intelligibility than miss errors when the mixture signal-to-noise ratio (SNR) is below 0 dB. However, the relative importance between the two types of error is conditioned on the SNR level of the input speech signal. Based on these observations, a mask-based objective measure, the loudness weighted hit-false, is proposed for predicting speech intelligibility. The proposed objective measure shows significantly higher correlation with intelligibility compared to two existing mask-based objective measures.


international conference on acoustics, speech, and signal processing | 2016

Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions

Marc Delcroix; Keisuke Kinoshita; Chengzhu Yu; Atsunori Ogawa; Takuya Yoshioka; Tomohiro Nakatani

Deep neural network (DNN) based acoustic models have greatly improved the performance of automatic speech recognition (ASR) for various tasks. Further performance improvements have been reported when making DNNs aware of the acoustic context (e.g. speaker or environment) for example by adding auxiliary features to the input, such as noise estimates or speaker i-vectors. We have recently proposed a context adaptive DNN (CA-DNN), which is another approach to exploit the acoustic context information within a DNN. A CA-DNN is a DNN that has one or several factorized layers, i.e. layers that use a different set of parameters to process each acoustic context class. The output of a factorized layer is obtained by the weighted sum over the contribution of the different context classes, given weights over the context classes. In our previous work, the class weights were computed independently of the recognizer. In this paper, we extend our previous work by introducing the joint training of the CA-DNN parameters and the class weights computation. Consequently, the class weights and the associated class definitions can be optimized for ASR. We report experimental results on the AURORA4 noisy speech recognition task showing the potential of our approach for fast unsupervised adaptation.


international conference on acoustics, speech, and signal processing | 2016

UTD-CRSS system for the NIST 2015 language recognition i-vector machine learning challenge

Chengzhu Yu; Chunlei Zhang; Shivesh Ranjan; Qian Zhang; Abhinav Misra; Finnian Kelly; John H. L. Hansen

In this paper, we present the system developed by the Center for Robust Speech Systems (CRSS), University of Texas at Dallas, for the NIST 2015 language recognition i-vector machine learning challenge. Our system includes several subsystems, based on Linear Discriminant Analysis - Support Vector Machine (LDA-SVM) and deep neural network (DNN) approaches. An important feature of this challenge is the emphasis on out-of-set language detection. As a result, our system development focuses mainly on the evaluation and comparison of two different out-of-set language detection strategies: direct out-of-set detection and indirect out-of-set detection. These out-of-set detection strategies differ mainly on whether the unlabeled development data are used or not. The experimental results indicate that indirect out-of-set detection strategies used in our system could efficiently exploit the unlabeled development data, and therefore consistently outperform the direct out-of-set detection approach. Finally, by fusing four variants of indirect out-of-set detection based subsystems, our system achieves a relative performance gain of up to 45%, compared to the baseline cosine distance scoring (CDS) system provided by organizer.


IEEE Journal of Selected Topics in Signal Processing | 2017

An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing

Chunlei Zhang; Chengzhu Yu; John H. L. Hansen

In this study, we explore the use of deep-learning approaches for spoofing detection in speaker verification. Most spoofing detection systems that have achieved recent success employ hand-craft features with specific spoofing prior knowledge, which may limit the feasibility to unseen spoofing attacks. We aim to investigate the genuine-spoofing discriminative ability from the back-end stage, utilizing recent advancements in deep-learning research. In this paper, alternative network architectures are exploited to target spoofed speech. Based on this analysis, a novel spoofing detection system, which simultaneously employs convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is proposed. In this framework, CNN is treated as a convolutional feature extractor applied on the speech input. On top of the CNN processed output, recurrent networks are employed to capture long-term dependencies across the time domain. Novel features including Teager energy operator critical band autocorrelation envelope, perceptual minimum variance distortionless response, and a more general spectrogram are also investigated as inputs to our proposed deep-learning frameworks. Experiments using the ASVspoof 2015 Corpus show that the integrated CNN–RNN framework achieves state-of-the-art single-system performance. The addition of score-level fusion further improves system robustness. A detailed analysis shows that our proposed approach can potentially compensate for the issue due to short duration test utterances, which is also an issue in the evaluation corpus.


spoken language technology workshop | 2014

Utilization of unlabeled development data for speaker verification

Gang Liu; Chengzhu Yu; Navid Shokouhi; Abhinav Misra; Hua Xing; John H. L. Hansen

State-of-the-art speaker verification systems model speaker identity by mapping i-Vectors onto a probabilistic linear discriminant analysis (PLDA) space. Compared to other modeling approaches (such as cosine distance scoring), PLDA provides a more efficient mechanism to separate speaker information from other sources of undesired variabilities and offers superior speaker verification performance. Unfortunately, this efficiency is obtained at the cost of a required large corpus of labeled development data, which is too expensive/unrealistic in many cases. This study investigates a potential solution to resolve this challenge by effectively utilizing unlabeled development data with universal imposter clustering. The proposed method offers +21.9% and +34.6% relative gains versus the baseline system on two public available corpora, respectively. This significant improvement proves the effectiveness of the proposed method.


IEEE Transactions on Audio, Speech, and Language Processing | 2017

Active Learning Based Constrained Clustering For Speaker Diarization

Chengzhu Yu; John H. L. Hansen

Most speaker diarization research has focused on unsupervised scenarios, where no human supervision is available. However, in many real-world applications, a certain amount of human input could be expected, especially when minimal human supervision brings significant performance improvement. In this study, we propose an active learning based bottom-up speaker clustering algorithm to effectively improve speaker diarization performance with limited human input. Specifically, the proposed active learning based speaker clustering has two different stages: explore and constrained clustering. The explore stage is to quickly discover at least one sample for each speaker for boosting speaker clustering process with reliable initial speaker clusters. After discovering all, or a majority, of the involved speakers during explore stage, the constrained clustering is performed. Constrained clustering is similar to traditional bottom-up clustering process with an important difference that the clusters created during explore stage are restricted from merging with each other. Constrained clustering continues until only the clusters generated from the explore stage are left. Since the objective of active learning based speaker clustering algorithm is to provide good initial speaker models, performance saturates as soon as sufficient examples are ensured for each cluster. To further improve diarization performance with increasing human input, we propose a second method which actively select speech segments that account for the largest expected speaker error from existing cluster assignments for human evaluation and reassignment. The algorithms are evaluated on our recently created Apollo Mission Control Center dataset as well as augmented multiparty interaction meeting corpus. The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision.


international conference on acoustics, speech, and signal processing | 2016

Language recognition using deep neural networks with very limited training data

Shivesh Ranjan; Chengzhu Yu; Chunlei Zhang; Finnian Kelly; John H. L. Hansen

This study proposes a novel deep neural network (DNN) based approach to language identification (LID) for the NIST 2015 Language Recognition (LRE) i-Vector Machine Learning Challenge. State-of-the-art DNN based LID systems utilize large amounts of labeled training data. The 2015 LRE i-Vector Machine Learning Challenge limits the access to only ready-to-use i-Vectors for LID system training and testing. This poses unique challenges in designing DNN based LID systems, since optimized front-ends and network architectures can no longer be used. We propose to use the training i-Vectors to train an initial DNN for LID. Next, we present a novel strategy to use this initial DNN to estimate out-of-set language labels from the development data. The final DNN for LID is trained using the original training data, and the estimated out-of-set language data. We show that augmenting the training set with out-of-set labels leads to significant improvement in the LID performance. Our approach obtains very competitive costs (defined by NIST) of 26.56, and 25.98 respectively, on the progress and evaluation subsets of the challenge. Since the amount of training data is very limited (300 i-Vectors per language), this study outlines a successful recipe for DNN based LID using very limited resources.


conference of the international speech communication association | 2016

Text-available speaker recognition system for forensic applications

Chengzhu Yu; Chunlei Zhang; Finnian Kelly; Abhijeet Sangwan; John H. L. Hansen

This paper examines a text-available speaker recognition approach targeting scenarios where the transcripts of test utterances are either available or obtainable through manual transcription. Forensic speaker recognition is one of such applications where the human supervision can be expected. In our study, we extend an existing Deep Neural Network (DNN) ivector-based speaker recognition system to effectively incorporate text information associated with test utterances. We first show experimentally that speaker recognition performance drops significantly if the DNN output posteriors are directly replaced with their target senone, obtained from force alignment. The cause of such performance drops can be attributed to the fact that forced alignment selects only the single most probable senone as their output, which is not desirable in a current speaker recognition framework. To resolve this problem, we propose a posterior mapping approach where the relationship between forced aligned senonoes and its corresponding DNN posteriors are modeled. By replacing DNN output posteriors with senone mapped posteriors, a robust text-available speaker recognition system can be obtained in mismatched environments. Experiments using the proposed approach are performed on the Aurora-4 dataset.


international conference on acoustics, speech, and signal processing | 2013

A new mask-based objective measure for predicting the intelligibility of binary masked speech

Chengzhu Yu; Kamil K. Wójcicki; Philipos C. Loizou; John H. L. Hansen

Mask-based objective speech-intelligibility measures have been successfully proposed for evaluating the performance of binary masking algorithms. These objective measures were computed directly by comparing the estimated binary mask against the ground truth ideal binary mask (IdBM). Most of these objective measures, however, assign equal weight to all time-frequency (T-F) units. In this study, we propose to improve the existing mask-based objective measures by weighting each T-F unit according to its target or masker loudness. The proposed objective measure shows significantly better performance than two other existing mask-based objective measures.

Collaboration


Dive into the Chengzhu Yu's collaboration.

Top Co-Authors

Avatar

John H. L. Hansen

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Chunlei Zhang

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Abhijeet Sangwan

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Gang Liu

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Lakshmish Kaushik

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Abhinav Misra

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Finnian Kelly

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Navid Shokouhi

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Shivesh Ranjan

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Kamil K. Wójcicki

University of Texas at Dallas

View shared research outputs
Researchain Logo
Decentralizing Knowledge