Yao Tian | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yao Tian is active.

Explore More

Publication

Featured researches published by Yao Tian.

international conference on acoustics, speech, and signal processing | 2017

Deep neural networks based speaker modeling at different levels of phonetic granularity

Yao Tian; Liang He; Meng Cai; Wei-Qiang Zhang; Jia Liu

Recently, a hybrid deep neural network/i-vector framework has been proved effective for speaker verification, where the DNN trained to predict tied-triphone states (senones) is used to produce frame alignments for sufficient statistics extraction. In this work, in order to better understand the impact of different phonetic precision to speaker verification tasks, three levels of phonetic granularity are evaluated when doing frame alignments, which are tied-triphone state, monophone state and monophone. And the distribution of the features associated to a given phonetic unit is further modeled with multiple Gaussians rather than a single Gaussian. We also propose a fast and efficient way to generate phonetic units of different granularity by tying DNNs outputs according to the clustering results based on DNN derived senone embeddings. Experiments are carried out on the NIST SRE 2008 female tasks. Results show that using DNNs with less precise phonetic units and more Gaussians per phonetic unit for speaker modeling generalize better to different speaker verification tasks.

conference of the international speech communication association | 2016

Investigating Various Diarization Algorithms for Speaker in the Wild (SITW) Speaker Recognition Challenge.

Yi Liu; Yao Tian; Liang He; Jia Liu

Collecting training data for real-world text-independent speaker recognition is challenging. In practice, utterances for a specific speaker are often mixed with many other acoustic signals. To guarantee the recognition performance, the segments spoken by target speakers should be precisely picked out. An automatic detection could be developed to reduce the cost of expensive human hand-made annotations. One way to achieve this goal is by using speaker diarization as a pre-processing step in the speaker enrollment phase. To this end, three speaker diarization algorithms based on Bayesian information criterion (BIC), agglomerative information bottleneck (aIB) and i-vector are investigated in this paper. The corresponding impacts on the results of speaker recognition system are also studied. Experiments conducted on Speaker in the Wild (SITW) Speaker Recognition Challenge (SRC) 2016 showed that the utilization of a proper speaker diarization improves the overall performance. Some more efforts are made to combine these methods together as well.

Odyssey 2016 | 2016

Investigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition.

Yao Tian; Liang He; Yi Liu; Jia Liu

Recently, the integration of deep neural networks (DNNs) trained to predict senone posteriors with conventional language modeling methods has been proved effective for spoken language recognition. This work extends some of the senone-based DNN frameworks by replacing the DNN with the LSTM RNN. Two of these approaches use the LSTM RNN to generate features. The features are extracted from the recurrent projection layer in the LSTM RNN either as frame-level acoustic features or utterance-level features and are then processed in different ways to produce scores for each target language. In the third approach, the conventional i-vector model is modified to use the LSTM RNN to produce frame alignments for sufficient statistics extraction. Experiments on the NIST LRE 2015 demonstrate the effectiveness of the proposed methods.

international symposium on chinese spoken language processing | 2014

Speaker verification using Fisher vector

Yao Tian; Liang He; Zhi-Yi Li; Wei-lan Wu; Wei-Qiang Zhang; Jia Liu

This paper introduces an approach based on Fisher vector feature representation for speaker verification. The Fisher vector is originated from Fisher Kernel and represents each utterance as a high-dimensional vector by encoding the derivatives of the loglikelihood of the UBM model with respect to its mean and variances. This representation captures the average first and second order differences between the utterance and each of the Gaussian centers of the UBM model. And the Fisher vector is further projected to a low-dimensional space using PPCA which is conducted in a similar way of factor analysis. We compare the proposed method with the state-of-art i-vector approach on the telephone-telephone condition of NIST SRE2010 female and male core task. The experimental results indicate that the proposed Fisher vector based method is competitive with i-vector. It can also provide complementary information to i-vector and the fusion of these two approach obtains a relative improvement of 11.8% and 14.7% in EER and 9.2% and 2.7% in minDCF for female and male than i-vector alone.

conference of the international speech communication association | 2016

Improving Deep Neural Networks Based Speaker Verification Using Unlabeled Data.

Yao Tian; Meng Cai; Liang He; Wei-Qiang Zhang; Jia Liu

Recently, deep neural networks (DNNs) trained to predict senones have been incorporated into the conventional i-vector based speaker verification systems to provide soft frame alignments and show promising results. However, the data mismatch problem may degrade the performance since the DNN requires transcribed data (out-domain data) while the data sets (indomain data) used for i-vector training and extraction are mostly untranscribed. In this paper, we try to address this problem by exploiting the unlabeled in-domain data during the training of the DNN, hoping the DNN can provide a more robust basis for the in-domain data. In this work, we first explore the impact of using in-domain data during the unsupervised DNN pre-training process. In addition, we decode the in-domain data using a hybrid DNN-HMM system to get its transcription, and then we retrain the DNN model with the “labeled” in-domain data. Experimental results on the NIST SRE 2008 and the NIST SRE 2010 databases demonstrate the effectiveness of the proposed methods.

international symposium on chinese spoken language processing | 2016

A study of variational method for text-independent speaker recognition

Liang He; Yao Tian; Yi Liu; Fang Dong; Wei-Qiang Zhang; Jia Liu

An i-vector has become the state-of-the-art algorithm for text-independent recognition. Most of related works take the extraction of the i-vector as a black-box by using some open software (e.g. Kaldi, Alize) and focus on the vector-based back-end algorithms, such as length normalization, WCCN, or PLDA. In this paper, we study the variational method and present a concise derivation for the i-vector. Based on our proposed methods, three criteria for derivation are compared. There are maximum likelihood (ML), maximum a posteriori (MAP) and maximum marginal likelihood (MML) criterion respectively. Experimental results on the NIST SRE08 tel-tel-English condition task proved our works.

international symposium on signal processing and information technology | 2015

THUEE language modeling method for the OpenKWS 2015 evaluation

Zhuo Zhang; Wei-Qiang Zhang; Kai-Xiang Shen; Xu-Kui Yang; Yao Tian; Meng Cai; Jia Liu

In this paper, we describe the THUEE (Department of Electronic Engineering, Tsinghua University) teams method of building language models (LMs) for the OpenKWS 2015 Evaluation held by the National Institute of Standards and Technology (NIST). Due to the very limited in-domain data provided by NIST, it takes most of our time and efforts to make good use of the out-of-domain data. There are three main steps in our work. Firstly, data cleaning has been done on the out-of-domain data. Secondly, by comparing the cross-entropy difference between the in-domain data and out-of-domain data, a part of the out-of-domain corpus which is well-matched to the in-domain one has been selected as the training corpus. Thirdly, the final n-gram LM is obtained by interpolating individual n-gram LMs according to different training corpus and all the training data is further combined to train one feed-forward neural network LM (FNNLM). In this way, we reduce the perplexity on development test data by 8.3% for n-gram LM and 1.7% for FNNLM, and the Actual Term-Weighted Value (ATWV) of the final result is 0.5391.

international conference on signal and information processing | 2015

Stacked bottleneck features for speaker verification

Yao Tian; Liang He; Jia Liu

i-Vector modeling has shown to be effective for text independent speaker verification. It represents each utterance as a low-dimensional vector using factor analysis with a GMM supervector. In order to capture more complex speaker statistics, this paper proposes a new feature representation other than i-vectors for speaker verification using neural networks. In this work, stacked bottleneck features are extracted from cascade neural networks based on GMM supervectors. Dropout is integrated into the model to improve generalization error. We compare the proposed method with i-vector approach on NIST SRE2008 female short2-short3 telephone-telephone task. Experimental results demonstrate the efficacy of the proposed method.

international symposium on chinese spoken language processing | 2014

A new fast and memory effective i-vector extraction based on factor analysis of KLD derived GMM supervector

Zhi-Yi Li; Wei-Qiang Zhang; Yao Tian; Jia Liu

At present, i-vector model has become the state-of-the-art technology for speaker recognition. It represents speech utterance to a low-dimensional fix-length compact i-vector. For some real application, i-vector extraction procedure is relatively slow and requires too much memories. Some numerical approximation based fast extraction methods have been proposed to speed up the computation and to save memory meanwhile. However they are all at the expense of more or less performance degradation. From a novel model approximation viewpoint, we first propose a novel fast i-vector extraction method based on subspace factor analysis from Kullback-Leibler divergence derived Gaussian Mixture Models supervector. Experimental results on NIST SRE datasets demonstrate that the proposed method is more faster and performs more better than all the existing methods at the similar run time ratio. Besides, due to the different modeling viewpoint, we proposed a combination method with factorized subspace based extraction. This method can avoid the accuracy degradation and even can perform better than the standard one, while its extraction speed can be 10 times faster than the standard method.

conference of the international speech communication association | 2015