Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Lantian Li is active.

Publication


Featured researches published by Lantian Li.


IEEE Transactions on Audio, Speech, and Language Processing | 2016

Improving short utterance speaker recognition by modeling speech unit classes

Lantian Li; Dong Wang; Chenhao Zhang; Thomas Fang Zheng

Short utterance speaker recognition (SUSR) is highly challenging due to the limited enrollment and/or test data. We argue that the difficulty can be largely attributed to the mismatched prior distributions of the speech data used to train the universal background model (UBM) and those for enrollment and test. This paper presents a novel solution that distributes speech signals into a multitude of acoustic subregions that are defined by speech units, and models speakers within the subregions. To avoid data sparsity, a data-driven approach is proposed to cluster speech units into speech unit classes, based on which robust subregion models can be constructed. Further more, we propose a model synthesis approach based on maximum likelihood linear regression (MLLR) to deal with no-data speech unit classes. The experiments were conducted on a publicly available database SUD12. The results demonstrated that on a text-independent speaker recognition task where the test utterances are no longer than 2 seconds and mostly shorter than 0.5 seconds, the proposed subregion modeling offered a 21.51% relative reduction in equal error rate (EER), compared with the standard GMM-UBM baseline. In addition, with the model synthesis approach, the performance can be greatly improved in scenarios where no enrollment data are available for some speech unit classes.


asia pacific signal and information processing association annual summit and conference | 2015

Improved deep speaker feature learning for text-dependent speaker recognition

Lantian Li; Yiye Lin; Zhiyong Zhang; Dong Wang

A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.


IEEE Transactions on Audio, Speech, and Language Processing | 2017

Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition

Zhiyuan Tang; Lantian Li; Dong Wang; Ravichander Vipperla

Automatic speech and speaker recognition are traditionally treated as two independent tasks and are studied separately. The human brain in contrast deciphers the linguistic content, and the speaker traits from the speech in a collaborative manner. This key observation motivates the work presented in this paper. A collaborative joint training approach based on multitask recurrent neural network models is proposed, where the output of one task is backpropagated to the other tasks. This is a general framework for learning collaborative tasks and fits well with the goal of joint learning of automatic speech and speaker recognition. Through a comprehensive study, it is shown that the multitask recurrent neural net models deliver improved performance on both automatic speech and speaker recognition tasks as compared to single-task systems. The strength of such multitask collaborative learning is analyzed, and the impact of various training configurations is investigated.


Speech Communication | 2016

Improving speaker verification performance against long-term speaker variability

Linlin Wang; Jun Wang; Lantian Li; Thomas Fang Zheng; Frank K. Soong

A specially-collected speech database to reflect the long-term speaker variability.F-ratio to determine the importance of speaker- and session-specific information.Frequency warping and filter-bank outputs weighting strategies for feature extraction. Speaker verification performance degrades when input speech is tested in different sessions over a long period of time chronologically. Common ways to alleviate the long-term impact on performance degradation are enrollment data augmentation, speaker model adaptation, and adapted verification thresholds. From a point of view in features of a pattern recognition system, robust features that are speaker-specific, and invariant with time and acoustic environments are preferred to deal with this long-term variability. In this paper, with a newly created speech database, CSLT-Chronos, specially collected to reflect the long-term speaker variability, we investigate the issues in the frequency domain by emphasizing higher discrimination for speaker-specific information and lower sensitivity to time-related, session-specific information. F-ratio is employed as a criterion to determine the figure of merit to judge the above two sets of information, and to find a compromise between them. Inspired by the feature extraction procedure of the traditional MFCC calculation, two emphasis strategies are explored when generating modified acoustic features, the pre-filtering frequency warping and the post-filtering filter-bank outputs weighting are used for speaker verification. Experiments show that the two proposed features outperformed the traditional MFCC on CSLT-Chronos. The proposed approach is also studied by using the NIST SRE 2008 database in a state-of-the-art, i-vector based architecture. Experimental results demonstrate the advantage of proposed features over MFCC in LDA and PLDA based i-vector systems.


asia pacific signal and information processing association annual summit and conference | 2014

An overview of robustness related issues in speaker recognition

Thomas Fang Zheng; Qin Jin; Lantian Li; Jun Wang; Fanhu Bie

Speaker recognition technologies have been improved rapidly in recent years. However, critical robustness issues need to be addressed when they are applied in practical situations. This paper provides an overview of technologies dealing with robustness related issues in automatic speaker recognition. We first categorize the robustness issues into three categories, including environment-related, speaker-related and application-oriented issues. For each category, we then describe the current hot topics, existing technologies, and potential research focuses in the future.


asia pacific signal and information processing association annual summit and conference | 2016

AP16-OL7: A multilingual database for oriental languages and a language recognition baseline

Dong Wang; Lantian Li; Difei Tang; Qing Chen

We present the AP16-OL7 database which was released as the training and test data for the oriental language recognition (OLR) challenge on APSIPA 2016. Based on the database, a baseline system was constructed on the basis of the i-vector model. We report the baseline results evaluated in various metrics defined by the AP16-OLR evaluation plan and demonstrate that AP16-OL7 is a reasonable data resource for multilingual research.


asia pacific signal and information processing association annual summit and conference | 2016

Multi-task recurrent model for speech and speaker recognition

Zhiyuan Tang; Lantian Li; Dong Wang

Although highly correlated, speech and speaker recognition have been regarded as two independent tasks and studied by two communities. This is certainly not the way that people behave: we decipher both speech content and speaker traits at the same time. This paper presents a unified model to perform speech and speaker recognition simultaneously and altogether. The model is based on a unified neural network where the output of one task is fed to the input of the other, leading to a multi-task recurrent network. Experiments show that the joint model outperforms the task-specific models on both the two tasks.


international conference on acoustics, speech, and signal processing | 2017

Speaker segmentation using deep speaker vectors for fast speaker change scenarios

Renyu Wang; Mingliang Gu; Lantian Li; Mingxing Xu; Thomas Fang Zheng

A novel speaker segmentation approach based on deep neural network is proposed and investigated. This approach uses deep speaker vectors (d-vectors) to represent speaker characteristics and to find speaker change points. The d-vector is a kind of frame-level speaker discriminative feature, whose discriminative training process corresponds to the goal of discriminating a speaker change point from a single speaker speech segment in a short time window. Following the traditional metric-based segmentation, each analysis window contains two sub-windows and is shifting along the audio stream to detect speaker change points, where the speaker characteristics are represented by the means of deep speaker vectors for all frames in each window. Experimental investigations conducted in fast speaker change scenarios show that the proposed method can detect speaker change points more quickly and more effectively than the commonly used segmentation methods.


asia pacific signal and information processing association annual summit and conference | 2016

System combination for short utterance speaker recognition

Lantian Li; Dong Wang; Xiaodong Zhang; Thomas Fang Zheng; Panshi Jin

For text-independent short-utterance speaker recognition (SUSR), the performance often degrades dramatically. This paper presents a combination approach to the SUSR tasks with two phonetic-aware systems: one is the DNN-based i-vector system and the other is our recently proposed subregion-based GMM-UBM system. The former employs phone posteriors to construct an i-vector model in which the shared statistics offers stronger robustness against limited test data, while the latter establishes a phone-dependent GMM-UBM system which represents speaker characteristics with more details. A score-level fusion is implemented to integrate the respective advantages from the two systems. Experimental results show that for the text-independent SUSR task, both the DNN-based i-vector system and the subregion-based GMM-UBM system outperform their respective baselines, and the score-level system combination delivers performance improvement.


international symposium on chinese spoken language processing | 2016

Max-margin metric learning for speaker recognition

Lantian Li; Dong Wang; Chao Xing; Thomas Fang Zheng

Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-vector model, and has delivered state-of-the-art performance in speaker recognition. A potential problem of the PLDA model, however, is that it essentially assumes Gaussian distributions over speaker vectors, which is not always true in practice. Additionally, the objective function is not directly related to the goal of the task, e.g., discriminating true speakers and imposters. In this paper, we propose a max-margin metric learning approach to solve the problems. It learns a linear transform with a criterion that the margin between target and imposter trials are maximized. Experiments conducted on the SRE08 core test show that compared to PLDA, the new approach can obtain comparable or even better performance, though the scoring is simply a cosine computation.

Collaboration


Dive into the Lantian Li's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Andrew Abel

Xi'an Jiaotong-Liverpool University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Renyu Wang

Jiangsu Normal University

View shared research outputs
Researchain Logo
Decentralizing Knowledge