Is this you? Create Your Porfile

Donglai Zhu

Agency for Science, Technology and Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Donglai Zhu is active.

Explore More

Publication

Featured researches published by Donglai Zhu.

international conference on acoustics, speech, and signal processing | 2006

Integrating Acoustic, Prosodic and Phonotactic Features for Spoken Language Identification

Rong Tong; Bin Ma; Donglai Zhu; Haizhou Li; Eng Siong Chng

The fundamental issue of the automatic language identification is to explore the effective discriminative cues for languages. This paper studies the fusion of five features at different level of abstraction for language identification, including spectrum, duration, pitch, n-gram phonotactic, and bag-of-sounds features. We build a system and report test results on NIST 1996 and 2003 LRE datasets. The system is also built to participate in NIST 2005 LRE. The experiment results show that different levels of information provide complementary language cues. The prosodic features are more effective for shorter utterances while the phonotactic features work better for longer utterances. For the task of 12 languages, the system with fusion of five features achieved 2.38% EER for 30-sec speech segments on NIST 1996 dataset

international conference on acoustics, speech, and signal processing | 2009

The I4U system in NIST 2008 speaker recognition evaluation

Haizhou Li; Bin Ma; Kong-Aik Lee; Hanwu Sun; Donglai Zhu; Khe Chai Sim; Changhuai You; Rong Tong; Ismo Kärkkäinen; Chien-Lin Huang; Vladimir Pervouchine; Wu Guo; Yijie Li; Li-Rong Dai; Mohaddeseh Nosratighods; Thiruvaran Tharmarajah; Julien Epps; Eliathamby Ambikairajah; Eng Siong Chng; Tanja Schultz; Qin Jin

This paper describes the performance of the I4U speaker recognition system in the NIST 2008 Speaker Recognition Evaluation. The system consists of seven subsystems, each with different cepstral features and classifiers. We describe the I4U Primary system and report on its core test results as they were submitted, which were among the best-performing submissions. The I4U effort was led by the Institute for Infocomm Research, Singapore (IIR), with contributions from the University of Science and Technology of China (USTC), the University of New South Wales, Australia (UNSW), Nanyang Technological University, Singapore (NTU) and Carnegie Mellon University, USA (CMU).

international conference on acoustics, speech, and signal processing | 2006

Chinese Dialect Identification Using Tone Features Based on Pitch Flux

Bin Ma; Donglai Zhu; Rong Tong

This paper presents a method to extract tone relevant features based on pitch flux from continuous speech signal. The autocorrelations of two adjacent frames are calculated and the covariance between them is estimated to extract multi-dimensional pitch flux features. These features, together with MFCCs, are modeled in a 2-stream GMM models, and are tested in a 3-dialect identification task for Chinese. The pitch flux features have shown to be very effective in identifying tonal languages with short speech segments. For the test speech segments of 3 seconds, 2-stream model achieves more than 30% error reduction over MFCC-based model

international conference on acoustics, speech, and signal processing | 2007

A Generalized Feature Transformation Approach for Channel Robust Speaker Verification

Donglai Zhu; Bin Ma; Haizhou Li; Qiang Huo

In this paper we propose a generalized feature transformation approach to compensating for channel variation in speaker verification (SV) applications. Channel-dependent (CD) piecewise linear transformations are used for feature compensation. CD transformation parameters are estimated together with a channel-independent (CI) root Gaussian mixture model (GMM) from training data with a variety of channel conditions by using a maximum likelihood criterion. Experiments are conducted on the 2005 NIST Speaker Recognition Evaluation (SRE) corpus for several text-independent GMM-based SV systems. Experimental results show that the proposed approach achieves relative equal error rate (EER) reductions of 8.19% and 26.24% in comparison with a traditional feature mapping approach and a baseline system, respectively.

international conference on acoustics, speech, and signal processing | 2009

Joint map adaptation of feature transformation and Gaussian Mixture Model for speaker recognition

Donglai Zhu; Bin Ma; Haizhou Li

This paper extends our previous work on feature transformation-based support vector machines for speaker recognition by proposing a joint MAP adaptation of feature transformation (FT) and Gaussian Mixture Models (GMM) parameters. In the new approach, the prior probability density functions (PDFs) of FT and GMM parameters are jointly estimated using the background data under the maximum likelihood criteria. In this way, we derive a generic prior GMM that is more compact than the Universal Background Model due to the reduction of speaker variations. With the prior PDFs, we construct a supervector to characterize a speaker using FT and GMM parameters. We conducted experiments on NIST 2006 Speaker Recognition Evaluation (SRE06) data set. The results validated the effectiveness of the joint MAP adaptation approach.

international symposium on chinese spoken language processing | 2006

Fusion of acoustic and tokenization features for speaker recognition

Rong Tong; Bin Ma; Kong-Aik Lee; Changhuai You; Donglai Zhu; Tomi Kinnunen; Hanwu Sun; Minghui Dong; Eng Siong Chng; Haizhou Li

This paper describes our recent efforts in exploring effective discriminative features for speaker recognition. Recent researches have indicated that the appropriate fusion of features is critical to improve the performance of speaker recognition system. In this paper we describe our approaches for the NIST 2006 Speaker Recognition Evaluation. Our system integrated the cepstral GMM modeling, cepstral SVM modeling and tokenization at both phone level and frame level. The experimental results on both NIST 2005 SRE corpus and NIST 2006 SRE corpus are presented. The fused system achieved 8.14% equal error rate on 1conv4w-1conv4w test condition of the NIST 2006 SRE.

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop | 2006

Language Recognition Based on Score Distribution Feature Vectors and Discriminative Classifier Fusion

Jinyu Li; Sibel Yaman; Chin-Hui Lee; Bin Ma; Rong Tong; Donglai Zhu; Haizhou Li

We present the GT-IIR language recognition system submitted to the 2005 NIST Language Recognition Evaluation. Different from conventional frame-based feature extraction, our system adopts a collection of broad output scores from different language recognition systems to form utterance-level score distribution feature vectors over all competing languages, and build vector-based spoken language recognizers by fusing two distinct verifiers, one based on a simple linear discriminant function (LDF) and the other on a complex artificial neural network (ANN), to make final language recognition decisions. The diverse error patterns exhibited in individual LDF and ANN systems facilitate smaller overall verification errors in the combined system than those obtained in separate systems

international conference on multimedia and expo | 2009

Acoustic segment modeling for speaker recognition

Bin Ma; Donglai Zhu; Haizhou Li

We propose a speaker recognition system based on the acoustic segment modeling technique. It is assumed that the overall sound characteristics for speakers can be covered by a set of acoustic segment models (ASMs) while the ASMs are acoustically-motivated self-organized sound units without imposing any phonetic definitions. These acoustic segment models decode a spoken utterance into a string of segment units and the mean vectors of ASMs based on the unsupervised MAP adaptation are concatenated to represent the characteristics of the specific speaker. Support vector machines are thus applied on these high dimensional feature vectors for speaker recognition. We evaluate the proposed approach in the 2006 NIST Speaker Recognition Evaluation core condition test trials.

IEEE Transactions on Audio, Speech, and Language Processing | 2008

Optimizing the Performance of Spoken Language Recognition With Discriminative Training

Donglai Zhu; Haizhou Li; Bin Ma; Chin-Hui Lee

The performance of spoken language recognition system is typically formulated to reflect the detection cost and the strategic decision points along the detection-error-tradeoff curve. We propose a performance metrics optimization (PMO) approach to optimizing the detection performance of Gaussian mixture model classifiers. We design the objective functions to directly relate the model parameters to the performance metrics of interest, i.e., the detection cost function and the area under the detection-error-tradeoff curve. Both metrics are approximated by differentiable functions of model parameters. In this way, the model parameters can be optimized with the generalized probabilistic descent algorithm, a typical discriminative training technique. We conduct the experiments on the NIST 2003 and 2005 Language Recognition Evaluation corpora. The experimental results show that the PMO approach effectively improves the performance over the maximum-likelihood training approach.

international conference on acoustics, speech, and signal processing | 2008

Discriminative learning for optimizing detection performance in spoken language recognition

Donglai Zhu; Haizhou Li; Bin Ma; Chin-Hui Lee

We propose novel approaches for optimizing the detection performance in spoken language recognition. Two objective functions are designed to directly relate model parameters to two performance metrics of interest, the detection cost function and the area under the detection-error-tradeoff curve, respectively. Both metrics are approximated with differentiable functions of model parameters by using a smoothing function based on a class misclassification measure. The model parameters are optimized by using the generalized probabilistic descent algorithm. We conduct experiments on the NIST 2003 and 2005 Language Recognition Evaluation corpora. Results show that the proposed approaches effectively improve the performance over the maximum likelihood training approach.

Explore More