Kong-Aik Lee
Agency for Science, Technology and Research
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kong-Aik Lee.
international conference on acoustics, speech, and signal processing | 2009
Haizhou Li; Bin Ma; Kong-Aik Lee; Hanwu Sun; Donglai Zhu; Khe Chai Sim; Changhuai You; Rong Tong; Ismo Kärkkäinen; Chien-Lin Huang; Vladimir Pervouchine; Wu Guo; Yijie Li; Li-Rong Dai; Mohaddeseh Nosratighods; Thiruvaran Tharmarajah; Julien Epps; Eliathamby Ambikairajah; Eng Siong Chng; Tanja Schultz; Qin Jin
This paper describes the performance of the I4U speaker recognition system in the NIST 2008 Speaker Recognition Evaluation. The system consists of seven subsystems, each with different cepstral features and classifiers. We describe the I4U Primary system and report on its core test results as they were submitted, which were among the best-performing submissions. The I4U effort was led by the Institute for Infocomm Research, Singapore (IIR), with contributions from the University of Science and Technology of China (USTC), the University of New South Wales, Australia (UNSW), Nanyang Technological University, Singapore (NTU) and Carnegie Mellon University, USA (CMU).
asilomar conference on signals, systems and computers | 2006
Kong-Aik Lee; Woon-Seng Gan; Sen-Maw Kuo
This paper presents the mean-square performance analysis of a class of normalized subband adaptive filters (NSAF) using the concept of energy conservation. The NSAF has a unique weight-control mechanism whereby subband error signals are used to update a fullband tap-weight vector. We show that an energy conservation relation can be established for the weight adaptation. Subsequently, an expression for the steady-state excess mean-square error (MSE) and the necessary condition on the step size for mean-square stability can then be derived by manipulating various error quantities of the energy conservation relation. Simulation results are presented to corroborate the mathematical analysis.
international symposium on chinese spoken language processing | 2006
Rong Tong; Bin Ma; Kong-Aik Lee; Changhuai You; Donglai Zhu; Tomi Kinnunen; Hanwu Sun; Minghui Dong; Eng Siong Chng; Haizhou Li
This paper describes our recent efforts in exploring effective discriminative features for speaker recognition. Recent researches have indicated that the appropriate fusion of features is critical to improve the performance of speaker recognition system. In this paper we describe our approaches for the NIST 2006 Speaker Recognition Evaluation. Our system integrated the cepstral GMM modeling, cepstral SVM modeling and tokenization at both phone level and frame level. The experimental results on both NIST 2005 SRE corpus and NIST 2006 SRE corpus are presented. The fused system achieved 8.14% equal error rate on 1conv4w-1conv4w test condition of the NIST 2006 SRE.
international conference on acoustics, speech, and signal processing | 2015
Wei Rao; Man-Wai Mak; Kong-Aik Lee
Gaussian PLDA with uncertainty propagation is effective for i-vector based speaker verification. The idea is to propagate the uncertainty of i-vectors caused by the duration variability of utterances to the PLDA model. However, a limitation of the method is the difficulty of performing length normalization on the posterior covariance matrix of an i-vector. This paper proposes a method to avoid performing length normalization on i-vectors in Gaussian PLDA modeling so that uncertainty propagation can be directly applied without transforming the posterior covariance matrices of i-vectors. Instead of performing length normalization on i-vectors independently, the proposed method normalizes the column vectors of the total variability matrix. Because the i-vectors of all utterances are derived from the same normalized total variability matrix, they will be subject to the same degree of normalization, thereby avoiding the undesirable distortion introduced by the utterance-dependent length-normalization process. Experimental results on both NIST 2010 and 2012 SREs demonstrate that the proposed method achieves a performance similar to (and in some situations better than) that of Gaussian PLDA with length normalization. The method has the potential of improving the performance of uncertainty propagation for i-vector/PLDA speaker verification.
international conference on acoustics, speech, and signal processing | 2011
Filip Sedlak; Tomi Kinnunen; Ville Hautamäki; Kong-Aik Lee; Haizhou Li
State-of-the-art speaker verification systems consists of a number of complementary subsystems whose outputs are fused, to arrive at more accurate and reliable verification decision. In speaker verification, fusion is typically implemented as a linear combination of the subsystem scores. Parameters of the linear model are commonly estimated using the logistic regression method, as implemented in the popular FoCal toolkit. In this paper, we study simultaneous use of classifier selection and fusion. We study four alternative fusion strategies, three score warping techniques, and provide interesting experimental bounds on optimal classifier subset selection. Detailed experiments are carried out on the NIST 2008 and 2010 SRE corpora.
international conference on audio, language and image processing | 2008
XinYou See; Kong-Aik Lee; Woon-Seng Gan; Haizhou Li
This paper explores the use of proportionate adaptation and subband adaptive filtering techniques for applications involving colored excitation and the modeling of sparse impulse response. It has been shown that proportionate adaptation technique is capable to deal with sparse environment, while subband adaptive filtering technique provides good convergence performance under colored excitation. This paper integrates both crucial functionalities into a single algorithm for applications like network and acoustic echo cancellation, and feedback cancellation in hearing aids. The efficacy of the proposed algorithm is examined and validated via simulations.
international conference on acoustics, speech, and signal processing | 2008
Kong-Aik Lee; Changhuai You; Haizhou Li
This paper introduces a spoken language recognition system with a generative front-end and a discriminative backend. The generative front-end is built upon an ensemble of Gaussian densities. These Gaussian densities are trained to represent elementary speech sound units characterizing a wide variety of languages. We formulate the generative front-end in a form of sequence kernel. This sequence kernel transforms a spoken utterance into a feature vector with its attributes representing the occurrence statistics of the speech sound units. A discriminative support vector machine (SVM) then operates on the feature vectors to make classification decision. The proposed language recognition system demonstrates competitive performance on NIST 1996, 2003 and 2005 LRE corpora.
international conference on multimedia and expo | 2007
Kong-Aik Lee; Woon-Seng Gan
Delayless architecture for the recently proposed normalized subband adaptive filter (NSAF) is described and analyzed in this paper. The NSAF has a unique weight-control mechanism whereby error signals estimated in subbands are used to adapt a fullband filter. In the delayless architecture, we implement the subband weight adaptation in an auxiliary loop, and place only the fullband filter along the input signal path. By so doing, delay due to the filter banks is moved to the auxiliary loop out of the signal path, thereby making the algorithm attractive for applications where excessive signal path delay is intolerable. Simulation results demonstrate that the proposed delayless NSAF outperforms other delayless approaches in terms of convergence rate.
international symposium on chinese spoken language processing | 2010
Eryu Wang; Wu Guo; Li-Rong Dai; Kong-Aik Lee; Bin Ma; Haizhou Li
Gaussian mixture models (GMMs) are commonly used in text-independent speaker verification for modeling the spectral distribution of speech. Recent studies have shown the effectiveness of characterizing speaker information using the mean super-vector obtained by concatenating the mean vectors of the GMM. This paper proposes to use the spatial correlation captured by the covariance matrix of the mean super-vector for speaker verification. Factor analysis method is adopted to estimate the covariance of the super-vector. For measuring the similarity between speech utterances in terms of the spatial correlation, we propose two kernel metrics, namely, log-Euclidean inner product and Frobenius angle. For computational simplicity, we introduce an inner product classifier (IPC) with equivalent performance compared to the commonly used support vector machine (SVM). Experiments conducted on the 2006 NIST speaker recognition evaluation (SRE) dataset confirm the efficacy of the proposed factor analysis based spatial modeling technique.
international conference on acoustics, speech, and signal processing | 2010
Khe Chai Sim; Kong-Aik Lee
State-of-the-art spoken language recognition systems typically consist of a combination of sub-systems. These sub-systems generate language detection scores for each speech segment, which will be fused (combined) to yield the overall detection scores. Typically, score fusion is achieved using a linear model and Logistic Linear Regression (LLR) is commonly used to estimate the model parameters. This paper proposes an extension to the LLR model, known as the Weighted LLR (WLLR). WLLR is obtained using a weighted combination of multiple LLRs where the weights are obtained as a nonlinear function of the speech segments. Although the resultant score is still linear with respect to the scores of the individual sub-systems, the linear function depends on the speech segment. Hence, the overall score fusion model can be regarded as an adaptive model. Experimental results shows that WLLR outperforms LLR by approximately 10% relative for PPRLM system fusion on the NIST 2003 and 2005 language recognition evaluation sets.