Xiao-Lei Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiao-Lei Zhang is active.

Explore More

Publication

Featured researches published by Xiao-Lei Zhang.

IEEE Transactions on Audio, Speech, and Language Processing | 2013

Deep Belief Networks Based Voice Activity Detection

Xiao-Lei Zhang; Ji Wu

Fusing the advantages of multiple acoustic features is important for the robustness of voice activity detection (VAD). Recently, the machine-learning-based VADs have shown a superiority to traditional VADs on multiple feature fusion tasks. However, existing machine-learning-based VADs only utilize shallow models, which cannot explore the underlying manifold of the features. In this paper, we propose to fuse multiple features via a deep model, called deep belief network (DBN). DBN is a powerful hierarchical generative model for feature extraction. It can describe highly variant functions and discover the manifold of the features. We take the multiple serially-concatenated features as the input layer of DBN, and then extract a new feature by transferring these features through multiple nonlinear hidden layers. Finally, we predict the class of the new feature by a linear classifier. We further analyze that even a single-hidden-layer-based belief network is as powerful as the state-of-the-art models in the machine-learning-based VADs. In our empirical comparison, ten common features are used for performance analysis. Extensive experimental results on the AURORA2 corpus show that the DBN-based VAD not only outperforms eleven referenced VADs, but also can meet the real-time detection demand of VAD. The results also show that the DBN-based VAD can fuse the advantages of multiple features effectively.

IEEE Signal Processing Letters | 2011

Efficient Multiple Kernel Support Vector Machine Based Voice Activity Detection

Ji Wu; Xiao-Lei Zhang

In this letter, we propose a multiple kernel support vector machine (MK-SVM) method for multiple feature based VAD. To make the MK-SVM based VAD practical, we adapt the multiple kernel learning (MKL) thought to an efficient cutting-plane structural SVM solver. We further discuss the performances of the MK-SVM with two different optimization objectives, in terms of minimum classification errors (MCE) and improvement of receiver operating characteristic (ROC) curves. Our experimental results show that the proposed method not only leads to better global performances by taking the advantages of multiple features but also has a low computational complexity.

international conference on acoustics, speech, and signal processing | 2013

Denoising deep neural networks based voice activity detection

Xiao-Lei Zhang; Ji Wu

Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.

IEEE Signal Processing Letters | 2011

Maximum Margin Clustering Based Statistical VAD With Multiple Observation Compound Feature

Ji Wu; Xiao-Lei Zhang

In this letter, we propose a new robust feature and an unsupervised learning approach for statistical voice activity detection (VAD). Maximum margin clustering (MMC), as an unsupervised classifier, can improve the robustness of support vector machine (SVM) based VAD while requiring no data labeling for model training. In the MMC framework, the multiple observation compound feature (MO-CF) is proposed to improve accuracy. MO-CF is composed of two subfeatures-multiple observation signal-to-noise ratio (MO-SNR) and multiple observation maximum probability (MO-MP). The contributions of the two subfeatures are balanced by a factor which is chosen to yield the largest area under the ROC curve (AUC) of the performance. The proposed approach obtains improved performance over seven commonly used VAD techniques in the experiments covering various noisy scenarios with low SNRs.

EURASIP Journal on Advances in Signal Processing | 2011

An efficient voice activity detection algorithm by combining statistical model and energy detection

Ji Wu; Xiao-Lei Zhang

In this article, we present a new voice activity detection (VAD) algorithm that is based on statistical models and empirical rule-based energy detection algorithm. Specifically, it needs two steps to separate speech segments from background noise. For the first step, the VAD detects possible speech endpoints efficiently using the empirical rule-based energy detection algorithm. However, the possible endpoints are not accurate enough when the signal-to-noise ratio is low. Therefore, for the second step, we propose a new gaussian mixture model-based multiple-observation log likelihood ratio algorithm to align the endpoints to their optimal positions. Several experiments are conducted to evaluate the proposed VAD on both accuracy and efficiency. The results show that it could achieve better performance than the six referenced VADs in various noise scenarios.

systems man and cybernetics | 2012

Linearithmic Time Sparse and Convex Maximum Margin Clustering

Xiao-Lei Zhang; Ji Wu

Recently, a new clustering method called maximum margin clustering (MMC) was proposed and has shown promising performances. It was originally formulated as a difficult nonconvex integer problem. To make the MMC problem practical, the researchers either relaxed the original MMC problem to inefficient convex optimization problems or reformulated it to nonconvex optimization problems, which sacrifice the convexity for efficiency. However, no approaches can both hold the convexity and be efficient. In this paper, a new linearithmic time sparse and convex MMC algorithm, called support-vector-regression-based MMC (SVR-MMC), is proposed. Generally, it first uses the SVR as the core of the MMC. Then, it is relaxed as a convex optimization problem, which is iteratively solved by the cutting-plane algorithm. Each cutting-plane subproblem is further decomposed to a serial supervised SVR problem by a new global extended-level method (GELM). Finally, each supervised SVR problem is solved in a linear time complexity by a new sparse-kernel SVR (SKSVR) algorithm. We further extend the SVR-MMC algorithm to the multiple-kernel clustering (MKC) problem and the multiclass MMC (M3C) problem, which are denoted as SVR-MKC and SVR-M3C, respectively. One key point of the algorithms is the utilization of the SVR. It can prevent the MMC and its extensions meeting an integer matrix programming problem. Another key point is the new SKSVR. It provides a linear time interface to the nonlinear kernel scenarios, so that the SVR-MMC and its extensions can keep a linearthmic time complexity in nonlinear kernel scenarios. Our experimental results on various real-world data sets demonstrate the effectiveness and the efficiency of the SVR-MMC and its two extensions. Moreover, the unsupervised application of the SVR-MKC to the voice activity detection (VAD) shows that the SVR-MKC can achieve good performances that are close to its supervised counterpart, meet the real-time demand of the VAD, and need no labeling for model training.

international conference on acoustics, speech, and signal processing | 2014

Unsupervised domain adaptation for deep neural network based voice activity detection

Xiao-Lei Zhang

The mismatching problem between the training and test speech corpora hinders the practical use of the machine-learning-based voice activity detection (VAD). In this paper, we try to address this problem by the unsupervised domain adaptation techniques, which try to find a shared feature subspace between the mismatching corpora. The denoising deep neural network is used as the learning machine. Three domain adaptation techniques are used for analysis. Experimental results show that the unsupervised domain adaptation technique is promising to the mismatching problem of VAD.

international conference on acoustics, speech, and signal processing | 2012

Optimized weighted decoding for error-correcting output codes

Xiao-Lei Zhang; Ji Wu; Zhipeng Chen; Ping Lv

A common method to solve a multiclass classification problem is to reduce the problem to a serial binary classification problems and combine them via Error-Correcting Output Codes (ECOC). The ECOC contains three parts: coding design, decoding algorithm, and base dichotomizer. Recently, the Loss-Weighted (LW) decoding algorithm (Escalera et al., PAMI2010), which introduces a weight matrix to the Loss-Based (LB) decoding (Allwein et al., JMLR2001), achieves improved performance over traditional decoding methods. However, the weight matrix is assigned empirically. In this paper, we present a theoretical global optimization method for the weight matrix, so as to achieve the minimal training risk. Although the experimental results on real-world image, audio and text classification tasks show that the proposed decoding method only leads to slightly better performances than others in the case of discrete outputs of the dichotomizers, the proposed method provides a new screen on the decoding methods of the ECOC.

IEEE Transactions on Systems, Man, and Cybernetics | 2015

Heuristic Ternary Error-Correcting Output Codes Via Weight Optimization and Layered Clustering-Based Approach

Xiao-Lei Zhang

One important classifier ensemble for multiclass classification problems is error-correcting output codes (ECOCs). It bridges multiclass problems and binary-class classifiers by decomposing multiclass problems to a serial binary-class problems. In this paper, we present a heuristic ternary code, named weight optimization and layered clustering-based ECOC (WOLC-ECOC). It starts with an arbitrary valid ECOC and iterates the following two steps until the training risk converges. The first step, named layered clustering-based ECOC (LC-ECOC), constructs multiple strong classifiers on the most confusing binary-class problem. The second step adds the new classifiers to ECOC by a novel optimized weighted (OW) decoding algorithm, where the optimization problem of the decoding is solved by the cutting plane algorithm. Technically, LC-ECOC makes the heuristic training process not blocked by some difficult binary-class problem. OW decoding guarantees the nonincrease of the training risk for ensuring a small code length. Results on 14 UCI datasets and a music genre classification problem demonstrate the effectiveness of WOLC-ECOC.

international symposium on chinese spoken language processing | 2012

Perceptual similarity between audio clips and feature selection for its measurement

Qinghua Wu; Xiao-Lei Zhang; Ping Lv; Ji Wu

In this paper, we explore the retrieval of perceptually similar audio. It focuses on finding sounds according to human perceptions. Thus such retrieval is more “human-centered” [1] than previous audio retrievals which intend to find homologous sounds. We make comprehensive use of various acoustic features to measure the perceptual similarity. Since some acoustic features may be redundant or even adverse to the similarity measurement, we propose to find a complementary and effective combination of acoustic features via SFFS (Sequential Floating Forward Selection) method. Experimental results show that LSP, MFCC, and PLP are the three most effective acoustic features. Moreover, the optimal combination of features can improve the accuracy of similarity classification by about 2% compared with the best performance of a single acoustic feature.

Explore More