Jichen Yang
South China University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jichen Yang.
Signal Processing | 2009
Yanxiong Li; Qianhua He; Sam Kwong; Tao Li; Jichen Yang
Applause frequently occurs in multi-participants meeting speech. In fact, detecting applause is quite important for meeting speech recognition, semantic inference, highlight extraction, etc. In this paper, we will first study the characteristic differences between applause and speech, such as duration, pitch, spectrogram and occurrence locations. Then, an effective algorithm based on these characteristics is proposed for detecting applause in meeting speech stream. In the algorithm, the non-silence signal segments are first extracted by using voice activity detection. Afterward, applause segments are detected from the non-silence signal segments based on the characteristic differences between applause and speech without using any complex statistical models, such as hidden Markov models. The proposed algorithm can accurately determine the boundaries of applause in meeting speech stream, and is also computationally efficient. In addition, it can extract applause sub-segments from the mixed segments. Experimental evaluations show that the proposed algorithm can achieve satisfactory results in detecting applause of the meeting speech. Precision rate, recall rate, and F1-measure are 94.34%, 98.04%, and 96.15%, respectively. When compared with the traditional algorithm under the same experimental conditions, 3.62% improvement in F1-measure is achieved, and about 35.78% of computational time is saved.
International Journal of Speech Technology | 2010
Yanxiong Li; Sam Kwong; Qianhua He; Jun He; Jichen Yang
Feature subsets and hidden Markov model (HMM) parameters are the two major factors that affect the classification accuracy (CA) of the HMM-based classifier. This paper proposes a genetic algorithm based approach for simultaneously optimizing both feature subsets and HMM parameters with the aim to obtain the best HMM-based classifier. Experimental data extracted from three spontaneous speech corpora were used to evaluate the effectiveness of the proposed approach and the three other approaches (i.e. the approaches to single optimization of feature subsets, single optimization of HMM parameters, and no optimization of both feature subsets and HMM parameters) that were adopted in the previous work for discrimination between speech and non-speech events (e.g. filled pause, laughter, applause). The experimental results show that the proposed approach obtains CA of 91.05%, while the three other approaches obtain CA of 86.11%, 87.05%, and 83.16%, respectively. The results suggest that the proposed approach is superior to the previous approaches.
international conference on audio, language and image processing | 2014
Xueyuan Zhang; Zhuosheng Su; Pei Lin; Qianhua He; Jichen Yang
This paper presents an audio feature extraction scheme based on spectral decomposition where the decomposition is performed iteratively by matching-pursuit in frequency domain. Motivated by psychoacoustic studies, a set of spectral basis vectors are constructed to extract pitch, timbre and residual inharmonic components from spectrum. The audio feature is represented by the scales of basis vectors after dimension reduction. The proposed feature is evaluated on a 13-category audio effect classification task. The experimental results show the proposed feature outperforms other spectral and cepstral features.
Computer Speech & Language | 2017
Yanxiong Li; Qin Wang; Xue Zhang; Wei Li; Xinchao Li; Jichen Yang; Xiaohui Feng; Qian Huang; Qianhua He
Abstract This paper proposes an unsupervised method for analyzing speaker roles in multi-participant conversational speech. First, features for characterizing the differences of various roles are extracted from the outputs of speaker diarization. Then, an algorithm of role clustering based on the criterion of maximizing the inter-cluster distance without using any convergence threshold is proposed to obtain the number of roles and to merge the utterances belonging to the same role into one cluster. The contributions of different combinations of individual feature subsets are compared for the proposed method on the outputs from speaker diarization, and the combined feature subsets obtain higher F scores than the individual ones for clustering speaker roles. The impacts of both speaker diarization errors and feature dimensions on the performance of the proposed method are also discussed. Experiments are done on the outputs of both manual annotations and automatic speaker diarization to compare the proposed method with both the state-of-the-art clustering method and the supervised method. Evaluations show that the proposed method is superior to the previous clustering method and close to the conventional supervised method in terms of F scores under two different experimental conditions.
international conference on acoustics, speech, and signal processing | 2016
Ling Zou; Qianhua He; Jichen Yang; Yanxiong Li
Source recording device matching from two speech recordings is a new and important problem of digital media forensics. It aims to answer the question that whether or not two speech recordings are recorded by the same recording device. In this study we propose a source cell phone matching scheme. The Gaussian supervector (GSV) based on Mel-frequency cepstral coefficients (MFCCs) is extracted from the speech recording and is sparse represented with respect to a dictionary learned by K-SVD algorithm. The reduced-dimensional sparse representation coefficient is utilized to characterize the intrinsic fingerprint of the recording device. Then, KISS metric learning based similarity matching is conducted on a pair of fingerprints extracted from the two speech recordings. Evaluation experiments were conducted on a database of speech recordings recorded by 14 cell phones. The experimental results demonstrated the feasibility of the proposed scheme.
IEEE Transactions on Information Forensics and Security | 2018
Yanxiong Li; Xue Zhang; Xianku Li; Yuhan Zhang; Jichen Yang; Qianhua He
Considerable attention has been paid to acquisition device recognition over the past decade in the forensic community, especially in digital image forensics. In contrast, acquisition device clustering from speech recordings is a new problem that aims to merge the recordings acquired by the same device into a single cluster without having prior information about the recordings and training classifiers in advance. In this paper, we propose a method for mobile phone clustering from speech recordings by using a new feature of deep representation and a spectral clustering algorithm. The new feature is learned by a deep auto-encoder network for representing the intrinsic trace left behind by each phone in the recordings, and spectral clustering is used to merge recordings acquired by the same phone into a single cluster. The impacts of the structures of the deep auto-encoder network on the performance of the new feature are discussed. Different features are compared with one another. The proposed method is compared with others and evaluated under special conditions. The results show that the proposed method is effective under these conditions and the new feature outperforms other features.
international conference on acoustics, speech, and signal processing | 2017
Yanxiong Li; Xue Zhang; Xianku Li; Xiaohui Feng; Jichen Yang; Aiwu Chen; Qianhua He
Acquisition device clustering from speech recordings is a new and critical problem in the field of speech forensic, which aims at merging speech recordings acquired by the same device into one cluster without both pre-knowing prior information of the processed data and pre-training classifier. We propose a mobile phone clustering method, in which deep Gaussian supervector learned by deep neural network is used to represent the intrinsic trace left behind by mobile phone in speech recordings, and then spectral clustering technique is adopted to merge speech recordings acquired by the same mobile phone into one cluster. The performance of the proposed method is evaluated on a public corpus of speech recordings acquired by mobile phones. The results show that the proposed method is effective for mobile phone clustering from acquired speech recordings.
international conference on audio, language and image processing | 2014
Jichen Yang; Qianhua He; Min Cai; Yanxiong Li
Most of short time-frequency feature (TFF) extraction methods in the literature only consider scale and frequency of the selected atoms, which neglects the effect of expansion coefficient and time of the selected atoms. In order to classify movie audio signals better, an effective and flexible time-frequency feature extraction method using expansion coefficient, scale, time and frequency of the selected atoms is investigated in this work, which consists of four stages: signal decomposition, Wigner-Ville distribution, principal component extraction and clustering. The experimental results show that the proposed TFF is better than the traditional TFF, which can improve 6% in accuracy for classifying twenty kinds of movie audio signals. The best dimension number of the proposed TFF is 25.
Journal of Multimedia | 2014
Wei Li; Qianhua He; Yanxiong Li; Jichen Yang
An algorithm based on Model Distance (MD) for spectral speaker clustering is proposed to deal with the shortcoming of general spectral clustering algorithm in describing the distribution of signal source. First, an Universal Background Model (UBM) is created with a large quantity of independent speakers; Then, Gaussian Mixture Model (GMM) is trained from the UBM for every speech segment; At last, the probability distance between the GMM of every speech segment is used to build affinity matrix, and speaker spectral clustering is done on the affinity matrix. Experimental results based on news and conference data sets show that an average of 6.38% improvements in F measure is obtained in comparison with algorithm based on the feature vector distance. In addition, the proposed algorithm is 11.72 times faster.
Iet Signal Processing | 2017
Jichen Yang; Qianhua He; Yanxiong Li; Leian Liu; Jianhong Li; Xiaohui Feng
The current popular dictionary learning algorithms for sparse representation of signals are K-means Singular Value Decomposition (K-SVD) and K-SVD-extended. Only rank-1 approximation is used to update one atom at a time and it is unable to cope with large dictionary efficiently. In order to tackle these two problems, this study proposes M-Principal Component Analysis-N (M-PCA-N), which is an algorithm for dictionary learning and sparse representation. First, M-Principal Component Analysis (M-PCA) utilised information from the top M ranks of SVD decomposition to update M atoms at a time. Then, in order to further utilise the information from remaining ranks, M-PCA-N is proposed on the basis of M-PCA, by transforming information from the following N non-principal ranks onto the top M principal ranks. The mathematic formula indicates that M-PCA may be seen as a generalisation of K-SVD. Experimental results on the BBC Sound Effects Library show that M-PCA-N not only lowers the MSE between original signal and approximation signal in audio signal sparse representation, but also obtains higher audio signal classification precision than K-SVD.