Is this you? Create Your Porfile

Yanxiong Li

South China University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanxiong Li is active.

Explore More

Publication

Featured researches published by Yanxiong Li.

Signal Processing | 2009

Characteristics-based effective applause detection for meeting speech

Yanxiong Li; Qianhua He; Sam Kwong; Tao Li; Jichen Yang

Applause frequently occurs in multi-participants meeting speech. In fact, detecting applause is quite important for meeting speech recognition, semantic inference, highlight extraction, etc. In this paper, we will first study the characteristic differences between applause and speech, such as duration, pitch, spectrogram and occurrence locations. Then, an effective algorithm based on these characteristics is proposed for detecting applause in meeting speech stream. In the algorithm, the non-silence signal segments are first extracted by using voice activity detection. Afterward, applause segments are detected from the non-silence signal segments based on the characteristic differences between applause and speech without using any complex statistical models, such as hidden Markov models. The proposed algorithm can accurately determine the boundaries of applause in meeting speech stream, and is also computationally efficient. In addition, it can extract applause sub-segments from the mixed segments. Experimental evaluations show that the proposed algorithm can achieve satisfactory results in detecting applause of the meeting speech. Precision rate, recall rate, and F1-measure are 94.34%, 98.04%, and 96.15%, respectively. When compared with the traditional algorithm under the same experimental conditions, 3.62% improvement in F1-measure is achieved, and about 35.78% of computational time is saved.

annual acis international conference on computer and information science | 2008

A Novel Detection Method of Filled Pause in Mandarin Spontaneous Speech

Yanxiong Li; Qianhua He; Tao Li

Filled pause is one of the hesitation phenomena that current speech recognizers can not effectively handle. Detecting filled pauses is important in spontaneous speech dialogue systems because they play valuable roles, such as helping a speaker keep a conversational turn, in oral communication. In this paper, a novel detection method for filled pause is proposed based on a strategy of two-level decisions. Firstly, hypothetical filled pauses are extracted from non-silence signal segments. Then HMMs are trained and employed to recognize filled pauses from hypothetical filled pauses. Experiment results show that the average precision and recall rate of filled pauses are 80.66% and 92.59% respectively. Moreover, this filled pause detector can distinguish filled pauses from elongated words, which are not achieved in the previous works.

International Journal of Speech Technology | 2010

Genetic algorithm based simultaneous optimization of feature subsets and hidden Markov model parameters for discrimination between speech and non-speech events

Yanxiong Li; Sam Kwong; Qianhua He; Jun He; Jichen Yang

Feature subsets and hidden Markov model (HMM) parameters are the two major factors that affect the classification accuracy (CA) of the HMM-based classifier. This paper proposes a genetic algorithm based approach for simultaneously optimizing both feature subsets and HMM parameters with the aim to obtain the best HMM-based classifier. Experimental data extracted from three spontaneous speech corpora were used to evaluate the effectiveness of the proposed approach and the three other approaches (i.e. the approaches to single optimization of feature subsets, single optimization of HMM parameters, and no optimization of both feature subsets and HMM parameters) that were adopted in the previous work for discrimination between speech and non-speech events (e.g. filled pause, laughter, applause). The experimental results show that the proposed approach obtains CA of 91.05%, while the three other approaches obtain CA of 86.11%, 87.05%, and 83.16%, respectively. The results suggest that the proposed approach is superior to the previous approaches.

International Journal of Speech Technology | 2011

Detecting laughter in spontaneous speech by constructing laughter bouts

Yanxiong Li; Qianhua He

Laughter frequently occurs in spontaneous speech (e.g. conversational speech, meeting speech). Detecting laughter is quite important for semantic analysis, highlight extraction, spontaneous speech recognition, etc. In this paper, we first analyze the characteristic differences between speech and laughter, and then propose an approach for detecting laughter in spontaneous speech. In the proposed approach, non-silence signal segments are first extracted from spontaneous speech by using voice activity detection, and then split into syllables. Afterward, the possible laughter bouts are constructed by merging adjacent syllables (using symmetrical Itakura distance measure and duration threshold) instead of using a sliding fixed-length window. Finally, hidden Markov models (HMMs) are used to recognize the possible laughter bouts as laughs, speech sounds or other sounds. Experimental evaluations show that the proposed approach can achieve satisfactory results in detecting two types of audible laughs (audible solo and group laughs). Precision rate, recall rate, and F1-measure (harmonic mean of precision and recall rate) are 83.4%, 86.1%, and 84.7%, respectively. Compared with the sliding-window-based approach, 4.9% absolute improvements in F1-measure are obtained. In addition, the laughter boundary errors obtained by the proposed approach are smaller than that obtained by the sliding-window-based approach.

international conference on audio, language and image processing | 2010

Two-level approach for detecting non-lexical audio events in spontaneous speech

Yanxiong Li; Qianhua He; Wei Li; Zhi-Feng Wang

Based on analyses of characteristic differences between various audio events, a two-level approach is proposed for detecting three non-lexical audio events (filled pause, laugh, and applause) in spontaneous odel-based decision. The experiments give average precision of 87.3%, recall of 93.77%, and F-measure of 90.42%. Compared with the sliding window based approach, average F-measure is improved by 7.52%. Moreover, it can more accurately determine the boundaries of non-lexical audio events in spontaneous speech.

Multimedia Tools and Applications | 2018

Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

Yanxiong Li; Xue Zhang; Hai Jin; Xianku Li; Qin Wang; Qianhua He; Qian Huang

Extraction of effective audio features from acoustic events definitely influences the performance of Acoustic Event Detection (AED) system, especially in adverse audio situations. In this study, we propose a framework for extracting Deep Audio Feature (DAF) using multi-stream hierarchical Deep Neural Network (DNN). The DAF outputted from the proposed framework fuses the potential complementary information of multiple input feature streams and thus could be more discriminative than those input features for AED. We take two input feature streams and the hierarchical DNNs with two stages as an example for showing the extraction of DAF. The effectiveness of different audio features for AED is evaluated on two audio corpora, i.e. BBC (British Broadcasting Corporation) audio dataset and TV audio dataset with different signal-to-noise ratios. Experimental results show that DAF outperforms other features for AED under several experimental conditions.

Iet Signal Processing | 2014

Fast speaker clustering using distance of feature matrix mean and adaptive convergence threshold

Yanxiong Li; Hai Jin; Wei Li; Qianhua He; Zhengyu Zhu; Xiaohui Feng

The authors propose a method of fast speaker clustering in which a distance (distance of feature matrix mean, DFMM) is first defined for characterising the similarities between any two clusters, and then an adaptive convergence threshold is introduced for terminating the procedure of speaker clustering. If the minimum of the DFMMs between any two clusters is smaller than the threshold, then they are merged. The above mergence of clusters is repeated until the minimum of the DFMMs between any two clusters is larger than the threshold. They conduct experiments on both shorter voice segments (≤ 3 s) and longer voice segments (> 3 s) to compare their method with state-of-the-art methods, agglomerative hierarchical clustering with Bayesian information criterion (AHC + BIC) and vector quantisation with spectral clustering. Experiments show that their method achieves the best results for clustering shorter voice segments, and also obtains satisfactory results for clustering longer voice segments in comparison with other two methods. What is more, their method is faster than other methods in all experimental cases. The initial results show that the hybrid methods by combining their method with the AHC + BIC obtain further improvement in terms of the F score.

international conference on machine learning and cybernetics | 2013

Liveness detection using time drift between lip movement and voice

Zhengyu Zhu; Qianhua He; Xiaohui Feng; Yanxiong Li; Zhi-Feng Wang

Liveness check through measuring synchrony between the lip and the voice in video sequence has been proved to be a feasible way to prevent spoofing attacks. The generic methods utilize degree of audiovisual correlation to judge consistency. However, little attention is paid to the audiovisual time drift difference between live recordings and fraudulent attacks. In this paper, the time shift difference and the delay range is concluded by delay estimation experiments on VidTIMID. According to the conclusion, we propose an improved ¿liveness¿ scores evaluation algorithm for audio-video identification system. Experimental results on the same database show that the proposed algorithm got better performance. In comparison with the traditional methods, the EER (Equal Error Rate) averagely decreases about 5%.

Computer Speech & Language | 2017

Unsupervised classification of speaker roles in multi-participant conversational speech

Yanxiong Li; Qin Wang; Xue Zhang; Wei Li; Xinchao Li; Jichen Yang; Xiaohui Feng; Qian Huang; Qianhua He

Abstract This paper proposes an unsupervised method for analyzing speaker roles in multi-participant conversational speech. First, features for characterizing the differences of various roles are extracted from the outputs of speaker diarization. Then, an algorithm of role clustering based on the criterion of maximizing the inter-cluster distance without using any convergence threshold is proposed to obtain the number of roles and to merge the utterances belonging to the same role into one cluster. The contributions of different combinations of individual feature subsets are compared for the proposed method on the outputs from speaker diarization, and the combined feature subsets obtain higher F scores than the individual ones for clustering speaker roles. The impacts of both speaker diarization errors and feature dimensions on the performance of the proposed method are also discussed. Experiments are done on the outputs of both manual annotations and automatic speaker diarization to compare the proposed method with both the state-of-the-art clustering method and the supervised method. Evaluations show that the proposed method is superior to the previous clustering method and close to the conventional supervised method in terms of F scores under two different experimental conditions.

international conference on acoustics, speech, and signal processing | 2016

Source cell phone matching from speech recordings by sparse representation and KISS metric

Ling Zou; Qianhua He; Jichen Yang; Yanxiong Li

Source recording device matching from two speech recordings is a new and important problem of digital media forensics. It aims to answer the question that whether or not two speech recordings are recorded by the same recording device. In this study we propose a source cell phone matching scheme. The Gaussian supervector (GSV) based on Mel-frequency cepstral coefficients (MFCCs) is extracted from the speech recording and is sparse represented with respect to a dictionary learned by K-SVD algorithm. The reduced-dimensional sparse representation coefficient is utilized to characterize the intrinsic fingerprint of the recording device. Then, KISS metric learning based similarity matching is conducted on a pair of fingerprints extracted from the two speech recordings. Evaluation experiments were conducted on a database of speech recordings recorded by 14 cell phones. The experimental results demonstrated the feasibility of the proposed scheme.

Explore More