Qingju Liu
University of Surrey
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Qingju Liu.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Atiyeh Alinaghi; Philip J. B. Jackson; Qingju Liu; Wenwu Wang
In this paper the mixing vector (MV) in the statistical mixing model is compared to the binaural cues represented by interaural level and phase differences (ILD and IPD). It is shown that the MV distributions are quite distinct while binaural models overlap when the sources are close to each other. On the other hand, the binaural cues are more robust to high reverberation than MV models. According to this complementary behavior we introduce a new robust algorithm for stereo speech separation which considers both additive and convolutive noise signals to model the MV and binaural cues in parallel and estimate probabilistic time-frequency masks. The contribution of each cue to the final decision is also adjusted by weighting the log-likelihoods of the cues empirically. Furthermore, the permutation problem of the frequency domain blind source separation (BSS) is addressed by initializing the MVs based on binaural cues. Experiments are performed systematically on determined and underdetermined speech mixtures in five rooms with various acoustic properties including anechoic, highly reverberant, and spatially-diffuse noise conditions. The results in terms of signal-to-distortion-ratio (SDR) confirm the benefits of integrating the MV and binaural cues, as compared with two state-of-the-art baseline algorithms which only use MV or the binaural cues.
IEEE Transactions on Signal Processing | 2013
Qingju Liu; Wenwu Wang; Philip J. B. Jackson; Mark Barnard; Josef Kittler; Jonathon A. Chambers
In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g., Gaussian mixture models (GMMs). These methods often operate in a low-dimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandels BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.
IEEE Transactions on Multimedia | 2014
Qingju Liu; Andrew J. Aubrey; Wenwu Wang
The visual modality, deemed to be complementary to the audio modality, has recently been exploited to improve the performance of blind source separation (BSS) of speech mixtures, especially in adverse environments where the performance of audio-domain methods deteriorates steadily. In this paper, we present an enhancement method to audio-domain BSS with the integration of voice activity information, obtained via a visual voice activity detection (VAD) algorithm. Mimicking aspects of human hearing, binaural speech mixtures are considered in our two-stage system. Firstly, in the off-line training stage, a speaker-independent voice activity detector is formed using the visual stimuli via the adaboosting algorithm. In the on-line separation stage, interaural phase difference (IPD) and interaural level difference (ILD) cues are statistically analyzed to assign probabilistically each time-frequency (TF) point of the audio mixtures to the source signals. Next, the detected voice activity cues (found via the visual VAD) are integrated to reduce the interference residual. Detection of the interference residual takes place gradually, with two layers of boundaries in the correlation and energy ratio map. We have tested our algorithm on speech mixtures generated using room impulse responses at different reverberation times and noise levels. Simulation results show performance improvement of the proposed method for target speech extraction in noisy and reverberant environments, in terms of signal-to-interference ratio (SIR) and perceptual evaluation of speech quality (PESQ).
Signal Processing | 2012
Qingju Liu; Wenwu Wang; Philip J. B. Jackson
Recent studies show that facial information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterization of the coherence between the audio and visual speech using, e.g., a Gaussian mixture model (GMM). In this paper, we present three contributions. With the synchronized features, we propose an adapted expectation maximization (AEM) algorithm to model the audio-visual coherence in the off-line training process. To improve the accuracy of this coherence model, we use a frame selection scheme to discard nonstationary features. Then with the coherence maximization technique, we develop a new sorting method to solve the permutation problem in the frequency domain. We test our algorithm on a multimodal speech database composed of different combinations of vowels and consonants. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS, which confirms the benefit of using visual speech to assist in separation of the audio.
international conference on signal and information processing | 2013
Qingju Liu; Wenwu Wang
Scans of double-sided documents often suffer from show-through distortions, where contents of the reverse side (verso) may appear in the front-side page (recto). Several algorithms employed for show-through removal from the scanned images, are based on linear mixing models, including blind source separation (BSS), non-negative matrix factorization (NMF), and adaptive filtering. However, a recent study shows that a non-linear model may provide better performance for resolving the overlapping front-reverse contents, especially in grayscale scans. In this paper, we propose a new non-linear NMF algorithm based on projected gradient adaptation. An adaptive filtering process is also incorporated to further eliminate the blurring effect caused by non-perfect calibration of the scans. Our numerical tests show that the proposed algorithm offers better results than the baseline methods.
european signal processing conference | 2015
Qingju Liu; Wenwu Wang; Philip J. B. Jackson; Trevor J. Cox
Representing a complex acoustic scene with audio objects is desirable but challenging in object-based spatial audio production and reproduction, especially when concurrent sound signals are present in the scene. Source separation (SS) provides a potentially useful and enabling tool for audio object extraction. These extracted objects are often remixed to reconstruct a sound field in the reproduction stage. A suitable SS method is expected to produce audio objects that ultimately deliver high quality audio after remix. The performance of these SS algorithms therefore needs to be evaluated in this context. Existing metrics for SS performance evaluation, however, do not take into account the essential sound field reconstruction process. To address this problem, here we propose a new SS evaluation method which employs a remixing strategy similar to the panning law, and provides a framework to incorporate the conventional SS metrics. We have tested our proposed method on real-room recordings processed with four SS methods, including two state-of-the-art blind source separation (BSS) methods and two classic beamforming algorithms. The evaluation results based on three conventional SS metrics are analysed.
international conference on acoustics, speech, and signal processing | 2016
Qingju Liu; Teofilo de Campos; Wenwu Wang; Adrian Hilton
The work on 3D human pose estimation has been through a significant amount of progress in recent years, particularly due to the widespread availability of commodity depth sensors. However, most pose estimation methods follow a tracking-as-detection approach which does not explicitly handle occlusions, thus introducing outliers and identity association issues when multiple targets are involved. To address these issues, we propose a new method based on Probability Hypothesis Density (PHD) filter. In this method, the PHD filter with a novel clutter intensity model is used to remove outliers in the 3D head detection results, followed by an identity association scheme with occlusion detection for the targets. Experimental results show that our proposed method greatly mitigates the outliers, and correctly associates identities to individual detections with low computational cost.
european signal processing conference | 2017
Qingju Liu; Wenwu Wang; Philip J. B. Jackson; Yan Tang
Deep neural networks (DNN) have recently been shown to give state-of-the-art performance in monaural speech enhancement. However in the DNN training process, the perceptual difference between different components of the DNN output is not fully exploited, where equal importance is often assumed. To address this limitation, we have proposed a new perceptually-weighted objective function within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech. A perceptual weight is integrated into the proposed objective function, and has been tested on two types of output features: spectra and ideal ratio masks. Objective evaluations for both speech quality and speech intelligibility have been performed. Integration of our perceptual weight shows consistent improvement on several noise levels and a variety of different noise types.
international conference on computer vision | 2015
Qingju Liu; Teofilo de Campos; Wenwu Wang; Philip J. B. Jackson; Adrian Hilton
In this paper, a novel probabilistic Bayesian tracking scheme is proposed and applied to bimodal measurements consisting of tracking results from the depth sensor and audio recordings collected using binaural microphones. We use random finite sets to cope with varying number of tracking targets. A measurement-driven birth process is integrated to quickly localize any emerging person. A new bimodal fusion method that prioritizes the most confident modality is employed. The approach was tested on real room recordings and experimental results show that the proposed combination of audio and depth outperforms individual modalities, particularly when there are multiple people talking simultaneously and when occlusions are frequent.
ieee signal processing workshop on statistical signal processing | 2012
Qingju Liu; Wenwu Wang; Philip J. B. Jackson; Mark Barnard
Probabilistic models of binaural cues, such as the interaural phase difference (IPD) and the interaural level difference (ILD), can be used to obtain the audio mask in the time-frequency (TF) domain, for source separation of binaural mixtures. Those models are, however, often degraded by acoustic noise. In contrast, the video stream contains relevant information about the synchronous audio stream that is not affected by acoustic noise. In this paper, we present a novel method for modeling the audio-visual (AV) coherence based on dictionary learning. A visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio-visual mask, which is then applied to the binaural signal for source separation. We tested our algorithm on the XM2VTS database, and observed considerable performance improvement for noise corrupted signals.