Zhijian Ou
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhijian Ou.
international conference on acoustics, speech, and signal processing | 2010
Nan Ding; Zhijian Ou
The Hidden Markov Model (HMM) has been widely used in many applications such as speech recognition. A common challenge for applying the classical HMM is to determine the structure of the hidden state space. Based on the Dirichlet Process, a nonparametric Bayesian Hidden Markov Model is proposed, which allows an infinite number of hidden states and uses an infinite number of Gaussian components to support continuous observations. An efficient variational inference method is also proposed and applied on the model. Our experiments demonstrate that the variational Bayesian inference on the new model can discover the HMM hidden structure for both synthetic data and real-world applications.
international conference on acoustics, speech, and signal processing | 2005
Xianyu Zhao; Zhijian Ou; Minhua Chen; Zuoying Wang
In this paper, a new microphone array speech recognition system in which the array processor and the speech recognizer are closely coupled is studied. The system includes a generalized sidelobe canceller (GSC) beamformer followed by a recognizer with vector Taylor series (VTS) compensation. The GSC beamformer provides two outputs, allowing more information to be used in the recognizer. One is the enhanced target speech output, the other is the reference noise output. VTS is used to compensate the effect of the residual noise in the GSC speech output, utilizing the GSC reference noise output. The compensation is done in a minimum mean square error (MMSE) sense. Moreover, an iteration procedure using an expectation-maximization (EM) algorithm is developed to refine the compensation parameters. Experimental results on the MONC database showed that the new system significantly improved the speech recognition performance in overlapping speech situations.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Xianyu Zhao; Zhijian Ou
In conventional microphone array speech recognition, the array processor and the speech recognizer are loosely coupled. The only connection between the two modules is the enhanced target signal output from the array processor, which then gets treated as a single input to the recognizer. In this approach, useful environmental information, which can be provided by the array processor and also needs to be exploited by the recognizer, is ignored. Inherently, the array processor can generate multiple outputs of spatially filtered signals, as a multi-input-multi-output (MIMO) module. In this paper, a closely coupled approach is proposed, in which a recognizer with model-based noise compensation exploits the reference noise outputs from a MIMO array processor. Specifically, a multichannel model-based noise compensation is presented, including the compensation procedure using the vector Taylor series (VTS) expansion and parameter estimation using the expectation-maximization (EM) algorithm. It is also shown how to construct MIMO array processors from conventional beamformers. A number of practical implementations of the conventional loosely coupled approach and the proposed closely coupled approach were tested on a publicly available database, the Multichannel Overlapping Number Corpus (MONC). Experimental results showed that the proposed closely coupled approach significantly improved the speech recognition performance in the overlapping speech situations
IEEE Signal Processing Letters | 2006
Hui Lin; Zhijian Ou; Xi Xiao
In this letter, a new audio fingerprinting approach is presented. We investigate to improve robustness by more precise statistical fingerprint modeling with common component Gaussian mixture models (CCGMMs) and Kullback-Leibler (KL) distance, which is more suitable to measure the dissimilarity between two probabilistic models. To address the resulting complexity, generalized time-series active search is proposed, which supports a wide variety of distance measures between two CCGMMs, including L1, L2, KL, etc. Experiments show that the new approach with KL distance increases robustness to distortions (including low-quality MP3 compression, small room echo, and play-and-record) while achieving efficient search
international symposium on chinese spoken language processing | 2010
Yimin Tan; Zhijian Ou
Latent Dirichlet allocation (LDA) has been widely used for analyzing large text corpora. In this paper we propose the topic-weak-correlated LDA (TWC-LDA) for topic modeling, which constrains different topics to be weak-correlated. This is technically achieved by placing a special prior over the topic-word distributions. Reducing the overlapping between the topic-word distributions makes the learned topics more interpretable in the sense that each topic word-distribution can be clearly associated to a distinctive semantic meaning. Experimental results on both synthetic and real-world corpus show the superiority of the TWC-LDA over the basic LDA for semantically meaningful topic discovery and document classification.
international conference on audio, language and image processing | 2012
Xin He; Zhijian Ou; Jiasong Sun
The state-of-the-art language models (LMs) are n-gram models, which, for Chinese, are word-based n-grams. To construct Chinese word-based n-gram LMs, we need to have a lexicon and a Chinese word segmentation (CWS) step. However, there is no standard definition of a word in Chinese, and it is always possible to construct new words by combining multiple characters, which causes out-of-vocabulary (OOV) problems. These make lexicon definition and CWS being difficult and ill-defined, which deteriorates the quality of the Chinese LMs. Recently, conditional random fields (CRFs) have been shown to have the ability to perform robust and accurate CWS, especially in recalling OOV words. However they are in essence not Chinese language models, but conditional models of the position-of-character (POC) tag-sequence given the character-sequence. In this paper, we propose a new Chinese language model - joint n-gram, which incorporates the POC tags so that we escape from using a lexicon. It is a truly generative model of Chinese sentences. The effectiveness of the new LM is shown in terms of perplexities and CWS performances.
international conference on acoustics, speech, and signal processing | 2008
Cong Li; Zhijian Ou; Wei Hu; Tao Wang; Yimin Zhang
This paper presents a novel audio-visual fusion method for speech detection, which is an important front-end for content-based video processing. This approach aims to extract homogeneous speech segments from the accompanying audio stream in real-world movie/TV videos with the help of video captions. Note that captions are mainly created to help viewers to follow the dialog, rather than to accurately locate the speech regions. We propose a caption-aided speech detection approach, which makes use of both caption information and audio information. The inaccurate positions of the captions are refined through using audio features (pitch and MFCCs) and BIC-based acoustic change detection. Comparison experiments against several other traditional speech detection approaches are conducted, showing that the proposed approach improves the speech detection performance greatly.
IEEE Signal Processing Letters | 2007
Hui Lin; Zhijian Ou
This letter investigates the problem of incorporating auxiliary information, e.g., pitch, zero crossing rate (ZCR), and rate-of-speech (ROS), for speech recognition using dynamic Bayesian networks. In this letter, we propose switching auxiliary chains for exploiting different auxiliary information tailored to different phonetic states. The switching function can be specified by a priori knowledge or, more flexibly, be learned from data with information-theoretic dependency selection. Experiments on the OGI Numbers database show that the new model achieves 7% word-error-rate relative reduction by jointly exploiting pitch, ZCR, and ROS, while keeping almost the same parameter size as the standard HMM.
international conference on acoustics, speech, and signal processing | 2011
Yun Wang; Zhijian Ou
Modern monaural voice and accompaniment separation systems usually consist of two main modules: melody extraction and time-frequency masking. A main distinction between different separation systems lies in what approaches are used for the two modules. Popular techniques for melody extraction include hidden Markov models (HMMs) and non-negative matrix factorization (NMF), and masking includes hard and soft masking. This paper investigates the flaw of NMF-based melody extraction, and proposes the combination of HMM-based melody extraction (equipped with a newly-defined feature) and NMF-based soft masking. Evaluations on two publicly available databases show that the proposed system reaches state-of-the-art performance and outperforms several other combinations.
ieee signal processing workshop on statistical signal processing | 2016
Martin J. Zhang; Zhijian Ou
Most studies of change-point detection (CPD) focus on developing similarity metrics that quantify how Likely a time-point is to be a change point. After that, the process of selecting true change points among those high-score candidates is less well-studied. This paper proposes a new CPD method that uses determinantal point processes to model the process of change-point selection. Specifically, this work explores the particular kernel structure arose in such modelling, the almost block diagonal. It shows that the maximum a posteriori task, requiring at least O(N2,4) in general, can be achieved using O(N) under such structure. The resulting algorithms, BwDPP-MAP and BwDppCpd, are empirically validated through simulation and five real-world data experiments.