Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Zuoying Wang is active.

Publication


Featured researches published by Zuoying Wang.


international conference on acoustics, speech, and signal processing | 2004

Voice activity detection using visual information

Peng Liu; Zuoying Wang

In traditional voice activity detection (VAD) approaches, some features of the audio stream, for example frame-energy features, are used for voice decision. In this paper, we present the general framework of a visual information based VAD approach in a multi-modal system. Firstly, the Gauss mixture visual models of voice and non-voice are designed, and the decision rule is discussed in detail. Subsequently, the visual feature extraction method for VAD is investigated. The best visual feature structure and the best mixture number are selected experimentally. Our experiments show that using visual information based VAD, prominent reduction in frame error rate (31.1% relatively) is achieved, and the audio-visual stream can be segmented into sentences for recognition much more precisely (98.4% relative reduction in sentence break error rate), compared to the frame-energy based approach in the clean audio case. Furthermore, the performance of visual based VAD is independent of background noise.


international conference on acoustics, speech, and signal processing | 2003

Fuzzy clustering and Bayesian information criterion based threshold estimation for robust voice activity detection

Ye Tian; Ji Wu; Zuoying Wang; Dajin Lu

In previous voice activity detection (VAD) approaches that use threshold, consistent accuracy cannot be achieved since the mean-value based and the histogram based threshold estimation algorithms are not robust. They strongly depend on the percentage of voice and background noise in the estimate interval. In this paper, fuzzy clustering and Bayesian information criterion are proposed to estimate the thresholds for VAD. Compared to previous algorithms, the new algorithm is more robust and heuristic-rules-free. It is insensitive to the estimated interval, and can maintain fast tracking speed of environment change when combined with online update. Experiment shows it works very well with energy features in both stationary and non-stationary environments.


IEEE Signal Processing Letters | 2002

Nonspeech segment rejection based on prosodic information for robust speech recognition

Ye Tian; Zuoying Wang; Dajin Lu

A new scheme for nonspeech rejection is proposed by considering that most nonspeech segments do not have well-defined prosodic structures as speech segments do. Certain parameters characterizing the smoothness of the peak index series and of the peak amplitude series of the normalized autocorrelation function are used to make nonspeech segment rejection decisions. The receiver-operating-characteristics curve and recognition word-error-rate reduction measures show that our approach is more effective than garbage-model-based schemes when used in telephone speech recognition.


international conference on acoustics, speech, and signal processing | 2005

Closely coupled array processing and model-based compensation for microphone array speech recognition

Xianyu Zhao; Zhijian Ou; Minhua Chen; Zuoying Wang

In this paper, a new microphone array speech recognition system in which the array processor and the speech recognizer are closely coupled is studied. The system includes a generalized sidelobe canceller (GSC) beamformer followed by a recognizer with vector Taylor series (VTS) compensation. The GSC beamformer provides two outputs, allowing more information to be used in the recognizer. One is the enhanced target speech output, the other is the reference noise output. VTS is used to compensate the effect of the residual noise in the GSC speech output, utilizing the GSC reference noise output. The compensation is done in a minimum mean square error (MMSE) sense. Moreover, an iteration procedure using an expectation-maximization (EM) algorithm is developed to refine the compensation parameters. Experimental results on the MONC database showed that the new system significantly improved the speech recognition performance in overlapping speech situations.


international conference on acoustics, speech, and signal processing | 2006

Subspace Tracking in Colored Noise Based on Oblique Projection

Minhua Chen; Zuoying Wang

Projection approximation subspace tracking (PAST) algorithm gives biased subspace estimation when the received signal is corrupted by colored noise. In this paper, an unbiased version of PAST is proposed for the colored noise scenario. Firstly, a maximum likelihood (ML) and minimum variance unbiased (MVUB) estimator for the clean signal is derived using simultaneous diagonalization and oblique projection. Then, we provide a recursive algorithm, named oblique PAST (obPAST), to track the signal subspace and update the estimator in colored noise. Experimental results show the effectiveness of the obPAST algorithm


international conference on multimodal interfaces | 2002

Robust noisy speech recognition with adaptive frequency bank selection

Ye Tian; Ji Wu; Zuoying Wang; Dajin Lu

With the development of automatic speech recognition technology, the robustness problem of speech recognition systems is becoming more and more important. This paper addresses the problem of speech recognition in an additive background noise environment. Since the frequency energy of different types of noise focuses on different frequency banks, the effects of additive noise on each frequency bank are different. The seriously obscured frequency banks have little word signal information left, and are harmful for subsequence speech processing. Wu and Lin (2000) applied the frequency bank selection theory to robust word boundary detection in a noisy environment, and obtained good detection results. In this paper, this theory is extended to noisy speech recognition. Unlike the standard MFCC which uses all frequency banks for cepstral coefficients, we only use the frequency banks that are slightly corrupted and discard the seriously obscured ones. Cepstral coefficients are calculated only on the selected frequency banks. Moreover, an acoustic model is also adapted to match the modification of the acoustic feature. Experiments on continuous digital speech recognition show that the proposed algorithm leads to better performance than spectral subtraction and cepstral mean normalization at low SNRs.


international conference on acoustics speech and signal processing | 1999

Speaker adaptation using maximum likelihood model interpolation

Zuoying Wang; Feng Liu

A speaker adaptation scheme named maximum likelihood model interpolation (MLMI) is proposed. The basic idea of MLMI is to compute the speaker adapted (SA) model of a test speaker by a linear convex combination of a set of speaker dependent (SD) models. Given a set of training speakers, we first calculate the corresponding SD models for each training speaker as well as the speaker-independent (SI) models. Then, the mean vector of the SA model is computed as the weighted sum of the set of the SD mean vectors, while the covariance matrix is the same as that of the SI model. An algorithm to estimate the weight parameters is given which maximizes the likelihood of the SA model given the adaptation data. Experiments show that 3 adaptation sentences can give a significant performance improvement. As the number of SD models increases, further improvement can be obtained.


international conference on acoustics, speech, and signal processing | 2005

Robust speech recognition based on spectral adjusting and warping

Rui Zhao; Zuoying Wang

In this paper, we first propose a new channel adaptation method named spectral adjusting (SA) which adjusts the amplitude spectrum of the channel distorted speech with an adjusting function to reduce the channel distortion. Then, we combine vocal tract length normalization (VTLN), which warps the frequency scale of the speech spectrum to do speaker normalization, with SA to adjust and warp the speech spectrum. So the channel and speaker variations can be compensated for together. We call the combined method spectral adjusting and warping (SAW). In the SA method, the adjusting function is approximated by a piece-wise linear function, and the parameters of the piece-wise linear function are estimated by a gradient projection algorithm with short adaptation utterances based on the ML rule. The evaluating experiments were carried out on telephone speech recognition in a duration distribution based HMM (DDBHMM) system. Experimental results showed that SA yielded a relative error rate reduction of 10.44% over the baseline, and SAW led to a greater reduction of 14.6%.


systems, man and cybernetics | 2003

A Chinese spoken dialogue system for train information

Junyan Chen; Ji Wu; Zuoying Wang

In this paper a Chinese spoken dialogue system developed for train information retrieval is presented. After a brief description of the system architecture and the individual modules, a dialogue manager, which integrates user plan inference with the topic tree model, is proposed. Also dialogue strategies based on this mechanism, including consistent information sharing across multiple topics, reliable user response expectation and proper system prompt design, are presented and explained in detail. Experiments show that sentence meaning understanding error rate decreased by 23.5% with the guide of user plan inference. Preliminary subjective evaluation shows that the users are interested and willing to talk with the system although theres still much to be improved.


international conference on acoustics, speech, and signal processing | 2002

A new combined model of statics-dynamics of speech

Zhijian Ou; Zuoying Wang

Linear prediction (LP) HMM does not make the independent and identical distribution (IID) assumption in traditional HMM; however it often produces unsatisfactory results. In this paper, a new combined model of statics-dynamics of speech is proposed, based on a new analysis of both HMMs modeling strengths and weaknesses. The new model works with LPHMM as the dynamic part and traditional IID-based HMM as the static part; in addition, easy implementation and low cost are preserved. A new effective re-estimation solution is suggested for parameter tying to achieve better discrimination. Our experiments on speaker-independent continuous speech recognition demonstrated that the combined model achieved 7.5% error rate reduction from traditional HMM.

Collaboration


Dive into the Zuoying Wang's collaboration.

Top Co-Authors

Avatar

Ji Wu

Tsinghua University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge