Veronique Stouten
Katholieke Universiteit Leuven
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Veronique Stouten.
IEEE Signal Processing Letters | 2008
Veronique Stouten; Kris Demuynck; H. Van Hamme
We present a technique to automatically discover the (word-sized) phone patterns that are present in speech utterances. These patterns are learnt from a set of phone lattices generated from the utterances. Just like children acquiring language, our system does not have prior information on what the meaningful patterns are. By applying the non-negative matrix factorization algorithm to a fixed-length high-dimensional vector representation of the speech utterances, a decomposition in terms of additive units is obtained. We illustrate that these units correspond to words in case of a small vocabulary task. Our result also raises questions about whether explicit segmentation and clustering are needed in an unsupervised learning context.
Speech Communication | 2006
Veronique Stouten; Hugo Van hamme; Patrick Wambacq
In this paper, several techniques are proposed to incorporate the uncertainty of the clean speech estimate in the decoding process of the backend recogniser in the context of model-based feature enhancement (MBFE) for noise robust speech recognition. Usually, the Gaussians in the acoustic space are sampled in a single point estimate, which means that the backend recogniser considers its input as a noise-free utterance. However, in this way the variance of the estimator is neglected. To solve this problem, it has already been argued that the acoustic space should be evaluated in a probability density function, e.g. a Gaussian observation pdf. We illustrate that this Gaussian observation pdf can be replaced by a computationally more tractable discrete pdf, consisting of a weighted sum of delta functions. We also show how improved posterior state probabilities can be obtained by calculating their maximum likelihood estimates or by using the pdf of clean speech conditioned on both the noisy speech and the backend Gaussian. Another simple and efficient technique is to replace these posterior probabilities by M Kronecker deltas, which results in M front-end feature vector candidates, and to take the maximum over their backend scores. Experimental results are given for the Aurora2 and Aurora4 database to compare the proposed techniques. A significant decrease of the word error rate of the resulting speech recognition system is obtained.In this paper, we consider the parametric version of Wiener systems where both the linear and nonlinear parts are identified with clipped observations in the presence of internal and external noises. Also the static functions are allowed noninvertible. We propose a classification based support vector machine (SVM) and formulate the identification problem as a convex optimization. The solution to the optimization problem converges to the true parameters of the linear system if it is an finite-impulse-response (FIR) system, even though clipping reduces a great deal of information about the system characteristics. In identifying a Wiener system with a stable infinite-impulse-response (IIR) system, an FIR system is used to approximate it and the problem is converted to identifying the FIR system together with solving a set of nonlinear equations. This leads to biased estimates of parameters in the IIR system while the bias can be controlled by choosing the order of the approximated FIR system.
Speech Communication | 2009
Veronique Stouten; Hugo Van hamme
We describe an algorithm to automatically estimate the voice onset time (VOT) of plosives. The VOT is the time delay between the burst onset and the start of periodicity when it is followed by a voiced sound. Since the VOT is affected by factors like place of articulation and voicing it can be used for inference of these factors. The algorithm uses the reassignment spectrum of the speech signal, a high resolution time-frequency representation which simplifies the detection of the acoustic events in a plosive. The performance of our algorithm is evaluated on a subset of the TIMIT database by comparison with manual VOT measurements. On average, the difference is smaller than 10ms for 76.1% and smaller than 20ms for 91.4% of the plosive segments. We also provide analysis statistics of the VOT of /b/, /d/, /g/, /p/, /t/ and /k/ and experimentally verify some sources of variability. Finally, to illustrate possible applications, we integrate the automatic VOT estimates as an additional feature in an HMM-based speech recognition system and show a small but statistically significant improvement in phone recognition rate.
international conference on acoustics, speech, and signal processing | 2004
Veronique Stouten; H. Van Hamme; Patrick Wambacq
In this paper we describe how we successfully extended the model-based feature enhancement (MBFE) algorithm to jointly remove additive and convolutional noise from corrupted speech. Although a model of the clean speech can incorporate prior knowledge into the feature enhancement process, this model no longer yields an accurate fit if a different microphone is used. To cure the resulting performance degradation, we merge a new iterative EM algorithm to estimate the channel, and the MBFE-algorithm to remove nonstationary additive noise. In the latter, the parameters of a shifted clean speech HMM and a noise HMM are first combined by a vector Taylor series approximation and then the state-conditional MMSE-estimates of the clean speech are calculated. Recognition experiments confirmed the superior performance on the Aurora4 recognition task. An average relative reduction in WER of 12% and 2.8% on the clean and multi condition training respectively, was obtained compared to the Advanced Front-End standard.
international conference on acoustics, speech, and signal processing | 2005
Veronique Stouten; H. Van Hamme; Patrick Wambacq
Model-based techniques for robust speech recognition often require the statistics of noisy speech. In this paper, we propose two modifications to obtain more accurate versions of the statistics of the combined HMM (starting from a clean speech and a noise model). Usually, the phase difference between speech and noise is neglected in the acoustic environment model. However, we show how a phase-sensitive environment model can be efficiently integrated in the context of multi-stream model-based feature enhancement and gives rise to more accurate covariance matrices for the noisy speech. Also, by expanding the vector Taylor series up to the second order term, an improved noisy speech mean can be obtained. Finally, we explain how the front-end clean speech model itself can be improved by a preprocessing of the training data. Recognition results on the Aurora4 database illustrate the effect on the noise robustness for each of these modifications.
international conference on acoustics, speech, and signal processing | 2006
Veronique Stouten; H. Van Hamme; Patrick Wambacq
Many compensation techniques, both in the model and feature domain, require an estimate of the noise statistics to compensate for the clean speech degradation in adverse environments. We explore how two spectral noise estimation approaches can be applied in the context of model-based feature enhancement. The minimum statistics method and the improved minima controlled recursive averaging method are used to estimate the noise power spectrum based only on the noisy speech. The noise mean and variance estimates are nonlinearly transformed to the cepstral domain and used in the Gaussian noise model of MBFE. We show that the resulting system achieves an accuracy on the Aurora2 task that is comparable to MBFE with prior knowledge on noise. Finally, this performance can be significantly improved when the MS or EMCRA noise mean is reestimated based on a clean speech model
international conference on acoustics, speech, and signal processing | 2008
Alexander Bertrand; Kris Demuynck; Veronique Stouten; H. Van Hamme
Non-negative matrix factorisation (NMF) is an unsupervised learning technique that decomposes a non-negative data matrix into a product of two lower rank non-negative matrices. The non-negativity constraint results in a parts-based and often sparse representation of the data. We use NMF to factorise a matrix with spectral slices of continuous speech to automatically find a feature set for speech recognition. The resulting decomposition yields a filter bank design with remarkable similarities to perceptually motivated designs, supporting the hypothesis that human hearing and speech production are well matched to each other. We point out that the divergence cost criterion used by NMF is linearly dependent on energy, which may influence the design. We will however argue that this does not significantly affect the interpretation of our results. Furthermore, we compare our filter bank with several hearing models found in literature. Evaluating the filter bank for speech recognition shows that the same recognition performance is achieved as with classical MEL- based features.
conference of the international speech communication association | 2003
Veronique Stouten; Hugo Van hamme; Kris Demuynck; Patrick Wambacq
conference of the international speech communication association | 2004
Hugo Van hamme; Patrick Wambacq; Veronique Stouten
conference of the international speech communication association | 2007
Veronique Stouten; Kris Demuynck; Hugo Van hamme