Johan de Veth
Radboud University Nijmegen
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Johan de Veth.
Speech Communication | 1998
Johan de Veth; L.W.J. Boves
Abstract In this paper we aim to identify the underlying causes that can explain the performance of different channel normalization techniques. To this aim we compared four different channel normalization techniques within the context of connected digit recognition over telephone lines: cepstrum mean subtraction, the dynamic cepstrum representation, RASTA filtering and phase-corrected RASTA. We used context-dependent and context-independent hidden Markov models that were trained using a wide range of different model complexities. The results of our recognition experiments indicate that each channel normalization technique should preserve the modulation frequencies in the range between 2 and 16 Hz in the spectrum of the speech signals. At the same time, DC components in the modulation spectrum should be effectively removed. With context-independent models the channel normalization filter should have a flat phase response. Finally, for our connected digit recognition task it appeared that cepstrum mean subtraction and phase-corrected RASTA performed equally well for context-dependent and context-independent models when equal amounts of model parameters were used.
Speech Communication | 2001
Johan de Veth; Bert Cranen; L.W.J. Boves
Abstract In this paper, we discuss acoustic backing-off as a method to improve automatic speech recognition robustness. Acoustic backing-off aims to achieve the same objective as the marginalization approach of missing feature theory: the detrimental influence of outlier values is effectively removed from the local distance computation in the Viterbi algorithm. The proposed method is based on one of the principles of robust statistical pattern matching: during recognition the local distance function (LDF) is modeled using a mixture of the distribution observed during training and a distribution describing observations not previously seen. In order to assess the effectiveness of the new method, we used artificial distortions of the acoustic vectors in connected digit recognition over telephone lines. We found that acoustic backing-off is capable of restoring recognition performance almost to the level observed for the undisturbed features, even in cases where a conventional LDF completely fails. These results show that recognition robustness can be improved using a marginalization approach, where making the distinction between reliable and corrupted feature values is wired into the recognition process. In addition, the results show that application of acoustic backing-off is not limited to feature representations based on filter bank outputs. Finally, the results indicate that acoustic backing-off is much less effective when local distortions are smeared over all vector elements. Therefore, the acoustic pre-processing steps should be chosen with care, so that the dispersion of distortions over all acoustic vector elements as a result of within-vector feature transformations is minimal.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Yan Han; Johan de Veth; Lou Boves
In this paper, we introduce a novel method for clustering speech gestures, represented as continuous trajectories in acoustic parameter space. Trajectory Clustering allows us to avoid the conditional independence assumption that makes it difficult to account for the fact that successive measurements of an articulatory gesture are correlated. We apply the trajectory clustering method for developing multiple parallel hidden Markov models (HMMs) for a continuous digits recognition task. We compare the performance obtained with data-driven clustering to the recognition performance obtained with conventional head-body-tail models, which use knowledge-based criteria for building multiple HMMs in order to obviate the trajectory folding problem. The results show that trajectory clustering is able to discover structure in the the training database that is different from the structure assumed by the knowledge-based approach. In addition, the data-derived structure gives rise to significantly better recognition performance, and results in a 10% word error rate reduction
Speech Communication | 2001
Johan de Veth; Febe de Wet; Bert Cranen; L.W.J. Boves
Abstract For improved recognition robustness in mismatched training–test conditions, the application of key ideas from missing feature theory and robust statistical pattern recognition in the framework of an otherwise conventional automatic speech recognition (ASR) system were investigated. To this end, both the type of features used to represent the speech signals and the algorithm used to compute the distance measure between an observed feature vector and a previously trained parametric model were studied. Two different types of feature representations were used: a type in which spectrally local distortions are smeared over the entire feature vector and a type in which distortions are only smeared over part of the feature vector. In addition, two different distance measures were investigated, viz., a conventional distance measure and a robust local distance function in the form of acoustic backing-off. The effects on recognition performance were studied for artificially created, band-limited noise and NOISEX noise added to the speech signals. The results for artificial band-limited noise indicate that a partially smearing feature transform is to be preferred over a fully smearing transform. In addition, for artificial, band-limited noise, a robust local distance function is to be preferred over the conventional distance measure as long as the distorted feature values are outliers with respect to the feature distribution observed during training. The experiments with NOISEX noise show that the combination of feature type and distance measure that is optimal for artificial, band-limited noise is also capable of improving recognition robustness for NOISEX noise, provided that it is band-limited.
Computer Speech & Language | 2005
Febe de Wet; Johan de Veth; Loe Boves; Bert Cranen
Abstract The aim of this investigation is to determine to what extent automatic speech recognition may be enhanced if, in addition to the linear compensation accomplished by mean and variance normalisation, a non-linear mismatch reduction technique is applied to the cepstral and energy features, respectively. An additional goal is to determine whether the degree of mismatch between the feature distributions of the training and test data that is associated with acoustic mismatch, differs for the cepstral and energy features. Towards these aims, two non-linear mismatch reduction techniques – time domain noise reduction and histogram normalisation – were evaluated on the Aurora2 digit recognition task as well as on a continuous speech recognition task with noisy test conditions similar to those in the Aurora2 experiments. The experimental results show that recognition performance is enhanced by the application of both non-linear mismatch reduction techniques. The best results are obtained when the two techniques are applied simultaneously. The results also reveal that the mismatch in the energy features is quantitatively and qualitatively much larger than the corresponding mismatch associated with the cepstral coefficients. The most substantial gains in average recognition rate are therefore accomplished by reducing training-test mismatch for the energy features.
Speech Communication | 2001
Johan de Veth; Bert Cranen; L.W.J. Boves
Saying that late 20th century automatic speech recognition (ASR) is pattern recognition, is something of a truism, but perhaps one of which the fundamental implications are not always fully appreciated. Essentially, a pattern recognition task boils down to measuring the distance between a physical representation of a new, as yet unknown token, and all elements of a set of preexisting patterns, of course in the same physical representation. On the one hand, the ‘patterns’ that can be recognized are, implicitly or explicitly, separate and invariable entities. For example, the command open in a Windows control application always has the same invariable and unique meaning. On the other hand, the unknown input tokens are continuous signals that typically show a high degree of variability. ASR research has centered around the problem of how to map continuous, variable acoustic representations onto discrete, invariable patterns. In ASR the physical representation of the speech tokens is some kind of dynamic power spectrum, for reasons which date back to the days of Ohm and von Helmholtz, who have shown that the power spectrum explains most of the perceptual phenomena in human speech processing. Since the inception of digital signal processing dynamic spectra are approximated by a sequence of short-time spectra (Rabiner & Schafer 1978). Consequently, the pattern match in ASR is invariably implemented as the accumulation of some distance measure between the acoustic features derived from a sequence of short-time spectra of the input token and the corresponding representation of the active patterns (see Fig. 2.1). Therefore, anything which adds to the variability of the short time spectrum of a speech signal will, as it were by definition, complicate pattern matching, and consequently complicate ASR.
Eurasip Journal on Audio, Speech, and Music Processing | 2007
Annika Hämäläinen; Lou Boves; Johan de Veth; Louis ten Bosch
Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.
Speech Communication | 2003
Johan de Veth; L.W.J. Boves
The efficiency of classical RASTA filtering for channel normalisation was investigated for continuous speech recognition based on context-independent and context-dependent hidden Markov models. For a medium and a large vocabulary continuous speech recognition task, recognition performance was established for classical RASTA filtering and compared to using no channel normalisation, cepstrum mean normalisation, and phase-corrected RASTA. Phase-corrected RASTA is a technique that consists of classical RASTA filtering followed by a phase correction operation. In this manner, channel bias is as effectively removed as with classical RASTA. However, for phase-corrected RASTA, amplitude drift towards zero in stationary signal portions is diminished compared to classical RASTA. The results show that application of classical RASTA filtering resulted in decreased recognition performance when compared to using no channel normalisation for all conditions studied, although the decrease appeared to be smaller for context-dependent models than for context-independent models. However, for all conditions, recognition performance was significantly and substantially improved when phase-corrected RASTA was used and reached the same performance level as obtained for cepstrum mean normalisation in some cases. It is concluded that classical RASTA filtering can only be effective for channel robustness, if the impact of the amplitude drift towards zero can be kept as limited as possible.
Journal of the Acoustical Society of America | 1989
Johan de Veth; L.W.J. Boves; Wim van Golstein Brouwers
A cascade six pole‐pair and five zero‐pair synthesizer has been developed as part of a text‐to‐speech conversion system for Dutch. Control information for this synthesizer is derived from a.o. measurements on natural speech. A pitch synchronous robust ARMA analysis technique was developed and applied to utterances produced by a number of adult male talkers. The resulting pole‐zero parameters were separated into sets pertaining to the source of the vocal tract. The vocal tract parameters were corrected in those frames where the analysis method made occasional mistakes. Analysis‐resynthesis of sentence material using the corrected vocal tract parameters to control the synthesizer driven by impulse and noise excitation yielded high‐quality synthetic speech. The tract parameters were then used to inverse filter the speech, to obtain the source function, that was subsequently parametrized using the Liljencrants‐Fant model. It is hoped that the speech quality will be improved by replacing the impulse excitation...
Journal of the Acoustical Society of America | 1987
L.W.J. Boves; Johan de Veth
It is well documented that linear prediction analysis will only allow prediction coefficients to relate to vocal tract shapes under very special conditions, and that the problems in interpreting LP analysis results in articulatory terms are due to the simplifying assumptions underlying the analysis model. Usually, the assumption that the system is all pole [or auto‐regressive (AR)] is thought to be the most important simplification. Therefore, it is hoped that the use of the less restrictive pole‐zero [or auto‐regressive moving average (ARMA)] analysis model will offer greater opportunities for an articulatory interpretation. A detailed study of a number of different ARMA analysis implementations has shown that the all‐pole assumption may not be the most important restriction on the articulatory interpretation of the analysis results; the assumption of Gaussian excitation may prove to be at least equally impeding. [Research supported by the Foundation for Speech Technology, funded by SPIN.]