Richard Heusdens
Delft University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Richard Heusdens.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Cees H. Taal; Richard C. Hendriks; Richard Heusdens; Jesper Jensen
In the development process of noise-reduction algorithms, an objective machine-driven intelligibility measure which shows high correlation with speech intelligibility is of great interest. Besides reducing time and costs compared to real listening experiments, an objective intelligibility measure could also help provide answers on how to improve the intelligibility of noisy unprocessed speech. In this paper, a short-time objective intelligibility measure (STOI) is presented, which shows high correlation with the intelligibility of noisy and time-frequency weighted noisy speech (e.g., resulting from noise reduction) of three different listening experiments. In general, STOI showed better correlation with speech intelligibility compared to five other reference objective intelligibility models. In contrast to other conventional intelligibility models which tend to rely on global statistics across entire sentences, STOI is based on shorter time segments (386 ms). Experiments indeed show that it is beneficial to take segment lengths of this order into account. In addition, a free Matlab implementation is provided.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Jan S. Erkelens; Richard C. Hendriks; Richard Heusdens; Jesper Jensen
This paper considers techniques for single-channel speech enhancement based on the discrete Fourier transform (DFT). Specifically, we derive minimum mean-square error (MMSE) estimators of speech DFT coefficient magnitudes as well as of complex-valued DFT coefficients based on two classes of generalized gamma distributions, under an additive Gaussian noise assumption. The resulting generalized DFT magnitude estimator has as a special case the existing scheme based on a Rayleigh speech prior, while the complex DFT estimators generalize existing schemes based on Gaussian, Laplacian, and Gamma speech priors. Extensive simulation experiments with speech signals degraded by various additive noise sources verify that significant improvements are possible with the more recent estimators based on super-Gaussian priors. The increase in perceptual evaluation of speech quality (PESQ) over the noisy signals is about 0.5 points for street noise and about 1 point for white noise, nearly independent of input signal-to-noise ratio (SNR). The assumptions made for deriving the complex DFT estimators are less accurate than those for the magnitude estimators, leading to a higher maximum achievable speech quality with the magnitude estimators.
international conference on acoustics, speech, and signal processing | 2010
Richard C. Hendriks; Richard Heusdens; Jesper Jensen
Most speech enhancement algorithms heavily depend on the noise power spectral density (PSD). Because this quantity is unknown in practice, estimation from the noisy data is necessary. We present a low complexity method for noise PSD estimation. The algorithm is based on a minimum mean-squared error estimator of the noise magnitude-squared DFT coefficients. Compared to minimum statistics based noise tracking, segmental SNR and PESQ are improved for non-stationary noise sources with 1 dB and 0.25 MOS points, respectively. Compared to recently published algorithms, similar good noise tracking performance is obtained, but at a computational complexity that is in the order of a factor 40 lower.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Jan S. Erkelens; Richard Heusdens
This paper considers estimation of the noise spectral variance from speech signals contaminated by highly nonstationary noise sources. The method can accurately track fast changes in noise power level (up to about 10 dB/s). In each time frame, for each frequency bin, the noise variance estimate is updated recursively with the minimum mean-square error (mmse) estimate of the current noise power. A time- and frequency-dependent smoothing parameter is used, which is varied according to an estimate of speech presence probability. In this way, the amount of speech power leaking into the noise estimates is kept low. For the estimation of the noise power, a spectral gain function is used, which is found by an iterative data-driven training method. The proposed noise tracking method is tested on various stationary and nonstationary noise sources, for a wide range of signal-to-noise ratios, and compared with two state-of-the-art methods. When used in a speech enhancement system, improvements in segmental signal-to-noise ratio of more than 1 dB can be obtained for the most nonstationary noise sources at high noise levels.
international conference on acoustics, speech, and signal processing | 2010
Cees H. Taal; Richard C. Hendriks; Richard Heusdens; Jesper Jensen
Existing objective speech-intelligibility measures are suitable for several types of degradation, however, it turns out that they are less appropriate for methods where noisy speech is processed by a time-frequency (TF) weighting, e.g., noise reduction and speech separation. In this paper, we present an objective intelligibility measure, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech. The proposed method shows significantly better performance than three other, more sophisticated, objective measures. Furthermore, it is based on an intermediate intelligibility measure for short-time (approximately 400 ms) TF-regions, and uses a simple DFT-based TF-decomposition. In addition, a free Matlab implementation is provided.
IEEE Signal Processing Letters | 2002
Richard Heusdens; R. Vafin; W.B. Kleijn
We propose a segment-based matching-pursuit algorithm where the psychoacoustical properties of the human auditory system are taken into account. Rather than scaling the dictionary elements according to auditory perception, we define a psychoacoustic-adaptive norm on the signal space that can be used for assigning the dictionary elements to the individual segments in a rate-distortion optimal way. The new algorithm is asymptotically equal to signal-to-mask-ratio-based algorithms in the limit of infinite-analysis window length. However, the new algorithm provides a significantly improved selection of the dictionary elements for finite window length.
EURASIP Journal on Advances in Signal Processing | 2005
Steven van de Par; Ag Armin Kohlrausch; Richard Heusdens; Jesper Jensen; Søren Holdt Jensen
Psychoacoustical models have been used extensively within audio coding applications over the past decades. Recently, parametric coding techniques have been applied to general audio and this has created the need for a psychoacoustical model that is specifically suited for sinusoidal modelling of audio signals. In this paper, we present a new perceptual model that predicts masked thresholds for sinusoidal distortions. The model relies on signal detection theory and incorporates more recent insights about spectral and temporal integration in auditory masking. As a consequence, the model is able to predict the distortion detectability. In fact, the distortion detectability defines a (perceptually relevant) norm on the underlying signal space which is beneficial for optimisation algorithms such as rate-distortion optimisation or linear predictive coding. We evaluate the merits of the model by combining it with a sinusoidal extraction method and compare the results with those obtained with the ISO MPEG-1 Layer I-II recommended model. Listening tests show a clear preference for the new model. More specifically, the model presented here leads to a reduction of more than 20% in terms of number of sinusoids needed to represent signals at a given quality level.
international conference on acoustics, speech, and signal processing | 2002
Steven van de Par; Ag Armin Kohlrausch; Ghassan Charestan; Richard Heusdens
The use of psychoacoustical masking models for audio coding applications has been wide spread over the past decades. In such applications, it is typically assumed that the original input signal serves as a masker for the distortions that are introduced by the lossy coding method that is used. Such masking models are based on the peripheral bandpass filtering properties of the auditory system and basically evaluate the distortion-to-masker ratio within each auditory filter. Up to now these models have been based on the assumption that the masking of distortions is governed by the auditory filter for which the ratio between distortion and masker is largest. This assumption, however, is not in line with some new findings within the field of psychoacoustics. A more accurate assumption would be that the human auditory system is able to integrate distortions that are present within a range of auditory filters. In this contribution a new model is presented which is in line with new psychoacoustical studies and which is suitable for application within an audio codec. Although this model can be used to derive a masking curve, the model also gives a measure for the detectability of distortions provided that distortions are not too large.
IEEE Transactions on Information Theory | 2006
Jan Østergaard; Jesper Jensen; Richard Heusdens
In this paper, we derive analytical expressions for the central and side quantizers which, under high-resolution assumptions, minimize the expected distortion of a symmetric multiple-description lattice vector quantization (MD-LVQ) system subject to entropy constraints on the side descriptions for given packet-loss probabilities. We consider a special case of the general n-channel symmetric multiple-description problem where only a single parameter controls the redundancy tradeoffs between the central and the side distortions. Previous work on two-channel MD-LVQ showed that the distortions of the side quantizers can be expressed through the normalized second moment of a sphere. We show here that this is also the case for three-channel MD-LVQ. Furthermore, we conjecture that this is true for the general n-channel MD-LVQ. For given source, target rate, and packet-loss probabilities we find the optimal number of descriptions and construct the MD-LVQ system that minimizes the expected distortion. We verify theoretical expressions by numerical simulations and show in a practical setup that significant performance improvements can be achieved over state-of-the-art two-channel MD-LVQ by using three-channel MD-LVQ.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Richard C. Hendriks; Jesper Jensen; Richard Heusdens
All discrete Fourier transform (DFT) domain-based speech enhancement gain functions rely on knowledge of the noise power spectral density (PSD). Since the noise PSD is unknown in advance, estimation from the noisy speech signal is necessary. An overestimation of the noise PSD will lead to a loss in speech quality, while an underestimation will lead to an unnecessary high level of residual noise. We present a novel approach for noise tracking, which updates the noise PSD for each DFT coefficient in the presence of both speech and noise. This method is based on the eigenvalue decomposition of correlation matrices that are constructed from time series of noisy DFT coefficients. The presented method is very well capable of tracking gradually changing noise types. In comparison to state-of-the-art noise tracking algorithms the proposed method reduces the estimation error between the estimated and the true noise PSD. In combination with an enhancement system the proposed method improves the segmental SNR with several decibels for gradually changing noise types. Listening experiments show that the proposed system is preferred over the state-of-the-art noise tracking algorithm.