Paul Dalsgaard
Aalborg University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul Dalsgaard.
Speech Communication | 2005
Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg
Abstract The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors. In the paper, a model of network degradations and robustness techniques is presented. These techniques are classified into three categories: error detection, error recovery and error concealment (EC). A one-frame error detection scheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique for error detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward error correction and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder based EC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of the same speech database and channel. Special attention is given to the unique characteristics of DSR as compared to streaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networks is presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimising out-of-vocabulary detection. This paper concludes with a discussion of applicability of different techniques based on the channel characteristics and the system requirements.
international conference on acoustics, speech, and signal processing | 1994
O. Anderson; Paul Dalsgaard; William J. Barry
The research reported in this paper presents a method to identify poly- and mono-phonemes for four European languages. The functionality of the poly-phonemes is tested in two experiments, and a limited set of mono-phonemes is identified for a language-identification experiment. Ten acoustically-similar speech sounds were identified across the four languages British-English, Danish, German, and Italian. These sounds, which constitute a substantial proportion of the phonemes of each language, are designated as (language independent) poly-phonemes, and may serve as a multi-lingual training base for labelling and recognition systems. The remaining sounds of each language, which do not fulfil the similarity conditions, are dubbed mono-phonemes. Two application experiments were conducted. In the first the poly-phonemes are applied in a label alignment task. In the second a small selected of mono-phonemes for each of the four languages is used in a preliminary test of the ability of these sets to serve as language discriminators.<<ETX>>
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg
In this paper, the temporal correlation of speech is exploited in front-end feature extraction, client-based error recovery, and server-based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting at the client side. At the server side, each HFR feature vector is duplicated to construct a full frame rate (FFR) feature sequence. This HFR front-end gives comparable performance to the FFR front-end but contains only half the FFR features. Second, different arrangements of the other half of the FFR features creates a set of error recovery techniques encompassing multiple description coding and interleaving schemes where interleaving has the advantage of not introducing a delay when there are no transmission errors. Third, a subvector-based EC technique is presented where error detection and concealment is conducted at the subvector level as opposed to conventional techniques where an entire vector is replaced even though only a single bit error occurs. The subvector EC is further combined with weighted Viterbi decoding. Encouraging recognition results are observed for the proposed techniques. Lastly, to understand the effects of applying various EC techniques, this paper introduces three approaches consisting of speech feature, dynamic programming distance, and hidden Markov model state duration comparison
international conference on spoken language processing | 1996
Bojan Petek; Ove Kjeld Andersen; Paul Dalsgaard
The results from applying an improved algorithm to the task of automatic segmentation of spontaneous telephone quality speech are presented, and compared to the results from those resulting from superimposing white noise. Three segmentation algorithms are compared which are all based on variants of the Spectral Variation Function. Experimental results are obtained on the OGI multi language telephone speech corpus (OGLTS). We show that the use of the auditory forward and backward masking effects prior to the SVF computation increases the robustness of the algorithm to white noise. When the average signal to noise ratio (SNR) is decreased to 10 dB, the peak ratio (defined as the ratio of the number of peaks measured at the target over the original SNRs) is increased by 16%, 12%, and 11% for the MFC (Mel Frequency Cepstra), RASTA (Relative Spectral Processing), and the FBDYN (Forward Backward Auditory Masking Dynamic Cepstra) SVF segmentation algorithms, respectively.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Haitian Xu; Paul Dalsgaard; Zheng-Hua Tan; Børge Lindberg
Condition-dependent training strategy divides a training database into a number of clusters, each corresponding to a noise condition and subsequently trains a hidden Markov model (HMM) set for each cluster. This paper investigates and compares a number of condition-dependent training strategies in order to achieve a better understanding of the effects on automatic speech recogntion (ASR) performance as caused by a splitting of the training databases. Also, the relationship between mismatches in signal-to-noise ratio (SNR) is analyzed. The results show that a splitting of the training material in terms of both noise type and SNR value is advantageous compared to previously used methods, and that training of only a limited number of HMM sets is sufficient for each noise type for robustly handling of SNR mismatches. This leads to the introduction of an SNR and noise classification-based training strategy (SNT-SNC). Better ASR performance is obtained on test material containing data from known noise types as compared to either multicondition training or noise-type dependent training strategies. The computational complexity of the SNT-SNC framework is kept low by choosing only one HMM set for recognition. The HMM set is chosen on the basis of results from noise classification and SNR value estimations. However, compared to other strategies, the SNT-SNC framework shows lower performance for unknown noise types. This problem is partly overcome by introducing a number of model and feature domain techniques. Experiments using both artificially corrupted and real-world noisy speech databases are conducted and demonstrate the effectiveness of these methods.
Computer Speech & Language | 1992
Paul Dalsgaard
Abstract A two-stage approach to phoneme label alignment is presented. A self-organizing neural network is employed in the first stage. The second stage performs the label alignment of an independently given input phoneme string to the corresponding speech signal. The first stage transforms signal parameters into a set of continuously valued acoustic-phonetic features. The second stage uses the Viterbi decoding/level building technique to position the label boundaries. The validity of the feature transformation approach in stage one is demonstrated in a detailed experimental analysis, the results of which are used to derive a multi-dimensional probability density model for all individual phonemes. These models are used in the second stage label alignment process. Results are given in two parts. The first provides the experimental evidence to support the use of probability density functions based on acoustic-phonetic features, in the form of histograms for a number of vocalic and consonantal Danish and British English phonemes. The second gives the results from the label alignment process. Here, differences between reference time boundaries from a manually labelled test speech corpus and time boundaries from the alignment process are presented in histograms showing the label alignment time differences for a number of selected phoneme paris for Danish and British English. The results show an overall accuracy of the label alignment of 85% and 43% for Danish and British English, respectively.
international conference on acoustics, speech, and signal processing | 1991
Paul Dalsgaard; Ove Kjeld Andersen; William J. Barry
In previous work on label alignment, encouraging results were obtained using selected acoustic-phonetic features to model the individuals speech phonemes. Selection was based on minimal covariance between features on the one hand, and the inclusion of features underlying critical phonological opposition on the other. In the present work, principal component analysis was applied to give a number of uncorrelated output parameters which maximally exploit the discriminatory power of the features and are derived independently of the phonological functionality. Results of label alignment on three different European languages, Danish, English, and Italian, using different numbers of principal parameters show that the accuracy with ten parameters is at least as good as with 15 manually selected features. The best result is found for British English, which has 78% of its phoneme transition boundaries positioned within +or-20 ms of manually placed reference boundaries.<<ETX>>
Biennial on DSP for in-Vehicle and Mobile Systems | 2007
Haitian Xu; Zheng-Hua Tan; Paul Dalsgaard; Ralf Mattethat; Børge Lindberg
The growth in wireless communication and mobile devices has supported the development of distributed speech recognition (DSR) technology. During the last decade this has led to the establishment of ETSI-DSR standards and an increased interest in research aimed at systems exploiting DSR. So far, however, DSR-based systems executing on mobile devices are only in their infancy. One of the reasons is the lack of easy-to-use software development packages. This chapter presents a prototype version of a configurable DSR system for the development of speech enabled applications on mobile devices.
IEEE Signal Processing Letters | 2008
Haitian Xu; Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg
The nonlocal means (NL-means) algorithm recently proposed for image denoising has proved highly effective for removing additive noise while to a large extent maintaining image details. The algorithm performs denoising by averaging each pixel with other pixels that have similar characteristics in the image. This letter considers the real and imaginary parts of complex speech spectrogram each as a separate image and presents a modified NL-means algorithm to them for denoising to improve the noise robustness of speech recognition. Recognition results on a noisy speech database show that the proposed method is superior to classical methods such as spectral subtraction.
international conference on acoustics, speech, and signal processing | 2003
Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg
This paper presents research on two aspects of distributed speech recognition (DSR) in the presence of channel transmission errors in wireless network environments. The first is on experiments with a frame-based channel error protection scheme, where in previous research we reported results from experiments using randomly distributed bit-errors. This paper presents results from experiments using three additional, more realistic error distributions: burst-like packet loss, GSM error patterns and UMTS statistics. The second is on exploiting the knowledge about channel transmission errors for the purpose of optimising the Out-of-Vocabulary (OOV) detection. Transmission errors influence the acoustic likelihood, and therefore affect the optimal threshold setting for discrimination between In-Vocabulary (IV) words and OOV words. An OOV-detection method is proposed in which the estimated Frame-Error-Rate (FER) is used to adjust the discrimination threshold. Results from experiments are reported over a range of transmission errors.