Sabine Deligne
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sabine Deligne.
sensor array and multichannel signal processing workshop | 2002
Sabine Deligne; Gerasimos Potamianos; Chalapathy Neti
We introduce a non-linear enhancement technique called audio-visual codebook dependent cepstral normalization (AVCDCN) and we consider its use with both audio-only and audio-visual speech recognition. AVCDCN is inspired from CDCN, an audio-only enhancement technique that approximates the nonlinear effect of noise on speech with a piecewise constant function. Our experiments show that the use of visual information in AVCDCN allows significant performance gains over CDCN.
NATO advanced study institute on computational models of speech pattern processing | 1999
Gérard Chollet; Jan Cernocký; Andrei Constantinescu; Sabine Deligne; Frédéric Bimbot
The models used in current automatic speech recognition (or synthesis) systems are generally relying on a representation based on phonetic symbols. The phonetic transcription of a word can be seen as an intermediate representation between the acoustic and the linguistic levels, but the a priori choice of phonemes (or phone-like units) can be questioned, as probably non-optimal. Moreover, the phonetic representation has the drawback of being strongly language-dependent, which partly prevents reusability of acoustic resources across languages. In this article, we expose and develop the concept of ALISP (Automatic Language Independent Speech Processing), namely a general methodology which consists in inferring the intermediate representation between the acoustic and the linguistic levels, from speech and linguistic data rather than from a priori knowledge, with as little supervision as possible. We expose the benefits that can be expected from developing the ALISP approach, together with the key issues to be solved. We also present preliminary experiments that can be viewed as first steps towards the ALISP goal.
international conference on acoustics, speech, and signal processing | 2001
Sabine Deligne; Benoît Maison; Ramesh A. Gopinath
We present a scheme for the acoustic modeling of speech recognition applications requiring dynamic vocabularies. It applies especially to the acoustic modeling of out-of-vocabulary words which need to be added to a recognition lexicon based on the observation of a few (say one or two) speech utterances of these words. Standard approaches to this problem derive a single pronunciation from each speech utterance by combining acoustic and phone transition scores. In our scheme, multiple pronunciations are generated from each speech utterance of a word to enroll by varying the relative weights assigned to the acoustic and phone transition models. In our experiments, the use of these multiple baseforms dramatically outperforms the standard approach with a relative decrease of the word error rate ranging from 20% to 40% on all our test sets.
international conference on acoustics, speech, and signal processing | 2005
Vaibhava Goel; Hong-Kwang Jeff Kuo; Sabine Deligne; Cheng Wu
Conventional methods for training statistical models for automatic speech recognition, such as acoustic and language models, have focused on criteria such as maximum likelihood and sentence or word error rate (WER). However, unlike dictation systems, the goal for spoken dialogue systems is to understand the meaning of what a person says, not to get every word correctly transcribed. For such systems, we propose to optimize the statistical models under end-to-end system performance criteria. We illustrate this principle by focusing on the estimation of the language model (LM) component of a natural language call routing system. This estimation, carried out under a conditional maximum likelihood objective, aims at optimizing the call routing (classification) accuracy, which is often the criterion of interest in these systems. LM updates are derived using the extended Baum-Welch procedure (Gopalakrishnan et al. (1991)). In our experiments, we find that our estimation procedure leads to a small but promising gain in classification accuracy. Interestingly, the estimated language models also lead to an increase in the word error rate while improving the classification accuracy, showing that the system with the best classification accuracy is not necessarily the one with the lowest WER. Significantly, our LM estimation procedure does not require the correct transcription of the training data, and can therefore be applied to unsupervised learning from untranscribed speech data.
international conference on acoustics, speech, and signal processing | 2003
Sabine Deligne; Lidia Mangu
In this paper, we explore the use of lattices to generate pronunciations for speech recognition based on the observation of a few (say one or two) speech utterances of a word. Various search strategies are investigated in combination with schemes where single or multiple pronunciations are generated for each speech utterance. In our experiments, a strategy that combines merging time-overlapping links in a context-dependent subphone lattice and generating multiple pronunciations provides the best recognition accuracy. This results in average relative gains of 30% over the generation of single pronunciations using a Viterbi search.
sensor array and multichannel signal processing workshop | 2002
Sabine Deligne; Ramesh A. Gopinath
We address the problem of blind separation of convolutive mixtures of spatially and temporally independent sources modeled with mixtures of Gaussians. We present an EM algorithm to compute maximum likelihood estimates of both the separating filters and the source density parameters, whereas, in the state-of-the-art, separating filters are usually estimated with gradient descent techniques. The use of the EM algorithm, as opposed to the usual gradient descent techniques, does not require the empirical tuning of a learning rate, and thus can be expected to provide a more stable convergence. Besides, we show how multichannel autoregressive spectral estimation techniques can be used in order to initialize the EM algorithm properly. We demonstrate the efficiency of our EM algorithm together with the proposed initialization scheme by reporting on simulations with artificial mixtures.
ieee automatic speech recognition and understanding workshop | 2001
Sabine Deligne; Ramesh A. Gopinath
We address the issue of speech recognition in the presence of interfering signals, in cases where the signals corrupting the speech are recorded in separate channels. We propose to combine a trivial form of filtering with MCDCN, a multi-channel version of codebook dependent cepstral normalization, where the cepstra of the noise are estimated from the reference signals. We report on recognition experiments in a car where the speech signal is corrupted by radio talks or CD music played by the car speakers. Our approach allows relative word error rate reductions in the range of 70-90% compared to a no-compensation baseline, at a relatively low computational cost.
international conference on acoustics, speech, and signal processing | 2004
Sabine Deligne; Satya Dharanipragada
We consider the problem of enrollment for low-resource speech recognition systems designed for noisy environments. Noise robustness concerns, memory and computational constraints, along with the use of compact acoustic models for fast Gaussian computation, make adaptation especially challenging. We derive a maximum a posteriori (MAP) algorithm especially designed for the fast off-line adaptation of these compact acoustic models. It requires less computation and memory than standard feature-space maximum likelihood linear regression (FMLLR) which is another technique well suited for compact acoustic models. In our experiments of speaker enrollment for speech recognition in the car, we present a computationally efficient procedure to simulate noisy conditions with the adaptation data. In these experiments, MAP compares favorably with FMLLR in terms of recognition accuracy. Besides, combining FMLLR and MAP significantly outperforms each technique individually, thus providing an efficient alternative for systems with larger resources.
Archive | 2001
Sabine Deligne; François Yvon; Frederic Bimbot
Language can be viewed as the result of a complex encoding process which maps a message into a stream of symbols: phonemes, graphemes, morphemes, words ...depending on the level of representation. At each level of representation, specific constraints like phonotactical, morphological or grammatical constraints apply, greatly reducing the possible combinations of symbols and introducing statistical dependencies between them. Numerous probabilistic models have been developed in the area of speech and language processing to capture these dependencies. In this chapter, we explore the potentiality of the multigram model to learn variable-length dependencies in strings of phonemes and in strings of graphemes. In the multigram approach described here, a string of symbols is viewed as a concatenation of independent variable-length subsequences of symbols. The ability of the multigram model to learn relevant subsequences of phonemes is illustrated by the selection of multiphone units for speech synthesis.
Archive | 2001
Sabine Deligne; Ramesh A. Gopinath