Gilles Adda
Centre national de la recherche scientifique
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gilles Adda.
Computer Speech & Language | 2002
Lori Lamel; Jean-Luc Gauvain; Gilles Adda
The last decade has witnessed substantial progress in speech recognition technology, with today?s state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators and substantial amounts of supervision. This paper describes some recent experiments using lightly supervised and unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated broadcast news data from the Darpa TDT-2 corpus. The hypothesized transcription is optionally aligned with closed-captions or transcripts to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data. These experiments demonstrate that light or no supervision can dramatically reduce the cost of building acoustic models.
Computational Linguistics | 2011
Karën Fort; Gilles Adda; K. Bretonnel Cohen
Recently heard at a tutorial in our field: “It cost me less than one hundred bucks to annotate this using Amazon Mechanical Turk!” Assertions like this are increasingly common, but we believe they should not be stated so proudly; they ignore the ethical consequences of using MTurk (Amazon Mechanical Turk) as a source of labor. Manually annotating corpora or manually developing any other linguistic resource, such as a set of judgments about system outputs, represents such a high cost that many researchers are looking for alternative solutions to the standard approach. MTurk is becoming a popular one. However, as in any scientific endeavor involving humans, there is an unspoken ethical dimension involved in resource construction and system evaluation, and this is especially true of MTurk. We would like here to raise some questions about the use of MTurk. To do so, we will define precisely what MTurk is and what it is not, highlighting the issues raised by the system. We hope that this will point out opportunities for our community to deliberately value ethics above cost savings.
Communications of The ACM | 2000
Jean-Luc Gauvain; Lori Lamel; Gilles Adda
Some existing applications that could greatly benefit from new technology are the creation and access to digital multimedia libraries (disclosure of the information content and content-based indexation, such as are under exploration in the OLIVE project), media monitoring services (selective dissemination of information based on automatic detection of topics of interest) as well as new emerging applications such as News on Demand and Internet watch services. Such applications are feasible due to the large technological progress made over the last decade, benefiting from advances in microelectronics that have facilitated the implementation of more complex models and algorithms. Automatic speech recognition is a key technology for audio and video indexing. Most of the linguistic information is encoded in the audio channel of video data, which once tranTranscribing Broadcast News
international conference on spoken language processing | 1996
Lori Lamel; Gilles Adda
Creation of pronunciation lexicons for speech recognition is widely acknowledged to be an important but labor-intensive, aspect of system development. Lexicons are often manually created and make use of knowledge and expertise that is difficult to codify. We describe our American English lexicon developed primarily for the ARPA WSJ/NAB tasks. The lexicon is phonemically represented, and contains alternate pronunciations for about 10% of the words. Tools have been developed to add new lexical items, as well as to help ensure consistency of the pronunciations. Our experience in large vocabulary, continuous speech recognition is that systematic lexical design can improve system performance. Some comparative results with commonly available lexicons are given.
Speech Communication | 1994
Jean-Luc Gauvain; Lori Lamel; Gilles Adda; Martine Adda-Decker
Abstract In this paper we report on progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper-based speech corpora in English and French. The recognizer makes use of continuous density HMMs with Gaussian mixtures for acoustic modeling and n -gram statistics estimated on newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. For English the ARPA Wall Street Journal -based CSR corpus is used and for French the BREF corpus containing recordings of texts from the French newspaper Le Monde is used. Experiments were carried out with both these corpora at the phone level and at the word level with vocabularies containing up to 20,000 words. Word recognition experiments are also described for the ARPA RM task which has been widely used to evaluate and compare systems.
international conference on acoustics, speech, and signal processing | 2002
Lori Lamel; Jean-Luc Gauvain; Gilles Adda
This paper describes some recent experiments using unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated raw broadcast news data. The hypothesized transcription is used to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 minutes of manually annotated data. These experiments demonstrate that unsupervised training is a viable training scheme and can dramatically reduce the cost of building acoustic models.
international conference on acoustics, speech, and signal processing | 1994
Jean-Luc Gauvain; Lori Lamel; Gilles Adda; Martine Adda-Decker
We report progress made at LIMSI in speaker-independent large vocabulary speech dictation using the ARPA Wall Street Journal-based CSR corpus. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with vocabularies of up to 20 K words when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. The recognizer has been evaluated in the Nov92 and Nov93 ARPA tests for vocabularies of up to 20,000 words.<<ETX>>
international conference on acoustics, speech, and signal processing | 2001
Lori Lamel; Jean-Luc Gauvain; Gilles Adda
The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe broadcast audio data with a word error of about 20%. However, acoustic model development for the recognizers requires large corpora of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision. We describe some experiments using different levels of supervision for acoustic model training in order to reduce the system development cost. The experiments have been carried out using the DARPA TDT-2 corpus (also used in the SDR99 and SDR00 evaluations). Our experiments demonstrate that light supervision is sufficient for acoustic model development, drastically reducing the development cost.
international conference on acoustics, speech, and signal processing | 2003
Jean-Luc Gauvain; Lori Lamel; Holger Schwenk; Gilles Adda; Langzhou Chen; Fabrice Lefèvre
This paper describes the development of a speech recognition system for the processing of telephone conversations, starting with a state-of-the-art broadcast news transcription system. We identify major changes and improvements in acoustic and language modeling, as well as decoding, which are required to achieve state-of-the-art performance on conversational speech. Some major changes on the acoustic side include the use of speaker normalization (VTLN), the need to cope with channel variability, and the need for efficient speaker adaptation and better pronunciation modeling. On the linguistic side the primary challenge is to cope with the limited amount of language model training data. To address this issue we make use of a data selection technique, and a smoothing technique based on a neural network language model. At the decoding level lattice rescoring and minimum word error decoding are applied. On the development data, the improvements yield an overall word error rate of 24.9% whereas the original BN transcription system had a word error rate of about 50% on the same data.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Spyridon Matsoukas; Jean-Luc Gauvain; Gilles Adda; Thomas Colthurst; Chia-Lin Kao; Owen Kimball; Lori Lamel; Fabrice Lefèvre; Jeff Z. Ma; John Makhoul; Long Nguyen; Rohit Prasad; Richard M. Schwartz; Holger Schwenk; Bing Xiang
This paper describes the progress made in the transcription of broadcast news (BN) and conversational telephone speech (CTS) within the combined BBN/LIMSI system from May 2002 to September 2004. During that period, BBN and LIMSI collaborated in an effort to produce significant reductions in the word error rate (WER), as directed by the aggressive goals of the Effective, Affordable, Reusable, Speech-to-text [Defense Advanced Research Projects Agency (DARPA) EARS] program. The paper focuses on general modeling techniques that led to recognition accuracy improvements, as well as engineering approaches that enabled efficient use of large amounts of training data and fast decoding architectures. Special attention is given on efforts to integrate components of the BBN and LIMSI systems, discussing the tradeoff between speed and accuracy for various system combination strategies. Results on the EARS progress test sets show that the combined BBN/LIMSI system achieved relative reductions of 47% and 51% on the BN and CTS domains, respectively