Yannick Estève
University of Maine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yannick Estève.
international conference on acoustics, speech, and signal processing | 2009
Vincent Jousse; Simon Petitrenaud; Sylvain Meignier; Yannick Estève; Christine Jacquin
In this paper, we consider the extraction of speaker identity from audio records of broadcast news without a priori acoustic information about speakers. Using an automatic speech recognition system and an automatic speaker diarization system, we present improvements for a method which allows to extract speaker identities from automatic transcripts and to assign them to speech segments. Experiments are carried out on French broadcast news records from the ESTER 1 evaluation campaign. Experimental results using outputs of automatic speech recognition and automatic diarization are presented.
IEEE Transactions on Audio, Speech, and Language Processing | 2013
Benjamin Lecouteux; Georges Linarès; Yannick Estève; Guillaume Gravier
Combining automatic speech recognition (ASR) systems generally relies on the posterior merging of the outputs or on acoustic cross-adaptation. In this paper, we propose an integrated approach where outputs of secondary systems are integrated in the search algorithm of a primary one. In this driven decoding algorithm (DDA), the secondary systems are viewed as observation sources that should be evaluated and combined to others by a primary search algorithm. DDA is evaluated on a subset of the ESTER I corpus consisting of 4 hours of French radio broadcast news. Results demonstrate DDA significantly outperforms vote-based approaches: we obtain an improvement of 14.5% relative word error rate over the best single-systems, as opposed to the the 6.7% with a ROVER combination. An in-depth analysis of the DDA shows its ability to improve robustness (gains are greater in adverse conditions) and a relatively low dependency on the search algorithm. The application of DDA to both and beam-search-based decoder yields similar performances.
international conference on acoustics, speech, and signal processing | 2008
Benjamin Lecouteux; Georges Linarès; Yannick Estève; Guillaume Gravier
Driven decoding algorithm (DDA) is initially an integrated approach for the combination of 2 speech recognition (ASR) systems. It consists in guiding the search algorithm of a primary ASR system by the one-best hypothesis of an auxiliary system. In this paper, we generalize DDA to confusion-network driven decoding and we propose new combination schemes for multiple system combination. Since previous experiments involved 2 ASR systems on broadcast news data, the proposed extended DDA is evaluated using 3 ASR systems from different labs. Results show that generalized- DDA outperforms significantly ROVER method: we obtain a 15.7% relative word error rate improvement with respect to the best single system, as opposed to 8.5% with the ROVER combination.
2006 IEEE Odyssey - The Speaker and Language Recognition Workshop | 2006
Julie Mauclair; Sylvain Meignier; Yannick Estève
The automatic speaker diarization consists in splitting the signal into homogeneous segments and clustering them by speakers. However the speaker segments are specified with anonymous labels. This paper suggests a solution to identify those speakers by extracting their full names pronounced in French broadcast news. A semantic classification tree is automatically built on a training corpus and associate the full names detected in the transcription of a segment to this segment or to one of its neighbors. Then, a merging method permits to associate a full name to a speaker cluster instead of an anonymous label provided by the diarization. The experiments are carried out over French broadcast news records from the ESTER 2005 evaluation campaign. About 70% show duration is correctly processed for both development and evaluation corpora. On the evaluation corpus, 18.2% show duration is wrongly named and no decision is taken for 11.9% show duration
international conference on acoustics, speech, and signal processing | 2007
Benjamin Lecouteux; Georges Linarès; Yannick Estève; Julie Mauclair
The combination of automatic speech recognition (ASR) systems generally relies on a posteriori merge of system outputs or on a cross-adaptation. In this paper, we propose an integrated approach where the search of a primary system is driven by the outputs of a secondary one. This method allows to drive the primary system search by using the one-best hypotheses and the word posteriors gathered from the secondary system. Experiments are carried out within the experimental framework of the ESTER evaluation campaign (S. Galliano et al. 2005). Results show that the driven decoding algorithm significantly outperforms the two single ASR systems (-8% of relative WER, -1.7% absolute). Finally, we investigate the interactions between driven decoding and cross-adaptations. The best cross-adaptation strategy in combination with the driven decoding process brings to a final absolute gain of about 1.9% WER.
affective computing and intelligent interaction | 2015
Laurence Devillers; Sophie Rosset; Guillaume Dubuisson Duplessis; Mohamed A. Sehili; Lucile Bechade; Agnes Delaborde; Clément Gossart; Vincent Letard; Fan Yang; Yücel Yemez; Bekir Berker Turker; T. Metin Sezgin; Kevin El Haddad; Stéphane Dupont; Daniel Luzzati; Yannick Estève; Emer Gilmartin; Nick Campbell
Thanks to a remarkably great ability to show amusement and engagement, laughter is one of the most important social markers in human interactions. Laughing together can actually help to set up a positive atmosphere and favors the creation of new relationships. This paper presents a data collection of social interaction dialogs involving humor between a human participant and a robot. In this work, interaction scenarios have been designed in order to study social markers such as laughter. They have been implemented within two automatic systems developed in the Joker project: a social dialog system using paralinguistic cues and a task-based dialog system using linguistic content. One of the major contributions of this work is to provide a context to study human laughter produced during a human-robot interaction. The collected data will be used to build a generic intelligent user interface which provides a multimodal dialog system with social communication skills including humor and other informal socially oriented behaviors. This system will emphasize the fusion of verbal and non-verbal channels for emotional and social behavior perception, interaction and generation capabilities.
ieee automatic speech recognition and understanding workshop | 2003
Christian Raymond; Yannick Estève; Frédéric Béchet; R. De Mori; Géraldine Damnati
The approach proposed is an alternative to the traditional architecture of spoken dialogue systems where the system belief is either not taken into account during the automatic speech recognition process or included in the decoding process but never challenged. By representing all the conceptual structures handled by the dialogue manager by finite state machines and by building a conceptual model that contains all the possible interpretations of a given word-graph, we propose a decoding architecture that searches first for the best conceptual interpretation before looking for the best string of words. Once both N-best sets (at the concept level and at the word level) are generated, a verification process is performed on each N-best set using acoustic and linguistic confidence measures. A first selection strategy that does not include for the moment the dialogue context is proposed and significant error reduction on the understanding measures are obtained.
ieee automatic speech recognition and understanding workshop | 2009
Richard Dufour; Yannick Estève; Paul Deléglise; Frédéric Béchet
Processing spontaneous speech is one of the many challenges that automatic speech recognition (ASR) systems have to deal with. The main evidences characterizing spontaneous speech are disfluencies (filled pause, repetition, repair and false start) and many studies have focused on the detection and the correction of these disfluencies. In this study we define spontaneous speech as unprepared speech, in opposition to prepared speech where utterances contain well-formed sentences close to those that can be found in written documents. Disfluencies are of course very good indicators of unprepared speech, however they are not the only ones: ungrammaticality and language register are also important as well as prosodic patterns. This paper proposes a set of acoustic and linguistic features that can be used for characterizing and detecting spontaneous speech segments from large audio databases. More, we introduce a strategy that takes advantage of a global classification procfalseess using a probabilistic model which significantly improves the spontaneous speech detection.
international conference on acoustics, speech, and signal processing | 2004
Christian Raymond; Frédéric Béchet; R. De Mori; Géraldine Damnati; Yannick Estève
The paper proposes a new application of automatically trained decision trees to derive the interpretation of a spoken sentence. A new strategy for building structured cohorts of candidates is also described. By evaluating predicates related to the acoustic confidence of the words expressing a concept, the linguistic and semantic consistency of candidates in the cohort and the rank of a candidate within a cohort, the decision tree automatically learns a decision strategy for rescoring or rejecting an n-best list of candidates representing a users utterance. A relative reduction of 18.6% in the understanding error rate is obtained by our rescoring strategy with no utterance rejection and a relative reduction of 43.1% of the same error rate is achieve with a rejection rate of only 8% of the utterances.
IEEE Transactions on Speech and Audio Processing | 2003
Yannick Estève; Christian Raymond; R. De Mori; David Janiszek
This paper introduces new recognition strategies based on reasoning about results obtained with different Language Models (LMs). Strategies are built following the conjecture that the consensus among the results obtained with different models gives rise to different situations in which hypothesized sentences have different word error rates (WER) and may be further processed with other LMs. New LMs are built by data augmentation using ideas from latent semantic analysis and trigram analogy. Situations are defined by expressing the consensus among the recognition results produced with different LMs and by the amount of unobserved trigrams in the hypothesized sentence. The diagnostic power of the use of observed trigrams or their corresponding class trigrams is compared with that of situations based on values of sentence posterior probabilities. In order to avoid or correct errors due to syntactic inconsistence of the recognized sentence, automata, obtained by explanation-based learning, are introduced and used in certain conditions. Semantic Classification Trees are introduced to provide sentence patterns expressing constraints of long distance syntactic coherence. Results on a dialogue corpus provided by France Telecom R&D have shown that starting with a WER of 21.87% on a test set of 1422 sentences, it is possible to subdivide the sentences into three sets characterized by automatically recognized situations. The first one has a coverage of 68% with a WER of 7.44%. The second one has various types of sentences with a WER around 20%. The third one contains 13% of the sentences that should be rejected with a WER around 49%. The second set characterizes sentences that should be processed with particular care by the dialogue interpreter with the possibility of asking a confirmation from the user.