Benoit Favre
Aix-Marseille University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Benoit Favre.
international conference on acoustics, speech, and signal processing | 2013
Necip Fazil Ayan; Arindam Mandal; Michael W. Frandsen; Jing Zheng; Peter Blasco; Andreas Kathol; Frédéric Béchet; Benoit Favre; Alex Marin; Tom Kwiatkowski; Mari Ostendorf; Luke Zettlemoyer; Philipp Salletmayr; Julia Hirschberg; Svetlana Stoyanchev
We present a novel approach for improving communication success between users of speech-to-speech translation systems by automatically detecting errors in the output of automatic speech recognition (ASR) and statistical machine translation (SMT) systems. Our approach initiates system-driven targeted clarification about errorful regions in user input and repairs them given user responses. Our system has been evaluated by unbiased subjects in live mode, and results show improved success of communication between users of the system.
content based multimedia indexing | 2013
Meriem Bendris; Benoit Favre; Delphine Charlet; Géraldine Damnati; Grégory Senay; Rémi Auguste; Jean Martinet
Our goal is to automatically identify faces in TV content without pre-defined dictionary of identities. Most of methods are based on identity detection (from OCR and ASR) and require a propagation strategy based on visual clusterings. In TV content, people appear with many variation making the clustering very difficult. In this case, identifying speakers can be a reliable link to identify faces. In this work, we propose to combine reliable unsupervised face and speaker identification systems through talking-faces detection in order to improve face identification results. First, OCR and ASR results are combined to extract locally the identities. Then, the reliable visual associations are used to propagate those identities locally. The reliable identified faces are used as unsupervised models to identify similar faces. Finally speaker identities are propagated to the faces in case of lip activity detection. Experiments performed on the REPERE database show an improvement of the recall of +5% compared to the baseline, without degrading the precision.
Multi-source, Multilingual Information Extraction and Summarization | 2013
Heng Ji; Benoit Favre; Wen-Pin Lin; Dan Gillick; Dilek Hakkani-Tür; Ralph Grishman
Information Extraction (IE) and Summarization share the same goal of extracting and presenting the relevant information of a document. While IE was a primary element of early abstractive summarization systems, it’s been left out in more recent extractive systems. However, extracting facts, recognizing entities and events should provide useful information to those systems and help resolve semantic ambiguities that they cannot tackle. This paper explores novel approaches to taking advantage of cross-document IE for multi-document summarization. We propose multiple approaches to IE-based summarization and analyze their strengths and weaknesses. One of them, re-ranking the output of a high performing summarization system with IE-informed metrics, leads to improvements in both manually-evaluated content quality and readability.
international conference on acoustics, speech, and signal processing | 2013
Frédéric Béchet; Benoit Favre
Even though small ASR errors might not impact downstream processes that make use of the transcript, larger error segments like those generated by OOVs can have a considerable impact on applications such as speech-to-speech translation and can eventually lead to communication failure between users of the system. This work focuses on error detection in ASR output targeted towards significant error segments that can be recovered using a dialog system. We propose a CRF system trained to recognize error segments with ASR confidence-based, lexical and syntactic features. The most significant error segment is passed to a dialog system for interactive recovery in which rephrased words are reinserted in the original. 22% of utterances can be fully recovered and an interesting by-product is that rewriting error segments as a single token reduces WER by 17% on an adverse corpus.
european signal processing conference | 2015
Mickael Rouvier; Pierre-Michel Bousquet; Benoit Favre
This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Speaker Embeddings, for speaker diarization. Speaker Embedding features are taken from the hidden layer neuron activations of Deep Neural Networks (DNN), when learned as classifiers to recognize a thousand speaker identities in a training set. Although learned through identification, speaker embeddings are shown to be effective for speaker verification in particular to recognize speakers unseen in the training set. In particular, this approach is applied to speaker diarization. Experiments, conducted on the corpus of French broadcast news ETAPE, show that this new speaker modeling technique decreases DER by 1.67 points (a relative improvement of about 8% DER).
content-based multimedia indexing | 2014
Mickael Rouvier; Benoit Favre; Meriem Bendris; Delphine Charlet; Géraldine Damnati
Our goal is to automatically identify people in TV news and debates without any predefined dictionary of people. In this paper, we focus on the problem of person identification beyond face authentication in order to improve the identification results and not only where the face is detectable. We propose to use automatic scene analysis as features for people identification. We exploit two features: scene classification (studio and report) and camera identification. Then, people are identified by propagation strategies of overlaid names (OCR results) and speakers to scene classes and specific camera shots. Experiments performed on the REPERE corpus show improvement of face identification using scene understanding features (+13.9% of F-measure compared to the baseline).
international conference on acoustics, speech, and signal processing | 2014
Meriem Bendris; Benoit Favre; Delphine Charlet; Géraldine Damnati; Rémi Auguste
Our goal is to automatically identify faces in TV broadcast without a pre-defined dictionary of identities. Most methods are based on identity detection (from OCR and ASR) and require a propagation strategy based on visual clustering. In TV content, people appear with many variations making the clustering difficult. In this case, speaker clustering can be a reliable link for face clustering. We propose in this paper to build automatically an incomplete speaker-face mapping based on local evidence of OCR and Lip activity links. Then, we propose schemes of speaker constraints propagation to the face constrained-clustering problem. Experiments performed on the REPERE corpus show an improvement of face identification by propagating names to face clusters (+3.7% F-measure compared to the baseline).
international conference on acoustics, speech, and signal processing | 2014
Frédéric Béchet; Benoit Favre; Alexis Nasr; Mathieu Morey
Retrieving the syntactic structure of erroneous ASR transcriptions can be of great interest for open-domain Spoken Language Understanding tasks in order to correct or at least reduce the impact of ASR errors on final applications. Most of the previous works on ASR and syntactic parsing have addressed this problem by using syntactic features during ASR to help reducing Word Error Rate (WER). The improvement obtained is often rather small, however the structure and the relations between words obtained through parsing can be of great interest for the SLU processes, even without a significant decrease of WER. That is why we adopt another point of view in this paper: considering that ASR transcriptions contain inevitably some errors, we show in this study that it is possible to improve the syntactic analysis of these erroneous transcriptions by performing a joint error detection / syntactic parsing process. The applicative framework used in this study is a speech-to-speech system developed through the DARPA BOLT project.
ieee automatic speech recognition and understanding workshop | 2015
Michael Rouvier; Sebastien Delecraz; Benoit Favre; Meriem Bendris; Frédéric Béchet
Person role recognition in video broadcasts consists in classifying people into roles such as anchor, journalist, guest, etc. Existing approaches mostly consider one modality, either audio (speaker role recognition) or image (shot role recognition), firstly because of the non-synchrony between both modalities, and secondly because of the lack of a video corpus annotated in both modalities. Deep Neural Networks (DNN) approaches offer the ability to learn simultaneously feature representations (embeddings) and classification functions. This paper presents a multimodal fusion of audio, text and image embeddings spaces for speaker role recognition in asynchronous data. Monomodal embeddings are trained on exogenous data and fine-tuned using a DNN on 70 hours of French Broadcasts corpus for the target task. Experiments on the REPERE corpus show the benefit of the embeddings level fusion compared to the monomodal embeddings systems and to the standard late fusion method.
international conference on acoustics, speech, and signal processing | 2014
Benoit Favre; Mickael Rouvier; Frédéric Béchet
Clarification dialogs can help address ASR errors in speech-to-speech translation systems and other interactive applications. We propose to use variants of Levenshtein alignment for merging an er-rorful utterance with a targeted rephrase of an error segment. ASR errors that might harm the alignment are addressed through phonetic matching, and a word embedding distance is used to account for the use of synonyms outside targeted segments. These features lead to a relative improvement of 30% of word error rate on sentences with ASR errors compared to not performing the clarification. Twice as many utterances are completely corrected compared to using basic word alignment. Furthermore, we generate a set of potential merges and train a neural network on crowd-sourced rephrases in order to select the best merger, leading to 24% more instances completely corrected. The system is deployed in the framework of the BOLT project.