Benoit Favre | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benoit Favre is active.

Explore More

Publication

Featured researches published by Benoit Favre.

international conference on acoustics, speech, and signal processing | 2013

“Can you give me another word for hyperbaric?”: Improving speech translation using targeted clarification questions

Necip Fazil Ayan; Arindam Mandal; Michael W. Frandsen; Jing Zheng; Peter Blasco; Andreas Kathol; Frédéric Béchet; Benoit Favre; Alex Marin; Tom Kwiatkowski; Mari Ostendorf; Luke Zettlemoyer; Philipp Salletmayr; Julia Hirschberg; Svetlana Stoyanchev

We present a novel approach for improving communication success between users of speech-to-speech translation systems by automatically detecting errors in the output of automatic speech recognition (ASR) and statistical machine translation (SMT) systems. Our approach initiates system-driven targeted clarification about errorful regions in user input and repairs them given user responses. Our system has been evaluated by unbiased subjects in live mode, and results show improved success of communication between users of the system.

content based multimedia indexing | 2013

Unsupervised face identification in TV content using audio-visual sources

Meriem Bendris; Benoit Favre; Delphine Charlet; Géraldine Damnati; Grégory Senay; Rémi Auguste; Jean Martinet

Our goal is to automatically identify faces in TV content without pre-defined dictionary of identities. Most of methods are based on identity detection (from OCR and ASR) and require a propagation strategy based on visual clusterings. In TV content, people appear with many variation making the clustering very difficult. In this case, identifying speakers can be a reliable link to identify faces. In this work, we propose to combine reliable unsupervised face and speaker identification systems through talking-faces detection in order to improve face identification results. First, OCR and ASR results are combined to extract locally the identities. Then, the reliable visual associations are used to propagate those identities locally. The reliable identified faces are used as unsupervised models to identify similar faces. Finally speaker identities are propagated to the faces in case of lip activity detection. Experiments performed on the REPERE database show an improvement of the recall of +5% compared to the baseline, without degrading the precision.

Multi-source, Multilingual Information Extraction and Summarization | 2013

Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects

Heng Ji; Benoit Favre; Wen-Pin Lin; Dan Gillick; Dilek Hakkani-Tür; Ralph Grishman

Information Extraction (IE) and Summarization share the same goal of extracting and presenting the relevant information of a document. While IE was a primary element of early abstractive summarization systems, it’s been left out in more recent extractive systems. However, extracting facts, recognizing entities and events should provide useful information to those systems and help resolve semantic ambiguities that they cannot tackle. This paper explores novel approaches to taking advantage of cross-document IE for multi-document summarization. We propose multiple approaches to IE-based summarization and analyze their strengths and weaknesses. One of them, re-ranking the output of a high performing summarization system with IE-informed metrics, leads to improvements in both manually-evaluated content quality and readability.

international conference on acoustics, speech, and signal processing | 2013

ASR error segment localization for spoken recovery strategy

Frédéric Béchet; Benoit Favre

Even though small ASR errors might not impact downstream processes that make use of the transcript, larger error segments like those generated by OOVs can have a considerable impact on applications such as speech-to-speech translation and can eventually lead to communication failure between users of the system. This work focuses on error detection in ASR output targeted towards significant error segments that can be recovered using a dialog system. We propose a CRF system trained to recognize error segments with ASR confidence-based, lexical and syntactic features. The most significant error segment is passed to a dialog system for interactive recovery in which rephrased words are reinserted in the original. 22% of utterances can be fully recovered and an interesting by-product is that rewriting error segments as a single token reduces WER by 17% on an adverse corpus.

european signal processing conference | 2015

Speaker diarization through speaker embeddings

Mickael Rouvier; Pierre-Michel Bousquet; Benoit Favre

This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Speaker Embeddings, for speaker diarization. Speaker Embedding features are taken from the hidden layer neuron activations of Deep Neural Networks (DNN), when learned as classifiers to recognize a thousand speaker identities in a training set. Although learned through identification, speaker embeddings are shown to be effective for speaker verification in particular to recognize speakers unseen in the training set. In particular, this approach is applied to speaker diarization. Experiments, conducted on the corpus of French broadcast news ETAPE, show that this new speaker modeling technique decreases DER by 1.67 points (a relative improvement of about 8% DER).

content-based multimedia indexing | 2014

Scene understanding for identifying persons in TV shows: Beyond face authentication

Mickael Rouvier; Benoit Favre; Meriem Bendris; Delphine Charlet; Géraldine Damnati

Our goal is to automatically identify people in TV news and debates without any predefined dictionary of people. In this paper, we focus on the problem of person identification beyond face authentication in order to improve the identification results and not only where the face is detectable. We propose to use automatic scene analysis as features for people identification. We exploit two features: scene classification (studio and report) and camera identification. Then, people are identified by propagation strategies of overlaid names (OCR results) and speakers to scene classes and specific camera shots. Experiments performed on the REPERE corpus show improvement of face identification using scene understanding features (+13.9% of F-measure compared to the baseline).

international conference on acoustics, speech, and signal processing | 2014

Multiple-view constrained clustering for unsupervised face identification in TV-broadcast

Meriem Bendris; Benoit Favre; Delphine Charlet; Géraldine Damnati; Rémi Auguste

Our goal is to automatically identify faces in TV broadcast without a pre-defined dictionary of identities. Most methods are based on identity detection (from OCR and ASR) and require a propagation strategy based on visual clustering. In TV content, people appear with many variations making the clustering difficult. In this case, speaker clustering can be a reliable link for face clustering. We propose in this paper to build automatically an incomplete speaker-face mapping based on local evidence of OCR and Lip activity links. Then, we propose schemes of speaker constraints propagation to the face constrained-clustering problem. Experiments performed on the REPERE corpus show an improvement of face identification by propagating names to face clusters (+3.7% F-measure compared to the baseline).

international conference on acoustics, speech, and signal processing | 2014

Retrieving the syntactic structure of erroneous ASR transcriptions for open-domain Spoken Language Understanding

Frédéric Béchet; Benoit Favre; Alexis Nasr; Mathieu Morey

Retrieving the syntactic structure of erroneous ASR transcriptions can be of great interest for open-domain Spoken Language Understanding tasks in order to correct or at least reduce the impact of ASR errors on final applications. Most of the previous works on ASR and syntactic parsing have addressed this problem by using syntactic features during ASR to help reducing Word Error Rate (WER). The improvement obtained is often rather small, however the structure and the relations between words obtained through parsing can be of great interest for the SLU processes, even without a significant decrease of WER. That is why we adopt another point of view in this paper: considering that ASR transcriptions contain inevitably some errors, we show in this study that it is possible to improve the syntactic analysis of these erroneous transcriptions by performing a joint error detection / syntactic parsing process. The applicative framework used in this study is a speech-to-speech system developed through the DARPA BOLT project.

ieee automatic speech recognition and understanding workshop | 2015

Multimodal embedding fusion for robust speaker role recognition in video broadcast

Michael Rouvier; Sebastien Delecraz; Benoit Favre; Meriem Bendris; Frédéric Béchet

Person role recognition in video broadcasts consists in classifying people into roles such as anchor, journalist, guest, etc. Existing approaches mostly consider one modality, either audio (speaker role recognition) or image (shot role recognition), firstly because of the non-synchrony between both modalities, and secondly because of the lack of a video corpus annotated in both modalities. Deep Neural Networks (DNN) approaches offer the ability to learn simultaneously feature representations (embeddings) and classification functions. This paper presents a multimodal fusion of audio, text and image embeddings spaces for speaker role recognition in asynchronous data. Monomodal embeddings are trained on exogenous data and fine-tuned using a DNN on 70 hours of French Broadcasts corpus for the target task. Experiments on the REPERE corpus show the benefit of the embeddings level fusion compared to the monomodal embeddings systems and to the standard late fusion method.

international conference on acoustics, speech, and signal processing | 2014

Reranked aligners for interactive transcript correction

Benoit Favre; Mickael Rouvier; Frédéric Béchet

Clarification dialogs can help address ASR errors in speech-to-speech translation systems and other interactive applications. We propose to use variants of Levenshtein alignment for merging an er-rorful utterance with a targeted rephrase of an error segment. ASR errors that might harm the alignment are addressed through phonetic matching, and a word embedding distance is used to account for the use of synonyms outside targeted segments. These features lead to a relative improvement of 30% of word error rate on sentences with ASR errors compared to not performing the clarification. Twice as many utterances are completely corrected compared to using basic word alignment. Furthermore, we generate a set of potential merges and train a neural network on crowd-sourced rephrases in order to select the best merger, leading to 24% more instances completely corrected. The system is deployed in the framework of the BOLT project.

Explore More