Etienne Marcheret | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Etienne Marcheret is active.

Explore More

Publication

Featured researches published by Etienne Marcheret.

computer vision and pattern recognition | 2017

Self-Critical Sequence Training for Image Captioning

Steven J. Rennie; Etienne Marcheret; Youssef Mroueh; Jarret Ross; Vaibhava Goel

Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a baseline to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7.

international conference on acoustics, speech, and signal processing | 2015

Deep multimodal learning for Audio-Visual Speech Recognition

Youssef Mroueh; Etienne Marcheret; Vaibhava Goel

In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of 41% under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of 35.83% demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of 34.03%.

international conference on multimedia and expo | 2003

A real-time prototype for small-vocabulary audio-visual ASR

Jonathan H. Connell; Norman Haas; Etienne Marcheret; Chalapathy Neti; Gerasimos Potamianos; Senem Velipasalar

We present a prototype for the automatic recognition of audio-visual speech, developed to augment the IBM ViaVoice/spl trade/ speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentium/spl trade/ 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is there- fore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.

Multimodal Technologies for Perception of Humans | 2008

The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings

Jing Huang; Etienne Marcheret; Karthik Visweswariah; Gerasimos Potamianos

We present the IBM systems for the Rich Transcription 2007 (RT07) speaker diarization evaluation task on lecture meeting data. We first overview our baseline system that was developed last year, as part of our speech-to-text system for the RT06s evaluation. We then present a number of simple schemes considered this year in our effort to improve speaker diarization performance, namely: (i) A better speech activity detection (SAD) system, a necessary pre-processing step to speaker diarization; (ii) Use of word information from a speaker-independent speech recognizer; (iii) Modifications to speaker cluster merging criteria and the underlying segment model; and (iv) Use of speaker models based on Gaussian mixture models, and their iterative refinement by frame-level re-labeling and smoothing of decision likelihoods. We report development experiments on the RT06s evaluation test set that demonstrate that these methods are effective, resulting in dramatic performance improvements over our baseline diarization system. For example, changes in the cluster segment models and cluster merging methodology result in a 24.2% relative reduction in speaker error rate, whereas use of the iterative model refinement process and word-level alignment produce a 36.0% and 9.2% speaker error relative reduction, respectively. The importance of the SAD subsystem is also shown, with SAD error reduction from 12.3% to 4.3% translating to a 20.3% relative reduction in speaker error rate. Unfortunately however, the developed diarization system heavily depends on appropriately tuning thresholds in the speaker cluster merging process. Possibly as a result of over-tuning such thresholds, performance on the RT07 evaluation test set degrades significantly compared to the one observed on development data. Nevertheless, our experiments show that the introduced techniques of cluster merging, speaker model refinement and alignment remain valuable in the RT07 evaluation.

international conference on acoustics, speech, and signal processing | 2007

Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition

Etienne Marcheret; Vit Libal; Gerasimos Potamianos

To generate optimal multi-stream audio-visual speech recognition performance, appropriate dynamic weighting of each modality is desired. In this paper, we propose to estimate such weights based on a combination of acoustic signal space observations and single-modality audio and visual speech model likelihoods. Two modeling approaches are investigated for such weight estimation: one based on a sigmoid fitting function, the other employing Gaussian mixture models. Reported experiments demonstrate that the later approach outperforms sigmoid based modeling, and is dramatically superior to the static weighting scheme.

international conference on multimedia and expo | 2005

Rapid Feature Space Speaker Adaptation for Multi-Stream HMM-Based Audio-Visual Speech Recognition

Jing Huang; Etienne Marcheret; Karthik Visweswariah

Multi-stream hidden Markov models (HMMs) have recently been very successful in audio-visual speech recognition, where the audio and visual streams are fused at the final decision level. In this paper we investigate fast feature space speaker adaptation using multi-stream HMMs for audio-visual speech recognition. In particular, we focus on studying the performance of feature-space maximum likelihood linear regression (fMLLR), a fast and effective method for estimating feature space transforms. Unlike the common speaker adaptation techniques of MAP or MLLR, fMLLR does not change the audio or visual HMM parameters, but simply applies a single transform to the testing features. We also address the problem of fast and robust on-line fMLLR adaptation using feature space maximum a posterior linear regression (fMAPLR). Adaptation experiments are reported on the IBM infrared headset audio-visual database. On average for a 20-speaker 1 hour independent test set, the multi-stream fMLLR achieves 31% relative gain on the clean audio condition, and 59% relative gain on the noisy audio condition (approximately 7 dB) as compared to the baseline multi-stream system

computer vision and pattern recognition | 2009

Audio-visual speech synchronization detection using a bimodal linear prediction model

Kshitiz Kumar; Jiri Navratil; Etienne Marcheret; Vit Libal; Ganesh N. Ramaswamy; Gerasimos Potamianos

In this work, we study the problem of detecting audio-visual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two types of visual features extracted from the speakers facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.

Multimodal Technologies for Perception of Humans | 2008

The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings

Jing Huang; Etienne Marcheret; Karthik Visweswariah; Vit Libal; Gerasimos Potamianos

The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speech-to-text (STT) and speaker-attributed speech-to-text (SASTT) tasks on the lecture meeting domain. Three testing conditions are considered, namely the multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM) ones --- the latter for the STT task only. The IBM system building process is similar to that employed last year for the STT Rich Transcription Spring 2006 evaluation (RT06s). However, a few technical advances have been introduced for RT07: (a) better speaker segmentation; (b) system combination via the ROVER approach applied over an ensemble of systems, some of which are built by randomized decision tree state-tying; and (c) development of a very large language model consisting of 152M n-grams, incorporating, among other sources, 525M words of web data, and used in conjunction with a dynamic decoder. These advances reduce STT word error rate (WER) in the MDM condition by 16% relative (8% absolute) over the IBM RT06s system, as measured on 17 lecture meeting segments of the RT06s evaluation test set, selected in this work as development data. In the RT07 evaluation campaign, both MDM and SDM systems perform competitively for the STT and SASTT tasks. For example, at the MDM condition, a 44.3% STT WER is achieved on the RT07 evaluation test set, excluding scoring of overlapped speech. When the STT transcripts are combined with speaker labels from speaker diarization, SASTT WER becomes 52.0%. For the STT IHM condition, the newly developed large language model is employed, but in conjunction with the RT06s IHM acoustic models. The latter are reused, due to lack of time to train new models to utilize additional close-talking microphone data available in RT07. Therefore, the resulting system achieves modest WERs of 31.7% and 33.4%, when using manual or automatic segmentation, respectively.

Computer Speech & Language | 2013

The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks

Bowen Zhou; Xiaodong Cui; Songfang Huang; Martin Cmejrek; Wei Zhang; Jian Xue; Jia Cui; Bing Xiang; Gregg Daggett; Upendra V. Chaudhari; Sameer Maskey; Etienne Marcheret

This paper describes our recent improvements to IBM TRANSTAC speech-to-speech translation systems that address various issues arising from dealing with resource-constrained tasks, which include both limited amounts of linguistic resources and training data, as well as limited computational power on mobile platforms such as smartphones. We show how the proposed algorithms and methodologies can improve the performance of automatic speech recognition, statistical machine translation, and text-to-speech synthesis, while achieving low-latency two-way speech-to-speech translation on mobiles.

multimedia signal processing | 2007

An Embedded System for In-Vehicle Visual Speech Activity Detection

Vit Libal; Jonathan H. Connell; Gerasimos Potamianos; Etienne Marcheret

We present a system for automatically detecting drivers speech in the automobile domain using visual-only information extracted from the drivers mouth region. The work is motivated by the desire to eliminate manual push-to-talk activation of the speech recognition engine in newly designed voice interfaces in the typically noisy car environment, aiming at reducing driver cognitive load and increasing naturalness of the interaction. The proposed system uses a camera mounted on the rearview mirror to monitor the driver, detect face boundaries and facial features, and finally employ lip motion clues to recognize visual speech activity. In particular, the designed algorithm has very low computational cost, which allows real-time implementation on currently available inexpensive embedded platforms, as described in the paper. Experiments are also reported on a small multi-speaker database collected in moving automobiles, that demonstrate promising accuracy.

Explore More