Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tony Ezzat is active.

Publication


Featured researches published by Tony Ezzat.


ieee international conference on automatic face gesture recognition | 2004

Trainable videorealistic speech animation

Tony Ezzat; Gadi Geiger; Tomaso Poggio

We describe how to create with machine learning techniques a generative, videorealistic, and speech animation module. A human subject is first recorded using a videocamera as he/she utters a pre-determined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subjects mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence, which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.


Proceedings Computer Animation '98 (Cat. No.98EX169) | 1998

MikeTalk: a talking facial display based on morphing visemes

Tony Ezzat; Tomaso Poggio

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression, of a photorealistic talking face.


International Journal of Computer Vision | 2000

Visual Speech Synthesis by Morphing Visemes

Tony Ezzat; Tomaso Poggio

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.


international conference on automatic face and gesture recognition | 1996

Facial analysis and synthesis using image-based models

Tony Ezzat; Tomaso Poggio

In this paper we describe image-based modeling techniques that make possible the creation of photo-realistic computer models of real human faces. The image-based model is built using example views of the face, bypassing the need for any three-dimensional computer graphics models. A learning network is trained to associate each of the example images with a set of pose and expression parameters. For a novel set of parameters, the network synthesizes a novel, intermediate view using a morphing approach. This image-based synthesis paradigm can adequately model both rigid and non-rigid facial movements. We also describe an analysis-by-synthesis algorithm, which is capable of extracting a set of high-level parameters from an image sequence involving facial movement using embedded image-based models. The parameters of the models are perturbed in a local and independent manner for each image until a correspondence-based error metric is minimized. A small sample of experimental results is presented.


symposium on computer animation | 2005

Transferable videorealistic speech animation

Yao-Jen Chang; Tony Ezzat

Image-based videorealistic speech animation achieves significant visual realism at the cost of the collection of a large 5- to 10-minute video corpus from the specific person to be animated. This requirement hinders its use in broad applications, since a large video corpus for a specific person under a controlled recording setup may not be easily obtained In this paper, we propose a model transfer and adaptation algorithm which allows for a novel person to be animated using only a small video corpus. The algorithm starts with a multidimensional morphable model (MMM) previously trained from a different speaker with a large corpus, and transfers it to the novel speaker with a much smaller corpus. The algorithm consists of 1) a novel matching-by-synthesis algorithm which semi-automatically selects new MMM prototype images from the new video corpus and 2) a novel gradient descent linear regression algorithm which adapts the MMM phoneme models to the data in the novel video corpus. Encouraging experimental results are presented in which a morphable model trained from a performer with a 10-minute corpus is transferred to a novel person using a 15-second movie clip of him as the adaptation video corpus.


international conference on acoustics, speech, and signal processing | 2008

Localized spectro-temporal cepstral analysis of speech

Jake V. Bouvrie; Tony Ezzat; Tomaso Poggio

Drawing on recent progress in auditory neuroscience, we present a novel speech feature analysis technique based on localized spectro- temporal cepstral analysis of speech. We proceed by extracting localized 2D patches from the spectrogram and project onto a 2D discrete cosine (2D-DCT) basis. For each time frame, a speech feature vector is then formed by concatenating low-order 2D- DCT coefficients from the set of corresponding patches. We argue that our framework has significant advantages over standard one- dimensional MFCC features. In particular, we find that our features are more robust to noise, and better capture temporal modulations important for recognizing plosive sounds. We evaluate the performance of the proposed features on a TIMIT classification task in clean, pink, and babble noise conditions, and show that our feature analysis outperforms traditional features based on MFCCs.


international conference on acoustics, speech, and signal processing | 2007

AM-FM Demodulation of Spectrograms using Localized 2D Max-Gabor Analysis

Tony Ezzat; Jake V. Bouvrie; Tomaso Poggio

We present a method that de-modulates a narrowband magnitude spectrogram S(f, t) into a frequency modulation term cos(Φ)(f,t)) which represents the underlying harmonic carrier, and an amplitude modulation term A(f,t) which represents the spectral envelope. Our method operates by performing a two-dimensional local patch analysis of the spectrogram, in which each patch is factored into a local carrier term and a local amplitude envelope term using a Max-Gabor analysis. We demonstrate the technique over a wide variety of speakers, and show how the spectrograms in each case may be adequately reconstructed as S(f, t) = A(f, t)cos(Φ(f, t)).


conference of the international speech communication association | 2007

Spectro-temporal analysis of speech using 2-d Gabor filters.

Tony Ezzat; Jake V. Bouvrie; Tomaso Poggio


conference of the international speech communication association | 2008

Discriminative word-spotting using ordered spectro-temporal patch features.

Tony Ezzat; Tomaso Poggio


Archive | 2003

Perceptual Evaluation of Video-Realistic Speech

Gadi Geiger; Tony Ezzat; Tomaso Poggio

Collaboration


Dive into the Tony Ezzat's collaboration.

Top Co-Authors

Avatar

Tomaso Poggio

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Jake V. Bouvrie

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Gadi Geiger

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Ethan Meyers

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

James R. Glass

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Bhiksha Raj

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Evandro Gouvea

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Ken Schutte

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Minjoon Kouh

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Ryan Rifkin

Massachusetts Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge