Video-to-Video Translation for Visual Speech Synthesis
VVideo-to-Video Translation for Visual Speech Synthesis
Michail C. Doukas, Viktoriia Sharmanska, Stefanos ZafeiriouDepartment of Computing, Imperial College London, UK { michail-christos.doukas16,sharmanska.v,s.zafeiriou } @imperial.ac.uk Abstract
Despite remarkable success in image-to-image transla-tion that celebrates the advancements of generative adver-sarial networks (GANs), very limited attempts are knownfor video domain translation. We study the task of video-to-video translation in the context of visual speech generation,where the goal is to transform an input video of any spo-ken word to an output video of a different word. This isa multi-domain translation, where each word forms a do-main of videos uttering this word. Adaptation of the state-of-the-art image-to-image translation model (StarGAN) tothis setting falls short with a large vocabulary size. Insteadwe propose to use character encodings of the words and de-sign a novel character-based GANs architecture for video-to-video translation called Visual Speech GAN (ViSpGAN).We are the first to demonstrate video-to-video translationwith a vocabulary of 500 words.
1. Introduction
Creating synthetic samples of a face displaying variousexpressions or uttering words and sentences is a very impor-tant problem in the intersection of computer vision, graph-ics and machine learning. The solutions were mainly givenby the graphics community [39, 10] with a lot of manualwork or by devising strictly person specific solutions [37].Recently, with the advent of Deep Convolutional NeuralNetworks (DCNNs) and particularly with the introductionof Generative Adversarial Networks (GANs) [11] there hasbeen a shift in focus. That is, instead of designing person-alized methods and sophisticated rigging architectures, thefocus is on designing machine learning architectures, basedon GANs, which could harness the availability of data andsimulate both, the physical process of facial motion cre-ation and visualization. In particular, our paper focuses on aGANs-based model that can transform facial motion whenuttering words.In the seminal work[11], GANs was introduced as an un-supervised machine learning method and was implementedby a system of two neural networks contesting with each
Output video Input video Real video CropmouthregionUpdatemouthregion Fake video Generator ......... ...
Target word
Figure 1: Video-to-video translation for visual speech syn-thesis: given a video of a speaker uttering ”president” andthe target word ”after” , our generator produces a fake videoof the speaker uttering the target word.other in a zero-sum game framework. Since then manydifferent GANs architectures have been proposed, such asDeep Convolutional GANs (DCGAN) [30], WassersteinGANs (WGAN) [12], Boundary Equilibrium GenerativeAdversarial Networks (BEGAN) [3] and the ProgressiveGANs (PGAN) [19], which was the first to show impres-sive results in generation of high-resolution images.The concept of adversarial training has been also ex-tended to the supervised learning settings such as dense re-gression (e.g., image-to-image translation) leading to vari-ous conditional GANs (cGAN) [25] models [18, 50, 20, 17].In image-to-image translation, the algorithms learn to syn-thesize a target image from a given input image, e.g. syn-thesizing photos from label maps, using a training set ofaligned image pairs (e.g. pix2pix [18]) which are difficult tocollect. A recent work that made cGANs applicable in prob-lems where paired trained data are not available is Cyclic-1 a r X i v : . [ c s . C V ] M a y ANs [50] which used the notion of cyclic consistency.There is a plethora of works for the image-to-image trans-lation task [20, 23, 38, 4, 51, 35, 24]. A recent extension toa multi-domain setting has been also proposed in [6], whereboth data and labels are available but not paired. In order tosupport fully convolutional structures of the cGANs model[6], the authors append the one-hot label information as ex-tra channels to the input image. The drawback of this strat-egy is that it is very difficult to train with many labels (do-mains). A similar approach was proposed in [29] for trans-forming facial images depending to Action Units (AUs) .Both state-of-the-art models, StarGAN [6] and GANima-tion [29], allow transformations of the input face imageswith respect to hair color, age, gender, and anatomically-aware facial movements.Faces are one of the most widely used objects of choicefor demonstrating the capabilities of GANs [19] and cGANsvariants [6, 29, 48, 23, 47, 2, 26, 43]. This is the case be-cause creating synthetic facial samples has countless appli-cations, including video conferencing or movie dubbing forbetter lip-syncing results. In this paper, we aim at gener-ating realistic synthetic videos of speakers uttering a givenword. We assume that we are given a video of a person ut-tering one word along with a target word. Then, our methodgenerates a new video with the same person uttering the tar-get word. In particular, the contributions of our paper are: • Our method extends the work of StarGAN [6] to video-to-video translation using 3D convolutional structures. • In order to support many word labels, we propose toencode the words using character labels and designnovel architectures for video translation, i.e. the gen-erator network and the character inspector network. • We are the first to show video-to-video translation withsemantic transformation of 500 words/domains (using‘in-the-wild’ dataset LRW [9]).
2. Related work
Unlike the image domain, very limited attempts havebeen successful in the task of unconditional video synthe-sis [42, 40, 33, 31]. For the task of video style transfer,existing methods [5, 13, 16, 32] focus on transferring thestyle of a given reference image to the generated video,however these methods cannot be directly applied to ourproblem of multi-domain video-to-video transfer. Condi-tional video generation have been primarily addressed withtext inputs [22, 27]. To the best of our knowledge, thereis only one prior work on general video-to-video synthesis[44]. In this work, a mapping function is learned to trans-late a video input from one modality (such as segmentation AU system is a systematic way to code the facial motion with respectto activation of facial muscles. maps) to a corresponding RGB video. Using aligned inputand output video pairs, the proposed model learns to synthe-size video frames guided by the ground truth optical flowsignal. Hence, this method is not suitable for a video-to-video translation between semantically different domains,i.e. domains of different word utterings.The lipreading task is related to our problem, since weemploy lipreading methods when forcing the generator toproduce convincing videos. This task has been previouslytackled using the word-level [9, 36] and character-level[8] classification models. Visual speech synthesis is di-rectly related to the objectives of our framework. Thistask has been primarily approached via pose transfer. Thatis, given a target speaker and a driving video, methods[49, 39, 46, 47, 26, 2, 48] generate video with visual speechvia transferring the pose (lip movement) of one speaker toanother. A different approach is speech-driven facial syn-thesis [43, 7, 49, 46, 48]. In this setting, the generatedvideos are conditioned on an image of the target speakerand an audio signal. The methods in [43] generate the fa-cial animation of a person speaking providing only a fa-cial image and a sound stream. Using a similar setting, themethod of [7] produces video from audio, however it gen-erates frames independently from each other and animatesonly the mouth region. In the recent work [49], the authorsuse a GANs-based approach to generate talking faces condi-tioned on both, video and sound, which is suitable for audio-visual synchronization, and is different from our setup.
3. Method
We use conditional adversarial neural networks to learna mapping from an input video of a person uttering oneword to an output video of the same person uttering a tar-get word. The training is conditioned on the input video ofa speaker, the label of the input word, and the label of thetarget word. Each word label corresponds to a video do-main of speakers uttering this word. We consider a generalsetting of multi-domain translation, where any two words(in the vocabulary) can form source and target domains. Inthis setting, well-established approaches like pix2pix [18],CycleGAN [50] are not feasible, as they require trainingseparate translation models for all possible pairs of labels.Instead we base our approach on the StarGAN model [6]for multi-domain image-to-image translation.First, in Section 3.1, we describe our method as an exten-sion of the StarGAN framework for video-to-video transla-tion, using 3D convolutions and words as class labels. Inthis approach, the word label is appended as extra chan-nels to the input using one-hot representations, which isnot scalable. Next, in Section 3.2 we describe our pro-posed framework
Visual Speech GAN (ViSpGAN) for video-to-video translation that uses character encoding of wordsfor generation and could scale to any vocabulary size.2 .1. Word-level Video Translation
We build upon StarGAN framework [6] to form a gen-erative model that synthesizes realistic videos of a speakeruttering words given an input video. Our model consists ofthree components: a conditional video generator G , a videodiscriminator D and a video word classifier D cls .Given an input video X with T frames, X = { x , x , . . . , x T } , and a target word label l w as one-hotvector, the generator transforms ( X , l w ) into a syntheticvideo Y of the same length, where the speaker from theinput video X utters the target word l w . ( X , l w ) denotesdepth-wise concatenation of the video frames and the spa-tially replicated target label. The discriminator D is trainedto distinguish real videos from the synthesized ones, whilethe word classifier D cls is trained to classify videos of ut-tered words based on word labels in the vocabulary. There-fore, the generator is forced to synthesize fake samples,which are both realistic – similar to the ones in the dataset– and represent the right target words. Adversarial Loss.
The objective of the conditional GAN-based model for synthesizing realistic videos can be ex-pressed as a standard adversarial loss: L adv = E X [log D ( X )] + E X , l w [log(1 − D ( G ( X , l w ))] . Here the generator G ( X , l w ) is conditioned on the inputvideo X and the target label l w and tries to minimize theadversarial loss, whereas the discriminator is conditionedonly on the real video and aims to maximize this loss. Sim-ilarly to StarGAN, for training, we adopt the WassersteinGAN objective [1] with gradient penalty [12]: L adv = E X [ D ( X )] − E X , l w [ D ( G ( X , l w ))] − λ GP E ˆ X [( ||∇ ˆ X D ( ˆ X ) || − ] , (1)where ˆ X = α X + (1 − α ) Y is sampled uniformly along astraight line between a pair of a real and a generated videos, α ∼ U (0 , . Word Classification Loss.
The generator has to synthe-size a video Y of uttering a target word l w . To ensure thatthe target word is uttered in the output video, an auxiliaryword classifier D cls is introduced. It imposes the word clas-sification loss when optimizing both the discriminator andthe generator. By minimizing the the word classificationloss on fake videos L fakecls = E X , l w [ − log D cls ( l w | G ( X , l w ))] , (2)the generator learns to produce videos G ( X , l w ) with ut-tered target words l w as seen by the classifier D cls . On the contrary, by minimizing the corresponding classificationterm on real videos L realcls = E X , l (cid:48) w [ − log D cls ( l (cid:48) w | X )] , (3)the classifier D cls learns to predict correct word labels l (cid:48) w ofthe real videos. In this work, we employ the state-of-the-artlipreading neural network [36] as a word classifier D cls . Cycle Consistency Loss.
Adversarial training along withthe word classification do not guarantee that the identityof the speaker of the input video is preserved in the gen-erated video. Since we do not have a ground truth video Y available, we cannot compute a reconstruction error of thetranslation as in [18]. Instead we apply a cycle consistencyloss as in [50, 6] to preserve the content of the input videothrough a cycle application of G : L cyc = E X , l w , l (cid:48) w [ || G ( G ( X , l w ) , l (cid:48) w ) − X || ] . (4)Here the generator first synthesizes a speaker uttering thetarget word l w and then translates this video G ( X , l w ) backto the same speaker uttering the input word l (cid:48) w , for which wehave the ground truth (input video X ).The full objective for training the generator and discrim-inator networks of the 3D StarGAN model for video-to-video translation with word labels can be written as: L G = L adv + λ cls L fakecls + λ cyc L cyc L D = −L adv L D cls = L realcls (5)where λ cls and λ cyc are hyper-parameters that control rela-tive importance of the correspondent loss terms.In our implementation, we follow closely the networkstructures of the StarGAN model equipping them with 3Dconvolutions over video volumes. We refer to this model as3D StarGAN . Arguably the key design question of the conditional gen-erative adversarial networks is how to incorporate the la-bel information when training the generator network. Wefound the idea of the StarGAN generator (to append theinput RGB channels with the target labels) simple and re-markably effective. However, the 3D StarGAN extensionfor multi-word video translation has several shortcomings.First, it is difficult to train, in particular the generator net-work, even with a modest vocabulary size. Secondly, it doesnot account for commonalities in words and requires a lot The only structural difference is that we have D cls and D networkstrained separately in 3D StarGAN, where as in StarGAN model for imagetranslation, the classifier D cls is a part of the discriminator output D . Thishowever did not work for video data. A DAROUND k = 4 times k = 4 timesk first frames k last frames
A A A A D D D D
Figure 2: Illustration of the video generation with charac-ters (as input to the generator network). Given the targetword ”around” , and an input video of T = 24 frames,each of the six characters in the word is replicated times( ”aaaa-rrrr-oooo-uuuu-nnnn-dddd” ) to form labels for frames of the target video. After that, each character la-bel ( -dimensional binary vector over English alphabet) isspatially replicated and then concatenated (as extra chan-nels) to the corresponding frame of the input video usingdepth-wise concatenation.Figure 3: Overview of our proposed framework, which con-sists of four networks: the video generator G , the videodiscriminator D , the word classifier D cl and the characterinspector D insp . The generator synthesizes the fake video Y conditioned on the input video and the target characterlabels and then maps Y back to the original word domainagain to get the reconstructed video ˆ X . Additionally D triesto distinguish between real or fake samples and D cls and D insp force Y to contain the target word.of training data to learn all possible word translations. Fi-nally, adding even one extra word to the vocabulary meansexpanding and retraining the generator from scratch.Instead we take a different approach and propose to use characters as multi-label embedding for word translations .Our intuition is that similar words produce similar lip move-ments in the video frames where the common characters are uttered. For example, the words ”because” and ”become” share the same prefix ”bec” , which produces a similar lipmovement for both words. Character-level word embeddinghas several advantages for synthesis: (i) it can be scaled(potentially) to any vocabulary size, since we condition thegeneration on characters of the alphabet; (ii) it explorescommonalities in words utterings, as words share charac-ters. Finally, it can be naturally embedded in the function-ing of the character inspector as will become clear shortly. Video generation with characters.
Throughout thiswork we consider uttering of English words, which can begeneralized to other languages. The English alphabet has characters, and for simplicity we represent each charac-ter as a -dimensional one-hot-vector . Given the targetword l w = { c , c , . . . , c N } with N characters, we repli-cate each character k = T /N times, to create a sequence oflabels of the same length as the number of frames T : l c = { (cid:124) (cid:123)(cid:122) (cid:125) k times c , . . . , c , (cid:124) (cid:123)(cid:122) (cid:125) k times c , . . . , c , . . . , (cid:124) (cid:123)(cid:122) (cid:125) k times c N , . . . , c N } . (6)To form an input to the generator network, ( X , l c ) , eachcharacter label is first spatially replicated and then concate-nated (as extra channels) to the corresponding video frameusing depth-wise concatenation. Please, see Figure 2 for anillustration.This simple strategy of conditional generation mightlook like each letter takes up the same time in the producedvideo. However, this is not the case, as the produced videouttering has to be realistic, and this is controlled by the dis-criminator networks. Next we describe how we enhance theGANs discriminator with a character-level classifier. Character Inspector.
The task of classifying videos ofword utterings based on character sequences is not novel.A recently proposed model [8] addresses it by including anLSTM decoder to predict a sequence of distributions overcharacter tokens and an attention mechanism. Since wedo not aim to solve a lipreading task by itself, we adopt asimple yet effective approach. Instead of designing a char-acter classifier that tries to predict the exact character se-quence { c , c , . . . , c N } in the word, our approach involves inspecting whether the characters are uttered in the video ornot. We call this classifier network character inspector.Given a video X with a word label l (cid:48) w , the char-acter inspector D insp outputs a 26-dimensional vector D insp ( X ) ∈ [0 , with the probability for each of thecharacters being present in the word uttered in the video. Inorder to learn the task of inspection, we minimize the crossentropy loss between D insp ( X ) and the character-level em-bedding of the word, i.e. a -dimensional binary vector z (cid:48) , This can be further reduced to 5 dimensions with binary coding z (cid:48) i = 1 if the i-th character appears in l (cid:48) w and z (cid:48) i = 0 otherwise: L realinsp = E X , l (cid:48) w [ (cid:88) i =1 − z (cid:48) i log D iinsp ( X ) − (1 − z (cid:48) i ) log(1 − D iinsp ( X ))] . (7)By minimizing L realinsp on real videos from the dataset, D insp learns to recognize which letters are present/not present inthe utterings. On the other hand, we utilize the same net-work to force the generator to synthesize videos with char-acters of the target word l w by minimizing: L fakeinsp = E X , l w [ (cid:88) i =1 − z i log D iinsp ( G ( X , l c )) − (1 − z i ) log(1 − D iinsp ( G ( X , l c )))] , (8)where l c encodes the word label l w (6), and crossentropy is computed between the predicted distribution D insp ( G ( X , l c )) and the character indicator z i of the targetword l w , i.e. z i = 1 if character i appears in l w and z i = 0 otherwise.The advantage of using character inspector (as apposedto character detector, for example) is that it enables theGANs model to learn the appearance and duration of thecharacters from the real videos and to reflect this knowl-edge when synthesizing fake videos. Not enforcing detec-tion of the exact character sequence as conditioned in thevideo generation has also a shortcoming. The character in-spector is unaware of the relative order of characters in theword or how many times they appear in the video. Thiscould be addressed by incorporating an LSTM transducerto decode the exact letter sequence (which is an inherentlyharder problem than determining if a character is present inthe video or not). Instead we combine the character inspec-tor with the word classifier that clears away the need to learnthe exact order of letters in the word. Feature Matching Loss.
Finally, we further enhance thelip movement features of the generated samples by adding afeature matching loss term to the objective of the generator,which is similar to the commonly used VGG loss [45, 44].The idea behind this loss is that two videos Y and Y ∗ of thesame word uttering l w , should yield similar visual featureswhen they pass through a pre-trained lipreading network.We use the LSTM of the word classifier network D cls , andits output at the last time step T as the visual feature repre-sentation h T ( Y ) of the video Y . During training we sam-ple a random video Y ∗ belonging to the target class l w andwe compute the L distance between the visual features ityields and the visual feature vector which is obtained for thefake one Y = G ( X , l c ) . Therefore, the feature matching loss is defined as L fm = E X , l w , Y ∗ [ || h T ( Y ∗ ) − h T ( G ( X , l c )) || ] , (9)and by minimizing L fm with respect to G parameters thegenerator learns to compose videos with visual features thatmatch those of the real videos in the dataset. Visual Speech GAN Objective.
By combining all theloss terms, the generator of the proposed ViSpGAN modelis trained under the objective function: L G = L adv + λ cls L fakecls + λ cyc L cyc + λ insp L fakeinsp + λ fm L fm . (10)The discriminator D , the word classifier D cls and the char-acter inspector D insp of ViSpGAN are optimized with re-spect to the losses L D = −L adv L D cls = L realcls L D insp = L realinsp . (11)The hyper-parameters are used to balance the contributionsof the corresponding loss terms in comparison with L adv ,and we set λ cls = 1 , λ insp = 1 , λ cyc = 50 and λ fm = 50 . Video Generator G . Based on [6], the generator is madeup of three convolutional layers for downsampling the in-put, six residual blocks and another three transposed con-volutional layer for upsampling. In order to adapt G on thetask of video generation, we replace 2D convolutions withspatio-temporal 3D convolutions. We note that downsam-pling is applied not only in the spatial but in the temporaldimension as well, reducing the dimension by a factor oftwo each time. In each layer, convolutions are followed byinstance normalization [41] and the ReLU activation func-tion, except from the last one, where we simply apply a hy-perbolic tangent function to obtain a confined output valuefor each pixel of the output video. Video Discriminator D and Character Inspector D insp . We adapt the PatchGAN [21, 50, 18] setting for D , whichidentifies whether small video patches are real or fake, thusthe output of the discriminator is not a single value, but a3D volume. The discriminator receives a video of T framesand uses a series of six 3D convolutions, which reduces thespatial dimensions by a factor of two, except from the lastlayer. On the other hand, the temporal dimension is beingreduced only after the third layer, since it is much smallerthan the spatial. In D we do not apply instance normal-ization but we use Leaky ReLU with a leakage of 0.05. Thesame architecture is used for the character inspector as well,5xcept from the final layer, where the spatio-temporal di-mensions become unitary dimensions after downsamplingand the feature dimension contains 26 values, each one in-dicating the presence or absence of the corresponding letterin the input video. Word classifier D cls . This network is adopted from thework of [36] on the task of lipreading. First, the input videois passed through a 3D convolutional layer, followed by in-stance normalization and ReLU. Next, we unstuck this 4Dtensor (time × height × width × channels) in the tempo-ral dimension, to get one 3D feature map for each time-step. Each one of these T feature maps is passed throughthe same 34-layer ResNet [14] for downsampling and thenthrough an two-layered LSTM network. Finally, the lasthidden state of the LSMT is mapped through a fully con-nected layer and a Softmax unit to a probability vector, withan equal length to the number of word classes N l .
4. Experiments
For empirical evaluations, we use the LRW dataset [9],which is an ‘in the wild’ audio-visual database of speak-ers uttering 500 different words. This database has beenextracted from the BBC TV broadcasts and contains var-ious speakers in many different poses. There are roughly1000 short videos for each word class in the training set(500K videos), while the validation and test sets both con-tain 50 video samples per word class. Each clip consistsof 29 frames and each frame is an × image. Inorder to overcome high computational demands when pro-cessing video inputs we mainly focus on the mouth area ofthe speakers. We extract facial landmarks from each frameof every video in the dataset using [34]. Subsequently, weuse the coordinates that correspond to the landmark in thecenter of the speaker’s mouth to extract the region of inter-est, which is a cropped image of size × . Since the gen-erator performs down-sampling also in the temporal dimen-sion, we choose to do cropping in the temporal dimensionas well, such that the number of frames T can be dividedby 2 multiple times. In our experiments, we use T = 24 ,by simply dropping the first two and the last three frames inthe video. The word is typically contained in the center ofthe video, and the first and last frames often include partsof the previous/next words from the sentence the video wasextracted when the LRW was created. We use data augmen-tation by performing horizontal flips in the frames. Finally,we divide the training set in two equal parts. The first spit isused to train the models and the second split is used to trainan auxiliary network for the quantitative evaluations of themodels, as it is further explained in Section 4.1.We conduct two sets of experiments. In the first one,we start with an ablation study of the ViSpGAN model andvalidate the individual effectiveness of 1) the word classifier D cls , 2) the character inspector D insp and 3) the feature matching loss L fm using a subset of 50 words from theLRW dataset. Then we provide a quantitative comparisonof the proposed ViSpGAN model and the baseline method3D StarGAN. In the second experiment, we qualitativelycompare the generative performance of the models whentrained on 50 words and report qualitative analysis of theViSpGAN model when trained using the whole dataset of500 words. Our implementation is based on TensorFlowand training the model on the entire dataset requires almostseven days on NVIDIA Tesla K80 GPU. Setup.
We conduct this experiment using words. Wearrange words in an alphabetical order and use all videosamples from the first 50 word classes ( N l = 50 ). Wetrain all models on the first split of the training set usingthe Adam optimizer with β = 0 . , β = 0 . , learningrate η = 10 − and a batch size of 8. A single generatorupdate is performed after 5 discriminator and character in-spector updates. We train all models for 20 epochs, witha linearly decreasing learning rate after the th epoch, isthe same manner as [6]. We pre-train the word classifier D cls , which remains fixed in all models. This classifier istrained on the first split of the training set, before the ad-versarial training takes place. For that, we use Adam with β = 0 . , β = 0 . , while exponentially decreasing theinitial learning rate η = 10 − in discrete steps after eachepoch. In addition, L regularization is applied on weightswith λ = 0 . . We use a batch size of 16. For all ourexperiments, we produce fake samples using video inputsfrom the test set of LRW, since these videos have not beenseen by the models during adversarial training. Quantitative evaluation.
To assess the performance ofthe proposed framework and all its variants in the ablationstudy, we test how well an independently trained auxiliarylipreading network could classify the fake samples gener-ated by the models. For this reason we train an auxiliaryclassifier for the task of word recognition, using the sec-ond split of the training set, which has not been seen by ourmodel during adversarial training. The architecture of thisnetwork is inspired by that of the word classifier D cls , butit utilizes a ResNet-18 and a bi-directional GRU in the backend. After training this classifier for the first 50 words ofLRW, we get an accuracy of 73% on the test set. This per-formance is slightly inferior to state-of-art lipreading classi-fiers [36, 8] on the same dataset, and can be improved by us-ing a more sophisticated training strategy. We deliberatelyavoid reusing the architecture of the word classifier D cls here to be on a safe side , i.e. the GAN models in the ab-lation study that have been trained by optimizing the wordclassification loss, have learned to ‘fool’ the architecture of D cls by producing samples that it classifies correctly.6odel variation Accuracy ↑ FID ↓ ViSpGAN w/o L fm , D cls . ± . ViSpGAN w/o L fm , D insp . ± . ViSpGAN w/o L fm . ± . ViSpGAN w/o D cls . ± . ViSpGAN 99.1% . ± .
3D StarGAN 89.2% . ± . ViSpGAN (500 words) 95.6% . ± . Table 1: Classification accuracy of the auxiliary word clas-sifier on the videos generated by the model variations (thehigher the better) and the FID scores (the lower the better).In the ablation study, we compare the performance of theauxiliary classifier on synthetic samples produced by theViSpGAN model and its variations: ViSpGAN withoutword classifier D cls , ViSpGAN without feature matching L fm or without both, ViSpGAN without character inspec-tor D insp . For each model, we generate fake samplesper each of the target words using the unseen test set,so test videos in total. We compute two metrics, theaccuracy of the auxiliary classifier and the visual quality ofthe videos using the Fr´echet Inception Distance (FID) [15].For the latter one, we use the auxiliary classifier (the lastGRU state) as a feature extractor on real and fake videos.We compute the FID score using the real test videos and thefake videos produced by the generators of the models. Wereport the accuracy scores together with the FID scores inTable1.As can be seen from Table 1, the result on samples fromthe full model (ViSpGAN) enjoys the best performancecomparing to the other variants. It is also clear that thethe character inspector alone is not enough for synthesiz-ing utterings of word structures, i.e. the auxiliary classifierhas low accuracy on samples from the ViSpGAN w/o L fm , D cls model. In general, the auxiliary classifier has higheraccuracy on samples generated by the models with the char-acter inspector (ViSpGAN, ViSpGAN w/o L fm , ViSpGANw/o D cls ). We conjecture that those generated videos con-tain word utterings which are easier to classify due to theirmore pronounced character presence. The feature matchingcomponent L fm helps to generate more consistent visualfeatures for each word class. This significantly increases theclassification accuracy on fake videos and produces morepronounced lip movements. However, it comes at cost ofthe visual quality of samples, as can be seen by increasedFID scores. The models without feature matching but withthe word classifier (ViSpGAN w/o L fm ) produce sampleswith better FID score but lower accuracy as evaluated by theauxiliary classifier.Next, we compare the ViSpGAN model with our pro- posed adaptation of the StarGAN baseline model for videodata, 3D StarGAN, when trained on words. We reportthe accuracy and the FID score in Table 1. As can be seenfrom the results, the performance on samples from the 3DStarGAN is inferior to the proposed ViSpGAN model interms of accuracy, but has better FID score. We also traineda variant of the ViSpGAN model with the word classifieralone (ViSpGAN w/o L fm , D insp ), and as expected, itsperformance is similar to 3D StarGAN both in terms of ac-curacy and FID score.Finally for comparison, we also perform the same evalu-ation protocol using the ViSpGAN model trained and testedon the entire dataset of 500 words. We report the results inTable 1. We observe that the FID score has been greatly im-proved from training the model on the entire LRW dataset(which also corresponds to better visual quality of the gen-erated videos and will be demonstrated in the next sectionon qualitative analysis). The accuracy stays high even for word classification. We start with the qualitative comparison of the fake sam-ples generated by the 3D StarGAN and the ViSpGAN mod-els, when trained on the LRW subset of 50 words. In Fig. 4,we show an example of the generated videos given an inputvideo with uttering ”about” and a target word ”after” . Aswe can see, the video output of 3D StarGAN closely followsthe input video. We can notice a slight ‘mouth opening’ inthe blue box, a behaviour that corresponds to the utteringof letter ‘a’ . The ViSpGAN models (with and w/o featurematching) create a much more plausible lip movement for ‘a’ at the same frames. Furthermore, in the green box, weobserve that ViSpGAN produces more plausible and con-spicuous lip movements for the letter combination ‘ft’ , incomparison with 3D StarGAN and ViSpGAN w/o featurematching. We attribute this to the fact that the character la-bels along with the character inspector force the generatorto learn visual features that correspond directly to letters.Next we demonstrate the ability of our model to synthe-size different target words ”absolutely” , ”after” and ”west-minister” , given the same video input. For this task, we usethe ViSpGAN model trained on the entire LRW dataset. Ascan be seen in Fig. 5a, the generator has learned to mod-ify the input video according to the target word and pro-duces three different word utterings. Changes are most pro-nounced in the middle of the video, where the actual wordresides. Finally, in Fig. 5b, we show that even when thegenerator is given inputs which are dissimilar to each other,it successfully synthesizes videos with consistent lip move-ments for the same target word ( ”football” ). We use Pois-son Editing [28] to blend the generated mouth region backinto the original video with the speaker. More results in-cluding videos are in the supplementary material.7 a a b b o o u u t t ta a a f f t t e e e r rInput video(i) Output video(3D StarGAN)(ii) Output video (ViSpGAN w/o FM loss)(iii) Output video (ViSpGAN) Figure 4: Generated videos conditioned on the same video input from the test set of LRW, using: (i) 3D StarGAN, (ii)ViSpGAN without feature matching, and (iii) ViSpGAN when trained on 50 words. The original word is ”about” andthe target word is ”after” . The ‘blue’ box shows how the models generate letter ‘a’ (slight mouth opening in the 3DStarGAN output, pronounced lip movement for both ViSpGAN models). In the ‘green’ box ViSpGAN model produces moreconspicuous lip movements for the letter combination ‘ft’ in comparison with the other two models. ......
Input video ( "about" )Output video( "absolutely" )Output video( "after" )Output video( "westminister" ) (a)(b) Figure 5: Videos generated with our ViSpGAN model trained on 500 words. (a) Given an input video with the word ”about” and three target words ”absolutely” , ”after” and ”westminister” , the generator synthesizes three corresponding videos. (b)Given two input videos and the same target word ”football” , the generator produces two fake videos with consistent lipmovements of the target word. For each video pair, the first one is the input to the generator and the second one is the output.In the ‘blue’ colored box, the uttering corresponds to ”foot” , while the ‘red’ colored box corresponds to ”ball” .
5. Conclusions
We presented ViSpGAN, a GAN-based framework forvideo-to-video translation, for the task of visual speech syn-thesis. We conditioned video generation on the video of aspeaker and a target word encoded as a sequence of char-acters. We showed that synthetic samples generated bythe proposed ViSpGAN model contain strong visual speech features, similar to the real videos, when assessed by anauxiliary lipreading network. The advantage of the pro-posed model is that in principle, it can produce outputs ofany length. By applying the generator sequentially on a longvideo we could produce multiple words based on their char-acters. This could be a way of generating videos of fullsentences and it is an interesting future direction to explore.8 cknowledgments
VS is supported by the Imperial College Research Fellow-ship. We gratefully acknowledge NVIDIA for GPU dona-tion and Amazon for AWS Cloud Credits.
References [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gen-erative adversarial networks. In
International Conference onMachine Learning (ICML) , 2017.[2] W. J. Baddar, G. Gu, S. Lee, and Y. M. Ro. Dynamics trans-fer gan: Generating video by transferring arbitrary tempo-ral dynamics from a source video to a single target image. arXiv:1712.03534 , 2017.[3] D. Berthelot, T. Schumm, and L. Metz. BEGAN:Boundary equilibrium generative adversarial networks. arXiv:1703.10717 , 2017.[4] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr-ishnan. Unsupervised pixel-level domain adaptation withgenerative adversarial networks. In
Conference on ComputerVision and Pattern Recognition (CVPR) , 2017.[5] D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua. Coherentonline video style transfer. In
International Conference onComputer Vision (ICCV) , 2017.[6] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo.StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In
Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2018.[7] J. S. Chung, A. Jamaludin, and A. Zisserman. You said that?In
British Machine Vision Conference (BMVC) , 2017.[8] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lipreading sentences in the wild. In
Conference on ComputerVision and Pattern Recognition (CVPR) , 2017.[9] J. S. Chung and A. Zisserman. Lip reading in the wild. In
Asian Conference on Computer Vision (ACCV) , 2016.[10] P. Garrido, L. Valgaerts, H. Sarmadi, I. Steiner, K. Varanasi,P. Perez, and C. Theobalt. Vdub: Modifying face video ofactors for plausible visual alignment to a dubbed audio track.In
Computer Graphics Forum , pages 193–204, 2015.[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In
Advances in Neural InformationProcessing Systems (NIPS) , 2014.[12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville. Improved training of Wasserstein GANs. arXiv:1704.00028 , 2017.[13] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei. Char-acterizing and improving stability in neural style transfer. arXiv:1705.02092 , 2017.[14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings indeep residual networks. In
European Conference on Com-puter Vision (ECCV) , 2016.[15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, andS. Hochreiter. Gans trained by a two time-scale update ruleconverge to a local Nash equilibrium. In
Advances in NeuralInformation Processing Systems (NIPS) , 2017. [16] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu,Z. Li, and W. Liu. Real-time neural style transfer for videos.In
Conference on Computer Vision and Pattern Recognition(CVPR) , 2017.[17] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multi-modal unsupervised image-to-image translation. In V. Fer-rari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors,
Eu-ropean Conference on Computer Vision (ECCV) , 2018.[18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. In
Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2017.[19] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressivegrowing of GANs for improved quality, stability, and varia-tion. In
International Conference on Learning Representa-tions (ICLR) , 2018.[20] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning todiscover cross-domain relations with generative adversarialnetworks. In
International Conference on Machine Learning(ICML) , 2017.[21] C. Li and M. Wand. Precomputed real-time texture synthesiswith markovian generative adversarial networks. In
Euro-pean Conference on Computer Vision (ECCV) , 2016.[22] Y. Li, M. R. Min, D. Shen, D. E. Carlson, and L. Carin. Videogeneration from text. arXiv:1710.00421 , 2017.[23] M. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. arXiv:1703.00848 , 2017.[24] M.-Y. Liu and O. Tuzel. Coupled generative adversarial net-works. In
Advances in Neural Information Processing Sys-tems (NIPS) , 2016.[25] M. Mirza and S. Osindero. Conditional generative adversar-ial nets. arXiv: 1411.1784 , 2014.[26] K. Olszewski, Z. Li, C. Yang, Y. Zhou, R. Yu, Z. Huang,S. Xiang, S. Saito, P. Kohli, and H. Li. Realistic dynamic fa-cial textures from a single image using gans. In
InternationalConference on Computer Vision (ICCV) , 2017.[27] Y. Pan, Z. Qiu, T. Yao, H. Li, and T. Mei. To create what youtell: Generating videos from captions. arXiv:1804.08264 ,2018.[28] P. P´erez, M. Gangnet, and A. Blake. Poisson image editing.
ACM Transactions on Graphics (TOG) , 22:313–318, 2003.[29] A. Pumarola, A. Agudo, A. Martinez, A. Sanfeliu, andF. Moreno-Noguer. Ganimation: Anatomically-aware facialanimation from a single image. In
European Conference onComputer Vision (ECCV) , 2018.[30] A. Radford, L. Metz, and S. Chintala. Unsupervised rep-resentation learning with deep convolutional generative ad-versarial networks. In
International Conference on LearningRepresentations (ICLR) , 2015.[31] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. arXiv:1511.06434 , 2015.[32] M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transferfor videos and spherical images.
International Journal ofComputer Vision , 126(11):1199–1219, 2018.[33] M. Saito, E. Matsumoto, and S. Saito. Temporal genera-tive adversarial nets with singular value clipping. In
Interna-tional Conference on Computer Vision (ICCV) , 2017.
34] J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tz-imiropoulos, and M. Pantic. The first facial landmark track-ing in-the-wild challenge: Benchmark and results. In
Inter-national Conference on Computer Vision Workshops , 2015.[35] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang,and R. Webb. Learning from simulated and unsupervisedimages through adversarial training. In
Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2017.[36] T. Stafylakis and G. Tzimiropoulos. Combining residual net-works with lstms for lipreading. arXiv: 1703.04105 , 2017.[37] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: learning lip sync from au-dio.
ACM Transactions on Graphics (TOG) , 36:95:1–95:13,2017.[38] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. arXiv:1611.02200 , 2016.[39] J. Thies, M. Zollh¨ofer, M. Stamminger, C. Theobalt, andM. Nießner. Face2Face: Real-time Face Capture and Reen-actment of RGB Videos. In
Computer Vision and PatternRecognition (CVPR) , 2016.[40] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCo-GAN: Decomposing motion and content for video genera-tion. In
Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2018.[41] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instancenormalization: The missing ingredient for fast stylization. arXiv:1607.08022 , 2016.[42] C. Vondrick, H. Pirsiavash, and A. Torralba. Generatingvideos with scene dynamics. In
Advances in Neural Infor-mation Processing Systems (NIPS) , 2016.[43] K. Vougioukas, S. Petridis, and M. Pantic. End-to-EndSpeech-Driven Facial Animation with Temporal GANs. arXiv:1805.09313 , 2018.[44] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz,and B. Catanzaro. Video-to-video synthesis. In
Advances inNeural Information Processing Systems (NIPS) , 2018.[45] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, andB. Catanzaro. High-resolution image synthesis and seman-tic manipulation with conditional gans. arXiv:1711.11585 ,2017.[46] O. Wiles, A. Koepke, and A. Zisserman. X2face: A networkfor controlling face generation by using images, audio, andpose codes. In
European Conference on Computer Vision(ECCV) , 2018.[47] R. Xu, Z. Zhou, W. Zhang, and Y. Yu. Face transfer withgenerative adversarial network. arXiv:1710.06090 , 2017.[48] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking facegeneration by adversarially disentangled audio-visual repre-sentation. arXiv:1807.07860 , 2018.[49] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talk-ing Face Generation by Adversarially Disentangled Audio-Visual Representation. arXiv:1807.07860 , 2018.[50] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpairedimage-to-image translation using cycle-consistent adversar-ial networks. In
International Conference on Computer Vi-sion (ICCV) , 2017. [51] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros,O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In
Advances in Neural Information Pro-cessing Systems (NIPS) , 2017., 2017.