Lip Movements Generation at a Glance
Lele Chen, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, Chenliang Xu
LLip Movements Generation at a Glance
Lele Chen (cid:63) , Zhiheng Li (cid:63) , Ross K. Maddox , Zhiyao Duan , andChenliang Xu Department of Computer Science, University of Rochester School of Computer Science, Wuhan University Department of Biomedical Engineering, University of Rochester Department of Electrical and Computer Engineering, University of Rochester { lchen63,zli82,rmaddox } @ur.rochester.edu { zhiyao.duan,chenliang.xu } @rochester.edu Abstract.
Cross-modality generation is an emerging topic that aimsto synthesize data in one modality based on information in a differentmodality. In this paper, we consider a task of such: given an arbitrary au-dio speech and one lip image of arbitrary target identity, generate synthe-sized lip movements of the target identity saying the speech. To performwell in this task, it inevitably requires a model to not only consider theretention of target identity, photo-realistic of synthesized images, consis-tency and smoothness of lip images in a sequence, but more importantly,learn the correlations between audio speech and lip movements. To solvethe collective problems, we explore the best modeling of the audio-visualcorrelations in building and training a lip-movement generator network.Specifically, we devise a method to fuse audio and image embeddings togenerate multiple lip images at once and propose a novel correlation lossto synchronize lip changes and speech changes. Our final model utilizesa combination of four losses for a comprehensive consideration in gener-ating lip movements; it is trained in an end-to-end fashion and is robustto lip shapes, view angles and different facial characteristics. Thoughtfulexperiments on three datasets ranging from lab-recorded to lips in-the-wild show that our model significantly outperforms other state-of-the-artmethods extended to this task.
Cross-modality generation has become an important and emerging topic of com-puter vision and its broader AI communities, where examples are beyond themost prominent image/video-to-text [1,2] and can be found in video-to-sound [3],text-to-image [4], and even sound-to-image [5]. This paper considers a task: givenan arbitrary audio speech and one lip image of arbitrary target identity, generatesynthesized lip movements of the target identity saying the speech. Notice thatthe speech does not have to be spoken by the target identity, and neither the (cid:63)
Equal contribution. This work was done when Zhiheng Li was a visiting student atUniversity of Rochester. a r X i v : . [ c s . C V ] M a y L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu
Target identitySourceFrames + Source audio SynthenizedFramesSourceFrames
Fig. 1: The model takes an audio speech of the women and one lip image of thetarget identity, a male celebrity in this case, and synthesizes a video of the man’slip saying the same speech. The synthesized lip movements need to correspondto the speech audio and also maintain the target identity, video smoothness andsharpness.speech nor the image of target identity is required to be appeared in the trainingset (see Fig. 1). Solving this task is crucial to many applications, e.g., enhancingspeech comprehension while preserving privacy or assistive devices for hearingimpaired people.Lip movements generation has been traditionally solved as a sub-problemin synthesizing a talking face from speech audio of a target identity [6,7,8,9].For example, Bo et al. [6] restitch the lower half of the face via a bi-directionalLSTM to re-dub a target video from a different audio source. Their model selectsa target mouth region from a dictionary of saved target frames. More recently,Suwajanakorn et al. [9] generate synthesized taking face of President Obama withaccurate lip synchronization, given his speech audio. They first use an LSTMmodel trained on many hours of his weekly address footage to generate mouthlandmarks, then retrieve mapped texture and apply complicated post-processingto sharpen the generated video. However, one common problem for these manymethods is that they retrieve rather than generating images and thus, requirea sizable amount of video frames of the target identity to choose from, whereasour method generates lip movements from a single image of the target identity,i.e., at a glance.The only work we are aware of that addresses the same task as ours is Chunget al. [10]. They propose an image generator network with skip-connections,and optimize the reconstruction loss between synthesized images and real im-ages. Each time, their model generates one image from 0.35-second audio. Al-though their video generated image-by-image and enhanced by post-processinglooks fine, they have essentially bypassed the harder questions concerning theconsistency and smoothness of images in a sequence, as well as the temporalcorrelations of audio speech and lip movements in a video.To overcome the above limitations, we propose a novel method that takesspeech audio and a lip image of the target identity as input, and generates multi-ple lip images (16 frames) in a video depicting the corresponding lip movements(see Fig. 1). Observing that speech is highly correlated with lip movements evenacross identities, a concept grounds lip reading [11,12], the core of our paper is ip Movements Generation at a Glance 3 to explore the best modeling of such correlations in building and training a lipmovement generator network. To achieve this goal, we devise a method to fusetime-series audio embedding and identity image embedding in generating multi-ple lip images, and propose a novel audio-visual correlation loss to synchronizelip changes and speech changes in a video. Our final model utilizes a combina-tion of four losses including the proposed audio-visual correlation loss, a novelthree-stream adversarial learning loss to guide a discriminator to judge both im-age quality and motion quality, a feature-space loss to minimize perceptual-leveldifferences, and a reconstruction loss to minimize pixel-level differences, for acomprehensive consideration of lip movements generation. The whole system istrained in an end-to-end fashion and is robust to lip shapes, view angles, anddifferent facial characteristics (e.g., beard v.s. no beard).We evaluate our model along with its variants on three datasets: The GRIDaudiovisual sentence corpus (GRID) [13], Linguistic Data Consortium (LDC) [14]and Lip Reading in the Wild (LRW) [12]. To measure the quantitative accuracyof lip movements, we propose a novel metric that evaluates the detected land-mark distance of synthesized lips to ground-truth lips. In addition, we use acohort of three metrics, Peak Signal to Noise Ratio (PSNR), Structure Simi-larity Index Measure (SSIM) [15], and perceptual-based no-reference objectiveimage sharpness metric (CPBD) [16], to measure the quality of synthesized lipimages, e.g., image sharpness. We compare our model with Chung et al. [10] andan extended version of the state-of-the-art video Generative Adversarial Net-work (GAN) model [17] to our task. Experimental results show that our modeloutperforms them significantly on all three datasets (see ”Full model” in Tab. 3).Furthermore, we also show real-world novel examples of synthesized lip move-ments of celebrities, who are not in our dataset. Our demo video can be foundin https://youtu.be/7IX_sIL5v0c .Our paper marks three contributions. First, to the best of our knowledge,we are the first to consider the correlations among speech and lip movements ingenerating multiple lip images at a glance. Second, we explore various modelsand loss functions in building and training a lip movement generator network.Third, we quantify the evaluation metrics and our final model achieves signifi-cant improvement over state-of-the-art methods extended to this task on threedatasets ranging from lab-recorded to lips in-the-wild.The rest of our paper is organized as follows: Sec. 2 contains related work,Sec. 3 depicts our generator network for lip movements and introduces our audio-visual correlation loss, Sec. 4 details our full model and training procedure,Sec. 5 shows experimental results and Sec. 6 concludes the paper and points outdirections for future work.
We have briefly surveyed work in lip movement generation in the Introductionsection. Here, we discuss related work of each techniques used in our model.
L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu x x C o n v x x C o n v x x C o n v x x C o n v M a x P oo li n g x x C o n v x x C o n v D R e s i d u a l B l o c k s x x x D D e C o n v x x x D D e C o n v x x x D C o n v x x C o n v Audio-IdentityFusion
Correlation Loss
Audio EncoderIdentity Encoder
Correlation Networks
Three-StreamDiscriminator derivative
Decoder
Fig. 2: Full model illustration. Audio encoder and identity encoder extracts andfuses audio and visual embeddings. Audio-Identity fusion network fuses featuresfrom two modalities. Decoder expands fused feature to synthesized video. Corre-lation Networks are in charge of strengthening the audio-visual mapping. Three-Stream discriminator is responsible for distinguishing generated video and realvideo.A related but different task to ours is lip reading, where it also tackles thecross-modality generation problem. [11,12] use the correlation between lip move-ment and the sentences/words to interpret the audio information from the visualinformation. Rasiwasia et al. [18] use Canonical Correlation Analysis (CCA) [19]subspace learning to learn two intermediate feature spaces for two modalitieswhere they do correlation on the projected features. Cutler and Davis [20] useTime Delay Neural Network [21] (TDNN) to extract temporal invariant audiofeatures and visual features. These works have inspired us to model correlationsbetween speech audio and lip movements in generating videos.Audio variations and lip movements are not always synchronized in the pro-duction of human speech; lips often move before the audio signal is produced [22].Such delay between audio and visual needs to be considered when designing amodel. Suwajanakorn et al. [9] apply a time-delayed RNN without outputtingvalue in the first few RNN cells. Therefore, the output is shifted accordingly tothe delayed steps. However, such delay is empirically fixed by hand and thus, itis hard to determine the amount of delay for videos in-the-wild. We follow [21]to extract features with a large receptive field along temporal dimension, butuse a convolutional network instead of TDNN that leads to a simpler design.Adversarial training [23] is recently introduced as a novel and effective way totrain generative models. Researchers find that by conditioning the model on addi-tional information, it is possible to direct the data generation process [24,25,26].Furthermore, GAN has shown its ability to bridge the gap between differentmodalities and produce useful joint representations. We also use GAN loss inour training but we show that combining it with other losses leads to betterresults.
The overall data flow of our lip-movement generator network is depicted in Fig. 2.
In this paper, we omit channel dimension of all tensors for simple illus-tration . Recall that the input to our network are a speech audio and one single ip Movements Generation at a Glance 5 image of the target identity, and the output of our network are synthesized lipimages of the target identity saying that audio. The synthesized lip movementsneed to correspond to the speech audio, maintain the target identity, ensure thevideo smoothness, and be photo-realistic.
First, we encode the two-stream input information. For audio stream, the rawaudio waveform, denoted as S raw , is first transformed into log-mel spectrogram(see detail in Sec. 5.1), denoted as S lms , then encoded by an audio encodernetwork into audio features f s ∈ R T × F , where T and F denote the number oftime frames and frequency channels. For visual stream, an input identity image,denoted as p r , is encoded by an identity encoder network. The network outputsimage features f p ∈ R H × W , where H and W denote the height, width of theoutput image features.We fuse audio features f s and visual features f p together, whose output,the synthesized video feature f v , will be expanded by several residual blocksand 3D deconvolution operations to generate synthesized video ˆ v . In order tomake sure the synthesized clip is based on the target person and also cap-tures the time-variation of speech, we investigate an effective way to fuse f s and f p to get f v for generating a video. Here, the challenge is that the fea-ture maps exist in different modalities, e.g., audio, visual, and audio-visual, andreside in different feature spaces, e.g., time-frequency, space, and space-time. + DuplicateDuplicate
Fig. 3: Audio-Identity fusion. Transferaudio time-frequency features and im-age spatial features to video spatial-temporal features.Our fusion method is based on du-plication and concatenation. This pro-cess is depicted in Fig. 3. For each au-dio feature, we duplicate that featurealong frequency dimension in eachtime step, i.e., from the size of T × F to the size of T × F × F . Image feature,which can be viewed as a templatefor video representation, is copied Ttimes, i.e., from H × W to a new size T × H × W . We set H = W = F inthis method. Then, two kinds of dupli-cated features are concatenated alongchannel dimension. We believe that the acoustic information of audio speech is correlated with lipmovements even across identities because of their shared high-level represen-tation. Besides, we also regard that variation along temporal axis between twomodalities are more likely to be correlated. In other words, compared with acous-tic feature and visual feature of lip shape themselves, the variation of audio
L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu (a) (b)
Fig. 4: (a): Correlation coefficients with different offsets of four example videos.(b): Number of videos of different offsets with which the video has the maximumcorrelation coefficient. X-axes of both (a) and (b) stands for time steps of flowfield shifted forward.feature (e.g. the voice raising to a higher pitch) and variation of visual feature(e.g. mouth opening) have a higher likelihood to be correlated. Therefore, wepropose a method to optimize the correlations of the two modalities in theirfeature spaces. We use f (cid:48) s in size of ( T − × F , the derivative of audio feature f s (with size of T × F ) between consecutive frames in temporal dimension, torepresent the changes in speech. It goes through an audio derivative encodernetwork φ s , and thus, we have audio derivative feature φ s ( f (cid:48) s ). Similarly, we use F ( v ) to represent optical flows of each consecutive frames in a video v , where F is an optical flow estimation algorithm. It goes through an optical flow encodernetwork φ v , and thus, we have φ v ( F ( v )) to depict the visual variations of lipmovements in the feature space. We use cosine similarity loss to maximize thecorrelation between audio derivative feature and visual derivative feature: (cid:96) corr = 1 − φ s ( f (cid:48) s ) · φ v ( F ( v )) (cid:107) φ s ( f (cid:48) s ) (cid:107) · (cid:107) φ v ( F ( v )) (cid:107) . (1)Here, the optical flow algorithm applied to the synthesized frames needs to bedifferentiable for back-propagation [27]. In our implementation, we add a smallnumber ( (cid:15) = 10 − ) to the denominator to avoid division by zero. In order toavoid trivial solution when φ s and φ v are learned to predict constant outputs φ s ( f (cid:48) x ) and φ v ( F ( v )) which are perfectly correlated and the (cid:96) corr will go to 0,we combine other losses during the training process (see Eq. 2). Correlation Networks.
The audio and visual information are not perfectlyaligned in time. Usually, lip shape forms earlier than sound. For instance, whenwe say word ‘bed’, upper and lower lips meet before speaking the word [22].If such delay problem exists, aforementioned correlation loss, assuming audio-visual information are perfectly aligned, may not work. We verify the delayedcorrespondence problem between audio and visual information by designing acase study on 3260 videos randomly sampled from the GRID dataset. The so-lution for the delayed correspondence problem is given in the next paragraph.In the case study, for each 75-frame video v , we calculate the mean values ofeach 74 derivatives of audio s lms and mean values of each 74 optical flow fields ip Movements Generation at a Glance 7 φ v ( F ( v )). With respect to each video, we shift mean values of optical flows for-ward along time at different offsets (0 to 7 in our case study) and calculatePearson correlation coefficients of those two parts. Results of four videos, cal-culated by aforementioned procedures, are shown in Fig.4(a). Finally, we countthe number of videos in different offsets at which the video has the largest cor-relation coefficient, as shown in Fig. 4(b). Figure 4 shows that different videosprefer different offsets to output the maximum correlation coefficient, which in-dicates that fixing a constant offset of all audio-visual inputs would not solve theproblem of correlation with inconsistent delays among all videos in a dataset.To mitigate such delayed correlation problem, we design correlation networks(as shown in Fig. 2) containing an audio derivative encoder φ s and an opticalflow encoder φ v to extract features used for calculating the correlation loss inEq. 1. These networks reduce the feature size but retain the temporal lengthsimultaneously. The sizes of the two outputs are matched for calculating thecorrelation loss. We use 3D CNNs to implement these networks, which are alsohelpful to mitigate the fixed offset problem happens in previous works [9]. Both φ s and φ v output features with large receptive fields (9 for φ s ( f (cid:48) s ) and 13 for φ v ( F ( v ))), which consider the audio-visual correlation in large temporal dimen-sion. Compared with time-delayed RNN proposed in [9], CNN can learn delayfrom the dataset rather than set it as a hyperparameter. Besides, CNN architec-ture benefits from its weight sharing property leading to a simpler and smallerdesign than TDNN [21]. Without loss of generality, we use pairs of lip movement video and speech audio { ( v j , s j ) } in the training stage, where v j represents the j th video in our datasetand s j represents the corresponding speech audio. We omit the superscript j when it is not necessary for the discussion of one sample. We use p r to denote onelip image of the target speaker, which can provide the initial texture information.During training, we train over ( v, s ) in the training set and sample p r to be oneframe randomly selected from the raw video where v j is sampled from to ensurethat v and p r contain the same identity. Therefore, the system is robust to thelip shape of the identity p r . The objective of training is to generate a realisticvideo ˆ v that resembles v . For testing, the speech s and identity image p r canbe any speech and any lip image (even out of the dataset we used in training).Next, we present the full model in the context of training.Our full model (see Fig. 2) is end-to-end trainable and is optimized accordingto the following objective function: L = (cid:96) corr + λ (cid:96) pix + λ (cid:96) perc + λ (cid:96) gen , (2)where λ , λ and λ are coefficients of different loss terms. We set them as 0.5,1.0, 1.0 respectively in this paper. The intuitions behind the four losses are asfollows: L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu – (cid:96) corr : Correlation loss, illustrated in Sec. 3.2, is introduced to ensure thecorrelation between audio and visual information. – (cid:96) pix : Pixel-level reconstruction loss, defined as (cid:96) pix (ˆ v, v ) = (cid:107) v − ˆ v (cid:107) , whichaims to make the model sensitive to speaker’s appearance, i.e., retain theidentity texture. However, we find that using it alone will reduce the sharp-ness of the synthesized video frames. – (cid:96) perc : Perceptual loss, which is originally proposed by [28] as a method usedin image style transfer and super-resolution. It utilizes high-level featuresto compare generated images and ground-truth images, resulting in bettersharpness of the synthesized image. We adapt this perceptual loss and detailit in Sec. 4.1. – (cid:96) gen : Adversarial loss allows our model to generate overall realistic lookingimages and is defined as: (cid:96) gen = − log D ([ s j , ˆ v j ]), where D is a discrimina-tor network. We describe the detail of our proposed stream-stream GANdiscriminator in Sec. 4.2. In order to avoid over-smoothed phenomenon of synthesized video frames ˆ v , weadapt perceptual loss proposed by Johnson et al. [28], which reflects perceptual-level similarity of images. The perceptual loss is defined as: (cid:96) perc (ˆ v, v ) = (cid:107) ϕ ( v ) − ϕ (ˆ v ) (cid:107) , (3)where ϕ is a feature extraction network. We train an autoencoder to reconstructvideo clips. To let the network be more sensitive to structure features, we applysix residual blocks after the convolution layers. We train the autoencoder fromscratch, then fix the weights and use its encoder part as ϕ to calculate perceptualloss for training the full model. x x C o n v x x C o n v x x C o n v x x C o n v F C l a y e r x x x D C o n v x x x D C o n v x x x D C o n v x x x D C o n v FlowNet x x x D C o n v x x x D C o n v x x x D C o n v x x x D C o n v x x x D C o n v x x x D C o n v + Audio StreamVideo StreamOptical Flow Stream
Fig. 5: Three-stream GAN discriminatorillustration.The GAN discriminator in [17] forsynthesizing video considers the mo-tion changes implicitly by 3D con-volution. In order to generate sharpand smooth changing video frames,we propose a three-stream discrimi-nator network (see fig. 5) to distin-guish the synthesized video (ˆ v j ) fromreal video ( v j ) that not only considersmotion explicitly and but also condi-tions on the input speech signal. Theinput to the discriminator is a videoclip with the corresponding audio. Wehave the following three streams. For ip Movements Generation at a Glance 9 audio stream (also used in our generator), we first convert the raw audio to log-mel spectrogram, then use four convolutional layers followed by a fully-connectedlayer to get a 1D vector. We duplicate it to match features from other streams.For video stream, we use four 3D CNN layers to extract video features. In addi-tion, we include an optical flow stream that attends to motion changes explicitly.We fine-tune the FlowNet [29], which is pre-trained on FlyingChairs dataset, toextract optical flows, then apply four 3D CNN layers to extract features.Finally, we concatenate the three-stream features in channel dimension andlet them go through two convolutional layers to output the discriminator prob-ability. We adapt mismatch strategy [4] to make sure that our discriminator isalso sensitive to mismatched audio and visual information. Therefore, the dis-criminator loss is defined as: (cid:96) dis = − log D ([ s j , v j ]) − λ p log (1 − D ([ s j , ˆ v ])) − λ u log (1 − D ([ s j , v k ])) , k (cid:54) = j , (4)where v k represents a mismatch real video. We set both λ p and λ u . In this section, we first introduce datasets and experimental settings, and ouradapted evaluation metrics. Then, we show ablation study and comparison tothe state of the art. Finally, we demonstrate real-world novel examples.
We present our experiments on GRID [13], LRW [12] and LDC [14] datasets.We crop all the mouth regions as our focused region. There are 33 differentspeakers in GRID. Each speaker has 1000 short videos. The LRW dataset consistsof 500 different words spoken by hundreds of different speakers. There are 14speakers in the LDC dataset in which each speaker reads 238 different wordsand 166 different sentences. Videos in GRID and LDC are lab-recorded whilevideos in LRW are collected from news. The basic dataset information is shownin Tab. 1. Our data is composed by two parts: audio and image frames. Thenetwork can output different numbers of frames. In this work, we only considergenerating 16 image frames. As the videos are sampled at 25 fps, the time spanof the synthesized image frames is 0.64 seconds. We use sliding window approach(window size: 16 frames, overlap: 8 frames) to obtain training and testing videosamples over raw videos.
Audio:
We extract audio from the video file with a sampling rate of 41.1kHz. Each input audio is 0.64 seconds long (0.04 × . h ) 841k (159 . h ) 36k (6 . h )Val. 23k (4 . h ) N/A 4k (0 . h )Test 7k (1 . h ) 40k (7 . h ) 6.6k (1 . h ) Table 1: Dataset information. Validation set: known speakers but unseen sen-tences. Testing set: unseen speakers and unseen sentences.the number of samples between successive frames, the length of the FFT window,and the number of Mel bands are 512, 1024 and 128, respectively. This operationwill convert a 0.64-sec raw audio to a 64 ×
128 time-frequency representation.
Images:
First, we extract all image frames from videos. Then, we extract liplandmarks [30] and crop the image around the lip. Landmarks are only used forcropping and evaluation. We resize all of the cropped images to 64 ×
64 andnormalize all images (mean = 0.5, std = 0.5) along channel dimension. So, each0.64-sec audio corresponds to a 16 × × ×
64 RGB image sequence.We adopt Adam optimizer during training and fixed learning rates of 10 -4 with weight decay of 5 × -4 . We initialize all network layers according to themethod described in [31]. All models are trained and tested on a single NVIDIAGTX 1080Ti. During testing, generating one single frame costs 0 .
015 seconds.
To evaluate the quality of the synthesized video frames, we compute Peak Signalto Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) [15]. Toevaluate the sharpness of the generated image frames, we compute the perceptual-based no-reference objective image sharpness metric (CPBD) [16].As far as we know, no quantitative metric has been used to evaluate theaccuracy of generated lip movements video. Therefore, to evaluate whether thesynthesized video ˆ v corresponds to accurate lip movements based on the inputaudio, a new metric is proposed by calculating the Landmark Distance (LMD).We use Dlib [30], a HOG-based facial landmarks detector, which is also widelyused in lip-movement generation task and other related works[9,32], to detect liplandmarks on ˆ v and v , and mark them as LF and LR , respectively. To eliminatethe geometric difference, we calibrate the two mean points of lip landmarks in LF and LR . Then, we calculate the Euclidean distance between each correspondingpairs of landmarks on LF and LR , and finally normalized them with temporallength and number of landmark points. LMD is defined as: LM D = 1 T × P T (cid:88) t =1 P (cid:88) p =1 (cid:107) LR t,p − LF t,p (cid:107) , (5)where T denotes the temporal length of video and P denotes the total numberof landmark points on each image (20 points). ip Movements Generation at a Glance 11 Methods (a) (b) (c) (d) (e) (f) (g) (h) (i) (cid:96) corr (cid:88) (cid:88) (cid:88) (cid:96) corr (Non-Derivative Corr.) (cid:88) (cid:96) gen (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:96) pix (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:96) perc (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Two-Stream D. (cid:88) (cid:88)
Three-Stream D. (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Three-Stream D.(Frame-Diff.) (cid:88)
Metrics
LMD 1.37 1.28 1.33 1.31
Table 2: Ablation results on GRID dataset. The full model (method (e)) uses allfour losses as described in Sec.4. For LMD, the lower the better. SSIM, PSNRand CPBD, the higher the better. We bold each leading score.
We conduct ablation experiments to study the contributions of the three compo-nents in our full model separately: correlation loss, three-stream GAN discrim-inator and perceptual loss. The ablation study is conducted on GRID dataset.Results are shown in Tab. 2. Different implementations are discussed in below aswell. The following ablation studies are trained and tested on the GRID dataset.
Perceptual Loss.
Generally, we find that perceptual loss can improve theresult in metrics such as LMD, SSIM and CPBD, which means that perceptualloss can help our model generate more accurate lip movements with higher imagequality, and improve image sharpness at the same time (see method (c) v.s.method (d) in Tab.2).
Correlation Models.
When correlation loss is removed from final objectivefunction Eq. 2, results are worse than final objective in LMD, SSIM and PSNR,demonstrating the importance of correlation loss in generating more accurate lipmovement (see method (d) v.s. method (e) or method (g) v.s. method (h)).Besides, we investigate a model variant,
Non-Derivative Correlation (seemethod (f) in Tab. 2), for analyzing the necessity of applying derivative fea-tures to φ s and φ v . Instead of using the derivative of audio features and theoptical flow, this variant just uses audio features f s and video frames v directlyas inputs. Neither the derivative nor the optical flow is calculated here. Othersettings (e.g., network structure and loss functions) are identical with the fullmodel (denoted as method (e) in Tab. 2). The comparison between method(e) and method (f) in Tab. 2 shows that derivative correlation model outper-forms the Non-Derivative Correlation model in metrics such as SSIM, PSNR andLMD. With respect to
Non-Derivative Correlation model, landmark distance iseven worse than model without correlation loss (method (d)). The experimentalresult proves our assumption that it is the derivatives of audio and visual infor-
Full modelChung et al.Vondrick et al. GRID dataset LDC dataset LRW datasetGround Truth
Fig. 6: Generated videos of our model on three testing datasets compared withstate-of-the-art methods. In the testing set, none of the speakers were shown inthe training set.mation rather than the direct features that are correlated. Furthermore, since
Non-Derivative Correlation model fails to learn the derivative feature implicitly(i.e. convolutional layers fails to transform feature to their derivatives), usingthe derivatives of audio and visual features to do correlation as a strong expertprior knowledge is necessary.
GAN Discriminator.
We find that (cid:96) gen improves the CPBD result (seemethod (a), (b) and (c) in Tab. 2), demonstrating that discriminator can improvethe frame sharpness.Furthermore, we use two model variants to study the effectiveness of proposedthree-stream GAN discriminator.
Two-Stream Discriminator (see Two-StreamD. in Tab. 2) only contains audio stream and video stream. The
Frame-DifferenceThree-Stream Discriminator (see Three-Stream D.(Frame-Diff.) in Tab 2) re-places the optical flow with frame-wise difference, i.e., L Two-Stream Discriminator variant, our full model with proposed
Three-Stream Discriminator gives better result (see method (e) v.s. method (g)),which indicates the effectiveness of explicitly modeling motion changes amongthe frames. Second, compared with the
Frame-Difference Three-Stream Discrim-inator variant, the full model generates more realistic (higher CPBD) and accu-rate lip movements (lower LMD) (see method (e) and (i)), which indicates thatoptical flow is a better representation than frame-wise difference for modelingmotion changes.
In this section, we compare our full model with two state-of-the-art meth-ods [17,10]. We extend [17] to a conditional GAN structure, which receives thesame target image information and audio information as our models. The quanti-tative results are shown in Tab. 3. We test our models on three different datasets.The results show that our proposed models outperform state-of-the-art modelsin most of the metrics. In terms of LMD and PSNR, our full model shows betterperformance than methods that use discriminator [17] or reconstruction loss [10].Model proposed by Chung et al., based on reconstruction loss, generates much ip Movements Generation at a Glance 13Method GRID LDC LRW
LMD SSIM PSNR CPBD LMD SSIM PSNR CPBD LMD SSIM PSNR CPBDG. T. 0 N/A N/A 0.141 0 N/A N/A 0.211 0 N/A N/A 0.068Vondrick[17] 2.38 0.60 28.45 0.129 2.34
Full model
Table 3: Results on three datasets. Models mentioned in this table are trainedfrom scratch (no pre-training included) and be tested on each dataset a time.We bold each leading score.
Ground truth Synthesized framesGround truth Synthesized framesGround truth Synthesized framesGround truth Synthesized frames
Fig. 7: Randomly selected outputs of the full model on the LRW testing set.The lip shape in videos not only synchronize well with the ground truth, butmaintain identity information, such as (beard v.s. no beard).more blurred images, which makes them look unrealistic. We can see this phe-nomenon in the CPBD column. The LRW dataset consists of people talking inthe wild so resolution is much smaller in terms of lip region. We need to scaleup the ground truth to 64 ×
64, which leads to a lower resolution and CPBD.We suspect this is the reason why we achieve a better CPBD than ground truthin LRW dataset.The qualitative results compared with other methods are shown in Fig. 6.Our model generates sharper video frames on all three datasets, which has alsobeen supported by the CPBD results, even if input identity images are in lowresolution. We show additional results of our method in Fig. 7. Our model cangenerate realistic lip movement videos that are robust to view angles, lip shapesand facial characteristics in most of the times. However, sometimes our modelfails to preserve the skin color (see the last two examples in Fig. 7), which, we suspect, is due to the imbalanced data distribution in LRW dataset. Further-more, the model has difficulties in capturing the amount of lip deformations ofeach person, which is an intrinsic problem when learning from a single image.
Ground TruthLRWGRIDLRWGRIDLRWGRID
Fig. 8: The figure shows the generated images based on three identity imagesoutside of dataset, which is also not paired with the input audio from GRIDdataset. Two full models trained on GRID and LRW datasets are used here fora comparison.For generating videos given unpaired identity image and audio in the real-world, i.e., source identity of provided audio is different from the target identityand out of the datasets, our model can still perform well. Results are shown inFig. 8, in which three identity images of celebrities are selected outside of thedatasets the model trained on and the input audio is selected in GRID dataset.For our model trained on LRW, both identity images and audio are unseen. Forour model trained on GRID, we leave the source identity out of the training.The videos generated by our model show promising qualitative performance.Both lip regions of Musk and Sandburg are rotated by some degrees. We can seethat the rotation phenomenon in the generated video frames as well. Besides,our model can also retain beards in our generated clip when identity (targetperson) has beards as well. However, we observe that model trained on GRIDdataset fails to reserve the identity information. Because of the fact that LRWdataset has much more identities than GRID dataset (hundreds v.s. 33), themodel trained on LRW has better generalization ability.
In this paper, we study the task: given an arbitrary audio speech and one lip im-age of arbitrary target identity, generate synthesized lip movements of the target ip Movements Generation at a Glance 15 identity saying the speech. To perform well in this task, it requires a model tonot only consider the retention of the target identity, photo-realistic of synthe-sized images, consistency and smoothness of images in a video, but also learnthe correlations between the speech audio and lip movements. We achieve thisby proposing a new generator network, a novel audio-visual correlation loss anda full model that considers four complementary losses. We show significant im-provements on three datasets compared to two state-of-the-art methods. Thereare several future directions. First, non-fixed length lip movements generation isneeded for a more practical purpose. Second, it is valuable to extend our methodto one generating full face in an end-to-end paradigm.
Acknowledgement
This work was supported in part by NSF BIGDATA 1741472and the University of Rochester AR/VR Pilot Award. We gratefully acknowledge thegift donations of Markable, Inc., Tencent and the support of NVIDIA with the donationof GPUs used for this research. This article solely reflects the opinions and conclusionsof its authors and not the funding agents.
References
1. Das, P., Xu, C., Doell, R., Corso, J.J.: A thousand frames in just a few words:lingual description of videos through latent topics and sparse object stitching. In:IEEE Conference on Computer Vision and Pattern Recognition. (2013)2. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.:Baby talk: Understanding and generating simple image descriptions. In: IEEEConference on Computer Vision and Pattern Recognition. (2011)3. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.:Visually indicated sounds. In: IEEE Conference on Computer Vision and PatternRecognition. (2016)4. Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generativeadversarial text to image synthesis. In: ICML. (2016)5. Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual genera-tion. In: Proceedings of the on Thematic Workshops of ACM Multimedia. (2017)6. Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirec-tional LSTM. In: 2015 IEEE International Conference on Acoustics, Speech andSignal Processing. (2015)7. Garrido, P., Valgaerts, L., Sarmadi, H., Steiner, I., Varanasi, K., P´erez, P.,Theobalt, C.: Vdub: Modifying face video of actors for plausible visual alignmentto a dubbed audio track. Computer Graphics Forum (2015)8. Charles, J., Magee, D.R., Hogg, D.C.: Virtual immortality: Reanimating charactersfrom TV shows. In: Proceedings of the workshops in European Conference onComputer Vision. (2016)9. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama:learning lip sync from audio. ACM Trans. on Graphics (2017)10. Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British MachineVision Conference. (2017)11. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: End-to-endsentence-level lipreading. In: GPU Technology Conference. (2017)6 L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu12. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference onComputer Vision. (2016)13. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speechperception and automatic speech recognition. The Journal of the Acoustical Societyof America (2006)14. Richie, S., Warburton, C., Carter, M.: Audiovisual database of spoken americanenglish. Linguistic Data Consortium (2009)15. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:from error visibility to structural similarity. IEEE Trans. Image Processing (2004)16. Narvekar, N.D., Karam, L.J.: A no-reference image blur metric based on thecumulative probability of blur detection (CPBD). IEEE Trans. Image Processing(2011)17. Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics.In: NIPS. (2016)18. Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R.,Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Pro-ceedings of the ACM on Multimedia. (2010)19. Hotelling, H.: Relations between two sets of variates. Biometrika (1936)20. Cutler, R., Davis, L.S.: Look who’s talking: Speaker detection using video andaudio correlation. In: IEEE International Conference on Multimedia and Expo.(2000)21. Waibel, A.H., Hanazawa, T., Hinton, G.E., Shikano, K., Lang, K.J.: Phonemerecognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, andSignal Processing (1989)22. Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., Ghazanfar, A.A.:The natural statistics of audiovisual speech. PLOS Computational Biology (2009)23. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)24. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprintarXiv:1411.1784 (2014)25. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Info-gan: Interpretable representation learning by information maximizing generativeadversarial nets. In: NIPS. (2016)26. Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classi-fier gans. In: ICML. (2017)27. Rumelhart, D.E., Hinton, G.E., Williams, R.J., et al.: Learning representations byback-propagating errors. Cognitive modeling (1988)28. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transferand super-resolution. In: European Conference on Computer Vision. (2016)29. Dosovitskiy, A., Fischer, P., Ilg, E., H¨ausser, P., Hazirbas, C., Golkov, V., van derSmagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutionalnetworks. In: IEEE International Conference on Computer Vision. (2015)30. King, D.E.: Dlib-ml: A machine learning toolkit. Journal of Machine LearningResearch (2009)31. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: 2015 IEEE International Confer-ence on Computer Vision. (2015)32. Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences inthe wild. In: The IEEE Conference on Computer Vision and Pattern Recognition.(2017)ip Movements Generation at a Glance 17
Appendix A Network Architectures
This section details the network structures mentioned in the submitted paper,including
Audio Encoder (see Tab. 4),
Identity Encoder (see Tab. 5),
Decoder (see Tab. 6),
Audio Derivative Encoder (see Tab. 7),
Flow Encoder (see Tab. 8),and
Three-stream GAN discriminator (see Tab. 9, Tab. 10 and Tab. 11). For sim-plicity, all ”Conv”s in the following tables stand for the sequence of Convolution,Batch Normalization and ReLU.
Layers Output Size Kernel Stride Padding
Conv 16 ×
128 3 × ×
64 3 × ×
64 3 × ×
32 3 × ×
16 1 × Table 4: Network Structure of Audio Encoder.
Layers Output Size Kernel Stride Padding
Conv 64 ×
64 7 × ×
32 3 × ×
16 3 × Table 5: Network Structure of Identity Encoder.
Layers Output Size Kernel Stride Padding OutputPadding
3D ResBlocks 16 × × (cid:20) × × × × (cid:21) × (cid:20) (1 , , , , (cid:21) × (cid:20) (0 , , , , (cid:21) × × × (cid:2) × × (cid:3) × (cid:2) , , (cid:3) × (cid:2) , , (cid:3) × (cid:2) , , (cid:3) × × ×
64 7 × × × ×
64 - - - -
Table 6: Network Structure of Decoder. One ”Trans. Convs” stands for a se-quence of Transposed Convolution, Batch Normalization and ReLU.
Layers Output Size Kernel Stride Padding
Conv 16 × × × × × × × × Table 7: Network Structure of Audio Derivative Encoder (denoted as φ s in themain paper). Layers Output Size Kernel Stride Padding
Conv 16 × × × × × × × × × × × × × × × × Table 8: Network Structure of Flow Encoder (denoted as φ v in the main paper). Layers Output Size Kernel Stride Padding
Conv 16 ×
128 3 × ×
64 3 × ×
64 3 × ×
32 3 × Table 9: Network Structure of audio stream in Three-stream GAN discriminator.FC stands for fully connected layer.
Layers Output Size Kernel Stride Padding
Conv 8 × ×
32 4 × × × ×
16 4 × × × × × × × × × × Table 10: Network Structure of video stream in Three-stream GAN discrimina-tor.
Layers Output Size Kernel Stride Padding
Conv 8 × × × × × × × × × × × × × × × ×3 2, 2, 1 1, 1