[PDF] Geometry-guided Dense Perspective Network for Speech-Driven Facial Animation

Abstract

Realistic speech-driven 3D facial animation is a challenging problem due to the complex relationship between speech and face. In this paper, we propose a deep architecture, called Geometry-guided Dense Perspective Network (GDPnet), to achieve speaker-independent realistic 3D facial animation. The encoder is designed with dense connections to strengthen feature propagation and encourage the re-use of audio features, and the decoder is integrated with an attention mechanism to adaptively recalibrate point-wise feature responses by explicitly modeling interdependencies between different neuron units. We also introduce a non-linear face reconstruction representation as a guidance of latent space to obtain more accurate deformation, which helps solve the geometry-related deformation and is good for generalization across subjects. Huber and HSIC (Hilbert-Schmidt Independence Criterion) constraints are adopted to promote the robustness of our model and to better exploit the non-linear and high-order correlations. Experimental results on the public dataset and real scanned dataset validate the superiority of our proposed GDPnet compared with state-of-the-art model.

Full PDF

IIEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1

Geometry-guided Dense Perspective Networkfor Speech-Driven Facial Animation

Jingying Liu † , Binyuan Hui † , Kun Li ∗ , Member, IEEE,

Yunke Liu, Yu-Kun Lai,

Member, IEEE,

YuxiangZhang, Yebin Liu,

Member, IEEE, and Jingyu Yang,

Senior Member, IEEE

Abstract —Realistic speech-driven 3D facial animation is a challenging problem due to the complex relationship between speech andface. In this paper, we propose a deep architecture, called

Geometry-guided Dense Perspective Network (GDPnet) , to achievespeaker-independent realistic 3D facial animation. The encoder is designed with dense connections to strengthen feature propagationand encourage the re-use of audio features, and the decoder is integrated with an attention mechanism to adaptively recalibratepoint-wise feature responses by explicitly modeling interdependencies between different neuron units. We also introduce a non-linearface reconstruction representation as a guidance of latent space to obtain more accurate deformation, which helps solve thegeometry-related deformation and is good for generalization across subjects. Huber and HSIC (Hilbert-Schmidt IndependenceCriterion) constraints are adopted to promote the robustness of our model and to better exploit the non-linear and high-ordercorrelations. Experimental results on the public dataset and real scanned dataset validate the superiority of our proposed GDPnetcompared with state-of-the-art model.

Index Terms —Speech-driven, 3D Facial Animation, Geometry-guided, Speaker-independent (cid:70)

NTRODUCTION

The most important approach of human communication isthrough speaking and making corresponding facial expressions.Understanding the correlation between speech and facial mo-tion is highly valuable for human behavior analysis. Therefore,speech-driven facial animation has drawn much attention fromboth academia and industry recently, and has a wide range ofapplications and prospects, such as gaming, live broadcasting,virtual reality, and ﬁlm production [24], [26], [38]. 3D models,as a popular and effective representation for human faces, havestronger ability to show the facial motion and understand thecorrelation between speech and facial motion than 2D images.However, 3D models are more complicated than images, and it ismore difﬁcult to obtain realistic 3D animation results. As shownin Figure 1, our aim is to animate a 3D template model of anyperson according to an audio input.Despite the great progress in speaker-speciﬁc speech-drivenfacial animation [3], [19], [31], speaker-independent facial an-imation is still a challenging problem. Some methods animateunrealistic artist-designed character rigs driven by audio [6], [39].Others achieve more realistic animation by combining audio andvideo [24], [28], relying on manual processes, or focusing onlyon the mouth [33]. VOCA [5] achieves the ﬁrst audio-drivenspeaker-independent 3D facial animation in any language usinga realistic 3D scanned template. It could generate animation of • †

Equal contribution. • ∗ Corresponding author: Kun Li (Email: [email protected]) • Jingying Liu, Binyuan Hui, Kun Li and Yunke Liu are with the College of

Intelligence and Computing, Tianjin University, Tianjin 300350, China. • Yu-Kun Lai is with the School of Computer Science and Informatics,Cardiff University, Cardiff CF24 3AA, United Kingdom. • Yuxiang Zhang and Yebin Liu are with the Department of Automation,Tsinghua University, Beijing 10084, China. • Jingyu Yang is with the School of Electrical and Information Engineering,Tianjin University, Tianjin 300072, China. different styles across a range of identities. But there are still threechallenges to achieve realistic audio-driven 3D facial animationfor an arbitrary person and language: • The animated results are easily affected by both facialmotion and geometry structure. Therefore, we need toconsider the geometry representation of 3D models togenerate more realistic animation results, in addition torelating the audio and the facial motion. • The relation between audio and visual signals is compli-cated, and we need more effective neural networks to learnthis non-linear and high-order relationship. • Real signals usually contain noise and outliers, whichchallenge the robustness of the animation method.In this paper, to address these challenges, we propose ageometry-guided dense perspective network (GDPnet), which con-sists of encoder and decoder modules. For the encoder, to ensuremaximum information ﬂow between layers in the network, weconnect all layers (with matching feature-map sizes) directly witheach other. For the decoder, we utilize attention mechanism to useglobal information to selectively emphasize informative features.Besides, we propose a geometry-guided strategy and adopt twoconstraints from different perspectives to achieve more robustanimation. Experimental results demonstrate that the non-lineargeometry representation is beneﬁcial to the speech-driven model,and our model generalizes well to arbitrary subjects unseen duringtraining.

We will make code available to the community.

Speciﬁcally, the main contributions of this work are summa-rized as follows: • We propose a dense perspective network to better modelthe non-linear and high-order relationship between audioand visual signals. An encoder with dense connections isdesigned to strengthen feature propagation and encouragethe re-use of audio features, and a decoder with attention a r X i v : . [ c s . G R ] A ug EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2

Template

Figure 1: Our method is able to output reasonable and realistic 3D animated faces for any person in any language. Top: Actor fromVOCASET [5]; Middle: Actor from D3DFACS [4]; Bottom: Great tribute to Mr. Albert Einstein.mechanism is used to better regress the ﬁnal 3D facialmesh. • We adopt a non-linear face representation to guide thenetwork training, which helps to solve the geometry-related deformation and is effective for generalizationacross subjects. • We introduce Huber and HSIC (Hilbert-Schmidt indepen-dence criterion) constraints to promote the robustness ofour model and better measure the non-linear and high-order correlations. • Our model is easy to train and fast to converge. Atthe same time, we achieve more accurate and realisticanimation results for various persons in various languages.

ELATED W ORK

Despite the great progress in facial animation from images orvideos [20], [35], [36], [37], less attention has been paid tospeech-driven facial animation, especially animating a 3D face.However, understanding the correlation between speech and facialdeformation is very important for human behavior analysis andvirtual reality applications. Speech-driven 3D facial animation canbe categorized into two types: speaker-dependent animation andspeaker-independent animation, according to whether the methodsupports generalization across characters.

Speaker-dependent animation mainly uses a large amount of datato learn the animation ability in a speciﬁc situation. Cao et al . [3]ﬁrst rely on a database of high-ﬁdelity recorded facial motions,which includes speech-related motions, but the method relies onhigh-quality motion capture data. Suwajanakorn et al . [31] utilizea recurrent neural network trained on millions of video frames to synthesize mouth shape from audio, but this method only focuseson learning to generate videos of President Barack Obama fromhis voice and stock footages. Karras et al . [19] ﬁrst propose anend-to-end network for animation. Through the input of voiceand speciﬁc emotion embedding, it could output the 3D vertexpositions of a ﬁxed-topology mesh that corresponds to the centerof the audio window. Besides, it could produce expressive 3Dfacial motion from audio in real time and with low latency.However, this kind of animation methods has limited practicalapplications due to its inconvenience for generalization acrosscharacters.

Many works focus on the facial animation of artist-designedcharacter rigs [6], [7], [13], [18], [30], [32], [33], [34], [39]. Liu et al . [24] ﬁrst propose a speaker-independent method based on aKinect sensor with video and audio input for 3D facial animation,which reconstructs 3D facial expressions and 3D mouth shapesfrom color and depth input with a multi-linear model and adopts adeep network to extract phoneme state posterior probabilities fromthe audio. However, this method relies on a lot of pre-processingand inefﬁcient search methods. Taylor et al . [33] propose a simpleand effective deep learning approach for speech-driven facialanimation using a sliding window predictor to learn arbitrary non-linear mappings from phoneme label input sequences to mouthmovements. Pham et al . [27] propose a regression frameworkbased on a long short-term memory (LSTM) recurrent neuralnetwork to estimate rotation and activation parameters of a 3Dblendshape face model. Based on this work, they [28] further em-ploy convolutional neural networks to learn meaningful acousticfeature representations, but their method also needs the recurrentlayer to process the information of time series. Zhou et al . [39]propose a three-stage network using hand-engineered audio fea-

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 3

Template

Speech D ee p S p ee c h F C F C A tt e n t i o n A tt e n t i o n C o n v C o n v C o n v C o n v M e s h E n c o d e r M e s h D e c o d e r Feature

Identity Embedding

OutputReconstruction Representation

Down sample Feature map Add

Metric Space F C Figure 2: The architecture of our proposed geometry-guided dense perspective network.tures to regress the cartoon human. However, the animated faceis not a realistic scanned face. Cudeiro et al . [5] ﬁrst provide aself-captured multi-subject 4D face dataset and propose a genericspeech-driven 3D facial animation framework that works acrossa range of identities. However, none of these methods take intoaccount the inﬂuence of geometry representation on speech-driven3D facial animation.In this paper, we propose a speaker-independent speech-driven3D facial animation method by designing a geometry-guideddense perspective network. The introduced non-linear geometryrepresentation and two constraints from different perspectives arevery beneﬁcial to achieve realistic and robust animation.

EOMETRY - GUIDED D ENSE P ERSPECTIVE N ET - WORK

Figure 2 shows the architecture of our geometry-guided dense per-spective network (GDPnet). First of all, we extract speech featuresusing DeepSpeech [12] and embed the identity information to one-hot embedding. After concatenating the two kinds of information,the encoder maps it to the latent low-dimensional representation.The purpose of the decoder is to map the hidden representationto a high-dimensional space of 3D vertex displacements, and theﬁnal output mesh is obtained by adding the displacements to thetemplate.

Suppose we have three types of data { ( p , x i , y i ) } Fi =1 . Here, theindex i refers to a speciﬁc frame, and F is the total number offrames. x i ∈ R W × D is the speech feature window centered at the i th frame generated by DeepSpeech [12], where D is the numberof phonemes in the alphabet plus an extra one for a blank labeland W is the window size. p ∈ R N × denotes the corresponding template mesh, reﬂecting the subject-speciﬁc geometry, and N is the number of vertices of the mesh. y i ∈ R N × denotes theground truth for facial animation at each frame. At last, let ˆ y i ∈ R N × denotes the output of our GDPnet model for the input x i with template p . Our GDPnet model consists of an encoder and a decoder, as shownin Figure 2. The input of the encoder is a DeepSpeech feature ofthe audio and speciﬁc identity information. In order to effectivelyexpress different subjects, we encode the identity information asone-hot embedding so as to control different speaking styles. Inparticular, the dimension of identity embedding is equal to thenumber of subjects in the training set. During inference, changingthe identity embedding alters the output speaking style.

The purpose of the encoder is to map speech features to latent rep-resentations. Similar to VOCA [5], to learn temporal features andreduce the dimensionality of the input, we stack four convolutionallayers for the encoder.The problem of simply stacking the convolutional layers isthat the information in the shallow layer can be easily lost [15].We believe that both shallow and deep features are important, andhence we need an effective way to combine the features in theshallow layer and the deep layer. This encourages feature reusethroughout the network, and leads to more compact models. Tofurther improve the information ﬂow between layers, we utilizea dense connectivity pattern. Consequently, the i th layer receivesthe feature maps of all preceding layers, x , . . . , x (cid:96) − , as input: x (cid:96) = H (cid:96) ([ x , x , . . . , x (cid:96) − ]) , (1) EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 4 G T V O CA G D P n e t V O CA E rr . G D P n e t E rr . Figure 3: Qualitative evaluation results for clean audio inputs.where [ x , x , . . . , x (cid:96) − ] refers to the concatenation of the featuremaps produced in layers , . . . , (cid:96) − , and H (cid:96) is a compositefunction of two operations: convolution (Conv) with 3 × × × pooling layers in the feature map dimension toreduce the number of feature maps before concatenation (indicatedas the Down Sample layer in Figure 2). As a direct consequence ofthe input concatenation, the feature maps learned by any layers canbe accessed easily. Beneﬁting from the dense connection structure,we can reuse features effectively, which makes the encoder learnmore speciﬁc and richer latent representations. The decoder maps the latent representation to a high-dimensionalspace of 3D vertex displacements, and the ﬁnal output meshis obtained by adding the displacements to the template vertexpositions. To achieve this, we stack two fully connected layerswith tanh activation function. Inspired by the attention mechanism in image classiﬁcation[14], we add attention mechanism to perform feature recalibra-tion. In this way, the network learns to use global informationto selectively emphasize informative features and suppress lessuseful ones. Let x (cid:96) ∈ R C × denote the input of the attentionlayer, where C is the number of feature maps, and the attentionvalue a (cid:96) can be calculated by a (cid:96) = σ ( W δ ( W x (cid:96) )) , (2)where σ refers to the ReLU function and δ refers to the sigmoidfunction. W ∈ R C × C and W ∈ R C × C denote the learnableparameter weights for the attention block. The ﬁnal output of theattention block is obtained by (cid:101) x l = x l ⊗ a (cid:96) . (3)Here, ⊗ is element-wise multiplication. Through the attentionblock, the model can adaptively select important features for thecurrent input samples, and different inputs can generate differentattention responses.The ﬁnal output layer is a fully connected layer with linearactivation function, which produces N × output, correspondingto the 3-dimensional displacement vectors of N vertices. N = EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 5 G T V O CA G D P n e t V O CA E rr . G D P n e t E rr . Figure 4: Qualitative evaluation results for noisy audio inputs. is used in our experiments. The ﬁnal mesh can be generatedby adding this output to the identity template. In order to make thetraining more stable, the weight of this layer is initialized by 50PCA components calculated from the vertex displacements of thetraining data, and the deviation is initialized by zero.

The encoder-decoder structure described above can be regardedas a cross-modal process. The encoder maps the speech modeto the latent representation space, while the decoder maps thelatent representation space to the mesh mode. We refer to thelatent representation r as a cross-modal representation, whichshould express the expression and deformed geometry of a certainidentity. It can be exactly related to the reconstructed expression inthe 3D face representation and reconstruction using autoencoders[17], [22], [29]. The encoder encodes the input face mesh intoa latent representation ˆ r , and the decoder decodes the latentrepresentation into a reconstructed 3D mesh. In this paper, weuse an MGCN (Multi-column Graph Convolutional Network) [22]to extract geometry representation for each training mesh due toits ability to extract non-local multi-scale features. The geometrynetwork is an encoder-decoder architecture with multi-column graph convolutional networks to capture features of differentscales and learn a better latent space representation.During GDPnet training, we have a mesh corresponding to aframe in each audio, and the corresponding geometry representa-tion can be obtained using autoencoders. Using this 3D geometryrepresentation effectively constrains the cross-modal representa-tion. Speciﬁcally, we want the encoder output of GDPnet to beclosely related to the 3D geometry representation. Therefore,we need an appropriate measurement method to measure therelationship between them. Here we introduce two approaches ofmeasurement: Huber [16] constraint and Hilbert-Schmidt indepen-dence criterion (HSIC) constraint. Most work uses the (cid:96) loss to measure the distance between twovectors, but this measurement is more easily affected by noiseand outliers. (cid:96) loss is a better choice for robustness, but it isdiscontinuous and non-differentiable at position 0, leading to thedifﬁculty for optimization. Huber loss adopts a piece-wise methodto integrate the advantages of (cid:96) loss and (cid:96) loss and has beenwidely used in a variety of tasks. EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 6

FaceTalk_170728_03272_TA FaceTalk_170811_03274_TA FaceTalk_170904_00128_TAFaceTalk_170904_03276_TA FaceTalk_170912_03278_TA FaceTalk_170913_03279_TAFaceTalk_170915_00223_TA FaceTalk_170725_00137_TA

Training Set

FaceTalk_170811_03275_TAFaceTalk_170908_03277_TAFaceTalk_170731_00024_TA FaceTalk_170809_00138_TA

Validation SetTest Set

Figure 5: Speciﬁc subject names for training, validation and test.

Deﬁnition 1.

Assuming that there are two vectors r and ˆ r , theHuber constraint L ξ is deﬁned as L ξ : R → [0 , + ∞ ) ,L ξ ( r, ˆ r ) = (cid:40) r − ˆ r if | r − ˆ r | ≤ ξξ | r − ˆ r | − ξ otherwise, (4) where ξ > is the parameter that balances bias and robustness,and is set to 1.0 as default setting. The parameter ξ controls the blending of (cid:96) and (cid:96) losseswhich can be regarded as two extremes of the Huber loss with ξ → ∞ and ξ → , respectively. For smaller values of | r − ˆ r | ,the loss function L ξ is (cid:96) loss, and the loss function becomes (cid:96) loss when the magnitude of | r − ˆ r | exceeds ξ . In addition to the distance between the two expressions, we alsoconstrain from the perspective of correlations. If the two represen-tations are more related, they should contain similar information.Hilbert-Schmidt independence criterion (HSIC) measures the non-linear and high-order correlations and is able to estimate thedependence between representations without explicitly estimatingthe joint distribution of the random variables. It has been success-fully used in multi-view learning [2], [25].Assuming that there are two variables R =[ r , . . . , r i , . . . , r M ] and ˆ R = [ ˆ r , . . . , ˆ r i , . . . , ˆ r M ] , M isthe batch size. We deﬁne a mapping φ ( r ) to kernel space H , where the inner product of two vectors is deﬁned as k ( r i , r j ) = (cid:104) φ ( r i ) , φ ( r j ) (cid:105) . Then, φ (ˆ r ) is deﬁned to map ˆ r to kernel space G . Similarly, the inner product of two vectors isdeﬁned as k (ˆ r i , ˆ r j ) = (cid:104) φ (ˆ r i ) , φ (ˆ r j ) (cid:105) . Deﬁnition 2.

HSIC is formulated as

HSIC (cid:0) P R ˆ R , H , G (cid:1) = (cid:13)(cid:13) C R ˆ R (cid:13)(cid:13) = E R ˆ RR (cid:48) ˆ R (cid:48) (cid:104) k R ( R, R (cid:48) ) k ˆ R (cid:48) (cid:16) ˆ R, ˆ R (cid:48) (cid:17)(cid:105) + E RR (cid:48) [ k R ( R, R (cid:48) )] E ˆ R (cid:48) (cid:104) k ˆ R (cid:16) ˆ R, ˆ R (cid:48) (cid:17)(cid:105) − E R ˆ R (cid:104) E R (cid:48) [ k R ( R, R (cid:48) )] E ˆ R (cid:48) (cid:104) k ˆ R (cid:16) ˆ R, ˆ R (cid:48) (cid:17)(cid:105)(cid:105) , (5) where k R and k ˆ R are kernel functions, H and G are theHilbert spaces, and E R ˆ R is the expectation over R and ˆ R . Let D := { ( r , ˆ r ) , · · · , ( r m , ˆ r m ) } drawn from P R ˆ R . The empiricalversion of HSIC is induced as: HSIC( D , H , G ) = ( N − − tr ( K HK H ) , (6) where tr ( . . . ) is the trace of a square matrix. K and K are theGram matrices with k ,ij = k ( r i , r j ) and k ,ij = k (ˆ r i , ˆ r j ) .H centers the Gram matrix which has zero mean in the featurespace: H = I m − m m Tm . (7)Please refer to [11] for more detailed proof of HSIC. The loss of the proposed GDPnet consists of three parts, i.e . ,reconstruction loss, constraint loss and velocity loss: L = L r + λ L c + λ L v , (8)where λ and λ are positive constants to balance loss terms.The reconstruction loss L r computes the distance between thepredicted output and the ground truth: L r = (cid:107) y i − ˆy i (cid:107) F . (9) EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 7

Figure 6: User study result. The bars show the percentage of users choosing VOCA [5] or ours given the same sentence for ﬁve cases.During the training stage, the reconstruction representation con-straint L c could use Huber or HSIC as we discuss in Section3.3. The choice of these two constraints is a trade-off, as Huberconstraint has faster convergence and HSIC constraint has betterperformance, which will be discussed in Section 4.2. Besides, wehave the velocity loss L v = (cid:107) ( y i − y i − ) − ( ˆy i − ˆy i − ) (cid:107) F , (10)to induce temporal stability, which considers the smoothness ofprediction and ground truth in the sequence context. Our GDPnet is implemented using Tensorﬂow [1] and trained withthe Adam optimizer [21] on an NVIDIA GeForce GTX 1080Ti GPU. We train our model for 50 epochs with a learning rateof e − without learning rate decay. We use Adam with amomentum of 0.9, which optimizes the loss function between theoutput mesh and the ground-truth mesh. The balancing weightsfor loss terms are set to λ = 0 . and λ = 10 . , respectively.For network architecture, we use a windows size of W = 16 with D = 29 speech features, and set the dimension of latentrepresentation as 64. XPERIMENTS

In this section, we ﬁrst introduce the experimental setup includingthe dataset, training setup and the metric. Then, we evaluatethe performance of our GDPnet quantitatively and qualitativelycompared with the state-of-the-art method. We also conduct a blind user study. Finally, we perform an ablation study to analyzethe effects of different components of our approach.

VOCASET [5] provides high-quality 3D scans with about 29minutes of 4D scans captured at 60 fps as well as alignmentsof the entire head including the neck. The raw 3D head scans areregistered with a sequential alignment method using the publiclyavailable generic FLAME model [23]. Each registered mesh has5023 vertices with 3D coordinates. In addition to high-quality facemodels, VOCASET also provides the corresponding voice data,which is very useful to train and evaluate speech-driven 3D facialanimation. In total, it has 12 subjects and 480 sequences eachcontaining a sentence spoken in English with a duration of 3-5seconds. The sentences are taken from a diverse corpus similar to[8]. As we know, the posture, head rotation and other subjectiveinformation of the speaker cannot be completely judged only byvoice. In order to eliminate the inﬂuence of pose and distortionon the model, we only use the unposed data for training, so thatwe can effectively make use of the template information to obtainmore realistic animation results using the unknown voices.

In order to train and test effectively, we split 12 subjects intoa training set, a validation set and a test set, as VOCA [5] did.Furthermore, we split the remaining subjects as 2 for validationand 2 for testing. The training set consists of all sentences of eight

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 8

Table 1: Performance (mm) and training time (s) of different GDPnet variants.Variant HSIC Huber Dense Attention Validation Test Training Time(a) 5.861 7.701 98m58s(b) (cid:88) (cid:88) (d) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Speaker val Speaker val Mean

Speaker test Speaker test MeanVOCA [5] subjects. For the validation and test sets, 20 unique sentences areselected so that they are not shared with any other subject. Thespeciﬁc data division is shown in Figure 5. Note that there is nooverlap between training, validation and test sets for subjects orsentences.

To quantitatively evaluate the performances of the proposedmethod and the compared method, we adopt mean squared error(MSE), i.e . , the average squared difference between the estimatedvalue and the ground-truth value. Speciﬁcally, the MSE betweenthe generated mesh ˆ y and the ground-truth mesh y is deﬁned as: mse (ˆ y, y ) = 1 N N (cid:88) i =1 (cid:107) v i − ˆ v i (cid:107) , (11)where v is a vertex of the mesh, and N is the number of vertices. Furthermore, we study the impact of different components inour GDPnet. Speciﬁcally, we analyze four key components:HSIC constraint, Huber constraint, dense connection structure inthe encoder and attention mechanism. By taking one or severalcomponents into account, we obtain six variants as follows:(a) without any of the components;(b) with HSIC constraint loss to leverage geometry-guidedtraining strategy;(c) with Huber constraint loss to leverage geometry-guidedtraining strategy;(d) with HSIC constraint and dense connection structure in theencoder;(e) with HSIC constraint and attention mechanism in the decoder;(f) with HSIC constraint, dense connect structure and attentionmechanism.In Table 1, we compare the mean squared errors of differentvariants on the validation set and the test set, together with the training time. With HSIC or Huber constraint, the adoption ofour geometry-guided training strategy will rapidly reduce thetraining convergence time of the network. The convergence speedof using Huber constraint is the fastest, because the calculationtime of Huber loss is less than that of HSIC loss. However, theperformance of using Huber constraint is slightly worse than thatof using HSIC constraint, since the correlation measurement ofHSIC is more consistent with this task. Therefore, we use theHSIC constraint in the following comparison experiments. Insummery, each module in our GDPnet can improve the perfor-mance of animation effectively, especially when using both denseconnection structure and attention mechanism.

In this section, we compare our method with a state-of-the-art method, VOCA [5], quantitatively and qualitatively with auser study. VOCA [5] is the only state-of-the-art method thatachieves the same goal with our work: generating realistic 3Dfacial animation given an audio in any language and any 3D facemodel.

We ﬁrst evaluate the quantitative results of our GDPnet methodand VOCA [5] on the VOCASET dataset. For fair comparison, weuse the same dataset split as VOCA [5]. As presented in Table 2,we calculate the mean squared error for each subject in the valida-tion set and the test set. It can be seen that the overall performanceof our model is better than VOCA, demonstrating the better gen-eralization ability of our model. In order to more clearly formulatethe different speakers in the validate and test sets, we denote the i th subject in the validate set as Speaker vali similar to the test set.Our GDPnet improves accuracy by . mm on Speaker val and achieves competitive performance on Speaker val in thevalidation set. It is worth noting that, in the test set, our methodreduces . mm error for Speaker test and error by . mm for Speaker test . This proves that GDPnet is more generalized EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 9

Raw Scan Fitted Model Animation

Figure 7: Our method generalizes across various scanned models from 3dMD dataset [9].than VOCA. Some visual results are shown in Figure 3. The per-vertex errors are color-coded on the reconstructed mesh for visualinspection. Our method obtains more accurate results which arecloser to the ground truths.In order to evaluate the robustness of our method, we combinea speech signal with Gaussian noise, natural noise or outliers,and use the polluted signal as the input. The averaged errors overthe noisy cases are given in Table 2. Our method also obtainsmore accurate results for the noisy cases. Some visual results withGaussian noisy inputs are shown in Figure 4. The per-vertex errorsare color-coded on the reconstructed mesh for visual inspection.More results with different noises are shown in Figure 9. Theseresults demonstrate that our GDPnet is more robust to noise andoutliers. To evaluate the generalizability of our method, we perform qual-itative evaluation and perceptual evaluation with a user study,compared with the state-of-the-art method. For the user study, weshow the video results of our method and VOCA [5] speaking thesame sentence, and ask the users to choose the better one that ismore reasonable and natural. We collect 199 answers in total, andFigure 6 shows the user study results. It shows that our model getsmuch more votes than VOCA [5] in the ﬁve situations. • Generalization across unseen subjects and realscanned subjects:

Our method can animate any model

1. From http://soundbible.com that has the consistent topology with the FLAME. Todemonstrate the generalization capability of our method,we non-rigidly register the FLAME model against severalscanned models from 3dMD dataset [9], a self-scannedmodel and a model of Albert Einstein downloaded fromTurboSquid . Speciﬁcally, we ﬁrst manually deﬁne some3D landmarks and ﬁt the FLAME model to these 3Dlandmarks. Then, we adopt ED graph-based non-rigiddeformation and per-vertex reﬁnement to obtain a ﬁttedmesh with geometry details. Figure 1 shows someanimation results on unseen subjects in VOCASET [5],D3DFACS [4] and our ﬁtted dataset, driven by the sameaudio sequence. Figure 7 gives more results on our ﬁtteddataset. Video results compared with VOCA [5] areshown in the supplemental material. Our method achievesmore reasonable and realistic 3D facial animation results. • Generalization across languages:

Although trainedwith speech signals in English, our model can generateanimation results in any language. Figure 8 shows someexamples of our generalizations, compared with VOCA[5]. Our method achieves more reasonable and realisticresults for different languages. The supplementary videogives the detailed results. • Robustness to noise and outliers:

To demonstrate our

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 10 C h i n e s e VO C AGD P n e t Animation G e r m a n VO C AGD P n e t J a p a n e s e VO C AGD P n e t Figure 8: Our method generalizes natural and realistic animations across languages, compared with VOCA [5].robustness to noise and outliers, we combine a speechsignal with Gaussian noise, natural noise or outliers, anduse the polluted signal as the input. Figure 9 shows acomparison between VOCA [5] and our model. Beneﬁtingfrom the geometry-guided training strategy, our model notonly has a faster training convergence time, but also hasbetter robustness. Also, the supplementary video showsmore visual results.

For the speech without face motion with the mouth fully closed,our method cannot judge the expression of speaker simply fromthe voice. Figure 10 shows a failure case of our method. There-fore, only using audio features cannot achieve perfect 3D facial animation. In future work, we will investigate more supervisioninformation to assist the model in face inference, such as 2D visualinformation.

ONCLUSION

In this paper, we propose a geometry-guided dense perspectivenetwork (GDPnet) to animate a 3D template model of any personspeaking the sentences in any language. We design an encoderwith dense connection to strengthen feature propagation andencourage the re-usage of audio features, and a decoder with atten-tion mechanism to better regress the ﬁnal 3D facial mesh. We alsopropose a geometry-guided training strategy with two constrainsfrom different perspectives to achieve more robust animation.Experimental results demonstrate that our method achieves more

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 11

Animation VO C AGD P n e t G a u ss i a n N a t u r a l O u tli e r s VO C AGD P n e t VO C AGD P n e t Figure 9: Our method is robust to various noise and outliers in the input audio, compared with VOCA [5].accurate and reasonable animation results and generalizes well toany unseen subject.

GroundTruth Ours

Figure 10: One failure case using our method. R EFERENCES [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scalemachine learning on heterogeneous distributed systems. arXiv preprintarXiv:1603.04467 , 2016.[2] X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang. Diversity-induced multi-view subspace clustering. , 2015. [3] Y. Cao, W. C. Tien, P. Faloutsos, and F. H. Pighin. Expressive speech-driven facial animation.

Acm Transactions on Graphics , 24:1283–1302,2005.[4] D. Cosker, E. Krumhuber, and A. Hilton. A FACS valid 3D dynamicaction unit database with applications to 3D dynamic morphable facialmodeling. In

Proc. International Conference on Computer Vision , pages2296–2303. IEEE, 2011.[5] D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black. Capture,learning, and synthesis of 3D speaking styles. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June 2019.[6] C. Ding, L. Xie, and P. Zhu. Head motion synthesis from speech usingdeep neural networks.

Multimedia Tools and Applications , 74(22):9871–9888, 2015.[7] P. Edwards, C. Landreth, E. Fiume, and K. Singh. JALI: an animator-centric viseme model for expressive lip synchronization.

ACM Transac-tions on Graphics , 35(4):1–11, 2016.[8] W. M. Fisher. The DARPA speech recognition research database: speci-ﬁcations and status. In

Proc. DARPA Workshop on Speech Recognition,Feb. 1986 , pages 93–99, 1986.[9] A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, and P. Debevec.

Multiview face capture using polarized spherical gradient illumination.

ACM Transactions on Graphics , 30(6):129, 2011.[10] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neuralnetworks. In

International Conference on Artiﬁcial Intelligence andStatistics (AISTATS) , 2011.[11] A. Gretton, O. Bousquet, A. J. Smola, and B. Sch¨olkopf. Measuringstatistical dependence with hilbert-schmidt norms. In

InternationalConference on Algorithmic Learning Theory ALT , 2005. [12] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,

R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng. Deep

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12 speech: Scaling up end-to-end speech recognition.

Computer Science ,abs/1412.5567, 2014.[13] P. Hong, Z. Wen, and T. S. Huang. Real-time speech-driven faceanimation with expressions using neural networks.

IEEE Transactionson Neural Networks , 13(4):916–927, 2002.[14] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks.

IEEEtransactions on pattern analysis and machine intelligence , 2017.[15] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolu-tional networks. , pages 2261–2269, 2016.[16] P. J. Huber. Robust estimation of a location parameter. In

Annals ofMathematical Statistics , 1964.[17] Z.-H. Jiang, Q. Wu, K. Chen, and J. Zhang. Disentangled representationlearning for 3D face shape. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 11957–11966, 2019.[18] P. Kakumanu, R. Gutierrez-Osuna, A. Esposito, R. Bryll, A. Goshtasby,and O. Garcia. Speech driven facial animation. In

Proceedings of the2001 workshop on Perceptive user interfaces , pages 1–5, 2001.[19] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-drivenfacial animation by joint end-to-end learning of pose and emotion.

AcmTransactions on Graphics , 36:94:1–94:12, 2017.[20] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Nießner, P. P´erez,C. Richardt, M. Zollh¨ofer, and C. Theobalt. Deep video portraits.

ACMTransactions on Graphics , 37(4):1–14, 2018.[21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[22] K. Li, J. Liu, Y.-K. Lai, and J. Yang. Generating 3d faces using multi- column graph convolutional networks. In

Computer Graphics Forum , ACM Transactions onGraphics, (Proc. SIGGRAPH Asia) , 36(6), 2017.[24] Y. Liu, F. Xu, J. Chai, X. Tong, L. Wang, and Q. Huo. Video-audio drivenreal-time facial animation.

Acm Transactions on Graphics , 34:182:1–182:10, 2015.[25] K. W.-D. Ma, J. P. Lewis, and W. B. Kleijn. The hsic bottleneck:Deep learning without back-propagation.

ACM SIGKDD Conference onKnowledge Discovery and Data Mining , abs/1908.01580, 2019.[26] Y. Pei and H. Zha. Transferring of speech movements from video to 3dface space.

IEEE Transactions on Visualization and Computer Graphics ,13(1):58–69, 2007.[27] H. X. Pham, S. Cheung, and V. Pavlovic. Speech-driven 3d facialanimation with implicit emotional awareness: A deep learning approach. , pages 2328–2336, 2017.[28] H. X. Pham, Y. Wang, and V. Pavlovic. End-to-end learning for 3d facialanimation from speech. In

ACM International Conference on MultimodalInteraction , 2018.[29] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3d facesusing convolutional mesh autoencoders. In

European Conference onComputer Vision (ECCV) , 2018.[30] G. Salvi, J. Beskow, S. Al Moubayed, and B. Granstr¨om.SynFacełspeech-driven facial animation for virtual speech-reading sup-port.

EURASIP journal on audio, speech, and music processing ,2009(1):191940, 2009.[31] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Syn-thesizing obama: learning lip sync from audio.

Acm Transactions on

Graphics , 36:95:1–95:13, 2017.[32] S. Taylor, A. Kato, B. Milner, and I. Matthews. Audio-to-visual speechconversion using deep neural networks. 2016.[33] S. L. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. K.Hodgins, and I. A. Matthews. A deep learning approach for generalizedspeech animation.

Acm Transactions on Graphics , 36:93:1–93:11, 2017.[34] S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews. Dy-namic units of visual speech. In

Proceedings of the 11th ACM SIG-GRAPH/Eurographics conference on Computer Animation , pages 275–284, 2012.[35] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner.Face2face: Real-time face capture and reenactment of RGB videos. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2387–2395, 2016.[36] T. Weise, S. Bouaziz, H. Li, and M. Pauly. Realtime performance-basedfacial animation.

ACM Transactions on Graphics , 30(4):1–10, 2011.[37] Q. Wu, J. Zhang, Y.-K. Lai, J. Zheng, and J. Cai. Alive caricature from2D to 3D. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 7336–7345, 2018.[38] Zhigang Deng, U. Neumann, J. P. Lewis, Tae-Yong Kim, M. Bulut, andS. Narayanan. Expressive facial animation synthesis by learning speech coarticulation and expression spaces.

IEEE Transactions on Visualizationand Computer Graphics , 12(6):1523–1534, 2006. [39] Y. Zhou, S. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh.Visemenet: audio-driven animator-centric speech animation.