Emotion Transfer Using Vector-Valued Infinite Task Learning
Alex Lambert, Sanjeel Parekh, Zoltán Szabó, Florence d'Alché-Buc
EEmotion Transfer Using Vector-Valued Infinite TaskLearning
Alex Lambert , * Sanjeel Parekh , ∗ Zolt´an Szab´o Florence d’Alch´e-Buc Abstract
Style transfer is a significant problem of machine learning with numerous successful applications.In this work, we present a novel style transfer framework building upon infinite task learning andvector-valued reproducing kernel Hilbert spaces. We instantiate the idea in emotion transfer wherethe goal is to transform facial images to different target emotions. The proposed approach provides aprincipled way to gain explicit control over the continuous style space. We demonstrate the efficiencyof the technique on popular facial emotion benchmarks, achieving low reconstruction cost and highemotion classification accuracy.
Recent years have witnessed an increasing attention around style transfer problems (Gatys et al.,2016; Wynen et al., 2018; Jing et al., 2020) in machine learning. In a nutshell, style transfer refersto the transformation of an object according to a target style. It has found numerous applicationsin computer vision (Ulyanov et al., 2016; Choi et al., 2018; Puy and P´erez, 2019; Yao et al., 2020),natural language processing (Fu et al., 2018) as well as audio signal processing (Grinstein et al.,2018) where objects at hand are contents in which style is inherently part of their perception. Styletransfer is one of the key components of data augmentation (Mikołajczyk and Grochowski, 2018) asa means to artificially generate meaningful additional data for the training of deep neural networks.Besides, it has also been shown to be useful for counterbalancing bias in data by producing stylizedcontents with a well-chosen style (see for instance Geirhos et al. (2019)) in image recognition. Morebroadly, style transfer fits into the wide paradigm of parametric modeling, where a system, a processor a signal can be controlled by its parameter value. Adopting this perspective, style transfer-likeapplications can also be found in digital twinning (Tao et al., 2019; Barricelli et al., 2019; Lim et al.,2020), a field of growing interest in health and industry.In this work, we propose a novel principled approach for style transfer, exemplified in the contextof emotion transfer of face images. Given a set of emotions, classical emotion transfer refers tothe task of transforming face images according to these target emotions. The pioneering works inemotion transfer include that of Blanz and Vetter (1999) who proposed a morphable 3D face modelwhose parameters could be modified for facial attribute editing. Susskind et al. (2008) designed adeep belief net for facial expression generation using action unit (AU) annotations.More recently, extensions of generative adversarial networks (GANs, Goodfellow et al. 2014)have proven to be particularly powerful for tackling image-to-image translation problems (Zhu et al.,2017). Several works have addressed emotion transfer for facial images by conditioning GANs on a * Both authors contributed equally. : LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France. : Centerof Applied Mathematics, CNRS, ´Ecole Polytechnique, Institut Polytechnique de Paris, France. Corresponding author:[email protected] a r X i v : . [ s t a t . M L ] F e b ariety of guiding information ranging from discrete emotion labels to photos and videos. In partic-ular, StarGAN (Choi et al., 2018) is conditioned on discrete expression labels for face synthesis. Ex-prGAN (Ding et al., 2018) proposes synthesis with the ability to control expression intensity througha controller module conditioned on discrete labels. Other GAN-based approaches make use of ad-ditional information such as AU labels (Pumarola et al., 2018), target landmarks (Qiao et al., 2018),fiducial points (Song et al., 2018) and photos/videos (Geng et al., 2018). While GANs have achievedhigh quality image synthesis, they come with some pitfalls: they are particularly difficult to train andrequire large amounts of training data.In this paper, unlike previous approaches, we adopt a functional point of view: given some person,we assume that the full range of the emotional faces can be modelled as a continuous function fromemotions to images. This view exploits the geometry of the representation of emotions (Russell,1980), assuming that one can pass a facial image “continuously” from one emotion to an other. Wethen propose to address the problem of emotion transfer by learning an image-to-function model ableto predict for a given facial input image represented by its landmarks (Tautkute et al., 2018), thecontinuous function that maps an emotion to the image transformed by this emotion.This function-valued regression approach relies on a technique recently introduced by Brault et al.(2019) called infinite task learning (ITL). ITL enlarges the scope of multi-task learning (Evgeniou andPontil, 2004; Evgeniou et al., 2005) by learning to solve simultaneously a set of tasks parametrizedby a continuous parameter. While strongly linked to other parametric learning methods such the oneproposed by Takeuchi et al. (2006), the approach differs from previous works by leveraging the useof operator-valued kernels and vector-valued reproducing kernel Hilbert spaces (vRKHS; Pedrick1957; Micchelli and Pontil 2005; Carmeli et al. 2006). vRKHSs have proven to be relevant in solvingsupervised learning tasks such as multiple quantile regression (Sangnier et al., 2016) or unsupervisedproblems like anomaly detection (Sch¨olkopf et al., 2001). A common property of these works is thatthe output to be predicted is a real-valued function of a real parameter.To solve the emotion transfer problem, we present an extension of ITL, vector ITL (or shortlyvITL) which involves functional outputs with vectorial representation of the faces and the emotions,showing that the approach is still easily controllable by the choice of appropriate kernels guaranteeingcontinuity and smoothness. In particular, the functional point of view by the inherent regularizationinduced by the kernel makes the approach suitable even for limited and partially observed emotionalimages. We demonstrate the efficiency of the vITL approach in a series of numerical experimentsshowing that it can achieve state-of-the-art performance on two benchmark datasets.The paper is structured as follows. We formulate the problem and introduce the vITL frameworkin Section 2. Section 3 is dedicated to the underlying optimization problem. Numerical experimentsconducted on two benchmarks of the domain are presented in Section 4. Discussion and future workconclude the paper in Section 5. Proofs of auxiliary lemmas are collected in Section 6. In this section we define our problem. Our aim is to design a system capable of transferring emotions:having access to the face image of a given person our goal is to convert his/her face to a specifiedtarget emotion. In other words, the system should implement a mapping of the form(face, emotion) (cid:55)→ face . (1)In order to tackle this task, one requires a representation of the emotions, and similarly that of thefaces. The classical categorical description of emotions deals with the classes ‘happy’, ‘sad’, ‘angry’,‘surprised’, ‘disgusted’, ‘fearful’. The valence-arousal model (Russell, 1980) embeds these categoriesinto the -dimensional space. The resulting representation of the emotions are points θ ∈ R , eachcoordinate of these vectors encoding the valence (pleasure to displeasure) and arousal (high to low)associated to the emotions. This is the emotion representation we use while noting that there arealternative encodings in higher dimension ( Θ ⊂ R p , p ≥ ; Vemulapalli and Agarwala 2019) towhich the presented framework can be naturally adapted. Throughout this work faces are represented by landmark points. Landmarks have been proved to be a useful representation in facial recognition(Saragih et al., 2009; Scherhag et al., 2018; Zhang et al., 2015), 3D facial reconstruction and sentimentanalysis. Tautkute et al. (2018) have shown that emotions can be accurately recognized by detectingchanges in the localization of the landmarks. Given M number of landmarks on the face, this meansa description x ∈ X := R M =: d . The resulting mapping (1) is illustrated in Fig. 1: starting from aneutral face and the target happy one can traverse to the happy face; from the happy face, given thetarget emotion surprise one can get to the surprised face.In an ideal world, for each person, one would have access to a trajectory z mapping each emotion θ ∈ Θ to the corresponding landmark locations x ∈ X ; this function z : Θ (cid:55)→ X can be takenfor instance to be the element of L [Θ , µ ; X ] , the space of R d -valued square-integrable functionw.r.t. to a measure µ . The probability measure µ allows capturing the frequency of the individualemotions. In practice, one has realizations ( z i ) i ∈ [ n ] , each z i corresponds to a single person possibleappearing multiple times. The trajectories are observable at finite many emotions (cid:16) ˜ θ i,j (cid:17) j ∈ [ m ] where [ m ] := { , . . . , m } . In order to capture relation (1) one can rely on a hypothesis space H withelements h : X (cid:55)→ (Θ (cid:55)→ X ) . (2)The value h ( x )( θ ) represents the landmark prediction from face x and target emotion θ .We consider two tasks for emotion transfer:• Single emotional input:
In the first problem, the assumption is that all the faces appearing asthe input in (1) come from a fixed emotion ˜ θ ∈ Θ . The data which can be used to learn themapping h consists of t = n triplets x i = z i (cid:0) ˜ θ (cid:1) ∈ X , Y i = (cid:0) z i (cid:0) ˜ θ i,j (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) =: y i,j (cid:1) j ∈ [ m ] ∈ X m , ( θ i,j ) j ∈ [ m ] = (cid:0) ˜ θ i,j (cid:1) j ∈ [ m ] ∈ Θ m , i ∈ [ t ] . To keep the notation simple, we assume that m is the same for all the z i -s. In this case θ i,j is a literal copy of ˜ θ i,j which helps to get a unified formulation with the joint emotional input setting. o measure the quality of the reconstruction using a function h , one can consider a convex loss (cid:96) : X × X → R + on the landmark space where R + denotes the set of non-negative reals. Theresulting objective function to minimize is R S ( h ) := 1 tm (cid:88) i ∈ [ t ] (cid:88) j ∈ [ m ] (cid:96) ( h ( x i )( θ i,j ) , y i,j ) . (3)The risk R S ( h ) captures how well the function h reconstructs on average the landmarks y i,j when applied to the input landmark locations x i .• Joint emotional input:
In this problem, the faces appearing as input in (1) can arise from anyemotion. The observations consist of triplets x m ( i − l = z i (cid:0) ˜ θ i,l (cid:1) ∈ X , Y m ( i − l = (cid:0) z i (cid:0) ˜ θ i,j (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) =: y m ( i − l,j (cid:1) j ∈ [ m ] ∈ X m ( θ m ( i − l,j ) j ∈ [ m ] = (cid:0) ˜ θ i,j (cid:1) j ∈ [ m ] ∈ Θ m , where ( i, l ) ∈ [ n ] × [ m ] and the number of pairs is t = nm . Having defined this datasetone can optimize the same objective (3) as before. Particularly, this means that the pair ( i, l ) plays the role of index i of the previous case. The ( θ i,j ) i,j ∈ [ t ] × [ m ] is an extended version ofthe (cid:0) ˜ θ i,j (cid:1) i,j ∈ [ t ] × [ m ] to match the indices going from to t in (3).We leverage the flexible class of vector-valued reproducing kernel Hilbert spaces (vRKHS; Carmeliet al. (2010)) for the hypothesis class schematically illustrated in (2). Learning within vRKHS hasbeen shown to be relevant for tackling function-valued regression (Kadri et al., 2010, 2016). Theconstruction follows the structure h : X (cid:55)→ (Θ (cid:55)→ X ) (cid:124) (cid:123)(cid:122) (cid:125) ∈ H G (cid:124) (cid:123)(cid:122) (cid:125) ∈ H K (4)which we detail below. The vector ( R d )-valued capability is beneficial to handle the Θ (cid:55)→ X = R d mapping; the associated R d -valued RKHS H G is uniquely determined by a matrix-valued kernel G :Θ × Θ → R d × d = L ( X ) where L ( X ) denotes the space of bounded linear operators on X , in this casethe set of d × d -sized matrices. Similarly, in (4) the X → H G mapping is modelled by a vRKHS H K corresponding to an operator-valued kernel K : X × X → L ( H G ) . A matrix-valued kernel ( G ) has tosatisfy two conditions: G ( θ, θ (cid:48) ) = G ( θ (cid:48) , θ ) (cid:62) for any ( θ, θ (cid:48) ) ∈ Θ where ( · ) (cid:62) denotes transposition,and (cid:80) i,j ∈ [ N ] v (cid:62) i G ( θ i , θ j ) v j ≥ for all N ∈ N ∗ := { , , . . . } , { θ i } i ∈ [ N ] ⊂ Θ and { v i } i ∈ [ N ] ⊂ R d . Analogously, for an operator-valued kernel ( K ) it has to hold that K ( x, x (cid:48) ) = K ( x (cid:48) , x ) ∗ forall ( x, x (cid:48) ) ∈ X where ( · ) ∗ means the adjoint operator, and (cid:80) i,j ∈ [ N ] (cid:104) w i , K ( x i , x j ) w j (cid:105) H G ≥ with (cid:104)· , ·(cid:105) H G being the inner product in H G , for all N ∈ N ∗ , { θ i } i ∈ [ N ] ⊂ Θ and { w i } i ∈ [ N ] ⊂ H G . These abstract requirements can be guaranteed for instance by the choice (made throughout themanuscript) G ( θ, θ (cid:48) ) = k Θ ( θ, θ (cid:48) ) A , K ( x, x (cid:48) ) = k X ( x, x (cid:48) ) Id H G (5)with a scalar-valued kernel k X : X × X → R and k Θ : Θ × Θ → R , and symmetric, positivedefinite matrix A ∈ R d × d ; Id H G is the identity operator on H G . This choice corresponds to theintuition that for similar input landmarks and target emotions, the predicted output landmarks shouldalso be similar, as measured by k X , k Θ and A , respectively. More precisely, smoothness (analyticproperty) of the emotion-to-landmark output function can be induced for instance by choosing aGaussian kernel k Θ ( θ, θ (cid:48) ) = exp( − γ (cid:107) θ − θ (cid:48) (cid:107) ) with γ > . The matrix A when chosen as A = I d corresponds to independent landmarks coordinates while other choices encode prior knowledge aboutthe dependency among the landmarks coordinates ( ´Alvarez et al., 2012). Similarly, the smoothness f function h can be driven by the choice of a Gaussian kernel over X while the identity operator on H G is the simplest choice to cope with functional outputs. By denoting the norm in H K as (cid:107)·(cid:107) H K ,the final objective function is min h ∈ H K R λ ( h ) := R S ( h ) + λ (cid:107) h (cid:107) H K (6)with a regularization parameter λ > which balances between the data-fitting term ( R S ( h ) ) andsmoothness ( (cid:107) h (cid:107) H K ). We refer to (6) as vector-valued infinite task learning (vITL). Remark:
This problem is a natural adaptation of the ITL framework (Brault et al., 2019) learningwith operator-valued kernels mappings of the form X (cid:55)→ (Θ (cid:55)→ Y ) where Y is a subset of R ; here Y = X . An other difference is µ : in ITL this probability measure is designed to approximate integralsvia quadrature rule, in vITL it captures the observation mechanism. This section is dedicated to the solution of (6) which is an optimization problem over functions( h ∈ H K ). The following representer lemma provides a finite-dimensional parameterization of theoptimal solution. Lemma 3.1 (Representer)
Problem (6) has a unique solution ˆ h and it takes the form ˆ h ( x )( θ ) = t (cid:88) i =1 m (cid:88) j =1 k X ( x, x i ) k Θ ( θ, θ i,j ) A ˆ c i,j , ∀ ( x, θ ) ∈ X × Θ (7) for some coefficients ˆ c i,j ∈ R d with i ∈ [ t ] and j ∈ [ m ] . Based on this lemma finding ˆ h is equivalent to determining the coefficients { ˆ c i,j } i ∈ [ t ] ,j ∈ [ m ] .Throughout this paper we consider the squared loss (cid:96) ( x, x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) ; in this case the task boilsdown to the solution of a linear equation as detailed in the following result. Lemma 3.2 (optimization task for C ) Assume that K is invertible and let the matrix ˆ C = [ ˆ C i ] i ∈ [ tm ] ∈ R ( tm ) × d containing all the coefficients, the Gram matrix K = [ k i,j ] i,j ∈ [ tm ] ∈ R ( tm ) × ( tm ) , and thematrix consisting of all the observations Y = [ Y i ] i ∈ [ tm ] ∈ R ( tm ) × d be defined as ˆ C m ( i − j := ˆ c (cid:62) i,j , ( i, j ) ∈ [ t ] × [ m ] ,k m ( i − j ,m ( i − j := k X ( x i , x i ) k Θ ( θ i ,j , θ i ,j ) , ( i , j ) , ( i , j ) ∈ [ t ] × [ m ] , Y m ( i − j := y (cid:62) i,j , ( i, j ) ∈ [ t ] × [ m ] . Then ˆ C is the solution of the following linear equation K ˆ CA + tmλ ˆ C = Y . (8) When A = I d (identity matrix of size d × d ), the solution is analytic: ˆC = ( K + tmλ I tm ) − Y . (9) Remarks: • Computational complexity: In case of A = I d , the complexity of the closed form solution is O (cid:0) ( tm ) (cid:1) . If all the samples are observed at the same locations ( θ i,j ) i,j ∈ [ t ] × [ n ] , i.e. θ i,j = θ l,j for ∀ ( i, l, j ) ∈ [ t ] × [ t ] × [ m ] , then the Gram matrix K has a tensorial structure K = K X ⊗ K Θ with K X = [ k X ( x i , x j )] i,j ∈ [ t ] ∈ R t × t and K Θ = [ k Θ ( θ ,i , θ ,j )] i,j ∈ [ m ] ∈ R m × m . In this case, the computational complexity reduces to O (cid:0) t + m (cid:1) . If additionalscaling is required one can leverage recent dedicated kernel ridge regression solvers (Rudiet al., 2017; Meanti et al., 2020). If A is not identity, then multiplying (8) with A − gives K ˆ C + tmλ ˆ CA − = YA − which is a Sylvester equation for which efficient custom solversexist (El Guennouni et al., 2002). Regularization in vRKHS: Using the notations above, for any h ∈ H K parameterized by amatrix C , it holds that (cid:107) h (cid:107) H K = Tr (cid:0) KCAC (cid:62) (cid:1) . Given two matrices A , A and associatedvRKHSs H K and H K , if A and A are invertible then any function in H K parameterizedby C also belongs to H K (and vice versa), within which it is parameterized by CA − A .This means that the two spaces contain the same functions, but their norms are different. In this section we demonstrate the efficiency of the proposed vITL technique in emotion transfer. Wefirst introduce the two benchmark datasets we used in our experiments and give details about datarepresentation and choice of the hypothesis space in Section 4.1. Then, in Section 4.2, we provide aquantitative performance assessment of the vITL approach (in mean squared error and classificationaccuracy sense) with a comparison to the state-of-the-art StarGAN method. Section 4.3 is dedicatedto investigation of the role of A (see (5)) and the robustness of the approach w.r.t. partial observation.These two sets of experiments (Section 4.2 and Section 4.3) are augmented with a qualitative analysis(Section 4.4). The code written for all these experiments is available on GitHub. We used the following two popular face datasets for evaluation.• Karolinska Directed Emotional Faces (KDEF; Lundqvist et al. 1998): This dataset containsfacial emotion pictures from actors ( females and males) recorded over two sessionswhich give rise to a total of samples per emotion. In addition to neutral, the captured facialemotions include afraid, angry, disgusted, happy, sad and surprised.• Radboud Faces Database (RaFD; Langner et al. 2010): This benchmark contains emotionalpictures of unique identities (including Caucasian males and females, Caucasian children,and Moroccan Dutch males). Each subject was trained to show the following expressions:anger, disgust, fear, happiness, sadness, surprise, contempt, and neutral according to the facialaction coding system (FACS; Ekman et al. 2002).In our experiments, we used frontal images and seven emotions from each of these datasets. An edgemap illustration of landmarks for different emotions is shown in Fig. 2.At this point, it is worth recalling that we are learning a function-valued function, h : X (cid:55)→ (Θ (cid:55)→ X ) using a vRKHS as our hypothesis class (see Section 2). In the following we detail the choicesmade concerning the representation of the landmarks in X , that of the emotions in Θ , and in the kerneldesign k X , k Θ and A . Landmark representation, pre-processing:
We applied the following pre-processing steps toget the landmark representations which form the input of the algorithms. To extract landmarkpoints for all the facial images, we used the standard dlib library. The estimator is based on dlib ’simplementation of Kazemi and Sullivan (2014), trained on the iBUG 300-W face landmark dataset.Each landmark is represented by its 2D location. The alignment of the faces was carried out by thePython library imutils . The method ensures that faces across all identities and emotions are ver-tical, centered and of similar sizes. In essence, this is implemented through an affine transformationcomputed after drawing a line segment between the estimated eye centers. Each image was resizedto the size × . The landmark points computed in the step above were transformed throughthe same affine transformation. These two preprocessing steps gave rise to the aligned, scaled andvectorized landmarks x ∈ R × . Emotion representation:
We represented emotion labels as points in the 2D valence-arousalspace (VA, Russell 1980). Particularly, we used a manually annotated part of the large-scale Affect-Net database (Mollahosseini et al., 2017). For all samples of a particular emotion in the AffectNet eutral Fearful Angry Disgusted Happy Sad Surprised (a) KDEF Neutral Fearful Angry Disgusted Happy Sad Surprised (b) RaFD
Figure 2: Illustration of the landmark edge maps for different emotions and both datasets. A r o u s a l Neutral HappySad SurprisedFearfulDisgustedAngry
Figure 3: Extracted (cid:96) -normalized valence-arousal centroids for each emotion from the manually anno-tated train set of the AffectNet database. data, we computed the centroid (data mean) of the valence and arousal values. The resulting (cid:96) -normalized 2D vectors constituted our emotion representation as depicted in Fig. 3. The normaliza-tion is akin to assuming that the modeled emotions are of the same intensity. In our experiments, theemotion ‘neutral’ was represented by the origin. Such an emotion embedding allowed us to take intoaccount prior knowledge about the angular proximity of emotions in the VA space, while keeping therepresentation simple and interpretable for post-hoc manipulations. Kernel design:
We took the kernels k X , k Θ to be Gaussian on the landmark representation spaceand the emotion representation space, with respective bandwidth γ X and γ Θ . A was assumed to be I d unless specified otherwise. In this section we provide a quantitative assessment of the proposed vITL approach.
Performance measures:
We applied two metrics to quantify the performance of the comparedsystems, namely the test mean squared error (MSE) and emotion classification accuracy. The classifi-cation accuracy can be thought of as an indirect evaluation. To compute this measure, for each dataset e trained a ResNet-18 classifier to recognize emotions from ground-truth landmark edge maps (asdepicted in Fig. 2). The trained network was then used to compute classification accuracy over thepredictions at test time. To rigorously evaluate outputs for each split of the data, we used a classifiertrained on RaFD to evaluate KDEF predictions and vice-versa; this also allowed us to make the prob-lem more challenging. The ResNet-18 network was appropriately modified to take grayscale imagesas input. During training, we used random horizontal flipping and cropping between 90-100% of theoriginal image size to augment the data. All the images were finally resized to × and fed tothe network. The network was trained from scratch using the stochastic gradient descent optimizerwith learning rate and momentum set to . and . , respectively. The training was carried out for epochs with a batch size of .We report the mean and standard deviation of the aforementioned metrics over ten 90%-10%train-test splits of the data. The test set for each split is constructed by removing of the identitiesfrom the data. For each split, the best γ X , γ Θ and λ values were determined by -fold and -foldcross-validation on KDEF and RaFD, respectively. Baseline:
We used the popular StarGAN (Choi et al., 2018) system as our baseline. Other GAN-based studies use additional information and are not directly comparable to our setting. For faircomparison, the generator G and discriminator D were modified to be fully-connected networks thattake vectorized landmarks as input. In particular, G was an encoder-decoder architecture where thetarget emotion, represented as a 2D emotion encoding as for our case, was appended at the bottlenecklayer. It contained approximately one million parameters, which was chosen to be comparable withthe number of coefficients in vITL ( ,
664 = 126 × × × for KDEF). ReLU activationfunction was used in all layers except before bottleneck in G and before penultimate layers of both G and D . We used their default parameter values in the code. Experiments over each split of KDEFand RaFD were run for 50K and 25K iterations, respectively.
MSE results:
The test MSE for the compared systems is summarized in Table 1. As the tableshows, the vITL technique outperforms StarGAN on both datasets. One can observe low reconstruc-tion cost for vITL in both the single and the joint emotional input case. Interestingly, a performancegain is obtained with vITL joint on the RaFD data in MSE sense. We hypothesize that this is due tothe joint model benefiting from input landmarks for other emotions in the small data regime (only samples per emotion for RaFD). Despite our best efforts, we found it quite difficult to train StarGANreliably and the diversity of its outputs was low. Classification results:
The emotion classification accuracies are available in Table 2. The classi-fication results clearly demonstrate the improved performance and the higher quality of the generatedemotion of vITL over StarGAN; the latter also produces predictions with visible face distortions asit is illustrated in Section 4.4. To provide further insight into the classification performance we alsoshow the confusion matrices for the joint vITL model on a particular split of KDEF and RaFD datasetsin Fig. 4. For both the datasets, the classes ‘happy’ and ‘surprised’ are easiest to detect. Some con-fusions arise between the classes ‘neutral’ vs ‘sad’ and ‘fearful’ vs ‘surprised’. Such mistakes areexpected when only using landmark locations for recognizing emotions.
This section is dedicated to the effect of the choice of A (in kernel G ) and to the robustness of vITLw.r.t. partial observation. Influence of A in the matrix-valued kernel G : Here, we illustrate the effect of matrix A (see(5)) on the vITL estimator and show that a good choice of A can lead to lower dimensional models,while preserving the quality of the prediction. The choice of A is built on the knowledge that theempirical covariance matrices of the output training data contains structural information that can beexploited with vRKHS (Kadri et al., 2013). In order to investigate this possibility, we performed thesingular value decomposition of Y (cid:62) Y which gives the eigenvectors collected in matrix V ∈ R d × d . The code is available at https://github.com/yunjey/stargan . θ = neutral . ± .
001 0 . ± . vITL: θ = fearful . ± .
001 0 . ± . vITL: θ = angry . ± .
002 0 . ± . vITL: θ = disgusted . ± .
001 0 . ± . vITL: θ = happy . ± .
001 0 . ± . vITL: θ = sad . ± .
001 0 . ± . vITL: θ = surprised . ± .
001 0 . ± . vITL: Joint . ± .
001 0 . ± . StarGAN . ± .
003 0 . ± . Table 1: MSE error (mean ± std) on test data for the vITL single (top), the vITL joint and the StarGANsystem (bottom). Lower is better.Methods KDEF frontal RaFD frontalvITL: θ = neutral . ± .
57 79 . ± . vITL: θ = fearful . ± .
91 78 . ± . vITL: θ = angry . ± .
31 78 . ± . vITL: θ = disgusted . ± .
22 78 . ± . vITL: θ = happy . ± .
74 80 . ± . vITL: θ = sad . ± .
11 77 . ± . vITL: θ = surprised . ± .
25 80 . ± . vITL: Joint . ± .
10 77 . ± . StarGAN . ± .
46 65 . ± . Table 2: Emotion classification accuracy (mean ± std) for the vITL single (top), the vITL joint (middle)and the StarGAN system (bottom). Higher is better. A n g r y D i s g u s t e d F e a r f u l H a p p y N e u t r a l S a d S u r p r i s e d AngryDisgustedFearfulHappyNeutralSadSurprised (a) KDEF A n g r y D i s g u s t e d F e a r f u l H a p p y N e u t r a l S a d S u r p r i s e d AngryDisgustedFearfulHappyNeutralSadSurprised (b) RaFD
Figure 4: Confusion matrices for classification accuracy of vITL Joint model. Left: dataset KDEF.Right: dataset RaFD. The y axis represents the true labels, the x axis stands for the predicted labels.More diagonal is better. 9
20 40 60 80 100 120 140Rank of A T e s t M S E KDEF meanKDEF mean ±RaFD meanRaFD mean ±
Figure 5: Test MSE (mean ± std) as a function of the rank of the matrix A . Smaller MSE is better. l o g T e s t M S E KDEF meanKDEF min-maxRaFD meanRaFD min-max
Figure 6: Logarithm of the test MSE (min-mean-max) as a function of the percentage of missing data.Solid line: mean; dashed line: min-max. Smaller MSE is better.
For a fixed rank r ≤ d , define J r = diag (1 , · · · , (cid:124) (cid:123)(cid:122) (cid:125) r , , · · · , (cid:124) (cid:123)(cid:122) (cid:125) d − r ) , set A = V J r V (cid:62) and train a vITLsystem with the resulting A . While in this case A is no more invertible, each coefficient ˆ c i,j fromLemma 3.1 belongs to the r -dimensional subspace of R d generated by the eigenvectors associated tothe r largest eigenvalues of Y (cid:62) Y . This makes a reparameterization possible and leads to a decreasein the size of the model, going from t × m × d parameters to t × m × r . We report in Fig. 5 theresulting test MSE performance (mean ± standard deviation) obtained from different splits, andempirically observe that r = 20 suffices to preserve the optimal performances of the model. Learning under a partial observation regime:
To assess the robustness of vITL w.r.t. missingdata, we considered a random mask ( η i,j ) i ∈ [ n ] ,j ∈ [ m ] ∈ { , } n × m ; a sample z i ( θ i,j ) was used forlearning only when η i,j = 1 . Thus, the percentage of missing data was p := nm (cid:80) i,j ∈ [ n ] × [ m ] η i,j .The experiment was repeated for splits of the dataset, and on each split we averaged the resultsusing different random masks ( η i,j ) i ∈ [ n ] ,j ∈ [ m ] . The resulting test MSE of the predictor as a functionof p is summarized in Fig. 6. As it can be seen, the vITL approach is quite stable in the presence ofmissing data on both datasets. r o un d T r u t h Neutral Angry Disgusted Fearful Happy Sad Surprised v I T L S t a r G A N Figure 7: Discrete expression synthesis results on the KDEF dataset with ground-truth neutral land-marks as input. G r o un d T r u t h Neutral Angry Disgusted Fearful Happy Sad Surprised v I T L S t a r G A N Figure 8: Discrete expression synthesis results on the RaFD dataset with ground-truth neutral landmarksas input.
In this section we show example outputs produced by vITL in the context of discrete and continuousemotion generation. While the former is the classical task of synthesis given input landmarks andtarget emotion label, the latter serves to demonstrate a key benefit of our approach, which is theability to synthesize meaningful outputs while continuously traversing the emotion embedding space.
Discrete emotion generation:
In Fig. 7 and 8 we show qualitative results for generating landmarksusing discrete emotion labels present in the datasets. For vITL, not only are the emotions recogniz-able, but landmarks on the face boundary are reasonably well synthesized and other parts of the facevisibly less distorted when compared to StarGAN. The identity in terms of the face shape is alsobetter preserved.
Continuous emotion generation:
Starting from neutral emotion, continuous generation in the radialdirection is illustrated in Fig. 9. The landmarks vary smoothly and conform to the expected intensityvariation in each emotion on increasing the radius of the vector in VA space. We also show in Fig. 10the capability to generate intermediate emotions by changing the angular position, in this case from‘happy’ to ‘surprised’. For a more fine-grained video illustration traversing from ‘happy’ to ‘sad’ u r p r i s e d r = 0 r = 0.2 r = 0.4 r = 0.6 r = 0.8 r = 1 H a pp y F e a r f u l A n g r y D i s g u s t e dS a d Figure 9: Continuous expression synthesis results with vITL on the KDEF dataset, with ground-truthneutral landmarks. The generation is starting from neutral and proceeds in the radial direction towardsan emotion with increasing radii r . along the circle, see the GitHub repository.These experiments and qualitative results demonstrate the efficiency of the vITL approach inemotion transfer. In this paper we introduced a novel approach to style transfer based on function-valued regression, andexemplified it on the problem of emotion transfer. The proposed vector-valued infinite task learning(vITL) framework relies on operator-valued kernels. vITL (i) is capable of encoding and controllingcontinuous style spaces, (ii) benefit from a representer theorem for efficient computation, and (iii)facilitates regularity control via the choice of the underlying kernels. The framework can be extendedin several directions. Other losses (Sangnier et al., 2016; Laforgue et al., 2020) can be leveraged toproduce outlier-robust or sparse models. Instead of being chosen prior to learning, the input kernelcould be learned using deep architectures (Mehrkanoon and Suykens, 2018; Liu et al., 2020) openingthe door to a wide range of applications. happy surprised Figure 10: Continuous expression synthesis with vITL technique on the RaFD dataset, with ground-truth neutral landmarks. The generation is starting from ‘happy’ and proceeds by changing angularposition towards ‘surprised’. For a more fine-grained video illustration traversing from ‘happy’ to ‘sad’along the circle, see the demo on GitHub.
This section contains the proofs of our auxiliary lemmas.
Proof 6.1 (Lemma 3.1) For all g ∈ H G , let K x g denote the function defined by ( K x g )( t ) = K ( t, x ) g ∀ t ∈ X . Similarly, for all c ∈ X , G θ c stands for the function t (cid:55)→ G ( t, θ ) c where t ∈ Θ .Let us take the finite-dimensional subspace E = span (cid:16) K x i G θ ij c : i ∈ [ t ] , j ∈ [ m ] , c ∈ R d (cid:17) . The space H K can be decomposed as E and its orthogonal complement: E ⊕ E ⊥ = H K . Theexistence of ˆ h follows from the coercivity of R λ (i.e. R λ ( h ) → + ∞ as (cid:107) h (cid:107) H K → + ∞ ) which is theconsequence of the quadratic regularizer and the lower boundedness of (cid:96) . Uniqueness comes fromthe strong convexity of the objective. Let us decompose ˆ h = ˆ h E + ˆ h E ⊥ , and take any c ∈ R d . Then ∀ ( i, j ) ∈ [ t ] × [ m ] , (cid:68) ˆ h E ⊥ ( x i )( θ ij ) , c (cid:69) R d ( a ) = (cid:68) ˆ h E ⊥ ( x i ) , G θ ij c (cid:69) H G ( b ) = (cid:10) ˆ h E ⊥ , K x i G θ ij c (cid:124) (cid:123)(cid:122) (cid:125) ∈ E (cid:11) H K ( c ) = 0 . (a) follows from the reproducing property in H G , (b) is a consequence of the reproducing propertyin H K , and (c) comes from the decomposition E ⊕ E ⊥ = H K . This means that ˆ h E (cid:62) ( x i )( θ ij ) = 0 ∀ ( i, j ) ∈ [ t ] × [ m ] , and hence R S (ˆ h ) = R S (ˆ h E ) . Since λ (cid:13)(cid:13) ˆ h (cid:13)(cid:13) H K = λ (cid:16)(cid:13)(cid:13) ˆ h E (cid:13)(cid:13) H K + (cid:13)(cid:13) ˆ h E ⊥ (cid:13)(cid:13) H K (cid:17) ≥ λ (cid:13)(cid:13) ˆ h E (cid:13)(cid:13) H K we conclude that ˆ h E (cid:62) = 0 and get that there exist coefficients ˆ c i,j ∈ R d such that ˆ h = (cid:80) i ∈ [ t ] (cid:80) j ∈ [ m ] K x i G θ i,j ˆ c i,j . This evaluates for all ( x, θ ) ∈ X × Θ to ˆ h ( x )( θ ) = t (cid:88) i =1 m (cid:88) j =1 k X ( x, x i ) k Θ ( θ, θ i,j ) A ˆ c ij as claimed in (7) . Proof 6.2 (Lemma 3.2) Applying Lemma 3.1, problem (6) writes as min C ∈ R ( tm ) × d tm (cid:107) KCA − Y (cid:107) F + λ (cid:16) KCAC (cid:62) (cid:17) , where (cid:107)·(cid:107) F denotes the Frobenius norm. By setting the gradient of this convex functional to zero, andusing the symmetry of K and A , one gets tm K ( KCA − Y ) A + λ KCA = which implies (8) by the invertibility of K and A . cknowledgements A.L. and S.P. were funded by the research chair Data Science & ArtificialIntelligence for Digitalized Industry and Services at T´el´ecom Paris. ZSz benefited from the support ofthe Europlace Institute of Finance and that of the Chair Stress Test, RISK Management and FinancialSteering, led by the French ´Ecole Polytechnique and its Foundation and sponsored by BNP Paribas.
References
M. A. ´Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: a review.
Foundations and Trends in Machine Learning , 4(3):195–266, 2012. 4Barbara Rita Barricelli, Elena Casiraghi, and Daniela Fogli. A survey on digital twin: Definitions,characteristics, applications, and design implications.
IEEE Access , 7:167653–167671, 2019. 1Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. In
Conferenceon Computer Graphics and Interactive Techniques (SIGGRAPH) , pages 187–194, 1999. 1Romain Brault, Alex Lambert, Zolt´an Szab´o, Maxime Sangnier, and Florence d’Alch´e-Buc. Infi-nite task learning in RKHSs. In
International Conference on Artificial Intelligence and Statistics(AISTATS) , pages 1294–1302, 2019. 2, 5Claudio Carmeli, Ernesto De Vito, and Alessandro Toigo. Vector valued reproducing kernel Hilbertspaces of integrable functions and Mercer theorem.
Analysis and Applications , 4:377–408, 2006.2Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanit´a. Vector valued repro-ducing kernel Hilbert spaces and universality.
Analysis and Applications , 8(1):19–61, 2010. 4Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Star-GAN: Unified generative adversarial networks for multi-domain image-to-image translation. In
Conference on Computer Vision and Pattern Recognition (CVPR) , pages 8789–8797, 2018. 1, 2, 8Hui Ding, Kumar Sricharan, and Rama Chellappa. ExprGAN: Facial expression editing with con-trollable expression intensity. In
Conference on Artificial Intelligence (AAAI) , pages 6781–6788,2018. 2Paul Ekman, Wallace Friesen, and Joseph Hager. Facial action coding system: The manual.
SaltLakeCity, UT: Research Nexus. , 2002. 6A El Guennouni, Khalide Jbilou, and AJ Riquet. Block Krylov subspace methods for solving largeSylvester equations.
Numerical Algorithms , 29(1):75–96, 2002. 5Theodoros Evgeniou and Massimiliano Pontil. Regularized multi–task learning. In
ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining (KDD) , pages 109–117, 2004.2Theodoros Evgeniou, Charles Micchelli, and Massimiliano Pontil. Learning multiple tasks withkernel methods.
Journal of Machine Learning Research , 6:615–637, 2005. 2Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. Style transfer in text: Ex-ploration and evaluation. In
Conference on Artificial Intelligence (AAAI) , pages 663–670, 2018.1Justin J. Gatys, Alexandre A., and F.-F Li. Perceptual losses for real-time style transfer and super-resolution. In
European Conference on Computer Vision (ECCV) , pages 694–711, 2016. 1 obert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, andWieland Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias im-proves accuracy and robustness. In International Conference on Learning Representations (ICLR) ,2019. 1Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. Warp-guided GANs forsingle-photo facial animation.
ACM Transactions on Graphics , 37(6):1–12, 2018. 2Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in Neural Infor-mation Processing Systems (NIPS) , pages 2672–2680, 2014. 1Eric Grinstein, Ngoc QK Duong, Alexey Ozerov, and Patrick P´erez. Audio style transfer. In
Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 586–590, 2018.1Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neuralstyle transfer: A review.
IEEE Transactions on Visualization and Computer Graphics , 26(11):3365–3385, 2020. 1Hachem Kadri, Emmanuel Duflos, Philippe Preux, St´ephane Canu, and Manuel Davy. Nonlinearfunctional regression: a functional RKHS approach. In
International Conference on ArtificialIntelligence and Statistics (AISTATS) , pages 374–380, 2010. 4Hachem Kadri, Mohammad Ghavamzadeh, and Philippe Preux. A generalized kernel approach tostructured output learning. In
International Conference on Machine Learning (ICML) , pages 471–479, 2013. 8Hachem Kadri, Emmanuel Duflos, Philippe Preux, St´ephane Canu, Alain Rakotomamonjy, and JulienAudiffren. Operator-valued kernels for learning from functional response data.
Journal of MachineLearning Research , 17(20):1–54, 2016. 4Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regres-sion trees. In
Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1867–1874,2014. 6Pierre Laforgue, Alex Lambert, Luc Brogat-Motte, and Florence d’Alch´e Buc. Duality in RKHSswith infinite dimensional outputs: Application to robust losses. In
International Conference onMachine Learning (ICML) , pages 5598–5607, 2020. 12Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, Daniel HJ Wigboldus, Skyler T Hawk, andAD Van Knippenberg. Presentation and validation of the Radboud faces database.
Cognitionand emotion , 24(8):1377–1388, 2010. 6Kendrik Yan Hong Lim, Pai Zheng, and Chun-Hsien Che. A state-of-the-art survey of digital twin:techniques, engineering product lifecycle management and business innovation perspectives.
Jour-nal of Intelligent Manufacturing , 31:1313–1337, 2020. 1Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J. Sutherland. Learn-ing deep kernels for non-parametric two-sample tests. In
International Conference on MachineLearning (ICML) , pages 6316–6326, 2020. 12Daniel Lundqvist, Anders Flykt, and Arne ¨Ohman. The Karolinska directed emotional faces (KDEF).
CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet , 91(630):2–2, 1998. 6 iacomo Meanti, Luigi Carratino, Lorenzo Rosasco, and Alessandro Rudi. Kernel methods throughthe roof: handling billions of points efficiently. In Advances in Neural Information ProcessingSystems (NeurIPS) , 2020. 5Siamak Mehrkanoon and Johan A. K. Suykens. Deep hybrid neural-kernel networks using randomFourier features.
Neurocomputing , 298:46–54, 2018. 12Charles Micchelli and Massimiliano Pontil. On learning vector-valued functions.
Neural Computa-tion , 17:177–204, 2005. 2Agnieszka Mikołajczyk and Michał Grochowski. Data augmentation for improving deep learning inimage classification problem. In
International Interdisciplinary PhD Workshop (IIPhDW) , pages117–122, 2018. 1Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. AffectNet: A database for facial ex-pression, valence, and arousal computing in the wild.
IEEE Transactions on Affective Computing ,10(1):18–31, 2017. 6George Pedrick.
Theory of reproducing kernels for Hilbert spaces of vector valued functions . PhDthesis, 1957. 2Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. GANimation: Anatomically-aware facial animation from a single image. In
EuropeanConference on Computer Vision (ECCV) , pages 818–833, 2018. 2Gilles Puy and Patrick P´erez. A flexible convolutional solver for fast style transfers. In
Conferenceon Computer Vision and Pattern Recognition (CVPR) , pages 8963–8972, 2019. 1Fengchun Qiao, Naiming Yao, Zirui Jiao, Zhihao Li, Hui Chen, and Hongan Wang. Geometry-contrastive GAN for facial expression transfer. Technical report, 2018. ( https://arxiv.org/abs/1802.01822 ). 2Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale kernelmethod. In
Advances in Neural Information Processing Systems (NIPS) , pages 3891–3901, 2017.5James A Russell. A circumplex model of affect.
Journal of Personality and Social Psychology , 39(6):1161–1178, 1980. 2, 6Maxime Sangnier, Olivier Fercoq, and Florence d’Alch´e Buc. Joint quantile regression in vector-valued RKHSs.
Advances in Neural Information Processing Systems (NIPS) , pages 3693–3701,2016. 2, 12Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Face alignment through subspace constrainedmean-shifts. In
International Conference on Computer Vision (ICCV) , pages 1034–1041, 2009. 3Ulrich Scherhag, Dhanesh Budhrani, Marta Gomez-Barrero, and Christoph Busch. Detecting mor-phed face images using facial landmarks. In
International Conference on Image and Signal Pro-cessing (ICISP) , pages 444–452, 2018. 3Bernhard Sch¨olkopf, John C Platt, John Shawe-Taylor, Alex J. Smola, and Robert C Williamson.Estimating the support of a high-dimensional distribution.
Neural computation , 13(7):1443–1471,2001. 2Lingxiao Song, Zhihe Lu, Ran He, Zhenan Sun, and Tieniu Tan. Geometry guided adversarial facialexpression synthesis. In
International Conference on Multimedia (MM) , pages 627–635, 2018. 2 oshua M Susskind, Geoffrey E Hinton, Javier R Movellan, and Adam K Anderson. Generating facialexpressions with deep belief nets. In Affective Computing , chapter 23. IntechOpen, 2008. 1Ichiro Takeuchi, Quoc Le, Timothy Sears, and Alexander Smola. Nonparametric quantile estimation.
Journal of Machine Learning Research , 7:1231–1264, 2006. 2Fei Tao, He Zhang, Ang Liu, and A. Y. C. Nee. Digital twin in industry: State-of-the-art.
IEEETransactions on Industrial Informatics , 15(4):2405 – 2415, 2019. 1Ivona Tautkute, T. Trzci´nski, and Adam Bielski. I know how you feel: Emotion recognition withfacial landmarks.
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) ,pages 1959–19592, 2018. 2, 3Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In
International Conference on Machine Learn-ing (ICML) , pages 1349–1357, 2016. 1Raviteja Vemulapalli and Aseem Agarwala. A compact embedding for facial expression similarity.In
Conference on Computer Vision and Pattern Recognition (CVPR) , pages 5683–5692, 2019. 2Daan Wynen, Cordelia Schmid, and Julien Mairal. Unsupervised learning of artistic styles witharchetypal style analysis. In
Advances in Neural Information Processing Systems (NeurIPS) , pages6584–6593, 2018. 1Xu Yao, Gilles Puy, Alasdair Newson, Yann Gousseau, and Pierre Hellier. High resolution face ageediting. In
International Conference on Pattern Recognition (ICPR) , 2020. 1Zheng Zhang, Long Wang, Qi Zhu, Shu-Kai Chen, and Yan Chen. Pose-invariant face recognitionusing facial landmarks and Weber local descriptor.
Knowledge-Based Systems , 84:78–88, 2015. 3Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image transla-tion using cycle-consistent adversarial networks. In
International Conference on Computer Vision(ICCV) , pages 2223–2232, 2017. 1, pages 2223–2232, 2017. 1