[PDF] An Enhanced Adversarial Network with Combined Latent Features for Spatio-Temporal Facial Affect Estimation in the Wild

Abstract

Affective Computing has recently attracted the attention of the research community, due to its numerous applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling often results in very high-dimensional feature spaces and large volumes of data, making training difficult and time consuming. This paper addresses these shortcomings by proposing a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner, which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention modules. In our experiments, we show the effectiveness of our approach by reporting our competitive results on both the AFEW-VA and SEWA datasets, suggesting that temporal modelling improves the affect estimates both in qualitative and quantitative terms. Furthermore, we find that the inclusion of attention mechanisms leads to the highest accuracy improvements, as its weights seem to correlate well with the appearance of facial movements, both in terms of temporal localisation and intensity. Finally, we observe the sequence length of around 160\,ms to be the optimum one for temporal modelling, which is consistent with other relevant findings utilising similar lengths.

Full PDF

AAn Enhanced Adversarial Network with Combined Latent Features forSpatio-Temporal Facial Affect Estimation in the Wild

Decky Aspandi , , Federico Sukno , Bj¨orn Schuller , and Xavier Binefa Department of Information and Communication Technologies, Pompeu Fabra University, Barcelona, Spain Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany GLAM – Group on Language, Audio, & Music, Imperial College London, UK { decky.aspandi, federico.sukno, xavier.binefa } @upf.edu, [email protected] Keywords: Affective Computing, Temporal Modelling, Adversarial Learning.Abstract: Affective Computing has recently attracted the attention of the research community, due to its numerousapplications in diverse areas. In this context, the emergence of video-based data allows to enrich the widelyused spatial features with the inclusion of temporal information. However, such spatio-temporal modellingoften results in very high-dimensional feature spaces and large volumes of data, making training difﬁcult andtime consuming. This paper addresses these shortcomings by proposing a novel model that efﬁciently extractsboth spatial and temporal features of the data by means of its enhanced temporal modelling based on latentfeatures. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner,which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attentionmodules. In our experiments, we show the effectiveness of our approach by reporting our competitive resultson both the AFEW-VA and SEWA datasets, suggesting that temporal modelling improves the affect estimatesboth in qualitative and quantitative terms. Furthermore, we ﬁnd that the inclusion of attention mechanismsleads to the highest accuracy improvements, as its weights seem to correlate well with the appearance of facialmovements, both in terms of temporal localisation and intensity. Finally, we observe the sequence length ofaround 160 ms to be the optimum one for temporal modelling, which is consistent with other relevant ﬁndingsutilising similar lengths.

Affective Computing has recently attracted the atten-tion of the research community, due to its numerousapplications in diverse areas which include educa-tion (Duo and Song, 2010) or healthcare (Liu et al.,2008), among others. The growing availability ofaffect-related datasets, such as AFEW-VA (Kossaiﬁet al., 2017) and the recently introduced SEWA (Kos-saiﬁ et al., 2019) database enable the rapid develop-ment of deep learning-based techniques, which cur-rently hold the state of the art.Further, the emergence of video-based data allowsto enrich the widely used spatial features with theinclusion of temporal information. To this end, sev-eral authors have explored the use of long-short term a https://orcid.org/0000-0002-6653-3470 b https://orcid.org/0000-0002-2029-1576 c https://orcid.org/0000-0002-6478-8699 d https://orcid.org/0000-0002-4324-9952 memory (LSTM) recurrent neural networks (RNNs)(Tellamekala and Valstar, 2019; Ma et al., 2019), en-dowed also with attention mechanisms (Luong et al.,2015; Li et al., 2020; Xiaohua et al., 2019). However,such spatio-temporal modelling often results in veryhigh-dimensional feature spaces and large volumes ofdata, making training difﬁcult and time consuming.Moreover, it has been shown that the sequence lengthconsidered during training can be a decisive factor forsuccessful temporal modelling (Kossaiﬁ et al., 2017;Xia et al., 2020; Farhadi and Fox, 2018; Aspandi et al.,2019b), and yet a detailed study of this aspect is lack-ing in the ﬁeld.This paper addresses both the lack of incorpora-tion and analysis of temporal modelling on affectiveanalysis. We propose a novel model which can beefﬁciently used to extract both spatial and temporalfeatures of the data by means of its enhanced tempo-ral modelling based on latent features. We do so byincorporating three major networks, coined Generator,Discriminator, and Combiner, which are trained in an a r X i v : . [ c s . C V ] F e b dversarial setting to estimate the affect domains of Va-lence (V) and Arousal (A). Furthermore, we capitaliseon these latent features to enable temporal modellingusing LSTM RNNs, which we train progressively us-ing curriculum learning enhanced with adaptive atten-tion. Speciﬁcally, the contributions of this paper are asfollows:(a) We upgrade the standard adversarial setting, con-sisting of a Generator and a Discriminator, witha third network that combines the features fromthese networks, which are modiﬁed accordingly.This yields features that combine the latent spacefrom the autoencoder-based Generator and a V-AQuadrant estimate produced by the modiﬁed Dis-criminator, resulting in a compact but meaningfulrepresentation that helps reduce the training com-plexity.(b) We propose the use of curriculum learning to en-able analysis and optimisation of the temporal mod-elling length.(c) We incorporate dynamic attention to further en-hance our model estimates and show its effective-ness by reporting state of the art accuracy on boththe AFEW-VA and SEWA datasets. Affective Computing started by exploiting the useof classical machine learning techniques to enableautomatic affect estimation. Examples of early ap-proaches include partial least squares regression (Po-volny et al., 2016), and support vector machines (Nico-laou et al., 2011). Subsequently, to further progressthe investigations in this ﬁeld, the development oflarger and bigger datasets was addressed. Severaldatasets have been introduced so far, starting withSEMAINE (McKeown et al., 2010), AFEW-VA (Kos-saiﬁ et al., 2017), RECOLA (Ringeval et al., 2013),OMG (Barros et al., 2018), AffectNet (Mollahosseiniet al., 2015), and more recently SEWA (Kossaiﬁ et al.,2019), aff-wild (Kollias et al., 2019; Zafeiriou et al.,2017), and aff-wild2(Kollias and Zafeiriou, 2019; Kol-lias et al., 2020). Furthermore, the V-A labels havebecome the standard emotional dimensions over time,as opposed to hard emotion labels, given their con-tinuous nature (Kossaiﬁ et al., 2017; Kossaiﬁ et al.,2019).Throughout the last few years, models based onDeep Learning have emerged and currently hold thestate of the art in the context of affective analysis, giventheir ability to learn from large scale data. A recentexample along this line is the work from Mitenkova et al. (Mitenkova et al., 2019), who introduce tensormodelling for affect estimations by using spatial fea-tures. In their work, they use tucker tensor regressionoptimised by means of deep gradient methods, thusallowing to preserve the structure of the data and re-duce the number of parameters. Other works, such as(Handrich et al., 2020), adopt the multi-task approachto simultaneously address face detection and affectivestates prediction. Speciﬁcally, they use YOLO-basedCNN models (Huang et al., 2018) to estimate the faciallocations alongside V-A values through their proposedcombined losses. As such, their models are able toincorporate the characteristics of facial attributes andestimate their relevance to affect inferences.The recent growth of video-based datasets has en-couraged the inclusion of temporal modelling, whichhas shown to improve models’ training (Xie et al.,2016; Cootes et al., 1998). Relevant examples in Af-fective Computing include the works of Tellamekalaet al. (Tellamekala and Valstar, 2019) and Ma et al.(Ma et al., 2019). In their work, Tellamekala et al.(Tellamekala and Valstar, 2019) enforce temporal co-herency and smoothness aspects on their feature rep-resentation by constraining the differences betweenadjacent frames, while Ma et al. resort to the utilisa-tion of LSTM RNNs with residual connections appliedto multi-modal data. Furthermore, the use of atten-tion has also been recently explored by Xiaohua et al.(Xiaohua et al., 2019) and Li et al. (Li et al., 2020).Xiahoua et al. adopt multi-stage attention, which in-volves both spatial and temporal attention, on theirfacial based affect estimations. Meanwhile, usingspectrogram data as input, Li et al. propose a deepnetwork that utilises an attention mechanism (Luonget al., 2015) on top of their LSTM networks to predictthe affective states.Unfortunately, to our knowledge, all previousworks involving temporal modelling on affective com-puting miss one important aspect of the analysis: theinvolved sequence length in their training. While thespeciﬁed length of temporal modelling has been shownto affect the ﬁnal results on other related facial analysistasks (Kossaiﬁ et al., 2017; Xia et al., 2020; Farhadiand Fox, 2018; Aspandi et al., 2019b), the compu-tational cost required to train large spatio-temporalmodels hampers one to address such analysis. How-ever, these problems could be mitigated by: 1) theuse of progressive sequence learning to permit step-wise observations of various sequence lengths; thisapproach has been shown in the recent work of (As-pandi et al., 2019b) on facial landmark estimations,which uses curriculum learning enabling more robusttraining analysis and tuning of the temporal length; 2)the use of reduced feature sizes, enabling more efﬁ- igure 1: Schematic representation of our Full ANCLaF Networks. Left is our base model, which consists of three networksjointly trained in an adversarial setting: Latent Feature Extractor (G), Quadrant Estimator (D), and Valence Arousal Estimator(C). On the right, we see our network endowed with sequence modelling (ANCLaF-S) and attention mechanism (ANCLaF-SA). cient training process (Comas et al., 2020); this hasbeen explored in the affective computing ﬁeld by therecent works such as (Aspandi et al., 2020), whichuses generative modelling to extract a latent space ofrepresentative features. These two aspects have in-spired us to propose the combined models presentedin this work, as explained in the next section.

Figure 1 shows the overview of our proposed models,which consist of three networks: a Latent Feature Ex-tractor (acting as Generator, G), a Quadrant Estimator(or Discriminator, D), and a Valence/Arousal Estima-tor (or Combiner, C). Given input image I which con-tains the facial area, both G and D will be responsibleto learn low dimensional features that the combinerwill use to estimate the associated Valence (V) andArousal (A) state θ . The architecture of both the G andD networks follows the recent work from (Aspandiet al., 2020), and we propose to use LSTM enhancedwith attention to create our C network. We proposedtwo main architecture variants: the ANCLaF network(left part of Figure 1), which uses single images asinput and estimates V and A values independently foreach frame, and

ANCLaF-S and

ANCLaF-SA (rightpart of Figure 1) that uses sequences of latent featuresextracted from n frames as input, and utilises LSTMRNNs for the inference ( -S ), optionally combined withinternal attention layers ( -SA ). The pipeline of our base model

ANCLaF starts withthe G network. It receives either the original input image I, or a distorted version of it, ˜ I , as detailedin (Aspandi et al., 2019c; Aspandi et al., 2019a). Itsimultaneously produces the cleaned reconstruction ofthe input image ˆ I and a 2D latent representation thatwill be used as features ( Z ): G ( I ) Φ G = dec Φ G ( enc Φ G ( I )) with Z I ≈ enc Φ G ( I ) , (1)where Φ are the parameters of the respective networks, enc and dec are the encoder and decoder, respectively.Subsequently, the D network receives ˆ I and tries toestimate whether it was obtained from a true or fakeexample (namely, original or distorted input image),as well as a rough estimate of the affective state. Incontrast with the formulation in (Aspandi et al., 2020),in which D targets directly the intensity of V and A,we propose to base the estimated affect on the circum-plex quadrant ( Q ) (Russell, 1980) which discretisesemotions along the valence and arousal dimensions(four quadrants). This, in turn, reduces the trainingcomplexity. Thus, letting FC stand for fully connectedlayer: D ( I ) Φ D = FC Φ D ( enc Φ D ( I )) ⇒ Q I and { , } . (2)Then, Q is used to condition the extracted latent fea-tures Z through layer-wise concatenation, which wecall ZQ (Dai et al., 2017; Ye et al., 2018). Given theseconditioned latent features, the C network performsthe ﬁnal stage of affect estimation, producing reﬁnedpredictions of both V and A (Lv et al., 2017; Triantafyl-lidou and Tefas, 2016; Aspandi et al., 2019b). Thus, ifˆ θ denote the estimated V and A: ANCLaF ( I ) = C Φ C ([ G Φ G ( I ) ; D Φ D ( G Φ G ( I ))])= C Φ C ([ Z I ; Q I ])= FC Φ C ( enc Φ C ([ Z I ; Q I ])) ⇒ ˆ θ IANCLaF . (3) .2 Attention Enhanced SequenceLatent Affect Networks We propose two sequence-based variants of our mod-els: ANCLaF-S and -SA. Both of them use the com-bined features extracted by the G and D networks ZQ ,which are fed to LSTM networks to allow for temporalmodelling (Hochreiter and Schmidhuber, 1997) andcomplemented with an FC layer to produce the ﬁnalestimates. These networks are trained using curricu-lum learning (Bengio et al., 2009; Farhadi and Fox,2018; Aspandi et al., 2019b), in which the numberof frames is progressively increased, allowing morethroughout analysis of the training progress. Moreover,the training outcome for a given length facilitates thesubsequent training of larger sequences (Farhadi andFox, 2018). In this work, we considered a series of 2,4, 8, 16, and 32 successive frames (N = { , , , , } )for both training and inference stages. Depending onthe number of frames to take into account (n), weuse ANCLaF-S-n and ANCLaF-SA-n to name the re-spective variants of both ANCLaF-S and ANCLaF-SAnetworks. Lastly, the main difference between the twosequence models is that ANCLaF-SA also includesinternal attentional modelling using the current andprevious internal states from the LSTM layers. Thus,V-A predictions of ANCLaF-S-n are: ∀ n ∈ N , ANCLaF - S - n ( I n ) , h n = FC Φ C ( LST M Φ C ([ Z In , Q In ] , h n − )) ⇒ FC Φ C ( LST M Φ C ( ZQ In , h n − )) , (4)where LSTM is the Long Short Term Memory net-work (Hochreiter and Schmidhuber, 1997), and h n areLSTM states (h) after n successive frames. Built upon ANCLaF - SA , we further use attention modelling (Lu-ong et al., 2015) to enable adaptive weights on modelinferences by calculating the context vectors ( C ) thatsummarise the importance of each previous state h .Differently from the original method, however, here,we also propose to include both the LSTM inner state(c) and outgoing states (h) (Kim et al., 2018) to pro-vide the full previous information, and also to adaptthese techniques to only consider n previous statesfollowing our curriculum learning approach. Hence,given the combined LSTM states at frame t , denoted( S t = [ c t , h t ] ), and n previous states ( ¯ S ), the alignmentscore is calculated as: a n ( t ) = align ( S t , ¯ S t ) , with S x = [ h x ; c x ] (5) = exp (cid:0) W a [ S (cid:62) t ; ¯ S n ] (cid:1) ∑ N (cid:48) exp (cid:0) W a [ S (cid:62) t ; ¯ S n (cid:48) ] (cid:1) . Then, the location-based function computes the align- ment scores from the previous states ( ¯ S ): a n = softmax ( W a ¯ S ) . (6)Given the alignment vector, it is used to compute thecontext vector C t as the weighted average over theconsidered n previous hidden states: C t = ∑ n a n (cid:12) S n n (7)Finally, the context vector is concatenated with the cur-rent ZQ to be used as input to the C network pipeline: ∀ n ∈ N , ANCLaF - SA - n ( I n ) , h n = FC Φ C ( LST M Φ C ([ C n ; ZQ In ] , h n − )) . (8) We use the modiﬁed adversarial training from (As-pandi et al., 2020) to train both the G and D networks,and incorporate the training of the C network by pro-viding the latter with the features extracted from boththe G and D nets on the ﬂy. With this setup, we allowC to beneﬁt from the improved quality of the featuresextracted by G and D as their training progresses. Theequations for the modiﬁed adversarial training of thesethree networks are: L adv = E I [ log D ( I )] + E I [ log ( − D ( G ( ˜ I ))) + E a f c [ C ( I ) , θ I ] . (9)We use similar L a f c losses as in (Aspandi et al.,2020), which incorporates multiple affect metrics:Rooted Mean Square Error (RMSE) (Eq. 11), Cor-relation(COR) (Eq. 12), Concordance Correlation Co-efﬁcients (CCC) (Eq. 13), and (Kossaiﬁ et al., 2017)with the addition of Intra-class Correlation Coefﬁcient(ICC)(Kossaiﬁ et al., 2019). Thus, with { ˆ θ , θ } as thepredicted and the ground truth V-A values, the L a f c isdeﬁned as follows: E a f c = F ∑ i = f i F ( L RMSE + L COR + L CCC + L ICC ) (10) L RMSE = (cid:115) n n ∑ i = ( ˆ θ i − θ i ) , (11) L COR = E [( ˆ θ − ˆ µ θ ) − ( θ − µ θ )] σ ˆ θ σ θ (12) L CCC = x E [( ˆ θ − ˆ µ θ ) − ( θ − µ θ )] σ θ + σ θ + ( µ ˆ θ − µ θ ) (13) L ICC = x E [( ˆ θ − ˆ µ θ ) − ( θ − µ θ )] σ θ + σ θ , (14)where f i is the total number of instances of discrete-A classes i , and F is a normalisation factor (Aspandiet al., 2019a) for the total V-A classes (discretised bya value of 10). This normalisation factor is crucial incases of large imbalance in the number of instancesper class, like in the AFEW-VA dataset (see Section4.1). We use both the AFEW (Kossaiﬁ et al., 2017) andSEWA (Kossaiﬁ et al., 2019) datasets to train all ourmodel variants, by following their original subject-independent protocol (5-fold cross validation). Weconducted two training stages for each of our proposedmodels. Firstly, we trained the G, D, and C networkssimultaneously using adversarial loss as indicated inEquation 9. This training stage produced our baselineresults without any sequential modelling, and condi-tional latent features ZQ to be used for the next stagesof ANCLaF-S(A) Training.In the second stage, The training of both ANCLaF-S and ANCLaF-SA was performed using the com-bined latent and quadrant features, under the previ-ously deﬁned curriculum learning scheme. We pro-gressively train our ANCLaF-S models from 2, 4, 8,16 to 32 steps of temporal modelling with multi-stagetransfer learning (Christodoulidis et al., 2016). Sub-sequently, we add our proposed attention mechanismto the pre-trained ANCLaF-S models, thus obtainingour ANCLaF-SA models. In both cases, we optimisethe affect loss deﬁned in Equation 10 with the sameexperimental settings used to train ANCLaF.We need to note that our combined training setuptranslates to more than 100 experiments in total.Hence, the use of latent features (known as a goodchoice to achieve reduced dimensionality representa-tions) is critical to speed up our training process andmake our experiments feasible. We observed a savingup to 1 : 4 of the original times during training each ofour models by using the extracted latent features, withrespect to using the original image (around 12 hoursversus 2 days) on a single NVIDIA Titan X GPU. Fulldeﬁnitions of our models can be found in the respectiveonline source code . To quantify the impact of our temporal modelling,we opted to use two of the most popular and ac- https://github.com/deckyal/Seq-Att-Affect cessible video datasets available: Acted Facial Ex-pressions in the Wild (AFEW-VA)(Kossaiﬁ et al.,2017) and Automatic Sentiment Analysis in the Wild(SEWA)(Kossaiﬁ et al., 2019). On the one hand,AFEW-VA has more individual examples (600 ver-sus 538) than SEWA, however, the latter has moreframe examples, more contextual information (suchas subject, id of the associated culture) and is morebalanced in terms of V-A labels (Mitenkova et al.,2019). Furthermore, both datasets contain in the wild situations, enabling real time model evaluations. Fi-nally, the labels provided are in the form of continuousV-A values, together with additional facial landmarklocations that we reﬁned further using other externalmodels (Aspandi et al., 2019b) to obtain more stabledetection of the facial area.In each experiment, we provide the results fromall variants of our models to highlight the contribu-tion of each module: ﬁrst, we evaluate the ANCLaFmodel, which operates by exclusively using the latentfeatures extracted on each frame ( ZQ ) without anytemporal modelling. Then, we provide results fromboth ANCLaF-S and ANCLaF-SA, which incorpo-rate temporal modelling (and attention in the case of-SA). We report both RMSE and COR results, on bothdatasets, adding also ICC and CCC metrics for theAFEW-VA and SEWA datasets, respectively, to facili-tate quantitative analysis to other results reported in theliterature. Finally, for fair comparisons, we compareour models against external results which followedsimilar experimental protocols, i. e., using exclusivelythis dataset in their training stage. Table 1 and table 2 provide the full comparisons ofour proposed models against other reported results forboth the AFEW-VA and SEWA datasets, respectively.We can identify several ﬁndings based on these re-sults: Firstly, that our base ANCLaF model, relyingon a single image at a time, can produce quite com-petitive accuracy compared to other results from theliterature. Furthermore, its accuracy is also higherthan the results from the original AEG-CD-SZ modelsin which it is based upon (Aspandi et al., 2020), asshown by its higher accuracy on the SEWA datasets,especially for Valence. This may indicate its betterprocessing capabilities of the visual features, consider-ing that AEG-CD-SZ also incorporates audio features,which in a way also explains its higher accuracy onthe prediction of Arousal.Secondly, we notice a slight accuracy improve-ment when our models incorporate sequence mod-elling (ANCLaF-S), especially in terms of correlations, igure 2: Analysis of prediction results from a single frame (ANCLaF) and from multiple frames with temporal modelling(ANCLaF-S-n). Top: the overview of the overall results; Bottom:, a closer look at the prediction results.Table 1: Quantitative comparisons on the AFEW-VA dataset.

Model RMSE ↓ COR ↑ ICC ↑ VAL ARO AVG VAL ARO AVG VAL ARO AVGBaseline (Kossaiﬁ et al., 2017) 2.680 2.275 2.478

ANCLaF-SA-16 2.601 namely, concordeance corelation coefﬁcient (CCC),and ICC. This ﬁnding demonstrates the beneﬁt of thetemporal modelling, yielding more stable results thanthose achieved by ANCLaF (cf. Section 4.3). How-ever, even though the overall accuracy of ANCLaF-Sis better than that of ANCLaF (and quite comparableto other state of the art models), the improvement canbe considered modest, especially if we compare it withthe improvement achieved when we include attentionin our models. Indeed, we can see that our ANCLaF-SA outperforms almost all compared models acrossthe different affect metrics. These ﬁndings suggestthat the plain utilisation of LSTMs may not be enoughto attain a considerable and substantial increase of ac-

Table 2: Quantitative comparisons on the SEWA dataset.

Model RMSE ↓ COR ↑ CCC ↑ VAL ARO AVG VAL ARO AVG VAL ARO AVGBaseline (Kossaiﬁ et al., 2019) - - - 0.350 0.350 0.350 0.350 0.290 0.320Tensor (Mitenkova et al., 2019) 0.334 0.380 0.357 0.503 0.439 0.471 0.469 0.392 0.431AEG-CD-SZ(Aspandi et al., 2020)

ANCLaF-SA-16 0.334 0.331 curacy (Schmitt et al., 2019), justifying the inclusionof the attention mechanism in our approach.Thirdly, we further observe a noticeable trend ofsteady increase in accuracy from the predictions ofboth ANCLaF-S and ANCLaF-SA as the number ofconsidered frames grows from 2 to 8, and then itplateaus (or even worsens a bit) as n continues to in-crease. This trend suggests that generally, a mediumsequence length (between 4 to 16 frames) is optimalto produce more accurate predictions and that tooshort and too long sequences degrade temporal mod-elling. This ﬁnding is quite consistent with those from(Aspandi et al., 2019b), indicating the importance of igure 3: Analysis of the attention impact on the prediction results of our sequence modelling (results from ANCLaF-S-8 andANCLaF-SA-8, which correspond to the best ANCLaF-S and ANCLaF-SA models, respectively). Top: overview of the overallresults; Bottom: two examples of a closer view on the prediction graph. The column Wa-8 shows the attention weights learntfor the eight considered frames. progressive learning, which allows us to analyse andchoose the optimal sequence length during training.Lastly, this sequence length selection may also im-pact the context vector along with its weights learnt inour attentional module, which explains why a similartrend was observed in the results from these models(see Section 4.4 for more details). Figure 2 shows an example of V-A predictions forANCLaF and ANCLaF-S-n, together with the ground-truth annotations. Speciﬁcally, in the top part, wecan see the predicted affect states from our modelsthat, in general, are quite related to the ground truthvalues. However, we notice that the results of oursequence based models are more accurate than theirnon-sequential counterparts. We can also see that thethe predicted values from ANCLaF are quite sparse,thus, quite unstable compared to ANCLAF-S, whichexplains its lower COR, CCC, and ICC values. Oursequence modelling, on the other hand, is able to createsmooth predictions with higher overall accuracy.On the bottom part of the ﬁgure, showing a mag- niﬁed portion of the same example, we further noticethat the results for all ANCLaF-S-n are quite similar,with those from ANCLaF-S-8 showing the highestresemblance to the ground-truth. Thus, inclusion oftoo short or too long sequences yields sub-optimal re-sults due to the complexity of the facial movementsincluded between frames (see the next section for fur-ther details).

To analyse the impact of the attention mechanismon our sequence modelling, we ﬁrst show in Figure3 a comparison of our baseline sequence modelling(ANCLaF-S) against ANCLaF-SA with attention acti-vated. In the top part, we can see the predictions fromthe best performing models with and without attention(ANCLaF-S-8 and ANCLaF-SA-8). Comparing thepredictions from both models, we ﬁnd that the resultsare quite similar, though in some cases ANCLaF-SAseems to be more accurate and closer to the groundtruth. The quantitative accuracy results indicated onthe respective legends conﬁrm this observation.The attention weights learnt by ANCLaF-SA, in- igure 4: Analysis of the relationship between the selection of sequence length (n) and the learnt weights of our attentionalapproach. Top: overview of the prediction results of all variants of our models with attention mechanism (ANCLaF-SA-n)alongside their learnt weights. Middle: details for frames 622 to 653 with their associated weights for each model. Bottom:legend containing the quantitative comparisons. volving the previous eight frames, are also displayedat the bottom of the prediction plots. We can see thatthe weights calculated with respect to the associatedframes seem to be higher in the presence of changes.Indeed, we observe that the attention weights are usu-ally activated prior to subsequent facial movements.Interestingly, the intensity of the activations also ap-pears to highlight the level of these facial movements,or the changes between frames. For instance, fromframes 280-287, we can see that the different level ofthe weight intensity seems to be small, which also cor-relates to the subtle changes observed in those frames(e. g., closing of the eyes). In contrast, in frames643-650, we see high levels of activation on the ﬁrstfew frames that correspond to the more discerniblefacial movements on the respective frames, such as thechanges observed in the mouth area. These correla-tions illustrate how our models are capable of learningtemporal changes.Figure 4 provides further details on the attentionmechanism for different temporal modelling lengths.We can see that all the displayed models show quitesmooth results, thanks to the temporal modelling, butnot all of them achieve the same accuracy on the pre- dictions. The bottom part of the ﬁgure, highlightingthe input sequence from frames 622 to 653, can help toprovide an intuition about the optimal temporal mod-elling length, which was found to be about 8 frames.To this end, let us start by looking at the whole set of32 frames: we can see that such a sequence of framescomprises multiple facial changes, and considering allof them together makes the training task harder to opti-mise. On the other hand, if we consider groups of veryfew frames (e. g., 2 or 4 frames), the system is likely tocapture only part of a given facial action, which mayimpede it to properly interpret it. Therefore, we seethat the optimal sequence length is the one that con-tains enough frames to interpret facial changes withoutextending too much the temporal context, which mayunnecessarily increase training complexity and reduceaccuracy.Finally, it is important to emphasise that the op-timal sequence length needs to take into account theframe rate and the speciﬁc facial movements that arepresent in each dataset. In the considered dataset, withan overall frame rates of 50 fps, this length correspondsto 160 ms.

CONCLUSIONS

In this work, we have successfully built a sequence-attention based neural network for affect estimationsin the wild. We did so by incorporating three majorsub-networks: the Generator, which is responsible toextract latent features on each frame; the Discrimi-nator, which is used to supply the ﬁrst step of affectestimates of emotional quadrant, and the Combiner,which merges latent features and quadrant informationto produce the ﬁnal reﬁned affect estimates of Valenceand Arousal on a frame by frame basis. We then addedan LSTM layer to allow temporal modelling, whichwe further enhanced by using step-wise attention mod-elling. We trained these three major sub-networks inan adversarial setting, and used curriculum learningon the sequential training stages.We showed the effectiveness of our approach byreporting top state of the art results on two of the mostwidely used video datasets for affect analysis, namelyAFEW-VA and SEWA. Speciﬁcally, our baseline mod-els, which operate without any sequence modelling,yield quite competitive results with other models re-ported in the literature. On the other hand, our moreadvanced models, which are sequence-based, clearlyhelped to improve the affect estimates both in qualita-tive and quantitative terms. Qualitatively, the temporalmodelling helped to produce more stable results, withvisibly smoother transitions between affect predictions.Quantitatively, our models produced the overall bestaccuracy results reported so far on both tested datasets.Within sequence-based models, we observed thehighest accuracy improvements when the attentionmechanism was included. Detailed analysis of theattention weights highlighted their correlation withthe appearance of facial movements, both in terms of(temporal) localisation and intensity. Finally, we founda sequence length of around 160 ms to be the optimumone for temporal modelling, which is consistent withother relevant ﬁndings utilising similar lengths.Future work will need to explore further optimi-sation of the considered adversarial topologies andattention mechanisms as well as their transferabilityacross databases, cultures, and domains.

ACKNOWLEDGMENTS

This work is partly supported by the Spanish Min-istry of Economy and Competitiveness under projectgrant TIN2017-90124-P, the Maria de Maeztu Units ofExcellence Programme (MDM-2015-0502), and thedonation bahi2018-19 to the CMTech at UPF. Furtherfunding has been received from the European Union’s Horizon 2020 research and innovation programme un-der grant agreement No. 826506 (sustAGE).

REFERENCES

Aspandi, D., Mallol-Ragolta, A., Schuller, B., and Binefa, X.(2020). Latent-based adversarial neural networks forfacial affect estimations. In , pages348–352, Los Alamitos, CA, USA. IEEE ComputerSociety.Aspandi, D., Martinez, O., and Binefa, X. (2019a). Heatmap-guided balanced deep convolution networks for familyclassiﬁcation in the wild. In ,pages 1–5.Aspandi, D., Martinez, O., Sukno, F., and Binefa, X. (2019b).Fully end-to-end composite recurrent convolution net-work for deformable facial tracking in the wild. In , pages 1–8.Aspandi, D., Martinez, O., Sukno, F., and Binefa, X. (2019c).Robust facial alignment with internal denoising auto-encoder. In , pages 143–150.Barros, P., Churamani, N., Lakomkin, E., Siqueira, H.,Sutherland, A., and Wermter, S. (2018). The omg-emotion behavior dataset. In , pages 1–7.IEEE.Bengio, Y., Louradour, J., Collobert, R., and Weston, J.(2009). Curriculum learning. In

Proceedings of the26th Annual International Conference on MachineLearning , ICML ’09, page 41–48, New York, NY, USA.Association for Computing Machinery.Christodoulidis, S., Anthimopoulos, M., Ebner, L., Christe,A., and Mougiakakou, S. (2016). Multisource trans-fer learning with convolutional neural networks forlung pattern analysis.

IEEE journal of biomedical andhealth informatics , 21(1):76–84.Comas, J., Aspandi, D., and Binefa, X. (2020). End-to-endfacial and physiological model for affective comput-ing and applications. In , pages 1–8, Los Alamitos, CA,USA. IEEE Computer Society.Cootes, T. F., Edwards, G. J., and Taylor, C. J. (1998). Activeappearance models. In Burkhardt, H. and Neumann, B.,editors,

Computer Vision — ECCV’98 , pages 484–498,Berlin, Heidelberg. Springer Berlin Heidelberg.Dai, B., Fidler, S., Urtasun, R., and Lin, D. (2017). Towardsdiverse and natural image descriptions via a condi-tional gan. In

The IEEE International Conference onComputer Vision (ICCV) .Duo, S. and Song, L. (2010). An e-learning system based onaffective computing.

Physics Procedia , 24.Farhadi, D. G. A. and Fox, D. (2018). Re 3: Real-timerecurrent regression networks for visual tracking ofgeneric objects.

IEEE Robot. Autom. Lett. , 3(2):788–795.Handrich, S., Dinges, L., Al-Hamadi, A., Werner, P., andl Aghbari, Z. (2020). Simultaneous prediction ofvalence/arousal and emotions on affectnet, aff-wild andafew-va.

Procedia Computer Science , 170:634–641.Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory.

Neural Comput. , 9(8):1735–1780.Huang, R., Pedoeem, J., and Chen, C. (2018). Yolo-lite:a real-time object detection algorithm optimized fornon-gpu computers. In , pages 2503–2510.IEEE.Kim, C., Li, F., and Rehg, J. M. (2018). Multi-object trackingwith neural gating using bilinear lstm. In

The EuropeanConference on Computer Vision (ECCV) .Kollias, D., Schulc, A., Hajiyev, E., and Zafeiriou, S. (2020).Analysing affective behavior in the ﬁrst abaw 2020competition. arXiv preprint arXiv:2001.11409 .Kollias, D., Tzirakis, P., Nicolaou, M. A., Papaioannou,A., Zhao, G., Schuller, B., Kotsia, I., and Zafeiriou,S. (2019). Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, andbeyond.

International Journal of Computer Vision ,127(6-7):907–929.Kollias, D. and Zafeiriou, S. (2019). Expression, affect,action unit recognition: Aff-wild2, multi-task learningand arcface. arXiv preprint arXiv:1910.04855 .Kossaiﬁ, J., Tzimiropoulos, G., Todorovic, S., and Pantic,M. (2017). Afew-va database for valence and arousalestimation in-the-wild.

Image and Vision Computing ,65:23–36.Kossaiﬁ, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt, M.,Ringeval, F., Han, J., Pandit, V., Schuller, B., Star, K.,et al. (2019). Sewa db: A rich database for audio-visualemotion and sentiment research in the wild. arXivpreprint arXiv:1901.02839 .Li, C., Bao, Z., Li, L., and Zhao, Z. (2020). Explor-ing temporal representations by leveraging attention-based bidirectional lstm-rnns for multi-modal emotionrecognition.

Information Processing & Management ,57(3):102185.Liu, C., Conn, K., Sarkar, N., and Stone, W. (2008). Onlineaffect detection and robot behavior adaptation for inter-vention of children with autism.

IEEE T Robot , 24:883– 896.Luong, M.-T., Pham, H., and Manning, C. D. (2015). Ef-fective approaches to attention-based neural machinetranslation. arXiv preprint arXiv:1508.04025 .Lv, J.-J., Shao, X., Xing, J., Cheng, C., and Zhou, X.(2017). A deep regression architecture with two-stagere-initialization for high performance facial landmarkdetection. , pages 3691–3700.Ma, J., Tang, H., Zheng, W.-L., and Lu, B.-L. (2019). Emo-tion recognition using multimodal residual lstm net-work. In

Proceedings of the 27th ACM InternationalConference on Multimedia , pages 176–183.McKeown, G., Valstar, M. F., Cowie, R., and Pantic, M.(2010). The semaine corpus of emotionally colouredcharacter interactions. In ,pages 1079–1084. IEEE.Mitenkova, A., Kossaiﬁ, J., Panagakis, Y., and Pantic, M. (2019). Valence and arousal estimation in-the-wildwith tensor methods. In ,pages 1–7. IEEE.Mollahosseini, A., Hasani, B., and Mahoor, M. H. (2015).Affectnet: A database for facial expression.

Valence,and Arousal Computing in the Wild Department ofElectrical and Computer Engineering, University ofDenver, Denver, CO , 80210.Nicolaou, M. A., Gunes, H., and Pantic, M. (2011). Con-tinuous prediction of spontaneous affect from multiplecues and modalities in valence-arousal space.

IEEE TAffect Comput , 2(2):92–105.Povolny, F., Matejka, P., Hradis, M., Popkov´a, A., Otrusina,L., Smrz, P., Wood, I., Robin, C., and Lamel, L. (2016).Multimodal emotion recognition for avec 2016 chal-lenge. In

Proceedings of the 6th International Work-shop on Audio/Visual Emotion Challenge , AVEC ’16,page 75–82, New York, NY, USA. Association forComputing Machinery.Ringeval, F., Sonderegger, A., Sauer, J., and Lalanne, D.(2013). Introducing the recola multimodal corpus ofremote collaborative and affective interactions. In , pages 1–8.Russell, J. A. (1980). A circumplex model of affect.

Journalof personality and social psychology , 39(6):1161.Schmitt, M., Cummins, N., and Schuller, B. (2019). Con-tinuous emotion recognition in speech–do we needrecurrence?

Training , 34(93):12.Tellamekala, M. K. and Valstar, M. (2019). Temporallycoherent visual representations for dimensional affectrecognition. In ,pages 1–7. IEEE.Triantafyllidou, D. and Tefas, A. (2016). Face detectionbased on deep convolutional neural networks exploit-ing incremental facial part learning. In ,pages 3560–3565.Xia, Y., Braun, S., Reddy, C. K. A., Dubey, H., Cutler,R., and Tashev, I. (2020). Weighted speech distortionlosses for neural-network-based real-time speech en-hancement. In

ICASSP 2020 - 2020 IEEE ICASSP ,pages 871–875.Xiaohua, W., Muzi, P., Lijuan, P., Min, H., Chunhua, J.,and Fuji, R. (2019). Two-level attention with two-stage multi-task learning for facial emotion recogni-tion.

Journal of Visual Communication and ImageRepresentation , 62:217–225.Xie, J., Girshick, R. B., and Farhadi, A. (2016). Deep3d:Fully automatic 2d-to-3d video conversion with deepconvolutional neural networks. In

ECCV 2016 , pages842–857.Ye, H., Li, G. Y., Juang, B.-H. F., and Sivanesan, K. (2018).Channel agnostic end-to-end learning based commu-nication systems with conditional gan. In , pages 1–5. IEEE.Zafeiriou, S., Kollias, D., Nicolaou, M. A., Papaioannou, A.,Zhao, G., and Kotsia, I. (2017). Aff-wild: Valence andarousal ‘in-the-wild’challenge. In