[PDF] The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements

Abstract

Truly real-life data presents a strong, but exciting challenge for sentiment and emotion research. The high variety of possible `in-the-wild' properties makes large datasets such as these indispensable with respect to building robust machine learning models. A sufficient quantity of data covering a deep variety in the challenges of each modality to force the exploratory analysis of the interplay of all modalities has not yet been made available in this context. In this contribution, we present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge, and focused on the tasks of emotion, emotion-target engagement, and trustworthiness recognition by means of comprehensively integrating the audio-visual and language modalities. Furthermore, we give a thorough overview of the dataset in terms of collection and annotation, including annotation tiers not used in this year's MuSe 2020. In addition, for one of the sub-challenges - predicting the level of trustworthiness - no participant outperformed the baseline model, and so we propose a simple, but highly efficient Multi-Head-Attention network that exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 % improvement).

Full PDF

UUNDER REVIEW 1

The Multimodal Sentiment Analysisin Car Reviews (MuSe-CaR) Dataset:Collection, Insights and Improvements

Lukas Stappen † , Member, IEEE,

Alice Baird † , Member, IEEE,

Lea Schumann, andBjörn Schuller,

Fellow, IEEE

Abstract —Truly real-life data presents a strong, but exciting challenge for sentiment and emotion research. The high variety of possible‘in-the-wild’ properties makes large datasets such as these indispensable with respect to building robust machine learning models. Asufﬁcient quantity of data covering a deep variety in the challenges of each modality to force the exploratory analysis of the interplay of allmodalities has not yet been made available in this context. In this contribution, we present MuSe-CaR, a ﬁrst of its kind multimodaldataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge, andfocused on the tasks of emotion, emotion-target engagement, and trustworthiness recognition by means of comprehensively integratingthe audio-visual and language modalities. Furthermore, we give a thorough overview of the dataset in terms of collection and annotation,including annotation tiers not used in this year’s MuSe 2020. In addition, for one of the sub-challenges – predicting the level oftrustworthiness – no participant outperformed the baseline model, and so we propose a simple, but highly efﬁcient Multi-Head-Attentionnetwork that exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 % improvement).

Index Terms —Sentiment Analysis, Affective Computing, Database, Mutlimedia Retrieval, Trustworthiness (cid:70)

NTRODUCTION

Global video trafﬁc is estimated to grow four-fold in thecoming years [1], accounting for 80 % of all online trafﬁc in2019 [2]. On social media, users view eight billion videosdaily on Facebook [3] and YouTube has become the secondbiggest social network with nearly two billion active usersand one billion hours watched each day [4]. The internet hasundergone a rapid transformation from a largely text-basedWeb 2.0 to a multimedia, user content-driven net. However,extracting, processing, and analysing relevant informationfrom the huge amounts of semi-structured user-generateddata available remains a challenge [5].Text-based sentiment analysis is now widely used, e. g. ,for brand perception or customer satisfaction assessment,as machine learning approaches are able to learn rich textrepresentations from data that can be applied to sentimentclassiﬁcation [6], [7]. However, the increased availabilityof other modalities (e. g. , facial and vocal cues) offersnew opportunities for affective computing by incorporatingdiverse information. Fused representations from text andimages have shown improvements over unimodal modelsfor the prediction of sentiment and emotion [8]. As well asthis, inter-modality dynamics are harnessed through the inte-gration of multiple modalities, and this has brought forwardadvancements with respect to sentiment prediction [9], [10],[11]. • All authors are with the EIHW – Chair of Embedded Intelligence for HealthCare and Wellbeing, University of Augsburg, Germany.Contact E-mail: stappen,baird,[email protected] • Björn Schuller is also with GLAM – Group on Language, Audio, & Music,Imperial College, London, UK.†These authors equally contributed to the design of the dataset.

Furthermore, the engagement of multimodal data forsentiment analysis has received an increasing level of atten-tion lately [12]. The interest of the research community andindustry to develop methods, for areas including multimodalsentiment analysis – which analyse the interaction betweenusers’ emotions and topics in multimedia content – hasgrown with the wide dissemination of multimodal user-generated content [9], [13], [14]. While it is established thatmultimodal approaches lead to higher quality predictionresults as compared to unimodal input data [15], due to a lackof robustness, it is still an on-going challenge to develop andemploy these techniques in real-world applications. Despitethe recent progress for constructing larger datasets [16] toexplore and develop counter strategies for novel ‘in-the-wild’paradigms, there are many areas which remain unexploredto this day.In this work, we present in detail the process for the col-lection and annotation of the multimodal sentiment analysisin Car Reviews (MuSe-CaR) dataset*. The MuSe-CaR datasetis a large, extensively annotated multimodal (video, audio,and text) dataset that has been gathered under real worldconditions with the intention of developing appropriatemethods and further understanding of multimodal sentimentanalysis ‘in-the-wild’. To the best of the authors’ knowledge,it is more than three times larger than any other continuouslyannotated dataset aimed at pushing the understanding ofmultimodal sentiment beyond discrete modelling. Furtherto this, MuSe-CaR provides never before seen annotations,which explicitly allows for modelling of speaker topic a r X i v : . [ c s . MM ] J a n NDER REVIEW 2 and physical entity in relation to continuous emotions. Aselection of the MuSe-CaR dataset was utilised for the FirstInternational Multimodal Sentiment Analysis in Real-lifeMedia Challenge (MuSe 2020) [17], which was held at theACM Multimedia 2020 conference.In previous research, multimodal sentiment analysisand emotion recognition are often applied in the contextof product reviews [18], [19], and sourced from the open-accessible video platform YouTube [18]. With MuSe-CaR, weare inﬂuenced by this collection strategy, and have designedthe dataset with an abundance of computational tasks inmind. Furthermore, the dominant focus for MuSe-CaR is toaid in machine understanding of how positive and negativesentiment as well as emotional arousal is linked to an entityand aspects in a review (and other user-generated content ingeneral). In doing so, MuSe-CaR aims to bridge ﬁelds withinaffective computing, which currently utilise a variety ofemotionally annotated signals (dimensional and categorical).We collected over 40 hours of user-generated video mate-rial with more than 350 reviews and 70 host speakers (as wellas 20 overdubbed narrators) from YouTube. The extensiveannotations consist of 15 different annotation tiers/ tracks(3 continuous dimensional, 3 partially continuous binary,5 categorical, and 4 automatically extracted tiers). Amongothers, MuSe-CaR offers conversational topic labelling, thenovel continuous dimension of

Trustworthiness , and full word-aligned transcription.When selecting the data, it was of particular importanceto balance between the uncontrollable ‘in-the-wild’ inﬂuencesand constraining properties to allow for meaningful learningand generalisation using current deep learning methods.Such ‘in-the-wild’ characteristics of MuSe-CaR include: i)video : face-angle, shot size, camera motion, reviewer visibility,reviewer face occlusion‘ ((sun-)glasses), and highly variedbackgrounds within a single video; ii) audio : ambient noises,narrator and host diarisation, diverse microphone types,and speaker locations; iii) text : colloquialisms, and domain-speciﬁc terms. However, the contextual interaction with emo-tions, e. g. , towards entities and aspects, is content dependent.The number of different entities, topics, and aspects thatappear in the videos requires special consideration in orderto create balanced records for supervised tasks based onvision (image object detection), audio, and linguistics (aspectdetection) and their derived fusion. For this reason, we havelimited the dataset to vehicle reviews. Furthermore, mostof the reviewers are semi- or professional reviewers (e. g. ,YouTube channels, inﬂuencers). This has several practicaladvantages: on the one hand, it increases the video qualitysigniﬁcantly, on the other hand, it makes the videos moreconsistent (a broad but similar range of topics around thevehicle is covered, e. g. , vehicle safety). Finally, approval touse multiple videos is greatly simpliﬁed, as a channel withmultiple videos (and independent speakers) only needs tobe contacted once and not per video * .Based upon this dataset, our contributions in this workare threefold: • First, we give in-depth information regarding theMuSe-CaR selection, collection, and annotation process,which was not addressed in the MuSe 2020 challenge * We contacted the creators for consent, see Section 3 for details. baseline paper [17]. This includes the presentation ofadditional annotation tiers that have not been utilisedand introduced, yet (cf. Section 3.5). We expect that thiswill assist future participants and other researchers toconduct and interpret studies on MuSe-CaR more easily. • Second, we revisit the tasks introduced in MuSe 2020and demonstrate that the limits in terms of performancein, at least one of these tasks, has not been reached,yet. When doing so, by proposing a simple, yet efﬁ-cient model which utilises state-of-the-art componentswe beat the baseline of MuSe-Trust by around %.Furthermore, we describe extensive experiments run toidentify the key settings in modelling the novel task ofTrustworthiness. • Finally, we outline future tasks and research directionusing the dataset and the introduced annotation tiers.

ELATED R ESOURCES

In the following, we highlight important databases forMuSe-CaR to build upon which focus on computationalsentiment and emotion analysis from audio-visual recordings.For an overview of databases which utilise only one or othercombinations of modalities in this area, the reader is referredto recent survey studies (e. g. , for text [20], for vision [21],and for audio [22]). An overview of ﬁndings based on ourspeciﬁc areas of interest are summarised in Table 1.

TABLE 1Selection of multimodal sentiment analysis and affective computingdatasets focusing on at least one of three types of prediction targets:(Sentiment) classes, (Primitive) Emotions, Object-of-Interest.

Modal itiesavailable; Language: MULTI 1 = CN, DE, EN, GR, HU, SE & MULTI 2 =DE, ES, FR, PT; An notation Du ration (hh:mm) † ; Number of minimum

Anno tations per target. Subjectivity includes Sentiment: number of sent iment classes (* derived sentiment intensity from Valence); number of (basic)

Emo tions;

Cont ionous, Primitive:

Dim ensions: V alence, A rousal, T rustworthiness, L ikability, I ntensity, P ower, E xectiation, D ominance; number of Inc rement St eps, T raceannotations; O bject- o f- I nterest: classes of topics or entities. Name Modal Language AnDu (cid:56)

V,A,T (cid:56) (cid:52) (cid:52)

Affective Computing

SEWA [16] V,A MULTI 1 4:39 5 (cid:56) (cid:56)

V,A,L (cid:56) (cid:52) (cid:56)

HUMAINE [23] V,A EN 4:11 6 (cid:56) (cid:56)

V,A,I 7 (cid:52) (cid:56)

RECOLA [24] V,A FR 3:50 6 (cid:56) (cid:56)

V,A 9 (cid:52) (cid:56)

AFEW-VA [25] V,A EN 2:28 (cid:56) (cid:56) (cid:56)

V,A 21 (cid:56) (cid:56)

VAM [26] V,A EN 12:00 6-8 (cid:56) (cid:56) (cid:56)

IEMOCAP [27] V,A,L EN 11:28 5 (cid:56) (cid:56) (cid:56)

SEMAINE [28] V,A EN 6:30 6 (cid:56) (cid:56) (cid:52) (cid:56)

Belfast [29] V,A EN 3:57 6 (cid:56) (cid:56)

V,A 3 (cid:56) (cid:56)

Multimodal Sentiment

UR-FUNNY [30] V,A,L EN 90:23 2 (cid:56) (cid:56) (cid:56) (cid:56) (cid:56) MOSEAS [31] V,A,L MULTI 2 68:49 3 7 6 (cid:56) va (cid:56) (cid:56) MOSEI [32] V,A,L EN 65:53 3 7 6 (cid:56) (cid:56) (cid:56) (cid:56)

ICT-MMMO [18] V,A,L EN 13:58 2 5 (cid:56) (cid:56) (cid:56) (cid:56) (cid:56)

Ext. POM [33] V,A,L EN 15:40 1 5 (cid:56) (cid:56) (cid:56) (cid:56) (cid:56)

CH-SIMS [34] V,A,L CN 2:20 5 5 (cid:56) (cid:56) (cid:56) (cid:56) (cid:56)

AMMER [35] V,A,L DE 1:18 1 (cid:56) (cid:56) (cid:56)

Youtubean [36] V,A,L EN 1:11 2 3 (cid:56) (cid:56) (cid:56) (cid:56) (cid:56)

MOUD [37] V,A,L ES 0:59 2 3 (cid:56) (cid:56) (cid:56) (cid:56) (cid:56)

YouTube [19] V,A,L EN 0:29 3 3 (cid:56) (cid:56) (cid:56) (cid:56) (cid:56)

It is generally accepted [38], [39] that (multimodal) sentimentanalysis consists of a holder and the object (subject, entity)that the emotion is evoked from. Furthermore, the surveyof [12] divides the ﬁeld into three major groups: multimodal † Note: In the case of multimedia sentiment analysis databases, wehave usually indicated the total size. However, the exclusion of non-spoken parts typically reduces the size signiﬁcantly, and authors tendnot to specify the size of the adjusted audio-visual database.

NDER REVIEW 3 sentiment analysis i) in (monologue) video reviews e. g. , fromvideo platforms [18], [31], [40], ii) in human-machine andhuman-human interactions [16], iii) analysing of general mul-timedia content (e. g. , images, gifs) from social media [41],and stresses the need for additional datasets to extend theﬁeld.Recently, UR-FUNNY [30] collected a large number ofTed-Talk videos on more than 400 topics. On a small partof this data, binary humour is predicted.

MOSEAS [31] isa multilingual collection of 40 000 audio-visual sentencesconsisting of sentiment, emotion, and attribute labels.

MO-SEI [32] contains videos of 250 topics for sentiment analysis(7 classes) and emotion recognition (6 classes). Videos with notranscriptions and punctuation are provided by the creator,additionally they exclude video content where the camerawas not ﬁxed in place. [42] extended

POM [43], an audio-visual ﬁlm review dataset, with annotations at the level ofopinion-forming segments and components. It consists of600 videos with an average length of 94 seconds in which aperson looks straight into the camera and talks about six ﬁlmaspects.

ICT-MMMO [18] consists of user-generated reviewvideos to predict the sentiment.

CH-SIMS [34] acquired 60Mandarin raw videos from movies, and television series andshows with segments of up to 10 seconds. It is limited to partswhere both, the face and the voice appears at the same time.

Youtubean [36] collects seven popular product reviews ofone cell phone model with aspect and sentiment annotations.

MOUD [37] includes

YouTubers clearly visible (no occlusions)and oriented frontally to the camera expressing their opinionsin 30 second segments without any background music. The

YouTube corpus [19] provides sentiment labels for a widerange of YouTube product reviews.

AMMER [35] focuses onemotional interactions in a simulated car journey in German.Discrete emotions used in these databases are not fullyeligible to represent the sentiment, making more complexrepresentations [12] e. g. , through polarity and intensityscores necessary.

One such way to represent the sentiment more compre-hensively is through the primitive dimensions of emotions,e. g. , the circumplex model of emotions [44]. In this context,Valence often serves as an umbrella term of sentiment and isused interchangeably [45], [46], [47].

SEWA [16] provides a large audio-visual data with contin-uous arousal, valence, and likeability traces during human-human interactions recorded online via static webcams.However, only around 4 hours are continuously annotated.

HUMAINE [23] provides continuous intensity annotations inaddition to Arousal and Valence. The

RECOLA dataset [24]contains subjects interacting in a tightly controlled laboratoryenvironment. The audio, visual, and electro-dermal activitywere annotated with Arousal and Valence traces.

AFEW-VA includes 7 facial expressions from ﬁlms, annotated by 3 raters,in addition to Arousal and Valence with intensities from -10to 10 hard incremental steps.

VAM consists of clips from aGerman talk-show. Besides Valence, Arousal, and dominance,which are annotated on a discretised 5-point scale, thereare also six basic emotions which are sparsely labelled.

IEMOCAP [27] consists of audio, video, and transcriptions of ten actors in dyadic sessions in a controlled setting.

SEMAINE [28] is a richly annotated database of 21 human-agent interaction sessions.

Belfast [29] is a collection ofactively stimulated participants evoke moderate emotionalresponse due to speciﬁc tasks. The recordings come withcontinuous values of Valence and Arousal annotations.

From this overview of recent literature, on the one hand,we ﬁnd that multimodal sentiment analysis databases cur-rently try to select content from a wide range of topics,which increases the (linguistic) generalisation potential ofdeveloped models. However, multimedia databases whichutilise topics or aspects as a prediction target [36], [42]are rare, limiting the (supervised) understanding of thecontext almost exclusively to linguistic analysis. To date,no such methods offer complete solutions and rely on thespoken word. In situations where spoken language is notpresent and towards multimodal understanding, we wantto examine the relationships between emotion and object/physical entity. As the literature suggests a) these topicdeﬁnitions, e. g. , reviews [32] are too high-level to justifyan in-depth understanding of the opinion-topic structure andtheir multimodal interactions, a point which is necessary forreal-life applications, and b) with the drastic improvementsof general language models ( [48]) generalisation improvesnaturally over time [49].On the other hand, with a wide range of (health andwellbeing) situations in mind, affective computing focuseson the elementary sensing of affects using often primitiveand generalistic (continuous) emotional dimensions, makingit very broadly applicable [50]. These enable short- and long-range understanding and aggregation of affects, emotions,and sentiments. However, understanding what these areaimed at (e. g. , subject, entity) is not necessarily seen as partof the ﬁeld. Moreover, the shift towards perception in a noisyenvironment (in-the-wild), especially with regard to visualcharacteristics, has only recently been made a focus [16], [25].We argue that a) with the improvements of models throughdeep learning and its efﬁcient utilisation on large-scale data,this focus might change, and b) the properties of continuestraces might be helpful when dynamically breaking down alarge sequence of audio-visual emotional events into shortersegments, e. g. , sentences, aspects or noun-adjective pairs inthe future.The MuSe-CaR dataset described in this work was de-signed to overcome some of these basic limitations whenutilising user-generated, real-life media on a large-scaleand attempts to bring the best of two worlds together. Weconsider the following aspects in the design of MuSe-CaRlisted below. • The recording of the database should be as uncontrolledto a high degree in terms of recording settings, emotionaland linguistic content as possible. However, there shouldbe a certain overlap in the topics and aspects addressed. • Instead of isolated segments and sentences, the databaseshould contain long sequences across multiple aspectsof a topic. • The emotion-object interaction of the speaker should bein various environments (e. g. , outside and inside of acar).

NDER REVIEW 4 • The depicted speaker is not actively acting emotions,but rather, emotions are naturally elicit depending onthe topic, aspect and situation. • The audio-visual material provides verbal and nonver-bal information. However, occasionally, one modalityprovides only limited information (voice but no face,e. g. , the camera focuses on an object; face but no voice,e. g. , acceleration; audio-visual but no transcription,e. g. , speech-to-text failed due to complex audio sce-narios). • The emotional and topic annotations should be assignedbased on human subjective evaluations. HE M U S E -C A R DATABASE

In the following, we describe how we identiﬁed relevant ma-terial based on the previously deﬁned criteria. Furthermore,we outline the communication process with the creators toreach consent for the use of their content. Then, we provideinformation regarding the dataset composition, as well asthe data annotation process and the utilised annotation tools.

The selection of videos from YouTube was carried out in asemi-automatic process. We developed a basic crawler whichreceives a number of hand selected keywords (e. g. , ‘review’and a car brand) and provides metadata of pre-selectedvideos. In our view, the legal situation for crawling videosfrom the web is inconclusive in many countries ‡ . Therefore,we contacted the creators of the videos with high userengagement (views, likes, etc.) which most likely indicatesrelevance and high-quality content. Requesting actively thecreators’ consent to use the data for academic purposes in anopt-in approach, gives researchers worldwide legal certaintywhen using the database. We sent up to three (follow-up)emails to the creators over a period of three months, reachingan agreement in around 50 % of requests. In the emails, weexplained the intention of the dataset for non-commercialuse in challenges and for research and provided an exampleof an End User License Agreement (EULA). If the consent was given, three individuals carried out adeeper initial inspection by viewing around 10 % of eachvideo. In total, 366 videos were inspected. Based on the crite-ria deﬁned in Section 2.3 and the content of each video, theinspectors ﬁlled out a survey, asking to estimate importantdata properties such as the level of in-the-wild characteristics.These also included the emotionality and quality of the videofor our purpose on a scale of 1 to 5 (where 1 is substandard,and 5 is optimal). Regarding the emotionality, less than 5 % ‡ Uploading a video to YouTube automatically issues that videounder the YouTube own license. Regarding this licence, the use of thedata in the EU is only possible by YouTube directly or with the consent ofthe creator. In similar works [13], [31], [32], the database producers referto the fair use principle for academic use. These exceptions of intellectualproperty rights, however, do not seem to be applicable in the Europeanlegal sphere. Furthermore, YouTube’s standard terms and conditions aswell as of those from the API have to be considered. A fraction of videosare also available under the Creative Commons (CC-BY, full use if thecreator credits are mentioned) licence model.

Fig. 1. Thumbnails showing reviewers in various constellations to thecamera and interacting with the object. are rated below 3 and around 80 % as 4 or 5. Furthermore,regarding video quality properties, they seem suitable forour purpose with more than 85 % holding scores equal to orhigher than 4. Furthermore, additional estimation questionsreveal more detailed information about the videos (e. g. , faceinclusion): shot range, or camera selﬁe-angles, the scene andnoise settings (including the quality of sound), whether thereare background music and other speakers present, and ifspeakers have a dialect or accent. A detailed analysis canbe seen in Figure 2. Interesting observations are that thereare fewer close up than medium sized shots in most videos(only accounts for 1-25 % of the videos, cf. Figure 1). For selﬁeshots, the camera is mainly held at a lower angle. Videosare often ﬁlmed inside the car, either while driving or notdriving. Fewer coverage accounts for outside-the-car shots(for both driving and not driving). In general, there are noisesounds and background music present, however, the soundquality is only perceived to be bad for 1-10 % of the videoduration. The great majority of videos has a banner present(e. g. , copyright sign in a corner) for over 75 % of the videoduration.Since the subjects were not actively selected, but profes-sional, semi-professional (‘inﬂuencers’), and casual reviewers,we can only estimate the characteristics of our cohort. Weassume a broad age range from the mid-20s until the late-50swhile most speakers are English natives from the UnitedKingdom or the United States of America; a small minorityare non-native, yet ﬂuent English speakers. Around a thirdof reviewers wear glasses. There are barely any videos withspeakers having a dialect or accent.Videos which received less than 4 points of overall qualityfor training, less than 3 on emotionality or lack of theproperties outlined in the previous section were excluded.After inspection, 303 videos remained for annotation, whichcorresponds to 40.2 hours of video with an average durationof 8 minutes (90 % are shorter than 14 minutes).

It is well established that linguistic information of thespoken language can facilitate the learning of emotionsand is a cornerstone for understanding context. To supportfuture research into the interplay of the audio, visual, andtext modalities, we automatically transcribed the data. Inrecent years, speech-to-text achieved almost human-levelquality in popular languages, such as English. Furthermore,using the text modality for emotion recognition in a real-

NDER REVIEW 5 close upshots mediumshots longshots shot size selfie angle car parts in-the-wild characteristics

Fig. 2. Absolute counts of the estimation of various in-the-wild character-istics, namely camera shot size, camera selﬁe-angle, scene setting, andadditional noise inﬂuences of the collection. The percentage estimatesthe duration of collected video in a certain category, e. g. , 25-50 % in“close up shots” means that the collector estimates that 25-50 % of theduration of the collected video contains close up shots. life scenario would need to work independently, withouthuman intervention, and considering the size of the datasetalong with these prerequisites, we decided to use automatictranscription services on our corpus.The transcriptions from the videos using Google Cloudspeech API § and Amazon Transcribe ¶ were both of highquality. The ﬁrst contains also non-verbal cues and audioelements, such as laughter, music, etc. We provide both, butonly use the latter for our experiments since we felt thatthe precision slightly exceeds the ﬁrst one. One reason forthis improvement may be the option to create a customiseddictionary to improve the transcription quality regardingdomain-speciﬁc, automotive typical terms. The transcriptionsof the spoken language includes punctuation (e. g. , period,question mark, exclamation mark) and every transcribedword comes with a beginning and an end timestamp as wellas duration. These metadata help to align the text with theannotations (different sampling rate) and other modalitiesas well as enabling studies on more than 28 295 sentencesexceed all English Mutlimodal Sentiment Analysis databases(cf. Section 2) and have almost 5k more sentences than thenext biggest (MOSEI).Although we have not rigorously evaluated them, ourpreliminary screening suggests that the punctuation andboundaries are accurate and even better than commonlyavailable voice activity detectors we also have tried. The size of the dataset and ﬁne-grained annotations requireda highly efﬁcient annotation process. In order to ensurehigh quality, ethical and meaningful annotations for theMuSe-CaR dataset, we considered that keeping a human inthe loop was a vital aspect [51]. We therefore deﬁned threefunctional roles:1)

Annotator : It is the responsibility of the annotator tolabel the data based on the subsequent instruction. § https://cloud.google.com/speech-to-text ¶ https://aws.amazon.com/transcribe/ Auditor : It is the responsibility of the auditor to reviewthe performance of the information labelled, and ensureit is in line with the annotation protocol. Only afterthe annotations have been manually and automaticallychecked, veriﬁed and endorsed by the auditors is theannotation deemed usable.3)

Administrator : Manage all parties, and assign dutiesduring the entire annotation process.During annotation, an interactive process between theannotator, auditor, and administrator was applied:1)

Assignment of tasks:

The annotator is assigned oneor more packages by the administrator. Similar toprevious work [27], a session package corresponds toca. 40 minutes of video material. The annotators wereinstructed to have suitable rests between videos andsessions, so that the expected (and paid) working timewas one hour. In one session, all videos had to beannotated with the same annotation type (e. g. , Valence).2)

Annotation : The annotator annotates the videos piece bypiece and package by package and sends packages to theauditor after completion. Once all packages have beenprocessed and evaluated by the auditor, new packagescan be assigned to the annotator.3)

Progress tracking:

The administrator regularly tracksthe progress of the annotators and auditors. This keepstrack of which packages are still to be annotated oraudited – in the worst case, reevaluating the suitabilityof the work load.4)

Quality tracking:

By calculating the annotation accuracyover several annotators and with continuous auditingthroughout, the quality of the annotator can be trackedand the quality is therefore continuously improved.

For categorical annotation (e. g. , speaker topic), we usedthe annotation software ELAN 4.9.4 [52] – chosen for itsmultimodal interface which allows for a waveform and videodisplay (see Figure 3), as well as other useful functionalitiesincluding the ability to jump to areas of interest [52].It was shown that some emotions are transmitted morestrongly via visual signals (e. g. , sadness) while others morevia audio (e. g. , anger) [53]. In addition, context informationtransported by both modalities plays a crucial role in emo-tion perception [54]. For an audio-visual annotation of thecontinuous emotions, we choose the software DARMA [55].DARMA enables to record annotation signals from a

LogitechExtreme 3D Pro Joystick . The joystick allows for the transferof perceived emotions more intuitively [16]. The continuousannotation was made from the very start to the end of avideo and was sampled at 0.25 Hz with an axis magnitudeof range -1 000 and 1 000.

The MuSe-CaR contains annotations for continuously-valued(Valence, Arousal, and Trustworthiness), binary-valued(host/narrator turns, banner, and person appearance), andcategorical (topics, entities) ratings. Overall, we have anno-tated 11 tiers for each video. The dimensional annotationreﬂects the continuous emotional state of the individualsspeaking.

NDER REVIEW 6

Fig. 3. Interface of the annotation software ELAN when labelling speakertopics (cf. Section 3.5.7).Fig. 4. Snapshot of emotional scenes from the MuSe-CaR dataset depict-ing high Arousal (left) by non-verbally expressing the excitement aboutthe acceleration of the car and verbally positively valenced appreciationof the cars’ exterior.

In parallel to the continuously-evaluated signals, simplecontinuous binary (activated/deactivated) annotations arerecorded simultaneously by pressing and holding the triggeradjacent to the index ﬁnger of the joystick: i) Trustworthiness+ the turns between the host and the narrator, ii)

Valence +the appearance of banners, and iii)

Arousal + the appearanceof more than one person.In addition to the annotations, we provide more than10 pre-computed features, e. g. , facial landmarks, acousticlow-level descriptors (LLDs), hand gestures, head gestures,facial action units, etc., which in other works [11], [16] werealso decelerated as (semi-)automatic generated annotationsfor prediction. A detailed description of these can be foundin [17].

The videos are annotated using a continuous dimensionalmodel of emotion [44], considering both Arousal and Valence(cf. Figure 4). We discuss these together for ease, however,the literature shows that it is important to annotate themseparately [16]. In the dataset, an example of high Arousalis

Stressed or Elated (happiness) , however

Stressed would benegative Valence, and

Elated (happiness) would be positiveValence.The annotators were trained in-person. First, the axeswere introduced by showing them an explanatory video toimpart a general understanding of aspects of emotion. Afterfurther explanation and examples, they could experience thehandling of the software and the reaction of the joystick in ahands-on session. The test annotations were followed by agroup discussion, where individual annotations were com-pared to an up-hand recorded annotation of an experiencedannotator as well as between the group members. The ﬁnalpackages were done individually in a quite environmentwith headsets.

User-generated information has proven useful when creatinglarge (emotional) datasets with real-world content [32]. Itis known [56] that to build up reputation of a – in reallife unknown – creator has a strong inﬂuence to the userengagement and, thus, also most likely on other perceivedemotions. However, quantifying Trustworthiness is hard [57],and there is yet no dataset that offers the possibility to linkit to Arousal and Valence as well as train cross-domaindetectors.Since this is a completely novel dimension, we explainour deﬁnition more deeply. Generally there is no single,prevailing deﬁnition of Trustworthiness due to the lack ofa conceptual agreement [57], [58], [59]. Analogous to ourunderstanding, in [60], Trustworthiness is deﬁned as theability, benevolence, and integrity of a trustee.In the context of a stranger (our moderator) from socialmedia content, Trustworthiness presupposes that this personcan objectively assess the (facts of the) matter on the onehand and on the other hand communicates this assessmentunbiased, therefore honestly. Building on this deﬁnition,we asked the annotators to evaluate the Trustworthinessthroughout the video, i. e. , when the host is discussing aparticular aspect, how honest and knowledgeable does theannotator feel their review is? To say it in another way, thiscould also include the annotator perceiving a commercialgain rather than a truthful review of the product.Several video examples and cases were given to theannotators. Although, we reiterate that this can be a subjec-tive tier for annotation, and therefore additionally considerthis from the annotator’s perspective: Do you believe theinformation that is given to you? Do you have the feelingtheir argumentation is based on facts and experience or is itthat they are rather trying to sell something.

Several methods exist to fuse a set of subjective emotionannotations to establish a consensus from individual anno-tations [61], [62]. Since there can be no fully objective signalof subjective information such as emotions, this label is areferred to as the gold-standard.Figure 5 depicts the frequency distribution of the createdgold-standards utilising the Evaluator Weighted Estimator(EWE) approach [62]. Essentially, EWE considers the reliabil-ity of each annotation by calculating the cross-correlation ofthe annotation with the mean annotation (over all annotators).It can also be seen as the weighted mean of the annotators’agreement [63], [64]. EWE is described further in [62], andhas been applied to multiple similar continuous emotiondatabases [16], [24], [65].For MuSe-CaR, every video is annotated by at least ﬁveindependent annotators employed by the EIHW Chair froma group of eleven (six female and ﬁve male), all ﬂuentin English. The age of the annotators ranges from 21 to30 years. In this context, the mean concordance correlationcoefﬁcient (CCC) is used to measure the inter-rater agreementacross all annotations for each dimension is given as: Arousal . , Valence . , and Trustworthiness . . The levels ofagreement are moderate as to be expected [66] and consistentwith those of other emotional datasets [16], [67] showing that NDER REVIEW 7 D e n s i t y arousalvalencetrustworthiness Fig. 5. Density estimation of the continuously-annotated dimensionsArousal, Valence, and Trustworthiness tiers after EWE fusion of individualannotator tracks. While Arousal is almost perfectly Gaussian-shaped,Valence is skewed toward the positive end of the spectrum, and Trust-worthiness is even more peaked. the stronger ‘in-the-wild’ characteristics seem not to have astrong inﬂuence on the perception of emotions.

Within the videos, there are two human subjects which areof main interest to the annotators; these have been deﬁnedas follows: • The host is the reviewer/presenter of the car and itsfeatures, talking often directly to the audience andexpressing own opinions. • The narrator presents in some videos, may occur in theopening and closing sequences, may provide additionalinformation, and is not visible to the audience.During the annotation of Trustworthiness, the turns betweenthe host and the narrator were annotated using the binarytrigger. When the host is not speaking, and the narratorspeaks, the binary trigger is held.

Social media networks, such as YouTube, are a large, self-extending and diverse data pool. In these videos, however,superimposed graphics occur such as text and channel logos– some for copyright reasons, others to either inform orentertain the viewer. For our tasks, such banners mightnot be useful, since they might obscure entities or the host.Therefore, visible banners that show up are annotated inparallel to Valence. In a later stage we want to detect, andmeasure the inﬂuence on visual features or exclude/ replacethem [68], [69], [70].Banners appear with a variety of properties: i) appearance:static, or dynamic, ii) positions: bottom-right, bottom-left,upper-right, upper-left, entire footer, centre, or changing, iii) timing: on-screen billing, opening statement, in-between,or closing credits, iv) duration: highlight (very short), short(few seconds), or consistent (copyright), and v) transparency:none, partly, or mostly transparent. The focus of this database is on videos with only one visiblespeaker/ person. In a few cases, more than one person mayappear, either for speciﬁc comments or for demonstrationpurposes, e. g. , how many people ﬁt in the back seat, or thehost is being shown a novel feature. These situations arebinary annotated in parallel to Arousal.

Fig. 6. Examples of interaction between the host and the car parts fromGO-CarD [73].

Speaker Topics rely on generalisable topics vocalised bythe host and related to the object of interest. These higher-level topics combine many different aspects under one term.The videos are labelled by speaker topic segments whileoften one topic segment compromises of several sentences(cf. Figure 3). A detailed overview of sub-topics and aspectscovered by each topic can be found in Table 2. It also showsthe distribution of the topics across all 28k sentences, while20 % have more than one topic annotated.

TABLE 2Annotated speaker topics with examples of sub-topics and aspects. The% gives the share of the number of sentences in each topic, while 20 %of sentences have multiple topics.

Feature exterior (7%) Feature interior (6%) light headlight, foglight, taillight audio system radio, speakerdoor exterior locks, handle seat belt, split folding backs

Handling, Driving Experience (13%) User Experience (7%) driving actions braking, steering, gear shifting infotainment screen, bluetooth, realtime trafﬁcdynamics centroid, chassis, suspension interaction interface, iDrive system, gestures

Performance (13%) Quality & Aesthetic (7%) powertrain electric, hybrid, combustion design interior, exterior, style (sporty, etc.)engine horsepower, RPM, acceleration quality material quality, clearance

Safety (2%) Comfort (6%) tests Euro NCAP, NHTSA, rating surface leather, touchassistance sys. anti-lock brakes, traction control space leg room, head room, luggage

General information (16%) Costs (3%) introduction series, weight, sales, warranty one-off retail price, base price, feature pricecomparison models, brands, competitors after sale insurance, maintenance, resale

In contrast to the voice-based speaker topics, the physicalentities rely purely on visual input. The domain-speciﬁcobjects of human-object interaction, and car parts, wereannotated with bounding boxes for all recordings with a stepsize of 4 frames per second. 28 types of exterior and interiorcar parts, such as, door, steering wheel, and infotainmentwere labelled.Manual bounding box annotation of all frames, eachwith a high number of classes, are highly labour intensive.Based on previous experiments [71], [72], we can assumethat around two boxes per minute can be labelled. Sincethere are 576 000 frames to annotate with up to 15 boxesfor each frame, it would be impractical to label all of themmanually (requiring between 4 800 and 72 000 hours for asingle annotator).We chose a semi-automatically process. First, a localiser(Darknet-53 network) is pre-trained on 15 003 vehicle imagesfrom other datasets. This underlying data does not includeany human interaction. Therefore, we extracted and labelledanother 1 000 frames compromising of more than 8 000 boxesfrom MuSe-CaR and ﬁne-tuned the algorithm. A detaileddescription can be found in a separate paper [73]. Thenetwork achieves a mean average precision of 67.6 % withscores up to 94.0 % for very distinctive parts. A visual inspec-tion was performed on the annotations and those deemed

NDER REVIEW 8 unsatisfactory were removed or corrected. Figure 6 showsexamples of these parts during human-object interactions.

For the same reasons as in Section 3.5.8, we applied a semi-automatic process to the extraction of faces, the start (start ofvisible face) and end (end of visible face) point of occurrenceas well as the relative position within a frame. We labelledfaces in a small selection of videos from each channel byhand, to measure quantitatively the success of our automaticextraction and localisation. MTCNN [74] provides a robustframework for this task, previously already proved to beaccurate for face annotation in an emotional context [16]. Ithas a cascaded structure of three stages and is trained on thedatasets WIDER FACE [75] and CelebA [76].We classiﬁed the detected bounding boxes into true andfalse positives resulting in an accuracy of %, and an F1score of % on our selection. Furthermore, we conducteda visual inspection of the bounding boxes. Given the highlevel of visual in-the-wild characteristics, for instance, partlyvisible faces, different sizes, side-shots, sunglasses, etc., weconsider both, quantitative and qualitative results as strongand sufﬁcient for further feature extraction.As described in [17] in detail, we use frameworks suchas VGGFace [77], and OpenFace [78] to extract face-relatedfeatures including intensity and presence Facial Action Units(FAUs), facial landamrks, head pose, and gaze position. Each dimensional and categorical annotation is followed by avery brief survey of the annotator’s perception of the contentviewed and a self-assessment of their own annotation. On a10 point Likert scale (0 not at all, 10 very much) annotatorsare asked four questions: i) How appealing did you ﬁnd thevideo? ii)

How emotional did you ﬁnd the host? ii)

Howtrustworthy did you ﬁnd the content? vi)

How conﬁdent areyou about the accuracy of your annotation?This data is directly linked to each annotation. Anoverview of the answers collected is depicted in Figure 7.Although being somewhat subjective attributes to evaluate,we see a general agreement across our annotations for allquestions with a coefﬁcient of variation between 0.12 and0.25. The appeal of the videos to the annotators seemsrather strong with >

20 % of the ratings between seven andeight. Similarly, the level of emotionality and trustworthinessportrayed by the hosts were perceived high. Regarding theself-assessment of the annotation quality, the annotators seemvery conﬁdent in their performance.

For the MuSe 2020 challenge, we proposed a selection oftasks, and a detailed description can be found in [17]. Tomake the (pre-processed) data for each task (e. g. , speciﬁcfeatures) easily accessible after the challenge, we moved themto several Zenodo repositories || , a high-speed research datahost with storage in the CERN data centre. The MuSe-CaRdatabase is available online for researchers who fulﬁl the || Metadata, labels, raw video, and audio:https://zenodo.org/deposit/4134766 R e s p o n s e s How appealing did you find the video? R e s p o n s e s How emotional did you find the host (reviewer)? R e s p o n s e s How trustworthy did you find the overall content? R e s p o n s e s How confident are you about your annotations?

Fig. 7. Annotation related-metadata of almost 5 000 ratings, where theannotator gave scores from 1 to 10 regarding the video watched. requirements of the EULA (e. g. , academic-use only) ** . Thespeciﬁc links can be found in each of the following sub-sections. Multimodal Sentiment in-the-Wild (MuSe-Wild ) †† :MuSe-Wild aims to predict the level of the affective dimen-sions of Arousal and Valence in a time-continuous manner.Timestamps to enable modality alignment and fusion onword-, sentence-, and utterance-level as well as severalacoustic, visual and textual-based features are pre-computedand provided with the task package. Multimodal Emotion-Target Engagement (MuSe-Topic ) ‡‡ : The MuSe-Topic task focuses on the prediction of10-classes of domain-speciﬁc (automotive, as given by thechosen database) topics as the target of categorical Valenceand Arousal emotions. The three classes (low, medium, andhigh) of Valence and Arousal are each predicted for everytopic segment. These classes are created by averaging themean value of the temporally aggregated continuous labelsof MuSe-Wild and then dividing them into three equallysized classes (33 %) for each label.

Multimodal Trustworthiness (MuSe-Trust ) §§ : The lasttask aims to develop methods to predict a continuous Trust-worthiness signal in a sequential manner. Aligned Valenceand Arousal annotations are also provided, to explore therelationship between all three dimensions e. g. , by trainingmulti-task networks.The evaluation metric of MuSe-Wild and MuSe-Trust isthe CCC, which is often used in similar tasks [65], [79] as itstands for a theoretically well understood measure [80], ableto reﬂect the reproducibility and performance while beingrobust to changes in scale and location [81]. MuSe-Topic ismeasured in a combination of F1 and Unweighted AverageRecall.Table 3 shows the size of the training, validation, andtest partitions which consider emotional ratings, speaker/channel independence, and duration, and come with thepackages. Before partitioning, we removed less informa-tive data in a pre-processing step. MuSe-Wild and MuSe-Trust only includes parts where an active voice or a visible ** †† MuSe-Wild : https://zenodo.org/record/4134609 ‡‡ MuSe-Topic : https://zenodo.org/record/4134733 §§ MuSe-Trust ): https://zenodo.org/record/4134758

NDER REVIEW 9

TABLE 3Partitioning of the three tasks. Reported are the number of uniquevideos, and the duration for each task in hh:mm:ss.

Partition No. MuSe-Wild MuSe-Topic MuSe-TrustTrain 166 22 :16 :43 22 :35 :55 22 :45 :52Devel. 62 06 :48 :58 06 :49 :46 06 :52 :22Test 64 06 :02 :20 06˙:14 :08 06 :12 :53 (cid:80)

291 35 :08 :01 35 :39 :49 35 :51 :07 face are included. Only MuSe-Trust includes non-productrelated video segments, such as, advertisements which mighthave an impact on the Trustworthiness perception. MuSe-Topic , the more NLP-related task only includes parts with aspeaker topic annotation and transcription.

URPASSING THE M U S E -T RUST BASELINE

Although the organisers of MuSe 2020 received severalprediction submissions for the MuSe-Trust task, none couldsurpass the results of the baseline models, resulting in nopapers for this task being accepted for the ofﬁcial challengeworkshop [82]. In this section, we show that a simplebut efﬁcient neural network architecture called D

EEP T RUST utilising a Multi-Attention Head Layer (MAHL) for encodingin addition to a bi-directional (bi) Long Short-Term MemoryRecurrent Neural Network (LSTM) with augmentation issuitable for modelling Trustworthiness. We use providedfeature sets from the challenge as well as extracting newones, shown to be effective in the other MuSe tasks, resultingin two different (for acoustic and vision: one handcrafted,and one based on deep representations) feature sets foreach modality. We run extensive experiments regardingaugmentation, complexity, architecture, and learning theimpulse (loss) of such a network. Furthermore, we evaluatedthe performance of unimodal features and multimodal fusion,as well as the training style (single- vs multi-task learning)for the prediction of Trustworthiness.

Acoustic:

For the handcrafted acoustic feature set, we usethe extended Geneva Minimalist Acoustic Parameters Set( E G E MAPS ) [83] provided by the MuSe challenge. It isbased on 23 acoustic spectral, cepstral, and prosodic low-level descriptors (LLDs) from which statistical functions arecalculated. We extract this 88-dimensional feature vector witha window size of 5 seconds and a hop size of 250 ms to enablean alignment to the annotation sampling rate. We furtherapply standardisation to the vector dimensions.In addition, we extract VGG

ISH functions [84] pre-trainedon an extensive YouTube audio dataset (AudioSet) [85]. Theunderlying data contains 600 classes and the recordingscontain a variety of ‘in-the-wild’ noises that we expectto be beneﬁcial to obtain robust features from our ‘in-the-wild’ videos. By aligning the frame and hope size tothe annotation sample rate, we extract a 128-dimensionalVGG

ISH embedding vector every 0.25 s from the underlyinglog spectrograms.

Vision:

The Facial Action Units (FAU), are widelyadopted in tasks close to emotion recognition, describingvisually perceptible facial movements. FAUs break down facial expressions into 17 individual components of musclemovement, which we obtain from the OpenFace Toolkit [86].For the deep features, the pre-trained VGGF

ACE features [77],which were originally developed to identify faces of peopleare utilised. By removing the last softmax activation layer,we acquire a vector feature representation of the face.

Text:

A standard method to transfer words from asymbolic to a continuous representation are word embed-dings. These calculate a static, numerical vector per word,depending on the semantics in which the word occurs duringtraining. We extract a 300 dimensional F

AST T EXT vector foreach word of the automatic transcription [87]. Instead of astatic vector representation, context-based NLP transformersextract one vector per word in direct dependence on thecontext – during the time of inference. For this technique, weapply a BERT Model [88], well established for a number ofNLP tasks, and extract the sum of the last four layers as a768 dimensional feature vector similar to [89].All features are aligned using the timestamps introducedin Section 3.5.

EEP T RUST : multihead attention network for Trust-worthiness prediction

We utilise two neural network mechanisms to model theshort- and long-term dependencies of continuous Trust-worthiness: enhancing the encoding of the input state bya MHAL and modelling the temporal dynamics of statechanges by a LSTM. The attention heads improve the localrepresentation of the extracted features and are able tosustain the long-term (global) dynamics of a sequence inits representation. However, this does not inherit a deeperpositional understanding by nature [90]. To this end, we usethe functionality of LSTMs, which are particularly capable oflearning short- and mid-term patterns. Similar architectureswere used by [89] and [91] for emotion recognition.Mathematically, we train a function F ( X i ) , where X is a sequence of uni- or multimodal input features, whichpredicts a sequence of regression point estimates y i . Forthis, we apply h multi-attention heads to obtain moremeaningful sequence representations s t , with T being themaximum number of steps. Multi-attention heads are thekey building block of transformer networks. In this process, h softmax dot-product attentions (self-attention mechanism, Att ) are calculated in parallel to learn different discriminativepatterns in each head from the three linear projection inputs(query Q , key K , and values V ), whereby the division by √ d k prevents very small gradients. After scaling, the resultsof the individual heads are concatenated and fed into asubsequent linear layer W S :MultiHead = s t = Concat ( head , . . . , head h ) W S , (1)head h = α (cid:16) QW Qh , KW Kh , V W Vh (cid:17) , (2) α (Q, K, V) = softmax (cid:16) QK T / (cid:112) d k (cid:17) V. (3)The resulting enhanced sequence S is the input of, e. g. ,a one-directional LSTM to receive the temporal encodedsequence O : o t = −−−−→ LST M ( s t ) , t ∈ { , ..., T } . (4) NDER REVIEW 10

Finally, the temporally encoded information are feed into aregression layer, predicting y i . Unlike most large emotional datasets with continuousannotations, MuSe-CaR does not break the videos downinto artiﬁcial, equally-sized segments, so that content andcontext can remain largely intertwined. However, this hasthe disadvantage that some sequences end up very long (>5 000 steps) increasing the amount of computation powerneeded. To solve this issue and increase the amount of data,the length is segmented to a ﬁxed number of sequence steps ws moving with a hop size hs as proposed by the baselinepaper [17] and several participants [89]. Furthermore, weutilise the segment id, as it provides the models with anadditional positional encoding [89].The models are trained for a maximum of 100 epochsusing an Adam optimiser, while the learning rate is reduced,as it reaches a plateau for more than 10 epochs. As in thechallenge, the CCC is evaluated on the development set aftereach epoch, and after the training, the best conﬁgurations issubsequently evaluated on the test set.As we aim for an integrated approach of modality fusion,we use early (concatenate the inputs) as well as late (using aLSTM) fusion in order to better learn from the interaction ofthe modalities. We achieve the best results on development . CCCand on test . CCC (late) fusing all except the VG-G

ISH predictions of the fully trained uni-modal models usingan LSTM. The results represent a major improvement interms of the baselines established (cf. Table 4) by [17] morethan 50 % on the development set.For the ablation study, we choose to set ws = 200 , hs =100 after initial experiments, and the number of hiddenneurons of the a bidirectional LSTM to . If not otherwisestated, we run a hyperparameter optimisation using h ∈{ , , } heads, lr ∈ { , , . , . } , a batch size ∈{ , , } and report the best result indicating thehyperparameter setting by HP . Unimodal results and unimodal fusion:

First, wecompare the performance of our feature sets. As shown inTable 5, the advanced BERT features and the deep acousticfeatures VGG

ISH yield higher results in terms of CCC. Onlythe results on the vision features behave contrary, where thedeep face features VGGF

ACE seem to easily overﬁt on thedevelopment set, while FAU scores are lower, but generalisebetter. Early fusion of the both text modalities adds a smalladvantage compared to single use. For all other modalities,the results are worse on the test set. When comparing themodel with E G E MAPS and F

AST T EXT features with thoseof the baseline, both show large improvements on thedevelopment set and the result for F

AST T EXT almost doubledon the test set. For the following experiments, we use thebest performing feature set of each modality, namely BERT ,VGG

ISH , and FAU .

Augmentation:

Previous challenges [79] identiﬁed theeffective use of the available data as a key performancedriver. Table 6 shows the results under changing number ofsequence steps ws and hop sizes hs . When ws is equal hs , TABLE 4Results of MuSe-Trust using the dev(elopment) and test set. Results arereported in concordance correlation coefﬁcient (CCC). As feature setsfor text: F

AST T EXT (FT) and BERT ; acoustic: VGG

ISH , E G E MAPS ,and DeepSpectrum (DS); vision: VGGF

ACE , FAU , and 2D landmarks(2D). Furthermore, the baseline models use the raw audios (A) inEnd2You and several combined vision features (V) in LSTM + Self-Att.

Model Features Dev. Test

MultiFusion [92] F

AST T EXT + DS + 2D . . Ofﬁcial Baselines [17]

LSTM+Self-ATT E G E MAPS . . LSTM+Self-ATT F

AST T EXT . . LSTM+Self-ATT F

AST T EXT + E G E MAPS . . LSTM+Self-ATT F

AST T EXT + E G E MAPS + V . . End2You F

AST T EXT + VGGF

ACE + A .3198 .4128Ours (fusion) D EEP T RUST best-of (early) BERT + VGG

ISH + FAU . . D EEP T RUST (early) BERT + F

AST T EXT +VGG

ISH + E G E MAPS + FAU . . D EEP T RUST best-of (late) BERT + VGG

ISH + FAU . . D EEP T RUST (late) BERT + F

AST T EXT +VGG

ISH + E G E MAPS + FAU .6507 .6105

TABLE 5Results of MuSe-Trust using the devel(opment) and test set, applyingearly and late fusion. Results are reported in concordance correlationcoefﬁcient (CCC). As feature sets for text: F

AST T EXT (FT) and BERT ;acoustic: VGG

ISH and E G E MAPS ; vision: VGGF

ACE and FAU for ourmodels.

Feature sets Dev. Test HP T ex t F AST T EXT . . IBERT . .5539 IBERT + F

AST T EXT .5648 . II A ud i o E G E MAPS . . IIIVGG

ISH .5376 .4035

IVGG

ISH + E G E MAPS . . I V i s i o n FAU . .3623 IVGGF

ACE .4000 . IVVGGF

ACE + FAU . . V E a r l y F u s i o n T+A: BERT + VGG

ISH .5833 . VIT+V: BERT + FAU . .5880 VIIA+V: VGG

ISH + FAU . . I the sequences have no overlap. A larger ws appears to begenerally valuable. Most likely, this variable ( ws ) improvesthe ability to capture global dynamic changes, leading to amore expressive representation of the state of trust. Havingno overlap yields stable, generalisable results, while, whenapplying an overlap, the results could be either better orworse. As a rule of thumb, the longer the sequences are,the higher the overlap can be, while a good estimate rangesbetween hs = 0.3 – 0.5 ws . However, if ws is small (e. g. , 100,200), the results might not be generalisable to test. This mightbe counteracted by applying additional augmentation to thereappearing sequence steps. TABLE 6Results of MuSe-Trust using the devel(opment) and test set. Results arereported in CCC. steps BERT VGG

ISH

FAU Ø ws hs

Dev. Test Dev. Test Dev. Test Dev. Test

750 750 .5641 .5540 .4274 .4344 .3775 .3719 .4563 .4534750 500 .5739 .5747 .5604 .4699 .3671 .3705 .5005 .4717750 250 .5889 .5693 .5386 .4686 .4305 .4843 .5193 .5074

200 200 .5512 .5245 .5566 .4752 .3614 .2710 .4897 .4236200 150 .5500 .5533 .5517 .3034 .3558 .3508 .4858 .4025200 100 .5624 .5539 .5376 .4035 .3675 .3623 .4892 .4399

200 50 .5282 .5160 .5440 .4081 .3319 .1820 .4680 .3687100 100 .5167 .5128 .5064 .4445 .3711 .3709 .4647 .4427

100 50 .5312 .5068 .5369 .2918 .3816 .3294 .4832 .3760100 25 .5233 .5216 .5264 .2517 .3642 .2911 .4713 .3548

NDER REVIEW 11

Other side effects are that, as the length of the sequenceincreases, both the memory requirement and the trainingtime grow. With our standard architecture, a maximum of ws = is supported on a 32 GB Memory GPU, which makesit necessary to trade-off performance for usability. Heads:

In a unimodal setting, the most suitable numberof heads varies from modality to modality, with no cleartendency (e. g. , number of dimensions). The best results(cf. Table 7) are obtained with 4 heads for BERT ( . ontest), 16 heads for VGG ISH ( . on test), and 2 heads forFAU ( . on test). On average, 2, 4, and 16 heads performvery similarly on the development set with a slight advantagefor 16 heads ( . ) and 8 heads ( . ) on test data. TABLE 7Results of MuSe-Trust using the devel(opment) and test set. Results arereported in CCC. heads BERT VGG

ISH

FAU ØDev. Test Dev. Test Dev. Test Dev. Test .3774 .4878 .42964 .5624 .5539 .5376 .4035 .3675 .3591 .4892 .4388 .4592 .3548 .3352 .4953 .4352

Loss:

Since the loss and the metric are the same (CCC),we also report the Pearson Correlation Coefﬁcient (PCC) andthe Root Mean Square Error (RMSE) for this experimentcf. Table 8. The CCC loss clearly performs better than theMSE and L1 loss, for the BERT and VGG

ISH features as wellas for the average results of the two correlation based metrics(CCC and PCC). However, this is not the case for FAU whereL1 and MSE perform equally or outperform the CCC loss. ForRMSE, FAU has such a strong impact that also the averageRMSE of both other loss functions are better.

TABLE 8Results of MuSe-Trust using the devel(opment) and test set. Results arereported in CCC, PCC, and RMSE. loss metric BERT VGG

ISH

FAU ØDev. Test Dev. Test Dev. Test Dev. Test

CCC

CCC .5624 .5539 .5376 .4035 .3675 .3623 .4892 .4399

PCC .5684 .5998 .5384 .4421 .3770 .4301 .4946 .4907

RMSE .3652 .3485 .3867 .4199 .4693 .4780 .4071 .4155 L CCC .5076 .5211 .3650 .2408 .3678 .3407 .4135 .3675PCC .5432 .5712 .3877 .3270 .3724 .3728 .4344 .4237RMSE .3595 .3409 .4031 .3978 .4350 .3690 .3992 .3692M SE CCC .5215 .5433 .3932 .4094 .3537 .3243 .4228 .4257PCC .5455 .5570 .3932 .410 .3584 .3498 .4324 .4392RMSE .3584 .3470 .3932 .4160 .4407 .3796 .3974 .3809

Model:

Next, we compare several architectural choiceson the unimodal feature selections. As we can see in Table 9,using the combination of both modules is sensible. Oneconﬁguration (2 MHAL+Bi-LSTM) yields a better result onthe BERT test set. For all others and on average, the one-layerMHAL and a bidirectional LSTM architecture achieves thebest results.

Multimodal fusion:

Fusing the best single modalitiesimproves all results on the development set. However, thebest – the fusion of BERT and VGG

ISH – generalises poorlyachieving a lower result than BERT only. In comparison,the fusion of text and vision features generalises well andachieves . on test data – a better result than every single TABLE 9Results of MuSe-Trust using the devel(opment) and test set. Results arereported in CCC. network BERT VGG

ISH

FAU ØDev. Test Dev. Test Dev. Test Dev. Test

MHAL .3117 .3248 .4230 .3150 .3677 .3351 .3675 .3250LSTM .5165 .5170 .5441 .3771 .3270 .2513 .4625 .3818MHAL+LSTM .5423 .5526 .5368 .2248 .3609 .3047 .4800 .3607MHAL+2 Bi-LSTM .5456 .5504 .5259 .3688 .3642 .2973 .4786 .4055MHAL+Bi-LSTM .5624 .5539 .5376 .4035 .3675 .3623 .4892 .4399 .5762 .4918 .3818 .3645 .3447 .4704 .43422 MHAL+2 Bi-LSTM .5410 .5344 .4942 .3233 .3553 .3523 .4635 .40333 MHAL+3 Bi-LSTM .5437 .5089 .4977 .3376 .3455 .3104 .4623 .3856 modality. Similarly, VGG

ISH and FAU achieve slightly higherresults on both sets when being fused.

Multi-task learning:

Using our approach to predictArousal, Valence, and Trustworthiness simultaneously, out-performs the baseline by more than . on developmentand almost . on the test set. Adding more weight to theTrustworthiness predictions (II.) improves the result slightly. TABLE 10Results of MuSe-Trust using the devel(opment) and test set. Results ofTrustworthiness (T), Arousal (A), and Valence (V) reported in CCC.Conﬁgurations: (I.) equal loss weight or 0.33 (II.) 0.5 x Trustworthiness,0.25 * {Arousal,Valence}.

Conﬁguration T A VModel Features Dev. Test Dev. Dev.

End2You-Multitask [17] F

AST T EXT + VGGF

ACE + A 3264 .4119 – –MHAL+LSTM-Multi (I.) BERT + VGG

ISH + FAU .5428 .5456 .4102 .4442MHAL+LSTM-Multi (II.) BERT + VGG

ISH + FAU .5497 .5518 .4132 .4215

UTURE WORK AND LIMITATIONS

In the future, we plan to extend the ﬁrst proposed tasks inseveral directions by using the additional levels of annotationpresented. Of particular interest are the possible presence of aconnection between the novel dimension of Trustworthinessand Arousal/ Valence and its usability to evaluate user-generated data, which could give insights on the subjectiveperception of Trustworthiness.One focus for the MuSe challenge is to bring together com-munities from differing computational disciplines; mainly,the sentiment analysis community preferring to predictdiscrete sentiment/ emotion categories coming from anNLP background [93], and the audio-visual emotion recog-nition community predicting continuous-valued Valenceand Arousal dimensions of emotion (circumplex model ofaffect [44]) originated in intelligent audio and visual signalprocessing, while often disregarding the potential of thetextual modality [65], [79], [94], [95]. However, both have itsadvantages, e. g. , classes of emotions are more intuitive forhumans (happy label vs Arousal and Valence scores) and di-mensions are more generaliseable and are highly inﬂuencedby related, explicitly multimodal learning, techniques [96],[97], [98]. In theory, the russell’ circumplex model ofemotion allows a mapping (diarisation) from continuoussignals to emotion labels, which is often refuted [99]. Todate, no reliable approaches exist for dynamically modelingcontinuous emotional values to classes on a large scale.The question arises whether a special mapping can bederived for ‘in-the-wild’ environments. This would alsohelp enabling transfer-learning capabilities between the twocategories of research and accompanying datasets.

NDER REVIEW 12

Connected to the previously introduced direction, thecontinuous annotation signals also have to be summarised time-wise on e. g. , sentence- or segment-level for the map-ping process. Emotions are an intense feeling that is rathershort-term and is typically directed at a source/topic . Wewant to explore suitable ways to time-aggregated emotionannotations to reference topics. According fusion and aggre-gation approaches might even rely directly on the ﬁve rawannotations, such as in Dynamic Time Warping (DTW) [100]and Deep Canonical Time Warping (DCTW) [101] and mightincorporate an explicit model of uncertainty expressed by the(dis-)agreement of annotators. Starting with unsupervisedapproaches, one might extend the dataset to speciﬁc classesin the future. Furthermore, one should explore ways togenerally improve the time and cost of intensive continuousannotation. One idea is to learn stable representations of an-notators behaviours to create additional artiﬁcial annotationsbased on a reduced number of real annotations.Since multimodal sentiment analysis ‘in-the-wild’ oftenutilises user-generated data and a variety of ‘in-the-wild’characteristics, we want to explore those directions in moredepth.With the collected

YouTube metadata of user engagement(e. g. , view count, like and dislike ratio/count, sentimentalityof the video comments), we want to investigate if theycan be predicted purely relying on (features of) annotatedcontinuous emotions.The investigation of ‘in-the-wild’ inﬂuences and dataimbalance is of further interest. Typically, when developingmodels for emotion recognition, the aim is to work with adataset that is as balanced and clean as possible with regardto the people and environment characteristics. This avoidsbias in the modelling which leads to worse results whenpredicting data with an unknown distribution, as well havingethical implications in real-world implementation if notcounteracted. When collecting data ‘in-the-wild’, however,this is sometimes not possible. For this reason, the communityneeds to explore the effects as well as develop appropriatecounter-strategies of, e. g. , gender-wise unbalanced trainingdata. Another example is the relevance of facial-relatedfeatures in the prediction of Valence. While in most labsettings it is rather unusual for participants to wear glassesor sunglasses, this is more often the case for ‘in-the-wild’data. Previous studies showed that the degree of occlusionof the face is negatively correlated with the performance offace recognition [102] and the presence of occlusion , suchas sunglasses or masks, degrades the performance of facialexpression recognition systems [103]. We have deliberatelycollected around 30 % hosts wearing glasses or sunglasses inorder to be able to deepen these studies and also estimatehow multimodal approaches are of help to overcome thosekinds of challenges. A similar directions is to measure theinﬂuence of superimposed banners which might obscureimportant video content, such as interaction objects. We wantto explore ways to detect, artiﬁcially remove and replacesuch occlusion e. g. , by utilising Generative AdversarialNetworks.

Limitations.

As deﬁned by the categories of [66], col-lecting meaningful in-the-wild data is always a trade-off ofnaturalness and lack of control. Although we provide natural,minutes-long context to capture the change in emotions towards multiple topics instead of short dialogues, we seesome limitation regarding the range of possible context. Inthis corpus, we wanted to connect emotions to overarchingtopics enabling emotion-context interaction; hence, onlyselected material of one domain (car reviews) was consideredwhich naturally limits the range of topic. This has alsobeen done in other large emotional datasets as in, e. g. ,advertisement in the case of [16]. We see improvementson the other criteria: Primitive descriptors are only commonin affective computing but not in multimodal sentiment.We address the gap by providing two dimensions of realcontinuous emotions and a user-generated content speciﬁcone giving advanced ﬂexibility and help to bridge the twocommunities. The scope of the dataset (e. g. , number ofhosts, modalities), based on the previous results, does seemto be sufﬁcient to generalise personal-independent affectwell and the provided linguistic transcriptions are of highvalue. Focusing on the user-generated area, we kept a highdegree of naturalness, only excluding parts not sensitiveto affective recognition (e. g. , no face and no voice) whilekeeping all others, even very noisy sections (e. g. , occlusions,selﬁe-camera, or half faces etc.).

UMMARY AND C ONCLUSION

In this paper, we introduced MuSe-CaR – a multimodal sen-timent analysis in real-life media dataset. It was collected inuser-generated, noisy environments, and consists of around300 audio-visual and transcript recordings of more than70 hosts. We described the extensive annotation process indepth covering 11 tiers including dimensional emotions andlayers to model the interaction between them and speakertopics and to visual entities. We intentionally selected videoscontaining novel and challenging in-the-wild characteristics,including dynamic backgrounds and changing shots as wellas angles of the face.From this multimodal corpus of emotional car reviews,we derived three initial tasks: i) MuSe-Wild , where thelevel of the affective dimensions of Arousal and Valencehas to be predicted; ii) MuSe-Topic , where the domain-related conversational topics and three intensity classes ofArousal and Valence have to be predicted; and, iii) MuSe-Trust , where the level of continuous Trustworthiness hasto be predicted. These task are publicly available to theresearch community, representing a testing bed for efforts inautomatic analysis of audio-visual behaviour. In addition, weproposed a simple but efﬁcient network D

EEP T RUST usingattention-enhanced encoding to tackle the last task, largelyoutperforming baseline results of MuSe-Trust . We providedexhaustive experiments and showed it has also multi-taskprediction capabilities which is both helpful in advancing thisnovel task as well as the ﬁeld of continuous affect estimation.Finally, we introduced some of our future research directionsand limitations of the dataset. We hope this dataset is anothervaluable extension for the research community and anothercornerstone in mastering multimodal sentiment analysis. A CKNOWLEDGMENTS

The authors would like to thank the annotators and studentsat the University of Augsburg and Imperial College Londonwho contributed to this project in various ways.

NDER REVIEW 13

EFERENCES R EFERENCES

International Journal of Computer Applications , vol. 135,no. 7, pp. 975–8887, 2016.[6] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and theircompositionality,” in

Advances in Neural Information ProcessingSystems , 2013, pp. 3111–3119.[7] J. Camacho-Collados and M. T. Pilehvar, “On the role of text pre-processing in neural network architectures: An evaluation studyon text categorization and sentiment analysis,” in

Proceedings ofthe 2018 EMNLP Workshop BlackboxNLP: Analyzing and InterpretingNeural Networks for NLP , 2018, pp. 40–46.[8] X. Chen, Y. Wang, and Q. Liu, “Visual and textual sentimentanalysis using deep fusion convolutional neural networks,” in . IEEE,2017, pp. 1557–1561.[9] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency,“Tensor fusion network for multimodal sentiment analysis,” in

Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , 2017, pp. 1103–1114.[10] M. P. Fortin and B. Chaib-draa, “Multimodal sentiment analysis:A multitask learning approach,” in

Proceedings of the 8th Interna-tional Conference on Pattern Recognition Applications and Methods(ICPRAM) , 2019.[11] L. Stappen, V. Karas, N. Cummins, F. Ringeval, K. Scherer, andB. Schuller, “From speech to facial activity: towards cross-modalsequence-to-sequence attention networks,” in .IEEE, 2019, pp. 1–6.[12] M. Soleymani, D. Garcia, B. Jou, B. Schuller, S.-F. Chang, andM. Pantic, “A survey of multimodal sentiment analysis,”

Imageand Vision Computing , vol. 65, pp. 3–14, 2017.[13] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Mosi: multi-modal corpus of sentiment intensity and subjectivity analysis inonline opinion videos,” arXiv preprint arXiv:1606.06259 , 2016.[14] A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency,“Multi-attention recurrent network for human communicationcomprehension,” in

Thirty-Second AAAI Conference on ArtiﬁcialIntelligence , 2018.[15] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machinelearning: A survey and taxonomy,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 41, no. 2, pp. 423–443, 2018.[16] J. Kossaiﬁ, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt,F. Ringeval, J. Han, V. Pandit, A. Toisoul, B. W. Schuller et al. ,“Sewa db: A rich database for audio-visual emotion and sentimentresearch in the wild,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , 2019.[17] L. Stappen, A. Baird, G. Rizos, P. Tzirakis, X. Du, F. Hafner,L. Schumann, A. Mallol-Ragolta, B. W. Schuller, I. Lefter, E. Cam-bria, and I. Kompatsiaris, “Muse 2020 challenge and workshop:Multimodal sentiment analysis, emotion-target engagement andtrustworthiness detection in real-life media,” in . ACM, 2020. [18] M. W¨"ollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae,and L.-P. Morency, “Youtube movie reviews: Sentiment analysisin an audio-visual context,”

IEEE Intelligent Systems , vol. 28, no. 3,pp. 46–53, 2013.[19] L.-P. Morency, R. Mihalcea, and P. Doshi, “Towards multimodalsentiment analysis: Harvesting opinions from the web,” in

Pro-ceedings of the 13th International Conference on Multimodal Interfaces(ICIMI) , 2011, pp. 169–176.[20] V. Karas and B. W. Schuller, “Deep learning for sentiment analysis:an overview and perspectives,”

Natural Language Processing forGlobal and Local Business , pp. 97–132, 2020.[21] S. Li and W. Deng, “Deep facial expression recognition: A survey,”

IEEE Transactions on Affective Computing , 2020.[22] M. Swain, A. Routray, and P. Kabisatpathy, “Databases, featuresand classiﬁers for speech emotion recognition: a review,”

Inter-national Journal of Speech Technology , vol. 21, no. 1, pp. 93–120,2018.[23] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry,M. Mcrorie, J.-C. Martin, L. Devillers, S. Abrilian, A. Batliner et al. ,“The humaine database: Addressing the collection and annotationof naturalistic and induced emotional data,” in

InternationalConference on Affective Computing and Intelligent Interaction (ACII) .Springer, 2007, pp. 488–500.[24] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducingthe recola multimodal corpus of remote collaborative and affectiveinteractions,” in . IEEE,2013, pp. 1–8.[25] J. Kossaiﬁ, G. Tzimiropoulos, S. Todorovic, and M. Pantic, “Afew-va database for valence and arousal estimation in-the-wild,”

Imageand Vision Computing , vol. 65, pp. 23–36, 2017.[26] M. Grimm, K. Kroschel, and S. Narayanan, “The vera am mittaggerman audio-visual emotional speech database,” in . IEEE, 2008,pp. 865–868.[27] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim,J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactiveemotional dyadic motion capture database,”

Language Resourcesand Evaluation , vol. 42, no. 4, p. 335, 2008.[28] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder,“The semaine database: Annotated multimodal records of emo-tionally colored conversations between a person and a limitedagent,”

IEEE Transactions on Affective Computing , vol. 3, no. 1, pp.5–17, 2011.[29] I. Sneddon, M. McRorie, G. McKeown, and J. Hanratty, “Thebelfast induced natural emotion database,”

IEEE Transactions onAffective Computing , vol. 3, no. 1, pp. 32–41, 2011.[30] M. K. Hasan, W. Rahman, A. B. Zadeh, J. Zhong, M. I. Tanveer, L.-P. Morency, and M. E. Hoque, “Ur-funny: A multimodal languagedataset for understanding humor,” in

Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP) . ACL, 2019, pp. 2046–2056.[31] A. B. Zadeh, Y. Cao, S. Hessner, P. P. Liang, S. Poria, andL.-P. Morency, “Moseas: A multimodal language dataset forspanish, portuguese, german and french,” in

Proceedings of the2020 Conference on Empirical Methods in Natural Language Processing(EMNLP) , 2020, pp. 1801–1812.[32] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency,“Multimodal language analysis in the wild: Cmu-mosei datasetand interpretable dynamic fusion graph,” in

Proceedings of the56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , 2018, pp. 2236–2246.[33] E. Marrese-Taylor, C. Rodriguez-Opazo, J. A. Balazs, S. Gould,and Y. Matsuo, “A multi-modal approach to ﬁne-grained opinionmining on video reviews,” arXiv preprint arXiv:2005.13362 , 2020.[34] W. Yu, H. Xu, F. Meng, Y. Zhu, Y. Ma, J. Wu, J. Zou, and K. Yang,“Ch-sims: A chinese multimodal sentiment analysis dataset withﬁne-grained annotation of modality,” in

Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics , 2020,pp. 3718–3727.[35] D. Cevher, S. Zepf, and R. Klinger, “Towards multimodal emotionrecognition in german speech events in cars using transferlearning,”

Proceedings of the 15th Conference on Natural LanguageProcessing (KONVENS 2019) , 2019.[36] E. Marrese-Taylor, J. Balazs, and Y. Matsuo, “Mining ﬁne-grainedopinions on closed captions of YouTube videos with an attention-

NDER REVIEW 14

RNN,” in

Proceedings of the 8th Workshop on ComputationalApproaches to Subjectivity, Sentiment and Social Media Analysis .Copenhagen, Denmark: ACL, Sep. 2017, pp. 102–111.[37] V. Pérez-Rosas, R. Mihalcea, and L.-P. Morency, “Utterance-levelmultimodal sentiment analysis,” in

Proceedings of the 51st AnnualMeeting of the Association for Computational Linguistics , vol. 1, 2013,pp. 973–982.[38] B. Liu and L. Zhang, “A survey of opinion mining and sentimentanalysis,” in

Mining Text Data . Springer, 2012, pp. 415–463.[39] M. V. Mäntylä, D. Graziotin, and M. Kuutila, “The evolution ofsentiment analysis—a review of research topics, venues, and topcited papers,”

Computer Science Review , vol. 27, pp. 16–32, 2018.[40] V. P. Rosas, R. Mihalcea, and L.-P. Morency, “Multimodal senti-ment analysis of spanish online videos,”

IEEE Intelligent Systems ,vol. 28, no. 3, pp. 38–45, 2013.[41] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scalevisual sentiment ontology and detectors using adjective nounpairs,” in

Proceedings of the 21st ACM International Conference onMultimedia (ACMMM) . ACM, 2013, pp. 223–232.[42] A. Garcia, S. Essid, F. d’Alché Buc, and C. Clavel, “A multimodalmovie review corpus for ﬁne-grained opinion mining,” arXivpreprint arXiv:1902.10102 , 2019.[43] S. Park, H. S. Shim, M. Chatterjee, K. Sagae, and L.-P. Morency,“Computational analysis of persuasiveness in social multimedia: Anovel dataset and multimodal prediction approach,” in

Proceedingsof the 16th International Conference on Multimodal Interaction (ICMI) ,2014, pp. 50–57.[44] J. A. Russell, “A Circumplex Model of Affect,”

Journal of Personalityand Social Psychology , vol. 39, no. 6, pp. 1161–1178, 1980.[45] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas,“Sentiment strength detection in short informal text,”

Journal ofthe American Society for Information Science and Technology , vol. 61,no. 12, pp. 2544–2558, 2010.[46] S. M. Mohammad, “Sentiment analysis: Detecting valence, emo-tions, and other affectual states from text,” in

Emotion Measurement .Elsevier, 2016, pp. 201–237.[47] D. Preo¸tiuc-Pietro, H. A. Schwartz, G. Park, J. Eichstaedt, M. Kern,L. Ungar, and E. Shulman, “Modelling valence and arousal infacebook posts,” in

Proceedings of the 7th Workshop on ComputationalApproaches to Subjectivity, Sentiment and Social Media Analysis , 2016,pp. 9–15.[48] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, andconsequences,”

Minds and Machines , pp. 1–14, 2020.[49] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R.Bowman, “Glue: A multi-task benchmark and analysis platformfor natural language understanding,” in , 2019.[50] A. Luneski, E. Konstantinidis, and P. Bamidis, “Affective medicine:a review of affective computing efforts in medical informatics,”

Methods of information in medicine , vol. 49, no. 3, pp. 207–218, 2010.[51] A. Baird, S. Hantke, and B. W. Schuller, “Responsibleand representative multimodal data acquisition and analysis:On auditability, benchmarking, conﬁdence, data-reliance &explainability,”

CoRR , vol. abs/1903.07171, 2019. [Online].Available: http://arxiv.org/abs/1903.07171[52] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, andH. Sloetjes, “Elan: a professional framework for multimodalityresearch,” in , 2006, pp. 1556–1559.[53] L. C. De Silva, T. Miyasato, and R. Nakatsu, “Facial emotion recog-nition using multi-modal information,” in

Proceedings of ICICS,1997 International Conference on Information, Communications andSignal Processing. Theme: Trends in Information Systems Engineeringand Wireless Multimedia Communications , vol. 1. IEEE, 1997, pp.397–401.[54] L. F. Barrett, B. Mesquita, and M. Gendron, “Context in emotionperception,”

Current Directions in Psychological Science , vol. 20,no. 5, pp. 286–290, 2011.[55] J. M. Girard and A. G. Wright, “Darma: Software for dual axisrating and media annotation,”

Behavior Research Methods , vol. 50,no. 3, pp. 902–909, 2018.[56] C. Schwemmer and S. Ziewiecki, “Social media sellout: Theincreasing role of product promotion on youtube,”

Social Media+Society , vol. 4, no. 3, 2018.[57] S. T. Moturu and H. Liu, “Quantifying the trustworthiness ofsocial media content,”

Distributed and Parallel Databases , vol. 29,no. 3, pp. 239–260, 2011. [58] H. Horsburgh, “Trust and social objectives,”

Ethics , vol. 72, no. 1,pp. 28–40, 1961.[59] J. C. Cox, R. Kerschbamer, and D. Neururer, “What is trustworthi-ness and what drives it?”

Games and Economic Behavior , vol. 98, pp.197–218, 2016.[60] J. A. Colquitt, B. A. Scott, and J. A. LePine, “Trust, trustworthi-ness, and trust propensity: A meta-analytic test of their uniquerelationships with risk taking and job performance.”

Journal ofApplied Psychology , vol. 92, no. 4, p. 909, 2007.[61] Y. Panagakis, M. A. Nicolaou, S. Zafeiriou, and M. Pantic, “Robustcorrelated and individual component analysis,”

IEEE Transactionson Pattern Analysis and Machine Intelligence , vol. 38, no. 8, pp.1665–1678, 2015.[62] B. W. Schuller,

Intelligent Audio Analysis . Springer, 2013.[63] M. Grimm and K. Kroschel, “Evaluation of natural emotions usingself assessment manikins,” in

IEEE Workshop on Automatic SpeechRecognition and Understanding, 2005.

IEEE, 2005, pp. 381–385.[64] S. Hantke, E. Marchi, and B. Schuller, “Introducing the weightedtrustability evaluator for crowdsourcing exempliﬁed by speakerlikability classiﬁcation,” in

Proceedings of the Tenth InternationalConference on Language Resources and Evaluation (LREC) , 2016, pp.2156–2161.[65] F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer,S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017:Real-life depression, and affect recognition workshop and chal-lenge,” in

Proceedings of the 7th Annual Workshop on Audio/VisualEmotion Challenge (EmotiW) , 2017, pp. 3–9.[66] E. Douglas-Cowie, L. Devillers, J.-C. Martin, R. Cowie, S. Savvidou,S. Abrilian, and C. Cox, “Multimodal databases of everydayemotion: Facing up to complexity,” in

Ninth European Conferenceon Speech Communication and Technology (INTERSPEECH) , 2005.[67] L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-lifeemotion annotation and machine learning based detection,”

NeuralNetworks , vol. 18, no. 4, pp. 407–422, 2005.[68] A. Słucki, T. Trzci´nski, A. Bielski, and P. Cyrta, “Extracting textualoverlays from social media videos using neural networks,” in

International Conference on Computer Vision and Graphics . Springer,2018, pp. 287–299.[69] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan, “Textﬂow: A uniﬁed text detection system in natural scene images,” in

Proceedings of the IEEE international conference on computer vision ,2015, pp. 4651–4659.[70] H. Yang, C. Wang, C. Bartz, and C. Meinel, “Scenetextreg: a real-time video ocr system,” in

Proceedings of the 24th ACM InternationalConference on Multimedia (ACMMM) . ACM, 2016, pp. 698–700.[71] H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotations forvisual object detection,” in

Workshops at the Twenty-Sixth AAAIConference on Artiﬁcial Intelligence , 2012.[72] K. Konyushkova, J. Uijlings, C. H. Lampert, and V. Ferrari,“Learning intelligent dialogs for bounding box annotation,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2018, pp. 9175–9184.[73] L. Stappen, X. Du, V. Karas, S. Müller, and B. W. Schuller, “Go-card–generic, optical car part recognition and detection: Collection,insights, and applications,” arXiv preprint arXiv:2006.08521. , 2020.[74] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detectionand alignment using multitask cascaded convolutional networks,”

IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, 2016.[75] T. Afouras, J. S. Chung, A. W. Senior, O. Vinyals, andA. Zisserman, “Deep audio-visual speech recognition,”

CoRR , vol.abs/1809.02108, 2018. [Online]. Available: http://arxiv.org/abs/1809.02108[76] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributesin the wild,” in

Proceedings of International Conference on ComputerVision (ICCV) , December 2015.[77] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recogni-tion,”

British Machine Vision Association , 2015.[78] T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: anopen source facial behavior analysis toolkit,” in . IEEE, 2016,pp. 1–10.[79] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia,S. Schnieder, R. Cowie, and M. Pantic, “Avec 2013: the continuousaudio/visual emotion and depression recognition challenge,” in

Proceedings of the 3rd ACM International Workshop on Audio/VisualEmotion Challenge (EmotiW) . ACM, 2013, pp. 3–10.

NDER REVIEW 15 [80] V. Pandit and B. Schuller, “On many-to-many mapping betweenconcordance correlation coefﬁcient and mean square error,” arXivpreprint arXiv:1902.05180 , 2019.[81] I. Lawrence and K. Lin, “A concordance correlation coefﬁcient toevaluate reproducibility,”

Biometrics , pp. 255–268, 1989.[82] L. Stappen, B. W. Schuller, I. Lefter, E. Cambria, and I. Kompat-siaris, “Summary of muse 2020: Multimodal sentiment analysis,emotion-target engagement and trustworthiness detection in real-life media,” in . ACM, 2020.[83] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André,C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al. , “The geneva minimalistic acoustic parameter set (gemaps)for voice research and affective computing,”

IEEE Transactions onAffective Computing , vol. 7, no. 2, pp. 190–202, 2015.[84] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C.Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al. , “Cnnarchitectures for large-scale audio classiﬁcation,” in . IEEE, 2017, pp. 131–135.[85] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontologyand human-labeled dataset for audio events,” in . IEEE, 2017, pp. 776–780.[86] T. Baltrušaitis, P. Robinson, and L. Morency, “OpenFace: an OpenSource Facial Behavior Analysis Toolkit,” in

Proceedings of the IEEEWinter Conference on Applications of Computer Vision . Lake Placid,NY: IEEE, 2016, 10 pages.[87] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrichingword vectors with subword information,”

Transactions of theAssociation for Computational Linguistics , vol. 5, pp. 135–146, 2017.[88] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,“Language models are unsupervised multitask learners,”

OpenAIblog , vol. 1, no. 8, p. 9, 2019.[89] L. Sun, Z. Lian, J. Tao, B. Liu, and M. Niu, “Multi-modalcontinuous dimensional emotion recognition using recurrentneural network and self-attention mechanism,” in

Proceedingsof the 1st International on Multimodal Sentiment Analysis in Real-lifeMedia Challenge and Workshop , 2020, pp. 27–34.[90] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in

Advances in Neural Information Processing Systems , 2017, pp.5998–6008.[91] J. Huang, J. Tao, B. Liu, Z. Lian, and M. Niu, “Efﬁcient modelingof long temporal contexts for continuous emotion recognition,”in . IEEE, 2019, pp. 185–191.[92] H.-J. Yang, G.-S. Lee, J.-H. Kim, and S.-H. Kim, “Multimodalfusion with attention mechanism for trustworthiness predictionin car advertisements,” 2020.[93] A. Zadeh, P. P. Liang, L.-P. Morency, S. Poria, E. Cambria, andS. Scherer, “Proceedings of grand challenge and workshop onhuman multimodal language (challenge-hml),” in

Proceedings ofGrand Challenge and Workshop on Human Multimodal Language(Challenge-HML) , 2018.[94] D. Kollias, A. Schulc, E. Hajiyev, and S. Zafeiriou, “Analysingaffective behavior in the ﬁrst abaw 2020 competition,” arXivpreprint arXiv:2001.11409 , 2020.[95] B. W. Schuller, S. Steidl, A. Batliner, P. B. Marschik, H. Baumeister,F. Dong, S. Hantke, F. B. Pokorny, E.-M. Rathner, K. D. Bartl-Pokorny et al. , “The interspeech 2018 computational paralinguis-tics challenge: Atypical & self-assessed affect, crying & heartbeats.” in

Interspeech , 2018, pp. 122–126.[96] J. Arevalo, T. Solorio, M. Montes-y Gómez, and F. A. González,“Gated multimodal networks,”

Neural Computing and Applications ,pp. 1–20, 2020.[97] R. Gomez, J. Gibert, L. Gomez, and D. Karatzas, “Exploring hatespeech detection in multimodal publications,” in

The IEEE WinterConference on Applications of Computer Vision , 2020, pp. 1470–1478.[98] X. Qiu, Z. Feng, X. Yang, and J. Tian, “Multimodal fusion of speechand gesture recognition based on deep learning,” in

Journal ofPhysics: Conference Series , vol. 1453, 2020, p. 012092.[99] S. Hamann, “Mapping discrete and dimensional emotions ontothe brain: controversies and consensus,”

Trends in cognitive sciences ,vol. 16, no. 9, pp. 458–466, 2012. [100] L. Rabiner and B.-H. Juang,

Fundamentals of Speech Recognition .USA: Prentice-Hall, Inc., 1993.[101] G. Trigeorgis, M. A. Nicolaou, S. Zafeiriou, and B. W. Schuller,“Deep canonical time warping,” in

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , 2016, pp.5110–5118.[102] R. R. Atallah, A. Kamsin, M. A. Ismail, S. A. Abdelrahman, andS. Zerdoumi, “Face recognition and age estimation implicationsof changes in facial features: A critical review study,”

IEEE Access ,vol. 6, pp. 28 290–28 304, 2018.[103] M. Sajjad, A. Shah, Z. Jan, S. I. Shah, S. W. Baik, and I. Mehmood,“Facial appearance and texture feature-based robust facial expres-sion recognition framework for sentiment knowledge discovery,”

Cluster Computing , vol. 21, no. 1, pp. 549–567, 2018.

NDER REVIEW 16

Lukas Stappen received his Master of Sciencein Data Science with distinction from King’sCollege London in 2017. He then joined thegroup for Machine Learning in Health Informatics.Currently, he is a PhD candidate at the Chairfor Embedded Intelligence for Health Care andWellbeing, University of Augsburg, Germany, anda PhD Fellow of the BMW Group. His researchinterests include affective computing, multimodalsentiment analysis, and multimodal/cross-modalrepresentation learning with a core focus on ‘in-the-wild’ environments.

Alice Baird received her MFA in Sound Art fromColumbia University’s Computer Music Centerand is currently a Ph.D Fellow of the ZD.B, su-pervised by Professor Prof. Björn Schuller at theChair of Embedded Intelligence for Healthcareand Wellbeing, University of Augsburg, Germany.Her research is focused on intelligent audioanalysis in the domain of speech and generalaudio, and her research interests include: healthinformatics, affective computing, computationalparalinguistics, and speech pathology.

Lea Schumann received her B. Sc. degree incomputer science from the University of Augs-burg, Germany, in 2018 and is currently workingtowards her M. Sc. degree in computer sciencewith a strong focus on deep learning. Her re-search interests include computer vision, affec-tive computing, and natural language processing.