[PDF] Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

Abstract

The wide popularity of digital photography and social networks has generated a rapidly growing volume of multimedia data (i.e., image, music, and video), resulting in a great demand for managing, retrieving, and understanding these data. Affective computing (AC) of these data can help to understand human behaviors and enable wide applications. In this article, we survey the state-of-the-art AC technologies comprehensively for large-scale heterogeneous multimedia data. We begin this survey by introducing the typical emotion representation models from psychology that are widely employed in AC. We briefly describe the available datasets for evaluating AC algorithms. We then summarize and compare the representative methods on AC of different multimedia types, i.e., images, music, videos, and multimodal data, with the focus on both handcrafted features-based methods and deep learning methods. Finally, we discuss some challenges and future directions for multimedia affective computing.

Full PDF

11Affective Computing for Large-Scale HeterogeneousMultimedia Data: A Survey

SICHENG ZHAO,

University of California, Berkeley, USA

SHANGFEI WANG ∗ , University of Science and Technology of China, China

MOHAMMAD SOLEYMANI ∗ , University of Southern California, USA

DHIRAJ JOSHI,

IBM Research AI, USA

QIANG JI,

Rensselaer Polytechnic Institute, USAThe wide popularity of digital photography and social networks has generated a rapidly growing volume ofmultimedia data ( i.e. , image, music, and video), resulting in a great demand for managing, retrieving, andunderstanding these data. Affective computing (AC) of these data can help to understand human behaviorsand enable wide applications. In this article, we survey the state-of-the-art AC technologies comprehensivelyfor large-scale heterogeneous multimedia data. We begin this survey by introducing the typical emotionrepresentation models from psychology that are widely employed in AC. We briefly describe the availabledatasets for evaluating AC algorithms. We then summarize and compare the representative methods on AC ofdifferent multimedia types, i.e. , images, music, videos, and multimodal data, with the focus on both handcraftedfeatures-based methods and deep learning methods. Finally, we discuss some challenges and future directionsfor multimedia affective computing.CCS Concepts: •

General and reference → Surveys and overviews ; •

Information systems → Senti-ment analysis ; •

Human-centered computing → Human computer interaction (HCI) .Additional Key Words and Phrases: Affective computing, emotion recognition, sentiment analysis, large-scalemultimedia

ACM Reference Format:

Sicheng Zhao, Shangfei Wang, Mohammad Soleymani, Dhiraj Joshi, and Qiang Ji. 2019. Affective Computingfor Large-Scale Heterogeneous Multimedia Data: A Survey.

ACM Trans. Multimedia Comput. Commun. Appl.

Users are increasingly recording their daily activities, sharing interesting experiences, and express-ing personal viewpoints using mobile devices on social networks, such as Twitter , Facebook , andWeibo , etc . As a result, a rapidly growing volume of multimedia data ( i.e. , image, music, and video) ∗ Corresponding authors: Shangfei Wang, Mohammad Soleymani. https://twitter.com a r X i v : . [ c s . MM ] O c t :2 Zhao et al. (a) text (b) image (c) music(d) video (e) multi-modal data What an exciting day!

Fig. 1. Different multimedia data that are widely used to express emotions. has been generated, as shown in Figure 1, which results in a great demand for the management,retrieval, and understanding of these data. Most existing work on multimedia analysis focus on thecognitive aspects, i.e. , understanding the objective content, such as object detection in images [44],speaker recognition in speech [47], and action recognition in videos [48]. Since what people feelhave a direct influence on their decision making, affective computing (AC) of these multimediadata is of significant importance and has attracted increasing attention [22, 146, 154, 156, 164].For example, companies would like to know how customers evaluate their products and can thusimprove their services [53]; depression and anxiety detection from social media can help understandpsychological distress and thus potentially prevent suicidal actions [106].While the sentiment analysis in text [91] has long been a standard task, AC from other modalities,such as image and video, has just begun to be considered recently. In this article, we aim to reviewthe existing AC technologies comprehensively for large-scale heterogeneous multimedia data,including image, music, video, and multimodal data.Affective computing of multimedia (ACM) aims to recognize the emotions that are expectedto be evoked in viewers by a given stimuli. Similar to other supervised learning tasks, ACM istypically composed of three steps: data collection and annotation, feature extraction, and mappinglearning between features and emotions [157]. One main challenge for ACM is the affective gap, i.e. , “the lack of coincidence between the features and the expected affective state in which theuser is brought by perceiving the signal” [45]. In the early stage, various hand-crafted featureswere designed to bridge this gap with traditional machine learning algorithms, while more recentlyresearchers have focused on end-to-end deep learning from raw multimedia data to recognizeemotions. Existing ACM methods mainly assign the dominant (average) emotion category (DEC)to an input stimuli, based on the assumption that different viewers have similar reactions to thesame stimuli. We can usually formulate this task as a single-label learning problem.However, emotions are influenced by subjective and contextual factors, such as the educationalbackground, cultural diversity, and social interaction [95, 141, 168]. As a result, different viewers mayreact differently to the same stimuli, which creates the subjective perception challenge. Therefore,the perception inconsistency makes it insufficient to simply predict the DEC for the highly subjectivevariable. As stated in [168], we can perform two kinds of ACM tasks to deal with the subjectivitychallenge: predicting personalized emotion perception for each viewer and assigning multipleemotion labels for each stimuli. For the latter one, we can either assign multiple labels to each stimuli

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:3 with equal importance using multi-label learning methods, or predict the emotion distributionswhich tries to learn the degrees of each emotion [141].In this article, we concentrate on surveying the existing methods on ACM and analyzing potentialresearch trends. Section 2 introduces the widely-used emotion representation models from psy-chology. Section 3 summarizes the existing available datasets for evaluating ACM tasks. Section 4,Section 5, Section 6, and Section 7 survey the representative methods on AC of images, music,videos, and multimodal data, respectively, including both handcrafted features-based methodsand deep learning methods. Section 8 provides some suggestions for future research, followed byconclusion in Section 9.To the best of our knowledge, this article is among the first that provide a comprehensive surveyof affective computing of multimedia data from different modalities. Previous surveys mainly focuson a single modality, such as images [56, 161], speech [35], music [60, 145], video [11, 131], andmultimodal data [116]. From this survey, readers can more easily compare the correlations anddifferences among different AC settings. We believe that this will be instrumental in generatingnovel research ideas.

There are two dominant emotion representation models deployed by psychologists: categoricalemotions (CE), and dimensional emotion space (DES). CE models classify emotions into a few basiccategories, such as happiness and anger , etc . Some commonly used models include Ekman’s sixbasic emotions [34] and Mikels’s eight emotions [77]. When classifying emotions into positive and negative (polarity) [160, 163], sometimes including neutral , “emotion” is called “sentiment”.However, sentiment is usually defined as an atitude held toward an object [116]. Emotions areusually represented by DES models as continuous coordinate points in a 3D or 2D Cartesian space,such as valence-arousal-dominance (VAD) [104] and activity-temperature-weight [66]. VAD is themost widely used DES model, where valence represents the pleasantness ranging from positive tonegative, arousal represents the intensity of emotion ranging from excited to calm, and dominancerepresents the degree of control ranging from controlled to in control. Dominance is difficult tomeasure and is often omitted, leading to the commonly used two dimensional VA space [45].The relationship between CE and DES and the transformation from one to the other are studiedin [124]. For example, positive valence relates to a happy state, while negative valence relatesto a sad or angry state. CE models are easier for users to understand and label, but the limitedset of categories may not well reflect the subtlety and complexity of emotions. DES can betterdescribe detailed emotions with subtle differences flexibly, but it is difficult for uses to distinguishthe absolute continuous values, which may also be problematic. CE and DES are mainly employedin classification and regression tasks, respectively, with discrete and continuous emotion labels. Ifwe discretize DES into several constant values, we can also use it for classification [66]. Rankingbased labeling can be applied to ease DEC comprehension difficulties in raters.Although less explored in this context, one of the most well-known theories that explains thedevelopment of emotional experience is appraisal theory. According to this theory, cognitiveevaluation or appraisal of a situation or content in case of multimedia results in emergence ofemotions [88, 102]. According to Ortony, Clore and Collins (OCC) [88], emotions are experiencedfollowing a scenario comprising a series of phases. First, there is a perception of an event, object oran action. Then, there is an evaluation of events, objects or action according to personal wishesor norms. Finally, perception and evaluation result in a specific emotion or emotions arising.Certain appraisal dimensions such as novelty and complexity can be labeled and detected fromcontent. For example, Soleymani automatically recognized image novelty and complexity that arerelated to interestingness. There are also domain specific emotion taxonomy and scales. Geneva ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :4 Zhao et al.

Table 1. Representative emotion models employed in ACM.

Model Ref Type Emotion states/dimensions

Ekman [34] CE happiness, sadness, anger, disgust, fear, surpriseMikels [77] CE amusement, anger, awe, contentment, disgust, excitement, fear, sadnessPlutchik [97] CE ( × Emotional Music Scale [153] is a music specific emotion model for describing emotions induced bymusic. It consists of a hierarchical structure with 45 emotions, nine emotional categories and threesuperfactors that can describe emotion in music.Another relevant concept worth mentioning is that emotion in response to multimedia can beexpected, induced or perceived emotion. Expected emotion is the emotion that the multimediacreator intends to make people feel, perceived emotion is what people perceive as being expressed,while induced/felt emotion is the actual emotion that is felt by a viewer. Discussing the differenceor correlation of various emotion models is out of the scope of this article. The typical emotionmodels that have been widely used in ACM are listed in in Table 1.

The early datasets for AC of images mainly come from the psychology community with small-scaleimages. The International Affective Picture System (

IAPS ) is an image set that is widely used inpsychology to evoke emotions [62]. Each image that depicts complex scenes is associated withthe mean and standard deviation (STD) of VAD ratings in a 9-point scale by about 100 collegestudents. The

IAPSa dataset is selected from IAPS with 246 images [77], which are labeled by 20undergraduate students. The

Abstract dataset consists of 279 abstract paintings without contextualcontent. Approximately 230 people peer rated these paintings. The Artistic dataset (

ArtPhoto )includes 806 artistic photographs from a photo sharing site [71] with emotions determined by theartist uploading the photos. The Geneva affective picture database (

GAPED ) is composed of 520negative, 121 positive, and 89 neutral images [26]. Besides, these images are also rated with valenceand arousal values, ranging from 0 to 100 points. There are 500 abstract paintings in both

MART and devArt datasets, which are collected from the Museum of Modern and Contemporary Art ofTrento and Rovereto [3], and the “DeviantArt” online social network [3], respectively.Recent datasets, especially the large-scale ones, are constructed using images from social net-works. The Tweet dataset (

Tweet ) consists of 470 and 113 tweets for positive and negative sen-timents, respectively [18]. The

FlickrCC dataset includes about 500k Flickr creative common(CC) images which are generated based on 1,553 adjective noun pairs (ANPs) [18]. The images aremapped to the Plutchnik’s Wheel of Emotions with 8 basic emotions, each with 3 scales. The

Flickr dataset contains about 300k images [144] with the emotion category defined by the synonym wordlist which has the most same words as the adjective words of an image’s tags and comments. The FI dataset consists of 23,308 images which are collected from Flicker and Instagram by searchingthe emotion keywords [149] and labeled by 225 Amazon Mechanical Turk (MTurk) workers. Thenumber of images in each Mikels emotion category is larger than 1,000. The Emotion6 dataset [95]consists of 1,980 images collected from Flickr with 330 images for each Ekman’s emotion category.Each image was scored by 15 MTurk workers to obtain the discrete emotion distribution information.The

IESN dataset that is constructed for personalized emotion prediction [168] contains about 1Mimages from Flickr. Lexicon-based methods and VAD averaging [134] are used to segment the text

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:5

Table 2. Released and freely available datasets for AC of images, where ‘Ref’ is short for Reference, ‘

Dataset Ref

IAPS [62] 1,182 natural ≈

100 (half f) VAD annotation averageIAPSa [77] 246 natural 20 (10f,10m) Mikels annotation dominantAbstract [71] 279 abstract ≈

230 Mikels annotation dominantArtPhoto [71] 806 artistic – Mikels keyword dominantGAPED [26] 730 natural 60 Sentiment, VA annotation dominant, averageMART [3] 500 abstract 25 (11f,14m) Sentiment annotation dominantdevArt [3] 500 abstract 60 (27f,33m) Sentiment annotation dominantTweet [18] 603 social 9 Sentiment annotation dominantFlickrCC [18] ≈ of metadata from uploaders for expected emotions and comments from viewers for personalizedemotions. There are 7,723 active users with more than 50 involved images. We can also easily obtainthe DEC and emotion distribution for each image. FlickrLDL and

TwitterLDL datasets [141] areconstructed for discrete emotion distribution learning. The former one is a subset of FlickrCC,which are labeled by 11 viewers. The latter one consists of 10,045 images which are collected bysearching various sentiment key words from Twitter and labeled by 8 viewers. These datasets aresummarized in Table 2.

A notable benchmark for music recognition is music mood classification (AMC) task, organizedby annual Music Information Retrieval Evaluation eXchange (MIREX) [32]. In MIREX moodclassification task, initially 600 songs were shared with the participants. Starting from 2013, 1,43830 seconds excerpts from Korean pop songs have been added to MIREX. MIREX benchmark aimsto automatically classify songs into five emotion clusters derived from cluster analysis of onlinetags. MIREX mood challenge emotional representation has been debated in the literature due to itsdata-driven origin rather than psychology of emotion. For example, in [65], semantic and acousticoverlaps have been found between clusters. MIREX mood challenge considers only one label forthe whole song and disregards the dynamic time evolving nature of music.Computer Audition Lab 500 (CAL500) is a dataset of 500 popular songs which is labeled bymultiple tags including emotions [128]. The dataset is labeled in the lab by 66 labelers. Sound-tracks datasets [33] for music and emotion is developed by Eerola and Vuoskoski and containsinstrumental music from soundtrack of 60 movies. The expert annotators selected songs basedon five basic discrete categories ( anger , fear , sadness , happiness , and tenderness ) and dimensionalVA representation of emotions. Although not developed with music content analysis in mind, theDatabase for Emotion Analysis using Physiological Signals or DEAP dataset [61] also includesvalence, arousal and dominance ratings for 120 one-minute music video clips of western pop music.Each video clip is annotated by by 14–16 participants who were asked to report their felt valence,arousal, and dominance on a 9-point scale. AMG1608 [23] is another music dataset that contains :6 Zhao et al. Table 3. Released and freely available datasets for music emotion recognition, where ‘

Dataset Ref

MIREX mood [32] 2,038 western and kpop 2–3 Clusters annotation dominant, distributionCAL500 [128] 500 western >2 – annotation dominantSoundtracks [33] 110 instrumental 110 self-defined, VA annotation distributionMoodSwings [121] 240 western Unknown VA annotation distributionDEAP [61] 120 western 14–16 VAD annotation averageAMG1608 [23] 1,608 western 15 VA annotation distributionDEAM [6] 1,802 diverse 5–10 VA annotation average, distributionPMEmo [155] 794 western 10 VA annotation distribution arousal and valence ratings for 1,608 Western songs in different genres and is annotated throughMTurk.Music datasets with emotion labels usually consider one emotion label per song (static). MoodSwingsdataset [121] was the first to annotate music dynamically over time. MoodSwings was developedby Schmidt et al. and includes 240 15s excerpts of western pop songs with per-second valence andarousal labels, collected on MTurk. The MediaEval “Emotion in Music” challenge was organizedin years 2013–2015 in MediaEval Multimedia Evaluation initiative . MediaEval is a community-driven benchmarking campaign dedicated to evaluating algorithms for social and human-centeredmultimedia access and retrieval [63]. Unlike MIREX, “Emotion in Music” task focused on dynamicemotion recognition in music tracking arousal and valence over time [6, 115]. The data fromMediaEval tasks were compiled in MediaEval Database for Emotional Analysis in Music (DEAM)which is the largest available dataset with dynamic annotations, at 2Hz, with valence and arousalannotations for 1,802 songs and song excerpts licensed under Creative Commons license. PMEmois a dataset of 794 songs with dynamic and static arousal and valence annotations in addition toelectrodermal responses from ten participants [155].These datasets are summarized in Table 3. For a more detailed review of available music datasetswith emotional labels, we refer the readers to [89]. The target of video affective content computing is to recognize the emotions evoked by videos. In thisfield, it is necessary to construct a large benchmark dataset with precise emotional tags. However,the majority of existing research evaluate their proposed methods on their own collected datasets.The scarce video resources in those self-collected datasets, combined with the copyright restrictionsresult in limited accessibility for other researchers to reproduce existing work. Therefore, it isbeneficial to summarize some publicly available datasets in this field. In general, publicly availabledatasets can be classified into two types: datasets consisting of videos only, such as movie clips oruser generated videos, and datasets including both videos and audience’s information.

The LIRIS-ACCEDE dataset [30] is one of the largestdatasets in this area. Because it is collected under Creative Commons licenses, there are no copy-right issues. The LIRIS-ACCEDE dataset is a living database in development. In order to fulfillrequirements of different tasks, new data, features and tags are included. The LIRIS-ACCEDE datasetincludes the Discrete LIRIS-ACCEDE collection and the Continuous LIRIS-ACCEDE collection in2015 and was used for the MediaEval Emotional Impact of Movies tasks from 2015 to 2018. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:7 The Discrete LIRIS-ACCEDE collection [13] includes 9,800 clips, which is derived from 40 featurefilms and 120 short films. Specifically, the majority of the 160 films are collected from the videoplatform VODO. The duration of all 160 films is about 73.6 hours in total. All of the 9,800 videoclips last about 27 hour in total and the duration of each clip is between 8 to 12 seconds, which islong enough for viewers to feel emotions. In this collection, all the 9,800 video clips are labeled byvalues of valence and arousal.The Continuous LIRIS-ACCEDE collection [12] differs from the Discrete LIRIS-ACCEDE col-lection in annotation type. Roughly speaking, the annotations for movie clips in the DiscreteLIRIS-ACCEDE collection are global. It means that a whole 8 to 12 second video clip is representedby a single value of valence and arousal. This annotation type limits the possiblity for trackingemotions. To address this issue, 30 longer films are selected from the 160 films mentioned above.The total duration of all the selected films is about 7.4 hours. There are emotional annotationsaccording to valence and arousal of each second of the films in the collection.The MediaEval Affective Impact of Movies collections between 2015 and 2018 are used for theMediaEval affective Impact of Movies tasks in each corresponding year. Specifically, the MediaEval2015 Affective Impact of Movies [112] includes two sub-tasks: affect detection and violence detection.The Discrete LIRIS-ACCEDE collection was used as the development set. And 1,100 additionalvideo clips were extracted from 39 new movies and included. Indeed, all the new collected datawere shared under Creative Commons licenses. In addition, three values were used to label the10,900 video clips: a binary signal representing the presence of violence, a class tag of the excerptfor felt arousal and an annotation for felt valence.The MediaEval 2016 Affective Impact of Movies Task [27] also includes two sub-tasks: Globalemotion prediction and Continuous emotion prediction. The Discrete LIRIS-ACCEDE collectionand the Continuous LIRS-ACCEDE collection were used as the development sets for the first andsecond sub-tasks, respectively. In addition, 49 new movies were chosen as the test sets. 1,200 shortvideo clips from the new movies were extracted for the first task, and 10 long movies were selectedfor the second task. For the first sub-task, the tags include scores of valence and arousal for eachwhole movie clip. And for the second sub-task, scores of valence and arousal for each second of themovies are evaluated.The MediaEval 2017 Affective Impact of Movies Task [28] is focused on long movies for two sub-tasks: valence/arousal prediction and Fear prediction. The Continuous LIRIS-ACCEDE collectionwas selected as the development set, and an additional 14 new movies were collected as the test set.The annotations contain a valence value and an arousal value. In addition, there are a binary valueto represent whether the segment is supposed to induce fear or not for each 10-second segment.The MediaEval 2018 Affective Impact of Movies task [29] is also dedicated to valence/arousalprediction and fear prediction. The Continuous LIRIS-ACCEDE collection and the test set of theMediaEval 2017 Emotional Impact of Movies task were used as the development set. In addition, 12other movies selected from the set of the 160 movies mentioned in the Discrete LIRIS-ACCEDE partwere used as test set. Specifically, for the first sub-task, there are annotations containing valenceand arousal values for each second of the movies. And the beginning and ending times of eachsequence in movies that induce fear are recorded for the second sub-task.The VideoEmotion dataset [55] is a well-designed user-generated video collection. It contains1,101 videos downloaded from web platforms, such as YouTube and Flickr. The annotations of thevideos in this dataset are based on Plutchik’s wheel of emotions [97].Both the YF-E6 Dataset and the VideoStory-P14 Dataset are introduced in [136]. In order tocollect the YF-E6 emotion dataset, six basic emotion types are used as keywords to search videoson YouTube and Flickr. There are 3,000 videos collected in the YF-E6 dataset totally. Then therewere 10 annotators performing the labeling tasks. Only when all tags for a video clip were more

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :8 Zhao et al.

Table 4. Released and freely available datasets for video emotion recognition, where ‘

Dataset Ref

DiscreteLIRIS-ACCEDE [13] 9,800 26.9 film - VA annotation dominantContinuousLIRIS-ACCEDE [12] 30 7.4 film 10 (7f,3m) VA annotation averageMediaEval 2015 [112] 1,100 - film - 3 discrete VA values annotation dominantMediaEval 2016 [27] 1,210 - film - VA annotation distribution, averageMediaEval 2017 [28] 14 8 film - VA, fear annotation averageMediaEval 2018 [29] 12 9 film - VA, fear annotation averageVideoEmotion [55] 1,101 32.7 user-generated 10 (5f,5m) Plutchik annotation dominantYF-E6 [136] 1,637 50.9 user-generated 10(5f,5m) Emkan annotation dominantVideoStory-P14 [136] 626 - user-generated - Plutchik keyword dominantDEAP [61] 120 2 music video - VAD annotation personalizedMAHNOB-HCI [119] - - multiple types - VAD, Ekman+neutral annotation personalizedDECAF [1] 76 - music video/movies - VAD annotation personalizedAMIGOS [24] 20 - movies collection - VAD, Ekman annotation personalizedASCERTAIN [123] 36 - movies collection 58 (21f,37m) VA annotation personalized than 50 percent consistent, the video clip was added to the dataset. Finally, the dataset includes1,637 videos labeled with six basic emotion types. The VideoStory-P14 Dataset is based on theVideoStory dataset. Similar to the VideoEmotion Dataset, the keywords in Plutchik’s Wheel ofEmotions were used for the search process of the construction of the VideoStory dataset. Finally,there are 626 videos in the videoStory-P14 dataset with each having a unique emotion tag.

The DEAP dataset [61] includes theEEG and peripheral physiological signals that are collected from 32 participants during watching40 one-minute long excerpts of music videos. In addition, frontal face videos collected from 22 ofthe 32 participants are gathered. Annotators labeled each video according to the level of like/dislike,familiarity, arousal, valence, and dominance. Though the DEAP dataset is publicly available, itshould be noted that it does not include the actual videos because of the licensing issues, but thelinks of videos are provided.The MAHNOB-HCI [119] is a multimodal dataset including multi-class information recorded inresponse to video affective stimuli. Particularly, speeches, face videos, and eye gazes are recorded.In addition, two experiments were conducted to record both peripheral and central nervous systemphysiological signals from 27 subjects. In the first experiment, subjects were assigned to reporttheir emotional responses to 20 affective induced videos, including the level of arousal, valenceand dominance, and predictability as well as emotion categories. In the second experiment, theparticipants evaluated whether they agreed with the displayed labels after watching short videosand images. The dataset is available for academic use through a web-interface.The DECAF dataset [1] consists of Infra-red facial video signals, Electrocardiogram (ECG), Mag-netoencephalogram (MEG), horizontal Electrooculogram (hEOG) and Trapezius Electromyogram(tEMG), recorded from 30 participants watching 36 movie clips and 40 one-minute music videos,which are derived from the DEAP dataset [61]. The subjective feedback is based on valence, arousal,and dominance space. In addition, time-continuous emotion annotations for movie clips are alsoincluded in the dataset.The AMIGOS dataset [24] includes multi-class affective data, individual and groups of viewers’responses to both short and long videos. The EEG, ECG, GSR, frontal, and full body video wererecorded in two experimental settings, i.e. , 40 participants watching 16 short emotional clips and

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:9

In addition to audiovisual content and viewers’ reactions, other modalities, such as language, alsocontain significant information for affective understanding of multimedia content.Visual sentiment is the sentiment associated with the concepts depicted in images. Two datasetswere developed through mining images associated with the adjective-noun pair (ANP) representa-tions that have affective significance [17]. ANPs in [17] were generated by first using seed termsfrom Plutchik’s Wheel of Emotion [97] to query Flickr and YouTube . After mining the tags asso-ciated with visual content on YouTube and Flickr, adjective and noun candidates were identifiedthrough part-of-speech tagging. Then adjective and nouns were paired to create ANP candidateswhich were filtered by sentiment strength, named entities, and popularity. The Visual SentimentOntology (VSO), [17] , is the results of this process. Sentibank resulted in the creation of a set ofphoto-tweet sentiment dataset, with both visual and textual data with polarity labels, collected onAmazon Mechanical Turk . This work was later extended to form a multilingual ANP set and itsdataset, in [25, 57] , containing 15,630 ANPs from 12 major languages and 7.37M images [58]. MyReaction When (MRW) dataset contains 50,107 video-sentence pairs crawled from social media,depicting physical or emotional reactions to the situations described in sentences [120]. The GIFsare sourced from Giphy . Even though there is no emotional labels, the language and visualassociations are mainly based on sentiment which makes this dataset an interesting resource foraffective content analysis.CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is a collection of mul-tiple datasets for multimodal sentiment analysis and emotion recognition. This collection includesmore than 23,500 sentence utterance videos from more than 1,000 people from YouTube [152] .All the videos are transcribed and aligned with audiovisual modalities. A multimodal multi-partydataset for emotion recognition in conversation (MELD) was primarily developed for emotionrecognition in multiparty interaction purposes [98]. MELD contains visual, audio, and textualmodalities and includes 13,000 utterances from 1,433 dialogues from the TV-series Friends, witheach utterance labeled with emotion and sentiment.COGNINMUSE is a collection of videos annotated with sensory and semantic saliency, events,cross-media semantics, and emotions [178]. A subset of 3.5h extracted from movies, includingtextual modality, are annotated on arousal and valence. Li et al. collected a dataset of 360 degreesvirtual reality videos that can elicit different emotions [68]. Even though the dataset consists of https://visual-sentiment-ontology.appspot.com http://mvso.cs.columbia.edu https://giphy.com/ https://github.com/A2Zadeh/CMU-MultimodalSDKACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :10 Zhao et al. Table 5. Released and freely available datasets for multimodal multimodal emotion recognition. Disc. forMELD corresponds to six Ekman emotions in addition to neutral. ‘Labeling’ represents the method to obtainlabels, such as human annotation (annotation), self-reported felt emotion and keyword searching (keyword),‘Labels’ means the detailed labels in the dataset, such as dominant emotion category (dominant), averagedimension values (average), personalized emotion (personalized), and emotion distribution (distribution).

Dataset Ref

SentiBank tweet [17] 603 images, text images Sentiment annotation dominantMVSO [57] 7.36M image, metadata photos Sentiment automatic averageCMU-MOSEI [152] 23,500 video, audio, text YouTube videos Sentiment annotation averageMELD [98] 13,000 video, audio, text TV series Sentiment, Disc. annotation dominantCOGNIMUSE [178] 3.5h video, audio, text movies VA annotation, self-report dominantVR [68] 73 video, audio VR videos VA self-report average

73 short videos, on average 183s long, it is one of the first datasets of its kind whose contentunderstanding stays limited. These multimodal datasets are summarized in Table 5.

In the early stages, AC researchers mainly worked on designing handcrafted features to bridge theaffective gap. Recently, with the advent of deep learning especially convolutional neural networks(CNNs), current methods have shifted to an end-to-end deep representation learning. Motivated bythe fact that the perception of image emotions may be dependent on different types of features [171],some methods employ fusion strategies to jointly consider multiple features. In this section, wesummarize and compare these methods. Please note that here we classify the directly extractedCNN features based on pre-trained deep models into handcrafted features category.

Low-level Features are difficult to be understood by viewers. These features are often directlyderived from other computer vision tasks. Some widely extracted features include GIST, HOG2x2,self-similarity and geometric context color histogram features as in [94], because of their individualpower and distinct description of visual phenomena in a scene perspective.Compared with the above generic features, some specific features derived from art theory andpsychology have been designed. For example, Machajdik and Hanbury [71] extracted elements-of-art features, including color and texture . The MPEG-7 visual descriptors are employed in [66],which include four color-related ideas and two texture-related ideas. How shape features in naturalimages influence emotions is investigated in [70] by modeling the concepts of roundness-angularityand simplicity-complexity. Sartori et al. [103] designed two kinds of visual features to representdifferent color combinations based on Itten’s color wheel.

Mid-level Features contain more semantics, are more easily interpreted by viewers than low-level features, and thus are more relevant to emotions. Patterson and Hays [94] proposed todetect 102 attributes in 5 different categories, including materials, surface properties, functions oraffordances, spatial envelop attributes, and object presence. Besides these attributes, eigenfacesthat may contribute to facial images are also incorporated in [150]. More recently, in [100], SIFTfeatures are first extracted as basic features, which are fed into bag-of-visual-words (BoVW) torepresent the multi-scale blocks. Another mid-level representation is the latent topic distributionestimated by probabilistic latent semantic analysis.Harmonious composition is essential in an artwork. Several compositional features, such aslow depth of field, are designed to analyze such characteristics of an image [71]. Based on thefact that figure-ground relationships, color patterns, shapes and their diverse combinations areoften jointly employed by artists to express emotions in their artworks, Wang et al. [133] proposed

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:11

Table 6. Summary of the hand-crafted features at different levels for AC of images. ‘

Feature Ref Level Short description

LOW_C [94] low GIST, HOG2x2, self-similarity and geometric context color histogram features 17,032Elements [71] low color: mean saturation, brightness and hue, emotional coordinates, colorfulness,color names, Itten contrast, Wang’s semantic descriptions of colors, area statis-tics; texture: Tamura, Wavelet and gray-level co-occurrence matrix 97MPEG-7 [66] low color: layout, structure, scalable color, dominant color; texture: edge histogram,texture browsing ≈ to extract interpretable aesthetic features. Inspired by princiles-of-art, Zhao et al. [162] designedcorresponding mid-level features, including balance , emphasis , harmony , variety , gradation , and movement . For example, Itten’s color contrasts and the rate of focused attention are employed tomeasure emphasis . High-level Features that represent the semantic content contained in images can be easilyunderstood by viewers. We can also well recognize the conveyed emotions in images through thesesemantics. In the early years, simple semantic content including faces and skins contained in imagesare extracted in [71]. For the images that contain faces, facial expressions may directly determinethe emotions. Yang et al. [142] extracted 8 kinds of facial expressions as high-level features. Theybuilt compositional features of local Haar appearances by a minimum error based optimizationstrategy, which are embedded into an improved AdaBoost algorithm. For the images detectedwithout faces, the experessions are simply set as neutral . Finally, they generated a 8 dimensionalvector with each element representing the number of corresponding facial expressions.More recently, the semantic concepts are described by adjective noun pairs (ANPs) [18, 21], whichare detected by SentiBank [18] or DeepSentiBank [21]. The advantages of ANP are that it turns aneutral noun into an ANP with strong emotions and makes the concepts more detectable, comparedto nouns and adjectives, individually. A 1,200 dimensional vector representing the probability ofthe ANPs can form a feature vector.Table 6 summarizes the above-mentioned hand-crafted features at different levels for AC ofimages. Some recent methods also extracted CNN features from pre-trained deep models, such asAlexNet [157, 158] and VGGNet [141].To map the extracted handcrafted features to emotions,

Machine Learning Methods are com-monly employed. Some typical learning models include Naive Bayes (NB) [71, 133], support vectormachine (SVM) [70, 150, 162], K nearest neighbor (KNN) [66], sparse learning (SL) [67, 103], logisticregression (LR) [150], multiple instance learning (MIL) [100], and matrix completion (MC) [3]for emotion classification , support vector regression (SVR) [70, 162] for emotion regression, andmulti-graph learning (MGL) [171] for emotion retrieval.Instead of assigning the DEC to an image, some recent methods began to focus on the perceptionsubjectivity challenge, i.e. , predicting personalized emotions for each viewer or learning emotion ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :12 Zhao et al.

Table 7. Representative work on AC of images using hand-crafted features, where ‘Fusion’ indicates the fusionstrategy of different features, ‘cla, reg, ret, cla_p, dis_d, dis_c’ in the Task column are short for classification,regression, retrieval, personalized classification, discrete distribution learning, continuous distribution learning(the same below), respectively, ‘Result’ is the reported best accuracy for classification, mean squared error forregression, discounted cumulative gain for retrieval, F1 for personalized classification, and KL divergence fordistribution learning (the first line [169] is the result on sum of squared difference) on the correspondingdatasets.

Ref Feature Fusion Learning Dataset Task Result [71] Elements, Composition,FS early NB IAPSa, Abstract, Art-Photo cla 0.471, 0.357, 0.495[66] MPEG-7 – KNN unreleased cla 0.827[70] Shape, Elements early SVM, SVR IAPSa; IAPS cla; reg 0.314; V-1.350, A-0.912[67] Segmented objects – SL IAPS, ArtPhoto cla 0.612, 0.610[150] Sentributes – SVM, LR Tweet cla 0.824[133] Aesthetics – NB Abstract, ArtPhoto cla 0.726, 0.631[162] Principles – SVM, SVR IAPSa, Abstract, Art-Photo; IAPS cla; reg 0.635, 0.605, 0.669;V-1.270, A-0.820[171] LOW_C, Elements, At-tributes, Principles, ANP,Expressions graph MGL IAPSa, Abstract, Art-Photo, GAPED, Tweet ret 0.773, 0.735, 0.658, 0.811,0.701[103] IttenColor – SL MART, devArt cla 0.751, 0.745[100] BoVW – MIL IAPSa, Abstract, Art-Photo cla 0.699, 0.636, 0.707[3] IttenColor – MC MART, devArt cla 0.728, 0.761[168] GIST, Elements, At-tributes, Principles, ANP,Expressions graph RMTHG IESN cla_p 0.582[169] GIST, Elements, Princi-ples - SSL Abstract dis_d 0.134[157] GIST, Elements, At-tributes, Principles,ANP, deep features fromAlexNet weighted WMMSSL Abstract, Emotion6,IESN dis_d 0.482, 0.479, 0.478[141] ANP, VGG16 - ACPNN Abstract, Emotion6,FlickrLDL, TwitterLDL dis_d 0.480, 0,506, 0,469, 0.555[158] GIST, Elements, At-tributes, Principles, ANP,AlexNet weighted WMMCPNN Abstract, Emotion6,IESN dis_d 0.461, 0.464, 0.470[167] GIST, Elements, At-tributes, Principles, ANP,AlexNet – MTSSR IESN dis_c 0.436 distributions for each image. The personalized emotion perceptions of a specified user after viewingan image is predicted in [166, 168], associated with online social networks. They considered differenttypes of factors that may contribute to emotion recognition, including the images’ visual content,the social context related to the corresponding users, the emotions’ temporal evolution, and theimages’ location information. To jointly model these factors, they proposed rolling multi-taskhypergraph learning (RMTHG), which can also easily hanlde the data incompleteness issue.Generally, the distribution learning task can be formulated as a regression problem, which slightlydiffers for different distribution categories ( i.e. , discrete or continuous). For example, if emotion isrepresented by CE, the regression problem targets predicting the discrete probability of each emotioncategory with the sum equal to 1; if we represent emotion based on DES, the regression problemis typically transformed to the prediction of the parameters of specified continuous probabilitydistributions. For the latter one, we usually need to firstly determine the form of continuousdistributions, such as exponential distribution and Gaussian distribution. Some representativelearning methods for emotion distribution learning of discrete emotions include shared sparselearning (SSL) [169], weighted multimodal SSL (WMMSSL) [157, 159], augmented conditionalprobability neural network (ACPNN) [141], and weighted multi-model CPNN (WMMCPNN) [158].

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:13

Both SSL and WMMSSL can only model one test image each time, which is computationallyinefficient. After the parameters are learned, ACPNN and WMMCPNN can easily predict theemotion distributions of a test image. Based on the assumption that the VA emotion labels can bewell modeled by a mixture of 2 bidimensional Gaussian mixture models (GMMs), Zhao et al. [167]proposed to learn continuous emotion distributions in VA space by multi-task shared sparseregression (MTSSR). Specifically, the parameters of GMMs are regressed, including the mean vectorand covariance matrix of the 2 Gaussian components as well as the mixing coefficients.Table 7 summarizes some representative work based on hand-crafted features. Generally, high-level features (such as ANP) can achieve better recognition performance for images with richsemantics, mid-level features (such as Principles) are more effective for artistic photos, whilelow-level features (such as Elements) perform better for abstract paintings.

To deal with the situation where images are weakly labeled, a potentially cleaner subset of thetraining instances are selected progressively [148]. First, they trained an initial CNN model basedon the training data. Second, they selected the training samples with distinct sentiment scoresbetween the two classes with a high probability based on the prediction score of the trained modelon the training data itself. Finally, the pre-trained AlexNet on ImageNet is fine-tuned to classifyemotions into 8 categories by changing the last layer of the CNN from 1000 to 8 [149]. Besides usingthe fully connected layer as classifier, they also trained an SVM classifier based on the extractedfeatures from the second to the last layer of the pre-trained AlexNet model.Multi-level deep representations (MldrNet) are learned in [101] for image emotion classification.They segmented the input image into 3 levels of patches, which are input to 3 different CNN models,including Alexnet, aesthetics CNN (ACNN), and texture CNN (TCNN). The fused features are fed intomultiple instance learning (MIL) to obtain the emotion labels. Zhu et al. [175] proposed to integratethe different levels of features by a Bidirectional GRU model (BiGRU) to exploit their dependenciesbased on MldrNet. They generated two features from the Bi-GRU model and concatenated them asthe final feature representations. To enforce the feature vectors extracted from each pair of imagesfrom the same category to be close enough, and those from different categories to be far away, theyproposed to jointly optimize a contrastive loss together with the traditional cross-entropy loss.More recently, Yang et al. [138] employed deep metric learning to explore the correlation ofemotional labels with the same polarity, and proposed a multi-task deep framework to optimizeboth retrieval and classification tasks. By considering the relations among emotional categoriesin the Mikels’ wheel, they jointly optimized a novel sentiment constraint with the cross-entropyloss. Extending triplet constraints to a hierarchical structure, the sentiment constraint employsa sentiment vector based on the texture information from the convolutional layer to measurethe difference between affective images. In [105, 139], Yang et al. proposed a weakley supervisedcoupled convolutional neural network to exploit the discriminability of localized regions for emotionclassification. Based on the image-level labels, a sentiment map is firstly detected in one branchwith the cross spatial pooling strategy. And then the holistic and localized information are jointlycombined in the other branch to conduct a classification task. The detected sentiment map caneasily explain which regions of an image determine the emotions.The above deep methods mainly focused on the dominant emotion prediction. There are alsosome work on emotion distribution learning based on deep models. The very first work is amixed bag of emotions, which trains a deep CNN regressor (CNNR) for each emotion category inEmotion6 [95] based on the AlexNet architecture. They changed the number of output nodes to 1 topredict a real value for each emotion category and replaced the Softmax loss with Euclidean loss. Toensure the sum of different probabilities to be 1, they normalized the predicted probabilities of all

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :14 Zhao et al.

Table 8. Representative work on deep learning based AC of images, where ‘Pre’ indicates whether the networkis pre-trained using ImageNet, ‘

Ref Base net Pre [148] self-defined no 24 – – FlickrCC cla 0.781[149] AlexNet yes 4,096 SVM – FI, IAPSa, Abstract, ArtPhoto cla 0.583, 0.872, 0.776,0.737[101] AlexNet,ACNN, TCNN yes 4,096, 256,4,096 MIL –

FI, IAPSa, Abstract, ArtPhoto,MART cla [175] self-defined no 512 – contrastive FI, IAPSa, ArtPhoto cla 0.730, 0.902, 0.855[138] GoogleNet-Inception yes 1,024 – sentiment FI, IAPSa, Abstract, ArtPhoto cla;ret 0.676, 0.442, 0.382,0.400; 0.780, 0.819,0.788, 0.704[105, 139] ResNet-101 yes 2,048 – – FI, Tweet cla 0.701, 0.814[95] AlexNet yes 4,096 – Euclidean Emotion6 dis_d 0.480[140] VGG16 yes 4,096 – KL Emotion6, FlickrLDL,TwitterLDL dis_d 0.420, 0,530, 0,530 emotion categories. However, CNNR has some limitations. First, the predicted probability cannot beguaranteed to be non-negative. Second, the probability correlations among different emotions areignored, since the regressor for each emotion category is trained independently. In [140], Yang et al.designed a multi-task deep framework based on VGG16 by jointly optimizing the cross-entropyloss for emotion classification and Kullback-Leibler (KL) divergence loss for emotion distributionlearning. To match the single emotion dataset to emotion distribution learning settings, theytransformed each single label to emotion distribution with emotion distances computed on Mikels’wheel [166, 168]. By extending the size of training samples, this method achieves the state-of-the-artperformance for discrete emotion distribution learning.The representative deep learning based methods are summarized in Table 8. The deep represen-tation features generally perform better than the hand-crafted ones, which are intuitively designedfor specific domains based on several small-scale datasets. However, how the deep features correlateto specific emotions is unclear.

Music emotion recognition (MER) strives to identify emotion expressed by music and subsequentlypredict listener’s felt emotion from acoustic content and music metadata, e.g. , lyrics, genre, etc.

Emotional understanding of music have applications in music recommendation and is particularlyuseful for producing music retrieval. An analysis of search queries from creative professionalsshowed that 80% contain emotional terms, showing emotions prominence in that field [52]. Agrowing number of work have tried to address emotional understanding of music from acousticcontent and metadata (see [60, 145] for earlier reviews on this topic).Earlier work on emotion recognition from music relied on extracting acoustic features similar tothe ones used in speech analysis, such as audio energy and formants. Acoustic features describeattributes related to musical dimensions. Musical dimensions include melody, harmony, rhythm,dynamics, timbre (tone color), expressive techniques, musical texture, and musical form [89], asshown in Table 9. Some also add energy as a musical feature which is important for MER [145].Melody is a linear succession of tones and can be captured by features representing key, pitchand tonality. Among others, chroma is often used to represent melodic features [145]. Harmony ishow the combination of various pitches are processed during hearing. Understanding harmony

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:15

Table 9. Musical dimensions and acoustic features describing them.

Musical dimension Acoustic features

Melody PitchHarmony chromagram, chromagram peak, key, mode, key clarity, harmonic, change, chordsRhythm tempo, beat histograms, rhythm regularity, rhythm strength, onset rateDynamics and loudness RMS energy, loudness, timpral widthTimbre MFCC, spectral shapres (centroid, shape, spread, skewness, kurtosis, contrast and flatness), bright-ness, rolloff frequency, zero crossing rate, spectral contrast, auditory modulation features, inhar-monicity, roughness, dissonance, odd to even harmonic ratio [4]Musical form Similarity Matrix (similarity between all possible frames) [89]Texture attack slope, attack time involves chords or multiple notes played together. Examples of acoustic features capturing harmonyinclude chromagram, key, mode, and chords [4]. Rhythm consists of repeated patterns of musicalsounds, i.e. , notes and pulses that can be describes in terms of tempo and meter. Higher temposongs often induce higher arousal and fluent rhythm is associated with higher valence and firmrhythm is associated with sad songs [145]. Mid-level acoustic features, such as onset rate, tempoand beat histogram, can represent rhythmic characteristics of music. Dynamics of music involve thevariation in softness or loudness of notes which include change of loudness (contrast) and emphasison individual sounds (accent) [89]. Dynamics of music can be captured by changes in acousticfeatures related to energy such as root mean square (RMS) energy. Timbre is the perceived soundquality of musical notes. Timbre is what differentiates different voices and instruments playingthe same sound. Acoustic features capturing timbre, such as MFCC and spectrum shape, describesound quality [143]. Acoustic features describing timbre include MFCC, spectral features (centroid,contract, flatness), and zero crossing rate [4]. Expressive techniques are the way a musical pieceis played including tempo and articulation [89]. Acoustic features, such as tempo, attack slope,and time, can be used to describe this dimension. Musical texture is how rhythmic, melodic, andharmonic features are combined in music production [89]. It is related to the range of tones playedat the same time. Musical form describes how a song is structured, such as introduction verse andchorus [89]. Energy whose dynamics are described in music dynamic features is strongly associatedwith arousal perception.There are a number of toolboxes available for extracting acoustic features from music that canbe used for music emotion recognition. Music Analysis, Retrieval and Synthesis for Audio Signals(Marsyas) [129] is an open source framework developed in C++ that supports extracting a largerange of acoustic features with music information retrieval applications in mind, including time-domain zero-crossings, spectral centroid, rolloff, flux, and Mel-Frequency Cepstral Coefficients(MFCC) etc.

MIRToolbox [64] is an open source toolbox implemented in MATLAB for musicinformation retrieval applications. MIRToolbox offers the ability to extract a comprehensive set ofacoustic features at different levels including features related to tonality, rhythm, and structures.Speech and music interpretation by large-space extraction or OpenSMILE [36, 37] is an open sourcesoftware developed in C++ with the ability to extract a large number of acoustic features forspeech and music analysis in real-time. LibROSA [75] is a Python package for music and audioanalysis. It is mainly developed with music information retrieval application in mind and supportsimporting from different audio sources and extracting musical features such as onsets chromaand tempo in addition to the low-level acoustic features. ESSENTIA [16] is an open source librarydeveloped in C++ with Python interface that is developed for audio analysis. ESSENTIA containsan extensive collection of algorithms supporting audio input/output functionality, standard digitalsignal processing blocks, statistical characterization of data, and a large set of spectral, temporal,tonal and high-level music descriptors.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :16 Zhao et al.

Music emotion recognition either attempts to classify songs or excerpt into categories (classifica-tion) or estimate their expressed emotions on continuous dimensions (regression). The choice ofmachine learning model in music emotion recognition depends on the emotional representationused. Mood clusters [32], dimensional representations such as arousal, tension and valence as wellas music specific emotion representation can be used. An analysis of the methods proposed forMediaEval “Music in Emotion” task submissions revealed that using deep learning accounted forthe superior performance for emotion recognition much more than the choice features [6]. Recentmethods for emotion recognition in music rely on deep learning and often use spectrogram featuresthat are converted to images [5]. Aljanaki and Soleymani proposed learning musically meaningfulmid-level perceptual features that can describe emotions in music [5]. They demonstrated thatperceptual features such as melodiousness, modality, rhythmic complexity and dissonance candescribe a large portion of emotional variance in music both in dimensional representation andMIREX clusters. They also trained a deep convolutional neural network to recognize these mid-levelattributes. There have been also work attempting to use lyrics in addition to acoustic content forrecognizing emotion in music [73]. However, lyrics are copyrighted and not easily available whichhinders further work in this direction.

Currently, the features used in affective video content analysis are mainly from two categories [107,109]. One is considering the stimulus of video content and extracting the features reflecting theemotions conveyed by the video content itself. And the other is extracting features from the viewers.Features extracted from the video content are content-based features, and features formed fromthe signals of the viewers’ responses are viewer-related features.

Generally speaking, the video content comprise of a series of ordered frames as well as correspondingaudio signals. Therefore, it is natural to extract features from these two modalities. The audiovisualfeatures can further be divided into low-level and mid-level according to their ability to describethe semantics of video content.

Commonly, the low-level features are directly computed from the rawvisual and audio content, and usually carry no semantic information. As for visual content, color,lighting, and tempo are important elements that can endow the video with strong emotionalrendering and further give viewers direct visual stimuli. In many cases, computations are conductedover each frame of the video, and the average values of the computational results of the overallvideo are considered as visual features. Specifically, the color-related features often contain thehistogram and variance of color [20, 107, 177], the proportions of color [82, 86], the number ofwhite frame and fades [177], the grayness [20], darkness ratio, color energy [132, 170], brightnessratio and saturation [85, 86], etc . In addition, the differences of dark and light can be reflected bythe lighting key, which is used to evoke emotions in video and draw the attention of viewers bycreating an emotional atmosphere [82]. As for the tempo-related features, properties of shot canreinforce the expression of video, such as shot change rate and shot length variance [20, 85, 87]according to movie grammar. To better take advantage of the temporal information of the video,the motion vectors have been computed as features in [173]. Since the optical flow can characterizethe influence of camera motions, the histogram of optimal flow matrix (HOF) has been computedas features in [147]. Additionally, Yi and Wang [147] traced motion key points at multiple spatialscales and computed the mean motion magnitude of each frame as features.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:17

To represent audio content, pitch, zero crossing rates (ZCR), Mel frequency cepstrum coefficients(MFCC), and energy are the most popular features [2, 43, 82, 132, 170]. In particular, the MFCC [14,49, 74, 147, 176, 177] and its ∆ MFCC are used to characterize emotions in video clips frequently;while the derivatives and statistics (min, max,mean) of MFCC or ∆ MFCC are also explored widely.As for pitch, [87] shows that pitch of sound is associated closely with some emotions, such as angerwith higher pitch and sadness with lower standard deviation of pitch. Similar situation can alsooccur in the energy [86, 87]. For example, the total energy of anger or happiness is higher than thecounterpart of unexciting emotions. ZCR [86] is used to separate different types of audio signals,such as music, environmental sound and speech of human. Besides these frequent related features,audio flatness [177], spectral flux [177], delta spectrum magnitude, harmony [86, 111, 177], bandenergy ratio, spectral centroid [49, 177], and spectral contrast [86] are also utilized.Evidently, the aforementioned features are mostly handcrafted. With the emergence of deeplearning, features can be automatically learned through deep neural networks. Some pre-trainedconvolutional neural networks (CNNs) are used to learn static representations from every frameor some selected key frames, while a Long-short term memory (LSTM) is exploited to capturedynamic representations existing in videos.For instance, in [136], an AlexNet with seven fully-connected layers trained on 2600 ImageNetclasses is used to learn features. A Convolutional Auto-Encoder (CAE) is designed to ensure theCNNs can extract the visual features effectively in [92]. Ben-Ahmed and Huet [14] first usedthe pre-trained ResNet-152 to extract feature vectors. And then, these vectors are fed into anLSTM according to their temporal order to extract high-order representations. Pre-trained model,SoundNet, is utilized to learn audio features. Because the expressive emotions of video are inducedand communicated by the protagonist in video in many cases, the features of protagonist areextracted from the key frame by a pre-trained CNN and used in video affective analysis in [176, 177].In addition to the protagonist, other objects in each frame of video also give insights into emotionalexpression of video. For example, in [110], Shukla et al. removed the non-gaze regions from videoframes (Eye ROI) and built the coarse grained scene structure remaining gist information byGuassian filter with variance. After the operations above in [110], the next video affective analysismay pay more attention to important information and reduce unnecessary noise.

Unlike low-level features, mid-level features often contain semanticinformation. For example, EmoBase10 feature depicting audio cues is computed in [147]. Huet al. [49] proposed a method of combining the audio and visual features to model contextualstructures of the key frames selected from video. This can produce a kind of so-called multi-instance sparse coding (MI-SC) for next analysis. In addition, the lexical features [81] are extractedfrom the dialogues of speakers by using a natural language toolkit. These features can reflect theemotional changes in videos and can also represent a certain emotional expression in overall videos.Muszynski et al. [81] used aesthetic movie highlights related to occurrences of meaningful moviescenes to define some experts. These features produced by experts are more knowledgeable andabstract for video affective analysis, especially movies. HHTC features, which are computed onthe basis of combination of Huang Transform in visual-audio and cross-correlation features, areproposed in [85].

Besides the content-related features, viewers’ facial expressions and changes of physiological signalsevoked by content of videos are the most common sources for extracting viewer-related features.McDuff and Soleymani [74] coded the facial actions of viewers for further affective video analysis.Among various physiological signals, electrocardiography (ECG), galvanic skin response (GSR),

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :18 Zhao et al. electroencephalography (EEG) are the mostly frequently ones and their statistical measures, suchas mean, median, spectral power bands, etc ., are often recommended as features. Wang et al. [132]used EEG signals to construct a new EEG feature with the assistance of the relationship amongvideo content by exploiting canonical correlation analysis (CCA). In [40], some viewer-relatedfeatures are extracted from the whole pupil dilation ratio time-series without the differences amongpupil diameter in human eyes, such as its average and derivation for global features as well as thefour spectral power bands for local features.In addition to viewers’ responses mentioned above, the comments or other textual informationproduced by viewers can also reflect their attitudes or emotional reactions toward the videos. Inthe light of this, it is reasonable to consider users’ textual comments or other textual informationto extract features. In [82], the “sentiment analysis” module using Unigrams and Bigrams is built tolearn comment-related features of the collected data according to the YouTube link provided by theDEAP dataset.

After feature extracting, a classifier or a regressor is used to obtain emotional analysis results.For classification, there are several frequently used classifiers, including support vector machines(SVM) [2, 42, 83, 109, 130, 147], Naive Bayes (NB) [42, 83], Linear Discriminant Analysis [110],logistic regression (LR) [137], and ensemble learning [2], etc .Recent work show that the SVM-based methods are very popular for affective video contentanalysis due to its simplicity, max-margin training property, and use of kernels [131]. For example,Yi and Wang’ work [147] demonstrated that linear SVM is more suitable for classification thanRBM, MLP, and LR . In [109], LDA, linear SVM (LSVM), and Radial Basis SVM (RSVM) classifiers areemployed in emotion recognition experiments, and the RSVM obtained the best F1 scores. In [42],both Navie Bayes and SVM are used as classifiers in unimodal and multimodal conditions. In theunimodal experimental condition, NB is not better than SVM. And the fusion results showed thatSVM is much better than NB in multimodal situations. However, SVM also has its shortages, suchas the difficulty of selecting suitable kernel functions. Indeed, SVM is not always the best choice.In [2], the results demonstrated that ensemble learning outperforms SVM in terms of classificationaccuracy. Ensemble learning has acquired a lot of attention in many fields because of its accuracy,simplicity, and robustness. In addition, in [137], LR is adopted as the classifier for its effectivenessand simplicity. In fact, LR is used frequently in many transfer learning tasks.However, all the classifiers mentioned above are not able to capture the temporal information.Some other methods try to use temporal information. For example, Gui et al. [40] combined SVMand LSTM to predict emotion labels. Specifically, global features and sequence features are proposedto represent the pupillary response signals. Then a SVM classifier is trained with the global featuresand a LSTM classifier is trained with the sequence features. Finally, a decision fusion strategy isproposed to combine these two classifiers.A regressor is needed when mapping the extracted features to the continuous dimensionalemotion space. Recently, one of the most popular regression method is support vector regression(SVR) [12, 79, 177]. For example, in [12], video features like audio, color, aesthetic are fed into SVRin the SVR-Standard experiment. And in the SVR-Transfer learning experiment, the pre-trainedCNN is treated as a feature extractor. The CNN’s outputs are used as the input to the SVR. Theexperimental results showed that the SVR-Transfer learning outperforms other methods. Indeed,the various kernel functions in SVR provide a stronger adaptability.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:19

Table 10. Representative work on AC of videos using kinds of features, where P a , P v , MSE a , MSE v , Acc a , Acc v , Acc and

MAP indicates the Pearson correlation coefficients of arousal and valence, the mean sum errorof arousal and valence, the accuracy of arousal and valence, the average accuracy and mean average precisionrespectively. ‘statistics’ means (min, max, mean).

Ref Feature Fusion Learning Dataset Task Result [111] Mel frequency spectral; MFCC, Chromaand their derivatives and statistics; Au-dio compressibility; Harmonicity; Shotfrequency; HOF and statistics; His-togram of 3D HSV and statistics; Videocompressibility; Histogram of facial area decision LSTM Dataset describedby Malandrakis [72] reg P a : . ± . MSE a : . ± . P v : . ± . MSE v : . ± . [40] Average, standard deviation and fourspectral power bands of pupil dilation ra-tio time-series decision SVM, LSTM MAHNOB-HCI cla Acc a :0.730, Acc v :0.780[14] Audio and visual deep features from pre-trained model feature CNN, LSTM, SVM PMIT cla MAP :0.0.2122[2] MFCC; Color values; HoG; Dense trajec-tory descriptor; CNN-learned features decision CNN, SVM, Ensem-ble DEAP cla

Acc : 0.81, 0.49[79] HHTC features feature SVR Discrete LIRIS-ACCEDE reg

MSE a :0.294, MSE v : 0.290[42] Statistical measures (such as mean, me-dian, skewness kurtosis) for EEG data,power spectral features, ECG, GSR,Face/Head-pose decision SVM/NB music excerpts [80] cla F1 (v: 0.59, 0.58, a:0.60, 0.57)[41] Time-span visual and visual features feature CNN Opensmiletoolbox music excerpts [80] cla MSE a :0.082 MSE v : 0 . [43] tempo; pitch; zero cross; roll off; MFCCs;Saturation; Color heat; Shot length fea-ture; General preferences; Visual excite-ment; Motion feature; fMRI feature feature DBM SVM TRECVID cla -[177] Colorfulness; MFCC; CNN-learned fea-tures from the keyframes containingprotagonist decision CNN, SVM, SVR LIRIS-ACCEDE,PMSZU cla/reg -[173] Multi-frame motion vectors decision CNN SumMe, TVsum,Continuous LIRIS-ACCEDE reg -[49] The median of the L values in Luv space;means and variances of componentsin HSV space; texture feature; meanand standard deviation of motions be-tween frames in a short; MFCC; Spectralpower; mean and variance of the spec-tral centroids; Time domain zero cross-ings rate; Multi-instance sparse coding feature SVM Musk1, Musk2, Ele-phant, Fox, Tiger cla Acc : 0.911, 0.906,0.885, 0.627, 0.868[82] Lighting key; Color; Motion vectors;ZCR; energy; MFCC; pitch; Textual fea-tures decision SMO, Navie Bayes DEAP cla F1:0.849 0.811

Acc :0.911 0.883[20] Key lighting; Grayness; Fast motion;Shot chanage rate; Shot length variation;MFCC; CNN-learned features; powerspectral density; EEG; ECG; respiation;galvanic skin resistance feature SVM DEAP cla

Acc v :0.7 0.7 0.7125, Acc a :0.6876 0.7 0.8F1 (A:0.664 0.6870.789)[83] MFCC; ZCR; energy; pitch, color his-tograms; lighting key; motion vector decision SVM, Navie Bayes DEAP cla F1: 0.869, 0.846 Acc : 0.925, 0.897[109] CNN feature, low-level audio visual fea-tures, EEG decision LDA, LSVM, RSVM Dataset introducedby the authors cla -[147] MKT; ConvNets feature; EmoLarge;IS13; MFCC; EmoBase10; DSIFT; HSH decision SVM, LR, RBM,MLP MediaEval 2015,2016 AffectiveImpact of Movies cla

Acc a :0.574, Acc v :0.462[110] CNN feature - SVM, LDA dataset in [108] cla -[12] CNN feature - SVR LIRIS-ACCEDE reg MSE a : 0.021, MSE v : 0.027 In total, there are two fusion strategies for multimodal information: feature-level fusion and decision-level fusion. Feature-level fusion means that the multimodal features are combined and then used

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :20 Zhao et al. as the input of a classifier or a regressor. Decision-level fusion fuses several results of differentclassifiers, and the final results are computed according to the fusion methods.One way of feature-level fusion is implemented by feature accumulation or concatenation [14,147, 177]. In [14], two feature vectors for visual and audio data are averaged as the global genrerepresentations. In [177], multi-class features are concatenated to generate a high dimensionaljoint representation. Some machine learning methods are also employed to learn joint features [41,43, 90, 135]. In [41], a two-branch network is used to combine the visual and audio features. Theoutputs of the two-branch network are then fed into a classifier, and the experiment results showedthat the joint features outperform other methods. In [43], the low-level audio-visual features andfMRI-derived features are fed into multimodal DBM to learn joint representations. The target of thismethod is to learn the relation between audio-visual features and fMRI-derived features. In [135],PCA is used to learn the multimodal joint features. In [132], canonical correlation analysis (CCA) isused to construct a new video feature space with the help of EEG features and a new EEG featurespace with the assistance of video content, so only one modality is needed to predict emotionduring the testing process.By combining the results of different classifiers, decision-level fusion strategy is able to achievebetter results [2, 40, 42, 82, 83, 111]. In [2], linear fusion and SVM-based fusion techniques areexplored to combine outputs of several classifiers. Specifically, the output of each classifier hasits own weight in linear fusion. The final result is the weighted sum of all outputs. In SVM-basedfusion, the outputs of unimodal classifiers are concatenated together. And then the higher levelrepresentations for each video clip are fed into a fusion SVM to predict the emotion. Based on theseresults, linear fusion is better than SVM-based fusion. In [42, 83], linear fusion is also used to fusethe outputs of multiple classifiers. The differences among these linear fusion methods depend onthe distribution of weights.

In tradition, video emotional recognition includes two steps, i.e. , feature extraction step and re-gression or classification step. Because of the lack of consensus on the most relevant emotionalfeatures, we may not be able to extract the best features for the problem at hand. As a result, thistwo-step mode has hampered the development of affective video content analysis. In order to solvethis problem, some methods based on end-to-end training frameworks are proposed. Khorramiet al. [59] combined CNN and RNN to recognize the emotional information of videos. According totheir method, a CNN is trained using frame facial images sampled from videos to extract features.Then the features are fed into a RNN to perform continuous emotion recognition. In [51], a singlenetwork using ConvLSTM is proposed, where videos are input to the network and the predictedemotional information is output directly. In fact, due to the complexity of CNNs and RNNs, thetraining of these frameworks needs large amounts of data. However, in video affective contentanalysis, the samples in existing datasets are usually limited. This is the reason why end-to-endmethods are still less common compared to the traditional two step methods, despite of theirinfluential potentials.

In this section, we survey the work that analyze multimodal data beyond audiovisual content.Most of the existing work on affective understanding of multimedia rely on one modality, evenwhen additional modalities are available, for example in videos [55]. Earlier work on emotionalunderstanding of multimedia used hand crafted features from different modalities that are fusedat feature or decision levels [8, 15, 46, 117, 126]. The more recent work mainly use deep learningmodels [55, 93].

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:21

Language is a commonly used modality in addition to vision and audio. There is a large bodyof work on text-based sentiment analysis [91]. Sentiment analysis from text is well-establishedand is deployed at scale in industry at a broad set of applications involving opinion mining [116].With the shift toward an increasingly multimodal social web, multimodal sentiment analysisis becoming more relevant. For example, vloggers post their opinions on YouTube, and photoscommonly accompany user posts on Instagram and Twitter. Analyzing text for emotion recognitionrequires representing terms by features. Lexically-based approaches are one of the most popularmethods for text-based emotion recognition. They involve using knowledge of words’ affect forestimating document or content’s affect. Linguistic Inquiry and Word Count (LIWC) is a well-known lexical tool that matches the terms in a document with its dictionary and generates scoresalong different dimensions including affective and cognitive constructs such as “present focus” and“positive emotion” [54]. The terms in each category or selected by experts is extensively validated ondifferent content. AffectNet is another notable lexical resource which includes a semantic netowrkof 10,000 items with representations for “pleasantness”, “attention”, “sensitivity”, and “aptitude” [19].The continuous representations can be mapped to 24 distinct emotions. DepecheMood is a lexiconcreated through a data-driven method mining a news website annotated with its particular set ofdiscrete emotions, namely, “afraid”, “amusemed”, “anger”, “annoyed”, “don’t care”, “happy”, and“inspired” [122]. DepecheMood is extended to DepecheMood++ by including Italian [7].The more recent development in text-based affective analysis is models powered by deep learning.Leveraging large scale data, deep neural networks are able to learn representations that are relevantfor affective analysis in language. Word embeddings are one of the most common representationsused to represent language. Word embeddings, such as Word2Vec [78] or GloVe [96], learn languagecontext of the word by learning a representation (a vector), that can capture semantic and syntacticsimilarities. More recently, representation learning models that can encode the whole sequence ofterms (sentences, documents) showed impressive performance in different language understandingtasks, including sentiment and emotional analysis. Bidirectional Encoder Representations fromTransformers (BERT) [31] is a method for learning a language model that can be trained on largeamount of data in an unsupervised manner. This pre-trained model is very effective in representinga sequence of terms as a fixed-length representation (vector). BERT architecture is a multi-layerbidirectional Transformer network that encodes the whole sequence at once. BERT representationachieves state-of-the-art results in multiple natural language understanding tasks.The audiovisual features that are used for multimodal understanding of affect are similar tothe ones discussed in previous sections. The main technique between miltimodal models lies inmethods for multimodal fusion. Multimodal methods involve extracting features from multiplemodalities, e.g. , audiovisual, and training joint or separate machine learning models for fusion [9].Multimodal fusion can be done in model-based and model-agnostic ways. The model-agnosticfusion methods do not rely on a specific classification or regression method and include feature-level, decision-level, or hybrid fusion techniques. Model-based methods address multimodal fusionin model construction. Examples of model-based fusion methods include Multiple Kernel Learning(MKL) [38], graphical models, such as Conditional Random Fields [10] and neural networks [84, 99].Pang et al. [93] used Deep Boltzmann Machine (DBM) to learn a joint representation acrosstext, vision, and audio to recognize expected emotions from social media videos. Each modality isseparately encoded with stacking multiple Restricted Boltzmann Machines (RBM) and pathwaysare merged to a joint representation layer. The model was evaluated for recognizing eight emotioncategories for 1,101 videos from [55]. Muszynski et al. [81] studied perceived vs induced emotion inmovies. To this end, they collected additional labels on a subset of LIRIS-ACCEDE dataset [13]. Theyfound that perceived and induced emotions do not always agree. Using multimodal Deep Belief

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :22 Zhao et al.

Networks (DBN), they could demonstrate that fusion of electrodermal responses with audiovisualcontent features improves the overall accuracy for emotion recognition [81].In [111], authors performed regression to estimate intended arousal and valence levels (asjudged by experts in [72]). LSTM recurrent neural networks are used for unimodal regressionsand fused via early and late fusion for audiovisual estimation with late fusion achieving the bestresults. Tarvainen et al. [125] performed an in-depth analysis on how emotions are constructed inmovies. They identified scene type as a major factor in emotions in movies. They then used contentfeatures to recognize emotions along three dimensions of hedonic tone (valence), energetic arousal(awake–tired) and tense arousal (tense–calm).Bilinear fusion is a method that is proposed to model inter- and intra- modality interaction amongmodalities by performing outer product between unimodal embeddings [69]. Zadeh et al. [151]extended this to a Tensor Fusion Network to model intra-modality and inter-modality dynamicsin multimodal sentiment analysis. The tensor fusion network includes modality embedding sub-networks, a tensor fusion layer modeling the unimodal, bimodal and trimodal interactions using athree-fold Cartesian product from modality embeddings along with a final sentiment inferencesub-network conditioned on the tensor fusion layer. The main drawback of such methods is theincrease in the dimensionality of the resulting multimodal representation.

Although remarkable progress has been made on affective computing of multimedia (ACM) data,there are still several open issues and directions that can boost the performance of ACM.

Multimedia Content Understanding.

As emotions may be directly evoked by the multimediacontent in viewers, accurately understanding what is contained in multimedia data can significantlyimprove the performance of ACM. Sometimes it is even necessary to analyze the subtle details. Forexample, we may feel “amused” on a video with a laughing baby; but if the laugh is from a negativecharacter, it is more possible for us to feel “angry”. In such cases, besides the common property,such as “laugh”, we may need to further recognize the identity, such as “a lovely baby” and “an evilantagonist”.

Multimedia Summarization.

Emotions can play a vital role in selection of multimedia forcreation of summaries or highlights. This is an important application in entertainment and sportsindustries ( e.g. movie trailers, sports highlights). There has been some recent work in this directionwhere affect information from audio visual cues has led to the successful creation of video sum-maries [76, 113]. In particular, work reported in [113] used audiovisual emotions in part to createan AI trailer for a 20 th Century Fox film in 2016. Similarly, AI Highlights described in [76] hingedon audiovisual emotional cues and have successfully been employed to create the official highlightsat Wimbledon and US Open since 2017. This is a very promising direction for affective multimediacomputing which can have a direct impact on real world media applications.

Contextual Knowledge Modeling.

The contextual information of a viewer watching somemultimedia is very important. Similar multimedia data under different contexts may evoke totallydifferent emotions. For example, we may feel “happy" when listening a song about love in a wedding;but if the same song is played when two lovers are departing, it is more likely that we feel “sad".The prior knowledge of viewers or multimedia data may also influence the emotion perceptions.An optimistic viewer and a pessimistic viewer may have totally different emotions about the samemultimedia data.

Group Emotion Clustering.

It is too generic to simply recognize the dominant emotion, whileit is too specific to predict personalized emotion. It would make more sense to model emotions forgroups or cliques of viewers with similar interests and backgrounds. Clustering different viewers

New AC Setting Adaptation.

Because of the domain shift [127], the deep learning modelstrained on one labeled source domain may not work well on the other unlabeled or sparsely labeledtarget domain, which results in the models’ low transferability to new domains. Exploring domainadaptation techniques that fit well on the AC tasks is worth investigating. One possible solution isto translate the source data to an intermediate domain that are indistinguishable from the targetdata while preserving the source labels [165, 172] using Generative Adversarial Networks [39, 174].How do deal with some practical settings, such as multiple labeled source domains and emotionmodels’ homogeneity, is more challenging.

Regions-of-Interest Selection.

The contributions of different regions of given multimediamay vary to the emotion recognition. For example, the regions that contain the most importantsemantic information in images are more discriminative than background; some video frames areof no use to emotion recognition. Detecting and selecting the regions-of-interest may significantlyimprove the recognition performance as well as the computation efficiency.

Viewer-Multimedia Interaction.

Instead of direct analysis of the multimedia content or im-plicit consideration of viewers’ physiological signals (such as facial expressions, Electroencephalo-gram signals, etc .), joint modeling of both multimedia content and viewers’ responses may betterbridge the affective gap and result in superior performances. We should also study how to dealwith missing or corrupted data. For example, some physiological signals are unavailable during thedata collection stage.

Affective Computing Applications.

Although AC is claimed to be important in real-worldapplications, few practical systems have been developed due to the relatively low performance. Withthe availability of larger datasets and improvements in self-supervised and semi-supervised learning,we foresee the deployment of ACM in real-world applications. For example, in media analytics, thecontent understanding methods will identify the emotional preferences of users and emotionalnuances of social media content to better target advertising effort; in fashion recommendation,intelligent costumer service, such as customer-multimedia interaction, can provide better experienceto customers; in advertisement, generating or curating multimedia that evokes strong emotions canattract more attention. We believe that an emotional artificial intelligence will become a significantcomponent of mainstream multimedia applications.

Benchmark Dataset Construction.

Existing studies on ACM mainly adopt small-scale datasetsor construct relatively larger-scale ones using keyword searching strategy without annotationquality guaranteed. To advance the development of ACM, creating a large-scale and high-qualitydataset is in urgent need. It has shown that there are three critical factors for dataset construction ofACM, i.e. , the context of viewer response, personal variation among viewers, and the effectivenessand efficiency of corpus creation [118]. In order to include a large number of samples, we mayexploit online systems and crowdsourcing platforms to recruit large numbers of viewers with a rep-resentative spread of backgrounds to annotate multimedia and provide contextual information ontheir emotional responses. Since emotion is a subjective variable, personalized emotion annotationwould make more sense, from which we can obtain the dominant emotion and emotion distribu-tion. Further, accurate understanding of multimedia content can boost the affective computingperformance. Inferring emotional labels from social media users’ interaction with data, e.g. , likes,comments, in addition to their spontaneous responses, e.g. , facial expression, where possible, willprovide new avenues for enriching affective datasets.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :24 Zhao et al.

In this article, we have surveyed affective computing (AC) methods for heterogeneous multimediadata. For each multimedia type, i.e. , image, music, video, and multimodal data, we summarized andcompared available datasets, handcrafted features, machine learning methods, deep learning models,and experimental results. We also briefly introduced the commonly employed emotion modeldsand outlined potential research directions in this area. Although deep learning-based AC methodshave achieved remarkable progress in recent years, an efficient and robust AC method that is ableto obtain high accuracy under unconstrained conditions is yet to be achieved. With the adventof deep understanding of emotion evocation in brain science, accurate emotion measurement inpsychology, and novel deep learning network architectures in machine learning, affective computingof multimedia data will remain an active research topic for a long time.

ACKNOWLEDGMENTS

This work was supported by Berkeley DeepDrive, the National Natural Science Foundation of China(Nos. 61701273, 91748129), and the National Key R&D Program of China (Grant No. 2017YFC011300).The work of MS is supported in part by the U.S. Army. Any opinion, content or information presenteddoes not necessarily reflect the position or the policy of the United States Government, and noofficial endorsement should be inferred.

REFERENCES [1] Mojtaba Khomami Abadi, Ramanathan Subramanian, Seyed Mostafa Kia, Paolo Avesani, Ioannis Patras, and Nicu Sebe.2015. DECAF: MEG-based multimodal database for decoding affective physiological responses.

IEEE Transactions onAffective Computing

6, 3 (2015), 209–222.[2] Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2017. A comprehensive study on mid-level representationand ensemble learning for emotional analysis of video material.

Multimedia Tools and Applications

76, 9 (2017),11809–11837.[3] Xavier Alameda-Pineda, Elisa Ricci, Yan Yan, and Nicu Sebe. 2016. Recognizing emotions from abstract paintingsusing non-linear matrix completion. In

IEEE Conference on Computer Vision and Pattern Recognition . 5240–5248.[4] Anna Aljanaki. 2016.

Emotion in Music: representation and computational modeling . Ph.D. Dissertation. UtrechtUniversity.[5] Anna Aljanaki and Mohammad Soleymani. 2018. A data-driven approach to mid-level perceptual musical featuremodeling. In

International Society for Music Information Retrieval Conference .[6] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. 2017. Developing a benchmark for emotional analysis ofmusic.

PloS One

12, 3 (2017), e0173392.[7] Oscar Araque, Lorenzo Gatti, Jacopo Staiano, and Marco Guerini. 2018. DepecheMood++: a Bilingual Emotion LexiconBuilt Through Simple Yet Powerful Techniques. arXiv preprint arXiv:1810.03660 (2018).[8] Sutjipto Arifin and Peter YK Cheung. 2008. Affective level video segmentation by utilizing the pleasure-arousal-dominance information.

IEEE Transactions on Multimedia

10, 7 (2008), 1325–1341.[9] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Surveyand Taxonomy.

IEEE Transactions on Pattern Analysis and Machine Intelligence

41, 2 (2019), 423–443.[10] Tadas Baltrusaitis, Ntombikayise Banda, and Peter Robinson. 2013. Dimensional affect recognition using ContinuousConditional Random Fields. In

IEEE International Conference and Workshops on Automatic Face and Gesture Recognition .1–8.[11] Yoann Baveye, Christel Chamaret, Emmanuel Dellandréa, and Liming Chen. 2018. Affective video content analysis: Amultidisciplinary insight.

IEEE Transactions on Affective Computing

9, 4 (2018), 396–409.[12] Yoann Baveye, Emmanuel Dellandréa, Christel Chamaret, and Liming Chen. 2015. Deep learning vs. kernel methods:Performance for emotion prediction in videos. In

International Conference on Affective Computing and IntelligentInteraction . 77–83.[13] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming Chen. 2015. Liris-accede: A video database foraffective content analysis.

IEEE Transactions on Affective Computing

6, 1 (2015), 43–55.[14] Olfa Ben-Ahmed and Benoit Huet. 2018. Deep Multimodal Features for Movie Genre and Interestingness Prediction.In

International Conference on Content-Based Multimedia Indexing . 1–6.ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:25 [15] Sergio Benini, Luca Canini, and Riccardo Leonardi. 2011. A connotative space for supporting movie affectiverecommendation.

IEEE Transactions on Multimedia

13, 6 (2011), 1356–1370.[16] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, O. Mayor, Gerard Roma, JustinSalamon, J. R. Zapata, and Xavier Serra. 2013. ESSENTIA: an Audio Analysis Library for Music Information Retrieval.In

International Society for Music Information Retrieval Conference . 493–498.[17] Damian Borth, Tao Chen, Rongrong Ji, and Shih-Fu Chang. 2013. Sentibank: large-scale ontology and classifiers fordetecting sentiment and emotions in visual content. In

ACM International Conference on Multimedia . 459–460.[18] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. 2013. Large-scale visual sentimentontology and detectors using adjective noun pairs. In

ACM International Conference on Multimedia . 223–232.[19] Erik Cambria, Thomas Mazzocco, Amir Hussain, and Chris Eckl. 2011. Sentic Medoids: Organizing Affective CommonSense Knowledge in a Multi-Dimensional Vector Space. In

Advances in Neural Networks . 601–610.[20] Mo Chen, Gong Cheng, and Lei Guo. 2018. Identifying affective levels on music video via completing the missingmodality.

Multimedia Tools and Applications

77, 3 (2018), 3287–3302.[21] Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014. DeepSentiBank: Visual Sentiment ConceptClassification with Deep Convolutional Neural Networks.

Computer Science (2014).[22] Tao Chen, Felix X Yu, Jiawei Chen, Yin Cui, Yan-Ying Chen, and Shih-Fu Chang. 2014. Object-based visual sentimentconcept analysis and application. In

ACM International Conference on Multimedia . 367–376.[23] Yu-An Chen, Yi-Hsuan Yang, Ju-Chiang Wang, and Homer Chen. 2015. The AMG1608 dataset for music emotionrecognition. In

IEEE International Conference on Acoustics, Speech and Signal Processing . 693–697.[24] Juan Abdon Miranda Correa, Mojtaba Khomami Abadi, Nicu Sebe, and Ioannis Patras. 2017. AMIGOS: A dataset forMood, personality and affect research on Individuals and GrOupS. arXiv preprint arXiv:1702.02510 (2017).[25] Vaidehi Dalmia, Hongyi Liu, and Shih-Fu Chang. 2016. Columbia mvso image sentiment dataset. arXiv preprintarXiv:1611.04455 (2016).[26] Elise S Dan-Glauser and Klaus R Scherer. 2011. The Geneva affective picture database (GAPED): a new 730-picturedatabase focusing on valence and normative significance.

Behavior Research Methods

43, 2 (2011), 468–477.[27] Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Mats Viktor Sjöberg, Christel Chamaret, et al. 2016. Themediaeval 2016 emotional impact of movies task. In

CEUR Workshop Proceedings .[28] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye, and Mats Sjöberg. 2017. The MediaEval 2017Emotional Impact of Movies Task. In

Working Notes Proceedings of the MediaEval 2017 Workshop .[29] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye, Zhongzhe Xiao, and Mats Sjöberg. 2018. TheMediaEval 2018 Emotional Impact of Movies Task. In

Working Notes Proceedings of the MediaEval 2018 Workshop .[30] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye, Zhongzhe Xiao, and Mats Sjöberg. 2019.Datasets column: predicting the emotional impact of movies.

ACM SIGMultimedia Records

10, 4 (2019), 6.[31] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. In

Annual Conference of the North American Chapter of the Association forComputational Linguistics . 4171–4186.[32] XHJS Downie, Cyril Laurier, and MBAF Ehmann. 2008. The 2007 MIREX audio mood classification task: Lessonslearned. In

International Society for Music Information Retrieval Conference . 462–467.[33] Tuomas Eerola and Jonna K Vuoskoski. 2011. A comparison of the discrete and dimensional models of emotion inmusic.

Psychology of Music

39, 1 (2011), 18–49.[34] Paul Ekman. 1992. An argument for basic emotions.

Cognition & Emotion

6, 3-4 (1992), 169–200.[35] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features,classification schemes, and databases.

Pattern Recognition

44, 3 (2011), 572–587.[36] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent developments in openSMILE, themunich open-source multimedia feature extractor. In

ACM International Conference on Multimedia . 835–838.[37] Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. OpenSMILE: The Munich Versatile and Fast Open-sourceAudio Feature Extractor. In

ACM International Conference on Multimedia . 1459–1462.[38] Mehmet Gönen and Ethem AlpaydÄśn. 2011. Multiple Kernel Learning Algorithms.

Journal of Machine LearningResearch

12, Jul (2011), 2211–2268.[39] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. 2014. Generative adversarial nets. In

Advances in Neural Information Processing Systems . 2672–2680.[40] Dongdong Gui, Sheng-hua Zhong, and Zhong Ming. 2018. Implicit Affective Video Tagging Using Pupillary Response.In

International Conference on Multimedia Modeling . 165–176.[41] Jie Guo, Bin Song, Peng Zhang, Mengdi Ma, Wenwen Luo, et al. 2019. Affective video content analysis based onmultimodal data fusion in heterogeneous networks.

Information Fusion

51 (2019), 224–232.[42] Rishabh Gupta, Mojtaba Khomami Abadi, Jesús Alejandro Cárdenes Cabré, Fabio Morreale, Tiago H Falk, and NicuSebe. 2016. A quality adaptive multimodal affect recognition system for user-centric multimedia indexing. In

ACM

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :26 Zhao et al.

International Conference on Multimedia Retrieval . 317–320.[43] Junwei Han, Xiang Ji, Xintao Hu, Lei Guo, and Tianming Liu. 2015. Arousal recognition using audio-visual featuresand FMRI-based brain response.

IEEE Transactions on Affective Computing

6, 4 (2015), 337–347.[44] Junwei Han, Dingwen Zhang, Gong Cheng, Nian Liu, and Dong Xu. 2018. Advanced deep-learning techniques forsalient and category-specific object detection: a survey.

IEEE Signal Processing Magazine

35, 1 (2018), 84–100.[45] Alan Hanjalic. 2006. Extracting moods from pictures and sounds: Towards truly personalized TV.

IEEE SignalProcessing Magazine

23, 2 (2006), 90–100.[46] Alan Hanjalic and Li-Qun Xu. 2005. Affective video content representation and modeling.

IEEE Transactions onMultimedia

7, 1 (2005), 143–154.[47] John HL Hansen and Taufiq Hasan. 2015. Speaker recognition by machines and humans: A tutorial review.

IEEESignal Processing Magazine

32, 6 (2015), 74–99.[48] Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey.

Imageand Vision Computing

60 (2017), 4–21.[49] Weiming Hu, Xinmiao Ding, Bing Li, Jianchao Wang, Yan Gao, Fangshi Wang, and Stephen Maybank. 2016. Multi-perspective cost-sensitive context-aware multi-instance sparse coding and its application to sensitive video recognition.

IEEE Transactions on Multimedia

18, 1 (2016), 76–89.[50] Xiao Hu and J Stephen Downie. 2007. Exploring Mood Metadata: Relationships with Genre, Artist and Usage Metadata..In

International Society for Music Information Retrieval Conference . 67–72.[51] Jian Huang, Ya Li, Jianhua Tao, Zheng Lian, and Jiangyan Yi. 2018. End-to-End Continuous Emotion Recognitionfrom Video Using 3D Convlstm Networks. In

IEEE International Conference on Acoustics, Speech and Signal Processing .6837–6841.[52] Charles Inskip, Andy Macfarlane, and Pauline Rafferty. 2012. Towards the disintermediation of creative music search:analysing queries to determine important facets.

International Journal on Digital Libraries

12, 2-3 (2012), 137–147.[53] Bernard J Jansen, Mimi Zhang, Kate Sobel, and Abdur Chowdury. 2009. Twitter power: Tweets as electronic word ofmouth.

Journal of the Association for Information Science and Technology

60, 11 (2009), 2169–2188.[54] Audra E. Massey Jeffrey H. Kahn, Renée M. Tobin and Jennifer A. Anderson. 2007. Measuring Emotional Expressionwith the Linguistic Inquiry and Word Count.

JSTOR: The American Journal of Psychology

Twenty-EighthAAAI Conference on Artificial Intelligence .[56] Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, Quang-Tuan Luong, James Z. Wang, Jia Li, and Jiebo Luo. 2011.Aesthetics and emotions in images.

IEEE Signal Processing Magazine

28, 5 (2011), 94–115.[57] Brendan Jou, Tao Chen, Nikolaos Pappas, Miriam Redi, Mercan Topkara, and Shih-Fu Chang. 2015. Visual AffectAround the World: A Large-scale Multilingual Visual Sentiment Ontology. In

ACM International Conference onMultimedia . 159–168.[58] Brendan Jou, Margaret Yuying Qian, and Shih-Fu Chang. 2016. SentiCart: Cartography and Geo-contextualization forMultilingual Visual Sentiment. In

ACM International Conference on Multimedia Retrieval . 389–392.[59] Pooya Khorrami, Tom Le Paine, Kevin Brady, Charlie Dagli, and Thomas S Huang. 2016. How deep neural networkscan improve emotion recognition on video data. In

IEEE International Conference on Image Processing . 619–623.[60] Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G Morton, Patrick Richardson, Jeffrey Scott,Jacquelin A Speck, and Douglas Turnbull. 2010. Music emotion recognition: A state of the art review. In

Inter-national Society for Music Information Retrieval Conference , Vol. 86. 937–952.[61] Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, ThierryPun, Anton Nijholt, and Ioannis Patras. 2012. Deap: A database for emotion analysis using physiological signals.

IEEE Transactions on Affective Computing

3, 1 (2012), 18–31.[62] Peter J Lang, Margaret M Bradley, and Bruce N Cuthbert. 1997. International affective picture system (IAPS): Technicalmanual and affective ratings.

NIMH Center for the Study of Emotion and Attention (1997), 39–58.[63] Martha Larson, Mohammad Soleymani, Guillaume Gravier, Bogdan Ionescu, and Gareth JF Jones. 2017. The bench-marking initiative for multimedia evaluation: MediaEval 2016.

IEEE MultiMedia

24, 1 (2017), 93–96.[64] Olivier Lartillot, Petri Toiviainen, and Tuomas Eerola. 2008. A Matlab Toolbox for Music Information Retrieval. In

Data Analysis, Machine Learning and Applications . 261–268.[65] Cyril Laurier, Perfecto Herrera, M Mandel, and D Ellis. 2007. Audio music mood classification using support vectormachine. In

MIREX task on Audio Mood Classification . 2–4.[66] Joonwhoan Lee and EunJong Park. 2011. Fuzzy similarity-based emotional classification of color images.

IEEETransactions on Multimedia

13, 5 (2011), 1031–1039.[67] Bing Li, Weihua Xiong, Weiming Hu, and Xinmiao Ding. 2012. Context-aware affective images classification basedon bilayer sparse representation. In

ACM International Conference on Multimedia . 721–724.ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:27 [68] Benjamin J. Li, Jeremy N. Bailenson, Adam Pines, Walter J. Greenleaf, and Leanne M. Williams. 2017. A PublicDatabase of Immersive VR Videos with Corresponding Ratings of Arousal, Valence, and Correlations between HeadMovements and Self Report Measures.

Frontiers in Psychology

IEEE International Conference on Computer Vision . 1449–1457.[70] Xin Lu, Poonam Suryanarayan, Reginald B Adams Jr, Jia Li, Michelle G Newman, and James Z Wang. 2012. On shapeand the computability of emotions. In

ACM International Conference on Multimedia . 229–238.[71] Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology andart theory. In

ACM International Conference on Multimedia . 83–92.[72] Nikos Malandrakis, Alexandros Potamianos, Georgios Evangelopoulos, and Athanasia Zlatintsi. 2011. A supervisedapproach to movie emotion tracking. In

IEEE International Conference on Acoustics, Speech and Signal Processing .2376–2379.[73] Ricardo Malheiro, Renato Panda, Paulo Gomes, and Rui Pedro Paiva. 2016. Emotionally-relevant features forclassification and regression of music lyrics.

IEEE Transactions on Affective Computing

9, 2 (2016), 240–254.[74] Daniel McDuff and Mohammad Soleymani. 2017. Large-scale affective content analysis: Combining media contentfeatures and facial reactions. In

IEEE International Conference on Automatic Face & Gesture Recognition . 339–345.[75] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015.librosa: Audio and music signal analysis in python. In

Proceedings of the Python in Science Conferences , Vol. 8. 18–25.[76] Michele Merler, Khoi-Nguyen C. Mac, Dhiraj Joshi, Quoc-Bao Nguyen, Stephen Hammer, John Kent, Jinjun Xiong,Minh N. Do, John R. Smith, and Rogerio S. Feris. 2018. Automatic curation of sports highlights using multimodalexcitement features.

IEEE Transactions on Multimedia

21, 5 (2018), 1147–1160.[77] Joseph A Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio, and Patricia A Reuter-Lorenz. 2005. Emotional category data on images from the International Affective Picture System.

Behavior ResearchMethods

37, 4 (2005), 626–630.[78] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Wordsand Phrases and their Compositionality. In

Advances in Neural Information Processing Systems 26 . 3111–3119.[79] Shasha Mo, Jianwei Niu, Yiming Su, and Sajal K Das. 2018. A novel feature set for video emotion recognition.

Neurocomputing

291 (2018), 11–20.[80] Fabio Morreale, Raul Masu, Antonella De Angeli, et al. 2013. Robin: an algorithmic composer for interactive scenarios.

Sound and Music Computing Conference

IEEE Transactions on Affective Computing (2019).[82] Shahla Nemati and Ahmad Reza Naghsh-Nilchi. 2016. Incorporating social media comments in affective videoretrieval.

Journal of Information Science

42, 4 (2016), 524–538.[83] Shahla Nemati and Ahmad Reza Naghsh-Nilchi. 2017. An evidential data fusion method for affective music videoretrieval.

Intelligent Data Analysis

21, 2 (2017), 427–441.[84] Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. 2011. Continuous Prediction of Spontaneous Affect from MultipleCues and Modalities in Valence-Arousal Space.

IEEE Transactions on Affective Computing

2, 2 (2011), 92–105.[85] Jianwei Niu, Yiming Su, Shasha Mo, and Zeyu Zhu. 2017. A Novel Affective Visualization System for Videos Basedon Acoustic and Visual Features. In

International Conference on Multimedia Modeling . 15–27.[86] Jianwei Niu, Shihao Wang, Yiming Su, and Song Guo. 2017. Temporal Factor-Aware Video Affective Analysis andRecommendation for Cyber-Based Social Media.

IEEE Transactions on Emerging Topics in Computing

Neurocomputing

173 (2016), 339–345.[88] Andrew Ortony, Gerald L. Clore, and Allan Collins. 1988.

The Cognitive Structure of Emotions . Cambridge UniversityPress.[89] Renato Eduardo Silva Panda. 2019.

Emotion-based Analysis and Classification of Audio Music . Ph.D. Dissertation.Univesity of Coimbra, Coimbra, Portugal.[90] Yagya Raj PANDEYA and LEE Joonwhoan. 2019. Music-Video Emotion Analysis Using Late Fusion of Multimodal.

DEStech Transactions on Computer Science and Engineering iteee (2019).[91] Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis.

Foundations and Trends® in InformationRetrieval

2, 1–2 (2008), 1–135.[92] Lei Pang and Chong-Wah Ngo. 2015. Mutlimodal learning with deep boltzmann machine for emotion prediction inuser generated videos. In

ACM International Conference on Multimedia Retrieval . 619–622.[93] Lei Pang, Shiai Zhu, and Chong Wah Ngo. 2015. Deep Multimodal Learning for Affective Analysis and Retrieval.

IEEE Transactions on Multimedia

17, 11 (2015), 2008–2020.ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :28 Zhao et al. [94] Genevieve Patterson and James Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing sceneattributes. In

IEEE Conference on Computer Vision and Pattern Recognition . 2751–2758.[95] Kuan-Chuan Peng, Amir Sadovnik, Andrew Gallagher, and Tsuhan Chen. 2015. A Mixed Bag of Emotions: Model,Predict, and Transfer Emotion Distributions. In

IEEE Conference on Computer Vision and Pattern Recognition . 860–868.[96] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation.In

Conference on Empirical Methods in Natural Language Processing . 1532–1543.[97] Robert Plutchik. 1980.

Emotion: A psychoevolutionary synthesis . Harpercollins College Division.[98] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019.MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In

Annual Meeting of theAssociation for Computational Linguistics . 527–536.[99] Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas BaltrusÌĘaitis, and Roland Goecke. 2016. Extending LongShort-Term Memory for Multi-View Structured Learning. In

European Conference on Computer Vision . 338–353.[100] Tianrong Rao, Min Xu, Huiying Liu, Jinqiao Wang, and Ian Burnett. 2016. Multi-scale blocks based image emotionclassification using multiple instance learning. In

IEEE International Conference on Image Processing . 634–638.[101] Tianrong Rao, Min Xu, and Dong Xu. 2016. Learning multi-level deep representations for image emotion classification. arXiv preprint arXiv:1611.07145 (2016).[102] David Sander, Didier Grandjean, and Klaus R. Scherer. 2005. A systems approach to appraisal mechanisms in emotion.

Neural Networks

18, 4 (2005), 317–352.[103] Andreza Sartori, Dubravko Culibrk, Yan Yan, and Nicu Sebe. 2015. Who’s afraid of itten: Using the art theory of colorcombination to analyze emotions in abstract paintings. In

ACM International Conference on Multimedia . 311–320.[104] Harold Schlosberg. 1954. Three dimensions of emotion.

Psychological Review

61, 2 (1954), 81.[105] Dongyu She, Jufeng Yang, Ming-Ming Cheng, Yu-Kun Lai, Paul L Rosin, and Liang Wang. 2019. WSCNet: Weaklysupervised coupled networks for visual sentiment classification and detection.

IEEE Transactions on Multimedia (2019).[106] Guangyao Shen, Jia Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tianrui Hu, Tat-Seng Chua, and Wenwu Zhu. 2017.Depression Detection via Harvesting Social Media: A Multimodal Dictionary Learning Solution.. In

InternationalJoint Conference on Artificial Intelligence . 3838–3844.[107] Abhinav Shukla. 2018.

Multimodal Emotion Recognition from Advertisements with Application to ComputationalAdvertising . Ph.D. Dissertation. International Institute of Information Technology Hyderabad.[108] Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli, and RamanathanSubramanian. 2017. Affect recognition in ads with application to computational advertising. In

ACM InternationalConference on Multimedia . 1148–1156.[109] Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli, and RamanathanSubramanian. 2017. Evaluating content-centric vs. user-centric ad affect recognition. In

ACM International Conferenceon Multimodal Interaction . 402–410.[110] Abhinav Shukla, Harish Katti, Mohan Kankanhalli, and Ramanathan Subramanian. 2018. Looking Beyond a CleverNarrative: Visual Context and Attention are Primary Drivers of Affect in Video Advertisements. In

ACM InternationalConference on Multimodal Interaction . 210–219.[111] Sarath Sivaprasad, Tanmayee Joshi, Rishabh Agrawal, and Niranjan Pedanekar. 2018. Multimodal ContinuousPrediction of Emotions in Movies using Long Short-Term Memory Networks. In

ACM International Conference onMultimedia Retrieval . 413–419.[112] Mats Sjöberg, Yoann Baveye, Hanli Wang, Vu Lam Quang, Bogdan Ionescu, Emmanuel Dellandréa, Markus Schedl,Claire-Hélène Demarty, and Liming Chen. 2015. The MediaEval 2015 Affective Impact of Movies Task. In

MediaEval .[113] John R. Smith, Dhiraj Joshi, Benoit Huet, Winston Hsu, and Jozef Cota. 2017. Harnessing A.I. for augmenting creativity:Application to movie trailer creation. In

ACM International Conference on Multimedia . 1799–1808.[114] Mohammad Soleymani. 2015. The quest for visual interest. In

ACM International Conference on Multimedia . 919–922.[115] Mohammad Soleymani, Micheal N. Caro, Erik M. Schmidt, Cheng-Ya Sha, and Yi-Hsuan Yang. 2013. 1000 Songs forEmotional Analysis of Music. In

ACM International Workshop on Crowdsourcing for Multimedia . 1–6.[116] Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. 2017. A surveyof multimodal sentiment analysis.

Image and Vision Computing

65 (2017), 3–14.[117] Mohammad Soleymani, Joep JM Kierkels, Guillaume Chanel, and Thierry Pun. 2009. A bayesian framework for videoaffective representation. In

International Conference on Affective Computing and Intelligent Interaction and Workshops .1–7.[118] Mohammad Soleymani, Martha Larson, Thierry Pun, and Alan Hanjalic. 2014. Corpus development for affectivevideo indexing.

IEEE Transactions on Multimedia

16, 4 (2014), 1075–1089.[119] Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. 2012. A multimodal database for affectrecognition and implicit tagging.

IEEE Transactions on Affective Computing

3, 1 (2012), 42–55.ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:29 [120] Yale Song and Mohammad Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. In

IEEE Conference on Computer Vision and Pattern Recognition . 1979–1988.[121] Jacquelin A Speck, Erik M Schmidt, Brandon G Morton, and Youngmoo E Kim. 2011. A Comparative Study ofCollaborative vs. Traditional Musical Mood Annotation. In

International Society for Music Information RetrievalConference , Vol. 104. 549–554.[122] Jacopo Staiano and Marco Guerini. 2014. Depeche Mood: a Lexicon for Emotion Analysis from Crowd AnnotatedNews. In

Annual Meeting of the Association for Computational Linguistics . 427–433.[123] Ramanathan Subramanian, Julia Wache, Mojtaba Khomami Abadi, Radu L Vieriu, Stefan Winkler, and Nicu Sebe.2018. ASCERTAIN: Emotion and personality recognition using commercial sensors.

IEEE Transactions on AffectiveComputing

9, 2 (2018), 147–160.[124] Kai Sun, Junqing Yu, Yue Huang, and Xiaoqiang Hu. 2009. An improved valence-arousal emotion space for videoaffective content representation and recognition. In

IEEE International Conference on Multimedia and Expo . 566–569.[125] Jussi Tarvainen, Jorma Laaksonen, and Tapio Takala. 2018. Film mood and its quantitative determinants in differenttypes of scenes.

IEEE Transactions on Affective Computing (2018).[126] René Marcelino Abritta Teixeira, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2012. Determination of emotionalcontent of video clips by low-level audiovisual features.

Multimedia Tools and Applications

61, 1 (2012), 21–49.[127] Antonio Torralba and Alexei A Efros. 2011. Unbiased look at dataset bias. In

IEEE Conference on Computer Vision andPattern Recognition . 1521–1528.[128] Douglas Turnbull, Luke Barrington, David Torres, and Gert Lanckriet. 2007. Towards musical query-by-semantic-description using the cal500 data set. In

International ACM SIGIR Conference on Research and Development in InformationRetrieval . 439–446.[129] George Tzanetakis and Perry Cook. 2000. Marsyas: A framework for audio analysis.

Organised Sound

4, 3 (2000),169–175.[130] Shangfei Wang, Shiyu Chen, and Qiang Ji. 2019. Content-based video emotion tagging augmented by users’ multiplephysiological responses.

IEEE Transactions on Affective Computing

10, 2 (2019), 155–166.[131] Shangfei Wang and Qiang Ji. 2015. Video affective content analysis: a survey of state-of-the-art methods.

IEEETransactions on Affective Computing

6, 4 (2015), 410–430.[132] Shangfei Wang, Yachen Zhu, Lihua Yue, and Qiang Ji. 2015. Emotion recognition with the help of privilegedinformation.

IEEE Transactions on Autonomous Mental Development

7, 3 (2015), 189–200.[133] Xiaohui Wang, Jia Jia, Jiaming Yin, and Lianhong Cai. 2013. Interpretable aesthetic features for affective imageclassification. In

IEEE International Conference on Image Processing . 3230–3234.[134] Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. 2013. Norms of valence, arousal, and dominance for13,915 English lemmas.

Behavior Research Methods

45, 4 (2013), 1191–1207.[135] Baixi Xing, Hui Zhang, Kejun Zhang, Lekai Zhang, Xinda Wu, Xiaoying Shi, Shanghai Yu, and Sanyuan Zhang.2019. Exploiting EEG Signals and Audiovisual Feature Fusion for Video Emotion Recognition.

IEEE Access

IEEE Transactions on Affective Computing

9, 2 (2016),255–270.[137] Can Xu, Suleyman Cetintas, Kuang-Chih Lee, and Li-Jia Li. 2014. Visual Sentiment Prediction with Deep ConvolutionalNeural Networks. arXiv preprint arXiv:1411.5731 (2014).[138] Jufeng Yang, Dongyu She, Yukun Lai, and Ming-Hsuan Yang. 2018. Retrieving and classifying affective Images viadeep metric learning. In

AAAI Conference on Artificial Intelligence .[139] Jufeng Yang, Dongyu She, Yu-Kun Lai, Paul L Rosin, and Ming-Hsuan Yang. 2018. Weakly supervised couplednetworks for visual sentiment analysis. In

IEEE conference on computer vision and pattern recognition . 7584–7592.[140] Jufeng Yang, Dongyu She, and Ming Sun. 2017. Joint image emotion classification and distribution learning via deepconvolutional neural network. In

International Joint Conference on Artificial Intelligence . 3266–3272.[141] Jufeng Yang, Ming Sun, and Xiaoxiao Sun. 2017. Learning Visual Sentiment Distributions via Augmented ConditionalProbability Neural Network. In

AAAI Conference on Artificial Intelligence . 224–230.[142] Peng Yang, Qingshan Liu, and Dimitris N Metaxas. 2010. Exploring facial expressions with compositional features. In

IEEE Conference on Computer Vision and Pattern Recognition . 2638–2644.[143] Xinyu Yang, Yizhuo Dong, and Juan Li. 2018. Review of data features-based music emotion recognition methods.

Multimedia Systems

24, 4 (2018), 365–389.[144] Yang Yang, Jia Jia, Shumei Zhang, Boya Wu, Qicong Chen, Juanzi Li, Chunxiao Xing, and Jie Tang. 2014. How DoYour Friends on Social Media Disclose Your Emotions?. In

AAAI Conference on Artificial Intelligence . 306–312.[145] Yi-Hsuan Yang and Homer H Chen. 2012. Machine recognition of music emotion: A review.

ACM Transactions onIntelligent Systems and Technology

3, 3 (2012), 40.ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. :30 Zhao et al. [146] Xingxu Yao, Dongyu She, Sicheng Zhao, Jie Liang, Yu-Kun Lai, and Jufeng Yang. 2019. Attention-aware PolaritySensitive Embedding for Affective Image Retrieval. In

IEEE International Conference on Computer Vision .[147] Yun Yi and Hanli Wang. 2018. Multi-modal learning for affective content analysis in movies.

Multimedia Tools andApplications (2018), 1–20.[148] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2015. Robust Image Sentiment Analysis Using ProgressivelyTrained and Domain Transferred Deep Networks.. In

AAAI Conference on Artificial Intelligence . 381–388.[149] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Building a large scale dataset for image emotionrecognition: The fine print and the benchmark. In

AAAI Conference on Artificial Intelligence . 308–314.[150] Jianbo Yuan, Sean Mcdonough, Quanzeng You, and Jiebo Luo. 2013. Sentribute: image sentiment analysis from amid-level perspective. In

ACM International Workshop on Issues of Sentiment Discovery and Opinion Mining . 10.[151] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Networkfor Multimodal Sentiment Analysis. In

Conference on Empirical Methods in Natural Language Processing . 1103–1114.[152] Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. In

AAAI Conference on Artificial Intelligence .[153] Marcel Zentner, Didier Grandjean, and Klaus R Scherer. 2008. Emotions evoked by the sound of music: characterization,classification, and measurement.

Emotion

8, 4 (2008), 494–521.[154] Chi Zhan, Dongyu She, Sicheng Zhao, Ming-Ming Cheng, and Jufeng Yang. 2019. Zero-Shot Emotion Recognition viaAffective Structural Embedding. In

IEEE International Conference on Computer Vision .[155] Kejun Zhang, Hui Zhang, Simeng Li, Changyuan Yang, and Lingyun Sun. 2018. The PMEmo Dataset for MusicEmotion Recognition. In

ACM International Conference on Multimedia Retrieval . 135–142.[156] Yanhao Zhang, Lei Qin, Rongrong Ji, Sicheng Zhao, Qingming Huang, and Jiebo Luo. 2016. Exploring coherent motionpatterns via structured trajectory learning for crowd mood modeling.

IEEE Transactions on Circuits and Systems forVideo technology

27, 3 (2016), 635–648.[157] Sicheng Zhao, Guiguang Ding, Yue Gao, and Jungong Han. 2017. Approximating Discrete Probability Distribution ofImage Emotions by Multi-Modal Features Fusion. In

International Joint Conference on Artificial Intelligence . 4669–4675.[158] Sicheng Zhao, Guiguang Ding, Yue Gao, and Jungong Han. 2017. Learning Visual Emotion Distributions via Multi-Modal Features Fusion. In

ACM International Conference on Multimedia . 369–377.[159] Sicheng Zhao, Guiguang Ding, Yue Gao, Xin Zhao, Youbao Tang, Jungong Han, Hongxun Yao, and Qingming Huang.2018. Discrete Probability Distribution Prediction of Image Emotions With Shared Sparse Learning.

IEEE Transactionson Affective Computing (2018).[160] Sicheng Zhao, Guiguang Ding, Jungong Han, and Yue Gao. 2018. Personality-Aware Personalized Emotion Recognitionfrom Physiological Signals. In

International Joint Conference on Artificial Intelligence .[161] Sicheng Zhao, Guiguang Ding, Qingming Huang, Tat-Seng Chua, Björn W Schuller, and Kurt Keutzer. 2018. AffectiveImage Content Analysis: A Comprehensive Survey. In

International Joint Conference on Artificial Intelligence . 5534–5541.[162] Sicheng Zhao, Yue Gao, Xiaolei Jiang, Hongxun Yao, Tat-Seng Chua, and Xiaoshuai Sun. 2014. Exploring principles-of-art features for image emotion recognition. In

ACM International Conference on Multimedia . 47–56.[163] Sicheng Zhao, Amir Gholaminejad, Guiguang Ding, Yue Gao, Jungong Han, and Kurt Keutzer. 2019. Personalizedemotion recognition by personality-aware high-order learning of physiological signals.

ACM Transactions onMultimedia Computing, Communications, and Applications

15, 1s (2019), 14.[164] Sicheng Zhao, Zizhou Jia, Hui Chen, Leida Li, Guiguang Ding, and Kurt Keutzer. 2019. PDANet: Polarity-consistentDeep Attention Network for Fine-grained Visual Emotion Regression. In

ACM International Conference on Multimedia .[165] Sicheng Zhao, Chuang Lin, Pengfei Xu, Sendong Zhao, Yuchen Guo, Ravi Krishna, Guiguang Ding, and Kurt Keutzer.2019. CycleEmotionGAN: Emotional Semantic Consistency Preserved CycleGAN for Adapting Image Emotions. In

AAAI Conference on Artificial Intelligence . 2620–2627.[166] Sicheng Zhao, Hongxun Yao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. 2018. Predicting personalized imageemotion perceptions in social networks.

IEEE Transactions on Affective Computing

9, 4 (2018), 526–540.[167] Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, and Guiguang Ding. 2017. Continuous Probability DistributionPrediction of Image Emotions via Multi-Task Shared Sparse Regression.

IEEE Transactions on Multimedia

19, 3 (2017),632–645.[168] Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, Wenlong Xie, Xiaolei Jiang, and Tat-Seng Chua. 2016. Predictingpersonalized emotion perceptions of social images. In

ACM International Conference on Multimedia . 1385–1394.[169] Sicheng Zhao, Hongxun Yao, Xiaolei Jiang, and Xiaoshuai Sun. 2015. Predicting discrete probability distribution ofimage emotions. In

IEEE International Conference on Image Processing . 2459–2463.[170] Sicheng Zhao, Hongxun Yao, Xiaoshuai Sun, Xiaolei Jiang, and Pengfei Xu. 2013. Flexible presentation of videosbased on affective content analysis. In

International Conference on Multimedia Modeling . 368–379.ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article 1. Publication date: January 2019. ffective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey 1:31 [171] Sicheng Zhao, Hongxun Yao, You Yang, and Yanhao Zhang. 2014. Affective image retrieval via multi-graph learning.In

ACM International Conference on Multimedia . 1025–1028.[172] Sicheng Zhao, Xin Zhao, Guiguang Ding, and Kurt Keutzer. 2018. EmotionGAN: unsupervised domain adaptationfor learning discrete probability distributions of image emotions. In

ACM International Conference on Multimedia .1319–1327.[173] Sheng-hua Zhong, Jiaxin Wu, and Jianmin Jiang. 2019. Video summarization via spatio-temporal deep architecture.

Neurocomputing

332 (2019), 224–235.[174] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-To-Image Translation UsingCycle-Consistent Adversarial Networks. In

IEEE International Conference on Computer Vision . 2223–2232.[175] Xinge Zhu, Liang Li, Weigang Zhang, Tianrong Rao, Min Xu, Qingming Huang, and Dong Xu. 2017. Dependencyexploitation: a unified CNN-RNN approach for visual emotion recognition. In

International Joint Conference onArtificial Intelligence . 3595–3601.[176] Yingying Zhu, Zhengbo Jiang, Jianfeng Peng, and Sheng-hua Zhong. 2016. Video affective content analysis based onprotagonist via convolutional neural network. In

Pacific Rim Conference on Multimedia . 170–180.[177] Yingying Zhu, Min Tong, Zhengbo Jiang, Shenghua Zhong, and Qi Tian. 2019. Hybrid feature-based analysis ofvideo’s affective content using protagonist detection.

Expert Systems with Applications

128 (2019), 316–326.[178] Athanasia Zlatintsi, Petros Koutras, Georgios Evangelopoulos, Nikolaos Malandrakis, Niki Efthymiou, Katerina Pastra,Alexandros Potamianos, and Petros Maragos. 2017. COGNIMUSE: a multimodal video database annotated withsaliency, events, semantics and emotion with application to summarization.