[PDF] The Focus-Aspect-Polarity Model for Predicting Subjective Noun Attributes in Images

Abstract

Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective- or attribute-labels from images. However, most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Polarity model to structure the process of capturing subjectivity in image processing, and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication in several cases outperforms the default way of information fusion (concatenation).

Full PDF

TThe Focus-Aspect-Polarity Modelfor Predicting Subjective Noun Attributes in Images

Tushar Karayil DFKI, Germany [email protected]

Philipp Blandfort DFKI and TUK, Germany [email protected]

J¨orn HeesDFKI, Germany [email protected]

Andreas DengelDFKI, Germany [email protected]

Abstract

Subjective visual interpretation is a challenging yet im-portant topic in computer vision. Many approaches reducethis problem to the prediction of adjective- or attribute-labels from images. However, most of these do not take at-tribute semantics into account, or only process the image ina holistic manner. Furthermore, there is a lack of relevantdatasets with ﬁne-grained subjective labels. In this paper,we propose the Focus-Aspect-Polarity model to structurethe process of capturing subjectivity in image processing,and introduce a novel dataset following this way of model-ing. We run experiments on this dataset to compare severaldeep learning methods and ﬁnd that incorporating contextinformation based on tensor multiplication in several casesoutperforms the default way of information fusion (concate-nation).

1. Introduction

Subjectivity is the phenomenon wherein human percep-tion is inﬂuenced by personal feelings, tastes, opinions etc.The variance which arises as a result of this phenomenonplays a crucial role in the visual domain. For example, themeaning that we infer from an image can depend on: ourinternal templates about the stimuli [26], expectations andlearned biases about the visual object [7], context / prior vi-sual input [6], random neural ﬂuctuations in cortex [16] andother factors like personality of the interpreting individual.This innate diversity in interpretation has made evaluationand computational modeling of subjectivity a difﬁcult task.The challenge in modeling subjectivity arises from twomain sources. First, subjective interpretation by deﬁnitionis arbitrary in a certain sense, since there is no a priori ob- Equal contribution.

Figure 1. Illustration of the task. The model takes an image anda noun ( focus ) present in the image as input. It outputs the corre-sponding aspects and polarity (of each aspect ). For the given im-age in the illustration, since the focus is on the noun “person”, themodel identiﬁes the aspects “Age” and “Happiness” as the mostappropriate. The polarity provided for each aspect determineswhich set of attributes suits the noun in the context of the image.For the aspect age in the given example, the polarity output indi-cates that suitable attributes for the person in the image would be“old”, “elderly”, “mature” or “senior”, in contrast to “young”. At-tributions of images, “Old man and his dog” by Katie Cook, usedunder CC BY-NC-ND 2.0. jective taste, feeling, or opinion, and at times such contextinformation might not be accessible at all. In particular,this poses challenges to evaluation, and in many cases itis reasonable to expect that there will be a larger marginto a perfect score. Second, subjectivity tends to be moreﬁne-grained than objectivity. For example, in images, ob-jectivity that is detected typically is about which entities arevisible, while subjective information is rather about charac-terizing how these entities or the picture as a whole differfrom some expectation [6, 7].To attenuate these issues, previous methods typicallyconsider holistic aspects of subjectivity (e.g. in visual sen-timent analysis) or mix subjective components with non-1 a r X i v : . [ c s . C V ] O c t ubjective components (as in adjective-noun pairs) [3]. Aproblem with the latter is that, in evaluation, these com-ponents are mixed and might be hard to separate later on,while the original interest was to focus on subjective parts.Additionally, existing works which use this approach do notinclude any sophisticated structuring of the subjective com-ponents. We also found that there is a clear lack of datasetswith more ﬁne-grained or structured subjective aspects an-notated.In order to overcome these shortcomings, we propose anovel dataset ( aspects-DB ) and the Focus-Aspect-Polaritymodel for subjective visual interpretation, disentanglingthree components of subjectivity: 1) focus: the center ofattention, 2) aspect: which dimension to evaluate on, and3) polarity: result of this evaluation. Our proposed way ofmodeling is illustrated in Figure 1. Brieﬂy put, our modelworks as follows: Given an image and, as context, a nounpresent in the image (as proxy to describe which part ofthe image is attended to), we would like to ﬁrst identifywhich dimension of evaluation (represented by aspects ) oneis likely use for describing the noun in the given image, andsecondly predict how the noun would be evaluated with re-spect to these dimensions of evaluation (represented by as-pect polarities ). Finally, in this paper we analyze severalmethods for the emerging tasks aspect prediction and po-larity detection , thereby providing an overview of differentways of using context information in this particular case andrevealing general open issues.The rest of the paper is organized as follows. Section 2surveys related work relevant to this paper. Section 3 intro-duces the model and the new dataset. Section 4 explains thetwo tasks that form the core of this method. Section 5 givesa detailed description of the experiments and architecturesused. Section 6 provides our insights and ﬁndings from theexperiments along with the open questions. Section 7 con-cludes the paper with a summary and future work.

2. Related Work

This section can be broadly divided into three seg-ments: First, the methods which attempt to capture sub-jectivity. Second, the methods which use Adjective-Nounpairs for this purpose. Finally, the available attribute detec-tion datasets.

There have been many promising approaches which re-searchers have employed at detecting subjective parts of vi-sual interpretation. While some works focused on attributesto enhance the quality of nouns [10, 21, 4], others focusedon understanding the aesthetics [23, 9].[4, 3] proposed the large scale visual sentiment ontologyto detect adjective-noun pairs inside an image. Given animage they propose to ﬁnd a suitable adjective-noun pair to best describe an image from a set of adjective-noun pairs.Although adjective noun pairs capture the sentiment to anextent, they do not reveal the degree to which this senti-ment applies. Moreover, relying on a single adjective-nounpair to describe the whole image would mean only the mostprominent noun is focused upon.The authors of [22] propose a cross-modal mapping froma visual semantic space onto a linguistic space in order toautomatically annotate images with adjectives. The map-ping is performed by a projection function that maps thevector representation of an image tagged with an object /attribute onto the linguistic representation of the object / at-tribute word. This mapping function can then be applied toany given image to obtain its linguistic projection. The mainadvantage, as claimed by [22], is that of zero-shot learn-ing, i.e., unseen attributes (not present in training) can bepredicted. However, in this approach the whole image ismapped onto an adjective without focusing on any particu-lar noun or aspect. Our method, on the other hand focuseson ﬁnding a suitable set of adjectives for a given noun incontext of an image, and still allows for zero-shot learning.

Our work mainly builds on a line of work originatingfrom the Visual Sentiment Ontology [5] proposed by Borthet al., which aims at detecting adjective-noun combinationsfrom images. So far, the best performing method withinthis direction are cross-residual networks (XResNet) [17],which we include in our experiments and will describe indetail in Section 5.2. For any given image, XResNet outputsscores for adjective-noun combinations as well as scores forall individual adjectives and nouns separately. This meansthat it separates the more subjective parts of interpreta-tion (represented by the adjectives) from the more objective(represented by the nouns).There are two major datasets that have been used fortraining the above-mentioned architectures: The VisualSentiment Ontology (VSO) [5] and the Multilingual Vi-sual Sentiment Ontology (MVSO) [18]. These datasetshave been created from the popular photo-sharing platformFlickr. However, the data in these cases suffers from a clearbias towards the positive attributes / adjectives [19]. [17]has taken some efforts for achieving a better overall balance,but even there, for any given noun the number of associatedadjectives is typically very small and the distribution heav-ily skewed. More importantly, the “feasible” adjectives fora given noun are in most cases not mutually exclusive, attimes even similar in meaning (e.g. “smiling person” and“happy person”), and yet any non-ground-truth adjective istypically considered to be wrong. This makes it harder tointerpret performances on these datasets in terms of abilityto capture subjective aspects.These issues can be overcome by considering the prob-em of attribute prediction to focus on the subjective part ofvisual interpretation: Given an image and an entity (in ourcase represented by a noun) in the image, estimate the suit-ability of attributes under consideration of their semanticrelations. In this regard we created a dataset with structuredand properly balanced attributes for any given noun, thusaddressing the issue of balance which was found lacking inthe existing VSO and MVSO.

There are several popular attribute datasets available forcomputer vision research.The Visual Genome [20] contains over 100,000 im-ages with ﬁne-grained annotations, including region de-scriptions, object instances and visual attributes in the orderof Millions. However, the attributes in this dataset mostlyrelate to objective information. Hence, most common at-tributes are colors like white, blue red, black and despitethe large number of total annotations in Visual Genome, wefound the number of subjective attribute instances to be toolow for our purpose.aPascal and aYahoo [11] are two attribute datasets con-taining natural object-based images with attribute annota-tions. Here again, the included attributes correspond to ob-jective features, such as parts of a face like eyes, nose andso on, which deems it inappropriate for analyzing subjectiveinterpretation.Another attribute dataset is the SUN Attribute Dataset[24], which contains scene attributes of the four categories“functions / affordances” (e.g. “diving”, “climbing”), “ma-terials”, “surface properties” and “spatial envelope”. Theformer three categories are restricted to objective informa-tion, and while there are several subjective attributes (suchas “scary” or “stressful”) in the “spatial envelope” category,all of these annotations are describing the scene in a holisticmanner.Overall, none of the available datasets is appropriate forfocusing on more ﬁne-grained subjective visual interpreta-tion.

3. Modeling and Dataset

The work of Borth et al. [5] shows that adjective nouncombinations are often visible and reasonably simple to au-tomatically detect in images, presumably because of howthey contain both subjective (in the adjective) and objective(in the noun) information.If we consider the semantics of adjectives as describedby Baroni and Zamparelli in [2], where adjectives are inter-preted as modiﬁers of nouns, we see that visually detectingadjective noun combinations can be understood as a model that combines attention and evaluation, where the noun de-scribes where the viewer is focusing when interpreting theimage and the adjective contains the subjective evaluationof this part of the image.For the adjective, we want to take a step further and ac-knowledge the fact that adjectives for the same noun areoften semantically related. In other words, they can be or-ganized along various dimensions of evaluation. Examplefor such dimensions would be size, age, cuteness or temper-ature.So instead of considering any non-ground-truth adjec-tive as wrong and thereby largely ignoring semantic rela-tions between adjectives, we organized the adjectives intoopposing lists. Arranging in this manner paves way for amore appropriate evaluation as opposing adjectives (mutu-ally exclusive) cannot occur together for the same noun.For example, if we consider the opposing adjectives in[“cute”, “adorable”] vs [“scary”, “ugly”], a classiﬁcationof “cute” or “adorable” of a puppy are semantically simi-lar, but “cute” and “scary” cannot apply to the same puppy.To further elaborate, we arranged a list of mutually oppos-ing adjectives, where each opposing list reﬂects a certain di-mension of evaluation, which we call “aspect” , of the noun.The aspects and the adjectives pertaining to these aspectsthat are incorporated in our dataset are listed in Table 2 andwill be derived in Section 3.2.In summary, we separate three potential sources of sub-jectivity in our model:1.

Focus : Given a single image, there are typically dif-ferent components one can pay attention to. For thispaper we will assume that this place of focus can becaptured by a noun. Note that nouns can relate to anentity in the image (such as “dog” or “dude”), but alsorefer to the whole scene (as in “place”) or the pictureitself (“shot”).2.

Aspect : Once the focus has been determined, there areseveral potential dimensions for evaluation. For exam-ple, people in the image can be evaluated with respectto their physical size, age, level of activity and so on.In our dataset, selecting an aspect for evaluation is es-sentially about choosing a set of semantically relatedattributes.3.

Polarity : In our case, we chose all aspects to be rep-resented by mutually exclusive sets of adjectives, suchthat evaluating each aspect amounts to a binary deci-sion problem. For example, physical size would haveadjectives like “small”, “tiny”, “short” on one side and“tall”, “big”, “huge” on the other. Picking a certain po-larity then means to say that one of the adjectives fromthis side is appropriate to be used as an attribute for thegiven noun. spect Noun Left Polarity Right Polarity age people “young” “old”, “elderly”, “mature”, “senior”,“aged”activity city “active”, “busy” “sleepy”, “sleeping”happiness boy “smiling”, “laughing”, “happy” “crying”, “sad”

Table 1. Examples of ground truth data in the proposed aspects-DB dataset. Each row represents an aspect, an example noun and a sampleimage which corresponds to the left and right polarities. Attributions of images, from left to right: “Young Heart Attack” by Richard Child,used under CC BY-NC-ND 2.0, “Old Chinese Men” by Michael Goodine, used under CC BY-NC-ND 2.0., “Busy Times Square” by JimLarrison, used under CC BY-NC-ND 2.0., “A quiet Saturday morning” by Pedro, used under CC BY-NC-ND 2.0, “Grandma looks happy”by Praveen, used under CC BY-NC-ND 2.0, “Boy crying” by Francisco Osorio, used under CC BY-NC-ND 2.0

The following points summarize the key features of thismethod of modeling: • Three different sources of subjectivity are disentan-gled. This brings about the possibility to evaluatethese components separately, and can potentially beexploited by computational models, e.g. by learningbiases at these distinct levels. • Semantic relations between attributes are respected. Inparticular, by detecting aspect polarities instead of in-dividual attributes, we treat attributes of the same po-larity as being synonymous for the given aspect. Wethereby avoid to consider any attribute as wrong if itmeans the same but is merely phrased differently, as itis for example done when using adjective noun combi-nations or single attributes as independent class labels. • This modeling leads to a more sensible way of 0-shotlearning for attribute detection, i.e., predicting subjec-tive attributes for nouns for which they were not avail- able during training time. We will explore this direc-tion below.

To overcome the shortcomings of the existing datasetsmentioned above, and to have a fair evaluation for experi-ments, we decided to create a new dataset called aspects-DB for subjective visual interpretation, following the AAPmodel. We will now describe the steps we took for buildingthe dataset.First, based on the Visual Sentiment Ontology, we com-piled thematic lists of nouns, focusing on terms from theurban environment since such images often contain severalentities: • people : “person”, “people”, “guy”, “girl”, “woman”,“man”, “baby”, “boy”, “child” • animals : “cat”, “dog”, “animal”, “pet”, “puppy”, “kit-ten”, “bird” buildings : “building”, “house”, “architecture”, “ho-tel”, “church”, “restaurant” • scene : “street”, “place”, “view”, “city”, “event”,“neighborhood”, “location” • plants : “tree”, “plant”, “ﬂower”Second, based on semantic adjective classes in Ger-maNet [13] we selected a list of adjective classes that canapply to these nouns and can be visible in images. Examplesfor such semantic classes are appearance (“pretty”, “ugly”,...), size (“small”, “big”, “large”, ...) or age (“young”, “old”,...). For each such class we ﬁxed its meaning (e.g. evalu-ation ), then came up with one or more adjectives for ei-ther side (e.g. “good”, “great” vs “bad”, “stupid”). Thisgave us an initial list of aspects with associated attributes(represented by adjectives) grouped into mutually exclusivesets. We then iteratively expanded both sides of each as-pect by using synonym and antonym information from the-saurus. For example, for the aspect evaluation thesauruswould be invoked to ﬁnd synonyms of “good”, add any ofthese synonyms to the left side of the aspect if they are mu-tually exclusive with all attributes on the right side, and addantonyms of “good” to the right side of the aspects if theyare mutually exclusive with all attributes on the left side.Third, for each adjective-noun combination based on thenoun list and the initial aspect list, we used the Flickr-API( ) tocrawl images tagged with “[adjective] [noun]” (as one tag).Finally, we iteratively performed the following steps forcleaning and structuring the data properly: • We removed all adjective-noun combinations with lessthan 20 occurrences. • For each noun-aspect combination we counted the totalnumber of available images for each polarity. We onlykept noun-aspect combinations if for each polarity thetotal number of available images was at least 100. • All nouns with less than 500 images in total and lessthan two available aspects were removed. • We removed all aspects with a total of less than 500images for any of the two polarities. • We manually checked whether images obtained for in-dividual adjective-noun combinations captured the as-sociated aspect and visibly included the noun. Allcombinations where this was not the case were re-moved. Note that this led to the removal of almost halfof the originally crawled data. • For each feasible noun-aspect combination we ran-domly sampled the same number of images for left and right polarity from all relevant adjective-noun combi-nations. (Not recycling any images, i.e., each imagein our dataset is only used for exactly one noun-aspectcombination.)The ﬁnal aspects-DB dataset contains , images intotal and features nouns for aspects. A complete listof aspects can be found in Table 2. More detailed statisticsare presented in Table 3. The dataset is balanced on polar-ity level, i.e., for each noun-aspect combination, half of thethe available images belong to the left polarity of the as-pect and the other half to the right. Since the ground truthwas obtained by adjective-noun pairs, we keep the adjectivepart in our dataset as extra information, so for each image, aspects-DB includes a noun, an aspect, the polarity of thisaspect and the original adjective the noun was combinedwith in the adjective-noun tag.Table 1 shows a few examples of nouns, their top aspectsand the corresponding left and right polarities. The datasetis available to the public and can be downloaded at http://madm.dfki.de/downloads .We would like to emphasize that the ground truth in aspects-DB is meant to capture general tendencies in sub-jective interpretation (where we use tags as proxy). Thesetendencies must to some extent be corpus / domain speciﬁcand on item-level we cannot expect perfect performance.This means that the task is not to detect objectively correctlabels as in many common image classiﬁcation datasets,but to model general biases such as “for this image of asleepy puppy and noun dog , people would typically inter-pret the image with respect to aspect age . Aspects age , ac-tivity , evaluation would likely be rated as having polarities- ( young ), + ( sleepy ), and - ( good ) respectively”.

4. Tasks

In the ﬁrst task, an image and a noun are given and thetask is to predict which one of the aspects in our dataset (seeTable 2) a subjective interpretation would most likely focuson. For example, given an image with a puppy togetherwith the noun “dog”, a likely aspect from our list wouldtypically be age . This problem is modeled as multi-classclassiﬁcation task, where for each given image and noun,only a single aspect is considered to be correct.Note that our dataset is not balanced on aspect level, andindeed, for a given noun the aspect prior typically stronglyfavors a certain aspect (see dataset statistics in Table 3).This skewness motivates us to not use an evaluation met-ric that is based on accuracy. More precisely, for each nounwe ﬁrst compute the average F1 score across all applicableaspects, and then average this number across all nouns to o. Aspect Name Attributes with Left Polarity (-1) Attributes with Right Polarity (+1)

Table 2. Aspects in the aspects-DB dataset. We only list attributes that are included in any adjective-noun combination in the dataset.

Noun Aspecteval. size age happ. rar. act. people 0 0 7352 0 1620 0guy 1672 0 296 0 0 0man 554 0 5846 0 0 0baby 0 0 0 690 0 298boy 1558 0 0 602 0 0cat 874 0 960 214 0 0dog 2094 0 402 630 0 922building 0 312 9912 0 0 0house 0 2084 7814 0 0 0architecture 528 0 8746 0 0 0hotel 342 0 5384 0 0 0city 0 698 2372 0 0 286tree 0 2428 328 0 0 0

Table 3. Numbers of images for all noun-aspect combinations inour ﬁnal dataset. For all combinations that are withheld duringtraining for 0-shot experiments (see Section 4), the correspondingnumbers are underlined. obtain the overall performance measure of a model: F1 asp := 1 |N | (cid:88) n ∈N (cid:18) |A ( n ) | (cid:88) a ∈A ( n ) F1 n ( a ) (cid:19) , where A ( n ) denotes all aspects that are available for noun n , N the set of nouns, and F1 n ( a ) is the F1 score for aspect a calculated over all images for noun n .All available data is for each polarity split into 50% train-ing, 20% development and 30% test data. Aspect polarity detection is about deciding which polar-ity applies to a given noun for a given aspect in the contextof the input image. Coming back to the previous puppy ex-ample of Section 4.1, the true polarity for aspect age wouldbe “left” (corresponding to young age) when given an im-age of a puppy with the noun context “dog”. For trainingand evaluation we only consider one aspect at a time, hencethis problem can be seen as binary classiﬁcation task. Apart from the standard polarity detection task, wherethe same dataset split is used as for aspect prediction, weconsider polarity detection, where some aspect-nouncombinations (the ones that are underlined in Table 3) areonly contained in the ﬁnal test set and the rest of the datawas randomly split into 70% training and 30% for devel-opment (per aspect-noun combination). It should be notedthat zero-shot learning on aspect prediction cannot be donein the same way (unless the noun is left out completely fortraining): If we remove individual noun-aspect combina-tions and train a model on the remaining ones, the modelgenerally learns that for the nouns any excluded aspect isnot feasible. This points at another problem in the adjective-noun way of modeling, where aspect and aspect polarity areboth blended into adjective information.For calculating overall accuracy for polarity detection(for both sub-tasks), for each aspect we compute the averageaccuracy across all nouns, and then compute the average ofthese numbers: acc pol := 1 |A| (cid:88) a ∈A (cid:18) |N ( a ) | (cid:88) n ∈N ( a ) acc ( a, n ) (cid:19) , where A denotes the set of aspects, N ( a ) returns a list ofall nouns available for aspect a , and acc ( a, n ) denotes theaccuracy for aspect a and noun n .

5. Methods

In this section we explain the methods we compare inour experiments (Section 6), where they are evaluated onboth tasks described in the previous section. For all models,except the XResNet variants, visual features are extractedfrom the image by an inception-v3 network [27], which wastrained on ImageNet [8] and kept unchanged.

We deploy various models based on logistic regressionwhich take visual features from the inception network asthe only input. The motivation for using these models waso have a robust starting point which allows us to comparethe effect of modeling the tasks in different ways. More pre-cisely, the different possibilities for modeling the tasks leadto these three versions (for both aspect and aspect polarityprediction): • The noun-agnostic version does not consider noun in-formation at any point. Aspect prediction is modeledas classiﬁcation task with multiple classes. So for pre-dicting the most likely aspect given an image and noun,a single logistic regression model is trained to outputthe corresponding class from the visual features, irre-spective of the noun. Aspect polarity detection is mod-eled as separate binary classiﬁcation problems, i.e., foreach aspect, one logistic regression model is trainedto detect the polarity of the respective aspect from theimage vector, again not taking the noun into account. • In the noun-speciﬁc variant, separate models aretrained for distinct nouns. For each individual noun,we then follow the same approach as described in theprevious point. This means that for each noun we haveone model predicting the most likely aspect, and foreach noun-aspect combination we have one model foraspect polarity detection. We explore this possibilityas a simple way to take the noun context into account. • Finally, we consider a logistic regression model ( adj-noun ) trained on detecting adjective-noun combina-tions from the inception features. We include thismodel to analyze the effect of modeling the outputas adjective-noun as compared to aspect and polarity.Here, a single model is trained, and conditioning ona noun is done by simply ignoring all outputs with adifferent noun. To evaluate this model on aspect pre-diction and aspect polarity prediction, the remainingadjective-noun scores have to be converted to aspectand polarity scores: For aspect prediction, we selectthe adjective from the highest ranked adjective-nouncombination and return the aspect it is contained in.In case of aspect polarity detection, for any given as-pect the adjective-noun outputs are ﬁltered further suchthat all remaining adjectives are included in this aspect.The polarity of the highest ranked among these adjec-tives is given as ﬁnal polarity output.In all these cases we use the scikit-learn [25] implemen-tation for training and inference.

Cross-residual networks, or short XResNet, refers to anarchitecture which was introduced in [17] for adjective-noun pair detection and is based on the well-known residualnetworks (ResNet) architecture [15]. Figure 2 shows thestructure of the XResNet architecture we used. The main adj-noun combinations nouns / nounsasp-pol-noun combinations/adjs asp-pol combinations/

Figure 2. XResNet architecture (adapted from [17]). Solid short-cuts indicate identity, dotted connections indicate × projec-tions, and dashed shortcuts indicate cross-residual weighted con-nections. We train and evaluate two different versions of XRes-Net, one which predicts adjective, adjective-noun and noun, andone which predicts aspect-polarity, aspect-polarity-noun and noun(indicated in the diagram by purple and green color respectively). ifference of XResNet as compared to ResNet is that thenetwork branches out at the end into three distinct heads,where these branches remain closely connected to eachother via so-called cross-residual connections. The standardXResNet architecture has 50 layers and ﬁnally branches outto predict adjectives, nouns and adjective-noun pairs respec-tively.We trained this standard model based on adjective andnoun ground truth in our dataset, starting from a pre-trainedmodel and using settings as described in [17] for ﬁne-tuningon our data. Since this method outputs scores for adjectives,nouns and adjective-noun combinations, the output needs tobe converted into aspect and aspect polarity information forevaluation. We consider two ways of doing this conversion,each based on a different output branch of the model: • Using the adjective output ( adj ): To convert adjectivescores to a prediction of the most likely aspect, adjec-tives are ordered by score and the aspect of the highestranked adjective is output. For aspect polarity predic-tion, for any given aspect, all adjectives that are asso-ciated with the current aspect are ordered by score andthe polarity of the highest ranked among these adjec-tives is taken as prediction of the model. Note that thisversion completely ignores any given noun informa-tion. • Using the adjective-noun output ( adj-noun ): For bothaspect and aspect polarity prediction the output isﬁrst ﬁltered based on the given noun, such that onlyadjective-noun combinations featuring this noun arekept. All remaining adjective-noun combinations onlydiffer in their adjective, and can thus be treated like alist of adjective scores. To obtain the ﬁnal output wethen follow the same steps as described in the previouspoint.To exclude the possibility that training on adjectiveand nouns while testing on aspects and polarities is ad-versarial to ﬁnal performance, we train another XResNetmodel which directly predicts aspect-polarity-noun combi-nations, aspect-polarity combinations and the noun, insteadof adjective-noun, adjective and noun. In this case, the out-put does not need to be converted, but we still have twopossibilities for evaluation, a noun-agnostic one using theaspect-polarity scores for the prediction ( asp-pol ), and onebased on aspect-polarity-noun output ( asp-pol-noun ) whereconditioning on the noun is done by ignoring all irrelevantoutput scores.

The concatenation model is a straightforward applicationof information fusion, where a one-hot encoding of the nounis appended to the image embedding obtained from the in-ception network. This concatenated vector is then used as the input to a multi-layer perceptron (MLP) with one hiddenlayer.We build one such model for aspect prediction and onefor detecting aspect polarity. For both aspect prediction andaspect polarity detection, the corresponding model has oneoutput neuron per aspect. Note however, that for polaritydetection during training and testing we only consider theoutput of the unit corresponding to the aspect which is pro-cessed at the time.

Instead of merely concatenating image features and con-text we consider a slightly more sophisticated way of condi-tioning on the context, using tensor products to combine in-formation. Similar ways of using context information havebeen used in several publications in the ﬁeld of natural lan-guage processing, for example [14], [12] and [1], but weare not aware of any other publication in computer visionincluding this method.This approach is illustrated in Figure 3. The core partis the

Tensor Conditioning layer , which can be understoodas part of a neural network that combines the noun-agnosticand noun-speciﬁc logistic regression models: For each noun i = 1 , . . . , , there is a weight matrix W i and a bias term b i . In addition, the layer uses a weight matrix W and a biasterm b that is used irrespective of the context. Given asinput the image embedding x and the i -th noun, the outputof the Tensor Conditioning layer is then computed as tanh (cid:0) ( W + W i ) · x + b + B i (cid:1) . We now represent nouns as one-hot vectors n ∈ R andput together all noun weight matrices W i into a third-orderTensor W ∈ R × × and all noun biases b i into a biasmatrix B ∈ R × . The ﬁnal layer function T ( x, n ) canbe formulated by using a tensor product between the nouncontext and a weight tensor to obtain the weight matrix forthe given noun: T ( x, n ) = tanh (cid:32)(cid:16) W + (cid:88) i =1 W i · n i (cid:17) · x + b + (cid:88) i =1 B i · n i (cid:33) = tanh (cid:0) ( W + W · n ) · x + b + B · n (cid:1) As in the concatenation approach, we deploy separate Ten-sor Conditioning models for the two tasks of aspect predic-tion and aspect polarity detection.

6. Results

For both tasks described in Section 4, we ran experi-ments with all conditioning methods explained in Section 5.All hyper-parameters (learning rate, number of hidden unitsfor the concatenation method) were ﬁne-tuned based onperformances on training and development data (see Sec-tion 4). We report performances on the test data. igure 3. An overview of the Tensor Conditioning model. Given as input a one-hot encoded noun and an image, the Tensor Conditioningmodel embeds the image with a pre-trained inception-v3 network. This image embedding vector is then processed with the Tensor Condi-tioning layer which consists of two linear layers, a context-independent and a context-dependent one, followed by element-wise additivefusion of their outputs. For the context-dependent path, the Tensor Conditioning layer keeps a tensor with context-dependent weights anda matrix with context-dependent biases, which are multiplied by the noun vector to obtain weights and bias. (Since the noun is one-hotencoded, this multiplication amounts to a selection operation.) We use two separate Tensor Conditioning models for our experiments, onefor aspect prediction with aspect likelihoods and one for aspect polarity detection with aspect polarities as output.

All results for the aspect prediction task are listed in Ta-ble 4. As a statistical baseline, we include the score of amethod which ignores the image and randomly predicts anaspect with the probability P ( aspect | noun ) , based on gen-eral dataset statistics.As we can see when comparing performances of themodels to the statistical baseline, noun-agnostic logistic re-gression and the concatenation model apparently did notwork for this task at all. Many of the other models achievecomparable scores, so there was no unique best perform-ing method but a rather large group of top models. Amongthese, Tensor Conditioning performs slightly worse than therest (0.62 aspect-F1 while the top one achieves 0.64), but asexpected performs very similarly to noun-speciﬁc logisticregression. Interestingly, logistic regression ( adj-noun ) ison par with the top XResNet models ( adj-noun and asp- pol-noun ), despite being a considerably simpler approach.There are two methods between the top and bottomgroup: The asp-pol and adj versions of XResNet. Here,surprisingly, predicting based on aspects and aspect polar-ities is worse than based on adjectives, while this discrep-ancy disappears completely when using other output layersof the respective models (see adj-noun and asp-pol-noun ).In case of logistic regression the way of modeling also onlymarginally affects ﬁnal performance.Overall, the noun context seems to be important and isused by the models, but this effect is less pronounced forthe more sophisticated XResNet models. Results for both polarity detection experiments can befound in Table 5.The two best performing models for the standard sub-task are noun-speciﬁc logistic regression and the asp-pol- ethod Aspect-F1 statistical baseline 0.47logistic regression (adj-noun) logistic regression (noun-agnostic) 0.48logistic regression (noun-speciﬁc) 0.63XResNet [17] (adj) 0.59XResNet [17] (adj-noun)

XResNet (asp-pol) 0.54XResNet (asp-pol-noun) concatenation + MLP (10) 0.48concatenation + MLP (500) 0.51Tensor Conditioning 0.62

Table 4. Aspect prediction performances of all models. All meth-ods except the baseline and the XResNet models use inception-v3to embed the image. The number in parentheses after MLP indi-cates the number of hidden units used for this model. Please referto Section 5 for details on the individual models.

Method Polarity accuracy standard 0-shot statistical baseline 50.0% 50.0%logistic regression (adj-noun) 79.1% -logistic regression (noun-agnostic) 75.3% 63.4%logistic regression (noun-speciﬁc) -XResNet [17] (adj) 75.5% 61.5%XResNet [17] (adj-noun) 78.8% -XResNet (asp-pol) 76.4% 60.5%XResNet (asp-pol-noun) -concatenation + MLP (10) 69.1% concatenation + MLP (500) 71.4% 63.8%Tensor Conditioning 79.2% 61.8%

Table 5. Aspect polarity detection performances of all models.All methods except the baseline and the XResNet models useinception-v3 to embed the image. The number in parentheses af-ter MLP indicates the number of hidden units used for this model.Please refer to Section 5 for details on the individual models. Notethat not all models are applicable to the 0-shot learning task. noun

XResNet, both of which are not applicable to 0-shot.Among all models with 0-shot capability, Tensor Condi-tioning performs best on the standard task. Again, the con-catenation method works worst among all models, but thistime at least it outperforms the baseline by a large margin.For this task, modeling according to aspects and polaritiesseems to work better in general than using adjectives andnouns (see corresponding logistic regression or XResNetresults). As for aspect prediction, noun information is usedbut does not lead to terribly large improvements as com-pared to methods not using any noun context.The performances on the 0-shot experiment are quitecomparable, while concatenation achieves the highest over-all score. This is an interesting ﬁnding, considering that itis worst for polarity detection in the standard task.

7. Conclusion

We introduced a new method for capturing subjectivityprevalent in images. To overcome several challenges, in-cluding the heavy bias towards positive tags / titles in socialmedia, and to make it possible to separately evaluate dif-ferent parts of subjective visual interpretation, we compileda new dataset. We ran our experiments on the new datasetand reported the results with different architectures. It wasalso shown that with the new model, it is possible to per-form learning to predict unseen noun-attribute com-binations. Given the prevalence of simple concatenationfor combining information in deep learning approaches, weﬁnd it interesting that Tensor Conditioning performed betterin two out of three tasks.Our results raise some fundamental questions, which wewant to investigate in the future: • How can context be modeled optimally? Often re-searchers use concatenation as default choice and fo-cus on data or hyper-parameters for improvementwithout changing this part of the architecture, but ourresults showed a decrease in performance in two out ofthree cases with the concatenation method. • Which properties of the tasks make some methods(like concatenation) fail in one but outperform all othermethods in another?Furthermore, we plan to explore more ways of condition-ing on context, and adapt our approach to applications suchas personalized tag prediction and affective image caption-ing, where biases at different stages of subjective visual in-terpretation according to the Focus-Aspect-Polarity modelcan be made dependent on a user context to mimic the sub-jective interpretation of the given user.

Acknowledgments

This work was supported by the BMBF project De-FuseNN (Grant 01IW17002) and the NVIDIA AI Lab(NVAIL) program.

References [1] D. Bamman, C. Dyer, and N. A. Smith. Distributed represen-tations of geographically situated language. In

Proceedingsof the 52nd Annual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers) , pages 828–834.Association for Computational Linguistics, 2014. 8[2] M. Baroni and R. Zamparelli. Nouns are vectors, adjec-tives are matrices: Representing adjective-noun construc-tions in semantic space. In

Proceedings of the 2010 Confer-ence on Empirical Methods in Natural Language Process-ing , EMNLP ’10, pages 1183–1193, Stroudsburg, PA, USA,2010. Association for Computational Linguistics. 33] D. Borth, T. Chen, R. Ji, and S.-F. Chang. Sentibank:large-scale ontology and classiﬁers for detecting sentimentand emotions in visual content. In

Proceedings of the 21stACM international conference on Multimedia , pages 459–460. ACM, 2013. 2[4] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale visual sentiment ontology and detectors using adjectivenoun pairs. In

Proceedings of the 21st ACM internationalconference on Multimedia , pages 223–232. ACM, 2013. 2[5] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale visual sentiment ontology and detectors using adjectivenoun pairs. In

Proceedings of the 21st ACM InternationalConference on Multimedia , MM ’13, pages 223–232, NewYork, NY, USA, 2013. ACM. 2, 3[6] H. H. BULTHOFF. Bayesian decision theory and psy-chophysics.

Perception as Bayesian inference , 123, 1996.1[7] C.-C. Carbon. Cognitive mechanisms for explaining dynam-ics of aesthetic appreciation. i-Perception , 2(7):708–719,2011. 1[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 248–255. Ieee, 2009. 6[9] S. Dhar, V. Ordonez, and T. L. Berg. High level describ-able attributes for predicting aesthetics and interestingness.In

Computer Vision and Pattern Recognition (CVPR), 2011IEEE Conference on , pages 1657–1664. IEEE, 2011. 2[10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describingobjects by their attributes. In

Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on , pages1778–1785. IEEE, 2009. 2[11] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describ-ing objects by their attributes. In , pages 1778–1785, June 2009. 3[12] E. Guevara. A regression model of adjective-noun com-positionality in distributional semantics. In

Proceedings ofthe 2010 Workshop on GEometrical Models of Natural Lan-guage Semantics , GEMS ’10, pages 33–37, Stroudsburg, PA,USA, 2010. Association for Computational Linguistics. 8[13] B. Hamp and H. Feldweg. Germanet - a lexical-semantic netfor german. In

In Proceedings of ACL workshop AutomaticInformation Extraction and Building of Lexical Semantic Re-sources for NLP Applications , pages 9–15, 1997. 5[14] M. Hartung, F. Kaupmann, S. Jebbara, and P. Cimiano.Learning compositionality functions on word embeddingsfor modelling attribute meaning in adjective-noun phrases. In

Proceedings of the 15th Conference of the European Chapterof the Association for Computational Linguistics: Volume 1,Long Papers , pages 54–64. Association for ComputationalLinguistics, 2017. 8[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In , pages 770–778, June 2016. 7 [16] G. Hesselmann, C. A. Kell, and A. Kleinschmidt. Ongoingactivity ﬂuctuations in hmt+ bias the perception of coher-ent visual motion.

Journal of Neuroscience , 28(53):14481–14485, 2008. 1[17] B. Jou and S.-F. Chang. Deep cross residual learning formultitask visual recognition. In

ACM Multimedia , 2016. 2,7, 8, 10[18] B. Jou, T. Chen, N. Pappas, M. Redi, M. Topkara, and S.-F.Chang. Visual affect around the world: A large-scale multi-lingual visual sentiment ontology. In

Proceedings of the 23rdACM international conference on Multimedia , pages 159–168. ACM, 2015. 2[19] S. Kalkowski, C. Schulze, A. Dengel, and D. Borth. Real-time analysis and visualization of the yfcc100m dataset.In

Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel So-lutions , pages 25–30. ACM, 2015. 2[20] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bern-stein, and L. Fei-Fei. Visual genome: Connecting languageand vision using crowdsourced dense image annotations.2016. 3[21] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi-sual genome: Connecting language and vision using crowd-sourced dense image annotations.

International Journal ofComputer Vision , 123(1):32–73, 2017. 2[22] A. Lazaridou, G. Dinu, A. Liska, and M. Baroni. From vi-sual attributes to adjectives through decompositional distri-butional semantics.

TACL , 3:183–196, 2015. 2[23] A. K. Moorthy, P. Obrador, and N. Oliver. Towards com-putational models of the visual aesthetic appeal of consumervideos. In

European Conference on Computer Vision , pages1–14. Springer, 2010. 2[24] G. Patterson, C. Xu, H. Su, and J. Hays. The sun attributedatabase: Beyond categories for deeper scene understanding.

International Journal of Computer Vision , 108(1-2):59–81,2014. 3[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-chine learning in Python.

Journal of Machine Learning Re-search , 12:2825–2830, 2011. 7[26] M. L. Smith, F. Gosselin, and P. G. Schyns. Measuring inter-nal representations from behavioral and brain data.

CurrentBiology , 22(3):191–196, 2012. 1[27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision.In2016 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,2016