A Novel Attention-based Aggregation Function to Combine Vision and Language
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
AA Novel Attention-based Aggregation Functionto Combine Vision and Language
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
University of Modena and Reggio EmiliaEmail: { name.surname } @unimore.it Abstract —The joint understanding of vision and language hasbeen recently gaining a lot of attention in both the ComputerVision and Natural Language Processing communities, withthe emergence of tasks such as image captioning, image-textmatching, and visual question answering. As both images andtext can be encoded as sets or sequences of elements – likeregions and words – proper reduction functions are needed totransform a set of encoded elements into a single response, likea classification or similarity score. In this paper, we propose anovel fully-attentive reduction method for vision and language.Specifically, our approach computes a set of scores for eachelement of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction,which can be used for both classification and ranking. We test ourapproach on image-text matching and visual question answering,building fair comparisons with other reduction choices, on bothCOCO and VQA 2.0 datasets. Experimentally, we demonstratethat our approach leads to a performance increase on both tasks.Further, we conduct ablation studies to validate the role of eachcomponent of the approach.
I. I
NTRODUCTION
As humans we learn to combine vision and language earlyin life, building connections between visual stimuli and ourability to communicate in a common natural language. Theabundance and diversity of daily-created data pose new un-paralleled opportunities in the attempt to artificially reproducethis joint visual-semantic understanding. Recent progress at theintersection of Computer Vision and Natural Language Pro-cessing has led to new architectures capable of automaticallycombining the two modalities, improving the performance ofdifferent vision-and-language tasks, such as image caption-ing [1], [2], cross-modal retrieval [3], [4], [5], [6], and visualquestion answering [7], [1], [8]. All these settings have usuallybeen addressed by using recurrent neural networks that cannaturally model the sequential nature of textual data. However,the recent advent of fully attentive mechanisms, in which therecurrent relation is abandoned in favor of the use of self-and cross-attention, has consistently changed the way to dealwith visual and textual data, as testified by the success andperformance improvements obtained with the Transformer [9]and BERT [10] models.Nonetheless, the difficulty in tackling these problems isstill given by the huge discrepancy between visual-semanticmodalities. In this context, recent research efforts have mainlyfocused on treating images and text as sets or sequencesof building elements, such as image regions and sentencewords, leading to a better content understanding of both a man reaching for a frisbeewhile jumping over a groupof people lying down
Visual Features
Attention-based Aggregation Function
Score Attention
Textual Features
Score Attention
Fig. 1. We propose a novel aggregation function for vision-and-languagetasks. Given sets of visual and textual inputs, our approach computes a setof scores for each modality, using a novel operator based on cross-attentionwhich ensures a learnable reduction based on cross-modal information flow. modalities [1], [6]. While this approach has allowed morefine-grained alignment and richer representation capabilities ofvisual-semantic concepts, it has also caused a large increaseof features that need to be combined together without loosinginter- and intra-modality interactions.As such, aggregating features represents one of the crucialsteps in visual-semantic tasks in which different informationare fused together to obtain a compact and comprehensiverepresentation of both modalities. In this paper, we tackle theproblem of aggregating visual-semantic features in an effectiveand learnable way, and propose a novel aggregation functionbased on attentive mechanisms that can be successfully appliedto different vision-and-language tasks. Our method can be seenas a variant of the cross-attention schema in which a set ofscores are learned to aggregate feature vectors coming fromimage regions and textual words, thus taking into account thecross-modality interactions between elements (Fig. 1).We apply our attention-based aggregation function to cross-modal retrieval and visual question answering: in the first case,the compact representation of visual-semantic data is used tomeasure the similarity between the input image and the textualsentence, while, in the visual question answering task, it is a r X i v : . [ c s . C V ] A p r sed to compute a classification score over a set of possibleanswers for each image-question pair. Experimentally, we testour approach on the COCO dataset for cross-modal retrievaland on the VQA 2.0 dataset for visual question answering, andwe demonstrate its effectiveness in both settings with respectto different commonly used aggregation functions.To summarize, our main contributions are as follows: • We introduce a new aggregation method based on atten-tive mechanisms that learns a compact representation ofsets or sequences of feature vectors. • We tailor our method to combine vision-and-languagedata in order to obtain a cross-modal reduction for bothclassification and ranking objectives. Also, our methodcan be easily adapted to other tasks requiring an ag-gregation of elements with minimum changes in thearchitecture design. • We show the effectiveness of our solution when comparedto other common reduction operators, demonstrating su-perior performance in aggregating multi-modal features.II. R
ELATED W ORK
In the last few years, several research efforts have beenmade to improve the performance of cross-modal retrieval andvisual question answering methods, resulting in novel architec-tures [11], [12], [13], [14], effective training [3], [5] and pre-training [8] strategies, and more powerful representations ofimages and text [1], [6], [15]. In the following, we review themost important work related to these two visual-semantic tasksand provide a brief overview of feature aggregation functionsused in the deep learning literature.
A. Cross-modal Retrieval
The key issue of cross-modal retrieval methods is to mea-sure the visual-semantic similarity between images and textualsentences. Typically, this is achieved by learning a commonembedding space where visual and textual data can be pro-jected and compared. One of the first attempt in this directionhas been made by Kiros et al. [3] in which a triplet rankingloss is used to maximize the distance between mismatchingitems and minimize that between matching pairs.Following this line of work, Faghri et al. [5] introduced asimple modification of standard loss functions, based on theuse of hard negatives during training, that has been demon-strated to be effective in improving the final performance andwidely adopted by several subsequent methods [16], [11], [6].Among them, Gu et al. [11] further improved the visual-semantic embedding representations by incorporating gener-ative processes of images and text. Differently, Engilberge etal. [16] proposed a novel approach in which spatial poolingmechanisms are used to embed visual features and localizenew concepts from the embedding space.Recently, strong improvements have been obtained with thestacked cross-attention mechanism proposed by Lee et al. [6]in which a latent correspondence between image regions andwords of the caption is learned to match images and textualsentences. Wang et al. [12] extended this paradigm by adding the relative position of image regions in the visual encoder,demonstrating better performance. On a similar line, Li etal. [15] introduced a visual-semantic reasoning model basedon graph convolutional networks that can generate better visualrepresentations and capture key objects and semantic conceptspresent on a scene.
B. Visual Question Answering
Many different solutions have been proposed to address theVQA task, ranging from Bayesian [17] and compositional [7],[18] approaches to spatial attention-based methods [19], [1]and bilinear pooling schemes [20]. In the last few years, theuse of attention mechanisms has become the leading choice forthis task, resulting in new models in which relevance scoresover visual and textual features are computed to process onlyrelevant information. Among them, Anderson et al. [1] re-visited the standard attention over a spatial grid of featuresand proposed to encode images with multiple feature vectorscoming from a pre-trained object detector.After this work, several methods with attention over imageregions have been presented [20], [21], [22], [8], [13]. WhileCadene et al. [21] proposed a reasoning module to encodethe semantic interaction between each visual region and thequestion, Gao et al. [22] introduced a dynamic fusion frame-work that integrates inter- and intra-modality information.Differently, Li et al. [13] presented a novel solution based ongraph attention networks that considers spatial and semanticrelations to enrich image representations.Following the advent of fully-attentive mechanisms for se-quence modeling tasks like machine translation and languageunderstanding [9], [10], different Transformer-based solutionshave also been proposed to address multimodal settings [8],[14], [2]. In the context of visual question answering, Yu etal. [23] presented a co-attention module made of a stack ofattentive layers based on self-attention, keeping the textualencoder based on recurrent neural networks. Gao et al. [14],instead, introduced a novel architecture entirely based on fully-attentive mechanisms that learns cross-modality relationshipsbetween latent summarizations of visual regions and questions.On a similar line, Tan et al. [8] proposed a Transformer-basedmodel that has demonstrated improved performance thanks toa pre-training phase on large amounts of image-sentence pairs.
C. Feature Aggregation Methods
The aggregation of spatial and temporal features is one ofthe key challenges in deep learning architectures. Differentsolutions have been proposed and heavily depend on thedomain in which applying the aggregation functions ( i.e. im-ages or text). While fusing and pooling operations appliedover depths, scales, and resolutions constitute fundamentalcomponents in visual recognition architectures, the sequentialnature of textual data requires different strategies to reducefeature dimensionality.Regarding the visual domain, with the first strategiesadopted in early popular deep learning models [24], [25],[26], the architecture design has moved in last few years toeeper and wider networks [27], [28], [29] incorporating bot-tlenecks and connectivity novelties like skipping, gating, andaggregating mechanisms. While going deeper, i.e. aggregatingacross channels and depths, improves the semantic recognitionaccuracy, spatial fusion, i.e. aggregating across scales andresolutions, is needed to achieve a better localization capa-bility. In this context, feature pyramid networks [30] are thepredominant approach, making use of top-down and lateralconnections between feature hierarchical levels.On a different note, data with a sequential nature suchas textual sentences require different solutions to take intoaccount the temporal dependencies between elements. In thissetting, the use of recurrent neural networks has remained themost commonly used strategy, where hidden representations,learned through memory and gating mechanisms, are adoptedas global encoding of a sequence of feature vectors.Recently, with the advent of fully attentive architectures [9]that overcame limitations of recurrent networks, novel solu-tions based their global understanding of sequences throughthe addition of a special CLS token at the beginning of eachsequence [10], [8]. Thanks to the use of attention that modelsinter- and intra-modality connections, this CLS token can learna compact representation of an input sequence for generalclassification purposes. Additionally, similar efforts have beenmade on the encoding of textual sentences, where again meanand max pooling or CLS token have remained the predominantaggregation approaches [31], [32].Differently from previous works, we propose a novel aggre-gation method based on attentive mechanisms that can reducein a learnable way a set or a sequence of features coming fromeither the visual or textual domain.III. P
ROPOSED M ETHOD
The popular scaled dot-product attention mechanism op-erates on a set of input vectors and generates a new setof vectors, where each one has been updated with rele-vant information coming from the others. This has beenproved largely effective in sequence to sequence tasks such asnatural language understanding and machine translation [9],[10]. However, visual-semantic tasks such as visual questionanswering and cross-modal retrieval deal with multi-modalinput sequences and require alignment between modalitiesand global perspectives in order to reach a classification orsimilarity output.To this end, we propose a novel attention-based aggregationfunction that learns to align and combine two sets of featuresinto a global and compact representation based on the cross-domain connections between modalities. In the simplest case,the two sets of features will be regions from an input imageand word features from a natural language sentence. In a nut-shell, our approach leverages dot-product attention to computecross-modal scores for each element of the two feature sets.The weights are then used to take a weighted sum of the inputfeature vectors, thus reducing the two sets into a pair of vectorswhich can be used for classification or ranking. In the following, we firstly present our attention-basedreduction method. With the aim of testing the operator on bothimage-text matching and visual question answering, we thenintroduce a general architecture for both tasks, where featuresfrom multi-modal inputs are extracted and combined. In thelast section, we discuss the final stages of the architecture andthe training choices.
A. Attention-based Aggregation Function
Motivated by the need of leveraging the information con-tained in sequence of vectors and at the same time to comparemulti-modal information, our aggregation function is based onthe scaled dot-product attention mechanism [9].To recall what attention is, given three sets of vectors, i.e. queries Q , keys K and values V , scaled dot-productattention computes a weighted sum of the value vectorsaccording to a similarity distribution between query and keyvectors. This is usually done in a multi-head fashion, so thatfor each head h the attention operator is defined as Attention h ( Q h , K h , V h ) = softmax (cid:18) Q h K Th √ d (cid:19) V h , (1)where Q h is a matrix of n q query vectors, K h and V h bothcontain n k keys and values, and d is the dimensionality ofqueries and keys, used as a scaling factor.In the case of self-attention, queries, keys and values areobtained for each head as linear projections of the sameinput vectors belonging to a single modality, while in cross-attention, queries are a projection of one modality vectors andkeys and values are projections of the other modality vectors.Inspired by cross-attention, we define a Score Attention opera-tor which computes a relevance score for each element of thequery sequence, considering its relationships with keys andvalues coming from the other modality.Formally, given the set of query, key and value vectors fromall heads, our Score dot-product attention can be formulatedas
ScoreAttn ( Q , K , V ) = fc (cid:18)(cid:20) softmax (cid:18) Q h K Th √ d (cid:19) V h (cid:21) h (cid:19) , (2)where Q , K , V indicate the set of queries, keys and values forthe different heads, [ ... ] h indicates the concatenation of theoutputs of all heads and fc is a linear projection that outputsa single scalar score for each input query.In order to learn complex interactions between modalitiesand therefore to guide the reduction process based on thecross-domain relations, Score Attention is applied on queriesfrom one modality and keys and values from the other modal-ity. Therefore, given a set of input vectors X coming from onemodality ( e.g. regions of an image) and a set of input vectors Z coming from the other modality ( e.g. words of a text), weobtain a final condensed representation for X conditioned on egionEncoderWordEncoder Cross Attention Self AttentionCross Attention Self Attention Attention-based Aggregation
Score AttentionScore Attention
ImageTextual Description a snowboarder flies down a snowy slope on a sunny day
Question what color is the snowboarder jacket?
Fig. 2. Our architecture for cross-modal feature extraction and matching. After a cross-modal feature extraction stage, the proposed attention-based aggregationfunction aligns and reduces vectors from both modalities into compact and cross-modal representations. Z as a weighted sum of its vectors using the scores providedby the Score Attention operator, i.e. Y ( X , Z ) = n q (cid:88) i =0 S i ( X , Z ) · X i (3) S ( X , Z ) = softmax ( ScoreAttn ( Q , K , V )) , (4)where queries Q are obtained as projections of X , while keysand values K , V are obtained from Z . The softmax is appliedover the n q scores returned by the Score Attention operator.Conversely, the same applies to the reduction of the othermodality Z by considering Z as query sequence and X askey and value ones.As it can be seen from Eq. (2) and (4), our Score Attentionoperator can be thought as a cross-attention that, insteadof yielding a sequence of vectors, computes a sequenceof scores conditioned on keys and values from the othermodality. Therefore, the final compressed representation foreach modality can capture a global perspective of the input,focusing on elements that show higher importance with respectto the cross-domain interactions.Noticeably, this aggregation function can be executed mul-tiple times in parallel with different query, key and valueprojections, thus yielding more than one output vector. Thisin principle can foster a more disentangled representation, inwhich different output vectors refer to different global aspectsof the same input features. We therefore test our method withdifferent number of compressed vectors, and we refer later tothis hyper-parameter with k . Whenever the number of vectorsis more than one, we average their contributions with a non-learnable reduction operator. More details on this can be foundin the Implementation Details section. B. Visual-Semantic Model
To test our aggregation operator, we devise a generalarchitecture for cross-modal feature extraction and matching, with the aim of tackling different tasks with the same commonpipeline. Specifically, the architecture is tested on both image-text retrieval and visual question answering. Given input re-gions from an image, and words from a textual description, weadopt a bi-directional GRU as text encoder, retaining for eachword the average embedding between the forward hidden stateand the backward hidden state. On the visual side, instead, weapply a linear projection to the features of image regions.Following recent progress in fully-attentive models andcross-modality interaction [9], [8], after this encoding stagewe propagate visual and textual features with a cross-attentionoperation, followed by a self-attention for each modality. Ontop of this, two instances of the aggregation operator are ap-plied, one for each modality, thus obtaining one global vectorfor each modality. A summary of the overall architecture isreported in Fig. 2.
C. Training
The last stage of the model and the training objectivesdepend on the specific task. In the following we report themain differences.
Visual Question Answering.
After applying the aggregationoperator, the two vector representations are concatenated andfed to a fully connected layer which is in charge of predictingthe final answer class. Additionally, in the case of VQA, weadd a position-wise feed-forward layer between the reductionoperator and the final concatenation for class prediction.During the training phase, we employ the binary cross-entropy loss in a multi-label fashion, i.e. applying it indepen-dently for all classes. For fairness of comparison, we do notmake use of any data augmentation strategy and do not employany external data source like part of the VQA literature does.
Cross-modal Retrieval.
In the case of image-text matching,instead, the compressed vectors given by the application of theaggregation operator are compared with a cosine similarity tomeasure their similarity score. During training, we adopt aninge-based triplet ranking loss, which is the most commonranking objective in the retrieval literature. Following Faghri etal. [5], we only backpropagate the loss obtained on the hardestnegatives found in the mini-batch. Given image and sentencepairs ( I, T ) , our final loss with margin α is thus defined as L hard ( I, T ) = max ˆ T (cid:104) α − S ( I, T ) + S ( I, ˆ T ) (cid:105) + + max ˆ I (cid:104) α − S ( I, T ) + S ( ˆ I, T ) (cid:105) + , where S indicates the cosine similarity, [ x ] + = max( x, , ˆ T is the hardest negative sentence and ˆ I is the hardest negativeimage. IV. E XPERIMENTAL E VALUATION
In this section, we report the results on the two consideredvisual-semantic tasks ( i.e. visual question answering and cross-modal retrieval) by comparing our attention-based aggregationfunction with respect to different baselines. First, we provideimplementation details and introduce the datasets used in ourexperiments.
A. Datasets
To validate the effectiveness of our solution, we employ twoof the most widely used datasets containing visual-semanticdata. In particular, we carry out the experiments on the VQA2.0 [33] and COCO [34] datasets to address visual questionanswering and cross-modal retrieval, respectively.
COCO.
The dataset contains more than
120 000 images, eachof them annotated with different textual descriptions. Wefollow the splits provided by Karpathy et al. [35], where images are used for validation, for testing and the restfor training. Following the standard evaluation protocol [5],retrieval results on this dataset are reported by averaging over folds of test images each. VQA 2.0.
The dataset is composed of images coming fromthe COCO dataset and are divided in training, validation,and test according to the official splits. For each image,three questions are provided on average. These questions aredivided into three different categories:
Yes/No , Number ,and
Others . Each image-question pair is annotated with answers collected by human annotators, and the most frequentanswer is selected as the correct one. We report experimentalresults on the validation and test-dev sets of this dataset, onlyusing the training split to train our model. Differently fromstandard literature that uses additional training data comingfrom different datasets, we only focus on image-question-answer triplets from this dataset. B. Implementation Details
To encode image regions, we employ the Faster R-CNNmodel finetuned on the Visual Genome dataset [36], [1],obtaining a -dimensional feature vector for each region.We reduce the dimensionality of region feature vectors byfeeding them to a fully connected layer with a size of . Foreach image, we select the top regions with the highest class detection confidence score. As mentioned, to encode wordvectors, we use a bi-directional GRU with a single layer usingeither learned or pre-trained word embeddings to representwords of the sentence. We set the hidden size of the GRUlayer to .Following the standard implementation [9], each scaled dot-product attention also includes a dropout, a residual connec-tion, and a layer normalization. We set the dimensionality d of each layer to , the number of heads in both scaleddot-product and score attention to , and the dropout keepprobability to . . In all our experiments, we use Adam [37]as optimizer and a batch size equal to . Visual Question Answering.
For VQA models, we set theinitial learning rate to . decreased by a factor of every epochs. To represent words, we use and finetunethe pre-trained GloVe word embeddings [38] with a worddimensionality equal to . We set the maximum lengthof input questions to , padding the shorter ones. For theadditional position-wise feed forward layer used in VQAmodels, we set the hidden size to the same dimensionality d of attention layers. When we use a number of compressedvectors k larger than , we average the k vectors of eachmodality to obtain a single compact representation for bothimage regions and words.Following a common practice in the VQA task [1], theset of candidate answers is limited to correct answers in thetraining set that appear more than times, resulting in anoutput vocabulary size equal to . Cross-modal Retrieval.
We set the initial learning rate to . decayed by a factor of every epochs, and themargin α of the triplet loss function to . . Also, we clip the -norm of vectorized gradients to . . To encode words, weuse one-hot vectors and linearly project them with a learnableembedding matrix to the word dimensionality of . To createthe word vocabulary, we take into account only the words thatappear at least times in the training and validation sets.In our attention-based aggregation function, when the num-ber of compressed vectors k is larger than , we compute apair-wise cosine similarity between each pair of compressedvectors coming from the two modalities, and we averagethe resulting k similarity scores. Intuitively, each aggregationmodule learns to extract and compare different relevant infor-mation, specializing each vector to distinct semantic meaning. C. Baselines
To evaluate the proposed method, we compare our resultswith respect to five different aggregation functions, namelymean pooling, max pooling, log-sum-exp pooling, 1D con-volution, and CLS token. For all baselines, we employ thepipeline defined in Sec. III-B, and the same hyper-parametersand implementation choices used for our architecture.
Mean Pooling.
The mean aggregation function is one of themost common approaches for feature reduction and refers tothe global average pooling between each vector of the inputsequence. Since input sequences may have different lengths,
ABLE IA
CCURACY RESULTS ON
VQA 2.0
DATASET . T
HE RESULTS ARE REPORTED ON THE VALIDATION AND TEST - DEV SPLITS . A
LL MODELS ARE TRAINEDONLY ON THE
VQA 2.0
TRAINING SPLIT . Validation Test-DevAggregation Function
All Yes/No Number Others All Yes/No Number OthersMean Pooling 54.87 71.50 37.93 46.69 56.05 71.00 38.88 47.19Max Pooling 56.73 75.68 37.64 47.37 57.95 75.14 38.48 47.69LogSumExp Pooling 54.61 70.94 38.27 46.53 55.68 70.36 38.72 47.001D Convolution 56.87 72.35 39.18 49.79 57.79 71.71 39.97 49.96CLS Token 58.31 74.29 39.89 51.03 59.40 74.26 40.31 51.07
Ours ( k = 1) ( k = 2) Ours ( k = 3) Ours ( k = 5) Ours ( k = 7) Ours ( k = 10) in our experiments the mean pooling operation is computedusing only the valid elements of the sequence. Max Pooling.
Similarly to the mean operation, the maxpooling is another commonly used strategy to reduce fea-ture dimensionality and selects the maximum activation inthe feature maps. In our setting, we apply max pooling tothe sequence dimension, thus obtaining a single summarizedvector for each input sequence.
LogSumExp Pooling.
It can be considered as a smoothapproximation of the maximum function and is defined as thelogarithm of the sum of the argument exponentials. We applythis operation along the feature dimension thus condensing themost important features for each vector of the sequence.
1D Convolution.
Convolution is the fundamental operation ofCNNs and works well for identifying patterns in data. We test1D convolutions applied to the sequence dimension to obtaina compact and aggregated representation of the whole set ofvectors. In our experiments, we set the kernel size equal tothe input sequence length.
CLS Token.
Following the recent progress of pre-trainingstrategies and cross-modality matching [10], we also considerthe integration of a special CLS token at the beginning ofeach input sequence. Thanks to the cross- and self-attentionoperations, the CLS token can be used as a final compactrepresentation of the entire sequence. We add a CLS tokenfor each modality and use them in last stage of the pipelineaccording to the specific task.
D. Visual Question Answering Results
Experimental results for the VQA task are shown in Table Iby comparing our aggregation function with respect to theaforementioned baselines. For each method, we report theaccuracy on all image-question pairs of the considered splitsand the accuracy values on the three question categories of theVQA 2.0 dataset ( i.e.
Yes/No , Number , and
Others ).As it can be seen, our method surpasses all other aggregationfunctions by a significant margin on both validation and test-dev splits. With respect to the CLS token, which is the top
TABLE IIC
OMPARISON BETWEEN DIFFERENT WORD EMBEDDING STRATEGIES ON
VQA 2.0
VALIDATION SET . Aggregation Func. Word Emb.
All Yes/No Number Others
Ours ( k = 5) Learned 59.29 77.24 40.29 50.66
Ours ( k = 5) GloVe 60.98 78.51 42.20 ( k = 5) GloVe Finetuned
Ours ( k = 7) Learned 59.23 76.98 40.02 50.80
Ours ( k = 7) GloVe
Ours ( k = 7) GloVe Finetuned 60.95 78.40 performing baseline in this task, our solution achieves animprovement of . % and . % in terms of overall accuracyon the validation and test-dev splits, respectively.Additionally, we test our attention-based aggregationmethod by using a different number of k compressed vectorsand different word embedding strategies. In the bottom sectionof Table I, we report the accuracy results by varying thenumber of compressed vectors. As it can be noticed, themodel with vector reaches good results surpassing all otherbaselines. Nevertheless, higher performances can be achievedwith and compressed vectors suggesting that a correctanswer can be positively influenced by capturing differentaspects of the input features. Above a certain numbers of k vectors, we instead observe a degradation of the performance,as demonstrated by the results with vectors. This canbe explained by the greater complexity of the model thatundermines the benefits of learning different global vectors.In Table II, we show the performance on the VQA 2.0validation set when using different word embedding strategies.In particular, we compare the results by employing learnableword embeddings and pre-trained GloVe vectors, either fixedor finetuned during training. In our experiments, the GloVeword embeddings lead to an improvement of the final accuracyresults using both and compressed vectors. The perfor-mance gap between fixed and finetuned GloVe vectors is notvery large, but a slight improvement is given when using thefinetuned version. For this reason, all experiments are carried uestion: What color is the car on the right?
Ground-truth: red
Mean: white
Ours: red
Question:
How many people can you see?
Ground-truth: eight
Mean: five
Ours: seven
Question:
Is the bear real?
Ground-truth: no Mean: yes
Ours: no Question:
What color is the floor?
Ground-truth: brown
Mean: yellow
Ours: brown
Question:
Is the girl sitting on the horse?
Ground-truth: yes
Mean: no Ours: yes
Question: what is on the train?
Ground-truth: graffiti
Mean: people
Ours: graffiti
Question:
What is the yellow food?
Ground-truth: corn
Mean: eggs
Ours: corn
Question: how many giraffes are there?
Ground-truth: two
Mean: three
Ours: two
Fig. 3. Qualitative results on VQA 2.0 validation set. For each image, we report a sample question, the ground-truth answer, and the corresponding answerspredicted by our aggregation function and by the mean pooling operation. out by using the GloVe vectors finetuned during training. Onthe contrary, learning word embeddings from scratch bringsto lower performances in all settings.
Qualitative Results.
Some sample results on the VQA 2.0validation set are reported in Fig. 3. For each image, weshow the corresponding question, the ground-truth correctanswer and the answers predicted by our attention-basedaggregation function and the mean pooling operation. Theresults demonstrate the effectiveness of our strategy also froma qualitative point of view and confirm better performance thanone of the most widely used solution to aggregate features.Our method is able to correctly identify the color of theobjects contained in the question and count the number ofinstances of a given entity. Also, it can accurately answer eithersimple ( e.g.
Yes/No ) or more complex questions that requirea complete understanding of the scene.
E. Cross-modal Retrieval Results
Table III shows the results for the cross-modal retrieval taskon the COCO test set. For both text and image retrieval, wereport the results in terms of recall@ K (with K = 1 , , )which measures the portion of query images or query captionsfor which at least one correct result is found among the top- K retrieved elements. Also in this setting, we compare our aggre-gation function with respect to the previously defined baselinesand we analyze the performance by varying the number ofcompressed vectors used to aggregate input sequences.As it can be seen, our attention-based aggregation achievesthe best results among all considered aggregation functions onboth text and image retrieval. Also in this case, the CLS tokenresults to be the top performing baseline according to all eval-uation metrics, confirming the importance of using inter- andintra-modality interactions to reduce feature dimensionality.Differently from the VQA task in which the best resultsare obtained with and compressed vectors, the bestperformances are instead achieved with a lower number ofvectors ( i.e. and ), as shown in the bottom section ofTable III. In this setting, we do not find beneficial the useof GloVe word vectors and all results are thus obtained bylearning word embeddings during training. This suggests thatthe large amount of textual data contained in the COCO datasetcompared to that available for the VQA task can lead tospecific and more suited word embedding representations. TABLE IIIC
ROSS - MODAL RETRIEVAL RESULTS ON M ICROSOFT
COCO 1K
TEST SET . Text Retrieval Image RetrievalAggregation Function
R@1 R@5 R@10 R@1 R@5 R@10Mean Pooling 69.66 93.12 97.64 50.42 82.27 90.83Max Pooling 69.04 92.68 96.98 51.20 83.27 91.52LogSumExp Pooling 64.20 91.52 96.84 47.22 82.26 91.231D Convolution 65.66 91.86 96.58 49.25 81.43 90.42CLS Token 70.30 93.38 97.24 51.05 83.28 ( k = 1) Ours ( k = 2) Ours ( k = 3) Ours ( k = 4) Qualitative Results.
Finally, we show some sample resultsfor text and image retrieval in Fig. 4(a) and 4(b), respectively.Also in this case, we compare our results with those obtainedby using the mean pooling aggregation function. As it canbe seen, these qualitative results further confirm the effective-ness of our solution leading to increased and more accurateperformance on both settings.V. C
ONCLUSION
Aggregating features has always played a critical role indeep learning architectures. In this paper, we proposed a novelaggregation function based on a variant of the cross-attentionmechanism, that reduces sets or sequences of elements intoa single compact representation in a learnable fashion. Wespecifically tailored our method for visual-semantic tasks suchas visual question answering and cross-modal retrieval whereimages and text are combined to obtain a classification or aranking score, respectively. Experimental results demonstratedthat our approach achieves better performances when com-pared to other commonly used reduction functions on bothconsidered tasks. Further, we showed that our method canbe applied with minimum changes in the overall design todifferent settings where any form of element reduction isrequired. R
EFERENCES[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, andL. Zhang, “Bottom-up and top-down attention for image captioning andvisual question answering,” in
CVPR , 2018.[2] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-Memory Transformer for Image Captioning,” in
CVPR , 2020. urs:
A white boxy birthday cake with red flowers on a decorated table with candles.
Query Image Mean:
A sheet cake sitting on top of a table with lit candles.
Top-1 Ours:
A large white blue and red clock shaped like a cup.
Query Image Mean:
A triangle sign with an English and foreign warning.
Top-1Query Caption:
An orange is placed on a plate with a cracker.
Top-1 (Mean)Top-1 (Ours)
Query Caption:
A man in blue jacket standing by a passing train.
Top-1 (Mean)Top-1 (Ours)
Query Caption:
Many umbrellas on a beach near a body of water.
Top-1 (Mean)Top-1 (Ours)
Query Caption:
Two animals walking through high grass in the woods.
Top-1 (Mean)Top-1 (Ours)
Ours:
The boy is getting ready to hit the ball with his bat.
Query Image Mean:
A man is posed in mid swing about to serve a ball in a tennis court.
Top-1 Ours:
A dog is running alongside a horse in a corral.
Query Image Mean:
Two brown dogs areplaying on the dirt.
Top-1 (a) Text Retrieval
Ours:
A white boxy birthdaycake with red flowers on adecorated table with candles.
Query Image Mean:
A sheet cake sitting ontop of a table with lit candles.
Top-1 Ours:
A large white blue andred clock shaped like a cup.
Query Image Mean:
A triangle sign with anEnglish and foreign warning.
Top-1Query Caption:
An orange is placed on a plate with a cracker.
Top-1 (Mean)Top-1 (Ours)
Query Caption:
A man in blue jacket standing by a passing train.
Top-1 (Mean)Top-1 (Ours)
Query Caption:
Many umbrellas on a beach near a body of water.
Top-1 (Mean)Top-1 (Ours)
Query Caption:
Two animals walking through high grass in the woods.
Top-1 (Mean)Top-1 (Ours)
Ours:
A white boxy birthdaycake with red flowers on adecorated table with candles.
Query Image Mean:
A sheet cake sitting ontop of a table with lit candles.
Top-1 Ours:
A large white blue andred clock shaped like a cup.
Query Image Mean:
A triangle sign with anEnglish and foreign warning.
Top-1 (b) Image RetrievalFig. 4. Qualitative results for text and image retrieval. For each sample, we report the top-1 result retrieved by our aggregation function and by the meanpooling operation.[3] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semanticembeddings with multimodal neural language models,” in
NeurIPSWorkshops , 2014.[4] L. Baraldi, M. Cornia, C. Grana, and R. Cucchiara, “Aligning text anddocument illustrations: towards visually explainable digital humanities,”in
ICPR , 2018.[5] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: ImprovingVisual-Semantic Embeddings with Hard Negatives,” in
BMVC , 2018.[6] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attentionfor image-text matching,” in
ECCV , 2018.[7] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural modulenetworks,” in
CVPR , 2016.[8] H. Tan and M. Bansal, “LXMERT: Learning Cross-Modality EncoderRepresentations from Transformers,” in
EMNLP , 2019.[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
NeurIPS ,2017.[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-trainingof Deep Bidirectional Transformers for Language Understanding,” in
NAACL-HLT , 2019.[11] J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang, “Look, imagine andmatch: Improving textual-visual cross-modal retrieval with generativemodels,” in
CVPR , 2018.[12] Y. Wang, H. Yang, X. Qian, L. Ma, J. Lu, B. Li, and X. Fan, “PositionFocused Attention Network for Image-Text Matching,” in
IJCAI , 2019.[13] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph attentionnetwork for visual question answering,” in
ICCV , 2019.[14] P. Gao, H. You, Z. Zhang, X. Wang, and H. Li, “Multi-modality latentinteraction network for visual question answering,” in
ICCV , 2019.[15] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual Semantic Reasoningfor Image-Text Matching,” in
ICCV , 2019.[16] M. Engilberge, L. Chevallier, P. P´erez, and M. Cord, “Finding beans inburgers: Deep semantic-visual embedding with localization,” in
CVPR ,2018.[17] M. Malinowski and M. Fritz, “A multi-world approach to question an-swering about real-world scenes based on uncertain input,” in
NeurIPS ,2014.[18] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learningto reason: End-to-end module networks for visual question answering,”in
ICCV , 2017.[19] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attentionnetworks for image question answering,” in
CVPR , 2016.[20] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear Attention Networks,” in
NeurIPS , 2018.[21] R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, “MUREL: Multi-modal Relational Reasoning for Visual Question Answering,” in
CVPR ,2019.[22] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamicfusion with intra-and inter-modality attention flow for visual questionanswering,” in
CVPR , 2019. [23] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-attentionnetworks for visual question answering,” in
CVPR , 2019.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
NeurIPS , 2012.[25] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in
ICLR , 2015.[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
CVPR , 2015.[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
CVPR , 2016.[28] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,” in
CVPR , 2017.[29] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
CVPR , 2017.[30] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in
CVPR , 2017.[31] S. Humeau, K. Shuster, M.-A. Lachaux, and J. Weston, “Poly-encoders:Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring,” in
ICLR , 2020.[32] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings usingsiamese bert-networks,” arXiv preprint arXiv:1908.10084 , 2019.[33] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Makingthe v in vqa matter: Elevating the role of image understanding in visualquestion answering,” in
CVPR , 2017.[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common Objects inContext,” in
ECCV , 2014.[35] A. Karpathy and L. Fei-Fei, “Deep Visual-Semantic Alignments forGenerating Image Descriptions,” in
CVPR , 2015.[36] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei,“Visual Genome: Connecting Language and Vision Using CrowdsourcedDense Image Annotations,”
IJCV , vol. 123, no. 1, pp. 32–73, 2017.[37] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”in
ICLR , 2015.[38] J. Pennington, R. Socher, and C. Manning, “GloVe: Global Vectors forWord Representation,” in