[PDF] Retrieval Augmentation for Deep Neural Networks

Abstract

Deep neural networks have achieved state-of-the-art results in various vision and/or language tasks. Despite the use of large training datasets, most models are trained by iterating over single input-output pairs, discarding the remaining examples for the current prediction. In this work, we actively exploit the training data, using the information from nearest training examples to aid the prediction both during training and testing. Specifically, our approach uses the target of the most similar training example to initialize the memory state of an LSTM model, or to guide attention mechanisms. We apply this approach to image captioning and sentiment analysis, respectively through image and text retrieval. Results confirm the effectiveness of the proposed approach for the two tasks, on the widely used Flickr8 and IMDB datasets. Our code is publicly available at this http URL

Full PDF

RRetrieval Augmentation to Improve Robustness andInterpretability of Deep Neural Networks

Rita Parada Ramos

INESC-IDInstituto Superior T´ecnicoUniversidade de Lisboa

Lisbon, Portugal [email protected]

Patr´ıcia Pereira

INESC-IDInstituto Superior T´ecnicoUniversidade de Lisboa

Lisbon, Portugal [email protected]

Helena Moniz

INESC-IDFaculdade de LetrasUniversidade de Lisboa

Lisbon, Portugal [email protected]

Joao Paulo Carvalho

INESC-IDInstituto Superior T´ecnicoUniversidade de Lisboa

Lisbon, Portugal [email protected]

Bruno Martins

INESC-IDInstituto Superior T´ecnicoUniversidade de Lisboa

Lisbon, Portugal [email protected]

Abstract —Deep neural network models have achieved state-of-the-art results in various tasks related to vision and/or language.Despite the use of large training data, most models are trainedby iterating over single input-output pairs, discarding the re-maining examples for the current prediction. In this work, weactively exploit the training data to improve the robustness andinterpretability of deep neural networks, using the informationfrom nearest training examples to aid the prediction both duringtraining and testing. Speciﬁcally, the proposed approach usesthe target of the nearest input example to initialize the memorystate of an LSTM model or to guide attention mechanisms. Weapply this approach to image captioning and sentiment analysis,conducting experiments with both image and text retrieval.Results show the effectiveness of the proposed models for the twotasks, on the widely used Flickr8 and IMDB datasets, respectively.Our code is publicly available . Index Terms —deep learning, retrieval augmentation, nearestneighbors, LSTM, attention mechanism

I. I

NTRODUCTION

The most common methodology in deep learning involvesthe supervised training of a neural network with input-outputpairs, so as to minimize a given loss function [11]. In general,deep neural networks predict the output conditioned solely onthe current input or, more recently, leveraging an attentionmechanism [2] that focuses only on parts of the input as well.This leaves the rest of the dataset examples unused for thecurrent prediction, either during training or inference.In this work, we leverage similar examples in the training setto improve the robustness and interpretability of deep neuralnetworks, both at training and testing time. We propose anapproach that retrieves the nearest training example to the onebeing processed and uses the corresponding target example(i) as auxiliary context to the input (e.g. combining the input http://github.com/RitaRamo/retrieval-augmentation-nn together with the retrieved target), or (ii) to guide the attentionmechanism of the neural network.We show that the retrieved target can be easily incorporatedin an LSTM model [6], making use of its initial memory state.In general, previous studies have given little consideration tothe initialization of the LSTM’s memory state. Typically, thememory state is initialized simply with a vector of zeros. Evenwhen it is initialized with the current context (e.g., the inputtext for machine translation tasks, or the input image in imagecaptioning), it is just initialized in the same way as the hiddenstate: with a simple afﬁne transformation of the same context.Our approach takes advantage of the initial memory state byencoding auxiliary information from training examples. Wealso present a new multi-level attention method that attends tothe inputs, as well as to the target of the nearest example.We evaluate the proposed approach on image captioningand sentiment analysis. In brief, image captioning involvesgenerating a textual description of an image. The dominantframework involves using a CNN as an encoder that representsthe image, and passes this representation to a RNN decoderthat generates the respective caption, combined with neural-attention [8], [20]. The task of sentiment analysis aims toclassify the sentiment of an input text. Within neural methods,RNNs and CNNs are commonly used for sentiment analysis,recently also combining attention mechanisms [1], [17].Our general aim is to show the robustness of our approachby applying it to different tasks (i.e., generation and classi-ﬁcation) and by using a retrieval mechanism with differentmodalities (i.e., image and text retrieval).II. P ROPOSED A PPROACH

The proposed approach consists of two steps. The ﬁrststep involves retrieving the nearest training example given a r X i v : . [ c s . C L ] F e b he current input. The second step leverages the target of thenearest example, either by encoding it as the LSTM’s initialmemory state or to guide the attention mechanism. A. Retrieval Component

For retrieval, we use Facebook AI Similarity Search(FAISS) [9] to store the training examples as high-dimensionalvectors, and search over the dataset. FAISS is an open sourcelibrary for nearest-neighbor search, optimized for memoryusage and speed, being able to quickly ﬁnd similar high-dimensional vectors over a large pool of example vectors.To search over the stored vectors, FAISS uses as default theEuclidean distance, although it also supports the inner productdistance. We use the default Euclidean distance to retrieve thenearest example x n given the current input x c .Note that, for the second stage, the target output of thenearest example is required, and not the nearest example itself.This can be retrieved with an auxiliary lookup table that mapsthe corresponding index of x n to its target y n . a) Image Retrieval: In image captioning, each traininginput image x is stored as D-dimensional vector using a pre-trained encoder network: r x = Enc(x) ∈ R D . (1)In the previous expression, r x denotes the vector represen-tation of the input, and D its dimensionality. A CNN encoder Enc extracts the images features V via the last convolutionallayer, followed by a global average pooling operation. b) Text Retrieval: In Sentiment Analysis, each traininginput sentence x is mapped to a vector representation r x usinga pre-trained sentence representation model, denoted as S inthe following expression: r x = S(x) ∈ R D . (2)In particular, we use a pre-trained sentence transformerto obtain the corresponding sentence representations [18],namely the paraphrase-distilroberta-base-v1 model. This RoBERTa-based sentence representation modelhas been trained to produce meaningful sentence embeddingsfor similarity assessment and retrieval tasks. B. Incorporating the Nearest Target in the LSTM

In the second step, we use the target of the nearest exampleto initialize the memory state of the LSTM or to guide theattention mechanism.

1) LSTM Initial Memory State:

After retrieving the nearestinput x n in the ﬁrst step, the corresponding target y n isincorporated in the LSTM as the initial memory state. Thiscan be accomplished as long as the target example is encodedinto a continuous vector space with the same dimensionalityof the LSTM. Our approach consists in mapping the retrievedtarget y n into a ﬁxed-length representation and using an afﬁnetransformation to have the same dimensionality of the LSTM. http://github.com/UKPLab/sentence-transformers r y n = W n f ( y n ) . (3)In the previous expression, W n is a learned parameter thatprojects the vector representation of the retrieved target to thesame dimensionality of the LSTM. For the image captioningand sentiment analysis tasks, we use the vector representations f ( y n ) produced through the procedures described next: a) Image Captioning: We explore three alternative rep-resentations for the target caption of the nearest input image. • Average of static word embeddings : Each word is rep-resented with a pre-trained embedding, in particular usingfastText [16], and then the word vectors are averaged tobuild the caption representation. • Weighted average of static word embeddings : Takingthe average of word vectors assumes that all words areof equal importance. To better capture the meaning of acaption, we will average the word embeddings weightedby their norms. Liu et al. [13] has shown that the norm ofa word embedding can be used to deﬁne the importanceof a word, with frequent words having smaller norms thanrare words. • Contextual embeddings : The aforementionedrepresentations ignore word order and use staticword embeddings not dependent on their left/rightcontext within the sequence of words. To takethis information into consideration, we encode thecaption using the pre-trained sentence transformer paraphrase-distilroberta-base-v1 . b) Sentiment Analysis: In this task, the target of thenearest input sentence is either positive or negative. We suggestthe following representations to encode the nearest target: •

1s and -1s : The nearest target is represented with a vectorof 1s when positive or with a vector of -1s when negative.The rationale behind this choice relates to using oppositevectors with a cosine similarity of -1. • Average embeddings of positive/negative sentences :When the nearest target is positive, we use a represen-tation obtained from all the positive training sentencesusing the pre-trained fastText embeddings. This is doneby averaging each word vector of each positive sentenceand then averaging over all the positive sentences toobtain the ﬁnal embedding that represents a positivetarget. The same idea is applied to a negative target (i.e.building a representation from the average embeddings ofall negatives sentences). Essentially, we intend to providethe model with an overall memory of what resembles apositive/negative sentence. • Weighted average embeddings : Similar to the previousformulation, but weighting each word by the norm of theword vectors. • Contextual embeddings : Similar to the twoaforementioned representations, but using paraphrase-distilroberta-base-v1 torepresent each sentence. ig. 1. Overview of the proposed approach. Typically, an LSTM model predicts the output based solely on the input, without leveraging auxiliar contextfrom other training examples. Our approach retrieves the nearest training example x n and incorporates its target y n into the LSTM’s initial memory state m . We apply our approach to sentiment analysis (upper) and image captioning (down).

2) Guiding an Attention Mechanism:

The target of thenearest input can also be included in an attention mechanism.We present a new multi-level attention method that attends tothe inputs and also to the retrieved target, deciding which ofthose to focus on for the current prediction. a) Image Captioning:

First, a visual context vector c t iscomputed with a typical additive attention [2] given to theimage features V ∈ R D × K and the previous hidden state h t − ∈ R D of the LSTM: a t = w Ta tanh( W v V + W h h t − ) , (4) α t = softmax( a t ) , (5) c t = K (cid:88) i =1 α i,t v i . (6)In the previous expressions, W h , W v , w Ta are learned pa-rameters and α t are the attention weights. The attended imagevector is deﬁned as c t , the visual context vector.Then, the multi-level context vector ˆ c t is obtained given theprevious hidden state h t − and the concatenation of the visualcontext vector c t with the retrieved target vector r y n ∈ R D . ˆ a t = w T ˆ a tanh ( W m concat( c t , r y n ) + W h h t − ) , (7) ˆ α t = softmax( ˆ a t ) , (8) ˆ c t = ˆ α ,t c t + ˆ α ,t r y n . (9)In the previous expressions, W m , w T ˆ a are parameters tobe learned, and ˆ α t consists in ˆ α ,t , i.e. the weight given tothe visual context vector, and ˆ α ,t , the weight given to theretrieved target, with ˆ α ,t =1- ˆ α ,t . In this way, the proposedattention can decide to attend to the current image, or to focuson the nearest example. b) Sentiment Analysis: The same attention mechanismdescribed for image captioning is also applied here, withlittle modiﬁcations. In this case, attention is not calculated ateach time-step of the LSTM, since the prediction only occursat time-step T. Therefore, the previous hidden state h t − isreplaced by the last hidden-state h t and the visual features V are replaced by all the LSTM hidden-states H .III. I MPLEMENTATION D ETAILS

We compare our two models, named Image Captioningthrough Retrieval (ICR) and Sentiment Analysis through Re-trieval (SAR), against a vanilla encoder-decoder model withneural attention, and a vanilla attention-based LSTM, respec-tively. The differences between the baselines and our modelsare the aforementioned approaches: (i) the LSTM’s memorytate being initialized with the target of the nearest input, and(ii) the multi-level attention mechanism.In image captioning, we use a pre-trained ResNet on Im-ageNet as the encoder, without ﬁnetuning, together with astandard LSTM as decoder, with one hidden layer and 512units. For each input image, the ResNet encoder extractsthe current image features V = [ v , ..., v k ] (2048 x 146 D )and performs average pooling to obtain the global imagefeature ¯ v = K (cid:80) Ki =1 v i (that corresponds to r x ). The initialhidden state of the LSTM is then initialized with an afﬁnetransformation of the image feature h = W ih ¯ v (512 D ).The initial memory state m is also initialized in the sameway, with m = W im ¯ v for the baseline model. For the ICRmodel, m is initialized with the retrieved target (Eq. 3),corresponding to the ﬁrst reference caption of the retrievednearest image. At each time-step t , the LSTM receives asinput the fastText embedding of the current word in thecaption (300 D ), concatenated with the corresponding attentioncontext vector (512 D ). The baseline model uses the visualcontext vector c t and the ICR model uses the multi-levelcontext vector ˆ c t . In the particular case of the multi-levelattention mechanism, the image features V receive an afﬁnetransformation before passing through Eq. 4 to ensure the samedimensionality of the retrieved target in order to compute Eq.9. Finally, the current output word probability is computedwith an afﬁne transformation with dropout rate of 0.5, followedby a softmax layer.In sentiment analysis, we also use an LSTM with onehidden layer and 512 units. We use a standard LSTM forcoherence with image captioning, but note that the SARmodel can also use a bi-directional model. As typically donein sentiment analysis, the baseline LSTM hidden states areinitialized with a vector of zeros, whereas our SAR modelinitializes m with the sentiment of the nearest review (Eq.3). At each time step, the LSTM receives as input the currentword embedding (based on pre-trained fastText embeddings).After processing the whole sequence, at the last time-step T,attention is applied over the respective hidden states H and thelast hidden state h t . Then, the corresponding context vector ispassed through an afﬁne transformation with dropout (0.5),followed by a sigmoid layer in order to obtain the positivesentiment probability.The models are trained with the standard categorical andbinary cross-entropy loss, respectively for image captioningand sentiment analysis. The batch size is set to 32 and weuse the Adam optimizer with a learning rate of 4e-4 and 1e-3, for image captioning and sentiment analysis, respectively.As stopping criteria, we use early stopping to terminate ifthere is no improvement after 12 consecutive epochs onthe validation set (over BLEU score for image captioning,and accuracy for sentiment analysis) and the learning rateis decayed after 5 consecutive epochs without improvement(shrink factor of 0.8). At testing time, we use greedy decodingin image captioning. IV. E XPERIMENTS

This section presents the experimental evaluation of the pro-posed approach. We ﬁrst describe the datasets and evaluationmetrics, and then present and discuss the obtained results.

A. Datasets and Metrics

We report experimental results on commonly used datasetsfor image captioning and sentiment analysis, namely theFlickr8k dataset [7] and the IMDB dataset [15], respectively.The Flickr8k dataset has 8000 images with 5 reference cap-tions per image. We use the publicly available splits of Karpa-thy , with 6000 images for training, 1000 images for validationand the remaining 1000 images for testing. The IMDB datasetcontains 50000 movie reviews from IMDB, labeled as positiveor negative. We use the publicly available train-test splits, with25000 reviews each and we do a random split on the trainingset with 10% of the original training reviews for validation.For both datasets, the vocabulary consists of words that occurat least ﬁve times.To evaluate caption quality, we use classic metrics in theliterature such as BLEU, METEOR, ROUGE L, CIDEr andSPICE [3]. All the aforementioned metrics were calculatedthrough the implementation in the MS COCO caption evalua-tion package . Additionally, we use the recent BERTScore [21]metric which has a better correlation with human judgment.Regarding the evaluation of sentiment analysis, we also useestablished metrics, namely the F-score and the classiﬁcationaccuracy. B. Image Captioning Results

In Table I, we present the captioning results on the Flickr8kdataset. We compare the baseline model with the proposed ICRmodel, displaying the performance of the different representa-tions suggested to encode the retrieved target caption, namelyusing average pre-trained embeddings ( avg ), weighted average( weighted ) and RoBERTa sentence embeddings (

RoBERTa ).Results show that the ICR model outperforms the baseline,achieving a better performance on all metrics independentlyof the representation used. The best performance was achievedusing the nearest target encoded with the weighted average,followed by simple average and then succeeded by RoBERTaembeddings, with all the three surpassing the baseline. Wehypothesise that static fastText embeddings worked better thancontextual RoBERTa embeddings in the representation of theretrieved target possibly due to the use of fastText embeddingsfor the input words in the captioning model as well, lying inthe same representation space.A representative example is shown in Figure 2, containingthe captions generated by the aforementioned models, togetherwith a visualization of our multi-level attention mechanismfrom the best model (ICR weighted ). We provide more ex-amples in Figure 4. These qualitative results further conﬁrmthat the ICR weighted model tends to produce better captions. http://cs.stanford.edu/people/karpathy/deepimagesent/ http://github.com/tylin/coco-captionABLE II MAGE C APTIONING PERFORMANCE ON THE F LICKR K TEST SPLIT , WITH THE BEST RESULTS SHOWN IN BOLD .Models BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE L CIDEr SPICE BERTScoreBaseline 0.5541 0.3848 0.2554 0.1691 0.1948 0.4421 0.4526 0.1353 0.4945ICR avg weighted

ICR

RoBERTa

MAGE C APTIONING ABLATION STUDY ON F LICKR K TEST SPLIT . T

HE BASELINE IS COMPARED AGAINST THE USE OF THE RETRIEVED TARGET IN THE

LSTM’

S INITIAL MEMORY STATE ( initialization m ) AND IN THE ATTENTION MECHANISM ( multi-level attention ). T HE BEST RESULTS ARE SHOWN INBOLD AND THE SECOND - BEST ARE UNDERLINED .Models BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE L CIDEr SPICE BERTScoreBaseline 0.5541 0.3848 0.2554 0.1691 0.1948 0.4421 0.4526 0.1353 0.4945ICR weighted initialization m ICR weighted multi-level attention weighted (combined) 0.6080 0.4251 0.2857 0.1896 0.2002 0.4583 0.4897 0.1364 0.5250Fig. 2. Upper: A visualization of our multi-level attention mechanism,showing how much attention the ICR weighted model pays to the current inputimage (blue bars) and to the retrieve target caption (orange bars). Bottom: thegenerated captions by the different models for the input image.

Regarding the multi-level attention mechanism, in general, wenoticed that the ICR weighted model tends to attend to thecurrent input image in the beginning of the generation processand, after obtaining the current context, the model switches itsattention to focus more on the semantics of the nearest caption.As observed in this example, the retrieved information canaid the prediction and it can also provide some evidence forthe model’s decision. We thus argue that augmenting LSTMmodels with retrieved examples can also bring beneﬁts interms of model interpretability. We also provide some retrievednearest examples in Figure 3. The encoder was pre-trainedon the ImageNet dataset. This sometimes results in retrievedimages that are not that similar to the given image or withsome similar aspects but with different caption contexts (seethe last row of the table). Better image representations couldbe achieved by ﬁne-tuning the encoder on a task derived from the information associated to the captions (e.g. the nouns andadjectives of the captions, thus promoting similarity betweenimages that correspond to similar captions).

Current input-output pair Retrieved nearest neighbora man dressed in black is surﬁng a huge wave crasheson a large blue wave over a man on a surfboardhiker walking along high rocks in front a male hiker sits perchedof a blue sky with puffy white clouds on a high rock in the mountainsa group of people are in the water a dog is walking on the sandand people are all over the beach

Fig. 3. Examples of retrieved images together with the corresponding targetcaptions used in our ICR models.

We also performed ablation studies, as shown in TableII, to quantify the impact of each method we use in theICR model: the retrieved target incorporated in the LSTM’smemory state ( initialization m ) and the multi-level attentionmechanism ( multi-level attention ). Speciﬁcally we comparethe baseline against the best performed model (ICR weighted ),either including the former or the later method. ObservingTable II, the baseline obtains the lowest scores across allmetrics. The performance is improved using the retrievedinformation in the LSTM memory cell or in the attention,showing that the retrieved targets are effectively exploited inboth methods. We also note that there are little gains from ig. 4. Examples of test captions generated by the different models under analysis. The captioning quality tends to improve using the semantics of theretrieved target captions (highlighted in bold). combining the two methods. Overall, the performance of bothproposed methods is similar and each of the two can be usedto capture the retrieved information. C. Sentiment Analysis Results

Table III presents the sentiment analysis results on theIMDB dataset. We contrast the baseline model against theproposed SAR model, comparing the performance of thedifferent representations suggested to encode the retrievedtarget label, consisting in using vectors of -1s and 1s ( -11s )to represent the negative and positive labels, respectively, orusing the average of fastText embeddings ( avg ) of all thepositive and negative training reviews, their weighted average( weighted ) or RoBERTa sentence embeddings (

RoBERTa ). While the performance decreases with the SAR -11s modelcompared to the baseline, suggesting that this representationis not effective, the other SAR models slightly outperform thebaseline, yielding a better performance on both accuracy andF-score. Our best result, using the SAR avg model, yieldsa 0.053 increase in accuracy compared to the baseline anda 0.052 increase in the F-score. We also conducted ablationstudies with the proposed methods (i.e., initialization m or multi-level attention ), and observed that both outperformed thebaseline, as can be seen in Table IV. It is to be noticed that thedifferences in performance are very small, but perhaps with amore challenging dataset or in another classiﬁcation task theinﬂuence could be larger. For some retrieval examples pleaseee Table V. In most cases, the Text Retrieval is effective inﬁnding a good neighbor but sometimes the neighbor can bethe same movie with an opposite review as we can see in thelast example. TABLE IIIS

ENTIMENT A NALYSIS RESULTS ON THE

IMDB

TEST SPLIT . B

ESTPERFORMANCE IS SHOWN IN BOLD .Models Accuracy F-scoreBaseline 0.8930 0.8929SAR -11s avg

SAR weighted

RoBERTa

ENTIMENT A NALYSIS ABLATION STUDY . B

EST PERFORMANCE IS SHOWNIN BOLD AND THE SECOND - BEST IS UNDERLINED .Models Accuracy F-scoreBaseline 0.8930 0.8929SAR avg initialization m avg multi-level attention SAR avg (combined) 0.8983 0.8981

V. R

ELATED W ORK

Our approach is closely related with those from someprevious studies, in which models predict the output con-ditioned on retrieved examples [5] [19]. Hashimoto et al.[5] used an encoder-decoder model augmented by a learnedretriever to generate source code. The authors suggest aretrieve-and-edit approach, in which the retriever ﬁnds thenearest input example and then that prototype is edited intoan output pertinent to the input. In turn, Weston et al. [19]introduced a retriever for dialogue generation. The idea is toﬁrst retrieve the nearest response and then the generator, i.e.a seq-to-seq model, receives the current input concatenatedwith the retrieved response, separated with a special token.Our approach is similar, but rather than concatenating theinput with the retrieved example, we make use of an LSTM’smemory cell state to incorporate the nearest training example.In our view, the retrieved examples should be considered asadditional context, and not be treated as regular input. Also,unlike Hashimoto et al. [5], our retrieval component does notneed to be trained.Our retriever is, in fact, based on that from Khandelwal etal. [10]. In their work, a pre-trained Language Model (LM) isaugmented with a nearest neighbors retrieval mechanism, inwhich the similar examples are computed using FAISS. Theprobabilities for the next word are computed by interpolatingthe LM’s output distribution with the nearest neighbor distri-bution. However, different from their work, we do not use theretriever just for inference, aiding the prediction both duringtraining and testing time.Retrieved examples have also been used to guide attentionmechanisms. For instance Gu et al. [4] described an attention-based neural machine translation model that attends to the current source sentence as well to retrieved translation pairs.There are also other multi-level attention studies that attendto more than the input. Lu et al. [14] proposed an adaptiveencoder-decoder framework that can choose when to rely onthe image or the language model to generate the next word.Li et al. [12] proposed three attention structures, representingthe attention to different image regions, to different words,and to vision and semantics. Our retrieval multi-level attentionmechanism takes inspiration from these approaches. Note,however, that our attention mechanism is simpler and can beeasily integrated on top of other attention modules. Our multi-level mechanism attends to the retrieved vector and to a givenattention vector, which does not need to be the visual contextvector that we used for image captioning or the context vectorfrom the hidden states used for sentiment analysis. We canplug-in any context vector to be attended together with theretrieved target vector.VI. C

ONCLUSIONS AND F UTURE W ORK

In this work, we proposed a new approach that can in-corporated into any LSTM model, leveraging the informationof similar examples in the training set, during training andtest time. For a given input, we ﬁrst retrieve the nearesttraining input example and then use the corresponding targetas auxiliar context to the input of the neural network or toguide its attention mechanism. We showed that the retrievedtarget can be easily incorporated in an LSTM initial memorystate and we also designed a multi-level attention mechanismthat can attend both to the input and the target of the nearestexample. We conducted a series of experiments, presentingalternative ways to represent the nearest examples for twodifferent tasks, image captioning and sentiment analysis. Boththe Image Captioning Retrieval model and our SentimentAnalysis Retrieval model yield better results than baselineswithout the retrieved information. Besides aiding to improveresult quality, this retrieval approach also provides cues for themodel’s decisions, and thus it can also be of help in terms ofmaking models more interpretable.Despite the interesting results, there are also many possibleideas for future work. For instance, further work could betterexplore our approach in terms of interpretability. Additionally,both models could be improved by receiving informationfrom the top nearest examples, instead of relying on a singleretrieved example. In the particular case of image captioning,it would be important to ﬁne-tune the encoder used in theretrieval component for producing better image representationsin order to ensure a more effective search of nearest examples.In addition, the selection of the retrieved target could also beimproved. In this work, we simply select the ﬁrst reference ofthe retrieved input image, when there are actually ﬁve possiblereferences for the retrieved target. It would be interesting toselect the most descriptive caption (i.e., in terms of length) orthe most similar one between the ﬁve, or even try to combinethe information of the ﬁve reference captions. Furthermore,the proposed multi-level attention mechanism could attend tothe various words of the retrieved caption, as it does for the

ABLE VE

XAMPLES OF REVIEWS WITH ITS CORRESPONDING LABELS AND THE RETRIEVED NEAREST REVIEWS WITH ITS LABELS . (+)

DENOTES A POSITIVEREVIEW AND (-)

DENOTES A NEGATIVE REVIEW .Current input-output pair Retrieved nearest neighbor( + ) anyone who doesn t laugh all through this movie has been embalmed ( + ) it s very funny it has a great cast who each give great performancesi have watched it at least twenty times and (...) absolutely perfect (...) especially sally ﬁeld and kevin kline it s a well written screenplayﬁlled with stars who perform their roles to perfection kevin kline( - ) i can t say much about this ﬁlm i think it speaks for itself as do the ( - ) this movie was a disappointment i was looking forward to seeing acurrent ratings on here i rented this about two years ago and i totally good movie am the type of person who starts a movie and doesn t turnregretted it it off until the end but i was forcing myself not to turn it off( + ) obviously written for the stage lightweight but worthwhile how can ( + ) i had really only been exposed to olivier s dramatic performancesyou go wrong with ralph richardson olivier and merle oberon and those were mostly much later ﬁlms than divorce in this ﬁlm he isdisarmed of his pomp and overconﬁdence by sassy merle oberon andplays the ﬂustered divorce attorney with great charm( - ) by the numbers story of the kid prince a singer on his way to becoming ( + ) purple rain has never been a critic s darling but it is a cult classica star then he falls in love with appolonia kotero but he has to deal with his and deserves to be if you are a prince fan this is for you the main plotwife beating father (...) i couldn t believe it the script is terrible lousy is prince seeing his abusive parents in himself and him falling in lovedialogue and some truly painful comedy routines with a girl image regions. Regarding sentiment analysis, our main priorityinvolves conducting more experiments in more challengingdatasets than the IMDB reviews to better assess the proposedapproach. Moreover, we also plan to test with datasets thathave more ﬁne-grained labels than only positive or negativetargets. We also note that, in movie reviews, neighbor inputscan have different targets but be similar, since they refer tothe same movie. Therefore, exploring the proposed approachto the task of intent classiﬁcation could be more effective sinceneighbor inputs usually have the same targets.A CKNOWLEDGEMENTS

We would like to thank Andr´e F. T. Martins and Marcos Tre-viso for the feedback on preliminary versions of this document.Our work was supported by national funds through Fundac¸ ˜aopara a Ciˆencia e a Tecnologia (FCT), under the project grantswith references UIDB/50021/2020 (INESC-ID multi-annualfunding) and PTDC/CCI-CIF/32607/2017 (MIMU), under theproject Parceria Internacional CMU Ref. 045909 (MAIA), andthrough the Ph.D. scholarship with reference 2020.06106.BD.R

EFERENCES[1] Qurat Tul Ain, Mubashir Ali, Amna Riaz, Amna Noureen, MuhammadKamran, Babar Hayat, and A Rehman. Sentiment analysis using deeplearning techniques: a review.

Intelligent Journal of Advanced ComputerScience Applications , 8(6):424, 2017.[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neuralmachine translation by jointly learning to align and translate. arXivpreprint arXiv:1409.0473 , 2014.[3] Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of textgeneration: A survey. arXiv preprint arXiv:2006.14799 , 2020.[4] Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor OK Li. Searchengine guided neural machine translation. In

Association for theAdvancement of Artiﬁcial Intelligence , pages 5133–5140, 2018.[5] Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang.A retrieve-and-edit framework for predicting structured outputs. In

Proceedings of the Annual Meetings of Neural Information ProcessingSystems , pages 10052–10062, 2018.[6] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[7] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing imagedescription as a ranking task: Data, models and evaluation metrics.

Journal of Artiﬁcial Intelligence Research , 47:853–899, 2013.[8] MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and HamidLaga. A comprehensive survey of deep learning for image captioning.

ACM Computing Surveys (CSUR) , 51(6):1–36, 2019. [9] Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similaritysearch with gpus. arXiv preprint arXiv:1702.08734 , 2017.[10] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, andMike Lewis. Generalization through memorization: Nearest neighborlanguage models. arXiv preprint arXiv:1911.00172 , 2019.[11] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436–444, 2015.[12] Yangyang Li, Shuangkang Fang, Licheng Jiao, Ruijiao Liu, and RonghuaShang. A multi-level attention model for remote sensing image captions.

Remote Sensing , 12(6):939, 2020.[13] Xuebo Liu, Houtim Lai, Derek F Wong, and Lidia S Chao. Norm-based curriculum learning for neural machine translation. arXiv preprintarXiv:2006.02014 , 2020.[14] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowingwhen to look: Adaptive attention via a visual sentinel for imagecaptioning. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 375–383, 2017.[15] Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew YNg, and Christopher Potts. Learning word vectors for sentimentanalysis. In

Proceedings of the Annual Meeting of the Association forComputational Linguistics , pages 142–150, 2011.[16] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch,and Armand Joulin. Advances in pre-training distributed word represen-tations. In

Proceedings of the International Conference on LanguageResources and Evaluation , 2018.[17] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, and Rada Mi-halcea. Beneath the tip of the iceberg: Current challenges and new direc-tions in sentiment analysis research. arXiv preprint arXiv:2005.00357 ,2020.[18] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embed-dings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 ,2019.[19] Jason Weston, Emily Dinan, and Alexander H Miller. Retrieve andreﬁne: Improved sequence generation models for dialogue. arXivpreprint arXiv:1808.04776 , 2018.[20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attendand tell: Neural image caption generation with visual attention. In

Proceedings of the International conference on Machine Learning , pages2048–2057, 2015.[21] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and YoavArtzi. BERTscore: Evaluating text generation with BERT. arXiv preprintarXiv:1904.09675arXiv preprintarXiv:1904.09675