[PDF] The Role of the Input in Natural Language Video Description

Abstract

Natural Language Video Description (NLVD) has recently received strong interest in the Computer Vision, Natural Language Processing (NLP), Multimedia, and Autonomous Robotics communities. The State-of-the-Art (SotA) approaches obtained remarkable results when tested on the benchmark datasets. However, those approaches poorly generalize to new datasets. In addition, none of the existing works focus on the processing of the input to the NLVD systems, which is both visual and textual. In this work, it is presented an extensive study dealing with the role of the visual input, evaluated with respect to the overall NLP performance. This is achieved performing data augmentation of the visual component, applying common transformations to model camera distortions, noise, lighting, and camera positioning, that are typical in real-world operative scenarios. A t-SNE based analysis is proposed to evaluate the effects of the considered transformations on the overall visual data distribution. For this study, it is considered the English subset of Microsoft Research Video Description (MSVD) dataset, which is used commonly for NLVD. It was observed that this dataset contains a relevant amount of syntactic and semantic errors. These errors have been amended manually, and the new version of the dataset (called MSVD-v2) is used in the experimentation. The MSVD-v2 dataset is released to help to gain insight into the NLVD problem.

Full PDF

11 The Role of the Input inNatural Language Video Description

Silvia Cascianelli,

Student Member, IEEE,

Gabriele Costante,

Member, IEEE,

Alessandro Devo,Thomas A. Ciarfuglia,

Member, IEEE,

Paolo Valigi,

Member, IEEE, and Mario L. Fravolini

Abstract —Natural Language Video Description (NLVD) hasrecently received strong interest in the Computer Vision, Nat-ural Language Processing (NLP), Multimedia, and AutonomousRobotics communities. The State-of-the-Art (SotA) approachesobtained remarkable results when tested on the benchmarkdatasets. However, those approaches poorly generalize to newdatasets. In addition, none of the existing works focus on theprocessing of the input to the NLVD systems, which is bothvisual and textual. In this work, it is presented an extensive studydealing with the role of the visual input, evaluated with respectto the overall NLP performance. This is achieved performingdata augmentation of the visual component, applying commontransformations to model camera distortions, noise, lighting,and camera positioning, that are typical in real-world operativescenarios. A t-SNE based analysis is proposed to evaluate theeffects of the considered transformations on the overall visualdata distribution. For this study, it is considered the Englishsubset of Microsoft Research Video Description (MSVD) dataset,which is used commonly for NLVD. It was observed that thisdataset contains a relevant amount of syntactic and semanticerrors. These errors have been amended manually, and thenew version of the dataset (called MSVD-v2) is used in theexperimentation. The MSVD-v2 dataset is released to help togain insight into the NLVD problem.

Index Terms —Video Description, Multimodal Data, InputPreprocessing.

I. I

NTRODUCTION V ISUAL AND TEXTUAL data-based tasks [1] are re-ceiving growing interest in many research communities.Some studied problems are visual content retrieval based onnatural language queries [2]–[5], text-guided video summa-rization [6], [7], story understanding [8], and visual contentdescription [9]–[11]. This paper tackles the video descriptionproblem (NLVD). This is particularly interesting both for itsresearch challenges and for its numerous possible applications.These include automatic video captioning of web content,automatic generation of the Descriptive Video Service (DVS)track of movies, products for the visually impaired and theblind, effective human-machine interaction, service and collab-orative robotics applications, and video surveillance to namea few. The approaches developed to address this problem aredata-driven. In the training phase, the NLVD systems receiveas input a video stream and an associated description, thatis a sentence in natural language. In the test phase, those

The authors are with the Department of Engineering, University of Perugia,Perugia, Italy (e-mail: [email protected]; [email protected];[email protected]; [email protected];[email protected]; [email protected])Manuscript received Month XX, 2018; revised Month XX, 201X.

Fig. 1: Natural Language Video Description systems aretrained on videos and associated captions. In the test phase,these systems are expected to produce a relevant and syntac-tically correct sentence describing unseen videos.systems are expected to output a descriptive sentence givena video (Fig. 1). The quality of the produced description isdifﬁcult to assess objectively [12], [13]. Nevertheless, to obtaina quantitative evaluation, the common practice is adoptingmetrics designed for NLP tasks such as machine translationand summarization, and for image description. The SotAapproaches obtained good results on the benchmark datasets interms of evaluation metrics. However, the human performancein terms of the same metrics is still signiﬁcantly higher (seeTABLE IV). Another issue with the current NLVD methods isthat both training and test are performed on the same dataset.The recent work by Cascianelli et al. [14] outlined the poorgeneralization capabilities of those algorithms when tested ona new dataset. This may limit their practical applicability.The recently growing interest in NLVD is accompaniedby intense activity of design and collection of new datasetssuitable for studying the problem. The most commonly useddatasets for NLVD are the Montreal Video Annotation dataset(M-VAD) [15], the Max Plank Institute of Informatics MovieDescription dataset (MPII-MD) [16], the Microsoft VideoDescription Corpus (MSVD) [17] and the Microsoft Research- Video to Text dataset (MSR-VTT) [18]. These datasets aregeneric in the depicted actions and featured actors in thescene. The M-VAD and the MPII-MD contain snippets frommovies, which typically have high resolution. The MSVD andthe MSR-VTT, instead, include videos from YouTube, whichthus have a more varied quality. In this respect, these twolatter datasets seem more suitable for the study of NLVDsystems able to generalize. However, it is not guaranteedthat they capture the high variability of the video quality a r X i v : . [ c s . C V ] F e b ( e.g. , color channels, resolution) in the problem. This is anobstacle to the deploy of NLVD systems in applications suchas surveillance and service robotics, where the characteristicsof the camera and its position in the scene differ from scenarioto scenario. From a textual standpoint, in the M-VAD andMPII-MD, the videos are paired with the associated sentencefrom the script or the transcribed DVS track. Therefore, thesedatasets lack in diversity of the possible description for eachvideo. This also has a drawback in the evaluation proceduresince using the standard evaluation metrics can be to someextent misleading [19]. In the MSVD and MSR-VTT, there areseveral descriptions for each video (on average 43 in the ﬁrstdataset, 20 in the other,) collected via the Amazon MechanicalTurk (AMT) service. Since they better capture the differentways to describe the same video, these two datasets seem moresuitable to study NLVD. However, not only SotA methods stillperform poorly on them, but also humans obtain not perfectperformance scores (see TABLE IV and TABLE VII). In thesight of these considerations, this study is conducted using theMSVD dataset.It is well known that the quality of the training data is crucialfor the performance of NLVD algorithms. Therefore, it isimportant to use the most reliable datasets, deeply analyse theircharacteristics, and design the training input properly. Inputpreprocessing is a well-known good practice for effectivelytraining machine learning algorithms [20]–[22]. For example,via data augmentation the training set can be automaticallyenlarged, thus providing more samples to the algorithm. Thisreduces the overﬁtting and increases the generalization capa-bility of the model. Further, via data cleansing outliers andincorrect samples are removed, thus the distribution of thedataset should better represent the problem. This reduces thetraining time and increases the accuracy of the models. To thebest of our knowledge, the role of the input has been neglectedso far for NLVD systems. In our opinion, this aspect should bedeeply explored for two main reasons: to allow improving thegeneralization capabilities and to gain further insights into theproblem and thus design NLVD algorithms more judiciously.In the sight of this, the purpose of this work is to tacklethe following practical issues: 1) to quantify the performanceimprovement due to input preprocessing; 2) to provide somepractical guidelines for a rational selection of suitable inputaugmentation strategies.For this study, the benchmark MSVD dataset is considered,and a standard encoder-decoder NLVD system is designed. Anumber of visual transformations are then applied to the videosin the dataset. The selection of the most appropriate appear-ance transformations for visual data augmentation is guidedboth by a data-driven analysis based on t-SNE [23]. Further,since the transformed videos have to preserve the original se-mantic content, the augmentation strategies have been selectedamong those that do not affect the relation between the videoand the associated description. In the experimentation, it wasobserved that the MSVD dataset contains a relevant numberof syntactic and semantic errors. This suggested to (manually)amend these inconsistencies producing an improved dataset,called MSVD-v2. This new dataset is used in addition tothe original one in the experiments, to evaluate the effects of training the NLVD system with more consistent textual data.The remainder of the paper is organized as follows. InSection II the related work is overviewed. In Section III thepoposed approach is explained. In Section IV the result of anextensive experimental study are reported and discussed. InSection V the conclusions are traced.II. R ELATED W ORK

The NLVD problem is attracting the interest of manyresearch communities, from the Computer Vision [15] andNLP [23] community to the Multimedia [10], [11], [24]and Autonomous Robotics ones [14]. The early proposedapproaches to NLVD consist in addressing the task as templateﬁlling [17], [23], [25] or description retrieval [11], [26].The most recent and most popular approach to NLVD istreating the problem as a machine translation one [27], froma video sequence to a natural language sentence, using andencoder-decoder architecture. The frames of the video are usu-ally subsampled and processed by one or more ConvolutionalNeural Network (ConvNet) to extract a visual descriptor forthe frame. Object recognition ConvNets and action recognitionConvNets are commonly used and combined together to obtaina good representation of the frames. Integrating the OpticalFlow is also a used strategy [28]. Another recently proposedapproach [29] consists in representing the video frames via asequential vector of locally aggregated descriptor (SeqVLAD)layer, that combines a VLAD encoding and a recurrent-convolutional network. The SeqVLAD framework aggregatesthe intra-frame spatial information and the inter-frame motioninformation. The frames descriptors are used to encode thevideo. The econding can be obtained directly by mean poolingthe features, as done, e.g. , in [30], or, more effectively, viaan RNN-based encoder. Typically, is used an LSTM-basedencoder. This can be a single LSTM [31], a bidirectionalLSTM (BiLSTM) [32], or a multilayer LSTM [33]. Using theGRU in the encoder is less common [14]. The video encodingis then fed to the sentence decoder together with the groundtruth sentence, word-by-word. The words in the ground truthsentences are used to form a vocabulary for the dataset. Thewords in the caption are represented as vectors in a WordEmbedding (WE) [34], [35]. The WE is usually learned duringthe training of the NLVD system [36], or in some cases is apretrained WE, as in [37]. The decoder is trained to predict theprobability of each word in the vocabulary to be the next onein the sentence based on the video encoding and the previouswords in the sentence. At each step, the most probable wordis emitted, and the process stops when an End-Of-Sequence( < EOS > ) tag is emitted. The decoder is designed to be arecurrent architecture. The LSTM is the preferred choice,either as a single block [19] or in a multilayer LSTM-basedarchitecture [28]. Some works [14], [33], [38] employ the GRUas the main block of the decoder.To improve the performance, attention mechanisms areemployed at different points of the encoder-decoder system. Inparticular, at each word generation step, the decoder takes asinput the video features weighted according to their relevanceto the next word, based on the previously emitted words [31], [32], [38], [39]. With the same principle, in [40] theattention mechanism is applied to the mean-pooled featuresfrom a predeﬁned number of objects tracklets in the video.In [41], the textual information is used to select Regions-of-Interest (ROIs) in the video frames, whose descriptorsare combined with those of the global frame in a DualMemory Recurrent Model. An alternative strategy to combinevisual and textual information is reshaping the feature vectorsinto circulant matrices and combining them to extract themultimodal relation among the two different modalities [42],or builnding multimodal matching tensor of sequential data[43]. The attention mechanism can be implemented as anadditional layer in the encoder-decoder architecture or can beintegrated into the gating strategy of the decoder, as done in[10]. Recent trends include training multitask NLVD models[44], [45], using a reinforcement-learning framework [46],[47], or a cycle learning framework [48].Devising a SotA NLVD system is beyond the scope of thispaper. Here, the focus is on the input to these systems and theeffects of its preprocessing on the NLVD performance. Thestudy is conducted considering a simple yet effective NLVDencoder-decoder architecture. a) Input Preprocessing: Data-driven approaches, such asDeep Learning-based ones, heavily depend on the quality ofthe training data, in terms of effectiveness, achieved represen-tation power, and generalization capability. For this reason,attention is usually put on properly preprocessing the inputto those algorithms [21]. Data augmentation at the visuallevel is a well-known strategy to improve the performanceof algorithms for many Computer Vision tasks. Emblematic isthe case of [49], where the generalization capabilities of theAlexNet ConvNet increased by training the model on alteredimages. To be beneﬁcial for the training, the applied alterationsshould be carefully designed to capture the characteristics ofthe data of the problem. In this work, it is proposed for the ﬁrsttime visual data augmentation for NLVD, taking into accountthe characteristics of the videos captured by the camera invarious application scenarios, and maintaining the relationwith the associated descriptions.In the recent work in [50], it is presented style augmentationas a novel strategy to perform visual data augmentation ex-ploiting a style transfer network [51]. In particular, the texture,contrast, colour and illumination of the image is altered,but shapes and semantic content are preserved. This strategyhas been found effective for improving the performance onclassiﬁcation tasks, domain transfer and depth estimation.Style transfer via neural networks was introduced by Gatys et al. in [52], and many other works followed this approachfor transforming images with the style of paintings [51], [53],[54] or other photorealistic images taken under completelydifferent conditions [55]. The content representation and thestyle representation of the input image are extracted from apre-trained ConvNet. In particular, the content is representedby the feature responses in higher layers, and the style is rep-resented by the feature correlations of multiple lower layers.Content and style are modelled by two separate terms of theloss function, minimized to synthesise the new image havingthe desired style and content. Following the novel approach of [50], in this work style augmentation is tested for NLVD.In the NLP literature, data augmentation has been proposedto enlarge the training corpora automatically. For example,the authors of [56] performed textual data augmentation byreplacing words with their synonyms from WordNet [57] forConvNet-based models for ontology classiﬁcation, sentimentanalysis, and text categorization. In [58] the focus was onNatural Language Normalization and it was addressed theproblem of small datasets for that. The authors trained a ma-chine translation architecture on a small normalization datasetand translated in an unnormalized form a bigger corpus ofstandard text. With this, the authors were able to augment thesmall text normalization datasets. In [59], data augmentationfor machine translation was performed, targeting rare words.The authors trained an LSTM language model to alter bothsource and target sentences in a parallel corpus. This way, theymaintained the relation between the two sentences in the twolanguages. Doing the same for NLVD is not straightforwardbecause one of the two ”languages” is visual. Few workson NLVD operate at input level. In [40] data augmentationis proposed at the sentence level. The authors proposed toenrich the sentence part of the MSR-VTT with sentences fromthe MSVD. These sentences are selected based on the visualsimilarity between the associated videos in the two datasets.However, once included in the MSR-VTT, the sentences arepaired with fake videos, i.e. , all-zeros vectors. Thus, thisapproach does not maintain the relation between video andtext. In this paper, a new version of a benchmark datasetis presented. The sentences associated with the videos havebeen manually checked and corrected in case of errors, thusmaintaining their semantic relatedness to the videos.III. P

ROPOSED A PPROACH

To study the role of the input in the NLVD problem abasic encoder-decoder architecture is designed, and a standardbenchmark dataset, namely the MSVD [17], isn considered.In this section, it is described the NLVD system, the videoaugmentation strategy, and the text checking procedure thatled to the amended version of the dataset.First of all, it is instructive to brieﬂy overview the standardevaluation metrics used for NLVD systems and throughoutthis study to guide the design choices. These metrics are:BLEU [60], in its -gram variant; ROUGE [61] in its LongestCommon Subsequence (LCS) variant; METEOR [62]; CIDEr[12].Call n -gram a sequence of n consecutive words. Given acandidate sentence A and a reference sentence B to compare: • The ratio of the number of n -grams in A that are mappedto n -grams in B to the total number of n -grams in A isthe n -gram precision. • The ratio of the number of n -grams in A that are mappedto n -grams in B to the total number of n -grams in B isthe n -gram recall.BLEU is a precision-oriented metric designed for machinetranslation evaluation. To obtain the score, n -gram precisionis calculated considering n -grams up to length four. BLEUcorrelates well with human judgement on the quality of the translation when evaluated on the entire test set, but poorly atthe sentence level.ROUGE is a recall-oriented metric designed for summa-rization evaluation. It is based on the idea that a candidatesummary should ideally overlap the reference summary. Thismetric has three variants, depending on the sentences compar-ison strategy adopted. In the NLVD literature, it is used thevariant that considers the longest common sequence (LCS),called ROUGE L . All ROUGE variants correlate well withhuman judgement.METEOR is a precision and recall-based MT evaluationmetric. For its computation, unigrams in the candidate andreference sentences are matched based on their exact form, i.e. , if the unigrams are the same word, stemmed form, i.e. ,if the unigrams have the same root, and meaning, i.e. , , ifthe unigrams are synonyms. Then, unigram precision andunigram recall are calculated based on the found matches,and the F-mean is obtained, weighing the recall more than theprecision. In addition, a multiplicative factor is used to rewardidentically ordered contiguous matched unigrams. METEORcorrelates better than unigram precision, unigram recall andtheir harmonic combination, with human judgement also atthe sentence level.CIDEr is a metric designed to assess the quality of thedescription of an image. It is based on the cosine similaritybetween n -grams in the candidate description and in the set ofreference descriptions associated to the image. Each n -gramis weighted using a Term Frequency-Inverse Document Fre-quency (TF-IDF) strategy. This metric is designed to correlatewell with human judgement on the image description quality,thus is particularly suitable for the task of NLVD.The possible values for all the above metrics span from 0 to1. For all but CIDEr, these are reported using values from 0 to100. The values of the CIDEr metric are reported between 0and 1000. This is done to make the CIDEr values of the sameorder of magnitude as those of the other metrics. In fact, evenSotA approaches obtain very low scores in terms of the CIDErmetric. A. Basic Encoder-Decoder NLVD System

Outperforming the SotA is beyond the scope of this paper,thus a simple yet effective encoder-decoder architecture isdesigned and used. This helps in better highlighting the effectsof the input preprocessing on the performance. Its pictorialrepresentation is in Fig. 2. In the following, the model isreferred to as Basic Encoder-Decoder Description System(BEDDS).The video frames are sampled one every ﬁve. On thesampled frames, the output of the last fully connected layer ofthe

ResNet50 [63] and the

C3D [64] ConvNets is computed.The choice of these two SotA ConvNets is the result of apreliminary ablation study and conﬁrms the results reported, e.g. , in [28], [48] on the beneﬁts of using very deep objectrecognition ConvNets and including the temporal informationeither via action recognition ConvNets or Optical Flow. Thisallows capturing both the appearance and the movement inthe frame. In this study, it has been used

ResNet50 instead B R L M C

BEDDS (VGG16) + WE + VE 41.5 66.8 30.4 60.7BEDDS (VGG16) + WE 41.2 67.0 30.9 57.1BEDDS (VGG16) + WE - GRU enc. 41.9 67.5 30.6 54.1BEDDS (ResNet50+C3D) + WE 45.0 69.2 32.3 66.7BEDDS (ResNet50+C3D) + WE - GRU enc. 43.9 69.1

TABLE I: Preliminary ablation study on the encoder-decoderarchitecture used for this study on the MSVD. B stands forBLEU , R L for ROUGE L , M for METEOR, and C for CIDEr.Bold indicates the best performance.of its deeper variants to limit the computational cost of theexperiments. Note that for the C3D vector it is considereda sliding window centered in the sampled frame containing16 frames. The outputs of the ConvNets are concatenated toform the feature vector x ∗ describing the frame. This vectoris + -dimensional. As a result, the input video isrepresented by the sequence of feature vectors describing its N frames ( x , x , ..., x N ) . Usually, in the NLVD literature,the visual feature vectors are mapped in a lower dimensionalspace via a learnt linear transformation (VE). In the NLVDarchitecture used for this study, it has been decided not toperform this mapping operation in the sight of the preliminarystudy whose results are reported in TABLE I.The sentence words are converted to lower-case, and thepunctuation is removed. The Begin-Of-Sequence ( < BOS > )and the < EOS > tags are prepended and appended respectivelyto the sentence. Afterwards, the so preprocessed sentence istokenized, and the tokens form the dataset vocabulary D . SomeSotA NLVD approaches include in the vocabulary only thosewords that appear in the dataset with a minimum frequency.For this study, it is decided to include all the words in thevocabulary to exclude the effects of the additional minimumfrequency hyperparameter on the performance. The words inthe dataset are represented using the -dimensional GloVe[34], [35] WE, pre-trained on a six billion words corpus. Inmany SotA architectures the WE is learned from scratch or apretrained WE is ﬁnetuned during the training of the NLVDsystem. In this study, all these strategies for the WE havebeen tested, and the pretrained GloVe WE led to the bestperformance (see TABLE I.) In addition, with this choice, theoverall model has fewer parameters to train. Note that in casea word in the dataset has not a corresponding vector in theGloVe embedding, a 300-dimensional random valued vectoris assigned to it. In general, such words are either propernouns, typos or very rare words. In fact, their amount decreasesfrom ∼ to ∼ after the textual data cleansing proceduredescribed in III-C. As a result, the input sentence is representedby the sequence of embedding vectors corresponding to its L words ( y , y , ..., y L ) .The frames feature vectors are fed, one at a time, to theencoder LSTM [65]. Using the LSTM as the main block ofthe encoder in the NLVD systems is a common practice. Inthe case of this study, the choice was guided by a preliminarystudy in which the LSTM and the GRU have been compared asmain block of the encoder. The study (see TABLE I) conﬁrmedthe results of [14] in that the two blocks are equivalent in terms Fig. 2: Architecture of the encoder-decoder model used in this study. Recurrent layers are depicted as unfolded graphs forexplanatory purpose. The

ResNet50 and

C3D

ConvNets extracts features from the video frames, which are the input to theLSTM encoder. The ﬁnal state of the encoder and the GloVe embedding of the words in the caption are the input to the GRUdecoder, which generates the output description one word at a time until it emits the < EOS > tag.of overall performance. Although GRU has fewer parameters,for this work it is decided to use an LSTM-based encoder,because this is the common strategy in the NLVD literature.The LSTM is a Deep-RNN able to handle both long andshort-term temporal dependencies between serial data. It hasan inner memory cell c n and a gating strategy to updatethe memory cell value and produce the output h en , based onthe current input x n , and the previous state c n − and output h en − . In particular, the new memory cell value is obtainedby combining the previous value, multiplied by the forgetgate f n , and a candidate new state ˜ c n , multiplied by the inputgate i n . This is to modulate how much to forget the previousvalue, and how much to update the current value with the newinformation. The output is obtained by multiplying the currentmemory cell with the output gate, that modulates how muchmemory to expose for the output. More formally, the LSTMused as the encoder in this study is deﬁned by the followingequations (1)-(6). f n = σ ( W f x n + U f h en − + b f ) (1) i n = σ ( W i x n + U i h en − + b i ) (2) o n = σ ( W o x n + U o h en − + b o ) (3) ˜ c n = tanh ( W c x n + U c h en − + b c ) (4) c n = f n (cid:12) c n − + i n (cid:12) ˜ c n (5) h en = o n (cid:12) tanh ( c n ) (6)where the W ∗ s, the U ∗ s, and b ∗ s are learnable weight matricesand bias vectors, σ is the sigmoid function, tanh is the hy-perbolic tangent function, and (cid:12) is the element-wise product.The last output of the encoder, which represents the en-tire video, is passed to the decoder as its ﬁrst state, i.e. , h eN = h d . = v . The ﬁrst input to the decoder is the WE ofthe < BOS > token, the subsequent inputs are the WE of thewords in the sentence, which terminates with the WE of the < EOS > token. In this work, the decoder is a GRU [66]. TheGRU is a more recent Deep-RNN, able to deal with both longand short-term time dependencies between the elements in asequence. Different from the LSTM, it has not a memory cell,and its output corresponds to its inner state h dl . Similar to the LSTM, the state is calculated via a gating strategy using thecurrent input y l and the previous state h dl − . First, a candidatestate ˜ h dl is computed from the current input and the previousstate, multiplied by the reset gate r l . This gate controls howmuch of the old state to forget in the candidate new state.Afterwards, the state is updated, also obtaining the output.To this end, the previous state and current candidate state arecombined after being multiplied by the update gate z l . Thisgate controls how much of the old state to maintain and howmuch of the current candidate state to use in the new state.More formally, the GRU used as the decoder in this study isdeﬁned by the following equations (7)-(10). r l = σ ( W r y l + U r h dl − + b r ) (7) z l = σ ( W z y l + U z h dl − + b z ) (8) ˜ h dl = tanh ( W h y l + U h ( r l (cid:12) h dl − ) + b h ) (9) h dl = (1 − z l ) (cid:12) h dl − + z l (cid:12) ˜ h dl (10)where the W ∗ s, the U ∗ s, and b ∗ s are learnable weight matricesand bias vectors.At each step, the decoder outputs the state h dl . This ismultiplied by a weight matrix W D to obtain the output vector ˆ y l . From this, the output word is selected from the vocabularyusing the softmax function, that models the probability thatthe output word is the next one in the description, i.e. , : P r ( ˆ y l | v , y , y , ..., y l − ) ∼ e ˆ y l (cid:80) y ∈ D e y (11)In the training phase, the objective is to maximize the log-likelihood of the words over the sentence, i.e. , max W L (cid:88) l =1 logP r ( ˆ y l | v , y , y , ..., y l − ) (12)where W denotes all the parameters of the model.In the test phase, the input to the GRU decoder at each stepis the previous word emitted, and the decoding process stopsautomatically when the < EOS > token is emitted as the mostprobable token. Training videos Training samples B R L M C

200 8182 33.0 63.7 27.6 36.3400 16221 35.4 65.6 29.3 48.7600 24599 41.6 66.3 29.6 52.4800 32606 42.0 68.0 31.1 58.41000 41035 44.8 69.2 32.5 68.11200 49158

TABLE II: BEDDS model performance on the MSVD depend-ing on the number of training videos. B stands for BLEU , R L for ROUGE L , M for METEOR, and C for CIDEr. Boldindicates the best performance. B. Data Augmentation to Study the Role of the Visual Input

For the training of Deep Neural Networks, the availability ofa large number of training samples is critical. NLVD systemsare no exception, as it can be read from TABLE II. This isa ﬁrst reason to perform data augmentation for enlarging thetraining set via videos alteration. In addition, these systemslack in the generalization capability, as observed from [14].In this work, to study the generalization capabilities andthe performance of the designed NLVD system under notideal characteristics of the visual input, it is proposed to applyalterations to the videos in the MSVD that could reﬂect someoperating conditions of cameras in real-world scenarios. Infact, when the video captioning system is used in a speciﬁcapplication context ( e.g. , for a greyscale surveillance cameraplaced above the monitored scene) some considerations on thecharacteristics of the images can be traced (the images will begreyscaled, keystone distorted, possibly blurred, occasionallyvery dark or very bright, etc.) According to those considera-tions, altered videos can be included in the dataset used fortraining or ﬁne-tuning the captioning system.The transformations applied in this study are the following: • Grayscale conversion, to model grayscale cameras, whichare largely used e.g. , for surveillance and robotics appli-cations. • Gaussian blur, to model the occasional out-of-focus op-erating condition. • Keystone distortion, to model the not optimal positionof the camera in the scene. In fact, e.g. , ﬂying dronesand surveillance cameras are usually above the scene,while e.g. , small terrestrial robots, kids, or users seatedunderneath a stage are below the scene. In this work, thisdistortion has been implemented using the perspectivetransform. • Brightness enhancement and reduction, to model thedifferent illumination conditions that may be encounteredin the application scenario. • Salt & Pepper noise, to generally represent low-qualityimages from cheap cameras.Each of them is applied, to different degrees of severity, to allthe videos on the MSVD. Exemplar applied transformationsare reported in Fig. 3.In addition, the videos have been transformed in the styleof some artistic paintings. This was motivated by the fact thatthe strategy to perform data augmentation via style transferhas been found beneﬁcial for many Computer Vision tasks [50]. In this study, the effectiveness of style augmentation forNLVD is investigated. In particular, the approach of [67] hasbeen adopted to transform the videos directly. The appliedapproach builds on the style transformation network in [53]and uses the instance normalization proposed in [54] insteadof batch normalization. The artistic styles selected are those ofPablo Picasso’s ’La Muse’, Leonid Afremov’s ’Rain Princess’,Edvard Munch’s ’The Scream’, Francis Picabia’s ’Udnie’,Katsushika Hokusai’s ’The Great Wave off Kanagawa’, andWilliam Turner’s ’The Wreck of a Transport Ship’. Someexamples are reported in Fig. 4.The applied alterations, either classical or artistic, do notmodify the semantic content of the video. In fact, whenselecting the transformations, those that would have affectedthe semantic content of the video have not been considered.For example, cropping, which is a typically applied visual dataaugmentation strategy, has not been considered to avoid therisk of cropping out something described in the caption.

C. Data Cleansing to Study the Role of the Textual Input

Providing high-quality training sentences to NLVD modelsis critical to achieving good performance. Among the availabledatasets for studying NLVD, in this study the popular MSVD[17] is adopted. This is the English portion of the datasetpresented by Chen and Dolan [68] for paraphrase evaluation.The authors asked AMT workers to describe with a complete,grammatically-correct sentence a short segment of variousvideo clips from YouTube, focussing on the main actor andaction depicted. The annotators had the option to watch theentire clip or only the segment to describe, with or without theaudio, and could also choose the language for the description.In case English was not the native language of the annotator,the suggestion was given to use the Google translation service.These aspects made possible the collection of low-qualitydescriptions. Therefore, the authors organized the annotationprocess in two tasks to describe the same videos. Eachannotator performed the ﬁrst task. According to the qualityof the English descriptions provided during the ﬁrst task, theauthors manually granted the best annotators with the accessto the second task. Finally, however, the resulting datasetcollected the descriptions from both the tasks: ∼

50k from theﬁrst task and ∼

30k from the second task.Despite the instructions and the quality assessment proce-dure, the English portion of the MSVD contains syntacticallyand semantically incorrect descriptions. An example is re-ported in Fig. 5. Therefore, for this work, a task involving 21users has been prepared, in which it has been asked to the usersto check and correct all the captions of the MSVD. Note thatsimply removing the sentences with errors would have reducedthe performance, as can be observed from TABLE III. For thisreason, it has been preferred to amend the errors. Each captioncorrection has been double checked. For the task, four typesof errors have been deﬁned, and the annotators had to ﬁndand correct them. The types of errors, ranked based on theirseverity, are:1. unsuitability, i.e. , the sentence has no meaning, is ill-formulated, or in general, does not respect the instructions (a) (b) (c) (d)(e) (f) (g) (h)

Fig. 3: Example of visual alterations applied to the training videos in the MSVD. 3a is an original image from the video. 3b isthe grayscale converted image. 3c is the blurred image. 3d is the image with applied gaussian noise (salt and pepper noise). 3eand 3f are the image with two keystone distortions applied. 3g is the brightness reduced image. 3h is the brightness enhancedimage. (a) (b) (c) (d)(e) (f) (g)

Fig. 4: Example of visual style transfer applied to the training videos in the MSVD. 4a is an original image from the video.4b is the image in the style of the Picasso’s painting ’La Muse’. 4c is the image in the style of the Afremov’s painting ’RainPrincess’. 4d is the image in the style of the Munch’s painting ’The Scream’. 4e is the image in the style of the Picabia’spainting ’Udnie’. 4f is the image in the style of the Hokusai’s painting ’The Great Wave off Kanagawa’. 4g is the image inthe style of the Turner’s painting ’The Wreck of a Transport Ship’.given in the original task of [68]. These sentences havebeen replaced with other correct descriptions of the samevideo.2. hallucination, i.e. , the sentence describes actors or actionsor objects that do not appear in the video. These errorshave been corrected double-checking the video.3. syntactic, i.e. , the sentence contains a grammatical erroror a typo. These errors have been amended.4. proper noun, i.e. , the sentence contains a proper noun,which cannot be inferred by the video but comes fromthe experience of the annotator. The proper nouns havebeen removed or replaced with a common one.In the test subset, the annotators labelled the errors in Fig. 5: Examples of captions with errors associated with avideo in the MSVD.

Training captionsper video Training samples B R L M C

10 12000 35.3 61.2 26.9 78.9 ∼

43 49158

TABLE III: BEDDS model performance on the MSVD de-pending on the number of training captions per video. B stands for BLEU , R L for ROUGE L , M for METEOR, and C for CIDEr. Bold indicates the best performance. B R L M C

MSVD 60.10 ± ± ± ± ± ± ± ± TABLE IV: Human performance on the MSVD in its originalversion from [17], [68] and checked version from this work,MSVD-v2. B stands for BLEU , R L for ROUGE L , M for METEOR, and C for CIDEr. Bold indicates the bestperformance.addition to correcting them. In case of multiple errors, theannotators labelled the caption giving priority to the mostsevere type of error. From this process, it emerged that the24.62% of the captions in the test set contained one or more ofthe errors just described. The 49.20% of them had syntacticalerrors, 27.10% were unsuitable descriptions for the associatedvideo, the 12.18% contained hallucinations, and the 11.52%proper nouns.To gain insights into the MSVD, both the original one from[17], [68] and the one obtained for this study, referred to asMSVD-v2, the average human performance has been mea-sured as follows. For each video, a ground truth sentence hasbeen considered the predicted description, and the performancescores have been calculated similarly to what done for theautomatic NLVD models. This procedure has been repeated23 times since each video in the test subset of the MSVDis associated with at least 23 captions. Finally, the mean andstandard deviation of the scores have been calculated. Thehuman performance changes after checking the text part of thedataset as reported in TABLE IV. In particular, the mean valueincreases for all the scores and the variance decreases. Thisis no surprise considering the high number of detected andamended errors. In fact, the considered metrics are based onthe similarity of words and groups of words in the comparedsentences, and the dissimilarity due to the errors has beenremoved (or signiﬁcantly reduced.) The values of the scoresare not perfect because of the natural diversity in the possibleways to describe each video.The obtained MSVD-v2 dataset is available online .IV. E XPERIMENTS AND R ESULTS

In this section, the implementation details of the experi-mental setup used in this study are reported, and the obtainedresults are presented and discussed. The BEDDS model de-scribed in III-A has been used as the baseline for observing http://sira.diei.unipg.it/supplementary/input4nlvd2018/ the effects of the visual data augmentation and textual datacleansing preprocessing steps. A. Implementation Details1) Architecture Details:

The dimension of the hidden stateof the Encoder LSTM and GRU and the Decoder GRU hasbeen set to 1000. When used, the learnt WE maps the words ina -dimensional space, and the VE maps the feature vectorsin a -dimensional space. The vocabulary D has been builtusing the training and validation subsets of either the MSVDand MSVD-v2 datasets. In the ﬁrst case, it consists of

10 160 words, in the second case, of words.For the training, the Stochastic Gradient Descent algorithmhas been employed, with learning rate set to . and keptconstant. The batch size has been set to 64 samples. As theearly stopping criterion, the METEOR score on the validationset has been used (similar to what done, e.g. , in [48], [69].)In particular, the training has been stopped if the value of theMETEOR score did not increase for ten consecutive epochs. Inthe test phase, the best model in terms of the METEOR scorehas been used. On average, the training ends in ∼

40 epochsfor the models trained the original dataset, ∼

20 for the styleaugmented dataset, and ∼

15 for the classically augmenteddataset. This resulted respectively in ∼ ∼ ∼

2) Visual Data Augmentation Details:

Additional to thetransformations described in V, in the test phase only ithas been applied vertical ﬂipping and contrast reduction andenhancement, with multiplicative factors and . respec-tively. Apart from style transfer, vertical ﬂipping and greyscaleconversion, for all the applied transformation a parameter canbe set to vary their severity. Different values of the parametershave been chosen for the transformations to the videos in thetraining set, and others for the tranformations to the test setonly. In particular: • The kernel size ρ of the gaussian blur has been set to , , and in training phase, and , , , and in testphase only. • The ratio between the top width w top and the bottomwidth w bottom of the image for the keystone distortionhas been set to / , , / , and / in the training phase,and / , , / , and / in the test phase only. • The enhancement and reduction factors for the brightnessalteration have been set respectively to and . in thetraining phase, and to , , and . , . in the test phaseonly. • The probability p that a pixel is affected by the Salt &Pepper noise has been set to . , . , and . in thetraining phase, and to . , . in the test phase only. B. Results1) Effects of Visual Data Augmentation:

The performanceof the BEDDS model trained on the original training videosof the MSVD has been evaluated on the MSVD originaltest videos, and on the same videos altered as explained inIV-A2, to evaluate its generalization capability with respect to different visual conditions. The performance decreasesproportionally with the intensity of the various transformationsapplied. This indicates that the model lacks robustness tounseen appearances of the scene.The BEDDS model has been trained also on an augmentedMSVD, obtained as explained in III-B and IV-A2. The re-sulting models are referred to as BEDDS-VA, in case oftraining on classically altered videos, and BEDDS-ST in caseof training on style transformed videos. The comparison of thethree models on the MSVD dataset is reported in TABLE Vand in TABLE VI.When tested on the original test videos of the MSVD,the performance the BEDDS-VA model is inferior to thatof the BEDDS model trained on the original training videosonly. However, on the test videos altered with the sametransformations as in the training set, the BEDDS-VA modeloutperforms the BEDDS model in terms of all the metrics.Also for the BEDDS-VA model, the performance decreasesproportionally with the intensity of the various transformationsapplied, but the performance drop is smaller than that ofthe BEDDS model trained on the original videos only. Ontest videos altered with unseen transformations, includingstyle transfer, the BEDDS-VA model outperforms the BEDDSmodel in the majority of cases. This is particularly true forthe performance in terms of the CIDEr metric, which is theone that by design better captures the human consensus onthe quality of image descriptions. The cases in which theperformance of the BEDDS model are comparable or superiorto that of the BEDDS-VA model are those of transformationsthat do not signiﬁcantly alter the appearance of the video,such as vertical ﬂipping and small keystone distortion. Thissuggests that the BEDDS model is biased on the appearanceof the training videos.The BEDDS-ST model outperforms the BEDDS andBEDDS-VA models when tested on the style transformedvideos and on severe Salt & Pepper noise alteration withprobability of altered pixel set equals to . . However, inthe majority of the other cases, its performance is inferior tothat of the other models. This suggests that, different fromother Computer Vision tasks as classiﬁcation under domainshift and depth estimation, performing style transfer for visualdata augmentation is not effective for the NLVD task.Some intuitions on this behaviour can be gained observingthe data distribution obtained via the t-SNE analysis [70]depicted in Fig. 6. The points represent the ResNet50 featuresextracted from the ﬁfth frame of each video in the MSVDdataset, both original and altered as described in III-B. Recallthat the

ResNet50 features capture the appearance of theframes. From the t-SNE plots, it can be observed that thestyle transformed frames form separate clusters, which do notoverlap with the other data points. Some of the classicallyaltered frames are grouped together, e.g. , those altered via Salt& Pepper noise, severe Gaussian blur, and brightness variation.In such cases, the BEDDS-VA model outperforms the BEDDSmodel. The transformations that do not severely alter theappearance of the videos result in points that are distributedas those corresponding to the original frames. Therefore, insuch cases, the BEDDS model performs comparably or better than the BEDDS-VA model. In applicative scenarios, thesame analysis can be performed on videos captured under thespeciﬁc operating conditions. This can guide the selection ofthe most appropriate visual transformations to apply to videosto include in the dataset for training or ﬁnetuning the NLVDsystem.Furthermore, the performance of the BEDDS, BEDDS-VA,and BEDDS-ST models have been tested without retraining onthe MSR-VTT dataset. This dataset has characteristics similarto those of the MSVD dataset, in terms of visual qualityof the videos and number and quality of the captions, sinceboth contain videos from YouTube with multiple captions pervideos, collected via the AMT service. The results of thiscomparison are reported in TABLE VII. The performance ofthe three models are comparable, and all below the humanperformance on the same dataset, calculated as done for theMSVD dataset in III-C.These results suggest that with the visual data augmentationpreprocessing step the model can deal better with appearancechanges. However, the recently proposed style augmentationapproach results less effective than classical alterations in thecontext of NLVD. In addition, the robustness with respect toappearance conditions of speciﬁc applications can be furtherincreased by training the NLVD models on videos alteredaccordingly.

2) Effects of Textual Data Cleansing:

The BEDDS modelhas been trained on either the MSVD and the MSVD-v2dataset, obtained as explained in III-C. The resulting modelis referred to as BEDDS-TC. Both the variants have beentested on the two datasets. The BEDDS-TC model outperformsthe BEDDS model on both datasets in terms of all metricsbut CIDEr on the MSVD (69.8 for BEDDS-TC and 70.0for BEDDS.) The results of this study are reported in TA-BLE VIII. The same trend can be observed also when testingon the MSR-VTT dataset, as observed from TABLE VII.Considering only the performance gain obtained in terms ofevaluation metrics is limiting and can be misleading for inves-tigating the effects of training with high-quality textual data.Therefore, the descriptions produced by the two models havebeen compared further, from a qualitative point of view. Thecomplete corpus of results is available online . As expected,there are cases where one model outputs a correct descriptionwhile the other a completely wrong one. Nevertheless, both theBEDDS-TC and the BEDDS models produce correct detaileddescriptions for the same videos. It is interesting to focuson the cases where the BEDDS model outputs an erroneousdetailed description. Some examples are reported in Fig. 7 forthe MSVD dataset, and in Fig. 8 for the MSR-VTT dataset.In such cases, the descriptions from the BEDDS-TC modelare more generic but still correct. However, metrics based on n -gram similarity rather than semantic consistency, like thoseused in the NLVD evaluation, cannot properly capture thisaspect. In addition, synonyms and hypernyms can be penalized[71]. This can explain the little performance gain achieved withtextual data cleansing in terms of such metrics. In the sight ofthis and of the considerations in III-C on the syntactic and se- http://sira.diei.unipg.it/supplementary/input4nlvd2018/ BEDDS BEDDS-VA BEDDS-ST

Alteration B R L M C B R L M C B R L M C I n B E DD S - VA T r a i n i ng a nd T e s t P h a s e None ( i.e. , Original videos) ρ = 12 ρ = 15 ρ = 17 w top /w bottom = 2 / w top /w bottom = 1 / w top /w bottom = 5 / w top /w bottom = 3 × . × p = 0 . p = 0 . p = 0 . I n T e s t P h a s e O n l y Vertical Flipping ρ = 5 ρ = 7 ρ = 10 ρ = 20 w top /w bottom = 2 / w top /w bottom = 1 / w top /w bottom = 3 / w top /w bottom = 2 × . × . × × p = 0 . p = 0 . Contrast Reduction × . × TABLE V: Performance of the BEDDS, BEDDS-VA, and BEDDS-ST models on differently altered test videos of the MSVD,used both in training and test phase or in test phase only. B stands for BLEU , R L for ROUGE L , M for METEOR, and C for CIDEr. Bold indicates the best performance.mantic errors in the MSVD, we believe that using the MSVD-v2 dataset to train and test the NLVD algorithms is reasonablebecause it contains better quality ground truth captions. Thisis conﬁrmed by the average human performance estimation onthe MSVD and MSVD-v2 datasets. As mentioned in III-C, itsmean value is higher on the amended dataset, and the varianceis smaller. Neither human performance can be perfect for thistask, due to its intrinsic subjectivity. However, the improvedperformance after the textual data cleansing suggests that theMSVD-v2 dataset represents a more reliable benchmark thanthe MSVD for the NLVD task. Finally, the comparison of theperformance on original and the amended datasets highlightsthe importance of the consistency of the textual componentwhen designing an NLVD system.V. C ONCLUSION

In this work, it has been presented a study to evaluate theperformance of NLVD systems in case the video input datasetis augmented with transformed video derived from the originalones applying common transformations. For this purpose,extensive studies have been performed on the benchmarkMSVD dataset and on a reﬁned version speciﬁcally amended for this study (the MSVD-v2 dataset.) The experiments havebeen carried out using a simple yet effective NLVD encoder-decoder architecture.The results of the analysis reveal that the visual dataaugmentation generally provides improvements in terms ofrobustness to appearance changes. In particular, consideringthe CIDEr score, which by design correlates with the humanjudgment on image description, the model trained on theaugmented videos obtains an average +4 . performanceimprovement with peaks up to +22 . for severe Gaussianblur, when tested on videos altered using a different setof transformations compared to those used in the trainingset. As expected, this improvement is more signiﬁcant whenthe NLVD model is tested on videos altered with the sametransformation used in the training set ( +12 . on average,with peaks up to +29 , for severe alterations as keystonedistortion and Salt & Pepper noise.) This suggests that, whenapplying the NLVD system in a real-world scenario, it isbeneﬁcial to train or ﬁnetune the system with videos alteredaccording to the visual conditions typical of the speciﬁcapplication. In this work, it has been shown that some insightson the utility of the speciﬁc input transformations can be BEDDS BEDDS-VA BEDDS-ST

Style B R L M C B R L M C B R L M C

Original videos

Afremov’s Rain Princess 18.5 51.1 19.9 20.6 13.9 48.2 19.2 12.4

Munch’s The Scream 27.9 60.5 25.5 36.9 27.8 60.5 25.4 38.5

Picabia’s Udnie 23.0 57.2 23.1 21.6 23.0 57.8 23.2 21.4

Hokusai’s The Great Wave off Kanagawa 22.9 57.9 22.8 24.6 24.2 58.4 24.1 26.9

Turner’s The Wreck of a Transport Ship 25.4 59.3 23.6 30.2 26.6 60.5 24.8 34.8

TABLE VI: Performance of the BEDDS, BEDDS-VA, and BEDDS-ST models on the test videos of the MSVD, transformedin different artistic styles. B stands for BLEU , R L for ROUGE L , M for METEOR, and C for CIDEr. Bold indicates thebest performance. −40 −20 0 20 40−60−40−2002040 (a) −40 −20 0 20 40−40−2002040 (b) −40 −20 0 20 40−40−2002040 (c) −40 −20 0 20 40−60−40−2002040 (d) Fig. 6: Distribution of the frames in the MSVD dataset, altered with classical alterations and style transfer. 6a contains thepoints associated to the original frames and to all the altered frames. 6b contains the points associated to the original framesand to the style transformed frames used for training the BEDDS-ST model. 6c contains the points associated to the originalframes and to the frames altered with the classical alterations in the training set of the BEDDS-VA model. 6d contains thepoints associated to the original frames and to the altered frames used only for test. B R L M C

BEDDS 16.9 42.7 16.3

BEDDS-VA 16.9 42.2 16.0 9.0BEDDS-ST 16.9 42.1 16.0 8.6BEDDS-TC ± ± ± ± TABLE VII: Performance of the BEDDS, BEDDS-VA, andBEDDS-ST models on the original test videos of the MSR-VTT. B stands for BLEU , R L for ROUGE L , M for ME-TEOR, and C for CIDEr. Bold indicates the best performanceof the models. For completeness, the human performance arealso reported. B R L M C

MSVD BEDDS 45.1 69.4 32.9

BEDDS-TC

TABLE VIII: Performance of the BEDDS and BEDDS-TCmodels on the two versions of the MSVD, original andchecked (MSVD-v2). B stands for BLEU , R L for ROUGE L , M for METEOR, and C for CIDEr. Bold indicates the bestperformance.gained using a t-SNE analysis. Speciﬁcally, the videos alteredvia transformations that do not severely change the appearanceare distributed as the original videos, while those alteredwith severe transformations (such as Salt & Pepper noise,severe Gaussian blur and brightness variation) are groupedin separate clusters. For those latter cases, data augmentationbrings to a signiﬁcant improvement in the performance ofthe NLVD system. Finally, it was observed that the BEDDS-TC model, trained on the reﬁned MSVD-v2 dataset, providesmore generic but correct captions, refelcted in a performanceimprovement in terms of all the evaluation metrics.A CKNOWLEDGMENT

We gratefully thank the NVIDIA Corporation with thedonation of the

Titan XP

GPU used for this research. Ourgratitude also goes to the users who volunteered for the MSVDtext checking task. R

EFERENCES[1] P. Cui, S. Liu, and W. Zhu, “General knowledge embedded imagerepresentation learning,”

IEEE Transactions on Multimedia , vol. 20,no. 1, pp. 198–207, 2018.[2] C. Koﬂer, L. Yang, M. Larson, T. Mei, A. Hanjalic, and S. Li, “Predictingfailing queries in video search,”

IEEE Transactions on Multimedia ,vol. 16, no. 7, pp. 1973–1985, 2014.[3] H. Xie, Y. Zhang, J. Tan, L. Guo, and J. Li, “Contextual query expansionfor image retrieval,”

IEEE Transactions on Multimedia , vol. 16, no. 4,pp. 1104–1114, 2014.[4] W. Li, J. Joo, H. Qi, and S.-C. Zhu, “Joint image-text news topicdetection and tracking by multimodal topic and-or graph.”

IEEE Trans.Multimedia , vol. 19, no. 2, pp. 367–381, 2017.[5] Y. Hu, L. Zheng, Y. Yang, and Y. Huang, “Twitter100k: A real-worlddataset for weakly supervised cross-media retrieval,”

IEEE Transactionson Multimedia , vol. 20, no. 4, pp. 927–938, 2018.

Fig. 7: Exemplar captions produced by the BEDDS model,which was trained on the original MSVD, and the BEDDS-TC model, which was trained on the MSVD-v2 dataset. [6] H. Song, X. Wu, W. Yu, and Y. Jia, “Extracting key segments of videosfor event detection by learning from web sources,”

IEEE Transactionson Multimedia , vol. 20, no. 5, pp. 1088–1100, 2018.[7] X. Yang, T. Zhang, and C. Xu, “Text2video: An end-to-end learningframework for expressing text with videos,”

IEEE Transactions onMultimedia , 2018.[8] L. Baraldi, C. Grana, and R. Cucchiara, “Recognizing and presentingthe storytelling video structure with deep multimodal networks,”

IEEETransactions on Multimedia , vol. 19, no. 5, pp. 955–968, 2017.[9] L. Li, S. Tang, Y. Zhang, L. Deng, and Q. Tian, “Gla: Global–localattention for image description,”

IEEE Transactions on Multimedia ,vol. 20, no. 3, pp. 726–737, 2018.[10] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioningwith attention-based lstm and semantic consistency,”

IEEE Transactionson Multimedia , vol. 19, no. 9, pp. 2045–2055, 2017.[11] J. Dong, X. Li, and C. G. Snoek, “Predicting visual features from text forimage and video caption retrieval,”

IEEE Transactions on Multimedia ,2018.[12] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-based Image Description Evaluation,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2015, pp.4566–4575.[13] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semanticpropositional image caption evaluation,” in

European Conference onComputer Vision . Springer, 2016, pp. 382–398.[14] S. Cascianelli, G. Costante, T. A. Ciarfuglia, P. Valigi, and M. L.Fravolini, “Full-gru natural language video description for service Fig. 8: Exemplar captions produced by the BEDDS model andthe BEDDS-TC model on the MSR-VTT dataset. robotics applications,”

IEEE Robotics and Automation Letters , vol. 3,no. 2, pp. 841–848, 2018.[15] A. Torabi, C. Pal, H. Larochelle, and A. Courville, “Using descriptivevideo services to create a large data source for video annotationresearch,” arXiv preprint arXiv:1503.01070 , 2015.[16] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset formovie description,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2015, pp. 3202–3212.[17] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan,R. Mooney, T. Darrell, and K. Saenko, “Youtube2text: Recognizing anddescribing arbitrary activities using semantic hierarchies and zero-shotrecognition,” in

Proceedings of the IEEE International Conference onComputer Vision , 2013, pp. 2712–2719.[18] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descriptiondataset for bridging video and language,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2016, pp.5288–5296.[19] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle,A. Courville, and B. Schiele, “Movie description,”

International Journalof Computer Vision , vol. 123, no. 1, pp. 94–120, 2017.[20] C.-M. Teng, “Correcting noisy data.” in

ICML . Citeseer, 1999, pp.239–248.[21] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing forsupervised leaning,”

International Journal of Computer Science , vol. 1,no. 2, pp. 111–117, 2006.[22] N. M. Nawi, W. H. Atomi, and M. Rehman, “The effect of data pre-processing on optimized training of artiﬁcial neural networks,”

ProcediaTechnology , vol. 11, pp. 32–39, 2013.[23] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J.Mooney, “Integrating language and vision to generate natural language descriptions of videos in the wild.” in

Coling , vol. 2, no. 5, 2014, p. 9.[24] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia contentusing attention-based encoder-decoder networks,”

IEEE Transactions onMultimedia , vol. 17, no. 11, pp. 1875–1886, 2015.[25] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, andS. Guadarrama, “Generating Natural-Language Video Descriptions Us-ing Text-Mined Knowledge.” in

AAAI , vol. 1, 2013, p. 2.[26] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,and S. Fidler, “Aligning Books and Movies: Towards Story-like VisualExplanations by Watching Movies and Reading Books,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2015, pp.19–27.[27] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, andK. Saenko, “Sequence to Sequence-Video to Text,” in

Proceedings ofthe IEEE International Conference on Computer Vision , 2015, pp. 4534–4542.[28] Y. Guo, J. Zhang, and L. Gao, “Exploiting long-term temporal dynamicsfor video captioning,”

World Wide Web , pp. 1–15, 2018.[29] Y. Xu, Y. Han, R. Hong, and Q. Tian, “Sequential video vlad: trainingthe aggregation locally and temporally,”

IEEE Transactions on ImageProcessing , vol. 27, no. 10, pp. 4933–4944, 2018.[30] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, andK. Saenko, “Translating Videos to Natural Language using Deep Re-current Neural Networks,” arXiv preprint arXiv:1412.4729 , 2014.[31] X. Li, Z. Zhou, L. Chen, and L. Gao, “Residual attention-based lstm forvideo captioning,”

World Wide Web , pp. 1–16, 2018.[32] Y. Bin, Y. Yang, F. Shen, N. Xie, H. T. Shen, and X. Li, “Describingvideo with attention-based bidirectional lstm,”

IEEE Transactions onCybernetics , 2018.[33] L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-awareneural encoder for video captioning,” in

Computer Vision and PatternRecognition (CVPR), 2017 IEEE Conference on . IEEE, 2017, pp. 3185–3194.[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efﬁcient estimation ofword representations in vector space,” arXiv preprint arXiv:1301.3781 ,2013.[35] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectorsfor word representation,” in

Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) , 2014, pp.1532–1543.[36] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen, “Fromdeterministic to generative: Multi-modal stochastic rnns for video cap-tioning,” arXiv preprint arXiv:1708.02478 , 2017.[37] S. Venugopalan, L. A. Hendricks, R. Mooney, and K. Saenko, “Improv-ing LSTM-based Video Description with Linguistic Knowledge Minedfrom Text,” arXiv preprint arXiv:1604.01729 , 2016.[38] H. Wang, C. Gao, and Y. Han, “Sequence in sequence for videocaptioning,”

Pattern Recognition Letters

PatternRecognition Letters , vol. 105, pp. 23–29, 2018.[40] T.-H. Chen, K.-H. Zeng, W.-T. Hsu, and M. Sun, “Video captioningvia sentence augmentation and spatio-temporal attention,” in

AsianConference on Computer Vision . Springer, 2016, pp. 269–286.[41] Z. Yang, Y. Han, and Z. Wang, “Catching the temporal regions-of-interest for video captioning,” in

Proceedings of the 2017 ACM onMultimedia Conference . ACM, 2017, pp. 146–153.[42] A. Wu and Y. Han, “Multi-modal circulant fusion for video-to-languageand backward.” in

IJCAI , 2018, pp. 1029–1035.[43] Y. Yu, J. Kim, and G. Kim, “A joint sequence fusion model for videoquestion answering and retrieval,” in

The European Conference onComputer Vision (ECCV) , September 2018.[44] R. Pasunuru and M. Bansal, “Multi-task video captioning with videoand entailment generation,” arXiv preprint arXiv:1704.07489 , 2017.[45] L. Li and B. Gong, “End-to-end video captioning with multitaskreinforcement learning,” arXiv preprint arXiv:1803.07950 , 2018.[46] R. Pasunuru and M. Bansal, “Reinforced video captioning with entail-ment rewards,” arXiv preprint arXiv:1708.02300 , 2017.[47] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang, “Videocaptioning via hierarchical reinforcement learning,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2018, pp. 4213–4222.[48] B. Wang, L. Ma, W. Zhang, and W. Liu, “Reconstruction network forvideo captioning,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 7622–7631. [49] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classiﬁcationwith Deep Convolutional Neural Networks,” in Advances in NeuralInformation Processing Systems , 2012, pp. 1097–1105.[50] P. T. Jackson, A. Atapour-Abarghouei, S. Bonner, T. Breckon, andB. Obara, “Style augmentation: Data augmentation via style random-ization,” arXiv preprint arXiv:1809.05375 , 2018.[51] G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens, “Exploringthe structure of a real-time, arbitrary neural artistic stylization network,” arXiv preprint arXiv:1705.06830 , 2017.[52] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer usingconvolutional neural networks,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 2414–2423.[53] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in

European Conference onComputer Vision . Springer, 2016, pp. 694–711.[54] V. L. D. U. A. Vedaldi, “Instance normalization: The missing ingredientfor fast stylization,” arXiv preprint arXiv:1607.08022 , 2016.[55] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic image stylization,” arXiv preprintarXiv:1802.06474 , 2018.[56] X. Zhang and Y. LeCun, “Text understanding from scratch,” arXivpreprint arXiv:1502.01710 , 2015.[57] C. Fellbaum, “Wordnet,” in

Theory and applications of ontology:computer applications . Springer, 2010, pp. 231–243.[58] I. Saito, J. Suzuki, K. Nishida, K. Sadamitsu, S. Kobashikawa, R. Ma-sumura, Y. Matsumoto, and J. Tomita, “Improving neural text normal-ization with data augmentation at character-and morphological levels,”in

Proceedings of the Eighth International Joint Conference on NaturalLanguage Processing (Volume 2: Short Papers) , vol. 2, 2017, pp. 257–262.[59] M. Fadaee, A. Bisazza, and C. Monz, “Data augmentation for low-resource neural machine translation,” arXiv preprint arXiv:1705.00440 ,2017.[60] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Methodfor Automatic Evaluation of Machine Translation,” in

Proceedings ofthe 40th annual meeting on Association for Computational Linguistics .Association for Computational Linguistics, 2002, pp. 311–318.[61] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Sum-maries,” in

Text summarization branches out: Proceedings of the ACL-04workshop , vol. 8. Barcelona, Spain, 2004.[62] S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MTEvaluation with Improved Correlation with Human Judgments,” in

Proceedings of the ACL Workshop on Intrinsic and Extrinsic EvaluationMeasures for Machine Translation and/or Summarization , vol. 29, 2005,pp. 65–72.[63] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for ImageRecognition,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 770–778.[64] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “LearningSpatiotemporal Features with 3D Convolutional Networks,” in

Proceed-ings of the IEEE International Conference on Computer Vision , 2015,pp. 4489–4497.[65] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[66] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluationof Gated Recurrent Neural Networks on Sequence Modeling,” arXivpreprint arXiv:1412.3555 , 2014.[67] L. Engstrom, “Fast style transfer,” https://github.com/lengstrom/fast-style-transfer/, 2016, commit c77c028.[68] D. L. Chen and W. B. Dolan, “Collecting highly parallel data forparaphrase evaluation,” in

Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics (ACL-2011) , Portland, OR,June 2011.[69] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,and Y. Bengio, “Show, attend and tell: Neural image caption generationwith visual attention,” in

International conference on machine learning ,2015, pp. 2048–2057.[70] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

Journalof machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008.[71] M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem, “Re-evaluating automatic metrics for image captioning,” arXiv preprintarXiv:1612.07600 , 2016.

Silvia Cascianelli received the M.Sc. magna cum laude degree in Informationand Automation Engineering in 2015. She then joined the Intelligent Systems,Automation and Robotics Laboratory (ISARLab) in 2015 as a Ph.D. studentand she is currently a Research Assistant there. Her research interestsare mainly Machine Learning, Natural Language Processing, and ComputerVision for Robotics.

Gabriele Costante received the Ph.D. degree in Robotics from the Universityof Perugia in 2016. He is currently a Post-Doc Researcher at the ISARLaband a Lecturer of Computer Vision at the University of Perugia, Departmentof Engineering. His research interests are mainly Robotics, Computer Visionand Machine Learning.

Alessandro Devo received the M.Sc. magna cum laude degree in Informationand Robotics Engineering in 2018 from University of Perugia, with a thesis onNatural Language Video Description for Service Robotics Applications fromthe University of Perugia. He then joined the ISARlab as a Ph.D. Student.His research interests are mainly Machine Learning, Reinforcement Lerning,and Computer Vision.

Thomas A. Ciarfuglia received the Ph.D. degree in Robotics from theUniversity of Perugia in 2011. He joined the ISARLab in 2008 and workedas a Post-Doc there. He is a Lecturer of Machine Learning at the Universityof Perugia, Department of Engineering. His research interests are MachineLearning and Computer Vision for Robotics.

Paolo Valigi received the Ph.D. degree from University of Rome “TorVergata” in 1991. From 1990 to 1994 he worked with the Fondazione UgoBordoni. Since 2004 he has been Full Professor at the University of Perugia,Department of Engineering. He is currently the head of the ISARLab. Hisresearch interests are in the ﬁeld of Robotics and Systems Biology.