Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning
DDetection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
TEMPORAL SUB-SAMPLING OF AUDIO FEATURE SEQUENCES FOR AUTOMATEDAUDIO CAPTIONING
Khoa Nguyen , Konstantinos Drossos , Tuomas Virtanen
3D Media Group, Tampere University, Tampere, Finland { firstname.lastname } @tuni.fi Audio Research Group, Tampere University, Tampere, Finland { firstname.lastname } @tuni.fi ABSTRACT
Audio captioning is the task of automatically creating a textual de-scription for the contents of a general audio signal. Typical audiocaptioning methods rely on deep neural networks (DNNs), wherethe target of the DNN is to map the input audio sequence to an out-put sequence of words, i.e. the caption. Though, the length of thetextual description is considerably less than the length of the audiosignal, for example 10 words versus some thousands of audio fea-ture vectors. This clearly indicates that an output word correspondsto multiple input feature vectors. In this work we present an ap-proach that focuses on explicitly taking advantage of this differenceof lengths between sequences, by applying a temporal sub-samplingto the audio input sequence. We employ a sequence-to-sequencemethod, which uses a fixed-length vector as an output from the en-coder, and we apply temporal sub-sampling between the RNNs ofthe encoder. We evaluate the benefit of our approach by employ-ing the freely available dataset Clotho and we evaluate the impactof different factors of temporal sub-sampling. Our results show animprovement to all considered metrics.
Index Terms — audio captioning, recurrent neural networks,temporal sub-sampling, hierarchical sub-sampling networks
1. INTRODUCTION
Audio captioning is the task of automatically describing the con-tents of a general audio signal, using natural language [1, 2]. It canbe considered an inter-modal translation task, where the contents ofthe audio signal are translated to text [2, 3]. Audio captioning offersthe ability for developing methods that can learn complex informa-tion from audio data, like spatiotemporal relationships, differentia-tion between foreground and background, and higher and abstractknowledge (e.g. counting) [2, 4].Audio captioning started in 2017 [1] and all published audiocaptioning methods (up to now) are based on deep neural net-works (DNNs) [3, 5]. The usual set-up of methods is accord-ing to sequence-to-sequence architectures, where an encoder (usu-ally based on recurrent neural networks, RNNs) takes a sequenceof audio feature vectors as an input, and a decoder (also usuallyRNN-based) takes as an input the output of the encoder and out-puts a sequence of words. A common element to sequence-to-sequence architectures is the alignment of the input and output se-
K. Drossos and T. Virtanen would like to acknowledge CSC-IT Centerfor Science, Finland, for computational resources. Part of the computationsleading to these results were performed on a TITAN-X GPU donated byNVIDIA to K. Drossos. quences [6, 7, 8]. Three main alternative techniques have been pro-posed for the alignment of the input and output sequences, namelythe employment of a fixed-length vector representation of the outputof the encoder [6], the usage of attention mechanism [7, 8], and self-attention [9]. The former two approaches have been widely adoptedin the audio captioning field. Specifically, [1] presents a methodthat employs an RNN encoder, an RNN decoder, and an attentionmechanism between the encoder and encoder, to align the input andoutput sequences. Study [3] used again an RNN encoder and anRNN decoder, but the alignment of the input and output sequencesperformed with the usage of a fixed-length vector. This vector wasthe mean, over time, of the output of the encoder. Finally, [5] pre-sented an approach for audio captioning, employing the attentionmechanism presented in [8].Although the above-mentioned techniques seem to be essentialfor the audio captioning task, they are applied only at the outputof the encoder. This means that the encoder processes the wholeinput audio sequence, and only at its output there is the association(through the alignment mechanisms) of the different parts of theinput sequence with the different parts of the output sequence. Ifwe could adopt a policy for effectively collate sequential parts ofthe input sequence, then we might be able to let the encoder learnbetter, entities that exhibit long time presence, i.e., long temporalpatterns that correspond to one output class like the sound of a carand the word “car”. This need for DNNs has been identified at leastsince 2012 [10] and the hierarchical sub-sampling networks wereintroduced, and also adopted more recently, e.g. [11].In this paper we employ the hierarchical sub-sampling networksand we present a novel approach for audio captioning methods em-ploying multi-layered and RNN-based encoders. We draw inspira-tion from the empirical observation that each word in the captioncorresponds to multiple time-steps of the input sequence to a DNN-based audio captioning method, and we hypothesize that the per-formance of the method could be enhanced by reducing the tempo-ral length of the sequence after each RNN layer in an RNN-basedencoder. To assess our hypothesis and our method, we employ amethod that is not using temporal sub-sampling, and we alter it bysolely employing the sub-sampling. The obtained results show that,indeed, temporal sub-sampling can enhance the performance of theaudio captioning methods.The rest of the paper is organized as follows. In Section 2 wepresent our method and the temporal sub-sampling method. In Sec-tion 3 we present the followed evaluation procedure, and the ob-tained results are in Section 4. Section 5 concludes the paper. a r X i v : . [ ee ss . A S ] J u l etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan Figure 1: Illustration of the proposed method, with L bi-directionalRNN layers at the encoder. Details are shown for RNN l =1 enc and forthe result l are similar. Details for sub-sampling are in Figure 2.
2. PROPOSED METHOD
Our proposed method employs a multi-layered, bi-directional RNN-based encoder, an RNN-based decoder, accepts as an input a se-quence of T audio feature vectors with F features, X ∈ R T × F ,and outputs a sequence of S vectors ˆ Y = [ˆ y , . . . , ˆ y S ] , where eachvector ˆ y ∈ [0 , D contains the predicted probability for each ofthe D words available to the method, where D is the amount ofunique words in the captions of the training dataset. After each bi-directional RNN layer in the encoder (apart from the last one), ourmethod applies a temporal sub-sampling of the output sequence ofthe bi-directional RNN layer, before use the sequence as an inputfor the next bi-directional RNN layer in the encoder. Figure 1 illus-trates the method. Our method is otherwise the same as the baselinemethod of the DCASE 2020 automated audio captioning task (task6) , enhanced with the of temporal sub-sampling of the latent rep-resentations in the encoder. Our encoder consists of L enc bi-directional RNNs, where −−→ RNN l enc and ←−− RNN l enc are the l -th forward and backward RNNs of the en-coder, respectively. −−→ RNN enc and ←−− RNN enc process the input sequence X as −→ h t = −−→ RNN enc ( x t , −→ h t − ) and (1) ←− h t = ←−− RNN enc ( ←− x t , ←− h t − ) , (2)where ←− x t is the t -th feature vector of the time-reversed X , −→ h t , ←− h t ∈ [ − , Ξ are the outputs of the −−→ RNN enc and ←−− RNN enc ,respectively, for the t -th time-step, −→ h = ←− h = [0] Ξ , and Ξ isthe amount of output features of each of the RNNs of the l -th bi-directional RNN of the encoder. Then, the outputs of the −−→ RNN enc and ←−− RNN enc , −→ h t and ←− h t , respectively, are concatenated as h t = [ −→ h (cid:62) t , ←− h (cid:62) t ] (cid:62) , (3) http://dcase.community/challenge2020/task-automatic-audio-captioning Figure 2: Illustration of the sub-sampling process with a factor M = 2 , with an input of a sequence with A vectors of B features, O ∈ R A × B , and an output of another sequence O (cid:48)(cid:48) ∈ R (cid:98) A/M (cid:99)× B .With the red “X” we indicate the vectors that were discarded, ac-cording to the sub-sampling process.and H = [ h , . . . , h T ] . Then, for ≤ l < L , our methodapplies a temporal sub-sampling to h l − , as H (cid:48)(cid:48) l − = { h l − iM +1 } i = (cid:98) ( T l ) /M (cid:99) i =0 , (4)where T l is the amount of time-steps of H (cid:48)(cid:48) l − , M ∈ N (cid:63) is the sub-sampling factor, and (cid:98)·(cid:99) is the floor function. Figure 2illustrates thesub-sampling process.Then, H (cid:48)(cid:48) l − is given as an input to −−→ RNN l enc and ←−− RNN l enc , sim-ilarly to Eqs. (1), (2), and (3), obtaining H (cid:48) l . H l is obtained by aresidual connection between H (cid:48)(cid:48) l − and H (cid:48) l , as H l = H (cid:48) l + H (cid:48)(cid:48) l − , (5)where H l ∈ R T l × ∆ is the output of the l -th bi-directional RNNlayer of the encoder. By utilizing Eq. (4), we enforce the RNNsof the encoder to squeeze the information in the input sequence toa smaller output sequence, effectively making the RNNs to learn atime-filtering and time-compression of the information in the inputsequence [10, 11]. This is can be proven beneficial in the case ofaudio captioning, since one output class (i.e. one word) correspondsto multiple input time-steps. By temporal sub-sampling, we areenforcing the encoder to express the learnt information with lowertemporal resolution, effectively providing a shorter sequence butwith each of its time-step to represent longer temporal patterns [10].The above presented scheme of temporal sub-sampling, results inreducing the length of the input audio sequence to M − L − times.For example, a sub-sampling factor of M = 2 and L = 3 RNNlayers results in reducing the length of the input audio sequence times (a reduction of 75.0%).Finally, the output of the encoder. z ∈ R ∆ , is formed as z = H L TL . (6)That is, z is the last time-step of the output sequence H L of the L -thbi-directional RNN of the encoder. The decoder of our method consists of an RNN and a linear layerfollowed by a softmax non-linearity (the latter two will collectivelyreferred to as classifier). The decoder takes as an input the z forevery time step as u s = RNN dec ( z , u s − ) , (7)where u s ∈ [0 , Ψ is the output of the RNN dec for the s -th time-step of the decoder, with u s = [0] Ψ . u s is used as an input to theclassifier, Cls, as ˆ y s = Cls ( u s ) . (8) etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan The process described by Eqs (7) and (8) is repeated until s = S or y s matches to a predefined symbol, depending if the decoderis in optimization or inference process, respectively. The encoder,the decoder, and the classifier are jointly optimized to minimize thebinary cross-entropy at each time-step and between the predicted, ˆ y s , and ground truth, y s = [ y s, , . . . , y s,D ] , one-hot encoding ofwords.
3. EVALUATION
To evaluate our method, we employ a freely and well-curated au-dio captioning dataset, called Clotho [4], and the baseline of theDCASE 2020 audio captioning task. For assessing the performanceof our method, we use machine translation and captioning metrics,employed by most of the existing audio captioning work.
Clotho contains audio clips of CD quality (44.1 kHz sampling rate,16-bit sample width), and 5 captions for each audio clip. The timeduration of the audio clips ranges from 15 to 30 seconds, and theamount of words in each caption ranges from eight to 20 words.Clotho provides three splits for developing audio captioning meth-ods, namely development, evaluation, and testing. The developmentand evaluation splits are freely available online , while the testingsplit is withheld for scientific challenges. In this work, we employthe development and evaluation splits of Clotho, having 2893 and1045 audio clips, yielding 14465 and 5225 captions, respectively.We choose Clotho because it is built to offer audio content diver-sity, and extra care has been taken for eliminating spelling errors,named entities, and speech transcription in the captions. Addition-ally, Clotho is already employed at the DCASE 2020 audio caption-ing task .We extract F = 64 log-scaled mel-band energies from each ofthe 4981 audio clips of Clotho, using an 1024-sample long window(approximately 23 ms) with 50% overlap, and the Hamming win-dowing function. This results in having sequences from T = 1292 to T = 2584 audio feature vectors, for audio clips of 15 and 30seconds duration, respectively. We process the captions of Clotho,starting by appending (cid:104) eos (cid:105) to all captions, where (cid:104) eos (cid:105) is a specialtoken that signifies the end of the sequence (i.e. the caption). Then,we identify the set of words in the captions, include (cid:104) eos (cid:105) in that setas well, and represent each element of that set (called a token fromnow on) with an one-hot encoding vector y = { , } D . This pro-cess yields a sequence of S = 8 to S = 21 one-hot encoded tokensfor each caption. To implement the above, we employed the freelyavailable code from the DCASE 2020 audio captioning task . Weuse each audio clip with all of its corresponding captions as sepa-rate input-output examples, resulting to a total of 14465 and 5225input-output examples for the development and evaluation splits, re-spectively. We use each of the sequences of audio feature vectorsin the splits as our X and each of the corresponding sequences ofone-hot encoded tokens as our Y . To optimize our method and fine tune its hyper-parameters, we em-ploy the development split. By empirical observation on the values https://zenodo.org/record/3490684 https://github.com/audio-captioning/clotho-dataset of the loss for the development split, we decided to round the lossvalue to three decimal digits and stop the training if the loss didnot improve for 100 consecutive epochs. For the training of ourmethod, we employed a batch size of 16 (mainly due to compu-tational resources constraints). To apply a uniform T and S in abatch, we calculate the maximum T and S in the batch and we pre-pend vectors of { } F to each X and append the one-hot encodedvector of (cid:104) eos (cid:105) to every Y , in order to make them have the T and S equal to the maximum T and S in the same batch. Additionally,we observed that there is a considerable imbalance at the frequencyof appearance of the tokens at the captions. That is, some w i , forexample “a”/“an” and “the”, appear quite frequently (e.g. over 4000times) at the captions, but some appear quite fewer times, e.g. five.To overcome this imbalance, each token, w i , is inversely weightedby its frequency in the dataset, resulting in the following loss for-mulation L (cid:48) (ˆ y s , y s ) = Φ s L (ˆ y s , y s ) , (9)where Φ s is a weight for the loss calculation of the token repre-sented by y s . But due to the quite large frequency of tokens like“a”, we observed that Φ s could get values as low as e − , whichwhen compared to a value of Φ s = 1 for the non-frequent tokens,has a great difference. This difference at the values of Φ s results inhampering significantly the learning of the frequent tokens and, inaddition, will not ever contribute to the training of our method sincewe round L (cid:48) to three decimal digits. Thus, we employed a clampingof Φ s as Φ s = (cid:40) min( f w ) f ws if min( f w ) f ws ≥ β , β otherwise , (10)where β is a hyper-parameter that we set to e − by followingthe above described process, f w s is the frequency of the w s tokenin the development split, w s is the token that the y s corresponds to,and min( f w ) is the minimum frequency of all tokens in the Clothodataset. We follow the baseline method of the DCASE 2020 audiocaptioning task, and we use L = 3 , with Ξ = 256 and
Ψ = 256 ,which are the same as in the baseline of DCASE 2020 audio cap-tioning task. According to Clotho, D = 4366 . We optimize theparameters of our method using L (cid:48) and the Adam optimizer [12],with a learning rate of e − and the values for β and β that arereported to the corresponding paper [12], and we employ dropoutwith a probability of p = 0 . between RNN enc and RNN enc , andbetween RNN enc and RNN enc .We choose hyper-parameters that yielded the lowest loss valuefor the development split, following the above mentioned policyof stopping the training process and implementing a random searchover the combinations of Φ s and learning rate of Adam. To evaluatethe impact of the sub-sampling, we employ four different values for M , namely 2, 4, 8, and 16. Finally, the total amount of parametersof our method is 4 573 711, and the code of our method is based onthe PyTorch framework and is freely available online . We assess the performance of our method by using the above men-tioned processes and hyper-parameters, and employing the metricsused in the DCASE 2020 audio captioning task. Each metric is cal-culated using the the predicted sequence of words ˆ Y for a X , andall the ground truth sequences Y for the same X . We compare theobtained values against the ones reported by the baseline method https://github.com/DK-Nguyen/audio-captioning-sub-sampling etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan Table 1: Results for the baseline method, i.e. M = 1 , and ourproposed method with sub-sampling factor M = { , , , } . Metric M = 1 M = 2 M = 4 M = 8 M = 1 BLEU BLEU L SPICE 0.033 of the DCASE 2020 audio captioning task, which is our methodwith M = 1 . The calculation of the metrics is performed using theavailable tools for DCASE 2020 audio captioning task .Specifically, we use the machine translation metrics BLEU n ,ROUGE L , and METEOR, and the captioning metrics CIDEr,SPICE, and SPIDEr. BLEU n is a precision-based metrics that mea-sures a weighted geometric mean of modified precision of n -gramsbetween predicted and ground truth captions [13]. ROUGE L [14]calculates an F-measure using a longest common sub-sequence(LCS) between predicted and ground truth captions, and ME-TEOR [15] measures a harmonic mean of the precision and recallfor segments between predicted and ground truth captions, whichis shown to have high correlation with quality human-level transla-tion. CIDEr [16] calculates a weighted cosine similarity using term-frequency inverse-document-frequency (TF-IDF) weighting for n -grams, and SPICE [17] measures the ability of the predicted cap-tions to recover from the ground truth captions, objects, attributes,and the relationship between them. Finally, SPIDEr is a weightedmean between CIDEr and SPICE, exploiting the advantages of bothmetrics [18]. We assess the performance of our method versus theDCASE 2020 audio captioning task method using the values ofthe above mentioned metrics, evaluated on the evaluation split ofClotho.
4. RESULTS AND DISCUSSION
In Table 1 are the values of the employed metrics for the evaluationsplit of Clotho. As can be seen from Table 1, using a sub-samplingfactor M ≥ always improves the values of the metrics. This factclearly indicates that our proposed method of sub-sampling benefitsthe performance of audio captioning methods. The maximum valueof SPIDEr is obtained for M = 8 , is 0.067, and can be mainly at-tributed to the better SPICE score than the value of M = 8 yields.Though, the values of the metrics for the different sub-sampling fac-tors, do not exhibit some systematic behaviour. That is, increasingabove 2, i.e. M > does not result in increasing or decreasing theperformance. This could be attributed to the fact that the employedmethod employs a fixed-length output from the encoder. Thus theincreased impact from reducing more the length of the sequence,cannot be observed since the output of the encoder is always a fixed-length vector. This fact strongly indicates that the impact of the in-creased M might be more visible in an audio captioning method,that employs an alignment mechanism which uses the whole out-put sequence of the encoder, and not only a fixed-length vector (e.g.attention). https://github.com/audio-captioning/caption-evaluation-tools Table 2: Reduction of the sequence length (in percentages) com-pared to the input audio, required time for predictions on Clothoevaluation split, and minimum and maximum resulting amount oftime-steps ( T min L and T max L , respectively), according to sub-samplingfactor M = { , , , } , and L = 3 . M T min L T max L Reduction in length Time (sec) . % .
812 323 646 75 . % .
134 80 161 93 . % .
678 20 40 98 . % . . % . In Table 2 is the resulting reduction in the time needed for ob-taining all the predicted outputs using the Clotho evaluation split,and the resulting length of the output sequence of the encoder com-pared to the input sequence X . As can be observed from Table 2,the increase in M has also a clear impact at the time needed forobtaining the predicted captions. This is to be expected, since withtemporal sub-sampling, the output of the encoder is 99.6% shortercompared to the length of the input audio.Finally, an example of the output of our method with M = 8 is“ a person is walking a through something and ”, for the file of theClotho evaluation split “clotho file 01 A pug struggles to breathe1 14 2008” and with ground truth captions like “a man walkingwho is blowing his nose hard and about to sneeze” and “a smalldog with a flat face snoring and groaning”. Another example isthe predicted caption “ a group of of birds and birds a ” for the file“clotho file sparrows” and with ground truth captions like “a flockof birds comes together with a lot of chirping” and “birds sing in dif-ferent tones while in a large group”. As it can be seen, out methodmanages to identify the sources and the actions, but it lacks in thelanguage modelling (LM). The latter fact is to be expected, sinceour method did not focused on the LM perspective, e.g. by usingattention or explicit LM.
5. CONCLUSIONS AND FUTURE WORK
In this paper we presented an approach for audio captioning that uti-lizes temporal sub-sampling, given the empirical observation that aword in a general audio caption refers to a sequence of audio sam-ples. Our approach is focusing on methods that use a multi-layeredand RNN-based encoder, utilizing a temporal sub-sampling of theoutput sequence of each RNN layer of the encoder. We evaluatedour approach using the freely available audio captioning dataset,Clotho, using multiple factors of sub-sampling. The obtained re-sults clearly indicate that temporal sub-sampling can benefit the au-dio captioning methods. We observed an increase at all metrics andwith all sub-sampling factors greater than 2, compared to the caseof not using sub-sampling (i.e. M = 1 ). The maximum benefitwas observed for M = 8 , where an increase of 1.3 is observed forSPIDEr. From the variation of the values of the utilized metrics, ac-cording to the different sub-sampling factors, we hypothesize thatthe temporal sub-sampling might have a more pronounced effectwhen is employed in a method that uses an alignment mechanismlike attention. Though, more research is needed for verifying ordisproving that. etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
6. REFERENCES [1] K. Drossos, S. Adavanne, and T. Virtanen, “Automated au-dio captioning with recurrent neural networks,” in , 2017, pp. 374–378.[2] S. Lipping, K. Drossos, and T. Virtanen, “Crowdsourcing adataset of audio captions,” in
Detection and Classification ofAcoustic Scenes and Events (DCASE) 2019 , Oct. 2019.[3] M. Wu, H. Dinkel, and K. Yu, “Audio caption: Listen andtell,” in
ICASSP 2019 - 2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2019,pp. 830–834.[4] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An au-dio captioning dataset,” in
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 736–740.[5] C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Gen-erating captions for audios in the wild,” in
Proceedings of the2019 Conference of the North American Chapter of the As-sociation for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , Jun. 2019,pp. 119–132.[6] K. Cho, B. van Merri¨enboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio, “Learning phraserepresentations using RNN encoder–decoder for statisticalmachine translation,” in
Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing(EMNLP)
Proceed-ings of the International Conference on Learning Representa-tion (ICLR) , 2014.[8] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidi-rectional attention flow for machine comprehension,” in
Pro-ceedings of the International Conference on Learning Repre-sentation (ICLR) , vol. abs/1611.01603, 2016.[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is allyou need,” in
Proceedings of the 31st International Confer-ence on Neural Information Processing Systems . Red Hook,NY, USA: Curran Associates Inc., 2017, p. 60006010.[10] A. Graves,
Supervised Sequence Labelling with RecurrentNeural Networks , ser. Studies in Computational Intelligence.Springer Berlin Heidelberg, 2012. [Online]. Available:https://books.google.fi/books?id=wpb-CAAAQBAJ[11] F. Scheidegger, L. Cavigelli, M. Schaffner, A. C. I. Malossi,C. Bekas, and L. Benini, “Impact of temporal subsampling onaccuracy and performance in practical video classification,”in , 2017, pp. 996–1000.[12] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in
Proceedings of the International Conference onLearning Representation (ICLR) , 2014. [13] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: amethod for automatic evaluation of machine translation,” in
Proceedings of the 40th annual meeting on association forcomputational linguistics . Association for ComputationalLinguistics, 2002, pp. 311–318.[14] C.-Y. Lin, “ROUGE: A package for automatic evaluationof summaries,” in
Text Summarization Branches Out
Proceedings of the second workshop on statisticalmachine translation , 2007, pp. 228–231.[16] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr:Consensus-based image description evaluation,” in
Proceed-ings of the IEEE conference on computer vision and patternrecognition (CVPR) , 2015, pp. 4566–4575.[17] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice:Semantic propositional image caption evaluation,” in
Euro-pean Conference on Computer Vision . Springer, 2016, pp.382–398.[18] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Im-proved image captioning via policy gradient optimization ofspider,” in2017 IEEE International Conference on ComputerVision (ICCV)