ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual CNNs
PParaCNN: Visual Paragraph Generation viaAdversarial Twin Contextual CNNs
Shiyang Yan, Yang Hua, Neil Robertson
Queen’s University Belfast
Abstract.
Image description generation plays an important role in manyreal-world applications, such as image retrieval, automatic navigation,and disabled people support. A well-developed task of image descrip-tion generation is image captioning, which usually generates a shortcaptioning sentence and thus neglects many of fine-grained properties,e.g., the information of subtle objects and their relationships. In thispaper, we study the visual paragraph generation, which can describethe image with a long paragraph containing rich details. Previous re-search often generates the paragraph via a hierarchical Recurrent NeuralNetwork (RNN)-like model, which has complex memorising, forgettingand coupling mechanism. Instead, we propose a novel pure CNN model,ParaCNN, to generate visual paragraph using hierarchical CNN archi-tecture with contextual information between sentences within one para-graph. The ParaCNN can generate an arbitrary length of a paragraph,which is more applicable in many real-world applications. Furthermore,to enable the ParaCNN to model paragraph comprehensively, we alsopropose an adversarial twin net training scheme. During training, weforce the forwarding network’s hidden features to be close to that ofthe backwards network by using adversarial training. During testing, weonly use the forwarding network, which already includes the knowledgeof the backwards network, to generate a paragraph. We conduct exten-sive experiments on the Stanford Visual Paragraph dataset and achievestate-of-the-art performance.
Keywords:
Visual Paragraph Generation, CNN Language Model, Ad-versarial Training, Twin Net
Image description generation [1,2,20,8] is a key topic in both the computervision (CV) and natural language processing (NLP) communities. A classicaltask of image description generation is image captioning, which requires themachine to generate a caption to describe the image. Despite the encourag-ing progress [6,1,2] in image captioning, a single short sentence with less thantwenty words is not enough to describe the full content of an image, which couldbe very informative on the subtle objects and their relationships. Furthermore,image captioning tends to capture the scene-level clues rather than fine-grainedentities, limiting their direct applications in many real-world problems such as a r X i v : . [ c s . C V ] A p r Shiyang Yan, Yang Hua, Neil Robertson disabled people support and visual semantic navigation. Consequently, it is morenatural to describe images in detailed paragraph, which have been studied re-cently [19,22,5,31,27,24]. An illustration of the differences between the imagecaptioning and visual paragraph generation are shown in Figure 1.
Fig. 1.
An illustration of the differences between image captioning and visual paragraphgeneration. The caption is a simplified sentence whilst the paragraph contains morefine-grained properties.
The challenges in visual paragraph generation are twofold: First, the sen-tences within one paragraph should be coherent and consistent. Secondly, encod-ing the long-term dependencies in the current sequence model, such as RecurrentNeural Networks (RNNs) [15,9], is still challenging. To address these challenges,most of the previous research [19,22,5,31] use RNNs to form an encoder-decoderframework, often with hierarchical architecture to model the words and para-graph. The designed coupling mechanism between sentences in a hierarchicalmodel [5,22] provides consistency and coherence within one paragraph. How-ever, RNNs have a complex memorising, forgetting and addressing mechanism.Although Convolutional Neural Networks (CNNs) are usually not consideredas the mainstream for the sequence to sequence tasks, recently, we see severalimprovements of CNNs in sequence to sequence tasks [12,2], with encouragingresults. As pointed out in [12], compared to RNNs, computations in CNNs can befully parallelised, and more importantly, the number of non-linearities in CNNsare fixed and independent to the input length, which is easier to be optimised.Nevertheless, the vanilla deep CNN model is not suitable for the visual para-graph generation, as it models the entire paragraph equally, which consequentlyneglects the coherence and consistencies between sentences within a paragraph.In this paper, we propose a novel CNN architecture, ParaCNN, to effectivelyand efficiently tackle the task of visual paragraph generation. This architecture itle Suppressed Due to Excessive Length 3 contains a hierarchy of topic convolutions and word convolutions. We first usethe CNN network to decompose the visual features into several visual topics.Next, the new generated visual topic, combined with the contextual featuresfrom the previously generated sentence, is applied in the word convolutions [2] togenerate a new sequence of words. This hierarchical structure solves the problemof coherence by using the context as the coupling mechanism between sentenceswithin a paragraph. The parallel computing capability of the pure CNN modelis more efficient than an hierarchical RNN structure and proves to be better inmodelling long sequences. Unlike the hierarchical RNN-based model [19,22,5,31],our CNN model can be flexible enough to generate variable length of paragraphs,which is a very appealing property in real-world applications.As argued in [28], sequence to sequence model should form a better sum-marising of the past. However, the training objective of current sequence modelcorresponds with one-step prediction, where the local correlations are usuallystronger than long-term dependencies and thus dominate the learning. Hence,inspired by the ‘twin’ training scheme [28], we propose an adversarial twin nettraining scheme. We force the hidden features of the forwarding network to beclose to that of the backwards one by using adversarial training [14]. We only usethe forwarding network in inference. The twin structure can force the ParaCNNto learn a good summarising of the past by making the model focus on the in-formation that is useful to predict a specific token, which is also useful to thebackward network. Our contributions can be summarised as follows: 1) We pro-pose a visual paragraph model based on pure convolution operation, denoted asParaCNN. The ParaCNN is flexible to generate a various number of sentences,which is more suitable in real-world applications. To our best knowledge, weare the first to use pure convolutions to generate a visual paragraph. 2) Wepropose an adversarial twin training scheme for the ParaCNN, which signifi-cantly improves performance. 3) We perform extensive experiments to validatethe proposed methods and provide insights for future studies.
Most of the existing image captioning models use a deep CNN as the imageembedding and feed the visual features to an RNN (LSTM/GRU) network.Since the attention mechanism shows improving performance on various NLPtasks [29,4,33,13], the visual attention mechanism [32,1] has been extensivelystudied in image captioning. For instance, a channel-wise attention mechanismis proposed in [6]. Semantic attention, which includes the image attributes, yieldsbetter results [34]. Encoding fine-grained features from an object detector arestudied [1,21]. We find the visual attention mechanism [32] also improves theperformance of the proposed ParaCNN. The features from an object detectorcan boost the performance since an object detector focuses on the semanticallymeaningful and fine-grained image features. In this paper, we also use objectfeatures [16] as the image embedding. The convolutional sequence processing for
Shiyang Yan, Yang Hua, Neil Robertson image captioning achieved comparable results with RNN-based approaches, andwith much higher efficiency [2] because of a parallel training scheme. Other mer-its of utilising a CNN in image captioning include that the CNN can more easilyhandle long-term dependence [11] and avoids the complex memory and forget-ting mechanism. The convolutional captioning model proposed in [2] inspires usto propose the ParaCNN model.
Regions-Hierarchical [19] introduces the first large-scale paragraph captioningdataset, which utilises the images from Visual Genome dataset and adds newparagraph annotations. The dataset shows more pronouns, verbs and more di-versities than single sentence captioning dataset, which is more challenging.Regions-Hierarchical [19] proposes a hierarchical RNN model to generate thevisual paragraphs. The hierarchical decoder includes two LSTMs, where the out-put of one sentence-level LSTM is used as input of the other word-level LSTM.Subsequent research uses a similar hierarchical approach. For instance, RTT-GAN [22] extends this model with another paragraph-level LSTM and appliedadversarial training. Diverse (VAE) [5] also uses a hierarchical RNN structurewith a more complex coupling mechanism between sentences and includes vari-ational auto-encoder structure [18], to diversify the generated visual paragraph.It is worth mentioning is that a different approach [25] which uses a flat RNNto model the long paragraph. While their baseline model, which only uses a singleRNN model yields poor performance, they achieve state-of-the-art results byintroducing a repetition penalty sampling (Rep. Penalty Sampling) techniquein self-critical training [27]. We also apply the Rep. Penalty Sampling in theinference, but only use it in the testing stage. As a result, it requires littleadditional computation resource. The self-critical training in the long sequenceis very computationally intensive, which makes it less practical in the real world.Our primary focus is not on the sampling method but the improvement of themodel. Other methods like [31] use external knowledge of depth image to enrichthe visual representation and achieve improved results.
We first introduce the problem formation of the visual paragraph generation,the training and inference mechanism of the proposed ParaCNN. Subsequently,we present the twin net training algorithm for the ParaCNN.
In image paragraph generation, we are given an image I and are required to generate a paragraph of sentences P = { S , ..., S M } . Eachsentence S contains { W , ..., W N } words. This structure of paragraph forms ahierarchical architecture which requires the model to generate a sequence of itle Suppressed Due to Excessive Length 5 Fig. 2.
Illustration of ParaCNN: Each paragraph annotation is split into 6 sentences,which contains a maximum of 30 words. We generate new topic by using convolutionallayers with the contextual information of the previously generated sentence; The pool-ing method for context generation can be configured as self-attention pooling or meanpooling. Then a sentence CNN model decodes the topic to a sequence of words. Thewhole network is computed in parallel and optimised with backpropagation. This figureis an example of illustration; actually, we use deeper CNN networks. Furthermore, thesentence generator (in pink) can be replaced with a transformer-based model (in blue),which proves the wide-range feasibility.
Best viewed in color. topics corresponding to the sentences, denoted as T = { T , ..., T M } . Then themodel should be able to generate the word sequence based on the generatedtopics. Next, we introduce the proposed ParaCNN by splitting into the trainingpart and inference part. Training.
We propose the contextual hierarchical CNN based on the aboveproblem formulation, which is shown in Figure 2. Firstly, we feed the image fea-tures into a CNN model, denoted as the paragraph convolution f p , to generate M topics. Each topic is then used to generate a sentence by using the convolutionalsentence decoder f s , similar to the one in [2]. Both the convolutional operationsin paragraph convolution and sentence convolution are realised by applying themasked 1d convolution to make the model only look at the past informationsince we only have the past information during inference. More formally, thedynamics of the ParaCNN can be illustrated as follows: P i ( w ji | w
The twin network training pipeline with adversarial training in which the prob-ability distribution of hidden features from the forwarding network is pulled close tothat of the backwards network.
Best viewed in color. where S is the sentence, context is the average of the Multi-head Self-attention [29]pooled sentence embedding and T is the topic vector, which is obtained fromparagraph convolutions f p .Equation 1 provides a formal presentation of the word generation with con-volutional decoder. The model generates a new word based on all the past in-formation instead of the one in the last time step as in most of the RNN-basedapproaches. Equation 2 gives an illustration of the topic generation process.Specifically, the model generates a new topic based on the visual representationsand the mean pooled embedding vector of the previously generated sentence.If we consider the whole model as f , we can write down the model’s dynamicsas in Equation 3: P i ( w i | w
The inference is performed sequentially, one word at a time. Wefirst feed the start token ‘ < start > ’ as w to the model and then the w ∼ p ( w | w , I ) is sampled. In this paper, we use greedy sampling in inference.Afterwards, we sequentially feed the generated tokens w , ..., w i − to the model,and sample the generated new token w i ∼ p i ( w i | w
T winN et to force the RNN also focusing on thefuture tokens. In [28], they use an L2 loss to force the distance between thehidden states of forwarding and backwards networks to be close, if they have thesame ground-truths.However, L2 loss cannot model the distribution; therefore, we propose touse the adversarial training [14] to push the distribution of the forwarding and itle Suppressed Due to Excessive Length 7 backwards ParaCNN close to each other. We define the forwarding network as ageneral training process using MLE. The backwards network is implemented byflipping the ground-truth labels and feeding them to the model. Hence the outputhidden features contain backwards information. Specifically, we use a more stableWasserstein-GAN (W-GAN) [3] to deploy the adversarial training framework.During the discriminator training, we set the hidden feature of the forwardingnetwork as false while the hidden features of the backwards network as true toensure the forwarding network to generate indistinguishable hidden features ofthe backwards network. During the training of the forwarding network, we get theprobabilities of the hidden features being true and set them as the adversarialloss to fine-tune the forwarding network. The training process is presented inFigure 3.
Advantages of the adversarial twin model.
Empirically speaking, the ad-versarial twin model’s advantages are twofold: First, if the model needs to predict w i based on w , ..., w i − , the twin networks learn the transition knowledge notonly on w i − to w i − but also on w i − to w i − , which makes the transition moresmooth. As a result, the model can learn a more comprehensive representationof the past. Second, the training of the twin networks can also be consideredas a knowledge transfer process, in which the knowledge from the backwardsnetwork is transferred to the forwarding network, which is used in the inferencephase. This knowledge contains information from the future, which is helpful inthe prediction/generation process during the inference stage.The L2 loss [28], however, is not as good as adversarial training in terms of theknowledge transfer. However, it acts as a regularisation term [28] for the trainingobjectives and makes the specific token aligned between the forwarding andbackwards networks when performing twin net training. We find that the L2 lossand adversarial twin net training scheme are complementary. We demonstratethat adding the adversarial loss and combined with L2 regularisation term intwin net training makes the original ParaCNN generate the best results amongall our experimental settings. We perform all the experiments on the Stanford Visual Paragraph dataset pro-posed by [19]. In this dataset, each image contains one paragraph. The training,validation and testing sets contain 14,575, 2487 and 2489 images, respectively.We evaluate the BLEU, METEOR, ROUGE-L and CIDEr scores for the gener-ated paragraphs.
As our model contains an explicit hierarchical struc-ture of paragraph, sentences and words, we split each paragraph into 6 sentences,
Shiyang Yan, Yang Hua, Neil Robertson
Table 1.
Parameter Settings: ‘Projection’ means the visual vector after a fully-connected layer for the visual input. ‘Topic’ indicates the dimension of topic vectorand ‘Context’ is the dimension of the context vector.
Vocabulary Paragraph Sentence Max Epochs Learning Rate Optimiser8668 6 sentences 30 words 40 4e-4 RMSpropVisual Input Projection Topic Word Embedding Context Convolutional Channels4096 512 512 512 512 512
Table 2.
The Model Complexity and Time Efficiency of the ParaCNN.Methods Model Complexity Time EfficiencyHierarchical RNNs 165.65 M Parameters 2506s /epochParaCNN 38.27 M Parameters 2078s /epochParaCNN + Attention 38.38 M Parameters 2101s /epoch which contains up to 30 words. Therefore, our model can model a paragraph witha maximum of 180 words. During training, we add a ‘ < start > ’ token in frontof each paragraph. We skip the words which appear less than 2 times during thevocabulary establishment, hence we have a total of 8668 words in our vocabu-lary. We then use one-hot encoding to encode each word before passing it to atwo-layer trainable word embedding module.
Training Details.
To make a fair comparison, the visual representation of eachimage is the same as that of [19,5,22], which is from the dense captioning [16]visual features. We map the visual features to a 512 dimension vector by usingfully connected layers and max pooling, which may harm the final performancesince the dimension of the visual features is only half of that in [19,5,22]. However,our primary focus is on the feasibility of our model with the only convolutionaloperation. For the adversarial training of the twin ParaCNN, we use a one-layer bi-GRU as the discriminator, and the dimension of both the hidden statesand outputs of the GRU model are set as 512. During the adversarial training,following the practice of W-GAN [3], we train the discriminator 5 steps and thegenerator 1 step, alternatively. We use RMSprop optimisation techniques withbackpropagation in training. We implement our models with the PyTorch [26]deep learning package. All experiments are conducted on a PC with Ubuntu18.04 system and equipped with an Nvidia Geforce 2080Ti GPU. A detailedconfiguration of the parameters is shown in Table 1.
We perform ablation studies on thekernel size and the depth of the topic convolution and word convolution modules,which are shown in Table 3. We set various kernel sizes and depths for thetopic convolution and word convolutions. The kernel size of 5 for both the topic itle Suppressed Due to Excessive Length 9
Fig. 4.
The cross entropy loss of the hierarchical RNNs and the proposed ParaCNN.Our ParaCNN converges faster and has a lower loss value than the hierarchical RNNcounterpart.
Table 3.
Ablation Studies on the Parameters of the ParaCNN: Kernel of 5, depth of 4for the topic and depth of 5 for the sentence convolutions yields the best performance.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErKernels (Topics, Words) = (5, 7), Depth (Topics, Words) = (3, 6) 35.6 19.1 9.9 5.1 13.5 25.3 10.3Kernels (Topics, Words) = (3, 5), Depth (Topics, Words) = (3, 6) 37.5 20.3 10.7 5.5 14.1 25.7 12.3Kernels (Topics, Words) = (5, 5), Depth (Topics, Words) = (3, 5) 39.5 22.5 13.0 7.5 15.1 28.2 15.1Kernels (Topics, Words) = (5, 5), Depth (Topics, Words) = (3, 6) 40.0 22.7 13.1 7.5 15.3 28.4 15.6Kernels (Topics, Words) = (4, 5), Depth (Topics, Words) = (4, 5) 39.9 22.7 13.2
Table 4.
The Performance Comparisons on the Different Scheme of Modules: Overall,our ParaCNN with attention mechanism generates the best results.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErTopics = RNN, Words = RNN (Results of [7]) 35.6 18.0 9.1 4.5 13.9 - 10.6Topics = RNN, Words = CNN 36.1 18.9 9.9 5.4 13.6 26.0 12.1Topics = CNN, Words = CNN (No context) 38.1 21.2 12.3 7.1 14.6 28.1 14.4Topics = CNN, Words = CNN (ParaCNN) and word convolutions yields the best performance. Increasing or decreasing thekernel size of topic convolution does not bring performance increase. We also setthe maximum depth of the whole network as 9, then 4 convolutional layers forthe topic and 5 convolutional layers for word generation in our studies.
The Impact of Different Model Configurations for ParaCNN.
Firstly,the topic and word modules are all set as RNNs, which is implemented by [7].The result is shown in Table 4, which is not satisfactory.We then replace the word module to a CNN decoder, and see a slight increasein all the language evaluation metrics, which shows consistency to previous re-search [2] that convolutional decoder is slightly better than an RNN decoder.
Table 5.
The performance of the Transformer-based Sentence Generator.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErParaCNN 40.9 23.3 13.3 7.5 15.5 28.2 16.4ParaCNN (Transformer) 39.9 22.0 12.5 7.1 15.2 27.5 14.7
Table 6.
The Impact of Training Paragraph Length: We increase the training para-graph length to 7 sentences, but see no much differences on the results.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErParaCNN (7 Sent.) 38.9 22.1 12.6 7.2
ParaCNN (6 Sent.)
The CNNs are applied for both topic and word module, but without con-textual information during the topic convolution. We see a big increase in theall the evaluation metrics, which shows the superiority of the CNNs in para-graph generation, especially in topic modelling since the improvement is largelybrought by using the CNNs in the topic generation.Subsequently, we add contextual information from the last generated sentenceto the new one during the topic generation, and also see an obvious improvementin the results, as presented in the last row of Table 4. We also visualise the lossvalue of training of the hierarchical RNNs and our ParaCNN in Figure 4, whichshows our ParaCNN starts to converge faster than the RNN-based model fromthe 10th epoch.To analyse the model complexity and time efficiency of the ParaCNN, werecord the model complexity and time efficiency of training of the HierarchicalRNNs and our ParaCNN in Table 2. The recorded numerical results show thatour ParaCNN has less model complexity and higher time efficiency in training.
The Impact of the Transformer-based Sentence Generator.
We replacethe word convolution module with a state-of-the-art transformer-based sentencegenerator [10], the results are presented in Table 12. The results are not as goodas our CNN-based module.
The Impact of Number of Sentences within a Paragraph.
As presentedin Table 7, our ParaCNN can generate a various number of sentences within aparagraph, due to the flexibility of convolutional operations. Although 6 sen-tences still yield the best result, we show that in real-world applications, ourParaCNN is more flexible to generate longer paragraph to meet the demand.When we re-train the model using 7 sentences, the ParaCNN tends to be moreconsistent with human consensus since there is an increase in METEOR andCIDEr, as presented in Table 6.Besides, we train another standalone model to predict the number of gen-erated sentences. This model is a three-layer fully connected neural network, itle Suppressed Due to Excessive Length 11
Table 7.
The Impact of Generated Paragraph Length: We use the trained ParaCNNmodel to generate a various number of sentences, also including the adaptive numberachieved by using another standalone model to determine the number of sentencesbased on the visual inputs.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErTwin ParaCNN (5 Sent.) 37.9 21.5 12.4 7.1 14.7 28.1 13.6Twin ParaCNN (7 Sent.) 38.1 21.8 12.5 7.2 16.8 28.3 16.7Twin ParaCNN (6 Sent.)
Ad. Sent. (Max 6) 38.0 21.7 12.4 7.1 14.7 27.9 15.9Ad. Sent. (Min 5, Max 6) 39.5 22.5 12.9 7.4 15.2 28.3 15.7Ad. Sent. (Min 5, Max 7) 41.0 23.3 13.4 7.7 15.5 28.3 15.7Ad. Sent. (Min 6, Max 7) 40.2 22.9 13.1 7.5
Table 8.
The Performance of the Attention Mechanism for the ParaCNN: The visualattention mechanism improves the performance of each scheme.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErParaCNN (No Visual Attention) trained using the ground-truth of the number of the sentence in the trainingset. We notice a slight decrease in the evaluation metrics in this adaptive lengthinference. When limiting the number of the sentence to a specific range, thereis an increase in performance. A suitable length of the generated paragraph isvital in maintaining the performance. Nevertheless, we prove the flexibility ofour model that it can generate an arbitrary length of a paragraph.
The Impact of the Visual Attention Mechanism on the ParaCNN.
Ad-ditionally, the visual attention mechanism [32] is applied in the word generationconvolutional module. There are 5 convolutional layers in the word generationmodule, as presented in Table 3. We add the visual attention mechanism in the2nd and the 4th layer of the word generation module instead of every layer. Theresults are shown in Table 8. It can be seen in the table that the visual attentionmechanism generally improves the results, since the visual attention mechanismforces the ParaCNN on the more precise location of the visual features, thusgenerate more accurate concepts.
The Impact of Twin Net Training and Adversarial Optimisation.
Wevalidate the effectiveness of the twin net training scheme on the ParaCNN modelby forcing the distance between the hidden features (the last fully-connectedfeatures) of the forwarding network and backwards network close. We train thenetworks from scratch by 40 epochs, and the results are presented in Table 9.We see a clear increase in all the language evaluation metrics, which proves thatthe twin net training is effective for ParaCNN. We then apply the twin net withadversarial training framework to optimise the ParaCNN and the results are
Table 9.
The Performance of Twin Net Training: The twin net training improvesmost of the metrics. Especially, the adversarial twin net improves the CIDEr metricsignificantly.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErParaCNN 39.3 22.8 13.5 8.1 15.6 29.4 17.5Twin ParaCNN (L2) 40.9 23.8 14.2 8.5 15.9 29.9 18.3Twin ParaCNN (Adversarial) 41.6 24.0 14.0 8.5 16.2 29.0 19.1Twin ParaCNN (L2, Rep. Penalty Sampling) 42.5 25.3 15.3 9.2 16.4
Table 10.
The Results of the L2 Twin Net, Adversarial Twin Net, and the Combi-nation of Them: We see an improvement of using adversarial training on twin net.Furthermore, the combination of the L2 regularisation term and the adversarial twinnet training yields the best results.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErTwin ParaCNN (L2) 40.9 23.8 14.2 8.5 15.9 29.9 18.3Twin ParaCNN (Adversarial) 41.6 24.0 14.0 8.5 16.2 29.0 19.1Twin ParaCNN (L2 + Adversarial) 42.7 24.9 14.5 8.5 16.8 29.9 19.6Twin ParaCNN (L2 + Adversarial, Rep. Penalty Sampling)
Table 11.
The Impact of the Learning Rate of the Discriminator in Twin Net Training:A suitable learning rate for the discriminator is critical in the adversarial twin netscheme.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErTwin ParaCNN (Adversarial, lr 3e-4) 40.7 23.5 13.8 8.2 15.9 29.2 18.5Twin ParaCNN (Adversarial, lr 3e-4, Rep. Penalty Sampling) 42.4 25.1 15.0 9.0 16.4 presented in Table 9. The adversarial training of twin net bring much improve-ment on CIDEr, which evaluates how much the generated paragraph matchesthe human consensus. In addition, as Table 11 reveals, the learning rate of thediscriminator is critical for improving the performance. We find improved resultsof using adversarial twin net training. Moreover, as Table 10 shows, the L2 losscombined with the adversarial twin net training yields the best result on all theevaluation metrics.
The Impact of the Different Pooling Method in Context Generation.
The pooling method in context generation, as shown in Figure 2 can be setmean pooling, which is simple averaging the information. Therefore, we use amulti-headed self-attention model [29] to replace the mean-pooling operation.The self-attention can discover the internal relationship in the previous gen-erated sentence and selects more informative knowledge for the next sentencegeneration.
Comparison with the State-of-the-art Methods. (a) Flat models: (a1)Sentence-Concat.
Two sentence-level captioning models which are
Neuraltalk [17] and
NIC [30] pre-trained on the MSCOCO [23] dataset are adopted itle Suppressed Due to Excessive Length 13
Table 12.
The Performance Comparison between Different Pooling Method for Con-text Generation.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErParaCNN (Mean Pooling) 39.3 22.8 13.5 8.1 15.6 29.4 17.5ParaCNN (Self-attention Pooling) 40.6 23.9 14.3 8.6 16.1 29.8 18.2Twin ParaCNN (Mean Pooling)
Table 13.
The Performance Comparison with the State-of-the-art Methods on theVisual Paragraph Dataset: We achieve state-of-the-art results on the evaluation metrics.
Category Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDErFlat Models Sentence-Concat [19] 31.1 15.1 7.6 4.0 12.1 6.8Template [19] 37.5 21.0 12.0 7.4 14.3 12.2Image-Flat [19] 34.0 19.1 12.2 7.7 12.8 11.1Top-down Attention [1] 32.8 19.0 11.4 6.9 12.9 13.7Self-critical [25] 29.7 16.5 9.7 5.9 13.6 13.8DAM-Att [31] 35.0 20.2 11.7 6.6 13.9 17.3Meshed Transformer [10] 37.5 22.3 13.7 8.4 15.4 16.1Hierarchical Models Regions-Hierarchical [19] 41.9 24.1 14.2 8.7 16.0 13.5RTT-GAN [22] 42.0 24.9 14.9 9.0 17.1 16.9Diverse (VAE) [5] 42.4 25.6 15.2 9.4
Human [19] Annotation 42.9 25.7 15.6 9.7 19.2 28.6 to predict 5 sentences for each image. (a2) Image-Flat.
Image-Flat [19] ap-plied a deep CNN to encode the image and train a flat RNN model to generatethe entire paragraph. (a3) DAM-Att.
DAM-Att [31] used additional informa-tion of depth image to enhance the spatial relationship information. Our modelis a hierarchical CNN model which is less similar to the flat RNNs. Also, ourParaCNN tends to perform much better than the flat models. (b) HierarchicalModels: (b1) Region-Hierarchical.
Region-Hierarchical [19] applied hierar-chical RNNs to model the hierarchy from words to paragraph. The ParaCNNgenerates much higher results on the evaluation metrics, as presented in Table 4.Besides, our ParaCNN converges faster than the hierarchical RNNs model. (b2)RTT-GAN.
RTT-GAN [22] realised the hierarchical RNNs in a GAN settingwhere the generator mimics the human-annotated paragraphs and tries to foolthe discriminator. (b3) Diverse(VAE).
Diverse(VAE) [5] modelled the para-graph distribution with variational auto-encoder (VAE), which can preserve theconsistency of the global topics of paragraphs. This method focuses on the diver-sity of the topic in the dataset and the internal consistency within one paragraph.In contrast, the proposed ParaCNN focuses on the feasibility of the pure convo-lutional operations on the visual paragraph generation and the effectiveness ofthe twin net training scheme. (b4) Context-Aware Visual Policy network(CAVP).
CAVP [35] proposed a reinforcement learning scheme by consideringthe previous visual attentions as context and deciding whether the context isused for the current word or sentence generation. The focus of CAVP [35] andour work are different, but we also consider the context information of the previ-ously generated sentence and see the effectiveness of the contextual information.
Fig. 5.
The visualisation of images and corresponding generated paragraphs from dif-ferent methods. The figure shows that the ParaCNN can generate more detailed para-graph, and the adversarial twin net training indeed improves the ParaCNN baselineby generating more accurate sentences. More interestingly, the ParaCNN with twinnet generates more novel concepts of subtle objects and their relationships, which areneglected in other methods, even in the ground truth. The best results are generatedwith the combination of L2 loss and adversarial twin net training.
Visualisation of the Generated Paragraphs.
As shown in Figure 5, wepresent the generated paragraphs from the ParaCNN with the standard training,adversarial twin net training and the L2 combined with twin net ParaCNN. Thetwin net training scheme generates much different paragraph from the ParaCNNbaseline and includes more fine-grained details, which proves that it indeed im-proves the ParaCNN training.
This paper studies the task of visual paragraph generation, which could describeimages with long paragraph thus facilitate the real-world applications such asimage retrieval, automatic navigation and disabled support. Specifically, we pro-pose a novel model, called ParaCNN, by only using convolutional operations onthe multimodal data of image and natural language. The ParaCNN can forma hierarchical structure on the word, sentence and paragraph, thus solving thelong sequence generation problem at the same time. The ParaCNN is flexible togenerate an arbitrary number of sentences, which is more suitable for real-worldapplications. Moreover, we propose an adversarial twin net training scheme toenable the ParaCNN to contain backwards information, which improves the per-formance. We conduct extensive experiments to validate the proposed methodand achieve state-of-the-art results. itle Suppressed Due to Excessive Length 15
References
1. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang,L.: Bottom-up and top-down attention for image captioning and visual questionanswering. In: CVPR (2018)2. Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In:CVPR (2018)3. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks.In: ICML (2017)4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 (2014)5. Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation fromimages. In: ECCV (2018)6. Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.S.: Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning.In: CVPR (2017)7. Chen, X.: Tensorflow implement of paper: A hierarchical approach for generatingdescriptive image paragraphs https://github.com/chenxinpeng/im2p (2017)8. Cho, K., Courville, A., Bengio, Y.: Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia (11), 1875–1886 (2015)9. Cho, K., Van Merri¨enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder forstatistical machine translation. arXiv preprint arXiv:1406.1078 (2014)10. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: M2: Meshed-memory trans-former for image captioning. arXiv preprint arXiv:1912.08226 (2019)11. Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. arXivpreprint arXiv:1805.04833 (2018)12. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutionalsequence to sequence learning. In: ICML (2017)13. Golub, D., He, X.: Character-level question answering with attention. arXivpreprint arXiv:1604.00727 (2016)14. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)16. Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localizationnetworks for dense captioning. In: CVPR (2016)17. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating imagedescriptions. In: CVPR (2015)18. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 (2013)19. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for gen-erating descriptive image paragraphs. In: CVPR (2017)20. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C.,Berg, T.L.: Babytalk: Understanding and generating simple image descriptions.IEEE Transactions on Pattern Analysis and Machine Intelligence35