CPTR: Full Transformer Network for Image Captioning
CCPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING
Wei Liu *1,2 , Sihan Chen *1,2 , Longteng Guo , Xinxin Zhu , Jing Liu National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
ABSTRACT
In this paper, we consider the image captioning task froma new sequence-to-sequence prediction perspective and pro-pose CaPtion TransformeR (CPTR) which takes the sequen-tialized raw images as the input to Transformer. Comparedto the “CNN+Transformer” design paradigm, our model canmodel global context at every encoder layer from the begin-ning and is totally convolution-free. Extensive experimentsdemonstrate the effectiveness of the proposed model and wesurpass the conventional “CNN+Transformer” methods onthe MSCOCO dataset. Besides, we provide detailed visual-izations of the self-attention between patches in the encoderand the “words-to-patches” attention in the decoder thanks tothe full Transformer architecture.
Index Terms — image captioning, Transformer, sequence-to-sequence
1. INTRODUCTION
Image captioning is a challenging task which concerns aboutgenerating a natural language to describe the input image au-tomatically. Currently, most captioning algorithms follow anencoder-decoder architecture in which a decoder network isused to predict words according to the feature extracted bythe encoder network via attention mechanism. Inspired bythe great success of Transformer [1] in the natural languageprocessing field, recent captioning models tend to replace theRNN model with Transformer in the decoder part for its ca-pacity of parallel training and excellent performance, how-ever, the encoder part always remains unchanged, i.e., uti-lizing a CNN model (e.g. ResNet [2]) pretrained on imageclassification task to extract spatial feature or a Faster-RCNN[3] pretrained on object detection task to extract bottom-up[4] feature.Recently, researches about applying Transformer to com-puter vision field have attracted extensive attention. Forexample, DETR [5] utilizes Transformer to decode detectionpredictions without prior knowledge such as region proposalsand non-maximal suppression. ViT [6] firstly utilizes Trans-former without any applications of convolution operation for * Wei Liu and Sihan Chen contribute equally to this paper. image classification and shows promising performance espe-cially when pretrained on very huge datasets (i.e., ImageNet-21K, JFT). After that, full Transformer methods for bothhigh-level and low-level down-stream tasks emerge, such asSETR [7] for image semantic segmentation and IPT [8] forimage processing.Inspired by the above works, we consider solving the im-age captioning task from a new sequence-to-sequence per-spective and propose CaPtion TransformeR (CPTR), a fullTransformer network to replace the CNN in the encoder partwith Transformer encoder which is totally convolution-free.Compared to the conventional captiong models taking as in-put the feature extracted by CNN or object detector, we di-rectly sequentialize raw images as input. Specifically, we di-vide an image into small patches of fixed size (e.g. × ),flatten each patch and reshape them into a 1D patch sequence.The patch sequence passes through a patch embedding layerand a learnable positional embedding layer before being fedinto the Transformer encoder.Compared to the “CNN+Transformer” paradigm, CPTRis a more simple yet effective method that totally avoids con-volution operation. Due to the local operator essence of con-volution, the CNN encoder has limitation in global contextmodeling which can only be fulfilled by enlarging receptivefield gradually as the convolution layers go deeper. How-ever, encoder of CPTR can utilize long-range dependenciesamong the sequentialized patches from the very beginning viaself-attention mechanism. During the generation of words,CPTR models “words-to-patches” attention in the cross atten-tion layer of decoder which is proved to be effective. We eval-uate our method on MSCOCO image captioning dataset andit outperforms both “CNN+RNN” and “CNN+Transformer”captioning models.
2. FRAMEWORK2.1. Encoder
As depicted in Fig. 1, instead of using a pretrained CNN orFaster R-CNN model to extract spatial features or bottom-upfeatures like the previous methods, we choose to sequentializethe input image and treat image captioning as a sequence-to-sequence prediction task. Concretely, we divide the original a r X i v : . [ c s . C V ] J a n osition embedding +
Input image Flatten & ReshapeDivide into patches ... a puppy rests ... next to a bicycle
Self-AttentionAdd & Layer Norm
FFNAdd & Layer Norm N e × Masked Self-Attention FFN
Cross AttentionAdd & Layer NormAdd & Layer NormAdd & Layer Norm + Patch Embedding × N d Linear &Softmax
Fig. 1 . The overall architecture of proposed CPTR model.image into a sequence of image patches to adapt to the inputform of Transformer.Firstly, we resize the input image into a fixed resolution X ∈ R H × W × (with 3 color channels), then divide the re-sized image into N patches, where N = HP × WP and P is thepatch size ( P = 16 in our experiment settings). After that, weflatten each patch and reshape them into a 1D patch sequence X p ∈ R N × ( P · . We use a linear embedding layer to mapthe flattened patch sequence to latent space and add a learn-able 1D position embedding to the patch features, then we getthe final input to the Transformer encoder which is denoted as P a = [ p , . . . , p N ] .The encoder of CPTR consists of N e stacked identicallayers, each of which consists of a multi-head self-attention(MHA) sublayer followed by a positional feed-forward sub-layer. MHA contains H parallel heads and each head h i cor-responds to an independent scaled dot-product attention func-tion which allows the model to jointly attend to different sub-spaces. Then a linear transformation W O is used to aggregatethe attention results of different heads, the process can be for-mulated as follows: MHA(
Q, K, V ) = Concat ( h , . . . , h H ) W O (1)The scaled dot-product attention is a particular attention pro-posed in Transformer model, which can be computed as fol-lows: Attention(
Q, K, V ) = Softmax (cid:18) QK T √ d k (cid:19) V (2)where Q ∈ R N q × d k , K ∈ R N k × d k and V ∈ R N k × d v are thequery, key and value matrix, respectively. The followed positional feed-forward sublayer is imple-mented as two linear layers with GELU activation functionand dropout between them to further transform features. Itcan be formulated as: FFN( x ) = FC (Dropout(GELU(FC ( x )))) (3)In each sublayer, there exists a sublayer connection com-posed of a residual connection, followed by layer normaliza-tion. x out = LayerNorm( x in + Sublayer ( x in ))) (4)where x in , x out are the input and output of one sublayerrespectively and the sublayer can be attention layer or feedforward layer. In the decoder side, we add sinusoid positional embedding tothe word embedding features and take both the addition re-sults and encoder output features as the input. The decoderconsists of N d stacked identical layers with each layer con-taining a masked multi-head self-attention sublayer followedby a multi-head cross attention sublayer and a positional feed-forward sublayer sequentially.The output feature of the last decoder layer is utilized topredict next word via a linear layer whose output dimensionequals to the vocabulary size. Given a ground truth sentence y ∗ T and the prediction y ∗ t of captioning model with parame-ters θ , we minimize the following cross entropy loss: L XE ( θ ) = − T (cid:88) t =1 log (cid:0) p θ (cid:0) y ∗ t | y ∗ t − (cid:1)(cid:1) (5)odel BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE CIDErc5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 CNN+RNN
SCST [9] 78.1 93.7 61.9 86.0 47.0 75.9 35.2 64.5 27.0 35.5 56.3 70.7 114.7 116.0LSTM-A [10] 78.7 93.7 62.7 86.7 47.6 76.5 35.6 65.2 27.0 35.4 56.4 70.5 116.0 118.0Up-Down [4] 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5 27.6 36.7 57.1 72.4 117.9 120.5RF-Net [11] 80.4 95.0 64.9 89.3 50.1 80.1 38.0 69.2 28.2 37.2 58.2 73.1 122.9 125.1GCN-LSTM [12] - - 65.5 89.3 50.8 80.3 38.7 69.7 28.5 37.6 58.5 73.4 125.3 126.5SGAE [13] 81.0 95.3 65.6 89.5 50.7 80.4 38.5 69.7 28.2 37.2 58.6 73.6 123.8 126.5
CNN+Transformer
ETA [14] 81.2 95.0 65.5 89.0 50.9 80.4 38.9 70.2 28.6 38.0 58.6 73.9 122.1 124.4CPTR . Performance comparisons on MSCOCO online test server. All models are finetuned with self-critical training. c5/c40denotes the official test settings with 5/40 ground-truth captions.Like other captioning methods, we also finetune our modelusing self-critical training [9].
3. EXPERIMENTS3.1. Dataset and Implementation Details
We evaluate our proposed model on MS COCO [15] datasetwhich is the most commonly used benchmark for image cap-tioning. To be consistent with previous works, we use the“Karpathy splits” [16] which contains 113,287, 5,000 and5,000 images for training, validation and test, respectively.The results are reported on both the Karpathy test split foroffline evaluation and MS COCO test server for online evalu-ation.We train our model in an end-to-end fashion with the en-coder initialized by the pre-trained ViT model. The input im-ages are resized to × resolution and the patch sizeis setting to 16. The encoder contains 12 layers and decodercontains 4 layers. Feature dimension is 768, and the attentionhead number is 12 for both encoder and decoder. The wholemodel is first trained with cross-entropy loss for 9 epochs us-ing an initial learning rate of × − and decayed by 0.5 atthe last two epochs. After that, we finetune the model usingself-critical training [9] for 4 epochs with an initial learningrate of . × − and decayed by 0.5 after 2 epochs. We useAdam optimizer and the batch size is 40. Beam search is usedand the beam size is 3.We use BLEU-1,2,3,4, METEOR, ROUGE and CIDErscores [17] to evaluate our method which are denoted as B-1,2,3,4, M, R and C, respectively. We compare proposed CPTR to “CNN+RNN” models includ-ing LSTM [18], SCST [9], LSTM-A [10], RFNet [11], Up-
Method B-1 B-2 B-3 B-4 M R C
CNN+RNN
LSTM [18] - - - 31.9 25.5 54.3 106.3SCST [9] - - - 34.2 26.7 55.7 114.0LSTM-A [10] 78.6 - - 35.5 27.3 56.8 118.3RFNet [11] 79.1 63.1 48.4 36.5 27.7 57.3 121.9Up-Down [4] 79.8 - - 36.3 27.7 56.9 120.1GCN-LSTM [12] 80.5 - - 38.2 28.5 58.3 127.6LBPF [19] 80.5 - - 38.3 28.5 58.4 127.6SGAE [13] 80.8 - - 38.4 28.4 58.6 127.8
CNN+Transformer
ORT [20] 80.5 - - 38.6 28.7 58.4 128.3ETA [14] 81.5 - - 39.3 28.8 58.9 126.6CPTR
Table 2 . Performance comparisons on COCO Karpathy testsplit. All models are finetuned with self-critical training.Down [4], GCN-LSTM [12], LBPF [19], SGAE [13] and“CNN+Transformer” models including ORT [20], ETA [14].These methods mentioned above all use image features ex-tract by a CNN or object detector as inputs, while our modeldirectly takes the raw image as input. Table 2 shows theperformance comparison results on the offline Karpathy testsplit, and CPTR achieves 129.4 Cider score which outper-forms both “CNN+RNN” and “CNN+Transformer” models.We attribute the superiority of CPTR model over conventional“CNN+” architecture to the capacity of modeling global con-text at all encoder layers. The online COCO test server eval-uation results shown in Table 1 also demonstrates the effec-tiveness of our CPTR model.
We conduct ablation studies from the following aspects: (a)Different pre-trained models to initialize the Transformer en- retrained Model Res
Table 3 . Ablation studies on the cross-entropy training stage.Res: image resolution. × to × while maintaining the patch size equals to 16can bring huge performance gains (from 111.6 Cider score to116.5 Cider score). It is sensible for that the length of patchsequence increases from 196 to 576 due to the increasing in-put resolution, and can divide image more specifically andprovide more features to interact with each other via the en-coder self-attention layer. In this section, we take one example image to show thecaption predicted by CPTR model and visualize both the self-attention weights of the patch sequences in the encoder and“words-to-patches” cross attention weights in the decoder.With regards to the encoder self-attention, we choose an im-age patch to visualize its attention weights to all patches.As shown in Fig. 2, in the shallow layers, both the localand global contexts are exploited by different attention headsthanks to the full Transformer design which can not be ful-filled by the conventional CNN encoders. In the middle layer,model tends to pay attention to the primary object, i.e., “teddybear” in the image. The last layer fully utilizes global contextand pays attention to all objects in the image, i.e., “teddybear”, “chair” and “laptop”.Besides, we visualize the “words-to-patches” attention
Fig. 2 . Visualization of the predicted encoder self-attentionweights of different layers and attention heads. The image atthe upper left corner is the raw image and the red point onit is the chosen query patch. The first, second and third roware the attention weights visualization of the 1st, 6th, 12th en-coder layer, respectively. The columns show different headsin given layers.
Fig. 3 . Visualization of the attention weights computed by the“words-to-patches” cross attention in the last decoder layer.“A teddy bear sitting in a blue chair with a laptop” is the cap-tion generated by our model.weights in the decoder during the caption generation process.As is shown in Fig. 3, CPTR model can correctly attend toappropriate image patches when predicting every word.
4. CONCLUSIONS
In this paper, we rethink image captioning as a sequence-to-sequence prediction task and propose CPTR, a full Trans-former model to replace the conventional “CNN+Transformer”procedure. Our network is totally convolution-free and pos-sesses the capacity of modeling global context information atevery layer of the encoder from the beginning. Evaluation re-sults on the popular MS COCO dataset demonstrate the effec-tiveness of our method and we surpass “CNN+Transformer”networks. Detailed visualizations demonstrate that our modelcan exploit long range dependencies from the beginning andthe decoder “words-to-patches” attention can precisely attendto the corresponding visual patches to predict words. . REFERENCES [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in
Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[3] Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun, “Faster r-cnn: Towards real-time object detectionwith region proposal networks,”
IEEE transactions onpattern analysis and machine intelligence , vol. 39, no.6, pp. 1137–1149, 2016.[4] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang,“Bottom-up and top-down attention for image caption-ing and visual question answering,” in
Proceedings ofthe IEEE conference on computer vision and patternrecognition , 2018, pp. 6077–6086.[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve,Nicolas Usunier, Alexander Kirillov, and SergeyZagoruyko, “End-to-end object detection with trans-formers,” arXiv preprint arXiv:2005.12872 , 2020.[6] Alexey Dosovitskiy, Lucas Beyer, AlexanderKolesnikov, Dirk Weissenborn, Xiaohua Zhai, ThomasUnterthiner, Mostafa Dehghani, Matthias Minderer,Georg Heigold, Sylvain Gelly, et al., “An image is worth16x16 words: Transformers for image recognition atscale,” arXiv preprint arXiv:2010.11929 , 2020.[7] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xia-tian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jian-feng Feng, Tao Xiang, Philip HS Torr, et al., “Re-thinking semantic segmentation from a sequence-to-sequence perspective with transformers,” arXiv preprintarXiv:2012.15840 , 2020.[8] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu,Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu,Chao Xu, and Wen Gao, “Pre-trained image processingtransformer,” arXiv preprint arXiv:2012.00364 , 2020.[9] Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel, “Self-critical se-quence training for image captioning,” in
Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 7008–7024. [10] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, andTao Mei, “Boosting image captioning with attributes,”in
Proceedings of the IEEE International Conference onComputer Vision , 2017, pp. 4894–4902.[11] Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai, “Reflective decoding network for image cap-tioning,” in
Proceedings of the IEEE International Con-ference on Computer Vision , 2019, pp. 8888–8897.[12] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei, “Ex-ploring visual relationship for image captioning,” in
Proceedings of the European conference on computervision (ECCV) , 2018, pp. 684–699.[13] Xu Yang, Kaihua Tang, Hanwang Zhang, and JianfeiCai, “Auto-encoding scene graphs for image caption-ing,” in
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 2019, pp. 10685–10694.[14] Guang Li, Linchao Zhu, Ping Liu, and Yi Yang, “Entan-gled transformer for image captioning,” in
Proceedingsof the IEEE International Conference on Computer Vi-sion , 2019, pp. 8928–8937.[15] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, andC Lawrence Zitnick, “Microsoft coco: Common objectsin context,” in
European conference on computer vision .Springer, 2014, pp. 740–755.[16] Andrej Karpathy and Li Fei-Fei, “Deep visual-semanticalignments for generating image descriptions,” in
Pro-ceedings of the IEEE conference on computer vision andpattern recognition , 2015, pp. 3128–3137.[17] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Doll´ar, andC Lawrence Zitnick, “Microsoft coco captions: Datacollection and evaluation server,” arXiv preprintarXiv:1504.00325 , 2015.[18] Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan, “Show and tell: A neural image cap-tion generator,” in
Proceedings of the IEEE conferenceon computer vision and pattern recognition , 2015, pp.3156–3164.[19] Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu,“Look back and predict forward in image captioning,”in
Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , 2019, pp. 8367–8375.[20] Simao Herdade, Armin Kappeler, Kofi Boakye, andJoao Soares, “Image captioning: Transforming objectsinto words,” in