[PDF] Code Summarization with Structure-induced Transformer

Abstract

Code summarization (CS) is becoming a promising area in recent language understanding, which aims to generate sensible human language automatically for programming language in the format of source code, serving in the most convenience of programmer developing. It is well known that programming languages are highly structured. Thus previous works attempt to apply structure-based traversal (SBT) or non-sequential models like Tree-LSTM and graph neural network (GNN) to learn structural program semantics. However, it is surprising that incorporating SBT into advanced encoder like Transformer instead of LSTM has been shown no performance gain, which lets GNN become the only rest means modeling such necessary structural clue in source code. To release such inconvenience, we propose structure-induced Transformer, which encodes sequential code inputs with multi-view structural clues in terms of a newly-proposed structure-induced self-attention mechanism. Extensive experiments show that our proposed structure-induced Transformer helps achieve new state-of-the-art results on benchmarks.

Full PDF

SSIT3: Code Summarization with Structure-Induced Transformer

Hongqiu Wu , Hai Zhao ∗ , Min Zhang , Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interactionand Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China MoE Key Lab of Artiﬁcial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China Institute of Artiﬁcial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China w [email protected], [email protected], [email protected] Abstract

Code summarization (CS) is becoming apromising area in recent natural language un-derstanding, which aims to generate sensibleannotations automatically for source code andis known as programmer oriented. Previousworks attempt to apply structure-based traver-sal (SBT) or non-sequential models like Tree-LSTM and GNN to learn structural programsemantics. They both meet the following draw-backs: 1) it is shown ineffective to incorpo-rate SBT into Transformer; 2) it is limited tocapture global information through GNN; 3)it is underestimated to capture structural se-mantics only using Transformer. In this paper,we propose a novel model based on structure-induced self-attention, which encodes sequen-tial inputs with highly-effective structure mod-eling. Extensive experiments show that ournewly-proposed model achieves new state-of-the-art results on popular benchmarks. To ourbest knowledge, it is the ﬁrst work on codesummarization that uses Transformer to modelstructural information with high efﬁciency andno extra parameters. We also provide a tutorialon how we pre-process.

With the rise of Internet industry, program compre-hension has been an indispensable role in processof software development and maintenance. How-ever, the overwhelming programs bring big chal-lenges to programmers who have limited time andenergy, which runs counter to liberating human lifeas well as human oriented society. In this case,machine program comprehension has become avaluable research orientation in nowadays naturallanguage processing. Among which the code sum- ∗ Corresponding author. This paper was partially supportedby National Key Research and Development Program of China(No. 2017YFB0304100) and Key Projects of National NaturalScience Foundation of China (U1836222 and 61733011).

Attach to a new plot, and display. CS Figure 1: A sample of code summarization task in Java. marization task aims to enable machines to under-stand source code and generate sensible annota-tions automatically, which literally helps to reducelarge amounts of unnecessary and tedious work ofprogrammers.In early days, code summarization is a deriva-tive problem of information retrieval (Haiduc et al.,2010; Eddy et al., 2013; Wong et al., 2013, 2015)by matching the most similar code snippets whichare labeled with summaries. Such method lacksgeneralization and performs in low quality. Fortu-nately, the rapid development of deep learning andhighly open-source communities provide computa-tion and data supports for high-quality automaticcode summarization. In recent years, researcherstreated code summarization as a task of languagegeneration, which usually depends on Sequence-to-Sequence models. Iyer et al. (2016) employedRecurrent Neural Network (RNN) with attentionmechanism to generate summaries from originalsource code. Uri et al. (2019) further tried to enrichcode representation by injecting its syntactic infor-mation, which was known as Abstraction SyntaxTree (AST). Experiments showed that structuralrepresentations (e.g., AST) may help models bet- a r X i v : . [ c s . C L ] D ec tructure-sensitive Long-term dependency FMMLSTMTree-LSTM (cid:88) Transformer (cid:88)

SBT + LSTM (cid:88) (cid:88)

SBT + Transformer (cid:88) (cid:88)

SIT3 (cid:88) (cid:88) (cid:88)

Table 1: Comparison of the previous models with pro-posed SIT3 model. The FMM column refers to whetherinput features match with corresponding model. ter comprehend code snippets and achieve moresensible generation results.It is already known that RNN-based modelsmay encounter bottleneck when modeling long se-quences due to its pool long-term dependency. Forinstance, a normal snippet of Java usually has hun-dreds of tokens, as shown in Figure 1. More re-cently, Ahmad et al. (2020) proposed a transformermodel to capture long-term and non-sequential in-formation of source code, which outperformed pre-vious RNN-based models by a large margin. How-ever, it is surprised that their experiments showedno improvement when fusing SBT into his trans-former. In our paper, we attribute such failureto adding sequential features into non-sequentialmodel.Generally, summarization task aims to under-stand the whole content from an integral view,which requires models to identify where mainpoints are and how it is organized. To this end, itis even more important to inject structural informa-tion into language modeling. Previous works couldbe divided into two categories. The ﬁrst is to ﬂat-ten structural information so that sequential modelscan directly learn. Hu et al. (2018a) proposed astructure-based traversal (SBT) method to ﬂattenASTs into linear sequences. The other is to employnon-sequential encoders (e.g., Tree-LSTM (Shidoet al., 2019), Tree-Transformer (Harer et al., 2019),Graph Neural Network (Liu et al., 2020; Alex et al.,2020)) to directly model structural inputs.Each method has its own merits and both arestructure-sensitive. However, non-sequential mod-els based on RNN are usually limited in capturinglong-range information, while GNNs are sensitiveto local information through message passing with k hops where k is usually small. SBT was onlyshown to be positive on RNN but invalid on Trans-former (Ahmad et al., 2020). Such method aims toenable sequential models to learn non-sequentialrelationships with certain elaborate linear forms.RNN may achieve this goal because of its se- AB CD E AB CD E

Figure 2: Transform original self-attention, left-handcomplete graph, into structure-induced self-attention,right-hand graph which looks clear-cut, using adjacentmatrix. Note that we omit self-loops for concision. quential architecture through attention mechanism.Transformer obtain features through self-attention(SAN), nevertheless which acts more like a non-sequential process. Consequently, such sequentialfeatures are unsuitable for a non-sequential archi-tecture to extract implicit structural information.We boldly call it Feature-model mismatch in Table1. In this paper, we propose a novel approach ontoTransformer to solve the problem through adjacentmatrix which completely retains structural infor-mation and brings into attention mechanism. Thecontributions of our work are listed as follows:1. We propose a novel model on code summa-rization task based on structure-induced self-attention. Extensive experiments show thatour newly-proposed model achieves new state-of-the-art performance and owns much fasterconvergence with completely the same scaleof parameters of Transformer.2. AST, which is known as a multi-way treeis used to be challenging to pre-process andtricky to train with. However, we apply a se-quential transformer model to effectively fuseAST. We further intensify structural represen-tations from simple AST to multiple graphs.We also provide a tutorial on how we pre-process to facilitate future researches.3. By comprehensive experiments and analy-sis, we valid our proposed structure-inducedmethod and show the underestimated abilityof vanilla Transformer to capture graph infor-mation like GNN does. tt e n ti on K e y Q u e r y V a l u e A tt e n ti on K e y Q u e r y V a l u e Stacks × T r a n s f o r m e r D ec od e r Structure-Induced SAN

Source code SummaryMulti-graphadjacent matrix

Figure 3: Whole architecture of SIT3, Structure-Induced Network. Note that it contains 3 stacks in its encoder.

We aim to design a structure-sensitive transformermodel so that it is able to comprehend code snippetsboth semantically and syntactically. Meanwhile,we do not mean to introduce extra parameters sothat guarantee the efﬁciency of model. In this sec-tion, we ﬁrst review the self-attention (SAN) ofTransformer. Different from repeating the originalmodel, we explain the model from a graphic angle.Based on SAN graph, we improve its connectedrelations and then propose our structured-inducedself-attention.

Transformer is composed of stacks of identical lay-ers for both encoder and decoder (Vaswani et al.,2017). Each layer emphasizes on self-attention(SAN) mechanism, which is denoted as:

SAN ( X ) = Sof tmax ( X Q ( X K ) T √ d k ) X V (1)where X = ( x , . . . , x l ) denotes input sequenceof sub-words, l denotes sequence length and d k denotes hidden size per head. We view each sub-word as a node and ignore their relative positionsand then transfer SAN into graphic representation.Equation (1) can be rewritten as follow: SAN ( X ) =  e e · · · e l ... ... . . . ... e l e l · · · e ll  ·  n ... n l  (2)where attention scores refers to weight matrix ofedges where e ij represents how signiﬁcant node n i attend to node n j , while value matrix V refers toeach node representation. Figure 2 depicts the pro-cess of calculating attention scores from a graphicangle.As we can see pictorially, the SAN generatesa directed complete graph. Note that actually itis also cyclic that each node connects to itself,which are not depicted. Such complete graph treatseach word as a neighbor with no consideration oftheir inner structures, which may deﬁnitely bringalong noise and redundancy. To our best knowl-edge, there should be a certain structure inside thewhole sentence that may represent the relationshipsbetween words more explicitly. To achieve a more explicit SAN graph, we man-age to prune the unnecessary connections basedon Abstract Syntax Tree (AST), which formulatessyntactic information for a source code snippetin a multi-way tree structure. Figure 6 shows acomplete AST graph of a certain program. To ap-propriately incorporate such structure in SAN, weintroduce its adjacent matrix A . Since SAN itselfis undirected, the adjacent matrix should be sym-metrical. Then we perform a dot product with A and key-query pairs to obtain the structure-inducedattention scores. The formulation of SI-SAN canbe written as: SI SAN ( X ) = Sof tmax ( A · X Q ( X K ) T √ d k ) X V (3)As shown in Figure 2, when a ij = 0 in the adjacentmatrix, the connection between n i and n j will be a) Full (b) Random (c) Gaussian (d) Structure-induced Figure 4: Comparison of different types of self-attention pattern. dropped out. Different from calculating global in-formation onto the whole sentence in original SAN,we expect model to calculate more accurate infor-mation with consideration of their inner structures.

Multi-graph structure-induce.

Inspired by Fu-sionGNN (Liu et al., 2020), we further expandAST structure with Control Flow Graph (CFG)and Program Dependency Graph (PDG) and createour structure-induced self-attention (SI-SAN). Asa result of combining multiple graphs, SI-SAN isexpected to capture various structural semantics.Note that loops will emerge in the graph, whichmay destroy original organization and further in-troduce redundancy. Due to theory of informationdiffusion (Goyal et al., 2020), we apply a poolingoperation to cut down needless loops, which alsomakes possible for further model compression. Formore details on multi-graph construction, pleaserefer to Appendix A.

Hierarchical structure-induce.

SI-SAN treatsall nodes within the same connected subgraphequally. It is rational to assume that each node paysunequal attention on itself, its brothers, ancestorsand descendants. For that reason, we further ex-pand the adjacent weights into hierarchical weights.For node itself, we set the attention scores into 1.0.While for all the brothers and ancestors, we settheir attention scores into 0.75 and 0.5 respectively.

Figure 3 depicts the whole architecture of ourmodel. Noe that we give our model directly andput the validation into ablation study.We substitute half of the encoder layers of Trans-former with our structure-induced self-attentionlayers. Speciﬁcally, for Transformer-Base whichhas 6 layers in its encoder, we substitute the 2nd,4th and 6th layer of it and retrain the others.

Induced Contextual Aggregation.

To combinecontextual representations, we transform originalsequential structure into 3 stacks, where we puttwo layers (SAN & SI-SAN) to become a newstack. Speciﬁcally, given input sequence X =( x , . . . , x l ) , where l denotes sequence length, weﬁrst pass it through an SAN layer to obtain hid-den representation denoted as H = ( h , . . . , h l ) .Subsequently we pass H through a SI-SAN layerto obtain H (cid:48) = ( h (cid:48) , . . . , h (cid:48) l ) . Finally, we use anaggregation module to fuse H and H (cid:48) to obtain theaggregated representation ¯ H = ( ¯ h , . . . , ¯ h l ) . H = SAN ( X ) H (cid:48) = SI SAN ( H )¯ H = Aggr ( H + H (cid:48) ) (4)where the aggregation operation we use is simpleposition-wise sum. Then we continue this processfor the other two stacks. We explore that such twinstructure leads to better performance than serial-ized form. In each stack, model begins to learnglobal information with SAN, where all connec-tions are available. Then through SI-SAN, modelis told which of the connections are useful andwhich should be dropped out.Note that SIT model still consists of 6 encoderlayers and 6 decoder layers, but only changes thearchitecture between modules of Transformer anddoes not introduce any extra parameters. However,experiments show that such architecture is moreeffective and obtain promising results than that inVaswani et al. (2017); Ahmad et al. (2020). Our experiments are conducted on twomost frequently-used benchmarks, Java (Hu et al.,2018a) and Python (Wan et al., 2018). However, odel Java PythonBLUE ROUGE-L METEOR BLUE ROUGE-L METEOR

CODE-NN (Iyer et al., 2016) 27.60 41.10 12.61 17.36 37.81 09.29Tree2Seq (Eriguchi et al., 2016) 37.88 51.50 22.55 20.07 35.64 08.96RL+Hybrid2Seq (Wan et al., 2018) 38.22 51.91 22.75 19.28 39.34 09.75DeepCom (Hu et al., 2018a) 39.75 52.67 23.06 20.78 37.35 09.98API+Code (Hu et al., 2018b) 41.31 52.25 23.73 15.36 33.65 08.57Dual Model (Wei et al., 2019) 42.39 53.61 25.77 21.80 39.45 11.14Transformer (Ahmad et al., 2020) 44.58 54.76 26.43 32.52 46.73 19.77Transformer (Ours) 44.60 54.93 26.70 33.12 47.21 20.56SIT3

Table 2: BLEU, ROUGE-L and METEOR for our approach compared with other baselines. The results of upperpart are directly reported from Ahmad et al. (2020). Note that we only reproduce transformer result since it ismuch strong than the other baselines. However, our result is much stronger. since we have no idea of how data is split which isexpected to be random, we use 5-fold cross valida-tion to obtain more stable results.

Pre-processing.

For Java code, we refer to themethod provided in (Hu et al., 2018a). They use javalang module of Python to compile Java andfetch the ASTs in dictionary form. For Pythoncode, we use the parser of ast module of Pythonitself to generate trees and write a script to resolvethem to SI-SAN and HIS-SAN masks. For moredetails on pre-prosessing, please refer to AppendixB.

OOV nodes.

Note that we do not introduce anytokens out of original sentences which can be non-terminal nodes generated by AST. Since SI-SANcan still model the structure without non-terminalnodes, we remove them from the tree and retainthe rest structure. The advantage is that we avoidOOV problem (Hu et al., 2018a) of source codeand prevent input sequences from too long, whichis beneﬁcial for both model training memory efﬁ-ciency.

Our proposed model is evaluated against number ofstate-of-the-art methods on benchmarks. We dividethese baselines into two groups according to theirmodel structures.

Transformer-based approaches.

We refer tothe Transformer in (Ahmad et al., 2020) whichequipped with copy attention (See et al., 2017)and relative position encoding (RPE) (Shaw et al.,2018). To our best knowledge, this may perform abetter baseline than vanilla Transformer (Vaswaniet al., 2017). For more objective comparison, we run their model on our machine using the samedata with SIT3. It turns out that we make a muchstronger baseline in our implementation. Note thatwe also utilize RPE in SIT3 because of its better ca-pability in capturing long sequences, while we donot utilize copy attention because it seems unusualto copy from one modality to another.

LSTM-based approaches.

This group includesall relevant LSTM models with sequential and non-sequential inputs.

We train our model on a single GPU with batchsize of 64. The initial learning rate is selected in { } with warm-up rate of 0.06 and L2weight decay of 0.01. The maximum number ofepochs is set to 230 which is enough for all modelsto converge. For validation, we simply use greedysearch while for evaluation, we use beam searchwith beam size selected in {

4, 5, 8 } and choose thebest result. Table 2 shows the overall results on Java andPython benchmarks. We can see that the perfor-mance of Transformer has been very strong, whichoutperforms all the previous works by a signiﬁcantmargin. However, our model is more powerful,further boosting Transformer with about 1 and 2BLEU points on Java and Python respectively andachieves new state-of-the-art results. Speciﬁcally,SIT3 achieves more amazing results on Python, in-creasing by 2.02, 1.86 and 1.40 points on BLEU,ROUGE-L and METEOR respectively. We specu-late that Python is split by 0.6, 0.2 for training andtesting, in which case validation tends to be much

Steps B L E U TransformerSIT3 (a) BLEU score with training steps on Java

Steps B L E U TransformerSIT3 (b) BLEU score with training steps on Python

Figure 5: Convergence between Transformer and SIT3. more challenging because of much less trainingsamples. On the contrary, the gap on Java whichis split by 0.8, 0.1 is supposed to be smaller. Evenso, SIT3 still boosts Transformer by 0.94, 0.62 and1.05 points on BLEU, ROUGE-L and METEORrespectively.Moreover, Figure 5 shows the trend of BLEUscores on development set over training steps. Wecan see that SIT3 achieve a much faster conver-gence rate than Transformer. For instance onPython dataset, SIT3 arrives the best performanceof Transformer in about 120 epochs, nearly a halfof its best epoch. Note that the time consumedof each epoch for both models is almost the same.Such high convergence rate helps support our SI-SAN, which tends to be more structural-sensitivethan original SAN.

Taken as a whole, our experimental results demon-strate our success in modeling structure with Trans-former. In this section, we conduct ablation studyto valid our model and make analysis on how itworks on four comparative experiments. Notethat we conduct the study on a different datasetof Python-V2 , in which we perform our ownpre-processing to make it quite cleaner. Con-sequently, all the approaches obtain better per-formances, which may help discriminate results.Readers can also refer to Appendix B for details. To valid SI-SAN, we implement the following ex-periments. We gradually replace original SAN https://github.com/EdinburghNLP/code-docstring-corpus/tree/master/V2 Model BLEU ROUGE-L

Transformer-6V 44.21 53.01Transformer-5V1S 44.13 52.95Transformer-4V2S 44.43 53.06Transformer-3V3S 45.33 54.05Transformer-2V4S 44.57 53.22Transformer-1V5S 44.41 53.22Transformer-6S 44.00 52.97

Table 3: BLEU and ROUGE-L for variant modelswith incremental proportions of SI-SAN. We omit ME-TEOR here since it is much lower than the other twometrics and less discriminative.

Model BLEU ROUGE-L

SIT3-Gaussian-SAN 40.00 49.70SIT3-SI-SAN 45.33 54.05

Table 4: BLEU and ROUGE-L for variant models withdifferent attention patterns. ayers in original Transformer with SI-SAN andobserve how performances change. Speciﬁcally,for model Transformer-5V1S, we replace the lastlayer with SI-SAN and also use the its hidden repre-sentation for induced contextual aggregation. Theresults of variant models with incremental propor-tions of SI-SAN layers are shown in Table 3.We see that with the proportion of SI-SAN lay-ers increases, the BLEU score almost increases in asimilar pattern. However, it is interesting that whenSI-SAN proportion is above half, the BLEU scorebegins to decline. We hypothesis that it is becauseof over-pruning. The connections dropped out inSI-SAN may carry certain useful information. Thatis why we apply induced contextual aggregation toreduce redundancy while retain its original expres-sion.

Previously, Ahmad et al. (2020) followed Hu et al.(2018a) and used SBT technique to ﬂatten ASTstructures into linear sequences. Their experimentssuggested that incorporation of AST informationin Transformer does not result in improvement.They gave the reason that Transformer might learnstructural information itself by relative positionrepresentation. However, we achieve substantialimprovement while incorporating AST into Trans-former using SI-SAN. We attribute the failure ofSBT to Feature-model mismatch problem. Here weadditionally specify it by the following two reasons.First, ﬂattening methods may unavoidably weakenthe original structures and lead to information losswhen transform it into linear form. However, SI-SAN retain complete structural information andincorporate into attention mechanism via adjacentmatrix. Consequently, each node are only visibleto speciﬁc nodes that are accessible correspondingto the structure. Second, position encoding is notsufﬁcient to represent position features in tree-likestructures or graph structures. Recent experimentsshow that position encoding in SAN serves moreas features of different positions instead of relativedistances. Even though under relative position en-coding which models mutual interactions amongtokens, hierarchical relations in tree structure arestill difﬁcult to represent. Take Python code if a== 0: break as an instance, if is supposed to bethe neighbor of both a and break but there are threepositions apart in RPE.In addition, the average length of input code se- quences is about to be 1.5 times longer, which mayintroduce additional cost. Since our architecturedoes not need non-terminal tokens so that we areable to retain original sequences while incorporat-ing their inner structures. SI-SAN is proven tobe as fast as SAN per epoch but with much fasterconvergence. Another natural question is that whether SIT3achieves good performances through denoisingtraining with structure-induced mask. To answerthis question, we make two extra experiments. Wereplace our mask in SI-SAN with random and Gaus-sian mask, which are depicted in Figure 4. Asshown in Table 4, randomly adding a noisy maskmay not achieve any improvement but deterioratesthe model.

RNN-based approaches.

While number ofworks (Haiduc et al., 2010; Eddy et al., 2013; Wonget al., 2013, 2015; Zhang et al., 2020) on codesummarization usually depended on informationretrieval approaches, retrieving summaries fromsimilar code snippets, most of the recent workstend to treat it as a machine translation problembased on Sequence-to-Sequence model. Mean-while attention mechanism is used most frequentlyfor its better performance on capturing long-rangefeatures. Allamanis et al. (2016) used Convolu-tion Neural Network (CNN) with copy attention,and more commonly, Iyer et al. (2016) used Re-current Neural Network (RNN) model to generatesummaries from code snippets. Hu et al. (2018b)proposed to use API knowledge from related taskto assist training. Additionally, reinforce learning(Wan et al., 2018) and dual learning (Wei et al.,2019) are shown effective to boost model perfor-mance.

Transformer-based approaches.

It is knownthat RNN-based models may encounter bottleneckwhen modeling long sequences due to its pool long-term dependency. Ahmad et al. (2020) proposedan enhanced Transformer with copy attention andrelative position encoding, which outperformed pre-vious RNN-based models by a large margin.

Structure-based approaches.

Recent works oncode summarization pay more and more attentionon structural information of source, which can behe Abstract Syntax Tree (AST). Hu et al. (2018a)proposed a ﬂattening method called structure basedtraversal (SBT) to transform ASTs into linear se-quences and then inject into RNN. (Shido et al.,2019; Harer et al., 2019) proposed Tree-LSTM andTree-Transformer to directly encode tree-style in-puts. Differ from modeling code with sequentialmodels, Liu et al. (2020); Alex et al. (2020) applyGraph Neural Network (GNN) to encode programsinto graphs.

Pre-training approaches.

Apart from directlytraining models on task-speciﬁc data, CodeBERT(Feng et al., 2020) which is pre-trained on large-scale bimodal data also achieved promising re-sults on ﬁne-tuning. However, they did not intro-duce generation-related training objective as wellas structural information of programming languageitself. It is worth further exploration on pre-trainingapproaches for program comprehension tasks.

This paper presents a novel structured-inducedtransformer model SIT3 on code summarizationtask which is proven to be more structure-sensitivethan Transformer. By well-designed architecture,we successfully incorporate multiple structures intoattention mechanism without trivial pre-processingand tricky implementation. We further adopt anew module architecture to aggregate both globalself-attention and structure-induced self-attentionrepresentations. Experiments on two challengingbenchmarks including Java and Python show thatSIT3 yields new state-of-the-art results. Further-more, we provide substantial analysis to valid theeffectiveness of our proposed approach. In the fu-ture, we plan to expand this work onto pre-trainedlanguage model.

References

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, andKai-Wei Chang. 2020. A transformer-based ap-proach for source code summarization. In

Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics , pages 4998–5007,Online. Association for Computational Linguistics.LeClair Alex, Haque Sakib, Wu Lingfei, and McMil-lan Collin. 2020. Improved code summarization viaa graph neural network. In .Miltiadis Allamanis, Hao Peng, and Charles Sutton.2016. A convolutional attention network for ex- treme summarization of source code. In

Proceed-ings of The 33rd International Conference on Ma-chine Learning , volume 48 of

Proceedings of Ma-chine Learning Research , pages 2091–2100, NewYork, New York, USA. PMLR.Brian P Eddy, Jeffrey A Robinson, Nicholas A Kraft,and Jeffrey C Carver. 2013. Evaluating source codesummarization techniques: Replication and expan-sion. In , pages 13–22. IEEE.Akiko Eriguchi, Kazuma Hashimoto, and YoshimasaTsuruoka. 2016. Tree-to-sequence attentional neu-ral machine translation. In

Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers) , pages823–833, Berlin, Germany. Association for Compu-tational Linguistics.Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin,Ting Liu, Daxin Jiang, et al. 2020. Codebert: Apre-trained model for programming and natural lan-guages. arXiv preprint arXiv:2002.08155 .Saurabh Goyal, Anamitra Roy Choudhury, SaurabhRaje, Venkatesan Chakaravarthy, Yogish Sabharwal,and Ashish Verma. 2020. Power-bert: Accelerat-ing bert inference via progressive word-vector elim-ination. In

International Conference on MachineLearning , pages 3690–3699. PMLR.Sonia Haiduc, Jairo Aponte, Laura Moreno, and An-drian Marcus. 2010. On the use of automated textsummarization techniques for summarizing sourcecode. In , pages 35–44. IEEE.Jacob Harer, Chris Reale, and Peter Chin. 2019. Tree-transformer: A transformer-based method for cor-rection of tree-structured data. arXiv preprintarXiv:1908.00449 .Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin.2018a. Deep code comment generation. In , pages 200–20010.IEEE.Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and ZhiJin. 2018b. Summarizing source code with trans-ferred api knowledge. In

Proceedings of the Twenty-Seventh International Joint Conference on ArtiﬁcialIntelligence, IJCAI-18 , pages 2269–2275. Interna-tional Joint Conferences on Artiﬁcial IntelligenceOrganization.Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, andLuke Zettlemoyer. 2016. Summarizing source codeusing a neural attention model. In

Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 2073–2083, Berlin, Germany. Association forComputational Linguistics.hangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow,and Yang Liu. 2020. Automatic code summarizationvia multi-dimensional semantic fusing in gnn. arXivpreprint arXiv:2006.05405 .Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1073–1083, Vancouver, Canada. Association for Computa-tional Linguistics.Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position represen-tations. In

Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers) , pages 464–468,New Orleans, Louisiana. Association for Computa-tional Linguistics.Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto,Atsushi Miyamoto, and Tadayuki Matsumura. 2019.Automatic source code summarization with ex-tended tree-lstm. In , pages 1–8.IEEE.Alon Uri, Brody Shaked, Levy Omer, and Yahav Eran.2019. code2seq: Generating sequences from struc-tured representations of code. In

International Con-ference on Learning Representations .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information pro-cessing systems , pages 5998–6008.Yao Wan, Zhou Zhao, Min Yang, Guandong Xu,Haochao Ying, Jian Wu, and Philip S Yu. 2018. Im-proving automatic source code summarization viadeep reinforcement learning. In

Proceedings of the33rd ACM/IEEE International Conference on Auto-mated Software Engineering , pages 397–407.Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi Jin. 2019.Code generation as a dual task of code summariza-tion. In

Advances in Neural Information ProcessingSystems , pages 6563–6573.Edmund Wong, Taiyue Liu, and Lin Tan. 2015. Clo-com: Mining existing source code for automaticcomment generation. In , pages 380–389. IEEE.Edmund Wong, Jinqiu Yang, and Lin Tan. 2013.Autocomment: Mining question and answer sitesfor automatic comment generation. In , pages 562–567. IEEE. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun,and Xudong Liu. 2020. Retrieval-based neuralsource code summarization. In