[PDF] Relevance Transformer: Generating Concise Code Snippets with Relevance Feedback

Abstract

Tools capable of automatic code generation have the potential to augment programmer's capabilities. While straightforward code retrieval is incorporated into many IDEs, an emerging area is explicit code generation. Code generation is currently approached as a Machine Translation task, with Recurrent Neural Network (RNN) based encoder-decoder architectures trained on code-description pairs. In this work we introduce and study modern Transformer architectures for this task. We further propose a new model called the Relevance Transformer that incorporates external knowledge using pseudo-relevance feedback. The Relevance Transformer biases the decoding process to be similar to existing retrieved code while enforcing diversity. We perform experiments on multiple standard benchmark datasets for code generation including Django, Hearthstone, and CoNaLa. The results show improvements over state-of-the-art methods based on BLEU evaluation. The Relevance Transformer model shows the potential of Transformer-based architectures for code generation and introduces a method of incorporating pseudo-relevance feedback during inference.

Full PDF

aa r X i v : . [ c s . C L ] J u l Relevance Transformer: Generating Concise Code Snippetswith Relevance Feedback

Carlos Gemmell, Federico Rossetto, and Jeﬀrey Dalton

University of Glasgow, Scotland, UK{carlos.gemmell,federico.rossetto,jeﬀ.dalton}@glasgow.ac.uk

ABSTRACT

Tools capable of automatic code generation have the potential toaugment programmer’s capabilities. While straightforward coderetrieval is incorporated into many IDEs, an emerging area is ex-plicit code generation. Code generation is currently approached asa Machine Translation task, with Recurrent Neural Network (RNN)based encoder-decoder architectures trained on code-descriptionpairs. In this work we introduce and study modern Transformer ar-chitectures for this task. We further propose a new model called theRelevance Transformer that incorporates external knowledge us-ing pseudo-relevance feedback. The Relevance Transformer biasesthe decoding process to be similar to existing retrieved code whileenforcing diversity. We perform experiments on multiple standardbenchmark datasets for code generation including Django, Hearth-stone, and CoNaLa. The results show improvements over state-of-the-art methods based on BLEU evaluation. The Relevance Trans-former model shows the potential of Transformer-based architec-tures for code generation and introduces a method of incorporat-ing pseudo-relevance feedback during inference.

CCS CONCEPTS • Information systems → Information retrieval ; KEYWORDS

Code Generation, Code Retrieval, Neural Machine Translation

ACM Reference Format:

Carlos Gemmell, Federico Rossetto, and Jeﬀrey Dalton. 2020. RelevanceTransformer: Generating Concise Code Snippets with Relevance Feedback.In

Proceedings of the 43rd International ACM SIGIR Conference on Researchand Development in Information Retrieval (SIGIR ’20), July 25–30, 2020, Vir-tual Event, China.

ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3397271.3401215

To eﬀectively write code a programmer requires parallel knowl-edge of many diﬀerent programming languages, libraries, and tech-niques. The sheer amount of structured information required is of-ten too much to memorize, resulting in frequent online searchesfor library examples or syntax clariﬁcations. This lengthens thedevelopment process and reduces productivity.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full cita-tion on the ﬁrst page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior speciﬁc permissionand/or a fee. Request permissions from [email protected].

Description: get the first object from a querysetin django model ` Entry `

Code ground truth: Entry . objects . filter ( ) [ : 1 ] . get ( )

Model current decoding sequence: Entry .

Relevant words: ['filter', 'objects', 'id', 'author__id', 'Book', 'pk','*', 'Sample', 'Entry', 'name', "'name'", 'title',"'title'", 'exists', '-']

Next token prediction:

Predicted 'objects' over 'groupby'

Figure 1: Generation sample from the Relevance Trans-former on the Django dataset. The sample shows a sentenceunder construction and the token to be produced at the nexttime step.

While code retrieval [9] is a helpful feature in many IDEs, it isoften inﬂexible to the varying demands of a programmer and hastrouble adapting to context. Code generation seeks to solve theseproblems by allowing the programmer to express their ideas innatural language and have the code be generated via an algorithm.In doing so, the programmer can focus on higher-level tasks.Current work in Neural Machine Translation (NMT) systemsrelated to code generation use RNN-based encoder-decoder mod-els, often Long Short-Term Memory (LSTM) networks. While RNN-based models are useful in many translation tasks [1, 14], newermodels such as Transformer [15] show signiﬁcant advances in NMTdue to their self-attentive architectures. However, the problem withall these architectures is their inability incorporate external knowl-edge.To our knowledge, we are the ﬁrst to use Transformer-based ar-chitectures for the task of code generation. We propose RelevanceTransformer, a new model that incorporates pseudo-relevance feed-back for translation during the decoding phase. Following methodsfrom Lavrenko and Croft [7], we induce a positive bias on autore-gressive generation improving decoding quality. This bias is pro-duced by retrieving relevant code snippets to the English descrip-tion and extracting common tokens proportional to their relevancefor the model. Results on standard benchmark collections showconsistent gains over both retrieval and generation baselines, in-cluding signiﬁcant gains on the realistic CoNaLa dataset [17] basedon Stack Overﬂow questions.

RELATED WORK

Retrieval models are well established in the ﬁeld of code improve-ment. Many attempts emphasize helping programmers debug pro-grams and remove duplicate code by identifying close matches insource code. Early approaches [6] rely on highly structured for-mal methods to convert queries into a structured query languageto search for exact matches. Mishne et al. [9] propose a code snip-pet retrieval method by forming unstructured queries over sourcecode and use a "fuzzy" matching approach to help programmersﬁnd similar snippets to their query. These approaches attempt tosearch the code to ﬁnd relevant results. Sindhgatta [13] employs adiﬀerent approach by querying over code authors’ annotations toretrieve relevant code snippets. This last approach is most similarto our retrieval model.Most recent work treats code generation as a Machine Transla-tion task and applies translation models, such as encoder-decodernetworks [14]. These sequence to sequence (Seq2Seq) models al-low for variable-length input and output. While Seq2Seq modelsprovide a strong baseline, Ling et al. [8] propose a latent predictornetwork which allows selective copying of input tokens relevantto the output sequence by selecting diﬀerent predictors. Later net-works incorporate structural information from code as ASTs [4, 11].These models use code speciﬁc actions and build the target codeby specifying a sequence of rules to construct the tree.Other work focuses on maintaining the token representation byenhancing their input with retrieved snippets of code. Hashimotoet al. [3] use a two-stage training method by retrieving similar snip-pets of code and then using these snippets as input to a Seq2Seqmodel. The retrieval algorithm solely takes an English descriptionand is trained using an oracle to produce a ranking that returnsthe pairs with most similar code to the desired output. This pro-cess adds context to support the decoder in producing the targetcode.While the ﬁeld of cross-lingual information retrieval employstranslation dictionaries [5] and Statistical Machine Translation [2]to improve eﬀectiveness, the inverse problem is seldom approached.Zhang et al. [18] use retrieved translation chunks to boost the prob-ability of decoding certain tokens. While this decoding processis similar to ours, they employ an alignment dictionary to bringin external knowledge and don’t normalize their increments withrespect to the retrieved documents. In contrast, we don’t requireany structured knowledge relying only on documents found in thetraining set.

We deﬁne the task of code generation from natural language as:given a query description, q , the goal is to generate a single mostrelevant snippet of code, c , that satisﬁes the query.To perform this task we formulate as follows: Input:

Tokens from q are split into a sequence { q i } i ∈[ ,..., n ] , with i denoting the position of the token in the sequence. Output:

Code tokens from c are split into a sequence { c i } i ∈[ ,..., m ] . We note that c can come from either retrieval (existing code) orbe produced by a generative model. The output is a short snippetequivalent to a small line (or lines) of code. One of the core components of our model is the retrieval algorithm.It is responsible for producing a ranking of relevant documentswith respect to an input query. In our problem, the query is the nat-ural language English description from the code-description pair, ( q , c ) . Our search corpus is composed of all English descriptions ofthe training set. The retrieval algorithm then scores a document d through its similarity function RS ( q , d ) . We identify two eﬀectivemethods for retrieving snippets. The ﬁrst is a BM25 implementa-tion in Lucene, using PyLucene as an interface. The second is thesimilarity scoring function from ReCode [4], a token level stringsimilarity score. While we test both, we opt for BM25 due to themore eﬃcient implementation.The ranking produced by the retrieval algorithm is used to thenpick the top k documents. We extract the code from the pairs anduse it either as the ﬁnal output, as is the case for our baseline re-trieval methods, or as a guide for our Relevance model. Our system uses the Transformer [15] at its core. This architec-ture employs several self-attentive layers in an encoder-decoderstructure to map variable-length input to a variable-length outputsequence. The output is produced autoregressively, generating aconditional distribution over the entire vocabulary at each timestep t . During training, the model uses a look-ahead attention maskto hide future predictions from the current step, thus only basingits prediction on the English tokens q and the currently producedoutput sequence c t − . Given the smaller size of the datasets incontrast to the original uses of Transformers, we reduce the sizeof our model to two attention layers for both the encoder and de-coder, four attention heads, embedding dimension of 512, and apointwise feed-forward network dimension of 1024. In this section, we outline how the Relevance Transformer copeswith the unique challenges of generating code. Initial naÃŕve at-tempts consisted of simply appending top code results to the in-put, but these proved unsuccessful. There are several key compo-nents in the Relevance Transformer that provide signiﬁcant im-provements over the base implementation: pseudo-relevance feed-back decoding and input token copying.

Our second key aspect in ourproposed network is a sequence aware pseudo-relevance feedback[7] decoding method. During a decoding step our copy augmentedTransformer produces a probability distribution over each tokenin the vocabulary, as well as positional out-of-vocabulary terms,we denote this as M ( q , c t − ) where c t − = { c , ..., c t − } is thecurrent decoded sequence. We aim to improve decoding quality byretrieving the top k documents D ( q , k ) and emphasizing a set ofcommon words ST ( n ) in the results. We achieve this by interpo-lating normalized token frequency scores with the original NMTdistribution, Equation 1. ( w t | q , c t − ) = [ λ · M ( q , c t − ) + ( − λ ) · RF ( q , w t )· RP ( c t − , w t )] · Z (1) f r ( w t , d ) = count ( w t , d )/ lenдth ( d ) RF ( q , w t ) = (cid:2) − ST ( n ) ( w t ) (cid:3) · Õ d ∈ D ( q , k ) f r ( w t , d ) · RS ( q , d ) (2)Where Z is the normalization constant. For each token, we takeinto account the score given by the retrieval algorithm as wellas the document length to emphasize top-scoring snippets. Whilethere is no guarantee a top-scoring snippet will provide good sug-gestions for words in the output, however, the aggregation of mul-tiple top-scoring snippets it gives conﬁdence to increase the prob-ability of common words, Equation 2.We also take into account terms that have already been seen inthe current decoded sequence. As such we use a repetition penalty(Equation 3) to condition the probability given to a term based onits previous presence in the prediction. RP ( c t − , w t ) = [ − c t − ( w t )] (3) Copy methods stem from PointerNetworks [16] which use the attention distribution produced overthe input sequence to choose an element from the input at eachdecoding time step. While at its core Pointer Networks only al-low copying elements from the input, Copy Generator Networks[12] support both generation of new tokens and copying relevanttokens from the input. Our code generation task beneﬁts from hav-ing many tokens in the input sequence in common with the outputsequence, such as variable names and method identiﬁers. These arenotoriously troublesome for sequence generation tasks since theyare often very rare in the small code-descriptions pair collections.As such, Copy Generator Networks provide an eﬀective method toemphasize tokens regardless of their frequency in the dataset bycopying them from the input. M ( w t | q , c t − ) = p дen · T ( w t | q , c t − ) + ( − p дen ) · a t ( w t ) (4)Our implementation of the copy generation in the Transformeris inspired by See et al. [12]. We use the ﬁnal encoder attentionvector and produce a copying vector emphasising each input tokenrelative to its attention weight a t ( w t ) , Equation 4. This is then in-terpolated with the original vocabulary distribution T ( w t | q , c t − ) through a p дen function. The use of out-of-vocabulary tokens forvery rare words, described in Section 4.2, allows for even moregeneric copying of words that haven’t even been seen in the train-ing dataset. In this section, we describe the collections of code, the data pre-processing, and our evaluation metrics.

This dataset was produced by a single engi-neer tasked to annotate the entire DJANGO source code line byline (18k+ lines). The original aim for the dataset was to map from

Django samples:

Desc : description(COPY) is a string "The '%s'function"(COPY) replaced by value ofreceiver(COPY) . __name__ .Truth: description(COPY) = "The '%s' function"(COPY)% receiver(COPY) . __name__Pred : description(COPY) = "The '%s' function"(COPY)% receiver(COPY)BLEU : 0.67

CoNaLa sample:

Desc : split string ` input ` based on occurrences ofregex pattern '[ ](?=[A-Z]+\\b)'(COPY)Truth: re . split ( '[ ](?=[A-Z]+\\b)'(COPY) , input )Pred : re . split ( '[ ](?=[A-Z]+\\b)'(COPY) , input )BLEU : 1.0

Figure 2: Multiple predicted samples from the RelevanceTransformer on Django and CoNaLa datasets code to pseudo-code. This leads to relatively detailed descriptionsof each line which map to code.

The dataset consists of 665 samples, eachsourced from the cards of the game. A card consists of a name, de-scription, and several key statistics. These ﬁelds form the whole ofthe English description. The code consists of the associated Pythonsource code from the game ﬁles. In contrast to the other datasets,Hearthstone consists of much longer sequences of approximately400 tokens. However, many of these sequences have similar boiler-plate python code.

This dataset is sourced from StackOverﬂowquestions and answers. It consists of over 2k hand-written shortanswers to programming questions. These are high-quality code-description pairs. However, the dataset size is limited. The authorsprovide an additional automatically annotated set of 600k+ pairs.During evaluation of the automatically annotated dataset, we deemit too noisy for our task and decide to solely use the 2k hand-written pairs.

Our training samples consist of two parallel languages: Englishand code. We process our samples into a common vocabulary setby tokenizing by spaces and speciﬁc code identiﬁers. This kindof tokenization is equivalent to that of ReCode [4] and preservesstrsengs, variable names and function identiﬁers as individual to-kens. A uniﬁed vocabulary is especially important since commontokens shared from input to output sequences allow for copying.We assign each out-of-vocabulary token shared between each se-quence a generic positional token, this gives the model the ﬂexibil-ity to copy potentially unseen relevant tokens to the output basedon context. As such, our vocabulary size is comparatively small atunder 1k tokens, while still allowing rare tokens to be predicted.etrieval Methods Django Hearthstone CoNaLaBM25 (ﬁne tuned baseline) 43.1 59.5 13.2ReCode sequence similarity 43.4 65.1 11.2Oracle retrieval similarity 58.1 74.2 38.0Generative MethodsSeq2Seq LSTM 58.9 60.4

Latent predictor networks [8] —Transformer baseline [15] 79.2 72.5 17.5Transformer + Copy 81.8 74.0 20.8Transformer + Copy+ NaÃŕve Retrieval 80.7 60.1 19.0Relevance Transformer

BLEU is a standard metric in the ﬁeld of code generation [4, 8]. Wefollow this standard and use the BLEU implementation from Re-Code [4] to evaluate the quality of our model’s output. The scoresfor each pair is averaged to give an overall BLEU score for thedataset. We also test for signiﬁcance with a paired t-test and ap-ply Bonferroni corrections where applicable.

In this section, we examine the results of our experiments on threecollections. Table 1 is divided into retrieval and generative meth-ods. Despite being simple, retrieval methods are strong baselinesin a code setting. Code repetition and similar patterns, such asin Hearthstone, lead to high sequence similarity despite only be-ing able to retrieve code from the training set. We test an oraclemethod by taking the highest scoring retrieved snippet accordingto BLEU, setting an upper bound on the eﬀectiveness of these meth-ods.In the generative methods section, we outline ﬁrst the state-of-the-art non-AST methods for each of the datasets. The base Trans-former [15] model is used as a baseline for comparison. We notethat the base Transformer model is already very eﬀective at thistask, surpassing the previously stated results. Following this, thenaÃŕve retrieval method is tested, which concatenates the top codedocument to the input and uses our copy mechanism. Our exper-iments show that the more complex input reduces overall eﬀec-tiveness. In contrast, the Relevance Transformer comprises of bothrelevance feedback and a copy mechanism and shows statisticallysigniﬁcant improvements over the base Transformer at a 95% conﬁ-dence interval for Django and CoNaLa. Hearthstone’s 66 test sam-ples give inconclusive but suggestive results. Following a closerinspection of the decoded results, the eﬀectiveness increase forCoNaLa suggests pseudo-relevance feedback is particularly usefulat boosting low scoring sequences by providing a starting point ofpotentially useful terms for the model.In Figure 1, we show how our Relevance Transformer plays akey role in emphasising words that are likely to be in the target se-quence. In that example, the Transformer on its own predicts the next token in the sequence to be ‘groupby’. This token is still rele-vant in the context but it is not the correct prediction. The pseudo-relevance feedback corrects this by emphasising common tokensfrom the top retrieved documents and results in the production ofthe correct token, ‘objects’.

In this work, we study the challenging task of code generation. Weintroduce the Relevance Transformer, a model that leverages exter-nal knowledge from pseudo-relevance feedback to increase trans-lation quality and diversity. It uses feedback results at inferencetime with a copy mechanism to improve over the baseline Trans-former and achieves state-of-the-art results on three standard codedatasets. Our approach is general and our results demonstrate thatincorporating knowledge from retrieval can provide a signiﬁcantbeneﬁt to generative models, in code generation and potentially inother domains as well.

ACKNOWLEDGEMENTS

We thank Iain Mackie for his contributions during development.

REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 (2014).[2] Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and Changn-ing Huang. 2001. Improving query translation for cross-language informationretrieval using statistical models. In

Proceedings of the 24th annual internationalACM SIGIR conference on Research and development in information retrieval . 96–104.[3] Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. 2018. Aretrieve-and-edit framework for predicting structured outputs. In

Advances inNeural Information Processing Systems . 10052–10062.[4] Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, An-thony Tomasic, and Graham Neubig. 2018. Retrieval-based neural code genera-tion. arXiv preprint arXiv:1808.10025 (2018), 925–930.[5] David A Hull and Gregory Grefenstette. 1996. Querying across languages: adictionary-based approach to multilingual information retrieval. In

Proceedingsof the 19th annual international ACM SIGIR conference on Research and develop-ment in information retrieval . 49–57.[6] Jun-Jang Jeng and Betty HC Cheng. 1993. Using formal methods to constructa software component library. In

European Software Engineering Conference .Springer, 397–417.[7] Victor Lavrenko and W Bruce Croft. 2017. Relevance-based language models. In

ACM SIGIR Forum , Vol. 51. ACM New York, NY, USA, 260–267.[8] Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočisk`y, An-drew Senior, Fumin Wang, and Phil Blunsom. 2016. Latent predictor networksfor code generation. arXiv preprint arXiv:1603.06744 (2016).[9] Gilad Mishne, Maarten De Rijke, et al. 2004. Source Code Retrieval using Con-ceptual Similarity.. In

RIAO , Vol. 4. Citeseer, 539–554.[10] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti,Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-codefrom source code using statistical machine translation (t). In . IEEE, 574–584.[11] Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract syntax net-works for code generation and semantic parsing. arXiv preprint arXiv:1704.07535 (2017).[12] Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Sum-marization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017).[13] Renuka Sindhgatta. 2006. Using an information retrieval system to retrievesource code samples. In

Proceedings of the 28th international conference on Soft-ware engineering . 905–908.[14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learn-ing with neural networks. In

Advances in neural information processing systems .3104–3112.[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all youneed. In

Advances in neural information processing systems . 5998–6008.16] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In

Advances in neural information processing systems . 2692–2700.[17] Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neu-big. 2018. Learning to mine aligned code and natural language pairs from stackoverﬂow. In . IEEE, 476–486.[18] Jingyi Zhang, Masao Utiyama, Eiichro Sumita, Graham Neubig, and SatoshiNakamura. 2018. Guiding neural machine translation with retrieved translationpieces. arXiv preprint arXiv:1804.02559arXiv preprint arXiv:1804.02559