[PDF] Automatic Code Generation using Pre-Trained Language Models

Abstract

Recent advancements in natural language processing \cite{gpt2} \cite{BERT} have led to near-human performance in multiple natural language tasks. In this paper, we seek to understand whether similar techniques can be applied to a highly structured environment with strict syntax rules. Specifically, we propose an end-to-end machine learning model for code generation in the Python language built on-top of pre-trained language models. We demonstrate that a fine-tuned model can perform well in code generation tasks, achieving a BLEU score of 0.22, an improvement of 46\% over a reasonable sequence-to-sequence baseline. All results and related code used for training and data processing are available on GitHub.

Full PDF

AAutomatic Code Generation using Pre-Trained Language Models

Luis PerezDepartment of Computer ScienceStanford University [email protected]

Lizi OttensDepartment of Computer ScienceStanford University [email protected]

Sudharshan ViswanathanDepartment of Computer ScienceStanford University [email protected]

Abstract

Recent advancements in natural language processing[1] [2] have led to near-human performance in multiplenatural language tasks. In this paper, we seek to under-stand whether similar techniques can be applied to a highlystructured environment with strict syntax rules. Speciﬁ-cally, we propose an end-to-end machine learning model forcode generation in the Python language built on-top of pre-trained language models. We demonstrate that a ﬁne-tunedmodel can perform well in code generation tasks, achievinga BLEU score of 0.22, an improvement of 46% over a rea-sonable sequence-to-sequence baseline. All results and re-lated code used for training and data processing are avail-able on GitHub.

1. Introduction

Automating even small parts of software development isan active research area [3], with multiple approaches pro-posed methods (See Section 1). Succeeding in the automa-tion of even small tasks can save time for countless softwareengineers, which translates to saved resources across mul-tiple industries. Furthermore, as software continues to eatthe world and demand for experienced software develop-ers continues to outpace supply, automatic code generationwill become increasingly important.In this paper, we propose a machine learning model toautomate the task of writing code by assisting developersin writing individual units of functionality (or “functions”).Automating code generation can take on many forms, from See shared repository located at https://github.com/kandluis/code-gen. See article by Marc Andreessen. auto-completing lines of source code to generating lines ofsource code from comments, generating source code fromUI images, or generating unit tests from source code. In thisproject, we aim to take the initial lines of code (a functionsignature) along with a doc-string (function documentation)and generate the corresponding function body. In order todo this, we use a pre-trained language model and ﬁne-tune iton a canonical corpus of Python code scraped from GitHub[4].

2. Background

A primary challenge in code generation is that it is stillan active area of research, with many possible solutions andongoing investigation [5]. State of the art solutions have notyet come close to automating basic tasks software engineersperform on a daily basis.

The most traditional and well-known approach used bymultiple IDEs across a range of languages simply consistsof token completion based on structured information ob-tained from static analysis of code. For example, when adeveloper types a sequence of characters, the system willattempt to ﬁnd near-matching strings corresponding to func-tion deﬁnitions and propose completing these function calls.Similarly, for object methods, on the typing of the acces-sor token (such as “-¿” or “.”), the IDE will propose auto-completing different methods belonging to the object.The biggest drawback of these approaches is that theylack true understanding of the programmers intent, and alsolack context relating to the surrounding code other than thatfrom heuristics by the tool’s developers.1 a r X i v : . [ c s . C L ] F e b .2. Using Machine Learning for Code Search Another approach taken in multiple papers in the liter-ature [4] involves framing the problem as a code searchproblem. Rather than trying to generate code or completethe code that the developer is making, we can re-frame theproblem as one of searching for relevant pre-existing snip-pets. This is the primary approach we take in three of ourbaseline models.

Other more novel approaches from literature [5] are typ-ically applied to restricted language domains, and havemassive complexity in evaluation results, etc. Speciﬁcally,while pre-trained models are trained on free-form languagedata, programming languages often utilize non-natural vari-able names, function names, and syntax with more structure[5]. Work in this area has focused on creating more struc-tured models that take advantage of speciﬁc architectures[6]. In [7], the authors work to ﬁrst decompose the inputsequence of text tokens for the context into a tree-like struc-ture. Other approaches involve restricting the output of themodel to a context-free grammar (CFG) or domain-speciﬁclanguage (DSL) [8]. A code generation model’s output mustadhere to a very speciﬁc form in order to be syntacticallycorrect.In this paper, we instead focus on taking a different ap-proach. As has been demonstrated by ever-increasing sizesof language models, we focus on improving the perfor-mance on the code prediction task by making use of pre-trained language models that are then ﬁne-tuned on code.

In this project, we are leveraging the CodeSearchNetdataset [4]. The dataset consists of 2 million (comment,code) pairs from open source libraries, ranging in languagesfrom Python to Javascript, PHP, Java, Go and Ruby. Mediancode-length consists of 60-100 text tokens, with 95% code-length of up to 350 tokens. Median documentation lengthconsists of 10 text tokens. The distributions of methods and(comment, code) pairs across programming language are vi-sualized in Figure 3.We restrict our dataset to samples in the Python program-ming language rather than the others available. Focusingon Python, there are over 1M methods and approximately500k (comment, code) pairs that make up our dataset. Wemake this decision both for practical and modeling reasons.From a practical perspective, restricting to a reasonably-sized dataset focused on a single-language domains permitsfor more thorough ablation studies. From a modeling per-spective, we belief that transfer learning from natural lan-guage to a programming language such as Python is an eas-ier task to accomplish.

3. Methodology

In this section, we explain our methodology for multipleexperiments and baselines proposed as well as details on thetraining data and distribution.

Figure 4 explains the general architecture of the base-line models from the CodeSearchNet task. We successfullytrained and evaluated two baselines: Neural-Bag-Of-Wordsand an RNN-based baseline. See Section 4.Generally speaking, the baselines models take as inputexamples of (comments, code) pairs and learn to retrieve aspeciﬁc code snippet. Each programming language has itsown encoder network (see three columns to the right in Fig-ure 4), which are tasked with encoding a set of candidatecode snippets. They are then combined through a dot prod-uct operation with the embedding generated by the query(docstring) encoder to produce a matrix comparison.The matrix diagonal serves as the scores of each querydoc string/code snippet. Through this methodology, thesebaseline models are able to extract meaningful informationand learn a joint distribution over the query and commentpairs. We train these models as a baseline since we believethey will be useful in the downstream task of code genera-tion. The models are trained on the following loss function: − N (cid:88) i log (cid:32) exp( E c ( c Ti ) E q ( d i )) (cid:80) j exp( E c ( c Tj ) E q ( d j )) (cid:33) (1) The above baseline is useful only in the sense that itwould allow our system to ﬁnd pre-existing code snippetswhich might be relevant to the developer. Since our goal israther to make novel code, we propose a different baselinebased on a more traditional sequence-to-sequence model.In this case, we use a traditional RNN architecture whichtakes as input individual characters. The reason we take thisapproach is to circumvent the need to learn word-level em-beddings. Furthermore, we hypothesize that making use ofentire words, from NLP models, will actually harm the per-formance of the model for code generation. The primaryreason for this being that most of the syntax involved inwriting code does not generally map directly to the Englishlanguage. Concretely, we encode each character present inthe training data as a 1-of-k encoding (one-hot encoding)and feed them into an RNN one at a time. Our output willbe a k-dimensional output vector corresponding to a proba-bility distribution over the entire set of characters.For the model architecture, we sweep over multiple typesof RNN cells, including LSTM, RNN, and GRU. We ﬁndthe best performing model to consists of an LSTM-basedodel using a hidden state size of with two hidden lay-ers in the internal RNN cell. Our training takes place usingsequences of characters, sampled at random from our in-put code. Given a sequence from i to i + 50 , the model istrained to predict the sequence from i + 1 to i + 51 . Thismeans we have a many-to-many sequence model (See Fig-ure 6.2.1). We use batch size of and train for a total of epochs.To avoid issues with gradient explosion and stabilizetraining, we make liberal use of gradient-clipping. In par-ticular, we clip all gradients to an absolute size of .We sweep over learning rates and ﬁnd that a startedlearning rate of . with an exponentially decayingschedule appears to perform best as measured by a held-out validation set. We use a decay rate of . per epoch.We also experiment with the use of dropout, but ﬁnd littleimpact on ﬁnal performance. Our ﬁnal approach relies on the use of pre-trained lan-guage models. We ﬁne tune our code generator using thesmall GPT-2 model with 117 million parameters. Usingsuch a large backbone and continuing to ﬁne tune allows usto generate synthetic code samples with even higher quality,treating programming languages as another speciﬁc domainalongside encyclopedia articles, news or books.

Figure 1. Decoder-Only Architecture used by GPT-2.

The general architecture of the GPT-2 model consists ofa sequence-to-sequence predictive task based on the trans-former architecture [9] [1]. However, it consists solely ofthe 12-layer decoder-only, as visualized in Figure 1. Eachlayer has 12 independent attention heads, leading to 144 dis-tinct attention patterns. By making use of an attention-basedframework, the model is more adept at dealing with long-range dependencies. This is because the attention mecha-nism allows the model to focus on the encoding of any ofthe input sequence tokens.

4. Results

CodeSearchNet provides a good starting point as we areable to train different models on the input code streams. Wetrained a simple LSTM model as well as a neural bag of words model on a combination of all the available (code,documentation) pairs. For details on these simple baselines,please see Appendix Section 6.1.

As both of the above baselines focus on understandingand extracting useful embeddings for our overall task, ourprimary baseline consists of a straight-forward sequence-to-sequence model. Given that code typically does not consistof English words and can instead have quite a varied syntax,our baseline model is a model which uses character levelembedding, so it is character aware [10].Due to computational constraints, we train only on thePython subset of the data and only on 10% of the total dataavailable. For the char-rnn model [10], this corresponds toaround 50MB of raw text, or 78,357,395 characters with1,618 distinct symbols. Figure 9 shows the training andvalidation losses on the model. The loss is simply a soft-max loss on the 1,618 characters for a sequence of length128 (the model is trained on sequences of length 128 bydefault). Figure 10 shows the perplexity, or the amount ofmeaningful information encoded.We include a sample generated from the best performingmodel for reference (See Section 2 in Appendix). A hyper-parameter tuning of learning rate and batch side for a totalof 20 epochs has ﬁnal measured performance as shown inTable 4.1.

BatchSize StarterLearningRate RegularizationWeight BLEU Scoreon Train BLEU Scoreon Eval

64 0.02 0.1 0.022 0.01264 0.02 0.01 0.023 0.01964 0.002 0.1 0.034 0.02864 0.002 0.01 0.037 0.01264 0.0002 0.1 0.09 0.07364 0.0002 0.01 0.094 0.014128 0.02 0.1 0.024 0.021128 0.02 0.01 0.021 0.013128 0.002 0.1 0.033 0.029128 0.002 0.01 0.038 0.011128 0.0002 0.1

128 0.0002 0.01 0.113 0.034

We have been working with the publicly available smallGPT-2 model with 117 million parameters. We trained us-ing the small GPT-2 model for 100,000 mini-batch itera-tions with a batch size of 2. We have included some samplecode that our model generated directly in the report. Qual-itatively, our model generates code which is far more rea-sonable than our baseline. The generated code is novel,s veriﬁed by doing n-gram overlap analysis between thegenerated code and the training dataset. We also note thatthe model learns appropriate understanding of Python syn-tax, with uses of if-statements, function and method calls,as well as regularly commented code. For full output, seeAppendix Section 6.2.We observed that the idea of using Byte Pair encod-ing as used in GPT-2 is a much better strategy to generatecode than just using characters, while of course the size ofthe models itself has a very observable effect in generatingPython-like code.Overall, the GPT-2 model quickly achieves performancethat’s much better than the baseline. Continued trainingof the model shows that our BLEU score performance willcontinue to increase, as seen in Figure 2

Figure 2. BLEU Score During Training of GPT-2 Based Model forPython Code Generation

5. Conclusions

In this paper, we explore the problem of automaticallycompleting a function from the given function signature andhuman-readable documentation. We ﬁnd the best perform-ing model to be a ﬁne-tuned version GPT-2, a transformer-based NLP model which is trained to generate natural texton an extremely large dataset. Despite the fact that ourmodel focuses speciﬁcally on code rather than natural lan-guage, we hypothesize that it is able to treat programminglanguage as another speciﬁc domain alongside the encyclo-pedia articles, news or books that its backbone has beentrained on. We are able to achieve a BLEU score of 0.22,improving our baseline by ¿40%.

6. Contributions

All team member contributed equally to this project.Baselines from the CodeSearchNet models for code searchwere trained and tuned by Luis Perez and SudharshanViswanathan. Data analysis and understanding of the fea-tures (including histograms, distribution of tokens, andother data insights) was primarily performed by Lizi Ottens. Training of the baseline char-rnn model, as well as anal-ysis of results and discussion was contributed primarily byLuis Perez. Fine-tuning and training with the small andmedium GPT-2 models was primarily explored and ana-lyzed by Lizi Ottens and Sudharshan Viswanathan.All written submissions were co-written by all three au-thors.

References [1] Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. Language mod-els are unsupervised multitask learners.

None , 2019.1, 3[2] Kenton Lee Kristina Toutanova Jacob Devlin, Ming-Wei Chang. BERT: pre-training of deep bidirectionaltransformers for language understanding.

CoRR ,abs/1810.04805, 2018. 1[3] Miltiadis Allamanis, Earl T. Barr, Premkumar T. De-vanbu, and Charles A. Sutton. A survey of ma-chine learning for big code and naturalness.

CoRR ,abs/1709.06182, 2017. 1[4] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Mil-tiadis Allamanis, and Marc Brockschmidt. Code-searchnet challenge: Evaluating the state of semanticcode search, 2019. 1, 2, 6[5] Yasir Hussain, Zhiqiu Huang, Senzhang Wang, andYu Zhou. Codegru: Context-aware deep learning withgated recurrent unit for source code modeling.

CoRR ,abs/1903.00884, 2019. 1, 2[6] Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, LiliMou, and Lu Zhang. Treegen: A tree-based trans-former architecture for code generation, 2019. 2[7] Xinyue Liu, Xiangnan Kong, Lei Liu, and KuorongChiang. Treegan: Syntax-aware sequence genera-tion with generative adversarial networks.

CoRR ,abs/1808.07582, 2018. 2[8] Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong,Ge Li, and Lu Zhang. A grammar-based struc-tural CNN decoder for code generation.

CoRR ,abs/1811.06837, 2018. 2[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all you need.

CoRR , abs/1706.03762, 2017. 3[10] Yoon Kim, Yacine Jernite, David A. Sontag, andAlexander M. Rush. Character-aware neural languagemodels.

CoRR , abs/1508.06615, 2015. 311] KyungHyun Cho, Bart van Merrienboer, DzmitryBahdanau, and Yoshua Bengio. On the propertiesof neural machine translation: Encoder-decoder ap-proaches.

CoRR , abs/1409.1259, 2014. 6 ppendix and Figures

The Neural Bag of Words and LSTM CodeSearchNet baselines both report metrics in the same fashion. Below, we showthe training curves, which correspond to the loss in Equation (1).Additionally, given that the baselines models for CodeSearchNet focus on code snippet retrieval, we also report theachieved mean reciprocal rank. The MRR is a statistic measure for evaluating any process that produces a list of possibleresponses to a sample of queries, ordered by probability of correctness. The reciprocal rank of a query response is themultiplicative inverse of the rank of the ﬁrst correct answer: 1 for ﬁrst place, for second place, for third place and so on.The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries, as in Equation (2).MRR = 1 | Q | | Q | (cid:88) i =1 rank i (2) This baseline consists of a simple encoder architecture which takes as input bag-of-words representation of the code andusing a single neural network encodes these token representation into an embedding [4]. This baseline actually performs thebest, achieving the lowest overall training and validation losses (see Figure 5) as well as the highest MRR on the validationset (See Figure 6).

In this model, we employ the GRU cell [11] to summarize the input sequence. This baseline performs signiﬁcantly worse,suffering from what appears to be obvious over-ﬁtting. In Figure 7, we can see that while the training loss appears to plateau,the validation loss begins quickly climbing. While this behavior does not appear to affect the overall MRR achieved on thevalidation set, it is still clear that the model performs worse than the bag of words baseline as per Figure 8.

Listing 1. Sample generated using Char-RNN model.

Downloads D a i l y m o t i o n v i d e o s by URL . ( x or p i l l a r e s ( i f macks ) , s t y l e a s a b o o l t o you e x t n e r method i n s t r u x E r r o r ,t o o i o i n s t a n c e o f t h a t S e i r t o be two c a t e g o r i c a l S t r i n g t o m a n d a t i o n : a t t r : ‘ ColumnserverName (z r , j )d i m e n s e d s o u r c ea l s o = ‘ ‘ a x i s ‘ ‘ . E l e m e n t . T h i s r e p r d u r e s a s u b s t i m c l e o f ”””c o d e = i t e mr e t u r n s e l f . f i l e n a m er e s t r o x i g , s e l f . g e t c h a n n e l s ( ) :””” Get t h e f u n c t i o n a s f i r m a p l i s t { } and a : a t t r a c t e d c o o r d i n t : v a l u e o f Timee n d a e s : t a r g e t e x t = i n t ( c m d d i c t ) : Runnee n e t w o r k number o f e l e m e n t . What yo u ’ s a s s o c i a t e s py ch o f L i s t d o e s r e q u e s t . C h a n n e l s r e p r e s e n t e d s h o u l d be a T a r g e t E x c e p t t h e more n o t e s‘ ‘ i n d e x ‘ ‘ .d e f v e r t e v e n t n o t i f y c h a n n e l ( s e l f , i n f n a = 1 )e l i f t r y i n g . f o r m a t (comps + ’ , ’ % ( random ) + 1 ,H . d a t a :i f l i s t + l o o k b i n g . g e t ( ’ g e t d i c t . g e t ’ ] } isting 2. Sample generated using small (117M) GPT-2 model after 40000 iterations. i n p u t s n i p p e t = d e f s e m i l o c a l l i n e a r t r e n d t r a n s i t i o n m a t r i x ( a u t o r e g r e s s i v e c o e f ) : ””” B u i l d t h e t r a n s i t i o n m a t r i x f o r a semi − l o c a l l i n e a r t r e n d model . ””” ,o u t p u t c o d e = c o l u m n s = np . a s a r r a y ( c o l u m n s ) i f a u t o c o r r e l a t e d t r e n d m a t r i x :column = np . a s a r r a y ( c o l u m n s ) i = column . min ( )j = column . max ( ) a s s e r t h a s a t t r (column , ” s i g m a ” )a s s e r t h a s a t t r ( a x e s , ” a x i s ” ) i = c o l u m n s . copy ( )j = c o l u m n s . copy ( ) ax = ColumnSetAxes ( i , j )a x e s . s e t a x e s ( ax ) ax . s e t x l a b e l ( s e l f . d e f a u l t l a b e l )ax . s e t y l a b e l ( s e l f . d e f a u l t y l a b e l )a x e s . s e t t i t l e ( s e l f . t i t l e )a x e s . t e x t ( u ” Subsim ” ] s e l f . g e t i n d e x ( a x i s = columns , c o l s = c o l s ) i f n o t c o l u m n s : i f a u t o c o r r e l a t e d t r e n d m a t r i x :ax . c o l u m n s = c o l u m n s e l s e : i f i < or j < or i +1 < = i6.2.1 Figures