[PDF] A Structural Model for Contextual Code Changes

Abstract

We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program's Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task. We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves a 28% relative gain over state-of-the-art sequential models and 2x higher accuracy than syntactic models that learn to generate the edited code, as opposed to modeling the edits directly. Our code, dataset, and trained models are publicly available at this https URL .

Full PDF

NNeural Edit Completion

SHAKED BRODY,

Technion, Israel

URI ALON,

Technion, Israel

ERAN YAHAV,

Technion, IsraelWe address the problem of predicting edit completions based on a learned model that was trained on past edits.Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of thesnippet . We refer to this task as the E

DIT C OMPLETION task and present a novel approach for tackling it. Themain idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, ratherthan learning the likelihood of the edited code. We represent an edit operation as a path in the program’s AbstractSyntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, wepresent a powerful and lightweight neural model for the E

DIT C OMPLETION task.We conduct a thorough evaluation, comparing our approach to a variety of representation and modelingapproaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Ourexperiments show that our model achieves 28% relative gain over state-of-the-art sequential models and × higher accuracy than syntactic models that learn to generate the edited code instead of modeling the edits directly. We make our code, dataset, and trained models publicly available. Software development is an evolutionary process. Programs are being maintained, refactored, fixed,and updated on a continuous basis. Program edits are therefore at the very core of software develop-ment. Poor edits can lead to bugs, security vulnerability, unreadable code, unexpected behavior, andmore. The ability to suggest a good edit in code is therefore crucial.We introduce the E DIT C OMPLETION task: predict edit completions based on a learned model thatwas trained on past edits. Given a code snippet that is partially edited, our goal is to predict an editcompletion that completes the edit for the rest of the snippet . The edit completion is representedtechnically as a sequence of edit operations that we refer to as an edit script . Problem Definition.

Let P be a given program fragment and C be the surrounding context of P before any edits were applied. Let ∆ C denote the edits that were applied to C , and C ′ = ∆ C (C) theresulting edited context. The goal in our E DIT C OMPLETION task is to predict an edit function ∆ P , suchthat applying ∆ P to P results in the program fragment after the edit: ∆ P (P) = P ′ . Our underlyingassumption is that the distribution of edits in P can be inferred from the edits ∆ C that occurred in itscontext. We thus model the probability: Pr ( ∆ P | ∆ C ) . We present a new approach for representingand predicting ∆ P in the E DIT C OMPLETION task, named C : Contextual Code Changes. Motivating examples.

Consider the E DIT C OMPLETION examples in Figure 1a and Figure 1b. Theseillustrate the significance of edits in the context C and how they can help in suggesting a likelyedit for P . In Figure 1a, the edit in the context consists of changing the if statement predicate,resulting in a null check for the variable attack . After the edit in the context, the value of attack in P cannot be null . Therefore, the ternary statement that checks attack for nullness in P can Authors’ addresses: Shaked Brody, Technion, Israel, [email protected]; Uri Alon, Technion, Israel, [email protected]; Eran Yahav, Technion, Israel, [email protected]. a r X i v : . [ c s . P L ] M a y Shaked Brody, Uri Alon, and Eran Yahav CC ′ PP ′ - if(self.isDisabled())+ if(attack == null || attack.IsTraitDisabled) return false;- var targetPos = attack != null ? attack.GetTargetPosition(pos, target) : target.CenterPosition;+ var targetPos = attack.GetTargetPosition(pos, target) ; InputOutput Δ P Edit Script (a) The predicate of the if statement in C was edited to include a null check for attack . Thus, in P , the checking of attack != null and the ternary operator can be removed. CC ′ PP ′ Input - public override bool GetFileCharacteristics( out FileCharacteristics fileCharacteristics)+ public override FileCharacteristics GetFileCharacteristics( ) {- fileCharacteristics = new FileCharacteristics( this.OpenTime, this.currentFileLength); return true;+ return new FileCharacteristics( this.OpenTime, this.currentFileLength); }

Output Δ P Edit Script (b) The signature of

GetFileCharacteristics in C was edited to return a FileCharacteristic object instead of modifying an output parameter. Thus, in P , themethod should return a FileCharacteristic object instead of returning true .Fig. 1. Examples of E

DIT C OMPLETION . The input consists of a program fragment P , and edits thatoccurred in the context that transformed C into C ′ . The output is ∆ P – an edit script that describe thelikely edit. Applying ∆ P to P results in P ′ – the code after the edit. be removed. Our model successfully predicted the needed edit ∆ P , which is applied to P to yield P ′ .Figure 1b shows another example, in which the edit in the context is a modification of a functionsignature. In C ′ , the return type was changed to FileCharacteristics , and the output param-eter fileCharacteristics for the function was removed. P consists of an assignment to theparameter fileCharacteristics , and a return statement of true value. The edit in the con-text implies a necessary edit in P , in which the assignment statement has to be removed (since fileCharacteristics is no longer defined) and the return statement must include a variable oftype FileCharacteristics . Our model successfully predicted the correct edit for P . P ′ consistsof returning an object of type FileCharacteristics . + var item = nodes.FirstOrDefault( x => x.name == n && x.Type == NodeType.Directory && x.ParentId == parent.Id) ;- var item = nodes.Where( x => x.name == n && x.Type == NodeType.Directory && x.ParentId == parent.Id).FirstOrDefault(); Unit DELDecl Expr CallType Dotvar item Call ArgListFirstOrDefaultNamenodes Dot Where LambdaParamList ExprName Equals Namen And

12 31

UPD DEL DEL (a)

Unit DELDecl ExprNameType parent CallINodenodes

UPD DEL

Dot CallFirst ArgListDot Where DEL + INode parent = nodes.First( x => x.Type == NodeType.Root) ;- INode parent = nodes.Where( x => x.Type == NodeType.Root).First(); LambdaParamList ExprName Equals Name (b)Fig. 2. An example of two edits. These examples are different and the edits operate on differentvalues. However, observing the structure of these edits reveals the similarity between them and allowsa learning model to generalize better. This similarity is expressed as almost identical AST paths. Forsimplicity, only the program fragment that should be edited P is shown, without the context C . Edit Completion vs. Code Completion.

It is important to note that E DIT C OMPLETION and codecompletion are completely different tasks. The goal of code completion is to predict missing fragmentsof a program given a partial program as context. In contrast, the goal of E DIT C OMPLETION is to predictadditional edits in a partial sequence of edit operations. That is, while code completion operates oncode, E DIT C OMPLETION operates on code edits . Representing Code Edits.

The main design decision in learning code edits is how to represent theedit , i.e., how to represent the difference between the code in its original form and its desired, altered,form. Naïvely, differencing of programs can be performed by treating code as text and using text-diff algorithms for line differencing [Hunt and McIlroy 1975] or inline differencing [Birney et al. 1996].In contrast, we model the difference between the abstract syntax trees (ASTs) of the original and theedited code. This allows to naturally use paths in the AST (AST paths) to model edits.

Our approach.

We present a novel approach for E DIT C OMPLETION : predicting contextual codechanges – C . Code changes can be described as a sequence of edit operations, such as “move anode, along with its underlying subtree, to be a child of another node” or “update the value of anode to be identical to the value of another node” . Such edit operations can be naturally representedas paths between the source node and the target node, along with the relationship between themand the edit command, i.e., “move” or “update”. AST paths provide a natural way to express binaryrelationships between nodes (and thus subtrees) in the AST. We use AST paths to represent ∆ C –edits that occurred in the context and transformed C into C ′ , such that ∆ C (C) = C ′ . We also use Shaked Brody, Uri Alon, and Eran Yahav

AST paths to represent ∆ P – the edits that should be applied to P . We thus model the probability Pr ( ∆ P | ∆ C ) , where both the input ∆ C and the output ∆ P are represented as AST paths.Representing edits as paths allows a learning model to generalize well across different examples. Con-sider the two examples in Figure 2. In Figure 2a, the edit modifies a series of LINQ calls – converting Where().FirstOrDefault() into

FirstOrDefault() . The editin Figure 2b modifies

Where().First() into

First() . We elabo-rate on the representation of edits as paths in section 2 and section 4, for now it suffices to see thatthere is a sequence of three edit operations in each of the figures (numbered ○ , ○ , ○ ). Althoughthe predicates are different and these edits operate on different values, the structure of the edits inFigure 2a and Figure 2b is identical. This property is expressed as similarity in the AST paths thatrepresent these edits. For example, consider the identical structure of the path ○ in the two figures,where it operates on a different value in each figure ( FirstOrDefault and

First ).Our use of AST paths allows the model to generalize these edits well, although these edits are notidentical and their predicates are different.We apply a Pointer Network [Vinyals et al. 2015] to point to paths in the AST of P and create anedit operation sequence, i.e., an edit script. While prior work used AST paths to read programs andpredict a label [Alon et al. 2019a,c], we generate an edit script by predicting AST paths, i.e., makingAST paths the output of our model.

Previous approaches.

In related tasks, such as bug fixing and program repair, previous approacheshave mostly represented code as a flat token stream [Chen et al. 2019; Tufano et al. 2018; Vasic et al.2019]; although this allows to use NLP models out-of-the-box, such models do not leverage the richsyntax of programming languages. Yin et al. [2019] suggested a system that learns to represent anedit and use its representation to apply the edit to another code snippet. Although sounding similar,the task that Yin et al. [2019] addressed and our task are dramatically different: Yin et al. [2019]addressed the (easier) variant and assume that the edit that needs to be applied is given as part ofthe input, in the form of “before” and “after” versions of another code with the same edit applied ;their task is only to apply the given edit on a given code. Thus, in the task of Yin et al. [2019], theassumption is that ∆ C = ∆ P . In contrast, we do not assume that the edit ∆ P is given; we conditionon edits that occurred in the context ( ∆ C ), but these edits are different than the edits that need to beapplied to P , and our model needs to predict the edit to P itself, i.e., predict what needs to be editedand how . Other work did use syntax but did not represent the structure of the edit itself. Dinellaet al. [2020] proposed a model for detecting and fixing bugs using graph transformations, withoutconsidering context changes (i.e., ∆ C = ∅ ). Their method can predict unary edit operations on theAST. In contrast, in our work we predict binary edit operations. Thus, our representation is muchmore expressive. For example, consider the edit of moving a subtree: this edit can be represented as a single binary operation; contrarily, this edit requires multiple unary operations. Modeling Code Likelihood vs. Modeling Edit Likelihood.

In general, there are two main learningapproaches for learning to edit a given code snippet. Assume that we wish to model the probabilityof a code snippet Y given another code snippet X . Much prior work [Chen et al. 2019; Mesbahet al. 2019] had followed the approach of generating Y directly, attempting to model Y given X ,thus modeled the probability Pr (Y | X) . This approach is straightforward, but it requires modelingthe likelihood of Y , which is a problem that is more difficult than necessary. In contrast, it canbe much more effective to model the likelihood of the edit which transforms X into Y , withoutmodeling the likelihood of Y itself, hence Pr ( ∆ X→Y | X) . Our modeling of the edit follows the latter approach: Pr ( ∆ P | ∆ C ) . In this work, we learn to predict the edit ( ∆ P ) that transforms P into P ′ ,instead of predicting the entire program ( P ′ ). By applying ∆ P to P , generating P ′ is straightforward: ∆ P (P) = P ′ . Learning to predict the edit instead of learning to predict the edited code – makes ourlearning task is much easier and provides much higher accuracy, as we show in Section 6.We show the effectiveness of C on E DIT C OMPLETION on a new dataset, scrapped from over 300,000commits from GitHub. Our approach significantly outperforms textual and syntactic approaches, thateither model the code or model only the edit, and are driven by strong neural models.

Contributions.

The main contributions of this paper are: • We introduce the E DIT C OMPLETION task: given a program P and edits that occurred in its context,predict the likely edits that should be applied to P . • C – a novel approach for representing and predicting contextual edits in code. This is the firstapproach that represents structural edits directly. • Our technique directly captures the relationships between subtrees that change in an edit using paths in the AST. The output of our technique is an edit script that is executed to edit the programfragment P . • A prototype implementation of our approach, called C PO, for Contextual Code Changes viaPath Operations. C PO is implemented using a strong neural model that predicts the likely editby pointing to an AST path that reflects that edit. • A new E DIT C OMPLETION dataset of source code edits and their surrounding context edits, scrapedfrom over 300,000 commits from GitHub. • An extensive empirical evaluation that compares our approach to a variety of representation andmodeling approaches that are driven by strong models such as LSTMs, Transformers, and neuralCRFs. Our evaluation shows that our model achieves over relative gain over state-of-the-artstrong sequential models, and over × higher accuracy than syntactic models that do not modeledits directly. • A thorough ablation study that examines the contribution of syntactic and textual representationsin different components of our model.

In this section, we demonstrate our approach using a simple E DIT C OMPLETION example. The mainidea is to represent all valid edit operations in P as AST paths, and predict a sequence of these paths.Since every path is associated with an edit operation, by pointing to a sequence of paths, we, in fact,predict an edit script. High-level overview.

Consider the edit that occurred in the context in Figure 3a – insertion of a newdefinition of the method

AddNavigation which overloads previous definitions. After applying thisedit, it is possible to use this new signature when calling

AddNavigation . Consider the original codesnippet P at top of Figure 3e. The edit in the context allows to simplify the call to AddNavigation using the new signature, as shown in the “edited” code snippet P ′ at the bottom of Figure 3e. Considerthe partial AST of P in Figure 3b. The desired edit can be described as an edit script consisting Shaked Brody, Uri Alon, and Eran Yahav + public virtual Navigation AddNavigation(string name, ForeignKey foreignKey, bool pointsToPrincipal) (a)

Unit DELExpr ArgListCallName ArgproductType Dot AddNavigation ExprNew CallNavigation ArgListArg Arg Arg

MOV MOV MOV (b) Name Unit DELExpr ArgListCall ArgumentExprNew CallNavigation ArgListArg Arg Arg DEL DEL DEL

ArgproductType Dot AddNavigation (c)

Unit DELExpr Call ArgListArg Arg ArgNameproductType Dot AddNavigation (d) + productType.AddNavigation( "FeaturedProductCategory", featuredProductFk, pointsToPrincipal: false);- productType.AddNavigation( new Navigation( featuredProductFk, "FeaturedProductCategory", pointsToPrincipal: false); P → P ′ (e)Fig. 3. An E DIT C OMPLETION example from our test set. Figure 3a shows the edit that transforms C into C ′ – overloading the function AddNavigation . Figure 3e shows P and P ′ as code in red andgreen, respectively. Figure 3b depicts the partial AST and the first three edit operations of the edit.Figure 3c shows the AST after applying the first three operations, and shows the next three operationsas AST paths. Figure 3d illustrates the AST after performing all operations, resulting in an AST thatcorresponds to P ′ . Every edit operation is represented by an AST path having the same color andnumber as the edit command. Dotted contours represent subtrees that will be affected by applyingthese operations. Already-affected subtrees are surrounded by dashed contours. of six edit operations to the AST of P . Consider the first operation: ○ MOV . The meaning of thisoperation is to move the node

Expr with its subtree to be the leftmost child of the node

Unit . Thisedit operation can be represented by the red 1 ○ path: Expr → Arg → ArgList → Call → Expr → Unit . Note how this path directly captures the syntactic relationship between the node

Expr andthe node

Unit , allowing our model to predict a

MOV operation as part of the edit script.In Figure 3c we can see the result of applying the following first three operations: ○ MOV , ○ MOV , ○ MOV , moving subtrees to new locations in the tree. The last three commands are

DEL operations, expressing deletion of a node and its underlying subtree. These operations can be represented usingpaths as well. For instance, ○ DEL is represented by the green 4 ○ path: Navigation → Call → Expr → Unit → DEL , where

DEL is an artificial node that we add as a child of the AST’s root. InFigure 3d we can see the AST after applying all six operations. After executing all six operations,our model produces P ′ , shown in Figure 3e. Path Extraction.

To inform the model about the available edits to predict from, we parse the ASTof P to extract all AST paths that represent valid edits. Every path can represent different edit“commands” that use the same path. For example, consider the blue 2 ○ path in Figure 3b: Name → Call → ArgList → Arg → Expr → Call . This path can represent a move operation –

MOV ,i.e. moving the node

Name with its subtree, to be the leftmost child of

Call ; alternatively, this pathcan represent an insertion operation –

INS , i.e., copy

Name with its subtree, and insert it as theleftmost child of

Call . To distinguish between different edit operations that are represented usingthe same AST path, each path is encoded as a vector once, and projected into three vectors usingdifferent learned functions. Each resulting vector corresponds to a different kind of edit operation.For example, the orange 3 ○ path in Figure 3b can represent either “move” ( MOV ), “update” (

UPD )or “insert” (

INS ) operations. In this case, this path was projected using the learned function thatrepresents “move”.

Edit Script Prediction.

We predict one edit operation at each step by pointing at a path and itsassociated operation, among the valid edit operations. This results in an edit script . For example, inFigure 3, our model finds that the red 1 ○ path with MOV is most likely to be the first operation. Then,given this edit, our model finds that the blue 2 ○ path with MOV is most likely to be the next operation,and so on, until we predict a special “end of sequence” (

EOS ) symbol.

Modeling Code Likelihood vs. Modeling Edit Likelihood.

Modeling edits using AST paths providean effective way to model only the difference between P and P ′ . For example, consider the red 1 ○ path that moves the subtree rooted at Expr from its original place to be the first child of

Unit . Topredict this edit, our model only needs to select the red 1 ○ path out of the other available operations.In contrast, a model that attempts to generate P ′ entirely [Chen et al. 2019], would need to generatethe entire subtree from scratch in the new location. Pairwise Edit Operations.

Most edit operations, such as “move” and “update”, can be described as pairwise operations, having the “source” and the “target” locations as their two arguments. AST pathsprovide a natural way to represent pairwise relations, originating from the “source” location, andreaching the “target” location through the shortest path between them in the tree. In contrast, priorwork which used only unary edit operations such as HOPPITY [Dinella et al. 2020] are limited toinsert each node individually, and thus use multiple edit commands to express the ○ MOV operation;our model represents this edit operation as a single AST path – the red 1 ○ path. Key aspects.

The example in Figure 3 demonstrates several key aspects of our method: • Edits applied to the context of P can provide useful information for the required edit to P . • Pairwise edit operations can be naturally represented as AST paths. • A neural model, trained on these paths, can generalize well to other programs, thanks to the directmodeling of code edits as paths.

Shaked Brody, Uri Alon, and Eran Yahav • By pointing at the available edit operations, the task that the model addresses becomes choosing the most likely edit, rather than generating P ′ from scratch, and thus significantly eases thelearning task. In this section, we provide the necessary background. First, in Section 3.1 we define abstract syntaxtrees (ASTs) and AST paths. In Section 3.2 we use these definitions to describe how to representcode edits using AST path and perform AST differencing. Finally, in Section 3.3 and Section 3.4 wedescribe the concept of attention and pointer networks , which are crucial components in our neuralarchitecture (that is described in Section 5).

Given a programming language L and its grammar, we use V to denote the set of nonterminals , and T to denote the set of terminals in the grammar. The Abstract Syntax Tree (AST) of a program canbe constructed in the standard manner is defined as follows: Definition 3.1. (Abstract Syntax Tree) Given a program P written in a programming language L ,its Abstract Syntax Tree A is the tuple ( A , B , r , X , δ , φ ) , where A is the set of non-leaf nodes suchthat each n ∈ A is of type that belongs to V ; B is the set of leaves such that each n ∈ B is of typethat belongs to T ; r ∈ A is the root of the tree; X is a set of values taken from P ; δ is a function δ : A → ( A ∪ B ) ∗ that maps nonterminals nodes to their children; φ is a mapping φ : B → X thatmaps a terminal node to a value.An AST path is simply a sequence of nodes in the AST, formally: Definition 3.2. (AST Path) Given an AST A = ( A , B , r , X , δ , φ ) , an AST path is a sequence of nodes p = n , n , ..., n k , where n i ∈ A ∪ B , such that for every consecutive pair of nodes n i and n i + , either n i ∈ δ ( n i + ) or n i + ∈ δ ( n i ) . We follow Alon et al. [2018] and associate each node’s child index towith its type.For example, consider the blue 2 ○ path in Figure 3b. The path starts in the node Name , goes up to itsparent node

Call , then goes down to its right-most child

ArgList , an so on.AST paths are a natural way to describe relationships between nodes in the AST, and can serve asa general representation of relationships between elements in programs. For example, Alon et al.[2018, 2019c] used paths between leaves in the AST as a way to create an aggregated representationof the AST.In this work, we use AST paths to model relationships between arbitrary nodes in the tree (bothterminals and nonterminals) to model the effect of edit operations.

An edit in a program can be represented as a sequence of operations on its AST. To compute thedifference between two programs, we compute the difference between the two ASTs using algorithmssuch as GumTree [Falleri et al. 2014]. Given two program P and P ′ along with their ASTs A and A ′ , GumTree outputs an edit script , consisting of instructions of how to change A to become A ′ .Each operation in the script is either MOV , DEL , UPD or INS and operates on one or two nodes. The

BD A C (a)

D CC (b)

MOV C (c) DEL

C Z (d)

UPD

D E (e)

INS

Fig. 4. Example of AST edit operations. Figure 4a depict the AST before the change. Figure 4b showsthe result of

MOV operation – moving C to be the right sibling of D . Figure 4c shows the result of DEL –removing C . Figure 4d shows the result of UPD – updating C to Z . Figure 4e shows the result of INS –Inserting E to be the right sibling of D . command MOV n s , n t stands for moving a subtree inside the AST. This operation takes the sourcenode n s to be moved, and the target n t node which will be the left sibling of n s after the move. Thecommand DEL n s stands for removing the node n s from the tree. We use the command UPD v , n t , toupdate the value of the node n s to become v . Lastly, to represent insertion, we use INS n s , n t , where n s is the root of a subtree to be inserted and n t is the target node that will be the left sibling of n s after the insertion.Figure 4 demonstrates all operations: Figure 4a illustrates the AST before the edits; Figure 4b showsthe result of MOV C , D ; Figure 4c depict the command DEL C ; Figure 4d shows the update of C to thevalue Z , i.e., UPD Z , C ; Figure 4e illustrates the command INS E , D – the insertion of node E as a rightsibling of D .In general, AST differencing algorithms consist of two steps. The first step maps nodes from A to A ′ , where each node belongs to a single mapping at most, and mapped nodes share the same type.The second step uses the mapping and aims to produce a short edit script. The GumTree algorithmfocuses on the first step of mapping since there are known quadratic optimal algorithms [Chawatheet al. 1996] for the second step.GumTree [Falleri et al. 2014] breaks the mapping stage into three steps. The first step is a top-downalgorithm that finds isomorphic subtrees across A and A ′ . The roots of these subtrees are called anchors mapping . The second step is a bottom-up algorithm that seeks for containers mapping –node pairs among A and A ′ , such that their descendants share common anchors . Finally, the laststep seeks for additional mappings between the descendants of the containers mapping pairs.Applying the GumTree algorithm for the mapping stage and using known techniques for producingthe edit script results in an end-to-end efficient algorithm. The complexity of this algorithm is O ( n ) in the worst case, where n is the number of nodes in the larger among A and A ′ , i.e., n = max (|A| , |A ′ |) . An attention mechanism computes a learned weighted average of some input vectors, given anotherinput query vector. Usually, attention is used by a neural model to align elements from differentmodalities. For example, in neural machine translation (NMT) [Bahdanau et al. 2014], attentionallows the model to “focus” on different words from the source language while predicting every wordin the target language, by computing a different weighted average at every step. This ability has showna significant improvement across various tasks such as translation [Bahdanau et al. 2014; Luong et al. 2015; Vaswani et al. 2017], speech recognition [Chan et al. 2016] and code summarization andcaptioning [Alon et al. 2019a].Formally, given a set of k vectors Z = z , z , .., z k ∈ R d (usually, an encoding of the input of themodel) and a query vector q ∈ R d (usually, the hidden state of a decoder at a certain time step t ),attention performs the following computation. The first step is computing a “score” for each inputvector z i . For example, Luong et al. [2015] use a learned matrix W a ∈ R d × d to compute the score s i of the vector z i : s i = z i · W a · q ⊤ (1)Next, all scores are normalized into a pseudo-probability using the softmax function: α i = e s i (cid:205) kj = e s j (2)where every normalized score is between zero and one α i ∈ [ , ] , and their sum is one: (cid:205) α i = .Then, a context vector is computed as a weighted average of the inputs z , z , .., z k , such that theweights are the computed weights α : c = k (cid:213) i α i · z i This dynamic weighted average can be computed iteratively at different prediction time steps t ,producing different attention scores α t and thus a different context vector c t . This allows a decoderthe ability to focus on different elements in the encoded inputs at each prediction step. A pointer network [Vinyals et al. 2015] is a variant of the seq2seq paradigm [Sutskever et al. 2014],where the output sequence is a series of pointers to the encoded inputs, rather than a sequence froma separate vocabulary of symbols. This mechanism is especially useful when the output sequenceis composed only of elements from the input, possibly permutated and repeated. For example, theproblem of sorting a sequence of numbers can be naturally addressed using pointer networks: theinput for the model can be the unsorted sequence of numbers, and the output is the sorted sequence,where every output prediction is a pointer to an element in the input sequence.Pointing can be performed similarly to attention: at each decoding step, Equation (1) and Equation (2)compute input scores, similarly to attention; then, the resulting normalized scores α i can be used forclassification over the encoded inputs, as the output probability of the model.Pointer networks and attention share almost the same implementation, but they are different inprinciple. Attention computes a dynamic average c t at each decoding iteration. Then, c t is used inthe prediction of this time step, among a different closed set of possible classes (for example, thepossible classes can be the words in the target language). In pointer network, on the other hand, thepossible classes at each decoding step are the elements in the input sequence itself .Another difference is that in pointer networks there is a label associated with each “pointing” step.Each “pointing” distribution α is directly supervised by computing a cross-entropy loss with areference label. In other words, each pointing can be measured for its correctness, and the mostly-pointed input is either correct or incorrect. In contrast, attention is not directly supervised; the model’sattention distribution α is internal to the model. The attention distribution α is usually not “correct”nor “incorrect”, because the attention is used for a follow-up prediction. In the E DIT C OMPLETION task that we consider in this work, the input contains multiple edit operationsthat occurred in the context, and the output is a series of edit operations that should be performed.The main challenge is – how to represent edits in a learning model ? We look for a representationthat is expressive and generalizable . The representation should be expressive , such that differentedits are reflected differently; this would allow a model to consider the difference between examples.However, just representing every edit uniquely is not enough, because the representation should alsobe generalizable , such that similar edits would be reflected similarly; this would allow a model togeneralize better, even if the edit that should be predicted at test time does not look exactly like anedit that was observed at training time.Representing edits using AST paths provides an expressive and generalizable solution. An editoperation, such as “move”, can be represented as the path in the AST from the subtree that shouldbe moved, up to its new destination. This path includes the syntactic relation between the sourceand the target of the move. Different move operations would result in different paths (and thus,this representation is expressive), and similar moves will result in similar paths (and thus, thisrepresentation is generalizable). In this section, we explain how AST paths can naturally representsuch edit operations.We represent edit operations as follows:(1) The

MOV (move) operation has two arguments: the first is the source node – the root of the subtreeto be moved, and the second is the target node. The meaning of “

MOV n s , n t ” is that node n s moves to be the right sibling of node n t . To support moving a node to be the leftmost child, weaugment the AST with Placeholder nodes, that are always present as the leftmost child nodesof all nonterminal nodes.(2) The

DEL (delete) operation has one argument, which is a subtree to be deleted. We represent

DEL as a path that originates from the root of the subtree to be deleted, into a special

DEL target nodethat we artificially add as a child of the AST’s root. So in practice, we represent “

DEL n s ” as“ DEL n s , n DEL ” where n DEL is the

DEL node.(3) The

UPD (update) operation has two arguments: the first argument is a node with a source value,and the second argument is a node whose value needs to be updated. For instance, if the value ofnode n t needs to be updated to x , and the value of node n s is x , we denote this by: “ UPD n s , n t ”.(4) The INS (insert) operation has two arguments: the first argument is the subtree to be copied,and the second is the target node. The operation “

INS n s , n t ” means that the subtree rooted at n s should be copied and inserted as right sibling of n t . If n s should be inserted as a leftmost child,the target node will be the appropriate Placeholder node.Since all four operations can be represented using two nodes n s and n t from the AST of P , theAST path from n s to n t is a natural way to represent an edit operation. Figure 5 demonstrates a MOV operation and its associated path representation. Figure 5a depicts the path

Arg → ArgList → Arg , which can be associated with MOV and represent the operation

MOV

Arg , Arg , i.e., movingthe first argument to be the last. Figure 5b shows the AST after the movement.To represent insertions ( INS ) and updates (

UPD ) in the context , which transformed C into C ′ – weaugment the AST with additional UPD and

INS nodes. To represent all update operations

UPD n s , n t ,we add the necessary n s nodes as children of UPD . For example, In Figure 6, there are two updateoperations that involve two source nodes – y and Bar . Thus, we add these nodes as children of

CallArg ArgListExpr Arg (a) MOV

Call Arg ArgList Arg Expr (b)

INS

Fig. 5. An example of a path that represents a

MOV operation. Figure 5a shows the path:

Arg → ArgList → Arg that represents the edit of moving the first argument to be the last argument. Adotted contour represents the subtree that will be affected by applying the operations. Figure 5bshows the AST after applying the edit. The affected subtree is surrounded by a dashed contour. Unit DELDecl ExprInitNameFooType x UPD INSBar yName

Fig. 6. An example of

UPD (update) and

INS (insert) operations in the context C . The orange 1 ○ path represents that the node Foo has been updated to the value

Bar . Similarly, the green 2 ○ pathrepresent that the node x has been updated to y . The purple 3 ○ path represents the insertion of node Name along with its subtree.

UPD , and represent the operations with paths that originate from these nodes. The orange 1 ○ path,for instance, represent the update of Foo to become

Bar . In the case of insertion of a new subtree,we represent this operation with a path that originate from

INS and ends in the root of the subtree.Consider the purple 3 ○ path in Figure 6. This path represent that the subtree with the root Name wasinserted as the leftmost child of

Type . We augment the AST with additional

UPD and

INS nodes asadditional children of the AST’s root, along with the special

DEL node.These modifications allow us to represent any edit in the context. In this work, we focus on edits thatcan be represented as AST paths in P . Examples which require generating code from scratch requireother, more heavyweight code completion models, and are out of the scope of this paper. In this section, we describe our model in detail. The main idea that guides the design of our modelis to allow a neural model to consider multiple edits that occurred in the context ( ∆ C ), and predict AttentionPointer

PC C ′ Fig. 7. A high-level overview over our architecture. On the left, the partial AST that represents thecontext C . The red paths represent the transformation from C to C ′ . On the right, we can see thepartial AST of P and its paths that represent possible valid predictions. The model attends to thepaths that transform C to C ′ to point to a path of P that corresponds to a edit operation. a single path operation that should be applied to P at every time step. The major challenge is –how to predict a single path operation? Classifying among a fixed vocabulary of path operationsis combinatorially infeasible. Alternatively, decomposing the prediction of a path operation into asequence of smaller atomic node predictions – increases the chances of making a mistake, and canlead to predicting a path operation that is not even valid in the given example. We take a differentapproach: we encode all the path operations that are valid in a given example, and train the model to point to a single path operation, only among these valid operations. That is, in every example, themodel predicts path operations among a different set of valid operations. At a high-level, our model reads the edits that occurred in the context and predicts edits that shouldbe performed in the program. Since there might be multiple edits in the context, our model uses attention to compute a dynamic weighted average of them. To predict the edit in the program, ourmodel enumerates all possible edits, expresses them as AST paths, and points at the most likely edit.Thus, the input of the model is a sequence of AST paths from the augmented C , and the output is asequence of AST paths from P . Our model is illustrated in Figure 7.Our model follows the encoder-decoder paradigm: the encoder encodes all valid paths of the inputcode ( P ) and the paths of the input context (transforming C to C ′ ) into continuous vectors; thedecoder generates an edit sequence by pointing to the set of paths in P while attending to the pathsof C and C ′ . First, to consider edits that occurred in the context, our model encodes the sequence ofcontext paths that transformed C into C ′ , as a set of vectors. Then, the model performs a series ofpredictions, where each such prediction is an edit that should be applied to P . At each prediction step,the model attends (as explained in Section 3.3) to the context paths. Using the resulting attentionvector, the model points (as explained in Section 3.4) to a single path in P . The path in P that themodel points to – is translated to the edit that should be applied in this step. In the next step, thechosen edit from the previous step is used to compute the attention query of the next step.An edit operation that occurred in the context can be represented as an AST path (Section 4). Wedenote the sequence of paths that represent the edits in the context as ∆ C = Paths (C , C ′ ) . The edit function that should be predicted is also represented as AST paths, where each path is associatedwith an edit operation. We denote the sequence of AST paths that represent the edits that should bepredicted as ∆ P = Paths (P , P ′ ) ; we use these vectors as our classes to predict from.Using the above notations, we model the following conditional probability: Pr ( ∆ P | ∆ C ) . Given a sequence of paths

Paths (C , C ′ ) , we encode all paths using a Path Encoder (Section 5.2.1).Then, since it is a sequence (that has a meaningful order), the context paths go through an LSTM[Hochreiter and Schmidhuber 1997], resulting in the sequence of vectors Z C .We enumerate all valid edits that can be applied to P , denote this set as Paths (cid:16) P , P (cid:17) , and encodethese paths using the Path Encoder (Section 5.2.1) which results in the set of vectors Z P . Everypath in Paths (cid:16) P , P (cid:17) can represent different edit operations (Section 5.2.2), i.e., both “update” and“move”. Thus, every path vector z ∈ Z P is projected to represent different edit operations, resultingin the set of vectors Z Op , which represent the set of classes that the model can predict from. Given a set of AST paths, our goal is to create a vector representation z i foreach path v ... v l . The vocabulary of nodes of the AST is limited to a fixed-size vocabulary from thegrammar of the language. In contrast, the values of AST leaves correspond to the tokens in the textualrepresentation of the program. Therefore, the vocabulary of these tokens is unbounded. To addressthe issue of the unbounded vocabulary of terminal values, we follow previous work [Allamaniset al. 2015, 2016; Alon et al. 2019a], and split these values into sub tokens. For example, the value toString will be split into to and string . We represent each path as a sequence of node typesusing an LSTM, and subtoken embeddings to represent terminal values (the tokens). Node Representation.

Each AST path is composed of nodes v , ..., v l . Each node is taken froma limited vocabulary of symbols of the programming language. Terminal nodes also have auser-defined token value. Every node has an associated child index, i.e., its index among its siblingnodes [Alon et al. 2018]. We represent each node using a learned embedding matrix E nodes and alearned embedding matrix for its child indices E index . We sum the vector of the node type w withthe vector of its child index i to represent the node: encode _ node ( w ) = E indexi + E nodesw The first and the last node of an AST path may be terminals whose values are tokens in the code. We use a learned embedding matrix E subtokens to represent each subtoken: encode _ value ( w ) = (cid:213) s ∈ split ( w ) E subtokenss (3)where w is a value associated with a terminal node. Differently from code2vec [Alon et al. 2019c], our paths can originate from and end in nonterminal nodes as well. Path Representation.

We encode the path v , ..., v l by applying an LSTM: h , ..., h l = LST M path ( encode _ node ( v ) , ..., encode _ node ( v l )) We concatenate the last state vector with encoding of the values associated with the first and the lastnodes in the path, pass them through a learned fully connected layer W path and a nonlinearity: encode _ path ( v ... v l ) = tanh (cid:0) W path · [ h l ; encode _ value ( φ ( v )) ; encode _ value ( φ ( v l ))] (cid:1) where φ is the function that retrieves a terminal node’s associated value (Section 3.1). If v or v l arenonterminals, and thus do not have an associated value, we encode the first and the last nodes insteadof their values; i.e., encode _ node ( v ) instead of encode _ value ( φ ( v )) .To express the order of context paths Paths (C , C ′ ) , we pass these through another LSTM: Z C = LST M C ( PathEncoder ( Paths (C , C ′ ))) (4)Applying the path encoder on Paths (cid:16) P , P (cid:17) results in Z P : Z P = PathEncoder (cid:16)

Paths (cid:16) P , P (cid:17)(cid:17) To represent different operations (i.e.,

MOV , UPD , INS ) that share thesame path z ∈ Z P , we project z using different learned matrices W MOV , W U PD , W I N S : z MOV = z · W MOV z U PD = z · W U PD z I N S = z · W I N S such that z MOV , z U PD , and z I N S are used for pointing at

MOV , UPD and

INS edits that are all describedby the same encoded path z . This creates our set of possible classes to point to: Z Op = (cid:216) z ∈ Z P { z MOV , z U PD , z I N S } (5)We use Z Op as the representations of the classes that our model outputs a distribution over. The decoder generates an edit script given the outputs of the encoder. At each decoding time step, thedecoder predicts a single edit operation, by pointing to a single vector from Z Op , while attending tothe sequence of vectors Z C . The decoder consists of three main components: an LSTM [Hochreiterand Schmidhuber 1997], attention [Bahdanau et al. 2014], and a pointer [Vinyals et al. 2015].The decoder LSTM operates by receiving an input vector at each time step; then, it uses this inputvector to update the LSTM’s internal state, and uses the updated state as the query for attention.Given the current state, we compute an attention vector of the vectors in Z C , and use the resultingvector to point to a (prediction) vector in Z Op . In the next time step, the input vector for the LSTM isthe last pointed vector from the previous step. The initial hidden state of the LSTM is an elementwiseaverage of paths in Z P and in Z C . Attention.

We employ attention as described in Section 3.3, where the query is h t – the hidden stateof the decoder LSTM at time step t . At each time step, we compute a scalar score for every vector z i ∈ Z C . This score is computed by performing a dot product between each context vector z i ∈ Z C and a learned matrix W a and h t . We then normalize all scores with a softmax function to get thenormalized weights α t : α t = softmax (cid:0) Z C · W a · h ⊤ t (cid:1) We then compute a weighted average of Z C to get the attention vector c t c t = (cid:213) z i ∈ Z C α i · z i Pointing.

Given the vector c t , we compute a pointing score for each valid edit that is represented as z ∈ Z Op . The resulting scores are normalized using softmax; these normalized scores constitute themodel’s output distribution.We perform a dot product of every z ∈ Z Op with another learned weight matrix W p and c t . Thisresults in a scalar score for every valid prediction in Z Op . We then apply a softmax, resulting in adistribution over the vectors in Z Op : ˆ y t = softmax (cid:0) Z Op · W p · c t (cid:1) (6)We use this distribution ˆ y t as the model’s prediction at time step t . At training time, we train alllearnable weights to maximize the log-likelihood [Rubinstein 1999] of ˆ y t according to the true label.At test time, we compute the arдmax of ˆ y t to get the prediction: our model predicts the edit operationthat is correlates with the element that has the highest pointing score. The output of the decoderacross time steps can be (unambiguously) translated to an edit script. We implement our approach for E DIT C OMPLETION in a neural model called C PO , for ContextualCode Changes via Path Operations. The main contributions of our approach are (a) the syntacticrepresentation of code edits; and (b) modeling of likelihood of code edits, rather than modeling thelikelihood of the edited code. Thus, these are the main ideas that we wish to evaluate. We compareour model with baselines that represent each of the different paradigms (Table 1) on a new dataset.Our model shows significant performance improvement over the baselines. We introduce a new E DIT C OMPLETION dataset of code edits in C P . To make the task even more challenging, we filteredout examples that: (a) the edit in P consists of only DEL operations; and (b) edits that both P andits context contain only UPD operations such that all updates in P are included in the updates of C ,since these usually reflect simple renaming that is easily predicted by modern IDEs. Following recentwork on the adverse effects of code duplication [Allamanis 2019; Lopes et al. 2017], we split thedataset into training-validation-test by project . This resulted in a dataset containing 39.5k/4.4k/5.9ktrain/validation/test set examples, respectively. We trained all models and baselines on the trainingset, performed tuning and early-stopping using the validation set, and report final results on the testset. A summary of the statistics of our dataset and a list of the repositories we used to create ourdataset are shown in Appendix A. We make our new dataset publicly available. Textual SyntacticCode Likelihood SequenceR[Chen et al. 2019] Path2Tree[Aharoni and Goldberg 2017]Edit Likelihood LaserTagger+CRF[Malmi et al. 2019] C PO(this work)

Table 1. A high-level taxonomy of our model and the baselines.

The two main contributions of our approach that we wish to examine are: (a) the syntactic representa-tion of code edits; and (b) modeling edit likelihood, rather than modeling code likelihood. Since wedefine the new task of E DIT C OMPLETION , we picked strong neural baselines and adapted them to thistask, to examine the importance of these two main contributions.Table 1 shows a high-level comparison of our model and the baselines. Each model can be classifiedacross two properties: whether it uses a syntactic or textual representation of the edit, and whetherit models the likelihood of the code or models the likelihood of the edit . We put significant effortinto performing a fair comparison to all baselines, including subtoken splitting as in our model,lowercasing the subtokens, and replacing generated UNK tokens with the tokens that were given thehighest attention score.

LaserTagger [Malmi et al. 2019] - is a textual model that models the edit likelihood . LaserTaggerlearns to apply textual edits to a given text. The model follows the framework of sequence tagging ,i.e., classifying each token in the input sequence. Each input token is classified into one of:

KEEP φ , DELETE φ and SWAP , where φ belongs to a vocabulary of all common phrases obtained from thetraining set. While LaserTagger leverages edit operation, it does not take advantage of the syntacticstructure of the input. Since the original implementation of LaserTagger uses a pre-trained BERTNLP model, which cannot be used for code, we carefully re-implemented a model in their spirit,without BERT. We used the same preprocessing scripts and sequence tags as Malmi et al. [2019],and encoded the input using either a bidirectional LSTM or a Transformer [Vaswani et al. 2017](LaserTagger LSTM and LaserTagger

Transformer , respectively). We further strengthened these modelswith neural Conditional Random Fields (CRFs) [Ma and Hovy 2016]. To represent context edits, weemployed a sequence alignment algorithm [Birney et al. 1996] and extracted the textual edits. Weencoded these context edits using a bidirectional LSTM and concatenated the resulting vector to themodel’s encoded input.

SequenceR is a re-implementation of Chen et al. [2019]. SequenceR follows the sequence-to-sequence paradigm from Neural Machine Translation (NMT) with attention [Luong et al. 2015]and a copy mechanism [Gu et al. 2016]. The input is the subtokenized code snippet, along withthe textual edits in the context. The output is the edited code. Hence, this method does not takeadvantage of syntax nor edit operations. We carefully re-implemented this approach because Se-quenceR abstracts away identifier names, and replaces identifier names with generic names. Forexample int x = 0 becomes int varInt = 0 . Since our model uses identifier names and wefound that identifier names help our model, to perform a fair comparison – we kept identifiernames in SequenceR as well. While the original SequenceR uses LSTMs with copy and attention(SequenceR

LSTM ), our re-implementation allowed us to strengthen this baseline by replacing theLSTM with a Transformer [Vaswani et al. 2017] and a copy mechanism (SequenceR

Transformer ). We evaluated both SequenceR

LSTM which follows the original model of Chen et al. [2019] and thestrengthened SequenceR

Transformer baseline.

Path2Tree follows Aharoni and Goldberg [2017]. This baseline leverages the syntax and modelsthe code likelihood. In this baseline, we performed a pre-order traversal of the AST and representedthe AST as a serialized sequence of nodes. Using this sequential serialization of the AST, we couldemploy strong neural seq2seq models. The input consists of the paths that represent edits in thecontext (as in our model), along with a serialized sequence that represents the AST of P . The outputof the model is the sequence that represents the AST of P ′ . As the neural underlying seq2seq model,we used both a Transformer (Path2Tree Transformer ) with a copy mechanism and a BiLSTM withattention and copy mechanisms (Path2Tree

LSTM ) .

From each sample in our dataset, we (a) extracted all paths of

Paths (cid:16) P , P (cid:17) that describe possiblevalid edit operations; and (b) extracted the paths that represent the transformation of C to C ′ , i.e., Paths (C , C ′ ) . We did not filter, discard any of these paths, nor limited the paths lengths.We used input embedding dimensions of 64, LSTM cells with a single layer, and 128 units. Thisresulted in a very lightweight model of only 750K learnable parameters. We trained our model on aTesla V100 GPU using the Adam optimizer [Kingma and Ba 2014] with a learning rate of 0.001 tominimize the cross-entropy loss. We applied dropout [Hinton et al. 2012] of . .In the baselines – we used BiLSTMs with 2 layers, embedding and hidden state size of 512, resultingin 10M learned parameters in SequenceR LSTM and in Path2Tree

LSTM resulting in 10M learnedparameters. We used the original hyperparameters of the Transformer [Vaswani et al. 2017] to trainTransformers in SequenceR

Transformer and Path2Tree

Transformer , resulting in 45M learned parameters.LaserTagger

LSTM uses BiLSTMs with two layers with a hidden state size of 128, and an embeddingsize of 64. This model contains 1M learned parameters. For LaserTagger

Transformer we used 4 layersof Transformer encoders, with 4 layers and 8 attention heads, an embedding size of 64 and hiddensize of 512. For both, the context encoder uses BiLSTMs with two layers with a hidden state size of128, and an embedding size of 64. We experimented with LaserTaggers that contain context encoderthat uses Transoformer and setups that contained larger dimensions, but they achieved slightly lowerresults. In the other baselines, larger dimensions did contribute to the performance.

Evaluation Metric.

To perform a fair comparison across all examined models, we had to use a metricthat is meaningful and measurable in all models and baselines. We thus measured exact-matchaccuracy across all models and baselines. The accuracy of each model is the percentage of examplesin the test set which the entire target sequence was predicted correctly.

Performance.

Table 2 depicts the main results of our evaluation: C PO gains more than 11% abso-lute accuracy over LaserTagger

Transformer , which performed the best of all baselines. These resultsemphasize the need for structural representation of both edits and context. C PO achieves × higheraccuracy compared to the syntactic baseline Path2Tree. Although this baseline uses AST paths torepresent the changes in the context of P and represent P with its underlying AST, its performanceis inferior compared to our C PO, because Path2Tree does not model the edit operations directly andthus needs to generate the entire AST of P ′ . Model Acc Learnable Parameters

SequenceR

LSTM [Chen et al. 2019] +copy 30.7 10MSequenceR

Transformer [Chen et al. 2019] +copy 32.6 45MLaserTagger

LSTM [Malmi et al. 2019] +CRF 40.9 1MLaserTagger

Transformer [Malmi et al. 2019] +CRF 41.4 1.6MPath2Tree

LSTM [Aharoni and Goldberg 2017] +copy 22.5 10MPath2Tree

Transformer [Aharoni and Goldberg 2017] +copy 25.5 45MC PO (this work)

Table 2. Our model achieves significantly higher accuracy than the baselines.

These results show the importance of the two main contributions of our model.

Modeling the edit has the most significant contribution – this is expressed in the advantage of our model over bothversions of Path2Tree, and in the advantage of both versions of LaserTagger over both versions ofSequenceR. Syntactic representation over textual representation also has a significant contribution,which is expressed in the superiority of our model over both versions of LaserTagger. Using these twokey contributions, our model performs significantly better than all models while being much morelightweight in terms of learnable parameters. The same results are visualized in Figure 8. . . . . . . . Accuracy Path2Tree

LSTM

Path2Tree

Transformer

SequenceR

LSTM

SequenceR

Transformer

LaserTagger

LSTM

LaserTagger

Transformer C PO (this work) Fig. 8. Visualization of the accuracy score of our model compared to the baselines. The values arethe same as in Table 2. Our model achieves significantly higher accuracy than the baselines.

We manually examined the predicted examples and discuss two representative cases.Figure 9 shows an example in which the modification of a method signature in the context affects P which lies in the method body. The context of P , showed in Figure 9a, includes a change inthe signature of the method GetFileCharacteristic . The name of the method was changed to

GetAppender and its return type was updated from

FileCharacteristic to BaseFileAppender .Consider P in Figure 9b. P is a return statement, located in the body of the changed method GetFileCharacteristic . Since the return type of the method was updated to

BaseFileAppender ,the return statements inside the method must be changed as well. The renaming of the method nameto

GetAppender might have also hinted to our model that the appender object itself should bereturned. Our model successfully predicted the desirable edit – altering the return statement from return appender.GetFileCharacteristic to return appender; . This example shows + private BaseFileAppender GetAppender( string fileName)- public FileCharacteristic GetFileCharacteristic( string fileName) Unit DELDeclPublic FileCharacteristic UPD INSType Param PrivateGetAppenderGetFileCharacteristicBaseFileAppenderParamList

Context Edits ( C → ) C ′ C → C ′ (a) Unit DELReturnNameType Expr CallArgListINodeappender

MOV DEL

Dot

Program Edits ( P → ) P ′ + return appender ;- return appender.GetFileCharacteristic(); GetFileCharacteristic P (b)Fig. 9. An example where the edit of a method signature affects the edit of P which lies in the methodbody. Figure 9a illustrates the edit in the context and the paths that describe the transformation from C to C ′ . Figure 9b shows the predicted edit operations along with their associated paths in P . how context edits are important to predict edits in a program, by providing information about(a) return type changes, and (b) method renaming.Figure 10 illustrates a case where the edit in the context is conceptually similar to the edit in P ,but is not identical. Figure 10a shows a variable declaration statement, where part is casted to thetype MethodCallExpression and assigned to the newly-declared variable methodExpression .In the edited context, the keyword var was updated to an explicit type

MethodCallExpression .Figure 10b shows an edit that is similar in spirit: P consists of an initialization statement, where thevariable nameParts is assigned a new Stack . Using the edit in the context, our modelpredicted the edit of var to Stack in P . This edit consists of an insertion of a new subtree,since Stack is represented as a subtree of five nodes. In contrast, the edit in the contextis represented as an

UPD edit, because it only requires to update the value of a single node. Thisexample demonstrates a class of examples where the edit in the context hints for similar in spiritedits in P , but are not identical and should be performed differently. We conducted an extensive ablation study to examine the importance of different components in ourmodel. We focus on two axes – the representation of ∆ P and the representation of ∆ C . This allowsus to examine where does the advantage of our model over the strongest baselines comes from – thesyntactic representation of the context or the syntactic representation of P ? Unit DELDecl Expr partType LeftPar RightParvar UPD INSMethodCallExpressionMethodCallExpression InitMethodCallExpression + MethodCallExpression methodExpression = (MethodCallExpression)part;- var methodExpression = (MethodCallExpression)part; C → C ′ Context Edits ( C → ) C ′ (a) Unit DELDecl ExprInitNew NameType nameParts Call ArgListvar Stack stringArgArgList + Stack nameParts = new Stack();- var nameParts = new Stack(); DEL INS

Program Edits ( P → ) P ′ P (b)Fig. 10. An example in which the edit in the context is conceptually similar to the edit of P . Figure 10aillustrates the edit that occurred in the context and the paths that describe the transformation from C to C ′ . Figure 10b shows the predicted edit operations along with their associated paths in P . No Context Textual Context Path-Based ContextTextual P † P † C (this work) Table 3. Variations on our model. † marks results that are copied from Table 2. In our model, P is represented using its syntactic structure, i.e., a path-based representation. Alterna-tively, P can be represented using its textual representation. The representation of P determines therepresentation of P ′ , i.e., they must be represented similarly, or otherwise, the model would need to“translate” P into a different representation to predict P ′ . However, the representation of the context C can theoretically be different than that of P .We thus took our model and examined different representations of the context: path-based context(as in our original model), textual context, and “no context”. For each type of context representation,we also experimented with different types of representations for P – syntactic representation (asin our original model) and textual representation of P . For textual representation of P we usedLaserTagger Transformer [Malmi et al. 2019], which we found to be the strongest textual baseline inSection 6. All the hybrid models were re-trained, and their performance is shown in Table 3.

Contribution of context.

We observe that the contribution of the changes in the context is considerable,in both representations of P (both textual and path-based). Ignoring changes in the context (the left“No Context” column of Table 3) results in lower accuracy. This motivates our task of predictingedits given the context – program edits are correlated with edits that occurred in the context, andpredicting edits should consider the context edits. P representation. We observe that across all different settings of context representation – a syntacticrepresentation of P performs better than a textual representation of P . That is, even if the contextis textual (the right column of Table 3) – a model benefits from a syntactic representation of P .This advantage is even clearer in the case of “No context”, where the path-based representationof P achieves over 10% absolute accuracy over the textual representation of P . A path-basedrepresentation of P allows us to model the edit in P directly, which makes the learning task mucheasier and more generalizable. Context representation.

As Table 3 shows, the representation of the context should be compatible with the representation of P . If P is textual – a textual context performs better; if P is syntactic – asyntactic context performs better. We hypothesize that matching the context representation to theprogram representation allows the model to utilize the context better, and eases the modeling of thecorrelation between edits that occurred in the context to edits that should be applied to P . Representing programs in learning models.

The representation of programs in learning models is aquestion that is even more important than the employed learning algorithm or neural architecture.In the last few years, several approaches have been proposed. Early work used the straightforwardtextual representation, essentially learning from the flat token stream [Allamanis et al. 2016; Chenet al. 2019; Iyer et al. 2016; Tufano et al. 2018; Vasic et al. 2019]; although this allows leveragingNLP learning approaches, such models do not leverage the rich syntax of programming languages,and eventually perform worse than other representations, although their utilization of strong NLPmodels. Another line of work represent programs as graphs – these usually augment the AST of aprogram with additional semantic edges and use a graph neural network to learn from the resultinggraph [Allamanis et al. 2018; Brockschmidt et al. 2019; Fernandes et al. 2019; Hellendoorn et al.2020; Yin et al. 2019]. Graphs provide a natural way to represent programs, and they allow us toaugment programs with domain knowledge such as semantic analysis easily. However, it is unclearhow well can these models perform in the absence of full semantic information – given partial code,given code that cannot be compiled, or languages that are difficult to analyze semantically. In thiswork, we leverage AST paths to represent programs [Alon et al. 2018]. AST paths were shown to bean effective representation for predicting variable names, method names [Alon et al. 2019c], naturallanguage descriptions [Alon et al. 2019a] and code completion [Alon et al. 2019b]. In our task, ASTpaths allow us to model edits directly, along with the syntactic relationship between the source andthe target node of the edits.

Representing edits.

Much work has been recently proposed on representing edits. Yin et al. [2019]proposed a model that learns to apply a given code edit on another given code snippet. Althoughsounding similar, the task that we address, E DIT C OMPLETION , is very different, since there is nospecific edit in our input that needs to be applied. In contrast, in E DIT C OMPLETION the model mustpredict what should be edited and how instead of only applying a given edit. In our work, there is noguarantee that the edit that needs to be predicted is included in the context. Furthermore, there could be several edits in the context. Thus, our model needs to choose and predict the edit itself, while theedits that occurred in the context might only be related.SequenceR [Chen et al. 2019] used state-of-the-art NMT models to predict bug fixes on single-linebuggy programs. Our work is different from their approach by the representation of the input andthe output. Chen et al. [2019] represent the code as a token stream, while our approach representsedits as AST paths. Further, their approach attempts to generate the entire edited program, whereasour model models only the edit. We demonstrated the advantage of our approach over SequenceRempirically in Section 6.A related problem to ours is the problem of fixing compilation errors. Tarlow et al. [2019] followsthe encoder-decoder paradigm, using an encoder that consists of a graph neural network (GNN),which encodes a multi-graph that is built from the AST and the compilation error log messages. Thedecoder is a Transformer [Vaswani et al. 2017] that outputs a sequence that represents the predictededit. DeepDelta [Mesbah et al. 2019] used an NMT model where the input consists of compilationerrors and an AST path from the problematic symbol in the code to the root of the tree. The output oftheir model is a sequence that represents the edit script. In our work, pairwise AST paths allow us tomodel the desired edit directly, instead of predicting an edit using multiple predictions.Recently, Dinella et al. [2020] proposed a model called HOPPITY for detecting and fixing bugs insource code using graph transformations. The main difference between our approach and theirs isthat HOPPITY does not model edit operations directly, as our model, but rather, models a graph thatrepresents the input, and use the resulting node representations to predict actions. This modelingmakes their model predict unary edit operations, while our model predicts binary edits: HOPPITYcan only predict single-node edits in each step, such as deleting a subtree root, inserting a singlenode and changing a single node value. Thus, edits like moving large subtrees require multipleinsertion operations of a single node at a time. In our approach, moving and inserting a subtree canbe performed by a single edit operation. Dinella et al. [2020] evaluated their model on examplesthat contain three single-node operations at most. However, as shown in A, the average size ofmoved subtrees in our train set is 3.48. Such edits would have required HOPPITY to generate theentire subtree in the new position (three operations) and delete the subtree in its original place (oneoperation), resulting in four operations in total. Hence, our average case is larger than the casesexamined by HOPPITY.CC2Vec [Hoang et al. 2020] represent edits in version-control-systems (e.g., GitHub). However,their approach represents edits only textually . CC2Vec was demonstrated on the tasks of predictingcommit messages, predicting bug fixes and defect prediction; however, their model could not predictthe edit itself, as the problem that we address in this paper.Chakraborty et al. [2018] proposed a two-step model that aims to apply edits in code. The first step oftheir model encodes the sequence that represents a pre-order traversal of the AST of the original codeand generates the sequence that represents the AST of the edited code. In the second step, they assignthe values of terminal nodes to concrete values. Their approach predicts the edit by synthesizing theentire AST. In Section 6 we showed the advantage of modeling the likelihood of edits over modelingthe likelihood of the code. Additionally, our model is trained end-to-end, while Chakraborty et al.[2018] trains different components of their model separately. We presented a novel approach for representing and predicting edits in code. Our main idea is tolearn the likelihood of the edit itself, rather than learning the likelihood of the new program. We usepaths from the Abstract Syntax Tree to represent code edits that occurred in the context, and usethem to point to edits that should be predicted.We demonstrate the effectiveness of our approach on the E DIT C OMPLETION task – predicting edit incode given edits in its surrounding context. We conjecture that our direct modeling of the likelihoodof edits and using the rich structure of code are the main components that contribute to the strengthof our model. We affirm this conjecture in a thorough evaluation and ablation study. Our methodperforms significantly better than strong neural baselines that leverage syntax but do not model editsdirectly, or model edits but do not leverage syntax.We believe that our approach can serve as a basis for a variety of models and tools that require model-ing and predicting code edits, such as bug fixing, an E DIT C OMPLETION assistant in the programmer’sIDE, and automatically adapting client code to changes in public external APIs. Further, we believethat our work can serve as the basis for a future “neural code reviewer”, that can save human effortand time. To these ends, we make all our code, dataset, and trained models publicly available.

REFERENCES

Roee Aharoni and Yoav Goldberg. 2017. Towards String-To-Tree Neural Machine Translation. In

Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) . Association for ComputationalLinguistics, Vancouver, Canada, 132–140. https://doi.org/10.18653/v1/P17-2021Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In

Proceedings ofthe 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming andSoftware (Athens, Greece) (Onward! 2019) . Association for Computing Machinery, New York, NY, USA, 143â ˘A¸S153.https://doi.org/10.1145/3359591.3359735Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class Names.In

Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE2015) . ACM, New York, NY, USA, 38–49. https://doi.org/10.1145/2786805.2786849Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In

International Conference on Learning Representations . https://openreview.net/forum?id=BJOFETxR-Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarizationof Source Code. In

Proceedings of The 33rd International Conference on Machine Learning (Proceedings of MachineLearning Research) , Maria Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. PMLR, New York, New York, USA,2091–2100. http://proceedings.mlr.press/v48/allamanis16.htmlUri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019a. code2seq: Generating Sequences from Structured Representa-tions of Code. In

International Conference on Learning Representations . https://openreview.net/forum?id=H1gKYo09tXUri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. 2019b. Structural Language Models of Code. arXiv:1910.00577 [cs.LG]Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A general path-based representation for predicting programproperties.

Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation -PLDI 2018 (2018). https://doi.org/10.1145/3192366.3192412Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019c. Code2vec: Learning Distributed Representations of Code.

Proc. ACM Program. Lang.

3, POPL, Article 40 (Jan. 2019), 29 pages. https://doi.org/10.1145/3290353Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align andtranslate. arXiv preprint arXiv:1409.0473 (2014).Ewan Birney, Julie D. Thompson, and Toby J. Gibson. 1996. PairWise and SearchWise: Finding the Optimal Align-ment in a Simultaneous Comparison of a Protein Profile against All DNA Translation Frames.

Nucleic Acids Re-search

24, 14 (07 1996), 2730–2739. https://doi.org/10.1093/nar/24.14.2730 arXiv:https://academic.oup.com/nar/article-pdf/24/14/2730/7064078/24-14-2730.pdfMarc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, and Oleksandr Polozov. 2019. Generative Code Modelingwith Graphs. In

International Conference on Learning Representations . https://openreview.net/forum?id=Bke4KsA5FX Saikat Chakraborty, Miltiadis Allamanis, and Baishakhi Ray. 2018. Tree2Tree Neural Translation Model for Learning SourceCode Changes.

ArXiv abs/1810.00314 (2018).William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for largevocabulary conversational speech recognition. In . IEEE, 4960–4964.Sudarshan S. Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change Detection inHierarchically Structured Information.

SIGMOD Rec.

25, 2 (June 1996), 493â ˘A¸S504. https://doi.org/10.1145/235968.233366Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus.2019. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair.

CoRR abs/1901.01808 (2019).arXiv:1901.01808 http://arxiv.org/abs/1901.01808Elizabeth Dinella, Hanjun Dai, Ziyang Li, Mayur Naik, Le Song, and Ke Wang. 2020. HOPPITY: LEARNING GRAPHTRANSFORMATIONS TO DETECT AND FIX BUGS IN PROGRAMS. In

International Conference on LearningRepresentations . https://openreview.net/forum?id=SJeqs6EFvBJean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accuratesource code differencing. In

ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, Vasteras,Sweden - September 15 - 19, 2014 . 313–324. https://doi.org/10.1145/2642937.2642982Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Structured Neural Summarization. In

InternationalConference on Learning Representations . https://openreview.net/forum?id=H1ersoRqtmJiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating Copying Mechanism in Sequence-to-SequenceLearning.

CoRR abs/1603.06393 (2016). arXiv:1603.06393 http://arxiv.org/abs/1603.06393Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. 2020. Global Relational Modelsof Source Code. In

International Conference on Learning Representations . https://openreview.net/forum?id=B1lnbRNtwrGeoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improvingneural networks by preventing co-adaptation of feature detectors.

CoRR abs/1207.0580 (2012). arXiv:1207.0580http://arxiv.org/abs/1207.0580Thong Hoang, Hong Jin Kang, Julia Lawall, and David Lo. 2020. CC2Vec: Distributed Representations of Code Changes.arXiv:2003.05620 [cs.SE]Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.

Neural Comput.

9, 8 (Nov. 1997), 1735–1780.https://doi.org/10.1162/neco.1997.9.8.1735James W. Hunt and M. Douglas McIlroy. 1975. An algorithm for differential file comparison.Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a NeuralAttention Model. In

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) . Association for Computational Linguistics, Berlin, Germany, 2073–2083. https://doi.org/10.18653/v1/P16-1195Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. http://arxiv.org/abs/1412.6980 cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for LearningRepresentations, San Diego, 2015.Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu:a map of code duplicates on GitHub.

Proceedings of the ACM on Programming Languages

1, OOPSLA (2017), 1–28.Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based NeuralMachine Translation.

CoRR abs/1508.04025 (2015). arXiv:1508.04025 http://arxiv.org/abs/1508.04025Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In

Proceedings ofthe 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 1064–1074.Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and Aliaksei Severyn. 2019. Encode, Tag, Realize: High-Precision Text Editing. In

EMNLP-IJCNLP .Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian. 2019. DeepDelta: learning to repaircompilation errors. In

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering . 925–936.Reuven Rubinstein. 1999. The cross-entropy method for combinatorial and continuous optimization.

Methodology andComputing in Applied Probability

1, 2 (1999), 127–190.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks.

CoRR abs/1409.3215 (2014). arXiv:1409.3215 http://arxiv.org/abs/1409.3215Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, and EdwardAftandilian. 2019. Learning to Fix Build Errors with Graph2Diff Neural Networks. arXiv:1911.01205 [cs.LG]Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. AnEmpirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation.

CoRR abs/1812.08693 (2018). arXiv:1812.08693 http://arxiv.org/abs/1812.08693Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh singh. 2019. Neural Program Repair by JointlyLearning to Localize and Repair. In

International Conference on Learning Representations . https://openreview.net/forum?id=ByloJ20qtmAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and IlliaPolosukhin. 2017. Attention is All you Need. In

Advances in Neural Information Processing Systems 30 , I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998–6008.http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdfOriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. arXiv:1506.03134 [stat.ML]Pengcheng Yin, Graham Neubig, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L. Gaunt. 2019. Learning toRepresent Edits. In

International Conference on Learning Representations . https://openreview.net/forum?id=BJl6AjC5F7 A DATASET

Table 4 lists the GitHub repositories that we used to create our dataset.

Repository User Split corefx dotnet Trainshadowsocks-windows shadowsocks TrainCodeHub CodeHubApp Traincoreclr dotnet Trainroslyn dotnet TrainPowerShell PowerShell TrainWaveFunctionCollapse mxgmn TrainSignalR SignalR TrainShareX ShareX TrainNancy NancyFx Traindapper-dot-net StackExchange Trainmono mono TrainWox Wox-launcher TrainAutoMapper AutoMapper TrainRestSharp restsharp TrainBotBuilder Microsoft TrainSparkleShare hbons TrainNewtonsoft.Json JamesNK TrainMonoGame MonoGame TrainMaterialDesignInXamlToolkit MaterialDesignInXAML TrainReactiveUI reactiveui Trainmsbuild Microsoft Trainaspnetboilerplate aspnetboilerplate Trainorleans dotnet TrainHangfire HangfireIO TrainSonarr Sonarr TraindnSpy 0xd4d TrainPsychson brandonlw Trainacat intel TrainSpaceEngineers KeenSoftwareHouse TrainPushSharp Redth Traincli dotnet TrainStackExchange.Redis StackExchange Trainakka.net akkadotnet Trainframework accord-net Trainmonodevelop mono TrainOpserver opserver Trainravendb ravendb TrainOpenLiveWriter OpenLiveWriter ValidationMvc aspnet ValidationGVFS Microsoft ValidationOpenRA OpenRA ValidationRx.NET dotnet ValidationMahApps.Metro MahApps ValidationFluentValidation JeremySkinner ValidationILSpy icsharpcode ValidationServiceStack ServiceStack Testchoco chocolatey Testduplicati duplicati TestCefSharp cefsharp TestNLog NLog TestJavaScriptServices aspnet TestEntityFrameworkCore aspnet Test

Table 4. Our dataset repositories.

A summary of statistics over our dataset is shown in Table 5.

Train Validation Test

MOV

DEL

INS

5% 4.5% 2.8%Avg. number of

UPD

MOV ) 3.48 2.95 2.85Avg. size of deleted subtrees (

DEL ) 4.49 4.97 4.39Avg. size of inserted subtrees (

INS ) 1.27 2.09 1.26) 1.27 2.09 1.26