[PDF] Mining Software Repair Models for Reasoning on the Search Space of Automated Program Fixing

Abstract

This paper is about understanding the nature of bug fixing by analyzing thousands of bug fix transactions of software repositories. It then places this learned knowledge in the context of automated program repair. We give extensive empirical results on the nature of human bug fixes at a large scale and a fine granularity with abstract syntax tree differencing. We set up mathematical reasoning on the search space of automated repair and the time to navigate through it. By applying our method on 14 repositories of Java software and 89,993 versioning transactions, we show that not all probabilistic repair models are equivalent.

Full PDF

MMining Software Repair Models for Reasoning onthe Search Space of Automated Program Fixing

Matias Martinez

University of Lille & INRIA

Martin Monperrus

University of Lille & INRIAEmpirical Software Engineering, Springer, 2013 (accepted for publication on Sep. 11, 2013).

Abstract —This paper is about understanding the nature of bugﬁxing by analyzing thousands of bug ﬁx transactions of softwarerepositories. It then places this learned knowledge in the contextof automated program repair. We give extensive empirical resultson the nature of human bug ﬁxes at a large scale and a ﬁnegranularity with abstract syntax tree differencing. We set upmathematical reasoning on the search space of automated repairand the time to navigate through it. By applying our method on 14repositories of Java software and 89,993 versioning transactions,we show that not all probabilistic repair models are equivalent.

I. I

NTRODUCTION

Automated program ﬁxing consists of generating sourcecode in order to ﬁx bugs in an automated manner [1], [2], [3],[4], [5]. The generated ﬁx is often an incremental modiﬁcation(a “patch” or “diff”) over the software version exhibiting thebug. The previous contributions in this new research ﬁeld makedifferent assumptions on what is required as input (e.g. goodtest suites [2], pre- and post-conditions [3], policy models[1]). The repair strategies also vary signiﬁcantly. Examplesof radically different models include genetic algorithms [2]and satisﬁability models (SAT) [6].In this paper, we take a step back and look at the problemfrom an empirical perspective. What are real bug ﬁxes madeof? The kind of results we extensively discuss later are forinstance: in bug-ﬁxes of open source software projects, themost common source code change consists of inserting amethod invocation. Can we reuse the knowledge for reasoningon automated program repair? We propose a framework to doso, by reasoning on the kind of bug ﬁxes. This frameworkenables us to show that the granularity of the analysis of realcommits (which we call “repair models”) has a big impacton the navigation into the search space of program repair.We further show that the heuristics used to build probabilitydistributions on top the repair models also make a signiﬁcantdifference: not all repair actions are equals!Let us now make precise what we mean with repair actions and repair models . A software repair action is a kind ofmodiﬁcation on source code that is made to ﬁx bugs. Wecan cite as examples: changing the initialization of a variable;adding a condition in an “if” statement; adding a method call,etc. In this paper, we use the term “repair model” to refer to aset of repair actions. For instance, the repair model of Weimeret al. [2] has three repair actions: deleting a statement, insertinga statement taken from another part of the software, swappingtwo statements There is a key difference between a repair action and arepair: a repair action is a kind of repair, a repair is a concretepatch. In object-oriented terminology, a repair is an instanceof a repair action. For instance, “adding a method call” is arepair action, “adding x.foo() ” is a repair. A repair actionis program- and domain-independent, it contains no domain-speciﬁc data such as variable names or literal values.First we present an approach to mine repair actions frompatches written by developers. We ﬁnd traces of human-based program ﬁxing in software repositories (e.g. CVS,SVN or Git), where there are versioning transactions (a.k.acommits) that only ﬁx bugs. We use those “ﬁx transactions”to mine AST-level repair actions such as adding a method call,changing the condition of a “if”, deleting a catch block. Repairactions are extracted with the abstract differencing algorithmof Fluri et al. [7]. This results in repair models that are muchbigger (41 and 173 repair actions) compared to related workwhich considers at most a handful of repair actions.Second, we propose to decorate the repair models with aprobability distribution. Our intuition is that not all repair ac-tions are equal and certain repair actions are more likely to ﬁxbugs than others. We also take an empirical viewpoint to deﬁnethose probability distributions: we learn them from softwarerepositories. We show that those probability distributions areindependent of the application domain.Third, we demonstrate that our probabilistic repair modelsenable us to reason on the search space of automated programrepair. The multinomial theorem [8, p.73] comes into play toanalyze the time to navigate into the search space of automatedrepair from a theoretical viewpoint.To sum up, our contributions are: • An extensive analysis of the content of software version-ing transactions: our analysis is novel both with respect ofsize (89,993 transactions of 14 open-source Java projects)and granularity (173 repair actions at the level of theAST). • A probabilistic mathematical reasoning on automatedrepair showing that depending on the viewpoint one mayquickly navigate – or not – into the search space ofautomated repair. Despite being theoretical, our resultshighlight an important property of the deep structure ofthis search space: the likely-correct repairs are highlyconcentrated in some parts of the search space, as starsare concentrated into galaxies in our universe.This article is a revised version of a technical report [9]. It a r X i v : . [ c s . S E ] N ov eads as follows. Section II describes how we map concreteversioning transactions to change actions. Section III discusseshow to only select bug ﬁx transactions. Section IV thenshows that those change actions are actually repair actionsunder certain assumptions. Section V presents our theoreticalanalysis on the time to navigate in the search space ofautomated repair. Finally, we compare our results with therelated work (in Section VII) and concludes.II. D ESCRIBING V ERSIONING T RANSACTIONS WITH A C HANGE M ODEL

In this section, we describe the contents of versioningtransactions of 14 repositories of Java software. Previousempirical studies on versioning transactions [10], [11], [12],[13], [14] focus on metadata (e.g., authorship, commit text) orsize metrics (number of changed ﬁles, number of hunks, etc.).On the contrary, we aim at describing versioning transactionsin terms of contents: what kind of source code change theycontain: addition of method calls; modiﬁcation of conditionalstatements; etc. There is previous work on the evolution ofsource code (e.g. [15], [16], [17]). However, to our knowledge,they are all at a coarser granularity compared to what wedescribe in this paper.Note that other terms exist for referring to versioning trans-actions: “commits”, “changesets”, “revisions”. Those termsreﬂect the competition between versioning tools (e.g. Gituses “changeset” while SVN “revision”) and the differencebetween technical documentation and academic publicationswhich often use “transaction”. In this paper, we equate thoseterms and generally use the term “transaction”, as previousresearch does.Software versioning repositories (managed by version con-trol systems such as CVS, SVN or Git) store the source codechanges made by developers during the software lifecycle.Version control systems (VCS) enables developers to queryversioning transactions based on revision number, authorship,etc. For a given transaction, VCS can produce a difference(“diff”) view that is a line-based difference view of sourcecode. For instance, let us consider the following diff: while (i < MAX_VALUE){ op.createPanel(i); - i=i+1; + i=i+2; } The difference shows one line replaced by another one.However, one could also observe the changes at the abstractsyntax tree (AST) level, rather than at the line level. In thiscase, the AST diff is an update of an assignment statementwithin a for loop. In this section, our research question is: whatare versioning transactions made of at the abstract syntax treelevel? .To answer this question, we have followed the followingmethodology. First, we have chosen an AST differencing al-gorithm from the literature. Then, we have constituted a datasetof software repositories to run the AST differencing algorithmon a large number of transactions. Finally, we have computed descriptive statistics on those AST-based differences. Let usﬁrst discuss the dataset.

A. Dataset

CVS-Vintage is a dataset of 14 repositories of open-source Java software [18]. The inclusion criterion of CVS-Vintage is that the repository mostly contains Java codeand has been used in previous published academic workon mining software repositories and software evolution. Thisdataset covers different domains: desktop applications, serverapplications, libraries such as logging, compilation, etc. Itincludes the repositories of the following projects: ArgoUML,Columba, JBoss, JHotdraw, Log4j, org.eclipse.ui.workbench,Struts, Carol, Dnsjava, Jedit, Junit, org.eclipse.jdt.core, Scaraband Tomcat. In all, the dataset contains 89,993 versioningtransactions, 62,179 of them have at least one modiﬁed Javaﬁle. Overtime, 259,264 Java ﬁles have been revised (whichmakes a mean number of 4.2 Java ﬁles modiﬁed per transac-tion).

B. Abstract Syntax Tree Differencing

There are different propositions of AST differencing algo-rithms in the literature. Important ones include Raghavan etal.’s Dex [19], Neamtiu et al’s AST matcher [20] and Fluriet al’s ChangeDistiller [7]. For our empirical study on thecontents of versioning transactions, we have selected the latter.ChangeDistiller [7] is a ﬁne-grain AST differencing tool forJava. It expresses ﬁne granularity source code changes usinga taxonomy of 41 source changes types, such as “statementinsertion” of “if conditional change”. ChangeDistiller handleschanges that are speciﬁc to object-oriented elements such as“ﬁeld addition”. Fluri and colleagues have published an open-source stable and reusable implementation of their algorithmfor analyzing AST changes of Java code.ChangeDistiller produces a set of “source code changes”for each pair of Java ﬁles from versioning transactions. Fora source code change, the main output of ChangeDistilleris a “change type” (from the taxonomy aforementioned).However, for our analysis, we also consider two other piecesof information. We reformulate the output of ChangeDistiller,each AST source code change is represented as a 2-valuetuple: scc = ( ct, et ) where ct is one of the 41 change types, et (for entity type) refers to the source code entity relatedto the change (for instance, a statement update may changea method call or an assignment). Since ChangeDistiller is anAST differencer, formatting transactions (such as changing theindentation) produce no AST-level change at all. The shortlisting above would be represented as one single AST changethat is a statement update (ct) of an assignment (et). C. Change Models

All versioning transactions can be expressed within a“change model”. We deﬁne a change model as a set of “changeactions”. For instance, the change model of standard Unix diffis composed of two change actions: line addition and linedeletion. A change model represents a kind of feature space,2nd observations in that space can be valued. For instance, astandard Unix diff produces two integer values: the number ofadded lines and the number of deleted lines. ChangeDistillerenables us to deﬁne the following change models. CT ( Change Type ) is composed of 41 features, the 41change types of ChangeDistiller. For instance, one of thisfeature is “Statement Insertion” (we may use the shortenedname “Stmt_Insert”).

CTET ( Change Type Entity Type ) ismade of all valid combinations of the Cartesian productbetween change types and entity types. CTET is a reﬁnementof CT. Each repair action of CT is mapped to [1 . . . n ] repairactions of CTET. Hence the labels of the repair actionsof CTET always contain the label of CT. There are 104entity types and 41 change types but many combinationsare impossible by construction, as a result CTET contains173 features. For instance, since there is one entity typerepresenting assignments, one feature of CTET is “statementinsertion of an assignment”.In the rest of this paper, we express versioning transactionswithin those two change models. There is no better changemodel per se: they describe versioning transactions at differentgranularity. We will see later that, depending on the perspec-tive, both change models have pros and cons. D. Measures for Change Actions

We deﬁne two measures for a change action i : α i is theabsolute number of change action i in a dataset; χ i is theprobability of observing a change action i as given by itsfrequency over all changes ( χ i = α i / (cid:80) α i ). For instance,let us consider feature space CT and the change action“statement insertion” (StmtIns). If there is α StmtIns = 12 source code changes related to statement insertion among100, the probability of observing a statement insertion is χ StmtIns = 12% . E. Empirical Results

We have run ChangeDistiller over the 62,179 Java transac-tions of our dataset, resulting in 1,196,385 AST-level changesfor both change models. For change model CT, which israther coarse-granularity, the three most common changes are“statement insert” (28% of all changes), “statement delete”(23% of all changes) and “statement update” (14% of allchanges). Certain changes are rare, for instance, “additionof class derivability” (adding keyword final to the classdeclaration) only appears 99 times (0.0008% of all changes).The complete results are given in the companion technicalreport [21].Table I presents the top 20 change actions and the associatedmeasures for change model CTET. The comprehensive tablefor all 173 change actions is given in the companion tech-nical report [21]. In Table I, one sees that inserting methodinvocations as statement is the most common change, whichmakes sense for open-source object-oriented software that isgrowing.Let us now compare the results over change models CTand CTET. One can see that statement insertion is mostly

Change Action α i Prob. χ i Statement insert of method invocation 83,046 6.9%Statement insert of if statement 79,166 6.6%Statement update of method invocation 76,023 6.4%Statement delete of method invocation 65,357 5.5%Statement delete of if statement 59,336 5%Statement insert of variable declaration statement 54,951 4.6%Statement insert of assignment 49,222 4.1%Additional functionality of method 49,192 4.1%Statement delete of variable declaration statement 44,519 3.7%Statement update of variable declaration statement 41,838 3.5%Statement delete of assignment 41,281 3.5%Condition expression change of if statement 40,415 3.4%Statement update of assignment 34,802 2.9%Addition of attribute 29,328 2.5%Removal of method 26,172 2.2%Statement insert of return statement 24,184 2%Statement parent change of method invocation 21,010 1.8%Statement delete of return statement 20,880 1.7%Insert of else statement 20,227 1.7%Deletion of else statement 17,197 1.4%

Total 1,196,385

Table IT

HE ABUNDANCE OF

AST-

LEVEL CHANGES OF CHANGE MODEL

CTET

OVER

VERSIONING T RANSACTIONS . T

HE PROBABILITY χ i IS THERELATIVE FREQUENCY OVER ALL CHANGES ( E . G . 6.9% OF SOURCE CODECHANGES ARE INSERTIONS OF METHOD INVOCATION ). composed of inserting a method invocation (6.9%), insert an“if” conditionals (6.6%), and insert a new variable (4.6%).Since change model CTET is at a ﬁner granularity, there areless observations: both α i and χ i are lower. The probabilitydistribution ( χ i ) over the change model is less sharp (smallervalues) since the feature space is bigger. High value of χ i means that we have a change action that can frequently befound in real data: those change actions have of a high“coverage” of data. CTET features describe modiﬁcationsof software at a ﬁner granularity. The differences betweenthose two change models illustrate the tension between a highcoverage and the analysis granularity.F. Project-independence of Change Models

An important question is whether the probability distribu-tion (composed of all χ i ) of Table I is generalizable to Javasoftware or not. That is, do developers evolve software in asimilar manner over different projects? To answer this ques-tion, we have computed the metric values not for the wholedataset, but per project. In other words, we have computedthe frequency of change actions in 14 software repositories.We would like to see that the values do not vary betweenprojects, which would mean that the probability distributionsover change actions are project-independent. Since our datasetcovers many different domains, having high correlation valueswould be a strong point towards generalization.As correlation metric, we use Spearman’s ρ . We chooseSpearman’s ρ because it is non-parametric. In our case, whatmatters is to know whether the importance of change actions issimilar (for instance that “statement update” is more commonthan“condition expression change”). Contrary to parametriccorrelation metric (e.g. Pearson), Spearman’s ρ only focuseson the ordering between change actions, which is what we areinterested in.We compute the Spearman correlation values between the3 o f p r o j e c t pa i r s Figure 1. Histogram of the Spearman Correlation between Changes ActionFrequencies of Change Model CT Mined on Different Projects. There is nooutlier, there are all higher than 0.75, meaning that the importance of changeactions is project-independent. probability distributions of all pairs of project of our datasets(i.e. ∗ = 91 combinations). One correlation value takesas input two vectors representing the probability distributions(of size 41 for change model CT and 173 for change modelCTET).The critical value of Spearman’s ρ depends on size of thevectors being compared and on the required conﬁdence level.At conﬁdence level α = 0 . , the critical value for changemodel CT (41 features) is 0.364 and is 0.301 for changemodel CTET (values from statistical tables, we used [22]).If the correlation is higher than the critical value, the nullhypothesis (a random distribution) is rejected.For instance, in change model CT, the Spearman correlationbetween Columba and ArgoUML is 0.94 which is muchhigher than the critical value (0.364). This means that thecorrelation is statistically signiﬁcant at α = 0 . conﬁdencelevel. The high value shows that those two projects wereevolved in a very similar manner. All values are given in thecompanion technical report [21]. Figure 1 gives the distributionof Spearman correlation values for change model CT. 75%of the pairs of projects have a Spearman correlation higherthan 0.85 . For all pairs of projects, in change model CT,Spearman’s ρ is much higher that the critical value. This showsthat the likelihood of observing a change action is globallyindependent of the project used for computing it.

To understand the meaning of those correlation values, letus now analyze in detail the lowest and highest correlationvalues. The highest correlation value is 0.98 and it correspondsto the project pair Eclipse-Workbench and Log4j. In this case,33 out of 41 change actions have a rank difference between 0 Most statistical tables of Spearman’s ρ stop at N=60, however since thecritical values decreases with N, if ρ > . the null hypothesis is stillrejected. Spearman correlation is based on ranks, a value of 0.85 means either thatmost change actions are ranked similarly or that a single change action has areally different rank. and 3. The lowest correlation value is 0.80 and it correspondsto Spearman correlation values between projects Tomcat andCarol. In this case, the maximum rank change is 23 (for changeaction “Removing Method Overridability” — removing ﬁnal for methods). In total, between Tomcat and Carol, there aresix change actions for which the importance changes of atleast 10 ranks. Those high values trigger the 0.80 Spearmancorrelation. However, for common changes, it turns out thattheir ranks do not change at all (e.g. for “Statement Insert”,“Statement Update”, etc.).We have also computed the correlation between projectswithin change model CTET (see the companion technicalreport [21]). They are all above 0.301, the critical value forvectors of size 173 at α = 0 . conﬁdence level, showingthat in change model CTET, the change action importanceis project-independent as well, in a statistically signiﬁcantmanner. Despite being high, we note that they are slightlylower than for change model CT, this is due to the fact thatSpearman’s ρ generally decreases with the vector size (asshown by the statistical table). G. Recapitulation

To sum up, we provide the empirical importance of 173source code change actions; we show that the importanceof change actions are project independent; we show that theprobability distribution of change actions is very unbalanced.Our results are based on the analysis of 62,179 transactions.To our knowledge, those results have never been publishedbefore, given this analysis granularity and the scale of theempirical study.The threats to the validity of our results are of two kinds.From the internal validity viewpoint, a bug somewhere in theimplementation may invalidate our results. From the externalvalidity viewpoint, there is risk that our dataset of 14 projectsis not representative of Java software as a whole, even if theyare written by different persons from different organizationsin different application domains. Also, our results may notgeneralize to other programming languages.III. S

LICING T RANSACTIONS TO F OCUS ON B UG F IXES

In Section II, we have deﬁned and discussed two measuresper change action i : α i and χ i . For instance, χ StmtInsert gives the frequency of a statement insertion. Those measuresimplicitly depend on a transaction bag to be computed. Sofar we have considered all versioning transactions of therepository. For deﬁning a repair space, we need to apply thosetwo measures on a transaction bag representative of softwarerepair. How should we slice transactions to focus on bug ﬁxes?An intuitive method, that we will use as baseline, is to relyon the commit message (by slicing only those transactions thatcontain a given word or expression related to bug ﬁxing). Be-fore going further, let us clarify the goal of the classiﬁcation:the goal is to have a good approximation of the probabilitydistribution of change actions for software repair . Later is the Note that our goal is not to have a good classiﬁcation in terms of precisionor recall.

A. Slicing Based on the Commit Message

When committing source code changes, developers maywrite a comment/message explaining the changes they havemade. For instance when a transaction is related to a bugﬁx, they may write a comment referencing the bug report ordescribing the ﬁx.To identify transaction bags related to bug ﬁx, previous workfocused on the content of the commit text: whether it containsa bug identiﬁer, or whether it contains some keywords suchas “ﬁx” (see [23] for a discussion on those approaches). Toidentify bug ﬁx patterns, Pan et al. [24] select transactionscontaining at least one occurrence of “bug”, “ﬁx” or “patch”.We call this transaction bag BFP. We will compute α i and χ i based on this deﬁnition.Such a transaction bag makes a strong assumption on thedevelopment process and the developer’s behavior: it assumesthat developers generally put syntactic features in commit textsenabling to recognize repair transactions, which is not reallytrue in practice [23], [25], [26]. B. Slicing Based on the Change Size in Terms of Number ofAST Changes

We may also deﬁne ﬁxing transaction bags based on their“AST diffs”, i.e.; based on the type and numbers of changeactions that a versioning transaction contains. This transactionbag is called N-SC (for N Abstract S yntactic Changes), e.g.5-SC represents the bag of transactions containing ﬁve AST-level source code change.In particular, we assume that small transactions are verylikely to only contain a bug ﬁx and unlikely to contain a newfeature. Repair actions may be those that appear atomically intransactions (i.e. the transaction only contains one AST-levelsource code change). “1-SC” (composed of all transactions ofone single AST change) is the transaction bag that embodiesthis assumption. Let us verify this assumption. C. Do Small Versioning Transactions Fix Bugs?1) Experiment:

We set up a study to determine whethersmall transactions correspond to bug ﬁxes changes. We deﬁnesmall as those transactions that introduce only one ASTchange.

2) Overview:

The study consists in manual inspection andevaluation of source code changes of versioning transactions.First, we randomly take a sample set of transactions fromour dataset (see II-A). Then, we create an “evaluation item”for each pair of ﬁles of the sample set (the ﬁle before andafter the revision). An evaluation item contains data to helpthe raters to decide whether a transaction is a bug ﬁx ornot: the syntactic line-based differencing between the revisionpair of the transaction (it helps to visualize the changes), theAST change between them (type and location – e.g. insertionof method invocation at line 42) and the commit messageassociated with the transaction.

Full Agreement (3/3) Majority (2/3)Transaction is a Bug Fix 74 21Transaction is not a Bug Fix 22 23I don’t know 0 1Table IIT HE R ESULTS OF T HE M ANUAL I NSPECTION OF

144 T

RANSACTIONS BY T HREE R ATERS .

3) Sampling Versioning Transactions:

We use stratiﬁedsampling to randomly select 1-SC versioning transactionsfrom the software history of 16 open source projects (mostlyfrom [18]). Recall that a “1-SC” versioning transaction onlyintroduces one AST change. The stratiﬁcation consists ofpicking 10 items (if 10 are found) per project. In total, thesample set contains 144 transactions sampled over 6,953 1-SC transactions present in our dataset.

4) Evaluation Procedure:

The 144 evaluation items wereevaluated by three people called the raters : the paper authorsand a colleague, member of the faculty at the University ofBordeaux. During the evaluation, each item (see III-C2) ispresented to a rater, one by one. The rater has to answer thequestion

Is a bug ﬁx change? . The possible answers are a ) Yes,the change is a bug ﬁx , b ) No, the change is not a bug ﬁx and c ) I don’t know . Optionally, the rater can write a comment toexplain his decision.

5) Experiment Results:a) Level of Agreement:

The three raters fully agreed that74 of 144 (51.8%) transactions from the sample transactionsare bug ﬁxes. If we consider the majority (at least 2/3 agree),95 of 144 transactions (66%) were considered as bug ﬁx trans-actions. The complete rating data is given in the companiontechnical report [21].Table II presents the number of agreements. The column

Full Agreement shows the number of transactions for whichall raters agreed. For example, the three rates agreed that thereis a bug ﬁx in 74/144 transactions. The

Majority column showsthe number of transactions for which two out of three ratersagree. To sum up, small transactions predominantly consistsof bug ﬁxes.Among the transactions with full agreement on the absenceof bug ﬁx changes, the most common case found was theaddition of a method. This change indeed consists of theaddition of one single AST change (the addition of a “method”node). Interestingly, in some cases, adding a method wasindeed a bug ﬁx, when polymorphism is used: the new methodﬁxes the bug by replacing the super implementation. b) Statistics:

Let us assume that p i measures the degreeof agreement for a single item (in our case in { , , } . Theoverall agreement ¯ P [27] is the average over p i . We have ¯ P = 0 . . Using the scale introduced by [28], this value meansthere is a Substantial overall agreement between the rates,close to an

Almost perfect agreement .The coefﬁcient κ (Kappa) [27], [29] measures the conﬁ-5ence in the agreement level by removing the chance factor .The κ degree of agreement in our study is 0.517, a valuedistant from the critical value (it is 0). The null hypothesis isrejected, the observed agreement was not due to chance.

6) Conclusion:

The manual inspection of 144 versioningtransaction shows that there is a relation between the one ASTchange transactions and bug ﬁxing. Consequently, we can usethe 1-SC transaction bag to estimate the probability of changeactions for software repair.IV. F

ROM C HANGE M ODELS TO R EPAIR M ODELS

This section presents how we can transform a “changemodel” into a “repair model” usable for automated softwarerepair. As discussed in Section II, a change model describesall types of source code change that occur during softwareevolution. On the contrary, we deﬁne a “repair action” as achange action that often occurs for repairing software, i.e.often used for ﬁxing bugs.By construction, a repair model is equal to a subset of achange model in terms of features. But more than the numberof features, our intuition is that the probability distributionover the feature space would vary between change models andrepair models. For instance, one might expect that changingthe initialization of a variable has a higher probability in arepair model. Hence, the difference between a change modeland a repair model is matter of perspective. Since we areinterested in automated program repair, we now concentrateon the “repair” perspective hence use the terms “repair model”and “repair action” in the rest of the paper.

A. Methodology

We have applied the same methodology as in II. We havecomputed the probability distributions of repair model CT andCTET based on different deﬁnitions of ﬁx transactions, i.e.we have computed α i and χ i based on the transactions bagsdiscussed in III: ALL transactions, N-SC and BFP. For N-SC,we choose four values of N: 1-SC, 5-SC, 10-SC and 20-SC.Transactions larger than 20-SC have almost the same topologyof changes as ALL, as we will show later (see section IV-C2). The main question we ask is whether those different deﬁ-nitions of “repair transactions” yield different topologies forrepair models.B. Empirical Results

Table III presents the top 10 change types of repair modelCT associated with their probability χ i for different versioningtransaction bags. The complete table for all repair actions isgiven in the companion technical report [21]. Overall, thedistribution of repair actions over real bug ﬁx data is veryunbalanced, the probability of observing a single repair actiongoes from more than 30% to 0.000x%. We observe the Paretoeffect: the top 10 repair actions account for more than 92%of the cumulative probability distribution. Some degree of agreement is expected when the ratings are purelyrandom[27], [29].

Furthermore, we have made the following observations fromthe experiment results:First, the order of repair actions (i.e. their likelihood of con-tributing to bug repair) varies signiﬁcantly depending on thetransaction bag used for computing the probability distribution.For instance: a statement insertion is χ Stmt _ Upd is29% for transaction bag ALL, but jumps to 33% for transactionbag 20-SC.Third, the probability distributions for transaction bags ALLand BFP are close: repair action has similar probability values.As consequence, transaction bag BFP maybe is a randomsubset of ALL transactions. All those observations also holdfor repair model CTET, the complete table is given in thecompanion technical report [21].

Those results are a ﬁrst answer to our question: differentdeﬁnitions of “repair transactions” yield different probabilitydistribution over a repair model.C. Discussion

We have shown that one can base repair models on differentmethods to extract repair transaction bags. There are certainanalytical arguments against or for those different repair spacetopologies. For instance, selecting transactions based on thecommit text makes a very strong assumption on the quality ofsoftware repository data, but ensures that the selected trans-actions contain at least one actual repair. Alternatively, smalltransactions indicate that they focus on to a single concern,they are likely to be a repair. However, small transactionsmay only see the tip of the ﬁx iceberg (large transactionsmay be bug ﬁxing as well), resulting in a distorted probabilitydistribution over the repair space. At the experimental level,the threats to validity are the same as for Section II. HE S PEARMAN CORRELATION VALUES BETWEEN REPAIR ACTIONS OFTRANSACTION BAG “ALL”

AND THOSE FROM THE TRANSACTION BAGSBUILT WITH DIFFERENT HEURISTICS .

1) Correlation between Transaction Bags:

To what extentare the 6 transactions bags different? We have calculatedthe Spearman correlation values between the probabilitiesover repairs actions between all pairs of distributions. Inparticular, we would like to know whether the heuristicsyield signiﬁcantly different results compared to all transactions(transaction bag ALL). Table IV presents these correlationvalues.6

LL BFP 1-SC 5-SC 10-SC 20-SCStmt_Insert-29% Stmt_Insert-32% Stmt_Upd-38% Stmt_Insert-28% Stmt_Insert-31% Stmt_Insert-33%Stmt_Del-23% Stmt_Del-23% Add_Funct-14% Stmt_Upd-24% Stmt_Upd-19% Stmt_Del-16%Stmt_Upd-15% Stmt_Upd-12% Cond_Change-13% Stmt_Del-11% Stmt_Del-14% Stmt_Upd-16%Param_Change-6% Param_Change-7% Stmt_Insert-12% Add_Funct-10% Add_Funct-8% Param_Change-7%Order_Change-5% Order_Change-6% Stmt_Del-6% Cond_Change-7% Param_Change-7% Add_Funct-7%Add_Funct-4% Add_Funct-4% Rem_Funct-5% Param_Change-5% Cond_Change-6% Cond_Change-5%Cond_Change-4% Cond_Change-3% Add_Obj_St-3% Add_Obj_St-3% Add_Obj_St-3% Add_Obj_St-3%Add_Obj_St-2% Add_Obj_St-2% Order_Change-2% Rem_Funct-3% Rem_Funct-2% Order_Change-3%Rem_Funct-2% Alt_Part_Insert-2% Rem_Obj_St-2% Order_Change-1% Order_Change-2% Rem_Funct-2%Alt_Part_Insert-2% Rem_Funct-2% Inc_Access_Change-1% Rem_Obj_St-1% Alt_Part_Insert-1% Alt_Part_Insert-2%

Table IIIT OP

10 C

HANGE T YPES OF C HANGE M ODEL CT AND THEIR P ROBABILITY χ i FOR D IFFERENT T RANSACTION B AGS . T

HE DIFFERENT HEURISTICS USEDTO COMPUTE THE FIX TRANSACTIONS BAGS HAS A SIGNIFICANT IMPACT ON BOTH THE RANKING AND THE PROBABILITIES . For instance, the Spearman correlation value between ALLand 1-SC is 0.68. This value shows, as we have noted before,that there is not a strong correlation between the order of theirrepair actions of both transaction bags. In other words, heuris-tic 1-SC indeed focuses on a speciﬁc kind of transactions.On the contrary, the value between ALL and BFP is 0.99.This means the order of the frequency of repair actions arealmost identical. Moreover, Table IV shows the correlationvalues between N-SC (N = 1, 5, 10 and 20) and ALL tendto 1 (i.e. perfect alignment) when N grows. This validatesthe intuition that the size of transactions (in number of ASTchanges) is a good predictor to focus on transactions thatare different in nature from the normal software evolution.Crossing this result with the results of our empirical studyof 144 -SC transactions, there is some evidence that byconcentrating on small transactions, we probably have a goodapproximation of repair transactions.

2) Skewness of Probability Distributions:

Figure 2 showsthe probability for the most frequent repair actions of repairmodel CTET according to the transaction size (in numberof AST changes). For instance, the probability of updatinga method invocation decreases from 15% in 1-SC transactionsto 7% in all transactions. In particular, we observe that: a ) fortransaction with 1 AST change, the change probabilities aremore unbalanced (i.e. less uniform than for all transactions).There are 5 changes that are much more frequent than therest. b ) for transactions with more than 10 AST changes, theprobabilities of top changes are less dispersed and all smallerthan 0.9% c ) the probabilities of those 5 most frequent changesdecrease when the transaction size grows. This is a furtherpiece of evidence that heuristics N-SC provide a focus ontransactions that are of speciﬁc nature, different from the bulkof software evolution.

3) Conclusion:

Those results on repair actions are espe-cially important for automated software repair: we think itwould be fruitful to devise automated repair approaches that“imitate” how human developers ﬁx programs. To us, usingthe probabilistic repair models as described in this section isa ﬁrst step in that direction.

Stmt update of method invocationAdd funct of methodCondition change of IfStmt update of variable declarationStmt Insert of method invocationStmt update of assignmentStmt update of returnRemove funct of methodStmt delete of method invocationAdd Object State of attributeStmt Insert of assignmentRemove obj State of attribute0.000.020.040.060.080.100.120.140.16 2 4 6 8 10

ALL... AS T c hange p r obab ili t y Transaction size (In AST changes)

Figure 2. Probabilities of the 12 most frequent AST changes for 11 differenttransaction bags: 10 that include transactions with i AST changes, with i =1 ... , and the ALL transaction bag. V. A

UTOMATED A NALYSIS OF THE T IME TO N AVIGATEINTO THE S EARCH S PACE OF A UTOMATED P ROGRAM R EPAIR

This section discusses the nature of the search space size ofautomated program repair. We show that the two repair modelsdeﬁned in IV allow mathematical reasoning. We present a wayof comparing repair models and their probability distributionbased on data from software repositories.

A. Decomposing The Repair Search Space

The search space of automated program repair consists ofall explorable bug ﬁxes for a given program and a given bug(whether compilable, executable or correct). If one boundsthe size of the repair (e.g. all patched of at most 40 lines),the search space size is ﬁnite. A naive search space is huge,because even in a bounded size scenario, there are a myriadof elements to be added, removed or modiﬁed: statements,variables, operators, literals.

A key point of automated program repair research consistsof decreasing the time to navigate the repair search space.

There are many ways to decrease this time. For instance,fault localization enables the search to ﬁrst focus on placeswhere ﬁxes are likely to be successful. This one and othercomponents of a repair process may participate in an efﬁcientnavigation. One of them is the “shaping” of ﬁxes.7nformally, the shape of a bug ﬁx is a kind of patch.For instance, the repair shape of adding an “if” throwing anexception for signaling an incorrect input consists of insertingan if and inserting a throw. The concept of “repair shape”is equivalent to what Wei et al. [3] call a “ﬁx schema”, andWeimer et al [2] a “mutation operator”.In this paper, we deﬁne a “repair shape” as an unorderedtuple of repair actions (from a set of repair actions called R ) .In the if/throw example aforementioned, in repair space CTET,the repair shape of this bug ﬁx consists of two repair actions:statement insertion of “if” and statement insertion of “throw”.The shaping space consists of all possible combinations ofrepair actions.The instantiation of a repair shape is what we call ﬁxsynthesis . The complexity of the synthesis depends on therepair actions of the shaping space. For instance, the repairactions of Weimer et al. [2] (insertion, deletion, replace) havean “easy” and bounded synthesis space (random picking in thecode base).To sum up, we consider that the repair search space canbe viewed as the combination of the fault localization space(where the repair is likely to be successful), the shaping space(which kind of repair may be applied) and the synthesis space(assigning concrete statements and values to the chosen repairactions). The search space can then be loosely deﬁned as theCartesian product of those spaces and its size then reads: | F AULT L OCALIZATION | × | S HAPE | × | S YNTHESIS | In this paper, we concentrate on the shaping part of thespace. If one can ﬁnd efﬁcient strategies to navigate throughthis shaping space, this would contribute to efﬁciently navi-gating through the repair search space as a whole, thanks tothe combination.

B. Mathematical Analysis Over Repair Models

To analyze the shaping space, we now present a mathemati-cal analysis of our probabilistic repair models. So far, we havetwo repair models CT and CTET (see IV) and different waysto parametrize them.According to our probabilistic repair model, a good naviga-tion strategy consists on concentrating on likely repairs ﬁrst:the repair shape is more likely to be composed of frequentrepair actions. That is a repair shape of size n is predicted bydrawing n repair actions according to the probability distribu-tion over the repair model. Under the pessimistic assumptionthat repair actions are independent , our repair model makesit possible to know the exact median number of attempts N Since a bug ﬁx may contain several instances of the same repair actions(e.g. several statement insertions), the repair shape may contain several timesthe same repair action. Equation (1) holds if and only if we consider them as independent. Ifthey are not, it means that we under-estimate the deep structure of the repairspace, hence we over-approximate the time to navigate in the space to ﬁnd thecorrect shape. In other words, even if the repair actions are not independent(which is likely for some of them) our conclusions are sound. that is needed to ﬁnd a given repair shape R (demonstrationgiven in the companion technical report [21]): N = k such that k (cid:88) i =1 p (1 − p ) i − ≥ . (1)with p = n !Π j ( e j !) × Π r ∈ R P P ( r ) where e j is the number of occurrences of r j inside R For instance, the repair of revision 1.2 of Eclipse’sCheckedTreeSelectionDialog consists of two inserted state-ments. Equation 1 tells us that in repair model CT, we wouldneed in average 12 attempts to ﬁnd the correct repair shapefor this real bug.Having only a repair shape is far from having a real ﬁx.However, the concept of repair shape associated with themathematical formula analyzing the time to navigate the repairspace is key to compare ways to build a probability distributionover repair models. C. Comparing Probability Distributions Over Repair ActionsFrom Versioning History

We have seen in Section V-B that the time for ﬁndingcorrect repair shapes depends on a probability distribution overrepair actions. The probability distribution P is crucial forminimizing the search space traversal: a good distribution Presults in concentrating on likely repairs ﬁrst , i.e. the repairspace is traversed in a guided way, by ﬁrst exploring the partsof the space that are likely to be more fruitful. This posestwo important questions: ﬁrst, how to set up a probabilitydistribution over repair actions; second, how to compare theefﬁciency of different probability distributions to ﬁnd goodrepair shapes.To compute a probability distribution over repair actions,we propose to learn them from software repositories. Forinstance, if many bug ﬁxes are made of inserted methodcalls, the probability of applying such a repair action shouldbe high. Despite our single method (learning the probabilitydistributions from software repositories), we have shown in IVthat there is no single way to compute them (they depend ondifferent heuristics). To compare different distributions againsteach other, we set up the following process.One ﬁrst selects bug repair transactions in the versioninghistory. Then, for each bug repair transaction, one extracts itsrepair shape (as a set of repair actions of a repair model). Thenone computes the average time that a maximum likelihoodapproach would need to ﬁnd this repair shape using equation 1.Let us assume two probability distributions P and P over a repair model and four ﬁxes ( F . . . F ) consistingof two repair actions and observed in a repository. Let usassume that the time (in number of attempts) to ﬁnd theexact shape of F . . . F according to P is (5 , , , andaccording to P (25 , , , . In this case, it’s clear that the “Fix for 19346 integrating changes from Sebastian Davids” http://goo.gl/d4OSi nput : C (cid:46) A bag of transactions

Output : The median number of attempts to ﬁnd good repair shapes begin Ω ← {} (cid:46) Result set

T, E ← split ( C ) (cid:46) Cross-validation: split C into T raining and E valuation data M ← train _ model ( T ) (cid:46) Train a repair model (e.g. compute a probability distribution over repair actions) for s ∈ E (cid:46)

For all repairs observed in the repository do n ← compute _ repairability ( s, M ) (cid:46) How long to ﬁnd this repair according to the repair model Ω ← R ∪ n (cid:46) Store the “repairability” value of s return median (Ω) (cid:46) Returning the median number of attempts to ﬁnd the repair shapes

Figure 3. An Algorithm to Compare Fix Shaping Strategies. There may be different ﬂavors of functions split , f and computeRepairability .Repair /Repair Size 1 2 3 4 5 6 7 8ArgoUML (996) (638) (386) (362) (254) (234) (197) (166)Carol (30) (15) (10) (10) (7) (13) (6) (9)Columba (382) (255) (144) (146) (113) (108) (73) (94)Dnsjava (165) (139) (71) (82) (54) (50) (33) ∞ (44)jEdit (115) (84) (53) (48) (32) (30) (29) (32)jBoss (514) (353) (208) (189) (147) (150) (86) (113)jHotdraw6 (21) (21) (9) (10) (10) (3) ∞ (5) (2)jUnit (40) (39) (18) ∞ (11) (7) ∞ (11) (9) ∞ (6)Log4j (223) (134) (68) (70) (64) (42) (41) ∞ (48)org.eclipse.jdt.core (1606) (1025) (657) (631) (392) (416) (314) (309)org.eclipse.ui.workbench (1184) (783) (414) (464) (326) (305) (215) (192)Scarab (653) (346) (202) (159) (113) (137) (89) (77)Struts (221) (133) (86) (103) (61) (77) (39) (34)Tomcat (281) (167) (111) (120) (84) (87) (61) (51)Table VT HE MEDIAN NUMBER OF ATTEMPTS ( IN BOLD ) REQUIRED TO FIND THE CORRECT REPAIR SHAPE OF FIX TRANSACTIONS . T

HE VALUES IN BRACKETSINDICATE THE NUMBER OF FIX TRANSACTIONS TESTED PER PROJECT AND PER TRANSACTION SIZE FOR REPAIR MODEL

CT. T

HE REPAIR MODEL CT ISMADE FROM THE DISTRIBUTION PROBABILITY OF CHANGES INCLUDED IN

TRANSACTION BAGS . F

OR SMALL TRANSACTIONS , FINDING THECORRECT REPAIR SHAPE IN THE SEARCH SPACE IS DONE IN LESS THAN

ATTEMPTS . probability distribution P enables us to ﬁnd the correct repairshapes faster (the shaping time for P are lower). Beyond thisexample, by applying the same process over real bug repairsfound in a software repository, our process enables us to selectthe best probability distributions for a given a repair model.Since equation 1 is parametrized by a number of repair ac-tions, we instantiate this process for all bug repair transactionsof a certain size (in terms of AST changes). This means thatour process determines the best probability distribution for agiven bug ﬁx shape size. D. Cross-Validation

We compute different probability distributions P x fromtransaction bags found in repositories. We evaluate the time toﬁnd the shape of real ﬁxes that are also found in repositories,which may bias the results. To overcome this problem, we usecross-validation: we always use different sets of transactionsto estimate P and to calculate the average number of attemptsrequired to ﬁnd a correct repair shape. Using cross-validationreduces the risk of overﬁtting.Since we have a dataset of 14 independent software repos-itories, we use this dataset structure for cross-validation.We take one repository for extracting repair shapes and theremaining 13 projects to calibrate the repair model (i.e. to compute the probability distributions). We repeat the process14 times, by testing each of the 14 projects separately. Inother words, we try to predict real repair shapes found in onerepository from data learned on other software projects .Figure 3 sums up this algorithm to compare ﬁx shapingstrategies. From a bag of transactions C , function split createsa set of testing transactions and a set of evaluation transactions.Then, one trains a repair model (with function trainM odel ),for repair models CT and CTET it means computing a proba-bility distribution on a speciﬁc bag of transactions. Finally, foreach repair of the testing data, one computes its “repairability”according to the repair model (with Equation 1). The algorithmreturns the median repairability, i.e. the median number ofattempts required to repair the test data. E. Empirical Results

We run our ﬁx shaping process on our dataset of 14repositories of Java software considering two repair models:CT and CTET (see Section II-C). We remind that CT consistsof 41 repair actions and CTET of 173 repair actions. For bothrepair models, we have tested the different heuristics of IV-Ato compute the median repair time: all transactions (ALL); oneAST change (1-SC); 5 AST changes (5-SC); 10 AST changes(10-SC); 20 AST changes (20-SC); transactions with commit9

Q1-SC5-SC10-SC20-SCBFPALL05000100001500020000 0 1 2 3 4 5 6 7 8 9 M ed i an r epa i r a tt e m p t s Repair size (in

Figure 4. The repairability of small transactions in repair model CT. Certainprobability distributions yield a median repair time that is much lower thanothers. text containing “bug”, “ﬁx”, “patch” (BFP); a baseline of auniform distribution over the repair model (EQP for equally-distributed probability).We extracted all bug ﬁx transactions with less than 8AST changes from our dataset. For instance, the versioningrepository of DNSJava contains 165 transactions of 1 repairaction, 139 transactions of size 2, 71 transactions of size 3,etc. The biggest number of available repair tests are in jdt.core(1,605 ﬁxes consist of one AST change), while Jhotdraw hasonly 2 transactions of 8 AST changes. We then computedthe median number of attempts to ﬁnd the correct shapeof those 23,048 ﬁx transactions. Since this number highlydepends on the probability distributions P x , we computed themedian repair time for all combinations of ﬁx size transactions,projects, and heuristics discussed above ( × × .Table V presents the results of this evaluation for repairspace CT and transaction bag 5-SC. For each project, thebold values give the median repairability in terms of numberof attempts required to ﬁnd the correct repair shape with amaximum likelihood approach. Then, the bracketed valuesgive the number of transactions per transaction size (size innumber of AST changes) and per project. For instance, over996 ﬁx transactions of size 1 in the ArgoUML repository,it takes an average of 6 attempts to ﬁnd the correct repairshape. On the contrary, for the 51 transactions of size 8 in theTomcat repository, it takes an average of 34,240 attempts toﬁnd the correct repair shape. Those results are encouraging:for small transactions, it takes a handful of attempts to ﬁndthe correct repair shape. The probability distribution over therepair model seems to drive the search efﬁciently. The otherheuristics yield similar results – the complete results (6 tables– one per heuristic) are given in [21].About cross-validation, one can see that the performanceover the 14 runs (one per project) is similar (all columnsof Table V contain numbers that are of similar order ofmagnitude). Given our cross-validation procedure, this meansthat for all projects, we are able to predict the correct shapesusing only knowledge mined in the other projects. This gives EQ1-SC5-SC10-SC20-SCBFPALL05000100001500020000 0 1 2 3 4 5 6 7 8 9

Repair size (In abstract syntactic changes) M ed i an r epa i r a tt e m p t s EQ1-SC5-SC10-SC20-SCBFPALL020000400006000080000100000120000 1 2 3 4 M ed i an r epa i r a tt e m p t s Repair size (in

Figure 5. The repairability of small transactions in repair space CTET. Thereis no way to ﬁnd the repair shapes of transactions larger than 4 AST codechanges. us conﬁdence that one could apply our approach to any newproject using the probability distributions mined in our dataset.Furthermore, ﬁnding the correct repair shapes of largertransactions (up to 8 AST changes) has an order of magnitudeof and not more. Theoretically, for a given ﬁx shapeof n AST changes, the size of the repair model is thenumber of repair actions of the model at the power of n (e.g. | CT | n ). For CT and n = 4 , this results in a space of = 2 , , possible shapes (approx ). In practice,overall all projects, for small shapes (i.e. less or equal than 3changes), a well-deﬁned probability distribution can guide tothe correct shape in a median time lower than 200 attempts.This again show that the probability distribution over the repairmodel is so unbalanced that the likelihood of possible shapesis concentrated on less than shapes (i.e. that the probabilitydensity over | CT | n is really sparse).Now, what is the best heuristic, with respect to shaping, totrain our probabilistic repair models? For each repair shapesize of Table V and heuristic, we computed the medianrepairability over all projects of the dataset (a median ofmedian number of attempts). We also compute the medianrepairability for a baseline of a uniform distribution (EQP)over the repair model (i.e. ∀ i, P ( r i ) = 1 / | CT | ) ). Figure 4presents this data for repair model CT. It shows the mediannumber of attempts required to identify correct repair shapes asY-axis. The X-axis is the number of repair actions in the repairtest (the size). Each line represents probability estimationheuristics.Figure 4 gives us important pieces of information. First, theheuristics yield different repair time. For instance, the repairtime for heuristic 1-SC is generally higher than for 20-SC.Overall, there is a clear order between the repairability time:for transactions with less than 5 repair actions heuristic 5-SCgives the best results, while for bigger transactions 20-SC isthe best. Interestingly, certain heuristics are inappropriate formaximum-likelihood shaping of real bug ﬁxes: the resultingdistributions of probability results in a repair time that ex-plodes even for small shape (this is the case for a uniform10istribution EQP even for shape of size 3). Also, all medianrepair times tend toward inﬁnity for shape of size larger than9. Finally, although 1-SC is not good over many shape size,we note that that for small shape of size 1 is better. This isexplained by the empirical setup (where we also decomposetransactions by shape size).

1) On The Best Heuristics for Computing Probability Dis-tributions over Repair Actions:

To sum up, for small repairshapes heuristic 1-SC is the best with respect to probabilisticrepair shaping, but it is not efﬁcient for shapes of size greaterthan two AST-level changes. Heuristics 5-SC and 20-SC arethe best for changes of size greater than 2. An important pointis that some probability distributions (in particular built fromheuristics EQP and 1-SC) are really suboptimal for quicklynavigating into the search space .Do those ﬁndings hold for repair model CTET, which hasa ﬁner granularity?

2) On The Difference between Repair Models CT andCTET:

We have also run the whole evaluation with the repairmodel CTET (see II-C). The empirical results are given in thecompanion technical report [21](in the same form as Table V).Figure 5 is the sibling of ﬁgure 4 for repair model CTET.They look rather different. The main striking point is that withrepair model CTET, we are able to ﬁnd the correct repair shapefor ﬁxes that are no larger than 4 AST changes. After that, thearithmetic of very low probabilities results in virtually inﬁnitetime to ﬁnd the correct repair shape. On the contrary, in therepair model CT, even for ﬁxes of 7 changes, one could ﬁndthe correct shape in a ﬁnite number of attempts. Finally, in thisrepair model the average time to ﬁnd a correct repair shape isseveral times larger than in CT (in CT, the shape of ﬁxes ofsize 3 can be ﬁnd in approx. 200 attempts, in CTET, it’s morearound 6,000).For a given repair shape, the synthesis consists of ﬁndingconcrete instances of repair actions. For instance, if the pre-dicted repair action in CTET consists of inserting a methodcall, it remains to predict the target object, the method and itsparameters. We can assume that the more precise the repairaction, the smaller the “synthesis space”. For instance, inCTET, the synthesis space is smaller compared to CT, becauseit only composed of enriched versions of basic repair actionsof repair model CT (for instance inserting an “if” instead ofinserting a statement).Our results illustrate the tension between the richness of therepair model and the ease of ﬁxing bugs automatically. Whenwe consider CT, we ﬁnd likely repair shapes quickly (lessthan 5,000 attempts), even for large repair, but to the priceof a larger synthesis space. In other words, there is a balancebetween ﬁnding correct repair actions and ﬁnding concreterepair actions. When the repair actions are more abstract, itresults in a larger synthesis space, when repair actions are moreconcrete, it hampers the likelihood of being able to concentrateon likely repair shapes ﬁrst. We conjecture that the proﬁlebased on CT is better because of the following two points:it enables us to ﬁnd bigger correct repair shapes (good) in asmaller amount of time (good). Finally, we think that our results empirically explore someof the foundations of “repairing”: there is a difference betweenprescribing aspirin (it has a high likelihood to contribute tohealing, but only partially) and prescribing a speciﬁc medicine(one can try many medicines before ﬁnding the perfect one).VI. A

CTIONABLE G UIDELINES FOR A UTOMATED S OFTWARE R EPAIR

Our results blend empirical ﬁndings with theoretical in-sights. How can they be used within a approach for automatedsoftware repair? This section presents actionable guidelinesarising from our results. We apply those guidelines in a casestudy that consists of reasoning on a simpliﬁed version ofGenProg within our probabilistic framework.

A. Consider Using a Probability Distribution over RepairActions

Automated software repair embed a set of repair actions,either explicitly or implicitly. On two different repair models,we have shown that the importance of each repair actiongreatly varies. Furthermore, our mathematical analysis hasproved that considering a uniform distribution over repairactions is extremely suboptimal.Hence, from the viewpoint of the time to ﬁx a bug, we rec-ommend to set up a probability distribution over the consideredrepair actions. This probability distribution can be learned onpast data as we do in this paper or simply tuned with anincremental evaluation process. For instance, Le Goues et al.[30] have done similar probabilistic tuning over their threerepair actions.

Overall, using a probability distribution overrepair actions could signiﬁcantly fasten the repair process.B. Be Aware of the Interplay between Shaping and Synthesis

We have shown that having more precise shapes has a realimpact on shaping time. In repair model CT, for ﬁx shapes ofsize 3, the logical shaping time is approximately 150 attempts.In repair model CTET, for shapes of the same size, the averagelogical time jumps around 4,000, which represents more thana ten-fold increase. Our work quantitatively highlights theimpact of consider more precise repair actions.

By being awareof the interplay between shaping and synthesis, the researchcommunity will be able to create a disciplined catalog ofrepair actions and to identify where the biggest synthesischallenges lie.C. Analyze the Repairability depending on The Fix Size

We have shown that certain repair shapes are impossible toﬁnd because of their size. In repair model CT, the shapes ofmore than 10 repair actions are not found in a ﬁnite time. Inrepair model CTET, the repair shapes of more than 5 actionsare not found either. Given that a repair shape is an abstractionover a concrete bug ﬁx, if one can not ﬁnd the abstraction,there is no chance to ﬁnd the concrete bug ﬁx.Our analysis for identifying this limit is agnostic of the re-pair actions. Hence one can use our methodology and equationto analyze the size of the “ﬁndable” ﬁxes.

Our probabilistic // insert 1 if (a == 0) { // ast 1 // insert 2 System.out.println(b); // ast 2 // insert 3 } // insert 4 while (b != 0) { // infinite loop // ast 3 // insert 5 if (a > b) { // ast 4 // insert 6 a = a - b; // ast 5 // insert 7 } else { // insert 8 b = b - a; // ast 6 // insert 9 } // insert 10 } // insert 11 System.out.println(a); // ast 7 // insert 12 return ; // ast 8 // insert 13 } Listing 1. The inﬁnite loop bug of Weimer et al’s bug [2]. Code insertioncan be made on 13 places, 8 AST subtrees can be deleted or copied. framework enables one to understand the theoretical limits ofcertain repair processes.

Let us now apply those three guidelines on a small casestudy.

D. Case Study: Reasoning on GenProg within our Probabilis-tic Framework

We now aim at showing than our model also enables toreason on Weimer et al’s [2] example program. This program,shown in Listing 1, implements Euclid’s greatest commondivisor algorithm, but runs in an inﬁnite loop if a = 0 and b > . The ﬁx consists of adding a “return” statement on line6. a) Probability Distribution: In Weimer et al’s repairapproach, the repair model consists of three repair actions:inserting statements, deleting statements, and swapping state-ments . By statements, they mean AST subtrees. With auniform probability distribution, the logical time to ﬁnd thecorrect shape is 4 (from Equation 1). If one favors insertionover deletion and swap, for instance by setting p insert =0 . , themedian logical time to ﬁnd the correct repair action becomes2 which is twice faster. Between 2 and 4, it seems negligible,but for larger repair models, the difference might be countedin days, as we show now. b) Shaping and Synthesis: In the GCD program, there are n place = 13 places where n ast = 8 AST statements can beinserted. In this case, the size synthesis space can be formallyapproximate: the number of possible insertion is n place ∗ n ast ;the number of possible deletion is n ast ; the number of possibleswap is ( n ast ) .This enables us to apply our probabilistic reasoning atthe level of concrete ﬁx as follows. We deﬁne the concrete In more recent versions of GenProg, swapping has been replaced by“replacing”. repair distribution as: p insert ( ast i , place k ) = p insert n place ∗ n ast , p delete ( ast j ) = p delete n ast , p swap ( ast i , ast j ) = p insert ( n ast ) .With a uniform distribution p insert = p delete = p swap =1 / , formula 1 yields that the logical time to ﬁx this particularbug (insertion of node p insert > p delete (see Table III). What if we distort theuniform distribution over the repair model to favor insertion?The following table gives the results for arbitrary distributionsspanning different kinds of distribution: p insert p delete p swap Logical time.33 .33 .33 219.39 .28 .33 185.45 .22 .33 160.40 .40 .20 180.50 .30 .20 144.60 .20 .20 120This table shows that as soon as we favor insertion overdeletion of code, the logical time to ﬁnd the repair do actuallydecrease.Interestingly, the same kind of reasoning applies to faultlocalization. Let’s assume that a fault localizer ﬁlters out halfof the possible places where to modify code (i.e. n place = 7 ).Under the uniform distribution and the space concrete repairspace, the logical time to ﬁnd the ﬁx decreases from 219 to118 runs. c) Repairability and Fix Size: We consider the samemodel but on larger programs with fault localization, forinstance 100 AST nodes and 20 potential places for changes.Let us assume that the concrete ﬁx consists of insertingnode ≥ , runs. Let us assume that the concrete ﬁx consists of two repairactions: inserting node ELATED W ORK d) Empirical Studies of Versioning Transactions:

Pu-rushothaman and Perry [14] studied small commits (in termsof number of lines of code) of proprietary software at LucentTechnology. They showed the impact of small commits withrespect to introducing new bugs, and whether they are orientedtoward corrective, perfective or adaptive maintenance. German[11] asked different research questions on what he calls“modiﬁcation requests” (small improvements or bug ﬁx), inparticular with respect to authorship and change coupling(ﬁles that are often changed together). Alali and colleagues[13] discussed the relations between different size metrics forcommits ( e) Abstract Syntax Tree Differencing:

The evaluation ofAST differencing tools often gives hints about common changeactions of software. For instance, Raghavan et al. [19] showedthe six most common types of changes for the Apache webserver and the GCC compiler, the number one being “Alteringexisting function bodies”. This example clearly shows thedifference with our work: we provide change and repair actionsat a very ﬁne granularity. Similarly, Neamtiu et al. [20] givesinteresting numerical ﬁndings about software evolution suchas the evolution of added functions and global variables of C code. It also remains at granularity that is coarser comparedto our analysis. Fluri et al. [7] gives some frequency numbersof their change types in order to validate the accuracy andthe runtime performance of their distilling algorithm. Thosenumbers were not — and not meant to be — representative ofthe overall abundance of change types. Giger et al. [37] discussthe relations between 7 categories of change types and not thedetailed change actions as we do. f) Automated Software Repair:

We have already men-tioned many pieces of work on automated software repair (incl.[1], [2], [3], [4], [5], [38]). We have discussed in details therelationship of our work with GenProg. Let us now comparewith the other close papers.Wei et al. [3] presented AutoFix-E, an automated repairtool which works with contracts. In our perspective, AutoFix-E is based on two repair actions: adding sequences of state-changing statements (called “mutators”) and adding a precon-dition (of the form of an “if” conditional). Their ﬁx schemasare combinations of those two elementary repair actions. Incontrast, we have 173 basic repair actions and we are ableto predict repair shapes that consist of combinations of 4repair actions. However, our approach is more theoreticalthan theirs. Our probabilistic view on repair may fasten theirrepair approach: it is likely that not all “ﬁx schemas” areequivalent. For instance, according to our experience, addinga precondition is a very common kind of ﬁx in real bugs.Debroy et al. [39] invented an approach to repair bugsusing mutations inspired from the ﬁeld of mutation testing.The approach uses a fault localization technique to obtainthe candidate faulty locations. For a given location, it appliesmutations, producing mutants of the program. Eventually, amutant is classiﬁed as “ﬁxed” if it passes the test suite ofthe program. Their repair actions are composed of mutationsof arithmetic, relational, logical, and assignment operators.Compared to our work, mutating a program is a special kindof ﬁx synthesis where no explicit high-level repair shapesare manipulated. Also, in the light of our results, we assumethat a mutation-based repair process would be faster usingprobabilities on top of the mutation operators.Kim et al. [40] introduced

PAR , an algorithm that generatesprogram patches using a set of 10 manually written ﬁxtemplates. As GenProg, the approach leverages evolutionarycomputing techniques to generate program patches. We sharewith PAR the idea of extracting repair knowledge from human-written patches. Beyond this high-level point in common,there are three important differences. First, they do a manualextraction of ﬁx patterns (by reading 62,656 patches) whilewe automatically mine them from the past commits. Second,PAR patterns and our repair actions are expressed at a differentgranularity. PAR patterns contain a speciﬁcation of the contextthat matches a piece of AST, a speciﬁcation of analysis (e.g.to collect compatible expressions in the current scope), and aspeciﬁcation of change. Our repair actions correspond to thislast part. While their patterns are operational, their changespeciﬁcations are ad hoc (due to the process of manuallyspecifying templates). On the contrary, our speciﬁcation of13epair actions are systematic and automatically extracted, butour approach is more theoretical and we do not ﬁx concretebugs. This shows again that the foundations of their approachcontains more manual work than ours: a PAR pattern is a manually identiﬁed repair schema where all the synthesis rulesare manually encoded. Finally, we think it is possible to marryour approaches by decorating their templates with probabilitydistributions (whether mined or not) so as to speed up therepair. VIII. C

ONCLUSION

In this paper, we have presented the idea that one canmine repair actions from software repositories. In other words,one can learn from past bug ﬁxes the main repair actions(e.g. adding a method call). Those repair actions are meantto be generic enough to be independent of the kinds of bugand the software domains. We have discussed and applied amethodology to mine the repair actions of 62,179 versioningtransactions extracted from 14 repositories of 14 open-sourceprojects. We have largely discussed the rationales and conse-quences of adding a probability distribution on top of a repairmodel. We have shown that certain distributions over repairactions can result in an inﬁnite time (in average) to ﬁnd arepair shape while other ﬁne-tuned distributions enable us toﬁnd a repair shape in hundreds of repair attempts.The main direction of future work consists of going be-yond empirical results and theoretical analysis. We are nowexploring how to use this learned knowledge (of the formof probabilistic repair models) to ﬁx real bugs. In particular,we are planning to work on using probabilistic models to seewhether one can faster repair the bugs of PAR’s and GenProg’sdatasets. The latter involves having a Java implementationof GenProg and would advance our knowledge on whetherGenProg’s efﬁciency is really language-independent (Segfaultsand buffer overruns do not exists in Java).R

EFERENCES[1] W. Weimer, “Patches as better bug reports,” in

Proceedings of theInternational Conference on Generative Programming and ComponentEngineering , 2006.[2] W. Weimer, T. Nguyen, C. L. Goues, and S. Forrest, “Automaticallyﬁnding patches using genetic programming,” in

Proceedings of theInternational Conference on Software Engineering , 2009.[3] Y. Wei, Y. Pei, C. A. Furia, L. S. Silva, S. Buchholz, B. Meyer, andA. Zeller, “Automated ﬁxing of programs with contracts,” in

Proceedingsof the International Symposium on Software Testing and Analysis , AC,2010.[4] V. Dallmeier, A. Zeller, and B. Meyer, “Generating ﬁxes from objectbehavior anomalies,” in

Proceedings of the International Conference onAutomated Software Engineering , 2009.[5] A. Arcuri, “Evolutionary repair of faulty software,”

Applied Soft Com-puting , vol. 11, no. 4, pp. 3494–3514, 2011.[6] D. Gopinath, M. Z. Malik, and S. Khurshid, “Speciﬁcation-based pro-gram repair using sat,” in

Proceedings of the International Conferenceon Tools and Algorithms for the Construction and Analysis of Systems ,2011.[7] B. Fluri, M. Wursch, M. Pinzger, and H. Gall, “Change distilling:Tree differencing for ﬁne-grained source code change extraction,”

IEEETransactions on Software Engineering , vol. 33, pp. 725 –743, nov. 2007.[8] M. Bóna,

A Walk Through Combinatorics: An Introduction to Enumer-ation and Graph Theory . World Scientiﬁc, 2011.[9] M. Martinez and M. Monperrus, “Mining repair actions for guidingautomated program ﬁxing,” tech. rep., INRIA, 2012. [10] A. Hindle, D. M. German, and R. Holt, “What do large commits tellus?: a taxonomical study of large commits,” in

Proceedings of theInternational Working Conference on Mining Software Repositories ,2008.[11] D. M. German, “An empirical study of ﬁne-grained software modiﬁca-tions,”

Empirical Softw. Engineering , vol. 11, no. 3, pp. 369–393, 2006.[12] L. Hattori and M. Lanza, “On the nature of commits,” in

Proceedingsof 4th International ERCIM Workshop on Software Evolution andEvolvabillity (EVOL) , pp. 63 –71, 2008.[13] A. Alali, H. Kagdi, and J. Maletic, “What’s a typical commit? acharacterization of open source software repositories,” in

Proceedings ofthe IEEE International Conference on Program Comprehension , 2008.[14] R. Purushothaman and D. Perry, “Toward understanding the rhetoric ofsmall source code changes,”

IEEE Transactions on Software Engineer-ing , vol. 31, pp. 511 – 526, june 2005.[15] B. Livshits and T. Zimmermann, “Dynamine: ﬁnding common errorpatterns by mining software revision histories,” in

Proceedings of theEuropean software engineering conference held jointly with Interna-tional Symposium on Foundations of Software Engineering , 2005.[16] R. Robbes,

Of Change and Software . PhD thesis, University of Lugano,2008.[17] E. Giger, M. Pinzger, H. Gall, T. Xie, and T. Zimmermann, “Comparingﬁne-grained source code changes and code churn for bug prediction,”in

Working Conference on Mining Software Repositories , 2011.[18] M. Monperrus and M. Martinez, “Cvs-vintage: A dataset of 14 cvsrepositories of java software,” Tech. Rep. hal-00769121, INRIA, 2012.[19] S. Raghavan, R. Rohana, D. Leon, A. Podgurski, and V. Augustine,“Dex: a semantic-graph differencing tool for studying changes in largecode bases,” in

Proceedings of the 20th IEEE International Conferenceon Software Maintenance , 2004.[20] I. Neamtiu, J. S. Foster, and M. Hicks, “Understanding source codeevolution using abstract syntax tree matching,” in

Proceedings of theInternational Workshop on Mining Software Repositories

Proceedings of the International Symposium on Empirical SoftwareEngineering and Measurement , 2010.[24] K. Pan, S. Kim, and E. J. Whitehead, “Toward an understanding of bugﬁx patterns,”

Empirical Software Engineering , vol. 14, no. 3, pp. 286–315, 2008.[25] R. Wu, H. Zhang, S. Kim, and S.-C. Cheung, “Relink: recovering linksbetween bugs and changes,” in

Proceedings of the 2011 Foundations ofSoftware Engineering Conference , pp. 15–25, 2011.[26] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, andP. Devanbu, “Fair and balanced?: bias in bug-ﬁx datasets,” in

Proceed-ings of the 7th joint meeting of the European Software EngineeringConference and the ACM SIGSOFT Symposium on the Foundations ofSoftware Engineering , ESEC/FSE ’09, pp. 121–130, ACM, 2009.[27] J. Cohen et al. , “A coefﬁcient of agreement for nominal scales,”

Educational and psychological measurement , vol. 20, no. 1, pp. 37–46,1960.[28] J. R. Landis and G. G. Koch, “The measurement of observer agreementfor categorical data.,”

Biometrics , vol. 33, no. 1, pp. 159–174, 1977.[29] F. L. Joseph, “Measuring nominal scale agreement among many raters,”

Psychological bulletin , vol. 76, no. 5, pp. 378–382, 1971.[30] C. L. Goues, W. Weimer, and S. Forrest, “Representations and operatorsfor improving evolutionary software repair,” in

Proceedings of GECCO ,pp. 959–966, 2012.[31] A. Arcuri and L. Briand, “A practical guide for using statistical tests toassess randomized algorithms in software engineering,” in

Proceedingsof the 33rd International Conference on Software Engineering , pp. 1–10,ACM, 2011.[32] A. Hindle, D. German, M. Godfrey, and R. Holt, “Automatic classicationof large changes into maintenance categories,” in

Proceedings of thedebroInternational Conference on Program Comprehension , 2009.[33] B. Fluri, E. Giger, and H. C. Gall, “Discovering patterns of changetypes,” in

Proceedings of the International Conference on AutomatedSoftware Engineering , 2008.

34] S. Vaucher, H. Sahraoui, and J. Vaucher, “Discovering new changepatterns in object-oriented systems,” in

Proceedings of the WorkingConference on Reverse Engineering , 2008.[35] S. Kim, K. Pan, and E. J. Whitehead, “Memories of bug ﬁxes,” in

Proceedings of the 14th ACM SIGSOFT International Symposium onFoundations of Software Engineering , 2006.[36] C. C. Williams and J. K. Hollingsworth, “Automatic mining of sourcecode repositories to improve bug ﬁnding techniques,”

IEEE Transactionson Software Engineering , vol. 31, no. 6, pp. 466–480, 2005.[37] E. Giger, M. Pinzger, and H. C. Gall, “Can we predict types ofcode changes? an empirical analysis,” in

Proceedings of the WorkingConference on Mining Software Repositories , pp. 217–226, 2012. [38] A. Carzaniga, A. Gorla, N. Perino, and M. Pezzè, “Automaticworkarounds for web applications,” in

Proceedings of the 2010 Foun-dations of Software Engineering Conference , pp. 237–246, ACM, 2010.[39] V. Debroy and W. Wong, “Using mutation to automatically suggest ﬁxesfor faulty programs,” in

Proceedings of the International Conference onSoftware Testing, Veriﬁcation and Validation (ICST) , pp. 65–74, IEEE,2010.[40] D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generationlearned from human-written patches,” in

Proceedings of the 2013International Conference on Software Engineering , pp. 802–811, IEEEPress, 2013., pp. 802–811, IEEEPress, 2013.