[PDF] Evaluating SZZ Implementations Through a Developer-informed Oracle

Abstract

The SZZ algorithm for identifying bug-inducing changes has been widely used to evaluate defect prediction techniques and to empirically investigate when, how, and by whom bugs are introduced. Over the years, researchers have proposed several heuristics to improve the SZZ accuracy, providing various implementations of SZZ. However, fairly evaluating those implementations on a reliable oracle is an open problem: SZZ evaluations usually rely on (i) the manual analysis of the SZZ output to classify the identified bug-inducing commits as true or false positives; or (ii) a golden set linking bug-fixing and bug-inducing commits. In both cases, these manual evaluations are performed by researchers with limited knowledge of the studied subject systems. Ideally, there should be a golden set created by the original developers of the studied systems. We propose a methodology to build a "developer-informed" oracle for the evaluation of SZZ variants. We use Natural Language Processing (NLP) to identify bug-fixing commits in which developers explicitly reference the commit(s) that introduced a fixed bug. This was followed by a manual filtering step aimed at ensuring the quality and accuracy of the oracle. Once built, we used the oracle to evaluate several variants of the SZZ algorithm in terms of their accuracy. Our evaluation helped us to distill a set of lessons learned to further improve the SZZ algorithm.

Full PDF

EEvaluating SZZ Implementations Through aDeveloper-informed Oracle

Giovanni Rosa ∗ , Luca Pascarella † , Simone Scalabrino ∗ , Rosalia Tufano † , Gabriele Bavota † ,Michele Lanza † , and Rocco Oliveto ∗∗ University of Molise, Italy † Software Institute @ USI Universit`a della Svizzera italiana, Switzerland

Abstract —The SZZ algorithm for identifying bug-inducingchanges has been widely used to evaluate defect predictiontechniques and to empirically investigate when, how, and bywhom bugs are introduced. Over the years, researchers have pro-posed several heuristics to improve the SZZ accuracy, providingvarious implementations of SZZ. However, fairly evaluating thoseimplementations on a reliable oracle is an open problem: SZZevaluations usually rely on (i) the manual analysis of the SZZoutput to classify the identiﬁed bug-inducing commits as true orfalse positives; or (ii) a golden set linking bug-ﬁxing and bug-inducing commits. In both cases, these manual evaluations areperformed by researchers with limited knowledge of the studiedsubject systems. Ideally, there should be a golden set created bythe original developers of the studied systems.We propose a methodology to build a “developer-informed”oracle for the evaluation of SZZ variants. We use Natural Lan-guage Processing (NLP) to identify bug-ﬁxing commits in whichdevelopers explicitly reference the commit(s) that introduced aﬁxed bug. This was followed by a manual ﬁltering step aimed atensuring the quality and accuracy of the oracle. Once built, weused the oracle to evaluate several variants of the SZZ algorithmin terms of their accuracy. Our evaluation helped us to distill aset of lessons learned to further improve the SZZ algorithm.

Index Terms —SZZ, Defect Prediction, Empirical Study

I. I

NTRODUCTION

The SZZ algorithm, proposed by ´Sliwerski, Zimmermann,and Zeller [1] at MSR 2005, identiﬁes, given a bug-ﬁxingcommit C BF , the commits that likely introduced the bug ﬁxedin C BF . These commits are termed “bug-inducing” commits.In essence, given C BF as input, SZZ identiﬁes the last change(commit) to each source code line changed in C BF ( i.e., changed to ﬁx the bug). This is done by relying on theannotation/blame feature of versioning systems. The identiﬁedcommits are considered as the ones that later on triggered thebug-ﬁxing commit C BF .SZZ has been widely adopted to (i) design and evaluatedefect prediction techniques [2]–[6], and to (ii) run empiricalstudies aimed at investigating under which circumstances bugsare introduced [7]–[10]. The relevance of the SZZ algorithmwas recognized a decade later with a MIP (Most InﬂuentialPaper award) presented at the 12th Working Conference onMining Software Repositories (MSR 2015).Several researchers have proposed variants of the originalalgorithm, with the goal of boosting its accuracy [11]–[16].For example, one issue with the basic SZZ implementationis that it considers changes to code comments and whitespaceslike any other change. This means that if a comment is modiﬁed in C BF , the latestchange to that comment is mistakenly considered as a bug-inducing commit. An improvement by Kim et al. [11] wastherefore to ignore changes to code comments and blank linesas candidate bug-inducing commits.Despite the major advances made on the accuracy of SZZ,Alencar da Costa et al. [14] highlighted the major difﬁcultiesin fairly evaluating and comparing the SZZ variants proposedin the literature. They observed that the studies presentingand evaluating SZZ variants mostly rely on manual analysisof a small sample of SZZ results [1], [11]–[13], with thegoal of evaluating its accuracy. Such an evaluation is usuallyperformed by the researchers who—not being the originaldevelopers of the studied systems—do not always have theknowledge needed to correctly identify the bug introducingcommit. Also, due to the high cost of such a manual analysis,it is usually performed on a small sample of the identiﬁed bug-inducing commits. Other researchers built instead a groundtruth to evaluate the performance of the SZZ algorithm [16].However, also in these cases, the ground truth is producedby the researchers. Alencar da Costa et al. [14] called forevaluations performed with “ domain experts ( e.g., develop-ers or testers) ” reporting however that “ such an analysis isimpractical ” since “ the experts would need to verify a largesample of bug-introducing changes, which is difﬁcult to scaleup to the size of modern defect datasets ” [14].We present a methodology to build a “developer-informed”oracle for the evaluation of SZZ implementations. To explainits idea, let us take as example commit a8a97bd from the apache/thrift GitHub project, accompanied by a com-mit message saying: “

THRIFT-4513: ﬁx bug in comparatorintroduced by e58f75d ”. The developer ﬁxing the bug isexplicitly documenting the commit that introduced such abug. Based on this observation, we deﬁned a number of strictNLP-based heuristics to automatically identify notes in bug-ﬁxing commits in which developers explicitly reference thecommit(s) that introduced the ﬁxed bug. We applied theseheuristics to a total of 19,603,736 mined through GH Archive[39], which archives all public events on GitHub.Our goal with the above described process is not to be ex-haustive, i.e., we do not want to identify all bug-ﬁxing commitsin which developers indicated the bug-inducing commit(s), butrather to obtain a high-quality dataset of commits that werecertainly of the bug-inducing kind. a r X i v : . [ c s . S E ] F e b pproach name Reference Based on Used by Oracle type B-SZZ ´Sliwerski et al. [1] [3], [4], [17]–[24] // // //AG-SZZ Kim et al. [11] B-SZZ [2], [8], [25]–[31] Manually deﬁned (researchers) 2 301DJ-SZZ Williams and Spacco [12] AG-SZZ [6], [7], [32]–[37] Manually deﬁned (researchers) 1 25L-SZZ & R-SZZ Davies et al. [13] AG-SZZ [14] Manually deﬁned (researchers) 3 174MA-SZZ da Costa et al. [14] AG-SZZ [6], [9], [10], [15], [16], [38] Automatically computed metrics 10 2,637RA-SZZ Neto et al. [15] MA-SZZ [5], [6], [15] Manually deﬁned (researchers) 10 365RA-SZZ * Neto et al. [16] RA-SZZ None Manually deﬁned (researchers) 10 365

TABLE I: Variants of the SZZ algorithm. For each one, we specify (i) the algorithm on which it is based, (ii) references ofworks using it, (iii) the oracle used in the evaluation (how it was built, number of projects and bug ﬁxes considered).We mined the time period between March 2011 and April2020, obtaining 3,585 commits. To further increase the in-trinsic quality of the dataset, we manually validated the 3,585commits, to (i) verify if, from the commit message, it was clearthat the developer was documenting the bug-inducing commit;and (ii) taking note of any issue referenced in the commitmessage ( e.g., issue THRIFT-4513 in the previous example).Information from the issue tracker is exploited by some of theSZZ implementations and we wanted our dataset to include it.As output of this process, we obtained a dataset of 1,930validated bug-ﬁxing commits in which developers documentedthe commit(s) that introduced the bug, with 212 also includinginformation about the ﬁxed issue(s). To the best of our knowl-edge, our work is the ﬁrst presenting a dataset for the SZZevaluation built by using information about the bug-inducingcommit(s) explicitly reported by the bug ﬁxer.We tested nine variants of SZZ on our dataset. Besidesreporting their precision and recall, we analyzed their com-plementarity and focused on the set of bug-ﬁxes where allSZZ variants fail. A qualitative analysis of those cases allowedto distill lessons learned useful to further improve the SZZalgorithm in the future. Summarizing, our contributions are:1) A methodology to build a “developer-informed” oraclefor the evaluation of SZZ implementations, which doesnot require major manual efforts as compared to theclassical manual identiﬁcation of bug-inducing commits.2) A ﬁrst, easily extensible dataset built using our method-ology and featuring 1,930 validated bug-ﬁxing commits.3) An empirical study comparing the effectiveness of sev-eral SZZ implementations.4) A comprehensive replication package featuring (i) thedataset, and (ii) the implemented SZZ variants [40].II. B

ACKGROUND AND R ELATED W ORK

We start by presenting several variants of the SZZ algorithm[1] proposed in the literature over the years. Then, we discusshow those variants have been used in SE research community.

A. SZZ and its variants

Table I presents the SZZ variants proposed in the literature.We report for each of them its name and reference, theapproach it builds upon ( i.e., the starting point on whichthe authors provide improvements), some references to worksthat used it, and information about the oracle used for theevaluation. Speciﬁcally, we report how the oracle was builtand the number of projects/bug reports considered. All the approaches that aim at identifying bug-inducingcommits (BICs) rely on two elements: (i) the revision historyof the software project, and (ii) an issue tracking system(optional, needed only by some SZZ implementations).The original SZZ algorithm was proposed by ´Sliwerski et al. [1] (we refer to it as B-SZZ, following the notation providedby da Costa et al. [14]). B-SZZ takes as input a bug reportfrom an issue tracking system, and tries to ﬁnd the commit thatﬁxes the bug. To do this, B-SZZ uses a two-level conﬁdencelevel: syntactic (possible references to the bug ID in the issuetracker) and semantic ( e.g., the bug description is containedin the commit message). B-SZZ relies on the CVS diff command to detect the lines changed in the ﬁx commit andthe annotate command to ﬁnd the commits in which thelines were modiﬁed. Using this procedure, B-SZZ determinesthe earlier change at the location of the ﬁx. Potential bug-inducing commits performed after the bug was reported arealways ignored.Kim et al. [11] noticed that B-SZZ has limitations mostlyrelated to formatting/cosmetic changes ( e.g., moving a bracketto the next line). Such changes can deceive B-SZZ: B-SZZ(i) can report as BIC a revision which only changed the codeformatting, and (ii) it can consider as part of a bug-ﬁx aformatting change unrelated to the actual ﬁx. They introducea variant (AG-SZZ) in which they used an annotation graph,a data structure associating the modiﬁed lines with the con-taining function/method. AG-SZZ also ignores the cosmeticparts of the bug-ﬁxes to provide more precise results.Williams and Spacco [12] improved the AG-SZZ algorithmin two ways: ﬁrst, they use a line-number mapping approach[41] instead of the annotation graph introduced by Kim et al. [11]; second, they use DiffJ [42], a Java syntax-aware difftool, which allows their approach (which we call DJ-SZZ) toexclude non-executable changes ( e.g., import statements).Davies et al. [13] propose two variations on the criterionused to select the BIC among the candidates: L-SZZ usesthe largest candidate, while R-SZZ uses the latest one. Theseimprovements were done on top of the AG-SZZ algorithm.MA-SZZ, introduced by da Costa et al. [14], excludes fromthe candidate BICs all the meta-changes , i.e., commits that donot change the source code. This includes (i) branch changes,which are copy operations from one branch to another, (ii)merge changes, which consist in applying the changes per-formed in a branch to another one, and (iii) property changes,which only modify ﬁle properties ( e.g., permissions).o further reduce the false positives, two new variantswere introduced by Neto et al. , RA-SZZ [15] and RA-SZZ * [16]. Both exclude from the BIC candidates the refactoringoperations, i.e., changes that should not modify the behaviorof the program. Both approaches use state-of-the-art tools:RA-SZZ uses RefDiff [43], while RA-SZZ * uses RefactoringMiner [44], with the second one being more effective [16].The original SZZ was not empirically evaluated [1]. Instead,all its variants, except MA-SZZ, were manually evaluated bytheir authors. One of them, RA-SZZ * [16], used an externaldataset, i.e., Defect4J [45]. MA-SZZ was evaluated usingautomated metrics, namely earliest bug appearance , futureimpact of a change , and realism of bug introduction [14].In Table II we list the open-source implementations of SZZ. Tool name Approach Public repository

SZZ Unleashed [33] ∼ DJ-SZZ [12] https://github.com/wogscpar/SZZUnleashedOpenSZZ [46] ∼ B-SZZ [1] https://github.com/clowee/OpenSZZP Y D RILLER [47] ∼ AG-SZZ [1] https://github.com/ishepard/pydriller

TABLE II: Open-source tools implementing SZZ.SZZ Unleashed [33] partially implements DJ-SZZ: it usesline-number mapping [12] but it does not rely on DiffJ [42]for computing diffs, also working on non-Java ﬁles. It does nottake into account meta-changes [14] and refactorings [16].OpenSZZ [46] implements the basic version of the ap-proach, B-SZZ. Since it is based on the git blame command,it implicitly uses the annotated graph [11].P Y D RILLER [47], a general purpose tool for analyzinggit repositories, also implements B-SZZ. It uses a simpleheuristic for ignoring C- and Python-style comment lines, asproposed by Kim et al. [11]. We do not report in Table II acomprehensive list of all the SZZ implementations that can befound on GitHub, but only the ones presented in papers.

B. SZZ in Software Engineering Research

The original SZZ algorithm and its variations were used ina plethora of studies. We discuss some examples, while fora complete list we refer to the extensive literature review byRodr´ıguez-P´erez et al. [37], featuring 187 papers.SZZ has been used to run several empirical investigationshaving different goals [7]–[10], [17], [18], [20], [22]–[25],[27]–[31], [35], [37]. For example, Aman et al. [9] studiedthe role of local variable names in fault-introducing commitsand they used SZZ to retrieve such commits, while Palomba et al. [17] focused on the impact of code smells, and used SZZto determine whether an artifact was smelly when a fault wasintroduced. Many studies also leverage SZZ to evaluate defectprediction approaches [2]–[6], [19], [21], [26], [34], [38].Looking at Table I it is worth noting that, despite its clearlimitations [11], many studies, even recent ones, still relyon B-SZZ [3], [4], [17]–[24] (the approaches that use gitimplicitly use the annotation graph deﬁned by Kim et al. [11]). Improvements are only slowly adopted in the literature,possibly due to the fact that some of them are not releasedas tools and that the two standalone tools providing a publicSZZ implementation were released only recently [33], [46]. The studies most similar to ours are the one by da Costa et al. [14] and the one by Rodr´ıguez-P´erez et al. [36]. Bothreport a comparison of different SZZ variants. Da Costa et al. [14] deﬁned and used a set of metrics for evaluating SZZimplementations without relying on a manually deﬁned oracle.However, they specify that, ideally, domain experts shouldbe involved in the construction of the dataset [14], whichmotivated our study. Rodr´ıguez-P´erez et al. [37] introduceda model for distinguishing bugs caused by modiﬁcations tothe source code (the ones that SZZ algorithms can detect) andthe ones that are introduced due to problems with externaldependencies. They also used the model to deﬁne a manuallycurated dataset on which they evaluated SZZ variants. Theirdataset is created by researchers and not domain experts. Inour study, instead, we rely on the explicit information providedby domain experts in their commit messages.III. B

UILDING A D EVELOPER - INFORMED D ATASET OF B UG - INDUCING C OMMITS

We present a methodology to build a dataset of bug-inducingcommits by exploiting information provided by developerswhen ﬁxing bugs. Our methodology reduces the manual effortrequired for building such a dataset and more important, doesnot assume technical knowledge of the involved source codeon the researchers’ side.The proposed methodology involves two main steps: (i)automatic mining from open-source repositories of bug-ﬁxingcommits in which developers explicitly indicate the commit(s)that introduced the ﬁxed bug, and (ii) a manual ﬁltering aimedat improving the dataset quality by removing ambiguous com-mit messages that do not give conﬁdence in the informationprovided by the developer. In the following, we detail thesetwo steps. The whole process is depicted in Fig. 1.

A. Mining Bug-ﬁxing and Bug-inducing Commits

There are two main approaches proposed in the literaturefor selecting bug-ﬁxing commits. The ﬁrst one relies on thelinking between commits and issues [48]: issues labeled with“bug”, “defect”, etc. are mined from the issue tracking system,storing their issue ID ( e.g.,

THRIFT-4513). Then, commitsreferencing the issue ID are mined from the versioning systemand identiﬁed as bug-ﬁxing commit. While such a heuristic isfairly precise, it has two important drawbacks that make itunsuitable for our work. First, the link to the issue trackingsystem must be known and a speciﬁc crawler for each differenttype of issue tracker ( e.g.,

Jira, Bugzilla, GitHub, etc.) mustbe built.Second, projects can use a customized set of labels to indi-cate bug-related issues. Manually extracting this informationfor a large set of repositories is expensive. The basic ideabehind this ﬁrst phase is to use the commit messages to iden-tify bug-ﬁxing commits: we automatically analyze bug-ﬁxingcommit messages searching for those explicitly referencingbug-inducing commits. ord-based Filtering

Commit Selection

Syntax-aware Filtering FilteringDuplicate Deletion

Bug Report Search

Select commits in push events from Github Archive Remove if it does not contain a ﬁx-related wordRemove if it does not contain a bug-related word Remove if it is a copyRemove if it is from a forked repository Remove if it is not a bug-ﬁxing commitRemove if the reference h is ambiguous Add issue URLsAdd issue creation timesRemove if it does not contain a commit reference h j or if it is a revert commit (H1) Use a coarse Filter (H2) if “introduced by” is an ancestor of h Use a ﬁne Filter (H3) if “introduced by” is not an ancestor of h Automated Mining Manual Evaluation jj j

Fig. 1: Process used for building the dataset.As a preliminary step, we mined GH A

RCHIVE [39] whichprovides, on a regular basis, a snapshot of public eventsgenerated on GitHub in the form of JSON ﬁles.We mined the time period going from March 1 st th push events: such events gather the commitsdone by a developer on a repository before performing the push action. Considering the goal of building an oracle for SZZalgorithms, we are not interested in any speciﬁc programminglanguage. We performed three steps to select a candidate setof commits to manually analyze in the second phase: (i) weselected a ﬁrst candidate set of bug-ﬁxing commits, (ii) weused syntax-aware heuristics to reﬁne such a set, and (iii) weremoved duplicates.

1) Word-Based Bug-Fixing Selection:

To identify bug-ﬁxing commits, we ﬁrst apply a lightweight regular expressionon all the commits we gathered, as done in previous work [49],[50]. We mark as potential bug-ﬁxes all commits accompaniedby a message including at least a ﬁx-related word and a bug-related word . We exclude the messages that include the word merge to ignore merge commits. Note that we do not needsuch a heuristic to be 100% precise, since two additional andmore precise steps will be performed on the identiﬁed setof candidate ﬁxing commits to exclude false positives ( i.e., a NLP-based step and a manual analysis).

2) Syntax-Aware Filtering:

We needed to select from theset of candidate bug-ﬁxing commits only the ones in whichdevelopers likely documented the bug-inducing commit(s). Weused the syntax-aware heuristics described below to do this.The ﬁrst author deﬁned such heuristics through a trial-and-error procedure, taking a 1-month time period of events on GHArchive to test and reﬁne different versions of the heuristics,manually inspecting the achieved results after each run. Theﬁnal version has been consolidated with the feedback of twoadditional authors.As a preliminary step, we used the doc.sents functionof the

SPA C Y Python module for NLP to extract the set S c of sentences composing each commit message c .For each sentence s i ∈ S c , we used SPA C Y to build itsword dependency tree t i , i.e., a tree containing the syntacticrelationships between the words composing the sentence.Fig. 2 provides an example of t i generated for the sentence“ ﬁxes a search bug introduced by 2508e12 ”. ﬁx or solve bug , issue , problem , error , or misfeature https://spacy.io/ ﬁxesbugsearch 2508e12a byintroduced Fig. 2: Example of word dependency tree built by

SPA C Y .By navigating the word dependency tree, we can infer thatthe verb “ﬁx” refers to the noun “bug”, and that the verb“introduced” is linked to commit id through the “by”apposition. H1: Exclude Commits Without Reference and Reverts.

We split each s i ∈ S c into words and we select all its commithashes H ( s i ) using a regular expression . We ignore all the s i for which H ( s i ) is empty ( i.e., which do not mention anycommit hash). Similarly, we ﬁlter out all the s i that either(i) start with a commit hash, or (ii) include the verb “revert”referring to any h j ∈ H ( s i ) . We keep all the remaining s i . Weexclude the commits that do not contain any valid sentenceas for this heuristic. We use the H ( s i ) extracted with thisheuristic also for the following heuristics. H2: Coarsely Filter Explicit Introducing References.

Ifone of the ancestors of h j is the verb “introduce” (in anydeclension), as it happens in Fig. 2, we consider this as astrong indication of the fact that the developer is indicating h j as (one of) the bug-inducing commit(s). In this case, wecheck if h j also includes at least one of the ﬁx-related words and one of the bug-related words as one of its ancestors orchildren. At least one of the two words ( i.e., the one indicatingthe ﬁxing activity or the one referring to a bug) must be anancestor. We do this to avoid erroneously selecting sentencessuch as “ Improving feature introduced in 2508e12 and ﬁxed abug ”, in which both the ﬁx-related and the bug-related wordare children of h j .For example, the h j in Fig. 2 meets this constraint since ithas among its ancestors both ﬁx and bug . We also excludethe cases in which the words attempt or test (again, indifferent declensions) appear as ancestors of h j . We do thisto exclude false positives observed while experimenting withearlier versions of this heuristic. [0-9a-f] { } or example, the sentence “ Remove attempt to ﬁx errorintroduced in 2f780609 ” belongs to a commit that aims atreverting previous changes. Similarly, the sentence “

Add testsfor the ﬁx of the bug introduced in 2f780609 ” most likelybelongs to the message of a test-introduction commit.

H3: Finely Filter Non-Explicit Introducing References. If h j does not contain the verb “introduce” as one of itsancestors, we apply a ﬁner ﬁltering heuristic: both a wordindicating a ﬁxing activity and a word indicating a bug mustappear as one of h j ’s ancestors. Also, we deﬁne a list ofstop-words that must not appear either in the h j ’s ancestoras well as in the dependencies ( i.e., ancestors and children)of the “ﬁxing activity” word. Such a stop-word list, derivedthrough a trial-and-error procedure, includes eight additionalwords ( was , been , seem , solved , ﬁxed , try , trie (to capture tries and tried ), and by ), besides attempt and test also used in H2.This allows, for example, to exclude sentences such as “ Thisdeﬁnitely ﬁxes the bug I tried to ﬁx in commit 26f3fe2 ”, meetsall selection criteria for H3 but it is a false positive.

3) Duplicate Deletion:

We saved the list of commits includ-ing at least one sentence s i meeting H1 and either H2 or H3 ina MySQL database. Since we analyzed a large set of projects,it was frequent that some commits were duplicated due to thefact that different forks of a given project are available. Asa ﬁnal step, we removed such duplicates, keeping only thecommit of the main project repository.Out of the 19,603,736 parsed commits, the automatedﬁltering selected 3,585 commits. Our goal with the abovedescribed process is not to be exhaustive, i.e., we do notwant to identify all bug-ﬁxing commits in which developersindicated the bug-inducing commit(s), but rather to obtain ahigh-quality dataset of commits that were certainly of thebug-inducing kind. The quality of the dataset is then furtherincreased during the subsequent step of manual analysis. B. Manual Analysis

Four of the authors (from now on, evaluators) manuallyinspected the 3,585 commits produced by the previous step.The evaluators have different backgrounds (graduate student,faculty member, junior and a senior researcher with two yearsof industrial experience). The goal of the manual validationwas to verify (i) whether the commit was an actual bug-ﬁx,and (ii) if it included in the commit message a non-ambiguoussentence clearly indicating the commit(s) in which the ﬁxedbug was introduced. For both steps the evaluators mostlyrelied on the commit message and, if available, on possiblereferences to the issue tracker. Those references could be issueIDs or links that the evaluators inspected to (i) ensure that theﬁxed issue was a bug, and (ii) store for each commit the linksto the mentioned issues and, for each issue, its opening date.The latter is an information that may be required by an SZZimplementation ( e.g.,

SZZ Unleashed [33] and OpenSZZ [46]require the link to the issue) to exclude from the candidate listof bug-inducing commits those performed after the opening ofthe ﬁxed issue. Indeed, if the ﬁxed bug has been already reported at date d i , a commit performed on date d j > d i cannot be responsiblefor its introduction. Since the commits to inspect come froma variety of software systems, they rely on different issuetrackers. When an explicit link was not available but an issuewas mentioned in the commit message ( e.g., see the commitmessage shown in the introduction), the evaluators searchedfor the project’s issue tracker, looking on the GitHub reposi-tory for documentation pointing to it (in case the project didnot use the GitHub issue tracker itself). If no information wasfound, an additional Google search was performed, lookingfor the project website or directly searching for the issue IDmentioned in the commit message.The manual validation was supported by a web-based appli-cation we developed that assigns to each evaluator the candi-date commits to review, showing for each of them its commitmessage and a clickable link to the commit G IT H UB page.Using a form, the evaluator indicated whether the commitwas relevant for the oracle ( i.e., an actual bug-ﬁx documentingthe bug-inducing commit) or not, and listing mentioned issuestogether with their opening date. Each commit was assignedby the web application to two different evaluators, for a totalof 7,170 evaluations. To be more conservative and to havehigher conﬁdence in our oracle, we decided to not resolveconﬂicts ( i.e., cases in which one evaluator marked the commitas relevant and the other as irrelevant): we excluded from ouroracle all commits with at least one “irrelevant” ﬂag. C. The Obtained SZZ Oracles

Out of the 3,585 manually validated commits, 1,930(55.6%) passed our manual ﬁltering, of which 212 includereferences to a valid issue ( i.e., an issue labeled as a bug thatcan be found online). This indicates that SZZ implementationsthat rely on information from issue trackers can only be run ona minority of bug-ﬁxing commits. Indeed, the 1,930 instanceswe report have been manually checked as true positive bug-ﬁxes, and only 212 of these (11.0%) mention the ﬁxed issue.The dataset is available in our replication package [40].These 1,930 commits and their related bug-inducing com-mits impact ﬁles written in many different languages. All theimplementations of the SZZ algorithm (except for B-SZZ)perform some language-speciﬁc parsing to ignore changesperformed to code comments.In our study (Section IV) we experimented several versionsof the SZZ including those requiring the parsing of comments.We implemented support for the top-8 programming languagespresent in our oracle ( i.e., the ones responsible for morecode commits): C, C++, C overall dataset and of the language-ﬁltered one. verall Language-ﬁltered

Language

C 350 433 52 297 366 41Python 271 304 36 249 279 35C++ 198 241 31 138 162 20JS 169 180 26 127 135 18Java 88 101 14 72 80 10PHP 63 71 6 56 64 5Ruby 43 47 5 36 37 4C

Total

TABLE III: Features of the language-ﬁltered / overall datasets.It is worth noting that a repository or even a commit caninvolve several programming languages: for this reason, the total may be lower than the sum of the per-language values( i.e., a repository can be counted in two or more languages).Besides sharing the datasets as JSON ﬁles, we also sharethe cloned repositories from which the bug-ﬁxing commitshave been extracted. This enables the replication of our studyand the use of the datasets for the assessment of future SZZimprovements. IV. S TUDY D ESIGN

The goal of this study is to experiment several implementa-tions of the SZZ algorithm on the previously deﬁned language-ﬁltered dataset ( context of our study). The perspective isthat of researchers interested in assessing the effectivenessof the state-of-the-art implementations and identify possibleimprovements that can be implemented to further improve theaccuracy of the SZZ algorithm. To achieve such a goal, weaim to answer the following research question:

How do different variants of SZZ perform in identifyingbug-inducing changes?A. Data Collection

We focused our experiment on several variants of the SZZalgorithm. Speciﬁcally, we (i) re-implemented all the main ap-proaches available in the literature (presented in Section II) ina new tool, and (ii) adapted three existing tools (P Y D RILLER [47], SZZ Unleashed [33], and OpenSZZ [46]) to work withour dataset. We provide in our replication package [40] bothour tool and the adapted versions of the other tools, includingdetailed instructions on how to run them.We report the details about all the implementations wecompare in Table IV and, for each of them, we explicitlymention (i) how it ﬁlters the lines changed in the ﬁx ( e.g., itremoves cosmetic changes), (ii) which methodology it uses foridentifying the preliminary set of bug-inducing commits ( e.g., annotation graph), (iii) how it ﬁlters such a preliminary set( e.g., it removes meta-changes), and (iv) if it uses a heuristicfor selecting a single bug-inducing commit and, if so, whichone ( e.g., most recent commit).We also explicitly mention any difference between our im-plementations and the approaches as described in the originalpapers presenting them. It is worth noting that we intentionally made all our re-implementations optionally independent from the issue-trackersystems: we did this because most of the instances of ourdataset do not provide links to the bug-report ( ∼ e.g., for B-SZZ). However, we experimentall techniques with and without such a ﬁltering applied.As for the tools, instead, we did not modify their implemen-tation of the BIC-ﬁnding procedures: e.g., we did not removethe ﬁltering by issue date from SZZ Unleashed. On the otherhand, we implemented wrappers for such tools that allowedus to run them with our dataset. SZZ Unleashed depends ona speciﬁc issue-tracker system ( i.e., Jira) for ﬁltering commitsdone after the bug-report was opened. We made it independentfrom it by adapting our datasets to the input it expects ( i.e.,

Jira issues in JSON format). It is worth noting that, despite thecomplexity of such ﬁles, SZZ Unleashed only uses the issueopening date in its implementation. For this reason, we onlyprovide such ﬁeld and we set the others to null .Note that some of the original implementations listed inTable IV can identify bug-ﬁxing commits. In our study, wedid not want to test such a feature: we test a scenario in whichthe implementations already have the bug-ﬁxing commits forwhich they should detect the bug-inducing commit(s).To evaluate the previously described implementations, wedeﬁned two datasets extracted from the language-ﬁltered dataset: (i) the oracle all dataset, featuring 1,115 bug-ﬁxes,which includes both the ones with and without issue informa-tion, and (ii) the oracle issues dataset, featuring 129 instances,which includes only instances with issue information. Also,we deﬁned two additional datasets, oracle J all (80 instances)and oracle J issues (10 instances), obtained by considering onlyJava-related commits from the oracle all and oracle issues ,respectively. We did this because two implementations, i.e., RA-SZZ *5 and OpenSZZ, only work on Java ﬁles.We ran all the implementations on all the datasets onwhich they can be executed ( i.e., we did not run RA-SZZ * and OpenSZZ on the datasets including non-Java ﬁles). It isworth noting that SZZ Unleashed requires the issue date inorder to work, so it would not be possible to run it on the oracle all dataset. To avoid this problem, we simulated thebest-case-scenario for such commits: we pretended that anissue about the bug was created few seconds after the last bug-inducing commit was done. Consider the bug-ﬁxing commit BF without issue information and its set of bug-inducingcommits BIC ; we assumed that the issue mentioned in BF had max b ∈ BIC ( date ( b )) + δ as opening date, where δ is asmall time interval (we used 60 seconds). B. Data Analysis

Given the deﬁned oracle and the set of bug-inducing com-mits detected by the experimented implementations, we eval-uated its accuracy by using two widely-adopted InformationRetrieval (IR) metrics, namely recall and precision [52]. It relies on Refactoring Miner [51] which only works on Java ﬁles.

ABLE IV: Characteristics of the SZZ implementations we compare in our study. We mark with a “ (cid:5) ” our re-implementations.

Acronym Fix Line Filtering BIC Identiﬁcation Method BIC Filtering BIC Selection Differences w.r.t. the original paper

B-SZZ // Annotation Graph [11] // // We use git blame instead of the CVS annotate , i.e., we implicitly use an annotation graph [11]. Wedo not ﬁlter BICs based on the issue creation date. (cid:5) AG-SZZ Cosmetic changes [11] Annotation Graph [11] // // No differences. (cid:5)

MA-SZZ Cosmetic changes [11] Annotation Graph [11] Meta-Changes [14] // No differences. (cid:5)

L-SZZ Cosmetic Changes [11] Annotation Graph [11] Meta-Changes [14] Largest [13] We ﬁlter meta-changes [14]. (cid:5)

R-SZZ Cosmetic Changes [11] Annotation Graph [11] Meta-Changes [14] Latest [13] We ﬁlter meta-changes [14]. (cid:5)

RA-SZZ * Cosmetic Changes [11]Refactorings [16] Annotation Graph [11] Meta-Changes [14] // We use Refactoring Miner 2.0 [51]. (cid:5)

SZZ@PYD Cosmetic Changes [11] Annotation Graph [11] // // We implement a wrapper for P Y D RILLER [47].SZZ@UNL Cosmetic Changes [11] Line-number Mapping [12] Issue-date [1] // We implement a wrapper for SZZ Unleashed [33].SZZ@OPN // Annotation Graph [11] // // We implement a wrapper for OpenSZZ [46].

We computed them using the following formulas: recall = | correct ∩ identiﬁed || correct | precision = | correct ∩ identiﬁed || identiﬁed | where correct and identiﬁed represent the set of true positivebug-inducing commits (those indicated by the developers inthe commit message) and the set of bug-inducing commitsdetected by the experimented algorithm, respectively. As anaggregate indicator of precision and recall, we report the F-measure [52], deﬁned as the harmonic mean of precision andrecall. Such metrics were also used in previous work forevaluating SZZ variants ( e.g., Neto et al. [16]).Given the set of experimented SZZ variants/tools

SZZ exp = { v , v , . . . v n } , we also analyze theircomplementarity by computing the following metricsfor each v i [53]: correct vi ∩ vj = | correct vi ∩ correct vj || correct vi ∪ correct vj | correct vi \ ( SZZexp \ vi ) = | correct vi \ correct ( SZZexp \ vi ) || correct vi ∪ correct ( SZZexp \ vi | where correct v i represents the set of correct bug-inducingcommits detected by v i and correct ( SZZ exp \ v i ) the correctbug-inducing commits detected by all other techniques but v i . correct v i ∩ v j measures the overlap between the set of correctbug-inducing commits identiﬁed by two given implementa-tions: we computed it between all the pairs of SZZ variantsand present the results using a heatmap. correct v i \ ( SZZ exp \ v i ) ,instead, measures the correct bug-inducing commits identiﬁedonly by technique v i and missed by all others.It is worth clarifying that, when we compute the overlapmetrics, we compare all the implementations among them onthe same dataset. This means, for example, that we do notcompute the overlap between a variant tested on oracle all andanother variant tested on oracle issues .As a last step in our analysis, we compute the set of bug-ﬁxing commits for which none of the experimented techniqueswas able to correctly identify the bug-inducing commit(s). Wequalitatively discuss these cases to understand their peculiari-ties and point to future improvements of the SZZ algorithm. V. R ESULTS D ISCUSSION

Table V reports the results achieved by the experimentedSZZ variants and tools.TABLE V: Precision, recall, and F-measure calculated for allSZZ algorithms. † Java only ﬁles.

Algorithm oracle all oracle issue

Recall Precision F1 Recall Precision F1 N o i ss u e d a t e ﬁ lt e r B-SZZ 0.69 0.39 0.50 0.69 0.38 0.49AG-SZZ 0.60 0.45 0.52 0.62 0.43 0.51L-SZZ 0.45 0.52 0.48 0.43 0.49 0.46R-SZZ 0.57 0.66 0.61 0.56 0.64 0.60MA-SZZ 0.64 0.36 0.46 0.65 0.36 0.47 † RA-SZZ * † SZZ@OPN 0.19 0.32 0.24 0.10 0.50 0.17 W it hd a t e ﬁ lt e r B-SZZ 0.69 0.42 0.53 0.69 0.39 0.50AG-SZZ 0.60 0.49 0.54 0.62 0.44 0.52L-SZZ 0.45 0.54 0.49 0.43 0.50 0.46R-SZZ 0.57 0.73 0.64 0.56 0.67 0.61MA-SZZ 0.64 0.39 0.48 0.65 0.37 0.47 † RA-SZZ * † SZZ@OPN 0.19 0.33 0.24 0.10 0.50 0.17

The top part of the table shows the results when the issuedate ﬁlter has not been applied, while the bottom part relatesto the application of the date ﬁlter. With “issue date ﬁlter” werefer to the process through which we remove from the list ofcandidate bug-inducing commits returned by a given techniqueall those performed after the issue documenting the bug hasbeen opened. Those are known to be false positives. For thisreason, such a ﬁlter is expected to not have any impact onrecall (since the discarded bug-inducing commits should allbe false positives) while increasing precision. The left part ofTable V shows the results achieved on oracle all , while theright part focuses on oracle issue .The ﬁrst result to extrapolate from Table V is the generaltrend concerning the performance of the SZZ implementations.When not using the issue date ﬁltering (top part), thehighest achieved F-Measure is 61% (R-SZZ). R-SZZ uses theannotation graph, ignores cosmetic changes and meta-changes,and only considers as bug-inducing commits the latest changethat impacted a line changed to ﬁx the bug.uch a combination of heuristics make the R-SZZ themost precise on both oracles, achieving a 66% precisionon oracle all and 64% on oracle issue . With respect to re-call/precision tradeoff, there is a price to pay in terms ofrecall that, however, it is not dramatically worse comparedto the best approach in terms of recall: SZZ@UNL (SZZUnleashed). The latter achieves a 72% recall on both oracle all and oracle issue datasets, with, however, a precision of 9% and6%, respectively. We investigated the reasons behind such alow precision, ﬁnding that it is mainly due to a set of outlierbug-ﬁxing commits for which SZZ@UNL identiﬁes a highnumber of (false positive) bug-inducing commits. For example,three bug-ﬁxing commits are responsible for 72 identiﬁed bug-inducing commits, out of which only three are correct. Weanalyzed the distribution of bug-inducing commits reportedby SZZ@UNL for the different bug-ﬁxing commits. Casesfor which more than 20 bug-inducing commits are identiﬁedfor a single bug-ﬁx can be considered outliers. By ignoringthose cases, the recall and precision of SZZ@UNL are 67%and 18%, respectively on oracle all , and 67% and 17% on oracle issue . By lowering the outlier threshold to 10 bug-inducing, the precision grows in both datasets to 24%. Webelieve that the low precision of SZZ@UNL may be due tomisbehavior of the tool in few isolated cases.Two implementations ( i.e.,

RA-SZZ * and SZZ@OPN) onlywork with Java ﬁles. In this case, we compute their recalland precision assuming by only considering the bug-ﬁxingcommits impacting Java ﬁles. Both of them exhibit limitedrecall and precision. While this is due in part to limitations ofthe implementations, it is also worth noting that the number ofJava-related commits in our datasets is quite limited ( i.e.,

80 in oracle all and only 10 in oracle issue ). Thus, failing on a few ofthose cases penalizes in terms of performance metrics. Still, wefound the low precision of RA-SZZ * surprising, consideringthe expensive mechanism it uses to limit false positives ( i.e., ignoring lines impacted by refactoring operations detected byRefactoring Miner [51].B-SZZ, the simplest SZZ version, exhibits a good recall of69% on both datasets, making it the second-best algorithmafter SZZ@UNL. Nonetheless, B-SZZ pays in precision,making it the fourth algorithm together with the PyDrillerimplementation for oracle all and the ﬁfth for oracle issue . Thesimilarity between B-SZZ and the PyDriller implementationresults in very similar performances.AG-SZZ, L-SZZ, and MA-SZZ exhibit, as comparedto others, good performance for both recall and precision.These algorithms provide a good balance between recall andprecision, as also shown by their F-Measure ( ∼ oracle all dataset: RA-SZZ * (+8%) and R-SZZ (+7%). This boosts the latter to a very good 73% precision on oracle all , and 67% on oracle issue (+3%).To summarize the achieved results: We found that R-SZZ is the most precise variant on our datasets, with aprecision ∼

70% when the issue date ﬁlter is applied. Thus,we recommend it when precision is more important than recall( e.g., when a set of bug-inducing commits must be mined forqualitative analysis). SZZ@UNL ensures instead a high recallat, however, a high precision cost. If the focus is on recall,the current recommendation is to rely on B-SZZ, using, forexample, the implementation provided by PyDriller. Finally,looking at Table V, it is clear that there are still marginsof improvement for the accuracy of the SZZ algorithm. Wediscuss possible directions for future work in Section V-A.Table VI shows the correct v i \ ( SZZ exp \ v i ) metric we com-puted for each SZZ variant v i .TABLE VI: Bug inducing commits correctly identiﬁed exclu-sively by the v i algorithm. † Java only ﬁles.

Algorithm No date ﬁlter With date ﬁlter oracle all oracle issue oracle all oracle issue

B-SZZ 0/804 0/94 0/804 0/94AG-SZZ 0/804 0/94 0/804 0/94L-SZZ 0/804 0/94 0/804 0/94R-SZZ 0/804 0/94 0/804 0/94MA-SZZ 0/804 0/94 0/804 0/94 † RA-SZZ * † SZZ@OPN 0/56 0/7 0/56 0/7

This metric measures the correct bug-inducing commitsidentiﬁed only by technique v i and missed by all the others.Fig. 3a and Fig. 3b depicts the correct v i ∩ v j metric com-puted between each pair of SZZ variants when not ﬁlteringbased on the issue date, while Fig. 4a and Fig. 4b show thesame metric when the issue ﬁlter has been applied. Giventhe metric deﬁnition, the depicted heatmaps are symmetric( i.e., correct v i ∩ v j = correct v j ∩ v i ). The only technique ableto identify bug-inducing commits missed by all others SZZimplementations is SZZ@UNL (20 on oracle all and 3 on oracle issue ) – Table VI. This is not surprising considering thehigh SZZ@UNL recall and the high number of bug-inducingcommits it returns for certain bug-ﬁxes. It also explains whynone of the other implementations identiﬁes bug-inducingcommits missed by all the others: Given 804 as cardinalityof the intersection of the true positives identiﬁed by all SZZtechniques, SZZ@UNL correctly retrieves 800 of them.Looking at the overlap metrics in Fig. 3 and Fig. 4, twoobservations can be made. First, the overlap in the identiﬁedtrue positives is substantial. Excluding SZZ@OPN, 21 of the28 comparisons have an overlap in the identiﬁed true positives ≥

70% and the lower values are still in the range 60-70%. Thelow overlap values observed for SZZ@OPN are instead dueto the its low recall. Second, the complementarity between thedifferent SZZ variants is quite low, which indicates that thereis a set of bug-ﬁxing commits for which all of the variants failin identifying the correct bug-inducing commit(s). - S ZZ A G - S ZZ L - S ZZ R - S ZZ M A - S ZZ R A - S ZZ * S ZZ @ P Y D S ZZ @ U N L S ZZ @ O P N B-SZZAG-SZZL-SZZR-SZZMA-SZZRA-SZZ*SZZ@PYDSZZ@UNLSZZ@OPN (a) oracle all B - S ZZ A G - S ZZ L - S ZZ R - S ZZ M A - S ZZ R A - S ZZ * S ZZ @ P Y D S ZZ @ U N L S ZZ @ O P N B-SZZAG-SZZL-SZZR-SZZMA-SZZRA-SZZ*SZZ@PYDSZZ@UNLSZZ@OPN (b) oracle issue

Fig. 3: Overlap between SZZ variants when no issue date ﬁlter is applied. B - S ZZ A G - S ZZ L - S ZZ R - S ZZ M A - S ZZ R A - S ZZ * S ZZ @ P Y D S ZZ @ U N L S ZZ @ O P N B-SZZAG-SZZL-SZZR-SZZMA-SZZRA-SZZ*SZZ@PYDSZZ@UNLSZZ@OPN (a) oracle all B - S ZZ A G - S ZZ L - S ZZ R - S ZZ M A - S ZZ R A - S ZZ * S ZZ @ P Y D S ZZ @ U N L S ZZ @ O P N B-SZZAG-SZZL-SZZR-SZZMA-SZZRA-SZZ*SZZ@PYDSZZ@UNLSZZ@OPN (b) oracle issue

Fig. 4: Overlap between SZZ variants when the issue date ﬁlter is applied.We manually analyzed those cases to derive possible futureimprovements to the SZZ.

A. Improvements to SZZ

The manual analysis of 311 bug-ﬁxing commits on whichall SZZ variants fail allowed us to identify recurring patternsand distill three possible ways to improve the SZZ algorithm.

1) The buggy line is not always impacted in the bug-ﬁx:

In some cases, ﬁxing a bug introduced in line l may not resultin changes performed to l . An example in Java is the additionof an if guard statement checking for null values beforeaccessing a variable. In this case, while the bug has been introduced with the codeaccessing the variable without checking whether it is null ,the bug-ﬁxing commit does not impact such a line, it justadds the needed if statement. An example from our datasetis the bug-ﬁxing commit from the thcrap repository [54] inwhich line 289 is modiﬁed to ﬁx a bug introduced in commit b67116d , as pointed by the developer in the commit message.However, the bug was introduced with changes performed online 290 [54]. Thus, running git blame on line 289 of the ﬁxcommit will retrieve a wrong bug-inducing commit. Deﬁningapproaches to identify the correct bug-inducing commit inthese cases is far from trivial.owever, by manually analyzing a large dataset of bug-ﬁxing commits, it should be possible to identify ﬁxing patternswith associated buggy lines. Such a dataset could be used totrain a model able, given a bug-ﬁxing commit, to point to thelocation of the bug.

2) SZZ is sensible to history rewritings:

Bird et al. [55]highlighted some of the peril of mining git repositories, amongwhich the possibility for developers to rewrite the changehistory. This can be achieved through rebasing, for example:using such a strategy can have an impact on mining thechange history [56], and, therefore, on the performance ofthe SZZ algorithm. Besides rebasing, git allows to partiallyrewrite history by reverting changes introduced in one ormore commits in the past. This action is often performed bydevelopers when a task they are working on leads to deadend. Once run, the revert command results in new commitsin the change history that turn back the indicated changes.Consequently, SZZ can improperly show one of these commitsas candidate bug-inducing.For example, in the message of commit fromthe xkb-switch project [57], the developer indicates that thebug she is ﬁxing has been introduced in commit . Byperforming a blame on the ﬁx commit, git returns as a bug-inducing commit [58], which is a revert commit. Byperforming an additional blame step, the correct bug-inducingcommit pointed by the developer can be retrieved [59]. FutureSZZ variants should handle revert commits, and properly dealwith them when analyzing the change history.

3) Looking at the “big picture” in code changes:

Inseveral bug-ﬁxing commits we inspected, the implementedchanges included both added and modiﬁed/deleted lines. SZZimplementations focus on the latter, since there is no way toblame a newly added line. However, we found cases in whichthe added lines were responsible for the bug-ﬁxing, whilethe modiﬁed/deleted ones were unrelated. There have beena recent attempt to address this problem: Sahal and Tosun[60] proposed an SZZ extension that considers past historyof all the lines in the block in which the added line appears.However, the research in this aspect is still at the beginning.An example is commit ca11949 from the snake repository[61], in which two lines are added and two deleted to ﬁx abug. The deleted lines, while being the target of SZZ, areunrelated to the bug-ﬁx, as clear from the commit messagepointing to commit [62] as the one responsiblefor the bug introduction. In the bug-inducing commit, thedeveloper refactored the code to simplify an if condition.While refactoring the code, she introduced a bug ( i.e., shemissed an else branch). The ﬁxing adds the else branch tothe sequence of if / else if branches introduced in the bug-inducing commit. In this case, by relying on static analysis,it should be possible to link the added lines, representing the else branch, to the set of if / else if statements precedingit. While the added lines cannot be blamed, lines related tothem ( e.g., acting on the same variable, being part of the same“high-level construct” like in this case) could be blamed toincrease the chances of identifying the bug-inducing commit. While this would help recall, it would penalize precisionwithout careful heuristics aimed at ﬁltering out false positives.VI. T HREATS TO V ALIDITY

Construct validity.

During the manual validation, the eval-uators mainly relied on the commit message and the linkedissue(s), when available, to conﬁrm that a mined commit wasa bug-ﬁxing commit. Misleading information in the commitmessage could result in the introduction of false positiveinstances in our dataset. However, all commits have beenchecked by at least two evaluators and doubtful cases havebeen excluded, privileging a conservative approach. To buildour dataset, we considered all the projects from GitHub,without explicitly deﬁning criteria to select only projectsthat are invested in software quality. Our assumption is thatthe fact that developers take care of documenting the bug-introducing commit(s) is an indication that they care aboutsoftware quality. To ensure that the commits in our datasetare from projects that take quality into account, we manuallyanalyzed 123 projects from our dataset, which allowed us tocover a signiﬁcant sample of commits (286 out of 1,115, with95% ±

5% conﬁdence level). For each of them, we checkedif they contained elements that indicate a certain degree ofattention to software quality, i.e., (i) unit test cases, (ii) codereviews (through pull requests), (iii) and continuous integrationpipelines. We found that in 95% of the projects, developers (i)wrote unit test cases, and (ii) conducted code reviews throughpull requests. Also, we found CI pipelines in 75% of theprojects.

Internal validity.

There is a possible subjectiveness intro-duced of the manual analysis, which has been mitigated withmultiple evaluators per bug-ﬁx. Also, we reimplemented mostof the experimented SZZ approaches, thus possibly introduc-ing variations as compared to what proposed by the originalauthors. We followed the description of the approaches inthe original papers, documented in Table IV any differencebetween our implementations and the original proposals, andshare our implementations [40]. Also, note that the differencesdocumented in Table IV always aim at improving the perfor-mance of the SZZ variants and, thus, should not be detrimentalfor their performance.

External validity.

While it is true that we mined millionsof commits to build our dataset, we used very strict ﬁlteringcriteria that resulted in 1,930 instances for our oracle. Also,the SZZ implementations have been experimented on a smallerdataset of 1,115 instances that is, however, still larger thanthose used in previous works. Finally, our dataset represents asubset of the bug-ﬁxes performed by developers. This is dueto our design choice, where we used strict selection criteriawhen building our oracle to prefer quality over quantity. Itis possible that our dataset is biased towards a speciﬁc typeof bug-ﬁxing commits: there might be an inherent differencebetween the bug ﬁxes for which developers document the bug-inducing commit(s) ( i.e., the only ones we considered) andother bug ﬁxes.II. C

ONCLUSION

When an algorithm like SZZ becomes so prominent insoftware engineering research, it is more than just necessary toexplore ways to ameliorate its performance. Still, it is crucialto create a platform that allows for a sound and fair comparisonof any new variant.Our goal was to create such a platform, exempliﬁed in apublicly available and extensible oracle of multiple and docu-mented datasets, together with open source re-implementationsof a considerable number of variants.Moreover, as we used our oracle to compare the variantsand check our re-implementation validity, we came up withseveral concrete improvements to the existing SZZ variants.Given the pivotal role of SZZ for various research en-deavors, for example, in the context of defect analysis andprediction, and the whole ﬁeld of MSR (mining softwarerepositories), we believe our work can set the stage fornumerous and, above all, comparable ameliorations of theseminal SZZ algorithm.A

CKNOWLEDGMENT

This project has received funding from the European Re-search Council (ERC) under the European Union’s Horizon2020 research and innovation programme (grant agreementNo. 851720). We are grateful for the support by the SwissNational Science foundation (SNF) and JSPS (Project “SEN-SOR”). R

EFERENCES[1] J. ´Sliwerski, T. Zimmermann, and A. Zeller, “When do changes induceﬁxes?”

ACM sigsoft software engineering notes , vol. 30, no. 4, pp. 1–5,2005.[2] H. Hata, O. Mizuno, and T. Kikuno, “Bug prediction based on ﬁne-grained module histories,” in . IEEE, 2012, pp. 200–210.[3] M. Tan, L. Tan, S. Dara, and C. Mayeux, “Online defect predictionfor imbalanced data,” in , vol. 2. IEEE, 2015, pp. 99–108.[4] L. Pascarella, F. Palomba, and A. Bacchelli, “Fine-grained just-in-timedefect prediction,”

Journal of Systems and Software , vol. 150, pp. 22–36,2019.[5] M. Yan, X. Xia, Y. Fan, A. E. Hassan, D. Lo, and S. Li, “Just-in-timedefect identiﬁcation and localization: A two-phase framework,”

IEEETransactions on Software Engineering , 2020.[6] Y. Fan, X. Xia, D. A. da Costa, D. Lo, A. E. Hassan, and S. Li, “Theimpact of changes mislabeled by szz on just-in-time defect prediction,”

IEEE Transactions on Software Engineering , 2019.[7] G. Bavota and B. Russo, “Four eyes are better than two: On theimpact of code reviews on software quality,” in . IEEE,2015, pp. 81–90.[8] M. Tufano, G. Bavota, D. Poshyvanyk, M. Di Penta, R. Oliveto,and A. De Lucia, “An empirical study on developer-related factorscharacterizing ﬁx-inducing commits,”

Journal of Software: Evolutionand Process , vol. 29, no. 1, p. e1797, 2017.[9] H. Aman, S. Amasaki, T. Yokogawa, and M. Kawahara, “Empiricalstudy of fault introduction focusing on the similarity among localvariable names,” in

QuASoQ@ APSEC , 2019, pp. 3–11.[10] B. Chen and Z. M. J. Jiang, “Extracting and studying the logging-code-issue-introducing changes in java-based large-scale open source softwaresystems,”

Empirical Software Engineering , vol. 24, no. 4, pp. 2285–2322, 2019.[11] S. Kim, T. Zimmermann, K. Pan, E. James Jr et al. , “Automatic iden-tiﬁcation of bug-introducing changes,” in . IEEE, 2006,pp. 81–90. [12] C. Williams and J. Spacco, “Szz revisited: verifying when changesinduce ﬁxes,” in

Proceedings of the 2008 workshop on Defects in largesoftware systems , 2008, pp. 32–36.[13] S. Davies, M. Roper, and M. Wood, “Comparing text-based anddependence-based approaches for determining the origins of bugs,”

Journal of Software: Evolution and Process , vol. 26, no. 1, pp. 107–139, 2014.[14] D. A. Da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho,and A. E. Hassan, “A framework for evaluating the results of the szzapproach for identifying bug-introducing changes,”

IEEE Transactionson Software Engineering , vol. 43, no. 7, pp. 641–657, 2016.[15] E. C. Neto, D. A. da Costa, and U. Kulesza, “The impact of refactoringchanges on the szz algorithm: An empirical study,” in . IEEE, 2018, pp. 380–390.[16] ——, “Revisiting and improving szz implementations,” in . IEEE, 2019, pp. 1–12.[17] F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, andA. De Lucia, “On the diffuseness and the impact on maintainability ofcode smells: a large scale empirical investigation,”

Empirical SoftwareEngineering , vol. 23, no. 3, pp. 1188–1221, 2018.[18] B. C¸ aglayan and A. B. Bener, “Effect of developer collaboration activityon software quality in two large scale projects,”

Journal of Systems andSoftware , vol. 118, pp. 288–296, 2016.[19] M. Wen, R. Wu, and S.-C. Cheung, “Locus: Locating bugs fromsoftware changes,” in . IEEE, 2016, pp. 262–273.[20] D. Posnett, R. D’Souza, P. Devanbu, and V. Filkov, “Dual ecologicalmeasures of focus in software development,” in . IEEE, 2013, pp. 452–461.[21] S. Kim, E. J. Whitehead, and Y. Zhang, “Classifying software changes:Clean or buggy?”

IEEE Transactions on Software Engineering , vol. 34,no. 2, pp. 181–196, 2008.[22] O. Kononenko, O. Baysal, L. Guerrouj, Y. Cao, and M. W. Godfrey,“Investigating code review quality: Do people and participation matter?”in . IEEE, 2015, pp. 111–120.[23] S. Wehaibi, E. Shihab, and L. Guerrouj, “Examining the impact of self-admitted technical debt on software quality,” in , vol. 1. IEEE, 2016, pp. 179–188.[24] V. Lenarduzzi, F. Lomio, H. Huttunen, and D. Taibi, “Are sonarquberules inducing bugs?” in . IEEE, 2020,pp. 501–511.[25] M. L. Bernardi, G. Canfora, G. A. Di Lucca, M. Di Penta, andD. Distante, “The relation between developers’ communication and ﬁx-inducing changes: An empirical study,”

Journal of Systems and Software ,vol. 140, pp. 111–125, 2018.[26] F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu, “Bugcachefor inspections: hit or miss?” in

Proceedings of the 19th ACM SIG-SOFT symposium and the 13th European conference on Foundations ofsoftware engineering , 2011, pp. 322–331.[27] J. Eyolfson, L. Tan, and P. Lam, “Correlations between bugginess andtime-based commit characteristics,”

Empirical Software Engineering ,vol. 19, no. 4, pp. 1009–1039, 2014.[28] A. T. Misirli, E. Shihab, and Y. Kamei, “Studying high impact ﬁx-inducing changes,”

Empirical Software Engineering , vol. 21, no. 2, pp.605–641, 2016.[29] G. Canfora, M. Ceccarelli, L. Cerulo, and M. Di Penta, “How long doesa bug survive? an empirical study,” in . IEEE, 2011, pp. 191–200.[30] L. Prechelt and A. Pepper, “Why software repositories are not usedfor defect-insertion circumstance analysis more often: A case study,”

Information and Software Technology , vol. 56, no. 10, pp. 1377–1389,2014.[31] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, andP. Devanbu, “Fair and balanced? bias in bug-ﬁx datasets,” in

Proceedingsof the 7th joint meeting of the European Software Engineering Confer-ence and the ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering , 2009, pp. 121–130.32] P. Marinescu, P. Hosek, and C. Cadar, “Covrig: A framework for theanalysis of code, test, and coverage evolution in real software,” in

Proceedings of the 2014 international symposium on software testingand analysis , 2014, pp. 93–104.[33] M. Borg, O. Svensson, K. Berg, and D. Hansson, “Szz unleashed:an open implementation of the szz algorithm-featuring example usagein a study of just-in-time bug prediction for the jenkins project,” in

Proceedings of the 3rd ACM SIGSOFT International Workshop onMachine Learning Techniques for Software Quality Evaluation , 2019,pp. 7–12.[34] Z. T´oth, P. Gyimesi, and R. Ferenc, “A public bug database of githubprojects and its application in bug prediction,” in

Computational Scienceand Its Applications – ICCSA 2016 . Springer International Publishing,2016, pp. 625–638.[35] R.-M. Karampatsis and C. Sutton, “How often do single-statementbugs occur? the manysstubs4j dataset,” in

Proceedings of the 17thInternational Conference on Mining Software Repositories, MSR , 2020,p. To appear.[36] G. Rodr´ıguez-P´erez, G. Robles, A. Serebrenik, A. Zaidman, D. M.Germ´an, and J. M. Gonzalez-Barahona, “How bugs are born: a modelto identify how bugs are introduced in software components,”

EmpiricalSoftware Engineering , pp. 1–47, 2020.[37] G. Rodr´ıguez-P´erez, G. Robles, and J. M. Gonz´alez-Barahona, “Repro-ducibility and credibility in empirical software engineering: A case studybased on a systematic literature review of the use of the szz algorithm,”

Information and Software Technology , vol. 99, pp. 164–176, 2018.[38] H. Tu, Z. Yu, and T. Menzies, “Better data labelling with emblem (andhow that impacts defect prediction),”

IEEE Transactions on SoftwareEngineering

Replication package , https://github.com/grosa1/icse2021-szz-replication-package.[41] C. C. Williams and J. W. Spacco, “Branching and merging in therepository,” in

Proceedings of the 2008 international working conferenceon Mining software repositories . IEEE, 2017, pp. 269–279.[44] N. Tsantalis, M. Mansouri, L. Eshkevari, D. Mazinanian, and D. Dig,“Accurate and efﬁcient refactoring detection in commit history,” in . IEEE, 2018, pp. 483–494.[45] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existingfaults to enable controlled testing studies for java programs,” in

Pro-ceedings of the 2014 International Symposium on Software Testing andAnalysis , 2014, pp. 437–440.[46] V. Lenarduzzi, F. Palomba, D. Taibi, and D. A. Tamburri, “Openszz: Afree, open-source, web-accessible implementation of the szz algorithm,” in

Proceedings of the 28th IEEE/ACM International Conference onProgram Comprehension , 2020, p. To appear.[47] D. Spadini, M. Aniche, and A. Bacchelli, “PyDriller: Pythonframework for mining software repositories,” in

Proceedings ofthe 2018 26th ACM Joint Meeting on European Software EngineeringConference and Symposium on the Foundations of Software Engineering- ESEC/FSE 2018 . New York, New York, USA: ACM Press, 2018,pp. 908–911. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3236024.3264598[48] T. F. Bissyande, F. Thung, S. W. an?d D. Lo, L. Jiang, and L. Reveillere,“Empirical evaluation of bug linking,” in , 2013, pp. 89–98.[49] M. Fischer, M. Pinzger, and H. C. Gall, “Populating a release historydatabase from version control and bug tracking systems,” in , 2003, p. 23.[50] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, andD. Poshyvanyk, “An empirical study on learning bug-ﬁxing patchesin the wild via neural machine translation,”

ACM Trans. Softw. Eng.Methodol. , vol. 28, no. 4, pp. 19:1–19:29, 2019.[51] N. Tsantalis, A. Ketkar, and D. Dig, “Refactoringminer 2.0,”

IEEETransactions on Software Engineering , 2020.[52] R. Baeza-Yates and B. Ribeiro-Neto,

Modern Information Retrieval .Addison-Wesley, 1999.[53] R. Oliveto, M. Gethers, D. Poshyvanyk, and A. D. Lucia, “On theequivalence of information retrieval methods for automated traceabilitylink recovery,” in

Proceedings of the 18th IEEE International Conferenceon Program Comprehension . IEEE Computer Society, 2010, pp. 68–71.[54]

Adjacent ﬁx commit example , https://github.com/thpatch/thcrap/commit/29f16632f6bedd85e849034b01c3a8e8d4b7d83d.[55] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, andP. Devanbu, “The promises and perils of mining git,” in .IEEE, 2009, pp. 1–10.[56] V. Kovalenko, F. Palomba, and A. Bacchelli, “Mining ﬁle histories:Should we consider branches?” in

Proceedings of the 33rd ACM/IEEEInternational Conference on Automated Software Engineering , 2018, pp.202–213.[57]

Revert ﬁx commit example , https://github.com/grwlf/xkb-switch/commit/5d8cee18015b9a64aa3e06a81802f8186a99cc02.[58]

Revert bug-inducing commit wrong , https://github.com/grwlf/xkb-switch/commit/8b9cf29bca85076500ae5a2759f86e2042c527d0.[59]

Revert bug-inducing commit correct , https://github.com/grwlf/xkb-switch/commit/42abcc0da1c7f1062d069349edf90aa3b8832ca4.[60] E. Sahal and A. Tosun, “Identifying bug-inducing changes for code addi-tions,” in

Proceedings of the 12th ACM/IEEE International Symposiumon Empirical Software Engineering and Measurement , 2018, pp. 1–2.[61]

Block bug-inducing commit wrong , https://github.com/krmpotic/snake/commit/ca119496290f4ba8594c1e298a77336825c71e77.[62]