[PDF] Characterizing and Mitigating Self-Admitted Build Debt

Abstract

Technical Debt is a metaphor used to describe the situation in which long-term code quality is traded for short-term goals in software projects. In recent years, the concept of self-admitted technical debt (SATD) was proposed, which focuses on debt that is intentionally introduced and described by developers. Although prior work has made important observations about admitted technical debt in source code, little is known about SATD in build systems. In this paper, we coin the term Self-Admitted Build Debt (SABD) and through a qualitative analysis of 500 SABD comments in the Maven build system of 300 projects, we characterize SABD by location and rationale (reason and purpose). Our results show that limitations in tools and libraries, and complexities of dependency management are the most frequent causes, accounting for 49% and 23% of the comments. We also find that developers often document SABD as issues to be fixed later. To automate the detection of SABD rationale, we train classifiers to label comments according to the surrounding document content. The classifier performance is promising, achieving an F1-score of 0.67-0.75. Finally, within 16 identified 'ready-to-be-addressed' SABD instances, the three SABD submitted by pull requests and the five SABD submitted by issue reports were resolved after developers were made aware. Our work presents the first step towards understanding technical debt in build systems and opens up avenues for future work, such as tool support to track and manage SABD backlogs.

Full PDF

11 Characterizing and MitigatingSelf-Admitted Build Debt

Tao Xiao, Dong Wang, Shane McIntosh, Hideaki Hata,Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto

Abstract —Technical Debt is a metaphor used to describe the situation in which long-term code quality is traded for short-term goals insoftware projects. In recent years, the concept of self-admitted technical debt (SATD) was proposed, which focuses on debt that isintentionally introduced and described by developers. Although prior work has made important observations about admitted technicaldebt in source code, little is known about SATD in build systems. In this paper, we coin the term

Self-Admitted Build Debt (SABD) andthrough a qualitative analysis of 500 SABD comments in the Maven build system of 300 projects, we characterize SABD by locationand rationale (reason and purpose). Our results show that limitations in tools and libraries, and complexities of dependencymanagement are the most frequent causes, accounting for 49% and 23% of the comments. We also ﬁnd that developers oftendocument SABD as issues to be ﬁxed later. To automate the detection of SABD rationale, we train classiﬁers to label commentsaccording to the surrounding document content. The classiﬁer performance is promising, achieving an F1-score of 0.67–0.75. Finally,within 16 identiﬁed ‘ready-to-be-addressed’ SABD instances, the three SABD submitted by pull requests and the ﬁve SABD submittedby issue reports were resolved after developers were made aware. Our work presents the ﬁrst step towards understanding technicaldebt in build systems and opens up avenues for future work, such as tool support to track and manage SABD backlogs.

Index Terms —Self-Admitted Technical Debt, Build Debt, Build System, Build Maintenance (cid:70)

NTRODUCTION T H roughout the software development process, stake-holders strive to build functional, maintainable, andhigh-quality software. Despite their best efforts, developersinevitably encounter situations where suboptimal solutions,known as Technical Debt (TD) are implemented in a softwareproject [8]. Although studies have traced evidence of TDin source code, TD covers a range of software artifacts andprocesses (i.e., architecture, build, defects, design, documen-tation, infrastructure, people, process, requirements, service,and testing) [2]. Clear evidence of TD is at the core of self-admitted technical debt (SATD), where developers recordthe reasoning behind such suboptimal solutions. Potdar andShihab [35] observed that SATD was present existed in 31%of source code ﬁles.Although prior work on SATD in source code has madeimportant advances, modern software development has abroader scope than solely producing source code. Indeed, acomplex collection of other software artifacts and tools areneeded to produce ofﬁcial software releases. At the heart ofthese, other artifact is the build system , which orchestratestools (e.g., automated test suites, containerization tools,external and internal dependency management) into a re-peatable (and ideally incremental) process. Software teamsuse build system speciﬁcations to express dependencies within • T. Xiao, D. Wang, H. Hata, R. G. Kula, T. Ishio, and K. Matsumoto arewith the Nara Institute of Science and Technology, Japan.E-mail: { tao.xiao.ts2, wang.dong.vt8, hata, raula-k, ishio, matu-moto } @is.naist.jp. • S. McIntosh is with the Cheriton School of Computer Science, Universityof Waterloo, Canada.E-mail: [email protected]. and among internal and external software artifacts. Buildsystems also suffer from TD. Developers at Google calledsuch form of TD as

Build Debt (BD), which is describedas the efforts to measure and pay down technical debtfound in their build ﬁles and associated dead code [33].They described the four types of BD, namely (i) dependencyspeciﬁcations—where there is a slowdown of the build andtest systems and the brittleness of a project’s build due tounder-declared direct dependencies; and (ii) unbuildabletargets—where abandoned targets that have not success-fully built for several months (i.e., zombie targets); and (iii)visibility—where public components that become privateare never removed; and (iv) unnecessary command-lineﬂags—where a set of ﬂags for libraries and binaries are nolonger needed.Despite the critical role that build systems play, to thebest of our knowledge, there have been no previous studiesthat investigate SATD in build speciﬁcation ﬁles. To ﬁll thisgap, in this paper, we propose to analyze build ﬁles andtheir existing SATD, which we refer to as

Self-Admitted BuildDebt (SABD) . More speciﬁcally, we set out to characterizeSABD, explore its potential for automation and evaluateSABD mitigation strategies. By analyzing the 500 SABDcomments extracted from 300 GitHub repositories that uti-lize the Maven build system, we address the following threeresearch questions: (RQ1) What are the characteristics of a SABD?

Motivation:

It is unclear what types of SABD exist. Simi-lar to SATD, analyzing SABD characteristics (location andrationale) will lay the foundation for understanding thescope of build debt. Answering this question will drivefuture research and tool development on SABD problemsof practical relevance. a r X i v : . [ c s . S E ] F e b (RQ1.1) Location: Which build ﬁle speciﬁcations are mostsusceptible to SABD? Results:

SABD tends to occur in the plugin conﬁguration andthe external dependency conﬁguration code, accounting for47% and 31%, respectively. (RQ1.2) Rationale: What causes a developer to documentSABD and what purpose does it serve?

Results:

We analyze rationale along reason and purposedimensions. First, we ﬁnd that there are ten categoriesof SABD reasons. The most frequent reasons include thelimitations in tools and libraries, and complexities of de-pendency management, accounting for 49% and 23% ofanalyzed SABD instances, respectively. Second, we ﬁnd thatthere are six purposes for leaving SABD comments, withdocumenting issues to be ﬁxed later and explaining therationale for a workaround occurring the most frequently,accounting for 34% and 23% of analyzed SABD instances,respectively. (RQ2) Can automated classiﬁers accurately identify thecharacteristics of SABD?

Motivation:

Analysis of build systems at large organizationslike Google would require an automated approach to bepractical. For practitioners, automatic SABD identiﬁcationcould facilitate the replication of detection approaches topromote their adoption, and improve the detection qualityand traceability. Hence, we explore the feasibility of trainingautomatic classiﬁers to identify SABD characteristics.

Results:

Experimental results show automation is feasible,achieving a precision of 0.68 and 0.79, a recall of 0.67 and0.75, and an F1-score of 0.67 and 0.75 for SABD reasonsand purposes, respectively. Comparing both traditional andstate-of-the-art machine learning techniques, we ﬁnd thatthe auto-sklearn based classiﬁers tend to outperform the setof baseline classiﬁers, i.e., Naive Bayes (NB), Support VectorMachine (SVM), and k-Nearest Neighbor (kNN) by a marginof 10–22 percentage points. (RQ3) To what extent can SABD be removed?

Motivation:

The manner by which developers handle SABDis currently unknown, i.e., whether or not they can beremoved. Hence, we investigate the ‘ready-to-be-addressed’SABD removal by submitting pull requests and issue re-ports. Answering this research question can address thenecessity of proposing automatic tools to identify SABDcomments for researchers and furthermore facilitate bettertechnical debt management for projects.

Results:

Within 16 ‘ready-to-be-addressed’ SABD instances,we propose pull requests for seven cases, three of whichwere merged. Moreover, we produce issue reports for ninecases, ﬁve of which were resolved within 20 days. Whilethere are a number of factors at play, these responses suggestthat developers are receptive and reactive to SABD.

Replication package.

To facilitate replication and futurework in the area, we have prepared a replication package,which includes the manually labeled dataset and the scriptsfor reproducing our analyses. The package is available on-line at https://github.com/NAIST-SE/SABD.The remainder of the paper is organized as follows.Section 2 describes the workﬂow that we followed to collectSABD comments. Sections 3–5 show the experiments thatwe conducted to address RQ1–3, respectively. Section 6presents our recommendations for build system stakehold- (DP2) Identify SABDcommentsGitHub (DP1) Extractcomments fromMaven repositories SABDcomments

Data Preparation

Manually classifylocations, reasons,and purposes

RQ 1

Coded SABDcomments (RQ 1 results)Model evaluationresults (RQ 2 results)Feedback(RQ 3 results)Create issue reportsand pull requestsIdentify 'ready-to-be-addressed' SABD

RQ 3

Constructclassification models

RQ 2

Evaluate models'ready-to-be-addressed' SABDClassificationmodelsComments

Fig. 1: Overview of the studyers based on our study. Section 7 situates our work withrespect to the literature on build systems and technicaldebt. Section 8 discusses the threats to validity. Section 9draws the conclusions and highlights opportunities for fu-ture work.

ATA P REPARATION

In this section, we describe the data collection procedure.Figure 1 shows an overview of the procedure, which consistsof two steps: (DP1) Extract comments from Maven reposito-ries; and (DP2) Identify SABD comments. (DP1) Extract comments from Maven repositories.

Maven is a popular build automation tool used primarily forJava projects. In a large-scale analysis of 177,039 repositories,McIntosh et al. [31] found that Maven repositories tend to bethe most actively maintained. Since developers are activelyupdating their Maven ﬁles, we suspect that technical debtmay also be accruing. Thus, we select Maven as our studiedbuild system.We analyze Java repositories in the dataset shared byHata et al. [14]. That dataset includes the Git repositoriesof actively developed software projects containing (i) morethan 500 commits; and (ii) at least 100 commits duringthe most active two-year period. Forked repositories areexcluded from the analysis. We analyze the latest version(

HEAD revision) of the repositories. The list of

HEAD revisionsis summarized in the replication package.We select the Maven speciﬁcations from each studiedrepository using the ﬁlename convention (i.e., pom.xml ).Each studied repository may have several Maven speciﬁ-cations. Since the speciﬁcations are written in XML, com-ments are recognized as content appearing between “ ” XML tokens. We extract comment content fromMaven speciﬁcations using a script that builds upon the JavaSE XML parser. Finally, we extract 253,555 comments from100,765 POM ﬁles in 3,710 Maven repositories. (DP2) Identify SABD comments.

We detect SABD com-ments using the keyword-based approach of Potdar and

1. https://github.com/takashi-ishio/CommentLister

TABLE 1: Summary of studied dataset

Shihab [35]. In addition, to reduce the risk of missing SABDcomments and enlarge the dataset, we expand Potdar andShihab’s keyword list to include 13 frequent features thatwere recommended by Huang et al. [16]. Our adapted listof SABD keywords is summarized in our online appendix. In the end, we are able to detect 3,424 SABD comments.Table 1 provides an overview of our studied dataset.

HARACTERIZING

SABD C

OMMENTS (RQ1)

SABD comments may appear in different locations withinbuild ﬁles. Moreover, the rationale for incurring SABD maydiffer. In this section, we analyze the locations, reasonsfor adoption, and purposes served by SABD comments. Toperform our analyses, we use a manually-intensive method.Below, we present our approach to classify SABD commentsaccording to locations, reasons, and purposes (Section 3.1),followed by the results (Section 3.2). Finally, we explore therelationships between locations and reasons, and locationsand purposes (Section 3.3).

We apply an open coding approach [7] to classify randomlysampled SABD comments in build ﬁles. Open coding isa qualitative data analysis method by which artifacts un-der inspection (SABD comments in our case) are classiﬁedaccording to emergent concepts (i.e., codes). After coding,we apply open card sorting [34] to raise low-level codesto higher levels and more abstract concepts, especially forSABD reasons. Below, we describe our code saturation,sample coding, and card sorting procedures in more detail.

Code saturation.

Section 2 shows that there are 3,424SABD comments out of 253,555 comments appearing in thecurated set of GitHub Maven repositories. Since coding ofall 3,424 instances is impractical, we elect to code a sampleof SABD comments.To discover as a complete list of SABD locations, reasons,and purposes as possible, we strive for saturation . Similar toprior work [15], we set our saturation criterion to 50, i.e.,we continue to code randomly selected SABD commentsuntil no new codes have been discovered for 50 consecutivecomments. Finally, we reach saturation after coding 266SABD comments. To test the level of agreement of ourconstructed codes, we calculate the Kappa agreement ofour codes among the ﬁrst two authors, who independentlycoded the locations, reasons, and purposes of all 266 SABDcomments. Cohen’s Kappa for location codes is 0.91 or‘Almost perfect’ agreement [44], whereas Cohen’s Kappafor reason and purpose codes are 0.78 and 0.75, respectively,which indicate ‘Substantial’ agreement [44]. The somewhatlower agreement can be explained by the need for extrapo-lation when coding the reason and purpose of an instanceof SABD from its context and comment content.

2. https://doi.org/10.6084/m9.ﬁgshare.13018580

Sample coding.

To increase the generalizability of ourresults, after our codes achieve saturation, we coded addi-tional SABD comments to reach 500 samples. We dividedthe additional 234 samples into two sets. Then, the ﬁrstauthor independently coded the ﬁrst set, and the secondauthor independently coded the second set. In a series offollow-up meetings with the third author, each case wherethe coders disagree was discussed until a consensus wasachieved. When coding each SABD comment, the codersfocus on the location, key reason, and key purpose. Forexample, a SABD comment from the

Apache OODT projectis located in the plugin conﬁguration. The reason of thiscomment is labeled as ‘External library limitation’ and itspurpose is ‘Document for later ﬁx’.Since the open coding is an exploratory data analysistechnique, it may be prone to errors. To mitigate errors, wecode in two passes. First, we code based on the commentitself. After completing an initial round of coding, we per-form a second pass over all of SABD comments to correctmiscoded instances. In the second pass, we code basedon contextual information, such as the surrounding buildspeciﬁcation code, prior commit history, and other relevantdevelopment records.

Card sorting.

We apply open card sorting to construct ataxonomy of SABD reasons. Open card sorting helps us togenerate general types from our low-level codes. The opencard sorting includes two steps. First, the coded commentsare merged into cohesive groups that can be representedby a similar subcategory. Second, the related subcategoriesare merged to form categories that can be summarized by ashort title.

In this section, we present our results for RQ1, consisting ofSABD location and rationale (reason and purpose).

RQ1.1 - SABD Location

Observation 1:

Plugin conﬁguration and External dependenciesconﬁguration are the most frequently occurring locations in oursample.

We identiﬁed the nine location codes that emergedfrom our qualitative analysis. Table 2 provides an overviewof the categories and their deﬁnitions, frequencies, andlines of code (LOC) for SABD locations. From the table,we observe that

Plugin conﬁguration is the most frequentlyoccurring location for developers to leave SABD comments,with 47% of SABD comments appearing in that location. Thesecond most frequently occurring location is the

Externaldependencies conﬁguration location, accounting for 31% ofSABD comments appearing in that location. In addition,locations such as

Project metadata , Build organization , and

Software conﬁguration management are rarely associated withSABD, i.e., 1% for each category.The fourth column of Table 2 shows that our locationtendencies appear to follow with the volume of code ineach category. This result shows that, as one might expect,categories that require larger volumes of build conﬁgurationcode tend to be more prone to contain SABD.

TABLE 2: Deﬁnition and Frequency of SABD locations and lines of code (LOC) of build code

Category Deﬁnition Frequency LOC

Plugin conﬁguration Build code that speciﬁes which build plugin features should be in-cluded or excluded and how they should be conﬁgured for buildexecution, e.g., , . 238 (47%) 100,434(44%)External dependencies conﬁguration The system under construction or the plugins in use during the buildprocess may rely on external code in order to function correctly.The Maven build technology also includes tooling for specifying andresolving these dependencies through a central repository of artifacts.Maven users may declare their external dependencies in Maven speci-ﬁcations using tags, e.g., . 156 (31%) 99,940(43%)Build variables Maven users may declare their own build variables or override inher-ited variables from a parent POM, e.g., . 66 (13%) 8,099 (4%)Resource conﬁguration Build code that speciﬁes where resources are stored and what kinds ofresources are used during the build process, e.g., . 10 (2%) 1,272 (1%)Multi-directory conﬁguration Build code that avoids redundancies or duplicate conﬁgurationsthrough inheritance. Maven users may declare their project’s parentin Maven speciﬁcations using tags, e.g., . 10 (2%) 5,464 (2%)Repository conﬁguration The system which is needed to deploy artifacts from the organizationmay rely on remote repositories in order to populate the requireddependencies to a local repository, e.g., . 9 (2%) 4,372 (2%)Project metadata Build code that speciﬁes project descriptive information. In Maven,this information includes the version, artifact, and group identiﬁers ofthe project, e.g., , , . 6 (1%) 9,222 (4%)Build organization Build code that speciﬁes the build lifecycle and its outputs should beconﬁgured, e.g., . 3 (1%) 417 (0%)Software conﬁguration management Build code that speciﬁes a set of information for release build to checkout the tag that was created for this release. Maven users may declareit in Maven speciﬁcations using tags. e.g., . 2 (1%) 789 (0%) RQ1.2 - SABD Rationale

Reasons.

The top portion of Table 3 deﬁnes and quantiﬁesthe reasons that we observe for SABD comments in ourstudied sample.

Observation 2:

Limitation is the most common reason cate-gory for developers to leave SABD comments.

We identiﬁed 17subcategories that emerged from our classiﬁcation for SABDreasons. The 17 subcategories ﬁt into ten categories. Table 3shows deﬁnitions and frequencies for various SABD reasoncategories. As we can see from the table, 49% of SABDcomments are left due to the

Limitation reason.Upon closer inspection, external library limitation is themain limitation, accounting for 42% of the occurrences of theLimitation category. Indeed, it appears that working aroundthe constraints imposed by external libraries is a complexityof modern development from which build speciﬁcations arenot exempt. The following is an example of the

Limitation reason. The comment describes the limitation of the speciﬁcversion of the maven-war-plugin plugin. < !−− This is broken in maven−war−plugin 2.0, works in 2.0.1 −− >< warSourceExcludes > WEB−INF/no−lib/*.jar < /warSourceExcludes > The second most frequently occurring category isthe

Dependency reason (23%). In an example below,the org.apache.httpcomponents:httpclient depen-dency is required to implement the

OAuth2 testing. Thus,developers leave comments as a reminder for why thisdependency is needed. < !−− Required to implement OAuth2 testing −− >< dependency >< groupId > org.apache.httpcomponents < /groupId >< artifactId > httpclient < /artifactId >< scope > test < /scope >< /dependency >

3. https://tinyurl.com/y5jtkxkb4. https://tinyurl.com/y5jxjk45

Less frequently occurring patterns include the

Deploy-ment process reason (1%). Finally, only 5% of commentsare labeled as

No reason , which means that we could notdetermine the reasons from them.In the study of Mensah et al. [32], they identiﬁed thepossible causes of SATD introduction. The most prominentcauses are code smells (23%), complicated and complextasks (22%), inadequate code testing (21%), and unexpectedcode performance (17%). Comparatively, those causes ac-count for only 1% of the SABD. We suspect that this isbecause build code speciﬁes a set of rules to prepare andtransform the source code into deliverables. Unlike imper-ative or object-oriented systems, build speciﬁcations areprimarily implemented to inform an expert system so thatit may make efﬁcient and correct decisions. This change inparadigm is likely changing the characteristics of observableSABD.We describe the remaining reason categories in detailusing representative examples in our online appendix tohelp the reader understand this taxonomy. Purposes.

The bottom portion of Table 3 deﬁnes andquantiﬁes the purposes that we observe for SABD com-ments in our studied sample.

Observation 3:

Document for later ﬁx is the mostfrequently occurring purpose.

Our classiﬁcation revealedsix SABD purposes. Table 3 shows the results of ourpurpose classiﬁcation. We observe that 34% SABDcomments are left by developers with the

Documentfor later ﬁx purpose. The result indicates that SABDcomments are likely to be used as a short-termmemo for developers to recheck in the future. Forexample, since Maven resolves dependencies transitively,it is possible to include unwanted or problematic

5. https://doi.org/10.6084/m9.ﬁgshare.13147727

TABLE 3: Deﬁnition and Frequency of SABD reasons andpurposes. Ten categories merged from subcategories forSABD reasons are shown in bold.

Category Deﬁnition Frequency R e a s o n Limitation

Constraints imposed by the design orimplementation of third-party librariesor development tools.

245 (49%)

External library limitation 208 (42%)External tool limitation 22 (4%)Build tool limitation 15 (3%)

Dependency

When accessing unavailable artifacts orassets, such as missing or staledependencies.

117 (23%)

Missing dependency 56 (11%)Stale dependency 49 (10%)Problematic dependency 12 (2%)

Recursive call

Coherence issues, recursive calls toinvoke another build ﬁle.

33 (7%)Document

Inadequate project description issues,such as licensing and metadataspeciﬁcation.

23 (4%)

Specify metadata 12 (2%)Licensing 11 (2%)

Build break

Broken builds (i.e., failures that occurduring the build process) in build ﬁles.

22 (4%)Compiler setting

Conﬁguration issues during thecompilation process, such as compilerconﬁguration and symbol visibility.

18 (4%)

Compiler conﬁguration 16 (3%)Symbol visibility 2 (1%)

Deployment process

Processes which make the softwareartifacts ready for execution oravailable for use.

12 (2%)

Installation 6 (1%)Deployment 6 (1%)

Code smell

Violations of fundamentals of designprinciples, i.e., instances of poor codingpractice in build ﬁles.

Changes that need to be propagated tokeep software artifacts in sync duringupdates.

A label could not be assigned (due tolack of information).

25 (5%) P u r p o s e Document for later ﬁx Document an issue that should berevisited in the future. 172 (34%)Workaround Document constraints imposed bydesign or implementation choices andthe impact they have had on solutionstructure and/or content. 113 (23%)Warning for futuredevelopers Warn other developers to pay attentionto an aspect of the solution that maynot be clear from its structure orcontent. 111 (22%)Document suboptimalimplementation choice Explain why a problematic solution hasbeen adopted. 82 (16%)Placeholder for laterextension Document an extension point for laterenhancement(s). 13 (3%)Silence build warnings Defer or ignore warnings emitted byunderlying tools. 9 (2%) dependencies. For the software.amazon.awssdk:s3 dependency which includes the broken netty-nio-client:software.amazon.awssdk dependency, developers exclude this dependency topreserve a clean (i.e., non-broken) build status. Developersleft this comment as a note to revisit in the future. < dependency >< groupId > software.amazon.awssdk < /groupId >< artifactId > s3 < /artifactId >< version > $ { awsjavasdk.version } < /version >< exclusions >< exclusion >< !−− TODO remove exclusions after we ﬁx netty module −− >< artifactId > netty−nio−client < /artifactId >< groupId > software.amazon.awssdk < /groupId >< /exclusion >< /exclusions >< /dependency > Another commonly occurring purpose is the

Workaround purpose, accounting for 23% of our sample. In the examplebelow, the io.grpc:grpc-core dependency is partly un-usable. Developers comment out this dependency and leftthe comment to document this temporary ﬁx.

6. https://tinyurl.com/y4wg8n3z7. https://tinyurl.com/y43rxj9a < !−− FIXME(lesv) Temporary ﬁx due to Datastore having the wrong@@version@@ −− >< !−− < dependencyManagement >< dependencies >< dependency >< groupId > io.grpc < /groupId >< artifactId > grpc−core < /artifactId >< version > < /version >< /dependency >< /dependencies >< /dependencyManagement > −− >< !−− end of FIXME −− > On the other hand, only 3% and 2% of SABD commentsare identiﬁed for the

Placeholder for later extension purposeand the

Silence build warnings purpose.The survey of Maldonado et al. [26] showed that devel-opers most often use SATD to track future bugs and badimplementation areas. In our context of build systems, thehigh frequency of the

Document for later ﬁx purpose agreeswith their observations.We describe the remaining purpose categories in detailusing representative examples in our online appendix tohelp the reader understand this taxonomy. Observation 4:

Location categories share a strong relationshipwith reason categories.

Motivated by representative examples,we observe that SABD comments in similar locations canvary in terms of reasons and purposes. Thus, we conducta further study to investigate the relationships betweenlocations and reasons, and locations and purposes. We vi-sualize these relationships by using two parallel sets [19]in parallel categories diagrams. Parallel sets are variants ofparallel coordinates, in which the width of lines that connectsets corresponds to the frequency of their co-occurrence.Figure 2 shows relationships between locations and reasons,and locations and purposes.In Figure 2a, SABD comments in

Plugin conﬁguration most frequently occur (65.1%) because of the

Limitation reason in our sample. The example below shows a co-occurrence of

Plugin conﬁguration - Limitation . This SABDcomment is located in the tag and indicates thatthe current plugin suffers from an external library limitation,i.e., maven-shade-plugin . < plugin >< artifactId > maven−antrun−plugin < /artifactId >< executions >< !−− XXX: workaround for an issue with maven−shade−pluginThere appears to be some stale state from previous executions of theShade plugin, which manifest themselves as ‘‘We have a duplicate‘‘ warnings on build and also as some classes not being updatedon build. −− >< execution > ... < /plugin > SABD comments located in

External dependencies con-ﬁguration most frequently occur (58.3%) due to the

Dependency reason in our sample. The example be-low shows this relationship, where a comment lo-cated in the tag indicates that miss-ing the com.google.protobuf:protobuf-java depen-dency should be added to ﬁx a dependency issue.

8. https://doi.org/10.6084/m9.ﬁgshare.131477399. https://tinyurl.com/y6fuzkrk10. https://tinyurl.com/yxdopn3g

Plugin(cid:3)con(cid:192)guration

Loca(cid:87)ion

Build(cid:3)(cid:89)ariablesMulti-director(cid:92)(cid:3)con(cid:192)guration

Build(cid:3)organi(cid:93)ationSoft(cid:90)are(cid:3)con(cid:192)guration(cid:3)managementE(cid:91)ternal(cid:3)dependencies(cid:3)con(cid:192)gurationResource(cid:3)con(cid:192)gurationRepositor(cid:92)(cid:3)con(cid:192)gurationProject(cid:3)metadata Build(cid:3)break

Rea(cid:86)on

LimitationChange(cid:3)propagationCode(cid:3)smellCompiler(cid:3)settingDeplo(cid:92)ment(cid:3)processDocumentDependenc(cid:92)No(cid:3)reasonRecursi(cid:89)e(cid:3)call N (cid:82) . (cid:3) (cid:82) f(cid:3) c a (cid:86) e (cid:86) . % . % The Trial Version (a) Location-Reason

Plugin(cid:3)con(cid:192)guration

Loca(cid:87)ion

Build(cid:3)variablesMulti-director(cid:92)(cid:3)con(cid:192)guration

Build(cid:3)organi(cid:93)ationSoftware(cid:3)con(cid:192)guration(cid:3)managementE(cid:91)ternal(cid:3)dependencies(cid:3)con(cid:192)gurationResource(cid:3)con(cid:192)gurationRepositor(cid:92)(cid:3)con(cid:192)gurationProject(cid:3)metadata Document(cid:3)for(cid:3)later(cid:3)(cid:192)(cid:91)

P(cid:88)(cid:85)(cid:83)o(cid:86)e

Silence(cid:3)build(cid:3)warningsWorkaroundWarning(cid:3)for(cid:3)future(cid:3)developersDocument(cid:3)suboptimal(cid:3)implementation(cid:3)choicePlaceholder(cid:3)for(cid:3)later(cid:3)e(cid:91)tension N (cid:82) . (cid:3) (cid:82) (cid:73)(cid:3) c a (cid:86) e (cid:86) . % . % . % . % The Trial Version (b) Location-Purpose

Fig. 2: Parallel sets between locations and reasons, andlocations and purposes. For example,

Plugin conﬁguration most frequently occur because of the

Limitation reason. < dependencies >< !−− ﬁx protobuf dependency issue −− >< dependency >< groupId > com.google.protobuf < /groupId >< artifactId > protobuf−java < /artifactId >< version > < /version >< /dependency >< /dependencies > These two observations suggest that location categories tendto be more prone to SABD causes.Moreover, for the relationships between locations andpurposes that are shown in Figure 2b, SABD comments inthe

Plugin conﬁguration location are most often left withthe

Workaround purpose (30.7%). For instance, the SABDcomment below is provided with a workaround for Travisbuild. < plugins >< plugin >

11. https://tinyurl.com/y65fqwlv < groupId > org.apache.maven.plugins < /groupId >< artifactId > maven−sureﬁre−plugin < /artifactId >< version > < /version >< conﬁguration >< !−− Travis build workaround −− >< argLine > −Xms1024m −Xmx2048m < /argLine >< /conﬁguration >< /plugin >< /plugins > Figure 2b also shows that SABD comments in the

Ex-ternal dependencies conﬁguration location are most often leftwith the

Document for later ﬁx purpose (41.7%). In a casebelow, the comment located in the tag isused to document a pending maven-plugin 3.0 . De-spite these two observations, there is no clear relationshipbetween locations and purposes of SABD comments. Theresult suggests that the location of SABD does not affectthe purpose of developers leaving SABD comments or viceversa. < dependency >< groupId > org.jenkins−ci.main < /groupId >< artifactId > maven−plugin < /artifactId >< version > < /version >< exclusions >< exclusion > < !−− TODO pending maven−plugin 3.0 −− >< groupId > org.apache.ant < /groupId >< artifactId > ant < /artifactId >< /exclusion >< /exclusions >< /dependency > RQ1 Summary:

We identiﬁed nine SABD locations,ten reasons, and six purposes in Maven build sys-tems. Location categories tend to be more proneto SABD causes. In the build system maintenance,stakeholders involved in SABD management shouldbe aware of diverse SABD characteristics to assist inmaking effective management decisions.

OMMENT C LASSIFICATION (RQ2)

In Section 3, our results provide evidence that diverseSABD locations and rationale (i.e., reasons and purposes)exist in build ﬁles. To facilitate the replication of detectionapproaches, and promote the adoption and traceability ofSABD, an automated SABD classiﬁer should be beneﬁcial.Thus, we further study the feasibility of automaticallyclassifying SABD comments. To do so, we use the manuallycoded SABD comments from Section 3 as a dataset. With thisdataset, we train classiﬁers based on machine learning tech-niques and evaluate their performance. Below, we presentour approaches to automated classiﬁcation (Section 4.1) andmodel evaluation (Section 4.2), as well as the results to RQ2(Section 4.3).

Since in the SABD reasons and purposes, minority codesprovide the lesser with enough signal to reliably classify,while minority codes may possess more valuable knowl-edge. Thus, to reduce the bias introduced by oversampling,we rearrange the codes whose frequencies are less than 10%of the sampled comments with the aspects of SABD reasons

12. https://tinyurl.com/y49s8kg3 and purposes. For the reason category, we merge

Recur-sive call , Document , Build break , Compiler setting , Deploymentprocess , Code smell , Change propagation , and

No reason into

Other . For the purpose category, we merge

Placeholder forlater extension and

Silence build warnings into

Other . Text preprocessing.

An analysis of coded SABD com-ments revealed that bug report links usually appear in com-ments. Thus, for all SABD comments, we replace hyperlinkswith abstracturl by using the regular expression similarto the previous study [25]: https ?: \ / \ \ . ) ?[ −a−zA−Z0−9@:%. \ + ˜ { } \ . [ a−z ] { }\ b ([ − a−zA−Z0−9@:% \ +.˜ } Moreover, to reduce the impact of noisy text in com-ments, we remove special characters by using the regular ex-pression [ˆA-Za-z0-9]+ . Additionally, we apply Spacy to lemmatize words, which accounts for term conjugation.Although it is common practice, we opt to exclude stopwords removal, since stop words like ‘for’ and ‘until’ conveycritical semantics in the context of SABD comments [25]. Feature extraction.

We apply the N-gram Inverse Doc-ument Frequency (IDF) approach to extract features fromthe preprocessed text using the N-gram weighting schemetool [39] with its default setting. N-gram IDF [40] is a the-oretical extension of the IDF approach for handling wordsand phrases of any length. The approach generates a list ofall valid N-gram terms, and the strength of their associationwith the targeted classes excluding

Other . We remove anyterm that appears only once in each class. In total, 1,997and 1,120 N-gram terms are retrieved for SABD reasons andpurposes, respectively.

Classiﬁer preparation.

Previous studies [24, 25, 45] re-ported that classiﬁers trained by combining N-gram IDFand auto-sklearn machine learning tend to outperform clas-siﬁers that are trained with single word features. Headingtheir advice, we train our classiﬁers using auto-sklearn [10],which automatically determines effective machine learningpipelines for classiﬁcation. Auto-sklearn searches a con-ﬁguration space of 15 classiﬁcation algorithms, 14 featurepreprocesses, and four data preprocesses for optimal hyper-parameter settings. We conﬁgure the approach to optimizefor the weighted F1-score, with a budget of one hour foreach round, and a memory capacity of 32 GB.

To evaluate our classiﬁer, we use common performancemeasures. The precision is the fraction of SABD commentsthat are correctly classiﬁed. The recall is the fraction of trulySABD comments that are classiﬁed as such. The F1-score isthe harmonic mean of precision and recall.To investigate the impact that the choice of classiﬁcationtechnique has, we apply Naive Bayes (NB), Support VectorMachine (SVM), and k-Nearest Neighbors (kNN) classiﬁca-tion techniques. These classiﬁers have been broadly adoptedin prior studies [16, 46]. Similar to prior work [25], we applyTF-IDF [38] to extract the features for our baseline classiﬁers.

13. https://spacy.io/

TABLE 4: Performance of classiﬁers for SABD reason

Category auto-sklearn NB SVM kNNPrecision

Limitation

Recall

Limitation 0.73 0.62

F1-score

Limitation 0.73 0.65

TABLE 5: Performance of classiﬁers for SABD purpose

Category auto-sklearn NB SVM kNNPrecision

Document for later ﬁx

Recall

Document for later ﬁx 0.82 0.37

Workaround

Avg.

F1-score

Document for later ﬁx

Avg.

Observation 5:

The auto-sklearn classiﬁer tends to outperformthe baseline classiﬁers for both reasons and purposes.

Table 4shows the classiﬁer performance with respect to the reasonsfor SABD. The table shows that the average precision is0.68, which is greater than the precision of the NB andkNN classiﬁers (0.57 and 0.62, respectively). The averagerecall and F1-score of the auto-sklearn are also greater thanbaseline classiﬁers by at least nine and seven percentagepoints, respectively. Upon closer inspection, we ﬁnd thatthe classiﬁcation of

Limitation achieves the best performancewhen compared with the other two reason categories. Forinstance, the precision, recall, and F1-score for

Limitation are0.75, 0.73, and 0.73, respectively, which are greater than thenext best performance,

Dependency , by a margin of at leastseven percentage points.Table 5 shows the classiﬁer performance with respect tothe purposes of SABD. As we can see from Table 5, theauto-sklearn classiﬁer outperforms the baseline classiﬁers.The average recall and F1-score are greater than the otherbaseline classiﬁers by at least eight points. Closer inspec-tion of the purpose categories reveals that classifying the

Workaround purpose has the best performance, with theF1-score reaching 0.89. Such high performance is possibleas there usually exist keywords that explicitly map thiscategory, e.g., ‘workaround’ and ‘temporary’. Moreover, forthe other purpose categories, we ﬁnd that the performanceis still promising, e.g., achieving F1-scores of 0.77 and 0.70for

Document for later ﬁx and

Waiting for future developers purposes. On the other hand, we observe that SVM out-performs auto-sklearn, especially in terms of precision. Forexample, SVM reaches a mean/median precision of 0.82 for

TABLE 6: Frequently occurring n-gram features in eachcategory

Category N-gram features Frequency R e a s o n Limitation be break 18available 10break in 10link 9jdk 9

Dependency java 9 7require by 6to implement 6framework 5be need 4 P u r p o s e Documentfor later ﬁx ﬁx this 7late 6the project 6todo ﬁx this 6pron ﬁx 6

Documentsuboptimalimplementationchoice ofﬂine 6todo why 6script 4do pron 4copy 3

Workaround workaround for 54workaround for abstracturl 19workaround to 14a workaround 9java 9 7

Warning forfuture developers break in 10war 6ﬁx a 5in osgi 4mvn 4 the SABD purpose, while the auto-sklearn classiﬁer achievesa mean/median precision of 0.79.In Table 6, we list the most frequently occurring N-gram features. For example, ‘workaround for’ appeared 54times for SABD comments with the

Workaround purpose,and ‘be break’ appeared 18 times for SABD comments about

Limitation reason.

RQ2 Summary:

The auto-sklearn classiﬁer tends tooutperform the baselines for SABD reasons (0.68precision, 0.67 recall, 0.67 F1-score) and purposes(0.79 precision, 0.75 recall, 0.75 F1-score).

EMOVAL (RQ3)

In this section, we investigate the willingness of developersto remove the ‘ready-to-be-addressed’ SABD that containsresolved bug reports similar to the previous study [24]. Todo so, we mine for links in the comments of the manuallycoded data from Section 3. We then systematically assesswhether the SABD is ready to be addressed. This concept isknown as ‘on-hold’ SATD, which is a condition to indicatethat a developer is waiting for a certain event to occurelsewhere (e.g., an update to the behavior of a third-partylibrary or tool), according to the study of Maipradit etal. [25]. Below, we ﬁrst describe our studies of the incidencesof ‘ready-to-be-addressed’ SABD (Section 5.1), and our pro-posed clean-up pull requests and tracking issue reports(Section 5.2). TABLE 7: Frequency of link target types in 91 SABD com-ments

Category Frequency bug report 87 (84%)tutorial or article 6 (6%)404 4 (4%)Stack Overﬂow 2 (2%)pull request 1 (1%)software homepage 1 (1%)forum thread 1 (1%)blog post 1 (1%) sum

103 (100%)

Identify ready-to-be-addressed SABD.

We systematicallyidentify ‘ready-to-be-addressed’ SABD using the followinglist of conditions:Step 1. Extract hyperlinks or issue IDs from the com-ments. Using regular expressions, we extract 103 links from91 SABD comments. Then we manually code them basedon the link target coding guide of Hata et al. [14]. Table 7shows the link target distribution. We observe that bug report is the most frequently occurring (84%) link target in SABDcomments.Step 2. Check link target in SABD comments. We checkif the link target is an bug report with the status of ‘resolved’,‘closed’, ‘veriﬁed’, or ‘completed’, and resolution type is setto ‘ﬁxed’ similar to the previous study [24]. Furthermore,to facilitate the creation of our pull requests and issuereports, we exclude four candidates where: (I) the repositoryreferenced in the SABD comment has been archived; (II)the repository referenced by the issue report in the SABDcomment has been archived; (III) the repository referencedin the SABD comment is a mirror repository; (IV) the issuereport link in the SABD comment is a ‘cross-reference’ (e.g.,the issue report is referenced to aid in documenting therationale behind an implementation choice).

Observation 6:

Of the 91 SABD comments that containhyperlinks, 16 contain ‘ready-to-be-addressed’ SABD. Amongthe 16, Plugin conﬁguration is the most frequently occurringlocation, Limitation is the most frequently occurring reason, andWorkaround is the most frequently occurring purpose, i.e., 13,9, and 12 cases, respectively.

Table 8 shows that we initiallyidentiﬁed 27 ‘ready-to-be-addressed’ SABD in our dataset.However, we observed that 10 of the 27 SABD had alreadybeen removed by developers. Five SABD were removedbecause the entire ﬁle was deleted. The other ﬁve SABDwere addressed directly by developers. Additionally, oneSABD had been submitted as an issue report by a developer,but it has not been closed.

To evaluate the importance of ‘ready-to-be-addressed’SABD, we created issue reports and pull requests to thestudied projects. When preparing issue reports, we also pro-vide possible solutions for developers to deal with ‘ready-to-be-addressed’ SABD. Examples of issue reports and pullrequests are shown in Figures 3a and 3b.

Observation 7:

For ‘ready-to-be-addressed’ SABD, removalrates have been reached 43% and 56% in pull requests and issue (a) Example of an issue report(b) Example of a pull request

Fig. 3: Examples of created issue report and pull requestTABLE 8: Distribution of ‘ready-to-be-addressed’ SABD

Status Frequency

Existing 16 (59%)File deleted 5 (19%)Fixed by developers 5 (19%)Developers try to ﬁx 1 (3%) reports, respectively.

In total, we prepared seven issue reportsfor nine instances of SABD, since three SABD belong tothe same repository. In addition, we prepared seven pullrequests for the other seven instances of SABD comments.We found developers actively resolve these pull requestsand issue reports within a 20 days time frame.Three of the four responded pull requests have beenaccepted and merged into the master branch. For instance,one developer responded: “I merged it, and found and ﬁxedtwo others of the same type, which would have remained if youhad not brought it to our attention. Thanks!” . Only one pullrequest was rejected because the developer had to considerthe plugin version dependency, i.e., “Thanks for the reminder!Upgrading the plugin in my TODO list for the summer, so I’lllook into it shortly. The plugin version must be updated beforeremoving the conﬁg.” .For the prepared issue reports, ﬁve ‘ready-to-be-addressed’ SABD were resolved. For instance, one devel-oper left the appreciation: “Thanks for making us aware of thisfact.” . On the other hand, two issue reports were rejected:one case is where the developer did not agree that the issuewas an instance of technical debt, and another case is where,due to the system supporting multiple versions, the SABDcould not be removed.

RQ3 Summary:

We identiﬁed 16 instances of SABDthat are ‘ready-to-be-addressed’. Through our ex-periment, we propose pull requests for seven cases,three of which were merged. Moreover, we produceissue reports for nine cases, ﬁve of which wereresolved within 20 days. These responses suggestthat developers are receptive and reactive to SABDin build systems.

ECOMMENDATIONS

Based on our ﬁndings, we make the following recommen-dations for practitioners, researchers, and tool builders.First, we recommend that practitioners: • Track SABD by using issue trackers, as we ﬁnd thatdevelopers have tried to add issue report hyperlinksor issue IDs in tandem with comments. Using onlycomments to track or manage SABD is still problematic.Indeed, only 91 out of the 500 SABD comments containhyperlinks. Explicitly referencing related content willimprove traceability. • Check SABD containing resolved bug reports, as we identi-ﬁed 16 instances of SABD comments that are ready-to-be-addressed from RQ3. These stale SABD commentscould create confusion for anyone inspecting the code.Even more insights for practitioners would be discov-ered by the following future research: • Further studies of workarounds for SABD.

During thecoding process, we observe that one SABD workaroundcan be used across several repositories. This suggeststhat the retrieval and curation of workarounds mayhave broad implications beyond the scope of a speciﬁcproject, and are thus important and of value for practi-tioners. • Establishing an understanding of SABD in other buildsystems.

As seen in Tables 2 and 3, we identiﬁed ninelocations, ten reasons, and six purposes, which couldimprove the overall understanding of SABD in buildsystems. Applying the coding guides from RQ1 to otherbuild systems (e.g., make , CMake, or Ant) could help toestablish a broader theory of SABD in build systems ingeneral. • Improving the classiﬁcation of SABD in build systems.

InRQ2, we propose automatic classiﬁers to identify SABDcharacteristics. Tables 4 and 5 show that our classiﬁersare promising. The result demonstrates the feasibility ofautomatic classiﬁers and serves as a key step for devel-oping a SABD classiﬁcation system. However, it still hasroom for performance improvement. We suggest that infuture research, researchers evaluate other approachesto improve the SABD classiﬁers.The following directions for future work may yield valuefor tool builders: • Tool support for managing SABD in build systems.

Al-though we recommend that practitioners use issuereport hyperlinks or IDs to track SABD, it could bepractically useful to have tools or systems to helppractitioners manage SABD traceability automatically.A SABD management tool could make the developersaware of the debt being incurred and would makeit easy to continually avoid the debt as part of theirnormal workﬂow. A possible mock-up is presented byMaipradit et al. [24, Fig. 7]. • Focusing on top SABD locations and reasons would pro-vide the most beneﬁt to developers . In RQ1, we providethe most frequently occurring locations and reasonsfor SABD in the build systems. We suggest that toolbuilders make an extra effort on these top locations andreasons. • Tool support for recommending solutions to SABD in buildsystems.

During the creation of pull requests and issuereports, we observe that the possible solutions thatwe provided for developers to mitigate ‘ready-to-be-addressed’ SABD are similar and straightforward (e.g.,remove the extraneous comment or code). This obser-vation suggests that an automated tool for addressingSABD could be useful. This will not just help develop-ers to manage such SABD, but also will improve thequality of the ﬁnal product.

ELATED W ORK

In this section, we position our work with respect to theliterature on build systems and technical debt related to thisstudy.

Build system maintenance is a hidden cost, which takesa considerable amount of development effort. Kumfert etal. [20] argued that the need to keep the build systemsynchronized with the source code generates an implicitoverhead on the development process, and in their survey,developers claimed that they spend up to 35.71% of theirtime on build system maintenance. McIntosh et al. [30]analyzed ten large, long-live projects by mining the versionhistories, and their study showed that build system mainte-nance is 27% overhead on source code development. Adamset al. [1] studied the evolution of the Linux KBUILD ﬁles andhow these ﬁles co-evolve with the source code. McIntosh etal. [29] made similar observations in Java build systems.Build breakage and how to repair it have been widelystudied. Kerzazi et al. [18] interviewed 28 software engi-neers to study why build breakages are introduced in anindustrial setting. Rausch et al. [36] performed an analysisof build failures, which studies the variety and frequency ofbuild breakage in the CI environments of 14 open sourceJava projects. Islam and Zibran [17] studied the factorsthat may impact the build outcome, observing that thenumber of changed lines of code, ﬁles, and built commits intasks are most signiﬁcantly associated with build outcomes.Zolfagharinia et al. [47] studied the impact of operationsystem and runtime environment on build breakage in theCI environment of the Comprehensive Perl Archive Net-work (CPAN) ecosystem, suggesting interpretation of buildresults is not straightforward.In addition, researchers have proposed automated ap-proaches to repair build breakages. For example, Macho etal. [23] proposed BUILDMEDIC, an approach to automat-ically repair Maven builds that break due to dependency-related issues. Hassan and Wang [13] proposed HireBuild,an approach to automatically repair build scripts with ﬁxinghistories. Hassan [12] also outlined promising preliminarywork towards automatic build repair in CI environment thatinvolves both source code and build script.There have also been other predictive approaches pro-posed to promote awareness and simplify interactions withbuild systems. Tufano et al. [42] envisioned a predictivemodel that would preemptively alert developers about theextent to which their software changes may impact future building activities. Hassan and Zhang [11] deﬁned a modelfor predicting the certiﬁcation results of a software build.Bisong et al. [5] proposed and analyzed models that canpredict the build time of a job. Cao et. al [6] proposedBuildM´et´eo—a tool to forecast the duration of incrementalbuild jobs by analyzing a timing-annotated Build Depen-dency Graph (BDG).Although plenty of studies widely investigate the impor-tance of build system maintenance and propose approachesto relieve the build issue, there is no study that focuseson SATD within the scope of build system maintenance.However, build systems often suffer from massive main-tenance activities during the development process, and thepart of these activities is produced by SATD, since SATDchanges are more difﬁcult to perform and SATD inevitablygenerate long-term maintenance problems from a short-term hack. Thus, in this study, we ﬁrst characterize andmitigate SABD in the Maven build system and explore thefeasibility of training automatic classiﬁers to identify SABDcharacteristics.

Due to the importance of technical debt to the software de-velopment process and quality, there have been surveys andmapping studies about technical debt. Sierra et al. [41] sur-veyed research work on SATD, analyzing the characteristicsof current approaches and techniques for SATD detection,comprehension, and repayment. Li et al. [21] performed amapping study on technical debt and technical manage-ment. Vassallo et al. [43] showed that 88% of participantsmentioned documenting their suboptimal implementationchoices in the code that they produced.Prior studies widely analyzed the factors or activitiesthat affect technical debt. Besker et al. [3] observed that sixorganizational factors (experience of developers, softwareknowledge of startup founders, employee growth, uncer-tainty, lack of development process, and the autonomy ofdevelopers regarding TD decisions) were associated withthe beneﬁts and challenges of the intentional accumulationof technical debt in software. Besker et al. [4] also investi-gated activities on which wasted time is spent and whetherdifferent TD types impact the wasted time in different ways.The detection of technical debt is also widely studied.Liu et al. [22] proposed the SATD detector to automaticallydetect SATD comments and highlight, list, and managedetected comments in an Integrated Development Envi-ronment (IDE). Farias et al. [9] carry out three empiricalstudies to curate the knowledge embedded in the SATDidentiﬁcation vocabulary, which can be used to automat-ically identify and classify TD items through code com-ment analysis. Yan et al. [46] also proposed an automatedchange-level TD determination model that can identify TD-introducing changes. Wattanakriengkrai et al. [45] combineN-gram IDF and auto-sklearn machine learning approachesto train classiﬁers to identify requirement and design debt.Maldonado et al. [27] used NLP maximum entropy classi-ﬁers [28] to automatically identify design and requirementSATD in source code comments. Moreover, Ren et al. [37]used Convolution Neural Network-based approaches withbaseline text-mining approaches [16] to identify SATD in a cross-project prediction setting. Maipradit et al. [24, 25]identiﬁed “On-Hold” SATD for automated management.Inspired by these past studies of SATD, in this paper,we conduct the ﬁrst study on self-admitted technical debtin build systems. Similar to prior work, we ﬁrst set out tocharacterize SABD in build systems in terms of locations,reasons, and purposes. We provide three coding guides forSABD in build systems, and automated SABD classiﬁersare provided in Section 4. Furthermore, in this work, weinvestigate the willingness of developers to remove the‘ready-to-be-addressed’ SABD that refers to resolved issuereports. HREATS T O V ALIDITY

Below, we discuss the threats to the validity of our study:

Construct validity.

We use comment patterns to identifySABD comments in build ﬁles. Since SABD comment pat-terns are not enforced, we will miss SABD comments thatdo not conform to these comment patterns. To mitigate thisrisk, we expand upon a popular comment patterns list [35]with features recommended by Huang et al. [16].

Internal validity.

We rely on manually coded data,which may be miscoded due to the subjective nature ofunderstanding the coding schema. To mitigate this threat,we apply three best practices for open coding: 1) we con-duct four rounds of independent coding and calculate theCohen’s Kappa to ensure that our agreements at least are‘Substantial’; 2) we pursue saturation with concrete criteria,i.e., 50 consecutively coded comments for which no newcategories were discovered; 3) we perform two passes thatrevisited miscoded SABD comments based on additionalcontextual information.

External validity.

We only conduct an empirical study of300 Maven projects. As such, our results may not generalizeto all Maven projects or other build technologies. On theother hand, our sample of projects is diverse, includingprojects of varying size and domain. Nonetheless, replica-tion studies may help to improve the strength of generaliza-tions that can be drawn.

ONCLUSIONS

Addressing self-admitted technical debt (SATD) is an im-portant step in the development process. Recently, manystudies have focused on SATD in source code, but littleis known about SATD in the build systems. Thus, in thispaper, we characterize and propose mitigation strategies forthe coined term

Self-Admitted Build Debt (SABD). To do that,we (i) manually classiﬁed 500 SABD comments accordingto their locations, reasons, and purposes; and (ii) trainedSABD classiﬁers using the coded SABD comments; and (iii)investigated the willingness of developers to remove the‘ready-to-be-addressed’ SABD that references resolved bugreports.We observe that (i) SABD comments in Maven buildsystems most often occur in the plugin conﬁguration loca-tion; the most frequently occurring reasons behind SABDis to document limitations in tools and libraries, as wellas issues to be ﬁxed later; and (ii) our auto-sklearn clas-siﬁer achieves better performance than baseline classiﬁers, achieving an F1-score of 0.67–0.75; and (iii) the removal ratesof ‘ready-to-be-addressed’ SABD in pull requests and issuereports reached 43% and 56%, respectively. We foresee manypromising avenues for future work, such as improvementsto the classiﬁers, expanding our coded corpus of SABDcomments to other build systems, and automatic approachesto address SABD in build systems. A CKNOWLEDGEMENT

We would like to thank Rungroj Maipradit for provid-ing technical assistance in training auto-sklearn classi-ﬁer. This work has been supported by JSPS KAKENHIGrant Numbers JP18KT0013, JP18H04094, JP20K19774, andJP20H05706. R EFERENCES [1] B. Adams, K. D. Schutter, H. Tromp, and W. Meuter,“The evolution of the linux build system,”

ElectronicCommunication of the European Association of SoftwareScience and Technology (ECEASST) , 2007.[2] N. S. R. Alves, L. F. Ribeiro, V. Caires, T. S. Mendes,and R. O. Sp´ınola, “Towards an ontology of termson technical debt,” in

Proceedings of the InternationalWorkshop on Managing Technical Debt (MTD) , 2014.[3] T. Besker, A. Martini, R. Edirisooriya Lokuge, K. Blin-coe, and J. Bosch, “Embracing technical debt, froma startup company perspective,” in

Proceedings of theInternational Conference on Software Maintenance and Evo-lution (ICSME) , 2018.[4] T. Besker, A. Martini, and J. Bosch, “Software developerproductivity loss due to technical debt—a replicationand extension study examining developers’ develop-ment work,”

Journal of Systems and Software (JSS) , 2019.[5] E. Bisong, E. Tran, and O. Baysal, “Built to last or builttoo fast? evaluating prediction models for build times,”in

Proceedings of the International Conference on MiningSoftware Repositories (MSR) , 2017.[6] Q. Cao, R. Wen, and S. McIntosh, “Forecasting theduration of incremental build jobs,” in

Proceedings ofthe International Conference on Software Maintenance andEvolution (ICSME) , 2017.[7] K. Charmaz,

Constructing Grounded Theory . SAGE,2014.[8] W. Cunningham, “The wycash portfolio managementsystem,”

SIGPLAN OOPS Messenger , 1992.[9] M. A. de Freitas Farias, M. G. de Mendonc¸a Neto,M. Kalinowski, and R. O. Sp´ınola, “Identifying self-admitted technical debt through code comment anal-ysis with a contextualized vocabulary,”

Information andSoftware Technology (IST) , 2020.[10] M. Feurer, A. Klein, K. Eggensperger, J. T. Springen-berg, M. Blum, and F. Hutter, “Efﬁcient and robustautomated machine learning,” in

Proceedings of the In-ternational Conference on Neural Information ProcessingSystems (NIPS) , 2015.[11] A. E. Hassan and K. Zhang, “Using decision trees topredict the certiﬁcation result of a build,” in

Proceed-ings of the International Conference on Automated SoftwareEngineering (ASE) , 2006. [12] F. Hassan, “Tackling build failures in continuous inte-gration,” in Proceedings of the International Conference onAutomated Software Engineering (ASE) , 2019.[13] F. Hassan and X. Wang, “Hirebuild: An automaticapproach to history-driven repair of build scripts,” in

Proceedings of the International Conference on SoftwareEngineering (ICSE) , 2018.[14] H. Hata, C. Treude, R. G. Kula, and T. Ishio, “9.6 millionlinks in source code comments: Purpose, evolution, anddecay,” in

Proceedings of the International Conference onSoftware Engineering (ICSE) , 2019.[15] T. Hirao, S. McIntosh, A. Ihara, and K. Matsumoto,“The review linkage graph for code review analytics: Arecovery approach and empirical study,” in

Proceedingsof the European Conference on Foundations of SoftwareEngineering (ESEC/FSE) , 2019.[16] Q. Huang, E. Shihab, X. Xia, D. Lo, and S. Li, “Iden-tifying self-admitted technical debt in open sourceprojects using text mining,”

Empirical Software Engineer-ing (EMSE) , 2017.[17] M. R. Islam and M. F. Zibran, “Insights into continuousintegration build failures,” in

Proceedings of the Interna-tional Conference on Mining Software Repositories (MSR) ,2017.[18] N. Kerzazi, F. Khomh, and B. Adams, “Why do auto-mated builds break? an empirical study,” in

Proceedingsof the International Conference on Software Maintenanceand Evolution ((ICSME)) , 2014.[19] R. Kosara, F. Bendix, and H. Hauser, “Parallel sets:Interactive exploration and visual analysis of categor-ical data,”

Transactions on Visualization and ComputerGraphics (TVCG) , 2006.[20] G. Kumfert and T. Epperly, “Software in the doe: Thehidden overhead of ”the build”,” Lawrence LivermoreNational Lab., CA (US), Tech. Rep., 2002.[21] Z. Li, P. Avgeriou, and P. Liang, “A systematic mappingstudy on technical debt and its management,”

Journalof Systems and Software (JSS) , 2015.[22] Z. Liu, Q. Huang, X. Xia, E. Shihab, D. Lo, and S. Li,“Satd detector: A text-mining-based self-admitted tech-nical debt detection tool,” in

Proceedings of the Inter-national Conference on Software Engineering: CompanionProceeedings (ICSE-Companion) , 2018.[23] C. Macho, S. McIntosh, and M. Pinzger, “Automat-ically repairing dependency-related build breakage,”in

Proceedings of the International Conference on SoftwareAnalysis, Evolution and Reengineering (SANER) , 2018.[24] R. Maipradit, B. Lin, C. Nagy, G. Bavota, M. Lanza,H. Hata, and K. Matsumoto, “Automated identiﬁcationof on-hold self-admitted technical debt,” in

Proceedingsof the International Working Conference on Source CodeAnalysis and Manipulation (SCAM) , 2020.[25] R. Maipradit, C. Treude, H. Hata, and K. Matsumoto,“Wait for it: identifying “on-hold” self-admitted techni-cal debt,”

Empirical Software Engineering (EMSE) , 2020.[26] E. D. S. Maldonado, R. Abdalkareem, E. Shihab, andA. Serebrenik, “An empirical study on the removalof self-admitted technical debt,” in

Proceedings of theInternational Conference on Software Maintenance and Evo-lution (ICSME) , 2017.[27] E. D. S. Maldonado, E. Shihab, and N. Tsantalis, “Using natural language processing to automatically detectself-admitted technical debt,”

Transactions on SoftwareEngineering (TSE) , 2017.[28] C. Manning and D. Klein, “Optimization, maxent mod-els, and conditional estimation without magic,” in

Pro-ceedings of the Conference of the North American Chapterof the Association for Computational Linguistics on HumanLanguage Technology: Tutorials(NAACL-Tutorials) , 2003.[29] S. Mcintosh, B. Adams, and A. E. Hassan, “The evolu-tion of java build systems,”

Empirical Software Engineer-ing (EMSE) , 2012.[30] S. McIntosh, B. Adams, T. H. Nguyen, Y. Kamei, andA. E. Hassan, “An empirical study of build mainte-nance effort,” in

Proceedings of the International Confer-ence on Software Engineering (ICSE) , 2011.[31] S. McIntosh, M. Nagappan, B. Adams, A. Mockus,and A. E. Hassan, “A large-scale empirical study ofthe relationship between build technology and buildmaintenance,”

Empirical Software Engineering (EMSE) ,2015.[32] S. Mensah, J. Keung, J. Svajlenko, K. E. Bennin, andQ. Mi, “On the value of a prioritization scheme for re-solving self-admitted technical debt,”

Journal of Systemsand Software (JSS) , 2018.[33] J. D. Morgenthaler, M. Gridnev, R. Sauciuc, andS. Bhansali, “Searching for build debt: Experiencesmanaging technical debt at google,” in

Proceedings ofthe International Workshop on Managing Technical Debt(MTD) , 2012.[34] P. Morville and L. Rosenfeld,

Information architecturefor the World Wide Web: Designing large-scale web sites .O’Reilly Media, 2006.[35] A. Potdar and E. Shihab, “An exploratory study on self-admitted technical debt,” in

Proceedings of the Interna-tional Conference on Software Maintenance and Evolution(ICSME) , 2014.[36] T. Rausch, W. Hummer, P. Leitner, and S. Schulte, “Anempirical analysis of build failures in the continuousintegration workﬂows of java-based open-source soft-ware,” in

Proceedings of the International Conference onMining Software Repositories (MSR) , 2017.[37] X. Ren, Z. Xing, X. Xia, D. Lo, X. Wang, andJ. Grundy, “Neural network-based detection of self-admitted technical debt: from performance to explain-ability,”

Transactions on Software Engineering and Method-ology (TOSEM) , 2019.[38] G. Salton and C. Buckley, “Term-weighting approachesin automatic text retrieval,”

Information Processing andManagement (IP&M) , 1988.[39] M. Shirakawa. N-gram weighting scheme,. [Online].Available: https://github.com/iwnsew/ngweight[40] M. Shirakawa, T. Hara, and S. Nishio, “Idf for wordn-grams,”

Transactions on Information Systems (TOIS) ,2017.[41] G. Sierra, E. Shihab, and Y. Kamei, “A survey of self-admitted technical debt,”

Journal of Systems and Software(JSS) , 2019.[42] M. Tufano, H. Sajnani, and K. Herzig, “Towards pre-dicting the impact of software changes on buildingactivities,” in

Proceedings of the International Conferenceon Software Engineering: New Ideas and Emerging Results (ICSE-NIER) , 2019.[43] C. Vassallo, F. Zampetti, D. Romano, M. Beller,A. Panichella, M. Di Penta, and A. Zaidman, “Continu-ous delivery practices in a large ﬁnancial organization,”in Proceedings of the International Conference on SoftwareMaintenance and Evolution (ICSME) , 2016.[44] A. Viera and J. Garrett, “Understanding interobserveragreement: The kappa statistic,”

Family medicine , 2005.[45] S. Wattanakriengkrai, R. Maipradit, H. Hata,M. Choetkiertikul, T. Sunetnanta, and K. Matsumoto,“Identifying design and requirement self-admittedtechnical debt using n-gram idf,” in

Proceedings ofInternational Workshop on Empirical Software Engineeringin Practice (IWESEP) , 2018.[46] M. Yan, X. Xia, E. Shihab, D. Lo, J. Yin, and X. Yang,“Automating change-level self-admitted technical debtdetermination,”

Transactions on Software Engineering(TSE) , 2019.[47] M. Zolfagharinia, B. Adams, and Y.-G. Gu´eh´eneuc,“Do not trust build results at face value: An empiricalstudy of 30 million cpan builds,” in

Proceedings of theInternational Conference on Mining Software Repositories(MSR) , 2017.

Tao Xiao is a Master’s student at the Depart-ment of Information Science, Nara Institute ofScience and Technology, Japan. He receivedhis BSc degree in Software Engineering fromChiang Mai University, Thailand, in 2020. Hismain research interests are empirical softwareengineering, mining software repositories, natu-ral language processing.

Dong Wang is currently working toward the Doc-tor degree in Nara Institute of Science and Tech-nology in Japan. His research interests includecode review and mining software repositories.

Shane McIntosh is an Associate Professor atthe University of Waterloo. Previously, he was anAssistant Professor at McGill University, wherehe held the Canada Research Chair in Soft-ware Release Engineering. He received hisPh.D. from Queen’s University, for which he wasawarded the Governor General’s Academic GoldMedal. In his research, Shane uses empiricalmethods to study software build systems, re-lease engineering, and software quality: http://shanemcintosh.org/.

Hideaki Hata is an Assistant Professor at theNara Institute of Science and Technology. Hisresearch interests include software ecosystems,human capital in software engineering, and soft-ware economics. He received a Ph.D. in in-formation science from Osaka University. Moreabout Hideaki and his work is available online athttps://hideakihata.github.io/.

Raula Gaikovina Kula is an Assistant Profes-sor at Nara Institute of Science and Technol-ogy. He received the Ph.D degree from NaraInstitute of Science and Technology in 2013.His interests include Software Libraries, Soft-ware Ecosystems, Code Reviews and MiningSoftware Repositories.

Takashi Ishio received the Ph.D. degree in infor-mation science and technology from Osaka Uni-versity in 2006. He was a JSPS Research Fellowfrom 2006–2007. He was an Assistant Profes-sor at Osaka University from 2007–2017. He isnow an Associate Professor of Nara Institute ofScience and Technology. His research interestsinclude program analysis, program comprehen-sion, and software reuse. He is a member of theIEEE, ACM, IPSJ and JSSST.