Better Technical Debt Detection via SURVEYing
BBetter Technical Debt Detection via “SURVEY”ing
Fahmid M. Fahid
Computer ScienceNorth Carolina State UniversityRaleigh, [email protected]
Zhe Yu
Computer ScienceNorth Carolina State UniversityRaleigh, [email protected]
Tim Menzies
Computer ScienceNorth Carolina State UniversityRaleigh, [email protected]
Abstract —Software analytics can be improved by surveying;i.e. rechecking and (possibly) revising the labels offered byprior analysis. Surveying is a time-consuming task and effectivesurveyors must carefully manage their time. Specifically, theymust balance the cost of further surveying against the additional benefits of that extra effort.This paper proposes SURVEY0, an incremental Logistic Re-gression estimation method that implements cost/benefit analysis.Some classifier is used to rank the as-yet-unvisited examplesaccording to how interesting they might be. Humans then reviewthe most interesting examples, after which their feedback is usedto update an estimator for estimating how many examples areremaining.This paper evaluates SURVEY0 in the context of self-admittedtechnical debt. As software project mature, they can accumulate“technical debt” i.e. developer decisions which are sub-optimaland decrease the overall quality of the code. Such decisions areoften commented on by programmers in the code; i.e. it is self-admitted technical debt (SATD). Recent results show that textclassifiers can automatically detect such debt. We find that wecan significantly outperform prior results by SURVEYing thedata. Specifically, for ten open-source JAVA projects, we can find83% of the technical debt via SURVEY0 using just 16% of thecomments (and if higher levels of recall are required, SURVEY0can adjust towards that with some additional effort).
Index Terms —Technical debt, software analytics
I. I
NTRODUCTION
This paper is about cost-effective analytics using surveying ;i.e. rechecking and (possibly) revising the labels found by prioranalysis. We demonstrate the value of surveying by showingthat, it can lead to better predictors for technical debt thanexisting state-of-the-art methods [1].Studying technical debt is important since it can signif-icantly damage project maintainability [2], [3], [4]. Whendevelopers cut corners and make haste to rush out code, thenthat code often contains technical debt ; i.e. decisions thatmust be repaid, later on, with further work. Technical debtis like dirt in the gears of software production. As technicaldebt accumulates, development becomes harder and slower.Technical debt can affect many aspects of a system including evolvability (how fast we can add new functionality) and maintainability (how easily developers can handle new orunseen bugs in code).Surveying is important for automated software engineeringsince many automated software analytics methods assume thatthey are learning from correctly labelled examples. However,before an automated method uses labels from old data, it is prudent to revisit and recheck the labels generated by prioranalysis. This is needed since humans often make mistakes inthe labelling [5]. But surveying can be (very) time-consumingprocess. For example, later we show that surveying all the dataused in this study would require more than 350 hours. Clearly,surveying will not be apoted as standard practice unless wecan reduce its associated effort.Algorithm 1 describes SURVEY0, a human-in-the-loopalgorithm for reducing the cost of surveying. The details ofSURVEY0 are offered later in this paper. For now, it is sufficeto say, SURVEY0 includes an early exit strategy (in Step5) thatis triggered if “enough” examples have been found.To assess the significance of SURVEY0, this paper buildspredictors for technical debt, with and without surveying. Thatexperience let us answer the following research questions.
RQ1: Is surveying necessary?
If there are no disputesabout what labels to assign to (e.g.) code comments, there is noneed for surveying the data. But this is not the case. We findthat labels about technical debt from different sources havemany disagreements (36% to 79%, median to max). Hence:
Conclusion
Surveying is required to resolve dis-agreement about labels.
RQ2: Is SURVEY0 useful?
SURVEY0 cannot be rec-ommended unless it improves our ability to make qualitypredictions. Therefore we compared the predictive power ofclassifiers that were trained using SURVEY0’s labels. We
Algorithm 1
SURVEY0 = { C, S, E, R, m } . Using a humanreader R , a sorter S , a classifier C and an estimator E ,SURVEY0 updates its knowledge every m examples.
1) Randomly sort n software artifacts (e.g. code comments);2) Use prior data to build a classifier C and a sorter S ;3) Using S , ask reader R to review and label the first m (cid:28) n artifacts as “good, bad” (in our case, “bad” means “has TD”);4) Using the labelled examples, update an estimator E for thenumber of “bad” remaining in the n − m examples;5) Exit if E says we found enough “bad” examples;6) Else: • Skip over the first m artifacts. Set n = n − m . • Apply the sorter S to arrange the remaining artifacts, in orderof descending “bad”-ness; • Loop to Step3. a r X i v : . [ c s . S E ] M a y ound that SURVEY0’s labels improved recall from 63% to83% (median values across 10 data sets). That is: Conclusion
SURVEY0 improved quality predic-tions.
RQ3: Is SURVEY0 comparable to the state-of-the-artfor human-in-the-loop AI?
SURVEY0 is not a researchcontribution unless it out-performs other human-in-the-loopAI tools. Therefore we compared our results to those froman “optimum” tool. For this study, “optimum” was computedby giving prior state-of-the-art methods an undue advantage(allowing those methods to tune hyperparameter for better testresults). We found that:
Conclusion
SURVEY0 made its predictions at a nearoptimum rate.
RQ4: How soon can SURVEY0 learn quality predictors?
SURVEY0 cannot be said to mitigate the relabelling problem unless it finds most quality issues using very few artifacts.Therefore we tracked how many artifacts SURVEY0 had toshow the humans before it finds most of the technical debts.We found that SURVEY0 asked humans to read around 16%of the comments while finding 83% of the technical debts(median values across ten data sets).
Conclusion
SURVEY0 can be recommended as away to reduce the labelling effort.
RQ5: Can SURVEY0 find more issues?
The previousresearch question showed that SURVEY0 can find most issuesafter minimal human effort. But if finding (e.g.) 83% of thequality issues is not enough, can SURVEY0 be used to findeven more technical debt issues? We find that SURVEY0’sstopping rule can be modified to find more issues withadditional cost of reading:
Conclusion
SURVEY0 can be used to get additionaldesired level of quality assurance.
RQ6: How much does SURVEY0 delay human readers?
Humans grow frustrated and unproductive when they wait fora system response for a long time. Therefore we recordedhow long humans had to wait for SURVEY0’s conclusions.We found that SURVEY0 needs half a minute to find the next m = 100 most interesting programmer comments. Humans, onthe other hand, need twenty minutes [7] to assess if those 100comments a for “self-admitted technical debt” (defined later inthis paper). That is, SURVEY0 delays humans by . / (20 +0 . ≈ . To put that another way: Conclusion
SURVEY0 imposed negligible overhead(i.e. less than 5%) on the activity of human experts.The rest of this paper is structured as follows. In section II, wefirst discuss the background work and other related conceptsthat is needed for our study. In section IV, we give a briefdescription of our dataset, a detailed description of SURVEY0and our experiment and evaluation methods. section VI discussthe results of our study. We discuss the threats to validityin Section VII. We will close our work by discussing itsimplication and possible future directions.Note that, for reproduction purpose, all our data and scriptsis publicly available (see github.com/blinded4Review).II. B
ACKGROUND
A. About Technical Debt
Technical debt (TD) effects multiple aspects of the softwaredevelopment process (see Figure 1). The term was first intro-duced by Cunningham in 1993 [2]. It is a widespread problem: • In 2012, after interviewing 35 software developers fromdifferent projects in different companies, both varying insize and type, Lim et al. [8] found developers generateTD due to factors like increased workload, unrealisticdeadline in projects, lack of knowledge, boredom, peer-pressure among developers, unawareness or short-termbusiness goals of stakeholders, and reuse of legacy orthird party or open source code. • After observing five large scale projects, Wehaibi et al.found that, the amount of technical debts in a project may
Fig. 1. Impact of technical debt on software. From [6].
2e very low (only 3% on average), yet they create sig-nificant amount of defects in the future (and fixing suchtechnical debts are more difficult than regular defects) [9]. • Another study on five large scale software companiesrevealed that, TDs contaminate other parts of a softwaresystem and most of the future interests are non-linear innature with respect to time [10]. • According to the SIG (Software Improvement Group)study of Nugroho et al., a regular mid-level project owes $8 , , in TD and resolving TD has a Return OnInvestment (ROI) of 15% in seven years [4]. • Guo et al. also found similar results and concluded that,the cost of resolving TD in future is twice as much asresolving immediately [3].Much research tried to identify TD as part of code smellsusing static code analysis, with limited success [11], [12], [13],[14], [15]. Static code analysis has a high rate of false alarmswhile imposing complex and heavy structures for identifyingTD [16], [17], [18], [19].Recently, much more success has been seen in the work onso-called “self-admitted technical debt” (SATD). A significantpart of technical debt is often “self-admitted” by the developerin code comments[20]. In 2014, after studying four large scaleopen source software projects, Potdar and Shihab [20] con-cluded that developers intentionally leave traces of TD in theircomments with remarks like like “ hack, fixme, is problematic,this isn’t very solid, probably a bug, hope everything willwork, fix this crap ” Potdar and Shihar et al. found 62 distinctkeywords for identifying such TD [20] (similar conclusionswere made by Faris et al. [21]). In 2015, Maldonado et al.used five open source projects to manually classify differenttypes of SATD [7] and found: • SATD mostly contains requirement debt and design debtin source code comments; •
75% of the SATD gets removed, but the median lifetimeof SATD ranges between 18-173 days [22].Another study tried to find the SATD introducing commitsin Github using different features on change level [23]. Insteadof using a bag of word approach, a recent study also proposedword embedding as vectorization technique for identifyingSATD [24]. Other studies investigated source code commentsusing different text processing techniques. For example, Tanet al. analyzed source code comments using natural languageprocessing to understand programming rules and documenta-tions and indicate comment quality and inconsistency [25],[26]. A similar study was done by Khamis et al [27]. Afteranalyzing and categorizing comments in source code, Steidlet al. proposed a machine learning technique that can measurethe comment quality according to category [28]. Malik etal. used random forest to understand the lifetime of codecomments [29]. Similar study over three open source projectswas also done by Fluri et al. [30].In 2017, Maldonado et al. identified two types of SATD in10 open source projects (average 63% F1 Score) using NaturalLanguage Processing (a Max Entropy Stanford Classifier) using only 23% training examples [31]. A different approachwas introduced by Huang et al. in 2018 [1]. Using eightdatasets, Huang et al. build Naive Bayes Multinomial sub-classifier for each training dataset using information gain asfeature selection. By implementing an ensemble technique onsub-classifiers, they have found an average 73% F1 scores forall datasets [1]. A recent IDE for Ecliplse was also releasedusing this technique for identifying SATD in java projects [32].To the best of our knowledge, Huang et al.’s EMSE’18 paperis the current state-of-the-art approach for identifying SATD.Hence, we base our work on their methods.
B. About Surveying
This section describes surveying, why it is needed, and whycost-effective methods for surveying are required.Standard practice in software analytics is for differentresearchers to try their methods on shared data sets. Forexample, in 2010, Jureckzo et al. [33] offered tables of datathat summarized dozens of open source JAVA projects. Thatdata is widely used in the literature. A search at GoogleScholar on “xalan synapse” (two of the Jureckzo data sets)shows that these data sets are used in 177 papers and eighttextbooks, 126 of which are in the last five years.Reusing data sets from other researchers has its advantagesand disadvantages. One advantage is repeatability of researchresults ; i.e. using this shared data, it is now possible andpractical to repeat/repute/prove prior results. For examples ofsuch kind on analysis, see the proceedings of the PROMISEconference or the ROSE festivals (recognizing and reward-ing open science in SE) at FSE’18, FSE’19. ESEM’19 andICSE’19. See also all the lists of 678 papers which reusedata from the Software-artifact Infrastructure Repository atNebraska University (sir.csc.ncsu.edu/portal/usage.php).Another advantage is faster research . Software analyticsdata sets contain independent and dependent variables. Forexample, in the case of self-admitted technical debt, theindependent variables are the programmer comments and thedependent variable is the label “SATD=yes” or “SATD=no”.Independent variables can often be collected very quickly(e.g. Github’s API permits 5000 queries per hour). However,assigning the dependent labels is comparatively a much slowertask. According to Maldonado and Shihab et al. [7], classifying33,093 comments as “SATD ∈ { yes,no } ” from five opensource projects took approximately 95 hours by a singleperson; i.e. 10.3 seconds per comment. Using that information,we calculated that, relabelling the data used in this paperwould require months of work (see Table I). When a tasktakes months to complete, it is not surprising that researchteams tend to reuse old labels rather than make their own.That said, the clear disadvantage of reusing old labels is reusing old mistakes . Humans often make mistakes whenlabelling [5]. Hence, it is prudent to review the labels foundin a dataset. We used the term “surveying’ to refer to theprocess of revisiting, rechecking, and possibly revising thelabels offered by prior analysis3 ABLE IC
OST OF LABELLING , T
HREE DIFFERENT SCENARIOS . Scenario
Our dataset contains , com-ments (from ten projects). At 10.3 seconds/comment [7], thistakea 178 hours to label. With two readers (one to read, one toverify) this time becomes 356 hours. Using the power of pizza,we can assemble a hackathon team of half a dozen graduatestudents willing to work on tasks like this, six hours per day,two days per month; i.e. 6*6*2=72 hours per month. Scenario
Note that, if pushed, we coulddemand more time from these students. For example, we coulddemand that two students work on this task, full time. Giventhe tedium of that task, we imagine that they could workproductively on this task for 20 hours per week per person.Under these conditions, revisiting and relabelling our datawould take nearly two months.
Scenario
Given sufficient funds, suchlabelling could be done at a much faster rate. Crowdsourcingtools like Mechanical Turk could be used to assemble anynumber of readers to revisit and relabel all comments, in just amatter of hours [34], [35], [36]. While this is certainly a usefulheuristic method for scaling up labelling, ever so often theremust a validation study where the results of crowdsourcing arechecked against some “ground truth”. This paper is concernedwith cost-effective methods for generating that ground truth.
In our experience, surveying is usually done on a somewhatinformal basis. For example, researchers would manuallysurvey a small number of randomly selected artifacts (e.g. 1%of the corpus; or 100 artifacts). There are many problems withthe informal approach to surveying: • How many random selections are enough? That is, onwhat basis should we select? • And when to stop surveying? Should finding N errorsprompt N more samples? Or is there some point afterwhich further surveying is no longer cost-effective?In order to answer these questions, the rest of this articlediscusses cost-effective methods for surveying.III. R ELATED W ORK
The process we call surveying uses some technology fromresearch on active learning [37], [38]. “Active learners” as-sume that some oracle offers labels to examples and that thereis a cost incurred, each time we invoke the oracle (in the caseof surveying, that might mean asking a human to check if aparticular code comment is an example of SATD).The research of active learning was certainly motivating forthis work. However, standard active learning methods werenot immediately applicable to the problem of technical debt.Accordingly, we made numerous changes to standard methods.Firstly,
SURVEY0’s workflow are different (more informed) than those of an standard active learner. Such learners do notknow when to stop learning. Since our goal is to understand how many more items we need to read, SURVEY0 adds anincremental estimation method that studies how fast humansare currently finding interesting examples and imposes astopping criteria based on that estimation. That estimationmethod is described later in this paper.(Aside: outside the machine learning literature, we did findtwo information retrieval methods for predicting when to stopincremental learning from Ros et al. [39] and Cormack [40].When we experimented with these methods, we found thatour estimators out-performed these methods. For more on thispoint, see the
RQ3 results discussed later in this paper.)Secondly, we needed different learning methods . Activelearning in SE has been applied previously in (e.g.) the codesearch recommender tools of Gay et al. [41] that seek methodsimplicated in bug reports. Our work is very different to that: • Code search recommender tools input bug reports and output code locations. In between, those tools search static code descriptors ; i.e. theirs are a code analysis tool. • The tools of this paper input programmer commentsand output predictors of technical debt. In between, ourmethods search text comments ; i.e. ours is a text mining tool.Thirdly, we had to make more use of prior knowledge : • Initially, we tried tools built to help researchers find(say) a few dozen relevant papers within 1000 abstractsdownloaded from Google Scholar [42]. Those tools werenot successful (they resulted in single digit recall values). • On investigation, we realized those tools started learningafresh for each new problem. That is, those tools assumedthat prior knowledge was not relevant to new projects. • That assumption seemed inappropriate for ths paper since,for most commercial software developers, software ismore extended and refined than build from scratch. Insuch an environment, it is possible to discover importantlessons from prior projects. • Hence, as shown in Step2 of Algorithm 1, SURVEY0 starts by learning models from all prior projects. Afterthat, the rest of SURVEY0 uses feedback from the currentproject to refine the estimations from those models.IV. I
NSIDE
SURVEY0Recall from the above that SURVEYO is characterized by: { C, S, E, R, m } That is, SURVEY0 updates its knowledge every m example,using a reader R , a sorter S , a classifier C and an estimator E . In the experiments of this paper, m defines how much datais passed to humans (each time, we pass m = 100 examples).The rest of this section describes C, S, E, R . Just to say theobvious, this section includes many engineering choices whichfuture research may want to revisit and revise. We make noclaim that SURVEY0 is the best surveying tool. Rather, ourgoal is to popularize the surveying problem and produce abaseline result which can be used to guide the creation ofbetter surveyors.4 . About the Classifiers “ C ” This paper compares two classifiers • C = ensemble decision trees (EnsembleDT) • C = linear SVMThese learners were selected as follows. Firstly, as to ouruse of SVMs, these are commonly used method for textmining [43]. A SVM outputs a map of binary sorted data withmargins between the two as far apart as possible, known asthe support vectors. SVM uses a kernel to transform problemdata into a higher dimensional space where it is easier tofind decision boundary between examples. SVM models thisboundary as set of support vectors ; i.e. examples of differentclasses closest to the boundary. Depending on the kernel usedwith a SVM, their training times can be very fast or very slow.For this work, we used linear SVM and SVM with radial biasfunctions. There was no significant performance delta betweenthem and the linear SVM was much faster. Hence, for thiswork, we are reporting linear SVM only.Secondly, as to our use of EnsembleDT, our aim was toextend the results of Huang et al.’s EMSE’18 paper. That workused an ensemble Naive Bayes classifiers: In that approach • The authors first trained one Naive Bayes Multinomial(NBM) sub-classifier for each training project. • These solo classifiers were then consulted as an ensemble ,where each solo classifier voted on whether or not sometest example was an example of SATD. • The output of such an ensemble classifier is the majorityvote across all the ensemble members.To build their system, Huang et al. used Weka library (writ-ten in Java)[44] and their built-in “StringToWordVector” forvectorization and the “NaiveBayesMultinomial” for classi-fication. We were unable to find an equivalent vectorizerin Python, so we used the standard TF-IDF vectorizer. Wefailed to reproduce their results using Scikit-learn’s [45] NaiveBayes Multinomial. But by retaining the ensemble method(as recommend by Huang et al.) and switching the classifierto Decision Trees (DT) , we obtained similar results. Thus,for our experiment, we used their framework and data (with2 additional projects) but with a modification to the leaner(Decision Trees, not Naive Bayes Multinomial).Decision tree learners recursivily split data such that eachsplit is more uniform than its parent. The attribute used to splitthe data is selected in order to minimize the diversity of thedata after each split. This is a very fast and efficient machinelearning algorithm [46].
B. About the Sorters “ S ” SURVEY0 asks its learners to sort examples by how “in-teresting” they are. Our two classifiers need different sorters: • For EnsembleDT, the S function counts how many timesensemble members vote for SATD. • For linear SVMs, a “most interesting” example would bean unlabelled artifact on the SATD side of the decisionboundary, and furthest away from that boundary. Hence,the sorter S for linear SVMs is “distance from the Fig. 2. Example retrieval curve (project SQL) using SURVEY0. “Actual” isthe retrieval by the human according to the sorter S . “Total Estimation” isthe output from the estimator. With Target@90, that becomes the “Target@90Estimation”. This intersects at point S where we stop with 85% recall and17% cost. bounary” (and for this measure, we take the SATD sideof the boundary to be positive distance).Note that when our estimator needs the probability that anexample is technical debt, we reach into these sort orders andreport the position of that example w.r.t. the other examples.Formally, those probabilities are generated by normalizing thesort scores over the range between 0..1 C. About the Estimator “ E ” SURVEY0 uses an internal estimator, built using a LogisticRegression (LR) curve. Using this estimator, it is possible toguess how many more interesting examples are left to find.This estimator is used as follows. First, users specify a target goal e.g. find 90% of all the technical debt comments. NextSURVEY0 executes, asking the reader R to examine m = 100 comments at a time. As this process continues, more and moreof the technical debt comments are discovered.Figure 2 shows a typical growth curve. The dotted blueline shows the evolving estimator. In practice, E often over-estimates how much technical debt has been found. Hence,after reading x = 17% of the code, the estimator reports thatthe target; i.e. that 90% of the TD has been found (even thoughthe exact figure is 85%, see Figure 2).Algorithm 2 descibes our estimator. The estimator takes twoinputs, some probability from the classifier (see §IV-B) and thelabels. All unlabeled example are assumed to be “not technicaldebt” (because as shown in Table II, actual TD commentsare quite rare). A logistic regression model is then trainedusing the probabilities (from the learners) as the independentvariable; and labels as dependent examples. Using an iterativeapproach, the label for the unlabeled dataset is then predictedand the total number of remaining target class is calculated.5 lgorithm 2 SURVEY0 estimator with m l labeled examplesby human (1 is SATD, 0 is Not-SATD) and m u unlabeledexamples all marked with (because the dataset is very imbal-anced). This is our y i . The algorithm obtains its probabilitiesfrom the sorter S described in §IV-B.
1) Count the total number of positives in y i , say C i .2) Train a Logistic Regression curve LR using x and y i ;3) Use LR to predict the probabilities of m u (all the unlabeleddatapoints), say p i .4) Sort p i in decreasing order.5) In each datapoint m u j ∈ sorted p i not marked as “seen”,calculate a cumulative sum of the probabilities from the sortedlist one by one and mark each datapoint as “seen”. Wheneverthe sum > , reset the probability of the first one, m u j = 1 and rest as 0. Go back to step 5 until all datapoint is markedas “seen”. At the end of this step p i has new probabilities with and s only.6) Marge p i (new probabilities of unlabeled examples) with m l (labeled examples) and get y i +1 .7) Count the total number of positives in y i +1 , say C i +1 .8) If C i (cid:54) = C i +1 , go back to step 1 with y i +1 as new y i D. About the Reader “ R ” SURVEY0 use a human expert to label examples in the testproject. At each iteration, the sorter suggests m most likelytarget class examples from the unlabeled data points and thehuman labels them one by one.In this experiment, we have implemented an automatedhuman oracle to mimic the behaviour of a human reader. Todo that, we kept the actual label of our test project (labeledby the authors of the data set [7]) as a separate reference set.At each iteration, the oracle looks into the reference set andlabel the comment (thus mimicking a human expert).V. E XPERIMENTAL M ATERIALS
A. Evaluation Metrics
Recall:
Our framework is concerned with how much targetclass (SATD) is found within the comments that has beenchecked. Formally, this is known as recall:
Recall = T rueP ositiveT rueP ositive + F alseN egative × SAT DF oundSAT DF ound + N onSAT DF ound × (1)The larger the recall, the better retrieval process Cost:
As our framework has a human involved, we wantedto measure the cost of finding target class (SATD). For that,we only focused in the number of comments to read as a ratioof total number of comments. Thus,
Cost = CommentsReadT otalComments × (2)Cost is a measurement of the overall effort needed for thehuman. The smaller the cost the better the surveying. TABLE IID
ATASET D ETAILS . I
N TEN PROJECTS , THE SELF - ADMITTED TECHNICALDEBT COMMENTS ARE AROUND OF ALL COMMENTS . C = S =
S/C
Release Comments SATD *100
ApacheAnt 1.7.0 AutomatingBuild 4098 131 3.2ApacheJMeter 2.10 Testing 8057 374 4.64ArgoUML - UMLDia-gram 9452 1413 14.95Columba 1.4 EmailClient 6468 204 3.15EMF 2.4.1 ModelFrame-work 4390 104 2.37HibernateDistri-bution 3.3.2 ObjectMap-pingTool 2968 472 15.90jEdit 4.2 JavaTextEdi-tor 10322 256 2.48jFreeChart 1.0.19 JavaFrame-work 4408 209 4.74jRuby 1.4.0 RubyforJava 4897 622 12.70SQL12 - Database 7215 286 3.96
MEDIAN 5683 271 4.77
B. Data
Table II shows the data used in this study. This data comesfrom the same source as Huang et al.; i,e the publicly availabledataset from Maldonado and Shihab [7]. This dataset containsten open source JAVA projects on different application do-mains, varying in size and number of developers and mostimportantly, in number of comments in source code. Theprovided dataset contains project names, classification type(if any) with actual comments. Note that, our problem donot concern with the type of SATD, rather we care abouta binary problem of being a SATD or not. So, we havechanged the final label into a binary problem by defining
WITHOUT CLASSIFICATION as no and the rest (forexample DESIGN ) as yes .When creating this dataset, Maldonado et al. [7] usedjDeodrant [47], an Eclipse plugin for extracting commentsfrom the source code of java files. After that, they appliedfour filtering heuristics to the comments. A short descriptionof them are given below (and for more details, see [7]): • Removed licensed comments, auto generated commentsetc because according to the authors, they do not containSATD by developers. • Removed commented source codes as commented sourcecodes do not contain any SATD. • Removed Javadoc comments that do not contain words6ike “todo”, “fixme”, “xxx” etc because according to theauthors, the rest of the comments rarely contain SATD. • Multiple single line comments are grouped into a singlecomment because they all convey a single message andit is easy to consider them as a group.After applying these filters, the number of comments ineach project reduced significantly (for example, the numberof comments in Apache Ant reduced from , to , ,almost 19% of the original size).Two of the authors [7] then manually labelled each com-ments according to the six different types of TD mentioned byAlves et al. [48]. Note that if those labels were perfect, thenSURVEY0 would not be necessary. C. Standard Rig
In the following, when we say standard rig we mean a 10-by-10 cross validation study that tries to build a predictor fortechnical debt, as follows: • For i = 1 to 10 projects – test = project[i] – train = projects - test –
10 times repeat ∗ Generate a new random number as seed. ∗ Apply the classifier C . · For ensembles, we generate n-1 decision treesusing the seed (learning from 90% of the train-ing data, selected at random). · For SVM, we shuffle the data using the seed. ∗ Apply SURVEY0, with m = 100 , stopping at sometarget recall (usualy, 90% recall for SATD).Note also, when generating the estimator, we shuffle the datausing the seed for building the logistic regression model.VI. R ESULTS
The experimental materials described above where used toanswer the research questions from the introduction.
TABLE IIIRQ1
RESULTS : A
GREEMENT BETWEEN
SURVEY0’
S LABELS AND THOSEGENERATED BY OTHER METHODS . S
ORTED BY
RECALL . I
N THISTABLE , THE smaller
THE VALUES IN THE RIGHT - HAND - SIDE COLUMN , THE larger
THE AGREEMENT BETWEEN
SURVEY0’
S LABELS AND OTHERLABELS . Projects Recall 100-Recall jruby 88 12argouml 87 13columba 87 13jmeter 73 27hibernate 72 28jfreechart 46 54emf 33 67sql12 54 69ant 24 76jedit 21 79
MEDIAN
63 37
A. RQ1: Is surveying necessary?
Table III reports the levels of (dis)agreement seen betweenthe labels seen after rechecking and revising labels (usingSURVEY0) and the labels in the original data This data wasgenerated using our standard rig: • In that table, we measure disagreement as 100-recall. • A disagreement of 0% indicates that the labels found viaSURVEY0 are the same as in the original data sets. • Note that the disagreements are quite large and rangefrom 36% to 79% (median to max). That is:
Conclusion
Surveying is required to resolve dis-agreement about labels.When discussing this results with colleagues, they comment“does not that mean that SURVEY0 is just getting it wrongall the time?”. We would argue against that interpretation.As shown below, surveying improves classification predictionsso whatever SURVEY0 is doing, it is also improving thecorrespondence between the labels and the target concept.Now suppose the reader is unconvinced by the last para-graph and wants to check whose labels are correct: • The pre-existing labels? • Or the labels generated by SURVEY0?At that point the reader would encounter the “ground truth”problems. That is, to assess which labels are correct, the readerwould need some “correct” labels. After some reflection (anda review of Table I) the reader might realize that finding thosecorrect set of labels can be very costly– so much so that theywould like some intelligent assistant to help them label thedata is a cost-effective manner. That is, they need some toollike SURVEY0.This is not a fanciful scenario. We envisage that oncetools like SURVEY0 become widespread, then informally“labelling” collectives will emerge between collaborating re-search groups. Data sets would be passed between researchgroups, each one checking the labels of the other. If the levelof disagreement on the next round of labelling falls below acommunity-decided level of acceptability, then that data couldthen move on to be used in research papers.
B. RQ2: Is SURVEY0 useful?
Figure 3 and Figure 4 shows the recalls and costs achievedfrom the standard rig (when the target goal is 90%). In thosefigures • The
EnembleDT results are the closest we can come toreproducing the methods of Huang et al. from EMSE’18.In these results, some classifier is learned from nineprojects, then applied to the tenth. These results makeno use of SURVEY0; i.e. here, there is no label reviewor revision. • The other plots come from SURVEY0 using either En-embleDT or Linear SVM as the learner.We observe that :7 ig. 3. Recall of SURVEY0, SURVEY-EnsembleDT and recall of EnsembleDT without SURVEYing (from RQ1)Fig. 4. Cost of SURVEY0 and SURVEY-EnsembleDT • The two sets of treatments have median recalls of 82.5and 62% respectively. • That is, the treatments using SURVEY0 perform muchbetter than those that do not.That is:
Conclusion
SURVEY0 improved quality predic-tions.As to which classifiers we would recommend, Figure 3 reportsthat in terms of recall, both Linear SVM and EnsembleDTperform just as well. However, Figure 4 reports that LinearSVM has much lower associated cost; i.e. it can find thetechnical debt comments much faster than EnsembleDT.Based on these results, we recommend SURVEY0 using aLinear SVM classifier
C. RQ3: Is SURVEY0 comparable to the state-of-the-art forhuman-in-the-loop AI?
Certain information retrieval methods offer stopping criteriafor when to halt exploring new data. Here, we assess two suchstate-of-the-art approaches, developed for assisting SystematicLiterature Reviews.
Fig. 5. Comparison between optimized state-of-the-art human-in-the-loopframeworks with SURVEY0
Those stopping methods require certain tuning parameterswhich we set by “cheating”; i.e. manually tuning using ourtest data. That is, we gave the information retrieval methodsan undue advantage over SURVEY0Ros et al. [39] suggests that, if no target class is foundin x consecutive examples seen (if each iteration offers m examples, then a total of x/m iterations), then we should stop.Ros proposed x = 50 but after “cheating”, we found that x =10 worked better (i.e. obtained higher recalls with minimumcost).Cormack et al. [40] finds the knee in the current retrievalcurve at each iteration and if the ratio between slops from slope
SURVEY0 made its predictions at a nearoptimum rate.
RQ4: How soon can SURVEY0 learn quality predictors?
As we know from RQ2 and RQ3, that SURVEY0 has a highrecall, meaning, when looking for SATDs, it finds most ofthem. But in order to do that, SURVEY0 need a human expertto read through the comments suggested by the classifier andsorter. Thus, a core part of SURVEY0 is to ensure that, thecost of reading is minimized and learning when to stop.Our experiment with SURVEY0 on Target@90 recall showthat, after reading only 16% (median) of the comments,SURVEY0 stops while finding 83% (median) of the SATDs.This 16% cost has an IQR of 5% across project, implying, IQR= inter-quartile-range = (75-25)th percentile. ABLE IVRQ5: C
OST E FFECTIVE
SURVEY0
FOR T ARGET @90
AND T ARGET @95.
Projects Target@90 Target@95Name Recall Cost Recall Cost ant 85 21 87 35jmeter 89 15 93 38argouml 79 15 90 21columba 98 16 98 33emf 78 32 83 41hibernate 78 15 85 21jedit 80 32 86 44jfreechart 61 19 66 20jruby 85 13 94 19sql12 85 17 91 24
MEDIAN 83 16 89 29IQR the cost is nearly the same for all ten projects. Hence, wesay:
Conclusion
SURVEY0 can be recommended as away to reduce the labelling effort.
RQ5: Can SURVEY0 find more issues?
In the above experiments, we set that target goal to be 90%recall. Here, we report what happens when we seek to findmore technical debt; i.e when we set the target recall to 95%.Table IV shows the results. As before, if we set the targetto X%, we achieve a performance level of slightly less thanX (so the median recalls achieved when the target was 90%or 95% was 83% or 89%, respectively).We also see that increasing the target recall by just 5% (from90 to 95) nearly doubles the cost of finding the technical debt(from reading 16% of the comments to 29%). We make nocomment here on whether or not it is worth increasing thecost in this way. All we say is that, if required, our methodscan be used to tune how much work is done to reach somedeseired level of recall. That is:
Conclusion
SURVEY0 can be used to advise on howmuch more work is required to achieve some additionaldesired level of quality assurance.
D. RQ6: How much does SURVEY0 delay human readers?
SURVEY0 need a human expert in the loop to classifythe most possible datapoint. To that end, SURVEY0 offer m examples at each iteration, before estimating the remainingtarget class (here SATD).This estimation process has its own overhead. Accordingto Maldonado et al., each example need approximately 10.3seconds to classify. So, if m examples are offered at eachiteration, then the human expert will need approximately m × . seconds to finish reading. If the estimation processtakes longer than this, then human expert become unproductivewhile waiting for the next iteration. After experimenting, wesee that on average, the estimation process takes 30 secondsfor m = 100 . On the other hand, for m = 100 , human reader will need ∗ . seconds or approximately 20 minutes.If we calculate the overhead of each iteration, it is only . / (20 + 0 . ≈ . To put that another way: Conclusion
SURVEY0 imposed negligible overhead(i.e. less than 5%) on the activity of human experts.VII. T
HREATS TO V ALIDITY
Model Bias:
One internal threat to validity is our biastowards the classifier selection and stopping rule selection.We experimented a wide variety of state-of-the-art classifiersused in text-mining as rankers while building SURVEY0 andfound SVM to be the best. Yet, there are other advanced andcomplex classifiers (such as LSTM) that we did not used in ourselection, because of the simplicity of our dataset as well as noprior work has used them to classify SATDs. We also avoideda few stopping rules as baselines (such as Wallace [50]) inten-tionally as previous research showed [51] that our baselinesare significantly better than theirs. Nevertheless, we are awarethat our model selection is not comprehensive and could beexplored further in future research.
Evaluation Bias:
We have reported Recall and Cost as ouroverall measure. We have repeated each experiment ten timesand reported only the median values for minimizing any biastowards randomness. We understand the quality measures arenot comprehensive and their might be other quality measureused in software engineering that reflects more comprehensivesummary of our findings. A more comprehensive analysisusing other measures is left for future work.
Sample Bias:
The dataset was provided by authors Maldon-ado and Shihab [7]. Other data might lead to other conclusions.Clearly, this work needs to be repeated on other data.VIII. D
ISCUSSION
In our work, we have studied the comments of ten opensource projects developed in JAVA. Our work shows that, withminimum cost, we can identify self-admitted technical debtusing a combination of AI with human. There are several waysto extend the current work. • Feature Selection and Vectorization:
A few recent workshows that feature selection can improve the overallclassification of SATD [1]. A more recent work alsoimplies that word embedding model such as word2vecis promising while identifying SATD [24]. We believe,our framework can also improve significantly after properfeature selection and vectorization. We initially did somefeature selection, but a more rigorous experiment mustbe done in this regard in the future. • New Dataset:
Our work is confined in Java projects andOpen Source Projects. We want to develop new dataset togeneralize our findings and possibly discover new factsalong the way. • Matrices:
There are other goal metrics to explore. Forexample, measuring cost in terms of time or man-hourmight be a better quality measure.9
Results:
According to our experiment, we can find
SATD while only reading of the data (both in me-dian). We hypothesize that, this results can be improvedusing hyperparameter optimization. The only drawback isthe run-time of such tuning. In our future work, we willtry to find improved results using hyperparameter tuning.IX. C
ONCLUSION
Technical debt is a metaphor to describe the quick and dirtyworkaround to receive immediate gain. This is an intentionalpractice and often developers leave intentional comments indi-cating that their work is sub-otimal. Although this phenomenais unavoidable in reality, research show that the long termimpact of these practice is dire. Thus identifying technical debtis a major concern for the industry. This work has exploredmethods for building a technical debt predictor, at minimalcost.The methods used here to reduce the cost of buildingtechnical debt predictors are quite general to any human-in-the-loop process where some subject matter expert is requiredto read and label a large corpus. Such work can be time-consuming, tedious, and error-prone. Our work is a responseto that. We offer a complete framework where human willbe guided by an AI to label artifacts with minimal effort.At least for the data studied here, we can find on average(median) of the artifacts of interest by reading only of the artifacts. Examining the possible implication on a largerdataset with better estimator and well tuned parameters willopen interesting possibilities in future.R
EFERENCES[1] Q. Huang, E. Shihab, X. Xia, D. Lo, and S. Li, “Identifying self-admittedtechnical debt in open source projects using text mining,”
EmpiricalSoftware Engineering , vol. 23, no. 1, pp. 418–451, 2018.[2] W. Cunningham, “The wycash portfolio management system,”
ACMSIGPLAN OOPS Messenger , vol. 4, no. 2, pp. 29–30, 1993.[3] Y. Guo, C. Seaman, R. Gomes, A. Cavalcanti, G. Tonin, F. Q. Da Silva,A. L. Santos, and C. Siebra, “Tracking technical debtan exploratorycase study,” in . IEEE, 2011, pp. 528–531.[4] A. Nugroho, J. Visser, and T. Kuipers, “An empirical model of technicaldebt and interest,” in
Proceedings of the 2nd Workshop on ManagingTechnical Debt . ACM, 2011, pp. 1–8.[5] L. Hatton, “Testing the value of checklists in code inspections,”
IEEEsoftware , vol. 25, no. 4, 2008.[6] I. Ozkaya, R. L. Nord, and P. Kruchten, “Technical debt: From metaphorto theory and practice,”
IEEE Software , vol. 29, no. 06, pp. 18–21, nov2012.[7] E. d. S. Maldonado and E. Shihab, “Detecting and quantifying differenttypes of self-admitted technical debt,” in . IEEE, 2015, pp. 9–15.[8] E. Lim, N. Taksande, and C. Seaman, “A balancing act: What softwarepractitioners have to say about technical debt,”
IEEE software , vol. 29,no. 6, pp. 22–27, 2012.[9] S. Wehaibi, E. Shihab, and L. Guerrouj, “Examining the impact of self-admitted technical debt on software quality,” in , vol. 1. IEEE, 2016, pp. 179–188.[10] A. Martini and J. Bosch, “The danger of architectural technical debt:Contagious debt and vicious circles,” in . IEEE, 2015, pp. 1–10.[11] R. Marinescu, G. Ganea, and I. Verebi, “Incode: Continuous qualityassessment and improvement,” in . IEEE, 2010, pp. 274–275. [12] R. Marinescu, “Detection strategies: Metrics-based rules for detectingdesign flaws,” in
IEEE, 2004, pp. 350–359.[13] ——, “Assessing technical debt by identifying design flaws in softwaresystems,”
IBM Journal of Research and Development , vol. 56, no. 5,pp. 9–1, 2012.[14] N. Zazworka, R. O. Sp´ınola, A. Vetro, F. Shull, and C. Seaman, “Acase study on effectively identifying technical debt,” in
Proceedingsof the 17th International Conference on Evaluation and Assessment inSoftware Engineering . ACM, 2013, pp. 42–47.[15] F. A. Fontana, V. Ferme, and S. Spinelli, “Investigating the impact ofcode smells debt on quality code evaluation,” in
Proceedings of theThird International Workshop on Managing Technical Debt . IEEEPress, 2012, pp. 15–22.[16] N. Tsantalis and A. Chatzigeorgiou, “Identification of extract methodrefactoring opportunities for the decomposition of methods,”
Journal ofSystems and Software , vol. 84, no. 10, pp. 1757–1782, 2011.[17] N. Tsantalis, D. Mazinanian, and G. P. Krishnan, “Assessing therefactorability of software clones,”
IEEE Transactions on SoftwareEngineering , vol. 41, no. 11, pp. 1055–1090, 2015.[18] J. Graf, “Speeding up context-, object-and field-sensitive sdg genera-tion,” in . IEEE, 2010, pp. 105–114.[19] K. Ali and O. Lhot´ak, “Application-only call graph construction,” in
European Conference on Object-Oriented Programming . Springer,2012, pp. 688–712.[20] A. Potdar and E. Shihab, “An exploratory study on self-admittedtechnical debt,” in . IEEE, 2014, pp. 91–100.[21] M. A. de Freitas Farias, M. G. de Mendonc¸a Neto, A. B. da Silva,and R. O. Sp´ınola, “A contextualized vocabulary model for identifyingtechnical debt on code comments,” in . IEEE, 2015, pp. 25–32.[22] E. d. S. Maldonado, R. Abdalkareem, E. Shihab, and A. Serebrenik, “Anempirical study on the removal of self-admitted technical debt,” in . IEEE, 2017, pp. 238–248.[23] M. Yan, X. Xia, E. Shihab, D. Lo, J. Yin, and X. Yang, “Automatingchange-level self-admitted technical debt determination,”
IEEE Trans-actions on Software Engineering , 2018.[24] J. Flisar and V. Podgorelec, “Enhanced feature selection using wordembeddings for self-admitted technical debt identification,” in . IEEE, 2018, pp. 230–233.[25] L. Tan, D. Yuan, G. Krishna, and Y. Zhou, “/* icomment: Bugs orbad comments?*,” in
ACM SIGOPS Operating Systems Review , vol. 41,no. 6. ACM, 2007, pp. 145–158.[26] S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens, “@ tcomment: Testingjavadoc comments to detect comment-code inconsistencies,” in . IEEE, 2012, pp. 260–269.[27] N. Khamis, R. Witte, and J. Rilling, “Automatic quality assessment ofsource code comments: the javadocminer,” in
International Conferenceon Application of Natural Language to Information Systems . Springer,2010, pp. 68–79.[28] D. Steidl, B. Hummel, and E. Juergens, “Quality analysis of sourcecode comments,” in . Ieee, 2013, pp. 83–92.[29] H. Malik, I. Chowdhury, H.-M. Tsou, Z. M. Jiang, and A. E. Hassan,“Understanding the rationale for updating a functions comment,” in . IEEE, 2008,pp. 167–176.[30] B. Fluri, M. Wursch, and H. C. Gall, “Do code and comments co-evolve? on the relation between source code and comment changes,” in . IEEE,2007, pp. 70–79.[31] E. da Silva Maldonado, E. Shihab, and N. Tsantalis, “Using naturallanguage processing to automatically detect self-admitted technicaldebt,”
IEEE Transactions on Software Engineering , vol. 43, no. 11, pp.1044–1062, 2017.[32] Z. Liu, Q. Huang, X. Xia, E. Shihab, D. Lo, and S. Li, “Satd detector:a text-mining-based self-admitted technical debt detection tool,” in
Pro-ceedings of the 40th International Conference on Software Engineering:Companion Proceeedings . ACM, 2018, pp. 9–12.
33] M. Jureczko and L. Madeyski, “Towards identifying software projectclusters with regard to defect prediction,” in
Proceedings of the 6thInternational Conference on Predictive Models in Software Engineering ,ser. PROMISE ’10. New York, NY, USA: ACM, 2010, pp. 9:1–9:10.[Online]. Available: http://doi.acm.org/10.1145/1868328.1868342[34] D. Chen, K. T. Stolee, and T. Menzies, “Replication can improve priorresults: A github study of pull request acceptance,”
ICPC’19 , 2019.[35] J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie:Duplicate crowdtesting reports detection with screenshot information,”
Information & Software Technology , vol. 110, pp. 139–155, 2019.[Online]. Available: https://doi.org/10.1016/j.infsof.2019.03.003[36] J. Wang, Y. Yang, Z. Yu, T. Menzies, and Q. Wang, “Characterizingcrowds to better optimize worker recommendation in crowdsourcedtesting,”
TSE’19 (to appear) , 2019.[37] G. V. Cormack and M. R. Grossman, “Evaluation of machine-learningprotocols for technology-assisted review in electronic discovery,” in
Pro-ceedings of the 37th international ACM SIGIR conference on Research& development in information retrieval . ACM, 2014, pp. 153–162.[38] N. Abe and H. Mamitsuka, “Query learning strategies using boostingand bagging,” in
Proceedings of the Fifteenth International Conferenceon Machine Learning , ser. ICML ’98. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 1998, pp. 1–9. [Online]. Available:http://dl.acm.org/citation.cfm?id=645527.657478[39] R. Ros, E. Bjarnason, and P. Runeson, “A machine learning approach forsemi-automated search and selection in literature studies,” in
Proceed-ings of the 21st International Conference on Evaluation and Assessmentin Software Engineering . ACM, 2017, pp. 118–127.[40] G. V. Cormack and M. R. Grossman, “Engineering quality and reliabilityin technology-assisted review,” in
Proceedings of the 39th InternationalACM SIGIR conference on Research and Development in InformationRetrieval . ACM, 2016, pp. 75–84.[41] G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “On the use of relevancefeedback in ir-based concept location,” in , 2009, pp. 351–360.[42] Z. Yu, N. A. Kraft, and T. Menzies, “Finding better active learners for faster literature reviews,”
Empirical Software Engineering , vol. 23, no. 6,pp. 3161–3186, 2018.[43] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Sup-port vector machines,”
IEEE Intelligent Systems and their applications ,vol. 13, no. 4, pp. 18–28, 1998.[44] I. H. Witten, E. Frank, L. E. Trigg, M. A. Hall, G. Holmes, and S. J.Cunningham, “Weka: Practical machine learning tools and techniqueswith java implementations,” 1999.[45] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay, “Scikit-learn: Machine learning in Python,”
Journal of MachineLearning Research , vol. 12, pp. 2825–2830, 2011.[46] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifiermethodology,”
IEEE transactions on systems, man, and cybernetics ,vol. 21, no. 3, pp. 660–674, 1991.[47] M. Fokaefs, N. Tsantalis, E. Stroulia, and A. Chatzigeorgiou, “Jdeodor-ant: identification and application of extract class refactorings,” in . IEEE,2011, pp. 1037–1039.[48] N. S. Alves, L. F. Ribeiro, V. Caires, T. S. Mendes, and R. O. Sp´ınola,“Towards an ontology of terms on technical debt,” in . IEEE, 2014,pp. 1–7.[49] V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan, “Finding a” kneedle”in a haystack: Detecting knee points in system behavior,” in .IEEE, 2011, pp. 166–171.[50] B. C. Wallace, I. J. Dahabreh, K. H. Moran, C. E. Brodley, and T. A.Trikalinos, “Active literature discovery for scoping evidence reviews:How many needles are there,” in
KDD workshop on data mining forhealthcare (KDD-DMH) , 2013.[51] Z. Yu and T. Menzies, “ fast : Better automated support for findingrelevant se research papers,” arXiv preprint arXiv:1705.05420 , 2017., 2017.