SQAPlanner: Generating Data-Informed Software Quality Improvement Plans
Dilini Rajapaksha, Chakkrit Tantithamthavorn, Jirayus Jiarpakdee, Christoph Bergmeir, John Grundy, Wray Buntine
11 SQAPlanner: Generating Data-InformedSoftware Quality Improvement Plans
Dilini Rajapaksha, Chakkrit Tantithamthavorn, Jirayus Jiarpakdee,Christoph Bergmeir, John Grundy, and Wray Buntine
Abstract — Software Quality Assurance (SQA) planning aims to define proactive plans, such as defining maximum file size, to preventthe occurrence of software defects in future releases. To aid this, defect prediction models have been proposed to generate insights asthe most important factors that are associated with software quality. Such insights that are derived from traditional defect models arefar from actionable—i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what isthe risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefficient and ineffective SQA planningprocesses. In this paper, we investigate the practitioners’ perceptions of current SQA planning activities, current challenges of suchSQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-DrivenSQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-based explanations for the predictions of defect prediction models. Finally, we develop and evaluate an information visualization for ourSQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanneris needed, effective, stable, and practically applicable. We also find that 80% of our survey respondents perceived that our visualizationis more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics—i.e., generating actionableguidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.
Index Terms —Software Quality Assurance, SQA Planning, Actionable Software Analytics, Explainable AI. (cid:70)
NTRODUCTION
Software Quality Assurance (SQA) planning is the pro-cess of developing proactive SQA plans. One of themost important SQA activities is to define developmentpolicies and their associated risk thresholds [12] (e.g.,defining the maximum file size, the maximum code com-plexity, and the minimum degree of code ownership).Such SQA plans will be later enforced for a whole teamto ensure the highest quality of software systems. Thesepolicies are essential to improve software quality andsoftware maintainability [29].Recently, top software companies have released sev-eral commercial AI-driven defect prediction tools. Forexample, Microsoft’s Code Defect AI, Amazon’s Code-Guru. Such tools heavily rely on the concept of defectprediction models that have been well-studied in thepast decades [17]. In particular, Microsoft’s Code DefectAI is built on top of the concept of explainable Just-In-Time defect prediction [21, 44]—i.e., explaining the pre-dictions of defect models using a LIME model-agnostictechnique [37]. The crux of Microsoft’s Code Defect AItool is similar to the recent parallel work by Jiarpakdee etal. [21] who also suggested to use a LIME model-agnostictechnique to explain the predictions of defect models. • D. Rajapaksha, C. Tantithamthavorn, J. Jiarpakdee C. Bergmeir, J. Grundy,and W. Buntine are with the Faculty of Information Technology, MonashUniversity, Melbourne, Australia.E-mail: {dilini.rajapakshahewaranasinghage, chakkrit, jirayus.jiarpakdee,christoph.bergmeir, john.grundy, wray.buntine}@monash.edu • Corresponding author: C. Tantithamthavorn.
However, these current state-of-the-art defect predic-tion approaches can only indicate the most importantfeatures, which are still far from actionable. Thus, prac-titioners still do not know (1) what they should do todecrease the risk of having defects, and what they shouldavoid to not increase the risk of having defects and (2)what is a risk threshold for each metric (e.g., how largeis a file size that would be risky? and how small is a filesize that would be non-risky?).A lack of actionable guidance and its risk thresholdcan lead to inefficient and ineffective SQA planningprocesses. Such ineffective SQA planning processes willresult in the recurrence of software defects, slow projectprogress, high costs of development, unsatisfactory soft-ware products, and unhappy end-users. These chal-lenges are very significant to the practical applicationsof defect prediction models, but still remain largelyunexplored.We aim to help practitioners to make better data-informed SQA planning decisions by generating action-able guidance derived from defect prediction models.Thus, we first propose the following four types of guid-ance to support SQA planning: (G1) Risky current practices that lead the defect modelto predict a file as defective are needed to helppractitioners understand what are the current riskypractices. (G2) Non-risky current practices that lead the defectmodel to predict a file as clean are needed tohelp practitioners understand what are the non-risky current practices. a r X i v : . [ c s . S E ] F e b (G3) Potential practices to avoid to not increase therisk of having defects are needed to help prac-titioners understand which currently not imple-mented practices to avoid to not increase the riskof having defects. (G4) Potential practices to follow to decrease the riskof having defects are needed to help practitionersunderstand which practices to newly implement todecrease the risk of having defects.To achieve this aim, our research study has the follow-ing 3 key objectives: (Obj1) Investigating practitioners’ perceptions and chal-lenges of carrying out current SQA planningactivities and the perceptions of our proposedfour types of guidance; (Obj2)
Developing and evaluating our novel SQAPlan-ner approach and comparing it with state-of-the-art approaches; (Obj3)
Developing and evaluating an information vi-sualization for our SQAPlanner approach andcomparing it with the visualization of Microsoft’sCode Defect AI tool.To achieve the first objective, we first conducted aqualitative survey with practitioners to address the fol-lowing research questions: (RQ1) How do practitioners perceive SQA planningactivities?
For SQA planning activities, 86% ofthe respondents perceived as important and70% perceived as being used in practice. How-ever, 66% perceived as time-consuming and 58%perceived as difficult, indicating that a data-informed SQA planning tool is needed to sup-port QA teams for better data-informed decision-making and policy-making. (RQ2) How do practitioners perceive our proposedfour types of guidance to support SQA plan-ning?
Both (G1) the guidance on risky currentpractices that lead a model to predict a file asdefective and (G4) the guidance on the potentialpractices to follow to decrease the risk of havingdefects are perceived as among the most useful,most important, and most considered willingnessto adopt by the respondents.Motivated by the findings of RQ1 and RQ2, we pro-posed an AI-Driven SQAPlanner—i.e., an approach togenerate four types of guidance in the form of rule-based explanations [34] to support data-informed SQAplanning.
Our AI-Driven SQAPlanner is a significant ad-vancement over the LIME model-agnostic technique [37] ,since LIME only indicates what factors are the most im-portant to support the predictions towards defective (G1)and clean (G2) classes, while our AI-Driven SQAPlannercan additionally provide actionable guidance on whatshould developers avoid (G3) and should do (G4) todecrease the risk of having defects. Then, we conductan empirical evaluation to evaluate our SQAPlannerapproach and compare with two state-of-the-art local rule-based model-agnostic techniques (i.e., Anchor [38](i.e., an extension of LIME [37]), LORE [15]). Through acase study of 32 releases across 9 open-source softwareprojects, we addressed the following research questions: (RQ3) How effective are the rule-based explanationsgenerated by our SQAPlanner approach whencompared to the state-of-the-art approaches?
The rule-based guidance generated by our SQA-Planner approach achieves the highest coverage(at the median 89%), confidence (at the median99%), and lift scores (at the median 6.6) whencomparing to baseline techniques. (RQ4) How stable are the rule-based explanationsgenerated by our SQAPlanner approach whenthey are regenerated?
Our SQAPlanner approachproduces the most consistent (a median Jaccardcoefficient of 0.92) rule-based guidance whencomparing to baseline techniques, suggesting thatour approach can generate the most stable rule-based guidance when they are regenerated. (RQ5) How applicable are the rule-based explana-tions generated by our SQAPlanner approachto minimize the risk of having defects in thesubsequent releases?
For 55%-87% of the defec-tive files, our SQAPlanner approach can gener-ate rule-based guidance that is applicable to thesubsequent release to decrease the risk of havingdefects.To evaluate the practical usefulness of our SQAPlan-ner, we developed a proof-of-concept prototype to vi-sualize the actual generated actionable guidance. Thevisualization of our SQAPlanner is designed to providethe following key information: (1) the list of guidancethat practitioners should follow and should avoid; (2) theactual feature value of that file; and (3) its threshold andrange values for practitioners to follow to mitigate therisk of having defects. Then, we compare our visualiza-tion with the visualization of Microsoft’s Code Defect AI(see Figure 2). Finally, we conducted a qualitative surveyto address the following research questions: (RQ6) How do practitioners perceive the visualiza-tion of SQAPlanner when comparing to thevisualization of the state-of-the-art?
80% of therespondents agree that the visualization of ourSQAPlanner is best to provide actionable guid-ance compared to the visualization of Microsoft’sCode Defect AI. (RQ7) How do practitioners perceive the actual guid-ance generated by our SQAPlanner? • An empirical investigation of the practitioners’ per-ceptions and their challenges of current SQA plan-ning activities. • An empirical investigation of the practitioners’ per-ceptions of our proposed four types of guidance. • The development of our novel AI-Driven SQAPlan-ner approach to generate the proposed four types ofguidance in the form of rule-based explanations tobetter support SQA planning. The implementationis available at https://github.com/awsm-research/SQAPlanner-implementation. • The empirical investigation of the effectiveness, thestability, and the applicability of rule-based expla-nations generated by our SQAPlanner. • The development of the visualization of our SQA-Planner approach and the empirical investigation ofthe practitioners’ perceptions on our visualizationand the actual guidance.The rest of the paper is organized as follows. Sec-tion 2 discusses the significance of SQA planning, thelimitations of current AI-driven defect prediction tools,and the motivation of the proposed four types of guid-ance to support SQA planning. Section 3 presents theoverview of our case study and the motivation of theresearch questions. Section 4 presents the results of thepractitioners’ perceptions of SQA planning activities andthe four types of guidance to support SQA planning.Section 5 presents our SQAPlanner approach, while Sec-tion 6 presents the empirical results of our SQAPlannerapproach. Section 7 presents the empirical investigationof the visualization of our SQAPlanner and the actualguidance generated by our SQAPlanner approach. Sec-tion 8 summarizes the threats to the validity of our study,and Section 9 discusses related work. Finally, Section 10draws the conclusions.
ACKGROUND AND M OTIVATION
In this section, we first discuss the significance of Soft-ware Quality Assurance (SQA) planning. Then, we dis-cuss the limitations of current AI-driven defect predic-tion tools. Finally, we propose the four types of guidanceto support SQA planning.
This is a classic principle that is commonly applied toSQA processes to prevent software defects [27]. It iswidely known that the cost of software defects risessignificantly if they are discovered later in the process.Thus, finding and fixing software defects prior to releas-ing software is usually much cheaper and faster thanfixing after the software is released [3]. Therefore, SQAteams play a critical role in software companies as agatekeeper, i.e., not allowing software defects to passthrough to end-users.Consider an example of an SQA practice inside theAtlassian company, Australia’s largest software com-pany with a variety of well-known software productse.g., JIRA Issue Tracking System, BitBucket, and Trello. Fig. 1: A JIRA software development process and howQA engineers interact with developers prior to releasinga software product.Figure 1 provides an overview of a JIRA software de-velopment process. During this process, a QA engineerhas multiple points at which he or she provides feedbackinto the way the feature is developed and tested—i.e.,providing every form of quality improvement guidancefor all steps of the software development process fromplanning to completion. This process allows for imme-diate active feedback to ensure that knowledge gainedfrom previous software defects is fed back into thetesting notes for future releases to prevent defects in thenext iteration.
An AI-driven defect prediction (aka. defect predictionmodel) is a classification model which is trained onhistorical data in order to predict if a file is likely to bedefective in the future. Defect models serve two mainpurposes. First is to predict . The predictions of defectmodels can help developers to prioritize their limitedSQA resources on the most risky files [9, 31, 46, 48].Therefore, developers can save their limited SQA efforton the most risky files instead of wasting their timeon inspecting less risky files. Second is to explain . Theinsights that are derived from defect models could helpmanagers chart quality improvement plans to avoid thepitfalls that lead to defects in the past [2, 30, 49]. Forexample, if the insights suggest that code complexityshares the strongest relationship with defect-proneness,managers must initiate quality improvement plans tocontrol and monitor the code complexity of that system.Recently, top software companies have released sev-eral commercial AI-driven defect prediction tools. Forexample, Microsoft’s Code Defect AI, Amazon’s Code-Guru. Such tools heavily rely on the concept of defectprediction models that have been well-studied in the
Fig. 2: An example visualization of the Microsoft’s Code Defect AI tool (http://codedefectai.azurewebsites.net/).However, this tool does not suggest what practitioners should do to decrease the risk of having defects, and whatpractitioners should avoid in order not to increase the risk of having defects. In addition, this tool does not suggesta risk threshold for each metric.past decades [17]. In particular, Microsoft’s Code DefectAI is built on top of the concept of explainable Just-In-Time defect prediction [21, 44]—i.e., explaining the pre-dictions of defect models using a LIME model-agnostictechnique [37]. LIME is a model-agnostic technique forexplaining the predictions of any AI/ML algorithms.The crux of Microsoft’s Code Defect AI tool is similarto the recent parallel work by Jiarpakdee et al. [21]—i.e., extracting several software metrics (e.g., Churn),building a classification model (e.g., random forests),generating a prediction for each file in a commit, andgenerating an explanation of each prediction using theLIME model-agnostic technique [37].Figure 2 presents an example visualization ofMicrosoft’s Code Defect AI product for the file
ErrorHandlerBuilderRef.java of the ApacheCamel Release 2.9.0. This figure shows that this fileis predicted as defective with a confidence score of70%. There are three most important factors that areassociated with this prediction as defective, i.e., thenumber of lines of class and method declaration, thenumber of distinct developers, and the degree of codeownership. Thus, these insights can help managerschart quality improvement plans to control for thesemetrics. However, there exist the following limitations. • First, practitioners still do not know what theyshould do to decrease the risk of having defects,and what they should avoid to not increase the riskof having defects.
We find that LIME can only indi-cates what factors are the most important to supportthe predictions towards defective (G1) and clean(G2) classes, without providing actionable guidanceon what should they avoid (G3) and should do (G4)to decrease the risk of having defects. • Second, practitioners still do not know a riskthreshold for each metric (e.g., how large is a filesize that would be risky? and how small is a filesize that would be non-risky?).A lack of these types of guidance and its risk thresh-old could lead to inefficient and ineffective SQA plan-ning processes. Such ineffective SQA planning processescould result in the recurrence of software defects, slowproject progress, high costs of development, unsatisfac-tory software products, and unhappy end-users. To thebest of our knowledge, the aforementioned challengesare very significant to the practical applications of defectprediction models, but still remain largely unexplored.
To address the aforementioned challenges, we proposean AI-driven SQAPlanner—i.e., an approach for gener-ating four types of guidance and its risk threshold inthe form of rule-based explanation for the predictions ofdefect models. Below, we discuss a motivating scenarioof how our AI-Driven SQAPlanner could be used in asoftware development process to assist SQA planning.
Without our SQAPlanner . Consider Bob who is a QAmanager joining a new software development project.His main responsibility is to apply SQA activities (e.g.,code review and testing) to find defects and developquality improvement plans to prevent them in the nextiteration. However, he has little knowledge of the soft-ware projects. Therefore, he decides to deploy a defectprediction model to guide his QA team about where isthe risky areas of source code so his team can effectivelyallocate the limited effort on this risky area. However,Bob still encounters various SQA planning problems during the planning steps to prevent software defects inthe next iteration. In particular, without AI-driven SQAplanning tools, he can’t understand what are the riskypractices and what are the non-risky practices for thisteam and this project, what are key actions to avoid thatincrease the risk of having defects, and what are the keyactions to do to decrease the risk of having defects.
Alack of AI-driven SQA planning tools could lead to a failureto develop the most effective SQA plans.
Ultimately, thisresults in the recurrence of software defects, slow projectprogress, and high costs of software development, unsat-isfactory software products, and unhappy end-users.
With our SQAPlanner . Now consider that Bob adoptsour AI-driven SQAPlanner tool. In particular, given afile that is predicted as defective by defect predictionmodels, our SQAPlanner can further generate rule-basedexplanations to better understand what are key riskypractices, non-risky practices, actions to avoid that in-crease the risk of defects, and actions to do to decreasethe risk of having defects for that file. Bob can use ourSQAPlanner to make data-informed decisions when devel-oping SQA plans. This could result in more optimal SQAplans, leading to higher quality of software systems,less number of software defects, lower costs of softwaredevelopment, satisfactory software products, and happyend-users.
First, we propose to generate the guidance in the formof rule-based explanations, since our recent work [21]found that decision trees/rules are the most preferredrepresentation of explanations by software practitionersas they involve logic reasoning that they are familiarwith. Formally, a rule-based explanation ( e ) is an as-sociation rule e = { r = p ⇒ q } that describes theassociation between p (a Boolean condition of featurevalues (i.e., antecedent, left-hand-side, LHS)) and q (theconsequence (i.e., consequent, right-hand-side, RHS)) forthe decision value y = f (cid:48) ( x ) . In this paper, we use anarrow ( associate ===== ⇒ ) to describe the association between theBoolean condition ( p ) of feature values for a file and thepredictions ( q ) towards a {DEFECT,CLEAN} class. Notethat an association in general doesn’t mean that there isa causal relationship.Second, motivated by the limitations of Microsoft’sCode Defect AI tool (see Figure 2), we hypothesizethat the following four types of guidance (G) that arepresented in a form of rule-based explanations are bene-ficial to guide practitioners when developing SQA plans.Below, we present the definition, the motivation, and anexample of the four types of guidance. G1: Risky current practices that lead the defect modelto predict a file as defective are needed to helppractitioners understand what current practices areproblematic. For example, an association rule of { LOC > } associate ===== ⇒ DEFECT indicates that a
Obj1: Investigating the Perceptions of SQA Planning & our Guidance Obj3: Evaluating the Visualization of our SQAPlannerObj2: Evaluating our SQAPlanner Approach
RQ1, RQ2 RQ6, RQ7
RQ3, RQ4, RQ5
Qualitative Survey Qualitative Survey
Empirical Evaluation
Aim: To help practitioners make data-informed SQA planning
Fig. 3: An overview of our study design and researchquestions.file with LOC greater than 100 is associated withthe predictions towards a defective class. Thus,practitioners should consider decreasing the LOCto less than 100, as this may likely decrease the riskof having defects.
G2: Non-risky current practices that lead the defectmodel to predict a file as clean are needed tohelp practitioners understand what current prac-tices contribute towards a low risk of having defects.For example, an association rule of { Ownership > . } associate ===== ⇒ CLEAN indicates that a file with anownership value greater than 0.8 is associated withthe predictions towards a clean class. Thus, prac-titioners should consider maintaining or increasingthe ownership value to more than 0.8 to potentiallydecrease the risk of having defects.
G3: Potential practices to avoid to not increase the riskof having defects are needed to help practition-ers understand which currently not implementedpractices to avoid to not increase the risk of hav-ing defects. For example, an association rule of { MinorDeveloper > } associate ===== ⇒ DEFECT indicatesthat a file with a number of minor developers ofgreater than 0 is associated with the predictionstowards a defective class. Thus, practitioners shouldavoid increasing the number of minor developers togreater than zero to not increase the risk of havingdefects.
G4: Potential practices to follow to decrease the riskof having defects are needed to help practitionersunderstand which practices to newly implement todecrease the risk of having defects. For example,an association rule of { RatioCommentToCode > . } associate ===== ⇒ CLEAN indicates that a file witha proportion of comments to code that is largerthan 60% is associated with the predictions towardsthe clean class. Thus, practitioners should considerincreasing the proportion of comments to code togreater than 60% to decrease the risk of havingdefects.
TUDY D ESIGN AND R ESEARCH Q UES - TIONS
In this paper, we aim to help practitioners make data-informed SQA planning by providing guidance on (1)what practitioners should do to decrease the risk ofhaving defects and (2) what practitioners should avoid inorder not to increase the risk of having defects with (3) arisk threshold in the form of rule-based explanations forthe predictions of defect prediction models. To achievethis aim, we design our case study according to thefollowing objectives (see Figure 3):
Objective 1—Investigating the practitioners’ percep-tions of SQA planning and the proposed four typesof guidance.
SQA planning activities are important insoftware development processes (e.g., to define initialsoftware development policies), but often vary fromorganization to organization [11]. However, there existno empirical studies that investigate how practitionersperceive the importance of SQA planning activities intheir organization and what are their key challenges.Thus, we formulate the following research question: • (RQ1) How do practitioners perceive SQA plan-ning activities? One of the most important SQA planning activities isto define development policies and their associated riskthresholds [12]. Such development policies will be laterenforced for the whole team to ensure the highest qualityof software systems (e.g., the maximum file size, themaximum code complexity, the minimum code to com-ment ratio, and the minimum degree of code ownership).Such policies are essential to improve software qualityand software maintainability. Recently, Microsoft’s CodeDefect AI tool has been released to the public where thecrux of this tool is defect prediction models. However,Figure 2 shows that such tool only indicates the im-portance scores of features that are generated by LIME,which are still far from actionable. That means LIMEonly indicates what factors are the most important tosupport the predictions towards defective (G1) and clean(G2) classes, but do not actually guide developers whatshould they avoid (G3) and should do (G4) to decreasethe risk of having defects. We hypothesize that our pro-posed four types of guidance that are presented in a formof rule-based explanation would be more actionable toguide practitioners when developing SQA plans. Thus,we formulate the following research question: • (RQ2) How do practitioners perceive our proposedfour types of guidance to support SQA planning?Objective 2—Developing and Evaluating our AI-Driven SQAPlanner Approach. To address the practi-tioners’ challenges of SQA planning and the limitationsof Microsoft’s Code Defect AI tool, we propose SQA-Planner to help practitioners make data-informed deci-sions when developing SQA plans. First, SQAPlannerdevelops a defect prediction model to generate a predic-tion. Then, SQAPlanner generates a rule-based explana-tion of the prediction to provide actionable guidance. However, there are different local rule-based model-agnostic techniques for generating explanations in theeXplainable AI (XAI) domain available (e.g., Anchor [38]and LORE [15]). Thus, it remains unclear whether ourSQAPlanner outperforms the state-of-the-art rule-basedmodel-agnostic techniques. Therefore, we conduct anempirical study to evaluate our approach and comparewith the baseline techniques. Thus, we formulate thefollowing research questions. • (RQ3) How effective are the rule-based expla-nations generated by our SQAPlanner approachwhen compared to the state-of-the-art approaches? • (RQ4) How stable are the rule-based explanationsgenerated by our SQAPlanner approach when theyare regenerated? • (RQ5) How applicable are the rule-based expla-nations generated by our SQAPlanner approachto minimize the risk of having defects in thesubsequent releases?Objective 3—Developing the Visualization of SQA-Planner and Investigating the Practitioners’ Percep-tions. While the rule-based explanations of our SQA-Planner are designed to help practitioners understandthe logic behind the predictions of defect models, suchrule-based explanations may not be immediately ac-tionable and easily understandable by practitioners.Thus, we develop a proof-of-concept by translating therule-based explanations of the actionable guidance intohuman-understandable explanations. The visualizationof our SQAPlanner is designed to provide the followingkey information: (1) the list of guidance that practitionersshould follow and should avoid; (2) the actual metricvalues of that file; and (3) the risk threshold and rangevalues for practitioners to follow to mitigate the riskof having defects. Then, we conduct a post-validationqualitative survey with practitioners to evaluate theirperceptions of the visualization of our SQAPlanner whencomparing to the existing visualization of Microsoft’sCode Defect AI (see Figure 2). Thus, we formulate thefollowing research questions: • (RQ6) How do practitioners perceive the visual-ization of SQAPlanner when comparing to thevisualization of the state-of-the-art? • (RQ7) How do practitioners perceive the actualguidance generated by our SQAPlanner? RACTITIONERS ’ P
ERCEPTIONS ON
SQAP
LANNING AND THE F OUR T YPES OF G UID - ANCE
In this section, we aim to investigate the practitioners’perceptions of (1) the SQA planning activities (RQ1) and(2) the proposed four types of guidance to support SQAplanning (RQ2). Below, we describe the approach andpresent the results.
TABLE 1: (RQ1 and RQ2) A summary of the agreement percentage, the disagreement percentage, and the agreementfactor for the practitioners’ perception of SQA planning activities and our proposed four types of guidance.
Dimension Statement %Agreement %Disagreement Agreement Factor (RQ1) Perceived importance SQA planning activities 86% 6% 14.33(RQ1) Being used in practice 70% 10% 7.00(RQ1) Perceived time-consuming 66% 10% 6.60(RQ1) Perceived difficulty 58% 24% 2.42(RQ2) Perceived usefulness G1: Risky current practices that leadthe defect model to predict a file as defective 82% 6% 13.67G2: Non-risky current practices that leadthe defect model to predict a file as clean 64% 10% 6.40G3: Potential practices to avoid tonot increase the risk of having defects 52% 20% 2.60G4: Potential practices to follow to decreasethe risk of having defects 80% 8% 10.00(RQ2) Perceived importance G1: Risky current practices that leadthe defect model to predict a file as defective 64% 10% 6.40G2: Non-risky current practices that leadthe defect model to predict a file as clean 60% 10% 6.00G3: Potential practices to avoid tonot increase the risk of having defects 64% 24% 2.67G4: Potential practices to follow to decreasethe risk of having defects 82% 6% 13.67(RQ2) Willingness to adopt G1: Risky current practices that leadthe defect model to predict a file as defective 74% 12% 6.17G2: Non-risky current practices that leadthe defect model to predict a file as clean 66% 12% 5.50G3: Potential practices to avoid tonot increase the risk of having defects 52% 22% 2.36G4: Potential practices to follow to decreasethe risk of having defects 72% 12% 6.00
To investigate practitioners’ perceptions of SQA plan-ning activities and their feedback on our proposed fourtypes of data-driven guidance to support such activities,we conducted a survey study with 50 software practi-tioners. As suggested by Kitchenham and Pfleeger [24],we considered the following steps when conducting ourstudy: (1) design and develop a survey, (2) evaluate asurvey, (3) recruit and select participants, (4) verify data,and (5) analyse data. We describe each step below. (Step 1) Design and develop a survey.
We firstdevised the concept of data-driven software quality as-surance (SQA) planning with respect to the 4 types ofrules generated by our approach. We then wanted to in-vestigate practitioners’ perceptions along 4 dimensions,i.e., perceived importance, being used in practice, wouldit be time-consuming, and what are key difficulties. Wedesigned our survey as a cross-sectional study whereparticipants provide their responses at one fixed point intime. The survey consists of 16 closed-ended questionsand 4 open-ended questions. For closed-ended ques-tions, we use agreement and evaluation ordinal scales.To mitigate any inconsistency of the interpretation ofnumeric ordinal scales, we labeled each level of theordinal scales with words as suggested by Krosnick [26](e.g., strongly disagree, disagree, neutral, agree, and strongly agree). The format of the survey is an onlinequestionnaire where we use an online questionnaire ser-vice as provided by Google Forms. When accessing thesurvey, each participant is provided with an explanatorystatement that describes the purpose of the study, whythe participant is chosen for this study, possible benefitsand risks, and confidentiality. The survey takes approx-imately 15 minutes to complete and is anonymous. (Step 2) Evaluate a survey.
We carefully evaluated thesurvey via pre-testing [28] to assess the reliability andvalidity of the survey. We revised the evaluation processto identify and fix potential problems (e.g., missing,unnecessary, or ambiguous questions) until reaching aconsensus. Finally, the survey has been rigorously re-viewed and approved by the Monash University HumanResearch Ethics Committee (MUHREC ID: 22542). (Step 3) Recruit and select participants.
The targetpopulation of the survey is software practitioners. Toreach the target population, we used a recruiting serviceprovided by the Amazon Mechanical Turk to recruit50 participants as a representative subset of the tar-get population. We use the participant filter options of"Employment Industry - Software & IT Services" and"Job Function - Information Technology" to ensure thatthe recruited participants are valid samples representingthe target population. We pay 6.4 USD as a monetaryincentive for each participant [10, 40].
DifficultTime−consumingBeing used in practiceImportance 100 50 0 50 100
Percentage
Response Strongly disagree Disagree Neutral Agree Strongly agree
SQA planning activities are:
Fig. 4: (RQ1) The likert scores of the practitioners’ per-ceptions of SQA planning along four dimensions i.e.,importance, being used in practice, time-consuming, anddifficulty. (Step 4) Verify data.
To verify our survey responsedata, we manually read all of the open-question re-sponses to check the completeness of the responses i.e.,whether all questions were appropriately answered. Weexcluded 11 responses that are missing and are notrelated to the questions. In the end, we had a set of 989responses. We summarized and presented the results ofclosed-ended responses in a Likert scale with stackedbar plots, while we discussed and provided examples ofopen-ended responses. (Step 5) Analyse data.
We manually analysed the re-sponses of the open-ended questions to extract in-depthinsights. For closed-ended questions, we summarise andpresent key statistical results. We compute the agree-ment and disagreement percentage of each closed-endedquestion. The agreement percentage of a statement is thepercentage of respondents who strongly agree or agreewith a statement ( % strongly agree + % agree ), while thedisagreement percentage of a statement is the percentageof respondents who strongly disagree or disagree with astatement ( % strongly disagree + % disagree ). We also com-pute an agreement factor of each statement as suggestedby Wan et al. [51]. The agreement factor is a measure ofagreement between respondents, which is calculated foreach statement using the following equation: ( % stronglyagree + % agree )/( % strongly disagree + % disagree ). Highvalues of agreement factors indicate a high agreementof respondents to a statement. The agreement factor of 1indicates that the numbers of respondents who agree anddisagree with a statement are equal. Finally, low valuesof agreement factors indicate that a high disagreementof respondents to a statement.
The demographics of our 50 practitioner survey respon-dents are as follows: • Country of Birth: India (58%) and US (36%) • Roles: developers (50%), managers (42%), and others(8%) • Years of Professional Experience: less than 5 years(26%), 6–10 years (38%), 11–15 years (22%), 16–20years (12%), and more than 25 years (2%)
G4: Potential practices to follow todecrease the risk of having defectsG3: Potential practices to avoid tonot increase the risk of having defectsG2: Non−risky current practices that leadthe defect model to predict a file as cleanG1: Risky current practices that leadthe defect model to predict a file as defective 100 50 0 50 100
Percentage
Response Not at all useful Not usefulNeutral UsefulExtremely useful (a) Perceived usefulness
G4: Potential practices to follow todecrease the risk of having defectsG3: Potential practices to avoid tonot increase the risk of having defectsG2: Non−risky current practices that leadthe defect model to predict a file as cleanG1: Risky current practices that leadthe defect model to predict a file as defective 100 50 0 50 100
Percentage
Response Not at all important Not importantNeutral ImportantExtremely important (b) Perceived importance
G4: Potential practices to follow todecrease the risk of having defectsG3: Potential practices to avoid tonot increase the risk of having defectsG2: Non−risky current practices that leadthe defect model to predict a file as cleanG1: Risky current practices that leadthe defect model to predict a file as defective 100 50 0 50 100
Percentage
Response Not at all considered Not consideredNeutral ConsideredExtremely considered (c) Willingness to adopt
Fig. 5: (RQ2) The likert scores of the perceived useful-ness, the perceive importance, and the willingness toadopt of the respondents for each proposed guidance. • Programming Language: Java (44%), Python (30%),C/C++/C • Use of Static Analysis Tools: Yes (62%) and No (38%)These demographics indicate that the responses arecollected from practitioners who reside in various coun-tries, have a range of roles, varied years of experience,and varied programming language expertise. This indi-cates that our findings are likely not bound to specificcharacteristics of practitioners.
Results . For SQA planning activities, 86% of therespondents perceive as important and 70% perceived as being used in practice. However, 66% perceivedas time-consuming and 58% perceived as difficult.
Figure 4 shows the distributions of likert scores of thepractitioners’ perceptions of SQA planning activities.The survey results show that SQA planning activities areperceived as important by 86% of the respondents, andare being used in practice by 70% of the respondents.However, they are perceived as time-consuming by 66%of the respondents and as difficult to do by 58% of the re-spondents. Table 1 also shows that the agreement factorof all studied dimensions of SQA planning activities areof above 1 with the values of 2.42 - 14.33. This indicatesthat most respondents agree (while having very fewrespondents who disagree) that SQA planning activitiesare important, being used in practice, time-consuming,and difficult.Respondents described that some of the SQA planningactivities in their organisations involve human heuris-tics in decision-making. For example, they used docu-mentation and review checklists [7] (e.g., R34: “Lessonslearnt from projects are documented and common mistakesare included in review checklists to ensure that they are notrepeated.” ), and team meetings (e.g., R10: “team meetings,brainstorm, and in house system” , and R48: “... throughstep by step manual processes working together in a coreteam” ). These findings indicates that a data-informedSQA planning tool is needed to support QA teams makebetter data-informed decision- and policy-making. (RQ2) How do practitioners perceive our proposedfour types of guidance to support SQA planning?
Results . Both (G1) the guidance on risky practices thatlead a model to predict a file as defective and (G4)the guidance on the practices to follow to decreasethe risk of having defects are perceived as among themost useful, most important, and most considered will-ingness to adopt by the respondents.
Figure 4 showsthe likert scores of the practitioners’ perceptions of SQAplanning along four dimensions i.e., importance, beingused in practice, time-consuming, and difficulty. The sur-vey results show that all types of guidance are perceivedas useful by 52%-80% of the respondents, important by60%-82% of the respondents, and considered willing toadopt by 52%-72% of the respondents. Similar to RQ1,we observed that the values of agreement factor for allof the proposed guidance are higher than 1 for all of thestudied dimensions. This suggests that most respondentsagree (while having a very few of those who disagree)that all proposed guidance are useful, important, andwilling to adopt these four types of guidance.Respondents provided positive feedback of our pro-posed four types of guidance since these types of guid-ance can help with SQA planning (e.g., R37: “It allows theQA team who might not necessarily know the changes thathave gone into each program to focus their energy on the mostrisky components, programs, or functionalities. It also givesmanagers a great view of the risks involved and how it couldpotentially be reduced or mitigated.” ). However, some respondents raise critical concernsrelated to the potential negative impact on the devel-opment process made by these four types of guidance.For example, cost of implementation and internal re-sistance (e.g., R27: “Some extra time spent improving theprocess. Needing to implement the process including training.Employee resistance to adoption.” ), and lax developmentpractice (e.g., R30: “Sometimes we get too reliant on theautomated processes and other things slip through ...” ). UR AI-D
RIVEN
SQAP
LANNER A PPROACH
Our SQAPlanner consisted of two major phases: (1)developing defect prediction models; and (2) generatingfour types of guidance using a local rule-based model-agnostic technique to explain the predictions of defectmodels. Figure 6 presents an overview workflow of ourSQAPlanner approach.
There is a plethora of classification techniques that havebeen used to develop defect prediction models [13,17, 46]. We first select the following five classificationtechniques, i.e., Decision Trees (DT), Logistic Regression(LR), multi-layer Neural Network (NN), Random Forest(RF), and Support Vector Machine (SVM). These classi-fication techniques are popularly-used in defect predic-tion studies. Since the performance of defect predictionmodels may vary depending on the studied datasets,we first conduct a preliminary analysis to identify themost accurate classification techniques for our study. Weuse the implementation of the selected five classificationtechniques provided by the scikit-learn Python package.For each training dataset, we build defect predictionmodels using all of the 65 software metrics (see Table 3and Table 4). To ensure that our experiment is strictly-controlled and fair across the studied classification tech-niques, we use the default setting of the classificationtechniques provided by the scikit-learn Python package,do not apply feature selection techniques, and do not ap-ply class rebalancing techniques. This setting will ensurethat the results are not bound to (i.e., not sensitive to)the randomization of the non-deterministic optimizationalgorithms [48], feature selection algorithms [22], andclass rebalancing algorithms [43]. Then, we evaluatethe performance of each classification technique usingtesting datasets. Then, we measure the predictive abilityof defect models using an Area Under the ReceiverOperating Characteristic Curve (AUROC or AUC). AUCmeasures the ability to distinguish defective and cleanfiles. The values of AUC range from 0 to 1. The AUCvalue of 0 is considered the worst performance, the AUCvalue of 0.5 is considered as merely random guessing,and the AUC value of 1 is considered the best perfor-mance [18].Then, we use the Non-Parametric Scott-Knott ESDtest (Version 3.0) to find the classification techniquesthat perform best across our studied datasets. We chose If {DEV>10} then {BUG}If {DEV>10} then {BUG}
Defect Models
Select instances from neighbourhoodGenerate instances from neighbourhood Generate predictions from global defect models Association Rule Mining AlgorithmID,similarity,classID,similarity,class ID,predict
TFTTFF
Phase 1: Developing Defect Models Phase 2: Generating Rule-based Explanations
Develop Defect Models
Training Dataset
Generate predictions
PredictionsSQAPlannerTesting Dataset + Fig. 6: An overview diagram of our SQAPlanner togenerate four types of guidance in the form of rule-basedexplanations for each file.the Non-Parametric Scott-Knott ESD test, since it doesnot produce overlapping groups like other post-hoctests (e.g., Nemenyi’s test) [47] and it does not requirethe assumptions of normal distributions, homogeneousdistributions, and the minimum sample size. The Non-Parametric ScottKnott ESD test is a multiple compar-ison approach that leverages a hierarchical clusteringto partition the set of median values of techniques(e.g., medians of variable importance scores, mediansof model performance) into statistically distinct groupswith non-negligible difference. The mechanism of theNon-Parametric Scott-Knott ESD test consists of 2 steps:(Step 1) Find a partition that maximizes the medianof each distribution between groups using the non-parametric Kruskal-Wallis test with Chi-square statistics.(Step 2) Split the distributions into two groups or merg-ing into one group using the non-parametric Cliff | δ | effect size. The implementation of the Non-ParametricScottKnott ESD test is available in the ScottKnott ESD Rpackage (Version 3.0). Random Forest is the most accurate studied classifi-cation technique with a median AUC value of 0.77.
Figure 7 presents the Scott-Knott ESD ranking of thestudied classification techniques with the distributionof the AUC values. We find that other classificationtechniques achieve a median AUC value of 0.74, 0.63,0.65, and 0.59 for SVM, DT, NN, LR, respectively. Finally,the ScottKnottESD test confirms that random forestsstatistically outperforms other classification techniques.For the rest of the paper, we focus on the random forestmodels due to the following reasons: • Random Forest is one of the most accurate studiedclassification techniques for our case study and isless sensitive to parameter settings [46, 48]; • Random Forest is a classification technique that isto a certain degree explainable with its own built-in
4. http://github.com/klainfo/ScottKnottESD l ll l ll
Rank 1 Rank 2 Rank 3RF SVM NN DT LR0.20.40.60.8
Classification Techniques A UC Fig. 7: The Non-Parametric Scott-Knott ESD ranking ofthe studied classification techniques with the distribu-tion of the AUC values.feature importance techniques (e.g., gini importanceand permutation importance) [5, 21–23]. Since SVMdoes not have its own built-in feature importancetechniques, we excluded SVM from our analysis;and • Random Forest is a classification technique that isrobust to overfitting [46], outliers [43], and classmislabelling [45].
There are 5 major steps for generating four types of guid-ance using a local rule-based model-agnostic technique.First, for each instance to be explained ( i explain ), weselect the nearest instances surrounding such an instanceto be explained from the training set ( I nearest ), cf. Line1. Second, we generate synthetic instances ( I synthetic )around the neighbourhood of each instance to be ex-plained, cf. Line 2. Then, we create a set of combinedinstances as I combined = I nearest ∪ I synthetic , cf. Line 3,which is a combination of the nearest instances andthe synthetic instances. Third, we use the global defectprediction models to generate the predictions of thecombined instances (i.e., P I combined ), cf. Line 4. Fourth, tolearn the associations between the synthetic features andthe predictions of the global defect prediction models,we use the Magnum Opus association rule learningalgorithm [53] to generate a set of optimal associationrules that are the most predictive (i.e., rules with thehighest confidence) and the most interesting (i.e., ruleswith the highest lift) from the combined instances andtheir predictions, cf.
Line 5. Finally, we classify the set ofassociation rules into four types of rule-based guidancewith respect to a contingency table of such associationrules and identify the best rule for each type of guidance, cf.
Line 6. Below, we explain each major step in details.
Phase 2-1: Select the nearest instances surrounding aninstance to be explained
We assume that instances from the neighbourhood of theinstance to be explained have approximately equivalent I e Euclidean distances Tr X exponential () Instances around the neighbourhood (sorted by sim. scores)ID,score S e l e c t t he t op - N i n s t an c e s o f ea c h c l a ss ID,score,class U s e t he s i m il a r i t y sc o r e o f ! a s a t h r e s ho l d t o s e l e c t t he m i n i m u m nu m be r o f t he h i ghe s t s i m il a r i n s t an c e s m i n ( s i m T N t h , s i m F N t h ) Selected instances around the neighbourhoodID,score,class
TTTFFF TFT
ID,score,class
Fig. 8: An approach to select instances around the neigh-bourhood.characteristics to an instance to be explained. Figure 8presents an overview of the steps to select the nearestinstances from the neighbourhood of the instance to beexplained. In particular, there are three steps as follows: (Step 1) – Normalize feature values.
Different featuresmay have different units and thus their range values mayvary greatly. For example,
LOC (e.g., 100 lines of code)and
Ownership (e.g., an ownership score of 0.5). Thus,we first apply a Z-score normalization to each feature indefect datasets. (Step 2) – Compute the similarity scores of instancesin training data.
To do so, we first compute the Eu-clidean distance between the instances in the trainingdata (
T r x ) and the instance to be explained ( i e ). Then,we apply an exponential kernel function to convertsuch Euclidean distances into similarity scores. using anexponential kernel function to make the distance morelinearly distributed. (Step 3) – Select the smallest number of the mostsimilar instances using the top-N instances of eachclass. To do so, we first sort the similarity scores ofinstances ( sim ) in descending order for each class. Then,we select the top N instances of each class from thesorted similarity scores. The lowest similarity score of thetop N instances of each class (i.e., Min( sim
True N th , sim False N th ) )is used as a threshold to select the minimum numberof the most similar instances. Such the lowest similarityscore among the top N instances of both classes isused to determine the boundary of the neighbourhood.For example, given an example of N = 10 , the lowestsimilarity scores of the top-10 instances with the highestsimilarity scores of DEFECT and
CLEAN classes are0.8 and 0.9, respectively. Therefore, in this example,the similarity score of 0.8 (the 10th instance from class
DEFECT ) is used to determine the boundary of theneighbourhood. The selected instances are instances thathave the similarity scores of above 0.8 (i.e., sim ≥ . ). Phase 2-2: Generate synthetic instances to expand theneighbourhood
The number of selected nearest instances in the neigh-bourhood may not be enough to accurately learn thebehaviour of the instance to be explained. Thus, wegenerate synthetic instances to expand the neighbour-hood. To do so, we use the crossover (or interpolation)technique and the mutation technique to generate new
Algorithm 1:
A Local Rule-based Model Inter-pretability with k -optimal Associations Input :
T r x − training instances without target (class label) T r y − target (class label) of training instances i explain − an instance need to be explained M − a global defect prediction model N features − N synthetic − Output: G i explain − Four types of rule-based guidance for theinstance to explain i explain I nearest ← SelectFromNeighbourhood ( T r px , i explain ) I synthetic ← GenerateFromNeighbourhood ( I selected ,N features , N synthetic , i explain ) I combined ← I nearest ∪ I synthetic P I combined ← GetPredictFromGlobalModel ( I combined , M ) R i explain ← GenerateMagnumOpusRules ( I combined ,P I combined ) G i explain ← GenerateRuleGuidance ( R i explain ,i explain , P i explain ) return G i explain synthetic instances while ensuring that the majority ofsuch synthetic instances are within the neighbourhoodof the instance to be explained. Below, we describe howwe generate synthetic instances using the crossover andthe mutation techniques in details. Generate synthetic instances using the crossovertechnique.
To do so, we randomly select two differentinstances from the neighbourhood of the instance tobe explained. Then, we generate the synthetic instancesbased on the crossover technique using the followingequation: I crossover = x + ( y − x ) ∗ α (1)where x and y are random parent instances from thetraining set, and α is a randomly generated numberbetween and . Generate synthetic instances using the mutationtechniques.
To do so, we randomly select three differentinstances from the neighbourhood of the instance to beexplained. Then, we generate synthetic instances basedon the mutation technique [42] using the following equa-tion: I mutation = x + ( y − z ) ∗ µ (2)where x, y and z are random parent instances from thetraining set, and µ is a randomly generated numberbetween . and . Phase 2-3: Generate the predictions of the nearest in-stances and the synthetic instances from defect predictionmodels
Firstly, we name a set of such the nearest instances(generated in Phase 2-1) and the synthetic instances(generated in Phase 2-2) as the combined instances I combined , where I combined = I nearest ∪ I synthetic Then, wegenerate the predictions of such combined instances inthe neighbourhood (i.e.,
Prediction I nearest ∪ I synthetic ) fromdefect prediction models to learn the behaviour and thelogics of such defect prediction models. Phase 2-4: Generate association rules using MagnumOpus association rule mining
The Magnum OPUS association rule mining algorithmperforms statistically sound association rule miningby combining k -optimal association discovery tech-niques [54] and the OPUS search algorithm [53] to findthe k most interesting associations according to a definedcriterion (e.g., lift, confidence, coverage). The effective-ness of our SQAPlanner relies on this algorithm to gen-erate the rule-based explanations. With the functionalityof the OPUS search algorithm, it will effectively prunethe search space by discarding the associations whichare likely to be spurious, and removing false positivesby performing Fisher’s exact hypothesis test. We use animplementation of the k -optimal association rule miningtechnique as provided by the BigML platform. Phase 2-5: Generate four types of rule-based guidance
Finally, we classify the optimal set of association rulesthat are identified by Magnum OPUS into four categorieswith respect to a contingency table of the LHS and RHSof the association rules. Then, we identify the best rulethat is the most predictive and the most interesting foreach type of guidance as the output of SQAPlanner.To better illustrate how we classify the output rulesgenerated by Magnum OPUS, we use four examplesof an association rule as a subject of this explanation.Given an instance to explain i explain that has 200 linesof code ( LOC = 200 ) and is predicted as
DEFECT by the global defect prediction model, our SQAPlannerframework generates the following four types of rule-based explanations:
G1: Risky current practices that lead the defect modelto predict a file as defective.
Technical Name.
Supporting Rules ( (cid:60) + ). Definition. if LHS = true, then RHS = true.
Example. { LOC > } associate ===== ⇒ DEFECT
Interpretation.
This example is a supporting rule,since (1) the antecedent (LHS) of the rule hold trueas the actual
LOC of i explain (i.e., 200) is actuallyhigher than 150, and (2) the consequent (RHS) of therule hold true as the prediction of i explain generatedby the global defect prediction model is DEFECT . G2: Non-risky current practices that lead the defectmodel to predict a file as clean.
Technical Name.
Contradicting Rules ( (cid:60) − ). Definition. if LHS = true, then RHS = false.
Example. { LOC < } associate ===== ⇒ CLEAN
Interpretation.
This example is a contradicting rule,since (1) the antecedent (LHS) of the rule hold trueas the actual
LOC of i explain (i.e., 200) is actuallylower than 500, yet (2) the consequent (RHS) of therule does not hold true as the prediction of i explain generated by the global defect prediction model is DEFECT .
5. https://bigml.com/
G3: Potential practices to avoid to not increase the riskof having defects.
Technical Name.
Hypothetical Supporting Rules( (cid:60) H + ). Definition. if LHS = false, then RHS = true.
Example. { LOC > } associate ===== ⇒ DEFECT
Interpretation.
This example is a hypothetical sup-porting rule, since (1) the antecedent (LHS) of therule does not hold true as the actual
LOC of i explain (i.e., 200) is not higher than 300, yet (2) the conse-quent (RHS) of the rule hold true as the predictionof i explain generated by the global defect predictionmodel is DEFECT . G4: Potential practices to follow to decrease the riskof having defects.
Technical Name.
Hypothetical Contradicting Rules orCounterfactual Rules ( (cid:60) H − ). Definition. if LHS = false, then RHS = false.
Example. { LOC < } associate ===== ⇒ CLEAN
Interpretation.
This example is a hypothetical con-tradicting rule, since (1) the antecedent (LHS) ofthe rule does not hold true as the actual
LOC of i explain (i.e., 200) is not lower than 100, and (2) theconsequent (RHS) of the rule does not hold trueas the prediction of i explain generated by the globaldefect prediction model is DEFECT . XPERIMENTAL D ESIGN AND R ESULTS
In this section, we aim to investigate (RQ3) the effec-tiveness, (RQ4) the stability, and (RQ5) the applicabilityof the rule-based explanations generated by our SQA-Planner. Below, we describe the studied projects, theexperimental design, and present the results.
To select some suitable projects, we identified threeimportant criteria that need to be satisfied: • Criterion 1 — Publicly-available defect datasets :To support verifiability and foster replicability ofour study, we choose to train our defect predictionmodels using publicly available defect datasets. • Criterion 2 — Multiple releases : The central hy-pothesis of our approach is that the guidance that isderived from past knowledge (a release k − ) can beused to explain the predictions of defective files inthe target releases (a release k ) and be applicable toprevent software defects in future releases (a release k + 1 ). Thus, we need multiple releases for eachsoftware project to validate our hypothesis. • Criterion 3 — Labels of defective files are based onactual affected releases : Prior work raises concernsthat the approximation of the post-release windowperiods (e.g., 6 months) that are popularly-used inmany defect datasets may introduce bias to the con-struct to the validity of our results [56]. Instead ofrelying on traditional post-release window periods,we choose to use defect datasets that are labeled TABLE 2: A statistical summary of the studied systems.
Name Description
TABLE 3: A summary of the studied code metrics.
Granularity Metrics Count
File AvgCyclomatic, AvgCyclomaticModified, AvgCyclomaticStrict, AvgEssential, AvgLine, AvgLineBlank, AvgLineCode,AvgLineComment, CountDeclClass, CountDeclClassMethod, CountDeclClassVariable, CountDeclFunction, CountDe-clInstanceMethod, CountDeclInstanceVariable, CountDeclMethod, CountDeclMethodDefault, CountDeclMethodPrivate,CountDeclMethodProtected, CountDeclMethodPublic, CountLine, CountLineBlank, CountLineCode, CountLineCod-eDecl, CountLineCodeExe, CountLineComment, CountSemicolon, CountStmt, CountStmtDecl, CountStmtExe, MaxCy-clomatic, MaxCyclomaticModified, MaxCyclomaticStrict, RatioCommentToCode, SumCyclomatic, SumCyclomaticModi-fied, SumCyclomaticStrict, SumEssential 37Class CountClassBase, CountClassCoupled, CountClassDerived, MaxInheritanceTree, PercentLackOfCohesion 5Method CountInput_{Min, Mean, Max}, CountOutput_{Min, Mean, Max}, CountPath_{Min, Mean, Max}, MaxNesting_{Min,Mean, Max} 12
TABLE 4: A summary of the studied process and own-ership metrics.
Metrics Description
Process Metrics
COMM The number of Git commitsADDED_LINES The normalized number of lines added to the moduleDEL_LINES The normalized number of lines deleted from the mod-uleADEV The number of active developersDDEV The number of distinct developers
Ownership Metrics
MINOR_COMMIT The number of unique developers who have contributedless than 5% of the total code changes (i.e., Git commits)on the moduleMINOR_LINE The number of unique developers who have contributedless than 5% of the total lines of code on the moduleMAJOR_COMMIT The number of unique developers who have contributedmore than 5% of the total code changes (i.e., Git commits)on the moduleMAJOR_LINE The number of unique developers who have contributedmore than 5% of the total lines of code on the moduleOWN_COMMIT The proportion of code changes (i.e., Git commits) madeby the developer who has the highest contribution ofcode changes on the moduleOWN_LINE The proportion of lines of code written by the developerwho has the highest contribution of lines of code on themodule based affected releases, as suggested by recent stud-ies [8, 56].Thus, we finally selected a corpus of publicly availabledefect datasets provided by Yatish et al. [56] where theground-truths are labeled based on the affected releases.These datasets consist of 32 releases that span 9 open-source, real-world, non-trivial software systems. Table 2shows a statistical summary of the studied datasets. Eachdataset has 65 software metrics along 3 dimensions, i.e.,54 code metrics, 5 process metrics, and 6 human metrics.Table 3 shows a summary of the static code metrics,while Table 4 shows a summary of the process andhuman metrics. The full details of the data collectionprocess are available at Yatish et al. [56].
We hypothesize that the guidance that is derived frompast knowledge (a release k − ) can be used to explainthe predictions of defective files in the target releases (arelease k ) and be applicable to prevent software defectsin future releases (a release k + 1 ). Thus, we evaluateour approach (see Figure 9) using a set of three con-secutive releases ( k -1, k , and k +1) for training, testing,and explanation evaluation, respectively. We first trainedour defect models using a random forest classificationtechnique on a training release (i.e., a release k − ). Then,we generate rule-based explanations for each file in thetesting release (i.e., a release k ). Finally, we evaluatethe applicability of the rule-based explanations with theexplanation evaluation release (i.e., a release k + 1 ). Let’stake an example of the ActiveMQ project, we first usethe release 5.0.0 for training, the release 5.1.0 for testing,and the release 5.2.0 for explanation evaluation. Werepeat the experiment similarly for the other consecutivereleases (i.e., {5.1.0, 5.2.0, 5.3.0}, {5.2.0, 5.3.0, 5.8.0}) andfor other projects. Motivation . Our SQAPlanner is based on the assump-tion that our rule-based explanations are generatedbased on the approximation of the characteristics of filesthat are similar to the file to be explained. This assump-tion is similar to those of many local rule-based model-agnostic techniques [15, 37, 38] that the behaviour of theinstance to be explained is similar to the behaviours ofthe instances around its neighbourhood. According tothe definition of rule-based explanations in Section 2.4, Defect Models (RQ3) How effective is the guidance generated by our SQAPlanner?(RQ4) How stable is the guidance generated by our SQAPlanner?(RQ5) How applicable is the guidance generated by our SQAPlanner?Testing Dataset (Release k) Training Dataset (Release k-1)
Explanation Evaluation Dataset (Release k+1)
Generate Explanations
SQAPlanner Rules
Fig. 9: An evaluation framework of our SQAPlannerapproachour SQAPlanner generated rule-based explanations willbe considered effective if such rule-based explanationsachieve a high coverage and high confidence value.
Approach . To address RQ3, we evaluate the rule-basedexplanations generated by our SQAPlanner using thetraditional association rule evaluation measures (i.e.,coverage, confidence, and lift).
Coverage measures support of the antecedent of anassociation rule, i.e., the percentage of files that supportthe rule conditions. Formally, Coverage ( p → q ) = Support ( p ) where Support( p ) is the proportion of filesthat fulfill p .Support ( p ) = | files ∈ Dataset, such that files fulfill p | { DEV > } associate ===== ⇒ DEFECT with a coverage value of 0.9indicates that 90% of the files fulfill a risky practice ofhaving more than ten developers who touch a file. Ahigh coverage value of the G1 guidance indicates thatsuch a risky practice is a common risky practice to manyfiles of the dataset.
Confidence (i.e., Precision or Strength) measures thepercentage of files that fulfill the antecedent and conse-quent together over the number of files that only fulfillthe antecedent, which can be defined as follows:Confidence ( p → q ) = Support ( p → q ) / Support ( p ) .For example, a rule-based explanation (G1) of { DEV > } associate ===== ⇒ DEFECT with a confidence value of 0.8 in-dicates that, there are 80% of the files that fulfill the riskypractice of having more than ten developers who toucha file are actually defectives. A high confidence valueof the G1 guidance indicates that such risky practice is ahigh confident risky practice to many files of the dataset.
Lift measures how many times more often the an-tecedent and consequent occur together compared towhat would be expected when they (i.e., both antecedentand consequent) were statistically independent, whichcan be defined as follows:Lift ( p → q ) = Support ( p → q ) Support ( p ) × Support ( q ) l ll ll ll ll Coverage Confidence Lift O u r F r a m e w o r k L O R E A n c ho r O u r F r a m e w o r k L O R E A n c ho r O u r F r a m e w o r k L O R E A n c ho r Techniques V a l ue Fig. 10: (RQ3) The distribution of the evaluation mea-sures of our rule-based explanations when compared tobaseline approaches (i.e., LORE and Anchor).For example, a rule-based explanation (G1) of { DEV > } associate ===== ⇒ DEFECT with a life value of 5 indicatesthat, the file will be 5 times (i.e., 500%) more likely tobe defective if the rule is fulfilled. A lift value greaterthan one means that a file is likely to be defective if theconditions are fulfilled, while a lift value less than onemeans a file is unlikely to be defective if the conditionsare fulfilled. A high lift value of the G1 guidance in-dicates that there is a high chance that a file is likelyto be defective if such risky practice is fulfilled. Thus,practitioners should pay attention to guidance rules witha high lift value.
Baseline comparison.
We compare our SQAPlannerwith the two state-of-the-art local rule-based model-agnostic techniques, i.e., Anchor [38] and LORE [15] [37].
Anchor , an extension of LIME [37], was proposed byRibeiro et al. [38]. The key idea of Anchor is to select if-then rules – so-called anchors – that have high confidence,in a way that features that are not included in the rulesdo not affect the prediction outcome if their featurevalues are changed. In particular, Anchor selects onlyrules with a minimum confidence of 95%, and thenselects the rule with the highest coverage if multiplerules have the same confidence value.
LORE is proposed by Guidotti et al. [15]. For eachinstance to be explained, LORE generates files aroundthe neighbourhood using a genetic algorithm. LOREthen obtains predictions of the generated files from theglobal defect models to learn the behaviour and thelogics of the defect models. Finally, a decision tree isbuilt on the defined neighbourhood of the instance tobe explained and is then later converted to rules.
Results . Figure 10 presents the results for coverage, con-fidence, and lift of the local rule-based model-agnostictechniques. (Coverage) At the median, 89% of files are supported by the rule-based explanations, suggesting that ourSQAPlanner outperforms the LORE and Anchor localrule-based model-agnostic techniques. Figure 10 showsthat the median coverage is 89%, 34%, and 6% forour SQAPlanner, LORE, and Anchor, respectively. Wesuspect that the high coverage values that are achievedby our SQAPlanner are due to the flexibility of the k -optimal search that allows us to search particularly forrules with high coverage. High coverage is important asit is a measure for how representative a rule is for a givendataset, so that our results suggest that our SQAPlannerachieves the most representative rules. (Confidence) At the median, 99% of files are sup-ported by the antecedent and the consequent ofthe rule-based explanations, which outperforms theLORE and Anchor model-agnostic techniques. Fig-ure 10 shows the distributions of the confidence forour SQAPlanner, LORE, and Anchor, respectively. Wefind that LORE and Anchor achieve high confidencewith median confidence of 95% and 98%, respectively.We find that the comparable confidence values achievedby LORE and Anchor have to do with the main opti-mization goal of Anchor and LORE, since both LOREand Anchor techniques aim to search for rules withthe highest confidence. Nevertheless, we find that ourSQAPlanner achieves the highest median confidence of99%. (Lift) The rule-based explanations generated by ourSQAPlanner achieve a median lift value of 6.6, whichoutperforms the LORE and Anchor model-agnostictechniques.
Figure 10 shows that the median lift is 6.6,5.2, 0.98 for our SQAPlanner, LORE and Anchor respec-tively. The highest lift value of 6.6 indicates that files willbe 6.6 times (i.e., 660%) more likely to be defective if therule is matched. Similarly, the highest lift value of ourSQAPlanner can be attributed to the flexibility of the k -optimal search that allows us to search particularly forrules with the highest lift. On the other hand, Anchorachieves a lower lift score, since Anchor constructs theneighbourhood in a way that it contains only files of thesame class as the instance in consideration. Thus, the liftscores for Anchor under these circumstances are equalto the confidence values. (RQ4) How stable are the rule-based explanationsgenerated by our SQAPlanner approach when theyare regenerated? Motivation . Our SQAPlanner approach and the twostate-of-the-art local rule-based model-agnostic tech-niques (i.e., LORE and Anchor) involve random datageneration when generating synthetic instances aroundthe neighbourhood. As such, the randomization biasmay produce different rule-based explanations when theapproaches are re-executed. Thus, we aim to investigatethe consistency of the rule-based explanation of the sameinstance when these model-agnostic techniques are re-executed. l l
Model−Agnostic Techniques J a cc a r d C oe ff i c i en t Fig. 11: (RQ4) The distribution of the Jaccard Coefficientsof the rule-based model-agnostic techniques.
Approach . To address RQ4, we repeat our experimentten times to investigate the stability of our rules. Sincethe rules generated by the baseline comparison areoptimized based on confidence only, we focus on therules generated by our approach that are optimized forconfidence as well.For each rule-based explanation of each file, we use theJaccard coefficient to measure the consistency of the gen-erated rule-based explanations. The Jaccard coefficientcompares the common and the distinct features in twogiven sets (e.g., X and Y ) using the following equation: J ( X, Y ) = | X ∩ Y | / | X ∪ Y | . The coefficient ranges from0% to 100%. The higher the coefficient the higher thesimilarity of rules over two independent runs. Results . Our SQAPlanner approach produces the mostconsistent rule-based explanations when compared toLORE and Anchor.
Figure 11 shows that our SQAPlan-ner achieves a median Jaccard coefficient of 0.92, whileLORE and Anchor achieve a median Jaccard coefficientof 0.42, and 0.79, respectively. In other words, for eachprediction of an instance to be explained, our rule-based explanations are (at the median) 92% consistentwith the rule-based explanations when re-executing ourframework in multiple independent runs. In addition,our SQAPlanner’s rule-based explanations are (at themedian) 13% and 50% more consistent than the rule-based explanations generated by Anchor and LORE,respectively. We suspect that the highest consistencyachieved by our approach is a result of the more ro-bust nature of our framework when selecting similarinstances from the training data and when generatingsynthetic instances around the neighbourhood (as de-scribed in Sections 5.2 and 7). In contrast, Anchor usesa bandit algorithm [25] to generate neighbours, whileLORE uses a genetic algorithm to generate neighbours. (RQ5) How applicable are the rule-based explanationsgenerated by our SQAPlanner approach to minimizethe risk of having defects in the subsequent re-leases?
Motivation . The central hypothesis of our approachis that the rule-based explanations derived from pastknowledge (a release k − ) can be used to explain the LOC Actual Predict Counterfactual Rules
A.java 1,000 Bug BugB.java 1,200 Bug Bug
Defect Models R H − A . java : LOC < 900 ⇒
Clean
Training
Past Release (k-1) Target Release (k) Validation Release (k+1)
Testing
LOC Actual Predict
A.java 850 Clean CleanB.java 1,500 Bug Bug R H − B . java : LOC < 1,100 ⇒
Clean
Predictions
Our SQA Planner
Rules
RQ5-a measures the number of instances where the hypothetical contradicting rule follows the actual feature values in the validation data when the prediction is changed (i.e.,
Bug in k but Clean in k+1) e.g., is in accordance with the validation data (i.e., LOC=850) R H − A . java : LOC < 900 ⇒
Clean C o m pa r e RQ5-b measures the number of instances where the hypothetical contradicting rule does not follow the actual feature values in the validation data when the prediction is not changed (i.e.,
Bug in k and Bug in k+1) e.g., is in accordance with the validation data R H − B . java : LOC < 1,100 ⇒
Clean
Fig. 12: (RQ5) An approach to evaluate the applicability of the hypothetical contradicting rules.predictions of defective files in a target release (a release k ), and thus be applicable to guide SQA planning toprevent software defects in future releases (release k +1 ).We want to investigate what are the proportion of fileswhere the rule-based explanations are satisfied and notsatisfied with the actual feature values in the subsequentrelease. Approach . To address RQ5, we focus on the hypo-thetical contradicting rules, which are rules that guidewhat are the practices to follow to decrease the riskof having defects (i.e, whether the prediction of thesame file could be reversed if the rule is followed ina subsequent release). We note that not all of the fileswhose hypothetical contradicting rules can be gener-ated by Anchor and LORE, since we find that LOREproduces a maximum of 69% hypothetical contradictingrules across projects (median amount of rules producedper project is 41%), and Anchor by definition does notgenerate any hypothetical contradicting rules. Since ourapproach is the only one that can generate hypotheticalcontradicting rules, we focus only on our SQAPlannerapproach. Figure 12 presents an approach to evaluatethe applicability of the hypothetical contradicting rules.We analyze the applicability of the hypothetical contra-dicting rules along 2 perspectives:
RQ5-a: Are hypothetical contradicting rules appliedwhen the prediction of an instance changes from defec-tive in a testing release k to clean in a validation release k + 1 ? Hypothetical contradicting rules are considered asapplicable if such rules follow the actual feature valuesin the validation data when the prediction of the instancechanges from defective in k to clean in k +1 . For example,A.java is predicted to be defective in the testing data ( k )but predicted to be clean in the validation data ( k +1 ). Weconsider that the generated hypothetical contradictingrules (e.g., { LOC < } associate ===== ⇒ CLEAN ) is correct ifsuch rule is in accordance with the actual feature valuesin the validation data (i.e.,
LOC = 850). Like in thisexample, the hypothetical contradicting rule suggestsdevelopers reduce the lines of code to less than 900 topotentially reverse the decision of the defect models from defective to clean, which is consistent with the validationdata (
LOC = 850).
RQ5-b: Are hypothetical contradicting rulesnot applied when the prediction of an instancedoes not change from defective in a testing release k to clean in a validation release k + 1 ? Hypotheticalcontradicting rules are considered applicable if suchrules do not follow the actual feature values in thevalidation data when the prediction of the instancedoes not change from defective in k to clean in k + 1 .For example, we consider B.java to be predicted tobe defective in both testing data and validation data.Thus, we consider that the generated hypotheticalcontradicting rule (e.g., { LOC < } associate ===== ⇒ CLEAN )is applicable if such rule does not follow the actualfeature values in the validation data (i.e.,
LOC = 1,500).For each perspective, we compute the number ofinstances where the hypothetical contradicting rule doesfollow and does not follow the actual feature values inthe subsequent release in RQ5-a and RQ5-b, respectively.Figure 13 presents the proportion of files where itshypothetical contradicting rule does follow (RQ5-a) anddoes not follow (RQ5-b) the actual feature values in thesubsequent release for each measure.
Results . For 55%-87% of the instances in the subse-quent releases, our SQAPlanner’s hypothetical con-tradicting rules are correctly applicable when theprediction of rules changes from defective to clean.
Figure 13 shows that there are 87%, 82% and 55% ofthe instances in the subsequent releases that our hy-pothetical contradicting rules follow the actual featurevalues in the validation data with respect to coverage,confidence, and lift, respectively. This finding indicatesthat our SQAPlanner’s hypothetical contradicting ruleslearned from past knowledge ( k − ) to explain thepredictions of instances from the target release ( k ) couldpotentially reverse the predictions of the same instancein the subsequent release ( k + 1 ) from having defects toclean. For 67%-81% of the instances in the subsequentreleases, our hypothetical contradicting rules are cor- l l Coverage Confidence LiftRQ5−a RQ5−b RQ5−a RQ5−b RQ5−a RQ5−b020406080100
Analysis P e r c en t age Fig. 13: The percentage of of the instances in the sub-sequent releases where our hypothetical contradictingrules (RQ5-a) follow the actual feature values in the val-idation data when the decision is changed and (RQ5-b)do not follow the actual feature values in the validationdata when the decision is not changed for each measure. rectly non-applicable when the prediction of rules doesnot change.
Figure 13 shows there are 67%, 81% and71% of the instances in the subsequent releases that ourSQAPlanner’s hypothetical contradicting rules do not fol-low the actual feature values in the validation data whenthe prediction of rules does not change with respectto coverage, confidence, and lift, respectively. In otherwords, when files are still defective in the subsequentrelease, our hypothetical contradicting rules are stilllargely in agreement (i.e., our hypothetical contradictingrules are correctly non-applicable).
We conducted a qualitative analysis to illustrate theeffectiveness of our guidance generated by our SQAPlan-ner. We selected the
ErrorHandlerBuilderRef.java of therelease 2.9.0 of the Camel software system as the subjectof this qualitative analysis. Our SQAPlanner approachcorrectly predicts this file as defective with a probabilityscore of 70%. Below, we discuss the implications of ourrule-based explanations to guide developers on whatthey could follow and could avoid to decrease the riskof having defects.
What are risky practices that lead a model to predict a fileas defective?
To answer this question, we use the supporting rule togenerate guidance (G1) for this file as follows: (cid:60) + = { LOCDeclaration > .
150 &DistinctDeveloper > .
68 &Ownership < . } associate ===== ⇒ DEFECT
Implication . This supporting rule indicates that thisfile is being predicted as defective since it is associated with the conditions of having more than 28 lines ofdeclarative code, more than 1.68 distinct developers,and a line-based ownership score of less than 85%.When comparing this to the actual feature values ofthe file {
LOCDeclaration = 34 , DistinctDeveloper = 3 , Ownership = 0 . }, we find that the conditions of thissupporting rule are consistent with the actual featurevalues and the consequent is consistent with our SQA-Planner’s prediction (i.e., defective). What are the non-risky practices that lead a model topredict a file as clean?
To answer this question, we use our contradicting ruleto generate guidance (G2) for this file as follows: (cid:60) − = { . < = RatioCommentToCode < = 0 . } associate ===== ⇒ CLEAN
Implication . We find that our contradicting rule isconsistent with the actual feature values of the file tobe explained. The actual feature values of this file is{
RatioCommentToCode = 0 . }, meaning that 51% ofthe total lines of code are comment lines (i.e., (cid:60) − ) indicates thatthe condition that supports its prediction as not beingdefective is { . < = RatioCommentToCode < = 0 . },indicating that files that have a RatioCommentToCode of more than 44% but less than 96% are likely notto be defective. Developers should thus adhere to thecontradicting rule i.e., having the comment ratio formore than 44% of the file, to not increase the risk ofhaving defects.
What are practices to avoid to not increase the risk ofhaving defects?
To answer this question, we use our hypothetical sup-porting rule to generate guidance (G3) for this file asfollows: (cid:60) H + = { MinorCommit > . } associate ===== ⇒ DEFECT
Implication . Having more than zero minor developerswill increase the risk of having defects. The actual featurevalue of this file is {
MinorCommit = 0 }, meaning thatthis file has no minor developers (i.e., minor) who editor change the file. This finding is consistent with Bird etal. [2] and Rahman [32] who found that minor devel-opers often introduce defects. Thus, developers shouldadhere to the hypothetical supporting rule in order notto increase the risk of having defects.
What are the practices to follow to decrease the risk ofhaving defects?
To answer this question, we use our hypothetical con-tradicting rule to generate guidance (G4) for this file asfollows: (cid:60) H − = { LOCBlank < .
62 &OutputMean < . } associate ===== ⇒ CLEAN
Implication . If developers changed the file to have lessthan 8 blank lines and less than 2 output variables,this could reverse the prediction of having defects tobeing clean. The actual feature values of this file are{
LOCBlank = 19 , OutputMean = 3 . }, meaning thatthis file has 19 blank lines and an average of 3.7 outputvariables (i.e., fan-out) of functions in a file. The hypo-thetical contradicting rule indicates that if { LOCBlank < .
62 & OutputMean < . } then the file is likelyto reverse the prediction of having defects to beingclean. Thus, our hypothetical contradicting rule providessuggestions to the developers of what they should do todecrease the risk of having defects. It should be notedthat our contradicting rule shows correlations that maynot necessarily be causal. RACTITIONERS ’ P
ERCEPTIONS OF OUR
SQAP
LANNER V ISUALIZATION
In this section, we aim to investigate the practitioners’perceptions of the visualization of SQAPlanner whencomparing to the visualization of the state-of-the-art(RQ6) and the actual guidance generated by SQAPlanner(RQ7). Below, we describe the approach and present theresults.
To address RQs 6 and 7, we developed a proof-of-concept to visualize the actionable guidance generatedby our SQAPlanner. Traditionally, the importance scoresof Random Forests or LIME’s model-agnostic techniquesare commonly presented using a bar chart. However,such bar charts can only indicate the importance scores,without providing guidance on what to do and what notto do.To address this challenge, we propose to use a bulletplot (see Figure 14). The visualization of our SQAPlanneris designed to provide the following key information:(1) the list of guidance that practitioners should followand should avoid; (2) the actual feature value of that file;and (3) its threshold and range values for practitioners tofollow to mitigate the risk of having defects. The greenshades indicate the non-risky range values of features,while the red shades indicate the risky range valuesof features. The vertical bars indicate the actual valuesof features for a given file. The green arrows providedirections of how a feature should be changed (i.e.,increase or decrease). The list of guidance is structuredinto two parts: (1) what to do to decrease the risk ofbeing defective; and (2) what to avoid to not increase therisk of being defective. For each guidance, we translatea rule-based explanation into an actionable guidance. Aguidance is presented in the form of natural language to ensure that it is actionable and understandable bypractitioners.To translate the rule-based explanations into actualguidance, we focus on only the
ErrorHandlerBuilder-Ref.java of the release 2.9.0 of the Camel software system.We use the rule-based explanations from Section 6.4 asa reference. Finally, we derive the following statementsaccording to the reference rule-based explanations inSection 6.4: • (S1) Decreasing the number of class and methoddeclaration lines to less than 29 lines to decreasethe risk of being defective. • (S2) Decreasing the number of distinct developers toless than 2 developers to decrease the risk of beingdefective. • (S3) Increasing the ownership code proportion tomore than 0.85 to decrease the risk of being defec-tive. • (S4) Avoid decreasing the comment to code ratioto less than 0.44 to not increase the risk of beingdefective. • (S5) Avoid increasing the number of minor devel-opers to more than 0 developers to not increase therisk of being defective. • (S6) Decreasing the number of blank lines to lessthan 8 lines to decrease the risky of being defective. • (S7) Decreasing the number of output variables toless than 2 variables to decrease the risk of beingdefective.To implement the visualization of our SQAPlannerapproach, we decided to use the Microsoft’s Code DefectAI as our core infrastructure. We first downloaded therepository of Code Defect AI from GitHub. Then, wecarefully studied their repository and deployed CodeDefect AI in our local environment with continuous sup-port from the core developer of Code Defect AI. Then,we integrated our SQAPlanner approach and replacedtheir visualization (bar plots) with our visualizationgenerated by SQAPlanner using the implementation ofbullet plots as provided by the d3.js Javascript library. To investigate the practitioners’ perceptions of ourSQAPlanner visualization, we used a qualitative surveyas a research method. We also used the visualization ofMicrosoft’s Code Defect AI (see Figure 2) as a baselinecomparison. The objectives of the survey are as follows:(1) to investigate the practitioners’ perceptions of thevisualization of our SQAPlanner; and (2) to investigatethe practitioners’ perceptions of the actionable guidancegenerated by our SQAPlanner. Similar to Section 4, wedesigned the survey as a cross-sectional study whereparticipants provide their responses at one fixed pointin time. The design of our survey is described below.
Part 1—Practitioners’ perceptions of the visualizations ofour SQAPlanner:
We first provided the concept of defectprediction models and described how our SQAPlan-
6. https://github.com/aricent/codedefectai7. https://bl.ocks.org/mbostock/4061961 OK Project Name : Apache Camel (Release 2.9.0)File Name : ErrorHandlerBuilderRef.java
Commit ID: 0a02dd5f58a77282dd18f6468d7fa6d5c50ce326 Commit Date: 2019-08-15| 08:09:14 PM File History
Risk Score: 70%
What to do to decrease the risk of having defects?
Decreasing the number of class and method declaration lines to less than 29 lines
Actual = 34 lines0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Decreasing the number of distinct developers to less than 2 developers
Actual = 3 developers0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Increasing the ownership code proportion to more than 0.85
Actual = 0.65 0 5 10 15 20 25 30
Decreasing the number of blank lines to less than 8 lines
Actual = 19 blank lines 0 1 2 3 4 5 6 7 8 9 10
Decreasing the number of output variables to less than 2 variables
Actual = 4 variables
What to avoid to not increase the risk of having defects?
Avoid decreasing the comment to code ratio to less than 0.44
Actual = 0.51 0 1 2 3 4 5 6 7 8 9 10
Avoid increasing the number of minor developers to more than 0 developers
Actual = 0 minor developers * In the bullet plots, the red shade indicates the range of values that are high risk of being defective, while the green shade indicates the range ofvalues that are low risk of defective. The bold vertical line indicates the actual values for each feature of this file.
Bug Risk Prediction: Yes OK Project Name : Apache Camel (Release 2.9.0)File Name : ErrorHandlerBuilderRef.java
Commit ID: 0a02dd5f58a77282dd18f6468d7fa6d5c50ce326 Commit Date: 2019-08-15| 08:09:14 PM File History
Risk Score: 70%
What to do to decrease the risk of having defects?
Decreasing the number of class and method declaration lines to less than 29 lines
Actual = 34 lines0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Decreasing the number of distinct developers to less than 2 developers
Actual = 3 developers0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Increasing the ownership code proportion to more than 0.85
Actual = 0.65 0 5 10 15 20 25 30
Decreasing the number of blank lines to less than 8 lines
Actual = 19 blank lines 0 1 2 3 4 5 6 7 8 9 10
Decreasing the number of output variables to less than 2 variables
Actual = 4 variables
What to avoid to not increase the risk of having defects?
Avoid decreasing the comment to code ratio to less than 0.44
Actual = 0.51 0 1 2 3 4 5 6 7 8 9 10
Avoid increasing the number of minor developers to more than 0 developers
Actual = 0 minor developers * In the bullet plots, the red shade indicates the range of values that are high risk of being defective, while the green shade indicates the range ofvalues that are low risk of defective. The bold vertical line indicates the actual values for each feature of this file.
Bug Risk Prediction: Yes
Fig. 14: The visualization of our SQAPlanner is designed to provide the following key information: (1) the list ofguidance that practitioners should follow and should avoid; (2) the actual feature value of that file; and (3) itsthreshold and range values for practitioners to follow to mitigate the risk of having defects.ner can be used to support SQA planning. Then, wepresented the visualization of our SQAPlanner and thevisualization of Microsoft’s Code Defect AI. We askedthe participants a closed-ended question to inquire aboutwhich of the visualization is the best to provide action-able guidance on how to mitigate the risk of havingdefects. We also asked the participants an open-endedquestion to inquire about their rationale on why theselected visualization is preferred over another visual-ization.
Part 2—Practitioners’ perceptions of the actual guidancegenerated by SQAPlanner:
We again presented the vi-sualization of SQAPlanner. Then, for each statement,we asked the participants a closed-ended question toinquire whether the participants agree for each of the seven statements that we translated from the rule-basedexplanations.We used an online questionnaire service as providedby Google Forms. We carefully evaluated the survey viapre-testing [28] to assess the reliability and validity ofthe survey. The survey has been rigorously reviewedand approved by the Monash University Human Re-search Ethics Committee (MUHREC Project ID: 27209).We used a recruiting service provided by the MTurk torecruit participants. We received 240 closed-ended and30 open-ended responses from 30 respondents. Finally,we manually verified and analyzed the survey responsesto ensure that the responses are of high quality.
80% 20% S Q AP l anne r B a s e li ne ( C ode D e f e c t A I ) P e r c en t age (a) Perceptions of visual-ization. S1S2S3S4S5S6S7 100 50 0 50 100
Percentage
Response Disagree Agree (b) Perceptions of the actual guidance generated by our SQAPlanner.
Fig. 15: (RQ6,RQ7) The results of a qualitative survey with practitioners.
Results .
80% of our respondents agree that the visu-alization of our SQAPlanner is better for providingactionable guidance when compared to the visualiza-tion of Microsoft’s Code Defect AI.
Figure 15a showsthe percentage of the respondents who select whichvisualization is best to provide actionable guidance onhow to mitigate the risk of having defects.After analyzing the open-end responses, practitioners(e.g., R10 and R12) provided rationales that the sug-gested threshold values of each factor and directionalarrows provided by SQAPlanner make the visualizationmore clear on what developers should do and shouldavoid to decrease the risk of having defects. Respondents(e.g., R19, R20, and R23) also pointed out that thesummary of "What to do" and "What to avoid" is straightto the point and helpful.On the other hand, 20% of the respondents rate thevisualization of Microsoft’s Code Defect AI as better. Re-spondents (e.g., R5 and R16) provided rationales that thevisualization of Microsoft’s Code Defect AI is presentedin a more simple and concise manner (i.e., only presentthe most important factors that are associated with therisk of having defects). Thus, future research shouldtake into consideration the complexity of the providedinformation when designing a novel visualization for AI-driven defect prediction. (RQ7) How do practitioners perceive the actual guid-ance generated by our SQAPlanner?
Results . Figure 15b presents that thepercentage of the respondents who agree with the sevenstatements derived from the actual guidance generatedby our SQAPlanner. We find that 90% of the respondentsagree the most with (S1) Decreasing the number of class andmethod declaration lines to less than 29 lines to decrease therisk of being defective .On the other hand, only 63% of the respondents agreewith (S3) Increasing the ownership code proportion to morethan 0.85 to decrease the risk of being defective . We suspectthat the wide range of agreement rates for our state-ments has to do with the degree of understandability ofthe software metrics, since practitioners may find thatthe number of class and method declaration lines forS1 is more intuitive and easy to understand than theownership code proportion for S3. Thus, future researchshould take into consideration the degree of understand-ability of the software metrics when designing a novelvisualization for AI-driven defect prediction.
HREATS TO V ALIDITY
Construct Validity . Many local model-agnostic tech-niques could be used to generate many forms of explana-tions e.g., feature importance and rules. In this paper, wefocused only on rule-based explanations by comparingwith LORE and Anchor, an extension of LIME. We alsostudied only a limited number of available classificationtechniques. Thus, our results may not be applicable orgeneralise to the use of other techniques. Nonetheless,other classification techniques can be explored in futurework to see if they improve on our results.
Internal Validity . The practicality of rule-based expla-nations heavily relies on software metrics that are usedto train the models. In this paper, we chose to generaterules based on 65 well-known and hand-crafted software metrics, rather than using advanced automated featuregeneration like deep learning. Future work may focus ontrying to explain other machine learning-based models,such as explaining deep learning models used in an SQAcontext.The goal of our SQAPlanner (aka. the local rule-basedmodel-agnostic technique) is a post-hoc analysis of theglobal defect prediction models. That means, SQAPlan-ner can only explain the behavior of the (global) defectprediction models, regardless of the correct or incor-rect predictions. If the predictions of the global defectmodels for the testing dataset are incorrect, SQAPlannerwill explain why the global defect prediction modelsgenerate wrong predictions in the form of rule-basedexplanations. Therefore, the robustness or the sensitivityof our SQAPlanner does not depend on the accuracy ofthe predictions of the global defect prediction models. External Validity . We applied our SQA Planner ap-proach to a limited number of software systems. Thus,our results may not generalize to other datasets, do-mains, ecosystems. However, we mitigated this bychoosing a range of different non-trivial, real-world,open-source software applications. Nonetheless, addi-tional replication studies in a proprietary setting andother ecosystems will prove useful to compare to ourresults reported here.SQA planning involves various activities. However,this paper only focused on helping practitioners todefine development policies and their associated riskthresholds [12], without considering other activities. Inaddition, the dependent variable that we used in thisstudy only focused on software quality (i.e., defective orclean), without considering other aspects (e.g., testability,reusability, robustness, and maintainability). Thus, otherSQA planning activities and other quality attributes canbe explored in future work.
ELATED W ORK
In this section, we discuss related work and gaps tohighlight the contributions of our work to the literature.
Despite the advances of AI/ML techniques that aretightly integrated into software development practices(e.g., defect prediction [17], automated code review [1,50], automated code completion [19, 20]), such AI/MLtechniques come with their own limitations. The centralproblem of AI/ML techniques is that most AI/ML mod-els are considered black-box models i.e., we understandthe underlying mathematical principles without explicitdeclarative knowledge representation. In other words,developers do not understand how decisions are madeby such AI/ML techniques. In addition, the current de-fect modelling practices do not uphold the current dataprivacy laws and regulations, which require justificationsof individual predictions for any decisions made byan AI/ML model. Therefore, applying such black-box AI/ML techniques in the software development prac-tices for safety-critical and cyber-physical systems [4, 55]which involve safety, security, business, personal, or mil-itary operations is unfavourable and must be avoided.Explainable AI is essential in software engineeringto building appropriate trust (including Fairness, Ac-countability, and Transparency (FAT)). Developers canthen (1) understand the reasons and the logic behindevery decision and (2) effectively improve the predic-tion models by understanding any unsound predictionsmade by the models. Recently, explainable AI has beenemployed in software engineering [44], by making defectprediction models more practical [52] (i.e., using LIMEto explain which tokens and which lines are likely to bedefective in the future) and explainable [21] (i.e, usingLIME to explain a prediction why a file is predicted asdefective). However, there exists no studies that able toprovide concrete guidance on what developers shoulddo or should not do to support SQA planning. To thebest of our knowledge, this paper is the first to generatelocal rule-based explanations to help QA teams makedata-informed decisions in software quality assuranceplanning.
There are two key approaches for achieving explainabil-ity in defect prediction models. The first is to make theentire decision process transparent and comprehensible(i.e., global explainability). The second is to explicitlyprovide an explanation for each individual prediction(i.e., local explainability).Examples of global explainability methods are re-gression models [33, 35], decision trees [57], decisionrules [39], and Fast-and-Frugal trees [6]. These trans-parent AI/ML techniques often provide built-in modelinterpretation techniques to uncover the relationshipsbetween the studied features and defect-proneness. Forexample, an ANOVA analysis provided for logistic re-gression or a variable importance analysis provided forrandom forest. However, the insights derived from thesetransparent AI/ML techniques do not provide justifica-tions for each individual prediction.Model-agnostic techniques are techniques for explic-itly providing an instance explanation for each decisionof AI/ML models (i.e., local explainability) for a giventesting instance [16]. Formally, given a defect model f and an instance x , the instance explanation problemaims to provide an explanation e for the prediction f ( x ) = y . To do so, we address the problem by buildinga local interpretable model f (cid:48) that mimics the localbehaviour of the global defect model f . An explanationof the prediction is then derived from the local inter-pretable model f (cid:48) . The local interpretable model focuseson learning the behaviour of the defect models in theneighbourhood of the specific instance x , without aimingat providing a single description of the logic of the black box for all possible instances. Thus, an explanation e ∈ E is obtained through f (cid:48) , if e = (cid:15) ( f (cid:48) , x ) for someexplanation logic (cid:15) ( ., . ) which reasons over f (cid:48) and x .Two common ways to represent explanations are feature-importance explanations and rule-based explanations.Unlike model-specific explanation techniques discussedabove, the great advantage of model-agnostic techniquesis their flexibility. Such model-agnostic techniques can(1) interpret any learning algorithms (e.g., regression,random forest, and neural networks); (2) are not limitedto a certain form of explanations (e.g., feature importanceor rules); and (3) are able to process any input data (e.g.,features, words, and images [36]).There are a plethora of model-agnostic techniques [16]for identifying the most important feature at the instancelevel. For example, LIME (i.e., Local Interpretable Model-agnostic Explanations) [37] is a model-agnostic techniquethat mimics the behaviour of the black-box model witha local linear model to generate the explanations ofthe predictions. BreakDown [14, 41] is a model-agnostictechnique that uses the greedy strategy to sequentiallymeasure contributions of metrics towards the expectedprediction. However, none of these techniques can gen-erate explanations with the logic behind.Despite the advances of model-agnostic techniques inthe Explainable AI communities, such techniques havenot been employed in practical software engineeringcontexts. To the best of our knowledge, this paper is thefirst to generate local rule-based explanations to help QAteams make data-informed decisions in software qualityassurance planning.
10 C
ONCLUSIONS
Defect prediction models have been proposed to gen-erate insights (e.g., the most important factors that areassociated with software quality). However, such in-sights derived from traditional defect models are farfrom actionable—i.e., practitioners still do not knowwhat they should do and should avoid to decrease therisk of having defects, and what is a risk threshold foreach metric. A lack of actionable guidance and its riskthreshold could lead to inefficient and ineffective SQAplanning processes.In this paper, we investigate practitioners perceptionsand their challenges of current SQA planning activitiesand the perceptions of our proposed four types of guid-ance. Then, we propose and evaluate our SQAPlannerapproach—i.e., an approach for generating four types ofguidance and its risk threshold in the form of rule-basedexplanation for the predictions of defect prediction mod-els. Finally, we develop and evaluate the visualization ofour SQAPlanner approach.Through the use of qualitative survey and empiricalevaluation, our results lead us to conclude that SQA-Planner is needed, important, effective, stable, and ap-plicable. We also find that 80% of respondents perceivedthat our visualization is more actionable. Thus, our SQAPlanner paves a way for novel research in actionablesoftware analytics.Finally, we note that we do not seek to claim thegeneralization and causation of our proposed guidance.Instead, the key message of our study is that our rule-based guidance can explain the behaviour of the defectmodels that learnt from the relationship between soft-ware features and defect-proneness from the past releasedata. Thus, they can indicate important relationships inthe data and provide a useful tool to support decision-and policy-making in software quality assurance. Ourrule-based guidance could be used as a guidance toolfor supporting decision-making so that developers can(1) understand the reasons and the logic behind everyprediction, and (2) effectively improve the predictionmodels by understanding any unsound prediction madeby the models. A CKNOWLEDGMENTS
C. Tantithamthavorn was partially supported by theAustralian Research Council’s Discovery Early Ca-reer Researcher Award (ARC DECRA) funding scheme(DE200100941). C. Bergmeir was partially supported bythe Australian Research Council’s Discovery Early Ca-reer Researcher Award (ARC DECRA) funding scheme(DE190100045). J. Grundy was partially supported bythe Australian Research Council’s Laureate Fellowshipfunding scheme (FL190100035). R EFERENCES [1] S. Asthana, R. Kumar, R. Bhagwan, C. Bird, C. Bansal, C. Maddila,S. Mehta, and B. Ashok, “Whodo: automating reviewer sugges-tions at scale,” in
Proceedings of the 2019 27th ACM Joint Meetingon European Software Engineering Conference and Symposium on theFoundations of Software Engineering . ACM, 2019, pp. 937–945.[2] C. Bird, B. Murphy, and H. Gall, “Don’t Touch My Code !Examining the Effects of Ownership on Software Quality,” in
Proceedings of the European Conference on Foundations of SoftwareEngineering (ESEC/FSE) , 2011, pp. 4–14.[3] B. Boehm and V. R. Basili, “Software defect reduction top 10 list,”
Foundations of empirical software engineering: the legacy of Victor R.Basili , vol. 426, no. 37, pp. 426–431, 2005.[4] M. Borg, S. Gerasimou, N. Hochgeschwender, and N. Khakpour,“Explainability for safety and security,”
Explainable Software forCyber-Physical Systems (ES4CPS), Report from the GI Dagstuhl Sem-inar 19023 , p. 15, 2019.[5] L. Breiman, A. Cutler, A. Liaw, and M. Wiener, “randomForest: Breiman and Cutler’s Random Forests for Classification andRegression. R package version 4.6-12.”
Software available at URL:https://cran.r-project.org/package=randomForest .[6] D. Chen, W. Fu, R. Krishna, and T. Menzies, “Applications ofPsychological Science for Actionable Analytics,” in
Proceedings ofthe 2018 26th ACM Joint Meeting on European Software EngineeringConference and Symposium on the Foundations of Software Engineer-ing . ACM, 2018, pp. 456–467.[7] C. Y. Chong, P. Thongtanunam, and C. Tantithamthavorn, “As-sessing the students understanding and their mistakes in code re-view checklists–an experience report of 1,791 code review check-lists from 394 students,” in
International Conference on SoftwareEngineering: Joint Software Engineering Education and Training track(ICSE-JSEET) , 2021.[8] D. A. da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho, andA. E. Hassan, “A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-introducing Changes,”
Transactionson Software Engineering (TSE) , vol. 43, no. 7, pp. 641–657, 2017.[9] M. D’Ambros, M. Lanza, and R. Robbes, “An Extensive Compar-ison of Bug Prediction Approaches,” in
Proceedings of the Interna-tional Conference on Mining Software Repositories (MSR) , 2010, pp.31–41.[10] P. Edwards, I. Roberts, M. Clarke, C. DiGuiseppi, S. Pratap,R. Wentz, and I. Kwan, “"increasing response rates to postalquestionnaires: Systematic review",”
Bmj , vol. 324, no. 7347, p.1183, 2002.[11] S. Farooqui and W. Mahmood, “A survey of pakistan’s sqapractices: A comparative study,” in , 2017.[12] D. Galin,
Software quality: concepts and practice . John Wiley &Sons, 2018.[13] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the Impactof Classification Techniques on the Performance of Defect Pre-diction Models,” in
Proceedings of the International Conference onSoftware Engineering (ICSE) , 2015, pp. 789–800.[14] A. Gosiewska and P. Biecek, “iBreakDown: Uncertainty of ModelExplanations for Non-additive Predictive Models,” arXiv preprintarXiv:1903.11420 , 2019.[15] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, andF. Giannotti, “Local rule-based explanations of black box decisionsystems,” arXiv preprint arXiv:1805.10820 , 2018.[16] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, D. Pedreschi,and F. Giannotti, “A Survey Of Methods For Explaining BlackBox Models,” vol. 51, no. 5, pp. 1–45, 2018. [Online]. Available:http://arxiv.org/abs/1802.01933[17] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell,“A Systematic Literature Review on Fault PredictionPerformance in Software Engineering,”
Transactions on SoftwareEngineering (TSE) , vol. 38, no. 6, pp. 1276–1304, 2012.[Online]. Available: http://ieeexplore.ieee.org.pc124152.oulu.fi:8080/xpls/abs{_}all.jsp?arnumber=6035727[18] J. A. Hanley and B. J. McNeil, “The meaning and use of the areaunder a receiver operating characteristic (ROC) curve,”
Radiology ,vol. 143, no. 1, pp. 29–36, Apr. 1982. [Online]. Available:http://dx.doi.org/10.1148/radiology.143.1.7063747[19] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deeplearning type inference,” in
Proceedings of the 2018 26th ACM JointMeeting on European Software Engineering Conference and Symposiumon the Foundations of Software Engineering . ACM, 2018, pp. 152–162.[20] V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “Whencode completion fails: a case study on real-world completions,”in
Proceedings of the 41st International Conference on Software Engi-neering . IEEE Press, 2019, pp. 960–970.[21] J. Jiarpakdee, C. Tantithamthavorn, H. K. Dam, and J. Grundy,“An empirical study of model-agnostics techniques for defectprediction models,” 2020.[22] J. Jiarpakdee, C. Tantithamthavorn, and A. E. Hassan, “TheImpact of Correlated Metrics on Defect Models,”
Transactions onSoftware Engineering (TSE) , p. To Appear, 2019.[23] J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “AutoSpear-man: Automatically Mitigating Correlated Software Metrics forInterpreting Defect Models,” in
Proceedings of the InternationalConference on Software Maintenance and Evolution (ICSME) , 2018,pp. 92–103.[24] B. A. Kitchenham and S. L. Pfleeger, “Personal opinion surveys,”in
Guide to Advanced Empirical Software Engineering . Springer,2008, pp. 63–92.[25] L. Kocsis and C. Szepesvári, “Bandit based monte-carlo plan-ning,” in
European conference on machine learning . Springer, 2006,pp. 282–293.[26] J. A. Krosnick, “Survey research,”
Annual Review of Psychology ,vol. 50, no. 1, pp. 537–567, 1999.[27] S. Kumaresh and R. Baskaran, “Defect analysis and preventionfor software process quality improvement,”
International Journalof Computer Applications , vol. 8, no. 7, pp. 42–47, 2010.[28] M. S. Litwin,
How to Measure Survey Reliability and Validity . Sage, 1995, vol. 7.[29] B. R. Maxim and M. Kessentini, “An introduction to modern soft-ware quality assurance,” in
Software Quality Assurance . Elsevier,2016, pp. 19–46.[30] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The Impactof Code Review Coverage and Code Review Participation onSoftware Quality,” in
Proceedings of the International Conference onMining Software Repositories (MSR) , 2014, pp. 192–201.[31] T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static CodeAttributes to Learn Defect Predictors,”
Transactions on SoftwareEngineering (TSE) , vol. 33, no. 1, pp. 2–13, 2007.[32] F. Rahman and P. Devanbu, “Ownership, experience and defects:a fine-grained study of authorship,” in
Proceedings of the Interna-tional Conference on Software Engineering (ICSE) , 2011, pp. 491–500.[33] ——, “How, and Why, Process Metrics are Better,” in
Proceedingsof the International Conference on Software Engineering (ICSE) , 2013,pp. 432–441.[34] D. Rajapaksha, C. Bergmeir, and W. Buntine, “LoRMIkA: Localrule-based model interpretability with k-optimal associations,”
Information Sciences , vol. 540, pp. 221–241, 2020.[35] G. K. Rajbahadur, S. Wang, Y. Kamei, and A. E. Hassan, “TheImpact of Using Regression Models to Build Defect Classifiers,”in
Proceedings of the International Conference on Mining SoftwareRepositories (MSR) , 2017, pp. 135–145.[36] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-agnostic Inter-pretability of Machine Learning,” arXiv preprint arXiv:1606.05386 ,2016.[37] ——, “Why should I trust you?: Explaining the Predictions ofAny Classifier,” in
Proceedings of the International Conference onKnowledge Discovery and Data Mining (KDDM) , 2016, pp. 1135–1144.[38] ——, “Anchors: High-precision model-agnostic explanations,” in
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[39] D. Rodríguez, R. Ruiz, J. C. Riquelme, and J. S. Aguilar-Ruiz,“Searching for Rules to Detect Defective Modules: A SubgroupDiscovery Approach,”
Information Sciences , vol. 191, pp. 14–30,2012.[40] E. Smith, R. Loftin, E. Murphy-Hill, C. Bird, and T. Zimmermann,“"improving developer participation rates in surveys",” in
Proceed-ings of the International Workshop on Cooperative and Human Aspectsof Software Engineering (CHASE) , 2013, pp. 89–92.[41] M. Staniak and P. Biecek, “Explanations of Model Predictions withlive and breakDown Packages,” arXiv preprint arXiv:1804.01955 ,2018.[42] R. Storn and K. Price, “Differential evolution – a simpleand efficient heuristic for global optimization over continuousspaces,”
Journal of Global Optimization , vol. 11, no. 4, pp. 341–359, Dec. 1997. [Online]. Available: https://doi.org/10.1023/A:1008202821328[43] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “TheImpact of Class Rebalancing Techniques on The Performanceand Interpretation of Defect Prediction Models,”
Transactions onSoftware Engineering (TSE) , p. To Appear, 2019.[44] C. Tantithamthavorn, J. Jiarpakdee, and J. Grundy, “Explainableai for software engineering,” arXiv preprint arXiv:2012.01614 , 2020.[45] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, A. Ihara, andK. Matsumoto, “The Impact of Mislabelling on the Performanceand Interpretation of Defect Prediction Models,” in
Proceeding ofthe International Conference on Software Engineering (ICSE) , 2015,pp. 812–823.[46] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Mat-sumoto, “Automated Parameter Optimization of ClassificationTechniques for Defect Prediction Models,” in
Proceedings of theInternational Conference on Software Engineering (ICSE) , 2016, pp.321–332.[47] ——, “An Empirical Comparison of Model Validation Techniquesfor Defect Prediction Models,”
Transactions on Software Engineering(TSE) , vol. 43, no. 1, pp. 1–18, 2017.[48] ——, “The Impact of Automated Parameter Optimization onDefect Prediction Models,”
Transactions on Software Engineering(TSE) , pp. 683–711, 2018. [49] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revis-iting Code Ownership and its Relationship with Software Qualityin the Scope of Modern Code Review,” in Proceedings of theInternational Conference on Software Engineering (ICSE) , 2016, pp.1039–1050.[50] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida,H. Iida, and K.-i. Matsumoto, “Who Should Review My Code?A File Location-based Code-reviewer Recommendation Approachfor Modern Code Review,” in
Proceedings of the International Con-ference on Software Analysis, Evolution, and Reengineering (SANER) ,2015, pp. 141–150.[51] Z. Wan, X. Xia, A. E. Hassan, D. Lo, J. Yin, and X. Yang,“Perceptions, expectations, and challenges in defect prediction,”
IEEE Transactions on Software Engineering , 2018.[52] S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn,H. Hata, and K. Matsumoto, “Predicting defective lines using amodel-agnostic technique,” 2020.[53] G. I. Webb, “Opus: An efficient admissible algorithm for un-ordered search,”
Journal of Artificial Intelligence Research , vol. 3,pp. 431–465, 1995.[54] G. I. Webb and S. Zhang, “K-optimal rule discovery,”
Data Miningand Knowledge Discovery , vol. 10, no. 1, pp. 39–79, 2005.[55] Y. Yang, D. Falessi, T. Menzies, and J. Hihn, “Actionable analyticsfor software engineering,”
IEEE Software , vol. 35, no. 1, pp. 51–53,2017.[56] S. Yathish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamtha-vorn, “Mining Software Defects: Should We Consider AffectedReleases?” in
In Proceedings of the International Conference on Soft-ware Engineering (ICSE) , 2019, p. To Appear.[57] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,“Cross-project Defect Prediction,” in
Proceedings of the EuropeanSoftware Engineering Conference and the Symposium on the Founda-tions of Software Engineering (ESEC/FSE) , 2009, pp. 91–100.
Dilini Rajapaksha received the BSc(hons) de-gree from Sri Lanka Institute of InformationTechnology (SLIIT). She is currently a Ph.D.candidate at Monash University, Australia. Herresearch interests include Machine Learningand Time-series Forecasting. The goal of herPh.D. is to provide local explanations for the pre-dictions given by the time-series and machinelearning models.
Chakkrit Tantithamthavorn is a Lecturer inSoftware Engineering and a 2020 ARC DECRAFellow in the Faculty of Information Technol-ogy, Monash University, Australia. His currentfellowship is focusing on the development of“Practical and Explainable Analytics to PreventFuture Software Defects”. His work has beenpublished at several top-tier software engineer-ing venues, such as the IEEE Transactions onSoftware Engineering (TSE), the Springer Jour-nal of Empirical Software Engineering (EMSE)and the International Conference on Software Engineering (ICSE). Moreabout Chakkrit and his work is available online at http://chakkrit.com.
Jirayus Jiarpakdee is a Ph.D. candidate atMonash University, Australia. His research inter-ests include empirical software engineering andmining software repositories (MSR). The goal ofhis Ph.D. is to apply the knowledge of statisticalmodelling, experimental design, and softwareengineering to improve the explainability of de-fect prediction models.
Christoph Bergmier is a Lecturer in Data Sci-ence and Artificial Intelligence, and a 2019 ARCDECRA Fellow in the Monash Faculty of In-formation Technology. His fellowship is on thedevelopment of "efficient and effective analyt-ics for real-world time series forecasting". Healso works as a Data Scientist in a variety ofprojects with external partners in diverse sec-tors, e.g. in healthcare or infrastructure main-tenance. Christoph holds a PhD in ComputerScience from the University of Granada, Spain,and an M.Sc. degree in Computer Science from the University of Ulm,Germany.
John Grundy is Australian Laureate Fellow andProfessor of Software Engineering at MonashUniversity, Australia. He has published widelyin automated software engineering, domain-specific visual languages, model-driven engi-neering, software architecture, and empiricalsoftware engineering, among many other areas.He is Fellow of Automated Software Engineeringand Fellow of Engineers Australia.