[PDF] SQAPlanner: Generating Data-Informed Software Quality Improvement Plans

Abstract

Software Quality Assurance (SQA) planning aims to define proactive plans, such as defining maximum file size, to prevent the occurrence of software defects in future releases. To aid this, defect prediction models have been proposed to generate insights as the most important factors that are associated with software quality. Such insights that are derived from traditional defect models are far from actionable-i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what is the risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefficient and ineffective SQA planning processes. In this paper, we investigate the practitioners' perceptions of current SQA planning activities, current challenges of such SQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-Driven SQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-based explanations for the predictions of defect prediction models. Finally, we develop and evaluate an information visualization for our SQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanner is needed, effective, stable, and practically applicable. We also find that 80% of our survey respondents perceived that our visualization is more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics-i.e., generating actionable guidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.

Full PDF

11 SQAPlanner: Generating Data-InformedSoftware Quality Improvement Plans

Dilini Rajapaksha, Chakkrit Tantithamthavorn, Jirayus Jiarpakdee,Christoph Bergmeir, John Grundy, and Wray Buntine

Abstract — Software Quality Assurance (SQA) planning aims to deﬁne proactive plans, such as deﬁning maximum ﬁle size, to preventthe occurrence of software defects in future releases. To aid this, defect prediction models have been proposed to generate insights asthe most important factors that are associated with software quality. Such insights that are derived from traditional defect models arefar from actionable—i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what isthe risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefﬁcient and ineffective SQA planningprocesses. In this paper, we investigate the practitioners’ perceptions of current SQA planning activities, current challenges of suchSQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-DrivenSQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-based explanations for the predictions of defect prediction models. Finally, we develop and evaluate an information visualization for ourSQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanneris needed, effective, stable, and practically applicable. We also ﬁnd that 80% of our survey respondents perceived that our visualizationis more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics—i.e., generating actionableguidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.

Index Terms —Software Quality Assurance, SQA Planning, Actionable Software Analytics, Explainable AI. (cid:70)

NTRODUCTION

Software Quality Assurance (SQA) planning is the pro-cess of developing proactive SQA plans. One of themost important SQA activities is to deﬁne developmentpolicies and their associated risk thresholds [12] (e.g.,deﬁning the maximum ﬁle size, the maximum code com-plexity, and the minimum degree of code ownership).Such SQA plans will be later enforced for a whole teamto ensure the highest quality of software systems. Thesepolicies are essential to improve software quality andsoftware maintainability [29].Recently, top software companies have released sev-eral commercial AI-driven defect prediction tools. Forexample, Microsoft’s Code Defect AI, Amazon’s Code-Guru. Such tools heavily rely on the concept of defectprediction models that have been well-studied in thepast decades [17]. In particular, Microsoft’s Code DefectAI is built on top of the concept of explainable Just-In-Time defect prediction [21, 44]—i.e., explaining the pre-dictions of defect models using a LIME model-agnostictechnique [37]. The crux of Microsoft’s Code Defect AItool is similar to the recent parallel work by Jiarpakdee etal. [21] who also suggested to use a LIME model-agnostictechnique to explain the predictions of defect models. • D. Rajapaksha, C. Tantithamthavorn, J. Jiarpakdee C. Bergmeir, J. Grundy,and W. Buntine are with the Faculty of Information Technology, MonashUniversity, Melbourne, Australia.E-mail: {dilini.rajapakshahewaranasinghage, chakkrit, jirayus.jiarpakdee,christoph.bergmeir, john.grundy, wray.buntine}@monash.edu • Corresponding author: C. Tantithamthavorn.

However, these current state-of-the-art defect predic-tion approaches can only indicate the most importantfeatures, which are still far from actionable. Thus, prac-titioners still do not know (1) what they should do todecrease the risk of having defects, and what they shouldavoid to not increase the risk of having defects and (2)what is a risk threshold for each metric (e.g., how largeis a ﬁle size that would be risky? and how small is a ﬁlesize that would be non-risky?).A lack of actionable guidance and its risk thresholdcan lead to inefﬁcient and ineffective SQA planningprocesses. Such ineffective SQA planning processes willresult in the recurrence of software defects, slow projectprogress, high costs of development, unsatisfactory soft-ware products, and unhappy end-users. These chal-lenges are very signiﬁcant to the practical applicationsof defect prediction models, but still remain largelyunexplored.We aim to help practitioners to make better data-informed SQA planning decisions by generating action-able guidance derived from defect prediction models.Thus, we ﬁrst propose the following four types of guid-ance to support SQA planning: (G1) Risky current practices that lead the defect modelto predict a ﬁle as defective are needed to helppractitioners understand what are the current riskypractices. (G2) Non-risky current practices that lead the defectmodel to predict a ﬁle as clean are needed tohelp practitioners understand what are the non-risky current practices. a r X i v : . [ c s . S E ] F e b (G3) Potential practices to avoid to not increase therisk of having defects are needed to help prac-titioners understand which currently not imple-mented practices to avoid to not increase the riskof having defects. (G4) Potential practices to follow to decrease the riskof having defects are needed to help practitionersunderstand which practices to newly implement todecrease the risk of having defects.To achieve this aim, our research study has the follow-ing 3 key objectives: (Obj1) Investigating practitioners’ perceptions and chal-lenges of carrying out current SQA planningactivities and the perceptions of our proposedfour types of guidance; (Obj2)

Developing and evaluating our novel SQAPlan-ner approach and comparing it with state-of-the-art approaches; (Obj3)

Developing and evaluating an information vi-sualization for our SQAPlanner approach andcomparing it with the visualization of Microsoft’sCode Defect AI tool.To achieve the ﬁrst objective, we ﬁrst conducted aqualitative survey with practitioners to address the fol-lowing research questions: (RQ1) How do practitioners perceive SQA planningactivities?

For SQA planning activities, 86% ofthe respondents perceived as important and70% perceived as being used in practice. How-ever, 66% perceived as time-consuming and 58%perceived as difﬁcult, indicating that a data-informed SQA planning tool is needed to sup-port QA teams for better data-informed decision-making and policy-making. (RQ2) How do practitioners perceive our proposedfour types of guidance to support SQA plan-ning?

Both (G1) the guidance on risky currentpractices that lead a model to predict a ﬁle asdefective and (G4) the guidance on the potentialpractices to follow to decrease the risk of havingdefects are perceived as among the most useful,most important, and most considered willingnessto adopt by the respondents.Motivated by the ﬁndings of RQ1 and RQ2, we pro-posed an AI-Driven SQAPlanner—i.e., an approach togenerate four types of guidance in the form of rule-based explanations [34] to support data-informed SQAplanning.

Our AI-Driven SQAPlanner is a signiﬁcant ad-vancement over the LIME model-agnostic technique [37] ,since LIME only indicates what factors are the most im-portant to support the predictions towards defective (G1)and clean (G2) classes, while our AI-Driven SQAPlannercan additionally provide actionable guidance on whatshould developers avoid (G3) and should do (G4) todecrease the risk of having defects. Then, we conductan empirical evaluation to evaluate our SQAPlannerapproach and compare with two state-of-the-art local rule-based model-agnostic techniques (i.e., Anchor [38](i.e., an extension of LIME [37]), LORE [15]). Through acase study of 32 releases across 9 open-source softwareprojects, we addressed the following research questions: (RQ3) How effective are the rule-based explanationsgenerated by our SQAPlanner approach whencompared to the state-of-the-art approaches?

The rule-based guidance generated by our SQA-Planner approach achieves the highest coverage(at the median 89%), conﬁdence (at the median99%), and lift scores (at the median 6.6) whencomparing to baseline techniques. (RQ4) How stable are the rule-based explanationsgenerated by our SQAPlanner approach whenthey are regenerated?

Our SQAPlanner approachproduces the most consistent (a median Jaccardcoefﬁcient of 0.92) rule-based guidance whencomparing to baseline techniques, suggesting thatour approach can generate the most stable rule-based guidance when they are regenerated. (RQ5) How applicable are the rule-based explana-tions generated by our SQAPlanner approachto minimize the risk of having defects in thesubsequent releases?

For 55%-87% of the defec-tive ﬁles, our SQAPlanner approach can gener-ate rule-based guidance that is applicable to thesubsequent release to decrease the risk of havingdefects.To evaluate the practical usefulness of our SQAPlan-ner, we developed a proof-of-concept prototype to vi-sualize the actual generated actionable guidance. Thevisualization of our SQAPlanner is designed to providethe following key information: (1) the list of guidancethat practitioners should follow and should avoid; (2) theactual feature value of that ﬁle; and (3) its threshold andrange values for practitioners to follow to mitigate therisk of having defects. Then, we compare our visualiza-tion with the visualization of Microsoft’s Code Defect AI(see Figure 2). Finally, we conducted a qualitative surveyto address the following research questions: (RQ6) How do practitioners perceive the visualiza-tion of SQAPlanner when comparing to thevisualization of the state-of-the-art?

80% of therespondents agree that the visualization of ourSQAPlanner is best to provide actionable guid-ance compared to the visualization of Microsoft’sCode Defect AI. (RQ7) How do practitioners perceive the actual guid-ance generated by our SQAPlanner? • An empirical investigation of the practitioners’ per-ceptions and their challenges of current SQA plan-ning activities. • An empirical investigation of the practitioners’ per-ceptions of our proposed four types of guidance. • The development of our novel AI-Driven SQAPlan-ner approach to generate the proposed four types ofguidance in the form of rule-based explanations tobetter support SQA planning. The implementationis available at https://github.com/awsm-research/SQAPlanner-implementation. • The empirical investigation of the effectiveness, thestability, and the applicability of rule-based expla-nations generated by our SQAPlanner. • The development of the visualization of our SQA-Planner approach and the empirical investigation ofthe practitioners’ perceptions on our visualizationand the actual guidance.The rest of the paper is organized as follows. Sec-tion 2 discusses the signiﬁcance of SQA planning, thelimitations of current AI-driven defect prediction tools,and the motivation of the proposed four types of guid-ance to support SQA planning. Section 3 presents theoverview of our case study and the motivation of theresearch questions. Section 4 presents the results of thepractitioners’ perceptions of SQA planning activities andthe four types of guidance to support SQA planning.Section 5 presents our SQAPlanner approach, while Sec-tion 6 presents the empirical results of our SQAPlannerapproach. Section 7 presents the empirical investigationof the visualization of our SQAPlanner and the actualguidance generated by our SQAPlanner approach. Sec-tion 8 summarizes the threats to the validity of our study,and Section 9 discusses related work. Finally, Section 10draws the conclusions.

ACKGROUND AND M OTIVATION

In this section, we ﬁrst discuss the signiﬁcance of Soft-ware Quality Assurance (SQA) planning. Then, we dis-cuss the limitations of current AI-driven defect predic-tion tools. Finally, we propose the four types of guidanceto support SQA planning.

This is a classic principle that is commonly applied toSQA processes to prevent software defects [27]. It iswidely known that the cost of software defects risessigniﬁcantly if they are discovered later in the process.Thus, ﬁnding and ﬁxing software defects prior to releas-ing software is usually much cheaper and faster thanﬁxing after the software is released [3]. Therefore, SQAteams play a critical role in software companies as agatekeeper, i.e., not allowing software defects to passthrough to end-users.Consider an example of an SQA practice inside theAtlassian company, Australia’s largest software com-pany with a variety of well-known software productse.g., JIRA Issue Tracking System, BitBucket, and Trello. Fig. 1: A JIRA software development process and howQA engineers interact with developers prior to releasinga software product.Figure 1 provides an overview of a JIRA software de-velopment process. During this process, a QA engineerhas multiple points at which he or she provides feedbackinto the way the feature is developed and tested—i.e.,providing every form of quality improvement guidancefor all steps of the software development process fromplanning to completion. This process allows for imme-diate active feedback to ensure that knowledge gainedfrom previous software defects is fed back into thetesting notes for future releases to prevent defects in thenext iteration.

An AI-driven defect prediction (aka. defect predictionmodel) is a classiﬁcation model which is trained onhistorical data in order to predict if a ﬁle is likely to bedefective in the future. Defect models serve two mainpurposes. First is to predict . The predictions of defectmodels can help developers to prioritize their limitedSQA resources on the most risky ﬁles [9, 31, 46, 48].Therefore, developers can save their limited SQA efforton the most risky ﬁles instead of wasting their timeon inspecting less risky ﬁles. Second is to explain . Theinsights that are derived from defect models could helpmanagers chart quality improvement plans to avoid thepitfalls that lead to defects in the past [2, 30, 49]. Forexample, if the insights suggest that code complexityshares the strongest relationship with defect-proneness,managers must initiate quality improvement plans tocontrol and monitor the code complexity of that system.Recently, top software companies have released sev-eral commercial AI-driven defect prediction tools. Forexample, Microsoft’s Code Defect AI, Amazon’s Code-Guru. Such tools heavily rely on the concept of defectprediction models that have been well-studied in the

Fig. 2: An example visualization of the Microsoft’s Code Defect AI tool (http://codedefectai.azurewebsites.net/).However, this tool does not suggest what practitioners should do to decrease the risk of having defects, and whatpractitioners should avoid in order not to increase the risk of having defects. In addition, this tool does not suggesta risk threshold for each metric.past decades [17]. In particular, Microsoft’s Code DefectAI is built on top of the concept of explainable Just-In-Time defect prediction [21, 44]—i.e., explaining the pre-dictions of defect models using a LIME model-agnostictechnique [37]. LIME is a model-agnostic technique forexplaining the predictions of any AI/ML algorithms.The crux of Microsoft’s Code Defect AI tool is similarto the recent parallel work by Jiarpakdee et al. [21]—i.e., extracting several software metrics (e.g., Churn),building a classiﬁcation model (e.g., random forests),generating a prediction for each ﬁle in a commit, andgenerating an explanation of each prediction using theLIME model-agnostic technique [37].Figure 2 presents an example visualization ofMicrosoft’s Code Defect AI product for the ﬁle

ErrorHandlerBuilderRef.java of the ApacheCamel Release 2.9.0. This ﬁgure shows that this ﬁleis predicted as defective with a conﬁdence score of70%. There are three most important factors that areassociated with this prediction as defective, i.e., thenumber of lines of class and method declaration, thenumber of distinct developers, and the degree of codeownership. Thus, these insights can help managerschart quality improvement plans to control for thesemetrics. However, there exist the following limitations. • First, practitioners still do not know what theyshould do to decrease the risk of having defects,and what they should avoid to not increase the riskof having defects.

We ﬁnd that LIME can only indi-cates what factors are the most important to supportthe predictions towards defective (G1) and clean(G2) classes, without providing actionable guidanceon what should they avoid (G3) and should do (G4)to decrease the risk of having defects. • Second, practitioners still do not know a riskthreshold for each metric (e.g., how large is a ﬁlesize that would be risky? and how small is a ﬁlesize that would be non-risky?).A lack of these types of guidance and its risk thresh-old could lead to inefﬁcient and ineffective SQA plan-ning processes. Such ineffective SQA planning processescould result in the recurrence of software defects, slowproject progress, high costs of development, unsatisfac-tory software products, and unhappy end-users. To thebest of our knowledge, the aforementioned challengesare very signiﬁcant to the practical applications of defectprediction models, but still remain largely unexplored.

To address the aforementioned challenges, we proposean AI-driven SQAPlanner—i.e., an approach for gener-ating four types of guidance and its risk threshold inthe form of rule-based explanation for the predictions ofdefect models. Below, we discuss a motivating scenarioof how our AI-Driven SQAPlanner could be used in asoftware development process to assist SQA planning.

Without our SQAPlanner . Consider Bob who is a QAmanager joining a new software development project.His main responsibility is to apply SQA activities (e.g.,code review and testing) to ﬁnd defects and developquality improvement plans to prevent them in the nextiteration. However, he has little knowledge of the soft-ware projects. Therefore, he decides to deploy a defectprediction model to guide his QA team about where isthe risky areas of source code so his team can effectivelyallocate the limited effort on this risky area. However,Bob still encounters various SQA planning problems during the planning steps to prevent software defects inthe next iteration. In particular, without AI-driven SQAplanning tools, he can’t understand what are the riskypractices and what are the non-risky practices for thisteam and this project, what are key actions to avoid thatincrease the risk of having defects, and what are the keyactions to do to decrease the risk of having defects.

Alack of AI-driven SQA planning tools could lead to a failureto develop the most effective SQA plans.

Ultimately, thisresults in the recurrence of software defects, slow projectprogress, and high costs of software development, unsat-isfactory software products, and unhappy end-users.

With our SQAPlanner . Now consider that Bob adoptsour AI-driven SQAPlanner tool. In particular, given aﬁle that is predicted as defective by defect predictionmodels, our SQAPlanner can further generate rule-basedexplanations to better understand what are key riskypractices, non-risky practices, actions to avoid that in-crease the risk of defects, and actions to do to decreasethe risk of having defects for that ﬁle. Bob can use ourSQAPlanner to make data-informed decisions when devel-oping SQA plans. This could result in more optimal SQAplans, leading to higher quality of software systems,less number of software defects, lower costs of softwaredevelopment, satisfactory software products, and happyend-users.

First, we propose to generate the guidance in the formof rule-based explanations, since our recent work [21]found that decision trees/rules are the most preferredrepresentation of explanations by software practitionersas they involve logic reasoning that they are familiarwith. Formally, a rule-based explanation ( e ) is an as-sociation rule e = { r = p ⇒ q } that describes theassociation between p (a Boolean condition of featurevalues (i.e., antecedent, left-hand-side, LHS)) and q (theconsequence (i.e., consequent, right-hand-side, RHS)) forthe decision value y = f (cid:48) ( x ) . In this paper, we use anarrow ( associate ===== ⇒ ) to describe the association between theBoolean condition ( p ) of feature values for a ﬁle and thepredictions ( q ) towards a {DEFECT,CLEAN} class. Notethat an association in general doesn’t mean that there isa causal relationship.Second, motivated by the limitations of Microsoft’sCode Defect AI tool (see Figure 2), we hypothesizethat the following four types of guidance (G) that arepresented in a form of rule-based explanations are bene-ﬁcial to guide practitioners when developing SQA plans.Below, we present the deﬁnition, the motivation, and anexample of the four types of guidance. G1: Risky current practices that lead the defect modelto predict a ﬁle as defective are needed to helppractitioners understand what current practices areproblematic. For example, an association rule of { LOC > } associate ===== ⇒ DEFECT indicates that a

Obj1: Investigating the Perceptions of SQA Planning & our Guidance Obj3: Evaluating the Visualization of our SQAPlannerObj2: Evaluating our SQAPlanner Approach

RQ1, RQ2 RQ6, RQ7

RQ3, RQ4, RQ5

Qualitative Survey Qualitative Survey

Empirical Evaluation

Aim: To help practitioners make data-informed SQA planning

Fig. 3: An overview of our study design and researchquestions.ﬁle with LOC greater than 100 is associated withthe predictions towards a defective class. Thus,practitioners should consider decreasing the LOCto less than 100, as this may likely decrease the riskof having defects.

G2: Non-risky current practices that lead the defectmodel to predict a ﬁle as clean are needed tohelp practitioners understand what current prac-tices contribute towards a low risk of having defects.For example, an association rule of { Ownership > . } associate ===== ⇒ CLEAN indicates that a ﬁle with anownership value greater than 0.8 is associated withthe predictions towards a clean class. Thus, prac-titioners should consider maintaining or increasingthe ownership value to more than 0.8 to potentiallydecrease the risk of having defects.

G3: Potential practices to avoid to not increase the riskof having defects are needed to help practition-ers understand which currently not implementedpractices to avoid to not increase the risk of hav-ing defects. For example, an association rule of { MinorDeveloper > } associate ===== ⇒ DEFECT indicatesthat a ﬁle with a number of minor developers ofgreater than 0 is associated with the predictionstowards a defective class. Thus, practitioners shouldavoid increasing the number of minor developers togreater than zero to not increase the risk of havingdefects.

G4: Potential practices to follow to decrease the riskof having defects are needed to help practitionersunderstand which practices to newly implement todecrease the risk of having defects. For example,an association rule of { RatioCommentToCode > . } associate ===== ⇒ CLEAN indicates that a ﬁle witha proportion of comments to code that is largerthan 60% is associated with the predictions towardsthe clean class. Thus, practitioners should considerincreasing the proportion of comments to code togreater than 60% to decrease the risk of havingdefects.

TUDY D ESIGN AND R ESEARCH Q UES - TIONS

In this paper, we aim to help practitioners make data-informed SQA planning by providing guidance on (1)what practitioners should do to decrease the risk ofhaving defects and (2) what practitioners should avoid inorder not to increase the risk of having defects with (3) arisk threshold in the form of rule-based explanations forthe predictions of defect prediction models. To achievethis aim, we design our case study according to thefollowing objectives (see Figure 3):

Objective 1—Investigating the practitioners’ percep-tions of SQA planning and the proposed four typesof guidance.

SQA planning activities are important insoftware development processes (e.g., to deﬁne initialsoftware development policies), but often vary fromorganization to organization [11]. However, there existno empirical studies that investigate how practitionersperceive the importance of SQA planning activities intheir organization and what are their key challenges.Thus, we formulate the following research question: • (RQ1) How do practitioners perceive SQA plan-ning activities? One of the most important SQA planning activities isto deﬁne development policies and their associated riskthresholds [12]. Such development policies will be laterenforced for the whole team to ensure the highest qualityof software systems (e.g., the maximum ﬁle size, themaximum code complexity, the minimum code to com-ment ratio, and the minimum degree of code ownership).Such policies are essential to improve software qualityand software maintainability. Recently, Microsoft’s CodeDefect AI tool has been released to the public where thecrux of this tool is defect prediction models. However,Figure 2 shows that such tool only indicates the im-portance scores of features that are generated by LIME,which are still far from actionable. That means LIMEonly indicates what factors are the most important tosupport the predictions towards defective (G1) and clean(G2) classes, but do not actually guide developers whatshould they avoid (G3) and should do (G4) to decreasethe risk of having defects. We hypothesize that our pro-posed four types of guidance that are presented in a formof rule-based explanation would be more actionable toguide practitioners when developing SQA plans. Thus,we formulate the following research question: • (RQ2) How do practitioners perceive our proposedfour types of guidance to support SQA planning?Objective 2—Developing and Evaluating our AI-Driven SQAPlanner Approach. To address the practi-tioners’ challenges of SQA planning and the limitationsof Microsoft’s Code Defect AI tool, we propose SQA-Planner to help practitioners make data-informed deci-sions when developing SQA plans. First, SQAPlannerdevelops a defect prediction model to generate a predic-tion. Then, SQAPlanner generates a rule-based explana-tion of the prediction to provide actionable guidance. However, there are different local rule-based model-agnostic techniques for generating explanations in theeXplainable AI (XAI) domain available (e.g., Anchor [38]and LORE [15]). Thus, it remains unclear whether ourSQAPlanner outperforms the state-of-the-art rule-basedmodel-agnostic techniques. Therefore, we conduct anempirical study to evaluate our approach and comparewith the baseline techniques. Thus, we formulate thefollowing research questions. • (RQ3) How effective are the rule-based expla-nations generated by our SQAPlanner approachwhen compared to the state-of-the-art approaches? • (RQ4) How stable are the rule-based explanationsgenerated by our SQAPlanner approach when theyare regenerated? • (RQ5) How applicable are the rule-based expla-nations generated by our SQAPlanner approachto minimize the risk of having defects in thesubsequent releases?Objective 3—Developing the Visualization of SQA-Planner and Investigating the Practitioners’ Percep-tions. While the rule-based explanations of our SQA-Planner are designed to help practitioners understandthe logic behind the predictions of defect models, suchrule-based explanations may not be immediately ac-tionable and easily understandable by practitioners.Thus, we develop a proof-of-concept by translating therule-based explanations of the actionable guidance intohuman-understandable explanations. The visualizationof our SQAPlanner is designed to provide the followingkey information: (1) the list of guidance that practitionersshould follow and should avoid; (2) the actual metricvalues of that ﬁle; and (3) the risk threshold and rangevalues for practitioners to follow to mitigate the riskof having defects. Then, we conduct a post-validationqualitative survey with practitioners to evaluate theirperceptions of the visualization of our SQAPlanner whencomparing to the existing visualization of Microsoft’sCode Defect AI (see Figure 2). Thus, we formulate thefollowing research questions: • (RQ6) How do practitioners perceive the visual-ization of SQAPlanner when comparing to thevisualization of the state-of-the-art? • (RQ7) How do practitioners perceive the actualguidance generated by our SQAPlanner? RACTITIONERS ’ P

ERCEPTIONS ON

SQAP

LANNING AND THE F OUR T YPES OF G UID - ANCE

In this section, we aim to investigate the practitioners’perceptions of (1) the SQA planning activities (RQ1) and(2) the proposed four types of guidance to support SQAplanning (RQ2). Below, we describe the approach andpresent the results.

TABLE 1: (RQ1 and RQ2) A summary of the agreement percentage, the disagreement percentage, and the agreementfactor for the practitioners’ perception of SQA planning activities and our proposed four types of guidance.

Dimension Statement %Agreement %Disagreement Agreement Factor (RQ1) Perceived importance SQA planning activities 86% 6% 14.33(RQ1) Being used in practice 70% 10% 7.00(RQ1) Perceived time-consuming 66% 10% 6.60(RQ1) Perceived difﬁculty 58% 24% 2.42(RQ2) Perceived usefulness G1: Risky current practices that leadthe defect model to predict a ﬁle as defective 82% 6% 13.67G2: Non-risky current practices that leadthe defect model to predict a ﬁle as clean 64% 10% 6.40G3: Potential practices to avoid tonot increase the risk of having defects 52% 20% 2.60G4: Potential practices to follow to decreasethe risk of having defects 80% 8% 10.00(RQ2) Perceived importance G1: Risky current practices that leadthe defect model to predict a ﬁle as defective 64% 10% 6.40G2: Non-risky current practices that leadthe defect model to predict a ﬁle as clean 60% 10% 6.00G3: Potential practices to avoid tonot increase the risk of having defects 64% 24% 2.67G4: Potential practices to follow to decreasethe risk of having defects 82% 6% 13.67(RQ2) Willingness to adopt G1: Risky current practices that leadthe defect model to predict a ﬁle as defective 74% 12% 6.17G2: Non-risky current practices that leadthe defect model to predict a ﬁle as clean 66% 12% 5.50G3: Potential practices to avoid tonot increase the risk of having defects 52% 22% 2.36G4: Potential practices to follow to decreasethe risk of having defects 72% 12% 6.00

To investigate practitioners’ perceptions of SQA plan-ning activities and their feedback on our proposed fourtypes of data-driven guidance to support such activities,we conducted a survey study with 50 software practi-tioners. As suggested by Kitchenham and Pﬂeeger [24],we considered the following steps when conducting ourstudy: (1) design and develop a survey, (2) evaluate asurvey, (3) recruit and select participants, (4) verify data,and (5) analyse data. We describe each step below. (Step 1) Design and develop a survey.

We ﬁrstdevised the concept of data-driven software quality as-surance (SQA) planning with respect to the 4 types ofrules generated by our approach. We then wanted to in-vestigate practitioners’ perceptions along 4 dimensions,i.e., perceived importance, being used in practice, wouldit be time-consuming, and what are key difﬁculties. Wedesigned our survey as a cross-sectional study whereparticipants provide their responses at one ﬁxed point intime. The survey consists of 16 closed-ended questionsand 4 open-ended questions. For closed-ended ques-tions, we use agreement and evaluation ordinal scales.To mitigate any inconsistency of the interpretation ofnumeric ordinal scales, we labeled each level of theordinal scales with words as suggested by Krosnick [26](e.g., strongly disagree, disagree, neutral, agree, and strongly agree). The format of the survey is an onlinequestionnaire where we use an online questionnaire ser-vice as provided by Google Forms. When accessing thesurvey, each participant is provided with an explanatorystatement that describes the purpose of the study, whythe participant is chosen for this study, possible beneﬁtsand risks, and conﬁdentiality. The survey takes approx-imately 15 minutes to complete and is anonymous. (Step 2) Evaluate a survey.

We carefully evaluated thesurvey via pre-testing [28] to assess the reliability andvalidity of the survey. We revised the evaluation processto identify and ﬁx potential problems (e.g., missing,unnecessary, or ambiguous questions) until reaching aconsensus. Finally, the survey has been rigorously re-viewed and approved by the Monash University HumanResearch Ethics Committee (MUHREC ID: 22542). (Step 3) Recruit and select participants.

The targetpopulation of the survey is software practitioners. Toreach the target population, we used a recruiting serviceprovided by the Amazon Mechanical Turk to recruit50 participants as a representative subset of the tar-get population. We use the participant ﬁlter options of"Employment Industry - Software & IT Services" and"Job Function - Information Technology" to ensure thatthe recruited participants are valid samples representingthe target population. We pay 6.4 USD as a monetaryincentive for each participant [10, 40].

DifficultTime−consumingBeing used in practiceImportance 100 50 0 50 100

Percentage

Response Strongly disagree Disagree Neutral Agree Strongly agree

SQA planning activities are:

Fig. 4: (RQ1) The likert scores of the practitioners’ per-ceptions of SQA planning along four dimensions i.e.,importance, being used in practice, time-consuming, anddifﬁculty. (Step 4) Verify data.

To verify our survey responsedata, we manually read all of the open-question re-sponses to check the completeness of the responses i.e.,whether all questions were appropriately answered. Weexcluded 11 responses that are missing and are notrelated to the questions. In the end, we had a set of 989responses. We summarized and presented the results ofclosed-ended responses in a Likert scale with stackedbar plots, while we discussed and provided examples ofopen-ended responses. (Step 5) Analyse data.

We manually analysed the re-sponses of the open-ended questions to extract in-depthinsights. For closed-ended questions, we summarise andpresent key statistical results. We compute the agree-ment and disagreement percentage of each closed-endedquestion. The agreement percentage of a statement is thepercentage of respondents who strongly agree or agreewith a statement ( % strongly agree + % agree ), while thedisagreement percentage of a statement is the percentageof respondents who strongly disagree or disagree with astatement ( % strongly disagree + % disagree ). We also com-pute an agreement factor of each statement as suggestedby Wan et al. [51]. The agreement factor is a measure ofagreement between respondents, which is calculated foreach statement using the following equation: ( % stronglyagree + % agree )/( % strongly disagree + % disagree ). Highvalues of agreement factors indicate a high agreementof respondents to a statement. The agreement factor of 1indicates that the numbers of respondents who agree anddisagree with a statement are equal. Finally, low valuesof agreement factors indicate that a high disagreementof respondents to a statement.

The demographics of our 50 practitioner survey respon-dents are as follows: • Country of Birth: India (58%) and US (36%) • Roles: developers (50%), managers (42%), and others(8%) • Years of Professional Experience: less than 5 years(26%), 6–10 years (38%), 11–15 years (22%), 16–20years (12%), and more than 25 years (2%)

G4: Potential practices to follow todecrease the risk of having defectsG3: Potential practices to avoid tonot increase the risk of having defectsG2: Non−risky current practices that leadthe defect model to predict a file as cleanG1: Risky current practices that leadthe defect model to predict a file as defective 100 50 0 50 100

Percentage

Response Not at all useful Not usefulNeutral UsefulExtremely useful (a) Perceived usefulness

Percentage

Response Not at all important Not importantNeutral ImportantExtremely important (b) Perceived importance

Percentage

Response Not at all considered Not consideredNeutral ConsideredExtremely considered (c) Willingness to adopt

Fig. 5: (RQ2) The likert scores of the perceived useful-ness, the perceive importance, and the willingness toadopt of the respondents for each proposed guidance. • Programming Language: Java (44%), Python (30%),C/C++/C • Use of Static Analysis Tools: Yes (62%) and No (38%)These demographics indicate that the responses arecollected from practitioners who reside in various coun-tries, have a range of roles, varied years of experience,and varied programming language expertise. This indi-cates that our ﬁndings are likely not bound to speciﬁccharacteristics of practitioners.

Results . For SQA planning activities, 86% of therespondents perceive as important and 70% perceived as being used in practice. However, 66% perceivedas time-consuming and 58% perceived as difﬁcult.

Figure 4 shows the distributions of likert scores of thepractitioners’ perceptions of SQA planning activities.The survey results show that SQA planning activities areperceived as important by 86% of the respondents, andare being used in practice by 70% of the respondents.However, they are perceived as time-consuming by 66%of the respondents and as difﬁcult to do by 58% of the re-spondents. Table 1 also shows that the agreement factorof all studied dimensions of SQA planning activities areof above 1 with the values of 2.42 - 14.33. This indicatesthat most respondents agree (while having very fewrespondents who disagree) that SQA planning activitiesare important, being used in practice, time-consuming,and difﬁcult.Respondents described that some of the SQA planningactivities in their organisations involve human heuris-tics in decision-making. For example, they used docu-mentation and review checklists [7] (e.g., R34: “Lessonslearnt from projects are documented and common mistakesare included in review checklists to ensure that they are notrepeated.” ), and team meetings (e.g., R10: “team meetings,brainstorm, and in house system” , and R48: “... throughstep by step manual processes working together in a coreteam” ). These ﬁndings indicates that a data-informedSQA planning tool is needed to support QA teams makebetter data-informed decision- and policy-making. (RQ2) How do practitioners perceive our proposedfour types of guidance to support SQA planning?

Results . Both (G1) the guidance on risky practices thatlead a model to predict a ﬁle as defective and (G4)the guidance on the practices to follow to decreasethe risk of having defects are perceived as among themost useful, most important, and most considered will-ingness to adopt by the respondents.

Figure 4 showsthe likert scores of the practitioners’ perceptions of SQAplanning along four dimensions i.e., importance, beingused in practice, time-consuming, and difﬁculty. The sur-vey results show that all types of guidance are perceivedas useful by 52%-80% of the respondents, important by60%-82% of the respondents, and considered willing toadopt by 52%-72% of the respondents. Similar to RQ1,we observed that the values of agreement factor for allof the proposed guidance are higher than 1 for all of thestudied dimensions. This suggests that most respondentsagree (while having a very few of those who disagree)that all proposed guidance are useful, important, andwilling to adopt these four types of guidance.Respondents provided positive feedback of our pro-posed four types of guidance since these types of guid-ance can help with SQA planning (e.g., R37: “It allows theQA team who might not necessarily know the changes thathave gone into each program to focus their energy on the mostrisky components, programs, or functionalities. It also givesmanagers a great view of the risks involved and how it couldpotentially be reduced or mitigated.” ). However, some respondents raise critical concernsrelated to the potential negative impact on the devel-opment process made by these four types of guidance.For example, cost of implementation and internal re-sistance (e.g., R27: “Some extra time spent improving theprocess. Needing to implement the process including training.Employee resistance to adoption.” ), and lax developmentpractice (e.g., R30: “Sometimes we get too reliant on theautomated processes and other things slip through ...” ). UR AI-D

RIVEN

SQAP

LANNER A PPROACH

Our SQAPlanner consisted of two major phases: (1)developing defect prediction models; and (2) generatingfour types of guidance using a local rule-based model-agnostic technique to explain the predictions of defectmodels. Figure 6 presents an overview workﬂow of ourSQAPlanner approach.

There is a plethora of classiﬁcation techniques that havebeen used to develop defect prediction models [13,17, 46]. We ﬁrst select the following ﬁve classiﬁcationtechniques, i.e., Decision Trees (DT), Logistic Regression(LR), multi-layer Neural Network (NN), Random Forest(RF), and Support Vector Machine (SVM). These classi-ﬁcation techniques are popularly-used in defect predic-tion studies. Since the performance of defect predictionmodels may vary depending on the studied datasets,we ﬁrst conduct a preliminary analysis to identify themost accurate classiﬁcation techniques for our study. Weuse the implementation of the selected ﬁve classiﬁcationtechniques provided by the scikit-learn Python package.For each training dataset, we build defect predictionmodels using all of the 65 software metrics (see Table 3and Table 4). To ensure that our experiment is strictly-controlled and fair across the studied classiﬁcation tech-niques, we use the default setting of the classiﬁcationtechniques provided by the scikit-learn Python package,do not apply feature selection techniques, and do not ap-ply class rebalancing techniques. This setting will ensurethat the results are not bound to (i.e., not sensitive to)the randomization of the non-deterministic optimizationalgorithms [48], feature selection algorithms [22], andclass rebalancing algorithms [43]. Then, we evaluatethe performance of each classiﬁcation technique usingtesting datasets. Then, we measure the predictive abilityof defect models using an Area Under the ReceiverOperating Characteristic Curve (AUROC or AUC). AUCmeasures the ability to distinguish defective and cleanﬁles. The values of AUC range from 0 to 1. The AUCvalue of 0 is considered the worst performance, the AUCvalue of 0.5 is considered as merely random guessing,and the AUC value of 1 is considered the best perfor-mance [18].Then, we use the Non-Parametric Scott-Knott ESDtest (Version 3.0) to ﬁnd the classiﬁcation techniquesthat perform best across our studied datasets. We chose If {DEV>10} then {BUG}If {DEV>10} then {BUG}

Defect Models

Select instances from neighbourhoodGenerate instances from neighbourhood Generate predictions from global defect models Association Rule Mining AlgorithmID,similarity,classID,similarity,class ID,predict

TFTTFF

Phase 1: Developing Defect Models Phase 2: Generating Rule-based Explanations

Develop Defect Models

Training Dataset

Generate predictions

PredictionsSQAPlannerTesting Dataset + Fig. 6: An overview diagram of our SQAPlanner togenerate four types of guidance in the form of rule-basedexplanations for each ﬁle.the Non-Parametric Scott-Knott ESD test, since it doesnot produce overlapping groups like other post-hoctests (e.g., Nemenyi’s test) [47] and it does not requirethe assumptions of normal distributions, homogeneousdistributions, and the minimum sample size. The Non-Parametric ScottKnott ESD test is a multiple compar-ison approach that leverages a hierarchical clusteringto partition the set of median values of techniques(e.g., medians of variable importance scores, mediansof model performance) into statistically distinct groupswith non-negligible difference. The mechanism of theNon-Parametric Scott-Knott ESD test consists of 2 steps:(Step 1) Find a partition that maximizes the medianof each distribution between groups using the non-parametric Kruskal-Wallis test with Chi-square statistics.(Step 2) Split the distributions into two groups or merg-ing into one group using the non-parametric Cliff | δ | effect size. The implementation of the Non-ParametricScottKnott ESD test is available in the ScottKnott ESD Rpackage (Version 3.0). Random Forest is the most accurate studied classiﬁ-cation technique with a median AUC value of 0.77.

Figure 7 presents the Scott-Knott ESD ranking of thestudied classiﬁcation techniques with the distributionof the AUC values. We ﬁnd that other classiﬁcationtechniques achieve a median AUC value of 0.74, 0.63,0.65, and 0.59 for SVM, DT, NN, LR, respectively. Finally,the ScottKnottESD test conﬁrms that random forestsstatistically outperforms other classiﬁcation techniques.For the rest of the paper, we focus on the random forestmodels due to the following reasons: • Random Forest is one of the most accurate studiedclassiﬁcation techniques for our case study and isless sensitive to parameter settings [46, 48]; • Random Forest is a classiﬁcation technique that isto a certain degree explainable with its own built-in

4. http://github.com/klainfo/ScottKnottESD l ll l ll

Rank 1 Rank 2 Rank 3RF SVM NN DT LR0.20.40.60.8

Classification Techniques A UC Fig. 7: The Non-Parametric Scott-Knott ESD ranking ofthe studied classiﬁcation techniques with the distribu-tion of the AUC values.feature importance techniques (e.g., gini importanceand permutation importance) [5, 21–23]. Since SVMdoes not have its own built-in feature importancetechniques, we excluded SVM from our analysis;and • Random Forest is a classiﬁcation technique that isrobust to overﬁtting [46], outliers [43], and classmislabelling [45].

There are 5 major steps for generating four types of guid-ance using a local rule-based model-agnostic technique.First, for each instance to be explained ( i explain ), weselect the nearest instances surrounding such an instanceto be explained from the training set ( I nearest ), cf. Line1. Second, we generate synthetic instances ( I synthetic )around the neighbourhood of each instance to be ex-plained, cf. Line 2. Then, we create a set of combinedinstances as I combined = I nearest ∪ I synthetic , cf. Line 3,which is a combination of the nearest instances andthe synthetic instances. Third, we use the global defectprediction models to generate the predictions of thecombined instances (i.e., P I combined ), cf. Line 4. Fourth, tolearn the associations between the synthetic features andthe predictions of the global defect prediction models,we use the Magnum Opus association rule learningalgorithm [53] to generate a set of optimal associationrules that are the most predictive (i.e., rules with thehighest conﬁdence) and the most interesting (i.e., ruleswith the highest lift) from the combined instances andtheir predictions, cf.

Line 5. Finally, we classify the set ofassociation rules into four types of rule-based guidancewith respect to a contingency table of such associationrules and identify the best rule for each type of guidance, cf.

Line 6. Below, we explain each major step in details.

Phase 2-1: Select the nearest instances surrounding aninstance to be explained

We assume that instances from the neighbourhood of theinstance to be explained have approximately equivalent I e Euclidean distances Tr X exponential () Instances around the neighbourhood (sorted by sim. scores)ID,score S e l e c t t he t op - N i n s t an c e s o f ea c h c l a ss ID,score,class U s e t he s i m il a r i t y sc o r e o f ! a s a t h r e s ho l d t o s e l e c t t he m i n i m u m nu m be r o f t he h i ghe s t s i m il a r i n s t an c e s m i n ( s i m T N t h , s i m F N t h ) Selected instances around the neighbourhoodID,score,class

TTTFFF TFT

ID,score,class

Fig. 8: An approach to select instances around the neigh-bourhood.characteristics to an instance to be explained. Figure 8presents an overview of the steps to select the nearestinstances from the neighbourhood of the instance to beexplained. In particular, there are three steps as follows: (Step 1) – Normalize feature values.

Different featuresmay have different units and thus their range values mayvary greatly. For example,

LOC (e.g., 100 lines of code)and

Ownership (e.g., an ownership score of 0.5). Thus,we ﬁrst apply a Z-score normalization to each feature indefect datasets. (Step 2) – Compute the similarity scores of instancesin training data.

To do so, we ﬁrst compute the Eu-clidean distance between the instances in the trainingdata (

T r x ) and the instance to be explained ( i e ). Then,we apply an exponential kernel function to convertsuch Euclidean distances into similarity scores. using anexponential kernel function to make the distance morelinearly distributed. (Step 3) – Select the smallest number of the mostsimilar instances using the top-N instances of eachclass. To do so, we ﬁrst sort the similarity scores ofinstances ( sim ) in descending order for each class. Then,we select the top N instances of each class from thesorted similarity scores. The lowest similarity score of thetop N instances of each class (i.e., Min( sim

True N th , sim False N th ) )is used as a threshold to select the minimum numberof the most similar instances. Such the lowest similarityscore among the top N instances of both classes isused to determine the boundary of the neighbourhood.For example, given an example of N = 10 , the lowestsimilarity scores of the top-10 instances with the highestsimilarity scores of DEFECT and

CLEAN classes are0.8 and 0.9, respectively. Therefore, in this example,the similarity score of 0.8 (the 10th instance from class

DEFECT ) is used to determine the boundary of theneighbourhood. The selected instances are instances thathave the similarity scores of above 0.8 (i.e., sim ≥ . ). Phase 2-2: Generate synthetic instances to expand theneighbourhood

The number of selected nearest instances in the neigh-bourhood may not be enough to accurately learn thebehaviour of the instance to be explained. Thus, wegenerate synthetic instances to expand the neighbour-hood. To do so, we use the crossover (or interpolation)technique and the mutation technique to generate new

Algorithm 1:

A Local Rule-based Model Inter-pretability with k -optimal Associations Input :

T r x − training instances without target (class label) T r y − target (class label) of training instances i explain − an instance need to be explained M − a global defect prediction model N features − N synthetic − Output: G i explain − Four types of rule-based guidance for theinstance to explain i explain I nearest ← SelectFromNeighbourhood ( T r px , i explain ) I synthetic ← GenerateFromNeighbourhood ( I selected ,N features , N synthetic , i explain ) I combined ← I nearest ∪ I synthetic P I combined ← GetPredictFromGlobalModel ( I combined , M ) R i explain ← GenerateMagnumOpusRules ( I combined ,P I combined ) G i explain ← GenerateRuleGuidance ( R i explain ,i explain , P i explain ) return G i explain synthetic instances while ensuring that the majority ofsuch synthetic instances are within the neighbourhoodof the instance to be explained. Below, we describe howwe generate synthetic instances using the crossover andthe mutation techniques in details. Generate synthetic instances using the crossovertechnique.

To do so, we randomly select two differentinstances from the neighbourhood of the instance tobe explained. Then, we generate the synthetic instancesbased on the crossover technique using the followingequation: I crossover = x + ( y − x ) ∗ α (1)where x and y are random parent instances from thetraining set, and α is a randomly generated numberbetween and . Generate synthetic instances using the mutationtechniques.

To do so, we randomly select three differentinstances from the neighbourhood of the instance to beexplained. Then, we generate synthetic instances basedon the mutation technique [42] using the following equa-tion: I mutation = x + ( y − z ) ∗ µ (2)where x, y and z are random parent instances from thetraining set, and µ is a randomly generated numberbetween . and . Phase 2-3: Generate the predictions of the nearest in-stances and the synthetic instances from defect predictionmodels

Firstly, we name a set of such the nearest instances(generated in Phase 2-1) and the synthetic instances(generated in Phase 2-2) as the combined instances I combined , where I combined = I nearest ∪ I synthetic Then, wegenerate the predictions of such combined instances inthe neighbourhood (i.e.,

Prediction I nearest ∪ I synthetic ) fromdefect prediction models to learn the behaviour and thelogics of such defect prediction models. Phase 2-4: Generate association rules using MagnumOpus association rule mining

The Magnum OPUS association rule mining algorithmperforms statistically sound association rule miningby combining k -optimal association discovery tech-niques [54] and the OPUS search algorithm [53] to ﬁndthe k most interesting associations according to a deﬁnedcriterion (e.g., lift, conﬁdence, coverage). The effective-ness of our SQAPlanner relies on this algorithm to gen-erate the rule-based explanations. With the functionalityof the OPUS search algorithm, it will effectively prunethe search space by discarding the associations whichare likely to be spurious, and removing false positivesby performing Fisher’s exact hypothesis test. We use animplementation of the k -optimal association rule miningtechnique as provided by the BigML platform. Phase 2-5: Generate four types of rule-based guidance

Finally, we classify the optimal set of association rulesthat are identiﬁed by Magnum OPUS into four categorieswith respect to a contingency table of the LHS and RHSof the association rules. Then, we identify the best rulethat is the most predictive and the most interesting foreach type of guidance as the output of SQAPlanner.To better illustrate how we classify the output rulesgenerated by Magnum OPUS, we use four examplesof an association rule as a subject of this explanation.Given an instance to explain i explain that has 200 linesof code ( LOC = 200 ) and is predicted as

DEFECT by the global defect prediction model, our SQAPlannerframework generates the following four types of rule-based explanations:

G1: Risky current practices that lead the defect modelto predict a ﬁle as defective.

Technical Name.

Supporting Rules ( (cid:60) + ). Deﬁnition. if LHS = true, then RHS = true.

Example. { LOC > } associate ===== ⇒ DEFECT

Interpretation.

This example is a supporting rule,since (1) the antecedent (LHS) of the rule hold trueas the actual

LOC of i explain (i.e., 200) is actuallyhigher than 150, and (2) the consequent (RHS) of therule hold true as the prediction of i explain generatedby the global defect prediction model is DEFECT . G2: Non-risky current practices that lead the defectmodel to predict a ﬁle as clean.

Technical Name.

Contradicting Rules ( (cid:60) − ). Deﬁnition. if LHS = true, then RHS = false.

Example. { LOC < } associate ===== ⇒ CLEAN

Interpretation.

This example is a contradicting rule,since (1) the antecedent (LHS) of the rule hold trueas the actual

LOC of i explain (i.e., 200) is actuallylower than 500, yet (2) the consequent (RHS) of therule does not hold true as the prediction of i explain generated by the global defect prediction model is DEFECT .

5. https://bigml.com/

G3: Potential practices to avoid to not increase the riskof having defects.

Technical Name.

Hypothetical Supporting Rules( (cid:60) H + ). Deﬁnition. if LHS = false, then RHS = true.

Example. { LOC > } associate ===== ⇒ DEFECT

Interpretation.

This example is a hypothetical sup-porting rule, since (1) the antecedent (LHS) of therule does not hold true as the actual

LOC of i explain (i.e., 200) is not higher than 300, yet (2) the conse-quent (RHS) of the rule hold true as the predictionof i explain generated by the global defect predictionmodel is DEFECT . G4: Potential practices to follow to decrease the riskof having defects.

Technical Name.

Hypothetical Contradicting Rules orCounterfactual Rules ( (cid:60) H − ). Deﬁnition. if LHS = false, then RHS = false.

Example. { LOC < } associate ===== ⇒ CLEAN

Interpretation.

This example is a hypothetical con-tradicting rule, since (1) the antecedent (LHS) ofthe rule does not hold true as the actual

LOC of i explain (i.e., 200) is not lower than 100, and (2) theconsequent (RHS) of the rule does not hold trueas the prediction of i explain generated by the globaldefect prediction model is DEFECT . XPERIMENTAL D ESIGN AND R ESULTS

In this section, we aim to investigate (RQ3) the effec-tiveness, (RQ4) the stability, and (RQ5) the applicabilityof the rule-based explanations generated by our SQA-Planner. Below, we describe the studied projects, theexperimental design, and present the results.

To select some suitable projects, we identiﬁed threeimportant criteria that need to be satisﬁed: • Criterion 1 — Publicly-available defect datasets :To support veriﬁability and foster replicability ofour study, we choose to train our defect predictionmodels using publicly available defect datasets. • Criterion 2 — Multiple releases : The central hy-pothesis of our approach is that the guidance that isderived from past knowledge (a release k − ) can beused to explain the predictions of defective ﬁles inthe target releases (a release k ) and be applicable toprevent software defects in future releases (a release k + 1 ). Thus, we need multiple releases for eachsoftware project to validate our hypothesis. • Criterion 3 — Labels of defective ﬁles are based onactual affected releases : Prior work raises concernsthat the approximation of the post-release windowperiods (e.g., 6 months) that are popularly-used inmany defect datasets may introduce bias to the con-struct to the validity of our results [56]. Instead ofrelying on traditional post-release window periods,we choose to use defect datasets that are labeled TABLE 2: A statistical summary of the studied systems.

Name Description

TABLE 3: A summary of the studied code metrics.

Granularity Metrics Count

File AvgCyclomatic, AvgCyclomaticModiﬁed, AvgCyclomaticStrict, AvgEssential, AvgLine, AvgLineBlank, AvgLineCode,AvgLineComment, CountDeclClass, CountDeclClassMethod, CountDeclClassVariable, CountDeclFunction, CountDe-clInstanceMethod, CountDeclInstanceVariable, CountDeclMethod, CountDeclMethodDefault, CountDeclMethodPrivate,CountDeclMethodProtected, CountDeclMethodPublic, CountLine, CountLineBlank, CountLineCode, CountLineCod-eDecl, CountLineCodeExe, CountLineComment, CountSemicolon, CountStmt, CountStmtDecl, CountStmtExe, MaxCy-clomatic, MaxCyclomaticModiﬁed, MaxCyclomaticStrict, RatioCommentToCode, SumCyclomatic, SumCyclomaticModi-ﬁed, SumCyclomaticStrict, SumEssential 37Class CountClassBase, CountClassCoupled, CountClassDerived, MaxInheritanceTree, PercentLackOfCohesion 5Method CountInput_{Min, Mean, Max}, CountOutput_{Min, Mean, Max}, CountPath_{Min, Mean, Max}, MaxNesting_{Min,Mean, Max} 12

TABLE 4: A summary of the studied process and own-ership metrics.

Metrics Description

Process Metrics

COMM The number of Git commitsADDED_LINES The normalized number of lines added to the moduleDEL_LINES The normalized number of lines deleted from the mod-uleADEV The number of active developersDDEV The number of distinct developers

Ownership Metrics

MINOR_COMMIT The number of unique developers who have contributedless than 5% of the total code changes (i.e., Git commits)on the moduleMINOR_LINE The number of unique developers who have contributedless than 5% of the total lines of code on the moduleMAJOR_COMMIT The number of unique developers who have contributedmore than 5% of the total code changes (i.e., Git commits)on the moduleMAJOR_LINE The number of unique developers who have contributedmore than 5% of the total lines of code on the moduleOWN_COMMIT The proportion of code changes (i.e., Git commits) madeby the developer who has the highest contribution ofcode changes on the moduleOWN_LINE The proportion of lines of code written by the developerwho has the highest contribution of lines of code on themodule based affected releases, as suggested by recent stud-ies [8, 56].Thus, we ﬁnally selected a corpus of publicly availabledefect datasets provided by Yatish et al. [56] where theground-truths are labeled based on the affected releases.These datasets consist of 32 releases that span 9 open-source, real-world, non-trivial software systems. Table 2shows a statistical summary of the studied datasets. Eachdataset has 65 software metrics along 3 dimensions, i.e.,54 code metrics, 5 process metrics, and 6 human metrics.Table 3 shows a summary of the static code metrics,while Table 4 shows a summary of the process andhuman metrics. The full details of the data collectionprocess are available at Yatish et al. [56].

We hypothesize that the guidance that is derived frompast knowledge (a release k − ) can be used to explainthe predictions of defective ﬁles in the target releases (arelease k ) and be applicable to prevent software defectsin future releases (a release k + 1 ). Thus, we evaluateour approach (see Figure 9) using a set of three con-secutive releases ( k -1, k , and k +1) for training, testing,and explanation evaluation, respectively. We ﬁrst trainedour defect models using a random forest classiﬁcationtechnique on a training release (i.e., a release k − ). Then,we generate rule-based explanations for each ﬁle in thetesting release (i.e., a release k ). Finally, we evaluatethe applicability of the rule-based explanations with theexplanation evaluation release (i.e., a release k + 1 ). Let’stake an example of the ActiveMQ project, we ﬁrst usethe release 5.0.0 for training, the release 5.1.0 for testing,and the release 5.2.0 for explanation evaluation. Werepeat the experiment similarly for the other consecutivereleases (i.e., {5.1.0, 5.2.0, 5.3.0}, {5.2.0, 5.3.0, 5.8.0}) andfor other projects. Motivation . Our SQAPlanner is based on the assump-tion that our rule-based explanations are generatedbased on the approximation of the characteristics of ﬁlesthat are similar to the ﬁle to be explained. This assump-tion is similar to those of many local rule-based model-agnostic techniques [15, 37, 38] that the behaviour of theinstance to be explained is similar to the behaviours ofthe instances around its neighbourhood. According tothe deﬁnition of rule-based explanations in Section 2.4, Defect Models (RQ3) How effective is the guidance generated by our SQAPlanner?(RQ4) How stable is the guidance generated by our SQAPlanner?(RQ5) How applicable is the guidance generated by our SQAPlanner?Testing Dataset (Release k) Training Dataset (Release k-1)

Explanation Evaluation Dataset (Release k+1)

Generate Explanations

SQAPlanner Rules

Fig. 9: An evaluation framework of our SQAPlannerapproachour SQAPlanner generated rule-based explanations willbe considered effective if such rule-based explanationsachieve a high coverage and high conﬁdence value.

Approach . To address RQ3, we evaluate the rule-basedexplanations generated by our SQAPlanner using thetraditional association rule evaluation measures (i.e.,coverage, conﬁdence, and lift).

Coverage measures support of the antecedent of anassociation rule, i.e., the percentage of ﬁles that supportthe rule conditions. Formally, Coverage ( p → q ) = Support ( p ) where Support( p ) is the proportion of ﬁlesthat fulﬁll p .Support ( p ) = | ﬁles ∈ Dataset, such that ﬁles fulﬁll p | { DEV > } associate ===== ⇒ DEFECT with a coverage value of 0.9indicates that 90% of the ﬁles fulﬁll a risky practice ofhaving more than ten developers who touch a ﬁle. Ahigh coverage value of the G1 guidance indicates thatsuch a risky practice is a common risky practice to manyﬁles of the dataset.

Conﬁdence (i.e., Precision or Strength) measures thepercentage of ﬁles that fulﬁll the antecedent and conse-quent together over the number of ﬁles that only fulﬁllthe antecedent, which can be deﬁned as follows:Conﬁdence ( p → q ) = Support ( p → q ) / Support ( p ) .For example, a rule-based explanation (G1) of { DEV > } associate ===== ⇒ DEFECT with a conﬁdence value of 0.8 in-dicates that, there are 80% of the ﬁles that fulﬁll the riskypractice of having more than ten developers who toucha ﬁle are actually defectives. A high conﬁdence valueof the G1 guidance indicates that such risky practice is ahigh conﬁdent risky practice to many ﬁles of the dataset.

Lift measures how many times more often the an-tecedent and consequent occur together compared towhat would be expected when they (i.e., both antecedentand consequent) were statistically independent, whichcan be deﬁned as follows:Lift ( p → q ) = Support ( p → q ) Support ( p ) × Support ( q ) l ll ll ll ll Coverage Confidence Lift O u r F r a m e w o r k L O R E A n c ho r O u r F r a m e w o r k L O R E A n c ho r O u r F r a m e w o r k L O R E A n c ho r Techniques V a l ue Fig. 10: (RQ3) The distribution of the evaluation mea-sures of our rule-based explanations when compared tobaseline approaches (i.e., LORE and Anchor).For example, a rule-based explanation (G1) of { DEV > } associate ===== ⇒ DEFECT with a life value of 5 indicatesthat, the ﬁle will be 5 times (i.e., 500%) more likely tobe defective if the rule is fulﬁlled. A lift value greaterthan one means that a ﬁle is likely to be defective if theconditions are fulﬁlled, while a lift value less than onemeans a ﬁle is unlikely to be defective if the conditionsare fulﬁlled. A high lift value of the G1 guidance in-dicates that there is a high chance that a ﬁle is likelyto be defective if such risky practice is fulﬁlled. Thus,practitioners should pay attention to guidance rules witha high lift value.

Baseline comparison.

We compare our SQAPlannerwith the two state-of-the-art local rule-based model-agnostic techniques, i.e., Anchor [38] and LORE [15] [37].

Anchor , an extension of LIME [37], was proposed byRibeiro et al. [38]. The key idea of Anchor is to select if-then rules – so-called anchors – that have high conﬁdence,in a way that features that are not included in the rulesdo not affect the prediction outcome if their featurevalues are changed. In particular, Anchor selects onlyrules with a minimum conﬁdence of 95%, and thenselects the rule with the highest coverage if multiplerules have the same conﬁdence value.

LORE is proposed by Guidotti et al. [15]. For eachinstance to be explained, LORE generates ﬁles aroundthe neighbourhood using a genetic algorithm. LOREthen obtains predictions of the generated ﬁles from theglobal defect models to learn the behaviour and thelogics of the defect models. Finally, a decision tree isbuilt on the deﬁned neighbourhood of the instance tobe explained and is then later converted to rules.

Results . Figure 10 presents the results for coverage, con-ﬁdence, and lift of the local rule-based model-agnostictechniques. (Coverage) At the median, 89% of ﬁles are supported by the rule-based explanations, suggesting that ourSQAPlanner outperforms the LORE and Anchor localrule-based model-agnostic techniques. Figure 10 showsthat the median coverage is 89%, 34%, and 6% forour SQAPlanner, LORE, and Anchor, respectively. Wesuspect that the high coverage values that are achievedby our SQAPlanner are due to the ﬂexibility of the k -optimal search that allows us to search particularly forrules with high coverage. High coverage is important asit is a measure for how representative a rule is for a givendataset, so that our results suggest that our SQAPlannerachieves the most representative rules. (Conﬁdence) At the median, 99% of ﬁles are sup-ported by the antecedent and the consequent ofthe rule-based explanations, which outperforms theLORE and Anchor model-agnostic techniques. Fig-ure 10 shows the distributions of the conﬁdence forour SQAPlanner, LORE, and Anchor, respectively. Weﬁnd that LORE and Anchor achieve high conﬁdencewith median conﬁdence of 95% and 98%, respectively.We ﬁnd that the comparable conﬁdence values achievedby LORE and Anchor have to do with the main opti-mization goal of Anchor and LORE, since both LOREand Anchor techniques aim to search for rules withthe highest conﬁdence. Nevertheless, we ﬁnd that ourSQAPlanner achieves the highest median conﬁdence of99%. (Lift) The rule-based explanations generated by ourSQAPlanner achieve a median lift value of 6.6, whichoutperforms the LORE and Anchor model-agnostictechniques.

Figure 10 shows that the median lift is 6.6,5.2, 0.98 for our SQAPlanner, LORE and Anchor respec-tively. The highest lift value of 6.6 indicates that ﬁles willbe 6.6 times (i.e., 660%) more likely to be defective if therule is matched. Similarly, the highest lift value of ourSQAPlanner can be attributed to the ﬂexibility of the k -optimal search that allows us to search particularly forrules with the highest lift. On the other hand, Anchorachieves a lower lift score, since Anchor constructs theneighbourhood in a way that it contains only ﬁles of thesame class as the instance in consideration. Thus, the liftscores for Anchor under these circumstances are equalto the conﬁdence values. (RQ4) How stable are the rule-based explanationsgenerated by our SQAPlanner approach when theyare regenerated? Motivation . Our SQAPlanner approach and the twostate-of-the-art local rule-based model-agnostic tech-niques (i.e., LORE and Anchor) involve random datageneration when generating synthetic instances aroundthe neighbourhood. As such, the randomization biasmay produce different rule-based explanations when theapproaches are re-executed. Thus, we aim to investigatethe consistency of the rule-based explanation of the sameinstance when these model-agnostic techniques are re-executed. l l

Model−Agnostic Techniques J a cc a r d C oe ff i c i en t Fig. 11: (RQ4) The distribution of the Jaccard Coefﬁcientsof the rule-based model-agnostic techniques.

Approach . To address RQ4, we repeat our experimentten times to investigate the stability of our rules. Sincethe rules generated by the baseline comparison areoptimized based on conﬁdence only, we focus on therules generated by our approach that are optimized forconﬁdence as well.For each rule-based explanation of each ﬁle, we use theJaccard coefﬁcient to measure the consistency of the gen-erated rule-based explanations. The Jaccard coefﬁcientcompares the common and the distinct features in twogiven sets (e.g., X and Y ) using the following equation: J ( X, Y ) = | X ∩ Y | / | X ∪ Y | . The coefﬁcient ranges from0% to 100%. The higher the coefﬁcient the higher thesimilarity of rules over two independent runs. Results . Our SQAPlanner approach produces the mostconsistent rule-based explanations when compared toLORE and Anchor.

Figure 11 shows that our SQAPlan-ner achieves a median Jaccard coefﬁcient of 0.92, whileLORE and Anchor achieve a median Jaccard coefﬁcientof 0.42, and 0.79, respectively. In other words, for eachprediction of an instance to be explained, our rule-based explanations are (at the median) 92% consistentwith the rule-based explanations when re-executing ourframework in multiple independent runs. In addition,our SQAPlanner’s rule-based explanations are (at themedian) 13% and 50% more consistent than the rule-based explanations generated by Anchor and LORE,respectively. We suspect that the highest consistencyachieved by our approach is a result of the more ro-bust nature of our framework when selecting similarinstances from the training data and when generatingsynthetic instances around the neighbourhood (as de-scribed in Sections 5.2 and 7). In contrast, Anchor usesa bandit algorithm [25] to generate neighbours, whileLORE uses a genetic algorithm to generate neighbours. (RQ5) How applicable are the rule-based explanationsgenerated by our SQAPlanner approach to minimizethe risk of having defects in the subsequent re-leases?

Motivation . The central hypothesis of our approachis that the rule-based explanations derived from pastknowledge (a release k − ) can be used to explain the LOC Actual Predict Counterfactual Rules

A.java 1,000 Bug BugB.java 1,200 Bug Bug

Defect Models R H − A . java : LOC < 900 ⇒

Clean

Training

Past Release (k-1) Target Release (k) Validation Release (k+1)

Testing

LOC Actual Predict

A.java 850 Clean CleanB.java 1,500 Bug Bug R H − B . java : LOC < 1,100 ⇒

Clean

Predictions

Our SQA Planner

Rules

RQ5-a measures the number of instances where the hypothetical contradicting rule follows the actual feature values in the validation data when the prediction is changed (i.e.,

Bug in k but Clean in k+1) e.g., is in accordance with the validation data (i.e., LOC=850) R H − A . java : LOC < 900 ⇒

Clean C o m pa r e RQ5-b measures the number of instances where the hypothetical contradicting rule does not follow the actual feature values in the validation data when the prediction is not changed (i.e.,

Bug in k and Bug in k+1) e.g., is in accordance with the validation data R H − B . java : LOC < 1,100 ⇒

Clean

Fig. 12: (RQ5) An approach to evaluate the applicability of the hypothetical contradicting rules.predictions of defective ﬁles in a target release (a release k ), and thus be applicable to guide SQA planning toprevent software defects in future releases (release k +1 ).We want to investigate what are the proportion of ﬁleswhere the rule-based explanations are satisﬁed and notsatisﬁed with the actual feature values in the subsequentrelease. Approach . To address RQ5, we focus on the hypo-thetical contradicting rules, which are rules that guidewhat are the practices to follow to decrease the riskof having defects (i.e, whether the prediction of thesame ﬁle could be reversed if the rule is followed ina subsequent release). We note that not all of the ﬁleswhose hypothetical contradicting rules can be gener-ated by Anchor and LORE, since we ﬁnd that LOREproduces a maximum of 69% hypothetical contradictingrules across projects (median amount of rules producedper project is 41%), and Anchor by deﬁnition does notgenerate any hypothetical contradicting rules. Since ourapproach is the only one that can generate hypotheticalcontradicting rules, we focus only on our SQAPlannerapproach. Figure 12 presents an approach to evaluatethe applicability of the hypothetical contradicting rules.We analyze the applicability of the hypothetical contra-dicting rules along 2 perspectives:

RQ5-a: Are hypothetical contradicting rules appliedwhen the prediction of an instance changes from defec-tive in a testing release k to clean in a validation release k + 1 ? Hypothetical contradicting rules are considered asapplicable if such rules follow the actual feature valuesin the validation data when the prediction of the instancechanges from defective in k to clean in k +1 . For example,A.java is predicted to be defective in the testing data ( k )but predicted to be clean in the validation data ( k +1 ). Weconsider that the generated hypothetical contradictingrules (e.g., { LOC < } associate ===== ⇒ CLEAN ) is correct ifsuch rule is in accordance with the actual feature valuesin the validation data (i.e.,

LOC = 850). Like in thisexample, the hypothetical contradicting rule suggestsdevelopers reduce the lines of code to less than 900 topotentially reverse the decision of the defect models from defective to clean, which is consistent with the validationdata (

LOC = 850).

RQ5-b: Are hypothetical contradicting rulesnot applied when the prediction of an instancedoes not change from defective in a testing release k to clean in a validation release k + 1 ? Hypotheticalcontradicting rules are considered applicable if suchrules do not follow the actual feature values in thevalidation data when the prediction of the instancedoes not change from defective in k to clean in k + 1 .For example, we consider B.java to be predicted tobe defective in both testing data and validation data.Thus, we consider that the generated hypotheticalcontradicting rule (e.g., { LOC < } associate ===== ⇒ CLEAN )is applicable if such rule does not follow the actualfeature values in the validation data (i.e.,

LOC = 1,500).For each perspective, we compute the number ofinstances where the hypothetical contradicting rule doesfollow and does not follow the actual feature values inthe subsequent release in RQ5-a and RQ5-b, respectively.Figure 13 presents the proportion of ﬁles where itshypothetical contradicting rule does follow (RQ5-a) anddoes not follow (RQ5-b) the actual feature values in thesubsequent release for each measure.

Results . For 55%-87% of the instances in the subse-quent releases, our SQAPlanner’s hypothetical con-tradicting rules are correctly applicable when theprediction of rules changes from defective to clean.

Figure 13 shows that there are 87%, 82% and 55% ofthe instances in the subsequent releases that our hy-pothetical contradicting rules follow the actual featurevalues in the validation data with respect to coverage,conﬁdence, and lift, respectively. This ﬁnding indicatesthat our SQAPlanner’s hypothetical contradicting ruleslearned from past knowledge ( k − ) to explain thepredictions of instances from the target release ( k ) couldpotentially reverse the predictions of the same instancein the subsequent release ( k + 1 ) from having defects toclean. For 67%-81% of the instances in the subsequentreleases, our hypothetical contradicting rules are cor- l l Coverage Confidence LiftRQ5−a RQ5−b RQ5−a RQ5−b RQ5−a RQ5−b020406080100

Analysis P e r c en t age Fig. 13: The percentage of of the instances in the sub-sequent releases where our hypothetical contradictingrules (RQ5-a) follow the actual feature values in the val-idation data when the decision is changed and (RQ5-b)do not follow the actual feature values in the validationdata when the decision is not changed for each measure. rectly non-applicable when the prediction of rules doesnot change.

Figure 13 shows there are 67%, 81% and71% of the instances in the subsequent releases that ourSQAPlanner’s hypothetical contradicting rules do not fol-low the actual feature values in the validation data whenthe prediction of rules does not change with respectto coverage, conﬁdence, and lift, respectively. In otherwords, when ﬁles are still defective in the subsequentrelease, our hypothetical contradicting rules are stilllargely in agreement (i.e., our hypothetical contradictingrules are correctly non-applicable).

We conducted a qualitative analysis to illustrate theeffectiveness of our guidance generated by our SQAPlan-ner. We selected the

ErrorHandlerBuilderRef.java of therelease 2.9.0 of the Camel software system as the subjectof this qualitative analysis. Our SQAPlanner approachcorrectly predicts this ﬁle as defective with a probabilityscore of 70%. Below, we discuss the implications of ourrule-based explanations to guide developers on whatthey could follow and could avoid to decrease the riskof having defects.

What are risky practices that lead a model to predict a ﬁleas defective?

To answer this question, we use the supporting rule togenerate guidance (G1) for this ﬁle as follows: (cid:60) + = { LOCDeclaration > .

150 &DistinctDeveloper > .

68 &Ownership < . } associate ===== ⇒ DEFECT

Implication . This supporting rule indicates that thisﬁle is being predicted as defective since it is associated with the conditions of having more than 28 lines ofdeclarative code, more than 1.68 distinct developers,and a line-based ownership score of less than 85%.When comparing this to the actual feature values ofthe ﬁle {

LOCDeclaration = 34 , DistinctDeveloper = 3 , Ownership = 0 . }, we ﬁnd that the conditions of thissupporting rule are consistent with the actual featurevalues and the consequent is consistent with our SQA-Planner’s prediction (i.e., defective). What are the non-risky practices that lead a model topredict a ﬁle as clean?

To answer this question, we use our contradicting ruleto generate guidance (G2) for this ﬁle as follows: (cid:60) − = { . < = RatioCommentToCode < = 0 . } associate ===== ⇒ CLEAN

Implication . We ﬁnd that our contradicting rule isconsistent with the actual feature values of the ﬁle tobe explained. The actual feature values of this ﬁle is{

RatioCommentToCode = 0 . }, meaning that 51% ofthe total lines of code are comment lines (i.e., (cid:60) − ) indicates thatthe condition that supports its prediction as not beingdefective is { . < = RatioCommentToCode < = 0 . },indicating that ﬁles that have a RatioCommentToCode of more than 44% but less than 96% are likely notto be defective. Developers should thus adhere to thecontradicting rule i.e., having the comment ratio formore than 44% of the ﬁle, to not increase the risk ofhaving defects.

What are practices to avoid to not increase the risk ofhaving defects?

To answer this question, we use our hypothetical sup-porting rule to generate guidance (G3) for this ﬁle asfollows: (cid:60) H + = { MinorCommit > . } associate ===== ⇒ DEFECT

Implication . Having more than zero minor developerswill increase the risk of having defects. The actual featurevalue of this ﬁle is {

MinorCommit = 0 }, meaning thatthis ﬁle has no minor developers (i.e., minor) who editor change the ﬁle. This ﬁnding is consistent with Bird etal. [2] and Rahman [32] who found that minor devel-opers often introduce defects. Thus, developers shouldadhere to the hypothetical supporting rule in order notto increase the risk of having defects.

What are the practices to follow to decrease the risk ofhaving defects?

To answer this question, we use our hypothetical con-tradicting rule to generate guidance (G4) for this ﬁle asfollows: (cid:60) H − = { LOCBlank < .

62 &OutputMean < . } associate ===== ⇒ CLEAN

Implication . If developers changed the ﬁle to have lessthan 8 blank lines and less than 2 output variables,this could reverse the prediction of having defects tobeing clean. The actual feature values of this ﬁle are{

LOCBlank = 19 , OutputMean = 3 . }, meaning thatthis ﬁle has 19 blank lines and an average of 3.7 outputvariables (i.e., fan-out) of functions in a ﬁle. The hypo-thetical contradicting rule indicates that if { LOCBlank < .

62 & OutputMean < . } then the ﬁle is likelyto reverse the prediction of having defects to beingclean. Thus, our hypothetical contradicting rule providessuggestions to the developers of what they should do todecrease the risk of having defects. It should be notedthat our contradicting rule shows correlations that maynot necessarily be causal. RACTITIONERS ’ P

ERCEPTIONS OF OUR

SQAP

LANNER V ISUALIZATION

In this section, we aim to investigate the practitioners’perceptions of the visualization of SQAPlanner whencomparing to the visualization of the state-of-the-art(RQ6) and the actual guidance generated by SQAPlanner(RQ7). Below, we describe the approach and present theresults.

To address RQs 6 and 7, we developed a proof-of-concept to visualize the actionable guidance generatedby our SQAPlanner. Traditionally, the importance scoresof Random Forests or LIME’s model-agnostic techniquesare commonly presented using a bar chart. However,such bar charts can only indicate the importance scores,without providing guidance on what to do and what notto do.To address this challenge, we propose to use a bulletplot (see Figure 14). The visualization of our SQAPlanneris designed to provide the following key information:(1) the list of guidance that practitioners should followand should avoid; (2) the actual feature value of that ﬁle;and (3) its threshold and range values for practitioners tofollow to mitigate the risk of having defects. The greenshades indicate the non-risky range values of features,while the red shades indicate the risky range valuesof features. The vertical bars indicate the actual valuesof features for a given ﬁle. The green arrows providedirections of how a feature should be changed (i.e.,increase or decrease). The list of guidance is structuredinto two parts: (1) what to do to decrease the risk ofbeing defective; and (2) what to avoid to not increase therisk of being defective. For each guidance, we translatea rule-based explanation into an actionable guidance. Aguidance is presented in the form of natural language to ensure that it is actionable and understandable bypractitioners.To translate the rule-based explanations into actualguidance, we focus on only the

ErrorHandlerBuilder-Ref.java of the release 2.9.0 of the Camel software system.We use the rule-based explanations from Section 6.4 asa reference. Finally, we derive the following statementsaccording to the reference rule-based explanations inSection 6.4: • (S1) Decreasing the number of class and methoddeclaration lines to less than 29 lines to decreasethe risk of being defective. • (S2) Decreasing the number of distinct developers toless than 2 developers to decrease the risk of beingdefective. • (S3) Increasing the ownership code proportion tomore than 0.85 to decrease the risk of being defec-tive. • (S4) Avoid decreasing the comment to code ratioto less than 0.44 to not increase the risk of beingdefective. • (S5) Avoid increasing the number of minor devel-opers to more than 0 developers to not increase therisk of being defective. • (S6) Decreasing the number of blank lines to lessthan 8 lines to decrease the risky of being defective. • (S7) Decreasing the number of output variables toless than 2 variables to decrease the risk of beingdefective.To implement the visualization of our SQAPlannerapproach, we decided to use the Microsoft’s Code DefectAI as our core infrastructure. We ﬁrst downloaded therepository of Code Defect AI from GitHub. Then, wecarefully studied their repository and deployed CodeDefect AI in our local environment with continuous sup-port from the core developer of Code Defect AI. Then,we integrated our SQAPlanner approach and replacedtheir visualization (bar plots) with our visualizationgenerated by SQAPlanner using the implementation ofbullet plots as provided by the d3.js Javascript library. To investigate the practitioners’ perceptions of ourSQAPlanner visualization, we used a qualitative surveyas a research method. We also used the visualization ofMicrosoft’s Code Defect AI (see Figure 2) as a baselinecomparison. The objectives of the survey are as follows:(1) to investigate the practitioners’ perceptions of thevisualization of our SQAPlanner; and (2) to investigatethe practitioners’ perceptions of the actionable guidancegenerated by our SQAPlanner. Similar to Section 4, wedesigned the survey as a cross-sectional study whereparticipants provide their responses at one ﬁxed pointin time. The design of our survey is described below.

Part 1—Practitioners’ perceptions of the visualizations ofour SQAPlanner:

We ﬁrst provided the concept of defectprediction models and described how our SQAPlan-

6. https://github.com/aricent/codedefectai7. https://bl.ocks.org/mbostock/4061961 OK Project Name : Apache Camel (Release 2.9.0)File Name : ErrorHandlerBuilderRef.java

Commit ID: 0a02dd5f58a77282dd18f6468d7fa6d5c50ce326 Commit Date: 2019-08-15| 08:09:14 PM File History

Risk Score: 70%

What to do to decrease the risk of having defects?

Decreasing the number of class and method declaration lines to less than 29 lines

Actual = 34 lines0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Decreasing the number of distinct developers to less than 2 developers

Actual = 3 developers0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Increasing the ownership code proportion to more than 0.85

Actual = 0.65 0 5 10 15 20 25 30

Decreasing the number of blank lines to less than 8 lines

Actual = 19 blank lines 0 1 2 3 4 5 6 7 8 9 10

Decreasing the number of output variables to less than 2 variables

Actual = 4 variables

What to avoid to not increase the risk of having defects?

Avoid decreasing the comment to code ratio to less than 0.44

Actual = 0.51 0 1 2 3 4 5 6 7 8 9 10

Avoid increasing the number of minor developers to more than 0 developers

Actual = 0 minor developers * In the bullet plots, the red shade indicates the range of values that are high risk of being defective, while the green shade indicates the range ofvalues that are low risk of defective. The bold vertical line indicates the actual values for each feature of this file.

Bug Risk Prediction: Yes OK Project Name : Apache Camel (Release 2.9.0)File Name : ErrorHandlerBuilderRef.java

Commit ID: 0a02dd5f58a77282dd18f6468d7fa6d5c50ce326 Commit Date: 2019-08-15| 08:09:14 PM File History

Risk Score: 70%

What to do to decrease the risk of having defects?

Decreasing the number of class and method declaration lines to less than 29 lines

Actual = 34 lines0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Decreasing the number of distinct developers to less than 2 developers

Actual = 3 developers0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Increasing the ownership code proportion to more than 0.85

Actual = 0.65 0 5 10 15 20 25 30

Decreasing the number of blank lines to less than 8 lines

Actual = 19 blank lines 0 1 2 3 4 5 6 7 8 9 10

Decreasing the number of output variables to less than 2 variables

Actual = 4 variables

What to avoid to not increase the risk of having defects?

Avoid decreasing the comment to code ratio to less than 0.44

Actual = 0.51 0 1 2 3 4 5 6 7 8 9 10

Avoid increasing the number of minor developers to more than 0 developers

Bug Risk Prediction: Yes

Fig. 14: The visualization of our SQAPlanner is designed to provide the following key information: (1) the list ofguidance that practitioners should follow and should avoid; (2) the actual feature value of that ﬁle; and (3) itsthreshold and range values for practitioners to follow to mitigate the risk of having defects.ner can be used to support SQA planning. Then, wepresented the visualization of our SQAPlanner and thevisualization of Microsoft’s Code Defect AI. We askedthe participants a closed-ended question to inquire aboutwhich of the visualization is the best to provide action-able guidance on how to mitigate the risk of havingdefects. We also asked the participants an open-endedquestion to inquire about their rationale on why theselected visualization is preferred over another visual-ization.

Part 2—Practitioners’ perceptions of the actual guidancegenerated by SQAPlanner:

We again presented the vi-sualization of SQAPlanner. Then, for each statement,we asked the participants a closed-ended question toinquire whether the participants agree for each of the seven statements that we translated from the rule-basedexplanations.We used an online questionnaire service as providedby Google Forms. We carefully evaluated the survey viapre-testing [28] to assess the reliability and validity ofthe survey. The survey has been rigorously reviewedand approved by the Monash University Human Re-search Ethics Committee (MUHREC Project ID: 27209).We used a recruiting service provided by the MTurk torecruit participants. We received 240 closed-ended and30 open-ended responses from 30 respondents. Finally,we manually veriﬁed and analyzed the survey responsesto ensure that the responses are of high quality.

80% 20% S Q AP l anne r B a s e li ne ( C ode D e f e c t A I ) P e r c en t age (a) Perceptions of visual-ization. S1S2S3S4S5S6S7 100 50 0 50 100

Percentage

Response Disagree Agree (b) Perceptions of the actual guidance generated by our SQAPlanner.

Fig. 15: (RQ6,RQ7) The results of a qualitative survey with practitioners.

Results .

80% of our respondents agree that the visu-alization of our SQAPlanner is better for providingactionable guidance when compared to the visualiza-tion of Microsoft’s Code Defect AI.

Figure 15a showsthe percentage of the respondents who select whichvisualization is best to provide actionable guidance onhow to mitigate the risk of having defects.After analyzing the open-end responses, practitioners(e.g., R10 and R12) provided rationales that the sug-gested threshold values of each factor and directionalarrows provided by SQAPlanner make the visualizationmore clear on what developers should do and shouldavoid to decrease the risk of having defects. Respondents(e.g., R19, R20, and R23) also pointed out that thesummary of "What to do" and "What to avoid" is straightto the point and helpful.On the other hand, 20% of the respondents rate thevisualization of Microsoft’s Code Defect AI as better. Re-spondents (e.g., R5 and R16) provided rationales that thevisualization of Microsoft’s Code Defect AI is presentedin a more simple and concise manner (i.e., only presentthe most important factors that are associated with therisk of having defects). Thus, future research shouldtake into consideration the complexity of the providedinformation when designing a novel visualization for AI-driven defect prediction. (RQ7) How do practitioners perceive the actual guid-ance generated by our SQAPlanner?

Results . Figure 15b presents that thepercentage of the respondents who agree with the sevenstatements derived from the actual guidance generatedby our SQAPlanner. We ﬁnd that 90% of the respondentsagree the most with (S1) Decreasing the number of class andmethod declaration lines to less than 29 lines to decrease therisk of being defective .On the other hand, only 63% of the respondents agreewith (S3) Increasing the ownership code proportion to morethan 0.85 to decrease the risk of being defective . We suspectthat the wide range of agreement rates for our state-ments has to do with the degree of understandability ofthe software metrics, since practitioners may ﬁnd thatthe number of class and method declaration lines forS1 is more intuitive and easy to understand than theownership code proportion for S3. Thus, future researchshould take into consideration the degree of understand-ability of the software metrics when designing a novelvisualization for AI-driven defect prediction.

HREATS TO V ALIDITY

Construct Validity . Many local model-agnostic tech-niques could be used to generate many forms of explana-tions e.g., feature importance and rules. In this paper, wefocused only on rule-based explanations by comparingwith LORE and Anchor, an extension of LIME. We alsostudied only a limited number of available classiﬁcationtechniques. Thus, our results may not be applicable orgeneralise to the use of other techniques. Nonetheless,other classiﬁcation techniques can be explored in futurework to see if they improve on our results.

Internal Validity . The practicality of rule-based expla-nations heavily relies on software metrics that are usedto train the models. In this paper, we chose to generaterules based on 65 well-known and hand-crafted software metrics, rather than using advanced automated featuregeneration like deep learning. Future work may focus ontrying to explain other machine learning-based models,such as explaining deep learning models used in an SQAcontext.The goal of our SQAPlanner (aka. the local rule-basedmodel-agnostic technique) is a post-hoc analysis of theglobal defect prediction models. That means, SQAPlan-ner can only explain the behavior of the (global) defectprediction models, regardless of the correct or incor-rect predictions. If the predictions of the global defectmodels for the testing dataset are incorrect, SQAPlannerwill explain why the global defect prediction modelsgenerate wrong predictions in the form of rule-basedexplanations. Therefore, the robustness or the sensitivityof our SQAPlanner does not depend on the accuracy ofthe predictions of the global defect prediction models. External Validity . We applied our SQA Planner ap-proach to a limited number of software systems. Thus,our results may not generalize to other datasets, do-mains, ecosystems. However, we mitigated this bychoosing a range of different non-trivial, real-world,open-source software applications. Nonetheless, addi-tional replication studies in a proprietary setting andother ecosystems will prove useful to compare to ourresults reported here.SQA planning involves various activities. However,this paper only focused on helping practitioners todeﬁne development policies and their associated riskthresholds [12], without considering other activities. Inaddition, the dependent variable that we used in thisstudy only focused on software quality (i.e., defective orclean), without considering other aspects (e.g., testability,reusability, robustness, and maintainability). Thus, otherSQA planning activities and other quality attributes canbe explored in future work.

ELATED W ORK

In this section, we discuss related work and gaps tohighlight the contributions of our work to the literature.

Despite the advances of AI/ML techniques that aretightly integrated into software development practices(e.g., defect prediction [17], automated code review [1,50], automated code completion [19, 20]), such AI/MLtechniques come with their own limitations. The centralproblem of AI/ML techniques is that most AI/ML mod-els are considered black-box models i.e., we understandthe underlying mathematical principles without explicitdeclarative knowledge representation. In other words,developers do not understand how decisions are madeby such AI/ML techniques. In addition, the current de-fect modelling practices do not uphold the current dataprivacy laws and regulations, which require justiﬁcationsof individual predictions for any decisions made byan AI/ML model. Therefore, applying such black-box AI/ML techniques in the software development prac-tices for safety-critical and cyber-physical systems [4, 55]which involve safety, security, business, personal, or mil-itary operations is unfavourable and must be avoided.Explainable AI is essential in software engineeringto building appropriate trust (including Fairness, Ac-countability, and Transparency (FAT)). Developers canthen (1) understand the reasons and the logic behindevery decision and (2) effectively improve the predic-tion models by understanding any unsound predictionsmade by the models. Recently, explainable AI has beenemployed in software engineering [44], by making defectprediction models more practical [52] (i.e., using LIMEto explain which tokens and which lines are likely to bedefective in the future) and explainable [21] (i.e, usingLIME to explain a prediction why a ﬁle is predicted asdefective). However, there exists no studies that able toprovide concrete guidance on what developers shoulddo or should not do to support SQA planning. To thebest of our knowledge, this paper is the ﬁrst to generatelocal rule-based explanations to help QA teams makedata-informed decisions in software quality assuranceplanning.

There are two key approaches for achieving explainabil-ity in defect prediction models. The ﬁrst is to make theentire decision process transparent and comprehensible(i.e., global explainability). The second is to explicitlyprovide an explanation for each individual prediction(i.e., local explainability).Examples of global explainability methods are re-gression models [33, 35], decision trees [57], decisionrules [39], and Fast-and-Frugal trees [6]. These trans-parent AI/ML techniques often provide built-in modelinterpretation techniques to uncover the relationshipsbetween the studied features and defect-proneness. Forexample, an ANOVA analysis provided for logistic re-gression or a variable importance analysis provided forrandom forest. However, the insights derived from thesetransparent AI/ML techniques do not provide justiﬁca-tions for each individual prediction.Model-agnostic techniques are techniques for explic-itly providing an instance explanation for each decisionof AI/ML models (i.e., local explainability) for a giventesting instance [16]. Formally, given a defect model f and an instance x , the instance explanation problemaims to provide an explanation e for the prediction f ( x ) = y . To do so, we address the problem by buildinga local interpretable model f (cid:48) that mimics the localbehaviour of the global defect model f . An explanationof the prediction is then derived from the local inter-pretable model f (cid:48) . The local interpretable model focuseson learning the behaviour of the defect models in theneighbourhood of the speciﬁc instance x , without aimingat providing a single description of the logic of the black box for all possible instances. Thus, an explanation e ∈ E is obtained through f (cid:48) , if e = (cid:15) ( f (cid:48) , x ) for someexplanation logic (cid:15) ( ., . ) which reasons over f (cid:48) and x .Two common ways to represent explanations are feature-importance explanations and rule-based explanations.Unlike model-speciﬁc explanation techniques discussedabove, the great advantage of model-agnostic techniquesis their ﬂexibility. Such model-agnostic techniques can(1) interpret any learning algorithms (e.g., regression,random forest, and neural networks); (2) are not limitedto a certain form of explanations (e.g., feature importanceor rules); and (3) are able to process any input data (e.g.,features, words, and images [36]).There are a plethora of model-agnostic techniques [16]for identifying the most important feature at the instancelevel. For example, LIME (i.e., Local Interpretable Model-agnostic Explanations) [37] is a model-agnostic techniquethat mimics the behaviour of the black-box model witha local linear model to generate the explanations ofthe predictions. BreakDown [14, 41] is a model-agnostictechnique that uses the greedy strategy to sequentiallymeasure contributions of metrics towards the expectedprediction. However, none of these techniques can gen-erate explanations with the logic behind.Despite the advances of model-agnostic techniques inthe Explainable AI communities, such techniques havenot been employed in practical software engineeringcontexts. To the best of our knowledge, this paper is theﬁrst to generate local rule-based explanations to help QAteams make data-informed decisions in software qualityassurance planning.

10 C

ONCLUSIONS

Defect prediction models have been proposed to gen-erate insights (e.g., the most important factors that areassociated with software quality). However, such in-sights derived from traditional defect models are farfrom actionable—i.e., practitioners still do not knowwhat they should do and should avoid to decrease therisk of having defects, and what is a risk threshold foreach metric. A lack of actionable guidance and its riskthreshold could lead to inefﬁcient and ineffective SQAplanning processes.In this paper, we investigate practitioners perceptionsand their challenges of current SQA planning activitiesand the perceptions of our proposed four types of guid-ance. Then, we propose and evaluate our SQAPlannerapproach—i.e., an approach for generating four types ofguidance and its risk threshold in the form of rule-basedexplanation for the predictions of defect prediction mod-els. Finally, we develop and evaluate the visualization ofour SQAPlanner approach.Through the use of qualitative survey and empiricalevaluation, our results lead us to conclude that SQA-Planner is needed, important, effective, stable, and ap-plicable. We also ﬁnd that 80% of respondents perceivedthat our visualization is more actionable. Thus, our SQAPlanner paves a way for novel research in actionablesoftware analytics.Finally, we note that we do not seek to claim thegeneralization and causation of our proposed guidance.Instead, the key message of our study is that our rule-based guidance can explain the behaviour of the defectmodels that learnt from the relationship between soft-ware features and defect-proneness from the past releasedata. Thus, they can indicate important relationships inthe data and provide a useful tool to support decision-and policy-making in software quality assurance. Ourrule-based guidance could be used as a guidance toolfor supporting decision-making so that developers can(1) understand the reasons and the logic behind everyprediction, and (2) effectively improve the predictionmodels by understanding any unsound prediction madeby the models. A CKNOWLEDGMENTS

C. Tantithamthavorn was partially supported by theAustralian Research Council’s Discovery Early Ca-reer Researcher Award (ARC DECRA) funding scheme(DE200100941). C. Bergmeir was partially supported bythe Australian Research Council’s Discovery Early Ca-reer Researcher Award (ARC DECRA) funding scheme(DE190100045). J. Grundy was partially supported bythe Australian Research Council’s Laureate Fellowshipfunding scheme (FL190100035). R EFERENCES [1] S. Asthana, R. Kumar, R. Bhagwan, C. Bird, C. Bansal, C. Maddila,S. Mehta, and B. Ashok, “Whodo: automating reviewer sugges-tions at scale,” in

Proceedings of the 2019 27th ACM Joint Meetingon European Software Engineering Conference and Symposium on theFoundations of Software Engineering . ACM, 2019, pp. 937–945.[2] C. Bird, B. Murphy, and H. Gall, “Don’t Touch My Code !Examining the Effects of Ownership on Software Quality,” in

Proceedings of the European Conference on Foundations of SoftwareEngineering (ESEC/FSE) , 2011, pp. 4–14.[3] B. Boehm and V. R. Basili, “Software defect reduction top 10 list,”

Foundations of empirical software engineering: the legacy of Victor R.Basili , vol. 426, no. 37, pp. 426–431, 2005.[4] M. Borg, S. Gerasimou, N. Hochgeschwender, and N. Khakpour,“Explainability for safety and security,”

Explainable Software forCyber-Physical Systems (ES4CPS), Report from the GI Dagstuhl Sem-inar 19023 , p. 15, 2019.[5] L. Breiman, A. Cutler, A. Liaw, and M. Wiener, “randomForest: Breiman and Cutler’s Random Forests for Classiﬁcation andRegression. R package version 4.6-12.”

Software available at URL:https://cran.r-project.org/package=randomForest .[6] D. Chen, W. Fu, R. Krishna, and T. Menzies, “Applications ofPsychological Science for Actionable Analytics,” in

Proceedings ofthe 2018 26th ACM Joint Meeting on European Software EngineeringConference and Symposium on the Foundations of Software Engineer-ing . ACM, 2018, pp. 456–467.[7] C. Y. Chong, P. Thongtanunam, and C. Tantithamthavorn, “As-sessing the students understanding and their mistakes in code re-view checklists–an experience report of 1,791 code review check-lists from 394 students,” in

International Conference on SoftwareEngineering: Joint Software Engineering Education and Training track(ICSE-JSEET) , 2021.[8] D. A. da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho, andA. E. Hassan, “A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-introducing Changes,”

Transactionson Software Engineering (TSE) , vol. 43, no. 7, pp. 641–657, 2017.[9] M. D’Ambros, M. Lanza, and R. Robbes, “An Extensive Compar-ison of Bug Prediction Approaches,” in

Proceedings of the Interna-tional Conference on Mining Software Repositories (MSR) , 2010, pp.31–41.[10] P. Edwards, I. Roberts, M. Clarke, C. DiGuiseppi, S. Pratap,R. Wentz, and I. Kwan, “"increasing response rates to postalquestionnaires: Systematic review",”

Bmj , vol. 324, no. 7347, p.1183, 2002.[11] S. Farooqui and W. Mahmood, “A survey of pakistan’s sqapractices: A comparative study,” in , 2017.[12] D. Galin,

Software quality: concepts and practice . John Wiley &Sons, 2018.[13] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the Impactof Classiﬁcation Techniques on the Performance of Defect Pre-diction Models,” in

Proceedings of the International Conference onSoftware Engineering (ICSE) , 2015, pp. 789–800.[14] A. Gosiewska and P. Biecek, “iBreakDown: Uncertainty of ModelExplanations for Non-additive Predictive Models,” arXiv preprintarXiv:1903.11420 , 2019.[15] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, andF. Giannotti, “Local rule-based explanations of black box decisionsystems,” arXiv preprint arXiv:1805.10820 , 2018.[16] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, D. Pedreschi,and F. Giannotti, “A Survey Of Methods For Explaining BlackBox Models,” vol. 51, no. 5, pp. 1–45, 2018. [Online]. Available:http://arxiv.org/abs/1802.01933[17] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell,“A Systematic Literature Review on Fault PredictionPerformance in Software Engineering,”

Transactions on SoftwareEngineering (TSE) , vol. 38, no. 6, pp. 1276–1304, 2012.[Online]. Available: http://ieeexplore.ieee.org.pc124152.oulu.ﬁ:8080/xpls/abs{_}all.jsp?arnumber=6035727[18] J. A. Hanley and B. J. McNeil, “The meaning and use of the areaunder a receiver operating characteristic (ROC) curve,”

Radiology ,vol. 143, no. 1, pp. 29–36, Apr. 1982. [Online]. Available:http://dx.doi.org/10.1148/radiology.143.1.7063747[19] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deeplearning type inference,” in

Proceedings of the 2018 26th ACM JointMeeting on European Software Engineering Conference and Symposiumon the Foundations of Software Engineering . ACM, 2018, pp. 152–162.[20] V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “Whencode completion fails: a case study on real-world completions,”in

Proceedings of the 41st International Conference on Software Engi-neering . IEEE Press, 2019, pp. 960–970.[21] J. Jiarpakdee, C. Tantithamthavorn, H. K. Dam, and J. Grundy,“An empirical study of model-agnostics techniques for defectprediction models,” 2020.[22] J. Jiarpakdee, C. Tantithamthavorn, and A. E. Hassan, “TheImpact of Correlated Metrics on Defect Models,”

Transactions onSoftware Engineering (TSE) , p. To Appear, 2019.[23] J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “AutoSpear-man: Automatically Mitigating Correlated Software Metrics forInterpreting Defect Models,” in

Proceedings of the InternationalConference on Software Maintenance and Evolution (ICSME) , 2018,pp. 92–103.[24] B. A. Kitchenham and S. L. Pﬂeeger, “Personal opinion surveys,”in

Guide to Advanced Empirical Software Engineering . Springer,2008, pp. 63–92.[25] L. Kocsis and C. Szepesvári, “Bandit based monte-carlo plan-ning,” in

European conference on machine learning . Springer, 2006,pp. 282–293.[26] J. A. Krosnick, “Survey research,”

Annual Review of Psychology ,vol. 50, no. 1, pp. 537–567, 1999.[27] S. Kumaresh and R. Baskaran, “Defect analysis and preventionfor software process quality improvement,”

International Journalof Computer Applications , vol. 8, no. 7, pp. 42–47, 2010.[28] M. S. Litwin,

How to Measure Survey Reliability and Validity . Sage, 1995, vol. 7.[29] B. R. Maxim and M. Kessentini, “An introduction to modern soft-ware quality assurance,” in

Software Quality Assurance . Elsevier,2016, pp. 19–46.[30] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The Impactof Code Review Coverage and Code Review Participation onSoftware Quality,” in

Proceedings of the International Conference onMining Software Repositories (MSR) , 2014, pp. 192–201.[31] T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static CodeAttributes to Learn Defect Predictors,”

Transactions on SoftwareEngineering (TSE) , vol. 33, no. 1, pp. 2–13, 2007.[32] F. Rahman and P. Devanbu, “Ownership, experience and defects:a ﬁne-grained study of authorship,” in

Proceedings of the Interna-tional Conference on Software Engineering (ICSE) , 2011, pp. 491–500.[33] ——, “How, and Why, Process Metrics are Better,” in

Proceedingsof the International Conference on Software Engineering (ICSE) , 2013,pp. 432–441.[34] D. Rajapaksha, C. Bergmeir, and W. Buntine, “LoRMIkA: Localrule-based model interpretability with k-optimal associations,”

Information Sciences , vol. 540, pp. 221–241, 2020.[35] G. K. Rajbahadur, S. Wang, Y. Kamei, and A. E. Hassan, “TheImpact of Using Regression Models to Build Defect Classiﬁers,”in

Proceedings of the International Conference on Mining SoftwareRepositories (MSR) , 2017, pp. 135–145.[36] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-agnostic Inter-pretability of Machine Learning,” arXiv preprint arXiv:1606.05386 ,2016.[37] ——, “Why should I trust you?: Explaining the Predictions ofAny Classiﬁer,” in

Proceedings of the International Conference onKnowledge Discovery and Data Mining (KDDM) , 2016, pp. 1135–1144.[38] ——, “Anchors: High-precision model-agnostic explanations,” in

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.[39] D. Rodríguez, R. Ruiz, J. C. Riquelme, and J. S. Aguilar-Ruiz,“Searching for Rules to Detect Defective Modules: A SubgroupDiscovery Approach,”

Information Sciences , vol. 191, pp. 14–30,2012.[40] E. Smith, R. Loftin, E. Murphy-Hill, C. Bird, and T. Zimmermann,“"improving developer participation rates in surveys",” in

Proceed-ings of the International Workshop on Cooperative and Human Aspectsof Software Engineering (CHASE) , 2013, pp. 89–92.[41] M. Staniak and P. Biecek, “Explanations of Model Predictions withlive and breakDown Packages,” arXiv preprint arXiv:1804.01955 ,2018.[42] R. Storn and K. Price, “Differential evolution – a simpleand efﬁcient heuristic for global optimization over continuousspaces,”

Journal of Global Optimization , vol. 11, no. 4, pp. 341–359, Dec. 1997. [Online]. Available: https://doi.org/10.1023/A:1008202821328[43] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “TheImpact of Class Rebalancing Techniques on The Performanceand Interpretation of Defect Prediction Models,”

Transactions onSoftware Engineering (TSE) , p. To Appear, 2019.[44] C. Tantithamthavorn, J. Jiarpakdee, and J. Grundy, “Explainableai for software engineering,” arXiv preprint arXiv:2012.01614 , 2020.[45] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, A. Ihara, andK. Matsumoto, “The Impact of Mislabelling on the Performanceand Interpretation of Defect Prediction Models,” in

Proceeding ofthe International Conference on Software Engineering (ICSE) , 2015,pp. 812–823.[46] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Mat-sumoto, “Automated Parameter Optimization of ClassiﬁcationTechniques for Defect Prediction Models,” in

Proceedings of theInternational Conference on Software Engineering (ICSE) , 2016, pp.321–332.[47] ——, “An Empirical Comparison of Model Validation Techniquesfor Defect Prediction Models,”

Transactions on Software Engineering(TSE) , vol. 43, no. 1, pp. 1–18, 2017.[48] ——, “The Impact of Automated Parameter Optimization onDefect Prediction Models,”

Transactions on Software Engineering(TSE) , pp. 683–711, 2018. [49] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revis-iting Code Ownership and its Relationship with Software Qualityin the Scope of Modern Code Review,” in Proceedings of theInternational Conference on Software Engineering (ICSE) , 2016, pp.1039–1050.[50] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida,H. Iida, and K.-i. Matsumoto, “Who Should Review My Code?A File Location-based Code-reviewer Recommendation Approachfor Modern Code Review,” in

Proceedings of the International Con-ference on Software Analysis, Evolution, and Reengineering (SANER) ,2015, pp. 141–150.[51] Z. Wan, X. Xia, A. E. Hassan, D. Lo, J. Yin, and X. Yang,“Perceptions, expectations, and challenges in defect prediction,”

IEEE Transactions on Software Engineering , 2018.[52] S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn,H. Hata, and K. Matsumoto, “Predicting defective lines using amodel-agnostic technique,” 2020.[53] G. I. Webb, “Opus: An efﬁcient admissible algorithm for un-ordered search,”

Journal of Artiﬁcial Intelligence Research , vol. 3,pp. 431–465, 1995.[54] G. I. Webb and S. Zhang, “K-optimal rule discovery,”

Data Miningand Knowledge Discovery , vol. 10, no. 1, pp. 39–79, 2005.[55] Y. Yang, D. Falessi, T. Menzies, and J. Hihn, “Actionable analyticsfor software engineering,”

IEEE Software , vol. 35, no. 1, pp. 51–53,2017.[56] S. Yathish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamtha-vorn, “Mining Software Defects: Should We Consider AffectedReleases?” in

In Proceedings of the International Conference on Soft-ware Engineering (ICSE) , 2019, p. To Appear.[57] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,“Cross-project Defect Prediction,” in

Proceedings of the EuropeanSoftware Engineering Conference and the Symposium on the Founda-tions of Software Engineering (ESEC/FSE) , 2009, pp. 91–100.

Dilini Rajapaksha received the BSc(hons) de-gree from Sri Lanka Institute of InformationTechnology (SLIIT). She is currently a Ph.D.candidate at Monash University, Australia. Herresearch interests include Machine Learningand Time-series Forecasting. The goal of herPh.D. is to provide local explanations for the pre-dictions given by the time-series and machinelearning models.

Chakkrit Tantithamthavorn is a Lecturer inSoftware Engineering and a 2020 ARC DECRAFellow in the Faculty of Information Technol-ogy, Monash University, Australia. His currentfellowship is focusing on the development of“Practical and Explainable Analytics to PreventFuture Software Defects”. His work has beenpublished at several top-tier software engineer-ing venues, such as the IEEE Transactions onSoftware Engineering (TSE), the Springer Jour-nal of Empirical Software Engineering (EMSE)and the International Conference on Software Engineering (ICSE). Moreabout Chakkrit and his work is available online at http://chakkrit.com.

Jirayus Jiarpakdee is a Ph.D. candidate atMonash University, Australia. His research inter-ests include empirical software engineering andmining software repositories (MSR). The goal ofhis Ph.D. is to apply the knowledge of statisticalmodelling, experimental design, and softwareengineering to improve the explainability of de-fect prediction models.

Christoph Bergmier is a Lecturer in Data Sci-ence and Artiﬁcial Intelligence, and a 2019 ARCDECRA Fellow in the Monash Faculty of In-formation Technology. His fellowship is on thedevelopment of "efﬁcient and effective analyt-ics for real-world time series forecasting". Healso works as a Data Scientist in a variety ofprojects with external partners in diverse sec-tors, e.g. in healthcare or infrastructure main-tenance. Christoph holds a PhD in ComputerScience from the University of Granada, Spain,and an M.Sc. degree in Computer Science from the University of Ulm,Germany.

John Grundy is Australian Laureate Fellow andProfessor of Software Engineering at MonashUniversity, Australia. He has published widelyin automated software engineering, domain-speciﬁc visual languages, model-driven engi-neering, software architecture, and empiricalsoftware engineering, among many other areas.He is Fellow of Automated Software Engineeringand Fellow of Engineers Australia.