Architectural Decay as Predictor of Issue- and Change-Proneness
Duc Minh Le, Suhrid Karthik, Marcelo Schmitt Laser, Nenad Medvidovic
AArchitectural Decay as Predictor ofIssue- and Change-Proneness
Duc Minh Le*, Suhrid Karthik † , Marcelo Schmitt Laser † , and Nenad Medvidovic † Software Infrastructure* Computer Science Department † Bloomberg L.P. University of Southern CaliforniaLondon, EC4N 4TQ, UK Los Angeles, CA 90089, [email protected] { skarthik,schmittl,neno } @usc.edu Abstract —Architectural decay imposes real costs in terms ofdeveloper effort, system correctness, and performance. Over time,those problems are likely to be revealed as explicit implemen-tation issues (defects, feature changes, etc.). Recent empiricalstudies have demonstrated that there is a significant correlationbetween architectural “smells”—manifestations of architecturaldecay—and implementation issues. In this paper, we take a stepfurther in exploring this phenomenon. We analyze the availabledevelopment data from 10 open-source software systems andshow that information regarding current architectural decay inthese systems can be used to build models that accurately predict future issue-proneness and change-proneness of the systems’implementations. As a less intuitive result, we also show that, incases where historical data for a system is unavailable, such datafrom other, unrelated systems can provide reasonably accurateissue- and change-proneness prediction capabilities.
Index Terms —Architectural Decay, Issue Proneness, ChangeProneness, Architectural Smell, Decay Prediction
I. I
NTRODUCTION
Software systems change regularly, as do their architectures.Over time, a system’s architecture is increasingly affected bydecay, caused by careless or unintended design decisions [53].Decay results in systems whose implemented architecturesdiffer in important ways from their designed architectures. Bothresearchers and practitioners have recognized the negative im-pact of architectural decay and its role in causing technical debt.Despite this, when developers modify a system during main-tenance, they often focus on code and neglect the architecture.Researchers have proposed a number of techniques to analyzea system at the code level and to predict issues that are likely toappear in the system’s future versions. A common approach hasbeen to use historical artifacts, such as data from issue trackersand version control systems, to build prediction models. Earlyapproaches [29], [47], [24], [13] built models to predict imple-mentation issues based on code metrics. Later studies made useof other properties that were reckoned to be potential causes ofissues, such as code dependencies [73] and code smells [25].In contrast to code-level techniques, analogous techniques atthe architecture level have not received nearly as much attention,even though recent work has demonstrated that even simplecode updates can cause system-wide architectural changes [36].Frequently, such updates introduce architectural smells in asystem (e.g., dependency cycle, ambiguous interface [35]).These smells may have no immediately visible effect, but theyare symptoms of architectural decay and accumulated technical debt [64], [18], [36], [35]. As decay compounds in long-livedsystems, the number of architectural smells grows, creatingunforeseen issues when engineers try to modify a system.In such cases, engineers are eventually likely to realize thenegative effects of the incurred technical debt and the needto refactor their system. However, they usually spot deeperarchitectural problems only when related implementation-levelissues surface. For example, issue code smells using existingapproaches will not help to uncover the underlying architecturalissues, and modifications to address thus identified problemsrun the risk of being inadequate, short-term patches [51].Despite this, predictive models that leverage architecturalcharacteristics to anticipate the implementation issues or theamount of change a system may experience have been scarce.In this paper, we propose and empirically evaluate anapproach to predict a system’s (1) future implementation issuesand (2) proneness to change based on the system’s currentand past architectural characteristics. Our work is inspired inpart by the recent finding [35] that architectural smells andimplementation issues are strongly correlated. Specifically, weanalyze 466 versions of 10 open-source software systems. Foreach system version, we use 3 different methods to recover itsarchitectures from source code. We analyze thus obtained 1,398architectural models to detect 11 distinct types of architecturalsmells. The detected smells are subsequently used as featuresin our prediction models. We make use of different machinelearning techniques to predict a given system’s issue- andchange-proneness based on the collected architectural features.Our study has resulted in two principal findings regardingthe predictive power of the models obtained in this manner:1) The architectural smells detected in a system can help toaccurately predict both the issue-proneness and change- a r X i v : . [ c s . S E ] F e b Recovery
Techniques
Source Code Architecture
Architectural
Smell DetectorIssue Repository Issue Extractor Issues
Relation Analyzer zCorrelation Data
Legend
ArtifactsArtifactsComponentArchitectural- Smell Instances z Prediction
ModelsBuild models F IG . 1: A RCHITECTURE RECOVERY PIPELINE USED IN OUR STUDY AND ENABLED BY THE
ARCADE
TOOL SUITE .proneness of that system at a given point in time.Our models yielded precision and recall scores of atleast 70% (and as high as 95%) for specific recoveredarchitectural views of the subject systems. This findingallows maintainers to foresee future problems involvingnew smell-impacted parts of a system.2) Different, independently developed software systems tendto share issue- and change-proneness characteristics. Thisallows developers to use models created using data froma set of existing systems to predict the issue- and change-proneness of an unrelated system for which historical datadoes not exist (e.g., a newly developed system). Whilethe accuracy of such general-purpose prediction models islower than the system-specific models, the loss in accuracyis moderate, typically under 10%. Our results indicatethat this is a fruitful area for further investigation, andthat our models are already usable in practice for makingcertain types of decisions.Section II introduces foundations for our study. Section IIIpresents the research questions and describes the study. Theresults are detailed in Section IV. Threats to validity, relatedwork, conclusions, and acknowledgment round out the paper.II. F
OUNDATION
Our work is directly enabled by three research threads:(1) software architecture recovery, (2) definition and analysisof architectural smells, and (3) tracking implementation issues.Figure 1 depicts how these threads are combined to answerour research questions in this paper.
A. Architecture Recovery with ARCADE
Garcia et al. [19] conducted a comparative evaluation ofsoftware architecture recovery techniques. Their objective wasto measure the existing techniques’ accuracy and scalabilityon a set of systems for which researchers had previouslyobtained “ground-truth” architectures [20]. To that end, theauthors implemented a tool suite, named ARCADE, offering alarge set of architecture recovery choices to an engineer. Garcia et al.’s results indicate that two techniques im-plemented in ARCADE consistently outperformed the rest: The existing techniques implemented within ARCADE support structuralclusterings of software systems’ elements based on a range of criteria. Whilethe resulting recovered models contain only partial architectural information fora given system, in this paper we will refer to them as “recovered architectures”.We note that our use of this term is consistent with existing literature.
ACDC [66] and
ARC [23]. We select these techniques for ourstudy.
ACDC leverages a system’s structural characteristics to cluster implementation-level modules into architecturalcomponents, while
ARC focuses on the concerns implementedby a system.
ACDC relies on static dependency analysis;
ARC uses information retrieval and machine learning.
PKG is another technique implemented in ARCADE.
PKG extracts a system’s implementation package structure . The pack-age structure of a system is considered to be a reliable view of asystem’s “implementation architecture” [34]. We use it to com-plement the two selected clustering-based architectural views.
B. Architectural Smells
Architectural smells are instances of poor architecturaldesign decisions [45]. They negatively impact system lifecycleproperties, such as understandability, testability, extensibility,and reusability [21]. While code smells [16], anti-patterns [10],or structural design smells [17] originate from implementationconstructs (e.g., classes, methods, variables), architecturalsmells stem from poor use of software architecture-levelabstractions — components, connectors, interfaces, patterns,styles, etc. Detected instances of architectural smells arecandidates for restructuring [9], to help prevent architecturaldecay and improve system quality.Researchers have collected a growing catalog of architecturalsmells. Garcia et al. [21], [22] identified an initial set of foursmells related to connectors, interfaces, and concerns. Mo etal. [46] introduced a new concern-related smell. Ganesh et al.[17] also summarized a catalog of structural design smells,some of which are at the architecture-level. Le et al. [35]described 11 different architectural smells and proposed a setof algorithms to detect them. Table I summarizes a consolidatedlist of smells that were identified in the above references, afterremoving duplicates and non-architectural smells.
C. Issue Tracking Systems
Issue tracking systems are commonly used developmenttools that allow users to report different problems and concernsabout a system and monitor their status. All subject systemsselected for analysis in this paper use Jira [4] as their issuetracking system. However, this is not a limitation; our approachcan be applied to other issue trackers.When reporting implementation issues, engineers categorizethem into different types: bug , new feature , improvement , task ABLE I: C
ONSOLIDATED CATALOG OF ARCHITECTURAL SMELLSCategory Type Definition ConsequencesInterface-based Unused Interface Component’s interface is not linked to other components Adds unnecessary complexity to the systemUnused Brick Component’s interfaces are all unused Same as Unused Interface, but more severeSloppy Delegation Component delegates functionality it could have performed Reduces separation of concernsFunctionality Overload Component has an excessive amount of functionality Reduced modularityLego Syndrome Component handles exceedingly small amount of functionality High couplingChange-based Duplicate Functionality Several components replicate the same functionality Bugs if changing only one duplicateLogical Coupling Parts of different components are frequently changed together Similar to Duplicate FunctionalityDependency-based Dependency Cycle Set of components whose links form a circular chain Changes to one component affect the entire cycleLink Overload Component’s interfaces have too many dependencies Reduced isolation of changesConcern-based Scattered Parasitic Funct. Multiple components responsible for realizing one concern Changing a feature modifies multiple system partsConcern Overload Component implements an excessive number of concerns Violates separation of concerns to be performed, etc. We consider all issue types in our studybecause they may result in relevant changes to a system. In otherwords, any issue type or individual issue instance may have anunderlying architectural cause. Note that it would be possible toperform a finer-grained analysis using the same process we em-ployed that would focus on a specific subset of issues or types.Each issue has a status that indicates where the issue isin its lifecycle [3]. An issue starts as “open”, progresses to“resolved”, and finally to “closed”. We restrict our study toclosed and resolved issues that have been “fixed”, and ignorethose resolved issues that fall under “won’t fix”, “cannotreproduce”, etc. We do so because any effects caused by thefixed issues presumably appear in certain system versions anddisappear once the issue is addressed. Additionally, a fixed issuecontains information that is useful for our study: (1) affectedversions in which the issue has been found, (2) type of issue,and (3) fixing commits , i.e., the changes applied to the systemto resolve the issue. Finding fixing commits is not always easysince there is no standard method for engineers to keep trackof this information. Three ways of keeping track of an issue’sfixing commits are commonly employed in our set of subjectsystems: (1) direct links to the commits, (2) specifying pullrequests, and (3) specifying patch files. Our implemented toolsupports collecting data from all three methods.Based on the collected information, issues are mapped todetected smells. To do this, first, we find the system versionsthat the issue affects. Then we find the architectural smellspresent in those versions. We say the issue is infected by agiven smell if and only if (1) both the issue and the smellaffect the same system version and (2) the resolution of theissue changes files that are involved in the smell. Based onthis relationship, we studied if the characteristics of an issue(e.g. issue type, number of fixing commits) depend on whetherthe issue is infected by a given smell.Note that resolving an issue may not remove the smell thatled to the issue in the first place. One reason is that developerscould find a workaround. The smell may also correlate withmore than one issue. In general, it is difficult to identify the ex-act relationship between a specific architectural smell instanceand a specific implementation issue. Fortunately, we do not needto do that in our work, because we are looking for predictionmodels that uncover smell-issue correlations across most cases. III. E
MPIRICAL S TUDY S ETUP
This section describes our study setup. Our hypothesis andresearch questions are described in Section III-A. We thendescribe how we pre-processed the raw data in Section III-B.
A. Research Question and Research Hypothesis
Our hypothesis is that it is possible to construct accuratemodels to predict the impact of architectural decay on asystem’s implementation . To evaluate this hypothesis, we focuson the predictability of a system’s issue- and change-pronenessbased on the identified architectural smells (i.e., the symptomsof decay). We define two research questions accordingly.
RQ1.
To what extent can the architectural smells detectedin a system help to predict the issue-proneness and change-proneness of that system at a given point in time?
The training data used to build the prediction models for asystem is collected from different versions of that system. Ifthese models can be shown to accurately predict issue- andchange-proneness, this would indicate that architectural smellshave consistent impacts on those two properties throughout asystem’s life span. In turn, this would confirm that the impactof architectural smells is not related to other factors, such assystem size, which will change during a system’s evolution.In addition, an accurate prediction model will be useful formaintainers to foresee the future issue- and change-pronenessof newly smell-affected parts of a system, helping them todecide when and where they may need to refactor the system.
RQ2.
To what extent do unrelated software systems tend toshare properties with respect to issue-proneness and change-proneness?
This question investigates whether the issue- and change-proneness of a system can be accurately predicted by a general-purpose model trained using symptoms of architectural decayfrom unrelated systems. If such a model can be constructed,it can be reused by developers to predict properties of systemsfor which historical information is not (yet) available. Anaffirmative answer to this question would also have a deeper im-plication: software systems tend to share fundamental propertiesregardless of system type, application domain, developers, em-ployed tools, programming languages, execution platforms, etc. aw dataFrom ARCADE Labelling Balancing 10-foldCross-validationOne system as test set
RQ1 result
RQ2 resultData pre-processing Model building F IG . 2: D ATA PROCESSING PIPELINE . B. Building the Data Pipeline
To answer the two research questions, we build multipleprediction models based on different systems’ architectural-smell data and assess the models’ accuracy. We rely onARCADE [36] to collect the underlying raw architectural-smelldata, and WEKA[50]—a well-known ML framework—to pre-process the data, build prediction models, and evaluate theiraccuracy. The data pipeline we use is illustrated in Figure 2.Section III-B1 introduces the list of subject systems and the pro-cess of recovering their architectural artifacts with ARCADE.Two main pre-processing tasks are labeling and balancing theraw data, which are discussed in Sections III-B2 and III-B3,respectively. Creating the training and test sets, evaluating pre-diction models as well as determining the baseline models arediscussed in Sections III-B4, III-B5, and III-B6, respectively.
1) ARCADE and Subject Systems:
We collected data fromten open-source systems from the Apache Software Foundation,shown in Table II. Specifically, our study uses three types ofdata: (1) architectural smells detected in recovered architectures,(2) implementation issues collected from the Jira [4] issuerepository, and (3) code commits extracted from GitHub [5].Using ARCADE, we recover the subject systems’ architec-tures using the three recovery techniques—ACDC, ARC, andPKG—whose accuracy and scalability have been demonstratedby prior work (recall Section II-A). We then analyze therecovered architectures for the presence of smells identifiedin the literature (recall Section II-B and Table I), as well asthe systems’ issue- and change-proneness. Those architecturalartifacts are the raw data for building prediction models.
2) Labeling the Data:
Data labeling is a key step to ensurethe success of prediction models. In our prediction problem,we are interested in two properties—issue-proneness andTABLE II: S
UBJECT SYSTEMS IN OUR STUDYSystem Domain change-proneness. These properties can be obtained by, first,counting the raw numbers of issues and changes in a system’sdevelopment history and, then, finding a way of characterizingthose numbers. Specifically, we assign nominal labels based onthe raw numbers of issues and changes related to source filesto represent different levels of issue- and change-proneness.Converting a set of numerical values to nominal labels de-pends on the values’ distribution. In our problem, the numbersof issues and changes follow a heavy-tailed distribution [15],where many files are associated with small numbers of issuesand code changes, while comparatively fewer files are associ-ated with large numbers of issues and changes. This is not anuncommon type of distribution [72], [8]. As an illustration, thePareto chart [68] in Figure 3 depicts the distribution of issuesper file in Hadoop: while few files are associated with a largenumber of issues, the arc, which represents the cumulative per-centage of file-groups’ sizes, shows a clear heavy-tailed pattern.One common labeling approach is to segment a heavy-taileddistribution into head and tail segments. A more sophisticatedapproach is to divide the distribution into three parts—head,body, and tail—which in our case represent the three levelsof proneness: low, medium, and high. We choose the latterapproach because the numerical values in our study span awide range. Having these three levels gives developers a betterestimation of architectural decay’s impact.To segment a dataset, we use the Pareto principle [52], apopular segmentation method for heavy-tailed distributions,widely used in software engineering (e.g., [8], [30], [61]). Toobtain the three segments, we apply the Pareto principle twice,as suggested in literature [6]. Specifically, we divide the originaldataset into two portions. The first portion contains 80% of theF IG . 3: P ARETO CHART OF ISSUES PER FILE IN H ADOOP .T HE X - AXIS REPRESENTS THE H ADOOP FILES GROUPED BYTHE NUMBER OF ISSUES THEY CONTAIN , THE LEFT Y - AXISTHE NUMBER OF FILES IN SAME GROUPS , AND THE RIGHTY - AXIS THE CUMULATIVE PERCENTAGE OF GROUPS ’ SIZES . Q 1 - Cross-Validation
Training set Test set
Fold-1 Fold-2 Fold-3 … Fold-9 Fold-10
Iter 1
Fold-1 Fold-2 Fold-3 … Fold-9 Fold-10
Iter 2… …
RQ 2 - Separate Test System
Camel CXF Hadoop … Wicket ZookeeperCamel CXF Hadoop … Wicket Zookeeper … Dataset of a single system (e.g., Hadoop) Prediction model for CamelPrediction model for CXFPrediction model for Hadoop F IG . 4: C REATING DATASETS TO ANSWER
RQ1 (
TOP ) AND
RQ2 (
BOTTOM ).original dataset’s low-end, while the second portion contains20% of the high-end. We apply the Pareto segmentation oncemore to the latter portion, thus obtaining two new portions thatrespectively contain the next 16% (80% of the 20%) and 4%(20% of the 20%) of the high-end data points.In order to collect the data regarding architectural decay,for each version of a subject system, we first collect the listof “fixed” issues affecting that version. Next, we collect thefiles that were changed when fixing the issues. For each file,we gather its associated architectural smells, the number ofissues whose fixing commits changed that file (used whendetermining the system’s issue-proneness in Sections IV-RQ1-A and IV-RQ2-A), and the total number of changes (used whendetermining the system’s change-proneness in Sections IV-RQ1-B and IV-RQ2-B). After the raw data is collected, we label itusing the Pareto technique mentioned above before feeding itto supervised ML algorithms.To determine the level of issue-proneness of a source file ina system version, first, the number of issues related to that fileis collected. This is one data point. We collect data points forall files in all available versions of a system, and then sort thedataset by the numbers of issues, from low to high. Then, thefirst 80% of data points are marked with “low” labels; the next16% and 4%, respectively, are marked with “med(ium)” and“high” labels. To determine the change-proneness of a sourcefile in a system’s version, we count the number of commitsrelated to that file and repeat a similar labeling process.Table III shows several data samples in our datasets afterlabeling. The shown features, i.e., architectural smells in ourcase, are CO (Concern Overload), SF (Scattered parasiticFunctionality), LO (Link Overload), and DC (DependencyCycle). The output features, i.e., labels, are the levels of issue-proneness and change-proneness. The two leftmost columnsshow the versions and filenames of each data point. The nexteleven columns are binary features that indicate the presence (1)or absence (0) of a specific smell (recall Table I) in a given file.TABLE III: D
ATA SAMPLES FROM H ADOOPVers. Filename CO SF LO DC ... Iss Chg0.20.0 dfs/DFSClient.java 0 1 1 1 ... H L0.20.0 mapred/JobTracker.java 1 0 1 0 ... M M0.20.0 tools/Logalyzer.java 0 0 0 0 ... L L... ... ... ... ... ... ... ... ...
The two rightmost columns indicate the issue-proneness (“Iss”)and change-proneness (“Chg”) of the files. For example, inversion 0.20.0 of Hadoop,
DFSClient.java has three smells:SPF, LO, and DC. The file’s issue-proneness is high (H), andits change-proneness is low (L). On the other hand, both issue-and change-proneness of
JobTracker.java are medium (M).
3) Balancing the Data:
Due to the distribution of data andthe labeling approach, we need to balance our datasets [55].Recall from Section III-B2 that the low : med : high ratio of ourdatasets is 80:16:4 (i.e., 20:4:1). If such a dataset were usedto train a prediction model, the most likely outcome wouldbe a model that predicts “low” for every data point. As weare more interested in “high” and “med” labels, such a modelwould be useless. It is thus important to ensure that weightedmetrics are not biased by less (or more) frequent labels.We use SMOTE [11] to balance our dataset, oversampling“med” by a factor of 5 and “high” by a factor of 20. SMOTEis a technique that synthesizes new minority samples basedon nearest neighbors between sample data points. Adding newminority samples guarantees that the dataset will be balanced,i.e., that the low : med : high ratio will be 1:1:1.
4) Training and Test Sets:
To build and test our predictionmodels, we use two different approaches for the two researchquestions, as illustrated in Figure 4. In the first approach, usedfor RQ1, one dataset is created for each subject system witha cross-validation setup. Specifically, we use 10-fold cross-validation, where the dataset is randomly divided into tenequal-sized subsets. Then, we sequentially select one subsetand test it against the prediction model built by the other ninesubsets. The final result is the mean of the ten tests’ results.In the second approach, used for RQ2, we combine all subjectsystems and then divide them into two independent datasets:a training set, which comprises nine systems, and a test set,which comprises the single remaining system.
5) Evaluation Metrics:
To evaluate the accuracy of our mod-els, we use precision and recall [54]. Precision is the fractionof correctly predicted labels over all predicted labels. Recall isthe fraction of correctly predicted labels over all actual labels.For illustration, consider the sample confusion matrix, shownin Table IV, that is produced after classifying 25 samples into“high”, “med”, and “low”. The precision for the “high” label isABLE IV: E
XAMPLE PREDICTED VS . ACTUAL VALUESTrue/ActualHigh Med LowHigh 4 6 3Med 1 2 0Predict Low 1 2 6 the number of correctly predicted “high” samples (4) out of allsamples predicted to be “high” (4+3+6=13), i.e., 30 . .
6) Determining Baseline Models:
To determine the effec-tiveness of the prediction models, we need to compare themto a baseline. In this case, we consider a baseline modelto be the simplest possible prediction. The model can beobtained through different approaches. For some problemsthis may be a random result, and for others in may be themost common prediction. As our dataset has been balanced(Section III-B3), the simplest approach is “uniform” — generatepredictions uniformly at random. This implies a prediction inwhich Table IV has equal values in all cells, giving us a modelwith both precision and recall of 33.3%.IV. E
MPIRICAL S TUDY R ESULTS
In this section, for each of the two research questions wediscuss the validation method and the associated findings.
RQ1:
To what extent can the architectural smells detectedin a system help to predict the issue-proneness and change-proneness of that system at a given point in time?
In this prediction problem, all input features are binary (recallTable III), indicating whether a file contains an architecturalsmell. For this reason, decision-based techniques are most likelyto yield good results [41]. Metrics collected from a range ofmodels we built and evaluated using four different classificationtechniques—decision table [31], decision tree [56], logisticregression [39], naive bayes [28]—confirmed this. We thusonly discuss the results obtained by the decision-table models.
A. Issue-Proneness
Recall from Section III-B that, to compute issue-proneness,for each file in each version of a given system, we gatherthe file’s associated architectural smells and number of issueswhose fixing commits changed the file. Table V shows the preci-sion and recall of the models for predicting the issue-pronenessof our subject systems from Table II. These metrics are com-puted using 10-fold cross-validation [32]. The bottom-most rowshows the average values across all systems. For each system,we built different prediction models based on smells detected inthe three architectural views (ACDC, ARC, and PKG). In total,30 prediction models per system were created and evaluated. TABLE V: P
REDICTING ISSUE - PRONENESS
ACDC ARC PKGSystem Precision Recall Precision Recall Precision RecallCamel 69.9% 68.4% 70.8% 67.0% 68.2% 62.8%CXF 78.0% 76.7% 68.9% 68.3% 64.7% 63.8%Hadoop 81.2% 80.1% 76.6% 76.6% 72.8% 73.4%Ignite 78.9% 78.1% 78.9% 79.1% 70.4% 71.0%Nutch 80.8% 71.6% 82.5% 82.7% 68.3% 52.1%OpenJPA 71.4% 68.3% 74.5% 73.2% 69.2% 67.9%Pig 71.7% 69.1% 71.3% 71.1% 68.6% 69.5%Struts2 89.2% 89.0% 95.0% 94.8% 79.1% 78.3%Wicket 69.2% 70.1% 76.7% 77.1% 63.7% 65.4%ZooKeeper 72.0% 72.6% 70.8% 69.2% 68.7% 69.4%Average 76.2% 74.4% 76.6% 75.9% 69.4% 67.4%
In general, the prediction models that relied on architecturesrecovered by ACDC and ARC were comparable in terms ofaccuracy: the average (precision, recall) for the ACDC and ARCmodels were (76.2%, 74.4%) and (76.6%, 75.9%) respectively.On the other hand, the models emerging from PKG yielded ac-curacy that was up to 13% lower. The models yielded very highpredictive power in the cases of certain systems. For example,the ARC-based models for Struts2 achieved ≈ ≈ × better (2 × in a majority of cases) than the baseline’s 33.3%,further confirming that our models are useful for predictingfiles with high numbers of related issues.Our results confirm that architectural smell-based modelscan accurately predict the issue-proneness of a system. Inother words, architectural smells have a consistent impact on asystem’s implementation with respect to issue-proneness overthe system’s lifetime. This finding means that architecturaldecay can be a powerful indicator of the health of a system’simplementation . It serves as a direct motivator for softwareengineers to pay more attention to the architecture, andarchitectural smells, in their systems. For example, systemmaintainers can use our models to foresee future problems, todevise refactoring plans, to prioritize their activities, etc.TABLE VI: P REDICTING ISSUE - PRONENESS WITH ” HIGH ”,”
MED ” AND ” LOW ” LABELS UNDER
ACDC
High Med LowSystem Precision Recall Precision Recall Precision RecallCamel 73.9% 56.9% 57.6% 63.6% 78.2% 69.9%CXF 94.4% 83.3% 65.2% 76.0% 74.5% 70.7%Hadoop 71.2% 81.5% 78.3% 78.1% 72.1% 81.5%Ignite 93.8% 89.1% 66.4% 76.8% 76.6% 67.8%Nutch 66.9% 94.7% 90.4% 61.2% 95.0% 75.8%OpenJPA 69.3% 89.5% 65.9% 49.8% 79.1% 65.6%Pig 80.5% 90.9% 72.9% 52.3% 61.8% 64.1%Struts2 96.3% 95.7% 88.2% 81.1% 83.1% 90.4%Wicket 78.8% 89.6% 59.3% 60.3% 69.5% 57.3%ZooKeeper 71.0% 88.0% 64.2% 54.2% 80.7% 75.6%Average 79.6% 85.9% 70.5% 65.3% 76.1% 71.9% a) Precision (b) Recall F IG . 5: P RECISION AND R ECALL OF ISSUE - PRONENESSPREDICTION FOR EACH LABEL IN
ACDC.The comparatively poorer performance of PKG in answeringRQ1 suggests that implementation-package structure is noteffective for measuring architectural decay and it can maskdeeper architectural problems. This observation is in line withprevious findings [36], which showed that, compared to ACDCand ARC, PKG is markedly less useful for understanding theunderlying architectural changes and their impact.This leads to another observation. Recall the categorizationof architectural smells in Section II-B and Table I: two of thefour categories are dependency-based and concern-based smells.This suggests that ACDC (dependency-based recovery) andARC (concern-based recovery) should inherently outperformPKG when such smells are encountered. It further suggeststhat targeting specific recovery techniques to specific types ofsmells, and then finding a way to combine their results, mayyield even higher accuracy in our prediction models. We areexploring this hypothesis in our ongoing work.
B. Change-Proneness
Recall from Section III-B that, to compute change-proneness,for each file in each version of a given system we gather (1) thefile’s associated architectural smells and (2) the total number ofchanges to the file reflected in the implementation issues’ fixingcommits. We used the same approach to evaluate the accuracyof the 30 architectural models for each system in predictingchange-proneness as we did for predicting issue-proneness.Table VII shows the accuracy of our models. The modelsbased on PKG-recovered architectures again have the lowestaccuracy. In some systems, e.g., CXF and Nutch, the valuesfor PKG-recovered architectures are 10-20% lower than thecorresponding values in the other two views. The average(precision, recall) are (74.7%, 71.6%) and (73.6%, 73.5%) forthe ACDC- and ARC-based architectural views, respectively.Notably, the values yielded when analyzing Struts2 are, onceagain, very high. A further investigation of Struts2’s datasethighlighted a distinguishing characteristic: 36 of the analyzedversions are distributed across just four minor Struts2 versions:2.0.x, 2.1.x, 2.2.x and 2.3.x. In other words, the changes inmost of these 36 versions were “patches”. It is reasonableto expect that the architectures and detected smell instancesbetween patches within a single minor version will be verysimilar. The prediction model for Struts2 benefits from thissimilarity and thus achieves very high accuracy in the cross-validation test. This suggests a promising strategy for building TABLE VII: P
REDICTING CHANGE - PRONENESS
ACDC ARC PKGSystem Precision Recall Precision Recall Precision RecallCamel 69.9% 63.4% 68.0% 67.1% 60.3% 61.0%CXF 73.7% 70.8% 69.7% 63.4% 60.8% 63.4%Hadoop 78.1% 73.2% 74.9% 74.8% 67.4% 70.0%Ignite 77.5% 76.1% 75.8% 76.1% 68.7% 69.1%Nutch 73.1% 66.8% 76.3% 78.0% 62.2% 46.1%OpenJPA 78.3% 77.7% 74.3% 70.0% 68.2% 62.1%Pig 70.1% 67.4% 69.6% 70.2% 65.9% 66.5%Struts2 89.3% 85.8% 87.8% 96.7% 71.2% 73.7%Wicket 66.6% 65.3% 72.1% 71.8% 62.7% 59.0%ZooKeeper 69.9% 69.6% 67.8% 67.2% 65.5% 64.4%Average 74.7% 71.6% 73.6% 73.5% 65.3% 63.5% prediction models: to increase the accuracy of models used topredict properties of a system version, one should select recentversions instead of all versions across the entire system lifespan.In summary, our results confirm that the historical data of asoftware system regarding its architectural smells, issues, andchanges can be used to develop models to accurately predictthe issue- and change-proneness of that system. The resultsalso indicate that architectural smells have a consistent impacton software system implementations throughout the systems’lifetimes. Our architecture-based prediction approach, whoseperformance is usually two times better than the baseline, isuseful for software maintainers to foresee likely future problemsin newly smell-impacted parts of their system. The approachcan also help in creating maintenance plans that can help toeffectively reduce the system’s issue- and change-proneness.Lastly, ACDC and ARC outperform PKG, emphasizing theimportance of selecting the appropriate architecture recoverytechniques and targeting them to the task at hand.
RQ2:
To what extent do unrelated software systems tend toshare properties with respect to issue- and change-proneness?
The results obtained in answering RQ1 showed that architec-tural smells consistently impact the issue- and change-pronenessof a software system during its lifetime. In that sense, RQ2can be considered an extension of RQ1: we aim to understandwhether architectural smells have consistent impacts across un-related software systems, more specifically, whether the issue-and change-proneness of a system can be accurately predictedby models trained with data from unrelated systems. Moredeeply, this research question tries to assess whether there arefundamentally shared traits across software systems, regardlessof their developers and development processes, implementationfeatures, application domains, underlying designs, etc.To answer this question, instead of using 10-fold cross-validation, we selected each subject system as the test systemand used its dataset as the test set; the training set was then cre-ated by combining datasets of the remaining nine systems. Forreference, we also built a prediction model by combining all tensystems, i.e., including the test systems. Note that the datasetsof different subject systems have different sizes; we had toresample those datasets to the same size before combining them.
A. Issue-Proneness
Tables VIII, IX, and X summarize the precision and recallvalues of
RQ2 experiments with regard to predicting issue-ABLE VIII: P
REDICTING ISSUE - PRONENESS – PRECISION ( TOP ) AND RECALL ( BOTTOM ) UNDER
ACDC
System 10-fold (RQ1) All 10 9 OthersCamel 69.9% 64.8% 53.6%CXF 78.0% 71.4% 66.4%Hadoop 81.2% 71.1% 62.8%Ignite 78.9% 73.9% 60.2%Nutch 80.8% 74.9% 59.6%OpenJPA 71.4% 68.8% 63.9%Pig 71.7% 66.8% 61.4%Struts2 89.2% 77.1% 69.1%Wicket 69.2% 66.7% 55.0%ZooKeeper 72.0% 65.4% 56.0%Camel 68.4% 57.5% 46.7%CXF 76.7% 71.3% 65.7%Hadoop 80.1% 69.2% 62.9%Ignite 78.1% 73.5% 59.3%Nutch 71.6% 68.8% 54.4%OpenJPA 68.3% 63.0% 57.3%Pig 69.1% 64.1% 58.8%Struts2 89.0% 76.4% 68.8%Wicket 70.1% 66.0% 54.9%ZooKeeper 72.6% 60.3% 56.9%
TABLE IX: P
REDICTING ISSUE - PRONENESS – PRECISION ( TOP ) AND RECALL ( BOTTOM ) UNDER
ARC
System 10-fold (RQ1) All 10 9 OthersCamel 70.8% 64.9% 59.7%CXF 68.9% 55.2% 49.0%Hadoop 76.6% 67.6% 59.6%Ignite 78.9% 66.9% 62.3%Nutch 82.5% 64.6% 62.3%OpenJPA 74.5% 66.9% 63.9%Pig 71.3% 62.1% 61.7%Struts2 95.0% 76.1% 63.8%Wicket 76.7% 63.3% 62.0%ZooKeeper 70.8% 66.3% 50.4%Camel 67.0% 59.4% 48.5%CXF 68.3% 62.3% 54.5%Hadoop 76.6% 67.4% 59.4%Ignite 79.1% 66.5% 61.6%Nutch 82.7% 58.1% 53.9%OpenJPA 73.2% 65.5% 62.0%Pig 71.1% 62.5% 61.1%Struts2 94.8% 75.7% 63.7%Wicket 77.1% 65.3% 63.6%ZooKeeper 69.2% 67.1% 56.4% proneness under ACDC, ARC, and PKG, respectively. The left-most columns of these tables show the lists of systems. The pre-cision and recall values are presented for three different cases:1) “10-fold” column – 10-fold cross-validation on the testset. We reproduce this result from RQ1 for easy reference.2) “All 10” column – Models trained by datasets from all10 systems, including the test set.3) “9 Others” column – Models trained by 9 other systems’datasets, not including the test set.In total, beside the 300 issue-proneness prediction modelsper system that emerged from RQ1’s analysis, we built andevaluated 60 additional issue-proneness models to answer RQ2.We found several consistent trends across all three architec-tural views. First, a prediction model built by combining datasets of multiple different software systems, even if the testsystem itself is included, has lower accuracy than the model TABLE X: P
REDICTING ISSUE - PRONENESS –P RECISION ( TOP ) AND RECALL ( BOTTOM ) UNDER
PKG
System 10-fold (RQ1) All 10 9 OthersCamel 68.2% 59.5% 46.0%CXF 64.7% 62.7% 59.1%Hadoop 72.8% 61.8% 50.2%Ignite 70.4% 70.2% 62.6%Nutch 68.3% 66.9% 51.9%OpenJPA 69.2% 71.2% 53.1%Pig 68.6% 68.0% 53.6%Struts2 79.1% 92.4% 67.6%Wicket 63.7% 66.1% 60.2%ZooKeeper 68.7% 66.3% 44.0%Camel 62.8% 50.9% 43.5%CXF 63.8% 60.0% 44.7%Hadoop 73.4% 61.5% 50.3%Ignite 71.0% 69.5% 62.3%Nutch 62.1% 54.1% 50.9%OpenJPA 67.9% 68.3% 39.2%Pig 69.5% 68.0% 44.5%Struts2 78.3% 92.0% 67.1%Wicket 65.4% 66.1% 58.9%ZooKeeper 69.4% 66.8% 42.7% built for that specific test system. This can be seen in all threeTables VIII, IX, and X, where the “All 10” columns have lowervalues for precision and recall than the corresponding “10-fold”(results from RQ1) columns.More interesting is the case where the test system is excludedand the model is trained on the datasets from the remaining ninesystems (the “9 others” column). This represents the scenario ofusing a generic predictive model comprising entirely differentsystems. The precision and recall values predictably decreasefurther across all three architectural views. These results arereflective of the intuition that using datasets from differentsystems can create a more general-purpose model, but is alsolikely to add noise and reduce the model’s ability to predict theproperties of a specific system. Therefore, if a sufficiently largedataset for a given system is available, the system’s predictionmodels should be trained only on that dataset.At the same time, it is interesting to note that the loss ofaccuracy between the “10-fold” and “9 Others” models isrelatively moderate: with few exceptions, it is on the order of10-20%. On the lower end, one example exception is PKG’sprecision for Wicket’s issue-proneness (Table X-top), wherethe discrepancy is only 3.5%. On the higher end, an interestingexception are the precision and recall values obtained by ARCfor Struts2 (Table IX), which are both more than 30% lowerfor the “9 Others” models. This ties to the above discussion ofthe limited types of smells that exist in Struts2: its uniquenessdecreased the ability of other systems to predict its issue-proneness, just like it helped ensure highly accurate modelswhen using only its own historical data.Figure 6 shows a comparison of precision and recall betweendifferent combinations of ACDC based models. We observe thatusing data from “9 Others” systems can yield a relatively goodprediction model with at least 50% improvement comparedto the baseline (0.5 vs. 0.33). In addition, the accuracy of“All 10” models lends support to a hypothesis that if a systemhas a short history of development, then including generic data a) Precision (b) Recall F IG . 6: P REDICTING ISSUE - PRONENESS UNDER
ACDC.can help improve predictive performance. We are currentlyevaluating this hypothesis more extensively.
B. Change-Proneness
We observed analogous trends to those discussed above inthe experiments that attempt to predict the change-pronenessusing unrelated systems’ datasets. We elide this data for space.In summary, the results of the experiments conducted inthe context of
RQ2 confirm that software systems tend toshare properties with respect to issue- and change-proneness.The accuracy of general-purpose models is lower than thatof specific models, but the gap is not prohibitive. Our resultssuggest that developers can use general-purpose models to getan overall sense of the likely issue- and change-proneness ofa new software system in the early stages of its development,before sufficiently large numbers of system versions becomeavailable. Similarly, developers can use such models to predictimportant properties of any existing systems for which historicaldata is missing, spotty, or unreliable.An interesting question is whether restricting general-purposemodels to systems that are likely to share certain key char-acteristics can improve the models’ predictive power. This issomething we have not done in our current study: while theset of test systems we used share some characteristics (e.g.,Java-based enterprise systems and Apache Projects), they arealso inherently different systems targeting a variety of domains.Our ongoing work is investigating whether taking into accountfactors such as the role of the employed development processes,off-the-shelf frameworks, system design principles and patterns,application domains, etc. can be used to increase the accuracyof the general-purpose models.Overall, the predictive models we developed provide develop-ers another tool to check and maintain their software system’shealth and track technical debt. A straightforward way toidentify “unhealthy” parts of a system is to look for long-livedsmelly files , i.e., files that have been involved in architecturalsmells across a large number of system versions. These fileshave a high potential to introduce new issues. Figure 7 showsexamples of such files from Hadoop and Struts2. The x-axes inboth plots indicate system versions, while the the y-axes indi-cate the numbers of smells in which each of the files is involved.From the collected data such as that depicted in Figure 7,we have observed that long-lived smelly files are repeatedlyinvolved in new issues during a system’s lifetime. For example,
DFSClient.java is mentioned in ≈ IG . 7: T OP -5 LONG - LIVED SMELLY FILESIN H ADOOP ( TOP ) AND S TRUTS
BOTTOM ).date;
JobTrackers.java is mentioned in ≈ Dispatcher.java is mentioned in ≈
670 Struts2 issues;and so on. We posit that stemming such trends and properlyaddressing the underlying problems will require consideringthe architectural causes of these issues.V. T
HREATS TO V ALIDITY
The key threats to external validity include our subjectsystems. Most of the steps in our data gathering process areautomated. However, manual intervention is required since eachsystem has different implementation conventions. Due to themanually-intensive data gathering process, we have used datafrom ten subject systems in our dataset. We mitigate a possiblethreat stemming from the number of systems by using datafrom their 466 versions and evaluating 720 prediction models.All our subject systems are Apache projects, implementedin Java, and use the Jira issue tracking system. The reason forthis is that it helped to simplify our data gathering and analysisworkflow. In our on-going work, we are expanding our analysisbeyond Apache. The diversity of the chosen systems, however,helps to reduce this threat, as does the wide adoption of Apachesoftware, Java, and Jira. Further, all the recovery techniquesand smell definitions in this paper are language-independent.Our study’s construct validity is threatened by (1) theaccuracy of the recovered architectural views, (2) the detectionof architectural smells, and (3) the relevance of implementationissues. To mitigate the first threat, we applied three architec-ture recovery techniques (ACDC, ARC, and PKG) that hadpreviously exhibited the greatest usefulness in an extensivecomparative analysis of available techniques [19] and in astudy of architectural change during system evolution [36], [7],62]. The three techniques were developed independently anduse different strategies for recovering a system’s architecture.To mitigate the second threat, we selected architectural smelltypes that were previously studied on a smaller scale [42], [46],[38], [22], [21], and were shown to be strong indicators ofarchitectural problems. Finally, to mitigate the third threat, weonly collected “resolved” and “closed” issues, i.e., those issuesthat have been independently verified and fixed by developers.The primary threat to our study’s internal validity and conclusion validity involves the predictability relationshipbetween reported implementation issues and architecturalsmells. Our prediction models are built based on significantcorrelations between architectural smells and implementationissues, which have been confirmed in other work [35]. Althoughcorrelation does not imply causality, we have shown examplesof the causal relationship’s existence. Prior work has alsoconfirmed the causality between implementation issues andarchitectural smells via manual inspection [70], [44]. Inaddition, our observations are consistent across the ten systems.VI. R
ELATED W ORK
Predicting implementation issues and code change have beenwidely studied research problems in software maintenance.The main type of implementation issues that researchers wereinterested early on were defects. Li et al. [40] used OO metricsas predictors of software maintenance effort. Subramanyamet al. [63] also demonstrated that a set of metrics [12] hassignificant implications on software defects. Nagappan etal. [49] found a representative set of code complexity measuresto determine failure-prone software entities. However, themetrics considered in prior work cannot prevent defects athigher abstraction levels, such as architectural problems.Issue prediction based on bug-fixing history is also anestablished area. Rahman et al. [58] developed an algorithmthat ranks files by their numbers of past changes. The algorithmhelps developers find hot spots in the system that needdevelopers’ attention. There are more sophisticated methodsthat combine historical information and software change impactanalysis to increase the efficiency and accuracy of the prediction[67], [27], [57]. However, as before, these approaches do notexplain higher-level defects caused by architectural decay.Code changes have a close connection with defects insoftware. Nagappan et al. [48] used code churn to predictthe defect density of software systems. Hassan et al. [26] usedcomplexity metrics based on code changes to predict faults.Code change has been used in a number of other researchefforts [71], [14], [37], [36] to evaluate system maintainability.To predict code changes, Romano et al. proposed twoapproaches, relying on code metrics [59] and anti-patterns [60].Xia et al.’s approach [69] predicts a system’s change-pronenessusing co-change information of unrelated systems. While theirapproach is similar to the one we employed in the context ofRQ2, it yields relatively low accuracy. Malhotra et al. [43]used hybridized techniques to identify change-prone classes.However, their empirical study is relatively small. Kouroshfar et al. [33] do use architectural information to study the correlationbetween co-changes across architectural models and defects.However, they restrict their study to cross-module changes.VII. C
ONCLUSION
This paper’s contributions are twofold. First, we havedeveloped an approach that can identify parts of a softwaresystem that are likely targets of future maintenance activitiesbased on architectural characteristics as well as the change- andissue-proneness of different architectural elements. Second, wehave conducted an empirical study that highlights the impactof architectural decay on ten well known open-source systems.We leverage the identified correlations between symptomsof architectural decay and reported implementation issues todevelop an architecture-based approach that accurately predictsa system’s issue- and change-proneness. Our approach has beenvalidated on ten existing systems, considering 11 different typesof smells under three different architectural views. This is thefirst study of its kind and, as such, its results can be treated asa foundation on which subsequent work should build. At thesame time, the study has resulted in several important findingsregarding the predictive power of architecture-based models.Our study confirmed that architectural smells consistentlyimpact a system’s implementation during the system’s lifecycle.In other words, the impact does not change significantly withother factors such as system size. This means that the detectedarchitectural smells can help to accurately predict the issue-proneness and change-proneness of a system at any relevantpoint in time. In turn, such architecture-based prediction canserve as a useful tool for maintainers to recognize futureproblems associated with newly smell-impacted parts of thesystem and to plan their activities.As a perhaps more unexpected result, we have shown thatunrelated software systems tend to share properties with respectto issue- and change-proneness. This allows developers to usegeneral-purpose models created with the available data froma set of existing systems to predict the properties of systemsfor which such information is missing. Unsurprisingly, theaccuracy of such general-purpose models is lower than that ofsystem-specific models, but not prohibitively so. Our resultssuggest that it is possible to develop such models sufficientlyaccurately to use them as a basis of actionable advice.It is important to keep in mind that this was an initial attemptat constructing general-purpose prediction models. Our modelswere trained using all architectural smells and software systemswithout particular prior planning. Our future work will investi-gate how to select an appropriate set of systems to improve theaccuracy of these models. We will also explore whether furtheraccuracy improvements can be achieved by restricting the typesof architectural smells on which the models are trained.VIII. A
CKNOWLEDGMENTS
This work is supported by the U.S. National ScienceFoundation under grants 1717963, 1823354, and 1823262 andU.S. Office of Naval Research under grant N00014-17-1-2896.
EFERENCES[1] CXF Implementation Issue - CXF-223. https://issues.apache.org/jira/browse/CXF-223, 2007.[2] Pig Implementation Issue - PIG-1178. https://issues.apache.org/jira/browse/PIG-1178, 2010.[3] What is an issue. https://confluence.atlassian.com/jira064/what-is-an-issue-720416138.html, 2018.[4] Apache jira. https://issues.apache.org/jira, 2019.[5] GitHub. https://github.com/, 2019.[6] J. Arthur.
Six Sigma simplified: quantum improvement made easy .KnowWare International, 2001.[7] P. Behnamghader, D. M. Le, J. Garcia, D. Link, A. Shahbazian, andN. Medvidovic. A large-scale study of architectural evolution in open-source software systems.
Empirical Software Engineering , 2016.[8] B. W. Boehm. Value-based software engineering: Overview and agenda.In
Value-based software engineering , pages 3–14. Springer, 2006.[9] I. Bowman, R. Holt, and N. Brewster. Linux as a case study: its extractedsoftware architecture. In
ICSE , 1999.[10] F. Buschmann, K. Henney, and D. C. Schmidt.
Pattern-oriented softwarearchitecture, on patterns and pattern languages , volume 5. John wiley& sons, 2007.[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.Smote: synthetic minority over-sampling technique.
Journal of artificialintelligence research , 16:321–357, 2002.[12] S. R. Chidamber and C. F. Kemerer. A metrics suite for object orienteddesign.
IEEE Trans. Softw. Eng. , 20(6):476–493, June 1994.[13] D. Coleman, D. Ash, B. Lowther, and P. Oman. Using metrics to evaluatesoftware system maintainability.
Computer , 27(8):44–49, Aug 1994.[14] M. D’Ambros, H. Gall, M. Lanza, and M. Pinzger. Analysing softwarerepositories to understand software evolution. In
Software evolution ,pages 37–67. Springer, 2008.[15] S. Foss, D. Korshunov, S. Zachary, et al.
An introduction to heavy-tailedand subexponential distributions , volume 6. Springer, 2011.[16] M. Fowler.
Refactoring: Improving the Design of Existing Code . Addison-Wesley Professional, 1999.[17] S. Ganesh, T. Sharma, and G. Suryanarayana. Towards a principle-basedclassification of structural design smells.
Journal of Object Technology ,12(2):1–1, 2013.[18] J. Garcia.
A Unified Framework for Studying Architectural Decay ofSoftware Systems . PhD thesis, University of Southern California, 2014.[19] J. Garcia, I. Ivkovic, and N. Medvidovic. A comparative analysisof software architecture recovery techniques. In
Automated SoftwareEngineering (ASE), 2013 IEEE/ACM 28th International Conference on ,pages 486–496, 2013.[20] J. Garcia, I. Krka, C. Mattmann, and N. Medvidovic. Obtaining ground-truth software architectures.
ICSE , 2013.[21] J. Garcia, D. Popescu, G. Edwards, and N. Medvidovic. Toward acatalogue of architectural bad smells. In
QoSA ’09: Proc. 5th Int’l Conf.on Quality of Software Architectures , 2009.[22] J. Garcia, D. Popescu, G. Edwards, and M. Nenad. IdentifyingArchitectural Bad Smells. In , 2009.[23] J. Garcia, D. Popescu, C. Mattmann, N. Medvidovic, and Y. Cai.Enhancing architectural recovery using concerns. In
ASE , 2011.[24] T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation of object-oriented metrics on open source software for fault prediction.
IEEETransactions on Software Engineering , 31(10):897–910, Oct 2005.[25] T. Hall, M. Zhang, D. Bowes, and Y. Sun. Some code smells have asignificant but small effect on faults.
ACM Transactions on SoftwareEngineering and Methodology , 23(4):33:1–33:39, Sept. 2014.[26] A. E. Hassan. Predicting faults using the complexity of code changes.In
Proceedings of the 31st International Conference on SoftwareEngineering , pages 78–88. IEEE Computer Society, 2009.[27] H. Hata, O. Mizuno, and T. Kikuno. Bug prediction based on fine-grained module histories. In
Software Engineering (ICSE), 2012 34thInternational Conference on , pages 200–210. IEEE, 2012.[28] G. H. John and P. Langley. Estimating continuous distributions in bayesianclassifiers. In
Proceedings of the Eleventh conference on Uncertaintyin artificial intelligence , pages 338–345. Morgan Kaufmann PublishersInc., 1995.[29] S. Kim, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller. Predictingfaults from cached history. In
Proceedings of the 29th International Con-ference on Software Engineering , ICSE ’07, pages 489–498, Washington,DC, USA, 2007. IEEE Computer Society. [30] A. R. Kiremire. The application of the pareto principle in softwareengineering.
Consulted January , 13:2016, 2011.[31] R. Kohavi. The power of decision tables. In
European conference onmachine learning , pages 174–189. Springer, 1995.[32] R. Kohavi et al. A study of cross-validation and bootstrap for accuracyestimation and model selection. In
Ijcai , volume 14, pages 1137–1145.Montreal, Canada, 1995.[33] E. Kouroshfar, M. Mirakhorli, H. Bagheri, L. Xiao, S. Malek, and Y. Cai.A study on the role of software architecture in the evolution and qualityof software. In
Proceedings of the 12th Working Conference on MiningSoftware Repositories , pages 246–257. IEEE Press, 2015.[34] P. B. Kruchten. The 4+ 1 view model of architecture.
Software, IEEE ,1995.[35] D. Le, D. Link, A. Shahbazian, and N. Medvidovic. An empirical studyof architectural decay in open-source software. In
ICSA , 2018.[36] D. M. Le, P. Behnamghader, J. Garcia, D. Link, A. Shahbazian, andN. Medvidovic. An empirical study of architectural change in open-sourcesoftware systems. In
Proc. Mining Software Repositories, 2015 .[37] D. M. Le, C. Carrillo, R. Capilla, and N. Medvidovic. Relatingarchitectural decay and sustainability of software systems. In , 2016.[38] D. M. Le and N. Medvidovic. Architectural-based speculative analysis topredict bugs in a software system. In
Proceeding ICSE ’16 Proceedingsof the 38th International Conference on Software Engineering , pages807–810. ACM New York, NY, USA ©2016, 2016.[39] S. Le Cessie and J. C. Van Houwelingen. Ridge estimators in logisticregression.
Applied statistics , pages 191–201, 1992.[40] W. Li and S. Henry. Object-oriented metrics that predict maintainability.
Journal of Systems and Software , 23(2):111 – 122, 1993. Object-OrientedSoftware.[41] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of predictionaccuracy, complexity, and training time of thirty-three old and newclassification algorithms.
Machine learning , 40(3):203–228, 2000.[42] I. Macia, J. Garcia, D. Popescu, A. Garcia, N. Medvidovic, and A. vonStaa. Are automatically-detected code anomalies relevant to architecturalmodularity?: an exploratory analysis of evolving systems. In
Proceedingsof the 11th annual international conference on Aspect-oriented SoftwareDevelopment . ACM, 2012.[43] R. Malhotra and M. Khanna. An exploratory study for softwarechange prediction in object-oriented systems using hybridized techniques.
Automated Software Engineering , 24(3):673–717, 2017.[44] A. Martini, F. A. Fontana, A. Biaggi, and R. Roveda. Identifying andprioritizing architectural debt through architectural smells: A case studyin a large software company. In C. E. Cuesta, D. Garlan, and J. P´erez,editors,
Software Architecture , pages 320–335, Cham, 2018. SpringerInternational Publishing.[45] T. Mens and T. Tourwe. A survey of software refactoring.
IEEE TSE ,Jan. 2004.[46] R. Mo, J. Garcia, Y. Cai, and N. Medvidovic. Mapping architecturaldecay instances to dependency models. In
Managing Technical Debt(MTD), 2013 4th International Workshop on , pages 39–46, 2013.[47] R. Moser, W. Pedrycz, and G. Succi. A comparative analysis ofthe efficiency of change metrics and static code attributes for defectprediction. In
Proceedings of the 30th International Conference onSoftware Engineering , ICSE ’08, pages 181–190, New York, NY, USA,2008. ACM.[48] N. Nagappan and T. Ball. Use of relative code churn measures topredict system defect density. In
Proceedings of the 27th internationalconference on Software engineering , pages 284–292. ACM, 2005.[49] N. Nagappan, T. Ball, and A. Zeller. Mining metrics to predict componentfailures. In
Proceedings of the 28th international conference on Softwareengineering
Software Engineering (SBES), 2014 BrazilianSymposium on , pages 91–100, Sept 2014.[52] V. Pareto and A. Page. Manuale di economia politica (manual of politicaleconomy).
Milan, Italy: Societa Editrice Libraia , 1906.[53] D. E. Perry and A. L. Wolf. Foundations for the study of softwarearchitecture.
ACM SIGSOFT SEN , 1992.[54] D. M. Powers. Evaluation: from precision, recall and f-measure to roc,informedness, markedness and correlation. 2011.55] F. Provost. Machine learning from imbalanced data sets 101. In
Proceedings of the AAAI’2000 workshop on imbalanced data sets , pages1–3, 2000.[56] J. R. Quinlan.
C4. 5: programs for machine learning . Elsevier, 2014.[57] F. Rahman and P. Devanbu. How, and why, process metrics are better.In
Software Engineering (ICSE), 2013 35th International Conference on ,pages 432–441. IEEE, 2013.[58] F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu. Bugcachefor inspections: Hit or miss? In
Proceedings of the 19th ACM SIGSOFTSymposium and the 13th European Conference on Foundations ofSoftware Engineering , ESEC/FSE ’11, pages 322–331, New York, NY,USA, 2011. ACM.[59] D. Romano and M. Pinzger. Using source code metrics to predict change-prone java interfaces. In
Software Maintenance (ICSM), 2011 27th IEEEInternational Conference on , pages 303–312. IEEE, 2011.[60] D. Romano, P. Raila, M. Pinzger, and F. Khomh. Analyzing the impact ofantipatterns on change-proneness using fine-grained source code changes.In
Reverse Engineering (WCRE), 2012 19th Working Conference on ,pages 437–446. IEEE, 2012.[61] A. S. Sayyad and H. Ammar. Pareto-optimal search-based softwareengineering (posbse): A literature survey. In , pages 21–27. IEEE, 2013.[62] A. Shahbazian, D. Nam, and N. Medvidovic. Toward predictingarchitectural significance of implementation issues. In ,May 2018.[63] R. Subramanyam and M. S. Krishnan. Empirical analysis of ck metricsfor object-oriented design complexity: implications for software defects.
IEEE Transactions on Software Engineering , 29(4):297–310, April 2003.[64] R. Taylor, N. Medvidovic, and E. Dashofy. Software Architecture:Foundations, Theory, and Practice. 2009. [65] G. Tsoumakas and I. Katakis. Multi-label classification: An overview.
International Journal of Data Warehousing and Mining (IJDWM) , 3(3):1–13, 2007.[66] V. Tzerpos and R. Holt. ACDC: an algorithm for comprehension-drivenclustering. In
Working Conference on Reverse Engineering (WCRE) ,2000.[67] S. Wang and D. Lo. Version history, similar report, and structure: Puttingthem together for improved bug localization. In
Proceedings of the 22ndInternational Conference on Program Comprehension , pages 53–63.ACM, 2014.[68] L. Wilkinson. Revising the pareto chart.
The American Statistician ,60(4):332–334, 2006.[69] X. Xia, D. Lo, S. McIntosh, E. Shihab, and A. E. Hassan. Cross-project build co-change prediction. In
Software Analysis, Evolution andReengineering (SANER), 2015 IEEE 22nd International Conference on ,pages 311–320. IEEE, 2015.[70] L. Xiao. Detecting and preventing the architectural roots of bugs. In
Proceedings of the 22Nd ACM SIGSOFT International Symposium onFoundations of Software Engineering , FSE 2014, pages 811–813, NewYork, NY, USA, 2014. ACM.[71] L. Xiao, Y. Cai, R. Kazman, R. Mo, and Q. Feng. Identifying andquantifying architectural debt. In
Proceedings of the 38th InternationalConference on Software Engineering , ICSE ’16, pages 488–498, NewYork, NY, USA, 2016. ACM.[72] K. Yamashita, S. McIntosh, Y. Kamei, A. E. Hassan, and N. Ubayashi.Revisiting the applicability of the pareto principle to core developmentteams in open source software projects. In
Proceedings of the 14thInternational Workshop on Principles of Software Evolution , pages 46–55. ACM, 2015.[73] T. Zimmermann and N. Nagappan. Predicting defects using networkanalysis on dependency graphs. In