On the Interplay between Non-Functional Requirements and Builds on Continuous Integration
Klérisson V. R. Paixão, Crícia Z. Felício, Fernanda M. Delfim, Marcelo de A. Maia
OOn the Interplay between Non-FunctionalRequirements and Builds on Continuous Integration
Klérisson V. R. Paixão ∗ , Crícia Z. Felício † , Fernanda M. Delfim ∗ and Marcelo de A. Maia ∗∗ Universidade Federal de Uberlândia – Uberlândia (MG), Brazil{klerisson, fernanda, marcelo.maia}@ufu.br † Instituto Federal do Triângulo Mineiro – Uberlândia (MG), [email protected]
Abstract —Continuous Integration (CI) implies that a wholedeveloper team works together on the mainline of a softwareproject. CI systems automate the builds of a software. Sometimesa developer checks in code, which breaks the build. A brokenbuild might not be a problem by itself, but it has the potentialto disrupt co-workers, hence it affects the performance of theteam. In this study, we investigate the interplay between non-functional requirements (NFRs) and builds statuses from 1,283software projects. We found significant differences among NFRsrelated-builds statuses. Thus, tools can be proposed to improve CIwith focus on new ways to prevent failures into CI, specially forefficiency and usability related builds. Also, the time required toput a broken build back on track indicates a bimodal distributionalong all NFRs, with higher peaks within a day and lower peaksin six weeks. Our results suggest that more planned schedule formaintainability for Ruby, and for functionality and reliability forJava would decrease delays related to broken builds.
Index Terms —Software repository mining; Continuous integra-tion; Topic models; Non-functional requirements;
I. I
NTRODUCTION “In general the answer to how to stay efficient whena build is almost always broken is: stop breakingthe build .” –
Anonymous This excerpt from an online Question and Answer communitylays out competing concepts of Continuous Integration (CI)in the software industry. CI means that a whole developerteam works together on the mainline of a software project [1].CI build-process automatically takes source code commits,compiles the code, and then progresses through a pipeline oftesting. Sometimes one developer checks in the source coderepository something that breaks the build, i.e. checks in codewhich does not compile or pass unit or code analysis tests. Ifon one hand, it may disrupt colleagues’ work, on the other, itprevents breakages going unnoticed.The build of a system is one of the first steps of movingsoftware from development to customers. A failure in the buildmay not only disrupt the co-workers, but also the business [2].Hence, as important as avoiding broken builds is the time takento fix the build. Longer times mean more wasted developertime. Understanding for what reasons a set of source codechanges broke the build is hard without developer’s advice andbecomes crucial to prevent problems [3]. Also, relying on thedeveloper for such analysis it is feasible on small-scale cases. https://perma.cc/NS8Z-3GX8 A growing body of work in software engineering uses topic analysis to make sense of textual data in softwarerepositories [4]. As we gain access to larger datasets, it becomesimportant to scale our ability to conduct such analyses [5].In this direction, Hindle et al. established a link betweentopics computed from commit messages and non-functionalrequirements (NFRs) [6]. Their technique enables large-scaletopic analysis over such artifacts, because NFRs are widelyspread across software projects. Furthermore, that work shedsome light on what a set of commit messages means in termsof NFRs. Therefore, if failed CI builds are related to certainNFRs, then developers can use topics to prevent failures .In this study, we examine the NFRs categories computedfrom the list of all commits that were built in a given buildjob from Travis-CI (a CI platform for open-source softwaredevelopment [7]) in 1,283 projects from GitHub repository.By studying a large corpus of projects, we aim to empiricallyinvestigate the interplay of NFRs and Travis-CI builds statuses.Our research is guided by two main research questions:
RQ1.
Which NFRs occur more frequently in failed Travis-CI builds than successful ones?
RQ2.
How long do NFR-related builds remain broken?We found significant differences among NFRs related-buildsstatuses. Thus, tools can be proposed to improve CI withfocus on new ways to prevent failures into CI. Further, ourresults suggest that more planned schedule for maintainabilityfor Ruby, and for functionality and reliability for Java woulddecrease delays related to broken builds.The paper outline is standard: literature review, material andmethod description, results, and conclusions.II. R
ELATED W ORK
Our study inherits from a rich ecosystem of tools andapplications for software repository mining, and draws onthe insights of prior work in NFRs and topic modeling.
Non-functional requirements.
While a functional require-ment describes what a system should do, NFRs place constraintson the performance of the system, i.e. how it will do so [8].NFRs may also describe aspects of the system related to itsevolution over time, e.g., maintainability, extensibility, anddocumentation, to name a few. Unfortunately, there are alot of disagreements on what NFRs really are. Mairiza et al.found 114 different NFRs classes [9], which contrasts with a r X i v : . [ c s . S E ] M a r ig. 1. The automatic non-functional requirement labeling computes commit messages topics from over 3 million Travis-CI build jobs. the international standard ISO 9126 quality model [10], whereNFRs are defined by six high-levels classes: maintainability,functionality, portability, efficiency, usability, and reliability.Eckhardt et al. analyzed NFRs taken from industrial require-ments specifications to better understand their nature [11]. Theirresults suggest that NFRs are buried in functional requirements,insofar as we should not make any distinguish between them.Despite the aforementioned discussion, whether NFRs arecorrectly categorized, we cannot deny that NFRs conceptspervade all modern software projects, therefore, we can usesuch definitions to compare projects. Topic Modeling.
The history of topic models in academicresearch related to Software Engineering is long. For acomprehensive survey on this matter, we refer to Chen et al. [4].Herein, we focus on topic analysis applied to commit messages.Hindle et al. mined commits in a windowed time fashion [12].They applied latent dirichlet allocation (LDA) [13] techniquein a 30-day period of commit messages to identify topic trends.Their technique allows the automated summarization of “whathas been done” in a given time. In another work [14], thoseauthors also used topic analysis to annotate commit messages,among other software artifacts, and map the results ontosoftware project phases. Their idea is to propose an alternativeapproach to monitor software process compliance. With respectto the study of CI builds statuses, there are some commonthemes between our work and Santos and Hindle’s work [15]. Inthat work, a n-gram language model was proposed to computehow “unusual” is a commit message. The results suggest apositive correlation between unusualness messages and buildsfailures.In comparison, our goal is to investigate the correlationbetween NFRs developers were working on and CI buildsstatuses. We rely on the method of Hindle et al. that linksa set of commits messages to NFRs [6]. However, since weare interested in system-wide builds statuses, our NFR relatedtopics are extracted from all commits reported in CI builds.III. M
ATERIAL AND M ETHOD
A. TravisTorrent Dataset
TravisTorrent [7] is a synthesis of software projects fromGitHub that have Travis-CI enabled. Version 8.2.2017 compre-hends 3,702,595 builds from 1,283 projects. For our particularinterest, the structure of the build entries involves the job id,project name, status, builds duration, started timestamp, andall commits that were built.Regarding the status of a build, there are five values in thedataset. We consider in this study three of them: passed , which means a project has been built and passed its test suite; failed ,a project failed to build or failed in its tests; and errored ,a misconfiguration was found in the project. The last twostatuses were grouped. Ultimately, they both mean that thebuild is broken . We discard the other two statuses (started andcanceled), because we either do not know the process outcomeand the reasons behind its cancellation.Additionally, with the project name, we fetch (clone) therepository from GitHub. Then, with the commit list, allmessages are taken.
B. NFR Labeling
The overall NFR automatic labeling process is illustrated asFig. 1. First, for each GitHub project we clone the repository.Then, we select the commits that were built for each build job.With the commit we fetch the associated messages.Such messages, per project, are given as input to the topicmodeling phase. We use the Mallet toolkit [16] to generate 20topics with 10 words per topic. To automatically label eachbuild job with a topic, we use the exp3 word-list, please referto the work of Hindle et al. [6] for the details on the word-listgeneration. This word set consists of keywords separated byeach NFR (maintainability, functionality, portability, efficiency,usability, and reliability).The motivation to choose this word-list instead of others(exp1 or exp2) is because it contains more words per NRFcategory. Since we aim to contrast diverse projects a broadlist of words might be better representative. Recall that thisprocess is done per project. So, the topic computing of oneproject is not affected by the topics from others.Finally, with each build job and its associated topic, welabeled our build job with an NFR where there was a matchbetween the topic’s word and the word-list.IV. R
ESULTS AND D ISCUSSION
This section reports our results. For replication purposes,raw data used for our analyses is available for download . RQ1. Which NFRs occur more frequently in failed Travis-CIbuilds than successful ones?
While a common best practice on continuous integration isto have all the tests passing at all times, build breakage happens.The primary endpoint of the study is to identify patterns offailure that might help developers prioritize their efforts onpreventing such failures. https://doi.org/10.6084/m9.figshare.2279505.v1ABLE IP AIRWISE C HI -S QUARE COMPARISON OF
NFR S . PortabilityUsability
UsabilityEfficiency
EfficiencyReliability
ReliabilityMaintainability
MaintainabilityFunctionality
PortabilityUsability
UsabilityEfficiency
EfficiencyReliability
ReliabilityMaintainability
MaintainabilityFunctionality M a i n t a i n a b ilit y F u n c ti o n a lit y P o r t a b ilit y E f fi c i e n c y U s a b ilit y R e li a b ilit y U n n a m e d .
27 9 .
38 6 27 .
27 19 .
82 1 .
87 2 . . .
51 2 .
94 13 . .
09 0 .
71 1 . % bu il d s PassedBroken (a) Ruby. M a i n t a i n a b ilit y F u n c ti o n a lit y P o r t a b ilit y E f fi c i e n c y U s a b ilit y R e li a b ilit y U n n a m e d .
36 7 .
75 6 .
21 34 .
85 17 .
52 2 .
37 4 . .
23 1 .
94 1 .
75 13 .
34 6 .
55 1 .
23 1 . % bu il d s PassedBroken (b) Java.Fig. 2. Passed vs. Broken builds. Figures on bars indicate percentages.
Fig. 2 show, for each NFR and for different programminglanguage (Ruby and Java), the percentage of passed andbroken builds. Unnamed indicates the number of builds theapproach was not able to classify automatically, as explainedin Section III-B. Therefore, we do not consider this categoryin our statistical analysis.To test the presence of a significant difference amongproportions of builds we perform a Pearson Chi-Square pair-wise test on a contingency table, where columns represent thebuilds per NFR and rows builds statuses ( H : the proportion ofbuilds having different statuses does not change among NFRs).P values < 0.05 were considered statistically significant. Table Ishows the P values of paired NFRs.For Ruby projects, analyses revealed significant differencesin 10 out of the 15 pairwise comparisons. There are nosignificant differences between efficiency and portability or usability . The same is observed with functionality and reli-ability or maintainability . With Java projects, all pair-wisecomparisons were significant except between efficiency and usability , functionality and portability , and maintainability and reliability . RQ2. How long do NFR-related builds remain broken?
Here, we investigate the impact of broken builds consideringthe time elapsed until the build is fixed. Although is notdesirably to face build failures, they play an important role tothe development process. For example, a broken build denotesa bug caught earlier [17]. However, since the developers basetheir work on project branches, if they remain broken for longertimes they affect the project’s performance.Table II shows the average time elapsed between a brokenbuild and a sequent passed one group by NFR. Fig. 3 showsthe graphical distribution of broken builds for each setting.
Discussion.
The goal of RQ1 was to examine whetherproviding comparison between NFR related builds statuseshad an impact on continuous integration builds. The studyrevealed significant results. For Ruby projects, despite theabsolute number of builds related to efficiency , it holds thesame proportion of passed and broken builds as usability group. Together they represent around 70% of the builds.However, RQ2 results exposes that a broken build relatedto efficiency
NFR takes 1.6x more time on average to be fixedthan a usability broken build. We observe similar scenario for a) Ruby. (b) Java.Fig. 3. Graphical distribution of broken builds along the time group by NFR.TABLE IIA
VERAGE DURATION OF BROKEN BUILDS IN MINUTES . NFR Ruby Java
Maintainability 403 37Functionality 144 58Portability 121 34Efficiency 118 40Usability 73 36Reliability 191 64Unnamed 143 19
Total average
170 41
Java. Broken builds from reliability group has no significantproportional differences with maintainability group, but anissue from the former group takes 1.7x more time on averageregarding the last group.Fig. 3 shows the distribution of the time taken of brokenbuild until a sequent successful build per NFR. Note that thedistribution is bimodal along all NFRs. That is, it has twopeaks, showing that most broken builds are either fixed withina day (higher peak) or takes around six weeks (lower peak),which is a common release methodology adopted in industry(rapid release cycles).Our design decisions suggest a set of limitations, manyof which we hope to address in future work. We did notmeasure accuracy of the NFR labeling method. Further, westudy the association of builds with only one topic, but theremight be cases where they can be linked with multiple topics.Although our approach can be seen as a replication of thework Hindle et al. [6], further evaluation is needed. Finally,we only consider NFRs in our study. Thus, we refrain to onlydiscuss about the relationship of NFRs and builds statuses.Future work could use/propose other classification of builds.V. C
ONCLUSION
We examined a large set of projects to expose the relationshipbetween NFR and CI builds statuses. Certain categories ofNFR related builds are more prevalent, such as efficiencyand usability, regardless if Ruby or Java. So, recommendation systems to help avoiding breakages on those kind of buildswould produce overall larger impact on the whole process.Moreover, maintainability for Ruby projects, and functional-ity together with reliability for Java, take longer times to befixed. So, they could be postponed to whenever developersare available to watch the builds, avoiding conflicts amongthemselves.
Acknowledgment.
We thank the Brazilian agencies
CAPES , CNPq , and
FAPEMIG . R
Proc. ICSME , 2014, pp. 41–50.[3] B. Adams and S. McIntosh, “Modern release engineering in a nutshell –why researchers should care,” in
Proc. SANER , 2016, pp. 78–90.[4] T.-H. Chen, S. W. Thomas, and A. E. Hassan, “A survey on the useof topic models when mining software repositories,”
Empir Softw Eng ,vol. 21, no. 5, pp. 1843–1919, 2016.[5] L. B. L. De Souza and M. D. A. Maia, “Do software categories impactcoupling metrics?” in
Proc. MSR , 2013, pp. 217–220.[6] A. Hindle, N. A. Ernst, M. W. Godfrey, and J. Mylopoulos, “Automatedtopic naming,”
Empir Softw Eng , vol. 18, no. 6, pp. 1125–1155, 2013.[7] M. Beller, G. Gousios, and A. Zaidman, “Travistorrent: Synthesizingtravis ci and github for full-stack research on continuous integration,” in
Proc. MSR , 2017.[8] L. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos,
Non-functionalrequirements in software engineering . Springer, 2012, vol. 5.[9] D. Mairiza, D. Zowghi, and N. Nurmuliani, “An investigation into thenotion of non-functional requirements,” in
Proc. SAC , 2010, pp. 311–317.[10] ISO/IEC,
ISO 9126. Software engineering – Product quality , 2001.[11] J. Eckhardt, A. Vogelsang, and D. M. Fernández, “Are "non-functional"requirements really non-functional?: An investigation of non-functionalrequirements in practice,” in
Proc. ICSE , 2016, pp. 832–842.[12] A. Hindle, M. W. Godfrey, and R. C. Holt, “What’s hot and what’s not:Windowed developer topic analysis,” in
Proc. ICSM , 2009, pp. 339–348.[13] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
J Mach Learn Res , vol. 3, p. 993–1022, 2003.[14] A. Hindle, M. W. Godfrey, and R. C. Holt, “Software process recoveryusing recovered unified process views,” in
Proc. ICSM , 2010, pp. 1–10.[15] E. A. Santos and A. Hindle, “Judging a commit by its cover: Correlatingcommit message entropy with build status on travis-ci,” in
Proc. MSR ,2016, pp. 504–507.[16] A. K. McCallum, “Mallet: A machine learning for language toolkit,”2002.[17] M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Usage, costs,and benefits of continuous integration in open-source projects,” in