[PDF] The effects of change decomposition on code review -- a controlled experiment

Abstract

Background: Code review is a cognitively demanding and time-consuming process. Previous qualitative studies hinted at how decomposing change sets into multiple yet internally coherent ones would improve the reviewing process. So far, literature provided no quantitative analysis of this hypothesis. Aims: (1) Quantitatively measure the effects of change decomposition on the outcome of code review (in terms of number of found defects, wrongly reported issues, suggested improvements, time, and understanding); (2) Qualitatively analyze how subjects approach the review and navigate the code, building knowledge and addressing existing issues, in large vs. decomposed changes. Method: Controlled experiment using the pull-based development model involving 28 software developers among professionals and graduate students. Results: Change decomposition leads to fewer wrongly reported issues, influences how subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the number of found defects. Conclusions: Change decomposition reduces the noise for subsequent data analyses but also significantly supports the tasks of the developers in charge of reviewing the changes. As such, commits belonging to different concepts should be separated, adopting this as a best practice in software engineering.

Full PDF

TThe effects of change decomposition oncode review — a controlled experiment Marco di Biase , Magiel Bruntink , Arie van Deursen andAlberto Bacchelli Delft University of Technology, Delft, The Netherlands Software Improvement Group, Amsterdam, The Netherlands University of Zurich, Zurich, Switzerland

ABSTRACT

Background:

Code review is a cognitively demanding and time-consuming process.Previous qualitative studies hinted at how decomposing change sets into multipleyet internally coherent ones would improve the reviewing process. So far, literatureprovided no quantitative analysis of this hypothesis.

Aims: (1) Quantitatively measure the effects of change decomposition on theoutcome of code review (in terms of number of found defects, wrongly reportedissues, suggested improvements, time, and understanding); (2) Qualitatively analyzehow subjects approach the review and navigate the code, building knowledge andaddressing existing issues, in large vs. decomposed changes.

Method:

Controlled experiment using the pull-based development model involving28 software developers among professionals and graduate students.

Results:

Change decomposition leads to fewer wrongly reported issues, in ﬂ uenceshow subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the numberof found defects. Conclusions:

Change decomposition reduces the noise for subsequent data analysesbut also signi ﬁ cantly supports the tasks of the developers in charge of reviewingthe changes. As such, commits belonging to different concepts should be separated,adopting this as a best practice in software engineering. Subjects

Human-Computer Interaction, Software Engineering

Keywords

Code review, Controlled experiment, Change decomposition, Pull-baseddevelopment model

INTRODUCTION

Code review is one among the activities performed by software teams to check codequality, with the purpose of identifying issues and shortcomings (

Bacchelli & Bird,2013 ). Nowadays, reviews are mostly performed in an iterative, informal, change- andtool-based fashion, also known as modern code review (MCR) (

Cohen, 2010 ). Softwaredevelopment teams, both in open-source and industry, employ MCR to check codechanges before being integrated in to their codebases (

Rigby & Bird, 2013 ). Past researchhas provided evidence that MCR is associated with improved key software quality aspects,such as maintainability (

Morales, McIntosh & Khomh, 2015 ) and security ( di Biase,Bruntink & Bacchelli, 2016 ), as well as with less defects (

McIntosh et al., 2016 ). How to cite this article di Biase M, Bruntink M, van Deursen A, Bacchelli A. 2019. The effects of change decomposition on code review — acontrolled experiment. PeerJ Comput. Sci. 5:e193 DOI 10.7717/peerj-cs.193

Submitted

19 December 2018

Accepted

15 April 2019

Published

13 May 2019

Corresponding author

Marco di Biase, [email protected]

Academic editor

Robert Winkler

Additional Information andDeclarations can be found onpage 22DOI

Distributed underCreative Commons CC-BY 4.0 eviewing a source code change is a cognitively demanding process. Researchersprovided evidence that understanding the code change under review is among themost challenging tasks for reviewers (

Bacchelli & Bird, 2013 ). In this light, past studieshave argued that code changes that — at the same time — address multiple, possiblyunrelated concerns (also known as noisy ( Murphy-Hill, Parnin & Black, 2012 ) or tangled changes ( Herzig & Zeller, 2013 )) can hinder the review process (

Herzig &Zeller, 2013 ; Kirinuki et al., 2014 ), by increasing the cognitive load for reviewers.Indeed, it is reasonable to think that grasping the rationale behind a change thatspans multiple concepts in a system requires more effort than the same patch committedseparately. Moreover, the noise could put a reviewer on a wrong track, thus leadingto missing defects ( false negatives ) or to raising unfounded issues in sound code( false positives in this paper).Qualitative studies reported that professional developers perceive tangled codechanges as problematic and asked for tools to automatically decompose them (

Tao et al.,2012 ; Barnett et al., 2015 ). Accordingly, change untangling mechanisms have beenproposed (

Tao & Kim, 2015 ; Dias et al., 2015 ; Barnett et al., 2015 ).Although such tools are expectedly useful, the effects of change decomposition onreview is an open research problem.

Tao & Kim (2015) presented the earliest andmost relevant results in this area, showing that change decomposition allows practitionersto achieve their tasks better in a similar amount of time.In this paper, we continue on this research line and focus on evaluating the effectsof change decomposition on code review. We aim at answering questions, such as:Is change decomposition bene ﬁ cial for understanding the rationale of the change? Doesit have an impact on the number/types of issues raised? Are there differences in timeto review? Are there variations with respect to defect lifetime?To this end, we designed a controlled experiment focusing on pull requests, awidespread approach to submit and review changes ( Gousios et al., 2015 ). Our workinvestigates whether the results from

Tao & Kim (2015) can be replicated, and extendthe knowledge on the topic. With a Java system as a subject, we asked 28 softwaredevelopers among professionals and graduate students to review a refactoring and anew feature (according to professional developers (

Tao et al., 2012 ), these are the mostdif ﬁ cult to review when tangled). We measure how the partitioning vs. non-partitioning ofthe changes impacts defects found, false positive issues, suggested improvements, timeto review, and understanding the change rationale. We also perform qualitativeobservations on how subjects conduct the review and address defects or raise falsepositives, in the two scenarios.This paper makes the following contributions: (cid:1) The design of an asynchronous controlled experiment to assess the bene ﬁ ts ofchange decomposition in code review using pull requests, available for replication( di Biase et al., 2018 ); (cid:1) Empirical evidence that change decomposition in the pull-based review environmentleads to fewer false positives. di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 2/25

ELATED WORK

Several studies explored tangled changes and concern separation in code reviews.

Tao et al.(2012) investigated the role of understanding code changes during the softwaredevelopment process, exploring practitioners ’ needs. Their study outlined that graspingthe rationale when dealing with the process of code review is indispensable. Moreover,to understand a composite change, it is useful to break it into smaller ones each concerninga single issue. Rigby et al. (2014) empirically studied the peer review process for sixlarge, mature OSS projects, showing that small change size is essential to the more ﬁ ne-grained style of peer review. Kirinuki et al. (2014) provided evidence about problemswith the presence of multiple concepts in a single code change. They showed that theseare unsuitable for merging code from different branches, and that tangled changes aredifferent to review because practitioners have to seek the changes for the speci ﬁ ed taskin the commit.Regarding empirical controlled experiments on the topic of code reviews, the mostrelevant work is by Uwano et al. (2006) . They used an eye-tracker to characterize theperformance of subjects reviewing source code. Their experimentation environmentenabled them to identify a pattern called scan , consisting of the reviewer reading the entirecode before investigating the details of each line. In addition, their qualitative analysisfound that participants who did not spend enough time during the scan took more time to ﬁ nd defects. Uwano ’ s experiment was replicated by Sharif, Falcone & Maletic (2012) .Their results indicated that the longer participants spent in the scan , the quicker theywere able to ﬁ nd the defect. Conversely, review performance decreases when participantsdid not spend suf ﬁ cient time on the scan, because they ﬁ nd irrelevant lines. Recently, Baum, Schneider & Bacchelli (2019) highlighted how performance in code review issigni ﬁ cantly higher when code changes are small, whereas complex and longer changeslead to lower review effectiveness.Even if MCR is now a mainstream process, adopted in both open source and industrialprojects, we found only two studies on change partitioning and its bene ﬁ ts for codereview. The work by Barnett et al. (2015) analyzed the usefulness of an automatictechnique for decomposing changesets. They found a positive association between changedecomposition and the level of understanding of the changesets. According to their results,this would help time to review as the different contexts are separated.

Tao & Kim (2015) proposed a heuristic-based approach to decompose changeset with multiple concepts.They conducted a user study with students investigating whether their untanglingapproach affected the time and the correctness in performing review-related tasks.Results were promising: Participants completed the tasks better with untangled changesin a similar amount of time. In spite of the innovative techniques they proposed tountangle code changes and in these promising results, the evaluation of effects ofchange decomposition was preliminary.In contrast, our research focuses on setting up and running an experiment toempirically assess the bene ﬁ ts of change decomposition for the process of code review,rather than evaluating the performances of an approach. di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 3/25

OTIVATION AND RESEARCH OBJECTIVES

In this section, we present the context of our work and the research questions.

Experiment definition and context

Our analysis of the literature showed that there is only preliminary empirical evidenceon how code review decomposition affects its outcomes, its change understanding, timeto completion, effectiveness (i.e., number of defects found), false positives (issuesmistakenly identi ﬁ ed as defect by the reviewer), and suggested improvements. This lackof empirical evidence motivates us in setting up a controlled experiment, exploiting thepopular pull-based development model, to assess the conjecture that a proper separationof concerns in code review is bene ﬁ cial to the ef ﬁ ciency and effectiveness of the review.Pull requests feature asynchronous, tool-based activities in the bigger scope ofpull-based software development ( Gousios, Pinzger & Van Deursen, 2014 ). The pull-basedsoftware process features a distributed environment where changes to a system areproposed through patch submissions that are pulled and merged locally, rather thanbeing directly pushed to a central repository.Pull requests are the way contributors submit changes for review in GitHub. Changeacceptance has to be granted by other team members called integrators (

Gousios et al.,2015 ). They have the crucial role of managing and integrating contributions and areresponsible for inspecting the changes for functional and non-functional requirements.A total of 80% of integrators use pull requests as the means to review changes proposedto a system (

Gousios et al., 2015 ).In the context of distributed software development and change integration, GitHubis one of the most popular code hosting sites with support for pull-based development.GitHub pull requests contain a branch from which changes are compared by an automaticdiscovery of commits to be merged. Changes are then reviewed online. If further changesare requested, the pull request can be updated with new commits to address the comments.The inspection can be repeated and, when the patch set ﬁ ts the requirements, thepull request can be merged to the master branch. Research questions

The motivation behind MCR is to ﬁ nd defects and improve code quality ( Bacchelli & Bird,2013 ). We are interested in checking if reviewers are able to address defects (referred inthis paper as effectiveness ). Furthermore, we focus on comments pointing out false positives (wrongly reported defects), and suggested improvements (non-critical non-functionalissues such as suggested refactorings). Suggested improvements highlight reviewerparticipation (

McIntosh et al., 2014 ) and these comments are generally considered veryuseful (

Bosu, Greiler & Bird, 2015 ). Our ﬁ rst research question is: RQ1.

Do tangled pull requests in ﬂ uence effectiveness (i.e., number of defects found), false positives , and suggested improvements of reviewers, when compared to untangledpull requests? di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 4/25 ased on the ﬁ rst research question, we formulate the following null-hypotheses for(statistical) testing:Tangled pull requests do not reduce:H the effectiveness of the reviewers during peer-reviewH the false positives detected by the reviewers during peer-reviewH the suggested improvements written by the reviewers during peer-reviewGiven the structure and the settings of our experimentation, we can also measurethe time spent on review activity and defect lifetime. Thus, our next research question is: RQ2.

Do tangled pull requests in ﬂ uence the time necessary for a review and defectlifetime, when compared to untangled pull requests?For the second research question, we formulate the following null-hypotheses:Tangled pull requests do not reduce:H time to reviewH defect lifetimeFurther details on how we measure time and de ﬁ ne defect lifetime are described inthe section “ Outcome Measurements ” .In our study, we aim to measure whether change decomposition has an effect onunderstanding the rationale of the change under review. Understanding the rationale isthe most important information need when analyzing a change, according to professionalsoftware developers ( Tao et al., 2012 ). As such, the question we set to answer is:

RQ3.

Do tangled pull requests in ﬂ uence the reviewers ’ understanding of the changerationale, when compared to untangled ones?For our third research question, we test the following null-hypotheses:Tangled pull requests do not reduce:H change-understanding of reviewers during peer-review when compared to untangledpull requestsFinally, we qualitatively investigate how participants individually perform the review tounderstand how they address defects or potentially raise false positives. Our last researchquestion is then: RQ4.

What are the differences in patterns and features used between reviews of tangledand untangled pull requests? di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 5/25

XPERIMENTAL DESIGN AND METHOD

In this section, we detail how we designed the experiment and the research method thatwe followed.

Object system chosen for the experiment

The system that was used for reviews in the experiment is JPacman, an open-source Javasystem available on GitHub (https://github.com/SERG-Delft/jpacman-framework) thatemulates a popular arcade game used at Delft University of Technology to teach softwaretesting.The system has about 3,000 lines of code and was selected because a more complex andlarger project would require participants to grasp the rationale of a more elaborate system.In addition, the training phase required for the experiment would imply hours of effort,increasing the consequent fatigue that participants might experience. In the end, theexperiment targets assessing differences in review partitioning and is tailored for aprocess rather than a product.

Recruiting of the subject participants

The study was conducted with 28 participants recruited by means of conveniencesampling (

Wohlin et al., 2012 ) among experienced and professional softwaredevelopers, PhD, and MSc students. They were drawn from a population samplethat volunteered to participate. The voluntary nature of participation implies theconsent to use data gathered in the context of this study. Software developers belong tothree software companies, PhD students belong to three universities, and MSc students todifferent faculties despite being from the Delft University of Technology. We involvedas many different roles as possible to have a larger sample for our study and increaseits external validity. Using a questionnaire, we asked development experience, language-speci ﬁ c skills, and review experience as number of reviews per week. We also includeda question that asked if a participant knew the source code of the game. Table 1 reportsthe results of the questionnaire, which are used to characterize our population and toidentify key attributes of each subject participant. Table 1 Descriptive data of the subject participants.Group μ per role per group μ μ Control(tangled changes) 6 2 (33%) SW developer 4.3 4.8 4.8 3.3 3.6 3.63 1 (33%) PhD student 5.0 2.9 3.0 2.95 3 (60%) MSc student 2.2 0.7 2.6 3.8Treatment(untangled changes) 6 2 (33%) SW developer 4.8 2.9 3.3 3.4 4.0 6.43 1 (33%) PhD student 6.0 6.6 2.0 0.85 3 (60%) MSc student 2.2 1.1 6.0 9.0 Delft University of Technology HumanResearch Committee approved our studywith IRB approval di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 6/25 onitoring vs. realism

In line with the nature of pull-based software development and its peer review withpull requests, we designed the experimentation phase to be executed asynchronously.This implies that participants could run the experiment when and where they feltmost comfortable, with no explicit constraints for place, time or equipment.With this choice, we purposefully gave up some degree of control to increase realism.Having a more strictly controlled experimental environment would not replicate theusual way of running such tasks (i.e., asynchronous and informal). Besides, an experimentrun synchronously in a laboratory would still raise some control challenges: It mightbe distracting for some participants, or even induce some follow the crowd behavior, thusleading to people rushing to ﬁ nish their tasks.To regain some degree of control, participants ran all the tasks in a provided virtualmachine available in our replication package ( di Biase et al., 2018 ). Moreover, werecorded the screencast of the experiment, therefore not leaving space to misalignedresults and mitigating issues of incorrect interpretation. Subjects were provided withinstructions on how to use the virtual machine, but no time window was set. Independent variable, group assignment, and duration

The independent variable of our study is change decomposition in pull requests.We split our subjects between a control group and a treatment group: The controlgroup received one pull request containing a single commit with all the changes tiedtogether; the treatment group received two pull requests with changes separatedaccording to a logical decomposition. Our choice of using only two pull requestsinstead of a larger number is mainly dictated by the limited time participants wereallotted for the experiment, and the possibly increased in ﬂ uence of distractions.Changes spanning a greater part of the codebase require additional expertise,knowledge, and focus, which reviewers might lack. Extensive literature in psychology( Shiffrin, 1988 ; Wickens, 1991 ; Cowan, 1998 ; James, 2013 ) reports that cognitiveresources such as attention are ﬁ nite. Since complex tasks like reviewing code drain suchresources, the effectiveness of the measured outcomes will be negatively impactedby a longer duration.Participants were randomly assigned to either the control group or the treatmentusing strata based on experience as developers and previous knowledge. Previous researchhas shown that these factors have an impact on review outcome ( Rigby et al., 2012 ; Bosu, Greiler & Bird, 2015 ). Developers who previously made changes to ﬁ les to bereviewed had a higher proportion of useful comments.All subjects were asked to run the experiment in a single session so that externaldistracting factors could be eliminated as much as possible. If a participant needed a pause,the pause is considered and excluded from the ﬁ nal result as we measure and monitorfor periods of inactivity. We seek to reduce the impact of fatigue by limiting the expectedtime required for the experiment to an average of 60 min; this value is closer to theminimum rather than the median for similar experiments ( Ko, LaToza & Burnett, 2015 ).As stated before, though, we did not suggest or force any strict limit on the duration of di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 7/25 he experiment to the ends of replicating the code review informal scenario. No learningeffect is present as every participant runned the experiment only once.

Pilot experiments

We ran two pilot experiments to assess the settings. The ﬁ rst subject (a developer with ﬁ ve FTE years of experience) took too long to complete the training and showed someissues with the virtual machine. Consequently, we restructured the training phaseaddressing the potential environment issues in the material provided to participants.The second subject (a MSc student with little experience) successfully completed theexperiment in 50 min with no issues. Both pilot experiments were executedasynchronously in the same way as the actual experiment. Tasks of the experiment

The participants were asked to conduct the following four tasks. Further details areavailable in the online appendix ( di Biase et al., 2018 ). Preparing the environment

Participants were given precise and detailed instructions on how to set-up theenvironment for running the experiment. These entailed installing the virtual machine,setting up the recording of the screen during the experiment, and troubleshootingcommon problems, such as network or screen resolution issues.

Training the participants

Before starting with the review phase, we ﬁ rst ensured that the participants weresuf ﬁ ciently familiar with the system. It is likely that the participants had never seen thecodebase before: this situation would limit the realism of the subsequent review task.To train our participants we asked subjects to implement three different features inthe system:1. Change the way the player moves on the board game, using different keys,2. check if the game board has null squares (a board is made of multiple squares) andperform this check when the board is created, and3. implement a new enemy in the game, with similar arti ﬁ cial intelligence to anotherenemy but different parameters.This learning by doing approach is expected to have higher effectiveness than providingtraining material to participants ( Slavin, 1987 ). By de ﬁ nition, this approach is a method ofinstruction where the focus is on the role of feedback in learning. The desired featuresrequired change across the system ’ s codebase. The third feature to be implemented targetedthe classes and components of the game that would be object of the review tasks. The choiceof using this feature as the last one is to progressively increment the level of dif ﬁ culty.No time window was given to participants, aiming for a more realistic scenario. Asexplicitly mentioned in the provided instructions, participants were allowed to use anysource for retrieving information about something they did not know. This was permittedas the study does not want to assess skills in implementing some functionality in a A full-time employee (FTE) works theequivalent of 40 hours a week. We con-sider 1 FTE-year when a person hasworked the equivalent of 40 hours a weekfor one year. For example, an individualworking two years as a developer for 20hours a week would have 1 FTE-yearexperience. di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 8/25 rogramming language. The only limitation is that the participants must use the toolswithin the virtual machine.The virtual machine provided the participants with the Eclipse Java IDE. The setupalready had the project imported in Eclipse ’ s workspace. We used a plugin in Eclipse,WatchDog ( Beller et al., 2015 ), to monitor development activity. With this plugin, wemeasured how much time participants spent reading, typing, or using the IDE. Thepurpose was to quantify the time to understand code among participants and whetherthis relates to a different outcome in the following phases. Results for this phase areshown in Fig. 1, which contains boxplots depicting the data. It shows that there isno signi ﬁ cant difference between the two groups. We retrieve the non-statisticalsigni ﬁ cance by performing Mann – Whitney U -tests on the four variables in Fig. 1, withthe following p -values: IDE active: p -value = 0.98, User Active: p -value = 0.80, Reading: p -value = 0.73, Writing: p -value = 0.73. Perform code review on proposed change(s)

Participants were asked to review two changes made to the system:1. the implementation of the arti ﬁ cial intelligence for one of the enemies2. the refactoring of a method in all enemy classes (moving its logic to the parent class).These changes can be inspected in the online appendix ( di Biase et al., 2018 ) andhave been chosen to meet the same criteria used by Herzig, Just & Zeller (2016) whenchoosing tangled changes. Changes proposed can be classi ﬁ ed as refactoring and enhancement . Previous literature gave insight as to how these two kinds of changes,when tangled together, are the hardest to review ( Tao et al., 2012 ). Although recent researchproposed a theory for the optimal ordering of code changes in a review (

Baum, Schneider &

Figure 1 Boxplots for training phase measurements.

The results highlight no differences between thetwo groups. Full-size  DOI: 10.7717/peerj-cs.193/ ﬁ g-1 di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 9/25 acchelli, 2017 ), we used the default ordering and presentation provided by GitHub, becauseit is the de-facto standard. Changesets were included in pull requests on private GitHubrepositories so that participants performed the tasks in a real-world review environment.Pull requests had identical descriptions for both the control and the treatment, with noadditional information except their descriptive title. While research showed that a shortdescription may lead to poor review participation (

Thongtanunam et al., 2017 ), this doesnot apply to our experiment as there is no interaction among subjects.Subjects were instructed to understand the change and check its functional correctness.We asked the participants to comment on the pull request(s) if they found any problemin the code, such as any functional error related to correctness and issues with codequality. The changes proposed had three different functional issues that were intentionallyinjected into the source code. Participants could see the source code of the wholeproject in case they needed more context, but only through GitHub ’ s browser-based UI.The size of the changeset was around 100 lines of code and it involved seven ﬁ les. Gousios, Pinzger & Van Deursen (2014) showed that the number of total lines changedby pull requests is on average less than 500, with a median of 20. Thus, the number oflines of the changeset used in this study is between the median and the average. Ourchangeset size is also consistent with recent research which found that code review ishighly effective on smaller changes (

Baum, Schneider & Bacchelli, 2019 ) and rewiewabilitycan be empirically de ﬁ ned through several factors, one being change size ( Ram et al., 2018 ) Post-experiment questionnaire

In the last phase participants were asked to answer a post-experiment questionnaire.Questions are showed in the section “ Results, ” RQ3: Q1 – Q4 were about change-understanding, while Q5 – Q12 involved subjects ’ opinions about changesetcomprehension and its correctness, rationale, understanding, etc. Q5 – Q12 were asummary of interesting aspects that developers need to grasp in a code change, asmentioned in the study of

Tao et al. (2012) . The answers must be provided in a Likertscale (

Oppenheim, 2000 ) ranging from “ Strongly disagree ” (1) to “ Strongly agree ” (5). Outcome measurements

Effectiveness, false positives, suggested improvements

Subjects were asked to comment a pull request in the pull request discussion or in-linecomment in a commit belonging to that pull request. The number of comments addressingfunctional issues was counted as the effectiveness. At the same time, we also measuredfalse positives (i.e., comments in pull request that do not address a real issue in the code) andsuggested improvements (i.e., remarks on other non-critical non-functional issues).We distinguished suggested improvements and false positives from the comments thatcorrectly addressed an issue because the three functional defects were intentionally put inthe source code. Comments that did not directly and correctly tackle one of these three issueswere classi ﬁ ed either as false positives or suggested improvements. They were classi ﬁ ed bythe ﬁ rst author by looking at the description provided by the subject. A correctly identi ﬁ edissue needs to highlight the problem, and optionally provide a short description. di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 10/25 ime

Having the screencast of the whole experiment, as well as using tools that give timemeasures, we gathered the following measurements: (cid:1)

Time for Task 2, in particular: – total time Eclipse is [opened/active] – total time the user is [active/reading/typing];as collected by WatchDog (section “ Tasks of the Experiment ” ). (cid:1) Total net time for Task 3, de ﬁ ned as from when the subject opens a pull request untilwhen (s)he completes the review, purged of eventual breaks. (cid:1) Defect lifetime, de ﬁ ned as the period during which a defect continues to exist. It ismeasured from the moment the subject opens a pull request to when (s)he writes acomment that correctly identi ﬁ es the issue. For the case of multiple comments on thesame pull request, this is the time between ﬁ nishing with one defect and addressingthe next. A similar measure was previously used by Prechelt & Tichy (1998) .All the above measures are collected in seconds elapsed.

Change-understanding

In this experiment, change understanding was measured by means of a questionnairesubmitted to participants post review activity, as mentioned in Task 4 in the section “ Tasks of the Experiment. ” Questions are shown in the section “ Results, ” RQ3, from Q1to Q4. Its aim is to evaluate differences in change-understanding. A similar techniquewas used by

Binkley et al. (2013) . Final survey

Lastly, participants were asked to give their opinion on statements targeting the perceptionof correctness, understanding, rationale, logical partitioning of the changeset, dif ﬁ culty innavigating the changeset in the pull request, comprehensibility, and the structure of thechanges. This phase, as well as the previous one, was included in Task 4, correspondingto questions Q5 – Q12 (section “ Results, ” RQ3). Results were given on a Likert scalefrom “ Strongly disagree ” (1) to “ Strongly agree ” (5) ( Oppenheim, 2000 ), reported asmean, median and standard deviation over the two groups, and tested for statisticalsigni ﬁ cance with the Mann – Whitney U -test. Research method for RQ4

For our last research question, we aimed to build some initial hypothesis to explain theresults from the previous research questions. We sought what actions and patternsled a reviewer in ﬁ nding an issue or raising false positive, as well as other comments.This method was applied only to the review phase, without analyzing actions and patternsconcerning the training phase. The method to map actions to concepts started byannotating the screencasts retrieved after the conclusion of the experimental phase.Subjects performed a series of actions that de ﬁ ned and characterized both the outcome di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 11/25 nd the execution of the review. The ﬁ rst author inserted notes regarding actionsperformed by participants to build a knowledge base of steps (i.e., participant opens fileName , participant uses GitHub search box with the keyword , etc.).Using the methodology for qualitative content analysis delineated by Schreier (2013) ,we ﬁ rstly de ﬁ ned the coding frame. Our goal was to characterize the review activitybased on patterns and behaviors. As previous studies already tackled this problem andcame up with reliable categories, we used the investigations by Tao et al. (2012) and

Sillito,Murphy & De Volder (2006) as the base for our frame. We used the concepts from

Tao et al. (2012) regarding

Information needs for reasoning and assessing the change and

Exploring the context and impact of the change , as well as the

Initial focus points and

Building on initial focus points steps from

Sillito, Murphy & De Volder (2006) .To code the transcriptions, we used the deductive category application, resemblingthe data-driven content analysis technique by

Mayring (2000) . We read the materialtranscribed, checking whether a concept covers that action transcribed (e.g., participantopens ﬁ le fileName so that (s)he is looking for context). We grouped actionscovered by the same concept (e.g., a participant opens three ﬁ les, but always for contextpurpose) and continued until we built a pattern that led to a speci ﬁ c outcome (i.e.,addressing a defect or a false positive). We split the patterns according to their conceptordering such that those that led to more defects found or false positive issues werevisible. THREATS TO VALIDITY AND LIMITATIONS

Internal validity

The sample size comprised in our experiment poses an inherent threat to the internalvalidity of our experiment. Furthermore, using a different experimental strategy (e.g., thatused by

Tao & Kim (2015) ) would remove personal performance biases, while causinga measurable learning effect. In fact,

Wohlin et al. (2012) state that “ due to learning effects,a human subject cannot apply two methods to the same piece of code . ” This wouldresult in affecting the study goals and construct validity. In addition, the design andasynchronous execution of the experimental phase creates uncertainty regardingpossible external interactions. We could not control random changes in the experimentalsetting, and this translates to possible disturbances coming from the surroundingenvironment, that could cause skewed results.Moreover, our experiment settings could not control if participants interacted amongthem, despite participants did not have any information about each other.Regarding the statistical regression (

Wohlin et al., 2012 ), tests used in our studynot performed with the Bonferroni correction, following the advice by Perneger: “ Adjustments are, at best, unnecessary and, at worst, deleterious to sound statisticalinference ” ( Perneger, 1998 ). Other corrections such as the false discovery rate (FDR) arealso not suited for our study. The de-facto standard to perform the FDR correction isthe Benjamini – Hochberg (BH) (

Benjamini & Hochberg, 1995 ). The BH, though usefulwhen dealing with large numbers of p -values (e.g., 100), needs careful adjustment of a di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 12/25 hreshold to detect false positives. The number of statistical tests performed in our studyis small enough to warrant not applying FDR or other signi ﬁ cance corrections. Construct validity

Relatively to the restricted generalizability across constructs (

Wohlin et al., 2012 ), inour experiment, we uniquely aim to measure the values presented in the section “ OutcomeMeasurements. ” The treatment might in ﬂ uence direct values we measure, but it couldpotentially cause negative effects on concepts that our study does not capture.Additionally, we acknowledge threats regarding the time measures taken by the ﬁ rst authorregarding RQ2. Clearly, manual measures are suboptimal, that were adopted to avoidparticipants having to perform such measures themselves.When running an experiment, participants might try to guess what is the purpose ofthe experimentation phase. Therefore, we could not control their behavior based onthe guesses that either positively or negatively affected the outcome.Threats to construct validity are connected to designing the questionnaires used forRQ3, despite designed using standard ways and scales ( Oppenheim, 2000 ). Finally, threatsconnected to the manual annotation of the screencasts recorded and analyzed by the ﬁ rst author could lead to misinterpreted or misclassi ﬁ ed actions performed by theparticipants in our experiment. External validity

Threats to external validity for this experiment concern the selection of participants tothe experimentation phase. Volunteers selected with convenience sampling could have animpact on the generalizability of results, which we tried to mitigate by samplingmultiple roles for the task. If the group is very heterogeneous, there is a risk that the variationdue to individual differences is larger than due to the treatment (

Cook & Campbell, 1979 ).Furthermore, we acknowledge and discuss the possible threat regarding thesystem selection for the experimental phase. Naturally, the system used is not fullyrepresentative of a real-world scenario. Our choice, however, as explained in the section “ Object System Chosen for the Experiment, ” aims to reduce the training phase effortrequired from participants and to encourage the completion of the experiment. Despiteresearch empirically showed that small code changes are easier to review ( Ram et al., 2018 )and current industrial practice reports reviews being almost always done on very smallchangesets (

Sadowski et al., 2018 ), the external validity of our study is in ﬂ uenced by thesize of the changes under review and number of pull requests used in the experimentalsetup. Lack of empirical studies that provide an initial reference on the size of tangledchangesets left us unable to address such threats. Future research should provide empiricalevidence about the average tangled change, as well as the impact of larger changeset ornumber of pull requests on the results of our experiment.Finally, our experiment was designed considering only a single programming language,using the pull-based methodology to review and accept the changes proposed usingGitHub as a platform. Therefore, threats for our experiment are related to mono-operationand mono-method bias ( Wohlin et al., 2012 ). di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 13/25

ESULTS

This section presents the results to the four research questions we introduced in the section “ Research Questions ” . RQ1. Effectiveness, false positives, and suggestions

For our ﬁ rst research question, descriptive statistics about results are shown in Table 2.It contains data about effectiveness of participants (i.e., correct number of issuesaddressed), false positives, and number of suggested improvements. Given the sample size,we applied a non-parametric test and performed a Mann – Whitney U -test to test fordifferences between the control and the treatment group. This test, unlike a t -test, doesnot require the assumption of a normal distribution of the samples. Results of the statisticaltest are intended to be signi ﬁ cant for a con ﬁ dence level of 95%.Results indicate a signi ﬁ cant difference between the control and the treatment groupregarding the number of false positives, with a p -value of 0.03. On the contrary, there isno statistically signi ﬁ cant difference regarding the number of defects found (effectiveness)and number of suggested improvements.The example of a false positive is when one of the subjects of the control group writes: “ This doesn ’ t sound correct to me. Might want to ﬁ x the for, as the variable varName isnever used . ” This is not a defect, as varName is used to check how many times thefor-statement has to be executed, despite not being used inside the statement. This is alsowritten in a code comment. Another false positive example is provided from a participantin the control group who, reading the refactoring proposed by the changeset underreview, writes: “ The method methodName is used only in Class

ClassName , so ﬁ x this . ” This isnot a defect as the same methodName is used by the other classes in the hierarchy. As such,we can reject only the null hypothesis H regarding the false positives, while we cannotprovide statistically signi ﬁ cant evidence about the other two variables tested in H and H .The statistical signi ﬁ cance alone for the false positives does not provide a measure tothe actual impact of the treatment. To measure the effect size of the factor over thedependent variable we chose the Cliff ’ s Delta ( Cliff, 1993 ), a non-parametric measurefor effect size. The calculation is given by comparing each of the scores in one group toeach of the scores in the other, with the following formula: d ¼ ð x > x Þ(cid:3) ð x < x Þ n n where x , x are values for the two groups and n , n are their sample size. For data with skewed Table 2 RQ1 — number of defects found (effectiveness), false positives and suggested improvements.Group ﬁ dence interval 95% p -value Effectiveness (defects found) Control 14 20 1.0 1.42 0.72 [0 – – – Treatment 14 1 0 0.07 0.25 [0 – – – Note:

Statistically signi ﬁ cant p -values in boldface. di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 14/25 arginal distribution it is a more robust measure if compared to Cohen standardizedeffect size (

Cohen, 1992 ). The computed value shows a positive (i.e., tangled pullrequests lead to more false positives) effect size ( d = 0.36), revealing a medium effect.The effect size is considered negligible for j d j < 0.147, small for j d j < 0.33, medium for j d j < 0.474, large otherwise ( Romano et al., 2006 ). Result 1:

Untangled pull requests (treatment) lead to fewer false positives with astatistically signi ﬁ cant, medium size effect .Given the presence of suggested improvements in our results, we found that the controlgroup writes in total seven, while the participants in the treatment write 19. This difference isinteresting, calling for further classi ﬁ cation of the suggestions. For the control group,participants wrote respectively three improvements regarding code readability , twoconcerning functional checks , one regarding understanding of source code and one regardingother code issues. For the treatment group, we classi ﬁ ed ﬁ ve suggestions for code readability ,eight for functional checks and seven for maintainability . Although subjects have beenexplicitly given the goal to ﬁ nd and comment exclusively functional issues (section “ Tasks ofthe Experiment ” ), they wrote these suggestions spontaneously. The suggested improvementsare included in the online appendix ( di Biase et al., 2018 ) along with their classi ﬁ cation. RQ2. Review time and defect lifetime

To answer RQ2, we measured and analyzed the time subjects took to review the pullrequests, as well as the amount of time they used to ﬁ x each of the issues present.Descriptive statistics about results for our second research question are shown in Table 3.It contains data about the time participants used to review the patch, completed by themeasurements of how much they took to ﬁ x respectively two of the three issuespresent in the changeset. All measures are in seconds. We exclude data relatively to thethird defect as only one participant detected it. To perform the data analysis, we used thesame statistical means described for the previous research question. When computingthe review net time used by the subjects, results show an insigni ﬁ cant difference, thuswe are not able to reject null-hypothesis H . This indicates that the average case of thetreatment group takes the same time to deliver the review, despite having two pull Table 3 RQ2 — review time, ﬁ rst and second defect lifetime.Group ﬁ dence Interval 95% p -value Review net time Control (tangled changes) 14 831 853 385 [99 – – – – – – Note:

Measurements in seconds elapsed. di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 15/25 equests to deal with instead of one. However, analyzing results regarding the defectlifetime we also see no signi ﬁ cant difference and cannot reject H . Data show that themean time to address the ﬁ rst issue is about 14% faster in the treatment group ifcompared with the control. This is because subjects have to deal with less code that concernsa single concept, rather than having to extrapolate context information from a tangledchange. At the same time the treatment group is taking longer (median) to address thesecond defect. We believe that this is due to the presence of two pull requests, and as such, thecontext switch has an overhead effect on that. From the screencast recordings we found noreviewer using multi-screen setup, therefore subjects had to close a pull-request and thenreview the next, where they need to gain knowledge on different code changes. Result 2:

Our experiment was not able to provide evidence for a difference in net reviewtime between untangled pull requests (treatment) and the tangled one (control); this despitethe additional overhead of dealing with two separate pull requests in the treatment group . RQ3. Understanding the change ’ s rationale For our third research question, we seek to measure whether subjects are affected bythe dependent variable in their understanding of the rationale of the change. Rationaleunderstanding questions are Q1 – Q4 (Table 4) and Fig. 2 reports the results. Q1 – Q12mark the respective questions, while answers from the (C)ontrol or (T)reatment groupare marked respectively with their ﬁ rst letter next to the question number. Numbers inFigure count the participants ’ answers to questions per Likert Item. Higher scores for Q1,Q2, and Q4 mean better understanding, whereas for Q3 a lower score signi ﬁ es a correct Table 4 RQ3 — Post-experiment questionnaire.

Questions on understanding the rationale of the changesetThe purpose of this changeset entails ...Q1 ... changing a method for the enemy AIQ2 ... the refactoring of some methodsQ3 ... changing the game UI panelQ4 ... changing some method signatureQuestions on participant ’ s perception on the changesetQ5 The changeset was functionally correctQ6 I found no dif ﬁ culty in understanding the changesetQ7 The rationale of this changeset was perfectly clearQ8 * The changeset a logical separation of concernsQ9 Navigating the changeset was hardQ10 * The relations among the changes were well structuredQ11 The changeset was comprehensibleQ12 * Code changes were spanning too many features

Note:

Questions with * have p < 0.05. di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 16/25 nderstanding. As for the previous research questions, we test our hypothesis with anon-parametrical statistical test. Given the result we cannot reject the null hypothesisH of tangled pull requests reducing change understanding. Participants are in factable to answer the questions correctly, independent of their experimental group.After the review, our experimentation also provided a ﬁ nal survey (Q5 – Q12 in Table 4)that participants ﬁ lled in at the end. Results shown in Fig. 2 indicate that subjectsjudge equally the changeset (Q5), found no dif ﬁ culty in understanding the changeset(Q6), agree on having understood the rationale behind the changeset (Q7). This results CCCC Q1 Neither

Questions on understanding the rationale of the changesetControl (tangled changes)Treatment (untangled changes)

Q2 CTQ3 CTQ4 CTQ5 CTQ6 CTQ7 CTQ8* CTQ9 TQ10* TQ11 T Q12* T

Figure 2 RQ3 — Answers to questions Q1 – Q12 in Table 4. (C)ontrol and (T)reatment answers aremarked with their respective ﬁ rst letter. Numbers count the participants ’ answers to questions per Likertitem. Questions with (cid:4) have p < 0.05. Full-size  DOI: 10.7717/peerj-cs.193/ ﬁ g-2 di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 17/25 hows that our experiment cannot provide evidence of differences in changeunderstanding between the two groups.Participants did not ﬁ nd the changeset hard to navigate (Q9), and believe that thechangeset was comprehensible (Q11). Answers to questions Q9 and Q11 are surprisingto us, as we would expect dissimilar results for code navigation and comprehension.In fact, change decomposition should allow subjects to navigate code easier, as well asimprove source comprehension.On the other hand, subjects from the control and treatment group judge differentlywhen asked if the changeset was partitioned according to a logical separation of concerns(Q8), if the relationships among the changes were well structured (Q10) and if thechanges were spanning too many features (Q12). These answers are in line with whatwe would expect, given the different structure of the code to be reviewed. The answersare different with a statistical signi ﬁ cance for Q8, Q10 and Q12. Result 3:

Our experiment was not able to provide evidence of a difference inunderstanding the rationale of the changeset between the experimental groups. Subjectsreviewing the untangled pull requests (treatment) recognize the bene ﬁ ts of untangled pullrequests, as they evaluate the changeset as being (1) better divided according to a logicalseparation of concerns (Q8), (2) better structured (Q10), and (3) not spanning too manyfeatures (Q12) . RQ4. Tangled vs. untangled review patterns

For our last research question, we seek to identify differences in patterns and featuresduring review, and their association to quantitative results. We derived such patterns from

Tao et al. (2012) and

Sillito, Murphy & De Volder (2006) . These two studies are relevant asthey investigated the role of understanding code during the software development process.

Tao et al. (2012) laid out a series of information needs derived from state-of-the-artresearch in software engineering, while

Sillito, Murphy & De Volder (2006) focused onquestions asked by professional experienced developers while working on implementinga change. The mapping found in the screencasts is shown in Table 5.Table 6 contains the qualitative characterization, ordered by the sum of defectsfound. Values in each row correspond to how many times a participant in either groupused that pattern to address a defect or point to a false positive.

Table 5 RQ4 — Concepts from literature and their mapped keyword.Concept Mapped keyword

What is the rationale behind this code change? (

Tao et al., 2012 ) RationaleIs this change correct? Does it work as expected? (

Tao et al., 2012 ) CorrectnessWho references the changed classes/methods/ ﬁ elds? ( Tao et al., 2012 ) ContextHow does the caller method adapt to the change of its callees? (

Tao et al., 2012 ) Caller/calleeIs there a precedent or exemplar for this? (

Sillito, Murphy & De Volder, 2006 ) Similar/precedent di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 18/25 esults indicate that pattern P1 is the one that led to most issues being addressed in thecontrol group (eight), but at the same time is the most imprecise one (three false positives).We conjecture that this is related to the lack of context-seeking concept. Patterns P1 and P3have most false positives addressed in the control group. In the treatment group, patternP2 led to more issues being addressed ( ﬁ ve), followed by the previously mentioned P1 (four).Analyzing the transcribed screencasts, we note an overall trend of reviewing codechanges in the control group, exploring the changeset using less context exploration thanin the treatment. Among the participants belonging to the treatment, we witnessed a muchmore structured way of conducting the review. The overall behavior is that of getting thecontext of the single change, looking for the ﬁ les involved, called, or referenced by thechangeset, in order to grasp the rationale. All of the subjects except three repeated this stepmultiple times to explore a chain of method calls, or to seek for more context in that same ﬁ le opening it in GitHub. We consider this the main reason to explain that untangled pullrequests lead to more precise (fewer false positives) results. Result 4:

Our experiment revealed that review patterns for untangled pull requests(treatment) show more context-seeking steps, in which the participants open morereferenced/related classes to review the changeset . DISCUSSION

In this section, we analyze and discuss results presented in the section “ Results, ” withconsequent implications for researchers and practitioners. Implications for researchers

In past studies, researchers found that developers call for tool and research support fordecomposing a composite change (

Tao et al., 2012 ). For this reason, we were surprised that

Table 6 RQ4 — Patterns in review to address a defect or leading to a false positive.ID Pattern Control Treatment1st concept 2ndconcept 3rd concept Defect FP Defect FP

P1 Rationale Correctness 8 3 4 0P2 Rationale Context Correctness 4 0 5 0P3 Context Rationale Correctness 3 2 3 0P4 Context Correctness Caller/callee 1 0 2 0P5 Context Correctness 2 1 0 0P6 Correctness Context 0 0 2 0P7 Rationale Correctness Context 0 0 1 0P8 Correctness Context Caller/callee 1 0 0 0P9 Correctness Context Similar/precedent 1 0 0 1 di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 19/25 ur experiment was not able to highlight differences in terms of reviewers ’ effectiveness(number of defects found) and reviewers ’ understanding of the change rationale, when thesubjects were presented with smaller, self-contained changes. Further research withadditional participants is needed to corroborate our ﬁ ndings.If we exclude latent problems with the experiment design that we did not account for,this result may indicate that reviewers are still able to conduct their work properly, evenwhen presented with tangled changes. However, the results may change in differentcontexts. For example, the cognitive load for reviewers may be higher with tangledchanges, with recent research showing promising insights regarding this hypothesis( Baum, Schneider & Bacchelli, 2019 ). Therefore, the negative effects in terms ofeffectiveness could be visible when a reviewer has to assess a large number of changes everyday, as it happens with integrators of popular projects in GitHub (

Gousios et al., 2015 ).Moreover, the changes we considered are of average size and dif ﬁ culty, yet results may beimpacted by larger changes and/or more complex tasks. Finally, participants were notcore developers of the considered software system; it is possible that core developers wouldbe more surprised by tangled changes, ﬁ nd them more convoluted or less “ natural, ” thusrejecting them ( Hellendoorn, Devanbu & Bacchelli, 2015 ). We did not investigate thesescenarios further, but studies can be designed and carried out to determine whetherand how these aspects in ﬂ uence the results of the code review effort.Given the remarks and comments of professional developers on tangled changes( Tao et al., 2012 ), we were also surprised that the experiment did not highlight anydifferences in the net review time between the treatment groups. Barring experimentaldesign issues, this result can be explained by the additional context switch, which doesnot happen in the tangled pull request (control) because the changes are done in thesame ﬁ les. An alternative explanation could be that the reviewers with the untangledpull requests (treatment) spent more time “ wondering around ” and pinpointing smallissues because they found the important defects quicker; this would be in line with thecognitive bias known as Parkinson ’ s Law ( Parkinson & Osborn, 1957 ) (all the availabletime is consumed). However, time to ﬁ nd the ﬁ rst and second defects (3) is the samefor both experimental groups thus voiding this hypothesis. Moreover, similarly to us, Tao &Kim (2015) also did not ﬁ nd a difference with respect to time to completion intheir preliminary user study. Further studies should be designed to replicate our experimentand, if results are con ﬁ rmed, to derive a theory on why there is no reduction in review time.Our initial hypothesis on why time does not decrease with untangled code changes isthat reviewers of untangled changes (control) may be more willing to build a moreappropriate context for the change. This behavior seems to be backed up by our qualitativeanalysis (section “ Results ” ), through the context-seeking actions that we witnessed forthe treatment group. If our hypothesis is not refused by further research, this couldindicate that untangled changes may lead to a more thorough low-level understandingof the codebase. Despite we did not measure this in the current study, it may explainthe lower number of false positives with untangled changes. Finally, untangled changesmay lead to better transfer of code knowledge, one of the positive effects of code review( Bacchelli & Bird, 2013 ). di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 20/25 ecommendation for practitioners

Our experiment is not able to show no negative effects when changes are presented asseparate, untangled changesets, despite the fact that reviewers have to deal with twopull requests instead of one, with the subsequent added overhead and a more prominentcontext switch. With untangled changesets, our experiment highlighted an increasednumber of suggested improvements, more context-seeking actions (which, it is reasonableto assume, increase the knowledge transfer created by the review), and a lower numberof wrongly reported issues.For the aforementioned reasons, we support the recommendation that changeauthors prepare self-contained, untangled changeset when they need a review. In fact,untangled changesets are not detrimental to code review (despite the overhead of havingmore pull-requests to review), but we found evidence of positive effects. We expect theuntangling of code changes to be minimal in terms of cognitive effort and time for theauthor. This practice, in fact, is supported by answers Q8, Q10, Q12 to the questionnaireand by comments written by reviewers in the control group (i.e., “ Please make differentcommit for these two features, ” “

I would prefer having two pull requests instead of oneif you are ﬁ xing two issues ” ). CONCLUSION

The goal of the study presented in this paper is to investigate the effects of changedecomposition on MCR (

Cohen, 2010 ), particularly in the context of the pull-baseddevelopment model (

Gousios, Pinzger & Van Deursen, 2014 ).We involved 28 subjects, who performed a review of pull request(s) pertaining to (1) arefactoring and (2) the addition of a new feature in a Java system. The control groupreceived a single pull request with both changes tangled together, while the treatmentgroup received two pull requests (one per type of change). We compared control andtreatment groups in terms of effectiveness (number of defects found), number of falsepositives (wrongly reported issues), number of suggested improvements, time tocomplete the review(s), and level of understanding the rationale of the change. Ourinvestigation also involved a qualitative analysis of the review performed by subjectsinvolved in our study.Our results suggests that untangled changes (treatment group) lead to:1. Fewer reported false positives defects,2. more suggested improvements for the changeset,3. same time to review (despite the overhead of two different pull requests),4. same level of understanding the rationale behind the change,5. and more context-seeking patterns during review.Our results support the case that committing changes belonging to different conceptsseparately should be an adopted best practice in contemporary software engineering.In fact, untangled changes not only reduce the noise for subsequent data analyses di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 21/25

Herzig, Just & Zeller, 2016 ), but also support the tasks of the developers in charge ofreviewing the changes by increasing context-seeking patterns.

ACKNOWLEDGEMENTS

The authors would like to thank all participants of the experiment and the pilot.We furthermore thank the fellow researchers who gave critical suggestion to helpstrengthening the methodology of our study.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding

This project has received funding from the European Union ’ s Horizon 2020 research andinnovation programme under the Marie Sklodowska-Curie grant agreement No. 642954.Alberto Bacchelli has received support from the Swiss National Science Foundationthrough the SNF Project No. PP00P2_170529. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures

The following grant information was disclosed by the authors:European Union ’ s Horizon 2020 research and innovation programme under the MarieSklodowska-Curie grant agreement: 642954.Swiss National Science Foundation through the SNF Project: PP00P2_170529. Competing Interests

Arie van Deursen is an Academic Editor for PeerJ Computer Science. Marco di Biaseand Magiel Bruntink are employed by Software Improvement Group.

Author Contributions (cid:1)

Marco di Biase conceived and designed the experiments, performed the experiments,analyzed the data, prepared ﬁ gures and/or tables, performed the computation work,authored or reviewed drafts of the paper, approved the ﬁ nal draft. (cid:1) Magiel Bruntink conceived and designed the experiments, authored or reviewed drafts ofthe paper, approved the ﬁ nal draft. (cid:1) Arie van Deursen conceived and designed the experiments, authored or revieweddrafts of the paper, approved the ﬁ nal draft. (cid:1) Alberto Bacchelli conceived and designed the experiments, authored or revieweddrafts of the paper, approved the ﬁ nal draft. Ethics

The following information was supplied relating to ethical approvals (i.e., approvingbody and any reference numbers):The Human Subjects Committee of the Faculty of Economics, Business Administrationand Information Technology at the University of Zurich authorized the research describedin Alberto Bacchelli ’ s research proposal with IRB 2018-024. di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 22/25 ata Availability

The following information was supplied regarding data availability:The raw data is available at: https://data.4tu.nl/repository/uuid:826f7051-35f6-4696-b648-8e56d3ea5931

REFERENCES

Bacchelli A, Bird C. 2013.

Expectations, outcomes, and challenges of modern code review. In:

Proceedings of the 2013 International Conference on Software Engineering, ICSE ’ Piscataway:IEEE Press, 712 – Barnett M, Bird C, Brunet J, Lahiri S. 2015.

Helping developers help themselves: automaticdecomposition of code review changesets. In:

Proceedings of the 37th International Conferenceon Software Engineering — Volume 1, ICSE ’ Piscataway: IEEE Press, 134 – Baum T, Schneider K, Bacchelli A. 2017.

On the optimal order of reading source code changesfor review. In:

Piscataway: IEEE, 329 – Baum T, Schneider K, Bacchelli A. 2019.

Associating working memory capacity and code changeordering with code review performance. In:

Empirical Software Engineering.

New York: SpringerDOI 10.1007/s10664-018-9676-8.

Beller M, Gousios G, Panichella A, Zaidman A. 2015.

When, how, and why developers (do not)test in their ides. In:

Proceedings of the 2015 10th Joint Meeting on Foundations of SoftwareEngineering, ESEC/FSE 2015 , New York: ACM, 179 – Benjamini Y, Hochberg Y. 1995.

Controlling the false discovery rate: a practical and powerfulapproach to multiple testing.

Journal of the Royal Statistical Society. Series B (Methodological) :289 –

300 DOI 10.1111/j.2517-6161.1995.tb02031.x.

Binkley D, Davis M, Lawrie D, Maletic J, Morrell C, Sharif B. 2013.

The impact of identi ﬁ erstyle on effort and comprehension. Empirical Software Engineering :219 – Bosu A, Greiler M, Bird C. 2015.

Characteristics of useful code reviews: an empirical study atmicrosoft. In:

Piscataway: IEEE, 146 – Cliff N. 1993.

Dominance statistics: ordinal analyses to answer ordinal questions.

PsychologicalBulletin :494 –

509 DOI 10.1037/0033-2909.114.3.494.

Cohen J. 1992.

Statistical power analysis.

Current Directions in Psychological Science :98 – Cohen J. 2010.

Modern code review. In: Oram A, Wilson G, eds.

Making Software . Chapter 18.Sebastopol: O ’ Reilly, 329 – Cook TD, Campbell DT. 1979.

Quasi-experimentation: design and analysis for ﬁ eld settings. Vol. 3.Chicago: Rand McNally.

Cowan N. 1998.

Attention and memory: an integrated framework.

Vol. 26. Oxford: OxfordUniversity Press. di Biase M, Bruntink M, Bacchelli A. 2016.

A security perspective on code review: the case ofchromium. In:

Proceedings of the 16th IEEE International Working Conference on SourceCode Analysis and Manipulation, SCAM 2016, October 2-3, 2016.

Piscataway: IEEE, 21 – di Biase M, Bruntink M, Van Deursen A, Bacchelli A. 2018. The effects of change decompositionon code review — a controlled experiment — online appendix. Available at https://data.4tu.nl/repository/uuid:826f7051-35f6-4696-b648-8e56d3ea5931 . di Biase et al. (2019), PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 23/25 ias M, Bacchelli A, Gousios G, Cassou D, Ducasse S. 2015.

Untangling ﬁ ne-grained codechanges. In: Proceedings of the 22nd International Conference on Software Analysis, Evolution,and Reengineering, SANER 2015.

Piscataway: IEEE Computer Society, 341 – Gousios G, Pinzger M, Van Deursen A. 2014.

An exploratory study of the pull-based softwaredevelopment model. In:

Proceedings of the 36th International Conference on SoftwareEngineering — ICSE 2014, (May 2014).

New York: ACM, 345 – Gousios G, Zaidman A, Storey M, Van Deursen A. 2015.

Work practices and challenges inpull-based development: the integrator ’ s perspective. In: Proceedings of the 37th InternationalConference on Software Engineering — Volume 1, ICSE ’ Piscataway: IEEE Press, 358 – Hellendoorn VJ, Devanbu PT, Bacchelli A. 2015.

Will they like this? Evaluating codecontributions with language models. In:

Proceedings of the 12th Working Conference onMining Software Repositories.

Piscataway: IEEE Press, 157 – Herzig K, Just S, Zeller A. 2016.

The impact of tangled code changes on defect prediction models.

Empirical Software Engineering :303 –

336 DOI 10.1007/s10664-015-9376-6.

Herzig K, Zeller A. 2013.

The impact of tangled code changes. In:

Mining Software Repositories(MSR) ’

13 Proceedings of the 10th IEEE Working Conference on Mining Software.

Piscataway:IEEE, 121 – James W. 2013.

The principles of psychology . Redditch: Read Books Ltd.

Kirinuki H, Higo Y, Hotta K, Kusumoto S. 2014.

Hey! are you committing tangled changes? In:

Proceedings of the 22nd International Conference on Program Comprehension, ICPC 2014.

New York: ACM, 262 – Ko A, LaToza T, Burnett M. 2015.

A practical guide to controlled experiments of softwareengineering tools with human participants.

Empirical Software Engineering :110 – Mayring P. 2000.

Qualitative content analysis.

Forum: Qualitative Social Research :159 – McIntosh S, Kamei Y, Adams B, Hassan A. 2014.

The impact of code review coverage andcode review participation on software quality: a case study of the qt, vtk, and itk projects. In:

Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014.

New York: ACM, 192 – McIntosh S, Kamei Y, Adams B, Hassan AE. 2016.

An empirical study of the impact of moderncode review practices on software quality.

Empirical Software Engineering :2146 – Morales R, McIntosh S, Khomh F. 2015.

Do code review practices impact design quality? A casestudy of the Qt, Vtk, and Itk projects. In:

Proceedings of the 22nd International Conferenceon Software Analysis, Evolution and Reengineering, SANER 2015.

Piscataway: IEEE, 171 – Murphy-Hill E, Parnin C, Black A. 2012.

How we refactor, and how we know it.

IEEETransactions on Software Engineering :5 –

18 DOI 10.1109/tse.2011.41.

Oppenheim A. 2000.

Questionnaire design, interviewing and attitude measurement . London:Bloomsbury Publishing.

Parkinson CN, Osborn RC. 1957.

Parkinson ’ s law, and other studies in administration. Vol. 24.Boston: Houghton Mif ﬂ in. Perneger TV. 1998.

What ’ s wrong with bonferroni adjustments. British Medical Journal :1236 – Prechelt L, Tichy W. 1998.

A controlled experiment to assess the bene ﬁ ts of procedure argumenttype checking. IEEE Transactions on Software Engineering :302 –

312 DOI 10.1109/32.677186. di Biase et al. (2019),

PeerJ Comput. Sci. , DOI 10.7717/peerj-cs.193 24/25 am A, Sawant AA, Castelluccio M, Bacchelli A. 2018.

What makes a code change easier toreview: an empirical investigation on code change reviewability. In:

Proceedings of the 2018 26thACM Joint Meeting on European Software Engineering Conference and Symposium on theFoundations of Software Engineering (ESEC/FSE 2018) . New York, NY: ACM, 201 – Rigby PC, Bird C. 2013.

Convergent contemporary software peer review practices. In:

Proceedingsof the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013.

New York: ACM, 202 – Rigby P, Cleary B, Painchaud F, Storey M, German D. 2012.

Contemporary peer review in action:lessons from open source development.

IEEE Software :56 –

61 DOI 10.1109/ms.2012.24.

Rigby P, German D, Cowen L, Storey M. 2014.

Peer review on open-source software projects.

ACM Transactions on Software Engineering and Methodology :1 – Romano J, Kromrey J, Coraggio J, Skowronek J. 2006.

Appropriate statistics for ordinal leveldata: should we really be using t-test and cohen ’ sd for evaluating group differences on the nsse andother surveys. In: Annual Meeting of the Florida Association of Institutional Research , 1 – Sadowski C, Söderberg E, Church L, Sipko M, Bacchelli A. 2018.

Modern code review: a casestudy at google. In:

Proceedings of the 40th International Conference on Software EngineeringSoftware Engineering: in Practice (ICSE-SEIP '18) . New York, NY: ACM, 181 – Schreier M. 2013.

Qualitative content analysis. In: Flick U, ed.

The SAGE Handbook of QualitativeData Analysis . London: SAGE, 170 – Sharif B, Falcone M, Maletic JI. 2012.

An eye-tracking study on the role of scan time in ﬁ ndingsource code defects. In: Proceedings of the Symposium on Eye Tracking Research andApplications.

New York: ACM, 381 – Shiffrin RM. 1988.

Attention. In: Atkinson RC, Herrnstein RJ, Lindzey G, Luce RD, eds.

Stevens ’ Handbook of Experimental Psychology: Perception and Motivation; Learning andCognition.

Vol. 2. Oxford: John Wiley & Sons, 739 – Sillito J, Murphy G, De Volder K. 2006.

Questions programmers ask during software evolutiontasks. In:

Proceedings of the 14th ACM SIGSOFT International Symposium on Foundationsof Software Engineering.

New York: ACM, 23 – Slavin R. 1987.

Mastery learning reconsidered.

Review of Educational Research :175 – Tao Y, Dang Y, Xie T, Zhang D, Kim S. 2012.

How do software engineers understand codechanges? An exploratory study in industry. In:

Proceedings of the ACM SIGSOFT 20thInternational Symposium on the Foundations of Software Engineering, FSE ’ , New York: ACM,1 – Tao Y, Kim S. 2015.

Partitioning composite code changes to facilitate code review. In:

Proceedingsof the 12th Working Conference on Mining Software Repositories.

Piscataway: IEEE, 180 – Thongtanunam P, McIntosh S, Hassan AE, Iida H. 2017.

Review participation in moderncode review.

Empirical Software Engineering :768 –

817 DOI 10.1007/s10664-016-9452-6.

Uwano H, Nakamura M, Monden A, Matsumoto K. 2006.

Analyzing individual performanceof source code review using reviewers ’ eye movement. In: Proceedings of the 2006 Symposiumon Eye Tracking Research & Applications.

New York: ACM, 133 – Wickens CD. 1991.

Processing resources and attention.

Multiple-Task Performance . London:Taylor & Francis, 3 – Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A. 2012.

Experimentationin software engineering . Berlin/Heidelberg: Springer Science & Business Media. di Biase et al. (2019),