[PDF] Assessing Practitioner Beliefs about Software Engineering

Abstract

Software engineering is a highly dynamic discipline. Hence, as times change, so too might our beliefs about core processes in this field. This paper checks some five beliefs that originated in the past decades that comment on the relationships between (i) developer productivity; (ii) software quality and (iii) years of developer experience. Using data collected from 1,356 developers in the period 1995 to 2006, we found support for only one of the five beliefs titled "Quality entails productivity". We found no clear support for four other beliefs based on programming languages and software developers. However, from the sporadic evidence of the four other beliefs we learned that a narrow scope could delude practitioners in misinterpreting certain effects to hold in their day to day work. Lastly, through an aggregated view of assessing the five beliefs, we find programming languages act as a confounding factor for developer productivity and software quality. Thus the overall message of this work is that it is both important and possible to revisit old beliefs in SE. Researchers and practitioners should routinely retest old beliefs.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Assessing Practitioner Beliefs about Software Engineering

Shrikanth N.C. · William Nichols · Fahmid Morshed Fahid · Tim Menzies

Received: date / Accepted: date

Abstract

Software engineering is a highly dynamic discipline. Hence, as times change, sotoo might our beliefs about core processes in this ﬁeld.This paper checks some ﬁve beliefs that originated in the past decades that comment onthe relationships between (i) developer productivity; (ii) software quality and (iii) years ofdeveloper experience.

Using data collected from 1,356 developers in the period 1995 to 2006, we found sup-port for only one of the ﬁve beliefs titled “

Quality entails productivity ”. We found no clearsupport for four other beliefs based on programming languages and software developers.However, from the sporadic evidence of the four other beliefs we learned that a narrowscope could delude practitioners in misinterpreting certain effects to hold in their day to daywork. Lastly, through an aggregated view of assessing the ﬁve beliefs, we ﬁnd programminglanguages act as a confounding factor for developer productivity and software quality.Thus the overall message of this work is that it is both important and possible to revisitold beliefs in SE. Researchers and practitioners should routinely retest old beliefs.

Keywords software analytics · beliefs · productivity · quality · experience Shrikanth N.C., Fahmid Morshed Fahid., Tim MenziesDepartment of Computer Science, North Carolina State University, Raleigh, NC, USA.E-mail: [email protected], [email protected], and [email protected] NicholsSoftware Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, USA.E-mail: [email protected] a r X i v : . [ c s . S E ] J un Shrikanth N.C. et al. “ Though deeply learned, unﬂecked by fault, ’tis rareto see when closely scanned, a man from all unwisdom free. ”– Valluvar’s sacred couplet (translated, 1886, G.U. Pope [1, 43])Ideally, practitioners and researchers in Software Engineering (SE) learn lessons fromthe past in order to better manage their future projects. But while many researchers recordthose beliefs [54,58,59,61], very little is currently being done to verify the veracity of thosebeliefs.We assert that it is important to quantitatively assess SE beliefs, as such beliefs are usedby – Practitioners when they justify design or process decisions; e.g. “better not use gotostatements in our code”; – Managers to justify purchases or training programs or hiring decisions; e.g. “test-drivendevelopment processes are best”; – and Researchers as they select what issues they should explore next; e.g. “it is better toremove more bugs, earlier in the life-cycle, since the longer they stay in the code, themore expensive it becomes to remove them”.But the justiﬁcation for such beliefs may be weak. Nagappan et al. recently rechecked and rejected Dijkstra’s famous comment that goto is necessarily considered harmful [36]. As toearly bug removal, Menzies et al. looked for evidence about whether or not “the longer abug remains in the system, the exponentially more costly it becomes to ﬁx”. An extensiveliterature survey found only ten papers that actually experimented with this issue, of whichﬁve did, and ﬁve did not support this belief [30]. Further, Fucci et al. reviewed numerousstudies on test-driven development and found no evidence of an advantage from writing testsbefore writing code [19]. To say the least, this result is very different from numerous priorclaims [18],More generally, Devanbu et al. reported at ICSE’16 just how widely practitioner be-liefs at Microsoft, diverged from each other and from the existing empirical evidence [13].Also, Shrikanth and Menzies reason that discrepancy between practitioners and empiricalevidence by documenting the poverty of evidence for numerous defect prediction beliefs indozens of software projects [47].Motivated by the above examples, in this paper, we: – Determine what large data sources exist . Since 1995, our second author (Nichols) hasbeen tutoring data collection methods for developers. As part of that work, he has col-lected data from ten tasks assigned to 1,356 developers. In all, we have data from 5,424completed tasks. – Check how that data comments on known catalogs of SE beliefs.

For this paper, weused the 2003 textbook

A handbook of software and systems engineering: Empiricalobservations, laws, and theories [15] by Albert Endres & Dieter Rombach. That bookdocuments dozens of SE hypotheses, laws, and theories.For a variety of reasons, this paper only explores the ﬁve Endres and Rombach beliefs listedin Table 1. Those reasons are: – No single article could explore all the beliefs recorded by Endres and Rombach. – The data used in this study provided by Nichols (second author of this paper) et al.[55] could only comment on a subset of the Endres and Rombach beliefs. As to beliefs ssessing Practitioner Beliefs about Software Engineering 3

Table 1: Beliefs studied in this paper.

Productivity and reliability depend on the length of aprogram’s text, independent of language level used.

Corbat´o’s law [10] (1969) Object-oriented programming reduces errors and en-courages reuse.

Dahl-Goldberg Hypothesis [11,20] (1967 & 1989) Quality entails productivity.

Mills-Jones Hypothesis [9, 32] (1983 & 1990) Individual developer performance varies consider-ably.

Sackman’s Second law [46] (1968) It takes 5000 hours to turn a novice into an expert.

Apprentice’s law [38] (1993) such as “

Prototyping (signiﬁcantly) reduces requirement and design errors, especiallyfor user interfaces ” (Boehms second Law) or “

Screen pointing-time is a function ofdistance and width ” (FittsShneiderman law) that would require a different data sourceto assess. – Of the remaining beliefs, we found that ﬁve were most widely-cited. For example, one ofthe original SMALLTALK papers [20] (cited 7,430 times) motivates its work using the“Dahl-Goldberg” hypothesis listed in Table 1. Also the paper propose the “Apprentice’sLaw” [38] has been cited 4,390 times. The remaining three beliefs are all referenced inthe

The Mythical Man-Month [7] and the famous 1987 article

No silver bullet [6]. These two words are cited 8,649 and 5,085 times, respectively .Finally, there is some coherence between the ﬁve beliefs we selected. Speciﬁcally, theyexplore aspects of the entities of Figure 1. That is to say, in theory, we could learn morefrom a summation of these beliefs than from just from a separate study of each of them.Speciﬁcally, after studying the data about these ﬁve beliefs, we can ask and answer threeresearch questions: RQ1: Why beliefs diverge among practitioners?

Apart from belief 3 titled “Quality entails productivity”, none of the other beliefsare supported.

Next, we ask:

RQ2: What is the relationship between Productivity, Quality, and Expertise?

A focus on quality, early in the project life-cycle minimizes rework but programmingexperience neither improves production rate nor mitigates defects.

Finally, we ask:

RQ3: What impacts Productivity, and Quality?

Programming languages —

C fects. Speciﬁcally, C

The contributions of this paper are: All the citation counts in this bullet item were collected from Google Scholar, December, 2019. Shrikanth N.C. et al.

Fig. 1: A summary of beliefs in Table 1 is shown here. Beliefs are broken into their entities and edges are drawn between two entities to acknowledge the presence of an effect as re-ported in SE literature. Strength of that effect using the data we collected is assessed in § – A replication study: We assess ﬁve SE beliefs to understand the widespread relevanceof a large disconnect between SE beliefs and the actual evidence in practice. – Prior publications such as [13, 27] report the disconnect between practitioners and em-pirical evidence, which is important, but only a few [47] extends to offer an explanationfor that disconnect. We highlight such disconnects exist even among decades-old SEbeliefs based on developer productivity, expertise, and software quality. – Importantly, our advice to practitioners is not to dwell into years of developer experi-ence but value some programming languages over others. We also suggest practitionersto focus on quality right from the early stages of a project, preferably adhering to adisciplined process. – The data is publicly available [55] and the results of this study are reproducible. Thereproduction package is available here .The rest of this paper is structured as follows. § § §

4, we discuss the choice of ourdatasets statistical tests, measures, and terminologies needed for assessment in §

5, where wedetail the modeling of the beliefs. Next, we discuss the results of our assessment in § § § § http://tiny.cc/se_beliefs ssessing Practitioner Beliefs about Software Engineering 5 This section describes the beliefs explored in this paper (and the next section describes thedata we used to explore those beliefs).2.1 Quality:Belief 2 claims that “

Object-oriented programming reduces errors and encourages reuse ”;i.e.,some groups of programming languages induce more defects than others. In the literature,there is some support for this claim: – Ray et al. analyzed Open Source (OS) projects and found a modest but signiﬁcant effectof programming languages affecting software quality [44]. – Kochhar et al. [5,26] showed some languages used together (interoperability) with otherlanguages induced defects. – Bhattacharya and Neamtiu [4] argue that C++ is a better choice than C for both softwarequality and developer productivity. – Mondel et al. empirically assessed four beliefs related to systems testing. They foundevidence for an old belief based on more reused code to be harmful [49] in one of the two organizations they assessed [34].Belief 3 claims that ”

Quality entails productivity ”;ie. this belief implies a relation-ship between quality and productivity. Mills by applying Cleanroom Software Engineering,showed the possibility of simultaneous productivity and software quality improvements inboth commercial and research projects [32].

Much works [25, 28, 46] in the past decades studied developer productivity. Belief 1 titled“

Productivity and reliability depend on the length of a programs text, independent of lan-guage level used ” implies that Lines of Code (LOC) is a better indicator to software qualityand productivity than some programming languages. To the best of our knowledge, this1969 belief is not well explored in the past. Compared to the late 60’s, practitioners nowwrite code in numerous programming languages using tools (like Integrated developmentenvironments) to catalyze software development. Thus it is essential to revisit the claimedeffect.Interestingly, some researchers acknowledge the widely held belief that some good de-velopers are much better (almost 10X) than many poor developers [46]. Belief 4 is centeredaround the belief titled “

Individual developer performance varies considerably ”. On relatedlines of thought, using the same data set Nichols pointed out that a developer who is produc-tive in one task is not necessarily productive in another [37]. That result warns us that evenif we do ﬁnd a hero [2] developer, they may not remain heroes consistently. Thus the fo-cus should be to answer, whether these productivity variance also impact software quality?If it does not, then practitioners can conﬁdently withdraw their large appeal around thesemoderate productivity variances in practice.While exploring literature on developer productivity, we also note a common debate onuniversal productivity metric,

Shrikanth N.C. et al. – In one study, Vasilescu et al. measured productivity as the number of pull requests toshow productivity improvements through Continuous Integration practice in the GitHubarena [53]. – In another recent study, Murphy et al. showed non-technical factors (self-rated metric)were good predictors for productivity [35]. – Suggestions about how to augment traditional measures such as incorporating reworktime were also discussed in the past [40].Since all the above productivity measures discussed have their limitations, we lean to-wards the most prevalent measure ‘production rate’ (program size over time) used in litera-ture. The list of measures used in this study refers to § Two common beliefs are experts perform the same task better (higher quality and meet dead-lines) than novices, and that expertise is built over time. The differences between experts andnovices are discussed in various domains [16,17]. In SE back in 1985, Wiedenbeck consid-ered 20 developers in two equal groups of 10 and found the expert group to be signiﬁcantlybetter in certain programming sentence identiﬁcation tasks than the other novice group. Theexpert group had 20,000 hours (mean) experience in their programming languages, whereasthe novice population had as little as 500 hours (mean).

Although some studies have highlighted there is more to just years of experience toexpertise [3], we think it is important to revisit prevalent beliefs. Especially belief 5 titled“

It takes 5000 hours to turn a novice into an expert ”; as it is known to inﬂuence softwarequality and developer productivity. For example, in a 2014 TSE article by Bergersen etal. claimed that the ﬁrst few years of experience correlated with developer performance. Butlater in a 2017 EMSE article by Dieste et al. found years of experience to be a poor predictorof developer productivity and quality [14].Our work is similar to [13,47] where we too assess various beliefs in an empirical study,but we differ from them in the following ways: – The truisms we assess have inﬂuenced numerous SE articles as discussed earlier in § – We observe variations in entities of beliefs such as developer productivity, defects, andyears of developer experience among different programming languages. The results ofthat observation can help managers to prefer some programming languages over another. – Although, some Open Source software systems (OSS) lessons may extend to practice,this work looks for evidence in tasks completed by developers from industries of variousdomains. The generalizability of our results is discussed in § In this section, we discuss the source and nature of the data while detailing the collectionframework. Then we detail statistical tests and SE measures used to answer our RQ’s.In summary, our data comes from a decades-long training program. The consultantsfrom the Software Engineering Institute (SEI, based in Pittsburgh, USA) traveled aroundthe world to train developers in personal data collection. This “Personal Software Process” ssessing Practitioner Beliefs about Software Engineering 7 (or PSP) [23] is based on a belief that a disciplined process can improve the productivityand quality [41]. Speciﬁcally, if a practitioner uses PSP, they are encouraged to guess howlong some tasks will take, and then explain any differences between the predicted and actualeffort.There are several reasons to use this data. Firstly, it is a minimal intrusion into the actualdevelopment work of practitioners. With the support of the right tools (e.g., with the toolsfrom the SEI), practitioners spend less than 20 minutes per day on the PSP data collection.Hence, PSP can generate accurate and insightful records of actual developer activity [37,40–42, 51]. Table 2: Count of Engineers (Developers) by Domain

Product Domain Number Product Domain Number

Software Services 378 Telecom 92Business IT 351 Financial 68Automation&Control 112 Governmenty 66Accounting Software 99 Embedded 55Consumer Electronics 99 Aerospace 51Automotive 97 Other 319

Secondly, when SEI consultants train practitioners in PSP, they use a standard set often tasks. The course is taught over 10 class days with one week focused on measurementand estimation, and the second week focused on reviews, design, and quality (and there wastypically a minimum two-week gap between weeks one and two). Hence, we have data onthousands of developers doing the same set of tasks, using a wide variety of programmingmethods and tools. For an overview of that data: – Table 2 lists the thousands of developers who have had this PSP training, along with thekind of software they usually develop. – Figure 2 lists the languages used by attendees as they tried to complete the ten program-ming tasks. – Table 3 sorts the ten tasks (labelled from 1 to 10) from simplest (at level “0”) to hardest(at level “2”). Small dice of a 20-page task 10 speciﬁcation is presented in Figure 3.Concise requirements for task 10 include writing programs to – Read a table of historical data using the linked list from task 1 – Write a multiple regression solver to estimate the regression parameters – From user-supplied values of estimates for new LOC, reused LOC, and modiﬁedLOC compute the expected effort and prediction interval – Print out the results – Figure 4 lists the tens of thousands of defects recorded during the PSP training tasks.Thirdly, this PSP data comes from industrial practitioners from the world. The PSPclasses were taught in the US, Japan, Korea, Australia, Mexico, Sweden, Germany, theUnited Kingdom, the Netherlands, and India. Class size ranged from 1 to 20 developers,with a mean of 10.4 and an interquartile range of 7 to 14. Only about 3.2% of subjects (123of 3,832) were from a university setting, while most of the classes, 361 of 373, we’re taughtin the industry to practicing software developers. Early adopters included Air Force, ABB,Honeywell, Allied Signal, Boeing, and Microsoft.

Shrikanth N.C. et al.

Fig. 2: Distribution of all the tasks attempted & completed by developers using a speciﬁcprogramming language. For a fair sample size comparison we only consider tasks completedusing C, C++, C

Fourthly, this is high ﬁdelity data. A study of PSP data collection by Disney and John-son [24] using 10 developers who wrote 89 programs found that manual collection andcalculations on paper led to a 5% error rate, mostly in derived calculations or transcriptionerrors. One explanation for this low error rate is the way the data was collected. The SEIauthorized instructors to review each developer’s (student) PSP data as a required criterionfor the successful completion of the task. Grading rubrics included self-consistency checks,checks to ensure that estimates and actuals are consistent with historical data, and compar-isons with peers for data from each sub-process. Developers are also shown class summariesfor comparison to their peers. Hence, various studies [21, 45, 50] have found the data to bevery accurate.Table 3: Overview of the data set composed of 10 tasks at various levels completed in one ofthe ﬁve programming languages we considered. Level 2 (the rows are shown in gray) taskhave the highest complexity than its earlier levels.

Level Task Developer Attempts Programming Languages ssessing Practitioner Beliefs about Software Engineering 9

Fig. 3: The task 10 (or assignment 10A) listed in Table 3 asks the developers to extend program 6 to calculate the three-parameter multiple-regression factors from a historical dataset, then estimate the development effort with a 70% to 90% prediction intervals.Fig. 4: Distribution of 124,521 defects recorded by developers while completing the 10tasks using 5 programming languages C, C++, C – They varied slightly in size, difﬁculty, and complexity. – They were chosen to be sufﬁciently difﬁcult to generate useful data on estimation, effort,size, and defects and could typically be completed in an afternoon with 100 to 200 Linesof Code in a 3rd Generation language.

Fig. 5: This bar-chart portrays the proportion of 1,356 developers completing 5,424 level 2(labelled 7,8,9 and 10) tasks grouped by a speciﬁc programming language. – Two programs were dedicated to counting program size; the remainder were primar- ily statistical, including regression, multiple regression, a Simpson’s rule integral, Chifunction, Student’s T function, and prediction intervals. – Developers were not expected to be domain experts and were provided a speciﬁca-tion package that included descriptions of necessary formulas, algorithms, required testcases, and numeric examples suitable for a developer with no speciﬁc statistical exper-tise.The developers collected their personal data for effort, size, and defects using the PSP dataframework, which measures direct time in minutes and program size in new and changedlines of code. Developers were instructed to build solutions with incremental cycles of de-sign, code, and test, selecting their own increment size, typically a component or feature of25-50 lines of code. Though some developers could produce working programs in a singlecycle, most used 3 to 5 cycles, depending on their solution size and complexity. For effortaccounting, each increment was initially designed and coded (creation), reviewed (appraisal)followed by the compile and test (failure). All-time required to achieve a clean compile wasattributed to compile. All rework necessary to get the tests to pass was attributed to the test.The accounting highlighted rework so that rework could be minimized. – Although developers used numerous programming languages to complete the tasks, pre-dominately 85% of the developers used C, C++, C ssessing Practitioner Beliefs about Software Engineering 11 – For simplicity in presentation belief 1 and 4 use data only from level 2 tasks (labeled7,8,9 and 10) listed in Table 3. For all other beliefs (2, 3 and 5) we consider all the 10tasks. – Suiting to the nature of the beliefs, we use data from the appropriate type of defect inour analysis. The three types of defects we study is shown in Figure 4.4.2 MeasuresWe use the three SE measurements below to derive our conclusions while assessing the 5beliefs we chose in this study.

Program size (LOC) = Lines of Code

Production Rate = LOC/Hour (

Coding time ) Quality (defects) = Number of defects(

Unless speciﬁed we consider defects injected in the coding phase. Other typeof defects we analyze are defects injected in design phase and defects removedin testing phase. )As mentioned earlier, information per programming task such as the number of defects,program size, coding time, etc. is captured in practice by developers. To recollect, developerscompleted the 10 programming tasks of increasing complexity listed in Table 3. They usedvarious programming languages, as shown in Figure 2, but largely using C, C++, C

All the ﬁve beliefs we assess requires us to compare different distributions of the measureslike program size, production rate and defects. To cater to our experiment setup for beliefs1,2,4 and 5 later in § § § ρ (correlation co-efﬁcient) in this SE literature [60] we derive the following ranges for | ρ | : Correlation: ∗ no support ∗ minimum/weak support ∗ support ∗ strong support ∗ very strong supportWe acknowledge that these ranges are debatable. All the correlations we reportare at the 99% conﬁdence level (ie., p value < . ). Rank: ∗ Lower Scott-Knott rank indicates better production rate , better quality (fewerdefects) and large program size (LOC) ie., Population distribution in Rank 1 isbetter than Rank 2. ∗ Distributions placed in different ranks indicate signiﬁcantly different popula-tion. – Rank:

Clusters a list of populations to report signiﬁcant differences. – Correlation:

Reports signiﬁcant associations between two variables.

Later in our experiments in §

5, we compare populations of SE measures such as defects,production rate, and program size. Note populations may have the same median, but theirdistribution could be very different, hence to identify signiﬁcant differences or rank amongtwo or more populations, we use the ScottKnott test recommended by Mittas et al. in TSE13[33].ScottKnott is a top-down bi-clustering approach used to rank different treatments; thetreatment could be program size, production rate, defects, etc. This method sorts a list of l treatments with ls measurements by their median score. But before we sort a list of l treatments, we normalize our data between [0, 1]. This is because the SE measures, likeprogram size, defects, etc., do not typically fall between a ﬁxed range to ﬁt the quartile plots(later in § l treatments by applyingmin-max normalization, as shown below. Note this transformation does not impact the rankof the l treatments in any way. x (cid:48) = x − min ( x ) max ( x ) − min ( x ) Where, – max ( x ) is the global maximum ie., (largest value among the list of l treatments) ssessing Practitioner Beliefs about Software Engineering 13 – min ( x ) is the global minimum ie., (least value among the list of l treatments)The Scott-Knott approach then splits the normalized l into sub-lists m, n in order tomaximize the expected value of differences in the observed performances before and afterdivisions. For lists l, m, n of size ls , ms , ns where l = m ∪ n , the “best” division maximizes E ( ∆ ) ; i.e. the difference in the expected mean value before and after the spit: E ( ∆ ) = msls abs ( m.µ − l.µ ) + nsls abs ( n.µ − l.µ ) Notably, these techniques are preferred since they do not make Gaussian assumptions(non-parametric). To avoid “small effects” with statistically signiﬁcant results, we employthe conjunction of bootstrapping and A12 effect size test by Vargha and Delaney [52] forthe hypothesis test H to check if m, n are truly signiﬁcantly different.

Spearman’s rank correlation (a non-parametric test) assesses associations between two mea-sures discussed earlier, for example, a correlation between production rate and softwarequality. We chose Spearman like some SE quality study [12] recommended to handleskewed data; further, it is unaffected by transformations (such as log, reciprocal, square-root, etc.) on variables.

The Spearman’s rank correlation, ρ = cov ( X,Y ) σ x σ y between two samples X, Y (with means x and y ), as estimated using x i ∈ X and y i ∈ Y via ρ = (cid:80) ni =1 ( x i − x )( y i − y ) (cid:112)(cid:80) ni =1 ( x i − x ) ( y i − y ) We conclude using both correlation coefﬁcient ( ρ ) and its associated p value in all ourexperiments. The correlation coefﬁcient ( ρ ) varies from +1, i.e., ranks are identical, to -1,i.e., ranks are the opposite, where 0 indicates no correlation. – Higher | ρ | value indicates strong evidence. – Lower p value indicates the evidence is statistically signiﬁcant.

In this section, for each belief listed in Table 1, we discuss the rationale, construct the ex-periment, and discuss the strength of the assessed belief.5.1 Belief 1: Corbat´o’s lawThis section discusses an effect reported in a 1969 paper by Corbat´o [10] that

Productivity and defects depend on the length of a program’s text, independent of thelanguage level used.

That is to say, (a) longer programs tend to get more defects; (b) and this effect is notmitigated by newer generation languages. Note that, if true, Corbat´o’s rule warns us that, bymerely switching to a newer language:

Table 4: This table shows normalized distributions of “program size”, “production rate” and“defects” of level 2 task(s) ranked using Scott-Knott test (elucidated in § Program Size (LOC)Group 1 Rank Task, Language Median IQR

Level 2 tasks 1 10, C (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115)

Production rate (LOC/hour)Group 2 Rank Task, Language Median IQR

Task 10 completed usingC and C (cid:115) (cid:115)

DefectsGroup 3 Rank Task, Language Median IQR

Task 10 completed usingC and C (cid:115) (cid:115) – Defects cannot be reduced – And developers cannot be made more productiveTo check this rule, we studied equivalent programming tasks written by languages atdifferent language levels. Speciﬁcally, Hejlsberg and Li et al. [22, 29] assert that C and C

Prediction:

If Corbat´o was wrong, then, we should see either – Production rates differ by programming language, and/or – Defects differ by programming language.

This part of the paper focuses on the more complex tasks (mainly for simplicity in presen-tation). Speciﬁcally, we used the level 2 tasks (labelled 7,8,9 & 10) from Table 3.Table 4 shows our results in three groups program size, production rate, and defects.From this table, we make several observations. – Program size distributions in group 1 reveal that tasks 8 and 10 completed using C andC ssessing Practitioner Beliefs about Software Engineering 15 – Subsequently, in groups 2 and 3 (“production rate” and “defects”), we only focus ontasks 10, C, and 10, C – The focus of groups 2 and 3 (“production rate” and “defects”) on task 10 (chosen in theprevious step) reveal developers who completed the task using C – Thus, as per Corbat´o’s Law, if only LOC matters and language level does not, thentask irrespective of whichever language (C or C Our results contradicted Corbat´o’s law as with similar program size(LOC), tasks completed using C

Programs written using non-OO languages naturally induce more defects.

If true, then programs written in OO languages like C++ should get fewer defects thanwritten in C (non-OO).To check this effect, we studied tasks completed by developers in 5 programming lan-guages. Among those ﬁve languages C++, VB, C private, protected (in Java) toencapsulate certain complex parts of code. Further, modern OO languages such as C – “defects injected in design” (design defects) and – “defects injected in code” (coding defects). Prediction:

If Dahl & Goldberg were wrong, then programming similar tasks using OOlanguages such as C

Table 5 presents the “defects (Code + Design)” in two groups (programming languages andtask 10). From this table, we make several observations. – The defect distributions in group 1 reveal tasks completed by developers using C – Notably, tasks completed using C++ that support OO have the most defects. – A focused analysis of defects in group 2 shows, task 10 completed in C – Defects are lower only in two of four languages that support OO (C normalized distribution of “defects (Coding + Design)” in two dif-ferent groups (programming languages and task 10). Using the Scott-Knott test (elucidatedin § Defects (Coding + Design)Group 1 Rank Language Median IQR

Programming Language 1 VB 5 5 (cid:115) (cid:115) (cid:115) (cid:115) (cid:115)

Group 2 Rank Task, Language Median IQR

Task 10 1 10, C (cid:115) (cid:115) (cid:115) (cid:115) (cid:115)

Accordingly, we say:Belief 2:

Programs written in OO are not necessarily less defect prone

Quality entails productivity.

That is to say, a lack of early emphasis on quality in the project life-cycle will leadto a lot of rework (unproductive) and defective software. Mills showed that highly reli-able software could be produced through cleanroom software engineering, which employs ssessing Practitioner Beliefs about Software Engineering 17

Fig. 6: The box-plots in this chart show the distribution of correlation scores grouped bydifferent programming language. The correlation is computed between “code defects” (X)and “test defects” (Y) (computed as described in § software engineering focus on quality right from the early stages of the project. Having sucha focus minimizes unnecessary effort in the later stages of the project, unnecessary effortslike ﬁxing defects in the ﬁnal testing phase, which were undetected early in the project lifecycle (such as coding or unit test).To study this effect, we check for a linear trend between “code defects” (defects injectedduring coding) and “test defects” (defects escaped to testing phase) using the correlationtest elucidated in §

4. We consider all 10 tasks (labeled 1 to 10) in Table 3 to gain moredata points for the independent and dependant variables. Lastly, we export the signiﬁcantcorrelation scores visually into a box-plot and discuss the strength of the observed trendbased on the median. To achieve that, we do the following: – We capture the number of “code defects” and the number of “test defects” for each taskcompleted using a speciﬁc programming language. – Then, we correlate between the captured list of “code defects” and “test defects” acrossall the 10 tasks and export the correlation coefﬁcient ( ρ ) values. – The above step results in 50 ρ scores (10 tasks x 5 programming languages). We plot theexported scores (distributions) in Figure 6. Prediction:

If Mills & Jones were wrong, then it could mean that managers need notinvest in quality assurance activities early in their project life cycle.

Figure 6 presents a box-plot of all the exported correlation ( ρ ) scores grouped by program- ming language. We used all tasks (labeled 1 to 10) from Table 3. From this ﬁgure, we makethe following observations. – We ﬁnd a median of +0.4 ( ρ ) between “code defects” and “test defects” in Java and inthe remaining four programming languages we analyzed, the correlation is above +0.5( ρ ). – An overall median correlation of +0.5 ( ρ ) considering all the ﬁve programming lan-guages conﬁrm rework will increase (more test defects) if there is a lack of emphasis onquality early (more code defects). (Technical aside: To be precise, we say that we cansupport, but not strongly support, this belief since, from § | ρ | of 0.5indicates “support”.)Accordingly, we say:Belief 3: Emphasis on early quality does minimize rework. That said, the strengthof this support is not strong ρ (+0.5). Individual developer performance varies considerably.

That is to say; developer X is considerably “better” in completing a task than developerY. By “better” we mean developer X writes more lines of code in less time than developer

Y, and developer X’s deliverable gets fewer defects than developer Y’s deliverable.Also note that, if true, Sackmans second Law warns us that: – Only some developers are productive and write quality code.A variation between developers was rather a surprising ﬁnding by Sackman in 1966 asthe objective of the original study was to compare productivity between online programmingand ofﬂine programming [46]. Endres & Rombach also note that this effect is not exten-sively studied in the past few decades. They also offer some doubts concerning the smallsample size and the statistical approach used in the original (Sackman’s) study. Sackman’sstudy considered only 12 developers, and their conclusion is based on extremes and not onthe entire distribution. Note in this work; we compare large distributions of production rateand defects scores captured from over 1000’s of developers.Naturally, managers would prefer few high performers over many low performers, butrecently (2019) Nichols using the same data showed that a developer X who is productivein one task is not necessarily productive in another [37]. Thus we address the quality aspectof this belief. In other words, we check whether such large production rate variance reﬂectspoor quality (high defects). To assess that strength, we construct the experiment as follows: – We rank “production rate” distributions in our data set to identify a task ‘ T P ’ (where T P is a speciﬁc task ‘ T ’ completed using a programming language ‘ P ’ having the largestproduction rate variance). – Correspondingly, we rank “defect” distributions of tasks to compare the defect distribu-tion of ‘ T P ’ with distributions of the same task ‘ T ’ completed in 4 other programminglanguages. If ‘ T P ’ portrays less quality (more defects), then such a result would supportthis belief. Prediction:

If Sackman was wrong, then practitioners may ease their large appeal to-wards some high performing developers. ssessing Practitioner Beliefs about Software Engineering 19

For simplicity in presentation, we only consider level 2 tasks (labeled 7,8,9, and 10) listed inTable 3. Table 6 shows our “production rate” and “defects” in four groups. From this table,we make several observations. – Group 1 shows that among the level 2 tasks that developers completed, task 10 (‘ T ’, therow is shown in gray) has the largest normalized production rate variance of 5 (IQR). – Group 2 shows that among the level 2 tasks that developers completed, developers usingC P ’, the row is shown in gray) has the largest normalized production rate varianceof 6 (IQR). – Group 3 tests the above two results in conjunction, i.e. ‘ T P ’ (i.e., task 10 completedusing C normalized production rate variance of 6 (IQR). – The population ‘10, C ( T P ) ’ with the largest production rate variance correspondinglydid not portray the highest defects in group 4. Rather the evidence is contradictory tothis belief, as ‘10, C Table 6: This table shows normalized “production rate” and “defects” distributions in fourdifferent groups. Using the Scott-Knott test (elucidated in §

5, 6 and 6 . The last group indicates task 10 completed using C T P ) has fewer defects( ) than Java, C and C++. Production rate (LOC/hour)Group 1 Rank Task Median IQR

Level 2 tasks 1 10 7 (cid:115) (cid:115) (cid:115) (cid:115) Group 2 Rank Language Median IQR

Programming Languages 1 C (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) Group 3 Rank Task, Language Median IQR

Task 10 completed using 5programming languages 1 10, C (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) DefectsGroup 4 Rank Task, Language Median IQR

Task 10 completed using 5programming languages 1 10, C (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) Belief 4:

Software Quality is not impacted by the variance in developer productionrate.

It takes 5000 hours to turn a novice into an expert.

To assess the effect of prolonged programming experience we analyze “production rate”and “defects” among the expert and novice groups. An expert is someone who is both knowl-edgeable and skilled in their ﬁeld of work. An expert in this study is a developer who cancomplete the task on time (productive) with no defects (quality).Adopting from [15] we map the 5000-hour threshold as follows: expert : ≥ years of experience (or ≥ hours of programming experience) novice : < years of experience (or < hours of programming experience) That is to say, (a) expert developers induce less defects than novices; (b) expert devel-opers are more productive in completing tasks than novice developers. Note that, if true, theApprentices Law warns us that, we should mistrust novices due to their lower quality code.To check this, we will analyze the distributions of “production rate” and “defects” among experts and novices . The ratio of expert to novice developers in our data is shown in Figure 7.Fig. 7: Proportion of expert and novice developers who completed all the 10 tasks listed inTable 3 using a speciﬁc programming language. ssessing Practitioner Beliefs about Software Engineering 21 Table 7: This table shows “production rate” and “defects” distributions of expert and novice developers in four groups. Using Scott-Knott test (elucidated in § expert and novice developers. On the other hand groups 3 and 4 portray signiﬁcantly different distributions(as seen from different Scott-Knott ranks) among developers using different programminglanguages. Production rate (LOC/hour)Group 1 Rank Experience Median IQR

All tasks 1 expert 6 4 (cid:115) (cid:115)

DefectsGroup 2 Rank Experience Median IQR

All tasks 1 expert 6 6 (cid:115) (cid:115)

Production rate (LOC/hour)Group 3 Rank Language, Experience Median IQR

Programming Languageexpertise 1 C (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115)

DefectsGroup 4 Rank Language, Experience Median IQR

Programming Languageexpertise 1 C (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115) (cid:115)

Table 7 presents our results on production rate and defects in 4 groups. From this Table,we make the following observations. – Despite numerous studies in the past that endorsed this effect, groups 1 and 2 reveal noeffect on developers with years of prolonged programming experience. In other words,

Novice developers were as productive and induced the same amount of defects as expert developers. – Our earlier results conﬁrm some programming languages to have an effect over “pro-duction rate” and “quality” (defects). Thus to check whether “years of experience” alsoinﬂuence developers using different programming languages, we segregate the expert and novice population by programming languages to ﬁnd the following: – “Years of experience” has less inﬂuence on “defects” among developers using dif-ferent programming languages. Like in our earlier results seen in Table 5, overallC novices portray better quality (fewer defects) than developers of three other languages. This also implies that, strangely C novices portray betterquality (fewer defects) than experts . – Similar to results shown earlier in Table 6, production rate is best observed amongC

Apprentice Law is only supported on the lines of production rate, only for Java and C++developers (2 of 5 groups of developers), and has no inﬂuence in mitigating defects. Ourevidence supports the counterclaim that practical industrial experience has little to do withexpertise. There is no noticeable performance difference among experts and novices (Groups1 and 2). We believe the conditions for deliberate practice [16] are not achieved in normalwork; thus, years of experience has limited beneﬁt.Hence, overall, we say that.Belief 5:

Experienced developers did not necessarily write better (fewer defects)programs on time.

Earlier in Figure 1, we presented the relationship between beliefs and its entities as recordedin literature. We revisit that ﬁgure using the evidence from assessing the ﬁve beliefs in § §

5. Here a line edge (black) between two entities denote an effectbacked by empirical evidence. ssessing Practitioner Beliefs about Software Engineering 23

RQ1: Why beliefs diverge among practitioners?

Given that the ﬁve beliefs we chose are decades-old prevalent beliefs, we naturally expectedstrong support, but surprisingly, our analysis showed none of these beliefs are strongly sup-ported presently . It is important to note that such beliefs naturally hold in practice [13, 15,39], and this is not to say that these beliefs were not true.We reason below that a probable source of divergence of beliefs among practitioners [13]could arise from misinterpreting effects by observing partial evidence. For example, recallthe effect reported in the belief 2 that,

Programs written using non-OO languages naturally induce more defects.

Although the results of belief 2 from § lack of a broader perspective.Lastly, looking at Figure 8 amidst overall negative results, we found support for be-lief 3 titled “ Quality entails productivity ” which is unaffected among 4 of 5 programminglanguages we analyzed, which we endorse.Accordingly, we say:

Apart from belief 3 titled “Quality entails productivity”, none of the other beliefsare supported.

RQ2: What is the relationship between Productivity, Quality, and Expertise?

If studies like this can conﬁrm associations among these three entities (Productivity, Quality,and Expertise), practitioners can make better choices during their project life-cycle. Associ-ations such as – Experienced developers produce a quality deliverable on time. – Early quality assurance activities can ensure faster delivery.Using the directed edges of the graph shown in Figure 8 we ﬁnd: – Belief 5 results do not reveal any beneﬁcial effect of years of developer experience onsoftware quality and only make some group of developers (speciﬁcally Java and C++)more productive. – On the other hand, belief 3 results conﬁrm that early quality assurance facilitates on-time delivery. That association is unaffected among different programming languagesor task complexity.

Accordingly, we say:

A focus on quality, early in the project life-cycle minimizes rework but programmingexperience neither improves production rate nor mitigates defects.

Notably, it is now apparent from the result of the beliefs and these discussions that someprogramming languages were better than others. We discuss that next in RQ3.6.3

RQ3: What impacts Productivity and Quality?

The beliefs we chose to assess in this study bundle the three entities developer productivity,software quality, and developer expertise. Practitioners, throughout their project life-cycles,track production rate, quality, and developer expertise to manage their deliverables better tomeet deadlines. Hence, if practitioners can understand what affects these entities, they canmake better decisions accordingly. While assessing the ﬁve beliefs, we noted developerscompleting tasks using some programming were better (more productive and induced fewerdefects) than others. Thus we think it is useful to revisit the results of beliefs in § – Results of beliefs 4 and 5 conﬁrm production rate is better among C – Results of beliefs 2 and 5 strangely conﬁrm defects are lower among C novices than experts . Years of developer experience only made an impact on the production rateof C++ and Java developers. But note, Java expert ’s production rate is signiﬁcantly lowerthan C novices . – Results of belief 1 shows C

We draw the following subsections from Wohlin et al. [57] (ﬁrst conceived by Cook andCampbell [8]) ssessing Practitioner Beliefs about Software Engineering 25 § programs cannot be directly compared. Because this data replicates the same task acrossmultiple developers, quality is best measured by the total number of defects in code, design,and test accordingly.7.2 Internal ValidityThreats to internal validity concern the causal relationship due to the artifacts of the studydesign and execution. It may also include factors that have not been controlled or measuredor study execution introducing some unintended factors.PSP course’s emphasis on measuring production, estimation, and quality could haveinﬂuenced the developers performance. The mitigation was that the developers were not inany sort of competition with each other; instead, they were instructed to take consistent datato measure their performance trend. Also, there are no overlaps, i.e., the same developercompleted the ten tasks only once using a programming language. Other factors that wereuncontrolled include experience with a speciﬁc programming language or aspects of thedevelopment environment in which the class was taken.Analyst bias in conducting the research is always a potential threat. This is minimizedbecause the data was collected for an entirely different purpose over an extended time byseveral independent individuals. We further minimize this threat by relying on quantitativedata and fully revealing that data. Lastly, we do not consider PSP as a treatment, but weuse that data to observe evidence in the prevalent beliefs we evaluated. We do not questionthe authenticity of these beliefs in the past, given the notable increase in the number of pro-gramming languages, supporting tools, memory, computation power, and online workforce.We question the relevancy of these beliefs presently in § Through extensive evaluation of ﬁve old Software Engineering beliefs (originated between

Qualityentails productivity ”. That implies on-time delivery is achieved with a quality-driven focus.Four other beliefs we assessed are not supported; uncertainties in the results of those beliefsportrayed how practitioners with a narrow scope could misinterpret speciﬁc effects to holdin their work.Notably, we observed programming language to be a better indicator of software qualityand production rate than years of developer experience. In other words, production rate andsoftware quality varied for different programming languages. Overall, irrespective of theprogramming experience, C

Our results reinforce the recent ﬁndings of Shrikanth & Menzies [47] and others in thepast [13, 39], which is practitioners should not inherently believe their past will hold in thepresent. Like peer SE researchers [34, 48], we suggest all practitioners, especially subjectmatter experts, to consider assessing a handful of beliefs empirically from time to time tounderstand what works for their organization. Speciﬁcally, our current results prescribe, – Practitioners should emphasize on quality right from the early stages of their projects. – Practitioners should be less concerned about programming experience and more con-cerned about programming language. ssessing Practitioner Beliefs about Software Engineering 27

Acknowledgements

This work was partially supported by NSF grant SM and PSP SM are service marks of Carnegie Mellon University. References

1. Tiruvalluvanayanar arulicceyta tirrukkural = the ’sacred’ kurral of tiruvalluva-nayanar. https://archive.org/details/tiruvalluvanayan00tiruuoft/mode/2up . Accessed: 2020-04-18.2. Amritanshu Agrawal, Akond Rahman, Rahul Krishna, Alexander Sobran, and Tim Menzies. We dontneed another hero? In

Proceedings of the 40th International Conference on Software Engineering Soft-ware Engineering in Practice-ICSE-SEIP , volume 18, 2018.3. Sebastian Baltes and Stephan Diehl. Towards a theory of software development expertise. In

Proceedingsof the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium onthe Foundations of Software Engineering , pages 187–200. ACM, 2018.4. Pamela Bhattacharya and Iulian Neamtiu. Assessing programming language impact on development andmaintenance: A study on c and c++. In

Proceedings of the 33rd International Conference on SoftwareEngineering , pages 171–180. ACM, 2011.5. Tegawend´e F Bissyand´e, Ferdian Thung, David Lo, Lingxiao Jiang, and Laurent R´eveill`ere. Popularity,interoperability, and impact of programming languages in 100,000 open source projects. In , pages 303–312. IEEE, 2013.6. F Brooks and HJ Kugler.

No silver bullet . April, 1987.7. Frederick P Brooks Jr et al.

The Mythical Man-Month: Essays on Software Engineering, AnniversaryEdition, 2/E . Pearson Education India, 1995.8. Donald Thomas Campbell and Thomas D Cook.

Quasi-experimentation: Design & analysis issues forﬁeld settings . Rand McNally College Publishing Company Chicago, 1979.9. Richard H. Cobb and Harlan D. Mills. Engineering software under statistical quality control.

IEEESoftware , 7(6):45–54, 1990.10. Fernando J Corbato. Pl/i as a tool for system programming.

Datamation , 15(5):68, 1969.11. Ole-Johan Dahl and Kristen Nygaard. Class and subclass declarations. In

Pioneers and Their Contribu-tions to Software Engineering , pages 235–253. Springer, 2001.12. Marco D’Ambros, Michele Lanza, and Romain Robbes. An extensive comparison of bug predictionapproaches. In , pages31–41. IEEE, 2010.13. Premkumar Devanbu, Thomas Zimmermann, and Christian Bird. Belief & evidence in empirical soft-ware engineering. In ,pages 108–119. IEEE, 2016.14. Oscar Dieste, Alejandrina M Aranda, Fernando Uyaguari, Burak Turhan, Ayse Tosun, Davide Fucci,Markku Oivo, and Natalia Juristo. Empirical evaluation of the effects of experience on code qualityand programmer productivity: an exploratory study.

Empirical Software Engineering , 22(5):2457–2542,2017.15. Albert Endres and H Dieter Rombach.

A handbook of software and systems engineering: Empiricalobservations, laws, and theories . Pearson Education, 2003.16. K Anders Ericsson. Deliberate practice and the acquisition and maintenance of expert performance inmedicine and related domains.

Academic medicine , 79(10):S70–S81, 2004.17. K Anders Ericsson, Ralf T Krampe, and Clemens Tesch-R¨omer. The role of deliberate practice in theacquisition of expert performance.

Psychological review , 100(3):363, 1993.18. Steven Fraser, Dave Astels, Kent Beck, Barry Boehm, John McGregor, James Newkirk, and CharliePoole. Discipline and practices of tdd: (test driven development). In

Companion of the 18th AnnualACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications ,OOPSLA 03, page 268270, New York, NY, USA, 2003. Association for Computing Machinery.19. D. Fucci, H. Erdogmus, B. Turhan, M. Oivo, and N. Juristo. A dissection of the test-driven developmentprocess: Does it really matter to test-ﬁrst or to test-last?

IEEE Transactions on Software Engineering ,43(7):597–614, 2017.20. Adele Goldberg and David Robson.

Smalltalk-80: the language and its implementation . Addison-WesleyLongman Publishing Co., Inc., 1983.21. Fernanda Grazioli.

An Analysis of Student Performance During the Introduciton of the PSP: An Empiri-cal Cross Course Comparrison . PhD thesis, Universidad de la Republica, 2013.22. Anders Hejlsberg, Scott Wiltamuth, and Peter Golde.

The C . Adobe Press,2006.23. Watts S Humphrey.

A discipline for software engineering . Addison-Wesley Longman Publishing Co.,Inc., 1995.24. Philip M. Johnson and Anne M. Disney. A critical analysis of psp data quality: Results from acase study.

Empirical Softw. Engg. , 4(4):317–349, December 1999.ssessing Practitioner Beliefs about Software Engineering 2925. Mik Kersten and Gail C Murphy. Using task context to improve programmer productivity. In

Proceed-ings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering , pages1–11, 2006.26. Pavneet Singh Kochhar, Dinusha Wijedasa, and David Lo. A large scale study of multiple program-ming languages and code quality. In , volume 1, pages 563–573. IEEE, 2016.27. Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. Practitioners’ expectations on automatedfault localization. In

Proceedings of the 25th International Symposium on Software Testing and Analysis ,pages 165–176. ACM, 2016.28. Thomas D LaToza, Maryam Arab, Dastyni Loksa, and Amy J Ko. Explicit programming strategies.

Empirical Software Engineering , pages 1–34, 2020.29. Yingling Li, Lin Shi, Jun Hu, Qing Wang, and Jian Zhai. An empirical study to revisit productivityacross different programming languages. In , pages 526–533. IEEE, 2017.30. Tim Menzies, William Nichols, Forrest Shull, and Lucas Layman. Are delayed issues harder to resolve?revisiting cost-to-ﬁx of defects throughout the lifecycle.

Empirical Software Engineering , 22(4):1903–1935, 2017.31. Harlan D Mills. Cleanroom engineering.

Advances in Computers , 36:1, 1993.32. HD Mills. Software productivity in the enterprise. In

Software Productivity , pages 265–270. Little,Brown, 1983.33. N. Mittas and L. Angelis. Ranking and clustering software cost estimation models through a multiplecomparisons algorithm.

IEEE Trans SE , 39(4):537–551, April 2013.34. Akito Monden, Masateru Tsunoda, Mike Barker, and Kenichi Matsumoto. Examining software engi-neering beliefs about system testing defects.

It Professional , 19(2):58–64, 2017.35. Emerson Murphy-Hill, Ciera Jaspan, Caitlin Sadowski, David Shepherd, Michael Phillips, Collin Winter,Andrea Knight, Edward Smith, and Matt Jorde. What predicts software developers’ productivity?

IEEETransactions on Software Engineering , 2019.36. Meiyappan Nagappan, Romain Robbes, Yasutaka Kamei, ´Eric Tanter, Shane McIntosh, Audris Mockus,and Ahmed E Hassan. An empirical study of goto in c code from github repositories. In

Proceedings ofthe 2015 10th Joint Meeting on Foundations of Software Engineering , pages 404–414. ACM, 2015.37. William R Nichols. The end to the myth of individual programmer productivity.

IEEE Software ,36(5):71–75, 2019.38. Donald A. Norman.

Things That Make Us Smart: Defending Human Attributes in the Age of the Machine .Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1993.39. Carol Passos, Ana Paula Braun, Daniela S. Cruzes, and Manoel Mendonca. Analyzing the impact ofbeliefs in software project practices. In

ESEM’11 , 2011.40. Mark C Paulk. Factors affecting personal software quality. 2006.41. Mark C Paulk. The impact of process discipline on personal software quality and productivity.

SoftwareQuality Professional , 12(2):15, 2010.42. Mark Christopher Paulk.

An empirical study of process discipline and software quality . PhD thesis,University of Pittsburgh, 2005.43. George Uglow Pope et al.

Sacred Kurral Of Tiruvalluva Nayanar . Asian Educational Services, 1999.44. Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. A large scale study of pro-gramming languages and code quality in github. In

Proceedings of the 22nd ACM SIGSOFT Interna-tional Symposium on Foundations of Software Engineering , pages 155–165. ACM, 2014.45. Dieter Rombach, J¨urgen M¨unch, Alexis Ocampo, Watts S Humphrey, and Dan Burton. Teaching disci-plined software development.

Journal of Systems and Software , 81(5):747–763, 2008.46. Harold Sackman, Warren J Erikson, and E Eugene Grant. Exploratory experimental studies compar-ing online and ofﬂine programing performance. Technical report, SYSTEM DEVELOPMENT CORPSANTA MONICA CA, 1966.47. NC Shrikanth and Tim Menzies. Assessing practitioner beliefs about software defect prediction. arXiv ,pages arXiv–1912, 2019.48. F. Shull. I believe!

IEEE Software , 29(01):4–7, jan 2012.49. William M Thomas, Alex Delis, and Victor R Basili. An analysis of errors in a reuse-oriented develop-ment environment.

Journal of Systems and Software , 38(3):211–224, 1997.50. Diego Vallespir and William Nichols. An Analysis of Code Defect Injection and Removal in PSP. In

Proceedings of the TSP Symposium 2012 , Pittsburgh, 2012. Carnegie Mellon University.51. Diego Vallespir and William Nichols. Quality is free, personal reviews improve software quality at nocost.

Software Quality Professional , 18(2), 2016.52. Andr´as Vargha and Harold D Delaney. A critique and improvement of the cl common language effectsize statistics of mcgraw and wong.

Journal of Educational and Behavioral Statistics , 25(2):101–132,2000.0 Shrikanth N.C. et al.53. Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir Filkov. Quality andproductivity outcomes relating to continuous integration in github. In

Proceedings of the 2015 10th JointMeeting on Foundations of Software Engineering , pages 805–816. ACM, 2015.54. Zhiyuan Wan, Xin Xia, Ahmed E Hassan, David Lo, Jianwei Yin, and Xiaohu Yang. Perceptions, ex-pectations, and challenges in defect prediction.

IEEE Transactions on Software Engineering , 2018.55. William Nichols; Watts Humphrey; Julia Mullaney; James McHale; Dan Burton; Alan Willett. Pspstudent assignment data, 2019.56. Claes Wohlin. Is prior knowledge of a programming language important for software quality? In

Pro-ceedings International Symposium on Empirical Software Engineering , pages 27–34. IEEE, 2002.57. Claes Wohlin, Per Runeson, Martin H¨ost, Magnus C Ohlsson, Bj¨orn Regnell, and Anders Wessl´en.

Ex-perimentation in software engineering . Springer Science & Business Media, 2012.58. Xin Xia, Lingfeng Bao, David Lo, Pavneet Singh Kochhar, Ahmed E Hassan, and Zhenchang Xing.What do developers search for on the web?

Empirical Software Engineering , 22(6):3149–3185, 2017.59. Xin Xia, Zhiyuan Wan, Pavneet Singh Kochhar, and David Lo. How practitioners perceive codingproﬁciency. In

Proceedings of the 41st International Conference on Software Engineering , pages 924–935. IEEE Press, 2019.60. Thomas Zimmermann, Rahul Premraj, and Andreas Zeller. Predicting defects for eclipse. In

ThirdInternational Workshop on Predictor Models in Software Engineering (PROMISE’07: ICSE Workshops2007) , pages 9–9. IEEE, 2007.61. Weiqin Zou, David Lo, Zhenyu Chen, Xin Xia, Yang Feng, and Baowen Xu. How practitioners perceiveautomated bug report management techniques.