[PDF] How do students test software units?

Abstract

We gained insight into ideas and beliefs on testing of students who finished an introductory course on programming without any formal education on testing. We asked students to fill in a small survey, to do four exercises and to fill in a second survey. We interviewed eleven of these students in semi-structured interviews, to obtain more in-depth insight. The main outcome is that students do not test systematically, while most of them think they do test systematically. One of the misconceptions we found is that most students can only think of test cases based on programming code. Even if no code was provided (black-box testing), students try to come up with code to base their test cases on.

Full PDF

aa r X i v : . [ c s . C Y ] F e b How do students test software units?

Lex Bijlsma

Department of Computer ScienceOpen Universiteit

Heerlen, [email protected]

Niels Doorn

Academy of ICT & Creative TechnologiesNHL Stenden Hogeschool

Emmen, The [email protected]

Harrie Passier

Department of Computer ScienceOpen Universiteit

Heerlen, The [email protected]

Harold Pootjes

Department of Computer ScienceOpen Universiteit

Heerlen, The [email protected]

Sylvia Stuurman

Department of Computer ScienceOpen Universiteit

Heerlen, The [email protected]

Abstract —We gained insight into ideas and beliefs on testingof students who ﬁnished an introductory course on programmingwithout any formal education on testing. We asked students toﬁll in a small survey, to do four exercises and to ﬁll in a secondsurvey. We interviewed eleven of these students in semi-structuredinterviews, to obtain more in-depth insight. The main outcomeis that students do not test systematically, while most of themthink they do test systematically. One of the misconceptions wefound is that most students can only think of test cases basedon programming code. Even if no code was provided (black-boxtesting), students try to come up with code to base their test caseson. Index Terms —D.2 Software Engineering, D.2.5 Testing andDebugging, D.2.5.j Test levels, D.2.5.k Testing strategies, K.3.2.bComputer science education

I. I

NTRODUCTION

Professional software developers spend a considerable partof their time on testing. Agile methods in particular haveincreased the importance of testing throughout the develop-ment process. Yet, many recent graduates lack sufﬁcient testingskills [1]. For example, Edwards found that students detectedonly approximately 15% of the bugs, and observed thatstudents often only apply so-called ‘happy path’ testing [2],implicitly assuming that the input is ‘ideal’.In most university curricula, scant attention is paid totesting. For example, testing is fully integrated within thecurriculum in only 2 out of 20 Dutch universities [3]. In somecurricula, testing is not a topic at all. In most curricula, testingeducation is limited to an introduction of Java’s popular testframework JUnit. How to compose test cases is often not givenexplicit attention.There are strong indications that testing instruction in-ﬂuences the quality of students’ programs positively. Someeducators state that testing as activity improves software com-prehension [4] and use that motivation to introduce testing andintegrate testing in programming education. In fact, knowledgeabout testing in itself tends to improve the quality of students’programs [5]. This effect was even observed when test caseswere provided by the teacher [6]. Moreover, testing is one of the core knowledge areas in both the

ACM / IEEE curriculumguidelines for computer science and software engineering [7],[8].When we want to produce explicit, procedural guidanceon how to create test cases and better instructional material,we should know how students view testing initially andwhich misconceptions need to be paid attention. Our researchquestion is, therefore, • What ideas do students have about testing before theyhave had any relevant instruction?We provide an overview of what is known with respectto our question in Section II. In Section III we explain howwe approached our research. The results are described andanalyzed in the sections IV through VII. We conclude withanswers to our question.II. R

ELATED WORK

As far as we know, only Edwards systematically examinedhow students test. He used the tests that students sent in foran assignment in a course on data structures [2]. Grading wasbased on branch coverage of the tests. The mean coverage was95.4%. The similarity between the tests was large: 90% of thetests were the same. To check which bugs the student’s testscould detect, all tests of the students were combined into asingle large test suite, along with the reference tests for theassignment. This test suite was then run against all studentprograms. The tests of the students only detected 13.6% ofthe total number of bugs. Almost all students only performed‘happy path’ testing, testing only the default scenario.The fact that not only students, but also professionals tendto rely on ‘happy path testing’ has been known for a longtime. Leventhal [9] found strong evidence of happy pathtesting (also called positive bias testing). The only ‘antidote’mentioned here is to construct thorough and complete programspeciﬁcations.The tendency of students only to test the default scenario isin accordance with the ﬁnding that students have ‘alternativetandards’ for correctness [10]. Students soften the require-ment that a function should show correct behavior for all input,to the notion that the behavior should be correct for most input,for input that seems ‘logical’.Practitioners complain that many recently graduated stu-dents lack sufﬁcient testing skills [1]. Practitioners see a skillgap between university graduates and industry expectationsand think that graduates often do not seem to see the value oftesting. They observe that graduated students often follow atrial-and-error strategy: build something and ‘see if it works’.Explicitly teaching testing does have effect: students whohave been educated in the subject produce better testcases [11].Testing, so it seems, requires explicit attention in the cur-riculum. Without an explicit speciﬁcation given by the teacher,students seem to assume a speciﬁcation that only allows ‘ideal’input. III. M

ETHOD

A. Aim of the study

This is an in-depth explorative study on the perceptions ofsoftware testing by ﬁrst-year students in computer science.The emphasis is on qualitative data. The participants haveprogramming knowledge on the level of an introductory courseabout programming, but have no prior formal education onthe topic of software testing. We want to study their naturalway of testing software units during programming. To do so,we used surveys, exercises and interviews. In this section, wedescribe our research setup, methods and the way we analyzedthe results.

B. General approach

Thirty-one students were involved, all ﬁrst-year computerscience students at a university of applied science. All studentshave basic knowledge of: • HTML , imperative programming using

PHP (period 1). • Databases and

SQL , with some attention to exceptionhandling (period 2). • Introduction to OO Programming with Java using BlueJ(period 3).In none of these courses testing was a topic.We asked the students to ﬁll in a survey. Then we askedthem to do four exercises, and then we asked them to ﬁll ina second survey. Together, this took them about 45 minutes.Finally, we interviewed a subset of the participants. Theseinterviews took about twenty minutes.

1) Pre-exercises survey:

The aim of the ﬁrst survey is to testthe ideas and beliefs the participants have on software testingwithout being exposed to the exercises. This survey containsone multiple-choice question about when during developmenttests can be best formulated, followed by an open question tomotivate the given answer, and six statements about testingwith ﬁve-point Likert scale answer options, three of whichwere taken from Kolikant [10].

2) The exercises:

The participants were presented with fourexercises. Three exercises had a functional description andimplementation. One had a functional description only. Foreach exercise, the participant is asked: • to give the test cases needed to decide whether thefunction is correct or not • (for the ﬁrst three) to determine the correctness of theprovided implementation, and, in case of incorrectness,to provide a test case to prove this claim.

3) Post-exercises survey:

The second survey is held directlyafter the exercises. The survey asks for: • the perceived complexity of the exercises • the process they followed to complete the exercises • the supposed correctness of their answers • the general ratio of time that they think should be spenton programming and testing in daily practice • whether they took boundary values into account duringthe exercises.All questions of this survey have ﬁve-point Likert scale answeroptions.

4) Interviews:

To obtain more in-depth knowledge of theway the exercises have been done, we interviewed a subsetof the participants using a semi-structured interview. Wecould also verify the given answers to the exercises and thestudents’ ideas about testing in more depth. All interviewswere conducted by two interviewers, one as the chair and theother one making notes. All interviews were audio-recordedand written down verbatim.

C. Ideas about testing

We aimed to study both the student’s testing methods andthe way they understand the concepts related to testing. Wealso gained insights into the misconceptions of the partici-pants. We discerned the following ideas about testing. Theinstruments we used are in braces: • During what programming phase (before, during or afterprogramming) is testing relevant (pre-exercises survey)? • Which stakeholder should conduct testing? (pre-exercisessurvey) • In what depth and width should software be tested (pre-and post-exercises surveys, exercises and interviews)? • What time ratio should be spent on testing? (post-exercises survey) • Completeness of testing, i.e. when does one have enoughtest cases (post-exercises surveys, exercises and inter-views)? • Does creating test cases help with understanding code(pre- and post-exercises surveys)? • Does one use boundary values for test cases (pre- andpost-exercises surveys)?We also asked them to note the start and end time at eachexercise, to see how much time they spent creating test cases.

D. Analysis

The analysis was done by four researchers, all involved insoftware engineering education. ) Pre- and post-exercises surveys:

Both surveys weresubjected to quantitative analysis.Answers to the open question from the pre-exercises surveywere collected and analyzed quantitatively. We labeled theanswers using characteristics that we found in the answers.Examples of these characteristics are: • before, during or afterwards testing • testing the whole by testing the individual components • iterative approach: write some code, immediately writetests • trial and error approach • focus on code • focus on functionality • testing as a means to check whether the code is robust.The analysis was performed by two researchers and reviewedby the other two researchers.

2) Exercises:

The answers to the exercises were analyzedseparately on completeness of the test cases, test approach,mistakes, misconceptions, and time spent. After that, theﬁndings were aggregated using a classiﬁcation, deﬁned duringa brown paper session. The results of these ﬁndings andclassiﬁcation were discussed until consensus of all decisionswas reached.

3) Interviews:

All transcripts and notes were read entirelyby all researchers. We analyzed the interviews with respectto the completeness of the test cases and the approaches. Weused a classiﬁcation, deﬁned during a brown paper session,to aggregate the results and analyzed them quantitatively. Theresults of these ﬁndings and classiﬁcation were discussed untilwe obtained consensus of all decisions.

4) Meta-analysis:

Finally, to determine the main ﬁndings,we performed a meta-analysis. We determined the most impor-tant classes during a brown paper session, and provided themwith clear examples. These are the examples we show in thisarticle. Again, the results of this analysis and classiﬁcationwere discussed until we obtained consensus of all decisions.IV. R

ESULTS - P RE - EXERCISES SURVEY

31 students ﬁlled in the survey.

A. When to test

Question:

The best time to construct test cases is (a) after,(b) before or (c) during programming? Motivate your answer.(Multiple answers are allowed.)

Answers:combinations frequency ( N = 31 )a 2b 2c 12ab 0ac 9bc 1abc 5If we just look at the number of times an alternativewas mentioned, whether or not in combination with other alternatives, we get the following:alternative frequency ( N = 31 )after 16before 8during 27Conclusion: There is a preference for testing during pro-gramming. The low score for testing before programming isunderstandable if students base their test cases on code in-spection (see later). We also suspect, based on the motivationsstudents added, that some students do not clearly distinguishbetween constructing test cases and running tests. Also, thepreference for testing during programming is possibly basedon a confusion between testing and compiling, and betweenrunning and debugging (see later). B. Claims about testing

We presented the students with six statements about testing.The questions 4, 5 and 6 were taken from Kolikant [10]. Thestudents could indicate the level of agreement on a ﬁve-pointLikert scale (1: completely disagree, 5: completely agree). Forthese statements, we have N = 29 .1) Absence of errorsClaim: Testing can make it plausible that your programdoes not contain errors.

Answers: Average 3.41, standard deviation 1.02. Manystudents place high trust in the power of testing.2) Who tests?Claim:

It is best if end users perform the tests.

Answers: Average 3.66, standard deviation 1.20. Thisstatement is widely agreed with, which is unexpected inview of the preference they expressed to construct testcases during development, see IV-A.3) Which test cases?Claim:

The most important consideration when selectingtest cases is to ensure that they are representative of theexpected use of the program.

Answers: Average 3.59, standard deviation 1.05. Thisshows the tendency to ‘happy path testing’.4) ConﬁdenceClaim:

For a program I have written myself, I know itworks well when I have run it several times and obtainedcorrect output.

Answer: Average 2.66, standard deviation 0.98. This issimilar to claim A.1 from Kolikant’s paper [10]. In thatpaper, 50% of respondents agreed with the statement,both at high school and college level. Our respondentsseem to possess a somewhat more sophisticated attitude:only 24% agreed.5) Reasonable outputClaim:

In testing a program for a complicated calcula-tion, I am satisﬁed when the output looks reasonable. Itis not necessary to redo the calculation by hand.

Answer: Average 1.69, standard deviation 0.70. This issimilar to claim A.2 from Kolikant [10]. 33% of thehigh school students and 69% of the college studentsgreed in his study. Of our respondents, only one agreed,choosing answer 4.6) No testingClaim:

Sometimes I am sure that a program I havewritten is completely correct. In such a case, if theprogram compiles, it is not necessary to run or test theprogram.

Answer: Average 1.55, standard deviation 0.90. This issimilar to claim A.3 from Kolikant [10]. In the Kolikantstudy, 42% of high school students and 31% of collegestudents agreed. Of our respondents, only two (7%)agreed. Both gave the answer 4.V. R

ESULTS - E

XERCISES

Students were presented with four exercises. In each ex-ercise, they were asked to write test cases for the givenfunction. Three of the exercises were both ‘black-box’ and‘white-box’: both a functional speciﬁcation and Java code wereprovided. One exercise was ‘black-box’ with only a functionalspeciﬁcation.For each of the white-box exercises, students were askedwhether they considered the code to be correct. If not, theywere asked to present a test case that would fail due to theincorrect code.All exercises were single functions with input and output inthe form of arrays of integers or a single integer. The exercisescontained programming constructs and syntax that should befamiliar to the students and were part of the previous Javacourses they followed.

A. The exercises

The exercises were as follows:

1) Exercise 1: The longest period of frost:

This functiondetermines the length of the longest period of frost froma series of temperatures. The input is an array of integersrepresenting the temperatures of a sequence of days. Theoutput is an integer representing the length of the longestnumber of consecutive days the temperature was below zero.The body of the method is not correct: currentPeriod isinitialized to -1, which should be 0. The code is as follows: /*** Returns the longest uninterrupted period of temperatures below 0*/ public int longestBelowZero ( int [ ] temperatures ) { int longestPeriod = 0; int currentPeriod = −1; for ( int i =0; i < temperatures . length ; i ++) { if ( temperatures [ i ] < { currentPeriod ++; } else { if ( currentPeriod > = longestPeriod ) { longestPeriod = currentPeriod ; } currentPeriod = 0; }} return longestPeriod ; } Listing 1. Frost exercise (frost)

2) Exercise 2: The lowest index of the lowest value:

Theinput for this function is an array of integers. The functionshould determine the lowest index of the lowest number inthe array. The provided code is incorrect: the index in the forloop that iterates over the values is initialized to , whichshould be . The code is as follows: /*** Returns the lowest index of the lowest value*/ public int ﬁndTheLowestIndexOfTheLowestValue ( int [ ] numbers ) { int index = 1; for ( int i = 1; i < numbers . length ; i ++) { if ( numbers [ i ] < numbers [ index ]) { index = i ; }} return index ; } Listing 2. Lowest index of the Lowest value exercise (min-min)

3) Exercise 3: Changing coins:

The input of this functionis an integer representing an amount of money in cents. Thefunction returns the smallest sequence of coins from the eurocoin series that can be used to represent that amount of money.The code is correct. /*** Returns the smallest sequence of coins* to represent the input argument.* Possible coins:* 1, 2, 5, 10, 20 and 50 cent and* 1 and 2 euro (100 and 200 cent)*/ public

ArrayList < Integer > exchange ( int amount ) { ArrayList < Integer > result = new ArrayList <> (); int [ ] coins = { } ; for ( int coin : coins ) { for ( int i =0; i < amount / coin ; i ++) { result . add ( coin ); } amount = amount − ( amount / coin ) * coin ; } return result ; } Listing 3. Exchange exercise (coins)

4) Exercise 4: Palindrome:

The input of this function isa string. It returns true when the input is a palindrome, andfalse when it is not. This is the black-box exercise, withoutimplementation. Only the signature and the description aregiven. /*** Input: A string* Output: true if the string is a palindrome* otherwise false*/ public boolean isPalindrome ( String word ) { // no body is provided } Listing 4. Palindrome checker exercise (palindrome)

These four exercises have an algorithmic nature. Basedon the description and the code, students should be able tonderstand the algorithm and come up with test cases. Theexercises differ in the concepts that are used. The frost exercisecontains a for-loop that iterates over the input array. It also hasbranching, a conditional statement ( if else ), with anotherconditional statement in the else branch. The second exerciseuses the same array with two indexes in the conditionalstatement. This can be easily overlooked. The coin exchangeexercise uses a nested loop construct. This is often consideredto be a complex concept for novice programmers [12], [13].There is also a mathematical statement with a subtraction, amultiplication and a division. The palindrome exercise handlesstring input and uses a boolean return value.

B. Observations

Analyzing the students’ answers, we divided our observa-tions into four main categories:1) test approaches2) completeness of the test cases3) misconceptions4) programming knowledgeFor each category, we give some examples. The exercise ismentioned between brackets, i.e. frost, coins, min-min, orpalindrome.

1) Test approaches:a) Happy path testing:

The test approach that moststudents applied is happy path testing. We determined 33 testsets consisting of happy path test cases only: frost 10, coins3, min-min: 6, and palindrome: 14. An example from the frostexercise: ‘One test case with at least one temperature belowzero.’

Another example from the min-min exercise: ‘[0,1,2,0,2,-1]’b) Structural approaches:

We determined nine test casesthat could be interpreted as boundary testing.One student differentiated on the frost exercise between atest case with one temperature and a test case with severaltemperatures: ‘One test with a known outcome. Then, one arraywith one item. And one array with temperaturesbelow zero only.’

Another student deﬁned an empty array and a number of arrayswith length greater than zero for the frost exercise: ‘[], [-1,-1,-1,4,3,-2,-2], [-1,0,1,2,3,], [-1,-1,2,2].’

One student applied a more or less systematic test approach.The student described four test cases (frost): ‘An array without temperatures below zero, an arraywith one period of frost, an array with multipleperiods of frost, and an array with a period of frostat the beginning.’

The last test case shows the bug in the code. Nevertheless, thestudents did not deﬁne test cases with array’s of length zeroand one. c) Test cases based on code inspection — bug ﬁnd:

Infour situations a student only wrote one test case based on thebug in the code. One example (frost): ‘[-1,-1,0,1]’

The function’s output is 1, while the longest period of frost is2, due to the wrong initialization of currentPeriod .Another example (min-min): ‘[. . . ] after that I would use a test case where thelowest value is on the ﬁrst index, this will probablyfail because of the for-loop which starts with i=1 .Personally, I would probably never test this becauseI would have noticed this while programming.’

This test case demonstrates the bug. The last part of the quoteunderlines the approach of code inspection as an alternativefor testing.One student mentioned that the body of the change function(coins) is incorrect, but was not able to deﬁne a test caseshowing the bug. d) Miscellaneous:

One student gave an answer we donot understand (frost); it might be an approach to debugginginstead of testing: ‘You have to see the array to ﬁgure out if it iscorrect.’

One student was unable to provide a concrete test case. Thisstudent gave the following description (frost): ‘An input value of which you know the output value’ which is basically a very high-level description of testing ingeneral.

2) Completeness of test cases:

A complete test set shoulddiscern several aspects, for instance, structural as well asdomain-speciﬁc, or speciﬁcation as well as implementationbased test cases. Implementation based test cases are onlypossible, of course, if an implementation is present [14].In case of the frost exercise, examples of structural aspectsare an empty array, an array with one element and an arraywith several elements. Examples of domain-speciﬁc aspectsare no frost periods at all and frost periods of different lengthsspread out over the array in several ways. An example of animplementation aspect is what to do in case of an anomaly.If the implementation is present, one can think of applyingvarious coverage criteria as well as testing overﬂow situationsin cases speciﬁc types are used for variables.We observed that almost all the test sets deﬁned by thestudents are far from complete. For example, for the frostexercise, a minimum of three test cases is needed to havepath coverage. Only one student provided enough test casesto reach path coverage, as follows: ‘[0,-1,1,-1,0,-1], [-1,0,-1,-1], [1,1,0,1]’

Most students deﬁned either one or two test cases, or theyprovided test cases that could not test the given functionssufﬁciently. For example, for the ﬁrst exercise, we found onlyone test case eleven times. ‘One test case with at least one temperature belowzero.’ e found only 2 test cases 4 times ‘One test case with a negative number and one testcase with a positive number.’

We found three test cases 4 times and four test cases 2times. As mentioned before, most of these test cases test thehappy path scenario only.An example of an incomplete test case for the coin exercise: ‘Test cases with multiples of coins.’

The students who applied a more systematic testing ap-proach had a slightly more complete test set.

3) Misconceptions:a) Exhaustive testing:

One student presented one testcase and then proposed an exhaustive testing scenario (frost): ‘[-1,-1,0,0,1,1] and I shall look to all possible inputsand see whether the program reacts as is expectedwith several days of frost.’b) Test cases without expected result:

Some studentsspeciﬁed an array with random numbers as a test case, forexample ‘Random numbers in an array.’

The problem with this approach is, of course, that the resultof such a test case is unknown beforehand and therefore it isimpossible to determine the correctness of the function.Another example (min-min) also deﬁning a random array astest case is: ‘1.-) One array with equal numbers, 2.-) one arraystarting with the lowest number, 3.-) one arraywith all random numbers, and 4.-) one array withnumbers you know the result of.’c) Type testing:

One student deﬁned a test case with anarray containing a character, where an array with integers isexpected (min-min): ‘One array with two numbers, one array with alowest number, and an array with a character.’

The language used is Java, a strongly typed language. d) Dividing by zero:

Two students remarked that dividingby zero is forbidden and thought that, as a consequence,dividing zero by something else is forbidden as well. e) Implementation is required:

As part of the palindromeexercise, one student wrote one test case (‘lol’, which is apalindrome), but mentioned that it is impossible to check thecase because the implementation is missing.

4) Programming knowledge:

Although these studentsshould have the required knowledge about Java, it seems thatsome students struggle with the given code. For example, onestudent wrote as a test case (coin): ‘49,7,9,127,61 I think that something is missing with‘int coin’, because an int can not be an array.’

Here, this student, probably, has not enough knowledge ofJava types.Some students are not focused on input-output testing, buton print-based testing to check whether certain statements aresuccessfully executed and in what order. For example, onestudent wrote as a test case (min-min): ‘I deﬁne a method that prints the array to see thearray is successfully created.’

One student had no idea how to solve this exercise (frost)and stated: ‘I have no idea.’

One student did not understand the palindrome exercise,judging by the answer: ‘droom, paling, moordnilap, true, false, 12345,palindr00m’5) Time spent on the exercises:

The students were askedto note the start and end time for each exercise. We got thefollowing averages per exercise:Exercise Average time spent standard deviation1 ( N = 31 ) 6:42 3:072 ( N = 30 ) 4:30 1:463 ( N = 31 ) 4:54 1:594 ( N = 31 ) 2:17 1:16The ﬁrst exercise took the students the longest. This couldbe caused by the time needed to understand how the exercisesworked. During the interviews, students did consider the thirdexercise to be the most difﬁcult. The last exercise, the black-box exercise, took considerably less time than the exerciseswith the code provided. This supports our ﬁndings that stu-dents mainly use the code to think of test cases.Overall, the short time spent by students to solve theseexercises strikes us. This ﬁnding matches with the ﬁndings ofhappy path testing, and test cases based on code inspection.

6) Correct or incorrect:

For each white-box exercise,students were asked if the code was correct and if not, tocome up with a test case to support their claim. Exercise oneand two both contained one logical error and exercise threewas correct. None of the exercises contained syntax errors.The following table shows the results:Exercise Correct Incorrect Valid test case1 ( N = 22 ) 11 11 72 ( N = 25 ) 4 21 193 ( N = 23 ) 10 13 n/aWith respect to the ﬁrst exercise, an equal number ofstudents thought that the code is correct and incorrect. Sevenstudents were able to provide a valid test case to support theirclaim. On the second exercise, students scored a lot better.Most students noticed the bug and were able to provide atest case to support their claim. With the third exercise, moststudents wrongly think the code is incorrect. This supports theindication that most students found this exercise the hardest.VI. R ESULTS - P

OST - EXERCISE SURVEY

The survey has been ﬁlled in by 31 students. The followingstatements were submitted to the students after they hadperformed the exercises. . Understanding

Claim:

Having to think of test cases has increased myunderstanding of the program code.

Answer: Average 3.42, standard deviation 1.13. We did notverify whether any deeper understanding was actually reached,but students feel they did reach an increased understanding.

B. Test coverage

Claim:

My test cases were sufﬁcient to test the program.

Answer: Average 3.10, standard deviation 0.83. In fact thetest cases were clearly insufﬁcient, which the students onlyrealized when discussing them during the interview phase.Apparently, many students interpreted the exercise as ‘ﬁnd thecoding error in this program’ and stopped when they had foundone.This attitude could have been stimulated by the questionto consider the correctness of the code, and if not correct, topresent a test case that would fail due to this incorrectness.However, this possibility was not supported during the inter-views.

C. Systematic testing

Claim:

I test a program by systematically checking allpossible input values.

Answer: Average 3.67, standard deviation 0.75. This issimilar to claim A.4 from Kolikant [10]. In that study, 71%of the high school students and 75% of the college studentsagreed. In our population, 79% agreed. This is a remarkableclaim, because the exercise results showed that the studentsproduced a very limited set of test cases that certainly did notcover all possibilities.

D. Overlooking cases

Claim:

There is always the possibility that the program failsfor some input value I have not discovered.

Answer: Average 4.59, standard deviation 0.62. This issimilar to claim A.5 from Kolikant [10]. In that case, 54%of the high school students and 81% of the college studentsagreed. In our population no less than 93% agreed. Thisshows that the optimism exhibited in the previous sectionshould not be taken too literally. Kolikant [10] concludesfrom these numbers that students tend to describe their non-systematic methods as systematic. Our results strongly conﬁrmthis conclusion.

E. Time use

Claim:

The ratio of time spent on programming and testingshould be (1) 100/0, (2) 75/25, (3) 50/50, (4) 25/75, (5) 0/100.

Answer: Average 2.64, standard deviation 0.75. Of therespondents 48% thought most of the time should be spent onprogramming, 41% thought you should spend the same amountof time for both, and only 10% thought that you should spendmore time on testing. One of these students had the answer0% programming, 100% testing, which is a strange answer.

F. Boundary values

Claim:

In selecting test cases I take boundary values intoaccount.

Answer: Average 3.53, standard deviation 0.85. However, itwas established in the interviews that not all students know themeaning of the term boundary values. In the exercise results,we see boundary values used only occasionally.VII. R

ESULTS - I

NTERVIEWS

We interviewed eleven students. Here, we present our ﬁnd-ings and illustrate them using phrases from the transcribedinterview texts.

A. Test cases are based on code inspection

Code inspection is very often mentioned as an approach tocompose test cases. The functionality of the code is determinedbased on the code itself instead of on speciﬁcation. ‘First, I’ve read the description of the exercise, afterwhich I read the code thoroughly to determine itsfunctionality. Otherwise, I am not able to determinethe expected outcomes.’

Other examples of students indicating explicitly that they needthe code to understand its functionality are: ‘I read the text above the code and looked at thecode to determine if I understood what happens inthe code.’‘Here, I read the code and hope to get more infor-mation on how it should work.’

That students need the code to compose test cases is shownby the following examples: ‘I was surprised that there was no code! That meansthat you have to think about test cases based onlyon the speciﬁcation!’‘There was no code available, so I have to thinkabout how it works. So I imagined how it could beimplemented, to see how it should work.’B. When a bug is found, a test case for that bug is composed

Many students mention that they are looking for bugs in thecode. For each bug that they ﬁnd, they compose a test case.An example is: ‘Interviewer: And suddenly, you saw the error in thecode?Student: Yes, and then I thought, I write [1,2,3] andthen it is ready, on to the next one.’

Furthermore, students mention that, besides happy path testing,the test cases are limited to the bugs found in the code.This can explain the low number of test cases we observein the students’ solutions on the exercises (see Section V). Anexample: ‘Actually, I devise test cases more or less on what Isee in the code, as if to say this is erroneous.’

Other methods of implementation based testing were nevermentioned during the interviews. . Wrong test strategies

Wrong test strategies that are often mentioned are: randombased test cases, happy path testing, pursue exhaustive testing,restrict test cases to the examples described as part of theexercise, and restrict test cases to bugs found in the code.An example of random testing is the following. On thequestion of whether the number of test cases that is sufﬁcient,the student answers: ‘Student: For this, it is enough.Interviewer: You took some numbers, randomly, andlooked ...Student: ... if they are correct. Yes.’

An example of happy path testing is: ‘It was more about ﬁguring out .... how much ...,how often the longest period of frost took place, say.... when the longest period of frost took place. Fortesting, you need only negative numbers. If there areno negative numbers, there is no longest period offrost. So ...’

Some students pursue exhaustive testing. ‘Only integers are allowed. Thus, in that case allpossible integers as input until the computer is notable to process them. That should be a physicalproblem. Yes.’

Sometimes, students limit the test cases to the example(s)given in the exercise test. This often leads to happy path testingtoo. ‘I used exactly the same examples as given in theexercise text.’

Finally, as we have mentioned earlier, students often limit testcases to bugs ﬁnd in the code.

D. Unnecessary or even impossible testing

Some students mention they add test cases to check valuetypes although the program was coded in Java, which meansthe compiler detects type errors directly. We consider this asa misconception. ‘Here I can add some characters and look howthe program reacts because the program expectsintegers, but if I put in characters, then the programshould chuck them out.’‘If I have to input a number, then I input a string asfor example ’ HELLO ’ and see what happens.’ Another misconception is that some students consider test-ing as a way of ﬁnding syntax errors. ‘It is possible that you forget a semicolon, and yetit does not work. In such a case it is good to lookat each line of code and to see where it goes wrong.This is a way of testing.’

These misconceptions probably show students do not under-stand what a compiler does.

E. Lack of motivation

Students often do not see the necessity to test code thor-oughly: ‘Student: No, if I had a computer, then I should applymuch longer test cases.Interviewer: Is there a reason you did not do that?Student: Yes, too much effort.’

The following example shows the importance of grading: ‘I think that, how important is the exercise ..., if itis for grading, then I should perform testing moreelaborately then just looking at the code. That ispossible, then it works, but in cases of grading, thenyou should ﬁnd all errors in the code.’

The following example is related to attitude/engagement: ‘I do not ﬁnd myself good. It was early in themorning. Is possible that I missed some things. Theattitude I made the exercise with played a role too.For me, this research is not important, it is not myresearch.’F. Reading someone else’s code is difﬁcult

Students often mentioned that reading someone else’s codeis difﬁcult. ‘I experienced a lot of problems with the codeconventions because I am used to place the bracketsin a different way.’‘I did not understand the code really, because ofcourse it is not my own code.’

This could be a reason for the few test cases students wrote.

G. Pen and paper versus working on a computer

Some students explicitly mentioned that they prefer workingon a computer instead of working with pen and paper. Workingon a computer means running the code to see if it ‘works’. ‘It is difﬁcult for me to do it just with pen and paper.It is easier to do it on a computer. Then, you caneasily see what happens while running the code,what the code does exactly.’

Students look for bugs by experimenting with the code, forwhich a computer is needed. With pen and paper, this approachis not possible. In fact, they debug the code instead of testit. This is an educational issue: students have to learn thedifferences between debugging and testing and have to learnhow to write speciﬁcation-based tests, probably the best withpen and paper. VIII. C

ONCLUSIONS

Our long term goal is to improve the quality of the codethat students produce, through better testing education. Toimprove our test education, we need insight into student’smisconceptions and their view on testing before they have hadany relevant instruction about this topic. . Findings1) Test cases are based on code inspection:

It was remark-able that students based their tests on code inspection, evenin the case of an exercise with only a speciﬁcation. For thisexercise, they ﬁrst thought about the code they would write tosolve the problem. Some students could not write tests at allfor some exercises because they did not understand the code.Conclusion: students at this level do not have the notion ofbasing tests on the speciﬁcation.

2) Test cases for a bug:

During the interviews and in thetest cases deﬁned during the exercises, we see that studentsread the code, ﬁnd a bug and write a test case for that bug.This can partially be attributed to the test setup. During theexercises, students were asked if they think the code wascorrect and to write a test case that shows the bug if theybelieve the code was incorrect. This question can be a triggerto speciﬁcally search for bugs.However, it is apparent that many students moved on to thenext exercise after designing a test for the (presumed) bug inthe code. They did not take the time to think of other testcases.This strong focus on the given code shows that many ofthe students do not write test cases as a way to assure thecorrectness of a program during its complete life cycle, butmore of a way to debug the code. This is consistent withthe ﬁndings of Edwards regarding a trial-and-error approachto software development and testing [2]. It is known fromthe literature that beginning students do not see a differencebetween testing and debugging [15].

3) Lack of systematic testing:

Both the interviews and theexercises show that the students tend to limit themselves to‘happy path testing’. This ﬁnding ﬁts with the survey resultsshowing that students are optimistic about the correctness oftheir code. This is a known phenomenon [2].In the classiﬁcation of Michaeli [16], our students havea ‘level 1’ understanding of software quality (thinking thatsoftware that successfully processes sample data works). Inthe classiﬁcation of Beizer [17] they are in phase 1 (thinkingthat the purpose of testing is to show that the software works)and in some cases phase 2 (thinking that the purpose of testingis to show that the software does not work).A more extreme misconception was found where studentsdid not think at all about providing test cases, but merelycopied the examples that were mentioned in the exercise textfor the purpose of illustrating and clarifying the speciﬁcation.This may be ascribed to misunderstanding the task.

4) Incomplete test sets:

The exercises reveal that al-most all the test sets deﬁned by the students are far fromcomplete, mostly only containing happy path test cases.Speciﬁcation-based requirements (such as robustness), as wellas implementation-based requirements (such as coverage ra-tios) are not satisﬁed. The above ﬁndings explain these in-complete test sets well.

5) Wrong test strategies:

Besides happy path testing, weobserved test cases restricted only to the examples as part ofthe exercise, and test cases restricted to bugs found in code. Another remarkable test strategy we observed is exhaustivetesting, i.e. trying to feed a function with all possible inputs.This is a known misconception: Complete Testing is Possible .These students described this approach, but did not try toshow their test cases. One student mentioned the possibilityof physical problems.

6) Lack of motivation:

Many students showed a lack ofmotivation for testing. They are optimistic about the cor-rectness of their own code and consider testing merely anadditional burden. One reason may be that the test tasks areexperienced as too simple to justify the extra work [18], whilecode inspection is still feasible. This leads to a paradox intesting education. If the code is small enough to understand,testing is not a necessity. If the code becomes larger, studentsare unable to comprehend the code and are therefore unableto design tests (at least, white-box tests).

7) Time spent to test:

The time spent by students to readan exercise, to deﬁne test cases and to inspect the codeis remarkably short. This observation matches the ﬁndingsof happy path testing, test cases based on code inspectionspecifying a test case for found bugs, as well as a lack ofmotivation.

8) Unnecessary or even impossible testing:

Although thelanguage we use is Java, some students proposed type test-ing in their answers. Possibly, students tested the programon robustness, i.e. how it reacts to erroneous inputs. Also,some students used testing as a way to ﬁnd syntax errors.Because type checking and syntax checking are performedby the compiler, we consider these as misconceptions, i.e.unnecessary testing. We did not ﬁnd this type of misconceptionin existing research.

B. Regarding Kolikant’s ﬁndings

Regarding Kolikant’s study [10], our population of studentsreveals more mistrust concerning the correctness of a programbased on reasonable output of that program: 24% of ourpopulation versus 50% of the population of Kolikant considerreasonable output to be a sufﬁcient indicator of correctness.The difference increases in case of complicated calculations:our population 3% versus Kolikant 33% in case of high schoolstudents and 69% in case of college students.A similar observation was done involving the no-testingapproach in the case that a programmer is certain that his/herprogram is correct. In our study, only 7% agreed that testingis not necessary if the code compiles, where in the Kolikantstudy 31% agreed with that statement. These ﬁndings followfrom the pre-exercises surveys.Almost similar to Kolikant, we observe that 79% of thestudents think that they test systematically. The exercises andinterviews show that they produced a very limited set of testcases that certainly did not cover all possibilities. We also,like Kolikant, conclude that students tend to describe theirnon-systematic methods as systematic. EFERENCES[1] R. Pham, S. Kiesling, L. Singer, and K. Schneider, “Onboarding in-experienced developers: struggles and perceptions regarding automatedtesting,”

Software Quality Journal , vol. 25, no. 4, pp. 1239–1268, 2017.[2] S. H. Edwards and Z. Shams, “Do student programmers all tend towrite the same software tests?” in

Proceedings of the 2014 conferenceon Innovation & technology in computer science education . ACM,2014, pp. 171–176.[3] N. Doorn, “How can more students become ‘test-infected’: currentstate of affairs and possible improvements,” Master’s thesis, OpenUniversiteit, 2018.[4] S. H. Edwards, “Using software testing to move students from trial-and-error to reﬂection-in-action,”

ACM SIGCSE Bulletin , vol. 36, no. 1, pp.26–30, 2004.[5] O. A. L. Lemos, F. F. Silveira, F. C. Ferrari, and A. Garcia, “Theimpact of software testing education on code reliability: An empiricalassessment,”

Journal of Systems and Software , vol. 137, pp. 497–511,2018.[6] M. A. Brito, J. a. L. Rosi, S. R. d. Souza, and R. T. Braga, “Anexperience on applying software testing for teaching introductory pro-gramming courses,”

CLEI Electronic Journal , vol. 15, no. 1, 2012.[7] The Joint Task Force on Computing Curricula Association for Comput-ing Machinery (ACM) IEEE Computer Society,

Curriculum Guidelinesfor Undergraduate Programs in Computer Science . ACM, 2013.[8] The Joint Task Force on Computing Curricula Association for Comput-ing Machinery (ACM) IEEE Computer Society,

Curriculum Guidelinesfor Undergraduate Degree Programs in Software Engineering . ACM,2014.[9] L. M. Leventhal, B. E. Teasley, and D. S. Rohlman, “Analyses of factorsrelated to positive test bias in software testing,”

International Journalof Human-Computer Studies , vol. 41, no. 5, pp. 717–749, 1994.[10] Y. B.-D. Kolikant, “Students’ alternative standards for correctness,” in

Proceedings of the ﬁrst international workshop on Computing educationresearch . ACM, 2005, pp. 37–43.[11] O. S. G´omez, S. Vegas, and N. Juristo, “Impact of cs programs on thequality of test cases generation: An empirical study,” in

Proceedings ofthe 38th International Conference on Software Engineering Companion .ACM, 2016, pp. 374–383.[12] I. Cetin, “Students’ understanding of loops and nested loops incomputer programming: An apos theory perspective,”

CanadianJournal of Science, Mathematics and Technology Education ,vol. 15, no. 2, pp. 155–170, 2015. [Online]. Available:https://doi.org/10.1080/14926156.2015.1014075[13] D. Ginat, “On novice loop boundaries and range conceptions,”

Computer Science Education , vol. 14, no. 3, pp. 165–181, 2004.[Online]. Available: https://doi.org/10.1080/0899340042000302709[14] A. Bijlsma, H. Passier, H. Pootjes, and S. Stuurman, “Integrated testdevelopment: An integrated and incremental approach to write softwareof high quality,” in

Proceedings of the 7th Computer Science Edu-cation Research Conference (CSERC) , V. Pieterse, G. Papadopoulos,D. Stikkolorum, and H. Passier, Eds. ACM, 10 2018, pp. 9–20.[15] L. Murphy, G. Lewandowski, R. McCauley, B. Simon, L. Thomas, andC. Zander, “Debugging: the good, the bad, and the quirky–a qualitativeanalysis of novices’ strategies,” in

ACM SIGCSE Bulletin , vol. 40, no. 1.ACM, 2008, pp. 163–167.[16] T. Michaeli and R. Romeike, “Addressing teaching practices regardingsoftware quality: Testing and debugging in the classroom,” in

Pro-ceedings of the 12th Workshop on Primary and Secondary ComputingEducation . ACM, 2017, pp. 105–106.[17] B. Beizer,

Software testing techniques , 2nd ed. Van Nostrand Reinhold,1990.[18] L. P. Scatalon, E. F. Barbosa, and R. E. Garcia, “Challenges to integratesoftware testing into introductory programming courses,” in2017 IEEEFrontiers in Education Conference (FIE)