[PDF] Automating Test Case Identification in Java Open Source Projects on GitHub

Abstract

Software testing is one of the very important Quality Assurance (QA) components. A lot of researchers deal with the testing process in terms of tester motivation and how tests should or should not be written. However, it is not known from the recommendations how the tests are written in real projects. In this paper, the following was investigated: (i) the denotation of the word "test" in different natural languages; (ii) whether the number of occurrences of the word "test" correlates with the number of test cases; and (iii) what testing frameworks are mostly used. The analysis was performed on 38 GitHub open source repositories thoroughly selected from the set of 4.3M GitHub projects. We analyzed 20,340 test cases in 803 classes manually and 170k classes using an automated approach. The results show that: (i) there exists a weak correlation (r = 0.655) between the number of occurrences of the word "test" and the number of test cases in a class; (ii) the proposed algorithm using static file analysis correctly detected 97% of test cases; (iii) 15% of the analyzed classes used main() function whose represent regular Java programs that test the production code without using any third-party framework. The identification of such tests is very complex due to implementation diversity. The results may be leveraged to more quickly identify and locate test cases in a repository, to understand practices in customized testing solutions, and to mine tests to improve program comprehension in the future.

Full PDF

CComputing and Informatics, Vol. 32, 2013, 1001–1031, V 2021-Feb-24

AUTOMATING TEST CASE IDENTIFICATION INOPEN SOURCE PROJECTS ON GITHUB

Matej

Madeja , Jaroslav

Porub¨an , Michaela

Baˇc´ıkov´a ,Mat´uˇs

Sul´ır , J´an

Juh´ar , Sergej

Chodarev , Filip

Gurb´aˇl

Department of Computers and InformaticsTechnical University of Koˇsice042 00 Koˇsice, Slovakiae-mail: [email protected], [email protected]

Abstract.

Software testing is one of the very important Quality Assurance (QA)components. A lot of researchers deal with the testing process in terms of testermotivation and how tests should or should not be written. However, it is not knownfrom the recommendations how the tests are actually written in real projects. Inthis paper the following was investigated: (i) the denotation of the test word indiﬀerent natural languages; (ii) whether the test word correlates with the presenceof test cases; and (iii) what testing frameworks are mostly used. The analysis wasperformed on 38 GitHub open source repositories thoroughly selected from the setof 4.3M GitHub projects. We analyzed 20,340 test cases in 803 classes manuallyand 170k classes using an automated approach. The results show that: (i) thereexists weak correlation ( r = 0 . test and test cases presence ina class; (ii) the proposed algorithm using static ﬁle analysis correctly detected 95%of test cases; (iii) 15% of the analyzed classes used main() function whose representregular Java programs that test the production code without using any third-partyframework. The identiﬁcation of such tests is very low due to implementationdiversity. The results may be leveraged to more quickly identify and locate testcases in a repository, to understand practices in customized testing solutions andto mine tests to improve program comprehension in the future. Keywords:

Program comprehension, Java testing, testing practices, test smells,open source projects, GitHub

Mathematics Subject Classiﬁcation 2010: a r X i v : . [ c s . S E ] F e b M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl

Software quality is a highly desirable feature of a software product that is deliveredby veriﬁcation and validation activities. One of Quality Assurance (QA) compo-nents is software testing as a popular risk management strategy to ensure thatsource code meets all the requirements. However, the development of such tests isa time-consuming and costly process, as it represents more than a half of the entiredevelopment process [28]. At ﬁrst glance, the task of creating tests is straightfor-ward — to test, but in addition to that tests describe the expected behavior of theproduction code being tested. Years ago, Demeyer et al. [6] suggested that if thetests are maintained together with the production code, their implementation is themost accurate mirror of the product speciﬁcation and can be considered as up-to-date documentation. Obviously, tests can contain a number of useful productioncode metadata that can support program comprehension.Understanding the code is one of the very ﬁrst tasks a developer must strugglewith before the implementation of a particular feature. When the product speci-ﬁcation changes (e.g. the requirements for new features are added), the developermust ﬁrst understand them, then create his/her own mental model [4] and ﬁnally,the created mental model is expressed in a speciﬁc artifact — code implementation.The problem is that two developers are likely to create two diﬀerent mental modelsfor the same issue. If more than one developer is working on a project, it is expectedthat they will think diﬀerently and that the same feature will be implemented indiﬀerent ways depending on the author. A comprehension gap arises when one de-veloper needs to adapt another programmer’s mental model from the code, and thisis the most time consuming process.An assumption can be made that by using the knowledge about the structureand semantics of tests and their connection to the production code, it is possible toincrease the eﬀectiveness of program comprehension and reduce the comprehensiongap. This would be possible, for example, by enriching the source code with meta-data from the tests directly into the production code, e.g. data used for testing,test scenarios, objects relations, comments, etc. To achieve this goal, it is necessaryto know in detail how the tests are actually written and what kind of data theyencompass.There exist many guidelines on how tests should be created. First, namingconventions may aid the readability and comprehension of the code. According tothe empirical study by Butler et al. [3], developers largely follow naming conventions.It can be assumed that these conventions will also be followed for the test code.Our previous research [23] shows that there is a relation between the naming ofidentiﬁers in the test code and the production code being tested. This indicatesthat the relationship between the test and production code is not only at the levelof method calls, object instances, or identiﬁer references, but also at the vocabularylevel that is connected to the domain knowledge and mental models of a tester anda developer.Furthermore, many authors [25, 21, 9] deﬁne best practices to simplify the test est Case Identiﬁcation in OS Projects on GitHub test and the number of test cases in a ﬁle. That means searching for the test string could be beneﬁcial for faster test case identiﬁcation. Based on the previousreasoning, this paper deﬁnes the following hypotheses:

H1:

There is a correlation between the occurrence of the word “test” in the ﬁlecontent and the number of test cases.

H2:

Tests are usually constituted by means of test frameworks, but there are alsoother ways of automated code testing.These hypotheses represent only a partial step towards mining information fromtests to support program comprehension. Upon successful identiﬁcation, it will bepossible to analyze the impact of testing frameworks on test writing and developerthinking. The knowledge about how tests are written can be used to mine infor-mation from suitable testing code segments to enrich the production code with thisinformation and support code comprehension to speed up the development process.This paper analyzes 38 projects that have been carefully selected from all GitHubprojects with a majority of code written in Java language. In addition to conﬁrmingor refuting the hypotheses, the paper provides an overview of known test frameworks,examining whether it is appropriate to search for tests using the word ”test” due todiﬀerent natural languages of developers, and provides an algorithm for static codeanalysis to automate the identiﬁcation of test cases. Because testing is a relativelycomplex task that can be made up of diﬀerent types of tests (unit, user interface,etc.) and it is not possible to analyze them in a reasonable amount of time, thispaper focuses exclusively on unit testing.004

M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl

Section 2 presents the current state of research and the gaps found in the re-search. In section 3, the research method is described, including the analysis of theword test in diﬀerent languages, the selection of unit testing frameworks, used searchstrategies and the process of analysis. This section also presents the pseudocode ofthe proposed algorithm used for automated test case detection. Section 4 summa-rizes the results, threats to validity are mentioned in section 5, and conclusions canbe found in section 6.

Many researchers examine software testing but man still know little about the struc-ture and semantics of test code. This chapter summarizes the related work of soft-ware testing from various perspectives.

Learning about real testing practices is a constant research challenge. The goalof such research is mostly to ﬁnd imperfections and risks, learn, and make rec-ommendations on how to prevent them and how to streamline their development.Leitner and Bezemer [20] studied 111 Java-based projects from GitHub that containperformance tests. Authors identify tests by searching for one or more terms inthe test ﬁle name or for the presence of popular framework import, solely in the src/test project directory. Selected projects were subjected to a manual analysis,in which they monitored several metrics. The most important results for this paperwas the fact that 103 projects also included unit tests, usually following standard-ized best practices. On the other hand, the performance testing approach of thesame projects often appears less extensive and less standardized. Another ﬁndingwas that 58 projects (52%) mix performance tests freely with their functional testsuite, i.e., performance tests are in the same package, or even the same test ﬁle, asfunctional tests. Six projects implemented tests as usage examples. Using a sim-ilar approach we would like to analyze unit tests, but with careful selection fromall GitHub projects at a speciﬁc time, resulting in more relevant projects used foranalysis.Code coverage, also known as test coverage, is a very popular method for eval-uating project quality. Ellims et al. [7] investigated the usage of unit testing inpractice in three projects that authors evaluated as well-tested. Statement coveragewas found to be indeed a poor measure of test adequacy. According to the ﬁndingsof Hemmati [12], basic criteria such as statement coverage are a very weak metric,detecting only 10% of the faults. A test case may cover a piece of code but miss itsfaults. According to Hilton et al. [13], coverage can be beneﬁcial in the code reviewprocess if a smaller part of the project is evaluated. By reducing coverage to a singleratio of the whole project, much valuable information could be lost. Kochhar et al.[16] performed an analysis of 100 large open-source Java projects showing that 31%of the projects have coverage greater than 50% and only 8% are greater than 75%. est Case Identiﬁcation in OS Projects on GitHub

JUnit and

TestNG frameworks were searched to identify tests in the project.This method could be useful when looking for the occurrence of speciﬁc testingframeworks in the code.

Based on the mentioned research, it is possible to roughly estimate the motivation ofthe tester, the relationship of code coverage and test case quality, and best practicesimproving the reliability of test suites. However, it is still unknown what frameworksare used for Java test projects, whether there are tests that are not dependent onthe framework, or whether the particular project uses its own framework. Also thefollowing issues in the existing research have been identiﬁed:1.

Respondent type : In some cases, research conducted in an academic environmentdoes not provide real testing practices because students are not full-ﬂedged de-velopers; they often have little experience. Tracking students’ practices providesskewed data. Observations of practices in a single software company or countryare also misleading because people from the same corporate or state environ-ment have similar practices and they inﬂuence each other, e.g., company codestyle, used frameworks, similar education, etc.2.

Project type and size : Experiments often compare test practices on sampleprojects, which do not represent real projects, but are created or modiﬁed for006

M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl a particular experiment. The size of such projects is very small so they containonly a few lines of production code. Code comprehension of such projects ismuch simpler and therefore tests are much simpler, too. Real project normallycontains more complicated tests.3.

The amount of analyzed data : Studies that analyze enterprise or open sourceprojects usually select a small number of projects (mostly up to 5), which resultsin samples too small for generalized claims.With respect of the mentioned research imperfections, this paper tries to care-fully select projects for analysis from a large number of independent projects toprovide as much relevant results as possible.

First of all, it is necessary to ﬁnd suitable projects containing test cases. Thus,metadata of all GitHub open-source projects was obtained via GHTorrent [11] (sec-tion 3.1) due to their high availability. GHTorrent collects projects’ metadata fromGitHub, one of the biggest project sharing platform in the world. The experimentwas limited to projects whose the most used language is Java. In existing researchis common technique searching for testing frameworks’ imports [30] or to search forﬁles with the world test in the ﬁlename [20]. Because our main goal for the futureis to improve production code comprehension, it is necessary to ﬁnd particular testcases and not only a test class.As we go deeper in this study and try to identify speciﬁc test cases (not onlytest classes), it is necessary to consider whether the searching for the word test isappropriate. Keep in mind, that the aim is not to count number of test cases in aproject. Otherwise, we could run test via automated build tool (e.g. ant, maven orgradle) and collect the number of tests. In that case, the issue is that building suchopen source projects often fail [31] and we need to build every single project and runtests what is a time consuming task. In this paper we try to count and especiallyﬁnd the location of such test cases.Since the testing process can also be denoted as veriﬁcation, examination, etc.,an in-depth analysis (section 3.2) of testing process denotation in various foreignlanguages was performed, which showed that searching for the word test is suitable.Due to the limitations of the GitHub Search API it was possible to search only oneword across all Github Java projects.As the framework is assumed to inﬂuence developer thinking and test case im-plementation, a list of 50 unit testing frameworks for Java (section 3.3) has beencreated. Because the goal is to detect customized testing practices compared withframework–based ones in existing projects, it is not possible to use an automatedmethod, and since it is not possible to manually analyze all GitHub projects, weneed to select the most suitable ones. Based on the meaning of the word test weassume that there will be a correlation between the occurrence of the word test (in est Case Identiﬁcation in OS Projects on GitHub test in ﬁlename,2. the word test in ﬁle content,3. frameworks’ imports in ﬁle content (38 frameworks).Every single project was searched as mentioned above, 4.3 million projects intotal. It is possible to expect that the more occurrences of the word test in theproject, the more test cases will be present in it and the more we will learn fromit in the future. Therefore, projects with the highest occurrence of the word test(in ﬁle content or ﬁlename) or with the highest occurrence of a speciﬁc framework’simport were selected for manual analysis. Using searching for test regardless ofthe framework, we were also able to analyze testing practices without using anythird-party framework, i.e. customized testing solutions. Because GitHub containsmany projects that are not relevant, e.g. testing, homework or cloned projects, rulesfor searching relevant projects have been deﬁned (section 3.4.2), resulting in set ofprojects used for manual and automated analysis. A script for automated analysiswas created to partially automate the identiﬁcation of test cases and to collect somemetrics about particular ﬁles (see section 3.5). All methodology details are describedin the following sections.

To provide conclusions that are as general as possible, it was necessary to choosethe most general sample of data possible. It would be ideal to analyze all types ofprojects, i.e. proprietary and open source. This experiment is focused exclusivelyon open source projects, for the reasons of access to proprietary projects is limited.GitHub is a distributed code repository and project hosting web site. It has be-come one of the most popular web-based services to host both proprietary and mostlyopen source projects, therefore, we can consider it a suitable source of projects. Itprovides an open Application Programming Interface (API) allowing one to workwith all public projects (with small exceptions).To avoid the latency of the oﬃcial API, the GitHub Archive project storespublic events from the GitHub timeline and publishes them via Google BigQuery.Downloading via Google BigQuery is charged. GHTorrent [11] was used insteadthat provides a mirror of GitHub projects’ metadata. It monitors the GitHub pub-lic event timeline, retrieves contents and dependencies of every event and requestGitHub API to store project data into database. That includes general info aboutprojects, commits, comments, users, etc. The study data mining started in May https://github.com/ https://docs.github.com/en/rest M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl mysql-2019-05-01 has been downloadedand imported into our local database. test Leitner et al. [20] searched for tests only in src/test directory and test classesidentiﬁed manually. However, the tests can be placed in any project’s directory (e.g. Android uses src/androidTest ). Another approach is to search for “test” string in ﬁlenames as executed by Kochhar et al. [18], because they assumed thatthe tests would be exclusively in ﬁles containing the case-insensitive “test” string.As in the previous case, best practices lead the developer to use test in the ﬁle name,but it is not mandatory. For this reason, the most accurate should be searching forthe word test in the ﬁle content. Of course, ﬁrstly it is necessary to consider whetherthe word test is the right one for searching.Therefore, the meaning of the word test using Google Translate was veriﬁed in109 diﬀerent languages (all available by Google) as follows:

1. From English to foreign language and back to English

By means of this method the most frequent meanings of the word test in aforeign language were obtained. We obtained multiple meanings per language.By translating them back to English we found out which foreign language trans-lations correspond to the original word test .

2. From foreign language to English and back to foreign language

The opposite approach was used to ﬁnd whether the string test has a meaningin particular foreign language. The word was translated into English and all itsmeanings were veriﬁed against the available translation alternatives in the givenlanguage.Multiple translations ensured that the correct meaning of the word in a partic-ular language was understood. Using the 1 st method it was found out that wordsets related to testing process of diﬀerent foreign languages are mostly translated as test in English, see Figure 1. This means that when a foreign developer would liketo express something related to testing (e.g. to write a test case), he/she will usemostly the word test . In this meaning it is the ﬁrst choice when searching test casesby a string. In a very marginal denotation (i.e. the translation was found with lowfrequency) occurred meaning outside of testing area, e.g., essay , audition or ﬂier .Because such meanings occurred only infrequently, it can be omitted. There werealso 14 languages that did not include the word test in their reverse translation atall, but its meaning was rather denoting examination , check or quiz . https://ghtorrent.org/downloads.html https://developer.android.com/ https://translate.google.com/ Frequency determined by Google Translate service, indicates how often a translationappears in public documents: 3 - high; 2 - middle; 1 - low frequency. est Case Identiﬁcation in OS Projects on GitHub

Sum of reverse translation frequency of the word test in public documents ofdiﬀerent languages.

A total of 44 languages used non-Latin charset. For these languages, the 2 nd approach did not make sense to use. For the remaining languages, the meaning wascompletely identical in 43 languages and the same or similar meaning in 20 cases.We found only 2 languages (Hungarian and Latvian ), in which the word test has acompletely diﬀerent meaning, such as body , hew , or tool (nothing related to testing).The analysis shows that the word test will actually refer to the testing process inthe code and the meaning can vary in very rare cases. Only the word test will besearched for in this experiment because of the rate limitations of the GitHub API(explained in section 3.4). Usually there is a reason for the developer to use the word test in the testing code. Inthis context, the crucial question is whether developers are motivated to use the word test in their code. The developer is greatly inﬂuenced by the test framework, whichteaches him or her diﬀerent habits. As a part of this study, we analyzed 50 Javaunit testing frameworks, extensions, and supporting libraries to determine whetherthe use of the word test during test implementation is optional, recommended, ormandatory (see Table 1). The list of testing frameworks was created as a part ofthis study from diﬀerent sources, such as blogs, technical reports, research papers,etc. https://translate.google.com/?sl=hu&tl=en&text=test https://translate.google.com/?sl=lv&tl=en&text=test M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl

Table 1.

Analyzed unit testing frameworks and extensions for Java.

Name Package for import Frameworktype Firstversion Lastcommit Must include”test”

SpryTest N/A U N/A N/A(archived) N/AInstinct N/A B 24.01.2007 07.03.2010(archived) N/AJava Server-SideTesting framework(JSST) N/A U 17.11.2010 17.11.2010(archived) (cid:4)

NUTester N/A U 05.02.2009 27.03.2012(archived) N/ASureAssert N/A A 29.05.2011 04.02.2019(archived) N/ATacinga N/A U 14.02.2018 22.02.2018(archived) N/AUnitils N/A U 29.09.2011(v3.2) 08.10.2015(archived) N/ACactus org.apache.cactus U 11.2008 05.08.2011(archived) (cid:4)

Concutest N/A U 30.04.2009 12.01.2010(archived) (cid:4)

Jtest N/A G 1997 21.05.2019(last release) (cid:4)

Randoop N/A G 23.08.2010 05.05.2020 (cid:4)

EvoSuite N/A G 25.12.2015(v1.0.2) 30.04.2020 (cid:4)

JWalk N/A G 19.05.2006 14.06.2017 (cid:4)

TestNG org.testng U 31.07.2010(v5.13) 11.04.2020 (cid:4)

Artos com.artos U 22.09.2018 19.04.2020 (cid:4)

JUnit 5 org.junit U 10.09.2017 02.05.2020 (cid:4)

JUnit 4 org.junit U 16.02.2006 10.04.2020 (cid:4)

JUnit 3 junit.framework U N/A N/A (cid:4)

BeanTest info.novatec.bean-test U 23.04.2014 02.05.2015 (cid:4)

GrandTestAuto org.GrandTestAuto U 21.11.2009 22.01.2014 (cid:4)

Arquillian org.jboss.arquillian U 10.04.2012 21.04.2020 (cid:4)

EtlUnit org.bitbucket.bradleysmithllc.etlunit U 02.12.2013(v2.0.25) 04.04.2014 (cid:4)

HavaRunner com.github.havarunner U 16.12.2013 08.06.2017 (cid:4)

JExample ch.unibe.jexample U 2008 N/A (cid:4)

Cuppa org.forgerock.cuppa U 22.03.2016 01.10.2019 (cid:4)

DbUnit org.dbunit U 27.02.2002 24.02.2020 (cid:4)

GroboUtils net.sourceforge.groboutils U 20.12.2002 05.11.2004 (cid:4)

JUnitEE org.junitee U 23.07.2001(v1.2) 11.12.2004 (cid:4)

Needle de.akquinet.jbosscc.needle U N/A 16.11.2016 (cid:4)

OpenPojo com.openpojo U 13.10.2010 20.03.2020 (cid:4)

Jukito org.jukito U/M 25.01.2011 17.04.2017 (cid:4)

Spring testing org.springframework.test M/U 01.10.2002 06.05.2020 (cid:4)

Concordion org.concordion U/SbE 23.11.2014(v1.4.4) 27.04.2020 (cid:3)

Jnario org.jnario B 23.07.2014 (cid:3)

Cucumber-JVM io.cucumber B 27.03.2012 04.05.2020 (cid:3)

Spock spock.lang B 05.03.2009 01.05.2020 (cid:3)

JBehave org.jbehave B 2003 23.04.2020 (cid:3)

JGiven com.tngtech.jgiven B 05.04.2014 10.04.2020 (cid:4)

JDave org.jdave B 18.02.2008 17.01.2013 (cid:3) beanSpec org.beanSpec B 15.09.2007 27.06.2014(alpha) (cid:3)

EasyMock org.easymock.EasyMock M 2001 10.04.2020 (cid:4)

JMock org.jmock M 10.04.2007 23.04.2020 (cid:4)

JMockit org.jmockit M 20.12.2012 13.04.2020 (cid:4)

Mockito org.mockito M 2008 30.04.2020 (cid:4)

Mockrunner com.mockrunner M 2003 16.03.2020 (cid:4)

PowerMock org.powermock M 28.05.2014(v1.5.5) 30.03.2020 (cid:4)

AssertJ org.assertj A 26.03.2013 05.05.2020 (cid:4)

Hamcrest org.hamcrest A 01.03.2012 06.05.2020 (cid:4)

XMLUnit org.xmlunit A 03.2003 04.05.2020 (cid:4)

Legend: U – unit; B – behavioural; A – assert; M – mock; G – generator;SbE – speciﬁcation by example est Case Identiﬁcation in OS Projects on GitHub unit testing category. Infor-mation about the ﬁrst version and the last commit may be interesting in terms ofthe framework lifetime and its occurrence in projects. Projects marked as archived or test generators in Table 1 were excluded from further analysis for the followingreasons: 1. archived projects usually had unavailable documentation or were neverreleased; 2. test generators produce tests that are not based on the programmer’smental model, but are generated automatically (semi-randomly), which is not inter-esting from code comprehension point of view.It can be seen that 37 of 50 frameworks require the word test as method/classannotation ( @Test ) or part of its name ( testMethod , methodTest ). This conditionis due to the fact that the listed frameworks are mostly extensions that depend onone of the base frameworks, such as JUnit or TestNG . Diﬀerent versions of

JUnit are listed separately, because test labeling diﬀers between them (annotations vs.method name format). A deeper analysis of frameworks’ JavaDocs revealed thatmany frameworks include other classes, methods, or annotations that include theword test in their names. Although the use of these methods is not mandatory, itmay support the search. At the same time, it should be noted that mostly behavioralframeworks do not use the test convention. This is due to the thinking of thedeveloper, whose role in BDD is not to write a test, but a scenario or speciﬁcation.Such frameworks often use domain speciﬁc languages (DSL) to simplify usage fornon-programmer team members.

The whole process of data gathering can be seen in Figure 2. GHTorrent provided140 million of GitHub projects. From this set all deleted, non-Java or duplicatedprojects have been removed. After cleaning the initial data, a total of 6.7 millionprojects were kept for further analysis.GHTorrent contained only basic meta data about the projects, which was notsuﬃcient for our needs. Given the meaning of the word test (see section 3.2) itwas decided that searching for test across all projects would be beneﬁcial. TheGitHub API provides a code search endpoint, which index only repositories thatare originals (for code search), i.e., not forks. Repository forks are not searchableunless the fork has more stars than the parent repository, therefore, such projectswere also removed. If the project has been detected as deleted, private, or blockedby GitHub during querying code search, it has been not considered. Finally, a totalof 4.3 million projects were included. The GitHub code search API has the followinglimitations: • up to 1,000 results for each search; https://docs.github.com/en/rest/reference/search M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl

Fig. 2.

The GitHub data mining process for the study. • up to 30 requests per minute (authenticated user); • global requests rate limited at 5,000 requests per hour; • only ﬁles smaller than 384 KB and repositories with fewer than 500,000 ﬁles aresearchable.To deal with these limitations, a script was created for data acquisition, whichwas executed in parallel on 20 instances with diﬀerent internet protocol (IP) ad-dresses. For each instance, 10 GitHub personal access tokens, which the scriptcycled, were reserved. The tokens were provided by experiment supporters. An-other issue was that our requests were often evaluated as abuse, which signiﬁcantlyslowed down the whole process of data gathering. For each project, two requeststo the GitHub code search API were issued, as presented in Table 2. Gatheringoccurrence of the word test took 43 days in total.Table 2. The GitHub API requests used to search the string “test” in a project.Search “test” in Example request at https://api.github.com/search/code

Java ﬁles content ?q=test+in:file+language:java+repo:apache/camel

Java ﬁlenames ?q=filename:test+language:java+repo:apache/camel

GitHub indexes only the default branch code (usually master ), so the whole analysiswas performed only using the default branch. The string “test” can also be a partof other words, e.g. fas test , las test , thisis test framework . There exist 532 such est Case Identiﬁcation in OS Projects on GitHub test in total. To avoid inaccuracies when searching for a word ofthe selected string, false positives must be excluded from the search. When usingregular GitHub search, the search term will appear in the results when driven bythe following rules: • string uses camel case convention without numbers , e.g., myTest , • string uses snake case convention, e.g., my test , test 123 ; • string includes a delimiter or special character (space, ., $ , @, etc.), e.g., test.delimiter , @Test ; • search is case insensitive, e.g. Test sentence , test sentence .GitHub considers as Java language ﬁle any ﬁle with .java or .properties ex-tensions. The same search rules apply to both search types: ﬁle content and ﬁlenamesearch. Obviously, according to the above rules, GitHub search automatically ﬁltersthe results, therefore, unwanted words containing the string test do not appear inthe results. At the same time, it is necessary to note the side eﬀect, i.e. that neitherthe words testing or testsAllMethods will be matched. Therefore, any string al-ternative that should be found should be included in the search outside the GitHubAPI. The basic variant of the word will suﬃce, because best practices use the word test to indicate the testing method or class. When searching for diﬀerent testing types, the eﬀort is to go through as manyprojects as possible. Because GitHub contains millions of repositories, it is a chal-lenge to choose the projects that can be the most instructive, and to ﬁlter out irrel-evant ones. To make the selection as objective as possible, we planned to use reaper tool [24], which can assess a GitHub repository in collaboration with

GHTorrent using project metadata and code: architecture, community, continuous integration,documentation, history, issues, license and unit testing. By evaluating all thesemetrics, a particular repository can be tagged as a real software project and thusexclude example projects, forks, irrelevant ones, etc.The last update of the oﬃcial reaper repository was committed in 2018, manyof the libraries it used changed their options, therefore, modiﬁcation of the projectwas necessary. Many assessment attributes require project ﬁles to be available,so each project needs to be cloned or downloaded as archive. For large projects,it can be gigabytes of data and the size of the project subsequently aﬀects thelength of the analysis. To ﬁnd out whether reaper will be beneﬁcial for our study, amanual analysis of 50 projects was performed and the results were compared with Numbers can be used, but they are not considered as individual words, e.g. or test123 will not be found. https://github.com/RepoReapers/reaper M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl the evaluation of reaper . All available evaluation attributes were selected except forunit tests assessment because it was limited to

JUnit and

TestNG frameworks. Thethresholds and weights of particular attributes were preserved by the developers ofthe tool because these values were considered empirically conﬁrmed.Because we want to select a sample of projects from which we would learn themost, projects with the highest number of ﬁles containing the word test in its bodyand ﬁlename were selected for the comparison. The same attributes as used bythe reaper were taken into account in the manual evaluation, but the relevance ofthe project for this study was assessed by an observer. Evaluation of 50 projectsusing the reaper tool took 10 days, with the most time being spent on evaluatingthe project architecture. Many repositories with the highest test presence in ﬁlecontent or ﬁlename were actually identiﬁed as

Subversion (SVN) mirrors by manualanalysis and because there were multiple copies of the same code (caused by theSVN’s branching style), the projects were not relevant, but the reaper assessed suchprojects as suitable. According to this signiﬁcant issue important projects could belost by assessing project in an automated manner, so it was concluded that it ismore eﬃcient to select projects manually driven by the following rules, inspired byexisting research: • Priority was given to projects with the highest number of the word test in theproject (in ﬁle content and ﬁlename). According to [27] we can expect presenceof tests in popular projects. If it is assumed that the word test will be correlatedwith the number of test cases in the project, large and long maintained projectsare expected, which authors consider the best sample for the study. • History , as evidence of sustained evolution. Projects under 50 commits wereexcluded (inspired by the reaper ) because they represented small or irrelevantprojects. Those projects that contained a large number of commits (more than1 000 per day), considered committed by a robot, were also excluded. • Originality was evaluated by comparing the readme ﬁle for similarities in otherrepositories. By such comparison it is possible to detect clones and similarrepositories [34]. Jiang et al. [14] found that developers clone repositories tosubmit pull requests, ﬁx bugs, add new features, etc. The problem is thatdevelopers often do not create forks but project clones (a manual copy of aproject), but readme ﬁle is often unchanged. • Community , as evidence of collaboration, was assessed by number of contrib-utors in the project. The more developers participate in the project, the morelikely it is that the (testing) code will be written in a diﬀerent style. e.g. https://github.com/zg/jdk , https://github.com/dmatej/Glassfish , https://github.com/svn2github/cytoscape est Case Identiﬁcation in OS Projects on GitHub Inspired by Stefan et al. [30], in order to monitor the impact of frameworks on testwriting, we searched projects’ code via the GitHub API for imports of any of thetesting frameworks from Table 1 (excluding generators and archived projects). Thisway projects with diﬀerent frameworks were achieved. Only projects that containedthe word test in the Java ﬁle body at least once were queried. Because there wasa large number of requests (37 per single project), the project set was limited to500,000, ordered by the number of Java ﬁles containing the word test in its body.The search string was bounded by quotes because GitHub API normally splits wordsusing special characters (e.g. dot) resulting in ﬁnding irrelevant results. Exam-ple search request: https://api.github.com/search/code?q="org.testng"+in:file+language:java+repo:apache/camel .For each framework we created a separate list of projects, again sorted by theoccurrence of the word test in the project, to ﬁnd projects with a high number of testcases if possible. Original repositories of the searched framework were removed fromthe analysis (e.g. when searching for JUnit, original JUnit framework repository wasexcluded). Subsequently, the selection of relevant projects was performed accordingto the steps mentioned in the section 3.4.2. For some frameworks, e.g.

JExample ,which were created as a part of the research [19], no software repositories withbusiness focus were found and as a consequence, it was necessary to include alsoexample, homework, or cloned/forked ones, if the original one was not publiclyavailable. Three diﬀerent data sets were received by searching via GitHub API: 1. the word test in ﬁlename, 2. the word test in ﬁle content, 3. frameworks’ imports in ﬁlecontent. First four relevant and top projects (highest test or framework’s importstring occurrence) were manually investigated from each set in order to ﬁnd outthe test writing practices. The projects were cloned and to keep the consistencybetween the test search and the manual analysis, the project was reverted to thetimestamp of GitHub API download using the following command: git checkout `git rev-list -n 1 --before="" ""` For each selected project all ﬁles containing the word test or framework’s importin ﬁle content or ﬁlename has been selected as possible option for manual analysis.The project ﬁles that contained the largest occurrence of the word test and frame-work’s import in their content (expected higher number of tests) were analyzed asﬁrst. During the investigation of tests from diﬀerent authors and from diﬀerentprojects, we created an automated supportive method for detecting the number of https://github.com/akuhn/jexample git clone M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl test cases in a ﬁle. It does not require compiling the code, such as for computingcode coverage, or building abstract syntax tree (AST), e.g. indexing in an IDE.Regardless of the framework, it is advisable to investigate the count of thefollowing attributes of a source ﬁle containing the word test :1.

Annotations @Test — very popular mostly thanks to

JUnit and

TestNG .2.

Methods containing test in the beginning of the name — best practices leadsdevelopers to use this convention (also for historical purposes).3.

Methods containing

Test in the end of the name — an alternative of previousone.4.

Public methods — possibly all public methods of a test class can be consideredas tests.5.

Occurrence of main — customized testing solutions are executed via main() .6.

File path containing test — should relate to testing.7.

Classes containing $ in the name — the character $ in a class name mostlydenotes a generated code that should not be analyzed.8. Total number of test occurrence in ﬁle content — to reveal the relation betweenexecutable test cases and the word test presence in the content.All listed metrics (counts of occurrence in a ﬁle) were saved for each analyzedﬁle. The pseudocode for collecting mentioned metrics can be seen in Listing 1(implementation available at GitHub ). The presented algorithm is partly theresult of the study, because it was created in parallel with the manual analysis.Manual analysis complement the algorithm implementation and vice versa. Thisalgorithm was used to evaluate the test identiﬁcation for each Java ﬁle containingthe word test . Subsequently, the automated identiﬁcation was checked during themanual analysis to determine the correct number of test cases and the metric usedfor the calculation (e.g., the number of annotations and public methods can be thesame, but the relevant number of tests can only come from one of them). It isnecessary to identify the number of particular test cases in order to link a speciﬁctest case with the unit under test (UUT) and it’s speciﬁc method. Each test caseis likely to represent a unique use case and thus unique information to enrich theproduction code.Gathered metadata about test case identiﬁcation were analyzed from diﬀerentperspectives. As it is not feasible to analyze all tests in a repository, it can beassumed that the testing style in a project is uniform, and rather analyze moreprojects implemented by diﬀerent developers. Test classes with the highest numberof the following attributes were analyzed: 1. @Test annotations, 2. public methodswith names starting with test , 3. public methods with names ending with Test , https://docs.oracle.com/javase/specs/jls/se11/html/jls-3.html https://github.com/madeja/unit-testing-practices-in-java/blob/master/AnalyzeProjectCommand.php est Case Identiﬁcation in OS Projects on GitHub Algorithm predictTests(filePath) Input: File path to analyze. Output: List of statistical data content := load filePath content and remove comments nonClassContent := remove all class content, keep only content outside of it such as imports or class annotations classContent := remove all content outside of the class block and keep only first-level methods without body using /\{([^\{\}]++|(?R))*\}/ annotations := matches count of regex /@Test/ in classContent startsWithTest := matches count of regex /public +.*void *.* +[Tt]est[a-zA-Z\\d$\_]* *\(/ in classContent endsWithTest := matches count of regex /public +.*void *.* +[a-zA-Z$\_]{1}[a-zA-Z\\d$\_]*Test *\(/ in classContent publicMethods := matches count of regex /public +.*void +.*\(/ in classContent includesMain := matches count of /public +static +void +main.*\(/ in classContent hasDollar := if $ in filename, then true, else false testInPath := if "/test" in filePath, then true, else false if TestNG import found in content, then if @Test found in nonClassContent, then testCaseCount := publicMethods else testCaseCount := annotations else if JUnit4 import found in content, then testCaseCount := annotations else if JUnit3 import found in content, then testCaseCount := startsWithTest else if startsWithTest > 0, then testCaseCount := startsWithTest else if annotations > 0, then testCaseCount := annotations else testCaseCount := 0 return annotations, startsWithTest, endsWithTest, publicMethods includesMain, hasDollar, testInPath, testCaseCount Listing 1: Pseudocode of the algorithm for gathering metadata and identiﬁed numberof tests in a Java source ﬁle.4. main method, 5. word test occurrence. For framework–dependent searches therewas an additional analysis of ﬁles with the highest framework import occurrence inthe content.

Using the automated script all repositories’ ﬁles from Table 3 were processed, 38repositories and 170,076 classes altogether, from which 803 classes and 20,340 testmethods were manually investigated. Some special practices in terms of structure018

M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl of the testing code or the developer’s reasoning were observed. The ﬁrst 4 projectsfrom Table 3 represent repositories with the largest occurrences of the word test in the ﬁlename, another 4 in ﬁle content and other repositories represent the topimport occurrence of a particular framework.Table 3.

Statistics of the investigated repositories.

Repository Framework Analyzedclasses Analyzedtests JavaKLOC T A A M A M openjdk/client testng, junit 30410 130 30410 1661 5149 20798SpoonLabs/astor junit 30331 36 30331 1548 2338 13324apache/camel junit 10438 81 10438 625 1240 6847apache/netbeans testng, junit 13056 78 13056 1627 5009 11908JetBrains/intellij-community testng, junit 20375 49 20375 4805 3842 13630SpoonLabs/astor testng, junit 30331 44 30331 5883 2338 13324corretto/corretto-8 testng, junit 13688 10 13688 1659 3638 10792aws/aws-sdk-java junit 28574 18 28574 302 3680 20528wildﬂy/wildﬂy arquillian 5109 24 5109 123 548 3553eclipse-ee4j/cdi-tck arquillian 4758 30 4758 139 97 2748resteasy/Resteasy arquillian 2821 13 2821 144 220 1675keycloak/keycloak arquillian 1681 16 1681 104 396 1286jsfunit/jsfunit cactus 222 13 222 125 21 142bleathem/mojarra cactus 737 16 737 250 171 556topcoder-platform/tc-website-master cactus 1635 8 1635 42 366 1199apache/hadoop-hdfs cactus 325 4 325 20 101 282zanata/zanata-platform dbunit 770 21 770 171 197 554B3Partners/brmo dbunit 145 18 145 37 47 106gilbertoca/construtor dbunit 145 18 145 64 24 53sculptor/sculptor dbunit 153 11 153 101 26 103geotools/geotools groboutils 3424 5 3424 5 1272 3659notoriousre-i-d/ce-packager groboutils 107 11 107 75 46 91tliron/prudence groboutils 16 2 16 3 13 11MichaelKohler/P2 jexample 36 12 36 53 4 24akuhn/codemap jexample 132 15 132 286 41 112wprogLK/TowerDefenceANTS jexample 17 3 17 50 9 12rbhamra/Jboss-Files needle 44 21 44 30 5 30akquinet/mobile-blog needle 19 10 19 33 2 10s-case/s-case needle 46 15 46 13 39 33dbarton-uk/population-pie needle 7 6 7 16 1 4abarhub/rss openpojo 26 2 26 3 6 20BRUCELLA2/Prescriptions-Scolaires openpojo 25 19 25 40 10 18jpmorganchase/tessera openpojo 382 8 382 12 45 234tensorics/tensorics-core openpojo 161 3 161 1 24 85orange-cloudfoundry/static-creds-broker jgiven 21 11 21 33 2 16eclipse/sw360 jgiven 175 4 175 51 56 161Orchaldir/FantasyWorldSimulation jgiven 54 13 54 198 7 37kodokojo/docker-image-manager jgiven 11 5 11 8 3 8

SUM 170076 803 363730 20340 31033 127973

Legend: A – processed automated; M – investigated manually; KLOC – kilo of lines of code; T A – average time of automated test case detection in ms. est Case Identiﬁcation in OS Projects on GitHub test occurrence and test cases To evaluate the precision of the algorithm from Listing 1, results were compared tomanual test identiﬁcation of 20,340 test cases across all three datasets (the word test search in ﬁlename, ﬁle content and framework import search in ﬁle content, seesection 3.5). Accuracy of 95.72% for test cases detection was achieved by automatedidentiﬁcation. Most of false positives and false negatives occurrences were causedby customized testing solutions, e.g. when tests were performed directly from the main() function by calling methods of the class. If the naming conventions of thecalled (testing) methods were not governed by the principles of frameworks (e.g.prepending method name with “test” or using public methods), not all test caseswere detected in an automated way.The proposed algorithm was used to identify all tests in all Java classes ofprojects from Table 3. The script was used for all Java ﬁles that contained string “test” in the ﬁle content or in the ﬁlename (in total 170,076 ﬁles). Figure 3 showsthe correlation with the linear regression line of the word “test” and the numberof test cases in particular class. A standard (pearson) correlation coeﬃcient of r =0 .

655 was reached, that is not strongly signiﬁcant when considering signiﬁcance level α = 0 .

05. However, from the perspective of ﬁnding projects containing tests, thistechnique is beneﬁcial and can help future experimenters to ﬁlter projects containingtests much more faster. Because projects have diﬀerent numbers of test classes anduse diﬀerent frameworks, the detailed ratio of the word “test” occurence and testcase presence per project can be found at GitHub .Fig. 3. Correlation of the word “test” presence and number of test cases for analyzedclasses by automated script. https://github.com/madeja/unit-testing-practices-in-java/blob/master/correlation-boxplot.png M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl

Due to existing research [20] that identiﬁed test ﬁles using searching “test” in theﬁle path, when limiting our results to ﬁles containing “test” in the path (120,907 ﬁles)the correlation coeﬃcient of r = 0 . r = 0 . main without the 3 rd party testing framework (moreexplained in section 4.3.2) was detected in 26,205 (15.41%) classes containing theword test in their content. The proposed algorithm in section 3.5 successfully iden-tiﬁed test cases in only 6% classes of this set. Because main tests make up a fairlylarge proportion and the identiﬁcation of test cases is not clear, it is necessary toinvestigate this testing style deeper in the future. H1: There is a correlation between the occurrence of the word “test” in the ﬁle content and the number of test cases.

The H1 has been refused, because reached correlation coeﬃcient r = 0 . Executing a full code analysis, e.g. in an IDE, of large project with thousandsof kilo of lines of code (KLOC), is a time consuming task. Such example is theproject openjdk/client from Table 3. To get a faster feedback about tests in aproject, the proposed algorithm was used for static source code analysis. Becauseproposed automated algorithm should run as a part of an integrated developmentenvironment (IDE) extension in the future it should be fast enough. To emulatesimilar environment that a developer can use, a laptop with

CPU and was used. In the table 3 can beseen the average time ( T A ) of automated analysis executed 10 times. The averagetime of execution was 158ms per KLOC, which authors consider as a satisfactoryresponse time in terms of user experience for use in an IDE extension. est Case Identiﬁcation in OS Projects on GitHub In related work (section 2) there are best practices and recommendations that de-velopers can follow and therefore can be expected in the code. During the manualinvestigation of multiple repositories containing tests, we identiﬁed special testingpractices used by developers, which are described in the following paragraphs. Thelistings that are given as examples come from the analyzed repositories, but the codewas simpliﬁed for presentation purposes. Code listings refer to GitHub repositoryof this paper. rd party frameworksRegular test. Tests that follow best practices and avoid test smells fall into thiscategory. They represent the majority of occurrences in the projects and since theseapproaches are already described in the available literature [25, 21, 9], this groupwill not be given a detailed attention. However, the basic aspect of such tests is thatinformation about context and evaluation are available directly in the particular testmethod (considering also test setup, teardown and ﬁxtures), thanks to which thetest comprehension is straightforward.

Master test.

This testing code style represents test classes which contain onlyone executable test method (see GitHub ). JUnit will consider only the all() method as a test case, because it is annotated with @Test annotation. Other meth-ods are considered auxiliary ones. The problem with such a notation is the com-plexity of test comprehension. If the test fails, the developer only has informationthat the test all failed, but does not know what the test should have veriﬁed, whatdata was used, etc.According to the best practices, it should be clear from the test name what thetest veriﬁes. In this context, from a semantic point of view, it is possible to considermethods as test cases on lines 1-8. The mentioned methods are crucial in terms offailure and understanding of the test, and from the method name it is also clearwhat the test veriﬁes. Another disadvantage of these test types is the assertionroulette test smell [32], because iterations of the test over the input data make itdiﬃcult to determine which data caused the test failure and whether the input datado not interfere with each other between the tests.

Reverse proxy test.

If a separate test is written for each use case, the rec-ommendations are met, but this does not mean that it will be easy to understand.There are tests that call one auxiliary method in multiple tests and the result isevaluated in the auxiliary method. According to the test evaluation manner, theycan be divided into: https://github.com/madeja/unit-testing-practices-in-java https://github.com/madeja/unit-testing-practices-in-java/blob/master/examples/c_masterTest.java M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl Result evaluation via method name (see GitHub ).2. Result evaluation via internal object state (see GitHub ).The 1 st approach is much more diﬃcult to comprehend due to the high degreeof abstraction. It is not clear directly from the test method code (L5-7) what iscompared during the test, because the input data are loaded from a ﬁle determinedby the test method name (L2). In the JetBrains/intellij-community project,from which the example is given, the doTest() method is the general one and itwas necessary to investigate multiple classes to comprehend how tests are evaluated.At the same time, too generic auxiliary method can result in the general ﬁxture testsmell.The 2 nd approach is similar to the previous one, but uses the internal state ofan object that is initialized before a particular test during test setup or the enum type with diﬀerent method implementations. The problem of these methods mightarise if the method accepts an input parameter, which is later used to change thecontrol ﬂow. If the same test is called with diﬀerent input data, the test logic doesnot change and therefore it is the same test. However, if the control ﬂow changesin the test, e.g. by some variable value, it can be considered as a separate test(diﬀerent ﬂow, diﬀerent test). If the same auxiliary method is called more thanonce, there may be 2 diﬀerent tests, which contradicts best practices and makes thecomprehension diﬃcult. Multiple test execution.

Mostly server-side application test diﬀerent usecases, which require an action after the execution of base functionality, e.g. whetherthe right content is shown after main test execution (see GitHub ). Because in theexample JUnit3 is used, every public method prepended by test is considered astest case, so testEcho() is executed twice; as a single test case and as a part of testA4JRedirect() . Custom testing practices of testing practices are classic Java programs executablevia main() function, whose task is to verify the functionality of the productioncode. Such tests are often written due to the possibility of conﬁguring the executionvia command line parameters, which allows variability of test execution. On theother hand, tests should not be so environmentally dependent that they need to beconﬁgured to such an extent. The second possibility for why to write such testsis that they clarify the testing code with a large number of cases. Test methodsare called directly from main() and, if necessary, also the environment setup is https://github.com/madeja/unit-testing-practices-in-java/blob/master/examples/c_reverseProxyMethod.java https://github.com/madeja/unit-testing-practices-in-java/blob/master/examples/c_reverseProxyObject.java https://github.com/madeja/unit-testing-practices-in-java/blob/master/examples/c_multipleExecution.java est Case Identiﬁcation in OS Projects on GitHub • Calling methods one by one: all testing methods are manually called from main() together with parameters. • Calling methods according to input data: by iterating the test data, speciﬁc testsare called based on the current data. • Helper function that returns an array of test cases: the helper method returns anarray of instances created from abstract classes, whereas the abstract methods(which represent test cases) are implemented during the instance creation. The main() contains an iteration over the array of object instances. • Iterating values of enum : similar to the previous one, but it iterates over enum values. When creating the enum , the method of test class is implemented andthe data is set. The test class has its own implementation of a method and statein each iteration. • Calling constructor: in the main function the testing class instance is createdand the tests are called from the constructor.There is a problem of how to identify such test using automated way and howto determine number of tests in such a class. The main() function also occurs inclassic tests (e.g. to run test outside of IDE or without a build automation tool ),e.g. based on JUnit or TestNG . The function can also be found in modiﬁed runnersof testing frameworks. To clearly distinguish the presence of a customized solutionwithout any framework, it is possible to check the presence of the framework import— if a class contains the main() function and an import together, it is a runner orregular test based on the framework, not a customized solution.Other interesting ways of writing customized tests were also observed. For exam-ple, in the openjdk/client repository, there were tests for trichotomous relationsfor which a custom @Test annotation was implemented (see GitHub ). The an-notation is used to indicate the test and, at the same time, to deﬁne the type ofcomparison in the method (L1, L4). Thanks to the word test usage, it is possibleto detect the correct number of tests, in similar way as for JUnit . In this example,the impact of 3 rd party framework on the developer’s customized solution is visible.There are many tests in the repository using standardized frameworks, therefore theusage of @Test annotation is a logical way of deﬁning a test case. The exampleshows the execution of every annotated method 20,000 times (L9). Writing testsmanually using a framework would not be as eﬀective and would be diﬃcult to com-prehend. On the other hand, such tests in large iterations can easily give rise to the assertion roulette test smell, which makes it diﬃcult to identify a test failure. https://junit.org/junit4/faq.html https://github.com/madeja/unit-testing-practices-in-java/blob/master/examples/c_main1.java M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl

While in the previous case the test was evaluated using asserts, there are alsoapproaches that have their own error handling. E.g. in the same repository forall

ResourceBundle classes, a helper test class

RBTestFmwk has been implemented,which represents a custom framework and test classes inherit from it. The frame-work provides the processing of the main() function parameters, performing tests,and processing results. The test methods to be performed are deﬁned as input pa-rameters. The disadvantage is that when performing such tests, it is necessary toknow the internal structure of the class, at least method names that need to beperformed.In general, the following risks were observed by analyzing other main testingmethods: • Execution interruption — If a test fails, execution may be completely interruptedand no further tests will be performed (e.g. raised exception). • Failure identiﬁcation — Because testing is often performed repeatedly over dif-ferent data, it can be diﬃcult to identify the exact cause of test failure and insome cases may require debugging the test code. • Dependence — Tests often use the same sources or data for testing and mayaﬀect the results of other tests. Also, the tests are often order-dependent andthe test order randomness was not found in any repository.As mentioned in section 4.1, because of the high diversity of writing such tests,it is necessary to carry out an extensive study dealing solely with this issue, to ﬁnda way to precisely identify such test cases.

H2: Tests are usually constituted by means of test frameworks, butthere are also other ways of automated code testing.

The H2 has been conﬁrmed. Third party frameworks are used mostly fortesting (84.59%), but there are also customized testing solutions in the form ofclassic Java programs, whose task is to test production code. Such customizedsolutions are mainly used to perform a large number of tests, which are per-formed multiple times and this way the implementation is simpliﬁed.hon print data type

The study relied on GHTorrent databank and GitHub APIsearch algorithm to identify relevant projects. Because only projects with a majorityof Java language were selected, testing practices in projects, where Java was not amajor language could have been lost. Test classes that did not use the word test to indicate a test case were also lost. Searching for test cases was based on bestpractices and rules of the identiﬁed frameworks, but there may still exist other ways est Case Identiﬁcation in OS Projects on GitHub main() function.

External validity:

To provide generalizable results, 20k of test cases were an-alyzed manually and 170k by an automated way. Also, the meaning and occurrenceof the word test was analyzed for diﬀerent natural languages and test frameworks.The results can be used to identify test cases in Java-based projects or projectswith a diﬀerent programming language with the usage of similar testing conven-tions. Despite the presented observations, our ﬁndings, as is usual in empiricalsoftware engineering, may not be directly generalized to other systems, particularlyto commercial or to the ones implemented in other programming languages.

This paper presented a large empirical study of Java open source GitHub projectsto better understand how to identify test cases and and their exact location in theproject without the need of deep and time-consuming dynamic code analysis. Analgorithm based on searching the word test in the repository ﬁles content or ﬁlenameswas proposed and, at the same time, the unusual testing practices were investigated.In total 20,340 test cases in 803 classes were investigated manually and 170k classesby an automated way. We summarise the most interesting ﬁndings from our study: • There is not strong correlation between the word test occurrence and test casespresence in a class. • Searching for the word test in the ﬁle content can be used to identify projectscontaining test. • Using static ﬁle analysis, the proposed algorithm is able to correctly detect 95%of test cases. • Approximately 15% of the analyzed ﬁles contains test in the content togetherwith main() function whose represent regular Java programs that test the pro-duction code without using any third-party framework. Success rate of identiﬁ-cation of such test cases is very low because of implementation diversity.Several test writing styles were found and classiﬁed, along with code samplesof the analyzed repositories. Possible origins of test smells and other code compre-hension defects were also mentioned. Based on these ﬁndings the following maincontribution of this paper are concluded:026

M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl • Possibility of fast and automated test case identiﬁcation together with the exactlocation in the project. • Finding of correlation coeﬃcient r = 0 .

655 between the occurrence of the word test and test case in a ﬁle, which allows to threshold projects or ﬁles for similaranalysis. • Overview of observed testing practices with respect to existence of customizedtesting solutions with emphasis on places in testing code usable for mining in-formation about the production code.As future work, we plan to ﬁnd a solution for an accurate identiﬁcation of testcases in customized solutions. We believe that by studying testing practices, itwill be possible to train artiﬁcial intelligence to automatically recognize tests by thestructure and nature of the code. At the same time, we would like to focus on miningtests for information that could support the production source code comprehensionand streamline the development process.

This work was supported by project VEGA No. 1/0762/19: Interactive pattern-driven language development.

REFERENCES[1]

Beller, M., Gousios, G., Panichella, A., and Zaidman, A.

When,how, and why developers (do not) test in their ides. In

Proceedings of the2015 10th Joint Meeting on Foundations of Software Engineering (New York,NY, USA, 2015), ESEC/FSE 2015, Association for Computing Machinery,p. 179–190.[2]

Bissi, W., Serra Seca Neto, A. G., and Emer, M. C. F. P.

The eﬀectsof test driven development on internal quality, external quality and productiv-ity: A systematic review.

Information and Software Technology 74 (2016), 45– 54.[3]

Butler, S., Wermelinger, M., and Yu, Y.

Investigating naming conven-tion adherence in java references. In (2015), pp. 41–50.[4]

Corritore, C. L., and Wiedenbeck, S.

Mental representations of expertprocedural and object-oriented programmers in a software maintenance task.

International Journal of Human-Computer Studies 50 , 1 (1999), 61 – 83.[5]

Cruz, L., Abreu, R., and Lo, D.

To the attention of mobile softwaredevelopers: guess what, test your app!

Empirical Software Engineering 24 , 4(2019), 2438–2468. est Case Identiﬁcation in OS Projects on GitHub

Demeyer, S., Ducasse, S., and Nierstrasz, O.

Object-oriented reengi-neering patterns . Elsevier, 2002.[7]

Ellims, M., Bridges, J., and Ince, D. C.

Unit testing in practice. In (2004), pp. 3–13.[8]

Fucci, D., Erdogmus, H., Turhan, B., Oivo, M., and Juristo, N.

A dissection of the test-driven development process: Does it really matter totest-ﬁrst or to test-last?

IEEE Transactions on Software Engineering 43 , 7(2017), 597–614.[9]

Garcia, B.

Mastering Software Testing with JUnit 5: Comprehensive guideto develop high quality Java applications . Packt Publishing Ltd, 2017.[10]

Gopinath, R., Jensen, C., and Groce, A.

Mutations: How close arethey to real faults? In (2014), pp. 189–200.[11]

Gousios, G.

The ghtorrent dataset and tool suite. In

Proceedings of the 10thWorking Conference on Mining Software Repositories (Piscataway, NJ, USA,2013), MSR ’13, IEEE Press, pp. 233–236.[12]

Hemmati, H.

How eﬀective are code coverage criteria? In (2015),pp. 151–156.[13]

Hilton, M., Bell, J., and Marinov, D.

A large-scale study of test coverageevolution. In

Proceedings of the 33rd ACM/IEEE International Conferenceon Automated Software Engineering (New York, NY, USA, 2018), ASE 2018,Association for Computing Machinery, p. 53–63.[14]

Jiang, J., Lo, D., He, J., Xia, X., Kochhar, P. S., and Zhang, L.

Why and how developers fork what from whom in github.

Empirical SoftwareEngineering 22 , 1 (2017), 547–578.[15]

Just, R., Jalali, D., Inozemtseva, L., Ernst, M. D., Holmes, R.,and Fraser, G.

Are mutants a valid substitute for real faults in softwaretesting? In

Proceedings of the 22nd ACM SIGSOFT International Symposiumon Foundations of Software Engineering (New York, NY, USA, 2014), FSE2014, Association for Computing Machinery, p. 654–665.[16]

Kochhar, P. S., Lo, D., Lawall, J., and Nagappan, N.

Code coverageand postrelease defects: A large-scale study on open source projects.

IEEETransactions on Reliability 66 , 4 (2017), 1213–1228.[17]

Kochhar, P. S., Thung, F., and Lo, D.

Code coverage and test suite eﬀec-tiveness: Empirical study with real bugs in large systems. In (2015), pp. 560–564.[18]

Kochhar, P. S., Thung, F., Nagappan, N., Zimmermann, T., and Lo,D.

Understanding the test automation culture of app developers. In

M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl (2015), pp. 1–10.[19]

Kuhn, A., Van Rompaey, B., Haensenberger, L., Nierstrasz, O.,Demeyer, S., Gaelli, M., and Van Leemput, K.

Jexample: Exploitingdependencies between tests to improve defect localization. In

Agile Processesin Software Engineering and Extreme Programming (Berlin, Heidelberg, 2008),P. Abrahamsson, R. Baskerville, K. Conboy, B. Fitzgerald, L. Morgan, andX. Wang, Eds., Springer Berlin Heidelberg, pp. 73–82.[20]

Leitner, P., and Bezemer, C.-P.

An exploratory study of the state of prac-tice of performance testing in java-based open source projects. In

Proceedings ofthe 8th ACM/SPEC on International Conference on Performance Engineering (New York, NY, USA, 2017), ICPE ’17, Association for Computing Machinery,p. 373–384.[21]

Lewis, W. E.

Software testing and continuous quality improvement . CRCpress, 2017.[22]

Linares-V´asquez, M., Bernal-Cardenas, C., Moran, K., and Poshy-vanyk, D.

How do developers test android applications? In (2017),pp. 613–622.[23]

Madeja, M., and Porub¨an, J.

Tracing Naming Semantics in Unit Testsof Popular Github Android Projects. In (Dagstuhl, Germany, 2019), R. Ro-drigues, J. Janousek, L. Ferreira, L. Coheur, F. Batista, and H. G. Oliveira,Eds., vol. 74 of

OpenAccess Series in Informatics (OASIcs) , Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, pp. 3:1–3:13.[24]

Munaiah, N., Kroh, S., Cabrey, C., and Nagappan, M.

Curatinggithub for engineered software projects.

Empirical Software Engineering 22 , 6(2017), 3219–3253.[25]

Nayyar, A.

Instant Approach to Software Testing: Principles, Applications,Techniques, and Practices . BPB Publications, 2019.[26]

Peruma, A., Almalki, K., Newman, C. D., Mkaouer, M. W., Ouni,A., and Palomba, F.

On the distribution of test smells in open sourceandroid applications: an exploratory study. In

CASCON (2019), pp. 193–202.[27]

Pham, R., Singer, L., Liskin, O., Filho, F. F., and Schneider, K.

Creating a shared understanding of testing culture on a social coding site. In (2013),pp. 112–121.[28]

Scalabrino, S., Linares-V´asquez, M., Poshyvanyk, D., andOliveto, R.

Improving code readability models with textual features. In (2016), pp. 1–10. est Case Identiﬁcation in OS Projects on GitHub

Spadini, D., Palomba, F., Zaidman, A., Bruntink, M., and Bac-chelli, A.

On the relation of test smells to software code quality. In (2018), pp. 1–12.[30]

Stefan, P., Horky, V., Bulej, L., and Tuma, P.

Unit testing perfor-mance in java projects: Are we there yet? In

Proceedings of the 8th ACM/SPECon International Conference on Performance Engineering (New York, NY,USA, 2017), ICPE ’17, Association for Computing Machinery, p. 401–412.[31]

Sul´ır, M., Baˇc´ıkov´a, M., Madeja, M., Chodarev, S., and Juh´ar, J.

Large-scale dataset of local java software build results.

Data 5 , 3 (2020), 86.[32]

Van Deursen, A., Moonen, L., Van Den Bergh, A., and Kok, G.

Refactoring test code. In

Proceedings of the 2nd international conference on ex-treme programming and ﬂexible processes in software engineering (XP) (2001),pp. 92–95.[33]

Zerouali, A., and Mens, T.

Analyzing the evolution of testing library usagein open source java projects. In (2017), pp. 417–421.[34]

Zhang, Y., Lo, D., Kochhar, P. S., Xia, X., Li, Q., and Sun, J.

De-tecting similar repositories on github. In (2017),pp. 13–23.

Matej

Madeja was born in 1992 in Keˇzmarok, Slovakia. In2017 he graduated (MSc) at the Department of Computers andInformatics of the Faculty of Electrical Engineering and Infor-matics at Technical University of Koˇsice. He defended his mas-ter’s thesis in the ﬁeld of Informatics. Currently, he is a PhDstudent in the same department. His research is focused onimprovement of program comprehension eﬃciency, source codetesting techniques and teaching of programming.

Jaroslav

Porub(cid:127)an is a Professor and the Head of Departmentof Computers and Informatics, Technical University of Koˇsice,Slovakia. He received his MSc. in Computer Science in 2000and his PhD. in Computer Science in 2004. Since 2003 he isa member of the Department of Computers and Informatics atTechnical University of Koˇsice. Currently the main subject ofhis research is the computer language engineering concentratingon design and implementation of domain speciﬁc languages andcomputer language composition and evolution.

M. Madeja, J. Porub¨an, M. Baˇc´ıkov´a, M. Sul´ır, J. Juh´ar, S. Chodarev, F. Gurb´aˇl

Michaela

Ba(cid:20)c(cid:19)(cid:16)kov(cid:19)a is an assistant professor and the Head ofthe Information Systems Laboratory at the Department of Com-puters and Informatics, Technical University of Koˇsice, Slovakia.She received her Ph.D. in Computer Science in 2014. Since 2014she is a member of the Department of Computers and Infor-matics at Technical University of Koˇsice. Currently the mainsubject of her research is UX, HCI and usability while focusingon the domain-related terminology in user interfaces (domain us-ability). She also focuses on software languages and innovationsin the teaching process.

Mat´uˇs

Sul(cid:19)(cid:16)r is an assistant professor at the Department ofComputers and Informatics, Technical University of Koˇsice, Slo-vakia. At the same university, he graduated with a Master’sdegree in Computer Science in 2014 and PhD in 2018. His re-search is focused on program comprehension, particularly on theintegration of run-time information with source code, attribute-oriented programming, and debugging. He is also interested inempirical studies in software engineering.

Jan

Juh(cid:19)ar is a researcher at the Department of Computersand Informatics, Technical University of Koˇsice. He received hisPhD. in Computer Science in 2018. Since 2018 he is a mem-ber of Department of Computers and Informatics at TechnicalUniversity of Koˇsice. His research focuses on program compre-hension, programming tools, source code metadata and programprojections.

Sergej

Chodarev is an assistant professor at the Departmentof Computers and Informatics, Technical University of Koˇsice,Slovakia. He received his Master’s degree in computer science in2009 and his PhD degree in computer science in 2012. The sub-ject of his research includes domain-speciﬁc languages, metapro-gramming and software engineering. est Case Identiﬁcation in OS Projects on GitHub

Filip