MSR Mining Challenge: The SmartSHARK Repository Mining Data
TThe SmartSHARK Repository Mining Data st Alexander Trautsch
Institute of Computer ScienceUniversity of Goettingen
G¨ottingen, [email protected] nd Steffen Herbold
Institute of Computer ScienceUniversity of Goettingen
G¨ottingen, [email protected]
Abstract —The SmartSHARK repository mining data is acollection of rich and detailed information about the evolutionof software projects. The data is unique in its diversity andcontains detailed information about each change, issue trackingdata, continuous integration data, as well as pull request andcode review data. Moreover, the data does not contain only rawdata scraped from repositories, but also annotations in form oflabels determined through a combination of manual analysis andheuristics, as well as links between the different parts of the dataset. The SmartSHARK data set provides a rich source of datathat enables us to explore research questions that require datafrom different sources and/or longitudinal data over time.
Index Terms —repository mining, version control, issue track-ing, mailing list, continuous integration, code review, softwaremetrics
I. I
NTRODUCTION
In the last years, we invested a large amount of effort tocreate a versatile data set about the evolution of softwareprojects that combines data from different sources based onour SmartSHARK platform for replicable and reproduciblesoftware repository mining [1], [2]. The core of this approachwas to combine all data we generated for different publicationsin a single database, that grows with every publication. Thisdoes not only mean that we add more projects over time, butalso that the amount of information for the projects alreadywithin the database increases. By now, our database containsthe following data: • Data collected from Git, e.g., the commit messages,authors, dates, as well as the changed hunks. The clone ofthe Git repository at the time of collection is also storedto enable further analysis of the source code. • Data about the source code for each commit focused onJava, e.g., software metrics (size, complexity, documen-tation, code clones), static analysis warnings from PMD ,and the number of nodes of each type in the AST of afile. • Data about code changes, i.e., the detection of changetypes with ChangeDistiller [3], as well as refactoringswith RefDiff [4] and RefactoringMiner [5]. • Data collected from Jira, i.e., the issues, comments, andchanges to issues made. • Data collected from GitHub, i.e., issues, pull requests,and code reviews as part of pull requests.
This work was partially funded by DFG Grant 402774445. https://pmd.github.io/ • Data collected from mailing lists, i.e., all emails from thedeveloper and user mailing lists. • Links commits and issues, as well as links betweencommits and pull requests. • Manually validated links between commits and bug is-sues, as well as the type of issues labeled as bug for 38projects [6]. • Manually validated line labels that mark which changescontributed to a bug fix for 23 projects as well as partialdata for five additional projects [7]. • Annotations for commits and changes, i.e., bug fixingchanges including their probable inducing changes, ifchanges modified Javadocs or inline comments, whetherTODOs were added or removed, if test code changed orif we were able to detect refactorings. • Travis CI build logs and build status information for allprojects that use Travis CI.The identities of developers are managed in a separate col-lection that is not shared publicly, unless specifically requestedwith a description of the purpose. Hence, developers are onlyidentified by the (random) object identifier in the database.II. D
ATA D ESCRIPTION
This publication describes version 2.0 of the data, which ispublicly available. Older releases are available on our home-page, where we will also post future releases. A descriptionon how to setup the data for local use, as well as an examplefor accessing the data with Python is available online. In the following, we describe the data sources, the tools weused for the data collection, the size and format of the data,the schema of our database, the sampling strategy we usedand the list the projects for which data is available.
A. Data Sources
The raw data was collected from four different sources. • Version control data is collected directly from a cloneof the Git repository. The repositories are retrieved fromGitHub. Full data: https://doi.org/10.6084/m9.figshare.13651346.v1Small version without code entity states, code group states, and cloneinstances: https://doi.org/10.5281/zenodo.4462750 https://smartshark.github.io/dbreleases/ https://smartshark.github.io/fordevs/ a r X i v : . [ c s . S E ] F e b Issue tracking data is collected from the Apache Jira andGitHub. • Pull request data is collected from GitHub. • Continuous integration data is collected from Travis CI. All data is publicly available, but the tool vendors mayrequire the registration to scrape the data.
B. Data Collection Tools
Figure 1 shows the data collection tools we used. Alltools are available on GitHub. The vcsSHARK downloadsa clone of the Git repository and collects metadata aboutcommits. The coastSHARK, mecoSHARK, changeSHARK,refSHARK, and rminerSHARK use the clone of the reposi-tory to collect software metrics and detect refactorings. ThememeSHARK removes duplicate software metrics and reducesthe data volume. The travisSHARK collects data from Travisand links it to the commits. The prSHARK collects pullrequests including reviews and comments from GitHub andlinks them to commits. The mailingSHARK collects E-Mailsfrom mailing lists. The issueSHARK collects issue trackingdata from Jira and GitHub issues. The linkSHARK establisheslinks between the issues and commits. The labelSHARK usesthese links, the textual differences of changes, and changes tocode metrics to compute labels for the commits. These labelsare used by the inducingSHARK to find the probable changesthat are inducing for the labels, e.g., for bugs.The visualSHARK is used for manual validation of data,e.g., of links between commits and issues, issue types, andchanges that contribute to bug fixes. This information is usedby the labelSHARK and inducingSHARK to improve datathat relies this information, e.g., bug labels for commits. Forcompleteness, we also mention the identitySHARK, which canbe used to merge multiple identities of the same person in ourdata (e.g., different user name, same email). However, thisdata is not part of our public data set and will only be madeavailable upon request if the desired usage is clearly specifiedand does not raise any ethical or data privacy related concerns.
C. Size and Format
The data set currently contains 69 projects, the manualvalidations are available for a subset of 38 projects. Overall,these projects have 317,401 commits, 140,363 issues, 44,403pull requests, and 2,501,717 emails.All data is stored in a MongoDB. The size of the completeMongoDB is 1,140 Gigabyte. This size drops drastically toabout 37 Gigabyte, if we omit the collections with codeclone data and software metrics. The data is still growingand additional projects will also be made available throughsubsequent releases of the data set.Drivers for MongoDB are available for many programminglanguages. Additionally, we provide Object-Relational Map-ping (ORM) for Python and Java. https://issues.apache.org/jira/ https://github.com/smartshark/ https://docs.mongodb.com/drivers/ vcsSHARK issueSHARKprSHARKmailingSHARK coastSHARKmecoSHARK changeSHARK travisSHARKlinkSHARKvisualSHARK refSHARKrminerSHARK memeSHARKlabelSHARK inducingSHARKidentitySHARK Fig. 1. Overview of data collection tools. The arrows indicate dependenciesbetween tools. The colors indicate that different data sources are used (blue:Git repository; green: Jira and GitHub Issues; light blue: GitHub pull requests;yellow: mailing list archive; orange: Travis CI; grey: manual validation). Amix of colors means that data from different sources is required, as indicatedby the dependencies and the colors.
D. Overview of the Database Schema
We currently have data of four types of repositories: versioncontrol systems, issue tracking systems, pull request systems,and mailing lists. Figure 2 gives an overview of our databaseschema. A complete documentation is available online. Eachproject has an entry with the name and id. The softwarerepositories are assigned to projects by their id.The simplest system are the mailing lists. The emails forthe mailing lists are stored in the message collection. Theissue tracking data is stored in three collections: issue storesthe data about the issue itself, e.g., the title, description, andtype; issue_comment stores the discussion of the issue;and event stores any update made to the issue, e.g., changesof the type, status, or description. This way, the completelife-cycle including all updates is available in the database.The pull requests are organized similarly, but require sevencollections, due to the direct relationship to source code andassociated code reviews: pull_request stores the metadataof the pull request, e.g., the title, description, and the associ-ated branches; pull_request_comment stores the discus-sion of the pull request; pull_request_event stores anyupdate made to the pull request; pull_request_file and pull_request_commit store references to files and com-mits within pull requests; and pull_request_review and pull_request_review_comment store the informationabout code reviews.The version control system data is relatively complex,due to the diversity of the data stored. The main collectionis commit , which contains the general metadata about thecommits, e.g., the author, committer, revision hash, commitmessage, and time stamp. Moreover, commit also containscomputed data, e.g., labels like bug fixing or links to issues.The file_action group all changes made to a file in acommit, hunk contains the actual changes, including the diffs.The general information about the history is completed bythe branch and tag collections. The code_group_state and code_entity_state contain the results of the staticanalysis we run on the repository at each commit. Code groups https://smartshark2.informatik.uni-goettingen.de/documentation/ rojectissue_system vcs_systemmailing_list message issue event issue_comment tag commit clone_instance code_entity_statefile_action hunk travis_build project_idmailing_list_idissue_id issue_id issue_system_id project_id file project_id vcs_system_id file_id peopleidentity commit_idcreator_idreporter_idassignee_id author_idauthor_id from_idto_idscc_ids branch author_id committer_id tagger_id commit_id ce_parent_idpeople induces refactoring hunks commit_idlinked_issue_idsszz_issue_ids fixed_issue_ids code_group_state code_entity_statescg_ids pull_request _systempull_request pull_request _commentpull_request _review pull_request_review_comment travis_job build_id pull_request_file pull_request _commitpull_request _event project_id pull_request_system_id assignee_id creator_idauthor_id committer_idauthor_idauthor_idcreator_id pull_request_id creator_id original_pull_request_commit_idpull_request_commit_id pu ll _ r e qu e s t _ c o mm i t _ i d pull_request_review_idin_reply_to_idsource_commit_id target_commit_id commit_id Fig. 2. Overview of the database schema and the relationships between the collections. are, e.g., packages, code entities, are, e.g., files, classes, andmethods. We removed duplicate code entities, e.g., files wherethe measurements did not change from one commit to the next.This way, we can reduce the data volume by over 11 Terabyte.To still allow the identification of the code entity states at thetime of a commit, the commit collection contains a list tothe correct code entities. While the code entities also containa link to the commit for which they were measured, this linkshould be avoided, because users may inadvertently assumethat they could find all code entities for a specific commits thisway, which is not the case. The clone_instance collectionstores data about code clones. The automatically detectedrefactorings are stored in the refactoring collection. The travis_build collection contains the general informationabout the build, e.g., time stamps and the build status, and the travis_job collection contains the logs of the individualbuild jobs.The people collection is not associated with anyspecific data source. Instead, we map all metadata that containsaccounts, names, or email addresses to this collection andstore the name, email address, and user name. The identitycollection contains list of people, that very likely belong tothe same identity. We use our own identity merging algorithm,which is available online. https://github.com/smartshark/identitySHARK (a scientific publicationabout our algorithm is not yet available) E. Sampling Strategy and Representativness
The data contains only projects from the Apache SoftwareFoundation that have Java as the main language. The projectsall have between 1,000 and 20,000 commits, i.e., the data doesnot contain very small or very large projects. The reason forthe exclusion of very large projects is the data volume andprocessing time for the static analysis of each commit.While the sample is not randomly drawn, it should berepresentative for well-maintained Java projects that have highstandards for their development processes, especially withrespect to issue tracking. Moreover, the projects cover differentkinds of applications, including build systems (ant-ivy), Webapplications (e.g., jspwiki), database frameworks (e.g., calcite),big data processing tools (e.g., kylin), and general purposelibraries (commons).
F. List of Projects
We have collected the data we described above for thefollowing projects. Manually validated data is available forthe italic projects. Travis CI data is available for all bold-facedprojects.activemq, ant-ivy , archiva , bigtop , calcite , cayenne , commons-bcel , commons-beanutils , commons-codec , commons-collections , commons-compress , commons-configuration , commons-dbcp , commons-digester , commons-imaging , commons-io , commons-jcs , commons-jexl , commons-lang , commons-math , commons-net , commons-rdf , commons-scxml , commons-validator , commons-vfs , eltaspike , derby, directory-fortress-core, eagle , falcon, flume , giraph , gora , helix, httpcomponents-client , httpcomponents-core , jackrabbit, james, jena, jspwiki ,kafka, knox , kylin , lens , mahout , manifoldcf , maven,mina-sshd, nifi , nutch , oozie, openjpa, opennlp , parquet-mr , pdfbox , phoenix, pig, ranger , roller, samza , santuario-java , storm , streams , struts , systemml , tez , tika , wss4j , xerces2-j, zeppelin , zookeeperIII. U SAGE E XAMPLES
The SmartSHARK data is versatile and allows differentkinds of research. In the past, we have focused mostly onthe analysis of bugs, as well as longitudinal analysis of trendswithin the development history. Below, we list some examplesof papers that used (a subset of) this data set. Please notethat some of papers below are still under review and not yetpublished in their final versions. • We evaluated defect prediction data quality with a fo-cus on SZZ and manual validation [6]. The manuscriptdescribes how we manually validated the links betweencommits and issues, as well as the issue types andhow we used SmartSHARK to create release-level defectprediction data. • We evaluated trends of static analysis warnings fromPMD and usage of custom rules for PMD as well asthe impact on defect density [8]. • We evaluated the impact of static source code metrics andstatic analysis warnings from PMD on just-in-time defectprediction [9]. • We used the manually validated issue type data to trainand evaluate issue type prediction models [10]. • We provided the data for the modelling of the developerbehaviour through Hidden Markov Models (HMMs) [11]. • We analyzed the tangling within bug fixing commits aswell as the capability of researchers to manually identifytangling [12]. • We conducted an initial evaluation of a cost model fordefect prediction [13].IV. P
OSSIBLE R ESEARCH Q UESTIONS
The strength of our data is the capability to reason overdata from different information sources. Questions regardingdifferences between the discussions on mailing lists and withinissue trackers can be answered without scraping data frommultiple sources. Moreover, the static analysis results and thelabeling of changes allow for research into the relationship,e.g., between refactorings and self-admitted technical debt orbug fixes. The availability of manually validated data enablesus to evaluate the validity of heuristics, as well as the devel-opment of improvements of heuristics, e.g., for the labeling ofbug fixes. Moreover, while we already established many linksbetween the data sources, there are more opportunities thatcould be considered, e.g., the links between the mailing listand commits, or the mailing list and reported issues. Similarly,the links between pull requests and bugs can be explored, e.g., to understand why post release bugs were not spotted duringcode review. V. L
IMITATIONS
The greatest limitation of the SmartSHARK data is thenumber of projects for which data is available. The reasonfor this is the large computational effort required for the staticanalysis of the Java code of each commit. This not only limitsthe external validity due to the sample size, but also due to afocus on Java as programming language. In the future, we planto overcome this limitation by extending the database with alarge set of projects, for which we omit the static analysis and,thereby, are able to scale up the number of projects. While thiswill not support the same research questions, there are manyinteresting questions that can be answered without a staticanalysis of the source code for each commit.VI. C
ONCLUSION
The SmartSHARK data set provides a rich source of datathat enables us to explore research questions that require datafrom different sources and/or longitudinal data over time.Since all data is stored in a single data base, results are easyto reproduce. The data is still growing and future releases willfurther extend the data with more projects and additional datasources. R
EFERENCES[1] F. Trautsch, S. Herbold, P. Makedonski, and J. Grabowski, “Addressingproblems with replicability and validity of repository mining studiesthrough a smart data platform,”
Empirical Software Engineering , Aug.2017.[2] A. Trautsch, F. Trautsch, S. Herbold, B. Ledel, and J. Grabowski, “Thesmartshark ecosystem for software repository mining,” in
Proc. of the2020 Int. Conf. Softw. Eng. - Demonstrations Track , 2020.[3] B. Fluri, M. W¨ursch, M. Pinzger, and H. Gall, “Change distilling:Tree differencing for fine-grained source code change extraction,”
IEEETransactions on Software Engineering , vol. 33, no. 11, pp. 725–743,2007.[4] D. Silva and M. T. Valente, “Refdiff: Detecting refactorings in versionhistories,” in , May 2017, pp. 269–279.[5] N. Tsantalis, M. Mansouri, L. M. Eshkevari, D. Mazinanian, and D. Dig,“Accurate and efficient refactoring detection in commit history,” in
Proceedings of the 40th International Conference on SoftwareEngineering , ser. ICSE ’18. New York, NY, USA: ACM, 2018, pp. 483–494. [Online]. Available: http://doi.acm.org/10.1145/3180155.3180206[6] S. Herbold, A. Trautsch, and F. Trautsch, “Issues with szz: An empiricalassessment of the state of practice of defect prediction data collection,”2019.[7] S. Herbold, A. Trautsch, B. Ledel, A. Aghamohammadi, T. A. Ghaleb,K. K. Chahal, T. Bossenmaier, B. Nagaria, P. Makedonski, M. N.Ahmadabadi, K. Szabados, H. Spieker, M. Madeja, N. Hoy, V. Lenar-duzzi, S. Wang, G. Rodr´ıguez-P´erez, R. Colomo-Palacios, R. Verdec-chia, P. Singh, Y. Qin, D. Chakroborti, W. Davis, V. Walunj, H. Wu,D. Marcilio, O. Alam, A. Aldaeej, I. Amit, B. Turhan, S. Eismann, A.-K. Wickert, I. Malavolta, M. Sulir, F. Fard, A. Z. Henley, S. Kourtzanidis,E. Tuzun, C. Treude, S. M. Shamasbi, I. Pashchenko, M. Wyrich,J. Davis, A. Serebrenik, E. Albrecht, E. U. Aktas, D. Str¨uber, andJ. Erbel, “Large-scale manual validation of bug fixing commits: A fine-grained analysis of tangling,” 2020.[8] A. Trautsch, S. Herbold, and J. Grabowski, “A LongitudinalStudy of Static Analysis Warning Evolution and the Effects ofPMD on Software Quality in Apache Open Source Projects,”
Empirical Software Engineering , 2020. [Online]. Available: https://doi.org/10.1007/s10664-020-09880-19] ——, “Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction,” in
Proc. of the 2020 Int. Conf.Softw. Maintenance and Evolution , 2020.[10] S. Herbold, A. Trautsch, and F. Trautsch, “On the feasibility of auto-mated issue type prediction,” https://arxiv.org/abs/2003.05357, 2020.[11] V. Herbold, “Mining developer dynamics for agent-based simulationof software evolution,” Dissertation, University of Goettingen,Germany, 2019. [Online]. Available: http://hdl.handle.net/21.11130/00-1735-0000-0003-C15C-C[12] S. Herbold, A. Trautsch, and B. Ledel, “Large-scale manual validationof bugfixing changes,” Mar 2020. [Online]. Available: osf.io/acnwk[13] S. Herbold, “On the costs and profit of software defect prediction,”
IEEETransactions on Software Engineering , no. 01, pp. 1–1, dec 2019.[14] Y. Zhao, H. Leung, Y. Yang, Y. Zhou, and B. Xu, “Towards anunderstanding of change types in bug fixing code,”
Information andSoftware Technology
Proceedings of the 2005 International Workshop on MiningSoftware Repositories , ser. MSR ’05. New York, NY, USA: ACM,2005, pp. 1–5. [Online]. Available: http://doi.acm.org/10.1145/1082983.1083147[16] D. Spadini, M. Aniche, and A. Bacchelli, “PyDriller: Python frameworkfor mining software repositories,” in
Proceedings of the 2018 26thACM Joint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering -ESEC/FSE 2018 . New York, New York, USA: ACM Press, 2018,pp. 908–911. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3236024.3264598[17] K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature:How misclassification impacts bug prediction,” in
Proceedings of the2013 International Conference on Software Engineering , ser. ICSE’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 392–401. [Online].Available: http://dl.acm.org/citation.cfm?id=2486788.2486840[18] R. P. L. Buse and W. R. Weimer, “Learning a metric for codereadability,”
IEEE Trans. Softw. Eng. , vol. 36, no. 4, pp. 546–558, Jul.2010. [Online]. Available: http://dx.doi.org/10.1109/TSE.2009.70[19] S. Scalabrino, M. Linares-V´asquez, R. Oliveto, and D. Poshyvanyk,“A comprehensive model for code readability,”
Journal of Software:Evolution and Process , vol. 30, no. 6, p. e1958, 2018.[20] G. Gousios, “The ghtorrent dataset and tool suite,” in
Proceedings of the10th Working Conference on Mining Software Repositories , ser. MSR’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 233–236.[21] F. Trautsch, S. Herbold, and J. Grabowski, “Are unit and integrationtest definitions still valid for modern java projects? an empirical studyon open-source projects,”
Journal of Systems and Software , May 2019, pp. 184–185.[23] F. Trautsch, S. Herbold, P. Makedonski, and J. Grabowski, “Adressingproblems with external validity of repository mining studies througha smart data platform,” in