[PDF] An Empirical Analysis of the R Package Ecosystem

Abstract

In this research, we present a comprehensive, longitudinal empirical summary of the R package ecosystem, including not just CRAN, but also Bioconductor and GitHub. We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades, providing comprehensive counts and trends for common metrics across packages, releases, authors, licenses, and other important metadata. We find that the historical growth of the ecosystem has been robust under all measures, with a compound annual growth rate of 29% for active packages, 28% for new releases, and 26% for active maintainers. As with many similar social systems, we find a number of highly right-skewed distributions with practical implications, including the distribution of releases per package, packages and releases per author or maintainer, package and maintainer dependency in-degree, and size per package and release. For example, the top five packages are imported by nearly 25% of all packages, and the top ten maintainers support packages that are imported by over half of all packages. We also highlight the dynamic nature of the ecosystem, recording both dramatic acceleration and notable deceleration in the growth of R. From a licensing perspective, we find a notable majority of packages are distributed under copyleft licensing or omit licensing information entirely. The data, methods, and calculations herein provide an anchor for public discourse and industry decisions related to R and CRAN, serving as a foundation for future research on the R software ecosystem and "data science" more broadly.

Full PDF

AAn Empirical Analysis of the R Package Ecosystem

Ethan Bommarito, Michael J Bommarito II

Licensio, LLC

Abstract

In this research, we present a comprehensive, longitudinal empirical summary of the R package ecosystem,including not just CRAN, but also Bioconductor and GitHub. We analyze more than 25,000 packages,150,000 releases, and 15 million ﬁles across two decades, providing comprehensive counts and trends forcommon metrics across packages, releases, authors, licenses, and other important metadata. We ﬁnd thatthe historical growth of the ecosystem has been robust under all measures, with a compound annual growthrate of 29% for active packages, 28% for new releases, and 26% for active maintainers. As with manysimilar social systems, we ﬁnd a number of highly right-skewed distributions with practical implications,including the distribution of releases per package, packages and releases per author or maintainer, packageand maintainer dependency in-degree, and size per package and release. For example, the top ﬁve packagesare imported by nearly 25% of all packages, and the top ten maintainers support packages that are importedby over half of all packages. We also highlight the dynamic nature of the ecosystem, recording both dramaticacceleration and notable deceleration in the growth of R. From a licensing perspective, we ﬁnd a notablemajority of packages are distributed under copyleft licensing or omit licensing information entirely. Thedata, methods, and calculations herein provide an anchor for public discourse and industry decisions relatedto R and CRAN, serving as a foundation for future research on the R software ecosystem and “data science”more broadly.

Keywords: software, software development, R, licensing, open source, dependency, complex system

1. Introduction

Since its ﬁrst release in 1993, R has established itself as the most popular open-source statistical com-puting language [1]. One common explanation for this success is R’s large community, partially inheritedfrom S, the language it “succeeded,” and its rich ecosystem of open-source packages and contributors. Whileits relative popularity has varied over the years, the increasing emphasis on statistical analyses in academiaand industry has been evidenced by R’s increasing absolute rank in the TIOBE index. Inspired by researchsuch as [2] [3] [4] [5] [6], our prior research [7] on Python’s PyPI, and professional experience with softwaredevelopment and information security, we seek in this paper to empirically describe the package ecosystemof this important language. Unlike extant literature, we analyze both complete historical package metadataand package contents, providing a more comprehensive understanding of releases, authors, licenses, depen-dencies, and other trends in package source and metadata over time. This research is intended to provide aconvenient reference for empirical claims regarding the R ecosystem and to provide an anchor for a largerbody of future research.R’s most well-known source of packages, the Comprehensive R Archive Network (CRAN), was ﬁrstproposed in 1996 by Kurt Hornik and collaborators at TU Wien after inspiration by CTAN and CPAN.They announced the ﬁrst realization of this proposal in early 1997 and launched the original server in thesame year [8]. Since then, the number of “oﬃcial” CRAN mirrors has grown to over 100 in 2020 [9]. As its

Email addresses: [email protected] (Ethan Bommarito), [email protected] (Michael J Bommarito II)

Preprint submitted to arxiv February 22, 2021 a r X i v : . [ c s . M S ] F e b ﬃcial and longest-serving package repository, CRAN provides an empirical source of information about bothR speciﬁcally and “data science” broadly. While a number of other studies have examined CRAN, eitheralone or in combination with GitHub, these studies have generally relied on smaller samples, metadata-onlyinformation, and qualitative coding in their analysis [10] [11] [3] . While our research does not address thesame questions as these publications, it does establish a complete, longitudinal, and reproducible baselinefor CRAN across its history, including direct analysis of package source code and data.Over the two decades since CRAN’s launch, a number of academic communities have especially embracedR. Most notably, many members of the bioinformatics ﬁeld have standardized on the usage of R in their work.In 2001, a group of such researchers began the “Bioconductor project[...], an initiative for the collaborativecreation of extensible software for computational biology and bioinformatics (CBB)”. Their own wordsexemplify why so many researchers have been drawn to R: “[t]he primary motivations for an open-sourcecomputing environment for statistical genomics are transparency, pursuit of reproducibility and eﬃciencyof development” [12] [13]. Like CRAN, the Bioconductor project provides its own repository of packagesmanaged and distributed based on the needs of the bioinformatics community. Similar projects, such asOmegahat, have also intermittently existed over the last two decades [14]. Despite its prominent role in animportant area of research, Bioconductor has received far less attention in prior research outside of [15] and[2], both of which are dated at this point.In the context of society broadly, the last ﬁve years have witnessed a dramatic increase in attentiontowards the “emerging” ﬁelds of data science, machine learning, and “AI.” R has long held a prominentposition in the endeavors of academic and industrial researchers in this space. As statistics departments oftentransitioned their primary teaching materials from S to R, many of the university students who ﬁlled the earlyranks of “data scientists” were trained in R as their primary language. In industry, organizations includingMicrosoft, Google, Oracle, Facebook, and IBM have been active in the area, embedding R in databases andBI platforms, acquiring companies that provide commercial support and extension, publishing their ownR packages, and collaborating with core developers. Many of these academic and industry activities havealso been open source, and, as is common among open-source activities, have been recorded on GitHub[16]. GitHub itself has become increasingly popular over this time, and the introduction of the devtools package for R has greatly simpliﬁed the process of developing and using packages from GitHub.Together, CRAN, Bioconductor, and GitHub contain a record of the evolving activities of authors andmaintainers as they adapt to and aﬀect the environment they co-create - metaphorically, an ecosystem.Each package source oﬀers diﬀerent beneﬁts to package developers and package users. As a purpose-builtdistribution, Bioconductor oﬀers members of the bioinformatics community a focused and well-tested plat-form for research, including both original research and subsequent replication. This platform is the resultof a committee of experts and developers who are constantly supporting and improving the distribution,though at their sole discretion and on their own release schedule. In contrast, while CRAN provides oﬃcialpackages and some degree of quality control, the CRAN repository is more open than Bioconductor. Pack-ages of any nature may be submitted to CRAN so long as they comply with its policies and pass its basicautomated test suite. These packages are made available as soon as approved by the responsible team ofR maintainers or community members. GitHub lies at the most open end of the spectrum, containing notjust packages meant for re-use, but also any source code, data, or related activities that an individual ororganization makes available. Many package developers who release in CRAN or Bioconductor use GitHubto manage their development activities prior to “oﬃcial” package or distribution release. The devtools package even allows package authors to distribute their packages directly from GitHub without relying oneither the Bioconductor or CRAN teams for review or approval.

2. Data and Methods

Decades have passed since the R language and CRAN ﬁrst appeared, and many things have changedsince their ﬁrst release and deployment. The data presented in this research is based on a platform developedby the authors to archive and analyze common software languages for the purpose of information security,compliance, code quality, and valuation [17]. With respect to R, the platform’s data retrieval protocols areoutlined below: 2 .1. CRAN Data Retrieval

1. Retrieve a list of all CRAN packages from a trusted mirror (e.g., https://cloud.r-project.org/ )

2. For each package P ,(a) Download most recent release from /src/contrib/ (b) Download all archived releases of P from /src/contrib/Archive/P (c) For each release R in package P ,i. Parse and store the release metadata in the DESCRIPTION ﬁleii. Parse and store any license or copyright information in the LICENSE ﬁleiii. For each ﬁle F in release R ,A. Analyze and index F in ﬁle databaseB. If F is classiﬁed as a source ﬁle, calculate source code metrics In many cases, source and data ﬁles do not change from release to release. In the extreme case, releasesmay only update package structure or metadata. We perform SHA-1 hashing of all ﬁles to eﬃciently index,reduce unnecessary reprocessing, and identify identical ﬁles across all packages and releases.

1. Retrieve a list of Bioconductor versions from the Bioconductor server2. For each Bioconductor release version V greater than 2.5, (a) Retrieve a list of all package releases from https://bioconductor.org/packages/V/bioc/ (b) For each release R ,i. Parse and store the release metadata in the DESCRIPTION ﬁleii. Parse and store any license or copyright information in the LICENSE ﬁleiii. For each ﬁle F in release R ,A. Analyze and index F in ﬁle databaseB. If F is classiﬁed as a source ﬁle, calculate source code metrics

1. Using the GitHub Search API, collect all releases that are tagged under the R language

2. For each release R ,(a) Parse and store the metadata returned from the GitHub Search API(b) Parse and store the release metadata in the DESCRIPTION ﬁle(c) Parse and store any license or copyright information in the LICENSE ﬁle(d) For each ﬁle F in release R ,i. Analyze and index F in ﬁle databaseii. If F is classiﬁed as a source ﬁle, calculate source code metricsAs the protocol descriptions imply, data is collected from CRAN and Bioconductor in a similar manner.Current releases are available under a distribution folder on a mirror, and historical releases are grouped bypackage folder containing one or more prior releases. Whereas CRAN is organized by R version, Bioconductoris organized by its own package releases, live or archived, tied to a Bioconductor version.GitHub, however, diﬀers from this structure quite radically, and the nature of GitHub repositories makesthe data much less simple. In this analysis, we analyze releases as published in GitHub repositories, whichrepresent the most comparable record to CRAN and Bioconductor. Notably, GitHub repositories alsoinclude GitHub metadata, e.g., license type, not just DESCRIPTION metadata. This GitHub release Most mirrors include sub-folders for “archived,” “abandoned,” or “old(er)” packages. We retrieve these packages, eventhough they may not be currently available to install. Due to issues with historical Bioconductor package listings, retrieval of historical Bioconductor packages prior to version2.5 produced inconsistent and incomplete results. As a result, they have been excluded from this analysis. The GitHub Search API does not return stable results, requires rate limiting, and has peculiar chunking requirements. Oursearch results were performed by adaptively chunking requests by push date and size, but results may vary based on chunkingstrategy. devtools devtools supports installing from commits, tags, branches, pull requests, or releases, the analysis of commit-by-commit, branch-by-branch development practices is not our focus in this research. In the instance wherethe metadata GitHub provides is in conﬂict with the metadata provided in a DESCRIPTION ﬁle, we preferthe information provided in the DESCRIPTION ﬁle, since it is more likely to agree with other versions ofthat package hosted on other sites.Note that the retrieval methods above retrieve releases from three separate sources of R packages. Whilemany packages or releases are only available from one archive, some packages or releases may be presentin more than one source. For example, many authors may release a tagged version on GitHub and submitthe GitHub archives to CRAN for listing. That package may subsequently be distributed as part of aBioconductor release. In some cases, identical sources may be present with varying versions or metadata,e.g., to explicitly indicate compatibility with a newer R version. So as to avoid overcounting of such identicalor trivially-varying releases, all counts presented in this research reﬂect deduplication of releases for the samepackage name and version string or SHA-1 hash. For example, while ggplot2 2.2.1 exists as both a GitHubrelease and on CRAN, we count it as a single unique package release for 2016. Further, much of the activityon GitHub may relate to forks of popular packages as users contribute via GitHub pull request workﬂows.Such activities are an important part of a healthy open-source community, but may result in overcounting Ractivity at the package level. On the other hand, we also rely on GitHub’s R tagging and release functionalityfor this search; authors who never use the GitHub release functionality or whose source is not properly taggedby GitHub are absent from our results, and this may result in undercounting R activity at the package andauthor level. In our analysis of forks with releases, we ﬁnd only 42 packages occur with this condition, andso we believe the results do not materially impact our interpretations. Further, many users fork R packageswithout ever committing a modiﬁcation. By only including GitHub releases, we exclude these trivial forks,which would otherwise outnumber real packages.Lastly, a number of important metadata ﬁelds demonstrate great variance across contributors and withincontributors. Contributors record their name with and without initials, middle name, or credentialing.Contributors include, omit, or change their email addresses or aﬃliations. Contributors use diﬀerent Relatorvalues over time. Similar dynamics occur across many other ﬁelds. We perform simple normalizationfor authors, relying on both name and email address, and have manually reviewed these normalizations.However, normalization of such inconsistencies in other ﬁelds is left to future work on speciﬁc areas ofinterest.

3. Results

As described above in Section 2, our results are based on three sources of R packages in the ecosystem:CRAN, Bioconductor and GitHub. CRAN is the most well-known and longitudinal among these, as it isthe largest, oldest, and best-known source. GitHub, while not exclusive to R or any other language, isimmense compared to CRAN in the terms of the number of authors, the amount of activity, and how muchR source code is published there. Relative to these two sources, Bioconductor is the most focused andsmallest. Despite these large variations in scale between the sources, we attempt to present them in parallelwhere possible in the tables and ﬁgures below. The majority of our tables below are grouped by source,year, or another similar index. Where possible, we standardize the presentation of results on the “package”or “release” level (a “package” is a grouping of one or more “releases”). We track package releases acrosssources, so if the same package version is released across more than one source, we can keep their releasesseparated by source in the event that identical versions do not contain identical contents.While the authors’ platform executes these procedures on a daily basis for commercial purposes, thetables and ﬁgures presented herein are based on a snapshot from December 2020. A summary of data is Outside of counts, these releases are analyzed separately, e.g., for information security issues, so long as they are notidentical by SHA-1 hash.

DESCRIPTION ﬁles; some authors use CRAN-speciﬁc or SPDXstandards, but many do not. We discuss future work on this topic in Section 4 below.

Statistic CRAN Bioconductor GitHub

Number of Packages 20,023 2,043 6,650Number of Releases 115,134 21,335 19,225Number of Maintainers 14,131 1,957 1,432Number of Unique Licenses 609 109 284Number of Files 11,713,946 1,815,723 4,124,235

Table 1: Summary of key statistics by source

We divide the remainder of the results section as follows. First, in Subsection 3.1, we present time seriesof counts for packages, releases, and maintainers over time, allowing us to summarize the growth of R overthe last two decades. Next, in Subsection 3.2, we examine key facts and distributions related to packages andreleases per se , such as the number of packages over time and the distribution of duration between release.In Subsection 3.3, we present information about contributors such as authors and maintainers, includingmeasures of the most active maintainers.

While these total counts are impressive, it is important to understand how the ecosystem together andsources alone have grown over time to create today’s collection of software. Importantly, our analysis windowfor each of these sources varies. While we have archived all CRAN releases for two decades, our reliabledata for Bioconductor and GitHub is limited to only the past decade. As such, we will initially present andinterpret their time series independently.Table 2 documents the number of new packages, active packages, new releases, active maintainers, andcumulative releases by year for CRAN. Overall, these numbers clearly demonstrate a language that has ex-perienced dramatic growth, with order-of-magnitude increases in packages, releases, and active maintainers.Recent years have seen over 2,000 new packages, over 7,000 actively maintained packages, and over 1,600active maintainers on CRAN. Notable, however, is the peak and dip in growth of new packages and activemaintainers from 2017 to 2019. While the ecosystem has returned to growth in these metrics as of 2020,it is striking that CRAN experienced both a year-over-year decrease and an overall deceleration in growthgiven academic and industry interest in related ﬁelds.Next, Table 3 details the same metrics for Bioconductor since Version 2.5. As expected, Bioconductor’soverall growth is much slower as a result of its focus on domain relevance and quality, and this is conﬁrmedin its higher ratio of actively maintained packages to total releases than CRAN. It is further unsurprisingthat Bioconductor does not exhibit the same kind of monotonic growth, as the committee’s inclusion criteriaand maintenance standards may even result in a reduction of included packages in some cases. Finally, in Table 4, we show the last eight years of metrics from GitHub’s releases. GitHub exhibitsthe most rapid absolute increase out of all sources, likely reﬂecting the adoption of devtools by manydevelopers of packages. Recent years have seen approximately 1,000 new packages per year, approximately1,500 actively maintained packages, and over 600 active maintainers on GitHub. As discussed above inSection 2, some trivial amount of package activity is inﬂuenced by fork and pull request workﬂows, but thisdoes not change our qualitative interpretations. Bioconductor’s release page lists the number of packages available in each Bioconductor release version. In some cases,these packages may be unchanged from prior Bioconductor releases in prior years. As a result, while our numbers diﬀer fromBioconductor’s release numbers, ours reﬂect only new packages and releases in a calendar year and are comparable to othersources. ear New Packages Active Packages New Releases Active Maintainers Total Releases Table 2: Number of new packages, active packages, new releases, active maintainers and total releases on CRAN by year, 2020is a partial year through December

Year New Packages Active Packages New Releases Active Maintainers Total Releases

Table 3: Number of new packages, active packages, new releases, active maintainers and total releases on Bioconductor by year

Year New Packages Active Packages New Releases Active Maintainers Total Releases

Table 4: Number of new packages, active packages, new releases, active maintainers and total releases on GitHub by year o f N e w P a c k a g e s CRANBioConductorGitHub

Figure 1: New packages by year across package sources.

Since these sources have existed for diﬀerent durations and serve diﬀerent needs, direct comparisons ofraw counts can be misleading and mischaracterize the nature of the ecosystem. Even growth rates may notbe a useful metric for sources like Bioconductor, where one might expect a maturity and stability at a levelof “feature completion” compared to an unconstrained scope. Therefore, as described above, we synthesizethese three sources into a single, cumulative time series that represents R’s total ecosystem. By viewing thesum of these sources together, we much more accurately capture the true activity and evolution of R overthe last two decades. We present the same metrics from Tables 2, 3, and 4 for the total ecosystem in Table5. Table 5 documents an impressive two decades for the R ecosystem. In its ﬁrst ﬁve years, R’s packageecosystem slowly grew to approximately 250 active packages and 100 maintainers. In the next ﬁve years,these numbers grew more than ﬁve-fold to over 1,700 active packages and nearly 600 maintainers. Between2010 and 2015, these numbers more than doubled to over 4,000 active packages and over 1,200 maintainers.Finally, in the most recent ﬁve years, this growth has slowed but still resulted in over 7,500 active packagesand 1,600 maintainers.To better summarize this growth across R and within each source, we calculate the compound annualgrowth rate (CAGR) for each source and the collective ecosystem. These results are presented for newpackages, active packages, new releases, and active maintainers below in Table 6. These calculations conﬁrmour observations above and provide a convenient summarization of R’s two decades of growth. On average,R has sustained a total growth rate of approximately 29% for active packages, and 28% for new releases, and7

000 2005 2010 2015 2020Year030060090012001500 o f A c t i v e M a i n t a i n e r s CRANBioConductorGitHub

Figure 2: Active Maintainers by Year

Year New Packages Active Packages New Releases Active Maintainers Total Releases

Table 5: Number of new packages, active packages, new releases, active maintainers and total releases across repositories byyear

86% for active maintainers. For comparison with Python, we document a CAGR of 47% for active packagesand 39% for maintainers within PyPI in [7]. While this cumulative growth has been impressive, it is worthnoting that recent years indicate a potential peak or plateau in R’s growth. Since 2017-2018, the numberof new packages and active maintainers on both GitHub and CRAN has remained relatively constant ordecreased. We will continue to monitor this trend through 2021 to examine whether the R ecosystem maybe stable or actually decreasing at this point.

Measure CRAN Bioconductor GitHub Total Ecosystem

New Packages 20.85% 1.04% 38.99% 22.49%Active Packages 27.89% 21.04% 47.54% 29.29%New Releases 26.56% 2.61% 54.77% 28.03%Active Maintainers 23.64% 1.06% 35.78% 25.61%Total Releases 33.95% 22.45% 74.04% 35.82%

Table 6: Compound Annual Growth Rate (CAGR) for new packages, new releases and active maintainers on CRAN. CAGRcalculation starts at 2000 for CRAN, 2009 for BIOC, 2013 for GitHub, and 2000 for combined.

In order to better understand package dynamics generally and within each source, we next examine per-package distributions and statistics. First, we calculate summary statistics for the distribution of releasesper package in Table 7. This distribution helps us understand how packages are, or are not, maintainedby each source and overall. The diﬀerence between the GitHub, CRAN and Bioconductor is immediatelyapparent from this table. GitHub is the most right-skewed of these sources, as expected, with a median of1.0 release per package and a mean of 2.9. Many repositories in GitHub are forks or single releases that seelittle maintenance or are even entirely abandoned. There are repositories that see regular updates, such aspopular projects developed by the R core team, academic groups, or companies. CRAN shows some right-skewness as well, though much less, as its median releases per package is 3.0 and mean is 5.8. Bioconductor,by contrast, has both the highest mean releases per package and an almost identical median, demonstratingthe quality control and support provided through its curation. On average across the ecosystem, most Rpackages have had 3.0 or fewer releases with an average of approximately 5.0.

Statistic CRAN Bioconductor GitHub Across Repositories

Mean 5.75 10.44 2.89 5.60Standard Deviation 8.76 6.40 5.54 8.51Minimum 1.00 1.00 1.00 1.0025th Percentile 1.00 5.00 1.00 1.0050th Percentile 3.00 10.00 1.00 3.0075th Percentile 7.00 15.00 3.00 7.00Maximum 200.00 21.00 201.00 250.00

Table 7: Descriptive statistics for distribution of number of releases per package

We next examine the distribution of duration between releases or inter-release timing for packages. Table8 below presents the mean, standard deviation, min, max, and quartiles for this distribution by source andacross the entire ecosystem. Packages that have a lower duration between releases are likely higher velocityand/or younger packages, whereas packages with more time between releases may be either mature or lowervelocity. As Bioconductor has a deﬁned semi-annual release calendar, its timing is generally less varied. ForCRAN, the mean and median inter-release durations are 213 and 96 days. For GitHub, these mean andmedian durations are 100 and 37 days, respectively. Unsurprisingly, packages that are active and releasedon GitHub demonstrate a higher velocity than those released through CRAN, likely as a result of decreased9 easure CRAN Bioconductor GitHub Across Repositories

Mean 213 days 283 days 100 days 178 daysStandard Deviation 331 days 343 days 172 days 312 daysMinimum 1 days 1 days 1 days 0 days25th Percentile 33 days 161 days 8 days 11 days50th Percentile 96 days 181 days 37 days 68 days75th Percentile 247 days 286 days 114 days 200 daysMaximum 5,236 days 4,658 days 2,182 days 5,236 days

Table 8: Descriptive statistics for inter-release duration distribution submission and approval process friction. On average across the ecosystem, packages release approximatelysix months apart, though there is substantial variation.To better understand the right tail of this distribution - packages with many releases - we examine thetop 20 packages by release count. These packages are listed below in Table 9. In general, most of thesepackages have long histories on CRAN, e.g.,

Matrix , or are built by active teams or companies on GitHub,e.g., canvasXpress . Many of these packages are even authored and maintained by the R-core developmentteam and its members, such as nlme , lattice , Matrix , and mgcv . Name Count spatstat 250PortalData 201canvasXpress 199Matrix 198pomp 162mgcv 155nlme 136RcppArmadillo 136lattice 135rgdal 134caret 133spdep 129plotrix 129sp 125Rcmdr 121XML 121gstat 118arm 117lme4 117foreign 115

Table 9: Top 20 packages by number of releases across repositories

Releases and packages do not, of course, grow on trees. Contributors of a wide variety create and sustainthese works, and, in order to truly understand the ecosystem, we must examine dynamics related to theseindividuals and organizations. Maintainer data, the most common of these contributor types, is provided innumerous forms within the R ecosystem. In all packages containing DESCRIPTION ﬁles, an

AuthorsR ﬁeld can contain a list of one or more individuals or organizations, including details about the role thatthey play for that package or release. In addition, packages published to CRAN must include a Maintainerﬁeld for an author who provides an email address. Table 10 below provides a summary of the distribution10f packages and releases per maintainer. As is the case in many records of human creators, the majority ofmaintainers have a single work while a small number of maintainers is responsible for many packages; themedian number of packages and releases per maintainer is 1 and 3, respectively. Even at the 75th percentile,these medians increase to just 2 packages and 8 releases.

Statistic Packages per Maintainer Releases per Maintainer

Mean 1.81 8.90Standard Deviation 2.56 24.62Minimum 1.00 1.0025th Percentile 1.00 1.0050th Percentile 1.00 3.0075th Percentile 2.00 8.00Maximum 99.00 791.00

Table 10: Descriptive statistics for distribution of packages and releases per maintainer across repositories

In Table 11, we examine the right tail of the distribution in Table 10 by listing the top 20 maintainersby package counts. Many of these authors should come as no surprise, as Hadley and Dirk might as well belisted as essential personnel in R’s ecosystem. Some of these authors, however, are less well-known to thecommunity, such as Jia Zhong, who is responsible for IBM’s

IBMPredictiveAnalytics

GitHub repository.In some cases, these authors are actually organizations or their automation, such as the BioconductorPackage Maintainer metadata author description.

Name Count

Scott Chamberlain 99Hadley Wickham 78Richel Bilderbeek 69Dirk Eddelbuettel 66Jia Zhong Wu 64G´abor Cs´ardi 57Bioconductor Package Maintainer 56Jeroen Ooms 54Thomas J. Leeper 38Kurt Hornik 37Scott Chamberlain 36Max Kuhn 35Bob Rudis 34Robin K. S. Hankin 34Hadley Wickham 34Martin Maechler 34Gabor Csardi 34Kartikeya Bolar 33Henrik Bengtsson 32Jan Wijﬀels 31

Table 11: Top 20 maintainers by number of packages maintained across repositories

Not all packages are equal in eﬀort, of course. Packages vary both in terms of their size, composition,and complexity; some might contain primarily data instead of source code, and others might contain a smallnumber of very complex functions. In order to provide a view into the distribution of eﬀort in maintainingpackages, we present the top maintainers by number of lines of code (LOC) in Table 12. These LOCcalculations include only ﬁles such as R, Fortran, C, and C++ source ﬁles, excluding any lines or bytes from11ata, auto-generated documentation, or other package contents. Unsurprisingly, a small number of authorsare responsible for an outsized percentage of R code; the top 20 maintainers alone support over 15% of alllines of code in the R ecosystem.

Maintainer KLOC Percent of total KLOC

Michael Lawrence 15,727.78 1.64%Adrian Baddeley 10,878.97 1.13%Marek Gagolewski 10,326.76 1.08%weizhouUMICH 10,038.94 1.05%Henrik Bengtsson 9,527.69 0.99%Adrian Baddeley 8,299.77 0.86%Joshua N. Pritikin 8,230.79 0.86%Gabor Csardi 7,367.14 0.77%Kurt Hornik 7,193.62 0.75%Martin Maechler 6,823.04 0.71%Stan buildbot 6,706.61 0.70%Wei-Chen Chen 6,209.87 0.65%Douglas Bates 6,035.10 0.63%Jeﬀrey S. Racine 5,074.88 0.53%Roger Bivand 5,022.16 0.52%Edzer Pebesma 4,801.93 0.50%Vladislav Kim 4,372.42 0.46%Doug and Martin 4,367.44 0.46%Virginie Rondeau 3,978.60 0.41%Alexander Robitzsch 3,970.17 0.41%

Table 12: Top 20 maintainers by KLOC

Conversely, many packages also have more than just one contributor, whether that contributor is anindividual or organization. Luckily, many releases explicitly include additional contributors in their

DE-SCRIPTION metadata. Table 13 details the descriptive statistics for the distribution of contributors perpackage across repositories. The mean and median number of listed contributors per package is 2.5 and 2.0,respectively, increasing to 3.0 at the 75th percentile. The right tail of this distribution is marked by packagesthat distribute collections of data from multiple sources. For example, the package with 127 contributors isthe rcorpora package, which includes a “collection of small text corpora of interesting data.”

Statistic Contributors per Package

Mean 2.46Standard Deviation 3.05Minimum 1.0025th Percentile 1.0050th Percentile 2.0075th Percentile 3.00Maximum 127.00

Table 13: Descriptive statistics for distribution of contributors per package across repositories

Within R

DESCRIPTION , packages may also specify the role of an author or contributor to a package.These roles are based on the Library of Congress’s MARC Code List for Relators [18], though as they aremanually entered without validation, some data quality issues occur. Table 14 presents counts of the mostcommon three-letter role designations. Some of these tags clearly reﬂect R’s usage in academia, e.g., the12fth most popular tag is “Thesis Advisor.” Many other tags highlight the unreliability of such metadata,either due to common misspellings or humorous self-reported roles like woodcutter. As an example, allcontributors to the rcorpora discussed above are listed without a role tag.

Tag Count

Untagged 184,996Author 92,357Creator 49,071Contributor 48,824Copyright Holder 17,457Thesis Advisor 1,510Translator 793Funder 725Data Contributor 725Contractor 334

Table 14: Top 10 package contributor tags across repositories

In general, our analysis suggests that the R ecosystem leans heavily on a small number of contributingindividuals and organizations, but that there are many potential sources of confusion from a data qualityperspective. The community would likely beneﬁt from a concerted eﬀort to solicit maintainer assistanceand to improve validation for metadata through

R CMD functionality. Such improvements in data quality,especially as they relate to copyright holders and licensing, may be critical for the further adoption of R inindustry.

Based on our author data above, just 0.1% of maintainers are responsible for 3.4% of packages. However,not all packages are equally important in the ecosystem, as some packages may be relied on much morefrequently as a dependency than others. [10] has already documented the existence of a right-skeweddistribution for package in-degree in a similar but smaller, non-longitudinal sample of CRAN. The interestedreader is referred to their research for more detail on the structure of R package networks, such as macro-and mesoscopic community analysis.In Table 15, we begin by examining the in-degree distribution for packages in our sample. Conceptually,packages that have higher in-degree are typically more “important” or popular in the ecosystem, just as foodweb analyses can reveal important organisms in real ecological systems. As our sample is longitudinal andspans multiple package sources, some of which do not automatically connect, it is important to note that ourresults do not characterize the current

CRAN network alone. Further, just as in ecological systems, someorganisms may have critical roles by virtue of network structure and their position, which is not capturedby in-degree alone. The interested reader is again directed to [10] [11] for more on such structural analysis.We note, however, that both prior literate and our research do not reﬂect the use of libraries in ad-hocstatistical analysis or “data science” scripts, where users rarely formally save or package the sequence ofexecuted statements.Within our sample, the mean and median package in-degrees are 1.93 and 0.0, conﬁrming that thedistribution is right-skewed. In fact, even the 75th percentile is 0.0, i.e., more than three out of fourpackages have never been imported as a dependency. This ﬁnding is not surprising when one considers thatmany R packages are specialized statistical methods or point solutions with narrow application or domain;therefore, while many researchers may use these packages to perform ad-hoc statistical analyses, there islittle reason to “build on top of” these packages.Next, we examine the out-degree distribution for packages in our sample in Table 16. The mean andmedian out-degree per package are 1.93 and 1.0, less skewed than in-degree. Further, while the 75th percentileout-degree, 3.0, is higher than the corresponding in-degree quartile, the maximum out-degree is two ordersof magnitude smaller for out-degree than in-degree. Taken together, these in- and out-degree distributions13 tatistic Dependency

Mean 1.93Standard Deviation 26.01Minimum 0.0025th Percentile 0.0050th Percentile 0.0075th Percentile 0.00Maximum 1,845.00

Table 15: Descriptive statistics for distribution of in-dependencies per package across repositories conﬁrm the intuition that R contains a small number of very important packages and a large number ofloosely-connected or singleton packages.

Statistic Dependency

Mean 1.93Standard Deviation 3.32Minimum 0.0025th Percentile 0.0050th Percentile 1.0075th Percentile 3.00Maximum 59.00

Table 16: Descriptive statistics for distribution of out-dependencies per package across repositories

As shown in Table 17, packages at the extreme right tail such as dplyr have over 1,000 unique packagedependents. The table lists the twenty most depended-on, i.e., highest in-degree, packages. There are anumber of immediate observations. Many of the most popular packages are intended to simplify syntax,provide foundational data structures, or make pre-processing data easier. Rcpp is relied on by many CRANpackages that provide acceleration through

C++ or vendor third-party

C++ libraries. When in-degree iscalculated based on release instead of package, the ranks are slightly diﬀerent but the distribution is evenmore extreme; for example, dplyr and ggplot2 are a dependent each for 7,701 and 8,194 separate releases,respectively. Overall, the extreme dependence is notable compared to other languages; the top ﬁve packagesare imported by nearly one in every four packages.Strikingly, a number of the most depended-on packages are maintained by the same contributors. Inorder to understand just how much concentration there is, we examine how in-degree is distributed acrossmaintainers in the ecosystem. Table 18 shows the top 20 maintainers by this measure, and the resultsconﬁrm our intuitions again. Hadley, for example, is the listed maintainer responsible for packages that areimported by over 4,000 other packages. When one considers that a number of these contributors are alsoemployed by RStudio, the concentration of support eﬀort becomes even more extreme. In total, the top10 maintainers are responsible for nearly 16,000 unique package dependencies; this means that more thanone out of every two packages relies on these ten maintainers. Such extreme reliance on a small number ofpackages and persons has implications for ecosystem sustainability and potential supply chain attacks.For readers interested in how our in- and out-degree distributions compare to other right-skewed distri-butions or the prior work mentioned above, we have calculated log-log plots of degree distributions in Figure3. As noted above, our results are longitudinal and span multiple package sources; therefore, they are largerand conceptually diﬀerent from static, single-source analyses of dependency networks. These calculations count unique dependency per package; multiple releases do not weight or multiply edges or degree, andeven if dependencies are removed in later releases, we still record such imports. ame Count dplyr 1,845ggplot2 1,798MASS 1,178Rcpp 1,026magrittr 907data.table 860Matrix 780jsonlite 665httr 639Biobase 562foreach 550plyr 480GenomicRanges 463IRanges 463BiocGenerics 453S4Vectors 452igraph 448doParallel 432purrr 409lattice 407 Table 17: Top 20 dependencies by number of packages across repositories

Maintainer Number of Packages Dependent

Hadley Wickham 4,321Thomas Lin Pedersen 1,824Brian Ripley 1,633Bioconductor Package Maintainer 1,390Dirk Eddelbuettel 1,344Lionel Henry 1,218Martin Maechler 1,195Dirk Eddelbuettel and Romain Francois 1,028Douglas Bates 1,015Jeroen Ooms 928Stefan Milton Bache 907Doug and Martin 901G´abor Cs´ardi 887M Dowle and T Short 860Developers 860Gabor Csardi 720Biocore Team c/o BioC user list 687Hong Ooi 644Jim Hester 642Rich Calaway 633

Table 18: Top 20 maintainers by number of packages dependent on their maintained packages o f P a c k a g e s Figure 3: Number of Dependents by Package

Licensing is a key factor for the success of open-source software ecosystems. Depending on the composi-tion and usage of software within the ecosystem, there is risk of both free-riding if licenses are too permissiveand under-investment if licenses are too restrictive. The R Project itself, including the R interpreter andcore libraries, is currently licensed under the GPL Version 2. While a complete discussion of licensing factorsand analysis of R licensing is outside the scope of this paper, we present a summary distribution of licensemetadata in Table 19. Unfortunately, the licensing metadata situation for R is complicated by a mixture oflicense restrictions and poor metadata management. For instance, there are nearly two thousand releasesthat have no listed license in their metadata. Many other metadata ﬁelds specify a license identiﬁer, butalso indicate that a separate

LICENSE ﬁle, which may contradict the metadata ﬁeld, is also present.Excluding omitted license metadata, the GPL family licenses such as GPL Version 2 and 3 outnumberall other licenses like MIT, BSD, or Apache family licenses. Compared with other open-source ecosystems,this is a striking ﬁnding; from a proportion perspective, GPL is an order of magnitude more common inR than in Python, for example. This preference may be due to the original R core team’s preference forGPL licensing. In future research, we will present more detailed analysis of licensing and compliance trendswithin R packages and source ﬁles, documenting trends such as re-licensing, “mixed” or multi-licensing,data sub-“licensing” or restrictions, and contradictory license indications.

Unlike most extant research, our data set and analysis also includes full ﬁle-level records and ﬁle-leveldetail data. For example, we present in Table 20 a summary of the average number of lines of code perﬁle and per release for R, Fortran, C, and C++ ﬁles across all sources. Many packages in the R ecosystemcontain not just R source ﬁles, but also Fortran, C, or C++ sources, either for acceleration or to vendorfrom third party packages. While Fortran, C, and C++ ﬁles are, on average, longer than R source ﬁles - an16 icense Count Percent

No License in Metadata 1,707.0 5.94%GPL 2 9,583.0 33.37%GPL 3 5,856.0 20.39%MIT + ﬁle LICENSE 3,286.0 11.44%GPL 826.0 2.88%Artistic-2.0 597.0 2.08%GPL 3 — ﬁle LICENSE 275.0 0.96%CC0 260.0 0.91%LGPL-3 213.0 0.74%ﬁle LICENSE 182.0 0.63%MIT 161.0 0.56%BSD 3 + ﬁle LICENSE 158.0 0.55%GPL 2 — ﬁle LICENSE 152.0 0.53%Apache License 2.0 152.0 0.53%BSD 2 + ﬁle LICENSE 150.0 0.52%GPL ( > = 2.0) 141.0 0.49%AGPL-3 140.0 0.49%LGPL 134.0 0.47%What license is it under? 102.0 0.36%GPL 3 + ﬁle LICENSE 74.0 0.26%Unlimited 73.0 0.25% Table 19: Top 20 Licenses (metadata) by Number of Releases unsurprising ﬁnding given the languages - there are over twice as many lines of R code across all sources asthere are Fortran, C, and C++ lines together.

Statistic R C C++ Fortran

Average Per File 157.31 527.21 308.30 528.44Average Per Release 4,208.14 1,118.92 614.00 223.84

Table 20: Summary of LOC across sources by source ﬁle type

Table 21 presents trends in R, Fortran, C, and C++ source KLOC over time, detailing how many linesof code by language are available in the latest release for all packages as of each year. In the ﬁrst decade ofR packages, Fortran played an important role; for example, in 2005, there were 5 LOC of R for every oneof Fortran, and in 2010, there were 7 LOC of R for every one of Fortran. By 2015, there were more than 10LOC of R for every one of Fortran, and by 2020, there are nearly 30 LOC of R for every one of Fortran. Chas similarly fallen from near-parity with R to approximately 1:7 in 2020. Prior to the release of

Rcpp inlate 2008, there were very few lines of C++; however, after the introduction of

Rcpp , C++ has grown tonearly the same proportion as C in recent years. When understanding and auditing R sources, it is clearlycritical that the licenses and security of these C, C++, and Fortran sources are considered in addition to Ralone.

Packages contain a wide variety of other information, including mandatory and optional ﬁelds in therequired

DESCRIPTION metadata ﬁle distributed with every package. Many of these metadata ﬁeldsactually drive behavior in the R interpreter. As an example of deeper analysis that can be performed withour platform and data, we examine two additional types of metadata ﬁelds: those related to “lazy” dataloading and those related to compilation of non-R source, e.g., Fortran, C, or C++.17 ear R C C++ Fortran

Table 21: KLOC by source language for most recent release as of each year

Table 22 shows a breakdown of how many packages and releases tag Lazy Data, or as it also shows upin the metadata, Lazy Loading. Lazy Data is a reference to how a package handles the loading of data.Normally, when R loads a package, it will read all objects into memory; when Lazy Loading is enabled,however, the interpreter will read only the data that the process currently needs instead of all packageobjects. This is a very useful ﬂag for packages that may distribute large models or data ﬁles, and may be agood proxy to growth in recent data science applications within R. That over 60% of packages have LazyDataﬁelds set to true does imply that a large proportion of packages vendor or contemplate vendoring data, butthat there may be unexpected performance issues at execution time that result from such unexaminedconﬁguration. LazyData NeedsCompilation% of Releases 58.27 20.52% of Packages 63.20 17.95

Table 22: Breakdown of Packages and Releases with LazyData and NeedsCompilation tags

To investigate this, we can look at the diﬀerence between the datasets included in these releases. Table23 shows the breakdown for the average dataset included in an R release as compared to the average datasetincluded in a Lazy Data R release. We can see that releases that are ﬂagged as using lazy loading do have,on average, larger included datasets. However, the diﬀerence in their averages is only about 5 MBs in size.Further, while both show a large right-skew - meaning that there are many small datasets, which have theaverage brought up by especially large datasets, which makes sense as R is a language that is often used todeal with big data - the lazy loading datasets actually exhibit a more exaggerated right-skew, as they havea higher mean and a lower median.The Needs Compilation tag, on the other hand, is very straightforward. It tells whether or not the18 tatistic All packages Lazy Loading packages

Count 57,684.00 34,935.00Mean 264,669.80 269,792.16Standard Deviation 1,644,041.99 1,801,426.24Minimum 75.00 75.0025th Percentile 1,390.00 1,308.0050th Percentile 9,289.50 8,130.0075th Percentile 96,034.50 91,350.00Maximum 102,473,361.00 102,473,361.00

Table 23: Descriptive Statistics of the number of bytes for the average R dataset, conditioned release in question needs to be compiled before it can be used. Table 22 shows that there are more than20% of releases and roughly 18% of packages that have the Needs Compilation tag ﬂagged as true. Since, onaverage, more releases need compilation than packages, this seems to indicate that some subset of packageshave many releases that require compilation. Likely, there are developers who provide source code releases- perhaps even as a regular ’developmental’ or ’trunk’ version.

4. Conclusion and Future Work

To our knowledge, the results presented above provide the most comprehensive, longitudinal record ofthe R package ecosystem available. We aggregate packages across not just two decades of CRAN, but alsoinclude nearly a decade of GitHub and Bioconductor. Further, our methods and data include not justpackage-level metadata, but also normalized source control metadata and ﬁle-level information. Within thissample, we ﬁnd that the R package ecosystem has grown by orders of magnitude, but that recent historysuggests a potential change in this trend. We also ﬁnd an extreme concentration of responsibility on a verysmall number of maintainers and organizations. While we exclude policy discussion and normative questionsfrom the scope of this publication, further discussion and analysis by the community is likely warranted.We intend to present future work on two tracks. First, we have hundreds of millions of records derivedfrom parsing R, Fortran, C, and C++ sources with antlr , and we intend to provide further metrics relatedto conventions, quality, and security in the R ecosystem. Second, we will present in separate researcha comparative analysis of R’s licensing and license network relative to Python, focusing speciﬁcally onpotential risks to the community and broader adoption in academia and industry.

5. Acknowledgements

This work was supported by Licensio, LLC, of which both authors are members. The authors would liketo acknowledge the late Professor Rick Riolo, whose spirit of inquiry into complex systems everywhere liveson in his many students around the world.

References [1] TIOBE, Tiobe index july: Programming language r climbs up the rankings, https://developer-tech.com/news/2020/jul/07/tiobe-index-july-programming-language-r-rankings/ , 2020. Online; accessed September 11 2020.[2] A. Decan, T. Mens, M. Claes, P. Grosjean, On the development and distribution of r packages: An empirical analysis ofthe r ecosystem, in: Proceedings of the 2015 european conference on software architecture workshops, ACM, p. 41.[3] A. Decan, T. Mens, M. Claes, P. Grosjean, When github meets cran: An analysis of inter-repository package dependencyproblems, in: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER),volume 1, IEEE, pp. 493–504.[4] A. Decan, T. Mens, M. Claes, On the topology of package dependency networks: A comparison of three programminglanguage ecosystems, in: Proccedings of the 10th European Conference on Software Architecture Workshops, ACM, p. 21.

5] A. Decan, T. Mens, M. Claes, An empirical comparison of dependency issues in oss packaging ecosystems, in: 2017 IEEE24th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 2–12.[6] A. Decan, T. Mens, P. Grosjean, An empirical comparison of dependency network evolution in seven software packagingecosystems, Empirical Software Engineering 24 (2019) 381–416.[7] E. Bommarito, M. Bommarito, An empirical analysis of the python package index (pypi), arXiv preprint arXiv:1907.11073(2019).[8] K. Hornik, F. Leisch, Vienna and R: Love, marriage and the future, Citeseer, 2002.[9] R. Foundation, the status of cran mirrors, https://cran.r-project.org/mirmon_report.html , 2020. Online; accessedSeptember 11 2020.[10] M. Mora-Cantallops, S. S´anchez-Alonso, E. Garc´ıa-Barriocanal, A complex network analysis of the comprehensive rarchive network (cran) package ecosystem, Journal of Systems and Software 170 (2020) 110744.[11] M. Mora-Cantallops, M.- ´A. Sicilia, E. Garc´ıa-Barriocanal, S. S´anchez-Alonso, Evolution and prospects of the comprehen-sive r archive network (cran) package ecosystem, Journal of Software: Evolution and Process 32 (2020) e2270.[12] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry,et al., Bioconductor: open software development for computational biology and bioinformatics, Genome biology 5 (2004)R80.[13] Bioconductor, Bioconductor website, , 2020. Online; accessed December 15 2020.[14] O. Project, Omega project for statistical computing, , 2020. Online; accessed December 2020.[15] R. Nagarajan, M. Scutari, packdep: network abstractions of cran and bioconductor, in: The R User Conference, useR!2013 July 10-12 2013 University of Castilla-La Mancha, Albacete, Spain, volume 10:30, p. 100.[16] GitHub, Github, https://github.com , 2020. Online; accessed December 2020.[17] L. Licensio, Licensio, https://licens.io/ , 2020. Online; accessed December 2020.[18] L. of Congress, Marc code list for relators, , 2021. Online; accessedJanuary 2021., 2021. Online; accessedJanuary 2021.