Attributing and Referencing (Research) Software: Best Practices and Outlook from Inria
Pierre Alliez, Roberto Di Cosmo, Benjamin Guedj, Alain Girault, Mohand-Said Hacid, Arnaud Legrand, Nicolas P. Rougier
11 Attributing and Referencing (Research) Software:Best Practices and Outlook from Inria
Pierre A
LLIEZ , Université Côte d’Azur, Inria, France, mailto:[email protected]
Roberto D I C OSMO , Inria, Software Heritage, University of Paris, France, mailto:[email protected]
Benjamin G
UEDJ , Inria, France and University College London, United Kingdom, mailto:[email protected]
Alain G
IRAULT , Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, 38000 Grenoble, France, mailto:[email protected]
Mohand-Saïd H
ACID , Univ. Lyon, University Claude Bernard Lyon 1, LIRIS, Lyon France, mailto:[email protected]
Arnaud L
EGRAND , Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France, mailto:[email protected]
Nicolas R
OUGIER , Univ. Bordeaux, Inria, CNRS, IMN, Labri, Bordeaux, France, mailto:[email protected]
Abstract —Software is a fundamental pillar of modern scientificresearch, across all fields and disciplines. However, there is alack of adequate means to cite and reference software due tothe complexity of the problem in terms of authorship, roles andcredits. This complexity is further increased when it is consideredover the lifetime of a software that can span up to severaldecades. Building upon the internal experience of Inria, theFrench research institute for digital sciences, we provide in thispaper a contribution to the ongoing efforts in order to developproper guidelines and recommendations for software citation andreference. Namely, we recommend: (1) a richer taxonomy forsoftware contributions with a qualitative scale; (2) to put humansat the heart of the evaluation; and (3) to distinguish citation fromreference.
Keywords — Software citation; software reference; authorship;development process.
I. I
NTRODUCTION
Software is a fundamental pillar of modern scientific re-search, across all fields and disciplines, and the actual knowl-edge embedded in software is contained in software sourcecode which is, as written in the GPL license, “the preferredform [of a program] for making modifications to it [as adeveloper]” and “provides a view into the mind of the de-signer” [18]. With the rise of Free/Open Source Software,which requires and fosters source code accessibility, accesshas been provided to an enormous amount of software sourcecode that can be massively reused. Similar principles arenow permeating the Open Science movement, in particularafter the attention drawn to it by the crisis in scientificreproducibility [20], [12]. All this has recently motivatedthe need of properly referencing and crediting software inscholarly works [13], [19], [9].In this context, we provide a contribution to the ongoingefforts to develop proper guidelines and recommendations,building upon the internal experience of Inria, the Frenchresearch institute for digital sciences ( ).Born in 1967, more 50 years ago, Inria has grown to directlyemploy 2,400 people, and its 190 project-teams involve more than 3,000 scientists working towards meeting the challengesof computer science and applied mathematics, often at theinterface with other disciplines. Software lies at the very heartof the Institute’s activity, and it is present in all its diversity,ranging from very long term large systems (e.g., the awardwinning Coq proof assistant, to the CompCert certified com-piler, through the CGAL Computational Geometry AlgorithmsLibrary to name only a few of most the well-known ones),to medium sized projects and small but sophisticated codesimplementing advanced algorithms.Inria has always considered software as a first class nobleproduct of research, as an instrument for research itself, andas an important contribution in the career of researchers. Assuch, whenever a team is evaluated or a researcher applies for aposition or a promotion, a concise and precise self-assessmentnotice must be provided for each software developed in theteam or by the applicant, so that it can be assessed in asystematic and relevant way.With the emerging awareness of the importance of makingresearch openly accessible and reproducible, Inria has steppedup its engagement for software: (i) it has been working foryears on reproducible research, and is running a MOOC on thissubject; (ii) it has been at the origin of the Software Heritageinitiative, which is building a universal archive of sourcecode [1]; and (iii) it has experimented a novel process for re-search software deposit and citation by connecting the Frenchnational open access publication portal, HAL ( hal.archives-ouvertes.fr ), to Software Heritage [1].Nevertheless, citing and referencing software is a verycomplex task, for several reasons. First, software authorshipis extremely varied, involving many roles: software architect,coder, debugger, tester, team manager, and so on. Second,software itself is a complex object: the lifespan can range froma few months to decades, the size can range to a few dozens oflines of code to several millions, and it can be stand-alone orrely on multiple external packages. And finally, sometimes onemay want to reference a particular version of a given software a r X i v : . [ c s . D L ] N ov (this is crucial for reproducible research), while at other timesone may want to cite the software as a whole.In this article, we report on the practices, processes, andvision, both in place and under consideration at Inria, toaddress the challenges of referencing and accessing softwaresource code, and properly crediting the people involved in theirdevelopment, maintenance, and dissemination.The article is structured as follows: Section II brieflysurveys previous work. Section III describes the inherentcomplexity of software, which is the main reason why thetopic studied in this paper is challenging. Section IV presentsthe key internal processes that Inria has established over thelast decades to track the hundreds of software projects towhich the institute contributes, and the criteria and taxonomiesthey use. Section V draws the main lessons that have beenlearned from this long term experience. In particular, we statethree recommendations to contribute to a better handling ofresearch software worldwide. Finally, Section VI concludesby providing a set of recommendations for the future.II. S URVEY OF PREVIOUS WORK
The astrophysics community is one of the oldest oneshaving attempted to systematically describe the software de-veloped and use in their research work. The AstrophysicsSource Code Library was started in 1999. Over the years ithas established a curation process that enables the productionof quality metadata for research software. These metadata canbe used for citation purposes, and they are widely used inthe astrophysics field [2]. Around 2010, interest in softwarearose in a variety of domains: a few computer science confer-ences started an artefact evaluation softwarecitation principles [19]. In a nutshell, this document recognizesthe importance of software, credit and attribution, persistenceand accessibility, and provides several recommendations basedon use-cases that illustrate the different situations where onewants to cite a piece of software.We do acknowledge these valuable efforts, which havecontributed to raise the awareness about the importance ofresearch software in the scholarly world.Nonetheless, we consider that a lot more work is neededbefore we can consider this problem settled: the actual rec-ommendations that can be found on how to make softwarecitable and referenceable, and how to give credit to its authors,fall quite short of what is needed for an object as complex as software. For example, in most of the guidelines we haveseen, making software referenceable for reproducibility (wherethe precise version of the software needs to be indicated), or citable for credit (to authors or institutions), seems to boildown to simply finding a way to attach a DOI to it, typicallyby depositing a copy of the source code in repositories likeZenodo or Figshare.This simple approach, inspired by common practices forresearch data, is not appropriate for software.When our goal is giving credit to authors, attaching anidentifier to metadata is the easy part, and any system ofdigital identifiers, be it DOI, Ark or Handles, will do. Thedifficulty lies in getting quality metadata, and in particular indetermining who should get credit , for what kind of contri-bution , and who has authority to make these decisions . Theheated debate spawned by recent experiments that tried toautomatically compute the list of authors out of commit logsin version control systems [6] clearly shows how challengingthis can be.As we will see in Section IV-D, when looking for repro-ducibility, it is necessary to precisely identify not only themain software but also its whole environment and to make itavailable in an open and perennial way. In this context, weneed verifiable build methods and intrinsic identifiers that donot depend on resolvers that can be abused or compromised(see Wiley using fake DOIs to trap web crawlers . . . andresearchers as well), and DOIs are not designed for this usecase [9].To make progress in our effort to make research soft-ware better recognized, a first step is to acknowledgeits complexity, and to take it fully into account inour recommendations.III. C OMPLEXITY OF THE SOFTWARE LANDSCAPE
Software is very different from articles and data, with whichwe have much greater familiarity and experience, as they havebeen produced and used in the scholarly arena long beforethe first computer program was ever written. In this section,we highlight some of the main characteristics that render thetask of assessing, referencing, attributing and citing softwarea problem way more complex than what it may appear at firstsight.Software development is a multifaceted and continuouslyevolving activity, involving a broad spectrum of goals, actors,roles, organizations, practices and time extents. Without pre-tending to be exhaustive, we detail here in bold the mostimportant aspects that need to be taken into account forassessing, referencing, attributing or citing software.
Structure
A software project can be organized either as a monolithicprogram (e.g., the Netlib BLAS libraries), or as a compositeassembly of modules (e.g., the Clang compiler). It can eitherbe self-contained or have many external dependencies . Forexample, the Eigen C++ template library for linear algebra( http://eigen.tuxfamily.org ) aims for minimal dependen-cies while listing an ecosystem of unsupported modules.
Lifetime
A software can be produced during a single, short extent of time (referred to as one-shot contribution ), or over a longtime span, possibly fragmented into several time intervalsof activities. Some long running software projects extendover several decades. For example, the CGAL project( ) started in 1996 as aEuropean consortium, became open source in 2004, andhas provided more than 30 releases since then.
Community
A software can be the product of a single scholar, a well-identified team or a scattered team of scholars spanning alarge scientific community that may be difficult to track pre-cisely. The CGAL open source project lists more than 130contributors, distinguishing between the former and currentdevelopers, and acknowledging the reviewers and initialconsortium members ( ).In contrast, the Meshlab 3D mesh processing software( https://en.wikipedia.org/wiki/MeshLab ) is authored by asingle team from the CNR, Pisa.
Authorship
Software developer(s) writing the code are the most visibleauthors of a software program, but they are not, and by far,the only ones. A variety of activities are involved in thecreation of software, ranging from stating the high-levelspecifications, to testing and bug fixing, through designingthe software architecture, making technical choices, runninguse cases, implementing a demonstrator, drafting the doc-umentation, deploying onto several platforms, and buildinga community of users. In these contexts the roles of asingle contributor can be plural, with contributions spanningvariable time extents. Authorship is even more complicatedwhen developers resort to pseudonimity, i.e., disguisedidentity in order to not disclose their legal identities. Forall these reasons, evaluating the real contributions to asignificant piece of software is a very difficult problem:in our experience at Inria, automated tools may help in thistask, but are by far insufficient, and it is essential to havehumans in the loop.
Authority
Beyond good practices, most quality or certified softwaredevelopment projects define management processes andauthority rules. Authorities are entitled to make deci-sions, give orders, control processes, enforce rules, andreport. They can be institutions, organizations, commu-nities, or sometimes a single person (e.g., Guido vanRossum for Python). Some projects set up an edito-rial board, similar in spirit to scientific journals, withreviewers, managers and well-defined procedures (SeeCGAL’s Open Source Project Rules and Procedures at ). Each new con-tribution must be submitted for review and approval beforebeing integrated. Some decisions can be taken top-downwhile others are bottom-up. In some cases, a shared gover-nance is implemented. This organization can be somehowcompared to the Linux kernel development organizationwhere Linus Torvalds integrates contributions but delegatesthe responsibility of software quality evaluation to a fewtrusted colleagues. Another important aspect is the trace-ability of who did what during the software project. In its simplest form, the number of lines or code or commitlogs are used for tracing contributions and changes, butmore advanced means such as repository mining-basedmetrics [17], bug-related metrics, or peer evaluation arecommon.
Levels of description
Another dimension that adds to the complexity is the varietyof levels at which a software project can be described, eitherfor citation or for reference.
Exact status of the source code.
For the purpose of exactreproducibility, one must be able to reference any precisepoint in the development history of a software project, evenif it is not labeled as a release; in this case, cryptographicidentifiers like those used in distributed version controlsystems, and now generalized in Software Heritage [9],are necessary . For instance, the sentence “you can find atswh:1:cnt:cdf19c4487c43c76f3612557d4dc61f9131790a4;lines=146-187 of swh:1:snp:c9c31ee9a3c631472cc8817886aaa0d3784a3782;origin=https://github.com/rdicosmo/parmap/ the exact core mapping algorithm used in thisarticle” makes two distinct references. The former onepoints to the lines of a source file while the later onepoints to the software context in which this file is used. (Major) release.
When a much coarser granularity issufficient, one can designate a particular (major) release ofthe project. For instance: “This functionality is available inOCaml version 4” or “from CGAL version 3” . Project.
Sometimes one needs to cite a software project atthe highest level; a typical example is a researcher, a teamor an institution reporting the software projects it developsor contributes to. In this case, one must list only the projectas a whole , and not all its different versions. For instance: “Inria has created OCaml and Scikit-Learn” .IV. F
OUR PROCESSES FOR FOUR DIFFERENT NEEDS
There are four main reasons why the research software pro-duced at Inria is carefully referenced and evaluated: (i) manag-ing the career of individual researchers and research engineers,(ii) assessing the technology transfer, (iii) visibility and impactof a research team, and (iv) promote reproducible researchpractices. We detail next these four topics, and the informationcollected, to cater to these different needs.
A. Career management
Software development is a research output taken into ac-count in the evolution of the career of individual researchersand research engineers . Measuring the impact of a softwareprovides a means to measure the scope and magnitude ofcontributions of research results, when they are carefully trans-lated into usable software. Evaluating the maturity and breadthof software is also essential to guide further developments andresource allocation.Inria has an internal evaluation body, the Evaluation Com-mittee (EC), the role of which includes evaluating both in-dividual researchers when they apply for various positions(typically ranging from junior researcher to leading rolessuch as senior researcher or research director), and organizing the evaluations of whole research teams, which take placeevery 4 years. In both cases, evaluating a given softwarerevolves around three items: (i) the software itself, whichcan be downloaded and tested; (ii) precise self-assessmentcriteria filled-in by the developers themselves; and (iii) afactual and high-level description of the software, including theprogramming language(s) used along with the number of linesof code, the number of man-months of development effort, andthe web site from where the software and any other relevantmaterial (a user manual, demos, research papers, ...) can bedownloaded.Among these three items, the self-assessment criteria playa crucial role because they provide key information on thesoftware, how it was developed, and what role each devel-oper played. Version 1 of these “Criteria for Software Self-Assessment” dates from August 2011 [15]. They are also usedby the Institute for Information Sciences and Technologies(INS2I) of The French National Centre for Scientific Research(CNRS). It comprises two lists of criteria using a qualitative scale. The first list characterizes the software itself:
Audience . Ranging from A1 (personal prototype) to A5 (usable by a wide public). Software Originality . Ranging from
SO1 (none) to
SO4 (original software implementing a fair number of originalideas).
Software Maturity . Ranging from
SM1 (demos work, restnot guaranteed) to
SM5 (high-assurance software, certifiedby an evaluation agency or formally verified).
Evolution and Maintenance . Ranging from
EM1 (no fu-ture plans) to
EM4 (well-defined and implemented plan forfuture maintenance and evolution, including an organizedusers group).
Software Distribution and Licensing
Ranging from
SDL1 (none) to
SDL5 (external packaging and distributioneither as part of e.g., a Linux distribution, or packagedwithin a commercially distributed product).As an example, the OCaml compiler is assessed as: Audi-ence A5 , Software Originality SO3 , Software Maturity
SM4 ,Evolution and Maintenance
EM4 , Software Distribution andLicensing
SDL5 .The second list characterizes the contribution of the de-velopers and comprises the following criteria:
Design andArchitecture (DA) , Coding and Debugging (CD) , Main-tenance and Support (MS) , and
Team/Project Manage-ment (TPM) . Each contribution ranges from (not involved)to (main contributor). As an example, the personal contribu-tion of one of OCaml’s main developer might be: Design andArchitecture DA3 , Coding and Debugging
CD4 , Maintenanceand Support
MS3 , Team/Project Management
TPM4 .Overall, these self-assessment criteria have been in usedat Inria for several years now. The feedback from bothjury members (for individual researchers) and internationalevaluators (for research teams) is that they are extremelyuseful, despite their coarse granularity and being based on self-statement. All praise the relevance of the criteria and the factthat they provide a mean to assess the scope and magnitudeof contributions to a given software, much more accurately.
B. Technology transfer
Information about authorship and intellectual property isa key asset when technology transfer takes place, either inindustrial contracts or for the creation of start-ups. Besides,technology transfer is at the heart of Inria’s strategy to increaseits societal and economical impact. However, in the particularcase of software, technology transfer raises a number of dif-ficulties. Most of the time, transferring a software to industrystarts by sending a copy of the software to a French registrationagency named
Agence pour la Protection des Programmes (APP: ). When doing so, a dedicatedform has to be filled that requires to specify all the contributors of the software, and for each of them the percentage of her/hiscontribution.When the software is old (typically more than 10 yearsold), this involves carrying on some archaeology to retrievethe contribution of the first developers (some of whom mayhave left Inria, or may have not been Inria employees atall). A dedicated technology transfer team interacts with theresearchers in this process, taking into account all the differentcontributions to software development. In particular, they usea taxonomy of roles that includes the following:
Coding.
This seems the most obvious part, but it is actually complex,as one cannot just count the number of lines of codeswritten, or the number of accepted pull requests. Some-times a long code fragment may be a straightforward re-implementation of a very well known algorithm or datastructure, involving no complexity or creativity at all, whileat other times a few lines of code can embody a complexand revolutionary approach (e.g., speeding up massively theexecution time). Often, a major contribution to a project isnot adding code, but fixing code or removing portions ofcode by factoring the project and increasing its modularityand genericity.
Testing and debugging.
This is an essential role when developing software thatis meant to be used more than once. This activity mayrequire setting up a large database of relevant use cases anddevising a rigorous testing protocol (e.g., non-regressiontesting).
Algorithm design.
Inventing the underlying algorithm that forms the very basisof the software being transferred to industry is, of course,a key contribution.
Software architecture design
This is another important activity that does not necessarilyshow up in the code itself, but which is essential formaintenance, modularity, efficiency and evolution of thesoftware. As Steve Jobs famously said while promotingObject Oriented Programming and the NeXT computermore than twenty-five years ago, “The line of code thathas no bug and that costs nothing to maintain, is the lineof code that you never wrote” . Documentation.
This activity is essential to ease (re)usability and to supportlong term maintenance and evolution. It ranges from inter- python3-matplotlibpython3-dateutilpython3-six(>= 1.4) python3:any python-matplotlib-data(>= 3.0.2-2)python3-pyparsing (>= 1.5.6) libjs-jquerylibjs-jquery-uipython3-numpy(>= 1:1.14.3) python3(<< 3.8) (>= 3.7~)python3-numpy-abi9python3-cycler (>= 0.10.0) python3-kiwisolver libfreetype6 (>= 2.2.1)libpng16-16 (>= 1.6.2-1)python3-pil python3-tk(>= 1.5) (>= 3.2~) tzdata[python3] [python3] {debconf} debconf-2.0(>= 0.5)[debconf] {cdebconf} fonts-lyx ttf-bitstream-vera(>= 3.3.2-2~) jquery javascript-common(>= 1.7)(<< 3.8)(>= 3.7~)python3.7:any libblas3 libblas.so.3liblapack3 liblapack.so.3python3-pkg-resources python3-minimal(= 3.7.3-1)python3.7(>= 3.7.3-1~) libpython3-stdlib(= 3.7.3-1)python3.7-minimal(>= 3.7.3-1~) {dpkg} install-info(>= 1.13.20)libpython3.7-minimal(= 3.7.3-2)libexpat1(>= 2.1~beta3) libssl1.1(>= 1.1.1)libpython3.7-stdlib (>= 0.5)(= 3.7.3-2)mime-support libbz2-1.0liblzma5 (>= 5.1.1alpha+20120614)libdb5.3 lib ffi fi le xz-utils (= 1.0.6-9)libmagic1(= 1:5.35-4)libmagic-mgc(= 1:5.35-4) (>= 5.2.2)xz-lzma (= 6.1+20181013-2)libgpm2 (>= 6) readline-common(>= 1.15.4)libreadline-common (>= 1.16.1)uuid-runtime(>= 2.25-5~) (>= 2.31.1)adduserlibsmartcols1 (>= 2.27~rc1)libsystemd0 (>= 0.5)passwd(>= 5.1.1alpha+20120614)libgcrypt20(>= 1.8.0)liblz4-1(>= 0.0~r122)libgpg-error0(>= 1.25)libgpg-error-l10n (= 3.7.3-2)(= 3.7.3-2) (>= 3.7.3-1~)[python3.7] [python3.7] libgfortran5 (>= 8)libquadmath0 (>= 4.6) ...-6-gcc-9-base(= 9-20190428-1)(>= 4.6)(= 9-20190428-1)(>= 8)(>= 4.6) ...-3-(>= 3.3.2-2~) (<< 3.8)(>= 3.6~) (>= 1.6.2-1)(<< 3.8) (>= 3.7~) (>= 2.2.1)[mime-support] python3-pil.imagetk libimagequant0(>= 2.11.10) libjpeg62-turbo(>= 1.3.1) liblcms2-2(>= 2.2+git20110628)libti ff fi le (<< 3.8)(>= 3.7~) (= 6.0.0-1) (>= 3.4.1-2)(>= 3.7.1-1~)(<< 3.9) blt (>= 2.4z-9)tk8.6-blt2.5(>= 2.5.3)libtcl8.6 (>= 8.6.0) libtk8.6(>= 8.6.0)(= 2.5.3+dfsg-5)(>= 8.6.0) (>= 8.6.0)blt4.2blt8.0 blt8.0-uno ff (>= 2.2.1)(>= 8.6.0-2)libfontcon fi g1 (>= 2.12.6)libxext6libxft2(>> 2.1.1) libxss1 (>= 2.3.5)(>= 2.12.6)libxrender1 x11-commonlibjpeg62(>= 5.1.1alpha+20120614) (>= 1.3.1)libjbig0(>= 2.0) (>= 0.5.1)libzstd1(>= 1.3.2) (>= 0.5.1)(>= 0.5.1) Matplotlib libraryPython dependenciesReal dependencies Fake OS dependenciesinduced by package granularity
Fig. 1. Example of the complexity in direct and indirect dependencies for a specific python package (matplotlib). Boxes represent actual packages (librariesthat need to be installed on the system), arrows indicates dependencies to other packages, labels indicates the minimal/maximal version number. In blue thePython dependencies, in red the “true” system dependencies incurred by python (e.g., the libc or libjpeg62 ), in green some “fake” dependencies incurredby the package management system but which are very likely not used by python (e.g., adduser or dpkg ). nal technical documentation to drafting the user manual andtutorials.The older and bigger the software, the more difficult thisauthorship identification task is. C. Visibility and impact of a research team
Software is a part of the scientific production that anyresearch team exposes. Software that are diffused to a largescholar audience or commercialized to industrial users maybecome an important source of inspiration for novel researchchallenges. Feedback from practitioners or academic users isa precious source of knowledge for determining the researchproblems with high potential practical impact. Software canalso be a key instrument for research, central to the dailyresearch activity of a team, and a main support for teachingand education. It may also become a communication mediumbetween young researchers, e.g., Ph.D. students sharing theirresearch topics and experiments via a common set of softwarecomponents.Inria considers (research) software to be a valuable outputof research, and has always encouraged its research teams toadvertise the software project they contribute to: this can be onthe public web page of the team, or in its annual activity report.To simplify the collection of the information concerning thesoftware projects, an internal database, called BIL (“Based’Information des Logiciels”, i.e. “database of informationon software”), has been in use for several years. It allowsresearch teams to deposit very detailed meta-data describingthe software projects they are involved in. The BIL can then beused to generate automatically the list of software descriptionsfor the team web page, for the activity report, and also to prefill part of the forms used in the two processes described abovefor individual career evaluation and for technology transfer,avoiding the burden of typing in the same information overand over again.
D. Reproducible Research
Another important concern of Inria is supporting repro-ducibility of research results and the reproducibility crisistakes a whole new dimension when software is involved.Scholars are struggling to find ways to aggregate in a coherent compendium the data, the software, and the explanations oftheir experiments. The focus is no longer on giving credit,but on finding, rebuilding, and running the exact softwarereferenced in a research article . We identified at least threemajor issues:First, the frequent lack of availability of the software sourcecode , and/or of precise references to the right version of it, is amajor issue [7]. Solving this issue requires stable and perennialsource code archives and specialized identifiers [9].Second, characterizing and reproducing the full softwareenvironment that is used in an experiment requires trackinga potentially huge graph of dependencies (a small exampleis shown in Figure 1). Specific tools to identify and expresssuch dependencies are needed. Finally, although the notion ofresearch compendium is seducing, it should aggregate objectsof very different nature (article, data, software) for whichspecific archives and solutions may already exist. To ease thedeposit of such objects, we believe the compendium shouldthus rather build on stable references to objects than try toaddress all problems at once.In recent years, various building blocks have emerged to ad-dress these challenges and may lead to such a global approach and stable references to the software artifact themselves. Inriahas fostered and supported a few of them, that we brieflypresent here.
Software Heritage: a universal archive of source code.
Software Heritage (SWH) was started in 2015 to collect,preserve and share the source code of all software everwritten, together with its full development history [1].As of today, it has already collected almost 6 billionsunique source code files coming from over 85 millionsoftware origins that are regularly harvested. Therecently added “save code now” feature enables users torequest proactively the addition of new software originsor to update them. Source code and its developmenthistory are stored in a universal data model basedon Merkle DAGs [9], providing persistent, intrinsic,unforgeable, and verifiable identifiers for the morethan 10 billion objects it contains [9]. Each intrinsicidentifier is computed on the content and meta-data ofthe software itself, through cryptographic hashes, and isembedded into the software’s persistent identifier. This universal archive of all software source code addressesthe issue of preserving and referencing source code forreproducibility. Reproducible builds.
In the early 2000’s, the ground-breaking notion of func-tional package manager was introduced by the Nixsystem [10], using cryptographic hashes to ensure thatbinaries are rebuilt and executed in the exact same soft-ware environment. Similar notions provide the foundationof the Guix toolchain, which has been developed over thelast decade under the umbrella of the GNU project, withkey contributions from Inria [8]. The essential propertyof these tools is that, given the same source files andthe associated functional build recipes , one can obtainas a result of the build process the very same binaryfiles in the same environment. Very recently, Guix hasbeen connected with SWH to ensure long term repro-ducibility: when the source code (currently downloadedfrom the upstream distribution sites) disappears from thedesignated location, Guix uses transparently the SWHintrinsic identifiers to fetch the archived copy from itsarchive. Functional build recipes are themselves a formof source code, and they too can be archived and givenintrinsic identifiers, which will provide proper references also for software environments.
Curation of software deposit in HAL for SWH.
Over the past two years, Inria has fostered a collaborationbetween SWH and HAL, the French national open accessarchive, with the goal of providing an efficient process ofresearch software deposit. Figure 2 provides a high leveloverview of this process: researchers submit softwaresource code and meta-data to the HAL portal; these sub-missions are placed in a moderation loop where humansinteract with the researchers to improve the quality of themeta-data and to avoid duplicates; once a submission isapproved, it is sent to SWH via a generic deposit mech-anism, based on the SWORD standard archive exchange protocol; it is then ingested in the SWH archive; finally,the unique intrinsic identifier needed for reproducibilityis returned to the HAL portal, which displays it alongsidethe identifier for the meta-data. Detailed guidelines havebeen developed to help researchers and moderators get toa high quality deposit of their source code.
Fig. 2. Moderated software deposit in SWH via HAL.
V. L
ESSONS LEARNED ON CREDITING SOFTWARE
The processes described above have been established insideInria and refined over decades to answer the internal needs ofthe institution. While their goal has not been to guide externalprocesses such as software citation, we strongly believe theyprovide a solid basis to build a universal framework forsoftware citation and reference.Here are a few important lessons we learned from all theabove: (research) software projects present a great degree ofvariability along many axes; contributions to software can take many forms and shapes ; and there are key contributions thatmust be recognised but do not show up in the code nor inthe logs of the version control systems . This has several mainconsequences: • the need of a rich metadata schema to describe softwareprojects; • the need of a rich taxonomy for software contributions,that must not be flattened out on the simple role ofsoftware developer; • last but not least, while tools may help, a careful humanprocess involving the research teams is crucial to producethe qualified information and metadata that is needed forproper credit and attribution in the scholarly world.We focus here mostly on the two latter issues, as thequestion of metadata for software has already attracted sig-nificant attention, with the Codemeta initiative providing agood vehicle for standardisation, and for incorporating the newentities that may be needed [16]. A. Taxonomy of contributor roles: a proposal
The need to recognise different levels and forms of con-tributions is not new in academia: in Computer Science andMathematics we are quite used to separate, for example, the persons that are named as authors, and those that are onlymentioned in the acknowledgements.In the specific case of software projects, theSoftware Credit Ontology https://dagstuhleas.github.io/SoftwareCreditRoles/doc/index-en.html proposes a totalof 23 roles, among which 13 are directly concerned with anactual contribution to a software project, under the contributorcategory: code reviewer; community contributor; designer;developer; documenter; idea contributor; infrastructuresupporter; issue reporter; marketing and sales; model drivensoftware engineering expert; packager; requirement elicitator;systems and network engineer. As we can see, this ontologyin more focused on the business aspect of software projects(see for instance the marketing and sales role) than ontechnology aspect (for instance, the developer role is furtherrefined into bug fixer, core developer, and maintainer). Thetaxonomy we propose in the recommendation below is arefinement and combination of the taxonomies presented inSection IV-A and IV-B.
Recommendation
Giving credit to contributors of a software project isvery similar to giving credit to contributors to researcharticles. We thus need a rich taxonomy. In the previoussections we discussed two taxonomies, developed andused in two different contexts inside Inria: despiteminor differences (for example, maintenance and usersupport are not taken into account for technologytransfer), one can extract rather easily the followingtaxonomy of contributor roles that covers all the usecase seen, and that may be extended in the future: • Design • Debugging • Maintenance • Coding • Architecture • Documentation • Testing • Support • Management
But this is only part of the story: in both of theinternal Inria processes we described, contributionsare not just classified in different roles, they are also quantified , either at a coarse grain (from 1 to 5 forcareer evaluation), or at a very fine grain (percentagesare used for technology transfer, where a financialreturn needs to be precisely redistributed). We thusrecommend using a coarse grain qualitative scale asit is easy to implement and proves to be very helpfulwhenever technology transfer occurs.Other disciplines too have pushed efforts to create a richertaxonomy of contributions for research articles, with theCRediT system [3] detailing 14 different possible roles, oneof which is software : the key idea is that each person listedas an author needs to specify one or more of the 14 roles.
B. The importance of the human in the loop
This quantification is essential, in particular considering thatan academic credit system will be inevitably built on top ofsoftware citations, which brings us to our next key point: the importance of having humans in the loop, which has alreadybeen clearly advocated in a different context by the teambehind the Astronomic Source Code Library [2].As we have already noted, many of the contributor rolesidentified above are not reflected in the code. In order to assessthese roles, in kind and quantity, it is necessary to interact withthe team that has created and evolved the software: this is whatthe technology transfer service at Inria routinely does.What about the activities that are tightly related to thesoftware source code itself, like coding, testing, and de-bugging? Here it is very tempting to try to use auto-mated tools to determine the role of a contributor, andthe importance of each contribution. There are indeed awealth of different developer scoring algorithms that targetGitHub contributors (see for example http://git-awards.com/ , https://github.com/msparks/git-score and GitHub’s ownscoring using the number of commits, deletions, or additions).Unfortunately these measures are far from robust: refactoring(that may be just renaming or moving file around or evenchanging tabs in spaces!) can lead to huge score increases,while the actual developer contribution is marginal. And evenif one could rule out irrelevant code changes, our experienceat Inria is that the importance and quality of a contributioncannot be assessed by counting the number of lines of codethat have been added (see our description of the coding rolein Section IV-B). This is particularly the case for research software that involves significant innovations. Recommendation
As a bottomline, we strongly suggest to refrain, for research software , from trying to generate softwarecitation and credit metadata, and in particular the listof (main) authors, using automated tools: we needinstead quality information in the scholarly world,and currently this can only be achieved with qualifiedhuman intervention. We strongly encourage the authorsof research software to provide such qualitative infor-mation, for example in an
AUTHORS file, and to usethe aforementioned taxonomy and scale.As an illustration of this recommendation, the rich metadatacollected by HAL in the deposit process are sent to SWHusing the now standard CodeMeta schema [16], and will besoon extended with the taxonomy of Section V-A
C. Distinguish citation from reference
We have extensively covered the best practices for assessingand attributing software artefacts: they are essential for givingqualified academic credit to the people that contribute tothem, and are key prerequisites for creating citations forsoftware. This complex undertaking requires significant humanintervention, and proper processes and tools.The overall problem of reproducible research is quite dif-ferent: while there are examples of rather comprehensivesolutions in very specialised domains, it seems very diffi-cult to find a unique solution general enough to cover all the use cases. An example of domain specific solution isprovided by the IPOL journal (Image Processing On Line, , an Open Science journal dedicated toimage processing): Each article describes an algorithm andcontains its source code, with an online demonstration facilityand an archive of experiments.We believe that the three building blocks described inSection IV-D (Software Heritage, NiX/GUIX, curated connec-tions between SWH and HAL) will allow to provide precisereferences (as illustrated in the end of Section III) to bothspecific software excerpt, context, and environment and topermanently bind them with research articles.
Recommendation
It is essential to distinguish citations to projects orresults from exact references to software and theirenvironment, and we believe that both should be usedin articles. We also strongly encourage the use oftools like GUIX and Software Heritage to build suchperennial references.Although neither a consensus nor a standard exists yet onhow to use references in articles, we are currently working onproposing concrete guidelines and adding support in SoftwareHeritage to easily provide the corresponding L A TEX snippets.VI. C
ONCLUSION
It this article we presented for the first time the internalprocesses and efforts in place at Inria for assessing, attributing,and referencing research software. They play an essential rolefor the careers of individual Inria researchers and engineers,the evaluation of whole research teams, the technology transferactivities and incentive policies, and the visibility of researchteams.These processes have to cope with the great complexity andvariability of research software, in terms of the nature of itsrelating activities and practices, roles of its contributing actors,and diversity of lifespans.
Recommendations
Based on our experience over several decades, we havedistilled the important lessons learned and are happy to providea set of recommendations that can be summarized as follows:
Recognise the diversity of contributor roles
The taxonomy of contributors described in Section V-A hasbeen extensively tested internally at Inria. We recommendthat it be incorporated in the CodeMeta standard , and allthe platforms and tools that support software attribution andcitation. In the meanwhile, researchers can adopt it rightaway in the metadata they incorporate in their own sourcecode.
Keep the human in the loop
To obtain quality metadata, as seen in Section V-B, it isessential to have humans in the loop. We strongly adviseagainst the unsupervised use of automated tools to createsuch metadata. While these automated tools can save a lot of time, we recommend instead the implementationof a metadata curation and moderation mechanism in alltools and platforms that are involved in the creation ofmetadata for research software, like Zenodo or FigShare.We also recommend that research institutions and academiain general rely on human experts to assess the qualitativecontributions of research software, and refrain from adopt-ing as evaluation criteria automated metrics that are easilybiased.
Distinguish citation from reference
As explained in Section III, citations , used to provide creditto contributors, are conceptually different from references designed to support reproducibility. While the latter can belargely automated using platforms like Software Heritageand tools like GUIX, the former require careful humancuration. Research articles will then be able to provideboth software citations and software references, and we arecurrently working on concrete guidelines that we will makepublicly available. R
EFERENCES[1] J.-F. Abramatic, R. Di Cosmo, and S. Zacchiroli. Building the universalarchive of source code.
Commun. ACM , 61(10):29–31, Sept. 2018.[2] A. Allen and J. Schmidt. Looking before leaping: Creating a softwareregistry.
Journal of Open Research Software , 3(e15), 2015.[3] L. Allen, A. O’Connell, and V. Kiermer. How can we ensure visibilityand diversity in research contributions? How the Contributor Role Tax-onomy (CRediT) is helping the shift from authorship to contributorship.
Learned Publishing , 32(1):71–74, 2019.[4] Association for Computing Machinery. Artifact review andbadging. , Apr 2018. Retrieved April 27th 2019.[5] C. L. Borgman, J. C. Wallis, and M. S. Mayernik. Who’s got the data?interdependencies in science and technology collaborations.
ComputerSupported Cooperative Work , 21(6):485–523, 2012.[6] C. T. Brown. Revisiting authorship, and JOSS software publications. http://ivory.idyll.org/blog/2019-authorship-revisiting.html , jan 2019.Retrieved April 2nd, 2019.[7] C. Collberg and T. A. Proebsting. Repeatability in computer systemsresearch.
Communications of the ACM , 59(3):62–69, feb 2016.[8] L. Courtès and R. Wurmus. Reproducible and user-controlled softwareenvironments in HPC with Guix. In
Euro-Par 2015: Parallel ProcessingWorkshops , pages 579–591, 2015.[9] R. Di Cosmo, M. Gruenpeter, and S. Zacchiroli. Identifiers for digitalobjects: the case of software source code preservation. In
Proceedingsof the 15th International Conference on Digital Preservation, iPRES2018, Boston, USA , Sept. 2018. Available from https://hal.archives-ouvertes.fr/hal-01865790 .[10] E. Dolstra, M. de Jonge, and E. Visser. Nix: A safe and policy-freesystem for software deployment. In L. Damon, editor,
Proceedings ofthe 18th Conference on Systems Administration (LISA 2004), Atlanta,USA, November 14-19, 2004 , pages 79–92. USENIX, 2004.[11] Y. Gil, C. H. David, I. Demir, B. Essawy, W. Fulweiler, J. Goodall,L. Karlstrom, H. Lee, H. Mills, J.-H. Oh, S. Pierce, A. Pope, M. Tzeng,S. Villamizar, and X. Yu. Towards the geoscience paper of the future:Best practices for documenting and sharing research from data tosoftware to provenance: Geoscience paper of the future.
Earth andSpace Science , 3, 07 2016.[12] K. Hinsen. Software development for reproducible research.
Computingin Science and Engineering , 15(4):60–63, 2013.[13] J. Howison and J. Bullard. Software in the scientific literature: Problemswith seeing, finding, and using software mentioned in the biologyliterature.
Journal of the Association for Information Science andTechnology , 67(9):2137–2155, 2016.[14] L. Hwang, A. Fish, L. Soito, M. Smith, and L. H. Kellogg. Softwareand the scientist: Coding and citation practices in geodynamics.
Earthand Space Science , 4(11):670–680, 2017. ,pages 536–540, Sep. 2015.[18] L. J. Shustek. What should we collect to preserve the history ofsoftware?
IEEE Annals of the History of Computing , 28(4):110–112,2006.[19] A. M. Smith, D. S. Katz, and K. E. Niemeyer. Software citationprinciples.
PeerJ Computer Science , 2:e86, 2016.[20] V. Stodden, R. J. LeVeque, and I. Mitchell. Reproducible researchfor scientific computing: Tools and strategies for changing the culture.