Code Replicability in Computer Graphics
Nicolas Bonneel, David Coeurjolly, Julie Digne, Nicolas Mellado
CCode Replicability in Computer Graphics
NICOLAS BONNEEL,
Univ Lyon, CNRS, France
DAVID COEURJOLLY,
Univ Lyon, CNRS, France
JULIE DIGNE,
Univ Lyon, CNRS, France
NICOLAS MELLADO,
Univ Toulouse, CNRS, France
Fig. 1. We ran 151 codes provided by papers published at SIGGRAPH 2014, 2016 and 2018. We analyzed whether these codes could still be run as of 2020 toprovide a replicability score, and performed statistical analysis on code sharing.
Image credits: Umberto Salvagnin, _Bluenose Girl, Dimitry B., motiqua, Ernest McGrayJr., Yagiz Aksoy, Hillebrand Steve. 3D models by Martin Lubich and Wig42.
Being able to duplicate published research results is an important process ofconducting research whether to build upon these findings or to compare withthem. This process is called “replicability” when using the original authors’artifacts (e.g., code), or “reproducibility” otherwise (e.g., re-implementingalgorithms). Reproducibility and replicability of research results have gaineda lot of interest recently with assessment studies being led in various fields,and they are often seen as a trigger for better result diffusion and trans-parency. In this work, we assess replicability in Computer Graphics, byevaluating whether the code is available and whether it works properly.As a proxy for this field we compiled, ran and analyzed 151 codes out of374 papers from 2014, 2016 and 2018 SIGGRAPH conferences. This analysisshows a clear increase in the number of papers with available and opera-tional research codes with a dependency on the subfields, and indicates acorrelation between code replicability and citation count. We further providean interactive tool to explore our results and evaluation data.CCS Concepts: •
Computing methodologies → Computer graphics ; •
Software and its engineering → Open source model .Additional Key Words and Phrases: Replicability, reproducibility, open source,code review, siggraph
Authors’ addresses: Nicolas Bonneel, [email protected], Univ Lyon, CNRS,Lyon, France; David Coeurjolly, [email protected], Univ Lyon, CNRS, Lyon,France; Julie Digne, [email protected], Univ Lyon, CNRS, Lyon, France; NicolasMellado, [email protected], Univ Toulouse, CNRS, Toulouse, France.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].© 2020 Association for Computing Machinery.0730-0301/2020/7-ART1 $15.00https://doi.org/10.1145/3386569.3392413
The ability to reproduce an experiment and validate its results is acornerstone of scientific research, a key to our understanding of theworld. Scientific advances often provide useful tools, and build upona vast body of previous work published in the literature. As such,research that cannot be reproduced by peers despite best effortsoften has limited value, and thus impact, as it does not benefit toothers, cannot be used as a basis for further research, and castsdoubt on published results. Reproducibility is also important forcomparison purposes since new methods are often seen in the lightof results obtained by published competing approaches. Recently se-rious concerns have emerged in various scientific communities frompsychological sciences [Open Science Collaboration et al. 2015] toartificial intelligence [Hutson 2018] over the lack of reproducibility,and one could wonder about the state of computer graphics researchin this matter.In the recent trend of open science and reproducible research, thispaper aims at assessing the state of replicability of papers publishedat ACM Transactions on Graphics as part of SIGGRAPH confer-ences. Contrary to reproducibility which assesses how results canbe obtained by independently reimplementing published papers –an overwhelming task given the hundred papers accepted yearlyto this event – replicability ensures the authors’ own codes runand produce the published results. While sharing code is not theonly available option to guarantee that published results can beduplicated by a practitioner – after all, many contributions can bereimplemented from published equations or algorithm descriptionswith more or less effort – it remains an important tool that reducesthe time spent in reimplementation, in particular as computer graph-ics algorithms get more sophisticated.
Submission ID: papers_262. 2020-05-07 00:43. Page 1 of 1–8. ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. a r X i v : . [ c s . D L ] M a y :2 • Nicolas Bonneel, David Coeurjolly, Julie Digne, and Nicolas Mellado Our contributions are twofold. First, we analyze code sharingpractices and replicability in computer graphics. We hypothesizestrong influence of topics, an increase of replicability over timesimilar to the trend observed in artificial intelligence [Hutson 2018],and an increased impact of replicable papers, as observed in imageprocessing [Vandewalle 2019]. To evaluate these hypotheses, wemanually collected source codes of SIGGRAPH 2014, 2016 and 2018papers and ran them, and when possible, assessed how they couldreplicate results shown in the paper or produce reasonably similarresults on different inputs. Second, we provide detailed step-by-stepinstructions to make these software packages run (in practice, inmany cases, code adaptations had to be done due to dependencieshaving evolved) through a website, thus becoming a large codereview covering 151 codes obtained from 374 SIGGRAPH papers.We hope this platform can be used collaboratively in the future tohelp researchers having difficulties reproducing published results.Our study shows that: • Code sharing is correlated with paper citation count, and hasimproved over time. • Code sharing practices largely vary with sub-fields of com-puter graphics. • It is often not enough to share code for a paper to be replicable.Build instructions with precise dependencies version numbersas well as example command lines and data are important.
The impact of research involves a number of parameters that areindependent of the quality of the research itself, but of practicessurrounding it. Has the peer review process been fairly conducted?Are the findings replicable? Is the paper accessible to the citizen?A number of these questions have been studied in the past withinvarious scientific communities, which this section reviews.
Definitions.
Reproducible research has been initiated in com-puter science [Claerbout and Karrenbach 1992] via the automationof figures production within scientific articles. Definitions haveevolved [Plesser 2018] and have been debated [Goodman et al. 2016].As per ACM standards [ACM 2016], repeatability indicates the orig-inal authors can duplicate their own work, replicability involvesother researchers duplicating results using the original artifacts(e.g., code) and hardware, and reproducibility corresponds to otherresearchers duplicating results with their own artifacts and hard-ware – we will hence use this definition. We however mention thatvarious actors of replicable research have advocated for the opposite definition: replicability being about answering the same researchquestion with new materials while reproducibility involves the orig-inal artifacts [Barba 2018] – a definition championed by the NationalAcademies of Sciences [2019].
Reproducibility and replicability in experimental sciences.
Concerns over lack of reproducibility have started to emerge inseveral fields of studies, which has led to the term “reproducibilitycrisis” [Pashler and Harris 2012]. In experimental sciences, replica-bility studies evaluate whether claimed hypotheses are validatedfrom observations (e.g., whether the null hypothesis is consistentlyrejected and whether effect sizes are similar). In different fields of psychology and social sciences, estimations of replication rates havevaried between 36% out of 97 studies with significant results, withhalf the original effect size [Open Science Collaboration et al. 2015],50%-54% out of 28 studies [Klein et al. 2018], 62% out of 21 Natureand Science studies with half the original effect size [Camerer et al.2018], and up to roughly 79% out of 342 studies [Makel et al. 2012].In oncology, a reproducibility rate of 11% out of 53 oncology pa-pers has been estimated [Begley and Ellis 2012], and a collaborationbetween Science Exchange and the Center for Open Science (ini-tially) planned to replicate 50 cancer biology studies [Baker andDolgin 2017]. Over 156 medical studies reported in newspapers,about 49% were confirmed by meta-analyses [Dumas-Mallet et al.2017]. A survey published in Nature [Baker 2016] showed large dis-parities among scientific fields: respondents working in engineeringbelieved an average of 55% of published results are reproducible( N = N = p < .
005 [Benjaminet al. 2018] or simply abandoning hypothesis testing and p-values asbinary indicators [McShane et al. 2019], providing confidence inter-vals and using visualization techniques [Cohen 2016], or improvingexperimental protocols [Begley 2013].While computer graphics papers occasionally include experi-ments such as perceptual user studies, our paper focuses on codereplicability.
Reproducibility and replicability in computational sciences.
In hydrology, Stagge et al. [2019] estimate via a survey tool that 0.6%to 6.8% of 1,989 articles (95% Confidence Interval) can be reproducedusing the available data, software and code – a major reported issuebeing the lack of directions to use the available artifacts (for 89% oftested articles). High energy physicists, who depend on costly, oftenunique, experimental setups (e.g., the Large Hadron Collider) andproduce enormous datasets, face reproducibility challenges both indata collection and processing [Chen et al. 2019]. Such challenges aretackled by rigorous internal review processes before data and toolsare opened to larger audiences. It is argued that analyses shouldbe automated from inception and not as an afterthought. Closer toour community is the replication crisis reported in artificial intelli-gence [Gundersen and Kjensmo 2018; Hutson 2018]. Notably, theauthors surveyed 400 papers from top AI conferences IJCAI andAAAI, and found that 6% of presenters shared their algorithm’s code,54% shared pseudo-code, 56% shared their training data, and 30%shared their test data, while the trend was improving over time. In arecent study on the reproducibility of IEEE Transactions on ImageProcessing papers [Vandewalle 2019], the authors showed that, onaverage, code availability approximately doubled the number ofcitations of published papers. Contrary to these approaches, we notonly check for code availability, but also evaluate whether the codecompiles and produces similar results as those found in the paper,with reasonable efforts to adapt and debug codes when needed.Efforts to improve reproducibility are nevertheless flourish-ing from early recommendations such as building papers using
Makefiles in charge of reproducing figures [Schwab et al. 2000] to
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. Submission ID: papers_262. 2020-05-07 00:43. Page 2 of 1–8. ode Replicability in Computer Graphics • 1:3 various reproducibility badges proposed by ACM [ACM 2016] in col-laboration with the
Graphics Replicability Stamp Initiative [Panozzo2016]. Colom et al. list a number of platforms and tools that help inreproducible research [2018]. Close to the interest of the computergraphics community, they bring forward the IPOL journal [Colomet al. 2015] whose aim is to publish image processing codes via aweb interface that allows to visualize results, along with a completeand detailed peer-reviewed description of the algorithm. They fur-ther mention an initiative by GitHub [2016] to replicate publishedresearch, though it has seen very limited success (three replicationswere initiated over the past three years). In Pattern Recognition,reproducible research is awarded with the Reproducible Label inPattern Recognition organized by the biennal Workshop on Repro-ducible Research in Pattern Recognition [Kerautret et al. 2019, 2017].Programming languages and software engineering communitieshave created the Artifact Evaluation Committees for accepted pa-pers [Krishnamurthi 2020], with incentives such as rewarding withadditional presentation time at the conference and an extra page inthe proceedings, with special recognition for best efforts.Other initiatives include reproducibility challenges such as theone organized yearly since 2018 by the ICLR conference in machinelearning [Pineau et al. 2019] that accepts submissions aiming atreproducing published research at ICLR. In 2018, reproducibilityreports of 26 ICLR papers were submitted, out of which 4 werepublished in the ReScience C journal.
Open access . Software bugs have had important repercussionson collected data and analyses, hence pushing for open sourcingdata and code. Popular examples include Microsoft Excel that con-verts gene names such as SEPT2 (for Septin 2) to dates [Ziemannet al. 2016], or a bug in widely used fMRI software packages thatresulted in largely inflated false-positive rates, possibly affectingmany published results [Eklund et al. 2016]. Recently, Nature Re-search has enforced an open data policy [Nature 2018], stated intheir policies as authors are required to make materials, data, code,and associated protocols promptly available to readers without unduequalifications , and proposes a journal focused on sharing high re-usevalue data called
Scientific Data [Scientific Data (Nature Research)2014]. Other platforms for sharing scientific data include the
OpenScience Framework [Center for Open Science 2015]. Related to code,Colom et al. [2018] reports the websites mloss that lists machinelearning codes,
RunMyCode for scientists to share code associatedwith their research paper, or
ResearchCompendia that stores data andcodes. Long-term code availability is also an issue, since authors’webpages are likely to move according to institution affiliations sothat code might be simply unavailable. Code shared on platformssuch as GitHub is only available as long as the company exists whichcan also be an issue, if limited. For long-term code storage, the Soft-ware Heritage initiative [Di Cosmo and Zacchiroli 2017] aims atcrawling the web and platforms such as GitHub, Bitbucket, Googlecode etc. for open source software and stores them in a durableway. Recently, the Github Archive Program [Github 2020] pushedthese ideas further and propose a pace layer strategy where code isarchived at different frequencies (real-time, monthly, every 5 years),with advertised lifespans up to 500 years and possibly 10,000 years.
Other assessments of research practices . Reproducibility ofpaper acceptance outcome has been assessed in machine learning.In 2014, the prestigious NIPS conference (now NeurIPS) has per-formed the
NIPS consistency experiment : a subset of 170 out of 1678submissions were assigned to two independent sets of reviewers,and consistency between reviews and outcomes were evaluated.The entire process, results, and analyses, were shared on an openplatform [Lawrence 2014]. Decisions were inconsistent for 43 out of166 reviewed papers (4 were withdrawn, 101 were rejected by bothcommittees, 22 were accepted by both committees). Other initiativesfor more transparent processes include the sharing of peer reviewsof published papers on platforms such as
OpenReview [Soergel et al.2013] or directly by journals [The Royal Society Publishing 2020],and the post-publication monitoring for misconducts or retractionson platforms such as PubPeer and RetractionWatch [Didier andGuaspare-Cartron 2018].
Our goal is to assess trends in replicability in computer graphics.We chose to focus on the conference in the field with highest expo-sure, ACM SIGGRAPH, as an upper bound proxy for replicability.Although this hypothesis remains to be verified, this conferencemore often publishes completed research projects as opposed topreliminary exploratory ideas that are more often seen in smallervenues which could explain lower code dissemination. To estimatea trend over time, we focus on three SIGGRAPH conferences: SIG-GRAPH 2014 (Vancouver, 127 accepted papers), 2016 (Anaheim, 119accepted papers), and 2018 (Vancouver, 128 accepted papers). Wedid not include SIGGRAPH 2019 (Los Angeles) since authors some-times need time to clean up and publish their code after publication.We did not include SIGGRAPH Asia nor papers published in ACMTransactions on Graphics outside of the conference main track toreduce variability in results and keep a more focused scope. Wechose a two-year interval between conferences in the hope to getclearer trends, and to keep a tractable number of papers to evaluate.We searched for source codes as well as closed-source binaries forall papers. We restricted our search to original implementations andreimplementations authored and released by the original authorsof the paper, excluding reimplementations by others, as we aim atassessing replicability and not reproducibility (see Sec. 2). For eachpaper, we report the objective and subjective information describedbelow.
Identifying and factual information . This includes the papername and DOI, ACM keywords, pdf, project and code or binariesURLs if they have been found, as well as information indicating ifauthors are from the industry, academia, or unaffiliated, for furtheranalysis. For papers, we include information as whether they can befound on arXiv or other
Open Archive Initiative providers we mayhave found, in open access on the ACM Digital Library, or by othermeans such as institutional web pages. Aside from ACM keywords,we further categorize papers into 6 broad topics related to computergraphics, and we also keep track of whether they relate to neuralnetworks. We defined these topics as:
Submission ID: papers_262. 2020-05-07 00:43. Page 3 of 1–8. ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :4 • Nicolas Bonneel, David Coeurjolly, Julie Digne, and Nicolas Mellado • Rendering . This includes simulating light transport, real-timerendering, sampling, reflectance capture, data-structures forintersections, and non-photorealistic rendering. • Animation and simulation . This includes character anima-tion, motion capture and rigging/skinning, cinematogra-phy/camera path planning, deformable models as well asfluid, cloth, hair or sound simulation, including geometric ortopology problems related to these subjects. • Geometry . This includes geometry processing and modeling,for point-based, voxel-based and mesh-based geometries, aswell as topology, mapping, vector fields and shape collectionanalysis. We also include image-based modeling. • Images . This includes image and video processing, as wellas texture synthesis and editing, image segmentation, draw-ing, sketching and illustration, intrinsic decomposition orcomputational photography. We also included here image-based rendering, which relies more on image techniques thanrendering. • Virtual Reality . This category includes virtual and augmentedreality, 3d displays, and interactions. • Fabrication . This includes 3d printing, knitting or causticdesign.We strive to classify each paper into a single category to simplifyanalyses. Both these categories and paper assignments to these cat-egories can be largely debated. While they may be prone to errorsat the individual level, they still provide meaningful insight whenseen as statistical aggregates. These categories were used in ouranalysis instead of ACM keywords for several reasons: first, wecounted more than 127 different ACM keywords which would makeoverspecialized categories. The hierarchical nature of this taxon-omy also makes the analysis more complicated. In Fig. 2 we showthe distribution of ACM keywords of papers involved in each ofour categories. Interestingly, this visualization exacerbates the lackof ACM keywords dedicated to fabrication despite the increasingpopularity of this topic.Information about code includes code license, presence of doc-umentation, readme files and explicit mention of the code authors(who usually are a subset of the paper authors), as well as buildmechanism (
Makefile , CMakeLists , SCons , IDE projects, or othertypes of scripts), and lists of dependencies. We notably indicatewhether library or software dependencies are open source (e.g.,
Eigen , OpenCV ), closed source but free at least for research purpose(e.g., mosek , CUDA or Intel
MKL ), or closed source and paying even forresearch purpose (e.g.,
Matlab ). Similarly, we ask whether the codedepends on data other than examples or input data (e.g., trainingdata or neural network description files) and their license.One of our key contributions is that we report the undocumentedsteps required to make the code run – from bug fixes to dependencyinstallation procedures. We believe this information is valuable tothe community as these steps are often independently found bystudents relying on these codes sometimes after significant effort.
Subjective judgments on replicability . For papers withoutpublished code, this includes information as to whether the pa-per contains explicit algorithms and how much effort is deemedrequired to implement them (on a scale of 1 to 5). For algorithms
Physical simulation
Animation
Simulation by animation
Motion capture
Shape modelingVirtual reality
Geometric topology
Sound and music computing
Computer graphicsGraphics recognition and interpretationNeural networks
Continuous simulationDocument scanningImage manipulationMotion processing Partial differential equationsProcedural animation
Graphics systems and interfacesModeling and simulationMotion path planningMusic retrievalReal-time simulation
Algorithmic game theory and mechanism designComputational control theory Computer Graphics Computer Vision Computer visionComputer vision problemsComputer-aided design Control methodsDiscretizationEngineeringGaussian processes Gestural inputGraphical user interfacesGraphics input devicesImage and video acquisition Integral Equations Interest point and salient region detections Learning paradigmsMachine learning Machine learning algorithmsMachine learning approachesMassively parallel and high-performance simulations Mixed / augmented reality Natural language generationNonconvex optimization PerceptionPerforming artsPhysical Simulation PhysicsReconstruction Reinforcement learningRobotic planning RoboticsScene understandingStochastic games Volumetric models (a) Animation
Shape modeling
Physical simulation
Geometric topology
Computer-aided manufacturingMesh geometry modelsParametric curve and surface modelsPerceptionShape analysis
Computational geometryComputer-aided designGraphics recognition and interpretationGraphics systems and interfacesMesh models
Animation Architecture (buildings)Compilers Computer graphicsDocument scanningEmerging optical and photonic technologiesGraphical user interfacesGraphics file formatsGraphics input devicesHaptic devices Image and video acquisitionImage manipulationMesh generation Motion captureMotion processing Nonconvex optimizationRandomness, geometry and discrete structuresReflectance modelingUser studiesVolumetric models (b) Fabrication
Shape modeling
Computational geometry
Randomness, geometry and discrete structuresShape analysis
Mesh geometry models
Mesh modelsVolumetric models
Parametric curve and surface models
Computer graphics
Physical simulation
Continuous optimization Mesh generationTexturing
Computer-aided designDesign and analysis of algorithmsDiscretization Graphics input devicesNeural networks
Algebraic topology AnimationComputations in finite fields Computer-aided manufacturingConvex optimizationDifferential calculusGraphics file formats Graphics recognition and interpretationGraphics systems and interfacesImage manipulationInterest point and salient region detectionsLife and medical sciences Machine learningMathematical optimization Modeling and simulation Nonconvex optimization Partial differential equationsPerceptionPoint-based modelsProbabilistic reasoningProcedural animation Spatial and physical reasoningUser interface design (c) Geometry
Computational photography
Image manipulation
Image processing
Neural networks
Scene understanding
Computer graphicsGraphics systems and interfacesParametric curve and surface modelsRenderingShape analysis
3D imagingComputer vision Fine artsGraphics recognition and interpretation TexturingVideo segmentation
Appearance and texture representationsImage representations Image-based renderingReconstruction
AnimationComputer Graphics Computer vision problems Graphics input devicesMixed discrete-continuous optimizationPerceptionPoint-based modelsReflectance modeling RegularizationShape modeling SolversVirtual reality
Automatic DifferentiationContent ranking Gestural inputGraphics Processors Hardware ArchitectureHardware description languages and compilation Image segmentation Integrated and visual development environmentsInterest point and salient region detectionsLogic Design Machine learningMatchingMesh geometry models Mixed / augmented realityMotion processing Non-photorealistic renderingPerforming arts Randomness, geometry and discrete structuresSound-based input / output TransportationVolumetric models (d) Image
Reflectance modeling
Ray tracing
Texturing
Rendering
Graphics processorsImage manipulationNeural networksRasterization
3D imagingAnimationGraphics systems and interfacesImage-based renderingMassively parallel algorithms
Computational photography Computer graphicsEpipolar geometryGaussian processes Image and video acquisitionImage processingMachine learningMesh models Non-photorealistic rendering Virtual realityVisibility (e) Rendering
Virtual reality
Displays and imagersImage manipulationRendering
Graphics systems and interfacesImage processingPerception
Graphics input devicesHuman computer interaction (HCI)
3D imagingComputational photographyComputations on matricesEmerging optical and photonic technologiesGestural input Human-centered computingInteraction techniquesLinear algebra algorithms Mixed / augmented realitySensor devices and platformsShape modelingSystems and tools for interaction design (f) Virtual RealityFig. 2. Distribution of the ACM keywords per topic. The font size reflectsthe number of papers associated with a keyword. requiring little reimplementation effort (with a score of 5) – typicallyfor short shaders or short self-contained algorithms – this can givean indication as to why releasing the code was judged unnecessary.For papers containing code, we evaluate how difficult it was to repli-cate results through a number of questions on a scale of 1 to 5. Thisincludes the difficulty to find and install dependencies, to configureand build the project, to fix bugs, to adapt the code to other contexts,and how much we could replicate the results shown in the paper.We strived to remain flexible in the replicability score: often, theexact input data were not provided but the algorithms producedsatisfactory results that are qualitatively close to those publishedon different data, or algorithms relied on random generators (e.g.,for neural network initializations) that do not produce repeatablenumber sequences and results. Contrary to existing replicabilityinitiatives, we did not penalize these issues, and this did not preventhigh replicability scores.We shared the task of evaluating these 374 submissions across 4full-time tenured researchers (authors of the paper), largely expe-rienced in programming and running complex computer graphicssystems. Reasonable efforts were made to find and compile the pro-vided code, including retrieving outdated links from the WayBackMachine [Tofel 2007], recreating missing
Makefiles , debugging,trying on multiple OS (compiling was tested on Windows 10, DebianBuster, Ubuntu 18.04 and 19.10 and MacOS 10.15 ), or adapting the Ubuntu 14.04 and Windows 2012 virtual machines for very specific tests.ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. Submission ID: papers_262. 2020-05-07 00:43. Page 4 of 1–8. ode Replicability in Computer Graphics • 1:5 code to match libraries having evolved. Efforts to adapt the code toevolved libraries, compilers or languages are due to practical rea-sons: it is sometimes impractical to rely on old Visual Studio 2010precompiled libraries when only having access to a newer version,or to rely on
TensorFlow
CUDA driversto version 8 for the sole purpose of having a single code run. Wechose to avoid contacting authors for clarifications, instructions orto report bug fixes to protect anonymity. We also added the GitHubprojects to Software Heritage [Di Cosmo and Zacchiroli 2017] whenthey were not already archived and gave the link to the SoftwareHeritage entry in our online tool.
We provide the data collected during our review as a JSON file,available as supplementary material. Each JSON entry describesthe properties of a paper (e.g., author list, project page, ACM key-words, topics) and its replicability results (e.g., scores, replicabilityinstructions). All the indicators and statistics given in this paperare computed from this data, and we provide in supplementarymaterials all the scripts required to replicate our analysis.We facilitate data exploration by providing an intuitive web inter-face available at https://replicability.graphics (see Fig. 3) to visualizecollected data. This interface allows two types of exploration, eitherthe whole dataset or per paper.
Dataset exploration.
Our dataset exploration tool is split into twocomponents: a table listing the reviewed papers, and two graphsshowing statistics about the table content. A first graph displays thedistribution of papers with respect to the code/pseudocode avail-ability, and their replicability score. A second graph shows papersavailability, either as ACM Open Access or as a preprint providedby the authors. The interactive table allows to filter the dataset bythe author name, paper title, publication year and/or by topic, andto update the graphs according to the selection. It is also possible tosort the paper by their properties and in particular their replicabilityscore or a documentation score between 0 and 2 (0: no documenta-tion, 2: exhaustive documentation). Each paper is associated with adedicated webpage accessible directly from the table.
Per-paper exploration.
The paper webpage gives a direct access tothe information extracted from the JSON file. It includes the linksto resources available online (Digital ACM library, preprint, code),several information (e.g., paper topic, nature of the artifact, list ofthe dependencies) and a breakdown of the replicability experimentwhen code was available (scores and comments). In addition, thepaper webpage gives the Altmetric Attention Score and links tothe Altmetric webpage of the paper if available. This score measuresthe overall attention a paper has received, including on social net-works, which differs from academic citation scores. The commentsection mostly covers the steps that the reviewer had to follow inorder to try to replicate the paper, which includes details about thedependencies management and updates, bug fixes or code modifi-cations. We expose the exact revision number (for git projects) orMD5 hash of the archive file (for direct download) of codes thatrelate to the comments. The website allows for commenting scores and instructions, both as a user and as a paper author, as well asadding new entries, for adding new or updated codes. This section analyzes both objective and subjective metrics. Allreported p-values were adjusted for multiple comparisons using thefalse discovery rate control procedure proposed by Benjamini andHochberg [1995].
Availability of papers.
Papers are overall available. Over all 374 pa-pers, only two are available only on the ACM Digital Library. No-tably, ACM provides free access to all SIGGRAPH and SIGGRAPHAsia proceedings though it is little advertised [ACM 2020]. Also,27 are available as preprints on arXiv (9 only on arXiv), 17 on HAL(7 only on HAL) , 44 benefit from the ACM Open Access policy–the other papers being available as preprints at least on the authorswebsite or other paper repositories. Availability of code.
Software packages were available for151 papers, which consist of 133 papers for which source code wasprovided plus 18 papers for which no source code was provided butinstead compiled software was provided. For the rest of the analysis,we considered both compiled and open source software combined.While open source research codes allow for adaptation, making iteasier to build upon them, and are thus ideal, binary software atleast allows for effortless method comparisons. Nevertheless, amongthese software packages, we could not run 19 of them due to techni-cal issues preventing the codes to compile or run, and 5 of them dueto lack of dedicated hardware (see Sec. 6). Among these 133 codes,60 do not have license information, which could notably preventcode dissemination in the industry, and 11 do not come with anydocumentation nor build instructions.We perform χ tests to analyze trends in code sharing. Overall,codes or binaries could be found for 37 papers out of 127 (29.1%)in 2014, 47 out of 119 (39.5%) in 2016, 67 papers out of 128 (52.3%)in 2018. This increase is statistically significant between 2014 and2018 ( p = . − ), though not between 2014 and 2016 ( p = . p = . Fabrication , 26.9%for
Animation , 31.8% for
Virtual Reality , 47.9% for
Rendering , 51.9%for
Geometry and 57.9% for
Images (Fig. 4).We also analyzed the involvement of at least one author from theindustry on the release of codes or binaries. We found that overall,papers involving the industry provided code or binaries 31.3% of thetimes, while this was the case for 45.4% of purely academic papers –a difference that is significant ( p = . Some references or preprints may also be available on other OAI providers thanks todatabase interconnections or local initiatives, we only report here the most significantones found by this study.Submission ID: papers_262. 2020-05-07 00:43. Page 5 of 1–8. ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :6 • Nicolas Bonneel, David Coeurjolly, Julie Digne, and Nicolas Mellado
Fig. 3. We designed a web interface to explore our collected data allowing to see individual paper replicability and build instructions, available at https://replicability.graphics .Fig. 4. We compute the percentage of papers that include either code orbinaries as a function of years and topic. We also show Clopper-Pearson95% confidence intervals.Fig. 5. We compute the median number of citations and its 95% confidenceintervals for papers sharing code (or executable) and for papers not sharingcode nor executable.
Given the sheer amount of deep learning codes available online,we hypothesized that deep learning-related papers were more likelyto share code. We tested this hypothesis on our dataset, but wefound that they provided code only 44.6% of the times (25 out of56), while this was the case 39.6% of the times for non-deep papers(126 out of 318) – a non-significant difference ( p = . p = . Table 1. Additional quantitative data from our study.
As replicability scores are subjective, we first perform an analy-sis of variance (ANOVA), despite some limitations here (see Nor-man [2010] for a full discussion), aimed at determining two things:is there a dependence on the reviewer of the code on replicabil-ity scores? And, does the year influence replicability (as it wouldseem that older non-maintained codes are harder to replicate)? TheANOVA is performed on the replicability score, taking only paperswith codes for which compiling was successful, and with two factors:reviewer and year. The answer to both questions seems negative(resp. p = .
13 and p = . ≤ Our analysis has a number of limitations. First, the data we collectedmay only be partly reliable. While we spent reasonable efforts tofind, run and compile codes, it is possible that we missed codes, orthat additional efforts or contacting the authors for clarifications orto report bugs would result in different outcome for a few papers.Similarly, we could not fully evaluate codes that depend on specifichardware (such as spatial light modulators, microcontrollers, Hall
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. Submission ID: papers_262. 2020-05-07 00:43. Page 6 of 1–8. ode Replicability in Computer Graphics • 1:7 effect sensors etc.) for 4 papers. Our analysis focused on assessingthe codes provided by the authors which only assesses replicabilitybut not reproducibility: there are instances for which papers weresuccessfully reimplemented by other teams, which falls out of ouranalysis scope. It could also be expected that certain codes could beavailable upon request ; in fact, in a few cases, the provided coderelied on data only available upon request, which we did not assess.Second, the codes we found and assessed may have evolved afterthe paper has been published, which we cannot control. Similarly,the published code could be a cleaned-up version of the originalcode, or even a full reimplementation.Third, our focus on SIGGRAPH could hide a more negative pictureof the entire field. We believe that the exposure SIGGRAPH probablygives biases our results, with a tendency to find more codes herethan in smaller venues. It would be an interesting future work tocompare replicability across computer graphics venues.
Our replicability evaluation led us to identify a number of issues.First, the number of dependencies was often correlated with thedifficulty to compile – especially on Windows. Precompiled librarieswere sometimes provided for compilers that became outdated, orsome dependencies were no longer supported on recent hardware orOS. The lack of precise dependencies version number was anotherimportant issue we faced. Package managers for Python such as pip or conda evolve and default to different library versions, andbuild instructions or installation scripts did not directly work withthese new versions. Lack of instructions for running software raisedimportant frustration: default parameters were sometimes not pro-vided, command line parameters not described, or results output asnumerical values in the console or written to files of undocumentedformat with no clear ways to use. In one case we had to develop asoftware for reading the output file and displaying results. Similarly,input data were not always provided, or sometimes only providedupon request. Finally, in some cases, the software implemented onlypart of the method, producing results that did not match the qualityof the results shown in the paper (e.g., missing post-processing, oronly implementing the most important parts of the paper).This leads us to issue several recommendations for the researchcommunity in Computer Graphics to promote their research workthrough the code replicability point of view. For the authors.
Sharing the code or any artifact to help thereplicability of the paper results is a good way to spread the con-tributions of the paper, as shown in terms of citation numbers inSec. 5.1 and independently by Vandewalle [2019]. When shippingthe code as a supplementary material for the paper, several con-cerns must be addressed: Accessibility of the source code (e.g., usingthe Software Heritage archive when the article is accepted) ; Repli-cability of the build process specifying the exact versions of thesoftware or libraries that must be used to build and execute the code(for instance using container/virtualization services – docker – orpackage/configurations managements services – anaconda , Nix https://anaconda.org https://nixos.org etc.); Clarity of the source code as a knowledge source (e.g., throughtechnical documentation and comments in the code); and, finally,tractability of the coding process (authorship, clear licensing etc.).Extra care should be given to codes that depend on rapidly evolv-ing libraries. This is particularly the case of deep learning libraries( TensorFlow , Pytorch , Caffe etc.). As an example, several syntaxchanges occurred in pytorch over the past few years and caffeappears not to be maintained anymore (e.g., pre-built binaries areprovided up to Visual Studio 2015, and the last commit on Githubwas in March 2019) ; Python 2.7 is not maintained anymore as ofJanuary 1st, 2020. We recommend limiting the number of depen-dencies when possible – e.g., avoiding to depend on large librariesfor the sole purpose of loading image files – possibly shipping de-pendencies with the source code (with integration into the projectbuild framework). Similarly, deep learning codes can require up toseveral days of training: sharing pre-trained models together withthe training routines is a good way to ensure replicability.
For the conference program chairs.
Not all research papersneed to be involved in a source code replicability effort. A paperpresenting a mathematical proof of the asymptotic variance of somesampler is intrinsically reproducible and does not need source code.On the contrary, for research papers for which it would make sense,the source code should be considered as a valuable artifact whenevaluating a submission. This can be emphasized and encouragedin the guidelines and submission forms. Asking the reviewers tocheck the replicability of the article through the provided code isan ultimate goal for targeting replicability but it may not be sus-tainable for the entire community. Intermediate action could beto communicate about the importance of paper replicability andto allow authors to submit their (anonymous) code and data withappropriate entries in the submission system. Furthermore, we ad-vocate for a specific deadline for the submission of the code anddata materials, e.g. one week after the paper deadline. The objectivewould be to let additional time for authors to sanitize their codeand encourage its publication, without interfering with the intensepaper polishing process happening right before the paper deadline,nor with the reviewing process, since reviewers would only waitfor a short amount of time before getting the code.
For the publishers.
Publishers already offer the possibility toattach supplementary materials to published papers (videos, sourcecode, data. . . ). Beside videos, other types of supplementary docu-ments are not clearly identifiable. We would recommend to clearlytag (and promote on the publisher library) supplementary data thatcorrespond to source codes. Independent platforms, such as Soft-ware Heritage [Di Cosmo and Zacchiroli 2017], permit archivingand attach unique identifiers to source codes as well as a timestamp.Publishers could easily rely on these platforms to link to the sourcecode of a paper.
Our in-depth study of three years of ACM SIGGRAPH conferencepapers showed a clear increase in replicability as authors are in-creasingly sharing their codes and data. Our study also showedthat sharing code was correlated with a higher impact, measured
Submission ID: papers_262. 2020-05-07 00:43. Page 7 of 1–8. ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :8 • Nicolas Bonneel, David Coeurjolly, Julie Digne, and Nicolas Mellado in terms of citation numbers. We developed a website which aimsat helping practitioners run existing codes on current hardwareand software generations, with build instructions for 151 codes wefound online when we could run them. Contrary to existing replica-bility stamps, our replicability scores are non-binary but on a 1-to-5integer scale, less strict in the sense that results sufficiently closethose shown in the paper on different data were deemed appropriate,but sometimes inconsistent with these stamps when software couldnot be run anymore on current hardware and software generations.In the future, we hope to see interactions with these replicabilitystamp initiatives for which we share the common goal of spreadingopen research.
We thank Roberto Di Cosmo for insightful discussions and thereviewers for their constructive feedback. This work was fundedin part by ANR-16-CE23-0009 (ROOT), ANR-16-CE33-0026 (CAL-iTrOp), ANR-15-CE40-0006 (CoMeDiC) and ANR-16-CE38-0009 (e-ROMA).
REFERENCES
Nature News
Nature News arXiv preprintarXiv:1802.03311 (2018).C Glenn Begley. 2013. Reproducibility: Six red flags for suspect work.
Nature
Nature
Nature Human Behaviour
2, 1(2018), 6.Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: apractical and powerful approach to multiple testing.
Journal of the Royal statisticalsociety: series B (Methodological)
57, 1 (1995), 289–300.Colin F Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, MagnusJohannesson, Michael Kirchler, Gideon Nave, Brian A Nosek, Thomas Pfeiffer, et al.2018. Evaluating the replicability of social science experiments in Nature and Sciencebetween 2010 and 2015.
Nature Human Behaviour
Nature Physics
15, 2 (2019), 113–119.Jon F Claerbout and Martin Karrenbach. 1992. Electronic documents give reproducibleresearch a new meaning. In
SEG Technical Program Expanded Abstracts 1992 . Societyof Exploration Geophysicists, 601–604.Jacob Cohen. 2016. The earth is round (p<. 05). In
What if there were no significancetests?
Routledge, 69–82.Miguel Colom, Bertrand Kerautret, and Adrien Krähenbühl. 2018. An Overview ofPlatforms for Reproducible Research and Augmented Publications. In
InternationalWorkshop on Reproducible Research in Pattern Recognition . Springer, 25–39.Miguel Colom, Bertrand Kerautret, Nicolas Limare, Pascal Monasse, and Jean-MichelMorel. 2015. IPOL: a new journal for fully reproducible research; analysis of fouryears development. In . IEEE, 1–5.Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software Heritage: Why and Howto Preserve Software Source Code. In iPRES 2017: 14th International Conference onDigital Preservation (2017-09-25). Kyoto, Japan.Emmanuel Didier and Catherine Guaspare-Cartron. 2018. The new watchdogsâĂŹvision of science: A roundtable with Ivan Oransky (Retraction Watch) and Brandon Stell (PubPeer).
Social studies of science
48, 1 (2018), 165–167.Estelle Dumas-Mallet, Andy Smith, Thomas Boraud, and François Gonon. 2017. Poorreplication validity of biomedical association studies reported by newspapers.
PloSone
12, 2 (2017), e0172650.Anders Eklund, Thomas E Nichols, and Hans Knutsson. 2016. Cluster failure: WhyfMRI inferences for spatial extent have inflated false-positive rates.
Proceedings ofthe national academy of sciences
Science translational medicine
8, 341 (2016).Odd Erik Gundersen and Sigbjørn Kjensmo. 2018. State of the art: Reproducibility inartificial intelligence. In
Thirty-Second AAAI Conference on Artificial Intelligence .Matthew Hutson. 2018. Artificial intelligence faces reproducibility crisis.Bertrand Kerautret, Miguel Colom, Daniel Lopresti, Pascal Monasse, and Hugues Talbot.2019.
Reproducible Research in Pattern Recognition . Springer.Bertrand Kerautret, Miguel Colom, and Pascal Monasse. 2017.
Reproducible Research inPattern Recognition . Springer.Richard A Klein, Michelangelo Vianello, Fred Hasselman, Byron G Adams, Reginald BAdams Jr, Sinan Alper, Mark Aveyard, Jordan R Axt, Mayowa T Babalola, ŠtěpánBahník, et al. 2018. Many Labs 2: Investigating variation in replicability acrosssamples and settings.
Advances in Methods and Practices in Psychol. Sci.
Perspectives on PsychologicalScience
7, 6 (2012), 537–542.Blakeley B McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer LTackett. 2019. Abandon statistical significance.
The American Statistician
73, sup1(2019), 235–245.Engineering National Academies of Sciences, Medicine, et al. 2019.
Reproducibility andreplicability in science . National Academies Press.Nature. 2018. Reproducibility: let’s get it right from the start.
Nature Communications
9, 1 (2018), 3716. https://doi.org/10.1038/s41467-018-06012-8Geoff Norman. 2010. Likert scales, levels of measurement and the âĂIJlawsâĂİ ofstatistics.
Advances in health sciences education
15, 5 (2010), 625–632.Open Science Collaboration et al. 2015. Estimating the reproducibility of psychologicalscience.
Science
Perspectives on Psychological Science
7, 6 (2012), 531–536.Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rosemary Nan Ke, and Hugo Larochelle.2019. ICLR Reproducibility Challenge 2019.
ReScience C (2019).Hans E Plesser. 2018. Reproducibility vs. replicability: a brief history of a confusedterminology.
Frontiers in neuroinformatics
11 (2018), 76.Matthias Schwab, N Karrenbach, and Jon Claerbout. 2000. Making scientific computa-tions reproducible.
Computing in Science & Engineering
ICML 2013 Peer Review Workshop (2013).https://openreview.net/ (accessed Dec. 2019).James H Stagge, David E Rosenberg, Adel M Abdallah, Hadia Akbar, Nour A Attallah,and Ryan James. 2019. Assessing data availability and research reproducibility inhydrology and water resources.
Scientific data
Proceedings of the 7thInternational Web Archiving Workshop . 27–37.Patrick Vandewalle. 2019. Code availability for image processing papers: a statusupdate. https://lirias.kuleuven.be/retrieve/541895$$DVandewalle_onlinecode_TIP_SITB19.pdfMark Ziemann, Yotam Eren, and Assam El-Osta. 2016. Gene name errors are widespreadin the scientific literature.