[PDF] File-based localization of numerical perturbations in data analysis pipelines

Abstract

Data analysis pipelines are known to be impacted by computational conditions, presumably due to the creation and propagation of numerical errors. While this process could play a major role in the current reproducibility crisis, the precise causes of such instabilities and the path along which they propagate in pipelines are unclear. We present Spot, a tool to identify which processes in a pipeline create numerical differences when executed in different computational conditions. Spot leverages system-call interception through ReproZip to reconstruct and compare provenance graphs without pipeline instrumentation. By applying Spot to the structural pre-processing pipelines of the Human Connectome Project, we found that linear and non-linear registration are the cause of most numerical instabilities in these pipelines, which confirms previous findings.

Full PDF

11 File-based localization of numerical perturbations indata analysis pipelines

Ali Salari , Gregory Kiar , Lindsay Lewis , Alan C. Evans , Tristan Glatard Concordia University, Montreal, Canada; McGill University, Montreal, Canada; Montreal NeurologicalInstitute, Montreal, Canada.

Abstract —Data analysis pipelines are known to be impactedby computational conditions, presumably due to the creation,propagation, and ampliﬁcation of numerical errors. While thisprocess could play a major role in the current reproducibilitycrisis, the precise causes of such instabilities and the path alongwhich they propagate in pipelines are unclear. We present Spot,a tool to identify which processes in a pipeline create numericaldifferences when executed in different computational conditions.Spot leverages system-call interception through ReproZip toreconstruct and compare provenance graphs without pipeline in-strumentation. By applying Spot to the structural pre-processingpipelines of the Human Connectome Project, we found that linearand non-linear registration are the cause of most numericalinstabilities in these pipelines, which conﬁrms previous ﬁndings.

I. I

NTRODUCTION

Vibrations in computational infrastructures impact dataanalyses in various ﬁelds, but identifying the origin of these ef-fects in complex pipelines remains challenging. In some cases,most likely due to numerical instabilities, small perturbationsresulting from changes in operating system versions [8],hardware [12], or parallelization parameters [4], may result insubstantially different analysis outcomes. To better understandand correct these effects, efﬁcient tools are needed to assistpipeline developers in the comparison of results obtainedacross different conditions.In neuroimaging, our primary application ﬁeld, data anal-yses often consist of hundreds of computational processes– often coming from multiple toolboxes – that are aggre-gated to perform a speciﬁc function. For instance, the fM-RIprep pipeline [5] assembles software blocks from FSL [11],AFNI [2], FreeSurfer [6] and ANTs [1] to provide a state-of-the art functional MRI processing tool with minimal userinput. Another example are the pipelines of the HumanConnectome Project [7] that combine tools from FSL andFreeSurfer to pre-process structural, functional and diffusiondata from their uniquely high-ﬁdelity open dataset. In bothcases, pipelines leverage toolboxes that are widely trusted inthe community, yet, at the same time substantial variations inresults have been observed in these toolboxes resulting fromminor data or infrastructure perturbations [9], [8], [15], [13],suggesting that further investigation of their numerical condi-tioning is required. For such complex pipelines, a lightweightsolution has to be found to perform such evaluations withlimited code instrumentation.Numerical evaluations are traditionally performed usingtechniques such as interval arithmetics [10] that require com- plete code re-writes and are therefore barely applicable tocomplex pipelines. Recently, Monte-Carlo Arithmetic [16],[3] provided a practical way to evaluate the uncertainty ofnumerical results without the need to rewrite the applica-tion in a different paradigm. By perturbating ﬂoating-pointcomputations, it introduces a controllable amount of noisein the pipelines, effectively sampling results from a randomdistribution. While this technique is very appealing, it suffersfrom two main issues that make it impractical at the scaleof a complete pipeline. First, it requires that all softwarecomponents be recompiled for MCA instrumentation, whichis not always feasible. Second, it multiplies the execution timeby a factor of 10 to 100, which is impractical when executionsalready take a few hours to complete.We present Spot, a tool to identify the source of numericaldifferences in complex pipelines without instrumentation. Us-ing system-call interception through the ReproZip tool [17],Spot traverses graphs of processes and intermediary ﬁles topinpoint the pipeline components that are unstable acrossexecution conditions. When differences start accumulating,effectively masking any further instability, it restores cleandata copies through a set of wrapper scripts. Wrapper scriptsare also used to restore temporary data that might have beendeleted during the execution, and to disambiguate ﬁles thathave been written by multiple processes. The remainder ofthis paper presents the design of Spot, and its application topre-processing pipelines of the HCP project.II. T

OOL DESCRIPTION

Spot identiﬁes the components in a pipeline, at the resolu-tion level of a system process, that produce different results indifferent execution conditions. First, a directed bipartite prove-nance graph is recorded for each pipeline execution, wherenodes represent application processes and ﬁles, and edgesrepresent read and write ﬁle accesses (Figure 1a). Second,transient ﬁles, i.e., ﬁles that are either deleted during pipelineexecution or modiﬁed by multiple processes, are identiﬁedand disambiguated, resulting in a provenance DAG (DirectedAcyclic Graph) in which ﬁle nodes have a single parent(in-degree of 1) (Figure 1b). DAGs produced in differentconditions are then compared, in a step-by-step executionthat prevents the propagation of differences in the pipeline(Figure 1c). The resulting labeled graph identiﬁes the non-reproducible processes in the pipeline. a r X i v : . [ q - b i o . Q M ] J un betremove_ext remove_ext imtest bet2remove_ext output.nii.gzfslmaths fslstatsvoxels.txt (a) Raw provenance graph (ReproZip out-put), with transient ﬁles shown in grayboxes. betremove_ext remove_ext imtest bet2remove_ext output.nii.gzfslmathsoutput.nii.gzfslstatsvoxels.txt (b) Provenance DAG, with disambiguatedtransient ﬁles. betremove_ext remove_ext imtest BET2remove_ext output.nii.gzfslmathsoutput.nii.gzfslstatsvoxels.txt (c) Labeled DAG comparing 2 executionconditions, showing 1 non-reproducibleprocess. Fig. 1: Provenance graphs created from the example pipeline in Listing 1. Processes are represented with circles, ﬁles withrectangles, and read/write accesses with plain edges. For convenience, the process tree is also shown, with gray dashed edges.Processes forked by bet were captured by ReproZip while they did not appear in Listing 1. Processed associated withexecutables located in /usr/bin/ or /bin/ are not shown.To ensure that a ﬁle can be unambiguously associated withthe process that created it, we assume that the pipeline can betransformed such that:1) Processes don’t run concurrently;2) Each process sequentially reads, computes, and writes.In practice, pipeline processes may still run concurrentlyprovided that they don’t write concurrently to the same ﬁles.A process may also interleave ﬁle writes with computing, forinstance when different ﬁle blocks are processed sequentially.However, only a single version of the ﬁle must eventually bemade available to the other processes. In particular, in case aprocess deletes a ﬁle that it had created itself, this ﬁle mustnot be used by any other process. Finally, we also require thatprocesses are associated to a command line (executable andarguments), to facilitate process instrumentation. if [ $ then echo "usage: $0 "exit 1 fi input_image=$1 bet ${ input_image } output.nii.gz fslmaths output.nii.gz -bin output.nii.gzecho "Voxels / volume in binarized brain mask:"fslstats output.nii.gz -V > voxels.txt \r m output.nii.gz Listing 1: Example pipeline

A. Recording provenance graphs

We use ReproZip [17] to capture: (1) the set of processescreated by the pipeline, and (2) the set of ﬁles read andwritten by each process, including temporary ﬁles. ReproZipcollects this information through the ptrace() system call,with no required instrumentation of the pipeline. Using theReproZip trace, Spot reconstructs a provenance graph bycreating process and ﬁle nodes and by adding directed edgescorresponding to ﬁle reads and writes (Figure 1a).Provenance graphs are often data-dependent, due to vari-ations in input data that may trigger differing branching orlooping patterns across executions, for example. Some ofthese differences can be neglected: for instance, when a datadecompression step is present at the beginning of the executionfor some subjects only. Other differences cannot: for instance,when entirely different processing paths are used for differentdatasets. Spot includes helpers to identify different instancesof provenance graphs, such as supporting the clustering of pro-cess trees, where nodes are processes and edges are fork() or clone() system calls, using the tree edit distance [19]implemented in Python’s zss package. B. Capturing transient ﬁles

We capture temporary ﬁles by replacing every process P by a wrapper that ﬁrst calls P and then saves the producedtemporary ﬁles to a read-only directory. This process replace-ment is done by pre-pending to the PATH environment variablea directory that contains a wrapper script named after theexecutable called by P .Files written by multiple processes are disambiguated usinga similar technique. For a ﬁle F written by the processes in P = { P , . . . , P n } , we ﬁrst check that processes in P do not writeconcurrently to F , which would violate our assumptions. Then, we replace every process P i by a PATH -based wrapper thatﬁrst calls P i and then saves F to a read-only directory. In thisway, successive versions of F are preserved for comparison.We ﬁnally update the provenance graph accordingly, so thatall ﬁles in the graph have an in-degree of 1 (Figure 1b). Thisoperation also makes the provenance graph acyclic, since weassumed that a process could only release a single version ofa ﬁle. C. Labeling processes

After capturing transient ﬁles in the ﬁrst condition (i.e. oper-ating system, library version, etc.), we re-run the pipeline stepby step in the second one to label processes. The output ﬁlescreated by a process in both conditions are compared: if nodifferences are found, the process is marked as reproducible;otherwise, the process is marked as non-reproducible, and theoutput ﬁles produced in the ﬁrst condition are copied to thesecond one, to ensure that differences do not propagate fur-ther in the pipeline. Processes are instrumented transparentlythrough a modiﬁcation of the

PATH variable similar to the onedescribed previously. By default, differences in output ﬁlesare identiﬁed by comparing ﬁle checksums. Other comparisonfunctions can also be deﬁned for speciﬁc ﬁle types, forinstance to ignore ﬁle headers or ﬁle sections containingtimestamps. Spot ﬁnally creates a labeled provenance graphhighlighting non-reproducible processes.Figure 1c illustrates a hypothetical incremental labelingof the example in Listing 1. Process bet2 is labeled asnon-reproducible (red) as it produces ﬁles with differences.To prevent the propagation of these differences, the ﬁlesproduced by bet2 in Condition 2 are replaced with the ﬁlesproduced by bet2 in Condition 1. Processes fslmaths and fslstats are then executed and labeled as reproducible(green) as they produce ﬁles without differences.III. E

XPERIMENTS

We applied Spot to the minimal pre-processing pipelinesreleased by the Human Connectome Project (HCP), a leadinginitiative in neuroimaging.

A. HCP pipelines and dataset

The HCP developed a set of pre-processing pipelines toprocess structural, functional, and diffusion MRI data acquiredin the project. We focus on HCP pre-processing pipelinesfor structural data, and particularly on PreFreeSurfer andFreeSurfer. A detailed description of the analyses done bythese pipelines is available in [7]. In summary, the Pre-FreeSurfer pipeline consists of the following steps: • Gradient Distortion Correction (DC), • Alignment and Anatomical Average (AAve), T1w(s),T2w(s), • Anterior/Posterior Commissure Alignment (ACPC-A), • Brain Extraction (BExt), • Bias Field Correction (BFC), • Atlas-Registration (AR).And the FreeSurfer pipeline consists of the following: • Image downsampling, • T1w image registration, • T1w image segmentation, • Surface placement, • Surface registration.We randomly selected 20 unprocessed subjects from theHCP data release S500 available in the ConnectomDB reposi-tory as a subset of the 1200 Subject Release. For each subject,available data consisted of 1 or 2 T1-weighted images and 1or 2 T2-weighted images, with × × voxels of size . × . × . mm. Acquisition protocols and parameters aredetailed in [18]. B. Data processing

We built Docker images for the HCP pre-processingpipelines v3.19.0 (PreFreeSurfer and FreeSurfer) in CentOS6.9 (Final) and CentOS 7.4 (Core), available on DockerHub.Container images contain the HCP software dependencies,including FSL (version 5.0.6), FreeSurfer (version 5.3.0-HCP,CentOS4 build), and Connectome Workbench (version 1.0).We processed the 20 subjects with PreFreeSurfer andFreeSurfer, using the 2 CentOS versions. Each subject wasprocessed twice on the same operating system to detect within-OS variability coming from pseudo-random operations. Wecompared pipeline results using FreeSurfer tools mri_diff , mris_diff , and lta_diff , to ignore execution-speciﬁcinformation such as ﬁle path or timestamps. To comparesegmentations X and Y , we used the Dice coefﬁcient deﬁnedas follows: DICE = 2 | X ∩ Y || X | + | Y | IV. R

ESULTS

The average processing time per subject was approximately2 hours for PreFreeSurfer and 8 hours for FreeSurfer. The av-erage output ﬁle size was 2.7 GB for PreFreeSurfer and 4.1 GBfor FreeSurfer. For each subject, PreFreeSurfer accessed 83Kﬁles and created 7.7K processes, and FreeSurfer accessed 62Kﬁles and created 4K processes.

1) Within-OS differences:

We did not observe any within-OS difference in PreFreeSurfer. In FreeSurfer, we identi-ﬁed 2 processes leading to within-OS differences due tothe use of pseudo-random numbers: image registration with mri_segreg , and cortical surface curvature estimationswith mris_curvature . Fixing the random seed used inFreeSurfer removed these differences.TABLE I: Types of provenance graphs inPreFreeSurfer.

Type Number ofSubjects Number ofT1w images Number ofT2w images1 9 2 22 8 1 13 1 1 24 2 2 1

2) Between-OS differences in PreFreeSurfer:

We identiﬁedfour types of subjects with different PreFreeSurfer provenancegraphs (Table I). Differences between subject types came fromdifferent numbers of T1 and T2 images in the raw data.We veriﬁed that the provenance graphs were identical for allsubjects of the same type, for both versions of CentOS.Figure 2 shows the frequency of non-reproducible pipelineprocesses in PreFreeSurfer. The processes identiﬁed as non-reproducible were observed in linear registration with FSL flirt (in ACPC-Alignment, Brain Extraction, DistortionCorrection, and Atlas Registration), in non-linear registrationwith FSL fnirt (in Brain Extraction and Atlas Registra-tion), and in image warping with FSL new_invwarp (inBrain Extraction and Atlas Registration). Differences werealso observed in image mean computations with FSL maths (in Anatomical Average). Figure 3 shows a complete Pre-FreeSurfer labeled DAG, localizing the observed differencesin the entire pipeline, for a given subject.To illustrate the magnitude of the differences, Figure 4compares fnirt results in Brain Extraction for a particularsubject. Differences appear to be important, in particular inthe areas framed in red.

3) Between-OS differences in FreeSurfer:

The only non-reproducible process identiﬁed by Spot in FreeSurfer was mris_make_surfaces (cortical and white matter surfacesgeneration), a dynamically-linked executable that produceddifferent results for 10 out of 20 subjects.However, FreeSurfer results still differ between conditions,due to the propagation and ampliﬁcation of differences createdin PreFreeSurfer. We observed the effect of this propagationin FreeSurfer results, as shown in Figure 5 for whole-brainsegmentations. The Dice coefﬁcients associated with the 44regions segmented by FreeSurfer are shown in Figure 6,showing that Dice coefﬁcents below 0.9 are observed in mostregions, and particularly in the smallest ones.V. D

ISCUSSION

Our results provide insights on the reproducibility of neu-roimaging pipelines, and on the relevance of the approachimplemented in Spot for reproducibility studies.

A. Key ﬁndings

Linear and non-linear registration with FSL were found tofrequently lead to differences between results obtained withdifferent operating systems. This does not come as a surprisegiven the instabilities associated with these processes. It alsocorroborates our previous ﬁndings in [8], where fMRI pre-processing with FSL was found to vary across operatingsystems starting from the motion correction step, a step thatuses FSL’s flirt tool internally. It would be relevant toinvestigate if the observed instability of registration processesgeneralizes to other toolkits, or if it remains speciﬁc to FSL.In view of the effect of small data perturbations in a varietyof toolboxes and processes, such as cortical surface extractionusing FreeSurfer and CIVET [15] or connectome estimationusing Dipy [14], it is probable that this observation generalizes widely across toolboxes and requires a deeper investigation ofthe stability of linear and non-linear registration.While only a handful or processes were found non-reproducible across the tested operating systems, the effectof such instabilities were found to propagate widely in thepipelines, and to substantially impact the segmentations cre-ated by FreeSurfer. This illustrates the need to conduct re-producibility studies on entire pipelines rather than isolatedprocesses. It also highlights the need for a deeper stabilityanalysis of pipeline processes.As is shown in Figure 2, the reproducibility of a given toolmay vary across subjects and across processing parameters.For instance, linear registration with flirt seems to be fullyreproducible in the Anatomical Average sub-pipeline, whileit is highly non-reproducible in ACPC Alignment. In BrainExtraction, the same tool was found reproducible for somesubjects only. Therefore, reproducibility studies need to be per-formed on several subjects. While this is common practice tosome extent in neuroimaging, software tests are often executedonly on a single dataset to reduce the associated computationalload. Our results show that pipeline tests should encompassenough subjects to cover execution paths adequately.Our results illustrate the type of variability that can beintroduced in neuroimaging results due to operating systemupdates. The numerical noise introduced by operating sys-tem updates is realistic, as such updates are likely to occurthroughout the time span of a neuroscience study, but it isalso uncontrolled, as it originates in updates of low-levellibraries by third-party developers. A possible method to studythis problem more comprehensively would be to introducecontrolled numerical perturbations in pipelines, which couldbe done by introducing noise either in the data, or in ﬂoating-point computations through Monte-Carlo Arithmetic [16]. Thework in [14] discusses and compares these two techniques.

B. Spot evaluation

The processes identiﬁed by Spot as non-reproduciblewere all associated with dynamically-linked executables. Thismakes complete sense as statically-linked executables are notimpacted by library updates. Moreover, the hypothetical effectsof hardware or Linux kernel updates were not measured,as the different operating systems were deployed in Dockercontainers on the same host, that is, using the same kerneland hardware.To evaluate the reproducibility of a pipeline, Spot needs toexecute it 3 times in order to (1) record a ﬁrst ReproZip trace,(2) save transient ﬁles, and (3) compare results in the secondcondition. This is one more execution than the theoreticalminimum of 2. It might be possible to further reduce thisoverhead by executing at step (2) only the processes dependingon transient ﬁles.We demonstrated the applicability of our approach byevaluating two of the arguably most complex pipelines inneuroimaging. Technically, these pipelines consist of a mix oftools assembled from different toolboxes through a variety ofscripts written in different languages. Our ﬁle-based approach,notably enabled by ReproZip, was able to analyze these A C P C - A (20/20) (18/20) B E x t (10/20) (6/20) (13/20) (14/20) (10/20) (10/20) (0/20) (0/20) D C (19/20) (0/20) (19/20) (0/20) (17/20) (0/20) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) B F C ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) A R (17/20) (6/20) (5/20) (0/20) (0/20) flirt AA v e ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) ( / ) new_invwarp fnirt fslmaths (11/11) (8/10) F r a c t i o n o f n o n - r e p r o d u c i b l e p r o c e ss e s Fig. 2: Heatmap of non-reproducible processes across PreFreeSurfer pipeline steps. Each cell represents the occurrence ofa particular command line in a pipeline step among Anatomical Average (AAve), Anterior/Posterior Commissure Alignment(ACPC-A), Brain Extraction (BExt), Bias Field Correction (BFC), or Atlas-Registration (AR). Cell labels indicate the fractionof subjects for which the corresponding process wasn’t reproducible. For example, the flirt tool was invoked 6 times instep DC for each of the 20 subjects: 2 instances weren’t reproducible in 19 subjects, 3 instances were always reproducible,and 1 instance wasn’t reproducible in 17 subjects.pipelines without requiring their instrumentation, which saveda very substantial technical effort. The assumptions made onthe pipeline structure, related to the absence of concurrentwrites, were not violated in our analysis, and are likely to notimpede Spot’s applicability to the most common neuroimagingpipelines.File-based analyses also have limitations related to thegranularity at which they operate. Indeed, differences can onlybe identiﬁed at the level of an entire operating-system process,which can correspond to arbitrary amounts of code. Narrowingdown the analysis to particular libraries, functions, or evencode sections would require another approach. Similarly, Spotwould not be able to detect differences in data not saved inﬁles but instead passed to subsequent processes in memory. Acommon scenario in neuroimaging pipelines is that tools returnresults in their standard output, which is parsed by the callingprocess and passed to subsequent ones through variables.VI. C

ONCLUSION

We presented Spot, a tool to detect the source of numericaldifferences in complex pipelines executed in different com-putational conditions. Spot leverages system-call interceptionthrough the ReproZip tool, and therefore can be applied tothe most complex pipelines without requiring their instrumen-tation. It is available at https://github.com/big-data-lab-team/spot under MIT license.By applying Spot to the pre-processing pipelines of theHuman Connectome Project, compared in different operatingsystems, we showed that between-OS differences are mostlyoriginating in linear and non-linear image registration tools.Moreover, differences introduced during image registrationpropagate widely in the pipelines, leading to important vari-ability in whole-brain segmentations. Future work will investigate in more details the numericalstability of registration algorithms. Additionally, we plan onusing Monte-Carlo arithmetic to inject controlled amounts ofnoise in pipelines and monitor uncertainty propagation andampliﬁcation in their results.VII. A

CKNOWLEDGMENTS

EFERENCES[1] Brian B Avants, Nick Tustison, and Gang Song. Advanced normalizationtools (ants). Insight j, 2:1–35, 2009.[2] Robert W Cox. Afni: what a long strange trip it’s been. Neuroimage,62(2):743–747, 2012.[3] Christophe Denis, Pablo de Oliveira Castro, and Eric Petit. Veriﬁcarlo:Checking ﬂoating point accuracy through monte carlo arithmetic, 2016.[4] Kai Diethelm. The limits of reproducibility in numerical simulation.Computing in Science & Engineering, 14(1):64–72, 2011.[5] Oscar Esteban, Christopher J Markiewicz, Ross W Blair, Craig AMoodie, A Ilkay Isik, Asier Erramuzpe, James D Kent, MathiasGoncalves, Elizabeth DuPre, Madeleine Snyder, et al. fmriprep: a robustpreprocessing pipeline for functional mri. Nature methods, 16(1):111,2019.[6] Bruce Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.[7] Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony Wilson,Timothy S Coalson, Bruce Fischl, Jesper L Andersson, Junqian Xu,Saad Jbabdi, Matthew Webster, Jonathan R Polimeni, et al. The minimalpreprocessing pipelines for the human connectome project. Neuroimage,80:105–124, 2013.

AAveACPC-ABExt AAve ACPC-A BExtDC BFC ARFSLMATHSFLIRTFNIRT FLIRTNEWINVWARP FSLMATHSFLIRTFLIRT NEWINVWARPFLIRT FLIRTFNIRT NEWINVWARP

Fig. 3: A complete provenance graph from the PreFreesurfer pipeline. For better visualization, processes associated withcommands in /bin or /usr/bin were ommitted, as well as imtest , imcp , remove_ext , fslval , avscale , and fslhd . Fig. 4: Differences between T2 fnirt results in Pre-FreeSurfer’s Brain Extraction (CentOS6 vs CentOS7). Ananimated version of the comparison is available here.Fig. 5: Sum of binarized differences between whole-brainFreeSurfer segmentations obtained from PreFreeSurfer pro-cessings in CentOS6 vs CentOS7 (N=20). Segmentations wereresampled and overlaid to the MNI152 volume template.An animated comparison of segmentations obtained for aparticular subject is available here. [8] Tristan Glatard, Lindsay B. Lewis, Rafael Ferreira da Silva, Reza Adalat,Natacha Beck, Claude Lepage, Pierre Rioux, Marc-Etienne Rousseau,Tarek Sherif, Ewa Deelman, Najmeh Khalili-Mahani, and Alan C. Evans.Reproducibility of neuroimaging analyses across operating systems.Frontiers in Neuroinformatics, 9:12, 2015.[9] Ed H B M Gronenschild, Petra Habets, Heidi I L Jacobs, Ron Mengelers,Nico Rozendaal, Jim van Os, and Machteld Marcelis. The effects ofFreeSurfer version, workstation type, and Macintosh operating systemversion on anatomical volume and cortical thickness measurements. PloSone, 7(6):e38234, January 2012.[10] Timothy Hickey, Qun Ju, and Maarten H Van Emden. Interval arithmetic:From principles to implementation. Journal of the ACM (JACM),48(5):1038–1068, 2001.[11] Mark Jenkinson, Christian F Beckmann, Timothy EJ Behrens, Mark WWoolrich, and Stephen M Smith. Fsl. Neuroimage, 62(2):782–790,2012.[12] Fabienne J´ez´equel, Jean-Luc Lamotte, and Issam Sa¨ıd. Estimation ofnumerical reproducibility on cpu and gpu. In 2015 Federated Conferenceon Computer Science and Information Systems (FedCSIS), pages 675–680. IEEE, 2015.[13] David N Kennedy, Sanu A Abraham, Julianna F Bates, Albert Crowley,Satrajit Ghosh, Tom Gillespie, Mathias Goncalves, Jeffrey S Grethe,Yaroslav O Halchenko, Michael Hanke, et al. Everything matters:the ReproNim perspective on reproducible neuroimaging. Frontiers inneuroinformatics, 13:1, 2019.[14] Gregory Kiar, Pablo de Oliveira Castro, Pierre Rioux, Eric Petit,Shawn T. Brown, Alan C. Evans, and Tristan Glatard. Comparingperturbation models for evaluating stability of neuroimaging pipelines,2019.[15] L B Lewis, C Y Lepage, N Khalili-Mahani, M Omidyeganeh, S Jeon,P Bermudez, A Zijdenbos, R Vincent, R Adalat, and A C Evans.Robustness and reliability of cortical surface reconstruction in CIVETand FreeSurfer. Annual Meeting of the Organization for Human BrainMapping, 2017.[16] Douglass Stott Parker. Monte Carlo Arithmetic: exploiting randomnessin ﬂoating-point arithmetic. University of California (Los Angeles).Computer Science Department, 1997.[17] R´emi Rampin, Fernando Chirigati, Dennis Shasha, Juliana Freire, andVicky Steeves. Reprozip: The reproducibility packer. Journal of OpenSource Software, 1(8):107, 2016. [18] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJBehrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al.The wu-minn human connectome project: an overview. Neuroimage,80:62–79, 2013.[19] Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for theediting distance between trees and related problems. SIAM journal oncomputing, 18(6):1245–1262, 1989. D i c e c o e ff i c i e n t0 - Non WM hypointensities1 - Left vessel2 - Optic Chiasm3 - Right vessel4 - WM hypointensities5 - Right Inf Lateral Ventricle6 - Left Inf Lateral Ventricle7 - 3rd Ventricle8 - Left Choroid Plexus9 - Right Accumbens area10 - Left Pallidum11 - Left Amygdala 12 - Left Accumbens area13 - Right Amygdala14 - CSF15 - Right Choroid Plexus16 - Right Pallidum17 - Left Putamen18 - CC Central19 - 4th Ventricle20 - Right Thalamus Proper21 - Right Cerebellum White Matter22 - CC Mid Anterior 23 - Right Lateral Ventricle24 - CC Anterior25 - Left Thalamus Proper26 - CC Posterior27 - Left Ventral DC28 - Right Putamen29 - Left Lateral Ventricle30 - Left Cerebellum White Matter31 - CC Mid Posterior32 - Left Hippocampus33 - Right Caudate 34 - Right Hippocampus35 - Left Caudate36 - Right Ventral DC37 - Brain Stem38 - Left Cerebral White Matter39 - Right Cerebral White Matter40 - Right Cerebellum Cortex41 - Right Cerebral Cortex42 - Left Cerebellum Cortex43 - Left Cerebral Cortex44 - Background