[PDF] Performance Evaluation of Big Data Processing Strategies for Neuroimaging

Abstract

Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, represented by the BigBrain dataset. We contrast these various strategies using Apache Spark and Nipype as our representative Big Data and neuroimaging processing engines, on Dell EMC's Top-500 cluster. Big Data thresholds were modelled by comparing the data-write rate of the application to the filesystem bandwidth and number of concurrent processes. This model acknowledges the fact that page caching provided by the Linux kernel is critical to the performance of Big Data applications. Results show that in-memory computing alone speeds-up executions by a factor of up to 1.6, whereas when combined with data locality, this factor reaches 5.3. Lazy evaluation strategies were found to increase the likelihood of cache hits, further improving processing time. Such important speed-up values are likely to be observed on typical image processing operations performed on images of size larger than 75GB. A ballpark speculation from our model showed that in-memory computing alone will not speed-up current functional MRI analyses unless coupled with data locality and processing around 280 subjects concurrently. Furthermore, we observe that emulating in-memory computing using in-memory file systems (tmpfs) does not reach the performance of an in-memory engine, presumably due to swapping to disk and the lack of data cleanup. We conclude that Big Data processing strategies are worth developing for neuroimaging applications.

Full PDF

PPerformance Evaluation of Big Data ProcessingStrategies for Neuroimaging

Val´erie Hayot-Sasson , Shawn T Brown and Tristan Glatard Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada Montreal Neurological Institute, McGill University, Montreal, Canada

Abstract —Neuroimaging datasets are rapidly growing in sizeas a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Dataprocessing strategies by neuroimaging processing engines remainslimited. Here, we evaluate three Big Data processing strategies(in-memory computing, data locality and lazy evaluation) ontypical neuroimaging use cases, represented by the BigBraindataset. We contrast these various strategies using Apache Sparkand Nipype as our representative Big Data and neuroimagingprocessing engines, on Dell EMC’s Top-500 cluster. Big Datathresholds were modeled by comparing the data-write rate ofthe application to the ﬁlesystem bandwidth and number ofconcurrent processes. This model acknowledges the fact thatpage caching provided by the Linux kernel is critical to theperformance of Big Data applications. Results show that in-memory computing alone speeds-up executions by a factor ofup to 1.6, whereas when combined with data locality, this factorreaches 5.3. Lazy evaluation strategies were found to increase thelikelihood of cache hits, further improving processing time. Suchimportant speed-up values are likely to be observed on typicalimage processing operations performed on images of size largerthan 75GB. A ballpark speculation from our model showed thatin-memory computing alone will not speed-up current functionalMRI analyses unless coupled with data locality and processingaround 280 subjects concurrently. Furthermore, we observe thatemulating in-memory computing using in-memory ﬁle systems(tmpfs) does not reach the performance of an in-memory engine,presumably due to swapping to disk and the lack of datacleanup. We conclude that Big Data processing strategies areworth developing for neuroimaging applications.

I. I

NTRODUCTION

Big Data processing engines have signiﬁcantly improvedthe performance of Big Data applications by diminishing theamount of data movement that occurs during the executionof an application. Locality-aware scheduling, introduced bythe MapReduce [1] framework, reduced the overall costsassociated to network transfer of data by scheduling tasksto the nodes located nearest to the data. Reduction of datamovement was further improved upon through in-memorycomputing [2], which ensures that data is maintained inmemory between tasks whenever possible. To further reducethe cost of data movement, lazy evaluation, the process ofperforming computations only when invoked, was leveragedby Big Data frameworks to enable further optimizations suchas regrouping of tasks and computing only what is necessary.Frameworks such as MapReduce and Spark [2] have be-come mainstream tools for data analytics, although many oth-ers, such as Dask [3], are emerging. Meanwhile, several scien-tiﬁc domains including bioinformatics, physics or astronomy, have entered the Big Data era due to increasing data volumesand variety. Nevertheless, the adoption of Big Data enginesfor scientiﬁc data analysis remains limited, perhaps due to thewidespread availability of scientiﬁc processing engines suchas Pegasus [4] or Taverna [5], and the adaptations required inBig Data processing engines for scientiﬁc computing.Scientiﬁc applications differ from typical Big Data usecases, which might explain the remaining gap between BigData and scientiﬁc engines. While Big Data applicationsmostly target text processing (e.g. Web search, frequent patternmining, recommender systems [6]) implemented in consistentsoftware libraries, scientiﬁc applications often involve binarydata such as images and signals, processed by a sequence ofcommand-line/containerized tools using a mix of program-ming languages (C, Fortran, Python, shell scripts), referredto as workﬂows or pipelines. With respect to infrastructure,Big Data applications commonly run on clouds or dedicatedcommodity clusters with locality-aware ﬁle systems such asthe Hadoop Distributed File System (HDFS [7]), whereasscientiﬁc applications are usually deployed on large, sharedclusters where data is transferred between data and computenodes through shared ﬁle systems such as Lustre [8]. Suchdifferences in applications and infrastructure have importantconsequences. To mention only one, in-memory computingrequires instrumentation to be applied to command-line tools.Technological advances of the past decade, in particularpage caching in the Linux kernel [9], in-memory ﬁle systems(tmpfs) and memory-mapped ﬁles might also explain the lackof adoption of Big Data engines for scientiﬁc applications. Insuch conﬁgurations, in-memory computing would be a featureprovided by the operating system rather than by the engineitself. The frontier between these two components is blurredand needs to be clariﬁed.Our primary ﬁeld of interest, Neuroimaging, is no exceptionto the generalized rise of data volumes in science due to thejoint increase of image resolution and subject cohort sizes [10].Processing engines have been developed with neuroinformaticsapplications in mind, for instance Nipype [11] or the PipelineSystem for Octave and Matlab (PSOM [12]). Big Data engineshave also been used for neuroimaging applications, includingthe Thunder project [13] and in more speciﬁc works suchas [14]. However, no quantitative performance evaluation hasbeen conducted on neuroimaging applications to assess theadded-value of Big Data engines compared to traditionalprocessing engines.This paper addresses the following questions: a r X i v : . [ c s . D C ] A p r ) What is the effect of in-memory computing, lazy eval-uation and data locality on current neuroimaging appli-cations?2) Can in-memory computing be effectively enabled by theoperating system rather than the data processing engine?Answers to these questions have important implications.In [15], a comparative study of Dask, Spark, TensorFlow, Myr-iaDB, and SciDB on neuroinformatics use-cases is presented.It concludes that these systems need to be extended to bet-ter address scientiﬁc code integration, data partitioning, dataformats and system tuning. We argue that such efforts shouldonly be conducted if substantial performance improvementsare expected from in-memory computing, lazy evaluation ordata locality. On the other hand, neuroimaging data processingengines are still being developed, and the question remainswhether these projects should just be migrated to Spark, Dask,or other Big Data engines.Our study focuses on performance. We intentionally do notcompare Big Data and scientiﬁc data processing engines onthe grounds of workﬂow language expressivity, fault-tolerance,provenance capture and representation, portability or repro-ducibility, which are otherwise critical concerns, addressed forinstance in [16]. Besides, our study of performance focuseson the impact of data writes and transfers. It purposely leavesout task scheduling to computing resources, to focus on theunderstanding of data writes and movement. Task schedulingwill be part of our discussion, however.In terms of infrastructure, we focus on the case of High-Performance Computing (HPC) clusters that are typicallyavailable through University facilities or national computinginfrastructures such as XSEDE, Compute Canada or PRACE,as neuroscientists typically use such platforms. We assumethat HPC systems are multi-tenant, that compute nodes areaccessible through a batch scheduler, and that a ﬁle systemshared among the compute nodes is available. We intention-ally did not consider distributed, HDFS-like ﬁle systems, asinitiatives to deploy them in HPC centers, for instance Hadoopon-demand [17], have not become mainstream yet.Our methods, including performance models, processingengines, applications and infrastructure used, are described inSection II. Section III presents our results which we discuss inSection IV along with the two research questions mentionedpreviously. Section V concludes on the relevance of Big Dataprocessing strategies for neuroimaging applications.II. M ATERIALS AND M ETHODS

The application pipelines, benchmarks, performance data,and analysis scripts used to implement the methods describedhereafter are all available at https://github.com/big-data-lab-team/paper-in-mem-locality for further inspection and repro-ducibility. Links to the processing engines and processed dataare provided in the text.

A. Engines1) Apache Spark:

Apache Spark is a well-establishedScala-based processing framework for Big Data, with APIsin Java, Python and R. Its generalized nature allows Spark to not only be applied to batch workﬂows, but also SQL queries,iterative machine learning applications, data streaming andgraph processing. Spark’s main features include data locality,in-memory processing and lazy evaluation, which it achievesthrough its principle abstraction, the Resilient DistributedDataset (RDD).An RDD is an immutable parallel data structure thatachieves fault-tolerance through the concept of lineage [18].Rather than permitting ﬁne-grained transformations, onlycoarse-grained transformations (applying to many elementsin the RDD), can be applied, thereby making it simple tomaintain a log of data modiﬁcations. This log, known as thelineage, is used to reproduce any lost data modiﬁcations.Two types of operations can be performed on RDDs:transformations and actions. Applying a transformation to anRDD produces a new child RDD through a narrow or widedependency. A narrow dependency signiﬁes that the childis only dependent on a single parent partition, whereas achild RDD is dependent on all parent partitions in a widedependency. Examples of transformations include map, ﬁlterand join. To materialize an RDD, an action must be performed,such as a reduce or a collect. Lazy evaluation is representedin Spark through the use of transformations and actions. Aseries of transformations may be deﬁned without the data everbeing materialized. Using this strategy, Spark can optimizedata processing throughout the application.All actions and wide dependencies require a shufﬂe –Spark’s most costly operation. Every shufﬂe begins with eachmap task saving its data to local ﬁles for fault tolerance.The shufﬂe operation then redistributes the data across thepartitions as requested. A shufﬂe marks a stage boundary inSpark, where reduce-like operations will not begin until alldependent map tasks have completed.Although Spark uses in-memory computing, it is not nec-essary for all the data to ﬁt in memory. Spark will spill anyRDD elements that cannot be maintained in memory to disk.Moreover, as Spark transformations generate new RDDs andnumerous transformations may occur within a single applica-tion, Spark implements a Least-Recently Used (LRU) evictionpolicy. If an evicted RDD needs to be reused, Spark willrecompute it using the lineage data collected. As demonstratedin [13], caching signiﬁcantly improves processing times ofiterative algorithms where RDDs are reused. It can be of evengreater importance if the RDD is costly to recompute.Data locality in Spark is achieved through the scheduling oftasks to partitions which have the data loaded in memory. Ifthe data is instead stored on HDFS, the scheduler will assign itto one of the preferred locations speciﬁed by HDFS. Spark’sscheduler utilizes delay scheduling to optimize fairness andlocality for all tasks.Three different types of schedulers are compatible withSpark: 1) Spark Standalone, 2) YARN [19] and 3) Mesos [20].The Spark Standalone scheduler is the default scheduler. TheYARN scheduler designed for Hadoop [21] clusters and isprepackaged with Hadoop installations, whereas in contrast,Mesos was designed to be used in multi-tenant cluster environ-ments. In our experiments, we focus on the Spark Standalonecluster.xecuting Spark applications on HPC systems with Spark-unaware schedulers may be inefﬁcient. The amount of re-sources requested by Spark may impede Spark-cluster schedul-ing time. Using pilot-scheduling strategies to add nodes tothe Spark cluster as they are allocated by the underlyingschedulers may speedup allocation and overall processingtime [22]. This, however, is not studied in the current paper.As Spark is frequently used by the scientiﬁc community, wedesigned our experiments using the PySpark API. This is at acost to performance as the PySpark code must undergo Pythonto Java serialization. We used Spark version 2.3.2 installedfrom https://spark.apache.org.

2) Nipype:

Nipype is a popular Python neuroimaging pro-cessing engine. It aims at being a solution for easily creatingreproducible neuroimaging workﬂows. Although Nipype doesnot employ any Big Data processing strategies, it providesplugins to numerous schedulers found in most clusters readilyavailable to researchers, such as the Sun/Oracle Grid Engine(SGE/OGE), TORQUE, Slurm and HTCondor. It also includesits own scheduler, MultiProc, for parallel processing on singlenodes. Furthermore, Nipype provides many built-in interfacesto commonly used neuroimaging tools that can be incorporatedwithin the workﬂows. Nipype’s ability to easily parallelizeworkﬂows in researcher-available cluster setups, capture de-tailed provenance information necessary for reproducibility,and allow users to easily integrate existing neuroimaging tools,make it preferable over existing Big Data solutions, whichwould necessitate modiﬁcations to achieve this.Jobs, or Interfaces in Nipype, are encapsulated by a Node.A Node dictates that the job will execute on a single input.However, the MapNode, a child variant of the Node, canexecute on multiple inputs. All tasks in Nipype execute in theirown uniquely named subdirectory which facilitates provenancetracking of inputs and outputs and also enables checkpointingof the workﬂow. In the case of Node failure or applicationmodiﬁcation, only the nodes which have been modiﬁed (ver-iﬁed by hash) or have not successfully completed, are re-executed.In order for Nipype to operate as intended, a ﬁlesystemshared by all the nodes is required. However, it is still possibleto save to a non-shared local ﬁlesystem, but it may come at theexpense of fault-tolerance, as data located on failed nodes willbe permanently lost. Moreover, the user will need to ensurethat the ﬁles are appropriately directed to the nodes that requirethem as there is no guarantee of data locality in Nipype.We used Nipype version 1.1.4 installed through the pippackage manager.

B. Data Storage Locations

Data storage location is critical to the performance ofBig Data applications on HPC clusters. Data may reside inthe engine memory, on a ﬁle system whose contents residein virtual memory (for instance tmpfs), on disks local tothe processing node, or on a shared ﬁle system. Table Isummarizes the Big Data strategies that can be used dependingon the data location. In addition, lazy evaluation is availablein Spark regardless of data location. The remainder of this

Data Location In-Memory Data LocalityComputingIn-memory Yes Yestmpfs Yes YesLocal Disk Page Caching YesShared File System Page Caching No

TABLE I: Big Data strategies on a shared HPC cluster.Section explains this Table and provides related performancemodels.

1) In-Engine-Memory and In-Memory File System:

Themain difference between storing the data in the engine mem-ory, as in Spark, and simply writing to an in-memory ﬁlesystem, such as tmpfs, is what happens when the processeddata ﬁlls up the available memory. When engine memory isused, the engine must cleanup unused data to avoid crashes.When an in-memory ﬁle system is used, the user is responsibleensuring that the ﬁlesystem does not reach capacity. Should itbe necessary to free up additional memory, the kernel willswap ﬁlesystem memory to disk and the performance willbecome that of local disk writes. In our experiments, we willexplore the conﬁguration where the data consumed by theapplication approaches the threshold of available memory.

2) Local Disk:

Storing data on local disks inevitably en-ables data locality, since data transfers are not necessarywhen tasks are executed on the nodes where the data resides.However, in absence of a more speciﬁc ﬁlesystem such asHDFS to handle ﬁle replication across computing nodes, datalocality comes at the price of stringent scheduling restrictions,as tasks can only be scheduled to the single node that containstheir input data.The performance of local disk accesses is strongly depen-dent on the page caching mechanism provided by the Linuxkernel, described in details in [9]. To summarize, data readfrom disk remains cached in memory until evicted by an LRU(Least Recently Used) strategy. When a process invokes the read() system call, the kernel will return the data directlyfrom memory if the requested data lies in the page cache,realizing a cache hit . Cache hits drastically speed-up datareads, by masking the disk latency and bandwidth behind amemory buffer. In effect, page caching provides in-memorycomputing transparently to the processing engine. However,page cache eviction strategies currently cannot be controlledby the application, which prevents processing engines fromanticipating reads by preloading the cache. Scheduling strate-gies might be designed to maximize cache hits, however. Forinstance, lazy evaluation could result in more cache hits byscheduling data-dependent tasks on the same node.Page caching has a more dramatic effect on disk writes,reducing their duration by several orders of magnitude. Whena process calls the write() system call, data is copied to amemory cache that is asynchronously written to disk by ﬂusherthreads, when memory shrinks, when “dirty” (unwritten) datagrows, or when a process invokes the sync() system call.This asynchronous ﬂushing of the page cache is called write-back .Page caching is essentially a way to emulate in-memorycomputing at the kernel level, without requiring a dedicatedengine. The size of the page cache, however, becomes aimitation when processes write faster than the disk bandwidth.When this happens, the page cache rapidly ﬁlls up and writesare limited by the disk write bandwidth as if no page cachewas involved.We introduce the following basic model to describe theﬁlling and ﬂushing of the page cache by an application: d ( t ) = (cid:18) DC − δγ (cid:19) t + d , where: • d ( t ) is the amount of data in the page cache at time t • D is the total amount of data written by the application • C is the total CPU time of the application • δ is the disk bandwidth • γ is the max number of concurrent processes on a node • d is the amount of data in the page cache at time t This model applies to parallel applications assuming that(1) concurrent processes all write the same amount of data,(2) concurrent processes all consume the same CPU time, (3)data is written uniformly along task execution. With theseassumptions, all the processes will write at the same rate,which explains why the model does not depend on the totalnumber of concurrent processes in the application, but only onthe max number of concurrent processes executing on the samenode ( γ ). While these assumptions would usually be violatedin practice, this simple model already provides interestinginsights on the performance of disk writes, as shown later.Naturally, the model also ignores other processes that mightbe writing to disk concurrently to the application, which weassume negligible here.In general, an application should ensure that ˙ d remainsnegative or null, leading to the following inequality: DC ≤ δγ (1)This deﬁnes a D/C (data-write) rate beyond which the pagecache becomes asymptotically useless. It should be noted thatthe transient phase during which the page cache ﬁlls up mightlast a signiﬁcant amount of time, in particular when ˙ d ispositive and small. We intentionally do not model the transientphase as it requires detailed knowledge of difﬁcult to estimateparameters such as the page cache size and the initial amountof data in it ( d ).We will use Equation 1 to deﬁne our benchmarks andinterpret the results. It should be noted that leveraging thepage cache, and therefore ensuring that Equation 1 holds, hasimportant performance implications: with page caching, thewrite throughput will be that of memory, while without pagecaching it will be that of the disk.

3) Shared File System:

We model a shared ﬁle systemusing its global realized bandwidth ∆ , shared by all concurrentprocesses in the cluster. We are aware that such a simplisticmodel does not describe at all the intricacies of systems suchas Lustre. In particular, metadata management, RPC protocoloptimizations and storage optimizations are all covered underthe realized bandwidth. We do, however, consider the effectof page caching in shared ﬁle systems, since in Linux writesto network-mounted volumes beneﬁt from this feature too. Data location Measured write bandwidths (MB/s)tmpfs 1377.18Local disk ( δ ) 193.64Lustre ( ∆ ) 504.03 TABLE II: Measured bandwidthsAs in the local disk model, we note that page caching willonly be useful when the ﬂush bandwidth is greater than thewrite throughput of the application, that is: DC ≤ ∆Γ , (2)where Γ is the max number of concurrent processes in thecluster . Note that ∆Γ will usually be much lower than δγ . C. Infrastructure

All experiments were executed on Dell EMC’s Zenithcluster, a Top-500 machine in the Dell EMC HPC and AIInnovation Lab, running Slurm. For the Spark experiments, aSpark cluster was started on a Slurm allocation comprisedof 16 dedicated nodes. Each Compute node has Red HatEnterprise Linux Server release 7.4 (Maipo) as the base op-erating system with kernel version 3.10.0-693.17.1.el7.x86 64(patched for Spectre/Meltdown). Dell EMC PowerEdge C6420with dual Intel Xeon Gold 6148/F processors (40 cores pernode) and 192GB ( × GB), 2666 MHz memory, serveas the compute nodes. Each compute has a 120GB M.2 SATASSD as local disk. A Dell HPC Lustre Solution with a rawstorage of 960TB is accesible on each compute node througha 100 Gb/s Intel OmniPath network. All the nodes connect toa director switch in a 1:1 non-blocking topology. The realizedwrite bandwidth of the local disk, Lustre ﬁle system and tmpfswere measured by sequentially writing various numbers ofimage blocks containing random intensities, to avoid cachingeffects (see: measure bandwidth.py). They are reported inTable II.

D. Datasets

We used BigBrain [23], a 75GB 40 µ m isotropic histologicalimage of a 65-year-old human brain. The BigBrain wasselected due to its uniqueness, as there does not yet exist ahigher-resolution image of a human brain. Moreover, therecurrently exists a lack of standardized tools for processingthe BigBrain as a consequence of its size. To examine theeffects processing the BigBrain has on Big Data strategies,we partitioned the full × × voxel image into30 ( × × ) chunks, 125 ( × × ) chunks and 750( × × ) chunks. Additionally, the full image was alsosplit in half ( × × voxels) and the half image waspartitioned into 125 chunks.Processing large images is only considered to be part of theBig Data problem in neuroscience. The other problem beingthe processing large MRI datasets, that is, datasets consistingof many small brain images belonging to various differentsubjects. This situation is commonly observed in functionalMRI (fMRI), where it is becoming increasingly common toprocess data from hundreds of subjects. Although we havenot explored explicitly the processing of large MRI datasets,he 75GB BigBrain is within the size ballpark [10] of MRIdatasets commonly processed in today’s studies.Since both small and large datasets may need to be pro-cessed using the same analysis pipeline, we examined theeffects of the data management strategies on small data aswell. For this, we selected a 12MB T1W image belonging tosubject 1 of OpenNeuro’s ds000001 dataset version 6. In orderto split the image into 125 equal-sized chunks, it was necessaryto zero-pad the image to the dimensions × × voxels, which subsequently increased the total image size to13MB. E. Applications

Algorithm 1

Incrementation Input x a sleep delay in seconds n a number of iterations C a set of image chunks f s ﬁlesystem to write to (mem, tmpfs, local disk,Lustre). for each chunk ∈ C do read chunk from Lustre for i ∈ [1 , n ] do chunk ← chunk + 1 sleep x if i < n then save chunk to f s end if end for save chunk to Lustre end for To effectively investigate how the different strategies impactprocessing, we selected a simple incrementation pipeline (Al-gorithm 1) that consists exclusively of map stages. A seriesof map-only stages would enable us to evaluate the effectsof in-memory computing when data locality is preserved.Incrementation was selected over other applications, such asbinarization, as it ensured that a new image was created at eachstep (i.e. no caching effects within the executing application).Each partitioned chunk was incremented by 1, in parallel, byeach task. As incrementing images is not a time consumingprocess, we added a sleep delay to the tasks to study theeffects of tasks duration. The incremented chunks would beeither maintained in-memory (Spark only) or saved to eithertmpfs, local disk or Lustre (Spark and Nipype). Should morethan a single iteration be requested, the incremented chunkswould be incremented again and saved to the same ﬁle system.This would repeat until the number of requested iterationshad elapsed. In all conditions, the ﬁrst input chunks and ﬁnaloutput chunks would reside on Lustre. We chose to performour initial input/ﬁnal output on Lustre as local storage istypically only accessible to a user for the duration of theexecution in HPC environments.

F. Experiments

We conducted four experiments in which we varied (1) thenumber of iterations in the application ( n in Algorithm 1),(2) the task duration ( x ), (3) the chunk size given a constantimage size, (4) the total image size. To evaluate the page-cachemodel, experiment conditions fell in different regions of Equa-tions 1 and 2, as summarized in Table III. Among the 16 nodesavailable, 1 was dedicated to the Spark master and driver, andthe remaining 15 were used as compute nodes. Since datalocality is not normally preserved in Nipype (a new Slurm al-location is requested for each processed task), we instrumentedNipype to ensure data locality (see: run benchmarks.py). Thatis, the chunks were split into partitions, and for each partition,we requested a Slurm allocation to process the entire pipelinein parallel using Nipype’s MultiProc scheduler on a givennode. This was possible as no communication was requiredbetween the processed chunks.For our ﬁrst incrementation experiment, we investigatedthe effects of Big Data strategies on varying total data size.To achieve this, we increased the number of incrementationiterations from 1, 10 and 100 times. The total data size wouldthen increase from 75GB, at 1 iteration, to 7,500GB, at 100iterations. The total number of chunks was 125. Chunks allran concurrently ( Γ =125) and were equally balanced among15 nodes, leading to 8 or 9 concurrent jobs per node ( γ =9).Task duration was ﬁxed at 3.44 seconds.In the second experiment, we evaluated the effects of BigData strategies on varying task duration. If the page cache hassufﬁcient time to ﬂush, it would be expected that in-memorycomputing and local disk perform equivalently. We varied thetask duration between 2.4 and 320 seconds such that the D/Cfalls into different regions of Equations 1 and 2. The numberof chunks was maintained at 125, leading to Γ =125 and γ =9.The number of iterations was ﬁxed to 10.As a third incrementation experiment, we were interested inthe effects of chunk size on Big Data strategies. Naturally, agreater chunk size signiﬁes a decrease in parallelism. However,it also signiﬁes an increase in sequential I/O (increased ∆ / Γ and δ/γ ). For this experiment we partitioned the completeBigBrain image into 30, 125 and 750 chunks, correspondingto γ values of 2, 9 and 25 respectively. While Spark attemptedto load-balance the data, it used up only 25 of the 40 coresfor 750 chunks. In contrast, Nipype tried to use up as manycores as possible. Unlike the previous experiment, the D/Crate was kept static at 178.6MB/s, however, this ratio ensuredthat different regions of the inequality were reached dependingon amount of parallelism. The number of iterations was ﬁxedto 10, and the task duration was adjusted so that C remainedconstant at 4,400s.For our fourth and ﬁnal incrementation experiment, weinvestigated the effects of the strategies on different imagesizes. We selected the 75GB BigBrain, the 38GB half BigBrainand the 13M T1W MRI image for this experiment. Thenumber of chunks was ﬁxed to 125. Similarly to the previousexperiment, the total sequential compute time was ﬁxed (10iterations, 1.76 seconds per task), however, due to varying sizein total data processed (D), the D/C rate varied. Once again, xperiment 1: Number of Iterationsn D (GB) C (s) D/C (MB/s) γ δ/γ (MB/s) Γ ∆ / Γ (MB/s) (D/C)/( δ/γ ) (D/C)/( ∆ / Γ )1 75 430 178.6 9 21.5 125 4.0 8.3 44.710 750 4,300 178.6 9 21.5 125 4.0 8.3 44.7100 7,500 43,000 178.6 9 21.5 125 4.0 8.3 44.7Experiment 2: Task Durationx (s) D (GB) C (s) D/C (MB/s) γ δ/γ (MB/s) Γ ∆ / Γ (MB/s) (D/C)/( δ/γ ) (D/C)/( ∆ / Γ )2.4 750 3,000 256 9 21.5 125 4.0 11.9 643.44 750 4,300 178.6 9 21.5 125 4.0 8.3 44.77.68 750 9,600 80 9 21.5 125 4.0 3.7 20320 750 400,000 1.9 9 21.5 125 4.0 0.09 0.48Experiment 3: Number of Chunkschunks D (GB) C (s) D/C (MB/s) γ δ/γ (MB/s) Γ ∆ / Γ (MB/s) (D/C)/( δ/γ ) (D/C)/( ∆ / Γ )30 750 4,400 174.6 2 96.8 30 16.8 1.8 10.4125 750 4,400 174.6 9 21.5 125 4.0 8.1 43.7750 750 4,400 174.6 25 7.7 375 1.3 22.7 134.3Experiment 4: Image Sizeimage D (GB) C (s) D/C (MB/s) γ δ/γ (MB/s) Γ ∆ / Γ (MB/s) (D/C)/( δ/γ ) (D/C)/( ∆ / Γ )BigBrain 750 2,200 349.1 9 21.5 125 4.0 16.2 87.3Half BigBrain 375 2,200 174.6 9 21.5 125 4.0 8.1 43.7MRI 0.127 2,200 0.06 9 21.5 125 4.0 0.003 0.015 TABLE III: Experiment conditions. Red cells denote the conditions where the inequalities in Equations 1 and 2 do not hold,i.e., the page cache is asymptotically useless. Green cells show the conditions where the page cache covers all data writes.we ensured that the D/C rate fell in multiple different regionsof the inequality. The D/C rate ranged from 349.1 MB/sfor BigBrain and 174.6MB/s for half BigBrain, to 0.06MB/sfor the MRI image. Only the 0.06MB/s MRI satisﬁed theinequality for both Lustre and local disk.III. R

ESULTS

A. Experiment 1: Number of Iterations

Fig. 1a shows the difference between the different ﬁlesystemchoices given varying number of iterations. At 1 iteration,all ﬁlesystems behave the same, although the applicationwas writing faster than the disk bandwidth. This is becauseapplication data was not saturating the page cache (transientphase). The page cache, on Zenith, occupies 40% of totalmemory. With 192 GB of RAM on each node, 76.8 GB of dirtydata could be held in a node’s page cache at any given time. Asthe total amount of data written by the application increasesto 750 GB, there is a greater disparity between Lustre and in-memory (2.67 x slower, on average). Local disk performance,however, is still comparable to memory (1.38 x slower, onaverage). Despite local disk and Lustre both being in transientstate, local disk encounters less contention than what wouldbe found on Lustre.At 100 iterations, or 7,500 GB, Lustre can be found to be, onaverage, 3.82 x slower than Spark in-memory. The slowdownexperienced can be explained by the smaller percentage of totaldata residing in the page cache at a given time, compared to 10iterations. Therefore, the effects of Lustre bandwidth are moresigniﬁcant in this application. At 100 iterations, the applicationwas writing 500GB per node (7,500 GB / 15 nodes) and hencecould not run on tmpfs or local disk.While there is some variability that can be seen in Fig. 1abetween the two engines, this believed to be insigniﬁcant,and potentially due to SLURM node allocation delays in ourlaunching of Nipype.

B. Experiment 2: Task Duration

Increasing task duration ensured that all ﬁle systems hada comparable performance (Fig. 1b). Lustre, for instance, isapproximately 1.01x slower than Spark in-memory at a taskduration of 320 seconds, whereas it is approximately 3.25xslower that Spark in-memory with 2.4 second tasks. Thispattern corroborated our page-cache model which postulatesthat data movement costs will have little impact on compute-intensive tasks. The reasoning behind this is that longer tasksgive the page cache more time to ﬂush between disk writes.

C. Experiment 3: Image Block Size

As can be seen in Fig. 1c, makespan decreases with increas-ing number of chunks. This is due to the fact that parallelismincreases with an increase in number of chunks. At 30 chunks,only 2 CPUs per node are actively working. At 125 chunks,this changes to a maximum of 9 CPUs per node, and at 750chunks, up to 40 CPUs can be active.Due to a size limitation of 2 GB imposed on Spark parti-tions, Spark with in-memory computing processing 30 chunkswas not performed.Local disk and tmpfs perform comparably for all condi-tions, with Lustre being signiﬁcantly slower. As with varyingthe number of iterations, Lustre is slower due to increasedﬁlesystem contention, which is, at minimum, 15 x greater thancontention on local disk, due to the number of nodes used.With an increase in number of chunks, local disk and tmpfsmakespans begin to converge. A potential explanation for thismay be that tmpfs is utilizing swap space. As concurrency in-creases, the memory footprint of the application also increases.It is possible that at 750 chunks, swapping to disk is requiredby tmpfs, thus resulting in similar processing times as localdisk.Swapping may also be an explanation for the variancebetween Spark in-memory and tmpfs performance. While M a k e s p a n ( s ) Spark - in-memorySpark - tmpfsNipype - tmpfsSpark - local diskNipype - local diskSpark - LustreNipype - Lustre1 10 100Iterations0100200300400 (a) Experiment 1: complete BigBrain, 125 chunks, 3.44-second tasks. M a k e s p a n ( s ) Spark - in-memorySpark - tmpfsNipype - tmpfsSpark - local diskNipype - local diskSpark - LustreNipype - Lustre2.4 3.44 7.68 320Task duration (s)0100200300400 (b) Experiment 2: complete BigBrain, 125 chunks, 10 iterations. M a k e s p a n ( s ) Spark - in-memorySpark - tmpfsNipype - tmpfsSpark - local diskNipype - local diskSpark - LustreNipype - Lustre30 125 750Number of chunks (Big Brain)0100200300400 (c) Experiment 3: complete BigBrain, 10 iterations, C= 4,400s. M a k e s p a n ( s ) Spark - in-memorySpark - tmpfsNipype - tmpfsSpark - local diskNipype - local diskSpark - LustreNipype - LustreMRI Half BigBrain BigBrainImage0100200300400 (d) Experiment 4: 125 chunks, 10 iterations, 1.76-second tasks.

Fig. 1: Experiment results: Makespans of Spark and Nipype writing to memory, tmpfs, local disk and Lustre.Spark may also spill to disk, it only does so when data does notﬁt in memory. As none of the RDDs generated throughout thepipeline were cached and all data concurrently accessed couldbe mantained in-memory, spilling to disk was not necessary.

D. Experiment 4: Image Size

Increasing overall data size decreases performance, as canbe seen in Fig. 1d. When the data size is very small (e.g. MRIimage) all ﬁle system makespans are comparable. This is dueto the fact that page cache can be leveraged fully regardless ofﬁle system. However, this time, Spark in-memory performedsigniﬁcantly worse than all other ﬁlesystems, with a makespanof 2,211 seconds. Upon further inspection, it appeared thatSpark in-memory executed in a sequential order, on a singleworker node. Lack of parallelism for the MRI image may be aresult of Spark’s maximum partition size, which is by default128 MB – signiﬁcantly larger than the 13 MB MRI image.At half BigBrain, the makespan differences become appar-ent in both local disk and Lustre, with Lustre becoming 2.4xslower than in-memory. This can be attributed to page cachesaturation, as predicted by the model for both half the BigBrain image and the complete BigBrain. Only the MRI image waspredicted to fall within the model constraints.When the complete BigBrain is processed, the disparitybetween the different ﬁlesystems becomes even greater. Lustrebecomes 3.68x slower, whereas local disk becomes 1.68xslower. An explanation for this is that the page cache ﬁllsup faster due to data size.

E. Page Cache Model Evaluations

In order to evaluate the page cache model, we compared theobserved speedup ratio provided by in-memory computing tothe (D/C) / ( δ / γ ) and (D/C) / ( ∆ / Γ ) ratios (Fig. 2). Speed-up ratios were computed as the ratio between the makespanobtained with Spark on local disk or Lustre, and the makespanobtained with Spark for in-memory computing. Experimentsfor which there were no in-memory equivalent (i.e. BigBrainsplit into 30 chunks) were not considered.Results show that, overall, the model correctly predicted theeffect of page cache on processing times for local disk andLustre. That is, the speed-up provided by in-memory comput-ing was larger than 1 for D/C rates larger than δ/γ (local disk)

20 40 60 80 100 120(D/C) / ( / )012345 Sp ee d - u p o f Sp a r k i n - m e m (a) Local Disk Sp ee d - u p o f Sp a r k i n - m e m (b) Lustre Fig. 2: Page cache model evaluation. Grey regions denote areas that violate model predictions.or ∆ / Γ (Lustre). Conversely, the speed-up provided by in-memory computing remained close to 1 for D/C rates smallerthan δ/γ (local disk) or ∆ / Γ (Lustre). The two points close tothe origin correspond to the sequential processing of the MRIimage by Spark mentioned previously.Points which violated model predictions were found at 1iteration, where page cache would not have been saturated inspite of a high D/C (transient state). However, in all cases, the“1” boundary was never trespassed by more than a factor of0.19, and is therefore likely a result of system variability.IV. D ISCUSSION

A. Effect of In-Memory Computing

We measured the effect of in-memory computing by com-paring the runs of Spark in-memory (yellow bars in Fig. 1) tothe ones of Spark on local disk (non-hatched green bars). Thespeed-up provided by in-memory computing is also reportedin Fig. 2a. The speed-up provided by in-memory computingincreases with (D/C) / ( δ / γ ), as expected from the model. Inour experiments, it peaked at 1.6, for a ratio of 16.2. Thiscorrespond to the processing of the BigBrain with 125 chunksand 1.76-second tasks in experiment 4 (total computing timeC=2,200s), which is typically encountered in common imageprocessing tasks such as denoising, intensity normalization,etc. The speed up of 1.6 is also reached with a ratio of 22.7 inexperiment 3, obtained by processing the BigBrain with 750chunks.The results also allow us to speculate on the effect ofin-memory computing on the pre-processing of functionalMRI, another typical use case in neuroimaging. Assuming anaverage processing time of 20 minutes per subject, which isa ballpark value commonly observed with the popular SPMor FSL packages, an input data size of 100MB per subject,and an output data size of 2GB (20-fold increase comparedto input size), the D/C rate would be 1.8MB/s, which wouldreach the δ/γ threshold measured on this cluster for γ = 108 ,that is, if 108 subjects were processed on the same node. This is very unlikely as the number of CPUs per node was 40. Wetherefore conclude that in-memory computing is likely to beuseless for fMRI analysis. Naturally, this estimate is stronglydependent on the characteristics of the cluster. B. Effect of Data Locality

We measure the effect of data locality by comparing theruns of Spark on local disk (non-hatched green bars in Fig. 1)to the ones of Spark on Lustre (non-hatched blue bars).The speed-up provided by local execution peaked at 3.2, for750 chunks in experiment 3. Overall, writing locally wasusually preferable over writing to Lustre, as a result of thelower contention on local disk. Although it may be true thatnetwork bandwidths exceed that of disks [24], locality remainsimportant as contention on a shared ﬁlesystem tends to bemuch higher than on local disk. The only time writing locallydid not have signiﬁcant impact over Lustre was in experiments1 and 4, at 1 iteration and when processing the MRI image,respectively. In both these scenarios, the Lustre writes did notimpact performance as the data was able to be written to pagecache and ﬂushed to Lustre asynchronously.

C. Combined Effect of In-Memory and Data Locality

We measure the combined effect of data locality and in-memory computing by comparing the runs of Spark in-memory (yellow bars in Fig. 1) to the ones of Spark onLustre (non-hatched blue bars). The speed-up provided by thecombined use of data locality and in-memory computing isalso reported in Fig. 2b. The provided speed-up increases with(D/C) / ( ∆ / Γ ), as expected from the model. In our experiments,it peaked around 5, for ratios of 120.4 and 64. Again, thisconﬁguration is likely to happen in typical image processingtasks performed on the BigBrain.As for the fMRI speculation, the D/C rate of 1.8MB/s wouldreach the ∆ / Γ threshold for Γ = 280 , which is a realisticnumber of subjects to process on a complete cluster. Naturally,this estimate is highly dependent on the observed bandwidthof the shared ﬁle system ( ∆ ). . Effect of Lazy Evaluation The effects of lazy evaluation can be seen throughout theexperiments. Nipype was found to be slower than Spark inmost experiments. While the Nipype execution graph is gen-erated prior to workﬂow execution, there are no optimizationsto ensure that the least amount of work is performed to producethe required results.During the processing of Experiment 3, 750 chunks wereprocessed in two batches for both Spark and Nipype due toCPU limitations. Rather than running each iteration on thefull dataset, as with Nipype, Spark opted to perform all theiterations on the ﬁrst batch (load, increment, save), and thenproceeded to process the second batch. Such an optimizationis important, even when processing data on disk, as it wouldpresumably increase the occurrence of cache hits. This maypartially explain the speedup seen at 750 chunks in Figure 1c.

E. Can tmpfs and Page Caches Emulate In-Memory Comput-ing?

Although tmpfs and page cache do improve performance,as seen in Figure 1, they do not always perform equivalentlyto in-memory. Tmpfs’s main limitation is that data residingon it may be swapped to disk if the system’s memory usageis high. When it reaches this point, its performance slowsdown to swap disk bandwidth, as observed in Figure 1d. Pagecache suffers a similar dilemma. I/O blocking writes to diskoccur when a given percentage (e.g. 40 %) of total memoryis occupied by dirty data. When the threshold is exceeded,processes performing writes must wait for dirty data to beﬂushed to disk.Furthermore, like memory, tmpfs and page cache are sharedresources on a node. If users on the node are heavily usingmemory to incite tmpfs to writes to swap space, or areperforming data-intensive operations that ﬁll up the pagecache, tmpfs/page cache performance on will be limited forother users. However, it is possible to request through the HPCscheduler a certain amount of available memory. Ultimately,in-memory data will also need to be spilled to disk if memoryusage exceeds amount of available memory, although diskwrites are likely to occur in tmpfs and page cache beforerequested available memory is ﬁlled.

F. Scheduling Remarks

A common recommendation in Spark is to limit the numberof cores per executors to 5, to preserve a good I/O throughput(see Cloudera blog), but the rationale for this recommendationis hardly explained. We believe that throughput degradationobserved with more than 5 cores per executor might be comingfrom full page caches.Spark does not currently include any active management ofdisk I/Os or page caches. We believe that it would be beneﬁcialto extend it toward this direction, to increase the performanceof operations where local storage has to be used, such as diskspills or shufﬂes. For instance, workﬂow-aware cache evictionpolicies that maximizes page cache usage for the workﬂowscould be investigated. An alternative Nipype plugin designed for running on Slurmwas not used in the experiments. The Slurm plugin requestsa Slurm allocation for each processed data chunk. Such ascheduling strategy was not ideal in our environment whereoversubscription of nodes was not enabled.Unlike Spark, Nipype by default opts to use all availableCPUs rather than to load balance data across the cluster. Thatis, given 50 chunks and 40 cores, Spark will only use up 25cores and process in two batches. Nipype will also have nochoice but to split up the processing into two batches, butwill ﬁrst process 40 chunks, immediately followed by theremaining 10. While both are reasonable strategies for dataprocessing, Spark may end up beneﬁting more from the pagecache, as less data is written in parallel (25 vs 40), givingmore time for the page cache to ﬂush.Nipype’s MapNodes, which apply a given function to eachelement in a list, were found to be slower than the Node,which apply a function to a single element, due to a blockingmechanism. For this reason, we selected to iterate through aseries of Nodes in our code despite MapNodes being easier touse.

G. Other Comments

Writing to node-local storage in a cluster environmentcomes at a cost, for both Nipype and Spark without HDFS.When a node is lost, the node-local data is lost with it.While Spark will recompute the lost partition automaticallyusing lineage, Nipype will fail for all tasks requiring thedata. Nevertheless, Spark will also fail if RDDs of node-local ﬁlenames are shufﬂed, as the data associated to theﬁlenames will not be shufﬂed with the RDD and there willbe no mechanism in place to fetch it.When executed on Lustre, Nipype will checkpoint itself,ensuring resumption from last checkpoint during re-execution.This is particularly important in the case of compute-intensiveapplications, such as those found in neuroimaging. Spark alsoprovides a checkpointing mechanism, however, it requiresHDFS.It is common in neuroimaging applications for users to wantaccess to all intermediate data. Such a feature is currentlyonly possible when writing to shared ﬁlesystem. It would alsonot be an option with Spark in-memory. To enable this, burstbuffers or heterogeneous storage managers (e.g. Triple-H [25])could be used to ensure fast processing and that all outputs(including intermediate outputs) will be sent asynchronouslyto the shared ﬁlesystem.It was expected that Spark would experience longer pro-cessing times, particularly with the small datasets, due toJava serialization. This was not found to be the case. UnlikeSpark, Nipype performs a more thorough provenance capture,potentially owing to longer processing times.In this paper, we analyzed the effects of Big Data strategieson an map-only artiﬁcial neuroimaging pipeline. This allowedus to examine the effects of these strategies without beingsigniﬁcantly obscured by other conditions. Studying the effectsof such strategies on map-reduce type workﬂows, in additionto real neuroimaging pipelines, would allows us to gain furthernsight on the added value of the Big Data performancestragies for neuroimaging use cases remains to be done.V. C

ONCLUSION

Big Data performance optimization strategies help improveperformance of typical neuroimaging applications. Our exper-iments indicate that overall, in-memory computing enablesgreater speedups than what can be obtained by using pagecache and tmpfs. While page cache and tmpfs do givememory-like performance, they are likely to ﬁll up fasterthan memory, leading to increased performance penaltieswhen compared to in-memory computing. We conclude thatextending Big Data processing engines to better support neu-roimaging applications, including developing their provenance,fault-tolerance, and reproducibility features, is worthwhile.Data locality plays an important role in application perfor-mance. Local disk was found to perform better than the sharedﬁlesystem despite having lower bandwidth, due to increasedcontention on the shared ﬁlesystem. Since local disk typicallyhas less contention than shared ﬁlesystems, it is recommendedto store data locally. However, using local storage without adistributed ﬁle system may limit fault tolerance.Although a more thorough analysis of lazy evaluationremains to be performed, it is speculated that this may be thecause of the general performance difference between Sparkand Nipype. Furthermore, it was found that lazy evaluationoptimizations increase the likelihood of cache hits, thus im-proving overall performance.Even though Big Data strategies are beneﬁcial to the pro-cessing of large images, it is estimated that it would requirerunning a functional MRI dataset with 280 concurrent subjectsfor any noticeable impact using our Lustre bandwidth estimate.Benchmarking Spark and Nipype using such a large fMRIdataset would be a relevant follow-up experiment, to test thishypothesis. It would also be useful to evaluate other types ofapplications, such as ones containing data shufﬂing steps.Finally, we plan to extend this study by including taskscheduling strategies in a multi-tenant environment. We expectto observe important differences between Spark and Nipype,due to Spark’s use of overlay scheduling. The impact of otherBig Data technologies, such as distributed in-memory ﬁlesystems(e.g. Apache Ignite) and Lustre scalability issues [26],could also be investigated.VI. A

CKNOWLEDGMENTS

We are thankful to the Dell EMC HPC and AI InnovationLab in Roundrock, TX, for providing the infrastructure andhigh-quality technical support.R

EFERENCES[1] J. Dean and S. Ghemawat, “MapReduce: simpliﬁed data processing onlarge clusters,”

Comm. of the ACM , vol. 51, no. 1, pp. 107–113, 2008.[2] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al. , “ApacheSpark: a uniﬁed engine for big data processing,”

Comm. of the ACM ,vol. 59, no. 11, pp. 56–65, 2016.[3] M. Rocklin, “Dask: Parallel computation with blocked algorithms andtask scheduling,” in

Proc. of the 14th Python in Science Conference , no.130-136. Citeseer, 2015. [4] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman,G. Mehta, K. Vahi, G. B. Berriman, J. Good, et al. , “Pegasus: Aframework for mapping complex scientiﬁc workﬂows onto distributedsystems,”

Scientiﬁc Programming , vol. 13, no. 3, pp. 219–237, 2005.[5] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood,T. Carver, K. Glover, M. R. Pocock, A. Wipat, et al. , “Taverna: atool for the composition and enactment of bioinformatics workﬂows,”

Bioinformatics , vol. 20, no. 17, pp. 3045–3054, 2004.[6] J. Leskovec, A. Rajaraman, and J. D. Ullman,

Mining of massivedatasets . Cambridge university press, 2014.[7] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoopdistributed ﬁle system,” in

Mass storage systems and technologies(MSST), 2010 IEEE 26th symposium on . Ieee, 2010, pp. 1–10.[8] P. Schwan et al. , “Lustre: Building a ﬁle system for 1000-node clusters,”in

Proc. of the 2003 Linux symposium , 2003, pp. 380–386.[9] R. Love,

Linux kernel development . Pearson Education, 2010.[10] J. D. Van Horn and A. W. Toga, “Human neuroimaging as a Big Datascience,”

Brain imaging and behavior , vol. 8, no. 2, pp. 323–331, 2014.[11] K. Gorgolewski, C. D. Burns, C. Madison, D. Clark, Y. O. Halchenko,M. L. Waskom, and S. S. Ghosh, “Nipype: a ﬂexible, lightweightand extensible neuroimaging data processing framework in Python,”

Frontiers in neuroinformatics , vol. 5, p. 13, 2011.[12] P. Bellec, S. Lavoie-Courchesne, P. Dickinson, J. Lerch, A. Zijdenbos,and A. C. Evans, “The pipeline system for Octave and Matlab (PSOM):a lightweight scripting framework and execution engine for scientiﬁcworkﬂows,”

Frontiers in neuroinformatics , vol. 6, p. 7, 2012.[13] J. Freeman, N. Vladimirov, T. Kawashima, Y. Mu, N. J. Sofroniew, D. V.Bennett, J. Rosen, C.-T. Yang, L. L. Looger, and M. B. Ahrens, “Map-ping brain activity at scale with cluster computing,”

Nature methods ,vol. 11, no. 9, p. 941, 2014.[14] M. Makkie, H. Huang, Y. Zhao, A. V. Vasilakos, and T. Liu, “Fast andscalable distributed deep convolutional autoencoder for fMRI Big Dataanalytics,”

Neurocomputing , vol. 325, pp. 20–30, 2019.[15] P. Mehta, S. Dorkenwald, D. Zhao, T. Kaftan, A. Cheung, M. Balazinska,A. Rokem, A. Connolly, J. Vanderplas, and Y. AlSayyad, “Comparativeevaluation of Big-Data systems on scientiﬁc image analytics workloads,”

Proc. of the VLDB Endowment , vol. 10, no. 11, pp. 1226–1237, 2017.[16] T. Guedes, V. Silva, M. Mattoso, M. V. N. Bedo, and D. de Oliveira, “Apractical roadmap for provenance capture and data analysis in spark-based scientiﬁc workﬂows,” in

Workshop on Workﬂows in Support ofLarge-Scale Science (WORKS) , Dallas, TX, 2018.[17] S. Krishnan, M. Tatineni, and C. Baru, “myHadoop-Hadoop-on-Demandon traditional HPC resources,”

San Diego Supercomputer Center Tech-nical Report TR-2011-2, University of California, San Diego , 2011.[18] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets.”

HotCloud , vol. 10, no.10-10, p. 95, 2010.[19] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. , “Apache HadoopYARN: Yet another resource negotiator,” in

Proc. of the 4th annualSymposium on Cloud Computing . ACM, 2013, p. 5.[20] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H.Katz, S. Shenker, and I. Stoica, “Mesos: A platform for ﬁne-grainedresource sharing in the data center.” in

NSDI , vol. 11, no. 2011, 2011,pp. 22–22.[21] T. White,

Hadoop: The deﬁnitive guide . O’Reilly Media, Inc., 2012.[22] I. Paraskevakos, A. Luckow, G. Chantzialexiou, M. Khoshlessan,O. Beckstein, G. C. Fox, and S. Jha, “Task-parallel analysis ofmolecular dynamics trajectories,”

CoRR , vol. abs/1801.07630, 2018.[23] K. Amunts, C. Lepage, L. Borgeat, H. Mohlberg, T. Dickscheid, M.- ´E.Rousseau, S. Bludau, P.-L. Bazin, L. B. Lewis, A.-M. Oros-Peusquens, et al. , “BigBrain: an ultrahigh-resolution 3D human brain model,”

Science , vol. 340, no. 6139, pp. 1472–1475, 2013.[24] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Disk-locality in datacenter computing considered irrelevant.” in

HotOS ,vol. 13, 2011, pp. 12–12.[25] N. S. Islam, X. Lu, M. Wasi-ur Rahman, D. Shankar, and D. K.Panda, “Triple-h: A hybrid approach to accelerate hdfs on hpc clusterswith heterogeneous storage architecture,” in

Cluster, Cloud and GridComputing (CCGrid), 2015 15th IEEE/ACM International Symposiumon . IEEE, 2015, pp. 101–110.[26] N. Chaimov, A. Malony, S. Canon, C. Iancu, K. Z. Ibrahim, andJ. Srinivasan, “Scaling spark on hpc systems,” in