CannyFS: Opportunistically Maximizing I/O Throughput Exploiting the Transactional Nature of Batch-Mode Data Processing
CCannyFS
Opportunistically Maximizing I/O Throughput Exploiting the Transactional Natureof Batch-Mode Data Processing
Jessica Nettelblad (cid:63) · Carl Nettelblad (cid:63)(cid:63)
Abstract
We introduce a user mode file system, Can-nyFS, that hides latency by assuming all I/O opera-tions will succeed. The user mode process will in turnreport errors, allowing proper cleanup and a repeatedattempt to take place. We demonstrate benefits for themodel tasks of extracting archives and removing direc-tory trees in a real-life HPC environment, giving typicalreductions in time use of over 80%.This approach can be considered a view of HPCjobs and their I/O activity as transactions. In gen-eral, file systems lack clearly defined transaction seman-tics. Over time, the competing trends to add cache andmaintain data integrity have resulted in different prac-tical tradeoffs.High-performance computing is a special case whereoverall throughput demands are high. Latency can alsobe high, with non-local storage. In addition, a theoret-ically possible I/O error (like permission denied, loss ofconnection, exceeding disk quota) will frequently war-rant the resubmission of a full job or task, rather thantraditional error reporting or handling. Therefore, op-portunistically treating each I/O operation as success-ful, and part of a larger transaction, can speed up someapplications that do not leverage asynchronous I/O. (cid:63)
E-mail: [email protected] Multidisciplinary Center for Advanced Computa-tional Science, Uppsala UniversityBox 335SE-751 05 UppsalaSweden · (cid:63)(cid:63) E-mail: [email protected] of Scientific Computing, Department of InformationTechnology, Science for Life Laboratory, Uppsala UniversityBox 335SE-751 05 UppsalaSweden
We have produced a proof-of-concept implementationof a user mode file system, CannyFS, based on the ideathat a full job at a computational cluster can be seenas a single transaction, and that specifics regarding suc-cess and consistency of individual I/O operations withinthe transaction can be considered irrelevant. If a trans-action fails, it should be fully rolled back (results re-moved), and retried. CannyFS relies on the canny as-sumption that any I/O operation can (or should) suc-ceed.Compared to web applications or local interactiveapplications, many types of HPC software have verylow requirements on making their I/O activity imme-diately visible outside of the writing job itself. It cansometimes be deemed acceptable to lose the result of ajob if a hardware outage would occur halfway through,since the proper course of action will be to resubmit thefull job anyway, if no checkpointing is done. However,traditional file systems cannot make such assumptions.Rather, the development over time has been in the di-rection of journaling file systems, where all changes tometadata or even file content are recorded in such a waythat a power outage should always result in a consistentstate representing a single point in time [5].In a distributed environment, the true state of thefile system will need to be synchronized between allpossible users. One common choice is that metadataaccesses (reads and/or writes, such as file creation, orchecking file existence) are completed synchronouslyagainst the server. Any I/O operation is assumed tobe able to fail, due to logic or hardware errors (rang-ing from permission denied or file not found to an ex-ceeded quota or a total failure in network connectivity).In some use cases, such errors are expected and neces- a r X i v : . [ c s . O S ] D ec Nettelblad & Nettelblad sary for proper operation. In other cases, they representtruly exceptional states, where the proper course of ac-tion is to kill the whole HPC job, remedy the cause ofthe error, and then resubmit the job.Other technologies in computer science face simi-lar issues of data consistency, distributed state, anderror reporting. For relational databases, the conceptof transactional integrity is common [2]. Depending onthe isolation level, transactions should act as more orless independent entities, where reads and writes shouldact as if all other transactions were either completed,or not even started yet. The sequence of transactionsshould be serializable, in the sense that the data readand the data written should be as if all transactionswere executed in (some) serial order. Which serial orderthe actual execution corresponds to is undeterminable,and does not have to correspond to the order in whichrequests were sent. Failures during a transaction corre-spond to the whole transaction being rolled back. Thisgives database management software some freedom inhow to treat the current “dirty” data during the trans-action, as long as a final verdict of committing or rollingback is reached.In low-level parallelism, the concept of “transac-tional memory” has gained similar popularity, includinghardware implementations[9,3]. Rather than explicitlylocking and maintaining a proper state of data for eachpossible outcome, a general feature exists for either ac-cepting the full update, or rolling back.File system transactions have existed, both as a re-search subject, and as general purpose implementationsin some releases of e.g. Microsoft Windows [6,10,12].This has been done from a data integrity perspective.In this communication, we choose to consider thefull outcome of an HPC job as a transaction. We as-sume that the job itself is already isolated, i.e. that noother processes read from the data produced by the job,or modify the data accessed by the job, while it is run-ning. We also assume that the output of the job caneasily be rolled back, manually or automatically, e.g.by removing a specific set of files, and that I/O failuresare rare. We believe this to be a representative assump-tion in many cluster computing workloads, especiallydata processing tasks for wide datasets containing ahigh number of input and output files traversed seriallyon individual nodes. Using these assumptions, we cando very aggressive buffering of reads and writes, goingas far as assuming that opening of files on a distributedresource will succeed. The result is a solution which isfar less sensitive to network or I/O subsystem latency.We evaluate this approach on an existing traditionalHPC resource under load, doing I/O against the com-mon high-performance file system from a single node, (a) Normal operation – without CannyFSFile 1 File 2 File 3 File 4(b) Overlapping operation – with CannyFSFile 1File 2File 3File 4
Fig. 1
Schematic illustration of time use when writing fourfiles. In (a), a typical synchronous process looping over fourfiles will at each time only expose the I/O requests to a singlefile to the I/O subsystem. If the subsystem makes guaranteeson not returning before such requests have completed (e.g. filecreation being successful, flushing when closing), the time de-lay for completing the full task can be considerable. In (b),CannyFS will report each request as completed, allowing thecalling process to send additional requests. Total throughputis put to better use and latency is hidden. Even though theinterval between opening and closing of any specific file mightbe longer, like in this example, total time use can be dras-tically reduced, especially in distributed environments withsome level of I/O congestion. No code changes is needed inthe calling process to accomplish this. while other jobs are executing on other nodes. When us-ing our prototype user mode shim file system CannyFS,we find great improvements in speed for our experimentworkload, which contains a high number of small I/Ooperations over many files.
CannyFS is a user mode file system that will mirrorthe full root file system of the host where it runs, oralternatively only a specific subpath. It is intended tobe run by an individual user within a batch job, with amounting point which only this user has access to.The characterizing feature of CannyFS is that all, orsome, I/O operations are treated as eager , in the sensethat they return as having completed without any ac-tual request yet being sent to the actual I/O subsystemsupposed to handle the request. Naturally, this can onlybe true for requests that at completion return none, oronly trivial, data. An example of the result, interpretedas the number of files getting concurrent I/O requests,is shown in Fig. 1. annyFS 3
A data read operation cannot be eager. In fact, theopposite is true, when a read takes place, all writes tothe same object first have to be flushed. One exceptionto this is that some levels of file system metadata ( stat and related calls) can be mocked by default values. In-dividual flags are provided for the eagerness status forapproximately 20 different I/O operations, roughly cor-responding to different POSIX I/O primitives. The de-fault setting is that all of these are on, but dependingon the nature of a specific workload, it might be neces-sary to turn some off for proper operation (i.e. if a tasksomehow verifies that I/O completes properly duringits normal course of operation, in a way that CannyFSis yet unable to recognize or represent).There are similarities between CannyFS and a tra-ditional write-behind cache. However, CannyFS onlystores very limited data in user space (see Section 3).Most traditional caches share the assumption of Can-nyFS, that an operation will not fail due to a suddencomplete crash or loss of connectivity. Furthermore,most caching implementations will not allow e.g. newfiles to be created without synchronizing this to thesource file system. This fact can make execution of tasksthat sweep over a high number of files sequentially veryslow over a high-latency link, or when interfacing to anI/O system with moderate congestion, resulting in non-negligible roundtrip times for individual synchronousI/O requests.
CannyFS utilizes FUSE [8], and is implemented in mod-ern C++, using some features of C++14. It is based ona heavily modified version of the fusexmp fh example,and is thus licensed under the General Public License.Each operation that can be run eagerly is imple-mented as a lambda expression. This makes it easy toeither enqueue an operation, or to perform it imme-diately, using the same codepath. Any I/O errors en-countered during deferred operation are recorded andprinted to stderr twice, when they happen, and in aglobal destructor (which is called during orderly pro-cess teardown). This ensures that the user will be noti-fied of any I/O errors that were not properly reportedback to the calling process. Serialization of writes andreads awaiting all preceeding writes having completed isachieved through Resource Acquisition Is Initialization(RAII) wrappers encapsulating the crucial lock logic.Therefore, implementations of individual I/O opera-tions can be almost trivial.As an option, the file system process can abort whena failure is encountered, ensuring that any future ac-cesses will return errors at the point when at least one such failure has been recorded. This point will still oc-cur after the logical point where the write was deemedcompleted by the caller.Separate queues are maintained per open file. Thismeans that all operations, no matter what file handlethey were executed through, are serialized. One reasonfor this design choice is to ensure that reads from onehandle are not executed until writes that have beenreported as complete have in fact been completed. Eachfile with active events gets a separate thread, makingthe corresponding I/O calls synchronously. A counteris updated in order to make it possible for synchronousread operations, happening outside the queue, to knowif a specific event has been executed or not.It is expected that this serialization of events willnot be perfect, since CannyFS cannot reliably repro-duce all ways in which two paths which are not lexico-graphically identical still end up referring to the samefile. Proper operation thus has to be established fora specific workload. The most important intended usecase is when all files are owned by the same user, with-out any links, and the most crucial metadata accessedare file names and actual file data, ignoring auxiliarydates and attributes.In addition to pure write operations, CannyFS canalso accelerate read-based directory traversal. This ismainly done by preventively reading stat data for allfiles when a readdir call is made. Together with the se-rialization of writes (including removal operations), thismeans that a rm -rf style call over a large directorytree with many files can be significantly accelerated,but also similar calls like find and du , that in practicetend to make one file system metadata read request perdirectory entry from the underlying file system.Actual file content for read and write operations isnever moved into user space in favorable cases. Rather,the Linux kernel and FUSE support for ferrying databetween pipe file descriptors using the splice systemcall is used. Another layer of indirection pipes is createdfor write operations, in order to move data from the re-quest handling onto the worker threads. Copying intonormal user space buffers would ensure that write oper-ations would never block, at the cost of higher overhead.The recommended approach is to make sure that theFUSE options -o big_writes -o max_write=65536 op-tions are used on most Linux kernels, where a pipebuffer is generally 65536 bytes in size. On very recentkernels, increasing the limit (in pages) put by /proc/sys/fs/pipe-user-pages-soft might also berecommended. If this would not be done, later pipe al-locations might fail or only get a much smaller buffer.A too small buffer would result in I/O stalling, sincethe buffer filling thread then gets implicitly synchro- Nettelblad & Nettelblad nized with the actual execution of the write operation,removing the benefits of CannyFS.The number of open files allowed per process is oftenlimited. With many I/O operations in flight over manyfiles, CannyFS might hit this limit. Therefore, there isalso an option to limit the total number of operationsthat can stay queued at the same time, with a defaultvalue of 300. This is also to accommodate a high frac-tion of file descriptors being used for the pipe approachoutlined above. The default file descriptor limit is 1024in some environments. Increasing this, and the opera-tion count limit in CannyFS, is highly recommended.The number of threads per process or user can also belimited, sometimes with the same default of 1024. Thesetwo limits are reported, and controlled, by ulimit -a and ulimit -u , in some environments.
Benchmarks consisted of two file-intensive I/O work-loads, unzipping the Linux kernel (the zip file of thecurrent master branch from github as of 2016-10-01),and removing the resulting directory tree, as a separateoperation. The full directory tree uses roughly 2,100MB of file data, distributed over 59,259 directory en-tries, for an average file size of 36 kB. Three storageoperation modes were considered: CannyFS mappingof the storage solution, direct solution access over NFS,and temporarily saving data on a tmpfs mount, thenstaging it out using rsync .All tests were executed as the exclusive job on anode within the larger HPC cluster Milou at the Up-psala Multidisciplinary Center for Advanced Compu-tational Science (UPPMAX), which is used for variedworkloads mainly related to bioinformatics. The mainstorage solution consists of a set of NAS units from Hi-tachi, mounted over NFS with the NFS client settings rw,noatime,vers=3, proto=tcp,wsize=1048576,rsize=1048576 . The storage network connection forthe node was a single GbE port. Even though thesesettings might be tuned, such tuning would require sys-tem administrator intervention, while the choice to useCannyFS for a specific task can be made by the enduser.Unzipping was done using the version of the unzip tool included in the OS distribution (Scientific Linux6). Each CannyFS test was executed by creating a newmount, and including the time for fully killing the Can-nyFS process after the test (which will unmount thefile system and flush all pending I/O). After this, filesystem caches were also flushed, using /proc/sys/vm/drop_caches 3 . Total time use was mea-sured until the point after cache flushing. Timing for tmpfs included the time for synchronization to perma-nent storage. 48 replicates were executed with interleav-ing between the three storage modes to avoid system-atic bias due to varying total cluster file system load.The number of allowed simultaneous requests was setto 4,000, which was found beneficial compared to thedefault. Results are summarized in box plots in Table 1,Fig. 2, and Fig. 3, showing the time use for the combi-nations of scenarios and modes. The directory removaltask does not make sense to measure in the tmpfs case,since the actual data removed will already be on theNFS file system at the time when removal is supposedto take place.The mean consumption when using CannyFS was 80seconds for archive extraction and 75s for directory treeremoval. When accessing the file system directly, thecorresponding times were 517s, and 214s, respectively,corresponding to walltime usage reductions of over al-most 85% for the extraction case, and roughly 65% fordirectory tree removal. Time results with the use ofCannyFS were much less sensitive to total I/O load onthe cluster. The maximum CannyFS time recorded forextraction was 98 seconds, while the maximum in usingdirect writing to NFS was 915 seconds. Overall, resultsusing tmpfs as an intermediary were similar to thosefor direct access to the NFS-based storage solution.The actual time usage distribution for directory treeremoval was bimodal, probably due to varying cachingbehavior for file system metadata at different levels.When caches are filled, directory entry removal is a verysimple operation. As can be seen from Fig. 4, the Can-nyFS distribution is much more centered, but around aslightly higher value than the lower mode of the distri-bution acquired when using direct file system access.
Transactional integrity has been a necessity for rela-tional databases since their inception. In the databasecontext, the opaquenesss of events within transactionsto the outside world has also been a source of maximiz-ing performance, while maintaining some level of dataintegrity. For transactional memory, it is more clearthat the overall goal is to give high performance bynot forcing the software implementation to handle theintricacies and overhead of locking for individual up-dates, or being able to undo a set of changes if a laterchange fails, for whatever reason. It is our belief thata wide set of I/O tasks follow a similar logic, whereascurrent implementations focus on integrity and consis-tency on the level of individual files or requests. Ourpractical CannyFS implementation demonstrates that annyFS 5
Table 1
Timing results (in seconds)
Test case
I/O mode Min Mean Median Max
Archive extraction
CannyFS 61 80 81 98NFS 191 517 509 915tmpfs + rsync 303 572 589 940
Directory removal
CannyFS 45 75 49 595NFS 33 214 65 1021
CannyFS NFS tmpfs+rsync S ec ond s ( s ) Fig. 2
Benchmark for archive extraction with and withoutCannyFS. Typical decrease over 80% with much decreasedvariability. Top and bottom lines for each box represent max-imum and minimum, respectively. Boxed area represents thecenter two quartiles, with the median explicitly marked. Thebenchmarks concerned consisted of extracting a zip archivecontaining the full Linux kernel source tree, onto a networkfile system, using three operation modes: the shim file sys-tem CannyFS hiding operation latency to the extraction pro-cess, direct access over unmodified NFS, and extraction ontotmpfs followed by transfer using rsync, like a typical dataout-staging workflow. significant performance benefits are possible, when la-tency in the underlying I/O solution is high.We are not the first to observe that file system la-tency can pose a problem. Other solutions have tendedto focus on tuning specific implementations, such asHPN-SSH [7] providing a more high-performing versionof SSH (including the SFTP protocol underlying sshfs),mainly by matching network and protocol buffer sizes,and in some cases increasing them. NFS implementa-tions have the option of close-to-open cache consistency,i.e. the guarantee that one client opening a file after an-other client has closed it will see all changes made bythe other party. In practice, this makes the closing offiles a barrier. In addition, NFS can be used with orwithout the async option, controlling what operations
CannyFS NFS S ec ond s ( s ) Fig. 3
Benchmark for directory tree removal with and with-out CannyFS, with some slight overhead, but a far decreasedmaximum and median time use. Top and bottom lines foreach box represent maximum and minimum, respectively, af-ter an outlier filtering at approximately ± . σ . Boxed arearepresents the center two quartiles, with the median explic-itly marked. The benchmarks concerned removing the full di-rectory tree of a previously extracted and flushed copy of theLinux kernel source tree, stored on a network file system, us-ing two operation modes: the shim file system CannyFS hid-ing operation latency to the removal process (plain rm -rf ),and direct access over unmodified NFS. The overhead of ourthreading model and a user mode file system makes CannyFSslightly slower than the ideal case for NFS operation. How-ever, this ideal case only occurs when NFS attribute cachingof the recently extracted archive is kicking in. CannyFS againshows a resilience against varying load conditions on the stor-age infrastructure. need to complete before blocking calls return on theclient.On a more general level, pCacheFS [11] is similarto CannyFS. pCacheFS is a shimming file system thatemploys a permanent local cache. However, pCacheFSin its current form is exclusively intended for read onlymirroring of file systems. CannyFS, on the other hand,is in one sense write only. It performs most consistentlywhen a task creates a new directory and writes a signif-icant amount of files to it, without ever reading them Nettelblad & Nettelblad
Fig. 4
Histogram over directory tree removal times with(teal) and without (amber) CannyFS. Minimum removaltimes are lower without the CannyFS overhead, but CannyFSis far more reliable, bar a limited number of outliers. back. pCacheFS is also implemented in Python, whilewe aim for high-performing C++, trying to limit theoverhead to what is incurred by any FUSE implemen-tation.This work has similarities to the concept of BurstBuffer [1], in that the goal is to accelerate the executionof write I/O. Burst Buffer attempts to use non-volatileor DRAM storage in specific nodes to coordinate writesfrom multiple nodes. CannyFS makes no attempt to co-ordinate writes from several nodes and keeps all datain the address space of the local machine. The justifi-cation is that a job which does not fully succeed can berestarted, with partial data cleared. If data need to besynchronized between different jobs, one option wouldbe to have one node using CannyFS and all other nodesrouting their I/Os through a remote mount (ideally overa high-speed fabric) on that node. This would clearlynot be sustainable for very high throughput scenar-ios, but for compute-bound tasks with some I/O boundphases, it could still be relevant. We also note that thefunctionality of CannyFS is most similar to the planned“Stage 2” of Burst Buffer (transparent caching), whichthat project has yet to reach. CannyFS can also be eas-ily applied within existing infrastructures by individualusers, simply by allowing them to do FUSE mountswithin directories they have access to. No additionalhardware investment or reconfiguration is needed.One could argue that rather than implementing ashimming file system, each task should be tuned re-garding I/O. Tasks could be made asynchronous inter-nally. However, traditional file I/O is not asynchronousin the POSIX standard, and making all I/O operationsexplicitly asynchronous can still be cumbersome withinexisting codebases. One could also argue that proper tuning of cache, integrity, and network stack settingsfor distributed and remote file systems can improve per-formance. However, such changes will often be system-wide and require root access.
Different jobs on the samemachine might have different needed transaction se-mantics against the same logical remote file system vol-ume, making very aggressive system-wide caching orasynchronous behavior problematic. CannyFS can be acompetitive option to first storing results on a scratchfile system and then transferring them to permanentstorage, as illustrated by our experiments. This reducesthe volume load on scratch storage, and also saves thetime of a separate unstaging step. For overlapped stag-ing in of data, tools such as vmtouch [4] might be con-sidered, depending on the size of datasets.Our experiments confirm that significant gains arepossible for a real-life scenario. We argue in favor ofa pragmatic standpoint, where one can simply observethat these non-optimal, synchronous I/O workloads arereal and do exist even in contexts where high perfor-mance concerns are important. In addition to main-tenance tasks such as expanding archives and buildingcode, the data and log output routines of many softwarepackages which have paid great care to their computa-tional performance, are nonetheless synchronous andblocking. Therefore, significant gains, although not asstaggering as what is demonstrated here, could be possi-ble for some tasks that at first glance would be assumedto be CPU-bound.5.1 Future workWhile functional, the current implementation of Can-nyFS has some limitations. The eager functionality triesto model the behavior of true file systems, but, for ex-ample, running typical make or configure scripts is notalways possible without disabling many parts of theeagerness and “inaccurate stat ” functionality throughcommand-line options. Especially configure scripts tendto create files with the same name repeatedly, and some-times complex patterns of symlinks. An active aim is toimprove functionality enough to allow the full configureand multi-stage build process of GCC into a clean builddirectory as a showcase scenario.An example of an area which we do not intend tosupport is tools like rsync, which actively try to resumecopying after a failure. Due to the possible re-orderingof I/O operations relative to their reported completionorder, we do not intend to provide any guarantees thata partial copy is correct (although a full hashing shouldalways be able to determine the correct status, sometools tend to use heuristics based on file sizes or modi-fication dates in some scenarios). The same issue could annyFS 7 also affect running make again on a pre-existing buildtree, since usual checks for the need to update filesbased on modification dates might fail.The CannyFS implementation is currently creatinga very high number of threads, and scrapping them. Ide-ally, threads should be reused, or a more general frame-work for task dependencies be used. However, manytask-based models are focusing on tasks that are com-pute or network bound, not fully supporting the charac-teristics of file I/O. While thread creation is expensive,the cost tends to be low compared to many file I/O op-erations. Pipes are actively recycled between I/O op-erations. The fact that CannyFS only buffers writes,sending all reads to the operating system, means thatany switch between reads and writes can incur a heavyperformance penalty. In our benchmark, this is encoun-tered due to the way unzip handles symlinks. The even-tual target path of the symlink is first written to a reg-ular file, which is only then immediately read back, andsaved in memory for an eventual symlink
API call. Un-til the write operation has cleared the internal CannyFSqueue, the read will be held back. No specific prioritiesare considered in the queuing (separate queues are han-dled per file and the kernel scheduler will schedule anythread that is ready), hence a high number of unrelatedI/Os might be retired, before the crucial one holdingthe read back is executed. We believe that the overheadseen in CannyFS directory tree removal relative to themost favorable cases using direct NFS access is mainlydue to thread creation activity, or more specifically theoverhead incurred on the spawning thread, since thatis part of the critical synchronous path. However, evenin that scenario, the varying latency of the underlyingstorage solution resulted in CannyFS coming up favor-ably in terms of mean time usage. The lowest numbersare probably due to file metadata still being cached onthe NFS level. When rm was run separately (not shown)with a considerable time delay after archive extraction,the time usage for CannyFS stayed mostly the same(a few higher outliers), while low values for NFS timeusage were never seen.In addition, the current code is based on the high-level FUSE interface. Specifically, it relies on the textrepresentation of paths quite frequently. The low-levelinterface would allow a more seamless integration withthe Linux kernel module, which among other things alsomaintains a separate notion of file size.CannyFS in its current form should be ready for lim-ited production use. However, as with any FUSE filesystem, the allow_other permission is generally notrecommended. This option allows users beyond the onelaunching the FUSE process to access the mounted filesystem. Any exploitable bug in the code would in that context result in a privilege-escalation into the permis-sions of the user running the FUSE process. The purpose of CannyFS is to hide latency in orderto make synchronous tasks with a high amount of dis-crete I/O events complete faster, getting close to sat-urating available bandwidth. Our experiments confirmthat a reduction is possible, although performance isstill far from the performance achievable if the file sys-tem end point is fully volatile. One can note that eventhe archive expansion task contains a few dependenciesbetween I/Os, such as the creation of symlinks, whichintroduce bottlenecks. We offer a general solution forhiding latency where data integrity is only required onthe level of tasks, but not on the level of individual I/Ooperations or files.CannyFS is better than pre-sliced bread! It’s likeslicing your bread after you eat it! The code is licensedunder the GPL and is available at . Acknowledgments
The computational resources were provided by SNICthrough Uppsala Multidisciplinary Center for AdvancedComputational Science (UPPMAX) under Project c2016040.JN is funded as a systems administrator within UPP-MAX. CN would like to acknowledge the former col-leagues at the IT Division within the Uppsala Univer-sity Administration for their insight into varied I/Oworkloads.
References
1. Bhimji, W., Bard, D., Romanus, M., Paul, D., Ovsyan-nikov, A., Friesen, B., Bryson, M., Correa, J., Lockwood,G.K., Tsulaia, V., et al.: Accelerating science with thenersc burst buffer early user program. CUG (2016)2. Garcia-Molina, H., Ullman, J.D., Widom, J.: Databasesystem implementation, vol. 654. Prentice Hall UpperSaddle River, NJ: (2000)3. Hammarlund, P., Kumar, R., Osborne, R.B., Rajwar,R., Singhal, R., D’Sa, R., Chappell, R., Kaushik, S.,Chennupaty, S., Jourdan, S., et al.: Haswell: The fourth-generation Intel Core processor. IEEE Micro (2), 6–20(2014)4. Hoyte, D.: vmtouch - the virtual memory toucher, https://github.com/hoytech/vmtouch
5. Prabhakaran, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Analysis and evolution of journalingfile systems. In: USENIX Annual Technical Conference,General Track. pp. 105–120 (2005) Nettelblad & Nettelblad6. Prabhakaran, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Analysis and evolution of journalingfile systems. In: USENIX Annual Technical Conference,General Track. pp. 105–120 (2005)7. Rapier, C., Bennett, B.: High speed bulk data transfer us-ing the SSH protocol. In: Proceedings of the 15th ACMMardi Gras conference: From lightweight mash-ups tolambda grids: Understanding the spectrum of distributedcomputing requirements, applications, tools, infrastruc-tures, interoperability, and the incremental adoption ofkey capabilities. p. 11. ACM (2008)8. Szeredi, M., Rauth, N.: Fuse - filesystems in userspace, https://github.com/libfuse/libfuse
9. Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC pro-cessor. In: 2008 IEEE International Solid-State CircuitsConference-Digest of Technical Papers (2008)10. Tweedie, S.C.: Journaling the Linux ext2fs filesystem. In:The Fourth Annual Linux Expo (1998)11. Tyers, J., Penninckx, P.: pCacheFS - persistent-cachingFUSE filesystem, https://github.com/ibizaman/pcachefshttps://github.com/ibizaman/pcachefs