[PDF] CannyFS: Opportunistically Maximizing I/O Throughput Exploiting the Transactional Nature of Batch-Mode Data Processing

Abstract

We introduce a user mode file system, CannyFS, that hides latency by assuming all I/O operations will succeed. The user mode process will in turn report errors, allowing proper cleanup and a repeated attempt to take place. We demonstrate benefits for the model tasks of extracting archives and removing directory trees in a real-life HPC environment, giving typical reductions in time use of over 80%. This approach can be considered a view of HPC jobs and their I/O activity as transactions. In general, file systems lack clearly defined transaction semantics. Over time, the competing trends to add cache and maintain data integrity have resulted in different practical tradeoffs. High-performance computing is a special case where overall throughput demands are high. Latency can also be high, with non-local storage. In addition, a theoretically possible I/O error (like permission denied, loss of connection, exceeding disk quota) will frequently warrant the resubmission of a full job or task, rather than traditional error reporting or handling. Therefore, opportunistically treating each I/O operation as successful, and part of a larger transaction, can speed up some applications that do not leverage asynchronous I/O.

Full PDF

CCannyFS

Opportunistically Maximizing I/O Throughput Exploiting the Transactional Natureof Batch-Mode Data Processing

Jessica Nettelblad (cid:63) · Carl Nettelblad (cid:63)(cid:63)

Abstract

We introduce a user mode ﬁle system, Can-nyFS, that hides latency by assuming all I/O opera-tions will succeed. The user mode process will in turnreport errors, allowing proper cleanup and a repeatedattempt to take place. We demonstrate beneﬁts for themodel tasks of extracting archives and removing direc-tory trees in a real-life HPC environment, giving typicalreductions in time use of over 80%.This approach can be considered a view of HPCjobs and their I/O activity as transactions. In gen-eral, ﬁle systems lack clearly deﬁned transaction seman-tics. Over time, the competing trends to add cache andmaintain data integrity have resulted in diﬀerent prac-tical tradeoﬀs.High-performance computing is a special case whereoverall throughput demands are high. Latency can alsobe high, with non-local storage. In addition, a theoret-ically possible I/O error (like permission denied, loss ofconnection, exceeding disk quota) will frequently war-rant the resubmission of a full job or task, rather thantraditional error reporting or handling. Therefore, op-portunistically treating each I/O operation as success-ful, and part of a larger transaction, can speed up someapplications that do not leverage asynchronous I/O. (cid:63)

E-mail: [email protected] Multidisciplinary Center for Advanced Computa-tional Science, Uppsala UniversityBox 335SE-751 05 UppsalaSweden · (cid:63)(cid:63) E-mail: [email protected] of Scientiﬁc Computing, Department of InformationTechnology, Science for Life Laboratory, Uppsala UniversityBox 335SE-751 05 UppsalaSweden

We have produced a proof-of-concept implementationof a user mode ﬁle system, CannyFS, based on the ideathat a full job at a computational cluster can be seenas a single transaction, and that speciﬁcs regarding suc-cess and consistency of individual I/O operations withinthe transaction can be considered irrelevant. If a trans-action fails, it should be fully rolled back (results re-moved), and retried. CannyFS relies on the canny as-sumption that any I/O operation can (or should) suc-ceed.Compared to web applications or local interactiveapplications, many types of HPC software have verylow requirements on making their I/O activity imme-diately visible outside of the writing job itself. It cansometimes be deemed acceptable to lose the result of ajob if a hardware outage would occur halfway through,since the proper course of action will be to resubmit thefull job anyway, if no checkpointing is done. However,traditional ﬁle systems cannot make such assumptions.Rather, the development over time has been in the di-rection of journaling ﬁle systems, where all changes tometadata or even ﬁle content are recorded in such a waythat a power outage should always result in a consistentstate representing a single point in time [5].In a distributed environment, the true state of theﬁle system will need to be synchronized between allpossible users. One common choice is that metadataaccesses (reads and/or writes, such as ﬁle creation, orchecking ﬁle existence) are completed synchronouslyagainst the server. Any I/O operation is assumed tobe able to fail, due to logic or hardware errors (rang-ing from permission denied or ﬁle not found to an ex-ceeded quota or a total failure in network connectivity).In some use cases, such errors are expected and neces- a r X i v : . [ c s . O S ] D ec Nettelblad & Nettelblad sary for proper operation. In other cases, they representtruly exceptional states, where the proper course of ac-tion is to kill the whole HPC job, remedy the cause ofthe error, and then resubmit the job.Other technologies in computer science face simi-lar issues of data consistency, distributed state, anderror reporting. For relational databases, the conceptof transactional integrity is common [2]. Depending onthe isolation level, transactions should act as more orless independent entities, where reads and writes shouldact as if all other transactions were either completed,or not even started yet. The sequence of transactionsshould be serializable, in the sense that the data readand the data written should be as if all transactionswere executed in (some) serial order. Which serial orderthe actual execution corresponds to is undeterminable,and does not have to correspond to the order in whichrequests were sent. Failures during a transaction corre-spond to the whole transaction being rolled back. Thisgives database management software some freedom inhow to treat the current “dirty” data during the trans-action, as long as a ﬁnal verdict of committing or rollingback is reached.In low-level parallelism, the concept of “transac-tional memory” has gained similar popularity, includinghardware implementations[9,3]. Rather than explicitlylocking and maintaining a proper state of data for eachpossible outcome, a general feature exists for either ac-cepting the full update, or rolling back.File system transactions have existed, both as a re-search subject, and as general purpose implementationsin some releases of e.g. Microsoft Windows [6,10,12].This has been done from a data integrity perspective.In this communication, we choose to consider thefull outcome of an HPC job as a transaction. We as-sume that the job itself is already isolated, i.e. that noother processes read from the data produced by the job,or modify the data accessed by the job, while it is run-ning. We also assume that the output of the job caneasily be rolled back, manually or automatically, e.g.by removing a speciﬁc set of ﬁles, and that I/O failuresare rare. We believe this to be a representative assump-tion in many cluster computing workloads, especiallydata processing tasks for wide datasets containing ahigh number of input and output ﬁles traversed seriallyon individual nodes. Using these assumptions, we cando very aggressive buﬀering of reads and writes, goingas far as assuming that opening of ﬁles on a distributedresource will succeed. The result is a solution which isfar less sensitive to network or I/O subsystem latency.We evaluate this approach on an existing traditionalHPC resource under load, doing I/O against the com-mon high-performance ﬁle system from a single node, (a) Normal operation – without CannyFSFile 1 File 2 File 3 File 4(b) Overlapping operation – with CannyFSFile 1File 2File 3File 4

Fig. 1

Schematic illustration of time use when writing fourﬁles. In (a), a typical synchronous process looping over fourﬁles will at each time only expose the I/O requests to a singleﬁle to the I/O subsystem. If the subsystem makes guaranteeson not returning before such requests have completed (e.g. ﬁlecreation being successful, ﬂushing when closing), the time de-lay for completing the full task can be considerable. In (b),CannyFS will report each request as completed, allowing thecalling process to send additional requests. Total throughputis put to better use and latency is hidden. Even though theinterval between opening and closing of any speciﬁc ﬁle mightbe longer, like in this example, total time use can be dras-tically reduced, especially in distributed environments withsome level of I/O congestion. No code changes is needed inthe calling process to accomplish this. while other jobs are executing on other nodes. When us-ing our prototype user mode shim ﬁle system CannyFS,we ﬁnd great improvements in speed for our experimentworkload, which contains a high number of small I/Ooperations over many ﬁles.

CannyFS is a user mode ﬁle system that will mirrorthe full root ﬁle system of the host where it runs, oralternatively only a speciﬁc subpath. It is intended tobe run by an individual user within a batch job, with amounting point which only this user has access to.The characterizing feature of CannyFS is that all, orsome, I/O operations are treated as eager , in the sensethat they return as having completed without any ac-tual request yet being sent to the actual I/O subsystemsupposed to handle the request. Naturally, this can onlybe true for requests that at completion return none, oronly trivial, data. An example of the result, interpretedas the number of ﬁles getting concurrent I/O requests,is shown in Fig. 1. annyFS 3

A data read operation cannot be eager. In fact, theopposite is true, when a read takes place, all writes tothe same object ﬁrst have to be ﬂushed. One exceptionto this is that some levels of ﬁle system metadata ( stat and related calls) can be mocked by default values. In-dividual ﬂags are provided for the eagerness status forapproximately 20 diﬀerent I/O operations, roughly cor-responding to diﬀerent POSIX I/O primitives. The de-fault setting is that all of these are on, but dependingon the nature of a speciﬁc workload, it might be neces-sary to turn some oﬀ for proper operation (i.e. if a tasksomehow veriﬁes that I/O completes properly duringits normal course of operation, in a way that CannyFSis yet unable to recognize or represent).There are similarities between CannyFS and a tra-ditional write-behind cache. However, CannyFS onlystores very limited data in user space (see Section 3).Most traditional caches share the assumption of Can-nyFS, that an operation will not fail due to a suddencomplete crash or loss of connectivity. Furthermore,most caching implementations will not allow e.g. newﬁles to be created without synchronizing this to thesource ﬁle system. This fact can make execution of tasksthat sweep over a high number of ﬁles sequentially veryslow over a high-latency link, or when interfacing to anI/O system with moderate congestion, resulting in non-negligible roundtrip times for individual synchronousI/O requests.

CannyFS utilizes FUSE [8], and is implemented in mod-ern C++, using some features of C++14. It is based ona heavily modiﬁed version of the fusexmp fh example,and is thus licensed under the General Public License.Each operation that can be run eagerly is imple-mented as a lambda expression. This makes it easy toeither enqueue an operation, or to perform it imme-diately, using the same codepath. Any I/O errors en-countered during deferred operation are recorded andprinted to stderr twice, when they happen, and in aglobal destructor (which is called during orderly pro-cess teardown). This ensures that the user will be noti-ﬁed of any I/O errors that were not properly reportedback to the calling process. Serialization of writes andreads awaiting all preceeding writes having completed isachieved through Resource Acquisition Is Initialization(RAII) wrappers encapsulating the crucial lock logic.Therefore, implementations of individual I/O opera-tions can be almost trivial.As an option, the ﬁle system process can abort whena failure is encountered, ensuring that any future ac-cesses will return errors at the point when at least one such failure has been recorded. This point will still oc-cur after the logical point where the write was deemedcompleted by the caller.Separate queues are maintained per open ﬁle. Thismeans that all operations, no matter what ﬁle handlethey were executed through, are serialized. One reasonfor this design choice is to ensure that reads from onehandle are not executed until writes that have beenreported as complete have in fact been completed. Eachﬁle with active events gets a separate thread, makingthe corresponding I/O calls synchronously. A counteris updated in order to make it possible for synchronousread operations, happening outside the queue, to knowif a speciﬁc event has been executed or not.It is expected that this serialization of events willnot be perfect, since CannyFS cannot reliably repro-duce all ways in which two paths which are not lexico-graphically identical still end up referring to the sameﬁle. Proper operation thus has to be established fora speciﬁc workload. The most important intended usecase is when all ﬁles are owned by the same user, with-out any links, and the most crucial metadata accessedare ﬁle names and actual ﬁle data, ignoring auxiliarydates and attributes.In addition to pure write operations, CannyFS canalso accelerate read-based directory traversal. This ismainly done by preventively reading stat data for allﬁles when a readdir call is made. Together with the se-rialization of writes (including removal operations), thismeans that a rm -rf style call over a large directorytree with many ﬁles can be signiﬁcantly accelerated,but also similar calls like find and du , that in practicetend to make one ﬁle system metadata read request perdirectory entry from the underlying ﬁle system.Actual ﬁle content for read and write operations isnever moved into user space in favorable cases. Rather,the Linux kernel and FUSE support for ferrying databetween pipe ﬁle descriptors using the splice systemcall is used. Another layer of indirection pipes is createdfor write operations, in order to move data from the re-quest handling onto the worker threads. Copying intonormal user space buﬀers would ensure that write oper-ations would never block, at the cost of higher overhead.The recommended approach is to make sure that theFUSE options -o big_writes -o max_write=65536 op-tions are used on most Linux kernels, where a pipebuﬀer is generally 65536 bytes in size. On very recentkernels, increasing the limit (in pages) put by /proc/sys/fs/pipe-user-pages-soft might also berecommended. If this would not be done, later pipe al-locations might fail or only get a much smaller buﬀer.A too small buﬀer would result in I/O stalling, sincethe buﬀer ﬁlling thread then gets implicitly synchro- Nettelblad & Nettelblad nized with the actual execution of the write operation,removing the beneﬁts of CannyFS.The number of open ﬁles allowed per process is oftenlimited. With many I/O operations in ﬂight over manyﬁles, CannyFS might hit this limit. Therefore, there isalso an option to limit the total number of operationsthat can stay queued at the same time, with a defaultvalue of 300. This is also to accommodate a high frac-tion of ﬁle descriptors being used for the pipe approachoutlined above. The default ﬁle descriptor limit is 1024in some environments. Increasing this, and the opera-tion count limit in CannyFS, is highly recommended.The number of threads per process or user can also belimited, sometimes with the same default of 1024. Thesetwo limits are reported, and controlled, by ulimit -a and ulimit -u , in some environments.

Benchmarks consisted of two ﬁle-intensive I/O work-loads, unzipping the Linux kernel (the zip ﬁle of thecurrent master branch from github as of 2016-10-01),and removing the resulting directory tree, as a separateoperation. The full directory tree uses roughly 2,100MB of ﬁle data, distributed over 59,259 directory en-tries, for an average ﬁle size of 36 kB. Three storageoperation modes were considered: CannyFS mappingof the storage solution, direct solution access over NFS,and temporarily saving data on a tmpfs mount, thenstaging it out using rsync .All tests were executed as the exclusive job on anode within the larger HPC cluster Milou at the Up-psala Multidisciplinary Center for Advanced Compu-tational Science (UPPMAX), which is used for variedworkloads mainly related to bioinformatics. The mainstorage solution consists of a set of NAS units from Hi-tachi, mounted over NFS with the NFS client settings rw,noatime,vers=3, proto=tcp,wsize=1048576,rsize=1048576 . The storage network connection forthe node was a single GbE port. Even though thesesettings might be tuned, such tuning would require sys-tem administrator intervention, while the choice to useCannyFS for a speciﬁc task can be made by the enduser.Unzipping was done using the version of the unzip tool included in the OS distribution (Scientiﬁc Linux6). Each CannyFS test was executed by creating a newmount, and including the time for fully killing the Can-nyFS process after the test (which will unmount theﬁle system and ﬂush all pending I/O). After this, ﬁlesystem caches were also ﬂushed, using /proc/sys/vm/drop_caches 3 . Total time use was mea-sured until the point after cache ﬂushing. Timing for tmpfs included the time for synchronization to perma-nent storage. 48 replicates were executed with interleav-ing between the three storage modes to avoid system-atic bias due to varying total cluster ﬁle system load.The number of allowed simultaneous requests was setto 4,000, which was found beneﬁcial compared to thedefault. Results are summarized in box plots in Table 1,Fig. 2, and Fig. 3, showing the time use for the combi-nations of scenarios and modes. The directory removaltask does not make sense to measure in the tmpfs case,since the actual data removed will already be on theNFS ﬁle system at the time when removal is supposedto take place.The mean consumption when using CannyFS was 80seconds for archive extraction and 75s for directory treeremoval. When accessing the ﬁle system directly, thecorresponding times were 517s, and 214s, respectively,corresponding to walltime usage reductions of over al-most 85% for the extraction case, and roughly 65% fordirectory tree removal. Time results with the use ofCannyFS were much less sensitive to total I/O load onthe cluster. The maximum CannyFS time recorded forextraction was 98 seconds, while the maximum in usingdirect writing to NFS was 915 seconds. Overall, resultsusing tmpfs as an intermediary were similar to thosefor direct access to the NFS-based storage solution.The actual time usage distribution for directory treeremoval was bimodal, probably due to varying cachingbehavior for ﬁle system metadata at diﬀerent levels.When caches are ﬁlled, directory entry removal is a verysimple operation. As can be seen from Fig. 4, the Can-nyFS distribution is much more centered, but around aslightly higher value than the lower mode of the distri-bution acquired when using direct ﬁle system access.

Transactional integrity has been a necessity for rela-tional databases since their inception. In the databasecontext, the opaquenesss of events within transactionsto the outside world has also been a source of maximiz-ing performance, while maintaining some level of dataintegrity. For transactional memory, it is more clearthat the overall goal is to give high performance bynot forcing the software implementation to handle theintricacies and overhead of locking for individual up-dates, or being able to undo a set of changes if a laterchange fails, for whatever reason. It is our belief thata wide set of I/O tasks follow a similar logic, whereascurrent implementations focus on integrity and consis-tency on the level of individual ﬁles or requests. Ourpractical CannyFS implementation demonstrates that annyFS 5

Table 1

Timing results (in seconds)

Test case

I/O mode Min Mean Median Max

Archive extraction

CannyFS 61 80 81 98NFS 191 517 509 915tmpfs + rsync 303 572 589 940

Directory removal

CannyFS 45 75 49 595NFS 33 214 65 1021

CannyFS NFS tmpfs+rsync S ec ond s ( s ) Fig. 2

Benchmark for archive extraction with and withoutCannyFS. Typical decrease over 80% with much decreasedvariability. Top and bottom lines for each box represent max-imum and minimum, respectively. Boxed area represents thecenter two quartiles, with the median explicitly marked. Thebenchmarks concerned consisted of extracting a zip archivecontaining the full Linux kernel source tree, onto a networkﬁle system, using three operation modes: the shim ﬁle sys-tem CannyFS hiding operation latency to the extraction pro-cess, direct access over unmodiﬁed NFS, and extraction ontotmpfs followed by transfer using rsync, like a typical dataout-staging workﬂow. signiﬁcant performance beneﬁts are possible, when la-tency in the underlying I/O solution is high.We are not the ﬁrst to observe that ﬁle system la-tency can pose a problem. Other solutions have tendedto focus on tuning speciﬁc implementations, such asHPN-SSH [7] providing a more high-performing versionof SSH (including the SFTP protocol underlying sshfs),mainly by matching network and protocol buﬀer sizes,and in some cases increasing them. NFS implementa-tions have the option of close-to-open cache consistency,i.e. the guarantee that one client opening a ﬁle after an-other client has closed it will see all changes made bythe other party. In practice, this makes the closing ofﬁles a barrier. In addition, NFS can be used with orwithout the async option, controlling what operations

CannyFS NFS S ec ond s ( s ) Fig. 3

Benchmark for directory tree removal with and with-out CannyFS, with some slight overhead, but a far decreasedmaximum and median time use. Top and bottom lines foreach box represent maximum and minimum, respectively, af-ter an outlier ﬁltering at approximately ± . σ . Boxed arearepresents the center two quartiles, with the median explic-itly marked. The benchmarks concerned removing the full di-rectory tree of a previously extracted and ﬂushed copy of theLinux kernel source tree, stored on a network ﬁle system, us-ing two operation modes: the shim ﬁle system CannyFS hid-ing operation latency to the removal process (plain rm -rf ),and direct access over unmodiﬁed NFS. The overhead of ourthreading model and a user mode ﬁle system makes CannyFSslightly slower than the ideal case for NFS operation. How-ever, this ideal case only occurs when NFS attribute cachingof the recently extracted archive is kicking in. CannyFS againshows a resilience against varying load conditions on the stor-age infrastructure. need to complete before blocking calls return on theclient.On a more general level, pCacheFS [11] is similarto CannyFS. pCacheFS is a shimming ﬁle system thatemploys a permanent local cache. However, pCacheFSin its current form is exclusively intended for read onlymirroring of ﬁle systems. CannyFS, on the other hand,is in one sense write only. It performs most consistentlywhen a task creates a new directory and writes a signif-icant amount of ﬁles to it, without ever reading them Nettelblad & Nettelblad

Fig. 4

Histogram over directory tree removal times with(teal) and without (amber) CannyFS. Minimum removaltimes are lower without the CannyFS overhead, but CannyFSis far more reliable, bar a limited number of outliers. back. pCacheFS is also implemented in Python, whilewe aim for high-performing C++, trying to limit theoverhead to what is incurred by any FUSE implemen-tation.This work has similarities to the concept of BurstBuﬀer [1], in that the goal is to accelerate the executionof write I/O. Burst Buﬀer attempts to use non-volatileor DRAM storage in speciﬁc nodes to coordinate writesfrom multiple nodes. CannyFS makes no attempt to co-ordinate writes from several nodes and keeps all datain the address space of the local machine. The justiﬁ-cation is that a job which does not fully succeed can berestarted, with partial data cleared. If data need to besynchronized between diﬀerent jobs, one option wouldbe to have one node using CannyFS and all other nodesrouting their I/Os through a remote mount (ideally overa high-speed fabric) on that node. This would clearlynot be sustainable for very high throughput scenar-ios, but for compute-bound tasks with some I/O boundphases, it could still be relevant. We also note that thefunctionality of CannyFS is most similar to the planned“Stage 2” of Burst Buﬀer (transparent caching), whichthat project has yet to reach. CannyFS can also be eas-ily applied within existing infrastructures by individualusers, simply by allowing them to do FUSE mountswithin directories they have access to. No additionalhardware investment or reconﬁguration is needed.One could argue that rather than implementing ashimming ﬁle system, each task should be tuned re-garding I/O. Tasks could be made asynchronous inter-nally. However, traditional ﬁle I/O is not asynchronousin the POSIX standard, and making all I/O operationsexplicitly asynchronous can still be cumbersome withinexisting codebases. One could also argue that proper tuning of cache, integrity, and network stack settingsfor distributed and remote ﬁle systems can improve per-formance. However, such changes will often be system-wide and require root access.

Diﬀerent jobs on the samemachine might have diﬀerent needed transaction se-mantics against the same logical remote ﬁle system vol-ume, making very aggressive system-wide caching orasynchronous behavior problematic. CannyFS can be acompetitive option to ﬁrst storing results on a scratchﬁle system and then transferring them to permanentstorage, as illustrated by our experiments. This reducesthe volume load on scratch storage, and also saves thetime of a separate unstaging step. For overlapped stag-ing in of data, tools such as vmtouch [4] might be con-sidered, depending on the size of datasets.Our experiments conﬁrm that signiﬁcant gains arepossible for a real-life scenario. We argue in favor ofa pragmatic standpoint, where one can simply observethat these non-optimal, synchronous I/O workloads arereal and do exist even in contexts where high perfor-mance concerns are important. In addition to main-tenance tasks such as expanding archives and buildingcode, the data and log output routines of many softwarepackages which have paid great care to their computa-tional performance, are nonetheless synchronous andblocking. Therefore, signiﬁcant gains, although not asstaggering as what is demonstrated here, could be possi-ble for some tasks that at ﬁrst glance would be assumedto be CPU-bound.5.1 Future workWhile functional, the current implementation of Can-nyFS has some limitations. The eager functionality triesto model the behavior of true ﬁle systems, but, for ex-ample, running typical make or conﬁgure scripts is notalways possible without disabling many parts of theeagerness and “inaccurate stat ” functionality throughcommand-line options. Especially conﬁgure scripts tendto create ﬁles with the same name repeatedly, and some-times complex patterns of symlinks. An active aim is toimprove functionality enough to allow the full conﬁgureand multi-stage build process of GCC into a clean builddirectory as a showcase scenario.An example of an area which we do not intend tosupport is tools like rsync, which actively try to resumecopying after a failure. Due to the possible re-orderingof I/O operations relative to their reported completionorder, we do not intend to provide any guarantees thata partial copy is correct (although a full hashing shouldalways be able to determine the correct status, sometools tend to use heuristics based on ﬁle sizes or modi-ﬁcation dates in some scenarios). The same issue could annyFS 7 also aﬀect running make again on a pre-existing buildtree, since usual checks for the need to update ﬁlesbased on modiﬁcation dates might fail.The CannyFS implementation is currently creatinga very high number of threads, and scrapping them. Ide-ally, threads should be reused, or a more general frame-work for task dependencies be used. However, manytask-based models are focusing on tasks that are com-pute or network bound, not fully supporting the charac-teristics of ﬁle I/O. While thread creation is expensive,the cost tends to be low compared to many ﬁle I/O op-erations. Pipes are actively recycled between I/O op-erations. The fact that CannyFS only buﬀers writes,sending all reads to the operating system, means thatany switch between reads and writes can incur a heavyperformance penalty. In our benchmark, this is encoun-tered due to the way unzip handles symlinks. The even-tual target path of the symlink is ﬁrst written to a reg-ular ﬁle, which is only then immediately read back, andsaved in memory for an eventual symlink

API call. Un-til the write operation has cleared the internal CannyFSqueue, the read will be held back. No speciﬁc prioritiesare considered in the queuing (separate queues are han-dled per ﬁle and the kernel scheduler will schedule anythread that is ready), hence a high number of unrelatedI/Os might be retired, before the crucial one holdingthe read back is executed. We believe that the overheadseen in CannyFS directory tree removal relative to themost favorable cases using direct NFS access is mainlydue to thread creation activity, or more speciﬁcally theoverhead incurred on the spawning thread, since thatis part of the critical synchronous path. However, evenin that scenario, the varying latency of the underlyingstorage solution resulted in CannyFS coming up favor-ably in terms of mean time usage. The lowest numbersare probably due to ﬁle metadata still being cached onthe NFS level. When rm was run separately (not shown)with a considerable time delay after archive extraction,the time usage for CannyFS stayed mostly the same(a few higher outliers), while low values for NFS timeusage were never seen.In addition, the current code is based on the high-level FUSE interface. Speciﬁcally, it relies on the textrepresentation of paths quite frequently. The low-levelinterface would allow a more seamless integration withthe Linux kernel module, which among other things alsomaintains a separate notion of ﬁle size.CannyFS in its current form should be ready for lim-ited production use. However, as with any FUSE ﬁlesystem, the allow_other permission is generally notrecommended. This option allows users beyond the onelaunching the FUSE process to access the mounted ﬁlesystem. Any exploitable bug in the code would in that context result in a privilege-escalation into the permis-sions of the user running the FUSE process. The purpose of CannyFS is to hide latency in orderto make synchronous tasks with a high amount of dis-crete I/O events complete faster, getting close to sat-urating available bandwidth. Our experiments conﬁrmthat a reduction is possible, although performance isstill far from the performance achievable if the ﬁle sys-tem end point is fully volatile. One can note that eventhe archive expansion task contains a few dependenciesbetween I/Os, such as the creation of symlinks, whichintroduce bottlenecks. We oﬀer a general solution forhiding latency where data integrity is only required onthe level of tasks, but not on the level of individual I/Ooperations or ﬁles.CannyFS is better than pre-sliced bread! It’s likeslicing your bread after you eat it! The code is licensedunder the GPL and is available at . Acknowledgments

The computational resources were provided by SNICthrough Uppsala Multidisciplinary Center for AdvancedComputational Science (UPPMAX) under Project c2016040.JN is funded as a systems administrator within UPP-MAX. CN would like to acknowledge the former col-leagues at the IT Division within the Uppsala Univer-sity Administration for their insight into varied I/Oworkloads.

References

1. Bhimji, W., Bard, D., Romanus, M., Paul, D., Ovsyan-nikov, A., Friesen, B., Bryson, M., Correa, J., Lockwood,G.K., Tsulaia, V., et al.: Accelerating science with thenersc burst buﬀer early user program. CUG (2016)2. Garcia-Molina, H., Ullman, J.D., Widom, J.: Databasesystem implementation, vol. 654. Prentice Hall UpperSaddle River, NJ: (2000)3. Hammarlund, P., Kumar, R., Osborne, R.B., Rajwar,R., Singhal, R., D’Sa, R., Chappell, R., Kaushik, S.,Chennupaty, S., Jourdan, S., et al.: Haswell: The fourth-generation Intel Core processor. IEEE Micro (2), 6–20(2014)4. Hoyte, D.: vmtouch - the virtual memory toucher, https://github.com/hoytech/vmtouch

5. Prabhakaran, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Analysis and evolution of journalingﬁle systems. In: USENIX Annual Technical Conference,General Track. pp. 105–120 (2005) Nettelblad & Nettelblad6. Prabhakaran, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Analysis and evolution of journalingﬁle systems. In: USENIX Annual Technical Conference,General Track. pp. 105–120 (2005)7. Rapier, C., Bennett, B.: High speed bulk data transfer us-ing the SSH protocol. In: Proceedings of the 15th ACMMardi Gras conference: From lightweight mash-ups tolambda grids: Understanding the spectrum of distributedcomputing requirements, applications, tools, infrastruc-tures, interoperability, and the incremental adoption ofkey capabilities. p. 11. ACM (2008)8. Szeredi, M., Rauth, N.: Fuse - ﬁlesystems in userspace, https://github.com/libfuse/libfuse

9. Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC pro-cessor. In: 2008 IEEE International Solid-State CircuitsConference-Digest of Technical Papers (2008)10. Tweedie, S.C.: Journaling the Linux ext2fs ﬁlesystem. In:The Fourth Annual Linux Expo (1998)11. Tyers, J., Penninckx, P.: pCacheFS - persistent-cachingFUSE ﬁlesystem, https://github.com/ibizaman/pcachefshttps://github.com/ibizaman/pcachefs