[PDF] Rambrain - a library for virtually extending physical memory

Abstract

We introduce Rambrain, a user space library that manages memory consumption of your code. Using Rambrain you can overcommit memory over the size of physical memory present in the system. Rambrain takes care of temporarily swapping out data to disk and can handle multiples of the physical memory size present. Rambrain is thread-safe, OpenMP and MPI compatible and supports Asynchronous IO. The library was designed to require minimal changes to existing programs and to be easy to use.

Full PDF

RRambrain - a library for virtually extending physical memory

Imgrund, M. a,b , Arth, A. a,c a University Observatory Munich, Scheinerstraße 1, 81679 Munich, Germany b Max-Planck-Institute for Radio Astronomy, Auf dem H¨ugel 69, 53121 Bonn, Germany c Max-Planck-Institute for Extraterrestrial Physics, Giessenbachstrasse 1, 85748 Garching, Germany

Abstract

We introduce Rambrain, a user space C++ library that manages memory consumption of data-intense applications.Using Rambrain one can overcommit memory beyond the size of physical memory present in the system. While thereexist other more advanced techniques to solve this problem, Rambrain focusses on saving development time by providinga fast, general and easy-to-use solution. Rambrain takes care of temporarily swapping out data to disk and can handlemultiples of the physical memory size present. Rambrain is thread-safe, OpenMP and MPI compatible and supportsasynchronous IO. The library is designed to require minimal changes to existing programs and pose only a small overhead.

Keywords: memory management; physical memory limitations; abstraction library; system paging; open source;MPI/OpenMP

1. Introduction

Facing large amounts of data, be it simulations or ob-servation results, many astrophysicists have become part-time software engineers. As the primary target of theirwork focuses on producing astrophysical results, develop-ing data analysis code is an inevitable obstacle on their wayto the actual goal. In the case of the authors this goal isrespectively to analyse extensive data sets of pulsar timinginformation (based on Imgrund et al., 2015) and to post-process large snapshots of cosmological simulations (seeArth et al. in prep.). While typical software-engineeringamounts to serialising given tasks to be executed as quicklyas possible, many everyday codes evaluating data or sim-ulation results are written to be run only a few times. Inthis light, the primary focus of an astrophysicist often lieson saving development time and not execution time.Writing code that processes large data sets is one of themost time consuming tasks. When developing applicationsthat use large amounts of main memory, a single largerdataset may suﬃce for the system to run out of memory.The typically chosen solution to this is ﬁnding a machinewith more main memory. It is obvious that this solution isonly temporary when facing growing amounts of data. Thesophisticated solution amounts to writing memory man-agement functions in an optimised but specialised way forthe problem at hand, so called “out-of-core computing”.This, however, is very (development) time consuming.Alternatively, one can think of following the typical ap-

Email addresses: [email protected] (Imgrund,M.), [email protected] (Arth, A.)

URL: https://github.com/mimgrund/rambrain/ (Imgrund,M.) proach nowadays, which has been made possible by ongo-ing hardware developments, and solve the memory short-age by parallelising one’s code. In addition to a commoncomputing cluster hardware vendors increase the amountof possibilities by introducing additional components likenon volatile memory (NVRAM) or memory with high band-width (MCDRAM). However, the task of parallelising re-mains and is in general non trivial to implement since a dis-tributed memory parallelisation, for example using MPI,has to be chosen. Additionally, not every code scales prop-erly. Thus, one might run into the issue of wasting a lotof CPU time, which has to be granted after writing com-puting proposals, just to fulﬁl memory requirements.Therefore, we introduce Rambrain, a library that facili-tates quick development of applications in need of largemain memory. It is built to easily integrate with existingC++ code on Linux and helps applications to swap outtemporarily unneeded data to transparently access multi-ples of the actual physical memory available on the system.While there may exist other solutions more speciﬁc to theproblem at hand showing slightly better performance, weargue that in most situations the ﬂexibility of a fast, reli-able and out-of-the-box solution is preferred to a few per-cent performance gain. In the following, we provide a quickreview of other solutions to the problem at hand and dis-cuss in which cases rambrain might be a superior choice.

2. Common strategies to avoid out-of-memory er-rors

The most basic strategy to still run an application ina situation of scarce free memory is using native systemswapping. Modern operating systems like Linux manage

Preprint submitted to SoftwareX September 22, 2018 a r X i v : . [ c s . D C ] J un r. Code metadata description C1 Current code version 1.1C2 Permanent link to code/repository used forthis code version https : //github.com/mimgrund/rambrain C3 Legal Code License GPLC4 Code versioning system used gitC5 Software code languages, tools, and ser-vices used C++, OpenMP, MPIC6 Compilation requirements, operating envi-ronments & dependencies Linux, libaioC7 If available Link to developer documenta-tion/manual http : //mimgrund.github.io/rambrain/ C8 Support email for questions [email protected]

Table 1: Code metadata association of physical memory to various processes run-ning at a given moment. As an application developer, youare presented a more or less consecutive virtual memoryaddress space. It is in general not clear whether a chunk ofvirtual memory, a so called “page”, is residing in a physicalmain memory location, called a “frame”, at a given timeor not. This layer of abstraction facilitates assignment ofmemory to a process, so that the system can overcommitphysical memory and reassign virtual pages to physicalframes, when desired. When free frames become scarce,the system writes out currently unused pages to secondarystorage (such as hard disks) in order to free frames. Whena process tries to access a non-resident page, a page fault istriggered and the page is read in from secondary storage bythe memory manager of the system (Ligh et al., 2014, p.20)and if necessary, according frames are freed by writing theoccupying pages out beforehand. While this process is ef-ﬁcient under normal operation, the system typically slowsdown to being unusable when actively consuming nearlyall physical memory. Especially when multiple processescompete for the remaining space (a typical situation fora developer working and debugging), the computer is vir-tually unusable until the memory-intense calculation hasﬁnished. How long a system can survive in a usable statemight be dependent of the type of secondary storage em-ployed. For example a SSD may keep a system usable for alonger time than a common HDD just because of it’s higherspeed of reading and writing data. Inevitably, the systemwill be still overwhelmed by the amount of data scheduledfor transfer and especially the concurrent requests due tomultitasking.This swapping mechanism is also limited by the availableswap space on the secondary storage. While adding moreswap space with the system’s on-board mechanisms ispossible, it needs super user privileges and reserves thewhole swap size on the disk even if it is not used com-pletely. Furthermore, it aggravates the situation whenmultiple processes are competing for memory, as more and Using the system tools mkswap / swapon as root. more parts of other programs can be swapped out and needto be swapped in again in order to continue execution.Using system swapping as a mechanism for overcommit-ting main memory can also provoke the action of the socalled “Out-Of-Memory Killer (OOM-Killer)”. As avail-able memory becomes sparse, the system tries to keep mostprocesses running. In order to free memory for other pro-cesses, the OOM-Killer will kill one or more processes byassigning a score correlated with importance, memory con-sumption, execution and idle times of the candidate pro-cess. The OOM-Killer thus can abort simulation or analy-sis at the very last step and protections against it are hardto ﬁnd (see e.g. Rodrigues, 2009). The OOM-Killer can bynow be controlled a bit ﬁner via the /proc ﬁle system, butshutting it oﬀ for a certain process needs administratorprivileges. However, one has to keep in mind that even ifone can force the own application to stay alive, the OOM-Killer can simply shut down system processes which maytrigger secondary eﬀects on the target process. To theknowledge of the authors, it is not possible to completelyturn oﬀ the OOM-Killer on every system. This becomesclear when concerning the alternatives in a situation of lowRAM. A call to the sbrk -family of functions to increaseheap size could possibly block indeﬁnitely, locking the pro-cess that called for more memory. Unless any other processwill free memory or terminate, the next process demand-ing for more heap memory will block too. The resultingcascade of blocking processes would probably have muchworse consequences for system health than killing a spe-ciﬁc process based on a reasonable metric.There exist other global kernel parameters such as ker-nel ’swappiness’ to manipulate kernel swapping behaviour.At ﬁrst glance, decreasing or increasing the amount ofpre-emptive swap out of idle application’s virtual memoryto disk sounds like a reasonable strategy to globally keepthe system eﬃciently in function. Tuning this parameter,however, is only useful when the amount of free physicalmemory is huge compared to the problem at hand. Whilelow values of this parameter will delay starting to swapout considerably, the demand of the main application for2ore RAM will dominate at some point below the physicalmemory size.In addition, such global tweaks have to be applied systemwide. While a user space solution like Rambrain can beallied to any system at hand, it requires very good cor-poration with system administrators to employ such a be-haviour on a managed machine.The next often mentioned solution to memory and swapmanagement is the mlock and mmap family of kernel func-tions. mlock is capable of locking address ranges for kernel swapout and can also advice the kernel to swap in ranges ofmemory from the swap space. While these functions canbe a usable approach for real-time applications that relyon fast memory access, it in no way limits heap growth.Thinking from the perspective of ’freeing physical mem-ory for new calculations’, the functions are of very limiteduse, as one cannot force the operating system to write outdata to swap and there is no guarantee that this will aﬀectphysical process size at all.The mmap -family of functions is used to seamlessly mapdisk ﬁles to virtual address space. The ﬁle can then be ma-nipulated as if it were resident at that virtual address spacelocation. Combined with mlock calls, the user is able toﬁnely tune which parts of a ﬁle will be resident in physicalmemory. There even exists an interface that can be used totrack which parts of a ﬁle currently reside in physical mem-ory. Also, the memory mapped regions are accounted foras cache, thus this memory will be swapped away prefer-ably when system memory becomes low, which reduces theoverall memory footprint of the application. However, us-age of memory maps for large ﬁles eﬀectively can be verycomplicated, as it may only be reasonable to open certain’windows’ into regions of the ﬁle used for swapping andthe number of regions is limited by ﬁle descriptor limits.Such a more controllable user-space solution is desirable,for example combining the memory mapping system callswith moderate sized swap ﬁles on the secondary storage.Memory mapping techniques are fast because they use thesame paging and copy mechanisms such as system swap-ping, but are subject to stronger limitations than lettingthe system handle the paging itself. The consecutive log-ical address space that is handed over to the process hasto be managed by the user. This means that the user hasto take care of allocating multiple data structures on topof the space, a mechanism that the new/delete operatorsdeal with in C++, normally. While handling for example avector of ﬁxed size structures in a memory map is simple,allocating objects of diﬀerent sizes is highly non-trivial.As the system is responsible for writing out the memorymapped regions to the ﬁle on secondary storage, eﬃcientinteraction with the kernel when changing the memory-mapped region is challenging when trying to optimise thisprocess for performance. Furthermore, a strategy deciding Both the number and size of memory maps are limited by thesystem. which contiguous region to swap out is all but clear.The authors in fact started to write a backend for theactual swapping I/O of Rambrain with memory mappedﬁles. On the long run, it turned out to be much morecomplicated to synchronize the swapping behaviour of themapped regions to gain performance without knowing theexact access pattern of the user beforehand and havingonly a few guarantees from the Linux kernel API. Thus,a perhaps more performant solution to a problem at handcan be implemented using these facilities, but this turnsout to be a diﬃcult encounter that will at least lead tocomplicated memory management code. Rambrain wantsto facilitate development of memory-intensive applicationsand is designed to take the burden of writing exactly suchcode from the user. In that respect, Rambrain will notbeat a custom tailored solution, but coding such a solu-tion is a hard task in its own respect. This renders sucha technique possible, but complicates robust implementa-tion and favourable run time behaviour in highly dynamicsituations.Of course, there exist already other solutions to tacklelarge data structures in memory, such as the STXXL (De-mentiev et al., 2008) that facilitate out-of-core computa-tion providing large standard containers in analogy to theStandard Template Library (STL). While this is a veryuseful idea, it has still some drawbacks imposed by it’sspecialised approach. Rambrain has built in class supportfor the full C++ standard in contrast to the limitationto POD-support of the STXXL. Rambrain provides directaccess to pointers in memory and thus will pose no over-head over heap allocation once the pointers have been pro-vided. Additionally, objects created with Rambrain can beused in association with normal STL-containers and willbe swapped, too.An alternative approach, using parallel virtual ﬁle systemsis also imaginable (see for example Tang et al., 2004).However, this kind of approach still leaves the program-mer with the burden to write IO operations himself, evenif they may be encapsulated e.g. as a function.Furthermore, optimising the data ﬂow on this level comesnear to developing an out-of-core algorithm for the prob-lem at hand that takes control over all input and outputoperations manually. Introductory reviews of such algo-rithms can be found in Toledo (1999a); Vitter (2001). Ofcourse one can design a very clever way of handling inputand output data to boost performance. This, however, op-poses the goal to ﬁnd a more generic solution that givesthe developer moderate control over input and output ﬂowwhile taking from him the burden of handling the inputand output manually. Specialised solutions cover for ex-ample n-body codes (Salmon and Warren, 1997) or lin-ear algebra calculations (Toledo, 1999b; Reiley and van deGeijn, 1999).From the view of the application developer, the situationis very simple: When writing a program the developerknows what data he uses, what he will use next, and whatis not needed for longer time. This information is always3 isting 1: Typical two dimensional ﬁeld initialisation double k x =1. , k y = 1 . ;2 unsigned int x max =1024 , y max =1024;34 double ∗ a r r [ x max ] ;5 for ( int x=0;x < x max;++x ) // allocate rows new double [ y max ] ;7 for ( int x=0;x < x max;++x ) { // initialize field double ∗ l i n e = a r r [ x ] ;9 double xx = x / ( double ) x max ;10 for ( int y=0;y < y max;++y ) { double yy = y /( double ) y max ;12 l i n e [ y ] = s i n ( ( xx ∗ k x+yy ∗ k y ) ) ;13 } } // do something and delete afterwards : for ( int x=0;x < x max;++x )17 delete a r r [ x ] ; // deallocate lines present directly in the source code. In the next sectionwe will introduce the interface which communicates thisinformation to the library.

3. Interfacing Rambrain

In order to manage the storage needs of a C++ ap-plication, we are faced with the problem of designing aninterface to tell Rambrain, which data is to be managedand when it has to be present. In this chapter we introducethis interface built to require minimal changes of existingcode while at the same time providing rich conveniencefeatures when possible.

As a memory manager keeping track of data has someoverhead on its own, it is only useful when the data man-aged is large. Rambrain can manage simple primitives,arrays, whole classes and also supports nesting of man-aged objects into managed classes. For a start, considerthe code in listing 1 that is initialising a two dimensionalplane wave ﬁeld of data type double on heap memory. Weallocate an array of pointers to the respective ﬁeld rowsin line 4, allocate the actual rows in line 6, and set upa plane wave over all ﬁeld values in lines 7 to 14. Somecalculations are executed prior to the deallocation of therows in line 17.If we assume now that y_max and x_max take large val-ues, the allocated doubles will consume a non-negligibleamount of RAM, passing a gigabyte at roughly 11600 elements. Thus, the developer would have to swap outelements if he seeks to avoid system-swapping to occur,to ensure that the program does not run out of physicalmemory. Manual implementation inserts many lines ofcode when allocating memory and around line 8. Alter-natively, the user would write his own memory manager Listing 2: typical two dimensional ﬁeld initialisation with Rambrain double k x =1. , k y = 1 . ; 2 unsigned int x max =1024 , y max =1024; 34managedPtr < double > ∗ a r r [ x max ] ; 5 for ( int x=0;x < x max;++x ) // allocate rows

6a r r [ x ] = new managedPtr < double > (y max ) ; 7 for ( int x=0;x < x max;++x ) { // initialize field < double > g l u e ( a r r [ x ] ) ; 9 double ∗ l i n e = g l u e ; 10 double xx = x / ( double ) x max ; 11 for ( int y=0;y < y max;++y ) { double yy = y /( double ) y max ; 13l i n e [ y ] = s i n ( ( xx ∗ k x+yy ∗ k y ) ) ; 14 } } // do something and delete afterwards : for ( int x=0;x < x max;++x ) 18 delete a r r [ x ] ; // deallocate lines version calling functions to load and unload data. Whenseveral objects are needed at once, loading and unloadingbecome the dominant part of the code.Furthermore the additional lines start to obfuscate algo-rithmic code structure. The nested for -loops as well asthe essential initialisation done will be diﬃcult to spot.Minimal changes to this passage of code will allocate thearrays so that Rambrain is aware of them and dynami-cally loads and unloads the lines if needed, as can be seenin listing 2.The overall structure is minimally changed. Up to addingline 8 we only wrap data objects. We introduce two tem-plate classes here, managedPtr<> and adhereTo<> to em-place Rambrain. When using Rambrain in a minimal way,these two classes will be the only ones actively referencedby the developer.The ﬁrst class, managedPtr<> , replaces allocation and deal-location by Rambrain wrappers. This replacement is nec-essary to hide away the pointer to the actual data in logicalmemory, as the element may or may not be present whenthe user dereferences that pointer.Consequently, we need a way to give back access to thedata. This is done by adhereTo<> which states its mean-ing in camel-case: This objects adheres to the data. Whilethe respective adhereTo<> object exists according to scop-ing rules, it is guaranteed that the user can fetch a validpointer to the data by assigning the adhereTo<> object tothe pointer, as is done in line 9. In the following, we willalso refer to this as “pulling the pointer”.The scoping relieves the user from the need to explicitlystate that the data is no longer used for the moment.While the corresponding adhereTo<> object exists, thepointer to the data remains valid. When this “glue” toa managedPtr<> is deleted, for example by going out ofscope, the object may be swapped out to disk in order tofree space in physical memory for other objects, if needed.4his already concludes what a developer needs to knowabout Rambrain to write his own code using the library inthe most basic fashion. Currently, Rambrain is, amongst others, equipped withthe following advanced features that give more detailedcontrol or convenience. The line numbers given refer tothe code examples in listing 3. The advanced featuresshow that the interface is both minimalistic and powerfulenough to facilitate development with Rambrain. • Allocation of simple datatypes.

The user mayallocate a single object or multiple objects at once,passing an initial value. Also multidimensional ar-rays are supported, that will be collapsed to an arrayof managedPtr<> s of the size of the last dimension.(lines 1-4) • Class allocation.

Class objects may have nested managedPtr<> s which can be swapped out indepen-dently of the class object. Rambrain supports para-metrised as well as default constructors. Destructorswill be called in the correct sequence. Furthermore,the member hierarchy can be tracked. Finally, Ram-brain will ensure correct deallocation of the object.As some or all parts of it may have been swappedout, this is a non-trivial task. The code supportsarray initialisation on classes, too. (lines 6-15) • Diﬀerent kinds of loading stages.

The user mayexplicitly state whether to load objects immediatelyor delay actual loading until the ﬁrst pointer is beingpulled from the adhereTo<> object.Rambrain can proﬁt from const -accessing the data.In case of the object having been swapped out al-ready, the swap ﬁle copy is not changed and reusedand thus another write-out is not necessary. If thedeveloper requests write access, the object has to berewritten to the ﬁle system for a swap-out. There-fore, when only reading data, using const -pointersis highly encouraged as will be seen in section 5.4.(lines 17-23) • Convenience macros.

When adhering to an ob-ject and pulling a pointer should happen in the sameslot, we provide convenience macros that create the adhereTo<> -object together with pulling a pointerin a single line. For class members this may hap-pen shadowing a parameter. In this case, the result-ing code reads as if the class would contain an un-managed array of the same name. Of course, const -versions of these macros exist, too. (lines 25-30) • Multithreading options.

When using Rambrainin a single threaded context, Rambrain throws anexception when the user tries to pull pointers refer-encing more data than the physical memory limit at

Listing 3: Advanced features < double > a1 ; // single element < double > a2 ( 5 ) ; // array of fiveelements < double > a3 ( 5 , 1 . ) ; // five elements ,all set to 1. < double ,2 > a1 ( 5 , 5 , 0 ) ; // two dim .,vals set to 0. class B { public : 7B( ) ; B( double &a , double &b ) ; 8˜B( ) ; 9 void someFunction ( ) ; 10managedPtr < double > data ; } // Class withctors / dtor < B > b1 ; // single element , defaultconstructor < B > b2 ( 1 ) // single element , defaultconstructor < B > b2 ( 1 , a , b ) ; // single element ,param . ctor < B > b2 ( 5 , a , b ) ; //5 elements ,parametrised ctor < double > g l u e 1 ( a1 ) ; // Load right away < double > g l u e 2 ( a2 , false ) ; // Loadwhen used const adhereTo < double > g l u e 3 ( a3 ) ; // Accessconst double ∗ c1=g l u e 1 ; 22 double ∗ c2=g l u e 2 ; // If not present , will befetched here const double ∗ c3 = g l u e 3 ; 2425 //= adhereTo < double > a1_glue (a1); double *a1data = a1_glue ; double , a1 , a1data ) ; 2728 void B : : someFunction ( ) { double , data ) ; // shadows member B::data } // MT: Do not fail if too much memory isrequested : − > setOutOfSwapIsFatal ( false ) ; 34 // MT: Avoid deadlock when needing multipledata at once : double ∗ c5 , ∗ c6 ; 36adhereTo < double > c 5 g l u e ( a1 ) , c 6 g l u e ( a2 ) ; 37 { LISTOFINGREDIENTS 38c5 = c 5 g l u e ; 39c6 = c 6 g l u e ; } adhereTo<> s.(line 33). However, this can potentially introduce adeadlock. Take for example a couple of threads thatneed two pointers each to start their calculation. As-sume only half or less of these managedPtr<> s ﬁt intoRAM. In this case, all or some threads may have re-quested the ﬁrst of the needed two pointer in parallel.Since Rambrain cannot free pulled pointers while therespective adhereTo<> s in scope exist, it blocks allthreads and waits for memory to become availableto swap-out. This, however, will never happen, asall threads are waiting and no thread is eventuallyﬁnishing to unlock data for swapping. To circum-vent this situation, the user may use a globally lock-ing scope conveniently provided by Rambrain (lines37-39). It is however highly encouraged not to over-commit memory also in multi-threaded situations asperformance may drop by this forced serialisation. Having introduced the basic usage style of the library,let us evaluate the impact of using Rambrain on code de-sign. While the syntax suggests that there would be noth-ing to keep in mind, a few limits and caveats apply never-theless.

Rambrain’s physical memory usage is limited to a cer-tain amount the managedPtr<> s may consume. As Ram-brain cannot use the native OS paging mechanisms, it isbound to the memory limits set by the user. Consequently,the set of currently existing adhereTo<> s marks data asin-use and determines what cannot be swapped out. Addi-tional managed pointers may only consume the remainingfree memory. Thus, Rambrain will be unable to manageproblems that demand the simultaneous use of more datathan this limit. The code has to be written in a way thatthe maximum simultaneously accessed data amounts toless bytes than the limit. This usually is the case anywayas algorithms are being formulated in a local way on thedata. The size of the simultaneously used data structures re-lates to the way of solving a problem. A matrix operation,for example, can typically be formulated on various ma-trix representations such as rows, columns, sparse single Currently we do not track the overhead imposed by the usage ofRambrain, as well as other heap allocations. This is planned for afuture release. Explicit delayed loading can be emplaced to limit this to the setof adhereTo <> s that a pointer was pulled from. elements or smaller submatrices. To gain something frommanaging such a subobject, the user has to take care thatthe payload per managed pointer is large enough, so thatthe overhead of managing the data becomes small. Wepropose allocating smaller structures via traditional mech-anisms and leaving the data-intense elements to Rambrain.If however a managedPtr<> is chosen, it is vital to keep inmind that this block of data can only be swapped out andin as a whole.Ideally, all elements of a single requested managedPtr<> will be needed in one step of a calculation. If not, Ram-brain might end up having to swap in many excess bytesto use just one or two elements. Fortunately enough, thesame argument applies for normal CPU cache locality anddevelopers are used to developing for this consecutive, lo-cal access scheme. For a review of the term locality andfurther hints please see for example Denning (2005); Chel-lappa et al. (2008). Therefore, existing and highly op-timised libraries are perfectly suited to be used togetherwith Rambrain.

4. Architecture and Design frontend backend adhereTo<>managedPtr<> managedMemorymanagedSwap type speciﬁc allocationensures data locality swap strategydisk storage

Figure 1:

Architecture of Rambrain:

Rambrain is divided intofour major classes, each serving a distinct purpose. The classes indashed boxes are abstract classes.

Having described the interface of Rambrain, let us nowdescribe how Rambrain is internally implemented and whatdesign decisions have been taken to serve the user’s datarequests. As depicted in Fig. 1, Rambrain is dividedinto four independent classes. While the user front endis implemented in a standardized way by the two classes managedPtr<> and adhereTo<> , whose functioning has beendescribed above, the abstract backend classes can be in-herited to implement a custom strategy which elements toselect for swapping. We currently serve two implementa-tions of these classes each. One amounts to a dummy classthat is used for testing purposes. The other implementa-tions, cyclicManagedMemory as well as managedFileSwap ,will be described in the following sections. We provideprofound source code documentation for all classes. Thedocumentation can be compiled from source code usingdoxygen (van Heesch, 2015) or viewed online (Imgrundand Arth, 2017, 2015) in a daily generated version.6ctivecounteractive swappedallocatedpre-emptive012 3 4 5 6789

Figure 2:

Cyclic managed memory:

Having accessed one ele-ment, it is very likely that the former next element will be the nextone this time, too. Obeying this ordering, the algorithm will asyn-chronously pre-fetch “pre-emptive” elements and swap out allocatedbut unused elements when necessary.

It is a major design decision which elements to choosefor swap-out to secondary storage when facing many cur-rently not used objects. In this section we argue that ageneric strategy should be at least capable of handlingrandom access and access in the same order in an eﬃcientway and describe the actual implementation.When swapping out the same amount of data to medianot capable of fast random access, swap-out size and frag-mentation factors limit the speed achieved in a practicalsituation: The throughput per byte to be written/readis reduced when writing small chunks only, as the over-head of managing the transfer both physically and logi-cally will take a greater fraction of execution time of therequest. This is especially true when using hard disks assecondary storage: When fragments of the data needed aredistributed over larger parts of the disk, the read/writehead of the disk has to be positioned diﬀerently at everyfragment. This process consumes more time than access-ing consecutively stored data. While this argument doesnot apply for modern solid state disks any more, split-ting data over multiple locations still poses an overheadas there must exist structures to describe and manage thesplitting. Consequently a strategy writing out and readingin larger and consecutive parts at once will in general befaster than a strategy swapping out small chunks.With no prior knowledge on what access pattern the userwill impose on the data we can only make general assump-tions and search for a strategy which can learn access pat-terns. The actual pattern encountered will lie somewherein between the two extremes of a completely ordered andrepeated sequence and random access patterns.The Linux kernel for example tracks ’page age’ and, whenneeded, preferably swaps out pages that have not recentlybeen touched by the memory management subsystem. With- out further going into details , this strategy has provenuseful to general access patterns encountered on systemswhich have to swap memory occasionally. In the intendeduse case of Rambrain, however, the need to swap out datais an all present circumstance. Letting the user state whichdata is required currently, places Rambrain in a better sit-uation than the kernel memory management is in. Ram-brain is being actively told which data is not required anymore and there exist hints, which data will be accessed bythe application in near future. Thus, Rambrain can muchmore clearly specify the ’age’ and ’ageing’ of data in theapplication’s context and also infer what to swap in next.Thinking of looping over an array of data, which is verycommon in scientiﬁc codes, the most simple strategy isbased on the assumption that if one element has been ac-cessed right after the other, it repeatedly may be requestedin that sequence in the future. Having accessed all ele-ments, it is most likely that the ﬁrst element will be ac-cessed again. When there are multiple array objects, thisalso holds when a subset of objects is under consideration.Even when needing only a subset of all arrays, it is likelythat the elements of the array will be accessed in the sameorder. This assumption suggests a cyclic strategy whichwe implement in the cyclicManagedMemory class and il-lustrate in Fig. 2. This order is represented as a doublylinked list of element pointers with connected end points.To organize this as an eﬀective queueing system, the mostrecently accessed element is marked with a so called “ac-tive” pointer and the last still allocated and not swappedout element as “counteractive”. The counteractive elementis followed by swapped out elements or elements that arein the process of being written to secondary storage. Whenaccessed in an ordered way, we may keep elements in phys-ical memory for as long as possible. The cycle deﬁnes areasonable sequence of swap-out: the elements that havenot been accessed for the longest time are the next candi-dates for swap-out. They are conveniently found by deref-erencing the counteractive pointer and moving this pointerbackwards as elements are swapped. This will write largechunks of data consecutive into the swap ﬁles. When aswapped out element is requested by the user, also the el-ements that are presumed to be needed next will be loadedpre-emptively and the elements will be placed in front ofthe former active element.In this way, accessing the next element in a local sequencewill be very fast as it can have already been loaded andno re-ordering has to be done to the cycle at all. Onlythe active pointer has to be moved backwards one elementto apparently move all active elements one position for-ward in the cycle. As long as the arrays themselves will beaccessed consecutively, local ordering is also preserved bythis scheme when interchanging access to various arrays. The interested reader may consult e.g. Rusling (1998) orhttps://linux-mm.org/ .2. Pre-emptive element swap-in and decay It is a non-trivial question to decide the amount ofbytes which are to be swapped in pre-emptively. A pre-emptively swapped in element will use up free physicalspace. Thus one has to make sure to not load unneededelements that would be swapped out again immediately.This could cause major increase of IO-operations, therebyslowing down the system. It is prevented by tracking theamount of pre-emptively swapped in bytes. Pre-emptiveswap-in will take place only as long as only a certain num-ber of pre-emptively loaded bytes or less are present. If apre-emptively loaded memory element is accessed by theuser, it’s size will be subtracted from the pre-emptive bud-get. If an element has to be swapped in from the swapﬁle, the next elements will be fetched too, until the pre-emptive budget is ﬁlled up again. In this way, randomaccess does not cause additional overhead by swapping inunnecessary bytes as the pre-emptive budget will alwaysbe near its limit and thus no further pre-emptive elementsare swapped in.This procedure however can lead to a constantly ﬁlled uppre-emptive budget. Imagine that an array A ﬁlls theRAM completely before an array B is accessed consecu-tively. Given that some elements of A have been loadedpre-emptively, they will never be used while B is accessed.Thus, they eﬀectively block the pre-emptive budget thatwould be useful in loading B consecutively. To avoid thissituation, Rambrain implements a decay of pre-emptive el-ements. The amount of decaying pre-emptive elements isdetermined by probabilistic arguments to prevent randomaccess from producing too many useless pre-emptive bytesin the following way:The maximum size of the pre-emptive budget can be usedto estimate the probability of hitting a pre-emptive ele-ment at random: P preemptive ≈ L preemptive / ( L ram + L swap ) ≤ L preemptive /L ram Where L ram is the maximum physical memory allowed, L swap the amount of occupied swapped out bytes and L preemptive the size of the pre-emptive budget. Now, everytime an element is not available in RAM, we determine theamount of pre-emptive elements that have been accessedsince the last element had to be swapped in. The proba-bility that these N elements have been accessed randomlyconsequently can be estimated by P N preemptive . If this valuedrops below 1 percent, we let decay twice the amount ofthe free pre-emptive budget, but at least one byte. Decay-ing implies swapping out pre-emptive elements to makespace for new pre-emptive elements. This typically im-plies loading at least two elements pre-emptively, as thepre-emptive swap-in fraction is by default set to ten per-cent and this fraction squared equals the signiﬁcance levelassumed above. Assuming equally distributed element sizes which are only a frac-tion of the pre-emptive budget.

When loaded into RAM, the data area of a managedPtr<> has to be allocated consecutively as pulling a pointer guar-antees consecutive layout. On secondary storage deviceswe may split up the data over various swap ﬁle locations.While this is not desirable, it is of use when free swap ﬁlelocation is running out and we want to use smaller left-over chunks from previous deallocations.Another major diﬀerence to managing heap memory, likethe memory allocator in the standard libraries that is in-terfaced by the new / delete operator implementations, isthat one cannot easily use the free space for the manage-ment overhead. This is because the managing structureshave to be accessible very fast and would cause consider-able latency when resident in secondary storage.Of course managing the chunks of the swap ﬁle in physi-cal memory poses unavoidable overhead. It will limit theamount of managed memory as this overhead grows overthe physical size of memory available. At the momentthe user has to manage large enough data amounts in one managedPtr<> to keep this overhead small. While thissounds like reintroducing the problem we sought out tosolve, we ﬁnd a typical memory overhead to be 5 to 10percent of the amount of allocated structures when thedata content is about 1kB. This amounts to being ableto manage half a terrabyte of data as if it were in RAMon a 32GB machine. The data would be saved in roughly5 · managedPtr<> s of this size. It is advisable to switchto higher memory loads per managedPtr<> which reducesthe overhead by the according factor, making more spaceaddressable on disk. We plan to pack up objects into largersets in future versions of the library to further reduce theoverhead. It is also planned to monitor the overhead andstrictly constrain it to the overall limit in future releases. Thus, given the task to swap out a managedPtr<> , ourstandard implementation managedFileSwap checks its listof free chunks of memory in the swap ﬁles and tries to ﬁndthe ﬁrst free chunk the managedPtr<> ﬁts into. If it failsto ﬁnd such a chunk, it starts to split the data consecu-tively over the remaining gaps. If this also fails, it cleansup cached managedPtr<> s produced by const accesses andtries again. If no free space is left, it will simply fail. Asthis unfortunate case may happen after days of calcula-tion, we also provide a swap policy mechanism that stateshow the library should react in that case. Policies amountto “fail in case of a full swap”, “ask the user if he wants toassign more swap space” or “automatically extend swapspace if free disk space is left to do so”.

The main techniques to write out large data sets to sec-ondary storage are Memory Mapping (MM), Direct Mem-ory Access (DMA) and using Asynchronous IO (AIO) or This, however, is a non-trivial task as typically the standardmemory allocation implementation has the control over the systemcall extending heap size. • Memory Mapping : The memory management unitin control of the virtual address space can be used toseemingly load contents of a whole ﬁle into physicalmemory. The same process used for paging will beutilised to write out or read in missing pieces and letan application use all space at once. When dealingwith large ﬁles, this technique is very popular, as itis fast (may use DMA internally). However, whenﬁles become too big, the memory management unitquickly runs into similar problems to the one encoun-tered with native swapping. A possible ﬁx may beto map only parts of the swap ﬁles. In this case,however, one has to control tightly which mappingsto close ﬁrst, as closing will block when the mappedregion is not written to disk completely. While thereexists kernel hinting, a technique to tell the kernelwhich pages to write out ﬁrst, the one-to-one map-ping of allocations to the page ﬁle poses a biggerobstacle. Optimal decisions where to store certainelements are hard to ﬁnd in a generic way and oneis again limited to consecutive memory allocations.Splitting data would render pulling a pointer to con-secutive memory impossible. Furthermore, the ad-vantage of directly mapping allocations to swap ﬁlelocations quickly can become a problem when thedata has to be moved to still use a minimal mem-ory mapped region. We thus quickly deferred usingthis method. There may be some interesting fea-tures to it, as automatic pre-fetching might alreadymimic an early stage of pre-emptive loading. Clev-erly opening and closing such page-ﬁle “windows”,however, is hard to handle having no guarantees forfuture access patterns. • Direct Memory Access : DMA can in principlecopy parts of memory directly to secondary storagewithout routing the data through the CPU. It is fastin both throughput and latency. However, it im-poses memory alignment restrictions on both sidesand supports only writing chunks of a certain size(typically 512kB for hard disks). Since writing is di-rect, the action bypasses any buﬀering by the kerneland thus directly leads to disk access. While this canbe advantageous in situations where one writes outmany consecutive datasets and implements a writecache on ones own, it typically leads to overhead inour use case. Together with the imposed alignmentrestrictions, it is not clear how to write an eﬃcientimplementation without writing complex schedulingcode or having lots of overhead when user objectsdo not ﬁt into the DMA alignment. DMA, whilefast, is very complex to handle in situations wherea priori it is not clear what the user requests fromRambrain. Thus the beneﬁts of fast IO and low CPU impact vanish in light of kernel ﬁle system buﬀeringeﬃciency. There is a long going discussion involv-ing Linus Torvalds who highly discourages the use ofDMA by the user (please see Torvalds, 2002). • Asynchronous IO:

The Linux kernel provides theuser with the possibility to asynchronously load andsave data to ﬁle descriptors. Primary actions aretaken only on the ﬁle system cache which has gonethrough a long evolution and is by now a very fastand eﬃcient way to use free physical space withoutnegative eﬀects under high load. Furthermore, DMAor Memory Mapping techniques may be present inthe background to bring the cache in sync with thesecondary storage. Implementing Asynchronous IOupon normal buﬀering implies fast execution and ef-ﬁcient write-out while at the same time being robustto architecture changes. Finally the most eﬃcientway of actually carrying out a certain storage oper-ation may only be found out at system level.The interested reader may be warned, however, thatthere currently exist three AIO implementations: kio(Kernel Asynchronous IO), libaio (which is just a Cwrapper for the former) and POSIX AIO. The lat-ter is currently implemented as blocking AIO, theformer is not guaranteed to be truly asynchronous,as its implementation is ﬁle system driver speciﬁc.We use a pool of submitting threads using AIO toprovide true AIO where possible and simulated AIOotherwise, using the libaio wrapper for the systemcalls. In this way, IO operations will be non-blockingand have a low impact on CPU load.By using asynchronous read and write requests, Ram-brain is capable of loading data in the backgroundwith small impact on the CPU load. A technique fordoing this is to ﬁrst create the adhereTo<> -object,which triggers swapping in of the object. While theasynchronous IO is swapping in the element, othercalculations can be done. When ﬁnally pulling therequested pointer, it may already have been copiedin in the background. A graphical scheme compar-ing synchronous and explicit asynchronous requeststo Rambrain is available in Figure 3 and a schematiclisting of the code producing this access scheme canbe found in listing 4. Putting the highlighted linefour after line six would constitute a synchronousversion of the code. As the application can alreadyprocess other data while fetching in next needed ob-jects, this can eﬀectively hide latency similar to GPUprogramming techniques or pre-fetching for caches(see e.g. Callahan et al., 1991).9 isting 4: Explicit asynchronous access < double > data ( 1 0 2 4 ) ,2 data2 ( 1 0 2 4 ) ;3 adhereTo < double > g l u e ( data ) ;4 adhereTo < double > g l u e 2 ( data2 ) ;5 double ∗ p t r = g l u e ;6 d o s o m e t h i n g o n d a t a ( p t r ) ;7 double ∗ p t r 2 = g l u e 2 ;8 d o s o m e t h i n g o n d a t a 2 ( p t r 2 ) ; Having chosen AIO for transferring the data to secondarystorage, the actual implementation is simple on the in-terface side but quite demanding on the scheduler side,as the scheduler has to deal with non-complete swap-outsand swap-ins when scheduling further action. As a ruleof thumb, it has been found very useful to “double-book”memory in the sense that chunks moving from or to physi-cal memory will demand their size in both budgets. At thesame time we also track the amount of memory which willbe freed by such actions (and thus can be waited for whenneeded). When completed, the budget of free memory onthe source side will be restored to the correct value and thebytes which were pending before will be subtracted fromthe pending bytes count. In this way, the scheduler canﬁnd the right strategy, given currently pending IO, anddemand a small amount of IO to satisfy its constraintsimposed by user requests.

Multithreading complicates writing the scheduler codea lot since one has to be very careful that the needs ofone thread do not interfere with the needs of anotherthread. Scheduler and swap both are written as one in-stance shared by all local threads. This design decision wastaken as data may be shared among threads and thus needsa common swapping procedure. Copying data betweenthreads however will result in various managedPtr<> s foreach instance. This does not impose a big memory over-head since only the shallow control structures are possi-bly present multiple times and not the data themselves.Consequently, passing managedPtr<> s and adhereTo<> sfrom one to another thread has to happen thread-safely, aswell as access to one managedPtr<> from multiple threads.Thread safety in this sense does not mean that one threadhas exclusive access to a managed pointer, but that themechanisms ensuring the availability of the data are writ-ten in a way that the object is present if at least one adhereTo<> in any thread is present and that the ob-ject may be swapped out at destruction of the very last adhereTo<> instance.While reference counting is strongly related to the con-cept of shared memory parallelisation, a distributed mem-ory setup is much easier described. Since every machineharbours it’s own memory unit, it instantiates it’s ownmanagement structures, swap and data pointers. Data are then copied between threads via the classical send and re-ceive routines of the employed library, as for example MPI.This poses slightly more overhead than the shared memorycase, but also provides the ability of a more intelligent ac-cess strategy especially if an asynchronous parallelisationmodel is implemented. One has to keep in mind, however,that if all machines or compute nodes write their swap ﬁlesto the same disk, they may compete and slow down all IO,highly dependent on the timing of operations.In total the amount of memory overhead due to parallelismshould be negligible, especially since typical applicationsare globally memory dominated by the amount of datahandled.

5. Results and Discussion

In this section we measure how code which utilisesRambrain compares to a code without Rambrain. Mea-suring performance is a non-trivial task for technical aswell as theoretical reasons. First of all, tests should bereproducible and measure the overhead imposed by Ram-brain. However, reaching this goal is non-trivial, as ﬁlesystem operations, kernel Asynchronous IO or schedulerperformance in a multithreaded situation may aﬀect theoverall performance as well. Especially the typical use case- a developer seeking to work and debug on the same sys-tem - is hard to simulate in a reproducible and meaningfulway. Separating library-imposed overhead and IO perfor-mance would be of no use either, as the user is interested inoverall performance. Most of the carried out tests howeverwill be highly speeded up by disk caching, which is alsofound in a productive system. We emphasize that whileonly RAM-to-RAM copying is done by the OS in thesecases, these tests measure best the overhead implied bythe workings and logic of the Rambrain library, since oncethe user is I/O limited, test results will be dominated byhardware performance.In order to provoke swapping actions we set up a test sys-tem ﬁnding a PC with the smallest physical RAM modulesizes removing all RAM modules up to one. The testswere then carried out using OpenSuse 13.2 (based on ker-nel 3.16) on an Intel(R) Core(TM)2 Quad CPU Q6700operating at 2.66GHz on an ASUSTeK P5NT WS main-board with 32Kb L1 Cache, 4MB L2 Cache and a standardunbranded 2GB memory module. The hard disk used is aSamsung SpinPoint S250.

We present the overhead the library imposes on theexecution time of a user code in a regime, where actuallynothing has to be swapped. This allows to judge whetherRambrain reaches near-to-native performance and thus canbe employed if it is unclear whether it will be needed onthe target system. We propose a test in which we per-form a rather simple n-body simulation of a ﬁxed set ofparticles using a Forward-Euler integrator (Euler, 1768).10 bject lifetimes main thread Rambrain libraries Kernel Asynchronous IOadhereTo::adhereTo() io submit() a dh e r e T og l u e ; d o ub l e * p t r ; a s y n c h r o n o u s c o p y request pointer w a i t i n g f o r I O io getevent()returning pointer a c t u a l w o r k (a) Blocking IO object lifetimes main thread Rambrain libraries Kernel Asynchronous IOrequesting ptr2 io submit() a dh e r e T og l u e ; a dh e r e T og l u e ; d o ub l e * p t r ; a dh e r e T og l u e ; d o ub l e * p t r ; a s y n c h r o n o u s c o p y returning ptr1 io getevent()requesting ptr3 io submit() a s y n c h r o n o u s c o p y returning ptr2 w o r k o np t r w o r k o np t r (b) Explicit asynchronous IO Figure 3:

Exemplary interaction of user code with Rambrain library.

Rambrain may be faster when giving clues about upcomingdata requirements. While in (a) the time waiting for data to arrive is wasted, the user may use this idle time for calculations on alreadyarrived data, as depicted in (b) and written in listing 4. As preventing idle time is highly desirable, Rambrain tries to behave like case(b) without the user explicitly hardcoding this. In order to do so Rambrain tries to guess the upcoming data demands of the program andautomatically pre-fetches elements that will be needed. . . .

001 0 .

01 0 . E x ec u t i o n t i m e [ s ] R e l a t i v ee x ec u t i o n t i m e d i ﬀ e r e n ce [ % ] Memory Usage [GB]

N-Body code

NativeRambrainOverhead

Figure 4:

Execution time of a n-body code:

We present timinginformation from a simple n-body code which accumulates data bysaving particle trajectories and velocities. By comparing a versionwith and without rambrain we see that the overhead of the libraryamounts to only a few percent of execution time in the regime ofreasonable data sizes.

While each timestep only depends on the last position andvelocities of all particles, we save the trajectories and ve-locities along the way in two dimensional arrays. A typicaluse case for this is in place visualisation of such a simula-tion. Therefore, the memory used by the program growsover time, adding two vectors per particle in each itera-tion. The results of both runs are shown in Fig. 4.In the beginning of the simulation, when hardly any datais present, we notice quite a big relative overhead of theRambrain library. However, this only amounts to an ab-solute diﬀerence of only one to two seconds. From a fewMB of data on, both curves show the same scaling withtime, which is given by the algorithm itself. The relativeoverhead presented by the blue line declines very rapidlyand ﬁnally converges down to a value between one and twopercent close to the two GB mark.In conclusion, a code utilising Rambrain is always a bitslower in the regime where no data has to be swapped outcompared to native code. However, the impact on execu-tion time is not a very big factor and we see no strict needfor user to completely switch oﬀ Rambrain in this case.

In this subsection we demonstrate the internal move-ment of data for a common problem: Transposing a bigmatrix which itself does not completely ﬁt in memory. Wesave matrices block wise, as it is done in many linear alge-bra libraries (see e.g. Blackford et al., 2002). This allowsfor a straight forward migration to a Rambrain version ofthe algorithm, simply replacing one layer of pointers by a managedPtr<> class.The result is shown in Figure 5. The left part of the plotshows the data allocation phase. At ﬁrst the main memoryis ﬁlled up very quickly with data, then data is consecu- S i ze [ M B ] Execution time [ms]

Timing information

Swapped OutSwapped InMain MemorySwap Memory

Figure 5:

Data movement for one ’Block’ algorithm matrixtranspose:

We show how data is moved between main memoryand swap in one matrix transpose run. The vertical line marks thetime point when the execution progresses from data allocation to theactual transposition. tively swapped out to make room for more allocations. Inthe transposition phase afterwards, data is exchanged fromswap to memory and back, loading all necessary blocks forthe current transposition step. Please note that the asyn-chronous nature of Rambrain makes it very diﬃcult tomeasure these values at a few discrete time points, sinceit is not clear when exactly the AIO events are handledin the background. Finally, the deletion of data is alsoplotted in the graph, but happens so fast that it is be-low the resolution limit of this plot. In total, we see thatour design criteria are met and that Rambrain behaveswell by constraining the usable memory. Additionally, theapproximately linear scaling of the ”Swapped Out” curvedemonstrates, that the overhead of the library itself is notdependent on the current state of the memory.The diagnostic output leading to ﬁgures like this can betriggered directly in Rambrain, so that the user is able toeasily proﬁle the fundamental behaviour of his code.

In this subsection we address the possible speed-up inexecution time one can gain by eﬃciently using the asyn-chronous nature of Rambrain and the possibility to pre-emptively load and unload elements automatically.To measure the performance of this mechanism, we pro-pose the test shown in listing 5. We set up a two dimen-sional array which is realised by a list of managed pointers.While keeping the ﬁrst dimension (i.e. the amount of onedimensional arrays) ﬁxed at 1024, we vary the size of theunderlying arrays (second dimension, bytesize ). In orderto measure the speed-up by asynchronism and pre-emptiveactions we need to give the library some time to work inthe background. Therefore, as in a typical use case, weiterate over the arrays in consecutive order and write theresult of a simple integer multiplication into the respec-12 isting 5: Standard implicitly asynchronous loading unsigned int numel = 1024 , b y t e s i z e ;2 managedPtr < managedPtr < char >> a r r ( numel ,b y t e s i z e ) ;3 ADHERETOLOC( managedPtr < char > , arr , p t r ) ;4 float l o a d ;5 float r e w r i t e t i m e s = l o a d / 1 0 0 . ;6 int i t e r a t i o n s = 1 0 2 3 0 ;78 for ( int i = 0 ; i < i t e r a t i o n s ; ++i ) { unsigned int use = ( i % numel ) ;10 // AdhereTo

11 adhereTo < char > g l u e ( p t r [ use ] ) ;12 // Pull the pointer to the object char ∗ l o c = g l u e ;1415 // Produce some computational load for ( int r = 0 ; r < r e w r i t e t i m e s ∗ b y t e s i z e ; r++ ) {

17 l o c [ r % b y t e s i z e ] = r ∗ i ;18 } } tive array. We vary the percentage of the array that datais written to ( load ) and data chunk size, simulating an ar-bitrary computational load that scales with the data. Theresults of this test are presented in Figure 6.It is clearly observable that the execution time decreasesdue to pre-emptive strategy. Increasing the work which isdone on the data in the left plot, the library’s overhead isalready masked at a few tens of percent of touched arrayelements. Working on the ﬁle buﬀer cache only, this testshows the minimal overhead of the Rambrain libraries. Ina real use case scenario, the required computational loadto completely mask swapping is increased. This resultclearly encourages the user to leave the standard behaviourof pre-emptive support enabled whenever possible. Evenif the data access is completely random, it does not im-ply a big performance drawback to try to be pre-emptive.Of course a problem-speciﬁc approach pre-fetching exactlythe next needed elements without trying to guess can im-prove performance here. However, this strongly violatesour assumption, that we value development time over ex-ecution time. We therefore argue that this optimisationleads towards developing a customized out-of-core algo-rithm, something no generic memory manager can substi-tute for. Be aware however, that when disk bandwidthbecomes the limiting factor, only part of the swap in/outprocedure can be masked by pre-emptive swap-in. For thisreason, the overhead loading the data can become domi-nant when limited by the disk process and not carryingout enough calculations. While the pre-emptive strategyis still faster than not using the calculation time for load-ing in the next needed data in the background, the loadingoverhead in percent assimilates in the bandwidth-limitedcase. This can be seen in the lower right panel of the ﬁgure, as pre-emptive and non-pre-emptive strategies as-similate when disk-caching is not suﬃcient any more andwrite-outs to secondary storage dominate the timing. Our next test is designed to examine how much timeis saved by properly pulling const pointers when possible.As outlined in section 3.2 it is possible to request a pointerto constant data from an adhereTo<> object instead of apointer to mutable data. This should be done in general,see e.g. Meyers (2012), but is of special importance tothe case of Rambrain. Not following this best practise willleave Rambrain with no clue on whether the data has beenmodiﬁed and forcing Rambrain to write the data out to theswap again. Hence, if the data has already a representa-tion in the swap and is addressed as constant, this copy iskept as long as the swap has enough free space. When thein-memory copy of the data pointer is later deleted anda swap-out occurs, the data needs not to be written outagain, saving expensive writing operations.In order to test this mechanism we allocate two blocks ofdata consisting of an array of smaller data chunks. Theﬁrst one we call the real data while the second one is thedummy data which we will adhere to and pull a pointerfrom to ensure the real data being swapped out due tomemory restrictions. Afterwards we access the real dataand the dummy data in alternating sequence, once swap-ping in the data const and once non- const . We measurethe time it takes to swap in the dummy data in both cases,ergo capturing the time it takes to also swap out the realdata. We present the resulting behaviour for diﬀerent sizesof data blocks in Figure 7.We notice that the change in execution time by const -access obviously scales with the amount of data, since itis highly dependent on the time it takes to complete theswap-out. In the regime of a data block amounting tobetween one and ten megabytes, we decrease the execu-tion time of the relevant code sections by about 20 to 30percent. Since these are relatively small data sizes in com-parison to the main memory, we can assume that thesedata swap-outs are completely handled by the disk cache.Therefore we save only the time for cache managementand basically a memory copy. When we enter the regimeof secondary storage IO we can expect the diﬀerence inexecution time to be even larger since the secondary stor-age itself is much slower than the main memory. For moststorage types, storing data takes longer than reading data,thus we expect this mechanism to save even more time inthis case. It is strongly advised to use const -access when-ever possible, also in light of other caches’ properties andoptimizations being used by the compiler.

Finally, let us compare the performance of Rambrainand system swapping. In principle, a local administra-tor can equip a Linux system with more swap space than13 T o t a l E x ec u t i o n t i m e [ m s ] Scaling with computational load

Preemptive (default)Preemptive oﬀ E x ec u t i o n t i m e [ m s ] Scaling with object size L o a d O v e r h e a d I d l e % Percentage of array that will be written to I d l e % Byte size per used chunk

Figure 6:

Pre-emptive loading:

We compare enabled and disabled pre-emptive mechanism of Rambrain and ﬁnd that the pre-emptivebehaviour of Rambrain results in a signiﬁcant performance boost. E x ec u t i o n t i m e [ m s ] Data Size [kb]

Performance gain by const allocation

Non-ConstConst

Figure 7:

Speed-up by pulling const pointers:

We run a sim-ple test where data is drawn once as constant and once as writeablepointers and compare the time it takes to swap out the data after-wards in a regime where all the data still ﬁts in the disk cache. usual by creating additional swap ﬁles or partitions withthe system command mkswap and enable them for use withthe command swapon . However, please note that it is notpossible to do so as a normal user. Additionally, this ap-proach requires the allocation of the whole swap ﬁle spaceon secondary storage already in the beginning - regard-less of how much of it will be actually used. Using thistechnique we create and enable a 10GB swap ﬁle on the described test machine.We compare a code which uses Rambrain to a non-managedcode utilising system swapping. We carry out two diﬀer-ent runs: In the ﬁrst one, data is written consecutively toan 8GB sized matrix. In the second one, the applicationrandomly writes to elements of this matrix. In the lattertest we explicitly disabled the pre-emptive swapping algo-rithm.On some attempts to run unmanaged, the native applica-tion is killed by the OOM-killer. This probably happensdue to the fast growth of heap memory. Also having a swapﬁle which is not at least about 25 percent bigger than theactual swapped size often provokes the OOM-killer to ter-minate the process. Even if the OOM-killer does not killthe test process, it may be that it may shut down otherprocesses in the background to free memory for the testprocess. When the attempts succeed, the system is virtu-ally unusable as even opening another shell prompt takesminutes. Furthermore, the interference of the native codewith the system does not stop when the application exits,but leaves the system in a slowly reacting state for min-utes to hours of usage, as large parts of other applicationsand system processes have been swapped out to disk. Weexpect that running other applications such as an inte-grated development environment in parallel will aggravatethe situation when trying to solve the problem using OSswapping.But also the actual execution time of Rambrain-managedcode is favouring the use of our library. In the case of con-secutive access, the version using Rambrain is about 1014ercent faster than the native version. In case of randomaccess, Rambrain is only 2% faster than the native code, ifwe obey the design limitation that all elements of a single managedPtr<> will be accessed.This test result is further conﬁrmed by daily experience ofthe authors being able to develop code on the same ma-chine their analysis software runs in parallel without beingdisturbed by the process which uses Rambrain.

To demonstrate that Rambrain is actually applicableto a real world problem, we choose a memory intensivediﬀerence imaging algorithm. The algorithm is designedto ﬁnd variable light sources by comparison of multipleimages. To mitigate errors due to noise these have to beconvolved with a kernel ﬁrst, before being subtracted fromeach other. For best accuracy, a variable point-spreadfunction with a high number of free parameters is cho-sen as a kernel and an optimal version is computed bya minimisation technique. Alard (2000) show that bestresults can be achieved by choosing one global kernel forthe whole image. While this may seem to be the best ap-proach anyway, we want to emphasize that usually onlylocal kernels can be used because of the vast amount ofmemory consumption that arises in case of high resolutionimage material. This applies for example for the diﬀerenceimaging code presented by G¨ossl and Riﬀeser (2002) intowhich we embed rambrain in order to overcome exactlythese limitations set by main memory.High resolution images taken with state of the art instru-ments (see for example Lee et al., 2012, 2015) can easily beof about 14000 pixels in size, each. Typically kernels withseveral hundreds of free parameters are used which lead toan exemplary memory consumption by kernel matrices of ImageSize · KernelSize · ( V alues + Errors ) · f loat = 14000 · · · B ≈ GB.

This exceeds the physical size of main memory of a typ-ical PC while the CPU time needed for such an analysisamounts to only a few hundred CPU hours.In Figure 8 we present the results of such an analysis usingsimulated data. We assess the quality of the achieved ﬁt byfolding the reconstructed signal with the kernel and sub-tracting this from the input. If the kernel that has beenconstructed by the method reproduces the point spreadfunction very well, the signal should vanish completely andonly noise should remain. The left panel presents the resultwith a local kernel, where the image has been subdividedinto several parts in order to ﬁt into memory. The stillpresent starlike features indicate that the kernel does notﬁt as well as in the middle image. This panel shows theglobal kernel in combination with rambrain and the rightone displays the diﬀerences between those two. One canclearly see, that it does not only make a diﬀerence to usea global kernel, but being able to use this kind of globalalgorithm on the data leads to a result that contains a larger fraction of the signal of variable light sources in thesky.With Rambrain’s capabilities to extend memory up todisk limitations, even more advanced algorithms can beapplied without the typical memory restrictions. Barriset al. (2005) for example propose to use all unique pairsof images of a given set in order to calculate a yet moreelaborate kernel. Memory management for this task canbe delegated to a library suited, such as rambrain is. Forfurther scientiﬁc analysis of actual observational imageswe refer to the upcoming paper of Riﬀeser et al. in prep.

6. Conclusions And Outlook

We introduced the reader to writing code that utilisesthe Rambrain library. We described in detail why theproposed interface is suﬃcient to consistently handle dataswap-out automatically and leads to satisfactory perfor-mance. We have demonstrated that the outlined mecha-nisms not only work properly, but also outperform naiveapproaches to mimic their strategy. Of course the librarycannot compete with a fully specialized out-of-core algo-rithm, but can save a lot of development time in providingautomatic facilities for large data sets. The library han-dles asynchronous transfer of data which provides latencyhiding of disk IO operations and reduces idle times to afew percent if computational load allows. Furthermore,we have shown that the memory and CPU overhead of thelibrary are both in the acceptable regime of only severalpercent. As all of this is provided by minimal user-sideinteraction, we feel the goal of writing a memory managerthat enables the user to transparently access multiples ofthe physical memory to be fulﬁlled. As memory manage-ment is a short-cut to just stating what data is currentlyneeded, the user can focus on the main goals of his appli-cation at the price of only a small overhead.We demonstrated the actual usage of our library via theexample of diﬀerence imaging in astrophysics. However,the opportunities where such data intense problems riseto the surface of scientiﬁc work are vast and growing innumbers.The interested reader may ﬁnd the code released as opensource project (Imgrund and Arth, 2015) accompanied byextensive further documentation, a list of the small set ofprerequisites, notes about the (also system-wide) conﬁgu-ration options, a complete list of features and code exam-ples. Interesting features are planned for future releases,such as direct mapping of ﬁle content to managedPtr<> ’sso that loading the data beforehand is not necessary anymore.While currently the usage of Rambrain is only shown na-tively in other C++ codes, it is possible to interface andcall the relevant functions also from codes written in dif-ferent programming languages such as Fortran or Python.The library might lose some of it’s elegance regarding theusage of strict scoping in C++, however we expect it to be15 igure 8:

Diﬀerence imaging residual:

Left: Multiple local kernels; Middle: Global kernel with Rambrain; Right: Diﬀerence of bothimages. fully functional when interfaced correctly. Writing such in-terfaces in proper manor is also part of future plans. Sincethe code is open source and available on github, the inter-ested reader is happily invited to collaborate and assist inthe development of such future features.Carrying out over 100 automatic tests partly consisting ofrandom interaction with the library on every developmentstep and keeping track of performance has proven very use-ful to ﬁnd bugs which only occur under rare circumstancese.g. in multithreaded situations and improved robustnessof the code a lot.We feel this library to be ready for use by a more generalscientiﬁc audience.

7. Acknowledgements

We thank Karsten Wiesner and Christian Storm forhelpful comments on presentation of advanced topics inthis paper. We thank Arno Riﬀeser for working togetherwith us on implementing and evaluating Rambrain in hisdiﬀerence imaging code. We thank Susanna Maurer forhelping us organising a workshop on Rambrain at LMU.We also thank the anonymous reviewers, who helped toimprove the overall quality of the paper and provided sug-gestions to complete the argumentation. Additionally wethank all others who helped us to ﬁnd bugs in the actualimplementation.

BibliographyReferences

Alard, C. (2000). Image subtraction using a space-varying kernel.

Astronomy and Astrophysics, Supplement , 144:363–370. Barris, B. J., Tonry, J. L., Novicki, M. C., and Wood-Vasey, W. M.(2005). The NN2 Flux Diﬀerence Method for Constructing Vari-able Object Light Curves.

The Astronomical Journal , 130:2272–2277.Blackford, L. S., Demmel, J., Dongarra, J., Duﬀ, I., Hammarling, S.,Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A.,Pozo, R., Remington, K., and Whaley, R. C. (2002). An UpdatedSet of Basic Linear Algebra Subprograms (BLAS).

ACM Trans.Math. Softw. , 28(2):135–151.Callahan, D., Kennedy, K., and Porterﬁeld, A. (1991). Softwareprefetching. In

Proceedings of the Fourth International Confer-ence on Architectural Support for Programming Languages andOperating Systems , ASPLOS IV, pages 40–52, New York, NY,USA. ACM.Chellappa, S., Franchetti, F., and P¨uschel, M. (2008).

Generativeand Transformational Techniques in Software Engineering II: In-ternational Summer School, GTTSE 2007, Braga, Portugal, July2-7, 2007. Revised Papers , chapter How to Write Fast Numeri-cal Code: A Small Introduction, pages 196–259. Springer BerlinHeidelberg, Berlin, Heidelberg.Dementiev, R., Kettner, L., and Sanders, P. (2008). STXXL: stan-dard template library for XXL data sets.

Software: Practice andExperience , 38(6):589–637.Denning, P. J. (2005). The Locality Principle.

Commun. ACM ,48(7):19–24.Euler, L. (1768).

Institutionum calculi integralis . Number Bd. 1 inInstitutionum calculi integralis. imp. Acad. imp. Sa`ent.G¨ossl, C. A. and Riﬀeser, A. (2002). Image reduction pipeline for thedetection of variable sources in highly crowded ﬁelds.

Astronomyand Astrophysics , 381:1095–1109.Imgrund, M. and Arth, A. (2015).

Github repository for Rambrain .https://github.com/mimgrund/rambrain/.Imgrund, M. and Arth, A. (2017).

Daily auto-generated documenta-tion for Rambrain . http://mimgrund.github.io/rambrain/.Imgrund, M., Champion, D. J., Kramer, M., and Lesch, H. (2015).A Bayesian method for pulsar template generation.

MNRAS ,449:4162–4183.Lee, C.-H., Riﬀeser, A., Koppenhoefer, J., Seitz, S., Bender, R.,Hopp, U., G¨ossl, C., Saglia, R. P., Snigula, J., Sweeney, W. E.,Burgett, W. S., Chambers, K. C., Grav, T., Heasley, J. N., Ho-dapp, K. W., Kaiser, N., Magnier, E. A., Morgan, J. S., Price, . A., Stubbs, C. W., Tonry, J. L., and Wainscoat, R. J. (2012).PAndromeda - First Results from the High-cadence Monitoring ofM31 with Pan-STARRS 1. The Astronomical Journal , 143:89.Lee, C.-H., Riﬀeser, A., Seitz, S., Bender, R., and Koppenhoefer, J.(2015). Microlensing Events from the 11 Year Observations of theWendelstein Calar Alto Pixellensing Project.

The AstrophysicalJournal , 806:161.Ligh, M., Case, A., Levy, J., and Walters, A. (2014).

The Art OfMemory Forensics . Wiley.Meyers, S. (2012).

Eﬀective C++ Digital Collection: 140 Ways toImprove Your Programming . Pearson Education.Reiley, W. C. and van de Geijn, R. A. (1999). Pooclapack: Parallelout-of-core linear algebra package. Technical report, Austin, TX,USA.Rodrigues, G. (2009). Taming the OOM killer.

LWN.net .Rusling, D. A. (1998).Salmon, J. K. and Warren, M. S. (1997). Parallel, out-of-core meth-ods for n-body simulation. In

PPSC . SIAM.Tang, J., Fang, B., Hu, M., and Zhang, H. (2004). A parallel out-of-core computing system using pvfs for linux clusters. In

Proceedingsof the International Workshop on Storage Network Architectureand Parallel I/Os , SNAPI ’04, pages 33–39, New York, NY, USA.ACM.Toledo, S. (1999a). A survey of out-of-core algorithms in numericallinear algebra. In Abello, J. M. and Vitter, J. S., editors,

Exter-nal Memory Algorithms , DIMACS Series in Discrete Mathemat-ics and Theoretical Computer Science, pages 161–179. AmericanMathematical Society.Toledo, S. (1999b). External memory algorithms. chapter A Sur-vey of Out-of-core Algorithms in Numerical Linear Algebra, pages161–179. American Mathematical Society, Boston, MA, USA.Torvalds, L. (2002). O DIRECT performance impact on 2.4.18.

Newsgroup fa.linux.kernel .van Heesch, D. (2015).

Doxygen project webpage

ACM Comput. Surv. ,33(2):209–271.,33(2):209–271.