[PDF] A Linux Kernel Scheduler Extension for Multi-core Systems

Abstract

The Linux kernel is mostly designed for multi-programed environments, but high-performance applications have other requirements. Such applications are run standalone, and usually rely on runtime systems to distribute the application's workload on worker threads, one per core. However, due to current OSes limitations, it is not feasible to track whether workers are actually running or blocked due to, for instance, a requested resource. For I/O intensive applications, this leads to a significant performance degradation given that the core of a blocked thread becomes idle until it is able to run again. In this paper, we present the proof-of-concept of a Linux kernel extension denoted User-Monitored Threads (UMT) which tackles this problem. Our extension allows a user-space process to be notified of when the selected threads become blocked or unblocked, making it possible for a runtime to schedule additional work on the idle core. We implemented the extension on the Linux Kernel 5.1 and adapted the Nanos6 runtime of the OmpSs-2 programming model to take advantage of it. The whole prototype was tested on two applications which, on the tested hardware and the appropriate conditions, reported speedups of almost 2x.

Full PDF

AA Linux Kernel Scheduler Extension for Multi-CoreSystems

Aleix Roca ∗ , Samuel Rodriguez ∗ , Albert Segura ∗ , Kevin Marquet † and Vicenc¸ Beltran ∗∗ Barcelona Supercomputing Center, Jordi Girona 29-31, Barcelona, 08034, SpainEmail: { arocanon, samuel.rodriguez, asegurasalvador, vbeltran } @bsc.es † Univ Lyon, INSA Lyon, Inria, CITI, 6 avenue des arts, Villeurbanne, F-69621, FranceEmail: [email protected]

Abstract —The Linux kernel is mostly designed for multi-programed environments, but high-performance applicationshave other requirements. Such applications are run standalone,and usually rely on runtime systems to distribute the application’sworkload on worker threads, one per core. However, due tocurrent OSes limitations, it is not feasible to track whetherworkers are actually running or blocked due to, for instance,a requested resource. For I/O intensive applications, this leadsto a signiﬁcant performance degradation given that the core ofa blocked thread becomes idle until it is able to run again.In this paper, we present the proof-of-concept of a Linuxkernel extension denoted

User-Monitored Threads (UMT) whichtackles this problem. Our extension allows a user-space processto be notiﬁed of when the selected threads become blockedor unblocked, making it possible for a runtime to scheduleadditional work on the idle core. We implemented the extensionon the Linux Kernel 5.1 and adapted the

Nanos6 runtime of theOmpSs-2 programming model to take advantage of it. The wholeprototype was tested on two applications which, on the testedhardware and the appropriate conditions, reported speedups ofalmost 2x.

Index Terms —Linux Kernel Scheduler, Task-Based Program-ming Models, I/O, HPC

I. I

NTRODUCTION

High-performance computing applications usually rely on task-based programming models to parallelize and seamlesslyload balance an application’s workload. The main objectiveof a programming model is to provide an abstraction layerfor application developers that eases the task of getting themost out of available hardware resources. For this purpose,task-based programming models rely on runtime systems toschedule fragments of an application’s work (named tasks)in threads (named workers) as well as to manage the tasksexecution sequence.Runtime systems, due to its particular nature, usually per-form its own HPC-tailored thread scheduling on top of the

Op-erating System (OS) scheduler. The objective is to maximizedata cache reusability and memory locality in NUMA ma-chines given that runtimes have a more in-depth knowledgeof the application’s behaviour than the general OS scheduler. Runtimes, unlike the kernel, know in advance which data a task willaccess as speciﬁed on its dependencies. Therefore, it is better for a runtimeto distribute tasks on pinned workers than let the kernel guess where eachworker should run based on its previous accesses.

The usual pattern followed by runtimes is to bound a workerper available core with the objective of minimizing threadmigrations and oversubscription, being both sources of cachepollution. Oversubscription refers to a period of time in whichmultiple threads are competing for the same core while in theready state. This is generally undesirable given the OS’ contextswitch overhead and the penalization incurred for having toshare the core’s caches.However, runtime’s balancing capabilities are subject to theunderlying OS scheduler policy. In particular, when a threadperforms a blocking I/O operation against the OS kernel, thecore where the thread was running becomes idle until theoperation completes. Certainly, although runtimes only keepa worker bound per core to avoid oversubscription, other non-runtime threads such as kernel or system background threadsmight be scheduled in the meantime. Nonetheless, because I/Ooperations are generally expensive, most of the time the coreis likely to remain idle. This problem can lead to signiﬁcantperformance loss as some HPC or high-end server applicationsperform lots of I/O operations while dealing with ﬁle andnetwork requests.A possible solution is to make the runtime aware of blockedand unblocked workers. In this way, a blocked worker’s corecould be used by another worker in the meantime. Thisapproach requires special kernel support, and several solutionsexist to do so, but their complexity has prevented them frominclusion into the Linux kernel mainline code.In this article, we make the following contributions: • We propose a new, simple and lightweight Linux kernelextension in which the OS provides user-space feedbackon the blocking and unblocking activities of its threads. • We modify the

Nanos6 runtime of the

OmpSs-2 [1]task-based programming model to take advantage of theproposed Linux Kernel extension. • We evaluate the whole prototype experimentally andshow that speedup of up to 2x can be achieved.Although the focus of this article is on the integration ofour solution with runtime systems, it is worth noting that itsscope is wider, being usable on other applications that bothrely on I/O and a multi-threaded environment. Throughout this paper, we use the therm ”core” to designate a logicalcomputation unit such as a hardware thread. a r X i v : . [ c s . O S ] A p r I. R

ELATED W ORK

Mechanisms to provide kernel-space to user-space feedbackwhen a blocking operation occurs have already been consid-ered in the context of user-level threads . User-level threadsprovide means to perform context switches in user-space, thusminimizing the cost of context switching. This is also knownas the N:1 model, in which N user-level threads are mappedto a single kernel thread. The problem with this model is thatwhen a user-level thread performs a blocking operation, alluser-level threads associated with the same kernel-level threadblock. A palliative approach exists, known as the

Hybridapproach in which a set of user-level threads are mapped to aset of kernel threads. However, if just one of these user-levelthreads block, all the other user-level threads associated withthe same kernel-level thread will block. The heart of the matteris that the kernel is not aware of user-level threads.Scheduler Activations (SA) [2] provides user-level threadswith their own reusable kernel context and a user-to-kernelfeedback mechanism based on upcalls (function calls fromkernel-space to user-space). When a user-space thread blocks,a new type of kernel thread known as activation thread , iscreated (or retrieved from a pool) to relieve it. The activationthread upcalls a special user-space function that informs theuser-space scheduler of the blocked thread. Then, still on theupcall, the user scheduler runs and schedules another user-space thread. When the blocking operation ﬁnishes, anotheractivation thread upcalls an user-space function to inform theuser scheduler that its thread is ready again.SA has a signiﬁcant drawback: the user-space schedulerthread cannot safely access shared resources protected witha lock that it has no direct access to. For example, considera user-level thread blocked on a page fault while holding aninternal glibc lock. In response, the SA kernel would wake upanother user-level thread to handle the blocking event. If theuser-level code event handler were to acquire the same glibcresource as the blocked thread, it would deadlock. This extendsto internal kernel structures, such as memory allocation locks.SA was integrated into production OS’s such as NetBSD [3],[4] and FreeBSD [5] (known as Kernel Schedule Entities orKSE). A Linux implementation was proposed [6]–[8] but theSA concept was rejected because of its complexity [9]. In theNetBSD 5.0 version, SA support was removed for the previous1:1 threading scheme because ”The SA implementation wascomplicated, scaled poorly on multiprocessor systems and hadno support for real-time applications” [10]. FreeBSD KSEsupport was also dropped since version 7.0.Windows OS has a similar implementation called User-Mode Scheduling (UMS) [11], it is based on the sameprinciple of upcalls, userland context switches and in-kernelunblocked thread retention. The interface is available since the64bit version of Windows 7 and Windows Server 2008 R2. Thelocking problem also arises in their implementation as notedin the cited document above: ”To help prevent deadlocks,the UMS scheduler thread should not share locks with UMSworkers. This includes both application-created locks and system locks that are acquired indirectly by operations suchas allocating from the heap or loading DLLs” .The K42 research OS [12], [13] proposed a more sophisti-cated mechanism to solve a similar problem. The K42 kernelschedules entities called dispatchers, and dispatchers scheduleuser-space threads. A process consists of an address space andone or more dispatchers. All threads belonging to a dispatcherare bound to the same core as the dispatcher is. Hence, toachieve parallelism, multiple dispatchers are required.When a user-space thread invokes a kernel service to initiatean I/O request, a ”reserved” thread from the kernel space isdynamically assigned from a pool of reserved threads. Thisthread is in charge of initiating the I/O operation against theunderlying hardware and block until the request is ready. In themeantime, the kernel returns control to the user-space threaddispatcher so it can schedule another user thread. When theI/O completes, the kernel notiﬁes the dispatcher with a signal-like mechanism so it can schedule the user thread again. It isworth noting that because dispatchers schedule user threads,an unblocked thread is not going to run unless there is someexplicit interaction from the dispatcher scheduler.The K42 user-level dispatcher’s scheduler is provided by atrusted thread library. However, this library suffers from thesame problem as SA: it cannot share any lock with the user-level threads; otherwise, a deadlock would block the process ifthe dispatcher’s scheduler tried to get a lock that was alreadytaken by the blocked user thread.The proposed

User-Monitored Thread (UMT) is similarto SA and K42 in the sense that both use a mechanism tonotify a user-space thread whenever another thread blocks orunblocks in kernel-space. The main differences of UMT withSA and K42 are, on the one hand, that unblocked threads arenot retained anywhere (hence, a deadlock with the user-spacescheduler is not possible) and, on the other hand, the UMT im-plementation is much more simple and lightweight. However,as a consequence of not retaining unblocked threads, UMTneeds to deal with periods of oversubscription. Nonetheless,the UMT correction mechanisms minimize its effect.Essentially, APIs than asynchronously ”notify” a user pro-gram with an upcall rather than ”return” information with adowncall, are more complex; similarly to how signal handlerscompare to the signalfd mechanism. UMT simplicity partiallylies on its downcall-based approach.TAMPI [14] and TASIO [15] libraries tackle a similarproblem with a completely different approach. Both librariesrely on the OmpSs-2 asynchronous-aware feature to enableI/O and computation overlapping by integrating asynchronousoperations into the tasking model. TAMPI works at the MPIlayer while TASIO at the OS surface. In essence, they translatesynchronous operations to asynchronous and, instead of block-ing, they return control back to the runtime. In general, TAMPIand TASIO are ad-hoc solutions that require blocking and non-blocking APIs while UMT is a generic approach that workswith any blocking event regardless of existing asynchronoussupport. However, because of their specialization, the ad-hocsolution performance could be better in some situations.II. P

ROPOSAL : U

SER -M ONITORED T HREADS (UMT)

A. Proposal overview

The

User-Monitored Threads (UMT) model speciﬁes amechanism that allows user applications to receive Linuxkernel notiﬁcation on the blocking/unblocking events of asubset of its threads. The details of this functionality aregiven in the following sections, but they are sketched inFig. 1. In this ﬁgure, the

UMT Scheduler box illustratesthe modiﬁed Linux Kernel process scheduler, W i identiﬁesruntime’s worker threads, and L denotes the runtime’s LeaderThread whose role is to monitor the UMT communicationchannel. L is free to run in any core; the OS scheduler decideswhich worker will be preempted for L to run if any. Basically: • At time T1, four workers W1, W2, W3 and W4 are boundto cores C0, C1, C2 and C3 respectively. L is waiting forUMT events. An idle pool of blocked workers holds. • At time T2, the worker W1 blocks because of an I/Ooperation and L is notiﬁed of the event. • At time T3, L wakes an idle worker from the pool andwaits again for more events (when W5 wakes, it wouldalso generate an unblock event which is omitted forsimplicity). Worker W5 is now running on a core; withoutthe proposed mechanism, it would have been idle. • At time T4, W1 is unblocked after the I/O operationﬁnishes. An unblocking event is generated and L wakesup. Because there are no free cores at the moment, L waitsuntil it momentously preempts another worker. Once itdoes so, it reads the UMT events and notices that multipleworkers (W1 and W5) are running on the same core (C0). • At time T5, after W5 ﬁnishes executing tasks, it checksL’s current UMT events status and realizes that there isan oversubscription problem affecting its current core. Toﬁx the problem, the worker self surrenders and returns tothe idle pool. This generates another event that wakes upL, which updates the UMT events status again. • At time T6, the oversubscription period has ended andthe four workers are running normally.

B. Kernel-space support

The UMT kernel-space support relies on two components:A pair of new system calls, which are used to initiate andmanage UMT, and the eventfd

Linux kernel feature, used asthe notiﬁcation channel between kernel- and user-space.An eventfd is a simpliﬁed pipe designed as a lightweightinter-process synchronization mechanism. Eventfds are inter-faced as usual ﬁle descriptors but, internally, they simplyhold a 64 bit counter. The standard write() and read() system calls can be used to increment and read the counter,respectively. Once read, the counter is cleared, but if its valuewas zero, the reader blocks until something is written.In UMT, eventfds are created when a user-space processclaims that it wants to monitor some of its threads by callingthe new system call umt enable() . At this point, the LinuxKernel initializes an eventfd per core, stores them in the

Fig. 1. UMT overview example context of the calling process and returns them to user-space. Threads start being monitored as soon as each of themcall the umt thread ctrl() syscall. UMT uses each eventfd tosimultaneously keep the count of both blocked and unblockedthreads in the corresponding core since the last read. Theblock and unblock counters are stored in the ﬁrst 32 andthe next 32 bits of the eventfd’s 64 bit counter, respectively.Counter overﬂows are not considered as we decided to focuson simplicity and UMT viability ﬁrst. Future UMT versionswill not rely on these counters, as mentioned in Section III-D.Within the kernel, the eventfds are written exclusively ina new wrapper function that substitutes the genuine LinuxKernel __schedule() function. This function is the com-mon entry point of all paths that leads to a context switch.The blocked eventfd counter is incremented by one justbefore calling the genuine __schedule() function whilethe unblocked eventfd counter is incremented by one on return.Not all threads that call __schedule() do block; some ofthem might just be preempted. Only threads about to blockor that wake up after being blocked (regardless of the reason)do write the eventfds counters. The right scenario is easilydetected by just checking the current process’ internal Linuxstate to be equal to the TASK RUNNING macro in the caseof blocking and by checking the previous running state in thecase of unblocking.The eventfd counters are consumed by a user-space applica-tion through the standard read() system call, which resetsthe eventfd counters to zero. The eventfd read value holdsthe number of blocked and unblocked tasks on the core since UMT does not keep track of preempted threads. Whenever a thread ispreempted, its core does not become idle, but another thread starts runningon it. Therefore, it is not necessary to inform user-space on such event. Namely, if monitored threads were to block on the same core withoutthe core’s eventfd being read, the blocked counter would overﬂow andunnoticeably corrupt the unblocked counter. he last read operation. By subtracting the number of blockedthreads to the number of unblocked threads, the numberof ready threads in the associated core’s eventfd is known.However, because each read operation erases the count, it isnecessary to keep a user-space per core count and add theresult of the subtraction after each eventfd read.A thread migration might lead to uncompensated countersi.e. a thread migrated from core A to core B while it wasin the preempted state in core A, will not have the chance totrigger the eventfd block event on core A before migrating and,therefore, it will be seen as if the thread is still running in Aeven after being migrated to B. When a process migrationoccurs, the UMT patch ensures the consistency of eventfdcounters by checking, after a context switch, if the currentthread core differs from its last one. If this is the case, a blockevent might have been missed on the previous core and, ifneeded, the missed event is written on the previous core’seventfd at this point (see Section III-D for more details).Because of UMT’s kernel-side simplicity, and unlike othersimilar approaches, it is likely that this code will not conﬂictwith other Linux Kernel features and be easy to maintain.Also, the introduced overhead for non-UMT-enabled applica-tions is kept to a minimum given that the UMT instrumentationpoints are reduced to two new conditional statements inthe context switch path. However, the interface used in thisimplementation could evolve in future versions to match theLinux Kernel community requirements. C. User-space support: The case of Nanos6

The UMT user-space support requires four main compo-nents: An initialization step to enable the UMT kernel feature(using the new system calls), a mechanism to read and processthe eventfds counters of each core (using the read() syscall) toobtain the number of ready threads, a mechanism to wake upthreads on idle cores based on the counter of ready threads,and a protection mechanism to limit oversubscription.In order to validate the proposal, we have adapted the

Nanos6 runtime of the

OmpSs-2 task-based programmingmodel to integrate our Linux Kernel extension. The originalNanos6 threading model relies on explicit management ofthread binding. The runtime keeps a single worker bound toeach core and an unbounded Leader Thread that periodicallywakes up to run polling services [14]. Workers continuouslyrequest tasks and only leave the core voluntarily when: nomore tasks are left, an explicit taskwait construct prevents thetask to continue until all its children tasks complete, or the nextready task to execute is already bound to another worker (inwhich case there is a swap of workers). The Nanos6 threadingmodel has been extended to manage multiple workers boundto the same core in order to support UMT.The UMT-enabled Nanos6 initialization phase still createsa worker bound at each core and an unbounded LeaderThread, extended with the purpose of monitoring all eventfds.The Leader Thread main loop ﬁrst performs a blocking readoperation on all eventfds using the standard epoll systemcall. When one of the monitored threads sends an event, the Leader Thread gets unblocked, reads the correspondingeventfd, subtracts the two eventfd counters, and calculates thenumber of ready threads by adding the result to the user-heldcounter. Then, for each core, it checks whether the count ofready threads is zero. If this is the case and there are still tasksto execute, the Leader Thread retrieves an idle thread from apool and gives it a task to execute on the idle core. This is theprecise point in which we break the original Nanos6 threadingmodel where there could be a single ready thread per core ata time. In other words, in the case that a blocked worker wasbound in the core where the Leader Thread has just started asecond worker, it might happen that when the blocked workerresumes, the second one is still running and both of them haveto compete for the core. However, in general, oversubscriptionprevails only for a limited amount of time.Nanos6 workers perform the oversubscription check onevery task scheduling point, which includes: starting, ﬁnishingor creating a task, as well as waiting (taskwait pragma) oryielding (taskyield pragma). To do so, workers ﬁrst updatethe ready thread counters by doing a non-blocking read on theeventfds. Then, if the number of ready threads on the currentcore is greater than one, the worker self surrenders and returnsto the pool of idle workers.

D. Discussion

The proposed design and implementation are based onseveral relaxed assumptions that simplify the design and donot particularly compromise performance. However, in futureversions, we plan to improve them.The core counters might occasionally suffer from temporalinconsistencies due to migrations or concurrent updates. Forinstance, if worker A reads an eventfd of core 0, but getspreempted before it can update the shared Nanos6 atomiccounter, another worker B could use the current core countervalue to determine whether to become idle or not. As aconsequence, worker B will take a decision based on incorrectdata (the eventfd has been read, but the corresponding user-space core counter has not been updated yet). This could besolved by protecting this critical region with a lock, but wehave declined this option given that the situation is unlikely tohappen and non-critcal, i.e. there are two possible outcomes:A worker becomes idle when it is the only worker on its core,and a worker continues to run although its core is already beingused. In the former case, the Leader thread would eventuallynotice the idle core during its 1ms periodic scans and schedulea worker there. In the latter case, the oversubscription periodwould simply last a bit longer. However, in any case, theapplication correctness would not be compromised.With the objective of reducing cache pollution, it couldbe interesting to have a Leader Thread monitoring eachcore eventfd instead of having a single Leader Thread thatmonitors all of them. However, this would require much moreLeader Thread context switches. For instance, in the singleLeader Thread approach, if events have been generated infour different cores, a single Leader thread context switchwill serve all of them. Instead, in the multiple Leader Threadpproach, four context switches would be needed. Because aLeader thread context switch might lead to the preemption ofa busy runtime worker, it is not clear whether having multipleLeader Threads would improve performance.Another relevant technique that the multi-leader-thread andUMT-enabled Nanos6 might beneﬁt from is the leader-follower approach [16]. Sometimes, the Leader Thread mightattempt to wake up a worker in the same core that it is running.Instead, the Leader Thread could ﬁrst create or designateanother worker to become the new Leader and then, it couldmorph into a standard worker to immediately start executingtasks. The recently nominated Leader Thread would wake upat some point to continue listening for incoming UMT events,repeating the cycle again .The proposed design notiﬁes user-space whenever a workerblocks or unblocks. However, Nanos6 workers only needto take immediate action when the counter reaches zero.Subsequently, it would be interesting to adapt the Linux kernelto notify user-space only when there are no ready workersbound to a core. This mechanism would also outdate thementioned eventfd overﬂow issue.Please note that retaining worker within the kernel whenthey are unlocked (just as it is done in SA) to preventoversubscription would enable the possibility of deadlockingthe application and the system as commented in Section II.Therefore, UMT avoids holding workers within the kernel; itjust makes more information available to user-space.In summary, the main UMT advantages are: • Simple and lightweight extension to monitor blocking andunblocking events of threads. • Deadlock free, compared to similar methods such as SA.And the disadvantages: • Periods of oversubscription might impact performancedepending on the application, but simple techniques cankeep it to a minimum (see Section IV-B). • Unnecessary context switches, which can be solved witha leader-follower approach. • Unnecessary block/unblock events, although this can bepalliated by notifying only when the cores become idle. • User-space core counter temporal inconsistencies, which,as explained above, are not critical and unlikely.IV. E

XPERIMENTAL VALIDATION

We evaluate the UMT performance on both Network andStorage I/O with two applications: The Full-Waveform In-version (FWI) application mock-up and the Heat Diffusionbenchmark based on the Gauss-Seidel algorithm.Both applications use synchronous network operations in-stead of asynchronous and, therefore, must enforce sequentialordering of communications. Using efﬁciently asynchronousAPIs with task-based programming models requires complexcode or special support [14], [15] that might not be available inall runtimes. Nonetheless, we show that UMT offers a genericapproach to alleviate this problem transparently. This is quite similar to how SA proposes to respond to kernel events.

Network I/O over Omni-Path and Inﬁniband is directlymanaged in user-space, so it never blocks on the kernel side.Therefore, we have run all tests on top of an Ethernet networkto illustrate the effect of UMT over network communications.Performance metrics are obtained for all applications withboth a UMT-enabled Nanos6 runtime and an unmodiﬁedversion. Both runtimes are executed on top of the modiﬁedkernel, but only the former activates the kernel UMT facility.

A. Environment, Tools and Metrics

All tests have been run on the BSC’s ”Cobi” Intel ScalableSystem Framework (SSF) cluster. Each node features twoIntel Xeon E5-2690v4 sockets with a total of 28 real coresand 56 hardware threads at 2.60GHz, 128GiB of DDRAM 4memory at 2400 MHZ, a 960 GB SSD Intel Optane 905P usedto run the benchmarks, and an Intel DC S3520 SATA SSDwith 222GiB that holds the system installation. All nodes areconnected with both Intel Omni-path and Ethernet networks.The node’s Linux distribution is a minimal installation ofa SUSE Linux Enterprise Server (SLES) 12.2-0. We haveupdated the genuine distribution’s kernel with our UMT-enabled Linux Kernel 5.1.We proﬁled both the Optane and SATA SSDs maximumrandom read and write speeds using the Flexible I/O tester(ﬁo) [17] by running 56 threads issuing up to 4 asynchronousI/O operations of 1MiB each. The Optane SSD approximatelyreported Reads of 2500 MiB/s and write of 2100 MiB/s. TheSATA SSD showed an approximate peak performance of 250MiB/s for writes and 270 MiB/s for reads.Custom metrics (such as oversubscription period) wereobtained with the Linux Trace Toolkit next generation (LT-Tng) [18] and the Babeltrace parser. The visualization toolTrace Compass [19] was a key component to analyze UMT.

B. Oversubscription and optional tweaks

UMT works transparently without any application modiﬁ-cations. However, minor straightforward programming tech-niques will increase its performance by limiting oversubscrip-tion and enabling more parallelism.The application developer only needs to consider the fol-lowing task layout to keep oversubscription to a minimum:Avoid packing I/O operations and computationally intensivework, in this order, within the same task. Instead, it is better toeither split I/O and computation into two different tasks or, ifpossible, do computation ﬁrst and then I/O. Another option isto add a task scheduling point after the I/O such as a taskyieldor a taskwait (that, in general, will not impact performance asit will not wait for any task).The reasoning is twofold: On the one hand, blocking opera-tions trigger the UMT mechanism resulting in additional awak-ened workers running more tasks. On the other hand, the UMT-enabled Nanos6 oversubscription prevention mechanism onlyforces workers to surrender at task scheduling points (suchas task ﬁnish). Therefore, tasks that perform I/O followedby computation will ﬁrst trigger the execution of multiplethreads per core that will immediately block while performing/O, triggering the wake up of further workers until eitherno more runnable tasks or workers are left. Then, eventually,threads will gradually wake up as their requested data becomesavailable and will resume execution. If the second part of thetasks is computationally intensive, all the previously wakenup threads will have to inevitably compete for a share oftheir assigned core. The oversubscription period will continueuntil the ﬁrst tasks ﬁnish and Nanos6 has a chance to stopworkers before they get another task. Alternatively, adding ascheduling point between I/O and computation will enforce anoversubscription check within the tasks execution, which willalso prevent the problem.Task-based MPI applications usually need to enforce se-quential ordering on their communication tasks as it is possiblethat all cores assigned to two MPI processes became blockedwhile running unmatched MPI send and receive operations.When UMT is in use, there is no need for such restrictionas long as networking transmissions do block. In such cases,UMT will report idle cores due to either blocking send orreceive operations, and the runtime will be able to schedulemore tasks. Eventually, all matching send-receive operationswill be in-ﬂight, and the execution will continue.UMT seamlessly overlaps I/O and computation, as long asthere is enough parallelism. To enable more parallelism (ifneeded) implementing a n-buffering scheme might be usefulin order to defer I/O operations while preventing stalls ondependent task. This is particularly useful for write operations,as read operations are likely to be in the critical path.

C. UMT and the page cache

UMT also works with I/O indirectly performed by thepage cache. When a worker writes data to the page cache,and it is full, the thread blocks until there is enough freememory to proceed. Because UMT reports blocking threadsregardless of the cause, the runtime is notiﬁed on this situationand it responds as usual. However, the page cache ﬂushmight be performed by a kernel thread when the amount ofsystem memory is below a certain threshold, or when pagesare older than a conﬁgurable number of centiseconds . Insuch cases, runtime threads do not block because ﬂushing isperformed transparently by the system which, in fact, is alsooverlapping I/O with computation. Yet this approach, unlikeUMT combined with non-buffered I/O, does not extend to readoperations and has the additional cost of an extra memorycopy for writing to the page cache. An advantage of thepage cache is that it naturally optimizes write I/O operationswhose address coincide in the same page cache before theyare ﬂushed, reducing the total number of bytes written to thestorage device. However, this only applies to applications thatwrite multiple times to the same addresses. We further evaluatethe effect of UMT and the page cache in Section IV-E1. See dirty background ratio, dirty ratio, dirty expire centisecs anddirty writeback centisecs in the Linux Kernel source Documenta-tion/sysctl/vm.txt ﬁle

D. Full Waveform Inversion Mock-up (FWI)1) Introduction:

The acoustic Full Waveform Inversion(FWI) [20] method aims to generate high-resolution subsoilvelocity models from acquired seismic data through an iter-ative process. The time-dependent seismic wave equation issolved forward and backward in time in order to estimate thevelocity model.From the computational point of view, FWI is mainly di-vided into two phases: the forward propagation and backwardpropagation. On each phase, two three-dimensional volumesare updated for a sequence of time steps, one for velocitiesand one for stresses. In each forward propagation timestep, theFWI models are updated and an snapshot might be saved todisk. Next, in the backward propagation phase, all timestepsare processed again but in inverse order and correspondingsnapshots are read instead of being written.We have parallelized the FWI mock-up using MPI andOmpSs-2. The velocity and stress volumes are split at theY-slice level, being a slice of length 1 the minimum parallelgranularity. Fig. 2 shows a timestep task decomposition for aforward propagation. The backward propagation is analogousand it is not shown. Each MPI rank is assigned a rangeof consecutive Y-slices for both the velocity and the stressvolumes. After a rank has ﬁnished computing either thevelocity or the stress slices, it sends the left-most and right-most slices to the previous and next ranks respectively. Theexact number of exchanged slices in the halo is an executionparameter. All slices assigned to an MPI rank are computed inOmpSs-2 tasks. Each task wraps the computation of exactlyone slice. Writing and reading snapshots is also done atthe slice level. The velocity and stress tasks involved in thecomputation of the halo also perform the MPI send operationto the appropriate rank. Instead, the MPI receive tasks are runon standalone tasks.We have introduced the three optimizations targeting UMTmentioned in Section IV-B.

2) Results:

We have evaluated FWI on two scenarios: Asingle node which involves only storage I/O and two nodes,which also includes network I/O. In both scenarios, we useone MPI process per socket to maximize the main memorybandwidth and restrict each MPI processes to work with itsown ﬁle. On each scenario, tests are repeated for both SATAand Optane SSDs. In all cases, performance is evaluated interms of processed kilo volume cells per second (kc/s). Wehave run between 5 and 10 repetitions for each test.We used two problem sizes; one for the single and anotherfor the two-node settings, being the later twice in size than theformer. The input frequency is 20Hz for both cases and thevolume dimensions in terms of Z, X and Y are 208x208x408for the single node tests and 208x208x808 for the two-nodetest. Each I/O task processes a Y-slice of 1521 KiB beingthe total volume size approximately 606 MiB large. In total,118 forward and 118 backward propagation iterations areprocessed. All tests are run with an I/O frequency (iof) of1 and 3. In total, each node writes and reads 70GiB for 1 iofand 23GiB for 3 iof.

S W VS W VS W VS W V S S S W V S S S W V R S R V R S R V V V V V S V S V R V R T i m e s t ep n T i m e s t ep n + V R S R V R S R V S S S W V S S S W VS W VS W VS W VS W V R V R V S V S V V V VRank 0 Rank 1

MPI Communication Task Dependency Task

Fig. 2. FWI task and MPI task decomposition of two ranks. Tasks work with a single volume slice. V and S compute velocity and stress slices respectively.The S sufﬁxes denote tasks that compute and send its block with MPI. The R sufﬁxes only receive a data block. W tasks write a velocity slice into disk. The use of the page cache is discouraged on the FWI.Although FWI rereads in the backward phase what was writtenin the forward phase, it is likely that all the cached data isﬂushed and replaced by the time it is attempted to be reused.Table I summarizes the results obtained on both SATA andOptane SSDs. UMT improves the performance on almost allthe presented scenarios. The key factor of all those metrics isthat the baseline’s core usage is suboptimal even though thetask-based parallelization approach and task granularity shouldbe enough to keep all cores busy during the entire execution.In consequence, it can be deduced that idle time is spent onI/O operations. The UMT-enabled Nanos6 runtime is aware ofidle cores and uses such knowledge to schedule pending workon them, effectively improving resource usage, which in mostcases almost reaches 100%.Because work is computed earlier in UMT, more pressurefalls on I/O devices. On systems where such devices are notsaturated, both network and storage devices can increase itsthroughput resulting in a generalized improved performance.However, as can be seen by looking at the SATA test with 1iof, no further improvements are achieved if the storage deviceis already at its peak performance on the baseline version.Indeed, that I/O can still be overlapped with computation but,if the storage device is the bottleneck, the UMT effect islimited to compute work in advance, although it will still beneeded to wait for the storage device in the end.The two-node setting achieves particularly high speedupsdue to two main reasons: On the one hand, Ethernet commu-nications are likely to block and UMT overlaps other work inthe meantime. On the other hand, the use of UMT disposes theneed for network serialization (as explained in Section IV-B)which reduces the number of synchronization points.Fig. 3 shows the FWI CPU, Disk and Network utilizationgraphs with and without UMT for a single run on two-nodeswith Optane. The disk view aggregates both read and writeoperations, but the kind of I/O can be easily distinguishedby the division in the graphs that separates both FWI phases:First, write for the forward phase and, second, read for thebackward phase. The CPU view shows how UMT affects the execution of each phase by bringing its utilization to almostfull capacity. CPU usage drops a bit at the end of each phasewhen the number of remaining tasks becomes low and it isnot possible to continue overlapping I/O with computation.The network and disk views, far from being saturated, followthe same tendency and see its throughput increased duringboth phases.Because UMT does not distinguish the block/unblock rea-son that triggers it, it is not possible to know to which extentUMT inﬂuences each I/O interface and how each of themindividually contributes to the overall performance improve-ment. Also, their effects get combined; for instance, increasingCPU usage due to overlapping storage I/O with computationimplies that work is computed earlier and, therefore, moredata is ready to be served through the network I/O interface.In consequence, it is complex to dissect each component.However, with the purpose of focusing on storage I/O (insteadof network I/O and the non-sequential ordering constrainteffect) we have run two totally independent FWI instanceson a single node (one per socket) with half of the problemsize and iof 3 for SATA and iof 1 for Optane. The tests runon Optane showed 3% speedup, while SATA tests obtained6%. Hence, it can be deduced that a considerable beneﬁt isobtained from communications.Oversubscription impact has been negligible for all tests.Our custom LTTng and Babeltrace analysis scripts reportedUMT oversubscription periods limited to approximately 2.25%of the total execution length. The number of per core contextswitches was incremented by 8200 in 330s, approximately.Finally, the Linux Perf tool has been used to analysisthe FWI, Nanos6 and Linux Kernel individual performance.Table II shows three Perf traces with sampling frequenciesgrouped by their dynamic shared object. We used multiple fre-quencies because although high sampling frequency increasesthe trace accuracy, it also generates bigger traces that mightaffect the storage device performance. The table shows that theUMT-enabled Nanos6 and Linux Kernel overhead increase isjust slightly higher than the baseline versions, approximatelya 0.04% for Nanos6 and 0.10% for the Linux Kernel.

ABLE IFWI

RESULTS . D

ISK THROUGHPUT IS REPORTED PER NODE . D

ISK AND N ETWORK ARE REPORTED IN M I B/ S . Storage IOF Version One Node Two NodesFOM CPU(%) Storage I/O FOM CPU(%) Storage I/O Network I/OSpeedup kc/s w r Speedup kc/s w rOptane 3 Baseline

13% 12103 74.66 141.14 140.57 38% 19407 51.01 112.10 113.77 45.04

UMT

13% 11883 75.01 409.21 406.77 34% 19492 52.57 328.29 341.17 45.21

UMT

SATA 3 Baseline

15% 11782 74.98 138.24 136.01 39% 19157 51.90 110.96 112.00 44.46

UMT

1% 7450 55.28 249.74 262.17 4% 14240 52.28 233.03 257.23 33.04

UMT C P U ( % ) D i sk ( M i B / s ) N e t w o r k ( M i B / s ) Time (s)Baseline UMT

Fig. 3. FWI baseline vs UMT metrics run on two nodes backed by Optane.TABLE IIP

ERCENTAGE OF P ERF SAMPLES DISTRIBUTED OVER

FWI, N

ANOS AND L INUX K ERNEL FOR THREE SAMPLING FREQUENCIES .Component 99Hz(%) 999Hz(%) 9999Hz(%)FWI (baseline) 93.58 97.05 97.57FWI (UMT) 96.16 96.95 96.64Nanos6 (baseline) 0.05 0.06 0.06Nanos6 (UMT) 0,11 0.11 0.09Kernel (baseline) 0.46 0.39 0.40Kernel (UMT) 0.59 0.47 0.50

E. Heat diffusion1) Introduction:

We use a Gauss-Seidel based iterative heatequation solver with checkpointing parallelized with OmpSs-2and MPI following a wavefront strategy.Fig. 4 shows the task decomposition. Essentially, this al-gorithm updates a two-dimensional matrix for a number ofiterations. Blocks of consecutive rows are distributed amongMPI processes, who interchange halos with their previous andnext rank. Each MPI process computes its share of the volumeby splitting them into blocks that are processed in tasks. Theexecution of iterations is fully overlapped; the tasks that belongto the iteration i +1 start running as soon as the tasks it dependson from the iteration i have been computed.Checkpointing is performed every n-iterations by writing Time

Data dependencyMPI communicationMPI serialization R a n k R a n k R a n k R a n k Fig. 4. Heat Diffusion task decomposition for four MPI processes. Tasks areshown as circles, there is no distinction between computation tasks and I/Otasks in terms of the task layout. the whole model in parallel. When a checkpointing iterationis reached, a set of tasks that perform both the model updateand storage I/O (in this order) are created instead of the usualupdate-only tasks. Because enough parallelism was available,there was no need to implement a n-buffering scheme.This benchmark main purpose is to illustrate the UMTbehaviour on applications that perform I/O mostly for check-pointing. This common I/O pattern is particularly interesting

ABLE IIIH

EAT D IFFUSION BUFFERED VS NON - BUFFERED STORAGE

I/O

ANALYSISON A SINGLE NODE WITH O PTANE AND

IOF 15. CPU

IS SHOWN INPERCENTAGE AND STORAGE

I/O IN M I B/ S . Cache Version BufferedFOM CPU Storage I/OSpeedup cells/s WriteBuffered Baseline

14% 4835 78.11 1242.85

UMT

Non-Buffered Baseline

20% 4934 76.36 1239.09

UMT because write operations do not stall parallelism and easilyenable the possibility of overlapping I/O with other tasks.Data written in each checkpointing iteration is usually notread again during the execution of the application and, there-fore, the system’s page cache is only adding the overhead ofan extra memory copy (similarly as FWI). For this reason, thiskind of write operations beneﬁt from a non-buffered policy, butit also implies that I/O becomes a blocking operation whichincreases the time in which cores are idling. In consequence,checkpointing is, by its nature, an ideal scenario for UMT.To illustrate such results, Table III reports the Heat Diffusionperformance obtained when running on top of the page cache(without O DIRECT). The remaining experiments on thissection, rely on the Linux Kernel O DIRECT mechanism tobypass the Linux Kernel page cache. Please, note that althoughthe buffered version achieves a similar speedup than thenon-buffered version, the actual performance of the bufferedversion is worst (smaller) than the non-buffered.

2) Results:

Table IV shows the results observed. Similarlyto FWI, UMT improves the benchmark’s performance byreducing the core’s idle time while rising both storage andnetwork I/O throughput.We used two different input sets for the SATA and OptaneSSDs. Because the Optane SSD is approximately 10 timesfaster than the SATA SSD, we had to increase the benchmark’scheckpointing frequency while running with Optane to keepthe pressure on the storage I/O; otherwise the benchmark’scomputational part would fairly extend the I/O part and theUMT’s margin for improvement becomes too small for itseffect to be appreciated. Optane tests write approximately780GiB, while SATA tests write 117GiB, per node.Fig. 5 shows the baseline vs UMT comparison of CPU,Disk and Network utilization for a full test run on two nodeswith Optane. On the baseline version, CPU usage exhibitsmany bursts as the result of adjacent computing iterations.Checkpointing iterations dramatically increases idle time and,because they get interleaved with computation bursts, theyprevent the cores from running at full capacity during mostof the execution. The baseline Optane throughput follows thesame pattern as the CPU usage, where its performance peakscoincide with CPU usage valleys. Again, a similar scenariorepeats with the network. Instead, the UMT version ﬂattensthe CPU usage at its peak performance and keeps both Optaneand Ethernet interface working at a higher capacity most ofthe time. As a result, performance almost reaches 2x (97%). The storage I/O only tests (two independent processes in onenode with half the problem size each), reported 13% speedupfor SATA (iof 100) and 5% speedup for Optane (iof 15).Again, showing that network communication improvementsseem to be particularly relevant for these tests.Similarly to FWI, oversubscription has had a negligible ef-fect on performance. Oversubscription periods range between2.4% and 3.2%. The number of per core context switchesincremented by 115000 for two nodes (in 840s) and 35000for one node (in 610s), approximately.V. C

ONCLUSION AND PERSPECTIVES

In this paper, we presented the proof-of-concept of theUser-Monitored Threads (UMT) model, which aims to mon-itor blocking and unblocking thread events based on theeventfd simpliﬁed pipe. The proposal introduces a simple andlightweight Linux Kernel extension along with a user-spaceruntime coupling. We implemented the kernel extension on topof a Linux Kernel 5.1, and adapted the OmpSs-2 task-basedprogramming model runtime to make use of the Linux Kernelextension. The Nanos6 runtime uses the mechanism to identifyidle cores. To do so, it uses the block and unblock UMT eventsto keep track of the number of ready workers bound to eachsystem’s core. When a core becomes idle because, for exam-ple, all of its bound workers are blocked while performingI/O operations, the runtime schedules additional workers onthem. Multiple workers bound to the same core might lead tooversubscription periods, but the runtime minimizes the effectby forcing workers to self-surrender of its core when multiplethreads in the ready state are detected to be bound on it.We tested UMT with two applications, and conclude thatit has two main effects: On the one hand, it provides amechanism to queue more I/O operations which brings theapplication’s I/O throughput closer to the storage device max-imum rate. On the other hand, blocked processes no longerstall the core where they were bound, and useful computationscan be run instead, effectively enabling transparent I/O andcomputation overlapping. Oversubscription periods might limitperformance but, as studied in the Experimentation Section,simple implementation techniques dramatically limit its effect.Overall, we were able to achieve speedups of almost 2x withthe evaluated tests than combine both storage and network I/O.It is worth noting that applications mostly beneﬁt from UMTwhen: they are not already saturating neither the system’scores nor the I/O devices, the application exhibits enoughparallelism to completely overlap I/O with computation and,ideally (although not mandatory), non-bufferd I/O can be used.Regarding our future work, we plan to improve and polishsome UMT details. In the ﬁrst place, we plan to carefullystudy a UMT version where notiﬁcations are only sent whena core is idle. Then, we will consider implementing a Nanos6leader-follower approach to minimize the number of unnec-essary context switches. And ﬁnally, we will continue testingUMT with further task-based I/O intensive real applications.Eventually, we will propose UMT for inclusion into the LinuxKernel mainline repository.

ABLE IVH

EAT D IFFUSION RESULTS . D

ISK THROUGHPUT IS REPORTED PER NODE . D

ISK AND N ETWORK ARE REPORTED IN M I B/ S . Storage IOF Version One Node Two NodesFOM CPU(%) Storage I/O FOM CPU(%) Storage I/O Network I/OSpeedup cells/s write Speedup Metric writeOptane 15 Baseline

20% 4934 76.36 1239.09 97% 5850 46.42 739.85 0.64

UMT

SATA 100 Baseline

30% 4017 63.23 151.47 65% 4531 37.12 86.00 0.51

UMT C P U ( % ) D i sk ( M i B / s ) N e t w o r k ( M i B / s ) Time (s)Baseline UMT

Fig. 5. Heat Diffusion baseline vs UMT metrics run on two nodes backed by Optane. A CKNOWLEDGMENT

This project is supported by the European Union’s Horizon2021 research and innovation programme under the grantagreement No 754304 (DEEP-EST), the Ministry of Economyof Spain through the Severo Ochoa Center of ExcellenceProgram (SEV-2015-0493), by the Spanish Ministry of Sci-ence and Innovation (contract TIN2015-65316-P) and by theGeneralitat de Catalunya (2017-SGR-1481). Also, the authorswould like to acknowledge that the test environment (Cobi)was ceded by Intel Corporation in the frame of the BSC -Intel collaboration. R

EFERENCES[1] J. M. Perez, V. Beltran, J. Labarta, and E. Ayguad, “Improving theintegration of task nesting and dependencies in openmp,” in ,May 2017, pp. 809–818.[2] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy,“Scheduler activations: Effective kernel support for the user-level man-agement of parallelism,”

ACM Transactions on Computer Systems(TOCS) , vol. 10, no. 1, pp. 53–79, 1992.[3] N. J. Williams, “An implementation of scheduler activations on thenetbsd operating system.” in

USENIX Annual Technical Conference,FREENIX Track , 2002, pp. 99–108.[4] C. A. Small and M. I. Seltzer, “Scheduler activations on bsd: Sharingthread management between kernel and application,” 1995.[5] J. Evans and J. Elischer, “Kernel-scheduled entities for freebsd,” 2000.[6] V. Danjean, R. Namyst, and R. D. Russell, “Linux kernel activationsto support multithreading,” in

In Proc. 18th IASTED InternationalConference on Applied Informatics (AI 2000 . Citeseer, 2000.[7] ——, “Integrating kernel activations in a multithreaded runtime systemon top of linux,” in

International Parallel and Distributed ProcessingSymposium . Springer, 2000, pp. 1160–1167. [8] V. Danjean and R. Namyst, “Controlling kernel scheduling from userspace: An approach to enhancing applications reactivity to i/o events,” in

International Conference on High-Performance Computing . Springer,2003, pp. 490–499.[9] B. H. Ingo Molnar. (2002) Linux kernel mailing list (lkml) discussion.[Online]. Available: https://lkml.org/lkml/2002/9/24/305[10] M. Rasiukevicius, “Thread scheduling and related interfaces in netbsd5.0,” 2009.[11] Microsoft. (2017) User-mode scheduling. [Online]. Available: https://msdn.microsoft.com/en-us/library/windows/desktop/dd627187[12] J. Appavoo, M. Auslander, D. DaSilva, D. Edelsohn, O. Krieger, M. Os-trowski, B. Rosenburg, R. W. Wisniewski, and J. Xenidis, “Schedulingin k42,”

White Paper, Aug , 2002.[13] O. Krieger, M. Auslander, B. Rosenburg, R. W. Wisniewski, J. Xenidis,D. Da Silva, M. Ostrowski, J. Appavoo, M. Butrico, M. Mergen et al. ,“K42: building a complete operating system,”

ACM SIGOPS OperatingSystems Review , vol. 40, no. 4, pp. 133–145, 2006.[14] K. Sala, X. Teruel, J. M. Perez, A. J. Pe˜na, V. Beltran, and J. Labarta,“Integrating blocking and non-blocking mpi primitives with task-basedprogramming models,”

Parallel Computing , vol. 85, pp. 153–166, 2019.[15] A. Roca, V. Beltran, and S. Mateo, “Introducing the task-aware storagei/o (tasio) library,” in

International Workshop on OpenMP . Springer,2019.[16] D. C. Schmidt, C. O’Ryan, M. Kircher, I. Pyarali, and B. Back-endend, “Leader/followers - a design pattern for efﬁcient multi-threadedevent demultiplexing and dispatching,” in . AddisonWesley, 2000, pp.00–29.[17] J. Axboe, “Fio-ﬂexible io tester,”

URL http://freecode. com/projects/ﬁo ,2014.[18] M. Desnoyers and M. R. Dagenais, “The lttng tracer: A low impactperformance and behavior monitor for gnu/linux,” in