An Evaluation of Coarse-Grained Locking for Multicore Microkernels
Kevin Elphinstone, Amirreza Zarrabi, Adrian Danis, Yanyan Shen, Gernot Heiser
AAn Evaluation of Coarse-Grained Locking for Multicore Microkernels
Kevin Elphinstone, Amirreza Zarrabi, Adrian Danis, Yanyan Shen, Gernot HeiserUNSW and Data61, CSIRO, Australia firstname.lastname @data61.csiro.au
Abstract
The trade-off between coarse- and fine-grained locking is awell understood issue in operating systems. Coarse-grainedlocking provides lower overhead under low contention, fine-grained locking provides higher scalability under contention,though at the expense of implementation complexity and re-duced best-case performance.We revisit this trade-off in the context of microkernels andtightly-coupled cores with shared caches and low inter-coremigration latencies. We evaluate performance on two archi-tectures: x86 and ARM MPCore, in the former case alsoutilising transactional memory (Intel TSX). Our thesis is thaton such hardware, a well-designed microkernel, with shortsystem calls, can take advantage of coarse-grained lockingon modern hardware, avoid the run-time and complexity costof multiple locks, enable formal verification, and still achievescalability comparable to fine-grained locking.
1. Introduction
Waste of processing power resulting from lock contentionhas been an issue since the advent of multiprocessor comput-ers, and has become a mainstream computing challenge sincemulticores became commonplace. Much research is directedto understanding and achieving scalability to large numbersof processor cores, where lock contention is inevitable andmust be minimised [Clements et al., 2013].
It is now takenas given that locks must be fine-grained , ideally protectingindividual accesses to shared data structures, and that shareddata structures must be minimised, or, in the extreme case of amultikernel [Baumann et al., 2009], avoided altogether.We observe that a discussion of scalability cannot be donewithout taking into account operating system (OS) structure aswell as platform architecture. Prior scalability work is typicallyperformed in the context of a monolithic OS that needs to scaleto hundreds or thousands of concurrent hardware executioncontexts, and communication between contexts measuring inthe thousands of cycles. But monolithic systems are no longerall that matters, microkernels are finding renewed interest dueto their ability to reduce a system’s trusted computing baseand thus its attack surface [Heiser and Leslie, 2010; Kleinet al., 2014; McCune et al., 2008; Steinberg and Kauer, 2010;Zhang et al., 2011].In a monolithic system, such as Linux, typical system calllatencies are long, even compared to inter-core communicationlatencies in the 1000s of cycles. In contrast, a well-designedmicrokernel is essentially a context-switching engine, with typical syscall latencies in the hundreds of cycles [Heiser andElphinstone, 2016]. In such a system, the cost of cross-coresynchronisation may be an order of magnitude higher thanthe basic syscall cost. It therefore makes no sense to run asingle kernel image, with shared data structures, across sucha manycore machine. An appropriate design should shareno data between cores where communication is expensive,resulting in a multikernel design [Baumann et al., 2009].However, the multikernel approach is not the complete an-swer either. It presents itself to user-level as a distributedsystem, where userland must explicitly communicate betweennodes. This is not the right model where communication laten-cies are small, eg. across hardware contexts of a single core,or between cores that share an L2 cache, where they are of theorder of tens of cycles, well below the latency of a syscall evenin a microkernel. In this context, explicit communication be-tween nodes is more expensive than relying on shared memory,and there is no justification for forcing a distributed-systemmodel on userland.We therefore argue that, for a microkernel, the right modelis one that reflects the structure of the underlying hardware:a shared kernel within a closely-coupled cluster of executioncontexts, but shared nothing between such clusters. The re-sulting model is that of a clustered multikernel [von Tessin,2012].A node in such a cluster puts scalability into a differentcontext: rather than to hundreds or thousands of cores, itonly needs to scale to the size of a closely-coupled cluster, nomore than a few dozens of execution contexts. Such a clustermatches another important category of platforms: the nowubiquitous (and inexpensive) low-end multicore processorsdeployed by the billions in mobile devices.A microkernel for a closely-coupled cluster represents anarea in the design space markedly different from that of many-cores, and it is far from obvious that the same solutions apply.In particular, it is far from obvious that fine-grained lock-ing makes sense. In fact, the typical systemcall latencies arenot much longer than critical section in other OSes. In thatsense, for a microkernel, even a big lock is not much coarserthan a fine-grained lock in a system like Linux – as has beenobserved before, for a microkernel, a big lock may be fine-grained enough [Peters et al., 2015].This discussion may seem academic at first, given that fine-grained locking techniques are well-known and widely im-plemented, so why not use them anyway? There are, in fact,strong reasons to stick with coarse-grained locking as long a r X i v : . [ c s . O S ] S e p s possible: Each lock acquisition has a cost, which is pureoverhead in the absence of contention. While insignificantcompared to the overall system-call cost in a system like Linux,in a microkernel this overhead is significant.More importantly, the concurrency introduced by fine-grained locking greatly increases the conceptual complexityof code, and thus increases the likelihood of subtle bugs thatare hard to find [Lehey, 2001], as painfully confirmed by ourexperience implementing fine-grained locking in seL4. Fur-thermore, this complexity is presently a show-stopper for for-mal verification, which otherwise is feasible for a microkernel[Klein et al., 2014].Additionally, as Intel TSX restricted transaction memory (RTM) extensions become widely available, there is an op-portunity to have the complexity of coarse-grained lockingand the performance of fine-grained locking by using RTM toelide coarse-grained locks [Rajwar and Goodman, 2001].We therefore argue that it is important to understand theperformance impact of a big-lock design, which maximisesbest-case performance, minimises complexity and eases assur-ance. To this end we conduct a detailed examination of thescalability of the seL4 microkernel on closely-coupled clus-ters on two vastly different hardware platforms (an x86-basedserver and an ARM-based embedded processor) under differ-ent locking regimes. We make the following contributions: • We estimate the theoretical performance of a coarselysynchronised (big-lock) microkernel using queueing the-ory (Section 4). • We validate the queueing model experimentally, and atthe same time identify modifications to the microker-nel to achieve near theoretically optimum performance(Section 5.1). • We compare big-lock and fine-grained-lock implemen-tations of the seL4 microkernel and evaluate those onclosely-coupled cores on two architectures (ARM andx86), in contrast to the usual approach of aiming for high-end scalability across loosely-coupled cores (Section 5). • We present (on x86) the first use of hardware transac-tional memory that places the majority of the kernel intoa single transaction for concurrency control, and we com-pare it with locking (Section 5).We show that the choice of concurrency control in the kernelis clearly distinguishable for extreme synthetic benchmarksSection 5.1). For a realistic, system-call intensive workload,performance differences are visible, and coarse grain lockingis preferred (or the equivalent elided with hardware transac-tional memory) over extra kernel complexity of fine-grainedlocking (Section 5.2).
2. Background
The best locking granularity is determined by a trade-off in-volving multiple factors. As long as there is no contention, taking and releasing locks is pure overhead, which is min-imised by having just a single lock, the big kernel lock (BKL).Each lock adds some overhead which degrades the best-case(i.e. uncontended) performance.As long as the total number of locks is small, this baselineoverhead is usually small compared to the basic system-callcost. However, on a well-designed microkernel, where systemcalls tend to be very short (100s of cycles) this overhead mightmatter.Fine-grained locking can significantly reduce contention,if it enables unlocked execution of the majority of code. In aBKL kernel, contention can be expected to be noticeable assoon as the hold time (fraction of time spent inside the kernel,also referred to as kernel time ) is not small compared to the pause time (fraction of time spent in user mode).The amount of kernel time depends on the profile of systemcalls executed, and thus on the workload. On a monolithickernel, most system services are provided by the kernel, espe-cially I/O, and consequently I/O-intensive workloads tend tohave high kernel time. On a microkernel, system services areprovided by server processes running in user mode, and thekernel provides communication between clients and servers.On a well-designed microkernel, such as the ones of the L4family, kernel time is dominated by context switches [Liedtke,1995]. The total number of kernel calls is higher than in amonolithic kernel (at least twice as high, as every server invo-cation invokes the kernel twice) but the average system-calllatency is a tiny fraction of that of a monolithic kernel.Hence, a BKL is a more credible design for a microkernelthan for a monolithic OS, at least for a closely-coupled clusterof execution contexts, where intra-cluster communication la-tencies are low (eg. due to a shared cache). As explained inthe introduction, this does not prevent the kernel from use inmanycores, but on such hardware, kernel state should not beshared across clusters, resulting in the clustered multikerneldesign. It is the design of a kernel for such a cluster which weexplore in this paper. To avoid drawing invalid conclusionsfrom idiosyncrasies of a particular platform, we examine twovery different architectures: an x86-based server processorand an ARM-based processor aimed at embedded devices,especially smartphones.
As an x86 platform we use a server-class Dell Poweredge R630fitted with two Intel Xeon E5-2683 v3 processors. These area 14-core processors with a base clock rate of 2.0 GHz andtwo hardware threads each, giving 28 hardware threads perprocessor. Thus the machine has total of 56 hardware threadsacross the two CPU sockets. While not officially supported,the microarchitecture features Intel’s TSX implementation of restricted transactional memory (RTM), which we describefurther in Section 2.4.The processor features three levels of cache. Each core hasprivate L1 instruction and data caches, each 32 KiB in size2emory Local Intra- Inter-Platform Level core socket socketx86 L1 4 115 218L2 12 105 208L3 44 44 163Memory 185 185 265ARM L1 4 17 N/AL2 26 28 N/AMemory 140 N/A N/A
Table 1: Memory and cache access latency in cycles. and 8-way associative. Each core furthermore has a private,non-inclusive, 8-way 256 KiB L2 cache. The last-level cacheis 35MiB, consisting of a 2.5MiB slice per core.Table 1 shows our measured memory latency, cache accesslatency, and latency of data transfer between cores on the samesocket, and across separate sockets. The measurements wereobtained using code derived from
BenchIT , and the resultsare reasonably consistent with measurements obtained by theauthors [Molka et al., 2015], noting the differing clock rates ofthe system under test. One should also note that these resultsare sensitive to the distance between cores and thus will varydepending on the specific cores involved.
Our ARM platform is the Sabre Lite, which is based on aFreescale i.MX 6Q SoC, featuring a quad-core ARM Cortex-A9 MPCore processor [Freescale, 2013].The cores run at a 1 GHz clock rate and have private, splitL1 caches, each 4-way-associative and 32 KiB in size. Thecores share a 1 MiB, unified, 16-way-associative L2 cache,which is the last-level cache. We ported the microbenchmarksfrom
BenchIT to the ARM platform and obtain the results inTable 1.Compared to x86, the ARM has much lower latency ofdata transfer between caches, and the latency is unaffected bydistance between the cores.
TSX provides 4 new instructions:
XBEGIN , XEND , XTEST and
XABORT . Code successfully executed between
XBEGIN and
XEND instructions will appear to have completed atomically,and is thus called a transactional region. If there are anymemory conflicts during the execution of the transactionalregion, the transaction will abort and jump to the instructionspecified by the
XBEGIN . A program can explicitly abort atransaction by issuing an
XABORT instruction.
XTEST returnswhether currently executing within a transactional region.TSX takes advantage of existing cache coherency protocols,to identify sets of cache lines written to and read by differ-ent cores on the CPU. This has two important consequences:memory conflicts are captured at a cache-line granularity, andtransactions are constrained by the size of the L1 and L2 caches. The mutated state must fit inside L1 cache, and theaccessed state must fit inside the L2 cache [Hasenplaugh et al.,2015]. The consequence is that it is probably not feasible towrap a complete monolithic kernel into an RTM transaction,as it is unlikely to fit within the L1 and L2 caches.Owing to the implementation of TSX, the RTM lock log-ically protects a dynamic set of individual L1 and L2 cachelines, and as such is a fairly extreme case of fine-grainedlocking, which should result in much reduced contention (as-suming a sane layout of kernel data structures).Note that an RTM transaction is not guaranteed to complete,even when the transaction is small enough and has no memoryconflicts. A variety of (hardware-implementation specific andfrequently unspecified) scenarios can result in an abort. Ofparticular interest to our work are certain interactions on spe-cific registers that trigger aborts, but are clearly unavoidablewhen executing OS code.Given transactions have no guarantees of progress, the de-veloper must ensure that there exists a fallback method ofsynchronisation that ensures progress in the presence of re-peated aborts. We use the commonly implemented techniqueof falling back on a regular lock for the code fragment in thecase of repeated aborts. To avoid races between transactionsand locks, our transactions test the lock upon entry to an RTMsection, to ensure the lock is free and force it into the readset of the transaction. A change in lock state by a competingthread will trigger the desired abort, and allow the section tosynchronise via the lock.
3. Microkernel Implementation
As we use seL4 as our microkernel testbed, we will now sum-marise its relevant features, Klein et al. [2014] presents moredetails. seL4 is event-based, with a single kernel stack. Toaid verification, seL4 uses a two-phase system call structure,where the first phase confirms the pre-conditions required forsystem call execution, and the second phase executes the sys-tem call without failure. Blocking operations are handled byre-starting the system call and thus re-confirming the precon-ditions prior to continuing execution.The kernel executes with interrupts disabled. Thisconcurrency-free design has traditionally been used in L4kernels in order to achieve high best-case performance, andhas been used on other systems as well [Ford et al., 1999].With formal verification it becomes as necessity, as it is fornow infeasible to verify concurrent kernel code.The kernel features some long-running operations resultingfrom the destruction of kernel objects that may have derivedobjects. In order to achieve usable interrupt latencies, it hasexplicit preemption points, where the kernel polls for pendinginterrupts, and restarts the operation if there are any [Blackhamet al., 2012]. The restart allows interrupts to be triggered fromoutside the kernel, prior to continuing the original operation.3eL4 supports the traditional L4-style synchronous (rendez-vous) message passing IPC with a payload of up to a fewhundred bytes. IPC operates via port-like objects called end-points . In addition, the kernel provides notifications withsemantics similar to binary semaphores.
The BKL is the natural, minimal extension of the existingseL4 design to multicores, as it is easy to implement andmostly preserves the in-kernel assumption of no concurrency.The kernel entry and exit code, which saves and restores theuser-state to a per-core kernel stack and sets up safe kernelexecution, remains outside of the BKL, while the rest of thekernel is protected by the BKL.This design is not entirely sufficient – the following invari-ant, used in the verification, no longer holds on a multicorekernel, even when the BKL is held:Except for the currently executing thread’s TCBand page table, all other TCBs and page tables arequiescent, and can be mutated or deleted.User-level code executing on other cores implicitly dependson the running thread’s TCB and page table to transition tokernel-mode via the kernel entry code to compete for the BKL.The invariant therefore no longer holds. We address this bymodifying the kernel to ensure remote cores are not dependenton any TCB or page-table undergoing deletion.We modify our prototype to keep a bitmap of cores that haveseen a specific page table in the page table itself, and IPI onlythose cores to trigger the remote core to enter the kernel idleloop (which has a permanently allocated TCB and page-table),and also to shoot down the TLB. A TCB can be identified asactive via the CPU affinity in the TCB itself combined with theper-core current thread pointer of the remote core, in whichcase the TCB is handled in a similar manner to the page table.This design, which is partially driven by the existing event-driven code base, is a valid design choice thanks to the shortduration of most system calls in the microkernel; it wouldresult in poor scalability on any other kind of system.The only other required change is introducing per-core idlethreads. However, in order to minimise inter-core cache-linemigrations, we also introduce per-core scheduler queues inaddition to the current-thread pointers, even though access isserialised by the BKL. This partitioned scheduling implies thatthreads can only migrate between cores if explicitly requestedby the user, which is consistent with seL4’s general philosophyof having all resource management under user control (andalso helps reasoning about real-time properties).To reduce contention (and enable the use of transactionalmemory, see Section 2.4) we further minimise the amountof locked code by moving context-switch-related hardwareoperations after the BKL release, which has the benefit ofreducing the critical section length. We use a CLH lock, as scalable queue lock [Craig, 1993],to synchronise the BKL kernel variant.
To compare the coarse-grained BKL with more complex butmore scalable fine-grained locking, we first replace the BKLwith a big reader lock [Corbet]. The lock allows all readercores to proceed in parallel as they access only local state toobtain a read lock.In our present prototype we use a single write lock aroundthe the non-IPC-related kernel code paths. These code paths,generally dealing with resource management, are infrequentlyexecuted, compared to IPC and interrupt handling, and as suchnot performance critical. This allows us to avoid significantcode changes without affecting overall performance.This design allows us to gradually migrate the kernel codeout of the writer lock into the reader lock. As long as dealloca-tion of kernel objects remains inside the writer lock, memorysafety is retained while holding the reader lock. Moving codeinto the reader lock exposes the contents of the objects to con-currency for improved scalability, which can then be protectedusing individual fine-grained locks.IPC mutates the state of TCBs, endpoints, and (potentially)the scheduler queues (depending on whether optimisationsapply that avoid queue updates during IPC [Heiser and El-phinstone, 2016]). We add ticket locks to each of these datastructures for synchronising IPC within the reader lock. Atypical IPC now involves the kernel reader lock, two TCBlocks, and one endpoint lock. Lock contention during IPC islimited to cases where IPC involves a shared destination orendpoint, or general contention with the kernel writer lock.Independent activities performing IPC on independent coresresult in no lock contention. We avoid deadlocks by identi-fying the affected TCBs prior to locking (made possible bymemory safety provided by the reader lock), and then lockingthem in order of their memory addresses.
The TSX extensions, combined with the small size of the ker-nel, allow us to optimistically execute the majority of the codewithout concurrency control. This is analogous to taking theBKL kernel variant and speculatively eliding the lock [Rajwarand Goodman, 2001]. The event-based design of the kernelis an important enabler for lock elision as it avoids blocking.We bracket almost the entire kernel with the transaction primi-tives shown in Figure 1. The somewhat simplified code is selfexplanatory, except the ‘L’ argument to _xabort() , which isreturned as the status at _xbegin() to distinguish betweenabort types.In addition to the changes described in Section 3.2, we needto move any TSX-specific abort-triggering CPU operationsafter the transaction. Many of those do not occur in seL4, asmost aborting operations are typical for device drivers, which4 eginTransaction() {while ((status = _xbegin()) !=_XBEGIN_STARTED ) {txnAttempts++;if (txnAttempts >=RTM_ATTEMPTS_THRESHOLD) {break; /* Give up */}/* wait for lock freed before retrying txn */while(LockTest());}if (status == _XBEGIN_STARTED) {if (LockTest()) { /* not free */_xabort(’L’);}} else {lockAcquire(); /* BKL fall-back */}}endTransaction() {if ( (txnInside = _xtest()) ) {_xend();} else {lockRelease();}}
Figure 1: Kernel transaction pseudo code. are user-level programs in seL4. The remaining problematicoperations are: • context-switch-triggered page-table register (CR3) load-ing and segment-register loading; • IPI triggering for inter-core notifications; • interrupt management for user-level device drivers, whichconsists of masking and acknowledging interrupts priorto return to the user-level handler.The key insight here is that it is safe to move these opera-tion outside of the transaction, because the two-phase kernelensures the system call which requires these operations is guar-anteed to succeed once the execution phase is entered, and thatthese operations are local to a core and thus are not exposedto concurrent access from other cores. Note that preemptionsduring this code section are prevented since the kernel runswith interrupts disabled.
4. BKL multicore scalability
In this section we use queuing theory to model the scalabilityof a BKL microkernel. The theoretical model provides us witha method to estimate best-case performance for a workloadparametrised by the rate the lock can be serviced ( µ ), thearrival rate of the lock requests ( λ , i.e. system call rate), andthe number ( n ) of cores in the machine. An estimate of best-case performance provides a theoretical reference point totarget. We employ the machine repairman queuing theory model. Themodel is historically based on machine failures in a factory(characterised by a failure rate) combined with waiting fora single repairman (the service rate). In our case the modelcorresponds to the arrival rate of lock requests combined withthe service rate of the lock itself.The model for an n -core multiprocessor has n + · · · n − n λ µ λ µ λ n − µ n − λ n − µ n − Figure 2: The machine repairman model.
The model assumes the rate, µ , of servicing the lock, to beindependent of the number of cores queued, i.e. µ k = s , where s is the average service time of the lock. It further assumesthat the rate of arrivals, λ , is proportional to the number ofcores not already waiting for the lock, i.e. λ k = n − ka , where a is the average inter-arrival time for a single core in the absenceof contention.In a steady state, the rates of lock acquisitions and lockreleases must be balanced. If P i is the probability of being instate i , this means P k · λ k = P k + · µ k . From this we can derivethe probabilities as P k = P · s k n ! a k ( n − k ) ! . (1)The system must always be in one of these states, ∑ ni = P i =
1, from which we can obtain P = ∑ ni = ( s i n ! a i ( n − i ) ! ) , (2)and ultimately P k = s k a k ( n − k ) ! ∑ ni = ( s i a i ( n − i ) ! ) . (3)From this we can compute the expected queue length as w = n ∑ i = iP i , (4)and lock throughput is µ ( − P ) = − P s . (5)5 .2. Model Assumptions and Kernel Design The queueing model assumes that the average rate of servingthe lock is independent of the queue depth. This is not truefor non-scalable locks or in the case of mutating shared kernelstate [Boyd-Wickizer et al., 2012]. We satisfy these assump-tions by avoiding shared mutable state for unrelated kernelsystem calls through per-core data structures, and using thescalable CLH lock.In addition, peak throughput is inversely proportional tolock service time. Hence, moving as much code out of thelock as possible, in particular the expensive local hardwareoperations (such as triggering of IPIs, and page table registerupdates), will improve scalability.
5. Evaluation
To evaluate our multicore microkernel variants, we use twoIPC microbenchmarks and a server-style macrobenchmark, asdescribed in the following sections. The platforms under testhave been already been described in Sections 2.2 and 2.3.
IPC perfor-mance is a key contributor to overall system performance inmicrokernel-based systems, and optimising IPC performancehas a long history in the L4 community [Heiser and Elphin-stone, 2016]. The traditional benchmark for best-case IPCperformance is “ping-pong”: a pair of threads on a singlecore does nothing other than sending messages to each other.This allows us to asses the basic cost of our lock implemen-tations, i.e. the pure acquisition and release cost, without anycontention.
100 200 300 400 500 600 none C L H f i ne R T M C yc l e s
424 436 508 496 (a) x86 none C L H f i ne
316 390 548 (b) ARM
Figure 3: Raw one-way IPC cycle cost for different seL4 lock-ing mechanisms. Error bars indicate standard deviations.
The figure shows that on x86, the overhead of a singleCLH lock (BKL) is approximately 3% compared to a baselineuniprocessor kernel with no concurrency primitives (“none”).With fine-grained locking, however, the overhead is 20%. Theoverhead of uncontended transactions is 17%.On the ARM, the cost of a single lock is significantly higherwith 23%, while for fine-grained locking it is over 70%. The higher synchronisation costs on the ARM processor relate toits partial-store-order memory model. It requires memory bar-riers ( dmb instructions) to preserve memory-access ordering.In our experience, the barriers cost from 6 cycles up to 19cycles depending on micro-architectural state. Our implemen-tation of CLH executes 6 barriers on this benchmark, while 16are needed with fine-grained locking. These barriers explainmost of the overhead.The significant cost of fine-grained locking provides a moti-vation for sticking with the BKL as long as possible, even ifverification tractability was no issue.
To explorescalability and experimentally validate the queueing model,we extend the single-core ping-pong to multiple cores. Specif-ically, we run a copy of ping-pong on each hardware thread,with all hardware threads executing completely independentlyand unsynchronised. We use the BKL kernel on the x86 plat-form.We add an exponentially distributed random delay betweenreceiving and replying to IPC, for each ping-pong pair. Thedelay varies between an average of 500 and 32,000 cycles, inpowers of 2, to create seven individual microbenchmarks thatsimulate a work load from an extreme system call intensiveworkload to a relatively compute-bound workload.The benchmark is embarrassingly parallel ensuring thatlimits of scalability are related to our kernel design and im-plementation, and not the benchmark itself. The number ofhardware threads (cores in this case) varies between 1 and 28.This benchmark produces extreme contention on the kernel(for low delay values). However, none of the kernel datastructures are contended , as each hardware thread’s pair ofsoftware threads accesses disjoint kernel objects (TCBs andIPC endpoints) during their syscalls. Hence, while expectingcontention on the BKL, fine-grained locking and RTM can beexpected to scale perfectly.For each delay and core-count parameter pair, the bench-mark consists of a two second warm-up, followed by sam-pling total IPCs during a one second interval to give total IPCthroughput per second.Figure 4(a) plots the resulting overall IPC throughput forvarying number of hardware threads. Each point representsone measurement for a particular delay time (identified bysymbol and colour). The vertical dotted lines shows where thecores are split across the two CPU sockets. The results havenegligible variance.For the runs with an average delay of 2000 cycles we per-form a least-squares regression of the queueing model usingonly the points for the first 14 hardware threads (i.e. withinone socket). The regression yields a service time of 358 cycles,and an average delay of 1999 cycles, with R = .
99, meaningthat the intra-core results are explained by the model if theservice time is 358 cycles.We use this service time to predict throughput for all othervalues of the delay parameter, resulting in the solid lines in6
0 5 10 15 20 25 30 35HW Threads 50010002000400080001600032000 (a) Scalability measurements and fits for 358 cycles service time.
0 5 10 15 20 25 T h r oughpu t ( I P C / s ) HW Threads 5004000 (b) Scalability bound prediction for service times of 323 cycles (upperlines) and 613 cycles (lower lines).
Figure 4: Total IPC throughput for varying parallelism and delay times (cycles). Points are measurements, lines are queueingmodel fits. the graph. We can see that in all cases, these fit the observedthroughput values very well for at least 14 threads (i.e. themodel explains the intra-socket behaviour well).The model breaks down once cores of the second socketare involved, except for the highest delay times. This is un-surprising, as with multiple sockets, the assumption of a fixedservice time no longer holds, as transfer times for the cacheline holding the lock now depend strongly on locality. Weconfirm this by instrumenting the lock: The average holdingtime is 164 cycles for a single core. For four or more cores,the observed holding times vary between 323 and 613 cycles. In Figure 4(b) we repeat the experimental results of Fig-ure 4(a) for two delay values, 500 and 4,000 cycles. We alsoshow the model prediction for 323 and 613 cycles, the min-imum and maximum holding times we observed for threadcounts of four or more. We can see that the results are upper-bounded by the 323-cycle curve, and, as long as only onesocket is used, remain close to the bound. Once the secondsocket is used, the lines quickly approach the lower line, cor-responding to the higher service time, which remains a lowerbound.
The queueing modelaccurately predicts experimental results where the averagelock holding time is stable. Where the lock holding time isvariable, it can be used to predict a performance range. Themodel enables prediction of where the knee of the perfor-mance curve occurs for a given lock holding time and averagedelay between system calls, assuming the absence of otherapplication-related limiting factors.The lock-holding time range for the microkernel varies froman average 164 cycles on a single core, to approximately 300cycles within a socket, to 600 cycles distributed across sockets.Thus the lock holding time is dominated by the architectural The average holding time of the lock is strongly influenced by the averagetransfer time of the cache line. Given transfer cost is zero for re-acquisitionof the lock by the same core, low core counts have higher probability of re-acquisition, and thus unrealistically low average holding times for the generalcase. cost of cache line transfer for the lock. This is indirect confir-mation that our microkernel is indeed scalable in the sense ofnot sharing any mutable state across cores except for the lockitself. It also implies that any improvement in reducing lockholding time on a single core will have only a modest effecton overall scalability due to the high architectural costs on theXeon.The model and experiments show that a workload runningon the microkernel with an average delay between system callson each core of 4000 cycles would scale to 14 cores, i.e. asingle CPU socket. An average delay time of 16000 cyclesis needed for a workload to scale across both sockets. Theresults support our hypothesis that a big lock will scale as longcores are closely coupled.These results are readily applicable in general to conserva-tive locks protecting potentially contended data, that rarelycontend in practice. We also note for the following sectionthat these results are readily applicable to the abort path in theRTM variant of the microkernel.
To compare different lockvariants, we run on x86 with a 4,000-cycle average delaybetween system calls, as this is just above the scalability limitof the BKL across a whole socket.Figure 5(a) shows that RTM behaves identically to fine-grained locking. This is expected: as explained in Section 2.4,RTM is logically an extreme case of fine-grained locking, andthe baseline lock overhead is the same as for the fine-grainedlocks according to Figure 3.The BKL variant serialises the IPC path across all the avail-able hardware threads, thus hits the knee in the performancecurve as predicted by the queueing model. As also expected,it converges on a lower performance plateau as the benchmarkspans the sockets.On ARM (Figure 5(b)), where intra-core cache line mi-gration costs are very low (in terms of cycles), we chose anagressive 500-cycle average delay. In addition, we run thepathological zero-delay case, where the system call rate isonly limited by the user-level stubs and cost of the system call7
0 5 10 15 20 25 30HW Threads RTMFineBKL (a) x86, average delay 4,000 cycles.
1 2 3 4 T h r oughpu t ( I P C / s ) HW ThreadsFine 0BKL 0Fine 500BKL 500 (b) ARM, average delay 0 and 500 cycles. Lines arefor clarity (no fit).
Figure 5: Total IPC throughput for varying parallelism and different locking implementations. itself. We see the higher overhead for fine-grained locking isreadily visible, with the BKL variant outperforming the fine-grained variant for the fairly extreme case of the 500 cycleaverage delay time, with perfect scalability to all four cores.It takes the unrealistic minimal delay for the BKL variant toplateau at 3 cores, allowing the fine-grained variant to exceedBKL throughput. We can expect the BKL to scale signifi-cantly beyond the size of our quad-core machine for realisticworkloads.
In order to assess BKL scalability, and the significance of theoverheads of the fine-grained schemes, we look for a “realisticworst-case” scenario, i.e. a benchmark which produces ashigh as system-call rate as can be expected under realisticconditions.None of the usual embedded-system benchmarks producesignificant syscall loads on the microkernel, we therefore use aserver-style benchmark. Note that the nature of the benchmarkis completely irrelevant for this exercise, all that counts isthe rate and distribution of kernel entries. The relevant opera-tions are IPC and interrupt handling, as all other microkerneloperations deal with resource management that is relativelyinfrequent.The seL4 equivalent of a syscall in a monolithic system issending an IPC message to a server process and waiting for areply (i.e. two microkernel IPCs per monolithic OS syscall).Similarly, an interrupt, which in a monolithic OS results in asingle kernel entry, produces two for the microkernel-basedsystem, as the interrupt is converted by the kernel into a notifi-cation to the driver (one kernel entry), and the driver acknowl-edges to the kernel with another syscall.
In order to hammer our kernel, weuse a simple client-server scenario, consisting of the Rediskey-value store [Redis]. We consolidate the clients and serverson the same machine due to insufficient network bandwidthto saturate the large number of cores. Redis receives clientrequests from a virtual network processor on Core 0. Each client and server has their own private copy of the lwIP TCP/IPstack [lwIp] running as a usermode process.
LWIPRedis LWIP
YCSB
LWIP
YCSBCore 1
LWIPRedis LWIP
YCSB
LWIP
YCSBCore 2
LWIPRedis LWIP
YCSB
LWIP
YCSBCore n
Core 0
Virtual Network Processor
Figure 6: Redis-based benchmark architecture.
Figure 6 shows the system under test. With the exceptionof Core 0, each core has a Redis server and two copies of theRedis benchmarking client.We run Redis as volatile instances, i.e. disabling file-systemaccess, in order to maximise throughput and therefore the rateof kernel entries.We evaluate the performance using a modified version ofYahoo! Cloud Serving Benchmarks (YCSB) [Cooper et al.,2010] as the benchmarking client. All client instances startsimultaneously and are tuned to perform a fixed number ofoperations that result in at least 2 minutes of run time. Thebenchmarking client consists of the read-only workload withzipfian distribution as presented in Cooper et al. [2010]. Foreach kernel variant, we instrument the kernel to record idletime within the idle loop for obtaining CPU utilisation for eachrun.
Figure 7 shows the results of the Redis bench-mark on the x86 platform. Not surprising, given the extremeworkload, scalability is limited.The transactions-based kernel performs consistently best.However, the BKL keeps up until 5 cores, after which through-put starts to drop. Fine-grained locking consistently performsat about 60-75% of the transactions kernel. These are single runs, sorry, no standard deviations... T h r oughpu t Cores BKLFineRTM
Figure 7: Redis-based benchmark architecture.
We measure the average delay time on each core by dividinguser time by the number of kernel entries. We find that theaverage time is 800 cycles on core 0 (the virtual network), andaround 1,600 cycles on the client/server cores (figures takenfrom the five-core case). This confirms that the benchmark isfairly extreme in terms of system call load.We can relate the observed delay times back to Figure 4(a).With 800 cycles delay on one core and 1,600 on the others,we expect a behaviour similar to the 1,000-cycle delay curveof Figure 4(a). And the similarity is indeed there: the 1,000-cycles curve peaks at 7 cores and then drops of slowly. TheBKL curve of Figure 7 peaks slightly earlier but overall lookssimilar.We re-iterate that this is an extreme benchmark, and Fig-ure 4(a) tells us that a slightly less extreme version, with aboutfour times the average delay, should scale to a full socket.
6. Related Work
Writing parallel and scalable code is a topic almost as oldas computing itself. Cantrill and Bonwick [2008] providesome historical context and motivation for concurrent soft-ware, together with words of wisdom to tackle the difficultiesof writing high-performance and correct concurrent software.We adhere to their advice by avoiding parallelising complexsoftware (i.e. splitting the BKL) as our data shows it is unwar-ranted for closely coupled cores.Recent complementary work evaluates the scalability ofvarious synchronisation primitives [David et al., 2013] onmany-core processors. The authors reinforce that scalability isa function of the hardware, with scalability best when accessis restricted to a single socket with uniform memory access –exactly our area of interest.Boyd-Wickizer et al. [2012] use queueing theory to predictthe collapse of ticket locks in Linux. We use a different modelto predict performance of a microkernel synchronised with ascalable lock, not just the lock itself. We also use it to validatethe scalability of the implementation of the microkernel itself.Hardware transactional memory is utilised in TxLinux[Rossbach et al., 2007] to implement cxspinlocks , a combi-nation of co-operative spinlocks and transactions capable ofsupporting device I/O and nesting. A small microkernel needsneither, as I/O is at user-level, and it can be designed to avoid complex, nested, fine-grained locks.Hofmann et al. [2009] apply HTM to coarse-grained locksin Linux 2.4 on a simulated HTM system. Our goals aresimilar, however our experiments are on real hardware (IntelTSX), on a microkernel with a single lock. Our event-basedkernel avoids the need for a cxmutex to handle waking waitingthreads on exiting a transaction.Patches for Linux to utilise Intel TSX have been madeavailable [Kleen]. To our knowledge, no performance data wasreleased. Eliding existing fine-grained locking does nothingto reduce kernel complexity. We elide the whole microkernel,providing favourable performance while retaining simplicity.
7. Conclusions
We have analysed scalability of a microkernel with fast systemcalls across closely-coupled processor cores. We find that forsuch a system, the overhead of locking is significant, rangingbetween a few percent for a single lock on x86 to 23% on anARM processor. This makes the best approach to concurrencycontrol non-obvious, especially when keeping in mind thatit makes no sense to scale such a kernel to a large multicore,where inter-core cache-line migration latencies exceed thebasic syscall cost.We analysed three different multicore implementations ofseL4, one using a big kernel lock, one using fine-grained lock-ing, and the third using hardware-supported transactions. Weevaluated the implementation on a server-class x86 platformas well as an ARM multicore aimed for the embedded mar-ket. We support the experiments with a queueing-theoreticalmodel.There are three main take-aways from our evaluation. Oneis that the inter-core cache-line migration cost matters a lot.This is demonstrated by the ARM results, where the BKLscales perfectly to 4 cores even with the unrealistically high500-cycle average inter-syscall time. If architects can maintainsimilarly tight coupling with higher core counts, our modellingshows that the BKL can be expected scale further.In contrast, the x86 platform shows that the perfect scalingregime does not extend past two cores even with about doublethe inter-syscall time, and plateaus at four cores (but keep inmind that this is still an extremely, if not unrealistically highload).The second take-away is that lock overhead is significant ona well-designed microkernel with very short syscall latencies.This is particularly obvious on the ARM with its relaxed mem-ory model and resulting high barrier costs, but the effect isalso significant on the x86, where the kernel using fine-grainedlocking performs only at about 75% of the BKL version untilthe latter reaches its performance knee.The third takeaway is that hardware transactions are an ex-citing development. To our knowledge, we are the first toimplement lock elision for a BKL kernel using Intel’s RTM.We show in microbenchmarks that a theoretically embarrass-ingly parallel application scales perfectly with little overhead9nd no serialisation. In our realistic (but extreme) macrobenchmark, which is less parallel, RTM upper-bounds the per-formance of both the BKL as well as fine-grained locking. Inthe case of the microkernel, where the whole system call canbe packed into a single transaction, RTM gives get the best ofboth worlds.We can summarise our experience that transactions are theway to go, if they are available. Failing that, the big lock isactually a good choice for a fast microkernel, as it is onlyoutperformed by fine-grained locking under extreme circum-stances, at least on the kind of closely-coupled system wherea single, shared kernel instance makes sense. Under thosecircumstances, the reduced (compared to fine-grained lock-ing) implementation complexity is a strong asset, as it enablesformal verification, which is presently unfeasible for systemsusing fine-grained locking. The significantly better perfor-mance under less extreme workloads is an added benefit.
References
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand,Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe,Adrian Schüpbach, and Akhilesh Singhania. The multiker-nel: A new OS architecture for scalable multicore systems.In
SOSP , Big Sky, MT, US, Oct 2009.Bernard Blackham, Yao Shi, and Gernot Heiser. Improvinginterrupt response time in a verifiable protected microkernel.In
EuroSys , pages 323–336, Bern, Switzerland, Apr 2012.Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, andNickolai Zeldovich. Non-scalable locks are dangerous. In , Ottawa, CA, Jul 2012.Bryan Cantrill and Jeff Bonwick. Real-world concurrency.
ACM Queue , 6(5), Sep 2008.Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich,Robert T. Morris, and Eddie Kohler. The scalable com-mutativity rule: Designing scalable software for multicoreprocessors. In
SOSP , pages 1–17, Farmington, PA, US, Oct2013.Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ra-makrishnan, and Russell Sears. Benchmarking cloud serv-ing systems with YCSB. In
SoCC , Indianapolis, IN, US,Jun 2010.J. Corbet. Big reader locks. http://lwn.net/Articles/378911/ .Travis S. Craig. Building FIFO and priority-queuing spinlocks from atomic swap. Technical Report UW-CSE-93-02-02, Department of Computer Science and Engineering,University of Washington, 1993.Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Ev-erything you always wanted to know about synchronizationbut were afraid to ask. In
SOSP , pages 33–48, Farmington,PA, US, Nov 2013. Bryan Ford, Mike Hibler, Jay Lepreau, Roland McGrath, andPatrick Tullmann. Interface and execution models in theFluke kernel. In , pages 101–115, New Orleans,LA, US, Feb 1999. USENIX.Freescale. i.MX 6Dual/6Quad Applications Processor Refer-ence Manual , rev. 1 edition, Apr 2013.William Hasenplaugh, Andrew Nguyen, and Nir Shavit. Quan-tifying the capacity limitations of hardware transactionalmemory. In , Jul 2015.Gernot Heiser and Kevin Elphinstone. L4 microkernels: Thelessons from 20 years of research and deployment.
Trans.Comp. Syst. , 34(1):1:1–1:29, Apr 2016.Gernot Heiser and Ben Leslie. The OKL4 microvisor: Con-vergence point of microkernels and hypervisors. In
APSys ,pages 19–24, New Delhi, India, Aug 2010.Owen S. Hofmann, Christopher J. Rossbach, and EmmettWitchel. Maximum benefit from a minimal HTM. In , pages 145–156, 2009.Andi Kleen. RFC: Kernel lock elision for TSX. https://lkml.org/lkml/2013/3/22/630 .Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Mur-ray, Thomas Sewell, Rafal Kolanski, and Gernot Heiser.Comprehensive formal verification of an OS microkernel.
Trans. Comp. Syst. , 32(1):2:1–2:70, Feb 2014.Greg Lehey. Improving the FreeBSD SMP implementation.In , Boston, MA, US,Jun 2001.Jochen Liedtke. On µ -kernel construction. In SOSP , pages237–250, Copper Mountain, CO, USA, Dec 1995.lwIp. Download site. .Jonathan M. McCune, Bryan Parno, Adrian Perrig, Michael K.Reiter, and Hiroshi Isozaki. Flicker: An execution infras-tructure for TCB minimization. In , Apr 2008.D. Molka, D. Hackenberg, R. Schöne, and W. E. Nagel. Cachecoherence protocol and memory performance of the IntelHaswell-EP architecture. In , pages 739–748, Sep 2015.Sean Peters, Adrian Danis, Kevin Elphinstone, and GernotHeiser. For a microkernel, a big lock is fine. In
APSys ,Tokyo, JP, Jul 2015.R. Rajwar and J. R. Goodman. Speculative lock elision: en-abling highly concurrent multithreaded execution. In , pages 294–305, Dec 2001.10edis. Download site. http://redis.io .Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter,Hany E. Ramadan, Bhandari Aditya, and Emmett Witchel.TxLinux: Using and managing hardware transactional mem-ory in an operating system. In
SOSP , Stevenson, WA, US,Oct 2007.Udo Steinberg and Bernhard Kauer. NOVA: Amicrohypervisor-based secure virtualization architec-ture. In , pages 209–222, Paris, FR, Apr2010. Michael von Tessin. The clustered multikernel: An approachto formal verification of multiprocessor OS kernels. In ,pages 1–6, Bern, Switzerland, Apr 2012.Fengzhe Zhang, Jin Chen, Haibo Chen, and Binyu Zang.CloudVisor: retrofitting protection of virtual machines inmulti-tenant cloud with nested virtualization. In