Minimizing Event-Handling Latencies in Secure Virtual Machines
MMinimizing Event-Handling Latencies in Secure Virtual Machines
Janis Danisevskis, Michael Peter, Jan Nordholz
Technische Universit¨at BerlinDeutsche Telekom LaboratoriesSecurity in Telecommunications { janis, peter, jnordholz } @sec.t-labs.tu-berlin.de Abstract —Virtualization, after having foundwidespread adoption in the server and desktoparena, is poised to change the architecture of embeddedsystems as well. The benefits afforded by virtualization— enhanced isolation, manageability, flexibility, andsecurity — could be instrumental for developers ofembedded systems as an answer to the rampant increasein complexity.While mature desktop and server solutions exist, theycannot be easily reused on embedded systems becauseof markedly different requirements. Unfortunately, op-timizations aimed at throughput, important for servers,often compromise on aspects like predictable real-timebehavior, which are crucial to many embedded systems.In a similar vein, the requirements for small trustedcomputing bases, lightweight inter-VM communication,and small footprints are often not accommodated. Thisobservation suggests that virtual machines for embed-ded systems should be constructed from scratch withparticular attention paid to the specific requirements.In this paper, we set out with a virtual machinedesigned for security-conscious workloads and describethe steps necessary to achieve good event-handlinglatencies. That evolution is possible because the un-derlying microkernel is well suited to satisfy real-timerequirements. As the guest system we chose Linuxwith the
PREEMPT_RT configuration, which itself wasdeveloped in an effort to bring down event-handlinglatencies in a general purpose system. Our resultsindicate that the increase of event-handling latencies ofa guest running in a virtual machine does not, comparedto native execution, exceed a factor of two.
Keywords -Virtualization; Operating System; Real-timeComputing; Security
I. I
NTRODUCTION
Historically, innovations first took place in the desk-top and server market before they trickled down toembedded systems some years later. Although the gapin capabilities is likely to persist, high-end embedded devices more and more resemble desktop systems interm of hardware (compute power, memory sizes)and software (general purpose operating systems, opensystem architecture). Assuming that this follow-suittrend is to continue, it should not be too long beforevirtualization gains traction in embedded devices. Be-fore dwelling on real-time characteristics, we wouldlike to briefly recite the arguments for virtualization inembedded devices.
A. The Case for Virtualization in Embedded SystemsDevelopment support.
Tight time-to-market timescan only be achieved though pervasive software reuse.Despite all efforts, dependencies between softwarecomponents are often intricate, requiring substantialdevelopment effort when changes are needed. More-over, quite a few applications hinge on particularoperating systems, or even specific versions of them.Virtualization obviates the need to settle for one op-erating system, but instead allows to run multiple ofthem in parallel. The easy reuse of tried-and-testedcomponents will cut down on the development effortthereby reducing costs and shortening the time-to-market.
Platform migration.
In the past, more powerfuldevices automatically translated into better function-ality. The recent tendency to multicore processorsmakes that benefit increasingly harder to reap as onlymultithreaded software benefits from the presence ofadditional cores. Virtualization opens up the opportu-nity to eschew the daunting task of porting single-coresoftware stacks, instead reusing it outright. Generally,the introduction of virtualization into the system stackrequires only the hypervisor to be ported to newdevices; the higher layers need only small adoptionsor none at all.
Security.
Traditionally, security from non-physical a r X i v : . [ c s . O S ] J un ttacks was not a major concern for embedded systems.Even if a device contained defects, these were difficultto exploit as the usual attack vector — networkconnectivity — was unheard-of. That has changedprofoundly in the recent past. High-end embeddedsystems more and more resemble desktop systemsregarding computing capabilities, usage scenarios, andconnectivity. The applications they accommodate sharemany traits with those found on desktops, securityvulnerabilities included. Security-wise, the calm pasthas given way to a frantic struggle between attackersraking about for exploitable deficiencies and defendershurrying for closing them promptly. While securitythreats against desktops are severe, they hold the po-tential to be devastating for embedded systems. Unlikethe stationary cousins, embedded systems often arerestricted in their means to repulse malware. Installingan new version of an anti-virus solution is just not onthe table. Given the gravity of the situation and theinevitability of its occurence, it is urgent to tightenthe defences. Virtualization may provide the isolationneeded to contain an attacker, thus offering deeperdefense. The feasibility of this approach has beenborne out by contemporary gaming consoles, wherehypervisor-based security architectures have provensufficiently capable of denying ultimate control toflocks of highly motivated attackers. Consolidation.
Today it is not uncommon to phys-ically separate subsystems to ensure non-interference.The downside is that the number of control units growswhich raises concerns as to the bill of material, weightof wiring, and reliability in the face of numerousphysical connections, which are vulnerable to aging.
License separation.
Platform vendors are often con-cerned about their system architecture not to be re-vealed to competitors. For that reason, they do not likethe publication of, say, device drivers as these maygive hints at the underlying hardware. This attitude isat odds with some important open source operatingsystems, which require source code access for devicedrivers. The tension can be eased if software withintellectual property rights restrictions is placed in aVM where it is not subject to licenses requiring sourceaccess.
Partial Validation.
If the interaction between compo-nents cannot be characterized precisely, then a tightlyintegrated system has to be revalidated as a wholewhenever any single component is changed. Con- versely, as long as component interaction is limitedto well-defined interfaces (including temporal aspects),it is sufficient to check whether a component adheresto the interface specification. Changes of a componentthen would not require a system validation.
Remote Management.
The growing device complex-ity makes it ever more likely that some unexpectedsituation arises after the device is shipped. Thereforit would be desirable to remotely assess the situationand take remedial measures. It the best case, the issueis fixed by a remotely initiated update before the usereven notices that something was about to break. Forall its virtues, remote management cuts both ways. Ifnot properly secured, a remote management interfaceis a downright invitation for attackers to take overthe device. As such, it must be ensured that onlyproperly authorized requests are honored. There shouldbe a secure anchor in the platform, which cannot beeven disabled if an attacker has already succeeded inpenetrating some parts of the system.
B. Challenges for Embedded Virtualization
Embedded devices differ from their desktop andserver cousins in that they are often charged withtasks that require interacting with their environmentin real-time. The transition from special-purpose real-time operating systems to general purpose operatingsystems like Linux poses enough problems in itselfin that respect. Virtualization requires modificationsdeep in the software stack. In order to hurt real-time performance not too much the following threeproblems have to be taken into account.First, virtualization support in current processorscomes in the shape of additional processor privilegelevels . Not only have incoming event now to traversemore modes (CPUs deliver events normally to the mostprivileged mode), the state transitions also take longerdue to the huger state involved. With current hardware,these additional expensive transitions in timing-criticalpaths are inevitable and will lead to increases in event-handling latencies.Second, the relative cost of individual operationchanges, which requires a redesign of timing criticalpaths. For example, on a non-virtualized platform The details differ though. Some instruction set architecturesadded to their existing privileges less privileged modes (x86 guestmode (AMD) and non-root mode (Intel)) while other went for moreprivileged modes (PPC hypervisor mode).
03 Feburary 2012 rming a timer takes a trap into the operating systemkernel and a device access. In a virtual machine, a guestis not allowed to access the physical timer directlybecause of contention from other guests. Attempts toaccess the device trap into the hypervisor, which takescare of multiplexing the physical timer. Unfortunately,this costly operation is in critical event handling paths.An alternative is to let the hypervisor arm the timerspeculatively when it injects an event. The guest OScan then resume the event handler process without thedelay incurred during setting the (virtual) timer.Lastly, the hypervisor has to be aware of the internaloperations of a VM. Assume a process in a VMthat runs with the highest priority and shall completeexecution as fast as possible. Further assume that,during the execution of that process, an IRQ arrives.The hypervisor takes over control and dispatches theevent to the virtual machine monitor , which in turninjects it into the VM. Only then the decision istaken that the event does not bear on the schedulingconfiguration and the high priority task is resumed. Atthat point an appreciable amount of time has elapseddue to the privilege changes and context switches .Had the hypervisor or the VMM had the informationthat the IRQ does not affect the current schedulingdecision in the VM, either one could have abstainedfrom forwarding the event and instead directly resumedthe interrupted activity. This short-circuiting, though,is only possible if it is known that a high priorityprocess is executing. Otherwise, forwarding the eventis the correct action. As the architecturally defined VMinterface does not convey this kind of information,additional interfaces are needed to bridge the s emanticgaps that opens up between hypervisor, VMM, andguest VM. C. Goal and Outline
Our goal was to build a system with the followingproperties:1) The trusted computing base — the set of allhardware, firmware, and software componentsthat are critical security and safety of a system —shall be small. It is well accepted that a reductionin complexity is instrumental in building reliablesystems. We assume that the VMM runs as a user-level task.
2) Given the scarcity of native applications forsecurity-conscious environments, the reuse oflegacy software shall be easily possible. As virtu-alization is by definition compatible with a hugebody of software and exhibits good performancecharacteristics, the system shall support multiplevirtual machines.3) One of this virtual machines shall be able topick up promptly on incoming events. The hostedguest shall see event-handling latencies that arewithin a reasonable range of that seen on physi-cal hardware.The contribution of this paper is twofold: first, wepresent a software stack that takes into account bothreal-time and security requirements. We assess theimpact of this arrangement on real-time workloadsrunning in a VM.In the second part, we investigate how real-timebehavior can be improved if the guest provides hintsregarding its real-time operations. Although such hintsrequire guest modifications, we found that these mod-ifications can be kept to a minimum and improvesignificantly on the real-time performance.II. B
ACKGROUND
Requirements regarding security, fault-tolerance, andreal-time performance can only be met if the lowestlayer in the software stack, the operating system kernel,provides proper support. One long-standing quarrelon this issue is whether monolithic kernels can beadequately evolved or a shift towards microkernelsis the better choice. While a huge part of the OScommunity recognizes the principal superiority of themicrokernel design, many point out that monolithickernels can catch up and retain their performanceadvantage.
A. Microkernels
Operating system kernels are a critical componentin the software stack. Kernels find themselves inthe situation that they have to guarantee safety andsecurity for all applications running on top of them,yet cater to specific needs of single applications.Applications often have very specific requirements,which are not easily conveyed through the syscallinterface. An operating system would like to hand overthe resource management to applications as that would
03 Feburary 2012 llow to apply arbitrary policies without adverselyaffecting non-participating parties.A microkernel is an operating system kernel thatcontains only those mechanisms needed to imple-ment a complete operating system on top. Typicallythese mechanisms include address space management,scheduling, and message passing. Other functionalitysuch as device drivers, protocol stacks, and file systemsis provided by user-level servers. The upside of thisapproach is that resources can be selectively allocatedto user-level tasks, which then can manage them attheir discretion.While hailed as the future of operating systems,microkernels have not supplanted monolithic kernelsacross the board. Although many experimental mi-crokernels demonstrated the potential benefits, neitherone saw enough take-up, which would have beennecessary for a self-sustaining software ecosystem.Furthermore, systems built on monolithic kernels kepttheir performance edge and had some of their defi-ciences alleviated. For example, the often raised pointof lacking flexibility was answered by the introductionof kernel modules. The lack of success on desktops andservers notwithstanding, microkernels made their wayinto systems that care about security and scheduling.
Scheduling:
When a computation involving up-dates of multiple variables can be preempted, furthersynchronization mechanisms have to be employed toensure consistency. While lockless schemes such asRCU[1] and lock-free synchronization[2] have an edgeover classical lock-based synchronization in certaincases, the latter is easier to understand and thusprevalent.Retrofitting preemptibility into monolithic kernelsby adding locks to data structures is tedious becauseassumptions as to atomic execution are usually deeplyentrenched and non-obvious. Deadlock situations mayarise if the lock grabbing sequence does not rule outcircular dependencies. If the objects to be protectedcome into being and perish at a high rate, coming upwith a suitable lock ordering is far from trivial. Thesituation is aggravated if long running operations shedheld locks intermittently and reacquire them beforeresuming. Such a courtesy may be imperative, though,if operations with tighter timing requirements requireprompt access. Furthermore, for most real systems,event-handling latencies are not an overriding priorityand thus must not hurt throughput performance too much. As a consequence, the number of locks inoften-used code paths has to be limited, which mayresult in longer lock-protected paths. The problemwas encountered by developers of the
PREEMPT RT flavor of the Linux kernel. Up to 15% performancedegradation[3] was enough to stir opposition againstmainline inclusion.Given the difficulties of directly improving on thepreemptibility of an existing kernel, it may be expedi-ent to absolve it from scheduling and assign that taskto a real-time scheduler. For that purpose, a shim isinterposed between the old kernel and the hardware,taking over the primary responsiblity for scheduling.The old kernel is - from a scheduling point of view -relegated to a real-time tasks. As such, it can stop theflow of events from the shim to it. However, it cannotprevent events arriving at the shim and triggeringscheduling decisions there.The idea has gained some momentum, both in opensource and commercial project. One of the first projectsalong this architecture was RTLinux[4]. RTAI[5] andits successor, the Xenomai project followed suit. Thedownside of this approach is that the shim only takesover scheduling but does not assume control overmemory management. Memory management still lieswith the host kernel, in these cases Linux. Real-timetasks are loaded into the Linux kernel and becomesubject to the real-time scheduling after initialization.The common residency in kernel space means thatcrashes are not contained in address spaces. Xenomaitries to mitigate the issue by optionally placing real-time applications in address spaces. While faults ofother real-time tasks become survivable, crashes in theLinux kernel are still fatal.The next step is not only to confer scheduling to theshim but also hand the responsibility for memory man-agement completely over to it. That would strengthenthe surviveability of the system as more parts aresubject to address space isolation. The downside isthat this encapsulation carries the cost of privilegelevel transitions in critical paths. Nonetheless, manysystem have adopted that architecture, among themQNX Neutrino[6], PikeOS[7], and L4 [8].
Security:
Monolithic kernels have grown to acomplexity were it becomes intractable to reason aboutall possible interactions in the kernel. Given that eachdefect may be used to gain control over the kernel, sucha situation is a security nightmare. In contrast, small
03 Feburary 2012 ernels are more amenable to thorough examination,which instills confidence regarding their correctness.This scrutiny can be even taken as far as to have themundergo formal verification[9].A prerequisite for a small trusted computing base,a small kernel alone is not sufficient, though. Ratherit is necessary that privileged user level servers cannotbe tricked into exercising their authority on behalf ofa malevolent party. This issue — widely known as confused deputy problem — can be ascribed to theseparation of naming and privilege.Capabilities facilitate the construction of system thathonour the pinciple of least authority . Despite theirundenied advantages, capabilities have not found theirway into mainstream operating systems. The reason forthat might well be that capabilities are not compatibleand thus cannot retrofitted into the APIs of mainstreamOSs. A recent effort by Watson at al[10] shows promiseto make deeper inroads but is yet in early stages.Researchers, who have not to take legacy APIs intoconsideration, were more successful with creating con-ceptually clean systems. The first system that featureda small kernel and capability-based access control,EROS[11], was followed by more systems with thesetraits[12][13].
B. Virtualization
Microkernel-based systems are relatively novel un-der active development and as such still in a state offlux. An steadily changing kernel interface, though, isa difficult target for user-level components. Althoughnative system frameworks have come into existence,they cannot rival existing operating system in features,maturity, and application availability. To make up thisdeficiency, system designers have chosen to reusewhole operation systems with virtualization the mostpromising technique.A virtual machine is an software environment whichis capable of executing applications together withtheir operating systems. A virtual machine imple-ments the same instruction-set architecture as realhardware, a guest cannot distinguish whether it runsnatively or under virtualization. While this requirementcan also be met by performance-sapping emulation,virtual machines are required to exhibit near-nativeperformance. This requirement translates into the hugemajority of instructions being executed directly bythe host hardware. Complementing equivalence and performance is control. The virtualization layer alwaysretains full control over the system. To that end thepart of the VM in control — the hypervisor —interposes on operations that impinge upon systemresources such as memory or IO devices. Instead ofaccessing the physical resource, the guest is providedwith virtual surrogates. As encapsulation is an explicitvirtualization goal, virtual machines naturally lendthemselves to serve as a unit of isolation.The advantages of virtualization come at a cost,though. Virtual machines operate in special processormodes which incur substantial overhead when enteredor left. While also being a problem for systems aimingat throughput, these world switching costs can addsubstantially to the length of critical paths of real-timeapplications if these were to run in VMs.Apart from hardware-induced costs, software-relatedcosts have to be considered, too. Latencies incurredby executing non-preemptible code paths within thelayers implementing the VM will add to those causedby the OS kernel in the VM. For this reason, it isnot recommendable to use so-called hosted VMMs,where an existing operating system is extended withhypervisor functionaltiy. Bare-metal hypervisors can beimplemented as microkernels and as such have muchless code in timing-critical execution paths.While bringing hardware and (direct) software costsunder control is necessary, it is not sufficient, though.One severe problem with virtualization architecturesis that the machine interface does not propagate high-level information, an issue also known as semantic gap .For example, on bare hardware, there is no need for anoperating system to let the hardware know that thereis a highly prioritized real-time application runnable.In contrast, under virtualization, a guest disablinginterrupts achieves only that no events arrive in its
VM. The system still accepts them and queues themuntil they can be delivered. While the guest performscorrectly, it does not as fast as it could as interruptarrival entails costly world switches. If the lower layerswere aware that real-time tasks are active, they coulddisable (at least) some events in order to expedite theguest execution. The information that a real-time taskis active is not easily derived by the host, thus it hasto be signaled by the guest.
03 Feburary 2012
II. D
ESIGN
We will start the description of our design bycovering the hypervisor and those features with abearing on real-time operations in VMs. Thereafter wewill elaborate on our virtual machine monitor. Thechapter is wrapped up with the explanation of thechanges we made to the guest.Before we set out to detail our design, we willtake up an issue that often gives rise to confusion.While often used interchangeably, the terms hypervisor and virtual machine monitor signify two distinctivelydifferent components in our architecture. More confus-ingly, sometimes the hypervisor is used synonymouslywith microkernel .A hypervisor enforces isolation, that is it implementsprotection domains, VMs being one flavor of them. Fordoing so, it needs to exercise exclusive control overdata structures that control isolation such as page tablesand entry vectors. This is only possible if it runs in thethe most privileged CPU execution mode as only thismode has full control over critical system resourcessuch as control registers.A microkernel applies the principle of minimalityto the portion of software running in the most priv-ileged execution mode. We understand microkerneland hypervisor complementary. Hypervisor impliesa function whereas microkernel denotes a particulardesign principle. As such, a microkernel can assumethe role of a hypervisor, or conversely, a hypervisorcan be implemented as microkernel.A virtual machine monitor provides VMs withfunctionality beyond CPU and memory virtualization,which is the duty of the hypervisor. Typically, theVMM supplies a VM with (virtual) devices andcoordinates its execution. Contrary to a hypervisor,which always requires maximal CPU privileges, aVMM can be either part of the kernel or implementedas a user task. A. Hypervisor
Our architecture builds on
Fiasco.OC [12], a state-of-the-art microkernel of L4 provenance. As typicalfor microkernels, it only provides basic functionalitysuch as address space construction, and inter-process We only consider accessible execution modes, which leavesout the system management mode, which is arguably even moreprivileged. communication (IPC), and scheduling. All the remain-ing aspects such as device drivers and protocol stacks,which are included in monolithic kernels, are relegatedinto user level servers where they are subject toaddress space containment. Under this arrangement, adevice driver failure, which is likely fatal in monolithicsystems, becomes survivable.The original Unix security model, which servedas a role model for many contemporary systems,does not allow for mandatory access control (MAC).Each data owner may release data to arbitrary thirdparties at his discretion without a system-wide imposedsecurity policy being able to prevent that. Efforts toadd mandatory access control mechanisms such asSELinux have proven difficult and are widely deemedinpractical. In contrast, Fiasco.OC features capability-based access control[12], which highly expedient forthe construction of systems that follow the princple ofleast authority .Fiasco.OC can be used as an hypervisor[14][15]. Inline with the microkernel philosophy, it only providessupport for CPU and memory virtualization and leavesthe provisioning of virtual devices to a user-level vir-tual machine monitor (VMM). It should be pointed outthat virtualization blends smoothly with the existingsystem. Memory is provisioned into VMs the sameway as it is into non-VM protection domains. Accessto VMs is governed by the same capability mechanismthat controls access to any other object in the system.Executing code in a VM is achieved through a threadand as such under control of the Fiasco scheduler.The object describing address spaces, the task, isalso used to provide the memory context for a VM. Theonly modification is that the VM memory range coversthe full 4GB on 32bit architectures whereas tasks arelimited to 3GB. Each virtual machine has an associatedcontroller task, which controls its execution therebyacting as a virtual machine monitor (VMM). Unliketasks, VMs cannot host threads directly. Instead, thethread is bound to the controller task and migrates intothe VM when execution shall there resume. To thatend it invokes a microkernel syscall with a capabilityto the VM object and a full architectural VM state asarguments. This operation is reverted when an externalevent arrives or a situation occurs that needs attention.The last execution state of is then reported back.From its inception, Fiasco was developed with real-time support in mind. The kernel disables interrupts
03 Feburary 2012 nly for inavoidable atomic operations (e. g. threadswitching) and is thus highly preemptible. Schedulingof threads is strictly based on static priorities.Streamlined synchronous IPC was one of the defin-ing characteristics of the original L4 and later de-rived kernels. The reasoning was that communicationis mainly procedure-call-like and that asynchronousevent handling is so rare that kernel support is notwarranted. It turned out that the premise does nothold for operating system kernels. Consequently, thekernel’s threading model was augmented with a virtualCPU mode [16], which proved very expedient forOS rehosting. The underlying principle is that controltransfer between two tasks only takes place if thereare no pending interrupts. That mirrors the behaviorof physical machines, where transitioning from kernelinto user level is and allowing interrupt delivery is usu-ally an atomic operation. Earlier designs lacking thatfeature could ignore pending events for a whole timeslice, which is unacceptable for real-time operations.For our work, the scope of this conditional operationwas extended to the switch into VM operation. B. Virtual Machine Monitor
Instead of starting from scratch, we opted to improveon the
Karma VMM [15]. We identified parts of Karmathat might delay the delivery of events into the guestand replaced them.Our first modification concerned the timer handling.Originally, Karma used Fiasco IPC timeouts as timesource. For historic reasons, Fiasco’s timer granularityis rather coarse at 1000Hz. Clearly such a clock sourceis inadequate to serve as a time source for a high-resolution Linux timer. Since changing the Fiasco timerinfrastructure was beyond our scope, we decided tooffer a higher-resolution timer as an alternative. Ourchoice was fell on the HPET. Ownership over thedevice is given to one karma instance, whereby itis allowed to program it directly and subscribe asa recepient of its interrupts. Whenever the guest OStries to program its (virtual) timer, these accesses areintercepted. The VMM retrieves the timeout to be setfrom the guest’s execution state and arms the HPETaccordingly. When the time has elapsed, the HPETraises an interrupt. Fiasco forwards it to the VMM, In L4Linux both the Linux kernel and Linux processes areimplemented as L4 tasks.
Figure 1. Usual series of events upon arrival of a timer interrupt.Steps 1-3: IRQ arrives and is injected into the VM. Steps 4-7: VMreprograms timer chip for the next interrupt by issuing a hypercall.Step 8: woken guest task is resumed. which, in turn, injects it as a virtual interrupt into theVM under its control.Since the VMM is a regular task it does nothave direct device access. Instead it uses servicesprovided by infrastructure servers. In many cases,service invocation involves sending messages (inter-process communication, IPC). Unfortunately, most ofthe currently used infrastructure employs synchronousIPC to a certain degree.However, a VMM does not fit well into this model.A VM may accommodate multiple processes, eachof which can potentially initiate communication withother servers (by accessing virtual devices). Since IPCresponse times are not guaranteed to be short , it couldhappen that any activity in the VMM came to an haltfor a substantial period of time. Any incoming eventwas delayed until an IPC reply arrived. To avoid thesedelays we employ a dedicated thread for external IPC.The main thread still detects the request from the guest,but instead of sending the IPC itself it leaves that taskto the communication thread. The important point inthis construction is that the main thread gets not tiedup in an IPC and can promptly pick up on incomingevents. IV. O PTIMIZATIONS
Our most important optimization concerned the de-lay incurred by reprogramming the timer source. Usu-ally Linux sets a new timeout promptly after the arrivalof a timer interrupt, i. e. directly in interrupt context In fact, most services do not specify any timing parameters.
03 Feburary 2012 efore resuming any (possibly real-time) process. Thesequence of events is shown in Figure 1. Since pro-gramming the timer involves leaving and reenteringthe VM and is thus a costly operation, the guest wasadapted to check whether the expired timer is aboutto wake a real-time task and to postpone the timerprogramming until after the task’s execution in thesecases. This allowed us to move the steps 4-7 off thecritical path. Once the real-time task has completed,Karma is asked to set the next timeout. From theguest’s perspective this behavior is correct because thereal-time load runs with the highest priority and cannotbe preempted by any other process. If the timeout hasnot led to a wake-up of a real-time task, the timeris programmed immediately as usual. Karma takesprovisions against run-away processes by setting anemergency timeout that is far larger than the worst-caseexecution time of the current job. Either a regular timerearming or the completion notification of the real-timetask will let Karma cancel the emergency timeout.Our second set of optimizations aimed at reducingthe overhead caused by lower-priority interrupts arriv-ing during execution of a real-time task. As the initialversion of Karma injected interrupts into the VM ina strict FIFO manner, the first improvement was tosort the pending interrupts and deliver the one withthe highest priority first, thereby retaining hardwaresemantics up to the VMM–VM interface, and holdingback the pending interrupts with lower priorities. Oncethe VM reprograms the timer, Karma can be sure thatthe real-time task has completed its time slice anddeliver the interrupts. For the next iteration we allowedKarma to preclude the microkernel from deliveringlow-priority events to the virtual machine altogether.The occurrence of such events still causes a VM exitand reentry in this scenario, but the additional contextswitch to Karma is inhibited. Once the real-time taskhas completed and the VM has announced to Karmathat event processing can resume, the microkerneldelivers the events to Karma which in turn injects theminto the guest. As a final step, we allowed Karma touse the system’s hardware task priority register (TPR)to directly inhibit the generation of hardware interruptsbelow a chosen priority. This scenario finally also savesthe costly VM exits and reentries during the executionof high-priority code by keeping interrupts pendingright on the interrupt controller. V. E
VALUATION
To evaluate the feasibility of our design, we rana number of experiments. Our test machine mounteda AMD Phenom II X4 3.4GHz, 4GB RAM (DDR3,1333MHz) on a Gigabyte GA-770TA-UD3 board, anda Hitachi hard disk (500GB, 7200rpm, 16MB cache).The virtual machines were provided with 256MBRAM.The guest running the realtime load was runninglinux version 3.0.14-rt31 with the
PREEMPT RT [17]and Karma patches applied. In the guest VM, weused cyclictest [18] (version 0.83), a utility wakingup periodically and measuring the deviation fromthe expected wakeup time. These delays are a goodmeasure for the preemptibility of the operating system.Our version of cyclictest is slightly modified in that itreports the measured delay directly through a hyper-call to Karma. Unless noted otherwise, the load wasgenerated in the same VM with “tar -xf linux.tar.bz2”unpacking a disk-resident archive. For comparision, weran the same Linux configuration except for the Karmapatches on bare hardware. If not noted otherwise, allmeasurements ran for one hour.For our experiments, the guest running timesharingloads was granted direct disk access, but had to usethe secure GUI as console. The GUI server has tocopy the framebuffer content of a client into thedevice framebuffer. Without the decoupling describedin subsection III-B, it would not be possible to providethe user with a responsive user interface and run real-time tasks with tight deadlines at the same time.Fiasco.OC offers a tracebuffer, which holds kernelgenerated and user supplied logging messages. It canbe made visible as read-only region in the addressspaces of user-level tasks. We made extensive use ofthis facility as time-stamped log messages were oftenthe only way to pinpoint causes of long delays.
A. Base Line Measurements
Our first measurement differs from the others in thatit is the only one not concerned with latencies. Whilenot the primary target, throughput is nonetheless ofinterest. Figure 2 lists the times it takes to compile theLinux kernel. The figures are only meant to gain anunderstanding of relative performance, as the imposedlimitations (one CPU, 256MB memory) certainly hurt.The result of approximately 3.5% slowdown is consis-tent with earlier results[15].
03 Feburary 2012 ompile time min max averageNative Linux 619.5s 620.4s 619.9sKarma Linux 639.2s 641.0s 640.3s
Figure 2. Linux kernel compile benchmark, in seconds, smaller isbetter. At least three runs were performed. All setups were limitedto one CPU and 256MB memory. In the case of Karma, diskaccesses were carried out directly without VMM intervention. O cc u rr e n ce s Latency [ µ s] a)b) Figure 3. Native Linux with RT PREEMPT patch latenciesmeasured with cyclictest (a) while cyclictest is the only load and(b) while device interrupt inducing load is running. (The 0 bindenotes latencies below 1 µ s.) In our next measurement, we ran the guest inthe base configuration as shown in figures 3 and 4.The important point to note is that the worst-caselatency under load increases less than twofold for thevirtualized case (14 µ s vs. 25 µ s). B. Bridging the Semantic Gap
In this chapter we demonstrate how the real-timeperformance can be improved if the actions of the guestand the layers underneath it are better coordinated.Figure 5 shows how the programming of the simulatedlocal APIC timer is rather expensive. By deferring thetimer reprogramming until the realtime workload has O cc u rr e n ce s Latency [ µ s] a)b) Figure 4. Baseline latencies measured with cyclictest in a guestrunning on a Karma virtual machine (a) while cyclictest is the onlyload and (b) while device interrupt inducing load is running in thesame guest. O cc u rr e n ce s Latency [ µ s] a)b) Figure 5. (a) Baseline latencies measured with cyclictest. (b)Latencies with deferred timer programming. finished, we can improve the release latency of therealtime workload dramatically.There are however situations which can not beimproved by postponing the reprogramming. UsuallyLinux refrains from programming the timer if therequired delay is so small that it can not be reliablyprogrammed; but it does not take the expensive round-trip to the VMM into account. Thus if Linux calls outto the VMM in order to reprogram the timer for adelay that is above its own programming threshold butless than the actual time required to exit and reenterthe VM, Karma will refuse to program the device anddirectly inject a synthetic timer event into the VM. Thissaves the additional delay of programming the timerchip, but the costs incurred by leaving and reenteringthe VM cannot be remedied. The latencies experiencedduring such a series of events (termed a ”soft trigger”as opposed to a ”hard trigger”, which means an actualhardware timer interrupt) are shown in Figure 6. Inthe future, we will try to reliably detect close timeoutsin Linux. In such situation it might be better to pollinstead of calling the hypervisor.Upon an HPET interrupt, Fiasco raises the TPR value such that only its own timer interrupts forscheduling can pass. If the HPET interrupt resultsin the release of an event handler, then it can runwithout interference from device interrupts. After thehandler has signaled the VMM its completion, thelatter unmasks the HPET. Fiasco also detects this andlowers the TPR to the regular level. Figure 7 shows thethe effect of that procedure. The worst case latenciesgrow by 6 µ s if the TPR optimization is disabled. The task priority register allows to specify a threshold forinterrupt delivery. Interrupt vectors with a numerically smallervalue are not delivered.
03 Feburary 2012 O cc u rr e n ce s Latency [ µ s]soft-triggeredhard-triggered1101001000100001000001e+06 5 10 15 20 25 O cc u rr e n ce s Latency [ µ s]soft-triggeredhard-triggered Figure 6. Latency comparison of interrupts delivered directlyfrom the hardware (”hard trigger”) against interrupts synthesizedby Karma when the desired wakeup time has already been reached(”soft trigger”). The figure on top shows the latencies on anotherwise idle system, the bottom one on a system with a deviceinterrupt inducing load. Delayed timer programming is active inboth setups. O cc u rr e n ce s Latency [ µ s] a)b) Figure 7. Latencies measured inside the virtual mashine with (b)and without (a) optimizations in place.
C. Isolation at VM Level
In our final measurement we investigated whetherload running in a separate VM has an impact on thereal-time latencies. To that end, we ran our workloadin a second VM with a Linux 2.6.35.2 and Karmapatches applied. Figure 8 shows the meaurement resultsalongside measurements with load running in the real-time VM. We put the higher latencies down to heavierpressure on caches and TLBs.VI. R
ELATED W ORK
Linux has been used in quite a few projectsto serve as the general purpose part in a hybridsystem[4][19][5]. Limited to scheduling, the real-time O cc u rr e n ce s Latency [ µ s] a)b) Figure 8. Overall latencies measured by cyclictest (a) whereload is induced in the same guest as the real-time workload and(b) where load is induced in a second virtual machine on the samephysical machine. layer has to rely on the Linux kernel for services likeprogram loading or memory access. To circumventthe intricate user-level management of Linux, whichinvolves paging, real-time tasks are often implementedas kernel modules. Kernel memory is not pagedunder Linux, the real-time executive can be sure thata once-loaded task is resident. Since all real-timeapplications run in the same kernel address space,faults cannot be contained with address spaces.Although recent versions of Xenomai offer user-levelreal-time applications, they still limit them to aspecific API. Maintaining more than the Linux API isseen as too cumbersome by many developers, though.Regardless of whether user tasks are encapsulated,the Linux kernel runs in kernel space in any case.Its immense size makes the presence of defects,especially in device drivers, highly likely.Kiszka[20] evaluates the use of Linux as a real-timehypervisor. Using KVM and Linux-RT as a host anda guest system latencies are evaluated. A paravirtu-alized scheduling interface is introduced to improvethe scheduling in the guest. The evaluations show thatreasonable scheduling latencies can be achieved.Rehosting an operating system on top of a micro-kernel has been investigated before [21][22][23]. Theapproach reflects the lacking hardware virtualizationsupport at the time. On architectures without taggedTLBs, performance is noticably degraded because ofthe context switch now necessary for a (guest) kernelentry. Still, it is possible to achieve reasonable guestevent-handling latencies with such approaches[24].Bruns at al[25] evaluated the application of a micro-kernel based system for real-time. A rehosted RTOSwas run on a commercially available embedded plat-form and compared to its native version. The evaluation
03 Feburary 2012 howed that the benefits of consolidation outweighedthe performance impact of the virtualization approach.EROS[11] demonstrated that a capability system canbe built on a small kernel and yield message passingperformance on a par with other then state-of-the-artmicrokernels. Unlike other systems that expose diskspace through a file system, EROS went for a persistentmodel. That state of the system resides primarily ondisk and is only fetched into memory for execution.This classical example for caching entails makingdecisions which objects are kept in memory and whichare written back to disk. EROS did not provide anymechanisms to influence this caching, giving real-timeapplications no chance to ensure they stay memory-resident.Mesovirtualization, proposed by Ito and Oikawain [26], is a virtualization approach targeted at thex86 architecture that minimally changes the guestoperating system. As a guest they apply Linux ontheir VMM implementation Gandalf, and show a betterperformance when comparing with Linux on Xen,attributing their advance to their annotations.In [27] Ito and Oikawa have implemented andevaluated shadow page-table implementations in theGandalf VMM. They achieve better results than Xen,however they modify the page handling code of theirLinux guest.Kinebuchi et al. [28] propose task grain schedulingto host a Linux and an RTOS on an L4 based hyper-visor on embedded systems. The guests are adapted torun in this environment and priorities of guest tasksare mapped to host ones. Their evaluation shows thatinterrupt latency is increased significantly compared toa native RTOS.The current excitement for virtualization obscuresthe fact that the layers need to implement virtualmachines may grow the attack surface, rendering thesystem as whole even more vulnerable. Peter at al[29]have shown that kernel support can be added tomicrokernels with minimal effort without underminingthe security properties of the system. This work didnot seek to minimize the trusted computing base forthe VM itself, though, which was demonstrated later[13]. Liebergeld at al[15] further demonstrated that vir-tualization does not hurt the real-time performance ofapplication running alongside the VM. The feasibilityof the VM to support real-time applications inside wasnot investigated. VII. C
ONCLUSION
In this paper, we have presented an architecturethat allows for the execution of real-time tasks ina virtual machine. Our results indicate that real-timeperformance comparable to loads run on bare hardwareis only possible if the guest can give hints at its currentexecution status to lower layers of the software stack.R
EFERENCES [1] Paul E. Mckenney and John D. Slingwine, “Read-CopyUpdate: Using Execution History to Solve ConcurrencyProblems,” in
Parallel and Distributed Computing andSystems , Las Vegas, NV, October 1998, pp. 509–518.[2] Maurice Herlihy, “A methodology for implementinghighly concurrent data objects,”
ACM Trans. Program.Lang. Syst. , vol. 15, pp. 745–770, November 1993.[3] Thomas Gleixner, ,” personal communication, RTLWS2010, Nairobi.[4] Victor Yodaiken, “The RTLinux Manifesto,” 1999.[5] L. Dozio and P. Mantegazza, “Real time distributedcontrol systems using rtai,” in
Object-OrientedReal-Time Distributed Computing, 2003. Sixth IEEEInternational Symposium on
Real-Time Systems Symposium,2002. RTSS 2002. 23rd IEEE , 2002.[9] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, JuneAndronick, David Cock, Philip Derrin, DhammikaElkaduwe, Kai Engelhardt, Rafal Kolanski, MichaelNorrish, Thomas Sewell, Harvey Tuch, and SimonWinwood, “seL4: Formal verification of an OSkernel,” in
Proceedings of the 22nd ACM Symposiumon Operating Systems Principles , Big Sky, MT, USA,Oct 2009, pp. 207–220, ACM.[10] Robert N. M. Watson, Jonathan Anderson, KrisKennaway, and Ben Laurie, “Capsicum: practicalcapabilities for UNIX,” .
03 Feburary 2012
11] Jonathan S. Shapiro, Jonathan M. Smith, and David J.Farber, “Eros: a fast capability system,” in
InSymposium on Operating Systems Principles , 1999, pp.170–185.[12] Adam Lackorzynski and Alexander Warg, “Tamingsubsystems: capabilities as universal resource accesscontrol in l4,” in
Proceedings of the Second Workshopon Isolation and Integration in Embedded Systems ,New York, NY, USA, 2009, IIES ’09, pp. 25–30, ACM.[13] Udo Steinberg and Bernhard Kauer, “Nova: amicrohypervisor-based secure virtualization architec-ture,” in
Proceedings of the 5th European conferenceon Computer systems , New York, NY, USA, 2010,EuroSys ’10, pp. 209–222, ACM.[14] “Faithful Virtualization on a Real-Time OperatingSystem,” .[15] Adam Lackorzynski Steffen Liebergeld, Michael Pe-ter, “Towards Modular Security-Conscious VirtualMachines,” in
Proceedings of the Twelfth Real-TimeLinux Workshop, Nairobi , 2010.[16] Alexander Warg Adam Lackorzynski and MichaelPeter, “Generic Virtualization with Virtual Processors,”in
Proceedings of the Twelfth Real-Time LinuxWorkshop, Nairobi , 2010.[17] “OSADL Project: Realtime Linux,”URL: .[18] “cyclictest,” URL: https://rt.wiki.kernel.org/index.php/Cyclictest .[19] Alfons Crespo, Ismael Ripoll, and Miguel Masmano,“Partitioned embedded architecture based on hypervi-sor: The xtratum approach,” in
EDCC . 2010, pp. 67–72, IEEE Computer Society.[20] Jan Kiszka, “Towards Linux as a Real-TimeHypervisor,” in
Proceedings of the 11th Real-timeLinux Workshop (RTLWS), Dresden, Germany , 2009.[21] N. Stephen F. B. des Places and F. D. Reynolds, “Linuxon the OSF Mach3 microkernel,” Available formURL: . [22] Hermann H¨artig, Michael Hohmuth, Jochen Liedtke,Jean Wolter, and Sebastian Sch¨onberg, “Theperformance of microkernel-based systems,” in
Proceedings of the sixteenth ACM symposium onOperating systems principles , New York, NY, USA,1997, SOSP ’97, pp. 66–77, ACM.[23] Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, andStefan G¨otz, “Unmodified device driver reuse andimproved system dependability via virtual machines,”in
Proceedings of the 6th conference on Symposiumon Opearting Systems Design & Implementation -Volume 6 , Berkeley, CA, USA, 2004, pp. 2–2, USENIXAssociation.[24] Adam Lackorzynski, Janis Danisevskis, Jan Nordholz,and Michael Peter, “Real-Time Performance ofL4Linux,” in
Proceedings of the Thirteenth Real-TimeLinux Workshop, Prague , 2011.[25] Felix Bruns, Shadi Traboulsi, David Szczesny, Eliz-abeth Gonzalez, Yang Xu, and Attila Bilgic, “Anevaluation of microkernel-based virtualization forembedded real-time systems,” in
Proceedings ofthe 2010 22nd Euromicro Conference on Real-TimeSystems , Washington, DC, USA, 2010, ECRTS ’10,pp. 57–65, IEEE Computer Society.[26] Megumi Ito and Shuichi Oikawa, “Mesovirtualization:Lightweight virtualization technique for embeddedsystems,” in
SEUS , Roman Obermaisser, YunmookNah, Peter P. Puschner, and Franz-Josef Rammig, Eds.2007, vol. 4761 of
Lecture Notes in Computer Science ,pp. 496–505, Springer.[27] Megumi Ito and Shuichi Oikawa, “Lightweight shadowpaging for efficient memory isolation in gandalf vmm,”in
Proceedings of the 2008 11th IEEE Symposiumon Object Oriented Real-Time Distributed Computing ,Washington, DC, USA, 2008, pp. 508–515, IEEEComputer Society.[28] Yuki Kinebuchi, Midori Sugaya, Shuichi Oikawa,and Tatsuo Nakajima, “Task grain scheduling forhypervisor-based embedded system,” in
HPCC . 2008,pp. 190–197, IEEE.[29] Michael Peter, Henning Schild, Adam Lackorzynski,and Alexander Warg, “Virtual machines jailed:virtualization in systems with small trusted computingbases,” in
VDTS ’09: Proceedings of the 1st EuroSysWorkshop on Virtualization Technology for DependableSystems , New York, NY, USA, 2009, pp. 18–23, ACM., New York, NY, USA, 2009, pp. 18–23, ACM.