Augmenting Operating Systems With the GPU
AAugmenting Operating Systems With the GPU
Weibin Sun Robert Ricci
University of Utah, School of Computing
Abstract
The most popular heterogeneous many-core platform,the CPU+GPU combination, has received relatively littleattention in operating systems research. This platform isalready widely deployed: GPUs can be found, in someform, in most desktop and laptop PCs. Used for morethan just graphics processing, modern GPUs have provedthemselves versatile enough to be adapted to other appli-cations as well. Though GPUs have strengths that canbe exploited in systems software, this remains a largelyuntapped resource. We argue that augmenting the OSkernel with GPU computing power opens the door to anumber of new opportunities. GPUs can be used to speedup some kernel functions, make other scale better, andmake it feasible to bring some computation-heavy func-tionality into the kernel. We present our framework forusing the GPU as a co-processor from an OS kernel, anddemonstrate a prototype in Linux.
Modern GPUs can be used for more than just graphicsprocessing; through frameworks like CUDA [1], they canrun general-purpose programs. While not well-suited to all types of programs, they excel on code that can makeuse of their high degree of parallelism. Most uses ofso-called “General Purpose GPU” (GPGPU) computa-tion have been outside the realm of systems software.However, recent work on software routers [10] and en-crypted network connections [14] has given examples ofhow GPGPUs can be applied to tasks more tradition-ally within the realm of operating systems. We claimthat these uses are only scratching the surface. In Sec-tion 2, we give more examples of how GPU computingresources can be used to improve performance and bringnew functionality into OS kernels. These include tasksthat have applications on the desktop, on the server, andin the datacenter.Consumer GPUs currently contain up to 512 cores [2],and are fairly inexpensive: at the time of writing, a In GPU terminology, a program running on the GPU is called a“kernel.” To avoid confusion, we use the term “OS kernel” or “GPUkernel” when the meaning could be ambiguous. current-generation GPU with 336 cores can be purchasedfor as little as $160, or about 50 cents per core. GPUsare improving at a rapid pace: the theoretical perfor-mance of NVIDIA’s consumer GPUs improved from 500gigaFLOPS in 2007 (GeForce 8800) to over 1.3 ter-aFLOPS in 2009 (GTX 480) [18]. Furthermore, the de-velopment of APUs, which contain a CPU and a GPUon the same chip, is likely to drive even wider adoption.This represents a large amount of computing power, andwe argue that systems software should not overlook it.Some recent OS designs have tried to embrace pro-cessor heterogeneity. Helios [17] provides a single OSimage across multiple heterogeneous cores so as to sim-plify program development. Barrelfish [5] treats a multi-core system as a distributed system, with independent OSkernels on each core and communication via message-passing. Both, however, are targeted at CPUs that havesupport for traditional OS requirements, such as vir-tual memory, interrupts, preemption, controllable con-text switching, and the ability to interact directly withI/O devices. GPUs lack these features, and are thus sim-ply not suited to designs that treat them as peers to tra-ditional CPUs. Instead, they are better suited for use asco-processors.Because of this, we argue that GPUs can be and shouldbe used to augment OS kernels, but that a heterogeneousOS cannot simply treat the GPU as a fully functionalCPU with different ISA. The OS kernel needs a newframework if it is to take advantage of the opportunitiespresented by GPUs. To demonstrate the feasibility ofthis idea, we designed and prototyped KGPU, a frame-work for calling GPU code from the Linux kernel. Wedescribe this framework and the challenges we faced indesigning it in Section 3
We have three motivations for offloading OS kernel tasksto the GPU: • To reduce the latency for tasks that run more quicklyon the GPU than on the CPU • To exploit the GPU’s parallelism to increase the throughput for some types of operations, such as in- a r X i v : . [ c s . O S ] M a y reasing the number of clients a server can handle • To make feasible incorporation of new functionality into the OS kernel that runs too slowly on the CPUThese open the door for new avenues of research, withthe potential for gains in security, efficiency, functional-ity, and performance of the OS. In this section, we de-scribe a set of tasks that have been shown to performwell on the CPU, and discuss how they show promise foraugmenting the operating system.
Network Packet Processing:
Recently, the GPU hasbeen demonstrated to show impressive performance en-hancements for software routing and packet process-ing. PacketShader [10] is capable of fast routing tablelookups, achieving a rate of close to 40Gbps for bothIPv4 and IPv6 forwarding and at most 4x speedup overthe CPU-only mode using two NVIDIA GTX 480 GPUs.For IPSec, PacketShader gets a 3.5x speedup over theCPU. Additionally, a GPU-accelerated SSL implemen-tation, SSLShader [14] runs four times faster than anequivalent CPU version.While PacketShader shows the feasibility of movingpart of the network stack onto GPUs and delivers excel-lent throughput, it suffers from a higher round trip la-tency for each packet when compared to the CPU-onlyapproach. This exposes the weakness of the GPU in alatency-oriented computing model: the overhead causedby copying data and code into GPU memory and thencopying results back affects the overall response time ofa GPU computing task severely. To implement GPU of-floading support, OS kernel designers must deal with thislatency problem. Our KGPU prototype decreases the la-tency of GPU computing tasks with the techniques dis-cussed in section 3.Though there are specialized programmable networkinterfaces which can be used for packet processing, theCPU+GPU combination offers a compelling alternative:the high level of interest in GPUs, and the fact that theyare sold as consumer devices drives wide deployment,low cost, and substantial investment in improving them.
In-Kernel Cryptography:
Cryptography operationsaccelerated by GPUs have been shown to be feasibleand to get significant speedup over CPU versions [12,14]. OS functionality making heavy use of cryptog-raphy includes IPSec [10] , encrypted filesystems, andcontent-based data redundancy reduction of filesystemblocks [21] and memory pages [9]. Another poten-tial application of the GPU-accelerated cryptography istrusted computing based on the Trusted Platform Mod-ule (TPM). A TPM is traditionally hardware, but recentsoftware implementations of the TPM specification, suchas vTPM [6], are developed for hypervisors to providetrusted computing in virtualized environments where vir-tual machines cannot access the host TPM directly. Be-cause TPM operations are cryptography-heavy (such as secure hashing of executables and memory regions), theycan also potentially be accelerated with GPUs.The Linux kernel contains a general-purpose cryptog-raphy library used by many of its subsystems. This li-brary can easily be extended to offload to the GPU. OurKGPU prototype implements AES on the GPU for theLinux kernel, and we present a microbenchmark in Sec-tion 3.3 showing that it can outperform the CPU by asmuch as 6x for sufficiently large block sizes. Due to theparallel nature of the GPU, blocks of data can representeither large blocks of a single task or a number of smallerblocks of different tasks. Thus, the GPU can not onlyspeed up bulk data encryption but also scale up the num-ber of simultaneous users of the cryptography subsystem,such as SSL or IPSec sessions with different clients.
Pattern Matching Based Tasks:
The GPU can accel-erate regular expression matching, with speedups of up to48x reported over CPU implementations [22]. A networkintrusion detection system (NIDS) with GPU-acceleratedregular expression matching [22] demonstrated a 60% in-crease in overall packet processing throughput on fairlyold GPU hardware. Other tasks such as information flowcontrol inside the OS [16], virus detection [3] (with twoorders of magnitude speedup), rule-based firewalls, andcontent-based search in filesystems can potentially ben-efit from GPU-accelerated pattern matching.
In-Kernel Program Analysis:
Program analysis isgaining traction as a way to enhance the security androbustness of programs and operating systems. For ex-ample, the Singularity OS [13] relies on safe code forprocess isolation rather than traditional memory protec-tion. Recent work on EigenCFA has shown that sometypes of program analysis can be dramatically sped upusing a GPU [20]. By re-casting the Control Flow Anal-ysis problem (specifically, 0CFA) in terms of matrix op-erations, which GPUs excel at, EigenCFA is able to seea speed up of 72x, nearly two orders of magnitude. Theauthors of EigenCFA are working to extend it to pointeranalysis as well. With speedups like this, analysis thatwas previously too expensive to do at load time or exe-cution time becomes more feasible; it is conceivable thatsome program analysis could be done as code is loadedinto the kernel, or executed in some other trusted context.
Basic Algorithms:
A number of basic algorithms,which are used in many system-level tasks, have beenshown to achieve varying levels of speedup on GPUs.These include sort, search [19] and graph analysis [11].GPU-accelerated sort and search fit the functionality offilesystems very well. An interesting potential use ofGPU-accelerated graph analysis is for in-kernel garbagecollection (GC). GC is usually considered to be time-consuming because of its graph traversal operation, but arecent patent application [15] shows it is possible to dothe GC on GPUs, and that it may have better performance2han on CPUs. Besides GC for memory objects, filesys-tems also use GC-like operations to reorganize blocks,find dead links, and check unreferenced blocks for con-sistency. Another example of graph analysis in the kernelis the Featherstitch [7] system, which exposes the depen-dencies among writes in a reliable filesystem. One of themost expensive parts of Featherstich is analysis of de-pendencies in its patch graph , a task we believe could bedone efficiently on the GPU.GPGPU computing is a relatively new field, with theearliest frameworks appearing in 2006. Many of the ap-plications described in this section are, therefore, earlyresults, and may see further improvements and broaderapplicability. With more and more attention being paidto this realm, we expect more valuable and interestingGPU-accelerated in-kernel applications to present them-selves in the future.
Because of the functional limitations discussed in Sec-tion 1, it is impractical to run a fully functional OS kernelon a GPU. Instead, our KGPU framework runs a tradi-tional OS kernel on the CPU, and treats the GPU as a co-processor. We have implemented a prototype of KGPUin the Linux kernel, using NVIDIA’s CUDA frameworkto run code on the GPU.
KGPU must deal with two key challenges to efficientlyuse the GPU from the OS kernel: the overhead of copy-ing data back and forth, and latency-sensitive launchingof tasks on the GPU.
Data Copy Overhead:
A major overhead in GPGPUcomputing is caused by the fact that the GPU has its ownmemory, separate from the main memory used by theCPU. Transfer between the two is done via DMA overthe PCIe bus. Applications using the GPU must intro-duce two copies: one to move the input to GPU memory,and another to return the result. The overhead of thesecopies is proportional to the size of the data.There are two kinds of main memory the CUDAdriver can use: one is general memory (called pageablememory in CUDA), allocated by malloc() . The otheris pinned memory, allocated by the CUDA driver and mmap -ed into the GPU device. Pinned memory is muchfaster than the pageable memory when doing DMA.In KGPU, we use pinned memory for all buffers be-cause of its superior performance. The downside ofpinned memory is that it is locked to specific physicalpages, and cannot be paged out to disk; hence, we mustbe careful about managing our pinned buffers. This man-agement is described in Subsection 3.2.
GPU Kernel Launch Overhead:
Another overheadis caused by the GPU kernel launch, which introducesDMA transfers of the GPU kernel code, driver set-upfor kernel execution and other device-related operations.This sets a lower bound on the time the OS kernel mustwait for the GPU code to complete, so the lower we canmake this overhead, the more code can potentially benefitfrom GPU acceleration. This overhead is not high whenthe GPU kernel execution time or the data copy overheaddominates the total execution time, as is the case for mostGPGPU computing, which is throughput-oriented [8].OS kernel workloads, on the other hand, are likely tobe dominated by a large number of smaller tasks, and la-tency of each operation is of greater importance. Thoughlarger tasks can be created by batching many small re-quests, doing so increases the latency for each request.CUDA has a “stream” [18] technology that allows ker-nel execution to proceed concurrently with GPU kernelexecution and data copy. By itself, this helps to improvethroughput, not latency, but we make use of it to commu-nicate between code running on the GPU and CPU.Instead of launching a new GPU kernel every time theOS wants to call a GPU code, we have designed a newGPU kernel execution model, which we call the Non-Stop Kernel (NSK). The NSK is small, is launched onlyonce, and does not terminate. To communicate with theNSK, we have implemented a new CPU-GPU message-based communication method. It allows messages to bepassed between the GPU and main memory while a GPUkernel is still running. This is impossible in traditionalCUDA programming, in which the CPU has to explicitlywait for synchronization with the GPU.We use pinnedmemory to pass these messages, and NVIDIA’s stream-ing features to asynchronously trigger transfers of themessage buffer back and forth between CPU and GPUmemory. Requests are sent from the CPU to the NSKas messages. The NSK executes the requested service,which it has pre-loaded into the GPU memory. Simi-larly, the CPU receives completion notifications from theNSK using these messages.We measured the time to launch an empty GPU kernel,transfer a small amout of input data to it (4KB), and waitfor it to return. Though most CUDA benchmarks mea-sure only the execution time on the GPU, we measuredtime on the CPU to capture the entire delay the OS ker-nel will observe. NSK outperforms the traditional launchmethod by a factor of 1.3x, reducing the base GPU ker-nel launch time to . µs for a kernel with 512 threads, . µs for 1024 threads, and . µs for 2048 threads.While this is much larger than the overhead of calling afunction on the CPU, as we will show in Section 3.3, thespeedup in execution time can be well worth the cost.Because of a limitation in CUDA that does not allowa running GPU kernel to change its number of threads3 igure 1: KGPU framework architecture dynamically, NSK switches to a traditional CUDA ker-nel launch model when a service requires more threadson the GPU. This switch will not be necessary in futurewhen the vendors provide the functionality of dynami-cally creating new GPU threads. Our framework for calling the GPU is shown in Figure 1.It is divided into three parts: a module in the OS ker-nel, a user-space helper process, and NSK running onthe GPU. The user-space helper is necessitated by theclosed-source nature of NVIDIA’s drivers and CUDAruntime, which prevent the use of CUDA directly frominside the kernel.To call a function on the GPU, the OS kernel followsthe following steps: • It requests one of the pinned-memory buffers, andfills it with the input. If necessary, it also requests abuffer for the result. • It builds a service request. Services are CUDA pro-grams that have been pre-loaded into NSK to min-imize launch time. The service request can option-ally include a completion callback. • It places the service request into request queue. • It waits for the request to complete, either by block-ing until the completion callback is called or busy-waiting on the response queue.The user-space helper for KGPU watches the requestqueue, which is in memory shared with the OS kernel.Upon receipt of a new service request, the helper DMAsthe input data buffer to the GPU using the CUDA APIs.This can proceed concurrently with another service run-ning on the GPU. When the DMA is complete, thehelper sends a service request message to NSK using themessage-passing mechanism described in Section 3.1.When the NSK receives the message, it calls the ser-vice function, passing it pointers to the input buffer andoutput buffer. When the function completes, the NSKsends a completion message to the CPU side, and re-sumes polling for new request messages. The user-levelhelper relays the result back to the OS kernel throughtheir shared response queue. To avoid a copy between the kernel module and theuser-space helper, the pinned data buffers allocated bythe CUDA driver are shared between the two. Also, be-cause NSK allows the user-space helper to work asyn-chronously via messages, service execution on the GPUand data buffer copies between main memory and GPUmemory can run concurrently. As a result, the databuffers locked in physical memory are managed care-fully to cope with the complex uses. On the CPU side,buffers can be used for four different purposes:1. Preparing for a future service call by accepting datafrom a caller in the OS kernel2. To DMA input data from main memory to the GPUfor the next service call3. To DMA results from the last service call from GPUmemory to main memory4. Finishing a previous service call by returning datato the caller in the OS kernelEach of these tasks can be performed concurrently, so,along with the service currently running on the GPU, thetotal depth of the service call pipeline is five stages. Inthe current KGPU prototype, we statically allocate fourbuffers, and each changes its purpose over time. For ex-ample, after a buffer is prepared with data from the caller,it becomes the host to GPU DMA buffer.On the GPU, we use three buffers: at the same timethat one is used by the active service, a second may re-ceive input for the next service from main memory viaDMA, and a third may be copying the output of the pre-vious service to main memory.
To demonstrate the feasibility of KGPU, we imple-mented the AES encryption algorithm as a service on theGPU for the Linux crypto subsystem. Our implemen-tation is based on an existing CUDA AES implementa-tion [4], and uses the ECB cipher mode for maximumparallelism. We did a microbenchmark to compare itsperformance with the original CPU version in the Linuxkernel, which is itself optimized by using special SSE in-structions in the CPU. We used a 480-core NVIDIA GTX480 GPU, a quad-core Intel Core i7-930 2.8 GHz CPUand 6GB of DDR3 PC1600 memory. The OS is Ubuntu10.04 with Linux kernel 2.6.35.3.We get a performance increase of up to 6x, as shownin Figure 2. The results show that the GPU AES-ECBoutperforms the CPU implementation when the size ofthe data is 8KB or larger, which is two memory pageswhen using typical page sizes. So, kernel tasks thatdepend on per-page encryption/decryption, such as en-crypted filesystems, can be accelerated on the GPU.4
Throughput(MB/s) Speedup
Size of Data(bytes)CPU AES-EncryptGPU AES-EncryptSpeedup
Figure 2: Encryption performance of KGPU AES. Decryption,not shown, has similar performance.
The GPU-augmented OS kernel opens new opportuni-ties for systems software, with the potential to bring per-formance improvements, new functionality, and securityenhancements into the OS.We will continue to develop and improve KGPU andto implement more GPU functions in our framework.One such improvement will be dynamically dispatchingtasks to the CPU or GPU depending on their size. Asseen in Figure 2, the overheads associated with callingthe GPU mean that small tasks may run faster on theCPU. Since the crossover point will depend on the taskand the machine’s specific hardware, a good approachmay be to calibrate it using microbenchmarks at boottime. Another improvement will be to allow other ker-nel subsystems to specifically request allocation of mem-ory in the GPU pinned region. In our current imple-mentation, GPU inputs must be copied into these regionsand the results copied out, because the pinned memoryis used only for communication with the GPU. By dy-namically allocating pinned buffers, and allowing usersof the framework to request memory in this region, theycan manage structures such as filesystem blocks directlyin pinned memory, and save an extra copy. This wouldalso allow multiple calls to be in the preparing and post-service callback stages at once.We expect that future developments in GPUs will al-leviate some of the current limitations of KGPU. Whilethe closed nature of current GPUs necessitates interact-ing with them from user-space, the trend seems to be to-wards openness; AMD has recently opened their high-end 3D GPU drivers and indicated that drivers for theirupcoming APU platform will also be open-source. Fur-thermore, by combining a GPU and CPU on the samedie, APUs, e.g. Intel SandyBridge and AMD Fusion, arelikely to remove the memory copy overhead with sharedcache between CPU cores and GPU cores; lower copyoverhead will mean that the minimum-sized task that canbenefit from GPU offloading will drop significantly.
References [1] CUDA. . [2] GTX580 GPU. .[3] Kaspersky Lab. .[4] OpenSSL CUDA AES Engine. http://code.google.com/p/engine-cuda .[5] A. Baumann, P. Barham, P.-E. Dagand, T. Harris,R. Isaacs, S. Peter, T. Roscoe, A. Sch¨upbach, and A. Sing-hania. The multikernel: a new OS architecture for scal-able multicore systems. SOSP 2009. ACM.[6] S. Berger, R. C´aceres, K. A. Goldman, R. Perez, R. Sailer,and L. van Doorn. vTPM: virtualizing the trusted plat-form module. USENIX Security 2006.[7] C. Frost, M. Mammarella, E. Kohler, A. de los Reyes,S. Hovsepian, A. Matsuoka, and L. Zhang. Generalizedfile system dependencies. SOSP 2007. ACM.[8] M. Garland and D. B. Kirk. Understanding throughput-oriented architectures.
Comm. ACM , 53:58–66, 2010.[9] D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren,G. Varghese, G. M. Voelker, and A. Vahdat. Differenceengine: harnessing memory redundancy in virtual ma-chines. OSDI 2008.[10] S. Han, K. Jang, K. Park, and S. Moon. PacketShader: aGPU-accelerated software router. SIGCOMM 2010.[11] P. Harish and P. J. Narayanan. Accelerating large graphalgorithms on the GPU using CUDA. HiPC 2007.[12] O. Harrison and J. Waldron. Practical symmetric keycryptography on modern graphics hardware. USENIXSecurity 2008.[13] G. C. Hunt and J. R. Larus. Singularity: rethinking thesoftware stack.
SIGOPS OSR , 41:37–49, 2007.[14] K. Jang, S. Han, S. Han, S. Moon, and K. Park. Acceler-ating SSL with GPUs. SIGCOMM 2010. ACM.[15] A. S. Jiva and G. R. Frost. GPU assisted garbage collec-tion. .[16] M. Krohn, A. Yip, M. Brodsky, N. Cliffer, M. F.Kaashoek, E. Kohler, and R. Morris. Information flowcontrol for standard os abstractions. SOSP 2007. ACM.[17] E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel,and G. Hunt. Helios: heterogeneous multiprocessing withsatellite kernels. SOSP 2009. ACM.[18] NVIDIA. CUDA C Programming Guide 3.2.[19] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris,J. Krger, A. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware.
ComputerGraphics Forum , 26(1):80–113, 2007.[20] T. Prabhu, S. Ramalingam, M. Might, and M. Hall.Eigen-CFA: Accelerating flow analysis with GPUs. PoPL2011.[21] K. Tangwongsan, H. Pucha, D. G. Andersen, andM. Kaminsky. Efficient similarity estimation for systemsexploiting data redundancy. INFOCOM 2010.[22] G. Vasiliadis, M. Polychronakis, S. Antonatos, E. P.Markatos, and S. Ioannidis. Regular expression match-ing on graphics hardware for intrusion detection. RAID2009., 26(1):80–113, 2007.[20] T. Prabhu, S. Ramalingam, M. Might, and M. Hall.Eigen-CFA: Accelerating flow analysis with GPUs. PoPL2011.[21] K. Tangwongsan, H. Pucha, D. G. Andersen, andM. Kaminsky. Efficient similarity estimation for systemsexploiting data redundancy. INFOCOM 2010.[22] G. Vasiliadis, M. Polychronakis, S. Antonatos, E. P.Markatos, and S. Ioannidis. Regular expression match-ing on graphics hardware for intrusion detection. RAID2009.