[PDF] Memory virtualization in virtualized systems: segmentation is better than paging

Abstract

The utilization of paging for virtual machine (VM) memory management is the root cause of memory virtualization overhead. This paper shows that paging is not necessary in the hypervisor. In fact, memory fragmentation, which explains paging utilization, is not an issue in virtualized datacenters thanks to VM memory demand patterns. Our solution Compromis, a novel Memory Management Unit, uses direct segment for VM memory management combined with paging for VM's processes. The paper presents a systematic methodology for implementing Compromis in the hardware, the hypervisor and the datacenter scheduler. Evaluation results show that Compromis outperforms the two popular memory virtualization solutions: shadow paging and Extended Page Table by up to 30% and 370% respectively.

Full PDF

aa r X i v : . [ c s . O S ] M a y Memory virtualization in virtualized systems:segmentation is better than paging

Boris TEABE

Université de Toulouse

Peterson YUHALA

Université de Neuchâtel

Alain TCHANA

ENS Lyon

Fabien HERMENIER

Nutanix

Daniel HAGIMONT

Université de Toulouse

Gilles MULLER

Inria

Abstract

The utilization of paging for virtual machine (VM) mem-ory management is the root cause of memory virtualizationoverhead. This paper shows that paging is not necessary inthe hypervisor. In fact, memory fragmentation, which ex-plains paging utilization, is not an issue in virtualized data-centers thanks to VM memory demand patterns. Our solu-tion

Compromis , a novel Memory Management Unit, usesdirect segment for VM memory management combined withpaging for VM’s processes. The paper presents a system-atic methodology for implementing

Compromis in the hard-ware, the hypervisor and the datacenter scheduler. Evalua-tion results show that

Compromis outperforms the two popu-lar memory virtualization solutions: shadow paging and Ex-tended Page Table by up to 30% and 370% respectively.

ACM Reference format:

Boris TEABE, Peterson YUHALA, Alain TCHANA, Fabien HER-MENIER, Daniel HAGIMONT, and Gilles MULLER. 2016. Mem-ory virtualization in virtualized systems: segmentation is better thanpaging. In

Proceedings of ACM Conference, Washington, DC, USA,July 2017 (Conference’17),

14 pages.DOI:

Virtualization has become the de facto cloud computing stan-dard because it brings several beneﬁts such as optimal serverutilization, security, fault tolerance and quick service deploy-ment [1, 3, 10]. However, there is still room for improve-ment, mainly at the memory level which represents up to90% [45] of the global virtualization overhead.Memory virtualization overhead comes from the necessityto manage three address spaces (application, guest OS andhost OS) instead of two (application and OS) as in nativesystems. Shadow paging [43] is the most popular memoryvirtualization solution. Each page table inside the guest OSis shadowed by a corresponding page table inside the hyper-visor, which contains the real mapping between Guest Vir-tual Addresses (GVA) and Host Physical Addresses (HPA).Thus, shadow page tables are those used for address transla-tion by the hardware page table walker (which resides insidethe Memory Management Unit, MMU). Page tables insidethe guest OS are never used. Shadow paging leads to one-dimensional (1D) page walkon TLB miss, as in a native system. However, building shadowpage tables comes with costly context switches between theguest OS and the hypervisor for synchronization. Nested/ExtendedPage Table (EPT) [15, 42] has been introduced for avoidingpage table synchronization cost. It improves the page tablewalker to walk through two page tables (from the guest andfrom the hypervisor) at the same time in a 2D manner. Thus,building the shadow page table does not require the protec-tion of guest’s page tables. This drastically reduces the num-ber of context switches. However, this solution induces sev-eral memory accesses during address translation due to the2D page walk mechanism. In a radix-4 page table [40] (themost popular case) for instance, this 2D page walk leads to24 memory accesses on each TLB miss instead of 4 in a na-tive system, resulting in signiﬁcant performance degradation.While many work proved the effectiveness of paging whendealing with processes ( e.g. , for reducing memory fragmen-tation), to our knowledge, there is no clear assessment ofits effectiveness when dealing with virtual machines (VMs).One explanation is that the implementation of hypervisorswas inspired by the bare metal implementation of OSes. Byanalyzing traces from two public clouds (Microsoft Azureand Bitbrain) and 308 private clouds (managed by Nutanix ),we show in Section 4 that paging is not mandatory for manag-ing memory allocated to VMs. In fact, we found that mem-ory fragmentation is not an issue in virtualized datacentersthanks to VM memory sizing and arrival rate.This paper presents Compromis , a novel MMU solutionwhich uses segmentation for VMs and paging for processesinside VMs.

Compromis allows a 1D page walk and gen-erates zero context switch to virtualize memory.

Compro-mis is inspired by

Direct Segment (DS) introduced by Basu et al. [13] for native systems. For memory hungry applica-tions which allocate at start time their entire memory andself-manage it at runtime ( e.g. , Java Virtual Machine), theirvirtual memory space can be directly mapped to a large phys-ical memory segment identiﬁed by a triple (

Base , Limit , Off-set ). This way, the translation of a virtual address va is givenby a simple register to register addition ( va+Offset ). Com-promis generalizes DS to DS-n, allowing the provisioning ofa VM with several memory segments. In

Compromis , every Nutanix is a world wide private cloud provider.1 rocessor context includes n new registers for address trans-lation. Contrary to other DS based solutions [9, 24], Compro-mis considers the entire VM memory and requires no guestOS and application modiﬁcation.This paper also investigates systems implications and presentsa systematic methodology for adapting the hypervisor andother cloud services ( e.g. , datacenter scheduler) for making

Compromis effective. To the best of our knowledge, this isthe ﬁrst DS based approach in virtualized systems which putsthe entire puzzle together.We have implemented a whole prototype in both Xen andKVM virtualized systems managed by OpenStack. This meansﬁrst the integration of our DS aware memory allocation algo-rithm. Second the improvement of the Virtual Machine Con-trol Structure (VMCS) data structure for conﬁguring the newhardware registers introduced by DS-n. Finally, the improve-ment of OpenStack’s VM placement algorithm to minimizethe number of memory segments. For evaluations, since oursolution relies on the modiﬁcation of the hardware, we mim-icked the functioning of a VM which runs on a DS-n machineas follows. We run the VM in para-virtualized (PV) mode[11] because the latter uses 1D page walk as DS-n. Howeverin PV, all page table modiﬁcations performed by the VM ker-nel trap into the hypervisor using hypercalls. To avoid thisbehavior which will not exist on a DS-n machine, we havemodiﬁed the guest kernel to directly set page table entrieswith the correct HPAs, calculated in the same way as a DS-nhardware would have done.To be exhaustive, the paper makes the following contribu-tions: • We ﬁrst study the potential effectiveness of DS in vir-tualized datacenters. In order words, we answer thisquestion: regarding VM memory demands, arrivaltimes and departure times, is it likely to provision allor the majority of the VMs with one large memorysegment? To answer this question, we study mem-ory fragmentation in virtualized systems by analyz-ing traces from two public clouds (Microsoft Azure[21] and Bitbrain [39]) and 308 private clouds. Wefound that using a DS aware memory allocation sys-tem, memory fragmentation is not a critical issue invirtualized datacenters as in native ones. • Drawing on this conviction, we propose DS-n, a gen-eralization of DS to provision a VM with multiplememory segments. We present the necessary hard-ware modiﬁcations required by DS-n. • We propose a DS aware VM memory allocation algo-rithm which minimizes the number of memory seg-ments to use. • We evaluated the performance gain of DS-n using anaccurate methodology on a real environment. The main results are as follows. First, the analyzed dat-acenter traces exhibit that it is then possible to provi-sion up to 99.99% of the VMs with one memory seg-ment while three segments are sufﬁcient to provisionall the VMs. Second, concerning the performancegain, DS-n reduces memory virtualization overheadto only 0.35%, outperforming both shadow pagingand EPT by up to 30% and 370% respectively. The re-sults also show that our memory allocation algorithmruns faster than traditional ones. Xen’s algorithm isoutperformed by 80%.The remainder of the paper is as follows. Section 2 presentsthe necessary background to understand our contributions.Section 3 evaluates the limitations of state-of-the-art solu-tions. Section 4 presents the analysis of several productiondatacenter traces and validate the opportunity to apply DS invirtualized systems. Section 5 presents the necessary hard-ware and software improvements to make DS-n effective.Section 6 presents the evaluation results. Section 7 presentsthe related work. Section 8 concludes the paper.

This section presents the two main techniques used to achievememory virtualization, namely Shadow paging [43] and Ex-tended Page Table (noted EPT) [15, 42].

Shadow paging is a software memory virtualization technique.In Shadow paging the hypervisor creates a shadow

Page Ta-ble (PT) for each guest PT. This shadow PT holds the com-plete translation from GVA to HPA. It is walked by the Hard-ware Page Table Walker (HPTW) on Translation LookasideBuffer (TLB) miss. Guest Page Tables (GPTs) are fake ones,they are not exploited. To put in place shadow paging, thehypervisor write protects both the CR3 register (which holdsthe current PT address) and GPTs. Each time the guest OSattempts to modify these structures, it then traps in the hy-pervisor which ﬁx the CR3 register or the shadow PT. Usingshadow paging, the HPTW only performs a 1D page walk asin a native system, leading to 4 memory accesses. However,the resulting context switches severely degrades the VM per-formance.

EPT (also called Nested Page Table) is a hardware-assistedmemory virtualization solution proposed by many chip ven-dors such as Intel and AMD. It relies on a two layer PT. Theﬁrst PT layer resides in the guest address space and is ex-clusively managed by its OS, at the rate of one PT per pro-cess. This ﬁrst layer PT thus contains GPAs which point toguest pages in the guest address space. Every process con-text switch triggers the setting by the guest OS of the CR3register with the GPA of the scheduled-in process’s PT ad-dress. The second PT layer resides in the hypervisor, at the enchmark Description SpecCPU 2006 Compute multi-threaded workloadsPARSEC 3.0 Compute multi-threaded workloadsRedis In-memory databaseElastic search In-memory database

Table 1.

Benchmarks used for assessment and evaluation ofour solution.

Native total cycles of all page walks

EPT total cycles of all (GPT+EPT) walks

Shadow total cycles of all (hypervisor level PT walk+ VMEn-try+VMExit+handler)

Table 2.

Formulas to estimate the overhead of memory virtu-alization. ("handler" is the handler which treats the VMExitgenerated when the guest OS attempts to modify the pagetable.)rate of one PT per VM. This PT represents the address spaceof the guest and it includes HPA which point to pages (realpages in RAM) in the host address space. Every vCPU con-text switch triggers the setting by the hypervisor of the nestedCR3 register (nCR3) with the HPA of the scheduled-in VM’sPT address. On TLB miss, the hardware page table walkertranslates a virtual address va into the corresponding HPAby performing a 2-dimension page walk, leading to 24 mem-ory accesses. This section presents the overhead of memory virtualizationin both native and virtualized systems. Note that even ina native system, the expression "memory virtualization" isused because of the mapping of the process linear addressspace to the physical address space.

Methodology.

Table 1 lists the benchmarks used to evalu-ate the performance overhead of memory virtualization. Wedo this while varying the memory page size in both the hy-pervisor and the guest OS. This way, we also evaluated hugepage-based solutions. The notation g X - h Y means X and Y are respectively the memory page size in the g uest OS andthe h ypervisor. The evaluation metric is the time taken byboth the hardware and the software for memory virtualiza-tion. Table 2 presents how the performance metric is cal-culated for each virtualization technology. We rely on bothPerformance Monitoring Counters (PMC) and low-level soft-ware probes that we have written for the purpose of this paper.Details on the experimental setup is given in Section 6.2.1. Results.

Figure 1 presents the results, interpreted as fol-lows. First, even in native systems memory virtualizationtakes a signiﬁcant time proportion in the execution of anapplication, up to 42% for mcf. Second, running applica-tions in a virtualized environment increases that duration, up to 50.93% for Elastic Search under shadow paging. Third,shadow paging incurs more overhead for the majority of ap-plications than EPT, up to 43.89% of difference for vips. Fi-nally, we can observe that even when huge pages are usedsimultaneously in the guest OS and the hypervisor, memoryvirtualization overhead is still high, almost 31.5% for Redisbenchmark. [9, 13, 24, 35] have reached the same conclusionwith the use of huge pages.

Synthesis.

These results show that the overhead of mem-ory virtualization is very signiﬁcant in a virtualized systemeven with huge pages. The root cause of this overhead is theutilization of paging as the memory virtualization basis forVMs.

Several research work tried to reduce the overhead of mem-ory virtualization in virtualized systems. However, no workhas questioned the relevance of paging in this context. Thissection studies the (ir)relevance of paging when dealing withVMs. To this end, we compare paging with segmentation,which is the alternative approach that has been left out.Paging consists in organizing both the virtual address spaceof a process and the physical address space into ﬁxed sizememory chunks (4KB, 2MB, etc. ) called pages. Thus, eachvirtual page can be housed in any physical page frame. Theprocess PT and the HPTW make it possible to ﬁnd the actualmapping of a virtual page to a physical page address. Seg-mentation on the other hand organizes both the virtual andthe physical address spaces in the form of variable size mem-ory chunks called segments. The size of the latter is chosenby the programmer. The correspondence between virtual seg-ments and physical segments is provided by a segment table.The virtual address to physical address mapping is done by asimple addition.The main reasons which promote paging over segmenta-tion in native systems are as follows: ( R ) Paging is invisibleto the programmer; this is not the case with segmentationwhich hardens application programming. ( R ) Paging makesthe implementation of the memory allocator in the OS eas-ier. Indeed, it only requires the use of a list of free pagesand the choice of any page within this list upon receivinga memory allocation request is sufﬁcient. This is importantfor scalability. With segmentation, there is a need to ﬁnd theappropriate physical segment that satisﬁes the size of the vir-tual segment requested by the application. ( R ) Paging limitsmemory fragmentation , which is not the case with segmen-tation. ( R ) Paging allows overcommitment, which is usefulfor optimizing memory utilization.The question is whether all these reasons are valid whenmanipulating VMs. To answer this question, we analyze the Internal fragmentation within pages is always possible, but it is negligiblecompared to the external fragmentation caused by segmentation z i p2 g cc m c f gob m k h mm e r s j e ng li bqu a n t u m h264 r e f o m n e t pp a s t a r x a l a n c b m k b l ac k s c ho l e s body t r ac k ca nn ea l f e rr e t ﬂ u i d a n i m a t e s t r ea m c l u s t e r v i p s R e d i s E l a s ti c S ea r c h .

02 2 .

51 42 .

07 0 .

43 2 . · − .

05 9 .

84 0 .

44 10 .

55 8 .

32 9 .

15 0 .

48 0 .

65 11 .

06 5 .

32 5 . . .

36 6 .

47 7 . .

05 17 .

82 11 .

43 6 .

17 5 .

92 10 .

92 6 .

76 31 .

97 6 .

52 29 .

55 29 .

82 10 .

25 35 .

14 24 . .

48 31 .

61 45 .

25 48 .

87 42 .

81 50 . .

14 17 .

53 23 .

65 4 .

78 3 .

99 15 .

79 5 . .

32 7 .

89 10 . .

08 3 .

97 4 .

64 8 .

08 3 .

29 4 .

46 5 .

55 4 .

97 34 .

65 19 . . .

77 18 . .

77 3 .

99 15 .

24 5 .

24 4 . .

14 9 .

53 7 .

82 3 .

95 2 .

74 7 .

52 3 .

13 4 .

11 4 .

78 4 .

91 31 .

36 16 . V i r t . ov e r h ea d ( % ) Native 4KB SP g4KB EPT g4KB-h2MB EPT g2MB-h2MB

Figure 1.

Proportion of CPU time used for memory virtualization in native, virtualized shadow pagin (SP) and virtualized EPT.relevance of each reason in virtualized systems. Before do-ing this, note that when we talk about memory managementin a virtualized system, we are talking about the allocationof physical memory to VMs and not memory allocation toapplications inside VMs.

Relevance of R (Segmentation hardens application pro-gramming) . This reason is valid in native environments(when dealing with applications) because application pro-grammers do not have the expertise to manage segment sizein a segmentation based system. Moreover it is not the heartof the business logic of their application. When dealing withVMs, developers are OS developers, who are expert. Leav-ing OS developers the responsibility to manage memory seg-ments is within their reach. Relevance of R (Paging makes memory allocation eas-ier) . It is necessary to facilitate the work of the memoryallocator for scalability purposes. In a native system, thememory allocator is subject to thousands of memory alloca-tion and deallocation requests per second. This is not truewhen dealing with VMs. Each VM performs only one allo-cation (at startup) and deallocation (at shutdown). Thus, thefrequency of memory allocation and deallocation requests re-ceived by the hypervisor are not of the same order of magni-tude as those received by the OS in a native system. Table 3presents the average memory allocation frequency receivedof a server from a native system and virtualized private andpublic clouds (see Section 6 for more details on the analyzeddatasets). We observe a phenomenal difference between na-tive and virtualized systems, which are quite stable. Giventhe extremely low values for virtualized datacenters, the dif-ﬁculty of ﬁnding free memory chunks does not mind withsegmentation when dealing with VMs. Relevance of R (Paging limits fragmentation) . Frag-mentation is due to the heterogeneity of memory demandsizes. Indeed, a system in which all demand sizes are identi-cal would not suffer from fragmentation. To verify whetherfragmentation could be a problem in a virtualized datacenter,we analyzed the memory demand sizes of the traces from thedatacenters presented in Table 3. Figure 2 shows the CDFs. Dataset Alloc./Hour/ServerNative - Our lab machine

Virtualized - Private clouds

Virtualized - Microsoft Azure public cloud

Table 3.

Memory allocation frequency (per hour on a server)in native and virtualized datacenters.We observe that public clouds stand out with very concen-trated demand sizes (14). This is because in public cloudsVM sizes are imposed. Things are slightly different in pri-vate clouds (201) where there is more freedom in VM sizedeﬁnition. In contrary, demand sizes vary a lot in nativesystems (25k) than virtualized environments. These resultsshow that fragmentation is not a relevant issue when dealingwith VMs.

Relevance of R (Paging allows memory overcommit-ment) . Overcommitment is a practice which allows to re-serve more memory than the physical machine actually has.It exploits the fact that all applications do not require theirentire memory demand at the same time. As a result, re-source waste is avoided. However, overcommitment comeswith performance degradation (during memory reclamation)and performance unpredictability [31]. These limitations areacceptable in a native system because there is no contractbetween application owners and the datacenter owner; theyboth belong to the same company. Best effort is the prac-tice in such contexts. Things are different in a virtualizeddatacenter, especially in commercial clouds. In the latter,the datacenter operator should respect the contract signedwith the VM owner, who paid for the reserved resources.Therefore, even if a VM is not using its resources, these re-sources have already been amortized. The necessity to avoidresource waste is less critical compared to a native system.Futhermore, the implementation of overcommitment in a vir-tualized system is challenging because of the VM blackboxnature [31]. It requires an expertise in the workload and thesystem to conﬁgure it and react in case of performance issues. C D F Memory sizes (GB)Native (25k demand sizes/1,969,716 demands) 0 0.4 0.8 Zoom 0->200KB 0 0.2 0.4 0.6 0.8 1 C D F Memory sizes (GB)Azure (14 VM sizes/2,013,767 VMs) 0 0.2 0.4 0.6 0.8 1 C D F Memory sizes(GB)Private clouds (201 VM sizes/301,440 VMs))

Figure 2.

CDF of Memory demand sizes in different datacenter types.As a consequence, no public clouds support it. Private cloudproviders either do not support it (Nutanix), disable it by de-fault (VMWare TPS, Hyper-V dynamic memory) or enableit with extra warnings (RedHat with KVM). Compromis : A DS based memoryvirtualization approach for VMs

Compromis is a hardware memory virtualization solution im-plemented within the MMU that exploits the strengths ofboth direct segment (DS) and paging. The former is usedby the hypervisor to deal with VMs while the latter is usedby the guest OS to deal with processes. The innovation is theutilization of DS instead of paging by the hypervisor. Con-sidering the fact that it may be impossible to satisfy a VMdemand using a single memory segment,

Compromis gener-alizes DS to DS-n. In the latter, a VM which is allocated k segments, with ≤ k ≤ n , uses the Compromis hardwarefeature. This section presents the set of improvements thatshould be applied to the datacenter stack in order to make

Compromis effective.

Figure 3 presents the general operations of a datacenter using

Compromis . When a user requests a VM instantiation froma ﬂavor (

Compromis aware datacenter, this policy is extended tochoose the machine with the greatest chance of allocatinglarge memory segments to the VM. To this end, the cloudscheduler quickly simulates the execution of the memory al-locator implemented by the hypervisor of compute nodes.This simulation is built by the cloud scheduler on top of thecurrent state of the memory layout of every machine that islocally stored and periodically updated (see Section 5.4).When the hypervisor of the selected physical machine re-ceives the VM instantiation request, it reserves the memoryfor the VM in the form of large memory segments rather thansmall page chunks as it is currently done (see Section 5.3). Ifthe number of segments used to satisfy the VM is less thanor equal to n , then the hypervisor conﬁgures the VM in DS-n Figure 3.

General functioning of a datacenter which imple-ments

Compromis .mode, a new mode that the hardware implements (see Sec-tion 5.5.1). Otherwise, the VM is conﬁgured in shadow orEPT mode, depending on the datacenter operator. In DS-nmode, the hardware performs an address translation by do-ing a 1D page walk (instead of 2D) followed by a series ofregister to register operations (see Section 5.2). Notice thata

Compromis aware machine can simultaneously run DS-nand not DS-n VMs. The next subsections detail the modi-ﬁcations that should be applied to each datacenter layer forbuilding

Compromis . A hardware which implements

Compromis includes new reg-isters to indicate the mapping of GPA segments (in the guestaddress space) to HPA segments (in the host address space).The value of each register comes from a Virtual MachineControl Structure (VMCS), conﬁgured by the hypervisor atVM startup (see Section 5.3). The number of added regis-ters is a function of n . That is n − guest base registers igure 4. Address translation handling in two DS-n machine types (top DS-1 and bottom DS-4).(noted

GBReд , ..., GBReд n − , no such registers in DS-1), n host base registers (noted HBReд , ..., HBReд n − ), andthe limit register. These registers indicate the mapping asfollows. The GPA segment [ , GBReд − ] is mapped tothe HPA segment [ HBReд , GBReд − ]. The GPA segment[ GBReд i − , GBReд i − ] is mapped to the HPA segment [ HBReд i − , HBReд i − + ( GBReд i − GBReд i − ) ] (where GBReд i − GBReд i − is the size of this segment). For a VMwith k segments, the mapping of the last GPA segment [ GBReд k − , GBReд k − + ( limit − HBReд k − ) ] is the HPA segment [ HBReд k − , limit]. Once the conﬁguration of the registers is made by thehypervisor, the translation of a virtual address va to the cor-responding HPA hpa for a DS-n VM type (whose number ofsegment is lower than n ) is summarized in Figure 4. Firstly,the MMU performs a 1D GPT walk, taking as input va . Thisoperation returns a GPA gpa . Then hpa is calculated as fol-lows: hpa = HBRreд i + ( дpa − GBRreд i ) (1)with [ GBRreд i , ∗ ] being the smallest GPA segment whichcontains gpa . If no such segment exists, a boundary viola-tion is raised and trap in the hypervisor as a "DS-n violation"exception. More generally, for each gpa extracted from aGPT layer, an offset addition followed by a comparison isperformed, meaning that every EPT walk is replaced by these two operations. For instance, when the VM has only one seg-ment, the computation of hpa is as follows hpa = HBRreд + дpa (2)There is a boundary violation here if hpa is greater than limit .The performance beneﬁt of these operations against the 2Dpage walk done in EPT is discussed in Section 6. The hypervisor needs two main changes: the integration ofa memory allocator for VMs and the conﬁguration of theVMCS to indicate DS-n type VMs.

We assume that the physical memory is organized in twoparts: the ﬁrst part is reserved for hypervisor and privilegedVM tasks while the second part is dedicated to user VMs.This memory organization is found in almost all popular hy-pervisors. In

Compromis , the ﬁrst memory part is managedusing the traditional memory allocator. Concerning the sec-ond memory part, a new allocation algorithm is used to en-force large memory segment allocation to VMs. This sectiondescribes this new allocator.Implementing a memory allocator requires to answer threequestions: ( Q ) which data structure to use for storing in-formation about free memory segments? ( Q ) how do wechoose elements from this data structure for responding to n allocation request? ( Q ) how do we insert an element intothis data structure when there is a memory release? Answer to Q : data structure . We use a doubly linkedlist to describe free memory segments. Each element in thelist describes a segment using three variables: • base : start address of the segment; • limit : end address of the segment; • date : allocation date of the segment containing base-1 .The elements of the list are ordered in an ascending order of base . Hereafter, an element of the list is noted [ base , limit , date ]. Answer to Q : allocation policy . When the hypervisorreceives a request to start a VM with a memory demand M ,it goes through the list described above to ﬁnd out whichsegments should be allocated to the VM. If it ﬁnds a seg-ment of size M , then that segment is taken off the list andallocated to the VM. Otherwise, the allocator chooses thelargest segment S b [ base , limit i , date ] among segments whichsize is greater than M . The VM is satisﬁed with a portion of S b . Note that taking the largest segment prevents the multi-plication of small segments, which are bad for a DS basedapproach. If there is no segment larger than M , then two op-tions are possible. The ﬁrst option ( Opt1 ) satisﬁes the VMwith the smaller segments. This allows to give a chance tothe VMs which will come later to have the big free segments.The second option (

Opt2 ) chooses the largest segment thatexists and executes the above algorithm with a new memorysize M ′ , with M ′ equals M minus the size of the chosen seg-ment. Compromis offers these two options to support variousworkload patterns and datacenter constitution. A workloadpattern is the set of VM instantiation and shutdown requestssubmitted to the datacenter during a period of time. The con-stitution of a datacenter is the physical machine sizes. Theoption selection is the responsibility of the cloud scheduler,which has a global view of the datacenter (see Section 5.4).

Answer to Q : freed memory taken into account . Stop-ping a VM results in the free of its memory, which has tobe inserted into the list of free memory segments. Let S bea memory segment to insert into the list. The insertion is asfollows. If S coincides with the beginning or the end of asegment S ′ in the list, then S ′ is simply extended (forward orbackward). If this extension causes the new big segment tocoincide with the beginning or the end of the segments thatfollow or precede it, then the extension continues. If there isno border coincidence between S and the existing segmentsin the list, then S is inserted in the list so that the ascendingorder is respected. Let k be the number of memory segments allocated to aVM. If k ≤ n then the VM is of type DS-n, otherwise itis conﬁgured with EPT or shadow paging according to thedatacenter administrator choice. The type conﬁguration of a VM is done by modifying the VMCS of its vCPUs. Toindicate that a VM is of type DS-n, a new bit of the Sec-ondary Processor-Based VM-Execution Controls is set. Oth-erwise, this bit remains at zero. For DS-n VMs, the hypervi-sor also positions new VMCS ﬁelds that will be used to pop-ulate

GBReg , HBReg , and limit registers. The ﬁelds are pop-ulated in ascending order of crossing segments. The valueof the ﬁelds which map to

HBReg and limit registers comesfrom the list of segments allocated to the VM. Concerningthe ﬁelds which map to

GBReg registers, their values are cal-culated as they are ﬁlled. When k < n , the remaining ﬁeldsare set to zero. The cloud scheduler is improved for two purposes: a DS-naware VM placement algorithm and memory allocation op-tion selection.

VM placement algorithm improvement.

The placementalgorithm determines the physical machine that will instan-tiate the VM. Traditionally, it has an objective such as loadbalancing. For example, the schedulers of OpenStack [23]and CloudStack [20] consist of a list of ﬁlters. Each ﬁlterimplements a concern such as resource-matchmaking [36],VM-VM or VM-host (anti-) afﬁnities. A ﬁlter receives as ar-guments a set of possible machine for the VM to boot andremove among them those not satisfying its concern. Eachtime the scheduler is invoked to decide where to place a VM,it chains the ﬁlters to retrieve eventually the satisfying ma-chines and pick one among them.To take the beneﬁt of DS-n, the VM scheduler must in-tegrate inside its objective the maximization of the numberof VMs of type DS-n. For ﬁlter-based scheduler, it consistsin implementing is a new ﬁlter to append to the existing list.This ﬁlter maintains a local copy of the free memory seg-ments on every machine and uses a simulator to evaluate thenumber of segments that will be used if the VM is instanti-ated on each machine. It then selects the machine leadingto the least number of segments. For schedulers that do notrely on ﬁlters, the rational is to weight the existing objectiveagainst the one that consists in picking the machine minimiz-ing the number of memory segments. Note that this cloudscheduler modiﬁcation does not affect (reduce) the hostingcapacity of the datacenter because the destination machine isselected among the original cloud scheduler candidates.

Memory allocation option selection.

Section 5.3.1 re-ported that the cloud scheduler has the responsibility to se-lect the memory allocation option that all hypervisors willuse. To this end, it embeds a memory allocator simulatorwhich implements the two options presented in 5.3.1. Thenit periodically (e.g., every week) replays in the simulator therecorded VM startup and shutdown logs. This is done whilevarying the memory allocation option. The selected optionis the one that produces the large number of DS-n VMs. All ypervisors are then notiﬁed with the name of the selectedoption and the logs repository is reset. We implemented

Compromis in two popular hypervisors (Xenand KVM), as well as in OpenStack’s Nova scheduler.

The implementation of

Compro-mis in Xen is straightforward. First Xen already organizesthe main memory in two parts as we wish. The ﬁrst part ismanaged by the Linux’s memory allocator subsystem hostedwithin the privileged VM ( dom0 ). The memory allocator foruser VMs resides in the hypervisor core. It is invoked by the dom0 during the VM instantiation process. We simply re-placed this allocator with the one described in Section 5.3.1.We validated the effectiveness of this algorithm by startingVMs (in hardware-assisted virtualization (HVM) mode) withsingle segments, while the hypervisor still uses EPT for ad-dress translation.Concerning the conﬁguration of the VM type, the modi-ﬁcation of Xen does not require any particular descriptionother than what has been said in Section 5.3.2. Concerningthe handling of cloud scheduler notiﬁcations related to thechanging of the memory allocation option, we deﬁne a newhypercall that inform the hypervisor with the name of theselected option.

Implementation in KVM.

Unlike Xen, KVM does nothold memory in two blocks. KVM relies on the Linux mem-ory allocator which sees VMs as normal processes. To im-plement

Compromis in KVM, we ﬁrst enforce the organiza-tion of the physical memory in two blocks. To this end, weuse the cgroup mechanism. Then the default Linux memoryallocator is associated to the ﬁrst block while our memoryallocator manages the second block. The /proc ﬁle system isused to record the used memory allocation option imposedby the cloud scheduler.

The implementation of

Compromis in OpenStack Nova isquite straightforward because Nova’s placement algorithmis easy to identify. Its execution steps are also easy to iden-tify, leading its extension with a simulation of our memoryallocator very simple. Concerning the periodical selectionof the memory allocation option, we implemented a separateprocess which starts at the same time as Nova. That processrelies on existing OpenStack logs for obtaining VM startupand shutdown requests.

Since

Compromis allows DS-n, the implementation of memory overcommitment is possi-ble by performing dynamic segment resizing, addition, or re-moval, combined with a slight cooperation between the guest OS and the hypervisor. A VM which needs more memorygains new segments or sees its segments extended. Inversely,a VM which memory needs to be reduced will see either itssegment sizes or number reduce. The cooperation betweenthe guest OS and the hypervisor is only necessary in thiscase. In fact, the hypervisor should indicate to the guest OSthe range of GPAs that should be released by the VM (usingthe balloon driver mechanism). Indeed, the hypervisor is theonly component which knows segment ranges.

Memory Mapped IO (MMIO) region virtualization.

IOdevice emulation and direct IO are the two IO virtualizationsolutions implemented by hypervisors in HVM mode. Theformer solution, which is the most popular one, consists inprotecting virtual MMIO ranges seen by the guest OS so thatall IO operations performed by the guest trap in the hyper-visor. With this IO virtualization solution, the utilization of

Compromis is straightforward since virtual MMIO regionsare at the GPA layer. The validation step presented in Sec-tion 5.5.1 was performed under this solution. With direct IOvirtualization, the guest OS is directly presented the physicalMMIO ranges conﬁgured by the hardware device. This so-lution requires

Compromis to use several memory segments.Note that this solution is not popular in todays clouds be-cause it limits scalability (only enables very few virtual de-vices) and dynamic consolidation (VM live migration is notpossible).

We evaluated the following aspects: (1) Effectiveness (seesection 6.1): it is the capability to start a large number ofVMs using the DS-n technology; (2) Performance gain (seesection 6.2): it is the capability to ameliorate the perfor-mance of applications which run in DS-n VMs; (3) Startupimpact (see Section 6.3): it is the potential positive/negativeimpact on VM startup latency. Otherwise indicated, the usedhypervisor and cloud management system are respectivelyXen and OpenStack.

Effectiveness evaluation is done by simulation using real dat-acenter traces.

We developed a simulator which mimics a datacenter man-aged with OpenStack [23], improved with our contributions.The simulator replays VM startup and shutdown requests col-lected from several production datacenters, presented in Sec-tion 6.1.2. It considers that a VM demand includes a numberof CPU cores and a memory size. For each simulated VMstartup request, the simulator logs two metrics: the numberof segments used for satisfying the VM memory demand andthe time taken by our changes (extension of the cloud sched-uler and the utilization of our memory allocation algorithmin the hypervisor). o highlight the beneﬁts of each Compromis feature, weevaluated different versions including: • BaseLine : the simulator implements both the nativeOpenStack scheduler and Xen’s memory allocationalgorithms; • ImprovPlacement : in this version the VM placementalgorithm is improved to choose for every VM themachine which will use the minimum number of mem-ory segments (as described in Section 5.3.1); • DynamicOptionSelec : in this version the cloud sched-uler calculates every week the best memory alloca-tion option which will be used (as described in Sec-tion 5.4).

We used the traces of 2 public clouds (Bitbrains [39] andMicrosoft Azure [21]) and 308 private clouds. Among otherﬁelds, each trace includes: the VM creation and destructiontime, and the VM size (

Bitbrains.

This cloud is a service provider specializedin managed hosting and business computation for many en-terprises. The dataset consists of 1,750 VMs, collected be-tween August and September 2013. Bitbrains does not in-clude physical machine characteristics.

Azure.

This is a public Microsoft cloud. The datasetcomprises , , VMs running on Azure from Novem-ber 16 th , 2016 to February 16 th , 2017. Private clouds.

This group aggregates data of 308 privateIaaS clouds running diverse workloads between November1 st , 2018 to November 29 th , 2018. For a given cloud, we col-lected one or more consistent snapshots of the cluster state atthe moment the cluster triggered its hotspot mitigation ser-vice, which indicates that a machine is getting close to sat-uration. A snapshot depicts the running VMs, their sizing(in terms of memory and cores) and their host (in terms ofavailable memory and cores). The collected dataset includes301,440 VMs. As the dataset contains snapshots and notthe VM creation and destruction time, we derived from eachsnapshot a bootstorm scenario where all the VMs are createdsimultaneously. This dataset includes server characteristics. Composition and server characteristics used for Bitbrainsand Azure.

Having no hardware information about the ﬁrsttwo datasets, we consider that they are composed of servergenerations presented in Table 4. We chose these server gen-erations as they are used in Azure according to this Youtubecomment [2].

Gen6 and

Godzilla are new generations while

Gen2 HPC , Gen4 and

Gen5 are older ones. All server gener-ations have the same proportion. . BaseLine provides betterresults in Bitbrain (up to 81% of VMs are satisﬁed withless than four memory segments) comparing to Azure (only

Name

RAM (GB) Cores % in the traces

HPC

128 24 20

Gen4

192 24 20

Gen5

256 40 20

Gen6

192 48 20

Godzilla

512 32 20

Table 4.

Server generations used in the replay of Bitbrainsand Azure traces.

BitbrainSolution 1 seg. 2 seg. 3 seg. >3 seg.BaseLine

ImprovPlacement+Opt1

100 0 0 0

ImprovPlacement+Opt2

100 0 0 0

DynamicOptionSelec

100 0 0 0

AzureSolution 1 seg. 2 seg. 3 seg. >3 seg.BaseLine

ImprovPlacement+Opt1

ImprovPlacement+Opt2

DynamicOptionSelec

Table 5.

Number of memory segments allocated to VMsfrom Bitbrain and Azure.about 24% of VMs are satisﬁed with less than four mem-ory segments). This is because the VMs running on Bitbrainhave a longer life time than Azure. However, our solutionssatisfy more VMs than

BaseLine (99.95%-100%). This is be-cause

BaseLine , which implements Xen, organizes the phys-ical memory in the forms of small memory chunks which arethen used for allocation. As a naive algorithm, Xen cannotenforce DS to a VM even if it exists a free memory segmentwhich is larger than the memory demand. In contrast,

Com-promis enforces DS near to the perfection (more than 99% ofVMs are satisﬁed with only one memory segment). Our twomemory application options discussed in 5.3.1 show theirslight difference in Bitbrain:

ImprovPlacement+Opt1 satis-ﬁes more VMs with only one segment in comparison with

ImprovPlacement+Opt2 . Finally, dynamically switching be-tween the two options (

DynamicOptionSelec ) is the best so-lution (99.99% of VMs use one memory segment).

Private clouds - Figure 5 . We plot the results for theseclouds separately from the previous ones because of the mul-titude number of clouds. We can make the same observationas above. Our solutions satisfy almost all VMs with only onememory segment, see a kind of wall at 1 on the latitude axis.

This section evaluates the performance gain brought by theutilization of DS-n. ynamicOptionSelec 1 2 3 4 0 20 40 60 80 100 ImprovPlacement+Opt2 1 2 3 4 0 20 40 60 80 100 ImprovPlacement+Opt1 1 2 3 4 0 20 40 60 80 100BaseLine 1 2 3 4 0 20 40 60 80 100 Figure 5.

Number of memory segments allocatedto VMs from 308 private clouds (longitude=cloud, lati-tude=

A DS-n machine handles a TLB miss using a 1D page walkfollows by a set of register to register operations. We mimicthis functioning by running the VM in para-virtualized (notedPV) mode [11] which also uses a 1D page walk. However inPV, all page table modiﬁcations performed by the VM ker-nel trap into the hypervisor using hypercalls. We modiﬁedthe guest kernel to directly set page table entries with thecorrect HPAs, calculated in the same way as a DS-n hard-ware would have done. The reader could legitimately askwhy using PV to simulate a hardware-assisted solution. Weclaim that our approach makes sense in our context becausethe benchmarks do not solicit PV machinery: all disks arein-memory ( tmpfs ) based and all network requests use the loopback interface. Accordingly, only the memory subsys-tem is solicited.The evaluation methodology we use is as follows. Let T D be the execution time of the VM in this modiﬁed PV con-text. We estimate the cost (noted T nreд reд ) of the register toregister operations performed by the DS-n hardware on TLBmiss using an assembly code which executes that operations.It is adaptable according to the value of n . Let N tlb be thenumber of TLB misses (collected using PMC) generated bythe application when it is executed in a native system. Weestimate the execution time T DS − n of a VM on a DS-n usingthis formula T DS − n = T D + N tlb × T nreд reд (3)We evaluated different values of n from 1 to 3. We compareDS-n with EPT (in which the execution time is noted T ept )and shadow paging (in which the execution time is noted T sha ). We used 4KB page size in guest VMs as is the stan-dard size. The characteristic of the experimental machineis presented in Table 6. Notice that this machine includes a Processor Single socket Intel(R) core (TM) [email protected] 4coresMemory 16GB DDR4 1600MHzDTLB 4-way, 64 entriesITLB 4-way, 128 entries

Table 6.

Characteristics of the experimental machine.page walk cache [12]. The list of benchmarks we use (as pre-vious work) are presented in Table 1. Each benchmark runsin a VM having a single vCPU and 5 GB memory. The usedhypervisor and OS are Xen 4.8 and Ubuntu 16.04 (Linux ker-nel 4.15) respectively.

Figure 6 presents the evaluation results. We only present theresults for DS-1 because we obtained almost the same resultswith DS-2 and DS-3. This is because the cost of registerto register operations realized in DS-1, DS-2 and DS-3 isextremely low compared with the cost of a 2D page walk.Figure 6 is interpreted as follows. First, obviously CPUintensive only applications ( e.g. , hmmer from PARSEC) donot beneﬁt enough from DS-n. Second, we conﬁrm that DS-n almost nulliﬁes the overhead of memory virtualization andleads the application almost to the same performance as innative systems. In fact, all black histogram bars are veryclose to 1. DS-n outperforms both EPT (up to 30% of perfor-mance difference for mcf) and shadow paging (up to 370%of performance difference for Elastic Search). Finally, weobserve that DS-n produces a very low, close to zero, over-head (0.35%) but also a stable overhead (0.42 standard de-viation). While a smaller overhead is always appreciable, astable overhead can also be a requirement to host latency sen-sitive applications like databases or real-time systems.To justify the origin of this signiﬁcant performance gapbetween these memory virtualization technologies, we ana-lyzed the values of the internal metrics focusing on appli-cations Redis, gcc, and Elastic Search. For DS-n, the costof memory virtualization is C DS − n = C D × N DS − ntlb , where C D is the number of CPU cycles for performing a 1D pagewalk and N DS − ntlb is the number of TLB misses. For EPT, thatcost is C EPT = C D × N EPTtlb , where C D is the number ofCPU cycles used to perform a 2D page walk and N EPTtlb isthe number of TLB misses. For shadow paging, the cost is C Sha = C D × N Shatlb + N Shaexit ×( C exit + C enter + C handler ) , where N Shatlb is the number of TLB misses; N Shaexit is the number ofVMExit related to page table modiﬁcation operations; C exit is the cost for performing VMExit followed by VMEnter;and C handler is the average execution time of memory man-agement handlers in the hypervisor. Table 7 presents the val-ues of these costs, according to our experimental machine.We observe that C DS − n is very lower than C EPT (e.g., × forRedis) and C Sha ( × ). z i p2 g cc m c f gob m k h mm e r s j e ng li bqu a n t u m h264 r e f o m n e t pp a s t a r x a l a n c b m k b l ac k s c ho l e s body t r ac k ca nn ea l f e rr e t ﬂ u i d a n i m a t e s t r ea m c l u s t e r v i p s R e d i s E l a s ti c S ea r c h

82 95 89 269 O v e r h ea d ( % ) SP g4KB EPT g4KB-h2MB DS-n g4KB

Figure 6.

Performance overhead of DS-n compared with shadow paging (SP) and EPT. Lower is better.

Technology Redis gcc Elastic Search C DS − n C EPT

17 17 46 C Sha

25 62 201

Table 7.

The total cost (in second) of each memory virtual-ization technology for Redis, gcc and Elastic Search.

Solution Bitbrain Azure Private cloudsBaseLine

DynamicOptionSelec

Table 8.

Memory allocation latency (mean-stdev) in ms.

Recall that

Compromis extends the Cloud scheduler (whichintervens on VM startup time) and changes the default mem-ory allocator used by the hypervisor (also at VM start time).Therefore, one may legitimately ask where these changes im-pact the VM startup latency. We answer this question bysumming the cost of the extension with the cost of our mem-ory allocation algorithm, the we compare it with the cost ofthe default Xen memory allocation algorithm. We rely onsimulation logs generated during the evaluations presentedin Section 6.1. The experiment reports that almost all thedifferent versions of our solution have the same complexity,thus we only present the results for

DynamicOptionSelec inTable 8. These results are interpreted as follows. First, we ob-serve that our solution reduces the startup time, by up to 80%for Azure VMs. The is because our allocation algorithm issimpler with regards to Xen which organizes memory in sev-eral memory chunk lists and iterate over these lists severaltimes to satisfy memory demand. Second, the smaller stan-dard deviation reports that the startup time becomes morestable than Xen. Such a predictability is critical for auto-scaling services, as demonstrated by Nitu et. al. [32]. The unpredictability of Xen comes from its complex memory al-location algorithm presented above.

The overhead of memory virtualization in native systems hasbeen proven by several previous work [12–14, 16–18, 22, 26,30, 33, 34, 45]. It has also been shown that this overhead isexacerbated in virtualized environments [6–9, 15, 19, 24, 25,25, 35, 42, 44, 45]. This section presents existing work inthe latter context. The research in this domain can be clas-siﬁed into two categories: software and hardware-assistedsolutions.

Direct paging [5] is similar to shadow paging [43] (presentedin Section 2.1), but it requires the modiﬁcation of the guestOS. In Direct paging [5], the hypervisor introduces an ad-ditional level of abstraction between what the guest sees asphysical memory and the underlying machine memory. Thisis done through the introduction of a Physical to Machine(P2M) mapping maintained within the hypervisor such as inshadow paging. The guest OS is aware of the P2M mappingand is modiﬁed such that instead of writing PTE it would in-stead write entries mapping virtual addresses directly to themachine address space by using itself the P2M. As shadowpaging, direct paging uses a 1D page walk to handle a TLBmiss. However, it includes two main drawbacks: contextswitches between the guest and the hypervisor for buildingthe P2M table, and the modiﬁcation of the guest OS (makingproprietary OSes such as Windows not usable).

Both Intel and AMD proposed EPT [15, 42], a hardware-assisted solution which does not include software solution’slimitations. We have already presented this solution in Sec-tion 2.2. As shown in the latter, EPT is far from satisfactorybecause of the 2D page walk that it imposes. To reduce theoverhead caused by this 2D page walk, several works haveproposed the extension of the page walk cache (PWC) [12], sed in native systems. Such a cache avoids page walk onPWC hit. Bhargava et al. [15] investigated for the ﬁrst timethis extension of PWC for EPT. The main limitation of suchsolutions is their inefﬁciency facing large working set sizeVMs (e.g., in-memory databases) [45]. Also, PWC based so-lutions suffer from a high rate of cache misses when severalVMs share the same machine due to cache evictions. Ahn etal. [8] used a ﬂat EPT instead of the traditional multi-levelradix. By this way, the authors reduced the number of mem-ory references on TLB miss to 9. Compromis totally elim-inates the EPT, resulting in 4 memory references for eachTLB miss.Some solutions improved the TLB [35, 37, 45]. Ryoo et al.[37] presented

POM-TLB , a very large level-3 in RAM TLB.

POM-TLB brings two main advantages. First, the number ofTLB misses is reduced because of the large TLB size, thusreducing the number of 2D page walks. Second,

POM-TLB beneﬁts the data cache to reduce RAM references. However,on cache miss a RAM access is necessary. Also, on

POM-TLB miss, the hardware is still performed a 2D page walk.This solution can be used at the same time with

Compromis .Wang et al. [44] and Gandhi et al. [25] showed that neitherEPT nor shadow paging can be a deﬁnite winner. They pro-posed dynamic switching mechanisms that exceed the ben-eﬁts of each technique. To this end, TLB misses and guestpage faults are monitored to determine the best technique toapply. Such dynamic solutions come with a signiﬁcant over-head related to two tasks: the monitoring and the computa-tion of considered metrics consume a lot of CPU cycles, andswitching from one technique or another requires to rebuildnew page tables.

Some researchers like Kwon et al. [28, 29] proposed theutilization of huge pages [4, 38] in the guest OS and the hy-pervisor at the same time. This way, the number of hierarchyin the page table is reduced, thus the number of memory ref-erences during page walk is reduced too. However, usinghuge pages leads to two main limitations for the guest. First,it increases memory fragmentation, thus memory waste forthe guest. This could lead to a memory pressure in the guestOS, resulting in swapping, which is negative for applicationperformance. Second, huge pages increase average and tailmemory allocation latency in the guest because zeroing ahuge page at page allocation time is more time consumingthan zeroing a 4Kb page.Talluri et al. [41] proposed

Hashed page tables in na-tive systems as an efﬁcient alternative to the radix page ta-ble structure. With hashed page tables, address translationis done using a single memory reference, assuming no col-lision. Yaniv et al. [45] presented how this technique canbe adapted for virtualized systems. The authors showed thatby using a 2D hashed page table hierarchy, the page walk isdone with 3 memory references instead of 24. This is one less than in

Compromis and native systems but suffers fromhash collisions.

Direct segment (DS) based solutions.

Previous workshowed the beneﬁts of DS in both native [13, 27] and virtu-alized systems [9, 24, 25]. Alam et al. presented DVMT[9],a mechanism which allows applications inside the VM to re-quest DS allocations directly from the hypervisor. The ap-plication is responsible for mapping GVAs which are in theallocated DS address space. This is a limitation for appli-cation developers who are not expert. Ganghi et al. [24]proposed three memory virtualization solutions based on DS.Their

VMM Direct mode is very close to

Compromis , but DSdoes not concern the entire VM memory. In addition, theauthors mainly investigated the two other modes.More generally, existing solutions in this category mainlyfocused on hardware contributions while we study the entirecloud stack consequences. Also, they relied only on simula-tions while we tried to perform accurate experiments on realmachines using real systems. Finally, we motivate (relyingon trace analysis) for the ﬁrst time the relevance of DS forVMs.

This paper presented

Compromis , a novel MMU solution forvirtualized systems.

Compromis generalizes DS to providethe entire VM memory space using a minimal number ofmemory segments. This way, the hardware page table walkerperforms a 1D page walk as in native systems. By analyz-ing several production datacenter traces, the paper showedthat

Compromis provisioned up to 99.99% VMs with a singlememory segment. The paper presented a systematic imple-mentation of

Compromis in the hardware, the hypervisor andthe cloud scheduler. The evaluation results show that

Com-promis reduces the memory virtualization overhead to only0.35%. Furthermore,

Compromis reduces the VM startup la-tency by up to 80% while providing also a predictable value.

References [1] [n.d.]. Beneﬁts of Virtualization. .[2] [n.d.]. Inside Microsoft Azure datacenter hardwareand software architecture with Mark Russinovich. .[3] [n.d.]. Top 5 Business Beneﬁts of Server Virtualization. https://blog.nhlearningsolutions.com/blog/top-5-ways-businesses-beneﬁt-from-server-virtualization .[4] [n.d.]. Transparent Hugepages. https://lwn.net/Articles/359158/ .[5] [n.d.]. X86 Paravirtualised Memory Management. https://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management .[6] Keith Adams and Ole Agesen. 2006. A Comparison of Software andHardware Techniques for x86 Virtualization. In

Proceedings of the12th International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS XII) . ACM, NewYork, NY, USA, 2–13. https://doi.org/10.1145/1168857.1168860

7] Ole Agesen, Jim Mattson, Radu Rugina, and Jeffrey Shel-don. 2012. Software Techniques for Avoiding HardwareVirtualization Exits. In

Proceedings of the 2012 USENIXConference on Annual Technical Conference (USENIXATC’12) . USENIX Association, Berkeley, CA, USA, 35–35. http://dl.acm.org/citation.cfm?id=2342821.2342856 [8] Jeongseob Ahn, Seongwook Jin, and Jaehyuk Huh. 2012. RevisitingHardware-assisted Page Walks for Virtualized Systems. In

Proceed-ings of the 39th Annual International Symposium on Computer Archi-tecture (ISCA ’12) . IEEE Computer Society, Washington, DC, USA,476–487. http://dl.acm.org/citation.cfm?id=2337159.2337214 [9] Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion.2017. Do-It-Yourself Virtual Memory Translation. In

Proceed-ings of the 44th Annual International Symposium on ComputerArchitecture (ISCA ’17) . ACM, New York, NY, USA, 457–468. https://doi.org/10.1145/3079856.3080209 [10] Michael Armbrust, Armando Fox, Rean Grifﬁth, Anthony D. Joseph,Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, ArielRabkin, Ion Stoica, and Matei Zaharia. 2010. A View ofCloud Computing.

Commun. ACM

53, 4 (April 2010), 50–58. https://doi.org/10.1145/1721654.1721672 [11] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris,Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warﬁeld. 2003. Xenand the art of virtualization. In

IN SOSP . 164–177.[12] Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Trans-lation Caching: Skip, Don’T Walk (the Page Table). In

Pro-ceedings of the 37th Annual International Symposium on Com-puter Architecture (ISCA ’10) . ACM, New York, NY, USA, 48–59. https://doi.org/10.1145/1815961.1815970 [13] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, andMichael M. Swift. 2013. Efﬁcient Virtual Memory for Big MemoryServers. In

Proceedings of the 40th Annual International Symposiumon Computer Architecture (ISCA ’13) . ACM, New York, NY, USA,237–248. https://doi.org/10.1145/2485922.2485943 [14] Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2012.Reducing Memory Reference Energy with OpportunisticVirtual Caching. In

Proceedings of the 39th Annual Inter-national Symposium on Computer Architecture (ISCA ’12) .IEEE Computer Society, Washington, DC, USA, 297–308. http://dl.acm.org/citation.cfm?id=2337159.2337194 [15] Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and SrilathaManne. 2008. Accelerating Two-dimensional Page Walks for Vir-tualized Systems. In

Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) . ACM, New York, NY, USA, 26–35. https://doi.org/10.1145/1346281.1346286 [16] Abhishek Bhattacharjee. 2013. Large-reach Memory Man-agement Unit Caches. In

Proceedings of the 46th An-nual IEEE/ACM International Symposium on Microarchi-tecture (MICRO-46) . ACM, New York, NY, USA, 383–394. https://doi.org/10.1145/2540708.2540741 [17] Abhishek Bhattacharjee. 2017. Translation-Triggered Prefetch-ing.

SIGARCH Comput. Archit. News

45, 1 (April 2017), 63–76. https://doi.org/10.1145/3093337.3037705 [18] Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi.2011. Shared Last-level TLBs for Chip Multiprocessors.In

Proceedings of the 2011 IEEE 17th International Sym-posium on High Performance Computer Architecture (HPCA’11) . IEEE Computer Society, Washington, DC, USA, 62–63. http://dl.acm.org/citation.cfm?id=2014698.2014896 [19] Xiaotao Chang, Hubertus Franke, Yi Ge, Tao Liu, Kun Wang, JimiXenidis, Fei Chen, and Yu Zhang. 2013. Improving Virtualizationin the Presence of Software Managed Translation Lookaside Buffers.In

Proceedings of the 40th Annual International Symposium on Com-puter Architecture (ISCA ’13) . ACM, New York, NY, USA, 120–129. https://doi.org/10.1145/2485922.2485933 [20] cloudstack [n.d.]. Apache CloudStack – Open Source Cloud Comput-ing. http://cloudstack.apache.org/ .[21] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Mar-cus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Under-standing and Predicting Workloads for Improved Resource Manage-ment in Large Cloud Platforms. In

Proceedings of the 26th Symposiumon Operating Systems Principles (SOSP ’17) . ACM, New York, NY,USA, 153–167. https://doi.org/10.1145/3132747.3132772 [22] Guilherme Cox and Abhishek Bhattacharjee. 2017. Efﬁcient Ad-dress Translation for Architectures with Multiple Page Sizes. In

Proceedings of the Twenty-Second International Conference on Ar-chitectural Support for Programming Languages and OperatingSystems (ASPLOS ’17) . ACM, New York, NY, USA, 435–448. https://doi.org/10.1145/3037697.3037704 [23] ﬁlter [n.d.]. Nova ﬁlter scheduler. http://docs.openstack.org/developer/nova/ﬁlter_scheduler.html .[24] Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M.Swift. 2014. Efﬁcient Memory Virtualization: Reducing Dimen-sionality of Nested Page Walks. In

Proceedings of the 47th AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO-47) . IEEE Computer Society, Washington, DC, USA, 178–189. https://doi.org/10.1109/MICRO.2014.37 [25] Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. 2016. Ag-ile Paging: Exceeding the Best of Nested and Shadow Paging. In

Proceedings of the 43rd International Symposium on Computer Ar-chitecture (ISCA ’16) . IEEE Press, Piscataway, NJ, USA, 707–718. https://doi.org/10.1109/ISCA.2016.67 [26] Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018.Devirtualizing Memory in Heterogeneous Systems. In

Proceed-ings of the Twenty-Third International Conference on Architec-tural Support for Programming Languages and Operating Sys-tems (ASPLOS ’18) . ACM, New York, NY, USA, 637–650. https://doi.org/10.1145/3173162.3173194 [27] Nikhita Kunati and Michael M. Swift. 2018. Implementation of Di-rect Segments on a RISC-V Processor. In

IN Second Workshop onComputer Architecture Research with RISC-V (CARRV), Co-locatedwith ISCA .[28] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Ross-bach, and Emmett Witchel. 2016. Coordinated and EfﬁcientHuge Page Management with Ingens. In

Proceedings of the 12thUSENIX Conference on Operating Systems Design and Implementa-tion (OSDI’16) . USENIX Association, Berkeley, CA, USA, 705–721. http://dl.acm.org/citation.cfm?id=3026877.3026931 [29] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach,and Emmett Witchel. 2017. Ingens: Huge Page Support for the OSand Hypervisor.

SIGOPS Oper. Syst. Rev.

51, 1 (Sept. 2017), 83–93. https://doi.org/10.1145/3139645.3139659 [30] Yashwant Marathe, Nagendra Gulur, Jee Ho Ryoo, Shuang Song, andLizy K. John. 2017. CSALT: Context Switch Aware Large TLB. In

Proceedings of the 50th Annual IEEE/ACM International Symposiumon Microarchitecture (MICRO-50 ’17) . ACM, New York, NY, USA,449–462. https://doi.org/10.1145/3123939.3124549 [31] Vlad Nitu, Aram Kocharyan, Hannas Yaya, Alain Tchana, Daniel Hag-imont, and Hrachya Astsatryan. 2018. Working Set Size EstimationTechniques in Virtualized Environments: One Size Does Not Fit All.

Proc. ACM Meas. Anal. Comput. Syst.

2, 1, Article 19 (April 2018),22 pages. https://doi.org/10.1145/3179422 [32] Vlad Nitu, Pierre Olivier, Alain Tchana, Daniel Chiba, AntonioBarbalace, Daniel Hagimont, and Binoy Ravindran. 2017. SwiftBirth and Quick Death: Enabling Fast Parallel Guest Boot andDestruction in the Xen Hypervisor. In

Proceedings of the 13thACM SIGPLAN/SIGOPS International Conference on Virtual Exe-cution Environments (VEE ’17) . ACM, New York, NY, USA, 1–14. https://doi.org/10.1145/3050748.3050758

33] Ashish Panwar, Aravinda Prasad, and K. Gopinath. 2018. MakingHuge Pages Actually Useful. In

Proceedings of the Twenty-Third Inter-national Conference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS ’18) . ACM, New York, NY,USA, 679–692. https://doi.org/10.1145/3173162.3173203 [34] Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh.2017. Hybrid TLB Coalescing: Improving TLB Translation Cov-erage Under Diverse Fragmented Memory Allocations. In

Proceed-ings of the 44th Annual International Symposium on ComputerArchitecture (ISCA ’17) . ACM, New York, NY, USA, 444–456. https://doi.org/10.1145/3079856.3080217 [35] Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhat-tacharjee. 2015. Large Pages and Lightweight Memory Man-agement in Virtualized Environments: Can You Have It BothWays?. In

Proceedings of the 48th International Symposium on Mi-croarchitecture (MICRO-48) . ACM, New York, NY, USA, 1–12. https://doi.org/10.1145/2830772.2830773 [36] R. Raman, M. Livny, and M. Solomon. 1998. Matchmak-ing: Distributed Resource Management for High ThroughputComputing. In

Proceedings of the 7th IEEE International Sym-posium on High Performance Distributed Computing (HPDC’98) . IEEE Computer Society, Washington, DC, USA, 140–. http://dl.acm.org/citation.cfm?id=822083.823222 [37] Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K.John. 2017. Rethinking TLB Designs in Virtualized Environ-ments: A Very Large Part-of-Memory TLB. In

Proceedingsof the 44th Annual International Symposium on Computer Ar-chitecture (ISCA ’17) . ACM, New York, NY, USA, 469–480. https://doi.org/10.1145/3079856.3080210 [38] Tom Shanley. 1996.

Pentium Pro Processor System Architecture (1sted.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA,USA.[39] Siqi Shen, Vincent van Beek, and Alexandru Iosup. 2015. Statisti-cal Characterization of Business-Critical Workloads Hosted in CloudDatacenters. In . 465–474.[40] Cristan Szmajda and Gernot Heiser. 2003. Variable Radix Page Table:A Page Table for Modern Architectures. In

Advances in ComputerSystems Architecture , Amos Omondi and Stanislav Sedukhin (Eds.).Springer Berlin Heidelberg, Berlin, Heidelberg, 290–304.[41] M. Talluri, M. D. Hill, and Y. A. Khalidi. 1995. A New Page Table for64-bit Address Spaces. In

Proceedings of the Fifteenth ACM Sympo-sium on Operating Systems Principles (SOSP ’95) . ACM, New York,NY, USA, 184–200. https://doi.org/10.1145/224056.224071 [42] Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L. Santoni, Fer-nando C. M. Martins, Andrew V. Anderson, Steven M. Bennett,Alain Kagi, Felix H. Leung, and Larry Smith. 2005. Intel Vir-tualization Technology.

Computer

38, 5 (May 2005), 48–56. https://doi.org/10.1109/MC.2005.163 [43] Carl A. Waldspurger. 2002. Memory Resource Management inVMware ESX Server.

SIGOPS Oper. Syst. Rev.

36, SI (Dec. 2002),181–194. https://doi.org/10.1145/844128.844146 [44] Xiaolin Wang, Jiarui Zang, Zhenlin Wang, Yingwei Luo, and Xiaom-ing Li. 2011. Selective Hardware/Software Memory Virtualization. In

Proceedings of the 7th ACM SIGPLAN/SIGOPS International Confer-ence on Virtual Execution Environments (VEE ’11) . ACM, New York,NY, USA, 217–226. https://doi.org/10.1145/1952682.1952710 [45] Idan Yaniv and Dan Tsafrir. 2016. Hash, Don’T Cache (the PageTable). In