[PDF] Leveraging Architectural Support of Three Page Sizes with Trident

Abstract

Large pages are commonly deployed to reduce address translation overheads for big-memory workloads. Modern x86-64 processors from Intel and AMD support two large page sizes -- 1GB and 2MB. However, previous works on large pages have primarily focused on 2MB pages, partly due to lack of substantial evidence on the profitability of 1GB pages to real-world applications. We argue that in fact, inadequate system software support is responsible for a decade of underutilized hardware support for 1GB pages. Through extensive experimentation on a real system, we demonstrate that 1GB pages can improve performance over 2MB pages, and when used in tandem with 2MB pages for an important set of applications; the support for the latter is crucial but missing in current systems. Our design and implementation of \trident{} in Linux fully exploit hardware supported large pages by dynamically and transparently allocating 1GB, 2MB, and 4KB pages as deemed suitable. \trident{} speeds up eight memory-intensive applications by { 18% }, on average, over Linux's use of 2MB pages. We also propose \tridentpv{}, an extension to \trident{} that effectively virtualizes 1GB pages via copy-less promotion and compaction in the guest OS. Overall, this paper shows that even GB-sized pages have considerable practical significance with adequate software enablement, in turn motivating architects to continue investing/innovating in large pages.

Full PDF

LLeveraging Architectural Support of Three Page Sizes with Trident

Venkat Sri Sai Ram, Ashish Panwar, Arkaprava Basu { sirvisettis, ashishpanwar, arkapravab } @iisc.ac.inDepartment of Computer Science and AutomationIndian Institute of Science Abstract

Large pages are commonly deployed to reduce addresstranslation overheads for big-memory workloads. Modernx86-64 processors from Intel and AMD support two large pagesizes – 1GB and 2MB. However, previous works on large pageshave primarily focused on 2MB pages, partly due to lack ofsubstantial evidence on the proﬁtability of 1GB pages to real-world applications. We argue that in fact, inadequate systemsoftware support is responsible for a decade of underutilizedhardware support for 1GB pages.Through extensive experimentation on a real system, wedemonstrate that 1GB pages can improve performance over2MB pages, and when used in tandem with 2MB pages foran important set of applications; the support for the latter iscrucial but missing in current systems. Our design and imple-mentation of

Trident in Linux fully exploits hardware supportedlarge pages by dynamically and transparently allocating 1GB,2MB, and 4KB pages as deemed suitable.

Trident speeds upeight memory-intensive applications by , on average, overLinux’s use of 2MB pages. We also propose

Trident pv , an ex-tension to Trident that effectively virtualizes 1GB pages viacopy-less promotion and compaction in the guest OS. Overall,this paper shows that adequate software enablement can bringpractical relevance to even GB-sized pages, in turn motivatingarchitects to continue investing/innovating in large pages.

1. Introduction

It is not uncommon for certain architectural features to requiresoftware enablement. Unfortunately, it is also common toﬁnd those hardware features that are underutilized or mostlyignored due to the lack of adequate software support. This is,while paying both the runtime cost of having the new hardware,e.g., power dissipation due to the feature, and the one-timehardware design and veriﬁcation cost. Further, architects areleft in the dark about the extent to which those features are ben-eﬁcial in practice, and whether they should continue enhancingthem or drop them in future products.In this work, we shed light on one such hardware featurethat has been languishing for a decade – support for 1GBpages and provide a detailed system software (here, Linux andKVM) enablement for the same.Big-memory workloads are well known to witness signiﬁ-cant slowdowns due to virtual to physical address translation(e.g., up to 20-50%). Modern processors support large pagesto help reduce this overhead [30, 35, 36]. A large page TLB (Translation Lookaside Buffer) entry maps a larger contiguousvirtual address region to contiguous physical address region(e.g., 2MB or 1GB compared to default 4KB). Consequently,the use of a large page increases TLB coverage and can re-duce the number of TLB misses that is the primary source oftranslation overheads.However, the hardware support alone is not useful. TheOSes and hypervisors (henceforth called system software ) thatcreate the address mappings need to enable large pages forapplications. Further, the ease of use determines the preva-lence of large page’s deployment in practice. If applicationmodiﬁcations are necessary for allocating a given large pagesize, then that page size may not be widely deployed and,consequently, render the corresponding hardware resourcesunderutilized.The x86-64 processors have supported two large page sizes– 2MB and 1GB, for over a decade. Intel Sandybridge archi-tecture launched in 2010 supported 1GB pages and had a four-entry L1 TLB dedicated for 1GB pages in each core [9]. Thecurrent generation of Intel Coffee Lake processors additionallyhave a 16-entry L2-TLB for 1GB pages [8]. Upcoming IntelIce Lake processors is presumed to have 1024-entry L2 TLBfor 1GB pages [12]. In short, processor vendors continues toenhance support for 1GB pages.Unfortunately, software enablement of large pages has fo-cused primarily on 2MB large pages. Linux’s TransparentHuge Page (

THP ) enables dynamic allocation of large pageswithout user intervention and thus, is key to the widespreaduse of large pages. But it is limited only to 2MB pages. Previ-ous research works on improving large page support similarlyignore 1GB pages [30, 35, 36].With the continued growth in the memory footprint of appli-cations, use of 1GB pages is likely to become a necessity forgood performance. The advent of denser non-volatile mem-ory (NVM) technologies promises to signiﬁcantly increasethe physical memory size [23, 32]. The ability to efﬁcientlyaddress a large amount of memory is essential to harness thebeneﬁts of NVM. While the hardware support for 1GB pageshas sped ahead, its software enablement has fallen behind.However, it is important for architects to ﬁnd hard evidenceof practical usefulness of 1GB pages, over and above 2MBpages, to continue enhancing its support or else consider drop-ping it in future products. Therefore, we ﬁrst set out to quan-tify the usefulness of 1GB pages to various applications, withand without virtualization. We ﬁnd that while most memory- a r X i v : . [ c s . O S ] N ov ntensive applications beneﬁt from 2MB pages over 4KB, asubset of them speeds up further with 1GB pages. Even forapplications in that subset, use of the other large page size(s)(here, 2MB) alongside the largest page size (here, 1GB) isimportant. Mapping a virtual address range with a large pagesize requires the address range to be at least as long as thatpage size and be aligned at that page size boundary. Conse-quently, larger the page size, lesser is the number of virtualaddress ranges that are mappable by that page size.We empirically ﬁnd that often a signiﬁcant portion of anapplication’s address space is not mappable with 1GB pages,but mappable with 2MB. Importantly, such address rangesoften witness relatively frequent TLB misses. Thus, if only thelargest page size is used, then those address ranges would haveto be mapped with the smallest (4KB) pages. Consequently,the number of TLB misses would increase signiﬁcantly.A larger page size also needs equally longer contiguousphysical memory chunk. The larger the page size, shorter isthe supply for necessary physical memory chunks. Thus, itmay not be possible to map a virtual address range with thelargest page size even if it is mappable by that page size.Driven by the above analysis, we built Trident in Linux todynamically allocate all available page sizes in x86-64 sys-tems. A key challenge in dynamically allocating 1GB pagesis the lack of enough number of 1GB contiguous physicalmemory chunks. As the free physical memory gets naturallyfragmented over time, ﬁnding 1GB chunks become far moredifﬁcult than 2MB chunks. Thus, the dynamic allocation oflarge pages needs to periodically compact physical memoryto make contiguous free memory available. However, com-paction for 1GB memory chunk requires signiﬁcantly morework than 2MB. Moreover, a compaction attempt fails if itencounters even a single page frame (4KB) with unmovablecontents, e.g., kernel’s objects like inodes , in a 1GB region. Inshort, compaction for 1GB chunks needs a new approach.

Trident introduces a novel smart compaction technique. Weobserve that the current compaction approach of sequentialscanning and moving contents of occupied page frames is notscalable to 1GB. This approach incurs an unnecessarily largeamount of data movement. The smart compaction tracks thenumber of occupied bytes (i.e., the number of mapped pageframes) within each 1GB physical memory chunk. Instead ofscanning, smart compaction then frees a region with the least number of occupied bytes, which signiﬁcantly reduces datamovement. It also tracks unmovable contents within a 1GBregion to further avoid unnecessary data movement.Even with smart compaction , 1GB memory chunks are notalways available when needed. About a third of the attempts toallocate 1GB page fails due to the unavailability of contiguousphysical memory. Unsurprisingly, 2MB chunks are moreeasily available.

Trident thus maps address ranges with 2MBpages if it fails to map with 1GB pages. Later, these 2MBpage mappings are promoted to 1GB pages, when suitable.We then propose

Trident pv , an optional extension to Trident under virtualization for copy-less 1GB page promotion andcompaction in the guest OS. The guest often copies contentsof guest physical pages to create contiguity in guest physicaladdress space for page promotion and compaction. We ob-serve that coping can be mimicked by exchanging the mappingbetween the guest physical address ( gPA ) and the host physicaladdress ( hPA ) of the source and destination. This copy-lesstechnique makes the promotion of 2MB pages to a 1GB pagesigniﬁcantly faster than the traditional copy-based approach.However, the guest and the hypervisor need to coordinate toalter the desired gPA to hPA mappings via a hypercall.We ﬁnd that on a bare-metal system, Trident speeds up eightmemory-intensive applications by 18%, over Linux’s

THP , onaverage.

Trident pv can improve upon Trident ’s own performanceunder virtualization, by up to 10%.In short, this paper shows that the use of 1GB pages hasbeen hamstrung due to inadequate software enablement evenafter a decade of hardware support in commercial processors.Therefore, architects must prioritize enhancing the softwaresupport before investing further into hardware support for 1GBpages. Our speciﬁc contributions are as follows:• We evaluate the usefulness of 1GB pages across variousapplications, both without and with virtualization.• We empirically demonstrate why it is important to deploy all large page sizes, not only the largest one.• We created

Trident in Linux to dynamically allocate all pagesizes available in x86-64 processors to signiﬁcantly speedupapplications with large memory footprint.• We then propose an optional extension to

Trident called

Trident pv that employs paravirtualization to enable copy-less1GB page promotion and compaction in the guest.

2. Background

Hardware support for large pages:

Applications with largememory footprints often spend considerable time in addresstranslation (e.g., up to 20-50%) [16, 35]. Address translationoverhead is primarily the overhead of performing page tablewalks on TLB misses. Hits in the TLB are fast, but a pagetable walk on a miss may require up to four memory accessesto lookup the hierarchical in-memory page table in x86-64processors. Large pages can help reduce translation overheadin two ways: (cid:13) It reduces the frequency of TLB misses byincreasing the TLB coverage since a single entry for a largepage maps a larger address range. (cid:13) It quickens individualpage table walks by reducing the number of levels in the pagetable that need to be looked up. For example, a walk for a1GB page requires up to 2 memory accesses, compared to 3for a 2MB page and 4 for a 4KB page, in x86-64 processors.

Large pages under virtualization:

Address translation in-volves two layers under virtualization through nested pagetables. First, a guest virtual address ( gVA ) is mapped to a guestphysical address ( gPA ) through the guest page tables ( gPT )managed by the guest OS running on a virtual machine. gPA is then translated to host physical address ( hPA ) through host2 able 1: Speciﬁcation of the experimental system

Processor

Intel Xeon Gold 6140 @2.3GHzSkylake Family with 2 Sockets

Number of cores

18 cores (36 threads) per socket

L1-iTLB

L1-dTLB

L2 TLB

Cache

32K L1-d, 32K L1-i, 1MB L2, 24MB L3

Main Memory

OS / Hypervisor

Ubuntu with Linux kernel version 4.17.3 / KVM page tables ( hPT ) maintained by the hypervisor. Two layers ofindirection increase the number of memory accesses requiredfor a page walk. For example, with four-level page tables, aTLB miss requires up to 24 memory accesses for 4KB-pages.Use of 2MB and 1GB pages at both layers reduce the numberof accesses to 15 and 8, respectively.

OS support for large page allocation:

OSes typically pro-vide three mechanisms to allocate large pages. In the pre-allocation based mechanism, users are required to reservephysical memory for large pages and a helper library (e.g., libHugetlbfs ) maps speciﬁc segment(s) of an application’s mem-ory with large pages from the reserved memory. Unfortunately,this static approach constrains the usability of large pages. Thesecond approach needs explicit system calls to map a virtualaddress ranges with large pages. This requires applicationmodiﬁcation (e.g., madvise syscall or extra ﬂags in mmap ). Inthe third approach, the OS allocates large pages without theuser or programmer involvement. Linux’s Transparent HugePages (

THP ) is an example of this approach.Internally,

THP employs two mechanisms. On a page fault,it checks if the faulting address falls within a virtual addressrange that is at least as big as and aligned with the large pagesize. If yes, and a free contiguous physical memory chunkis available,

THP maps the address with a large page. Foraddress regions that were not immediately mappable with largepages during page faults,

THP employs a background thread( khugepaged ) to locate virtual address ranges mapped with4KB pages and promote (remap) them to large pages, whenpossible. To ensure enough supply of contiguous physicalmemory,

THP also compacts physical memory. Compactionmoves contents of occupied pages to one end of the physicalmemory for creating contiguous free memory regions on theother end. Unfortunately,

THP currently supports only

KVM uses

THP forallocating 2MB large pages in the hypervisor (host).

3. Methodology

Table 1 details the conﬁguration of our experimental plat-form. We evaluate 12 workloads with multi-GB memoryfootprint (see Table 2) across machine-learning, graph algo-

Table 2: Speciﬁcations of the benchmarks

Name Threads Memory Description

XSBench 36 117GB Monte Carlo particle transport algorithmfor nuclear reactors [44]SVM 36 67.9GB Support Vector Machine, kdd2012 dataset [5]Graph500 36 63.5GB Breadth-ﬁrst-search and single-source-shortest-path over undirected graphs [2]CC/BC/PR 36 72GB Graph algorithms from GAPBS [17]CG.D 36 50GB Congruent Gradient algorithm from NASParallel Benchmarks [13]Btree 1 10.5GB Random lookups in a B+treeGUPS 1 32GB Irregular, memory-intensive microbenchmark [3]Redis 1 43.6GB An in-memory key-value store [21]Memcached 36 79GB An in-memory key-value caching store [25]Canneal 1 32GB Simulated cache-aware annealing from PARSEC [20] N o r m a li z e d f r a c t i o n o f p a g e w a l k c y c l e s Figure 1: Fraction of page walk cycles in native execution. rithms, key-value stores, HPC, and microbenchmarks (

GUPS and

Btree ). We use Linux’s perf tool [6] to collect microarchi-tectural events related to virtual memory overheads. Specif-ically, we monitor events like the number of cycles spent onpage walks via counters

DTLB_LOAD_MISSES.WALK_DURATION and

DTLB_STORE_MISSES.WALK_DURATION .To study performance under different states of the system,we used a tool to fragment the physical memory (more inSection 5.1). Physical memory is fragmented by reading alarge ﬁle at random offsets to populate OS’s ﬁle cache. Thisrenders physical memory sprinkled with recently used pageframes containing ﬁle cache contents and thus, fragmented.

4. How useful are 1GB large pages?

Hardware support for 1GB pages is not free, and the softwarerunning on x86-64 processors pays the price, irrespective of itsuse of 1GB pages. For example, modern Intel processors have4-entry L1 TLB and 16-entry L2 TLB dedicated to 1GB pages.Those four L1 entries for 1GB pages are accessed on every load and store since the page size is not known during TLBlookup. Due to frequent accesses, L1 TLBs can contribute toa thermal hotspot in processors [41] and can account for 6%of a processor’s total power [43]. The presence of dedicatedTLBs for 1GB pages adds to the cost. The continued increasein the number of TLB entries for 1GB pages would worsen it.It is, thus, natural to wonder if applications can beneﬁt from1GB pages. We analyze various applications under differentexecution scenarios to understand usefulness of 1GB pages.3 .511.522.5 N o r m a li z e d p e r f o r m a n c e Figure 2: Performance under native execution. Applicationsin shade beneﬁt from 1GB pages. N o r m a li z e d f r a c t i o n o f p a g e w a l k c y c l e s Figure 3: Fraction of page walk cycles under virtualization.

Figure 1 shows the (normalized) fraction of execution cyclesspent on page walks for each application while using differ-ent page sizes. The four bars for each application representwalk cycles with (cid:13) (cid:13) dynamically allocated2MB pages via THP , (cid:13) statically pre-allocated 2MB pagesvia libHugetlbfs , and, (cid:13) statically pre-allocated 1GB pagesvia libHugetlbfs . The fourth bar approximates the performanceachievable if the 1GB pages are deployed but not 2MB. Notethat the application-transparent dynamic allocation for 1GBpages (i.e., THP like) is not supported in Linux today.We note that Linux’s

THP often performs as good as 2MB- libHugetlbfs . For

Redis , THP reduces more walk cycles than libHugetlbfs . We ﬁnd that this is because

Redis ’s stack memoryis TLB sensitive, which cannot be mapped using libHugetlbfs .Importantly,

THP does not require pre-allocation of physicalmemory, nor does it need users to statically decide whichprogram segment(s) to be mapped with large pages.Reductions in walk cycles do not necessarily translate toproportional performance improvement on out-of-order CPUs.It depends upon what portions of those cycles are in the criticalpath of the execution. Figure 2 shows the normalized perfor-mance. For all workloads, except

Redis , the performance iscalculated as the inverse of the execution time. For

Redis , theperformance is the throughput as reported by the benchmark.We observe non-negligible performance improvement (atleast 3%) for eight applications (shaded left part of the ﬁgure)from 1GB pages (over 2MB). For example,

Canneal speedsup by 30% over

THP . These eight applications’ performanceimproves by 12 . libHugetlbfs , relative to THP using 2MB pages. Rest of N o r m a li z e d p e r f o r m a n c e Figure 4: Normalized performance under virtualization. M a pp a b l e m e m o r y ( G B ) Execution Timeline M a pp a b l e m e m o r y ( G B ) Execution Timeline (a) Graph500 (b) SVM

Figure 5: Total memory mappable with different page sizes. the applications witness beneﬁts of using 2MB pages (over4KB), but barely gain any further with 1GB pages. This is notsurprising; the walk cycles were already low with 2MB pages,and an out-of-order CPU could hide the remaining cycles.For the rest of the paper, we will thus focus on the ﬁrst eight(shaded) applications. We also observe that

THP is able toperform within 0 .

5% of that of libHugetlbfs using 2MB pageswithout needing memory pre-allocation or user guidance. Thisemphasizes the importance of

THP in the wide deployment of2MB pages; something yet to be realized for 1GB pages.

Two levels of translation under virtualization can increaseoverheads. Each level may use a different page size. Thus, atotal of nine combinations of page sizes are plausible. Whilewe experimented with all, we discuss only 4KB-4KB, 2MB-2MB, and 1GB-1GB combinations. The ﬁrst term denotes thepage size used for guest, and the second term denotes that inthe host. We chose these conﬁgurations as they demonstratethe best performance achievable with a given page size.Figure 3 shows the normalized fraction of page walk cyclesunder three different page size combinations. We notice sig-niﬁcant reductions in walk cycles with 2MB and 1GB pages.For example, the fraction of walk cycles reduced by 80% for

XSBench . Even a couple of 1GB page agnostic applicationse.g., PR and CC experience a large reduction in walk cycles.Figure 4 shows the performance under virtualization. Weobserve that 1GB pages provides a bit more beneﬁt here. Theeight 1GB page sensitive applications speed up by 17 . BC , which didnot beneﬁt from 1GB pages under native execution, becomesslightly sensitive to 1GB pages.4a) Graph500 (b) SVM Figure 6: Relative TLB-miss frequency.

In the analysis so far, only one of the large page sizes wasdeployed as is the norm in today’s software. However, we ﬁndthat using all large page sizes together can bring beneﬁts thatare not achievable using any one of them.A virtual address range is mappable by a large page only if: (cid:13) it is at least as long as that large page, and (cid:13) the startingaddress is aligned at the boundary of that page size. All 1GB-mappable address ranges are, thus, mappable by 2MB pagesbut not vice-versa. When an application allocates, de-allocates,and re-allocates memory (e.g., Graph500 ), the virtual addressspace gets fragmented. Consequently, an application’s entireaddress space may not be mappable by the largest page size.We empirically ﬁnd that often GBs of an application’svirtual memory is 2MB-mappable but not virtual memory fragmentation) or incrementallyallocate/de-allocate memory over time (high fragmentation).We wrote a kernel module to periodically scan an applica-tion’s virtual address space to measure 1GB-mappable and2MB-mappable address ranges. Figure 5 shows the size ofallocated virtual memory that is mappable with 2MB and 1GBover time for two representative applications –

Graph500 and

SVM . The x-axis represents the execution timeline (exclud-ing initialization), and the y-axis is the virtual memory (inGB). The two lines in each graph show the amount of 1GB-and 2MB-mappable memory. We observe that several GBsof memory is mappable by 2MB pages but not by 1GB (thegap between the two lines). If only accessbits in PTEs (4KB) and then tracks which ones get set againby the hardware, signifying a TLB miss. Figure 6 presents themeasurement. The x-axis shows the allocated virtual addressregions, and the y-axis shows the relative TLB miss frequen-cies to pages in those regions. We use different colors for2MB-mappable but 1GB-unmappable, and 1GB-mappable ad-dresses. We observe that the 1GB-unmappable regions witnessfrequent TLB misses. Particularly for

Graph500 , the spike in miss frequency on a relatively small 1GB-unmappable region(about 800MB) stands out (circled). Therefore, it is importantto map these 1GB un-mappable address ranges with 2MBpages to reduce TLB misses.Furthermore, it may not always be possible to map a 1GB-mappable address range with 1GB page due to unavailabilityof 1GB contiguous physical memory. However, 2MB contigu-ous physical memory regions are more easily available. Inshort, it is important to utilize all page sizes available .We also measured the usefulness of 1GB pages to Linuxkernel itself. Linux kernel direct maps entire physical mem-ory with the largest page size. Using a set of OS intensiveworkloads, we found that 1GB pages improve performance byaround 2-3% over 2MB pages (detailed in the Appendix).

Summary of observations: (cid:13) A set of niche but importantmemory-intensive applications speed up with 1GB pages over2MB pages. In contrast, 2MB pages almost universally beneﬁtmemory-intensive applications. 2 (cid:13)

Application-transparentallocation of 2MB pages brings beneﬁts of 2MB pages withoutuser dependency – a capability that 1GB pages lack. 3 (cid:13)

It isimportant to utilize all large page sizes not only the largest.

5. Trident: Dynamic allocation of all page sizes

We design and implement

Trident in Linux to enableapplication-transparent dynamic allocation of all three pagesizes on x86-64 processors.

Trident minimizes TLB missesby mapping most of an application’s address space with 1GBpages, failing which 2MB, and ﬁnally, 4KB pages are used.

Challenges:

While the dynamic allocation of 2MB pages isnot new, that for 1GB pages gives rise to many new challenges.First,

Trident needs to ensure a steady supply of free contiguous1GB physical memory chunks even in the presence of frag-mentation. We found that Linux’s sequential scanning basedcompaction for creating 2MB chunks is not scalable to 1GBdue to excessive data copying.Linux tracks only up to 4MB free physical memory chunks.However, the dynamic allocation of 1GB pages would requiremaintaining free memory upto 1GB granularity. Besides, allo-cating a 1GB page during a page fault is much slower than thatfor a 2MB or 4KB page due to the latency of zeroing entire1GB memory. Low-latency 1GB page faults are necessaryfor an aggressive deployment of 1GB pages. Finally,

Trident should map a virtual address range with the largest large page size deployable at that given time. It should then periodicallylook for opportunities to promote address ranges mapped witha smaller large page(s) to a larger one wherever possible.

At a high-level,

Trident modiﬁes four major parts of Linux. 1 (cid:13)

It enhances Linux to track up to 1GB free physical memorychunks. 2 (cid:13)

It updates the page fault handler to allocate 1GBpage on a fault when possible and fall back to smaller pagesif needed. 3 (cid:13)

Trident extends

THP ’s khugepaged backgroundthread to promote virtual address ranges to be re-mapped5promoted) to 1GB pages when possible. 4 (cid:13) Trident employs anovel smart compaction technique to ensure a steady supply of1GB physical memory chunks at low overhead.

Linux’sbuddy allocator keeps an array of free lists of physical mem-ory chunks of sizes 4KB up to 4MB in the power of 2 [7].When free memory is needed, the buddy allocator provides amemory chunk from one of its lists based on the request size.Freed physical memory is returned to the buddy, and coalescedwith neighboring free memory chunks to create larger ones.Unfortunately, the buddy only keeps track of regions up to4MB. We thus extended it to include separate lists for trackingup to 1GB memory chunks.

THP , Trident allocates large pages either 1 (cid:13) during a page fault (e.g.,when a process accesses a virtual address for the ﬁrst time) or2 (cid:13) later during attempts to promote an address range to a largepage. We here detail the former.If the faulting virtual address falls in a 1GB-mappable ad-dress range, then

Trident attempts to map it with a 1GB page.If it fails,

Trident attempts to map the address with a 2MB page,and on failure, with 4KB. If the faulting address falls in aregion that is 2MB-mappable but not 1GB-mappable

Trident tries to map it with 2MB.

Asynchronous zero-ﬁll:

A 1GB page fault takes around 400milli-seconds; compare that to 850 µ seconds for 2MB. The ad-ditional latency is due to zero-ﬁlling of 1GB memory insteadof 2MB * . We instead employ asynchronous zero-ﬁll to speedup 1GB faults. A kernel thread periodically zero-ﬁlls free1GB regions and Trident allocates an zero-ﬁlled region, if avail-able. This reduces the average 1GB fault latency from 400milli-seconds to 2 . Trident ’s variousdynamic allocation mechanisms that we will discuss in thissection. The ﬁrst data column shows applications’ memoryfootprint. The ﬁrst set of sub-columns capture the behaviorwith un-fragmented physical memory while the next set rep-resents that under fragmentation . Physical memory is saidto be fragmented if free memory is scattered in small holesand thus, non-contiguous. Typically, physical memory is un-fragmented only if the system is freshly booted and/or there islittle memory usage. But, the memory gets quickly fragmentedas applications/OS allocate and deallocate memory.The sub-columns under un-fragmented shows that the pagefault handler alone (

Page-fault only ) is able to map a largefraction of application’s memory with 1GB pages for threeout of eight applications (

XSBench, GUPS, Graph500 ). If an * Zero ﬁll ensures that no leftover data is leaked and cannot be avoided.

Is free 1GB frame(s) available in Buddy?Request smart compactionfor 1GB phys. mem chunkRemap R with 1GB page

Is free 2MBframe(s) available in Buddy?

Remap R with 2MB pageHas scanning for entire VA space of P ﬁnished?

Select candidate process P for promotion & startscanning its VA

Consider next region R in P's VA space

YY Y NNN N YY YNY N N

Request normal compactionfor 2MB phys. mem chunkIs 1GB compactionsuccessful? Is 2MB compactionsuccessful?

Is R 1GB-mappable&& not mapped with1GB page(s)?

Is R 2MB-mappable && mapped with4KB page(s)?

Figure 7: Trident’s large-page promotion algorithm. application pre-allocates its memory in large chunks, then thefault handler would often ﬁnd the faulting address to be in a1GB-mappable region, and map it with a 1GB page. However,

Redis and

Memcached progressively allocate memory whileinserting key-value pairs. Thus, the fault handler could mapa small portion of its memory with 1GB pages.

SVM, Btree,Canneal do not pre-allocate their entire memory needs, either.The story is different if the physical memory is fragmented.Even if the fault handler ﬁnds a 1GB-mappable address range,it is unlikely to ﬁnd a free 1GB physical memory chunk. Thus,it would often fall back to 2MB or 4KB pages. This is ev-ident from data in sub-columns for “Page-fault only” underfragmentation (Table 3) as only a few 1GB pages are allocated.

If an application does not pre-allocate memory or the physical memory is fragmented, it be-comes important to later re-map (promote) address ranges withlarger pages, when possible.

Trident extends

THP ’s khugepaged thread to promote to both 1GB and 2MB pages.Figure 7 shows a ﬂowchart of Trident ’s page promotion al-gorithm (changes to

THP are shaded). khugepaged ﬁrst selectsa candidate process whose memory for promotion and se-quentially scans its virtual address space. During scanning,

Trident looks for a 1GB-mappable virtual address ranges thatare mapped with smaller pages. Subsequently, it looks for2MB-mappable regions mapped with 4KB. If a candidate1GB-mappable range is found, khugepaged requests the buddyallocator for a free 1GB physical memory chunk. If a 1GBchunk is unavailable, khugepaged requests compaction of thephysical memory to create one.

Trident extends

THP ’s com-paction functionality to create a 1GB physical memory chunk(will be detailed shortly). If the compaction fails, it attemptsto map it with 2MB pages (if not already mapped with 2MB).

Trident ’s policy of preferring 1GB pages but falling back to2MB pages, makes the most out of TLB resources.6 able 3: Comparison of 1GB and 2MB pages allocated via different mechanisms employed in Trident.

Memoryfootprint(in GB)

Un-fragmented (all data in GB) Fragmented (all data in GB)

Page-faultonly PromotionNormal compaction PromotionSmart compaction Page-faultonly PromotionNormal compaction PromotionSmart compaction1GB 2MB 1GB 2MB 1GB 2MB 1GB 2MB 1GB 2MB 1GB 2MBXSBench 117 114 2.94 116 1.2 116 1.2 6 5.3 79 38.1 80 37.1GUPS 32 31 1 31 1 31 1 9 2.5 31 1 31 1SVM 68.5 54 14.3 65 3.5 65 3.5 6 5 53 12.2 54 9.9Redis 44 0 0.5 39 3.4 39 3.4 0 0 25 10.3 28 14.3Btree 25 0 16.7 16 5.8 16 5.8 0 11.7 8 12.73 12 8.91Graph500 63.5 59 4.01 60 3.35 60 3.35 5 5.8 37 24.2 38 23.6Memcached 137 16 121 121 16 121 16 9 60 12 55 16 60Canneal 32 8 1 30 2 30 2 6 1 6 21 8 22

Scanning for “source” of compaction

Scanning for “target” of compaction Current “target” (T)

Current “source” (S)

Bytes copied to free the source region P h y s i c a l a dd r e ss s p a c e Selected “source” (S) regionBytes copied to free the source region P h y s i c a l a dd r e ss s p a c e Selected “target” (T ) region (a) Linux’s “normal” compaction (b) Smart compaction1GB region

Figure 8: Comparison between Linux’s (traditional) com-paction of physical memory and Smart compaction.

Table 3’s sub-columns under “normal compaction” showsthe number of 1GB and 2MB pages allocated when the above-mentioned promotion policy is applied along with the pagefault handler (under both un-fragmented and fragmented mem-ory). For example, in the un-fragmented case, khugepaged isable to promote about 39GB of memory using 1GB pages for

Redis , when the fault handler alone failed to allocate even asingle 1GB page.

SVM, Canneal also enjoyed many more 1GBpages due to page promotion.When the physical memory is fragmented, page promotionhelps applications get some 1GB pages, although slightlysmaller in number compared to the un-fragmented case. Forexample, 1GB pages allocated to

SVM drops from 65 to 53.This is expected; free 1GB memory chunks are scarce evenafter compaction.Overheads of compaction for 1GB, however, can negatebeneﬁts of 1GB pages. Creating even a single

Smart compaction:

We, thus, propose a new compactiontechnique, called smart compaction , to reduce the cost of 1GBcompaction while creating enough 1GB physical memory chunks. The primary goal is to reduce the number of bytescopied. This directly reduces the cost of compaction.Figure 8 illustrates the difference between normal compaction as employed in Linux today and the smart compaction employedin

Trident . Figure 8(a) shows the working of the normal com-paction. On a compaction request, the khugepaged thread startssequentially scanning physical memory from where it left lasttime it attempted to compact (remembered in source pointer).Scanning starts from the low to high physical address. As itﬁnds an occupied physical page frame (4KB) it copies its con-tents to a free page frame found by scanning in the oppositedirection from the target pointer. This continues until a freememory chunk of the desired size (e.g., 2MB) is created, orthe entire memory is scanned without success.We observe that this strategy is agnostic to how full or emptya physical memory region is. Consequently, this leads toredundant copying. Let us consider the example in Figure 8(a).The 1GB region starting at address S is mostly occupied andhas only 256 free page frames (4KB). Thus, to free that 1GBregion, Linux would require to copy 999MB of data (512 ×

512 - 256 4KB pages). Instead, if a mostly free region wasfreed, then the number of bytes copied would be much smaller.While such sub-optimal compaction could be ﬁne for 2MB, itis not so for 1GB, as the data copying increases with size.Moreover, if the scan encounters a page frame with un-movable contents (e.g., inodes, DMA buffers) [35], then allcopying so far for a region, is wasted. A free chunk cannothave any unmovable contents. The probability of encounteringunmovable contents is much more for a 1GB region.To address these shortcomings, we propose smart compaction .The key idea is to divide the physical memory into 1GB re-gions and select ( not scan for) a region with the least numberof occupied page frames for freeing (i.e., the source of copy-ing). Similarly, a region with the most number of occupiedpage frames is preferred as the target for copying. This strategyminimizes data copy. We also track if a given 1GB region con-tains any unmovable contents. We avoid selecting regions withunmovable content for freeing (i.e, source). This eliminatesunnecessary data copying in futile compaction attempts.To implement the above idea, we ﬁrst introduced two coun-ters for each 1GB physical memory regions. One countertracks the number of free page frames, and the other one7 % r e du c t i o n Figure 9: Reduction in bytes copied by smart compaction. tracks the number of unmovable pages within a region. When-ever a page is returned to the buddy allocator (i.e., freed), weincrement the counter for free frames of the encompassing1GB region. Further, we decrement the counter for unmov-able pages if the freed page frame(s) contained unmovabledata. Whenever a page frame(s) is allocated from the buddyallocator the free counter for the encompassing region is decre-mented. We increment its unmovable page counter if theallocated page frame(s) would contain unmovable data (e.g.,requested for allocating kernel data structures). Note that a1GB region can also have a 2MB page allocated within it. Wetreat it as 512 base pages for ease of keeping statistics.As depicted in Figure 8(b) the smart compaction starts byselecting a 1GB region with largest number of free page framesand without any unmovable pages as the source(S) . It thenselects a target region ( T ) to move the contents of occupiedpage frames in the source. The region with the least numberof free page frames is selected as the target. It can happen that T may not have enough free frames to accommodate all of S ’spage frames. If so, a region with next least number of freeframes is selected to accommodate leftovers (and, so on).The sub-columns for smart compaction in Table 3 shows thenumber of 1GB and 2MB pages that were allocated underun-fragmented and fragmented physical memory. We observethat the number of 1GB pages allocated to each applicationis the same as that under the normal compaction in the un-fragmented case. Under fragmentation, smart compactiontypically provides even more 1GB page. This is because thesmart compaction always selects a 1GB region that is easiestto free, and thus, compaction succeeds more often.Figure 9 shows the percentage reduction in the number ofbytes copied with smart compaction over normal compaction.This measurement is performed when physical memory isfragmented as otherwise compaction is not required. We ob-serve that smart compaction often reduces the number of bytescopied by more than half and up to 85%. This demonstratesthat smart compaction performs less work to create the same ormore number of 1GB chunks. Only for XSBench , the improve-ment is less.

XSBench uses a large fraction of total memoryin the system and thus, even the ideal compaction algorithmwould not be able to avoid data copy under fragmentation. Allcompaction algorithm will behave the same if the physicalmemory is fragmented, and an application needs all memory.

Summary:

Trident incorporates the modiﬁed page fault han-dler, the modiﬁed promotion logic (Figure 7), and smart com-paction (Figure 8) to dynamically allocate 1GB, 2MB or 4KBpages, as deemed suitable. Trident pv : Paravirtualizing Trident Under virtualization,

Trident can be deployed both in the guestOS and in the hypervisor to bring beneﬁts of dynamic allo-cation of all page sizes, including 1GB pages, to both thelevels of translation. We observe that it is possible to furtheroptimize certain guest OS operations with paravirtualization.The guest OS copies contents of memory pages to (cid:13) tocompact gPA s, and (cid:13) to promote address mapping between gVA and gPA to larger pages. While the cost of copying 4KBpages is not high, copying 2MB pages in order to compact orpromote to 1GB page is slow. We, however, observe that theeffect of copying guest physical pages can be mimicked bysimply altering the mapping between corresponding gPA s and hPA s. This copy-less approach quickens both compaction and1GB page promotion in the guest but needs paravirtualization.We call this optional extension Trident pv .For brevity, we explain the key idea behind Trident pv withthe help of large page promotion only (Figure 10). Let usassume that two contiguous guest virtual pages, v1 and v2 , arecurrently mapped to two non-contiguous smaller pages g1 and g3 in guest physical memory (Figure 10(b)). For simplicity,we assume that a large page is double the size of a small page.To remap gVA encompassing v1 and v2 with a large page, theguest OS ﬁrst copies their content to two contiguous guesttarget physical pages – g7 and g8 and then updates the mappingbetween gVA and gPA . This traditional way of promoting largepages by copying contents is shown in Figure 10(a).Figure 10 (c) shows Trident pv ’s approach for page promotionwithout actual copy. Instead of copying g1 to g7 , the hypervisorexchanges the gPA to hPA mappings for g1 and g7 . After theexchange, g1 would map to h6 and g7 to h2 . Since, h2 containsthe data originally mapped by g1 , this is same as copying g1 to g7 . Similarly, the hypervisor exchanges the gPA to hPA mappings for g3 and g8 to create the effect of copying g3 to g8 . Later, gVA encompassing v1 and v2 is mapped by the guestwith a large page to contiguous gPA encompassing g7 and g8 .In this approach, the guest OS and the hypervisor need tocoordinate for copy-less page promotion and, thus, the needfor paravirtualization. Speciﬁcally, the guest OS supplies thehypervisor with a list of source and target guest physical pagesvia a hypercall. The hypervisor then updates the mappingfrom gPA to hPA in the manner explained above to create theeffect of copying guest physical pages. Besides promotion, Trident pv uses the same hypercall for compacting guest physicalmemory to create 1GB pages in the guest.While promising, the cost of hypercall (~300ns) to switchbetween guest and the hypervisors can, however, outweigh thebeneﬁts the copy-less promotion. We thus batch request formultiple page mapping exchanges in a single hypercall. Twopages are predeﬁned for passing the list of page addresses toexchange between the guest and the hypervisor. One pagecontains source gPA s (here, g1 and g3 ) and the other containsthe target gPA s (here, g7 and g8 ). In a single hypercall it is thus8 hPAgPA g1 g3 v2 g7 g8h2 h4 h6 h9 gVA v1 g1 g3 v2 g7 g8h2 h4 h6 h9 v1 v2 g1 g3 g7 g8h2 h4 h6 h9 step-1: interchange g1 and g7mappings at both levels step-2: interchange g3 and g8mappings at both levels v1 v2 g1 g3 g7 g8h2 h4 h6 h9 v1 g1 g3 v2 g7 g8h2 h4 h6 h9 step-1: copy g1 to g7 (h2 to h6)step-2: copy g3 to g8 (h4 to h9) (b) Initial non-contiguous mappings (a)

Traditional copy-based promotion (c)

Copy-less promotion in Tridentpv

Figure 10: Traditional copy-based vs. Trident pv ’s copy-less page promotion. possible to request exchange for 512 page addresses. Thus,a single hypercall is sufﬁcient to promote entire 1GB regionin gVA mapped with 2MB pages. The hypercall returns afterswitching all the requested pages or logs any failure in thesame shared page used for passing list of pages. On failure,the guest falls back to individually copy contents of pages.We empirically found that promoting 2MB pages to a 1GBpage in the guest takes ~600 ms in the copy-based technique.Without batching, Trident pv can promote the same in less than30 ms while batching reduces the time to ~500 µ s. Note that Trident pv ’s copy-less promotion is less useful for promoting4KB pages to 2MB since the cost of copying 4KB pages isnot signiﬁcant. Hence, we employ copy-less promotion andcompaction for 1GB pages only.

7. Evaluation

We evaluate

Trident to answer the following questions. 1 (cid:13)

Can

Trident help improve performance of memory-intensiveapplications over Linux’s default

THP , and over a recent work,called

HawkEye [35]? 2 (cid:13)

How important is

Trident ’s use ofall large page sizes? 3 (cid:13)

What are the sources of performanceimprovement for

Trident ? 4 (cid:13)

How does

Trident perform undervirtualization? 5 (cid:13)

Finally, how does

Trident pv impact pagepromotion/compaction the guest OS? Performance under un-fragmented physical memory:

Fig-ure 11 shows the normalized performance for four conﬁgu-rations (higher is better) – (cid:13) Linux’s

THP , (cid:13) HawkEye , (cid:13) Trident1G - only and (cid:13) Trident . HawkEye is the most recent re-lated work that improved upon

THP and other previous works(e.g., [30]). It does so by efﬁciently allocating 2MB pages tomemory regions that suffer the most from TLB misses [35].This enables us to compare

Trident ’s performance to currentstate-of-the-art.

Trident1G - only denotes the conﬁguration where Trident is disallowed to use 2MB pages. The difference be-tween

Trident1G - only and Trident highlights the importance ofleveraging all large page sizes.For each application, there are four bars in the cluster cor-responding to four conﬁgurations. The height of each bar isnormalized to the performance of the application under

THP (Linux’s default conﬁguration). Measurements in Figure 11were performed when the physical memory is un-fragmented.First, we observe that

Trident improves performance overLinux’s

THP by 14%, on average and up to 47% for

GUPS .Applications like

XSBench , SVM , Btree , Canneal witnessed N o r m a li z e d p e r f o r m a n c e Figure 11: Performance under no fragmentation. N o r m a li z e d f r a c t i o n o f p a g e w a l k c y c l e s Figure 12: Fraction walk cycles under no fragmentation. . . GUPS ,performance improvement is 12%, on average, over

THP .Next, we observe that

Trident also outperforms

HawkEye by 14%, on average. This is expected due to the similaritiesbetween huge page management in

Linux and

HawkEye in an un-fragmented system where both utilize 2MB pages aggressivelyto maximize performance.Finally, there is a signiﬁcant performance gap between

Tri-dent1G - only and Trident . Trident1G - only loses performance evenrelative to THP for several applications (e.g.,

Graph500, SVM ).In hindsight, this is expected. Our analysis in the Section 4.3revealed that these applications have signiﬁcant portions oftheir virtual memory that is 2MB-mappable but not 1GB-mappable. Further, these portions also witness a relativelylarger number of TLB misses.

Trident1G - only is forced to mapthese 1GB-unmappable regions with base 4KB pages and thushave more translation overheads compared to Trident that coulddeploy 2MB pages. In the process,

Trident1G - only ’s beneﬁtsfrom using 1GB pages are more than negated by the overheadsof mapping frequently accessed memory with 4KB pages.A keen reader will also observe that Trident1G - only performsmuch worse than (Figure 2, Section 4.1) al-9 .80.911.11.21.31.41.51.6 N o r m a li z e d p e r f o r m a n c e Figure 13: Performance under fragmentation. N o r m a li z e d f r a c t i o n o f p a g e w a l k c y c l e s Figure 14: Fraction walk cycles under fragmentation. though neither of them uses 2MB pages. The reason is lib-HugeTLBFS maps the entire chosen segment of an application’smemory (here, heap) with 1GB pages, irrespective of the sizesof allocation requests. For example, even if an applicationdoes a malloc for 12KB memory, libHugeTLBFS will map itwith 1GB pages. Thus, the question of 1GB-mappable vs.2MB-mappable virtual address region does not arise underits static memory allocation mechanism. Since

Trident1G - only enables dynamic memory allocation, it does not have any suchluxury. Fortunately, Trident is able to more than makeup forit by utilizing all large page sizes while also retaining ease ofprogrammability with dynamic allocation.

Performance under fragmented physical memory:

Ar-guably, performance analysis under fragmented physical mem-ory paints a more realistic execution scenario. Figure 13 showsthe normalized performance under fragmented physical mem-ory for the same four conﬁgurations as before.

Trident speedsup applications even more under fragmentation. This is unsur-prising since

Trident ’s smart compaction adds a further edgehere. On average, it improves performance by over

THP and

GUPS quickens by over 50%. Even excluding

GUPS , theimprovement is over

THP . Trident also outperforms

HawkEye in all cases. In somecases under fragmentation,

HawkEye performed worse than

THP (e.g.,

Memcached ). Our discussions with the authors of

HawkEye revealed that this might happen for large memoryapplications due to: 1 (cid:13)

CPU overhead of kbinmanager kernelthread that estimates relative TLB miss rates in

HawkEye and2 (cid:13) potential lock contention between kbinmanager , khugepaged and page-fault handler. Trident outperforms

Trident1G - only by a good margin evenunder fragmentation. However, note that in cases where Tri-dent1G - only performed worse than THP under un-fragmented

Table 4: Percentage 1GB physical memory allocation failures

Page fault Promotion Page fault Promotion

XSBench 94 32 GUPS 71 0SVM 88 19 Redis NA 36Graph500 91 38 Btree NA 25Memcached 43 81 Canneal 12 92

Table 5: Tail latency analysis for Redis physical memory, it performed almost equally under fragmen-tation. Here,

THP could deploy a lesser number of 2MB pagesdue to lack of contiguous physical memory. This, in turn,reduces the performance gap between

Trident1G - only and THP .The execution time for

THP also increases due to compactionoverhead, narrowing its edge further.We also measured how often the fragmented physical mem-ory prevents

Trident from mapping an address with a 1GB page.Table 4 shows the percentage of attempts to allocate a 1GBpage that fails due to fragmentation.There are “NA" under page fault for

Redis and

Btree sincethe fault handler never attempts to allocate a 1GB page due tolack of 1GB-mappable virtual address range during faults. Weobserve that 71-94% of 1GB page allocations fail due to lackof contiguous physical memory. Even during page promotion,1GB allocations fail often. This further reinforces the need toutilize all large page sizes. Even if the largest page size cannotbe used, a smaller large page (2MB) possibly be deployed.

Impact on page walk cycles:

Figure 12 shows the normal-ized fraction of page walk cycles for

THP , HawkEye , Trident1G - only , and Trident under un-fragmented memory. Figure 14shows the same under fragmentation. The reductions in thefraction of walk cycles with

Trident over

THP are signiﬁcant –38-85% under no fragmentation and 40-97% under fragmen-tation. Across all conﬁgurations, we observe that relativeimprovement in performance correspond to relative reductionin page walk cycles.

Impact on tail latency:

Tail latency is an important for in-teractive applications (e.g.,

Redis ) that need to abide by strictSLAs [30]. Table 5 reports 99 percentile latency of

Redis with 4KB pages,

THP , and

Trident , under both un-fragmentedand fragmented memory.

Trident slightly improves tail latency,relative to both 4KB and

THP . Trident avoids long latency 1GBpage faults by employing asynchronous zero-ﬁll and 1GBpages reduces TLB misses in the critical path.

Performance under virtualization:

We measured the per-formance of applications running inside a virtual machine with

Trident deployed both at the guest OS and at the hypervisor(KVM). We do not fragment memory. For comparison, we dothe same with

HawkEye too. Figure 15 shows the speedups,normalized to

THP deployed in the guest OS and KVM. Un-der virtualization,

Trident improves performance by 16% onaverage, over

THP and by 15% over

HawkEye . Canneal saw10 .60.811.21.41.6 N o r m a li z e d p e r f o r m a n c e Figure 15: Performance under virtualization. N o r m a li z e d p e r f o r m a n c e Figure 16: Trident pv ’s performance under fragmented gPA. biggest improvement (50%), but other applications also bene-ﬁted signiﬁcantly. For example, SVM and

Graph500 witnessed6% improvement each.

Performance with

Trident pv : When the gPA gets fragmentedover time, the guest OS must compact and promote pages withthe help of khugepaged thread. However, a signiﬁcant CPUusage by khugepaged in the guest OS could mean wasted vCPUtime (cost) for a tenant in the cloud. In fact, Netﬂix reportedhow their deployments on Amazon EC2 can get adverselyeffected by high CPU utilization by

THP ’s khugepaged [28].We, therefore, evaluate Trident pv with fragmented gPA but limit khugepaged ’s CPU utilization in the guest to maximum of10% of a single vCPU. This setup helps to ﬁnd out whether Trident pv ’s faster copy-less promotion/compaction can be usefulto use 1GB pages where CPU cycles are not free.Figure 16 shows the performance of Trident and

Trident pv normalized to THP . Trident pv is more effective than Trident for

XSBench, GUPS, Memcached , and

SVM by 5%, on averageand by up to 10%. We also observe that

Trident pv does not al-ways improve performance over Trident . Recall that

Trident pv ’shypercall-based copy-less approach is quicker than the copy-based approach only during promotion/compaction of 2MBpages to 1GB pages. Otherwise, the overhead of the hypercalland that of tracing and altering PTEs overshadows the beneﬁtsof avoiding copy. In applications such as BTree, Graph500,Canneal , 4KB pages are often promoted directly to 1GB pageswithout needing to go via 2MB pages limiting

Trident pv ’s scopefor improving their performance. Memory bloat:

Large pages are well-known to increase mem-ory footprint (bloat) due to internal fragmentation. Larger thepage size more is the bloat.

Trident causes bloat in two outof eight workloads. It adds 38GB and 13GB bloat for

Mem- cached and

Btree over

THP . We were able to recover the bloatby simply incorporating

HawkEye ’s technique for dynamicdetection and recovery of bloat by demoting large pages andde-deuplicating zero-ﬁlled small pages [35]. However, wedo not make any new contribution here and tradeoff betweenbloat and large pages is well explored [30, 35]

Summary:

We demonstrate that

Trident signiﬁcantly improvesperformance over

THP and state-of-art academic proposal [35]under varied execution scenarios.

8. Related work

Address translation overheads is a topic of several recent re-search efforts spanning multiple ﬁelds [11, 18, 27, 31, 37, 40,42].

Proposals that require hardware support:

Hardware op-timizations focus on reducing TLB misses or to acceleratepage-walks. Multi-level TLBs, and multiple page sizes arefound in today’s commercial CPUs [4, 22]. Further, page walkcaches are used to make page walks faster [14, 19].Direct segments [16] can signiﬁcantly reduce address trans-lation overheads through segmentation hardware. Coalesced-Large-reach TLBs increase TLB coverage through contiguity-aware hints encoded in the page tables [39]. This approachcan also be combined with page walk caches and large pages[19, 22, 38].

POM-TLB reduces page walk latency by servic-ing a TLB miss using a single memory lookup with a largein-memory TLB [42].

SpecTLB speculatively provides addresstranslation on a TLB miss by guessing virtual to physical ad-dress mappings [15].

ASAP prefetches translations to reducepage-walk latency to that of a single memory lookup [33].It ﬁrst orders page table pages to match that of the virtualmemory pages and then uses a base-plus-offset arithmetic thatdirectly indexes into the page tables. Large page support fornon-contiguous physical memory has also been proposed [24].Tailor page sizes use whatever contiguity OS can afford toallocate [29]. In contrast,

Trident does not need new hardware;architectural support is complementary to

Trident ’s goal offully utilizing large page support.

Proposals with only software support:

Software-only so-lutions mainly focus on better use of large pages. Navarroet. al., proposed a reservation-based large page promotionin FreeBSD as compared to proactive large page allocationin Linux [34].

Ingens proposes to mix

THP ’s aggressive largepage allocation with

FreeBSD ’s conservative approach to re-duce memory bloat and latency while still leveraging largepages.

Illuminator showed how unmovable kernel objects hin-der compaction [36]. Quicksilver uses hybrid strategies acrossdifferent stages in the lifetime of a large page. Superﬁcially,it employs aggressive allocation, hybrid preparation, relaxedmapping creation, on-demand mapping destruction and pre-emptive deallocation to achieve high performance and lowerlatency and bloat [48].

Carrefour-LP showed how large pagescan degrade performance in NUMA systems due to remoteDRAM accesses and unbalanced trafﬁc [26].

Trident is com-11lementary to these works; while these works focused on KBsand MBs-sized pages,

Trident focuses on 1GB pages whichbrings its unique challenges for compaction and page pro-motion. Many insights from these works on 2MB pages areapplicable to

Trident too e.g.,

HawkEye ’s ﬁne-grained page pro-motion and

Ingens ’s adaptive approach of balancing trade-offswith large pages can be applied to

Trident too.

Translation Ranger proposed a new OS service to activelycreate contiguity [47]. Finally, a recent proposal allows in-place compaction of physical memory [45]. This functionality,however, needs to be invoked via a special syscall and is un-suitable for applications that incrementally allocates memory(e.g.,

Redis ). Mitosis [10] eliminates remote memory accessesdue to page walks by replicating page tables on NUMA nodes;

Trident can also help reduce the impact of remote accessesin page walks, as just a side effect. In a concurrent work toours, code patches to enhance

THP to support 1GB pages wererecently posted on Linux mailing list [46]. This shows theLinux community’s growing interest in 1GB pages. How-ever, unlike these patches that simply aim to enable 1GB THPallocation, our work comprehensively shows the type of work-loads that could beneﬁt from 1GB pages, deals with physicalmemory fragmentation via smart compaction and virtualizes1GB pages via copy-less 1GB page promotion and compactionin

Trident pv .

9. Conclusion

Application transparent support is crucial to the success oflarge pages. While OS support for 2MB pages has maturedover the years, 1GB pages have received little attention despitebeing present in the hardware for a decade. We propose

Trident to leverage architectural support of all three page sizes avail-able on x86-64 systems, while also dealing with the latencyand fragmentation challenges. Our evaluation shows that trans-parent 1GB pages, particularly in tandem with 2MB pages,provide a signiﬁcant performance boost over 2MB THP inLinux. Further, the paravirtualized extension of

Trident , called

Trident pv , can effectively virtualize 1GB pages with copy-lessguest page promotion and compaction. We hope that Trident ’s18% performance improvement over Linux THP motivatesresearchers to further explore the role of large pages, beyond2MB, in building efﬁcient large-memory systems.

Trident source will be available at https://github . com/csl-iisc/Trident. References . mail-archive . com/freebsd-hackers@freebsd . org/msg13993 . html".[2] Graph500. https://graph500 . org/.[3] GUPS: HPCC RandomAccess benchmark. https://github . com/alexandermerritt/gups.[4] Hugepages. https://wiki . debian . . csie . ntu . edu . tw/~cjlin/libsvmtools/datasets/binary . html.[6] Linux perf. https://en . wikipedia . org/wiki/Perf_(Linux). [7] Page frame allocation via buddy in linux. https://wiki . osdev . org/Page_Frame_Allocation.[8] Wikichip: Intel coffee lake architecture. https://en . wikichip . org/wiki/intel/microarchitectures/coffee_lake.[9] Wikichip: Intel sandy bridge microarchitecture. https://en . wikichip . org/wiki/intel/microarchitectures/sandy_bridge_(client).[10] Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, TimothyRoscoe, and Jayneel Gandhi. Mitosis: Transparently self-replicatingpage-tables for large-memory machines, 2019.[11] Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion. Do-it-yourself virtual memory translation. In Proceedings of the 44th AnnualInternational Symposium on Computer Architecture . anandtech . com/show/14664/testing-intel-ice-lake-10nm/2.[13] David Bailey, E. Barszcz, J. Barton, D. Browning, Robert Carter,Leonardo Dagum, Rod Fatoohi, Paul Frederickson, T. Lasinski, RobertSchreiber, Horst Simon, Venkat Venkatakrishnan, and Sisira Weer-atunga. The nas parallel benchmarks;summary and preliminary results.In Proceedings of the 1991 ACM/IEEE Conference on Supercomput-ing , Supercomputing ’91, pages 158–165, New York, NY, USA, 1991.ACM.[14] Thomas W. Barr, Alan L. Cox, and Scott Rixner. Translation caching:Skip, don’t walk (the page table). In

Proceedings of the 37th AnnualInternational Symposium on Computer Architecture , ISCA ’10, pages48–59, New York, NY, USA, 2010. ACM.[15] Thomas W. Barr, Alan L. Cox, and Scott Rixner. Spectlb: A mechanismfor speculative address translation. In

Proceedings of the 38th AnnualInternational Symposium on Computer Architecture , ISCA ’11, pages307–318, New York, NY, USA, 2011. ACM.[16] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, andMichael M. Swift. Efﬁcient virtual memory for big memory servers. In

Proceedings of the 40th Annual International Symposium on ComputerArchitecture , ISCA ’13, pages 237–248, New York, NY, USA, 2013.ACM.[17] Scott Beamer, Krste Asanovic, and David A. Patterson. The GAPbenchmark suite.

CoRR , abs/1508.03619, 2015.[18] Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and SrilathaManne. Accelerating two-dimensional page walks for virtualizedsystems. In

Proceedings of the 13th International Conference onArchitectural Support for Programming Languages and OperatingSystems , ASPLOS XIII, pages 26–35, New York, NY, USA, 2008.ACM.[19] Abhishek Bhattacharjee. Large-reach memory management unit caches.In

Proceedings of the 46th Annual IEEE/ACM International Sympo-sium on Microarchitecture , MICRO-46, pages 383–394, New York,NY, USA, 2013. ACM.[20] Christian Bienia.

Benchmarking Modern Multiprocessors . PhD thesis,Princeton University, January 2011.[21] Josiah L. Carlson.

Redis in Action . Manning Publications Co., Green-wich, CT, USA, 2013.[22] Guilherme Cox and Abhishek Bhattacharjee. Efﬁcient address transla-tion for architectures with multiple page sizes. In

Proceedings of theTwenty-Second International Conference on Architectural Support forProgramming Languages and Operating Systems . anandtech . com/show/14155/intels-enterprise-extravaganza-2019-roundup, 2019.[24] Yu Du, Miao Zhou, Bruce R. Childers, Daniel Mossé, and Rami G.Melhem. Supporting superpages in non-contiguous physical mem-ory. , pages 223–234, 2015.[25] Brad Fitzpatrick. Distributed caching with memcached. Linux J. ,2004(124):5, August 2004.[26] Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston,Alexandra Fedorova, and Vivien Quéma. Large pages may be harmfulon numa systems. In

Proceedings of the 2014 USENIX Conferenceon USENIX Annual Technical Conference , USENIX ATC’14, pages231–242, Berkeley, CA, USA, 2014. USENIX Association.[27] Mel Gorman and Patrick Healy. Performance characteristics of explicitsuperpage support. In

Proceedings of the 2010 International Con-ference on Computer Architecture , ISCA’10, pages 293–310, Berlin,Heidelberg, 2012. Springer-Verlag. . brendangregg . com/Slides/AWSreInvent2017_performance_tuning_EC2 . pdf.[29] Faruk Guvenilir and Yale N Patt. Tailored page sizes. In , pages 900–912. IEEE, 2020.[30] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach,and Emmett Witchel. Coordinated and efﬁcient huge page manage-ment with ingens. In Proceedings of the 12th USENIX Conferenceon Operating Systems Design and Implementation , OSDI’16, page705–721, USA, 2016. USENIX Association.[31] Joshua Magee and Apan Qasem. A case for compiler-driven superpageallocation. In

Proceedings of the 47th Annual Southeast RegionalConference , ACM-SE 47, pages 82:1–82:4, New York, NY, USA,2009. ACM.[32] Kristie Mann. Five use cases of intel optane dc persistent memory atwork in the data center, 2019. https://itpeernetwork . intel . com/intel-optane-use-cases/.[33] Artemiy Margaritov, Dmitrii Ustiugov, Edouard Bugnion, and BorisGrot. Prefetched address translation. In Proceedings of the 52ndAnnual IEEE/ACM International Symposium on Microarchitecture ,MICRO ’52, page 1023–1036, New York, NY, USA, 2019. Associationfor Computing Machinery.[34] Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. Practical,transparent operating system support for superpages.

SIGOPS Oper.Syst. Rev. , 36(SI):89–104, December 2002.[35] Ashish Panwar, Sorav Bansal, and K. Gopinath. Hawkeye: Efﬁcientﬁne-grained os support for huge pages. In

Proceedings of the Twenty-fourth International Conference on Architectural Support for Program-ming Languages and Operating Systems , ASPLOS ’19, New York, NY,USA, 2019. ACM.[36] Ashish Panwar, Aravinda Prasad, and K. Gopinath. Making hugepages actually useful. In

Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems , ASPLOS ’18, pages 679–692, New York, NY,USA, 2018. ACM.[37] Chang Hyun Park, Sanghoon Cha, Bokyeong Kim, Youngjin Kwon,David Black-Schaffer, and Huh Jaehyuk. Perforated page: Supportingfragmented memory allocation for large pages. In

Proceedings of the47th International Symposium on Computer Architecture , ISCA ’20,New York, NY, USA, 2020. ACM.[38] Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H.Loh. Increasing tlb reach by exploiting clustering in page transla- tions. , pages 558–567, 2014.[39] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and AbhishekBhattacharjee. Colt: Coalesced large-reach tlbs. In

Proceedings of the2012 45th Annual IEEE/ACM International Symposium on Microar-chitecture , MICRO-45, pages 258–269, Washington, DC, USA, 2012.IEEE Computer Society.[40] Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee.Large pages and lightweight memory management in virtualized en-vironments: Can you have it both ways? In

Proceedings of the 48thInternational Symposium on Microarchitecture , MICRO-48, pages1–12, New York, NY, USA, 2015. ACM.[41] Kiran Puttaswamy and Gabriel Loh. Thermal analysis of a 3d die-stacked high-performance microprocessor. In

Proceedings of the 16thACM Great Lakes Symposium on VLSI , GLSVLSI ’06, pages 19–24,New York, NY, USA, 2006. ACM.[42] Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K. John. Re-thinking tlb designs in virtualized environments: A very large part-of-memory tlb. In

Proceedings of the 44th Annual International Sympo-sium on Computer Architecture . microarch . org/micro44/ﬁles/Micro%20Keynote%20Final%20-%20Avinash%20Sodani . pdf.[44] John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz.XSBench - The Development and Veriﬁcation of a Performance Ab-straction for Monte Carlo Reactor Analysis. In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future , Kyoto.[45] Zi Yan. Generating physically contiguous memory after page alloca-tion, 2019. https://patchwork . kernel . org/cover/10815945/.[46] Zi Yan. 1gb pud thp support on x86_64. https://lwn . net/Articles/832881/, 2020.[47] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee.Translation ranger: Operating system support for contiguity-awaretlbs. In Proceedings of the 46th International Symposium on ComputerArchitecture , ISCA ’19, pages 698–710, New York, NY, USA, 2019.ACM.[48] Weixi Zhu, Alan L. Cox, and Scott Rixner. A comprehensive anal-ysis of superpage management mechanisms and policies. In , pages 829–842. USENIX Association, July 2020., pages 829–842. USENIX Association, July 2020.