[PDF] Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Abstract

Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case for explicit page-table allocation policies and show that page-table placement is becoming crucial to overall performance. We propose Mitosis to mitigate NUMA effects on page-table walks by transparently replicating and migrating page-tables across sockets without application changes. This reduces the frequency of accesses to remote NUMA nodes when performing page-table walks. Mitosis uses two components: (i) a mechanism to enable efficient page-table replication and migration; and (ii) policies for processes to efficiently manage and control page-table replication and migration. We implement Mitosis in Linux and evaluate its benefits on real hardware. Mitosis improves performance for large-scale multi-socket workloads by up to 1.34x by replicating page-tables across sockets. Moreover, it improves performance by up to 3.24x in cases when the OS migrates a process across sockets by enabling cross-socket page-table migration.

Full PDF

MMitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Reto Achermann

Ashish Panwar

Abhishek Bhattacharjee Timothy Roscoe Jayneel Gandhi VMware Research ETH Zurich IISc Bangalore Yale University [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Mitosis to mitigate NUMA effects on page-tablewalks by transparently replicating and migrating page-tablesacross sockets without application changes. This reduces thefrequency of accesses to remote NUMA nodes when perform-ing page-table walks.

Mitosis uses two components: (i) amechanism to enable efﬁcient page-table replication and mi-gration; and (ii) policies for processes to efﬁciently manageand control page-table replication and migration.We implement

Mitosis in Linux and evaluate its beneﬁts onreal hardware.

Mitosis improves performance for large-scalemulti-socket workloads by up to 1.34x by replicating page-tables across sockets. Moreover, it improves performance byup to 3.24x in cases when the OS migrates a process acrosssockets by enabling cross-socket page-table migration.

1. Introduction

In this paper, we investigate the performance issues in largeNUMA systems caused by the sub-optimal placement not ofprogram data, but of page-tables, and show how to mitigatethem by replicating and migrating page-tables across sockets.The importance of good data placement across sockets forperformance on NUMA machines is well-known [29, 32, 41,46]. However, the increase in main memory size is outpacingthe growth of TLB capacity. Thus, TLB coverage (i.e. the sizeof memory that TLBs map) is stagnating and is causing moreTLB misses [21, 50, 58, 59]. Unfortunately, the performancepenalty due to a TLB miss is signiﬁcant (up to 4 memoryaccesses on x86-64). Moreover, this penalty will grow to 5memory accesses with Intel’s new 5-level page- tables [43].Our ﬁrst contribution in this paper (§ 3) is to show by ex-perimental measurements on a real system that page-table placement in large-memory NUMA machines poses perfor-mance challenges: a page-table walk may require multipleremote DRAM accesses on a TLB miss and such misses areincreasingly frequent. We show this effect due to page-tableplacement on a large-memory machine in two scenarios. Theﬁrst is a multi-socket scenario (§ 3.1) , where large-scale mul-tithreaded workloads execute across all sockets. In this case, N o r m a li z e d R un t i m e N o r m a li z e d R un t i m e Sockets 0 1 2 3Remote 86% 68% 71% 75%Local 14% 32% 29% 25% Single-SocketRemote 100%Local 0%

Figure 1:

Top Table:

Percentage of local and remote leafPTEs as observed from each socket on a TLB miss and

Bot-tom Graph:

Normalized runtime, for two workloads showingmulti-socket (left) and workload migration (right) scenarioswith their respective improvement using

Mitosis . the page-table is distributed across sockets by the OS as it seesﬁt. Such page placement results in multiple remote page-tableaccesses, degrading performance. We show the percentageof remote/local page-table entries (PTEs) on a TLB miss asobserved from each socket in the top left table of Figure 1 forone workload (Canneal) from the multi-socket scenario. Weobserve that some sockets experience longer TLB misses sinceup to 86% of leaf PTEs are located remotely. Large-memoryworkloads like key-value stores and databases that stress TLBcapacity are particularly susceptible to this behavior.Our second analysis conﬁguration focuses on a workloadmigration scenario (§ 3.2 ), where the OS decides to migratea workload from one socket to another. Such behavior arisesfor many reasons: the need to load balance, consolidate, im-prove cache behavior, or save power/energy [3, 31, 61]. A keyquestion with migration is what happens to the data that theworkload accesses. Existing NUMA policies in commodityOSes migrates data pages to the target socket where the work-load has been migrated. Unfortunately, page-table migrationis not supported [56], making future TLB misses expensive.Such misplacement of page-tables leads to performance degra-dation for the workload since 100% of TLB misses requireremote memory access as shown in top right table of Figure 1for one workload (GUPS) from workload migration scenario.Workload migration is common in environments where virtualmachines or containers are consolidated on large systems [3].Ours is the ﬁrst study to show this problem of sub-optimalpage-table placement on NUMA machine using these twocommonly occurring scenarios.Our second contribution (§ 4) is a technique, Mitosis , whichreplicates and migrates page-tables to reduce this effect.

Mi-tosis works entirely within the OS and requires no changeto application binaries. The design consists of a mechanism a r X i v : . [ c s . O S ] N ov o enable efﬁcient page-table replication and migration (§ 5),and associated policies for processes to effectively managepage-table replication and migration (§ 6). Mitosis builds onwidely-used OS mechanisms like page-faults and system callsand is hence applicable to most commodity OSes.Our third contribution (§ 5, 6) is an implementation of

Mi-tosis for an x86-64 Linux kernel. Instead of substantiallyre-writing the memory subsystem, we extend the Linux PV-Ops [9] interface to page-tables and provide policy extensionsto Linux’s standard user-level NUMA library, allowing usersto control migration and replication of page-tables, and selec-tively enable it on a per-process basis. When a process is sched-uled to run on a core, we load the core’s page-table pointerwith the physical address of the local page-table replica forthe socket. When the OS modiﬁes the page-table, the updatesare propagated to all replicas efﬁciently and that page-tablereads return consistent values based on all replicas.An important feature of

Mitosis is that it requires no changesto applications or hardware, and is easy to use on a per-application basis. For this reason,

Mitosis is readily deployableand complementary to emerging hardware techniques to re-duce address translation overheads like segmentation [21, 49],PTE coalescing [58, 59] and user-managed virtual mem-ory [16]. We will release our implementation of

Mitosis toenable future research on page-table placement and plan toupstream our changes to Linux.Our ﬁnal contribution (§ 8) is a performance evaluationof

Mitosis on real hardware. We show the effects of page-table replication and migration on a large-memory machinein the same two scenarios used before to analyze page-tableplacement. In the ﬁrst, multi-socket scenario , we had observedthat page-table placement results in multiple remote memoryaccesses, degrading performance for many workloads. Thegraph on the bottom left of Figure 1 shows the performanceof a commonly used “ﬁrst-touch” allocation policy whichallocates data pages local to the socket that touches the dataﬁrst. This policy is not ideal as it cannot allocate page-tableslocally for all sockets.

Mitosis replicates page-tables acrosssockets to improve performance by up to 1.34x in this scenario.These gains come at a mere cost of 0.6% memory overheadcompared to the exorbitant memory cost of data replication.In the second, workload migration scenario , we had ob-served that page-table migration is not supported, which makesTLB misses expensive for workloads after their migrationacross sockets. The graph on the bottom right in Figure 1quantiﬁes the worst-case performance impact of misplacingpage-tables on memory that is remote with respect to the appli-cation socket (see remote (interfere) bar). The local bar showsthe ideal execution time with locally allocated page-tables.

Mitosis improves this situation by enabling cross-socket page-table migration, and boosts performance by up to 3.24x.

2. Background

Translation Lookaside Buffers (TLBs) enable fast addresstranslation and are key to the performance of a virtual mem-ory based system. Unfortunately, TLBs only cover a tinyfraction of physical memory available on modern systemswhile workloads consume all memory for storing their largedatasets. Hence, memory-intensive workloads incur frequentcostly TLB misses requiring page-table lookup by hardware.Research has shown that TLB miss processing is pro-hibitively expensive [21, 24, 25, 26, 38, 53] as walking page-tables (e.g., 4-level radix tree on x86-64) requires multiplememory accesses. Even worse, virtualized systems need two-levels of page-table lookups which can result in much higherTLB miss processing overheads (24 memory accesses insteadof four on x86-64). Consequently, address translation over-heads of 10-40% are not unusual [21, 24, 25, 39, 40, 50], andwill worsen with emerging 5-level page-tables [43].In response, many research proposals improve addresstranslation by reducing the frequency of TLB misses and/oraccelerating page-table walks. Use of large pages to in-crease TLB-coverage [34, 35, 36, 55, 57, 63, 64, 66] andadditional MMU structures to cache multiple levels of thepage-tables [19, 24, 26] are some of the techniques widelyadopted in commercial systems. In addition, researchers havealso proposed TLB-speculation [20, 60], prefetching transla-tions [47, 53, 62], eliminating or devirtualizing virtual mem-ory [42], or exposing virtual memory system to applications tomake the case for application-speciﬁc address translation [16].We observe that prior works studied address translationon single-socket systems. However, page-tables are oftenplaced across remote and local memories in large-memorysystems. Given the sensitivity of large page placement on suchsystems [41], we were intrigued by the question of how page-table placement affects overall performance. In this paper, wepresent compelling evidence to show that optimizing page-table placement is as crucial as optimizing data placement.

Multi-socket architectures, where CPUs are connected viaa cache-coherent interconnect, offer scalable memory band-width even at high capacity and are frequently used in moderndata centers and cloud deployments. Looking forward, thistrend will only increase; large-memory (1-100 TBs) machinesare integrating even more devices with different performancecharacteristics like Intel’s Optane memory [6]. Furthermore,emerging architectures using chiplets and multi-chip modules[17, 33, 44, 45, 48, 54, 65, 67] will drive the multi-socket andNUMA paradigm: accessing memory attached to the localsocket will have higher bandwidth and lower latency thanaccessing memory attached to a remote socket. Note that ac-cessing remote memory can incur 2-4x higher latency thanaccessing local memory [1]. Given the non-uniformity of2ccess latency and bandwidth, optimizing data placement inNUMA systems has been an active area of research.

Modern OSes provide generic support for optimizing dataplacement on NUMA systems through various allocation andmigration polices. For example, Linux provides ﬁrst-touch vs.interleaved allocation to control the initial placement of data,and additionally employs AutoNUMA to migrate pages acrosssockets in order to place data closer to the threads accessing it.To further optimize data placement, Carrefour [32] proposeddata-page replication along with migration. In addition, datareplication has also been proposed at data structure level [29]and via NUMA-aware memory allocators [46] to further re-duce the frequency of remote memory accesses. In contrast,our work focuses on page-table pages, not data pages.Some prior research has proposed replicated data struc-tures for address spaces. RadixVM [30] manages the process’address space using replicated radix trees to improve the scal-ability of virtual memory operations in the research-grade xv6OS [15]. However,

RadixVM does not replicate page-tables.Similarly, Corey [28] divides the address space into shared andprivate per-core regions where these explicitly shared regionsshare the page-table. In contrast, we use replication to manageNUMA effects of page-table walks in an industry-grade OS.

Techniques for data vs. page-table pages:

One may expectprior migration and replication techniques to extend readily topage-tables. In reality, subtle distinctions between data andpage-table pages merit some discussion. First, data pages arereplicated by simple bytewise copying of data, without any spe-cial reasoning of the contents of the pages. Page-table pages,however, require more care and cannot rely simply on byte-wise copying – to semantically replicate virtual-to-physicalmappings, upper page-table levels must hold pointers (phys-ical addresses) to their replicated, lower level page-tables –which differ from replica to replica except at the leaf level.Moreover, data replication has high memory overheads andmaintaining consistency across replicated pages (especiallyfor write-intensive pages) can outweigh the beneﬁts of repli-cation. While data replication has its values, we show thatpage-table replication is equally important – it incurs negli-gible memory overhead, can be implemented efﬁciently anddelivers substantial performance improvement.

3. Page-Table Placement Analysis

In this section, we ﬁrst present an analysis of page-table dis-tributions when running memory-intensive workloads on alarge-memory machine ( multi-socket scenario § 3.1) and thenquantify the impact of NUMA effects on page-table walks( workload migration scenario § 3.2). Our experimental plat-form is a 4-socket Intel Xeon E7-4850v3 with 512 GB physicalmemory (more detailed machine conﬁguration in § 8).

ProcessSocket 2 Socket 3ProcessSocket 1Socket 0Process ProcessD L4L2 L3L1 M e m o r y M e m o r y M e m o r y M e m o r y Figure 2: An illustration of current page-table and data place-ment for a multi-socket workload using 4-socket system.

We focus on page-table distributions where workloads usealmost all resources in a multi-socket system. Consider theexample in Figure 2. If a core in socket 0 has a TLB missfor data "D", which is local to the socket, it has to performup to 4 remote accesses to resolve the TLB miss to ultimatelydiscover that data was actually local to its socket. Even thoughMMU caches [19] help reduce some of the accesses, at leastleaf-level PTEs have to be accessed. Since big-data workloadshave large page-tables that are absent from the caches, systemmemory accesses are often unavoidable [25].

Methodology.

We are interested in the distribution of pagesfor each level in the page-table; i.e., which sockets page-tablesare allocated on. We write a kernel module that walks thepage-table of a process and dumps the PTEs including thevalue of the page-table root register (CR3) to a ﬁle. Thekernel module is then invoked every 30 seconds while a multi-socket workload (e.g., Memcached) ran, producing a streamof page-table snapshots over time. We use 30 second timeintervals as page-table allocation occurs relatively infrequentlyand smaller time interval does not change results signiﬁcantly.We use ﬁrst-touch or interleaved data allocation policy whileenabling/disabling AutoNUMA [2] data page migration withdifferent page sizes for multi-socket workloads in Table 1.

Workload Description MS WM

Memcached a commercial distributed in-memory objectcaching system [8] 350GB –Graph500 a benchmark for generation, compression andsearch of large graphs [5] 420GB –HashJoin a benchmark for hash-table probing used indatabase applications and other large applications 480GB 17GBCanneal a benchmark for simulated cache-aware annealingto optimize routing cost of a chip design [10] 382GB 32GBXSBench a key computational kernel of the Monte Carlo neu-tronics application [14] 440GB 85GBBTree a benchmarks for index lookups used in databaseand other large applications 145GB 35GBLibLinear a linear classiﬁer for data with millions of instancesand features [7] – 67GBPageRank a benchmark for page rank used to rank pages insearch engines [23] – 69GBGUPS a HPC Challenge benchmark to measure the rate ofinteger random updates of memory [11] – 64GBRedis a commercial in-memory key-value store [12] – 75GB

Figure 4: Percentage of remote leaf PTEs as observed fromeach socket for our multi-socket workloads.

Analysis.

We analyze the distribution of page-tables for eachsnapshot in time. For each page-table level, we summarize thenumber of per-socket physical pages and the number of validPTEs pointing to page-table pages (or data frames) residingon a local or remote socket. From these snapshots, we collecta distribution of leaf PTEs and which sockets they are locatedon. We focus on leaf PTEs as there are orders of magnitudemore of them than non-leaf PTEs and because they generallydetermine address translation performance (upper-level PTEscan be cached in MMU caches [25]). These distributionsindicate how many local and remote sockets a page-table walkmay visit before resolving a TLB miss.

Results.

Due to space limitations, we show a single, processedsnapshot of the page-table for Memcached in Figure 3. Thissnapshot was collected using 4KB pages, local allocation, andAutoNUMA disabled. We studied 2MB pages as well andpresent observations from them later. The processed dumpshows the distribution of all four levels of the page-table (L4being the root, and L1 the leaf). The dump is organized in fourcolumns representing the four-sockets in this system. In eachcell, the ﬁrst number is the total physical pages at that level-socket combination (e.g. socket 1 has the only L4 page-tablepage). Next is the distribution of pointers in square bracketsof the valid PTEs at this level/socket (e.g. L4 on socket 1 has8 pointers to L3 on socket 0, 3 pointers locally, and 1 pointerto socket 3). The percentage numbers in rounded brackets arethe fraction of valid PTEs pointing to remote physical pages.Figure 4 shows the percentage of remote leaf PTEs observedby a thread running on each socket. Each workload’s clusterhas per-socket values representing the percentage of remoteleaf PTEs in the page-table. We made these observations fromthe page-table dumps and distribution of leaf PTEs:1. Page-tables pages are allocated on the socket initializingthe ﬁrst data structures that the page-table pages point to.This is similar to data frame allocation but has importantunintended performance consequences. Consider that eachpage-table page has 512 entries. This means that the choiceof where to allocate a page-table page is entirely dependentupon which of the 512 entries in the page-table page gets allocated ﬁrst, and which socket the allocating thread runson. If subsequently, other entries in the page-table pageare used for threads on another socket, remote memoryreferences for page-table walks become common.2. With ﬁrst touch policy, the number of page-tables tendsto be skewed towards a single socket (e.g. socket 1 forGraph500 on Figure 4). This is especially the case when asingle thread allocates and initializes all memory.3. The interleaved policy evenly distributes page-table pagesacross all sockets.4. While we observed data pages being migrated with AutoN-UMA, page-table pages were never migrated. The fractionof data pages migrated over time depends on the workloadand its access locality.5. On all levels, a signiﬁcant fraction of page-table entriespoints to remote sockets. In the case of interleave policy,this is ( N − ) / N for an N -socket system.6. Due to the skew in page-table allocation, some socketsexperience longer TLB misses since up to 99% of leafPTEs are located remotely. Summary

On multi-socket systems, page-table page alloca-tion is skewed towards sockets that initialize the data struc-tures. While data pages are migrated by default OS policies,page-table pages remain on the socket they are allocated. Con-sequently, remote page-table walks are inevitable and multi-socket workloads suffer from longer TLB misses as their asso-ciated page-table walks require remote memory accesses.

We now focus on the impact of NUMA on page-table walksin scenarios where a process on a single socket is migrated toanother. Such situations arise frequently in commercial clouddeployments due to the need for load balancing and improvingprocess-data afﬁnity [52, 27]. Particularly, the prevalence ofvirtual machines and containers that rely on hypervisors andNUMA-aware schedulers to consolidate workloads in data cen-ters are making inter-socket process migrations increasinglycommon. For e.g., VMware ESXi may migrate processes at afrequency of 2 seconds [3]. Today, data can be migrated acrosssockets but page-tables cannot, compromising performance.

Conﬁgurations.

We run each workload in isolation whiletightly controlling and changing i) the allocation policies fordata pages and page-table pages, ii) whether or not the socketsare idle and iii) whether transparent, 2MB large pages (THP)are enabled. We disable NUMA migration. To study page-table allocations in a controlled manner, we modiﬁed Linuxkernel to force page-table allocations on a ﬁxed socket. We usethe conﬁgurations shown in Table 2 and visualized in Figure 5.We use the STREAM benchmark [13] running on the socket4 T D

Process

PT D

Process D (i) Baseline: LP-LD (ii) Process migration: RP-RD Socket 0 Socket 1 Socket 0 Socket 1

PT D

Process D Data migration (iii) Data migration: RP-LD

Socket 0 Socket 1 PT Process D (iv) Loaded remote PT: RPI-LD Socket 0 Socket 1OtherProcessprocessmigration PT Process D (v) Process re-migration from iii: LP-RD Socket 0 Socket 1 PT OtherProcess D (vi) Loaded Remote Data: LP-RDI Socket 0 Socket 1ProcessprocessmigrationMemory 0 Memory 1 Memory 0 Memory 1 Memory 0 Memory 1 Memory 0 Memory 1 Memory 0 Memory 1 Memory 0 Memory 1

Figure 5: Different conﬁgurations for workload migration scenario. We show only 6 out of 7 conﬁgurations here. The 7thconﬁguration (RPI-RDI) can be easily created from (ii) by running another process on Socket 0.

Conﬁg. Workload Page-Table Data Interference (T)LP-LD A A: Local PT A: Local Data -(T)LP-RD A A: Local PT B: Remote Data -(T)RP-LD A B: Remote PT A: Local Data -(T)RP-RD A B: Remote PT B: Remote Data -(T)RPI-LD A B: Remote PT A: Local Data B: Interfere on PT(T)LP-RDI A A: Local PT B: Remote Data B: Interfere on Data(T)RPI-RDI A B: Remote PT B: Remote Data B: Interfere on PT&Data

Table 2: Conﬁgurations for workload migration scenario,where A and B denote different sockets. T denotes if THPin Linux is used for 2MB pages. Interference is another pro-cess that runs on a speciﬁed socket and hogs its local mem-ory bandwidth. Figure 5 shows the 2-socket case. indicated by interference to create a worst-case scenario ofco-locating a memory-bandwidth heavy workload. Memoryallocation and processor afﬁnity are controlled by numactl . Measurements.

We use perf to obtain performance countervalues such as execution cycles and TLB load and store misswalk cycles (i.e., the cycles that the page walker is active for).

Results.

We then run our workloads for all seven conﬁgura-tions. Figure 6 shows the normalized run times with a 4KBpage size. The base case is the LP-LD conﬁguration whereboth page-tables and data pages are local and the system isidle. For each conﬁguration, hashed part of the bar denotesthe fraction of time spent on page-table walks. We observe thefollowing from this experiment:1. All workloads spend a signiﬁcant fraction of executioncycles (up to 90%) performing page-table walks. Parts ofthese walks may be overlapped with other work; neverthe-less, they present a performance impediment.2. LP-LD runs most efﬁciently for 4KB page size.3. The local page-table, remote data case (LP-RD and LP-RDI) suffers 3x slowdown versus the baseline. This isnot surprising and has motivated prior research on datamigration techniques in large-memory NUMA machines. 4. More surprisingly, the remote page-table, local data case(RP-LD and RPI-LD) suffers 3.3x slowdown. This slow-down can even be more severe than remote data accesses.5. When both page-tables and data pages are placed remotely(RP-RD and RPI-RDI), the slowdown is 3.6x and is theworst placement possible for all workloads.6. With 2MB page size (ﬁgure omitted for space), TLB reachimproves and the number of memory accesses for a page-table walk decreases to 3 rather than 4. These two factorsreduce the fraction of execution cycles devoted to page-table walks. Even so, overall performance is still vulnerableto remote page-table placement.

Summary.

The NUMA node on which page-table pages areplaced signiﬁcantly impacts performance. Remote page-tablescan have similar, and in some cases even worse, slowdownthan remote data pages accesses. Moreover, the slowdown isvisible even with large pages.

4. Design Concept

Mitosis ’ key concept is a mechanism and its policies to repli-cate and migrate page-tables and reduce the frequency ofremote memory accesses in page-table walks.

Mitosis requirestwo components: i) a mechanism to support low-overheadpage-table replication and migration and ii) policies for pro-cesses to efﬁciently manage and control page-table replicationand migration. Figure 7 illustrates these concepts. Our dis-cussion focuses on the multi-socket and workload migrationscenarios used before in § 3. We showed in § 3.1 that multi-socket workloads will, assum-ing a uniform distribution of page-table pages, have N − N PTEspointing to remote pages for an N -socket system. Page-tables L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I L P - L D L P - R D L P - R D I R P - L D R P I - L D R P - R D R P I - R D I GUPS BTree HashJoin Redis XSBench PageRank LibLinear Canneal N o r m a li z e d R un t i m e Figure 6: Normalized runtime of our workloads in workload migration scenario with 4KB page size. The lower hashed part ofeach bar is time spent in walking the page-tables. All conﬁgurations are shown in Table 2. ata migration Page-table migrationL2Process S o c k e t S o c k e t ProcessL3L4 S o c k e t S o c k e t Process ProcessD L4,L3,L2,L1 D L4,L3,L2,L1 D D L4,L3,L2,L1 D DL4,L3,L2,L1L4,L3,L2,L1 L4,L3,L2,L1L4,L3,L2,L1L4,L3,L2,L1 DL1 (b) Workload migration scenario(a) Multi-socket scenario (i) Process – initially (ii) Process after migration (iii) Process after migration with Mitosis (i) Process without page-table replication (ii) Process with page-table replication Data migrationProcess S o c k e t S o c k e t Process S o c k e t S o c k e t Process Process S o c k e t S o c k e t S o c k e t S o c k e t Process S o c k e t S o c k e t S o c k e t S o c k e t Process S o c k e t S o c k e t S o c k e t S o c k e t Process

Figure 7:

Mitosis : Page-table migration and replication on large-memory machines may be distributed among the sockets in a skewed fashion. Fig-ure 7 (a)(i) shows a scenario where threads of the same work-load running on different sockets have to make remote memoryaccesses during page-table walks.From Figure 7 (a)(i) we can see that if a thread in socket0 has a TLB miss for data “D” (which is local to the socket),it has to perform up to 4 remote accesses to resolve the TLBmiss to only ﬁnd out that the data was local to its socket.With

Mitosis , we replicate the page-tables on each socketwhere the process is running (shown in Figure 7 (a)(ii)). Thisresults in up to 4 local accesses to the page-table, precludingthe need for remote memory accesses in page-table walks.

Single-socket workloads suffer performance loss when pro-cesses are migrated across sockets while page-tables are not(shown in Figure 7 (b)(ii)) The process is migrated fromsocket 0 to socket 1, the NUMA memory manager trans-parently migrate data pages, but page-table pages remain onsocket 1. In contrast,

Mitosis migrates the page-tables alongwith the data (Figure 7 (b)(iii)). This eliminates remote mem-ory accesses for page-table walks, improving performance.

5. Mechanism

Replication and migration are inherently similar. We ﬁrstdescribe the building blocks which are required to supportpage-table replication and later show how we can leverage thereplication infrastructure to achieve page-table migration.

Mitosis enables per-process replication; the virtual memorysubsystem needs to maintain multiple copies of page-tablesfor a single process. Efﬁcient replication of page-tables canbe divided into three sub-tasks: i) strict memory allocation tohold the replicated page-tables, ii) managing and keeping thereplicas consistent, and iii) using replicas when the processis scheduled. We now describe each sub-task in detail byproviding a generalized design and our Linux implementation.We also discuss how Mitosis handles accessed and dirty bits.

All page-table allocations are performed by the OS on a page-fault–an explicit mapping request can beviewed as an eager call to the page-fault handler for the givenmemory area.

Mitosis extends the same mechanism to allocatememory across sockets for different replicas.Such allocation is strict, i.e. it has to occur on a particularlist of sockets at allocation time. It is, therefore, possiblethat it may fail due to the unavailability of memory on thosesockets. There are multiple ways to sidestep this problemby reserving pages on each socket for page-table allocationsusing per-socket page-cache . These pages can be explicitlyreserved through a system call or automatically when a processallocates a virtual memory region. Alternatively, the OS canreclaim physical memory through demand paging mechanismsor evicting a data page onto another socket.

Linux implementation:

We rely on the existing page allo-cation functionality in Linux to implement

Mitosis . Whenallocating page-table pages, we explicitly supply the list oftarget sockets for page-table replication. Since strict allocationcan fail, we implemented per-socket page-caches to reservepages for page-table allocations. The size of this page-cacheis explicitly controlled using a sysctl interface.

For security, OSes usually do not allow userprocesses to directly manage their own page-tables. Instead,OSes export an interface through which page-table modiﬁca-tions are handled, e.g. map/unmap/protect of pages.

Mitosis extends the same interfaces for updates to page-tables to keepall replicas consistent. One way to implement this is to eagerly update all replicas at the same time via this standard interfacewhen an update to the page-table is performed on any replica.On an eager update, the OS ﬁnds the physical location toupdate in the local replica by walking the local replica ofthe page-table. It is required to walk other replicas of thepage-table to locate the physical location to update all thereplicas at the same time. Therefore, an N-socket system inx86_64 will need 4 N memory accesses with replication ona page-table update: 4 memory accesses to walk the page-table on each of the N sockets. To reduce this overhead, wedesigned a circular linked-list of all replicas. The metadata6 eplica page 0 PTE

Replica page 1

PTE

Replica page 2

PTE

Replica page 3

PTE

CR3-0 CR3-1 CR3-2 CR3-3

Metadata for page 0 Metadata for page 1 Metadata for page 2 Metadata for page 2

Added pointers for circular linked list

Figure 8: Circular linked list to locate all replicas efﬁciently(implemented in Linux with struct page ). about each physical page is utilized to store the pointers tothe next physical page holding the replica of the page-table.Figure 8 shows an illustration with 4-way replication. Thisallows updates to proceed without walking the page-tables toperform the update. With this optimization, the update of all N replicas takes 2 N memory references ( N for updating the N replicas and N for reading the pointers to the next replica). Linux implementation:

We implemented eager updates tothe replica page-tables in Linux. This required intercepting anywrites to the page-tables and propagate updates accordingly.But instead of revamping the full-memory subsystem in Linux,we used a different interface, PV-Ops [9], which is requiredto support para-virtualization environments such as Xen [18].The Linux kernel shipped with distributions like Ubuntu haspara-virtualization support enabled by default.Conceptually, this is done by indirect calls to the nativeor Xen handler functions. Effectively, the indirect calls arepatched with direct calls once the subsystem is initialized. ThePV-Ops subsystem interface consists of functions to allocateand free page-tables of any level, reading and writing thetranslation base register (CR3 on x86_64), and writing page-table entries. The PV-Ops interface can be seen in Listing 1. void write_cr3(unsigned long x);void paravirt_alloc_pte(struct mm_struct *mm, unsigned long pfn);void paravirt_release_pte(unsigned long pfn);void set_pte(pte_t *ptep, pte_t pte);

Listing 1: Excerpt of the PV-Ops interface

We implemented

Mitosis as a new backend for PV-Opsalongside with the native and Xen backends. When the kernelis compiled with

Mitosis , the default PV-Ops is switched to the

Mitosis backend. We implemented the

Mitosis backend withgreat care to ensure identical behavior to the native backendwhen

Mitosis is turned off. Besides, note that replication isgenerally not enabled by default, and thus the behavior is thesame as the native interface.The PV-Ops subsystem provides an efﬁcient way for

Mitosis to track any writes to the page-tables in the system. Propa-gating those updates efﬁciently requires a fast way to ﬁnd thereplica page-tables based solely on the information providedthrough the PV-Ops interface (Listing 1) i.e. using a kernelvirtual address (KVA) or a physical frame number (PFN). We augment the page metadata to keep track of replicaswith our circular linked list. The Linux kernel keeps trackof each 4KB physical frame in the system using struct page .Moreover, each frame has a unique KVA and PFN. Linuxprovides functions to convert between struct page and it’scorresponding KVA/PFN, which is typically done by adding,subtracting or shifting the respective values and are henceefﬁcient operations. We can, therefore, obtain the structpage directly from the information passed through the PV-Opsinterface and update all replicas efﬁciently.

When the OS schedules a process or task,it performs a context switch, restores processor registers andresumes execution of the new process or task. The contextswitch involves programming the page-table base register ofthe MMU with the base address of the process’ page-tableand ﬂushing the TLB. With

Mitosis , we extend the contextswitch functionality, to select and set the base address of thesocket’s local page-table replica efﬁciently. This enables atask or process to use the local page-table replica if present.

Linux implementation:

For each process, we maintain anarray of root page-table pointers which allows directly select-ing the local replica by indexing this array using the socketid. Initializing this array with pointers to the very same rootpage-table is equivalent to the native behavior.

A page-table is mostly managed by software(the OS) most of the time and read by the hardware (on a TLBmiss). On x86, however, hardware–namely the page-walker–reports whenever a page has been accessed or written to bysetting the accessed and dirty bits in the PTEs. In other words,page-table is modiﬁed without direct OS involvement. Thus,accessed and dirty bits do not use the standard software inter-face to update the PTE and cannot be replicated easily withouthardware support. Note, that these two bits are typically setby the hardware and reset by the OS. They are used by theOS for system-level operations like swapping or writing backmemory-mapped ﬁles if they are modiﬁed in memory. With

Mitosis when replicated, we logically OR accessed and dirtybits of all the replicas when read by the OS.

Linux implementation:

We need to read accessed/dirty bitsfrom all replicas as well as reset them in all replicas. Unfortu-nately, the PV-Ops interface doesn’t provide functions to reada page-table entry, worse we have found code in the Linuxkernel which even writes to the page-table entry without goingthrough the PV-Ops interface. We augmented with the corre-sponding get functions to PV-Ops which consult all copies ofpage-table entry and make sure the ﬂags are returned correctly.The new function reads all the replicas and ORs the bits in allreplicas to get the correct information.7 .5. Page-Table Migration

We use replication to perform migration in the following way:we use

Mitosis to replicate the page-table on the socket towhich the process has been migrated. The ﬁrst replica canbe eagerly freed after migration, or alternatively kept up-to-date in the case the process gets migrated back and lazilydeallocated in case physical memory is becoming scarce.

6. Policy

The policies we implement with

Mitosis control when page-tables are replicated and determine the processes and socketsfor which replicas are created. As with NUMA policies, page-table replication policies can be applied system-wide or uponuser request. We discuss both in this section.

System-wide policies can range from simpleon/off knobs for all processes to policies that actively moni-tor performance counter events provided by the hardware todynamically enable or disable

Mitosis .Event-based triggers can be developed for page-table mi-gration and replication within the OS. For instance, the OScan obtain TLB miss rates or cycles spent walking page-tablesthrough performance counters that are available on modernprocessors and then apply policy decisions automatically. Ahigh TLB miss rate suggests that a process can beneﬁt frompage-table replication or migration. By taking the ratio be-tween the time spent to serve TLB misses and the number ofTLB misses can indicate a replication candidate. Processeswith a low TLB miss rate may not beneﬁt from replication.Even if the OS makes a decision to migrate or replicate thepage-tables, there it may be costly to copy the entire page-table as big memory workloads easily achieve page-tablesof multiple GB in size. By using additional threads or evenDMA engines on modern processors, the creation of a replicacan happen in the background and the application regains fullperformance when the replica or migration has completed.The target applications of

Mitosis are long-running, big-memory workloads with high TLB pressure, and thereforewe disable page-table replication for short-running processessince the performance and memory cost of the replicated page-tables for short-running processes cannot be amortized (§ 8.3).

Linux implementation:

We support a straightforward,system-wide policy with four states: i) completely disable Mitosis , ii) enable per-process basis, iii) ﬁx the allocationof page-tables on a particular socket, and iv) enabled for allprocesses in the system. This system-wide policy can be setthrough the sysctl interface of Linux. We leave it as futurework to implement an automatic, counter-based approach. System-wide policies usually imply a one-size-ﬁts-all approach for all processes, but user-controlled policies allow programmers to use their understanding of theirworkloads and to select policies explicitly. These user-deﬁnedreplication and migration policies can be combined with dataand process placement primitives. Such policies can be se-lected when starting the program by deﬁning the CPU set andreplication set, or at runtime using corresponding system callsto set afﬁnities and replication policies. All of these policiescan be set per-process so that users have ﬁne-grained controlon replication and migration.

Linux implementation:

We implement user-deﬁned policiesas an additional API call to libnuma and corresponding pa-rameters of numactl . Similar to setting the allocation policy,we can supply node-mask or a list of sockets to replicate thepage-tables (Listing 2). Applications can thus select the repli-cation policy at runtime, or we can use numactl to select thepolicy without changing the program. numactl [--pgtablerepl= | -r ]void numa_set_pgtable_replication_mask(struct bitmask *);

Listing 2: Additions to libnuma and numactl

Both, libnuma and numactl use two additional systemcalls to set and get the page-table replication bitmask. When-ever a new mask is set,

Mitosis will walk the existing page-table and create replicas according to the new bitmask. Thebitmask effectively speciﬁes the replication factor: N bits setcorresponds to copies on N sockets and by passing an emptybitmask, the default behavior is restored.

7. Discussion

As a proof-of-concept, we implement

Mitosis in the widely-used Linux OS. Choosing Linux as our testbed allows us toprototype our ideas on a complex and complete OS where thesubtle interactions of many systems features and

Mitosis stress-tests its evaluation. Speciﬁcally, we use mainline Linux kernelv4.17 and implement

Mitosis for the x86_64 architecture. Weplan to release this implementation for everyone to use andplan to upstream the changes to the Linux kernel.

We have chosen to implement the prototype of

Mitosis inLinux. However, the concept of

Mitosis is applicable to otheroperating systems. Microkernels, for instance, push mostof their memory management functionality into user-spacelibraries or processes while the kernel enforces security andisolation. In Barrelﬁsh [22], for example, processes managetheir own address space by explicit capability invocations toupdate page-tables with new mappings.In such a system, one could implement

Mitosis purely inuser-space by linking to a

Mitosis -enabled libraryOS, and thekernel itself would not need to be modiﬁed at all. The librarycan keep track of the address space, including page-tables,replicas etc. Those data-structures can easily be enhanced8o include an array of page-table capabilities instead of asingle such table. This would allow policies to be deﬁnedat application level by using an appropriate policy library.Updates to page-tables might need to be converted to explicitupdate messages to other sockets, which avoid the need forglobal locks and propagates updates lazily. On a page-fault,updates can be processed and applied accordingly in the page-fault handling routine. We leave such an implementation tofuture work, but believe it to be straightforward.

Larger page sizes help reduce address translation overheadsby increasing the amount of memory that each TLB entry mapby orders of magnitude. Even with 2MB and 1GB page sizesupport in x86-64 on an Intel Haswell processor, the TLBreach is still less than 1%, assuming 1TB of main memoryfor any page size. Moreover, many commodity processorsprovide limited numbers of large page TLB entries especially1GB TLB entries, which limits their beneﬁt [21, 39, 49] andadditionallly huge-pages are not always the best choice [41].Since, address translation overheads are non-negligible withlarger page sizes, they are susceptible to NUMA effects onpage-table walks. Thus, our implementation of

Mitosis sup-ports larger page sizes and evaluate them. We extend trans-parent huge pages (THP) or 2MB page size in Linux whichrequires coalescing smaller pages to a large page and split-ting larger pages in to smaller ones.

Mitosis is implemented toreplicate the page-tables even in presence of such mechanisms.

Virtualized systems widely use hardware-based nested pagingto virtualize memory [37]. This requires two-levels of page-table translation:1. gVA to gPA: guest virtual address to guest physical addressvia a per-process guest OS page-table (gPT)2. gPA to hPA: guest physical address to host physical addressvia a per-VM nested page-table (nPT)In the best case, the virtualized address translation hits in theTLB to directly translate from gVA to hPA with no overheads.In the worst case, a TLB miss needs to perform a 2D page walkthat multiplies overheads vis-a-vis native, because accessesto the guest page-table also require translation by the nestedpage-table. For x86-64, a nested page-table walk requires upto 24 memory accesses. This 2D page-table walk comes withadditional hardware complexity.Understanding page-table placement in virtualized systemsis a major undertaking and requires a separate study. We be-lieve we can extend

Mitosis ’ design to replicate both guestpage-tables and nested page-tables independently if the un-derlying NUMA architecture is exposed to the guest OS toimprove performance of applications. To extend the design,we can rely on setting accessed and dirty bits at both gPTand nPT by the nested page-table walk hardware availablesince Haswell [4]. Thus, we can extend our OS extension for or-ing the access and dirty bits across replicas to get thecorrect information at both levels independently. However,the main issue is that most cloud systems prefer not to exposethe underlying architecture to the guest OS making a casefor novel approaches to replicate and migrate both levels ofpage-tables in a virtualized environment.

Coherence between hardware TLBs is maintained by the OSwith the help of TLB ﬂush IPIs and updates to the page-tableare already thread-safe as they are performed within a criticalsection. In Linux, a lock is taken whenever the page-tableof a process is modiﬁed and thus ensuring mutual exclusion.The updates to the page-table structure are made visible afterreleasing the lock. When an entry is modiﬁed, its effect ismade visible to other cores through a global TLB ﬂush as theold entry might still be cached.With

Mitosis , we currently keep the same consistency guar-antees by updating all page-table replicas eagerly while beingin the critical section. Thus, only one thread can modify thepage-table at a time. Hardware may read the page-table whileupdates are being carried out. The critical section ensures cor-rectness while serving the page-fault while again, the globalTLB ﬂush ensures consistency after modiﬁcation of an entryin case a core has cached the old one.

8. Evaluation

We evaluate

Mitosis using a set of big-memory workloads andmicro-benchmarks. We show: (1) how multi-threaded pro-grams beneﬁt from

Mitosis (§ 8.1), (2) how

Mitosis eliminatesNUMA effects of page-walks when page-tables are placedon remote sockets due to task migration (§ 8.2) and (3), thememory and runtime overheads of

Mitosis (§ 8.3).

Hardware Conﬁguration

We used a four-socket Intel XeonE7-4850v3 with 14 cores and 128GB memory per-socket (512GB total memory) with 2-way hyper-threading running at2.20GHz. The L3 cache is 35MB in size and the processor hasa per-core two-level TLB with 64+1024 entries. Accessingmemory on the local NUMA socket has about 280 cycleslatency and throughput of 28GB/s. For a remote NUMAsocket, this is 580 cycles and 11GB/s respectively.

In this part of the evaluation, we focus on multi-threadedworkloads running in parallel on all sockets in the system.For a machine with N NUMA sockets, in expectation N − N ofpage-table accesses will be remote while the remote socketsare busy themselves. We evaluate six workloads (see § 3.1),for all commonly used conﬁgurations that inﬂuence data andpage-table placement (see Table 3). Performance is presentedas an average of three runs, excluding the initialization phase.The results are shown in Figure 9a for 4KB pages and Fig-ure 9b with 2MB large pages respectively. All bars are normal-9 .00.20.40.60.81.0 FF + M F - A F - A + M II + M FF + M F - A F - A + M II + M FF + M F - A F - A + M II + M FF + M F - A F - A + M II + M FF + M F - A F - A + M II + M FF + M F - A F - A + M II + M Canneal Memcached XSBench Graph500 HashJoin BTree N o r m a li z e d R un t i m e . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x (a) 4KB Pages T F T F + M T F - A T F - A + M T I T I + M T F T F + M T F - A T F - A + M T I T I + M T F T F + M T F - A T F - A + M T I T I + M T F T F + M T F - A T F - A + M T I T I + M T F T F + M T F - A T F - A + M T I T I + M T F T F + M T F - A T F - A + M T I T I + M Canneal Memcached XSBench Graph500 HashJoin BTree N o r m a li z e d R un t i m e . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x . x (b) 2MB Large Pages Figure 9: Normalized performance with

Mitosis for multi-socket workloads with 4KB and 2MB page size. The lower hashed partof each bar is execution time spent in walking the page-tables.

Conﬁg. Data pages Page-table pages (T)F First-touch allocation First-touch allocation (bar: purple)(T)F+M

Mitosis replication (bar: green) (T)F-A First-touch allocation First-touch allocation (bar: purple)(T)F-A+M + Auto page migration

Mitosis replication (bar: green) (T)I Interleaved allocation Interleaved allocation (bar: purple)(T)I+M

Mitosis replication (bar:green)

Table 3: Conﬁgurations for multi-socket scenario where work-load runs on all sockets. T denotes Linux with THP. M denotesthe corresponding data allocation policy with

Mitosis . ized to 4KB ﬁrst-touch allocation policy (bar: F). Bars with thesame allocation policy are grouped in boxes for comparison.The number on top of Mitosis bars (green) shows improvementfrom corresponding non-

Mitosis bars (purple) within a box.Note that data allocation policy impacts performance and isshown across boxes for each workload. The results for 2MBpages are normalized to 4KB (bar: F) to show performanceimpact with increase in page size.We observe that with 4KB pages, up to 40% of the totalruntime is spent in servicing TLB misses.

Mitosis reduces theoverall runtime for all applications with the best-case improve-ment of 1.34x for Canneal. Most of the improvements can benoted in the reduction of page-walk cycles due to replicationof page-tables.Large pages can signiﬁcantly reduce translation overheadsfor many workloads. However, NUMA effects of page-tablewalks are still noticeable, even if all workload memory isbacked by large pages. Hence,

Mitosis provides signiﬁcantspeedup e.g., 1.14x, 1.13x, 1.06x and 1.07x for Canneal, Mem-cached, XSBench and BTree, respectively. Note that the useof large pages can lead to decreased performance on NUMAsystems and still not used for many systems [41]. Using various data page placement policies improves per-formance for our workloads as expected. In combination withall policies,

Mitosis consistently improves performance.We have provided evidence that highly parallel workloadsexperience NUMA effects of remote-memory accesses dueto page-table walks. Yet, running a workload concurrentlymeans we cannot inspect a thread in isolation: a TLB misson one core may populate the cache with the PTE neededto serve the TLB miss on another core of the same socket.Moreover, accessing a remote last-level cache may be fasterthan accessing DRAM. Nevertheless, we have shown that

Mitosis is still able to improve multi-threaded workloads byup to 1.34x and that too for both page sizes. Again,

Mitosis does not cause any slowdown.

As we observed in § 3.2, NUMA schedulers can move pro-cesses from one socket to another under various constraints.In this part of the evaluation, we show that

Mitosis elimi-nates NUMA effects of page-walks originating due to dataand threads migrating to a different socket while page-tablesremain ﬁxed on the socket where workload was ﬁrst initialized.We execute the same workloads used for workload migra-tion scenario in § 3.2. As an additional conﬁguration, weenabled

Mitosis when the page-table is allocated on a remotesocket. Recall, we disabled Linux’ AutoNUMA migration,and pre-allocated and initialized the working set (17-85GB).The results are shown in Figure 10a and Figure 10b with4KB and 2MB page sizes respectively. Table 2 in § 3.2 showedthe conﬁgurations used for evaluation: LP-LD (Local PT -Local Data) and RPI-LD (Remote PT with interference - LocalData). RPI-LD+M shows the improvement with page-table L P - L D R P I - L D R P I - L D + M L P - L D R P I - L D R P I - L D + M L P - L D R P I - L D R P I - L D + M L P - L D R P I - L D R P I - L D + M L P - L D R P I - L D R P I - L D + M L P - L D R P I - L D R P I - L D + M L P - L D R P I - L D R P I - L D + M L P - L D R P I - L D R P I - L D + M GUPS BTree HashJoin Redis XSBench PageRank LibLinear Canneal N o r m a li z e d R un t i m e . x . x . x . x . x . x . x . x (a) 4KB Pages T L P - L D T R P I - L D T R P I - L D + M T L P - L D T R P I - L D T R P I - L D + M T L P - L D T R P I - L D T R P I - L D + M T L P - L D T R P I - L D T R P I - L D + M T L P - L D T R P I - L D T R P I - L D + M T L P - L D T R P I - L D T R P I - L D + M T L P - L D T R P I - L D T R P I - L D + M T L P - L D T R P I - L D T R P I - L D + M GUPS BTree HashJoin Redis XSBench PageRank LibLinear Canneal N o r m a li z e d R un t i m e . x . x . x . x . x . x . x . x (b) 2MB Large Pages Figure 10: Normalized performance with

Mitosis for workloads in workload migration scenario with 4KB and 2MB page size. Thelower hashed part of each bar is execution time spent in walking the page-tables.

Mitosis when RPI-LD case arises inthe system. The boxes denote the bars to compare to see theimprovement due to page-table migration. The number on topof the bar denotes the improvement due to

Mitosis (green bar)as compared to non-mitosis bar (purple bar) within the samebox. All bars are normalized to 4KB LP-LD conﬁguration.The results for 2MB pages are normalized to 4KB (bar: LP-LD) to show performance impact with increase in page size.With 4KB pages (Figure 10a), remote page-tables cause1.4x to 3.2x slowdown (bar: RPI-LD) relative to the baseline(LP-LD).

Mitosis can mitigate this overhead and has the sameperformance as the baseline by migrating the page-tables withprocess migration.With 2MB large pages (Figure 10b), we see that the pagewalk overheads are comparatively lower, nevertheless we ob-serve a slowdown of up to 2.3x for TRPI-LD over TLP-LDconﬁguration. Again,

Mitosis can mitigate this overhead andhas the same performance as the TLP-LD conﬁguration. Note,that for certain workloads the page-tables are cached well inthe CPU caches and thus there is no difference in runtime. Forexample, in the case of GUPS, we observe roughly one TLBmiss per data access–two cache-line requests in total per dataarray access. By breaking this down, we obtain that each leafpage-table cache-line covers about 16MB of memory whichcorresponds to 256k cache-lines of the data array. Therefore,the page-table cache-lines are accessed 256k more often thanthe data array cache-lines, and there are less than 500k page-table cache lines which can easily be cached in L3 cache ofthe socket. In summary, page-table entries are likely to bepresent in the sockets processor cache.

Memory Fragmentation:

Physical memory fragmentationlimits the availability of large pages as the system ages, lead-ing to higher page-walk overheads [51, 56]. Figure 11 showsthe performance of Mitosis under heavy fragmentation whileusing THP in Linux with 2MB page size. We observe that allworkloads, including those that did not show performance im-provement with

Mitosis while using 2MB pages in Figure 10b,show dramatic improvement with

Mitosis in this case. Thisis due to workloads falling back to 4KB pages under frag-mentation – which we have already shown to be susceptibleto NUMA effects of page-table walks. Note that we presentthis experiment under heavy fragmentation to demonstratethat even if large pages are enabled, page-walk overheads can

TLP-LDTRPI-LD+MTRPI-LDTLP-LDTRPI-LD+MTRPI-LDTLP-LD X S B e n c h R e d i s G U P S Normalized Runtime

Figure 11: Performance of

Mitosis in workload migration sce-nario with 2MB pages under heavy memory fragmentation. approach that of 4KB pages. In practice, the actual state ofmemory fragmentation may depend on several factors andthese overheads will be proportional to the failure rate of largepage allocations.

Summary:

With this evaluation, we have shown that

Mitosis completely avoids resulting overheads due to page-tables be-ing misplaced on remote NUMA sockets. In none of the cases,

Mitosis resulted in a slowdown of the workload.

Enabling

Mitosis implies maintaining replicas which consumememory and use CPU cycles to be kept consistent. We eval-uate these overheads by estimating the additional memoryrequirement, and then perform micro-benchmarks on the vir-tual memory operations and wrap up by running applicationsend-to-end to set those overheads into perspective.

We estimate the overhead of theadditional memory used to store the page-table replicas when

Mitosis is enabled. We deﬁne the two-dimensional function mem _ overhead ( Foot print , Replicas ) =

Overhead %that calculates memory overhead relative to the single page-table baseline and evaluate it using different values for theapplication’s memory footprint and the number of replicas.For this estimation, we assume 4-level x86 paging with acompact address space e.g. the application uses addresses0 .. FootPrint . Each level has at least one page-table allocatedand a page-table is 4KB in size.Table 4 shows the memory overheads of

Mitosis for smallto large applications using up to 16 replicas. We use thesingle page-table case as the baseline. The page-table accountsfor about 0.19% of the total footprint, except for the 1MBcase where it accounts for 1.5%. With an increasing memoryfootprint used by the application,

Mitosis requires less than2.9% of additional memory for 16-replicas, whereas our four-socket machine used just 0.6% additional memory.The page-tables use a small fraction of the total memoryfootprint of the application. For small programs, the fractionis higher because there is a hard minimum of at least 16KBof page-tables–a 4KB page for each level. This is reﬂected bythe large 23.1% increase in memory consumption for smallprograms. However, putting this into perspective we advocatenot to use

Mitosis in this case as the 1MB memory footprintfalls within the TLB coverage.In summary, we showed that even with a 16-socket NUMAmachine,

Mitosis adds just 2.9% memory overhead and thisoverhead drops to 0.6% for our four-socket machine.

Number of ReplicasFootprint PT Size 1 2 4 8 16

Table 4: Memory footprint overhead for

Mitosis peration 4KB region 8MB region 4GB region mmap 1.021x 1.008x 1.006xmprotect 1.121x 3.238x 3.279xmunmap 1.043x 1.354x 1.393x Table 5: Runtime overhead of

Mitosis for virtual memory oper-ation system calls using 4-way Replication.

In this part of the eval-uation, we are interested in understanding the overheads ofself-replicating page-tables for common virtual memory oper-ations such as mmap , mprotect and munmap .We conducted a micro-benchmark that repeatedly calls theVMA operations and measured the time to complete the cor-responding system calls. For each operation, we enforce thatthe page-table modiﬁcations are carried out e.g. by passingthe MAP_POPULATE ﬂat to mmap . We varied the number ofaffected pages from a single page to a large region of memoryof multiple GB in size. We ran the micro-benchmark with

Mitosis enabled and disabled on an otherwise idle system. Weuse 4KB pages and 4-way replication.The results of this micro-benchmark are shown in Table 5.The table shows CPU cycles required to perform the operationon a memory region of size 4KB, 8MB, or 4GB with

Mitosis being on or off. Further, we calculate the overheads of

Mitosis by dividing the 4-way replicated case (

Mitosis on) with thebase case,

Mitosis off. For mmap , we observe an overhead ofless than 2%. For unmap , the overhead grows to 35% while

Mitosis adds more than 3x overheads for mprotect .With 4-way replication, there are four sets of page-tablesthat need to be updated resulting in four times the work. Weattribute the rather low overhead for mmap to the allocation andzeroing of new data pages during the system call. Likewise,when performing the unmap the freed pages are handed backto the allocator, but not zeroed resulting in less work per pageand thus higher overhead of replication.

Mitosis experiencesa large overhead for mprotect which is still smaller thanthe replication factor. The mprotect operation does a read-modify-write cycle on the affected page-table entries. Thisprocess is efﬁcient with no replicas as it results in sequentialaccess within a page-table. However, with the PV-OPS inter-face, for each written entry all replicas are updated accordinglywhich kills locality. This can be avoided by either changingthe PV-Ops interface or implementing lazy updates.

We now set the VMA op-erations micro-benchmark of the previous section into theperspective of real-world applications. We show that our mod-iﬁcations to the Linux kernel to support

Mitosis has negligibleend-to-end overhead for applications.

Workload

Mitosis

Off

Mitosis

On Overhead

GUPS 270.93 (0.43) 272.18 (0.00) 0.46%Redis 633.94 (0.34) 636.31 (0.86) 0.37%

Table 6: Runtimes with LP-LD setting, including initializationwith and without

Mitosis . Standard Deviation in Brackets.

We compare the execution time of the single-threadedbenchmarks. We run those benchmarks with and without

Mitosis and measure overall execution time, including alloca-tion and initialization phase. We use the LP-LD conﬁguration,i.e. everything is locally allocated. THP is deactivated.The results are shown in Table 6. We observe that in bothcases, GUPS and Redis, the overheads of

Mitosis are less thanhalf a percent, which is small compared to the improvementswe have demonstrated earlier.

9. Conclusion

We presented

Mitosis : a technique that transparently replicatespage-tables on large-memory machines, and provides the ﬁrstplatform to systematically evaluate page-table allocation poli-cies inside the OS. With strong empirical evidence, we madethe case for taking the allocation and placement of page-tablesto a ﬁrst-class consideration, in turn, optimizing performanceon NUMA systems. We also demonstrated the beneﬁts ofreplicating page-tables in large-memory machines for varioususe-cases, while observing negligible memory and runtimeoverheads. We plan to open-source the tools used in this workto inspire further research on optimizing page-table placement.Moreover, we plan to work with the Linux community to get

Mitosis integrated into the mainline kernel.

References . servethehome . com/amd-epyc-inﬁnity-fabric-latency-ddr4-2400-v-2666-a-snapshot/.[2] “AutoNUMA: the other approach to NUMA scheduling,” https://lwn . net/articles/488709/.[3] “Extreme Performance Series: vSphere Compute & Mem-ory Schedulers,” https://static . rainfocus . com/vmware/vmworldus17/sess/1489512432328001AfWH/ﬁnalpresentationPDF/SER2343BU_FORMATTED_FINAL_1507912874739001gpDS . pdf.[4] “FOUR NEW VIRTUALIZATION TECHNOLOGIES ON THELATEST INTEL R (cid:13) XEON,” https://software . intel . com/en-us/blogs/2014/09/08/four-new-virtualization-technologies-on-the-latest-intel-xeon-are-you-ready-to.[5] “Graph500 | large scale benchmarks,” https://graph500 . . anandtech . . csie . ntu . edu . tw/~cjlin/liblinear/.[8] “memcached: a distributed memory object caching system,” https://memcached . . kernel . org/doc/Documentation/virtual/paravirt_ops . txt.[10] “Parsec benchmark suite,” https://parsec . cs . princeton . edu/overview . htm.[11] “RandomAccess: GUPS (Giga Updates Per Second),” https://icl . utk . edu/projectsﬁles/hpcc/RandomAccess/.[12] “Redis,” https://redis . . cs . virginia . edu/stream/.[14] “XSBench: The Monte Carlo Macroscopic Cross Section LookupBenchmark,” https://github . com/ANL-CESAR/XSBench.[15] “Xv6, a simple Unix-like teaching operating system,” https://pdos . csail . mit . edu/6 . . html.[16] H. Alam, T. Zhang, M. Erez, and Y. Etsion, “Do-it-yourself virtualmemory translation,” in Proceedings of the 44th Annual InternationalSymposium on Computer Architecture , ser. ISCA ’17, 2017, pp. 457–468. . hotchips . org/wp-content/uploads/hc_archives/hc29/HC29 . . . . . . pdf.[18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebauer, I. Pratt, and A. Warﬁeld, “Xen and theArt of Virtualization,” in Proceedings of the Nineteenth ACMSymposium on Operating Systems Principles , ser. SOSP ’03. BoltonLanding, NY, USA: ACM, 2003, pp. 164–177. [Online]. Available:http://doi . acm . org/10 . . Proceedings of the 37thAnnual International Symposium on Computer Architecture , ser.ISCA ’10, Saint-Malo, France, 2010, pp. 48–59. [Online]. Available:http://doi . acm . org/10 . . , June 2011, pp. 307–317.[21] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “EfﬁcientVirtual Memory for Big Memory Servers,” in Proceedings of the40th Annual International Symposium on Computer Architecture , ser.ISCA ’13, Tel-Aviv, Israel, 2013, pp. 237–248. [Online]. Available:http://doi . acm . org/10 . . Proceedings ofthe ACM SIGOPS 22Nd Symposium on Operating Systems Principles ,ser. SOSP ’09, Big Sky, Montana, USA, 2009, pp. 29–44. [Online].Available: http://doi . acm . org/10 . . CoRR , vol. abs/1508.03619, 2015. [Online]. Available:http://arxiv . org/abs/1508 . Proceedings of the 46th Annual IEEE/ACM InternationalSymposium on Microarchitecture , ser. MICRO-46, Davis, California,2013, pp. 383–394. [Online]. Available: http://doi . acm . org/10 . . Proceedingsof the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , ser.ASPLOS ’17, Xi’an, China, 2017, pp. 63–76. [Online]. Available:http://doi . acm . org/10 . . Proceedings of the 2011 IEEE17th International Symposium on High Performance ComputerArchitecture , ser. HPCA ’11, 2011, pp. 62–63. [Online]. Available:http://dl . acm . org/citation . cfm?id=2014698 . Proceedingsof the 2018 USENIX Conference on Usenix Annual TechnicalConference , ser. USENIX ATC ’18. Berkeley, CA, USA:USENIX Association, 2018, pp. 85–96. [Online]. Available:http://dl . acm . org/citation . cfm?id=3277355 . Proceedings of the 8th USENIX Conference on OperatingSystems Design and Implementation , ser. OSDI’08. San Diego,California: USENIX Association, 2008, pp. 43–57. [Online].Available: http://dl . acm . org/citation . cfm?id=1855741 . Proceedingsof the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , ser.ASPLOS ’17, Xi’an, China, 2017, pp. 207–221. [Online]. Available:http://doi . acm . org/10 . . Proceedingsof the 8th ACM European Conference on Computer Systems , ser. Eu-roSys ’13, Prague, Czech Republic, 2013, pp. 211–224.[31] S. Das, S. Nishimura, D. Agrawal, and A. El Abbadi, “Albatross:Lightweight elasticity in shared storage databases for the cloud usinglive data migration,” in

Proceedings of the 2011 VLDB Endowment ,ser. VLDB ’11, 2011. [32] M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers,V. Quema, and M. Roth, “Trafﬁc Management: A Holistic Approachto Memory Placement on NUMA Systems,” in

Proceedings of theEighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems , ser. ASPLOS’13, Houston, Texas, USA, 2013, pp. 381–394. [Online]. Available:http://doi . acm . org/10 . . Proceedings of the28th ACM International Conference on Supercomputing , ser. ICS’14, Munich, Germany, 2014, pp. 303–312. [Online]. Available:http://doi . acm . org/10 . . , Feb 2015, pp. 223–234.[35] Z. Fang, L. Zhang, J. B. Carter, W. C. Hsieh, and S. A. McKee,“Reevaluating Online Superpage Promotion with Hardware Support,” in Proceedings of the 7th International Symposium on High-PerformanceComputer Architecture , ser. HPCA ’01, 2001, pp. 63–. [Online].Available: http://dl . acm . org/citation . cfm?id=580550 . Proceedings of the AnnualConference on USENIX Annual Technical Conference , ser. ATEC’98, New Orleans, Louisiana, 1998, pp. 8–8. [Online]. Available:http://dl . acm . org/citation . cfm?id=1268256 . IEEE Micro , vol. 37, no. 3, pp. 80–86, 2017.[38] J. Gandhi, V. Karakostas, F. Ayar, A. Cristal, M. D. Hill, K. S. McKin-ley, M. Nemirovsky, M. M. Swift, and O. S. Ünsal, “Range Translationsfor Fast Virtual Memory,”

IEEE Micro , vol. 36, no. 3, pp. 118–126,May 2016.[39] J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, “EfﬁcientMemory Virtualization: Reducing Dimensionality of NestedPage Walks,” in

Proceedings of the 47th Annual IEEE/ACMInternational Symposium on Microarchitecture , ser. MICRO-47,Cambridge, United Kingdom, 2014, pp. 178–189. [Online]. Available:http://dx . doi . org/10 . . . Proceedings of the 43rdInternational Symposium on Computer Architecture , ser. ISCA ’16,Seoul, Republic of Korea, 2016, pp. 707–718. [Online]. Available:https://doi . org/10 . . . Proceedings of the 2014 USENIX Conference on USENIX AnnualTechnical Conference , ser. USENIX ATC’14, Philadelphia, PA, 2014,pp. 231–242. [Online]. Available: http://dl . acm . org/citation . cfm?id=2643634 . Proceedings of the Twenty-ThirdInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , ser. ASPLOS ’18. NewYork, NY, USA: ACM, 2018, pp. 637–650. [Online]. Available:http://doi . acm . org/10 . . . intel . com/sites/default/ﬁles/managed/2b/80/5-level_paging_white_paper . pdf.[44] Intel Corp., “New Intel Core Processor Combines High-PerformanceCPU with Custom Discrete Graphics from AMD to Enable Sleeker,Thinner Devices,” https://newsroom . intel . com/editorials/new-intel-core-processor-combine-high-performance-cpu-discrete-graphics-sleek-thin-devices/.[45] S. S. Iyer, “Heterogeneous Integration for Performance and Scaling,” IEEE Transactions on Components, Packaging and ManufacturingTechnology , vol. 6, no. 7, pp. 973–982, July 2016.[46] S. Kaestle, R. Achermann, T. Roscoe, and T. Harris, “Shoal: SmartAllocation and Replication of Memory for Parallel Programs,” in

Proceedings of the 2015 USENIX Conference on Usenix AnnualTechnical Conference , ser. USENIX ATC ’15, Santa Clara, CA, 2015,pp. 263–276. [Online]. Available: http://dl . acm . org/citation . cfm?id=2813767 . Proceedingsof the 29th Annual International Symposium on ComputerArchitecture , ser. ISCA ’02, 2002, pp. 195–206. [Online]. Available:http://dl . acm . org/citation . cfm?id=545215 .

48] A. Kannan, N. E. Jerger, and G. H. Loh, “Enabling interposer-based dis-integration of multi-core processors,” in , Dec 2015,pp. 546–558.[49] V. Karakostas, O. S. Unsal, M. Nemirovsky, A. Cristal, and M. Swift,“Performance analysis of the memory management unit under scale-out workloads,” in , Oct 2014, pp. 1–12.[50] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill,K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal,“Redundant Memory Mappings for Fast Access to Large Memories,”in

Proceedings of the 42Nd Annual International Symposium onComputer Architecture , ser. ISCA ’15, Portland, Oregon, 2015, pp. 66–78. [Online]. Available: http://doi . acm . org/10 . . Proceedings of the 12th USENIX Conference on Operating SystemsDesign and Implementation , ser. OSDI’16. Berkeley, CA, USA:USENIX Association, 2016, pp. 705–721. [Online]. Available:http://dl . acm . org/citation . cfm?id=3026877 . Proceedings of theEleventh European Conference on Computer Systems , ser. EuroSys’16. New York, NY, USA: ACM, 2016, pp. 1:1–1:16. [Online].Available: http://doi . acm . org/10 . . ACM Trans. Archit. Code Optim. ,vol. 10, no. 1, pp. 2:1–2:38, Apr. 2013. [Online]. Available:http://doi . acm . org/10 . . . marvell . com/architecture/mochi/.[55] J. Navarro, S. Iyer, P. Druschel, and A. Cox, “Practical, TransparentOperating System Support for Superpages,” SIGOPS Oper. Syst.Rev. , vol. 36, no. SI, pp. 89–104, Dec. 2002. [Online]. Available:http://doi . acm . org/10 . . Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languagesand Operating Systems , ser. ASPLOS ’18. New York, NY,USA: ACM, 2018, pp. 679–692. [Online]. Available: http://doi . acm . org/10 . . , Feb 2015, pp. 210–222.[58] B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, “Increasing TLBreach by exploiting clustering in page translations,” in , Feb 2014, pp. 558–567.[59] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT:Coalesced Large-Reach TLBs,” in Proceedings of the 2012 45thAnnual IEEE/ACM International Symposium on Microarchitecture , ser.MICRO-45, Vancouver, B.C., CANADA, 2012, pp. 258–269. [Online].Available: https://doi . org/10 . . . Proceedingsof the 48th International Symposium on Microarchitecture , ser.MICRO-48, Waikiki, Hawaii, 2015, pp. 1–12. [Online]. Available:http://doi . acm . org/10 . . Proceedings of the2009 International Symposium on Computer Architecture , ser. ISCA’09, 2009.[62] A. Saulsbury, F. Dahlgren, and P. Stenström, “Recency-basedTLB Preloading,” in

Proceedings of the 27th Annual InternationalSymposium on Computer Architecture , ser. ISCA ’00, Vancouver,British Columbia, Canada, 2000, pp. 117–127. [Online]. Available:http://doi . acm . org/10 . . IEEE Trans. Comput. , vol. 53, no. 7, pp. 924–927,Jul. 2004. [Online]. Available: https://doi . org/10 . . . Proceedings of the25th Annual International Symposium on Computer Architecture , ser.ISCA ’98, Barcelona, Spain, 1998, pp. 204–213. [Online]. Available:https://doi . org/10 . . . tsmc . com/english/dedicatedFoundry/services/cowos . htm.[66] M. Talluri and M. D. Hill, “Surpassing the TLB Performance ofSuperpages with Less Operating System Support,” in Proceedingsof the Sixth International Conference on Architectural Support forProgramming Languages and Operating Systems , ser. ASPLOS VI,San Jose, California, USA, 1994, pp. 171–182. [Online]. Available:http://doi . acm . org/10 . . , June 2018, pp. 726–738., June 2018, pp. 726–738.