[PDF] The Cost of Software-Based Memory Management Without Virtual Memory

Abstract

Virtual memory has been a standard hardware feature for more than three decades. At the price of increased hardware complexity, it has simplified software and promised strong isolation among colocated processes. In modern computing systems, however, the costs of virtual memory have increased significantly. With large memory workloads, virtualized environments, data center computing, and chips with multiple DMA devices, virtual memory can degrade performance and increase power usage. We therefore explore the implications of building applications and operating systems without relying on hardware support for address translation. Primarily, we investigate the implications of removing the abstraction of large contiguous memory segments. Our experiments show that the overhead to remove this reliance is surprisingly small for real programs. We expect this small overhead to be worth the benefit of reducing the complexity and energy usage of address translation. In fact, in some cases, performance can even improve when address translation is avoided.

Full PDF

TThe Cost of Software-Based MemoryManagement Without Virtual Memory

Drew Zagieboylo

Department of Computer ScienceCornell University [email protected]

G. Edward Suh

Department of Electrical andComputer EngineeringCornell University [email protected]

Andrew C. Myers

Department of Computer ScienceCornell University [email protected]

Abstract

Virtual memory has been a standard hardware feature formore than three decades. At the price of increased hard-ware complexity, it has simpliﬁed software and promisedstrong isolation among colocated processes. In modern com-puting systems, however, the costs of virtual memory haveincreased signiﬁcantly. With large memory workloads, virtu-alized environments, data center computing, and chips withmultiple DMA devices, virtual memory can degrade perfor-mance and increase power usage. We therefore explore theimplications of building applications and operating systemswithout relying on hardware support for address translation.Primarily, we investigate the implications of removing theabstraction of large contiguous memory segments. Our ex-periments show that the overhead to remove this reliance issurprisingly small for real programs. We expect this smalloverhead to be worth the beneﬁt of reducing the complex-ity and energy usage of address translation. In fact, in somecases, performance can even improve when address transla-tion is avoided.

1. Introduction

Virtual memory is a central abstraction for modern operat-ing systems (OS), providing processes with a ﬁctional viewof a private, contiguous address space. Program code ref-erences memory via virtual addresses that are translated bythe OS and hardware to physical memory addresses. Whilethis abstraction has provided security, portability, and conve-nience, it can be very expensive. In particular, large-memoryapplications and virtualized environments tend to pay thehighest performance costs for address translation [1, 2], withoverheads of up to 94% reported for microbenchmarks and66% for representative benchmarks. Graph-processing ap-plications are hit the hardest, as their poor spatial localityincreases the time spent handling TLB misses [3]. All thesecosts are exacerbated in a virtual machine, since TLB missesthere may require nested page table walks.Virtual memory also makes hardware and software de-signs more complex. Hardware page table walkers traverse complex software-managed data structures; modern TLBssupport multiple physical page sizes; and translation struc-tures must be carefully managed to enforce security. Tofully utilize these features, OSes have also accrued morelogic and complexity. For example, the OS must managepage tables and ensure that TLBs are synchronized acrossCPUs and devices. Additionally, the OS has the difﬁcultjob of turning hardware-level optimizations for translationinto performance gains for software. Linux originally imple-mented transparent huge pages (THP) in 2010, yet the re-search community continues to search for practical designsin which THP improves software performance [8, 10, 13].While these efforts have continued to improve the state ofthe art, they introduce more processes and heuristics into analready complex kernel.Embedded systems, with low power and area budgets, al-ready tend to avoid virtual memory; we posit that a broaderrange of systems can beneﬁt by not translating memory ac-cesses. Much of the software complexity involved in vir-tual memory management is redundant with features imple-mented in managed language runtimes, and even in appli-cation code. Therefore, we believe future architectures canincrease performance, simplicity and energy efﬁciency bydelegating more responsibility to compilers, language run-times, and applications. This approach goes against the grainof the contemporary trend toward ever more complex hard-ware optimizations consuming more power and area.In this paper, we envision a physically addressed soft-ware architecture, without address translation, in which theOS hands out ﬁxed-size blocks of memory, and applicationsallocate data structures across those blocks. While this isan extreme design, determining the challenges and costs tosoftware that it implies will be useful in exploring futurehardware support for different memory abstractions. Our pri-mary contributions to this end are experiments measuring theperformance cost of removing the abstraction of large con-tiguous ranges of memory. Our results suggest that small-footprint applications pay only a small price from thesechanges, while large-footprint applications can beneﬁt from a r X i v : . [ c s . A R ] S e p able 1. Supporting Virtual Memory Features With Physi-cal AddressingVM Func-tion How To Implement Without Virtual MemoryProtection Hardware support for physical memory pro-tection and OS support for using these fea-tures.Relocation /Migration Most code is compiled position-independently so that it can be relocatedat run time. For data, managed languagesalready include support for relocating ob-jects. CARAT [12] shows how the compiler,runtime and kernel can provide similarsupport for unmanaged languages so thattheir data can be dynamically relocated.Swapping Modern, performance-critical software isconsidered non-functional when swapping,so it is avoided at all cost. However, lan-guage support for relocation directly sup-ports swapping. Machinery for migrating ob-jects between memory pages can also moveobjects between memory and disk, under ap-plication control.Contiguity Without large contiguous memory regions,language runtimes and programs representand access the program stack and large arraysdifferently.avoiding TLB misses. We hope that these results incite fur-ther research towards supporting physical addressing with-out losing the security, convenience, and performance of vir-tual memory.

2. Supporting Virtual-Memory Functionality

The disadvantages of virtual memory are well known, andmany proposals exist to mitigate them. Bhattacharjee [3]summarizes how current virtual memory designs imposeperformance, complexity, energy, and area costs, which havebecome system-wide bottlenecks for many applications. Inthis section, we discuss the implications of changing thememory allocation abstraction to physical instead of vir-tual. Table 1 summarizes four main features now providedby virtual memory and how they can be supported withoutaddress translation. Our conclusion is that hardware supportfor physical memory protection may be necessary, but mostother features can be implemented using logic already res-ident in managed language runtimes. The one exception isthe contiguous-memory abstraction, for which we proposeand evaluate preliminary solutions in Sections 3 and 4.We argue virtual memory can be removed with surpris-ingly low performance and complexity costs. For instance,by adding modest burden to applications, we can simplifyOS memory management. Recent research on transparently utilizing hardware support for multi-sized page tables [8,10, 13] requires complex OS enhancements to both provideperformance improvement and retain backwards-compatiblememory allocation APIs. However, many applications, suchas Memcached, already manage memory to reduce fragmen-tation by exploiting domain knowledge. Other applicationsuse general-purpose user-space allocators such as jemalloc.These allocators can easily be conﬁgured to interact with asimple OS memory manager like the one we describe in Sec-tion 3.Recently, multiple research projects proposed separat-ing memory protection from virtual memory to providestronger security guarantees, and demonstrated such pro-tection mechanisms can be realized with low overhead. Forexample, capability-based protections, such as CHERI [4],provide memory protection by including additional meta-data for each pointer. While capability systems make point-ers large, their overall memory overhead is quite small sincethey require only one bit per memory location to distinguishpointers from other data. Tagged memory such as Hyper-ﬂow [5] and RISC-V’s PMP can also provide strong protec-tion by attaching security meta-data to each memory chunk.Their area overhead is proportional to the granularity of pro-tection required. While exact area and power consumptionare difﬁcult to quantify, we expect that removing (or simpli-fying) support for address translation can have a net positiveimpact since current translation infrastructure uses as muchspace as an L1 cache and up to 15% of a chip’s energy [3].A ﬁnal consideration is ﬂexibility. Virtual memory isbaked into the ISA, and so are performance-inﬂuencing pa-rameters like page size. Modern instruction sets provide onlya few possible page sizes. For example, x86 64 only sup-ports 4 KB, 2 MB or 1 GB pages; ARMv8 only supports4 KB, 16 KB or 64 KB. This illustrates one of the problemswith an otherwise simple solution to some virtual memorywoes: increasing a system’s base page size such that allpages are huge pages. To achieve the best TLB reach withthe least wasted memory, the OS would ideally choose apage size parameterized on available resources and work-load; however, this is not possible in general with the limitedoptions provided by hardware. Physical addressing givesthe OS more choice as technology, memory resources, andworkloads vary over time.

3. Contiguous Memory

We imagine that realistic physically addressed systems couldutilize techniques from prior work [2, 8, 10, 13] to imple-ment ﬂexible partitioning of memory, and a rich API for re-serving contiguous regions of various sizes. Nevertheless,for the rest of this paper, we describe a general-purposeOS with a more straightforward memory allocation strategy:segment memory into ﬁxed-size blocks as the minimum allo-cation unit. In our experiments (Section 4), performance wasmostly insensitive to the choice of block size and we reportesults based on 32 KB blocks. In a real deployment, thisnumber would likely be of similar magnitude (somewherebetween small and huge pages) but not necessarily the same.While simplistic, this strategy provides a useful tool for ex-ploring how physically addressed applications might inter-act with a constrained memory manager. Since this OS hasless control over external fragmentation, it cannot providethe conventional expectation that arbitrarily large memoryrequests are satisﬁed as long as there is enough unallocatedmemory. We investigate the performance cost of modify-ing software to avoid large allocations. Our measurementson a variety of benchmarks suggest that this cost is surpris-ingly small. Furthermore, unlike traditional address transla-tion, this cost is only paid on accesses to “contiguous” struc-tures; all other memory accesses incur no overhead. Modi-ﬁcations are needed to address two key uses of contiguousmemory in programming languages: the program stack andlarge arrays.

Stack-relative addressing normally requires contiguous mem-ory locations, but only within a given stack frame. Therefore,as long as every stack frame ﬁts within the block size, codeonly needs to be modiﬁed to dynamically allocate memorywhen the current stack block runs out of space. This mod-iﬁcation adds some overhead to each function call (aboutthree x86 instructions) to ensure the current stack block hasenough space. In the rare case that it doesn’t, a new frameis allocated, non-register arguments are copied from the oldstack to the new frame, and the stack pointer is adjusted;at function exit, all of this work is cleaned up. By carefullymanaging the return address register on function entry, thecleanup code can be skipped when a new block is not allo-cated.An existing option of the gcc compiler, stack splitting , al-ready implements this functionality. It was originally devel-oped to reduce memory usage in highly parallel programs,but it fulﬁlls our purpose as well. The only modiﬁcation wemake to gcc’s implementation is to force new stack memoryrequests to be equal to the block size. Although we couldallocate perfectly sized frames for each function, it wouldincrease the number of calls made to the allocator. Amortiz-ing these calls by using larger allocations usually improvesperformance but may create internal fragmentation withinthe stack memory.

We must also consider how applications will create arrayson the heap, without the assumption that allocation (e.g.,via malloc ) can return arbitrarily large contiguous regions.A straightforward approach is to replace arrays with trees,where intermediate nodes in the tree hold pointers to othernodes and only the leaves actually store data. In a sense,hardware-supported page tables implement a similar data m+10 m–1… …m …n–m n–1… ……

Figure 1.

Memory layout for an array of length n repre-sented as a tree. Each sequence of boxes represents a con-tiguous allocation of size m . A tree stores meta-data about itsdepth. Data is stored exclusively on leaf nodes; intermediatenodes store indirection pointers. struct ArrayIterator {size_t nextPtr , lastPtr ; // pointers to cachedelements size_t lastIdx ; // index of data pointed to bylastPtr }ArrayIterator it; Tree tree;int next () {if (it. nextPtr > it. lastPtr ) {DataPage dp = tree. getDataPage (it. lastIdx + 1);it. nextPtr = dp. firstPtr ; it. lastPtr = dp.lastPtr ;it. lastIdx = dp.size + it. lastIdx ;}return *it. nextPtr ++;} Figure 2.

Iterator next() implementationstructure; we investigate replacing it with a purely softwareversion only used for large arrays.Our data structure is an implementation of “arrays astrees”, a discontiguous array design described by Siebert [11].Figure 1 provides a graphical representation of how data ispartitioned across multiple memory allocations in the tree.Trees are built with constant-sized allocations, independentof the data size. Supporting a larger number of elements re-quires a deeper tree, but this is still bounded by a relativelysmall number .In general, accessing an element in the tree requirestraversing a path from the root to a leaf and making onememory access for each layer; however, software optimiza-tions can reduce this overhead signiﬁcantly for commoncases. For instance, when iterating sequentially, software cancache a pointer to the most recently accessed element. Aslong as it is part of the same allocation, software only needsto increment this pointer and make a single memory accessto retrieve its data. A full tree traversal happens only wheniterating past the last element in a given allocation. This op-timization can be captured abstractly through an Iterator interface, so that standard iteration constructs in modernprogramming languages can use this feature transparently.Figure 2 is a pseudocode implementation of next() method With the 32 KB node size used in this work, 3-level and 4-level trees canaddress about 536 GB and 2 PB of data, respectively. o r m a li z ed R un t i m e b l a cksc ho l e s bod y t r a ck b s a rr a yc annea l dedup f a c e s i m f e rr e tf l u i dan i m a t e f r eq m i ne r a y t r a c e s t r ea m c l u s t e s w ap t i on s v i p s m c f _ r o m ne t pp_ r x a l an c b m k _ r x r deep s j eng_ r l ee l a_ r xz _ r geo m ean Figure 3.

Split-stack overhead on PARSEC andSPECInt2017.of

Iterator , which retrieves the next tree element. Inlined,this method offers further optimization opportunities.

4. Experimental Results

We ran all experiments on a computer with 16 Intel i7-7700CPUs clocked at 3.60 GHz, 32 KB of L1 instruction and datacaches, and 128 GB of physical memory, running Ubuntu18.04. All reported measurements are the average of tenruns, but with standard conﬁgurations to reduce variability(such as disabling ASLR and powersave mode), all samplestandard deviations were less than 0.1% of the mean.

We consider standard benchmarks to evaluate realistic im-pact, and also one microbenchmark designed to study apessimistic case. Our standard benchmarks are most of theSPECInt2017 and PARSEC suites. We omit the “exchange”SPEC benchmark because it is written in FORTRAN, andalso “perlbench” and “gcc” since they crash with gcc’s im-plementation of stack splitting. Our microbenchmark is de-signed to amplify the performance cost of stack splitting be-yond what would be seen in most programs; it is a recursiveFibonacci program that allows measuring the overhead ofchecking available stack space in function-call-bound code.Figure 3 shows the run time of split-stack compiled pro-grams normalized to the default gcc-compiled run times. Theaverage run-time increase was only 2%. The variability seendepends upon the frequency with which the programs makefunction calls. In most cases the performance changed byless than 1%, which we believe is essentially noise—it isless than the impact of changing stack alignment [9]. Wedid modify one benchmark (“ferret”) to change very largestack-based allocations to heap-based allocations in both thebaseline and split-stack executions. These results validateour hypothesis that stack splitting is unlikely to add signiﬁ-cant overhead. Even the Fibonacci microbenchmark showedonly a 15% slowdown, which seems acceptable consideringits pathological nature.

To test the impact of replacing large arrays with trees, weevaluated both microbenchmarks and some standard bench-

Table 2.

Ratios of run times for simulated physical-memory tree-based implementations vs. virtual-memory im-plementations. Tree-based implementations of arrays arecompared against traditional contiguous arrays; 4 KB arraysﬁt into depth-1 trees, 4 MB into depth-2 and all others indepth-3. For each of the benchmarks we provide both a naiveimplementation and a corresponding iterator optimized ver-sion. Cells which are 10% slower or faster than the baselineare colored for clarity.Benchmark 4KB 4MB 4GB 8GB 16GB 32GB 64GBLinearScan:Naive 1.36 2.97 3.34 3.37 3.37 3.37 3.37LinearScan: Iter 1.00 1.02 0.99 0.99 0.99 0.99 0.99StridedScan:Naive 1.71 0.72 1.28 1.26 1.08 1.04 1.06StridedScan: Iter 2.47 0.57 1.02 0.89 0.86 0.86 0.86

Data Size ( S i m u l a t ed P h ys i c a l / V i r t ua l ) R un T i m e GUPS R-B Tree

Figure 4.

Ratios of run times for two large data structures:GUPS and red–black trees. For GUPS, a simulated physical-memory tree-based implementation of GUPS is compared tothe virtual-memory implementation. For red–black trees, thesame implementation is used for both physical and virtualmemory. In both cases, the results suggest that physical ad-dressing offers better performance for large data structures. N o r m a li z ed R un t i m e Figure 5.

Overhead of software-based contiguous memoryon selected SPEC and PARSEC benchmarks.arks for runtime overhead. In order to more accurately sim-ulate the physically addressed system that we imagine, weexecute the tree-based implementations of our microbench-marks using 1 GB huge pages to reduce the TLB miss rateto 0, in most cases. For the baseline contiguous array imple-mentations, we did not use huge pages. For standard bench-marks, we used 4 KB pages for both arrays and trees; TLBmiss rates were always close to 0 regardless of page size.Our microbenchmarks exhibit various levels of spatiallocality by: (1) iterating over every element; (2) accessingevery 1024 th element (i.e., 4 KB apart); and (3) accessingpseudorandomly (in GUPS, an HPC benchmark). Lastly,we include a red–black tree benchmark which does not usean array implementation in either experiment to illustratethe potential speedup of removing virtual memory whencontiguity is not necessary. It creates a red–black tree byinserting random elements and then executes an in-ordertraversal that accesses memory locations with low locality.We evaluate two standard benchmarks which exhibitgood and bad spatial locality, respectively: blackscholesfrom PARSEC; and deepsjeng from SPECInt2017. The for-mer scans through several large arrays while executing ﬂoat-ing point computations on each element; the latter allocatesa single large array as a hashtable and accesses it less pre-dictably.We measured the average element access time for eachmicrobenchmark; Table 2 compares how the tree and con-tiguous array implementations perform on the two scanbenchmarks, both with and without software optimizationsfor tree iteration. For depth 1 trees (4 KB), we might expect no overhead, since they require no more memory accessesthan arrays do. However, our implementation checks thedepth of the tree before accessing data, which adds branchinstructions on every access. Similarly, some of our opti-mizations cause unnecessary overhead on very small trees.If we statically remove these operations, depth-1 trees haveidentical performance to arrays in all benchmarks. Compil-ers and language runtimes should be able to eliminate manytree-depth checks via static analysis and code specializa-tion [7].For deeper trees, we saw an expected jump in overheadcaused by the increased number of memory accesses, in bothof the unoptimized tree implementations. The strideddata point is the outlier, where trees outperform contiguousarrays substantially. This occurs because the contiguous ar-rays see the same overhead caused by TLB misses on allexperiments (except for ) but the trees require fewermemory accesses than all larger data points. In the linearscan, the arrays suffered almost no TLB misses, obviatingthe main advantage of trees. We hypothesize that translationhardware is optimized to make this case fast. Nevertheless,in this benchmark, trees added no overhead when using the Iterator optimization. We observed a similar phenomenonin the strided experiment, where arrays suffered extremely high TLB miss rates (above 90%) but didn’t experience asmuch slowdown as we expected and thus outperformed thenaive tree implementation. Likely, hardware optimizations(such as page table walk caches and prefetchers) reducedthe time to handle each TLB miss, mitigating some of theperformance problems caused by the strided scan with con-tiguous arrays. Again, by using the

Iterator optimization,trees running on physical memory even managed to outper-form the original array implementation.In Figure 4, we make the same comparisons but on theGUPS and red–black tree benchmarks. These benchmarkshave random access patterns that should both cause signif-icant TLB misses and make hardware translation optimiza-tions less effective. As expected, using a tree implementationdoes cause less slowdown on GUPS; trees even outperform arrays for the 16 GB GUPS dataset, so physical addressingshould perform better at that size or larger. Our red–blacktree benchmark, which used the same non-contiguous imple-mentation in both experiments, saw up to a 50% reductionin run time when running without virtual memory. These re-sults indicate the unsurprising, but hopeful fact that remov-ing all time spent handling TLB misses can greatly improveperformance even for workloads of relatively modest size.Figure 5 summarizes the performance overhead inducedby removing the contiguous memory abstraction on blacksc-holes and deepsjeng. Allocations by blackscholes totaled600 MB of memory, deepsjeng r uses 700 MB, and deep-sjeng s uses 7 GB. In all cases, replacing large arrays withtrees degraded performance by less than 3%; performanceeven improved slightly for blackscholes implemented with

Iterator s. Even with stack splitting, total overhead is un-der 10%.

Note that in our experiments, trees perform worse than ar-rays on the 32 GB and 64 GB datasets for GUPS and stridedaccesses. This is an artifact of our experimental setup; be-yond 16 GB, huge pages don’t faithfully simulate physical-memory performance because they start taking TLB missestoo [2]. Essentially, in these datapoints we’re seeing the soft-ware overhead of trees but none of the beneﬁts that physicaladdressing would convey. While we would have preferredto run these experiments on “bare metal” hardware withoutvirtual addressing, it is impractical to achieve such a setupwhile running normal software with huge datasets. Based onthe performance trends we did observe, we believe that treeswould start to outperform arrays for these two benchmarkson huge datasets with actual physical memory. Tree depthcan effectively be held constant, but the cost in TLB missesis only going to rise as datasets get even larger.

Our evaluation measures overheads arising because of theloss of the contiguous memory abstraction. Our selectedSPEC and PARSEC benchmarks indicate that CPU-boundorkloads see little overhead from trees and split stacks. Theoverhead from implementing medium-sized arrays as treescan be noticeable, but should be smaller or even negativein large-memory applications. Further, many applications(e.g., graph processing) act more like the red–black treebenchmark; our results suggest they will run signiﬁcantlyfaster in physical memory.Additionally, the performance gap that does remain insome of our experiments is clearly a result of differing per-formance optimizations. Modern address translation hard-ware is complex and heavily optimized, especially for com-mon access patterns. In addition to TLBs, page table walk(PTW) caches store intermediate page table informationto speed up future page table traversals. Prefetching alsohelps to hide TLB miss latency when access patterns arepredictable. However, we can implement similar optimiza-tions in software, achieve similar results, and avoid bloatinghardware with wasteful complexity. The Linear Scan:Naiveand Linear Scan:Iterator experiments exemplify this per-fectly. Our Iterator optimization essentially implements aPTW cache in software, so that we rarely traverse the wholetree; applying this optimization to the Linear Scan appli-cation completely removes the performance gap betweentraditional virtual memory and our simulated physical sys-tem. In general, it is better for applications to target similaroptimizations whenever they are actually needed, rather thanexpending circuitry on hardware to predict which optimiza-tions might be beneﬁcial at the moment. Certainly, there areare inherently unpredictable programs (like GUPS) whereno static optimization can help. Peformance of those casescould beneﬁt from hardware acceleration of tree traver-sals, perhaps using some simpliﬁed subset of current virtualmemory optimizations. Nevertheless, making that function-ality an optional accelertor rather than an obligate step onthe critical path to memory could offer the best of both opti-mization schemes.

5. Related Work

The many efforts to address the “address translation wall” [3]have largely been directed at opaquely improving translationhardware. More relevant to this paper are efforts to modifyor remove traditional address translation, or those that offerhardware support to enable such a transition.CARAT [12] consists of a compiler, runtime and kernelmodule, which together provide memory protection and theability to relocate memory despite the program using phys-ical addresses. In this way, CARAT preserves the virtualmemory abstraction for application software but removes thedependence on hardware support for address translation andmemory protection (within the application). CARAT insertsruntime allocation tracking and guards, and uses a sohpisti-cated program dependency analysis to support these featureswith minimal overheads. Our work suggests a more radicalchange, to entirely remove software dependence on hard- ware translation. These techniques are certainly compatibleand the most performant future systems will likely make useof both. For instance, “arrays as trees” could ameliorate anexisting CARAT limitation that arises from the difference inthe sizes of application allocations versus those of the under-lying allocator.Basu et al. [2] use direct segments to enable efﬁcient con-tiguous memory in the face of address translation. Thesesupport efﬁcient translation of large, contiguous memoryregions allocated by the OS via base and bound registers.While effectively eliminating translation overhead for cer-tain accesses, this scheme is still essentially static in nature,relying on traditional virtual memory as the default systembehavior.Alam et al. [1] propose a “Do-It-Yourself” translationmechanism that allows application software to provide itsown address translation function. The hardware then checksthat the result of translation is a legal, accessible physicalframe. Their work provides more ﬂexibility than traditionalvirtual memory abstractions, but still fundamentally relieson address translation and thus is still likely to performworse than direct physical memory.Many other hardware-enabled protection schemes do notrely on address translation and thus are good candidatesto support protection in the face of physical addressing.Mondrian memory protection [15] efﬁciently provides ﬁne-grained partitioning and sharing of memory and would en-able ﬂexible OS memory allocation. Other results [4, 6, 14]indicate that physically tagged memory can provide efﬁ-cient protection without relying on address translation; theseapproaches could support a variety of software memory-management abstractions.

6. Conclusion

Virtualization, large memory footprints, and coherency acrossmultiple cores and devices all make address translation agrowing performance bottleneck and source of complexity.We have explored an alternative approach: physical address-ing while delegating responsibility to software. This direc-tion has been insufﬁciently examined by the architecture,OS, and programming language communities. Our experi-mental results show that the software overheads added by re-placing the contiguous memory abstraction are low and thatphysical addressing is potentially a fruitful area for furtherresearch. While we have examined a narrow set of changesto applications and compilers, there are many more oppor-tunities to optimize memory management and access at alllayers of the programming stack. Additionally, removinghardware-based address translation would enable simpler,more power-efﬁcient circuits whose functionality and usageis driven by application needs rather than constrained byrigid design choices. cknowledgments

This work was supported by the Department of Defense(DoD) through the National Defense Science & EngineeringGraduate Fellowship (NDSEG) Program.

References [1] H. Alam, T. Zhang, M. Erez, and Y. Etsion. Do-It-YourselfVirtual Memory Translation. ISCA 17.[2] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift.Efﬁcient Virtual Memory for Big Memory Servers. ISCA 13.[3] A. Bhattacharjee. Preserving Virtual Memory by Mitigatingthe Address Translation Wall.

IEEE Micro , 2017.[4] B. Davis, R. N. M. Watson, A. Richardson, P. G. Neumann,S. W. Moore, J. Baldwin, D. Chisnall, J. Clarke, N. W. Fi-lardo, K. Gudka, and et al. CheriABI: Enforcing valid pointerprovenance and minimizing pointer privilege in the POSIX Crun-time environment. In

Int’l Conf. on Architectural Supportfor Programming Languages and Operating Systems (ASP-LOS) , 2019.[5] A. Ferraiuolo, M. Zhao, A. C. Myers, and G. E. Suh. Hyper-Flow: A Processor Architecture for Nonmalleable, Timing-Safe Information Flow Security. In

Proceedings of the 25thACM Conference on Computer and Communications Security(CCS) , 2018.[6] A. Joannou, J. Woodruff, R. Kovacsics, S. W. Moore, A. Brad-bury, H. Xia, R. N. Watson, D. Chisnall, M. Roe, B. Davis,et al. Efﬁcient tagged memory. In , pages 641–648. IEEE, 2017.[7] N. D. Jones, C. K. Gomard, and P. Sestoft.

Partial Evaluationand Automatic Program Generation . Prentice-Hall Interna-tional, London, UK, 1993. ISBN 978-0130202499. Availableonline at .[8] X. Li, L. Liu, S. Yang, L. Peng, and J. Qiu. Thinking AboutA New Mechanism For Huge Page Management. APSys 19.[9] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney.Producing wrong data without doing anything obviouslywrong!

ACM Sigplan Notices , 2009.[10] A. Panwar, S. Bansal, and K. Gopinath. HawkEye: EfﬁcientFine-Grained OS Support for Huge Pages. In

Int’l Conf.on Architectural Support for Programming Languages andOperating Systems (ASPLOS) , 2019.[11] F. Siebert. Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java. In

Proceedings of the2000 International Conference on Compilers, Architecture,and Synthesis for Embedded Systems , 2000.[12] B. Suchy, S. Campanoni, N. Hardavellas, and P. Dinda.CARAT: A case for virtual memory through compiler-andruntime-based address translation. In st ACM SIGPLANConf. on Programming Language Design and Implementation(PLDI) , pages 329–345, 2020.[13] X. Wang, H. Liu, X. Liao, J. Chen, H. Jin, Y. Zhang, L. Zheng,B. He, and S. Jiang. Supporting Superpages and LightweightPage Migration in Hybrid Memory Systems.

ACM Trans.Archit. Code Optim. , 2019. [14] S. Weiser, M. Werner, F. Brasser, M. Malenko, S. Mangard,and A.-R. Sadeghi. TIMBER-V: Tag-isolated memory bring-ing ﬁne-grained enclaves to RISC-V. In