[PDF] Disaggregation and the Application

Abstract

This paper examines disaggregated data center architectures from the perspective of the applications that would run on these data centers, and challenges the abstractions that have been proposed to date. In particular, we argue that operating systems for disaggregated data centers should not abstract disaggregated hardware resources, such as memory, compute, and storage away from applications, but should instead give them information about, and control over, these resources. To this end, we propose additional OS abstractions and interfaces for disaggregation and show how they can improve data transfer in data parallel frameworks and speed up failure recovery in replicated, fault-tolerant applications. This paper studies the technical challenges in providing applications with this additional functionality and advances several preliminary proposals to overcome these challenges.

Full PDF

DDisaggregation and the Application

Sebastian Angel

University of Pennsylvania

Mihir Nanavati

Microsoft Research

Siddhartha Sen

Microsoft Research

Abstract

This paper examines disaggregated data center architecturesfrom the perspective of the applications that would run onthese data centers, and challenges the abstractions that havebeen proposed to date. In particular, we argue that operatingsystems for disaggregated data centers should not abstractdisaggregated hardware resources, such as memory, compute,and storage away from applications, but should instead givethem information about, and control over, these resources.To this end, we propose additional OS abstractions and in-terfaces for disaggregation and show how they can improvedata transfer in data parallel frameworks and speed up failurerecovery in replicated, fault-tolerant applications. This paperstudies the technical challenges in providing applications withthis additional functionality and advances several preliminaryproposals to overcome these challenges.

Disaggregation splits existing monolithic servers into a num-ber of consolidated single-resource pools that communicateover a fast interconnect [8, 32, 34, 37, 49, 50, 61]. This modeldecouples individual hardware resources, including tightlybound ones such as processors and memory, and enables thecreation of “logical servers” with atypical hardware configu-rations. Disaggregation has long been the norm for disk-basedstorage [33] because it allows individual resources to scale,evolve, and be managed independently of one another. In thispaper, we target the new trend of memory disaggregation.Existing works on disaggregated data centers (DDCs)have focused primarily on the operational benefits ofdisaggregation—it allows resources to be packed moredensely and improves utilization by eliminating the bin-packing problem. As a result, these works strive to preserveexisting abstractions and interfaces and propose runtimes andOSes that make the unique characteristics of DDCs transpar-ent to applications [14, 61]. The implicit underlying assump-tion in these works is that from the perspective of the OS, thedistributed nature of processors and memory is an inconve-nient truth of the underlying hardware, much like paging orinterrupts, that should be abstracted away from applications.Our position is that the disaggregated nature of DDCs is notjust a hardware trend to be tolerated and abstracted away tosupport legacy applications, but rather one that should be ex-posed to applications and exploited for their benefit . We drawinspiration from decades-old distributed shared memory sys-tems (which conceptually closely resemble disaggregation)where early attempts at full transparency quickly gave way toweaker consistency and more restrictive programming modelsfor performance reasons [16, 35, 36, 45]. While the driving rationale for externalizing memory has changed, along withthe underlying hardware and target applications, we believethat co-designing applications and disaggregated operatingsystems remains an attractive proposition.Two properties of disaggregated hardware with potential tobenefit applications are the ability to reassign memory by dy-namically reconfiguring the mapping between processors andmemory and the failure independence of different hardwarecomponents (i.e., the fact that processors may fail without theassociated memory failing or vice versa). Memory reassign-ment can be leveraged by applications performing bulk datatransfers across the network to achieve zero-copy operationsby remapping memory from the source to the destination,or during processor failures to find orphaned memory a newhome. Failure independence also allows processors to be use-ful despite memory failures by acting as fast and reliablefailure informers [4] and triggering recovery protocols.We target data center applications that are logically cohe-sive, but physically distributed across multiple co-operatinginstances—examples of these include most microservice-based applications, data parallel frameworks, distributed datastores, and fault-tolerant locking and metadata services—andpropose extending existing OSes for disaggregated systems,such as LegoOS [61], with primitives for memory reassign-ment and failure notification. Below is a discussion of theproposed primitive operations and the challenges in imple-menting them, all of which are exacerbated by the fact thatthe exact nature of disaggregation and the functionality ofeach component is in flux (§2).•

Memory grant.

This is a voluntary memory reassignmentcalled by a source application instance to yield its memorypages and move them to a destination application instance.This reassignment requires a degree of flexibility fromthe interconnect, which must be able to handle modifyingmemory mappings quickly and at fine granularities.•

Memory steal.

This is an involuntary reassignment of mem-ory from one application instance to another. While similarto a memory grant from the perspective of the interconnect,a key difference is that the source application instance maynot have any prior warning. Since volatile state can nowtranscend an application instance, the programming modelneeds to guarantee crash consistency to ensure that state issemantically coherent at all times.•

Failure notification.

An application instance can opt toreceive notifications for memory failures or it can regis-ter other instances to automatically be notified in suchcases. This requires making failure information visible toapplications, as well as retaining group membership at the1 a r X i v : . [ c s . O S ] O c t rocessor so other instances can be notified if the localinstance cannot handle or mask the memory failure.Data parallel frameworks, such as MapReduce and Dryad,can use these primitives to eliminate unnecessary data trans-fer during shuffles or between nodes in the data flow graph,while Chubby [12] and other applications based on Paxos [39]can recover and reassign the committed state machine froma failed replica. In addition, early detection of memory fail-ures can trigger recovery mechanisms without waiting forconservative end-to-end timeouts. While this paper focuseson these two applications, we believe that the interfaces arebroad enough to benefit other applications. For example, scal-able data stores, such as Redis or memcached, could usememory grants to delegate part of their key space to new in-stances sans copying, while microservice-based applicationscan use grants and steals to achieve performance comparableto monolithic services and still retain some modularity.This paper is largely speculative and poses more questionsthan it answers. We discuss some operating system primitivesthat empower forward-looking applications to benefit fromdisaggregation (while allowing legacy applications to remainoblivious to it) and describe how these primitives could beimplemented and used in the context of an abstract disaggre-gated data center that resembles existing designs. Our hopeis to foster a broader discussion around disaggregation, notfrom the perspective of operators, but as an opportunity, andalso a challenge, for systems and application developers. In the absence of existing disaggregated data centers, a num-ber of different architectures have been proposed [34, 49, 50,57, 61]. While these architectures differ in some of the details,the general strokes are similar. We assume the architecturegiven in Figure 1, which has three core components: indi-vidual blades with compute elements and memory elements ,connected over a low-latency programmable resource inter-connect . While we have chosen to explore these ideas in thecontext of a single architecture for simplicity, we believe thatthey are broadly applicable to other disaggregation models.

Compute elements.

The basic compute elements in our rackare commodity processors which retain the existing memoryhierarchy with private core and shared socket caches. Whilesome architectures have processors operate entirely on remotememory [57], this requires major modifications to the proces-sor to support instructions such as

PUSH and

POP that implic-itly reference the stack, as well as to the memory and cachingsubsystems. In line with the majority of proposed architec-tures, we assume a small amount of locally-attached memoryat the compute elements, which is used for the operatingsystem and as a small cache to improve performance [24, 61].

Memory elements.

Memory elements, which are conven-tional DRAM or NVRAM chips, can be exposed directlyacross the interconnect (Fabric-Attached Memory) or frontedby a low-power processing element (e.g., mobile processor, other racks

ToR SwitchRack MMU Processors

Cache (SRAM)

Memory (DRAM / NVRAM) F IGURE

FPGA, or ASIC) that interacts with memory through a stan-dard MMU (Proxied-Memory). We assume a form of proxied-memory where addressing, virtualization, and access controlare delegated to the local processing element which interposeson memory requests. Similar functionality can be achieved forfabric-attached memory by coordinating memory controllersat multiple compute elements.

Resource interconnect.

The resource interconnect allowsprocessor and memory elements to communicate and can bebased on RDMA over InfiniBand or Ethernet, Omnipath [11],Gen-Z [25], or a switched PCIe fabric [13, 22]. Our designis agnostic to the physical layer, but we assume a degree ofprogrammability and on-the-fly reconfiguration within theinterconnect (that we call the Rack MMU) that allows com-pute and memory elements to be dynamically connected anddisconnected in arbitrary configurations. Recent work [63]proposes and implements one such network fabric; althoughthe proposed architecture lacks a programmable switch, itemulates its functionality through a Clos network of switchesand a coordination-free scheduling protocol.

Resource partitioning and allocation.

We assume that theunit for disaggregation is a single rack (i.e., compute andmemory elements reside in the same rack), with resourcesbeing partitioned into the desired compute abstractions, suchas virtual machines, containers, or processes, and presentedto applications (we generically refer to all of these computeabstractions, which host application workloads, as processes).The Rack MMU acts as a resource manager for the rack andis responsible for resource partitioning within the rack andassigning compute and memory elements to processes.The Rack MMU has a similar policy regarding sharing ofhardware resources as LegoOS [61]: processes may share thesame memory element, but not the same regions of memory(i.e., there is no shared memory). Similarly, compute elementscan host multiple processes, but all the threads of a processare restricted to a single compute element. This simplifies2aching, as shared memory would require coherence acrossthe local memory attached to the compute elements. Memoryis allocated at a fixed page-sized granularity, which is cho-sen according to the addressing architecture of compute andmemory elements. The Rack MMU is responsible for high-level placement decisions for processes and picks computeand memory elements on the basis of some bin-packing pol-icy, while fine-grained sharing and isolation across co-hostedprocesses are managed by the local OS.

Addressing and access control.

Traditional processes ex-pect to operate on a private virtual address space, regardlessof the physical layout of the underlying memory. To pre-serve this illusion, the Rack MMU stores a virtual-to-physical(V2P) mapping for each process, which resembles a tradi-tional per-process page table. Compute elements query thisV2P mapping, which may be cached locally, to route requeststo the correct memory element.The Rack MMU is also responsible for configuring accesscontrol to memory. When memory is allocated, the RackMMU ensures that the topology of the interconnect allows forthe existence of a path between the corresponding computeand memory elements. It also configures the page tables atthe memory elements with the process identifier (effectivelythe

CR3 ), the virtual address, and the appropriate permissions,enabling local enforcement at the memory elements. Whilethis mechanism is specific to proxied memory, fabric-attachedmemory systems have proposed a capability-based protectionsystem to achieve similar functionality [2].

Scaling out.

Not all applications want to live within a singlerack: to span racks, traditional Ethernet-based networking isavailable through a commodity top-of-rack (ToR) switch thatconnects to the rest of the data center network. Distributedapplications comprising multiple processes have to choosethe appropriate deployment: intra-rack deployments enjoylower latencies, while cross-rack deployments have greaterfailure independence. This decision is analogous to the onefaced by developers when selecting the appropriate placementgroup [5] or availability set [55] in cloud deployments today.

In traditional architectures, the OS is responsible for man-aging hardware resources, allocating them to processes, andenforcing isolation of shared resources. In a disaggregatedenvironment, this is no longer true and resource allocationis now within the bailiwick of the Rack MMU; the local OSat compute elements continues to be responsible for manag-ing the underlying hardware, providing local scheduling andisolation, and presenting a standard programming interfaceto applications. Additionally, the OS is responsible for trans-parently synchronizing application state between local andremote memory and, if any state is locally cached, managingthe contents and coherence of this cache [27, 61].Prior OSes for DDCs [14, 61] have chosen to implementa standard POSIX API and abstract away the disaggregatednature of DDCs from applications. While this allows existing unmodified applications to run on DDCs, our case studies (§5and §4) argue that many of these applications could achievebetter performance if they had more visibility and control.Accordingly, we advocate for the design and implementationof the following three operations as OS interfaces.

Memory is reassigned at page granularity by moving it fromthe V2P mapping of one process to another at the resourcemanager (Rack MMU) and invalidating any cached V2P map-pings at compute elements. Following this, the resource man-ager revokes access to that memory region by modifying thepage table entries for proxied memory (or by generating a newcapability for fabric-attached memory); the detached memorycan then be attached to an existing process similar to newlyallocated memory.Memory reassignment is conceptually similar to a memory grant operation in L4 [48], with one significant difference:as reassigned pages may contain data structures with internalreferences, these pages must be attached to the same virtualaddress to prevent dangling pointers. To avoid a situation inwhich the receiving process has already used the providedvirtual addresses (which would create ambiguity), we proposereserving a fixed number of bits of the virtual address to actas a process identifier.Mechanistically, we envision memory reassignment to oc-cur, like in L4, through message passing between OS in-stances. This transfer is initiated by the application througha system call similar to vmsplice() in Linux: when calledwith the

SPLICE_F_GIFT flag, the process “gifts” the mem-ory to the kernel, promising to never access it again. As thepage continues to use the same virtual address space in thereceiving process, the sender OS marks the virtual address asbeing “in use” and prevents further allocations or mappingsto it. Receiving processes are notified about the addition ofnew pages by their local OS through signals.

Memory grants are the most natural flavor of memory re-assignment, but are not particularly useful in the case ofcompute element failures. An alternative is for other entities(processes or local OSes) to be able to take away, or steal ,a process’ memory. For example, when a compute elementcrashes, another process belonging to the same applicationcould request the crashed process’ memory. This is similar tohow servers in Frangipani [64] keep their logs remotely, andcan request the logs of servers that have crashed to resumetheir operations.Two questions naturally arise in this case: firstly, who isallowed to trigger memory reassignments and when is it ac-ceptable to do? Secondly, how does the application guaranteethe semantic consistency of memory that may abruptly bestolen? While it is clear in the context of memory grants thata process should have the authority to give away its own mem-ory, the policy around forcible reassignment is less clear. Onepossibility is to group trusted processes together and allow3ny group member to initiate reassignment; another is to re-quire a group of processes reach consensus before reassigningany memory. In terms of timing, while we envision this asprimarily an aid to recovery mechanisms when a process hascrashed (or is suspected of having crashed), there might beapplications where stealing memory from a running processis acceptable and actually profitable.We propose to expose memory stealing via a syscall thatrequires the id of the source process and uses the group ofthe calling process as a capability for authentication; memoryallocated using brk or mmap can disallow future reallocationwith the appropriate flags. We do not enforce a specific policyat the Rack MMU and instead leave it up to the application todetermine what is appropriate (we explore one such policy inthe context of Paxos in Section 4.1). While a buggy applica-tion can mistakenly steal its own memory and crash, this isnot morally different from threads stomping on each other’smemory in buggy shared memory applications.The second challenge is maintaining crash consistency forreassignments. This is non-trivial since most applications arenot written with the idea that memory should be consistentand all invariants maintained at every point in the midst ofcomputation; most applications do in fact have temporarywindows of inconsistency. While certain programming ab-stractions such as transactional memory and objects [28, 62]provide atomicity, they are not sufficient in the case of com-pute element failures. Storage systems have historically facedsimilar challenges in allowing application state to outlivecompute and building transactional, crash consistent program-ming models for non-volatile memory (NVRAM) is an activearea of research [17, 54, 65–68]. Applications can adopt anyof these programming models, which rely on a combinationof techniques such as journaling, soft updates [23], shadowcopies [18], and undo logs [17] to remain crash consistentwhen updating structures in remote memory.Nevertheless, even when these structures are consistent inremote memory, the metadata required to locate them maybe in processor registers, caches, or in stack variables thatare not part of remote memory. Applications typically do nothave a namespace to locate internal objects and instead relyon the compiler to keep track of them; consequently, whenmemory is reassigned to a new process, finding the necessaryobjects from raw memory pages would be a momentous task,akin to searching for a lost treasure.Our suggestions for this are two-fold: first, applicationscan use an asynchronous, event-based model that forces themto reason about all critical state and package it into a heapobject before yielding (i.e., stack ripping [3]), since that isall that persists across invocations. Secondly, the applicationcan use a file system like namespace for objects [20], or itcan distribute metadata about heap objects (depending on theapplication, this could be as minimal as the root address of atree), in anticipation of failures, that act as a “map” to helplocate critical state. setting mean RTT ( µ s) Cross-rack (Cloud) 45Intra-rack (eRPC [31]) 2Future intra-rack (Mellanox ConnectX-6 [1]) 1F

IGURE

Compute elements should be notified about memory failureseither asynchronously using liveness information from a re-liable interconnect or explicitly in response to accesses onunreliable interconnects. In the latter case, compute elementscan receive messages from the controller of the memory ele-ment (when specific elements have failed), or rely on timeouts(when the entire memory element is unreachable). Error noti-fications are propagated back to the application through OSsignals (

SIGBUS ); applications that want to manage faultscan register for these signals and trigger a failure-recoveryprotocol, while legacy applications may safely ignore them.As memory failures may result in the loss of applicationstate, it is sometimes unclear how an application should lever-age failure notifications. To guard against such cases, an ap-plication can pre-register a group of processes with the OSthat will be informed in case of failures (these processes es-sentially serve as “emergency contacts”). This group is storedin a per-process forwarding table within the OS. As the OS islocal to the compute element, memory failures do not affectthe forwarding table; consequently, the application can defernotification to the OS using a syscall which broadcasts theerror to the corresponding group. This allows other processesto learn of the failure and respond appropriately, making thecompute element a local failure informer [4, 44].Failures of compute elements are harder to detect as theabsence of accesses to a particular memory element neednot be a sign of failure. We propose the addition of a rack-level monitor that periodically verifies the health of computeelements using heartbeats and triggers the appropriate actionwhen failures are detected. Applications can register a groupof processes to inform in the case of failures, similar to thegroups registered for memory failures; alternatively, they canalso register a lightweight failure handler to be run, in anisolated context, at the monitor. While this monitor is a singlepoint of failure and may not detect all failures, we view it asan optimization, rather than a replacement, to failure detectionusing end-to-end timeouts at the application.One might wonder why local failure informers are betterthan just using application-level timeouts to detect failures—especially given the reliance on timeouts to detect computeand memory failures. The answer is that we can exploit the4ifference between intra-rack and cross-rack latencies; as weshow in Figure 2, this difference is a few orders of magnitude.As compute and memory are located within the same rack, wemake the assumption that the Rack MMU achieves compara-ble latencies. This allows local failure detectors to have moreaggressive timeouts and trigger recovery procedures earlier. The memory interconnect described so far is capable of rout-ing requests between any compute and memory elementswithin the rack, as well as blocking communication betweenany such elements, at very low latency. It has enough spaceto store address mappings for each process, so that accessesfrom compute elements are transparently routed to the correctmemory element; further, it supports dynamic reconfigurationof routes and mappings without requiring any downtime.While existing research and production hardware satisfiessome of these requirements, achieving their composition re-mains an open problem. Programmable switches, such as theBarefoot Tofino and Cavium XPliant, offer low-latency, re-configurable routing between compute and memory elements,but are limited in their port counts and memory, restrictingtheir scale. In contrast, Shoal [63] supports high-density rackswith hundreds of compute and memory elements, but doesnot currently offer the low latency, programmability, and re-configurability required for grant and steal operations.

Applications use Paxos [39] to tolerate failures via the repli-cated state machine approach [38, 58, 60]: Paxos ensures thatdifferent replicas (which are deterministic state machines thatimplement the application’s logic) execute the same com-mands in the same order, ensuring that all replicas transitionthrough the same sequence of states. If a replica fails, a clientcan simply issue its requests to a live replica.Replica failures lead the system into a state of reconfig-uration where the old failed replica is removed and a newreplica is introduced [15, 40, 41]. This prevents too manyfailures from accumulating over time and making the sys-tem unavailable. Mechanistically, reconfiguration achievestwo goals: first, it brings new replicas up to date by havingthem fetch the latest state from existing replicas or persistentstorage [15]. Second, it prevents old replicas that have beenexcluded from the current configuration (presumably becausethey have failed) from participating if they come back online.

Detecting failures.

Detecting failures is a challenging propo-sition in an asynchronous environment due to the difficultyof distinguishing between crashed and slow processes [4, 21].Consequently, Paxos implementations rely on heartbeats andkeep-alives with conservative end-to-end timeouts to ascer-tain the state of processes. Recent failure detectors [42–44]quickly and reliably detect failures and kickstart recoverymechanisms in asynchronous settings using a combination of The exact gains are hard to quantify since network latency is only one outof many factors considered when setting end-to-end timeouts [4]. local, host-based monitors that track the health of componentsacross the stack, and lethal force. In cases where failures aresuspected but cannot be confirmed, these detectors forciblykill the process—the intuition behind this protocol (called

STONITH or “Shoot the Other Node in the Head”) is that un-necessary failures are preferable to uncertainty.

The failure independence of DDCs enables new ways to de-tect and recover from failures in fault-tolerant applicationsusing Paxos. We assume that the replicas of this applicationrun in different racks within the same data center—a rea-sonable assumption for applications that want greater failureindependence without paying the costs of wide-area traffic.Within this deployment, we explore two scenarios: a computeelement that loses some or all of its memory elements, and afaulty compute element with functional memory elements.

Dead compute with live memory.

When a replica dies, onecould in principle reassign the state machine’s memory toanother compute element and the system could continue oper-ating unimpeded. Such reassignment effectively reincarnates the old node, from the perspective of Paxos, which allowsthe consensus group to return to full health faster (no needto retrieve the state from a checkpoint or another replica).However, as we discuss in Section 3.2, the developer mustensure that the state machine’s transition function preservesmemory consistency after crashes.Should the failure of the compute element be detectedfaster than the end-to-end timeout of the Paxos group—alikely scenario due to the difference between intra- and cross-rack latencies—the reincarnation can be transparent to therest of the system. In such cases, a client and other replicaswill only observe a connection termination and will attemptto reconnect. Each Paxos replica registers a fast failure han-dler with the rack monitor that requests the Rack MMU toprovision a new replica that can take ownership of the deadreplica’s memory with the steal operator of Section 3.2.Failures detected by the monitor trigger this handler, whileundetected failures are eventually detected by another replicawhich must gain consensus, either through a proposal or froma stable leader, before reincarnating the failed instance.In response to a steal operation, the Rack MMU revokesand reassigns access to the region of memory. Revocation isneeded because compute element failures are not always fail-stop and the system must prevent a temporarily unavailablecompute element from returning and corrupting state. TheToR switch can redirect cross-rack traffic to the new computeelement using OpenFlow rules; further, it can also use theserules to fence the old compute element off from the rest ofthe network [43].

Dead memory with live compute.

When a compute ele-ment flushes its operations to a remote memory element, itis possible for this operation to fail if the memory element isdown. Instead of terminating the application right away, aswe discuss in Section 3.3, the OS propagates a signal up the5 oR SwitchRack MMU

A B

2. send data1. load

3. store F IGURE stack or forwards the signal to other replicas. This mechanismallows other replicas to detect memory failures more quicklythan relying on end-to-end timeouts. Indeed, were the appli-cation to be terminated immediately without this notification,other replicas would not know whether the memory remainsalive or not, leading to ambiguity as to the type of failure.

In-memory data parallel frameworks such as data flow andgraph processing systems [19, 26, 29, 51, 56, 69] expresscomputations as a series of nodes, where each node performsan operation on its inputs. In these systems, it is often neces-sary to move data between nodes so that the output of a nodemay be used as the input to the next node. For example, inMapReduce [19], the output of mappers is shuffled and sentto reducers that operate on a chunk of related data.Applications represent large compute jobs as a set ofsmaller tasks, and distribute these tasks across nodes using thedata parallel framework. While completing these jobs requiresall the individual tasks to finish, tasks are often unexpectedlydelayed due to factors such as load imbalances and workloadskews, failures, and hardware defects. As such stragglers holdup the entire job and significantly impact completion rates,frameworks employ a variety of mitigation techniques in-cluding blacklisting slow machines, speculatively timing outand rerunning tasks [7, 19], and even proactively launchingmultiple replicas of the same task [6].We believe that executing data parallel systems transpar-ently on a DDC would leave performance on the table, andargue for the use the operators described in Section 3 to speedup data movement and straggler mitigation.

Faster data movement.

Deploying an unmodified data par-allel framework on a transparent DDC results in unnecessarydata movement between computational nodes; for example,Figure 3 shows how transferring data between nodes forces3 network and memory RTTs. First, the source processorfetches data from its remote memory over the memory in-terconnect. Then, the source processor sends this data overthe network to the destination processor via the ToR switch.Finally, the destination processor forwards the data via the memory interconnect to the remote memory for storage.Meanwhile, data transfer is often a bottleneck in thesesystems. As an example, McSherry and Schwarzkopf [53]demonstrate that Timely Dataflow [52] achieves up to 3 × higher throughput when provided with a faster network. Mem-ory grants convert the 3 RTTs for data transfer into a singleRTT over the memory interconnect. The source, A , would call grant on the memory pages storing the data that it plans tosend to the destination, B , and would indicate B as the recipi-ent of these pages; the Rack MMU would make the necessaryadjustments to page permissions before notifying B that thepages are ready to be mapped into its local address space.Here, all the data transfer consists of small control messagesthat occur via the Rack MMU and bypasses the slower ToR. Dealing with stragglers.

Straggler nodes in data parallelsystems can have their memory forcibly reassigned to anothernode by having the job orchestrator steal the appropriatememory pages. The recipient node can resume and completethe half-completed computation, rather than starting fromscratch. In case of failures, as with the Paxos example (§4.1),the failure notification interfaces can inform the job orchestra-tor, allowing it to relaunch the task more quickly than relyingon an end-to-end timeout. If only the compute elements ofthe node have failed, the newly launched task can resumecomputation from where it had stalled.

We are by no means the first to observe either the ability toreassign memory across processes or the failure independenceof resources in DDCs. While recent works on disaggregatedsystems have advocated for transparent solutions—RAID-style [59] memory replication in LegoOS [61] and replica-tion and switch-based failover by Carbonari and Beschast-nikh [14]—this has largely been driven by the desire to benefitlegacy applications. Carbonari and Beschastnikh also observethat applications could benefit from information about fail-ures but do not go further; we build on that observation andlook at how applications that eschew transparency could usethis information. More specifically, we borrow ideas fromsystems for single-host IPC [9, 10, 46–48], distributed sharedmemory [16, 35, 36, 45], and accelerated RPCs [30, 31] forfast, zero-copy data transfer and from reliable failure inform-ers [4, 42–44] for faster recovery.Disaggregation represents a fundamental change in howhardware resources are built, provisioned, and presented toapplications for consumption. As befitting an operator-driveninitiative, early research has focused on changes necessarywithin the hardware rather than the application. But applica-tion developers are not averse to major changes in their pro-gramming model as long as they receive commensurate ben-efits; in fact, as witnessed by the prevalence of MapReduce,good models can help applications transition more smoothly.Reasoning about memory grant s and steal s is a signifi-cant departure from existing programming models, but thereis encouraging precedent: the Rust programming language6uccessfully introduced ownership and move semantics toguarantee memory safety and data race freedom. We believethat similar abstractions could be useful in our context.

Acknowledgements

We thank Andrew Baumann, Natacha Crooks, Joshua Leners,Youngjin Kwon, Srinath Setty, and Nathan Taylor for feed-back and discussions that improved this paper.

References [1] Connectx-6 single/dual-port adapter supporting 200Gb/s withVPI. .[2] R. Achermann, C. Dalton, P. Faraboschi, M. Hoffmann,D. Milojicic, G. Ndu, A. Richardson, T. Roscoe, A. L. Shaw,and R. N. M. Watson. Separating Translation from Protectionin Address Spaces with Dynamic Remapping. In

Proceedingsof the Workshop on Hot Topics in Operating Systems (HotOS) ,2017.[3] A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R.Douceur. Cooperative Task Management Without ManualStack Management. 2002.[4] M. K. Aguilera and M. Walfish. No time for asynchrony. In

Proceedings of the Workshop on Hot Topics in OperatingSystems (HotOS) , 2009.[5] Amazon. Placement Groups. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html .[6] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica.Effective Straggler Mitigation: Attack of the Clones. In

Proceedings of the USENIX Symposium on NetworkedSystems Design and Implementation (NSDI) , 2013.[7] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica,Y. Lu, B. Saha, and E. Harris. Reining in the Outliers inMap-reduce Clusters Using Mantri. In

Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation (OSDI) , 2010.[8] K. Asanovi´c. FireBox: A Hardware Building Block for 2020Warehouse-Scale Computers. In

Proceedings of the USENIXConference on File and Storage Technologies (FAST) , 2014.[9] B. Bershad, T. Anderson, E. Lazowska, and H. Levy.Lightweight Remote Procedure Call. In

Proceedings of theACM Symposium on Operating Systems Principles (SOSP) ,1989.[10] B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M.Levy. User-level Interprocess Communication for SharedMemory Multiprocessors.

ACM Transactions on ComputerSystems (TOCS) , 9(2), 1991.[11] M. S. Birrittella, M. Debbage, R. Huggahalli, J. Kunz,T. Lovett, T. Rimmer, K. D. Underwood, and R. C. Zak. IntelOmni-path Architecture: Enabling Scalable, HighPerformance Fabrics. In

Proceedings of the 2015 IEEE 23rdAnnual Symposium on High-Performance Interconnects(HOTI) , 2015.[12] M. Burrows. The Chubby lock service for loosely-coupleddistributed systems. In

Proceedings of the USENIXSymposium on Operating Systems Design and Implementation(OSDI) , 2006. [13] BusinessWire. Liqid Fulfils the Promise of Rack-ScaleComposable Infrastructure with General Availability. , 2017.[14] A. Carbonari and I. Beschastnikh. Tolerating Faults inDisaggregated Datacenters. In

Proceedings of the ACMWorkshop on Hot Topics in Networks (HotNets) , 2017.[15] T. Chandra, R. Griesemer, and J. Redstone. Paxos MadeLive—An Engineering Perspective. In

Proceedings of theSymposium on Principles of Distributed Computing (PODC) ,2007.[16] J. S. Chase, H. M. Levy, M. J. Feeley, and E. D. Lazowska.Sharing and Protection in a Single-address-space OperatingSystem.

ACM Transactions on Computer Systems (TOCS) ,12(4), 1994.[17] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K.Gupta, R. Jhala, and S. Swanson. NV-Heaps: MakingPersistent Objects Fast and Safe with Next-generation,Non-volatile Memories. In

Proceedings of the InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) , 2011.[18] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee,D. Burger, and D. Coetzee. Better I/O ThroughByte-addressable, Persistent Memory. In

Proceedings of theACM Symposium on Operating Systems Principles (SOSP) ,2009.[19] J. Dean and S. Ghemawat. MapReduce: Simplified DataProcessing on Large Clusters. In

Proceedings of the USENIXSymposium on Operating Systems Design and Implementation(OSDI) , 2004.[20] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz,D. Reddy, R. Sankaran, and J. Jackson. System Software forPersistent Memory. In

Proceedings of the ACM EuropeanConference on Computer Systems (EuroSys) , 2014.[21] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibilityof Distributed Consensus with One Faulty Process.

Journal ofthe ACM , 32(2), 1985.[22] S. Foskett. Liqid Takes Composable Infrastructure to a NewLevel. https://gestaltit.com/exclusive/stephen/liqid-takes-composable-infrastructure-to-a-new-level/ , 2018.[23] G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N.Patt. Soft Updates: A Solution to the Metadata UpdateProblem in File Systems.

ACM Transactions on ComputerSystems (TOCS) , 18(2), 2000.[24] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han,R. Agarwal, S. Ratnasamy, and S. Shenker. NetworkRequirements for Resource Disaggregation. In

Proceedings ofthe USENIX Symposium on Operating Systems Design andImplementation (OSDI) , 2016.[25] Gen-z core specification, revision 1.0. .[26] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin.PowerGraph: Distributed graph-parallel computation onnatural graphs. In

Proceedings of the USENIX Symposium onOperating Systems Design and Implementation (OSDI) , 2012.[27] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin.Efficient Memory Disaggregation with INFINISWAP. In

Proceedings of the USENIX Symposium on Networked ystems Design and Implementation (NSDI) , 2017.[28] N. Herman, J. P. Inala, Y. Huang, L. Tsai, E. Kohler,B. Liskov, and L. Shrira. Type-aware transactions for fasterconcurrent code. In Proceedings of the ACM EuropeanConference on Computer Systems (EuroSys) , 2016.[29] M. Isard, M. Budiu, Y. Yu, A. Birrel, and D. Fetterly. Dryad:Distributed data-parallel programs from sequential buildingblocks. In

Proceedings of the ACM European Conference onComputer Systems (EuroSys) , 2007.[30] A. Kalia, M. Kaminsky, and D. G. Andersen. FaSST: Fast,Scalable and Simple Distributed Transactions with Two-sided(RDMA) Datagram RPCs. In

Proceedings of the USENIXSymposium on Operating Systems Design and Implementation(OSDI) , 2016.[31] A. Kalia, M. Kaminsky, and D. G. Andersen. DatacenterRPCs can be General and Fast. In

Proceedings of the USENIXSymposium on Networked Systems Design andImplementation (NSDI) , 2019.[32] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas,D. Theodoropoulos, I. Koutsopoulos, K. Hasharoni, D. Raho,C. Pinto, F. Espina, S. López-Buedo, Q. Chen,M. Nemirovsky, D. Roca, H. Klos, and T. Berends.Rack-scale Disaggregated Cloud Data Centers: The dReDBoxProject Vision. In

Proceedings of the Design, Automation Testin Europe Conference Exhibition (DATE) , 2016.[33] R. H. Katz. High Performance Network and Channel-BasedStorage. Technical Report UCB/CSD-91-650, EECSDepartment, University of California, Berkeley, Sep 1991.[34] K. Keeton. The Machine: An Architecture for Memory-centricComputing. In

Proceedings of the Workshop on Runtime andOperating Systems for Supercomputers (ROSS) , 2015.[35] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel.TreadMarks: Distributed Shared Memory on StandardWorkstations and Operating Systems. In

Proceedings of theUSENIX Winter Technical Conference , WTEC’94, 1994.[36] L. Kontothanassis, R. Stets, G. Hunt, U. Rencuzogullari,G. Altekar, S. Dwarkadas, and M. L. Scott. Shared MemoryComputing on Clusters with Symmetric Multiprocessors andSystem Area Networks.

ACM Transactions on ComputerSystems (TOCS) , 23(3), 2005.[37] J. Kyathsandra and E. Dahlen. Intel Rack Scale ArchitectureOverview. http://presentations.interop.com/events/las-vegas/2013/free-sessions---keynote-presentations/download/463 , 2013.[38] L. Lamport. Time, Clocks, and the Ordering of Events in aDistributed System.

Communications of the ACM , 21(7),1978.[39] L. Lamport. The Part-Time Parliament.

ACM Transactions onComputer Systems (TOCS) , 16(2), 1998.[40] L. Lamport, D. Malkhi, and L. Zhou. Vertical Paxos andPrimary-Backup Replication. In

Proceedings of theSymposium on Principles of Distributed Computing (PODC) ,2009.[41] L. Lamport, D. Malkhi, and L. Zhou. Reconfiguring a statemachine.

ACM SIGACT News , 41(1), 2010.[42] J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish.Improving Availability in Distributed Systems with FailureInformers. In

Proceedings of the USENIX Symposium onNetworked Systems Design and Implementation (NSDI) , 2013. [43] J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish.Taming uncertainty in distributed systems with help from thenetwork. In

Proceedings of the ACM European Conference onComputer Systems (EuroSys) , 2015.[44] J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, andM. Walfish. Detecting failures in distributed systems with theFALCON spy network. In

Proceedings of the ACMSymposium on Operating Systems Principles (SOSP) , 2011.[45] K. Li and P. Hudak. Memory Coherence in Shared VirtualMemory Systems.

ACM Transactions on Computer Systems(TOCS) , 7(4), 1989.[46] J. Liedtke. Improving IPC by Kernel Design. In

Proceedingsof the ACM Symposium on Operating Systems Principles(SOSP) , 1993.[47] J. Liedtke. On Micro-kernel Construction. In

Proceedings ofthe ACM Symposium on Operating Systems Principles(SOSP) , 1995.[48] J. Liedtke. Toward Real Microkernels.

Communications of theACM , 39(9), 1996.[49] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt,and T. F. Wenisch. Disaggregated Memory for Expansion andSharing in Blade Servers. In

Proceedings of the InternationalSymposium on Computer Architecture (ISCA) , 2009.[50] K. Lim, Y. Turnet, J. Chang, J. Renato Santos, andP. Ranganathan. Disaggregated Memory Benefits for ServerConsolidation. Technical Report HPL-2011-31, HPLaboratories, 2011.[51] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system forlarge-scale graph processing. In

Proceedings of the ACMSIGMOD Conference , 2010.[52] F. McSherry. Timely dataflow. https://github.com/TimelyDataflow/timely-dataflow .[53] F. McSherry and M. Schwarzkopf. The impact of fastnetworks on graph analytics, part 1. , 2015.[54] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou,R. Alagappan, K. Strauss, and S. Swanson. Atomic In-placeUpdates for Non-volatile Main Memories with Kamino-Tx. In

Proceedings of the ACM European Conference on ComputerSystems (EuroSys) , 2017.[55] Microsoft. Regions and availability for virtual machines inAzure. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/regions-and-availability .[56] D. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, andM. Abadi. Naiad: A Timely Dataflow System. In

Proceedingsof the ACM Symposium on Operating Systems Principles(SOSP) , 2013.[57] S. Novakovi´c, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot.Scale-out NUMA. In

Proceedings of the InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) , 2014.[58] B. M. Oki and B. H. Liskov. Viewstamped Replication: ANew Primary Copy Method to Support Highly-AvailableDistributed Systems. In

Proceedings of the Symposium onPrinciples of Distributed Computing (PODC) , 1988.[59] D. A. Patterson, G. Gibson, and R. H. Katz. A Case forRedundant Arrays of Inexpensive Disks (RAID). In roceedings of the ACM SIGMOD Conference , 1988.[60] F. B. Schneider. Implementing fault-tolerant services usingthe state machine approach: a tutorial. ACM ComputingSurveys (CSUR) , 22(4), 1990.[61] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang. LegoOS: ADisseminated, Distributed OS for Hardware ResourceDisaggregation. In

Proceedings of the USENIX Symposium onOperating Systems Design and Implementation (OSDI) , 2018.[62] N. Shavit and D. Touitou. Software transactional memory. In

Proceedings of the Symposium on Principles of DistributedComputing (PODC) , 1995.[63] V. Shrivastav, A. Valadarsky, H. Ballani, P. Costa, K. S. Lee,H. Wang, R. Agarwal, and H. Weatherspoon. Shoal: ANetwork Architecture for Disaggregated Racks. In

Proceedings of the USENIX Symposium on NetworkedSystems Design and Implementation (NSDI) , 2019.[64] C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani: ascalable distributed file system. In

Proceedings of the ACMSymposium on Operating Systems Principles (SOSP) , 1997.[65] S. Venkataraman, N. Tolia, P. Ranganathan, and R. H.Campbell. Consistent and Durable Data Structures forNon-volatile Byte-addressable Memory. In

Proceedings of the USENIX Conference on File and Storage Technologies (FAST) ,2011.[66] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne:Lightweight Persistent Memory. In

Proceedings of theInternational Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) ,2011.[67] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile, MainMemory Storage System. In

Proceedings of the InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) , 1994.[68] J. Xu and S. Swanson. NOVA: A Log-structured File Systemfor Hybrid Volatile/Non-volatile Main. In

Proceedings of theUSENIX Conference on File and Storage Technologies (FAST) ,2016.[69] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica.Resilient distributed datasets: A fault-tolerant abstraction forin-memory cluster computing. In

Proceedings of the USENIXSymposium on Networked Systems Design andImplementation (NSDI) , 2012., 2012.