MigrOS: Transparent Operating Systems Live Migration Support for Containerised RDMA-applications
Maksym Planeta, Jan Bierbaum, Leo Sahaya Daphne Antony, Torsten Hoefler, Hermann Härtig
MMigrOS: Transparent Operating Systems Live Migration Support forContainerised RDMA-applications
Maksym PlanetaTU Dresden Jan BierbaumTU Dresden Leo Sahaya Daphne AntonyAMOLF Torsten HoeflerETH Z¨urichHermann H¨artigTU Dresden
Abstract
Major data centre providers are introducing RDMA-based networks for their tenants, as well as for operat-ing their underlying infrastructure. In comparison totraditional socket-based network stacks, RDMA-basednetworks offer higher throughput, lower latency, and re-duced CPU overhead. However, RDMA networks maketransparent checkpoint and migration operations muchmore difficult. The difficulties arise because RDMAnetwork architectures remove the OS from the criticalpath of communication. As a result, the OS loses con-trol over active RDMA network connections, requiredfor live migration of RDMA-applications. This paperpresents MigrOS, an OS-level architecture for transpar-ent live migration of RDMA-applications. MigrOS of-fers changes at the OS software level and small changesto the RDMA communication protocol. As a proof ofconcept, we integrate the proposed changes into Soft-RoCE, an open-source kernel-level implementation ofan RDMA communication protocol. We designed thesechanges to introduce no runtime overhead, apart fromthe actual migration costs. MigrOS allows seamless livemigration of applications in data centre settings. It alsoallows HPC clusters to explore new scheduling strate-gies, which currently do not consider migration as anoption to reallocate the resources.
Cloud computing is undergoing a phase of rapidlyincreasing network performance. This trend implieshigher requirements on the data and packet processingrate and results in the adoption of high-performancenetwork stacks [10, 12, 13, 23, 64, 72, 75]. RDMAnetwork architectures address this demand by offload-ing packet processing onto specialised circuitry of thenetwork interface controllers (RDMA NICs). TheseRDMA NICs process packets much faster than CPUs. User applications communicate directly with the NICsto send and receive messages using specialised RDMA-APIs, like IB verbs .This direct access minimises net-work latency, which made RDMA networks ubiquitousin HPC [37, 45, 50, 67] and increasingly more accus-tomed in the data centre context [30, 56, 66]. As a re-sult, major data centre providers already offer RDMAconnectivity for the end-users [1, 2].Similarly, containers have also become ubiquitous forlightweight virtualisation in data centre settings. Con-tainerised applications do not depend on the softwarestack of the host, thus greatly simplifying distributedapplication deployment and administration. However,RDMA networks and containerisation come at odds,when employed together: The former try to bring appli-cations and underlying hardware “closer” to each other,whereas the latter facilitates the opposite. This pa-per, in particular, addresses the issue of migratabilityof containerised RDMA-applications through OS-leveltechniques.The ability to live-migrate applications has long beenavailable for virtual machines (VMs) and is widely ap-preciated in cloud computing [25, 28, 38, 59, 63]. Weexpect live migration to become even more popular withthe growth of disaggregated [26, 33], serverless [73], andfog computing [61]. In contrast to VMs, containerisedapplications share the kernel, and thus their state, withthe host system. In general, it is still possible to ex-tract the relevant container state from the kernel andrestore it on another host later on. This recoverablestate includes open TCP connections, shell sessions, filelocks [9, 42]. However, the state of RDMA communi-cation channels is not recoverable by existing systems,and hence applications using RDMA cannot be check-pointed or migrated.To outline the conceptual difficulties involved in sav-ing the state of RDMA communication channels, wecompare a traditional TCP/IP-based network stack and IB verbs is the most common low-level API for RDMA net-works. a r X i v : . [ c s . O S ] O c t ernApp NIC
KernApp
NIC
KernApp
NIC
KernApp
NIC
11 22 3 4
Figure 1: Traditional (left) and RDMA (right) networkstacks. In traditional networks, the user applicationtriggers the NIC via kernel (1). After receiving a packet,the NIC notifies the application back through the ker-nel (2). In RDMA networks the application communi-cates directly to the NICs (3) and vice-versa (4) with-out kernel intervention. Traditional networks requiremessage copy between application buffers ( ) and NICaccessible kernel buffers ( ). RDMA-NICs (right) canaccess the message buffer in the application memory di-rectly ( ).the IB verbs API (see Figure 1). First, with a tradi-tional network stack, the kernel fully controls when thecommunication happens: applications need to performsystem calls to send or receive a message. In IB verbs,because of direct communication between the NIC andthe application, the OS has no communication intercep-tion points, except of tearing down the connection. Al-though the OS can stop a process from sending furthermessages, the NIC may still silently change the applica-tion state. Second, part of the connection state residesat the NIC and is inaccessible for the OS. Creating aconsistent checkpoint is impossible in this situation.In this paper, we propose MigrOS, an architec-ture enabling transparent live migration of container-ised RDMA-applications on the OS level. We identifythe missing hardware capabilities of existing RDMA-enabled NICs required for transparent live migration.We augment the underlying RoCEv2 communicationprotocol to update the physical addresses of a mi-grated container transparently. We modify a softwareRoCEv2-implementation to show that the required pro-tocol changes are small and do not affect the criticalpath of the communication. Finally, we demonstrate anend-to-end live migration flow of containerised RDMA-applications.
This section gives a short introduction to containerisa-tion and RDMA networking. We further outline livemigration and how RDMA networking obstructs thisprocess. PD GIDGUIDQPNLIDMR…MR SQ
Container/Process
SRQRQ GIDGUIDQPNLIDCQ WC …WCSQ SR …SRRR …RR
Figure 2: Primitives of the IB verbs library. Each queuepair (QP) comprises a send and a receive queue and hasmultiple IDs; node-global IDs (grey) are shared by allQPs on the same node.
In Linux, processes and process trees can be logicallyseparated from the rest of the system using namespace isolation. Using namespaces, allows process creationwith an isolated view on the file system, network de-vices, users, etc. Container runtimes leverage names-paces and other low-level kernel mechanisms [3, 52] tocreate a complete system view without external depen-dencies. Considering their close relation, we use theterms container and process interchangeably in this pa-per. A distributed application may comprise multiplecontainers across a network: a Spark application, forexample, can run the master and each worker in an iso-lated container and an MPI [29] application can con-tainerise each rank . The IB verbs API is today’s de-facto standard for high-performance RDMA communication. It enables appli-cations to achieve high throughput and low latency byaccessing the NIC directly (
OS-bypass ), avoiding unnec-essary memory movement ( zero-copy ), and delegatingpacket processing to the NIC ( offloading ) .Figure 2 shows the following IB verbs objects involvedin communication.
Memory regions (MRs) representpinned memory shared between the application and theNIC.
Queue pairs (QPs), comprising a send queue (SQ)and a receive (RQ) queue, represent connections. To re-duce memory footprint, multiple QPs can replace mul-tiple individual RQs with use a single shared receivequeue (SRQ).
Completion queues (CQs) inform the ap-plication about completed communication requests. A protection domain (PD) groups all these IB verbs ob-jects together and represents the process address spaceto the NIC.To establish a connection, an application needs to ex-change the following addressing information:
Memoryprotection keys to enable access to remote MRs, the2lobal vendor-assigned address (
GUID ), the routableaddress (
GID ), the non-routable address (
LID ), and thenode-specific QP number (
QPN ). This exchange hap-pens over another network, like TCP/IP. During theconnection setup, each QP is configured for a specific type of service . We implement MigrOS for Reliable Con-nections (RC) type of service, which provides reliablein-order message delivery between two communicationpartners.The application sends or receives messages by post-ing send requests (SR) or receive requests (RR) to a QP.These requests describe the message structure and referto the memory buffers within previously created MRs.The application checks for the completion of outstand-ing work requests by polling the CQ for work comple-tions (WC).There are various implementations of theIB verbs API for different hardware, including In-finiband [12], iWarp [31], and RoCE [7, 8]. InfiniBandis generally the fastest among these but requiresspecialised NICs and switches. RoCE and iWarpprovide RDMA capabilities in Ethernet networks.They still require require hardware support in theNIC, however, do not depend on specialised switchesand thus make it easier to incorporate RDMA into anexisting infrastructure. This work focuses on RoCEv2,a version of RoCE protocol.To enable RDMA-application migration, it is impor-tant to consider following challenges:1. User applications have to use physical networkaddresses (QPN, LID, GID, GUID), and theIB verbs API does not specify a way for virtual-ising these.2. The NIC can write to any memory it shares withthe application without the OS noticing.3. The OS cannot instruct the NIC to pause the com-munication, except abruptly terminating it.4. The user applications are not prepared for a con-nection changing destination address and going intoan erroneous state. As, result, the applications willterminate abruptly.5. Although the OS is aware of all IB verbs objectscreated by the application, it does not control thewhole state of these objects, as the state partiallyresides on the NIC.We address all of these challenges in Section 3.
CRIU is a software framework for transparently check-pointing and restoring the state of Linux processes [9].It enables live migration, snapshots, or remote debug-ging of processes, process trees, and containers. To ex-tract the user-space application state, CRIU uses con- ventional debugging mechanisms [5, 6]. However, toextract state of process-specific kernel objects, CRIUdepends on special Linux kernel interfaces.To restore a process, CRIU creates a new process thatinitially runs the CRIU executable which reads the im-age of the target process and recreates all OS objectson its behalf. This approach allows CRIU to utilisethe available OS mechanisms to run most of the recov-ery without the need for significant kernel modifications.Finally, CRIU removes any traces of itself from the pro-cess.CRIU is also capable of restoring the state of TCPconnections. This feature is crucial for the live migra-tion of distributed applications [42]. The Linux kernelintroduced a new TCP connection state,
TCP REPAIR ,for that purpose. In this state a user-level process canchange the state of send and receive message queues, getand set message sequence numbers and timestamps, oropen and close connection without notifying the otherside.As of now, if CRIU attempts to checkpoint anRDMA-application, it will detect IB verbs objects andwill refuse to proceed. Discarding IB verbs objects inthe naive hope that the application will be able to re-cover is failure-prone: once an application runs into anerroneous IB verbs object, in most cases, the applicationwill hang or crash. Thus, we provide explicit supportfor IB verbs objects in CRIU (see Section 3).
MigrOS is based on modern container runtimes andreuses much of the existing infrastructure with minimalchanges. Most importantly, we require no modificationof the software running inside the container (see Sec-tion 3.1).Existing container runtimes rely on CRIU for check-point/restore functionality [3, 15, 16, 52]. Therefore,it is sufficient to extend CRIU with IB verbs sup-port to checkpoint and restore containerised RDMA-applications. Section 3.2 describes our modifications tothe IB verbs API and how CRIU uses them.We also add two new QP states to enable CRIU tocreate consistent checkpoints (see Section 3.3). Finally,Section 3.4 describes minimal changes to the packet-level RoCEv2 protocol to ensure that each QP main-tains correct information about the location of its part-ner QP.
Typically, access to the RDMA network is hidden deepinside the software stack. Figure 3 gives an example of a3 bv -userOpen MPIMPI app ibv -kern C o n t a i n e r Kernel
Modifiedm-ibv -userCRIUContainer rt. T a r d i s Figure 3: Container migration architecture. Softwareinside the container, including the user level driver ( ibv -cont, grey), is unmodified. The host runs CRIU, ker-nel ( ibv -kern) and user ( ibv -migr, green) level driversmodified for migratability.containerised RDMA-application. The container imagecomes with all library dependencies, like the libc, butnot the kernel-level drivers. The application uses a stackof communication libraries, comprising Open MPI [29],Open UCX [67] (not shown), and IB verbs. Normally,to migrate, container runtime would require the applica-tion inside the container to terminate and later recoverall IB verbs objects. This removes transparency fromlive migration.MigrOS runs alongside the container comprising of acontainer runtime (e.g., docker[52]), CRIU, and IB verbslibrary. We modified CRIU to make it aware of IB verbs,so that it can successfully save IB verbs objects whenCRIU traverses the kernel objects belonging to the con-tainer. We extend the IB verbs library ( m-ibv -user and ibv -kern) to enable serialisation and deserialisation ofthe IB verbs objects. Importantly, the API extensionis backwards compatible with the IB verbs library run-ning inside the container. Thus, both m-ibv -user and ibv -user use the same kernel version of IB verbs. Mi-grOS requires no modifications of any software insidethe container.
To enable checkpoint/restore for processes and con-tainers, we extend the IB verbs API with twonew calls (see Listing 1): ibv_dump_context and ibv_restore_object . The dump call returns dump ofall IB verbs objects within a specific IB verbs context.The dumping runs almost entirely inside the kernel fortwo reasons. First, some links between the objects areonly visible at the kernel level. Second, to get a consis-tent checkpoint it is crucial to ensure an atomic dump.Of course, the existing IB verbs API allows to createnew objects. However, the existing IB verbs API is notexpressive enough for restoring objects. For example,when restoring a completion queue (CQ), the currentAPI does not allow to specify the address of the sharedmemory region for this queue. Also it is not possible int ibv_dump_context (struct ibv_context * ctx ,int * count , void * dump ,size_t length );int ibv_restore_object (struct ibv_context * ctx ,void ** object ,int object_type , int cmd ,void * args , size_t length );
Listing 1: Checkpoint/Restart extension for theIB verbs API. ibv_dump_context creates an imageof the IB verbs context ctx with count objects andstores it in the caller-provided memory region dump ofsize length . ibv_restore_object executes the restorecommand cmd for an individual object (QP, CQ, etc.) oftype object_type . The call expects a list of argumentsspecific to the object type and recovery command. args is an opaque pointer to the argument buffer of size length . A pointer to the restored object is returnedvia object .to recreate a queue pair (QP) directly in the Ready-to-Send (RTS) state. Instead, the QP has to traverse allintermediate states before reaching RTS.We introduce a fine-grained ibv_restore_object call to restore IB verbs objects one by one, for sit-uations when the existing API is not sufficient. Inturn, modified MigrOS, uses the extended IB verbsAPI to save and restore the IB verbs state of applica-tions. During recovery, MigrOS reads the object dumpand applies a specific recovery procedure for each ob-ject type. For example, to recover a QP, MigrOS calls ibv_restore_object with the command CREATE andprogresses the QP through the Init, RTR, and RTSstates using ibv_modify_qp . The contents of memoryregions or QP buffers are recovered using the standardfile and memory operations. Finally, MigrOS brings thequeue to the original state using the
REFILL commandof the restore call.
Before communication can commence, an applicationestablishes a connection bringing a QP through a se-quence of states (depicted in Figure 4). Each newly-created QP is in the
Reset (R) state. To send andreceive messages, a QP must reach its final
Ready-to-Send (RTS) state. Before reaching RTS, the QP tra-verses the
Init and
Ready-to-Receive (RTR) states. Incase of an error, the QP goes into one of the error states;
Error (E), or
Send Queue Error (SQE). In the
SendQueue Drain (SQD) state, a QP does not accept new4 reate RInit RTR RTSE SQESQD SP
Figure 4: QP State Diagram. Normal states and statetransitions ( , ) are controlled by the user applica-tion. A QP is put into error states ( , ) either bythe OS or the NIC. New states ( , ) are used forconnection migration. N N N t R S DR P RR . . .. . . s e n d n a c k r e s u m e a c k Figure 5: To migrate from host N to host N , thestate of the QP changes from RTS ( R ) to Stopped ( S ).Finally, the QP is destroyed ( D ). If the partner QP athost N sends a message during migration, this QP getspaused ( P ). Both QPs resume normal operation oncethe migration is complete.send requests. Apart from that, SQD is equivalent tothe RTS state.In addition to the existing states, we add two newstates invisible to the user application (see Figure 4): Stopped (S) and
Paused (P). When the kernel executes ibv_dump_context , all QPs of the specified context gointo a
Stopped state. A stopped QP does not send or re-ceive any messages. The QPs remain stopped until theyare destroyed together with the checkpointed process.A QP becomes
Paused when learns its destinationQP has become
Stopped (see Section 3.4). A pausedQP does not send messages, but also has no other QPto receive messages from. A QP remains paused, untilthe migrated destination QP restores at a new locationand sends a message with the new location address. Thepaused QP retains the new location of the destinationQP and returns to RTS state. After that, the commu-nication can continue.
There are two considerations, when migrating a con-nection. First, during the migration, the communica-tion partner of the migrating container must not confusemigration with a network failure. Second, once the mi-gration is complete, all partners of the communication node need to learn its new address.We address the first issue by extending RoCEv2 witha connection migration protocol. The connection mi-gration protocol is active during and after migration(see Figure 5). This protocol is part of the low-levelpacket transmission protocol and is typically imple-mented entirely within the NIC. Also, we add a new neg-ative acknowledgement type
NAK STOPPED . If a stoppedQP receives a packet, it replies with
NAK STOPPED anddrops the packet. When the partner QP receivesthis negative acknowledgement, it transitions to the
Paused (P) state and refrains from sending further pack-ets until receiving a resume message.After migration completes, the new host of the mi-grated process restores all QPs to their original state.Once a QP reaches the RTS state, the new host ex-ecutes the
REFILL command. This command restoresthe driver-specific internal QP state and sends a newlyintroduced resume message to the partner QP. Resumemessages are sent unconditionally, even if the partnerQP was not paused before. This way, we also addressthe second issue: The recipient of the resume messageupdates its internal address information to point to thenew location of the migrated QP; the source address ofthe resume message.Each pause and resume message carry source and des-tination information. Thus, if multiple QPs migrate atthe same time, there can be no confusion which QPsmust be paused or resumed. If at any point the migra-tion process fails, the paused QPs will remain stuck andwill not resume communication. This scenario is com-pletely analogous to a failure during a TCP-connectionmigration. In both cases, MigrOS will be responsiblefor cleaning up the resources.
To provide transparent live migration, MigrOS incorpo-rates changes to CRIU, IB verbs library, RDMA-devicedriver (SoftRoCE), and packet-level RoCEv2-protocol.To migrate an application, the container runtime in-vokes CRIU which checkpoints the target container.CRIU stops active RDMA-connections and saves thestate of IB verbs objects (see Section 4.1). SoftRoCEthen pauses communication using our extensions to thepacket-level protocol. After transfering the checkpointto the destination node, the container runtime at thatnode invokes CRIU to recover the IB verbs objects andrestores the application. SoftRoCE then resumes allpaused communication to complete the migration pro-cess.SoftRoCE is a Linux kernel-level software implemen-tation (not an emulation [48]) of the RoCEv2 proto-5ol [8]. RoCEv2 runs RDMA communication by tun-nelling Infiniband packets through a well-known UDPport. In contrast to other RDMA-device drivers, Soft-RoCE allows the OS to inspect, modify, and control thestate of IB verbs objects completely.As a performance-critical component of RDMA com-munication, RoCEv2 usually runs in NIC hardware. Sochanges to the protocol require hardware changes. Weimplement MigrOS with the focus on minimising theseprotocol changes. The key part of MigrOS is the addi-tion of connection migration capabilities to the existingRoCEv2 protocol (see Section 4.2).
State extraction begins when CRIU discovers that itstarget process opened an IB verbs device. We modifiedCRIU to use the API presented in Section 3.2 to extractthe state of all available IB verbs objects. CRIU storesthis state together with other process data in an image.Later, CRIU recovers the image on another node usingthe new API.When CRIU recovers MRs and QPs of the migratedapplication, the recovered objects must maintain theiroriginal unique identifiers. These identifiers are system-global and assigned by the NIC (in our case the Soft-RoCE driver) in a sequential manner. We augmentedthe SoftRoCE driver to expose the IDs of the last as-signed MR and QP to MigrOS in userspace. These IDsare memory region number (MRN) and queue pair num-ber correspondingly. Before recreating an MR or QP,CRIU configures the last ID appropriately. If no otherMR or QP occupies this ID, the newly created objectwill maintain the original ID. This approach is analo-gous to the way CRIU maintains the process ID of arestored process using ns last pid mechanism in Linux,which exposes the last process ID assigned by the kernel.It is possible for some other process to occupy MRNor QPN, which CRIU wants to restore. Two processescannot use the same MRN or QPN on the same node,resulting in a conflict. In the current scheme, we avoidthese conflicts by partitioning QP and MR addressesglobally among all nodes in the system before the ap-plication startup. CRIU faces the very same problemwith process ID collisions. This problem has only beensolved with the introduction of process ID namespaces.To remedy the collision problem for IB verbs objects,a similar namespace-based mechanism, would be re-quired. We leave this issue for future work.Additionally, recovered MRs, have to maintain theiroriginal memory protection keys. The protection keysare pseudo-random numbers provided by the NIC andare used by a remote communication partner when send-ing a packet. An RDMA operation succeeds only if the A pp li ca ti on SR SR … RR RR … WC WC … Comp.:5 QP a Req.Resp.: 7Comp. QP b Resp.:7Comp. QP b Req. A c k : R e s u m e : Req.:8Resp.
Figure 6: Resuming connection in SoftRoCE. Packets8, 9 ( ) are to be processed by the requester. Packets5-7 ( ) are yet to be acknowledged. Packet 4 ( ) isalready acknowledged. QP b expects the next packet is7. Resume packet has PSN of the first unacknowledgedpacket ( ). QP b replies with an acknowledgement ofthe last received packet.provided key matches the expected key of a given MR.Other than that, the key’s value does not carry any ad-ditional semantics. Thus, no collision problems exist forprotection keys.CRIU sets all protection keys to their origi-nal values before communication restarts bymaking an ibv restore object call with theIBV RESTORE MR KEYS command. The connection migration protocol ensures that connec-tions are terminated gracefully and recovered to a con-sistent state. The implementation of this protocol isdevice- and driver-specific. In this work, we modify theSoftRoCE driver to make it compliant with the con-nection migration protocol (Section 3.4) by providingan implementation of the checkpoint/restore API (Sec-tion 3.2).Figure 6 outlines the basic operation of the SoftRoCEdriver. The driver creates three kernel tasks for eachQP: requester , responder , and completer . When an ap-plication posts send (SR) and receive (RR) work re-quests to a QP, they are processed by requester andresponder correspondingly. A work request may besplit into multiple packets, depending on the MTU size.When the whole work request is complete, responderor completer notify the application by posting a workcompletion to the completion queue.The kernel tasks process all requests packet by packet.Each task maintains the packet sequence number (PSN)of the next packet. A packet sent by a requester is pro-cessed by the responder of the partner QP. The respon-der replies with an acknowledgement that is processedby the completer. The completer generates a work com-pletion (WC) after receiving acknowledgement for thelast packet in an SR. Similarly, the responder generatesa WC after receiving all packets of an RR.After migration, when the recovered QP a is ready to6ommunicate again, it sends a resume message to QP b with the new address. This way, QP b learns the newlocation of QP a . Receiving this resume message, the re-sponder of QP b replies with an acknowledgement of thelast successfully received packet. If some packets werelost during the migration, the next PSN at the respon-der of QP b is smaller than the next PSN at the requesterof QP a . The difference corresponds to the lost packets,which must be retransmitted. Simultaneously, the re-quester of QP b can already start sending messages. Atthis point, the connection between QP a and QP b is fullyrecovered.The presented protocol ensures that both QPs recoverthe connection without losing packets irrecoverably. Ifpackets were lost during migration, the QPs can de-termine which packets were lost and retransmit them.This retransmission is part of the normal RoCEv2 pro-tocol. The whole connection migration protocol runstransparently for the user applications. We evaluate MigrOS from three main aspects. First,we analyse the implementation effort, with a specific fo-cus on changes to the RoCEv2 protocol. Second, westudy the overhead of adding migration capability, out-side of the migration phase. Third, we estimate thefine-grained cost of migration for individual IB verbsobjects, as well as the full latency of migration in real-istic RDMA-applications.For most experiments, we use a system with twomachines: Each machine is equipped with an Intel i7-4790 CPU, 16 GiB RAM, an on-board Intel 1 Gb Eth-ernet adapter, a Mellanox ConnectX-3 VPI adapter,and a Mellanox Connect-IB 56 Gb adapter. The Mel-lanox VPI adapters are set to 40 Gb Ethernet mode andconnected to a Cisco C93128TX 40 Gb Ethernet switch.The SoftRoCE driver communicates over this adapter.The machines run Debian 11 with a custom Linux 5.7-based kernel. We refer to this setup as local .We conduct further measurements on a cluster com-prising two-socket Intel E5-2680 v3 CPUs nodes withConnect-IB 56 Gb NICs deployed by Bull. We refer tothis setup as cluster . Two nodes similar to those in thecluster were used in a local setup and equipped withMellanox ConnectX-3 VPI NICs configured to 56 Gb In-finiBand mode.
MigrOS requires few changes to the low-level RoCEv2protocol, as shown in Table 1. We count newly addedor modified source lines of code in different compo-nents of the software stack. Only around 10% of all Level Component Original ∆Kernel IB verbs 30 565 719SoftRoCE 9 446 872QP tasks 1 112 249User IB verbs 12 431 339SoftRoCE 1 004 332CRIU 61 616 1 845Total 4 137Table 1: Development effort in SLOC. We specificallyshow magnitude of changes done to QP tasks (see Fig-ure 6).Object Features required State (b)PD None 12MR Set MR keys and MRN 48CQ Set ring buffer state 64SRQ Set ring buffer state 68QP + QP tasks state 271, set QPNQP w/ SRQ + Current WQE state 823Table 2: Additional features implemented in the kernel-level SoftRoCE driver to enable recovery of IB verbsobjects. We provide the size each object occupies in thedump.the changes apply to the kernel-level SoftRoCE driver.These changes mostly focus on saving and restoringthe state of IB verbs objects. We counted separatelychanges to the requester, responder, and completer QPtasks responsible for the active phase of communication(see Figure 6). Such QP tasks are often implemented inthe NIC hardware, for other RDMA-implementations.Therefore it is important to minimise changes specifi-cally to QP tasks, as changes there directly translateto hardware changes. In our implementation, changesto QP tasks accounted only for around 6% of overallchanges.We used gprof to record the coverage of connectionmigration support code outside of migration phase. Outof all changes done to the QP tasks, only 28 lines weretouched, while the application communication was ac-tive. Among them, 3 lines are variable assignments,one is an unconditional jump, the rest are newly in-troduced if-else-conditions that occur at most once perpacket sent or received. The rest of the code changes tothe QP task run only during the connection migrationphase.Besides additional logic to the QP tasks, saving andrestoring IB verbs objects requires manipulation of7mplementation-specific attributes. Some of these at-tributes cannot be set through original IB verbs API.For example, recovery of an MR requires an additionalability to restore the original values of memory keys andan MRN. Some other attributes are not visible in orig-inal IB verbs API at all. The queues (CQ, SRQ, QP)implemented in SoftRoCE require an ability to save andrestore metadata of ring buffers backing up the queues.If a QP uses a shared receive queue (SRQ), the dumpof the QP additionally includes the full state of the cur-rent work queue entry (WQE). We identified all requiredattributes for SoftRoCE, calculated their memory foot-print (see Table 2), and implemented features requiredby these attributes.We show the analysis of the required changes to Ro-CEv2 implemented by SoftRoCE. We claim that similarchanges are required to other low-level implementationsof RoCEv2 protocol residing in RDMA-capable NICs.We demonstrate the changes to the communication pathare minimal, outside of the migration phase. We rea-sonably expect that once mapped to the hardware theproposed changes will remain minimal.
Just adding the capability for transparent container mi-gration already may incur overhead even when the mi-gration does not occur. For example, DMTCP (see Sec-tion 6) intercepts all IB verbs library calls and rewritesboth work requests and completions before forwardingthem to the NIC. The interception happens persistently,even when the process running under DMTCP never mi-grates. In contrast to this, MigrOS does not interceptcommunication operations at the critical path, therebyintroducing no measurable overhead. This subsectionexplores the overhead added for normal communicationoperations without migrations.First, we reaffirm that the proposed low-level proto-col changes are minimal. For that, we need to compareperformance of migratable and non-migratable versionsof SoftRoCE driver. Unfortunately, the original ver-sion ( vanilla kernel , without any modifications from ourside) of the SoftRoCE driver turned out to be notori-ously unstable. The original driver contained multitudeof concurrency bugs and required significant restructur-ing.Finally, we ended up with three versions of the driver:the original buggy version, a non-migratable fixed ver-sion, and a migratable fixed version (see Figure 7). Theoriginal version rendered to be faster, nevertheless forthe scope of our paper correctness was of higher prioritythan the performance. Nevertheless, the performance of SIGINT to a user-level RDMA-application caused the kernelto panic.
FixedMigratableBuggy Message size, B B a nd w i d t h , G b / s (a) Communication Through-put BuggyMigratableFixed Message size, B L a t e n c y , µ s (b) Communication Latency Figure 7: Performance comparison of different Soft-RoCE drivers. The original version shows a better per-formance whereas adding connection migration supportto the modified version makes practically no impact.Short Full name LocationSR SoftRoCE localCX3/40 ConnectX-3 40 Gb Ethernet localCX3/56 ConnectX-3 56 Gb InfiniBand clusterCIB ConnectIB localBIB Bull Connect-IB clusterTable 3: RDMA-capable NICs used for the evaluation.both fixed versions of the SoftRoCE driver is practicallyindistinguishable. Therefore, we conclude that MigrOSintroduces no runtime overhead outside of the migrationphase.Next, we show the overhead added by DMTCP, whichintercepts all IB verbs calls. This way, we study thecost of adding migration capability at the user level.We use the latency and bandwidth benchmarks fromthe OSU 5.6.1 benchmark suite [4] running on top ofOpen MPI 4.0 [29]. We ran the experiment on the pre-viously described cluster with ConnectIB NICs. As wehave shown above, adding support for the migrationdoes not add performance penalty. Thus, running with-out DMTCP is similar to having native migration sup-port. To be able to extract the state of IB verbs objects,DMTCP maintains shadow objects , which act as prox-ies between the user process and the NIC [24]. Figure 8shows that maintaining these shadow objects incurs anon-negligible runtime overhead for RDMA networks.
With added support for migrating IB verbs objects, thecontainer migration time will increase proportionally tothe time required to recreate these objects. Our goal is8 ativeDMTCP70% 46%0% Message size, B B a nd w i d t h , G b / s (a) Communicationthroughput. DMTCPreduces the bandwidthby up to 70% for smallmessages. NativeDMTCP23% 8% 1% Message size, B L a t e n c y , µ s (b) Communication latency.DMTCP increases the la-tency by up to 23% or0 . µ s, if size <
16 KiB, else1 . µ s. Figure 8: DMTCP adds substantial communicationoverhead, even when migration is not used.to estimate the additional latency for migrating RDMA-enabled applications. This subsection shows the cost formigrating connections created by SoftRoCE, as well asthe cost for connection creation with hardware-basedIB verbs implementations.Several IB verbs objects are required before a reli-able connection (RC) can be established, see Section 2.2.Usually, an application creates a single PD, one or twoCQs, multiple memory regions, and one QP per com-munication partner.To measure the cost of creating individual IB verbsobjects, we modified ib send bw from the perftest [14]benchmark suite to create additional MR objects. Wecreated one CQ, one PD, 64 QPs, and 64 1 MiB-sizedMRs per run. Figure 9 shows the average time requiredto create each object across 50 runs. Each tested NICis represented by a bar.We draw two conclusions from this experiment. First,there is a substantial variation for all operations acrossdifferent NICs. Second, the time required for most op-erations is in the range of milliseconds.The exact time required for migrating RDMA con-nections depends on two factors: the number of QPsand the total amount of memory assigned to MRs [53].Both of these factors are application-specific and canvary greatly. Therefore, next we show how the migra-tion time is influenced by the application’s usage of MRsand QPs.Figure 10 shows the MR registration time, dependingon the region’s size. MR registration costs are split be-tween the OS and the NIC: The OS pins the memoryand the NIC learns about the virtual memory mapping µ s 17 µ s PD QPCQ MR S R C X C X C I B B I B S R C X C X C I B B I B T i m e , m s To RTS To RTR To Init Create
Figure 9: Object creation time for different RDMA-devices. Before being able to send a message, a QPneeds to be in the state RTS, which requires the traver-sal of three intermediate states (Reset, Init, RTR). Weshow the interval of the standard deviation around themean.of the registered region. SoftRoCE does not incur the“NIC-part” of the cost, therefore MR registration withSoftRoCE is faster than for RDMA-enabled NICs. Forthis experiment, we do not consider the costs of trans-ferring the contents of the MR during migration.The number of QPs is the second variable influencingthe migration time. Figure 11 shows the time for mi-grating a container running the ib send bw benchmark.The benchmark consists of two single-process containersrunning on two different nodes. Three seconds after thecommunication starts, the container runtime migratesone of the containers to another node. The migrationtime is measured as the maximum message latency asseen by the container that did not move. The check-point is transferred over the same network link usedby the benchmarks for communication. With growingnumber of QPs, the benchmark consumes more mem-ory, ranging from 8 MiB to 20 MiB. To put things intoperspective, we estimated the migration time for realdevices by calculating the time to recreate IB verbs ob-ject for RDMA-enabled NICs. We subtracted the timeto create IB verbs objects with SoftRoCE from the mea-sured migration time and added time to create IB verbsobjects with RDMA-NICs (from Figure 9). We showour estimations with the dashed lines.9 .010.1110100 2 Size, KiB T i m e , m s Figure 10: MR registration timedepending on the region size. M i g r a t i o n t i m e , m s Figure 11: Migration speedwith different numbers of QPs. c g ( B ) m g ( C ) C o n t a i n e r R un t i m e Checkpoint Transfer Restore
Figure 12: Migration speed comparison ofDocker against CR-X (X)
BIB CIB CX3/40 CX3/56 SoftRoCE mg (C)ft (B)is (C)bt (C)cg (B)sp (B)lu (B)ep (C) 0.0 0.5 1.0 1.5Time (s) B e n c h m a r k ( S i ze ) Checkpoint Transfer Restore
Figure 13: MPI application migration.
For evaluating transparent live-migration of real-worldapplications we chose to migrate NPB 3.4.1 [18], anMPI benchmark suite. The MPI applications runon top of Open MPI 4.0 [29], which in turn usesOpenUCX 1.6.1 [67] for point to point communication.We configured UCX to use IB verbs communication overreliable connection (RC).This setup corresponds to Figure 3. We container-ised the applications using self-developed runtime CR-X, based on libcontainer [16]. Unlike Docker, our con-tainer runtime facilitates faster live migration by send-ing the image directly to the destination node, instead ofthe local storage, during the checkpoint process. More-over, our container runtime stores checkpoint in RAM,reducing migration latency even further. The remainingdescription of our container runtime is out of scope ofthis paper.To measure the latency of application migration, westart each MPI application with four processes ( ranks ).We migrate one of the ranks to another node approxi-mately in the middle of the application progress. Each benchmark has a size (A to F) parameter. We chosesize such that different benchmarks run between 10 and300 seconds. For this reason, we excluded dt bench-mark, because it runs only around a second. Figure 13shows container migration latency and standard devi-ation around the mean, averaged over 20 runs of eachbenchmark.We break down the migration latency into three parts: checkpoint , transfer , and restore . MigrOS stops the tar-get container in the beginning of the checkpoint phase.Large part of the checkpoint arrives to the destinationnode already during the checkpoint phase. After, thetransfer phase is over, MigrOS recovers the containeron the destination node. Overall, we observe the mi-gration time to be proportional to the checkpoint size.The benchmarks experience runtime delay proportionalto the migration latency.To show interoperability with other container run-times, we measured migration costs, when usingDocker 19.03 (see Figure 12). We had to implementfull end-to-end migration flow ourselves, because Dockersupports only checkpoint and restore features. To ourdisappointment, Docker does not employ some impor-tant optimisations and takes line time to complete mi-gration. Nevertheless, we prove our claim that MigrOSis readily interoperable with other container runtimes. Checkpoint/Restart Techniques
Transparent livemigration of processes [19, 54, 69], containers [51, 55,58], or virtual machines [25, 28, 38, 59, 63] has longbeen a topic of active research. The key challenge of thistechnique lies in the checkpoint/restart operation. Forprocesses and containers, this operation can be imple-mented at three levels: application runtime, user-levelsystem, or kernel-level system. Table 4 compares a se-10 e g i o n N o m a d P S M P I D M TC P M O S I X - M O S I X - M i g r O S RDMA (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:51)
Overhead N N N Y Y Y NRuntime (cid:51) (cid:51) (cid:51)
User-OS (cid:51) (cid:51)
Kernel-OS (cid:51) (cid:51) (cid:51)
Hardware (cid:51)
Units O VM P P P P CReference [22] [39] [65] [17] [20] [21] Ours
Table 4: Selected checkpoint/restart systems handle ei-ther VMs, processes (P), containers (C), or applicationobjects (O). Runtime-based systems naturally introduceno additional communication overhead for migrationsupport.lection of existing checkpoint/restart systems.
Runtime-based systems expect the user applicationto access all external resources through the API of theruntime system. This restriction resolves two importantissues with resource migratability: First, the runtimesystem controls exactly when the underlying resourceis used and can easily stop the user application fromdoing so to serialise the state of the resource. Second,the runtime can maintain enough information about thestate of the resource to facilitate resource serialisationand deserialisation. Such interception is cheap becauseit happens within the application’s address space.Almost all attempts to provide transparent live mi-gration together with RDMA networks rely on modifica-tions of the runtime system [17, 32, 34, 39, 43, 65]. Someruntime systems operate on application-defined objects(tasks, agents, lightweight threads) for even more effi-cient state serialisation and deserialisation [22, 45, 74].All runtime-based approaches bind the application to aparticular runtime system.
Kernel OS-level checkpoint/restart systems [21, 36,41, 44, 62] either do interposition at the kernel level orextract application state from the kernel’s internal datastructures. Although these systems support a widerspectrum of user applications, they incur a significantlyhigher maintenance burden. BLCR [36] has been aban-doned eventually. CRIU [9], currently the most success-ful OS-level tool for checkpoint/restart, keeps necessaryLinux kernel modification at a minimum and does notrequire interposing user-kernel API. We describe thistool in more detail in Section 2.3.Finally, user OS-level systems interpose the user-kernel API, providing the same transparency and gen- erality as kernel-based implementation. Such systemsuse the LD PRELOAD mechanism to intercept systemcalls from applications and virtualise system resources,like file descriptors, process IDs, and sockets. In ver-sion 4, MOSIX has been redesigned to work entirely atthe user level [20]. DMTCP [17] is a transparent fault-tolerance tool for distributed applications with supportfor IB verbs. To be able to extract the state of IB verbsobjects, DMTCP maintains shadow objects , which actas proxies between a user process and the NIC [24]. InSection 5.2, we show that maintaining these shadow ob-jects has non-negligible runtime overhead for RDMAnetworks.
Network Virtualisation
TCP/IP network virtuali-sation is an essential tool for isolating distributed ap-plications from the underlying physical network topol-ogy. Even though network virtualisation enables livemigration, it introduces overhead due to additional en-capsulation of network packets [60, 75]. Several newapproaches try to address these performance prob-lems [23, 60, 64, 75]. However, these approaches donot consider RDMA networks.Other work focuses on virtualising RDMA net-working. FreeFlow [46] intercepts communication viaIB verbs in containers to implement connection controlpolicies in software but does not support live containermigration. Nomad [39] uses InfiniBand address virtuali-sation for VM migration but implements the connectionmigration protocol inside an application-level runtime.LITE [71] virtualises RDMA networks, but offers no mi-gration support and requires application rewrite.MigrOS uses traditional network virtualisation forTCP/IP networks, which is not on the performance-critical path for RDMA-applications. However, Mi-grOS avoids unnecessary interception of RDMA-communication. Instead, MigrOS silently replaces ad-dressing information during migration.
RDMA Implementations
There are multiple open-source RDMA implementations. SoftRoCE [48] andSoftiWarp [70] are pure software implementations ofRoCEv2 [8] and iWarp [31] respectively. Both provideno performance advantage over socket-based communi-cation but are compatible with their hardware coun-terparts and facilitate the development and testing ofRDMA-based applications. We chose to base our workon SoftRoCE because RoCEv2 found wider adoptionthan iWarp.There are also open-source FPGA-based implemen-tations of network stacks. NetFPGA [76] does not sup-port RDMA communication. StRoM [68] provides aproof-of-concept RoCEv2 implementation. However, wefound it unfit to run real-world applications (for exam-ple, MPI) without further significant implementation ef-forts.11
Discussion
Hardware Modifications and Software Imple-mentation
Propositions to modify hardware oftenmeet criticism because they tend to be hard to val-idate in practice. We believe that limited hardwarechanges are worthy of consideration as the deploymentof custom [27, 47], programmable [47, 57], or software-augmented NICs [35] has already been proven feasible.IB verbs has routinely been extended with additionalfeatures [40, 49] as well. Deploying MigrOS to real datacentres would require hardware changes. We believethis trade-off is justified because MigrOS provides tan-gible performance benefits, in comparison to other ap-proaches.To find out whether our proposed changes have anyeffect on the critical path of the communication, we inte-grated them into a software implementation of RoCEv2.Our measurements show no performance difference af-ter adding support for migration. Given the nature ofthese changes, we are confident this observation appliesto hardware as well. Moreover, we provide our open-source software implementation to the research commu-nity for validating our findings and further study.
Compatibility with Existing Infrastructure
Mi-grOS ensures by design backwards compatibility at theIB verbs API and RoCEv2 protocol level. Moreover, Mi-grOS allows to use container runtimes interchangeably.By enabling migratability through MigrOS, a data cen-tre provider does not have to make the hard choice ofpunishing applications that do not benefit from migra-tion. We believe these features are crucial for successfulintegration into existing data centre management infras-tructure.
Unreliable Datagram Communication
MigrOSprovides live migration for reliable communica-tion (RC), but omits unreliable datagram (UD) commu-nication for two reasons: First, every message receivedover UD exposes the address of its sender. When thissender migrates, its address will change and currentlyMigrOS cannot conceal this fact from the receiver. Sec-ond, a UD QP can receive messages from anywhere.This means that a UD QP does not know where to thesend resume messages after migration. We leave migra-tion support for unreliable datagram for future work.
We introduce MigrOS, an OS-level architecture enablingtransparent live container migration. Our architecturedesign maintains full backwards-compatibility and in-teroperability with the existing RDMA network infras-tructure at every level. We demonstrate end-to-end migration flow of MPI applications using different con-tainer runtimes and studied cost of migration. MigrOSprovides live migration without sacrificing RDMA net-work performance, yet at the cost of changes to theRDMA communication protocol.To validate our solution, we integrated the pro-posed RDMA communication protocol changes into anopen-source implementation of the RoCEv2 protocol,SoftRoCE. For real-world deployment, these protocolchanges must be implemented in NIC hardware. Fi-nally, we provide a detailed analysis of any changes wemake to SoftRoCE to show their smallness.We are convinced the architecture of MigrOS can beuseful for dynamic load balancing, efficient preparedfail-over, and live software updates in data centres orHPC clusters.
Acknowledgments
The research and the work presented in this paper hasbeen supported by the German priority program 1648“Software for Exascale Computing” via the researchproject FFMK [11]. This work was supported in partby the German Research Foundation (DFG) within theCollaborative Research Center HAEC and the the Cen-ter for Advancing Electronics Dresden (cfaed). Theauthors are grateful to the Centre for InformationServices and High Performance Computing (ZIH) TUDresden for providing its facilities for high through-put calculations. In particular, we would like to thankDr. Ulf Markwardt and Sebastian Schrader for their sup-port with the experimental setup. The authors acknowl-edge support from the AWS Cloud Credits for Researchfor providing cloud computing resources.
Availability
The anonymised version of the code is available here:dropbox.com/s/clych73kxmuwjrt.
References [1] Elastic Fabric Adapter — Amazon Web Services.URL https://aws.amazon.com/hpc/efa/ .[2] High performance computing VM sizes. URL https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-hpc .[3] Linux Containers. URL https://linuxcontainers.org/lxc/introduction/ .[4] MVAPICH :: Benchmarks. URL http://mvapich.cse.ohio-state.edu/benchmarks/ .125] proc(5) - Linux manual page. URL http://man7.org/linux/man-pages/man5/proc.5.html .[6] ptrace(2) - Linux manual page. URL http://man7.org/linux/man-pages/man2/ptrace.2.html .[7]
Supplement to InfiniBand Architecture Specifica-tion: RoCE , volume 1. InfiniBand TA, 1.2.1 edi-tion, .[8]
Supplement to InfiniBand Architecture Specifica-tion: RoCEv2 . InfiniBand TA, . URL https://cw.infinibandta.org/document/dl/7781 .[9] Checkpoint/Restore In Userspace. URL https://criu.org/Main_Page .[10] Data Plane Development Kit. URL .[11] FFMK Website. URL https://ffmk.tudos.org .[12]
InfiniBand Architecture Specification , volume 1.Infiniband TA, 1.3 edition. URL https://cw.infinibandta.org/document/dl/8567 .[13] Messaging Accelerator (VMA) Documentation.URL https://docs.mellanox.com/display/VMAv883 .[14] OFED performance tests. URL https://github.com/linux-rdma/perftest .[15] Podman: daemonless container engine. URL https://podman.io/ .[16] runc: CLI tool for spawning and running containersaccording to the OCI specification. URL https://github.com/opencontainers/runc .[17] Jason Ansel, Kapil Arya, and Gene Cooperman.DMTCP: Transparent Checkpointing for ClusterComputations and the Desktop. URL http://arxiv.org/abs/cs/0701037 .[18] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Brown-ing, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O.Frederickson, T.A. Lasinski, R.S. Schreiber, H.D.Simon, V. Venkatakrishnan, and S.K. Weeratunga.The Nas Parallel Benchmarks. 5(3):63–73. ISSN0890-2720. doi:10/cgsfnm.[19] Amnon Barak and Amnon Shiloh. A dis-tributed load-balancing policy for a multicom-puter. 15(9):901–913. ISSN 00380644, 1097024X.doi:10/c8r7m6. [20] Amnon Barak and Shiloh, Amnon. The MOSIXCluster Management System for Distributed Com-puting on Linux Clusters and Multi-Cluster PrivateClouds.[21] Amnon Barak, Shai Guday, and Richard G.Wheeler.
The MOSIX Distributed Operating Sys-tem: Load Balancing for UNIX . Springer-Verlag.ISBN 978-0-387-56663-4.[22] Michael Bauer, Sean Treichler, Elliott Slaughter,and Alex Aiken. Legion: Expressing locality andindependence with logical regions. SC ’12, pages1–11. IEEE. ISBN 978-1-4673-0805-2 978-1-4673-0806-9. doi:10.1109/SC.2012.71.[23] Adam Belay, George Prekas, Christos Kozyrakis,Ana Klimovic, Samuel Grossman, and EdouardBugnion. IX: A Protected Dataplane OperatingSystem for High Throughput and Low Latency.OSDI ’14, pages 49–65. ISBN 978-1-931971-16-4.[24] Jiajun Cao, Gregory Kerr, Kapil Arya, and GeneCooperman. Transparent checkpoint-restart overInfiniband. In
Proceedings of the 23rd InternationalSymposium on High-performance parallel and dis-tributed computing - HPDC ’14 , pages 13–24. ACMPress. ISBN 978-1-4503-2749-7. doi:10/ggnfr4.[25] Christopher Clark, Keir Fraser, Steven Hand, Ja-cob Gorm Hansen, Eric Jul, Christian Limpach,Ian Pratt, and Andrew Warfield. Live Migra-tion of Virtual Machines. In
Proceedings ofthe 2nd Conference on Symposium on NetworkedSystems Design & Implementation - Volume 2 ,NSDI ’05, pages 273–286. USENIX Association.doi:10.5555/1251203.1251223.[26] Connor, Patrick, Hearn, James R., Dubal, Scott P.,Herdrich, Andrew J., and Sood, Kapil. Techniquesto migrate a virtual machine using disaggregatedcomputing resources.[27] Daniel Firestone, Andrew Putnam, SambhramaMundkur, Derek Chiou, Alireza Dabagh, MikeAndrewartha, Vivek Bhanu, Eric Chung, HarishKumar Chandrappa, Somesh Chaturmohta, MattHumphrey, Jack Lavier, Norman Lam, FengfenLiu, Kalin Ovtcharov, Jitu Padhye, Gautham Pop-uri, Shachar Raindel, Tejas Sapre, Mark Shaw,Gabriel Silva, Madhan Sivakumar, Nisheeth Sri-vastava, Anshuman Verma, Qasim Zuhair, DeepakBansal, Doug Burger, Kushagra Vaid, David A.Maltz, and Albert Greenberg. Azure Acceler-ated Networking: SmartNICs in the Public Cloud.USENIX Assoc. ISBN 978-1-931971-43-0.1328] Umesh Deshpande, Yang You, Danny Chan, NiltonBila, and Kartik Gopalan. Fast Server Deprovi-sioning through Scatter-Gather Live Migration ofVirtual Machines. pages 376–383. IEEE. ISBN978-1-4799-5063-8. doi:10.1109/CLOUD.2014.58.[29] Edgar Gabriel, Graham E. Fagg, George Bosilca,Thara Angskun, Jack J. Dongarra, Jeffrey M.Squyres, Vishal Sahay, Prabhanjan Kambadur,Brian Barrett, Andrew Lumsdaine, Ralph H. Cas-tain, David J. Daniel, Richard L. Graham, andTimothy S. Woodall. Open MPI: Goals, Concept,and Design of a Next Generation MPI Implemen-tation. In Dieter Kranzlm¨uller, P´eter Kacsuk, andJack Dongarra, editors,
Recent Advances in Paral-lel Virtual Machine and Message Passing Interface ,volume 3241, pages 97–104. Springer Berlin Heidel-berg. ISBN 978-3-540-30218-6. doi:10.1007/978-3-540-30218-6 19.[30] Peter X. Gao, Akshay Narayan, Rachit Agarwal,Sagar Karandikar, Sylvia Ratnasamy, Joao Car-reira, Sangjin Han, and Scott Shenker. NetworkRequirements for Resource Disaggregation. OSDI’16, pages 249–264. USENIX Association. ISBN978-1-931971-33-1. doi:10.5555/3026877.3026897.[31] D. Garcia, P. Culley, R. Recio, J. Hilland, andB. Metzler. A Remote Direct Memory Access Pro-tocol Specification. URL https://tools.ietf.org/html/rfc5040 .[32] Rohan Garg, Gregory Price, and Gene Cooper-man. MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing. In
Proceed-ings of the 28th International Symposium on High-Performance Parallel and Distributed Computing -HPDC ’19 , pages 49–60. ACM Press. ISBN 978-1-4503-6670-0. doi:10/ggnd38.[33] Juncheng Gu, Youngmoon Lee, Yiwen Zhang,Mosharaf Chowdhury, and Kang G. Shin. Effi-cient Memory Disaggregation with INFINISWAP.In
Proceedings of the 14th USENIX Conferenceon Networked Systems Design and Implementation ,NSDI ’17, page 21. USENIX Association. ISBN978-1-931971-37-9. doi:10.5555/3154630.3154683.[34] Wei Lin Guay, Sven-Arne Reinemo, Bjørn DagJohnsen, Chien-Hua Yen, Tor Skeie, Olav Lysne,and Ola Tø rudbakken. Early experiences with livemigration of SR-IOV enabled InfiniBand. 78:39–52.ISSN 07437315. doi:10/f68twd.[35] Sangjin Han, Keon Jang, Aurojit Panda, ShoumikPalkar, Dongsu Han, and Sylvia Ratnasamy. Soft-NIC: A Software NIC to Augment Hardware. [36] Paul H. Hargrove and Jason C. Duell. Berke-ley lab checkpoint/restart (BLCR) for Linux clus-ters. 46:494–499. ISSN 1742-6588, 1742-6596.doi:10/d33sc5.[37] Berk Hess, Carsten Kutzner, David van der Spoel,and Erik Lindahl. GROMACS 4: Algorithmsfor Highly Efficient, Load-Balanced, and ScalableMolecular Simulation. 4(3):435–447. ISSN 1549-9618, 1549-9626. doi:10/b7nkp6.[38] Michael R. Hines, Umesh Deshpande, and Kar-tik Gopalan. Post-copy live migration of vir-tual machines. 43(3):14–26. ISSN 0163-5980.doi:10/ccwrpt.[39] Wei Huang, Jiuxing Liu, Matthew Koop, BulentAbali, and Dhabaleswar Panda. Nomad: mi-grating OS-bypass networks in virtual machines.In
Proceedings of the 3rd international conferenceon Virtual execution environments - VEE ’07 ,page 158. ACM Press. ISBN 978-1-59593-630-1.doi:10/frgqz4.[40] InfiniBand Trade Association.
Supplementto InfiniBand Architecture Specification: XRC .URL https://cw.infinibandta.org/document/dl/7146 .[41] Jake Edge. Checkpoint/restart tries to head to-wards the mainline. URL https://lwn.net/Articles/320508/ .[42] Jonathan Corbet. TCP connection repair. URL https://lwn.net/Articles/495304/ .[43] J. Jose, Mingzhe Li, Xiaoyi Lu, K. C. Kan-dalla, M. D. Arnold, and D. K. Panda. SR-IOV Support for Virtualization on InfiniBand Clus-ters: Early Experience. In . IEEE. ISBN 978-0-7695-4996-5.doi:10/ggm53b.[44] Asim Kadav and Michael M. Swift. Live migrationof direct-access devices. 43(3):95. ISSN 01635980.doi:10/b9j36z.[45] Laxmikant V. Kale and Sanjeev Krishnan.CHARM++: a portable concurrent object orientedsystem based on C++. 28(10):91–108. ISSN 0362-1340. doi:10/cgnqf7.[46] Daehyeok Kim, Tianlong Yu, Hongqiang HarryLiu, Yibo Zhu, Jitu Padhye, Shachar Raindel,Chuanxiong Guo, Vyas Sekar, and Srinivasan Se-shan. FreeFlow: Software-based Virtual RDMA14etworking for Containerized Clouds. In
Proceed-ings of the 16th USENIX Conference on NetworkedSystems Design and Implementation , NSDI ’19,pages 113–125. USENIX Association. ISBN 978-1-931971-49-2. doi:10.5555/3323234.3323245.[47] Alec Kochevar-Cureton, Somesh Chaturmohta,Norman Lam, Sambhrama Mundkur, and DanielFirestone. Remote direct memory access in com-puting systems. URL https://patents.google.com/patent/US10437775B2/en .[48] Liran Liss. The Linux SoftRoCE Driver. URL https://youtu.be/NumH5YeVjHU?t=45 .[49] Liran Liss. On Demand Paging for User-level Net-working.[50] Jiuxing Liu, Jiesheng Wu, Sushmitha P. Kini, PeteWyckoff, and Dhabaleswar K. Panda. High per-formance RDMA-based MPI implementation overInfiniBand. In
Proceedings of the 17th annual in-ternational conference on Supercomputing , ICS ’03,pages 295–304. Association for Computing Machin-ery. ISBN 978-1-58113-733-0. doi:10/c4knj6.[51] Lele Ma, Shanhe Yi, and Qun Li. Efficient ser-vice handoff across edge servers via Docker con-tainer migration. In
Proceedings of the SecondACM/IEEE Symposium on Edge Computing , SEC’17, pages 1–13. ACM Press. ISBN 978-1-4503-5087-7. doi:10/gf9x9r.[52] Dirk Merkel. Docker: Lightweight Linux Con-tainers for Consistent Development and De-ployment. 2014(239):5. ISSN 1075-3583.doi:10.5555/2600239.2600241.[53] Frank Mietke, Robert Rex, Robert Baumgartl,Torsten Mehlan, Torsten Hoefler, and WolfgangRehm. Analysis of the Memory Registration Pro-cess in the Mellanox InfiniBand Software Stack.In Wolfgang E. Nagel, Wolfgang V. Walter, andWolfgang Lehner, editors,
Euro-Par 2006 ParallelProcessing , volume 4128 of
Lecture Notes in Com-puter Science , pages 124–133. Springer Berlin Hei-delberg. ISBN 978-3-540-37783-2 978-3-540-37784-9. doi:10.1007/11823285 13.[54] Dejan Milojiˇci´c, Frederick Douglis, and RichardWheeler.
Mobility: processes, computers, andagents . ACM Press/Addison-Wesley PublishingCo. ISBN 978-0-201-37928-0.[55] Andrey Mirkin, Alexey Kuznetsov, and KirKolyshkin. Containers checkpointing and live mi-gration. [56] Christopher Mitchell, Yifeng Geng, and JinyangLi. Using One-Sided { RDMA } Reads toBuild a Fast, CPU-Efficient Key-Value Store.pages 103–114. ISBN 978-1-931971-01-0. URL .[57] YoungGyoun Moon, SeungEon Lee, Muham-mad Asim Jamshed, and KyoungSoo Park. Ac-celTCP: Accelerating Network Applications withStateful TCP Offloading. NSDI ’20, pages 77–92.USENIX Association. ISBN 978-1-939133-13-7.[58] Shripad Nadgowda, Sahil Suneja, Nilton Bila,and Canturk Isci. Voyager: Complete ContainerState Migration. In , pages 2137–2142. IEEE. ISBN 978-1-5386-1792-2. doi:10/ggnhq5.[59] Michael Nelson, Beng-Hong Lim, and GregHutchins. Fast Transparent Migration forVirtual Machines. In
Proceedings of theUSENIX Annual Technical Conference , ATEC’05, pages 391–394. USENIX Association.doi:10.5555/1247360.1247385.[60] Zhixiong Niu, Hong Xu, Peng Cheng, YongqiangXiong, Tao Wang, Dongsu Han, and Keith Win-stein. NetKernel: Making Network Stack Partof the Virtualized Infrastructure. URL http://arxiv.org/abs/1903.07119 .[61] Opeyemi Osanaiye, Shuo Chen, Zheng Yan, Rongx-ing Lu, Kim-Kwang Raymond Choo, and MqheleDlodlo. From Cloud to Fog Computing: A Reviewand a Conceptual Live VM Migration Framework.5:8284–8300. ISSN 2169-3536. doi:10/ggnfkt.[62] Steven Osman, Dinesh Subhraveti, Gong Su, andJason Nieh. The design and implementation of Zap:a system for migrating computing environments.36:361–376. ISSN 0163-5980. doi:10/fbg7vq.[63] Zhenhao Pan, Yaozu Dong, Yu Chen, Lei Zhang,and Zhijiao Zhang. CompSC: live migration withpass-through devices. 47(7):109. ISSN 03621340.doi:10/f3887q.[64] Simon Peter, Jialin Li, Irene Zhang, DanR K Ports, Doug Woos, Arvind Krishnamurthy,Thomas Anderson, and Timothy Roscoe. Arrakis:The Operating System is the Control Plane.[65] S. Pickartz, C. Clauss, S. Lankes, S. Krem-pel, T. Moschny, and A. Monti. Non-intrusiveMigration of MPI Processes in OS-Bypass Net-works. In nd Distributed Processing Symposium Workshops(IPDPSW) , pages 1728–1735. doi:10/ggscxh.[66] Marius Poke and Torsten Hoefler. DARE: High-Performance State Machine Replication on RDMANetworks. In Proceedings of the 24th Interna-tional Symposium on High-Performance Paralleland Distributed Computing - HPDC ’15 , pages107–118. ACM Press. ISBN 978-1-4503-3550-8.doi:10/ggm3sf.[67] Pavel Shamis, Manjunath Gorentla Venkata,M. Graham Lopez, Matthew B. Baker, Oscar Her-nandez, Yossi Itigin, Mike Dubman, Gilad Shainer,Richard L. Graham, Liran Liss, Yiftah Shahar,Sreeram Potluri, Davide Rossetti, Donald Becker,Duncan Poole, Christopher Lamb, Sameer Ku-mar, Craig Stunkel, George Bosilca, and AurelienBouteiller. UCX: An Open Source Framework forHPC Network APIs and Beyond. In , pages 40–43. doi:10/ggmx8k.[68] David Sidler, Zeke Wang, Monica Chiosa, AmitKulkarni, and Gustavo Alonso. StRoM: Smart Re-mote Memory. page 16. doi:10/gg8qq7.[69] Jonathan M. Smith. A survey of process migra-tion mechanisms. 22(3):28–40. ISSN 0163-5980.doi:10/bjp787.[70] Animesh Trivedi, Bernard Metzler, and PatrickStuedi. A case for RDMA in clouds: turning su-percomputer networking into commodity. In
Pro-ceedings of the Second Asia-Pacific Workshop onSystems - APSys ’11 , page 1. ACM Press. ISBN978-1-4503-1179-3. doi:10/fzv576.[71] Shin-Yeh Tsai and Yiying Zhang. LITE Ker-nel RDMA Support for Datacenter Applications.In
Proceedings of the 26th Symposium on Op-erating Systems Principles - SOSP ’17 , pages306–324. ACM Press. ISBN 978-1-4503-5085-3.doi:10/ggscxn.[72] Dongyang Wang, Binzhang Fu, Gang Lu, Kun Tan,and Bei Hua. vSocket: virtual socket interfacefor RDMA in public clouds. In
Proceedings of the15th ACM SIGPLAN/SIGOPS International Con-ference on Virtual Execution Environments - VEE2019 , pages 179–192. ACM Press, . ISBN 978-1-4503-6020-3. doi:10/ggscxg.[73] Kai-Ting Amy Wang, Rayson Ho, and Peng Wu.Replayable Execution Optimized for Page Shar-ing for a Managed Runtime Environment. In
Proceedings of the Fourteenth EuroSys Conference 2019 , EuroSys ’19, pages 1–16. Association forComputing Machinery, . ISBN 978-1-4503-6281-8.doi:10/ggnq76.[74] David Wong, Noemi Paciorek, and Dana Moore.Java-based mobile agents. 42(3):92–ff. ISSN 0001-0782. doi:10/btg3k7.[75] Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu,Hongqiang Harry Liu, Matthew Rockett, ArvindKrishnamurthy, and Thomas Anderson. Slim:OS Kernel Support for a Low-Overhead ContainerOverlay Network. In
Proceedings of the 16thUSENIX Conference on Networked Systems De-sign and Implementation , NSDI ’19, pages 331–344. USENIX Association. ISBN 978-1-931971-49-2. doi:10.5555/3323234.3323263.[76] Noa Zilberman, Yury Audzevich, GeorginaKalogeridou, Neelakandan Manihatty-Bojan,Jingyun Zhang, and Andrew Moore. NetF-PGA: Rapid Prototyping of Networking Devicesin Open Source. In