[PDF] A1: A Distributed In-Memory Graph Database

Abstract

A1 is an in-memory distributed database used by the Bing search engine to support complex queries over structured data. The key enablers for A1 are availability of cheap DRAM and high speed RDMA (Remote Direct Memory Access) networking in commodity hardware. A1 uses FaRM as its underlying storage layer and builds the graph abstraction and query engine on top. The combination of in-memory storage and RDMA access requires rethinking how data is allocated, organized and queried in a large distributed system. A single A1 cluster can store tens of billions of vertices and edges and support a throughput of 350+ million of vertex reads per second with end to end query latency in single digit milliseconds. In this paper we describe the A1 data model, RDMA optimized data structures and query execution.

Full PDF

AA1: A Distributed In-Memory Graph Database

Chiranjeeb Buragohain, Knut Magne Risvik, Paul Brett, Miguel Castro,Wonhee Cho, Joshua Cowhig, Nikolas Gloy, Karthik Kalyanaraman,Richendra Khanna, John Pao, Matthew Renzelmann, Alex Shamis, Timothy Tan,Shuheng Zheng ∗ Microsoft

ABSTRACT

A1 is an in-memory distributed database used by the Bingsearch engine to support complex queries over structureddata. The key enablers for A1 are availability of cheap DRAMand high speed RDMA (Remote Direct Memory Access) net-working in commodity hardware. A1 uses FaRM [11, 12] asits underlying storage layer and builds the graph abstractionand query engine on top. The combination of in-memorystorage and RDMA access requires rethinking how data is al-located, organized and queried in a large distributed system.A single A1 cluster can store tens of billions of vertices andedges and support a throughput of 350+ million of vertexreads per second with end to end query latency in single digitmilliseconds. In this paper we describe the A1 data model,RDMA optimized data structures and query execution.

ACM Reference Format:

Chiranjeeb Buragohain, Knut Magne Risvik, Paul Brett, MiguelCastro, Wonhee Cho, Joshua Cowhig, Nikolas Gloy, Karthik Kalya-naraman, Richendra Khanna, John Pao, Matthew Renzelmann, AlexShamis, Timothy Tan, Shuheng Zheng. 2020. A1: A Distributed In-Memory Graph Database. In

Proceedings of the 2020 ACM SIGMODInternational Conference on Management of Data (SIGMOD’20), June14–19, 2020, Portland, OR, USA.

ACM, New York, NY, USA, 16 pages.https://doi.org/10.1145/3318464.3386135

The Bing search engine handles massive amounts of unstruc-tured and structured data. To efficiently query the structured ∗ Chiranjeeb Buragohain and Richendra Khanna are currently at Oracle,Timothy Tan is currently at AmazonPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected].

SIGMOD’20, June 14–19, 2020, Portland, OR, USA © 2020 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-6735-6/20/06...$15.00https://doi.org/10.1145/3318464.3386135 data, we need a platform that handles the scale of data aswell as the strict performance and latency requirements. Acommon pattern for solving low-latency query serving isto use a two-tier approach with a durable database for theground truth and a caching layer like memcached in frontfor read-only serving. The Facebook TAO datastore[5] is asophisticated example of this architecture using the graphdata model. But there are some elements of the design thatintroduce problems. First, systems like memcached expose aprimitive key-value API with little query capability. There-fore complex query execution logic is pushed into the client,rather than the database itself. Second, cache consistency ishard to achieve and such systems guarantee only eventualconsistency. Finally, there is no atomicity for updates whichleads to data constraint violations. For example, in TAO onecan have partial edges between two objects with the forwardlink existing, but no backward link. In Bing, we have a hugeset of diverse data sources that need be stitched togetherwith real-time update requirements. Therefore we wantedto move beyond an eventually consistent cache and into amore capable transactional database system.In representing structured data, the relational and thegraph data models have equivalent capabilities, though withdifferent ease of expression[23]. Our choice of the graph datamodel is a natural match for much of Bing data including coreassets like the knowledge graph[15]. Therefore we designedA1 to be a general purpose graph database with full trans-actional capabilities. Transactions in a distributed systemfrees up application developers from worrying about com-plex problems like atomicity, consistency and concurrencycontrol, and instead allows them to focus on core businessproblems[9]. A1 also exposes a query language which sim-plify application development by moving query executioninto the database. Our query language doesn’t attempt tobe as comprehensive as SQL and instead focus on the corecapabilities needed by the applications using A1. The prim-itives we support are general enough that multiple classesof applications can start using A1 with little difficulty. An-other key characteristic of A1 is that it is a latency-optimizeddatabase. Since search engines like Bing have a fixed latencybudget to render pages, all queries issued to the backend bythe search engine come with a corresponding latency budget a r X i v : . [ c s . D B ] A p r typically 100ms). If a query takes more than 100ms to exe-cute, then the results of that query will simply be discarded.That means that the availability of the system is measured byits latency, not by its error rate —if a system’s 80th percentilelatency is 100ms, the system’s effective availability is only80%. Therefore having tight control over tail latency is a keyrequirement.Cheap DRAM and fast networks with Remote Direct Mem-ory Access (RDMA) are the two major hardware trends thatenable A1. We are deploying machines with hundreds ofgigabytes of DRAM, so a set of racks can hold more than 200TB of memory. This is sufficient for most applications to keeptheir data in memory and to avoid accesses to secondary stor-age. Until recently, RDMA has been mostly in the provinceof exotic high performance computing networks, but nowit has become a commodity technology easily deployed incloud data centers. The RoCE (RDMA over Converged Eth-ernet) networks we use offer a round-trip latency less than5 microseconds, bandwidths of 40Gb/s and message ratesapproaching 100 million messages per second. Note that run-ning RDMA in data centers is still not an easy endeavor andwe will have more details on this later. The combination ofin-memory storage and RDMA allows A1 to achieve singledigit millisecond latencies for queries that access thousandsof objects across multiple machines.This paper makes three contributions. First, we describethe design and implement of A1 on top of the FaRM dis-tributed memory storage system (Sections 2, 3). Next in sec-tion 4, we show how A1 is integrated into a more complexsystem in Bing with replication and disaster recovery. Finally,we evaluate the applications built on top of A1 and their per-formance (Section 5 and 6). A key part of our journey inbuilding A1 has been the evolution of a research prototypelike FaRM into a production system. The learnings on thispath will be described throughout the paper. A1 has a typical layered architecture with networking at thebottom and query processing at top, as depicted in Figure 1.The four lowest layers of the stack together form a dis-tributed storage platform called FaRM [11, 12, 24]. FaRMprovides transactional storage and generic indexing struc-tures, while the rest of A1 provides graph data structures anda specialized graph query engine. The bottom RDMA com-munication layer provides primitives like one-sided RDMAread/write and a fast RPC implementation. The distributedmemory layer exposes a disaggregated memory model wherethe API enables one to allocate/read/write objects across acluster of machines. These objects are replicated so thatsingle machine or rack failure never leads to any data loss.Given a handle to an object, a single one-sided RDMA read is

A1 Graph APIGraph query executionRDMA Communiation FabricDistributed MemoryDistributed transactionsCore datastructuresGraph Store and IndexGraph Applications

Figure 1: Layers of the A1 architecture sufficient to retrieve the object. The transaction engine pro-vides atomicity, failure recovery, and concurrency control.FaRM also exposes basic data structures like B-trees. We usethese layers to build a database that exposes the graph datamodel. The query processing works directly on the graphstorage, but it also leverages aspects of the distributed mem-ory platform and communication to scale out and coordinateexecution of queries.Before we get into the details of the graph storage, it isworthwhile to understand the lower FaRM layer in a bitof more detail. We refer the reader to the existing literature[11, 12, 24] on the implementation of FaRM and instead focushere on building applications on top of FaRM.

FaRM is a transactional distributed in-memory storage sys-tem. It is worthwhile unpacking these adjectives. FaRM usesa set of machines in a datacenter and exposes there combinedmemory as a single flat storage space. The storage API ex-posed by FaRM is very simple: every storage object in FaRMis an unstructured chunk of contiguous memory. Objectsare uniquely identified by a 64bit address or pointer and canrange from size 64 byte to 1MB. All object manipulationshappen in the context of a transaction . For durability FaRMreplicates all data 3-ways.FaRM uses RDMA capable NICs (Network Interface Cards),which are becoming commodity in modern data centers forcross-machine communication. RDMA enables the abilityto read/write the contents of a remote machine’s memorywith low latency ( < µs within a rack) and high throughput.DMA achieves this low latency in three ways: first in thelocal machine, the RDMA library bypasses the OS kerneland talks directly to the NIC. Second, in the remote machine,the memory is accessed by the remote NIC directly withoutinvolving the CPU. This is known as a one-way read/write .Finally, TCP features like reliability and congestion controlare all implemented within the NIC and the network switcheswhich reduce the load on CPU further. Of course, taking anordinary storage system and simply porting its network layerto RDMA doesn’t always result in high performance. FaRMoptimizes its whole stack including replication, transactionprotocol and data structures to leverage RDMA at every layerto provide a high performance storage system.A FaRM cluster is a set of machines each running a FaRMprocess. One machine is designated as a Configuration Man-ager(CM) whose purpose is to keep track of machine member-ship in the cluster and data placement. The memory of eachmachine is split into 2GB chunks known as regions . Objectsare allocated within a region and every region is replicated3-ways in-memory for fault tolerance. Replication is doneusing a primary-backup mechanism and all reads/writes areserved from the primary only which ensures consistency ofall operations. The the 64-bit address of an object essentiallyconsists of two 32 bit numbers: the region id which uniquelyidentifies the region and the offset within the region wherethat object is located. The CM is responsible for determiningwhich machines are part of the cluster (i.e. membership) andregion metadata: allocation of regions to machines. Givena FaRM address, the CM metadata can be used to find themachine which hosts the primary copy of the region andthen we can use RDMA to directly read the contents of theobject by using the offset. All reads and writes happen in thecontext of a transaction. The transaction protocol is a variantof two phase commit with multiple optimizations for RDMA.For example, reads are always done using one-sided RDMAreads which bypass the CPU. Similarly data replication hap-pens using one-sided writes, again bypassing the CPU. Inproduction deployments, we deploy FaRM machines acrossat least three fault domains. A single fault domain consistsof a set of machines which share a common critical compo-nent like a network switch or power supply. Therefore allmachines in a single fault domain may become inaccessiblein case of a hardware failure. By replicating data across threefault domains, we ensure that no single component failurecan lead to loss of more than one copy of the data.The FaRM API(Figure 2) exposes a set of basic operationson objects: allocation, reading, writing and freeing them. The

ObjBuf object referred to in the API is the wrapper aroundthe FaRM object. Reads return an

ObjBuf object which holdsthe data for the object read. The operations must be executedin the context of a transaction which provides programmerswith atomicity and concurrency control. FaRM transactions std::unique_ptr CreateTransaction();ObjBuf* Transaction::Alloc(size_t size, Hint hint);Addr ObjBuf::GetAddr();ObjBuf* Transaction::Read(Addr addr, size_t size);ObjBuf* Transaction::OpenForWrite(ObjBuf *buf);void Transaction::Free(ObjBuf *buf);Status Transaction::Commit();Status Transaction::Abort();

Figure 2: FaRM API provide strict serializability as the default isolation level us-ing multi-version concurrency control[24]. Every transactionhas a timestamp associated with it and this timestamp en-sures a global order among all the transactions in the system.Note that the

Alloc

API takes a

Hint parameter. The hintis used to determine where to allocate the object: by defaultwe allocate the object in the local machine where the APIis invoked. The more useful option is to pass the address ofan existing object in the hint —in that case we attempt toallocate the object in the same region in which the existingobject exists. Since the region is our unit of replication, if twoobjects are allocated in the same region, they are guaranteedto be on the same machine in spite of machine failures. Thehint is advisory only: in case the region doesn’t have enoughspace, the allocator will find another place to allocate it.Here is an example of atomically incrementing a 64 bitcounter which is stored in FaRM:

Status status = COMMITTED;do {std::unique_ptr tx = CreateTransaction();ObjBuf *rbuf = tx->Read(address, sizeof(uint64_t));uint64_t value = *(uint64_t*) buf->data();value++;Objbuf *wbuf = tx->OpenforWrite(rbuf);memcpy(wbuf->data(), &value, sizeof(value));status = tx->Commit();} while (status != COMMITTED);

Figure 3: Atomic increment of a counter using FaRMAPI

In this example(Figure 3), we read a FaRM object identifiedby address and extract the value stored in it. The

ObjBuf object is a local immutable copy of the object. To modifyit, we need to create a writable copy which we do with the

OpenForWrite

API. Once all the objects have been modi-fied, we commit the transaction which atomically makes theupdate. The reason we have the loop here is that FaRM trans-actions run under optimistic concurrency control and hencemay abort under conflict and it is necessary to retry them.ote that in this model all transaction writes are bufferedlocally. The

OpenForWrite operation doesn’t cause any re-mote operations, it merely creates a modified buffer andstores the updates locally. The

Commit operations pusheswrites to remote machines, performs concurrency checksand finally commits the data.

Applications like A1 are integrated with FaRM using whatwe call the coprocessor model. In this model, A1 is compiledinto the same executable as the FaRM code and is part ofthe same address space. So calling the FaRM API (Figure 2)is as simple as making a regular function call. As part ofbeing a coprocessor, the application needs to integrate withFaRM’s threading model, which we will talk about later. Theavailability of transactions in the FaRM layer proved to be agreat engineering productivity boost in building A1. Whenwriting A1, the following principles guided our development: • Pointer linked data structures : The standard way tobuild data structures in FaRM is to use FaRM objectsconnected by pointers, e.g. linked lists, BTrees, graphsetc. Since dereferencing a pointer generally require anRDMA read (unless the object is hosted on the localmachine), we optimize the layout and placement ofdata structures to reduce the number of pointer deref-erences. For example, we prefer arrays to store list-oriented data instead of traversing linked lists. BTreeswith high branching ratio works well for search struc-tures, and we use the tuple ⟨ address,size ⟩ as the pointerwhich indicates both the address and size of the RDMAread to access the data stored in the object. • In-Memory Storage : Since the cost of memory is highcompared to SSD, we need to be frugal with storage.Typically A1 is used as a fast queryable store with non-queryable attributes stored in cheaper storage systems.For example, if we are storing the profile of an actorin A1, the photo of the actor will not be stored in A1itself. • Locality : RDMA reduces latency, but there is still a 20x-100x difference between accessing local memory vs.remote memory. Therefore, at object creation time, weattempt to co-locate data that is likely to be accessedtogether in the same machine. Similarly, at query time,we ship query to data to reduce the number of remotereads. When we reallocate any object, we keep itslocality intact by passing the old object’s address intothe

Alloc call. • Concurrency : Since FaRM transactions run under opti-mistic concurrency control, it is critical to avoid singlepoints of contention. For read-only queries, we usesnapshot isolation to ensure that updates to data does not delay or block read-only operations. When werun a distributed query, all objects across the clusterare read as of a single consistent snapshot version andthose versions are not garbage collected until the queryruns to completion. • Cooperative Multithreading : Recall that we compile theapplication with FaRM itself into a single binary. Insidethe FaRM process, coprocessors must run using coop-erative multithreading to share compute resources. Atstartup, we allocate a fixed number of threads andaffinitizes them to the cores. FaRM code and the appli-cation code (i.e. the coprocessor) share these threads.Coprocessors use a fixed number of fibers per threadto achieve cooperative multithreading. All FaRM APIcalls which touch remote objects are asynchronous,but the use of fibers hide the asynchrony and gives theapplication writer the illusion of writing synchronouscode.

SLBA1

Frontend A1 Frontend

A1/FaRM

Backend

A1/FaRM

Backend

A1/FaRMBackendA1/FaRM

Backend

FaRM RPC/Bond

RDMA RDMA

RDMA

Figure 4: A1 cluster deployment

Figure 4 shows the full physical deployment of a single A1cluster. Clients access the A1 cluster by making RPC callsto the A1 API. The RPC calls are routed by a software loadbalancer (SLB) to a set of frontend machines. The frontendmachines are stateless and mostly perform simple routingand throttling functions. The RPC request is forwarded bythe frontends to the backend machines where it is processed.The backend machines makes up the FARM cluster, andeach machine runs a combined binary of FaRM and A1. Allquery execution and data processing happens on the backendmachines, utilizing RDMA communication. Communicationbetween client and the cluster uses the traditional TCP stackwhich has higher latency. However, our target workloads arecomplex queries with many reads and writes, so the latencybetween the client and the backend is typically immaterialto the total execution time.

DATA STRUCTURES AND QUERYENGINE

Using a graph to model data is nothing new –entity-relationshipdatabases have been in use for a while and have enjoyed re-newed popularity recently. A1 adopts the property graphmodel: a graph consists of a set of vertices and directed edges connecting the vertices. The vertices and edges are typedand can have attributes (also known as properties) associatedwith them. The type for the vertex/edge defines the schemaof the associated attributes. In contrast to typical propertygraph models such as Tinkerpop[13] or Neo4J[20], we chooseto enforce schema on attributes to improve data integrity andperformance. An example will clarify the model. Considerthe relationship between a film and an actor as shown inFigure 5.

Actor (name, origin,birth_date) Film (name, genre,release_date)

Acted (character)

Figure 5: Simple graph example

We introduce two types of vertices:

Film and

Actor . TheActor attributes are name, origin and DOB; while the filmattributes are name, release date and genre. The attributesin the schema are comparable to column definitions in tra-ditional relational databases. Microsoft Bond[21] is a lan-guage for managing schematized data, similar to ProtocolBuffers[18]. In Bond, our schema looks like as follows: struct Actor { struct Film {0: string name; 0: string name;1: string origin; 1: string genre;2: date birth_date;} 2: date release_date;}

Next we introduce the edge type

Acted . that stores datalike the name of the character played by the Actor. So theedge data schema will look like struct Acted {0: string character;}

By using Bond, A1 inherits the Bond type system with prim-itive types like integers, floats, string, boolean and binaryblobs. Since Bond allows composite types (arrays and maps)and nesting of structs, A1 can support a richer type systemthan typical relational database.A1 organizes customer data in a hierarchy: the top levelof the hierarchy is a tenant and it is the default isolationcontainer. Two tenants can’t see each others data. A tenantmay have one or more graphs and every graph contain a set of types . A graph contains a set of vertices and edges and everyvertex/edge must belong to one of the types defined withinthat graph. The analogy between relational data model andthe A1 model is presented in table 1. A1 Graph Type Vertex/Edge AttributeRelational Database Table Row Column

Table 1: Analogy between relational database entitiesand A1 entities.

When declaring a vertex type, the user must also defineone of the attributes as a primary key , which must be uniqueand non-null. Every type by default comes with a sorted primary index defined over the primary key. Edge typesdo not require primary keys and there are no indexes onedges. It is also possible to declare secondary indexes onvertex attributes. There are no requirements on uniquenessor nullability on secondary index attributes.Within a graph, to uniquely identify a vertex, we needto specify the tuple ⟨ type,primary-key ⟩ . Using the type, wecan identify the relevant primary index and then retrieve thevertex by using the primary key in the index. Edges can’tbe identified directly except through the vertices to whomthey are attached. An edge is uniquely identified by the tu-ple ⟨ source-vertex,edge-type,destination-vertex ⟩ . This impliesthat given two vertexes, there can only be a single edge of agiven type.In terms of APIs we support the usual CRUD APIs on ob-jects like vertices, edges, types and graphs. We divide theAPIs into two classes: control plane APIs which manipulatebulk objects like graphs and types and data plane

APIs whichmanipulate fine grained objects like vertices and edges. In ad-dition, we expose a set of transaction APIs: CreateTransac-tion, CommitTransaction and AbortTransaction. TheCreateTransaction API creates a transaction object whichcan be used to group multiple data plane operations into asingle atomic transaction. If a transaction is not specified fora data plane operation like CreateVertex, a transaction isimplicitly created for that operation and committed at theend of the call. Unlike data plane operations, control planeoperations cannot be grouped under a transaction. Each con-trol plane operation executes under its own transaction.

A1 roots all data structure in the catalog . It is a systemdata structure which returns handles to objects like tenants,graphs, types, indexes, BTrees etc. The catalog is fundamen-tally a key-value store where the key is the name of theobject and the value is a pointer to all the data needed toaccess the object. For example in the case of a BTree, the cat-alog maps the name of the BTree to the FaRM address of theroot node of the BTree. Once we have the root node of theBTree, we create an in-memory object called a BTree proxy which allows us to lookup/manipulate the BTree contents.The catalog itself is stored in FaRM and hence materializing proxy from the BTree name can be an expensive operation—it involves multiple remote reads to map the name to theroot node and then potentially reading the root node itselffor any BTree metadata. To reduce load on the catalog andas well as avoid remote reads in materializing proxies, wecache proxies in memory once they are materialized. Oncecached, data plane operations like CreateVertex can usethem without incurring the overhead of looking up the cat-alog separately. The cache has a fixed TTL to ensure thatwe don’t use stale proxies. When the TTL expires, the cachechecks if the underlying object has changed: if it has thenwe refresh the proxy, if it hasn’t then we simple extend theTTL and continue to use the proxy.

Primary Index Secondary IndexVertex TypeIncoming Edge ListOutgoing Edge ListVertex Data PtrVertex Data HeaderVertex DataPrimary Key Secondary Key

Figure 6: Vertex and primary index

The storage format for vertices and edges are dictated by howthey are accessed. For a vertex, we can look it up using eitheran index or through edge traversal. Since traversal queriesare much more frequent than index lookups, we optimize forthat. A vertex is stored as two FaRM objects: a header objectand a data object as shown in Figure 6. The vertex headercontains the type of the vertex, pointers to data structuresthat hold edges associated with the vertex, and a pointer tothe data associated with the vertex. As the vertex is updatedwith new edges or new data, the header content changes,but the pointer to the header itself remains unchanged. Wecall this pointer the vertex pointer . The vertex data is storedin a separate variable length object and serialized in Bondbinary format. Since the data for a vertex is always schema-tized, the data representation is very compact and efficientto deserialize. Since vertex data and header are looked uptogether most of the time, we use locality to store both ofthem in the same region.Looking up a vertex from its primary key is a multi-stepprocess. First, we look up the vertex pointer (address ofthe vertex header) from the index which is a BTree. We cache internal BTree nodes heavily[11] and in most casesthis lookup requires one RDMA read rather than O( log ( n )) .Once the vertex pointer is found, we need two consecutiveRDMA reads to read the header and then the actual data.If the vertex is being read during a traversal, then we canbypass the index lookup and we need only two consecutivereads. Outgoing Edge List

Incoming Edge List

Source Vertex

Destination Vertex Src AddressEdge Type Data Address Data SizeDest AddressEdge Type Data Address Data SizeEdge Data HeaderEdge Data

Figure 7: Vertex, edge lists and half-edge

Unlike vertexes which can be uniquely identified by thevertex pointer, edges are not stored in a single unique FaRMobject. This is again dictated by how edges are used insideA1 queries. Given an edge e from vertex v to vertex v , if wedelete v we’d like to ensure that the edge e pointing from v does not remain dangling. To achieve this, we store the edgeas a 3 part object as shown in Figure 7. First we associatetwo edge lists with each vertex: an incoming edge list and anoutgoing edge list. The edge e appears twice: once as an entryin the outgoing edge list of v and once in the incoming edgelist for v . We call the entry that appears on the edge list a half-edge . The outgoing half-edge for v consists of the tuple ⟨ edge type, v pointer, data pointer ⟩ while the incoming half-edge for v consists of the tuple ⟨ edge type, v pointer, datapointer ⟩ . The data pointer field points to a FaRM object thatholds the data associated with the edge. In this example, ifwe delete v , then by inspecting its incoming edge list, weknow that there is an edge pointing to it from v and we cango to v and delete the entry in v as well. The edge list datastructure needs to satisfy a few constraints —first, since asingle vertex can hold millions of edges associated with it,the data structure needs to be scalable. Next, given an edgecharacterized by the source vertex, destination vertex andedge type, we should be able to lookup/insert/delete the edgequickly.To satisfy these requirements, we actually use two differ-ent implementations of the edge list. For small number ofhalf-edges, all half-edges are stored as an unordered list ina single FaRM object of variable length. As the number ofdges increase, we resize the FaRM object in a geometric pro-gression until we reach around 1000 edges. For vertexes withmore than 1000 edges, we store the edges in a global BTreewhere the key is the tuple ⟨ src vertex pointer, edge type, destvertex pointer ⟩ and the value is the edge data pointer. Aslong as the half-edges are stored in a single FaRM object, weuse locality to locate that FaRM object with the associatedvertex. Empirically we have found that for our current usecases, 99.9% of the vertexes contain fewer than 1000 edges.Given this edge layout, once a vertex is read, enumeratingits edges requires just one extra read as long as the numberof edges is small. Due to locality, this read is often simply alocal memory access.Although we take pains to allocate vertices and edge liststogether using locality, we do not attempt to enforce localitybetween different vertices. In case of immutable or slowlychanging graphs, it is possible to run offline jobs to pre-partition a graph so that vertices connected together endup close to each other. But we have avoided going downthis route since it imposes considerable burden on our cus-tomers to do the offline graph partitioning. Also as updateshappen, the original partitioning may no longer make sense.Instead we believe it is the responsibility of the databaseto simplify the application developer’s experience and pro-vide acceptable performance. We currently place verticesrandomly across the whole cluster and use locality to pushquery execution to where data resides. Looking at samplequery executions, we have found that this strategy can behighly effective (95% local reads) and we will discuss morein section 6. Recall that APIs like DeleteGraph or DeleteType are asyn-chronous. For example, calling DeleteGraph transitionsthe graph from state Active to Deleting, but the storageand resources associated with the graph is not freed syn-chronously. Instead an asynchronous workflow is kicked offwhich deletes all the resources associated with the graph andfinally frees the graph itself. Before a graph can be deleted,all types associated with the graph is deleted. For a type tobe deleted, we delete all the indices associated with the type:both primary and secondary. When the primary index isdeleted, we delete the vertices at the same time.The asynchronous workflows run within the A1 processusing what we call a

Task execution framework. Tasks areunits of work that can be scheduled to execute in future:tasks are enqueued on a global queue that is stored in FaRM.We have a pool of worker threads on every backend machinethat look for pending tasks and work on them. Since tasksare globally visible, any single task may be worked on any backend machine in the cluster. The worker threads are state-less and they save their execution state in FaRM itself. Oncea task is scheduled, it is picked up by a worker thread. If thethread can finish the task immediately, the task is completedand deleted. Alternatively, if the task is bigger, the workermay reschedule the task to run in future or spawn more tasksto parallelize the execution. This is the pattern we followin the DeleteGraph workflow: the DeleteGraph API callsimply creates a task. When this task is executed by a worker,it spawns more tasks to delete all the types in the graph andwaits for all those tasks to complete. The DeleteType tasksin turn execute for a long time since each type needs todelete all the vertices, edges and indexes associated with thetype. Using this framework, we are able to harness the entirecluster’s resources to execute long running workflows. Toensure that the workflows do not interfere with real-timeworkload, the worker threads run at a low priority.

A1 workloads are dominated by large read-only querieswhich access thousands of vertices, and small updates thatread and write a handful of vertices. When a query/updateoperation arrives at the frontend, it is by default routed toa random backend machine in the cluster. There are caseswhere more complex routing is required and we will discussit later in the section. When an update operation arrives ata backend machine, that machine becomes the transactioncoordinator for the operation. All read and writes are exe-cuted on that machine using RDMA to access remote data.Writes are made durable during transaction commit usingone-sided RDMA writes. Queries are executed a little differ-ently: the backend machine where the query arrived first, isdesignated as the coordinator for that query, which drivesthe execution of the query, but the bulk of the query execu-tion work is distributed across the cluster. We designate themachines where query execution happens at the instigationof the coordinator as workers .To understand the A1 query language and its execution,let’s take as an example, a knowledge graph of films, ac-tors and directors. If an actor appears in a film, then thefilm is connected to the actor with an outgoing edge of type film.actor . Similarly the director and the film is connectedwith an edge of type director.film . The A1 query languageknown as A1QL is similar to MQL: the Metaweb Query Lan-guage [14]. Let’s consider the two-hop query that asks forall actors that worked with Steven Spielberg. in A1QL, thequery is written as shown in Fig.8.Every A1 query is a JSON document with each level ofnested JSON struct describing a step in the traversal with thestarting point at the top level document. In this query, thetop level struct specifies the starting vertex as the vertex with "id" : "steven.spielberg","_out_edge" : { "_type" : "film.director","_vertex" : {"_out_edge" : { "_type" : "film.actor","_vertex" : {"_select" : ["*"]}}}}}

Figure 8: A1 query to retrieve all actors that haveworked with Steven Spielberg primary key steven.spielberg and we use the id field to look upthe director from the primary index. The next level specifiesthat we should traverse an outgoing edge ( _out_edge ) of type film.director to a film. The next level describes that we shouldtraverse out on an edge of type film.actor to from the filmto arrive at the actor vertex. At the last level, the select(*) clause indicates that we should return all values. Partition & ship vertices

Index LookupEnumerate Edges Evaluate Predicate Evaluate PredicateEnumerate EdgesPartition & ship verticesAggregate Replies Evaluate PredicateEnumerate Edges Evaluate PredicateAggregate Replies

Coordinator Worker Worker

Figure 9: Physical A1 query execution. The query is toretrieve all actors that have worked with Steven Spiel-berg (Fig.8.

Now let us see how the example query from Figure 8 isexecuted. The query coordinator parses the query to derivea logical plan and then generates a physical plan. A1 doesn’thave a true query optimizer: most of the queries submitted toA1 are straightforward and executed without any optimiza-tion. In A1QL the user can supply some optional optimizationhints. If they are supplied, then they are used in creating thephysical execution plan from the logical plan. Building a trueoptimizer is currently work in progress. Queries are builton top of a few basic operators like index scan, predicateevaluation against a vertex/edge data and edge enumera-tion for a given vertex. The step by step query execution isshown in Figure 9. In this execution, the coordinator startsby instantiating a transaction and choosing the transactiontimestamp as the version which will be used for all snap-shot reads. Next it does an index lookup to locate the

Steven Spielberg vertex and then from the vertex, enumerates allneighboring half-edges of type film.director .The edge enumeration gives the coordinator a list of ver-tex pointers for all the films (see Figure 7). In the next step,we need to look up all the actor edges from those film ver-texes. Since the edge list is co-located with the vertex, it ismore efficient to execute the task of edge enumeration at theactual location of the vertexes. Therefore the coordinatormaps the vertex pointers to the physical hosts which are theprimary storage hosts for the corresponding vertex. Mappingpointers to physical hosts is a local metadata operation withno remote accesses. Operators like predicate evaluation andedge enumeration are shipped to the machine hosting thevertex via RPC so that it can be evaluated without invokinga remote read. When we have multiple vertex operators tobe processed at the same machine, we batch the operators to-gether per machine to reduce the number of RPCs. Althoughquery shipping is the norm, if the number of vertexes opera-tors to be shipped are too small we avoid the RPC overheadby evaluating the operators locally using RDMA reads.Each worker receives RPCs from the coordinator and in-stantiates a new read-only transaction at the timestamp cho-sen by the query coordinator. This ensures that all the queryreads form a consistent global snapshot across the entiredistributed graph. The typical operator that executes in aworker are predicate evaluation which applies predicatesagainst vertex data and edge enumeration for the vertex.Note that both of these operations do not require any remotereads assuming locality applies. During edge enumeration,any edge predicate is also applied. Once edge enumerationis finished, the results are a set of vertex pointers for thenext hop of the traversal, i.e. a set of actor vertices. Thesevertices are shipped back to the coordinator where they areaggregated, duplicates removed and repartitioned by pointeraddress to run the next phase of the traversal. Once the wholequery completes, the results are aggregated by the coordina-tor and returned to the client. Since we keep the entire stateof the query in the memory of the coordinator, we are vulner-able to queries which require a working set bigger than thecoordinator’s available memory. Implementing disk spill forsuch a case is infeasible since our goal is to be a low-latencysystem. Currently we simply fast-fail queries whose workingset grows too large —in future we plan to dedicate regions inthe cluster for spilling intermediate query results. Fast-fail isan acceptable option since very large queries typically willnot be finished within its time budget anyway.If the final result set is too large to return in a single RPC,the coordinator does not return the full result set and insteadreturn partial results and a continuation token. Rest of theresults are cached in the coordinator and can be retrieved bythe client by supplying the continuation token in the nextrequest. The continuation token encodes the coordinatorost’s identity in it. When a request for result retrieval usinga continuation token is received by a frontend, the frontenddecodes the coordinator’s identity and forwards it to thecorrect machine so that rest of the results can be returned.The coordinator caches the results only for a limited time(typically 60 seconds) to conserve resources —the client isexpected to retrieve all the results in that time. If the cachetimes out or the coordinator crashes before all the results arereturned, the client is expected to restart the query. Sinceour typical query execution lifetime is measured in less thana second, this is not a big concern.

Although FaRM replicates data in memory 3-ways, there aresituations where data can be lost such as power loss to an en-tire datacenter or coordinated failure of 3 replicas. Thereforeany system built on FaRM needs to have a disaster recov-ery plan. A1 implements disaster recovery by replicating alldata asynchronously to a durable key-value store knownas ObjectStore which is used by Bing. We will not go intodetails of ObjectStore except to say that it supports the ab-straction of tables with each table containing a large numberof key-value pairs. Both keys and values are schematized us-ing Bond. Writes to ObjectStore made durable by replicatingevery write 3-ways into durable store.Since the replication from A1 to ObjectStore is asynchro-nous, in the event of a disaster, ObjectStore may not containall the writes committed to A1. To deal with this data loss, ourrecovery scheme supports two types of recovery: consistentrecovery and best-effort recovery . In consistent recovery werecover the database to the most up to date transactionallyconsistent snapshot that exists in ObjectStore. With best-effort recovery, we do not guarantee that the recovered stateof A1 will be transactionally consistent, but the databaseitself will be internally consistent. Let’s take an example toillustrate this. Suppose we have a single transaction in A1that adds two vertexes, A and B and an edge from A to B.We take a few example scenarios: • We succeed in replicating A and B, but the edge is isnot replicated. In that case after consistent recovery,A1 will not contain any of A or B or the edge. On theother hand, best effort recovery will recover both Aand B, but there will be no edge between them. • We succeed in replicating A and the edge, but not B.Again, consistent recovery will treat this as a partialtransaction and will ignore the edge and A. Best effortrecovery will recover A and notice that the other endof the edge B is missing and will not recover the edge.Therefore the database will be internally consistent—no dangling edges, but not transactionally consistent. Best-effort recovery therefore always recovers the databaseto a state which is at least as up to date as consistent recoveryand in almost all practical cases, to a more up to date state.For every graph, we create two tables in ObjectStore todurably store the data: the vertex table stores all the vertexesregardless of the vertex type, while the edge table stores allthe edges. When an update request arrives at A1, we applythe update to A1 and also insert a log entry for the update toa replication log transactionally. The replication log is itselfstored in FaRM with the usual 3 copy in-memory replica-tion guarantee. As soon as the update transaction commits,we attempt to replicate the update in replication log to Ob-jectStore synchronously with the customer request. If thereplication effort succeeds, then we delete the log entry andacknowledge success to the client. If the replication effortfails, we have an asynchronous replication sweeper processthat scans the replication log in FIFO order and flushes theunreplicated entries to ObjectStore and if successful, deletethe entry. We closely monitor the age of entries in the repli-cation log to make sure we do not have too many entriesin there —in ideal case, the replication log should be emptyexcept for ongoing update transaction entries. In case of adisaster, the entries in the replication log which were notreplicated to ObjectStore synchronously are the ones whichwill be permanently lost.When we replicate entries from replication log to Object-Store, we need to make sure that entries are applied in thesame order as the transaction order in A1, i.e. if we storedvalue v in vertex V and then store value v in V, then re-gardless of delays or failures in the replication pipeline, even-tually when all updates are flushed from the replication log,ObjectStore must reflect value v as the final value. This isachieved differently for consistent recovery and best-effortrecovery. Recall that in FaRM, every write transaction is as-signed a global commit timestamp which imposes a globalorder among all transactions that occurred in the system. Inbest effort recovery, every row in the vertex or edge tablehas a timestamp field which corresponds to the timestampof the FaRM transaction responsible for that update. When anew update comes in, we compare the timestamp of the ex-isting row in ObjectStore with the update’s timestamp. If theupdate’s timestamp is newer, then the update is a later trans-action and we store the update into the row. On the otherhand, if the update is older than the existing content of theObjectStore table row, then this update is a stale update andwe can discard it. For create operations, we unconditionallycreate the new row, while for delete operations we create atombstone row with the delete timestamp. The tombstone en-try is removed either when the row is recreated with a newertimestamp or by an offline garbage collection process whichremoves all tombstones older than a week. For the sake ofefficiency, we do not explicitly do a read-modify-write tomplement this protocol: ObjectStore exposes a native APIthat accepts a timestamp version and achieves this is a singleroundtrip. Note that this update process is idempotent: if areplication log entry is flushed multiple times, the outcomeis not changed.Consistent recovery works a little differently: in this casewe treat ObjectStore as a versioned datastore. Instead of juststoring ⟨ key → value ⟩ rows in the ObjectStore table, we aug-ment the key by the transaction timestamp version to get therow ⟨ (key,timestamp) → value ⟩ . Since ObjectStore supportsiterating over keys in sorted order, given a key, it is easy tofind all versions of that key or the latest version of the key.When an update comes in with a given timestamp, we alwaysinsert it into ObjectStore. For deletes, a tombstone entry isinserted. Again, this protocol is idempotent. To recover toa consistent snapshot from this durable versioned store, weneed to find a timestamp value below which all updates in A1are also reflected in ObjectStore. To do this, A1 continuallymonitors the timestamp of the oldest unreplicated entry inthe replication log ( t R ) and stores this value to ObjectStoredurably. Clearly when t R is made durable in ObjectStore, allwrites that have timestamp smaller than t R are also durablein ObjectStore. On recovery, we read the value of t R andrecover using the snapshot corresponding to this timestamp. In this section we will first look at Bing’s use of A1 and thenfocus on our experience bringing A1 into production use.A1 is designed to be a general purpose graph database andthere are multiple applications in Bing that runs on top of it.In this paper, we focus on a single use case: knowledge graph.We have already encountered a few knowledge graph serv-ing example scenarios. The knowledge graph is generatedonce a day by a large scale map-reduce job. There are real-time updates to the knowledge graph as well. The originalBing knowledge graph stack was a custom-built system withimmutable storage and regular key-value store. This prohib-ited real-time updates and could not handle more complexqueries within the latency constraint. A1 addresses both ofthese shortcomings and increases the overall flexibility ofthe system.In A1, the knowledge graph is designed with a semi-structureddata model. All entities whether they are films, actors, booksor cities are modeled as a single type of vertex named entity with all attributes stored as a key-value map. This is a choicenecessitated by the fact that the number of different typesof entities in a knowledge graph is vast (tens of thousands)and their attributes are constantly changing because we addmore and more information to entities. On the other hand,we strongly type the edges since the edge types are typicallyfixed and there is little data associated with edges. In practice, we have found that weak typing of vertices do not lead tosignificant query slowdowns while enabling more flexibledata modeling. Since A1 storage is expensive, only queryableattributes of an entity are stored in A1 while non-queryableattributes like image data are stored elsewhere.Bing receives human generated queries like “Tom Hanksand Meg Ryan movies” which are translated to A1QL queries.The translation step is non-trivial since it requires us tomap strings like “Tom Hanks”, “Thomas Hanks” or even just“Hanks” to the unique actor entity

Tom Hanks that we allknow and love. We will not go into the complexities of querycleaning and query generation here in this paper. The resultsof the queries are joined with data from other sources to ren-der the final page view. For example in this query if we returnthe vertex corresponding to the film

You’ve Got Mail , the ren-dering pipeline pulls together image data for the movie (e.g.movie poster) and generates the final page. Overall, A1 im-proves the average latency of the knowledge serving systemby 3.6X and enables significantly more complex queries.

RDMA originated in rack scale systems and is a difficult pro-tocol to work with in the large data center networks. SinceRoCEv2 doesn’t come with its own congestion control, weuse DCQCN [29] to enforce our own congestion control andfairness. We handle a lot of the protocol level instability bydefensive programming around communication problems.FaRM is able to recover very quickly from any host/networklevel failures which ensure that users do not notice network-ing hiccups. In addition to RDMA Read and Write, we alsomake heavy use of RDMA unreliable datagrams (UD) forclock synchronization and leases. In general we have beenable to achieve latencies less than 10 microseconds within asingle rack and less that 20 microseconds across racks withoversubscribed network links.

In building A1, we enhanced FARM from the version de-scribed in [11, 12] (denoted FARMv1) with several featuresinto what we will call FARMv2.The isolation guarantee provided by FaRMv1 transactionsis serializability, but combining serializability with optimisticconcurrency control can lead to certain well known problems.For example, consider two transactions: T reading a linkedlist consisting of two items A → B , and T deleting B fromthe list concurrently. Suppose the execution interleaving ofthe transactions is the following:(1) T reads A and gets the pointer to B .(2) T deletes B and commits.3) T dereferences the pointer to B which is now pointingto invalid memory. The application reads the invalidcontent of B and panics.Since executions of T and T are serializable, T will abortonce it attempts to commit, but even before that, the appli-cation will conclude erroneously at the last step above thatthe data it has read is corrupt. The solution to this problemis known as the opacity [16] property which guarantees thateven transactions that will eventually abort (e.g. T ) are se-rializable with respect to committed transactions (e.g. T )and hence will no longer cause application inconsistenciesat runtime.Optimistic concurrency control can often lead to highabort rates for large transactions. A1 is an OLTP systemwhich combines small update transactions (touching a hand-ful vertices/edges at most) with much larger read-only querieswhich can read many thousands of vertexes in a single query.Since optimistic concurrency control does not acquire readlocks, the large queries are susceptible to conflict with up-dates and hence abort frequently.FaRMv2 solves both of these problems by introducing aglobal clock which provides read and write timestamps for alltransactions. These timestamps provide a global serializationorder for all transactions. In addition FaRMv2 implementsMVCC (multi-version concurrency control) which ensuresthat read-only transactions can run conflict-free with updatetransactions. For details on the implementation, we refer thereader to FaRMv2 paper[12]. The fact that all transactionscan be ordered globally using their write timestamp is alsoused in our disaster recovery solution which we discussedin section 4. Data stored in FaRM can be made durable[12] using SSDfor storage and using non-volatile RAM (NVRAM) for trans-action log durability. But since A1 runs on commodity ma-chines with no NVRAM, the durability problem needs to besolved differently. There are two different durability prob-lems that we address. First, if we lose power to the entiredata center, clearly all data in memory in A1 will be lost. Weconsider this as a disaster scenario and implement disasterrecovery, which was described in section 4.A software outage in 3 machines across 3 failure domainscan occur during deployment or due to a bug. In the casethat these 3 machines hosts the 3 replicas of a single region,a total loss of that region will occur. This implies losing partsof the graph or index, and should be considered catastrophic.We protect against this by implementing a feature knownas fast restart . In FaRM, the memory where the regions areallocated do not belong to the FaRM process itself: insteadwe use a kernel driver known as PyCo, which grabs large contiguous physical memory segments at boot time. Whenthe FaRM process starts, it maps the memory segments fromthe driver to its own address space and allocates regionsthere. Therefore, if the FaRM process crashes unexpectedly,or restarts, the region data is still available in the driver’saddress space and the restarted process can grab them again.Note that fast restart doesn’t protect against the machinecrash or power cycle because in that case the machine willreboot and the state held in the driver’s memory will belost. In FaRMv1, only the data regions were stored in PyComemory. As part of fast restart, we moved all data needed tocorrectly recover after a process crash to PyCo memory —this includes region allocation metadata and transaction logs.Recall that the configuration manager (CM) is responsible fordetermining which regions are hosted in which machines. Incase of any machine failure, if the CM detects that all replicasfor any region has been lost, it pauses the whole system andall transactions are halted. In case of accidental A1/FaRMprocess crash, our deployment infrastructure automaticallyrestarts the process. Therefore the CM waits to see if thefailed process or processes will come back to life and if theycome back it initiates recovery of that region’s data includingall blocked transactions. Overall, fast restart has cut downthe downtime for A1 cluster by an order of magnitude.

To evaluate the performance of A1 experimentally, we use agraph consisting of 3.7 billion vertices and 6.2 billion edges,which is generated from the film and entertainment knowl-edge base containing 22.9 billion RDF triples with 3.7 billionentities. Graph vertices represent entities and have severalattributes, and edges do not have data attributes. On aver-age every vertex had a payload of 220 bytes. Although theaverage vertex degree is small, the skew in vertex degreedistribution is very large and some vertices have degreeslarger than ten million.We use a cluster of 245 machines and measure end-to-endresponse time from a client in the same datacenter as theA1 cluster. Every machine has two Intel E5-2673 v3 2.4 GHzprocessors, 128GB RAM and Mellanox Connect-X Pro NICwith 40Gbps bandwidth. A1 uses 80GB of the available RAMfor storage. The total storage space available in the machinesis 245*80GB/3 = 6.5TB –the factor of 3 is for 3x replication.Our data occupies 3.2TB of the total available space. The ma-chines are distributed across 15 racks and four T1 switchesconnect the racks. ToR (Top of the Rack) switches providesfull bisection bandwidth between machines in a single rack,while T1 switches use oversubscribed links between racks.Therefore, most of the cross-machine traffic uses oversub-scribed links. Vertices are distributed at random across theachines, and therefore 99.6% (=244/245) of a vertex neigh-bors are on a remote machine. We report the average andP99 (the 99th percentile) latency for a few multi-hop querieswhich represent various types of graph queries.Id A1QLQ1 { "id" : "steven.spielberg","_out_edge" : { "_type" : "director.film","_vertex" : {"_out_edge" : { "_type" : "film.actor","_vertex" : {"_select" : ["_count(*)"] }}}}} Q2 { "id" : "character.batman","_out_edge" : { "_type" : "character.film","_vertex" : {"_out_edge" : { "_type" : "film.performance","_vertex" : {"str_str_map[character]" : "Batman","_out_edge" : { "_type" : "performance.actor","_vertex" : {"_select" : ["_count(*)"] }}}}}}} Q3 { "id" : "steven.spielberg","_out_edge" : { "_type" : "director.film","_vertex" : { "_type" : "entity","_select" : ["name[0]"],"_match" : [{"_out_edge" : { "_type" : "film.actor","_vertex" : {"id" : "tom.hanks"}}},{ "_out_edge" : { "_type" : "film.genre","_vertex" : {"id" : "action"}}}] }}}} Q4 { "id" : "tom.hanks","_out_edge" : { "_type" : "actor.film","_vertex" : {"_out_edge" : { "_type" : "film.actor","_vertex" : {"_out_edge" : { "_type" : "actor.film","_vertex" : {"_select" : ["_count(*)"] }}}}}}} Table 2: Queries used to evaluate A1 performance

We focus on the following set of specific queries and seehow the system performs. The actual representation of thequeries in A1QL is in Table2. • Q1: Count actors who have worked with Steven Spiel-berg. • Q2: Count actors who have played Batman. • Q3: Action movies with Steven Spielberg and TomHanks. • Q4: Count number of films by actors who have workedwith Tom Hanks. L a t e n c y ( m s ) Queries/second

Q1: Actors who worked with spielberg

Average P99

Figure 10: Average and P99 latency for Q1.

The first query, Q1, asks for all actors that have workedwith the director Steven Spielberg. This translates to a simple2-hop query where we look up all films for Spielberg and thenfor those films find all actors that have acted in them. Figure10 depicts the average and P99 latency in for Q1, which readsa total of 49 vertices in the first hop (films by Spielberg) and1639 vertices in the second hop (actors in those films). Thetotal number of edges visited were 1785: this number is largerthan the number of vertices since multiple edges could pointto the same end vertex. By parallelizing all these reads acrossthe cluster, we were able to complete this query in less than8ms on average and 14ms at p99 at 20000 queries/second.Note the tight spread between the average and P99 latencieswhich is a consequence of the focus on latency for A1.The total number of raw FARM objects read during thequery is 3443 out of which only 163 are remote. In otherwords, we achieve more than 95% local reads through queryshipping to workers. Figure 11 shows the distribution of T o t a l T i m e ( u s ) Number of ReadsRDMA Read Latencies

Figure 11: Total RDMA read latency (microseconds)for different number of read operations.

DMA latencies as a function of number of reads done. Re-call that we ship the vertex predicate evaluation and edgeenumeration to workers. So if a worker lands with a bunchof vertices which are remote than it has to do one or moreRDMA reads to get all the data. Figure 11 shows the totaltime in microseconds doing RDMA reads vs the total numberof reads done and the trend is roughly linear. Average readtimes for RDMA was 17us.Q2, is deceptively simple, but more complex in its imple-mentation: find all actors who have played Batman. Thismaps to the following traversal where we first look up theentity Batman and then all movies in which this entity ap-pears. For each of the movies, we look up the performancesof all actors and then filter those performances by name ofthe character (Batman) and then the actor for that perfor-mance. This translates to a three-hop query from characterto film to performance to actor (Figure 12. L a t e n c y ( m s ) Queries/second

Q2: Actors who have played Batman

Average P99

Figure 12: Average and P99 latency for Q2.

Query Q3 (Figure 13) represents a more complex pattern ofgraph exploration. The query is to find all Spielberg movieswhich belong to the War movie genre and stars Tom Hanks.Here the graph pattern we are interested in is a star patternwhere the center is the movie and the movie is connected tothree entities: Spielberg as director, War as genre and TomHanks as an actor. A similar query is to find all comediesstarring both Ben Stiller and Owen Wilson.To evaluate maximum throughput of the system, we car-ried out a test using Q4. For a given actor, Q4 finds all ac-tors he/she has worked with and finds films starring them.This maps to a three-hop traversal query from actor to filmsto actors (co-stars) to their films. The goal of Q4 was tostress the system by exploring a large number of vertexesrather than being a realistic user query. On average, Q4 ac-cesses 24,312 vertices with 33ms latency for throughput (1000queries/second). We pushed the cluster to 15,000 queries/second L a t e n c y ( m s ) Queries/Second

Q3: War movies with Tom Hanks and Steven Spielberg

Average P99

Figure 13: Average and P99 latency for Q3. and at this throughput this query executes 365MM vertexreads/second across the cluster, i.e. 1.49MM vertex reads persecond for every machine in the cluster. L a t e n c y ( m s ) Queries/second

Latency vs Throughput

10 15 35 55

Figure 14: Latency vs throughput for different clustersizes (10, 15, 35 and 55).

Finally, to understand the scalability characteristics of A1,we created clusters of 10, 15, 35, and 55 machines in the samenetwork configuration as used for the larger cluster. We useda smaller dataset of 23 million vertices and 63 million edges.This dataset was distributed uniformly across the machines,and then ran a set of 2-hop queries. For each cluster, wemeasured the latency at different query loads, as shown inFigure 14. As expected, the usable throughput (below a givenlatency) correlates to the cluster size; Latency of queriesbelow the capacity threshold is mostly flat as the clustergrows. The clusters have the same network topology, so thisis as expected. For larger clusters, the expected benefit is notjust scalability of throughput, but also capacity for biggerdatasets.

RELATED WORK

Graph databases are not new in the database world and inrecent years, they have been experiencing great interest in in-dustry: some notable efforts in the open source space includeNeo4J[20], Apache Tinkerpop[13] and DataStax EnterpriseGraph[17], while AWS Neptune[4] and CosmosDb[19] areprominent cloud based offerings. All of these systems are diskbased and apart from DataStax Enterprise Graph and Cos-mosDb, none of them are distributed. Traditional commercialdatabases like Oracle and SQLServer also now support thegraph data model and associated query capabilities.Graph data has been represented in various ways usingRDF triples as well as property graph like model. RDF tripleshave been stored directly in relational stores [8] or storedin more efficient columnar formats [1]. Storing RDF data inrelational stores allows one to take advantage of the existingdepth of the relational technology. Since A1 was built groundup as a new system and the FaRM data model was highlyconducive to building linked data structure, we opted to gowith a property graph model rather than RDF or relational.Moreover, we have found that most of our customers preferthe property graph model in modeling their data.Trinity[25, 26] from Microsoft Research is the system clos-est to A1 in terms of its use of in-memory storage and hori-zontal scalability. But compared to A1, Trinity lacks trans-actions and not comparable in terms of performance. Face-book’s TAO[5] and Unicorn[10] are two horizontally scal-able systems which are deployed at large scale in production.TAO’s query model is much more restricted than A1 in thatit’s not meant for large multi-hop queries and it doesn’t offerany consistency or atomicity guarantees. Unicorn is builtmore as a search engine with very limited OLTP capability,but highly efficient exploration queries like A1. Since TAOand Unicorn are disk based, their storage capacity is muchlarger than A1’s.LinkedIn’s economic graph database [7] is a very highperformance graph query system designed for low latencyqueries similar to A1. It scales up vertically and can answerlookup queries in nanoseconds while A1 operates in mi-croseconds. Overall, A1 has taken the approach of usingcheap commodity hardware to scale out while taking advan-tage of RDMA to keep query latency low, while the LinkedIndatabase relies upon fixed sharding and specialized hardwareto achieve its performance.As the price of RAM has fallen, building distributed in-memory storage systems[22] for low-latency applicationshas become very attractive. The combination of RAM stor-age and RDMA networks is a newer development and re-search systems like FaRM[11, 12] and NAMDb[28, 30] haveshown the advantages of using RDMA to build scale-outtransactional databases. NAMDb has studied in detail the performance benefits of building remote pointer based datastructures like B+ trees over RDMA. The challenges of de-signing data structures optimized for remote memory hasbeen considered by Aguiler et.al. [2, 3]. RDMA has been usedto build high performance RDF engines[27] and file systemsas well[6]. But the adoption of RDMA in industry has beenlimited by the fact that it is hard to ensure proper fairnessand congestion control in large data center deploymentswith commodity hardware[29]. We believe the example ofA1 will augment the case for wide-spread adoption of RDMAin cloud data centers.

Building a generic database is a complex problem. A1 wasdesigned to work in a space with huge data volume, widevariety of data sources and update frequencies and strictrequirements to perform queries with very low latency.Distributed systems are complex to program and operate,and we chose to implement transaction support to hide thecomplexities of availability, replication and durability in theface of machine failure. The connected nature of graph datamade it even more important to ensure correctness at anytime. In our experience, the developer productivity was highdue to the support of transactions. Furthermore, the naturalproperty graph model was intuitive and powerful to use forbuilding search-oriented applications in Bing.FaRM and A1 utilizes the benefits of RDMA to a greatextent, and the performance achieved makes more complexquestion answering possible at scale, and within latenciesacceptable for interactive searching. FaRM was originallydesigned to support relational systems, but our work alsoshows that it is general enough to be considered a very effi-cient programming model for low-latency systems at scale.

A1 is built on top of the work by the FaRM team: in partic-ular we’d like to thank Aleksandar Dragojevic, DushyanthNarayanan, Ed Nightingale and our product manager DanaCozmei. Getting A1 into production would not have beenpossible without the help and support of the ObjectStoreteam: Sam Bayless, Jason Li, Maya Mosyak, Vikas Sabharwal,Junhua Wang and Bill Xu. The feedback we received from ourcustomers in Bing was invaluable in guiding our roadmapand features. We would also like to thank our anonymousreviewers whose feedback improved the paper in multipleways.

REFERENCES [1] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate Hollen-bach. 2007. Scalable Semantic Web Data Management Using Verti-cal Partitioning. In

Proceedings of the 33rd International Conferencen Very Large Data Bases (VLDB ’07) . VLDB Endowment, 411–422.http://dl.acm.org/citation.cfm?id=1325851.1325900[2] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard,Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Ra-jesh Venkatasubramanian, and Michael Wei. 2017. Remote Memoryin the Age of Fast Networks. In

Proceedings of the 2017 Symposiumon Cloud Computing (SoCC ’17) . ACM, New York, NY, USA, 121–127.https://doi.org/10.1145/3127479.3131612[3] Marcos K. Aguilera, Kimberly Keeton, Stanko Novakovic, and SharadSinghal. 2019. Designing Far Memory Data Structures: Think Outsidethe Box. In

Proceedings of the Workshop on Hot Topics in OperatingSystems (HotOS ’19) . ACM, New York, NY, USA, 120–126. https://doi.org/10.1145/3317550.3321433[4] Amazon.com. [n. d.]. AWS Neptune. https://aws.amazon.com/neptune/.[5] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, PeterDimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni,Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song,and Venkat Venkataramani. 2013. TAO: Facebook’s Distributed DataStore for the Social Graph. In

Proceedings of the 2013 USENIX Conferenceon Annual Technical Conference (USENIX ATC’13) . USENIX Association,Berkeley, CA, USA, 49–60. http://dl.acm.org/citation.cfm?id=2535461.2535468[6] Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, SongZheng, Yuhui Wang, and Guoqing Ma. 2018. PolarFS: An Ultra-lowLatency and Failure Resilient Distributed File System for Shared Stor-age Cloud Database.

Proc. VLDB Endow.

11, 12 (Aug. 2018), 1849–1862.https://doi.org/10.14778/3229863.3229872[7] Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer.2019. Nanosecond Indexing of Graph Data With Hash Maps and VLists.In

Proceedings of the 2019 International Conference on Managementof Data (SIGMOD ’19) . ACM, New York, NY, USA, 623–635. https://doi.org/10.1145/3299869.3314044[8] Eugene Inseok Chong, Souripriya Das, George Eadon, and JagannathanSrinivasan. 2005. An Efficient SQL-based RDF Querying Scheme. In

Proceedings of the 31st International Conference on Very Large DataBases (VLDB ’05) . VLDB Endowment, 1216–1227. http://dl.acm.org/citation.cfm?id=1083592.1083734[9] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christo-pher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, ChristopherHeiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Ko-gan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, DavidNagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, MichalSzymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012.Spanner: Google’s Globally-Distributed Database. In

Proc. VLDB Endow.

6, 11 (Aug. 2013), 1150–1161. https://doi.org/10.14778/2536222.2536239[11] Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, andMiguel Castro. 2014. FaRM: Fast Remote Memory. In

Proceedings of the11th USENIX Conference on Networked Systems Design and Implemen-tation (NSDI’14) . USENIX Association, Berkeley, CA, USA, 401–414.http://dl.acm.org/citation.cfm?id=2616448.2616486[12] Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightin-gale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consis-tency, Availability, and Performance. In

Proceedings of the 25th Sympo-sium on Operating Systems Principles (SOSP ’15) . ACM, New York, NY,USA, 54–70. https://doi.org/10.1145/2815400.2815425[13] Apache Software Foundation. [n. d.]. Apache Tinkerpop. http://tinkerpop.apache.org/.[14] Freebase. [n. d.]. Metaweb Query Language. https://github.com/nchah/freebase-mql.[15] Google. [n. d.]. Google Knowledge Graph. https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html.[16] Rachid Guerraoui and Michal Kapalka. 2008. On the Correctnessof Transactional Memory. In

Proceedings of the 13th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming (PPoPP’08)

ACM Trans. Comput. Syst.

33, 3, Article 7 (Aug. 2015), 55 pages. https://doi.org/10.1145/2806887[23] Ian Robinson, Jim Webber, and Emil Eifrem. 2013.

Graph databases . "O’Reilly Media, Inc.".[24] Alex Shamis, Matthew Renzelmann, Stanko Novakovic, Geor-gios Chatzopoulos, Aleksandar Dragojevic, Dushyanth Narayanan,and Miguel Castro. 2019. Fast General Distributed Transac-tions with Opacity. In

International Conference on Management ofData (SIGMOD ’19)

PVLDB

Proceedingsof SIGMOD 2013

Proceedings of the 12th USENIX Conference onOperating Systems Design and Implementation (OSDI’16) . USENIX Asso-ciation, Berkeley, CA, USA, 317–332. http://dl.acm.org/citation.cfm?id=3026877.3026902[28] Erfan Zamanian, Carsten Binnig, Tim Kraska, and Tim Harris. 2017.The End of a Myth: Distributed Transaction Can Scale.

PVLDB

10, 6(2017), 685–696. https://doi.org/10.14778/3055330.3055335[29] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Ma-rina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Rain-del, Mohamad Haj Yahia, and Ming Zhang. 2015. Con-gestion Control for Large-Scale RDMA Deployments. In

SIG-COMM