Fragmented Objects: Boosting Concurrency of Shared Large Objects
Antonio Fernandez Anta, Chryssis Georgiou, Theophanis Hadjistasi, Nicolas Nicolaou, Efstathios Stavrakis, Andria Trigeorgi
FFragmented Objects: Boosting Concurrency of SharedLarge Objects (cid:63)
Antonio Fern´andez Anta , Chryssis Georgiou , Theophanis Hadjistasi , NicolasNicolaou , Efstathios Stavrakis , and Andria Trigeorgi IMDEA Networks Institute, Madrid, Spain, [email protected] University of Cyprus, Nicosia Cyprus { chryssis, atrige01 } @cs.ucy.ac.cy Algolysis Ltd, Limassol, Cyprus { theo, nicolas, stathis } @algolysis.com Abstract.
This work examines strategies to handle large shared data objects indistributed storage systems (DSS), while boosting the number of concurrent ac-cesses, maintaining strong consistency guarantees, and ensuring good operationperformance. To this respect, we define the notion of fragmented objects: con-current objects composed of a list of fragments (or blocks ) that allow operationsto manipulate each of their fragments individually. As the fragments belong tothe same object, it is not enough that each fragment is linearizable to have usefulconsistency guarantees in the composed object. Hence, we capture the consis-tency semantic of the whole object with the notion of fragmented linearizability .Then, considering that a variance of linearizability, coverability , is more suitedfor versioned objects like files, we provide an implementation of a distributedfile system, called C O BFS, that utilizes coverable fragmented objects (i.e., files).In C O BFS, each file is a linked-list of coverable block objects. Preliminary em-ulation of C O BFS demonstrates the potential of our approach in boosting theconcurrency of strongly consistent large objects.
Keywords:
Distributed storage · Large objects · Linearizability · Coverability.
In this paper we deal with the storage and use of shared readable and writable data inunreliable distributed systems. Distributed systems are subject to perturbations, whichmay include failures (e.g., crashes) of individual computers, or delays in processing orcommunication. In such settings, large (in size) objects are difficult to handle. Evenmore challenging is to provide linearizable consistency guarantees to such objects.Researchers usually break large objects into smaller linearizable building blocks,with their composition yielding the complete consistent large object. For example, alinearizable shared R/W memory is composed of a set of linearizable shared R/W ob-jects [2]. By design, those building blocks are usually independent, in the sense that (cid:63)
Supported by the Cyprus Research and Innovation Foundation under the grant agreementPOST-DOC/0916/0090. a r X i v : . [ c s . D C ] F e b A. F. Anta et al. changing the value of one does not affect the operations performed on the others, andthat operations on the composed objects are defined in terms of operations invoked onthe (smallest possible) building blocks. Operations on individual linearizable registersdo not violate the consistency of the larger composed linearizable memory space.Some large objects, however, cannot be decomposed into independent buildingblocks. For example, a file object can be divided into fragments or blocks , so that writeoperations (which are still issued on the whole file) modify individual fragments. How-ever, the composition of these fragments does not yield a linearizable file object: it isunclear how to order writes on the file when those are applied on different blocks con-currently. At the same time, it is practically inefficient to handle large objects as singleobjects and use traditional algorithms (like the one in [2]) to distribute it consistently. Related work:
Attiya, Bar-Noy and Dolev [2], proposed an algorithm, colloquially re-ferred to as ABD, that emulates a distributed shared R/W register in message-passing,crash-prone, asynchronous environments. To ensure availability, the object is replicatedamong a set of servers and to provide operation ordering, a logical timestamp is associ-ated with each written value. ABD tolerates replica server crashes, provided a majorityof servers do not fail. Write operations involve a single communication round-trip. Thewriter broadcasts its request to all servers and it terminates once it collects acknowledg-ments from some majority of servers. A read involves two round-trips. In the first, thereader broadcasts a request to all servers, collects acknowledgments from some major-ity of servers, and it discovers the maximum timestamp. To ensure that any subsequentread will return a value associated with a timestamp at least as high as the discoveredmaximum, the reader propagates the value associated with the maximum timestamp toat least a majority of servers before completion, forming the second round-trip. ABDwas later extended for the multi-writer/multi-reader model in [21], and its performancewas later improved by several works, including [11,16,17,13,15]. Those solutions con-sidered small objects, and relied on the dissemination of the object values in each oper-ation, imposing a performance overhead when dealing with large objects.Fan and Lynch [12] attempted to reduce performance overheads by separating themetadata of large objects from their value. In this way, communication-demanding op-erations were performed on the metadata, and large objects were transmitted to a limitednumber of hosts, and only when it was “safe” to do so. Although this work improved thelatency of operations, compared to traditional approaches like [2,21], it still required totransmit the entire large object over the network per read and write operation. Moreover,if two concurrent write operations affected different “parts” of the object, only one ofthem would prevail, despite updates not being directly “conflicting.”Recently, Erasure-Coded (EC) approaches have gained momentum and have provedbeing extremely effective in saving storage and communication costs, while maintainingstrong consistency and fault-tolerance [6,7,10,19,20,8,28,23]. EC approaches rely onthe division of a shared object into coded blocks and deliver a single block to eachdata server. While very appealing for handling large objects, they face the challengeof efficiently encoding/decoding data. Despite being subdivided into several fragments,reads and writes are still applied on the entire object value. Therefore, multiple writerscannot work simultaneously on different parts of an object. ragmented Objects: Boosting Concurrency of Shared Large Objects 3
Value continuity is important when considering large objects, oftentimes overseenby distributed shared object implementations. In files, for example, a write operationshould extend the latest written version of the object, and not overwrite any new value.
Coverability was introduced in [24] as a consistency guarantee that extends lineariz-ability and concerns versioned objects. An implementation of a coverable (versioned)object was presented, where ABD-like reads return both the version and the value ofthe object. Writes, on the other hand, attempt to write a “versioned” value on the object.If the reported version is older than the latest, then the write does not take effect and itis converted into a read operation, preventing overwriting a newer version of the object.
Contributions:
In this work we set the goal to study and formally define the consistencyguarantees we can provide when fragmenting a large R/W object into smaller objects(blocks), so that operations are still issued on the former but are applied on the latter. Inparticular, the contributions of this paper are as follows: – We define two types of concurrent objects: (i) the block object, and (ii) the frag-mented object. Blocks are treated as R/W objects, while fragmented objects aredefined as lists of block objects (Section 3). – We examine the consistency properties when allowing R/W operations on individ-ual blocks of the fragmented object, in order to enable concurrent modifications.Assuming that each block is linearizable, we define the precise consistency that thefragmented object provides, termed
Fragmented Linearizability (Section 4). – We provide an algorithm that implements coverable fragmented objects. Then, weuse it to build a prototype implementation of a distributed file system, called C O BFS,by representing each file as a linked-list of coverable block objects. C O BFS adoptsa modular architecture, separating the object fragmentation process from the sharedmemory service, which allows to follow different fragmentation strategies and sharedmemory implementations. We show that C O BFS preserves the validity of the frag-mented object and satisfies fragmented coverability (Section 5). – We describe an experimental development and deployment of C O BFS on the Emu-lab testbed [1]. Preliminary results are presented, comparing our proposed algorithmto its non-fragmented counterpart. Results suggest that a fragmented object imple-mentation boosts concurrency while reducing the latency of operations (Section 6).
We are concerned with the implementations of highly-available replicated concurrentobjects that support a set of operations. The system is a collection of crash-prone, asyn-chronous processors with unique identifiers (ids) from a totally-ordered set I , composedof two main disjoint sets of processes: (a) a set C of client processes ids that may per-form operations on a replicated object, and (b) a set S of server processes ids that eachholds a replica of the object. Let I = C ∪ S .Processors communicate by exchanging messages via asynchronous point-to-point reliable channels; messages may be reordered. Any subset of client processes and upto a minority of servers (less than |S| / ), may crash at any time in an execution. Reliability is not necessary for the correctness of the algorithms we present. It is just used forsimplicity of presentation. A. F. Anta et al.
Executions, histories and operations: An execution ξ of a distributed algorithm A is an alternating sequence of states and actions of A reflecting the evolution in realtime of the execution. A history H ξ is the subsequence of the actions in ξ . We saythat an operation π is invoked (starts) in an execution ξ when the invocation action of π appears in H ξ , and π responds to the environment (ends or completes) when the response action appears in H ξ . An operation is complete in ξ when both its invocationand matching response actions appear in H ξ in that order. A history H ξ is sequential if it starts with an invocation action and each invocation is immediately followed byits matching response; otherwise, H ξ is concurrent . Finally, H ξ is complete if everyinvocation in H ξ has a matching response in H ξ (i.e., each operation in ξ is complete).We say that an operation π precedes in real time an operation π (cid:48) (or π (cid:48) succeeds in realtime π ) in an execution ξ , denoted by π → π (cid:48) , if the response of π appears before theinvocation of π (cid:48) in H ξ . Two operations are concurrent if neither precedes the other. Consistency : We consider linearizable [18] R/W objects. A complete history H ξ islinearizable if there exists some total order on the operations in H ξ s.t. it respects thereal-time order → of operations, and is consistent with the semantics of operations.Note that we use read and write in an abstract way: ( i ) write represents any operationthat changes the state of the object, and ( ii ) read any operation that returns that state. A fragmented object is a concurrent object (e.g., can be accessed concurrently by multi-ple processes) that is composed of a finite list of blocks . Section 3.1 formally defines thenotion of a block , and Section 3.2 gives the formal definition of a fragmented object . A block b is a concurrent R/W object with a unique identifier from a set B . A blockhas a value val ( b ) ∈ Σ ∗ , extracted from an alphabet Σ . For performance reasons itis convenient to bound the block length. Hence, we denote by B (cid:96) ⊂ B , the set thatcontains bounded length blocks, s.t. ∀ b ∈ B (cid:96) the length of | val ( b ) | ≤ (cid:96) . We use | b | todenote the length of the value of b when convenient. An empty block is a block b whosevalue is the empty string ε , i.e., | b | = 0 . Operation create ( b, D ) is used to introduce anew block b ∈ B (cid:96) , initialized with value D , such that | D | ≤ (cid:96) . Once created, block b supports the following two operations: (i) read () b that returns the value of the object b ,and (ii) write ( D ) b that sets the value of the object b to D , where | D | ≤ (cid:96) .A block object is linearizable if is satisfies the linearizability properties [22,18] withrespect to its create (which acts as a write ), read , and write operations. Once created, ablock object is an atomic register [22] whose value cannot exceed a predefined length (cid:96) . A fragmented object f is a concurrent R/W object with a unique identifier from a set F .Essentially, a fragmented object is a sequence of blocks from B , with a value val ( f ) = (cid:104) b , b , . . . , b n (cid:105) , where b i ∈ B , for i ∈ [0 , n ] . Initially, each fragmented object contains ragmented Objects: Boosting Concurrency of Shared Large Objects 5 an empty block, i.e., val ( f ) = (cid:104) b (cid:105) with val ( b ) = ε . We say that f is valid and f ∈ F (cid:96) if ∀ b i ∈ val ( f ) , b i ∈ B (cid:96) . Otherwise, f is invalid . Being a R/W object, one would expectthat a fragmented object f ∈ F (cid:96) , for any (cid:96) , supports the following operations: – read () f returns the list (cid:104) val ( b ) , . . . , val ( b n ) (cid:105) , where val ( f ) = (cid:104) b , b , . . . , b n (cid:105) – write ( (cid:104) D , . . . , D n (cid:105) ) f , | D i | ≤ (cid:96), ∀ i ∈ [0 , n ] , sets the value of f to (cid:104) b , . . . , b n (cid:105) s.t. val ( b i ) = D i , ∀ i ∈ [0 , n ] .Having the write operation to modify the values of all blocks in the list may hin-der in many cases the concurrency of the object. For instance, consider the followingexecution ξ . Let val ( f ) = (cid:104) b , b (cid:105) , val ( b ) = D , val ( b ) = D , and assume that ξ contains two concurrent writes by two different clients, one attempting to modifyblock b , and the other attempting to modify block b : π = write ( (cid:104) D (cid:48) , D (cid:105) ) f and π = write ( (cid:104) D , D (cid:48) (cid:105) ) f , followed by a read () f . By linearizability, the read will returneither the list written in π or in π on f (depending on how the operations are orderedby the linearizability property). However, as blocks are independent objects, it wouldbe expected that both writes could take effect, with π updating the value of b and π updating the value of b . To this respect, we redefine the write to only update one of theblocks of a fragmented object. Since the update does not manipulate the value of thewhole object, which would include also new blocks to be written, it should allow theupdate of a block b with a value | D | > (cid:96) . This essentially leads to the generation of newblocks in the sequence. More formally, the update operation is defined as follows: – update ( b i , D ) f updates the value of block b i ∈ f such that: • if | D | ≤ (cid:96) : sets val ( b i ) = D ; • if | D | > (cid:96) : partition D = { D , . . . , D k } such that | D j | ≤ (cid:96), ∀ j ∈ [0 , k ] , set val ( b i ) = D and create blocks b ji , for j ∈ [1 , k ] with val ( b ji ) = D j , so that f remains valid.With the update operation in place, fragmented objects resemble store-collect ob-jects presented in [3]. However, fragmented objects aim to minimize the communicationoverhead by exchanging individual blocks (in a consistent manner) instead of exchang-ing the list (view) of block values in each operation. Since the update operation onlyaffects a block in the list of blocks of a fragmented object, it potentially allows for ahigher degree of concurrency. It is still unclear what are the consistency guarantees wecan provide when allowing concurrent updates on different blocks to take effect. Thus,we will consider that only operations read and update are issued in fragmented objects.Note that the list of blocks of a fragmented object cannot be reduced. The contents of ablock can be deleted by invoking an update with an empty value.Observe that as a fragmented object is composed of block objects, its operations areimplemented by using read , write , and create block operations. The read () f performsa sequence of read block operations (starting from block b and traversing the list ofblocks) to obtain and return the value of the fragmented object. Regarding update op-erations, if | D | ≤ (cid:96) , then the update ( b i , D ) f operation performs a write operation onthe block b i as write ( D ) b i . However, if | D | > (cid:96) , then D is partitioned into substrings D , . . . , D k each of length at most (cid:96) . The update operation modifies the value of b i as write ( D ) b i . Then, k new blocks b i , . . . , b ki are created as create ( b ji , D j ) , ∀ j ∈ [1 , k ] ,and are inserted in f between b i and b i +1 (or appended at the end if i = | f | ). Thesequential specification of a fragmented object is defined as follows: A. F. Anta et al.
Definition 1 (Sequential Specification).
The sequential specification of a fragmentedobject f ∈ F (cid:96) over the complete sequential history H is defined as follows. Initially val ( f ) = (cid:104) b (cid:105) with val ( b ) = ε . If at the invocation action of an operation π in H has val ( f ) = (cid:104) b , . . . , b n (cid:105) and ∀ b i ∈ f, val ( b i ) = D i , and | D i | ≤ (cid:96) . Then: – if π is a read () f , then π returns (cid:104) val ( b ) , . . . , val ( b n ) (cid:105) . At the response action of π , it still holds that val ( f ) = (cid:104) b , . . . , b n (cid:105) and ∀ b i ∈ f, val ( b i ) = D i . – if π is an update ( b i , D ) f operation, b i ∈ f , then at the response action of π , ∀ j (cid:54) = i, val ( b j ) = D j , and • if | D | ≤ (cid:96) : val ( f ) = (cid:104) b , . . . , b n (cid:105) , val ( b i ) = D ; • if | D | > (cid:96) : val ( f ) = (cid:104) b , . . . , b i , b i , . . . , b ki , b i +1 , . . . , b n (cid:105) , such that val ( b i ) = D and val ( b ji ) = D j , ∀ j ∈ [1 , k ] , where D = D | D | · · · | D k and | D j | ≤ (cid:96), ∀ j ∈ [0 , k ] . A fragmented object is linearizable if it satisfies both the
Liveness (termination) and
Linearizability (atomicity) properties [22,18]. A fragmented object implemented by asingle linearizable block is trivially linearizable as well. Here, we focus on fragmentedobjects that may contain a list of multiple linearizable blocks, and consider only read and update operations. As defined, update operations are applied on single blocks,which allows multiple update operations to modify different blocks of the fragmentedobject concurrently. Termination holds since read and update operations on the frag-mented object always complete. It remains to examine the consistency properties.
Linearizability:
Let H ξ be a sequential history of update and read invocations andresponses on a fragmented object f . Linearizability [22,18] provides the illusion thatthe fragmented object is accessed sequentially respecting the real-time order, even whenoperations are invoked concurrently : Definition 2 (Linearizability).
A fragmented object f is linearizable if, given any com-plete history H , there exists a permutation σ of all actions in H such that: – σ is a sequential history and follows the sequential specification of f , and – for operations π , π , if π → π in H , then π appears before π in σ . Observe, that in order to satisfy Definition 2, the operations must be totally ordered.Let us consider again the sample execution ξ from Section 3. Since we decided notto use write operations, the execution changes as follows. Initially, val ( f ) = (cid:104) b , b (cid:105) , val ( b ) = D , val ( b ) = D , and then ξ contains two concurrent update operationsby two different clients, one attempting to modify the first block, and the other attempt-ing to modify the second block: π = update ( b , D (cid:48) ) f and π = update ( b , D (cid:48) ) f ( | D (cid:48) | ≤ (cid:96) and | D (cid:48) | ≤ (cid:96) ), followed by a read () f operation. In this case, since both The operator “ | ” denotes concatenation. The exact way D is partitioned is left to the imple-mentation. Our formal definition of linearizability is adapted from [4].ragmented Objects: Boosting Concurrency of Shared Large Objects 7 (a) (b)Fig. 1: Executions showing the operations on a fragmented object. Fig. (a) shows lin-earizable reads on the fragmented object (and serialization points), and (b) reads on thefragmented object that are implemented with individual linearizable reads on blocks.update operations operate on different blocks, independently of how π and π are or-dered in the permutation σ , the read () f operation will return (cid:104) D (cid:48) , D (cid:48) (cid:105) . Therefore, theuse of these update operations has increased the concurrency in the fragmented object.Using linearizable read operations on the entire fragmented object can ensure thelinearizability of the fragmented object as can be seen in the example presented in Fig-ure 1(a). However, providing a linearizable read when the object involves multiple R/Wobjects (i.e., an atomic snapshot) can be expensive or impact concurrency [9]. Thus, it ischeaper to take advantage of the atomic nature of the individual blocks and invoke oneread operation per block in the fragmented object. But, what is the consistency guaran-tee we can provide on the entire fragmented object in this case?
As seen in the exampleof Fig. 1(b), two reads concurrent with two update operations may violate linearizabil-ity on the entire object. According to the real time ordering of the operations on theindividual blocks, block linearizability is preserved if the first read on the fragmentedobject should return ( D (cid:48) , D ) , while the second read returns ( D , D (cid:48) ) . Note that wecannot find a permutation on these concurrent operations that follows the sequentialspecification of the fragmented object. Thus, the execution in Figure 1(b) violates lin-earizability. This leads to the definition of fragmented linearizability on the fragmentedobject, which relying on the fact that each individual block is linearizable , it allows ex-ecutions like the one seen in Fig. 1(b). Essentially, fragmented linearizability capturesthe consistency one can obtain on a collection of linearizable objects, when these areaccessed concurrently and individually, but under the “umbrella” of the collection.In this respect, we specify each read () f operation of a certain process, as a sequenceof read () b operations on each block b ∈ f by that process. In particular, a read oper-ation read () f that returns (cid:104) val ( b ) , . . . , val ( b n ) (cid:105) is specified by n + 1 individual readoperations read () b ,..., read () b n , that return val ( b ) , ..., val ( b n ) , respectively, where read () b → , . . . , → read () b n .Then, given a history H , we denote for an operation π the history H π whichcontains the actions extracted from H and performed during π (including its invo-cation and response actions). Hence, if val ( f ) is the value returned by read () f , then H read () f contains an invocation and matching response for a read () b operation, for each A. F. Anta et al. b ∈ val ( f ) . Then, from H , we can construct a history H | f that only contains operationson the whole fragmented object. In particular, H | f is the same as H with the followingchanges: for each read () f , if (cid:104) val ( b ) , . . . , val ( b n ) (cid:105) is the value returned by the readoperation, then we replace the invocation of read () b operation with the invocation ofthe read () f operation and the response of the read () b n block with the response actionfor the read () f operation. Then we remove from H | f all the actions in H read () f . Definition 3 (Fragmented Linearizability).
Let f ∈ F (cid:96) be a fragmented object, H acomplete history on f , and val ( f ) H ⊆ B the value of f at the end of H . Then, f is fragmented linearizable if there exists a permutation σ b over all the actions on b in H , ∀ b ∈ val ( f ) H , such that: – σ b is a sequential history that follows the sequential specification of b , and – for operations π , π that appear in H | f extracted from H , if π → π in H | f ,then all operations on b in H π appear before any operations on b in H π in σ b . Fragmented linearizability guarantees that all concurrent operations on differentblocks prevail, and only concurrent operations on the same blocks are conflicting. Con-sider two reads r and r , s.t. r → r ; then r must return a supersequence of blockswith respect to the sequence returned by r , and that for each block belonging in bothsequences, its value returned by r is the same or newer than the one returned by r . Having laid out the theoretical framework of Fragmented Objects, we now present aprototype implementation of a Distributed File System, we call C O BFS.When manipulating files it is expected that a value update builds upon the currentvalue of the object. In such cases a writer should be aware of the latest value of the ob-ject (i.e., by reading the object) before updating it. In order to maintain this property inour implementation we utilize coverable linearizable blocks as presented in [24]. Cov-erability extends linearizability with the additional guarantee that object writes succeedwhen associating the written value with the “current” version of the object. In a differ-ent case, a write operation becomes a read operation and returns the latest version andthe associated value of the object. Due to space limitations we refer the reader to [24]for the exact coverability properties.By utilizing coverable blocks, our file system provides fragmented coverability asa consistency guarantee. In our prototype implementation we consider each object tobe a plain text file, however the underlying theoretical formulation allows for extendingthis implementation to support any kind of large objects.
File as a coverable fragmented object:
Each file is modeled as a fragmented objectwith its blocks being coverable objects. The file is implemented as a linked-list of blocks with the first block being a special block b g ∈ B , which we call the genesis block , andthen each block having a pointer ptr to its next block, whereas the last block has a nullpointer. Initially each file contains only the genesis block; the genesis block containsspecial purpose (meta) data. The val ( b ) of b is set as a tuple, val ( b ) = (cid:104) ptr, data (cid:105) . The sequential specification of a block is similar to that of a R/W register [22], whose valuehas bounded length.ragmented Objects: Boosting Concurrency of Shared Large Objects 9
Fig. 2: Basic architecture of C O BFS
Overview of the Basic Architecure:
The basic architecture of C O BFS appears inFig. 2. C O BFS is composed of two main modules: ( i ) a Fragmentation Module (FM),and ( ii ) a Distributed Shared Memory Module (DSMM). In summary, the FM imple-ments the fragmented object while the DSMM implements an interface to a sharedmemory service that allows read/write operations on individual block objects. Follow-ing this architecture, clients may access the file system through the FM, while the blocksof each file are maintained by servers through the DSMM. The FM uses the DSMMas an external service to write and read blocks to the shared memory. To this respect,C O BFS is flexible enough to utilize any underlying distributed shared object algorithm.
File and block id assignment:
A key aspect of our implementation is the unique assign-ment of ids to both fragmented objects (i.e. files) and individual blocks. A file f ∈ F isassigned a pair (cid:104) cf id, cf seq (cid:105) ∈ C × N , where cf id ∈ C is the universally unique identi-fier of the client that created the file (i.e., the owner) and cf seq ∈ N is the client’s localsequence number, incremented every time the client creates a new file and ensuringuniqueness of the objects created by the same client.In turn, a block b ∈ B of a file is identified by a triplet (cid:104) f id, cid, cseq (cid:105) ∈ F ×C × N , where f id ∈ F is the identifier of the file in which the block belongs to, cid ∈ C is the identifier of the client that created the block (this is not necessarily theowner/creator of the file), and cseq ∈ N is the client’s local sequence number of blocksthat is incremented every time this client creates a block for this file (this ensures theuniqueness of the blocks created by the same client for the same file). Distributed Shared Memory Module:
The DSMM implements a distributed R/Wshared memory based on an optimized coverable variant of the ABD algorithm,called C O ABD [24]. The module exposes three operations for a block b : dsmm-read b , dsmm-write ( v ) b , and dsmm-create ( v ) b . The specification of each operation is shownin Algorithm 1. For each block b , the DSMM maintains its latest known version ver b and its associated value val b . Upon receipt of a read request for a block b , the DSMMinvokes a cvr-read operation on b and returns the value received from that operation.To reduce the number of blocks transmitted per read, we apply a simple yet veryeffective optimization (Algorithm 2): a read sends a READ request to all the servers in-
Algorithm 1
DSM Module: Operations on a coverable block object b at client p State Variables: ver b ∈ N initially ; val b ∈ V initially ⊥ ; function dsmm-read ( ) b,p (cid:104) val b , ver b (cid:105) ← b. cvr-read () return val b end function function dsmm-create ( val ) b,p (cid:104) val b , ver b (cid:105) ← b. cvr-write ( val, end function function dsmm-write ( val ) b,p (cid:104) val b , ver b (cid:105) ← b. cvr-write ( val, ver b ) return val b end function Algorithm 2
Optimized coverable ABD (read operation)1: at each reader r for object b State Variables: tg b ∈ N + × W initially (cid:104) , ⊥(cid:105) ; val b ∈ V , initially ⊥ function cvr-read ( ) send (cid:104) READ , ver b (cid:105) to all servers (cid:46) Query Phase wait until |S| +12 servers reply maxP ← max( {(cid:104) tg (cid:48) , v (cid:48) (cid:105) received from some server } ) if maxP.tg > tg b then send ( WRITE , maxP ) to all servers (cid:46)
Propagate Phase wait until |S| +12 servers reply (cid:104) tg b , val b (cid:105) ← maxP end if return( (cid:104) tg b , val b (cid:105) ) end function at each server s for object b State Variables: tg b ∈ N + × W initially (cid:104) , ⊥(cid:105) ; val b ∈ V , initially ⊥ function rcv ( M ) q (cid:46) Reception of a message from q if M.type (cid:54) = READ and
M.tg > tg b then (cid:104) tg b , val b (cid:105) ← (cid:104) M.tg, M.v (cid:105) end if if M.type = READ and
M.tg ≥ tg b then send( (cid:104) tg b , ⊥(cid:105) ) to q (cid:46) Reply without content else send( (cid:104) tg b , val b (cid:105) ) to q (cid:46) Reply with content end if end function cluding its local version in the request message. When a server receives a
READ requestit replies with both its local tag and block content only if the tag enclosed in the
READ request is smaller than its the local tag; otherwise it replies with its local tag withoutthe block content. Once the reader receives replies from a majority of servers, it detectsthe maximum tag among the replies, and checks if it is higher than the local knowntag. If it is, then it forwards the tag and its associated block content to a majority ofservers; if not then the read operation returns the locally known tag and block contentwithout performing the second phase. While this optimisation makes a little differenceon the non-fragmented version of the ABD (under read/write contention), it makes asignificant difference in the case of the fragmented objects. For example, if each read isconcurrent with a write causing the execution of a second phase, then the read sends thecomplete file to the servers; in the case of fragmented objects only the fragments thatchanged by the write will be sent over to the servers, resulting in significant reductions.The create and write operations invoke cvr-write operations to update the valueof the shared block b . Their main difference is that version is used during a create operation to indicate that this is the first time that the block is written. Notice that thewrite in create will always succeed as it will introduce a new, never before writtenblock, whereas operation write may be converted to a read operation, thus retrievingand returning the latest value of b . We refer the reader to [24] for the implementation of cvr-read and cvr-write , which are simple variants of the corresponding implementationsof ABD [2]. We state the following lemma: Lemma 1.
The DSMM implements R/W coverable block objects. ragmented Objects: Boosting Concurrency of Shared Large Objects 11
Fig. 3: Example of a writer x writing text at the beginning of the second block of a textfile with id f id = 7 . The hash value of the existing second block “ bad.. ” is replacedwith “ d .. ” and a new block with hash value “ .. ” is inserted immediately after.The block b id = x 7-x 2 and the new block b id = x 7-x 4 are sent to the DSM. Proof (Proof Sketch).
When both the read and write operations perform two phases thecorrectness of the algorithm is derived from Theorem 10 in [24]. It is easy to see thatthe optimization does not violate linearizability. The second phase of a read is omittedwhen all the servers reply with a tag smaller or equal to the local tag of the reader r . Since however, a read propagates its local tag to a majority of servers at every tagupdate, then every subsequent operation will observe (and return) the latest value of theobject to be associated with a tag at least as high as the local tag of r . Fragmentation Module:
The FM is the core concept of our implementation. Eachclient has a FM responsible for ( i ) fragmenting the file into blocks and identify modifiedblocks, and ( ii ) follow a specific strategy to store and retrieve the file blocks from theR/W shared memory. As we show later, the block update strategy followed by FM isnecessary in order to preserve the structure of the fragmented object and sufficient topreserve the properties of fragmented coverability. For the file division of the blocks andthe identification of the newly created blocks, the FM contains a Block Identification(BI) module that utilizes known approaches for data fragmentation and diff extraction.
Block Identification (BI):
Given the data D of a file f the goal of BI is to break D intodata blocks (cid:104) D , . . . , D n (cid:105) , s.t. the size of each D i is less than a predefined upper bound (cid:96) . Furthermore, by drawing ideas from the RSYNC (Remote Sync) algorithm [26],given two versions of the same file, say f and f (cid:48) , the BI tries to identify blocks that( a ) may exist in f but not in f (cid:48) (and vice-versa), or ( b ) they have been changed from f to f (cid:48) . To achieve these goals BI proceeds in two steps: (1) it fragments D into blocks,using the rabin fingerprints rolling hash algorithm [25], and (2) it compares the hashesof the blocks of the current and the previous version of the file using a string matchingalgorithm [5] to determine the modified/new data blocks. The role of BI within thearchitecture of C O BFS and its process flow appears in Fig. 3, while its specification isprovided in Algorithm 3. A high-level description of BI has as follows: – Block Division: Initially, the BI partitions a given file f into data blocks based onits contents, using rabin fingerprints . This algorithm identifies the block boundaries Algorithm 3
Fragmentation Module: BI and Operations on a file f at client p State Variables: H initially ∅ ; (cid:96) ∈ N ; L f a linked-list of blocks, initially (cid:104) b g (cid:105) ; bc f ∈ N initially ; function fm-block-identify ( ) f,p (cid:104) newD, newH (cid:105) ← RabinFingerprints ( f, (cid:96) ) curH = hash ( L f ) (cid:46) hashes of the data of the blocks in L f C ← SMatching ( curH, newH ) (cid:46) modified for (cid:104) h ( b j ) , h k (cid:105) ∈ C.mods s.t. h ( b j ) ∈ curH, h k ∈ newH do D ← { D k : D k ∈ newD ∧ h k = hash ( D k ) } fm-update ( b j , D ) f,p end for (cid:46) inserted for S ∈ C.inserts s.t. h i ∈ S are in sequence do D ← { D i : h i ∈ S ∧ D i ∈ newD ∧ h i = hash ( D i ) } b ← b j s.t. ∀ h i ∈ S inserted after h ( b j ) fm-update ( b, D ) f,p end for end function function fm-read ( ) f,p b ← val ( b g ) .ptr L f ← (cid:104) b g (cid:105) (cid:46) reset L f while b not NULL do val ( b ) ← dsmm-read () b,p L f .insert ( val ( b )) b ← val ( b ) .ptr end while return Assemble ( L f ) end function function fm-update ( b, D = (cid:104) D , D , . . . , D k (cid:105) ) f,p for j = k : 1 do b j ← (cid:104) f, p, bc f ++ (cid:105) (cid:46) set block id val ( b j ) .data = D j (cid:46) set block data if j < k then val ( b j ) .ptr = b j +1 (cid:46) set block ptr else val ( b j ) .ptr = val ( b ) .ptr (cid:46) point last to b ptr end if L f .insert ( val ( b j )) dsmm-create ( val ( b j )) bj end for val ( b ) .data = D if k > then val ( b ) .ptr = b (cid:46) change b ptr if | D | > end if dsmm-write ( val ( b )) b end function and it performs content-based chunking by calculating and returning the fingerprints(block hashes) over a sliding window, and guarantees that each block identified hasa bounded size of no more than (cid:96) . – Block Matching: Given the set of blocks (cid:104) D , . . . , D m (cid:105) and associated blockhashes (cid:104) h , . . . , h m (cid:105) generated by the rabin fingerprint algorithm, the BI tries tomatch each hash to a block identifier, based on the block ids produced during the pre-vious division of file f , say (cid:104) b , . . . , b n (cid:105) . We produce the vector (cid:104) h ( b ) , . . . , h ( b n ) (cid:105) where h ( b i ) = hash ( val ( b i ) .data ) from the current blocks of f , and using a stringmatching algorithm [5] we compare the two hash vectors to obtain one of the fol-lowing statuses for each entry: ( i ) equal, ( ii ) modified, ( iii ) inserted, ( iv ) deleted. – Block Updates: Based on the hash statuses computed through block matchingpreviously, the blocks of the fragmented object are updated. In particular, in thecase of equality, if a h i = h ( b j ) then D i is identified as the data of block b j . In case of modification, e.g. ( h ( b j ) , h i ), an update ( b j , { D i } ) f,p action isthen issued to modify the data of b j to D i (Lines 10:13). In case new hashes(e.g. (cid:104) h i , h k (cid:105) ) are inserted after the hash of block b j (i.e. h ( b j ) ), then the action update ( b j , { val ( b j ) .data, D i , D k } ) f,p is performed to create the new blocks after b j (Lines 15: 19). In our formulation block deletion is treated as a modification thatsets an empty data value thus, in our implementation no blocks are deleted . ragmented Objects: Boosting Concurrency of Shared Large Objects 13 FM Operations:
The FM’s external signature includes the two main operations of afragmented object: read f , and update f . Their specifications appear in Algorithm 3. Read operation - read () f,p : To retrieve the value of a file f , a client p may invoke a read f,p to the fragmented object. Upon receiving, the FM issues a series of reads onfile’s blocks; starting from the genesis block of f and proceeding to the last block byfollowing the pointers in the linked-list of blocks comprising the file. All the blocks areassembled into one file via the Assemble () function. The reader p issues a read for allthe blocks in the file. This is done to ensure the property stated in the following lemma: Lemma 2.
Let ξ be an execution of C O BFS with two reads ρ = read f,p and ρ = read f,q from clients p and q on the fragmented object f , s.t. ρ → ρ . If ρ returns alist of blocks L and ρ a list L , then ∀ b i ∈ L , then b i ∈ L and version ( b i ) L ≤ version ( b i ) L . Update operation - update ( b, D ) f,p : Here we expect that the update operation acceptsa block id and a set of data blocks (instead of a single data object), since the divisionis performed by the BI module. Thus, D = (cid:104) D , . . . , D k (cid:105) , for k ≥ , with the size | D | = (cid:80) ki =0 | D i | and the size of each | D i | ≤ (cid:96) for some maximum block size (cid:96) . Client p attempts to update the value of a block with identifier b in file f with the data in D .Depending on the size of D the update operation will either perform a write on theblock if k = 0 , or it will create new blocks and update the block pointers in case k > .Assuming that val ( b ) .ptr = b (cid:48) then: – k = 0 : In this case update , for block b , calls write ( (cid:104) val ( b ) .ptr, D (cid:105) , (cid:104) p, bseq (cid:105) ) b . – k > : Given the sequence of chunks D = (cid:104) D , . . . , D k (cid:105) the following blockoperations are performed in this particular order: → create ( b k = (cid:104) f, p, bc p ++ (cid:105) , (cid:104) b (cid:48) , D k (cid:105) , (cid:104) p, (cid:105) ) ** Block b k ptr points to b (cid:48) ** → . . . → create ( b = (cid:104) f, p, bc p ++ (cid:105) , (cid:104) b , D (cid:105) , (cid:104) p, (cid:105) ) ** Block b ptr points to b ** → write ( (cid:104) b , D (cid:105) , (cid:104) p, bseq (cid:105) ) b ** Block b ptr points to b ** The challenge here was to insert the list of blocks without causing any concurrentoperation to return a divided fragmented object, while also avoiding blocking any on-going operations. To achieve that, create operations are executed in a reverse order: wefirst create block b k pointing to b (cid:48) , and we move backwards until creating b pointingto block b . The last operation, write , tries to update the value of block b with value (cid:104) b , D (cid:105) . If the last coverable write completes successfully, then all the blocks are in-serted in f and the update is successful ; otherwise none of the blocks appears in f andthus the update is unsuccessful . This is captured by the following lemma: Lemma 3.
In any execution ξ of C O BFS , if ξ contains an π = update ( b, D ) f,p , then π is successful iff the operation b. cvr-write called within dsmm-write ( val ( b )) b,p , issuccessful.Proof. It is easy to see that if π = update ( b, D ) f,p is successful, then allthe dsmm-write operations invoked within π , including dsmm-write ( val ( b )) b,p ,are successful. It remains to show that π can only by unsuccessful whenever dsmm-write ( val ( b )) b,p is unsuccessful. In the case where D contains a single chunk, i.e. D = (cid:104) D (cid:105) then π invokes a single dsmm-write ( val ( b )) b,p with val ( b ) .data = D .If the cvr-write invoked in that operation is unsuccessful then π is also unsuccessful. Inthe case where k > , π invokes k − create operations with new block identifiers (dueto the incremented block counter bc ). The cvr-write operation on every such block willbe successful as (i) the block id (cid:104) f, p, bc (cid:105) (and thus the block) can only be generatedby process p , and (ii) the block is not yet inserted in the link-list. So no other writeoperation will attempt to cvr-write the same block concurrently. So the only operationthat may fail in this case as well, is the dsmm-write ( val ( b )) b,p as b was a part of the listand may be accessed concurrently by a writer q (cid:54) = p .Now a read operation may return a list that contains a block b i only if b i was writtenby a successful update operation. More formally: Lemma 4.
In any execution ξ of C O BFS , if a ρ = read f,p operation returns a list L then for any block b ∈ L there exists successful update ( ∗ ) f, ∗ operation that eitherprecedes or is concurrent to ρ and invokes sm-create ( val ( b )) b operation.Proof. According to our protocol it is clear that a block with id b appears in the list of f only if that is created and written during an update f, ∗ operation. Also, if the block iscreated by an update that precedes ρ , then no other block in the list will point to b , ρ will not invoke a sm-read b operation for b , and thus b / ∈ L .So it remains to examine the case where ρ may obtain b from an unsuccessful update f, ∗ . Let us assume by contradiction that a read operation may return a block b for a file f created by an unsuccessful update. Let b ∈ (cid:104) b , . . . , b n (cid:105) , the list of blocksthat the update needs to write on the DSM. In particular, the operation will create allthe blocks (cid:104) b , . . . , b n (cid:105) and attempt to write block b . There are two cases to consider:( i ) either b is equal to b , or ( ii ) b is in (cid:104) b , . . . , b n (cid:105) .If case ( i ) is true, then p will invoke a sm-write ( val ( b )) b as b is the block that isupdated. However, since we assume that the update was not successful, then by Lemma3, the write operation is not successful. Thus, according to the coverable DSM, b wasnever written and this contradicts the assumption that p obtain b ∈ L .If case ( ii ) holds, then b was created by p (an operation that cannot fail). However,since the update is not successful, then b was not written in the list. It is also true thatthere is no link path leading to b since the only path was b → b → . . . → b . So,during the traversal of the blocks, the read operation will not see b and thus will neverreach and obtain b , contradicting again our initial assumption.The above lemma will help us to show that the linked-list used for implementingour fragmented object stays connected in any execution. Lemma 5.
In any execution ξ of C O BFS , if a read f,p operation returns a list L = (cid:104) b g , b , . . . , b n (cid:105) for a file f , then val ( b g ) .prt = b , val ( b i ) .ptr = b i +1 , for ≤ i By Lemma 1 every block operation in C O BFS satisfies coverability and togetherwith Lemma 2 it follows that C O BFS implements a coverable fragmented object satis-fying the properties presented in Definition 3 Also, the BI ensures that the size of eachblock is limited under a bound (cid:96) and Lemma 5 ensures that each operation obtains aconnected list of blocks. Thus, C O BFS implements a valid fragmented object. To further appreciate the proposed approach from an applied point of view, we per-formed a preliminary evaluation of C O BFS against C O ABD. Due to the design of thetwo algorithms, C O ABD will transmit the entire file per read/update operation, whileC O BFS will transmit as many blocks as necessary for an update operation, but performas many reads as the number of blocks during a read operation. The two algorithmsuse the read optimization of Algorithm 2. Both were implemented and deployed on Emulab, [27], a network testbed with tunable and controlled environmental parameters. Experimental Setup: Across all experiments, three distinct types of distributed nodesare defined and deployed within the emulated network environment as listed below.Communication between the distributed nodes is via point-to-point bidirectional linksimplemented with a DropTail queue. – writer w ∈ W ⊆ C : a client that dispatches update requests to servers. – reader r ∈ R ⊆ C : a client that dispatches read requests to servers – server s ∈ S : listens for reader and writer requests and is responsible for main-taining the object replicas according to the underlying protocol they implement. Performance Metrics: We assess performance using: (i) operational latency , and (ii) the update success ratio . The operational latency is computed as the sum of commu-nication and computation delays. In the case of C O BFS, computational latency en-compasses the time necessary for the FM to fragment a file object and generate therespective hashes for its blocks. The update success ratio is the percentage of updateoperations that have not been converted to reads (and thus successfully changed thevalue of the indented object). In the case of C O ABD, we compute the percentage ofsuccessful updates on the file as a whole over the number of all updates. For C O BFS,we compute the percentage of file updates, where all individual block updates succeed. Scenarios: Both algorithms are evaluated under the following experimental scenarios: – Scalability: examine performance as the number of service participants increases – File Size: examine performance when using different initial file sizes – Block Size : examine performance under different block sizes (C O BFS only)We use a stochastic invocation scheme in which reads are scheduled randomly fromthe intervals [1 ...rInt ] and updates from [1 ..wInt ] , where rInt, wInt = 4 sec . Toperform a fair comparison and to yield valuable observations, the results shown arecompiled as averages over five samples per each scenario. Scalability Experiments: We varied the number of readers | R | , the number of writ-ers | W | , and the number of servers | S | in the set { , , , , , , , , , } .While testing for readers’ scalability, the number of writers and servers was kept con-stant, | W | , | S | = 10 . Using the same approach, scalability of writers, and in turnof servers, was tested while preserving the two other types of nodes constant (i.e. | R | , | S | = 10 and | R | , | W | = 10 respectively). In total, each writer performed 20updates and each reader 20 reads. The size of the initial file used was set to 18 kB ,while the maximum, minimum and average block sizes ( rabin fingerprints parameters)were set to 64 kB , and respectively. File Size Experiments: We varied the f size from to by doubling the filesize in each simulation run. The number of writers, readers and servers was fixed to5. In total, each writer performed 5 updates and each reader 5 reads. The maximum,minimum and average block sizes were set to , 512 kB and 512 kB respectively. Block Size Experiments: We varied the minimum and average b sizes of C O BFS from to 64 kB . The number of writers, readers and servers was fixed to 10. In total, eachwriter performed 20 updates and each reader 20 reads. The size of the initial file usedwas set to 18 kB , while the maximum block size was set to 64 kB Results: Overall, our results suggest that the efficiency of C O BFS is inversely propor-tional to the number of block operations, rather than the size of the file. This is primarilydue to the individual block-processing nature of C O BFS. More in detail: Scalability: In Fig. 4(a), the operational latency of updates in C O BFS remains almostunchanged and smaller than of C O ABD. This is because C O ABD writer updates arather small file, while each C O BFS writer updates a subset of blocks which are mod-ified or created. The computational latency of FM in C O BFS is negligible, when com-pared to the total update operation latency, because of the small file size. In Fig. 4(c), we ragmented Objects: Boosting Concurrency of Shared Large Objects 17(a) (b)(c) (d)(e) (f)(g) (h) Fig. 4: Simulation results for algorithms C O ABD and C O BFS. observe that the update operation latency in C O ABD increases even more as the num-ber of servers increases. As more updates are successful in C O BFS, reads may transfermore data compared to reads in C O ABD, explaining their slower completion as seen inFig. 4(b). Also, readers send multiple read block requests of small sizes, waiting eachtime for a reply, while C O ABD readers wait for a message containing a small file. Concurrency: The percentage of successful file updates achieved by C O BFS are sig-nificantly higher than those of C O ABD. This holds for both cases where the numberof writers increased (see Fig. 4(a)) and the number of servers increased (see Fig. 4(c)).This demonstrates the boost of concurrency achieved by C O BFS. In Fig. 4(a) we no-tice that as the number of writers increases (hence, concurrency increases), C O ABDsuffers greater number of unsuccessful updates, i.e., updates that have become readsper the coverability property. Concurrency is also affected when the number of blocksincreases, Fig. 4(d). The probability of two writes to collide on a single block decreases,and thus C O BFS eventually allows all the updates (100%) to succeed. C O ABD doesnot experience any improvement as it always manipulates the file as a whole. File Size: Figure 4(d) demonstrates that the update operation latency of C O BFS remainsat extremely low levels. The main factor that significantly contributes to the slight in-crease of C O BFS update latency is the FM computation latency, Fig. 4(e). We haveset the same parameters for the rabin fingerprints algorithm for all the initial file sizes,which may have favored some file sizes but burdened others. An optimization of the ra-bin algorithm or a use of a different algorithm for managing blocks could possibly leadto improved FM computation latency; this is a subject for future work. The C O BFSupdate communication latency remains almost stable, since it depends primarily on thenumber and size of update block operations. That is in contrast to the update latencyexhibited in C O ABD which appears to increase linearly with the file size. This wasexpected, since as the file size increases, it takes longer latency to update the whole file.Despite the higher success rate of C O BFS, the read latency of the two algorithms iscomparable due to the low number of update operations. The read latencies of the twoalgorithms with and without the read optimization can be seen in Fig. 4(f). The C O ABDread latency increases sharply, even when using the optimized reads. This is in line withour initial hypothesis, as C O ABD requires reads to request and propagate the whole fileeach time a newer version of the file is discovered. Similarly, when read optimizationis not used in C O BFS, the latency is close of C O ABD. Notice that each read thatdiscovers a new version of the file needs to request and propagate the content of eachindividual block. On the contrary, read optimization decreases significantly the C O BFSread latency, as reads transmit only the contents of the blocks that have changed. Block Size: From Figs. 4(g)(h) we can infer that when smaller blocks are used, the up-date and read latencies reach their highest values. In both cases, small b size results inthe generation of larger number of blocks from the division of the initial file. Addition-ally, as seen in Fig. 4(g), the small b size leads to the generation of more new blocksduring update operations, resulting in more update block operations, and hence higherlatencies. As the minimum and average b sizes increase, lower number of blocks needto be added when an update is taking place. Unfortunately, smaller number of blocksleads to a lower success rate. Similarly, in Fig. 4(h), smaller block sizes require moreread block operations to obtain the file’s value. As the minimum and average b sizes ragmented Objects: Boosting Concurrency of Shared Large Objects 19 increase, lower number of blocks need to be read. Thus, further increase of the mini-mum and average b sizes forces the decrease of the latencies, reaching a plateau in bothgraphs. This means that the emulation finds optimal minimum and average b sizes andincreasing them does not give better (or worse) latencies. We have introduced the notion of linearizable and coverable fragmented objects andproposed an algorithm that implements coverable fragmented files. It is then used tobuild C O BFS, a prototype distributed file system in which each file is specified as alinked-list of coverable blocks. C O BFS adopts a modular architecture, separating theobject fragmentation process from the shared memory service allowing to follow dif-ferent fragmentation strategies and shared memory implementations. We showed thatit preserves the validity of the fragmented object (file) and satisfies fragmented cover-ability. The deployment on Emulab serves as a proof of concept implementation. Theevaluation demonstrates the potential of our approach in boosting the concurrency andimproving the efficiency of R/W operations on strongly consistent large objects.For future work, we aim to perform a comprehensive experimental evaluation ofC O BFS that will go beyond simulations (e.g., full-scale, real-time, cloud-based exper-imental evaluations) and to further study parameters that may affect the performanceof the operations (e.g., file size, block size, etc), as well as to build optimizations andextensions, in an effort to unlock the full potential of our approach. References 1. Emulab network testbed. 2. Attiya, H., Bar-Noy, A., Dolev, D.: Sharing memory robustly in message passing systems.Journal of the ACM , 124–142 (1996)3. Attiya, H., Kumari, S., Somani, A., Welch, J.L.: Store-collect in the presence of continuouschurn with application to snapshots and lattice agreement (2020)4. Attiya, H., Welch, J.L.: Sequential consistency versus linearizability. ACM T.C.S. (1994)5. Black, P.: Ratcliff pattern recognition. Dictionary of Algorithms and Data Structures (2021)6. Cachin, C., Tessaro, S.: Optimal resilience for erasure-coded byzantine distributed storage.pp. 115–124. IEEE Computer Society, Los Alamitos, CA, USA (2006)7. Cadambe, V.R., Lynch, N.A., M´edard, M., Musial, P.M.: A coded shared atomic memoryalgorithm for message passing architectures. Distributed Computing (1), 49–73 (2017)8. Chen, Y.L.C., Mu, S., Li, J.: Giza: Erasure coding objects across global data centers. In: Proc.of USENIX ATC ’17. pp. 539–551 (2017)9. Delporte-Gallet, C., Fauconnier, H., Rajsbaum, S., Raynal, M.: Implementing snapshot ob-jects on top of crash-prone asynchronous message-passing systems. IEEE Trans. ParallelDistrib. Syst. (9), 2033–2045 (2018). https://doi.org/10.1109/TPDS.2018.280955110. Dutta, P., Guerraoui, R., Levy, R.R.: Optimistic erasure-coded distributed storage. In: DISC2008. pp. 182–196. Springer-Verlag, Berlin, Heidelberg (2008)11. Dutta, P., Guerraoui, R., Levy, R.R., Chakraborty, A.: How fast can a distributed atomic readbe? In: Proc. of PODC 2004. pp. 236–24512. Fan, R., Lynch, N.: Efficient replication of large data objects. In: DISC 2003. pp. 75–910 A. F. Anta et al.13. Fern´andez Anta, A., Hadjistasi, T., Nicolaou, N.: Computationally light “multi-speed”atomic memory. In: Proc. of OPODIS16 (2016)14. Fischer, M.J., Lynch, N., Paterson, M.: Impossibility of distributed consensus with one faultyprocess. Journal of ACM (2), 374–382 (1985)15. Georgiou, C., Hadjistasi, T., Nicolaou, N., Schwarzmann, A.A.: Unleashing and speeding upreaders in atomic object implementations. In: Proc. of NETYS 2018. pp. 175–19016. Georgiou, C., Nicolaou, N., Shvartsman, A.A.: Fault-tolerant semifast implementations ofatomic read/write registers. Journal of Parallel and Distributed Computing (1), 62–79 (2009)17. Hadjistasi, T., Nicolaou, N., Schwarzmann, A.A.: Oh-ram! one and a half round atomic mem-ory. In: Proc. of NETYS 2017. pp. 117–132. https://doi.org/10.1007/978-3-319-59647-1 1018. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects.ACM TOPLAS (3), 463–492 (1990)19. Konwar, K.M., Prakash, N., Kantor, E., Lynch, N., M´edard, M., Schwarzmann, A.A.:Storage-optimized data-atomic algorithms for handling erasures and errors in distributedstorage systems. In: Proc. of IPDPS16. pp. 720–729 (May 2016)20. Konwar, K.M., Prakash, N., Lynch, N., M´edard, M.: Radon: Repairable atomic data objectin networks. In: The International Conference on Distributed Systems (OPODIS) (2016)21. Lynch, N., Shvartsman, A.A.: Robust emulation of shared memory using dynamic quorum-acknowledged broadcasts. In: Proc. of Symposium on Fault-Tolerant Computing (1997)22. Lynch, N.: Distributed Algorithms. Morgan Kaufmann Publishers (1996)23. Nicolaou, N., Cadambe, V., Prakash, N., Konwar, K., Medard, M., Lynch, N.: Ares: Adap-tive, reconfigurable, erasure coded, atomic storage. In: IEEE 39th ICDCS. pp. 2195–220524. Nicolaou, N., Fern´andez Anta, A., Georgiou, C.: Coverability: Consistent versioning in asyn-chronous, fail-prone, message-passing environments. In: Proc. of IEEE NCA 201625. Rabin, M.O.: Fingerprinting by random polynomials (1981)26. Tridgell, A., Mackerras, P.: The rsync algorithm (1996)27. White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb,C., Joglekar, A.: An integrated experimental environment for distributed systems and net-works. In: OSDI02. pp. 255–270. USENIX Association, Boston, MA (Dec 2002)28. Zhang, H., Dong, M., Chen, H.: Efficient and available in-memory kv-store with hybriderasure coding and replication. In: FAST 16. USENIX Association, Santa Clara, CA (2016)ragmented Objects: Boosting Concurrency of Shared Large Objects 21 Appendix A Fragmented Objects with Coverable Blocks When writing a value to a linearizable R/W object, the value written does not need tobe dependent on the previous written value. However, in some objects (e.g. files), it isexpected that a value update will build upon (and thus avoid to overwrite) the currentvalue of the object. In such cases a writer should be aware of the latest value of theobject (i.e., by reading the object) before updating it. Although a read-modify-write(RMW) semantic would be more appropriate for this type of objects, it can only beachieved through consensus, which is known to be merely impossible to solve in anasynchronous environment with crashes [14].To this respect, in [24] the notion of coverability was introduced to leverage thesolvability of R/W object implementations, while providing a weak RMW object. In-formally, coverability, extends linearizability with the additional guarantee that objectwrites succeed when associating the written value with the “current” version of the ob-ject. In a different case, a write operation becomes a read operation and returns the latestversion and the associated value of the object.More formally, coverability uses a totally ordered set of versions , say Versions , andintroduces the notion of versioned (coverable) objects . A coverable object is a type ofR/W object where each value written is assigned with a version from the set Versions .The coverable R/W object X offers two operations: (i) X. cvr-write ( val, ver ) p , and(ii) X. cvr-read () p . A process p invokes a cvr-write ( val, ver ) p operation when it per-forms a write operation that attempts to change the value of the object. The opera-tion returns the value of the object and its associated version, along with a flag in-forming whether the operation has successfully changed the value of the object orfailed. A write is successful if it changes the value of the register; otherwise thewrite is unsuccessful . The read operation cvr-read () p involves a request to retrievethe value of the object. The response of this operation is the value of the regis-ter together with the version of the object that this value is associated with. Denot-ing a successful write cvr-write ( v, ver )( v, ver (cid:48) , chg ) p as tr-write ( ver )[ ver (cid:48) ] p (updat-ing the object from version ver to ver (cid:48) ), and cvr-write ( v, ver )( v (cid:48) , ver (cid:48) , unchg ) p as tr-write ( ver )[ ver (cid:48) , unchg ] p , a coverable implementation satisfies the following prop-erties (for the formal definition see [24]). Definition 4 (Coverability [24]). A valid execution ξ is coverable with respect to atotal order < ξ on all successful write operations, W ξ,succ , in ξ if: – ( Consolidation ) If a tr-write ( ver j )[ ∗ ] ∈ W ξ,succ then ver j is larger than anyversion written by a preceding successful write operation. – ( Continuity ) if tr-write ( ver )[ ver i ] ∈ W ξ,succ , then ver was written by a preced-ing write operation or ver = ⊥ the initial version – ( Evolution ) The version of the object is incrementally evolving and thus for twoversion ‘chains’ formed by concurrent writes on a single initial version ver , thelast version of the longest chain is larger than the latest version on the shorterchain. If a fragmented object utilizes coverable blocks, instead of linearizable blocks, thenDefinition 3 provides what we would call fragmented coverability : Concurrent updateoperations on different blocks would all prevail (as long as each update is tagged withthe latest version of each block), whereas only one update operation on the same blockwould prevail (all the other updates on the same block that are concurrent with thiswould become a read operation). As we see in the next section fragmented coverabilityis a good alternative to RMW semantics to implement large objects, like files, of whichany new value may depend on the current value of the object. B Additional Operations Supported by the Prototype