[PDF] On Architecture to Architecture Mapping for Concurrency

Abstract

Mapping programs from one architecture to another plays a key role in technologies such as binary translation, decompilation, emulation, virtualization, and application migration. Although multicore architectures are ubiquitous, the state-of-the-art translation tools do not handle concurrency primitives correctly. Doing so is rather challenging because of the subtle differences in the concurrency models between architectures. In response, we address various aspects of the challenge. First, we develop correct and efficient translations between the concurrency models of two mainstream architecture families: x86 and ARM (versions 7 and 8). We develop direct mappings between x86 and ARMv8 and ARMv7, and fence elimination algorithms to eliminate redundant fences after direct mapping. Although our mapping utilizes ARMv8 as an intermediate model for mapping between x86 and ARMv7, we argue that it should not be used as an intermediate model in a decompiler because it disallows common compiler transformations. Second, we propose and implement a technique for inserting memory fences for safely migrating programs between different architectures. Our technique checks robustness against x86 and ARM, and inserts fences upon robustness violations. Our experiments demonstrate that in most of the programs both our techniques introduce significantly fewer fences compared to naive schemes for porting applications across these architectures.

Full PDF

OO N A RCHITECTURE TO A RCHITECTURE M APPING FOR C ONCURRENCY

Soham Chakraborty

Department of Computer Science and EngineeringIIT DelhiDelhi 110016, India [email protected] A BSTRACT

Mapping programs from one architecture to another plays a key role in technologies such as binarytranslation, decompilation, emulation, virtualization, and application migration. Although multicorearchitectures are ubiquitous, the state-of-the-art translation tools do not handle concurrency primi-tives correctly. Doing so is rather challenging because of the subtle differences in the concurrencymodels between architectures.In response, we address various aspects of the challenge. First, we develop correct and efﬁcienttranslations between the concurrency models of two mainstream architecture families: x86 and ARM(versions 7 and 8). We develop direct mappings between x86 and ARMv8 and ARMv7, and fenceelimination algorithms to eliminate redundant fences after direct mapping. Although our mappingutilizes ARMv8 as an intermediate model for mapping between x86 and ARMv7, we argue that itshould not be used as an intermediate model in a decompiler because it disallows common compilertransformations.Second, we propose and implement a technique for inserting memory fences for safely migratingprograms between different architectures. Our technique checks robustness against x86 and ARM,and inserts fences upon robustness violations. Our experiments demonstrate that in most of theprograms both our techniques introduce signiﬁcantly fewer fences compared to naive schemes forporting applications across these architectures.

Architecture to architecture mapping is the widely applicable concept of converting an application that runs over somearchitecture X to run over some different architecture Y . For example, binary translators notaz [2014], Chernoffet al. [1998], which recompile machine code from one architecture to another in a semantic preserving manner. Suchtranslation is facilitated by decompilers Bougacha, Bits, Yadavalli and Smith [2019], avast, Shen et al. [2012], whichlift machine code from a source architecture to an intermediate representation (IR) and compile to a target architecture. Emulators implement a guest architecture on a host architecture. For instance, QEMU QEMU emulates a number ofarchitectures (including x86 and ARM) over other architectures, the Android emulator Android-x86 runs x86 imageson ARM, while Windows 10 on ARM emulates x86 applications Docs.Architecture to architecture mapping is essential for application migration and compatibility. An application writtenfor an older architecture may need upgraded to execute on latest architectures, while an application primarily targetinga later architecture may need to preserve backward compatibility with respect to older one. For example, Arm discussesthe required measures to port an application from ARMv5 to ARMv7 including synchronization primitives. Besidesits practical uses, formally mapping between architectures is helpful in the design process of future processors andarchitectures, as it allows one to compare and relate subtle features like concurrency, which vary signiﬁcantly fromone architecture to another.A key feature that has been overlooked in these mappings is concurrency , which is crucial for achieving good per-formance with modern multicore processors. To emulate or port a concurrent application correctly requires us to a r X i v : . [ c s . P L ] S e p n Architecture to Architecture Mapping for Concurrency x86 ARMv8 ARMv7ARMv7-mca

C11 (§4.2)(§4.1)(§4.3)

C11 (§4.6)(§4.5)(§4.4)Figure 1: Correct and efﬁcient mapping schemes between x86, ARMv8, and ARMv7/ARMv7-mca.map the concurrency primitives of the source to those of the target, taking into account the subtle differences in theirconcurrency models. Such semantic differences appear not only between architectures (e.g., x86 and ARM), but alsobetween different versions of the same architecture (e.g., ARMv7 and ARMv8 Pulte et al. [2018]).In this paper, we address the challenge of developing correct and efﬁcient translations between relaxed memory con-currency models of x86, ARMv8, and ARMv7. We approach the problem from multiple angles.First, we develop correct mapping schemes between these concurrency models, using the ARMv8 model as an efﬁcientintermediate concurrency model for mapping between x86 and ARMv7.This naturally leads to the question whether ARMv8 model can also serve as a concurrency model for IR in a de-compiler. Decompilers typically (1) raise the source machine code to an IR, (2) optimize the IR, and (3) generate thetarget code. Thus, If the IR follows the ARMv8 concurrency model, steps (1) and (3) can be performed efﬁciently tofacilitate translations between x86 and ARM concurrent programs. For step (2), we evaluate common optimizationson ARMv8 concurrency and observe that a number of common transformations are unsound. The result demonstratesthat to achieve correct and efﬁcient mapping by all steps (1,2,3) we require to come up with a different concurrencymodel. We leave the exploration for such a model for future research.Next, we focus on optimizing the direct mappings further. The issue is that for correctness direct mappings introducefences in translating stronger accesses to weaker ones. The introduced fences can be often redundant in certainmemory access sequences and can be eliminated safely. We identify conditions of safe fence elimination, prove safefence elimination, and based on these conditions we propose fence elimination algorithms.In addition to fence elimination, we apply memory sequence analysis to check and enforce robustness for a class ofconcurrent programs. Robustness analysis checks whether a program running model demonstrate only the behaviorswhich are allowed by a stronger model. The behaviors of a robust program are indistinguishable on stronger modelfrom an weaker model and therefore the program can seamlessly migrate from one architecture to another as far asconcurrent behaviors are concerned. If a program is not robust we insert fences to enforce robustness against a strongermodel. It is especially beneﬁcial in application porting and migration Barbalace et al. [2020, 2017] where it is crucialto preserve the observable behaviors of a running application.

Contributions & Results.

Now we discuss the speciﬁc contributions and obtained results. • In §4 we propose the mapping schemes ( (cid:55)→ ) between x86 and ARMv8, and between ARMv8 and ARMv7 as shownin Fig. 1. We do not propose any direct mapping between x86 and ARMv7, instead we consider ARMv8 as anintermediate model. We achieve x86 to ARMv7 mapping by combining x86 to ARMv8 and ARMv8 to ARMv7mapping. Similarly, ARMv7 to x86 mapping is derived by combining ARMv7 to ARMv8 and ARMv8 to x86mappings. We show that the direct mapping schemes would be same as these two step mappings through ARMv8.We also show that these mapping schemes are efﬁcient; each of the leading and/or trailing fences used in mappingwith the memory accesses are required to preserve correctness. • We show that multicopy-atomicity (MCA) (a write operation is observable to all other threads at the same time)does not affect the mapping schemes between ARMv8 and ARMv7 though it is a major difference between ARMv8and ARMv7 Pulte et al. [2018] as ARMv7 allows non-MCA behavior unlike ARMv8. To demonstrate the samewe propose ARMv7-mca in §4 which restricts non-MCA behaviors in ARMv7 and show that the mapping schemefrom ARMv8 to ARMv7-mca is same as ARMv8 to ARMv7 (Fig. 13a) and the mapping scheme of ARMv7-mcato ARMv8 is same as ARMv7 to ARMv8 mapping (Fig. 12a) respectively. • In §4.2, §4.6, and in §4.8 we propose alternative schemes for x86 to ARMv8 and ARMv8 to ARMv7 mappingwhere the respective x86 and ARMv8 programs are generated from C11 concurrent programs. In these schemeswe exploit the catch-ﬁre semantics of C11 concurrency ISO/IEC 9899 [2011], ISO/IEC 14882 [2011]. We do notgenerate additional fences for the load or store accesses generated from non-atomic loads or stores unlike the x86to ARMv8 and ARMv8 to ARMv7 mappings. 2n Architecture to Architecture Mapping for Concurrency X [1] = 1; a = X [1]; b = Y [ a ]; c = Y [1]; d = Z [ c ]; Y [1] = 1; (cid:55)→ X [1] = 1; a = X [1]; CBISB b = Y [ a ]; CBISB c = Y [1]; CBISB d = Z [ c ]; CBISB Y [1] = 1; (a) Initially X [1] = Y [1] = 0 and behavior in question: a = c = 1 , b = d = 0 . St ( X [1] , St ( Y [1] , Ld ( X [1] , Ld ( Y [1] , Ld ( Y [1] , Ld ( X [1] , addr addr (b) Disallowed in ARMv8 St ( X [1] , St ( Y [1] , Ld ( X [1] , Ld ( Y [1] , Ld ( Y [1] , Ld ( X [1] , R R (c) Allowed in ARMv7. R = ctrl isb ∪ addr Figure 2:

LDR (cid:55)→

LDR ; CBISB in ARMv8 to ARMv7 mapping is unsound. • In §5 we study the reordering, elimination, and access strengthening transformations in ARMv8 model. We provethe correctness of the safe transformations and provide counter-examples for the unsafe transformations. • The mapping schemes introduce additional fences while mapping the memory accesses. These fences are requiredto preserve translation correctness in certain scenarios and otherwise redundant. In §6 we identify the conditionswhen the fences are redundant and prove that eliminating the fences are safe. Based on these conditions we deﬁnefence elimination algorithms to eliminate redundant fences without affecting the transformation correctness. • We deﬁne the conditions for robustness for an (i) ARMv8 program against sequential consistency (SC) and x86model, (ii) ARMv7/ARMv7-mca program against SC, x86, and ARMv8 model, and (iii) ARMv7 program againstARMv7-mca model in §7 and prove their correctness in Appendix D. We also introduce fences to enforce robustnessfor a stronger model againts a weaker model. To the best of our knowledge we are the ﬁrst to check and enforcerobustness for ARM programs as well as for non-SC models. • In §8 we discuss our experimental results. We have developed a compiler based on LLVM to capture the effect ofmappings between x86, ARMv8, and ARMv7. Next, we have developed fence elimination passes based on §6. Thepasses eliminate signiﬁcant number of redundant fences in most of the programs and in some cases generate moreefﬁcient program than LLVM mappings.We have also developed analyzers to check and enforce robustness in x86, ARMv8, and ARMv7. For a numberof x86 programs the result of our SC-robustness checker matches the results from TrencherBouajjani et al. [2013]which also checks SC-robustness against TSO model. Moreover, we enforce robustness with signiﬁcantly lessnumber of fences compared to naive schemes which insert fences without robustness information.In the next section we informally explain the overview of the proposed approaches. Next, in §3 we discuss theaxiomatic models of the respective architectures which we use in later sections. The proofs and additional details arein the supplementary material.

In this section we discuss the overview of our proposed schemes, related observations, and the analysis techniques.

In x86 to ARMv8 mapping we considered two alternatives for mapping loads and stores: (1) x86 store and loadto ARMv8 release-store (

WMOV (cid:55)→

STLR ) and acquire-load (

RMOV (cid:55)→

LDAR ) respectively. (2) x86 store and load toARMv8 regular store and load accesses with respective leading and trailing fences as proposed in Fig. 9a, that is,

WMOV (cid:55)→

DMBST ; STR and

RMOV (cid:55)→

LDR ; DMBLD respectively. We choose (2) over (1) for following reasons.3n Architecture to Architecture Mapping for Concurrency a = X ; // Y = a ; Y = 1; b = Y ; // X = b ; X = 1; Ld ( X, St ( Y, St ( Y, Ld ( Y, St ( X, St ( X, datacoi datacoi Figure 3: a = b = 1 is disallowed in ARMv8 but allowed in ARMv7-mca for LDR (cid:55)→

LDR mapping. a = X ; Y = a ; b = Y ; Y = 2; c = Y ; Z = c ; d = Z ; Z = 4; e = Z ; X = e ; X = 1; (cid:55)→ a = X ; CBISB Y = a ; b = Y ; CBISB Y = 2; c = Y ; CBISB Z = c ; d = Z ; CBISB Z = 4; e = Z ; CBISB X = e ; X = 1; Ld ( X, St ( Y, Ld ( Y, St ( Y, Ld ( Y, St ( Z, Ld ( Z, St ( Z, Ld ( Z, St ( X, St ( X, R R R coirferfe rfecoe coerfe rfe

Figure 4: Behavior a = 1 , b = c = 2 , d = e = 4 is disallowed in ARMv8 but allowed in ARMv7-mca for LDR (cid:55)→

LDR ; CBISB mapping. In the execution R = data in ARMv8 and R = data ∪ ctrl isb in ARMv7-mca. • Reordering is restricted. x86 allows the reordering of independent store and load operations accessing differentlocations Lahav and Vafeiadis [2016]. ARMv8 also allows reordering of different-location store-load pairs, butrestricts the reordering of a pair of release-store and acquire load operation as it violates barrier-ordered-by ( bob )order Pulte et al. [2018]. Thus scheme (1) is more restrictive than (2) considering reordering ﬂexibility after map-ping. • Further optimization. (2) generates certain fences which are redundant in certain scenarios and can be removedsafely. Consider the mappings below; the generated

DMBST is redundant in (2) and can be eliminated safely unlikemapping (1).(1)

RMOV ; WMOV (cid:55)→

LDAR ; STLR (2)

RMOV ; WMOV (cid:55)→

LDR ; DMBLD ; DMBST ; STR (cid:32)

LDR ; DMBLD ; STR • x86 (cid:55)→ ARMv8 (cid:55)→

ARMv7 would introduce additional fences

To map x86 to ARMv7, if we use ARMv8 asintermediate step then scheme (1) introduces additional fences unlike (2) as follows.(1)

WMOV (cid:55)→

STLR (cid:55)→

DMB ; STR ; DMB (2)

WMOV (cid:55)→

DMBST ; STR (cid:55)→

DMB ; STR

LDR is signiﬁcantly weaker than ARMv8

LDR

ARMv8 to ARMv7 mapping in Fig. 13a introduces a trailing

DMB fence for ARMv8

LDR and

LDAR accesses as in-troducing a trailing control fence (

CBISB ) is not enough for correctness. Consider the mapping of the program fromARMv8 to ARMv7 in Fig. 2a. The execution is disallowed in ARMv8 as it creates an observed-by ( ob ) cycle asshown in Fig. 2b. However, whie mapping to ARMv7, if we map LDR (cid:55)→

LDR ; CBISB and rest of the instructions aremapped following the mapping scheme in Fig. 13a then the execution would be allowed as shown in Fig. 2c. Therefore

LDR (cid:55)→

LDR ; CBISB is too weak and we require a

DMB fence after each load as well as

RMW for the same reason. In ?? we strengthen the ARMv7 model to ARMv7-mca to exclude non-multicopy atomic behaviors. However, evenwith such a strengthening an LDR mapping requires a trailing

DMB fence.

Load access mapping without trailing fence is unsound in ARMv8 to ARMv7-mca mapping.

Consider theexample in Fig. 3 where the ARMv8 to ARMv7-mca mapping does not introduce trailing fence for a load access andtherefore we analyze the same execution in ARMv8 and ARMv7-mca. The shown behavior is not allowed in ARMv8as there is a dependency-ordered-befoe ( dob ) relation from the reads to the respective writes due to data ; coi relation.4n Architecture to Architecture Mapping for ConcurrencyIn this case there is no preserved-program-order ( ppo ) relation in ARMv7a-mca from the reads to the respective writesas data ; coi (cid:54)⊆ ppo . Therefore the execution is ARMv7-mca consistent and the mapping introduces a new outcome.Hence LDR (cid:55)→

LDR in ARMv8 to ARMv7-mca mapping is unsound.

Trailing control fence is not enough.

Consider the example mapping and the execution in Fig. 4. The executionin ARMv8 has ordered-by ( ob ) cycle and hence not consistent. The LDR (cid:55)→

LDR ; CBISB mappings would result inrespective ppo relations in the execution, but these ppo relations do not restrict such a cycle. As a result, the executionis ARMv7 or ARMv7-mca consistent. Hence

LDR (cid:55)→

LDR ; CBISB is unsound in ARMv8 to ARMv7-mca mapping.

In §4.2, §4.6, and §4.8 we study

C11 (cid:55)→ x86 (cid:55)→

ARMv8 , C11 (cid:55)→

ARMv8 (cid:55)→

ARMv7 , and

C11 (cid:55)→

ARMv8 (cid:55)→

ARMv7 - mca mapping schemes respectively. In these mappings from stronger to weaker models, we consider that the source archi-tecture program is generated from a C11 program following the mapping in map. We use this information to categorizethe accesses in architectures as non-atomic ( NA ) and atomic ( A ), and exploit two aspects of C11 concurrency; ﬁrst,a program with data race on non-atomic access results in undeﬁned behavior. Second, C11 uses atomic accesses toachieve synchronization and avoid data race on non-atomics. Considering these properties we introduce leading ortrailing fences in mapping particular atomic accesses and we map non-atomics to respective accesses without anyleading or trailing fence. Pros and Cons

C11 (cid:55)→ x86 (cid:55)→

ARMv8 scheme has a tradeoff; in case of non-atomics it is more efﬁcient than x86 (cid:55)→

ARMv8 as it does not introduce additional fences whereas an atomic store mapping requires a leadingfull fence or a pair of

DMBLD and

DMBST fences. Consider the mapping of the sequence: Ld NA ; St NA ; St REL (cid:55)→

RMOV NA ; WMOV NA ; WMOV A (cid:55)→ LDR ; STR ; DMBFULL ; WMOV A .In this case the C11 non-atomic memory accesses cannot be moved after the release write access. Hence we introducea leading DMBFULL with

WMOV A in C11 (cid:55)→ x86 (cid:55)→

ARMv8 to preserve the same order. Consider the C11 to x86 toARMv8 mapping of the program below. a = X NA ; Y NA = 1; Z REL = 1; r = Z ACQ ; if ( r == 1) { X NA = 2; Y NA = 2; c = Y NA ; } (cid:55)→ a = X NA ; Y NA = 1; Z A = 1; r = Z A ; if ( r == 1) { X NA = 2; Y NA = 2; c = Y NA ; } (cid:55)→ a = X ; Y = 1; DMBFULL Z = 1; r = Z ; DMBLD if ( r == 1) { X = 2; Y = 2; c = Y ; } The C11 program is data race free as it is well-synchronized by release-acquire accesses on Z and the outcome a = 2 , r = c = 1 is disallowed in the program. The generated ARMv8 program disallows the outcome, however,without the DMBFULL in the ﬁrst thread the outcome would be possible. It is because a

DMBLD or DMBFULL fence isrequired to preserve bob relation between Ld ( X, and St ( Z, events. Note that a DMBLD is not sufﬁcient to establish bob relation between St ( Y, and St ( Z, and hence we require a DMBST or DMBFULL fence. Therefore we have tointroduce a leading pair of

DMBLD and

DMBST fences or a

DMBFULL fence for

WMOV A mapping.As a result Fig. 9b provides more efﬁcient mapping for RMOV NA and WMOV NA accesses, but incurs more cost for WMOV A by introducing a leading DMBFULL instead of a

DMBST fence. After the mapping we may weaken such a

DMBFULL fencewhenever appropriate.The

C11 (cid:55)→

ARMv8 (cid:55)→

ARMv7 scheme does not introduce fence for mapping non-atomics and therefore moreefﬁcient than

ARMv8 (cid:55)→

ARMv7 . Note that C11 St (cid:119) REL generates an

STLR in ARMv8 and ARMv8

STR is generatedonly from C11 St (cid:118) RLX which does not enforce any such order.

Now we move to mappings between x86 and ARMv7. We do not propose direct mapping schemes, in-stead we use ARMv8 concurrency as an intermediate concurrency model as x86 (cid:55)→

ARMv7 / ARMv7 - mca and ARMv7 / ARMv7 - mca (cid:55)→ x86 would be same as x86 (cid:55)→ ARMv8 (cid:55)→

ARMv7 / ARMv7 - mca and ARMv7 / ARMv7 - mca (cid:55)→ ARMv8 (cid:55)→ x86 respectively. x86 (cid:55)→

ARMv7 vs x86 (cid:55)→ ARMv8 (cid:55)→

ARMv7

We derive x86 (cid:55)→

ARMv8 (cid:55)→

ARMv7 by combining x86 (cid:55)→

ARMv8(Fig. 9a) and ARMv8 (cid:55)→

ARMv7 (Fig. 13a) as follows. 5n Architecture to Architecture Mapping for Concurrency a = X ; // c = Y [ a ]; Z = 1; b = Z ; // V [ b ] = 1; X = 1; Ld ( X, Ld ( Y [1] , St ( Z, Ld ( Z, St ( V [1] , St ( X, addrpo addrporfe Figure 5: Load-store or store-store reorderings introduce a = b = 1 outcome and are unsound in ARMv8. MFENCE (cid:55)→

DMBFULL (cid:55)→

DMB RMW (cid:55)→

DMBFULL ; RMW ; DMBFULL (cid:55)→

DMB ; RMW ; DMBRMOV (cid:55)→

LDR ; DMBLD (cid:55)→

LDR ; DMB WMOV (cid:55)→

DMBST ; STR (cid:55)→

DMB ; STR

The correctness proofs of the x86 to ARMv8 and ARMv8 to ARMv7 mapping schemes in Fig. 9a and Fig. 13ademonstrate the necessity of the introduced fences. The introduced fences only allow reordering of an independentstore-load access pair on different locations which is similar to the allowed reordering restriction of x86. Thereforethe introduced fences are necessary and sufﬁcient.

ARMv7 (cid:55)→ x86 vs ARMv7 (cid:55)→

ARMv8 (cid:55)→ x86

We derive ARMv7 (cid:55)→

ARMv8 (cid:55)→ x86 by combining ARMv7 (cid:55)→

ARMv8 (Fig. 12a) and ARMv8 to x86 (Fig. 12b) as follows. Note that the mapping does not introduce any fencealong with the accesses and therefore optimal.

DMB (cid:55)→

DMBFULL (cid:55)→

MFENCE RMW (cid:55)→

RMW (cid:55)→

RMWLDR (cid:55)→

LDR (cid:55)→

RMOV STR (cid:55)→

STR (cid:55)→

WMOV

We consider ARMv8 as a concurrency model of an IR and ﬁnd that many common compiler optimizations are unsoundin ARMv8. • ARMv8 does not allow store-store and load-store reorderings

Consider the program and the execution in Fig. 5.In this execution there are addr ; [ Ld ]; po ; [ St ] and addr ; [ St ]; po ; [ St ] relations in the ﬁrst and second threads respec-tively which result in dob relations and in turn an ob cycle. Therefore the execution is not ARMv8 consistent andthe outcome a = b = 1 is disallowed. However, load-store reordering c = Y [ a ]; Z = 1 (cid:32) Z = 1; c = Y [ a ] or store-store reordering V [ b ] = 1; Z = 1 (cid:32) Z = 1; V [ b ] = 1 remove the respective dob relation(s) and enable a = b = 1 in the target. Thus store-store and load-store reorderings are unsafe in ARMv8. • Overwritten-write ( OW ) is unsound. Consider the program and its outcome a = 1 , b = 2 in Fig. 6a. In therespective execution the ﬁrst thread has data ; coi ⊆ dob from Ld ( X, to St ( Y, . The other thread has a bob relation due to DMBFULL fence which in turn create an ob cycle. Hence the execution is not ARMv8 consistent andthe outcome a = 1 , b = 2 is disallowed. Overwriting Y = a in the ﬁrst thread removes the dob relation and then a = 1 , b = 2 becomes possible. • Read-after-write ( RAW ) is unsound. We study the RAW elimination in Fig. 6b which is performed based ondependence analysis. Before we go to the transformation, we brieﬂy discuss dependence analysis on the accesssequence a = X ; Y [ a ∗

0] = 1 . In this case there is a false dependence from load of X to store of Y [ a ∗ as a ∗ always. ARMv8 does not allow to remove such a false dependence Pulte et al. [2018]. However, weobserve that using a static analysis that distinguishes between true and false dependencies is also wrong in ARMv8.In this example we analyze such a false dependency and based on that we perform read-after-write elimination onthe program, that is, Y [ a ∗

0] = 1; b = Y [0] (cid:32) Y [ a ∗

0] = 1; b = 1 .The source program does not have any execution a = 1 , b = 1 , c = 0 as addr ; rﬁ ; addr ⊆ dob and in the otherthread there is a bob reltion which together create an ob cycle. In the target execution there is no dob relation fromthe load of X to the load of c = Z [ b ] and therefore the outcome a = 1 , b = 1 , c = 0 is possible. As a result, thetransformation is unsound in ARMv8. The mapping schemes introduce leading and/or trailing fences for various memory accesses. However, some of thesefences may be redundant can safely be eliminated. Consider the x86 (cid:55)→

ARMv8 mapping and subsequent redundantfence eliminations below.

RMOV ; MFENCE ; WMOV (cid:55)→

LDR ; DMBLD ; DMBFULL ; DMBST ; STR (cid:32)

LDR ; DMBLD ; STR

6n Architecture to Architecture Mapping for Concurrency a = X ; Y = a ; Y = 2; (cid:32) a = X ; Y = 2; Context:  − b = Y ; DMBLD X = 1;  (a) OW introduces a = 1 , b = 2 a = X ; Y [ a ∗

0] = 1; b = Y [0]; c = Z [ b ]; (cid:32) a = X ; Y [ a ∗

0] = 1; b = 1; c = Z [ b ]; Context:  − Z [1] = 1; DMBFULL ; X = 1;  (b) RAW introduces a = b = 1 , c = 0 Figure 6: Overwritten-write (OW) and Read-after-write elimination (RAW) are unsound in ARMv8. SB (cid:44) (cid:104) X = 1; MFENCE ; t = Y ; (cid:105) SB (cid:48) (cid:44) (cid:104) Y = 1; MFENCE ; t = X ; (cid:105) SB (cid:48)(cid:48) (cid:44) (cid:104) Y = 1; t = Z ; (cid:105) St ( X, _ ) MFENCE Ld ( Y, _ ) St ( Y, _ ) MFENCE Ld ( X, _ ) St ( Y, _ ) Ld ( Z, _ ) Figure 7: A program of the form SB || · · · || SB || SB (cid:48) || · · · || SB (cid:48) || SB (cid:48)(cid:48) || · · · || SB (cid:48)(cid:48) is SC-robust against x86.The ARMv8 access sequence generated from x86 to ARMv8 mapping introduces three intermediate fences betweenthe load-store pair. A DMBLD fence sufﬁces to order a load-store pair and hence the

DMBFULL as well as the

DMBST fence are redundant and are safely eliminated.To perform such fence eliminations, we ﬁrst identify non-deletable fences and then delete rest of the fences. A fence is non-deletable if it is placed between a memory access pair in at least one program path so that the access pair may haveout-of-order execution without the fence. Analyzing the ARMv8 sequence above we mark the

DMBLD as non-deletableand rest of the fences as redundant.

There are existing approaches Lahav and Margalit [2019], Bouajjani et al. [2013] which explores program executionsto answer such queries. We propose an alternative approach by analyzing memory access sequences. In this analysis1. We identify the program components which may run concurrently. Currently we consider fork-join parallelismand identify the functions which create one or multiple threads. Our analysis considers that each of such functionscreates multiple threads. Therefore analyzing these functions f , . . . f n , we analyze all programs of the form f || · · · || f || f n || · · · || f n .2. Next, we analyze the memory access sequences in f , . . . f n to check whether the memory access pairs in thesefunctions may create a cycle.3. In case a cycle is possible, we check if each access pair on a cycle is ordered by robustness condition. If so, thenall K consistent executions of these programs are also M consistent.Consider the example in Fig. 7. We analyze the access sequences in thread functions SB , SB (cid:48) , and SB (cid:48)(cid:48) and derive agraph by memory access pairs which contains a cycle by the memory access pairs in SB and SB (cid:48) . These pairs on thecycle have intermediate MFENCE operations which enforce interleaving executions only irrespective of the number ofthreads created from SB , SB (cid:48) , SB (cid:48)(cid:48) . Our analysis reports these x86 programs as SC-robust. Using this approach wecheck M -robustness against K where K is an weaker models than M . Enforcing robustness.

If we identify robustness violation for a program then we identify memory access pairs whichmay violate a robustness condition. For these access pairs we introduce intermediate fences to enforce robustnessagainst a stronger model. ppo does not sufﬁce to enforce robustness in ARMv7

In addition to fences, ppo relations also orders a pair ofaccesses on different locations. However, we observe that ppo relations are not sufﬁcient to ensure robustness forARMv7 model.Consider the execution in Fig. 8, the execution allows the cycle and violates SC robustness. Therefore ppo cannot beused to order epo relations to preserve robustness. 7n Architecture to Architecture Mapping for Concurrencya: Ld ( A, b: St ( X, c: St ( X, d: Ld ( X, e: St ( Y, f: Ld ( Y, g: St ( Z, h: St ( Z, i: Ld ( Z, j: St ( A, ppo ppo fence ppocoe rfe rfe coe rferfe Figure 8: Execution prop ( b, g ) ∧ coe ( g, h ) ∧ ahb ( h, b ) cycle is allowed. Syntax

Instead of delving into the syntactic notations in each instruction set, we use common expressions andcommands which can be extended in each architecture. E ::= r | v | X | E + E | E ∗ E | E ≤ E | · · · ( Expr ) C ::= skip | C ; C | r = E | r = X | X = E | r = RMW ( X, E, E ) | r = RMW ( X, E ) | · · ·| br label | br label label ( Cmd ) P ::= X = v ; · · · X = v ; { C | · · · | C } ( P rogram ) In this notation we use X ∈ Locs , r ∈ Reg , and v ∈ val where Locs , Reg , val denote ﬁnite sets of memory loca-tions, registers, and values respectively. A program P consists of a set of initialization writes followed by a parallelcomposition of thread commands. Semantics

We follow the per-execution based axiomatic models for these architectures. In these models a program’ssemantics is deﬁned by a set of consistent executions. An execution consists of a set of events and relations among theevents.Given a binary relation R on events, R − , R ? , R + , and R ∗ represent inverse, reﬂexive, transitive, and reﬂexive-transitive closures of R respectively. dom ( R ) and codom ( R ) denote is its domain and its range respectively. Relation R is total on set S when total ( S, R ) (cid:44) ∀ a, b ∈ S. a = b ∨ R ( a, b ) ∨ R ( b, a ) . We compose binary relations R, S ⊆ E × E relationally by R ; S . [ A ] denotes an identity relation on a set A . We write R | loc to denote R related event pairs onsame locations, that is, R | loc (cid:44) { ( e, e (cid:48) ) ∈ R | e. loc = e (cid:48) . loc } . Similarly, R | (cid:54) = loc (cid:44) R \ R | loc is the R related eventpairs on different locations. Deﬁnition 1.

An event is of the form (cid:104) id , tid , lab (cid:105) , where id , tid ∈ N ,and lab are the unique identiﬁer, thread id, andthe label of the event based on the respective executed memory access or fence instruction. A label is of the form (cid:104) op , loc , rval , wval (cid:105) . For an event e , whenever applicable, e. lab , e. op , e. loc , e. rval , and e. wval to return the label, operation type, location,read value, and written value respectively. We write Ld , St , U , and F to represent the set of load, store, update, andfence events. Moreover, load or update events represent read events ( R ) and store or update events are write events( W ), that is R = Ld ∪ U and W = St ∪ U . We write [[ i ]] to represent the generated event in the respective model froman instruction i . For example, in x86 [[ i ]] ∈ St holds when i is a WMOV instruction. We also overload the notation as [[ P ]] M to denote the set of execution of program P in model M .In an execution events are related by various types of relations. Relation program-order( po ) captures the syntactic orderamong the events. We write a.b to denote that b is immediate po -successor of event a . Reads-from ( rf ) associates awrite event to a read event that justiﬁes its read value. Relation coherence-order( co ) is a total-order on same-locationwrites (stores or updates). The from-read ( fr ) relation relates a pair of same-location read and write events. Wealso categorize the relations as external and internal relations and deﬁne extended-coherence-order ( eco ). Relationmodiﬁcation order ( mo ) is a total-order on writes, updates, and fences such that mo ⊆ O × O where O = St ∪ U ∪ F .8n Architecture to Architecture Mapping for ConcurrencyNote that the co relation is included in the mo relation. The mo relation is used in x86 model only; the ARM modelsdo not use mo in their deﬁnitions. Deﬁnition 2.

An execution is of the form X = (cid:104) E , po , rf , co , mo (cid:105) where X . E denotes the set of memory access or fenceevents and X . po , X . rf , X . co , and X . mo denote the set of program-order, reads-from, coherence order, and modiﬁcationorder relations between the events in X . E . We now discuss the architectures and follow the axiomatic models of x86 and ARMv7 from Lahav et al. [2017], andARMv8 axiomatic model from Pulte et al. [2018]. We also present ARMv7-mca; a strengthened ARMv7 model withmulticopy atomicity (MCA). x86.

In x86

MOV instruction is used for both loading a value from memory as well as for storing a value to memory. Todifferentiate these two accesses we categorize them as

WMOV and

RMOV operations. In addition, there are atomic updateoperations which we denote by

RMW . x86 also provides

MFENCE which ﬂushes buffers and caches and ensure orderingbetween the preceding and following memory accesses.In x86 concurrency

WMOV , RMOV , and

MFENCE generate St , Ld , and F events respectively. A successful RMW generates U and otherwise an Ld event. We derive x86-happens-before ( xhb ) relation from program-order and reads-from relations: xhb (cid:44) ( po ∪ rf ) + . An x86 execution X is consistent when: • X . xhb is irreﬂexive. (irrHB) • X . mo ; X . xhb is irreﬂexive. (irrMOHB) • X . fr ; X . xhb is irreﬂexive. (irrFRHB) • X . fr ; X . mo is irreﬂexive. (irrFRMO) • X . fr ; X . mo ; X . rfe ; X . po is irreﬂexive (irrFMRP) • X . fr ; X . mo ; [ X . U ∪ X . F ]; X . po is irreﬂexive. (irrUF) ARMv7.

It provides

LDR and

STR instructions for load and store operations, and load-exclusive(

LDREX ) and store-exclusive(

STREX ) instructions to perform atomic update operation

RMW where

RMW (cid:44) L : LDREX ; mov ; teq L (cid:48) ; STREX ; teq L ; L (cid:48) : . ARMv7 provides full fence DMB which orders preceding and followinginstructions. There is also lightweight control fence

ISB which is used to construct

CBISB (cid:44) cmp ; bc ; ISB to orderload operations.In this model load ( Ld ), store ( St ), F events are generated from the execution of LDR and

LDXR , STR and

STXR , and

DMB instructions respectively. Fence

ISB is captured in ctrl

ISB (similar to ctrl isync in Lahav et al. [2017]) and in turn ppo relation, but does not create any event in an execution.ARMv7 deﬁnes preserved-program-order ( ppo ) relation which is a subset of program-order relation.We ﬁrst discuss the primitives of ppo following §F.1 in Lahav et al. [2017]: ppo is based on data ( ⊆ Ld × St ), control( ⊆ Ld × E ), and address ( ⊆ Ld × ( Ld ∪ St ) ) dependencies. Moreover, ISB fences along with conditionals introduce ctrl

ISB ⊆ ctrl preserved program order. Finally, ctrl ; po ⊆ ctrl and ctrl ISB ; po ⊆ ctrl ISB holds from deﬁnition.Based on these primitives ARMv7 deﬁne read-different-writes ( rdw ) and detour ( detour ) relations as follows. rdw (cid:44) ( fre ; rfe ) ⊆ po detour (cid:44) ( coe ; rfe ) \ po read-different-writes ( rdw ) relates two reads on same location in a thread which reads from different writes and detourcaptures the scenario where an external write takes place between a pair of same-location write in the same thread,and the read reads-from that external write.Based on these primitives ARMv7 deﬁnes ii , ci , ic , cc components as follows. ii (cid:44) addr ∪ data ∪ rdw ∪ rﬁ ic (cid:44) ∅ ci (cid:44) ctrl ISB ∪ detour cc (cid:44) data ∪ ctrl ∪ addr ; po ? Using these components ARMv7 deﬁnes ii , ic , ci , cc relations where each of these relations can be derived from thefollowing sequential compositions and the constraints. xy (cid:44) (cid:83) n ≥ x y ; x y ; · · · x n y n where 9n Architecture to Architecture Mapping for Concurrencyx86 ARMv8 RMOV LDR ; DMBLDWMOV DMBST ; STRRMW DMBFULL ; RMW ; DMBFULLMFENCE DMBFULL (a) x86 to ARMv8

C11 to x86 ARMv8

RMOV NA LDRWMOV NA STRRMOV A LDR ; DMBLDWMOV A DMBFULL ; STRRMW DMBFULL ; RMW ; DMBFULLMFENCE DMBFULL (b) C11 to x86 to ARMv8

Figure 9: Mapping schemes from x86 to ARMv8. • x, y, x · · · x n , y · · · y n ∈ { i , c } . • If x = c then x = c . • For every ≤ k ≤ n − , if y k = c then x k +1 = c . • If y = i then y n = i .Finally ARMv7 deﬁnes ppo as follows: ppo (cid:44) [ Ld ]; ii ; [ Ld ] ∪ [ Ld ]; ii ; [ St ] . ARMv7 also deﬁnes fence , ARM-happens-before ( ahb ), and propagation ( prop ) relations as follows. fence (cid:44) [ Ld ∪ St ]; po ; [ F ]; po ; [ Ld ∪ St ] ahb (cid:44) ppo ∪ fence ∪ rfeprop (cid:44) prop ∪ prop where prop (cid:44) [ St ]; rfe ? ; fence ; ahb ∗ ; [ St ] and prop (cid:44) ( coe ∪ fre ) ? ; rfe ? ; ( fence ; ahb ∗ ) ? ; fence ; ahb ∗ These relations are used to deﬁne the consistency constraints of an ARMv7 execution X as follows: • X . co is total (total-co) • ( X . po | loc ∪ X . rf ∪ X . fr ∪ X . co ) is acyclic (sc-per-loc) • X . fre ; X . prop ; X . ahb ∗ is irreﬂexive. (observation) • ( X . co ∪ X . prop ) is acyclic. (propagation) • [ X . rmw ]; X . fre ; X . coe is irreﬂexive (atomicity) • X . ahb is acyclic (no-thin-air) ARMv7-mca.

We strengthen the ARMv7 model and deﬁne ARMv7-mca model to support multicopy atomicity. Todo so, following Wickerson et al. [2017], we deﬁne write-order ( wo ) and impose the additional constraint on ARMv7as deﬁned in ?? . • X . wo + is acyclic where wo = ( rfe ; ppo ; fre ) (mca) ARMv8. provides load (

LDR ), store (

STR ) for load and store operations, load-exclusive (

LDXR ) and store-exclusive(

STXR ) instructions to construct

RMW similar to that of ARMv7. In addition, ARMv8 provides load-acquire (

LDAR ),store-release (

STLR ), load-acquire exclusive (

LDAXR ), and store-release exclusive (

STLXR ) instructions which operateas half fences. In addition to

DMBFULL and

ISB , ARMv8 provides load (

DMBLD ) and store (

DMBST ) fences. A

DMBLD fence orders a load with other accesses and a

DMBST orders a pair of store accesses.Based on these primitives ARMv8 deﬁnes coherence-after ( ca ), observed-by( obs ), and atomic-ordered-by ( aob ) rela-tions on same-location events. ARMv8 also deﬁnes dependency-ordered-before ( dob ) and barrier-ordered-by ( bob )relations to order a pair of intra-thread events. Finally Ordered-before ( ob ) is a transitive closure of obs , aob , dob , and bob relations. ca (cid:44) fr ∪ co obs (cid:44) rfe ∪ fre ∪ coe aob (cid:44) rmw ∪ [ range ( rmw )]; rﬁ ; [ A ] dob (cid:44) addr ∪ data ∪ ctrl ; [ St ] ∪ ( ctrl ∪ ( addr ; po )); [ ISB ]; po ; [ Ld ] ∪ addr ; po ; [ St ] ∪ ( ctrl ∪ data ); coi ∪ ( addr ∪ data ); rﬁbob (cid:44) po ; [ F ]; po ∪ [ L ]; po ; [ A ]; ∪ [ Ld ]; po ; [ F LD ]; po ∪ [ A ]; po ∪ [ St ]; po ; [ F ST ]; po ; [ St ] ∪ po ; [ L ] ∪ po ; [ L ]; coiob (cid:44) ( obs ∪ dob ∪ aob ∪ bob ) +

10n Architecture to Architecture Mapping for Concurrency X = 1; a = RMW ( Y, , Y = 1; b = RMW ( X, , St ( X, U ( Y, , St ( Y, U ( X, , fre St ( X, FLd ( Y, St ( Y, St ( X, FLd ( Y, St ( Y, rmw rmwfre Figure 10: In x86 to ARMv8 mapping

RMW requires a leading F fence. RMW ( X, , a = Y ; RMW ( X, , b = X ; U ( X, , Ld ( Y, U ( Y, , Ld ( X, fre Ld ( X, St ( X, FLd ( Y, Ld ( Y, St ( Y, FLd ( X, rmw rmwrmwfre Figure 11: In x86 to ARMv8 mapping

RMW requires a trailing F fence.Finally an ARMv8 execution X is consistenct when: • X . po | loc ∪ X . ca ∪ X . rf is irreﬂexive. (internal) • X . ob is irreﬂexive (external) • X . rmw ∩ ( X . fre ; X . coe ) = ∅ (atomic) We propose correct and efﬁcient mapping schemes between x86 and ARM models. These schemes may introduceleading and/or trailing fences while mapping memory accesses from one architecture to another. We show that thefences are necessary by examples and prove that the fences are sufﬁcient for correctness. To prove correctness weshow that for each consistent execution of the target program after mapping there exists a corresponding consistentexecution of the source program before mapping with same behavior.

The mapping scheme from x86 to ARMv8 is in Fig. 9a. The scheme generates a

DMBFULL for an

MFENCE . Whilemapping x86 memory accesses to that of ARMv8, the scheme introduces a leading

DMBST fence with a store, a trailing

DMBLD fence with a load, and leading as well as a trailing

DMBFULL fences with an update. We now discuss why thesefences are required.

Leading store fence

In an x86 execution a pair of stores is ordered unlike that of ARMv8 execution. A pair of storeevents ( St ) in ARMv8 execution are bob ordered when there is intermediate F ST or F event, that is [ St ]; po ; [ F ST ∪ F ]; po ; [ St ] ⊆ bob . To introduce such a bob order we require at least an intermediate F ST fence event. Therefore thescheme generates a leading DMBST fence with a store which ensures store-store order with preceding stores in ARMv8.

Trailing load fence

We know a load-store or load-load access pair is ordered in x86. To preserve the same accessordering we require a F LD fence between a load-load or load-store access pair. Therefore the scheme generates atrailing DMBLD fence with a load which ensures such order.

Leading and trailing fence for atomic update

Consider the x86 programs and a = b = 0 outcome.No x86 execution would allow a = b = 0 in the two programs in Figs. 10 and 11. However, if we translate theseprograms without intermediate DMBFULL fences between each pair of store and

RMW accesses then a = b = 0 would11n Architecture to Architecture Mapping for ConcurrencyARMv7/ARMv7-mca ARMv8 LDR LDRSTR STRRMW RMWDMB DMBFULLISB ISB (a) ARMv7 or ARMv7-mca to ARMv8

ARMv8 x86

LDR RMOVLDAR RMOVSTR WMOVSTLR WMOV ; MFENCERMW RMWDMBFULL MFENCEDMBLD / DMBST / ISB skip (b) ARMv8 to x86

Figure 12: Mapping schemes: ARMv8 to x86 and ARMv7/ARMv7-mca to ARMv8be possible in these two programs in ARMv8 as shown in the corresponding executions. As a result, the translationsfrom x86 to ARMv8 would be unsound. The leading and trailing

DMBFULL fences with

RMW accesses provide theseintermediate fences in the respective program to disallow a = b = 0 in both programs. Mapping correctness

These fences sufﬁce to preserve mapping correctness as stated in Theorem 1 and proved inAppendix A.1.

Theorem 1.

The mappings in Fig. 9a are correct.

In this mapping from x86 to ARMv8 we exploit the C11 semantic rule: data race results in undeﬁned behavior. Themapping scheme is in Fig. 9b. In this scheme we categorize the x86 load and store accesses by whether they aregenerated from C11 non-atomic or atomic accesses. If we know that a load/store access is generated from a C11non-atomic load/store then we do not introduce any trailing or leading fence. We prove the correctness of the scheme(Theorem 2) in Appendix A.2.

Theorem 2.

The mapping scheme in Fig. 9b is correct.

In §2.4 we have already demonstrated the tradeoff between the x86 (cid:55)→

ARMv8 and

C11 (cid:55)→ x86 (cid:55)→

ARMv8 mappingschemes.

The mapping scheme is in Fig. 12b. In this scheme an ARMv8 load or load-acquire is mapped to an x86 load and astore is mapped to an x86 store operation. The scheme generates a trailing

MFENCE with a store in x86 for ARMv8release-store as L ; po ; A ⊆ bob whereas in x86 store-load on different locations are unordered. Consider the examplebelow. L ( X, A ( Y, L ( Y, A ( X, fre (a) Disallowed in ARMv8 St ( X, FLd ( Y, St ( Y, FLd ( X, fre (b) Fences disallow the execution in x86The scheme also maps an atomic access pair to an atomic update in x86. The DMBLD , DMBST , and

ISB fences are notmapped to any access.

Theorem 3.

The mapping scheme in Fig. 12b is correct.

Proof Strategy

To prove Theorem 3 we ﬁrst deﬁne corresponding ARMv8 execution X s for a given x86 consistentexecution X t . Next we show that X s is ARMv8 consistent. To do so, we establish Lemma 1 and then use the same toestablish Lemma 2 on x86 consistent execution. Next, we deﬁne x86-preserved-program-order ( xppo ) and then basedon xppo we deﬁne x86-observation ( obx ) on an x86 execution and establish Lemma 3. Finally we prove Theorem 3using Lemma 2 and Lemma 3. The detailed proofs of Lemmas 1 to 3 and Theorem 3 are discussed in Appendix A.3. obx (cid:44) rfe ∪ coe ∪ fre ∪ [ U ] ∪ xppo where xppo (cid:44) s ∪ s ∪ s ∪ s ∪ s ∪ s ∪ s ∪ s

12n Architecture to Architecture Mapping for ConcurrencyARMv8 ARMv7/ARMv7-mca

LDR LDR ; DMBSTR STRLDAR LDR ; DMBSTLR DMB ; STR ; DMBRMW RMW ; DMBRMW A RMW ; DMBRMW (cid:119) L DMB ; RMW ; DMBDMB(FULL/LD/ST) DMBISB ISB (a) ARMv8 to ARMv7

C11 to ARMv8 ARMv7/ARMv7-mca

LDR NA LDRLDR A LDR ; DMBSTR STRLDAR LDR ; DMBSTLR DMB ; STR ; DMBRMW RMW ; DMBRMW A RMW ; DMBRMW (cid:119) L DMB ; RMW ; DMBDMB(FULL/LD/ST) DMBISB ISB (b) C11 to ARMv8 to ARMv7

Figure 13: Mapping schemes: ARMv8 (cid:55)→

ARMv7/ARMv7-mca and C11 (cid:55)→

ARMv8 (cid:55)→

ARMv7/ARMv7-mca. s (cid:44) [ Ld ]; po ; [ Ld ∪ St ] s (cid:44) po ; [ F ]; po s (cid:44) [ St ]; [ F ]; [ Ld ] s (cid:44) [ Ld ]; po s (cid:44) [ St ]; po ; [ St ] s (cid:44) po ; [ St ] s (cid:44) po ; [ St ]; po | loc ; [ St ] s (cid:44) [ U ]; rﬁ ; [ Ld ] Lemma 1.

Suppose X is an x86 consistent execution. In that case X . po | loc ; X . fr = ⇒ X . fr ∪ X . co . Lemma 2.

Theorem 4.

The mappings in Fig. 12a are correct.

To prove Theorem 4 we relate preserved-program-order ( ppo ) in ARMv7 to Ordered-before ( ob ) relation in ARMv8.In ARMv7 ppo relates intra-thread events and in ARMv8 dob , bob , and aob relates intra-thread event pairs. Notethat ARMv8 dob , bob , and aob relations together are not enough to capture the ARMv7 ppo relation as the detour component of ppo involves inter-thread relations. However, ARMv7 detour relation implies obs relation in ARMv8and therefore we can relate ppo and ob relations. Considering these aspects we state the following lemma. Lemma 4.

Suppose X s is an ARMv7 consistent execution and X t is corresponding ARMv8 execution. In that case X s . ppo = ⇒ X t . ob . Based on Lemma 4 along with other helper lemmas we prove the mapping soundness Theorem 4. The detailed proofsof Lemma 4, helper lemmas, and Theorem 4 are in Appendix A.5.

The mapping scheme is in Fig. 13a. Now we show that the fences along with memory accesses are necessary topreserve mapping soundness. In §2.2 we have already shown that

LDR (cid:55)→

LDR ; CBISB is unsound and therefore

LDR (cid:55)→

LDR ; DMB is necessary for correctness. Similarly,

LDAR (cid:55)→

LDR ; CBISB is unsound and

LDAR (cid:55)→

LDR ; DMB isnecessary for the same reasons.

Leading and trailing fences for release-store mapping

Consider po ; [ L ] ⊆ bob in ARMv8. The bob relation inthe ﬁrst thread along with other relations disallows this behavior. Consider the following example.13n Architecture to Architecture Mapping for Concurrency St ( X, L ( Y, L ( Y, A ( X, bob bobmoefre (a) Disallowed in ARMv8 St ( X, FSt ( Y, : St ( Y, FLd ( X, :(b) Fences disallow the execution in ARMv7Without such an intermediate fence in the ﬁrst thread the ARMv7 execution would be allowed which in turn introducea new outcome in the ARMv7 program and as a result the mapping would be incorrect. Therefore STLR mappingrequires a leading fence to preserve the mapping soundness.

STLR mapping requires a trailing fence considering theexample similar to that of §4.3. Considering the mapping, an

CBISB is not required anymore as every load generatesa trailing

DMB fence.In addition to

RMW , ARMv8 provides acquire and release or stronger

RMW accesses

RMW A and RMW (cid:119) L respectively. Beforemapping from ARMv8 we perform the transformations RMW A (cid:32) RMW ; DMBLD and

RMW (cid:119) L (cid:32) DMBFULL ; RMW ; DMBFULL .The trailing

DMBLD provides the same ordering as an acquire-exclusive load with following accesses. In case of

RMW (cid:119) L ,we introduce leading and trailing DMBFULL fences similar to that of

STLR access.For

DMBFULL , DMBLD , and

DMBST fences in ARMv8 the mapping scheme generates

DMB fences so that the bob orders inARMv8 executions are preserved in corresponding ARMv7 executions. Now we prove the correctness of the mappingas stated in Theorem 5.

Theorem 5.

The mappings in Fig. 13a are correct.

To prove Theorem 5, we relate ARMv8 and ARMv7 consistent executions in Lemma 5 and Lemma 6 as intermediatesteps. Lemma 5, Lemma 6, and Theorem 5 are proved in Appendix A.6.

Lemma 5.

Suppose X t is an ARMv7 consistent execution and X s is ARMv8 execution following the mappings inFig. 13a. In this case X s . ob = ⇒ ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ∪ X t . fence ) + . Lemma 6.

C11 (cid:55)→ x86 (cid:55)→

ARMv8 we propose C11 to ARMv8 to ARMv7 mapping scheme in Fig. 13b. The proof isdiscussed in detail in Appendix A.7. In §2.4 we already show that this mapping scheme is more efﬁcient than ARMv8to ARMv7 mapping.

Theorem 6.

The mapping scheme in Fig. 13b is correct.

The mapping scheme for ARMv7-mca to ARMv8 is same as the ARMv7 to ARMv8 mapping scheme as shownin Fig. 12a. To prove the mapping soundness we relate an ARMv7 consistent execution to corresponding ARMv8execution as follows.

Lemma 7.

Suppose X t is an ARMv8 consistent execution and X s is corresponding ARMv7 consistent execution. Inthat case [ X s . Ld ]; X s . ppo ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Using Lemma 7 we establish the acyclicity of write-order in ARMv7-mca source execution.

Lemma 8.

Suppose X t is a target ARMv8 consistent execution and X s is corresponding ARMv7 consistent execution.In this case X s . wo + is acyclic. The detailed proof of Lemma 7 are Lemma 8 are discussed in Appendix A.8. The mapping correctness theorem belowdirectly follows from Lemma 8.

Theorem 7.

The mappings in Fig. 12a are correct for ARMv7-mca.

The mapping schemes, ARMv8 to ARMv7-mca and C11 to ARMv8 to ARMv7-mca, are shown in Fig. 13. Thesoundness proofs are same as Theorems 5 and 6 respectively. We have already discussed in §2.3 why mapping of aload access requires a trailing

DMB fence to preserve correctness.14n Architecture to Architecture Mapping for Concurrency ↓ a \ b → St Ld L A F F LD F ST St (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) Ld (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:51) L (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) A (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) F (cid:55) (cid:55) (cid:51) (cid:55) = (cid:51) (cid:51) F LD (cid:55) (cid:55) (cid:51) (cid:55) (cid:51) = (cid:51) F ST (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) = (cid:51) Ld ( X, v (cid:48) ) · Ld ( X, v ) (cid:32) Ld ( X, v (cid:48) ) (RAR) (cid:51) A ( X, v (cid:48) ) · Ld ( X, v ) (cid:32) A ( X, v (cid:48) ) (RAA) (cid:51) A ( X, v (cid:48) ) · A ( X, v ) (cid:32) A ( X, v (cid:48) ) (AAA) (a) LDR and

LDAR eliminations. (cid:51) Ld ( X, v ) (cid:32) A ( X, v ) (R-A) (cid:51) St ( X, v ) (cid:32) L ( X, v ) (W-L) (cid:51) F LD / F ST (cid:32) F (F) (b) Access strengthening. Figure 14: Reordering, elimination, and strengthening transformations in ARMv8.

ARMv8

In this section we study the correctness of independent access reordering, redundant access elimination, and accessstrengthening in ARMv8 model. We prove the correctness of the safe transformations in Appendix B.

Reorderings.

We show the safe ( (cid:51) ) and unsafe ( (cid:55) ) reordering transformations of the form a · b (cid:32) b · a in Fig. 14 where a and b represent independent and adjacent shared memory accesses on different locations. We prove the correctnessof the safe reorderings in Appendix B.1.In Fig. 5 we have already shown that we cannot move a store before any load or store in. Same reasoning extendsto release-store and acquire-load. It is not safe to move a store before any fence as it may violate a dob relation.Similarly a load cannot be moved before an acquire load, DMBLD , or

DMBFULL operation as it may remove a bob relation. However, reordering with a

DMBST is safe as the ordering between them do not affect any component of ob relation. A release-store may safely reorder with a preceding fence as it does not eliminate any bob relation. Similarlymoving a load, store, or DMBST after an acquire-read is allowed as it does not eliminate any existing bob relation. Wemay safely reorder acquire-read with

DMBFULL as it does not affect the bob relations among the memory accesses. A

DMBLD between a load and a load or store creates bob relation. Hence moving a load after

DMBLD may eliminate a bob and therefore disallowed.Finally reorderings fences are safe as it preserves the bob relations between memory accesses.

Redundant access elimination

In §2.5 we have shown that overwritten-write and read-after-write transformations areunsound. However, a read-after-read elimination is safe in ARMv8 as enlisted in Fig. 14. We prove the correctness ofthe transformation in Appendix B.2.

Access strengthening

Strengthening memory accesses and fences may introduce new ordering among events and there-fore the strengthening transformations enlisted in Fig. 14 hold trivially.

In this section we prove the correctness of various fence eliminations and then propose respective fence eliminationalgorithms. More speciﬁcally, the proposed mapping schemes in §4 may introduce fences some of which are redundantin certain scenarios and can safely be eliminated. To do so, we ﬁrst check if a fence is non-eliminable . If not, we deletethe fence.

In x86 only a store-load pair on different locations is unordered. Therefore if a fence appear between such a pair thenit is not safe to eliminate the fence. Otherwise we may eliminate a fence.

Theorem 8. An MFENCE in an x86 program thread is non-eliminable if it is the only fence on a program path from astore to a load in the same thread which access different locations.An

MFENCE elimination is safe when it is not non-eliminable.

We prove the theorem in Appendix C.1. This fence elimination condition is particularly useful after ARMv8 to x86mapping following the scheme in Fig. 12b as it introduces certain redundant fences. For instance, ARMv8 to x86mapping

STLR ; STR (cid:55)→

WMOV ; MFENCE ; WMOV results in an intermediate

MFENCE which is redudant and can be safelydeleted as stores are ordered in x86. 15n Architecture to Architecture Mapping for Concurrency

We identify non-eliminable

DMBFULL , DMBST , and

DMBST fences and then safely eliminate rest of the fences. We provethe correctness of these fence eliminations in Appendix C.2.For instance, considering the Fig. 9a mapping scheme, the

DMBLD fence after

RMOV ; WMOV (cid:55)→

LDR ; DMBLD ; DMBST ; STR mapping sufﬁces to order the load and store access pair and the

DMBST is not required. However, we cannot im-mediately conclude that such a

DMBST fence is entirely redundant if we consider a mapping

WMOV ; RMOV ; WMOV (cid:55)→

DMBST ; STR ; LDR ; DMBLD ; DMBST ; STR where the second

DMBST orders the two stores and therefore non-eliminable.

Theorem 9.

Suppose an ARMv8 program is generated by x86 (cid:55)→

ARMv8 mapping (Fig. 9a). A

DMBFULL in a threadof the program is non-eliminable if it is the only fence on a program path from a store to a load in the same threadwhich access different locations.A

DMBFULL elimination is safe when it is not non-eliminable.

The trailing and leading fences in x86 to ARMv8 mapping ensures that a

DMBFULL fence can safely be eliminatedfollowing Theorem 9. Otherwise we cannot immediately eliminate a

DMBFULL ; rather whenever appropriate, we mayweaken such a

DMBFULL fence by replacing it with

DMBST ; DMBLD fence sequence when a

DMBFULL fence is costlierthan a pair of

DMBST and

DMBLD fences. We deﬁne safe fence weakening in Theorem 10 below and the detailed proofis in Appendix C.3.

Theorem 10. A DMBFULL in a program thread is non-eliminable if it is the only fence on a program path from a storeto a load in the same thread which access different locations.For such a fence

DMBFULL (cid:32)

DMBST ; DMBLD is safe.

While fence weakening can be applied on any ARMv8 program, it is especially applicable after ARMv7/ARMv7-mcato ARMv8 mapping. ARMv7 has only

DMB fence (except

ISB ) to order any pair of memory accesses and these

DMB fences translates to

DMBFULL fence in ARMv8. In many cases these

DMBFULL fences can be weakened and then wecan eliminate

DMBLD and

DMBST fences which are not non-eliminable.

Theorem 11. A DMBST in a program thread is non-eliminable if it is placed on a program path between a pair ofstores in the same thread which access different locations and there exists no other

DMBFULL or DMBST fence on thesame path.A

DMBST elimination is safe when it is not non-eliminable.

Theorem 12. A DMBLD in a program thread is non-eliminable if it is placed on a program path from a load to a storeor load access in the same thread which access different locations and there exists no other

DMBFULL or DMBLD fenceon the same path.A

DMBLD elimination is safe when it is not non-eliminable.

In ARMv7 we safely eliminate repeated

DMB fences. ARMv7

DMB fence elimination is particularly useful after ARMv8to ARMv7/ARMv7-mca mappings. For example,

LDR ; STLR (cid:55)→

LDR ; DMB ; DMB ; STR ; DMB generates repeated

DMB fences and one of them can be safely eliminated.

Theorem 13. A DMB in a program thread is non-eliminable if it is the only fence on a program path between a pair ofmemory accesses in the same thread.A

DMB elimination is safe when it is not non-eliminable.

We ﬁrst check if a fence is non-eliminable based on the access pairs and fence locations on the program paths. Weperform this analysis on the thread’s control-ﬂow-graph G = (cid:104) V , E(cid:105) where G . V denotes the program statementsincluding the accesses and G . E represents the set of edges between pair of statements. Next, we delete a fence if it isnot non-eliminable .In Fig. 15 we deﬁne a number of conditions which we use in fence elimination. Condition Reach ( G , i, j ) holds ifthere is a path from instruction i to instruction j in G and Path checks if there is any path from i to j through a fence f . mpairs ( G , a, b ) is a set of ( a × b ) memory access pairs in G . We compute mpairs ( G , a, b ) | (cid:54) = loc ; the set of memoryaccess pairs on different locations based on must-alias analysis. FD ELETE deletes a set of fences. Procedure

GET NF S updates the set of non-eliminable fences considering the positions of other fences between the access pairs. Given afence f and an access pair ( i, j ) , we check if there is a path from i to j through f without passing through alreadyidentiﬁed non-eliminable fences B . If so, fence f is also non-eliminable.16n Architecture to Architecture Mapping for Concurrency Reach ( G , i, j ) (cid:44) ( i, j ) ∈ [ G . V ]; G . E + ; [ G . V ] Path ( G , i, f, j ) (cid:44) Reach ( G , i, f ) ∧ Reach ( G , f, j ) ReachWO ( G , i, j, F ) (cid:44) Reach ( (cid:104) G . V \ F, G . E \ B (cid:105) , i, j ) where B = ( G. V × F ) ∪ ( F × G. V ) NFS ( G , i, f, j, F ) (cid:44) Path ( (cid:104) G . V \ F, G . E \ B (cid:105) , i, f, j ) where B = ( G. V × F ) ∪ ( F × G. V ) mpairs ( G , a, b ) (cid:44) { ( i, j ) | [[ i ]] ∈ a ∧ [[ j ]] ∈ b ∧ Reach ( G , i, j ) } mpairs ( G , a, b ) | (cid:54) = loc (cid:44) { ( i, j ) | mpairs ( G , a, b ) ∧ ¬ mustAlias ( i, j ) } mpairs ( G , a, b ) | loc (cid:44) { ( i, j ) | mpairs ( G , a, b ) ∧ mustAlias ( i, j ) } FDelete ( G , F ) (cid:44) (cid:104) G . V \ F, G . E \ (( G . V × F ) ∪ ( F × G . V )) (cid:105) procedure GET

NFS( G , P R, F, B ) for f ∈ F do for ( i, j ) ∈ P R do G (cid:48) ← FDelete ( G , B ) if Path ( G (cid:48) , i, f, j ) then B ← B ∪ { f } ; break ; // inner loop return B end procedure procedure FW EAKEN ( G , F ) for f ∈ F do V ← G . V ∪ { a, b | [[ a ]] ∈ F LD ∧ [[ b ]] ∈ F ST } E ← G . E ∪ { ( f, a ) , ( a, b ) } E ← E ∪ { ( e, a ) | G. E ( e, f ) } E ← E ∪ { ( b, e ) | G. E ( f, e ) } G (cid:48) . V ← V \ { f } G (cid:48) . E ← E \ (( G (cid:48) . V × { f } ) ∪ ( { f } × G (cid:48) . V )) return G (cid:48) end procedure Figure 15: Helpers conditions and functions procedure X LIM ( G ) F = { f | f ∈ G. V ∧ [[ f ]] ∈ F } ; U = { f | f ∈ G. V ∧ [[ f ]] ∈ U } ; SL ← mpairs ( G , St , Ld ) | (cid:54) = loc nfs ← getNFS ( G , SL, F, U ) ; return FDelete ( G , F \ nfs ) ; end procedure procedure ARM V LIM ( G ) F = { f | f ∈ G. V ∧ [[ f ]] = F } ; M ← mpairs ( G , E \ F, E \ F ) nfs ← getNFS ( G , M, F, ∅ ) ; return FDelete ( G , F \ nfs ) ; end procedure procedure ARM V LIM ( G ) F = { f | f ∈ G. V ∧ [[ f ]] = DMBFULL } ; SL ← mpairs ( G , St , Ld ) | (cid:54) = loc nfs ← getNFS ( G , SL, F, ∅ ) ; if x86 (cid:55)→ ARMv8 then G ← FDelete ( G , F \ nfs ) ; else G ← FWeaken ( G , F \ nfs ) F S = { f | f ∈ G . V ∧ [[ f ]] = DMBST } ; SS ← mpairs ( G , St , St ) | (cid:54) = loc F F ← getNFS ( G , SS, F S, nfs ) ; G ← FDelete ( G , F S \ F F ) ; F L = { f | f ∈ G . V ∧ [[ f ]] = DMBLD } ; LS ← mpairs ( G , Ld , St ) | (cid:54) = loc LL ← mpairs ( G , Ld , Ld ) | (cid:54) = loc F F (cid:48) ← getNFS ( G , LL ∪ LS, F L, nfs ) ; return FDelete ( G , F L \ F F (cid:48) ) ; end procedure Figure 16: Fence elimination algorithms after mappings.

Fence elimination in x86, ARMv7, and ARMv8 . In Fig. 16 we deﬁne x86, ARMv8, ARMv7 fence eliminationprocedures. For instance, in X LIM we ﬁrst identify store-load access pairs on different locations and the

MFENCE operations in a thread. Then we identify the set of non-eliminable fences nfs using getNFS procedure. In this case weconsider the positions of atomic updates along with fences as atomic updates also act as a fence. Finally FD

ELETE eliminates rest of the fences.Procedure ARM V LIM works in multiple steps for each of the fences. Note that while mapping to ARMv8 we donot use release-write or acquire-load accesses. Therefore we use the same

ReachWO condition to check if a fence isnon-eliminable. Moreover, in case of x86 to ARMv8 we eliminate

DMBFULL fences. In this case

DMBFULL eliminationis safe as it introduces other

DMBLD and

DMBST fences. However, we do not eliminate

DMBFULL when it is generatedfrom ARMv7 as it may remove order between a pair of accesses. In this case or in general we can weaken a

DMBFULL fence and then eliminate redundant

DMBST and

DMBLD fences.In ARMv7 a F is redundant when it it appears between a pair of same-location load-load, store-store, store-load,and atomic load-store accesses. Such redundant fences appear in ARMv7 program after mapping ARMv8 programs17n Architecture to Architecture Mapping for Concurrency(SC-x86A) [ R ]; po ∪ po ; [ W ] ∪ po | loc ∪ fence (SC-ARMv8) po | loc ∪ ( aob ∪ dob ∪ bob ) + (x86A-ARMv8) po | loc ∪ ( aob ∪ bob ∪ dob ) + ∪ WR (SC-ARMv7) po | loc ∪ fence (x86A-ARMv7) po | loc ∪ fence ∪ WR (ARMv8-ARMv7) po | loc ∪ [ St ]; po ∪ fence (ARMv7mca-ARMv7) [ St ]; po ∪ po ; [ St ] ∪ [ Ld ]; ( po | loc ∪ fence ); [ Ld ] Figure 17: ( M - K ): Condition R for M -robust against K analysis.(SC) acy ( po ∪ rf ∪ fr ∪ co ) (atomicity) irr ([ rmw ]; fre ; coe ) (a) SC (sc-per-loc) acy ( po | loc ∪ rf ∪ fr ∪ co ) (atomicity) irr ([ rmw ]; fre ; coe ) (GHB) acy (( po \ WR ) ∪ fence ∪ rfe ∪ co ∪ fr ) where fence = po ; [ rmw ∪ F ]; po and WR = [ St \ codom ( rmw )]; po ; [ Ld \ dom ( rmw )] (b) x86A Figure 18: SC and x86A model for robustness checkingto ARMv7/ARMv7-mca following the mapping scheme in Fig. 13a. For example, a sequence

LDR ; LDR in ARMv8results in a sequence

LDR ; F ; LDR ; F in ARMv7 where the introduced F instructions are redundant and we eliminatethese fences by ARM V LIM procedure.

We ﬁrst deﬁne robustness and then discuss the conditions and its analyses in more details.

Deﬁnition 3.

Suppose M and K are concurrency models. A program is M -robust against K if all its K -consistentexecutions are also M -consistent. We observe that in axiomatic models the axioms are represented in the form irreﬂexivity of a relation or acyclicity ofone or a combination of relations. When an axiom is violated then it results in a cycle on an execution graph. Such acycle consists of a set of internal relations which are included in program order ( po ) along with external relations. Ifthese involved po relations are appropriately ordered then such a cycle would not be possible. As a result the programwould have no weaker behavior and would be M -robust against a weaker model K . To capture the idea we deﬁne external-program-order ( epo ) relation as follows. epo (cid:44) po ∩ codom ( eco ) × dom ( eco ) Based on this observation we check and enforce M -robustness against K considering the relative strength ( (cid:64) ) of thememory accesses of the memory models: SC (cid:64) x86 (cid:64) ARMv8 (cid:64)

ARMv7 - mca (cid:64) ARMv7 . In all these cases wedeﬁne required constraints on the external-program-order ( epo ) edges in an execution which preserves robustness. Checking robustness in x86 . A subtle issue in checking SC-robustness against x86 is mo relation may take placebetween writes on different locations and in that case we have to consider a possible through different location writesas well. To avoid this complexity, we use the x86A model following Alglave et al. [2014], Alglave and Maranget asshown in Fig. 18 for robustness analyses. In this model there is no mo relation and unlike x86 an update operationresults in rmw ⊆ po | loc relation instead of an event similar to ARM models. In Fig. 18 we also deﬁne SC modelAlglave et al. [2014], Alglave and Maranget for robustness analysis. Robustness conditions

In Fig. 17 we deﬁne the conditions which have to be fulﬁlled by epo in all executions for agiven program. An x86A execution is SC-robust when all epo relations are fully ordered as deﬁned in (SC-x86A). InARMv8 model condition (SC-ARMv8) preserves order for all epo relations. Condition (x86A-ARMv8) orders all epo relations except non-RMW store-load access pairs on different locations similar to x86A. ARMv7 model uses po | loc and fence to order epo relations fullly to preserve SC robustness. We do not use ppo in these constraints as it violatesrobustness as shown in the example in Fig. 8. To preserve x86A robustness, ARMv7 orders all epo relations except18n Architecture to Architecture Mapping for Concurrency ReachWO ( G , i, j, F ) (cid:44) Reach ( (cid:104) G . V \ F, G . E \ B (cid:105) , i, j ) where B = ( G. V × F ) ∪ ( F × G. V ) Ordered ( G , ( i, j ) , F ) (cid:44) mustAlias ( i, j ) ∨ ¬ ReachWO ( G , i, j, F ) OnCyc ( A ) (cid:44) { ( a, b ) | ( a, b ) ∈ A ∧ ∃ ( p, q ) , ( r, s ) ∈ A. ( a, b ) (cid:54) = ( p, q ) ∧ ( a, b ) (cid:54) = ( r, s ) ∧ mayAlias ( b, p ) ∧ mayAlias ( a, s ) } getG ( b ) (cid:44) G where b ∈ G . V procedure INSERT F( P , O ) H ← ∅ for ( a, b ) ∈ O do if b / ∈ H then G ← getG ( b ) ; f ← new ( MFENCE ) ; G . V ← G . V ∪ { f } ; P ← { ( a, f ) | ( a, e ) ∈ G . E + } Q ← { ( f, b ) | ( e, b ) ∈ G . E + } G . E ← G . E ∪ { ( f, e ) } ∪ P ∪ Q ; H ← H ∪ { b } end procedure procedure SCR

OBUSTX P , N ) (cid:96) ← St ∪ Ld ∪ U ; A ← (cid:83) i ∈ N mpairs ( P ( i ) , (cid:96), (cid:96) ) ; O ← ∅ ; for ( a, b ) ∈ OnCyc ( A ) do B ← { f | f ∈ G . V ∧ [[ f ]] ∈ F ∪ U } if ¬ Ordered ( getG ( b ) , ( a, b ) , B ) then O ← O ∪ { ( a, b ) } ; if O == ∅ then return true ; else INSERT F ( P , O ) ; return false ; end procedure Figure 19: Analysis and enforcement of SC-robustness against x86. procedure INSERT

DMB V P , O ) H ← ∅ for ( a, b ) ∈ O do if a / ∈ H then G (cid:48) ← getG ( a ) if isLd ( a ) ∧ ¬ isLL ( a ) then f ← new ( DMBLD ) ; else f ← new ( DMBFULL ) ; G . V ← G . V ∪ { f } ; P ← { ( p, f ) | ( p, a ) ∈ G . E + } Q ← { ( f, q ) | ( a, q ) ∈ G . E + } G . E ← G . E ∪ { ( f, a ) } ∪ P ∪ Q ; H ← H ∪ { a } return false ; end procedure (a) ARMv8 1: procedure INSERT

DMB V P , O ) H ← ∅ for ( a, b ) ∈ O do if a / ∈ H ∧ ¬ isLL ( a ) then G (cid:48) ← getG ( a ) f ← new ( DMBFULL ) ; G . V ← G . V ∪ { f } ; P ← { ( p, f ) | ( p, a ) ∈ G . E + } Q ← { ( f, q ) | ( a, q ) ∈ G . E + } G . E ← G . E ∪ { ( f, a ) } ∪ P ∪ Q ; H ← H ∪ { a } return false ; end procedure (b) ARMv7 Figure 20: Fence insertion in ARMv8 and ARMv7 for enforcing robustness.non-RMW store-load access pairs on different locations. Condition (ARMv8-ARMv7) also does not rely on orderingby dependencies as ppo . For example, data ; coi ⊆ dob in ARMv8 does not imply ppo in ARMv7. Therefore such a dob order may disallow an execution to be ARMv8 consistent but be allowed in ARMv7 model which would violateARMv8-robustness. Finally, (ARMv7mca-ARMv7) checks if the program may have any MCA behavior.Now we state the robustness theorem based on these constraints and prove the respective robustness results in Ap-pendix D. Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17.

19n Architecture to Architecture Mapping for Concurrency

ReachWO ( G , i, j, F ) (cid:44) Reach ( (cid:104) G . V \ F, G . E \ B (cid:105) , i, j ) where B = ( G. V × F ) ∪ ( F × G. V ) RA ( i, Rel , Acq ) (cid:44) { a | ¬ ReachWO ( G , i, a, Rel ) ∧ a ∈ Acq } isSt ( i ) (cid:44) [[ i ]] ∈ St isSC ( i ) (cid:44) [[ i ]] ∈ St ∩ codom ( rmw ) isLd ( i ) (cid:44) [[ i ]] ∈ Ld isAcq ( i ) (cid:44) [[ i ]] ∈ AisLL ( i ) (cid:44) [[ i ]] ∈ Ld ∩ dom ( rmw ) isW ( i ) (cid:44) [[ i ]] ∈ St ∪ LisR ( i ) (cid:44) [[ i ]] ∈ Ld ∪ A // procedure O RDERED ( G , i, j ) F F ← { f | f ∈ G. V ∧ [[ f ]] ∈ F } ; F L ← { f | f ∈ G. V ∧ [[ f ]] ∈ F LD } ; F S ← { f | f ∈ G. V ∧ [[ f ]] ∈ F ST } ; L ← { a | a ∈ G. V ∧ [[ a ]] ∈ L } ; A ← { a | a ∈ G. V ∧ [[ a ]] ∈ A } ; B ← F F ∪ RA ( i, L, A ) ; Switch ( i, j ) Case mustAlias ( i, j ) : Case isSt ( i ) ∧ isLd ( j ) ∧ ¬ ReachWO ( G , i, j, B ) : Case ( isRel ( j ) ∨ isAcq ( i )) ∨ ( isRel ( i ) ∧ isAcq ( j )) : Case isLd ( i ) ∧ isLd ( j ) ∧ ¬ ReachWO ( G , i, j, B ∪ F L ) : Case isLd ( i ) ∧ isSt ( j ) ∧ ¬ ReachWO ( G , i, j, B ∪ F L ∪ Lcoi ( G , j )) : Case isSt ( i ) ∧ isSt ( j ) ∧ ¬ ReachWO ( G , i, j, B ∪ F S ∪ Lcoi ( G , j )) : Case ( isLL ( i ) ∨ isLd ( i )) ∧ ( isSt ( j ) ∨ isSC ( j )) : return true ; return false ; (a) Checking order for pairs1: procedure SCR

OBUST

ARM V P , N ) (cid:96) ← St ∪ Ld ∪ L ∪ A A ← (cid:83) i ∈ N mpairs ( P ( i ) , (cid:96), (cid:96) ) ; O ← ∅ for ( a, b ) ∈ OnCyc ( A ) do B ← GET B ( getG ( b )) if ¬ Ordered ( getG ( b ) , a, b ) then O ← O ∪ { ( a, b ) } ; if O == ∅ then return true ; else INSERT

DMB V ( P , O ) ; end procedure (b) SC-robust against ARMv8 1: procedure X OBUST

ARM V P , N ) (cid:96) ← St ∪ Ld ∪ L ∪ A A ← (cid:83) i ∈ N mpairs ( P ( i ) , (cid:96), (cid:96) ) ; O ← ∅ for ( a, b ) ∈ OnCyc ( A ) do G ← getG ( b ) B ← GET B ( G ) C ← isW ( i ) ∧ isR ( j ) ∧ ¬ ( isSC ( i ) ∧ isLL ( j )) if ¬ ( C ∨ Ordered ( G , a, b )) then O ← O ∪ { ( a, b ) } ; if O == ∅ then return true ; else INSERT

DMB V ( P , O ) ; end procedure (c) x86 robust against ARMv8 Figure 21: Robustness analysis of ARMv8 programs

When an execution is K -consistent but violates M consistency then it forms a cycle which violates certain irreﬂexivitycondition. Such a cycle contain events on different locations and therefore two or more epo edges where given suchan epo edge ( a, b ) there exists other epo edge(s) ( p, q ) and ( r, s ) such that a and b access the same locations as p and s respectively as ( b, p ) , ( s, a ) ∈ eco .We lift this semantic notion of robustness to program syntax in order to analyze and enforce robustness. We ﬁrstidentify the memory access pairs in all threads as these are potential epo edges. Next, we conservatively check if thememory access pairs would satisfy the robustness conditions in Fig. 17 in all its K consistent executions. If so, wereport the program as M -robust against K . To enforce robustness we insert appropriate fences between the memoryaccess pairs.We perform such an analysis in Fig. 19 to check and enforce SC-robustness against in x86 programs by procedureSCR OBUSTX

86 using a number of helper conditions.

ReachWO ( G , i, j, F ) checks if there is a program path fromaccess i to access j without passing through the fences F in G . Ordered ( G , ( i, j ) , F ) checks if ( i, j ) access pair20n Architecture to Architecture Mapping for Concurrency isWR ( i, j ) (cid:44) isW ( i ) ∧ isR ( j ) ∧ ¬ ( isSC ( i ) ∧ isLL ( j )) Ordered ( G , i, j, F ) (cid:44) mustAlias ( i, j ) ∨ ¬ ReachWO ( G , i, j, F ) procedure SCR

OBUST

ARM V P , N ) (cid:96) ← St ∪ Ld pr ← (cid:83) k ∈ N mpairs ( P ( k ) , (cid:96), (cid:96) ) ; O ← ∅ for ( i, j ) ∈ OnCyc ( pr ) do G ← getG ( b ) F = { f | f ∈ G. V ∧ [[ f ]] ∈ F } ; if ¬ Ordered ( G , i, j, F ) then O ← O ∪ { ( i, j ) } if O == ∅ then return true ; else INSERT

DMB V ( P , O ) ; end procedure (a) SC-robustness against ARMv7 1: procedure X OBUST

ARM V P , N ) (cid:96) ← St ∪ Ld pr ← (cid:83) k ∈ N mpairs ( P ( k ) , (cid:96), (cid:96) ) ; O ← ∅ for ( i, j ) ∈ OnCyc ( pr ) do G ← getG ( b ) F = { f | f ∈ G. V ∧ [[ f ]] ∈ F } ; if ¬ ( isWR ( i, j ) ∨ Ordered ( G , i, j, F )) then O ← O ∪ { ( i, j ) } if O == ∅ then return true ; else INSERT

DMB V ( P , O ) ; end procedure (b) x86-robustness against ARMv71: procedure ARM V OBUST

ARM V P , N ) (cid:96) ← St ∪ Ld pr ← (cid:83) k ∈ N mpairs ( P ( k ) , (cid:96), (cid:96) ) ; O ← ∅ for ( i, j ) ∈ OnCyc ( pr ) do G ← getG ( b ) F = { f | f ∈ G. V ∧ [[ f ]] ∈ F } ; if ¬ ( isSt ( i ) ∨ Ordered ( G , i, j, F )) then O ← O ∪ { ( i, j ) } if O == ∅ then return true ; else INSERT

DMB V ( P , O ) ; end procedure (c) ARMv8 robust against ARMv7 1: procedure ARM V MCA R OBUST

ARM V P , N ) pr ← (cid:83) k ∈ N mpairs ( P ( k ) , Ld , Ld ) ; O ← ∅ for ( i, j ) ∈ OnCyc ( pr ) do G ← getG ( b ) F = { f | f ∈ G. V ∧ [[ f ]] ∈ F } ; if ¬ Ordered ( G , i, j, F ) then O ← O ∪ { ( i, j ) } if O == ∅ then return true ; else INSERT

DMB V ( P , O ) ; end procedure (d) ARMv7-mca robust against ARMv7 Figure 22: Robustness analysis of ARMv7 programsordered in respective models. For example, in Fig. 19 it checks if i and j access same location using mustAlias or onall paths from i to j there exists a at least a fence from F by ReachWO .Finally, given a set of memory access pair A , OnCyc ( A ) ⊆ A identiﬁes the set of memory access pairs which mayresult in epo edges in an execution. SCR OBUSTX

86 checks if all such store-load access pairs appropriately orderedwhich in turn ensure SC-robust for the program P having N thread functions. If so, we report SC-robustness againstx86. Otherwise, we insert fences between unordered pairs using INSERT

F procedure to enforce robustness. Similarto SCR

OBUSTX

86 we also deﬁne procedures in Fig. 21 and Fig. 22 respectively to check and enforce robustness inARMv8 and ARMv7 programs.

Based on the obtained results we have implemented arachitecture to architeture (AA) mapping schemes deﬁned inFigs. 9, 12 and 13, followed by fence elimination algorithms described in Fig. 16. We have also developed robustnessanalyses for x86, ARMv8, and ARMv7 programs following the procedures in Figs. 19, 21 and 22.We have implemented these mappings, fence eliminations, and robust analyses in LLVM. To analyze programs forfence elimination and checking robustness, we leverage the existing control-ﬂow-graph analyses, alias analysis, andmemory operand type analysis in LLVM. The CFG analyses are used to deﬁne mpairs , Path , Reach , and

ReachWO conditions. The mayAlias and mustAlias functions are deﬁned using memory operand type and alias analyses.21n Architecture to Architecture Mapping for Concurrency

Prog. Orig x-v8 C-x-v8AA fd AA fdbarrier 0,6,6 5,5,10 2,1,6 4,0,14 2,1,8dekker-tso 4,7,0 5,5,7 4,3,4 8,0,18 4,6,6dekker-sc 0,7,0 5,5,3 4,5,0 8,0,14 4,6,2pn-ra 4,3,0 5,12,7 4,7,2 4,0,16 4,5,6pn-ra-b 0,9,6 5,10,7 4,7,2 4,0,14 4,5,4pn-ra-d 0,5,4 5,10,7 4,5,4 4,0,14 4,5,4pn-tso 2,3,0 5,12,7 4,7,2 4,0,14 4,5,4pn-sc 0,3,0 5,12,3 4,9,0 4,0,12 4,5,2lamport-ra 4,3,7 7,5,5 5,4,4 10,0,12 4,2,8lamport-tso 2,3,5 7,5,3 5,4,2 8,0,10 4,2,6lamport-sc 0,3,5 7,5,1 5,4,0 8,0,8 4,2,4spinlock 0,8,6 5,7,10 2,6,0 2,0,14 2,10,0spinlock4 0,14,12 9,11,18 4,10,0 4,0,24 4,18,0tlock 0,8,4 7,8,8 4,5,2 4,0,16 2,5,4tlock4 0,12,8 13,12,12 8,7,4 8,0,24 4,7,8seqlock 0,6,4 6,4,12 5,3,2 5,0,16 5,3,2nbw 0,3,4 10,8,12 6,6,1 7,0,18 6,7,6rcu 0,2,10 12,15,2 3,12,0 12,0,11 2,4,4rcu-oﬂ 4,16,8 17,18,24 12,6,9 15,0,51 11,2,42cilk-tso 2,7,4 15,15,15 13,4,11 13,0,29 9,4,14cilk-sc 0,7,4 15,15,13 13,6,9 13,0,27 9,6,12cldq-ra 3,4,0 7,5,9 6,2,1 6,0,14 6,2,2cldq-tso 1,4,0 9,5,7 6,2,1 6,0,12 6,2,2cldq-sc 0,4,0 7,5,6 6,2,1 6,0,11 6,2,1(a) x86 to ARMv8 Prog. Orig v8-xAA fdbarrier 6,0 4,2 4,1dekker-tso 3,4 0,11 0,6dekker-sc 3,0 0,7 0,3pn-ra 3,4 0,7 0,3pn-ra-b 5,0 2,7 2,3pn-ra-d 5,0 2,3 2,1pn-tso 3,2 0,5 0,5pn-sc 3,0 0,3 0,1lamport-ra 1,4 0,7 0,5lamport-tso 1,2 0,5 0,3lamport-sc 1,0 0,3 0,1spinlock 4,0 2,4 2,1spinlock4 6,0 4,6 4,2tlock 6,0 2,6 2,3tlock4 8,0 4,8 4,2seqlock 6,0 4,4 4,1nbw 4,0 2,3 2,2rcu 2,0 0,2 0,1rcu-oﬂ 16,4 1,20 1,12cilk-tso 5,2 2,9 2,2cilk-sc 5,0 2,7 2,1cldq-ra 4,3 2,5 2,2cldq-tso 4,1 2,3 2,2cldq-sc 4,0 2,2 2,2(b) ARMv8 to x86

Figure 23: Mappings between x86 and ARMv8. In x86 to ARMv8: relaxed accesses in general and for wait loops we use release / acquire accesses. Some of theprograms have release-acquire/TSO/SC versions. These programs assume the program would run on the respectivememory models. We have modiﬁed the x86, ARMv7, and ARMv8 code generation phases in LLVM to capture the effect of mappingschemes on C11 programs. For example, in original LLVM mapping a non-atomic store ( St NA ) results in WMOV and

STR accesses in x86 and ARMv8 respectively. Following the AA-mapping in Fig. 9a,

WMOV results in

DMBST ; STR inARMv8. Therefore to capture the effect of x86 to ARMv8 translation we generate

DMBST ; STR in ARMv8 insteadof a

STR for a C11 non-atomic store access. We modify the code lowering phase in LLVM to generate the requiredleading and trailing fences along with the memory accesses. The AA-mapping schemes introduce additional fencescompared to original mapping in all mapping schemes which is evident in Figs. 23a, 23b, 24a and 24b in ‘Orig’ and‘AA’ columns respectively.x86 to ARMv8 mappings (Fig. 9a).

In Fig. 23a we show the numbers of different fences resulted from

C11 (cid:55)→

ARMv8 (Orig), x86 (cid:55)→

ARMv8 (AA in x-v8), and

C11 (cid:55)→ x86 (cid:55)→

ARMv8 (AA in C-x-v8). Both x86 (cid:55)→

ARMv8 and

C11 (cid:55)→ x86 (cid:55)→

ARMv8 mapping schemes generate more fences compared to the original

C11 (cid:55)→

ARMv8 mapping. x86 (cid:55)→

ARMv8 (x-v8) generates more

DMBLD fences compared to

C11 (cid:55)→ x86 (cid:55)→

ARMv8 (C-x-v8) as the earlierscheme generates trailing

DMBLD fence for non-atomic loads. However, the number of

DMBFULL fences are more inC-x-v8 compared to x-v8 as atomic stores introduce leading

DMBFULL fences instead of

DMBST . For the same reasonthere is no

DMBST in C-x-v8 column.ARMv8 to x86 mappings (Fig. 12b). As shown in Fig. 23b, the number of atomic updates and fence operations inAA-mapping varies from Orig due to the mapping of C11 St SC and St REL accesses. In original mapping St SC (cid:55)→ RMW and St REL (cid:55)→

WMOV whereas in AA-mapping St ( REL | SC ) (cid:55)→ STLR (cid:55)→

WMOV ; MFENCE . As a result, the number of22n Architecture to Architecture Mapping for Concurrency

Programs Orig v7-v8AA fdbarrier 0,6,6 0,0,12 1,1,8dekker-tso 4,7,0 0,0,11 2,5,4dekker-sc 0,7,0 0,0,7 2,5,0pn-ra 4,3,0 0,0,7 0,2,4pn-ra-b 0,9,6 0,0,15 0,4,6pn-ra-d 0,5,4 0,0,9 0,2,6pn-tso 2,3,0 0,0,5 0,2,2pn-sc 0,3,0 0,0,3 0,2,0lamport-ra 4,3,7 0,0,14 1,2,11lamport-tso 2,3,5 0,0,10 1,2,7lamport-sc 0,3,5 0,0,8 1,2,5spinlock 0,8,6 0,0,14 2,11,0spinlock4 0,14,12 0,0,26 4,21,0tlock 0,8,4 0,0,12 2,5,4tlock4 0,12,8 0,0,20 4,7,8seqlock 0,4,4 0,0,8 4,3,2nbw 0,3,4 0,0,9 1,3,4rcu 0,2,10 0,0,12 0,1,10rcu-oﬂ 4,16,8 0,0,29 1,1,22cilk-tso 2,7,4 0,0,15 4,4,8cilk-sc 0,7,4 0,0,13 4,6,6cldq-ra 3,4,0 0,0,7 3,1,6cldq-tso 1,4,0 0,0,5 1,1,2cldq-sc 0,4,0 0,0,4 1,1,1(a) ARMv7 to ARMv8 Prog. Orig v8-v7 C-v8-v7AA fd AA fdbarrier 13 19 16 13 12dekker-tso 12 25 23 22 19dekker-sc 8 21 20 18 15pn-ra 8 17 16 12 11pn-ra-b 14 19 16 18 13pn-ra-d 10 17 16 12 11pn-tso 6 15 14 10 9pn-sc 4 13 12 8 7lamport-ra 15 21 20 18 17lamport-tso 11 18 17 15 14lamport-sc 9 16 15 13 12spinlock 13 18 17 15 12spinlock4 23 32 31 27 22tlock 13 20 16 17 12tlock4 21 34 29 29 20seqlock 11 19 15 12 9nbw 7 23 21 15 13rcu 13 36 32 15 14rcu-oﬂ 30 55 49 39 36cilk-tso 13 34 31 30 22cilk-sc 11 32 31 28 23cldq-ra 8 19 18 12 12cldq-tso 6 18 17 12 11cldq-sc 5 18 17 11 11(b) ARMv8 to ARMv7

Figure 24: Mappings between ARMv7 and ARMv8. Original mapping to ARMv8 is (

DMBFULL , release-store, acquire-load). In ARMv7-ARMv8 mapping the numbers are of (

DMBLD , DMBST , DMBFULL ).atomic updates are less and the number of fences are more in AA-mapping compared to the original x86 mapping inLLVM. We can observe the tradeoff between x86 (cid:55)→

ARMv8 and

C11 (cid:55)→ x86 (cid:55)→

ARMv8 considering the number ofgenerated

DMBLD and

DMBFULL fences. For example, in

Barrier program x86 (cid:55)→

ARMv8 generates more

DMBLD than

C11 (cid:55)→ x86 (cid:55)→

ARMv8 as it generates

DMBLD fences for non-atomic loads. On the other hand,

C11 (cid:55)→ x86 (cid:55)→

ARMv8 generates

DMBFULL fences for relaxed atomic stores instead of

DMBST fences.ARMv8 to ARMv7 mappings (Fig. 13a)

We show the number of

DMB fences in Fig. 24b due to

C11 (cid:55)→

ARMv8 (Orig),

ARMv8 (cid:55)→

ARMv7 (AA in v8-v7),

C11 (cid:55)→

ARMv8 (cid:55)→

ARMv7 (AA in C-v8-v7) mappings. Both

ARMv8 (cid:55)→

ARMv7 and

C11 (cid:55)→

ARMv8 (cid:55)→

ARMv7 generate more fences than

C11 (cid:55)→

ARMv8 mapping. Moreover,

C11 (cid:55)→

ARMv8 (cid:55)→

ARMv7 generates less number of fences than

ARMv8 (cid:55)→

ARMv7 as we do not generate trailing

DMB fences for non-atomicloads.ARMv7 to ARMv8 mappings (Fig. 12a).

The result is in Fig. 24a where The original

C11 (cid:55)→

ARMv8 mappinggenerates

DMBFULL , release-store, and acquire-load operations for these programs whereas the AA-mapping generatesrespective

DMBFULL fences only as ARMv7 does not have release-store, and acquire-load operations.

The fence optimization passes remove signiﬁcant number of fences as shown in the ‘fd’ columns in Figs. 23a, 23b, 24aand 24b. We have implemented the fence elimination algorithms as LLVM passes and run the pass after AA-mappingsto eliminate redundant fences. The pass extends LLVM

MachineFunctionPass and run on each machine function of theprogram. The precision of our analyses depend upon underlying LLVM functions which we have used. For example,we apply alias analysis and memory operand analysis to identify the memory location accessed by a particular access.Consider a scenario where we have identiﬁed an

MFENCE between a store-load pair. If we precisely identify that thestore-load pair access same-location then we can eliminate the fence. Otherwise we conservatively mark the fence asnon-eliinable.

Fence elimination after x86 to ARMv8 mapping.

The fence elimination algorithms have eliminated a number ofredundant fences after the mapping. In some scenarios original C11 to ARMv8 mapping is too restrictive as it generates23n Architecture to Architecture Mapping for Concurrency

11 12 | (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) ( (cid:55) ( | (cid:55) | (cid:55) (cid:55) | (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) ( (cid:51) ( | (cid:55) | (cid:55) (cid:55)

10 17 | (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) ( (cid:51) ( | (cid:55) | (cid:55) (cid:55)

12 20 | (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) ( (cid:51) ( | (cid:55)

10 25 | (cid:55) (cid:55)

12 27 | (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) ( (cid:55) ( | (cid:55) | (cid:55) (cid:55)

In this case we ﬁrst weaken the

DMBFULL fences to a pair of

DMBST and

DMBLD fences whenever appropriate and then perform the fence elimination. Therefore it introduces some

DMBST fences in the ’fd’ column in C-x-v8.

Fence elimination after ARMv8 to x86 mapping.

The mapping generates

MFENCE for release-store mapping and thefence elimination safely eliminate these fences whenever possible.

Fence elimination after ARMv7 to ARMv8 mapping.

In this case the mapping introduce

DMBFULL fences in ARMv8from ARMv7

DMB fences. We eliminate the repeated fences if any and then weaken the

DMBFULL fences to

DMBST and

DMBLD fences, and further eliminate redundant fences.

Fence elimination after ARMv8 to ARMv7 mapping.

ARMv8 to ARMv7 mapping generates extra fences in certainscenarios such as

LDR ; STLR (cid:55)→

LDR ; DMB ; DMB ; STR ; DMB where we can safely remove a repeated

DMB fence. Similarscenario takes place for

LDR A ; STLR mapping in C11 to ARMv8 to ARMv7 mapping.

We implement the robustness analysis as LLVM passes following the procedures in Fig. 19 as well as followingFigs. 21 and 22 in the appendix after instruction lowering in x86, ARMv8, and ARMv7. We report the analysesresults on the concurrent programs in Fig. 25. In these results we mark both robustness checking and robustnessenforcement results. We have also included the results from Lahav and Margalit [2019] about two other robustnesschecker: Trencher Bouajjani et al. [2013] and Rocker Lahav and Margalit [2019].Now we discuss the robustness results of the benchmarks programs which are marked by (cid:51) or (cid:55) . Among these pro-grams spinlock , spinlock4 , seqlock , ticketlock (tlock), and ticketlock4 (tlock4) provide robustness in all models. These24n Architecture to Architecture Mapping for Concurrencyresults also match the results from both Trencher and Rocker; both SC-robustness checkers. In rest of the programswe observe robustness violations due to various unordered accesses sequences. For example, (St-Ld) violates SC-robustness in all architectures, (SC-St/Ld) violate x86A robustness in ARMv8 and ARMv7, and (Ld-Ld) violate allrobustness in ARMv8 and ARMv7 models. Robustness of x86 programs.

We ﬁrst focus on SC-robustness against x86A and compare the result with Trencher. Ouranalysis precisely analyze robustness and agrees to Trencher in all cases except lamport-ra (lmprt-ra), lamport-tso (lmprt-tso), and cilk-sc. Both lamport-ra and lamport-tso has (St-Ld) sequence in different thread functions. As aresult, our analysis reports SC-robustness violation which is a false positive as in actual executions these access pairsnever execute in concurrence. In cilk-sc we report SC-robustness as the program has store-load sequences of the form a = Ld RLX ( T ); St RLX ( T, a − Ld RLX ( H ) . In this case the St RLX ( T, a − Ld RLX ( H ) may yield non-SC behaviorduring an execution which is reported by Trencher and Rocker. However, LLVM combines the load and store of T into an atomic fetch-and-sub ( fsub ) operation, that is, a = Ld RLX ( T ); St RLX ( T, a − (cid:32) a = fsub ( T, . As a resultthe program turns into SC-robust against x86 in LLVM as reported by our analysis. Robustness of ARMv8 programs.

Next, we study SC-robustness and x86A-robustness against ARMv8 for the bench-mark programs. ARMv8 allows out-of-order executions of memory accesses on different locations which do not affectdependencies. Therefore many of these programs in ARMv8 are not SC or x86A robust. Also our robustness ana-lyzer do not rely on dob ordering as it performs the analysis before the ARMv8 machine code is generated duringthe code lowering phase. Therefore LLVM may perform optimizations after the analysis which may remove certaindependencies and in that case our analysis would be unsound and may report false negative.As ARMv8 is weaker than x86A, the program which are not SC-robust in x86A are also not SC-robust in ARMv8.Programs like barrier , peterson-ra-Bartosz (pn-ra-b), peterson-sc (pn-sc), lamport-ra/tso/sc , rcu , rcu-ofﬂine (rcu-oﬂ),and chase-lev-dequeue-tso/sc (cldq-tso/sc) are in this category. There are programs which are SC-robust in x86 butnot in ARMv8 such as dekker-tso and so on. These programs violate both SC and x86A robustness due to unordered(Ld-Ld) or (SC-St/Ld) pairs. Robustness of ARMv7 programs.

Now we move to the robustness analysis in ARMv7. Except spinlock , spinlock4 ,and seqlock programs, all other programs violate SC-robustness due to the similar pattern as discussed in ARMv8robustness. Among these programs SC-robustness is violated in barrier due to (St-Ld) unordered sequence. Thisaccess pattern is allowed in x86A, ARMv8, and ARMv7-mca and therefore these ARMv7 programs are robust inthese models. Program rcu has unordered (St-St) pairs which violates SC and x86A robustness. However, these pairsdoes not violate ARMv8 and ARMv7-mca robustness. Rest of the programs exihibit certain (Ld-Ld) pairs which resultin x86, ARMv8, and ARMv7-mca robustness violations. Whenever we identify a program as non-robust we insert appropriate fences to enforce respective robustness. For ex-ample in Fig. 19 we identify the different-location store-load access pairs which may violate robustness. We introduceleading

MFENCE operations for the load operation in the pair as required.A naive scheme does not use robustness information. It ﬁrst eliminates existing fences in concurrent threads and theninsert fences after each memory accesses except atomic update in x86 and load-exclusive accesses in ARM modelsto restrict program behavior. In both naive scheme and our approach we do not insert fences for atomic update. InARMv8 we insert

DMBLD and

DMBFULL trailing fences for load, and store and store-exclusive respectively when theyare unordered with a successor. In ARMv7 we insert

DMBFULL trailing fences for load, store, and store-exclusive whenthey are unordered with a successor.In Fig. 25 we report the number of fences required in the naive scheme, robustness analyses results in our proposedapproach along with the number of introduce fences to enforce robustness. We compare our result to the naive schemesas explained in Fig. 25 and ﬁnd that our approach insert less number of fences in major instances. However, our fenceinsertion is not optimal; we leave the optimal fence insertion for enforcing robustness for future investigation.

Architecture to architecture mapping

There are a number of dynamic binary translators Ding et al. [2011], Wang et al.[2011], Hong et al. [2012], Lustig et al. [2015], Cota et al. [2017] emulate mutithreaded program. Among these earliertranslators such as PQEMUDing et al. [2011], COREMUWang et al. [2011], HQEMU Hong et al. [2012] and so on donot address the memory consistency model mismatches. ArMOR Lustig et al. [2015] proposes a speciﬁcation formatto deﬁne the ordering requirements for different memory models which is used in translating between architectural25n Architecture to Architecture Mapping for Concurrencyconcurrency models in dynamic translation. The speciﬁcation format is used in specifying TSO and Power architec-tures. Cota et al. [2017] uses the rules from ArMOR in Pico dynamic translator for QEMU. Our mapping schemesprovide the ordering rules which can be used to populate the ordering tables for x86 and ARM models. Moreover theARMv8 reordering table in Fig. 14 demonstrates that reordering certain independent access pairs are not safe if theyare part of certain dependency based ordering. In addition to the QEMU based translators, LLVM based decompilersBougacha, Bits, Yadavalli and Smith [2019], avast, Shen et al. [2012] raise binary code to LLVM IR and then compilesto another architecture. These decompilers do not support relaxed memory concurrency.

Fence optimization

Redundant fence elimination is addressed by Vafeiadis and Zappa Nardelli [2011], Elhorst [2014],Morisset and Nardelli [2017]. Vafeiadis and Zappa Nardelli [2011] performs safe fence elimination in x86, Elhorst[2014] eliminate adjacent fences in ARMv7, and Morisset and Nardelli [2017] perform efﬁcient fence elimination inx86, Power, and ARMv7. However, none of these approaches perform ARMv8 fence elimination.

Robustness analysis.

Sequential consistency robustness has been explored against TSO Bouajjani et al. [2013],POWER Derevenetc and Meyer [2014], and Release-Acquire Lahav and Margalit [2019] models by exploring execu-tions using model checking tools. Alglave et al. [2017] proposed fence insertion in POWER to strengthen a programto release/acquire semantics which has same preserved-program-order constraints between memory aceesses as TSO.On the contrary, we identify robustness checking conditions in ARMv7 and ARMv8 where we show that preserved-program-order is not sufﬁcient to recover sequential consistency in ARMv7 models. Identifying minimal set of fencesis NP-hard Lee and Padua [2001] and a number of approaches such as Shasha and Snir [1988], Bouajjani et al. [2013],Lee and Padua [2001], Alglave et al. [2017] proposed fence insertion to recover stonger order, particularly sequentialconsistency. Similar to Lee and Padua [2001] our approach is based on analyzing control ﬂow graphs without explor-ing the possible executions by model checkers. Though in certain scenarios we report false positives, our approachprecisely identiﬁes robustness for a number of well-known programs.

10 Conclusion and Future Work

In this paper we propose correct and efﬁcient mapping schemes between x86, ARMv8, and ARMv7 concurrencymodels. We have shown that ARMv8 can indeed serve as an intermediate model for mapping between x86 andARMv7. We have also shown that removing non-multicopy atomicity from ARMv7 does not affect the mappingschemes. We also show that ARMv8 model cannot serve as an IR in a decompiler as it does not support all commoncompiler optimizations. Next,we propose fence elimination algorithms to remove additional fences generated bythe mapping schemes. We also propose robustness analyses and enforcement techniques based on memory accesssequence analysis for x86 and ARM programs.Going forward we want to extend these schemes and analyses to other architectures as well. We believe these resultswould play a crucial role in a number of translator, decompilers, and state-of-the-art systems. Therefore integratingthese results to these systems is another direction we would like to pursue in future.

References

C/C++11 mappings to processors. .J. Alglave and L. Maranget. herd7 consistency model simulator. .J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: modelling, simulation, testing, and data-mining for weakmemory.

ACM Trans. Program. Lang. Syst. , 36(2):7:1–7:74, 2014. doi: 10.1145/2627752.J. Alglave, D. Kroening, V. Nimal, and D. Poetzl. Don’t sit on the fence: A static analysis approach to automatic fenceinsertion.

ACM Trans. Program. Lang. Syst. , 39(2):6:1–6:38, 2017.Android-x86. .Arm. Migrating a software application from armv5 to armv7-a/r application. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0425/chapter1intendreader.html .avast. A retargetable machine-code decompiler based on llvm. https://github.com/avast/retdec .A. Barbalace, R. Lyerly, C. Jelesnianski, A. Carno, H. Chuang, V. Legout, and B. Ravindran. Breaking the boundariesin heterogeneous-isa datacenters. In

ASPLOS 2017 , pages 645–659, 2017. doi: 10.1145/3037697.3037738.A. Barbalace, M. L. Karaoui, W. Wang, T. Xing, P. Olivier, and B. Ravindran. Edge computing: the case forheterogeneous-isa container migration. In

VEE’20 , pages 73–87, 2020. doi: 10.1145/3381052.3381321.L. Bits. Framework for lifting x86, amd64, and aarch64 program binaries to llvm bitcode. https://github.com/lifting-bits/mcsema . 26n Architecture to Architecture Mapping for ConcurrencyA. Bouajjani, E. Derevenetc, and R. Meyer. Checking and enforcing robustness against TSO. In

ESOP 2013 , pages533–553, 2013. doi: 10.1007/978-3-642-37036-6\_29.A. Bougacha. Binary translator to llvm ir. https://github.com/repzret/dagger .A. Chernoff, M. Herdeg, R. Hookway, C. Reeve, N. Rubin, T. Tye, S. Bharadwaj Yadavalli, and J. Yates. Fx32 aproﬁle-directed binary translator.

IEEE Micro , 18(2):56–64, 1998.E. G. Cota, P. Bonzini, A. Bennée, and L. P. Carloni. Cross-isa machine emulation for multicores. In

CGO’2017 , page210â ˘A ¸S220. IEEE Press, 2017.E. Derevenetc and R. Meyer. Robustness against power is pspace-complete. In

ICALP’14 , volume 8573 of

LNCS ,pages 158–170, 2014. doi: 10.1007/978-3-662-43951-7\_14.J. Ding, P. Chang, W. Hsu, and Y. Chung. PQEMU: A parallel system emulator based on QEMU. In

ICPADS’11 ,pages 276–283, 2011. doi: 10.1109/ICPADS.2011.102.M. Docs. How x86 emulation works on arm. https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on-arm-x86-emulation .R. Elhorst. Lowering C11 atomics for ARM in LLVM. In

European LLVM Conference , 2014.D.-Y. Hong, C.-C. Hsu, P.-C. Yew, J.-J. Wu, W.-C. Hsu, P. Liu, C.-M. Wang, and Y.-C. Chung. Hqemu: A multi-threaded and retargetable dynamic binary translator on multicores. In

CGO’12 , page 104â ˘A ¸S113, 2012. doi:10.1145/2259016.2259030.ISO/IEC 14882. Programming language C++, 2011.ISO/IEC 9899. Programming language C, 2011.O. Lahav and R. Margalit. Robustness against release/acquire semantics. In

PLDI 2019 , pages 126–141, 2019. doi:10.1145/3314221.3314604.O. Lahav and V. Vafeiadis. Explaining relaxed memory models with program transformations. In

FM’16 , pages479–495, 2016. doi: 10.1007/978-3-319-48989-6_29.O. Lahav, V. Vafeiadis, J. Kang, C.-K. Hur, and D. Dreyer. Repairing sequential consistency in C/C++11. In

PLDI2017 , pages 618–632, 2017. doi: 10.1145/3062341.3062352. Technical Appendix Available at https://plv.mpi-sws.org/scfix/full.pdf .J. Lee and D. A. Padua. Hiding relaxed memory consistency with a compiler.

IEEE Transactions on Computers , 50(8):824–833, 2001.D. Lustig, C. Trippel, M. Pellauer, and M. Martonosi. Armor: Defending against memory consistency model mis-matches in heterogeneous architectures. In

ISCA’15 , page 388â ˘A ¸S400, 2015. doi: 10.1145/2749469.2750378.R. Morisset and F. Z. Nardelli. Partially redundant fence elimination for x86, arm, and power processors. In

CC’17 ,pages 1–10, 2017.B. Norris and B. Demsky. CDSChecker: Checking concurrent data structures written with C/C++ atomics. In

OOP-SLA’13 , 2013.notaz. Starcraft. http://repo.openpandora.org/ , 2014.C. Pulte, S. Flur, W. Deacon, J. French, S. Sarkar, and P. Sewell. Simplifying ARM concurrency: multicopy-atomicaxiomatic and operational models for ARMv8.

PACMPL , 2(POPL):19:1–19:29, 2018. doi: 10.1145/3158107.QEMU. the fast! processor emulator. .D. E. Shasha and M. Snir. Efﬁcient and correct execution of parallel programs that share memory.

ACM Trans.Program. Lang. Syst. , 10(2):282–312, 1988. doi: 10.1145/42190.42277.B.-Y. Shen, J.-Y. Chen, W.-C. Hsu, and W. Yang. Llbt: An llvm-based static binary translator. In

CASES 2012 , page51â ˘A ¸S60, 2012. doi: 10.1145/2380403.2380419.V. Vafeiadis and F. Zappa Nardelli. Verifying fence elimination optimisations. In

SAS’11 , volume 6887 of

LNCS ,pages 146–162. Springer, 2011. doi: 10.1007/978-3-642-23702-7_14.Z. Wang, R. Liu, Y. Chen, X. Wu, H. Chen, W. Zhang, and B. Zang. COREMU: a scalable and portable parallelfull-system emulator. In C. Cascaval and P. Yew, editors,

PPOPP’11 , pages 213–222, 2011. doi: 10.1145/1941553.1941583.J. Wickerson, M. Batty, T. Sorensen, and G. A. Constantinides. Automatically comparing memory consistency models.In

POPL’17 , pages 190–204. ACM, 2017. doi: 10.1145/3009837.3009838.S. B. Yadavalli and A. Smith. Raising binaries to llvm ir with mctoll (wip paper). In

LCTES 2019 , page 213â ˘A ¸S218,2019. doi: 10.1145/3316482.3326354. 27n Architecture to Architecture Mapping for Concurrency

A Proofs of Mapping Schemes

A.1 x86 to ARMv8 Mappings

We ﬁrst restate Theorem 1.

Theorem 1.

The mappings in Fig. 9a are correct.

To prove Theorem 1, we prove the following formal statement. P x86 (cid:32) P ARMv8 = ⇒ ∀ X t ∈ [[ P ARMv8 ]] . ∃ X s ∈ [[ P x86 ]] . Behavior ( X t ) = Behavior ( X s ) Given an ARM execution X t we deﬁne correxponding x86 execution X s where1. [ X t . St ∪ X t . F ]; X t . ob ; [ X t . St ∪ X t . F ] = ⇒ X s . mo [ X t . St ∪ X t . F ]; X t . po ; [ X t . St ∪ X t . F ] = ⇒ X s . mo [ X t . F ]; X t . po ; X t . fr = ⇒ X s . mo X t . co = ⇒ X s . mo | loc We know that X t is ARMv8 consistent. Now we show that X s is x86 consistent. Proof.

We prove by contradiction.(irrHB)Assume X s has an X s . xhb cycle.It implies a ( X s . po ∪ X s . rfe ) + cycle.Considering the possible cases of X s . po edges on the cycle: Case [ X s . Ld ]; X s . po ; [ X s . W ] : = ⇒ [ X t . Ld ]; X t . po ; [ X t . F LD ]; X t . po ; [ X t . W ] . = ⇒ [ X t . Ld ]; X t . bob ; [ X s . W ]= ⇒ [ X t . Ld ]; X t . ob ; [ X s . W ] Case [ X s . U ]; X s . po ; [ X s . W ] : = ⇒ [ X t . Ld ]; X t . rmw ; X t . po ; [ X t . F ]; X t . po ; X t . rmw ; [ X t . St ]= ⇒ [ X t . Ld ]; X t . aob ; X t . bob ; X t . aob ; [ X t . St ]= ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Thus in both cases X s . xhb = ⇒ ( X t . ob ∪ X t . rfe ) + ⊆ X t . ob . However, X t is ARM consistent and X t . ob is irreﬂexive.Hence a contradiction and X s . xhb is irreﬂexive.(irrMOHB)Assume X s has a X s . mo ; X s . xhb cycle.However, from deﬁnition, [ X s . W ∪ X s . F ]; X s . xhb ; [ X s . W ∪ X s . F ] Considering the po and rfe from xhb : Case [ X s . W ∪ X s . F ]; X s . po ; [ X s . W ∪ X s . F ] : We know, [ X s . W ∪ X s . F ]; X s . po ; [ X s . W ∪ X s . F ]

28n Architecture to Architecture Mapping for ConcurrencyConsidering the subcases:

Subcase [ X s . St ∪ X s . F ]; X s . po ; [ X s . St ∪ X s . F ] : It implies [ X t . St ∪ X t . F ]; X t . po ; [ X t . St ∪ X t . F ] .From deﬁnitions, [ X t . St ∪ X t . F ]; X t . po ; [ X t . St ∪ X t . F ] = ⇒ X s . mo ∧ ¬ X s . mo − . Subcase Otherwise:

Possible scenarios are [ X s . U ]; X s . po ; [ X s . W ∪ X s . F ] or [ X s . W ∪ X s . F ]; X t . po ; [ X t . U ] .Now, [ X s . U ]; X s . po ; [ X s . W ∪ X s . F ]= ⇒ X t . rmw ; X t . po ; [ X t . F ]; X t . po ; [ X t . W ∪ X t . F ]= ⇒ X t . bob = ⇒ X t . ob Similarly, [ X s . W ∪ X s . F ]; X t . po ; [ X t . U ]= ⇒ X t . po ; [ X t . F ]; X t . po ; [ X s . W ]= ⇒ X t . bob = ⇒ X t . ob From deﬁnitions, [ X t . StX t . F ]; X t . ob ; [ X t . StX t . F ] = ⇒ X s . mo ∧ ¬ X s . mo − . Case [ X s . W ∪ X s . F ]; X s . rfe ; [ X s . W ∪ X s . F ] : It implies [ X s . W ]; X s . rfe ; [ X s . U ]= ⇒ ([ X t . Ld ]; X t . rmw ) ? ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . rmw ; [ X t . St ] following the mappings. = ⇒ ([ X t . Ld ]; X t . aob ) ? ; [ X t . St ]; X t . obs ; [ X t . Ld ]; X t . aob ; [ X t . St ]= ⇒ ([ X t . Ld ]; X t . aob ) ? ; [ X t . St ]; X t . ob ; [ X t . St ] From deﬁnitions we know that [ X t . St ]; X t . ob ; [ X t . St ] = ⇒ X s . mo ∧ ¬ X s . mo − .Therefore X s . xhb = ⇒ X s . mo and hence X s . mo ; X s . xhb is acyclic and X s satisﬁes (irrMOHB).(irrFRHB)Assume X s has a X s . fr ; X s . xhb cycle.We already know that X s . xhb = ⇒ X t . ob holds.Considering the cases of X s . fr : Case X s . fre : In this case X s . fre = ⇒ X t . fre = ⇒ X t . obs .In this case there exists a X t . obs ; X t . ob cycle which violates (external) in X t .Hence a contradiction and X s satisﬁes (irrFRHB). Case X s . fri : Following the mappings X s . fri = ⇒ X t . bob . 29n Architecture to Architecture Mapping for ConcurrencyIn this case there exists a X t . bob ; X t . ob cycle which violates (external) in X t .Hence a contradiction and X s satisﬁes (irrFRHB).(irrFRMO)Assume X s has a X s . fr ; X s . mo cycle.It implies a X s . fr ; X s . co cycle and in consequence a X t . fr ; X t . co cycle which violates (internal) in X t .Hence a contradiction and X s satisﬁes (irrFRMO).(irrFMRP)Assume X s has a X s . fr ; X s . mo ; X s . rfe ; X s . po cycle.It implies a X s . rfe ; X s . po ; X s . fr ; X s . mo cycle.Now we consider a X s . rfe ; X s . po ; X s . fr path.Thus [ X s . W ]; X s . rfe ; X s . po ; X s . fr ; [ X s . W ]= ⇒ [ X s . W ]; X s . rfe ; X s . po ; [ X s . R ]; X s . fre ; [ X s . W ] ∪ [ X s . W ]; X s . rfe ; X s . po ; [ X s . R ]; X s . fri ; [ X s . W ]= ⇒ [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . po ; [ X t . F LD ∪ X t . F ]; X t . po ; [ X t . Ld ]; X t . fre ; [ X t . St ] ∪ [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . po ; [ X t . F LD ∪ X t . F ]; X t . po ; [ X t . Ld ]; X t . fri ; [ X t . St ]= ⇒ [ X t . St ]; X t . obs ; [ X t . Ld ]; X t . bob ; [ X t . Ld ]; X t . obs ; [ X t . St ] ∪ [ X t . St ]; X t . obs ; [ X t . Ld ]; X t . bob ; [ X t . Ld ]; X t . bob ; [ X t . St ]= ⇒ [ X t . St ]; X t . ob ; [ X t . St ] ∪ [ X t . St ]; X t . ob ; [ X t . St ]= ⇒ [ St ]; X t . ob ; [ X t . St ] However, we know [ X t . St ]; X t . ob ; [ St ] = ⇒ [ X s . W ]; X s . mo ; [ X s . W ] .Thus [ X s . W ]; X s . rfe ; X s . po ; X s . fr ; [ X s . W ] = ⇒ X s . mo ∧ ¬ X s . mo − .Hence a contradiction and thus X s satisﬁes (irrFMRP).(irrUF)Assume X s has a X s . fr ; X s . mo ; [ X s . U ∪ X s . F ]; X s . po cycle.It implies [ X s . U ∪ X s . F ]; X s . po ; [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo cycle.Now, we consider a [ X s . U ∪ X s . F ]; X s . po ; [ X s . R ]; X s . fr ; [ X s . W ] path.Considering possible cases: Case [ X s . U ]; X s . po ; [ X s . R ]; X s . fr ; [ X s . W ] : = ⇒ [ X t . Ld ]; X t . po ; [ X t . F ]; X t . po ; [ X t . Ld ]; ( X t . fre ∪ X t . fri ); [ X t . St ]= ⇒ [ X t . Ld ]; X t . po ; [ X t . F ]; X t . po ; [ X t . Ld ]; X t . fre ; [ X t . St ] ∪ [ X t . Ld ]; X t . po ; [ X t . F ]; X t . po ; [ X t . Ld ]; X t . fri ; [ X t . St ]

30n Architecture to Architecture Mapping for Concurrency = ⇒ [ X t . Ld ]; X t . bob ; X t . obs ; [ X t . St ] ∪ [ X t . Ld ]; X t . bob ; [ X t . St ]= ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ]= ⇒ X s . mo following the deﬁnition. Case [ X s . F ]; X s . po ; X s . fr ; [ X s . W ] : = ⇒ [ X t . F ]; X s . po ; X t . fr ; [ X t . St ] following the mappings. = ⇒ X s . mo ∧ ¬ X s . mo − following the deﬁnition.Therefore [ X s . U ∪ X s . F ]; X s . po ; X s . fr ; X s . mo does not have a cycle.Hence a contradiction and X s satisﬁes (irrUF).From deﬁnition we know X t . co = ⇒ X s . mo | loc and therefore Behavior ( X s ) = Behavior ( X t ) holds. A.2 Correctness of C11 to x86 to ARMv8 Mapping

We restate the theorem and then prove the same.

Theorem 2.

The mapping scheme in Fig. 9b is correct.Proof.

The mapping can be represented as a combination of following transformation steps.1. P C11 (cid:55)→ P x86 mapping from map.2. P x86 (cid:55)→ P ARMv8 mappings from Fig. 9a.3. Fence strengthening

DMBST ; STR (cid:32)

DMBFULL ; STR in P ARMv8 .4. Elimination of leading

DMBFULL and trailing

DMBLD fences in following cases.(a)

DMBFULL ; STR (cid:32)

STR where

WMOV NA (cid:55)→ STR .(b)

LDR ; DMBLD (cid:32)

LDR where

RMOV NA (cid:55)→ LDR .We know (1), (2), (3) are sound and therefore it sufﬁces to show that transformation (4) is sound.Let X a and X (cid:48) a be the consistent execution of P ARMv8 before and after the transformation (3). Let X be correspndingC11 execution P C11 . and we know P C11 is race-free. Therefore for all non-atomic event a in X if there exist anothersame-location event b then X . hb = ( a, b ) holds.Now we consider x86 to ARMv8 mapping scheme in Fig. 9a.Considering the hb deﬁnition following are the possibilities: Case [ E NA ]; X . po ; [ W (cid:119) RLX ] : = ⇒ [ E ]; X a . po ; [ F LD ]; X a . po ; [ F ]; X a . po ; [ E ]= ⇒ [ E ]; X (cid:48) a . po ; [ F ]; X (cid:48) a . po ; [ E ]= ⇒ [ E ]; X (cid:48) a . bob ; [ E ] Case [ R (cid:119) RLX ]; X . po ; [ E NA ] : = ⇒ [ Ld ]; ( X a . rmw ; X a . F ∪ X a . po ; [ F LD ]); X a . po ; [ E ]= ⇒ [ E ]; X (cid:48) a . bob ; [ E ] Hence X (cid:48) a . bob = X a . bob and the transfmation is sound for x86 to ARMv8 mapping.As a result, the mapping scheme in Fig. 9b is sound. 31n Architecture to Architecture Mapping for Concurrency A.3 ARMv8 to x86 Mappings

We restate Lemma 1.

Lemma 1.

Suppose X is an x86 consistent execution. In that case X . po | loc ; X . fr = ⇒ X . fr ∪ X . co .Proof. We consider two cases in X : Case [ X . Ld ]; X . po | loc ; X . fr ; [ X . W ] : Let ( r, e ) ∈ [ X . Ld ]; X . po | loc ; [ X . R ] , ( e, w (cid:48) ) ∈ [ X . R ]; X . fr ; [ X . W ] holds.Also consider X . rf ( w e , e ) and X . rf ( w, r ) holds.We show by contradiction that X . co ( w, w (cid:48) ) and in consequence X . fr ( r, w (cid:48) ) holds.Assume X . co ( w e , w ) holds. Therefore X . fr ( e, w ) holds. However, from deﬁnition, X . xhb ( w, e ) holds. It is notpossible in a x86 consistent execution as it violates irreﬂexive ( X . fr ; X . xhb ) condition. Hence a contradiction and X . co ( w, w e ) holds.We also know that X . co ( w e , w (cid:48) ) holds as from deﬁnition X . rf ( w e , e ) ∧ X . fr ( e, w (cid:48) ) .As a result X . co ( w, w (cid:48) ) holds.Therefore X . fr ( r, w (cid:48) ) holds.Thus [ X . Ld ]; X . po | loc ; X . fr ; [ X . W ] = ⇒ X . fr . Case [ X . W ]; X . po | loc ; X . fr ; [ X . W ] : Let ( w, w (cid:48) ) ∈ [ X . W ]; X . po | loc ; X . fr ; [ X . W ] and ( w, r ) ∈ [ X . W ]; X . po | loc ; [ X . R ] ∧ ( r, w (cid:48) ) ∈ [ X . R ]; X . fr ; [ X . W ] holds.Two subcases: Subcase X . rf ( w, r ) : In this case X . co ( w, w (cid:48) ) holds by deﬁnition. Subcase X . rfe ( w r , r ) : In this case w (cid:54) = w r .We show X . co ( w, w r ) holds by contradiction.Assume X . co ( w r , w ) holds. In that case X . fr ( r, w ) holds. This violates irreﬂexive ( X . fr ; X . xhb ) constraint and hence acontradiction.Therefore, X . co ( w, w r ) holds and in consequence co ( w, w (cid:48) ) holds.Thus [ X . W ]; X . po | loc ; X . fr ; [ X . W ] = ⇒ X . co .We restate Lemma 2. Lemma 2.

Suppose X = (cid:104) E , po , rf , mo (cid:105) is an x86 consistent execution. For each ( X . po | loc ∪ X . fr ∪ X . co ∪ X . rf ) + path between two events there exists an alternative ( X . xhb ∪ X . fr ∪ X . co ) + path between these two events which hasno intermediate load event.Proof. Consider a load event r on ( X . po | loc ∪ X . fr ∪ X . co ∪ X . rf ) + path. Considering the path, the possible incomingedges to r are X . rf , X . po | loc , and the outgoing edges are X . fr , X . po | loc .Let a and b be the source and destination of the incoming and outgoing edges on the path.Possible cases: 32n Architecture to Architecture Mapping for Concurrency Case X . rf ( a, r ) ∧ X . fr ( r, b ) : From deﬁnition X . co ( a, b ) holds. Case X . rf ( a, r ) ∧ X . po | loc ( r, b ) : From deﬁnition, X . xhb ( a, b ) . Case X . po | loc ( a, r ) ∧ X . fr ( r, b ) : From Lemma 1, X . fr ( a, b ) ∨ X . co ( a, b ) holds. Case X . po | loc ( a, r ) ∧ X . po | loc ( r, b ) : From deﬁnition, X . xhb ( a, b ) holds.We restate Lemma 3. Lemma 3.

Suppose X = (cid:104) E , po , rf , mo (cid:105) is an x86 consistent execution. For each obx path between two events thereexists an alternative obx path which has no intermediate load event.Proof. Consider a load event r on X . obx path. Considering the path, the possible incoming edges to r are X . rf , X . xppo ,and the outgoing edges are X . fr , X . xppo .Let a and b be the source and destination of the incoming and outgoing edges on the path.Possible cases: Case X . rf ( a, r ) ∧ X . fr ( r, b ) : From deﬁnition X . mo ( a, b ) holds. Case X . rf ( a, r ) ∧ X . xppo ( r, b ) : From deﬁnition X . xhb ( a, b ) holds as xppo ⊆ po . Case X . po ( a, r ) ∧ X . po ( r, b ) : From deﬁnition X . xhb ( a, b ) holds as xppo ⊆ po . Case X . xppo ( a, r ) ∧ X . fr ( r, b ) : Considering the subcases of a : Subcase a ∈ ( W ∪ F ) : We show X . mo ( a, b ) holds.In this case following the deﬁnition of xppo we know ( a, r ) ∈ [ St ]; po ; [ F ]; po ; [ Ld ] . Let c ∈ F such that X . po ( a, c ) ∧ X . po ( c, r ) holds.We show X . mo ( c, b ) holds by contradiction.Assume X . mo ( b, c ) holds.In this case X . fr ( r, b ) ∧ X . mo ( b, c ) ∧ c ∈ F ∧ X . po ( c, r ) creates a cycle. Hence a contradiction as X is x86 consistent.Therefore X . mo ( c, b ) holds.We also know that X . mo ( a, c ) holds as X . mo ( c, a ) would lead to a X . mo ; X . xhb cycle which is a contradiction.As a result, X . mo ( a, c ) ∧ X . mo ( c, b ) implies that X . mo ( a, b ) holds.33n Architecture to Architecture Mapping for Concurrency Subcase a ∈ Ld : Let X . rf ( w, a ) . We consider two scenarios based on whether there is an intermediate fence: Subsubcase ( a, r ) ∈ [ Ld ]; X . po ; [ F ]; X . po ; [ Ld ] : Let c ∈ F be the intermediate fence event.It implies ( a, c ) ∈ X . xppo following s and X . mo ( c, b ) holds. (see earlier subcase)Thus there is a X . obx path from a to b without passing through r . Subsubcase Otherwise:

In this case ( a, r ) ∈ [ Ld ]; X . po ; [ Ld ] ∧ (cid:64) e. X . po ( a, e ) ∧ X . po ( e, r ) .Let c be the event such that ( c, r ) ∈ X . po ∩ X . obx and there is no such c (cid:48) in between c and r .The scenarios are as follows: • c ∈ U ∪ F .In this case X . mo ( c, b ) holds as otherwise X . mo ( b, c ) creates a X . fr ; X . mo ; [ U ∪ F ]; X . po cycle which resultsin a contradiction.Thus X . obx path between the same events does not pass through r . • c ∈ St .Following the deﬁnition of xppo , there is an intermediate fence event d ∈ F such that X . po ( c, d ) ∧ X . po ( d, r ) holds. In this case X . mo ( d, b ) holds and also X . mo ( c, d ) holds. Hence X . mo ( c, b ) also holds.Thus X . obx path between the same events does not pass through r . • c ∈ Ld .Let w ∈ W be the event on the X . obx path and X . rfe ( w, c ) holds.In this case we show by contradiction that X . mo ( w, b ) holds.Assume X . mo ( b, w ) holds.In that case X . fr ( r, b ) ∧ X . mo ( b, w ) ∧ X . rfe ( w, c ) ∧ X . po ( c, r ) creates a cycle which violates x86 consistencyfor X . Hence a contradiction and X . mo ( w, b ) holds.Thus X . obx path between the same events does not pass through r .We restate the theorem. Theorem 3.

The mapping scheme in Fig. 12b is correct.

To prove Theorem 3, we prove the following formal statement. P ARMv8 (cid:32) P x86 = ⇒ ∀ X t ∈ [[ P x86 ]] . ∃ X s ∈ [[ P ARMv8 ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

Given an x86 execution X t we deﬁne the correxponding ARM execution X s .We know that X t is x86 consistent. Now we show that X s is ARM consistent. We prove by contradiction.(internal)Assume X s contains X s . po | loc ∪ X s . ca ∪ X s . rf cycle.It implies a X t . po | loc ∪ X t . ca ∪ X t . rf cycle following the mappings.In that case we can derive a ( X t . xhb ∪ X t . fr ∪ X t . co ) + cycle with no load event in X t following Lemma 2.34n Architecture to Architecture Mapping for ConcurrencyThus the cycle contains only same-location write events.In that case X t . fr = ⇒ X t . co and ([ X t . W ]; X t . xhb ; [ X t . W ]) | loc = ⇒ X t . co which implies a X t . mo cycle as X t . co ⊆ X t . mo However, we know X t . mo is has no cycle and hence a contradiction.Therefore the source execution X s in ARMv8 satisﬁes (internal).(external)We prove this by contradiction. Assume X s contains a ob cycle. In that case X t contains a obx cycle. In that case,from Lemma 3, we know that there exists a X t . obx cycle which has no load event. Therefore the cycle contains only W ∪ F events and thus there is a X t . mo cycle. However, X t is x86 consistent and hence there is no X t . mo cycle. Thusa contradiction and X s has no ob cycle. Therefore the source execution X s in ARMv8 satisﬁes (external).(atomic)We prove this by contradiction. Assume [ X s . rmw ] ∩ ; X . fre ; X s . coe (cid:54) = ∅ .In that case there exists u ∈ X t . U , w ∈ X t . W in x86 consistent execution X t such that X t . fre ( u, w ) , X t . coe ( w, u ) hold.It implies there is a X t . fr ; X t . mo cycle as fre ⊆ fr and coe ⊆ mo hold.However, X t . fr ; X t . mo cycle is not possible as X t is consistent. Hence a contradiction and therefore [ X s . rmw ] ∩ ; X . fre ; X s . coe = ∅ .Thus X s is ARMv8 consistent as it satisﬁes (internal), (external), and (atomic) constraints. A.4 ARMv7-mca to ARMv8 Mappings

In Appendix A.5 we have already shown all the relevant consistency constraints. It remains to show that (mca) holdsfor ARMv7-mca to ARMv8 mappings.We restate Lemma 7 and then prove the same.

Lemma 7.

We start with [ X s . Ld ]; X s . ppo ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] Considering the ﬁnal incoming edge to [ X s . Ld ] , we consider following cases: Case [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . addr ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . addr ; X s . po ; [ X s . St ]= ⇒ [ X t . Ld ]; X t . ob ? ; [ X t . E ]; X t . dob ; [ X t . St ] from Lemma 4. = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . rdw ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . coe ; X s . rfe ; [ X s . Ld ]; X s . po | loc ; [ X s . St ]= ⇒ [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . coe ; X s . coe ; [ X s . St ]= ⇒ [ X t . Ld ]; X t . ob ? ; [ X t . E ]; X t . obs ; X t . obs ; [ X t . St ] from Lemma 4. = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ; [ X s . St ]; X s . rﬁ ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] :

35n Architecture to Architecture Mapping for ConcurrencyIt implies [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; ( X s . ctrl ∪ X s . data ∪ X s . addr ); [ X s . St ]; X s . coi ; [ X s . St ] as X s satisﬁes (sc-per-loc). = ⇒ [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; ( X s . ctrl ∪ X s . data ); [ X s . St ]; X s . coi ; [ X s . St ] ∪ [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . addr ; X s . po ; [ X s . St ]= ⇒ [ X t . Ld ]; X t . ob ? ; [ X t . Ld ]; X t . dob ; [ X t . St ] ∪ [ X t . Ld ]; X t . ob ? ; [ X t . Ld ]; X t . dob ; [ X t . St ] from Lemma 4. = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . ctrl ISB ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . ctrl ; [ X s . St ] as ctrl ISB ; po ⊆ ctrl ISB and ctrl

ISB ⊆ ctrl . = ⇒ [ X t . Ld ]; X t . ob ? ; [ X t . Ld ]; X t . dob ; [ X t . St ] from Lemma 4. = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ; [ X s . St ]; X s . detour ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ; [ X s . St ]; X s . coe ; [ X s . St ]; X s . rfe ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] from the deﬁnition of detour . = ⇒ [ X s . Ld ]; X s . ppo ; [ X s . St ]; X s . coe ; [ X s . St ]; X s . coe ; [ X s . St ]= ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ]; X t . obs ; [ X t . St ]; X t . obs ; [ X t . St ]= ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . ctrl ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . ctrl ; [ X s . St ] as ctrl ; po ⊆ ctrl . = ⇒ = ⇒ [ X t . Ld ]; X t . ob ? ; X t . dob ; [ X t . St ]= ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . addr ; X s . po ? ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . addr ; X s . po ? ; [ X s . St ]= ⇒ = ⇒ [ X t . Ld ]; X t . ob ? ; X t . dob ; [ X t . St ]= ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Now we show that X s satisﬁes (mca). We restate Lemma 8 and then prove the same. Lemma 8.

Suppose X t is a target ARMv8 consistent execution and X s is corresponding ARMv7 consistent execution.In this case X s . wo + is acyclic.Proof. Following the deﬁnition of X s . wo : X s . wo (cid:44) (( X s . rfe ; X s . ppo ; X s . rfe − ) \ [ X s . E ]); X s . co It implies X s . rfe ; X s . ppo ; [ X s . Ld ]; X s . fri ; [ X s . St ] ∪ X s . rfe ; X s . ppo ; [ X s . Ld ]; X s . fre ; [ X s . St ]= ⇒ X s . rfe ; [ X s . Ld ]; X s . ppo ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] ∪ X s . rfe ; X s . ppo ; X s . fre from deﬁnitions. = ⇒ X t . rfe ; [ X t . Ld ]; X t . ob ; [ X s . St ] ∪ X t . rfe ; X t . ob ; X t . fre from Lemma 7. = ⇒ X t . obs ; [ X t . Ld ]; X t . ob ; [ X s . St ] ∪ X t . obs ; X t . ob ; X t . obs from Lemma 9.36n Architecture to Architecture Mapping for Concurrency = ⇒ X t . ob .Thus X s . wo + = ⇒ X t . ob ∪ X t . ob = ⇒ X t . ob .We know X t . ob is acyclic.Therefore X s . wo + is acyclic.We restate Theorem 7 and then prove the same. Theorem 7.

The mappings in Fig. 12a are correct for ARMv7-mca.

We formally show P ARMv7 - mca (cid:32) P ARMv8 = ⇒ ∀ X t ∈ [[ P ARMv8 ]] . ∃ X s ∈ [[ P ARMv7 - mca ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

Follows from Theorem 4 and Lemma 8. Moreoveover, X s . co ⇐⇒ X t . co holds. Therefore Behavior ( X t ) = Behavior ( X s ) also holds. A.5 ARMv7 to ARMv8 Mappings

We restate Theorem 4.

Theorem 4.

The mappings in Fig. 12a are correct.

We prove the following formal statement. P ARMv7 (cid:32) P ARMv8 = ⇒ ∀ X t ∈ [[ P ARMv8 ]] . ∃ X s ∈ [[ P ARMv7 ]] . Behavior ( X t ) = Behavior ( X s ) Given an ARMv8 execution X t we deﬁne the correxponding ARMv7 execution X s such that X t . po ⇐⇒ X s . po , X t . rf ⇐⇒ X s . rf , and X t . co ⇐⇒ X s . co hold.We know that X t is ARMv8 consistent. We will show that X s is ARMv7 consistent.First we relate the X s and X t relations. Lemma 9.

Suppose X s is an ARMv7 consistent execution and X t is corresponding ARMv8 execution. In that case X s . fre = ⇒ X t . obs and X s . rfe = ⇒ X t . obs .Proof. Follows from deﬁnition.

Lemma 10.

Suppose X s is an ARMv7 consistent execution and X t is corresponding ARMv8 execution. In that case X s . fence = ⇒ X t . bob .Proof. X s . fence = ⇒ X t . po ; [ X s . F ]; X t . po = ⇒ X t . bob . Lemma 11. ( ii ∪ ci ∪ cc ); [ St ]; rﬁ = ⇒ ob Proof.

We know rﬁ ⊆ ii and ppo does not have cc ; ii subsequence following the constraint. Therefore we show ( ii ∪ ci ); [ St ]; rﬁ = ⇒ ob . It implies ( addr ∪ data ∪ ctrl ISB ); [ St ]; rﬁ = ⇒ ( addr ∪ data ); rﬁ ∪ ctrl ; [ St ] = ⇒ dob ∪ dob = ⇒ ob Let dobcc = data ∪ ctrl ; [ St ] ∪ addr ∪ addr ; po ; [ St ] and ndobcc = ctrl ; [ Ld ] ∪ addr ; po ; [ Ld ] . Therefore cc = dobcc ∪ ndobcc . Lemma 12. cc +0 = dobcc ∪ ndobcc Proof.

From deﬁnition cc +0 = ( dobcc ∪ ndobcc ) + Consider the following cases: • dobcc ; dobcc = ⇒ addr ; addr = ⇒ addr ; po ; [ St ] ∪ addr ; po ; [ Ld ] = ⇒ dobcc ∪ ndobcc

37n Architecture to Architecture Mapping for Concurrency • dobcc ; ndobcc = ⇒ addr ; ( ctrl ; [ Ld ] ∪ addr ; po [ Ld ]) = ⇒ addr ; po ; [ Ld ] = ⇒ ndobcc • ndobcc ; dobcc = ⇒ ( ctrl ; [ Ld ] ∪ addr ; po ; [ Ld ]); dobcc ; [ Ld ∪ St ]= ⇒ ctrl ; [ St ] ∪ addr ; po ; [ St ] ∪ ctrl ; [ Ld ] ∪ addr ; po ; [ Ld ] = ⇒ dobcc ∪ ndobcc • ndobcc ; ndobcc = ⇒ ( ctrl ; [ Ld ] ∪ addr ; po ; [ Ld ]); ndobcc ; [ Ld ] = ⇒ ndobcc Therefore cc +0 = dobcc ∪ ndobcc .Now we restate Lemma 4 and then prove the same. Lemma 4.

Suppose X s is an ARMv7 consistent execution and X t is corresponding ARMv8 execution. In that case X s . ppo = ⇒ X t . ob .Proof. From the deﬁnition of ppo and Lemma 12: [ Ld ]; X t . ppo = ⇒ [ Ld ]; ( X t . ii ∪ X t . ci ∪ X t . dobcc ; ci ?0 ∪ X t . ndobcc ; X t . ci ) + = ⇒ [ Ld ]; ( X t . ii ∪ X t . ci ∪ X t . dobcc ; X t . ci ∪ X t . ndobcc ; X t . ci ) + = ⇒ [ Ld ]; ( X s . addr ∪ X s . data ∪ X s . rdw ∪ X s . ob ∪ X s . ctrl ISB ∪ X s . detour ∪ X s . dobcc ; ci ?0 ∪ X s . ndobcc ; ci ) + byreducing the rﬁ edges following Lemma 11. = ⇒ [ Ld ]; ( X s . ob ∪ X s . ndobcc ; ci ) + as • X s . addr ∪ X s . data ∪ X s . ctrl ISB ⊆ X s . dob ⊆ X s . ob • X s . rdw = X s . fre ; X s . rfe ⊆ X s . obs ; X s . obs ⊆ X s . ob • X s . detour = X s . coe ; X s . rfe ⊆ X s . obs ; X s . obs ⊆ X s . ob • X s . dobcc = ( X s . data ∪ X s . ctrl ; [ St ] ∪ X s . addr ∪ X s . addr ; X s . po ; [ St ]) ⊆ X s . ob Now, X s . ndobcc ; X s . ci = ( X s . ctrl ; [ Ld ] ∪ X s . addr ; X s . po ; [ Ld ]); ( X s . ctrl ISB ∪ X s . detour ) from deﬁnition. = ⇒ ( X s . ctrl ; [ Ld ]; X s . ctrl ISB ∪ X s . addr ; X s . po ; [ Ld ]; X s . ctrl ISB ) as dom ( X s . detour ) (cid:54)⊆ Ld . = ⇒ ( X s . dob ∪ X s . dob ) as dom ( X s . detour ) (cid:54)⊆ Ld = ⇒ X s . ob = ⇒ [ Ld ]; ( X s . ob ∪ X s . ndobcc ; X s . ci ) + = ⇒ X s . ob .Therefore X t . ppo = ⇒ X s . ob . Lemma 13.

Suppose X s is an ARMv7 consistent execution and X t is corresponding ARMv8 execution. In that case(i) X s . ahb = ⇒ X t . ob and (ii) X s . prop = ⇒ X t . ob Proof. (i) X s . ahb = ⇒ X s . ppo ∪ X s . fence ∪ X s . rfe = ⇒ X t . ob ∪ X t . bob ∪ X t . obs from Lemma 9, Lemma 10, andLemma 4.(ii) We know X s . prop = X s . prop ∪ X s . prop from deﬁnition.Now, X s . prop = ⇒ [ X t . W ]; X t . rfe ? ; X t . fence ; X t . ahb ∗ ; [ X t . W ]= ⇒ X t . obs ; X t . bob ; ( X t . dob ∪ X t . bob ∪ X t . obs ); [ X t . W ]= ⇒ X t . ob Also X s . prop = ⇒ (( X t . co ∪ X t . fr ) \ X t . po ) ? ; X t . rfe ? ; ( X t . fence ; X t . ahb ∗ ) ? ; X t . fence ; X t . ahb ∗ . = ⇒ ( X t . coi ∪ X t . coe ∪ X t . fri ∪ X t . fre ) \ X t . po ) ? ; X t . rfe ? ; ( X t . fence ; X t . ahb ∗ ) ? ; X t . fence ; X t . ahb ∗

38n Architecture to Architecture Mapping for Concurrency = ⇒ ( X t . coe ∪ X t . fre ) \ X t . po ) ? ; X t . rfe ? ; ( X t . fence ; X t . ahb ∗ ) ? ; X t . fence ; X t . ahb ∗ = ⇒ X t . obs ; ( X t . fence ; X t . ahb ∗ ) ? ; X t . fence ; X t . ahb ∗ = ⇒ X t . obs ; ( X t . bob ; ( X t . dob ∪ X t . bob ∪ X t . obs ) ∗ ) ? ; X t . bob ; ( X t . dob ∪ X t . bob ∪ X t . obs ) ∗ = ⇒ X t . ob Hence X s . prop = ⇒ X t . ob .Now we prove Theorem 4. Proof.

We show X s is ARMv7 consistent by contradiction.(total-mo), (sc-per-loc), (atomicity) hold on X s as they hold on X t . It remains to show that (observation) and (propa-gation) hold on X s .(observation)Assume there is a X s . fre ; X s . prop ; X s . ahb ∗ cycle.Considering the relations above, X s . fre ; X s . prop ; X s . ahb ∗ = ⇒ X t . obs ; X t . ob ; ( X t . dob ∪ X t . bob ∪ X t . obs ) ∗ = ⇒ X t . ob .However, we know that X t . ob is irreﬂexive and hence a contradiction.Therefore, X s . fre ; X s . prop ; X s . ahb ∗ is irreﬂexive and X s satisﬁes (observation).(propagation)Assume there is a X s . co ∪ X s . prop cycle.It implies a X t . co ∪ X t . ob cycle.We know X t . co ; X t co = ⇒ X t . co and X t . prop ; X t . prop = ⇒ X t . prop .Thus a X t . co ∪ X t . ob cycle can be reduced to a X t . co ∪ X t . ob cycle where X t . co and X t . prop take place alternatively.In this case each of X t . prop ⊆ ( X t . W × X t . W ) | loc ⊆ X t . co .It implies there is a X t . co cycle which is a contradiction.Hence X s . co ∪ X s . prop is acyclic and X s satisﬁes (propagation).Therefore X s is ARMv7 consistent.Moreover, Behavior ( X t ) = Behavior ( X s ) holds as X t . co ⇐⇒ X s . co . A.6 ARMv8 to ARMv7 Mappings

We restate Lemma 5 and then prove the same.

Lemma 5.

Suppose X t is an ARMv7 consistent execution and X s is ARMv8 execution following the mappings inFig. 13a. In this case X s . ob = ⇒ ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ∪ X t . fence ) + .Proof. (1) X s . obs = ⇒ X s . rfe ∪ X s . coe ∪ X s . fre = ⇒ X t . rfe ∪ X t . coe ∪ X t . fre from deﬁnition.(2) We know X s . dob ⊆ [ X s . Ld ]; X s . po ; [ X s . E ] . = ⇒ [ X t . Ld ]; X t . po ; [ X t . F ]; X t . po ; [ X s . E ] following the mappings in Fig. 13a.39n Architecture to Architecture Mapping for Concurrency = ⇒ [ X t . Ld ]; X t . fence ; [ X t . E ] from the deﬁnition.(3)We know aob (cid:44) rmw ∪ [ range ( rmw )]; rﬁ ; [ A ] Hence X s . rmw ∪ [ range ( X s . rmw )]; X s . rﬁ ; [ X s . A ∪ X s . Q ]= ⇒ X t . rmw ∪ [ range ( X t . rmw )]; X t . rﬁ ; [ X t . Ld ]; X t . po ; [ X t . F ] (4)Following the deﬁnition of X s . bob , we consider its components: • X s . po ; [ X s . F ]; X s . po = ⇒ X t . po ; [ X t . F ]; X t . po = ⇒ X t . fence • [ X s . STLR ]; X s . po ; [ X s . LDAR ]= ⇒ [ X t ; F ]; X t . po ; [ X t . St ]; X t . po ; [ X t ; F ]; X t . po ; [ X t . Ld ]= ⇒ X t . fence • [ X s ; Ld ]; X s . po ; [ X s . F ]; X s . po = ⇒ X t . fence • [ X s . LDAR ]; X s . po = ⇒ [ X t ; Ld ]; X t . po ; [ X t . F ]; X t . po = ⇒ X t . fence • [ X s . St ]; X s . po ; [ X s . F ST ]; X s . po ; [ X s . St ]= ⇒ [ X t . St ]; X t . po ; [ X t . F ]; X t . po ; [ X t . St ]= ⇒ X t . fence • X s . po ; [ X s . STLR ]= ⇒ X t . po ; [ X t . F ]; X t . po ; [ X t . St ]= ⇒ X t . fence • X s . po ; [ X s . STLR ]; X s . coi = ⇒ X t . po ; [ X t . F ]; X t . po ; [ X t . St ]; X t . po = ⇒ X t . fence Thus X s . bob = ⇒ X t . fence .Therefore X s . ob = ⇒ ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . fence ∪ X t . rmw ∪ [ range ( X t . rmw )]; X t . rﬁ ; [ X t . Ld ]; X t . po ; [ X t . F ]) + Considering the outgoing edges from Ld event in [ range ( X t . rmw )]; X t . rﬁ ; [ X t . Ld ]; X t . po ; [ X t . F ] we consider two cases: case [ range ( X t . rmw )]; X t . rﬁ ; [ X t . Ld ]; X t . po ; [ X t . F ]; X t . po = ⇒ X t . fence case [ range ( X t . rmw )]; X t . rﬁ ; [ X t . Ld ]; X t . fre = ⇒ [ range ( X t . rmw )]; X t . coe by deﬁnition of fre .Therefore X s . ob = ⇒ ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . fence ∪ X t . rmw ) + .40n Architecture to Architecture Mapping for ConcurrencyWe restate the Lemma 6 and then prove the same. Lemma 6.

Suppose X t is an ARMv7 consistent execution and X s is ARMv8 execution following the mappings inFig. 13a. In this case either X s . ob = ⇒ (( X t . E × X t . E ) | loc \ [ E ]) or X s . ob = ⇒ ( X t . co ; X t . prop ∪ X t . prop ) + .Proof. We know X s . ob = ⇒ ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ∪ X t . fence ) + from Lemma 5. Scenario (1): ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ∪ X t . fence ) + has no X t . fence .In this case ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + = ⇒ ( X t . E × X t . E ) | loc \ [ E ] from the deﬁnitions. Scenario (2): Otherwise

In this case X s . ob = ⇒ (( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) ∗ ; X t . fence ) + Now we consider following cases:(RR) [ X t . Ld ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . Ld ]; X t . fence (RW) [ X t . Ld ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . St ]; X t . fence (WR) [ X t . St ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . Ld ]; X t . fence (WW) [ X t . St ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . St ]; X t . fence Case (RR): [ X t . Ld ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . Ld ]; X t . fence = ⇒ [ X t . Ld ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence = ⇒ [ X t . Ld ]; X t . fr ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence as X t satisﬁes (sc-per-loc). = ⇒ [ X t . Ld ]; X t . fri ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ [ X t . Ld ]; X t . fre ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence = ⇒ [ X t . Ld ]; ( X t . rmw ∪ X t . fence ); [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ X t . prop following the mapping of Fig. 13a and deﬁnition of prop . = ⇒ [ X t . Ld ]; X t . rmw ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ [ X t . Ld ]; X t . fence ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ X t . prop = ⇒ [ X t . Ld ]; X t . ppo ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ [ X t . Ld ]; X t . fence ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ X t . prop as X t . rmw = ⇒ X t . ppo . = ⇒ [ X t . Ld ]; X t . ahb ; X t . fence ∪ [ X t . Ld ]; X t . fence ; [ X t . St ]; X t . ahb ; [ X t . Ld ]; X t . fence ∪ X t . prop from deﬁnition of prop . = ⇒ X t . prop ∪ prop ∪ prop = ⇒ X t . prop Case (RW):

41n Architecture to Architecture Mapping for Concurrency [ X t . Ld ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . St ]; X t . fence = ⇒ [ X t . Ld ]; X t . fr ; X t . fence = ⇒ [ X t . Ld ]; X t . fri ; X t . fence ∪ [ X t . Ld ]; X t . fre ; X t . fence = ⇒ [ X t . Ld ]; X t . fence ∪ [ X t . Ld ]; X t . fre ; X t . fence = ⇒ X t . prop ∪ X t . prop from deﬁnition of prop . = ⇒ X t . prop Case (WR): [ X t . St ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) + ; [ X t . Ld ]; X t . fence = ⇒ [ X t . St ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ) ∗ ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence = ⇒ [ X t . St ]; X t . co ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence as X t satisﬁes (sc-per-loc). = ⇒ X t . co ; X t . prop from deﬁnition. = ⇒ X t . co ; X t . prop as prop ⊆ prop = ⇒ [ X t . St ]; X t . coi ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ [ X t . St ]; X t . coe ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence = ⇒ [ X t . St ]; X t . coi ; [ X t . St ]; X t . rfe ; [ X t . Ld ]; X t . fence ∪ X t . prop from deﬁnitions Case (WW): [ X t . St ]; ( X t . rfe ∪ X t . coe ∪ X t . fre ∪ X t . rmw ); [ X t . St ]; X t . fence = ⇒ [ X t . St ]; X t . co ; [ X t . St ]; X t . fence = ⇒ [ X t . St ]; X t . coi ; [ X t . St ]; X t . fence ∪ [ X t . St ]; X t . coe ; [ X t . St ]; X t . fence = ⇒ X t . fence ∪ [ X t . St ]; X t . coe ; [ X t . St ]; X t . fence = ⇒ X t . prop ∪ X t . prop from deﬁnition of prop . = ⇒ X t . prop Thus (in

Scenario-II ) X s . ob = ⇒ ( X t . co ; X t . prop ∪ X t . prop ) + .Finally we restate Theorem 5 and then prove the same. Theorem 5.

The mappings in Fig. 13a are correct.

To prove Theorem 5, we prove the following formal statement. P ARMv8 (cid:32) P ARMv7 = ⇒ ∀ X t ∈ [[ P ARMv7 ]] . ∃ X s ∈ [[ P ARMv8 ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

We know that X t is ARMv7 consistent. Now we show that X s is ARMv8 consistent. We prove by contradiction. Case (internal) :

We know that (sc-per-loc) holds in X t . Hence (internal) trivially holds in X s . Case (external):

Assume there is a X s . ob cycle.From Lemma 6 we know that X s . ob = ⇒ (( X t . E × X t . E ) | loc \ [ E ]) ∪ ( X t . co ; X t . prop ∪ X t . prop ) + .We know both (( X t . E × X t . E ) | loc \ [ E ]) is acyclic as X t satisﬁes (sc-per-loc) and ( X t . co ; X t . prop ∪ X t . prop ) + is acyclicas X t satisﬁes (propagation). Case (atomic):

42n Architecture to Architecture Mapping for ConcurrencyWe know that (atomic) holds in X t . Hence (atomic) trivially holds in X s .Therefore X s is consistent. Moreover, as X s . co ⇐⇒ X t . co holds, Behavior ( X s ) = Behavior ( X t ) also holds. A.7 Proff of correctness: C11 to ARMv8 to ARMv7

We restate the theorem and then prove the correctness.

Theorem 6.

The mapping scheme in Fig. 13b is correct.Proof.

The mapping can be represented as a combination of following transformation steps.1. P C11 (cid:55)→ P ARMv8 mapping from map.2. P ARMv8 (cid:55)→ P ARMv7 mappings from Fig. 13a.3. Elimination of leading

DMB fences for

LDR NA (cid:55)→ LDR mapping, that is,

LDR NA (cid:55)→ LDR ; DMB (cid:32)

LDR .We know (1), (2) are sound and therefore it sufﬁces to show that transformation (4) is sound.Let X a and X (cid:48) a be the consistent execution of P ARMv8 before and after the transformation (4). Let X be correspndingC11 execution P C11 . and we know P C11 is race-free. Therefore for all non-atomic event a in X if there exist anothersame-location event b then X . hb = ( a, b ) holds.Now we consider ARMv8 to ARMv7 mapping scheme.Considering the hb deﬁnition following are the possibilities: Case [ E NA ]; X . po ; [ F (cid:119) REL ]; X . po ; [ W RLX ] ∪ [ E NA ]; X . po ; [ W (cid:119) REL ] : = ⇒ [ E ]; X a . po ; [ F ]; X a . po ; [ St ∪ rmw ]= ⇒ [ E ]; X (cid:48) a . po ; [ F ]; X (cid:48) a . po ; [ E ]= ⇒ [ E ]; X (cid:48) a . fence ; [ E ] Case [ R (cid:119) RLX ]; X . po ; [ E NA ] ∪ [ R RLX ]; X po ; [ F (cid:119) ACQ ]; X . po ; [ E NA ] : = ⇒ [ Ld ]; X a . po ; [ F ]; X a . po = ⇒ [ E ]; X (cid:48) a . fence ; [ E ] Therefore X (cid:48) a . fence = X a . fence and the transfmation is sound for ARMv8 to ARMv7 mapping. A.8 ARMv7-mca to ARMv8 Mappings

In Appendix A.5 we have already shown all the relevant consistency constraints. It remains to show that (mca) holdsfor ARMv7-mca to ARMv8 mappings.We restate Lemma 7 and then prove the same.

Lemma 7.

43n Architecture to Architecture Mapping for Concurrency = ⇒ [ X t . Ld ]; X t . ob ? ; [ X t . E ]; X t . dob ; [ X t . St ] from Lemma 4. = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . rdw ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . coe ; X s . rfe ; [ X s . Ld ]; X s . po | loc ; [ X s . St ]= ⇒ [ X s . Ld ]; X s . ppo ? ; [ X s . E ]; X s . coe ; X s . coe ; [ X s . St ]= ⇒ [ X t . Ld ]; X t . ob ? ; [ X t . E ]; X t . obs ; X t . obs ; [ X t . St ] from Lemma 4. = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ; [ X s . St ]; X s . rﬁ ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; ( X s . ctrl ∪ X s . data ∪ X s . addr ); [ X s . St ]; X s . coi ; [ X s . St ] as X s satisﬁes (sc-per-loc). = ⇒ [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; ( X s . ctrl ∪ X s . data ); [ X s . St ]; X s . coi ; [ X s . St ] ∪ [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . addr ; X s . po ; [ X s . St ]= ⇒ [ X t . Ld ]; X t . ob ? ; [ X t . Ld ]; X t . dob ; [ X t . St ] ∪ [ X t . Ld ]; X t . ob ? ; [ X t . Ld ]; X t . dob ; [ X t . St ] from Lemma 4. = ⇒ [ X t . Ld ]; X t . ob ; [ X t . St ] Case [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . ctrl ISB ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] : It implies [ X s . Ld ]; X s . ppo ? ; [ X s . Ld ]; X s . ctrl ; [ X s . St ] as ctrl ISB ; po ⊆ ctrl ISB and ctrl

Suppose X t is a target ARMv8 consistent execution and X s is corresponding ARMv7 consistent execution.In this case X s . wo + is acyclic.Proof. Following the deﬁnition of X s . wo : X s . wo (cid:44) (( X s . rfe ; X s . ppo ; X s . rfe − ) \ [ X s . E ]); X s . co It implies X s . rfe ; X s . ppo ; [ X s . Ld ]; X s . fri ; [ X s . St ] ∪ X s . rfe ; X s . ppo ; [ X s . Ld ]; X s . fre ; [ X s . St ]= ⇒ X s . rfe ; [ X s . Ld ]; X s . ppo ; [ X s . Ld ]; X s . po | loc ; [ X s . St ] ∪ X s . rfe ; X s . ppo ; X s . fre from deﬁnitions. = ⇒ X t . rfe ; [ X t . Ld ]; X t . ob ; [ X s . St ] ∪ X t . rfe ; X t . ob ; X t . fre from Lemma 7. = ⇒ X t . obs ; [ X t . Ld ]; X t . ob ; [ X s . St ] ∪ X t . obs ; X t . ob ; X t . obs from Lemma 9. = ⇒ X t . ob .Thus X s . wo + = ⇒ X t . ob ∪ X t . ob = ⇒ X t . ob .We know X t . ob is acyclic.Therefore X s . wo + is acyclic.We restate Theorem 7 and then prove the same. Theorem 7.

The mappings in Fig. 12a are correct for ARMv7-mca.

We formally show P ARMv7 - mca (cid:32) P ARMv8 = ⇒ ∀ X t ∈ [[ P ARMv8 ]] . ∃ X s ∈ [[ P ARMv7 - mca ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

Follows from Theorem 4 and Lemma 8. Moreoveover, X s . co ⇐⇒ X t . co holds. Therefore Behavior ( X t ) = Behavior ( X s ) also holds. 45n Architecture to Architecture Mapping for Concurrency B Proofs and counter-examples for Optimizations in

ARMv8

B.1 Proofs of Safe reorderings

We prove the following theorem for safe reorderings in Fig. 14. P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

We know X t is ARMv8 consistent. We deﬁne X s where a · b (cid:32) b · a . X s . E = X t . EX s . po = ( X t . po \ { ( b, a ) } ∪ { ( a, b ) } ) + X s . rf = X t . rfX s . co = X t . co We show that X s is ARMv8 consistent.(internal)We know that X t . po | loc = X s . po | loc , X s . rf = X t . rf , X s . fr = X t . fr , X s . co = X t . co hold. We also know that X t satisﬁes(internal). Therefore X s also satisﬁes (internal).(external)We relate the ob relations between memory accesses in X t and X s . Let M = St ∪ Ld ∪ L ∪ A . • St ( x ) / L ( x ) · Ld ( y ) (cid:32) Ld ( y ) · St ( x ) / L ( x ) . In this case X s . aob = X t . aob , X s . bob ⊆ X t . bob , and X s . dob = X t . dob hold. • Ld ( x ) · Ld ( y ) (cid:32) Ld ( y ) · Ld ( x ) In this case X s . aob = X t . aob , X s . bob = X t . bob , and X s . dob = X t . dob hold. • F ST · Ld ( y ) (cid:32) Ld ( y ) · F ST . In this case X s . aob = X t . aob , X s . bob = X t . bob , and X s . dob = X t . dob hold. • St ( x ) / Ld ( x ) / F ST · A ( y ) (cid:32) A ( y ) · St ( x ) / Ld ( x ) / F ST In this case X s . aob = X t . aob , X s . bob ⊆ X t . bob , and X s . dob = X t . dob hold. • F LD / F ST / F · L ( y ) (cid:32) F LD / F ST / F · L ( y ) . In this case X s . aob = X t . aob , and X s . dob = X t . dob hold. We alsoknow that [ M ]; X s . bob ; [ X s . L ] = [ M ]; X t . bob ; [ X t . L ] and [ L ]; X s . bob ; [ M ] ⊆ [ L ]; X t . bob ; [ M ] hold. • A ( x ) F LD / F ST · F (cid:32) F · A ( x ) F LD / F ST . In this case X s . aob = X t . aob , X s . bob ; [ M ] ⊆ X t . bob ; [ M ] , [ M ]; X s . bob = [ M ]; X t . bob , and X s . dob = X t . dob hold. • St / L / A / F · F LD (cid:32) F LD · St / L / A / F . In this case X s . aob = X t . aob , [ M ]; X s . bob ; [ M ] ⊆ [ M ]; X t . bob ; [ M ] ,and X s . dob = X t . dob hold. • F LD / A / F · F ST (cid:32) F ST · F LD / A / F . In this case X s . aob = X t . aob , [ M ]; X s . bob ; [ M ] = [ M ]; X t . bob ; [ M ] , and X s . dob = X t . dob hold.Hence [ M ]; X s . obi ; [ M ] ⊆ [ M ]; X t . obi ; [ M ] holds.We also know that X s . rf = X t . rf and X s . co = X t . co hold.We also know that irr ( X t . ob ) holds.Therefore irr ( X t . ob ) also holds.We know that X t . rmw = X s . rmw , X s . rf = X t . rf , X s . fr = X t . fr , X s . co = X t . co hold. We also know that X t satisﬁes(atomic). Therefore X s also satisﬁes (atomic).We already know X s . co = X t . co and therefore Behavior ( X s ) = Behavior ( X t ) .46n Architecture to Architecture Mapping for Concurrency B.2 Safe eliminations

We prove the following theorem for (RAR), (RAA), and (AAA) safe eliminations in Fig. 14(a). P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

We know X t is ARMv8 consistent. We deﬁne X s where a · b (cid:32) a where(RAR) a = Ld ( X, v (cid:48) ) and b = Ld ( X, v ) or(RAA) a = A ( X, v (cid:48) ) and b = Ld ( X, v ) or(AAA) a = A ( X, v (cid:48) ) and b = A ( X, v ) . X s . E = X t . E ∪ { b } X s . po = ( X t . po ∪ { ( a, b ) } ) + X s . rf = X t . rf ∪ { ( w, b ) | X t . rf ( w, a ) } X s . co = X t . co Moreover, [ { a } ]; X s . po imm ; [ { b } ]; X s . dob = ⇒ [ { a } ]; X t . dob .We show that X s is ARMv8 consistent.Assume X s is not consistent.(internal)Asume a X s . po | loc ∪ X s . ca ∪ X s . rf cycle.It implies a X t . po | loc ∪ X t . ca ∪ X t . rf cycle as [ { b } ]; X s . fr implies [ { a } ]; X s . fr , and [ { a } ]; X t . fr .Therefore a contradiction and X s satisﬁes (internal).(external)We know dom ( X s . dob ); [ { b } ] = ⇒ dom ( X s . dob ); [ { a } ] = dom ( X t . dob ); [ { a } ] hold.Moreover, [ { b } ] . X s . dob = ⇒ [ { a } ] . X t . dob .Also in case of (AAA), codom ([ { b } ]; X s . bob ) = codom ([ { a } ]; X s . bob ) \ { b } = codom ([ { a } ]; X t . bob ) hold.Hence X s . ob ⊆ X t . ob .We know irr ( X t . ob ) holds.Therefore a contradiction and X s satisﬁes (external).(atomicity)From deﬁnition X s . rmw = X t . rmw , X s . coe = X t . coe , and X s . fre = X t . fre hold.Therefore X s preserves atomicity as X t preserves atomicity.Moreover, Behavior ( X s ) = Behavior ( X t ) holds as X s . co = X t . co holds. B.3 Access strengthening

We prove the following theorem for (R-A) from Fig. 14(a). P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

We know X t is ARMv8 consistent. We deﬁne X s where a (cid:32) b where a = Ld ( X, v ) and b = A ( X, v ) . X s . E = X t . E ∪ { a } \ { b }

47n Architecture to Architecture Mapping for Concurrency X s . po = X t . po ∪ { ( e, a ) | X t . po ( e, b ) } ∪ { ( a, e ) | X t . po ( b, e ) } X s . rf = X t . rf ∪ { ( w, a ) | X t . rf ( w, b ) } X s . co = X t . co We show that X s is ARMv8 consistent.Assume X s is not consistent.(internal)Asume a X s . po | loc ∪ X s . ca ∪ X s . rf cycle.It implies a X t . po | loc ∪ X t . ca ∪ X t . rf cycle which is a contradiction and hence X s satisﬁes (internal).(external)We know dom ( X s . ob ); [ { a } ] = dom ( X s . ob ); [ { b } ] and [ { a } ]; codom ( X . po ) = [ { b } ]; codom ( X t . bob ) .Hence X s . ob ⊆ X t . ob .We know irr ( X t . ob ) holds.Therefore a contradiction and X s satisﬁes (external).(atomicity)From deﬁnition X s . rmw = X t . rmw , X s . coe = X t . coe , and X s . fre = X t . fre hold.Therefore X s preserves atomicity as X t preserves atomicity.Moreover, Behavior ( X s ) = Behavior ( X t ) holds as X s . co = X t . co holds.48n Architecture to Architecture Mapping for Concurrency C Fence Elimination

C.1 Fence Elimination in x86

We restate the theorem on x86 fence elimination.

Theorem 8. An MFENCE in an x86 program thread is non-eliminable if it is the only fence on a program path from astore to a load in the same thread which access different locations.An

MFENCE elimination is safe when it is not non-eliminable.Proof.

We show: P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Given a X t ∈ [[ P tgt ]] we deﬁne X s ∈ P src by introducing the corresponding fence event e such that for all events w ∈ X s . W ∪ X s . F , • if ( w, e ) ∈ X s . mo ? ; X s . xhb holds then X s . mo ( w, e ) . • Otherwise X s . mo ( e, w ) .We know X t is consistent.Now we show X s is consistent.We prove by contradiction.(irrHB) Assume X s has X s . xhb cycle. We know the incoming and outgoing edges to e are X s . po edges and therefore X t . xhb already has a cycle. However, we know X t . xhb is irreﬂexive. Hence a contradiction and X s . xhb is irreﬂexive.(irrMOHB) Assume X s has X s . mo ; X s . xhb cycle. We already know that X t . mo ; X t . xhb is irreﬂexive. Therefore thecycle contains e . Two possiblilities: Case e ∈ dom ( X s . xhb ) and e ∈ codom ( X s . mo ) : Suppose X s . xhb ( e, w ) and X s . mo ( w, e ) . However, from deﬁnition we already know X s . xhb ( e, w ) = ⇒ X s . mo ( e, w ) when w ∈ X s . W ∪ X s . F . Hence a contradiction and X s . mo ; X s . xhb is irreﬂexive in this case. Case e ∈ codom ( X s . xhb ) and e ∈ dom ( X s . mo ) : Suppose X s . xhb ( w, e ) and X s . mo ( e, w ) . However, from deﬁnition we already know X s . xhb ( w, e ) = ⇒ X s . mo ( w, e ) when w ∈ X s . W ∪ X s . F . Hence a contradiction and X s . mo ; X s . xhb is irreﬂexive in this case.(irrFRHB) We know X t does not have a X t . fr ; X t . xhb cycle. We also know fr ⊆ ( W × W ) and hence event e ∈ F doesnot introduce any new X s . fr ; X xhb cycle. Therefore X s . fr ; X xhb is irreﬂexive.(irrFRMO)We know X t does not have a X t . fr ; X t . mo cycle. We also know fr ⊆ ( W × W ) and hence event e ∈ F does notintroduce any new X s . fr ; X mo cycle. Therefore X s . fr ; X mo is irreﬂexive.(irrFMRP)Assume X s has a X s . fr ; X s . mo ; X s . rfe ; X s . po cycle in X s cycle.In that case the cycle is of the form: [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ; [ X s . W ]; X s . rfe ; [ X s . R ]; X s . po ; [ X s . R ] .We know e ∈ F and therefore does not introduce this cycle in X s .In that case X t already has a X t . fr ; X t . mo ; X t . rfe ; X t . po cycle which is a contradiction.49n Architecture to Architecture Mapping for ConcurrencyHence X s . fr ; X s . mo ; X s . rfe ; X s . po cycle in X s is irreﬂexive.(irrUF)Assume X s has a X s . fr ; X s . mo ; [ X s . U ∪ X s . F ]; X s . po cycle.Two possiblities Case X s . fr ; X s . mo ; [ X s . U ]; X s . po : It implies a X t . fr ; X t . mo ; [ X t . U ]; X t . po cycle.However, we know X t satisﬁes (irrUF) and hence a contradiction. Case X s . fr ; X s . mo ; [ X s . F ]; X s . po : It implies a [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ; [ X s . F ]; X s . po ; [ X s . R ] cycle created by the introduced event e ∈ F .It implies [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ; [ { e } ]; X s . po ; [ X s . R ] From deﬁnition, we know [ X s . W ]; X s . mo ; [ { e } ] when [ X s . W ]; X s . mo ? ; X s . xhb ; [ { e } ] holds.Thus [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ; [ { e } ]; X s . po ; [ X s . R ]= ⇒ [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . xhb ; [ { e } ]; X s . po ; [ X s . R ]= ⇒ [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ];( X s . xhb ? ; [ X s . W ]; X s . rfe ; X s . po ∪ X s . po ); [ { e } ]; X s . po ; [ X s . R ]= ⇒ [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . xhb ? ; [ X s . W ]; X s . rfe ; X s . po ; [ { e } ]; X s . po ; [ X s . R ] ∪ [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . po ; [ { e } ]; X s . po ; [ X s . R ]= ⇒ [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . xhb ? ; [ X s . W ]; X s . rfe ; X s . po ; [ X s . R ] ∪ [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . po ; [ { e } ]; X s . po ; [ X s . R ] Now we consider two subcases:

Subcase [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . xhb ? ; [ X s . W ]; X s . rfe ; X s . po ; [ X s . R ] : = ⇒ [ X t . R ]; X t . fr ; [ X t . W ]; X t . mo ? ; [ X t . W ]; X t . xhb ? ; [ X t . W ]; X t . rfe ; X t . po ; [ X t . R ]= ⇒ [ X t . R ]; X t . fr ; X t . mo ? ; X t . rfe ; X t . po ; [ X t . R ] This is a contradiction as X t satisﬁes (irrFMRP). Subcase [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . po ; [ { e } ]; X s . po ; [ X s . R ] : Now we consider the [ X s . W ]; X s . po ; [ { e } ]; X s . po ; [ X s . R ] subsequence.Possible cases: Subsubcase [ X s . St ]; X s . po ; [ { e } ]; X s . po ; [ X s . Ld ] : It implies [ X t . St ]; X t . po ; [ X t . F ]; X t . po ; [ X t . Ld ] from the deﬁnition. = ⇒ [ X t . St ]; X t . mo ; [ X t . F ]; X t . po ; [ X t . Ld ] .In that case there exists a X t . fr ; X t . mo ; [ X t . F ]; X t . po cycle.50n Architecture to Architecture Mapping for ConcurrencyThis is a contradiction as X t satisﬁes (irrFMRP). Subsubcase [ X s . W ]; X s . po ; [ { e } ]; X s . po ; [ X s . U ] : It implies [ X t . W ]; X t . po ; [ X t . U ] .It implies [ X t . W ]; X t . mo ; [ X t . U ] as X t satisﬁes (irrMOHB).In this case [ X s . R ]; X s . fr ; [ X s . W ]; X s . mo ? ; [ X s . W ]; X s . po ; [ { e } ]; X s . po ; [ X s . R ]= ⇒ [ X t . R ]; X t . fr ; [ X t . W ]; X t . mo ? ; [ X t . W ]; X t . mo ; [ X t . U ]= ⇒ [ X t . R ]; X t . fr ; X t . mo ; [ X t . U ] Hence a contradiction as X t satisﬁes (irrFRMO). Subsubcase [ X s . U ]; X s . po ; [ { e } ]; X s . po ; [ X s . Ld ] : It implies [ X t . U ]; X t . po ; [ X t . Ld ] and in consequence a [ X t . Ld ]; X t . fr ; [ X t . W ]; X t . mo ? ; [ X t . U ]; X t . po ; [ X t . Ld ] cycle.Now, [ X t . Ld ]; X t . fr ; [ X t . W ]; X t . mo ? ; [ X t . U ]; X t . po ; [ X t . Ld ]= ⇒ [ X t . Ld ]; X t . fr ; [ X s . U ]; X t . po ; [ X t . Ld ] ∪ [ X t . Ld ]; X t . fr ; X t . mo ; [ X t . U ]; X t . po ; [ X t . Ld ]= ⇒ [ X t . Ld ]; X t . fr ; X t . xhb ; [ X t . Ld ] ∪ [ X t . Ld ]; X t . fr ; X t . mo ; [ X t . U ]; X t . po ; [ X t . Ld ] Hence a contradiction as X t satisﬁes (irrFRHB) and (irrUF). Behavior ( X s ) = Behavior ( X t ) holds as X s . mo | loc = X t . mo | loc . C.2 Fence Elimination in ARMv8Observation.

Let P be an ARMv8 program generated from an x86 program following the mappings in Fig. 9a. In thiscase for all consistent execution X ∈ [[ P ]] the followings hold:1. A non-RMW load event is immediately followed by a F LD event.2. A non-RMW store event is immediately preceeded by a F ST event,3. An RMW is immediately preceeded by a F event,4. An RMW is immediately followed by a F event,We restate Theorem 9. Theorem 9.

Suppose an ARMv8 program is generated by x86 (cid:55)→

ARMv8 mapping (Fig. 9a). A

DMBFULL in a threadof the program is non-eliminable if it is the only fence on a program path from a store to a load in the same threadwhich access different locations.A

DMBFULL elimination is safe when it is not non-eliminable.

To prove Theorem 9, we show: P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

Subcase a / ∈ dom ( X s . rmw ) : It implies ( a, b ) ∈ [ X t . Ld ]; X t . po ; [ X t . F LD ]; X t . po ; [ X t . E ] from Observation (1) in Appendix C.2. = ⇒ ( a, b ) ∈ [ X t . Ld ]; X t . bob ; [ X t . E ] Hence a contradiction and X s violates (external). Subcase a ∈ dom ( X s . rmw ) : It implies ( a, b ) ∈ [ X t . Ld ]; X t . po ; [ X t . F ]; X t . po ; [ X t . E ]= ⇒ ( a, b ) ∈ [ X t . Ld ]; X t . bob ; [ X t . E ] Hence a contradiction and X s violates (external). Case ( a, b ) ∈ [ X s . St ] × [ X s . St ] : = ⇒ [ X t . St ]; X t . po ; [ X t . F ST ]; X t . po ; [ X t . St ] from Observation (2) in Appendix C.2. = ⇒ [ X t . St ]; X t . bob ; [ X s . St ] This is a contradiction and hence X s satisﬁes (external). Case ( a, b ) ∈ [ X s . St ] × [ X s . Ld ] : It implies ( a, b ) ∈ [ X t . St ]; X t . po ; [ X t . F ]; X t . po ; [ X t . Ld ] from the condition in ?? . = ⇒ [ X t . St ]; X t . bob ; [ X s . Ld ] This is a contradiction and hence X s satisﬁes (external).As a result, X s also satisﬁes (external) and is ARMv8 consistent.Moreover, we know that X s . co = X t . co . Hence Behavior ( X s ) = Behavior ( X t ) .We restate Theorem 11. Theorem 11. A DMBST in a program thread is non-eliminable if it is placed on a program path between a pair ofstores in the same thread which access different locations and there exists no other

DMBFULL or DMBST fence on thesame path.A

DMBST elimination is safe when it is not non-eliminable.

To prove Theorem 11, we show: P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

Given a target execution X t ∈ [[ P tgt ]] we deﬁne a source execution X s ∈ P src by introducing the correspondingfence event e ∈ F .We know target execution X t satisﬁes (internal) and (atomic). From deﬁnition, source execution X s also supports(internal) and (atomic) as the respective relations remain unchanged.We now prove that X s satisﬁes (external) by showing X t . ob = X s . ob .52n Architecture to Architecture Mapping for ConcurrencyFrom deﬁnition we know that X t . obs = X s . obs , X t . dob = X s . dob , X t . aob = X s . aob .In that case there exists events ( a, b ) ∈ X s . bob but ( a, b ) / ∈ X t . bob .Considering possible cases of a and b : Case ( a, b ) ∈ [ X s . Ld ] × [ X s . E ] : It implies ( a, b ) ∈ [ X t . Ld ]; X t . po ; [ X t . F LD ∪ X t . F ]; X t . po ; [ X t . E ] from Observation (1) and (4) in Appendix C.2. = ⇒ ( a, b ) ∈ [ X t . Ld ]; X t . bob ; [ X t . E ] Hence a contradiction and X s violates (external). Case ( a, b ) ∈ [ X s . St ] × [ X s . St ] : = ⇒ [ X t . St ]; X t . po ; [ X t . F ST ∪ X t . F ]; X t . po ; [ X t . St ] from the condition in Theorem 9. = ⇒ [ X t . St ]; X t . bob ; [ X s . St ] This is a contradiction and hence X s satisﬁes (external). Case ( a, b ) ∈ [ X s . St ] × [ X s . Ld ] : It implies ( a, b ) ∈ [ X s . St ]; X s . po ; [ X s . F ]; X s . po ; [ X s . Ld ] as X s . bob ( a, b ) holds.It implies ( a, b ) ∈ [ X t . St ]; X t . po ; [ X t . F ]; X t . po ; [ X t . Ld ]= ⇒ [ X t . St ]; X t . bob ; [ X s . Ld ] This is a contradiction and hence X s satisﬁes (external).We restate Theorem 13. Theorem 13. A DMB in a program thread is non-eliminable if it is the only fence on a program path between a pair ofmemory accesses in the same thread.A

DMB elimination is safe when it is not non-eliminable.

To prove Theorem 13, we show: P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

From the mapping scheme and the constraint in Theorem 13, in all cases there is a pair of F fences betweenthe access pairs and therefore one of the fences is eliminable. C.3 Fence Weakening in ARMv8

We restate Theorem 10.

Theorem 10. A DMBFULL in a program thread is non-eliminable if it is the only fence on a program path from a storeto a load in the same thread which access different locations.For such a fence

DMBFULL (cid:32)

DMBST ; DMBLD is safe.

To prove Theorem 10, we show: P src (cid:32) P tgt = ⇒ ∀ X t ∈ [[ P tgt ]] ∃ X s ∈ [[ P src ]] . Behavior ( X t ) = Behavior ( X s ) Proof.

Given a target execution X t ∈ [[ P tgt ]] we deﬁne a source execution X s ∈ P src .From deﬁnition we know that X t . obs = X s . obs , X t . dob = X s . dob , X t . aob = X s . aob .We know target execution X t satisﬁes (internal) and (atomic). From deﬁnition, source execution X s also supports(internal) and (atomic) as the respective relations remain unchanged.We now prove that X s satisﬁes (external).We consider following possibilities: 53n Architecture to Architecture Mapping for Concurrency Case ( a, b ) ∈ [ Ld ] × [ E ] : In this case ( a, b ) ∈ [ X s . Ld ]; X s . po ; [ X s . F ]; X s . po ; [ X s . E ] and ( a, b ) ∈ [ X t . Ld ]; X t . po ; [ X t . F LD ]; X t . po ; [ X t . E ] .It implies both X s . bob ( a, b ) and X t . bob ( a, b ) hold. Case ( a, b ) ∈ [ St ] × [ St ] : In this case ( a, b ) ∈ [ X s . St ]; X s . po ; [ X s . F ]; X s . po ; [ X s . St ] and ( a, b ) ∈ [ X s . St ]; X s . po ; [ X s . F ST ]; X s . po ; [ X s . St ] It implies both X s . bob ( a, b ) and X t . bob ( a, b ) hold.We know that X t . ob is acyclic and hence X s . ob is also acyclic.As a result, X s also satisﬁes (external) and is ARMv8 consistent.Moreover, we know that X s . co = X t . co . Hence Behavior ( X s ) = Behavior ( X t ) .54n Architecture to Architecture Mapping for Concurrency D Proofs and Algorithms of Robustness Analysis

D.1 SC robust against x86Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17. In this case R = [ R ]; po ∪ po ; [ W ] ∪ po | loc ∪ po ; [ F ]; po . Proof.

Both x86A and SC satisﬁes atomicity.It remains to show that ( X . po ∪ X . rf ∪ X . fr ∪ X . co ) is acyclic by contradiction.Assume ( X . po ∪ X . rf ∪ X . fr ∪ X . co ) creates a cycle.It implies ( X . po ; X . eco ) + creates a cycle.It implies (([ R ]; po ∪ po ; [ W ] ∪ po | loc ∪ fence ); X . eco ) + has a cycle.Considering incoming and outgoing eco edges to po | loc : • X . rfe ; [ Ld ]; X . po | loc ; [ Ld ]; X . fre = ⇒ X . co • [ W ]; X . po | loc ; [ Ld ]; X . fre = ⇒ X . co • X . rfe ; [ Ld ]; X . po | loc ; [ W ] = ⇒ X . co It implies (( po \ WR ∪ fencerfe ∪ coe ∪ fre ) has a cycle.It implies ( po \ WR ∪ fence ∪ rfe ∪ co ∪ fr ) has a cycle as coi ∪ fri ⊆ po \ WR .However, we know ( po \ WR ∪ fencerfe ∪ co ∪ fr ) is acyclic and therefore a contradiction.Hence X satisﬁes acy ( X . po ∪ X . rf ∪ X . fr ∪ X . co ) . D.2 SC, x86 robustness against ARMv8Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17. In this case R = po | loc ∪ ( aob ∪ dob ∪ bob ) + . Proof.

Both SC and ARMv8 satisﬁes atomicity.It remains to show ( X . po ∪ X . rf ∪ X . fr ∪ X . co ) is acyclic by contradiction.Assume ( X . po ∪ X . rf ∪ X . fr ∪ X . co ) creates a cycle.If the cycle has one or no epo edge then the cycle violates (sc-per-loc).Otherwise, the cycle contains two or more epo edges.It implies ( X . epo ; X . eco ) + creates a cycle.It implies (( X . po | loc ∪ ( X . aob ∪ X . bob ∪ X . bob ) + ); X . eco ) + creates a cycle.Considering X . po | loc with incoming and outgoing X . eco , possible cases:(1) [ Ld ]; X . po | loc ; [ Ld ]; X . fre ; [ St ] = ⇒ [ Ld ]; X . fre (2) [ St ]; X . po | loc ; [ Ld ]; X . fre ; [ St ] = ⇒ [ St ]; X . coe (3) [ St ]; X . rfe ; [ Ld ]; X . po | loc ; [ St ] = ⇒ [ St ]; X . coe (4) ( X . coe ∪ X . fre ); [ St ]; X . po | loc ; [ St ] = ⇒ X . coe ∪ X . fre Therefore a (( X . po | loc ∪ ( X . aob ∪ X . bob ∪ X . bob ) + ); X . eco ) + cycle implies (( X . aob ∪ X . bob ∪ X . bob ) + ; X . eco ) + cycle. 55n Architecture to Architecture Mapping for ConcurrencyIt implies an X . ob cycle which violates (external) and therefore a contradiction. D.3 Proof of x86A robustness against ARMv8Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17. In this case R = po | loc ∪ ( aob ∪ bob ∪ dob ) + ∪ WR Proof.

Suppose (( X . po \ WR ) ∪ fence ∪ X . rfe ∪ X . co ∪ X . fr ) is a cycle.It implies (( X . po \ WR ) ∪ X . fence ∪ X . rfe ∪ X . coe ∪ X . fre ) is a cycle as coi ⊆ ( X . po \ WR ) and fri ⊆ ( X . po \ WR ) .It implies (( X . po \ WR ); eco ∪ X . fence ; X . eco ; ∪ X . WR | loc ; X . eco ∪ X . WR | (cid:54) = loc ; X . eco ) cycle.Now X . WR | (cid:54) = loc = ⇒ [ St ]; ( X . po \ WR ); [ St ] | (cid:54) = loc ; X . eco .Therefore it implies (( X . po \ WR ); X . eco ∪ X . fence ∪ X . WR | loc ; X . eco ) cycle.Following the deﬁnition of epo It implies (( po | loc ∪ ( X . aob ∪ X . dob ∪ X . bob ) + ); X . eco ) + cycle.Considering the incoming and outgoing edges for X . po | loc : [ Ld ]; X . po | loc ; [ Ld ]; X . fre = ⇒ [ Ld ]; X . fre [ St ]; X . rfe ; [ Ld ]; X . po | loc ; [ St ] = ⇒ [ St ]; X . coe ( X . fre ∪ X . coe ); [ St ]; X . po | loc ; [ St ] = ⇒ ( X . fre ∪ X . coe )[ St ]; X . po | loc ; [ Ld ]; X . fre = ⇒ [ St ]; X . coe It implies ( X . aob ∪ X . dob ∪ X . bob ) + ; X . eco ) + creates a cycle.It implies X . ob creates a cycle which is a contradiction.Therefore X is x86A consistent. D.4 SC, x86A, ARMv8, ARMv7mca robust against ARMv7Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17. D.4.1 SC-robust against ARMv7

In this case R = po | loc ∪ fence . Proof.

Both SC and ARMv7 satisﬁes atomicity.It remains to show that ( X . po ∪ X . rf ∪ X . fr ∪ X . co ) is acyclic by contradiction.Assume ( X . po ∪ X . rf ∪ X . fr ∪ X . co ) creates a cycle.If the cycle has one or no epo edge then the cycle violates (sc-per-loc).Otherwise, the cycle contains two or more epo edges.It implies ( X . epo ; X . eco ) + creates a cycle.It implies (( X . po | loc ∪ X . fence ); X . eco ) + creates a cycle.Considering the incoming and outgoing edges for X . po | loc : [ Ld ]; X . po | loc ; [ Ld ]; X . fre = ⇒ [ Ld ]; X . fre [ St ]; X . rfe ; [ Ld ]; X . po | loc ; [ St ] = ⇒ [ St ]; X . coe ( X . fre ∪ X . coe ); [ St ]; X . po | loc ; [ St ] = ⇒ ( X . fre ∪ X . coe )[ St ]; X . po | loc ; [ Ld ]; X . fre = ⇒ [ St ]; X . coe

56n Architecture to Architecture Mapping for ConcurrencyIt implies ( X . fence ; X . eco ) + creates a cycle.Now we consider [ codom ( f ence )]; X . eco ; [ dom ( f ence )] path.Possible cases: Case [ Ld ]; X . eco ; [ Ld ] : It implies X . fre ; X . rfe . Case [ Ld ]; X . eco ; [ St ] : It implies X . fre Case [ St ]; X . eco ; [ St ] : It implies X . coe Case [ St ]; X . eco ; [ Ld ] : It implies X . coe ; X . rfe Thus an ( X . fence ; X . eco ) + cycle impliesa (( X . coe ∪ X . fre ); X . rfe ? ; X . fence ) + cycle.It implies a prop + cycle which violates (propagation).Hence a contradiction and therefore SC is preserved. D.4.2 x86A robust against ARMv7Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17. In this case R = po | loc ∪ fence ∪ WR . Proof.

Suppose (( X . po \ WR ) ∪ X . rfe ∪ X . co ∪ X . fr ) is a cycle.It implies (( X . po \ WR ) ∪ X . rfe ∪ X . co ∪ X . fr ) is a cycle.It implies (( X . po \ WR ) ∪ X . fence ∪ X . rfe ∪ X . coe ∪ X . fre ) is a cycle as coi ⊆ WW and fri ⊆ ( X . po \ WR ) .It implies (( X . po \ WR ); X . eco ∪ X . fence ; X . eco ∪ X . WR | loc ; X . eco ∪ X . WR | (cid:54) = loc ; X . eco ) cycle.Now X . WR | (cid:54) = loc = ⇒ [ St ]; ( X . po \ WR ); [ St ] | (cid:54) = loc ; X . eco .Therefore it implies (( X . po \ WR ); X . eco ∪ X . fence ; X . eco ∪ X . WR | loc ; X . eco ) cycle.It implies (( po | loc ∪ fence ); X . eco ) + cycle following the deﬁnition of epo .Considering the incoming and outgoing edges for X . po | loc : [ Ld ]; X . po | loc ; [ Ld ]; X . fre = ⇒ [ Ld ]; X . fre [ St ]; X . rfe ; [ Ld ]; X . po | loc ; [ St ] = ⇒ [ St ]; X . coe ( X . fre ∪ X . coe ); [ St ]; X . po | loc ; [ St ] = ⇒ ( X . fre ∪ X . coe )[ St ]; X . po | loc ; [ Ld ]; X . fre = ⇒ [ St ]; X . coe It implies ( X . fence ; X . eco ) + creates a cycle.It implies X . prop creates a cycle which is a contradiction.Therefore X is x86A consistent. 57n Architecture to Architecture Mapping for Concurrency D.4.3 ARMv7 robust against ARMv8Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17. In this case R = po | loc ∪ [ St ]; po ∪ fence . Proof.

We show X is ARMv8 consistent.(internal)Assume a ( X . po | loc ∪ X . fr ∪ X . co ∪ X . rf ) cycle.However, X satisﬁes (sc-per-loc) and hence a contradiction.Therefore, X satisﬁes (internal).(external)Assume a X . ob cycle.It implies ( X . obs ; ( X . aob ∪ X . bob ∪ X . dob )) + creates cycle.From the deﬁnition, ( X . aob ∪ X . bob ∪ X . dob ) ⊆ po | loc ∪ fence ∪ [ St ]; po and therefore (( X . rfe ∪ X . coe ∪ X . fre ); ( X . po | loc ∪ fence )) + creates cycle.It implies prop creates a cycle which violates (propagation).Therefore a contradiction and X satisﬁes (external).(atomicity)ARMv7 execution X satisﬁes (atomicity).Therefore X has only ARMv8 execution. D.4.4 ARMv7-mca robust against ARMv7Theorem 14.

A program P is M -robust against K if in all its K consistent execution X , X . epo ⊆ X .R holds where R is deﬁned as condition ( M - K ) in Fig. 17. In this case R = [ Ld ]; po | loc ∪ fence ; [ Ld ] . Proof.