Interface Compliance of Inline Assembly: Automatically Check, Patch and Refine
Frédéric Recoules, Sébastien Bardin, Richard Bonichon, Matthieu Lemerre, Laurent Mounier, Marie-Laure Potet
IInterface Compliance of Inline Assembly:Automatically Check, Patch and Refine
Frédéric Recoules
Univ. Paris-Saclay, CEA, ListSaclay, [email protected]
Matthieu Lemerre
Univ. Paris-Saclay, CEA, ListSaclay, [email protected]
Sébastien Bardin
Univ. Paris-Saclay, CEA, ListSaclay, [email protected]
Laurent Mounier
Univ. Grenoble Alpes, VERIMAGGrenoble, [email protected]
Richard Bonichon
Tweag I/OParis, [email protected]
Marie-Laure Potet
Univ. Grenoble Alpes, VERIMAGGrenoble, [email protected]
Abstract —Inline assembly is still a common practice in low-level C programming, typically for efficiency reasons or foraccessing specific hardware resources. Such embedded assemblycodes in the GNU syntax (supported by major compilers suchas GCC, Clang and ICC) have an interface specifying how theassembly codes interact with the C environment. For simplicityreasons, the compiler treats GNU inline assembly codes asblackboxes and relies only on their interface to correctly gluethem into the compiled C code. Therefore, the adequacy betweenthe assembly chunk and its interface (named compliance ) isof primary importance, as such compliance issues can leadto subtle and hard-to-find bugs. We propose RUSTI N A, thefirst automated technique for formally checking inline assemblycompliance, with the extra ability to propose (proven) patches and(optimization) refinements in certain cases. RUSTI N A is basedon an original formalization of the inline assembly complianceproblem together with novel dedicated algorithms. Our prototypehas been evaluated on 202 Debian packages with inline assembly(2656 chunks), finding 2183 issues in 85 packages – 986 significantissues in 54 packages (including major projects such as ffmpegor ALSA), and proposing patches for 92% of them. Currently,38 patches have already been accepted (solving 156 significantissues), with positive feedback from development teams.
I. I
NTRODUCTION
Context.
Inline assembly, i.e. embedding assembly code insidea higher-level host language, is still a common practice in low-level C/C ++ programming, for efficiency reasons or for ac-cessing specific hardware resources – it is typically widespreadin resource-sensitive areas such as cryptography, multimedia,drivers, system, automated trading or video games [1], [2].Recoules et al. [1] estimate that 11% of Debian packageswritten in C/C ++ directly or indirectly depend on inlineassembly, including major projects such as GMP or ffmpeg,while 28% of the top rated C projects on GitHub contain inlineassembly according to Rigger et al. [2].Thus, compilers supply a syntax to embed assembly in-structions in the source program. The most widespread isthe GNU inline assembly syntax , driven by GCC but alsosupported by Clang or ICC. The GNU syntax provides an interface specifying how the assembly code interacts withthe C environment. The compiler then treats GNU inlineassembly codes as blackboxes and relies only on this interfaceto correctly insert them into the compiled C code . Problem.
The problem with GNU inline assembly is twofold.First, it is hard to write correctly : inline assembly syntax [4]is not beginner-friendly, the language itself is neither standard-ized nor fully described, and some corner cases are defined byGCC implementation (with occasional changes from time totime). Second, assembly chunks are treated as blackboxes, sothat the compiler does not do any sanity checks and assumes the embedded assembly code respects its declared interface.Hence, in addition to usual functional bugs in the assem-bly instructions themselves, inline assembly is also proneto interface compliance bugs, i.e., mismatches between thedeclared interface and the real behavior of the assembly chunkwhich can lead to subtle and hard-to-find bugs – typicallyincorrect results or crashes due to either subsequent compileroptimizations or ill-chosen register allocation. In the end,compliance issues can lead to severe bugs (segfault, deadlocks,etc.) and, as they depend on low-level compiler choices, theyare hard to identify and can hide for years before being trig-gered by a compiler update. For example, a 2005 compliancebug introduced in the libatomic_obs library of lock-freeprimitives for multithreading made deadlocks possible: it wasidentified and patched only in 2010 (commit 03e48c1). Asimilar bug was still lurking in another primitive in 2020 untilwe automatically found and patched it (commit 05812c2). Wealso found a 1997 interface compliance bug in glibc (leadingto a segfault in a string primitive) that was patched in 1999(commit 7c97add), then reintroduced in 2002 after refactoring. Microsoft inline assembly is different and has no interface, see Sec. IX-C. From the llvm-dev mailing list [3]: “GCC-style inline assembly is notori-ously hard to write correctly”. Note that syntactically incorrect assembly instructions are caught duringthe translation from assembly to machine code.© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in anycurrent or future media, including reprinting/republishing this material for advertising or promotional purposes, creating newcollective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. a r X i v : . [ c s . P L ] F e b oal and challenges. We address the challenge of helpingdevelopers write safer inline assembly code by designing anddeveloping automated techniques helping to achieve interfacecompliance, i.e. ensuring that both the assembly template andits interface are consistent with each other . This is challengingfor several reasons:
Define
The method must be built on a (currently missing)proper formalization of interface compliance, both real-istic and amenable to automated formal verification;
Check, Patch & Refine
The method must be able to checkwhether an assembly chunk complies with its interface,but ideally it should also be able to automatically suggestpatches for bugs or code refinements;
Wide applicability
The method must be generic enough toencompass several architectures, at least x86 and ARM.Fehnker et al. [5] published the only attempt we know ofto inspect the interface written by the developer. Yet, theirdefinition of interface compliance is syntactic and incomplete– for example they cannot detect the glibc issue mentionedabove. Moreover, they do not cover all subtleties of GCC inlineassembly (e.g., token constraints), consider only compliancechecking (neither patching nor refinement) and the implemen-tation is tightly bound to ARM (much simpler than x86).Note that recent attempts for verifying codes mixing C andassembly [1], [6] simply assume interface compliance.
Proposal and contributions.
We propose RUSTI N A, thefirst sound technique for comprehensive automated interfacecompliance checking, automated patch synthesis and interfacerefinements. We claim the following contributions: • a novel semantic and comprehensive formalization of theproblem of interface compliance (Sec. IV), amenable toformal verification; • a new semantic method (Sec. V) to automatically verifythe compliance of inline assembly chunks, to generatea corrective patch for the majority of compliance issuesand additionally to suggest interface refinements; • thorough experiments (Sec. VII) of a prototype imple-mentation (Sec. VI) on a large set of x86 real-worldexamples (all inline assembly found in a Debian Linuxdistribution) demonstrate that RUSTI N A is able to au-tomatically check and curate a large code base (202packages, 2640 assembly chunks) in a few minutes ,detecting 2036 issues and solving 95% of them; • a study of current inline assembly coding practices(Sec. VIII); besides identifying the common compliance issues found in the wild (Sec. VII-A), we also exhibit 6recurring patterns leading to the vast majority (97%) ofcompliance issues and show that 5 of them rely on fragile assumptions and can lead to serious bugs (Sec. VIII).As the time of writing, 38 patches have already beenaccepted by 7 projects, solving 156 significant issues(Sec. VII-C). Summary.
Inline assembly is a delicate practice. RUSTI N Aaids developers in achieving interface compliant inline assem-bly code. Compliant assembly chunks can still be buggy but RUSTI N A automatically removes a whole class of problems.Our technique has already helped several renowned projectsfix code, with positive feedback from developers. Note: supplementary material, including prototype and bench-mark data, is available online [7].
II. C
ONTEXT AND MOTIVATION
The code in Fig. 1 is an extract from libatomic_obs, commit30cea1b dating back to early 2012. It was replaced 6 monthslater by commit 64d81cd because it led to a segmentationfault when compiled with Clang. By 2020, another la-tent bug was still lurking until automatically discovered andpatched by our prototype
RUSTI N A (commit 05812c2).
AO_INLINE int
AO_compare_double_and_swap_double_full(volatile AO_double_t *addr,
AO_t old_val1, AO_t old_val2,
AO_t new_val1, AO_t new_val2) { char result; [...] __asm__ __volatile__("xchg %%ebx,%6;" /* swap GOT ptr and new_val1 */ "lock; cmpxchg8b %0; setz %1;" "xchg %%ebx,%6;" /* restore ebx and edi */ : "=m"(*addr), "=a"(result) : "m"(*addr), "d" (old_val2), "a" (old_val1), "c" (new_val2), "D" (new_val1) : "memory"); [...] return (int) result; } Figure 1: atomic_ops/sysdeps /gcc/x86.h@30cea1b
What the code is about.
This function uses inline assemblyto implement the standard atomic primitive
Compare AndSwap – i.e. write new_val in *addr if this latter still equalsto old_val (where 8-byte values old_val and new_val aresplit in 4-byte values old_val1, old_val2, etc.). The assemblystatement (syntax discussed in Sec. III) comprises assemblyinstructions (e.g., "lock ; cmpxchg8b %0;") building an as-sembly template where some operands have been replaced by tokens (e.g., %0) that will be latter assigned by the compiler.It also has a specification, the interface , binding togetherassembly registers, tokens and C expressions: line 196 declaresthe outputs , i.e. C values expected to be assigned by thechunk; lines 197 and 198 declare the inputs , i.e. C values thecompiler should pass to the chunk. The string placed beforea C expression is called a constraint and indicates the set ofpossible assembly operands this expression can be bound toby the compiler. For instance, "d" ( old_val2 ) indicates thatregister %edx should be initialized with the value of old_val2,while "=a" ( result ) indicates the value of result should becollected from %eax . Token %0 introduced by "m" (*addr)is an indirect memory access: its address, arbitrarily denoted&0 here, can be bound to several possibilities (cf. Fig. 5) –including %esi or %ebx .Fig. 2 gives the functional meaning of this binding alongwith the semantics of the assembly instructions (where “ :: ” isthe concatenation, “ c ←− ” a conditional assignment, “ e { h..l } ”the bits extraction and “ zext n ” the zero extension to size n ).This example allows us to introduce the concept of interfacecompliance issues and the associated miscompilation prob-lems: A) (framing condition) incomplete interfaces, possibly − z (%edx :: %eax) = *(&0) ←− %edx :: %eax *(&0) z ←− *(&0) %ecx :: %ebx (cid:181) ←− %eax %eax { .. } :: ( zext z ) (cid:28) %ebx %edi ←− result %eax { .. } (cid:28) %ebx %edi ←− %edi new_val1 ←− %ecx new_val2 ←− %eax old_val1 ←− %edx old_val2 ←− &0 addrcmpxchg8b %0xchg %ebx, %edilocksetz %alxchg %ebx, %edi"=a"(result)"D"(new_val1)"c"(new_val2)"a"(old_val1)"d"(old_val2)"=m"(*addr) Figure 2: Assembly statement semanticsleading to miscompilations due to wrong data dependencies;B) (unicity) ambiguous interfaces, where the result dependson compiler choices for token allocation.
A) An incomplete frame definition.
Here, register %edx isdeclared as read-only (by default, non-output locations are)whereas it is overwritten by instruction cmpxchg8b (c.f.Fig. 2). %edx should be declared as output as well.
Impact:
The compiler exclusively relies on the interface toknow the framing-condition – i.e. which locations are reador written. When this information is incomplete, data de-pendencies are miscalculated, potentially leading to incorrectoptimizations. Here, the compiler believes %edx still containsold_val2 after the assembly chunk is executed, while it is notthe case.Note that %ebx and %esi are not missing the outputattribute: while overwritten by the xchg instructions, they arethen restored to their initial value.
B) Ambiguous interface.
Here, while most of the binding isfixed, the compiler still has to bind &0 according to constraint"m". Yet, if the compiler rightfully chooses %ebx , the datadependencies in the assembly itself differ from the expectedone: pointer addr is exchanged with new_val1 just beforebeing dereferenced, which is not the expected behaviour. Theproblem here is that the result cannot be predicted as it dependson token resolution from the compiler.
Impact: the function is likely to end up in a segmentation fault when compiled by Clang.Historically, GCC was not able to select %ebx and the bugdid not manifest, but Clang did not had such restriction.
The problem.
These compliance issues are really hard to findout either manually or syntactically. First, there is here clearlyno hint from the assembly template itself ("cmpxchg8b %0")that register %edx is modified. Second, complex token bindingand aliasing constraints must be taken into account. Third,subtle data flows must be taken into account – for examplea read-only value modified then restored is not a complianceissue.
RUSTI N A insights.
To circumvent these problems, we havedeveloped RUSTI N A, an automated tool to check inlineassembly compliance (i.e. formally verifying the absence ofcompliance errors) and to patch the identified issues. RUSTI N A builds upon an original formalization of theinline assembly interface compliance problem, encompassingboth framing and unicity . From that, our method lifts binary-level Intermediate Representation (sketched in Fig. 2) andadapt the classical data-flow analysis framework ( kill-gen [8])in order to achieve sound interface compliance verification –especially RUSTI N A reasons about token assignments. Fromthe expected interface, it infers for each token an overapprox-imation of the set of valid locations and then computes the setof locations that shall not be altered before the token is used.Here, it deduces that writing in register %ebx may impacttoken %0. Also, it detects that a write occurs in the read-onlyregister %edx , thus successfully reporting the two issues.Moreover, RUSTI N A automatically suggests patches for thetwo issues. For framing, Fig. 3 highlights the core differencesbetween the two versions ( %edx is now rightfully declaredas output with "=d") – a similar patch now lives on thecurrent version of the function (commit 05812c2). For unicity,it suggests to declare %ebx as clobber, yielding a working fix.Yet, it also over-constrains the interface – the syntax does notallow a simple disequality between %0 and %ebx . Developersactually patched the issue in 2012 in a completely differentway by rewriting the assembly template (commit 64d81cd) –such a solution is out of RUSTI N A’s scope.
193 193194194 195195196197 196197198 @@ -193,5 +193,6 @@- __asm__ __volatile__("xchg %%ebx,%6;" /* swap GOT ptr and new_val1 */+ AO_t dummy;+ __asm__ __volatile__("xchg %%ebx,%7;" /* swap GOT ptr and new_val1 */"lock; cmpxchg8b %0; setz %1;"- "xchg %%ebx,%6;" /* restore ebx and edi */- : "=m"(*addr), "=a"(result)- : "m"(*addr), "d" (old_val2), "a" (old_val1),+ "xchg %%ebx,%7;" /* restore ebx and edi */+ : "=m"(*addr), "=a"(result), "=d" (dummy)+ : "m"(*addr), "2" (old_val2), "a" (old_val1),
Figure 3: Frame-write corrective patchGeneric and automatic, our approach is well suited to handlewhat expert developers failed to detect, while a simpler “bad”patterns detection approach would struggle against both thecombinatorial complexity induced by the size of architec-ture instruction sets and the underlying reasoning complexity(dataflow, token assignments). Overall, RUSTI N A found andpatched many other significant issues in several well-knownopen source projects (Sec. VII).III. GNU
INLINE ASSEMBLY SYNTAX
Overview.
This feature allows the insertion of assembly in-structions anywhere in the code without the need to call anexternally defined function. Fig. 4 shows the concrete syntaxof an inline assembly block, which can either be basic whenit contains only the assembly template or extended whenit is supplemented by an interface . This section concernsthe latter only. The assembly statement consists of “a seriesof low-level instructions that convert input parameters tooutput parameters” [4]. The interface binds C lvalues (i.e.,expressions evaluating to C memory locations) and expressionsto assembly operands specified as input or output , and declaresa list of clobbered locations (i.e., registers or memory cellswhose values could change). For the sake of completeness,he statement can also be tagged with volatile , inline or goto qualifiers, which are irrelevant for interface compli-ance, thus not discussed in this paper. The interface bindingsdescribed above are written by string specifications, whichwe will now explain. Templates.
The assembly text is given in the form of a format-ted string template that, like printf , may contain so-called token s (i.e., place holders). These start with % followed by anoptional modifier and a reference to an entry of the interface ,either by name (an identifier between square brackets) orby a number denoting a positional argument. The compilerpreprocesses the template, substituting token s by assemblyoperands according to the entries and the modifiers (note thatonly a subset of x86 modifiers is fully documented [9]) andthen emits it as is in the assembly output file. (cid:104) statement (cid:105) ::= ‘ asm ’ [ ‘ volatile ’ ] ‘ ( ’ (cid:104) template : string (cid:105) [ (cid:104) interface (cid:105) ] ‘ ) ’ (cid:104) interface (cid:105) ::= ‘ : ’ [ (cid:104) outputs (cid:105) ] ‘ : ’ [ (cid:104) inputs (cid:105) ] ‘ : ’ [ (cid:104) clobbers (cid:105) ] (cid:104) outputs (cid:105) ::= (cid:104) output (cid:105) [ ‘ , ’ (cid:104) outputs (cid:105) ] (cid:104) inputs (cid:105) ::= (cid:104) input (cid:105) [ ‘ , ’ (cid:104) inputs (cid:105) ] (cid:104) clobbers (cid:105) ::= (cid:104) clobber : string (cid:105) [ ‘ , ’ (cid:104) clobbers (cid:105) ] (cid:104) output (cid:105) ::= [ ‘ [ ’ (cid:104) identifier (cid:105) ‘ ] ’ ] (cid:104) constraint : string (cid:105) ‘ ( ’ (cid:104) Clvalue (cid:105) ‘ ) ’ (cid:104) input (cid:105) ::= [ ‘ [ ’ (cid:104) identifier (cid:105) ‘ ] ’ ] (cid:104) constraint : string (cid:105) ‘ ( ’ (cid:104) Cexpression (cid:105) ‘ ) ’ Figure 4: Concrete syntax of an extended assembly chunk
Clobbers.
They are names of hard registers whose values maybe modified by the execution of the statement, but not intendedas output. Clobbers must not overlap with inputs and outputs.The "cc" keyword identifies, when it exists, the conditionalflags register. The "memory" keyword instructs the compilerthat arbitrary memory could be accessed or modified. a = { %eax } b = { %ebx } c = { %ecx } d = { %edx } S = { %esi } D = { %edi } U = a ∪ c ∪ d q = Q = a ∪ b ∪ c ∪ di = n = Z r = R = q ∪ S ∪ D ∪ { %ebp } p = { r b + k × r i + c for r b ∈ r ∪ { %esp } ∪ { } and r i ∈ r ∪ { } and k ∈ { , , , } and c ∈ i } m = { * p for p ∈ p } g = i ∪ r ∪ m Figure 5: GCC i386 architecture constraints
Constraints.
A third language describes the set of validassembly operands for token assignment. The latter are of 3kinds: an immediate value, a register or a memory location.Fig. 5 gives a view of common atomic constraints (“letters”)used in x86. Constraint entries can have more that one atomicconstraint (e.g., "rm"), in which case the compiler choosesamong the union of operand choices. The language allowsto organize constraints into multiple alternatives , separatedby ‘ , ’ . Additionally, matching constraint between an inputtoken and an output token forces them to be equal; earlyclobber ‘ & ’ informs the compiler that it must not attemptto use the same operand for this output and any non-matched input; commutative pair ‘ % ’ makes an input and the nextone exchangeable.Finally, output constraints must start either with ‘ = ’ for thewrite-only mode or with ‘ + ’ for the read-write permission.IV. F ORMALIZING I NTERFACE C OMPLIANCE
A. Extended assembly
Assembly chunks.
We denote by C : asm a standard chunkof assembly code. Such a chunk operates over a memory state M : mstate , that is a map from location (registers ofthe underlying architecture or memory cells) to basic values(int8, int16, int32, etc.). We call A : value set the setof valid addresses for a given architecture. The value of anexpression in a given memory state is given by function eval : mstate × expression (cid:55)→ value . The set ofvalid assembly expressions is architecture-dependent (Fig. 5is for i386). We abstract it as a set of expression s builtover registers, memory accesses * and operations. Finally, anassembly chunk C can be executed in a memory state M toyield a new memory state M (cid:48) with function exec : asm × mstate (cid:55)→ mstate . Fig. 6 recaps above functions and types. exec : asm × mstate (cid:55)→ mstate eval : mstate × expression (cid:55)→ valuemstate : location (cid:55)→ valueexpression as e ::= value | register | * e | e + e | e × e | ... location ::= register | valueregister ::= %eax | %ebx | %ecx | %edx | ... // case of x86 value : int8 | int16 | int32 | ... Figure 6: Assembly types
Assembly templates.
Inline assembly does not directly useassembly chunks, but rather assembly templates , denoted C ♦ : asm ♦ , which are assembly chunks where some operands arereplaced by so-called tokens , i.e., placeholders for regular as-sembly expression s to be filled by the compiler (formally,they are identifiers %0, %1, etc.). Given a token assignment T : token (cid:55)→ expression , we can turn an assembly template C ♦ : asm ♦ into a regular assembly chunk C : asm using standardsyntactic substitution <> , denoted C ♦ < T > : asm . The value of token t through assignment T is given by eval ( M , T ( t )). Formal interface.
We model an interface I (cid:44) ( B O , B I , S T , S C , F ) as a tuple consisting of output tokens B O : token set , input tokens B I : token set , a memory separation flag F : bool , clobber registers S C : register set and valid tokenassignments S T : T set . • Input and output tokens bind the assembly memory stateand the C environment. Informally, the locations pointedto by tokens in B I are input initialized by the value ofsome C expressions while the values of the tokens in B O are output to some C lvalues. B O ∪ B I contains all tokendeclarations and B O ∩ B I may be non-empty; Actually, a concrete interface also contains initializer and collector expres-sions in order to bind I/O assembly locations input and output to C. We skipthem for clarity, as they do not impact compliance.
If the flag F is set to false, then assembly instructionsmay have side-effects on the C environment – otherwisethey operate on separate memory parts; • S C and S T provide additional information about howthe compiler should instantiate the assembly template tomachine code: the clobber registers in S C can be used fortemporary computations during the execution (their valueis possibly modified by the chunk), while S T representsall possible token assignments the compiler is allowedto choose – the GNU syntax typically leads to equality,disequality and membership constraints between tokensand (sets of) registers. Extended assembly chunk.
An extended assembly chunk X (cid:44) ( C ♦ , I ) is a pair made of an assembly template C ♦ and itsrelated interface I . The assembly template is the operationalcontent of the chunk (modulo token assignment) while theinterface is a contract between the chunk, the C environmentand low-level location management. B. (Detail) From GNU concrete syntax to formal interfaces
Let us see how the formal interface I is derived fromconcrete GNU syntax (Fig. 4). Tokens B O and B I come fromthe corresponding output and input lists except that: a) if anoutput entry is declared using the ’+’ modifier then it is addedto both B O and B I ; and b) if an input token and an outputtoken are necessarily mapped to the same register, they areunified. Each register in the clobber list belong to S C . If theclobber list contains "memory", the memory separation flag F is false, true otherwise. The set S T of valid token assignments T is formally derived in 3 steps:1) collection of string constraints, splitting constraints byalternative (i.e., ’ , ’): ( token (cid:55)→ string ) set ;2) architecture-dependent (e.g., Fig. 5) evaluation of stringconstraints: ( token (cid:55)→ expression set ) set ; rep-resenting a disjunction of conjunctions of atomic mem-bership constraints token ∈ { exp, . . . , exp };3) flattening: ( token (cid:55)→ expression ) set representinga disjunction of conjunctions of atomic equality con-straints token = expression ;Still, token assignments must respect the following properties(and are filtered out otherwise): • every output token maps to an assignable operand,either a register or a * e expression ; • every output token maps to distinct location ; • each token maps to a clobber-free expression where a clobber-free expression is an expression withoutany clobber register nor any early-clobber sub-expression (i.e.containing the mapping of an early-clobber token , intro-duced by the ’&’ modifier).Fig. 7 exemplifies the interface formalization of Fig. 1’schunk introduced in Sec. II. Tokens B O and B I simply enu-merate the present entries respectively in output and input lists(L196-198). The 5 th entry matches the same register %eax asthe second, %4 is unified with %1. For the sake of brevity, wesplit the set of token assignments into two parts: one invariant w.r.t. compiler choices, and one that may vary (we only list4 of them but there are other valid combination of memoryreferences). Finally, it has no clobbered register and, becauseof keyword "memory", memory separation is false . B O = { %0, %1 }, B I = { %2, %3, %5, %6 } S T = { [ %1 (cid:55)→ %eax , %3 (cid:55)→ %edx , %5 (cid:55)→ %ecx , %6 (cid:55)→ %edi ] } // fixed assignments × { [ %0 (cid:55)→ *%esi , %2 (cid:55)→ *%esi ], [ %0 (cid:55)→ *%ebp , %2 (cid:55)→ *%ebp ], // possible variations [ %0 (cid:55)→ *%esi , %2 (cid:55)→ *%ebp ], [ %0 (cid:55)→ *%ebx , %2 (cid:55)→ *%ebx ], ... } S C = ∅ F = false Figure 7: Formal interface I C. Interface compliance
An extended assembly chunk X (cid:44) ( C ♦ , I ) is said to be interface compliant if it respects both the framing and the unicity conditions that we define below. Observational equivalences.
As a first step, we define anequivalence relation (cid:7) ∼ = T B, F over memory states modulo a tokenassignment T , a set of observed tokens B and a memoryseparation flag F . We start by defining an equivalence relation ♦ ∼ TB . We say that M ♦ ∼ TB M if, for all token t in B , eval ( M , T ( t )) = eval ( M , T ( t )). We can generalize it toany pair of token assignments T and T : M ♦ ∼ T , T B M if,for all token s t in B , eval ( M , T ( t )) = eval ( M , T ( t )).Then, we define an equivalence relation • ∼ over memory states.We say that M • ∼ M if for all (address) location l in A , M ( l ) = M ( l ). The equivalence relation (cid:7) ∼ = T B, F over memorystates modulo a token assignment T (which can be generalizedto a pair T and T as above), a set of tokens B and a memoryseparation flag F is finally defined as: M (cid:7) ∼ = T B, F M if: M ♦ ∼ TB M ∧ ( F = false implies M • ∼ M ) Framing condition.
The framing condition restricts what canbe read and written by the assembly template. Given a tokenassignment T , we define a location input (resp. location out-put ) as a location pointed by a input (resp. output) token. Thenthe framing condition stipulates that: (frame-read) only initialvalues from input location can be read; (frame-write) onlyclobber registers and location output are allowed to be modi-fied by the assembly template.More formally, a location is assignable if it can be modified(i.e., if it is mapped to by an output token t , belongs to theclobber set S C or is a memory location A when there is noseparation ¬ F ), and non-assignable otherwise. We then have: frame-write for all M , for all T in S T , for all non assignable location l : M ( l ) = exec ( M , C ♦ < T > )( l ). frame-read for all M , M and T in S T such that M (cid:7) ∼ = TB I , F M : exec ( M , C ♦ < T > ) (cid:7) ∼ = TB O , F exec ( M , C ♦ < T > ), Unicity.
Informally, the unicity condition is respected whenthe evaluation of output tokens is independent from the chosentoken assignment. More formally, for all M , M , T and T in S T such that M (cid:7) ∼ = T , T B I , F M : exec ( M , C ♦ < T > ) (cid:7) ∼ = T , T B O , F exec ( M , C ♦ < T > ).Note that frame-read is a sub-case of unicity where T = T .. C HECK , PATCH AND REFINE
Figure 8 presents an overview of RUSTI N A. The tooltakes as input a C file containing inline assembly templatesin GNU syntax. From there, it parses the template code toproduce an intermediate representation (IR) of the template C ♦ , and interprets the concrete interface to produce a formalinterface I . The tool then checks that the code complieswith its interface using dedicated static dataflow analysis .If it succeeds, we have formally verified that the assemblytemplate complies with its interface. If not, our tool examinesthe difference between the formal interface computed fromthe code and the one extracted from specification; it canthen produce a patch (if some elements of the interface wereforgotten) or refine the interface (if the declared interfaceis larger than needed). We cannot produce a patch in everysituation, in that case the tool reports a compliance alarm –they can be false alarms, but it rarely happens on real code.Algorithms are fully detailed in the companion report [7]. A. Preliminary: code semantics extraction
Our analyses rely on Intermediate Representations (IR) forbinary code, coming from lifters [10], [11] developed for thecontext of binary-level program analysis. We use the IR ofthe B
INSEC platform [12], [13] (Fig. 9), but all such IRsare similar. They encode every machine code instruction intoa small well-defined side-effect free language, typically animperative language over bitvector variables (registers) andarrays (memory), providing jumps and conditional branches.Still, code lifters do not operate directly on assembly templatesbut on machine code, requiring a little extra-work to recoverthe tokens. We replace each token in the assembly template bya distinct register, use an existing assembler (
GAS ) to transformthe new assembly chunk into machine code and then lift it toIR. We perform the whole operation again where each tokenis mapped to another register, so as to distinguish tokens fromhard-coded registers. Tokens are then replaced in IR by distinctnew variable names.
B. Compliance Checking
This section discusses our static interface compliancechecks. We rely on the dataflow analysis framework [8],intensively used in compilers and software verification. Wecollect sets of locations ( token , register or the whole memory ) as dataflow facts, then compare them against the setsexpected from the interface. Checking frame-write requiresa forward impact analysis , checking frame-read requires a backward liveness analysis , and finally unicity requires acombination of both. Our techniques are over-approximated in order to ensure soundness. Memory is considered as awhole – all memory accesses being squashed as memory ,with a number of advantages: it closely follows the interfacemechanisms for memory, helps termination (the set of dataflowfacts is finite) and saves us the complications of memory-aware static analysis (heap or points-to). Finally, we proposetwo precision optimizations in order to reduce the risk of falsepositives (their impact is evaluated in Sec. VII-D). Frame-write.
Check must ensure that non-assignable loca-tions have the exact same value before and after the execution.As first approximation, a location that is never written (i.e.,never on the Left Hand Side LHS of an assignment) safelykeeps its initial value – since IR expressions are side-effectfree.
Impact analysis iterates forward from the entry of thechunk, collecting the set of LHS locations (either a token ,a register or the whole memory ). We then check thateach LHS location belongs to the set of declared assignablelocations (i.e. B O ∪ S C together with memory if ¬ F ). Frame-read.
Check must ensure that no uninitialized locationis read. This requires to compute (an overapproximation of)the set of live locations (i.e. holding a value that may be readbefore its next definition). Liveness analysis iterates backwardfrom the exit of the chunk, where output locations are live(outputs tokens B O ), computing dependencies of the RightHand Side (RHS) expression of found definitions until thefix-point is reached. We then check at the entry point thateach live location belongs to the set of declared inputs (i.e. B I together with memory if ¬ F ). Unicity.
Check must ensure that compiler choices have no im-pact on the chunk output. What may happen is that a locationis impacted or not by a preceding write depending on the tokenassignment. To check that this does not happen, we first definea relation may_impact over location ◦ (incl. tokens) suchthat l may_impact l (cid:48) is false if we can prove that (writingon) l has no impact on (the evaluation of) l (cid:48) – whatever thetoken assignment. In our implementation, l may_impact l (cid:48) returns false if there is no token assignment where l is asub-expression of l (cid:48) . Then, using previous frame-write and frame-read analyses, we finally check at each assignment toa location l that, for each live location l ’, l may_impact l ’ returns false .We now sketch the implementation of may_impact . Themain challenge is to avoid enumerating all valid token as-signments S T (c.f. Sec. IV-B) . We compute over a smallerset of abstract location facts location ∗ , indicating onlywhether a location is a constant value ( Immediate ), a regis-ter (
Direct register ) or is used to compute the addressof a token (
Indirect register ). We abstract token as-signments by reinterpreting the constraints over location ∗ ,yielding D ∗ : location ◦ (cid:55)→ location ∗ set . We thendefine the relation l ∗ impact ∗ l ∗(cid:48) over location ∗ as: l ∗ impact ∗ l ∗(cid:48) = Direct r impact ∗ Direct r : true Direct r impact ∗ Indirect r : true others : false Finally, we build the relation l may_impact l ’ such thatit returns true (sound) except if one of the following holds: • no l ∗ , l ∗(cid:48) in D ( l ) × D ( l ’) such that l ∗ impact ∗ l ∗(cid:48) ; • l or l ’ belongs to S C ; • l and l ’ are tokens, l is early clobber ("&"); • l is equal to l ’ (independent of compiler choice).NU assembly templateGNU assembly interface+ C IR template C ♦ Formal interface I C ODE SEMANTICSEXTRACTION I NTERFACE SEMANTICSEXTRACTION C HECK P ATCH
Formally verifiedcomplianceFormally compliantpatchCompliance alarm º (cid:29)(cid:29)º
Figure 8: Overview of RUSTI N A inst := lv ← e | goto e | if e then goto e else goto elv := var | @[e] n e := cst | lv | unop e | binop e e | e ? e : eunop := ¬ | − | zext n | sext n | extract i..j binop := arith | bitwise | cmp | concatarith := + | − | × | udiv | urem | sdiv | srembitwise := ∧ | ∨ | ⊕ | shl | shr | sarcmp := = | (cid:54) = | > u | < u | > s | < s Figure 9: The B
INSEC intermediate representationOur checkers are semantically sound in the sense that theycompute an overapproximation of the assembly template se-mantics. Hence, successfully checking an extended assemblychunk ensures it is interface-compliant .On the other hand, our technique could report complianceissues that do not exist (false positives). We propose belowtwo precision improvements :
1. Expression propagation
In Fig. 1, frame-write check,as is, would report a violation for %ebx and %esi becausethey are written. Yet, it is a false positive since both end upwith their initial value. To avoid it, we perform a symbolicexpression propagation for each written location, inlining thedefinition of written locations into their RHS expressions, andperforming
IR-level syntactic simplifications – such as x − x or x ⊕ x = 0 . Then, at fixpoint, frame-write checksbefore raising an alarm whether the original value has beenrestored (no alarm) or not (alarm);
2. Bit-level liveness dependency
In Fig. 1, result takesonly the lowest byte of %eax . However, our basic techniquewill count both z and %eax as live while high bytes of %eax are actually not – such imprecisions may lead tofalse alarms (Sec. VII-D). We improve our liveness analysisto independently track the status of each location bit. Forefficiency, we do not propagate location bits but locationsequipped with a bitset representing the status of each of theirbits. We modify propagation rules accordingly (especially bitmanipulations like extraction or concatenation), with bitwiseoperations over the bitsets.
C. Interface Patching
When the compliance checking fails, RUSTI N A tries togenerate a patch to fix the issue. As our dataflow analysis infersan over-approximated interface for the chunk under analysis,we take advantage of it to strengthen the current interface.
Framing condition.
We build a patch that makes the template C ♦ compliant with its formal interface I as follows: frame-write Any hard-coded register (resp. token) writtenwithout belonging to S C (resp. B O ) is added; frame-read Any token read without belonging to B I andwithout being initialized before, is added. Reading a registerbefore assigning it prevents automatic patch generation .In both cases, any direct access to a memory cell sets memoryseparation F to false .We then retrofit the changes of the formal interface in theconcrete syntax to produce the patch. For instance, in Fig. 3,token %3 (i.e. %edx ) violates the frame-write condition. Weadd a new output token %2 : "=d" (dummy) bound to its oldinitializer : "2" ( old_val2 ). Since we add a new token, wetake care to keep template “numbering” consistent.When a framing issue patch is generated, the resulting chunkis ensured to meet the framing condition. Unicity.
We give to the faulty register (resp. token) the(resp. early) clobber status preventing it to be mis-assignedto another token. Note however that, since we over-constrainthe interface (the syntax does not allow to declare a pair ofentries as distinct), the patch may fail if there is no more validtoken assignment.When a unicity patch is generated, the resulting chunk is ensured to be fully interface compliant if it still compiles.
D. Bonus: Refining the interface
Even if overapproximated, the interface that is inferred byRUSTI N A during the check may be smaller than the declaredone, allowing to produce refinement patches removing unnec-essary constraints in the interface – which can in turn givemore room to the compiler to produce smaller or faster code.We can already remove never-read inputs, never-writtenclobbers or undue "memory" keywords in absence of memoryaccesses . There is another case where a "memory" constraintcan be removed. Indeed, as recommended in the documenta-tion, single-level pointer accesses can be declared by commonentries using the "m" placement constraint instead of the(much more expensive) "memory" keyword.We design a dedicated “points-to” analysis to identify thecandidates for this transformation. It is based on a dataflowanalysis collecting, for each memory access, the precise loca-tion (on the form token or symbol + offset ) and size of theaccess. If it succeeds, we can safely remove the "memory"keyword and instead add a new entry (input "m", output"=m" or both depending of the access pattern) for each ofthe identified base pointers. If this is done on purpose, the chunk actually is out of this paper’s scope. These refinements can be disabled for dummy constraints put on purpose. ig. 10 shows an example of refinement happening inlibtomcrypt. In the original code, the "memory" constraint wasforgotten. We can see that (patch) refinement produces a fixthat does not add the missing keyword, but instead changesthe way the content pointed by key is given to the chunk. asm __volatile__ (- "movl (%1), %0\n\t"+ "movl %1, %0\n\t""bswapl %0\n\t"- :"=r"(rk[0]): "r"(key));+ :"=r"(rk[0]): "m"(*(uint32_t*)key));
Figure 10: Smart patch of a libtomcrypt chunkVI. I
MPLEMENTATION
We have implemented RUSTI N A, a prototype for interfacecompliance analysis following the method described in Sec. V.RUSTI N A is written in OCaml ( ∼ INSEC [12], [13] for IR lifting (includingbasic syntactic simplifications), and GAS to translate assemblyinto machine code. Our tool can handle a large portion of thex86 and ARM instruction sets. Yet, float and system instruc-tions are not supported (they are unsupported by B
INSEC ).Despite this, we handle 84% of assembly chunks found in aDebian distribution (Sec. VII).VII. E
XPERIMENTAL EVALUATION
Research questions.
We consider 5 research questions
RQ1.
Can RUSTI N A automatically check interface compliance onassembly chunks found in the wild?
RQ2.
Incidentally,how many assembly chunks exhibit a compliance issue, andwhich ones are the most frequent?
RQ3.
Can RUSTI N Aautomatically patch detected compliance issues?
RQ4.
Whatis the real impact of the compliance issues reported and of thegenerated patches?
RQ5.
What is the impact of RUSTI N Adesign choices on the overall checking result?
Setup.
All experiments are run on a regular
Dell Precision5510 laptop equipped with an
Intel Xeon E3-1505M v5processor and 32GB of RAM.
Benchmark.
We run our prototype on all
C-related x86 inlineassembly chunks found in a
Linux Debian 8.11 distribution ,i.e., 3107 x86 chunks in 202 packages, including big inlineassembly users like ALSA, GMP or ffmpeg. We remove 451out-of-scope chunks (i.e., containing either float or systeminstructions), keeping (85% of the initial dataset),with mean size of 8 assembly instructions (max. size: 341).
A. Checking (
RQ1 , RQ2 ) Table I sums up compliance checking results before (“Ini-tial”) and after patching (“Patched”) – we focus here on theInitial case.
Results.
RUSTI N A reports in less than 2 min (
40 ms perchunk in average ) that 1292 chunks out of 2656 are (fully)interface compliant (resp. 117 packages out of 202), while1364 chunks (resp. 85 packages) have compliance issues. Table I: RUSTI N A application on Debian 8.11 x86 (a) Overview at package level
Packages considered average chunks 15max chunks 384
Initial Patched (cid:29) – fully compliant 117 (cid:0) – only benign issues 31 º – serious issues 54 (b) Overview at chunk level Assembly chunks 3107 out-of-scope (e.g. floats)
Relevant chunks average size 8max size 341
Initial Patched (cid:29) – fully compliant 1292 (cid:0) – only benign issues 1070 º – serious issues 294 (c) Overview of found issues Initial Patched
Found issues frame-write 1718 (cid:0) – flag register clobbered 1197 º – read-only input clobbered 17 º – unbound register clobbered 436 º – unbound memory access 68 frame-read 379 º – non written write-only output 19 º – unbound register read 183 º – unbound memory access 177 unicity 86 Among the noncompliant ones, RUSTI N A allows to pinpoint294 chunks (resp. 54 packages) with serious compliance issues– according to our study in Sec. VIII we count an issue asbenign only when it corresponds to missing the flag registeras clobber (P1 in Sec. VIII).
Quality assessment.
While chunks deemed compliant byRUSTI N A are indeed supposed to be compliant (yet, it isstill useful to test it), compliance issues could be false alarms.We evaluate these two cases with 4 elements. ( qa ) Werun RUSTI N A on known libatomic_obs and glibc compliancebugs and on their patched versions : every time, RUSTI N Areturns the expected result. ( qa ) We consider 8 significantprojects (Sec. VII-C), manually review all their faulty as-sembly chunks (covering roughly 50% of the serious issuesreported in Table Ic) as well as randomly chosen compliantchunks, and crosscheck results with RUSTI N A: they perfectlymatch. ( qa ) For compliance proofs, we also run the checkerafter patching: RUSTI N A deems all patched chunks com-pliant. ( qa ) Several patches sent to developers have beenaccepted (Sec. VII-C). We conclude that results returned by
RUSTI N A are good:as expected, a chunk deemed compliant is compliant, andreported compliance issues are most likely true alarms – weo not find any false alarm in our dataset. ARM benchmark.
We also run RUSTI N A on the ARMversions of ffmpeg, GMP and libyuv (from
Linux Debian8.11 ) for a total of 394 chunks (average size 5, max. size29). We found very few issues (78), all in ffmpeg and relatedto the use of special flag q (accumulated saturations). Manualreview confirms them. Interestingly, the "cc" keywords arenot forgotten in other cases. As flags are explicit in ARMmnemonics, coding practices are different than those for x86. RQ1 : RUSTI N A is effective at compliance checking, interms of speed and precision – yielding compliance proofsand identifying compliance bugs with near-zero false alarmrate. RUSTI N A is widely applicable: it runs on the fullDebian assembly chunk base and, without change, on 2different architectures.
Compliance bugs in practice.
Our previous precision analysisallows to assume that a warning from the checker likely indi-cates a true compliance issue. Hence, according to Tables Iaand Ib, 1364/2656 chunks (resp. 85/202 packages) are notinterface-compliant, and 294 chunks (resp. 54 packages) havesignificant issues. According to Table Ic, 53% of significantissues come from unexpected writes, 38% from unexpected reads while 9% are unicity problems.
RQ2 : About half of inline x86 assembly chunks found in thewild is not interface-compliant, and a significant part (11%)even exhibits significant compliance issues – affecting 27%of the packages under analysis.
B. Patching (
RQ3 ) Results.
Table I (column “Patched”) shows that RUSTI N A performs well at patching compliance issues: in , itpatches of total issues (2000/2183), including of significant issues (803/986). Overall, 1276 more chunks(61 more packages) become fully compliant, reaching compliance on chunks ( on packages).The remaining issues (unbound register reads) are out of thescope of patching. They often correspond to the case wheresome registers are used as global memory between assemblychunks while only C variables can be declared as input ininline assembly. This practice is however fragile (special caseof pattern P6 in Sec. VIII). Quality assessment.
We assess the quality of the patchesadapting qa and qa from Sec. VII-A as follows: ( qa (cid:48) ) Onknown libatomic_obs and glibc compliance bugs, comparingRUSTI N A-generated patches to originals shows that they arefunctionally equivalent, with similar fixes. ( qa (cid:48) ) We manuallyreview all (114) generated patches on 8 significant projects(Sec. VII-C) and check that they do fix the reported compli-ance issues. Also, recall that patched chunks pass the compli-ance checks ( qa ) and that several patches have been acceptedby developers ( qa ). Overall, in most cases our automaticpatches are optimal and equivalent to the ones that would be written by a human . Still, the "memory" keyword may have asignificant impact on performance and developers usually tryto avoid it. We address this issue with refinement (Sec. V-D).Finally, some unicity issues we found were actually resolvedby developers by (deeply) rewriting the assembly template,instead of simply patching the interface. RQ3 : RUSTI N A effectively generates patches for compli-ance issues, in terms of speed and patch quality. RUSTI N Acan automatically curate a large code base, removing the vastmajority of compliance issues – the remaining ones requirerewriting the code beyond mere interface compliance.
C. Real-life impact (
RQ4 ) We have selected 8 significant projects from our benchmark(namely: ALSA, ffmpeg, haproxy, libatomic_obs, libtom-crypt, UDPCast, xfstt, x264) to submit patches generated byRUSTI N A in order to get real-world feedback. Note thatsubmitting patches is time-consuming: patches must adhere tothe project policy and our generated patches cannot be directlyapplied when the code uses macros (a common practice ininline assembly) as RUSTI N A works on preprocessed C files.Table III presents our results. Overall, we submitted patches fixing issues in the projects. Feedback hasbeen very positive: patches have already been integrated,fixing issues in projects (ALSA, haproxy, libatomic_obs,libtomcrypt, UDPCast, xfstt, x264) – developers clearly ex-pressed their interest in using RUSTI N A once released. Theffmpeg patches are still under review.
RQ4 : RUSTI N A helps efficiently deliver quality patches.
D. Internal evaluation: precision optimizations (
RQ5 ) The observed absence of false positives in Sec. VII-Aalready takes into account the two precision enhancers (bit-level liveness analysis and symbolic expression propagation)presented in Sec. V-B. We seek now to assess the impactof these two improvements over the false positive rate (fpr)of RUSTI N A. We ran a basic version of RUSTI N A (noexpression propagation, no bit-level liveness, but still the basicIR simplifications done by B
INSEC ) on our whole benchmark.It turns out that this basic version reports 127 false alarms (6%fpr) – 40 frame-write (2% fpr) and 87 frame-read (23% fpr).All these alarms concern potentially significant issues. Re-stricting to significant issues, this amount to false positive ratesof 13% (total), 23% ( frame-read ) and 8% ( frame-write ). Itturns out that our two optimizations are complementary: bit-level liveness analysis removes the 87 false frame-read alarmswhile expression propagation removes the 40 false frame-write alarms.The two precision optimizations (expression folding, bit-levelliveness) upon RUSTI N A base technique are essential inorder to get a near-zero false alarm rate.able II: Inline assembly recurrent (compliance) error patterns
Pattern Omitted clobber Additional context Implicit protection Details Robust? (cid:29) (*) 1197 –P2 %ebx register – compiler choice %ebx protected in PIC mode º (GCC ≥
5) 30 [15]P3 %esp register push/pop compiler choice %esp protected º (GCC ≥ º (inlining, cloning) 285 [17]P5 MMX register single-chunk function ABI MMX are ABI caller-saved º (inlining, cloning) 363 –P6 XMM register disable XMM compiler option no XMM generation º (cloning) 109 –(*) There are discussions on GCC mailing list to change that [18]. Table III: Submitted patches
Patched FixedProject About Status chunks issues CommitALSA Multimedia Applied 20 64/64 01d8a6e, 0fd7f0chaproxy Network Applied 1 1/1 09568fdlibatomic_obs Multi-threading Applied 1 1/1 05812c2libtomcrypt Cryptography Applied 2 2/2 cefff85UDPCast Network Applied 2 2/2 20200328xfstt X Server Applied 1 3/3 91c358ex264 Multimedia Applied 11 83/83 69771ffmpeg Multimedia Review 76 382/382 Including 27 non automatically patchable issues, manually fixed.
VIII. B
AD CODING PRACTICES FOR INLINE ASSEMBLY
In this section, we aim to: 1) seek some sort of regularity be-hind so many compliance issues, in order to understand whiledevelopers introduce them in the first place; 2) understand inthe same time why so many compliance issues do not turnmore often into observable bugs; 3) assess the risk of suchbugs to occur in the future.
Common error patterns for inline assembly.
We haveidentified 6 patterns (P1 to P6, see Table II) responsible for91% of compliance issues (1986/2183) – 80% of significantcompliance issues (789/986). In each case, some input oroutput declarations are missing, but surprisingly it almostalways concerns the same registers ( %ebx , %esp , "cc", MMXor XMM registers) or memory , with similar coding practices(e.g. no XMM declaration together with compiler optionsfor deactivating XMM, or no declaration of %ebx togetherwith surrounding push and pop ). Hence, these patterns aredeliberate rather than mere coding errors. Underlying implicit protections and their limits.
It turnsout that each pattern builds on implicit protections (Table II).We identified three main categories: ( ) (P1-P2-P3) compilerchoices regarding inline assembly (e.g., “protected” registers,default clobbers); ( ) (P4-P5) the apparent protection offeredby putting a single assembly chunk inside a C function (relyingmostly on the limited interprocedural analysis abilities ofcompilers); and ( ) (P6) specific compiler options. Yet, all these reasons are fragile : compiler choices maychange, and actually do, compilers enjoy more and more pow-erful program analysis engines including very aggressive codeinlining like Link-time optimization (LTO), and refactoringmay affect the compilation context.We now provide a precise analysis of each error pattern:P1 omitted "cc" keyword. x86 has been once a “cc0”architecture, i.e., any inline assembly statement implicitlyclobbered "cc" so it was not necessary to declare it aswritten. As far as we know, compilers still unofficially maintain this special treatment for backward compatibil-ity. However, some claim “that is ancient technology andone day it will be gone completely, hopefully” [18];P2 omitted %ebx register.
The Intel ABI states that %ebx should be treated separately as a special PIC (PositionIndependent Code) pointer register. Old version of GCC(prior to version < 5.0) totally dedicated %ebx to thatrole and refrained from binding it to an assembly chunk.Still, some chunks actually require to use %ebx (e.g.cmpxchg8b) and people used tricks to use it anyway with-out stating it. It becomes risky because current compilerscan now spill and use %ebx as they need;P3 omitted %esp register. %esp is here modified butrestored by push and pop . Yet, compilers may decideto use %esp instead of %ebp to pass addresses oflocal variables. In fact, it became the default behaviorsince GCC version 4.6. Thus, code mixing local variablereferences and push and pop may read the wrong indexof the stack, leading to unexpected issues;P4 omitted "memory".
Compilers’ analysis are often per-formed per function, with conservative assumptions onthe memory impact of called functions, limiting theability of the compiler to modify (optimize) the contextof chunks. This is no longer true in case of inlining whereassembly interface issues become more impactful;P5 omitted MMX register.
For the same reason as above,when a chunk is inside a function, it is also protectedby the ABI in use. The Intel ABI specifies that MMXregisters are caller-saved, hence the compiler must ensurethat their value is restored when function exits. Yet,inlining may break this pattern since the ABI barrier isnot there anymore once the function code is inlined;P6 omitted XMM register.
Using parts of the architectureout-of-reach of the compiler (the compiler cannot spillthem, typically through adequate command-line options)is safe but fragile as it is sensitive to future refactoring (af-fecting the compiler options). Moreover, newer compileroptions or hardware architecture updates can implicitlyreuse registers otherwise deactivated, e.g. XMM registersreused as subpart of AVX registers.
Breaking patterns.
We now seek to assess how fragile (ornot) these patterns are. Replaying known issues [15], [16],[17] with current compilers shows that patterns P2 to P4 are(still) unsafe. In addition, we conducted experiments to showthat current compilers do have the technical capacity to breakthe patterns. We consider two main threat scenarios: loning developers copy the chunk as is to another project(bad but common development practice [19], [20]);
Inlining projects import the code as a library and compile itstatically with their code (link-time optimization).We consider for each pattern 5 representative faulty chunksfrom the 8 projects. For each chunk, we craft a toy exampleaggressively tuned to call the (cloned or imported) chunkin an optimization-prone context. For instance, as P5 & P6issues involve SIMD registers, the corresponding chunks arecalled within an inner loop while auto-vectorization is enabled( -O3 ). Results are reported in column “Robust?” of Table II.We actually break 5/6 patterns with code cloning (all butP1), and 4/6 with code inlining, demonstrating that thesecompliance issues should be considered plausible threats.We identified a set of 6 recurring patterns leading to themajority of compliance issues. All of them build on fragileassumptions on the compiling chain. Especially, code cloningand compiler code inlining are serious threats.IX. D
ISCUSSION
A. Threats to validity
We avoid bias as much as possible in our benchmark:1) the benchmark is comprehensive: all Debian packages withC-embedded inline assembly; 2) we mostly work on x86, butstill consider 394 ARM chunks from 3 popular projects. Ourprototype is based on tools already used in significant casestudies [21], [22], [23], [24], including a well tested x86-to-IR decoder [25]. Also, results have been crosschecked inseveral ways and some of them manually reviewed. So, wefeel confident in our main conclusions.
B. Limitations
Architecture.
Our implementation supports the architecturesof the B
INSEC platform, currently x86-32 and ARMv7. This isnot a conceptual limitation, as our technique ultimately workson a generic IR. As soon as a new architecture is available inB
INSEC , we will support it for free.
Float.
We do not yet support float instructions as B
INSEC
IR does not. While adding support in the IR is feasible buttime-consuming, our technique could also work solely with a partial instruction support reduced to I/O information abouteach instruction – at the price of some false positives.
System instructions.
Our formalization considers assemblychunks as a deterministic way to convert well-identified inputsfrom the C environment to outputs. But system instructionsoften read or write locations hidden to the C context (systemregisters) and will thus appear to be non-deterministic –breaking either the framing or the unicity condition. Extendingour formalization is feasible, but it is useful only if the GNUsyntax is updated. Still, we consider that at most 13% ofassembly chunks used such instructions.
C. Microsoft inline assembly
Microsoft inline assembly (inline MASM) proposed inVisual Studio [26] does not suffer from the same flaws asGNU’s. Indeed, each assembly instruction is known by thecompiler such that no interface is required , and moreoverdevelopers can seamlessly write variables from C into theassembly mnemonics. Yet, this solution is actually restrictedto a subset of the i386 instruction set, as the cost in term ofcompiler development is significantly more important.X. R
ELATED WORK
Interface compliance.
Fehnker et al. [5] tackle inline assem-bly compliance checking for ARM (patching and refinementare not addressed), but in a very limited way. This workrestricts compliance to the framing case (no unicity condition)and is driven by assembly syntax rather than semantics,making it less precise than ours – for example, a saved-and-restored register will be counted as a framing-write issue.Moreover, it does not handle neither memory nor token con-straints (tokens are assumed to be in registers and to be distinctfrom each other). Finally, their implementation is strongly tiedto ARM with strong syntactic assumptions and their prototypeis evaluated only on 12 files from a single project.
Assembly code lifting and mixed code verification.
Tworecent works [1], [6] lift GNU inline assembly to semanticallyequivalent C code in order to perform verification of mixedcodes combining C and inline assembly. Their work is com-plementary to ours: their lifting assume interface compliancebut in turn they can prove functional correctness of assemblychunks. Verifying code mixing C and assembly has also beenactive on Microsoft MASM assembly [27], [28], [29]. Yet,inline MASM does not rely on interface (Sec. IX-C).
Binary-level analysis.
While binary-level semantic analysis ishard [30], [31], [32], [33], inline assembly chunks offer nicestructural properties [1] allowing efficient and precise analysis.We also benefit from previous engineering efforts on genericbinary lifters [10], [11], [25].XI. C
ONCLUSION
Embedding GNU-like inline assembly into higher-level lan-guages such as C/C ++ allows higher performance, but at theprice of potential errors due either to the assembly glue or toundue code optimizations as the compiler blindly trusts theassembly interface. We propose a novel technique to auto-matically reason about inline assembly interface compliance,based on a clean formalization of the problem. The techniqueis implemented in RUSTI N A, the first sound tool providingcomprehensive automated interface compliance checking aswell as automated patch synthesis and interface refinements.