[PDF] Parallel Binary Code Analysis

Abstract

Binary code analysis is widely used to assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code relates to source lines, inlined functions, and data types. To date, binary analysis has been single-threaded, which is too slow for applications such as performance analysis and software forensics, where it is becoming common to analyze binaries that are gigabytes in size and in large batches that contain thousands of binaries. This paper describes our design and implementation for accelerating the task of constructing control flow graphs (CFGs) from binaries with multithreading. Existing research focuses on addressing challenging code constructs encountered during constructing CFGs, including functions sharing code, jump table analysis, non-returning functions, and tail calls. However, existing analyses do not consider the complex interactions between concurrent analysis of shared code, making it difficult to extend existing serial algorithms to be parallel. A systematic methodology to guide the design of parallel algorithms is essential. We abstract the task of constructing CFGs as repeated applications of several core CFG operations regarding to creating functions, basic blocks, and edges. We then derive properties among CFG operations, including operation dependency, commutativity, monotonicity. These operation properties guide our design of a new parallel analysis for constructing CFGs. We achieved as much as 25 × speedup for constructing CFGs on 64 hardware threads. Binary analysis applications are significantly accelerated with the new parallel analysis: we achieve 8 × for a performance analysis tool and 7 × for a software forensic tool with 16 hardware threads.

Full PDF

PParallel Binary Code Analysis

XIAOZHU MENG,

Rice University, USA

JONATHON M. ANDERSON,

Rice University, USA

JOHN MELLOR-CRUMMEY,

Rice University, USA

MARK W. KRENTEL,

Rice University, USA

BARTON P. MILLER,

University of Wisconsin-Madison, USA

SRÐAN MILAKOVIĆ,

Rice University, USABinary code analysis is widely used to assess a program’s correctness, performance, and provenance. Binaryanalysis applications often construct control flow graphs, analyze data flow, and use debugging informationto understand how machine code relates to source lines, inlined functions, and data types. To date, binaryanalysis has been single-threaded, which is too slow for applications such as performance analysis and softwareforensics, where it is becoming common to analyze binaries that are gigabytes in size and in large batchesthat contain thousands of binaries.This paper describes our design and implementation for accelerating the task of constructing control flowgraphs (CFGs) from binaries with multithreading. Existing research focuses on addressing challenging codeconstructs encountered during constructing CFGs, including functions sharing code, jump table analysis,non-returning functions, and tail calls. However, existing analyses do not consider the complex interactionsbetween concurrent analysis of shared code, making it difficult to extend existing serial algorithms to beparallel. A systematic methodology to guide the design of parallel algorithms is essential. We abstract the taskof constructing CFGs as repeated applications of several core CFG operations regarding to creating functions,basic blocks, and edges. We then derive properties among CFG operations, including operation dependency,commutativity, monotonicity. These operation properties guide our design of a new parallel analysis forconstructing CFGs. We achieved as much as 25 × speedup for constructing CFGs on 64 hardware threads.Binary analysis applications are significantly accelerated with the new parallel analysis: we achieve 8 × for aperformance analysis tool and 7 × for a software forensic tool with 16 hardware threads. Binary code analysis is a foundational technique for a variety of applications, including performanceanalysis [Adhianto et al. 2010; Ţăpuş et al. 2002; Miller et al. 1995], software correctness [Arnoldet al. 2007; Gu and Mellor-Crummey 2018], software security [Jacobson et al. 2014; v. d. Veenet al. 2016; van der Veen et al. 2015], and software forensics [Meng et al. 2017; Rosenblum et al.2011b]. Important binary code analysis capabilities include constructing control flow graphs (CFGs),analyzing control flow and data flow properties, and extracting source line mappings and data typesfrom debugging information, when it is available. Traditionally, binary analysis applications aresingle-threaded. However, recent trends in these applications call for improving the performanceof binary analysis applications.In the field of performance analysis, it is becoming common to optimize the performance oflarge software systems that compile into multi-gigabyte binaries. We have witnessed this trendwithin software developed by national laboratories and popular machine learning frameworkssuch as TensorFlow [Abadi et al. 2016]. The developers of these large softwares use the following

Authors’ addresses: Xiaozhu Meng, Department of Computer Science, Rice University, Houston, TX, 77005, USA, Xiaozhu . Meng@rice . edu; Jonathon M. Anderson, Department of Computer Science, Rice University, Houston, TX, 77005,USA, jma14@rice . edu; John Mellor-Crummey, Department of Computer Science, Rice University, Houston, TX, 77005,USA, johnmc@rice . edu; Mark W. Krentel, Department of Computer Science, Rice University, Houston, TX, 77005, USA,krentel@rice . edu; Barton P. Miller, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI, 53706,USA, bart@cs . wisc . edu; Srđan Milaković, Department of Computer Science, Rice University, Houston, TX, 77005, USA,sm108@rice . edu. a r X i v : . [ c s . PF ] M a y performance analysis workflow to optimize their code: (1) compile the source code to generatethe binary program, (2) measure the performance of the binary during execution, (3) attributemeasurements to the corresponding source code (via binary analysis), and (4) optimize the sourcebased on the performance results. These four steps are repeated until developers are satisfied withtheir software’s performance.In this performance analysis cycle, binary analysis must be repeated after any source code changebecause even small code changes can lead to dramatically different binaries, especially with C++templates and aggressive compiler optimizations. In such a workflow, if the binary analysis instep (3) is slow, it will reduce the throughput and effectiveness of performance analysis. Currentsingle-threaded binary code analysis takes too long to analyze such large binaries. It takes morethan 20 minutes to analyze a 7.7GiB shared library from TensorFlow, which would interrupt theworkflow of developers tuning the code for production.In the field of software forensics, researchers have achieved great success in applying machinelearning to tasks including compiler identification [Rosenblum et al. 2011a] and authorship at-tribution [Caliskan-Islam et al. 2018; Meng et al. 2017; Rosenblum et al. 2011b]. These machinelearning-based software forensics applications require large training sets to be effective, containinghundreds to thousands of binaries. During the development of these forensic applications, thedevelopers typically repeat the following workflow: (1) design a set of binary code features, (2)extract the features with binary analysis to construct a training set, (3) validate the accuracy ofa model trained using the new features. These three steps are repeated until the developers aresatisfied with the effectiveness of the binary code features.While the training and tuning of machine learning models have traditionally been regarded asthe bottleneck of these software forensics applications, modern machine learning packages oftenprovide support for parallel training and tuning, using multithreading and even GPUs. In suchscenario, a serial feature extraction step can be a limiting factor of the development cycle: forexample, the feature extraction step in compiler identification [Rosenblum et al. 2011a] may takeover 24 hours.In this paper, we present our design and implementation of parallel binary code analysis to addressthe speed requirements imposed by these applications. The core of this work is a new parallelanalysis for constructing control flow graphs (CFG construction), which constructs functions, basicblocks, and edges between basic blocks. CFG construction is used in nearly every binary analysisapplication, directly or indirectly.Modern serial CFG construction algorithms focus on understanding complex machine codegenerated by compilers [Di Federico et al. 2017; Meng and Miller 2016; Shoshitaishvili et al. 2016].Complex code constructs such as non-returning functions, tail calls and jump tables play key roles inunderstanding high-level programming constructs, making their analysis important for applications.While function level parallelism is a natural starting point for parallel CFG construction, we mustaddress a range of complex issues: • Functions may share code. Threads analyzing different functions may end up concurrentlyanalyzing shared code and require synchronization. • Control flow graphs evolve during analysis. As a result, a parallel algorithm for CFG con-struction needs to consider concurrent changes by others. • Current binary code analysis is not designed nor implemented with parallelism in mind.Parallelization exposes the flaws in existing serial analysis for jump tables and tail callidentification. • While protecting intricate data structures with mutual exclusion is a tempting way to guar-antee correctness, the serialization this induces must be carefully evaluated for its impact onparallelism and performance. arallel Binary Code Analysis 3

To systematically address these issues, we abstract CFG construction as repeated applications ofseveral primitive CFG construction operations. These operations include creation of CFG elementssuch as functions, basic blocks, edges, modification to basic block ranges, and removing blocks andedges. We derive operation properties, including dependencies, commutativity, and monotonicity,and use this theoretical framework to reason the correctness and performance of CFG constructionalgorithms. This abstraction allows us to identify flaws in existing serial CFG construction algo-rithms. Many of the flaws are caused by not considering the interactions between complex codeconstructs. This methodology enables us to express parallelism as commutative operations andfocus our attention to address operation dependencies to improve performance. We then designnew algorithms and data structures for parallel CFG construction to address both correctness andperformance issues.We implemented our new parallel CFG construction in the Dyninst binary analysis and instru-mentation toolkit [Paradyn Project [n.d.]], a library widely used by researchers in performanceanalysis, security, and software forensics, and evaluated the performance characteristics of ourparallel binary analysis with a number of large binaries, including a 7.7GiB shared library fromTensorFlow. We achieved as much as 25 × speedup for constructing control flow graphs on 64hardware threads, which significantly accelerates client tools that employ binary analysis. We thenshowcase the benefits of parallel binary analysis with two applications. The hpcstruct utility inHPCToolkit [Adhianto et al. 2010] is used to relate performance measurement back to source code;we achieved 8 × speedup for hpcstruct . BinFeat [Meng [n.d.]] is a tool for extracting binary codefeatures for software forensics, for which we achieved 7 × speedup.In summary, this work makes the following contributions:(1) A set of CFG operations and operation properties that enable us to reason correctness andperformance of CFG construction algorithms.(2) A new algorithm for parallel CFG construction that is derived from the requirements andproperties of CFG operations.(3) An implementation of the new algorithm in Dyninst that can be used by other binary analysisapplication developers.(4) Demonstrating the effectiveness of our new parallel binary analysis with two binary analysisapplications: hpcstruct which significantly accelerates program structure recovery for per-formance analysis and BinFeat which significantly speeds up binary code feature extractionfor software forensics.

There is rich literature about constructing CFGs from binaries [Bardin et al. 2011; Di Federicoet al. 2017; Kinder and Kravchenko 2012; Kinder and Veith 2008; Schwarz et al. 2002]. A commonlyused approach is control flow traversal [Schwarz et al. 2002; Theiling 2000]. Starting from knownfunction entry points such as the ones found in the symbol table, it follows control flow transfers inthe program to discover code and identify additional function entry points for further analysis. Wediscuss several challenging code constructs that must be addressed during control flow traversaland representative binary analysis tools that implement control flow traversal.

Functions sharing code:

A common compiler optimization is to share binary code betweenfunctions with common functionality, such as error handling code and stack tear-down code. Thisconstruct is common in compiled code. We have observed this construct in glibc-2.29 (Released Jan.2019) where common error handling code is shared by multiple syscall wrappers, and within codecompiled by the Intel Compiler Suite (ICC). This can also occur logically in functions with multiple entry points: binary analysis tools typically represent such functions as multiple single-entryfunctions that share code. Thus, Fortran functions with multiple programmer-specified entry points(using the entry keyword), and binaries on Power 8 or newer (the ABI specifies that each functionhas at least two entry points) lead to functions sharing code. To support code shared multiplefunctions, one can define a function as the basic blocks that are reachable from the function entryby traversing only intra-procedural edges [Bernat and Miller 2012; Meng and Miller 2016].

Non-returning functions:

Binary analysis tools often define a call fall-through edge, whichis a summary edge representing that the control flow at a function call will return to the call site.However, a function call to a non-returning function will never return to its call site, so there shouldbe no call fall-through edge at such call sites. A wrongly created call fall-through edge can lead toconfusing control flow and cascading impacts on binary analysis applications.The general idea of identifying non-returning functions is to match function names againstknown non-returning functions such as exit and abort and uses an iterative analysis to identifyfunctions that always end in calls to non-returning functions. One example is a fixed point analysispresented by Meng and Miller [Meng and Miller 2016]. Each function has a return status, with threedifferent values: UNSET, RETURN, and NORETURN. A function’s return status is initialized withNORETURN if it is known to be a non-returning function, otherwise a function’s return status isUNSET. Three main components in the non-returning function analysis are: (1) a function’s returnstatus is set to RETURN if we find a return instruction; (2) if we encounter a call site calling to afunction with UNSET return status, we do not parse the call fall-through edge until the callee’sreturn status is set to RETURN; (3) if there is a cyclic dependency between functions’ return statuses,all functions in the cycle are non-returning.

Jump table analysis:

Compilers often emit indirect jumps for switch statements in the sourcecode. The targets of these indirect jumps are calculated based on jump table data in the binary.Being able to statically resolve the control flow targets calculated through jump tables is criticalfor complete control flow traversal. A common approach for resolving jump table targets is touse backward slicing to identify the instructions that are involved in the target calculation andconstruct a symbolic expression of the jump target to identify the actual jump targets [Di Federicoet al. 2017; Meng and Miller 2016; Shoshitaishvili et al. 2016; Williams-King et al. 2020].

Tail calls:

A tail call [Clinger 1998] is a compiler optimization that uses a jump instruction atthe end of a function to target the entry point of another function, thus not every branch should belabeled as intra-procedural. Tail calls are often recognized through heuristics [Di Federico et al.2017; Meng and Miller 2016], including (1) a branch to a known function entry is a tail call; (2)a branch to a basic block that is reachable through only intra-procedural edges of the currentfunction is not a tail call; (3) if there is stack frame tear down before the branch, it is a tail call.

Recent binary analysis tools address these challenging code constructs, including angr [Shoshi-taishvili et al. 2016], Dyninst [Meng and Miller 2016], and rev.ng [Di Federico et al. 2017]. Whilethese tools share similarity in addressing challenging code constructs, the software infrastructureof these tools have distinct characteristics regarding to analysis speed.Both angr and rev.ng first lift machine instructions to IR and then perform analysis on theresulting IR. The first advantage of this approach is that the binary analysis is not architecturespecific and can be readily ported to a new architecture after the IR is supported on the newarchitecture. For example, rev.ng uses QEMU to lift binary to LLVM IR. Therefore, rev.ng supportsevery architecture where QEMU is supported (more than 16 different architectures). Similarly,angr uses Valgrind’s VEX IR. The second advantage is that lifting to IR facilitates the developmentof complex data flow analysis such as points-to analysis and value set analysis. However, this arallel Binary Code Analysis 5 approach leads to significant performance slowdown for two reasons. First, lifting process itselfis slow. Second, The number of assignments in the IR is significantly larger than the number ofmachine instructions as one instruction may be lifted to multiple IR assignments, especially onCISC architectures such as x86-64.On the other hand, Dyninst directly operates with the binary. Dyninst’s instructionAPI providesan architecture independent interface for querying instruction opcodes, instruction operands,registers, and memory addressing modes. The CFG construction code inside Dyninst works withthis “bare-metal” instruction interface. The only exception is that when Dyninst resolves jumptables, Dyninst lifts machine instructions to ROSE IR [Quinlan and Liao 2011]. However, sincelifting is applied to instructions that are involved in the jump table calculation found by backwardslicing, typically only a small portion of the binary is lifted.As we will describe in Section 7, complex data flow analysis are not needed in our targetapplications. Therefore, we implement our new parallel CFG construction algorithms in Dyninst toachieve better performance.

While existing literature comprehensively describes how to address each challenging code construct,a critical problem we encountered when designing a parallel CFG construction algorithm is thecomplex interactions between these code constructs. Serial algorithms are designed with theassumption that no concurrent modification is made to the CFG. To address this problem, we presentan abstraction of control flow graphs and a series of core operations on them. This abstractionenables us to reason interactions between different CFG construction operations.Our abstraction builds upon the CFG definitions and operations designed for binary modifica-tion [Bernat and Miller 2012], which mainly works with fully constructed CFGs. Our abstractioninstead focuses on the process of constructing CFGs.

Definitions:

We define a CFG G = ⟨ B , C , E , F ⟩ to be a tuple of the following: • B is a set of address ranges [ s , e ) , representing basic blocks within the binary. Each of thesecontains at most one control flow instruction, which if present is the final instruction withinthe range, and has incoming control flow at only address s . • C is a set of candidate blocks [ t ] , representing addresses which are known to start basicblocks but do not have known ending addresses yet. • E ⊆ {( a → b ) : a ∈ B , b ∈ B ∪ C } is a set of directed edges between basic blocks, representingpossible control flow executions between blocks. • F ⊆ B ∪ C is the set of function entry blocks. Partial order : We utilize a partial order between control flow graphs, designed such that a largergraph includes more control flow elements. We define G ≼ G if all of the following are true: • The address ranges contained in G are also contained by G . Formally, let A and A be theaddresses contained by the blocks in B and B respectively. Then we require A ⊆ A . • The explicit control flow present in G is also present in G , regardless of adjustments to blockranges. Formally, for every edge ( a = [ s a , e a ) → b = [ s b , e b )) or ( a = [ s a , e a ) → b = [ s b ]) ∈ E ,one of the similar edges ([ s ′ a , e a ) → [ s b , e ′ b )) and ([ s ′ a , e a ] → [ s b ]) must be present in E .Intuitively, G may contain additional control flow edges that target addresses inside a or b ,causing them to be split. The requirement here is that the end address of the source block e a and the start address of the target block s b are preserved under the partial order. • The implicit control flow through a basic block in G is preserved in G . Formally, forevery block b = [ s , e ) ∈ B there is a sequence of blocks [ s , s ) , . . . , [ s n , e ) ∈ B such that ([ s i , s i + ) → [ s i + , s i + )) ∈ E for i = , . . . , n −

2. This means that a block b in G can be splitinto multiple smaller blocks in G to incorporate other incoming control flow. • Function entry labels in G are preserved in G , regardless of range adjustments. Formally,for every block [ s , e ) or [ s ] ∈ F , there is a block starting at the same address [ s , e ′ ) or [ s ] ∈ F . CFG operations : To construct a CFG based on the underlying binary, we define a number ofcore operations: • Block End Resolution: Given a graph G containing a candidate block [ t ] ∈ C , we define O BER ( G , [ t ]) as G with the candidate block [ t ] replaced by an actual basic block starting at t with a determined end address. There are three possible cases: – Block splitting. If there is an existing block b = [ s , e ) ∈ B such that s < t < e , then we haveto split b into the basic blocks [ s , t ) and [ t , e ) . Any incoming edges on b are redirected to [ s , t ) , while outgoing edges on b and incoming edges on [ t ] are moved to [ t , e ) . – Early block ending. If there is an existing block b = [ s , e ) ∈ B such that t < s and the range [ t , s ) contains no control flow instructions, we replace [ t ] with [ t , s ) as in the first case andappend the edge ([ t , s ) → [ s , e )) . – Linear parsing. If neither of the previous cases apply, let e be the address directly after thefirst control flow instruction following t . We replace [ t ] with [ t , e ) as in the first case. • Direct Edge Creation: Given a block a in a graph G , which ends with a direct control flowinstruction, we define O DEC ( G , a ) as G with outgoing edges appended to a , based on thecontrol flow instruction within a (if one exists). There are three cases: – If a terminates with an unconditional jump to address t , we append the edge ( a → [ t ]) . – If a = [ s , e ) terminates with a conditional jump to address t , we append edges for the caseswhere the condition is true ( a → [ t ]) and false ( a → [ e ]) . – If a terminates with a function call instruction to address t , we append the edge ( a → [ t ]) . • Call Fall-Through Edge Creation: Given an edge e = ([ s , e ) → f ) in a graph G where [ s , e ) contains a function call instruction and f ∈ F , we define O CF EC ( G , e ) as G potentially withthe additional edge ([ s , e ) → [ e ]) summarizing the execution of the callee function. Correctapplication of this operation depends on the non-returning function analysis used to identifywhether the target function can return or not. • Indirect Edge Creation: Given a block a in a graph G which contains a jump to a dynamicaddress, we define O I EC ( G , a ) as G with the additional edges ( a → [ t ]) , . . . , ( a → [ t n ]) ,where t , . . . , t n are target addresses determined statically. It is possible for this operation toadd no edges if the analysis used is insufficient to statically determine the possible targets. • Function Entry Identification: Given an edge e = ( a → b ) in a graph G , we define O F EI ( G , e ) as G with the block b potentially labeled as a function entry. This operation is trivial if e wascreated by an explicit call instruction, but further heuristics are required to identify functionsthat are reached only through optimized tail calls. • Edge Removal: Given an edge e = ( a → b ) within a graph G , we define O ER ( G , e ) as G withthe edge e removed along with any blocks and edges that are no longer reachable from anyfunction entry point. Formally, let B ′ ⊆ B and C ′ ⊆ C be the sets of blocks and candidateblocks in G reachable from any block in F without traversing e . We can then define O ER ( G , e ) = ⟨ B ′ , C ′ , E ∩ {( a ′ → b ′ ) : a ′ ∈ B ′ , b ′ ∈ B ′ ∪ C ′ } \ { e } , F ⟩ . Starting with the initial graph G = ⟨(cid:156) , F , (cid:156) , F ⟩ , where F is the set of candidate function entryblocks discovered via the binary’s symbol table and unwind information, the task of CFG construc-tion can be abstracted as repeated application of these operations. We denote G , G · · · , G n − asthe intermediate results and G n as the final CFG. arallel Binary Code Analysis 7 We present several important properties of the defined operations, assess existing serial algorithmswith these properties, and use these properties to define critical correctness and performance issuesfor parallel CFG constructions.

Operation dependencies:

To correctly build the CFG, operations should be applied with an orderthat satisfy the dependencies among them. We identify two types of dependencies: • Applicability Dependency . We cannot apply operations to an graph element that has notbeen discovered. For example, we must create an edge before we can resolve the target blockcandidate of this edge. • Non-returning Function Dependency . The correctness of O CF EC for creating call fall-through edges depends on the operations applied to the callee functions to determine whetherthe callee would return or not. If O CF EC is applied when the callee does not return, anerroneous call fall-through edge would be added, leading to an incorrect CFG.Operations that satisfy either type of the above dependency must be applied in order. We classifyoperations that are not constrained by any dependency into three categories: • Commutative operations : The operations O BER and O DEC commute with themselves andwith each other, allowing us to choose an order convenient for processing. To establish this,we discuss the following three cases: – Given two candidate blocks [ a ] and [ b ] where a < b , we have O BER ( O BER ( G , [ a ]) , [ b ]) = O BER ( O BER ( G , [ b ]) , [ a ]) . First, if there is a control flow instruction ending at address c where a < c < b , candidate block [ a ] will end before c while candidate block [ b ] willend after c . These two operations will act on non-overlapping address ranges and beindependent, which gives us commutativity. Second, if a control flow instruction ends at c where a < b < c and c is first control flow instruction following b , we have O BER ( O BER ( G , [ a ]) , [ b ]) = O BER ( G ∪ {[ a , c )} , [ b ]) (Linear parsing) = G ∪ {[ a , b ) , [ b , c )} (Block splitting) = O BER ( G ∪ {[ b , c )} , [ a ]) (Early block ending) = O BER ( O BER ( G , [ b ]) , [ a ]) . (Linear parsing)Thus we also have commutativity in this case. – Given two blocks a and b , we have O DEC ( O DEC ( G , a ) , b ) = O DEC ( O DEC ( G , b ) , a ) . This isbecause O DEC ( G , a ) only considers the terminating control flow instructions within theblock a . – We have O BER ( O DEC ( G , [ s , e )) , [ t ]) = O DEC ( O BER ( G , [ t ]) , [ s , e )) , when given a candidateblock [ t ] and a block [ s , e ) . We observe that O DEC ( G , [ s , e )) depends on only the terminatingcontrol flow instruction ending at e and will generate only new candidate blocks while O BER ( G , [ t ]) does not depend on candidate blocks. Therefore these two operations areindependent and thus commutative.The operation O ER also commutes with itself, allowing us to choose an order convenientfor processing. The graph O ER ( O ER ( G , e ) , e ) = O ER ( O ER ( G , e ) , e ) will contain no blocksreachable only through e and e , which gives us the commutativity property. • Monotonic ordering property : While O I EC does not commute trivially with any otheroperation, we can still establish a weaker property. Let O I EC ( G , a ) be an indirect edge creationoperation and O x be an O BER or O DEC operation. We have O x ( O I EC ( G , a )) ≼ O I EC ( O x ( G ) , a ) ,since the edges added by O x can at most add control flow paths preceeding a and thus increase A: B: ... ... leaveq mov %rsi, 1 jmp 0x400 jmp 0x400 Listing 1. An example showing inconsistent results in the tail call heuristics used by Dyninst. the set of target addresses. When our goal is to achieve a maximal CFG, this allows us toreorder O I EC after any O BER and O DEC operations without decreasing the final result. • Non-reorderable operations : The operations O CF EC and O F EI do not always commute,nor do they satisfy the ordering property above in all cases. Both of these operations useimplementation-specific analyses: non-returning function analysis for O CF EC and tail callidentification heuristics for O F EI , both of which at times require inspection of large portionsof the graph. Because of the sensitivity of these operations, we are cautious to apply theseoperations only after the considered subgraph has stabilized.

We compare the serial algorithms used by angr [Shoshitaishvili et al. 2016], Dyninst [Meng andMiller 2016], and rev.ng [Di Federico et al. 2017] using the defined notation and operations. • Dyninst and angr’s CFG construction can be characterized with an increasing expression: G ≼ G ≼ G · · · ≼ G n . This increasing pattern has the advantage of not performingredundant work of adding and then removing graph elements. On the other hand, rev.ng hasan additional step to clean candidate function entries after this increasing phase.We observe that this cleaning step can address issues caused by non-commutating operationssuch as tail call identification. Listing 1 is an example where Dyninst will give inconsistentresults depending on the analysis order. In this example, function A and B branch to thesame address. If A is analyzed first, because leaveq tears down the stack frame, Dyninst willtreat the branch in A as a tail call and create a new function at the branch target; later whenDyninst analyze B, we find that B branches to a known function entry, so the branch is in Balso a tail call. In this case, function B will not include block at 0x400. On the other hand, if Bis analyzed first, because there is no stack frame tear down before the branch in B, Dyninstwill not treat the branch as a tail call, and the block at 0x400 will be part of B. Therefore,the function boundary of B is determined by the order of analysis. Note that without othercontext, it is equally valid to conclude either “A and B both tail call to 0x400” or “A and Bshare block at 0x400”. With a cleaning step at the end, we have the opportunity to generate aconsistent answer. • The jump table analysis implemented in tools does not necessarily satisfy the monotonicityproperty we defined for O I EC . The root cause of this issue is imperfect jump table anal-ysis where jump table targets can be over-approximated. Suppose we have O I EC ( G , b ) and O I EC ( G , b ) . Due to imperfect jump table analysis, O I EC ( G , b ) generates an over-approximated set of jump targets, resulting in invalid outgoing edges. Such additional butconfusing control flow may cause O I EC ( G , b ) to fail, leading an empty set of targets. However,if O I EC ( G , b ) is performed first, then we may get the correct non-empty set of jump targetsfor b . We have observed this problem in Dyninst’s jump table analysis. While rev.gn andangr both provide detailed descriptions on how they resolve jump tables, neither is able toguarantee no over-approximation of the jump targets. arallel Binary Code Analysis 9 Besides the need of a cleaning step after the increasing CFG construction and the need to addressjump table over-approximation, we further identify three issues must be addressed to achieveeffective parallel analysis. • Commutative operations still need careful synchronization. Suppose we have two direct edges e and e that have the same target address and we perform O DEC ( G , e ) and O DEC ( G , e ) con-currently. Due to commutativity, O DEC ( O DEC ( G , e ) , e ) = O DEC ( O DEC ( G , e ) , e ) . However,the operation performed first will create the candidate block, making the second operationeffectively the identity function. This is trivial to maintain for serial algorithms, but carefulsynchronization is necessary to maintain this uniqueness property in a parallel setting. • Non-returning function dependencies between operations can also lead to ineffective paral-lelism. In a call chain where F calls F , F calls F , · · · , and F n − calls F n , an O CF EC operationin F may need to wait for operations in F to complete, which may need to wait for operationsin F , and so on. This effect causes undesirable serialization during the analysis. • The monotonic ordering property for operation O I EC ( G , a ) indicates that we might be able tofind more control flow targets if it is applied after other edge creation operations. However,deferring O I EC ( G , a ) can exacerbate the issue of non-returning function dependencies, asthis will delay the discovery of returns that are only reachable through the indirect jump. In Section 4, we have established that commutative operations can be performed in any orderwithout impacting the final results. This is the foundation for parallel CFG construction. Section 5.1describes the stages of our parallel analysis. Section 5.2 presents five invariants for supportingconcurrent CFG operations. Section 5.3 discusses parallel control flow traversal. Finally, Section 5.4discusses the parallel finalization step that is needed to obtain a correct CFG.

Our parallel analysis can be characterized with the expression G ≼ G ≼ G · · · ≼ G m ≽ G m + ≽ · · · ≽ G n . It contains a CFG expansion phase to discover and add new graph elements, followed bya CFG correction phase to remove incorrect graph components.Listing 2 describes three main stages in our parallel analysis. It starts with initializing datastructures for analyzing functions defined in the symbol table (Line 1). It is necessary to parallelizethis step as we have seen large binaries containing millions of functions. The second stage representsthe increasing construction phase. We perform control flow traversal for initialized functions inparallel, during which we may discover more functions. We repeatedly apply control flow traversaluntil there are no more functions to analyze (Line 2 - 6). The details of control flow traversal arepresented in Section 5.3. The final stage is to finalize the CFG (Line 7). This stage includes cleaningcontrol flow edges and blocks created by over-approximated jump tables, cleaning inconsistenttail call identification results, and determining which basic blocks belong to which function bytraversing intra-procedural edges from function entry blocks. We use Figure 1 to illustrate how our five invariants ensure that threads correctly perform concurrentCFG operations.

Invariant 1: Block Creation.

There is at most one basic block starting at any given address.This invariant means that if threads branch to the same target concurrently, one and only one threadshould create the block and make the block visible to other threads. Maintaining this invariant funcs = InitFunctions() -- Done in parallel while funcs != |$\emptyset$| more_funcs = |$\emptyset$| parallel for f in funcs more_funcs = more_funcs |$\cup$| ControlFlowTraversal(f) funcs = more_funcs CFGFinalization() -- Done in parallel

Listing 2. Stages of our parallel binary analysis. T T T T T jmp (a) T and T branch to two different addresses. T , T , and T branch to the same target. T B T B T B (b) T , T , and T create new basic blocks. T first reaches the block end and creates new controlflow edges. T then reaches the block end and is going to split blocks. B B T B (c) T finishes splitting blocks. T reaches the block end and is going to split blocks. B B B (d) T finished splitting blocks, which involves moving edges.Fig. 1. An example of five threads work with a common area of code. Solid edges represent the progress ofthreads. Bold solid edges represent actions to take place. Dashed edges represent control flow edges in theCFG. requires efficient concurrent data structures that synchronize threads branching to the same target,while allowing threads branching to different targets to proceed independently.In Figure 1a, threads T , T and T branch to the same address. According to this invariant, onlyone thread should create a new basic block. As shown in Figure 1b, T creates a new basic block B . arallel Binary Code Analysis 11 T and T do not create any new basic blocks and leave the common code area to work with othercode. Independently, T creates basic block B and T creates basic block B . Invariant 2: Block End.

There is at most one basic block ending at any given address. A naïveimplementaion of this invariant is to let a thread check whether a block exists at its current workingaddress. If there exists one, the working thread can end its block. However, this implementationmeans that there will be a block start lookup after decoding each instruction. This would createa performance hotspot on the concurrent data structure used for Invariant 1. Our design defersthis check until the working thread reaches a control flow instruction. In this way, we reduce thefrequency of global concurrent data structure lookups from once per instruction to once per controlflow instruction.As shown in Figure 1b, thread T , T and T will independently parse their blocks until they reachthe indirect jump instruction. Based on this invariant, only one thread should register the blockend address, which is T in this example. Invariant 3: Edge Creation.

The thread that registers a block’s end is responsible for creatingout-going control flow edges from that block. This invariant ensures that no redundant controlflow edges are created and jump table analysis for a particular indirect jump is always performedby one thread. This also reduces unnecessary block start lookups by avoiding redundant edges.As shown in Figure 1b, because thread T registers the block end, T continues to perform controlflow analysis to resolve the indirect jump targets and create control flow edges. T then leaves thecommon code and continues to work with other code. Invariant 4: Block Split.

The threads that reach a block end but do not register the blockend will need to split blocks. Suppose we have block B [ x , y ) , B [ x , y ) , . . . B n [ x n , y ) createdby n threads, where x < x < . . . < x n < y . The results of block split should be B [ x , x ) , B [ x , x ) , . . . , B n [ x n , y ) , with a fall-through edge between each pair of adjacent basicblocks. It is inefficient to wait for all relevant blocks before performing splitting, so we present thefollowing eager block split algorithm.Based on Invariant 2 (block end), only one block B i [ x i , y ) will register its end at y . When anotherblock B j [ x j , y ) reaches y , the working thread can look up B i as the registered block. Depending onthe relationship between x i and x j , we have two cases: • If x i > x j , B j is split into [ x j , x i ) while B i is untouched. We then register B j at block endaddress x i , which will trigger a new iteration of block split when another block has alreadyregistered block end at x i . As shown in Figure 1c, T splits blocks B , registers B ending at0 xA and then leaves the common code. • If x i < x j , B i is split into [ x i , x j ) while B j is untouched. We then replace B i with B j for blockend address y , register B i for block end address x j , and move out-going edges from B i to B j .Similar to the first case, registering B i at x j may recursively require another block split. Asshown in Figure 1d, T splits B and moves control flow edges from B to B .For both cases, each iteration of the block split algorithm ends with a smaller block end address.Therefore, our block split algorithm is guaranteed to converge. Invariant 5: Function Creation.

There is at most one function starting at any given address.This invariant has similar properties and requirements to Invariant 1 for creating blocks.These five invariants ensure that commutative operations can be safely reordered and performedconcurrently, and the relative speed of threads will not impact the final results.

Listing 3 presents the algorithm for control flow traversal. Coupled with the invariants defined inSection 5.2, control flow traversal can be performed in parallel. The traversal is repeated until there are no more unanalyzed basic blocks (Line 2). For eachunanalyzed block b , we use routine linearParsing to decode instructions until a control flowtransfer instruction is encountered (Line 4). Modifications to Dyninst’s instruction decoding codeadd thread-safety to support this.Routine registerBlockEnd follows invariant 2 (block end) to register the block end (Line 5).Only the thread that successfully registers the block end will see a non-empty set of control flowedges returned, following invariant 3 (edge creation). All other threads reaching the same block endwill see an empty set of edges and will follow invariant 4 (block split) to split the blocks (Line 7).The thread that creates the control flow edges will proceed to traverse the edges (Line 8 - 12). Ifwe encounter a function call, we may need to create a new function, following invariant 5 (functioncreation) at Line 10. If we encounter a call fall-through summary edge or return edge, we processnon-returning functions (Line 11). If we encounter other types of edges, such as indirect, direct orconditional branches, we create new basic blocks based on invariant 1 (block creation) at line 12. Handling non-returning functions:

In procedure processCall , we use the non-returningfunction analysis presented by Meng and Miller [Meng and Miller 2016]. To address the non-returning function dependency between CFG operations, we improve the analysis to eagerly notifyits callers once a function’s return status is set to RETURN. This improvement works well in practicebecause large functions may contain multiple return instructions. As soon as we encounter one ofa function’s return instructions, we know this function is RETURN and we can enable the O CF EC operation in its caller without waiting for analysis of the callee to finish.

Jump table analysis.

We address two issues raised in Section 4 about jump table analysis. First,jump table analysis ( O I EC ) in Dyninst does not satisfy monotonic ordering property. We identifythat when Dyninst encounters instructions or path conditions that it cannot analyze, Dyninst willfail to analyze the jump table and generate an empty set of control flow targets. This issue can beaddressed by taking the union of the targets discovered along different paths, essentially ignoringinstructions or path conditions that fail analysis. In this way, jump table targets identified alongvalid control flow paths can still be propagated to the indirect jump, and the analysis can generatenon-empty set of control flow targets. While this strategy makes the jump table analysis in Dyninstsatisfies the monotonic ordering property, it can over-approximate jump table sizes and lead tobogus control flow edges. We will introduce a cleaning strategy in the CFG finalization stage toremove bogus control flow edges.Second, the monotonic ordering property specifies that we can get a larger graph if we delayjump table analysis as much as possible, but this might delay the finding of return instructionsand hurt parallelism due to non-returning function dependencies. We balance these two factors byordering jump table analysis after the analysis of direct control flow edges in this function, butbefore call fall-through edges when the callee does not have a known return status. In addition,we repeat the analysis of a jump table after more control flow paths are created within the samefunction. This fixed-point analysis of jump tables allows us to find most of the targets early in theanalysis and gradually converge to a complete set of targets.

The goal of the CFG finalization stage is remove wrong CFG elements and determine functionboundaries. No new CFG elements will be added.The first step is jump table finalization, where we remove wrong indirect control flow edges.We find that over-approximation of jump targets is caused primarily by over-approximation ofjump table sizes. We can mitigate this problem by leveraging an observation that compilers donot emit overlapping jump tables [Williams-King et al. 2020]. Therefore, if the analysis of a jumptable overflows into another jump table, we can detect over-approximation and apply edge removal arallel Binary Code Analysis 13 more_func = |$\emptyset$| while f.hasMoreBlocks() b = f.nextBlock() linearParsing(b) edges = registerBlockEnd() if edges == |$\emptyset$| splitBlock(b) for e in edges: switch e.type() when 'call' , 'tailcall' then more_func |$\cup$|= processCall() when 'call-fallthrough', 'ret' then more_func |$\cup$|= processNonRetFunc() when 'other': createNewBasicBlock(f) return more_func Listing 3. The algorithm of control flow traversal. operations O ER to remove wrong edges and cascading dangling blocks. We make two observationsabout this strategy. First, we have established in Section 4 that edge removal operations commutate.Therefore, it is safe to perform this mitigation strategy in parallel. Second, this strategy cannot beused during the parallel control flow traversal step. This is because when we analyze a jump table,we do not know the exact locations of all jump tables in the binary. For this reason, we delay thismitigation of over-approximation until the CFG finalization phase.We then address wrong tail calls edges and determine function boundaries. We handle this withan iterative parallel control flow graph search. Starting from function entries, we add blocks to theboundary of a function by traversing intra-procedural edges. After getting the temporary functionboundaries, we use three rules in order to correct tail call results:(1) If a branch is marked as not a tail call, but the edge target has a CALL incoming edge, wecorrect this edge to be a tail call.(2) If a branch is marked as a tail call, but the branch target is within the current functionboundary, we correct this edge to be not a tail call.(3) If a branch is currently a tail call, but the edge target has only the current edge as incomingedges, we treat this as not a tail call. This is generally caused by outlined code blocks.After correcting tail calls, we re-perform the function boundary graph search and the tail callcorrection procedure. We flip the determination of tail call at most once for each edge, ensuringconvergence.Finally, we remove functions that do not have incoming inter-procedural edges. We implemented our new parallel CFG construction algorithms in Dyninst. Careful implementationthat follows our design is crucial for correctness and high performance. We present several codeexamples and lessons we learned in our work.

In Section 5.2 we presented five invariants for parallel control flow traversal. An efficient imple-mentation of these is the foundation for scalable parallel binary code analysis.Listing 4 is a code example of our implementation for invariant 1 (block creation). This codeexample can be easily adapted to implement invariant 5 (function creation). Recall that two re-quirements for invariant 1 are (1) threads that branch to the same address should be synchronizedand only one thread should create a new block, and (2) threads that branch to different addressescan make progress independently. Our implementation uses the concurrent hash map providedby Intel’s Threaded Building Blocks library [Intel [n.d.]] to fulfill these two requirements, which tbb::concurrent_hash_map blocks; bool attemptToCreateBlock(Address a) { Block* b = new Block(a); if (blocks.insert({a, b})) { // Successfully registered the new block. return true; } else { // Block already exists. delete b; return false; } } Listing 4. Example implementation of invariant 1 (block creation). tbb::concurrent_hash_map blocks_end; bool blockEnd(Block* b) { tbb::concurrent_hash_map::accessor a; if (blocks_end.insert(a, b->end())) { // Block end registered, continue to create edges. AnalyzeCFEdges(b); return true; } else { // a->second references the block in the entry. a->second = splitBlock(b, a->second); return false; } } Listing 5. Example implementation for invariant 2 (block end), invariant 3 (edge creation), and invariant 4(block split). provides entry-level reader-writer locks. The insert method of concurrent_hash_map ensuresthat only one of the concurrent insertions with the same key will succeed (Line 5). Therefore,we can use the return value of insert to determine whether the current thread has successfullycreated a block and should continue analysis of the block (Line 7). Threads that see a false returnvalue knows that another thread has created the block and can move on to other work (Line 9 - 10).Listing 5 is a code example showing how invariant 2 (block end), invariant 3 (edge creation), andinvariant 4 (block split) fit together. concurrent_hash_map exposes the entry-level reader-writerlocks via an “accessor” semantic. We can obtain an “accessor” for the existing entry in the table(inserting one if requested and not already present). The accessor acts as a read or write lock on theentry, and other threads that are trying to obtain a conflicting accessor will wait until the holdingthread releases its own accessor. Line 4 ensures only one block is registered for a block end address,enforcing invariant 2. The accessor ensures that edge creation (Line 6) and block splitting (Line10) are mutually exclusive. This mutual exclusion guarantees that control flow edges will not becreated while being moved. Note that invariant 3 and 4 do not require mutual exclusion, and it ispossible to implement them with finer grained synchronization. Our implementation uses mutualexclusion for simplicity and our performance profiling has not shown this mutual exclusion to be aperformance bottleneck.

In Section 5.1, we described that it is necessary to have a parallel symbol table as a large binary maycontain millions of symbols. Dyninst’s symbol table supports lookups by any of its four properties:byte offset, mangled name, “pretty” human-readable name and demangled “typed” name. Theoriginal implementation used a template class from Boost [Boost Project [n.d.]] to implement this,a very customizable structure called a multi_index_container . Since the Boost implementation arallel Binary Code Analysis 15 class Symtab::indexed_symbols { concurrent_hash_map master; concurrent_hash_map> byOffset; concurrent_hash_map> byMangledName; concurrent_hash_map> byPrettyName; concurrent_hash_map> byTypedName; bool insert(Symbol* s) { accessor a; if(!master.insert(a, {s, s->getOffset()})) return false; // Already inserted, no need to continue { accessor a1; byOffset.insert(a1, s->getOffset()); a1->second.push_back(s); } { accessor a2; byMangledName.insert(a2, s->getMangledName()); a2->second.push_back(s); } // ... etc. ... return true; } }; Listing 6. Implementation example for a thread-safe efficient map with multiple keys, discussed in Section 6.2. is not thread-safe, after contention for its mutex lock became a notable bottleneck we redesignedthe structure for concurrency as shown in Listing 6.The key insight is that no lookups occur in parallel with an insertion or modification, so syn-chronization is only needed during writes. Two writes only conflict if the

Symbol they are workingwith is the same, so we use the entry-level lock on the master table to mediate between threads.The thread which inserts on the master table proceeds to update the corresponding entries in the by* tables, retaining its lock to ensure that any other modifications to the collective entries occurin a total order. Once all modifications are complete, later lookups are able to use the by* tablesdirectly, giving the same semantics as the original structure.

We summarize two implementation lessons that are beneficial for improving performance, startingwith replacing parallel loops with task parallelism. As described in Section 5, we use a parallel forloop to perform parallel control flow traversal and collect new functions to analyze. The problemwith this implementation is that analysis of newly collected functions will not start until all existingfunctions have been analyzed. This can cause significant idleness when the analysis of functions isimbalanced. To address this issue, our improved implementation uses OpenMP tasks as the parallelprogramming model and we launch a new task as soon as we discover a new function to analyze.The second lesson is to use a thread local cache to reduce redundant calculations while notincurring thread synchronization overheads. For invariant 2 (block end) discussed in Section 5.2,we let each thread parse their blocks without any synchronization until reaching a control flow in-struction. This design causes redundant instruction decoding between overlapping blocks analyzedby different threads. However, while functions sharing code is common, most of the code blocks ina binary are still not shared. This means that most of the time, a thread is going to branch into ablock that was created by itself, not created by other threads. Therefore, we implemented a threadlocal cache that maintains the addresses that have been analyzed by the thread and use this cacheto reduce redundant decoding. The implementation of our new parallel CFG construction algorithms in Dyninst involves a largeamount of code, over 120K lines of code related to reading ELF sections, decoding machine instruc-tions, and data flow analysis. We find that the following workflow focuses our attention to whatneeded it most, and helps us identify numerous thread-safety issues across the code base.(1) Use a performance analysis tool such as HPCToolkit [Adhianto et al. 2010] to gather perfor-mance traces of the code with the aim of identifying code regions whose computational costjustifies an overhaul to add parallelism.(2) Inspect the identified code and choose a candidate parallelization strategy, such as loop-levelparallelism, a fork-join parallel programming model, or task-level parallelism. The rightchoice for a particular code region depends on its code structure as well as the algorithmsand data structures that it employs.(3) Use data race detectors to identify data structures and sharing patterns that require syn-chronization. Candidate tools include logical data race detectors such as cilkscreen andhappens-before race detectors such as helgrind , the latter available as part of the Valgrindbinary instrumentation system [Nethercote and Seward 2007].(4) Use the performance analysis tool to identify excessive mutual exclusion, unbalanced work-loads, and excessive phase-based synchronization that form major bottlenecks.(5) Test the design and implementation with a large suite of test cases and benchmarks.The above steps are repeated until the overall performance improves satisfactorily.

We present two application case studies that utilize our new parallel CFG construction algorithms: hpcstruct , a utility in HPCToolkit for performance analysis, and

BinFeat , a feature extractiontool for software forensics.

Besides constructing CFGs (AC1), other commonly used analysis capabilities by binary analysisapplications include (AC2) identifying loops, (AC3) building a mapping between source lines andmachine instructions, (AC4) understanding function inlining for templates and inlined functions,(AC5) iterating over functions, basic blocks, edges, and machine instructions, and (AC6) performingdata flow analysis such as register liveness analysis. We use two application examples to illustratehow these analysis capabilities are used.

Performance Analysis with HPCToolkit : HPCToolkit is an integrated suite of tools formeasurement and analysis of application performance on computers ranging from desktops to su-percomputers. To relate performance measurements to an application’s source code, the hpcstruct utility in HPCToolkit relates each machine instruction address to the static calling context in whichit occurs. In particular, hpcstruct is able to relate instructions to their original function (AC1) orloop construct (AC2) by inspection of the binary’s final CFG (AC5), and to an inlined function ortemplate (AC4) and source lines (AC3) if DWARF debugging information is available.

Feature Extraction with BinFeat : BinFeat is a tool for extracting binary code features for soft-ware forensics tasks, including function entry identification, compiler identification and authorshipattribution. Commonly used features include machine instruction sequences, subgraphs of CFGs(AC1), loop nesting levels (AC2), and live register counts (AC6). BinFeat iterates over all functionsand blocks to extract these features (AC5). arallel Binary Code Analysis 17 ParseAPI::CodeObject *co = getCodeObject(); co->parse() // Perform CFG construction in parallel std::vector funcs = co->funcs(); SortFuncs(funcs); // Sort functions to address load balancing // Parallel for loop to analyze each function in parallel for (size_t i = 0; i < fvec.size(); ++i) { ParseAPI::Function* f = fvec[i]; ParseAPI::LoopAnalyzer la(f); // Analyze loops DataflowAPI::LivenessAnalyzer live(f); // Register liveness analysis DataflowAPI::StackAnalysis sa(f); // Stack height analysis // Other thread-safe intra-procedural analysis } Listing 7. Code example of utilizing Dyninst for parallel binary analysis applications.

Even if Dyninst’s CFG construction algorithms are parallelized, binary analysis applications need toreduce serial execution to achieve good speedup. The basic idea here is that after the CFG has beenfully constructed, binary analysis will typically no longer make modifications to the CFG. Therefore,the CFG becomes read-only and different threads can safely perform analysis independently aslong as the analysis itself is thread-safe. Based on this idea, we summarize a design pattern forparallelizing binary analysis applications.Listing 7 shows an example code snippet to write parallel binary analysis applications. Line2 uses the parallel CFG construction algorithm described in Section 5 to construct a CFG. Line3 and 4 get the list of functions in the binary and sort the functions to address load balancingbetween threads. Sorting is important as functions will have different sizes, which can cause notableunbalance if a large function is scheduled last in a work queue. Therefore, we sort the functions indecreasing order so that large functions are processed first. Within the parallel loop, the user canapply intra-procedural analysis in parallel to different functions.To complete the parallelization of a binary analysis application, an application developer willalso need to parallelize application-specific logic.For hpcstruct , we parallelize the parsing of DWARF debugging information in a binary. Abinary’s DWARF information is organized in a forest-like structure with a tree for each compilationunit (CU). Since source files are typically of similar sizes across a project we simply used anOpenMP parallel for loop to process each of the CUs in parallel, accumulating their informationin structures allocated in parallel by a previous phase. This resulted in thousands of race reports,which we handled first by mutex locks and then later by using concurrent data structures such asthose discussed in Section 6.1. Some races were caused by code within Libdw, a utility library fromRed Hat for parsing DWARF, and in cases where the performance would suffer from full mutualexclusion we applied more significant modifications by implementing a resizeable hash table [Click2007; Michael 2002; Triplett et al. 2011] in Libdw.BinFeat needs to build a global feature index after extracting features from every functions in abinary. This operation can be parallelized with a reduction operation, which is a generic parallelcomputing primitive.

Since it is a challenging task to generate accurate ground truth for a binary’s CFG, we evaluate thecorrectness of our parallel CFG construction algorithm and implementation by approximating theground truth with debug information and RTL intermeidates. We then evaluate the performance ofour work using hpcstruct and BinFeat. To illustrate the correctness of our approach, we verified our algorithm and implementation using113 binaries obtained by compiling the coreutils and tar projects. These binaries are compiledwith GCC 9.3.0 for x86-64, with link-time optimization disabled and other optimizations enabledas specified by the package. In addition, we compiled these binaries with debug information andinjected the flag -fdump-rtl-dfinish to generate RTL intermediates for individual source files.The debug information and RTL are used only for generating the ground truth.The ground truth of this data set consists of three parts: • We represent the boundary of function with address ranges, essentially projecting the CFGof a function to the virtual address space. The DWARF .debug_info section encodes functionranges. In particular, it supports multiple non-contiguous ranges for one function and supportsone range corresponding to multiple functions. Therefore, we can evaluate the handling offunctions sharing code and non-contiguous functions. • We include the size of a jump table as part of the ground truth, which can be extracted byscanning the RTL files. Unfortunately, we cannot derive jump table locations or the actualtargets from the RTL files. As existing jump table analysis has focused on bounding the sizeof jump tables, we believe jump table sizes provide significant evaluation value. • RTL encodes the ground truth for calls to non-returning functions, where a non-returningcall has

REG_NORETURN as one of its arguments.We then write a checker program that uses our parallel CFG construction implementation toget the CFG, print out function ranges, jump table sizes, and non-returning calls, and match theseitems with the ground truth.We identified four distinct differences between our implementation and the ground truth bymanual inspection of the automatically identified differences: • Failing to identify non-returning calls to ‘ error ’, causing functions to include additionalranges. ‘ error ’ is non-returning when the first argument is non-zero, but returning whenthe first argument is zero. Existing non-returning function analysis performs name matchesfor external functions. This approach does not work for ‘ error ’. • For a function foo , the compiler may emit another function symbol (“ foo.cold ”) for outlinedcold blocks from foo . However, the debugging information does not encode “ foo.cold ” andlists the address ranges of “ foo.cold ” blocks as part of foo . • Failing to resolve a jump table whose calculation uses the stack to store intermediate values. • An extra indirect jump target caused by failing to identify a non-returning call to ‘ error ’,leading bogus wrong control flow edge to the indirect jump.In all cases above, the differences are caused by either incorrectness in the individual CFGoperations ( O CF EG and O I EC ) or mismatches between the symbol table and the DWARF information.In other words, the errors are not caused incorrect parallelism and can be fixed by improving theimplementations of O CF EG and O I EC . hpcstruct We use four large binaries to illustrate the effectiveness of our parallelization for speeding upperformance analysis, including two binaries from Lawrence Livermore National Laboratory (LLNL1and LLNL2 ), one large binary from Argonne National Laboratory (Camellia), and one shared libraryfrom TensorFlow [Abadi et al. 2016].Sizes of relevant sections of the four binaries are given in Table 1. LLNL1 is a Power little-endian64-bit binary, LLNL2 and TensorFlow are x86-64, and Camellia is a Power big-endian 32-bit binary. Due to export control, we are unable to disclose the names of these binaries until approved by LLNL. arallel Binary Code Analysis 19

Section(s) Sizes (MiB)Binary Total .text .debug_*

LLNL1 363 .

40 77 .

01 243 . .

50 149 .

13 1612 . .

08 40 .

81 232 . .

81 112 .

21 7622 . Table 1. Relevant statistics of the binaries used as input for the various benchmarks.Fig. 2. Trace from a run of hpcstruct on TensorFlow, descriptions of labeled sections are given in Section 8.2.

Binary Time Taken (s)Cores

DWARF (2)

CFG (4) hpcstruct

LLNL1 1 30 .

44 101 .

57 237 . ± . .

65 11 .

21 30 . ± . × . × . × . .

95 176 .

79 690 . .

07 19 .

66 112 . × . × . × . .

36 46 .

10 118 . ± . .

34 5 .

38 20 . ± . × . × . × . .

81 112 .

55 1252 . ± . .

29 9 .

56 160 . ± . .

63 5 .

49 146 . ± . .

67 4 .

46 154 . ± . × . × . × . Table 2. Performance results, averages of 10 runs unless otherwise noted. Times for

DWARF and

CFG representparallel DWARF parsing and parallel CFG construction, corresponding to sections 2 and 4 in Figure 2.

LLNL1, LLNL2 and Camellia were compiled by their corresponding software development teams,we compiled the TensorFlow binary with GCC 8.3.0. Experiments run on LLNL binaries were runon a node with 16 threads (8 cores), Camellia on one with 36 threads (18 cores), and TensorFlow ona two-socket machine with 36 cores each (72 threads total). Results for LLNL2 are based on one run for each thread count. We have limited access to the binary and cannot repeat theexperiment. S p ee d u p hpcstructDWARFCFG Fig. 3. Average speedup (geometric mean) of hpcstruct on the four binaries, as described in Section 8.2.

The results are presented in Table 2 and in Figure 3. Overall, hpcstruct has an end-to-endspeedup of 6 × to 8 × , due to several serial phases in the application code. We achieved a speedup of9 × to 25 × for constructing CFGs and a speedup of 8 × to 14 × for DWARF parsing.To better understand the end-to-end performance impact of our work, we break down themain phases of execution within hpcstruct in Figure 2, which presents a performance trace of hpcstruct running on TensorFlow with 64 threads. The contents of each phase are as follows:(1) Read data from disk into an internal buffer.(2) Parse DWARF type information in parallel and store in appropriate data structures. Imbalancein the sizes of compilation units can cause some idling.(3) Parse address to function and line mappings from DWARF and store in a serial structureoptimized for accelerated lookup. (4) Parse text regions in parallel to identify functions and construct the final CFG.(5) Convert line map and parsing results into “skeleton” objects inside hpcstruct , which aresuitable for export.(6) Query Dyninst structures in parallel to fill the “skeleton” with the final data to be serialized.(7) Serialize data and write to disk in parallel with queries to mitigate the effects of serialprocessing.Although our parallelization (2 and 4) scales well, the overall execution of hpcstruct hasdifficulties scaling. As per Amdahl’s Law, the serialization in application code (1, 5-7) and remainingdifficulties (3) prevent our speedup from scaling past 13 × . Applications with less serialization willsee larger speedups. BinFeat

Software forensic researchers typically use real world software to construct their training sets.We follow their practice and construct a set of binaries to analyze. We compiled Apache HTTPServer [The Apache HTTP Server Project [n.d.]], Redis [The Redis Project [n.d.]], Mysqlslap [TheMariaDB Project [n.d.]], and Nginx [The Nginx Project [n.d.]] with GCC-6.4.0 and -O2 optimization. The design of the data structure used here makes this region difficult to parallelize. arallel Binary Code Analysis 21

Time Taken (s)Cores

CFG IF CF DF BinFeat .

90 246 .

33 108 .

46 307 .

88 915 .

362 142 .

15 125 .

29 56 .

06 173 .

16 518 .

064 96 .

95 66 .

56 29 .

54 99 .

02 312 .

488 75 .

92 36 .

64 16 .

41 62 .

91 211 . .

40 22 .

35 9 .

76 44 .

88 160 . .

47 14 .

37 6 .

27 34 .

62 130 . .

40 13 .

80 6 .

93 34 .

23 131 . × . × . × . × . × . Table 3. Performance results for

BinFeat . CFG , IF , CF , DF represent the stages of CFG construction, extractinginstruction features, control flow features, and data flow features, respectively. Our data set contains 504 binaries. This experiment was run on a x86-64 machine with 18 cores, 72threads, and 48MB of L3 cache.Table 3 shows the performance results for

BinFeat . We achieved 7 × overall speedup using 32hardware threads, but did not gain any further improvement with 64 threads. Extracting instructionfeatures (18 × ) and control flow features (16 × ) scale well to 64 threads.The extraction of data flow features only achieves 9 × maximium speedup, we find that itsperformance is hurt by imbalanced workload between threads. Note that we extract features fromeach function in parallel. Data flow analysis typically has a higher time complexity compared toanalyzing instructions and traversing control flow graphs. Therefore, the analysis of large functionswill dominate the whole execution.CFG construction has only 4 × speedup, we identify two factors that limit its performance. First,the issue of imbalanced workload also applies to CFG construction as the jump table analysis in CFGconstruction takes significantly longer to run compared to other CFG operations such as creatingdirect edges. Second, as described in Section 4, the non-returning function dependencies betweenCFG operations can hurt parallelism. While we mitigate this problem with an eager approachdiscussed in Section 5.3, this problem still persists. Note that these two issues do not show up forlarge binaries such as those used in the hpcstruct experiments. We find that large binaries containsufficient numbers of functions to keep threads busy and hide these two issues. Benefiting other applications:

Our work provides a general framework for researchers to paral-lelize their binary analysis applications. For example, software vulnerability searching calculatesbinary code similarity [Chandramohan et al. 2016; David et al. 2016] to match known vulnerablecode. The calculation of binary code similarity utilizes binary analysis capabilities of analyzingmachine instruction characteristics, control flow, and data flow. Our work has parallelized severalcommon analysis capabilities and it will be interesting to see how our work benefits other binaryanalysis applications.

Compiler assisted analysis:

Our work opportunistically uses information from the compiler(such as providing correct and detailed labels in the code and DWARF). However, this is not acomplete solution and we cannot rely on sufficient or even accurate compiler support. Surprisinglyoften for even the most widely-used compilers, the compiler-provided information is incompleteor inaccurate. One key issue is that binary analysis applications do not typically control whichcompiler is used to generate the input binaries. For performance analysis, software developersoften use the compiler and optimization flags that lead to greatest performance, which often leads to less accurate debugging information. Software forensic analysts deal with binaries collectedfrom the wild, whose compiler generated information is often intentionally removed to defendagainst analysis. Therefore, while we use compiler assistance when available, we cannot not relyon its presence. Other forms of parallelism:

We focus on multi-threading as the mechanism for parallelization.Other forms of parallelism can be used to further improve the performance of binary analysisapplications. For example,

BinFeat can benefit from node level parallelism by distributing theanalysis of different binaries to different machines. We believe this type of parallelism is possible forcertain specific applications, and is orthogonal to our work. Binary analysis application developerscan benefit from our work and seek additional parallelization opportunities if necessary.

Stripped binaries:

Stripped binaries do not have the static symbol table ( .symtab ) any more,but still have the dynamic symbol table ( .dynsym ) and the exception unwinding frame information( .eh_frame ). In addition, our algorithms can be augmented with orthogonal research for identifyingstripped function entry points [Bao et al. 2014; Rosenblum et al. 2008; Shin et al. 2015].

Source code CFG construction:

The challenges of binary code CFG construction are largelydistinct from those of source code CFG construction. First, binary code functions can share code,which is the main reason that we must derive operation properties to guide our design invariantsto support analysis of multiple functions in a binary in parallel. In contrast, source code functionscannot overlap unless functions are nested. In this case, CFG construction for source code doesnot require rigorous synchronization. For example, a source code basic block parsed by one threadis not going to be split by another thread. Second, jump tables in binary code are often used toimplement switch statements in the source code. Jump tables are encoded as indirect control flowin the binary code, whose targets must be identified through data flow analysis. However, in sourcecode, the body of a switch statement is naturally grouped together, and it is straightforward toidentify every case clause for the switch statement. Third, the body of a source code function iscontiguous. However, basic blocks of binary code functions can be outlined to improve instructioncache performance. As a result, binary analysis needs to address non-contiguous functions. Fourth,tail calls in binary code are just normal function calls in source code.

10 CONCLUSION

With the increasing size of software and the need for analyzing large batches of binaries, addingmultithreaded parallelism speeds up binary analysis, but doing so requires principled algorithm anddata structure redesign and careful attention to implementation. Our work centers on a theoreticalabstraction that expresses CFG construction as applications of individual CFG operations. Wederived operation dependencies, commutativity, and monotonic ordering properties, which enableus to assess the strengths and weaknesses of existing serial CFG constructions, and guided ustowards a new design for our parallel CFG construction algorithm. We evaluated our parallelbinary analysis with a performance analysis tool hpcstruct and a software forensics tool

BinFeat ,achieving 25 × speedup for parallel CFG construction, 14 × for ingesting DWARF, 8 × overall for hpcstruct , and 7 × overall for BinFeat using 64 hardware threads. Our results show that ourparallel binary analysis can significantly speed up binary analysis applications, cutting the waittimes for their users and developers.

REFERENCES

Martín Abadi et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In . usenix . org/conference/osdi16/technical-sessions/presentation/abadi arallel Binary Code Analysis 23 L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCTOOLKIT: Toolsfor Performance Analysis of Optimized Parallel Programs.

Concurrency and Computation: Practice and Experience

22, 6(April 2010), 685–701.Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz. 2007. StackTrace Analysis for Large Scale Debugging. In . Long Beach, California, USA, 1–10.Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley. 2014. BYTEWEIGHT: Learning to RecognizeFunctions in Binary Code. In . San Diego, CA, 845–860.Sébastien Bardin, Philippe Herrmann, and Franck Védrine. 2011. Refinement-based CFG Reconstruction from UnstructuredPrograms. In . Austin,TX, USA, 54–69.Andrew R. Bernat and Barton P. Miller. 2012. Structured Binary Editing with a CFG Transformation Algebra. In . boost . org/.Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and ArvindNarayanan. 2018. When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries.In . San Diego, CA, USA.Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. BinGo:Cross-Architecture Cross-OS Binary Search. In . Seattle, WA, USA.Cliff Click. 2007. A lock-free hash table. In JavaOne Conference .William D. Clinger. 1998. Proper Tail Recursion and Space Efficiency. In . ACM Press, Montreal, Canada, 174–185.Cristian Ţăpuş, I-Hsin Chung, and Jeffrey K. Hollingsworth. 2002. Active Harmony: Towards Automated PerformanceTuning. In . Baltimore, Maryland, 1–11.Yaniv David, Nimrod Partush, and Eran Yahav. 2016. Statistical similarity of binaries. In . Santa Barbara, California, USA, 266–280.Alessandro Di Federico, Mathias Payer, and Giovanni Agosta. 2017. Rev.Ng: A Unified Binary Analysis Framework toRecover CFGs and Function Boundaries. In . Austin, TX,USA.Yizi Gu and John Mellor-Crummey. 2018. Dynamic Data Race Detection for OpenMP Programs. In

International Conferencefor High Performance Computing, Networking, Storage, and Analysis (SC) . threadingbuildingblocks . org/.Emily R. Jacobson, Andrew R. Bernat, William R. Williams, and Barton P. Miller. 2014. Detecting Code Reuse Attacks witha Model of Conformant Program Execution. In International Symposium on Engineering Secure Software and Systems(ESSoS) . Munich, Germany, 18.Johannes Kinder and Dmitry Kravchenko. 2012. Alternating Control Flow Reconstruction. In . Philadelphia, PA.Johannes Kinder and Helmut Veith. 2008. Jakstab: A Static Analysis Platform for Binaries. In . Princeton, NJ, USA, 423–427.Xiaozhu Meng. [n.d.]. A tool base on Dyninst to extract binary code features for software forensics, https://github . com/mxz297/BinFeat.Xiaozhu Meng and Barton P. Miller. 2016. Binary Code Is Not Easy. In The International Symposium on Software Testing andAnalysis (ISSTA) . Saarbrücken, Germany.Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun. 2017. Identifying Multiple Authors in a Binary Program. In . Oslo, Norway.Maged M Michael. 2002. High performance dynamic lock-free hash tables and list-based sets. In

Proceedings of the fourteenthannual ACM symposium on Parallel algorithms and architectures . ACM, 73–82.Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic,Krishna Kunchithapadam, and Tia Newhall. 1995. The Paradyn Parallel Performance Measurement Tool.

IEEE Computer

28, 11 (Nov. 1995), 37–46.Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation.In

Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (San Diego,California, USA) (PLDI ’07) . ACM, New York, NY, USA, 89–100. https://doi . org/10 . . Cetus users and compilerinfrastructure workshop, in conjunction with PACT , Vol. 2011. Citeseer, 1. Nathan V. Roberts. 2014. Camellia: A software framework for discontinuous Petrov-Galerkin methods.

Computers &Mathematics with Applications

68, 11 (2014), 1581 – 1604. https://doi . org/10 . . camwa . . .

010 MinimumResidual and Least Squares Finite Element Methods.Nathan Rosenblum, Barton P. Miller, and Xiaojin Zhu. 2011a. Recovering the Toolchain Provenance of Binary Code. In . Toronto, Ontario, Canada.Nathan Rosenblum, Xiaojin Zhu, and Barton P. Miller. 2011b. Who wrote this code? identifying the authors of programbinaries. In . Leuven, Belgium, 18.Nathan Rosenblum, Xiaojin Zhu, Barton P. Miller, and Karen Hunt. 2008. Learning to Analyze Binary Computer Code. In . AAAI Press, Chicago, Illinois, 798–804.B. Schwarz, S. Debray, and G. Andrews. 2002. Disassembly of Executable Code Revisited. In

Ninth Working Conference onReverse Engineering (WCRE) . Richmond, VA, USA.Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing Functions in Binaries with Neural Networks. In . Austin, TX, USA.Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, andG. Vigna. 2016. SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In . San Jose, CA, USA.The Apache HTTP Server Project. [n.d.]. The Apache HTTP Server Project, https://httpd . apache . org/.The MariaDB Project. [n.d.]. mysqlslap is a tool for load-testing MariaDB, https://mariadb . . nginx . com/.The Redis Project. [n.d.]. An open source, in-memory data structure store, https://redis . io/.H. Theiling. 2000. Extracting Safe and Precise Control Flow from Binaries. In the Seventh International Conference onReal-Time Systems and Applications (RTCSA) . Cheju Island, South Korea, 23–30.Josh Triplett, Paul E McKenney, and Jonathan Walpole. 2011. Resizable, Scalable, Concurrent Hash Tables via RelativisticProgramming.. In USENIX Annual Technical Conference , Vol. 11.V. v. d. Veen, E. Göktas, M. Contag, A. Pawoloski, X. Chen, S. Rawat, H. Bos, T. Holz, E. Athanasopoulos, and C. Giuffrida.2016. A Tough Call: Mitigating Advanced Code-Reuse Attacks at the Binary Level. In . San Jose, CA, USA.Victor van der Veen, Dennis Andriesse, Enes Göktaş, Ben Gras, Lionel Sambuc, Asia Slowinska, Herbert Bos, and CristianoGiuffrida. 2015. Practical Context-Sensitive CFI. In . Denver, Colorado, USA.David Williams-King, Hidenori Kobayashi, Kent Williams-King, Graham Patterson, Frank Spano, Yu Jian Wu, Junfeng Yang,and Vasileios P. Kemerlis. 2020. Egalito: Layout-Agnostic Binary Recompilation. In