[PDF] A Generating-Extension-Generator for Machine Code

Abstract

The problem of "debloating" programs for security and performance purposes has begun to see increased attention. Of particular interest in many environments is debloating commodity off-the-shelf (COTS) software, which is most commonly made available to end users as stripped binaries (i.e., neither source code nor symbol-table/debugging information is available). Toward this end, we created a system, called GenXGen[MC], that specializes stripped binaries. Many aspects of the debloating problem can be addressed via techniques from the literature on partial evaluation. However, applying such techniques to real-world programs, particularly stripped binaries, involves non-trivial state-management manipulations that have never been addressed in a completely satisfactory manner in previous systems. In particular, a partial evaluator needs to be able to (i) save and restore arbitrary program states, and (ii) determine whether a program state is equal to one that arose earlier. Moreover, to specialize stripped binaries, the system must also be able to handle program states consisting of memory that is undifferentiated beyond the standard coarse division into regions for the stack, the heap, and global data. This paper presents a new approach to state management in a program specializer. The technique has been incorporated into GenXGen[MC], a novel tool for producing machine-code generating extensions. Our experiments show that our solution to issue (i) significantly decreases the space required to represent program states, and our solution to issue (ii) drastically improves the time for producing a specialized program (as much as 13,000x speedup).

Full PDF

AA Generating-Extension-Generator for Machine Code

MICHAEL VAUGHN,

University of Wisconsin

THOMAS REPS,

University of WisconsinThe problem of “debloating” programs for security and performance purposes has begun to see increasedattention. Of particular interest in many environments is debloating commodity off-the-shelf (COTS) software,which is most commonly made available to end users as stripped binaries (i.e., neither source code nor symbol-table/debugging information is available). Toward this end, we created a system, called GenXGen[MC], thatspecializes stripped binaries.Many aspects of the debloating problem can be addressed via techniques from the literature on partialevaluation . However, applying such techniques to real-world programs, particularly stripped binaries, involvesnon-trivial state-management manipulations that have never been addressed in a completely satisfactorymanner in previous systems. In particular, a partial evaluator needs to be able to (i) save and restore arbitraryprogram states, and (ii) determine whether a program state is equal to one that arose earlier. Moreover, tospecialize stripped binaries, the system must also be able to handle program states consisting of memory thatis undifferentiated beyond the standard coarse division into regions for the stack, the heap, and global data.This paper presents a new approach to state management in a program specializer. The technique has beenincorporated into GenXGen[MC]. Our experiments show that our solution to issue (i) significantly decreasesthe space required to represent program states, and our solution to issue (ii) drastically improves the time forproducing a specialized program (as much as 13,000x speedup).

Modern commodity off-the-shelf (COTS) software tends to provide large sets of features to supportthe diverse use cases of their end-users. However, individual users of many COTS programs mightonly use a single, fixed subset of the available functionality. From such a user’s perspective, unusedfunctionality constitutes “bloat” in terms of binary size, program performance, and attack surface.A means of producing specialized versions of programs that only include features relevant to agiven use case would be a useful tool for simplifying and hardening COTS software. In particular,given certain configuration settings, a developer or administrator may wish to remove featuresirrelevant to their particular configuration, thereby improving space usage and performance, andreducing the program’s attack surface.Toward this end, we have created a system, called GenXGen[MC], that specializes strippedbinaries. The premise behind our work is that many aspects of the “debloating” problem can beaddressed via techniques from the literature on partial evaluation [9, 13]. For instance, a partialevaluator pe takes as inputs (i) a program P (expressed in some language L ); (ii) a partition of P ’sinputs into two sets, supplied and delayed (for short, S and D , respectively); and (iii) an assignment A ( S ) to the variables in S . As output, pe produces a residual program P A ( S ) that is specialized withrespect to A ( S ) . More formally, we have (cid:74) pe (cid:75) ( P , A ( S )) = P A ( S ) , (1)where (cid:74) · (cid:75) denotes the meaning function for the language in which pe is written. The requirementon residual program P A ( S ) is that it must obey the following equation: (cid:74) P (cid:75) L ( A ( S ∪ D )) = (cid:74) P A ( S ) (cid:75) L ( A ( D )) , (2) We find “supplied” and “delayed” to be more suggestive than the standard terms “static” and “dynamic,” respectively. Here, the partition of P ’s inputs is implicit in A ( S ) .Authors’ addresses: Michael Vaughn, University of Wisconsin, [email protected]; Thomas Reps, University of Wisconsin,[email protected]. a r X i v : . [ c s . P L ] M a y Michael Vaughn and Thomas Reps where (cid:74) · (cid:75) L is the meaning function for L . That is, P A ( S ) with input A ( D ) produces the same outputas P with input A ( S ∪ D ) ; however, P A ( S ) has fewer input arguments, and is specialized with respectto the assignment A ( S ) .A partial evaluator may be able to identify parts of a program’s control-flow graph (CFG) that areunreachable given particular configuration settings, and produce a residual program that does notcontain the identified parts. Moreover, code in the program that is dependent solely on the suppliedinputs can be executed by the partial evaluator, and elided from the resulting specialized program. Inpractice, these abilities allow a partial evaluator to perform a multitude of optimizations, without thedeveloper of the partial evaluator needing to write explicit implementations of each optimization[13]. For example, a partial evaluator will perform removal of unreachable code and constantfolding, as well as more sophisticated optimizations, such as loop-unrolling and function in-lining.For debloating, a partial evaluator can (i) simplify code so that the resulting program incorporatesspecific features based on particular configuration parameters, and (ii) collapse abstraction layersin the original program via function in-lining.In some contexts—including in our work—an alternative formulation of the above approach,based on the creation of generating extensions , is more desirable. A generating extension can bethought of as a self-contained, program-specific partial evaluator . A generating extension for P anda specified set of supplied inputs S is a program ge P , S that obeys the following equation: (cid:74) ge P , S (cid:75) ( A ( S )) = P A ( S ) , (3)where P A ( S ) is the specialized residual program defined previously, which obeys Eqn. (2). Thisapproach to program specialization is enabled by a tool called a generating-extension generator : aprogram that takes as input P and S , and creates a generating extension ge P , S .The difference between the two approaches to program specialization can be summarized asfollows: • Applying a partial evaluator to program P and partial state A ( S ) is similar to interpreting P on aninput state, except that the output is a specialized program P A ( S ) . • Applying a generating-extension generator to P is similar to compiling P , except that the outcomeis a program, ge P , S , that, when executed on a partial state A ( S ) , produces the specialized program P A ( S ) .Generating extensions, and in particular machine-code generating extensions, have the advantagethat they can execute as native programs; a semantic model of the target language is only neededto produce the generating extension. At specialization time, no semantic information is needed(other than the semantics built into the hardware platform on which a generating extension runs).Moreover, a pre-made generating extension can be delivered to an end user who wishes tospecialize a program without needing to deliver additional special-purpose tools for specializingprograms. For these reasons, we chose to work with generating extensions.

Program specializers have been created for many different types of languages, including imper-ative, functional, and logic-based, both for source code and—less frequently—for machine code.However, when creating a program-specialization tool for real-world programs, one faces a multi-tude of problems. In particular, a program-specialization tool must address two state-managementproblems:(1) A program specializer needs to be able to save and restore program states efficiently.(2) A program specializer uses a worklist-based algorithm that executes a program over partialprogram states (§2 and §4). To prevent redundant exploration of the program’s state space, thereneeds to be an efficient means of determining whether a (partial) state has repeated .Naive approaches to these issues are extremely costly (§6): enerating Extension Generator 3 • A straightforward approach to issue (1) means copying the entire state for each save and restoreoperation. • The need to test a new state against all states that have previously arisen (issue (2)) suggeststhe use of hashing. However, resolving collisions requires the ability to compare two states forequality.These state-management operations have never been addressed in a completely satisfactory mannerin prior work, and the disadvantages of prior approaches become more significant in the context ofspecializing stripped binaries. For instance, programs often use linked data structures , constructedusing nodes allocated from the heap. However, for a stripped binary, program states consist ofmemory that is undifferentiated beyond the coarse division into regions for the stack, the heap,and global data. Moreover, for a program specializer that runs natively, the states that need tobe captured and compared in issues (1) and (2) are native hardware states (at the level of theinstruction-set architecture).In this paper, we describe a new technique for state management in a program specializer thatruns natively . To demonstrate these techniques, we implement a new program-specialization tool,GenXGen[MC], and present an evaluation of its effectiveness. To address issues (1) and (2), ourapproach makes use of several ideas known from the literature: • Using built-in OS process-creation and context-switching mechanisms for saving and restoringstates [7]. • Using Rabin fingerprinting to create an incrementally updatable hash of a program’s entireaddress space where there is an exponentially small probability of the hash of any two statescolliding.[19, 21] • Exploiting hardware support for copy-on-write (CoW) memory management [18] to identifychanged memory regions, without the need to instrument arbitrary subject-program memoryaccesses or resorting to machine-code interpretation. Moreover, the use of CoW reduces physicalmemory pressure by sharing unchanged pages between multiple processes.The contributions of our work are as follows:A. Our main technical contributions are to state management in a program specializer that runsnatively (§3 and §5). Unlike prior approaches used in program specializers, our state-managementtechnique does not support a mechanism to resolve hash collisions . Instead, by choosing appropriatevalues for parameters of the hashing scheme, the probability of a collision can be made arbitrarilysmall (in our case, < − ), which allows us to forgo the conventional constraint that collisionsbe resolvable. By relaxing the collision-resolution constraint to a probabilistic guarantee, weobtain the following benefits:1. With these technique, state equivalence can be checked in constant time .2. This state-management technique handles program states over an address space divided intootherwise undifferentiated stack, heap, and global regions. Fine-grained knowledge aboutvariables and types is not required at specialization time. Nor is it necessary for the tool tohave knowledge of the distinction between free storage and storage that is in use in the heap.3. Moreover, we are able to use a hashing technique that supports efficient incremental updatingof hash values [6, 22].B. With these state-management techniques, we implemented GenXGen[MC], a new tool forspecializing binaries. We present an evaluation of our technique’s effectiveness in §6.Our approach has several benefits: • It allowed us to create a program specializer that specializes machine code, runs natively, andcan work without symbol-table information (§4 and §5). • The ability to perform O(1) state comparisons significantly improves specialization performance,compared to a naive approach (§6).

Michael Vaughn and Thomas Reps • The use of CoW dramatically reduces memory usage, compared to using full-state copies (§6).To make the paper self-contained, §2 presents a summary of partial evaluation and generatingextensions , using an example to provide intuition. §7 discusses related work. §8 concludes.

The purpose of this section is to provide background for readers unfamiliar with partial evaluationand generating extensions, and to help them understand how the material in §3–§5 representsan advance over previous work. (Readers already familiar with these techniques may wish toperuse the examples in this section and proceed to §3.) To aid understanding, relevant conceptsare presented using source-code examples, using the naive substring-matching procedure match (Fig. 1(a)) as an example.§2.1 describes how a partial evaluator specializes match on the pattern string. §2.2 describes a C generating extension that performs the same specialization. In both approaches, there is a first phaseof binding-time analysis (BTA) and a second worklist-driven specialization phase that producesthe residual program. Given the desired partition of the inputs into supplied and delayed sets,BTA extends the partition to the program’s variables at all program points, identifying variableoccurrences that can safely be included in partial states. The specialization phase traverses thesubject program’s CFG, executing each basic block it encounters. Moreover, the subject program isexecuted over partial states : states whose values can be safely computed when program executionstarts with an assignment to the supplied input variables.

The C procedure match in Fig. 1(a) is an implementation of an O (| s || p |) naive substring-matchingalgorithm. It returns 1 if and only if the string pointed to by s contains the string p as a substring.Note that s and p are presumed to point to valid C strings, and thus match terminates wheneverthe null terminator (ASCII 0) for either string is encountered.If we partially evaluate match with p pointing to the string “hat”, we obtain the procedure shownin Fig. 1(b). In this version, the inner loop has been unrolled, and all manipulations and uses of pat and p have been eliminated: the characters in “hat” are hard-coded into the tests in the specialized int match(char *p, char *s ) {while( *s != 0 ){char *s1 = s; //block 2char *pat = p;while(1) {if(*pat == 0) return 1; //block 3if(*pat != *s1) break; //block 4pat++; s1++; //block 5}s++;}return 0;} int match_s(char *s){while(*s != 0){char *s1 = s;if(*s1 == 'h'){s1++;if(*s1 == 'a'){s1++;if(*s1 == 't'){return 1;}}}s++;}return 0;} (a) (b) Fig. 1. (a) String-matching program match ; (b) match partially evaluated on p = "hat" . enerating Extension Generator 5 int match_ge(char *p){worklist_t L = empty_worklist();state_t successor_state;state_t cur_state;int cur_block;worklist_enqueue(L, 1,init_state);printf("match_s(char *s){");while(!is_empty(L)){cur_block = get_worklist_head(L).block;cur_state = get_worklist_head(L).state;remove_worklist_head(L);if(!previously_visited(cur_block, cur_state)){if(cur_block == 1) handle_block_1(L, cur_state);if(cur_block == 2) handle_block_2(L, cur_state);//code elidedif(cur_block == 8) handle_block_8(L, cur_state);}printf("}");} void handle_block_3(worklist_t L, state_t S){pat = S.pat;printf("blk_3_%d:", S.id);successor_state = snapshot(pat);if(*pat == 0){printf(" goto_7_%s", successor_state);worklist_enqueue(L, 7, successor_state);}else{printf(" goto_4_%s", successor_state);worklist_enqueue(L, 4, successor_state);}} void handle_block_4(worklist_t L, state_t S){pat = S.patprintf("blk_4_%d:", S.id);successor_state = snapshot(pat);printf("if(%c != *s1)", *pat);printf(" goto blk_2_%s", successor_state);printf("else goto blk_5_%s", successor_state);worklist_enqueue(L, 2, successor_state);worklist_enqueue(L, 5, successor_state);}void handle_block_5(worklist_t L, state_t S){pat = S.pat;printf("blk_5_%d:", S.id);pat++;printf("s1++;");successor_state = snapshot(pat);printf("goto blk_3_%s", successor_state);worklist_enqueue(L, 3, successor_state);} Fig. 2. Generating extension for the naive string matcher from Fig. 1. procedure. For this example, Eqns. (1) and (2) become (cid:74) pe (cid:75) ( match , [ p (cid:55)→ ” hat ” ]) = match [ p (cid:55)→ ” hat ” ] = match_s , (cid:74) match (cid:75) C ([ p (cid:55)→ ” hat ” ] ∪ A ( D )) = (cid:74) match_s (cid:75) C ( A ( D )) , where (cid:74) · (cid:75) C denotes the meaning function for C.Partial evaluation can be implemented using a two-stage process, consisting of BTA and the specialization phase , which specializes the program by executing over partial states [13] (startingwith an initial partial state, such as [ p (cid:55)→ ” hat ” ] ).There are many possible partitions that a BTA algorithm could produce. A BTA algorithmis acceptable for our purposes as long as the partition that it produces for each program pointis congruent [13]. Informally, congruence ensures that in every subject-program statement thatupdates a supplied variable, the update to the supplied variable does not depend on any delayedvalues. A partition of the variable occurrences at the different program points of p into suppliedand delayed sets ( V s and V d , respectively) is congruent if at every statement l in P where a variable v ∈ V s is updated, the new value of v is computed solely from variables in V s . Congruence isimportant because it ensures that the partial state induced by the set of supplied inputs can alwaysbe safely updated.A BTA algorithm can use forward slicing [12, 28] to compute a congruent partition. Given aset of variables V and a set of program points L , forward slicing computes the the set of programpoints that may be affected by the values of V at points in L . For BTA, we compute the forwardslice from the delayed inputs . The boxed statements in Fig. 1(a) show the program points includedin the forward slice starting at formal parameter s . A congruent partition of the program variableoccurrences is implicit in the slice. The forward slice contains all assignments to, and uses of,variable occurrences that are transitively dependent on s , while the complement of the slicecontains all assignments to and uses of variable occurrences not dependent on s . Thus, to ensurethat the specialization phase only performs safe updates, it executes only the statements in thecomplement of the slice. Moreover, slicing can be viewed as an extension of BTA results fromvariable occurrences to statements: all statements dependent only on supplied state are marked assupplied; the remainder are marked as delayed.The specialization phase is essentially a kind of interpreter that executes P over partial states,producing a residual program P ′ . The specializer interprets the CFG of the program, using a partialstate to track the values of the variable occurrences in the supplied set. The interpretation is non-standard because at a condition classified as delayed, such as the two boxed conditionals in Fig. 1(a),there are two successor basic blocks to interpret. A worklist is used to keep track of basic blocksthat still need to be processed. Every basic block is interpreted linearly, statement-by-statement, Michael Vaughn and Thomas Reps and each statement is evaluated in one of three ways. (1) All statements marked as “supplied” areevaluated, and the partial state is updated accordingly. For example, the statement pat++ will causethe value of pat in the partial state to be incremented by 1. (2) Statements marked as “delayed” arenot evaluated, but are emitted to the residual program instead. For instance, the single occurrenceof “ s1++ ” in the original match program is emitted at two different times during the specializationof match . (3) However, some statements marked as “delayed” cannot just be emitted as is; if adelayed statement s depends on the value of a supplied variable v , the value of v must be lifted into the residual program’s state at s . Lifting can be performed by replacing every occurrence of v in the emitted statement with the current value of v . For example, lifting is required for the if statement in the inner loop of match : every emitted instance of the statement in Fig. 1(b) has *s1 replaced with a character from “hat”.Unlike a standard interpreter, the specialization phase is prepared to handle control flow governedby delayed state. Consider the if statement at the end of the basic block marked as block 4in Fig. 1(a). Due to the comparison against the (delayed) string pointed to by s1 , there is notsufficient information in the partial state to determine which branch will be taken. Consequently,the specializer must arrange to specialize the blocks at both successors.In essence, the specializer needs to “go both ways” when encountering a branch governed bydelayed state. In practice, the specializer is generally implemented as a worklist-based algorithm:basic blocks are specialized and residuated using the approach described earlier; however, uponreaching a branch classified as “delayed,” the specializer records the current state, σ , and adds a ( σ , l ) pair to the worklist for every successor block l . The specializer then removes an ( s , b ) pairfrom the worklist, and executes basic block b , starting with state s . Thus, at the basic-block level,specialization is similar to execution, except that code can also be emitted; at the end of a basic block,the specializer creates the appropriate ( partial-state , basic-block ) pair(s) for the block’s successor(s),and inserts them into the worklist.The partial evaluation of match illustrates why a partial evaluator needs to be able to checkstate equality efficiently. Consider block 2, which contains the two assignments at the start ofthe outer while loop, and ends with an unconditional branch into the inner loop. Every timeblock 2 is executed, pat is set to point to the start of string p , and block 3 is enqueued. Whenblock 3 is removed from the worklist, the partial evaluator continues to unroll the inner loop.Subsequently, the partial evaluator reaches the break statement following block 4, triggering anew partial evaluation of block 2: pat is reset, and block 3 is again enqueued, ultimately leading toanother identical unrolling of the inner loop. Thus, a partial evaluator that always enqueues thesuccessor of block 2, namely block 3, will never terminate.To prevent this infinite unrolling, the partial evaluator must be able to detect duplicate partial-state/block pairs. In particular, the first time we evaluate block 2, we want to enqueue the pair ( σ , block 3 ) consisting of the state σ where p is equal to pat and block 3. Every subsequent timethat a partial evaluation of block 2 is complete, we have re-encountered the state-pair ( σ , block 3 ) .The partial evaluator will not terminate unless it can determine that ( σ , block 3 ) has repeated.Thus, a worklist-based partial-evaluation algorithm requires two key state-management features:(1) the ability to save and restore partial states,(2) the ability to efficiently check state equalityWhen partial evaluation is performed on a program written in a type-safe high-level language,both features can be implemented in a relatively straightforward fashion. Assume that match isalways called such that the pointers p and s are guaranteed to reference the beginning of validC strings. In this case, the relevant state is the set of all supplied variables and memory objectsreachable from the supplied variables on the stack. States can be saved, restored, and compared enerating Extension Generator 7 by traversing the graph of memory objects induced by the reachability relation over the suppliedstate, in a manner similar to the walk performed by a mark-and-sweep garbage collector.In §5, we describe an alternative method for state management that is more suitable for generatingextensions, particularly machine-code generating extensions (§2.2 and §4). An alternative approach to program specialization can be implemented via a generating-extensiongenerator . A generating-extension generator

GeGen takes as input a program P and the BTAresults, and produces a generating extension : (cid:74) GeGen (cid:75) ( P , S ) = ge P , S where the generating extension ge P , S produces a residual program: (cid:74) ge P , S (cid:75) ( A ( S )) = P A ( S ) such that P A ( S ) satisfies Eqn. (2). A generating extension has two key advantages over a partialevaluator:(1) It can be implemented as a program that executes natively in the target language, withoutinterpretation. A semantic model of the target language is only needed to construct the generatingextension.(2) The structure of a generating extension reflects the basic-block structure of the subject program.Structurally, a generating extension can be thought of as the original subject program, with thepartial-evaluation code “compiled in.” This intermingled structure can be structured in such a waythat generating extensions can be algorithmically produced basic-block-by-basic-block. Each basicblock in the subject program has an associated basic-block procedure in the generating extensionthat updates the partial state of the subject program and, generates residual code. After theseactions are completed, the block yields control to the compiled-in state-management logic. Thisstructure was used by Andersen [1] to automatically produce generating extensions for C programs,and we use a similar approach to structure our generating extensions for machine code (but withdifferent state-management mechanisms that are described in §5).For example, Fig. 2 is an Andersen-style C generating extension for procedure match from Fig. 1(a).Consider procedure match_ge in Fig. 2: match_ge repeatedly dequeues a ( partial-state , basic-block ) pair ( σ , b ) from the worklist until the worklist is empty. If ( σ , b ) has not been visited yet, block b isevaluated on σ through a “basic-block-procedure.”The state_t struct and snapshot procedure are used to save and restore states. The state_t struct has a member for every variable in the partial state (in this case, just pat ), and snapshot stores the current values of the variables in the struct. To illustrate the structure of “basic-block-procedures,” we consider three basic blocks from the inner loop of Fig. 1(a): blocks 3 and 4, whichend with the two if statements in the inner loop, and block 5, which increments the two stringpointers.The structure of each basic-block-procedure reflects the structure of the basic blocks in theoriginal program. The statements dependent only on the contents of the string pointed to by s (the boxed statements in Fig. 1(a) and Fig. 2, excluding the if statement) are merely quotedand printed verbatim. Conversely, the statements dependent only on the supplied value p (theunboxed statements in Fig. 1(a)) are evaluated during the execution of the generating extension.The statement if(*pat != *s1) break; depends on both pat and the delayed string *s1 . Variable pat must be lifted: the statement is printed as written, except that supplied variable pat is replacedwith its current value.The correspondence between the generating extension and the subject program makes it straight-forward to create a generating extension algorithmically. While the execution of the generating Michael Vaughn and Thomas Reps extension is now worklist-driven and non-standard, the basic block-level structure still reflects thatof the original subject program. Moreover, the generating extension itself need not include a Cinterpreter; a special-purpose analysis tool is only necessary to construct the generating extensionitself.Internally, however, the generating extension must still handle the two state-management issuesdescribed in §1: (1) saving and restoring partial states, and (2) checking state equality. At the endof each basic-block procedure, the generating extension takes a snapshot of its own state, emitscode to transfer control to its successor block(s), and inserts each successor block into the worklist.Finally, control returns to the top of the loop in match_ge , and another state/block pair is dequeuedif the worklist is not empty.

The best prior solution to the state-management problem has been to take advantage of the factthat a partial evaluator is similar to a language interpreter [13]—except that a partial evaluatoroperates on partial states, and an interpreter operates on full states.One can design an abstract datatype of partial states for which saving/restoring states andidentifying state repetition can be performed with low time and space overhead. In particular,the components of (partial) states can be hash-consed [11] so that a unique representative—i.e.,a canonical address—is maintained for each partial state. A set of the addresses of the uniquerepresentatives is then maintained, with hashing used to assist membership testing (and collisionresolution performed by comparing addresses).For our work on specializing binaries, such an approach was unsatisfactory. We chose thegenerating-extension approach, because it creates program specializers that end-users can usewithout learning sophisticated program-analysis tools. We wanted to avoid (i) packaging a full-featured interpreter for x86 with the specializers or (ii) instrumenting every load and store in thesubject binary. Consequently, we did not have the option of implementing memory as an explicitdata structure that can be readily swapped to save and restore states—which raises the followingquestion:How can issues (1) and (2) from §1 be handled efficiently in a generating extension ge P , S thatruns natively?To address issue (1), we use two OS-level mechanisms—copy-on-write (CoW) and process context-switching—to create an efficient mechanism for state-snapshotting and restoration . (See §5.1.) How-ever, the main element that allowed us to devise a solution is that we changed the requirementsassociated with issue (2) slightly. In particular, we do not insist that there be a mechanism to resolvecollisions , as long as we have control over parameters that ensure that the probability of a collision ever arising is below a value of our choosing . In other words, we allow the use of a collision-resistant hash. Moreover, the hash function is incrementally updatable : as execution of ge P , S mutates one(partial) state σ to another (partial) state σ , the hash value for σ can be computed efficiently byupdating the hash value of σ . (See §5.2.)In our implementation, by using 128-bit hash values, we ensure that when the program uses ≤ bits of memory and visits ≤ − . Thus, althoughit is possible for our tool to produce an incorrect residual program due to a hash-value collision, the More precisely, to support the unique-representative property, one would make use of applicative maps (see [23, §6.3] and[20]), hash-consing, and a hash table to detect duplicates. (The hash-code would be based on the contents of the map’sentries, rather than the structure of the tree that represents the map.) enerating Extension Generator 9 chances of that happening are negligible. In our implementation, incremental updating of hashvalues occurs at page granularity.Our experimental results show that these mechanisms are critical to the practical tractability ofmachine-code generating extensions. Most significantly, if Rabin fingerprinting or some other O ( ) state-comparison mechanism is not used, the amount of time to produce a residual program scalesquadratically with the number of partial states seen. Even on simple examples, fingerprinting yieldsseveral orders-of-magnitude of improvement in execution times; one simple test case required over12 hours without fingerprinting, while requiring only 3 seconds with fingerprinting.Using CoW produces a large improvement in the amount of space needed to represent partialstates; while thousands of pages of memory are required to represent visited states without CoW,at most several hundred are required when CoW is used in our experiments. Moreover, whenfingerprinting is used, CoW yields an additional two-to-six-fold reduction in the time required toproduce a residual program. In this section, we explain the technique for creating generating extensions used in our toolGenXGen[MC]. GenXGen[MC] takes a program and BTA results as input, and transforms eachbasic block of the program into a self-contained unit that (i) executes the basic block, (ii) updates thepartial state that depends on supplied input, (iii) produces a specialized basic block, (iv) snapshotsthe partial state, and (v) yields control to a controller process. The BTA results determine thetransformation of individual instructions performed in step (iii).GenXGen[MC] relies on the implementation of BTA from WiPER [26]. WiPER invokesCodeSurfer/x86 [3], which incorporates a number of algorithms for static analysis of machinecode [4] to build a dependence graph that supports machine-code slicing [27]. As in §2.1, BTA isperformed by slicing forward from the delayed inputs, marking all program points in the slice asdelayed, and all points outside the slice as supplied.To identify lifted values, reaching-definition analysis is performed for the operands of eachinstruction I in the delayed set. Any static instructions that define an operand used by I must haveassociated code to lift the value of the operand. Put another way, it is as if we had arranged for a year to have the right number of days so that, with any randomly chosengroup of 1,000,000 people, the chances of winning a birthday-paradox bet was less than 2 − . L3: mov dl, [ebx] -- dereference patcmp dl, 0 -- check first if conditionjz L7 -- if(*pat == 0) return 1L4: mov cl, [eax] -- dereference s1cmp cl, dl -- check second if conditionjne L2 -- if(*pat != *s1) break;L5: incr eax -- s1++incr ebx -- pat++jmp L3 -- while(1)

Fig. 3. Naive string matcher’s inner loop body. Boxed instructions are delayed, double-boxed instructionshave their destination operands lifted, and the remainder are supplied.

In Fig. 3, BTA results are illustrated for the code that implements the innermost loop of match from Fig. 1. Register eax contains the address of the current offset in the string that is beingsearched, and ebx contains the address of the current offset in the pattern to be matched. Theregisters cl and dl contain the current characters in the string and pattern, respectively. Basicblocks L3 , L4 , and L5 correspond to the inner loop blocks in Fig. 1. L7 is reached only if a match isfound. L2 is the target of the inner loop’s break statement, which starts another iteration of theouter loop.In lifted instruction mov dl, [ebx] , register dl must be lifted, because cmp cl, dl in block L4 compares the supplied pattern character in dl to a character from the delayed string.Given the partitioning of instructions in Fig. 3, the generating-extension-generator emits x86code augmented with pseudo-instructions that expand to sequences of x86 instructions. Their actionsemit code, control the flow of computation, and manage partial states.Fig. 4 illustrates the machine-code generating extension produced from match . In this presenta-tion, we treat the pseudo-instruction actions as black boxes. As before, non-branch instructionsclassified as supplied are executed, updating the partial state. Conversely non-branch instructionsclassified as delayed are emitted verbatim in a manner analogous to the printf calls in Fig. 1.The lifted instruction I = mov dl, [ebx] is executed, just like a supplied instruction. After I isexecuted, Lift emits code that sets the value of dl in the residual program to the value that dl holds immediately after the execution of I . MakeSnapshot records a snapshot of the current partial state of the subject program, using thetechnique described in §5.1. The 32-bit Intel x86 instruction set (also called IA32) has six 32-bit general-purpose registers ( eax , ebx , ecx , edx , esi ,and edi ), plus two additional registers: ebp , the frame pointer, and esp , the stack pointer. In Intel assembly syntax, whichis used in the examples in this paper, the movement of data is from right to left (e.g., mov eax,ecx sets the value of eax to the value of ecx ). Arithmetic and logical instructions are primarily operand instructions (e.g., add eax,ecx performs eax := eax + ecx ). An operand in square brackets denotes a dereference (e.g., if a is a local variable stored at offset -16, mov [ebp-16],ecx performs a := ecx ). Branching is carried out according to the values of condition codes (“flags”) setby an earlier instruction. For instance, to branch to L1 when eax and ebx are equal, one performs cmp eax,ebx , whichsets ZF (the zero flag) to 1 iff eax − ebx =

0. At a subsequent jump instruction jz L1 , control is transferred to L1 if ZF = L3: EmitSpecLabel(L3)mov dl, [ebx]Lift(dl)cmp dl, 0MakeSnapshotEmitSpecJmp("jz", L7, L4)CondEnqueue("jz", L7, L4)Yield L4: EmitSpecLabel(L4)Emit("mov cl, [eax]")Emit("cmp cl, dl")MakeSnapshotEmitDynJmp("jne", L2, L5)Enqueue(L2, L5)YieldL5: EmitSpecLabel(L5)Emit("incr eax")incr ebxMakeSnapshotEnqueue(L3)EmitJmp(L3)Yield

Fig. 4. The machine-code generating extension produced for the code in Fig. 3. The three blocks are themachine-code analogs of the handle_block functions in Fig. 2. enerating Extension Generator 11

Both the

Enqueue and

CondEnqueue actions place ( partial-state , basic-block ) pairs into the work-list. In both cases, the state used for every enqueued pair is the state recorded by the most recentcall to MakeSnapshot , and a pair is only enqueued if it has not been enqueued before.

Enqueue can be used to enqueue a single successor, in the case of unconditional jumps, or multiplesuccessors, in the case of jumps controlled by delayed state.

CondEnqueue is used for conditionaljumps governed by supplied state; its action is to enqueue the successor block determined by thesupplied state and the condition of the jump instruction.The actions of

EmitJmp , EmitSpecJump , and

EmitDynJump are to emit jumps to specializedversions of a block’s successors. If the original jump is conditional and governed by suppliedstate, we use

EmitSpecJmp to emit an unconditional jump to the specialized version of whicheversuccessor block is chosen by the jump’s condition. If the original jump is conditional and governedby delayed state,

EmitDynJmp is used to emit a conditional jump targeting the specialized versionsof the block’s two possible successors. If the jump is unconditional, we use

EmitJmp to emit anunconditional jump to the specialized version of the only successor.The action of

EmitSpecLabel is to emit a label unique to the current ( σ , b ) pair being invoked.The action of Yield is to emit code that yields control to the controller process. The controllerprocess is a simple piece of code that removes a pair ( σ , b ) from the worklist, restores state σ , andresumes execution at block b . Pointers.

The implementation of the

Lift macro needs to correctly handle pointers to stackand heap objects. A pointer to a heap or stack object during specialization time may not be avalid pointer to the object in the residual program. CodeSurfer/x86’s VSA implementation reliablyidentifies whether a memory location or register holds a pointer to the stack or heap at a givenprogram point in all of the programs tested. By combining this information with a special-purposeimplementation of malloc used only in the generating extension, concrete pointers into memoryobjects can be converted into relocatable offsets.

Because a generating extension saves and restores states at the end of every basic block, it iscritical for these operations to be implemented efficiently. While saving and restoring CPU state isa straightforward operation, saving and restoring memory state is potentially expensive, both fromthe standpoint of storage required to represent states, as well as the amount of time required tosave and restore states.We use different OS processes to represent different state snapshots, and the fork() system callto generate a new snapshot of interest. Thus, a set of snapshots—in our case, the elements of thegenerating extension’s worklist—can be represented by a set of process IDs. A partial state can berestored efficiently merely by performing a process switch.The fork() system call is implemented in a relatively time and space-efficient fashion throughthe use of a policy known as copy-on-write (CoW). Through the use of hardware-supported virtualmemory, logical addresses used by a process are decoupled from their physical addresses in memory.The address space of a process is broken up into fixed-size pages (typically 4096 bytes), each ofwhich can be mapped to an arbitrary physical memory page. This decoupling lets processes sharephysical pages, enabling CoW.When a process P calls fork() , a new process P ′ is created with register and memory contentsidentical to P , with the exception of eax , which contains the return value of fork() . Rather thanallocating new physical pages for P ′ , every page G ( P ′ , i ) in P ′ is mapped to the same physical page H as the corresponding G ( P , i ) in P . However, the virtual-to-physical mappings for P ′ are flagged as CoW using hardware support; when P ′ writes for the first time to a page G ( P ′ , i ) inherited from P , ahardware fault occurs, the changed version of the page is allocated its own page H ′ in physicalmemory, and the hardware state is updated so that G ( P ′ , i ) is mapped to H ′ .Consider the two processes σ and σ ′ in Fig. 5, where σ ′ is the result of forking σ and executingsome code that modifies the fourth page of the virtual address space. Each process has a four-pagevirtual address space, backed by an n -page physical memory. In this case, the first three virtualpages of both processes map to the same physical pages, while the fourth page of each processmaps to different physical pages.Our generating extensions exploit CoW to implement the end-of-block worklist update. Eachpartial state referred to in §2 is a separate Linux process. An additional “controller” process overseesthe partial evaluator’s worklist of ( partial-state , basic-block ) pairs. The controller process serves asa dispatcher for the specialization phase, and the worklist of unprocessed ( partial-state , basic-block ) pairs is implemented merely as a set of process IDs. Every time the controller process selects a ( partial-state , basic-block ) pair ( σ , b ) from the worklist, it signals the process P σ that represents σ , and P σ begins executing. P σ immediately calls fork() , creating a child process P ′ . Initially,the logical address space of P ′ contains σ , and its program counter is set to b . After P ′ finishesexecuting b , its logical address space contains σ ′ . However, physically only the pages that changed during the execution of b on σ are specific to P ′ ; the rest are shared with process P σ . Because allthe memory bookkeeping is implicitly taken care of by the OS and the hardware, the only datastate that is stored and manipulated by the controller process is the ID that the OS assigns to eachprocess.Thus, executing a basic block b on a given state σ incurs a cost that is linear in the number ofmemory-writing instructions in b , because each such instruction is executed only once. Moreover,the cost of switching between processes is constant: a switch consists of a system call and theupdate of several fixed-size hardware registers. For a generating extension to avoid traversing previously seen computation paths, it needs anefficient way to determine whether a given state has been seen before. We desire a state-managementscheme that possesses five properties:(i) Given a state σ that has been seen before, whenever we encounter σ again, the procedure mustrecognize σ as a previously visited state, with no false negatives.(ii) The procedure must be space- and time-efficient. We would like to store at most several hundredbits of information per state visited. We would also like O ( ) state-equality checks.(iii) The false-positive rate must be kept acceptably low.(iv) We would like to be able to efficiently update the value that characterizes a visited state.(v) If the initial state is a multi-gigabyte address space, computing the initial value that characterizesthe state should require a constant amount of computation.These criteria naturally suggest a solution based on hashing. Note that item (iii) deviates from theconventional approach to state management in program specialization, which is that there shouldbe no false positives. As discussed in §1 and explained below, although it is possible for our tool to Fig. 5. States σ and σ ′ after a CoW fault at P . S S S S S’ S’ S’ S’ P P P P P P n … enerating Extension Generator 13 produce an incorrect residual program due to a hash-value collision, the chances of that happeningare negligible. Our choice to work with this relaxed requirement was motivated by the fact thatour generating extensions work with native hardware states.To produce efficient generating extensions, item (iv) is especially important. When computingthe post-state hash after executing a single block, we wish to perform an amount of computationproportional to the number of changes to the pre-state made during the block’s execution. Byexploiting properties of CoW, it is possible to satisfy item (iv) with an appropriate choice of hashalgorithm: • When a basic block b is executed by the specializer, the first write to a page in the execution of b induces a CoW fault. Given a log of all CoW faults that occur during the execution of the basicblock, we can compute the changes between the pre-state σ and the post-state σ ′ . Implementingthis log was straightforward: we added (i) a small amount of instrumentation code to the Linuxkernel’s page-fault handler, and (ii) a small amount of extra state to every process structure inthe kernel. • We need an incrementally updatable hashing algorithm—one that lets us efficiently incorporatediffering pages into the pre-state hash-code, without additional computation beyond processingof the data of changed pages in the pre- and post-states. Rabin’s fingerprinting scheme satisfiesthis criterion [6, 22].Formally, given the pre-state σ and its associated hash H ( σ ) , and the contents of the changed pages, P pre = { P , ... P n } , P post = { P ′ , ... P ′ n } , from the pre- and post- states, we compute the post-state hashusing only the pre-state hash H , and the contents of P pre , and P post : H ( σ ′ ) = H incr ( σ , P , P ′ , ..., P n , P ′ n ) Given a bit-string σ = ( s , s ... s m − ) , representing the contents of a program’s address space,we wish to compute a hash H ( σ ) . To do so, the fingerprinting algorithm treats σ as a polynomial σ ( t ) = s ∗ + s ∗ t + ... + s m − ∗ t m − of degree m − Z . The fingerprintingscheme selects an irreducible polynomial P ( t ) (i.e., P ( t ) is only divisible by 1 and itself) of degree k ,again with coefficients from Z . Given P ( t ) , the fingerprint H ( σ ) is defined as H ( σ ) = σ ( t ) mod P ( t ) The choice of the degree k of the irreducible polynomial P allows us to choose the size of the hash,and thereby tune the collision probability relative to a definition of a “reasonable” execution of aprogram specializer. For our purposes, we are assuming that the partial state of the subject programwill use a 2 -bit address space and will visit 1 million unique states. Given these assumptions,simple counting arguments outlined in [6] and [22] show that for a 128-bit hash code (i.e., k = any collision among the 1 million hash-codes is less than 2 − .Note that the evaluation of the polynomial at a value of t plays no part in fingerprinting: wemerely use the algebraic properties of the polynomials themselves. Specifically, polynomials withcoefficients over Z have several properties convenient for the implementation of an incrementallyupdatable hash:(1) The addition operation + for such polynomials is addition mod two with no carry—i.e., bitwiseexclusive-or. Consequently, subtraction for polynomials with coefficients over Z is simplyaddition. These properties let us treat the contents of memory as σ ( t ) , thus incurring no additionalspace overhead.(2) Multiplication by t i can be implemented as an i -bit shift.(3) Fingerprinting is linear: H ( A + B ) = H ( A ) + H ( B ) Though Rabin fingerprinting is most well-known for its use as a sliding-window hash, it can also be used for incrementalhashing [22]. (4) The fingerprint of the product of t i and a polynomial σ ( t ) can be computed via H ( t i ∗ σ ( t )) = H ( H ( t i ) ∗ H ( σ ( t ))) Given property (1) of polynomials over Z , in what follows, we will use σ to denote both thebit-string representation of σ and the polynomial σ ( t ) ; the intended use will be clear from context.Consider the pre and post-state in Fig. 5, where virtual page 3 maps to physical page 3 in thepre-state and physical page 4 in the post-state. Properties (1)-(4) admit a simple update procedure: H ( σ ′ ) = H ( σ ) + H ( P ) + H ( P ) From (1), it follows that given a change to the i th page in pre-state σ , the post-state σ ′ can bederived by subtracting off the terms representing the contents of the i th page in σ and adding onthe terms corresponding to the post-state version of the page in σ ′ . By coupling this observationwith properties (2), (3), and (4), it can be shown that the H ( σ ′ ) can be directly computed from H ( σ ) using only the contents of the i th page in σ and σ ′ , avoiding the need to examine all of σ ′ tocompute its hash value.In particular, let w be the page size in bits supported by the OS (here 4096 ∗ = bits). Inaddition, let σ a , b = s a + s a + t + ... + s b ∗ t b − a denote the bit-string containing the bits of the substringof σ starting at a and ending at b , inclusively, for both a and b .Then, from properties (1) and (3), we have H ( σ ′ ) = H ( σ ) + H ( t i ∗ w ∗ σ i ∗ w , ( i + )∗ w − ) + H ( t i ∗ w ∗ σ ′ i , ( i + )∗ w − ) and by property (3) H ( t i ∗ w ∗ σ i ∗ w , ( i + )∗ w − ) = H ( H ( t i ∗ w ) ∗ H ( σ i ∗ w , ( i + )∗ w − )) For a fixed page size of, e.g., 4096 bits, the only non-constant-time computation is H ( t i ∗ w ) = t i ∗ w mod P , which can be computed in time log ( i ∗ w ) using modular-exponentiation-via-squaring.Because the maximum amount of addressable memory is bounded on x86 CPUs, log ( i + w ) iseffectively a small constant in practice.The number of pages that must be hashed in order to compute the post-state hash is O ( m ) ,where m is the number of unique pages written during the execution of the basic block. In thecommon case, m at most O ( n ) , where n is the number of instructions in the basic block. Thus, forthe common case, hashing induces a constant overhead on the amount of computation performedby a basic block. The only exceptions are special x86 opcodes, such as those that use the rep prefix;these instructions essentially implement loops that perform memory writes repeatedly, until somecondition is met. These instructions are often used for, e.g., string operations. In the programs weexamined, the use of rep -prefixed instructions to write large stretches of memory is uncommon; wedid not encounter any cases where rep was used to write regions larger than a page. Additionally,in our semantic model we consider rep -prefixed instructions to be loops, and we treat the individualprefix-free version of the instruction as a basic block.In addition to efficient incremental updates, this hash technique also handles new and emptypages efficiently. Any new physical memory added to a program’s address space is zeroed out bythe OS for security reasons; thus, when a program’s address space grows, it will contain zeroes. Itis clear from the properties of reduction modulo a polynomial that the hash of a zero page is zero;thus, no additional computation needs to be performed to incorporate new pages into the hash of aprogram state. Implementation.

The work to create GenXGen[MC] was reduced by adopting the BTA imple-mentation from the WiPER partial evaluator [26], which uses the slicing facilities provided byCodeSurfer/x86 [3]. After BTA is performed on a program, GenXGen[MC] traverses the program’s enerating Extension Generator 15

CFG, and emits the generating extension using the macros described in §4. The generating exten-sion is then assembled: the generating extension is placed as inlined assembly inside a small C++wrapper, so the “assembler” is actually g++ . The generating-extension binary can then be givenvalues for supplied inputs, and run on a version of the Linux 4.4.14 kernel modified to track CoWfaults. The final residual program is then assembled—this time using inlined assembly inside a smallC wrapper, so the assembler is gcc . (The use of g++ and gcc for assembly is an implementationexpedient.)

Experimental Questions.

Our experiments were designed to answer the following questions:(1) What are the individual improvements to memory usage contributed by CoW and fingerprinting?(2) What are the individual improvements to the time needed to emit a residual program contributedby CoW and fingerprinting?(3) Compared to the original subject program, how much does specialization speed up execution?

Experimental Setup.

We evaluated GenXGen[MC] using the binaries of seven microbenchmarks—listed in Fig. 6—and, as “real-world” examples, three command-line binaries: two GNU coretuilsprograms, and one program that makes use of printf . Five of the microbenchmarks were previouslyused to evaluate WiPER [26]. The sixth, matcher , is the naive string matcher given in Fig. 1. Theseventh, stack , is designed to stress test the fingerprinting technique. • gnu-wc counts lines, chars, or words in stdin . The supplied input specifies which quantities arecounted; the delayed input is stdin . • gnu-env runs a program with a specified assignment to environment variables. The suppliedinput is the assignment to environment variables; the delayed input is the program to invoke. • printf is a program that calls into a simple printf library. The supplied input is a format string;the delayed input is the remaining arguments.These programs present a reasonable cross section of real-world specialization tasks. gnu-wc represents a feature-removal task , in which a single mode of operation is chosen out of a set ofpotential modes, while printf is a fairly representative layer-collapsing and loop-unrolling task ,in which a library call is in-lined into a program. The third program, gnu-env , features aspects of The ten benchmarks are provided as supplementary material.Application Description Static Inputpower Computes x n n = n = n -dimensional vectors of first vectorinterpreter Interpreter for the minimalist an input programlanguage “Brainf*ck”filter Applies m × m convolution m = n =

3, and elementsfilter on an image of size n × n of the filtersha1 Computes the sha1 digest of a n = n bits of the first 512 bitsmatcher A naive substring-matching the target substringalgorithmstack A program that writes every n stack page n times Fig. 6. Microbenchmarks used in the evaluation. both tasks, because the core environment-update loop is unrolled, and features corresponding tounused command-line flags are excised. For gnu-wc , specializing with respect to the supplied input selects one of three main applicationloops, each of which is optimized for a different counting task. The generating extension elides theother two loops.In the case of printf , the specialization unrolls the format string, eliminating run-time parsingand logic for unused format specifiers. Similarly, in gnu-env , the argument-parsing loop is unrolled,emitting a program that runs a program in a pre-defined environment.To evaluate questions (1) and (2), we implemented GenXGen[MC] so that CoW and state finger-printing can be independently disabled in generating extensions, yielding four possible executionmodes (see Fig. 7).To simulate disabling of CoW, we added a mechanism to force the copy of an entire processaddress space. When CoW is “disabled,” we dirty each page without altering the state by (i) writinga single byte to each page in the address space, and then (ii) reverting the page back to its originalstate. These actions force every page to incur a CoW fault, causing the OS to create a copy ofevery page in the address space. This approach provides an upper bound on the time requiredbecause, by forcing the CoW mechanism to make the copy, a page fault must be handled by thekernel for every page, adding some overhead. We chose to estimate the cost in this way because ourgenerating extensions are inherently multi-process: each process holds a single state. Implementinga true CoW-free approach would have required modifying the OS to eliminate CoW, which seemedunwarranted, given that the technique is not likely to be competitive.To disable fingerprinting, we implemented an alternative version of the generating extension’sstate-comparison and worklist-management algorithm. Without fingerprinting, the only way tocompare the states of two processes is to do a direct comparison of process memory. Moreover, weno longer have a convenient means of indexing into a table of previously seen states. Consequently,the state manager must retain a process for every state previously seen, and must compare everynewly created process state with every retained state, comparing full address spaces. In contrast,in the fingerprint-based approach, we only need to store the 128-bit fingerprint; any process thatdoes not have outstanding worklist entries can be garbage-collected.To measure memory usage, the generating extension tracks the number of pages in use across allprocesses in the generating extension. Because all processes must be retained when fingerprintingis disabled, determining the memory usage across all processes is straightforward: it is the sum ofall live pages across all processes. When fingerprinting is used, memory usage is the maximumnumber of live pages at any given point in the program’s execution. To evaluate the execution timeof a generating extension, we time its end-to-end execution, from the beginning of the first basicblock to the end of the last basic block.We allowed the generating extensions to run end-to-end for the “real-world” examples. However,for the microbenchmarks, we added a time-out after 90 minutes of specialization.

Results.

The experimental results for questions (1) and (2) are presented in Fig. 7. With respectto question (1), both fingerprinting and CoW play a significant role in reducing memory usage.Using CoW, however, yields the most significant reduction for every application, except stack .This improvement is due to the fact that for all ten applications, the instructions that are evaluated The reason we used only three real-world programs in our study was because of limitations of CodeSurfer/x86 [3, 4],which GenXGen[MC] uses to implement BTA. Fortunately, in many circumstances, the subject program can be adapted toovercome the limitations, e.g., by manually unrolling a loop. However, the effort required to identify appropriate rewritingsto overcome current limitations of the static analyses in CodeSurfer/x86, as well as to model calls to library functions,limited the number of real-world programs that we were able to use for our study. enerating Extension Generator 17

Generating-extension performance Execution time[CoW,Fingerprint] orig. resid.[no,no] [yes,no] [no,yes] [yes,yes] prog. prog.printf time 68m 23s 66m 11s 6.138s .744s 90.6 ± µ s 77.7 ± µ sclient pages 240577 48 12774 6 — —gnu-wc time 45m 36s 50m 11s 2.755s 1.190s 283 ± µ s 106 ± µ spages 146901 46 2129 9 — —gnu-env time 13h 12m 12h 26m 15.692s 3.332s 36.5 ± .1 µ s 31.0 ± .1 µ spages 958050 129 2129 2 — —power time 74m 39s 64m 36s 2.241s .679s 3.3 ± µ s .6 ± µ spages 221416 102 2129 1 — —dotprod. time >90m >90m 11.366s 2.364s 3.7 ± µ s .8 ± µ spages — — 2129 1 — —interp. time >90m >90m 13.638 6.186s 35.9 ± .2 µ s 36.2 ± .1 µ spages — — 2129 1 — —filter time >90m >90m 16.391 6.370 4.1 ± µ s .6 ± µ spages — — 4258 2 — —sha1 time >90 m >90 m 24.223 11.783s 4.6 ± µ s 3.3 ± µ spages — — 2129 2 — —matcher time 28m 30s 26m 13s 1.652 .839 .9 ± µ s .1 ± µ spages 185223 22 31935 3 — —stack time 3m 43s 4m 36s 1m 34s 1m 26s 5,533 ± µ s .1 ± µ spages 195900 195900 1959 1959 — — Fig. 7. Run times and space usage for each generating extension, with and without CoW/fingerprinting. Runtimes for original and residual programs are also included, with 95% confidence intervals (“—” means “notmeasured.”) program original residualprintf 754 1038gnu-wc 1929 775gnu-env 1820 1123power 30 323dotprod. 307 1123interp. 146 558filter 287 1207sha1 332 2823matcher 34 410stack 3930 1

Fig. 8. Instruction counts for original and residual programs. during generating-extension execution perform the majority of their writes within a single stackpage. Even when fingerprinting is not used, CoW ensures that the number of pages needed toretain all previously visited states is small, roughly the number of basic blocks that executed atleast one memory write.Regarding question (2), fingerprinting plays the most significant role in reducing execution time.This result is unsurprising, because the amount of time needed to identify whether a state has beenpreviously visited without using fingerprinting scales linearly with the number of states previouslyvisited. Thus, the execution time scales quadratically with the number of states.Stress-test stack( n ) performs a set of writes that causes the generating extension to recomputeeach stack-page fingerprint n times. Still, the benefits of O ( ) lookup outweigh the cost of repeatedlyfingerprinting every stack page.Using CoW also improves the execution times of generating extensions; the improvement ismost pronounced in the case where fingerprinting is also used. When fingerprinting is used, theoverhead of copying an entire process begins to dominate the execution time of the generatingextension.For the gnu-wc and stack generating extensions without fingerprinting, the execution time withCoW enabled was greater than when CoW was disabled. We do not have a full explanation, but webelieve that the extra cost is due to the cost of collecting memory-usage data. When we measurememory usage with CoW enabled, we track every process currently using a given page. For certainworkloads, especially when fingerprinting is not used—and thus page mappings are retained forevery state visited—the cost of maintaining this data structure may become relatively large. To evaluate experimental question (3), we timed the end-to-end execution time of each program onan input, collecting the 10% trimmed mean of 1001 executions: i.e., for the original and residualversion of each program, we ran the program 1001 times, and discarded the 100 shortest and 100longest execution times.To time the programs, we instrumented the beginning and end of main in each program withcalls to a rdtscp -based timer. By doing so we avoid recording the noise induced by the initialcontext-switching, loading, and execution of the program. The hardware-counter-based rdtscp counter provides 40-clock-cycle resolution.For question (3), results are presented in Fig. 7. Specialization produced a speedup in all but oneprogram. Because stack has no meaningful delayed actions, it has a 5500x speedup, due to theelision of thousands of memory writes. The most significant speedup in a specialized program withnon-trivial delayed functionality was for matcher , which was 9x faster. This improvement can beattributed to the specialization of the inner loop of the program, which elides all the memory loadsfor the target string. In particular, this change speeds up the common case in which a characterin the string being searched does not match the first character of the target string. filter yieldsthe second most significant speedup, being 6.8x faster. The specialization of filter significantlyoptimizes the inner loop of the image filtering procedure, eliminating the if statement that selectswhich image filter is applied to each pixel, as well as inlining loads from lookup tables that encodeproperties of the selected filter algorithm. power and dotproduct benefit significantly from the unrolling of their main loops; the elimina-tion of the branch condition at the loop head yield a 5.5x speedup and a 4.6x speedup, respectively.The specialized version of gnu-wc has a speedup of 2.7x The specialization of gnu-wc elidesthe argument-parsing loop, as well as setup code that (i) sets locale information and (ii) obtainssystem-dependent configuration information. enerating Extension Generator 19 gnu-env , printf , enjoy more modest speedups, roughly 16-18%. Most of the speedup is dueto the unrolling of the core loop in each program: for format-string parsing in printf , and theargument-parsing and environment-setup loops in gnu-env . sha1 obtains a 1.4x speedup from the elision of loads inside the main loop, along with the elisionof the initial code that initializes the supplied data.However, interpreter experiences a slight slowdown (< 1.1x) , possibly due due to the effectsof aggressive unrolling on cache performance. Specialization of Machine Code.

Run-time code generation is a generating-extension-like ap-proach to program specialization that produces machine code on-the-fly during program execution.Unlike our approach to machine-code specialization, which operates on stripped binaries withoutsource code or symbol-table information, run-time code generation systems take user-annotated source code as input and perform BTA and generating-extension construction as part of compilation.In the Fox [15, 16] and Lancet [24] systems, type-level information in the source code is exploited toproduce run-time machine-code generators. These systems avoid the state-management issues from§5 by exploiting the availability of high-level semantic information from the source language. In ′ C[8], the user implements code generators using a DSL, and the user has the burden for avoidingredundant states and ensuring that code generators terminate. In contrast, Klimov [14] describesa run-time code generator for Java bytecode that does not rely on information from source code.However, Klimov can only determine state equality for programs that do not use the heap; theapproach identifies semantically identical states based on structural properties of Java VirtualMachine heap configurations. JIT compilation [2] is an example of run-time code generation inwidespread use. However, because it is performed at run-time, the emphasis is on recouping thecost of translation, which limits the kinds of optimization techniques that can be performed.Turning to interpretation-based approaches, WiPER [26] and TRIMMER [25] are partial evaluatorsfor x86 binaries and LLVM IR, respectively. WiPER uses CodeSurfer/x86’s semantic models of the32-bit x86 instruction set to evaluate instructions. WiPER represents states using an applicative-map-based data structure that does not use hash-consing. Thus, state equality is determined bydirectly comparing the contents of the data structure. TRIMMER implements a non-traditionalapproach to partial evaluation. In particular, TRIMMER implements an aggressive extension ofLLVM’s loop-unrolling and constant-propagation passes, rather than full partial evaluation. Bydoing so, TRIMMER avoids the need for a general-purpose state-management strategy.Although our system is quite different from Fox [15, 16] in most respects, their use of pseudo-instruction macros inspired our approach to constructing machine-code generating extensions.We use similar macros to produce residual assembly code, and extended the approach to includevarious other state-management actions.

Incremental State Hashing.

To the best of our knowledge, our work is the first application ofincremental state hashing to program specialization. Our fork -based method for managing partialstates was inspired by the state-management mechanism in the EXE symbolic-execution system[7].Model checking is a method to check properties of programs statically by exploring the statespace of a transition system. To achieve acceptable performance, model-checking algorithms mustavoid exploring redundant states, and Rabin’s fingerprinting technique has been used to implementincremental hashing of program states in model checkers [19, 21]. One of these model checkers,StEAM [17, 19], harnesses a VM that interprets assembly language. In contrast, our implementation of incremental state hashing exploits OS-level information to apply the technique to code that executes natively , rather than in an interpreter.

Symbolic and Concolic Execution

Partial evaluation bears some resemblance to symbolic execu-tion. In both cases, the state space of the program is partitioned: into supplied and delayed variablesin partial evaluation, into symbolic and concrete variables in symbolic execution. Moreover, we usethe OS-based state-management techniques previously implemented in systems such as EXE.However, symbolic execution differs significantly from partial evaluation in terms of how thepartitioned state space is explored. In both symbolic execution and partial evaluation, part of thestate space is kept concrete. In the case of symbolic execution, the non-concrete part of the statespace is represented as sets of symbolic values, which are logical formulas in some theory. Partialevaluators, on the other hand, do not track any information about the non-concrete part of thestate space; the congruence property of the BTA algorithm ensures that the delayed state is neverneeded to update supplied values.Symbolic execution thus attempts to construct a symbolic approximation of the values in thesymbolic portion of the state at every program point. That is, it attempts to explore, as exhaustivelyas possible, precisely the subset of the state space that a program specializer ignores. The state-management techniques in symbolic execution are geared towards managing the symbolic state . Inparticular, we are not aware of symbolic-execution engines that enforce any sort of terminationor state-equality properties with respect to the concrete portion of the state. To perform partialevaluation with a symbolic-execution engine, one would need to be able to determine when, forevery program point, all concrete states reachable from the starting concrete state had been reached.There is no straightforward way to achieve this with a stock symbolic-execution tool.

This paper describes how (i) the desire to perform specialization of machine code, using generatingextensions running natively, motivated (ii) the development of new techniques for state managementin a program specializer. The main challenge was that machine-code programs perform arbitraryreads and writes to an undifferentiated address space, and for this reason, our solution—in part—makes use of existing OS-level functionality. Our technique is used in the generating extensionscreated by GenXGen[MC].It has not escaped our attention that our technique can also be used for source-code specialization.In fact, we have already adapted the implementation to create GenXGen[C], a prototype generating-extension generator for C programs.The state-management technique presented in this paper is not the only option for a source-codespecializer: because sufficient information about a program’s variables is available, an interpreted approach to source-code generating extensions could track the state of memory at the level ofindividual variables. However, for source-code specialization, our technique offers three advantages: • A generating extension for a source-code program is compiled to machine code, and hencespecialization is performed by compiled—rather than interpreted—code. • By intercepting CoW faults and incorporating changed pages into the hash-value for a memorystate, a program specializer can easily support programs that use linked data structures, withno need to perform a mark-and-sweep traversal to capture program state. Thus, our techniqueprovides a method that can be used for languages that are not memory-safe, such as C (althougha conservative mark-and-sweep algorithm [5] could be employed). • The state-management implementation can be shared among program specializers for differentlanguages.As future work, we plan to develop GenXGen[C] further, and to investigate its applications. enerating Extension Generator 21

ACKNOWLEDGMENTS

Supported, in part, by a gift from Rajiv and Ritu Batra; by ONR under grants N00014-17-1-2889and N00014-19-1-2318; and by the UW-Madison OVCRGE with funding from WARF. The U.S.Government is authorized to reproduce and distribute reprints for Governmental purposes notwith-standing any copyright notation thereon. Opinions, findings, conclusions, or recommendationsexpressed in this publication are those of the authors, and do not necessarily reflect the views of thesponsoring agencies. T. Reps has an ownership interest in GrammaTech, Inc., which has licensedelements of the technology reported in this publication.

REFERENCES [1] Lars Ole Andersen. 1994.

Program analysis and specialization for the C programming language . Ph.D. Dissertation.University of Cophenhagen.[2] John Aycock. 2003. A Brief History of Just-in-time.

ACM Comput. Surv.

35, 2 (June 2003), 97–113. https://doi.org/10.1145/857076.857077[3] G. Balakrishnan, R. Gruian, T. Reps, and T. Teitelbaum. 2005. CodeSurfer/x86 – A Platform for Analyzing x86Executables, (Tool Demonstration Paper). In

Comp. Construct. [4] G. Balakrishnan and T. Reps. 2010. WYSINWYX: What You See Is Not What You eXecute.

Trans. on Prog. Lang. andSyst.

32, 6 (2010).[5] H.J. Boehm and M. Weiser. 1988. Garbage Collection in an Uncooperative Environment.

Software: Practice andExperience

18, 9 (1988), 807–820.[6] Andrei Z. Broder. 1993. Some applications of Rabin’s fingerprinting method. In

Sequences II . Springer, 143–152.[7] Cristian Cadar, Vijay Ganesh, Peter Pawlowski, David Dill, and Dawson Engler. 2006. EXE: A system for automati-cally generating inputs of death using symbolic execution. In

Proceedings of the ACM Conference on Computer andCommunications Security . Citeseer.[8] Dawson R Engler, Wilson C Hsieh, and M Frans Kaashoek. 1996. C: A language for high-level, efficient, and machine-independent dynamic code generation. In

Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles ofprogramming languages . ACM, 131–144.[9] Y. Futamura. 1971. Partial Evaluation of Computation Process—An Approach to a Compiler-Compiler.

Higher-Orderand Symbolic Computation

2, 5 (1971), 45–50. (Updated and revised version published as [10].).[10] Y. Futamura. 1999. Partial Evaluation of Computation Process—An Approach to a Compiler-Compiler.

Higher-Orderand Symbolic Computation

12, 4 (1999), 381–391.[11] E. Goto. 1974.

Monocopy and Associative Algorithms in an Extended LISP . Tech. Rep. TR 74-03. Information ScienceLaboratory, Univ. of Tokyo, Tokyo, Japan.[12] S. Horwitz, T. Reps, and D. Binkley. 1990. Interprocedural Slicing Using Dependence Graphs.

Trans. on Prog. Lang. andSyst.

12, 1 (Jan. 1990), 26–60.[13] N. Jones, C. Gomard, and P. Sestoft. 1993.

Partial Evaluation and Automatic Program Generation . Prentice-Hall, Inc.[14] Andrei V Klimov. 2009. A Java Supercompiler and its Application to Verification of Cache-Coherence Protocols. In

International Andrei Ershov Memorial Conference on Perspectives of System Informatics . Springer, 185–192.[15] P. Lee and M. Leone. 1996. Optimizing ML with Run-Time Code Generation. In

Prog. Lang. Design and Impl. [16] Mark Leone and Peter Lee. 1996. A declarative approach to run-time code generation. In

Workshop on Compiler Supportfor System Software (WCSSS) , Vol. 73. Citeseer.[17] P. Leven, T. Mehler, and S. Edelkamp. 2004. Directed Error Detection in C++ with the Assembly-Level Model CheckerStEAM. In

Spin Workshop .[18] R.A. Lorie. 1977. Physical Integrity in a Large Segmented Database.

ACM Trans. on Database Systems

2, 1 (1977),91–104.[19] Tilman Mehler and Stefan Edelkamp. 2006. Dynamic incremental hashing in program model checking.

ElectronicNotes in Theoretical Computer Science

Princ. of Prog. Lang. [21] Viet Yen Nguyen and Theo C Ruys. 2008. Incremental hashing for Spin. In

International SPIN Workshop on ModelChecking of Software . Springer, 232–249.[22] Michael O Rabin. 1981. Fingerprinting by random polynomials.

Technical report (1981).[23] T. Reps, T. Teitelbaum, and A. Demers. 1983. Incremental Context-Dependent Analysis for Language-Based Editors.

TOPLAS

5, 3 (July 1983), 449–477.[24] Tiark Rompf, Arvind K Sujeeth, Kevin J Brown, HyoukJoong Lee, Hassan Chafi, and Kunle Olukotun. 2014. Surgicalprecision JIT compilers. In

Acm Sigplan Notices , Vol. 49. ACM, 41–52. [25] Hashim Sharif, Muhammad Abubakar, Ashish Gehani, and Fareed Zaffar. 2018. TRIMMER: Application specializationfor code debloating. In

Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering .ACM, 329–339.[26] Venkatesh Srinivasan and Thomas Reps. 2015. Partial Evaluation of Machine Code.

SIGPLAN Not.

50, 10 (Oct. 2015),860–879. https://doi.org/10.1145/2858965.2814321[27] Venkatesh Srinivasan and Thomas Reps. 2016. An improved algorithm for slicing machine code. In

ACM SIGPLANNotices , Vol. 51. ACM, 378–393.[28] M. Weiser. 1981. Program Slicing. In