Towards Neural Decompilation
TTowards Neural Decompilation
Omer Katz
TechnionIsrael [email protected]
Yuval Olshaker
TechnionIsrael [email protected]
Yoav Goldberg
Bar Ilan UniversityIsrael [email protected]
Eran Yahav
TechnionIsrael [email protected]
Abstract
We address the problem of automatic decompilation , con-verting a program in low-level representation back toa higher-level human-readable programming language.The problem of decompilation is extremely importantfor security researchers. Finding vulnerabilities and un-derstanding how malware operates is much easier whendone over source code.The importance of decompilation has motivated theconstruction of hand-crafted rule-based decompilers. Suchdecompilers have been designed by experts to detect spe-cific control-flow structures and idioms in low-level codeand lift them to source level. The cost of supportingadditional languages or new language features in thesemodels is very high.We present a novel approach to decompilation basedon neural machine translation . The main idea is to au-tomatically learn a decompiler from a given compiler .Given a compiler from a source language S to a targetlanguage T , our approach automatically trains a decom-piler that can translate (decompile) T back to S . We usedour framework to decompile both LLVM
IR and x86 assembly to C code with high success rates. Using our LLVM and x86 instantiations, we were able to success-fully decompile over 97% and 88% of our benchmarksrespectively.
Given a low-level program in binary form or in someintermediate representation, decompilation is the task oflifting that program to human-readable high-level sourcecode.Fig. 1 provides a high-level example of decompilation.The input to the decompilation task is a low-level codesnippet, such as the one in Fig. 1(a). The goal of De-compilation is to generate a corresponding equivalenthigh-level code. The C code snippet of Fig. 1(b) is thedesired output for Fig. 1(a).There are many uses for decompilation. The mostcommon is for security purposes. Searching for softwarevulnerabilities and analyzing malware both start with (a) (b) Figure 1.
Example input (a) and output (b) of decompilation. understanding the low-level code comprising the program.Currently this is done manually by reverse engineeringthe program. Reverse engineering is a slow and tediousprocess by which a specialist tries to understand whata program does and how it does it. Decompilation cangreatly improve this process by translating the binarycode to a more readable higher-level code.Decompilation has many applications beyond security.For example, porting a program to a new hardware archi-tecture or operating system is easier when source codeis available and can be compiled to the new environ-ment. Decompilation also opens the door to applicationof source-level analysis and optimization tools.
Existing Decompilers
Existing decompilers, such asHex-Rays [2] and Phoenix [34], rely on pattern match-ing to identify the high-level control-flow structure in aprogram. These decompilers try to match segments of aprogram’s control-flow graph (CFG) to some patternsknown to originate from certain control-flow structures(e.g. if-then-else or loops). This approach often failswhen faced with non-trivial code, and uses goto state-ments to emulate the control-flow of the binary code. Theresulting code is often low-level, and is really assemblytransliterated into C (e.g. assigning variables to tempo-rary values/registers, using goto s, and using low-leveloperations rather than high-level constructs provided bythe language). While it is usually semantically equivalentto the original binary code, it is hard to read, and insome cases less efficient, prohibiting recompilation of thedecompiled code. a r X i v : . [ c s . P L ] M a y here are goto -free decompilers, such as DREAM++ [36, 37], that can decompile code without resorting tousing goto s in the generated code. However, all existingdecompilers, even goto -free ones, are based on hand-crafted rules designed by experts, making decompilerdevelopment slow and costly.Even if a decompiler from a low-level language L low toa high-level language L hiдh exists, given a new language L ′ hiдh , it is nontrivial to create a decompiler from L low to L ′ hiдh based on the existing decompiler. There is noguarantee that any of the existing rules can be reusedfor the new decompiler. Neural Machine Translation
Recent years have seentremendous progress in Neural Machine Translation(NMT) [16, 22, 35]. NMT systems use neural networksto translate a text from one language to another, andare widely used on natural languages. Intuitively, onecan think of NMT as encoding an input text on one sideand decoding it to the output language on the other side(see Section 3 for more details). Recent work suggeststhat neural networks are also effective in summarizingsource code [9, 11–13, 20, 21, 28, 29].Recently, Katz et al. [23] suggested using neural net-works, specifically RNNs, for decompilation. Their ap-proach trains a model for translating binary code directlyto C source code. However, they did not compensatefor the differences between natural languages and pro-gramming languages, thus leading to poor results. Forexample, the code they generate often cannot be com-piled or is not equivalent to the original source code.Their work, however, did highlight the viability of usingNeural Machine Translation for decompilation, thus sup-porting the direction we are pursuing. Section 8 providesadditional discussion of [23].
Our Approach
We present a novel automatic neuraldecompilation technique, using a two-phased approach.In the first phase, we generate a templated code snippetwhich is structurally equivalent to the input. The codetemplate determines the computation structure withoutassignment of variables and numerical constants. Then,in the second phase, we fill the template with values toget the final decompiled program. The second phase isdescribed in Section 5.Our approach can facilitate the creation of a decom-piler from L low to L hiдh from every pair of languages forwhich a compiler from L hiдh to L low exists.The technique suggested by [23] attempted to applyNMT to binary code as-is, i.e. without any additionalsteps and techniques to support the translation. We rec-ognize that for a trainable decompiler , and specificallyan NMT-based decompiler, to be useful in practice, weneed to augment it with programming-languages knowl-edge (i.e. domain-knowledge). Using domain-knowledge we can make translations simpler and overcome manyshortcomings of the NMT model. This insight is imple-mented in our approach as our canonicalization step(Section 4.3, for simplifying translations) and templatefilling (Section 5, for overcoming NMT shortcomings).Our technique is still modest in its abilities, but presentsa significant step forward towards trainable decompil-ers and in the application of NMT to the problem ofdecompilation. The first phase of our approach borrowstechniques from natural language processing (NLP) andapplies them to programming languages. We use anexisting NMT system to translate a program in a lower-level language to a templated program in a higher-levellanguage.Since we are working on programming languages ratherthan natural languages, we can overcome some majorpitfalls for traditional NMT systems, such as trainingdata generation (Section 4.2) and verification of trans-lation correctness (Section 4.4). We incorporate theseinsights to create a decompilation technique capable ofself-improvement by identifying decompilation failures asthey occur, and triggering further training as needed toovercome such failures. By using NMT techniques as the core of our decom-piler’s first phase, we avoid the manual work requiredin traditional decompilers. The core of our technique islanguage-agnostic requiring only minimal manual inter-vention (i.e., implementing a compiler interface).One of the reasons that NMT works well in our settingis the fact that, compared to natural language, code hasa more repetitive structure and a significantly smallervocabulary. This enables training with significantly fewerexamples than what is typically required for NLP [26](See Section 6).
Mission Statement
Our goal is to decompile shortsnippets of low-level code to equivalent high-level snip-pets. We aim to handle multiple languages (e.g. x86 assembly and
LLVM
IR). We focus on code compiledusing existing off-the-shelf compilers (e.g. gcc [1] andclang [4]), with compiler optimizations enabled , for thepurpose of finding bugs and vulnerabilities in benignsoftware. More specifically, we do not attempt to handlehand-crafted assembly as is often found in malware.Many previous works aimed to use decompilation asa mean of understanding the low-level code, and thusfocused mostly on code readability. In addition to read-ability, we place a great emphasis on generating codethat is correct (i.e., can be compiled without furthermodifications) and equivalent to the given input .We wish to further emphasize that the goal of ourwork is not to outperform existing decomopilers (e.g.,Hex-Rays [2]). Many years of development have beeninvested in such decompilers, resulting in mature and ell-tested (though not yet perfect) tools. Rather, wewish to shed light on trainable decompilation , and NMT-based decompilation in particular, as a promising alter-native approach to traditional decompilation. This newapproach holds the advantage over existing decompilersnot in its current results, but in its potential to handlenew languages, features, compilers, and architectureswith minimal manual intervention. We believe this abil-ity will play a vital role as decompilation will becomemore widely used for finding vulenrabilities. Main Contributions
The paper makes the followingcontributions: • A significant step towards neural decompilation bycombining ideas from neural machine translation(NMT) and program analysis. Our work bringsthis promising approach to decompilation closer tobeing practically useful and viable. • A decompilation framework that automatically gen-erates training data and checks the correctness oftranslation using a verifier. • A decompilation technique that is applicable tomany pairs of source and target languages and ismostly independent of the actual low-level sourceand high-level target languages used. • An implementation of our technique in a frameworkcalled
TraFix (short for TRAnslate and FIX) that,given a compiler from L hiдh to L low automaticallylearns a decompiler from L low to L hiдh . • An instantiation of our framework for decompi-lation of C source code from LLVM intermediaterepresentation (IR) [3] and x86 assembly. We usedthese instances to evaluate our technique on de-compilation of small simple code snippets. • An evaluation showing that our framework decom-piles statements in both
LLVM
IR and x86 assem-bly back to C source code with high success rates.The evaluation demonstrates the framework’s abil-ity to successfully self-advance as needed. In this section we provide an informal overview of ourapproach.
Consider the x86 assembly example of Fig. 1(a). Fig. 2shows the major steps we take for decompiling thatexample.The first step in decompiling a given input is applying canonicalization . In this example, for the sake of simplic-ity, we limited canonicalization to only splitting numbersto digits (Section 4.3.1), thus replacing with , re-sulting in the code in block (1). This code is providedto the decompiler for translation. The output of our decompiler’s NMT model is a canon-icalized version of C , as seen in block (2). In this example,output canonicalization consists of splitting numbers todigits, same as was applied to the input, and printingthe code in post-order (Section 4.3.2), i.e. each operatorappears after its operands. We apply un-canonicalization to the output, which converts it from post-order to in-order , resulting in the code in block (3). The outputof un-canonicalization might contain decompilation er-rors, thus we treat it as a code template. Finally, bycomparing the code in block (3) with the original inputin Fig. 1, we fill the template (i.e. by determining thecorrect numeric values that should appear in the code,see Section 5), resulting in the code in block (4). Thecode in block (4) is then returned to the user as the finaloutput.For further details on the canonicalizations applied bythe decompiler, see Section 4.3. Our approach to decompilation consists of two comple-mentary phases: (1) Generating a code template that,when compiled, matches the computation structure ofthe input, and (2) Filling the template with values andconstants that result in code equivalent to the input.
Fig. 3 provides a schematic representation of this phase.At the heart of our decompiler is the NMT model.We surround the NMT model with a feedback loop thatallows the system to determine success/failure rates andimprove itself as needed by further training.Denote the input language of our decompiler as L low and the output language as L hiдh , such that the gram-mar of both languages is known. Given a dataset ofinput statements in L low to decompile, and a compilerfrom L hiдh to L low , the decompiler can either start fromscratch, with an empty model, or from a previouslytrained model. The decompiler translates each of theinput statements to L hiдh . For each statement, the NMTmodel generates a few translations that it deemed to bemost likely. The decompiler then evaluates the generatedtranslation. It compiles each suggested translation from L hiдh to L low using existing of-the-shelf compilers. Thecompiled translations are compared against the origi-nal input statement in L low and classified as successfultranslations or failed translations. At this phase, thetranslations are code templates, not yet actual code,thus the comparison focuses on matching the computa-tion structure. A failed translation therefore does notmatch the structure of the input, and cannot producecode equivalent to the input in phase 2. We denote inputstatements for which there was no successful translation igure 2. Steps for decompiling x86 assembly to C : (1) canonicalized x86 input, (2) NMT output, (3) templated output, (4)final fixed output. Figure 3.
Schematic overview of the first phase of our decompiler as failed inputs . Successful translations are passed to thesecond phase and made available to the user.The existence of failed inputs triggers a retraining ses-sion. The training dataset and validation dataset (usedto evaluate progress during training) are updated withadditional samples, and the model resumes training us-ing the new datasets. This feedback loop, between thefailed inputs and the model’s training session, drives thedecompiler to improve itself and keep learning as long asit has not reached its goal. These iterations will continueuntil a predetermined stop condition as been met, e.g. asignificant enough portion of the input statements weredecompiled successfully. It also allows us to focus train-ing on aspects where the model is weaker, as determinedby the failed inputs.The well-defined structure of programming languagesallows us to make predictable and reversible modifica-tions to both the input and output of the NMT model.These modifications are referred to as canonicalizationand un-canonicalization, and are aimed at simplifyingthe translation problem. These steps rely on domainspecific knowledge and do not exist in traditional NMTsystems for natural languages. Section 4.3 motivates anddescribes our canonicalization methods. Updating the Datasets
After each iteration we updatethe dataset used for training. Retraining without doingso would lead to over-fitting the model to the existing dataset, and will be ineffective at teaching the model tohandle new inputs.We update the dataset by adding new samples ob-tained from two sources: • Failed translations – We compile failed translationsfrom L hiдh to L low and use them as additionaltraining samples. Training on these samples servesto teach the model the correct inputs for thesetranslations, thus reducing the chances that themodel will generate these translations again infuture iterations. • Random samples – we generate a predeterminednumber of random code samples in L hiдh and com-pile these samples to L low .The validation dataset is updated using only randomsamples. It is also shuffled and truncated to a constantsize. The validation dataset is translated and evaluatedmany times during training. Thus truncating it preventsthe validation overhead from increasing. The first phase of our approach produces a code tem-plate that can lead to code equivalent to the input. Thegoal of the second phase is to find the right values forinstantiating actual code from the template. Note thatthe NMT model provides initial values. We need to ver-ify that these values are correct and replace them withappropriate values if they are wrong. his step is inspired by the common NLP practice of delexicalization [18]. In NLP, using delexicalization, somewords in a sentence would be replaced with placeholders(e.g. NAME1 instead of an actual name). After translationthese placeholders would be replaced with values takendirectly from the input.Similarly, we use the input statement as the sourcefor the values needed for filling our template. Unlikedelexicalization, it is not always the case that we cantake a value directly from the input. In many cases,and usually due to optimizations, we must apply sometransformation to the values in the input in order to findthe correct value to use.In the example of Fig. 2, the code contains two numericvalues which we need to “fill” – and . For each of thisvalues we need to either verify or replace it. The caseof is relatively simple as the NMT provided a correctinitial value. We can determine that by comparing inthe output to in the original input. For , however,copying the value from the input did not provide thecorrect output. Compiling the output with the value would result in the instruction sall 1, %eax ratherthan the desired sall 2, %eax . We thus replace with avariable N and try to find the right value for N . To getthe correct value, we need to apply a transformation tothe input. Specifically, if the input value is x , the relevanttransformation for this example is N = x , resulting in N = that, when recompiled, yields the desired output.Therefore we replace with , resulting in the codein Fig. 2(4).Section 5 further elaborates on this phase and providesadditional possible transformations. Current Neural Machine Translation (NMT) models fol-low a sequence-to-sequence paradigm introduced in [14].Conceptually, they have two components, an encoderand a decoder. The encoder encodes an arbitrary lengthsequence of tokens x , ..., x n over alphabet A into a se-quence of vectors, where each vector represents a giveninput token x i in the context in which it appears. Thedecoder then produces an arbitrary length sequence oftokens y , ..., y m from alphabet B, conditioned on theencoded vectors. The sequence y , ..., y m is generated atoken at a time, until generating an end-of-sequencetoken. When generating the i th token, the model con-siders the previously generated tokens as well as theencoded input sequence. An attention mechanism isused to choose which subset of the encoded vectors toconsider at each generation step. The generation pro-cedure is either greedy, choosing the best continuationsymbol at each step, or uses beam-search to develop several candidates in parallel. The NMT system (includ-ing the encoder, decoder and attention mechanism) istrained over many input-output sequence pairs, wherethe goal of the training is to produce correct outputsequences for each input sequence. The encoder and thedecoder are implemented as recurrent neural networks(RNNs), and in particular as specific flavors of RNNscalled LSTM [19] and GRU [14] (we use LSTMs in thiswork). Refer to [30] for further details on NMT systems. In this section we describe the algorithm of our decom-pilation framework using NMT. First, in Section 4.1,we describe the algorithm at a high level. We then de-scribe the realization of operations used in the algorithmsuch as canonicalization (Section 4.3), the evaluation ofthe resulting translation (Section 4.4), and the stoppingcondition (Section 4.5).
Our framework implements the process depicted by Fig. 3.This process is also formally described in Algorithm 1.The algorithm uses a
Dataset data structure which holdspairs ( x , y ) of statements such that x ∈ L hiдh , y ∈ L low ,and y is the output of compiling x .The framework takes two inputs: (1) a set of state-ments for decompilation, and (2) a compiler interface.The output is a set of successfully decompiled statements.Decompilation starts with empty sets for training andvalidation and canonicalizes (Section 4.3) the input set.It then iteratively extends the training and validationsets (Section 4.2), trains a model on the new sets andattempts to translate the input set. Each translationis then recompiled and evaluated against the originalinput (Section 4.4 and Section 5). Successful translationsare then put in a Success set, that will eventually bereturned to the user. Failed translations are put in a
Failed set that will be used to further extend the trainingset. The framework repeats these steps as long as thestopping condition was not reached (Section 4.5).
To generate samples for our decompiler to train on, wegenerate random code samples from a subset of the C programming language. This is done by sampling thegrammar of the language. The samples are guaranteedto be syntactically and grammatically correct. We thencompile our code samples using the provided compiler.Doing so results in a dataset of matching pairs of state-ment, one in C and the other in L ll , that can be used bythe model for training and validation. lgorithm 1 Decompilation algorithm
Input inputset , collection of statements in L low compile , api to compile L hiдh to L low Output
Dataset of successfully decompiledstatements in L hiдh Data Types
Dataset: collection of pairs ( x , y ) ,such that x = compile ( y ) procedure Decompile inputset ← canonicalize ( inputset ) Train ← newDataset V alidate ← newDataset model ← newModel Success ← newDataset Failures ← newDataset while (?) do Train ← Train ∪ Failures ∪ random samples () V alidate = V alidate ∪ дen random samples () model . retrain ( Train , V alidate ) decompiled ← model . translate ( inputset ) recompiled ← compile ( decompiled ) for each i in ... inputset . size do pair ← ( inputset [ i ] , decompiled [ i ]) if equiv ( inputset [ i ] , recompiled [ i ]) then if f ill ( inputset [ i ] , recompiled [ i ]) then Success ← Success ∪ [ pair ] else Failures ← Failures ∪ [ pair ] end if else Failures ← Failures ∪ [ pair ] end if end for end while return uncanonicalize ( Success ) end procedure We note that, alternatively, we could use code snip-pets from publicly available code repositories as trainingsamples, but these are less likely to cover uncommoncoding patterns.
It is possible to improve the performance of NMT mod-els without intervening in the actual model. This canbe achieved by manipulating the inputs in ways thatsimplify the translation problem. In the context of ourwork, we refer to these domain-specific manipulations as canonicalization .Following are two forms of canonicalization used byour implementation: movl 1234 , X1 (a)
Original input movl 1 2 3 4 , X1 (b)
Input after splitting numbers to digits
X1 = 1 2 3 4 ; (c)
Translation output
X1 = 1234 ; (d)
Output after fusing digits to numbers
Figure 4.
Reducing vocabulary by splitting numbers to digits
The vocabulary size of the samples provided to the model,either for training or translating, directly affects theperformance and efficiency of the model. In the case ofcode, a large portion of the vocabulary is devoted tonumerical constants and names (such as variable names,method names, etc.).Names and numbers are usually considered “uncom-mon” words, i.e. words that do not appear frequently.Descriptive variable names, for example, are often usedwithin a single method but are not often reused in othermethods. This results in a distinctive vocabulary, consist-ing largely of uncommon words, and leading to a largevocabulary.We observe that the actual variable names do not mat-ter for preserving the semantics of the code. Furthermore,these names are actually removed as part of the strippingprocess. Therefore, we replace all names in our sampleswith generic names (e.g. X1 for a variable). This allowsfor more reuse of names in the code, and therefore moreexamples from which the model can learn how to treatsuch names. Restoring informative descriptive names insource code is a known and orthogonal research problemfor which several solutions exist (e.g. [10, 17, 32]).Numbers cannot be handled in a similar way. Theirvalues cannot be replaced with generic values, sincethat would alter the semantic meaning of the code. Fur-thermore, essentially every number used in the samplesbecomes a word in the vocabulary. Even limiting thevalues of numbers to some range [ − K ] would still resultin K different words.To deal with the abundance of numbers we take in-spiration from NMT for natural languages. Wheneveran NMT model for NL encounters an uncommon word,instead of trying to directly translate that word, it fallsback to a sub-word representation (i.e. process the wordas several symbols). Similarly, we split all numbers inour samples to digits. We train the model to handlesingle digits and then fuse the digits in the output intonumbers. Fig. 4 provides an example of this process ona simple input. Using this process, we reduce the por-tion of the vocabulary dedicated to numbers to only symbols, one per digit. Note that this reduction comesat the expense of prolonging our input sentences. a) original C code (b) post order C code (c) compiled x86 assembly Figure 5.
Example of code structure alignment
Alternative Method for Reducing Vocabulary Size
We observe that, in terms of usage and semantic meaning,all numbers are equivalent (other than very few specificnumbers that hold special meaning, e.g. 0 and 1). Thus,as an alternative to splitting numbers to digits, we triedreplacing all numbers with constants (e.g.
N1, N2, ... ).Similarly to variable names, the purpose of this replace-ment was to increase reuse of the relevant words whilereducing the vocabulary. When applying these replace-ments to our input statements, we maintained a recordof all applied replacements. After translation, we usedthis record to restore the original values to the output.This approach worked well for unoptimized code, butfailed on optimized code. In unoptimized code there isa direct correlation between constants in high-level andlow-level code. That correlation allowed us to restorethe values in the output. In optimized code, compileroptimizations and transformations break that correlation,thus making it impossible for us to restore the outputbased on the kept record.
Most high-level programming languages write code in-order , i.e. an operator appears between its 2 operands.On the other hand, low-level programming languages,which are ”closer” to the hardware, often use post-order ,i.e. both operands appear before the operator.The code in Fig. 5 demonstrates this difference. Fig. 5ashows a simple statement in C and Fig. 5c the x86 as-sembly obtained by compiling it. The different colorsrepresent the correlation between the different parts ofthe computation.Intuitively, if one was charged with the task of trans-lating a statement, it would be helpful if both input andoutput shared the same order. Having a shared ordersimplifies ”planning” the output by localizing the depen-dencies to some area of the input rather than spreadingthem across the entire input. Similarly, NMT models often perform better whenwhen the source and target languages follow similarword orders, even though the model reads the entireinput before generating any output. We therefore modifythe structure of the C input statements to post-order tocreate a better correlation with the output. Fig. 5b showsthe code obtained by canonicalizing the code in Fig. 5a.After translation, we can easily parse the generatedpost-order code using a simple bottom-up parser toobtain the corresponding in-order code. We rely on the deterministic nature of compilation asthe basis of this evaluation. After translating the inputs,for each pair of input i and corresponding translation t (i.e. the decompiled code), we recompile t and comparethe output to i . This allows us to keep track of progressand success rates, even when the correct translation isnot known in advance. Comparing computation structure
After the firststep of our decompiler, the structure of computationin the decompiled program should match the one ofthe original program. We therefore compare the originalprogram and the templated program from decompila-tion by comparing their program dependence graphs.We convert each code snippet to its corresponding
Pro-gram Dependence Graph (PDG). The nodes of the graphare the different instructions in the snippet. The graphcontains 2 types of edges: data dependency edges andcontrol dependency edges. A data dependency edge fromnode n to node n means that n uses a value set by n . A control dependency between n and n means thatexecution of n depends on the outcome of n . Fig. 6bshows an example of a program dependence graph forthe code in Fig. 6a. Solid arrows in the graph representdata dependencies between code lines and dashed ar-rows represent control dependencies. Since line 2 usesthe variable x which was defined in line 1, we have anarrow from 1 to 2. Similarly, line 8 uses the variable z which can be defined in either line 4 or line 6. Therefore,line 8 has a data dependency on both line 4 and line 6.Furthermore, the execution of lines 4 and 6 is dependenton the outcome of line 3. This dependency is representedby the dashed arrows from 3 to 4 and 6.We extend the PDG with nodes “initializing” the dif-ferent variables in the code. These nodes allow us tomaintain a separation between the different variables.We then search for an isomorphism between the 2graphs, such that if nodes n and n ′ are matched by theisomorphism it is guaranteed that either 1. both n and n ′ correspond to variables, 2. both n and n ′ correspondto numeric constants, or 3. n and n ′ correspond to the : x = y = x ∗ x ; if y %2 == then z = x + else z = x − end if w = z ∗ (a) Source code
123 64 8 (b)
Program Dependence Graph
Figure 6.
Example of Program Dependence Graph.Solid arrows for data dependencies, dashed arrows for controldependencies. same operator (e.g. addition, substraction, branching,etc...).If such an isomorphism exists, we know that bothcode snippets implement the same computation struc-ture. The snippets might still differ in the variable ornumeric constants they use. However, the way the snip-pets use these variables and constants is equivalent inboth snippets. Thus, if we could assign the correct vari-ables and constants to the code, we would get an identicalcomputation in both snippets. We consider translationsthat reach this point as a successful template and at-tempt to fill the template as described in Section 5. Atranslation is determined fully successful only if fillingthe template (Section 5) is also successful.This kind of evaluation allows us to overcome instruc-tion reordering, variable renaming, minor translationerrors and small modifications to the code (often due tooptimizations).
Our framework terminates the decompilation iterationswhen 1 of 3 conditions is met:1. Sufficient results: given a percentage threshold p ,after each iteration the framework checks the num-ber of test samples that remain untranslated andstops when at least p % of the initial test set wassuccessfully decompiled.2. No more progress: The framework keeps trackof the amount of remaining test samples. Whenthe framework detects that that number has notchanged in x iterations, meaning no progress wasmade during these iterations, it terminates. Suchcases highlight samples that are too difficult forour decompiler to handle 3. Iteration limit: given some number n , we can ter-minate the decompilation process after n iterationshave finished. This criteria is optional and can beleft empty, in which case only the first 2 conditionsapply. An important feature of our framework is that we canfocus the training done in the first phase to languagefeatures exhibited by the input. Essentially, we can startby “learning” to decompile a subset of the high-levellanguage.Learning to decompile some subset s of the high-levellanguage takes time and resources. Therefore, given anew input dataset, utilizing another subset of the lan-guage s ′ , we would like to reuse what we have learnedfrom s .Because the vocabulary of s ′ is not necessarily con-tained in the vocabulary of s , i.e. vocab ( s ′ ) ⊈ vocab ( s ) ,we have implemented a dynamic vocabulary extensionmechanism in our framework. When the framework de-tects that the current vocabulary is not the same as thevocabulary used for previous training sessions, it createsa new model and partially initializes it using value froma previously trained model. This allow us to add supportfor new tokens in the language without starting fromscratch.Note that all tokens are equivalent in the eyes of theNMT model. Specifically, the model does not know that avariable is different from a number or an operator. It onlylearns a difference between the tokens from the differentcontexts in which they appear. Therefore, using thismechanism, we can extend the language supported by thedecompiler with new operators, features and constructs,as needed. For example, starting from a subset of thelanguage containing only arithmetic expressions, we caneasily add if statements to the subset without losingany previous progress we’ve made while training onarithmetic expressions.The extension mechanism is also used during trainingon a specific language subset. At each iteration, ourframework generates new training samples to extend theexisting training set. These new samples can, for example,contain new variables/numbers that weren’t previouslypart of the vocabulary, thus requiring an extension ofthe vocabulary.It is important to note that in a real-world use-casewe don’t expect training sessions to be frequent. Addi-tional training should only applied when dealing withnew features, a new language or with relatively hardersamples than previous samples. We expect the majorityof decompilation problems to be solved using an existingmodel. Filling the Template
In Section 4, we saw how the decompiler takes a low-levelprogram and produces a high-level templated program,where some constant assignments require filling. In thissection, we describe how to fill the parameters in thetemplated program.
From our experimentation with applying NMT models tocode, we learned that NMT performs well at generatingcorrect code structure. We also learned that NMT hasdifficulties with constants and generating/predicting theright ones. This is exhibited by many cases where theproposed translation differs from an exact translation byonly a numerical constant or a variable.The use of word embeddings in NMT is a major con-tributor to these translation errors. A word embeddingis essentially a summary of the different contexts inwhich that word appears. It is very common in NLP foridentifying synonyms and other interchangeable words.For example, assume we have an NMT model for NLPwhich trains on the sentence “The house is blue”. Whiletraining, the model will learn that different colors oftenappear in similar contexts. The model can then gener-alize what it has learned from “The house is blue” andapply that to the sentence “The house is green” which ithas never encountered before. In practice, word embed-dings are numerical vectors, and the distance betweenthe embeddings of words that appear in similar contextswill be smaller than the distance between embeddings ofwords that do not appear in similar contexts. The modelitself does not operate on the actual words provided bythe user. It instead translates the input to embeddingsand operates on those vectors.Since we are dealing with code rather than naturallanguages, we have many more “interchangeable” wordsto handle. During training all numerical values appearin the same contexts, resulting in very similar (if notidentical) embeddings. Thus, the model is often unable todistinguish between different numbers. Therefore, whileword embeddings are still useful for generalizing fromtraining examples, using embeddings in our case resultsin translation errors when constants are involved.Due to the above we have decided to treat the outputof the NMT model not as a final translation but as atemplate that needs filling. The 1st phase of our decom-pilation process verifies that the computation structureresulting from recompiling the translation matches thatof the input. If that is the case, any differences are mostlikely the result of using incorrect constants. The 2ndphase of our decompilation process deals with correctingany such false constants. Given that the computation structure of our transla-tion and the input is the same, errors in constants can befound in variable names and numeric values. In the firstphase, as part of comparing the computation structure,we also verify that there are no cases where a variableshould have been a numeric value or vice versa. Thatmeans we can treat these two cases in isolation.We note that since we are dealing with low-level lan-guages, in which there are often no variable names tobegin with, using the correct name is inconsequential. Inthe case of variables, all that matter is that for each vari-able in the input there exists a variable in the translationthat is used in exactly the same manner. This require-ment is already fulfilled by matching the computationstructure (Section 4.4).
We focus on correcting errors resulting from using wrongnumeric values. Denoting the input as i , the translationas t and the result of recompiling the translation as r ,there are three questions that we need to address: Which numbers in r need to change? and towhich other numbers? Since the NMT model wastrained on code containing numeric values and constants,the generated translation also contains such values (gen-erated directly by the model) and constants (due to thenumeric abstraction step we describe in Section 4.3.1),and replaced with their original values. We use thesenumbers as an initial suggestion as to which values shouldbe used.As explained in Section 4.4, we compare r and i bybuilding their corresponding program dependence graphsand looking for an isomorphism between the graphs. Ifsuch an isomorphism is found, it essentially entails amapping from nodes in one graph to nodes in the other.Using this mapping we can search for pairs of nodes n r and n i such that n r ∈ r is mapped to n i ∈ i , both nodesare numeric values, but n r ! = n i . Such nodes highlightwhich numbers need to be changed ( n r ) and to whichother numbers ( n i ). Which numbers in t affect which numbers in r ? Note that although we know that n ∈ r is wrong andto be fixed, we cannot apply the fix directly. Instead weneed to apply a fix to t that will result in the desired fixto r . The first step towards achieving that is to create amapping from numbers in t to numbers in r such thatchanging n t ∈ t results in a change to n r ∈ r .By making small controlled changes to t we can observehow r is changed. We find some number n t ∈ t , replaceit with n ′ t resulting in t ′ and recompile it to get r ′ . Wethen compare r and r ′ to verify that the change we mademaintains the same low-level computation structure. Ifthat is the case, we identify all number n r ∈ r that werechanged and record those as affected by n t . ow do we enact the right changes in t ? Atthis point we know which number n t ∈ t we shouldchange and we know the target value n i we want to haveinstead of n r ∈ r . All we need to determine now is howto correctly modify n t to end up with n i .The simple case is such that n t == n r , which meanswhatever number we put in t is copied directly to r andthus we simply need replace n t with n i .However, due to optimizations (some applied evenwhen using -O0 ), numbers are not always copied as is.Following are three examples we encountered in our workwith x86 assembly. Replacing numbers in conditions
Assuming x is avariable of type int , given the code if (x >= 5) , it iscompiled to assembly equivalent to if (x > 4) , which issemantically identical but is slightly more efficient. Division/Multiplication by powers of 2
These op-erations are often replaced with semantically equivalentshift operations. For example, Division by 8 would becompiled as shift right by 3.
Implementing division using multiplication
Sincedivision is usually considered the most expensive opera-tion to execute, when the divisor is known at compilationtime, it is more efficient implement the division using asequence of multiplication and shift operations. For exam-ple, calculating x / can be done as ( x ∗ ) >> because ≈ / .We identified a set of common patterns used to makesuch optimizations in common compilers. Using thesepatterns, we generate candidate replacements for n t . Wetest each replacement by applying it to t , recompilingand checking whether the affected values n r ∈ r are nowequal to their n i ∈ i counterparts.We declare a translation as successful only if an appro-priate fix can be found for all incorrect numeric valuesand constants. In this section we describe the evaluation of our decom-pilation technique and present our results.
We implemented our technique in a framework called
TraFix . Our framework takes as input an implemen-tation of our compiler interface and uses it to build adecompiler. The resulting decompiler takes as input aset of sentences in a low-level language L low , translatesthe sentences and outputs a corresponding set of sen-tences in a high-level language L hiдh , specifically C in ourimplementation. Each sentence represents a sequence ofstatements in the relevant language. X0 = X1 + X2; (a) C code %1 = load i32 , i32* @X1%2 = load i32 , i32* @X2%3 = add i32 %1 , %2store i32 %3 , i32* @X0 (b) LLVM IR movl X1 , %edxmovl X2 , %eaxaddl %edx , %eaxmovl %eax , X0 (c) x86 assembly Figure 7.
Example of code structure alignment
Our implementation uses the NMT implementationprovided by DyNmt [6] with slight modifications. DyNmtimplements the standard encoder-decoder model forNMT using DyNet [31], a dynamic neural network toolkit.
Compiler Interface
The compiler interface consistsof a set of methods encapsulating usage of the compilerand representation specific information (e.g. how doesthe compiler represent numbers in the assembly?). Thecore of the api consists of: (1) A compile method thattakes a sequence of C statements and returns the sequenceof statements in L low resulting from compiling it (thereturned code is “cleaned up” by removing parts of it thatdon’t contribute any useful information); and (2) An Instruction class that describes the effects of differentinstructions, which is used for building a PDG duringtranslation evaluation (Section 4.4).We implemented such compiler interfaces for compi-lation (1) from C to LLVM
IR, and (2) from C to x86 assembly. Fig. 7 shows the result of compiling the simple C statement of Fig. 7a using both compilers. We evaluate
TraFix using random C snippets sampledfrom a subset of the C programming language. Eachsnippet is a sequence of statements, where each statementis either an assignment of an expression to a variable, an if condition (with or without an else branch), or a while loop. Expressions consist of numbers, variables, binaryoperator and unary operators. If and while statementsare composed using a condition – a relational operatorbetween two expression – and a sequence of statementswhich serves as the body. We limit each sequence ofstatements to at most 5. Table 1 provides the formalgrammar from which the benchmarks are sampled.All of our benchmarks were compiled using the com-piler’s default optimizations. Working on optimized code tatements := Statement | Statements StatementStatement := Assignment | Branch | LoopAssignments := Assignment | Assignments AssignmentAssignment := Var = Expr;Var := IDExpr := Var | Number | BinaryExpr | UnaryExprUnaryExpr := UnaryOp Var | Var UnaryOpUnaryOp := ++ | –BinaryExpr := Expr BinaryOp ExprBinaryOp := + | - | * | / | %Branch := if (Condition) { Statements } | if (Condition) { Statements } else { Statements } Loop := while (Condition) { Statements } |
Condition := Expr Relation ExprRelation := > | > = | < | < = | == | != Table 1.
Grammar for experiments. Terminals are underlined introduces several challenges, as mentioned in Section 5.2,but is crucial for the practicality of our approach. Notethat we didn’t strip the code after compilation. However,our ”original” C code that we compile is already essen-tially stripped since our canonicalization step abstractsall names in the code.During benchmark generation we make sure that thereis no overlap between the Training dataset, Validationdataset and our Test dataset (used as input statementsto the decompiler). Evaluating Benchmarks
Despite holding the ground-truth for our test set (the C used to generate the set),we decided not to compare the decompiled code to theground-truth. We observe that, in some cases, different C statements could be compiled to the same low-level code(e.g. the statements x = x + 1 and x++ ). We decided toevaluate them in a manner that allows for such occur-rences and is closer to what would be applied in a realuse-case. We, thus, opted to evaluate our benchmarksby recompiling the decompiled code and comparing itagainst the input, as described in Section 4.4. We ran several experiments of
TraFix . For each ex-periment we generated 2,000 random statements to beused as the test set.
TraFix was configured to gen-erate an initial set of 10,000 training samples and anadditional 5,000 training samples at each iteration. Anadditional 1,000 random samples served as the validationset. There is no overlap between the test set and thetraining/validation sets. We decided, at each iteration,to drop half of the training samples from the previousiteration. This serves to limit the growth of the trainingset (and thus the training time), and assigns a higherweight to samples obtained through recent failures com-pared to older samples. Each iteration was limited to2,000 epochs. In practice, our experiments never reached this limit. No iteration of our experiments with
LLVM and x86 exceeded more than 140 epochs (and no morethan 100 epochs when excluding the first iteration). Foreach test input we generated 5 possible translations using beam-search . We stopped each experiment when it hassuccessfully translated over 95% of the test statementsor when no progress was made for the last 10 iterations.Recall that the validation set is periodically translatedduring training and used to evaluate training progress.
TraFix is capable of stopping a training session early(before the epoch limit was reached) if no progress wasobserved in the last consecutive k validation sessions.Intuitively, this process detects when the model hasreached a stable state close enough to the optimal statethat can be reached on the current training set. In ourexperiments a validation session is triggered after pro-cessing 1000 batches of training samples (each batchcontaining 32 samples) and k was set to 10. All trainingsessions were stopped early, before reaching the epochslimit.The NMT model consists of a single layer each for theencoder and decoder. Each layer consists of 100 nodesand the word embedding size was set to 300.We ran our experiments on Amazon AWS instances.Each instance is of type r5a.2xlarge – a Linux machinewith 8 Intel Xeon Platinum 8175M processors, each op-erating at 2.5GHz, and 64GiB of RAM, running Ubuntu16.04 with GCC [1] version 5.4.0 and Clang [4] version3.8.0.We executed our experiments as a single process usingonly a single CPU, without utilizing a GPU, in orderto mimic the scenario of running the decompiler on anend-user’s machine. This configuration highlights theapplicability of our approach such that it can be usedby many users without requiring specialized hardware. As a measure of problem complexity, we first evaluatedour decompiler on several different subsets of C using onlya single iteration. The purpose of these measurements isto estimate how difficult a specific grammar is going tobe for our decompiler.We used 8 different grammars for these measurement.Each grammar is building upon the previous one, mean-ing that grammar i contains everything in grammar i − and adds a new grammar feature (the only exceptionis grammar 4 which does not contain unary operators).The grammars are:1. Only assignments of numbers to variables2. Assignments of variables to variables3. Computations involving unary operators
4. Computations involving binary operators igure 8. Success rate of x86 decompiler after a single it-eration on various grammars, with compiler optimizationenabled and disabled
5. Computations involving both operators types If branches While loops Nested branches and loopsFig. 8 shows the success rate, i.e. percentage of suc-cessfully decompiled inputs, for the different grammars,of decompiling x86 assembly with and without com-piler optimizations. Note that measured success ratesare after only a single iteration of our decompilationalgorithm (Section 4.1).As can be expected, the success rate drop as the com-plexity of the grammar increases. That means that formore complicated grammar, our decompiler will requiremore iterations and/or more training data to reach thesame performance level as on simpler grammars.As can also be expected, and as can be observed fromthe figure, decompiling optimized code is a slightly moredifficult problem for our decompiler compared to unopti-mized code. Although optimizations reduce our successrate by a few percents (at most 5% in our experiments),it seems that the decisive factor for the hardness of thedecompilation problem is the grammar complexity, notoptimizations.Recall that, given a compiler, our framework learnsthe inverse of that compiler. That means that, in theeyes of the decompiler, optimizations are “transparent”.Optimizations only cause the decompiler to learn morecomplex patterns than it would have learned withoutoptimizations, but don’t increase the number of patternslearned nor the vocabulary handled. Grammar complex-ity, on the other hand, increases both the number andcomplexity of the patterns the decompiler needs to learnand handle, and the vocabulary size, thus making thedecompilation task much harder to learn.We emphasize that enabling/disabling compiler opti-mizations in our framework required no changes to the timings successful
Table 2.
Statistics of iterative experiments of
LLVM IR framework. The only change necessary was adding theappropriate flags in the compiler interface. In our second set of experiments we allowed each ex-periment to execute iteratively to observe the effects ofmultiple iterations on our decompilation success rates.We implemented and evaluated 2 instances of ourframework: from
LLVM
IR to C , and from x86 assemblyto C .We ran each experiments times using the configura-tion described in Section 6.3. We allowed each experimentto run until it reached either a success rate of or iterations. The results reported below are averaged overall experiments. Decompiling
LLVM IR Out of the 5 experiments weconducted using our LLVM IR instance, 3 reached thegoal of success rate after a single iteration. The other2 experiments required one additional iteration to reachthat goal. Table 2 reports average statistics for these twoiterations. The columns epochs , train time and translatetime report averages for each iteration (i.e. average ofmeasurements from 5 experiments for the 1st iterationand from only 2 experiments for the 2nd iteration). The successful translations column reports the overall successrate, not just the successes in that specific iteration.The statistics in the table demonstrate that our LLVM decompiler performed exceptionally well, even thoughit was decompiling optimized code snippets (which aretraditionally considered harder to handle).On average, Our
LLVM experiments successfully de-compiled 97% of the benchmarks, before autonomouslyterminating. These include benchmarks consisting of upto input tokens and output tokens. We inten-tionally set the goal lower than . Setting it higherthan and allowing our instances to run for furtheriterations would take longer but would also lead to ahigher overall success rate.The timing measurements reported in the table high-light that the majority of execution time is spent ontraining the NMT model. Translation is very fast, takingonly a few seconds per input, as witnessed by the first it-eration. The execution time of our translation evaluation(including parsing each translation into a PDG, com-paring with the input PDG, and attempting to fill thetemplates correlating to the translations) is extremely imings successful Table 3.
Statistics of iterative experiments of x86 asembly
Figure 9.
Cummulative success rate of x86 decompiler as afunction of how many iteration the decompiler performed low, taking only a couple of minutes for the entire set ofbenchmarks.These observations are important due to the expectedoperating scenario of our decompiler. We expect themajority of inputs to be resolved using a previouslytrained model. Retraining an NMT model should bedone only when the language grammar is extended orwhen significantly difficult inputs are provided. Thus, innormal operations, the execution time of the decompiler,consisting of only translation and evaluation, will bemere seconds.
Decompiling x86
Assembly
Table 3 provides statis-tics of our x86 experiments. All of these experimentsterminated when they reached the iterations limit whichwas set to .Fig. 9 visualizes the successful translations column.The figure plots our average success rate as a functionof the number of completed iterations. It is evident thatwith each iteration the success rate increases, eventuallyreaching over after iterations. Overall, our decom-piler successfully handled samples of up input tokensand output tokens.Our decompilation success rates on x86 were lowerthan that of LLVM , terminating at around . Thiscorrelates with the nature of x86 assembly, which hassmaller vocabulary than that of
LLVM
IR. The smaller vocabulary shortens overall training times, but also re-sults in longer dependencies and meaningful patternsthat are harder to deduce and learn.We note that, in case of a traditional decompiler,bridging the remaining gap of failure rate wouldrequire a team of developers crafting additional rulesand patterns. Using our technique this can be achievedby allowing the decompiler to train longer and on moretraining data.
Manual examination of our results from Section 6.4 re-vealed that currently our main limitation is input length.There was no threshold such that inputs longer than thethreshold would definitely fail. We observed both success-ful and failed long inputs, often of the same length. Wedid however observe a correlation between input lengthand a reduced success rate. As the length of an inputincreases, it becomes more likely to fail.We found no other outstanding distinguishing features,in the code structure or used vocabulary, that we couldclaim are a consistent cause of failures.This limitation stems from the NMT model we used.long inputs are a known challenge for existing NMTsystems [26]. NMT for natural languages is usually lim-ited to roughly 60 words [26]. Due to nature of code(i.e. limited vocabulary, limited structure) we can han-dle inputs much longer than typical natural languagesentences (668 words for x86 and 845 words for
LLVM ).Regardless, this challenge also applies to us, resulting inpoorer results when handling longer inputs. As the fieldof NMT evolves to better handle long inputs, so wouldour results improve.To verify that this limitation is not due to our specificimplementation, we created another variant of our frame-work. This new variant is based on
TensorFlow [5, 8]rather than DyNet. Experimenting with this variant, wegot similar results as those reported in Section 6.4, andultimately reached the same conclusion — the observedlimitation on input length is inherent to using NMT.
Though we do not consider this a limitation, anotheraspect that could be improved is our template fillingphase. Our manual analysis identified some possibilitiesfor improving our second phase – the template fillingphase (Section 5).The first type of failure we have observed is the resultof constant folding – a compiler optimization that re-places computations involving only constants with theirresults. Fig. 10 demonstrates this kind of failure. Giventhe C code in Fig. 10a, the compiler determines that ∗ (a) High-level code movl X1 , %eaximull 315 , %eax , %eaxmovl %eax , X3 (b)
Low-level code
X3 = ( X1 * 43 ) * 70 ; (c)
Suggested decompilation
Figure 10.
Example of decompilation failure
X2 = ((X0 \% 40) * 63) / ((98 - X1) - X0); (a)
High-level code
X2 = ((X0 \% N3) * N13) / (((N2 - X1) + N11) - X0); (b)
Suggested decompilation
Figure 11.
Failure due to redundant number can be replaced with . Therefore, the x86 assemblyin Fig. 10b contains the constant . Using the codeof Fig. 10b as input, our decompiler suggests the C codein Fig. 10c.Note that the decompiler suggested code that is iden-tical in structure to the input. The first phase of ourdecompiler handled this example correctly, resulting ina matching code template. The failure occured in thesecond phase, in which we were unable to find the appro-priate numerical values. This failure occurs because ourcurrent implementation attempts to find a value for eachnumber independently from other numbers in the code.Essentially, this resulted in floating-point numbers whichwere deemed unacceptable by the decompiler becauseour benchmarks use only integers.This kind of failure can be mitigated by either (1) ap-plying constant folding to the high-level decompiled code,(2) allowing the template to be filled with floating pointnumbers (which was disabled since the benchmarks con-tained only integers), or (3) encoding the code as con-straints and using a theorem prover to find appropriateassignments to constants.A similar example is found in Fig. 11. We left thesuggested translation in this example as constants tosimplify the example. One can see that the suggestedtranslation in Fig. 11b is structurally identical to theexpected output in Fig. 11a, up to the addition of N .This example was not considered a matching code tem-plate by our implementation, because any value for N other than 0 results in a different computation structure.However, if N = , we get an exact match between thesuggested translation and the expected output. Using a X2 = 48 + (X5 * (X14 * 66)); (a)
High-level code
X2 = ((N8 * X14) * X5) - N4; (b)
Suggested decompilation
Figure 12.
Failure due to incorrect operator theorem prover based template filling algorithm coulddetect that and assign the appropriate values to theconstants, including N , resulting in equivalent code.Fig. 12 shows another kind of failure. In this examplethe difference between the expected output and suggestedtranslation is a + that was replaced with − . Currentlyonly variable names and numeric constants are treated astemplate parameters. This kind of difference can be over-come by considering operators as template parametersas well. Since the number of options for each operatortype (unary, binary) is extremely small, we could try alloptions for filling these template parameters. There are a few tradeoffs that should be taken intoaccount when using our decompilation framework: • Iterations limit – Applying an iterations limit al-lows to tradeoff decompilation success rates fora shortened decompilation time and would makesense in environments with limited resources (time,budget, etc.). On the other hand, setting the limittoo low will prevent the decompiler from reachingits full potential and will result in low successfultranslations rate. • Training set size – In our experiments we initial-ized the training set to 10,000 random samples andgenerated additional 5,000 new random sampleseach iteration. As we increase the training set size,so do the training time and memory consumptionincrease. Using too many initial training sampleswould be wasteful in case of relatively simple testsamples, in which a shorter training session, withfewer training samples, might suffice. On the otherhand, using too few samples would result in manytraining sessions when dealing with harder testsamples. This is also applicable when setting thenumber of random samples added at each itera-tion. Furthermore, rather than always generating aconstant number of samples, one can dynamicallydecide the number of samples to generate basedon some measure of progress (i.e. generate fewersamples when progressing at a higher rate). • Patience – the patience parameter determines howmany iterations to wait before terminating due to not observing any progress . Setting this parameter o high would result in wasted time. This is be-cause any training performed since the last timewe observed progress would essentially have beenin vain. On the other hand, it is possible for themodel to make no progress for a few iterationsonly to resume progressing once it generates thetraining samples it needed. Setting the patienceparameter too low might cause the decompiler tostop before it can reach its full potential. As mentioned in Section 1, traditional decompilers relyheavily on pattern matching. Development of such de-compilers depends on hand-crafted rules and patterns,designed by experts to detect specific control-flow struc-tures. Hand-crafting rules is slow, expensive and cum-bersome. We observe that the successful decompilationsproduced by our decompiler can be re-templatized toform rules that can be used by traditional decompil-ers, thus simplifying traditional decompiler development.Appendix A provides examples of such rules.
Measuring the readability of our translations requires auser study, which we did not perform. However, note thatgiven some training set, a model trained on that set willgenerate code that is similar to what it was trained on.Thus, the readability of our translations stems from thereadability of our training samples. Our translations areas readable as the training samples we generated. Thiswas also verified by an empirical sampling of our results.Therefore, given readable code as training samples, wecan surmise that any decompiled code we generate andoutput will also be readable.
Decompilation
The Hex-Rays decompiler [2] was con-sidered the state of the art in decompilation, and isstill considered the de-facto industry standard. Schwartzet al. [34] presented the Phoenix decompiler which im-proved upon Hex-Rays using new analysis techniquesand iterative refinement, but was still unable to guaran-tee goto -free code (since goto instructions are rarely usedin practice, they should not be part of the decompileroutput). Yakdan et al. [36, 37] introduced Dream, and itspredecessor Dream++, taking a significant step forwardby guaranteeing goto -free code. RetDec [7], short for
Retargetable Decompiler , is an open-source decompilerreleased in December 2017 by Avast, aiming to be thefirst ”generic” decompiler capable of supporting manyarchitectures, languages, ABIs, etc.While previous work made significant improvementsto decompilation, all previous work fall under the title of rule-based decompilers. Rule-based decompilers requiremanually written rules and patterns to detect knowncontrol-flow structures. These rules are very hard todevelop, prone to errors and usually only capture partof the known control-flow structures. According to datapublished by Avast, it took a team of 24 developers 7years to develop RetDec. This data emphasizes thattraditional decompiler development is extremely difficultand time consuming, supporting our claim that the futureof decompilers lies in approaches that can avoid this step.Our technique removes the burden of rule writing fromthe developer, replacing it with an automatic, neuralnetwork based approach that can autonomously extractrelevant patterns from the data.Katz et al. [23] suggested the first technique to useNMT for decompilation. While they set out to solve thesame problem, in practice they provide a solution to adifferent and significantly easier problem - producingsource-level code that is readable, without any guar-antees for equivalence, not semantic or even syntactic.Further, the code they generate is not guaranteed tocompile (and does not in practice). Because their codedoes not compile nor is equivalent, if you apply our eval-uation criteria to their results, their accuracy would beat most 3.8%. Further, beyond the cardinal difference inthe problem itself, they have the following limitations: • They can only operate on code compiled with aspecial version of Clang which they modified fortheir purposes. • All of their benchmarks are compiled without op-timizations. We apply the compiler’s default opti-mizations to all of our benchmarks. • They limit their input to 112 tokens and out-put to 88 tokens. This limits their input to sin-gle statements. We successfully decompiled x86 benchmarks of up to input tokens and out-put tokens. Each of our samples contains severalstatements. • Their methodology is flawed as they do not controlfor overlaps between the training and test sets. Weverify that there is no such overlap in our sets.
Modeling Source Code
Modeling source code usingvarious statistical models has seen a lot of interest forvarious applications.Srinivasan et al. [21] used LSTMs to generate naturallanguage descriptions for C ource code, i.e. translates commits to source code repos-itories to commit messages. The success presented bythese papers highlights that neural networks are usefulfor summarizing code, and supports the use of neuralnetworks for decompilation.Another application of source code modeling is forpredicting names for variable, methods and classes. Ray-chev et al. [33] used conditional random fields (CRFs)to predict variable names in obfuscated JavaScript code.He et al. [17] also used CRFs but for the purpose of pre-dicting debug information in stripped binaries, focusingon names and types of variables. Allamanis et al. [9] usedneural language models to predict variable, method andclass names. Allamanis et al. relied on word embeddingsto determine semantically similar names. We considerthis problem as orthogonal to our own. Given a semanti-cally equivalent source code produced by our decompiler,these techniques could be used to supplement it withvariable names, etc.Chen et al. [15] used neural networks to translate codebetween high-level programming languages. This prob-lem resembles that of decompilation, but is infact simpler.Translating low-level languages to high-level languages,as we do, is more challenging. The similarities betweenhigh-level languages are more prevalent than betweenhigh-level and low-level languages. Furthermore, trans-lating source code to source code directly bypasses manychallenges added by compilation and optimizations.Levy et al. [27] used neural networks to predict align-ment between source code and compiled object code.Their results can be useful in improving our second phase,i.e. filling the template and correcting errors. Specifically,their alignment prediction can be utilized to pinpointlocation in the source code that lead to errors.Katz et al. [24, 25] used statistical language modelsfor modeling of binary code and aspects of programstructure. Based on a combination of static analysis andsimple statistical language models they predict targetsof virtual function calls [24] and inheritance relationsbetween types [25]. Their work further highlights thatthese techniques can deduce high-level information fromlow-level representation in binaries. We address the problem of decompilation — convertinglow-level code to high-level human-readable source code.Decompilation is extremely useful to security researchersas the cost of finding vulnerabilities and understandingmalware drastically drops when source code is available.A major problem of traditional decompilers is thatthey are rule-based. This means that experts are neededfor hand-crafting the rules and patterns used for detect-ing control-flow structures and idioms in low-level code and lift them to source level. As a result decompilerdevelopment is very costly.We presented a new approach to the decompilationproblem. We base our decompiler framework on neuralmachine translation . Given a compiler, our framework au-tomatically learns a decompiler from it. We implementedan instance of our framework for decompiling
LLVM
IRand x86 assembly to C . We evaluated these instances onrandomly generated inputs with a high success rates. References [1] 1987. GCC, the GNU Compiler Collection. https://gcc.gnu.org/ .[2] 1998. The IDA Pro disassembler and debugger. .[3] 2002. The LLVM compiler infrastructure project. http://llvm.org .[4] 2007. clang: a C language family frontend for LLVM. https://clang.llvm.org/ .[5] 2015. TensorFlow. .[6] 2017. DyNMT, a DyNet based nueral machine transaltion. https://github.com/roeeaharoni/dynmt-py .[7] 2017. Retargetable Decompiler. https://retdec.com/ .[8] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat,Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Lev-enberg, Rajat Monga, Sherry Moore, Derek Gordon Murray,Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete War-den, Martin Wicke, Yuan Yu, and Xiaoqiang Zhang. 2016. Ten-sorFlow: A system for large-scale machine learning. (2016).[9] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and CharlesSutton. 2015. Suggesting Accurate Method and Class Names.In
Proceedings of the 2015 10th Joint Meeting on Foundationsof Software Engineering .[10] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and CharlesSutton. 2015. Suggesting Accurate Method and Class Names.In
Proceedings of the 10th Joint Meeting on Foundations ofSoftware Engineering .[11] Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016.A Convolutional Attention Network for Extreme Summariza-tion of Source Code.
CoRR (2016). http://arxiv.org/abs/1602.03001 [12] Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, andYi Wei. 2015. Bimodal Modelling of Source Code and NaturalLanguage. In
Proceedings of the 32Nd International Con-ference on International Conference on Machine Learning -Volume 37 .[13] Matthew Amodio, Swarat Chaudhuri, and Thomas W. Reps.2017. Neural Attribute Machines for Program Generation.
CoRR (2017). http://arxiv.org/abs/1705.09231 [14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.2014. Neural Machine Translation by Jointly Learning toAlign and Translate.
CoRR (2014).[15] Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-treeNeural Networks for Program Translation.
CoRR (2018).[16] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau,and Yoshua Bengio. 2014. On the Properties of NeuralMachine Translation: Encoder-Decoder Approaches.
CoRR (2014).[17] Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev,and Martin Vechev. 2018. Debin: Predicting Debug Informa-tion in Stripped Binaries. In
Proceedings of the ACM SIGSAC onference on Computer and Communications Security .[18] Matthew Henderson, Blaise Thomson, and Steve J. Young.2014. Robust dialog state tracking using delexicalised recur-rent neural networks and unsupervised adaptation. IEEESpoken Language Technology Workshop (2014).[19] Sepp Hochreiter and J ˜Aijrgen Schmidhuber. 1997. Long Short-Term Memory.
Neural Computation (1997).[20] Xing Hu, Yuhan Wei, Ge Li, and Zhi Jin. 2017. CodeSum:Translate Program Language to Natural Language.
CoRR (2017). http://arxiv.org/abs/1708.01837 [21] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and LukeZettlemoyer. 2016. Summarizing Source Code using a NeuralAttention Model. In
Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1:Long Papers) .[22] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Con-tinuous Translation Models. In
Proceedings of the Conferenceon Empirical Methods in Natural Language Processing .[23] D. S. Katz, J. Ruchti, and E. Schulte. 2018. Using recur-rent neural networks for decompilation. In
IEEE 25th In-ternational Conference on Software Analysis, Evolution andReengineering .[24] Omer Katz, Ran El-Yaniv, and Eran Yahav. 2016. EstimatingTypes in Binaries Using Predictive Modeling. In
Proceedingsof the 43rd Annual ACM SIGPLAN-SIGACT Symposium onPrinciples of Programming Languages .[25] Omer Katz, Noam Rinetzky, and Eran Yahav. 2018. Sta-tistical Reconstruction of Class Hierarchies in Binaries. In
Proceedings of the Twenty-Third International Conferenceon Architectural Support for Programming Languages andOperating Systems .[26] Philipp Koehn and Rebecca Knowles. 2017. Six Challengesfor Neural Machine Translation. (2017).[27] Dor Levy and Lior Wolf. 2017. Learning to Align the SourceCode to the Compiled Object Code. In
Proceedings of the34th International Conference on Machine Learning .[28] Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo.2017. A Neural Architecture for Generating Natural LanguageDescriptions from Source Code Changes.
CoRR (2017). http://arxiv.org/abs/1704.04856 [29] Chris J. Maddison and Daniel Tarlow. 2014. StructuredGenerative Models of Natural Source Code.
CoRR (2014). http://arxiv.org/abs/1401.0514 [30] Graham Neubig. 2017. Neural Machine Translation andSequence-to-sequence Models: A Tutorial.
CoRR (2017).[31] Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopoulos, MiguelBallesteros, David Chiang, Daniel Clothiaux, Trevor Cohn,Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette,Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, GauravKumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda,Matthew Richardson, Naomi Saphra, Swabha Swayamdipta,and Pengcheng Yin. 2017. DyNet: The Dynamic Neural Net-work Toolkit. arXiv preprint arXiv:1701.03980 (2017).[32] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015.Predicting Program Properties from ”Big Code”. In
Proceed-ings of the 42nd Annual Symposium on Principles of Pro-gramming Languages (POPL ’15) .[33] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015.Predicting Program Properties from ”Big Code”. In
Proceed-ings of the 42Nd Annual ACM SIGPLAN-SIGACT Sympo-sium on Principles of Programming Languages . [34] Edward J. Schwartz, JongHyup Lee, Maverick Woo, and DavidBrumley. 2013. Native x86 Decompilation Using Semantics-preserving Structural Analysis and Iterative Control-flowStructuring. In
Proceedings of the 22Nd USENIX Confer-ence on Security .[35] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequenceto Sequence Learning with Neural Networks. In
Advances inNeural Information Processing Systems 27: Annual Confer-ence on Neural Information Processing System .[36] K. Yakdan, S. Dechand, E. Gerhards-Padilla, and M. Smith.2016. Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study. In
IEEE Symposium on Security and Privacy (SP) .[37] Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-Padilla, and Matthew Smith. 2015. No More Gotos: Decompi-lation Using Pattern-Independent Control-Flow Structuringand Semantic-Preserving Transformations. In . Extracting Decompilation Rules
Table 4 contains examples of decompilation rules extracted from our decompiler. For brevity, we present mostly relatively simple rules, but longer andmore complicated rules were also found by our decompiler (examples of such rules are found at the bottom of the table, below the separating line). input outputmovl X , eax ; addl N , eax ; movl eax , X ; X = N + X ; movl X , eax ; subl N , eax ; movl eax , X ; X = X − N ; movl X , eax ; imull N , eax , eax ; movl eax , X ; X = X ∗ N ; movl X , ecx ; movl N , eax ; idivl ecx ; movl eax , X ; X = N / X ; movl X , eax ; movl X , ecx ; idivl ecx ; movl eax , X ; X = X / X ; movl X , eax ; sall N , eax ; movl eax , X ; X = X ∗ N ; movl X , ecx ; movl N , eax ; idivl ecx ; movl edx , eax ; movl eax , X ; X = N % X ; movl X , eax ; movl X , ecx ; idivl ecx ; movl edx , eax ; movl eax , X ; X = X % X ; movl X , eax ; leal 1 ( eax ) , edx ; movl edx , X ; movl eax , X ; X = X ++ ; movl X , eax ; leal -1 ( eax ) , edx ; movl edx , X ; movl eax , X ; X = X -- ; movl X , eax ; addl 1 , eax ; movl eax , X ; movl X , eax ; movl eax , X ; X = ++ X ; movl X , eax ; imull N , eax , eax ; addl N , eax ; movl eax , X ; X = N + ( N ∗ X ) ; movl X , eax ; addl N , eax ; sall N , eax ; movl eax , X ; X = ( X + N ) ∗ N ; movl X , eax ; imull N , eax , ecx ; movl N , eax ; idivl ecx ; movl eax , X ; X = N /( X ∗ N ) ; movl X , eax ; cmpl N , eax ; jg .L0 ; movl N , X ; .L0: ; if ( X < ( N + )){ X = N ; } jmp .L1 ; .L0: ; movl N , X ; .L1: ; movl X , eax ; cmpl N , eax ; jg .L0 ; while ( X > N ){ X = N ; } jmp .L1 ; .L0: ; movl N , X ; .L1: ; movl X , eax ; cmpl N , eax ; jne .L0 ; while ( N ! = X ){ X = N ; } movl X , eax ; cmpl N , eax ; jne .L0 ; movl N , X ; movl X , eax ; movl eax , X ; .L0: ; if ( N == X ){ X = N ; X = X ; } movl X , edx ; movl X , eax ; cmpl eax , edx ; jg .L0 ; movl N , X ; jmp .L1 ; .L0: ; movl N , X ; .L1: ; if ( X < = X ){ X = N ; } else { X = N ; } jmp .L1 ; .L0: ; movl X , eax ; addl N , eax ; movl eax , X ; .L1: ; movl X , eax ; cmpl N , eax ; jle .L0 ; while ( X < = N ){ X = N + X ; } jmp .L1 ; .L0 : ; movl X , eax ; addl 1 , eax ; movl eax , X ; movl X , edx ; movl X , eax ; addl edx , ea... while (( X − N ) > ( X % ( X − N ))){ X = ( ++ X ) + X ; ... movl X , eax ; addl 1 , eax ; movl eax , X ; movl X , edx ; movl X , eax ; movl N , ecx ; subl eax , ecx... if ( ++ X == ((( X ∗ ( N − X )) − N ) ∗ ( N − X ))){ X = ... movl X , edx ; movl X , eax ; addl edx , eax ; movl X , ecx ; movl X , edx ; addl edx , ecx ; idivl ecx ; ... X = X ∗ (( X + X ) % ( X + X )) ; X = ( X + X )/(( N − ... movl N , X ; movl X , eax ; movl eax , X ; movl X , eax ; movl X , edx ; addl N , edx ; subl edx , ea... X = N ; X = X ; if (( N + ( X − ( X + N ))) < = X ){ X ... jmp .L1 ; .L0 : ; movl X , ebx ; movl N , eax ; idivl ebx ; movl eax , X ; .L1 : ; movl X , edx ; movl X ... while (( X ∗ X ) > = ( N % ( X + N ))){ X = N / X ; } ; X ... Table 4.
Decompilation rules extracted from