In Nomine Function: Naming Functions in Stripped Binaries with Neural Networks
Fiorella Artuso, Giuseppe Antonio Di Luna, Luca Massarelli, Leonardo Querzoni
IIn Nomine Function: Naming Functions in Stripped Binaries with NeuralNetworks
Fiorella Artuso , Giuseppe Di Luna , Luca Massarelli , and Leonardo Querzoni CINI, Italy. [email protected] Sapienza, University of Rome, Italy. { diluna,massarelli,querzoni } @diag.uniroma1.it Abstract.
In this paper we investigate the problem of automatically naming pieces of assembly code. Where bynaming we mean assigning to an assembly function a string of words that would likely be assigned by a humanreverse engineer. We formally and precisely define the framework in which our investigation takes place. That iswe define the problem, we provide reasonable justifications for the choices that we made for the design of trainingand the tests. We performed an analysis on a large real-world corpora constituted by nearly 9 millions of functionstaken from more than 22k softwares. In such framework we test baselines coming from the field of Natural LanguageProcessing (e.g., Seq2Seq networks and Transformer). Interestingly, our evaluation shows promising results beatingthe state-of-the-art and reaching good performance. We investigate the applicability of tine-tuning (i.e., taking amodel already trained on a large generic corpora and retraining it for a specific task). Such technique is popularand well-known in the NLP field. Our results confirm that fine-tuning is effective even when neural networks areapplied to binaries. We show that a model, pre-trained on the aforementioned corpora, when fine-tuned has higherperformances on specific domains (such as predicting names in system utilites, malware, etc).
Keywords:
Reverse engineering · Function naming · Binary analysis · Dataset.
Last few years have witnessed the growth of a trend consisting in the application of machine learning (ML) andnatural language processing (NLP) techniques to the code analysis field, as illustrated in [14]. In fact, the vast andincreasing number of high quality software available through open source repositories such as GitHub, has given thechance to leverage large amount of source code as a ground truth for building statistical models of code. The designchoice of using NLP to build such models is motivated by the naturalness hypothesis which underlines the similaritiesbetween programming languages and human languages. According to this hypothesis, software is a form of humancommunication with similar statistical properties to natural language and these properties can be exploited to buildbetter software engineering tools [14]. The practice of applying ML and NLP techniques to code turned out to be veryhelpful and effective in many tasks such as predicting program bugs [7], predicting identifier names [19], translatingcode between programming languages [6], etc. Thus, the success given by the application of NLP techniques to sourcecode has led to investigate the possible use of such techniques also in the context of binary code analysis; learningfunction signatures [23], identifying similar functions [10,16,22], recovering the compiler that generated a given binary[15], just to cite a few.Following this research line, in this paper we investigate the feasibility of using similar techniques to predictthe name of functions in stripped binary programs. The latter are binary executable files that only contain low-levelinformation such as instructions, registers and memory addresses but no debug symbols since they are not directlynecessary for program execution. Debug symbols are generated by compiler programs on the basis of the sourcecode and typically include information about functions and variables, such as name, location, type and size which arehelpful for debugging and security analysis of a binary. Being non essential for the software execution, symbols areoften removed from a program after compilation, increasing the complexity of reverse engineer the software.Reconstructing symbols in a binary program can be a very useful feature for all those field where reverse engineer-ing code plays a crucial role, e.g. malware analysis. Usually, after having disassembled a malware the reverse engineerstarts analyzing the set of assembly instructions of the program looking for specific functions (e.g. encryption or net-work) that might reveal the malicious nature of the software sample under investigation. This task could be daunting, a r X i v : . [ c s . L G ] F e b specially when the binary code is stripped and original function names are not present. In this case, it could be veryhelpful having explanatory names for such functions, as they would save a lot of effort to the reverser whose workcould be guided and supported by clear hints about the content of each function.Recently, a few works proposed solutions to this problem. For example, [11] and [8] have shown promising resultsin predicting functions names. However, the problem at hand is far from being solved as existing solutions work onlyunder strong assumptions, like the presence of symbolic calls to dynamically linked libraries in each functions [8], ora closed set of possible assignable names to predict from [11]. Furthermore, existing solutions have only be evaluatedon small datasets that contain binary programs with little variance (less than 1k softwares), and this datasets have notbeen made publicly available, hindering the possibility to compare existing solutions.Starting from these issues in this paper we propose the following contributions: – Dataset. We created a new dataset, namely UbuntuDataset, composed by a large number of real world binaries,spannig several heterogeneous application fields: networking, databases, video-games, system utilities, c libraries,etc. Our dataset contains . millions functions from 22k distinct softwares. The dataset is made available onlineto the community for testing and evaluation. – Competitive DNN solutions. Considering the function naming problem, we train and test two Deep Neural Net-work architectures Seq2Seq and Transformer. We achieve an f1-score of . for Seq2Seq and of . forTransformer. We compare our architectures with previous solutions on a subset of our test set, showing an im-provement with respect to the state-of-the-art. – Fine-tuning. We show that a model trained on our UbuntuDataset has a drastic performance improvement whenfine-tuned for a specific domain. Fine-tuning [9] is a popular technique in NLP, in which models trained on hugegeneral purpose datasets are then re-trained on a smaller dataset specialised for a certain sub-task. In this paperwe show that fine-tuning works also when applied on models for binary code: we found a marked improvementof our model when fine-tuned to name applications of a specific domain (such as system utilities). In particular,we show that a model fine-tuned on busybox has a relative improvement of 22% when tested on coreutils withrespect to a pre-trained model with no fine-tuning. We also show that fine-tuning helps the network in having betterperformances when naming functions across different optimisations. That is, we fine-tune our network on certainset of packages compiled with optimisation flag O and then we show that this leads to a two-fold improvementwhen naming the same packages compiled with O , O , and O . We also compare all our fine-tuned models tomodels that are trained from scratch only on the specific fine-tuning dataset. As expected, these latter models havelower performances. This shows that fine-tuning is able to exploit and refine the knowledge learnt on large generaldatasets of binaries, but the specific training sets used for this purpose do not contain enough information to learnfrom scratch the correct relationships name-assembly code. – Test on Malware. We test our solution on real world linux malware in a quantitative and qualitative way. Also inthis case we test performances with and without fine-tuning, showing a 5-fold performance improvement in theformer case.After this introduction, Section 2 discusses the state of the art, Section 3 introduces the function naming problem,Section 4 describes the UbuntuDataset used for the evaluation, Section 5 details the proposed solutions. Section 6reports the experimental results on goodware, and Section 7 reports our tests on malware. Finally, Section 8 concludesthe paper.
There are several works on predicting names for variables, functions and objects using statistical learning.
A body of works have explored the possibility of predicting variable, function and object names within code expressedin high-level programming languages (such as Java, Javascript, C and similars), the majority of which leverage deeplearning techniques. In [2] a word2vec-like approach is used to learn a probability distribution on class and method2ames. The model can then be used to suggest probable names for classes, methods and variables in Java sourcecode. In [3] convolutional neural networks are used to create extremely summarised descriptions of Java functions thatresemble function names. Code2seq [4] proposes an encoder-decoder strategy on the Abstract Syntax Tree (AST) offunctions to obtain explanative sentences, so called “code captions”, of C
Predicting debugging symbols, including function names, through ML is a rather new field of research with fewcontributions. The most notable work in this area is DEBIN, proposed by He et al. [11]. It uses conditional randomfields to predict debug symbols. similarly to [20] it also predicts function names, variable names and types. Differentlyfrom our work, DEBIN is only able to assign to functions names from a predetermined closed set, i.e. it cannotgeneralise to new names. The dataset used in [11] to evaluate DEBIN is composed by 9k executables from 830 linuxpackages.DEBIN’s limitation is surpassed by a recent pre-print, NERO [8], which models the problem of predicting functionnames as a Neural Machine Translation (NMT) task where each function is represented by “call sites sequences”:each call site is an encoding of a function call containing the name of the called function and information on theparameters. Unfortunately, it is not clear how NERO will cope with functions that do not use calls, since it seemsthat no feature is extracted in this case. Additionally, by taking into consideration only calls to external library, theirmethod is likely to miss the difference functions that rely on the same set of api calls but are constituted by markedlypatterns of instructions (e.g., encryption functions and sorting functions implemented from scratch). Their source codeand dataset have not been released yet and their evaluation is done on a rather small set of functions (less than 15kfunctions).Finally, DIRE [13] proposes a probabilistic technique for variables name recovery that uses both lexical and struc-tural information recovered by using a decompiler. DIRE uses the decompilers internal AST representation, whichencodes additional structural information, to train a neural network based on an encoder-decoder structure, where theencoder is composed by a bidirectional-LSTM and a graph encoder and the decoder is a standard LSTM. Their entiredataset contains k functions, with a test set of k functions. We remark that recovering variables name is adifferent task than ours, and that function names are used in DIRE to help in predicting the name of variables. Given a fragment of binary code b representing a functional unit of code (for simplicity, and without loss of generality,we assume it corresponds to a function) in a compiled software, we want to output a string s . Such string s has torepresent a “ meaningful ” name for function b , that is a name that captures its semantic and its role inside the software.The above problem is extremely challenging, and, due to its nature, cannot be defined more precisely withoutincurring in complex reasoning about what is the “semantic” of code. Fortunately, statistical learning methods areespecially suitable for problems where the definition itself is fuzzy. As customary in statistical approaches we will tryto learn a probability distribution that assigns to each function b the most probable output string s using a large datasetof assembly functions with semantically expressive names.Our investigation of the function naming problem is based on a set of simplifying assumptions that made thetraining and testing phase tractable. In the following we precisely state all our assumptions, providing the reasoningbehind our design, and we describe the main challenges of the problem. The pre-print arXiv version:https://arxiv.org/abs/1902.09122v2 mentions a public available dataset. Although, no link to thedataset is present. .2 Training Statistical methods, especially the ones based on neural networks, are effective when trained on a suitable dataset ofrelatively large size (millions of functions). In a dataset that counts millions of functions it would be unreasonable tomanually annotate each one with a representative name. Therefore, in this paper we assume the following:
Assumption 1 (Sensible programmer) A programmer, when writing code in an high-level language, assigns names tofunctions that represent their semantics and their role inside the software.
The above assumption does not always hold true. However, we find reasonable to assume that it holds most ofthe time, especially, in large projects developed by professional or skilled programmers and where common namingconventions are often used and enforced. This assumption allows us to create a dataset by disassembling the binarycode from open source projects, without the need to perform manual annotation as function names are available in thesource code.
There is an unavoidable ambiguity in the output of any method that names something: in general several differentmeaningful descriptive names can be associated to a given piece of code. As an example, a function implementingquick sort on an array could be named “Quick-sort”, “quick-array-sort”, “sort”, etc. This creates a problem in the wayour solution is tested: in order to evaluate its real accuracy we should consider all possible meaningful names, but thisis, again, unfeasible.Therefore, as common in the NLP literature [5,18], we will say that a prediction is correct if it is the same namepresent in the dataset. In our case we will measure the performance using the classical metrics of precision, recall andF1. The drawback of this evaluation methodology is that other predictions will be deemed as wrong even if they aremeaningful.
Unfortunately, we found that functions names in our dataset are noisy, as example many functions of the OpenGLlibrary contain the bigram “gl”. Such pattern is recurrent in many libraries and softwares; this is due to the fact thatdevelopers use words and acronyms that have the meaning of identifying the software itself, but that do not add muchto the semantic of the function. In order to clean our dataset we designed a filtering process (described in a detailedway in Section 4). This filtering process has the purpose of associating each original name to a reduced name over arestricted vocabulary of words (whose size was around 1k words in our experiments). We found that such restrictionpreserves the semantic of the majority of names in our dataset and it solves the problem described above.
In this section we describe our dataset, and the steps used for its construction. The dataset can be found at the followingURL redacted . UbuntuDataset was built by downloading all available amd64 packages from the Ubuntu 19.04 apt repositories .We collected a total of 22040 packages; for each package, when available, we downloaded the corresponding debugsymbols and extracted all the executables files. At the end of this process we got 87853 distinct ELF files that wedisassembled with IDA Pro. We then filtered out duplicated functions and we discarded functions for which we The URL for downloading the dataset will be uploaded shortly. Namely, main , restricted , universe and multiverse . Two functions are duplicate if they contain the same list of instructions after the substitution of constants and memory addresseswith a dummy value.
Table 1.
Comparison of datasets used by function naming papers. (**)
The paper does not report the total number of functions, butit reports an average of 79 functions per binary file and a total number of 9000 binaries. (*) : Nero pre-print [8] mentions a publicdataset but no link to it is provided.
Paper
UbuntuDataset (Our) 22040
Ubuntu 19.04 Repositories yesDebin [11] 830 711k Ext. (**)
Ubuntu NoNero [8] Unknown 15k Linux Generic No (*)4.2 Function Representation
In UbuntuDataset each function is represented by its linear list of instructions. The relative order in the list is givenby the address of each instruction inside the program. The average number of instructions per function is . ; wetruncated all functions with more than instructions and we removed all functions with less than instructions. We apply a normalization process to function names whose final goal is to represent each of them as a list of tokens ina reference vocabulary. The process of normalisation is based on six steps:1.
Demangling Splitting Stemming Vocabulary construction Final conversionDemangling.
The names contained in the debug symbol table have been mangled by the compiler for various reasons(e.g, implementing methods overriding). The aim of this first step is to recover the original function names. Sincecompilers perform mangling in a standard way, it is possible to perform name demangling using standard libraries . During this step we filtered all the binary functions deriving from source code written in languages different fromC/C++ (e.g. GO, Haskell, etc.). Splitting.
This step consists in splitting function names into tokens. This is achieved by using the natural partitionprovided by camelcase and snake notations, which are generally adopted for function names.
Stemming.
This is a technique used in information retrieval to reduce inflected (or sometimes derived) words to theirbase form. Nevertheless, such technique turns out to be very useful in this context, since it has the effect of reducingthe vocabulary by mapping different forms of the same token into a unique one (for example the tokens “shared” and“sharing” are both mapped to their base form “share”). Stemming is important since it is not necessary for the networkto learn the syntactically correct token to associate to each function, but rather the semantically correct one. In particular, we used cxxfilt:https://github.com/afq984/python-cxxfilt e t s e t i n i tt o a ddd a t a r e a d f il e t y p e c r e a t e li s t w r i t e i ss t r i n g n a m e v a l u e u pd a t e c h e c k i n f o n e w i n t o n p a r s ee v e n t f r o m (a) Frequency of the 25 most frequent tokens. (b) Log-log scale plot of the frequency of all tokens in our vocab-ulary. Fig. 1.
Statistical analysis of the vocabulary.
Vocabulary construction.
This step consists in creating a meaningful vocabulary of tokens. This operation has the goalof removing useless and meaningless tokens from function names (e.g., many functions in the gnu scientific librarystarts with the token “gsl”). The token selection process consists first in assigning a score to each token and thenretaining only tokens whose score is above a certain threshold τ . The score for each token t is the project frequency,that is the number of different packages in which a token appears. This choice permit us to exclude tokens that appearonly in few packages independently from the frequency of the token inside the package. In this way we avoid to assigna large score to tokens that are not semantically relevant (see the aforementioned example of the “gsl” token). We set τ to 500 obtaining 1080 tokens. Finally, we exclude from the vocabulary all token that have no meaning ending upwith a vocabulary of 1064 tokens. Final conversion.
This final step consists in removing from function names all those tokens not contained in thevocabulary built in the previous step. Moreover, this step is also designed to split function names that are not usingthe camelcase or snake notation. The idea is to split tokens in case they match a word in our vocabulary, as examplethe token “numpy” matches the vocabulary word “num”, therefore we transform it in the token “num”; the token“numvertex” is transformed in “num” and “vertex”. There are few cases in which function names only consist of out-of-vocabulary tokens resulting in an empty name; in such cases functions are removed from the dataset. At the endof the entire normalisation process we filtered out 11% of the initial functions and we obtained nearly 9 millions offunctions.
Statistical analysis of function names.
We performed some basic analysis on the names contained in UbuntuDataset.The average numbers of tokens in a name is . . The of all names is composed by tokens or less. The frequencydistribution of the 25 most frequent token is in Figure 1a, while in Figure 1b we report the log-log plot of the frequencyof all tokens in our vocabulary. It is not surprising that get and set are predominant tokens, this is due to their frequentuse in objected oriented programming paradigm. Furthermore, most of the words are likely distributed according to apower law distribution, this is consistent with the distribution of words in human languages (as example the englishwords seems to follow a Zipf-law [17]). However, such fit is not perfect: in the log-log plot the tails of the distribution(the highest and lowest frequency words) do not fit a linear interpolation. In order to train and evaluate the models we split the dataset in the canonical Train, Validation and Test sets. Weput of functions in the Training set, in the Validation set and in the Test set (ie. Train set size: ,Validation set size: , Test set size: ). We avoid information leakage by splitting functionsby packages: all functions from a given package belong to only one of the three sets.6
Solution Overview
In this section we describe the solutions we tested. We considered two different Deep Neural Network models: Seq2Seqand Transformer. Both architectures take as input the set of normalised instructions that constitute a functions andoutput a prediction of its name token by token.
All our architectures take as input the sequence of normalised assembly instructions. Instructions are normalised withthe purpose of reducing their total number by removing mostly unnecessary information. We follow a normalisationprocess similar to the one proposed in [16]: we replace all base memory addresses with the special symbol
MEM and all immediates whose absolute value is above some threshold (we use in our experiments) with the specialsymbol
IMM . We do so because raw operands are of small benefit; for instance, the displacement given by a jumpis useless (e.g., instructions do not carry with them their memory address), and, it may worsen performances byartificially inflating the number of different instructions. In our normalisation the instruction mov EBX, becomes mov EBX,IMM , mov EBX, [0 x becomes mov EBX,MEM , while the instruction mov EAX, [ EBP − is notmodified. Sequence Transduction Models are usually used to solve NLP problems such as Neural Machine Translation (NMT).These models take as input a sequence terms and output a transducted sequence. These architectures are naturalcandidates for our problem due to its similarity with the translation task; in our task we are essentially translating fromassembly code to small sets of tokens in human language. Generally, NMT models are composed by an encoder thattakes the input sequences and return a set of statuses c and a decoder that takes c and output the probability of anoutput sequence Y : p ( Y ) = T (cid:89) t =1 p ( y ( t ) | y i , ..., y t − , c )) (1)In NMT-like tasks, to generate the output sequence, at each time step t the model outputs a probability distributionover the output vocabulary. Seq2Seq Model.
We use the Seq2Seq architecture proposed in [5]: the encoder consists of a bidirectional RNN withLong-Short Term Memory (LSTM) cells. The use of a bidirectional encoder is important since it allows to computefor each instruction an hidden state vector that takes into account the instruction itself and its previous and followingcontext. The decoder is a forward RNN connected to the encoder by an attention mechanism that allows to bettermodel long distance dependencies which represent a critical aspect in assembly code.
Parameters for Seq2Seq model.
We used an embedding size of 256 for the input and output tokens. For the encoder weused a two layers bidirectional RNN with hidden state size of 256. Equally for the decoder we used a unidirectionalRNN with hidden state size of 256. Encoder and Decoder were connected using Bahdanau attention [5]. The totalnumber of parameters for the Seq2Seq model is 52618793. The inference step uses a beam-search strategy with beamsize 1. We used the implementation provided by OpenNMT-py . Transformer.
The Transformer[1] is an encoder-decoder architecture entirely based on attention mechanism. Thenetwork consists of a set of N stacked encoders and a set of N stacked decoders. The encoder is composed of a stackof N identical layers. The bottom-most layer is fed with the embedding vectors of the input sequence, whereas allthe other layers are fed with the output of the previous encoder. Each layer consists of two sub-layers: a multi-headself-attention mechanism used to understand which are the relevant tokens in the input sequence and a fully connectedfeed-forward network independently applied to each position. In the same way, the decoder is composed of a stack of N identical layers. Each layer consists of three sub-layers: a masked multi-headed mechanism over the decoder input,another multi head attention over the encoder stack output and a final feed forward layer. https://github.com/OpenNMT/OpenNMT-py arameters for Transformer model. We used an embedding size of 256 for the input and output tokens. For the encoderwe used 6 encoding layers with 8 heads of attention, hidden state size of 256 and hidden feed forward size of 2048.We used the same values for the decoder. The total number of parameters for this model is 67482921. The inferencestep uses a beam-search strategy with beam size 1. We used the implementation provided by OpenNMT-py.
In this section we describe our tests on non-malware binaries. Our evaluation is divided in three main tests. The firstexperiment is used to train our architectures and select the best performing one. In the other two tests we investigate theperformance of our best performing architecture on specific applications, and we test the benefit of using fine-tuningto increase its performance.
Fine-tuning.
This is a really popular technique in NLP [12], that proved extremely effective when applied on modelsthat are derived from the Transformer and trained on extremely large general purpose datasets [9]. The main idea isthat a model trained on a large general purpose dataset will learn general related patterns with a given distribution, inour case the ones between assembly code and natural language words. However, specific tasks may be characterizedby small datasets with differet distributions. Fine-tuning the pre-trained model on such smaller dataset often leads tohigh performance tailored models for the specific task at hand.
Tests roadmap.
In details the test are: – UbuntuDataset Training and Test: we train and test the Seq2Seq and Transformer architectures using the Ubun-tuDataset. Moreover, on a subset of UbuntuDataset we compare the Transformer with DEBIN. The best perform-ing model of this test is the Transformer. In the other tests we will investigate how the Transformer trained onUbuntuDataset, namely TransformerPT, performs on specific tasks, with and without fine-tuning. – MultiOpt Test: we evaluate the performance of TransformerPT on a dataset composed by several applicationscompiled with 4 different optimisation levels -O( − ) and several compilers. Specifically, we test if the fine-tuning on a certain optimisation level increases the performance on the other levels. – SameDomain Test: we test if the fine-tuning performed on a specific application, namely busybox , increases theperformance on a different but similar application, namely the coreutils binaries. The idea is to see if re-training theTransformerPT on an application for a certain subdomain (in this case system utilities) increases its performanceon an application of the same domain. Evaluation Metrics.
Following [8], we use as evaluation metrics the classical precision ( P ), recall ( R ) and f1-score( F ). More precisely, given the set of vocabulary words in the actual name x : { x , x , ..., x n } , and the set of tokensin the prediction ˆ x : { ˆ x , ˆ x , ..., ˆ x n } , we define a membership function as: α (ˆ x i , x ) = (cid:26) if ˆ x i ∈ x else (2)and we compute precision, recall and f1-score as: P = (cid:80) i α (ˆ x i , x ) | ˆ x | (3) R = (cid:80) i α (ˆ x i , x ) | x | (4) F P + RP ∗ R (5) We trained each model on the Train set from UbuntuDataset for a maximum of 30 epochs. We used a batchsize of 512, and
Adam optimizer with decaying learning rate. After each training epoch we evaluate the performanceof the model on the Validation set. We used an early stopping mechanism, stopping the training when the f1-scoreon the Validation set does not decrease for more than 2 epochs. We took the model with the highest f1-score on theValidation set and tested it on the Test set. 8 esults.
The results are in Table 2; the Transformer outperforms Seq2Seq reaching an f1-score of . against . from Seq2Seq. As observed also in [11], it is likely that such measures are underestimating the real performance ofour system, the reasons have been described in Section 3.3. In our qualitative evaluation on malware (Section 7.2) wehave findings that confirm this hypothesis. Table 2.
Results on UbuntuDataset Test set prec. rec. f1Seq2Seq 0.174 0.095 0.122Transformer
We compared our solution with DEBIN and RANDOM. We did not compare our solutionwith NERO since their solution has not been released. RANDOM is a basic prediction strategy in which tokens arerandomly associated to functions with a probability that respects their frequency in the training set (as an example,“get” will be sampled more frequently than “socket”). This kind of strategy has a slight edge on the pure random one.Note that we used DEBIN as released, that is we do not trained their model from scratch on the UbuntuDataset.The comparison has been performed on a subset of our Test set. We took all the binaries contained in our Testset, for each binary we used DEBIN to get a table of the missing debug symbols. We discarded all binaries for whichDEBIN was crashing or taking more than 30 minutes to analyse. We then took all predictions for functions that werenot in the symbol table of the original files, this is because DEBIN does not predict symbols that are already present inthe symbol table of the binary (e.g., exported functions). At the end of this process we obtained functions. Foreach of these functions we took the name predicted by DEBIN and we normalised it using the vocabulary describedin Section 4.3, obtaining for each name a set of representative tokens.
Results.
Results from this test are in Table 3. We can see that all solutions are markedly better than the RANDOMstrategy. On this dataset the Transformer outperforms the other solutions reaching an f1-score of . , beating DE-BIN . and Seq2Seq . . We note that the relatively low value of DEBIN is not necessarily in contrast withwhat reported in [11]; their tests are done for the entire naming category, that includes not only the task of namingfunctions but also the one of naming variables. Variable names reuse common patterns (as an example f d, f ile, f arestandard variable names for files), and this could justify an higher rating than the one obtained by looking only at theirperformances on function names. Table 3.
Results and comparison with DEBIN on DEBIN-Test subset containing k functions. Prec. Rec. f1RANDOM 0.008 0.008 0.008DEBIN 0.047 0.046 0.046Seq2Seq 0.144 0.086 0.108Transformer
Table 4.
Results on MultipleCompilers Dataset for MultiOpt test. O - f1 O - f1 O - f1TransformerPT (w/o Fine-tuning) 0.134 0.145 0.142Transformer Fine-tuned Transformer Scratch 0.104 0.094 0.0939 ultipleCompilers Dataset.
We used the dataset of [15] that contains different packages compiled with 9 differentcompilers and optimization levels from O to O . On this dataset we selected three packages: binutils-2.30 , coreutils-2.29 , curl-7.61.0 . We split functions in four folds, one for each optimization level. We used the fold with O for fine-tuning the pre-trained model. Note that we filtered out duplicated functions: fold O does not include the functionscontained in the other three folds. After the filtering, fold O contains 76727 functions, fold O contains 61651functions, fold O contains 46842 functions, and fold O contains 43930 functions. The results for this dataset arereported in Table 4. Results without fine-tuning.
We took the TransformerPT pre-trained on UbuntuDataset and we used it to predict thenames of functions in folds O ,O ,O . Interestingly, the model shows the best performances on higher optimizations.We argue that this is probably due to the fact that the majority of packages in ubuntu repositories are compiled withhigher optimization. Therefore, the pre-trained model performs better on these kind of functions since they containpatterns observed in the training phase. Results with fine-tuning.
We fine-tune the pre-trained TransformerPT for 5 epochs using the functions compiled withoptimization O . After the fine-tuning we predict the name for the functions with other optimization levels. The resultsclearly show the benefits of fine-tuning: the f1-score of the fine-tuned Transformer on all the optimization levels isclose to 2x the one from the non-fine-tuned model. Results when training from scratch.
Using the O fold we trained a transformer model from scratch. We usefunctions with optimization O as Validation set. As for our pre-trained model we stopped the training after 24 epochswhen the f1-score on the Validation set did not grow for more than 2 epochs. The maximum f1-score is . , that isworse than the minimum achieved by the pre-trained and the fine-tuned model. This was largely expected, as a modeltrained from scratch on a smaller dataset only learns a limited number of patterns. Table 5.
Results on SameDomain Datset for SameDomain Test.
Prec. Rec. f1TransformerPT (w/o Fine-tuning) 0.200 0.166 0.181Transformer Fine-tuned
Transformer Scratch 0 0 0
SameDomain Dataset.
We compiled busybox-1.31.1 with gcc 7.4.0 and all four optimisation levels. From these bina-ries we obtained 11897 functions that will constitute our fine-tuning Train set. The Test set is composed by coreutils-2.29 compiled with 9 different compilers and optimization levels from O to O , after processing the functions weobtained 60770 samples. Results without fine-tuning.
Without fine-tuning TransformerPT has an f1-score of . . Such result is lower thanthe one obtained on the Test set of UbuntuDataset. We believe that is due to the fact that this dataset contains multipledifferent compiler optimizations, including the O that is not frequently used for pre-compiled ubuntu packages (seealso similar explanation for tests in previous section). Results with fine-tuning.
We fine-tuned the model re-training it for epochs. With fine-tuning we got an overallincrement of points, the f1-score reaches the value of . . This corresponds to a relative improvement of perfor-mances of around with respect to the pre-trained model. This confirms that fine-tuning is a good strategy whenthe naming has to be performed on applications of a certain specific domain. Results when training from scratch.
When the model is trained from scratch we reach a f1-score of . This isbecause the model incorrectly learns to associate any function to an empty name. This is probably due to the fact thatthe training set is too small to learn any meaningful relationships. clang-3.8, clang-3.9, clang-4.0, clang-5.0, gcc-3.4, gcc-4.7, gcc-4.8, gcc-4.9, gcc-5 Test on Malware
We tested the Transformer architecture on real world malware. We performed a quantitative evaluation on linux mal-ware obtained from Virus-share (Section 7.1), and a qualitative analysis on two malware for which the source code isavailable (Section 7.2).
In this section we describe our quantitative evaluation on malware.
Table 6.
Results on MalwareDataset. singleton represents all samples that where not labelled by
AVCLASS . others groups thefamilies containing only one sample. chinaz dnsamp drtycow gafgyt intfour ladvix mirai snessik sotdas yangji znaich singleton othersNumber of samples 4 12 5 217 3 2 17 3 4 3 7 57 11TransformerPT (w/o Fine Tuning) 0.119 0.122 0.282 0.156 0.404 0.356 0.124 0.302 0.105 0.141 0.119 0.221 0.103Transformer Fine Tuned Transformer Scratch 0.401 0.331 0.560 0.870 0.772 0.571 0.455 0.857 0.346 0.910 0.440 0.194 0.280
MalwareDataset.
We built a dataset of malware downloading an ELF collection from Virus-share . To have areliable ground-truth we only considered malware that are not stripped. Moreover, we selected only the malware forAMD64. We used VirusTotal to obtain the antivirus labels for each selected sample. We used AVCLASS [21] toassign a family to each sample. At the end of this process we obtained 406 malware belonging to 23 families. Thefamilies are very unbalanced, the majorities of samples ( ) belongs to the class gafgyt , 11 families contains only onesample, and 57 samples were classified from
AVCLASS as SINGLETON. We disassembled all samples with IDA Proand we filtered function names using the procedure and the vocabulary described in Section 4. In total, we gathered156316 functions.
Test Description and Dataset Split.
We will perform three kind of tests: one taking the Transformer architecturetrained on the UbuntuDataset (namely TransformerPT) as it is; another in which we fine-tune the trained architectureby performing a small re-train on a Training set that we will describe later; and, a final one where we used the sameaforementioned Training set to train a Transformer model from scratch. The Training set contains the 61 samples of tsunami family, the remaining samples are used for Test set. We decided to use a rather limited Training set to modela worst case scenario where there are only few malware that can used to build a labeled dataset.
Results.
The results of our tests are in Table 6.
Tests without fine-tuning.
We tested TransformerPT on the MalwareDataset. The average f1-score for all the classesis 0.196 and it is lower than the one on the UbuntuDataset (f1-score of 0.230). However, as we already mentionedthis f1-score is possibly underestimating the real performance by predicting names that are meaningful but not exactlyequal to the ones in the dataset. This hypothesis is confirmed by our qualitative analysis in Section 7.2.
Tests with fine-tuning.
We fine-tuned the TransformerPT model already trained on UbuntuDataset using Malware-Dataset Training set. We stopped the retrain after 43 epochs when the performance on the Training set where notimproving any more. The fine tuned model clearly shows the benefits of a domain specific fine-tune. We reachedan f1-score of 0.640. We think that during the fine-tuning the network learns the specific domain, that is constitutedmainly by encryption functions, network functions, IO operations, functions that gather informations from the OS andstatically linked libraries. https://tracker.virusshare.com:7000/torrents/VirusShare ELF 20190212.zip.REDACTED raining the model from scratch. We also used the same Training set to train a Transformer from scratch. Interestingly,this model reaches an high f1-score. We argue that this is due to the code reuse between different families and staticlinking. As a matter of fact while the Transfromer trained from scratch achieve good performance on certain families( gafgyt intfour , snessik ), it performs markedly worse, with respect to the fine-tuned model, on others. As an exampleon the singleton group the from scratch model reaches an f1-score of 0.194 (vs. pre-trained transformer 0.221), whilethe fine tuned model reaches an f1-score of 0.401. We performed a qualitative evaluation on the infamous botnet MIRAI, for which source code has been leaked , andon the educational ransomware gonnacry . We decided to use malware for which the source code was available inorder to manually asses each prediction by looking also at the original source code. Our hypothesis is that the f1-scoreis underestimating the performance of the pre-trained TransformerPT. We will show that the prediction of our modelcan provide to a reverse engineer some useful insights on a stripped binary. Analysis of Predictions on the Mirai Botnet.
We compiled the botnet from the source code with optmization O and gcc-7.4. We disassembled it using IDA Pro obtaining 74 functions. We report in Table 7 the comparison ofreference, prediction of TransformerPT and DEBIN. The value of the f1-score computed on such predictions is 0.157forTransformerPT and 0.013 for Debin. We identified the following interesting aspects: – Networking and Checksum functions : all networking functions are associated with a set of tokens relatedto network functionalities, that is usually composed by send , send to or send request , we highlight tear-down connection (row 35) that is correctly named as closed . The function checksum generic (row 23) iscorrectly named as crc . – String and memory related : The function mem exists (row 8) looks for a string in a buffer, and it is named str cmp , another example is util memseach (row 60) that is named as find and it also finds a string in memory.Functions that check if a string is in a particular format or if a character is in a particular format are almost correctlypredicted (e.g. util isupper (row 64) is named in is low ). – File Operations : The function killer kill by port (row 27) finds the process listening on a certain port by read-ing from the proc filesystem. In this case, our model predicts process read dir that correctly represent part itsbehaviour. The function has exe access (row 28) opens the file /proc/[pid]/exe , in our model the functions isnamed as read pid file . The function memory scan match (row 29) search for a certain string in a file, our modelpredicts get from file . – Other functions : The function attack kill all (row 2) is translated in kill all , by looking at the code the function isiterating on an array of PIDs and killing each process in the array. The function attack start (row 4) is starting aset of child processes, and it is correctly identified as run child . In attack get opt str (row 5) a value is searchedin an array, and this is correctly named as find . attack get opt ip (row 7) returns an ip address and this is namedas addr by our network. The function add attack (row 8) performs a reallocation of an array incrementing itssize and it adds an element, the network predicted name add fits this behaviour. Analysis of Predictions of Gonnacry.
We report in Table 8 the comparison of reference, prediction of TransformerPTand DEBIN. The value of f1-score for DEBIN predictions is . while the transformer reaches an f1-score of . .On this malware the predictions are rather good and even when they differ from the ground truth, they still express sub-behaviours implemented in the function. get desktop environment (row 12) is predicted as get path , interestingly,such function returns the desktop path. Other interesting examples are the functions decrypt / encrypt (rows 23, 24),what they do is to open a file copying its encrypted/decrypted contend in another file, the Transformer names suchfunctions as copy file . The function generate key (row 15) creates a random string and it is named as random . https://github.com/jgamblin/Mirai-Source-Code https://github.com/tarcisio-marinho/GonnaCry able 7. Prediction of the Transformer and DEBIN on mirai
Row able 8. Prediction of the Transformer and DEBIN on gonnacry
Row
This paper proposes a study on the problem of predicting names of functions in stripped binary code. We tested state-of-the-art solutions in machine translation finding a rather good carryover on our problem. Moreover, we created alarge public dataset of functions that can be used to further the research on the topic. The results that we found areencouraging, and pave the way for further studies. We believe that many improvements are possible, and that one ofthe main challenges is to find faithful metrics that would capture the performances perceived by a human. To this endit would be beneficial to investigate the following directions: – Multi-references dataset : an idea is to create a dataset where each function is associated with multiple referencenames. This would ameliorate the evaluation problem since it would take into account the possibility of deviatingfrom a single canonical name. – Extensive human evaluation : another line is to perform an extensive investigation using humans. It would bebeneficial to evaluate this kind of solutions using experienced programmers that look at the predicted name and atthe source code of the function. – Metrics based on NLP : finally, it would be interesting to use metrics based on NLP. Recently, BERTScore [24]proposes new metrics based on contextual word embedding computed with BERT [9]. The intuition is that thisprocess takes into account the semantic of the words in the predicted and reference sentences.
References
1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention Is All You Need.In
NIPS , 5998–6008, 2017.2. Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Suggesting accurate method and class names. In
FSE ,3849, 2015.3. Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. A convolutional attention network for extreme summarization of sourcecode. In
ICML , 2091–2100, 2016.4. Uri Alon, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured representations of code. In
ICLR ,20195. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.In
ICLR , 2015.6. S. Chakraborty, M. Allamanis, and B. Ray. Tree2tree neural translation model for learning source code changes. arXiv preprint,arXiv:1810.00314 , 2018 . H.K. Dam, T. Pham, S.W. Ng, T. Tran, J. Grundy, A. Ghose, T. Kim, and C. Kim. A deep tree-based model for software defectprediction. arXiv preprint, arXiv:1802.00921 , 2018.8. David, Yaniv and Alon, Uri and Yahav, Eran. Neural Reverse Engineering of Stripped Binaries. arXiv preprint,arXiv:1902.09122 , 2019.9. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers forlanguage understanding. In NAACL , 4171–4186, 2019.10. Steven HH Ding, Benjamin CM Fung, and Philippe Charland. Asm2vec: Boosting static representation robustness for binaryclone search against code obfuscation and compiler optimization. In SP , 472–489, 2019.11. Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. Debin: Predicting debug information instripped binaries. In CCS , 1667–1680, 2018.12. Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In
ACL , 328–339, 2018.13. Jeremy Lacomis, Pengcheng Yin, Edward J Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and BogdanVasilescu. Dire: A neural approach to decompiled identifier naming. In
ASE , 628–639, 2019.14. P. Devanbu C. Sutton M. Allamanis, E.T. Barr. A survey of machine learning for big code and naturalness.
ACM Comp. Survey ,51(4):81:1–81:37, 2018.15. Luca Massarelli, Giuseppe A Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. Investigating graph embeddingneural networks with unsupervised features extraction for binary analysis. In
BAR , 2019.16. Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. Safe: Self-attentivefunction embeddings for binary similarity. In
DIMVA , 309–329, 2019.17. Isabel Moreno-Snchez, Francesc Font-Clos, and lvaro Corral. Large-scale analysis of zipfs law in english texts.
PLOS ONE ,11(1):1–19, 2016.18. Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for NLG. arXiv preprint, arXiv:1707.06875 , 2017.19. R. Bavishi, M. Pradel, K. Sen. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from UsageContexts. arXiv preprint, arXiv:1809.05193 , 2018.20. Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program properties from big code.
Comm. ACM , 62(3):99-107, 2019.21. Marcos Sebasti´an, Richard Rivera, Platon Kotzias, and Juan Caballero. Avclass: A tool for massive malware labeling. In
RAID ,230–253. 2016.22. Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. In
CCS , 363–376, 2017.23. Z. L. Chua, S. Shen, P. Saxena, Z.Liang. Neural Nets Can Learn Function Type Signatures From Binaries. In
USENIX Security ,99–116, 2017.24. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation withbert. To appear in:
ICLR , 2020., 2020.