Spotting Silent Buffer Overflows in Execution Trace through Graph Neural Network Assisted Data Flow Analysis
SSpotting Silent Buffer Overflows in Execution Trace throughGraph Neural Network Assisted Data Flow Analysis
Zhilong Wang, Li Yu, Suhang Wang and Peng Liu
College of Information Sciences and TechnologyThe Pennsylvania State University, [email protected], [email protected], [email protected], [email protected]
Abstract —A software vulnerability could be exploited withoutany visible symptoms. When no source code is available, althoughsuch silent program executions could cause very serious damage,the general problem of analyzing silent yet harmful executionsis still an open problem. In this work, we propose a graphneural network (GNN) assisted data flow analysis method forspotting silent buffer overflows in execution traces. The newmethod combines a novel graph structure (denoted
DFG+ ) beyonddata-flow graphs, a tool to extract
DFG+ from execution traces,and a modified Relational Graph Convolutional Network as theGNN model to be trained. The evaluation results show thata well-trained model can be used to analyze vulnerabilities inexecution traces (of previously-unseen programs) without supportof any source code. Our model achieves 94.39% accuracy onthe test data, and successfully locates 29 out of 30 real-worldsilent buffer overflow vulnerabilities. Leveraging deep learning,the proposed method is, to our best knowledge, the first general-purpose analysis method for silent buffer overflows. It is also thefirst method to spot silent buffer overflows in global variables,stack variables, or heap variables without crossing the boundaryof allocated chunks.
I. I
NTRODUCTION
A fundamental challenge in cybersecurity is that vulner-abilities widely exist in all kinds of programs [1] despitesoftware engineers and security analysts have been spendinglots of efforts to avoid and test them. These vulnerabilitiescould be exploited by the attackers and expose a huge threatto individuals, organizations and governments [2]. Althoughresearchers have proposed a variety of techniques to automat-ically discover and analyze software vulnerabilities [3], [4],[5], [6], almost all existing techniques rely on visible “symp-toms” (e.g., crashes, failing assertions, and errors found byintegrity checkers). Most vulnerability discovery methods usesuch symptoms to distinguish (potentially) harmful programexecutions in which a vulnerability is triggered and benignexecutions [7], [8], [9].However, a software vulnerability could be exploited with-out any visible symptoms, and the corresponding programexecutions are often called a “silent” yet harmful execution.For example, some silent buffer overflows, silent Use-After-Free, and silent information leak could happen given specificmalicious inputs. All these silent yet harmful program execu-tions, though not as frequently seen as harmful as executionscarrying visible symptoms, could still be leveraged by attack-ers to compromise the system [10] and cause serious damage(e.g., altering program variables, leaking critical information).When the source code is available, silent yet harmful execu-tions in principle can be identified and analyzed. By leveragingsemantic information obtained from source code, researchershave developed various tools to identify and analyze them [11], [12], [13]. For example, Konstantin et al. [11] developedAddressSanitizer to detect memory errors and diagnose rootcauses (of silent yet harmful executions) through source codelevel instrumentation.However, in many cases the commercial software and legacycode targeted by the attacker has no source code available(to organizations using the software), and it is widely rec-ognized in the research community that when source codeis not available, analyzing silent yet harmful executions isan extremely difficult problem. A fundamental challenge insolving this problem is lack of high-level, semantically richinformation about data structures in the executables [14].Due to the fundamental challenge, the general problem ofidentifying and analyzing silent yet harmful executions is yetto be solved. In the literature, only a small portion of silentvulnerabilities can be identified. For example, there are aspectrum of silent buffer overflow, but only overflow acrossthe boundary of allocated chunks in heap can be detected byexisting methods [15], [16]. These methods capture the lengthof dynamically allocated buffer by hooking the heap allocationfunctions. Then they check the integrity of buffer accessby comparing the buffer length and offset of buffer access.So far, no effective method has been proposed to analyzesilent buffer overflows in global variables, stack variables,or heap variables without crossing the boundary of allocatedchunks. As stated by Dinesh et al. [16], binary disassemblyis insufficient to recover data section layouts and semanticinformation lost during compilation. Not surprisingly, silentvulnerability analysis without source code is still an openproblem and there lacks a general purpose analysis methodfor even one main category of silent vulnerabilities.In this work, we seek to develop a general purpose analysismethod for silent buffer overflows, one most important cate-gory of silent vulnerabilities. The proposed method is basedon a key observation:
Key Observation.
In silent yet harmful program executions,the data flows towards the variables corrupted by silentbuffer overflow and the memory space layout of some of thecorrupted variables are inherently different from those of non-affected variables.
It is worth noting that human analysts haveto examine enough data flows and memory layout patterns,which is usually very time consuming, before they couldleverage this key observation and identify the exact differencesbetween corrupted and non-affected variables.In light of this, we propose to leverage Graph NeuralNetwork [17] to significantly reduce manual efforts. In fact,our method is close to 100% automatic.
Our insight is thatcritical information about the difference between corruptedand non-affected variables could be represented by a novel a r X i v : . [ c s . CR ] F e b raph structure. Then Graph Neural Network (GNN) couldlearn essential features (from graphs extracted from executiontraces) through representation learning. Then the learned rep-resentations could enable the GNN model to “analyze” a givenexecution trace by classifying the nodes in the graph extractedfrom the trace. Finally, the nodes classified as “vulnerable”may provide enough information for automatically locatingthe address of vulnerable instructions and vulnerable buffers,respectively.Specifically, we design a novel graph data structure to holdimportant features obtained from program executions, includ-ing data flows, variables’ spatial information, and some usefulimplicit information flows. During the model training phase,we utilize a dynamic analyzer based on Intel Pin [18] to buildthe newly designed graph automatically from execution traces(of various programs) and customize AddressSanitizer [11] tohelp assign labels to nodes in the graph. A node with label“vulnerable” corresponds to a corrupted variable resulted fromsilent buffer overflow. Using the labeled graph as training data,we design and train a Bi-directional Propagation RelationalGraph Convolutional Network ( BRGCN ) to perform node clas-sification.After the
BRGCN model is well trained and deployed, themodel can be used to analyze vulnerabilities in executiontraces (of previously-unseen programs) without support of anysource code. The experiments show that our model achieves94.39% accuracy on the balanced test data, and successfullylocates 29 out of 30 real-world vulnerabilities which we obtainfrom a public vulnerability database [19], [20]. The evaluationresults show that graph neural network assisted data flowanalysis is an effective general-purpose method in spottingsilent buffer overflows when source code in not available.In summary, we made the following contributions: • To the best of our knowledge, this work proposes the firstgraph neural network assisted data flow analysis methodfor spotting silent buffer overflows in execution traces. Itcan analyze a full spectrum of silent buffer overflows. • We designed a new type of graph data structure
DFG+ to represent programs’ data flow, variables’ spatial infor-mation, and implicit information flow, in an integratedmanner. • We implemented a tool based on Intel Pin to auto-matically generate
DFG+ from program executions andcustomized AddressSanitizer to help assign ground truthlabels for nodes in
DFG+ . • We modified the Relational Graph Convolutional NeuralNetwork (RGCN) [21] by introducing bi-directional rela-tion types to make it more effective in program analysis. • We evaluated the effectiveness of the newly designed
DFG+ and the newly designed
BRGCN model, and com-pared with other baseline methods.
In our view, the pro-posed method is neither a “competitor” nor an extensionof existing fuzzing tools.
Without source code, existingfuzzing tools, though very efficient, simply cannot iden-tify silent buffer overflows in global variables, stack vari-ables, or heap variables without crossing the boundary ofallocated chunks. Hence, comparing the proposed methodand fuzzing tools could result in “comparing apples andoranges.” II. B
ACKGROUND AND R ELATED W ORK
A. Buffer Overflow
For decades buffer overflow (BOF) remains as one ofthe main security threats plaguing the cyberspace, attributingto the prevalence of the buffer overflow bugs in commod-ity software and the fundamental difficulty to find and fixthem. Conventionally, BOF vulnerability refers to a categoryof software vulnerabilities which could corrupt the adjacentmemory region due to insufficient bound checking. The bufferassociated with the vulnerability is called vulnerable buffer.According to the location of vulerable buffer, BOF can begroup into heap, stack, and global BOF. When BOF happens,it can cause the program to crash by corrupting data/codepointers (e.g., return address and jump table), or changeprogram state by altering some non-control-data [10]. In thepaper, we classify wild
BOF into three categories accordingto its symptoms. A wild
BOF means the input is not manuallycrafted by analyst, e.g., in exploit generation.1)
Visible Buffer Overflow . We call a BOF visible if itshows visible symptoms, such as program crashing, assertionfailing, or displaying garbled string on the screen.2)
Silent Buffer Overflow . Another kind of BOFs happenswithout any visible symptoms. For instance, some BOFs thatonly corrupt some dead variables (please see the define of dead variable in Section V-A), or only corrupt some localvariable in stack, will not crash.3)
Innocent Buffer Overflow . A BOF has no effect on theprogram if it only overwrites a padding space, which isinserted by some compilers to the boundary of buffers dueto data alignment specified by attributes of variable [22] orlanguage features [23].In general, visible BOFs are believed to be easily discoveredand analyzed when they happen, attributing to their visiblesymptoms. If source code is available, silent BOF can alsobe identified through sanitizer [11] or bound-checking [12].However, it is extremely challenging to analyze silent BOFwhen source code is unavailable, because critical high levelinformation such as length of buffer and type of variables,is lost in the binary during the compilation. The literaturereview shows that all the existing works [15], [16], [24], [25],[26] can only identify silent heap BOFs which overflow crossthe boundary of allocated memory chunks. The details ofthese approaches will be discussed shortly in Section II-D.Therefore, the general problem of silent BOF analysis is stillan open problem.It is worth noting that certain BOF vulnerabilities in ex-ecutiables could display either visible symptom or invisiblesymptom in different executions, given different inputs. Inthis case if the vulnerability is triggered by one or morevisible executions, the resulted BOF would be a visible BOF,even though the vulnerability could also be exploited by somesilent executions. Based on this fact, some works in programtesting try to find vulnerabilities “shared by” visible BOFs int age, i, total = 0, ages[0x20]; for(i = 0; i <= 0x20; i++){ age = receive(); if(age == -1) break; ages[i] = age; //overflow when i=0x20 total += ages[i]; } Code 1: A piece of code with silent buffer overflow.2nd silent BOFs by varying input lengths [7], [9]. However,this is only feasible when the vulnerability is caused by theexcess length of inputs. And it obviously cannot solve thegeneral problem of silent BOF analysis. Firstly, many BOFvulnerabilities only have silent BOF (execution) instances. Anillustrative example is shown in Code 1. In the code, ages isan integer array of length 20 and i is the index in the for loopto access ages . We could see there will be a BOF if i equals but the program won’t terminate at this point. Secondly,the length of buffer access does not necessarily depend on thelength of input. It can depend on the length of a portion ofthe input, or the value of one or several bytes in the input.Under these circumstances, it is almost impossible to knowwhich segment crashes the program. B. Data Flow Graph
Data flow graph was first introduced in the data flowmachines to describe parallel computation [27]. A data flowgraph is a directed graph, G ( N, E ) , where nodes in N repre-sent instructions and edges in E represent data dependenciesamong the nodes. Data flow graph is widely used in compileroptimization, such as register allocation, instruction schedulingand dead code elimination [28]. Although there is no univer-sally accepted definition, data flow analysis generally refers tothe process of collecting and deriving information about theway the variables are defined, used in the program [29].Data flow graph and analysis are also widely used insoftware security to analyze software defects, enforce securepolicies, and so on. Compared with control flow graph, dataflow graph is more informative, which contain semantic infor-mation of programs. C. Graph Neural Networks
In recent years, deep neural networks have shown increas-ingly noticeable success in security domains, such as security-oriented program analysis [30], [31], [32] and anomaly detec-tion [33], [34], due to their remarkable representation learningcapabilities [35]. Some representative genres of deep neuralnetworks are convolutional neural network (CNN), recurrentneural network (RNN), and graph neural network (GNN).CNN is developed to capture information from grid data,whereas RNN is designed to capture sequential information.Given the nature of our proposed graph data structure
DFG+ ,GNN is more compelling because of its great ability inrepresentation learning on graphs.Convolutional graph neural networks (ConvGNN), amongother GNNs, adopts convolution operations on graphs tocapture local and global structural patterns, through designingspecial convolution and readout functions [36]. Standard con-volutions on images or text embeddings are not applicable tographs because graphs have irregular structures, so that specialconvolutions have to be designed to work on graph data [17].Depending on how convolution is performed, existing GNNscan be classified into two categorizes, i.e., spectral-basedconvGNN and spatial-based ConvGNN [37].
1) Spectral-based ConvGNN:
Spectral-based convolution isdefined based on spectral graph theory [38]. In this framework,a graph Laplacian is defined and signals on graphs are filteredusing eigen-decomposition of graph Laplacian. The graph Note that some compilers may change the layout of variables in stack. Inorder to make the example simple, we do not consider variable reorder here. convolution operators are introduced by defining the graphFourier transform. However, despite the solid mathematicalfoundations, such approaches suffer from large computationalburden, spatially non-localized issue and generalization prob-lem. Considering the computation complexity of spectral-based ConvGNN, we don’t adopt it in our analysis of
DFG+ .
2) Spatial-based ConvGNN:
The main idea of spatial-basedConvGNN (massage passing GNN) is to generate a node v ’s representation by aggregating its own features x v andneighbors’ features x u , where u ∈ set of neighbors of v .Generally, spatial ConvGNN can be defined as: H ( l +1) = σ (cid:32)(cid:88) s C ( s ) H ( l ) W ( l,s ) (cid:33) , (1)where H ( l ) ∈ R n × d l is the latent representation of the n nodesin the l -th layer and d l is number of features. C ( s ) is the s -th convolution kernel that defines how the node features arepropagated to the neighborhood nodes, W ( l,s ) ∈ R d l × d l +1 is the trainable weight matrix that maps the d l -dimensionalfeatures into d l +1 dimension, σ is the activation functionsuch as ReLU [ ? ] and tanh. Equation 1 covers a broad classof ConvGNN, and different designs of spatial ConvGNNare distinguished by the their convolution kernel C ( s ) andvariability induced by W ( l,s ) in Equation 1.In each layer, message passing algorithm (neighborhoodaggregation) defined in the Equation 1 aggregates featuresfrom a node’s local neighborhood. Therefore, the node rep-resentation learned by the a k -layer ConvGNN include notonly the features of itself, but also the features of its k -hop neighborhoods and the local graph structure [36]. Thenode representation learning ability of ConvGNN has beenintensively investigated by recent research [39]. The greatability of ConvGNNs in modeling graph structured data havefacilitated various domains such as social network mining [40],[41], knowledge graphs [42], bioinformatics [43] and recentlycode similarity comparison [44]. Thus, it has great potentialto adopt ConvGNNs for representation learning on DFG+ todetect vulnerabilities.
D. Related Works
In this section, we will introduce some works that spotBOF on the source code and binary level, and discuss theirlimitations.
Source Code based Schemes.
The source code basedschemes usually adopt source code analysis to collect semanticinformation (e.g., buffer length) and enforce their detectionrules through source code instrumentation. Among severalexisting works [11], [12], [13], we choose one representativescheme – AddressSanitizer (ASAN) – to introduce. Briefly,ASAN detects spatial bugs by reserving
Redzone aroundheap, stack, and global objects, detects temporal bugs byquarantine for heap and stack objects to delay the reuse ofdead objects. Technically, it leverages shadow memory tomark whether an address in the program space belongs to
Redzone or not, and checks legality of target addressesbefore instructions access variables in memory. Despite itseffectiveness to detect all kinds of BOFs in global, stackand heap, ASAN has two limitations: firslty, ASAN needsprogram’s source code, thereby cannot detect bugs in legacycode and commercial software. Secondly, it fails to detect non-linear buffer-overflow (an access that jumps over a
Redzone ).3ABLE I: Comparison of the related works’ effectiveness to detect buffer overflow.
Defence Tools Require Source Code or Not Silent Heap BOF Silent Stack BOF Silent Global BOFB
OIL
No Partial No NoAddressSanitizer Yes Yes Yes YesTaintCheck No No No NoMemcheck No Partial No NoFuzzing No No No NoSymbolic Execution No No No NoThe Proposed Method No Yes Yes Yes
Static Approaches on Binary.
Rawat et al. [45] researchedthe detection of potential stack-based BOF vulnerability inbinary code. Different from traditional works that usuallydefine vulnerability patterns at the syntactic level (e.g., func-tion name), they considered more features of vulnerabilitieson semantic level and defined the buffer overflow inducingloops (
BOIL ) to summary the semantic patterns of poten-tial vulnerable loop. Based on the proposed patterns, theydeveloped a prototype to identify potential vulnerable loopsfrom executables. The advantage of their approach is that itdoes not need to execute the program and can achieve highcode coverage. However, as pointed out in their paper, theirscheme can only deal with a special case of BOF. We thinkit is due to the challenges to summarize all patterns basedon human efforts. In addition, the reported positive functionsin this scheme can only be viewed as potential vulnerablefunctions and needs further verification, because no concreteinput is available to verify the reported BOF.
Dynamic Approaches on Binary.
Dynamic approachesanalyze BOF vulnerabilities by finding vulnerable execu-tions. In the binary level, taint analysis (also known as dataflow tracking) is a popular method to debug vulnerabilities.TaintCheck [46], proposed by James Newsome and DawnSong, which locates BOF based on one simple assumption – innormal data flow, pre-defined taint source, such as user inputs,environment variables, and network data will not propagate topointers. Therefore, their approach can not detect the silentBOF which does not violate their assumption.Memcheck [47] is a well-known vulnerability analysis toolthat is implemented based on dynamic binary instrumentationframework Valgrind [48]. Technically speaking, it obtains theaddress and size of buffers in heap by hooking function callsto heap allocation functions and parsing their parameters andreturn values. By comparing the offset of buffer accessesand the length of allocated buffer in heap, they detect heapBOF which writes out of allocated heap chunk. As mentionedin [11], Memcheck and some other tools (Dr. Memory [24],Purify [25] and Intel Parallel Inspector [26]) that adopt similarapproaches are not capable to find out-of-bounds bugs in thestack (other than beyond the top of the stack frame), globaland in heap if the overwrite does not across the boundary ofallocated chunk.Fuzzing [7], symbolic execution [49], gradient descent [9]and hybrid approaches [8] are widely adopted path explorationmethod to automatically find vulnerable paths in software.These schemes try to generate input that achieve high cover-age, and find input that can trigger bugs. However, they selectpositive inputs based on whether it can crash the program.In such a case, the silent BOFs are ignored. Recent researchworks [15], [16] trying to detect silent vulnerabilities duringFuzzing suffer from same limitations with the works discussed in last paragraph.Table I summary the limitations of related works we men-tioned above. In conclusion, the general problem of silent BOFanalysis is still an open problem, and it remains a fundamentalchallenge to cope with the full spectrum of silent BOF at thebinary level.III. P ROBLEM F ORMULATION AND C HALLENGES
In this section, we will formalize the silent BOF analysisproblem and present the challenges to solve it.
A. The Problem and Research Goal
In this example shown in Code 1, an integer array ages with a fixed length is defined and allocated on the stack. Theloop copies one excess integer to the array and the nearestvariable total in the heap address will be overflowed, withoutdisturbing the normal execution of the program. We namethe out-of-bound access (writing/reading) during execution as invalid operations or invalid access , define the instructionaddress of the invalid operation as the overflow point , nameexecution with invalid operation as BOF instance , and acollection of runtime information as execution trace .With the above notations, our research goal is to locate the overflow point of silent BOF in an executable by analyzing itsexecution trace. To be more specific, we want to:1) distinguish
BOF instances from normal executions.2) locate the invalid operations in execution trace and overflow point in an executable.3) pinpoint each of them separately if there are more thanone overflow points in one execution trace . B. Why is This Problem Hard?
Due to the unavailability of type information in binary, it isnot possible to identify invalid operations by comparing theoffset of buffer reading/writing with length of target buffer.As shown in Code 2, which is the assembly code generatedfrom Code 1, the instruction at line 1 allocates memory for sub $0x94,%esp movl $0x0,-0x10(%ebp) movl $0x0,-0x14(%ebp) jmp target2 target1: call
The challenges faced by traditional methods motivate usto solve the problem through deep neural networks. Althoughthe complexity of patterns to identify silent BOF is a dauntingchallenge for human analysts to develop heuristic rules, it maynot be a challenge for a deep learning algorithm given enoughtraining data. Accordingly, we propose to spot silent BOFbased on graph neural network assisted data flow analysis.In this section, we first provide several insights based on ourdomain knowledge, which motivate us to choose our technicalapproach, and these insights will be verified through severalexperiments. Then, we provide an overview of our proposedapproach, and point out the challenges we must address.
A. Insights1) The Essential Information Need to be Captured for BOFAnalysis:
Through dynamic binary instrumentation, lots ofinformation can be collected along with program’s execution,such as control flow, data flow, accessed memory, valuesof operands for each instruction, and executed instructionsequence. However, not all these information are useful toidentify silent BOF. If the unnecessary information get in-cluded in training data, it will introduce noise and reduce theaccuracy of the model. Hence, two questions are raised andthe answers to them are associated with the domain knowledgerelated to buffer overflow and dynamic program analysis:
Q1.
What information should be selected?
Q2.
How to design the data structure to hold the data?Firstly, as discussed in the Section II-A, most silent BOFsdo not violate a valid control flow (by corrupted code pointers),but they always violate a valid data flow. Therefore, the dataflow is meaningful to be integrated to our training data. A dataflow graph (DFG) is the most popular way to represent pro-gram’s data flow. Secondly, the spatial information (variablelayout) is useful to diagnose BOF, because BOF is an spatialerror [1]. We will discuss the challenge to represent the spatialinformation later on.Thirdly, some other (implicit) information flow, such asinformation flow from a data pointer to its pointed variable,and information flow from condition variables to branch target,could be useful to infer whether variables are pointers orloop control variables. These information could be of greatimportance as many BOFs happen due to unsafe the pointerdereference in a loop. Fourthly, besides the information flow, information itself,i.e., the values of variables, could also be useful. In fact,the value of certain variables like loop control variablescould be very useful to analyze BOF. However, 1) there isno deterministic relationship between value of loop controlvariables and BOFs, and 2) the values of variables can be verynoisy and hard to interpret on the binary level. Therefore, wedecide not to include variable values in our data. Instead wepropose to incorporate some attributes of variables such aswhether a variable is immediate or is copied from the userinput.Based on the above insights, we leverage a novel graph datastructure to capture the essential information. Since the graphwe build is based on the program’s runtime data flow graph,together with some spatial information, we call it Data FlowGraph Plus (
DFG+ ).
2) Model Selection:
Q3.
Why is this a node classification problem?With the graph structure and its nodes and edges, graphanalysis tasks can be grouped into three categories [17]: graphclassification, node classification, and link prediction.The graph classification aims to classify graphs into dif-ferent types. When it is applied to our
DFG+ , the problembecomes classifying whether a program execution containssilent BOF or not. Since the goal of our work is not onlyto identify vulnerable execution, but also to locate the invalidoperations inside the execution, we cannot follow the graphclassification task. Link prediction is the problem of inferringmissing relationships between entities (nodes), which also doesnot fit our need. Node classification, on the other hand, aimsto classify nodes into different categories. If adopted, it coulddistinguish vulnerable nodes and benign nodes in graph, andthe vulnerable point in execution trace can be located bymapping nodes from the graph to program trace. Hence, theproposed research goals can be achieved by solving a nodeclassification problem.
Q4.
Why is graph neural network a promising approach?Firstly, the graph neural network can learn node featuresand graph structure, which is exactly how the
DFG+ encodesdata flow information. Secondly, deep learn has shown verypromising result in some reverse engineering works, suchas [30], [31]. In these works, it has shown superior perfor-mance compared with traditional methods, which indicatesthat it has great learning ability. Compared with other machinelearning algorithms, the deep neural network has following 2superiority: As stated in [30], there are some attractive featuresof neural network, “first, neural networks can learn directlyfrom the original representation with minimal feature engi-neering” and “second, neural networks can learn end-to-end,where each of its constituent stages are trained simultaneouslyin order to best solve the end goal”.
Q5.
Why do we choose relational graph neural network?Given the
DFG+ with multiple types of edges as the trainingdata, it is natural to adopt the relational graph neural network.RGCN [21] was originally proposed to represent knowledgebases with entities and triples as directed labeled multi-graphs.The entities are treated as nodes and the triples of the form(subject, predicate, object) are encoded by labeled edges,which is similar to the data structure in
DFG+ . There are othermodels capable of modeling graph structured data, that wethink not suitable in our case. Graph recurrent neural network(GRNN) [50] works on dynamic graphs where graphs are5 ource code executablessource code instrumentation labeled DFG+runtime analyzer Trained BRGCNmodel trainingexecutables unlabeled DFG+ labeled DFG+ vulnerability inforuntime analyzer Trained BRGCN vulnerability identification
Training PhaseTesting Phase
Fig. 1: Approach overview.evolving over time [51]. The GraphSAGE network [52] doesnot consider different types of edges in node classification. TheHeterogeneous graph neural network [53] aggregates hetero-geneous attributes or contents associated with nodes, which isoverly complicated for
DFG+ , where nodes contain relativelyeasy-to-encode attributes. Finally, label propagation [54] relieson a nearest neighbor graph to generate pseudo-labels fornodes and is often used in semi-supervised learning [55].
B. Overview
Fig.1 provides an overview of our proposed method, whichconsists of two major phases. In the training phase, we developa runtime analyzer based on Intel Pin to trace program runtimeinformation and organize it into a graph structure (i.e.,
DFG+ ).In the training samples, locations of invalid operations insilent BOF execution are obtained through source code instru-mentation. The invalid operations are reflected as vulnerablelabels for nodes in graph. We then train a
BRGCN model onthe labeled
DFG+ data, for testing in the following phase.In the testing phase, the trained
BRGCN model is used topredict silent BOF in programs with binary only. The analyzerwill trace program runtime information and construct
DFG+ without node label. It also generates maps that mapping eachnode from
DFG+ to instructions in program and in executiontrace. After labels are predicted for each node in
DFG+ , themapping can help us to locate the invalid operation in silentBOF.
C. ChallengesChallenge 1: How to apply the model trained on multipleprograms/graphs to a previously-unseen program?
Due to thespecificity of each program, selecting more semantic informa-tion from program execution inevitably introduce knowledgerelated to particular program logic into our dataset, which maynot hold in other programs. These knowledge, if learned bythe model, will hurt model’s generalization ability. As a result,some previous works [56], [57] on neural network assistedFuzzing can only let a model be trained and tested on thesame program. We will discuss how to cope this challengeshortly when presenting the design of
DFG+ and graph neuralnetwork in Section V-A and Section V-D, respectively.
Challenge 2: How to generate labels for
DFG+ ? Generally,training a high quality model needs a fair amount of trainingsamples with ground truth. We do not want to manually labelthe nodes in
DFG+ , which requires lots of human efforts.Hence, we need to develop a tool to label the data samplesautomatically. The detail of how data are labeled is presentedin Section V-C1.
Challenge 3: How to represent spatial information in a deeplearning-friendly manner?
Adding variable address to the training data could be the simplest way to include the spatialinformation. However, it is hard for the deep learning modelto learn the variable layout from the variable addresses. Wewill discuss how we represent the spatial information so thatthe deep learning model can quickly capture it in Section V-A.
Challenge 4: How to cope with extremely unbalanceddataset?
Each
DFG+ generated from program execution hasmore than 200,000 nodes on average, but only a few of themare vulnerability nodes, which means the dataset is extremelyunbalanced. V. D
ESIGN AND I MPLEMENTATION
In this section, we will firstly introduce the design of
DFG+ ,technique details of compiler plugin and run-time analyzer,and how they work together to generate labeled
DFG+ . Then,we present the
BRGCN and how it helps to spot silent BOFs.
A. DFG+1) Spatial Information:
The trained model should be able tocapture the general information and ignore program-specificinformation , so that the model trained on one set of DFG+can be applied to the other set. The general information is theknowledge shared among programs, such as the knowledge todetermine whether two variables are adjacent to each other.The program-specific information is the knowledge that onlycomes with a specific program, for example an integer variableis located at .To encode spatial information, there are two potentialmethods: to integrate address of each variable into variableattributes in data flow graph or to use relations to reflectthe adjacency relationships of variables. We did not choosethe first method due to two observations: 1) first, the spatialinformation, such as adjacency of two variable, are specificrelations between variables, rather than entities or attributes,which should not be encoded as node features. 2) second,the value of variable address is always associated with aconcrete execution and will change if a program is com-piled with different compilers or options, or run in differentsystem environment (e.g., different heap allocation), or evenat different executions (e.g, different loading address due toASLR [58]). Integrating address into data flow graph willintroduce program-specific information which is not helpfulfor the model. Therefore, we instead use relations (edges in a
DFG+ ) to indicate if two variables are adjacent to each other.In this way, we can represent the spatial information in a deeplearning-friendly manner.
2) Basic Design:
Using the terminology from data flowanalysis [59], a live variable is defined when an instructionwrites value to the variable, and a live variable is used whenan instruction reads the value of the variable. A variable is6 ive at a program point p if current value of this variable willbe used in future. A live variable v is dead at program point q if after program point q the value of v is redefined before itis used or not will not be used anymore. Nodes of Graph.
A node in
DFG+ represents a live variable.Therefore, multiple nodes will created for a variable if thevariable is defined and redefined along with program execu-tion. Note that the “variable” in our context not only refersto variables defined in source code, it can be operands ofany instructions (e.g., the return address on stack, register andimmediate value). According to the attributes of the variablethat a node corresponding to, we group nodes in graph into 4types: − Memory Node (m-node) denotes a live variablestored in memory. − Register Node (r-node) denotes a live variablestored in register. − Immediate Node (i-node) denotes an immediateoperand. − External Node (e-node) denotes a variable de-fined by a system call. The e-node is a special type ofnodes for variables associated to external data (e.g., user inputsand environment variables, and so on). The input data usuallycontain dangerous variables, which could result in BOF.
Edges of Graph.
We define 5 classes of directed edges in
DFG+ to reflect program’s direct or implicit information flowand spatial information. Note that each “variable” that appearsin following list is corresponding to a node in graph. − Data Flow Edge (d-edge) denotes a direct infor-mation flow from a source variable to a target variable. Thereexists a direct information if the value of the source variableis used to calculate the value of the target variable. − Adjacency Edge (a-edge) denotes that two vari-ables are adjacent to each other. The direction of a-edge denotes relative high or low of two variable addresses. − Index Edge (i-edge) denotes an implicit infor-mation flow (implicit data flow). The information flow isimplicit if a pointer or offset a is used to address a variable b to be read or written. − Redefine Edge (r-edge) denotes that a live vari-able is covered by another live variable. The r-edge not onlyindicates that these two live variables are in the same address,but also implicates the order of data flow for this variable. − Comparison Edge (c-edge) denotes another kindof implicit information flow, which happens when a live variable be compared with another live variable. The values oftheir operands will affect the value in eflags register, whichthen affect target of a conditional branch.Fig. 2 shows a
DFG+ generated from the execution of apiece of code in Code 2. The executed instruction sequenceis [ , , , , , , , , , , , , , ] . Let’s take the severalnodes and edges generated from sub $0x94, %esp , as anexample to demonstrate how the graph is generated. Node , , and , represent immediate $0x94 , source operand %esp ,and the destination operand %esp , respectively. There is a d-edge and r-edge between and because the old valuein %esp was used to calculate new value for %esp , and live variable in %esp is redefined . Besides, the a-edge betweennode and denotes that live variables in -0x10(%ebp) and -0x14(%ebp) are adjacent to each other. The i-edge fromnode to node denotes that -0x14(%ebp) is used to address m-noder-nodee-nodei-noded-edgei-edger-edgec-edgea-edge1 24 365 781011 12141617 1819 20 2113 2322 242526915 27 Fig. 2: A DFG+ generated from execution of a piece of code.variable in -0x10(%ebp) . The c-edge from node and to denotes that that the comparison between live variable in -0x10(%ebp) and immediate number $0x20 determines thevalue in %eflags . Labels of Nodes.
There are two types of labels for graphsnodes: 1) vulnerable label represents that the node isgenerated from an invalid operation in silent BOF; 2) benignlabel represents that it is generated from normal operation.
3) Reflection:
Through the noval design, variable type,information flow and adjacency relationships of variables gath-ered through program tracing are represented by node featuresand graph structure through different types of edges (relations).From the graph, we can not only clearly see the different graphstructure associated with different type of operations, but alsosee the difficulty to compare the difference between graphstructures for vulnerable and benign nodes manually throughhuman efforts. In Section V-D, we will show how GraphNeural Network capture these features through representationlearning.Here, we talk about how the design of
DFG+ helps toovercome Challenge 1. The novel design of
DFG+ aims toencode general information and eliminate program-specificinformation in program so that our trained model on someprograms is able to be applied to other programs. Specifi-cally, variable address, variable value, and opcodes which aretightly associated with a specific program, are not included in
DFG+ . Instead, we select address agnostic and value agnosticfeatures – information flow, variable adjacency and generalvariable features – from execution trace, and encode them asdifferent types of edges and node features in
DFG+ . So that amodel trained on training set can applied to predict labels onthe testing
DFG+ . B. Compiler Plugin for Data Labeling
We implement a tool to insert some code to binary throughsource code instrumentation, which can automatically distin-guish vulnerable and benign operations in program execution.As discussed in the related works, ASAN can detect out-of-bound memory accesses (i.e., invalid operations ) in BOFexecution. Therefore, we leverage ASAN to detect the invalidvulnerable operations , which helps the graph constructor (tobe discussed in next subsection) to label the nodes. However,ASAN has four features, which pose 4 problems in ourscenario: 1) ASAN inserts extra instructions before memoryallocation, access and destroy. 2) ASAN inserts
Redzone among variables, which change the adjacency relationshipsof variables. 3) ASAN reports memory errors by outputing7ulnerable information, then terminate the execution. 4) ASANcan only detect BOF on function level for function linkedfrom external libraries. Specifically, ASAN hooks function callto library functions, and provides wrapper functions to checkwhether BOF happens by analyzing the parameters passed tothese library functions.The extra instructions inserted by ASAN will introduceirrelevant information flow and the inserted
Redzones breaksome a-edge s in the constructed
DFG+ . Besides, if theexecution terminates at the point of first invalid access , thedata flow afterwards will be missing, so we have to modifyASAN to make it report invalid operations without terminatingthe program’s execution. In the following paragraphs, we willshow how we solve these problems.
1) How To Exclude Irrelevant Data Flow from InstructionsInserted by ASAN?:
To deal with the first problem, we need todistinguish program’s original instructions and the instructionsinserted by ASAN. To achieve this goal, we modify thecompiler plugin from ASAN to insert a pair of instructions(i.e., prefetcht1 and prefetcht2 ) at the beginning andend of each piece of code inserted by the ASAN. The pair ofinstruction serves as indicators that can be easily distinguishedand skipped when the runtime analyzer builds
DFG+ alongwith program execution. We adopt prefetch instructionsbecause prefetch has no side effect to program’s runtimestate and we can easily disable them.
2) How to Restore the Relation of Variable Adjacency?:
To handle the second problem, we leverage the shadowmemory to restore the original relation of variable adjacency.Specifically, shadow memory maintained by ASAN’s runtimeenvironment recorder the location of inserted
Redzone inthe address space of target program. The compiler pluginwill save configuration of shadow memory and share it tothe graph constructor. Through these configuration, the graphconstructor can query the shadow memory and restore theoriginal adjacency relationships of variables. We will discussthe details of how the adjacency relationships are restored inSection V-C2.
3) How to Label Nodes Generated from Vulnerable Oper-ations?:
To solve this problem, we let the compiler pluginto emit prefetcha as indicator before each suspicious in-struction (that results to out-of-bound read/write). Since ASANchecks the validity of target address before each suspiciousmemory access, and prefetcha will only be executed whenmemory errors are detected before it really happens. Bythis way, the runtime analysis routine be notified through prefetcha and assigns different labels to nodes, accord-ingly. Thus, we can achieve our goal without terminating theprogram execution and introducing any irrelevant data flow.
4) How to Identify Vulnerable Operations in Library Func-tions?:
To solve the last problem, we instrument the necessarylibraries with customized compiler, then link the instrumentedlibrary functions to target program. However, we observe thatthe most commonly used library in linux – glibc – cannot becompiled by LLVM due to some unsupported features, and thellvm-libc is still in planing phase [60]. Alternatively, we onlyinstrumented vulnerable functions in glibc, such as scanf and strcpy . Then, in the runtime library (runtime-rt [61])of LLVM, we hook calls to these vulnerable functions andredirect execution to instrumented ones. In such a case, thevulnerable node in the glibc can be labeled accurately. C
AVEAT . The customized compiler plugin is only used tohelp the runtime analyzer to assign labels to vulnerable nodesin built graph. The runtimer analyzer will assume all othernodes in graphs as benign nodes. Therefore, there is no need toinstrument other functions without vulnerabilities in libraries.However, the memory allocation and free functions, such as malloc and free , are special cases. Even through no BOFhappens in these functions, we still need to instrument thesefunctions to update the shadow memory.
C. DFG+ Construction based on Runtime Analyzer
The runtime analyzer is implemented based on IntelPin [18], which builds
DFG+ along with program’s execution.Intel Pin provides comprehensive APIs for code inspectionand instrumentation: the inspection APIs helps to analyzeinstructions in binary and the code instrumentation APIs helpto instrument code according to the results of inspection.The developed runtime analyzer consists of three components:dynamic code analysis and instrumentation, memory layoutrestoration, and graph construction. Fig. 3 demonstrates thewhole workflow.
1) Dynamic Code Analysis and Instrumentation:
The dy-namic code instrumentation consists of three phases: codeinspection, code instrumentation and runtime analysis. Beforecode instrumentation the analyzer firstly analyzes instructionsand system calls. Three types of callback functions will beregistered according to the analysis results: • Instruction Callback.
The structureof information flow can be easily understoodgiven some examples: the code analysisroutine uses and to define the structuresof information flow in mov 0x8048000, %eax , sub%eax, %ebx , respectively. Then, callback functions areregistered to instructions according to the types andstructures of information flow as demonstrated in Fig. 3. • System Call Callback.
Some system calls copysome external data to program space, the variables inwhich should be recognized as e-nodes . Call backfunctions are registered to these system calls to labelcorresponding memory regions at runtime. • Control Callback.
Two callback functions shouldbe registered to prefetcht1 and prefetcht2 to stopand resubmit the runtime tracing respectively, so thatinserted pieces of code by ASAN can be skipped. Acallback function should be registered to prefetchta to receive the signal about invalid operations , and assignlabels to corresponding vulnerable nodes accordingly.The compiler plugin based on LLVM instruments codeon the intermediate representation (IR) during compilation.During experiments, we observe that some instructions resideoutside of prefetcht1 - prefetcht2 pair in IR level dur-ing source code instrumentation float to position which areenclosed by prefetcht1 - prefetcht2 , due to instructionreordering [62]. In such case, information flow resulting fromthe floated instructions will be lost if we simply stop theanalysis process when execution enters code enclosed by prefetcht1 - prefetcht2 .To solve the problem, we adopt static data flow analysisto identify the floated instructions insides the prefetcht1 - prefetcht2 pair based on one heuristic rule: an instruction8 ..... sub $0x94,%espmovl$0x0,-0x10(%ebp)movl$0x0,-0x14(%ebp)jmptarget2target1:call
Callback Fun _mov_r2r_(){ ……} return
Code
Data Space … Shadow Mem … Program Space
Query Interfacereturn … Supporting data
Dynamic Code Analysis and Instrumentation Adjacency Relationships Restoration Graph ConstructionInformation flow V a r i ab l e A d j a ce n cy … read or writecontrol flowNodes and edges Restorer
Accessed Addresses
Constructor
Fig. 3: Workflow of the runtime analyzer to build
DFG+ with node labels. i inside prefetcht1 - prefetcht2 pair is a floated instruc-tion if there is a data dependency between an instructions j which is after prefetcht2 and i . Accordingly, we will notexclude information flow resulting from the floated instruc-tions in runtime analyzer. Then, along with program execution,the callback functions will capture the information flow andaccess memory addresses from executed instructions, and sendthem to the graph constructor and adjacency relationshipsrestorer.
2) Adjacency Relationships Restoration:
We observe thatthere are three kinds of changed adjacency relationships ofvariables, requiring different treatments respectively. We willshow these three cases based on Fig. 4, which shows the layoutof a memory fragment before and after instrumentation byASAN.Firstly, case (cid:172) does not need any restoration. For byte(s) in-side a buffer or variable, the inserted
Redzones do not affectits adjacency relationships, thereby needing no restoration.Secondly, case (cid:173) needs restoration. For byte(s) on theboundary of a buffer or variable, its adjacency relationshipsget changed because of the inserted
Redzone . For example,the adjacent bytes of i +4 in ML is the byte i +3 and byte i +5.However, in the ML w/ Redzone , the adjacent bytes for j +12is byte j +11 and j +13, and the byte j +11 is located in the Redzone . To restore adjacency relationships for this kind ofbyte(s), we find the real adjacent bytes by skipping bytes in
Redzone . By skipping
Redzone2 , the real adjacent byte, ...... var2 i +12 buf1 i +4 var1 i (a) Original memory layout (ML). ...... red4 j +28 var2 j +24 red3 j +20 buf1 j +12 red2 j +8 var1 j +4 red1 j (b) Memory layout (ML) after code instrumentation (w/ Redzone ). Fig. 4: Comparison between memory layouts with andwithout
Redzone . i.e., j +7, of byte j +12 can be found.Thirdly, case (cid:174) , which happens in BOF, also needsrestoration. When out-of-bound access happens in a ML w/ Redzone , one or several byte(s) (e.g., x ) in Redzone willbe read/written. If the invalid access is mapped to the ML,the out-of-bound read/write will read/write a byte(s) nearthe vulnerable buffer in high address. The following threesteps can help to find the corresponding byte(s) in ML w/
Redzone :1) First, finding the boundary byte ( b ) of BOF, which is thebyte near the first overflow byte(s) in low address.2) Second, calculating the distance ( d ) between b and x .3) Third, finding byte(s) ( y ) by shifting d -1 byte from the b to higher address, while skipping all bytes in Redzone .After mapping the byte(s) x in Redzone to a byte(s) y outside of Redzone , the adjacent bytes found throughstrategies adopted in case (cid:172) , (cid:173) , and (cid:174) for y is the restoredadjacent bytes for x . For example, through aforementionedstrategies, we can map byte in j +21 to byte in j +25 and findits real adjacent bytes in j +24 and j +26.
3) Graph Construction:
After the information flow arecaptured and filtered through callback functions at runtime,and the adjacency relationships are restored through afore-mentioned three strategies, it is straightforward to construct
DFG+ . We will not cover the details of graph construction.
Supporting Data.
Some data, named as supporting data asshown in Fig. 3, is important not only in the graph constructionphase, but also in vulnerability identification phase. For exam-ple, 1) map that maps a node in
DFG+ to the address of itscorresponding variable, 2) map maps a node to the instruction,which creates the node, and so on. We will show how theseinformation is used in Section V-E.C
AVEAT . After the model is trained, we no longer needsource code to generate labels, as the model will predict themfor us. The building the unlabeled graphs for binary-onlyprograms in testing phase as shown in Fig. 1, is much easier.Since the analyzed programs are not to be instrumented, thereis no need to exclude irrelevant instructions and restore theadjacency relationships of variables.
D. Our Graph Neural Network
DFG+ is a novel graph data structure to hold variableattributes, program information flow and variable layout. Gen-erally, the vulnerable data flow in the execution context and thevariable layout for some variables corrupted by silent buffer9verflow could be different from that of non-affected variables.In other word, the local graph centered at a vulnerable nodewould be slightly different from that of benign node. Thus,detecting vulnerability is equivalent to node classification byconsidering the local graph centered at each node. Thus, weneed to design a model that’s able to learn node representa-tions that capture the local graph structure and neighborhoodinformation, which facilitates the differentiation of vulnerablenodes from benign nodes. Thanks to the message passingmechanism, GNNs are good at learning node representationsby aggregating a node’s neighborhood information. Thus, weadopt GNNs for
DFG+ . As
DFG+ has different types of nodesand edges, we propose to adopt RGCN [21] as our basicmodel because it is developed for representation learning inknowledge graph, which also have different types of nodesand edges.Essentially, the multiple layer RGCN learns node represen-tation for a node v x by aggregating features (attributes) of node v x and its neighbors through message passing. In particular,for different types of edges/nodes, it use different parametersduring message passing, thus preserving the edge/node typeinformation. The propagation rule of RGCN in the l -th layerfor calculating the forward-pass update of a node v i is: h ( l +1) i = σ (cid:88) r ∈R (cid:88) j ∈N ri c i,r W ( l ) r h ( l ) j + W ( l ) s h ( l ) i , (2)where N ri denotes the set of neighbor indices of node i underrelation r ∈ R and R is a set of relation (edge types). c i,r isa normalization constant that we set as the count of neighborrelation r for node i . W ( l ) r is relation-specific transformationsmatrix for relation r , which enables relation-specific messagepassing, thus preserving edge type relationship. To ensurethat the representation of a node at layer l + 1 can also beinformed by the corresponding representation at layer l , asingle self-connection (i.e., W ( l ) s ) term is added. All messagespassed along with incoming edges are aggregated through anelement-wise activation function σ ( · ) . W ( l ) r and W ( l ) s are theparameters to be learned. By stacking K -layers of RGCNtogether, the representation of node v i could capture the K -hop local graph information centered at node v i . Limitation of RGCN.
Equation 2 is the basic designof RGCN, which has shown promising result in the earlyresearch [21]. For node classification in
DFG+ , however, thefeatures of a node x ’s outgoing nodes are not used in anappropriate way. If N ri is defined as the set of neighbor indicesof node v i under relation r through incoming edge, messagecan only pass along in these directions. As a consequence,node representation learned by the network only aggregatesfeatures from incoming nodes and some important featuresfrom outgoing nodes are lost. For example, if a global variableis used twice at runtime, its corresponding node x in the DFG+ will have two outgoing d-edges to node y and z re-spectively, i.e., y (cid:13) ← x (cid:13) → z (cid:13) . In such case, the feature cannotbe propagated from y to z or from z to y through x , whichis undesired because node y could be very useful to classifynode z and vice versa. For nodes without incoming links suchas x in the previous example, no information will propagateto them and thus we cannot learn good representations. RGCN with Bi-directional Propagation.
One straightfor-ward solution to the above issues would be ignoring the direc-tion of the edge, i.e., if the N ri is defined as the set of neighborindices of node i under the relation r through either incomingedge or outgoing edge, messages will get processed withthe same relation-specific transformations W ( l ) r . However, thiswill ignore the difference of incoming and outgoing directions.From the observations above, we extend the basic design toa bi-directional propagation for directed graphs. Specifically,we adopt two set of parameter for each type of edge:1) W inr is used to propagate messages along with thedirection of directed edge;2) W outr is used to propagate messages against the directionof directed edge;We define the propagation rule as: h ( l +1) i = σ (cid:88) r ∈R (cid:88) j ∈IN ri c i,r W in ( l ) r h ( l ) j + (cid:88) k ∈OUT ri c i,r W out ( l ) r h ( l ) k + W ( l )0 h ( l ) i (3)where IN ri and OUT ri denote the set of incoming neighborsand outgoing neighbors for node i under the relation r ∈ R ,respectively. The transformations W ( l ) is applied based on thetype and direction of edge. By designing two sets of weightsfor both directions, we make sure that the information ofnode y has a chance to propagate to node z and vice versa(the example shown in the last subsection). In our evaluation,we will quantitatively evaluate the model represented byEquation 3 and compare its effectiveness with the basic designdenoted by Equation 2.Moreover, we deprecate the common one-hot encoding ofIDs for each node as adopted by [21]. Instead, we make thetype of node as the node features, and expect the model tobehave the same regardless of the node order. We will evaluateits effectiveness in Section VII.Fig. 5(a) shows the framework of BRGCN , which takes
DFG+ as input and predict the labels for each nodes.
DFG+ consists of difference types of nodes and edges, which aremarked with different colors in the figure. The W and c inEquation 3 are parameters of the model, which is learnedduring the training phase. Initially, node features in DFG+ are embedded and fed into the model as the input of thefirst layer. Then, layer l computes the update feature (latentnode representation h ( l +1) ) for each node v i by aggregatingfeatures from its neighbors and itself. The output of theprevious layer become the input to the next layer. Finally, inthe output layer, sof tmax ( · ) activation is applied to generatelabel probabilities.Fig. 5(b) illustrates message-passing when calculating up-date feature for a node i in layer l . Features from neighboringnodes are gathered and then transformed for each relationtype individually, together with different transformation matrix W s for different types of edges. For example, W out ( l ) blue is thetransformation matrix for outgoing blue edges. The resultingrepresentations are accumulated and normalized. We choose ReLU as activation function in our model.
DFG+ is a directed graph G = ( V , F , E , R ) with nodes(entities) v i ∈ V have feature f i ∈ F , and edges with (rela-tions) ( v i , r, v j ) ∈ E , where r ∈ R is a relation type. In each10 FG+ … ReLU … ReLU … Output
Hidden Layer (1)
Hidden Layer (2) (a) The model overview. (b) Message propagation for one node.
Fig. 5: Bi-directional Relational Graph Convolutional Network.layer l , node features are updated through function definedin Equation 3. For each node, its old features ( h ( l ) i ) and itsneighbors’ old features ( h ( l ) j ) are passed along with the edges( ( v i , r, v j ) ∈ E ∨ ( v j , r, v i ) ∈ E ), and then aggregated througha normalized sum ( (cid:80) ( · ) ) and an activation function ( σ ( · ) ) toget the updated new features ( h ( l +1) i ), where h (1) i = f i inthe input layer. A n -layer network allows for message passingacross n -hop in the graph. Therefore, the representation of anode x learned by a n -layer BRGCN model aggregates nodefeatures from a n -hop subgraph centered on node x . Besides,the different sets of weights W d ( l ) r for different types of edgesand sum-aggregation adopted in Equation 3 can help to learnthe graph structures corresponding to information flow andvariable adjacency, respectively. By learning different graphstructures and node features, we believe that the network candistinguish vulnerable and benign nodes.To train the model, we minimize the following cross-entropyloss on all labeled nodes: min θ L = − (cid:88) G ∈G |Y| (cid:88) i ∈Y K (cid:88) k =1 w k · y ik ln h ( L ) ik , (4)where G is a graph in the training set G , Y is the set ofnodes in our training samples. h ( L ) i is the output of BRGCNfor node i . Note that we used softmax function for the lastlayer. Thus h ( L ) i denotes the predicted class distribution fornode i with h ( L ) ik being the probability of node i belongingto class k , k = { , } . w k is the weight for class k and y ik denotes respective ground truth label for node i . We introduce w k in our loss function because class distribution in DFG+ isextremely imbalanced, i.e., the majority of the nodes are nega-tive nodes (benign nodes), while the positive nodes (vulnerablenodes) only take up a very small portion. To avoid the majorityclass dominate the loss function, we assign larger weight topositive class. θ = { W ( l )0 , W in ( l ) r , W in ( l ) r ; r ∈ R} Ll =1 is theset of the model parameters. After the loss is calculated ineach training epoch, backward propagation computes gradientof the loss function with respect to the trainable parameters θ ,then parameters are updated to minimize loss. Model Parameter Size and Time Complexity.
For simplicityof the analysis, we first define the dimensionality of W in ( l ) r and W out ( l ) r as W in ( l ) r ∈ R d l × d l +1 and W out ( l ) r ∈ R d l × d l +1 ,where d l and d l +1 is the dimensionality of the node rep-resentation in the l -th layer and ( l + 1) -th layer, respec-tively. Since θ = { W ( l )0 , W in ( l ) r , W in ( l ) r ; r ∈ R} Ll =1 is the set of parameters for BRGCN, the model parameter size is O ( (cid:80) Ll =1 (cid:80) r ∈R d l · d l +1 ) = O ( (cid:80) Ll =1 d l · d l +1 · |R| ) .For the forward pass of BRGCN , the main time complexityin the l -th layer for node v i is the calculation of Equation 3,which is O ( (cid:80) r ∈R ( |IN ri | + |OUT ri | ) · d l · d l +1 ) . It is equivalentto O ( D i · d l · d l +1 ) , where D i = (cid:80) r ∈R ( |IN ri | + |OUT ri | ) is thesummation of the in-degree and out-degree of node v i . Thus,the time complexity of BRGCN for node v i is O ( D i · (cid:80) Ll =1 d l · d l +1 ) . Then, the computational cost of BRGCN for a
DFG+ graph is O ( (cid:80) i D i · (cid:80) Ll =1 d l · d l +1 ) , which is equivalent to O ( |E| (cid:80) Ll =1 d l · d l +1 ) , where E is the set of edge in DFG+ . Thecomplexity of the backward propagation via gradient descentis the same as the forward pass. Thus, the total cost in oneiteration is O ( |E| (cid:80) Ll =1 d l · d l +1 ) . E. Vulnerability Identification
In this subsection, we demonstrate how the trained modelachieve the goal as proposed in Section III-A. In the trainingphase, we leverage source code of vulnerable programs, inputsthat can trigger the vulnerabilities, and the tools that wepresent in early subsection to generate labeled
DFG+ andsave the corresponding supporting data. Then we train themodel with the generated
DFG+ from the vulnerable execution.Through the forward prorogation (message passing) rulesdefined in Equation 3, loss function defined in Equation 4,and backward prorogation to update the trainable parametersin the model, we get an effective model which can predictlabels for nodes an given unlabeled
DFG+ .In the testing phase we apply the trained model to predictthe labels for unlabeled
DFG+ generated from binary-onlysoftware. Through the maps in the supporting data createdby the runtime analyzer, the vulnerable nodes in
DFG+ canbe mapped to corresponding instruction addresses in binarycode and execution trace. Since the execution trace containsthe address of memory operands, the address of corruptedvariable can also be identified. Note that in some cases thatone execution may trigger several silent BOF vulnerabilities,the vulnerable instructions and corrupted variables can beidentified separately.
F. Implementation
We implement our system on 32-bit Linux system with Intelx86 Instruction Set Architecture. The compiler plugin is builtbased on the LLVM-5.0.0 and its runtime library is built onruntime-rt-5.0.0. The plugin and runtime library consists of 82lines of new code and 2236 lines of new code, respectively,when comparing with implementation of ASAN. The dynamic11inary analyzer and graph constructor are developed on theIntel Pin 3.10., which consists of 9900 lines of C++ code intotal. The graph model consists of 800 lines of Python code,and is implemented base on the DGL v0.4.3 [63], an highperformance and salable Python package for deep learning ongraph typed data.VI. E
XPERIMENT AND R ESULTS
A. Data Generation and Preprocessing
We select 30 reproducible CVEs as shown in Table II froma repository of Linux vulnerabilities [64]. We generate threelabeled
DFG+ for each CVE, from three different executions:In the first execution, we compile the program by the compilerplugin and find an input to trigger the vulnerability. In thesecond execution, we change the input which overflow thebuffer with different length. In the third execution, we changethe length of vulnerable buffer in the program’s source code,recompile it to binary through our compiler plugin, and runthe modified program again. In all three executions, inputs areable to trigger the vulnerable without crashing the execution.Since the length of vulnerable buffer or input can be hardly bechanged in some programs, we finally get 86 labeled
DFG+ swith over 35 millions (35084810) nodes, of which only 6708nodes are positive.We observe that the constructed
DFG+ vary largely in size(number of nodes), from a few thousands to a few millions.It is impossible to fit an entire
DFG+ into
BRGCN for end-to-end training, especially for those
DFG+ with more than 3millions of nodes. To alleviate this problem, we propose agraph cutting algorithm (the detaill of the Algorithm 1 is inthe appendices). In the cutting algorithm, we firstly cut a biggraph into several small graph by removing edges that connectdifferent sub-graphs, all the nodes in the sub-graphs is samplenodes . Secondly, we add n -hop neighbors to each sub-graphas supporting node , where n is the number of model layers.When training model on the sub-graphs, both supporting node and sample nodes are involved in forward propagation whereasonly the sample nodes was considered for calculating loss.As can be noticed above, the dataset is extremely imbal-anced – the ratio between positive and negative nodes isroughly 1/5230. To further reduce the number of negativenodes, we exclude all the r-node and i-node in samplenodes because BOF can only overwrite variables in mem-ory. We also exclude nodes without any incoming d-edge because the live variables associated with vulnerable nodesmust be written through invalid operation in BOF. After theexclusion, we are able to reduce the ratio to 1/659. Note thatby excluding we mean we won’t choose them as sample nodes in sub-graphs. Instead, we select them as supporting nodes ifthey are the neighbors of sample nodes to help classify the sample nodes in sub-graphs. Finally, we further reduce thenumber of negative through random sample. B. Evaluation
We experimented with different number of layers, size ofhidden states, and dropout rates to find the best performingmodel. Currently
BRGCN has 4 layers including an input layerand an output layer and each layer has hidden states withdimension 16. 10 sets of parameters ( W ) are used for 5 typesfor edges (2 sets of parameters for each type).After we get the best configurations, we adopt 8-fold cross-validation to comprehensively evaluate the model. In each TABLE II: Information and testing results of each CVE cases. Vulnerability Information Analysis Result
CVE-ID Name Region Detected
CVE-2004-0597 pngslap stack (cid:51)
CVE-2004-1120 proz stack (cid:51)
CVE-2004-1255 2fax stack (cid:51)
CVE-2004-1257 abc2mtex stack (cid:51)
CVE-2004-1261 asp2php stack (cid:51)
CVE-2004-1262 bsb2ppm stack (cid:51)
CVE-2004-1275 html2hdml stack (cid:51)
CVE-2004-1278 jcabc2ps stack (cid:51)
CVE-2004-1279 jpegtoavi stack (cid:51)
CVE-2004-1287 nasm stack (cid:51)
CVE-2004-1288 o3read stack (cid:51)
CVE-2004-1289 pcal stack (cid:51)
CVE-2004-1290 pgn2web stack (cid:51)
CVE-2004-1292 ringtonetools stack (cid:51)
CVE-2004-1293 rtf2latex2e.bin stack (cid:51)
CVE-2004-1297 unrtf stack (cid:51)
CVE-2004-2093 rsync stack (cid:51)
CVE-2004-2167 latex2rtf stack (cid:55)
CVE-2005-0101 newspost stack (cid:51)
CVE-2005-3862 unalz stack (cid:51)
CVE-2005-4807 as-new stack (cid:51)
CVE-2007-1465 dproxy stack (cid:51)
CVE-2009-1759 ctorrent stack (cid:51)
CVE-2009-2286 compface stack (cid:51)
CVE-2009-5018 gif2png stack (cid:51)
CVE-2010-2891 smisubtree stack (cid:51)
EDB-890 psnup stack (cid:51)
EDB-9264 stftp stack (cid:51)
EDB-14904 fcrackzip stack (cid:51)
EDB-15062 rarcrack stack (cid:51) round of the cross-validation, we select 75%, 12.5% and12.5% of 86 graphs as training set, validation set and testingset. Table III presents the
Accuracy , Precision , Recall and F1 on the test set. Our model achieve 94.39% accuracy and94.18% F1 score on the sampled dataset. Since we are the firstone to analyze the silent vulnerability through deep learning,we cannot find the similar works to compare. However, wewill compare our design with some other potential designs innext section.Then, we examine our model’s ability to identify vulnerableoperations in silent BOFs. Since a silent BOF will result inone or more vulnerable nodes, we can successfully locatethe vulnerable operation as long as one vulnerable node isidentified. As a result the vulnerabilities detection rate is muchbetter than the vulnerable node detection rate. Table II showsthe detection results when we map the vulnerable node intest phase to the executables. Due the limited numbers ofglobal/heap buffer overflow in vulnerability database, we didnot find a reproducible one in our evaluation. But we modifyvulnerable stack buffers in several cases displaied in Table II toTABLE III: The overall performance of our proposed models. Fold Accuracy Precision Recall F1 fold-1 0.8871 0.9828 0.7703 0.8636fold-2 0.9623 1.0000 0.9298 0.9636fold-3 0.9712 0.9455 1.0000 0.9719fold-4 0.9072 0.9167 0.8958 0.9060fold-5 0.9617 0.9657 0.9574 0.9615fold-6 0.9503 0.9244 0.9821 0.9523fold-7 0.9359 0.8864 1.0000 0.9397fold-8 0.9757 0.9537 1.0000 0.9763Average 0.9439 0.9469 0.9419 0.9418
XPLAINABILITY
When designing
DFG+ and
BRGCN , we raise several insightsbased on our intuitions, in this section we try to explain theireffectiveness through several experiments. Accordingly, we putforward several evaluation questions: 1) Can sequenced modelsuch as RNN and LSTM solve the problem through analysison the instruction sequence directly? 2) If a homogeneousgraph, rather than a more complex relational graph, is enoughto classify vulnerable nodes in
DFG+ ? 3) Is
BRGCN defined inEquation 3 more effective than RGCN defined in Equation 2?4) Can
BRGCN effectively identify vulnerable nodes in tra-ditional data flow graphs? 5) Can
BRGCN effectively identifyvulnerable nodes in
DFG+ with ID as node attributes? 6) Does
BRGCN really benefit from training on multiple graphs?To answer these questions, we setup 4 groups of experi-ments. In the first group of experiments, we firstly generateinstruction trace which includes executed instructions andaccess memory addresses, secondly split the instruction traceinto fix-length sequence to make them end with memoryaccess instructions. Thirdly, if the last instruction of a sequenceresults in vulnerable operation in silent BOF, we label thissequence as vulnerable sequence, otherwise we label it asbenign sequence. Fourthly, we sample the same positive sam-ples and negative samples as that sampled in training
BRGCN .Finally, we adopt an open source implementation of Memory-Augmented RNN and LSTM [65] to classify execution tracesand the results are reported in Table IV. From the experimentresults, we can easily conclude that RNN and LSTM cannothelp to identify vulnerable operation by analyzing instructionsequence with access memory addresses.In the second group of experiments, we adopt ConvGNNand RGCN and train models on
DFG+ . In the ConvGNN,all types of edges are treated homogeneously and processedwith the same weight matrix W . The RGCN adopts differentpropagation rules for different edge types, and propagate nodefeatures along with the incoming direction of edges. Theexperiment results in Table IV shows that the performance of BRGCN is better than RGCN, and the performance of RGCNis better than ConvGNN. This indicates that adopting two setsof parameters for each type of edge is more effective than oneset of parameter for each type and a single set of parameterregardless of edge types.In the third group of experiments, we change the structureof
DFG+ . There are two variants: 1) graphs with only programruntime data flow, and 2) graphs with nodes that are assignedunique IDs as node attributes. Then, we train
BRGCN modelon the two sets of modified graphs and display their resultin Table IV. From the results we know: 1) the
BRGCN cannot distinguish significant difference between local graphstructures of invalid operations and other benign operations indata flow graph only, the adoption of spatial information andother implicit information flow plays an important role for thenode classification problem and 2) when training a model ondifferent graphs, the adopting of node IDs as node attributesis harmful. TABLE IV: The performance comparison of different neuralnetworks and graph structures.
Group
Setting Accuracy Precision Recall F1 RNN 0.4977 0.4994 0.5557 0.5260LSTM 0.4948 0.4929 0.5136 0.5030 ConvGCN 0.7914 0.8105 0.7619 0.7616RGCN 0.8411 0.8699 0.8158 0.8175 BRGCN w/DF-Only 0.7001 0.6126 0.7702 0.6793
BRGCN w/Node-ID 0.7686 0.7769 0.7466 0.7577 BRGCN w/One-Program 0.5741 0.4839 0.3562 0.4215
In the last group of experiments, we try to train our modelon graphs generated from a single program and test it onother programs. The evaluation result shows that the modeltrained on a single program is significantly worse than modeltrained on multiple programs. We conclude that by carefullydesigning the
DFG+ , our model can benefit from differentprograms/graphs. It indicates common semantic features forBOF vulnerabilities are shared by different programs.VIII. L
IMITATIONS AND C ONCLUSION
Our approach suffers from several limitations. First, al-though we achieve high detection rate of silent BOFs, there isconsiderable false-positive predictions on nodes, meaning thatsome benign nodes are falsely classified as vulnerable nodes.This is inherent from the extreme imbalance of the numberof positive and negative nodes, which has a ratio as low as . . We have tried to down-sample the negative nodes andapply a class-weighted loss function during training, it is stillan issue because any single percentage drop of classificationprecision would lead to considerable false positive predictions.Second, although our model can test a graph very quickly (lessthan 0.2 seconds on average), there is an overhead in collectingprogram runtime data flow and building DFG+ . Unless wecan trace program data flow on the fly, it is not practical todeploy our framework for detecting vulnerabilities in real timeproduction environment.In this paper, we design a novel graph data structure
DFG+ to represent the program’s runtime information flowand variables’ spatial information. A runtime analyzer isimplemented to construct
DFG+ and the AddressSanitizer iscustomized to help label the nodes. We further propose
BRGCN to analyze
DFG+ and detect vulnerable nodes with 94.39%accuracy. Through mapping of the vulnerable nodes back tothe execution trace, we are able to locate the vulnerable pointsin the program at the binary level. We believe the
DFG+ andthe
BRGCN proposed in our work have wide applications.Our proposed scheme could be used in vulnerability analysisto help locate vulnerable point, in software patch to helpgenerate patches in the binary, in exploit generation to helpattack vulnerable software, and in software testing to help findsoftware bugs.Finally, we would like to suggest some future worksthat could supplement our approach. Some possible avenuesinclude applying the GNN-based approach to other silentvulnerable executions such as detecting buffer overread or onobfuscated programs.13
EFERENCES[1] L. Szekeres, M. Payer, T. Wei, and D. Song, “Sok: Eternal war inmemory,” in . IEEE,2013, pp. 48–62.[2] B. Liu, L. Shi, Z. Cai, and M. Li, “Software vulnerability discoverytechniques: A survey,” in , Nov 2012, pp. 152–156.[3] J. C. King, “Symbolic execution and program testing,”
Commun.ACM , vol. 19, no. 7, pp. 385–394, Jul. 1976. [Online]. Available:http://doi.acm.org/10.1145/360248.360252[4] I. Yun, S. Lee, M. Xu, Y. Jang, and T. Kim, “QSYM : Apractical concolic execution engine tailored for hybrid fuzzing,” in
Proceedings of the 2016 ACM SIGSAC Conferenceon Computer and Communications Security , ser. CCS ’16. NewYork, NY, USA: ACM, 2016, pp. 529–540. [Online]. Available:http://doi.acm.org/10.1145/2976749.2978340[6] A. Arora, R. Krishnan, R. Telang, and Y. Yang, “An empirical analysisof software vendors’ patch release behavior: Impact of vulnerabilitydisclosure,”
Information Systems Research , vol. 21, no. 1, pp. 115–132,2010. [Online]. Available: https://doi.org/10.1287/isre.1080.0226[7] M. Zalewski, “American fuzzy lop,” 2014.[8] I. Yun, S. Lee, M. Xu, Y. Jang, and T. Kim, “QSYM: A PracticalConcolic Execution Engine Tailored for Hybrid Fuzzing,” in
Proceedingsof the 27th USENIX Security Symposium (Security) , Baltimore, MD,Aug. 2018.[9] P. Chen and H. Chen, “Angora: Efficient fuzzing by principled search,”in . IEEE, 2018,pp. 711–725.[10] S. Chen, J. Xu, E. C. Sezer, P. Gauriar, and R. K. Iyer, “Non-Control-Data Attacks Are Realistic Threats.” in
USENIX Security Symposium ,vol. 5, 2005.[11] K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov, “Address-sanitizer: A fast address sanity checker,” in
Presented as part of the2012 { USENIX } Annual Technical Conference ( { USENIX }{ ATC } ,2012, pp. 309–318.[12] R. W. Jones and P. H. Kelly, “Backwards-compatible bounds checkingfor arrays and pointers in c programs.” in AADEBUG . Citeseer, 1997,pp. 13–26.[13] L. Lam and T. Chiueh, “Checking array bound violation using seg-mentation hardware,” in , 2005, pp. 388–397.[14] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino,A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna,“Sok: (state of) the art of war: Offensive techniques in binary analysis,”in , 2016, pp. 138–157.[15] A. Fioraldi, D. C. D’Elia, and L. Querzoni, “Fuzzing binaries formemory safety errors with qasan,” in . IEEE, 2020, pp. 23–30.[16] S. Dinesh, N. Burow, D. Xu, and M. Payer, “Retrowrite: Staticallyinstrumenting cots binaries for fuzzing and sanitization,” in . IEEE, 2020, pp. 1497–1511.[17] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,”
IEEE Transactionson Neural Networks and Learning Systems , 2020.[18] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: building customizedprogram analysis tools with dynamic instrumentation,”
Acm sigplannotices
International Symposium on Code Generation and Optimization(CGO 2011) . IEEE, 2011, pp. 213–223.[25] B. J. Reed Hastings, “Purify: Fast detection of memory leaks and accesserrors,” in
In proc. of the winter 1992 usenix conference . Citeseer, 1991.[26] “Intel Parallel Inspector,” http://software.intel.com/en-us/intel-parallel-inspector/, Intel.[27] A. R. Hurson and K. M. Kavi, “Dataflow computers: Their historyand future,”
Wiley Encyclopedia of Computer Science and Engineering ,2007.[28] K. Kennedy,
A survey of data flow analysis techniques . IBM ThomasJ. Watson Research Division, 1979.[29] J. Badlaney, R. Ghatol, and R. Jadhwani, “An introduction to data-flowtesting,” North Carolina State University. Dept. of Computer Science,Tech. Rep., 2006.[30] E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions inbinaries with neural networks,” in { USENIX } Security Symposium( { USENIX } Security 15) , 2015, pp. 611–626.[31] Z. L. Chua, S. Shen, P. Saxena, and Z. Liang, “Neural Nets Can LearnFunction Type Signatures from Binaries,” in , 2017, pp. 99–116.[32] W. Guo, D. Mu, X. Xing, M. Du, and D. Song, “ { DEEPVSA } :Facilitating Value-set Analysis with Deep Learning for PostmortemProgram Analysis,” in , 2019, pp. 1787–1804.[33] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog: AnomalyDetection and Diagnosis from System Logs Through Deep Learning,”in Proceedings of the 2017 ACM SIGSAC Conference on Computerand Communications Security , ser. CCS ’17. New York, NY,USA: ACM, 2017, pp. 1285–1298. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134015[34] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang,S. Tao, P. Sun, and R. Zhou, “Loganomaly: Unsupervised Detectionof Sequential and Quantitative Anomalies in Unstructured Logs,” in
Proceedings of the 28th International Joint Conference on ArtificialIntelligence , ser. IJCAI’19. AAAI Press, 2019, pp. 4739–4745.[Online]. Available: http://dl.acm.org/citation.cfm?id=3367471.3367702[35] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”
IEEE transactions on pattern analysisand machine intelligence , vol. 35, no. 8, pp. 1798–1828, 2013.[36] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,”
IEEE Transactions on Knowledge and Data Engineering , 2020.[37] M. Balcilar, G. Renton, P. H´eroux, B. Gauzere, S. Adam, and P. Honeine,“Bridging the gap between spectral and spatial domains in graph neuralnetworks,” arXiv preprint arXiv:2003.11702 , 2020.[38] F. R. Chung and F. C. Graham,
Spectral graph theory . AmericanMathematical Soc., 1997, no. 92.[39] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graphneural networks?” arXiv preprint arXiv:1810.00826 , 2018.[40] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang, “Deepinf:Social influence prediction with deep learning,” in
Proceedings of the24th ACM SIGKDD International Conference on Knowledge Discovery& Data Mining , 2018, pp. 2110–2119.[41] Q. Tan, N. Liu, and X. Hu, “Deep representation learning for socialnetwork analysis,”
Frontiers in Big Data , vol. 2, p. 2, 2019.[42] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao, W. Li, andZ. Wang, “Knowledge-aware graph neural networks with label smooth-ness regularization for recommender systems,” in
Proceedings of the25th ACM SIGKDD International Conference on Knowledge Discovery& Data Mining , 2019, pp. 968–977.[43] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur, “Protein interfaceprediction using graph convolutional networks,” in
Advances in neuralinformation processing systems , 2017, pp. 6530–6539.[44] A. Nair, A. Roy, and K. Meinke, “funcgnn: A graph neural networkapproach to program similarity,” in
Proceedings of the 14th ACM/IEEEInternational Symposium on Empirical Software Engineering and Mea-surement (ESEM) , 2020, pp. 1–11.[45] S. Rawat and L. Mounier, “Finding buffer overflow inducing loops inbinary executables,” in . IEEE, 2012, pp. 177–186.[46] J. Newsome and D. Song, “Dynamic taint analysis for automaticdetection, analysis, and signaturegeneration of exploits on commoditysoftware.” in
NDSS , vol. 5. Citeseer, 2005, pp. 3–4.
47] J. Seward and N. Nethercote, “Using valgrind to detect undefined valueerrors with bit-precision.” in
USENIX Annual Technical Conference,General Track , 2005, pp. 17–30.[48] N. Nethercote and J. Seward, “Valgrind: a framework for heavyweightdynamic binary instrumentation,”
ACM Sigplan notices , vol. 42, no. 6,pp. 89–100, 2007.[49] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta,Y. Shoshitaishvili, C. Kruegel, and G. Vigna, “Driller: Augmentingfuzzing through selective symbolic execution.” in
NDSS , vol. 16, no.2016, 2016, pp. 1–16.[50] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structuredsequence modeling with graph convolutional recurrent networks,” in
International Conference on Neural Information Processing . Springer,2018, pp. 362–373.[51] E. Hajiramezanali, A. Hasanzadeh, K. Narayanan, N. Duffield, M. Zhou,and X. Qian, “Variational graph recurrent neural networks,” in
Advancesin neural information processing systems , 2019, pp. 10 701–10 711.[52] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in
Advances in neural information processingsystems , 2017, pp. 1024–1034.[53] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla, “Het-erogeneous graph neural network,” in
Proceedings of the 25th ACMSIGKDD International Conference on Knowledge Discovery & DataMining , 2019, pp. 793–803.[54] M. Karasuyama and H. Mamitsuka, “Multiple graph label propagationby sparse integration,”
IEEE transactions on neural networks andlearning systems , vol. 24, no. 12, pp. 1999–2012, 2013.[55] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation fordeep semi-supervised learning,” in
Proceedings of the IEEE conferenceon computer vision and pattern recognition , 2019, pp. 5070–5079.[56] M. Rajpal, W. Blum, and R. Singh, “Not all bytes are equal: Neuralbyte sieve for fuzzing,” arXiv preprint arXiv:1711.04596 , 2017.[57] D. She, K. Pei, D. Epstein, J. Yang, B. Ray, and S. Jana, “Neuzz: Effi-cient fuzzing with neural program smoothing,” in . IEEE, 2019, pp. 803–817.[58] K. Z. Snow, F. Monrose, L. Davi, A. Dmitrienko, C. Liebchen, andA.-R. Sadeghi, “Just-in-time code reuse: On the effectiveness of fine-grained address space layout randomization,” in . IEEE, 2013, pp. 574–588.[59] U. Khedker, A. Sanyal, and B. Sathe,
Data flow analysis: theory andpractice . CRC Press, 2017.[60] “Memory-augmented recurrent neural networks,” 2019. [Online].Available: https://github.com/suzgunmirac/marnns[61] “”compiler-rt” runtime libraries,” LLVM project, 2020. [Online].Available: https://compiler-rt.llvm.org[62] R. R. Heisch, “Method and system for reordering the instructions of acomputer program to optimize its execution,” Dec. 21 1999, uS Patent6,006,033.[63] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma,L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deepgraph library: A graph-centric, highly-performant package for graphneural networks,” arXiv preprint arXiv:1909.01315 , 2019.[64] D. Mu, “Linuxflaw,” 2019. [Online]. Available: https://github.com/mudongliang/LinuxFlaw[65] ““llvm-libc” c standard library,” LLVM project, 2020. [Online]. Avail-able: https://llvm.org/docs/Proposals/LLVMLibC.html A PPENDIX AG RAPH C UT A LGORITHM
Algorithm 1
Graph Cut AlgorithmI
NPUT : graph G ( N , E ); number of layer l ; number of sub-graph n ;O UTPUT : a set with m subgraph: C = { G i | (cid:54) i < n } , andthe IDs of sampled nodes S i in each subgraph G i ; initialize a set of n subgraph: C = { G i | (cid:54) i < n } ,where G i = ( N i , E i ), N i = ∅ , E i = ∅ ; divided nodes in G into m samples: S i | (cid:54) i < n ,satisfying | S i | (cid:54) (cid:100)|N | /m (cid:101) and ∪ n-1i=0 S i = N for i = n down to 1 do N i := S i for j = l down to 1 do for e ∈ E do /* src ( e ) , des ( e ) denotes the source node anddestination node of e */ if src ( e ) ∈ N i and des ( e ) / ∈ N i then add dst ( e ) to N i ; end if if src ( e ) / ∈ N i and des ( e ) ∈ N i then add src ( e ) to N i end if end for end for for e ∈ E do if src ( e ) ∈ N i or dst ( e ) ∈ N i then add e to E i ; end if end for end forend for