[PDF] D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

Abstract

Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.

Full PDF

DD2A: A Dataset Built for AI-Based VulnerabilityDetection Methods Using Differential Analysis

Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang,Jim Laredo, Alessandro Morari, Zhong Su

IBM Research { zhengyu, burn, eae, laredoj, amorari } @us.ibm.com, { saurabh.pujar, luca.buratti1 } @ibm.com, { yangbbo, suzhong } @cn.ibm.com Abstract —Static analysis tools are widely used for vulnerabilitydetection as they understand programs with complex behaviorand millions of lines of code. Despite their popularity, staticanalysis tools are known to generate an excess of false positives.The recent ability of Machine Learning models to understandprogramming languages opens new possibilities when appliedto static analysis. However, existing datasets to train modelsfor vulnerability identiﬁcation suffer from multiple limitationssuch as limited bug context, limited size, and synthetic andunrealistic source code. We propose D2A, a differential analysisbased approach to label issues reported by static analysis tools.The D2A dataset is built by analyzing version pairs from multipleopen source projects. From each project, we select bug ﬁxingcommits and we run static analysis on the versions before andafter such commits. If some issues detected in a before-commitversion disappear in the corresponding after-commit version,they are very likely to be real bugs that got ﬁxed by the commit.We use D2A to generate a large labeled dataset to train modelsfor vulnerability identiﬁcation. We show that the dataset can beused to build a classiﬁer to identify possible false alarms amongthe issues reported by static analysis, hence helping developersprioritize and investigate potential true positives ﬁrst.

Index Terms —dataset, vulnerability detection, auto-labeler

I. I

NTRODUCTION

The complexity and scale of modern software programsoften lead to overlooked programming errors and securityvulnerabilities. Research has shown that developers spendmore than 50% of their time detecting and ﬁxing bugs [1], [2].In practice, they usually rely on automated program analysisor testing tools to audit the code and look for security vul-nerabilities. Among them, static program analysis techniqueshave been widely used because they can understand nontrivialprogram behaviors, scale to millions of lines of code, anddetect subtle bugs [3], [4], [5], [6]. Although static analysishas limited capacity to identify bug-triggering inputs, it canachieve better coverage and discover bugs that are missed bydynamic analysis and testing tools. In fact, static analysis canprovide useful feedback and has been proven to be effectivein improving software quality [7], [8].Besides these classic usage scenarios, driven by the needs ofrecent AI research on source code understanding and vulnera-bility detection tasks [9], [10], [11], [12], [13], [14], [15], [16],[17], [18], static analysis techniques have also been used togenerate labeled datasets for model training [12]. As programsexhibit diverse and complex behaviors, training models for vulnerability detection requires large labeled datasets of buggyand non-buggy code examples. This is especially critical foradvanced neural network models such as CNN, RNN, GNN,etc. However, existing datasets for vulnerability detectionsuffer from the following limitations: • Almost all datasets are on a function level and do notprovide context information (e.g., traces) explaining how abug may happen. Besides, they usually do not specify thebug types and locations. In many cases, the function-levelexample does not even include the bug root cause. • Some datasets (e.g. CGD [13]) are derived from conﬁrmedbugs in NVD [19] or CVE [20]. Although they have high-quality labels, the number of such samples is limited andmay be insufﬁcient for model training. • Synthetic datasets such as Juliet [21] and S-babi[14] canbe large. However, they are generated based on a fewpredeﬁned patterns and thus cannot represent the diversebehaviors observed in real-world programs. • There are also labeling efforts based on commit messages orcode diffs. Predicting code labels based on commit messagesis known to produce low-quality labels [12]. Code diff basedmethods [15] assume all functions in a bug-ﬁxing commitare buggy, which may not be the case in reality. Moreimportantly, these approaches have difﬁculty in identifyingbug types, locations, and traces.On the other hand, static analysis can reason beyondfunction boundaries. It’s automated and scales well enoughto generate large datasets from programs in the wild. Forexample, Russell et. al [12] applied the Clang static analyzer,Cppcheck, and Flawﬁnder to generate a labeled data set ofmillions of functions to train deep learning models and learnfeatures from source code. In some sense, it is the mostpromising labeling approach as it can additionally identify bugtypes and locations while using traces as context information.Despite the popularity in these scenarios, static analysistools are known to generate an excess of false alarms. Onereason is the approximation heuristics used to reduce complex-ity and improve scalability. In particular, static analysis triesto model all possible execution behaviors and thus can sufferfrom the state-space blowup problem [10]. To handle industry-scale programs, static analysis tools aggressively approximatethe analysis and sacriﬁce the precision for better scalabilityand speed. For example, the path-sensitive analysis does not a r X i v : . [ c s . S E ] F e b cale well on large programs, especially when modeling toomany path states or reasoning about complex path constraints.Therefore, path insensitive analysis that ignores path condi-tions and assumes all paths are feasible is commonly used inpractice, which obviously introduces false positives.These false positives greatly hinder the utilization of staticanalysis tools as it is counterproductive for developers to gothrough a long list of reported issues but only ﬁnd a fewtrue positives [22], [23]. To suppress them, various methodshave been proposed (as summarized in [24]). Among them,machine learning based approaches [25], [26], [27], [28], [10],[29], [30], [31], [32], [33] focus on learning the patterns offalse positives from examples. However, training such modelsrequires good labeled datasets. Most existing works manuallygenerate such datasets by reviewing the code and bug reports.In our experience, this review process is very labor-intensiveand cannot scale. Therefore, the datasets are relatively smalland may not cover the diverse behaviors observed in reality.To address these challenges, in this paper, we proposeD2A, a differential analysis based approach to label issuesreported by static analysis tools as ones that are more likelyto be true positives and ones that are more likely to be falsepositives . Our goal is to generate a large labeled dataset thatcan be used for machine learning approaches for (1) staticanalyzer false positive reduction, and (2) code understandingand vulnerability detection tasks. We demonstrate how ourdataset can be helpful for the false positive reduction task.In particular, for projects with commit histories, we assumesome commits are code changes that ﬁx bugs. Instead ofpredicting labels based on commit messages, we run staticanalysis on the versions before and after such commits. Ifsome issues detected in a before-commit version disappear inthe corresponding after-commit version, they are very likelyto be real bugs that got ﬁxed by the commit. If we analyze alarge number of consecutive version pairs and aggregate theresults, some issues found in a before-commit version neverdisappear in an after-commit version. We say they are not verylikely to be real bugs because they were never ﬁxed. Then, wede-duplicate the issues found in all versions and adjust theirclassiﬁcations according to the commit history. Finally, welabel the issues that are very likely to be real bugs as positives and the remaining ones as negatives . We name this procedure differential analysis and the labeling mechanism auto-labeler .Please note that we say the reported issues are very likely tobe TPs or FPs because the static analyzer may make mistakesor a bug-ﬁxing commit was not included for analysis. We willdiscuss this in more detail in Sec. III.We run the differential analysis on thousands of selectiveconsecutive version pairs from OpenSSL , FFmpeg , libav , httpd , NGINX and libtiff . Out of 349,373,753 issuesreported by the static analyzer, after deduplication, we labeled18,653 unique issues as positives and 1,276,970 unique issuesas negatives. Given there is no ground truth, to validatethe efﬁcacy of the auto-labeler, we randomly selected andmanually reviewed 57 examples. The result shows that D2Aimproves the label accuracy from 7.8% to 53%. Although the D2A dataset is mainly for machine learningbased vulnerability detection methods, which usually require alarge number of labeled samples, in this paper, we show it canbe used to help developers prioritize static analysis issues thatare more likely to be true positives. We will present AI-basedcode understanding approaches in another paper. In particular,inspired by [10], we deﬁned features solely from static analysisoutputs and trained a static analysis false positive reductionmodel. The result shows that we were able to signiﬁcantlyreduce false alarms, allowing developers to investigate issuesthat are less likely to be false positives ﬁrst. In summary, wemake the following contributions: • We propose a novel approach to label static analysis issuesbased on differential analysis and commit history heuristics. • Given it can take several hours to analyze a single versionpair (e.g. 12hrs for

FFmpeg ), we parallelized the pipelinesuch that we can process thousands of version pairs simulta-neously in a cluster, which makes D2A a practical approach. • We ran large-scale analyses on thousands of version pairs ofreal-world C/C++ programs, and created a labeled datasetof millions of samples with a hope that the dataset can behelpful to AI method on vulnerability detection tasks. • Unlike existing function-level datasets, we derive samplesfrom inter-procedural analysis and preserve more detailssuch as bug types, locations, traces, and analyzer outputs. • We demonstrated a use case of the D2A dataset. We traineda static analysis false positive reduction model, which caneffectively reduce false positive rate and help developersprioritize issues that are more likely to be real bugs. • To facilitate future research, we make the D2A dataset andits generation pipeline publicly available at https://github.com/ibm/D2A. II. M

OTIVATION

In this section, we describe two use scenarios to show whybuilding a good labeled dataset using static analysis can beuseful for AI-based vulnerability detection methods.

A. Existing Datasets for AI on Vulnerability Detection Task

Since programs can exhibit diverse behaviors, training ma-chine learning models for code understanding and vulnerabil-ity detection requires large datasets. However, according to arecent survey [35], lacking good and real-world datasets hasbecome a major barrier for this ﬁeld. Many existing workscreated self-constructed datasets based on different criteria.However, only a few fully released their datasets.Table I summarizes the characteristics of a few popularpublicly available software vulnerability datasets. We comparethese datasets to highlight the contributions D2A can make.Juliet [21], Choi et.al [34], and S-babi [14] are syntheticdatasets that were generated from predeﬁned patterns. Al-though their sizes are decent, the main drawback is the lackof diversity comparing to real-world programs [34].The examples in Draper [12] are from both synthetic andreal-world programs, where each example contains a functionand a few labels indicating the bug types. These labels were

ABLE IP

UBLICLY A VAILABLE D ATASETS FOR AI ON C/C++ V

ULNERABILITY D ETECTION

Dataset ExampleType ExampleLevel Whole DatasetReleased BugType BugLine BugTrace CodebaseTraceability CompilableExample GenerationImpl. Avail. Labelling Method

Juliet [21] synthetic function (cid:52) (cid:52) (cid:52) (cid:56) – (cid:52) – predeﬁned patternS-Babi [14] synthetic function (cid:52) (cid:52) (cid:52) (cid:56) – (cid:52) (cid:52) predeﬁned patternChoi et.al [34] synthetic function (cid:52) (cid:52) (cid:52) (cid:56) – (cid:52) (cid:52) predeﬁned patternDraper [12] mixed function (cid:52) (cid:52) (cid:56) (cid:56) (cid:56) (cid:56) (cid:56) static analysisDevign [15] real-world function (cid:56) (cid:56) (cid:56) (cid:56) (cid:56) (cid:56) (cid:56) manual + commit code diffCDG [13] real-world slice (cid:52) (cid:56) (cid:56) (cid:56) (cid:56) (cid:56) (cid:52) NVD + code diffD2A real-world trace (cid:52) (cid:52) (cid:52) (cid:52) (cid:52) (cid:52) (cid:52) differential static analysis

Note : To the best of our knowledge, there is no perfect dataset that is large enough and has 100% correct labels for AI-based vulnerability detection tasks. Datasetsgenerated from manual reviews have better quality labels in general. However, limited by their nature, they are usually not large enough for model training.On the other hand, the quality of the D2A dataset is bounded by the capacity of static analysis. D2A has better labels comparing to datasets labeled solelyby static analysis and complements existing high-quality datasets by the size. Please refer to Sec. II-A for details. generated by aggregating static analysis results. Draper doesn’tprovide details like bug locations or traces. For real-worldprograms, Draper doesn’t maintain the links to the originalcode base. If we want to further process the function-levelexamples to obtain more information, it’s difﬁcult to compileor analyze them without headers and compiler arguments.The Devign [15] dataset contains real-world function exam-ples from commits, where the labels are manually generatedbased on commit messages and code diffs. In particular, if acommit is believed to ﬁx bugs, all functions patched by thecommit are labeled as 1, which are not true in many cases. Inaddition, only a small portion of the dataset was released.CDG [13] is derived from real-world programs. It’s uniquebecause an example is a subset of a program slice and thusnot a valid program. Its label was computed based on NVD:if the slice overlaps with a bug ﬁx, it’s labeled as . Since thedataset is derived from conﬁrmed bugs, the label quality isbetter. However, the number of such examples is limited andmay not be sufﬁcient for model training.In fact, there is a pressing need for labeled datasets fromreal-world programs and encoding context information beyondthe function boundary [35], [15], [13]. It has been shownthat preserving inter-procedural ﬂow in code embedding cansigniﬁcantly improve the model performance (e.g. 20% pre-cision improvement in code classiﬁcation task) [36]. To thisend, D2A examples are generated based on inter-proceduralanalysis, where an example can include multiple functions inthe trace. D2A also provides extra details such as the bugtypes, bug locations, bug traces, links to the original codebase/commits, analyzer outputs, and compiler arguments thatwere used to compile the ﬁles having the functions. We believethey are helpful for AI for vulnerability detection in general. B. Manual Review and False Positive Reduction

We start by running a state-of-the-art static analyzer on alarge real-world program. We select bug types that may leadto security problems and manually go through each issue toconﬁrm how many reported issues are real bugs.The goal of this exercise is two-fold. First, we want tounderstand the performance of a state-of-the-art static analyzerfor large real-world programs in terms of how many reportedissues are real bugs. Second, by looking at the false positives,we want to explore ideas that can treat the static analyzer asa black box and suppress the false positives.

TABLE IIM

ANUAL R EVIEW : O

PEN

SSL F A DC Error Type Reported Manual ReviewFP TP FP:TP

UNINITIALIZED VALUE 101 101 0 –NULL DEREFERENCE 64 51 13 4:1RESOURCE LEAK 1 1 0 –

TOTAL 166 153 13 12:1

Note : 326 DEAD STORE issues were excluded from manual review. crypto/initthread.c:385: error: NULL_DEREFERENCE 02. pointer `gtr` last assigned on line 382 could be null and is dereferenced 03. at line 385, column 53. 04.

Showing all 5 steps of the trace 05. crypto/initthread.c:377:1: start of procedure init_thread_deregister() 07. static int init_thread_deregister( void *index, int all) 10. crypto/initthread.c:382:5: 14. int i; 15. if (!all) 18. ... 21. crypto/initthread.c:385:17: 23. if (!all) 24. for (i = 0; i < sk_THREAD_EVENT_HANDLER_PTR_num(gtr->skhands); i++) { 26. Fig. 1. Infer Bug Report Example.

1) Manual Case Study:

Since we are interested in largeC/C++ programs, we require the static analyzer should be ableto handle industrial-scale programs and detect a broad set ofbug types. To the best of our knowledge, the Clang StaticAnalyzer [37] and Infer [38] are two state-of-the-art staticanalyzers that satisfy our needs. However, the Clang Static An-alyzer doesn’t support cross translation unit analysis such thatthe inter-procedural analysis may be incomplete. Therefore,we choose Infer in our experiments. We use

OpenSSL version as the benchmark, which has *.c / *.h ﬁlesand . k lines of C code in total.We run Infer using its default setting and the results aresummarized in Table II. Infer reported issues of bugtypes: DEAD_STORE , UNINITIALIZED_VALUE , NULL_DEREFERENCE , and RESOURCE_LEAK . Amongthem,

DEAD_STORE refers to the issues where the valuewritten to a variable is never used. Since such issues are notdirectly related to security vulnerabilities, they were excluded f _ c o u n t l e n g t h e r r o r _ l i n e f u n c t i o n _ c o u n t e r r o r _ c h a r c _ f i l e _ c o u n t p a c k a g e _ c o u n t e r r o r _ p o s _ f u n Sample Feature A v e r a g e N o r m a li z e d V a l u e AllTP IssuesFP Issues

Fig. 2. Feature Exploration. We experiment with a few features that mayreﬂect the complexity of the issues. After normalization, the averaged featurevalues of true positives and false positives are signiﬁcantly different, whichsuggests a classiﬁer may achieve good performance. from the manual review. The remaining issues may lead tosecurity-related problems and thus were included in the study.The manual review was performed by 8 developers whoare proﬁcient in C/C++. We started by understanding the bugreports produced by Infer. Fig. 1 shows an example of the bugreport of a

NULL_DEREFERENCE issue. It has two sections.The bug location, bug type, and a brief justiﬁcation why Inferthinks the bug can happen are listed in lines – . The bugexplanation part can be in different formats for different bugs.In lines - , the bug trace that consists of the last steps of theoffending execution is listed. Fig. 1 shows 3 of the 5 steps. Ineach step (e.g. line - ), the location and 4 additional lines ofcode that sit before and after the highlighted line are provided.We ﬁrstly had two rounds of manual analyses to ﬁgure out ifthe reported issue may be triggered. Each issue was reviewedby two reviewers. If both reviewers agreed that the reportedbug can happen, we have an additional round of review and tryto conﬁrm the bug by constructing a test case. This processwas very time consuming and challenging, especially whenreviewing a complex program with cryptography involved.As shown in Table II, out of 166 security vulnerabilityrelated issues, we conﬁrmed that

13 (7.8%) issues are truepositives and are false positives.

2) Feature Exploration for False Positive Reduction:

Dur-ing the manual review, we found we can make a good guessfor some issues by looking at the bug reports. Inspired by theexisting false positive reduction works [27], [10], we exploredthe idea of predicting if the issues ﬂagged by Infer are truepositives solely based on the bug reports as shown in Fig. 1.Existing approaches are not directly applicable as they targetdifferent languages or static analyzers. Following the intuitionthat complex issues are more likely to be false positives,we considered features in bug reports that may reﬂect theissue complexity. We explored the following 8 features thatbelong to 3 categories: (1) error line and error char denotethe location (line and column number) where the bug occurs.(2) length , c ﬁle count and package count denote the uniquenumber of line numbers, source ﬁles and the directoriesrespectively in the trace. (3) if count and function count arethe numbers of the branches and functions in the trace.We extracted the features from the bug reports of issues. After normalization, we computed the average featurevalues of the true positive issues and false positiveissues. As shown in Fig. 2, the average feature values oftrue positives and false positives are signiﬁcantly different andeasily separable for all 8 features, which suggests a good falsepositive reduction classiﬁer can perform very well.III. D ATASET G ENERATION

In this section, we present the differential analysis basedapproach that labels the issues detected by the static analyzer.Then, we show how we generate two kinds of examples forthe D2A dataset based on the results obtained.

A. Overview

Fig. 3 shows the overall workﬂow of D2A. The input to thepipeline is a URL to a git repository. The output are examplesgenerated purely using the static differential analysis.As the pre-processing step, based on the commit messagesonly, the Commit Message Analyzer (Sec. III-B) selects a listof commits that are likely to be bug ﬁxes. Because it canbe very expensive to analyze a pair of consecutive versions,the goal of this step is to ﬁlter out commits that are notclosely related to bug ﬁxes (e.g. documentation improvementcommits) and speed up the process.For each selected commit, we obtain two sets of issuesreported by the static analyzer by running the analyzer on thebefore-commit and the corresponding after-commit versions.The auto-labeler (Sec. III-C) compares these two sets andidentiﬁes the issues that are ﬁxed by the commit.After aggregating all such issues from multiple consecutiveversion pairs and ﬁltering out noises based on commit history,the auto-labeler labels issues that are very likely to be real bugs as positives, and the issues that are never ﬁxed by a commitas negatives because they are very likely to be false positives .We further extract the function bodies according to the bugtraces and create the dataset.

B. Commit Message Analysis

We created the Commit Message Analyzer (CMA) to iden-tify commits that are more likely to refer to vulnerability ﬁxesand not documentation changes or new features. Using theNVD dataset [19], CMA learns the language of vulnerabilitiesand uses a hybrid approach that combines semantic similarity-based methods [39] and snippet samples-based methods [40]to identify relevant commit messages and their associatedcommit. Noise is reduced by eliminating meaningless tokens,names, email addresses, links, code, etc from each commitmessage prior to the analysis.Based on the semantic distribution of the vulnerable men-tions, CMA identiﬁes the category of the vulnerability andranks commits based on conﬁdence scores.

C. Auto-labeler Examples

For each bug-ﬁxing commit selected by CMA, we run thestatic analyzer on the versions before and after the commit.We evaluated several static analyzers such as CppCheck [41], uto-Labeler • Differential Analysis • Filtering Heuristic based on commit history

Commit Message Analysis Static Analysis FP Reduction Model TrainingFunction Extraction analysisreports version pairs git repo labeled issue type, loc, trace git repo

Dataset labeled code snippet,type, tracelabeled issue type, loc, trace

AI ModelTraining commit history

Fig. 3. The Overview of D2A Dataset Generation Pipeline.

Flawﬁnder [42], Clang Static Analyzer [37], and Infer [38].We chose Infer because it can detect a nice set of securityrelated bug types and supports cross translation unit analysisnecessary for effective inter-procedural analysis. More impor-tantly, it scales well on large programs.

Identify Fixed Issues in a Version Pair.

If we denote theissues found in the before-commit version as I before and theones in the corresponding after-commit version as I after , allissues can be classiﬁed into three groups: (1) the ﬁxed issues ( I before − I after ) that are detected in the before-commit versionbut disappear in the after-commit version, (2) the pre-existingissues ( I after ∩ I before ) that are detected in both versions, and(3) the introduced issues ( I after − I before ) that are not found inthe before-commit versions but detected in the after-commitversion. We are particularly interested in the ﬁxed issues because they are very likely to be bugs ﬁxed by the commit.We use the infer-reportdiff tool [43] to compute them.Note that it’s possible that a ﬁxed issue is not a real bugas the static analyzer may make mistakes, e.g. omit an issuefrom the after-commit even though the code had not changed.In our experience, an important reason is that Infer can exhibitnon-deterministic behaviors [44]. And the non-determinismoccurs more frequently when enabling parallelization [45]. Inorder to minimize the impact, we have to run Infer in single-threaded mode. However, this setting brings in performancechallenges and it takes several hours to analyze a version pair.For example, on an IBM POWER8 cluster, it takes 5.3 hrs and12 hrs to analyze a version pair of OpenSSL and

FFmpeg ,respectively, in single-thread mode. As we will need to analyzethousands of version pairs, it’s impractical to do so on a PC ora small workstation. Therefore, we addressed several technicalchallenges and parallelized the analysis to process more thana thousand version pairs simultaneously in a cluster.

Merge Issues Based on Commit History . After identifyingﬁxed issues in each version pair, we merge and deduplicatethe issues from all version pairs. In particular, we computethe sha1sum of the bug report after removing location-relatedcontents (e.g, the ﬁle names, line numbers, etc) and use it asthe id for deduplication. The reason why we remove location-related contents is that the same piece of code may be bumpedto a different location by a commit, changing only the linenumbers in the report. Then, we apply the following twoheuristics to ﬁlter out the ones that are unlikely to be bugsbased on the commit history. • Fixed-then-unﬁxed issues : The same issue may appear inmultiple version pairs. We sort all its occurrences based on the author date of the commit. If a ﬁxed issue appears againin a later version, it’s probably a false positive due to themistake of the static analyzer. We change the labels of suchcases and mark them as negatives . • Untouched issues : We check which parts of the code baseare patched by the commit. If the commit code diff doesn’toverlap with any step of the bug trace at all, it’s unlikely theissue is ﬁxed by the commit but more likely to be a falsepositive reported by the static analyzer. We mark such cases negatives as well.After applying the above ﬁlters, the remaining issues inthe ﬁxed issues group are labeled as positives (issues that aremore likely to be buggy) and all other issues are labeled asnegatives (issues that are more likely to be non-buggy). We callthese auto-labeler examples . Because auto-labeler examplesare generated based on issues reported by Infer, they all havethe infer bug reports.

D. After-ﬁx Examples

Due to the nature of vulnerabilities, the auto-labeler pro-duces many more negatives than positives such that the datasetof auto-labeler examples is quite imbalanced. Given that thepositive auto-labeler examples are assumed to be bugs ﬁxedin the after-commit versions, extracting the correspondingﬁxed versions is another kind of negative examples, whichwe call after-ﬁx examples . There are two beneﬁts: (1) Sinceeach negative example corresponds to a positive example, thedataset of auto-labeler positive examples and after-ﬁx negativeexamples is balanced. (2) The after-ﬁx negative examples areclosely related to the positive ones so that they may helpmodels focus on the delta parts that ﬁxed the bugs. Note thatthe after-ﬁx examples do not have a static analysis bug reportbecause the issue does not appear in the after-commit version.

E. An Example in the D2A Dataset

Fig. 4 shows a D2A example, which contains bug-relatedinformation obtained from the static analyzer, the code base,and the commit meta-data.In particular, every example has its label (0 or 1) and la-bel source (“auto labeler” or “after ﬁx extractor”) to denotehow the example was generated and if it is buggy. bug type , bug info , trace and zipped bug report are obtained from thestatic analyzer, which provides details about the bug types,locations, traces, and the raw bug report produced by Infer.This information can be useful to train models on bug reports.For each step in the trace , if it refers to a location insidea function, we extract the function body and save it in the { 02. "id": "httpd_9b3a5f0ffd8ec787cf645f97902582acb3234d96_1", 03. "label": 1, 04. "label_source": "auto_labeler", 05. "bug_type": "BUFFER_OVERRUN_U5", 06. "project": "httpd", 07. "bug_info": { 08. "qualifier": "Offset: [0, +oo] Size: 10 by call to ...", 09. "loc": "modules/proxy/mod_proxy_fcgi.c:178:31", 10. "url": "https://github.com/apache/httpd/blob/..." 11. }, 12. "versions": { 13. "before": "545d85acdaa384a25ee5184a8ee671a18ef5582f", 14. "after": "2c70ed756286b2adf81c55473077698d6d6d16a1" 15. }, 16. "trace": [ 17. { 18. "description": "Array declaration", 19. "loc": "modules/proxy/mod_proxy_fcgi.c:178:31", 20. "func_key": "modules/proxy/mod_proxy_fcgi.c@167:1-203:2", 21. } 22. ], 23. "functions": { 24. "modules/proxy/mod_proxy_fcgi.c@167:1-203:2": { 25. "name": "fix_cgivars", 26. "touched_by_commit": true , 27. "code": "static void fix_cgivars(request_rec *r, ..." 28. } 29. }, 30. "commit": { 31. "url": "https://github.com/apache/httpd/commit/2c70ed7", 32. "changes": [ 33. { 34. "before": "modules/proxy/mod_proxy_fcgi.c", 35. "after": "modules/proxy/mod_proxy_fcgi.c", 36. "changes": ["177,1^^177,5"] 37. } 38. ] 39. }, 40. "compiler_args": { 41. "modules/proxy/mod_proxy_fcgi.c": "-D_REENTRANT -I./server ...", 42. }, 43. "zipped_bug_report": "..." 44. } Fig. 4. A Simpliﬁed Example in D2A Dataset. functions section. Therefore, an example has all functionsinvolved in the bug trace, which can be used by functionlevel or trace level models. Besides, we cross-check withcommit code diff. If a function is patched by the commit,the touched by commit is true.In addition, the compiler arguments used to compile thesource ﬁle are saved in the compiler args ﬁeld. They canbe useful when we want to run extra analysis that requirescompilation (e.g. libclang [46] based tools).IV. S

TATIC A NALYSIS F ALSE P OSITIVE R EDUCTION

Although the D2A dataset is mainly for machine learningbased vulnerability detection methods, in this paper, we showthe dataset can be used to train a static analysis false posi-tive reduction model and help developers prioritize potentialtrue positives. We will present AI-based code understandingapproaches for vulnerability detection tasks in another paper.

A. Problem Statement

As observed previously [22], [23], an excessive number offalse positives greatly hinders the utilization of static analyzersas developers get frustrated and do not trust the tools. To thisend, we aim to device a method that can identify a subset ofthe reported issues that are more likely to be true positives, and use it as a prioritization tool . Developers may focus on theissues selected by the model ﬁrst and then move to remainingissues with a higher false positive rate if they have time.We treat the static analyzer as a black box and train afalse positive reduction model solely based on the bug reports. Our goal is to achieve a balance between a large number ofpredicted positives and a high false positive reduction rate. Wewant developers to see more real bugs in the predicted posi-tives comparing to all issues reported by the static analyzer.

B. Static Analysis Outputs/Dataa) Bug Trace description:

Infer static analysis producesmany output ﬁles. For our purposes, we are only interested inthe bug trace text ﬁle, illustrated in Fig.1, from which weextract the features. The bug trace starts with the locationwhere the static analyzer believes the error to have originated,and lists all the steps up to the line generating the error. Manyof the bugs are inter-procedural, so the bug trace cuts acrossmany ﬁles and functions. For each step in the ﬂow, the tracecontains 5 lines of code centered on the statement involved,the location of the ﬁle and function in the project, and a briefdescription of the step. At the top of the trace, the ﬁle and lineof code where the bug occurred are mentioned along with thebug type (error type). There is also a short description of thebug. The bug trace is therefore a combination of different typesof data like source code, natural language, numeric data likeline numbers, and ﬁle paths. b) Dataset description:

As described in Sec.III-D, theoriginal dataset has two types of negative examples, before-ﬁxand after-ﬁx. For these experiments, we built a dataset usingthe positive samples and the before-ﬁx negative examples.We are not interested in the after-ﬁx negative examples sincethese samples don’t produce a bug trace. In every project,the number of negative labels is very large compared to thenumber of positive labels, as can be seen in Table VI.

C. Feature Engineering

Our primary assumption when coming up with featureswas that complex code is more likely to have bugs and/oris more likely to be classiﬁed as having bugs by a staticanalyzer, because it is highly probable that the developer failedto consider all possible implications of the code. Complexcode is also more difﬁcult for other developers to understand,increasing the chance of their introducing bugs.One indication of complexity is the size of the bug trace. Along bug trace indicates that the control passes through manyfunctions, ﬁles or packages. The location of the bug couldalso indicate the complexity of the code. The line numberis indicative of the size of the ﬁle, and the column numberindicates the length of the line of the code where the bugoccurred. The depth of the line of code could indicate howentrenched the problematic code happens to be. Conditionalstatements cause many branches of execution to emerge andthese can lead to convoluted and buggy code. One way toestimate the complexity is to count the number of times con-ditional statements occur and also the occurrences of OR/ANDconditions. The error type is also a major feature that weconsider, as well as the number of C keywords used. Table IIIlists our ﬁnal set of features. We extract and normalize thesefeatures and save them in a features ﬁle.

ABLE IIIF

EATURES E XTRACTED FROM I NFER B UG R EPORT

Feature Description Feature Description error Infer bug/issue type error line line number of the errorerror line len length of error line error line depth indent for the error line textaverage error line depth average indent of code lines max error line depth max indent of code lineserror pos fun position of error within function average code line length average length of lines in ﬂowmax code line length max length of lines in ﬂow length the number of lines of codecode line count the number of ﬂow lines alias count the number of address assignment linesarithmetic count average operators / step assignment count fraction of Assignment stepscall count fraction of call steps cﬁle count the number of different .c ﬁlesfor count the number of for loops in report inﬁnity count fraction of +00 stepskeywords count the number of C keywords package count the number of different directoriesquestion count fraction of ’??’ steps return count average branches / stepsize calculating count average size calculations / step parameter count fraction of parameter stepsoffset added the number of “offset added”s in report

D. Model Selection

We experimented with 13 well-known machine learningmodels: namely, Decision Trees, K-means, Random Forest,Extra-trees, Gradient Boosting, Ada Boost, XGBoost, Cat-boost, LightGBM, Linear Classiﬁers with Stochastic GradientDescent, Gaussian Naive Bayes, Multinomial Naive Bayes,and Complement Naive Bayes. We ranked them based on boththeir AUC and F1 scores and selected the four best models.These were based on an ensemble of decision trees, RandomForest and Extra Trees for bagging methods, LightGBM, andCatboost for boosting.Random Forest is made of many weak learners (single deci-sion trees), which are fed with random samples of the data andtrained independently using a random sample of the features.Each inner decision tree is grown by using the features whichoffer the best split at every step. The randomness, which makesevery single tree different from the rest of the forest, associatedwith a high number of learners, makes the model quite robustagainst overﬁtting. A slightly different variation of RandomForest is Extra-Trees, also known as Extremely RandomizedTrees. The only difference is how the data is sampled to createthe input and how the splits are chosen randomly makingthe forest more diversiﬁed. Differently from bagging methods,where learners are trained independently, with gradient boost-ing methods each tree improves over the predictions madeby the previous ones. Boosting techniques are known to worksuccessfully with imbalanced data, but might suffer more fromoverﬁtting - to mitigate this effect typically many trees are usedtogether. LightGBM and Catboost are different frameworksto implement this kind of ensemble: the former creates animbalanced tree while the latter creates a balanced one.

E. Evaluation Metrics

In order to evaluate the different models, because of theimbalance in the dataset, we used the Area Under the Curve(AUC score, Fig. 5), a threshold-invariant metric that visual-izes the trade-off when we want to reduce the false positiverate while maintaining a good true positive rate. Since the maintask is to reduce the number of False Positives, we calculatethe percentage reduction in False Positives on the test set.Relying too much on this metric can bias towards models,which make very few accurate predictions. To make sure this

Fig. 5. The ROC curve tracks the performance at all classiﬁcation thresholds,while the Area under the Curve (AUC) provides an aggregation of perfor-mance across all possible classiﬁcation thresholds. The point on the curve thatminimizes the distance from the top-left corner is the one which provides thebest compromise between false positive rate and true positive rate. is not the case, we also calculate the total percentage of TruePositives which are predicted by the model. An ideal modelwould have a very high AUC Score, low False Positive rate,and high True Positives rate: to choose the threshold for theAUC, we select that point which minimizes the distance fromthe top-left corner (all true positives and no false positives).Once this point is chosen we also present F1-score as theaverage of each class F1-score since our goal is to reduce thenumber of false positives while preserving the real ones.

F. Voting

Real-world datasets present a high imbalance between realbugs and false positives. Also, the projects used to derived thedatasets proposed in this work vary in size yielding differentdataset sizes. Therefore, it’s not easy to choose the modelwhich does the best on all the datasets. While on a speciﬁcdataset a model can perform greatly, it could work poorly inanother: to mitigate such a problem we applied a soft-votingstrategy, which combines the scores of each classiﬁer, whichshould guarantee a more stable behavior across datasets.V. E

VALUATION

In this section, we present the result of D2A dataset genera-tion and label evaluation. In addition, we show the evaluationesults of the AI-based static analysis false positive reductionas a use case to demonstrate how D2A dataset can be helpful.

A. Dataset Generation Results1) Dataset Statistics:

The dataset generation pipeline iswritten in python and runs on a POWER8 cluster, where eachnode has 160 CPU cores and 512GB RAM. We analyzed 6open-source programs (namely,

OpenSSL , FFmpeg , httpd , NGINX , libtiff , and libav ) and generated the initialversion of the D2A dataset. In particular, Infer can detect morethan 150 types of issues in C/C++/Objective-C/Java programs[47]. However, some issues are not ready for production andthus disabled by default. In the pipeline, we additionallyenabled all issue types related to buffer overﬂows, integeroverﬂows, and memory/resource leaks, even some of themmay not be production-ready.Table IV summarizes the dataset generation results. Thecolumn CMA Version Pairs shows the number of bug-ﬁxingcommits selected by the commit message analyzer (Sec. III-B).For each selected commit, we run Infer on both the before-commit and after-commit versions. We drop a commit ifInfer failed to analyze either the before-commit version orthe after-commit version. Column

Infer shows the number ofcommits or version pairs Infer successfully analyzed. For auto-labeler examples (Sec. III-C), column

Issues Reported and unique auto-labeler examples - all shows the number issuesInfer detected in the before-commit versions before and afterdeduplication, which will be labeled as positives and negativesas shown in column

Positives and

Negatives . For after-ﬁxexamples (Sec. III-D), column

Negatives shows the numberof examples generated based on the auto-labeler positiveexamples. In total, we processed 11,846 consecutive versionspairs. Based on the results, we generated 1,295,623 uniqueauto-labeler examples and 18,653 unique after-ﬁx examples.

2) Manual Label Validation:

As there is no ground truth, toevaluate the label quality we randomly selected 57 examples(41 positives, 16 negatives) with a focus on positives. Wegave more weights to positive examples because they are moreimportant for our purpose. As mentioned in Sec. IV-A, labelinga non-buggy example as buggy is against the goal of falsepositive reduction. But it’s acceptable if we miss some ofthe real bugs. If we select examples according to the overalldataset distribution, we will have too few positive examples.Each example was independently reviewed by 2 reviewers.Table V shows the label validation results. On this biasedsample set, the accuracy with and without the auto-labeler is53% and 35% respectively. Note the accuracy on an unbiasedsample set is expected to be higher as there should be morenegative examples. Take the

OpenSSL study in Sec. II-B asan example. Without auto-labeler, the accuracy was only 7.8%on the set of 166 security-related examples.

B. False Positive Reduction Results1) Dataset:

To facilitate reproducibility, we deﬁned andplan to release a split for each project. In particular, wedrop bug types without any positive examples and split each

Fig. 6. Feature Importance of Random Forest algorithm trained on OpenSSL project’s data into train:dev:test sets (80:10:10) while main-taining the distribution of bug types. We use the same split inthis experiment. The model will be trained on the train + dev sets and tested on the test set.We observed that some FFmpeg and libav examplesare quite similar as libav was forked from

FFmpeg [48].We dropped

FFmpeg examples so that the all-data combinedexperiment would be fair.

FFmpeg examples are more imbal-anced compared to libav and we leave it for future work.Although we collect examples generated for many bug typesthat are not production-ready and are disabled by default inInfer, in this experiment, we consider just the 18 security-related bug types that are enabled by default. Table VI showsthe statistics of the data used in the experiment.

2) Feature Importance:

We used feature importance rank-ing when selecting the ﬁnal set of 25 features. Figure 6shows the features and their relative importance for one of themodels. Two features that were important for many modelsare the line number of the error in the ﬁle and the number oflines of code in the bug report, perhaps suggesting that largeﬁles and complex bug reports distinguish real errors.

3) FP Reduction Model:

We trained Random Forest using1000 estimators, Extra Trees with 500. For the Boosting algo-rithms, we used 500 estimators, learning rate 0.03, importancetype gain for the LGBM classiﬁer, and the same number ofestimators for Catboost. We deﬁne False Positive ReductionRate (FPRR) = ( FP infer - FP predict ) / FP infer ∗ . Because allissues are positive according to Infer, FP infer is just the numberof negative examples.As shown in Table VII, all models can effectively reducethe false positives for each project. In most cases, the FPRR isabove 70% for every model, without penalizing too much thereduction of true bugs. As expected, it’s hard to ﬁnd one bestmodel across all the projects. However, it’s encouraging thatthe voting can outperform the single models in the combined experiment, letting us believe that the more data we have, thebetter the voting system could perform.VI. R ELATED W ORK

Datasets for AI-based Vulnerability Detection . Juliet [21],Choi et.al [34], and S-babi [14] are synthetic datasets. They are

ABLE IVD

ATASET G ENERATION R ESULTS

Project Version Pairs IssuesReported Unique Auto-labeler Examples Unique After-ﬁx ExamplesCMA Infer All Negatives Positives Negatives

OpenSSL

FFmpeg httpd

NGINX

785 635 3,283,202 18,366 17,945 421 421 libtiff

144 144 525,360 12,649 12,096 553 553 libav

Total 14,447 11,846 349,373,753 1,295,623 1,276,970 18,653 18,653 • CMA:

The number of bug-ﬁxing commits identiﬁed by the commit message analyzer. • Infer:

The number of version pairs successfully analyzed by Infer. • Issues Reported:

The number of issues detected in the before-commit versions before deduplication.

TABLE VA

UTO - LABELER M ANUAL V ALIDATION R ESULTS

Positives Negatives All

BUFFER OVERRUN L1 2 0 2 1 1 0 3 1 2BUFFER OVERRUN L2 3 1 2 1 1 0 4 2 2BUFFER OVERRUN L3 6 1 5 4 4 0 10 5 5BUFFER OVERRUN S2 0 0 0 1 0 1 1 0 1INTEGER OVERFLOW L1 3 2 1 1 1 0 4 3 1INTEGER OVERFLOW L2 13 6 7 3 3 0 16 9 7INTEGER OVERFLOW R2 1 1 0 0 0 0 0 1 0MEMORY LEAK 1 1 0 1 1 0 2 2 0NULL DEREFERENCE 2 1 1 1 0 1 3 1 2RESOURCE LEAK 1 1 0 1 1 0 2 2 0UNINITIALIZED VALUE 9 3 6 1 1 0 10 4 6USE AFTER FREE 0 0 0 1 1 0 1 1 0ALL 41100% 1741% 2459% 16100% 1381% 319% 57100% 3053% 2747% • the issue count; A/D: manual review agrees/disagrees with the auto-labeler label

TABLE VIP

RODUCTION - READY S ECURITY R ELATED E RROR T YPES F ILTERING

All Errors Prod-ready Sec ErrsNegatives Positives N:P Negatives Positives N:P

OpenSSL libav

NGINX libtiff httpd generated based on predeﬁned patterns and cannot representreal-world program behaviors. Draper [12], Devign [15] andCDG [13] were generated from real-world programs. However,as discussed in Sec. II, they suffer from labeling or source lim-itations. In fact, lacking good real-world datasets has becomea major barrier for this ﬁeld [35]. D2A is automated and scaleswell on large real-world programs. It can produce more bugrelated information. We believe it can help to bridge the gap.

AI-based Static Analysis FP Reduction . Static analysis isknown to produce a lot of false positives. To suppress them,several machine learning based approaches [25], [26], [27],[28], [10], [29], [30], [31], [49], [32], [33] have been proposed.Because they either target different languages or different staticanalyzers, they are not directly applicable. Inspired by theirapproaches, we designed and implemented a false positivereduction model for Infer as a use case for the D2A dataset.VII. C

ONCLUSION

In this paper, we propose D2A, a novel approach to labelstatic analysis issues based on differential analysis and build a

TABLE VIIF

ALSE P OSITIVE R EDUCTION R ESULTS

Model GP P TP GN N TN F1 FPRR AUC

OpenSSL cb 81 858 62 2711 1934 1915 0.48 70.6% 0.79lgbm 81 827 65 2711 1965 1949 0.49 71.9% 0.82rf 81 591 59 2711 2201 2179 0.53 80.4% 0.83etc 81 616 52 2711 2176 2147 0.51 79.2% 0.78voting 81 506 58 2711 2286 2263 0.55 83.5% 0.83 libav cb 28 256 22 1495 1266 1260 0.53 84.3% 0.89lgbm 28 220 21 1495 1303 1296 0.55 86.7% 0.91rf 28 287 22 1495 1236 1230 0.52 82.3% 0.87etc 28 54 13 1495 1469 1454 0.65 97.3% 0.70voting 28 254 21 1495 1269 1262 0.53 84.4% 0.89

NGINX cb 5 27 3 145 123 121 0.55 83.4% 0.85lgbm 5 47 4 145 103 102 0.49 70.3% 0.86rf 5 46 3 145 104 102 0.47 70.3% 0.75etc 5 60 3 145 90 88 0.42 60.7% 0.67voting 5 54 4 145 96 95 0.46 65.5% 0.78 libtiff cb 3 17 2 118 104 103 0.56 87.3% 0.92lgbm 3 5 1 118 116 114 0.61 96.6% 0.72rf 3 7 2 118 114 113 0.69 95.8% 0.98etc 3 8 2 118 113 112 0.67 94.9% 0.97voting 3 7 2 118 114 113 0.69 95.8% 0.97 httpd cb 2 5 1 17 14 13 0.56 76.5% 0.88lgbm 2 3 1 17 16 15 0.65 88.2% 0.94rf 2 9 1 17 10 9 0.42 52.9% 0.77etc 2 6 1 17 13 12 0.53 70.6% 0.85voting 2 6 1 17 13 12 0.53 70.6% 0.85 combined cb 119 1403 95 4486 3202 3178 0.48 70.8% 0.82lgbm 119 1274 93 4486 3331 3305 0.49 73.7% 0.83rf 119 1063 86 4486 3542 3509 0.50 78.2% 0.84etc 119 1053 74 4486 3552 3507 0.50 78.2% 0.74voting 119 814 82 4486 3791 3754 0.54 83.7% 0.84 • The released dataset split deﬁnes train/dev/test for each project. For combined,its train/dev/test sets are the union of corresponding sets of all project • The models are trained on train + dev sets and tested on the test set. • GP/P/TP

Ground-truth/Predicted/True Positives; GN / N / TN deﬁned similarly. • FPRR : False Positive Reduction Rate = (GN - FP) / GN * 100 • cb : Catboost, lgbm : LightGBM, rf : Random Forest, etc : Extra-Trees labeled dataset from real-world programs for AI-based vulner-ability detection methods. We ran D2A on 6 large programsand generated a labeled dataset of more than 1.3M exampleswith detailed bug related information obtained from the inter-procedural static analysis, the code base, and the commithistory. By manually validating randomly selected samples, weshow D2A signiﬁcantly improves the label quality compared tostatic analysis alone. We train a static analysis false positivereduction model as a use case for the D2A dataset, whichcan effectively suppress false positives and help developersprioritize and investigate potential true positives ﬁrst. EFERENCES[1] T. D. LaToza, G. Venolia, and R. DeLine, “Maintaining mental models:A study of developer work habits,” in

Proceedings of the 28th Interna-tional Conference on Software Engineering. , 2006.[2] E. Murphy-Hill, T. Zimmermann, C. Bird, and N. Nagappan, “Thedesign space of bug ﬁxes and how developers navigate it,”

IEEETransactions on Software Engineering , vol. 41, no. 1, pp. 65–81, 2015.[3] N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix,“Using static analysis to ﬁnd bugs,”

IEEE Software , vol. 25, no. 5, pp.22–29, 2008.[4] N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, and Y. Zhou, “Usingﬁndbugs on production software,” in

OOPSLA’07 , 2007.[5] F. Yamaguchi, A. Maier, H. Gascon, and K. Rieck, “Automatic infer-ence of search patterns for taint-style vulnerabilities,” in , 2015.[6] G. Fan, R. Wu, Q. Shi, X. Xiao, J. Zhou, and C. Zhang, “Smoke:Scalable path-sensitive memory leak detection for millions of lines ofcode,” in

ICSE’19 , 2019.[7] V. B. Livshits and M. S. Lam, “Finding security vulnerabilities in javaapplications with static analysis,” in

Proceedings of the 14th Conferenceon USENIX Security Symposium , 2005.[8] S. Guarnieri, M. Pistoia, O. Tripp, J. Dolby, S. Teilhet, and R. Berg,“Saving the world wide web from vulnerable javascript,” in

Proceedingsof the 2011 International Symposium on Software Testing and Analysis,ISSTA’11 , 2011.[9] U. Y¨uksel and H. S¨ozer, “Automated classiﬁcation of static code analysisalerts: A case study,” in

ICSM’13 , 2013.[10] O. Tripp, S. Guarnieri, M. Pistoia, and A. Aravkin, “ALETHEIA:Improving the Usability of Static Security Analysis,” in

Proceedings ofthe 2014 ACM SIGSAC Conference on Computer and CommunicationsSecurity , 2014, pp. 762—-774.[11] U. Koc, P. Saadatpanah, J. S. Foster, and A. A. Porter, “Learning aclassiﬁer for false positive error reports emitted by static code analysistools,” in

MAPL’17 , 2017.[12] R. L. Russell, L. Y. Kim, L. H. Hamilton, T. Lazovich, J. Harer,O. Ozdemir, P. M. Ellingwood, and M. W. McConley, “Automated vul-nerability detection in source code using deep representation learning,”in

ICMLA’18 , 2018.[13] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong,“Vuldeepecker: A deep learning-based system for vulnerability de-tection,” in , 2018.[14] C. D. Sestili, W. S. Snavely, and N. M. VanHoudnos, “Towards securitydefect prediction with AI,”

CoRR , vol. abs/1808.09897, 2018. [Online].Available: http://arxiv.org/abs/1808.09897[15] Y. Zhou, S. Liu, J. K. Siow, X. Du, and Y. Liu, “Devign: Effective vul-nerability identiﬁcation by learning comprehensive program semanticsvia graph neural networks,” in

NeurIPS’19 , 2019.[16] L. Buratti, S. Pujar, M. Bornea, J. S. McCarley, Y. Zheng, G. Rossiello,A. Morari, J. Laredo, V. Thost, Y. Zhuang, and G. Domeniconi, “Ex-ploring software naturalness through neural language models,”

CoRR ,vol. abs/2006.12641, 2020.[17] S. Suneja, Y. Zheng, Y. Zhuang, J. Laredo, and A. Morari, “Learningto map source code to software vulnerability using code-as-a-graph,”

CoRR , vol. abs/2006.08614, 2020.[18] R. Paletov, P. Tsankov, V. Raychev, and M. Vechev, “Inferring crypto apirules from code changes,” in

Proceedings of the 39th ACM SIGPLANConference on Programming Language Design and Implementation , ser.PLDI 2018, 2018, pp. 450––464.[19] NIST, “National vulnerability database,” https://nvd.nist.gov/.[20] MITRE, “Common vulnerabilities and exposures,” https://cve.mitre.org/index.html.[21] NIST, “Juliet test suite for c/c++ version 1.3,” https://samate.nist.gov/SRD/testsuite.php.[22] B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’tsoftware developers use static analysis tools to ﬁnd bugs?” in

ICSE’13 ,2013, p. 672–681.[23] T. B. Muske, A. Baid, and T. Sanas, “Review efforts reduction bypartitioning of static analysis warnings,” in , 2013.[24] T. Muske and A. Serebrenik, “Survey of approaches for handling staticanalysis alarms,” in , 2016, pp. 157–166. [25] T. Kremenek and D. R. Engler, “Z-ranking: Using statistical analysis tocounter the impact of static analysis approximations,” in

Static Analysis,10th International Symposium, SAS 2003 , R. Cousot, Ed., 2003.[26] Y. Jung, J. Kim, J. Shin, and K. Yi, “Taming false alarms from adomain-unaware c analyzer by a bayesian statistical post analysis,” in

Proceedings of the 12th International Conference on Static Analysis , ser.SAS’05, 2005, p. 203–217.[27] U. Y¨uksel and H. S¨ozer, “Automated classiﬁcation of static codeanalysis alerts: A case study,” in , 2013, pp. 532–535.[28] Q. Hanam, L. Tan, R. Holmes, and P. Lam, “Finding patterns in staticanalysis alerts: Improving actionable alert ranking,” in

Proceedings ofthe 11th Working Conference on Mining Software Repositories , ser. MSR2014, 2014, p. 152–161.[29] U. Koc, P. Saadatpanah, J. S. Foster, and A. A. Porter, “Learning aclassiﬁer for false positive error reports emitted by static code analysistools,” in

MAPL’17 , 2017, p. 35–42.[30] X. Zhang, X. Si, and M. Naik, “Combining the logical and theprobabilistic in program analysis,” in

Proceedings of the 1st ACM SIG-PLAN International Workshop on Machine Learning and ProgrammingLanguages , ser. MAPL 2017, 2017, p. 27–34.[31] Z. P. Reynolds, A. B. Jayanth, U. Koc, A. A. Porter, R. R. Raje, and J. H.Hill, “Identifying and documenting false positive patterns generated bystatic code analysis tools,” in , 2017.[32] M. Raghothaman, S. Kulkarni, K. Heo, and M. Naik, “User-guidedprogram reasoning using bayesian inference,” in

Proceedings of the39th ACM SIGPLAN Conference on Programming Language Designand Implementation , ser. PLDI 2018, 2018, p. 722–735.[33] U. Koc, S. Wei, J. S. Foster, M. Carpuat, and A. A. Porter, “An empiricalassessment of machine learning approaches for triaging reports of a javastatic analysis tool,” in , 2019, pp. 288–299.[34] M.-J. Choi, S. Jeong, H. Oh, and J. Choo, “End-to-end prediction ofbuffer overruns from raw source code via neural memory networks,”in

Proceedings of the 26th International Joint Conference on ArtiﬁcialIntelligence , ser. IJCAI’17, 2017.[35] G. Lin, S. Wen, Q. L. Han, J. Zhang, and Y. Xiang, “Software vulnera-bility detection using deep neural networks: A survey,”

Proceedings ofthe IEEE , vol. 108, no. 10, pp. 1825–1848, 2020.[36] Y. Sui, X. Cheng, G. Zhang, and H. Wang, “Flow2vec: Value-ﬂow-basedprecise code embedding,”

OOPSLA , 2020.[37] LLVM, “The clang static analyzer,” https://clang-analyzer.llvm.org/.[38] Facebook, “Infer static analyzer,” https://fbinfer.com/.[39] D. Chandrasekaran and V. Mago, “Evolution of semantic similarity- A survey,”

CoRR , vol. abs/2004.13820, 2020. [Online]. Available:https://arxiv.org/abs/2004.13820[40] M. Sahami and T. D. Heilman, “A web-based kernel function formeasuring the similarity of short text snippets,” in

WWW ’06 , 2006.[41] Cppcheck-team, “Cppcheck,” http://cppcheck.sourceforge.net/.[42] D. A. Wheeler, “Flawﬁnder,” https://dwheeler.com/ﬂawﬁnder/.[43] Facebook, “infer reportdiff,” https://fbinfer.com/docs/man-infer-reportdiff.[44] J. Villard, “Infer is not deterministic, infer issue