[PDF] Out of Sight, Out of Place: Detecting and Assessing Swapped Arguments

Abstract

Programmers often add meaningful information about program semantics when naming program entities such as variables, functions, and macros. However, static analysis tools typically discount this information when they look for bugs in a program. In this work, we describe the design and implementation of a static analysis checker called SwapD, which uses the natural language information in programs to warn about mistakenly-swapped arguments at call sites. SwapD combines two independent detection strategies to improve the effectiveness of the overall checker. We present the results of a comprehensive evaluation of SwapD over a large corpus of C and C++ programs totaling 417 million lines of code. In this evaluation, SwapD found 154 manually-vetted real-world cases of mistakenly-swapped arguments, suggesting that such errors, while not pervasive in released code, are a real problem and a worthwhile target for static analysis.

Full PDF

OOut of Sight, Out of Place: Detecting and AssessingSwapped Arguments

Roger Scott ∗ , Joseph Ranieri † , Lucja Kot † , Vineeth Kashyap †† GrammaTech, Inc. ∗ [email protected], † { jranieri, lkot, vkashyap } @grammatech.com Abstract —Programmers often add meaningful informationabout program semantics when naming program entities suchas variables, functions, and macros. However, static analysistools typically discount this information when they look forbugs in a program. In this work, we describe the designand implementation of a static analysis checker called S

WAP

D,which uses the natural language information in programs towarn about mistakenly-swapped arguments at call sites. S

WAP

Dcombines two independent detection strategies to improve theeffectiveness of the overall checker. We present the results of acomprehensive evaluation of S

WAP

D over a large corpus of Cand C++ programs totaling million lines of code. In thisevaluation, S

WAP

D found manually-vetted real-world casesof mistakenly-swapped arguments, suggesting that such errors—while not pervasive in released code—are a real problem and aworthwhile target for static analysis.

Index Terms —static analysis, natural language, swapped argu-ments, big code

I. I

NTRODUCTION

Static analysis tools consist of automated “checkers”, eachof which identiﬁes potential problems by looking for matchesof a known code defect pattern or violations of an establishedprogram development rule. However, traditional static analysistechniques—such as those based on data-ﬂow analysis—do notuse the rich natural language information in programs: variablenames, ﬁeld names in a structure or a class, function names,macro names, etc. Programmers seldom choose these namesat random; they select names that convey information aboutthe semantic concepts they are manipulating, with identiﬁablepatterns in the creation, composition, and usage of thosenames. As we show in this work, static analysis tools canand should use these patterns to detect more bugs.In this paper we introduce S

WAP

D, an automated staticanalysis checker that uses natural language information todetect mistakenly swapped arguments at call sites. Listing 1shows an example of such a mistaken swap, found withS WAP

D in the open-source code for the editor xvile [1]. Here,the kill function from signal.h is called incorrectly: thearguments for process identiﬁer and signal have been swapped.Because the two arguments are type compatible, even thecompiler is unlikely to complain about the swap.Incorrect argument ordering is an easy mistake to makewhen programming in a language that supports positional ∗ Roger Scott performed this work while at GrammaTech. All the bugs listed in this paper were found with S

WAP

D on real-worldcode not written by the authors. For the sake of presentation, the listingssimplify or elide the code context. arguments , especially if the declaration for the callee functionis not readily available. Programmer confusion may be exacer-bated by certain function and interface design choices, such ascounter-intuitive argument ordering or long parameter lists. Intyped programming languages, type checking may catch someswapped argument errors, but not all, as seen in Listing 1. // declaration in signal.h int kill(pid_t pid, int (cid:58)(cid:58)(cid:58) sig); // use in xvile if (child < 0 && errno == EINTR) { kill( (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) SIGKILL, cpid);

Listing 1. Bug found with S

WAP

D: the arguments on line aremistakenly swapped. Underpinning S

WAP

D are two observations about developerbehavior when naming program entities. First, programmersoften choose argument names that are similar to parameternames, due to an underlying conceptual match between thetwo [2]. Therefore, as in Listing 1, an accidental swap mayhave taken place if both (a) argument names do not cover (i.e., have a sufﬁcient correspondence with) their parameternames, and (b) they would cover if argument positions wereswapped. Second, if we examine several calls to a function(e.g., calls to a library function in a large code corpus), weﬁnd discernible statistical patterns with respect to argumentnames and their positions in the calls. Statistically, if argumentnames are atypical in their current positions, but common inswapped positions, it may indicate an error. // declaration in X11/Xlib.h extern Bool XQueryExtension(Display, _Xconst char *, int *, int * /* first_event_return */ , int * /* (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) first_error_return */ ); // use in gpaste if (XQueryExtension (display, "XInputExtension",&xinput_opcode, (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) &xinput_error_base,&xinput_event_base)) { /* ... */ Listing 2. Bug found by S

WAP

D when parameter names are notavailable in the declaration.

We have found that these two detection strategies are mosteffective when used in combination. In particular, we make useof statistical data to reduce both false positives (§III-F) andfalse negatives (§III-G). As a motivating example, consider theGPaste [3] bug in Listing 2, also found with S

WAP

D. Here, That is, the position of the arguments in a function call denotes whichparameter they correspond to. a r X i v : . [ c s . P L ] S e p o parameter names are available in the declaration, so thesecond, statistical technique was key to detecting the bug.A key feature of S WAP

D is that we split parameter andargument names into smaller units, called morphemes , beforeapplying our techniques. Operating on morphemes rather thanwhole names is one of the factors that distinguishes S WAP

Dfrom closely related works [4], [5]. Splitting is motivated bythe intuition that program identiﬁers are often constructed byagglutinating two or more morphemes. Listing 1 and List-ing 2 represent individual examples of this naming behavior.Indeed, Figure 5 (§IV-D) shows that a signiﬁcant portion ofthe corpus uses names containing multiple morphemes. Forexample, in Listing 1, a na¨ıve attempt to match parametersand arguments based on string edit distances could fail, butsplitting

SIGKILL into sig and kill , and cpid into c and pid makes the correspondence clear. Our morpheme-basedapproach also allows us to improve signal by removing mor-phemes that appear in multiple arguments in a call; those likelyrepresent conceptual information about the calling contextrather than about the intended correspondence to parameters.For example, in Listing 2, splitting xinput_error_base and xinput_event_base into constituent morphemes (andeliminating the common morphemes xinput and base )helps identify the underlying pattern—that error and event are statistically more likely in their swapped positions. Wehave found real bugs with multi-morpheme names (e.g., List-ing 6), as summarized in Figure 6 (§IV-D).The techniques used in S WAP

D are largely programminglanguage agnostic. They are broadly applicable to programsin languages that support positional arguments. We haveimplemented a S

WAP

D prototype targeting C/C++ code; thoselanguages are heavily used in security and reliability-criticalsoftware, which provides particular motivation for accuratebug detection. During our empirical evaluation and triage ofS

WAP

D warnings, we found that many apparent argumentswaps are intentional. Thus, we have designed and adapted avariety of techniques to reduce the number of false positives.Our major contributions include: • A cover-based checker for detecting swapped argumentsvia mismatches between argument and parameter names. • A statistical checker for swapped arguments based on datacollected from a large code corpus. • A morpheme-oriented handling of names in both theseapproaches, for increasing the relevant signal present innames. • A hybrid approach combining the two checkers andfurther false positive reduction techniques to achieve highaccuracy in detecting swapped-argument errors. • A comprehensive evaluation of S

WAP

D on millionlines of open-source C/C++ code corpus [6]; we believeour evaluation to be one of the largest in this researcharea, especially for C/C++. S

WAP

D found swappedargument errors across this corpus. This ﬁgure suggeststhat, while swapped-argument errors are not extremely From the linguistic term for a unit of meaning in a natural language. common, they are a real problem, and efforts to detectthem are likely to provide value to developers.The remainder of this paper is organized as follows: wegive an overview of S

WAP

D (§II); provide details of speciﬁcalgorithms and techniques (§III); present the results of ourempirical evaluation (§IV); discuss related work (§V); andconclude (§VI). II. O

VERVIEW

In this section, we give an overview of S

WAP

D. We includea number of forward references to §III that contain furtherdetails on relevant algorithms, techniques, and heuristics.Figure 1 is an overview of S

WAP

D, featuring the bug inListing 1. The top left quadrant shows the input to S

WAP

D:the call site being checked, and the corresponding functiondeclaration. The function declaration is an optional input—ifit is not available, then the cover-based checker is skipped.Given a call site, we extract names (§III-A): from the argu-ment expressions at the call site and from the parameters inthe callee function declaration (if available). Next, we split(§III-B) both argument and parameter names into morphemes.Ofﬂine, we use a large corpus of code to compute a statistical database (§III-D), shown in the bottom left quadrantof Figure 1. The database is a key-value store, where keys aretriples consisting of a function name, argument position, andmorpheme. The values are weight s indicating the number ofprojects in the corpus where the morpheme appears in calls tothat function, at that argument position. Informally, the weightreﬂects the number of human programmer communities whoconsidered that the morpheme is appropriate to use at the givenargument position for that function.The right-hand portion of the diagram shows the S

WAP

Dpipeline of four stages. First, we compare the parameter mor-phemes and argument morphemes in the cover-based checker (§III-E). This stage does not need the statistical database, butit does require the function declaration with parameter names.If the cover-based checker ﬁnds a suspected error, S

WAP

Duses the statistical database to perform further vetting (§III-F)of the warning. The vetting rules out false positives due tousage patterns for certain functions where seemingly-swappedargument orderings are not rare, indicating that they couldhave a genuine and intentional use case. If we did reportsuch warnings, there could be a lot of false positives dueto function-speciﬁc patterns adopted by developers. If thesuspected error passes the vetting step, we move on to thefalse-positive ﬁltering stage described further below. // declaration in GStreamer guint64 gst_util_uint64_scale (guint64 val,guint64 num, guint64 (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) denom); // use in gst-plugins diff = gst_util_uint64_scale_int (diff, (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) denom_rate, num_rate); Listing 3. Example call where seemingly-swapped arguments arestatistically not rare—thus indicating a likely intentional swap.

Listing 3 shows a function declaration from GStreamer [7]and a callsite with a likely intentional swap. It is statistically rgument namesparameter namescall site in user code kill(SIGKILL, cpid); /* nearby code + comments */ callee declaration // signal.h int kill(pid_t pid, int sig); statistics curation statisticalDB cover-based checkerstatistical vettingstatistical checkerfalse positive filtering reported warningsparameter morphemesargument morphemescontext statistics  ✓ ✓✓✓ codecorpus offline   identify decl splitter .. kill(pid, SIGHUP); ... kill(getpid(), SIGSTOP); .. Fig. 1. High-level overview of S

WAP

D. The top left quadrant shows the input to the checker: a call site and its corresponding declaration. The bottom leftquadrant shows the ofﬂine curation—done once per corpus—of the statistical database, which is available to the checker. The argument names and parameternames corresponding to a call site are split and then processed in four stages, shown in the right half of the diagram. The sequencing of the four stages isshown using arrows marked with (cid:51) (if there may be a warning to report) and (cid:56) (if there is no warning to report). If a swapped-argument error at the callsite is still suspected after the false-positive ﬁltering stage, a warning is reported as shown in the bottom right quadrant. not rare to call gst_util_uint64_scale_int with themorphemes denom and num in the second and third argumentpositions respectively, i.e., in a swapped order based onparameter names, possibly as a shortcut for computing thereciprocal of the fraction. We discard such warnings.If the cover-based checker and the statistical vetting donot ﬁnd any errors at a call site, or are not applicable (asin Listing 2), S

WAP

D runs the statistical checker (§III-G)to look for other evidence of potential errors using datafrom the statistical database. Intuitively, we look for pairs ofmorphemes that appear at two argument positions at the callsite, with the property that, statistically, each morpheme issigniﬁcantly more common at the other’s position than at itsown. Hypothetically, suppose the cover-based checker was notable to identify the error shown in Figure 1. The statisticalchecker gives S

WAP

D another chance to catch the error: thestatistical database suggests that the pid morpheme is oftenused in the ﬁrst position at a kill call site, and the sig morpheme is often used in the second position. This statisticaldata suggests that the morphemes at this call site may havebeen swapped. Note that both the cover-based checker andthe statistical checker could have identiﬁed the same error:we quantify how often such an overlap occurs in (Figure 4,§IV-C).The ﬁnal stage for all candidate warnings is false-positiveﬁltering (§III-H): it applies various heuristics to distinguishbetween intentional and mistaken swaps. For example, con-sider Listing 4, which presents a false-positive ﬁnding inGrafX2 [8], ﬁltered out by S

WAP

D. The second call to iconv_open at line uses argument names that appear to be swapped; however, there is a call to the same function withthe arguments in the canonical order on the preceding line.If the programmer calls the function “both ways”, with callsin close proximity to each other, it is likely that both usagesare deliberate (because it undermines the theory that the swapwas due to not knowing the correct order). We use severalfalse-positive ﬁltering heuristics, including one motivated bythe call pattern in Listing 4. // declaration in iconv.h iconv_t iconv_open( const char *tocode, const char * (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) fromcode); // use in grafx2 cd = iconv_open(TOCODE, FROMCODE); // From UTF8to ANSI cd_inv = iconv_open( (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) FROMCODE, TOCODE); // FromANSI to UTF8

Listing 4. The candidate warning on line is ruled out in the false-positive ﬁltering stage because of the nearby correct call on line . Throughout our pipeline, we use techniques to minimizenoise and maximize signal in the natural language information.One such technique is comparing morphemes to each otherusing a similarity metric that takes into account abbreviations(§III-C): for example, msg is a common abbreviation for message . Another technique is to remove morphemes thatare common to pairs of argument names being checked, suchas remote in Listing 5. Removal of the common morphemeallows S

WAP

D to detect this bug in BoNeSi [9]. // declaration in libnet libnet_ptag_t libnet_build_tcp(uint16_t sp,uint16_t dp, uint32_t seq, uint32_t (cid:58)(cid:58) ack, /*9 more parameters ... */ ); // use in bonesi if (libnet_build_tcp(origSrcPort, dstPort, (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) remoteAck, remoteSeq, /* 9 more arguments... */ )==-1) { /% ... %/ Listing 5. Removing the common morpheme remote from theargument names at line clariﬁes their relationships with theparameter names: the third and fourth arguments appear to be swapped. In summary, S

WAP

D uses a hybrid approach based on bothnon-statistical and statistical techniques to detect and conﬁrmswapped-argument errors.III. D

ESIGN AND I MPLEMENTATION

This section provides details about speciﬁc stages, algo-rithms, and heuristics of S

WAP D. A. Extracting name information S WAP

D begins by extracting names from argument ex-pressions at call sites. If the corresponding declarations areavailable and include parameter names, those names are alsoextracted. We modeled our name extraction on DeepBugs [5],with adaptations to C and C++. For an abstract syntax treenode n , we extract a string name ( n ), where possible, asfollows. • If n is an identiﬁer, return its name. • If n is a non-string literal, return a string representationof its value. • If n is this , return “this”. • If n is ( m ) , return name ( m ). • If n is one of ++ m , -- m , m ++ , or m -- , return name ( m ). • If n is ⊗ m , where ⊗ ∈ { & , + , - , * } , return name ( m ). • If n is sizeof ( m ), return “sizeof”. • If n is a cast or explicit type conversion, return the nameof the operand. • If n is l . m , l -> m , or l :: m , return name ( m ). • If n is l [ m ] , return name ( l ). • If n is a call l . m ( . . . ) or m ( . . . ) , return name ( m ). • If n is a macro identiﬁer, return the macro name. • In all other cases, return nothing.To handle C/C++ macros, we use information from thepreprocessor input rather than the parser input (which is thepreprocessor output), which often allows us to operate on moremeaningful symbolic names. If an entire function call is aresult of a macro expansion, or if it is a virtual function call,we skip collecting names from that call site.

B. Splitting names into morphemes

We split argument names for the input call site to bechecked, and parameter names for function declarations.We also split argument names when building the statisticaldatabase (III-D). Our prototype uses the Ronin [10] identiﬁer-splitting algorithm. Ronin is an extension of the Samurai [11]algorithm, and uses a global table of token frequencies. Ad-ditionally, during splitting, we drop very common morphemeslike “ get ”, “ set ”, “ i ”, “ j ”, etc. C. Morpheme similarity metric

Computing the similarity between two morphemes is afundamental operation in S

WAP

D. We deﬁne a similaritymetric ∼ to quantify the degree of correspondence betweentwo morphemes while allowing for abbreviations. If twomorphemes do not have the same ﬁrst character, their ∼ valueis zero. Otherwise, their ∼ value is computed by applyinga penalty for each character that must be deleted from amorpheme in order for the resulting strings to contain the samecharacters in the same order. The penalty is lower for vowelsthan for consonants, decreases toward the end of the string, andis zero for a ﬁnal “ s ” (to account for singular/plural forms).Our penalty for missing characters is normalized by the lengthof the morphemes, so in longer morphemes we allow for moremissing characters while still maintaining a high similarity. Wesay two morphemes are sufﬁciently similar for a particularpurpose if the value of ∼ is greater than a context-speciﬁcthreshold. Note that ∼ can be naturally extended to be awareof synonyms ; the end of §IV-D presents a brief discussion ofsuch an experimental extension. D. Computing a statistical database

The statistical database is keyed by triples consisting of afunction name f , an argument position i , and a morpheme m . For each such triple, it contains a weight w ( f, m, i ) . Theweight captures the number of projects in the corpus wheremorpheme m appears at position i in a call site for f .For a given function f , morpheme m , and argument po-sitions i and j , we can use the weights in the database tocompute a numerical relative frequency score: ψ ( f, m, i, j ) = w ( f, m, i ) w ( f, m, j ) This score attempts to quantify how much more commonthe morpheme m is at argument position i than at argumentposition j at call sites to f . In the remainder of the paper,we will sometimes use the notation ψ ( m, i, j ) , omitting thefunction f when it is clear from context. When we buildthe database, we use the splitting techniques described inIII-B, and we eliminate common morphemes that appear inall argument positions at a call site. E. Cover-based checker

The cover-based checker detects swapped-argument errorsif the morphemes in the argument names better “cover” themorphemes in the parameter names when argument positionsare swapped at a call site. This checker is skipped if a functiondeclaration lacks parameter names.After splitting the parameter and argument names intomorphemes, we proceed pairwise, for each pair of argumentpositions i and j . In the rest of this paper, A i and A j denotethe sets of argument morphemes at positions i and j respec-tively; P i and P j denote the sets of parameter morphemes atcorresponding positions.First, we eliminate any morphemes common to A i and A j ,and similarly for P i and P j , to handle cases like Listing 5.f this elimination leaves any of A i , A j , P i , or P j empty, thecover checker does not proceed any further.Next, we compute the quality of the match, or “cover”, froma set of argument morphemes to a set of parameter morphemes.We run this computation for the original order, computing howwell A i covers P i and A j covers P j , and for the “swapped”order, i.e., how well A i covers P j and A j covers P i .Informally, a set of argument morphemes cover a set ofparameter morphemes if every parameter morpheme is sufﬁ-ciently similar to (using the metric ∼ described in III-C) atleast one argument morpheme. This relation is asymmetric: itis possible to have argument morphemes that are not similarto any parameter morpheme, yet still have coverage. However,if a parameter morpheme is not similar to any argumentmorpheme, then there is no coverage.We formalize a notion of coverage C ( A, P ) for an argumentmorpheme set A and a parameter morpheme set P : C ( A, P ) = min p ∈ P max a ∈ A ( a ∼ p ) Our criterion for reporting a swapped-argument warning isbased on two thresholds, α and α , empirically determined to be . and . respectively. We produce a candidatewarning if and only if both of the following hold: ( C ( A i , P i ) < α ) ∧ ( C ( A j , P j ) < α )( C ( A i , P j ) > α ) ∧ ( C ( A j , P i ) > α ) Informally, we require both sufﬁciently bad coverage inthe current positions and sufﬁciently good coverage in theswapped positions. // declaration in a different file void removeHighCoverageNodes(Graph* graph, double maxCov, boolean _export, Coordinate (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) minLength, /* more params */ ); // use in https://github.com/dzerbino/velvet removeHighCoverageNodes(graph, maxCoverageCutoff,(Coordinate) (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) minContigKmerLength,flagExportFilteredNodes, /* more args */ ); Listing 6. Example bug to showcase the machinery of the cover-based checker.

Listing 6 shows an example of a bug found using the cover-based checker. Figure 2 depicts the strong coverage mappingwhen arguments at position and are swapped. F. Statistical vetting

We perform statistical vetting when the cover-based checkerﬂags a pair of arguments in positions i and j at a call siteas potentially swapped. We compute max m ∈ A i ψ ( m, i, j ) and max m ∈ A j ψ ( m, j, i ) . If either of these quantities exceeds avetting threshold β (empirically determined to be ), weconclude that the usage in question is statistically not rareand so do not report a warning. Listing 3 shows such a case. We evaluated the results for a variety of settings and choose the bestprecision/yield trade-off in our judgment. The space constraints prevent usfrom providing further details of this process for each threshold. min contig kmer lengthmin length A P export A P flag export filtered nodes Fig. 2. Depicts the coverage in a swapped position, for the example Listing 6.Here, C ( A , P ) = 1 , C ( A , P ) = 1 , C ( A , P ) = 0 , C ( A , P ) = 0 . G. Statistical checks

When the cover-based checker does not ﬁnd a warning at acall site, we run the statistical checker. It uses the statisticaldatabase (III-D) and the argument names at the call site. Thestatistical checker can run even when the callee declarationdoes not include parameter names or cannot be retrieved.If the statistical database does not include statistics forthe function called at a call site, the statistical checker isskipped. Otherwise, it considers every possible pair of ar-gument positions i and j , and detects instances where twomorphemes are likely swapped across those positions usingthe following approach. As before, we eliminate commonmorphemes between A i and A j . After such elimination, ifeither A i or A j are empty, we skip the rest of the steps belowfor i and j .We now look for pairs of argument morphemes a i ∈ A i and a j ∈ A j such that min ( ψ ( a i , j, i ) , ψ ( a j , i, j )) > γ , for athreshold γ (empirically selected to be ). Informally, we lookat how much more common each morpheme is in the other’sposition than in its own, and require the lesser of those two“misplacement” scores to be greater than γ .We also require that A i \ a i = A j \ a j , i.e., exactly onemorpheme is swapped from the two morpheme sets A i and A j . If we ﬁnd such a pair of morphemes a i and a j , weperform one more check. We ﬁnd the morpheme m with thebiggest statistical difference in frequency between position j and position i , i.e, m = argmax x [ w ( f, x, j ) − w ( f, x, i )] . Weverify that a i is sufﬁciently similar to m . The intuition is thatif morpheme a i is common in both positions i and j , then thelikelihood of a swap is lower; we are looking for evidence thatmoving a i from i to j would bring the situation closer to whatis statistically most common. We perform a symmetric checkfor a j . If both checks pass, we produce a candidate swapped-argument warning involving argument positions i and j andproceed with further false-positive reduction. Note that therequirements and checks in this paragraph could be relaxedto potentially catch more bugs, however, at the expense ofincreased false positive rates. H. False-positive ﬁltering

Weeding out intentionally- vs. mistakenly-swapped argu-ments can be difﬁcult. We have developed a collection ofheuristics to identify likely intentional swaps. Some of themome from the literature; we developed others empiricallyby manually examining S

WAP

D warnings, identifying falsepositives, and formalizing common features of those falsepositives. Without false-positive ﬁltering of this nature, thedeveloper experience of using such checkers can be frustrating.We list our major heuristics below.

White-list words.

Some words hint that a swap might beintentional, e.g., “swap”, “exchange”, “rotate”, or “ﬂip”. Weexpand on the “nested in reverse” heuristic [4], and look forsuch words in the following locations: the name of the calleefunction, the name of the caller function, nearby conditionalexpressions (i.e., the last ﬁve branches along the currentexecution path, see Listing 7 from Mate Panel Libs [12]), thesix immediately-preceding lines of source code (including anycomment contents). We consider presence of such words to beindicative of false positives, and ﬁlter out such warnings. // decl in gdk-pixbuf.h GdkPixbuf* gdk_pixbuf_new ( /* 3 params */ , int width, int (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) height); // use in mate-panel-libs if (background->rotate_image /* && ..*/ // .. several lines of code, and nested if r = gdk_pixbuf_new ( /* 3 args */ , (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) height, width); Listing 7. False-positive warning ﬁltered out, because “rotate” is in anearby conditional expression.

Swap distance.

We found that on real-world code, allwarnings for argument positions i and j with | i − j | > werefalse positives. Therefore, we do not report such warnings. Geometric code patterns.

In geometric code, it is commonto combine swapping of axes with negation of one of twovalues to achieve various transformations. We exclude asintentional any apparent swapped-argument calls that negateexactly one of the two arguments involved.

Type checking.

We eliminate some false positives throughsimple type checking on the types of the two argumentsinvolved (similar to [2], [13]). The intuition is that a swap ofarguments with incompatible types would have been detectedin development via a compiler error. Even among compatibletypes, if the swapped order requires more type coercion, thatmay argue for the correctness of the existing code.

Nearby declaration.

If the declaration of the callee functionis in the same source ﬁle as the call site, we considerthe warning a false positive. This heuristic is based on ourempirical ﬁndings, and on the intuition that erroneous swapshappen when the programmer forgets the correct argumentorder. If the declaration is nearby, the programmer likely isaware of the correct argument order.

Nearby correct call.

If there are other calls to the samefunction, but with unswapped arguments, within the samecaller function (see Listing 4), we consider the warning afalse positive. This heuristic (similar to the “duplicate methodcalls” heuristic [4]) is based on empirical ﬁndings, and onan intuition about reminder proximity similar to that of theprevious heuristic.

Swap is not rare.

If the suspected swap is not an isolatedevent, but occurs in three or more separate callsites within the same calling function, we consider it a false positive. Ourobservation is that anomalies tend to be intentional unless theyoccur very rarely. This heuristic causes us not to report caseswhere a function is called at several callsites consistently withthe wrong argument order; however, true positives of this kindare far outweighed by false positives.IV. E

VALUATION

In this section, we present the results of an empiricalevaluation of S

WAP

D on a large C and C++ code corpus.The major research questions we consider are:R1. How well does S

WAP

D ﬁnd warnings in real-world open-source code?R2. What is the value-add of the different stages in S

WAP

D?R3. (a) How often are argument and parameter names con-structed from multiple morphemes? (b) How many ofthe true positives found involve multiple morphemes?These questions are aimed at seeking justiﬁcation formorpheme-level reasoning, instead of operating directlyon whole names.R4. (a) What effect does the corpus size used for the statisticaldatabase have on S

WAP

D’s ﬁndings? (b) What is theeffect of leaving out those projects from the statisticaldatabase, on which S

WAP

D produces any warnings?We also provide a discussion of the warnings triaged (§IV-F)and threats to validity (§IV-G) of our work.

A. Prototype implementation and corpus

We use the commercial static analysis tool CodeSonar [14]to extract name information from call sites and their cor-responding declarations, when available. We implementedthe statistics database computation (§III-D) and the S

WAP

Dchecker prototype in Python. Our prototype generates warn-ings in the SARIF format [15]. Such warnings can be importedinto an IDE or other SARIF viewers (such as CodeSonar)for manual inspection and triage. CodeSonar remembers thetriage result (i.e., true/false positive), as well as other userannotations, by ﬁngerprinting the warning location. We foundthis ability to be helpful for manual construction of groundtruth for the evaluation, and during review of the results.We computed the statistical database using the open-sourceFedora 29 source-package repository [6], ﬁltered to includeonly projects containing C or C++ code. We performed addi-tional ﬁltering to reduce duplication and eliminated extremelylarge projects. We successfully processed projects, con-sisting of about million lines of code. We refer to this setof projects as the SRPM corpus in the rest of this paper. Theresulting statistical database contains morpheme informationfor over four thousand functions.

B. Evaluation methodology

We considered evaluating S

WAP

D on both real-world codeand on a synthetically-generated dataset with randomly in-jected swapped arguments. We decided against the latter,because it is unclear how to generate a synthetic datasetwith a realistic distribution of both erroneous and intentionalwaps. In practice, intentional swaps far outnumber actualswap errors, so we do not believe that a na¨ıve injectionapproach that disregards them would lead to a realistic dataset;thus, evaluation results on synthetic datasets may not carryover to real-world code. Therefore, we decided to conductan evaluation exclusively on real-world open-source code.We perform our evaluation on the SRPM corpus, i.e., million lines of C and C++ code. As far as we are aware,our evaluation is the largest (in terms of number of lines ofcode) for a swapped-argument checker on any programminglanguage [4], [5]; and the largest by far [13] on C/C++.A limitation of using real-world code for evaluation is thelack of pre-existing ground truth. To obtain a list of true-and false-positive warnings, we ran S

WAP

D under differentconﬁgurations (described in IV-C) to obtain a total of unique warnings reported on the SRPM corpus. Of these,we sampled and manually triaged 859 unique warnings: wemarked 183 of these as true positives, and 676 as falsepositives. When S

WAP

D is run again on the SRPM corpusunder any conﬁguration, a warning reported at a triagedlocation is recognized and automatically classiﬁed as a trueor a false positive. The manual triage task was shared bysix experienced developers, some of whom were involvedin the development of S

WAP

D. We applied a conservativetriage strategy—marking warnings as true positives if theyreﬂect issues worth raising in a code review (i.e., real bugsor problems worth ﬁxing even if there is no runtime error).Otherwise, we marked warnings as false positives. Listing 8from OpenVAS libraries [16] shows an example warningmarked as a false positive: we suspect that the swap on line is intentional, because on line , the format string “ srchost %s ” uses an argument computed from dst . // declaration int init_v6_capture_device ( struct in6_addr src, struct in6_addr (cid:58)(cid:58)(cid:58) dst, char *filter); // use in openvas-libraries snprintf (filter, sizeof (filter), "ip6 and srchost %s", inet_ntop(AF_INET6, dst, addr, sizeof (addr))); bpf = init_v6_capture_device (* (cid:58)(cid:58)(cid:58) dst, src, filter); Listing 8. Warning triaged as a false positive.

Because we manually triaged a sample of warnings, weuse precision and yield as our evaluation metrics. Precision isthe ratio of the number of true-positive warnings to the totalnumber of warnings. Yield is the total number of reportedtrue-positive warnings from our ground-truth dataset. Yield isa proxy for recall: since there is no practical way to determinethe full set of all swapped-argument errors in the corpus, wecannot determine what percentage of them we have found.Making S

WAP

D practically useful requires balancing pre-cision and yield. High precision with low yield leads to fewreported warnings; while these warnings are likely to be realproblems, many other real problems may be missed. Lowprecision with high yield is also not ideal, because it leadsto large numbers of false positives. The developer effort tosift through those can cause frustration and reduce adoption. In a practical tool, scoring and sorting can be used to balancethese conﬂicting concerns. By assigning scores to warningsbased on their likelihood of being true positives, we can showthem to the user in a descending order. The user can thendecide when the effort of further manual triaging is no longerjustiﬁed by the likely beneﬁt of discovering an additional truepositive. Scoring the warnings from S

WAP

D is an interestingproblem that is outside the scope of this paper.

C. Evaluating various stages

As outlined in Section II, S

WAP

D involves a multi-stagepipeline, with four stages: (1) cover-based checker, (2) sta-tistical vetting, (3) statistical checker, and (4) false-positiveﬁltering. Each of these stages impact precision and/or yield.To expose the impact of each stage, we evaluated the precisionand yield of S

WAP

D in a variety of different conﬁgurations.In each conﬁguration name, the numbers indicate which of thefour stages are enabled. For example, in C , only stage isenabled, and the rest of the stages are disabled.Figure 3 shows the precision and yield of S WAP

D forvarious conﬁgurations. A few observations are clear: C shows the best trade-off between precision and yield. C has higher yield, but very low precision—it reports an order-of-magnitude more warnings in total than C . However,75% of the warnings from C are false positives, whichjustiﬁes our adoption of the false-positive ﬁltering stage toincrease precision. In fact, all the conﬁgurations without stage enabled (i.e., C , C , C , C ) have low precision.Conﬁgurations other than C and C have much loweryield, suggesting that relying solely on either the cover-basedchecker or the statistical checker misses several true-positivewarnings, justifying our hybrid approach. Comparing C vs. C , and C vs. C , we see a small increase in precisiontraded for a small decrease in yield: thus, cross-checking withthe statistical database to perform statistical vetting of thecover-based checker warnings can be useful if a user prefersprecision over yield.In summary, these results answer R1: with all the fourstages enabled, S WAP

D ﬁnds true-positive warnings onthe SRPM corpus, with a precision of %. These resultsalso answer R2: each of the four stages contribute to eitherincreasing precision or increasing yield, justifying the useof each stage. Figure 4 shows the overlap in true-positivewarnings between the cover-based checker and the statisticalchecker—both these approaches largely ﬁnd different sets oftrue warnings, further bolstering the case to use both of them. D. The case for morpheme-level reasoning

As described in §I, operating on morphemes instead ofwhole names is a key feature of our approach. Finding bugssuch as those in Listing 5 and Listing 6 highly beneﬁt frommorpheme-level reasoning: using a string-distance metric onwhole names is not a good ﬁt for ﬁnding such errors. However, C reports warnings in total on the SRPM corpus. Of these, warnings are present in our manually-triaged ground-truth dataset. Amongthese warnings, are true positives. . . . . C C C C C C C C Precision Y i e l d ( N u m b e r o f t r u e po s iti v e s ) Fig. 3. Precision vs. yield for various S

WAP

D conﬁgurations. C has thebest trade-off between precision and yield. C C Fig. 4. Presents the number of unique true-positive warnings reported bycover-based checker ( C ) only, statistical checker ( C ) only, and by both. if almost all the argument and parameter names provided bydevelopers consist of single morphemes, then the morpheme-level reasoning boils down to whole-name-level reasoning,making morpheme-level reasoning overkill.Figure 5 answers R3(a); it shows how often argument namesand parameter names in the SRPM corpus are constructed withdifferent morpheme-set sizes. If we cannot extract a namefrom a call site or a declaration, it does not get countedin this Figure. While a majority of both the argument andparameter names chosen by developers are made up of singlemorphemes, nearly 40% argument names are constructed frommore than one morpheme, lending credibility to our use ofmorpheme-level reasoning.Furthermore, in Figure 6, we plot the frequency distribution ≥ · . · . · . · i n s t a n ce s i n t h ec o r pu s ≥ · . · . · . · Fig. 5. Frequency distribution of morpheme-set sizes for argument names(left) and parameter names (right) across the SRPM corpus. The y-axis inboth the charts (note different scales) indicates frequency; the x-axis providesmorpheme-set size.

Max. morphemes involved N u m b e r o f t r u e po s iti v e s Fig. 6. Maximum number of morphemes (argument or parameter morphemesin either positions) involved in a reported true positive by C . of the maximum morphemes involved for all the true-positivewarnings reported by C : it serves to answer R3(b). For areported warning, let i and j be the two argument positionsinvolved in the swap, and let A i , A j , P i , and P j be theargument morpheme and parameter morpheme sets in thosetwo positions respectively. Then, the maximum morphemesinvolved in the warning is given by max {| A i | , | A j | , | P i | , | P j |} .Nearly 42% of true-positive warnings involve names withmore than one morpheme, conﬁrming that our morpheme-levelreasoning in the different stages are likely useful in identifyingreal bugs. size_t scm_port_buffer_put (SCM buf, const scm_t_uint8 *src, size_t count, size_t end,size_t (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) avail); // decl // use in guile scm_port_buffer_put (new_buf,scm_port_buffer_take_pointer (pt->read_buf,cur), (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) avail, 0, c_size); Listing 9. Example bug identiﬁed when S

WAP

D considers size and count to be synonyms.

Inclusion of synonym relationships in the similarity metric ∼ would allow morpheme-level reasoning to ﬁnd even moreerrors. For example, if we consider size and count assynonyms, i.e., ∼ ( size , count ) = 1 , then S WAP

D ﬁnds theerror shown in Listing 9 from Guile [17]. As future work, wewant to automatically extract synonyms from code corpora,and extend our morpheme-similarity metric with knowledgeof these synonyms.

E. Effect of corpus used for computing the statistical database

In Figure 7, we take different randomly-chosen subsets (1%,5%, 25%) of the SRPM corpus to compute the statisticaldatabase, and present the results of running C on theSRPM corpus with these different statistical databases. Eachrandom subset is computed ﬁve times. This plot serves to an-swer R4(a), showing the effect of the corpus size used for thestatistical database on S WAP

D’s ﬁndings. All the ﬁve randomtrials with 1% of the corpus, and most of the ﬁve randomtrials with 5% of the corpus, are in the bottom-left quadrant(low precision and low yield). However, all the rest of random . .

55 0 . . Precision Y i e l d ( N u m b e r o f t r u e po s iti v e s ) Fig. 7. Precision and yield reported when C is run on the SRPM corpuswith the statistical database computed from a random subset (i.e., 1%, 5%,25%, and 100%) of the projects in the SRPM corpus. Each random selectionis made ﬁve times: due to overlapping points, fewer than ﬁve points per subsetsize may be visible. The legend “excl” refers to using a statistical databasethat was computed without those projects that had at least one warningreported by C . Note that the origin in this chart is not at zero precisionand zero yield. subsets are in the top-right quadrant (high precision and highyield), suggesting that statistical database computation fromrelatively small subsets of the SRPM corpus (i.e., sometimeseven just 5%) can provide most of the precision and yieldgains, compared to using the entire corpus. Note that the cover-based checker has only a weak, second-order dependency onthe statistical database. If no statistics are available to vetits results, the yield can only increase, with a decrease inprecision.Furthermore, when computing the statistical database, ifwe exclude all the projects in the SRPM corpus that C reports a warning on, the precision and recall are notaffected much (see legend “excl”, both precision and recall areonly slightly lower than when using the entire corpus). Thisobservation answers R4(b) and conﬁrms that the analysis isnot over-ﬁt to the speciﬁc projects within which it is reportingwarnings. F. Discussion of triaged warnings

Figure 8 compares the relative probabilities of a true-positive warning occurring at a call site with a given numberof arguments. To compute these probabilities, we leave outcall sites with less than two arguments, because it would notaffect the relative probabilities shown here. We computed theseprobabilities as described for Figure 11 in [4]. In contrast to theprobability distribution for Java programs [4], we found thatprobability of true-positive warnings given a call site with twoor three arguments is comparable to the probabilities of thosewith higher number of arguments. One possible explanationfor the differences in the probability distributions could be thatbecause C and C++ programs are “weakly typed”, it allowsmore room for confusion in ordering arguments, even wheninvolving call sites with few arguments. ≥ · − . · − . · − . · − . · − . · − . · − . · − Number of arguments at a callsite P r ob a b ilit yo f a t r u e po s iti v e w a r n i ng Fig. 8. Probability of a true positive warning occurring at a call site with n arguments (for n = 2 , . . . , ≥ ). We computed these probability values in amanner similar to Figure 11 in [4]. In our triage of S

WAP

D warnings, we found a variety ofreasons for false-positive warnings. Some example reasonsinclude: incorrect splitting of names into morphemes, incor-rectly detected abbreviations, function-speciﬁc patterns thatstatistical vetting is not able to pick up on, poor namingdecisions by the developer, patterns that are rare but notincorrect, and names that don’t carry much meaning (seeListing 10 from XScreenSaver [18]). // decl in OpenGL void glVertex3f(GLfloat x, GLfloat y, GLfloat (cid:58) z); // use in xscreensaver glVertex3f (x, (cid:58) z, y); Listing 10. Likely false-positive warning: co-ordinate system could bealtered in OpenGL.

G. Threats to validity

Our techniques assume English names; it is unclear howmuch of our work is applicable to non-English names.Our statistical database is derived from a mature open-source code corpus for the Linux platform, and this particularcorpus may have good coding patterns, which is likely ben-eﬁcial. However, we may have higher yield on projects thatare less mature or yet-to-be released. Similar to a lot of workin this research area—where patterns are mined from code—our statistical vetting and the statistical checker makes theassumption that “most code is correct”. However, in speciﬁcdomains, this assumption may not hold [19]. We give moreimportance to statistical patterns that occur across several projects, which may help assuage some concerns about ourassumption. A possible area for improvement would be torecognize similar code in different projects and discount thestatistics for occurrences across multiple similar projects. Wedo this deduplication (§IV-A) at the granularity of entire ﬁles,which ignores many other forms of code duplication.We expect our work to be applicable for several popular pro-gramming languages that support position-based arguments,other than C and C++; however, techniques in S

WAP

D mayot be useful for programs written in languages with keywordarguments, such as Smalltalk and Objective-C.Finally, many of the warnings were triaged by people whodeveloped S

WAP

D, which could have caused some bias inlabeling warnings. One possible source of such bias is in thesampling of the warnings to triage. If the validity of a warningis difﬁcult to ascertain, the triager may skip it and look foran easier one. However, difﬁcult-to-triage warnings are morelikely to be false positives, so skipping these would bias thetriaged results toward more true positives.V. R

ELATED W ORK

In this section we discuss closely related previous work.

A. Matching argument and parameter names

The idea of detecting swapped-argument errors using mis-matches between argument names and parameter names hasbeen studied before [2], [4], [13]. Of these works, Rice etal. [4] have the most extensive real-world evaluation (run on200 million lines of proprietary code, and 10 million lines ofopen-source code).They detect incorrectly-ordered arguments at call sites inJava programs, and their work is most similar to our cover-based checker. They use string-similarity metrics on wholenames to detect mismatched correspondences between argu-ments and parameters, whereas our cover-based checker per-forms morpheme-level reasoning. We believe that the cover-based checker is a better approach because it picks rele-vant signals from names being compared. Comparing wholenames using a string distance is akin to comparing twowhole sentences using a string distance: there is fundamentalimpedance mismatch; whereas using a cover-based checkeris akin to comparing two sentences based on the wordscontained in them. Our approach is also readily extendedto other morpheme-similarity measures, including consideringsynonymous morphemes to be equivalent. Their work willmiss reporting bugs if parameter names are not available ornot useful, whereas our hybrid approach can still report bugsin such cases based on mined statistical patterns. Their workwill report false positives if there are function-speciﬁc anti-patterns that developers use (such as Listing 3), whereas wecan ﬁlter out such warnings using statistical vetting.

B. Learning from code

With the increased availability of large amounts of code,learning models of “correct” code from existing programsand detecting anomalies as bugs [20] has been gaining pop-ularity [21]–[25]; none of these [20]–[25] are mining nameinformation for detecting swapped-arguments errors. We dis-cuss our work in contrast to two such closely related works:DeepBugs [5] and APISan [26].

DeepBugs detects swapped-argument errors using a ma-chine learning approach: they seed a corpus of programs withartiﬁcial likely swapped-arguments errors, and train a classiﬁerto distinguish the artiﬁcial code from the unmodiﬁed real code. Because the real code is expected to have very few swapped-arguments errors, their hypothesis is that the classiﬁer learnsto identify swapped-arguments errors in real code. They applytheir technique to JavaScript programs, with a corpus of million lines of code. Their work is most similar to ourstatistical checker.Their artiﬁcial seeding of swapped arguments in a corpusdoes not distinguish between intentional and unintentionalswaps, and therefore, their classiﬁer is unlikely to learn such adistinction. Determining whether a swap is intentional or notrequires considering the surrounding code and context (e.g.,preceding source text, conditionals, caller function), but suchinformation is not taken into account by DeepBugs.DeepBugs only considers swaps between the ﬁrst two argu-ments at a call site, whereas we consider swaps between allpairs of arguments. DeepBugs requires a lot of training data;and it only reports warnings when the whole function nameand the whole argument names (at ﬁrst two positions) at acall site are all present in the top , vocabulary of names.DeepBugs reasons at the whole-name level, so call sites withless-frequently occurring whole argument names (that could bemade of frequently-occurring morphemes) and function namesare not even considered. Our morpheme-level reasoning booststhe signal present in name data for the statistical checker,and our hybrid approach can ﬁnd bugs even when there isno statistical data available for a particular function.Being able to explain why a warning is reported is anessential element for adoption. Explaining why DeepBugspredicted a call site to be buggy is hard [27], [28]. Incontrast, our approach provides straightforward algorithmicexplanations for each ﬁnding. APISan detects various classes of errors by computing astatistical database of function-usage characteristics, and thenﬁnding anomalous patterns in the database. The characteristicsthey extract from arguments at a call site do not pertain toargument “names”. Instead, they focus on extracting and statis-tically reasoning about traditional semantic relations betweenargument values. VI. C

ONCLUSION

In this paper, we have presented S

WAP

D, a techniqueto ﬁnd mistakenly-swapped arguments at call sites. S

WAP

Dexploits “big code” and carefully combines four stages (cover-based checker, statistical vetting, statistical checker, and false-positive ﬁltering) to balance the precision and yield of theﬁndings. A

CKNOWLEDGMENTS

This material is based on research sponsored by the De-partment of Homeland Security (DHS) Ofﬁce of ProcurementOperations, S&T acquisition Division via contract number70RSAT19C00000056. The views and conclusions containedherein are those of the authors and should not be interpreted asnecessarily representing the ofﬁcial policies or endorsements,either expressed or implied, of the DHS. We would like tothank Amy Gale for her feedback on our work.

EFERENCES[1] VI Like Emacs. [Online]. Available: https://linux.die.net/man/1/xvile[2] H. Liu, Q. Liu, C.-A. Staicu, M. Pradel, and Y. Luo, “Nomen estomen: Exploring and exploiting similarities between argument andparameter names,” in

Proceedings of the 38th International Conferenceon Software Engineering

Proc. ACMProgram. Lang. , vol. 1, no. OOPSLA, Oct. 2017. [Online]. Available:https://doi.org/10.1145/3133928[5] M. Pradel and K. Sen, “Deepbugs: A learning approach to name-basedbug detection,”

Proc. ACM Program. Lang. , vol. 2, no. OOPSLA, Oct.2018. [Online]. Available: https://doi.org/10.1145/3276517[6] Fedora Package Sources. [Online]. Available: https://src.fedoraproject.org/[7] GStreamer: open soure multimedia framework. [Online]. Available:https://gstreamer.freedesktop.org/[8] GrafX2. [Online]. Available: http://grafx2.chez.com/[9] BoNeSi. [Online]. Available: https://github.com/Markus-Go/bonesi[10] M. Hucka, “Spiral: splitters for identiﬁers in source code ﬁles,”

Journalof Open Source Software , vol. 3, no. 24, p. 653, 2018. [Online].Available: https://doi.org/10.21105/joss.00653[11] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining sourcecode to automatically split identiﬁers for software analysis,” in , 2009, pp. 71–80.[12] Mate Panel Libs. [Online]. Available: https://pkgs.org/download/mate-panel-libs[13] M. Pradel and T. R. Gross, “Name-based analysis of equally typedmethod arguments,”

IEEE Transactions on Software Engineering

Proceedings of the 2013 ACM SIGSAC Conference on Computerand Communications Security , ser. CCS ’13. New York, NY, USA:Association for Computing Machinery, 2013, p. 73–84. [Online].Available: https://doi.org/10.1145/2508859.2516693[20] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugsas deviant behavior: A general approach to inferring errors insystems code,” in

Proceedings of the Eighteenth ACM Symposiumon Operating Systems Principles , ser. SOSP ’01. New York, NY,USA: Association for Computing Machinery, 2001, p. 57–72. [Online].Available: https://doi.org/10.1145/502034.502041[21] M. K. Ramanathan, A. Grama, and S. Jagannathan, “Static speciﬁcationinference using predicate mining,”

ACM SIGPLAN Notices , vol. 42,no. 6, pp. 123–134, 2007.[22] P. Bian, B. Liang, Y. Zhang, C. Yang, W. Shi, and Y. Cai, “Detectingbugs by discovering expectations and their violations,”

IEEE Transac-tions on Software Engineering , vol. 45, no. 10, pp. 984–1001, 2019.[23] V. Murali, S. Chaudhuri, and C. Jermaine, “Bayesian speciﬁcationlearning for ﬁnding api usage errors,” in

Proceedings of the2017 11th Joint Meeting on Foundations of Software Engineering ,ser. ESEC/FSE 2017. New York, NY, USA: Association forComputing Machinery, 2017, p. 151–162. [Online]. Available: https://doi.org/10.1145/3106237.3106284[24] H. Perl, S. Dechand, M. Smith, D. Arp, F. Yamaguchi, K. Rieck, S. Fahl,and Y. Acar, “Vccﬁnder: Finding potential vulnerabilities in open-sourceprojects to assist code audits,” in

Proceedings of the 22nd ACM SIGSACConference on Computer and Communications Security , ser. CCS ’15. New York, NY, USA: Association for Computing Machinery, 2015, p.426–437. [Online]. Available: https://doi.org/10.1145/2810103.2813604[25] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng,and Y. Zhong, “Vuldeepecker: A deep learning-based systemfor vulnerability detection,” in . The Internet Society,2018. [Online]. Available: http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2018/02/ndss2018 03A-2 Li paper.pdf[26] I. Yun, C. Min, X. Si, Y. Jang, T. Kim, and M. Naik, “Apisan: Sanitizingapi usages through semantic cross-checking,” in

Proceedings of the25th USENIX Conference on Security Symposium , ser. SEC’16. USA:USENIX Association, 2016, p. 363–378.[27] C. Rudin, “Stop explaining black box machine learning models forhigh stakes decisions and use interpretable models instead,”