[PDF] Refined Grey-Box Fuzzing with SIVO

Abstract

We design and implement from scratch a new fuzzer called SIVO that refines multiple stages of grey-box fuzzing. First, SIVO refines data-flow fuzzing in two ways: (a) it provides a new taint inference engine that requires only logarithmic in the input size number of tests to infer the dependency of all program branches on the input bytes, and (b) it deploys a novel method for inverting branches by solving directly and efficiently systems of inequalities. Second, our fuzzer refines accurate tracking and detection of code coverage with simple and easily implementable methods. Finally, SIVO refines selection of parameters and strategies by parameterizing all stages of fuzzing and then dynamically selecting optimal values during fuzzing. Thus the fuzzer can easily adapt to a target program and rapidly increase coverage. We compare our fuzzer to 11 other state-of-the-art grey-box fuzzers on 27 popular benchmarks. Our evaluation shows that SIVO scores the highest both in terms of code coverage and in terms of number of found vulnerabilities.

Full PDF

RRefined Grey-Box Fuzzing with Sivo

Ivica Nikolić

School of Computing, NUSSingapore

Radu Mantu

University Politehnica of BucharestRomania

Shiqi Shen

School of Computing, NUSSingapore

Prateek Saxena

School of Computing, NUSSingapore

ABSTRACT

We design and implement from scratch a new fuzzer called Sivo thatrefines multiple stages of grey-box fuzzing. First, Sivo refines data-flow fuzzing in two ways: (a) it provides a new taint inference enginethat requires only logarithmic in the input size number of tests toinfer the dependency of all program branches on the input bytes,and (b) it deploys a novel method for inverting branches by solvingdirectly and efficiently systems of inequalities. Second, our fuzzerrefines accurate tracking and detection of code coverage with simpleand easily implementable methods. Finally, Sivo refines selectionof parameters and strategies by parameterizing all stages of fuzzingand then dynamically selecting optimal values during fuzzing. Thusthe fuzzer can easily adapt to a target program and rapidly increasecoverage. We compare our fuzzer to 11 other state-of-the-art grey-box fuzzers on 27 popular benchmarks. Our evaluation shows thatSivo scores the highest both in terms of code coverage and in termsof number of found vulnerabilities.

Fuzzing is the automatic generation of test inputs for programswith the goal of finding bugs. With increasing investment of compu-tational resources for fuzzing, tens of thousands of bugs are foundin software each year today. We view fuzzing as the problem ofmaximizing coverage within a given computational budget. Thecoverage of all modern fuzzers improves with the computationbudget allocated. Therefore, we can characterize the quality of afuzzer based on its rate of new coverage , say, the number of newcontrol-flow edges exercised, per CPU cycle on average.Broadly, there are three types of fuzzers. Black-box fuzzers donot utilize any knowledge of the program internals, and are some-times referred to as undirected fuzzers. White-box fuzzers performintensive instrumentation, for example, enabling dynamic sym-bolic execution to systematically control which program branchesto invert in each test. Grey-box fuzzers introduce low-overheadinstrumentation into the tested program to guide the search for bug-triggering inputs. These three types of fuzzers can be combined.For instance, recent hybrid fuzzers selectively utilize white-boxfuzzers in parallel to stand-alone grey-box fuzzers. Of the threetypes of fuzzers, grey-box fuzzers have empirically shown promis-ing cost-to-bug ratios, thanks to their low overhead techniques,and have seen a flurry of improved strategies. For example, recentgrey-box fuzzers have introduced many new strategies to prioritizeseed selection, byte mutations, and so on during fuzzing. Each ofthese strategies work well for certain target programs, while being relatively ineffective on others. There is no dominant strategy thatworks better than all others on all programs presently.In this paper, we present the design of a new grey-box fuzzercalled Sivo that generalizes well across many target programs. Sivoembraces the idea that there is no one-size-fits-all strategy thatworks universally well for all programs. Central to its design is a"parameterization-and-optimization" engine to which many special-ized strategies and their optimization parameters can be specified.The engine dynamically selects between the specified strategies andoptimizes their parameters on-the-fly for the given target programbased on the observed coverage. The idea of treating fuzzing asan optimization problem is not new—in fact, many prior fuzzersemploy optimization either implicitly or explicitly, but they do so partially [3, 21, 28, 33]. Sivo differs from these works conceptuallyin that it treats parameterization as a first-class design principle—all of its internal strategies are parameterized. The selection ofstrategies and determination of all parameter values is done dy-namically. We empirically show the power of embracing completeparameterization as a design principle in grey-box fuzzers.Sivo introduces 3 additional novel refinements for grey-boxfuzzers. First, Sivo embodies a faster approximate taint inferenceengine which computes taint (or sensitivity to inputs) for programbranches during fuzzing, using number of tests that are only log-arithmic in the input size. Such taint information is helpful fordirected exploration in the program path space, since inputs influ-encing certain branches can be prioritized for mutation. Our pro-posed refinement improves exponentially over a recent procedureto calculate taint (or data-flow dependencies) during fuzzing [11].Second, Sivo introduces a light-weight form of symbolic intervalreasoning which, unlike full-blown symbolic execution, does notinvoke any SMT / SAT solvers. Lastly, it eliminates deficiencies inthe calculation of edge coverage statistics used by common fuzzers(e.g. AFL [35]), thereby allowing the optimization procedure to bemore effective. We show that each of these refinements improvethe rate of coverage, both individually and collectively.We evaluate Sivo on 27 diverse real-world benchmarks com-prising several used in recent work on fuzzing and in Google OSS-fuzz [14]. We compare Sivo to 11 other state-of-the-art grey-boxfuzzers. We find that Sivo outperforms all fuzzers in terms of cover-age on 25 out of the 27 benchmarks we tested. Our fuzzer provides20% increase in coverage compared to the next best fuzzer, and 180%increase compared to the baseline AFL. Furthermore, Sivo findsmost vulnerabilities among all fuzzers in 18 of the benchmarks, and a r X i v : . [ c s . CR ] F e b n 11 benchmark programs finds unique vulnerabilities. This pro-vides evidence that Sivo generalizes well across multiple programsand according to multiple metrics. Our fuzzer is open-source . Fuzzers look for inputs that trigger bugs in target programs. As thedistribution of bugs in programs is unknown, fuzzers try to increasethe chance of finding bugs by constructing inputs that lead tomaximal program code execution. The objective of fuzzers is thus toconstruct inputs, called seeds , that increase the amount of executedprogram code, called code coverage . The coverage is measured basedon the control-flow graph of the executed program, where nodescorrespond to basic blocks (sets of program statements) and edgesexist between sequential blocks. Some of the nodes are conditional(e.g. correspond to if and switch statements) and have multipleoutgoing edges. Coverage increases when at some conditional node,called a branch , the control flow takes a new edge which is not seenin previous tests—this is called inverting or flipping a branch.Grey-box fuzzers assess code coverage by instrumenting the pro-grams and collecting profiling coverage data during the executionof the program on the provided inputs. They maintain a pool ofseeds that increase coverage. A grey-box fuzzer selects one seedfrom its pool, applies to it different operations called mutations toproduce a new seed, and then executes the program on the newseed. Those new seeds that lead to previously unseen coverage areadded to the pool. To specify a grey-box fuzzer one needs to defineits seed selection, the types of mutations it uses, and the type ofcoverage it relies on. All these fuzzing components, we call stages or subroutines of grey boxes. We consider a few research questionsrelated to different stages of fuzzing. RQ1: Impact of Complete Parameterization?

Fuzzers optimizefor coverage. There is no single fuzzing strategy that is expectedto work well across all programs. So, the use of multiple strategiesand optimization seems natural. Existing fuzzers do use dynamicstrategy selection and optimize the parameter value selection. Forexample, MOpt [21], AFLFast [3], and EcoFuzz [33] use optimizationtechniques for input seed selection and mutations. But, often suchparameterization comes with internal constants, which have beenhand-tuned on certain programs, and it is almost never applieduniversally in prior fuzzers. The first question we ask is what wouldbe the result of complete parameterization, i.e., if we encode allsubroutines and their built-in constants as optimization parameters.The problem of increasing coverage is equivalent to the problemof inverting more branches. In the initial stage of fuzzing, when thenumber of not yet inverted branches is high, AFL mutation strate-gies (such as mutation of randomly chosen bytes) are successfuland often help to invert branches in bulk. However, easily invert-ible branches soon become exhausted, and different strategies arerequired to keep the branch inversion going. One way is to resortto targeted inversion. In targeted inversion, the fuzzer chooses abranch and mutates input bytes that influence it. The following twoquestions are about refining target inversion in grey-box fuzzing.

RQ2: Efficient Taint Inference?

Several fuzzers have shown thattaint information, which identifies input bytes that influence a givenvariable, is useful to targeted branch inversion [1, 5, 7, 11, 24, 32]. https://github.com/ivicanikolicsg/SivoFuzzer If we want to flip a particular branch, the input bytes on whichthe branch condition variables depend on can be mutated whilekeeping the other bytes unchanged. The main challenge, however,is to efficiently calculate the taint information. Classical methodsfor dynamic taint-tracking incur significant instrumentation over-heads whereas static methods have severe false negatives (missdependencies) due to imprecision. The state-of-the-art fuzzers aimfor light-weight techniques for dynamically inferring taint duringfuzzing itself. Prior works have proposed methods which requirenumber of tests linear in 𝑛 , the size of the seed input [11]. This isextremely inefficient for programs with large inputs. This leads toour second question: Can we compute useful taint information butwith exponentially fewer tests? RQ3: Efficient Constraint-based Reasoning?

Taint only cap-tures whether a change in certain values of an input byte may leadto a change in the value of a variable. If we are willing to computemore expressive symbolic constraints, determining the specific in-put values which cause a program branch to flip is possible. Thechallenge is that computing and solving expressive constraints, forinstance first-order SAT/SMT symbolic formulae, is computation-ally expensive. In this work, we ask: Which symbolic constraintscan be cheap to infer and solve during grey-box fuzzing?

RQ4: Precise coverage measurement?

Grey-box fuzzers use cov-erage information as feedback to guide input generation. AFL, andalmost all other fuzzers building on it, use control-flow edge countsas a common metric. Since there can be many control-flow edgesin the program, space-efficient data structures for storing runtimecoverage data are important. Recent works have pointed out AFL’shash-based coverage map can result in collisions [12], which hasan unpredictable impact on the resulting optimization. Althoughsound solutions which avoid both false positives and negatives inedge count coverage are known, they are difficult to implement inmodern compilers which process one unit of compilation (e.g. onesource file) at a time. How do we compute compressed edge countswith high precision using standard compilers for instrumentation?

Grey-box fuzzers instrument the target program to gather runtimeprofiling data, which in turn guides their seed generation strategies.The objective of Sivo is generate seeds that increase code cover-age by using better and by using more of the profiling data. Sivoaddresses the four research questions with four refinements.

Parametrize-optimize approach (RQ1).

Sivo builds on the ideaof complete parameterization of all fuzzing subroutines and strate-gies, i.e. none of the internal parameters are hard-coded. Sivo selectsstrategies and parameter values dynamically based on the observedcoverage statistics and with the use of standard optimization algo-rithm. Such complete parameterization and optimization inherentlymakes Sivo adaptable to the target program and more general, sincespecialized strategies that work best for the program are prioritized.To answer RQ1, we empirically show in our evaluation that thisdesign principle individually helps Sivo outperform other evaluatedfuzzers across multiple target programs.

Fast Approximate Taint Inference (RQ2).

We devise a fast andapproximate taint inference engine

TaintFAST based on probabilis-tic group testing [10]. Instead of testing individually for each inputyte,

TaintFAST tests for carefully chosen groups of bytes and thencombines the results of all tests to infer the taint for each individualbyte. This helps to reduce the test complexity of taint inference from 𝑂 ( 𝑛 ) to 𝑂 (log 𝑛 ) executions of the program, where 𝑛 is the numberof input bytes. Thus the fuzzer can infer useful taint dependencyeven for very large inputs using TaintFAST . Symbolic Interval Constraints (RQ3).

We propose inferring sym-bolic interval constraints that capture the relationship between in-puts and variables used in branch conditions only. Instead of deduc-tively analyzing the semantics of executed instructions, we take anoptimistic approach and infer these constraints from the observedvalues of the inputs and branch conditional variables. The value-based inference is computationally cheap and tailored for a commoncase where values of the variables are direct copies of the inputsand when branches have comparison operations ( = , ≠ , < , ≤ , > , ≥ ).We show that such a systems of constraints can be solved efficientlyas well, without the use of SAT / SMT solvers. Compressed and Precise Edge Count Recording (RQ4).

Wetackle both the collision problem and the compressed edge countproblem in tracking coverage efficiently during grey-box fuzzing.For the former, we show a simple strategy based on using multiplebasic block labels (rather than only one as in AFL) and reduce orentirely eliminate the collisions. For the later, we propose temporarycoverage flushing to improve the prospect of storing important edgecounts. Although this may appear to be a minor refinement in grey-box fuzzing, we find that it has a noticeable impact experimentally.

We present the details of our four refinements in Sections 4.1-4.4and then show the complete design of Sivo in Section 4.5.

There are multiple grey-box strategies used at different stages offuzzing (e.g. selection of seeds, choice of mutations and their param-eters, etc), that depending on the fuzzed program, may or may notbe optimal in terms of proving the fastest increase in code coverage.Even during fuzzing of the same program, optimal strategies maychange as the set of not-yet inverted branches evolves. Existinggrey-box fuzzers are inefficient because they rarely aim to find thebest fuzzing strategy at some stage, and never at all stages. Rather,they usually adhere to a small selection of fuzzing strategies anduse them according to a hard-coded probability distribution. Forinstance, AFL chooses the top seed targets according to a speedmetric, and during fuzzing repeatedly selects (according to a nearlyuniformly random distribution) one of 20 elementary mutations.AFLFast uses different seeds based on underutilized paths. MOptimproves upon AFL, and optimizes the choice of mutations only.We aim to reduce the above inefficiencies of grey box fuzzersby expanding the number of available strategies considerably andoptimizing their selection dynamically. For this, we use two foldapproach: 1) introduce parametrized variations of the fuzzer sub-routines at all stages of fuzzing, and 2) during fuzzing, optimize thechoice of parameters and variables to increase code coverage.

Parametrization.

We introduce variations of the fuzzer subrou-tines either by changing some of their hard-coded parameters orby grouping similar subroutines together. When Sivo reaches a parametrized subroutine it needs to select which variations to exe-cute. Sivo in total makes 17 selections with 68 variations (refer toTable 1), among others, including: • Seed prioritization metric : Choose metric according tospeed, number of repetitions, previous success, and so on. • Fuzzing approach : Choose either the vanilla or data-flow driven approach. The vanilla approach is composed of mu-tations that do not require information about the depen-dency of the branches on input bytes, thus in rounds whenthis approach is selected there is no need to do taint in-ference, and the time spent on inference is saved. On theother side, data-flow consists of all mutation strategies thatrely on dependency information. • Vanilla/data-flow mutation strategy : Choose the mu-tation of the previously selected approach. For instance,the vanilla approach will allow the fuzzer to choose mu-tations such as copy-delete of bytes within a single seed,combining different seeds, and so. The data-flow drivenapproach includes mutation of fixed number of dependentbytes, mutations of bytes of particular branches, and so on. • Specific parameters of mutations : Optimize the inter-nal configurable variables used in each subroutine. Forinstance, this includes choosing the number of mutatedbytes, values of delimiters for copy-delete or combiningpositions, and so on.

Optimization.

To optimize all selections (e.g. choice of strategyor subroutine), we use multi armed bandits (MAB). More precisely,each choice provides a reward which is unknown and stochastic.The goal is to maximize the accumulative rewards obtained from allthe choices selected. Recall that the objective of Sivo is to maximizethe coverage. Hence, as a reward we use the additional coverageacquired from executing the choice. However, this metric alonemay not be accurate because some choices incur higher compu-tational overheads. Therefore, as a reward we use the additionalcoverage per time unit . In the conventional MAB, the distributionsof rewards are stationary with some unknown mean. In our case,as the fuzzer progresses, it requires more computational effort toreach the remaining unexplored code and increase coverage. Inother words, the rewards for the selection choices monotonicallydecrease over time. Therefore, we model our problem as MAB with non-stationary rewards and use discounting to solve it [17].Besides optimizing for selections, during fuzzing we use geneticalgorithms (GA) to optimize any black-box objective functions thatarise during inversion of branches. More precisely, we reduce partsof the inversion problem to a black-box optimization and apply GAto speed-up the inversion. In vanilla type fuzzing, we use GA tosearch for optimal positions of a fixed number of bytes to mutate.For this purpose, as an objective function we use the number ofbranches that are affected (i.e. change some of their variables) whenselected bytes are mutated. As the number of affected branchesincreases, so does the chance of inverting one of them by mutatingthe found optimal position bytes. In data-flow fuzzing, with GAwe invert non-discriminatory targeted branches. In this case, as anobjective function we use the distance of the resulting branch valueto the value that corresponds to branch inversion. For instance, forthe branch " if(x == 5) ", the objective function is | 𝑥 − | . When theistances reaches zero, i.e. the minimum of the objective functionis reached, the branch is inverted. // input is the array uint8_t x[1024]; A = x[100] + 10; // depends on 1 byte B = (uint32_t *)x[200]; // depends on 4 bytes C = (uint32_t *)x[236]; // depends on 4 bytes ... if ( A == 50) // branch depends on 1 byte ... if( B + C < 200 ) // branch depends on 8 bytes ... if( C > 200 ) { // branch depends on 4 bytes ... if( A + C < 400 ) // branch depends on 5 bytes ... } Figure 1: Branches with dependent input bytes.

To infer dependency of branches on input bytes, earlier fuzzersrelied on the truth value of branch conditions: if changing the valueof a particular byte changes the truth value of a branch, then it isinferred that the branch depends on this byte. For instance, in Fig. 1,to correctly infer the dependency of the branch at line 6, the enginefirst needs to select for mutation the input byte x[100] and then tochange its value from any other than 40 to 40. GreyOne [11] pro-posed so-called fuzzing-driven taint inference

FTI by switching thefocus from the truth value of a branch to the value of the variablesused in the branch. For instance, FTI determines the dependency ofbranch at line 6 on x[100] as soon as this input bytes is mutated,because this will lead to a change of the value of the variable 𝐴 thatis used in the branch. FTI is sound (no over-taint) and incomplete(some under-taint). Exact reasoning with soundness or complete-ness is not a concern, since fuzzers only use it to generate testswhich are concretely run to exhibit bugs.The prime issue with FTI, which improves significantly overmany other prior data-flow based engines, is efficiency. The taintis inferred by mutating bytes one-by-one in FTI. Thus, to infer thefull dependency on all input bytes, the engine will require as manyexecutions as the number of bytes. A seed may have tens of KBs,and there may be thousands of seeds, therefore the full inferencemay quickly become a major bottleneck in the fuzzer. On the otherhand, precise or improved branch dependency may not significantlyboost fuzzer bug-finding performance, thus long inference timemay be unjustified. Hence, it is critical to reduce the inference time. The

TaintFAST engine.

We use probabilistic group testing [10] toreduce the required number of execution for potential full inferencefrom 𝑂 ( 𝑛 ) to 𝑂 (log 𝑛 ), where 𝑛 is the number of input bytes. Insteadof mutating each byte individually followed by program execution(and subsequent FTI check for each branch condition if any of itsvariables has changed), we simultaneously mutate multiple bytes,and then execute the program with the FTI check. We choose themutation positions non-adaptively, according only to the value of 𝑛 . This assures that all branches can be processed simultaneously.Consider the code fragment at Figure 1 (here 𝑛 = 𝑉 𝑖 , where each bit corresponds to one of the input bytes. A bit at position 𝑗 is setiff the input byte 𝑗 is mutated (i.e. assigned a value other than thevalue that has in the seed). Once 𝑉 𝑖 is built, we execute the programon the new input (that corresponds to 𝑉 𝑖 ) and for each branch checkif any of its variables changed value (in comparison to the valuesproduced during the execution of the original seed). If so, we canconclude that the branch depends on some of the mutated bytesdetermined by 𝑉 𝑖 . Note, in all prior works, the vectors 𝑉 𝑖 had a singleset bit (only one mutated byte). As such, the inference is immediate,but slow. On the other hand, we use vectors with =

512 set bitsand select 2 · log =

20 such vectors. Vectors 𝑉 · 𝑗 , 𝑉 · 𝑗 + haverepeatedly 2 𝑗 set bits, followed by 2 𝑗 unset bits, but with differentstarts. For instances, the partial values of the first 10 vectors 𝑉 𝑖 aregiven below: 𝑉 = ...𝑉 = ...𝑉 = ...𝑉 = ...𝑉 = ...𝑉 = ...𝑉 = ...𝑉 = ...𝑉 = ...𝑉 = ...... We execute the resulting 20 inputs and for each branch build 20-bitbinary vector 𝑌 . The bit 𝑖 in 𝑌 is set if any of the branch valueschanged after executing the input that corresponds to 𝑉 𝑖 . For in-stance, for the branch at line 6 of Figure 1, 𝑌 = 𝑌 to infer the dependency. To do so, we initialize1024-bit vector 𝐷 that will hold the dependency of the branch oninput bytes—bit 𝑖 is set if the branch depends on the input byte 𝑖 .We set all bits of 𝐷 , i.e. we start by guessing full dependency onall inputs. Then we remove the wrong guesses according to 𝑌 . Foreach unset bit 𝑗 in 𝑌 (i.e. the branch value did not change when wemutated bytes 𝑉 𝑗 ), we unset all bits in 𝐷 that are set in 𝑉 𝑗 (i.e. thebranch does not depend on any of the mutated bytes 𝑉 𝑗 ).After processing all unset bits of 𝑌 , the vector 𝐷 will have set bitsthat correspond to potential dependent input bytes. Theoretically,there may be under and over-taint, according to the followinginformation-theoretic argument: 𝑌 has 20 bits of entropy and thusit can encode at most 2 dependencies, whereas a branch maydepend on any of the 1024 input bytes and thus it can have 2 different dependencies. In practice, however, it is reasonable toassume that most of the branches depend only on a few input bytes ,and in such a case the inference is more accurate. For branches thatdepend on a single byte, the correctness of the inference followsimmediately from group testing theory . For instance, the branchat line 6 of Fig 1 will have correctly inferred dependency only onbyte x[100] . For branches that depend on a few bytes, we canreduce (or entirely prevent) over-taint by repeating the originalprocedure while permuting the vectors 𝑉 𝑖 . In such a case, each C-type branches that contain multiple variables connected with AND/OR statements,during compilation are split into subsequent independent branches. Our inference isapplied at assembly level, thus most of the branches depend only on a few variables. The matrix with rows 𝑉 ,𝑉 , . . . is -disjunct and thus it can detect 1 dependency. epeated inference will suggest different candidates, except thetruly dependent bytes that will be suggested by all procedures.These input bytes then can be detected by taking intersection of allthe suggested candidates. For instance, for the branch at line 8 (thatactually depends on 8 bytes), a single execution of the procedurewill return 16 byte candidates. By repeating once the procedurewith randomly permuted positions of 𝑉 𝑖 , with high probability onlythe 8 actual candidates will remain.The above inference procedure makes implicit assumption that same branches are observed across different executions . Otherwise,if a branch is not observed during some of the executions, thenthe corresponding bit in 𝑌 will be undefined, thus no dependencyinformation about the branch will be inferred from that execution.For some branches the assumption always holds (e.g. for branchesat lines 6,8 in Figure 1). For other branches, the assumption holdsonly with some probability that depends on their branch conditions.For instance, the branch at line 12 may not be seen if the branch atline 10 is inverted, thus any of the 20 bits of 𝑌 may be undefined witha probability of . In general, for any branch that lies below somepreceding branches, the probability that bits in 𝑌 will be definedis equivalent to the probability that none of the above brancheswill inverted by the mutations . As a rule of thumb, the deeperthe branch and the easier to invert the preceding branches are,the harder will be to infer the correct dependency. To infer deeperbranches, we introduce a modification based on forced execution.We instrument the code so the executions at each branch will takea predefined control-flow edge, rather than decide on the edgeaccording to the value of the branch condition. This guaranteesthat the target branches seen during the execution of the originalseed file (used as a baseline for mutation), will be seen at executionsof all subsequent inputs produced by mutating the original seed.We perform forced execution dynamically, with the same staticallyinstrumented program, working in two modes. In the first mode, theprogram is executed normally, and a trace of all branches and theircondition values is stored. In the second mode, during executionas the branches emerge, their condition values are changed to thestored values, thus the execution takes the same trace as before. Noother variables aside from the condition values are changed.Note that by definition, our taint inference is approximate andour use of forced execution for improving it is not sound. As men-tioned before, this is not a concern in fuzzing. We experimentallyshow that these strategies are effective on several programs. It was noted in RedQueen [1], that when branches depend triviallyon input bytes (so-called direct copies of bytes) and the branch con-dition is in the form of equality (either = or ≠ ), then such branchescan be solved trivially. For instance, the branch at line 1 of Figure 2,depends trivially on the byte x[0] and its condition can be satisfiedby assigning 𝑥 [ ] = 𝑥 [ ] ≠ This holds even in the case of FTI. However, the probabilities there are higher becausethere is a single mutated byte. 1 if ( x[0] == 5 ) ... if( x[1] < 100 ) ... if( x[2] > 10 ) { ... if( x[2] <= 200 ) ... } Figure 2: Branches and systems of intervals. line 3 of Figure 2, that depends trivially on the input byte 𝑥 [ ] . Fromthe type of inequality (which can be obtained from the instructioncode of the branch), and the correct dependency on the input byte 𝑥 [ ] and the constant 100, we can deduce the branch form 𝑥 [ ] < 𝑥 [ ] ∈ [ , ] , or invertit, resulting in 𝑥 [ ] ∈ [ , ] . Namely, we can represent thesolution in a form of interval for that particular input byte.Often to satisfy/invert a branch we need to take into account notone, but several conditions that correspond to some of the branchesthat have common variables with the target branch. For instance,to satisfy the branch at line 7, we will end up with two inequalities,and thus two intervals: 𝑥 [ ] ∈ [ , ] corresponding to targetbranch at line 7, and 𝑥 [ ] ∈ [ , ] corresponding to branch atline 5 that share the same input variable 𝑥 [ ] with the target branch.A solution exists ( 𝑥 [ ] ∈ [ , ] ) because the intersection of theintervals is not empty.In general, the system is built starting from the target branch,by adding gradually preceding branches that have common inputvariables with the target branch. Each branch (in)equality is solvedindependently immediately, resulting in one or two intervals (twointervals only when solving 𝑥 ≠ 𝑣𝑎𝑙𝑢𝑒 , i.e. 𝑥 ∈ [ , 𝑣𝑎𝑙𝑢𝑒 − ] ∪[ 𝑣𝑎𝑙𝑢𝑒 + , 𝑚𝑎𝑥𝑣𝑎𝑙𝑢𝑒 ] ), and then intersection is found with theprevious set of intervals corresponding to those particular inputbytes. Keeping intervals sorted assures that the intersection willbe found fast. Also, each individual intersection can increase thenumber of intervals at most by 4. Thus the whole procedure islinear in the number of branches. As a result, we can efficientlysolve this type of constraints, and thus satisfy or invert branchesthat depend trivially on input bytes.Even when some of the preceeding branches do not dependtrivially on input bytes, solving the constraints for the remainingbranches gives an advantage in inverting the target branch. In sucha case, we repeatedly sample solutions from the solved constraintsand expect that the non-inverted branch constraints will be satisfiedby chance. As sampling from the system requires constant time(after solving it), the complexity of branch inversion is reduced onlyto that of satisfying non-trivially dependent branches. AFL uses a simple and an elegant method to record the edges andtheir counts by using an array showmap . First, it instruments allbasic blocks 𝐵 𝑖 of a program by assigning them a unique randomlabel 𝐿 𝑖 . Then, during the execution of the program on a seed, as anytwo adjacent basic blocks 𝐵 𝑗 , 𝐵 𝑘 are processed, it computes a hash ofthe edge ( 𝐵 𝑗 , 𝐵 𝑘 ) as 𝐸 = ( 𝐿 𝑗 ≪ ⊕ 𝐿 𝑘 and performs showmap[E]++ .ew coverage is observed if the value ⌊ log showmap[E] ⌋ of a non-zero entry showmap[E] has not been seen before. If so, AFL updatesits coverage information to include the new value, which we willrefer to as the logarithmic count. Prevent colliding edge hashes.

CollAFL [12] points out thatwhen the number of edges is high, their hashes will start to collidedue to birthday paradox, and showmap will not be able to signalall distinct edges. Therefore, a fuzzer will fail to detect some ofthe coverage. CollAFL provides a solution to the problem and pro-poses to assign the labels 𝐿 𝑖 in such a way that no collisions willappear, i.e. all ( 𝐿 𝑗 ≪ ⊕ 𝐿 𝑘 should be distinct. This solution issound, but in our understanding it is hard to fully implement inpractice. The current LLVM does not support intra-module passes,thus uniqueness of hashes can be guaranteed only on the level ofsingle files (even then collisions may appear because CollAFL inpart uses greedy algorithm that does not guarantee uniqueness),and all edges that connect basic blocks of different source files canstill collide. It is possible to overcome this issue with more ad-hocapproaches , but then the compilation becomes troublesome. Notethat since CollAFL does not have publicly available official sourcecode, we could not find the exact approach taken to overcome theissue of separate compilation with LLVM in prior work.We propose a simpler solution to the collision avoidance problem.Instead of assigning only one label 𝐿 𝑖 to each basic block 𝐵 𝑖 , weassign several labels 𝐿 𝑖 , . . . , 𝐿 𝑚𝑖 , but use only one of them duringan execution. The index of the used label is switched occasionallyfor all blocks simultaneously. The switch assures that with a highchance, each edge will not collide with any other edge at least forsome of the indices. The number of labels required to guarantee thatall edges will be unique with a high chance at some switch dependson the number of edges. We provide a combinatorial analysis in theAppendix A on this number. In our actual implementation the sizeof the showmap is 2 and we use 𝑚 = Improve compressed edge counts.

The logarithmic count helpsto reduce storing all possible edge counts, but it may also implicitlyhinder achieving better coverage. This is because certain importantcount statistics that have the same logarithmic count as previouslyobserved during fuzzing might be discarded. For instance, if the for count = 0; for(i=0; i< x; i++) count ++; if( 13 == count && C1 ) F1(); else if( 14 == count && C2 ) F2();

Figure 3: The effects of AFL’s edge count compression. loop in Figure 3 gets executed 13 times, then AFL will detect this as a For example, instead of using a compilation script for the whole program, we canmanually compile each source file of the program while providing to the compilation acumulative list of all used labels. This requires intrusive modification to build scripts. new logarithmic count of ⌊ log ⌋ =

3, it will update the coverage,save the seed in the pool, and later when processing this seed, thecode block 𝐹 𝐶 for loop gets executed 14 times,then the same logarithmic count ⌊ log ⌋ = 𝐹 𝐹 for loop needs to be executed 14 times and 𝐶 𝐹 𝐹 Sivo implements all the refinements mentioned in Sections 4.1-4.It uses the standard grey-box approach of processing iterativelyseeds: in each iteration it selects a seed, applies mutations to it toobtain new seeds, and stores those that increase coverage.

Algorithm 1:

OneIterationSivo ( Seeds, Coverage ) use_class ← MAB_select( Seed_class )use_crit ← MAB_select( Seed_criterion )seed ← Sample( use_class , use_crit, Seeds )use_strategy ← MAB_select( Fuzzer_strategy ) if use_strategy == Data-flow then Taint_inference(seed)tot_cov_incr ← while time budget left do use_mut ← MAB_select( strategy )use_mut_params ← MAB_select( use_mut ) // if needednew_seed ← Mutate( seed, use_mut, use_mut_params )new_coverage ← ProduceCoverage(new_seed)cov_increase ← ∥ new_coverage \ Coverage ∥ if cov_increase > 0 then Seeds ← Seeds (cid:208) new_seedCoverage ← Coverage (cid:208) new_coverageMAB_update( [use_mut , use_mut_params], cov_increase, while_time )tot_cov_incr += cov_increaseMAB_update( [use_class,use_crit,use_strategy] , tot_cov_incr , iter_time)

In Sivo (refer to the pseudo-code in Algorithm 1), the seed se-lection is optimized with MAB according to the best class andbest criterion. Similar optimization is applied to the selection ofmutation strategies and to their parameters. The coverage updateinformation is fed back to the MAB at the end (after each coverageupdate computation for mutations and their parameters, or thecumulative coverage update at the end of each iteration). In Table 1we give a list of selections available at different steps of the fuzzer.Sivo runs the iterations and occasionally executes the code cover-age refinements – refer to Algorithm 2. To understand the additionalparts introduced by the refinements of Sivo, in Appendix B weprovide a pseudo-code of a generic grey-box fuzzer. able 1: Fuzzer procedures and their variations

Procedure/Parameters Variation(s) Description

Seed_class

SC-fast-edges consider only most efficient seeds for each edgeSC-fast-multiple-edges include as well the fastest for each multiplicative edgeSC-all consider all seeds

Seed_criterion

Count choose the least sampled seedSpeed sample according to number of executions per secondLength sample according to number of bytesCrash consider only seeds that lead to crashesCov consider only seeds that increase edge countRandom sample randomly

Fuzzer_strategy

Vanilla does not require taint inferenceData-flow requires taint inference

Vanilla

Mutate-rand-bytes mutate random bytesCopy-remove copy and remove byte sequences of current seedCombiner combine multiple seeds at different positions

Data-flow

Mutate-bytes mutate dependent bytesInvert-branches invert target branches by mutating their dependent bytesInvert-branches-GA invert target branches with GA by minimizing objective functionRandom-sampler invert multiple (untargeted) branches with random samplingSystem-solver invert branches with system solverMingler reuse previously found bytes from other seeds to this seed

Mutate-rand-bytes/Type

MRB-GA use GA to determine positions of mutated bytesMRB-simple mutate randomly chosen bytes

MRB-GA

MRB-simple

True,False bias selection of bytes according to their previous use

Copy-remove/Number

Copy-remove/Mode

CR-rand add random bytes/removeCR-real copy real bytes/removeCR-learn learn and use byte divisors seen previouslyCR-prev copy/remove at positions previously successfull

Combiner/Number

Combiner/Select

Speed/Inverse-speed Prefer seeds that are faster/slower to executeLength/Inverse-length Prefer seeds that are shorter/longerSelect-random Sample randomly

Combiner/Mode

CM-random combine at random positionsCM-learn learn and use byte divisors to select position

Mutate-bytes/Number

Mutate-bytes/Type

True,False bias selection of bytes according to their previous use

System-solver/Type

ST-all add at once all branches dependent on the target inversion branchST-one add gradually one by one the unsolved branches

Mingler/Number

We implement Sivo in C++ with around 20,000 lines of code. All ofthe code is written from scratch, with the exception of around 500lines related to so-called fork server , which is based on AFL’s code.Sivo uses static instrumentation to obtain the data about the cover-age and branches of the programs. More precisely, three of the fourrefinements require additional instrumentation of programs to im-plement their functionality: the accurate code coverage refinement The fork server helps to speed-up execution of programs. It runs once the initializationof system resources (required by execve ), stores the state, and all subsequent executionsof the program are ran starting from the stored state. requires lighter instrumentation, whereas

TaintFAST and systemsolver refinements require heavier instrumentation. For this pur-pose, we compile a program source code with two different LLVMpasses: one that utilizes relatively lightweight instrumentation usedonly for code coverage, and one with heavier instrumentation thatprovides information about the branches as well. The compilationis done by a modified version of Clang (in a fashion similar to AFL’s afl-clang++ ). The overheads introduced by the instrumentation aregiven in Appendix C. lgorithm 2:

Sivo

Seeds ← Initial_seedsCoverage ← ProduceCoverage(Seeds) while true do OneIterationSivo ( Seeds, Coverage ); if time_to_switch_index then SwitchIndexInCoverage()Coverage ← ProduceCoverage( Seeds ) if time_to_start_flush then Old_coverage, Old_seeds ← Coverage, SeedsSeeds ← Initial_seedsCoverage ← ProduceCoverage( Seeds ) if time_to_stop_flush then New_coverage ← Coverage \ Old_coverageCoverage ← Coverage (cid:208)

Old_covaregeSeeds ← Old_seeds (cid:208)

GetSeedsThatProduceCov(Seeds,New_coverage)

We show that Sivo performs well on multiple benchmarks accordingto the standard fuzzing metrics such as code coverage (Section 5.2)and found vulnerabilities (Section 5.3). We evaluate the performanceof each refinement in Section 5.4.

Experiment environment.

All experiments are run on a ma-chine with Ubuntu Desktop 16.04, two Intel Xeon E5-2680v2 [email protected] with 40 cores, 64GB DDR3 RAM @1866MHz and SSDstorage. All fuzzers are tested on the same programs, provided withonly one initial seed, randomly selected from samples availableon the internet. To keep experiments computationally reasonable,while still providing a fair comparison of all considered fuzzers, weperformed a two-round tournament-like assessment. In the firstround, all fuzzers had been appraised over the course of 12 hours.This interval is chosen based on Google’s FuzzBench periodicalreports, which shows that 12 hours is sufficient to decide the rank-ing of the fuzzers usually [14]. The top 3 fuzzers from the firstround that perform the best on average over all evaluated programsprogress to the second round, in which they are run for 48 hours.

Baseline fuzzers.

We evaluate Sivo in relation to 11 notable grey-box fuzzers. In addition to AFL [35], we take the extended andimproved AFL family: AFLFast [3], FairFuzz [19], LAF-Intel [18],MOpt [21] and EcoFuzz [33]. Moreover, we include Angora [5] forits unique mutation techniques, Ankou [22] for its fitness function,and a few fuzzers that perform well on Google’s OSS-Fuzz [14]platform such as Honggfuzz [31], AFL++ [14], AFL++_mopt [14]. Toavoid any unfair comparison to grey-box fuzzers with no officiallyavailable implementation (e.g. , GreyOne [11] and CollAFL [12] )we chose to omit them from our experiments. Programs.

Our choice of programs was influenced by multiplefactors, such as implementation robustness, diversity of function-ality, and previous analysis in other works. We use versions thatallow a direct comparison to prior evaluations. Our final selec-tion consists of 27 programs including: binutils (e.g.: readelf , We found third-party source code of this fuzzer, and closer inspection revealedsuboptimal implementation. nm ), parsers and parser generators (e.g.: bson_to_json@libbson , bison ), a wide variety of analysis tools (e.g.: tcpdump , exiv2 , cflow , sndfile-info@libsndfile ), image processors (e.g.: img2txt ), as-semblers and compilers (e.g.: nasm , tic@libncurses ), compressiontools (e.g.: djpeg , bsdtar ), the LAVA-M dataset [9], etc. A com-plete list of the programs, their version under test, and their inputparameters is available in Appendix E. Efficiency metrics.

We use two metrics to compare the efficiencyof fuzzers: edge coverage and the number of found vulnerabilities.To determine the coverage, we use the logarithmic edge countbecause this number is the objective in the fuzzing routines ofthe AFL family of fuzzers . For completeness, in Appendix D weprovide as well a simple edge count. The coverage data may beinaccurate due to colliding edges when measured as in AFL. Torectify this, we instrument the target programs to store the fullexecution traces and compute precise coverage data. In Appendix Dwe also show the imprecise coverage, measured as in AFL.We measure the total number of distinct vulnerabilities foundby each fuzzer. For this purpose, first we confirm the reportedvulnerabilities, i.e. we take all seeds generated by a fuzzer andkeep those that trigger a crash by any of the sanitizers ASAN [27],UBSAN [25] and Valgrind [23]. Then, for each kept seed, we recordthe source line in the program that triggers a crash (according tothe appropriate sanitizer). Finally, we count the number of founddistinct source lines for the crash point.

Remark.

Some fuzzers [11, 12] report so-called path coverage mea-sured as the number of produced seeds. We find this metric ratherinaccurate. For instance, a seed that inverts 𝑛 branches often isequivalent to 𝑛 seeds that invert a single branch. However, theabove path coverage will consider the former as a single path, whilethe latter as 𝑛 paths (thus a fuzzer can easily manipulate this met-ric). In addition, the instability of this metric can be manifested bysuppling to AFL all seeds found previously by AFL – often, the AFLwill keep much smaller number of seeds (in some cases only 50% ofthe supplied seeds), because the path coverage metric is dependenton the order of processing the seeds. Therefore, seed count shouldnot be considered a reliable, standalone metric. We run all 12 fuzzers for 12 hours each, and record the coverage dis-covered during the fuzzing. The results are reported in Figure 4. Wecan see that at the end, Sivo provides the best coverage for 25 out ofthe 27 programs. On average Sivo produces 11 .

8% higher coveragethan the next best fuzzer when analyzed individually for each pro-gram. In direct comparison to fuzzers, Sivo outperforms the nextbest fuzzer MOpt by 20 . . In AFL, this is the number of set bits in the accumulative array showmap . ASAN and UBSAN detect more types of vulnerabilities than Valgrind, however,they also require additional instrumentation during the compilation of the programs.We use ASAN and UBSAN to confirm the vulnerabilities in such programs. For theprograms that fail to compile with the additional instrumentation, we use Valgrind asan alternative. This criterion was used for all fuzzers. igure 4: Coverage for all fuzzers during 12 hours of fuzzing(in 5 min increments). time to feed enough data back to the MABs. Thus, arguably theearly advantage of Sivo is achieved due to the parametrize para-digm, as well as the remaining two refinements (

TaintFAST andthe system solver method).We test the top three fuzzers Sivo, MOpt, and EcoFuzz on 48-hour runs and report the obtained coverage in Figure 5. We see thatSivo is the top fuzzer for 24 of the programs, with 13 .

4% coverageincrease on average with respect to the next best fuzzer for eachprogram, and 15 . .

1% with respect to MOpt and EcoFuzz.In comparison to the 12-hour runs, the other two fuzzers managedto reduce slightly the coverage gap, but this is expected (givensufficient time all fuzzers will converge). However, the gap is stillsignificant and Sivo provides consistently better coverage.

We summarize the number of vulnerabilities found by each fuzzeron 25 programs during the 12-hour runs in Table 2. (We removedtwo programs from Table 2, as none of the fuzzers finds vulnerabil-ities for them.) Out of 25 evaluated programs, Sivo is able to findthe maximal number of vulnerabilities in 18 programs (72%). Forcomparison, the next best fuzzer MOpt holds top positions in 11 pro-grams (44%) in terms of vulnerability discovery. This indicates thatSivo is significantly more efficient at finding vulnerabilities thanthe remaining candidate fuzzers as well. However, Sivo achievesless top positions in discovery of vulnerabilities compared to codecoverage, but this is not unusual as the objective of Sivo is code cov-erage, and the correlation between produced coverage and foundvulnerabilities is not necessarily strong [15, 16].We also measure and report in Table 2 the number of vulnera-bilities unique to each fuzzer, i.e. bugs that are found only by onefuzzer, and not by any other. This metric signals distinctivenessof each fuzzer—the greater the number of unique vulnerabilities,the more distinct the fuzzer is on vulnerability detection. Out of25 programs, Sivo discovers at least one unique vulnerability in 11programs. In total, Sivo finds 31 unique vulnerabilities, while thenext best fuzzer is Honggfuzz [31] with 21 vulnerabilities.

Next, we seek to determine the benefits of the refinements fromthe coverage produced by Sivo. To do this, we tagged each seedproduced by Sivo with the refinement that produced it, allowing usto attribute gains in coverage to that refinement. Note that we canonly estimate the benefits due to individual strategies. This is not anexact attribution for multiple reasons. The used seeds only suggestwhich fuzzing strategy at that moment has helped to increase thecoverage, however, the same coverage may be discovered later withother strategies. In addition, some coverage is produced due to acombination of more than one refinement.We provide upper or lower bounds on the percentage of resultingcode coverage as estimates of contribution due to each refinementproposed in Sivo. We measure

TaintFAST (TI) indirectly, throughthe coverage produced with all data-flow strategies, thus the es-timate is an upper bound on the percentage of coverage. Solvingsystems of interval refinement (SI) is measured fairly accurately. Fi-nally, we estimate the benefits of accurate coverage refinement (AC)as a percentage of coverage detected in addition to AFL coverage, able 2: The number of found vulnerabilities. The number of unique vulnerabilities (when non-zero) are reported after "/". "-”indicates failure to instrument/run the program. "

Application FuzzerAFL AFL++ AFL++_mopt AFLFast FairFuzz LAF-Intel MOpt EcoFuzz Honggfuzz Angora Ankou Sivobase64 2 2 2 2 2 2 2 2 2 2 1 2bison 3 3 3 3 4/1 3 4/1 2 2 1 3 2bson_to_json 2 1 1 1 2 2 2 1 1 2 1 2cflow 2 1 1 2 2 1 5 3 2 1 3 6/1exiv2 6 5 6 5 6 6 11/3 0 - - 8 8fig2dev 29/1 24 29 26 30/1 22 35 30/2 43/4 1 40 59/7ftpconf 2 2 2 2 2 2 2 2 2 2 2 2img2sixel 1 1 1 1 1 0 16/1 12/1 15/3 - 7 22/6img2txt 2 2 2 0 4 2 8/2 5/1 3 - 7/3 10/5md5sum 1 1 1 1 2/1 1 1 1 1 1 1 1nasm 4 4 5 4 8 4 10 8 2 5/1 9 13/1nm 4 3 3 4 4 4 6/1 5 3 0 4 6/1readelf 1 1 1 1 1 1 1 1 1 2/1 1 1sassc 1 1 1 1 2 1 2 2 1 - 1 5/3slaxproc 4 3 3 3 3 3 4 3 3 - 6/2 5/1sndfile-info 0 0 0 0 3 0 8/2 6 13/6 - 1 7tcpdump 0 0 0 0 0 0 3 1 1 - 1 7/3testsolv 6 6 6 6 6 6 7 8/1 14/8 - 6 9/2tic 2 1 2 1 2 2 3 2 2 - 0 3tiff2pdf 2 2 1 2 1 2 4 3 1 0 3 4tiffset 1 1 1 1 1 1 1 1 1 0 1 1uniq 1 1 1 1 1 1 2 3 7 1 2 7webm2pes 1 1 1 1 1 1 2 2 1 - 1 3/1who 1 1 1 1 1 1 7 3 6 0 3 7wpd2html 0 0 0 0 0 0 1/1 1 0 - 1 1 and we ignore the benefits of compressed edge counts, thus thisestimate provides only a lower bound. Lastly, we cannot attributethe benefits of complete parameterization refinement directly sinceall seeds are produced through this refinement.We provide two separate estimates: 1) for all coverage, and 2) onlyfor the coverage found by Sivo, but not by any of the other two topfuzzers (i.e. MOpt and EcoFuzz). The second category is importantbecause it may hint which refinements provide advantage to Sivoover the other fuzzers.The estimates are presented in Table 3. A few interesting con-clusion emerge from the table. First, the TI is disproportionatelybetter represented in the additional coverage, that is, Sivo finds alot of new coverage (in comparison to other fuzzers) with a help ofdata-flow fuzzing strategies (which are not present in AFL-familyof fuzzers), and thus efficient taint inference is valuable. Second,SI improve slightly the coverage, however, this branch inversionstrategy has a very strong impact on discovering new coverage forsome of the programs. Finally, as expected, AC helps to improvethe coverage and its benefit grows with the amount of coverage. In 20 of the tested programs, AC is responsible for adding at least 10%of new coverage.

Missed coverage.

We also determined the amount of coveragethat was produced jointly with the top fuzzers, but not by Sivo–the results are given in the last column of Table 3. Among the27 programs, only on 6 Sivo misses more than 10% of the jointcoverage found by the top fuzzers. From the plots in Figure 5, itseems Sivo has not yet plateaued the coverage discovery for mostof these programs, leaving the possibility that with longer testingit may discover part of the missed coverage.

Grey-box fuzzers, starting from the baseline AFL [35], have beenthe backbone of modern, large-scale testing efforts. To reach a newcoverage, AFL chooses a seed from an evolving pool and appliesmutation operators numerous times. The fuzzer utilizes arounda dozen of different mutations, and uniformly at random selectswhich to apply at a time. Mutated seeds that lead to a new cover-age are added to a pool of interesting seeds. From this pool AFL able 3: Coverage information. The first column (All Coverage) gives the percentages found individually by the three refine-ments of Sivo. The second column (Additional Cov) give the percentages of coverage found by Sivo, but not found by any ofthe two top fuzzers. The third column (Missed Cov) gives the coverage found jointly by any of the top fuzzers, but not by Sivo.

Application All Coverage Additional Cov Missed Covcov TI SI AC cov perct TI SI AC cov perct base64 bison bsdtar bson_to_json cflow djpeg exiv2 fig2dev ftpconf img2sixel img2txt md5sum nasm nm readelf sassc slaxproc sndfile-info tcpdump testsolv tic tiff2pdf tiffset uniq webm2pes who wpd2html . In contrast, Sivofirst parameterizes all aspects, i.e. introduces many variations ofthe fuzzing subroutines, and then tries to optimize all the selectionof parameters. Even the seed selection subroutines of EcoFuzz andSivo differ, despite both using multi-armed bandits: EcoFuzz utilizes This refers to optimization only – some fuzzers improve (but not optimize) multiplefuzzing subroutines.

MAB to select candidate seed from the pool, whereas Sivo usesMAB to decide on the selection criterion and on the pool of seeds.Several grey boxes deploy data-flow fuzzing, i.e. infer depen-dency of branches on input bytes and use it to accomplish moretargeted branch inversion. VUzzer [24], Angora [5], BuzzFuzz [13]and Matryoshka [6] use a classical dynamic taint inference en-gine (i.e. tracks taint propagation) to infer the dependency. Fair-fuzz [19], ProFuzzer [32], Eclipser [8] use lighter engine and inferpartial dependency by monitoring the execution traces of the seeds.RedQueen [1] and Steelix [20] can infer only dependencies basedon exact (often called direct) copies of input bytes in the branches,by mutating individual bytes. Among grey boxes, the best inferencein terms of speed, type, and accuracy is achieved by GreyOne [11].Its engine called FTI is based on mutation of individual bytes (thusfast because it does not track taint propagation) and can detectdependencies of any type (not only direct copies of input bytes).FTI mutates bytes one by one and checks on changes in variablesinvolved in branch conditions (thus accurate because it does notneed for the whole branch to flip, only some of its variables). Sivoinference engine

TaintFAST improves upon FTI and provides expo-nential decrease in the number of executions required to infer thefull dependency, at a possible expense of accuracy. Instead of testing igure 5: Coverage for top three fuzzers Sivo, MOpt, EcoFuzzduring 48 hours of fuzzing. bytes one by one,

TaintFAST uses probabilistic group testing andreduces the number of executions.Data-flow grey boxes accomplish targeted branch inversion byrandomly mutating the dependent bytes. A few fuzzers deploymore advanced strategies: Angora [5] uses gradient-descent basedmutation, Eclipser [8] can invert efficiently branches that are linearor monotonic, and GreyOne [11] inverts branches by graduallyreducing the distance between the actual and expected value in thebranch condition. Sivo uses genetic algorithms to guide the processof inversion based on distance. Some fuzzers, such as RedQueen andSteelix invert branches by solving directly the branch conditionsbased on equality (called magic bytes). Sivo can solve more complexbranch inversion conditions that involve inequalities, without theuse of SAT/SMT solvers. On the other hand, white boxes such asKLEE [4], and hybrid fuzzers such as Driller [29] and QSYM [34],use symbolic execution that relies on SMT solvers (thus it may beslow) to perform inversions in even more complex branches.The AFL-family of fuzzers as well as many other grey boxestrack edge coverage . In addition, the AFL-family uses bucketization,i.e. besides edges, they track the counts of edges and group them inbuckets that have ranges of powers of two. For practical purposesAFL does not record the precise edges (this will require storingwhole execution traces which may be slow), but rather it workswith hashes of edges (which is quite fast). The process of hashingmay introduce collisions as noted by CollAFL [12]. To avoid suchcollisions, CollAFL proposes to select carefully during the instru-mentation the free parameters in the hashing functions. On theother hand, Sivo solution is to switch between different hashingfunctions during the fuzzing. Instead of tracking edge coverage, afew fuzzers such as Honggfuzz [30], VUzzer [24] and LibFuzzer [26]track block coverage.

We have presented four refinements for grey boxes that boost dif-ferent fuzzing stages. First, we have shown fast taint engine thatrequires only logarithmic number of tests in the number of inputbytes to infer the dependencies of branches on inputs. Second, wehave provided an efficient method for inverting branches when theydepend trivially on input bytes and their conditions are based oninteger inequalities. Third, we have proposed an improved coveragetracking methods that are easy to implement. Finally, we have showthe parametrize-optimize paradigm that allows fuzzers to be moreflexible in adopting to target programs and thus to effectively in-crease coverage. We have implemented the refinements in a fuzzercalled Sivo. In comparison to 11 other popular grey-box fuzzers,Sivo scores highest with regards to coverage and to number offound vulnerabilities.

REFERENCES [1] Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, andThorsten Holz. 2019. REDQUEEN: Fuzzing with Input-to-State Correspondence..In

NDSS , Vol. 19. 1–15.[2] Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoud-hury. 2017. Directed greybox fuzzing. In

Proceedings of the 2017 ACM SIGSACConference on Computer and Communications Security . 2329–2344.[3] Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2017. Coverage-based greybox fuzzing as Markov chain.

IEEE Transactions on Software Engineer-ing

45, 5 (2017), 489–506.4] Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassistedand automatic generation of high-coverage tests for complex systems programs..In

OSDI , Vol. 8. 209–224.[5] Peng Chen and Hao Chen. 2018. Angora: Efficient fuzzing by principled search.In . IEEE, 711–725.[6] Peng Chen, Jianzhong Liu, and Hao Chen. 2019. Matryoshka: fuzzing deeplynested branches. In

Proceedings of the 2019 ACM SIGSAC Conference on Computerand Communications Security .[7] Jaeseung Choi, Joonun Jang, Choongwoo Han, and Sang Kil Cha. 2019. Grey-boxconcolic testing on binary code. In

ICSE .[8] Jaeseung Choi, Joonun Jang, Choongwoo Han, and Sang Kil Cha. 2019. Grey-boxconcolic testing on binary code. In . IEEE, 736–747.[9] Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti,Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. Lava: Large-scaleautomated vulnerability addition. In

S&P .[10] Dingzhu Du, Frank K Hwang, and Frank Hwang. 2000.

Combinatorial grouptesting and its applications . Vol. 12. World Scientific.[11] Shuitao Gan, Chao Zhang, Peng Chen, Bodong Zhao, Xiaojun Qin, Dong Wu,and Zuoning Chen. 2020. GREYONE: Data Flow Sensitive Fuzzing. In .[12] Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, andZuoning Chen. 2018. Collafl: Path sensitive fuzzing. In . IEEE, 679–696.[13] Vijay Ganesh, Tim Leek, and Martin Rinard. 2009. Taint-based directed whiteboxfuzzing. In . IEEE,474–484.[14] Google. 2020. OSS-Fuzz - continuous fuzzing of open source software. (2020).https://github.com/google/oss-fuzz[15] Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlatedwith test suite effectiveness. In

Proceedings of the 36th international conference onsoftware engineering . 435–445.[16] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018.Evaluating fuzz testing. In

Proceedings of the 2018 ACM SIGSAC Conference onComputer and Communications Security . 2123–2138.[17] Levente Kocsis and Csaba Szepesvári. 2006. Discounted ucb. In , Vol. 2.[18] lafintel. 2016. Circumventing fuzzing roadblocks with compiler transformations.(2016). https://lafintel.wordpress.com/[19] Caroline Lemieux and Koushik Sen. 2018. Fairfuzz: A targeted mutation strategyfor increasing greybox fuzz testing coverage. In

Proceedings of the 33rd ACM/IEEEInternational Conference on Automated Software Engineering . 475–485.[20] Yuekang Li, Bihuan Chen, Mahinthan Chandramohan, Shang-Wei Lin, Yang Liu,and Alwen Tiu. 2017. Steelix: program-state based binary fuzzing. In

Proceedingsof the 2017 11th Joint Meeting on Foundations of Software Engineering . 627–637.[21] Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, andRaheem Beyah. 2019. { MOPT } : Optimized mutation scheduling for fuzzers. In { USENIX } Security Symposium ( { USENIX } Security 19) . 1949–1966.[22] Valentin JM Manès, Soomin Kim, and Sang Kil Cha. 2020. Ankou: guiding grey-box fuzzing towards combinatorial difference. In

Proceedings of the ACM/IEEE42nd International Conference on Software Engineering . 1024–1036.[23] Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavy-weight dynamic binary instrumentation. In

PLDI .[24] Sanjay Rawat, Vivek Jain, Ashish Kumar, Lucian Cojocar, Cristiano Giuffrida,and Herbert Bos. 2017. VUzzer: Application-aware Evolutionary Fuzzing.. In

NDSS , Vol. 17. 1–14.[25] Andrey Ryabinin. 2014. UBSan: run-time undefined behavior sanity checker.(2014). https://lwn.net/Articles/617364/[26] Kosta Serebryany. 2016. Continuous fuzzing with libfuzzer and addresssanitizer.In . IEEE, 157–157.[27] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and DmitriyVyukov. 2012. AddressSanitizer: A fast address sanity checker. In

USENIX ATC .[28] Dongdong She, Rahul Krishna, Lu Yan, Suman Jana, and Baishakhi Ray. 2020.MTFuzz: Fuzzing with a Multi-Task Neural Network. In

FSE .[29] Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang,Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna.2016. Driller: Augmenting Fuzzing Through Selective Symbolic Execution.. In

NDSS , Vol. 16. 1–16.[30] Robert Swiecki. 2017. Honggfuzz: A general-purpose, easy-to-use fuzzer withinteresting analysis options.

URl: https://github. com/google/honggfuzz (visited on06/21/2017) (2017).[31] Robert Swiecki. 2020. Honggfuzz: Security oriented software fuzzer. Supportsevolutionary, feedback-driven fuzzing based on code coverage (SW and HWbased). (2020). https://honggfuzz.dev/[32] Wei You, Xueqiang Wang, Shiqing Ma, Jianjun Huang, Xiangyu Zhang, XiaoFengWang, and Bin Liang. 2019. Profuzzer: On-the-fly input type probing for better zero-day vulnerability discovery. In . IEEE, 769–786.[33] Tai Yue, Pengfei Wang, Yong Tang, Enze Wang, Bo Yu, Kai Lu, and Xu Zhou.2020. EcoFuzz: Adaptive Energy-Saving Greybox Fuzzing as a Variant of the Ad-versarial Multi-Armed Bandit. In { USENIX } Security Symposium ( { USENIX } Security 20) .[34] Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, and Taesoo Kim. 2018. { QSYM } :A practical concolic execution engine tailored for hybrid fuzzing. In { USENIX } Security Symposium ( { USENIX } Security 18) . 745–761.[35] Michal Zalewski. 2019. American fuzzy lop (2.52b). (2019). https://lcamtuf.coredump.cx/afl/

A ANALYSIS OF MULTI-LABEL BLOCKASSIGNMENT

When each basic block 𝐵 𝑖 is assigned an 𝑛 -bit random label 𝐿 𝑖 ,by the birthday paradox collisions on hashed edges 𝐿 𝑗 ≪ ⊕ 𝐿 𝑘 will appear once the number of edges 2 𝑡 reaches around 2 𝑡 ≥ 𝑛 .Roughly, the expected number of collisions is around 2 𝑡 · 𝑡 / 𝑛 = 𝑡 − 𝑛 . If we assign an additional (second) label to each basic block,then there will be roughly 2 𝑡 − 𝑛 colliding edge hashes on the secondlabel as well. However, among the colliding hash edges on the firstand the second label, there will be only 2 𝑡 − 𝑛 · 𝑡 − 𝑛 / 𝑡 = 𝑡 − 𝑛 common edges. Thus, if 3 𝑡 − 𝑛 <

0, i.e. 𝑡 < 𝑛 , then on averagethere will not be any common edges, and hence for at least one ofthe assignments (either the first or the second), each edge will haveunique hash. Similar analysis applies for larger number of labels.In general, if each block is assigned 𝑚 labels of 𝑛 random bits each,and the number of edges is smaller than 2 𝑚𝑚 + 𝑛 , then on averageeach hashed edge will be unique at least for one of the labels.We can also obtain a strict analysis is of the collision-free multi-label assignment. To do so, we compute the probability that eachadditional hashed edge does not collide with any other previoushashed edge at least on one of the labels. If there are already 𝑘 suchhashed edges, then the probability that the next 𝑘 + − ( 𝑘 𝑛 ) 𝑚 . This is computed from the opposite event (thatthe 𝑘 + 𝑚 labels with some of the 𝑘 hashes).Thus the probability that all 2 𝑡 edges will be unique on at least oneof the 𝑚 labels is: 𝑡 − (cid:214) 𝑘 = (cid:18) − (cid:18) 𝑘 𝑛 (cid:19) 𝑚 (cid:19) (1)We could not reduce the above formula (1) further (unlike the caseof 𝑚 = 𝑛, 𝑡, 𝑚 . In the case of Sivo, 𝑛 = , 𝑚 =

4, andtherefore, even when there are 𝑡 = Table 4: Calculated values of the probability (1) 𝑚 / 𝑡 PSEUDO-CODE OF GENERIC GREY-BOXFUZZER

Algorithm 3:

OneIterationGenericFuzz( Seeds, Coverage ) seed ← Sample( ConstantCriterion , Seeds )Taint_inference(seed) // if dataflow fuzzer while time budget left do use_mut ← Sample( strategy )new_seed ← Mutate( seed, use_mut, ConstantParams )new_coverage ← ProduceCoverage(new_seed)cov_increase ← ∥ new_coverage \ Coverage ∥ if cov_increase > 0 then Seeds ← Seeds (cid:208) new_seedCoverage ← Coverage (cid:208) new_coverage

Algorithm 4:

GenericFuzzer

Seeds ← Initial_seedsCoverage ← ProduceCoverage(Seeds) while true do OneIterationGenericFuzz( Seeds, Coverage );

C INSTRUMENTATION OVERHEAD

Each of the 27 evaluated programs is instrumented and compiledwith two different passes: lighter and heavier. We measured experi-mentally the usage and the overhead of the two instrumentationsaveraged over all 27 tested programs, and provide the data in Table 5.We give the percentage of usage of the two instrumentations as wellas their overhead on top of normal, uninstrumented program andon top of the standard AFL code coverage instrumentation, mea-sured according to the execution times of the compiled programs.We can see that the lighter instrumentation is 73 % slower than theuninstrumented program and 39 % slower than AFL instrumenta-tion, whereas heavier is 270 % and 190 % slower, respectively. On theother hand, the versions with lighter instrumentation are executedfar more frequently (78% vs 22%), thus we can conclude that on av-erage Sivo introduce 116 % overhead on top of the uninstrumentedprograms, and 72 % on top of AFL.

Table 5: Instrumentation Statistics

Type Usage Overhead Overheaduninstr. AFLLighter 78% 73% 39%Heavier 22% 270% 190%

D ALTERNATIVE COVERAGE

In Figure 6 we show the simple edge count for fuzzers on 12-hourruns. Sivo is top fuzzer for 21 of the programs in the 12-hour runs.In comparison to Figure 4, Sivo loses 4 top spots, 2 of which arewith a margin of less than 1%. In Figure 7 we show the coveragemeasured imprecisely as in AFL for fuzzers on 12-hour runs. Sivo istop fuzzer for 17 out of 24 programs (we had problems compiling 3

Figure 6: Edge count for all fuzzers during 12 hours offuzzing. of the programs). In comparison to Figure 4, Sivo loses 6 top spots.This is a result of colliding edge hashes and the inability of AFLcoverage engine to detect and handle such collisions. igure 7: AFL coverage for all fuzzers during 12 hours offuzzing.

E TESTED PROGRAMS

Program Version

Input line (as in AFL) base64

LAVA-M -d @@ bison bison 3.0.5 @@ bsdtar libarchive 3.4.3 -acf bsdtar.tar @@ bson_to_json libbson 1.8 @@ cflow cflow 1.5 @@ djpeg libjpeg 2.0.90 -colors 234 -rgb -gif -outfile djp.gif @@ exiv2 exiv2 0.27.3 -pt @@ fig2dev fig2dev 3.2.7a @@ ftpconf libconfuse 3.2.2 @@ img2sixel libsixel 1.8.2 @@ img2txt libcaca 0.99beta19 @@ md5sum LAVA-M -c @@ nasm nasm 2.14rc15 -f elf64 @@ -o nasm.o nm binutils 2.31 -DC @@ readelf binutils 2.31 -a @@ sassc libsass 3.5 @@ slaxproc libslax 0.22.0 -c @@ sndfile-info libsndfile 1.0.28 @@ tcpdump tcpdump 4.10.0rc1 -AnetttttvvvXXr @@ testsolv libsolv 0.7.2 @@ tic ncurses 6.1 -o tic.out tiff2pdf libtiff 4.0.9 @@ tiffset libtiff 4.0.9 -s 315 whatever @@ uniq LAVA-M @@ webm2pes libwebm 1.0.0.27 @@ webm2pes.out who LAVA-M @@ wpd2html libwpd 0.10.1libwpd 0.10.1