FLACK: Counterexample-Guided Fault Localization for Alloy Models
Guolong Zheng, ThanhVu Nguyen, Simón Gutiérrez Brida, Germán Regis, Marcelo F. Frias, Nazareno Aguirre, Hamid Bagheri
FFL AC K : Counterexample-Guided Fault Localizationfor Alloy Models
Guolong Zheng ∗ , ThanhVu Nguyen ∗ , Sim´on Guti´errez Brida † , Germ´an Regis † ,Marcelo F. Frias ‡ , Nazareno Aguirre † , Hamid Bagheri ∗∗ Univeristy of Nebraska-Lincoln { gzheng, tnguyen } @cse.unl.edu, [email protected] † University of Rio Cuarto and CONICET { sgutierrez, gregis, naguirre } @dc.exa.unrc.edu.ar ‡ Dept. of Software Engineering Instituto Tecnol´ogico de Buenos [email protected]
Abstract —Fault localization is a practical research topic thathelps developers identify code locations that might cause bugsin a program. Most existing fault localization techniques aredesigned for imperative programs (e.g., C and Java) and relyon analyzing correct and incorrect executions of the programto identify suspicious statements. In this work, we introduce afault localization approach for models written in a declarativelanguage, where the models are not “executed,” but ratherconverted into a logical formula and solved using backendconstraint solvers. We present
FLACK , a tool that takes as inputan Alloy model consisting of some violated assertion and returns aranked list of suspicious expressions contributing to the assertionviolation. The key idea is to analyze the differences between counterexamples , i.e., instances of the model that do not satisfythe assertion, and instances that do satisfy the assertion to findsuspicious expressions in the input model. The experimentalresults show that
FLACK is efficient (can handle complex, real-world Alloy models with thousand lines of code within 5 seconds),accurate (can consistently rank buggy expressions in the top 1.9%of the suspicious list), and useful (can often narrow down theerror to the exact location within the suspicious expressions).
I. I
NTRODUCTION
Declarative specification languages and the correspondingformally precise analysis engines have long been utilizedto solve various software engineering problems. The Alloyspecification language [1] relies on first-order relational logic,and has been used in a wide range of applications, suchas program verification [2], test case generation [3], [4],software design [5], [6], [7], network security [8], [9], [10],security analysis of emerging platforms, such as IoT [11] andAndroid [12], [13], and design tradeoff analysis [14], [15],to name a few. Cunha and Macedo, among others, use arecent extension of Alloy, called Electrum [16], to validatethe European Rail Traffic Management System, a system ofstandards for management and inter-operation of signalingfor the European railways [17]. Kim [18] proposes a SecureSwarm Toolkit (SST), a platform for building an authorizationservice infrastructure for IoT systems, and uses Alloy to showthat SST provides necessary security guarantees.Similar to developing programs in an imperative language,such as C or Java, developers can make subtle mistakeswhen using Alloy in modeling system specifications, espe- cially those that capture complex systems with non-trivialbehaviors, rendering debugging thereof even more arduous.These challenges call for debugging assistant mechanisms,such as fault localization techniques, that support declarativespecification languages.However, there is a dearth of fault localization techniquesdeveloped for Alloy. AlloyFL [19] is perhaps the only faultlocalization tool available for Alloy as of today. The key ideaof AlloyFL is to use “unit tests,” where a test is a predicatethat describes an Alloy instance to encode expected behaviors,to compute suspicious expressions in an Alloy model that failsthese tests. To compute the suspicious expressions, AlloyFLuses mutation testing [20], [21] and statistical debuggingtechniques [22], [23], [24], i.e., it mutates expressions, collectsstatistics on how each mutation affects the tests, then uses thisinformation to assign suspicion scores to expressions.While AlloyFL pioneered fault localization in the Alloycontext and the obtained results thereof are promising, it relieson the assumption of the availability of AUnit tests [25]—i.e., predicates representing Alloy instances—which are notcommon in the Alloy setting. Indeed, instead of writing testcases, Alloy users write assertions to describe the desiredproperty and let the Alloy Analyzer search for potentialcounterexamples (cex’s) that violate the property. Moreover,it is unclear how many test cases are needed or how goodthey must be for AlloyFL to be effective (e.g., in the AlloyFLevaluation [19], the number of tests ranges from 30 to 120).To address this state of affairs and improve the quality ofAlloy development, we present an automated approach and anaccompanying tool-suite for f ault l ocalization in A lloy modelsusing c ounterexamples, called FLACK . Given an Alloy modeland a property that is not satisfied by the model,
FLACK firstqueries the underlying Alloy Analyzer for a counterexample,an instance of the model that does not satisfy the property.Next,
FLACK uses a partial max sat (PMAXSAT) solver tofind an instance that does satisfy the property and is as closeas possible to the counterexample.
FLACK then determines therelations and atoms that are different between the cex and satinstance. Finally,
FLACK analyses these differences to computesuspicion scores for expressions in the original model.1 a r X i v : . [ c s . S E ] F e b nlike AlloyFL that relies on unconventional unit tests, FLACK uses well-established and widely-used assertions, nat-urally compatible with the development practices in Alloy.Also, instead of using mutation testing or statically analyzingthe effects of tests,
FLACK relies on counterexamples andsatisfying instances generated by constraint solvers, which arethe main underlying technology in Alloy.We evaluated
FLACK on a benchmark consisting of a suiteof buggy models from AlloyFL [19]. The experimental resultscorroborate that
FLACK is able to consistently rank buggyexpressions in the top 1.9% of the suspicious list. We alsoevaluated
FLACK on three case studies consisting of largerAlloy models used in the real-world settings (e.g., Alloy modelfor surgical robots, Java programs and Android permissions),and
FLACK was able to identify the buggy expressions withinthe top 1%. The run time of
FLACK for most the models isunder 5 seconds (under 1 second for the AlloyFL benchmarks).The experimental results corroborate that
FLACK has thepotential to facilitate a non-trivial task of formal specifica-tion development significantly and exposes opportunities forresearchers to exploit new debugging techniques for Alloy.To summarize, this paper makes the following contributions: • Fault localization approach for declarative models : Wepresent a novel fault localization approach for declarativemodels specified in the Alloy language. The insightunderlying our approach is that expressions in an Alloymodel that likely cause an assertion violation can beidentified by analyzing the counterexamples and closelyrelated satisfying instances. • Tool implementation : We develop a fully automated tech-nology, dubbed
FLACK , that effectively realizes our faultlocalization approach. We make
FLACK publicly availableto the research and education community [26]. • Empirical evaluation : We evaluate
FLACK in the contextof faulty Alloy specifications found in prior work andspecifications derived from real-world systems, corrobo-rating
FLACK ’s ability to consistently rank buggy expres-sions high on the suspicious list, and analyze complex,real-world Alloy models with thousand lines of code.The rest of the paper is organized as follows. Section IImotivates our research through an illustrative example. Sec-tion III describes the details of our fault localization approachfor Alloy models. Section IV presents the implementation andevaluation of the research. The paper concludes with an outlineof the related research and future work.II. I
LLUSTRATION
To motivate the research and illustrate our approach, weprovide an Alloy specification of a finite state machine (FSM),adapted from AlloyFL benchmarks [19]. The specificationdefines two type signatures, i.e.,
State and
FSM , along withtheir fields (lines 1–5). The specification contains three factparagraphs, expressing the constraints, detailed below: If astart (or a stop) state exists, there is only one of them (fact
OneStartAndStop ). The start state is not a subset of thestop state; no transition terminates at the start state; and no one sig FSM { start: set State, stop: set State } sig State { transition: set
State } fact OneStartAndStop { // If a start state exists, there is only oneof them all start1, start2 : FSM.start | start1 =start2 // If a stop state exists, there is only one ofthem all stop1, stop2 : FSM.stop | stop1 = stop2 some FSM.stop } fact ValidStartAndStop { // start state is not a subset of stop state FSM.start ! in FSM.stop // No transition ends at the start state. all s : State | FSM.start ! in s.transition // Error: should be "<=>" instead of "=>". all s: State | s.transition = none => s in FSM.stop } fact Reachability { // All states are reachable from the startstate. State = FSM.start.*transition // The stop state is reachable from any state. all s: State | FSM.stop in s.*transition } assert NoStopTransition{ no FSM.stop.transition } check NoStopTransition for 5
Fig. 1: Buggy FSM model, adapted from AlloyFL [19].transition leaves a stop state (fact
ValidStartAndStop ).Finally, every state is reachable from the start state, and thestop state is reachable from any state (fact
Reachability ).Each assertion specifies a property that is expected tohold in all instances of the model. For example, we use theassertion
NoStopTransition to check that a stop statebehaves as a sink. The Alloy Analyzer disproves this assertionby producing a counterexample, shown in Figure 2a, in whichthe stop state labeled
State3 transitions to
State1 .Thus, there is a “bug” in the model causing the assertionviolation. Indeed, careful analysis of the model and the gen-erated cex reveals that the problem is in the expression online 19: instead of stating that a stop state does not have anytransition to any state, the expression states that any state nothaving a transition to anywhere is a stop state—a subtle logicalerror that is difficult to realize .The goal of FLACK is to identify such buggy expressionsautomatically. For this example, within a second,
FLACK identifies four suspicious expressions with the one on line 19ranked first. Table I shows the results: expressions or nodeswith higher scores are ranked higher. Moreover,
FLACK sug-gests that the operator => is likely the issue in the expression. There are two potential fixes for this: (i) reverse the expression to: s inFSM.stop => s.transition=none or (ii) replace the implication op-erator ( => ) to logical equivalence ( <=> ), which technically would strengthenthe intended requirement. FLACK ’s results obtained for the model in Figure 1.
Suspicious Expression Scores.transition = none = > s in FSM.stop (= > ) 19 1.58s.transition = none 19 1.25FSM.stop in s.*transtion 25 0.5s in FSM.stop 19 0.5(a) Counterexample (b) (PMAX) Sat instance Fig. 2: Instance Pair. Note the similarity between the instances.Such a level of granularity can significantly help the developerunderstand and fix the problem. The results in Section IV showthat
FLACK can consistently rank the exact buggy expressionwithin the top 5 suspicious ones and do so under a second.The key idea underlying our fault localization approach is toanalyze the differences between counterexamples (instances ofthe model that do not satisfy the assertion) and instances thatdo satisfy the assertion to find suspicious expressions in theinput model.
FLACK first checks the assertion
NoStopTransition in the model using the Alloy Analyzer, which returns the cex inFigure 2a. Next,
FLACK generates a satisfying (sat) instancethat is as minimal and similar to the cex as possible. Theirdifferences promise effective localization of the issue. a) Generating SAT instances:
To obtain an instancesimilar to the cex,
FLACK transforms the input model intoa logical formula representing hard constraints and the infor-mation from cex into a formula representing soft constraints .Essentially,
FLACK converts the instance finding problem into a
Partial-Max SAT problem [27] and then uses the Pardinus [28]solver to find a solution that satisfies all the hard constraintsand as many soft constraints as possible. Thus, the resultis an instance of the model that is similar to the cex butsatisfies the assertion. Figure 2b shows an instance producedby Pardinus, considering the cex shown in Figure 2a. Noticethat this instance is similar to the given cex, except for theedge from
State3 to State1 , which represents the maindifference between the two instances. b) Finding Suspicious Expressions:
FLACK analyzes thedifferences between cexs and the sat instances—e.g., herethe transition from
State3 to State1 , which only appearsin the cex but not the sat instance—to identify Alloy rela-tions causing the issue. As shown in Table II, that demon-strates the Alloy text representation of the cex in Figure 2a,the transition relation involves the tuple
State3 ->State1 and the stop relation involves
State3 . Thus,
FLACK hypothesizes that two relations of transition and stop may cause the difference in the two models. Note thatwhile we present one cex and one sat instance in this examplefor the sake of simplicity,
FLACK supports analyzing multiplepairs of cex and sat instances in tandem.TABLE II: Text Representation of the cex in Figure 2a
Relation TuplesFSM FSM0start FSM0- > State0stop FSM0- > State3State State0, State1, State2, State3transition State0- > State1, State0- > State3, State1- > State2,State2- > State3, State3- > State1
Next,
FLACK slices the input model to contain onlyexpressions affecting both relations transition and
FSM.stop . This results in two expressions: all s:State | FSM.stop in s.*transition (line 25) and all s: State | s.transition = none => s inFSM.stop (line 19).At this point,
FLACK could stop and return these twoexpressions, one of which is the buggy expression on line 19.Indeed, this level of “statement” granularity is often used infault localization techniques, like Tarantula [29] or Ochiai [22].However,
FLACK aims to achieve a finer-grained granularitylevel by also considering the boolean and relational subex-pressions, detailed below. c) Ranking Boolean Nodes:
The expressions onlines 25 and 19 have four boolean nodes: (a)
FSM.stopin s.*transition , (b) s.transition = none ,(c) s in FSM.stop , and (d) s.transition =none => s in FSM.stop . FLACK instantiates eachof these with
State1 and
State3 , the values thatdifferentiate the cex and sat instance. For example,(a) becomes
FSM.stop in S1.*transition and
FSM.stop in S3.*transition . Next,
FLACK evaluates these instantiations using the cex and sat instanceand assigns a higher suspicious score to those withinconsistent evaluation results. For example, the instantiations
FSM.stop in S1.*transition and
FSM.stop inS3.*transition of node (a) evaluate to true in both cexand sat instance, so we give (a) the score 0 (i.e., no changes).We assign score 1 to (b) because
State3.transition= none evaluates to false in the cex but true in thesat instance (thus 1 change) and
State1.transition =none evaluates to false in both (no change).Overall,
FLACK obtains the scores 0, 1, 0, 1 for nodes (a),(b), (c), (d), respectively. Thus,
FLACK determines that nodes(b) s.transition = none and (d) s.transition =none => s in FSM.stop are the two most suspiciousboolean subexpressions within the expression on line 19. d) Ranking Relational Nodes:
While subexpression (d)indeed contains the error, it receives the same score as subex-3ression (a). To achieve more accurate results , FLACK furtheranalyzes the involving relations.
FLACK instantiates theserelations with
State1 and
State3 , assesses them in thecontext of the cex and sat instances, and assigns scores basedon the evaluations. For example, node (d) s.transition= none => s in FSM.stop contains 3 relations: (1) s.transition , (2) s , and (3) FSM.stop . Instantiatingthese relations with
State3 and evaluating them using thecex is as follows: (1) becomes { State1 } , (2) { State3 } , and(3) { State3 } . Thus, for the cex, (d) involves both State1 and
State3 , and
FLACK gives it a score of 1. Next, itevaluates the instantiations using the sat instance as follows:(1) becomes {} , (2) { State3 } , and (3) { State3 } . (d) doesnot involve State1 and thus has a score 0.
FLACK assigns (d)the average score of 0.5 (for the instantiations of
State3 ).Performing a similar computation for the instantiation of
State1 , we obtain a score of 2/3 for (d) as the evaluation forthe cex and sat instances involves both
State1 and
State3 (differentiated values) and
State2 (regular value). Thus, (d)has a score of 0.58 as an average of 0.5 and 2/3.Overall,
FLACK obtains the scores 0.5, 0.25, 0.5, 0.58 for(a), (b), (c), and (d), respectively. Note that the node (d) isnow ranked higher than (a) as desired. e) Suspicious Scores:
FLACK computes the final suspi-cious score of a node as the sum of the boolean and rela-tional scores of that node, as shown above. For example, thenode (d) s.transition = none => s in FSM.stop in the expression on line 19 has the highest suspicious scoreof 1.58. Table I shows suspicious scores of the expressions inthe ranked list returned by
FLACK .In addition,
FLACK analyzes (non-atomic) nodes contain-ing (boolean) connectors and reports connectors that connectsubnodes with different scores. For example,
FLACK suggeststhat the operator => is likely responsible for the error in (d)because the two subexpressions s.transition = none and s in FSM.stop have different scores as shown in Ta-ble I. Indeed, in this example, the assertion violation is entirelydue to this operator (a potential fix would be strengthening themodel to <=> or switching the two subexpressions as Alloydoes not have the operator <= ).III. A PPROACH
Figure 3 gives an overview of
FLACK , which takes as inputan Alloy model with some violated assertion and returnsa ranked list of suspicious expressions contributing to theassertion violation. The insight guiding our research is thatthe differences between counterexamples that do not satisfythe assertion and closely related satisfying instances can drivelocalization of suspicious expressions in the input model. Toachieve this,
FLACK uses the
Alloy analyzer to find counterex-amples showing the violation of the assertion. It then uses aPMAX-SAT solver to find satisfying instances that are as closeas possible to the cex’s. Next,
FLACK analyzes the differences While this example has only two expressions with similar scores, we obtainmany expressions with similar scores in more complex and real-world models.Thus, this step is crucial to distinguish the buggy expressions from the rest.
Fig. 3: Overview of
FLACK .between the cex’s and satisfying instances to find expressionsin the model that likely cause the errors. Finally,
FLACK computes and returns a ranked list of suspicious expressions.
A. The Alloy Analyzer
An Alloy specification or model consists of three compo-nents: (i) Type signatures ( sig ) define essential data types and their fields capture relationships among such data types,(ii) fact s, predicates ( pred ), and assertions ( assert ) areformulae defining constraints over data types, and (iii) run and check are commands to invoke the Alloy Analyzer.The check command is used to find counterexamples violat-ing some asserted property, and run finds satisfying modelinstances ( sat instances ). For a model M and a property p , a cex is an instance of M that satisfies M ∧ ¬ p , anda sat instance is one that satisfies M ∧ p . The specifi-cation in Figure 1 defines two signatures ( FSM, State ),three fields ( start, stop, transition ), three facts(
OneStartStop, ValidStart, ValidStop ) and oneassertion (
NoStopTransition ).Analysis of specifications written in Alloy is entirely auto-mated, yet bounded up to user-specified scopes on the size oftype signatures. More precisely, to check that p is satisfied by all instances of M (i.e., p is valid) up to a certain scope, theAlloy developer encodes p as an assertion and uses the check command to validate the assertion, i.e., showing that no cexexists within the specified scope (a cex is an instance I suchthat I (cid:15) M ∧¬ p ). To check that p is satisfied by some instancesof M , the Alloy developer encodes p as a predicate and usesthe run command to analyze the predicate, i.e., searching fora sat instance I such that I (cid:15) M ∧ p . In our running example,the check command examines the NoStopTransition assertion and returns a cex in Figure 2a.Internally, Alloy converts these tasks of searching for in-stances into boolean formulae and uses a SAT solver to checkthe satisfiability of the formulae. Each value of each relationis translated to a distinct variable in the boolean formula. Forexample, given a scope of 5 in the FSM model in Figure 1, therelation
State contains 5 values and is translated to 5 distinctvariables in the boolean formula, and the transition istranslated to 25 values representing 25 values of combinationsof (cid:107)
State (cid:107) × (cid:107)
State (cid:107) . An instance is an assignment for allvariables that makes the formula True. For example, cex inFigure 2a is an assignment where all variables correspondingto values in Table I are assigned True and others are False.Finally, Alloy translates the result from the SAT solver, e.g.,4 lgorithm 1:
FLACK fault localization process input :
Alloy model M , property p not satisfied by M output : Ranked list of suspicious expressions in M AlloySolver ← AlloyAnalyzer ( M, p ) pairs ← ∅ while | pairs | < max instance pairs do c ← AlloySolver . gencex () AlloySolver . blockcex ( c ) s ← PMaxSolver ( M, c ) if s = nil then U ← AlloySolver . get_unsatcore () return unsat_analyzer ( M, U, c ) pairs ← pairs ∪ ( c , s ) end diffs ← comparator ( pairs ) return diffs_analyzer ( diffs ) an assignment that makes the boolean formula True, back toan instance of M . B. The
FLACK
Algorithm
Algorithm 1 shows the algorithm of
FLACK , which takes asinput an Alloy model M and a property p that is not satisfiedby M (as an assertion violation) and returns a ranked listof expressions that likely contribute to the assertion violation. FLACK first uses the Alloy Analyzer and the Pardinus PMAX-SAT solver to generate pairs of cex and closely similar satinstances.
FLACK then analyzes the differences between thecex and sat instances to locate the error. If
FLACK cannotgenerate any sat instance,
FLACK inspects the unsat core returned by the Alloy Analyzer to locate the error.
1) Generating Instances:
To understand why M does notsatisfy p , FLACK obtains differences between cexs and relevantsat instances. These differences can lead to the cause of theerror. One option is to use the Alloy Analyzer to generate asat instance directly (e.g., by checking a predicate consistingof p ). However, such an instance generated by Alloy is oftenpredominantly different from the cex, and thus does not helpidentify the main difference. For example, the cex, shown inFigure 2a, that violates the assertion NoStopTransition is quite different from the two Alloy-generated satisfyinginstances, shown in Figure 4.To generate a sat instance closely similar to the cex,we reduce the problem to a PMAX-SAT (partial maximumsatisfiability) problem.
Definition III.1 (Finding a Similar Sat Instance from a Cex) . Given a set of hard clauses, collectively specified by model M and property p , and a set of soft clauses, specified bya counterexample cex , find a solution that satisfies all hardclauses and satisfies as many soft clauses as possible .More specifically, the hard clauses are generated by con-straints in the Alloy model, and the soft clauses are theassignment represented by cex where all presenting variablesare True and other variables are False. Because the relationsand scope of the Alloy model stay the same, the variables inthe transformed boolean formula would also remain the same,and just the values assigned to them would differ between Fig. 4: Model instances generated by the Alloy Analyzer.various model instances. Thus, this encoding can apply togeneral instances regardless of their structures. FLACK then uses an existing PMAX-SAT solver (Pardinus)to find an instance that has the property p and is as similar tothe cex as possible . For example, the instance in Figure 2bgenerated by Pardinus is similar to the cex in Figure 2a, buthas an extra edge from State3 to State1 . The idea is thatsuch (minimal) differences can help
FLACK identify the error.
2) Comparator:
FLACK compares the generated cex’s andsat instances to obtain their differences, which involve atoms,tuples, and relations. First, it obtains tuples and their atoms thatare different between the cex and sat instance, e.g., in Figure 2,the tuple
State3->State1 , which has the atoms
State1 and
State3 , is in the cex but not in the satisfying instance.Next, it obtains relations with different tuples between thecex and sat instance, e.g., the transition relation involvesthe tuple
State3->State1 in the cex but not in the satinstance. Third, it obtains relations that can be inferred fromthe tuples and atoms derived in the previous steps, e.g., therelation
FSM.stop involves tuples having the
State3 atom.In summary, for the pair of cex and sat instance in Figure 2,
FLACK obtains the suspicious relations transitions and stop and the atoms
State1, State3 . FLACK appliesthese comparison steps for all pairs of instances and cex’sand uses the common results.
3) Diff Analyzer:
After obtaining the differences consistingof relations and atoms between cex’s and sat instances,
FLACK analyzes them to obtain a ranked list of expressions based ontheir suspicious levels.
FLACK assigns higher suspicious scoresto expressions whose evaluations depend on these differences(and lower scores to those not depending on these differences).Algorithm 2 shows the Diff Analyzer algorithm, whichtakes as input a model M , the differences diffs obtained inSect III-B2, and pairs of cex and sat instances obtained inSect III-B1, and outputs a ranked list of suspicious expressionsin M . It first identifies expressions in M that involve relationsin diffs . These expressions are likely related to the differencebetween cex and sat instance. For example, consider themodel in Figure 1. FLACK identifies two expressions: Based on our experiment, the first solution returned by the PMAX-SATsolver is similar enough for
FLACK to locate bugs. lgorithm 2: Diff Analyzer input :
Alloy model M , differences diffs , pairs of cex’s and satinstances pairs output : Ranked list of suspicious expressions in M exprs ← get_susp_exprs ( M, diffs ) results ← {} foreach expr ∈ exprs do computescore( expr , results ) end return sort( results ) Function computescore( expr , results ) : score = 0 if isleaf ( expr ) then isexpr ← instantiate ( expr , diff ) foreach ( c , s ) ∈ pairs do cvals ← eval ( c , isexpr ) svals ← eval ( s , isexpr ) instscore ← if diff ⊂ cvals then instscore ← instscore + | diff || cvals | if diff ⊂ svals then instscore ← instscore + | diff || svals | score ← score + instscore ) end score ← score | pairs | else if isbool ( expr ) then isexpr ← instantiate ( expr , diff ) foreach ( c , s ) ∈ pairs do if eval ( c , isexpr ) (cid:54) = eval ( s , isexpr ) then score ← score + 1 end foreach child ∈ getchildren ( expr ) do score ← score + computescore ( child , results ) end end results ← results ∪ ( expr , score ) return score all s: State | FSM.stop in s.*transition on line 25 and all s: State | s.transition =none => s in FSM.stop on line 19, as they involvethe relations transition and stop in diffs . FLACK then recursively computes the suspicious score foreach collected expression e , represented as an AST tree. If e is a leaf (e.g., a relational expression), FLACK instantiates e with atoms from diffs . FLACK then evaluates the instantiatedexpression for each pair of cex and sat instance. If theevaluated result for an instantiated expression contains all atoms involved in diffs , FLACK computes the score as “ sizeof diffs / size of evaluated results ;” otherwise, the score is 0.For a pair, the score is then the average score of cex and satinstance. At last, the score of e is the average among all pairs.Essentially, a higher suspicious score is assigned to a relationalsubexpression whose evaluation involves many atoms in diffs .If e is not a leaf node, e ’s score is the sum of booleanand relational scores. If e is a boolean expression (i.e., anexpression that returns true or false ), we instantiate e with atoms from diffs and evaluate it on each cex and satinstance pair. If it evaluates to different results between the cexand sat instance (e.g., one is true and the other is false ), FLACK increases e ’s score by 1. Thus, a higher boolean score is assigned to the expressions whose evaluation does not matchbetween pairs of the cex and sat instances. Then e ’s relationalscore is calculated as the sum of e ’s children. The final scoreassigned to each expression is the sum of the e ’s booleanscores and the relational scores of e ’s children. In the end, FLACK returns all expression ranked by their suspicious scores.To make the idea concrete, consider the expression s.transition = none in Figure 1. For the cex and satinstance pair in Figure 2, diffs contain two atoms
State1 and
State3 . FLACK first instantiates the expression underanalysis with the atoms mentioned above into two concreteexpressions: (1)
State1.transition = none and (2)
State3.transition = none . The concrete expression(1) evaluates to false in both cex and sat instance, while theconcrete expression (2) evaluates to true in cex and false in the sat instance. Thus, the boolean score for the expressionunder analysis is 1 as the aggregation of the values obtainedfor the concrete expressions (1) and (2).
FLACK then computes the relational score for the expressionunder analysis as the sum of the relational scores for itschildren: s.transition and none , both of which areleaves. To compute the score for s.transtion , it is instanti-ated to
State1.transition and
State3.transtion . State1.transition evaluates to
State2 in bothcex and sat instance. Thus, it gets a score of 0. For
State3.transition , in cex, it evaluates to
State1 and gets a score of 1 as the size of different values { State3, State1 } divided by the size of the instanti-ated values { State3 } and the evaluated values { State1 } .In sat instance, it evaluates to an empty set and gets ascore of 0. Overall, s.transition gets a relational scoreof 0.25 as the average of all its instantiated expressions: State1.transition (0) and
State3.transition (0.5). Finally, the overall score of 1.25 is assigned to theexpression s.transition = none as the aggregation ofits boolean and relational scores.
4) UNSAT Core Analyzer:
It is possible that we can onlygenerate cex’s, but no sat instances, indicating that someconstraints in the model have conflicts with the property wewant to check. To identify these constraints,
FLACK inspectsthe unsat core returned by the Alloy Analyzer. The unsat coreexplains why a set of constraints cannot be satisfied by givinga minimal subset of conflicting constraints. Those conflictingconstraints can help
FLACK identify suspicious expressions.Algorithm 3 outlines the process underlying our UNSATcore analyzer, which takes as input a model M , an unsat core U , and a cex c showing M does not satisfy a property p , andoutputs a list of expressions in M conflicting with p . Recallthat these values, M , U , and c , are earlier inferred by FLACK as outlined in Algorithm 1.
FLACK starts by producing a sliced model M (cid:48) , in which allexpressions in the unsat core are omitted from the originalmodel M . Removing these conflicting expressions wouldallow us to obtain sat instances from the new model M (cid:48) to compare with the cex. FLACK now generates a minimalsat instance from M (cid:48) and compares it with the input cex to6 lgorithm 3: UNSAT Analyzer input :
Alloy model M , unsat core U , counterexample c output : a set of expressions in M M (cid:48) ← slice ( M, U ) s ← PMaxSolver ( M (cid:48) , c ) diffs ← comparator ( { ( c , s ) } ) exprs ← collect_exprs ( U ) conflicts ← ∅ foreach expr ∈ exprs do foreach diff ∈ diffs do if eval ( expr , diff , M (cid:48) ) = false then conflicts ← conflicts ∪ expr end end return conflicts obtain the differences between the cex and the sat instanceas shown in Section III-B2. Then, FLACK attempts to identifywhich of the removed expressions really conflict with p byevaluating them on obtained differences. The idea is that ifan expression evaluates to true, then adding that expressionback to the model would still allow the sat instance to begenerated, i.e., that expression is not conflicting with p . Thus,expressions that evaluate to false are ones conflicting with p and are returned as suspicious expressions. Note that weassign similar scores to these resulting expressions becausethey all contribute to the unsatisfiability of the original modeland the intended property.For example, if we change line 17 in the model shownin Figure 1 to all s: State | s.transition !inFSM.start , Alloy would find counterexamples such as theone in Figure 2a, but fail to generate any sat instances. Thisis because the modified line forces all states to have sometransitions, which conflicts with the constraint requiring notransition for stop states.From the unsat core, FLACK identifies four expressions inthe model: (a) all start1, start2 : FSM.start| start1 = start2 , (b) some FSM.stop , (c)
FSM.start !in FSM.stop and (d) all s: State |s.transition !in FSM.start . After removing thesefour expressions from the model,
FLACK can now generatethe same sat instance in Figure 2b. As before, the maindifference between the cex and sat instance involves twovalues:
State1 and
State3 . Then,
FLACK evaluates eachexpression using these values. Expressions (a), (b), and (c)are evaluated to true for both values, while expression (d)evaluates to false for
State3 . Thus,
FLACK correctlyidentifies (d) as the suspicious expression.IV. E
VALUATIONFLACK is implemented in Java 8 and uses Alloy 4.2. Weextend the backend KodKod solver [30] in Alloy to usethe Pardinus solver [28] to obtain sat instances similar tocounterexamples. We also modify the AST expression rep-resentation in Alloy to collect and assign suspicious scores toboolean and relational subexpressions.Our evaluation addresses the following research questions: • RQ1 : Can
FLACK effectively find suspicious expressions? • RQ2 : How does
FLACK scale to large, complex models? • RQ3 : How does
FLACK compare to AlloyFL?All experiments described below were performed on aMacbook with 2.2 GHZ i7 CPU and 16 GB of RAM.
A. RQ1: Effectiveness
To investigate the effectiveness of
FLACK , we use the bench-mark models from AlloyFL [19]. Table III shows 152 buggymodels collected from 12 Alloy models in AlloyFL. Theseare real faults collected from Alloy release 4.1, Amalgam [31],and Alloy homework solutions from graduate students. Briefly,these models are addr (address book) and farmer (farmercross-river puzzle) from Alloy; bempl (bad employer), grade (grade book) and other (access-control specifications) are fromAmalgam [31]; and arr (array), bst (balanced search tree), ctree (colored tree), cd (class diagram), dll (doubly linked list), fsm (finite state machine), and ssl (sorted singly linked list) arehomework from AlloyFL.For models with assertions (e.g., from Amalgam [31]), weuse those assertions for the experiments. For models that donot have assertions (e.g., homework assignments), we manu-ally create assertions and expected predicates by examiningthe correct versions or suggested fixes (provided by [19]).Moreover, from the correct models or suggested fixes, weknow which expressions contain errors and therefore use themas ground truths to compare against FLACK ’s results.
FLACK deals with models containing multiple violated assertions byanalyzing them separately and returning a ranked list for eachassertion. For illustration purposes, we simulate this by simplysplitting models with separate violations into separate models(e.g., bst2 contains two assertion violations and thus aresplit into two models bst2, bst2_1 ). Finally,
FLACK ishighly automatic and has just one user-configurable option:the number of pairs of cex and satisfying instances (which bydefault is set to 5 based on our experiences). a) Results:
Table III shows
FLACK ’s results. For eachmodel, we list the name, lines of code, the number of nodesthat
FLACK determined irrelevant and sliced out, and thenumber of total AST expression nodes. The last two columnsshow
FLACK ’s resulting ranking of the correct node and itstotal run time in second. The 28 italicized models containpredicate violations, while the other 124 models contain asser-tion violations.
FLACK automatically determines the violationtype and switches to the appropriate technique (e.g., usingcomparator for assertion errors and the unsat analyzer forpredicate violations (Section III). Finally, the models are listedin sorted order based on their ranking results.In summary,
FLACK was able to rank the buggy expressionsin the top 1 (e.g., the buggy expression is ranked first) for 91(60%), top 2 to 5 for 35 (23%), top 6 to 10 for 10 (34%), abovetop 10 for 6 (4%) out of 152 models. For 10 models,
FLACK was not able to identify the cause of the errors (e.g., the buggyexpression are not in the ranking list), many of which arebeyond the reach of
FLACK (e.g., the assertion error is not dueto any existing expressions in the model, but rather becausethe model is “missing” some constraints). Finally, regardless7ABLE III: Results of
FLACK on 152 Alloy models. Results are sorted based on ranking accuracy. Times are in seconds. model loc total sliced rank time model loc total sliced rank time model loc total sliced rank timetop 1 (91) 41 120 95 1 0.2 ssl10 43 155 110 1 0.1 dll20 2 36 88 47 2 0.0addr 21 74 10 1 0.6 ssl12 40 157 114 1 0.1 fsm6 29 98 17 2 0.0arr3 24 48 9 1 0.1 ssl14 44 158 149 1 0.5 fsm9 2 29 90 18 2 0.1arr4 24 64 61 1 0.2 ssl14 1 44 158 149 1 0.5 ssl11 42 177 127 2 0.1arr5 24 62 59 1 0.2 ssl14 2 43 153 120 1 0.0 bst8 59 134 57 3 0.3arr6 25 56 30 1 0.3 ssl14 3 44 153 108 1 0.1 bst8 1 59 134 57 3 0.2arr7 25 63 50 1 0.1 ssl17 41 152 119 1 0.0 bst22 1 49 199 124 3 0.1bst2 56 134 56 1 0.4 ssl17 1 42 152 106 1 0.1 dll1 1 38 86 57 3 0.1bst2 1 56 134 95 1 0.3 ssl18 1 40 160 118 1 0.1 dll18 2 36 107 71 3 0.6bst3 2 55 141 68 1 0.2 ssl18 2 49 160 85 1 0.3 fsm4 31 141 39 3 0.0cd1 27 44 33 1 0.0 ssl19 40 169 141 1 0.1 fsm5 2 29 69 17 3 0.0cd1 1 27 44 31 1 0.0 arr1
24 45 31 1 0.1 ssl2 1 44 156 72 3 0.1cd2 27 35 25 1 0.0 arr2
25 60 47 1 0.2 ssl13 43 174 123 3 0.1cd3 26 43 32 1 0.0 arr10
24 59 45 1 0.0 ssl18 40 162 131 3 0.0cd3 1 26 46 31 1 0.0 bst1
51 171 155 1 0.2 arr8 25 80 15 4 0.5dll1 37 77 63 1 0.1 bst4 1
52 163 147 1 0.1 bst2 2 47 147 92 4 0.1dll2 42 77 63 1 0.1 bst5
52 184 168 1 0.1 bst3 57 137 97 4 0.1dll3 37 80 59 1 0.0 bst7
52 159 143 1 0.1 fsm2 29 70 14 4 0.0dll3 1 37 75 49 1 0.1 bst8 2
54 156 140 1 0.1 fsm8 29 71 14 4 0.1dll4 37 81 67 1 0.1 bst9
55 185 169 1 0.1 arr11 24 83 41 5 0.1dll5 1 39 102 80 1 0.1 bst10
47 157 140 1 0.6 bst10 3 55 162 75 5 0.3dll6 36 113 94 1 0.1 bst10 2
52 172 156 1 0.1 fsm9 1 29 91 18 5 0.1dll7 1 36 73 59 1 0.1 bst11 1
60 214 198 1 0.1 ssl15 41 161 106 5 0.1dll8 36 96 76 1 0.1 bst13
53 200 184 1 0.1 top 6-10 (10) 51 155 73 7.6 0.2 dll9 38 100 91 1 0.1 bst14
59 202 186 1 0.1 bst3 1 57 137 58 6 0.2dll11 36 87 68 1 0.1 bst15
53 197 181 1 0.1 bst20 1 55 152 61 6 0.2dll12 36 77 63 1 0.1 bst17 1
52 201 185 1 0.1 fsm7 29 64 14 6 0.1dll13 36 60 51 1 0.0 bst18 1
56 204 188 1 0.1 ssl19 1 50 175 106 7 0.6dll14 1 37 85 71 1 0.1 bst20
52 169 153 1 0.1 bst6 51 140 61 8 0.2dll15 40 126 107 1 0.1 bst21
52 182 166 1 0.2 bst12 1 56 164 68 8 0.2dll16 36 82 68 1 0.1 bst22
50 213 158 1 0.6 bst19 2 55 154 86 8 0.2dll17 1 36 77 63 1 0.1 dll7
38 90 79 1 0.1 ssl19 2 44 174 81 8 0.3dll18 36 102 84 1 0.0 dll10
40 91 85 1 0.2 bst17 2 55 177 79 9 0.2dll18 1 36 101 67 1 0.1 dll14
39 102 91 1 0.1 bst22 2 54 209 118 10 0.1dll20 36 90 64 1 0.0 dll17
37 89 83 1 0.1 >
10 (6) 51 172 80 12.7 0.3 farmer 99 124 30 1 1.6 fsm1
31 90 71 1 0.0 bst4 2 51 154 61 11 0.3fsm1 1 30 90 19 1 0.0 fsm3
60 67 56 1 0.7 bst16 64 181 79 12 0.3fsm7 1 29 59 14 1 0.0 fsm9
30 91 79 1 0.5 bst17 48 186 113 12 0.2fsm9 4 30 91 78 1 0.1 fsm9 3
32 91 78 1 0.1 ssl12 1 44 161 78 12 0.4grade 33 22 7 1 0.0 top 2-5 (35) 40 120 66 2.9 0.2 ssl9 44 153 72 13 0.1ssl1 40 168 126 1 0.1 arr7 1 24 46 9 2 0.9 bst22 3 57 199 74 16 0.5ssl2 40 150 122 1 0.1 arr9 27 83 43 2 0.2 fail (10) 42 104 71 - 0.1 ssl3 45 188 135 1 0.0 bst2 3 52 133 68 2 0.1 bst4 46 145 95 - 0.1ssl3 1 44 184 121 1 0.2 bst12 47 146 94 2 0.2 bst11 54 196 135 - 0.2ssl4 40 146 118 1 0.1 bst18 51 187 115 2 1.4 bempl 51 14 7 - 0.0ssl5 42 188 160 1 0.6 bst19 47 158 112 2 0.1 ctree 30 49 5 - 0.0ssl6 42 157 148 1 0.7 bst19 1 52 175 112 2 0.1 dll3 2 36 75 67 - 0.0ssl6 1 42 157 148 1 0.5 dll2 1 42 82 57 2 0.0 other 34 26 19 - 0.0ssl6 2 41 152 119 1 0.1 dll17 2 36 82 57 2 0.0 ssl16 39 137 119 - 0.0ssl6 3 42 152 107 1 0.1 dll18 3 36 103 78 2 0.0 ssl16 1 39 133 115 - 0.0ssl7 41 136 95 1 0.1 dll19 36 83 63 2 0.1 ssl16 2 47 136 74 - 0.3ssl7 1 40 135 110 1 0.1 dll20 1 36 88 68 2 0.1 ssl16 3 43 133 71 - 0.2ssl8 43 166 119 1 0.1 of whether
FLACK succeeds or fails, the tool produces theresults almost instantaneously (under a second). b) Analysis:
FLACK was able to locate and rank thebuggy expressions in 142/152 models. Many of these bugsare common errors in which the developer did not considercertain corner cases. For example, stu5 contains the buggyexpression all n : This.header.*link | n.elem<= n.link.elem that does not allow any node without link(the fix is changing to all n : This.header.*link |some n.link => n.elem <= n.link.elem ). FLACK successfully recognizes the difference that the last node of the list contains a link to itself in the cex but not in the sat instance,and ranks this expression second; more importantly, it ranksfirst the subexpression n.elem <= n.link.elem , wherethe fix is actually needed.
FLACK also performed especiallywell on 28 models with violated predicates by analyzing theunsat cores and correctly ranked the buggy expressions first.For six models bst4_2 , bst16 , bst17 , bst22_3 , stu9 , and stu12_1 , FLACK was not able to place the buggyexpression within the top 10 (but still within the top 16). Forthese models,
FLACK obtains differences that are not directlyrelated to the error, but consistently appear in both the cex and8ABLE IV:
FLACK ’s results on large complex models model loc total sliced rank time(s) surgical robots 200 293 278 2 2.3android permissions 297 1138 673 2 5.2sll-contains 5250 5562 5510 3 3.0count-nodes 3791 5064 2861 18 188.7remove-nth 5063 6336 3306 12 1265.0 stat instances and therefore confuse
FLACK . FLACK was not able to identify the correct buggy ex-pressions in 10 models, e.g., the resulting ranking list doesnot contain the buggy expressions. Most of these bugs arebeyond the scope of
FLACK (and fault localization techniquesin general). More specifically, the 9 models bst4 , bst11 , bempl , ctree , dll3_2 , ssl16 , ssl16_1 , ssl16_2 ,and ssl16_3 have assertion violations due to missing con-straints in predicates and thus do not contain buggy expres-sions to be localized. For other , FLACK did not find the”ground truth” buggy expression (the buggy expression doesnot contain the different relation) but ranked first anotherexpression that could also be modified to fix the error.
B. RQ2: Real-world Case Studies
The AlloyFL benchmark contains a wide variety of Alloymodels and bugs, but they are relatively small ( ≈
50 LoC). Toinvestigate the scalability of
FLACK , we consider additionalcase studies on larger and more complex Alloy models. a) Surgical Robots:
The study in [32] uses Alloy tomodel highly-configurable surgical robots to verify a critical arm movement safety property: the position of the robot arm isin the same position that the surgeon articulates in the controlworkspace during the surgery procedure and the surgeon isnotified if the arm is pushed outside of its physical range.This property is formulated as an assertion and checked on15 Alloy models representing 15 types of robot arms usingdifferent combinations of hardware and software features. Thestudy found that 5 models violate the property.Table IV, which has the same format as Table III, liststhe results. We use all 5 buggy models (each has about 200LoC) but list them under one row because they are largelysimilar and share many common facts and predicates but withdifferent configurable values (one model has a fact that hasan AngleLmit set to 3 while another has value 6). The buggyexpression is also similar and appears in the same fact. Foreach model,
FLACK ranked the correct buggy expressionsin the second place in less than 3 seconds.
FLACK returnedtwo suspicious expressions (1)
HapticsDisabledin UsedGeomagicTouch.force and (2) somenotification : GeomagicTouch.force |notification = HapticsEnabled . Modifying eitherexpression would fix the issue, e.g., changing
Disable to Enabled in (1) or
Enabled to Disabled in (2). b) Android Permissions:
The COVERT project [33] usescompositional analysis to model the permissions of Android OS and apps to find inter-app communication vulnerabilities.The generated Alloy model used in this work does not containbugs violating assertions, thus we used the MuAlloy mutationtool [34] to introduce 5 (random) bugs to various predicatesin the model: 3 binary operator mutations, 1 unary operatormutation, and 1 variable declaration mutation.Table IV shows the result.
FLACK was able to locate 4 buggyexpressions (the unary modification ranked 2nd, the binaryoperations ranked 3rd, 9th, and 11th), but could not identify theother mutated expression (the variable declaration mutation).However, after manual analysis, we realize that this mutatedexpression does not contribute to the assertion violation (i.e.,
FLACK is correct in not identifying it as a fault). c) TACO: The TACO (Translation of Annotated COde)project [2] uses Alloy to verify Java programs with specifica-tions. TACO automatically converts a Java program annotatedwith invariants to an Alloy model with an assertion. If the Javaprogram contains a bug that violates the annotated invariant,then checking the assertion in the Alloy model would providea counterexample. We use three different Alloy models withviolated assertions representing three real Java programs fromTACO [2]: sll-contains checks if a particular elementexists in a linked list; count-nodes counts the number of alist’s nodes; and remove-nth removes the nth element of alist. These (machine-generated) models are much larger thantypical Alloy models ( ≈ ssl-contains , FLACK ranked the buggy expression third within 3 seconds. Thisexpression helps us locate an error in the original Javaprogram that skips the list header. The faulty expressions of remove-nth and count-nodes are ranked 12th and 18th,respectively (which are still quite reasonable given the large, > , number of possible locations). Note that these buggyexpressions consist of multiple errors (e.g., having 5 buggynodes), causing FLACK to instantiate and analyze combinationsof a large number of subexpressions.Manual analysis on the identified buggy expressions showedthat the original Java programs consist of (single) bugs withinloops. TACO performs loop unrolling and thus spreads it intomultiple bugs in the corresponding Alloy models.In summary, we found that
FLACK works well on large real-world Alloy models. While coming up with correct fixes forthese models remain nontrivial,
FLACK can help the developers(or automatic program repair tools) quickly locate buggyexpressions, which in turn helps understand (and hopefullyrepair) the actual errors in original models.
C. RQ3: Comparing to AlloyFL
We compare
FLACK with AlloyFL [19], which to the bestof our knowledge, is the only Alloy fault localization tech-nique currently available. While both tools compute suspiciousstatements, they are very different in both assumptions andtechnical approaches. As discussed in Section V, AlloyFLrequires
AUnit tests [25], provided by the user or generatedfrom the correct model, and adopts existing fault localizationtechniques in imperative programs, such as mutation testing9ABLE V: Comparision with AlloyFL. top avgtool > failed rank time(s) FLACK
91 126 136 6 10 2.4 0.2AlloyFL 76 128 137 8 7 3.1 32.4 and spectrum-based fault localization; in contrast,
FLACK usesviolated assertions and relies on counterexamples.To apply AlloyFL on the 152 benchmark models, we usethe best performance configuration and testsuites describedin [19]. Specifically, we use the AlloyFl hybrid algorithm withOchiai formula and reuse tests in [19] (automatically generatedby MuAlloy [34] as described in [19]).Table V compares the results of
FLACK and AlloyFL. Thetwo approaches appear to perform similarly, with
FLACK beingslightly more accurate. Overall,
FLACK outperforms AlloyFL,where on average the buggy expressions ranked 2nd and 3rdby
FLACK and AlloyFL, respectively. Also, in top 1 ranking
FLACK performs much better compared to AlloyFL (91 over76 models). Moreover,
FLACK is much faster, where theaverage analysis time is far less than a second for
FLACK ,it takes over 30 seconds for AlloyFL to analyze the samespecifications.We were not able to apply AlloyFL to the models inSection IV-B because MuAlloy [34], which is used to generateAUnit tests for AlloyFL, does not work with these models(e.g., mostly caused by unhandled Alloy operators). This isnot a weakness of AlloyFL, but it suggests that it is not trivialto generate tests from existing Alloy models automatically.
D. Threats to Validity
We assume no fault in data type (sig’s) and field declara-tions, which may limit the usage of
FLACK . However, none ofthe benchmark models we used has bugs at these locations.Moreover, we could always translate constraints for sig andfield to facts. For example, one sig A could be translatedto sig A; fact { one A } .The models in the AlloyFL benchmark are collected fromgraduate student’s homework and relatively small. Thus, theymay not represent faulty Alloy models in the real world.We also evaluate FLACK with large Alloy models, written byexperienced Alloy developer (the surgical robot models andAndroid permissions model) or generated by an automatic tool(TACO) and show that
FLACK performs well on these models(Section IV-B).We manually create assertions for models that do not haveassertions. Thus, our assertions might be inaccurate and notas intended. However, for other models with assertions (e.g.,those in the AlloyFL benchmark and all the case studies), weuse those assertions directly and
FLACK output similar results.V. R
ELATED W ORKFLACK is related to AlloyFL [19], which adopts spectrum-based fault localization [22], [23], [24] and mutation-basedtechniques [20], [21] from imperative languages. Given AUnit tests [25] labeled with should-pass or should-fail, AlloyFLcomputes a suspicious score for expression by mutating andgiving it a higher score if the mutation increases the numberof should-pass tests pass and the number of should-fail testsfail. AlloyFL uses MuAlloy [34] to automatically generatetests. However, MuAlloy requires the correct Alloy model togenerate these tests.
FLACK does not require tests and insteaduses assertions, which are commonly used in Alloy.The generation of similar instances can be viewed as amodel exploration problem [35]. Bordeaux [36] uses Alloyto find pairs of SAT/UNSAT instances with minimum relativedistances. In contrast,
FLACK reduces the generation of aninstance as close as possible to the identified counterexamplesinto a partial max sat problem and solves it using a PMAX-SAT solver.Amalgam [31] explains why some tuples of a relationdo or do not appear in certain instances. A user wouldmanually select a tuple to add or delete, and Amalgam triesto explain why they can or cannot do so (typically the reasonfor counterexamples is due to the assertions).
FLACK insteadautomatically identifies why a counterexample fails and findslocations that relate to this violation.Many fault localization techniques have been developed forimperative languages. Spectrum-based techniques [22], [23],[24], [37], [38], [29], [39], [40] identify faulty statementsby comparing passing and failing test executions. SAT-basedtechniques [41], [42] convert the fault location problem intoan SAT problem. Statistic-based methods [43], [44] collectstatistical information from test executions to locate errors.Feedback-based techniques [45], [46] interactively locateserror by getting feedback from the user. Delta debugging [47],[48] identifies code changes responsible for test failure. Thereare also works on minimizing differences in inputs basedon the assumption that similar inputs would lead to similarruns [49], [50], [51]. Program slicing [52], [53], [54] has alsobeen used to aid debugging.Model-based diagnosis (MBD) approaches identify faultycomponents of a system based on abnormal behaviors. Gries-mayer [55] applied MBD to localizing fault in imperativeprograms using model checker. Marques-Silva [56] convertedthe MBD problem into a MAXSAT problem to find theminimal diagnosis, where the system description is encodedas the hard clauses and the not abnormal predicates as thesoft clauses. There has also been another similar line work inpinpointing axioms in description logic [57], [58].VI. C
ONCLUSION AND F UTURE W ORK
We introduce a new fault localization approach for declara-tive models written Alloy. Our insight is that Alloy expressionsthat likely cause an assertion violation can be obtained byanalyzing the counterexamples, unsat cores, and satisfyinginstances from the Alloy Analyzer. We present
FLACK , a toolthat implements these ideas to compute and rank suspiciousexpressions causing an assertion violation in an Alloy model.
FLACK uses a PMAX-SAT solver to find satisfying instancessimilar to counterexamples generated by the Alloy Analyzer,10nalyzes satisfying instances and counterexamples to locatesuspicious expressions, analyzes subexpressions to achieve afiner-grain level of localization granularity, and uses unsatcores to help identify conflicting expressions. Preliminaryresults on existing Alloy benchmarks and large, real-worldbenchmarks show that
FLACK is effective in finding accurateexpressions causing errors. We believe that
FLACK takes animportant step in finding bugs in Alloy and exposes opportu-nities for researchers to exploit new debugging techniques forAlloy.Currently, we are improving the accuracy and efficiencyof
FLACK . Specifically, instead of using a default numberof instance pairs, we can search for instances incrementallyuntil the algorithm converges. We are also exploring newapproaches to effectively integrate
FLACK with automaticAlloy repair techniques. Preliminary results from the recentBeAFix work [59] shows that
FLACK accurately identifiesfaults in Alloy specifications, which in turn helps BeAFixautomatically analyze and repair those specifications.VII. D
ATA A VAILABILITY
We make
FLACK and all research artifacts, models, and ex-perimental data reported in the paper available to the researchand education community [26].A
CKNOWLEDGMENT
We thank the anonymous reviewers for helpful comments.This work was supported in part by awards W911NF-19-1-0054 from the Army Research Office; CCF-1948536, CCF-1755890, and CCF-1618132 from the National Science Foun-dation; and PICT 2016-1384, 2017-1979 and 2017-2622 fromthe Argentine National Agency of Scientific and TechnologicalPromotion (ANPCyT). R
EFERENCES[1] D. Jackson, “Alloy: A lightweight object modelling notation,”
ACMTrans. Softw. Eng. Methodol. , vol. 11, no. 2, p. 256–290, Apr. 2002.[2] J. P. Galeotti, N. Rosner, C. G. L´opez Pombo, and M. F. Frias, “Analysisof invariants for efficient bounded verification,” in
International Sympo-sium on Software Testing and Analysis , ser. ISSTA ’10. New York,NY, USA: Association for Computing Machinery, 2010, p. 25–36.[3] P. Abad, N. Aguirre, V. S. Bengolea, D. Ciolek, M. F. Frias, J. P. Galeotti,T. Maibaum, M. M. Moscato, N. Rosner, and I. Vissani, “Improving testgeneration under rich contracts by tight bounds and incremental SATsolving,” in
International Conference on Software Testing, Verificationand Validation . IEEE Computer Society, 2013, pp. 21–30.[4] N. Mirzaei, J. Garcia, H. Bagheri, A. Sadeghi, and S. Malek, “Reducingcombinatorics in gui testing of android applications,” in
Proceedingsof the 38th International Conference on Software Engineering (ICSE) ,2016, pp. 559–570.[5] H. Bagheri and K. J. Sullivan, “Bottom-up model-driven development,”in , D. Notkin, B. H. C. Cheng,and K. Pohl, Eds. IEEE Computer Society, 2013, pp. 1221–1224.[Online]. Available: https://doi.org/10.1109/ICSE.2013.6606683[6] ——, “Pol: specification-driven synthesis of architectural code frame-works for platform-based applications,” in
Generative Programmingand Component Engineering, GPCE’12, Dresden, Germany, September26-28, 2012 , K. Ostermann and W. Binder, Eds. ACM, 2012, pp.93–102. [Online]. Available: https://doi.org/10.1145/2371401.2371416[7] H. Bagheri, Y. Song, and K. J. Sullivan, “Architectural style as anindependent variable,” in
ASE 2010, 25th IEEE/ACM InternationalConference on Automated Software Engineering, Antwerp, Belgium,September 20-24, 2010 , C. Pecheur, J. Andrews, and E. D.Nitto, Eds. ACM, 2010, pp. 159–162. [Online]. Available: https://doi.org/10.1145/1858996.1859026 [8] F. A. Maldonado-Lopez, J. Chavarriaga, and Y. Donoso, “Detectingnetwork policy conflicts using alloy,” in
Proceedings of the 4th Interna-tional Conference on Abstract State Machines, Alloy, B, TLA, VDM, andZ - Volume 8477 , ser. ABZ 2014. Berlin, Heidelberg: Springer-Verlag,2014, p. 314–317.[9] T. Nelson, C. Barratt, D. J. Dougherty, K. Fisler, and S. Krishnamurthi,“The margrave tool for firewall analysis,” in
Proceedings of the 24thInternational Conference on Large Installation System Administration ,ser. LISA’10. USA: USENIX Association, 2010, p. 1–8.[10] N. Ruchansky and D. Proserpio, “A (not) nice way to verify the openflowswitch specification: Formal modelling of the openflow switch usingalloy,”
SIGCOMM Comput. Commun. Rev. , vol. 43, no. 4, p. 527–528,Aug. 2013.[11] M. Alhanahnah, C. Stevens, and H. Bagheri, “Scalable analysisof interaction threats in iot systems,” in
ISSTA ’20: 29th ACMSIGSOFT International Symposium on Software Testing and Analysis,Virtual Event, USA, July 18-22, 2020 , S. Khurshid and C. S.Pasareanu, Eds. ACM, 2020, pp. 272–285. [Online]. Available:https://doi.org/10.1145/3395363.3397347[12] H. Bagheri, J. Wang, J. Aerts, and S. Malek, “Efficient, evolutionary se-curity analysis of interacting android apps,” in , 2018, pp.357–368.[13] H. Bagheri, E. Kang, S. Malek, and D. Jackson, “A formal approachfor detection of security flaws in the Android permission system,”
Formal Aspects of Computing , vol. 30, no. 5, pp. 525–544, 2018.[Online]. Available: https://doi.org/10.1007/s00165-017-0445-z[14] H. Bagheri, C. Tang, and K. Sullivan, “Trademaker: Automateddynamic analysis of synthesized tradespaces,” in
Proceedings of the36th International Conference on Software Engineering , ser. ICSE2014. New York, NY, USA: ACM, 2014, pp. 106–116. [Online].Available: http://doi.acm.org/10.1145/2568225.2568291[15] H. Bagheri, C. Tang, and K. J. Sullivan, “Automated synthesis anddynamic analysis of tradeoff spaces for object-relational mapping,”
IEEE Trans. Software Eng. , vol. 43, no. 2, pp. 145–163, 2017. [Online].Available: https://doi.org/10.1109/TSE.2016.2587646[16] J. Brunel, D. Chemouil, A. Cunha, and N. Macedo, “The electrumanalyzer: model checking relational first-order temporal specifications,”in
Proceedings of the 33rd ACM/IEEE International Conference onAutomated Software Engineering, ASE 2018, Montpellier, France,September 3-7, 2018 , M. Huchard, C. K¨astner, and G. Fraser, Eds.ACM, 2018, pp. 884–887. [Online]. Available: https://doi.org/10.1145/3238147.3240475[17] A. Cunha and N. Macedo, “Validating the hybrid ertms/etcs level 3concept with electrum,” in
Abstract State Machines, Alloy, B, TLA, VDM,and Z , M. Butler, A. Raschke, T. S. Hoang, and K. Reichl, Eds. Cham:Springer International Publishing, 2018, pp. 307–321.[18] H. Kim, E. Kang, E. A. Lee, and D. Broman, “A toolkit for constructionof authorization service infrastructure for the internet of things,” in , 2017, pp. 147–158.[19] K. Wang, A. Sullivan, D. Marinov, and S. Khurshid, “Fault Localiza-tion for Declarative Models in Alloy,” in
International Symposium onSoftware Reliability Engineering (ISSRE) , 2020, pp. 391–402.[20] S. Moon, Y. Kim, M. Kim, and S. Yoo, “Ask the mutants: Mutatingfaulty programs for fault localization,” 03 2014, pp. 153–162.[21] M. Papadakis and Y. Le Traon, “Metallaxis-fl: Mutation-based faultlocalization,”
Softw. Test. Verif. Reliab. , vol. 25, no. 5-7, pp. 605–628,Aug. 2015. [Online]. Available: https://doi.org/10.1002/stvr.1509[22] R. Abreu, P. Zoeteweij, and A. J. C. van Gemund, “On the accuracyof spectrum-based fault localization,” in
Proceedings of the Testing:Academic and Industrial Conference Practice and Research Techniques- MUTATION , ser. TAICPART-MUTATION ’07. USA: IEEE ComputerSociety, 2007, p. 89–98.[23] J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test informa-tion to assist fault localization,” in
Proceedings of the 24th InternationalConference on Software Engineering , ser. ICSE ’02. New York, NY,USA: Association for Computing Machinery, 2002, pp. 467–477.[24] L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-basedsoftware diagnosis,”
ACM Trans. Softw. Eng. Methodol. , vol. 20, no. 3,Aug. 2011.
25] A. Sullivan, K. Wang, and S. Khurshid, “Aunit: A test automation toolfor alloy,” in , 2018, pp. 398–403.[26]
FLACK repository , 2020. [Online]. Available: https://doi.org/10.6084/m9.figshare.13439894.v4[27] Z. Fu and S. Malik, “On solving the partial max-sat problem,” in
Theoryand Applications of Satisfiability Testing - SAT 2006 , A. Biere and C. P.Gomes, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp.252–265.[28] A. Cunha, N. Macedo, and T. Guimar˜aes, “Target oriented relationalmodel finding,” in
Fundamental Approaches to Software Engineering ,S. Gnesi and A. Rensink, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2014, pp. 17–31.[29] J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantulaautomatic fault-localization technique,” in
Proceedings of the 20thIEEE/ACM International Conference on Automated Software Engineer-ing , ser. ASE ’05. New York, NY, USA: Association for ComputingMachinery, 2005, p. 273–282.[30] E. Torlak and D. Jackson, “Kodkod: A relational model finder,” in
Tools and Algorithms for the Construction and Analysis of Systems ,O. Grumberg and M. Huth, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2007, pp. 632–647.[31] T. Nelson, N. Danas, D. J. Dougherty, and S. Krishnamurthi, “Thepower of ”why” and ”why not”: enriching scenario exploration withprovenance,” in
Proceedings of the 2017 11th Joint Meeting on Founda-tions of Software Engineering, ESEC/FSE 2017, Paderborn, Germany,September 4-8, 2017 , 2017, pp. 106–116.[32] N. Mansoor, J. A. Saddler, B. Silva, H. Bagheri, M. B. Cohen, andS. Farritor, “Modeling and testing a family of surgical robots: Anexperience report,” in
Proceedings of the 2018 26th ACM Joint Meetingon European Software Engineering Conference and Symposium onthe Foundations of Software Engineering , ser. ESEC/FSE 2018. NewYork, NY, USA: Association for Computing Machinery, 2018, p.785–790. [Online]. Available: https://doi.org/10.1145/3236024.3275534[33] H. Bagheri, A. Sadeghi, J. Garcia, and S. Malek, “Covert: Compositionalanalysis of android inter-app permission leakage,”
IEEE Transactions onSoftware Engineering , vol. 41, no. 9, pp. 866–886, 2015.[34] K. Wang, A. Sullivan, and S. Khurshid, “Mualloy: A mutationtesting framework for alloy,” in
Proceedings of the 40th InternationalConference on Software Engineering: Companion Proceeedings , ser.ICSE ’18. New York, NY, USA: Association for ComputingMachinery, 2018, p. 29–32. [Online]. Available: https://doi.org/10.1145/3183440.3183488[35] N. Macedo, A. Cunha, and T. Guimar˜aes, “Exploring scenario explo-ration,” in
Fundamental Approaches to Software Engineering , A. Egyedand I. Schaefer, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,2015, pp. 301–315.[36] V. Montaghami and D. Rayside, “Bordeaux: A tool for thinkingoutside the box,” in
Proceedings of the 20th International Conferenceon Fundamental Approaches to Software Engineering - Volume10202 . Berlin, Heidelberg: Springer-Verlag, 2017, p. 22–39. [Online].Available: https://doi.org/10.1007/978-3-662-54494-5 2[37] R. Abreu, P. Zoeteweij, R. Golsteijn, and A. J. C. van Gemund, “Apractical evaluation of spectrum-based fault localization,”
J. Syst. Softw. ,vol. 82, no. 11, p. 1780–1792, Nov. 2009.[38] V. Dallmeier, C. Lindig, and A. Zeller, “Lightweight defect localizationfor java,” in
Proceedings of the 19th European Conference on Object-Oriented Programming , ser. ECOOP’05. Berlin, Heidelberg: Springer-Verlag, 2005, p. 528–550.[39] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan,“Scalable statistical bug isolation,” in
Proceedings of the 2005ACM SIGPLAN Conference on Programming Language Design andImplementation , ser. PLDI ’05. New York, NY, USA: Associationfor Computing Machinery, 2005, p. 15–26. [Online]. Available:https://doi.org/10.1145/1065010.1065014[40] S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst,D. Pang, and B. Keller, “Evaluating and improving fault localization,”in
Proceedings of the 39th International Conference on SoftwareEngineering , ser. ICSE ’17. IEEE Press, 2017, p. 609–620. [Online].Available: https://doi.org/10.1109/ICSE.2017.62 [41] M. Jose and R. Majumdar, “Cause clue clauses: Error localization usingmaximum satisfiability,”
SIGPLAN Not. , vol. 46, no. 6, pp. 437–446, Jun.2011. [Online]. Available: https://doi.org/10.1145/1993316.1993550[42] D. Gopinath, R. N. Zaeem, and S. Khurshid, “Improving theeffectiveness of spectra-based fault localization using specifications,”in
Proceedings of the 27th IEEE/ACM International Conference onAutomated Software Engineering , ser. ASE 2012. New York, NY,USA: Association for Computing Machinery, 2012, p. 40–49. [Online].Available: https://doi.org/10.1145/2351676.2351683[43] C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiff, “Sober:Statistical model-based bug localization,”
SIGSOFT Softw. Eng. Notes ,vol. 30, no. 5, pp. 286–295, Sep. 2005. [Online]. Available:https://doi.org/10.1145/1095430.1081753[44] W. E. Wong, V. Debroy, and D. Xu, “Towards better fault localization:A crosstab-based statistical approach,”
IEEE Transactions on Systems,Man, and Cybernetics, Part C (Applications and Reviews) , vol. 42, no. 3,pp. 378–396, 2012.[45] Y. Lin, J. Sun, Y. Xue, Y. Liu, and J. Dong, “Feedback-baseddebugging,” in
Proceedings of the 39th International Conference onSoftware Engineering , ser. ICSE’17. IEEE Press, 2017, pp. 393–403.[Online]. Available: https://doi.org/10.1109/ICSE.2017.43[46] X. Li, S. Zhu, M. d’Amorim, and A. Orso, “Enlightened debugging,”in
Proceedings of the 40th International Conference on SoftwareEngineering , ser. ICSE ’18. New York, NY, USA: Associationfor Computing Machinery, 2018, p. 82–92. [Online]. Available:https://doi.org/10.1145/3180155.3180242[47] B. Ness and V. Ngo, “Regression containment through source changeisolation,” 09 1997, pp. 616 – 621.[48] A. Zeller, “Yesterday, my program worked. today, it does not. why?”in
Proceedings of the 7th European Software Engineering ConferenceHeld Jointly with the 7th ACM SIGSOFT International Symposiumon Foundations of Software Engineering , ser. ESEC/FSE-7. Berlin,Heidelberg: Springer-Verlag, 1999, pp. 253–267.[49] T. Reps, T. Ball, M. Das, and J. Larus, “The use of program profilingfor software maintenance with applications to the year 2000 problem,”in
Software Engineering — ESEC/FSE’97 , M. Jazayeri and H. Schauer,Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp. 432–449.[50] D. B. Whalley, “Automatic isolation of compiler errors,”
ACM Trans.Program. Lang. Syst. , vol. 16, no. 5, pp. 1648–1659, Sep. 1994.[51] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducinginput,”
IEEE Trans. Softw. Eng. , vol. 28, no. 2, pp. 183–200, Feb. 2002.[52] J. R. Lyle and M. Weiser, “Automatic program bug location by programslicing,” 1987.[53] H. Agrawal, J. Horgan, S. London, and W. Wong, “Fault localizationusing execution slices and dataflow tests,”
Proceedings of IEEE SoftwareReliability Engineering , 06 1999.[54] H. Agrawal and J. R. Horgan, “Dynamic program slicing,”
SIGPLANNot. , vol. 25, no. 6, pp. 246–256, Jun. 1990. [Online]. Available:https://doi.org/10.1145/93548.93576[55] A. Griesmayer, S. Staber, and R. Bloem, “Fault localization using amodel checker,”
Softw. Test. Verif. Reliab. , vol. 20, no. 2, p. 149–173,Jun. 2010.[56] J. Marques-Silva, M. Janota, A. Ignatiev, and A. Morgado, “Efficientmodel based diagnosis with maximum satisfiability,” in
Proceedingsof the 24th International Conference on Artificial Intelligence , ser.IJCAI’15. AAAI Press, 2015, p. 1966–1972.[57] F. Baader and R. Pe˜naloza, “Axiom pinpointing in general tableaux,”
J. Log. and Comput. , vol. 20, no. 1, p. 5–34, Feb. 2010. [Online].Available: https://doi.org/10.1093/logcom/exn058[58] F. Baader, R. Pe˜naloza, and B. Suntisrivaraporn, “Pinpointing in thedescription logic EL + ,” in KI 2007: Advances in Artificial Intelligence ,J. Hertzberg, M. Beetz, and R. Englert, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2007, pp. 52–67.[59] S. G. Brida, G. Regis, G. Zheng, H. Bagheri, T. Nguyen, N. Aguirre, andM. F. Frias, “Bounded exhaustive search of alloy specification repairs,”in
International Conference on Software Engineering (ICSE) . IEEE,2021, p. to appear.. IEEE,2021, p. to appear.