[PDF] FLACK: Counterexample-Guided Fault Localization for Alloy Models

Abstract

Fault localization is a practical research topic that helps developers identify code locations that might cause bugs in a program. Most existing fault localization techniques are designed for imperative programs (e.g., C and Java) and rely on analyzing correct and incorrect executions of the program to identify suspicious statements. In this work, we introduce a fault localization approach for models written in a declarative language, where the models are not "executed," but rather converted into a logical formula and solved using backend constraint solvers. We present FLACK, a tool that takes as input an Alloy model consisting of some violated assertion and returns a ranked list of suspicious expressions contributing to the assertion violation. The key idea is to analyze the differences between counterexamples, i.e., instances of the model that do not satisfy the assertion, and instances that do satisfy the assertion to find suspicious expressions in the input model. The experimental results show that FLACK is efficient (can handle complex, real-world Alloy models with thousand lines of code within 5 seconds), accurate (can consistently rank buggy expressions in the top 1.9\% of the suspicious list), and useful (can often narrow down the error to the exact location within the suspicious expressions).

Full PDF

FFL AC K : Counterexample-Guided Fault Localizationfor Alloy Models

Guolong Zheng ∗ , ThanhVu Nguyen ∗ , Sim´on Guti´errez Brida † , Germ´an Regis † ,Marcelo F. Frias ‡ , Nazareno Aguirre † , Hamid Bagheri ∗∗ Univeristy of Nebraska-Lincoln { gzheng, tnguyen } @cse.unl.edu, [email protected] † University of Rio Cuarto and CONICET { sgutierrez, gregis, naguirre } @dc.exa.unrc.edu.ar ‡ Dept. of Software Engineering Instituto Tecnol´ogico de Buenos [email protected]

Abstract —Fault localization is a practical research topic thathelps developers identify code locations that might cause bugsin a program. Most existing fault localization techniques aredesigned for imperative programs (e.g., C and Java) and relyon analyzing correct and incorrect executions of the programto identify suspicious statements. In this work, we introduce afault localization approach for models written in a declarativelanguage, where the models are not “executed,” but ratherconverted into a logical formula and solved using backendconstraint solvers. We present

FLACK , a tool that takes as inputan Alloy model consisting of some violated assertion and returns aranked list of suspicious expressions contributing to the assertionviolation. The key idea is to analyze the differences between counterexamples , i.e., instances of the model that do not satisfythe assertion, and instances that do satisfy the assertion to ﬁndsuspicious expressions in the input model. The experimentalresults show that

FLACK is efﬁcient (can handle complex, real-world Alloy models with thousand lines of code within 5 seconds),accurate (can consistently rank buggy expressions in the top 1.9%of the suspicious list), and useful (can often narrow down theerror to the exact location within the suspicious expressions).

I. I

NTRODUCTION

Declarative speciﬁcation languages and the correspondingformally precise analysis engines have long been utilizedto solve various software engineering problems. The Alloyspeciﬁcation language [1] relies on ﬁrst-order relational logic,and has been used in a wide range of applications, suchas program veriﬁcation [2], test case generation [3], [4],software design [5], [6], [7], network security [8], [9], [10],security analysis of emerging platforms, such as IoT [11] andAndroid [12], [13], and design tradeoff analysis [14], [15],to name a few. Cunha and Macedo, among others, use arecent extension of Alloy, called Electrum [16], to validatethe European Rail Trafﬁc Management System, a system ofstandards for management and inter-operation of signalingfor the European railways [17]. Kim [18] proposes a SecureSwarm Toolkit (SST), a platform for building an authorizationservice infrastructure for IoT systems, and uses Alloy to showthat SST provides necessary security guarantees.Similar to developing programs in an imperative language,such as C or Java, developers can make subtle mistakeswhen using Alloy in modeling system speciﬁcations, espe- cially those that capture complex systems with non-trivialbehaviors, rendering debugging thereof even more arduous.These challenges call for debugging assistant mechanisms,such as fault localization techniques, that support declarativespeciﬁcation languages.However, there is a dearth of fault localization techniquesdeveloped for Alloy. AlloyFL [19] is perhaps the only faultlocalization tool available for Alloy as of today. The key ideaof AlloyFL is to use “unit tests,” where a test is a predicatethat describes an Alloy instance to encode expected behaviors,to compute suspicious expressions in an Alloy model that failsthese tests. To compute the suspicious expressions, AlloyFLuses mutation testing [20], [21] and statistical debuggingtechniques [22], [23], [24], i.e., it mutates expressions, collectsstatistics on how each mutation affects the tests, then uses thisinformation to assign suspicion scores to expressions.While AlloyFL pioneered fault localization in the Alloycontext and the obtained results thereof are promising, it relieson the assumption of the availability of AUnit tests [25]—i.e., predicates representing Alloy instances—which are notcommon in the Alloy setting. Indeed, instead of writing testcases, Alloy users write assertions to describe the desiredproperty and let the Alloy Analyzer search for potentialcounterexamples (cex’s) that violate the property. Moreover,it is unclear how many test cases are needed or how goodthey must be for AlloyFL to be effective (e.g., in the AlloyFLevaluation [19], the number of tests ranges from 30 to 120).To address this state of affairs and improve the quality ofAlloy development, we present an automated approach and anaccompanying tool-suite for f ault l ocalization in A lloy modelsusing c ounterexamples, called FLACK . Given an Alloy modeland a property that is not satisﬁed by the model,

FLACK ﬁrstqueries the underlying Alloy Analyzer for a counterexample,an instance of the model that does not satisfy the property.Next,

FLACK uses a partial max sat (PMAXSAT) solver toﬁnd an instance that does satisfy the property and is as closeas possible to the counterexample.

FLACK then determines therelations and atoms that are different between the cex and satinstance. Finally,

FLACK analyses these differences to computesuspicion scores for expressions in the original model.1 a r X i v : . [ c s . S E ] F e b nlike AlloyFL that relies on unconventional unit tests, FLACK uses well-established and widely-used assertions, nat-urally compatible with the development practices in Alloy.Also, instead of using mutation testing or statically analyzingthe effects of tests,

FLACK relies on counterexamples andsatisfying instances generated by constraint solvers, which arethe main underlying technology in Alloy.We evaluated

FLACK on a benchmark consisting of a suiteof buggy models from AlloyFL [19]. The experimental resultscorroborate that

FLACK is able to consistently rank buggyexpressions in the top 1.9% of the suspicious list. We alsoevaluated

FLACK on three case studies consisting of largerAlloy models used in the real-world settings (e.g., Alloy modelfor surgical robots, Java programs and Android permissions),and

FLACK was able to identify the buggy expressions withinthe top 1%. The run time of

FLACK for most the models isunder 5 seconds (under 1 second for the AlloyFL benchmarks).The experimental results corroborate that

FLACK has thepotential to facilitate a non-trivial task of formal speciﬁca-tion development signiﬁcantly and exposes opportunities forresearchers to exploit new debugging techniques for Alloy.To summarize, this paper makes the following contributions: • Fault localization approach for declarative models : Wepresent a novel fault localization approach for declarativemodels speciﬁed in the Alloy language. The insightunderlying our approach is that expressions in an Alloymodel that likely cause an assertion violation can beidentiﬁed by analyzing the counterexamples and closelyrelated satisfying instances. • Tool implementation : We develop a fully automated tech-nology, dubbed

FLACK , that effectively realizes our faultlocalization approach. We make

FLACK publicly availableto the research and education community [26]. • Empirical evaluation : We evaluate

FLACK in the contextof faulty Alloy speciﬁcations found in prior work andspeciﬁcations derived from real-world systems, corrobo-rating

FLACK ’s ability to consistently rank buggy expres-sions high on the suspicious list, and analyze complex,real-world Alloy models with thousand lines of code.The rest of the paper is organized as follows. Section IImotivates our research through an illustrative example. Sec-tion III describes the details of our fault localization approachfor Alloy models. Section IV presents the implementation andevaluation of the research. The paper concludes with an outlineof the related research and future work.II. I

LLUSTRATION

To motivate the research and illustrate our approach, weprovide an Alloy speciﬁcation of a ﬁnite state machine (FSM),adapted from AlloyFL benchmarks [19]. The speciﬁcationdeﬁnes two type signatures, i.e.,

State and

FSM , along withtheir ﬁelds (lines 1–5). The speciﬁcation contains three factparagraphs, expressing the constraints, detailed below: If astart (or a stop) state exists, there is only one of them (fact

OneStartAndStop ). The start state is not a subset of thestop state; no transition terminates at the start state; and no one sig FSM { start: set State, stop: set State } sig State { transition: set

State } fact OneStartAndStop { // If a start state exists, there is only oneof them all start1, start2 : FSM.start | start1 =start2 // If a stop state exists, there is only one ofthem all stop1, stop2 : FSM.stop | stop1 = stop2 some FSM.stop } fact ValidStartAndStop { // start state is not a subset of stop state FSM.start ! in FSM.stop // No transition ends at the start state. all s : State | FSM.start ! in s.transition // Error: should be "<=>" instead of "=>". all s: State | s.transition = none => s in FSM.stop } fact Reachability { // All states are reachable from the startstate. State = FSM.start.*transition // The stop state is reachable from any state. all s: State | FSM.stop in s.*transition } assert NoStopTransition{ no FSM.stop.transition } check NoStopTransition for 5

Fig. 1: Buggy FSM model, adapted from AlloyFL [19].transition leaves a stop state (fact

ValidStartAndStop ).Finally, every state is reachable from the start state, and thestop state is reachable from any state (fact

Reachability ).Each assertion speciﬁes a property that is expected tohold in all instances of the model. For example, we use theassertion

NoStopTransition to check that a stop statebehaves as a sink. The Alloy Analyzer disproves this assertionby producing a counterexample, shown in Figure 2a, in whichthe stop state labeled

State3 transitions to

State1 .Thus, there is a “bug” in the model causing the assertionviolation. Indeed, careful analysis of the model and the gen-erated cex reveals that the problem is in the expression online 19: instead of stating that a stop state does not have anytransition to any state, the expression states that any state nothaving a transition to anywhere is a stop state—a subtle logicalerror that is difﬁcult to realize .The goal of FLACK is to identify such buggy expressionsautomatically. For this example, within a second,

FLACK identiﬁes four suspicious expressions with the one on line 19ranked ﬁrst. Table I shows the results: expressions or nodeswith higher scores are ranked higher. Moreover,

FLACK sug-gests that the operator => is likely the issue in the expression. There are two potential ﬁxes for this: (i) reverse the expression to: s inFSM.stop => s.transition=none or (ii) replace the implication op-erator ( => ) to logical equivalence ( <=> ), which technically would strengthenthe intended requirement. FLACK ’s results obtained for the model in Figure 1.

Suspicious Expression Scores.transition = none = > s in FSM.stop (= > ) 19 1.58s.transition = none 19 1.25FSM.stop in s.*transtion 25 0.5s in FSM.stop 19 0.5(a) Counterexample (b) (PMAX) Sat instance Fig. 2: Instance Pair. Note the similarity between the instances.Such a level of granularity can signiﬁcantly help the developerunderstand and ﬁx the problem. The results in Section IV showthat

FLACK can consistently rank the exact buggy expressionwithin the top 5 suspicious ones and do so under a second.The key idea underlying our fault localization approach is toanalyze the differences between counterexamples (instances ofthe model that do not satisfy the assertion) and instances thatdo satisfy the assertion to ﬁnd suspicious expressions in theinput model.

FLACK ﬁrst checks the assertion

NoStopTransition in the model using the Alloy Analyzer, which returns the cex inFigure 2a. Next,

FLACK generates a satisfying (sat) instancethat is as minimal and similar to the cex as possible. Theirdifferences promise effective localization of the issue. a) Generating SAT instances:

To obtain an instancesimilar to the cex,

FLACK transforms the input model intoa logical formula representing hard constraints and the infor-mation from cex into a formula representing soft constraints .Essentially,

FLACK converts the instance ﬁnding problem into a

Partial-Max SAT problem [27] and then uses the Pardinus [28]solver to ﬁnd a solution that satisﬁes all the hard constraintsand as many soft constraints as possible. Thus, the resultis an instance of the model that is similar to the cex butsatisﬁes the assertion. Figure 2b shows an instance producedby Pardinus, considering the cex shown in Figure 2a. Noticethat this instance is similar to the given cex, except for theedge from

State3 to State1 , which represents the maindifference between the two instances. b) Finding Suspicious Expressions:

FLACK analyzes thedifferences between cexs and the sat instances—e.g., herethe transition from

State3 to State1 , which only appearsin the cex but not the sat instance—to identify Alloy rela-tions causing the issue. As shown in Table II, that demon-strates the Alloy text representation of the cex in Figure 2a,the transition relation involves the tuple

State3 ->State1 and the stop relation involves

State3 . Thus,

FLACK hypothesizes that two relations of transition and stop may cause the difference in the two models. Note thatwhile we present one cex and one sat instance in this examplefor the sake of simplicity,

FLACK supports analyzing multiplepairs of cex and sat instances in tandem.TABLE II: Text Representation of the cex in Figure 2a

Relation TuplesFSM FSM0start FSM0- > State0stop FSM0- > State3State State0, State1, State2, State3transition State0- > State1, State0- > State3, State1- > State2,State2- > State3, State3- > State1

Next,

FLACK slices the input model to contain onlyexpressions affecting both relations transition and

FSM.stop . This results in two expressions: all s:State | FSM.stop in s.*transition (line 25) and all s: State | s.transition = none => s inFSM.stop (line 19).At this point,

FLACK could stop and return these twoexpressions, one of which is the buggy expression on line 19.Indeed, this level of “statement” granularity is often used infault localization techniques, like Tarantula [29] or Ochiai [22].However,

FLACK aims to achieve a ﬁner-grained granularitylevel by also considering the boolean and relational subex-pressions, detailed below. c) Ranking Boolean Nodes:

The expressions onlines 25 and 19 have four boolean nodes: (a)

FSM.stopin s.*transition , (b) s.transition = none ,(c) s in FSM.stop , and (d) s.transition =none => s in FSM.stop . FLACK instantiates eachof these with

State1 and

State3 , the values thatdifferentiate the cex and sat instance. For example,(a) becomes

FSM.stop in S1.*transition and

FSM.stop in S3.*transition . Next,

FLACK evaluates these instantiations using the cex and sat instanceand assigns a higher suspicious score to those withinconsistent evaluation results. For example, the instantiations

FSM.stop in S1.*transition and

FSM.stop inS3.*transition of node (a) evaluate to true in both cexand sat instance, so we give (a) the score 0 (i.e., no changes).We assign score 1 to (b) because

State3.transition= none evaluates to false in the cex but true in thesat instance (thus 1 change) and

State1.transition =none evaluates to false in both (no change).Overall,

FLACK obtains the scores 0, 1, 0, 1 for nodes (a),(b), (c), (d), respectively. Thus,

FLACK determines that nodes(b) s.transition = none and (d) s.transition =none => s in FSM.stop are the two most suspiciousboolean subexpressions within the expression on line 19. d) Ranking Relational Nodes:

While subexpression (d)indeed contains the error, it receives the same score as subex-3ression (a). To achieve more accurate results , FLACK furtheranalyzes the involving relations.

FLACK instantiates theserelations with

State1 and

State3 , assesses them in thecontext of the cex and sat instances, and assigns scores basedon the evaluations. For example, node (d) s.transition= none => s in FSM.stop contains 3 relations: (1) s.transition , (2) s , and (3) FSM.stop . Instantiatingthese relations with

State3 and evaluating them using thecex is as follows: (1) becomes { State1 } , (2) { State3 } , and(3) { State3 } . Thus, for the cex, (d) involves both State1 and

State3 , and

FLACK gives it a score of 1. Next, itevaluates the instantiations using the sat instance as follows:(1) becomes {} , (2) { State3 } , and (3) { State3 } . (d) doesnot involve State1 and thus has a score 0.

FLACK assigns (d)the average score of 0.5 (for the instantiations of

State3 ).Performing a similar computation for the instantiation of

State1 , we obtain a score of 2/3 for (d) as the evaluation forthe cex and sat instances involves both

State1 and

State3 (differentiated values) and

State2 (regular value). Thus, (d)has a score of 0.58 as an average of 0.5 and 2/3.Overall,

FLACK obtains the scores 0.5, 0.25, 0.5, 0.58 for(a), (b), (c), and (d), respectively. Note that the node (d) isnow ranked higher than (a) as desired. e) Suspicious Scores:

FLACK computes the ﬁnal suspi-cious score of a node as the sum of the boolean and rela-tional scores of that node, as shown above. For example, thenode (d) s.transition = none => s in FSM.stop in the expression on line 19 has the highest suspicious scoreof 1.58. Table I shows suspicious scores of the expressions inthe ranked list returned by

FLACK .In addition,

FLACK analyzes (non-atomic) nodes contain-ing (boolean) connectors and reports connectors that connectsubnodes with different scores. For example,

FLACK suggeststhat the operator => is likely responsible for the error in (d)because the two subexpressions s.transition = none and s in FSM.stop have different scores as shown in Ta-ble I. Indeed, in this example, the assertion violation is entirelydue to this operator (a potential ﬁx would be strengthening themodel to <=> or switching the two subexpressions as Alloydoes not have the operator <= ).III. A PPROACH

Figure 3 gives an overview of

FLACK , which takes as inputan Alloy model with some violated assertion and returnsa ranked list of suspicious expressions contributing to theassertion violation. The insight guiding our research is thatthe differences between counterexamples that do not satisfythe assertion and closely related satisfying instances can drivelocalization of suspicious expressions in the input model. Toachieve this,

FLACK uses the

Alloy analyzer to ﬁnd counterex-amples showing the violation of the assertion. It then uses aPMAX-SAT solver to ﬁnd satisfying instances that are as closeas possible to the cex’s. Next,

FLACK analyzes the differences While this example has only two expressions with similar scores, we obtainmany expressions with similar scores in more complex and real-world models.Thus, this step is crucial to distinguish the buggy expressions from the rest.

Fig. 3: Overview of

FLACK .between the cex’s and satisfying instances to ﬁnd expressionsin the model that likely cause the errors. Finally,

FLACK computes and returns a ranked list of suspicious expressions.

A. The Alloy Analyzer

An Alloy speciﬁcation or model consists of three compo-nents: (i) Type signatures ( sig ) deﬁne essential data types and their ﬁelds capture relationships among such data types,(ii) fact s, predicates ( pred ), and assertions ( assert ) areformulae deﬁning constraints over data types, and (iii) run and check are commands to invoke the Alloy Analyzer.The check command is used to ﬁnd counterexamples violat-ing some asserted property, and run ﬁnds satisfying modelinstances ( sat instances ). For a model M and a property p , a cex is an instance of M that satisﬁes M ∧ ¬ p , anda sat instance is one that satisﬁes M ∧ p . The speciﬁ-cation in Figure 1 deﬁnes two signatures ( FSM, State ),three ﬁelds ( start, stop, transition ), three facts(

OneStartStop, ValidStart, ValidStop ) and oneassertion (

NoStopTransition ).Analysis of speciﬁcations written in Alloy is entirely auto-mated, yet bounded up to user-speciﬁed scopes on the size oftype signatures. More precisely, to check that p is satisﬁed by all instances of M (i.e., p is valid) up to a certain scope, theAlloy developer encodes p as an assertion and uses the check command to validate the assertion, i.e., showing that no cexexists within the speciﬁed scope (a cex is an instance I suchthat I (cid:15) M ∧¬ p ). To check that p is satisﬁed by some instancesof M , the Alloy developer encodes p as a predicate and usesthe run command to analyze the predicate, i.e., searching fora sat instance I such that I (cid:15) M ∧ p . In our running example,the check command examines the NoStopTransition assertion and returns a cex in Figure 2a.Internally, Alloy converts these tasks of searching for in-stances into boolean formulae and uses a SAT solver to checkthe satisﬁability of the formulae. Each value of each relationis translated to a distinct variable in the boolean formula. Forexample, given a scope of 5 in the FSM model in Figure 1, therelation

State contains 5 values and is translated to 5 distinctvariables in the boolean formula, and the transition istranslated to 25 values representing 25 values of combinationsof (cid:107)

State (cid:107) × (cid:107)

State (cid:107) . An instance is an assignment for allvariables that makes the formula True. For example, cex inFigure 2a is an assignment where all variables correspondingto values in Table I are assigned True and others are False.Finally, Alloy translates the result from the SAT solver, e.g.,4 lgorithm 1:

FLACK fault localization process input :

Alloy model M , property p not satisﬁed by M output : Ranked list of suspicious expressions in M AlloySolver ← AlloyAnalyzer ( M, p ) pairs ← ∅ while | pairs | < max instance pairs do c ← AlloySolver . gencex () AlloySolver . blockcex ( c ) s ← PMaxSolver ( M, c ) if s = nil then U ← AlloySolver . get_unsatcore () return unsat_analyzer ( M, U, c ) pairs ← pairs ∪ ( c , s ) end diffs ← comparator ( pairs ) return diffs_analyzer ( diffs ) an assignment that makes the boolean formula True, back toan instance of M . B. The

FLACK

Algorithm

Algorithm 1 shows the algorithm of

FLACK , which takes asinput an Alloy model M and a property p that is not satisﬁedby M (as an assertion violation) and returns a ranked listof expressions that likely contribute to the assertion violation. FLACK ﬁrst uses the Alloy Analyzer and the Pardinus PMAX-SAT solver to generate pairs of cex and closely similar satinstances.

FLACK then analyzes the differences between thecex and sat instances to locate the error. If

FLACK cannotgenerate any sat instance,

FLACK inspects the unsat core returned by the Alloy Analyzer to locate the error.

1) Generating Instances:

To understand why M does notsatisfy p , FLACK obtains differences between cexs and relevantsat instances. These differences can lead to the cause of theerror. One option is to use the Alloy Analyzer to generate asat instance directly (e.g., by checking a predicate consistingof p ). However, such an instance generated by Alloy is oftenpredominantly different from the cex, and thus does not helpidentify the main difference. For example, the cex, shown inFigure 2a, that violates the assertion NoStopTransition is quite different from the two Alloy-generated satisfyinginstances, shown in Figure 4.To generate a sat instance closely similar to the cex,we reduce the problem to a PMAX-SAT (partial maximumsatisﬁability) problem.

Deﬁnition III.1 (Finding a Similar Sat Instance from a Cex) . Given a set of hard clauses, collectively speciﬁed by model M and property p , and a set of soft clauses, speciﬁed bya counterexample cex , ﬁnd a solution that satisﬁes all hardclauses and satisﬁes as many soft clauses as possible .More speciﬁcally, the hard clauses are generated by con-straints in the Alloy model, and the soft clauses are theassignment represented by cex where all presenting variablesare True and other variables are False. Because the relationsand scope of the Alloy model stay the same, the variables inthe transformed boolean formula would also remain the same,and just the values assigned to them would differ between Fig. 4: Model instances generated by the Alloy Analyzer.various model instances. Thus, this encoding can apply togeneral instances regardless of their structures. FLACK then uses an existing PMAX-SAT solver (Pardinus)to ﬁnd an instance that has the property p and is as similar tothe cex as possible . For example, the instance in Figure 2bgenerated by Pardinus is similar to the cex in Figure 2a, buthas an extra edge from State3 to State1 . The idea is thatsuch (minimal) differences can help

FLACK identify the error.

2) Comparator:

FLACK compares the generated cex’s andsat instances to obtain their differences, which involve atoms,tuples, and relations. First, it obtains tuples and their atoms thatare different between the cex and sat instance, e.g., in Figure 2,the tuple

State3->State1 , which has the atoms

State1 and

State3 , is in the cex but not in the satisfying instance.Next, it obtains relations with different tuples between thecex and sat instance, e.g., the transition relation involvesthe tuple

State3->State1 in the cex but not in the satinstance. Third, it obtains relations that can be inferred fromthe tuples and atoms derived in the previous steps, e.g., therelation

FSM.stop involves tuples having the

State3 atom.In summary, for the pair of cex and sat instance in Figure 2,

FLACK obtains the suspicious relations transitions and stop and the atoms

State1, State3 . FLACK appliesthese comparison steps for all pairs of instances and cex’sand uses the common results.

3) Diff Analyzer:

After obtaining the differences consistingof relations and atoms between cex’s and sat instances,

FLACK analyzes them to obtain a ranked list of expressions based ontheir suspicious levels.

FLACK assigns higher suspicious scoresto expressions whose evaluations depend on these differences(and lower scores to those not depending on these differences).Algorithm 2 shows the Diff Analyzer algorithm, whichtakes as input a model M , the differences diffs obtained inSect III-B2, and pairs of cex and sat instances obtained inSect III-B1, and outputs a ranked list of suspicious expressionsin M . It ﬁrst identiﬁes expressions in M that involve relationsin diffs . These expressions are likely related to the differencebetween cex and sat instance. For example, consider themodel in Figure 1. FLACK identiﬁes two expressions: Based on our experiment, the ﬁrst solution returned by the PMAX-SATsolver is similar enough for

FLACK to locate bugs. lgorithm 2: Diff Analyzer input :

Alloy model M , differences diffs , pairs of cex’s and satinstances pairs output : Ranked list of suspicious expressions in M exprs ← get_susp_exprs ( M, diffs ) results ← {} foreach expr ∈ exprs do computescore( expr , results ) end return sort( results ) Function computescore( expr , results ) : score = 0 if isleaf ( expr ) then isexpr ← instantiate ( expr , diff ) foreach ( c , s ) ∈ pairs do cvals ← eval ( c , isexpr ) svals ← eval ( s , isexpr ) instscore ← if diff ⊂ cvals then instscore ← instscore + | diff || cvals | if diff ⊂ svals then instscore ← instscore + | diff || svals | score ← score + instscore ) end score ← score | pairs | else if isbool ( expr ) then isexpr ← instantiate ( expr , diff ) foreach ( c , s ) ∈ pairs do if eval ( c , isexpr ) (cid:54) = eval ( s , isexpr ) then score ← score + 1 end foreach child ∈ getchildren ( expr ) do score ← score + computescore ( child , results ) end end results ← results ∪ ( expr , score ) return score all s: State | FSM.stop in s.*transition on line 25 and all s: State | s.transition =none => s in FSM.stop on line 19, as they involvethe relations transition and stop in diffs . FLACK then recursively computes the suspicious score foreach collected expression e , represented as an AST tree. If e is a leaf (e.g., a relational expression), FLACK instantiates e with atoms from diffs . FLACK then evaluates the instantiatedexpression for each pair of cex and sat instance. If theevaluated result for an instantiated expression contains all atoms involved in diffs , FLACK computes the score as “ sizeof diffs / size of evaluated results ;” otherwise, the score is 0.For a pair, the score is then the average score of cex and satinstance. At last, the score of e is the average among all pairs.Essentially, a higher suspicious score is assigned to a relationalsubexpression whose evaluation involves many atoms in diffs .If e is not a leaf node, e ’s score is the sum of booleanand relational scores. If e is a boolean expression (i.e., anexpression that returns true or false ), we instantiate e with atoms from diffs and evaluate it on each cex and satinstance pair. If it evaluates to different results between the cexand sat instance (e.g., one is true and the other is false ), FLACK increases e ’s score by 1. Thus, a higher boolean score is assigned to the expressions whose evaluation does not matchbetween pairs of the cex and sat instances. Then e ’s relationalscore is calculated as the sum of e ’s children. The ﬁnal scoreassigned to each expression is the sum of the e ’s booleanscores and the relational scores of e ’s children. In the end, FLACK returns all expression ranked by their suspicious scores.To make the idea concrete, consider the expression s.transition = none in Figure 1. For the cex and satinstance pair in Figure 2, diffs contain two atoms

State1 and

State3 . FLACK ﬁrst instantiates the expression underanalysis with the atoms mentioned above into two concreteexpressions: (1)

State1.transition = none and (2)

State3.transition = none . The concrete expression(1) evaluates to false in both cex and sat instance, while theconcrete expression (2) evaluates to true in cex and false in the sat instance. Thus, the boolean score for the expressionunder analysis is 1 as the aggregation of the values obtainedfor the concrete expressions (1) and (2).

FLACK then computes the relational score for the expressionunder analysis as the sum of the relational scores for itschildren: s.transition and none , both of which areleaves. To compute the score for s.transtion , it is instanti-ated to

State1.transition and

State3.transtion . State1.transition evaluates to

State2 in bothcex and sat instance. Thus, it gets a score of 0. For

State3.transition , in cex, it evaluates to

State1 and gets a score of 1 as the size of different values { State3, State1 } divided by the size of the instanti-ated values { State3 } and the evaluated values { State1 } .In sat instance, it evaluates to an empty set and gets ascore of 0. Overall, s.transition gets a relational scoreof 0.25 as the average of all its instantiated expressions: State1.transition (0) and

State3.transition (0.5). Finally, the overall score of 1.25 is assigned to theexpression s.transition = none as the aggregation ofits boolean and relational scores.

4) UNSAT Core Analyzer:

It is possible that we can onlygenerate cex’s, but no sat instances, indicating that someconstraints in the model have conﬂicts with the property wewant to check. To identify these constraints,

FLACK inspectsthe unsat core returned by the Alloy Analyzer. The unsat coreexplains why a set of constraints cannot be satisﬁed by givinga minimal subset of conﬂicting constraints. Those conﬂictingconstraints can help

FLACK identify suspicious expressions.Algorithm 3 outlines the process underlying our UNSATcore analyzer, which takes as input a model M , an unsat core U , and a cex c showing M does not satisfy a property p , andoutputs a list of expressions in M conﬂicting with p . Recallthat these values, M , U , and c , are earlier inferred by FLACK as outlined in Algorithm 1.

FLACK starts by producing a sliced model M (cid:48) , in which allexpressions in the unsat core are omitted from the originalmodel M . Removing these conﬂicting expressions wouldallow us to obtain sat instances from the new model M (cid:48) to compare with the cex. FLACK now generates a minimalsat instance from M (cid:48) and compares it with the input cex to6 lgorithm 3: UNSAT Analyzer input :

Alloy model M , unsat core U , counterexample c output : a set of expressions in M M (cid:48) ← slice ( M, U ) s ← PMaxSolver ( M (cid:48) , c ) diffs ← comparator ( { ( c , s ) } ) exprs ← collect_exprs ( U ) conﬂicts ← ∅ foreach expr ∈ exprs do foreach diff ∈ diffs do if eval ( expr , diff , M (cid:48) ) = false then conﬂicts ← conﬂicts ∪ expr end end return conﬂicts obtain the differences between the cex and the sat instanceas shown in Section III-B2. Then, FLACK attempts to identifywhich of the removed expressions really conﬂict with p byevaluating them on obtained differences. The idea is that ifan expression evaluates to true, then adding that expressionback to the model would still allow the sat instance to begenerated, i.e., that expression is not conﬂicting with p . Thus,expressions that evaluate to false are ones conﬂicting with p and are returned as suspicious expressions. Note that weassign similar scores to these resulting expressions becausethey all contribute to the unsatisﬁability of the original modeland the intended property.For example, if we change line 17 in the model shownin Figure 1 to all s: State | s.transition !inFSM.start , Alloy would ﬁnd counterexamples such as theone in Figure 2a, but fail to generate any sat instances. Thisis because the modiﬁed line forces all states to have sometransitions, which conﬂicts with the constraint requiring notransition for stop states.From the unsat core, FLACK identiﬁes four expressions inthe model: (a) all start1, start2 : FSM.start| start1 = start2 , (b) some FSM.stop , (c)

FSM.start !in FSM.stop and (d) all s: State |s.transition !in FSM.start . After removing thesefour expressions from the model,

FLACK can now generatethe same sat instance in Figure 2b. As before, the maindifference between the cex and sat instance involves twovalues:

State1 and

State3 . Then,

FLACK evaluates eachexpression using these values. Expressions (a), (b), and (c)are evaluated to true for both values, while expression (d)evaluates to false for

State3 . Thus,

FLACK correctlyidentiﬁes (d) as the suspicious expression.IV. E

VALUATIONFLACK is implemented in Java 8 and uses Alloy 4.2. Weextend the backend KodKod solver [30] in Alloy to usethe Pardinus solver [28] to obtain sat instances similar tocounterexamples. We also modify the AST expression rep-resentation in Alloy to collect and assign suspicious scores toboolean and relational subexpressions.Our evaluation addresses the following research questions: • RQ1 : Can

FLACK effectively ﬁnd suspicious expressions? • RQ2 : How does

FLACK scale to large, complex models? • RQ3 : How does

FLACK compare to AlloyFL?All experiments described below were performed on aMacbook with 2.2 GHZ i7 CPU and 16 GB of RAM.

A. RQ1: Effectiveness

To investigate the effectiveness of

FLACK , we use the bench-mark models from AlloyFL [19]. Table III shows 152 buggymodels collected from 12 Alloy models in AlloyFL. Theseare real faults collected from Alloy release 4.1, Amalgam [31],and Alloy homework solutions from graduate students. Brieﬂy,these models are addr (address book) and farmer (farmercross-river puzzle) from Alloy; bempl (bad employer), grade (grade book) and other (access-control speciﬁcations) are fromAmalgam [31]; and arr (array), bst (balanced search tree), ctree (colored tree), cd (class diagram), dll (doubly linked list), fsm (ﬁnite state machine), and ssl (sorted singly linked list) arehomework from AlloyFL.For models with assertions (e.g., from Amalgam [31]), weuse those assertions for the experiments. For models that donot have assertions (e.g., homework assignments), we manu-ally create assertions and expected predicates by examiningthe correct versions or suggested ﬁxes (provided by [19]).Moreover, from the correct models or suggested ﬁxes, weknow which expressions contain errors and therefore use themas ground truths to compare against FLACK ’s results.

FLACK deals with models containing multiple violated assertions byanalyzing them separately and returning a ranked list for eachassertion. For illustration purposes, we simulate this by simplysplitting models with separate violations into separate models(e.g., bst2 contains two assertion violations and thus aresplit into two models bst2, bst2_1 ). Finally,

FLACK ishighly automatic and has just one user-conﬁgurable option:the number of pairs of cex and satisfying instances (which bydefault is set to 5 based on our experiences). a) Results:

Table III shows

FLACK ’s results. For eachmodel, we list the name, lines of code, the number of nodesthat

FLACK determined irrelevant and sliced out, and thenumber of total AST expression nodes. The last two columnsshow

FLACK ’s resulting ranking of the correct node and itstotal run time in second. The 28 italicized models containpredicate violations, while the other 124 models contain asser-tion violations.

FLACK automatically determines the violationtype and switches to the appropriate technique (e.g., usingcomparator for assertion errors and the unsat analyzer forpredicate violations (Section III). Finally, the models are listedin sorted order based on their ranking results.In summary,

FLACK was able to rank the buggy expressionsin the top 1 (e.g., the buggy expression is ranked ﬁrst) for 91(60%), top 2 to 5 for 35 (23%), top 6 to 10 for 10 (34%), abovetop 10 for 6 (4%) out of 152 models. For 10 models,

FLACK was not able to identify the cause of the errors (e.g., the buggyexpression are not in the ranking list), many of which arebeyond the reach of

FLACK (e.g., the assertion error is not dueto any existing expressions in the model, but rather becausethe model is “missing” some constraints). Finally, regardless7ABLE III: Results of

FLACK on 152 Alloy models. Results are sorted based on ranking accuracy. Times are in seconds. model loc total sliced rank time model loc total sliced rank time model loc total sliced rank timetop 1 (91) 41 120 95 1 0.2 ssl10 43 155 110 1 0.1 dll20 2 36 88 47 2 0.0addr 21 74 10 1 0.6 ssl12 40 157 114 1 0.1 fsm6 29 98 17 2 0.0arr3 24 48 9 1 0.1 ssl14 44 158 149 1 0.5 fsm9 2 29 90 18 2 0.1arr4 24 64 61 1 0.2 ssl14 1 44 158 149 1 0.5 ssl11 42 177 127 2 0.1arr5 24 62 59 1 0.2 ssl14 2 43 153 120 1 0.0 bst8 59 134 57 3 0.3arr6 25 56 30 1 0.3 ssl14 3 44 153 108 1 0.1 bst8 1 59 134 57 3 0.2arr7 25 63 50 1 0.1 ssl17 41 152 119 1 0.0 bst22 1 49 199 124 3 0.1bst2 56 134 56 1 0.4 ssl17 1 42 152 106 1 0.1 dll1 1 38 86 57 3 0.1bst2 1 56 134 95 1 0.3 ssl18 1 40 160 118 1 0.1 dll18 2 36 107 71 3 0.6bst3 2 55 141 68 1 0.2 ssl18 2 49 160 85 1 0.3 fsm4 31 141 39 3 0.0cd1 27 44 33 1 0.0 ssl19 40 169 141 1 0.1 fsm5 2 29 69 17 3 0.0cd1 1 27 44 31 1 0.0 arr1

24 45 31 1 0.1 ssl2 1 44 156 72 3 0.1cd2 27 35 25 1 0.0 arr2

25 60 47 1 0.2 ssl13 43 174 123 3 0.1cd3 26 43 32 1 0.0 arr10

24 59 45 1 0.0 ssl18 40 162 131 3 0.0cd3 1 26 46 31 1 0.0 bst1

51 171 155 1 0.2 arr8 25 80 15 4 0.5dll1 37 77 63 1 0.1 bst4 1

52 163 147 1 0.1 bst2 2 47 147 92 4 0.1dll2 42 77 63 1 0.1 bst5

52 184 168 1 0.1 bst3 57 137 97 4 0.1dll3 37 80 59 1 0.0 bst7

52 159 143 1 0.1 fsm2 29 70 14 4 0.0dll3 1 37 75 49 1 0.1 bst8 2

54 156 140 1 0.1 fsm8 29 71 14 4 0.1dll4 37 81 67 1 0.1 bst9

55 185 169 1 0.1 arr11 24 83 41 5 0.1dll5 1 39 102 80 1 0.1 bst10

47 157 140 1 0.6 bst10 3 55 162 75 5 0.3dll6 36 113 94 1 0.1 bst10 2

52 172 156 1 0.1 fsm9 1 29 91 18 5 0.1dll7 1 36 73 59 1 0.1 bst11 1

60 214 198 1 0.1 ssl15 41 161 106 5 0.1dll8 36 96 76 1 0.1 bst13

53 200 184 1 0.1 top 6-10 (10) 51 155 73 7.6 0.2 dll9 38 100 91 1 0.1 bst14

59 202 186 1 0.1 bst3 1 57 137 58 6 0.2dll11 36 87 68 1 0.1 bst15

53 197 181 1 0.1 bst20 1 55 152 61 6 0.2dll12 36 77 63 1 0.1 bst17 1

52 201 185 1 0.1 fsm7 29 64 14 6 0.1dll13 36 60 51 1 0.0 bst18 1

56 204 188 1 0.1 ssl19 1 50 175 106 7 0.6dll14 1 37 85 71 1 0.1 bst20

52 169 153 1 0.1 bst6 51 140 61 8 0.2dll15 40 126 107 1 0.1 bst21

52 182 166 1 0.2 bst12 1 56 164 68 8 0.2dll16 36 82 68 1 0.1 bst22

50 213 158 1 0.6 bst19 2 55 154 86 8 0.2dll17 1 36 77 63 1 0.1 dll7

38 90 79 1 0.1 ssl19 2 44 174 81 8 0.3dll18 36 102 84 1 0.0 dll10

40 91 85 1 0.2 bst17 2 55 177 79 9 0.2dll18 1 36 101 67 1 0.1 dll14

39 102 91 1 0.1 bst22 2 54 209 118 10 0.1dll20 36 90 64 1 0.0 dll17

37 89 83 1 0.1 >

10 (6) 51 172 80 12.7 0.3 farmer 99 124 30 1 1.6 fsm1

31 90 71 1 0.0 bst4 2 51 154 61 11 0.3fsm1 1 30 90 19 1 0.0 fsm3

60 67 56 1 0.7 bst16 64 181 79 12 0.3fsm7 1 29 59 14 1 0.0 fsm9

30 91 79 1 0.5 bst17 48 186 113 12 0.2fsm9 4 30 91 78 1 0.1 fsm9 3

32 91 78 1 0.1 ssl12 1 44 161 78 12 0.4grade 33 22 7 1 0.0 top 2-5 (35) 40 120 66 2.9 0.2 ssl9 44 153 72 13 0.1ssl1 40 168 126 1 0.1 arr7 1 24 46 9 2 0.9 bst22 3 57 199 74 16 0.5ssl2 40 150 122 1 0.1 arr9 27 83 43 2 0.2 fail (10) 42 104 71 - 0.1 ssl3 45 188 135 1 0.0 bst2 3 52 133 68 2 0.1 bst4 46 145 95 - 0.1ssl3 1 44 184 121 1 0.2 bst12 47 146 94 2 0.2 bst11 54 196 135 - 0.2ssl4 40 146 118 1 0.1 bst18 51 187 115 2 1.4 bempl 51 14 7 - 0.0ssl5 42 188 160 1 0.6 bst19 47 158 112 2 0.1 ctree 30 49 5 - 0.0ssl6 42 157 148 1 0.7 bst19 1 52 175 112 2 0.1 dll3 2 36 75 67 - 0.0ssl6 1 42 157 148 1 0.5 dll2 1 42 82 57 2 0.0 other 34 26 19 - 0.0ssl6 2 41 152 119 1 0.1 dll17 2 36 82 57 2 0.0 ssl16 39 137 119 - 0.0ssl6 3 42 152 107 1 0.1 dll18 3 36 103 78 2 0.0 ssl16 1 39 133 115 - 0.0ssl7 41 136 95 1 0.1 dll19 36 83 63 2 0.1 ssl16 2 47 136 74 - 0.3ssl7 1 40 135 110 1 0.1 dll20 1 36 88 68 2 0.1 ssl16 3 43 133 71 - 0.2ssl8 43 166 119 1 0.1 of whether

FLACK succeeds or fails, the tool produces theresults almost instantaneously (under a second). b) Analysis:

FLACK was able to locate and rank thebuggy expressions in 142/152 models. Many of these bugsare common errors in which the developer did not considercertain corner cases. For example, stu5 contains the buggyexpression all n : This.header.*link | n.elem<= n.link.elem that does not allow any node without link(the ﬁx is changing to all n : This.header.*link |some n.link => n.elem <= n.link.elem ). FLACK successfully recognizes the difference that the last node of the list contains a link to itself in the cex but not in the sat instance,and ranks this expression second; more importantly, it ranksﬁrst the subexpression n.elem <= n.link.elem , wherethe ﬁx is actually needed.

FLACK also performed especiallywell on 28 models with violated predicates by analyzing theunsat cores and correctly ranked the buggy expressions ﬁrst.For six models bst4_2 , bst16 , bst17 , bst22_3 , stu9 , and stu12_1 , FLACK was not able to place the buggyexpression within the top 10 (but still within the top 16). Forthese models,

FLACK obtains differences that are not directlyrelated to the error, but consistently appear in both the cex and8ABLE IV:

FLACK ’s results on large complex models model loc total sliced rank time(s) surgical robots 200 293 278 2 2.3android permissions 297 1138 673 2 5.2sll-contains 5250 5562 5510 3 3.0count-nodes 3791 5064 2861 18 188.7remove-nth 5063 6336 3306 12 1265.0 stat instances and therefore confuse

FLACK . FLACK was not able to identify the correct buggy ex-pressions in 10 models, e.g., the resulting ranking list doesnot contain the buggy expressions. Most of these bugs arebeyond the scope of

FLACK (and fault localization techniquesin general). More speciﬁcally, the 9 models bst4 , bst11 , bempl , ctree , dll3_2 , ssl16 , ssl16_1 , ssl16_2 ,and ssl16_3 have assertion violations due to missing con-straints in predicates and thus do not contain buggy expres-sions to be localized. For other , FLACK did not ﬁnd the”ground truth” buggy expression (the buggy expression doesnot contain the different relation) but ranked ﬁrst anotherexpression that could also be modiﬁed to ﬁx the error.

B. RQ2: Real-world Case Studies

The AlloyFL benchmark contains a wide variety of Alloymodels and bugs, but they are relatively small ( ≈

50 LoC). Toinvestigate the scalability of

FLACK , we consider additionalcase studies on larger and more complex Alloy models. a) Surgical Robots:

The study in [32] uses Alloy tomodel highly-conﬁgurable surgical robots to verify a critical arm movement safety property: the position of the robot arm isin the same position that the surgeon articulates in the controlworkspace during the surgery procedure and the surgeon isnotiﬁed if the arm is pushed outside of its physical range.This property is formulated as an assertion and checked on15 Alloy models representing 15 types of robot arms usingdifferent combinations of hardware and software features. Thestudy found that 5 models violate the property.Table IV, which has the same format as Table III, liststhe results. We use all 5 buggy models (each has about 200LoC) but list them under one row because they are largelysimilar and share many common facts and predicates but withdifferent conﬁgurable values (one model has a fact that hasan AngleLmit set to 3 while another has value 6). The buggyexpression is also similar and appears in the same fact. Foreach model,

FLACK ranked the correct buggy expressionsin the second place in less than 3 seconds.

FLACK returnedtwo suspicious expressions (1)

HapticsDisabledin UsedGeomagicTouch.force and (2) somenotification : GeomagicTouch.force |notification = HapticsEnabled . Modifying eitherexpression would ﬁx the issue, e.g., changing

Disable to Enabled in (1) or

Enabled to Disabled in (2). b) Android Permissions:

The COVERT project [33] usescompositional analysis to model the permissions of Android OS and apps to ﬁnd inter-app communication vulnerabilities.The generated Alloy model used in this work does not containbugs violating assertions, thus we used the MuAlloy mutationtool [34] to introduce 5 (random) bugs to various predicatesin the model: 3 binary operator mutations, 1 unary operatormutation, and 1 variable declaration mutation.Table IV shows the result.

FLACK was able to locate 4 buggyexpressions (the unary modiﬁcation ranked 2nd, the binaryoperations ranked 3rd, 9th, and 11th), but could not identify theother mutated expression (the variable declaration mutation).However, after manual analysis, we realize that this mutatedexpression does not contribute to the assertion violation (i.e.,

FLACK is correct in not identifying it as a fault). c) TACO: The TACO (Translation of Annotated COde)project [2] uses Alloy to verify Java programs with speciﬁca-tions. TACO automatically converts a Java program annotatedwith invariants to an Alloy model with an assertion. If the Javaprogram contains a bug that violates the annotated invariant,then checking the assertion in the Alloy model would providea counterexample. We use three different Alloy models withviolated assertions representing three real Java programs fromTACO [2]: sll-contains checks if a particular elementexists in a linked list; count-nodes counts the number of alist’s nodes; and remove-nth removes the nth element of alist. These (machine-generated) models are much larger thantypical Alloy models ( ≈ ssl-contains , FLACK ranked the buggy expression third within 3 seconds. Thisexpression helps us locate an error in the original Javaprogram that skips the list header. The faulty expressions of remove-nth and count-nodes are ranked 12th and 18th,respectively (which are still quite reasonable given the large, > , number of possible locations). Note that these buggyexpressions consist of multiple errors (e.g., having 5 buggynodes), causing FLACK to instantiate and analyze combinationsof a large number of subexpressions.Manual analysis on the identiﬁed buggy expressions showedthat the original Java programs consist of (single) bugs withinloops. TACO performs loop unrolling and thus spreads it intomultiple bugs in the corresponding Alloy models.In summary, we found that

FLACK works well on large real-world Alloy models. While coming up with correct ﬁxes forthese models remain nontrivial,

FLACK can help the developers(or automatic program repair tools) quickly locate buggyexpressions, which in turn helps understand (and hopefullyrepair) the actual errors in original models.

C. RQ3: Comparing to AlloyFL

We compare

FLACK with AlloyFL [19], which to the bestof our knowledge, is the only Alloy fault localization tech-nique currently available. While both tools compute suspiciousstatements, they are very different in both assumptions andtechnical approaches. As discussed in Section V, AlloyFLrequires

AUnit tests [25], provided by the user or generatedfrom the correct model, and adopts existing fault localizationtechniques in imperative programs, such as mutation testing9ABLE V: Comparision with AlloyFL. top avgtool > failed rank time(s) FLACK

91 126 136 6 10 2.4 0.2AlloyFL 76 128 137 8 7 3.1 32.4 and spectrum-based fault localization; in contrast,

FLACK usesviolated assertions and relies on counterexamples.To apply AlloyFL on the 152 benchmark models, we usethe best performance conﬁguration and testsuites describedin [19]. Speciﬁcally, we use the AlloyFl hybrid algorithm withOchiai formula and reuse tests in [19] (automatically generatedby MuAlloy [34] as described in [19]).Table V compares the results of

FLACK and AlloyFL. Thetwo approaches appear to perform similarly, with

FLACK beingslightly more accurate. Overall,

FLACK outperforms AlloyFL,where on average the buggy expressions ranked 2nd and 3rdby

FLACK and AlloyFL, respectively. Also, in top 1 ranking

FLACK performs much better compared to AlloyFL (91 over76 models). Moreover,

FLACK is much faster, where theaverage analysis time is far less than a second for

FLACK ,it takes over 30 seconds for AlloyFL to analyze the samespeciﬁcations.We were not able to apply AlloyFL to the models inSection IV-B because MuAlloy [34], which is used to generateAUnit tests for AlloyFL, does not work with these models(e.g., mostly caused by unhandled Alloy operators). This isnot a weakness of AlloyFL, but it suggests that it is not trivialto generate tests from existing Alloy models automatically.

D. Threats to Validity

We assume no fault in data type (sig’s) and ﬁeld declara-tions, which may limit the usage of

FLACK . However, none ofthe benchmark models we used has bugs at these locations.Moreover, we could always translate constraints for sig andﬁeld to facts. For example, one sig A could be translatedto sig A; fact { one A } .The models in the AlloyFL benchmark are collected fromgraduate student’s homework and relatively small. Thus, theymay not represent faulty Alloy models in the real world.We also evaluate FLACK with large Alloy models, written byexperienced Alloy developer (the surgical robot models andAndroid permissions model) or generated by an automatic tool(TACO) and show that

FLACK performs well on these models(Section IV-B).We manually create assertions for models that do not haveassertions. Thus, our assertions might be inaccurate and notas intended. However, for other models with assertions (e.g.,those in the AlloyFL benchmark and all the case studies), weuse those assertions directly and

FLACK output similar results.V. R

ELATED W ORKFLACK is related to AlloyFL [19], which adopts spectrum-based fault localization [22], [23], [24] and mutation-basedtechniques [20], [21] from imperative languages. Given AUnit tests [25] labeled with should-pass or should-fail, AlloyFLcomputes a suspicious score for expression by mutating andgiving it a higher score if the mutation increases the numberof should-pass tests pass and the number of should-fail testsfail. AlloyFL uses MuAlloy [34] to automatically generatetests. However, MuAlloy requires the correct Alloy model togenerate these tests.

FLACK does not require tests and insteaduses assertions, which are commonly used in Alloy.The generation of similar instances can be viewed as amodel exploration problem [35]. Bordeaux [36] uses Alloyto ﬁnd pairs of SAT/UNSAT instances with minimum relativedistances. In contrast,

FLACK reduces the generation of aninstance as close as possible to the identiﬁed counterexamplesinto a partial max sat problem and solves it using a PMAX-SAT solver.Amalgam [31] explains why some tuples of a relationdo or do not appear in certain instances. A user wouldmanually select a tuple to add or delete, and Amalgam triesto explain why they can or cannot do so (typically the reasonfor counterexamples is due to the assertions).

FLACK insteadautomatically identiﬁes why a counterexample fails and ﬁndslocations that relate to this violation.Many fault localization techniques have been developed forimperative languages. Spectrum-based techniques [22], [23],[24], [37], [38], [29], [39], [40] identify faulty statementsby comparing passing and failing test executions. SAT-basedtechniques [41], [42] convert the fault location problem intoan SAT problem. Statistic-based methods [43], [44] collectstatistical information from test executions to locate errors.Feedback-based techniques [45], [46] interactively locateserror by getting feedback from the user. Delta debugging [47],[48] identiﬁes code changes responsible for test failure. Thereare also works on minimizing differences in inputs basedon the assumption that similar inputs would lead to similarruns [49], [50], [51]. Program slicing [52], [53], [54] has alsobeen used to aid debugging.Model-based diagnosis (MBD) approaches identify faultycomponents of a system based on abnormal behaviors. Gries-mayer [55] applied MBD to localizing fault in imperativeprograms using model checker. Marques-Silva [56] convertedthe MBD problem into a MAXSAT problem to ﬁnd theminimal diagnosis, where the system description is encodedas the hard clauses and the not abnormal predicates as thesoft clauses. There has also been another similar line work inpinpointing axioms in description logic [57], [58].VI. C

ONCLUSION AND F UTURE W ORK

We introduce a new fault localization approach for declara-tive models written Alloy. Our insight is that Alloy expressionsthat likely cause an assertion violation can be obtained byanalyzing the counterexamples, unsat cores, and satisfyinginstances from the Alloy Analyzer. We present

FLACK , a toolthat implements these ideas to compute and rank suspiciousexpressions causing an assertion violation in an Alloy model.

FLACK uses a PMAX-SAT solver to ﬁnd satisfying instancessimilar to counterexamples generated by the Alloy Analyzer,10nalyzes satisfying instances and counterexamples to locatesuspicious expressions, analyzes subexpressions to achieve aﬁner-grain level of localization granularity, and uses unsatcores to help identify conﬂicting expressions. Preliminaryresults on existing Alloy benchmarks and large, real-worldbenchmarks show that

FLACK is effective in ﬁnding accurateexpressions causing errors. We believe that

FLACK takes animportant step in ﬁnding bugs in Alloy and exposes opportu-nities for researchers to exploit new debugging techniques forAlloy.Currently, we are improving the accuracy and efﬁciencyof

FLACK . Speciﬁcally, instead of using a default numberof instance pairs, we can search for instances incrementallyuntil the algorithm converges. We are also exploring newapproaches to effectively integrate

FLACK with automaticAlloy repair techniques. Preliminary results from the recentBeAFix work [59] shows that

FLACK accurately identiﬁesfaults in Alloy speciﬁcations, which in turn helps BeAFixautomatically analyze and repair those speciﬁcations.VII. D

ATA A VAILABILITY

We make

FLACK and all research artifacts, models, and ex-perimental data reported in the paper available to the researchand education community [26].A

CKNOWLEDGMENT

We thank the anonymous reviewers for helpful comments.This work was supported in part by awards W911NF-19-1-0054 from the Army Research Ofﬁce; CCF-1948536, CCF-1755890, and CCF-1618132 from the National Science Foun-dation; and PICT 2016-1384, 2017-1979 and 2017-2622 fromthe Argentine National Agency of Scientiﬁc and TechnologicalPromotion (ANPCyT). R

EFERENCES[1] D. Jackson, “Alloy: A lightweight object modelling notation,”

ACMTrans. Softw. Eng. Methodol. , vol. 11, no. 2, p. 256–290, Apr. 2002.[2] J. P. Galeotti, N. Rosner, C. G. L´opez Pombo, and M. F. Frias, “Analysisof invariants for efﬁcient bounded veriﬁcation,” in

International Sympo-sium on Software Testing and Analysis , ser. ISSTA ’10. New York,NY, USA: Association for Computing Machinery, 2010, p. 25–36.[3] P. Abad, N. Aguirre, V. S. Bengolea, D. Ciolek, M. F. Frias, J. P. Galeotti,T. Maibaum, M. M. Moscato, N. Rosner, and I. Vissani, “Improving testgeneration under rich contracts by tight bounds and incremental SATsolving,” in

International Conference on Software Testing, Veriﬁcationand Validation . IEEE Computer Society, 2013, pp. 21–30.[4] N. Mirzaei, J. Garcia, H. Bagheri, A. Sadeghi, and S. Malek, “Reducingcombinatorics in gui testing of android applications,” in

Proceedingsof the 38th International Conference on Software Engineering (ICSE) ,2016, pp. 559–570.[5] H. Bagheri and K. J. Sullivan, “Bottom-up model-driven development,”in , D. Notkin, B. H. C. Cheng,and K. Pohl, Eds. IEEE Computer Society, 2013, pp. 1221–1224.[Online]. Available: https://doi.org/10.1109/ICSE.2013.6606683[6] ——, “Pol: speciﬁcation-driven synthesis of architectural code frame-works for platform-based applications,” in

Generative Programmingand Component Engineering, GPCE’12, Dresden, Germany, September26-28, 2012 , K. Ostermann and W. Binder, Eds. ACM, 2012, pp.93–102. [Online]. Available: https://doi.org/10.1145/2371401.2371416[7] H. Bagheri, Y. Song, and K. J. Sullivan, “Architectural style as anindependent variable,” in

ASE 2010, 25th IEEE/ACM InternationalConference on Automated Software Engineering, Antwerp, Belgium,September 20-24, 2010 , C. Pecheur, J. Andrews, and E. D.Nitto, Eds. ACM, 2010, pp. 159–162. [Online]. Available: https://doi.org/10.1145/1858996.1859026 [8] F. A. Maldonado-Lopez, J. Chavarriaga, and Y. Donoso, “Detectingnetwork policy conﬂicts using alloy,” in

Proceedings of the 4th Interna-tional Conference on Abstract State Machines, Alloy, B, TLA, VDM, andZ - Volume 8477 , ser. ABZ 2014. Berlin, Heidelberg: Springer-Verlag,2014, p. 314–317.[9] T. Nelson, C. Barratt, D. J. Dougherty, K. Fisler, and S. Krishnamurthi,“The margrave tool for ﬁrewall analysis,” in

Proceedings of the 24thInternational Conference on Large Installation System Administration ,ser. LISA’10. USA: USENIX Association, 2010, p. 1–8.[10] N. Ruchansky and D. Proserpio, “A (not) nice way to verify the openﬂowswitch speciﬁcation: Formal modelling of the openﬂow switch usingalloy,”

SIGCOMM Comput. Commun. Rev. , vol. 43, no. 4, p. 527–528,Aug. 2013.[11] M. Alhanahnah, C. Stevens, and H. Bagheri, “Scalable analysisof interaction threats in iot systems,” in

ISSTA ’20: 29th ACMSIGSOFT International Symposium on Software Testing and Analysis,Virtual Event, USA, July 18-22, 2020 , S. Khurshid and C. S.Pasareanu, Eds. ACM, 2020, pp. 272–285. [Online]. Available:https://doi.org/10.1145/3395363.3397347[12] H. Bagheri, J. Wang, J. Aerts, and S. Malek, “Efﬁcient, evolutionary se-curity analysis of interacting android apps,” in , 2018, pp.357–368.[13] H. Bagheri, E. Kang, S. Malek, and D. Jackson, “A formal approachfor detection of security ﬂaws in the Android permission system,”

Formal Aspects of Computing , vol. 30, no. 5, pp. 525–544, 2018.[Online]. Available: https://doi.org/10.1007/s00165-017-0445-z[14] H. Bagheri, C. Tang, and K. Sullivan, “Trademaker: Automateddynamic analysis of synthesized tradespaces,” in

Proceedings of the36th International Conference on Software Engineering , ser. ICSE2014. New York, NY, USA: ACM, 2014, pp. 106–116. [Online].Available: http://doi.acm.org/10.1145/2568225.2568291[15] H. Bagheri, C. Tang, and K. J. Sullivan, “Automated synthesis anddynamic analysis of tradeoff spaces for object-relational mapping,”

IEEE Trans. Software Eng. , vol. 43, no. 2, pp. 145–163, 2017. [Online].Available: https://doi.org/10.1109/TSE.2016.2587646[16] J. Brunel, D. Chemouil, A. Cunha, and N. Macedo, “The electrumanalyzer: model checking relational ﬁrst-order temporal speciﬁcations,”in

Proceedings of the 33rd ACM/IEEE International Conference onAutomated Software Engineering, ASE 2018, Montpellier, France,September 3-7, 2018 , M. Huchard, C. K¨astner, and G. Fraser, Eds.ACM, 2018, pp. 884–887. [Online]. Available: https://doi.org/10.1145/3238147.3240475[17] A. Cunha and N. Macedo, “Validating the hybrid ertms/etcs level 3concept with electrum,” in

Abstract State Machines, Alloy, B, TLA, VDM,and Z , M. Butler, A. Raschke, T. S. Hoang, and K. Reichl, Eds. Cham:Springer International Publishing, 2018, pp. 307–321.[18] H. Kim, E. Kang, E. A. Lee, and D. Broman, “A toolkit for constructionof authorization service infrastructure for the internet of things,” in , 2017, pp. 147–158.[19] K. Wang, A. Sullivan, D. Marinov, and S. Khurshid, “Fault Localiza-tion for Declarative Models in Alloy,” in

International Symposium onSoftware Reliability Engineering (ISSRE) , 2020, pp. 391–402.[20] S. Moon, Y. Kim, M. Kim, and S. Yoo, “Ask the mutants: Mutatingfaulty programs for fault localization,” 03 2014, pp. 153–162.[21] M. Papadakis and Y. Le Traon, “Metallaxis-ﬂ: Mutation-based faultlocalization,”

Softw. Test. Verif. Reliab. , vol. 25, no. 5-7, pp. 605–628,Aug. 2015. [Online]. Available: https://doi.org/10.1002/stvr.1509[22] R. Abreu, P. Zoeteweij, and A. J. C. van Gemund, “On the accuracyof spectrum-based fault localization,” in

Proceedings of the Testing:Academic and Industrial Conference Practice and Research Techniques- MUTATION , ser. TAICPART-MUTATION ’07. USA: IEEE ComputerSociety, 2007, p. 89–98.[23] J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test informa-tion to assist fault localization,” in

Proceedings of the 24th InternationalConference on Software Engineering , ser. ICSE ’02. New York, NY,USA: Association for Computing Machinery, 2002, pp. 467–477.[24] L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-basedsoftware diagnosis,”

ACM Trans. Softw. Eng. Methodol. , vol. 20, no. 3,Aug. 2011.

25] A. Sullivan, K. Wang, and S. Khurshid, “Aunit: A test automation toolfor alloy,” in , 2018, pp. 398–403.[26]

FLACK repository , 2020. [Online]. Available: https://doi.org/10.6084/m9.ﬁgshare.13439894.v4[27] Z. Fu and S. Malik, “On solving the partial max-sat problem,” in

Theoryand Applications of Satisﬁability Testing - SAT 2006 , A. Biere and C. P.Gomes, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp.252–265.[28] A. Cunha, N. Macedo, and T. Guimar˜aes, “Target oriented relationalmodel ﬁnding,” in

Fundamental Approaches to Software Engineering ,S. Gnesi and A. Rensink, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2014, pp. 17–31.[29] J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantulaautomatic fault-localization technique,” in

Proceedings of the 20thIEEE/ACM International Conference on Automated Software Engineer-ing , ser. ASE ’05. New York, NY, USA: Association for ComputingMachinery, 2005, p. 273–282.[30] E. Torlak and D. Jackson, “Kodkod: A relational model ﬁnder,” in

Tools and Algorithms for the Construction and Analysis of Systems ,O. Grumberg and M. Huth, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2007, pp. 632–647.[31] T. Nelson, N. Danas, D. J. Dougherty, and S. Krishnamurthi, “Thepower of ”why” and ”why not”: enriching scenario exploration withprovenance,” in

Proceedings of the 2017 11th Joint Meeting on Founda-tions of Software Engineering, ESEC/FSE 2017, Paderborn, Germany,September 4-8, 2017 , 2017, pp. 106–116.[32] N. Mansoor, J. A. Saddler, B. Silva, H. Bagheri, M. B. Cohen, andS. Farritor, “Modeling and testing a family of surgical robots: Anexperience report,” in

Proceedings of the 2018 26th ACM Joint Meetingon European Software Engineering Conference and Symposium onthe Foundations of Software Engineering , ser. ESEC/FSE 2018. NewYork, NY, USA: Association for Computing Machinery, 2018, p.785–790. [Online]. Available: https://doi.org/10.1145/3236024.3275534[33] H. Bagheri, A. Sadeghi, J. Garcia, and S. Malek, “Covert: Compositionalanalysis of android inter-app permission leakage,”

IEEE Transactions onSoftware Engineering , vol. 41, no. 9, pp. 866–886, 2015.[34] K. Wang, A. Sullivan, and S. Khurshid, “Mualloy: A mutationtesting framework for alloy,” in

Proceedings of the 40th InternationalConference on Software Engineering: Companion Proceeedings , ser.ICSE ’18. New York, NY, USA: Association for ComputingMachinery, 2018, p. 29–32. [Online]. Available: https://doi.org/10.1145/3183440.3183488[35] N. Macedo, A. Cunha, and T. Guimar˜aes, “Exploring scenario explo-ration,” in

Fundamental Approaches to Software Engineering , A. Egyedand I. Schaefer, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,2015, pp. 301–315.[36] V. Montaghami and D. Rayside, “Bordeaux: A tool for thinkingoutside the box,” in

Proceedings of the 20th International Conferenceon Fundamental Approaches to Software Engineering - Volume10202 . Berlin, Heidelberg: Springer-Verlag, 2017, p. 22–39. [Online].Available: https://doi.org/10.1007/978-3-662-54494-5 2[37] R. Abreu, P. Zoeteweij, R. Golsteijn, and A. J. C. van Gemund, “Apractical evaluation of spectrum-based fault localization,”

J. Syst. Softw. ,vol. 82, no. 11, p. 1780–1792, Nov. 2009.[38] V. Dallmeier, C. Lindig, and A. Zeller, “Lightweight defect localizationfor java,” in

Proceedings of the 19th European Conference on Object-Oriented Programming , ser. ECOOP’05. Berlin, Heidelberg: Springer-Verlag, 2005, p. 528–550.[39] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan,“Scalable statistical bug isolation,” in

Proceedings of the 2005ACM SIGPLAN Conference on Programming Language Design andImplementation , ser. PLDI ’05. New York, NY, USA: Associationfor Computing Machinery, 2005, p. 15–26. [Online]. Available:https://doi.org/10.1145/1065010.1065014[40] S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst,D. Pang, and B. Keller, “Evaluating and improving fault localization,”in

Proceedings of the 39th International Conference on SoftwareEngineering , ser. ICSE ’17. IEEE Press, 2017, p. 609–620. [Online].Available: https://doi.org/10.1109/ICSE.2017.62 [41] M. Jose and R. Majumdar, “Cause clue clauses: Error localization usingmaximum satisﬁability,”

SIGPLAN Not. , vol. 46, no. 6, pp. 437–446, Jun.2011. [Online]. Available: https://doi.org/10.1145/1993316.1993550[42] D. Gopinath, R. N. Zaeem, and S. Khurshid, “Improving theeffectiveness of spectra-based fault localization using speciﬁcations,”in

Proceedings of the 27th IEEE/ACM International Conference onAutomated Software Engineering , ser. ASE 2012. New York, NY,USA: Association for Computing Machinery, 2012, p. 40–49. [Online].Available: https://doi.org/10.1145/2351676.2351683[43] C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiff, “Sober:Statistical model-based bug localization,”

SIGSOFT Softw. Eng. Notes ,vol. 30, no. 5, pp. 286–295, Sep. 2005. [Online]. Available:https://doi.org/10.1145/1095430.1081753[44] W. E. Wong, V. Debroy, and D. Xu, “Towards better fault localization:A crosstab-based statistical approach,”

IEEE Transactions on Systems,Man, and Cybernetics, Part C (Applications and Reviews) , vol. 42, no. 3,pp. 378–396, 2012.[45] Y. Lin, J. Sun, Y. Xue, Y. Liu, and J. Dong, “Feedback-baseddebugging,” in

Proceedings of the 39th International Conference onSoftware Engineering , ser. ICSE’17. IEEE Press, 2017, pp. 393–403.[Online]. Available: https://doi.org/10.1109/ICSE.2017.43[46] X. Li, S. Zhu, M. d’Amorim, and A. Orso, “Enlightened debugging,”in

Proceedings of the 40th International Conference on SoftwareEngineering , ser. ICSE ’18. New York, NY, USA: Associationfor Computing Machinery, 2018, p. 82–92. [Online]. Available:https://doi.org/10.1145/3180155.3180242[47] B. Ness and V. Ngo, “Regression containment through source changeisolation,” 09 1997, pp. 616 – 621.[48] A. Zeller, “Yesterday, my program worked. today, it does not. why?”in

Proceedings of the 7th European Software Engineering ConferenceHeld Jointly with the 7th ACM SIGSOFT International Symposiumon Foundations of Software Engineering , ser. ESEC/FSE-7. Berlin,Heidelberg: Springer-Verlag, 1999, pp. 253–267.[49] T. Reps, T. Ball, M. Das, and J. Larus, “The use of program proﬁlingfor software maintenance with applications to the year 2000 problem,”in

Software Engineering — ESEC/FSE’97 , M. Jazayeri and H. Schauer,Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp. 432–449.[50] D. B. Whalley, “Automatic isolation of compiler errors,”

ACM Trans.Program. Lang. Syst. , vol. 16, no. 5, pp. 1648–1659, Sep. 1994.[51] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducinginput,”

IEEE Trans. Softw. Eng. , vol. 28, no. 2, pp. 183–200, Feb. 2002.[52] J. R. Lyle and M. Weiser, “Automatic program bug location by programslicing,” 1987.[53] H. Agrawal, J. Horgan, S. London, and W. Wong, “Fault localizationusing execution slices and dataﬂow tests,”

Proceedings of IEEE SoftwareReliability Engineering , 06 1999.[54] H. Agrawal and J. R. Horgan, “Dynamic program slicing,”

SIGPLANNot. , vol. 25, no. 6, pp. 246–256, Jun. 1990. [Online]. Available:https://doi.org/10.1145/93548.93576[55] A. Griesmayer, S. Staber, and R. Bloem, “Fault localization using amodel checker,”

Softw. Test. Verif. Reliab. , vol. 20, no. 2, p. 149–173,Jun. 2010.[56] J. Marques-Silva, M. Janota, A. Ignatiev, and A. Morgado, “Efﬁcientmodel based diagnosis with maximum satisﬁability,” in

Proceedingsof the 24th International Conference on Artiﬁcial Intelligence , ser.IJCAI’15. AAAI Press, 2015, p. 1966–1972.[57] F. Baader and R. Pe˜naloza, “Axiom pinpointing in general tableaux,”

J. Log. and Comput. , vol. 20, no. 1, p. 5–34, Feb. 2010. [Online].Available: https://doi.org/10.1093/logcom/exn058[58] F. Baader, R. Pe˜naloza, and B. Suntisrivaraporn, “Pinpointing in thedescription logic EL + ,” in KI 2007: Advances in Artiﬁcial Intelligence ,J. Hertzberg, M. Beetz, and R. Englert, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2007, pp. 52–67.[59] S. G. Brida, G. Regis, G. Zheng, H. Bagheri, T. Nguyen, N. Aguirre, andM. F. Frias, “Bounded exhaustive search of alloy speciﬁcation repairs,”in

International Conference on Software Engineering (ICSE) . IEEE,2021, p. to appear.. IEEE,2021, p. to appear.