Do We Really Sample Right In Model-Based Diagnosis?
aa r X i v : . [ c s . A I] S e p Do We Really Sample Right in Model-Based Diagnosis? ∗ Patrick Rodler and Fatima Elichanova University of Klagenfurt [email protected] [email protected] Abstract
Statistical samples, in order to be representative,have to be drawn from a population in a randomand unbiased way. Nevertheless, it is commonpractice in the field of model-based diagnosis tomake estimations from (biased) best-first samples.One example is the computation of a few mostprobable possible fault explanations for a defec-tive system and the use of these to assess whichaspect of the system, if measured, would bring thehighest information gain.In this work, we scrutinize whether these statis-tically not well-founded conventions, that bothdiagnosis researchers and practitioners have ad-hered to for decades, are indeed reasonable. Tothis end, we empirically analyze various samplingmethods that generate fault explanations. Westudy the representativeness of the produced sam-ples in terms of their estimations about fault ex-planations and how well they guide diagnostic de-cisions, and we investigate the impact of samplesize, the optimal trade-off between sampling effi-ciency and effectivity, and how approximate sam-pling techniques compare to exact ones.
Suppose we intend to predict the outcome of the next elec-tion and we conduct a poll where we ask only, say, universityprofessors for whom they are going to vote. By this strategy,we will most likely not gain insight into the real sentiment inthe population regarding the election. The problem is sim-ply that professors are most probably not representative ofall people. In model-based diagnosis, however, such kind ofsamples are often used as a basis for making decisions thatrule the efficiency of the diagnostic process.
Model-based diagnosis [1; 2] deals with the detection, lo-calization and repair of faults in observed systems such asprograms, circuits, knowledge bases or physical devices ofvarious kinds. One important prerequisite to achieve thesegoals is the generation of diagnoses , i.e., explanations forthe faulty system behavior in terms of potentially faulty sys-tem components. A sample of diagnoses can be either • directly analyzed , e.g., to manually discover or makeestimations about the actual fault [3; 4], to aid proper ∗ This work was accepted for presentation at the 31st Interna-tional Workshop on Principles of Diagnosis (DX-2020). algorithm choice [5; 6], or to support users in test casespecifications or repair actions [7; 8; 9; 10; 11], or • used as an input or guidance to diagnostic algorithms ,where we focus on the second bullet in this work.An important class of diagnostic algorithms that areguided by a set of precomputed diagnoses are sequen-tial diagnosis (SD) approaches [2; 12]. They use a sam-ple of diagnoses to compute optimal system measurementsthat allow to efficiently and systematically rule out invaliddiagnoses until a single or highly probable one remains.Since achieving (global) optimality of the sequence of sys-tem measurements is intractable in general [13], state-of-the-art SD techniques usually rely on local optimization[14] using one out of numerous heuristics [2; 15; 16; 17;18] as optimality criteria. These heuristics can be seen asfunctions that, based on a given sample of diagnoses, mapmeasurement candidates to one numeric score each, and fi-nally select the one measurement with the best score. Inmost cases, these functions use two features of the sample: (F1) the diagnoses’ probabilities (to estimate the probabil-ity of each measurement outcome), and (F2) the diagnoses’ predictions of the measurement out-come (to estimate diagnosis elimination rates).Literature offers a wide range of different methods andways to generate samples of diagnoses, among them onesthat return a specific sample (which includes exactly a pre-defined subset of all diagnoses) and others that compute an unspecific sample , e.g., in a heuristic [20], stochastic [21]or simply undefined way [22] (where no guarantee can begiven regarding diagnosis selection for the sample).Many existing SD approaches draw on samples of thespecific type in that they build upon best-first samples, suchas maximum-probability or minimum-cardinality diagnoses[9; 16; 19; 23; 24; 25; 26]. While perhaps often being moti-vated by the desideratum that the most preferred/likely can-didate(s) should be known at any stage of the diagnosticprocess, e.g., to allow for well-founded stopping criteria,the use of such non-random samples is highly questionablefrom the statistical viewpoint.In this work we challenge the validity of the followingstatistical law in the domain of model-based diagnosis: For this work, we make the assumption that measurementshave uniform costs. If that is not the case, then the measurementcost is another factor that flows into the assessment of measure-ment candidates [15; 19]. Note, the algorithm described in [22] can be modified forheuristic diagnosis computation, as will be explained in Sec. 3.2. randomly chosen unbiased sample from a populationallows (on average) better conclusions and estimationsabout the whole population than any other sample.
The motivation behind this inquiry is to either understandwhy this fundamental principle does not apply to the par-ticular domain of model-based diagnosis, or to rationalizethe necessity of random sampling as part of diagnostic algo-rithms and to foster research in this direction.The particular contributions are: • We analyze a range of real-world diagnosis problemsand gain insight into the quality of three specific (best-first, random and, as a baseline, worst-first) and threeunspecific (approximate best-first, approximate ran-dom, approximate worst-first) sample types . • We assess a sample type’s quality based on – its theoretical representativeness , i.e., how well itallows to estimate the aspects (F1 and F2) that de-termine the heuristic score of measurements, and – its practical representativeness , i.e., its perfor-mance achieved in a diagnosis session wrt. timeand number of measurements. • We investigate the impact of the – sample size , – particular used heuristic , and – tackled diagnosis problem on the sample’s representativeness.This work is organized as follows: We provide a brief ac-count of theoretical foundations in Sec. 2. Our evaluations(dataset, sample types, sampling techniques, evaluation cri-teria, research questions, experiment settings, and results)are discussed in Sec. 3. In Sec. 4, we address limitations ofour research, and conclude with Sec. 5. We briefly characterize concepts from model-based diag-nosis used in this work, based on the framework of [16;9] which is more general [27] than Reiter’s theory [1]. Diagnosis Problem.
We assume that the diagnosed system,consisting of a set of components { c , . . . , c k } , is describedby a finite set of logical sentences K ∪ B , where K (possiblyfaulty sentences) includes knowledge about the behavior ofthe system components, and B (correct background knowl-edge) comprises any additional available system knowledgeand system observations. More precisely, there is a one-to-one relationship between sentences ax i ∈ K and com-ponents c i , where ax i describes the nominal behavior of c i ( weak fault model ). E.g., if c i is an AND-gate in a circuit,then ax i := out ( c i ) = and ( in c i ) , in c i )) ; B in this casemight contain sentences stating, e.g., which components areconnected by wires, or observed circuit outputs. The inclu-sion of a sentence ax i in K corresponds to the (implicit)assumption that c i is healthy. Evidence about the systembehavior is captured by sets of positive ( P ) and negative( N ) measurements [1; 2; 28]. Each measurement is a log-ical sentence; positive ones p ∈ P must be true and neg-ative ones n ∈ N must not be true. The former can be,depending on the context, e.g., observations about the sys-tem, probes or required system properties. The latter model The main reason for using this more general framework is itsability to handle negative measurements (things that must not betrue for the diagnosed system) which are helpful, e.g., for diagnos-ing knowledge bases [28; 16]. K = { ax : A → ¬ B ax : A → B ax : A → ¬ C ax : B → C ax : A → B ∨ C }B = ∅ P = ∅ N = {¬ A } Table 1:
Example DPI stated in propositional logic. properties that must not hold for the system, e.g., if K is abiological knowledge base to be debugged, a negative testcase might be “every bird can fly” (think of penguins). Wecall hK , B , P , N i a diagnosis problem instance (DPI) . Example 1 (DPI)
Table 1 depicts a DPI stated in propo-sitional logic. The “system” (the knowledge base itself inthis case) comprises five “components” c , . . . , c , and the“normal behavior” of c i is given by the respective axiom ax i ∈ K . No background knowledge ( B = ∅ ) or posi-tive measurements ( P = ∅ ) are given from the start. But,there is one negative measurement (i.e., N = {¬ A } ), whichstipulates that ¬ A must not be an entailment of the correctsystem (knowledge base). Note, however, that K (i.e., theassumption that all “components” are normal) in this casedoes entail ¬ A (e.g., due to the axioms ax , ax ) and there-fore some axiom (“component”) in K must be faulty. Diagnoses.
Given that the system description along withthe positive measurements (under the assumption K that allcomponents are healthy) is inconsistent, i.e., K∪B∪ P | = ⊥ ,or some negative measurement is entailed, i.e., K ∪B ∪ P | = n for some n ∈ N , some assumption(s) about the healthi-ness of components, i.e., some sentences in K , must be re-tracted. We call such a set of sentences D ⊆ K a diagnosis for the DPI hK , B , P , N i iff ( K \ D ) ∪ B ∪ P = x for all x ∈ N ∪ {⊥} . We say that D is a minimal diagnosis for dpi iff there is no diagnosis D ′ ⊂ D for dpi . The set of minimaldiagnoses is representative of all diagnoses under the weakfault model [29], i.e., the set of all diagnoses is equal to theset of all supersets of minimal diagnoses. Therefore, diag-nosis approaches usually restrict their focus to only minimaldiagnoses. We furthermore denote by D ∗ the actual diag-nosis which pinpoints the actually faulty axioms, i.e., all el-ements of D ∗ are in fact faulty and all elements of K \ D ∗ are in fact correct. Example 2 (Diagnoses)
For our DPI in Table 1 we havefour minimal diagnoses, given by D := [ ax , ax ] , D :=[ ax , ax ] , D := [ ax , ax ] , and D := [ ax , ax ] . Forinstance, D is a minimal diagnosis as ( K \ D ) ∪ B ∪ P = { ax , ax , ax } is both consistent and does not entail thegiven negative measurement ¬ A . Diagnosis Probability Model.
In case useful meta informa-tion is available that allows to assess the likeliness of failurefor system components, the probability of diagnoses (of be-ing the actual diagnosis) can be derived. Specifically, givena function p that maps each sentence (system component) ax ∈ K to its failure probability p ( ax ) ∈ (0 , , the prob-ability p ( D ) of a diagnosis D ⊆ K (under the common as-sumption of independent component failure) is computed asthe probability that all sentences in D are faulty, and all oth-ers are correct, i.e., p ( D ) := Q ax ∈D p ( ax ) Q ax ∈K\D (1 − p ( ax )) . Each time a new measurement is added to the DPI,the probabilities of diagnoses are updated using Bayes’ The-orem [2]. Example 3 (Diagnosis Probabilities)
Reconsider the DPIfrom Table 1 and let probabilities h p ( ax ) , . . . , p ( ax ) i = . , . , . , . , . i . Then, the probabilities of all mini-mal diagnoses from Example 2 are h p ( D ) , . . . , p ( D ) i = h . , . , . , . i . E.g., p ( D ) is calculated as . ∗ (1 − . ∗ . ∗ (1 − . ∗ (1 − . . The normalizeddiagnosis probabilities would then be h . , . , . , . i .Note, this normalization makes sense if only a proper subsetof all diagnoses is known. Measurement Points.
We call a logical sentence a mea-surement point (MP) if it states one (true or false) aspect ofthe system under consideration. E.g., if the system is a dig-ital circuit, the statement out ( c i ) = 1 , which states that theoutput of gate c i is high, is an MP. In case of the systembeing, say, a biological knowledge base, ∀ X ( bird ( X ) → canF ly ( X )) is an MP. Assuming an oracle orcl (e.g., anelectrical engineer for a circuit, or a domain expert for aknowledge base) that is knowledgeable about the system,one can send to orcl MPs m and orcl will classify each m as either a positive or a negative measurement, i.e., m orcl ( m ) where orcl ( m ) ∈ { P , N } . Measurements to Discriminate between Diagnoses.
MPscome into play when there are multiple diagnoses for a DPIand the intention is to figure out the actual diagnosis. Hence,given a set of diagnoses D for a DPI between which we wantto discriminate, the MPs m of particular interest are thosefor which each classification orcl ( m ) is inconsistent withsome diagnosis in D [2; 9]. We call such MPs informative (wrt. D ). In other words, each outcome of a measurementfor some informative MP will invalidate some diagnosis.Each MP m partitions any set of (minimal) diagnoses D into subsets D + m , D − m and D m as follows: • Each
D ∈ D + m is consistent (only) with orcl ( m ) = P .( diagnoses predicting the positive outcome ) • Each
D ∈ D − m is consistent (only) with orcl ( m ) = N .( diagnoses predicting the negative outcome ) • Each
D ∈ D m is consistent with both outcomes orcl ( m ) ∈ { P , N } . ( uncommitted diagnoses )Thus, an MP m is informative iff both D + m (diagnoses in-validated if orcl ( m ) = N ) and D − m (diagnoses invalidatedif orcl ( m ) = P ) are non-empty sets. (Estimated) Properties of Measurement Points. Sincenot all informative MPs are equally utile, the considerationof additional properties of MPs allows a more fine-grainedpreference rating of MPs. In fact, if D includes all diagnosesfor the given DPI, the partition h D + m , D − m , D m i allows todetermine, for each measurement outcome c ∈ { P , N } , its diagnosis elimination rate er ( orcl ( m ) = c ) as well as its probability p ( orcl ( m ) = c ) [17]: er + m := er ( orcl ( m ) = P ) = | D − m || D | er − m := er ( orcl ( m ) = N ) = | D + m || D | p + m := p ( orcl ( m ) = P ) = P + m + P m p − m := p ( orcl ( m ) = N ) = P − m + P m where P Xm := P D∈ D Xm p ( D ) for X ∈ { + , − , } .In practice, the calculation of all diagnoses is often in-feasible and diagnosis systems rely on a subset of the min-imal diagnoses D to estimate these properties of MPs. Inthe following, we denote by ˆ er + m, D and ˆ er − m, D the esti-mated elimination rate for positive and negative measure-ment outcome for MP m computed based on D . Simi-larly, we refer by ˆ p + m, D and ˆ p − m, D to the estimated proba-bility of a positive and negative measurement outcome for m and D . Importantly, these estimated values depend on both the MP m and the used sample D of diagnoses. Notethat all four estimates attain values in [0 , for any MP m , and in (0 , if the MP m is informative. Moreover, ˆ p + m, D + ˆ p − m, D = 1 and ˆ er + m, D + ˆ er − m, D ≤ where thedifference − ( ˆ er + m, D + ˆ er − m, D ) is the rate of uncommitteddiagnoses, which are not affected by the measurement at m . Example 4 (Measurement Points and their Properties)
Assume again our DPI from Table 1 and let all minimaldiagnoses be known, i.e., D = {D , . . . , D } (cf. Exam-ple 2). Then, e.g., m := A → C is an informative MP wrt. D since D + m = {D , D } 6 = ∅ and D − m = {D , D } 6 = ∅ .E.g., D ∈ D + m holds because ( K \ D ) ∪ B ∪ P = { ax , ax , ax } ⊃ { A → B, B → C } | = m and thus m can be no negative measurement under the assumption D . In a similar way, we obtain that D ∈ D − m due to ( K \ D ) ∪ B ∪ ( P ∪ { m } ) = { ax , ax , ax , m } ⊃{ A → ¬ C, A → C } | = ¬ A where ¬ A is a negative mea-surement; hence, m cannot be a positive measurement un-der the assumption D . In contrast, e.g., m := B is anon-informative MP because D + m = ∅ .Assuming the (normalized) probabilities from Exam-ple 3, we obtain probabilities ˆ p + m , D = . , ˆ p − m , D = . and elimination rates ˆ er + m , D = . , ˆ er − m , D = . for m .Note: (1) If we have at hand a different sample D , theestimations for one and the same MP might vary substan-tially. E.g., suppose D = {D , D , D } ; then ˆ p + m , D = . , ˆ p − m , D = . and ˆ er + m , D ≈ . , ˆ er − m , D ≈ . . (2) If the sample gets smaller (wrt. subset-inclusion), thenthe number of informative MPs might shrink, and viceversa. E.g., m becomes non-informative if, e.g., D = {D , D } , and thus might be disregarded by diagnosis sys-tems. Consequently, larger (smaller) samples will tend toprovide a richer (sparser) selection of MP candidates. Evaluating Measurement Points Using Heuristics.
Toquantitatively assess the preferability of different MPs,state-of-the-art sequential diagnosis systems rely on heuris-tics that perform a one-step-lookahead analysis of MPs [14].A heuristic h is a function that maps each MP m to a real-valued score h ( m ) [30]. At this, h ( m ) quantifies the utilityof the expected situation after knowing the outcome for MP m . The MP with the best score according to the used heuris-tic is then chosen as a next query to the oracle.Well-known heuristics incorporate exactly the two dis-cussed features, i.e., the estimated elimination rates and es-timated probabilities, into their computations [17]. So, dif-ferent heuristics correspond to different functions of theseestimates, e.g.: • information gain (ENT) [2] uses solely the probabili-ties and prefers MPs where P m = 0 and | ˆ p + m, D − ˆ p − m, D | is minimal [30]; • split-in-half (SPL) [16] considers only the eliminationrates and favors MPs with ˆ er + m, D + ˆ er − m, D = 1 andminimal | ˆ er + m, D − ˆ er − m, D | [30]; • risk optimization (RIO) [31] takes into account bothfeatures by computing a dynamically re-weightedfunction of ENT and SPL; • most probable singleton (MPS) [17; 30] also regardsboth features by giving preference to MPs that maxi-mize the probability of a maximal elimination rate.For details on these and other heuristics see [17] for a theo-etical analysis and [18] for an empirical evaluation. Example 5 (Heuristics)
Reconsider our DPI from Table 1and the MP m from Example 4, and let all minimal di-agnoses be known, i.e., D = {D , . . . , D } . Further, let m := A ∧ ¬ B → C . Note that m is informative (wrt. D ), D + m = {D , D , D } , D − m = {D } , and the estimations ˆ p + m , D = . , ˆ p − m , D = . and ˆ er + m , D = . , ˆ er − m , D = . . Hence, given the two MP candidates { m , m } , theheuristic SPL would select m (since a half of the known di-agnoses are eliminated for each outcome). Similarly, ENTwould prefer m to m (because for m roughly a half ofthe probability mass is eliminated for each outcome).However, assume that a used diagnosis sampling tech-nique just outputs the sample D = {D , D , D } . Inthis case, we obtain the probability estimates ˆ p + m , D = . , ˆ p − m , D = . and ˆ p + m , D = . , ˆ p − m , D = . , re-spectively. Hence, using ENT, the chosen MP would be m (the worse MP, as shown above). If sampling would yield D = {D , D } , then m would not even be an informativeMP (wrt. D ) on the one hand, and m would be the (theo-retically) optimal MP according to SPL on the other hand.This example shows the dramatic impact the used samplingtechnique can have on diagnostic decisions. Sequential Diagnosis (SD) aims at generating a sequenceof informative MPs such that a single (highly probable) di-agnosis remains for the given DPI, while minimizing thenumber of MPs needed (oracle inquiries are usually expen-sive). A generic SD process iterates through the followingsteps until (the Bayes-updated) p ( D ) for some D ∈ D ex-ceeds a probability threshold σ : S1 Generate a sample of minimal diagnoses D for the cur-rent DPI (initially, the given DPI). S2 Choose a (heuristically optimal) informative MP m wrt. D (using a selection heuristic h ). S3 Ask the oracle orcl to classify m . S4 Use the classification orcl ( m ) to update the DPI, byadding m to the positive measurements if orcl ( m ) = P and to the negative measurements if orcl ( m ) = N . We conducted extensive experiments using a dataset of real-world diagnosis cases (Sec. 3.1) to study six different di-agnosis sample types (Sec. 3.2 and 3.3) wrt. the accuracyof estimations and diagnostic efficiency (Sec. 3.4) in differ-ent scenarios in terms of sample size (number of diagnosescomputed) and measurement selection heuristic used. Theconcrete research questions are explicated in Sec. 3.5 andthe experiments are detailed in Sec. 3.6. Finally, in Sec. 3.7,we present and discuss the obtained results.
In our experiments we drew upon the set of real-world di-agnosis problems from the domain of knowledge-base de-bugging shown in Table 2. Note that every model-baseddiagnosis problem (according to Reiter’s original character-ization [1]) can be represented as a knowledge-base debug-ging problem [27], which is why considering knowledge-base debugging problems is without loss of generality. Toobtain a representative dataset we chose it in a way it coversa variety of different problem sizes, theorem proving com-plexities, and diagnostic metrics (number of diagnoses, their Table 2:
Dataset used in experiments (sorted by 2nd column). KB K |K| expressivity University (U) SOIN ( D ) SROIQ SROIQ ALCN ALCH ( D ) ALCH ( D ) ALCHF ( D ) SHF
Description Logic expressivity [32]; the higher the expressivity, the higher isthe complexity of consistency checking (conflict computation) for this logic. K . Same notation for conflicts. Sufficiently hard diagnosis problems from evaluations in [16], which were alsoused, e.g., in [18; 33; 34].
Diagnosis problems studied in [3; 35].
Faulty version of DBpedia ontology, see https://bit.ly/2ZO2qYZ.
Diagnosis problem used in scalability tests in [16]. The second scalability prob-lem used in [16] was not included in the dataset since computation of all mini-mal diagnoses was infeasible (within hours of computation) for it. sizes, number of conflicts, number of components). Thesemetrics are depicted in the columns of Table 2. In order toimplement the random sampling of diagnoses, another re-quirement to the dataset was that all the used problems allowthe computation of all minimal diagnoses within tolerabletime for our experiments (single digit number of minutes).
We examined the following types of diagnosis samples: T1 best-first ( bf ) T2 random ( rd ) T3 worst-first ( wf ) T4 approximate best-first ( abf ) T5 approximate random ( ard ) T6 approximate worst-first ( awf )By “best-first” / “worst-first”, we mean the most / least prob-able minimal diagnoses. The types T3 and T6 serve as base-lines. We refer to T1 , T2 and T3 as specific sample types because we know the properties of the sample (exactly the k best or worst diagnoses, or k unbiased random ones) in ad-vance by employing (expensive) sampling techniques thatguarantee these properties. On the other hand, we call T4 , T5 and T6 unspecific sample types and adopt (usually lesscostly) heuristic techniques to provide them. In the follow-ing, we denote a sample of type
T i including k minimaldiagnoses by S T i,k . The approaches we used for generating the samples for agiven DPI dpi = hK , B , P , N i were: T1 : We used uniform-cost HS-Tree [9, Sec. 4.6] andstopped it after k diagnoses were computed. Due to thebest-first property of the algorithm, these are provenly [9,Prop. 4.17] the k diagnoses with the highest probabilityamong all minimal diagnoses. T2 , T3 : We generated all minimal diagnoses allD for dpi .For T2 , we selected k random elements from this set bymeans of the Java (v1.8) pseudorandom number generator. This is generally intractable [36]. So, this approach to randomsampling is not viable in practice and just used for the purpose ofour evaluation. As said in Sec. 3.1, we chose our dataset so thatcomputation of allD was feasible within reasonable time. or T3 , we picked the k diagnoses with lowest probability. The generation of allD can be done by any sound and com-plete diagnosis computation, e.g., HS-Tree [1]. T4 , T5 , T6 : We used Inv-HS-Tree [22] to supply the sam-ples. First, we added all ax ∈ K to a list L . For T5 , werandomly shuffled L . For T4 and T6 , we sorted L in de-scending and ascending order of probability p ( ax ) , respec-tively. Finally, we let plain Inv-HS-Tree operate on this list L to supply a sample of size k .Inv-HS-Tree uses k calls to a diagnosis computationmethod called Inverse QuickXPlain (Inv-QX) [37; 38; 39].Each call of Inv-QX returns one well-defined minimal di-agnosis D L for dpi based on the strict total order of el-ements imposed by the sorting of the list L . Specifi-cally, D L is the minimal diagnosis with highest rank wrt.the antilexicographic order > antilex defined on sublists of L = [ l , . . . , l |K| ] [38]. At this, for sublists X, Y of L , wehave X > antilex Y ( X has higher rank wrt. > antilex than Y ) iff there is some k such that X ∩ { l k +1 , . . . , l |K| } = Y ∩ { l k +1 , . . . , l |K| } (both sublists are equal wrt. their low-est ranked elements in L ) and l k ∈ Y \ X (the first elementthat differs between the sublists is in Y ). E.g., if L includesthe letters a, b, . . . , z in alphabetic order, then X > antilex Y for X = [ b, n, r, v ] and Y = [ a, p, r, v ] because both listsshare [ r, v ] and, after deleting these two letters from both X and Y , the now last element ( p ) of Y is ranked lower in L than the one ( n ) of X (see also [40, Sec. 3.2.5]).That is, in the approximate best-first case ( T4 ), the com-puted diagnosis D L = [ d , . . . , d n − , d n ] has the propertythat there is no other minimal diagnosis D ′ = [ d ′ , . . . , d ′ r ] where d ′ r has higher probability than d n , and among all min-imal diagnoses that share the last element d n , there is noother minimal diagnosis whose second-last element has ahigher probability than d n − , and so forth. If we replace“higher probability” with “lower probability”, we obtain adescription of the diagnosis D L returned in the approximateworst-first case ( T6 ). In the approximate random case ( T5 ),we reshuffle L before each call of Inv-QX, thereby trying tosimulate a random selection. Note, Inv-HS-Tree guaranteesthat each Inv-QX call generates a new diagnosis by system-atically “blocking” different elements in L which must notoccur in the next diagnosis [22]. We evaluate sample types based on what we call their theo-retical and practical representativeness:
Theoretical Representativeness:
A sample type
T i is themore representative, the better the • probability estimates h ˆ p + m, D , ˆ p − m, D i for MPs m matchthe respective actual values h p + m , p − m i , • elimination rate estimates h ˆ er + m, D , ˆ er − m, D i for MPs m match the respective actual values h er + m , er − m i for samples D = S T i,k . Practical Representativeness:
A sampling technique
T i isthe more representative, the lower the • number of measurements required, • time required for sampling (diagnoses computation) Note, the naive approach to generating the least probable diag-noses using a best-first diagnosis computation mechanism (such asa uniform cost version of Reiter’s HS-Tree [9]) and simply takingthe reciprocals p ′ ( ax ) := 1 − p ( ax ) instead of the probabilities p ( ax ) for ax ∈ K (provably) does not work in general. throughout a sequential diagnosis session until the actual di-agnosis is isolated from spurious ones, where D = S T i,k ineach sequential diagnosis iteration.
The goal of our evaluation is to shed light on the followingresearch questions:
RQ1
Which type of sample is best in terms of theoreticalrepresentativeness?
RQ2
Which type of sample is best in terms of practical rep-resentativeness?
RQ3
Are the results wrt.
RQ1 and
RQ2 consistent overdifferent (a) sample sizes, (b) measurement selectionheuristics, and (c) diagnosis problem instances.
RQ4
Does larger sample size (more computed diagnoses)imply better representativeness?
RQ5
Does a better theoretical representativeness translateto a better practical representativeness?
We conducted two experiments, EXP1 and EXP2, to in-vestigate our research questions. Common to both exper-iments are the following settings: • We defined one DPI dpi K := hK , ∅ , ∅ , ∅i for each K in Table 2. That is, we as-sumed each axiom (component) in K to be possibly faultyand left the background knowledge and the measurementsvoid to begin with. To each ax ∈ K , we randomly assigneda fault probability p ( ax ) ∈ (0 , in a way that syntacti-cally equally (more) complex axioms have an equal (higher)probability (cf. [9; 16]). E.g., in our DPI in Table 1, el-ements of { ax , ax } (one implication, one negation) and { ax , ax } (one implication), respectively, would each beallocated the same probability, and the former two wouldhave a higher probability than the latter (cf. Example 3). • We precomputed all minimal diagnoses allD for each DPI dpi K . • We used all sample types
T i for i ∈ { , . . . , } (cf.Sec. 3.2). • We used sample sizes (numbers of generatedminimal diagnoses) k ∈ { , , , , } .The specific settings for each experiment were: EXP1: (theoretical representativeness)
For each dpi K , foreach k , and for each T i , we computed a sample D = S T i,k .We used • D to compute probability and elimination rate esti-mates h ˆ p + m, D , ˆ p − m, D i and h ˆ er + m, D , ˆ er − m, D i , and • allD to compute h p + m , p − m i and h er + m , er − m i for (if so many, otherwise for all) randomly selected in-formative MPs wrt. D . For each such MP, we thus hadfour estimates and four corresponding actual values, that wecould compare against one another. EXP2: (practical representativeness)
For each dpi K , foreach k , for each T i , and for each of the four heuristics h ∈ { ENT,SPL,RIO,MPS } (cf. Sec. 2), we executed 10 se-quential diagnosis sessions (loop S1 – S4 , Sec. 2) while ineach session • searching for a different randomly selected target diag-nosis D ∗ ∈ allD for dpi K , • starting from the initial problem dpi K • with stop criterion σ = 1 (loop until a single minimaldiagnosis remains, i.e., all others have been ruled out),where in each iteration through the loop at step • S1 , a sample D = S T i,k is drawn for the current DPI,able 3:
Theoretical and practical representativeness: Rankingsof sample types for various scenarios (EXP1 and EXP2). scenario (best) ranking (worst) criterion all rd wf bf awf ( abf ard ) E k = 6 bf wf ( rd awf ) abf ard E k = 10 rd wf bf awf abf ard E k = 20 rd wf bf awf ( abf ard ) E k = 50 rd wf awf bf ard abf Eall bf rd awf abf ard wf P k = 6 bf ( abf rd ard ) awf wf P k = 10 bf rd ( abf awf ) ( ard wf ) P k = 20 bf rd awf ( abf ard wf ) P k = 50 bf rd awf ard wf abf Pall bf ard abf rd awf wf M k = 2 bf abf ard awf rd wf M k = 6 bf rd ard abf awf wf M k = 10 rd abf ard bf awf wf M k = 20 ard abf awf rd bf wf M k = 50 ard rd awf bf abf wf M h = ENT bf abf ard awf rd wf M h = SPL bf ard abf rd awf wf M h = RIO rd ard awf bf abf wf M h = MPS ard abf rd awf wf bf
Mall awf bf ( abf ard ) rd wf T k = 2 abf ( ard awf ) bf rd wf T k = 6 awf ( abf ard ) bf ( rd wf ) T k = 10 awf abf ( ard bf ) rd wf T k = 20 bf ard awf abf rd wf T k = 50 bf wf ( awf rd ) ard abf T h = ENT awf abf bf ard rd wf T h = SPL ( abf awf bf ) ard rd wf T h = RIO awf ( abf ard bf ) rd wf T h = MPS bf awf ard abf rd wf T • S2 , an informative MP that is optimal for h is selected, • S3 , an automated oracle classifies each MP in a waythe predefined target diagnosis D ∗ is not ruled out.For our analyses, we recorded (sampling) times and numberof measurements (i.e., loop iterations) throughout a session. From our experiments, we obtained two large datasets, with ∗ ∗ (EXP1) and ∗ ∗ ∗ (EXP2) fac-tor combinations, respectively, for the factors sample type (6levels), diagnosis problem (8), sample size (5), and heuristic(4). Due to the high information content of our data and thepaper length restrictions, we can only provide a very con-densed presentation of the results, given in Tables 3 and 4. Presentation.
Table 3 shows rankings of the sample typesover different subsets of all factor combinations (referredto as scenarios ; left column of the table). E.g., scenario“all” means all 240 (EXP1) / 960 (EXP2) cases aggregated,whereas “ k = 20 ” denotes exactly the
240 : 5 = 48 (EXP1)/
960 : 5 = 192 (EXP2) cases where the sample size wasset to 20. Results from EXP1 are depicted in the top part ofthe table (first ten rows); results from EXP2 in the bottompart. A sample type
T i being ranked prior to type
T j (mid-dle column of table) means that
T i was better than
T j inmore of the factor combinations of the respective scenariothan vice versa. At this, the meaning of “better” ( criterion for comparison; rightmost table column) is • a higher Pearson correlation coefficient between esti-mated and real values for elimination rate (E) and, re-spectively, probability (P) estimations (cf. theoreticalrepresentativeness in Sec. 3.4), and • a lower average number of measurements (M) and, re-spectively, a lower average sample computation time (T) in sequential sessions (cf. practical representative-ness in Sec. 3.4).Note, the table does not give information about how muchbetter (worse) one T i was than another, but only that it wasa preferred choice to the other in more (less) cases (of a sce-nario). Moreover, the rankings do not mean that a higherranked strategy was always better than a lower ranked one.The idea behind this representation is to give the user of adiagnosis system a guidance how to set parameters (diagno-sis computation algorithm, number of computed diagnoses,heuristic used for measurement selection) in order to havethe highest chance of achieving best estimations (EXP1) /efficiency (EXP2).
Table 4 lists the best (ranked) sample types wrt. overalltime per diagnosis session in EXP2 (cumulated system com-putation time plus cumulated time for all measurements)for different scenarios and assumptions (1min, 10min) ofmeasurement conduction times. The two rightmost columns(“adj”) show hypothetical results under the assumption thatsample types T2 ( rd ) and T3 ( wf )—which we naively simu-lated by means of brute force diagnosis computation in ourexperiments (cf. Sec. 3.3)—were as efficiently computableas sample type T1 ( bf ). This allows to assess the addedvalue of, e.g., efficient random diagnosis sampling tech-niques. Discussion.
We address each research question in turn:
RQ1 : (Elimination rate, criterion E, Table 3) We seethat rd is the sample type of choice, as one would ex-pect. In numbers, the median correlation coefficients overall cases per scenario for (best,worst) sample type for k ∈{ , , , } were { (0 . , . , (0 . , . , (0 . , . , (0 . , . } , which reveals that estimations were alto-gether pretty good for all sampling techniques. However, for k ≥ , coefficients for rd manifested a significantly lowervariance than in case of all other techniques, i.e., all coef-ficients for rd concentrated in the interval [0.9,1], whereaslowest coefficients for all other techniques lay between lessthan 0.5 and 0.7. Moreover, it stands out that wf allowedalmost as accurate estimations as rd . A possible explana-tion for these favorable results of wf is that there is usu-ally a large number of minimal diagnoses with a very smallprobability, which is why the “sub-population” from whichthe wf diagnoses are “selected” tends to be larger (and thusmore representative) than for other sample types, except for rd (where diagnoses are drawn at random from the full pop-ulation). Finally, it is interesting that approximate methods( awf , ard , abf ) produced less representative samples thanexact ones. And, although rd comes out on top for E, itsapproximate counterpart ard shows the worst results. (Probability, criterion P, Table 3) Here, bf proved to be thepredominantly superior technique in all depicted scenarios.Although the fact that rd was only the second best methodmight be surprising at first sight, the likely explanation forthis is that often few of the most probable diagnoses alreadyaccount for a major part of the overall probability mass,which is why they are more reliable for estimations of P than Remarks wrt.
RQ1 : (1) We had to leave out the k = 2 scenar-ios as there were too few informative MPs which made these sce-narios not reliably analyzable. (2) Values and rankings for othertypes of correlation coefficients (i.e., Spearman and Kendall) werevery similar to the presented (Pearson) results. (3)
Most correla-tion coefficients were statistically significant ( α = 0 . ), exceptfor a few k = 6 scenarios and some scattered k = 10 cases. random sample. For the same reason, it comes as no rev-elation that wf samples turned out to be the least preferablemeans to estimate P. The medians of the correlation coef-ficients over all cases per scenario for (best,worst) sampletype for k ∈ { , , , } were { (0 . , . , (0 . , . , (0 . , . , (0 . , . } . Thus, again, all sampling meth-ods enabled pretty decent estimations, even for small sam-ple sizes of only six diagnoses. Similarly as for E, the vari-ance of correlation coefficients was significantly lower forthe best sample type, bf , than for all others. However, thespread of correlation coefficients for the single sample typeswas noticeably larger for P than for E for k ≤ , suggest-ing that only bf facilitates very reliable estimations of P forsmall to medium sample sizes. RQ2 : (Number of measurements, criterion M, Table 3) Wefind that bf was the best strategy if all data is considered;and it was the most suitable choice for heuristics ENT andSPL and for small sample sizes { , } . On the other hand,it was the worst choice for the MPS heuristic where it led tosubstantial overheads (of up to >
100 %) compared to othersample types, especially for large sample sizes. E.g., forthe diagnosis problem U and k = 50 , a diagnosis sessionusing bf involved 58 measurements vs. 25 measurements if rd was used instead. What is somewhat surprising is that bf decidedly outperformed rd in the SPL scenarios, althoughthe SPL function does not use any probabilities (where bf leads to better estimations), but solely the elimination rate(where rd produces better estimates). Further analyses areneeded to better understand this phenomenon. Overall, rd compares favorably only against awf and wf , but its perfor-mance depends largely on the used heuristic. For RIO it iseven the sample type of choice, and for MPS it clearly over-comes bf . For all four heuristics, one of the approximatemethods was the second best method, among which ard ledto good performance most consistently. In comparison with rd , ard was only (slightly) outweighed for RIO, but pre-vails for the other three heuristics. When considering largesamples (20 or 50 diagnoses), ard even turned out to be theoverall winner. This indicates that the QuickXPlain-basedapproximate random algorithm, in spite of its rather poorestimations (cf. E and P in Table 3), tends to be no less ef-fective than a real random strategy. Finally, observe that wf was in fact the least favorable option in quasi all scenarios. (Time for diagnosis session, criterion T, Table 3) Due to thenaive brute force approach we used in our experiments togenerate samples of type rd and wf , it comes as no surprisethat these two methods perform most poorly in terms of T.When drawing our attention to the best strategies, we findthat, in all but one ( h = SPL) of the shown scenarios, it isa different sample type that exhibited lowest time (T) thanthe one that manifested the lowest number of measurements(M). This prompts the conjecture of some time-informationtrade-off in diagnosis sampling—or: whenever the samplingprocess is most efficient (on avg.), the measurements arisingfrom the sample are not most effective (on avg.). In particu-lar, we recognize that, if an exact method is best for T (M),then an approximate method is best for M (T). And, unlikefor M, ard tends to be worse than awf and abf in case of T. (Diagnosis session time, criteria T, M combined, Table 4)
Since the outcome for
RQ2 is not at all clear-cut when view-ing M and T separately, we investigate their combined ef-fect, i.e., the overall (avg.) length of diagnosis sessions forthe different sample types. In brief, the conclusions are: Table 4: Best sample types wrt. overall sequential diagnosistime for various scenarios (EXP2). best sample type (given that time for each measurement = t ) scenario t = 1 min t = 10 min t = 1 min (adj) t = 10 min (adj)all data bf bf bf bf k = 2 bf bf bf bf k = 6 bf bf bf bf k = 10 abf abf abf ( rd , abf ) k = 20 awf bf awf bf k = 50 bf bf rd rd h = ENT bf bf bf bf h = SPL bf bf bf bf h = RIO bf ard rd ( ard , bf ) h = MPS ard ard ard ard • For small sample size below 10, go with bf . • For samplesize 10, use abf . • For sample size 20, take awf if the ex-pected time for conducting measurements is low, and take bf else. • Unless there is an efficient method for rd , use bf for large sample size (50), otherwise use rd . • For ENT orSPL, adopt bf . • For RIO, if measurement time is short, use bf if there is no efficient algorithm for rd , otherwise use rd ;if measuring takes longer, use ard . • For MPS, use ard . RQ3 : For theoretical representativeness, we observe prettyconsistent (ranking) results over all sample sizes (cf. oftenequal entries in each column for each of the E and P criteriain Table 3). There is more variation when comparing resultsfor different diagnosis problems. Nevertheless, results arefairly stable concerning the winning strategy: rd is in allcases the best (75 %) or second best (25 %) sample type forE, and bf is in all but one case the best (63 %) or second best(25 %) sample type for P. For practical representativeness,we see more of a fluctuation over different sample sizes andheuristics, as discussed for RQ2 above (cf. variation overentries of each column for M and T in Table 3). Examining(ranking) results over different diagnosis problems revealsa similar picture, where however the rankings for T are de-cidedly more stable than those for M, meaning that relativesampling times are less affected by the particular probleminstance than the informativeness of the samples.
RQ4 : Our data indicates a clear trend that increasing samplesize leads to better theoretical representativeness (cf. discus-sion of
RQ1 ). However, it also suggests that there is no gen-eral significant positive effect of larger sample size on prac-tical representativeness. While this is obvious for samplingtime (T), i.e., generating more diagnoses cannot take lesstime, it is less so for the number of measurements (M). Infact, we even measured increases wrt. M in some cases (e.g.,for MPS) as a result to drawing larger samples. This corrob-orates similar findings in this regard, albeit for lower samplesizes and other types of diagnosis problems, reported by [18;25].
RQ5 : From our data, we cannot generally conclude that abetter theoretical implies a better practical representative-ness (see discussion on
RQ1 , RQ2 and
RQ3 ). E.g., observethe performance of ard in the top vs. bottom part of Table 3.We surmise the cause of that to lie in the fact that (1) heuris-tics are based on a lookahead of one step only (where theapproximate character of this analysis might counteract thebenefit of good estimations), and (2) the added (informa-tion) value of additional diagnoses taken into a sample, re-gardless of how selected, decreases with the sample size (cf.the law of diminishing marginal utility [41] in economics).
Research Limitations
Our evaluations do not come without limitations. In brief,there are the following threats to validity: • For feasibilityreasons, we (1) did not use all diagnoses to determine thereal values (EXP1), but just all minimal ones, and (2) usedonly such problem instances that reasonably allow the gen-eration of all minimal diagnoses. • We focused on bina-ry-outcome measurements which are common in some, butnot all diagnosis sub-domains, e.g., in ontology and KB de-bugging [9; 42], circuit diagnosis [2], or matrix-based meth-ods [43]. ∗ •
We did not evaluate bf / wf sample types includ-ing minimum/maximum-cardinality minimal diagnoses, butconcentrated on most/least probable ones. ∗ • To keep thesize of our dataset manageable, we (1) omitted less com-monly used existing heuristics, and (2) included only a sub-set of available diagnosis computation methods into ouranalyses. ∗ •
In EXP2, differences between sample types un-derlying the rankings were in many cases not statisticallysignificant ( α = 0 . ) due to the relatively low numberof 10 diagnosis sessions we ran for each factor combina-tion (note, EXP2 already takes several weeks of computa-tion time with these settings). So, conclusions from our datamust be treated with caution for the time being. ∗ We tested six diagnosis sampling techniques wrt. the qualityof their estimations used by measurement selection heuris-tics, and their achieved performance in terms of diagnosticefficiency. Whereas random sampling, in line with statisti-cal theory, leads to highly reliable estimations, this benefitis only conditionally reflected by the performance exhibitedby random samples in diagnosis sessions, e.g., if the samplesize is large or one specific heuristic is adopted. It turnedout that, inspite of their missing statistical foundation, fullybiased best-first samples including the most probable diag-noses were the best method in more of the investigated sce-narios than any other sample type. However, in the major-ity of scenarios some of the other sample types was betterthan best-first. This shows that a generally most favorablesampling technique cannot be nominated from our studies,and that the optimal sampling technique to draw on dependson the particular diagnosis scenario (e.g., sample size, mea-surement selection heuristic).
Acknowledgments.
This work was supported by the Aus-trian Science Fund (FWF), contract P-32445-N38.
References [1] R Reiter. A Theory of Diagnosis from First Principles.
AIJ , 32(1):57–95, 1987.[2] J de Kleer, B Williams. Diagnosing multiple faults.
AIJ , 32(1):97–130, 1987.[3] P Rodler, D Jannach, K Schekotihin, P Fleiss. Are query-based ontology de-buggers really helping knowledge engineers?
KBS , 179:92–107, 2019.[4] R Stern, M Kalech, A Feldman, S Rogov, T Zamir. Finding all diagnoses isredundant. In
DX’13 .[5] D Jannach, T Schmitz, K Schekotihin. Parallelized hitting set computation formodel-based diagnosis. In
AAAI’15 .[6] J Slaney. Set-theoretic duality: A fundamental feature of combinatorial optimi-sation. In
ECAI’14 .[7] A Kalyanpur.
Debugging and Repair of OWL Ontologies . PhD thesis, Univ. ofMaryland, 2006.[8] K Schekotihin, P Rodler, W Schmid. Ontodebug: Interactive ontology debug-ging plug-in for Protégé. In
FoIKS’18 . Those bullet points marked by a “ ∗ ” we plan to address interms of additional experiments as part of future work. [9] P Rodler. Interactive Debugging of Knowledge Bases . PhD thesis, Univ. Kla-genfurt.
CoRR , abs/1605.05950, 2015.[10] C Meilicke.
Alignment incoherence in ontology matching . PhD thesis, Univ.Mannheim, 2011.[11] K Schekotihin, P Rodler, W Schmid, M Horridge, T Tudorache. Test-drivenontology development in Protégé. In
ICBO’18 .[12] J de Kleer, O Raiman. How to diagnose well with very little information. In
DX’93 .[13] K Pattipati, M Alexandridis. Application of heuristic search and informationtheory to sequential fault diagnosis.
IEEE T. Syst. Man Cyb. , 20(4):872–887,1990.[14] J de Kleer, O Raiman, and M Shirley. One step lookahead is pretty good. In
Readings in Mod.-Based Diag. , 1992.[15] B Moret. Decision trees and diagrams.
CSUR , 14(4):593–623, 1982.[16] K Shchekotykhin, G Friedrich, P Fleiss, P Rodler. Interactive Ontology Debug-ging: Two Query Strategies for Efficient Fault Localization.
JWS , 12-13:88–103, 2012.[17] P Rodler. On active learning strategies for sequential diagnosis. In
DX’17 .[18] P Rodler, W Schmid. On the impact and proper use of heuristics in test-drivenontology debugging. In
RuleML’18 .[19] A Gonzalez-Sanchez, R Abreu, H Gross, and A van Gemund. Spectrum-basedsequential diagnosis. In
AAAI’11 .[20] R Abreu, A van Gemund. A low-cost approximate minimal hitting set algorithmand its application to model-based diagnosis. In
Symp. on Abstract. Reform.Approx. , 2009.[21] A Feldman, G Provan, A van Gemund. Computing minimal diagnoses bygreedy stochastic search. In
AAAI’08 .[22] K Shchekotykhin, G Friedrich, P Rodler, P Fleiss. Sequential diagnosis ofhigh cardinality faults in knowledge-bases by direct diagnosis generation. In
ECAI’14 .[23] J de Kleer. Focusing on probable diagnoses. In
AAAI’91 .[24] J de Kleer, B Williams. Diagnosis with behavioral modes. In
IJCAI’89 .[25] J de Kleer, O Raiman. Trading off the costs of inference vs. probing in diagno-sis. In
IJCAI’95 .[26] T Zamir, R Stern, M Kalech. Using model-based diagnosis to improve softwaretesting. In
AAAI’14 .[27] P Rodler, K Schekotihin. Reducing model-based diagnosis to knowledge basedebugging. In
DX’17 .[28] A Felfernig, G Friedrich, D Jannach, M Stumptner. Consistency-based diagno-sis of configuration knowledge bases.
AIJ , 152(2):213–234, 2004.[29] J de Kleer, A Mackworth, R Reiter. Characterizing diagnoses and systems.
AIJ ,56, 1992.[30] P Rodler. Towards better response times and higher-quality queries in inter-active knowledge base debugging. Technical report, Univ. Klagenfurt.
CoRR ,abs/1609.02584v2, 2016.[31] P Rodler, K Shchekotykhin, P Fleiss, G Friedrich. RIO: Minimizing User Inter-action in Ontology Debugging. In
RR’13 .[32] F Baader, D Calvanese, D McGuinness, D Nardi, P Patel-Schneider.
The De-scription Logic Handbook . CUP, 2007.[33] M Horridge, B Parsia, U Sattler. Lemmas for justifications in owl.
DL’09 .[34] Q Ji, Z Gao, Z Huang, M Zhu. Measuring effectiveness of ontology debuggingsystems.
KBS , 71:169–186, 2014.[35] P Rodler. Reuse, Reduce and Recycle: Optimizing Reiter’s HS-tree for sequen-tial diagnosis. In
ECAI’20 .[36] T Bylander, D Allemang, M Tanner, J Josephson. The computational complex-ity of abduction.
AIJ , 49:25–60, 1991.[37] A Felfernig, M Schubert, C Zehentner. An efficient diagnosis algorithm forinconsistent constraint sets.
AI for Engin. Design Analysis Manufact. , 26(1):53–62, 2011.[38] U Junker. QuickXPlain: Preferred Explanations and Relaxations for Over-Constrained Problems. In
AAAI’04 .[39] P Rodler. Understanding the QuickXPlain algorithm: Simple explanation andformal proof.
CoRR , abs/2001.01835, 2020.[40] P Rodler, W Schmid, K Schekotihin. A generally applicable, highly scal-able measurement computation and optimization approach to sequential model-based diagnosis.
CoRR , abs/1711.05508, 2017.[41] R Easterlin. Diminishing marginal utility of income? Caveat emptor.
Soc. Indic.Res. , 70(3):243–255, 2005.[42] P Rodler, M Eichholzer. On the Usefulness of Different Expert Question Typesfor Fault Localization in Ontologies. In
IEA/AIE’19 .[43] M Shakeri, V Raghavan, K Pattipati, A Patterson-Hine. Sequential testing algo-rithms for multiple fault diagnosis.