[PDF] Explaining Violation Traces with Finite State Natural Language Generation Models

Abstract

An essential element of any verification technique is that of identifying and communicating to the user, system behaviour which leads to a deviation from the expected behaviour. Such behaviours are typically made available as long traces of system actions which would benefit from a natural language explanation of the trace and especially in the context of business logic level specifications. In this paper we present a natural language generation model which can be used to explain such traces. A key idea is that the explanation language is a CNL that is, formally speaking, regular language susceptible transformations that can be expressed with finite state machinery. At the same time it admits various forms of abstraction and simplification which contribute to the naturalness of explanations that are communicated to the user.

Full PDF

aa r X i v : . [ c s . S E ] J un Explaining Violation Traces with Finite State NaturalLanguage Generation Models

Gordon J. Pace and Michael Rosner

University of Malta gordon.pace|[email protected]

Abstract.

An essential element of any veriﬁcation technique is that of identify-ing and communicating to the user, system behaviour which leads to a deviationfrom the expected behaviour. Such behaviours are typically made available as longtraces of system actions which would beneﬁt from a natural language explanationof the trace and especially in the context of business logic level speciﬁcations. Inthis paper we present a natural language generation model which can be used toexplain such traces. A key idea is that the explanation language is a CNL that is,formally speaking, regular language susceptible transformations that can be ex-pressed with ﬁnite state machinery. At the same time it admits various forms ofabstraction and simpliﬁcation which contribute to the naturalness of explanationsthat are communicated to the user.

The growth in size and complexity of computer systems has been accompanied by anincrease in importance given to the application of veriﬁcation techniques, attempting toavoid or at least mitigate problems arising due to errors in the system design and imple-mentation. Given a speciﬁcation of how the system should behave (or, dually, of whatthe system should not do), techniques ranging from testing to runtime veriﬁcation andmodel checking attempt to answer the question of whether or not the system is correct.One common issue with all these techniques, is that a negative answer is useless unlessaccompanied by a trace showing how the system may perform leading to a violation ofthe expected behaviour.Consider, for example, the speciﬁcation of a system which allows user to log in, asshown in Figure 1, which states that “after three consecutive failed user authentications,users should not be allowed to attempt another login”. A testing or runtime veriﬁcationtool may deduce that the system may perform a long sequence of events which lead to aviolation. Although techniques have been developed to shorten such counter-examples[ZH02], such traces may be rather long, and using them to understand the circumstancesin which the system failed to work as expected may not always be straightforward.In the case of implementation-level properties and traces, tools such as debuggersand simulators may enable the processing of long traces by developers to understandthe nature of the bug, but in the case of higher-level speciﬁcations, giving business-logic level properties, such traces may need to be processed by management personnel.For example, a fraud expert may be developing fraud rules to try to match against the ogged out(0 attempts)start Logged out(1 attempt)Logged out(2 attempts) Authentication(0 attempts)Authentication(1 attempt)Authentication(2 attempts) UserDisabled ERRORLogged inrequest passwordrequest passwordrequest password good passwordbad password good passwordbad password good passwordbad password read ﬁlewrite ﬁlelogout request password

Fig. 1.

An automaton-based speciﬁcation behaviour of known black-listed users, and may want to understand why a trace showingthe behaviour of such a user does to trigger a rule he may have just set up. In such cases,a natural language explanation of such a trace would help the expert to understand betterwhat is going wrong and why.In this paper, we present the use of ﬁnite state natural language generation (NLG)models to explain violation traces. We assume that the basis of the controlled naturallanguage used to describe such behaviour is given by the person writing the speciﬁca-tion, by articulating how the actions can be described using a natural language, and howthey can be abstracted into more understandable explanations. We present a stepwise re-ﬁnement of the process, explaining how a more natural feel to the generated controllednatural language text can be given using ﬁnite state techniques.Although the work we present is still exploratory, we believe that the approach canbe generalised to work on more complex systems, and it can give insight into how farout the limits of ﬁnite state NLG techniques can be pushed.

In this section we illustrate a solution to the problem of generating reasonably natu-ral explanations from sequences of the above type in a computationally eﬃcient way.The two critical ingredients are (i) NLG, which, in a general sense, provides a set oftechniques for generating text ﬂexibly given an abstract non-linguistic representation ofemantic content, and (ii) CNLs which, in a nutshell, are natural languages with a de-signer element — natural in the sense that they can be understood by native speakers ofthe “parent” language and designed to to be simpler than that language from some com-putational perspective such as translation to logic or, as in this paper, NLG. An excellentsurvey and classiﬁcation scheme for CNLs appears in Kuhn [Kuh14]The ﬁnal output of NLG is clearly natural language of some kind. The nature of theprocess that produces that output is somewhat less clear, in that there are still many ap-proaches though most research in the area is consistent with the assumption that that itincludes at least the stages of content planning (deciding what to say), content packaging(packaging the content into sentence-sized messages), and surface realisation (construct-ing individual sentences). These three stages are linked together in a pipeline, accordingto the architecture proposed by Reiter and Dale [RD00].The complexity of NLG arises from fact that the input content severely underde-termines the surface realisation and that there are few guiding principles available tonarrow the realisation choices. Consequently, the process is even more nondeterministicthan the inverse process of of natural language understanding where at least it is possibleto appeal to common sense when attempting to choose amongst competing interpreta-tions. With NLG, some dimensions of complexity must be sacriﬁced for the computationto be feasible.In this paper, the sacriﬁce comes down to a choice concerning two sets of languages:(i) that which expresses the content, which we will call the C language, and (ii) a se-quence of languages in which explanations are realised that we will call E , E , etc. C isa form of semantic representation language, whilst E , E , . . . E n are CNLs in Kuhn’ssense.In both cases, we assume that both C and E i languages are regular languages in theformal sense. This has a number of advantages: the computational properties of suchlanguages are well understood, and we know that algorithms for parsing and generationof are of relatively low complexity. Additionally, we can express linguistic processes asrelations over such languages that can be computed by ﬁnite-state transducers . Elemen-tary transductions can be composed together to carry out complex linguistic processingtasks. Using techniques originally advocated for morphological analysis by Beesley andKattunen [BK03] we can envisage a complex NLG process as a series of ﬁnite statetransductions combined together under relational composition, thus opening up the pos-sibility of describing the synthesis of an explanation to eﬃcient, ﬁnite-state machinery.Of course, the restriction to regular languages imposes certain limitations upon whatcontent can possibly be expressed in C, and may also impact the naturalness of the vi-olation description expressed in E . However, these are empirical issues that will not betackled in this paperWe are not the ﬁrst to have used simpliﬁed languages in the attempt to reduce thecomplexity of natural language processing. In the domain of NLG, Wilcock [Wil01]proposed “the use of XML-based tools to implement existing well-known approachesto NLG” . Power [Pow12] uses ﬁnite-state representations for expressing descriptions ofOWL-LITE sentences. It is of course in the area of computational morphology whereﬁnite state methods are best known.he main contribution of the paper is to substantiate and present the hypothesis thataccording to the choice of C and E , it is possible to realise a family of eﬃcient NLGsystems that are based on steadfastly ﬁnite-state technology. start ERROR lll gb gb g b rwx l NL explanation b the user gave a bad password g the user gave a good password l the user requested to log in x the user logged out r the user read from a ﬁle w the user wrote to a ﬁle Fig. 2.

The speciﬁcation augmented with NL explanations

In what follows we ﬁrst present the C language and then a sequence of E languages,progressively adding features to attain a more natural explanation of the trace. As weshall investigate in more detail in Section 4, at each stage we use further information toobtain more natural generated text. C Language

We assume that the basic speciﬁcation of the C language is given by the automatonshown in Figure 2. The following trace is a sentence lgrxlblgwwxlgrwxlgxlblblbl Note that although the automaton itself is not necessary for the explanations that ensue,it could in principle be used to check which trace preﬁx leads to an error state to allowfor an explanation when such a state is reached.Next we turn to the series of E languages. Since these are all CNLs we will referto them as CNL0, CNL1, CNL2 and CNL3 respectively. All four languages are similarinsofar as they all talk about the same underlying, domain-speciﬁc world of states andactions, and they all ﬁnite state. At the same time they are somewhat diﬀerent linguisti-cally. .2 CNL0 Sentences of the CNL0 language are very simple declarative sentences of the kind thatwe typically associate with simple predicate-argument structures. In the example shownin Figure 3 here, each sentence has a subject, a verb, and possibly a direct object.

The user requested to log in. The user gave a good password. The user readfrom a ﬁle. The user logged out. The user requested to log in. The user gave abad password. The user requested to log in. The user gave a correct password.The user wrote to a ﬁle. The user wrote to a ﬁle. The user logged out. The userrequested to log in. The user gave a correct password. The user read from aﬁle. The user wrote to a ﬁle. The user logged out. The user requested to log in.The user gave a good password. The user logged out. The user requested to login. The user gave a bad password. The user requested to log in. The user gavea bad password. The user requested to log in. The user gave a bad password.The user requested to log in, which should not have been allowed.

Fig. 3.

A naïve explanation of the trace: CNL0

In this paper, the mapping between the C language and CLN0 is given extensionallyby means of a lexicon that connects the individual transition names with a sentence witha simple and ﬁxed syntactic structure. The lexicon itself is expressed as a ﬁnite statetransducer, as described in Section 4.1. For more complex systems such an approachmight not be practical, and a solution could then be to derive the sentence associatedwith each transition from more fundamental properties of the underlying machine.CNL0 provides for a somewhat naïve explanation of traces using the explanationsprovided by the domain expert directly. Next we turn to CNL1 which oﬀers some improvements. The main feature of CNL1 isthat it is a sequence of paragraphs, where each paragraph is simply a sequence of CNL0sentences, as shown in Figure 4There are two consequences to this slightly richer structure. One is that it providesthe skeleton upon which to hang the numbered steps. This is a presentation issue thatarguably increases the naturalness and improves comprehension. The other is that itgives a structural identity to each paragraph that could be exploited in order to attributecertain semantic properties to the associated sequence of actions. For example, we havethe notion of correctness which has the potential to ﬁgure in explanations. Nevertheless,this property is not actually exploited in CNL1.

The main novelty in CNL2, (see Figure 5) in contrast to CNL1, is the use of aggregation to reduce each multi-sentence paragraph to a single, more complex sentence. This is . The user requested to log in. The user gave a good password. The userread from a ﬁle. The user logged out.2. The user requested to log. The user gave a bad password.3. The user attempted to log in. The user gave a good password. The userwrote to a ﬁle. The user wrote to a ﬁle. The user logged out.4. The user requested to log in. The user gave a good password. The userread from a ﬁle. The user wrote to a ﬁle. The user logged out.5. The user requested to log in. The user gave a good password. The userlogged out.6. The user requested to log in. The user gave a bad password.7. The user requested to log in. The user gave a bad password.8. The user requested to log in. The user gave a bad password.9. The user requested to log in, which should not have been allowed.

Fig. 4.

A grouped explanation: CNL1 a technique which is used for removing redundancy (see Dalianis and Hovy [DH93]),yielding texts that are more ﬂuid, more acceptable and generally less prone to beingmisunderstood by human readers than CNL1-style descriptions.

1. The user requested to log in, gave a correct password and after readingfrom a ﬁle logged out.2. The user requested to log in, and gave a bad password.3. After a log in request the user gave a correct password and wrote twice toa ﬁle before logging out.4. The user requested to log in, gave a correct password, read from a ﬁle,wrote to a ﬁle and then logged out.5. After requesting a log in, the user gave a good password and logged out.6. The user requested to log in, gave a bad password, requested again to login, gave another bad password and after requesting to log in, gave anotherbad password.7. Finally, the user made a request to log in, which should not have beenallowed.

Fig. 5.

A better grouped explanation: CNL2

The linguistic renderings resulting from aggregation in CNL2 include:1. Punctuation other than full stops2. Temporal connectives (“after”, “then”, “ﬁnally”)3. The use of contrastive conjunctions like “but"4. Collective terms (“twice") .5 CNL3

Finally CNL3 (see Figure 6) is considerably more complex, because it not only containsfurther aggregation but also summarisation . The user logged in a number of times, interspersed by sequences of one or twobad logins, after which she unsuccessfully attempted to log in 3 times. The userthen made another request to log in, which should not have been allowed.

Fig. 6.

A natural explanation: CNL3

In this example, there are only two sentences. The ﬁrst sentence not only aggregatesthe ﬁrst six sentences, but it also omits some of the information (for example, the theuser read from a ﬁle, that the user logged out etc.). It also includes the use of certainphrases whose correct interpretation, as mentioned earlier, requires consideration of thecontext of occurrence as well as use of adverbs (“she unsuccessfully attempted") and theuse of more complex tenses (“should not have been allowed").

In this section we will look into using ﬁnite state CNLs for NLG. This is based on ﬁnitestate techniques as embodied in xfst (Beesley and Karttunen [BK03]) that has alreadybeen used extensively in several other areas of language processing such as computa-tional morphology and light parsing. xfst provides a language for the description ofcomplex transducers together with a compiler and a user interface for running and test-ing transducers. Our aim is to better understand the tradeoﬀs involved between producingreasonably natural explanations from traces and the use of the eﬃcient computationalmachinery described here.

Just as in Figure 2, our starting point is a regular input language C deﬁned as follows define SIGMA b|l|g|x|r|w;define C SIGMA*;SIGMA is the alphabet of the original FSA and the entire generation mechanism ac-cepts inputs that are arbitrary strings over this alphabet. Strings containing illegal char-acters yield the empty string and hence, no output.CNL0 can be obtained more or less directly via a dictionary which links symbols in

SIGMA to simple declarative sentences, as follows : Some of the syntactically more obscure aspects of this deﬁnition have been omitted for the sakeof clarity. efine SP " ";define USR {the SP user};define DICT b-> [{user} SP {gives} SP {bad} SP {password}],l-> [{user} SP {requests} SP {login}],g-> [{user} SP {gives} SP {good} SP {password}],x-> [{user} SP {logs} SP {out}],r-> [{user} SP {reads} SP {from} SP {a} SP {file}],w-> [{user} SP {writes} SP {to} SP {a} SP {file}];define CNL0 C .o. DICT;

The ﬁrst line deﬁnes the space character, and the second the symbol USR. The thirddeﬁnes the dictionary DICT which is implemented as ﬁnite state transducer that mapsfrom the individual action symbols to primitive sentences, all of which have the samebasic structure. The input sequence is represented as a string define input {lgrxlblgwwxlgrwxlgxlblblbl};

To get the output we compose

CNL0 with input using the expression ( [input .o.CNL0] ), extract the lower side of the relation with the l operator ( [input .o. CNL0].l ).The problem with the generated output is that there are no separators between the sen-tences. The solution is to compose the input with a transducer sentencesep that insertsa separator. input .o. sentencesep .o. CNL0 This turns the input into the following string: l.g.r.x.l.b.l.g.w.w.x.l.g.r.w.x.l.g.x.l.b.l.b.l.b.l.

Such a string can be made to yield exactly the sentences of CNL0 by arranging for themapping of the fullstops to insert a space. This is just another transducer that is composedinto the pipeline. The result of this process is exactly the text shown in Figure 3.

At a simplest level, we can specify how the explanation may be split into an enumeratedsequence of paragraphs, aiding the comprehension of the trace explanation. Considerbeing given the following list of subtrace speciﬁcations using regular explanations:

Correct login session: lg ( r + w ) ∗ x . Sequence of incorrect login requests: ( lb ) ∗ .In CNL1, the main feature is that we will use this information to group text. We willassume that the following paragraph deﬁnitions are supplied: define correct l g [r | w]* x;define incorrect [l b];define group1 correct @-> ... %|, incorrect @-> ... %|; he group1 deﬁnition includes a piece of xfst notation that causes a vertical bar to beinserted just after whatever matched the left hand side of the rule, yielding lgrx|lb|lgwwx|lgrwx|lgx|lb|lb|lb|l As shown earlier, we can when applied to the input, where the vertical bar is used todelimit paragraphs. l.g.r.x.|l.b.|l.g.w.w.x.|l.g.r.w.x.|l.g.x.|l.b.|l.b.|l.b.|l.

Fig. 7.

CNL1 representation just prior to lexicalisation

Composing this with an augmented version of CNL0 that also handles the paragraphbreaks yields exactly the paragraph structure of the CNL1 rendering shown in Figure 4.An inherent limitation of this approach is that it is impossible to produce a ﬁnite-statetransducer that will output a numbering scheme for arbitrary numbers of paragraphs.Our solution is to postprocess the output, and generate, for instance HTML or L A TEXoutput which will handle the enumeration as required..

We can now move on to CNL2. This involves several intermediate stages which arediagrammed below:

A: l.g.r.x.|l.b.|l.g.w.w.x.|l.g.r.w.x.|l.g.x.|l.b.|l.b.|l.b.|l.B: l,g,r,x.|l,b.|l,g,w,w,x.|l,g,r,w,x.|l,g,x.|l,b.|l,b.|l,b.|l.|C: aggregation1D: aggregation2A is as shown in Figure 7. We must now prepare for aggregation by ﬁrst replacingall but the paragraph-ﬁnal fullstops with commas. Because the transducer that achievesthis uses the paragraph marker to identify the ﬁnal fullstop, we must ﬁrst insert thatﬁnal paragraph marker as as shown in B . The next two phases of aggregation are bestexplained with the following example: we wish to transform “the user requested to lo-gin. the user gave a good password. the user logged out.” to the more natural “the userrequested login, gave a good password, and logged out”. The ﬁrst phase removes thesubject (i.e. the phrase “ the user”) of all sentences but reinstates the same subject atthe beginning of the paragraph. The second inserts an “and" just before the ﬁnal verbphrase of each aggregated sentence. In this way we are able to achieve paragraph 2 of theCNL2 example as shown in Figure Similar, surface-oriented techniques can be used toobtain the other paragraphs in Figure 5. Speciﬁcally, we have composed rules for insert-ing the words “after", “then", “twice", “another", “ﬁnally" and “and". However, spacelimitations prevent us from describing these in full. .4 Adding Abstraction: CNL3 We note that certain sequences of actions can be combined into a simpler explanation,abstracting away (possibly) irrelevant detail, thus aiding comprehension. For instance,consider the following rules, consisting of (i) a regular expression matching a collectionof subtraces which may be explained more concisely; and (ii) a natural language expla-nation which may replace the detailed text one would obtain from the whole subtrace:

Consecutive correct login sessions: ( lg ( r + w ) ∗ x ) n explained as “The user success-fully logged in n times” . Consecutive correct failed login attempts: ( lb ) n explained as “The user unsuccess-fully attempted to log in n times” . Correct login sessions interspersed with occasional incorrect one: (( lg ( r + w ) ∗ x ) ∗ lb ( lg ( r + w ) ∗ x )+) ∗ explained as “The user successfully logged in a number of times,with one oﬀ bad logins in between” . Correct login sessions interspersed with occasional incorrect one or two: (( lg ( r + w ) ∗ x ) ∗ ( lb + lblb )( lg ( r + w ) ∗ x )) ∗ explained as “The user logged in a number oftimes, interspersed by sequences of one or two bad logins” .Note that xfst allows regular expressions that are parametrised for the number oftimes a repeated expression matches. For example, the statement define success3 [l g [r | w]* x]^3; achieves the ﬁrst deﬁnition above and associates it with the multicharacter symbol success3 .This can be added to the dictionary DICT and associated with the string in much the sameway as the strings associated with transitions, as shown above.We will assume that these rules will be applied using a maximal length strategy — weprefer a longer match, and in case of a tie, the ﬁrst rule speciﬁed is applied. xfst allowsthe user to choose between longest and shortest match strategies. Using appropriate xfst rules would result in the description given in Figure 6.

To further enrich the generation explanations, we can extend the approach used in theprevious section for CNL3, to allow for actions to be described using diﬀerent termsin diﬀerent contexts. For example, a logout action when logged in may be described as ‘the user logged out’ , while a logout occuring while the user is already logged out wouldbetter be described as ‘the user attempted to log out’ . We can use techniques similar tothe ones presented in the previous section, using regular expressions to specify contextsin which an action will be described in a particular manner.Consider the speciﬁcation below, in which each action and natural language descrip-tion pair is accompanied by two regular expressions which have to match with the partof the trace immediately preceding and following the action for that description to beused : We use the notation a to signify any single symbol except for a . ction Pre Post CNL rendering x l x ∗ – user logs out otherwise user attempts to log out l – b the user attempts to log in otherwise the user logs inThis technique can be further extended and reﬁned to deal with repetition of actions asshown below with repeated logins:Action Pre Post CNL rendering l l b l ∗ b user attempts to log in again– b user attempts to log in l b l ∗ – user logs in again otherwise user logs inIt is interesting to see how far this approach can be pushed and generalised to allow forthe generation of more natural sounding text from the input traces. In this paper we have presented preliminary results illustrating how ﬁnite state approachescan be used generate controlled natural language explanations of traces. Although thereis still much to be done, the results are promising and it is planned that we use such anapproach to allow for the speciﬁcation of natural language explanations to be used in theruntime veriﬁcation tool

Larva [CPS09].Two problems underlying our task are: (i) the discovery of subsequences that are in-teresting for the domain in question and (ii) how to turn an interesting subsequence into anatural-sounding explanation. In this paper we have provided somewhat ad hoc solutionsto both these problems. While one can use proﬁling techniques to discover interesting,or frequently occurring subsequences, clearly there needs to be a strong human inputin identifying which of these sequences should be used to abstract and explain tracesmore eﬀectively. On the other hand, we see that many of the ad hoc solutions adopted tomake explanations more natural-sounding may be generalised to work on a wide-rangeof situations. We envisage that the person building the speciﬁcation may add hints asto how to improve the explanation, such as the tables shown in Section 4.4 to improveabstraction and the ones given in Section 4.5 to add contextuality.Given that, essentially, we are using regular grammars to specify our natural lan-guage generator, the generalisation process to reduce human input while generating morenatural-sounding text is bound to hit a limit. It is of interest to us, however, to investigatehow far these approaches can be taken without resorting to more sophisticated tech-niques usually applied to language generation.

References [BK03] K.R. Beesley and L. Karttunen.

Finite State Morphology . Number v. 1 in Studies incomputational linguistics. CSLI Publications, 2003.CPS09] Christian Colombo, Gordon J. Pace, and Gerardo Schneider. Larva — safer monitor-ing of real-time java programs (tool paper). In

Seventh IEEE International Conferenceon Software Engineering and Formal Methods (SEFM) , pages 33–37. IEEE ComputerSociety, November 2009.[DH93] Hercules Dalianis and Eduard H. Hovy. Aggregation in natural language generation. InGiovanni Adorni and Michael Zock, editors,

EWNLG , volume 1036 of

Lecture Notes inComputer Science , pages 88–105. Springer, 1993.[Kuh14] Tobias Kuhn. A survey and classiﬁcation of controlled natural languages.

Computa-tional Linguistics , 40(1):121–170, March 2014.[Pow12] Richard Power. Owl simpliﬁed english: A ﬁnite-state language for ontology editing.In Tobias Kuhn and Norbert E. Fuchs, editors,

CNL , volume 7427 of

Lecture Notes inComputer Science , pages 44–60. Springer, 2012.[RD00] Ehud Reiter and Robert Dale.

Building Natural Language Generation Systems . Cam-bridge University Press, New York, NY, USA, 2000.[Wil01] G. Wilcock. Pipelines, templates and transformations: Xml for natural language gener-ation. In in Proceedings of the First NLP and XML Workshop, NLPXML 2001EWNLG ,Lecture Notes in Computer Science, pages 1–8, Tokyo, 2001.[ZH02] Andreas Zeller and Ralf Hildebrandt. Simplifying and isolating failure-inducing input.