EExplainable Queries over Event Logs
Sylvain Hallé
Laboratoire d’informatique formelleUniversité du Québec à Chicoutimi, Canada
Abstract —Added value can be extracted from event logs gen-erated by business processes in various ways. However, althoughcomplex computations can be performed over event logs, theresult of such computations is often difficult to explain; inparticular, it is hard to determine what parts of an input logactually matters in the production of that result. This paperdescribes how an existing log processing library, called BeepBeep,can be extended in order to provide a form of provenance:individual output events produced by a query can be preciselytraced back to the data elements of the log that contribute to(i.e. “explain”) the result.
I. IntroductionVarious kinds of information systems generate data streamsin the form of sequences of data elements called event logs .Sources of event logs are diverse: business process managementengines, web servers, sensor networks, instrumented pieces ofgeneric software can all be instructed to record informationabout their execution to a persistent storage medium. Addedvalue can be extracted from event logs generated by thesesystems in various ways. Logs can be checked for complianceviolations of best practices, adherence to predetermined se-quences of events, detect deviations of some data point from aspecified value, or be used to calculate various quality metrics.This process can take place after the system has completedits execution ( offline processing), or compute its results on-the-fly as the events from the source are ingested ( streaming processing). These two modes of operation are often groupedunder the generic term “event stream processing”.Over the past decade, event stream processing systems haveseen widespread use, with the advent of solutions such asAmazon Kinesis , Apache Storm , Flink , Siddhi [27] andEsper . These systems provide rich processing capabilities,making it possible to evaluate complex queries over event logs.However, although intricate computations can be performedover these sources of data, the result of such computations isoften difficult to explain. For example, a Flink pipeline cancalculate some quality metric over instances of a process, andcheck that it always lies over some given threshold; however,if the result is false, how can one identify the source of theerror?Developers of information systems in all disciplines arefacing increasing pressure to come up with mechanisms to https://aws.amazon.com/kinesis https://storm.apache.org https://flink.apache.org https://espertech.com describe how a specific result is obtained –a concept called explainability . Although the term is often tied to AI [24],explainability is desirable in other fields of computation. Hence,if a system fails to verify a given property, a counter-example isgenerally sought after as a means of understanding the sourceof the problem. This pressure often comes from regulationsimposing constraints on the traceability of data processing,such as GDPR and BCBS. Yet, for most of the aforementionedengines, it is hard to determine what parts of an input logactually matters in the production of a given result. A useris typically left with the manual task of querying the log invarious ways in order to investigate the reason for a surprisingor irregular output result.In Section II, we shall see that various technologies andframeworks have been developed over the years in order toprovide a form of “lineage” or “provenance” information aboutthe output of some computer system. However, none of thesesystems consider the special problem of explainability forevent stream processing; in contrast, existing event streamprocessing systems provide very few in the way of lineageand explainability, leaving a gap that needs to be filled. Inthis paper, we describe how an existing log processing library,called BeepBeep [16], can be extended in order to providea form of explanation mechanism: the output produced bya query can be precisely traced back to the individual dataelements of the log that contribute to (i.e. “explain”) the result.Section III shall introduce the basic concepts behind eventstream processing in BeepBeep, and provide a few examplesof simple queries that can be run on event logs. SectionIV describes the data lineage mechanism that has beenadded to the library as part of this work. This mechanismleverages the fact that calculations in BeepBeep are doneby composing basic computation units together into eventpipelines; therefore, in order to obtain end-to-end provenance,it suffices to define simple input/output relationships for eachof these units separately. Finally, in Section V, the impact ofthe use of provenance on space and time resources is measuredexperimentally. These results show that, provided a user acceptssome performance trade-off, the library can provide articulateand intuitive results, when processors are composed to formcomplex computation chains.II. Related WorkTaken in a broad sense, we call “data lineage” any activitythat attempts to link the result of a computation (its outputs) to a r X i v : . [ c s . D B ] F e b lements that contribute to this result (its inputs). Dependingon the field of study, variants on the notion of lineage havebeen given different names.A large amount of work on lineage has been done in thefield of databases, where this notion is often called provenance .We can distinguish between three types of provenance. Thefirst type is called why-provenance and has been formalized byCui et al. [10]. To each tuple t in the output of a (relational)query, why-provenance associates a set of tuples present inthe input of the query; the meaning of this set is to collectall the input data that helped to “produce” t . How-provenance ,as its name implies, keeps track not only of what input tuplescontribute to the input, but also in which way these tupleshave been combined to form the result [12]. Finally, where-provenance describes where a piece of data is copied from[4]. It is typically expressed at a finer level of granularity, byallowing to link individual values inside an output tuple toindividual values of one or more input tuple.There exist various implementations of provenance-awaredatabase systems. Where-provenance has been implementedinto Polygen [32], DBNotes [6], Mondrian [11], MXQL [31]and Orchestra [17]. The Spider system performs a slightlydifferent task, by showing to a user the “route” from input tooutput that is being taken by data when a specific databasequery is executed [5]. The foundations for all these systemsare relational databases, where sets of tuples are manipulatedby operators from relational algebra, or extensions of SQL.Outside the field of databases, the W3C has standardized adata model for provenance information called Prov [13]. Thestandard includes an ontology that defines multiple provenancerelationships, such as “was derived from”, “was revision of”. Atemplating system for Prov data has been proposed by Moreau et al. [20]; it resembles the graph of processors produced inthe present work. However, prov-template assumes that, fora given processing task, this graph has the same structurefor every input, and only differs in the actual bindings givento its various elements. On the contrary, we shall see thatin BeepBeep, some processor chains produce graphs whosestructure highly depends on the input given to the pipeline.Moreover, the approach assumes these templates as given, whileour proposed work dynamically generates these graphs from aprocessor chain and an input stream at runtime.On its side, dynamic taint analysis consists in marking andtracking certain data in a program at run-time. TaintCheckis a system where each memory byte is associated with a 4-byte pointer to a taint data structure [21]; program inputs aremarked as tainted, and the system propagates taint markers toother memory locations during the execution of a program;this concept has been extended to the operating system as awhole in an implementation called Asbestos [29]. Hardwareimplementations of this principle have also been proposed [9],[26]. Gift is another taint analysis tool; Aussum is a compilerbased on it [19]. Dytan [8]. Rifle focuses on the informationflow [28] TaintBochs is a system that has been used to trackthe lifetime of sensitive data inside the memory of a program[7]. On the stream processing front, few solutions have beendeveloped to provide explanations for queries. Spline [25] isa system that works on top of Apache Spark and attempts torecover lineage information by instrumenting processing jobs;“lineage”, in this case, means the topological organization ofjobs and data sources that are being used. However, this systemdoes not work at the individual event level, and hence cannotbe used to explain the value of a precise output event producedby a Spark pipeline. Apache Atlas provides similar coarse-grained functionaities for jobs running on Hadoop. To the bestof our knowledge, no existing work focuses on fine-grainedexplainability of individual events in a stream processingpipeline.III. Event Log Query Processing with BeepBeepIn this section, we shall first describe basic concepts of eventlog processing, as implemented by the BeepBeep event streamquery engine. BeepBeep is a Java library that allows usersto easily ingest and transform event streams of various types;the library is free and open source . Over the past few years,BeepBeep has been involved in a variety of case studies [1],[3], [14], [18], [30]. A detailed description of BeepBeep isout of the scope of this paper, due to space restrictions. Forfurther details, the reader is referred to a complete textbookdescribing the system [16]. A. Functions and Processors
BeepBeep is organized around the concept of processors .In a nutshell, a processor is a basic unit of computation thatreceives one or more event streams as its input, and producesone or more event streams as its output. A processor producesits output in a streaming fashion: it does not wait to readits entire input trace before starting to produce output events.However, a processor can require more than one input eventto create an output event, and hence may not always outputsomething when given an input.BeepBeep’s core library provides a handful of genericprocessor objects performing basic tasks over traces; they canbe represented graphically as boxes with input/output “pipes”,as is summarized in Figure 1.A first way to create a processor is by lifting any function f into processor. This is done by applying f successively toeach input event (or n -tuple of input events, for functions thathave n arguments), producing the output events. A variant ofthis process is the Cumulate processor, which, as its nameimplies, accumulates input values according to some function;for example, providing it with the
Addition function willcause it to output the cumulative sum of all events receivedso far. Note that
Cumulate also works with non-numericalevents.A few processors can be used to alter the sequence of eventsreceived. The
CountDecimate processor returns every n -thinput event and discards the others. Another operation that canbe applied to a trace is trimming its output. Given a trace, https://atlas.apache.org https://liflab.github.io/beepbeep-3 f n n P { Σ f P n Apply a function to each event Keep one event every n Trim the first n events Cumulate values of a functionFork a stream into multiple copies Slice a stream into multiple sub-streams Apply a processor to a sliding window of events Filter events based on a control signal fg Figure 1: Pictograms for the basic BeepBeep procssors.the
Trim processor returns the trace starting at its n -th inputevent. Events can also be discarded from a trace based on acondition. The Filter processor takes two input streams; theevents are let through on its first input stream, if the event atthe matching position of the second stream is the value true ( (cid:62) ); otherwise, no output is produced.Another important functionality of event stream processingis the application of some computation over a window of events.If ϕ is an arbitrary processor, the Window processor of ϕ ofwidth n sends the first n events (i.e. events numbered 0 to n − ϕ , which is then queried for its n -th outputevent. The processor also sends events 1 to n to a secondinstance of ϕ , which is then also queried for its n -th outputevent, and so on. The resulting trace is indeed the evaluation of ϕ on a sliding window of n successive events. Any processorcan be encased in a sliding window, provided it outputs at least n events when given n inputs.In the case of business processes, a log can containinterleaved sequences of events for multiple process instances.The sub-sequence of events belonging to the same processinstance is called a slice ; applying a separate processing toeach such sub-sequence will be called slicing . To this end,BeepBeep provides a processor called Slice , which is oneof the most complex of the core library. It uses a function f to separate an input stream into several sub-streams. Eachof these sub-streams is sent to a different instance of someprocessor P , and the output of each copy is aggregated byanother function g . B. Pipes and Palettes
In order to create complex computations, processors can becomposed (or “piped”) together, by letting the output of oneprocessor be the input of another. An important characteristicof BeepBeep is that this piping is possible as long as the typeof the first processor’s output matches the second processor’sinput type. Such pipes can easily be created by using Java asthe glue code.If chains of basic processors are not sufficient to accomplishthe desired computation, BeepBeep makes it possible to extendits core with various packages of domain-specific processors and functions, called palettes . The main advantage of the palettesystem is its modularity: apart from a small core of commonobjects, a user is required to load only the palettes that arerelevant to the computing task at hand. BeepBeep’s “standardlibrary” offers more than a dozen such palettes; we brieflydescribe in the following those of particular interest in thecontext of business process logs.
1) Finite-State Machines:
A frequent use of stream pro-cessing is to check whether the events inside a log follow aspecific sequence, and trigger a warning as soon as a violationis observed. Specifying the allowed event sequences can bedone, among other things, by means of a finite-state automaton.BeepBeep’s
Fsm palette allows users to create
Moore machines ,a special case of automaton where each state is associated toan output symbol. This Moore machine allows its transitionsto be guarded by arbitrary functions; hence it can operate ontraces of events of any type.By associating states of the FSM to, e.g. Boolean values, aMoore machine can act as a monitor : when fed events from alog, it can be instructed to output the value true (or no valueat all) as long as the input sequence is a valid path, and return false when the last event received does not correspond to anacceptable transition in the current state of the automaton.
2) Linear Temporal Logic:
Similar to the
Fsm palette, the
Ltl palette makes it possible for users to write conditions on eventsequences using Linear Temporal Logic (LTL) [22]. We recallthat LTL, in addition the usual Boolean connectives, providesfour temporal operators that apply on an arbitrary formula ϕ .The temporal operator G means “globally”: the formula G ϕ means that formula ϕ is true in every event of the trace. Theoperator F means “eventually”; the formula F ϕ is true if ϕ holds for some future event of the trace. The operator X means“next”; it is true whenever ϕ holds in the next event of thetrace. Finally, the U operator means “until”; the formula ϕ U ψ is true if ϕ holds for all events until some event satisfies ψ .Each of these temporal operators is implemented as a Processor object, and chaining such processors appropriatelyallows users to create pipes that can be used to evaluate anyarbitrary LTL formula. Each LTL processor for an LTL formula ϕ applies the following semantics: the i -th output event is theverdict produced by a monitor evaluating the input trace startingat event i .Typically, temporal processors produce bursts of outputevents for multiple inputs at the same time, once a specific value(true or false) is received in the input stream. Consider the caseof operator G ϕ . The processor for this operator takes as inputa stream of Boolean values, corresponding to the evaluation of ϕ on each input event. Given the input stream (cid:62) , (cid:62) , ⊥ , (cid:62) , theprocessor will produce the output stream ⊥ , ⊥ , ⊥ : indeed, theproperty G ϕ is definitely false for the trace prefixes startingin each of the first three input events. However, those threeoutputs can only be produced once input event ⊥ at position3 has been received. Similarly, a definite verdict cannot yetbe computed for the input prefix starting at event 4. A similarreasoning applies to the remaining operators. { f Σ > × Figure 2: The BeepBeep chain of processors for the
Windowproduct query.
C. Examples
We now give a few examples of processor chains that canbe built using the basic processors and the objects providedby the palettes just described. These examples are aimed atshowing the diversity of computations that can be expressedwith BeepBeep, and will be reused in the next section toillustrate how their results can be explained by the lineagetracking extensions introduced in this paper. They are by nomeans a complete showcase of BeepBeep’s functionalities.
1) Window Product:
As a first example, consider theprocessor chain illustrated in Figure 2. This chain takes asinput a stream of numerical values; it computes the productof each sequence of three successive values, checks whetherthis product is not equal to zero. This chain introduces aspecial processor, not described earlier, at the bottom of thefigure, which simply turns any input event into a predefinedconstant —in this case, the value 0. Intuitively, the output ofthis chain can be translated as the assertion “the product of anythree successive values must be greater than zero”. Considerthe input stream 3 , , , , , , (cid:62) , ⊥ , ⊥ , ⊥ , (cid:62) . Indeed, the first window of three events (3 , , ⊥ .
2) Process Lifecycle:
A second example is shown in Figure3. This time, input events are assumed to be tuples of the form ( i , a ) , where i is some numerical identifier, and a is the nameof an action. This basic format is appropriate to represent asimple kind of business process log, where multiple interleavedprocess instances are distinguished by their value of i , andeach instance is made of a sequence of actions. This use caseis a prime example of the Slice processor, which in this caseis used to separate events of each process instance based ontheir id , and feeds each sub-sequence into a chain that firstfetches the action field of each event, and updates the stateof a Moore machine accordingly.In this particular case, one can see that the Moore machinefor each instance has transitions to a “sink” state that producesvalue “false” ( ⊥ ). Any sequence that follows the intendedpattern has the machine remain in a state that produces thevalue “true” ( (cid:62) ). Written as a regular expression, the languageaccepted by this machine corresponds to the string a ( bc ) + d . f a i ∧ ⊤ ⊤ a ⊤ ⊤ dbc ⊥ * ** * Figure 3: The BeepBeep chain of processors for the
Processlifecycle property.The output of each Moore machine is aggregated into a Booleanconjunction; therefore, for the global processor chain to return (cid:62) , each currently active process instance must follow theintended lifecycle —otherwise the chain returns ⊥ .Consider for example the following sequence of actions: ( , a ) , ( , a ) , ( , b ) , ( , b ) , ( , c ) , ( , d ) . The processor’s output forthis prefix will be the sequence of Booleans (cid:62) , (cid:62) , (cid:62) , (cid:62) , (cid:62) , ⊥ .As one can see, this sequence of events contains two interleavedprocess instances, labelled 1 and 2. The sequence of actions forprocess 1 follows the intended pattern ( ab ), while the sequenceof actions for process 2 ( abcd ) violates the lifecycle on thelast event.
3) LTL Property:
Our last event log query involves Booleanconnectives and LTL temporal operators. Its processor chain isshown in Figure 4. In this case, we assume the input events arelines of a CSV file, each containing a tuple ( action , p ) , where action is an action name and p is an arbitrary numerical value.The chain decomposes this tuple by fetching the value of p (top branch) and the value of a (bottom branch). The condition p < action = a is evaluated on the bottom branch, for some predefined actionname a .The Boolean streams corresponding to these conditions arethen sent through a piping of Boolean connectives and LTLoperators. The end result is also a Boolean stream, whichamounts to the evaluation of the LTL formula G ( p < → X ( action = a ∧ X ( action = a ))) . Intuitively, this expression canbe formulated as “every input event with a negative value for p must be followed by two successive events whose action is a ”. The chain outputs ⊥ whenever this pattern is not beingfollowed in the input stream.As an example, consider the input stream made of thefollowing four tuples ( b , ) , ( c , − ) , ( a , ) , ( d , ) . One can seethat the output of the processor chain, after ingesting these fourevents, will be the sequence ⊥ , ⊥ . According to the semanticsof LTL operators, this is caused by the fact that the sub-tracesstarting at the first and second event violate the conditionexpressed above: they both contain an event with p < a . No definite verdict can beyet reached for the sub-traces that start at the third and fourthevent; this is why no output event has been produced for thesetwo inputs. f < a f = GX f ∧ X f ∧ f p f action Figure 4: The BeepBeep chain of processors that checks the LTL property G ( p < → X ( action = a ∧ X ( action = a ))) .IV. An Explanation Mechanism for Stream QueriesAfter this brief presentation of the BeepBeep event streamlibrary, we describe in this section how the original systemhas been retrofitted with data lineage functionalities. Moreprecisely, in the present context “lineage” will correspond tothe association that can be established between a specific outputevent produced by a processor, and the input events that areinvolved in the production of this output.This is where BeepBeep’s design principles, based on theconcept of composition, can be put to good use. Since complexprocessor chains are obtained by piping basic processors intographs, it suffices to define input/output associations for eachprocessor separately. By virtue of composition, it will then bepossible to retrace output events all the way up to the originalinputs of a pipe, by simply following the chain of associationsfrom each processor to the next upstream processor.The goal of these additions and modifications is to makelineage as transparent as possible to the end user. The implica-tions of this requirement are twofold. First, all modificationsmust preserve backward compatibility: existing programs usingBeepBeep without lineage should still be valid programs underthe new version. Second, benefiting from data lineage in aprogram should require as few modifications as possible to aprocessor chain; that is, lineage should come at a little cost interms of added complexity to the glue code. The result of thesemodifications to the basic design of the library is described inwhat follows. A. The Event Tracker
All data lineage functionalities in BeepBeep are centeredaround a singleton object called the event tracker . The solepurpose of this object is to answer lineage queries: given anoutput event at a specific position in an output stream computedby a processor chain, the event tracker must point to the eventsof the chain’s inputs that contribute to (or “explain”) the factthat this particular output event contains this particular value.In order to do so, the event tracker must be informed, bythe various processors in the chain, of the output events theyproduce, and also to what input events they should be associatedto. To this end, the
EventTracker interface declares a methodcalled associate() , which can be called by processors during the execution of a task. A call to associate() must providethe following elements: 1) The ID of the processor instancemaking the call 2) The index of the output pipe 3) Theposition of the output event in the output stream 4) The indexof the input pipe 5) The position of the input event in theoutput stream . As one can see, calls to this method can beused by implementations of
EventTracker in order to recordinput/output associations. Since each processor instance inBeepBeep is given a numerical identifier that is unique acrossa given program, the associations for each processor of a chaincan be recorded and distinguished.However, processors must be aware of the existence of suchan event tracker so that they can call it. This is why the
Processor class is modified in such a way that each of theseobjects can now store a reference to an event tracker. By default,lineage is turned off: processors are instantiated with a nullreference as their default event tracker, indicating that no call to associate() needs to be made. This default can be changedby passing a non-null implementation of
EventTracker to aprocessor object after its creation.Passing an event tracker to each processor instance oneby one would be tedious; it would also violate our designprinciple of minimal modifications to the glue code. Since eachprocessor in a chain is eventually connected to another one, analternate approach is to use BeepBeep’s
Connector object, andarrange for the event tracker to be passed to processors throughcalls to connect() . In such a case, a user first instantiates a
Connector by specifying an event tracker, and then uses thisconnector’s connect() method to pipe processors, in place ofthe usual static method of the class. This call to connect() serves a double purpose: it makes processors aware of theexistence of an event tracker, and it also allows the tracker tokeep track of the connections between processors. Knowledgeof these connections is necessary in order to follow lineageacross the whole chain. Under such a design, a single line ofglue code needs to be changed in order to enable lineage in aprocessor chain.Once lineage has been properly set up in a program, a streamquery can be evaluated in the usual way. At any momentduring the processing, the event tracker can be asked forlineage information about a specific output event. This isone by calling a method named getProvenanceTree() . Aprovenance query contains three elements: the unique ID n ofa processor, the index i of an output pipe, and the position p ofthe output event in the corresponding output stream. Intuitively,such a query can be translated into the question: “what isthe explanation for the p -th event of the i -th output pipe ofprocessor n ?”In return, the event tracker produces a directed acyclicgraph (DAG) which, from the given output event, followsthe input/output associations in the processor chain all the wayup to the original inputs. As we shall see, the relationshipbetween the input and the output can be many-to-many; thisis why the generated structure is generally a graph, and not alinear chain of nodes. B. I/O Associations for Common Processors
Equipped with this basic setup, supporting lineage inprocessors amounts to the insertion, in each class descendingfrom the top-level
Processor , of appropriate calls to a tracker’s associate() methods. Since processors have a streamingmode of operation, these calls should also be made in astreaming fashion. This means that associations should berecorded progressively as the input events are ingested, as soonas such associations can be determined.In general, all the inputs given to a computation areconsidered to explain the output; for example, with the function f ( x , y ) = x + y , one can see that any value f can producealways depends on its two operands, x and y . However, thereexist exceptions to this general rule. Let us take the case offunction g ( x , y ) = x y ; typically, the knowledge of both x and y is required to explain the output value, but not always: when x = y =
1, the fact that g ( x , y ) = x . A similar argument could be donewith Boolean connectives such as disjunction and conjunction.In the following, we describe the rules used to produceinput/output associations for the various functions and processorobjects present in the BeepBeep library.
1) Core Processors:
Most of BeepBeep’s core processorshave relatively straightforward association rules. The
Count-Decimate processor, whose task is to keep every n -th event anddiscard the others, registers an association between input eventat position i and output event at position ni . The Trim processor,which discards the first n events, registers an associationbetween input event at position i and output event at position i − n (for every i ≥ n ). The Fork processor simply replicatesthe input events to its outputs; the i -th input event is associatedto the i -th output event of every output pipe.The Window processor, which applies a processor P on asliding window of n events, introduces a level of indirection.In order to produce the i -th output event from a stream ofevents e , e , . . . , the processor instantiates a copy of P andfeeds it with the interval of events [ e i , e i + n − ] . It creates atemporary event tracker, instructed to intercept the input/outputassociations registered by P . From this tracker, the associationsrelated to the last output event produced by P are thentransferred to the main event tracker, by taking care of shifting the positions of the input events by i . That is, the k -th eventgiven to P actually corresponds to the ( k + i ) -th event ingestedby the Window processor.These processors register the same associations, regardlessof the actual content of the events they process. Some otherprocessors will actually record different associations dependingon the actual stream they receive. The I/O pairs for the
Apply-Function processor are determined by the I/O pairs of theunderlying function that is being applied on each event front;as we have seen above, some of these functions may associatetheir output to all or part of their input arguments, dependingon their values.Similarly, the
Cumulate processor generally associates the i -th output event to all input events up to the i -th: this is consistentwith the fact that the processor computes the progressive“accumulation” of all input events received so far. However,this default behaviour may be overridden depending on thecumulative function being used. Take for example an instanceof Cumulate processor applied on a stream of Boolean values,using logical conjunction as its function. On the input stream (cid:62) , (cid:62) , ⊥ , (cid:62) , the processor will return the output stream (cid:62) , (cid:62) , ⊥ , ⊥ –that is, as soon as a false value is received, the processor’soutput will be false forever. To explain why a given outputevent at position i is false, it suffices to point to an input eventat position j ≤ i whose value is false.Among all of BeepBeep’s core processors, Slice is the onewith the most complex I/O relationships. As a reminder,
Slice creates multiple instances of a processor P , and dispatches aninput event to an instance of P based on the value returnedby a slicing function f . The last output value produced byeach instance of P is then aggregated using another function g . Internally, each such copy of P is linked to its own eventtracker. To associate the i -th output event to inputs, the Slice processor first uses an internal event tracker to identify whichof the arguments given to g are involved in the production ofits return value. These arguments correspond to output eventsproduced by one or more instances of P ; the event trackerfor each of them is then queried in order to obtain the inputevents associated to that output event. Finally, as in the case ofthe Window processor, the relative event indices in each sliceare converted into their corresponding positions in the streamingested by
Slice . C. I/O Associations for Palettes
We shall now describe I/O associations that have been definedfor processors of various palettes. As previously, our focus ison palettes that have particular relevance to the field of businessprocesses.
1) Moore Machines:
A Moore machine can be used to definecompliance constraints related to the sequence of activities thatcan be seen in an instance of a process, in the form of afinite-state machine. When a violation to these complianceconstraints is found in the log, existing tools, such as monitors,typically stop at the first event that makes the sequence non-compliant, and declare failure. The location in the trace wherethe monitor stops can already give some information to theser about the cause of the violation, but only in a fragmentarymanner. Depending on the specification, the failure may bethe result of the interplay between several events in the pastthat end up in a violation, and this information is not readilyavailable by a classical monitor with a pass/fail verdict.In order to address this issue, BeepBeep’s
MooreMachine processor has been retrofitted with lineage functionalities.Internally, each Moore machine instance records and updates avector (cid:174) v = (cid:104)( s , e , i ) , . . . , ( s n , e n , i n )(cid:105) whose elements are pairs ( s , e , i ) , where s is a state of the machine, e is an input event,and i is the position of that event in the input stream. Thevector is such that its last pair ( s n , e n , i n ) always contains thecurrent state s n the machine is in. (If (cid:174) v is empty, the machineis in its initial state.)Upon receiving an input event e n + , the machine updatesthis vector as follows. First, it takes the transition from itscurrent state s n , leading to a new state s n + . Assuming that i n + is the number of input events received from the beginningof the stream, it then appends to the vector (cid:174) v the new triplet ( s n + , e n + , i n + ) . The contents of this vector are then usedto record associations between the i n + -th output event of themachine and its inputs; more precisely, the machine will registeran association between the i n + -th output event and the i j -thinput event, for each 0 ≤ j ≤ n . This corresponds intuitivelyto the fact that every input event in the vector is necessary inorder to reach state s n + and produce the corresponding outputevent. However, before moving on to the next input event, theMoore machine performs one last cleanup step. It looks forthe earliest occurrence of s n + in another triplet at some index k ≤ n ; if found, all the triplets at positions j > k are deletedfrom the vector.The reason for this cleanup step is best explained on anexample. Consider the Moore machine shown in Figure 3.Given the input sequence a b c , the machine will produce theoutput sequence (cid:62) (cid:62) (cid:62) (subscripts indicate event positions).According to the procedure just described, the third event ofthis output will be associated to the input events 1, 2 and 3.Suppose we now give the machine a new input event b . Inaccordance to its transition relation, the machine will output anew symbol (cid:62) ; however, the input associations for this symbolwill be events at positions 1 and 4.As the reader may have understood, the explanation producedfor a given output event consists of a path from the initial state,excluding any loops that move away from a previously visitedstate. This corresponds to the intuition that the sequence ofinputs a b suffices to produce (cid:62) . In other words, the machinefinds the shortest subtrace in the input that produces the output.This mechanism can be used to provide an explanation inthe case of compliance violations. Suppose that in the previousexample, state 5 corresponds to an error state. Therefore, aninput sequence such as a b c b c b a violates the compliancerequirement; however, in order to “explain” this violation, thesubset a b a is sufficient.
2) Linear Temporal Logic:
As we have seen, LTL isan alternate way in which compliance constraints on eventsequences can be expressed. The
Ltl palette provides processors
Figure 5: Input/outputasssociations for the LTL G processor.corresponding to each LTL operator, and equipped with lin-eage tracking functionalities. Their implementation is actuallysimpler than for Moore machines, and can be explained in afew words.Consider the case of the processor for operator G . By virtueof the semantics of LTL, we know that this processor delays theproduction of output events as long as its inputs are true; oncea false event is received, it produces a burst of ⊥ output values.For a ⊥ event that is emitted at position j , an association isrecorded with the last input event at position i ≥ j whose valueis false.This is illustrated in Figure 5. The top row of the figurerepresents an input stream of Boolean values, with circlesrepresenting (cid:62) , and squares representing ⊥ . The bottom rowshows the output produced by the G processor. Lines record theassociations established between the inputs and the outputs. Asone can see, the first three output events are associated with thefirst false value. Indeed, the verdict produced by the monitorfor these three trace prefixes is “caused” by the presence ofvalue ⊥ at position 3. However, this event has no bearance onthe output values produced for positions 4–7; they are rathercaused by the presence of ⊥ at input position 7. Hence, atemporal operator separates the output stream into zones, witheach event of a zone typically being associated to the sameevent of the input stream. A similar reasoning can be appliedto the other temporal operators. D. Examples
These basic I/O associations turn out to provide surprisinglyarticulate and intuitive results, when processors are composedto form complex computation chains. We shall use the
Windowproduct property to explain the operation of the event tracker.An explanation query is made of three elements: 1) The ID ofa processor in a chain; 2) The index of an output pipe on thisprocessor; 3) The position of an event in the correspondingstream . From such a starting point, the event tracker will scanthe input/associations recorded during the evaluation of a query,and recursively traverse these associations until the ultimateinputs of the chain are reached (or no upstream associationscan be found to continue the chain).As we have seen earlier, on the input 3 , , , , , ,
2, the
Win-dow product processor chain produces the output (cid:62) , ⊥ , ⊥ , ⊥ , (cid:62) .Suppose we want an explanation for the reason the secondevent of this output is false. The EventTracker associatedto this processor chain is queried through a method called getProvenanceTree() , which will produce a directed acyclicgraph whose structure is depicted in Figure 6. The graph is readfrom bottom to top; each input or output event is represented > ⊥ { Σ × Figure 6: An explanation graph for the
Window product query.with a number corresponding to its relative position in thestream in question. Therefore, the direct explanation for thefact that processor f returned ⊥ on the second event is thatit received 0 as the second event in both its input streams.The chain can then be traversed further, and the reason forthe production of each zero value can be retraced to differentpaths and input events in the processor chain.Special attention should be given on the explanation for theresult of the Window processor (left branch). This processoroutputs a zero as its second event because the internal instanceof the
Cumulate processor associated to the second windowreturned zero. However, the reason for this null value is notexplained by the whole window, but by the single 0 thatcorresponds, in this case, to the third event of the window.Ultimately, the whole graph converges back to a single inputevent, which is the zero value at position 4 in the input stream.This is in line with the intuition that output ⊥ at position2 is indeed caused by the presence of this zero in the input.Oftentimes, only the input/output associations of the extremitiesof the chain are relevant; in such a case, the graph can be“flattened” by keeping only the set of original input events thatare mapped to a given output.It is important to stress that this explanation graph dependson the output event chosen and the actual input stream givento the pipeline. Mere knowledge of the processor graph can beseen as lineage (similar to the information provided by Splineor Atlas), but is too coarse-grained to count as an explanation of a result. Graphs of the same nature can be produced bythe event trackers associated to the other processor chainsillustrated in Section III-C. They cannot be illustrated due tolack of space; however, the intuition behind them can be brieflydiscussed. In the case of the Process lifecycle query, we haveseen that the input stream ( , a ) , ( , a ) , ( , b ) , ( , b ) , ( , c ) , ( , d ) produces the ⊥ output event at the sixth position, indicatingthat globally, not all process instances interleaved in the log arefollowing the intended lifecycle. Again, the EventTracker can be asked to explain this result. By following the I/Oassociation rules for each processor in the chain, the end result will point to two events of the input log: tuples ( , a ) and ( , d ) ,corresponding to the second and sixth elements. This resultprovides two interesting pieces of information: first, the ID ofthe process that causes the global error, in this case process abcd , the loop bc has no impacton the erroneous result, and is therefore not included in theexplanation.Finally, a similar reasoning can be made on explanationsfor the third property, which involves LTL operators. It hasbeen shown that the input sequence ( b , ) , ( c , − ) , ( a , ) , ( d , ) produces the output value ⊥ at position 4. The explanationmechanism will retrace this output event to the inputs ( c , − ) and ( d , ) . This corresponds to a “witness” of the fact that anevent with p < a as its action. Notice how event ( a , ) is not part of the explanation, as it does not cause theerroneous verdict.V. Experimental ResultsIn order to assess the viability of such a system in practicalsituations, we performed an empirical evaluation of BeepBeep’slineage functionalities through an experimental benchmark.In this section, we report on these results, which have beenobtained by running BeepBeep on various processor chains.They are aimed at measuring the impact, both in terms ofcomputation time and memory, of the introduction of lineagefunctionalities inside the system. As we have seen, this ispossible thanks to a switch provided by BeepBeep, and whichallows users to completely disable lineage tracking if desired.The experiments were implemented using the LabPal testingframework [15], which makes it possible to bundle all thenecessary code, libraries and input data within a single self-contained executable file, such that anyone can download andindependently reproduce the experiments. A downloadable labinstance containing all the experiments of this paper can beobtained from Zenodo, a research data sharing platform . Allthe experiments were run on a Intel CORE i5-7200U 2.5 GHzrunning Ubuntu 18.04, inside a Java 8 virtual machine with1746 MB of memory. A. Impact on Throughput
The first element we measured is the impact on processingspeed, or throughput . Table I shows the results for various typesof stream queries. Each line represents a pair of experiments,corresponding to the evaluation of a stream query both with andwithout the use of a tracker. The measured value in each caseis the average throughput, in number of input events processedper second.Unsurprisingly, turning lineage on incurs a non-negligibleslowdown, by as much as 21.7 × for the queries we considered.This is caused by the fact that, on each new event, a processor The lab instance will be uploaded on Zenodo only for the final versionof the paper. In the meantime, the latest version of the lab can be found onGitHub: https://github.com/liflab/beepbeep-explainability-lab uery No tracker (Hz) With tracker (Hz)
LTL property 9452.741 2128.3252Process lifecycle 4283.0835 2099.727Window product 333366.66 15386.154
Table I: Relative throughput overhead.now calls the event tracker possibly multiple times, in order toregister associations between inputs and outputs.These results should be put in context with respect to existingworks that include a form of lineage. The Mondrian systemreports an average slowdown of 3 × [11]; pSQL ranges between10 × and 1,000 × [2]; the remaining tools do not report CPUoverhead. For taint analysis tools, Dytan reports a 30–50 × slowdown [8]; GIFT-compiled programs are slowed down byup to 12 × ; TaintCheck has a slowdown of around 20 × [21],1–2 × for Rifle [21]. Time overhead for Spline [25] is close tozero, but as we have discussed, it provides lineage informationat a much coarser level of granularity. Of course, these varioussystems compute different types of lineage information, butthese figures give an outlook of the order of magnitude oneshould expect from such systems. B. Impact on Memory
A second part of the experiment consisted in measuring theamount of additional memory required by the use of an eventtracker. Memory was computed using the
SizePrinter objectfrom the Azrael serialization library . This tool performs arecursive traversal of the member fields of a Java object, downto primitive types, and computes the sum of their reportedsizes. The end result is a much more accurate indication ofthe memory actually consumed by an object, than would be ameasurement of the JVM’s memory footprint.The results are summarized in Table II. We can see that therelative impact on memory is larger than the impact of lineageon computation time. This is consistent with the intuition thatlineage tracking requires one to “remember” more things, muchmore than to “compute” more things. This consumption is stillrelatively reasonable in the absolute: for example, with the Window product processor chain, it would take an input file of86 million lines before filling up the available RAM in a 64GB machine with lineage data.The large relative blow-up is mostly caused by the fact that,for many processor chains, evaluating a query without lineagerequires a constant amount of space, while the tracking-enabledpipeline uses a linear amount of space. This is illustrated inFigure 7. As a matter of fact, it can be observed that for allthe functions considered in this paper, each element of theoutput contributes for a constant amount of lineage data. TableIII gives, for each query we considered, the average memoryoverhead per input event incurred by the use of an event tracker.These figures should be put in context by comparing theoverhead incurred by other lineage tracking tools. Notably,related systems for provenance in databases do not reporttheir storage overhead for provenance data. Dynamic taint https://github.com/sylvainhalle/Azrael
0 2000 4000 6000 8000 10000 M e m o r y ( B ) Length With trackerNo tracker
Figure 7: Evolution of memory consumption during theevaluation of the
Window product query.
Query No tracker (B) With tracker (B)
LTL property 12341 53027241Process lifecycle 24039551 40353151Window product 5294 7404930
Table II: Relative memory overhead.propagation systems report a memory overhead reaching 4 × for TaintCheck [21], 240 × for Dytan [8], and “an enormity” oflogging information for Rifle [7] (authors’ quote). Althoughthese systems operate at a different level of abstraction, thisshows that lineage tracking is inherently costly regardless ofthe approach chosen.VI. Conclusion and Future WorkIn this paper, we have seen how an event stream processingengine called BeepBeep can be extended with functionalitiesfor data lineage. In this particular context, lineage is thecapability to link a part of the system’s output all the wayup to the concrete inputs that contributed to the productionof that particular output. Thanks to BeepBeep’s principle ofcomposition, such lineage functionalities can be defined atthe level of individual units of computation called processors ,whose input/output associations can then be chained to forma provenance graph. Through a few examples, it has beenshown how such lineage capabilities can provide articulateand intuitive explanations for a result. What is more, thoselineage functionalities are built-in, and transparent to the user:a single line of code suffices to switch the mechanism onor off. To the best of our knowledge, BeepBeep is the firstevent stream processing engine that provides such a simple,yet all-encompassing explanation system.These promising results open the way to multiple researchquestions and improvements over this first solution. Extensionsto BeepBeep have been developed to perform trend deviationdetection and predictive analytics [23], among other uses; it is Query Memory per event
LTL property 5300Process lifecycle 1631Window product 739
Table III: Average memory overhead (in bytes) per input eventincurred by the use of an event tracker.lanned to expand the basic explanation capabilities to theseextensions in the near future. Currently, the system can onlyrecord associations between whole events. However, there existsituations where a finer granularity in the relationships betweeninputs and outputs would be required, such as when events areextracted from parts of a larger “document” such as an XMLevent.The implementation of the explanation mechanism couldalso be optimized in a few ways. First, we can observe thatsome processors always record the same association for eachinput/output event pair. Instead of recording this fact for everyevent, considerable savings, both in terms of time and space,could be achieved by making the tracker replace these individualassociations with a single generic rule. Based on the promisingresults and the lessons learned from the implementation ofBeepBeep’s explanation mechanism, a redesign of the lineagefunctionalities is currently under way, and taking into accountthe previous observations.The existence of a lineage tracking system inside BeepBeepalso opens the way to a myriad of exciting research questions.For example: For a given query, is there a part of the input eventtrace that never matters in the production of the output? Giventhat a part of the input is considered corrupted, are there parts ofthe output that are not affected by this corruption? What part ofthe input contributes the most to the output? All these questionscould be studied both concretely (by studying a particular input-output pair), but more interestingly by reasoning over all thepossible input-output pairs of a given processor chain.References [1] Q. Betti, B. Montreuil, R. Khoury, and S. Hallé.
Smart Contracts-EnabledSimulation for Hyperconnected Logistics , pages 1–41. Number 71 inStudies in Big Data. Springer, 2020. To appear in April 2020.[2] D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotationmanagement system for relational databases.
VLDB J. , 14(4):373–396,2005.[3] M. R. Boussaha, R. Khoury, and S. Hallé. Monitoring of securityproperties using BeepBeep. In A. Imine, J. M. Fernandez, J. Marion,L. Logrippo, and J. García-Alfaro, editors,
Proc. FPS 2017 , volume10723 of
LNCS , pages 160–169. Springer, 2017.[4] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A charac-terization of data provenance. In
Proc. ICDT 2001 , pages 316–330,2001.[5] L. Chiticariu and W. C. Tan. Debugging schema mappings with routes.In U. Dayal, K. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L.Kersten, S. K. Cha, and Y. Kim, editors,
Proc. VLDB 2006 , pages 79–90.ACM, 2006.[6] L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: a post-it systemfor relational databases based on provenance. In F. Özcan, editor,
Proc.SIGMOD 2005 , pages 942–944. ACM, 2005.[7] J. Chow, B. Pfaff, T. Garfinkel, K. Christopher, and M. Rosenblum.Understanding data lifetime via whole system simulation. In M. Blaze,editor,
Proceedings of the 13th USENIX Security Symposium, August9-13, 2004, San Diego, CA, USA , pages 321–336. USENIX, 2004.[8] J. A. Clause, W. Li, and A. Orso. Dytan: a generic dynamic taint analysisframework. In D. S. Rosenblum and S. G. Elbaum, editors,
Proc. ISSTA2007 , pages 196–206. ACM, 2007.[9] J. R. Crandall and F. T. Chong. Minos: Control data attack preventionorthogonal to memory model. In
Proc. MICRO-37 2004 , pages 221–232,2004.[10] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data ina warehousing environment.
ACM Trans. Database Syst. , 25(2):179–227,2000. [11] F. Geerts, A. Kementsietsidis, and D. Milano. MONDRIAN: annotatingand querying databases through colors and blocks. In L. Liu, A. Reuter,K. Whang, and J. Zhang, editors,
Proc. ICDE 2006 , page 82. IEEEComputer Society, 2006.[12] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings.In L. Libkin, editor,
Proc. PODS 2007
Proc.PETRA 2016 , page 3. ACM, 2016.[15] S. Hallé, R. Khoury, and M. Awesso. Streamlining the inclusion ofcomputer experiments in a research paper.
IEEE Computer , 51(11):78–89, 2018.[16] S. Hallé.
Event Stream Processing with BeepBeep 3: Log Crunching andAnalysis Made Easy . Presses de l’Université du Québec, 2018. ISBN978-2-7605-5101-5.[17] G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance.In A. K. Elmagarmid and D. Agrawal, editors,
Proc. SIGMOD 2010 ,pages 951–962. ACM, 2010.[18] R. Khoury, S. Hallé, and O. Waldmann. Execution trace analysis usingLTL-FO+. In T. Margaria and B. Steffen, editors,
Proc. ISoLA 2016,Part II , volume 9953 of
LNCS , pages 356–362, 2016.[19] L. Lam and T.-c. Chiueh. A General Dynamic Information Flow TrackingFramework for Security Applications. In
Proc. ACSAC 2006 , pages 463–472, Miami Beach, FL, USA, Dec. 2006. IEEE.[20] L. Moreau, B. V. Batlajery, T. D. Huynh, D. Michaelides, and H. Packer.A Templating System to Generate Provenance.
IEEE Transactions onSoftware Engineering , 44(2):103–121, Feb. 2018.[21] J. Newsome and D. X. Song. Dynamic taint analysis for automaticdetection, analysis, and signature generation of exploits on commoditysoftware. In
Proc. NDSS 2005 . The Internet Society, 2005.[22] A. Pnueli. The temporal logic of programs. In
FOCS , pages 46–57.IEEE, 1977.[23] M. Roudjane, D. Rebaine, R. Khoury, and S. Hallé. Predictive analyticsfor event stream processing. In
Proc. EDOC 2019 , pages 171–182. IEEE,2019.[24] W. Samek, T. Wiegand, and K.-R. Müller. Explainable ArtificialIntelligence: Understanding, Visualizing and Interpreting Deep LearningModels.
ITU Journal , (1), Aug. 2017. arXiv: 1708.08296.[25] J. Scherbaum, M. Novotny, and O. Vayda. Spline: Spark lineage, notonly for the banking industry. In , pages 495–498. IEEE Computer Society, 2018.[26] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas. Secure programexecution via dynamic information flow tracking. In S. Mukherjee andK. S. McKinley, editors,
Proc. ASPLOS 2004 , pages 85–96. ACM, 2004.[27] S. Suhothayan, K. Gajasinghe, I. L. Narangoda, S. Chaturanga, S. Perera,and V. Nanayakkara. Siddhi: a second look at complex event processingarchitectures. In R. Dooley, S. Fiore, M. L. Green, C. Kiddle, S. Marru,M. E. Pierce, M. Thomas, and N. Wilkins-Diehr, editors,
Proc. GCE2011 , pages 43–50. ACM, 2011.[28] N. Vachharajani, M. J. Bridges, J. Chang, R. Rangan, G. Ottoni, J. A.Blome, G. A. Reis, M. Vachharajani, and D. I. August. RIFLE: anarchitectural framework for user-centric information-flow security. In
Proc. MICRO-37 2004 , pages 243–254, 2004.[29] S. Vandebogart, P. Efstathopoulos, E. Kohler, M. N. Krohn, C. Frey,D. Ziegler, M. F. Kaashoek, R. T. Morris, and D. Mazières. Labels andevent processes in the Asbestos operating system.
ACM Trans. Comput.Syst. , 25(4):11, 2007.[30] S. Varvaressos, K. Lavoie, S. Gaboury, and S. Hallé. Automated bugfinding in video games: A case study for runtime monitoring.
Computersin Entertainment , 15(1):1:1–1:28, 2017.[31] Y. Velegrakis, R. J. Miller, and J. Mylopoulos. Representing and queryingdata transformations. In K. Aberer, M. J. Franklin, and S. Nishio, editors,
Proc. ICDE 2005 , pages 81–92. IEEE Computer Society, 2005.[32] Y. R. Wang and S. E. Madnick. A polygen model for heterogeneousdatabase systems: The source tagging perspective. In D. McLeod,R. Sacks-Davis, and H. Schek, editors,