Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data
aa r X i v : . [ c s . D B ] J a n Exploring Data and Knowledge combined AnomalyExplanation of Multivariate Industrial Data
Xiaoou Ding ∗ , Hongzhi Wang † , Chen Wang § , Zijue Li † , Zheng Liang †∗† School of Computer Science and Technology, Harbin Institute of Technology. § National Engineering Laboratory for Big Data Software, EIRI, Tsinghua University.Email: ∗ [email protected], † { wangzh,lizijue,lz20 } @hit.edu.cn, § wang [email protected] Abstract —The demand for high-performance anomaly detec-tion techniques of IoT data becomes urgent, especially in industryfield. The anomaly identification and explanation in time seriesdata is one essential task in IoT data mining. Since that theexisting anomaly detection techniques focus on the identificationof anomalies, the explanation of anomalies is not well-solved. Weaddress the anomaly explanation problem for multivariate IoTdata and propose a 3-step self-contained method in this paper. Weformalize and utilize the domain knowledge in our method, andidentify the anomalies by the violation of constraints. We proposeset-cover-based anomaly explanation algorithms to discover theanomaly events reflected by violation features, and furtherdevelop knowledge update algorithms to improve the originalknowledge set. Experimental results on real datasets from large-scale IoT systems verify that our method computes high-qualityexplanation solutions of anomalies. Our work provides a guideto navigate the explicable anomaly detection in both IoT faultdiagnosis and temporal data cleaning.
Index Terms —Anomaly explanation, time series data cleaning,rule-based violation detection, temporal data mining,
I. I
NTRODUCTION
Anomaly is summarized as any unusual change in a value ora pattern, which does not conform to specific expectations [1],[2]. Identifying the anomalies (roughly regarded as outliers,errors, and glitches) is one of the most challenging and excitingtopics in data mining and data cleaning community [1], [3].Researchers have gone a long way in anomaly detectionstudies in various fields. [4] introduces anomaly detectiontechniques for temporal data, which covers several kindsof temporal data including time series. Anomaly detectiontechniques with different methods such as Density-, Window-and Constraint-based approaches have been developed andapplied in various real scenarios (see [4], [5] as survey).The rapid development of sensor technologies and thewidespread use of sensor devices witness the flowering ofdata management and data mining technologies in sensordata. The demand for high-performance mining techniques ofInternet of Things (IoT) data also becomes urgent, especiallyin industry field. Time series is one of the most importanttypes in IoT data [5], [6]. Thus, the anomaly identification andexplanation tasks for time series data are essential. Despite ofthe advanced anomaly detection techniques, existing studiespay much attention to the identification of anomalies anderrors, leaving the anomaly reasoning and explanation not wellsolved. The characteristics of multivariate time series data inIoT applications urgently expect the detection methods to go further to explain the occurrence of anomalies, and thus, toachieve a higher dependability and interpretability in anomalystudies.Anomaly explanation, or called explainable anomaly iden-tification, will promote the involvement of domain knowledgein the techniques, and assist users to well understand theanomalies, which in turn improves the identification perfor-mance. It also complements existing data cleaning techniquesconsidering the explanation discovery for errors and violations.Considering the limitation of the current state-of-art anomalyexplanation approaches, the challenges of the problem are asfollows.(1).
Less attention paid to the dependency of data . Since thatthe multivariate time series are collected with each sequence( i.e. , attribute) from to one sensor, the abnormal data in se-quences may not be completely independent. Anomalies wouldresult in the glitches existing across multiple attributes withcomplex interactions. Treating all sequences independentlymay fail to correctly identify the actual errors in data.(2).
Under-utilized domain knowledge . Discovering the in-teractions and causes of anomalies requires the involvement of knowledge which comes from domain experts or professionalrules, especially in IoT data. Though knowledge-driven meth-ods, such as fault tree analysis (FTA), expert system (ES) [7],Dynamic Bayesian Network (DBN) [8], have been developedfor anomaly and fault diagnosis tasks, it still has limitationin knowledge modeling. Moreover, the incompleteness andfuzziness add to the difficulty in utilizing the knowledge.(3).
Lack of scalability in (IoT) big data . Since that currentknowledge-based methods usually focus on small-scale spe-cific scenarios, neither the knowledge nor the method can beeasily adapted to other scenarios.Referring to the desirable properties of causality analysis inviolation detection [9], [10], we summarize that a high-qualityexplicable anomaly detection approach in industry applicationsalways focuses on the following objectives. • Coverage . The solution of the method is expected tocomprehensively cover anomaly instances existing in data. • Conciseness . The method needs to provide a concisenesssolution rather than a redundant one, for the reason that,both time and human resources are costly in the responseprocedure to the anomalies. In addition, the consequence ofthe unexpected anomalies is unpredictable. That requires themethod to provide a small-scale solution for decision makings much as possible. • Self-update . The method is expected to deal with the newanomaly instances whose patterns are unknown in the knowl-edge base or does not occur (be detected) in the historical data. • Less tolerance of False Negative (FN) . Though it isdifficult to entirely avoid False Negative and False Positivein practical detection tasks, it demands a high performanceof the method in industrial field, especially the electricityand manufacture scenarios. FN means one fails to identifyuntraceable anomalies in data, which is likely to result in moreserious effects, compared with FP.
Contributions . Motivated by the above, we explore theanomaly explanation problem in multivariate time series un-der industrial scenarios with data and knowledge combinedmethod in this paper. Our contributions are summarized asfollows.(1). We formalize the anomaly explanation problem inmultivariate temporal data, and design a self-contained 3-stepanomaly explanation method framework for multivariate data(see Figure 2), according to the aforesaid four objectives.The proposed framework provides a guide to navigating theexplicable anomaly detection, especially in temporal datacleaning and IoT fault diagnosis techniques.(2). We apply the 4 type of constraints proposed in [6]which formalize the dependence on attributes (columns) andentities (rows) to accurately uncover the anomalies hiddenin multivariate data in violation detection step (see SectionIII). We devise a set-cover-based algorithm
AEC to addressthe anomaly explanation, and provide concise and reliableexplanation solutions covering all the anomaly representations(see Section IV).(3). We formalize and well utilize the domain knowledge toachieve the description and the explanation of the anomaliesin data. We also provide knowledge update procedures andalgorithms during the iteration of detection and explanation,which allows both manual intervention and automatic update(see Section V).(4). We conduct a thorough experiment on real-life datasetsfrom large-scale IoT systems. Results of the comparisonexperiment results verify our method provides high-qualityexplanation solutions of anomalies.II. F
RAMEWORK O VERVIEW
A. Problem Statement
We outline the multivariate time series in Figure 1. S = h s , ..., s N i is a sequence on sensor S , where N = | S | is the length of S , i.e. , the total number of elements in S . s n = h x n , t n i , ( n ∈ [1 , N ]) , where x n is a real-valuednumber with a time point t n , and for ∀ n, k ∈ [1 , N ] , it has ( n < k ) ⇔ ( t n < t k ) . Let Eq be an equipment sensor group. S Eq = { S , ..., S M } ∈ R N × M is a M -dimensional time series,where M is the total number of equipment sensors, i.e ., thenumber of dimensions. T = { t , ..., t n } is the set of timepoints of time series S Eq .In this paper, we use rule-based techniques to detect theanomalies from the violation of the given constraints. We Sensor Set S (Eq) Sensor 1 Sequence S equipped with s s s ... ... s N Sensor M Sequence S M Time interval T [ l:n ] ...... ...... ...... A sequence tuple S ( t i ) Equipment (Eq) produceproduce s s s ... ... s N Fig. 1. Multivariate IoT time series. introduce the constraint set for one sequence in Definition 1.Accordingly, given constraint c for sequence S , S is identifiedto violate c if the data in S does not satisfy the contentdescribed by c . We denote such violation by S = c . Definition 1: ( Constraint set ). C is the set of all constraintsdefined on sequence S , denoted by C ( S ) = { c , ..., c n } , where c i is a formulated or learnt constraint or rule the data need tomeet. Definition 2: ( Violation feature ). Given sequence S andthe constraints set of S , i.e., C ( S ) , we maintain a 2-tuple v = h S, F ( c ) i of S w.r.t. constraint c, ( c ∈ C ( S )) , where F ( c ) is a degree function computed by a specified violationmeasurement F on c , which has two formats:(1). If c is a qualitative constraint, F ( c ) = (cid:26) , S = c , S | = c. (2). If c is a quantitative constraint, it has F ( c ) = [ d, u ] ,where d and u are the lower bound and the upper boundcomputed by the measurement F , respectively. v is regarded as the violation feature of S on c when S = c . V ( S ) = { v , v , ... } is the set of all violation features of S , and V ( S ) = { V ( S ) , ..., V ( S M ) } is the total set of allviolation features in sequences of data S .It is acknowledged that the anomaly explanation discoveryproblem needs the assistance of knowledge provided by do-main experts who have accumulated countless practical ex-perience. The knowledge supplied from industry applicationshas various forms, including 1)a fault ID number which isacknowledged in the diagnostic system and can be retrievedin the users’ manual, 2)some general descriptions of abnor-mal patterns, 3)the empirical causal inference of anomalyinstances, etc. In this section, we formalize several concepts w.r.t the provided knowledge, which is the critical input of thestudied problem, besides data S and the constraint set C .In general, we aim to explain the detected anomalies byfinding out the corresponding fault reasons. We consider an(acknowledged) anomaly event E to be one reason of theoccurrence of anomalies. Due to the fact that each fault eventwill lead to a series of unexpected changes in sensors data,we consider one change in data as one anomaly representation,denoted by r in Definition 3. r is regarded as the smallest unitof the given knowledge in our problem. We briefly presentexamples of a knowledge set of a sensor group in a powerplant in Table I. Definition 3: ( Anomaly representation ). r = h S, F r ( c ) i is the anomaly representation of constraint c on sequence S ,where F r ( c ) is one kind of formal description depicted from ABLE IE
XAMPLE OF A KNOWLEDGE SET
Event E Explanation R ( E ) Representation
Id-1 Sensor break temperature decline,... h S , [ −∞ , i · · · Id-2 Sensor break pressure drop,... h S , F ( c ) i , h S , F ( c ) i · · · Id-1 Engine off zero in power,... · · ·
Id-1 Boiler state instability temperature shock,... · · · domain experts or professional knowledge. F r ( c ) has the same structure of F ( c ) referring to Definition2. For quantitative constraints, F r ( c ) = [ d r , u r ] , where d r (resp. u r ) is the lower (resp. upper) value of the descriptionof the knowledge.Accordingly to Definition 3, the unexpected changes causedby an event E is formally presented as a set of anomalyrepresentations. Such a set of representations is considered asan explanation of anomaly instances in data. That is, R ( E ) = { r , ..., r n } is a set of anomaly representation describing oneevent E . R ( E ) is the maximum set of representation r swhich can be provided by domain experts. The set of allexplanations, denoted by R = { R ( E ) , ..., R ( E N ) } is theformal description of the domain knowledge provided for theequipment sensor group S Eq . Formally, the problem studied inthis paper is stated in Definition 4. Definition 4: ( Problem description ). Given the multivariatetime series S of equipment Eq , the constraint set C , and theknowledge set R , the anomaly explanation problem includestwo tasks below, with the four objectives proposed in SectionI: (1) to detect the violations in S according to C , locate theviolated sequence ID with time interval T , and record theviolations in the set V ( S ) , and (2) to discover an explanationset R ′ ⊆ R w.r.t V ( S ) . B. Overview of the Approach
Figure 2 outlines our method, which contains three phases:violation detection, anomaly explanation, and knowledge up-date. We will discuss the violation detection with types ofconstraints in temporal data in Section III, and introduce ouranomaly explanation algorithms in Section IV in detail. Theknowledge update step will be discussed in Section V withthe procedure of Algorithm 3 and the function in Algorithm4 to find out the candidate update explanations.III. D
ISCOVERY OF ANOMALY INSTANCES
A. Constraint-based anomaly detection
Since that the dependance and relevance does exists among multivariate time series, we apply 4 types of constraintsdiscussed in [6] in our violation detection process. As shownin Table II, the 4-Type constraints embody the dependence onattributes (columns) and entities (rows) for temporal data.
TABLE IIT
YPES OF CONSTRAINTS
Type Singe column Multi-columnSingle row
Type 1 Type 2
Multi-row
Type 3 Type 4
Accordingly, we summarize some instances of the four-typeconstraints in Figure III. We consider the value domain of data
TABLE IIIE
XAMPLES OF CONSTRAINT TYPES
Type Singe sequence Multi-sequenceSingle time point
T-1: Value domain T-2: CFDsfrom documents T-2: Physical Mechanism
Time interval
T-3: SD, SC [11], T-4: Similarity ConstraintsT-3: Variance Constraints [12] points in sequence as the simple instance of Type-1 constraints,CFD for relational data and Physical Mechanism for industrialdata are concluded as multi-sequence constraints. Constraints,such as SD, SC, and VC, formalizing the dependence of datapoints along the time in one sequence belongs to Type-3constraints. Rules describing the similarity located in multisequences can be classified as Type-4 constraints. Varioustypes of constraints assist to precisely locate the anomalies andit has potential to uncover the anomalies as early as possible.
B. Anomaly distance measurement
With types of constraints, we can detect the violationshidden in sequences. After we obtain violation features, weneed to compare them with the given anomaly representationsto determine the real-happened anomaly events. We firstpropose the concept of “
Explicable ” in Definition 5, andpropose anomaly distance function in Definition 6, accordingto which we are able to quantize the closeness of violationswith acknowledged anomaly events.
Definition 5: ( Explicable feature ). Given a detected vi-olation feature v : h S, F ( c ) i , v is explicable by R , iff ∃ r : h S ′ , F r ( c ′ ) i ∈ R, R ⊆ R , S = S ′ and c = c ′ . Definition 6: ( Anomaly distance ). Given feature v : h S, F ( c ) i and the (corresponding) representation r : h S, F r ( c ) i , the distance between v and r w.r.t constraint c iscomputed as follows,If c is a quantitative constraint, then it has dist (cid:0) v, r (cid:1) = dist (cid:0) F ( c ) , F r ( c ) (cid:1) = 1 − [ d, u ] ∩ [ d r , u r ][ d, u ] ∪ [ d r , u r ] , (1)If c is a qualitative constraint, then it has dist (cid:0) v, r (cid:1) = dist (cid:0) F ( c ) , F r ( c ) (cid:1) = | F r ( c ) − F ( c ) | . (2)The proposed distance function dist ( · , · ) is only measurablewith regard to the same constraint. It does not make senseto compute the distance between v and r which hold dif-ferent constraints. Obviously, the anomaly distance dist ( v, r ) coincides with the properties of the distance function, whichlocates in [0 , and the value of dist ( v, r ) is lower as feature v is closer to representation r . More specifically, for qualitativeconstraints, dist ( v, r ) only has two values where dist ( v, r ) =0 shows the detected v is consistent with the representation r , and dist ( v, r ) = 1 , otherwise. For quantitative constraints, dist ( v, r ) ∈ (0 , shows v is partially consistent with r , while dist ( v, r ) = 1 indicates that feature v is completely differentfrom the representation r , with [ d, u ] ∩ [ d r , u r ] = 0 . Thus, wedescribe how to determine whether the feature is consistentwith the representation in Definition 7. iolation detection Violation features V Distancemeasurement dist ( v , r ) Knowledge set R Explanation filter ExplanationupdateUsers
Step 1: violation detection Step 2: anomaly explanation Step 3: knowledge update inexplicable violation features
IoT data Constraints C r + s updated representationslearned constraints Domainexperts
Fig. 2. Method framework overview
Definition 7:
Given feature v and the corresponding repre-sentation r , v is identified to be consistent with r if having dist ( v, r ) (cid:26) = 0 , c is qualitative ,< , c is quantitative . (3)More generally, dist ( v, r ) < is too loose to estimate whetherthe feature matches a representation. Thus, we introduce athreshold θ ∈ (0 , and consider v is consistent with r when dist ( v, r ) < θ . θ can be learned from amount of experiments,or set manually as required.We highlight that, one may fail to find the one-to-onematch between each violation feature and each representationin the real industry scenarios. That is, a violation feature v obtained from the violation detection method is not neces-sarily identified to be explicable (by R ). It results from boththe precision limitation of violation detection techniques andthe incompleteness in knowledge representations provided byexperts. In the next section, we will introduce our anomalyexplanation approach, in which we take into considerationthe multiple conditions between the detected features and thegiven representations in detail.IV. I DENTIFYING A NOMALY E XPLANATIONS
A. Candidate Explanations Discovery
In anomaly explanation and analysis research, especiallythe knowledge-based study, it is acknowledged that both incompleteness and ambiguity always exist in experts knowl-edge. For the former, domain experts may not provide alldescriptions about one anomaly instance reflected in the sensordata. The main reasons include 1) experts’ limitations inprofessional degree and human’s understandability of anomalyproblems, and 2) the deficient deployed sensors, which mayfail to record some parameters that are critical to the expla-nation of abnormal data. For the latter, since some knowledgeare accumulated form practical experience, one cannot expectindustry experts to be always definite about they explanations.Part of these explanations may happen with probability.According to our investigation of manufacturing and theelectricity industry, both definite and presumable knowledgeare applied in anomaly explanation and fault diagnose tasks.In this case, we divide the explanations of anomaly into twocategories: exact explanations and possible explanations . Given a fault event E , the exact explanation of E is themaximum set of anomaly representations, denoted as R ∗ ( E ) ,in which the fault event E leads to the existing of represen-tations { r , ..., r x } in data. ∀ i ∈ [1 , x ] , r i is called an exactrepresentation for E . The possible explanation of E is a setof anomaly representations, denoted by R + ( E ) = { r , ..., r y } ,in which E lead to the existing of anomaly representation r i , i ∈ [1 , y ] with probability Pr , (cid:0) Pr ∈ (0 , (cid:1) . ∀ j ∈ [1 , y ] , r j is called a possible representation for E .With the two categories, we denote the explanation R bythe combination of R ∗ and R + , as shown in Proposition 1. Proposition 1:
The explanation R w.r.t. event E is the unionof the exact explanation and the possible explanation of E ,denoted by R = R ∗ ∪ R + , where R ∗ ∩ R + = ∅ and R ∗ = ∅ .In general, the exact R ∗ is considered as the key factor in theidentification of E , while the possible R + helps to describethe event in a rough way. In industry scenarios, the occurrenceof a fault, especially a known one, will certainly give rise tothe violation of a series of constraints, as described by R ∗ .But on the contrary, one cannot be sure that whether the eventreally happens when some representations in R ∗ have beendetected. We formally describe the relationship between a fault E and its explanation R ∗ ( E ) in Proposition 2. Accordingly,we are able to make further analysis in order to obtain thefault set and provide a reliable and high-quality explanationof the anomalies. Proposition 2:
Given a fault event E with R ( E ) , E is thesufficient and unnecessary condition of its exact explanation R ∗ ( E ) , ( R ∗ ( E ) ⊆ R ( E )) .From the above, we are able to narrow the computation on R by finding out a subset of R , in which all exact representationshave appeared in data. Such subset of R is denoted as G = { R , R , ... } as shown in Definition 8. Thus, we first find outthe candidate explanation set G from R by { verifying theappearance of exact explanations with all violation features } detected by the previous steps, and then precisely compute theexplanation set from the candidate result G . Definition 8:
Given the violation feature set V of S w.r.t. T , and the set R , G is identified as a candidate explanationset if satisfying: 1) G ⊆ R , and 2) ∀ R ∈ G, R ∗ ⊆ R , ∀ r : h S, F r ( c ) i ∈ R ∗ , ∃ v ∈ V , v is consistent with r w.r.t. c .The candidate explanation set discovery process is shown in lgorithm 1: Compute Candidate Explanations
Input: V = { V ( S ) , ..., V ( S M ) } : the set of violationfeatures in S w.r.t T , the explanation set R Output: the set of candidate explanations G initialize G ← ∅ ; foreach R ∈ R do cand ← ; foreach r ∈ R.R ∗ do if v.F ( c ) = 0 or dist ( v, r ) = 1 then cand ← ; break; if cand = 1 then G ← G ∪ { R } ; return G ; Algorithm 1. After initialize an empty set G , we enumerateeach explanation set R from R in the outer loop (Lines 2-9),and maintain a label cand to record whether all elements in R exist in the detected violation features. Within the outer loop,we enumerate each representation r from the exact explanationset of R , i.e., R.R ∗ and identify the occurrence of r in data. Welet the label cand = 0 when the sequence S does not violateconstraint c w.r.t r , or the violation in S is different from r , i.e., dist ( v, r ) = 1 (Lines 5-7). After all exact representationsin R.R ∗ are visited, we add all R s with label cand = 1 into G , and finally obtain the objective subset G from R . B. Cost-based Explanation determination
According to Algorithm 1, we can qualitatively find out thecandidate set which contains an overall explanation of faultthat really happens in data. However, the set G is far from ahigh-quality solution for industry applications. In the next step,we aim to make further discovery of anomaly explanationsconsidering the four objectives as discussed in Section I.Note again that an explanation includes a few representa-tions, either the exact or the possible ones. We identify theexplanations of anomalies by measuring the distance betweenthe violation feature and the representation w.r.t. the same c .Intuitively, we should choose those explanations with closedistance values between the detected features and the givenrepresentations. In order to model the distance degree between v s and r s, we introduce a cost-based principle to quantitativelycompute how much the violation features matches an anomalyexplanation in Definition 9. Definition 9: ( Explanation Cost ). The cost of applyingevent E as an explanation of the anomalies in S is, Cost ( E ) = P xr i ∈ R ∗ dist ( v i , r i ) | r i .S | + P yr j ∈ R + w · dist ( v j , r j ) | r j .S | , where R ∗ ( E ) (resp. R + ( E ) ) is the exact (resp. possible) ex-planation of E , v is the detected violation feature of sequence S w.r.t constraint c , dist ( v, r ) is the anomaly distance between v and r w.r.t c , and | r.S | is the number of sequences involvedin r , and ω ∈ (0 , is the probability value of a possiblerepresentation in R + . TABLE IVC
ONFUSION MATRIX IN THE EXPLANATION PHASE
Knowledge DetectedAbnormal data Normal dataExist representations
Set A: V ∗ Set B (
R \ R o ) No representation
Set C:
V \ V ∗ Set D
According to Definition 9, a higher
Cost ( E ) value showsthat the violation features in data less match the fault rep-resentations of E , and intuitively, fault E is less likely to bea reason of the detected anomalies. It is acknowledged thatthere probably exists more than one fault in the equipmentsystem at the same time interval, which urges us to explore themultiple reasons which can cover all the anomalies detectedby constraints. Here, we introduce the minimum weight setcovering problem (MWSCP) in our method to derive theoptimal explanations of anomalies.Note that we are faced with different cases in the compari-son of the observed ( a.k.a detected) real data and the knowl-edge representations. The constraint-based anomaly detectionphase identifies the input data as two categories: 1) abnormaldata which violate at least one constraint, and 2) normaldata which do not have violations. Considering the givenknowledge, there also has two cases: 1) existing representationwhich describes the violation of certain constraint, and 2) nogiven representations at all. Accordingly, there are totally fourkinds of conditions to be considered in the explanation analysisphase, as shown in Table IV. From the point of the monitoringdata, we summarize the total four kinds of data as Set A, B,C, D, respectively.Set D in Table IV is beyond our research, for data in Set D isneither detected to be abnormal nor described by knowledge.We focus on the analysis of Set A, B, and C. Specifically,Set A contains the violated data which can be explicable byrepresentations from knowledge set R . Set B contains the datawhich have representations in R but are identified as normal.Set C contains the data detected to have violations while thereare no representations in R to explain these violations. Itmainly has two reasons: i ) The constraint instances are settoo strict so that the method falsely identifies some normaldata to be abnormal, or ii ) some new anomaly patterns arediscovered which is unknown in the present knowledge set.Thus, the cases in Set C are also serious in our solutions,because it can assist update the knowledge set.Below we first introduce how to precisely explain Set A, andwe will propose the updating method according to the detectedSet C in the next Section. Considering our four objectives, oursolution needs to cover all the detected anomalies with a small-size results. That is, to find out a concise set of explanationsfrom R which could explain all the violations. Moreover,our method is expected to provide explanations much welldescribing the anomalies with close distance values. C. Set-Cover-based anomaly explanation
As discussed above, we aim to find a subset of R as theproblem solution which covers all the violated instances indata. We denote the data in Set A as V ∗ = { v : h S, F ( c ) i| v ∈ ( S ) , and v is explicable by R} . We apply the minimumweight set covering problem (MWSCP) [13], [14] to solve ourAE problem. Definition 10 formalizes our problem, where theset V ∗ is the target to be covered, and the explanation cost Cost ( R ) is regarded as the weight. Definition 10: ( Anomaly Explanation Problem ) Giventhe violation set V ( S ) w.r.t . S , the knowledge set R , andthe candidate explanation set G . Our anomaly explanationproblem is to find an explanation set H which satisfies min X R ( E ) ∈ H cost ( E ) s.t. H ⊆ G, [ R o ( E ) ⊆ R ( E ) R o .cover = V ∗ , ∀ R ( E ) ∈ H where R o ( E ) = { r , ...r m } is a subset of R ( E ) w.r.t anomalyevent E having ∀ r i ∈ R o ( E ) , i ∈ [1 , m ] , ∃ explicable feature v ∈ V ∗ , dist ( v, r i ) < , and R o .cover represents the total setof the violation features which are consist with r s in R o . The greedy-based heuristic algorithm . Since the set covercomputing is NP-hard [10], [14], we introduce greedy-basedalgorithms for our AE problem. Considering the coverageissues, the solution of our anomaly explanation problemproposed above is a covering of the set V ∗ first, and thenit satisfies the minimum cost principle. In this case, we areable to find out whether an explanation R ∈ R is certainlycontained in the the solution H or certainly does not exist in H . Both cases are concluded in Proposition 3 and Proposition4, respectively. Proposition 3 shows that if the explanation R j covers more violations besides R i .cover , while R j has littlecost than R i , R i will not be selected into the solution H .Proposition 4 shows that an explanation R must be selectedinto the solution if it is the only explanation that can coversviolation v . Proposition 3:
Given R , the explanation R i is not valid anddoes not exist in the solution H , if ∃ R i , R j ∈ R , R i .cover ⊆ R j .cover , and Cost ( R i ) > Cost ( R j ) . Proposition 4: R from R is a valid explanation and doesexist in the solution H , if ∃ v ∈ V ∗ , there is only oneexplanation R in R which satisfies v ∈ R.cover .Considering both effectiveness and efficiency, we propose agreedy-based heuristic algorithm to obtain H . The generalprinciple is, to give priorities to choosing the explanation i )which has smaller cost value, and ii ) covering the violationsof constraints defined on multiple sequences, specifically thePhysical Mechanism constraints in this paper. For the former,it is obvious that the fault event with smaller cost value ismore reliable to explain part of the anomalies. For the later, weconsider to cover the violations existing in multiple sequencesprior to the ones in single sequence, for three main reasons:1) the violation between sequences are more likely to involvefault event(s) than the violation happening in single sequence.Because the detected single-sequence violations may justoccur sporadically for some reasons, which does not requirean explanation. The faults which happen in several sensors are Algorithm 2:
Compute Explanation Set
Input: the set V ∗ of violation features in S w.r.t T , thecandidate explanation set G Output: the set of final explanations H initialize H ← ∅ ; delete R s from G according to Proposition 3; insert R ′ s from G into H according to Proposition 4; G ← G \ R ′ s ; V ∗ ← V ∗ \ R ′ .cover ; select the subset V ∗ M from V ∗ where V ∗ M = { v | v.K > and v ∈ V ∗ } ; sort all v s in V ∗ M in the descending order of the size of v.K ; foreach v ∈ V ∗ M do R ( E ) ← arg min v ∈ R ( E ) .cover Cost ( E ) ; H ← H ∪ { R ( E ) } ; V ∗ un ← V ∗ \ H.cover ; while V ∗ = ∅ do R ( E ) ← arg max R ( E ) | R.cover ∩V ∗ un | Cost ( E ) ; H ← H ∪ { R ( E ) } ; V ∗ un ← V ∗ un \ R.cover ; return H ; always more serious than the ones only happen in one sensor,and 2) the multi-sequences violations always contains muchmore features than single-sequence violations. The processof multi-sequences violations contributes to increasing thecoverage of the solution.Algorithm 2 outlines our heuristic algorithm, which mainlyconsists of three steps, as discussed below. Global optimization (Lines 1-5). After initializing an emptyset H , we first execute the global optimization according toProposition 3 and Proposition 4. Thus, we narrow the size ofthe input G by deleting the invalid explanations R s, while weinsert the valid explanations R ’s into set H . After that, wedelete R ’s from G and correspondingly delete R ′ .cover fromthe violation set V ∗ . Covering multi-sequences violations (Lines 6-11). After wedeal with all the valid and invalid explanations, we begin toselect explanations from the present set G to cover the multi-sequences violations. We sort all multi-sequence features in thedescending order of the number of sequences involved in eachfeature v . We then enumerate each feature v from the sortedset V ∗ M , and greedily find an fault E whose explanation R ( E ) which can cover v with the minimum Cost ( E ) value. We putsuch R into the solution set H , and finish the iteration whenall the features in V ∗ M have been visited. We then delete allthe violation features covered by H from V ∗ and let the setbe V ∗ un , which needs to be covered in the following step. Covering single-sequence violations (Lines 12-15). Facedwith V ∗ un , we compute the total number of v s in V ∗ un coveredby the same explanation R , denoted by | R.cover ∩ V ∗ un | , andwe iteratively choose the explanation R which has the maxi-mum ratio of the above number to Cost ( E ) . Correspondingly,we add R into H and then delete R.cover from the present set V ∗ un . This iteration finishes until there is no features in V ∗ un ,and we finally obtain the solution H . omplexity . The modification process in Algorithm 2 lines2-5 costs O ( |V| · |R| ) time, and the sorting in line 6 costs O ( |V| · log |V| ) . Generally, the loops Lines 8-11 and Lines12-15 both cost O ( |V| · |R| ) at worst. To put it together,Algorithm 2 spends O ( |V| · |R| · max {|V| , |R|} ) .V. K NOWLEDGE U PDATE
Though we find out a solution of explaining the detectedanomalies with the existing reasons in Algorithm 2, thereremains some anomalies which are inexplicable by the knowl-edge set, i.e. , the Set C in Table IV. There are mainly severalreasons of the occurrence of the violated data in Set C:(1) Some of the explanations w.r.t a fault are not reliableenough to conclude the corresponding fault event. That is, therepresentations in such explanation are far from precise whichfails to identified the fault from the violation features. (2) Newfault events are discovered by the constraint set C which arenot known in the present knowledge set R . With the bothcases, we aim to update and improve the present knowledgeset according to the detected results. The updated knowledgeset will provide more precise anomaly explanations in return.In this section, we propose our update strategies faced withthe inexplicable violations. We first discuss the update ofanomaly representations, especially the possible representa-tions, utilizing the relevance between the detected violationfeatures in Section V-A, and we introduce a knowledge setmodification strategy in Section V-B, which assists improvethe quality of the knowledge set from the iterations of anomalyexplanation and knowledge update. A. Update of Anomaly Representations
As discussed above, the imperfect description concluded in R is one of the most serious reasons which results in theremaining inexplicable anomalies. Faced with Set C in TableIV, we consider to update the explanation of fault eventsby either adding new representations to an explanation ordirectly adding an explanation w.r.t . a new fault event. We firstconsider to update the existing representations within a fault’sexplanation with a relevance analysis of the violation features,and then consider to create a new record of the unknownanomalies in R .When we try to find the faults to explain the anomalies, thereprobably remains some violation instances that we fail to findthe fault reasons to cover them. We need to update and modifythe representations of faults in order to improve the descriptionof faults. Faced with the update task, we generally considerto insert some new violation features to the explanation of afault, where these new features are regarded as supplementary(possible) representations of the existing explanation. Thus,we are able to find a more precise solution H ′ to cover thedetected violations.When real faults happen, the violations w.r.t one faultprobably do not occur individually. One one hand, it ispossible that the anomaly in one sequence S brings aboutthe multiple violations of different constraints on S . On theother hand, some violations w.r.t a multi-sequence constraint c would occur at the same time in the involved sequences, i.e. , c.domain . To achieve the interpretability and the de-pendability of the representation update, we introduce the relevance analysis between the existing knowledge and thelearnt violation features. We discuss the relevance in SectionV-A1, and then propose our update algorithm in Section V-A2.
1) Relevance in anomaly representations:
Suppose r +1 and r +2 are two anomaly representations learnt from violationfeatures after lots of iterations of detection and update. Therelevance between them is formalized in Definition 11. Weconsider the different representations in the same sequence tobe directly related to each other. Besides, the representationsin different sequences come from one multi-sequences con-straints are also directly related to each other. Definition 11: ( Relevance between r s ). Given two anomalyrepresentations r : h S i , F ( c m ) i and r : h S j , F ( c n ) i , r is related to r , denoted by r ↔ r , if it satisfies either(1). S i and S j are the same sequence, while c m and c n aredifferent constraints, i.e. , i = j , and m = n , or(2). S i and S j are different sequences, while c m and c n are the same constraint. Specifically, c m ( a.k.a c n ) is a multi-sequence constraint whose domain contains S i and S j .Otherwise, r and r are not related to each other, denotedby r = r .The relevance between r s is a symmetrical relation, whileit does not has transitivity because there are two factorsin the identification of the relation “ ↔ ”, i.e, r s w.r.t thesame sequences, and r s w.r.t the same constraints. With therelation “ ↔ ”, we are able to update R s according to therelevance between a learnt representation r + and an existingrepresentation in R . Such learnt r + s come from the detectedviolation features, as formalized in Proposition 5. Proposition 5:
Given feature v , let CoverAE ( v ) = { R | R ∈R , v ∈ R.cover } be the set of all explanations which coverthe occurrence of v . Feature v ∗ is identified to be a learntrepresentation of the anomaly event E described by R , if itsatisfies (1) v ∗ is uncovered by the solution H , and (2) v ∗ isrelated to v .
2) Update Algorithm:
We propose the update process of theknowledge set R in Algorithm 3. We first construct a graph G where each vertex denotes a violation feature, and there existsan edge between two vertices v i and v j if they are relatedto each other according to Definition 11. We maintain a setof explanations CountAE for each feature v , in which eachelement ( i.e. , explanation) is able to cover v (Lines 2-3). Afterinitializing the visit flag of features, we begin to iterativelyvisit uncovered features and update them into the existingexplanations or create a new anomaly explanation (Lines 5-13). Concretely, for an uncovered feature v , we compute all thefeatures which are related to v (denoted by Candr ), and obtainall the explanations which should be updated w.r.t v (denotedby UpSet ) in Line 6. This process is implemented by a functionF
IND U P as discussed in Algorithm 4 below. If the set UpSet is empty, which means there does not exist any anomalyevents that can potentially explain the occurrence of v , weconsider to create a new anomaly event described by v and lgorithm 3: Explanation Update
Input: the knowledge set R , the set V of all detectedviolations, the set V uncover of violation featuresuncovered by H Output: the updated knowledge set R construct graph G = ( V , E ) where each identified feature v denotes a vertex in G , and the edge ( v i , v j ) ∈ E exists ifhaving v i related to v j ; foreach v ∈ V do CoverAE ( v ) ← { R | R ∈ R , v ∈ R.cover } ; Initialize ∀ v ∈ V , Flag [ v ] ← False; foreach v ∈ V uncover do Candr , UpSet ← F IND U P ( v ) ; if UpSet = ∅ then create a new anomaly event and update R up into UpSet ; foreach v ∈ Candr do CoverAE ( v ) ← UpSet ; foreach R up ∈ UpSet do insert v into R up as a new possiblerepresentation r + with the initial weight w ; V uncover ← V uncover \ Candr ; return R ; its related features (Lines 7-8). We enumerate each violationfeature v in set Candr , update the set
CountAE ( v ) with thenew explanation, and insert v into R up as a new representation r + = h S, F ( c ) i with a initial weight w (Lines 10-12). Afterfinishing the update process of Candr , we delete all theelements in
Candr from V uncover , and continue to process nextfeature in V uncover . We obtain the undated knowledge set R until all uncovered features have been visited.We then introduce the proposed function F IND U P in Algo-rithm 4, where we achieve to find out (1) all uncovered featuresrelated to the input feature v , denoted by the set Candr , and (2)all the anomaly explanations which need to be updated w.r.t v , denoted by UpSet . Given a feature v , we first mark that v has been visited, and then we determine whether v has beenalready covered by acknowledged explanations. If so, v willnot be considered as a candidate new representation. ∅ and thepresent set CountAE will be returned (Lines 2-3). Otherwise,when the present feature is not covered by any anomaly events,we initialize the set
Candr with v and the set UpSet with ∅ ,and begin to add elements into the both set. We iterativelyvisit the uncovered features related to v , and complete the set Candr as well as find out all the existing explanations to beupdated w.r.t v . After the loop in lines 5-9 finishes, both set Candr and
UpSet will be returned to Algorithm 3.
Complexity . In Algorithm 3 lines 1-3, it costs O ( |V| ) to construct the graph, and O ( |V| · |R| ) to find out theexplanation set CoverAE of all the features, where |V| and |R| denote the number of violation features and anomalyexplanations, respectively. The outer loop (Lines 5-13) costs O ( V uncover ) times, while the inner loop (Lines 9-12) spends O ( |V cover | · | UpSet | + | Candr | · |
UpSet | ) . In practice, the sizeof V uncover becomes smaller with the outer loop. Thus, the Algorithm 4: F IND U P ( v ) Input: the present violation feature v Output:
Candr , UpSet Flag [ v ] ← True; if CoverAE( v ) = ∅ then return ∅ , CoverAE ( v ) ; Initialize
Candr ← { v } , UpSet ← ∅ ; foreach v ∗ ∈ v. neighbours do if Flag [ v ∗ ] = False then temp Candr , temp
UpSet ← F IND U P ( v ∗ ) ; Candr ← Candr ∪ temp Candr ; UpSet ← UpSet ∪ temp UpSet ; return Candr , UpSet ; outer loop is executed in O ( |V uncover || Candr | ) on average. Together, thewhole loop costs O (cid:0) |V uncover || Candr | ·| UpSet |· max (cid:8) |V cover | , | Candr | (cid:9)(cid:1) .Whether |V uncover | is larger than |V cover | or not, it always has | UpSet | |R| . To put it together, Algorithm 3 totally costs O (cid:0) |V| · max {|V| , |R|} (cid:1) to update the whole knowledge set.We highlight that, r + is considered to be inserted into thepossible explanation set of a fault with a probability w , for thereason that the updated anomaly representations are derivedfrom multiple real detections, whose dependability is less thanthe existing knowledge. We will discuss our blueprint of howto modify the knowledge set in the iteration of detectionsand updates in Section V-B, including the consideration ofthe weight w and the violation features of r . B. Modification on the Knowledge Set
Note again that R ( E ) is divided into R ∗ ( E ) and R + ( E ) ,where we highly trust the representations in R ∗ ( E ) whilewe consider the representations in R + ( E ) with uncertainty.Actually, the possible explanation set R + ( E ) likely comesfrom the learning result of real detections, i.e. , the updateprocess in Algorithm 3. The violation features, especially theones appearing frequently in detection phases, can be appliedin the knowledge modification process.It is practicable to return anomaly explanation results todomain experts who are able to adjust and improve theexisting knowledge system. And the improved knowledgewould in turn contributes to the accuracy and the reliability ofthe following detections. Besides the supplementary possiblerepresentations r + s, we can also make the representation r more accurate by modifying the degree function value F ( c ) of r . In this section, we further discuss two kinds of potentialknowledge modification strategies. Type-1 Modification: degree function values . As proposedin Definition 2, the function value F ( c ) of a violation feature v measures to which degree a sequence violates the constraint c . For a violation feature of quantitative constraints denotedby v : h S, F ( c ) = [ d, u ] i , let F ′ ( c ) = [ d ′ , u ′ ] be the presentdegree function obtained from k times of updates. We modifythe function F ( c ) w.r.t v as follows. F ( c ) (cid:26) = [ d ′′ , u ′′ ] = [ k · d ′ + dk +1 , k · u ′ + uk +1 ] , dist ( F, F ′ ) < will be returned to manual , dist ( F, F ′ ) = 1 . ype-2 Modification: weights . The weight w of a possiblerepresentation r + in an anomaly explanation R would beestimated by the conditional probability Pr( r + | R ) in Equation(4), ˆ w ( r + ) = Pr( r + | R ) = Pr( R, r + )Pr( R ) = N positive ( Rr + ) N positive ( R ) (4)where N positive ( R ) denotes the occurrence number of anomalyexplanation R in solution H during times of learning process,while N positive ( Rr + ) denotes the occurrence number of theconditions that R exists in H and ∃ v is consistent with r + .VI. E XPERIMENTAL S TUDY
We now evaluate the experimental study of the proposedmethods. All experiments run on a computer with 3.40 GHzCore i7 CPU and 32GB RAM.
A. Experimental Settings
Data source . We conduct our experiments on real-lifeindustrial equipment data, named FPP, which has 80 sensorsrecording the working conditions of a fan-machine group froma large-scale fossil-fuel power plant. We have analyzed dataon more than 1620 K historical time points for 5 consecutivemonths with log files and functional documents. We report ourexperimental results on 64 sensors after preprocessing. Implementation . We have developed
Cleanits , a datacleaning system for industrial time series in our previous work[15], which reads and writes data from Apache IoTDB [16].The anomaly explanation method proposed in this paper isapplied as one main function of
Cleanits .We implement all algorithms proposed in this paper, withthe constraint-based detection method
VioDetect , the expla-nation algorithm
AEC , and the update algorithm
Update . Theconstraints used in experiments contain half real constraintsprovided by domain knowledge and half synthetic ones con-cluded by a long-term research of both the historical data andlog files. Due to the fact that not only the industrial knowledgeset is far from comprehensive, but also the labelled anomaliesas well as explanations are limited, we extend the existingknowledge (mostly from documents and domain experts), andmanually regulate some synthetic explanations with the corre-sponding representations based on the acknowledge documentsand fault logs.We consider the original clean time series data as groundtruth, and inject anomalies w.r.t constraints into sequences indifferent time intervals. Without loss of generality, we intro-duce anomaly instances according to the error types in [17],[18], and deeply consider the anomaly patterns simultaneouslylocated in multiple sequences referring to the acknowledgedreal anomaly events. We totally apply 210 constraints and 60anomaly events as the given knowledge set.Besides the porposed method, we also implement fivealgorithms for comparative evaluation: • greedyC : uses greedy strategy for the set cover problemin the candidate explanation set G to iteratively select event E satisfying R ( E ) = arg min Cost ( E ) | R.cover ∩V ∗ uncover | , withconstraint types insensitive. • greedynC : describes the violation in multi-sequence con-straints c with n violation features in the involved n sequences w.r.t c , rather than apply only one featureto denote such violations, with others the same with greedyC . • MFnC : treats the multi-sequence-constraint violation with n features in the involved sequences w.r.t c , with othersthe same with the proposed AEC . • TopK : sorts the explanations R ( E ) s in the ascendingorder of Cost ( E ) , and chooses the top K explanationsas the result. K is determined by K = |R| · |C vio ||C| , where |C| is the number of adapted constraints and |C vio | is thenumber of detected violated constraints. • AE : output all explanations satisfying R ( E ) =arg Cost ( E ) | R ( E ) .cover | ≤ λ , where λ ∈ (0 , is a set threshold.We report results with λ =0.4, for it provides the bestresults among possible threshold values.We note that the first three algorithms are cover-based, while TopK and AE are not. Measure . We apply Precision ( P ) and Recall ( R ) metricsto evaluate the performance of algorithms in Equation (5). P measures the ratio between the number of correctly-identifiedanomaly events, i.e., R is the ratiobetween P = , R = . (5) B. General Performance
With the condition that
Constraints =210, we perform allcomparison algorithms on 4 datasets which has 10.8K timepoints on 64 sensors recording data for one day. About 20anomaly events occur in each dataset. As shown in Figure3,
VioDetect reaches high performance on both P and R on all datasets. This is the foundation for a high-qualityexplanation computing. The proposed AEC has the best scoreswith P = 0 . on average, when MFnC comes the second. Itreveals that it is better to treat the violation of a multi-sequenceconstraint as only one violation feature, than maintain n features in each involved sequence. The gap in P between greedyC and greedynC also confirms this. However, bothalgorithms fail to provide precise explanations. It is becausethey both treat all types of constraints equally, and fail togive priority to multi-sequence constraints, whose violationsare more likely to show major features to be identified as ananomaly event.Figure 3 shows the four cover-based algorithms have similarRecall scores in the four datasets. It reveals that the coveringsolutions can capture and recall at least 85% of the anomalyevents. In addition, it is almost no difference in Recall whetherthe constraint types are sensitive or not. As for algorithm TopK and AE , the performance of both algorithms is not ioDetect AEC MFnC greedyC greedynC TopK AE D1 D2 D3 D40.60.70.80.91.0 P r e c i s i on VioDetect AEC MFnC greedyC greedynC TopK AE
D1 D2 D3 D40.60.70.80.91.0 R e c a ll Fig. 3. General performance comparison on 4 datasets in FPP.
30 60 90 120 150 180 2100.40.60.81.0
VioDetect AEC MFnC greedyC greedynC TopK AE P r e c i s i on R e c a ll Fig. 4. Performance comparison vs. the number of constraints steady in different datasets. This shows that simply choosingexplanations w.r.t
Cost cannot well identify the occurredanomaly events.
C. Evaluation on Explanation Performance
We next report the performance under three vital parame-ters: constraints amount (
Constraints ), data amount (
Timepoints ), and the number of occurred anomaly events ( AE ). Varying
Constraints . Figure 4 shows the performance onthe condition that
Time points =20K. It shows that
VioDetect and the four cover-based algorithms have higher scores withthe increasing of
Constraints , which indicates the complete-ness of the constraint set affects the explaining quality.Both P and R of the cover-based algorithms trend to bestable after Constraints reaches 120. Our proposed
AEC shows the best scores on P ( > R ( > MFnC comes the second, and greedyC the third. It also shows that theprecision difference between
AEC and
MFnC becomes largerwith the increasing constraint amount. It verifies again thatthe advantage of describing the violation w.r.t one constraintwith only one feature. As for Recall,
AEC and
MFnC (resp. greedyC and greedynC ) present quite closed scores, whichshows that the constraint-type sensitive does not obviouslyeffect the Recall level.
Varying
Time points . Figure 5 shows results with
Con-straints =210 and varying
Time points . Anomaly eventsare evenly located in data. The detection performance of
VioDetect is stable and only has a little drop when the dataamount becomes larger, and it achieves P =0.92 and R =0.91 indetecting almost 2-day data ( i.e. , Time points =20K).While all the comparison algorithms have drops in differentdegree with the increasing data amount,
AEC beats the restcomparison algorithms in both metrics. It achieves P > R > Time points ≤ greedyC and greedynC is faster. It shows thatthe simple greedy strategy cannot provide good solutions whendata amount gets larger, where exists more anomaly instances. Varying AE . Figure 6 reports the results with Con-straints =210, and varying number of happened anomaly events
VioDetect AEC MFnC greedyC greedynC TopK AE P r e c i s i on R e c a ll Fig. 5. Performance comparison vs. the number of time points
VioDetect AEC MFnC greedyC greedynC TopK AE P r e c i s i on R e c a ll Fig. 6. Performance comparison vs. the number of anomaly events in
Time points =20K data.
VioDetect has stable Recall scoreswhile Precision has a little drop from 0.942 to 0.904 with AE varying from 5 to 30. It shows that the anomaly event amountdoes not obviously effect the the violation detection result. AEC has the least drop in both metrics among the fivecomparison methods, and presents P > R > AE reaches 30. It confirms the effectiveness of ourmethod when faced with quite a few anomaly events. The fastdecline in the performance of greedyC and greedynC showsagain that the naive greedy-based algorithms would computeless reliable solutions when the growth number of anomalies.The baseline algorithm TopK also drops with the increasing AE , and it always provides poor results compared with others.Note that another baseline algorithm AE has closer gaps withthe cover-based algorithms in P , however it has poor Recalls.We highlight that AE >
20 is a strict condition in realscenarios, even through in a 2-day time series data. Thus, thehigher performance of our proposed
AEC as well as the largegap between
AEC and others algorithms indicates our methodhas the potential of the effectiveness and robustness in solvingreal anomaly explanation problems.
D. Evaluation on Update Performance
We then introduce the method performance of our knowl-edge update phase with three parameters:
Constraints , AE ,and the incomplete rate in explanations of the knowledge set R ( inr % ). We treat the performance of AEC with the original R as baseline. We randomly select the explanation set of someanomaly events in R , where we delete a percentage of possiblerepresentations. We denote the knowledge set after deletion as R − , and report the explanation results with R − by rRemove .We execute the update algorithms proposed in Section V for R − , and denote the updated knowledge set as R + . We reportthe explanation results with R + by Update . Varying
Constraints . Figure 7 shows the performance un-der R , R − , and R + on the condition that Time points =20K,and inr % =15%. rRemove has the worst scores on both P and R , and the score gap between AEC and rRemove trendsto be larger with the growth of
Constraints . It shows that P r e c i s i on R e c a ll Fig. 7. Update performance vs. the number of constraints P r e c i s i on R e c a ll Fig. 8. Update performance vs. the number of anomaly events the incomplete knowledge set would reduce the quality ofanomaly explanation results. When it comes to
Update with R + , it shows a significant improvement on both metrics.Our update method assists to recover 93% of the originalperformance with R on average. Moveover, the update resultbecomes better with the increasing constraint numbers. Itis because the proposed Algorithm 4 effectively computesthe relevance between uncovered violation features and theexisting representations, which helps the recovery and updateof the incomplete knowledge set, and further contributes tothe improvement of explaining the anomalies. Varying AE . Figure 8 presents the performance onthe condition Constraints =210,
Time points =20K, and inr %=15%. The Precision scores of rRemove and
Update arestable against the growth in AE , while the Recall scores havean obvious drop. However, the performance differences of rRemove and Update in P is quite larger than the differencesin R . While rRemove only reaches P =0.7 on average, Update is able to provide more precise results and reaches 0.8 in P faced with 30 anomaly events. Update recovers 94% of R onaverage. As for Recall, results on the three knowledge setshave similar scores in R , this indicates, to some degree, ouranomaly explanation method has robustness in Recall againstan incomplete knowledge set. Varying Incomplete rate in R . Figure 9 shows the resultson the condition Time points =20K and
Constraints =210with varying incomplete rate of R . We put P =0.89 and R =0.88of AEC with the original R along the X-axis, and focus onthe method performance on R − and R + . It is obvious thatthe scores of both metrics drop with the increasing inr %. Itpresents only 0.69 in P and 0.73 in R with the 20%-incompleteknowledge set. Our update method is able to correctly recoverthe missing representations by computing and analyzing theuncovered violation features. Update recovers to 96.8 in P and 0.9881 in R of AEC with inr %=4%, and 0.921 in P and0.954 in R with inr %=20%.Though Update achieves effective improvement of explain-ing anomalies with incomplete knowledge set, it does not per-fectly recover to the same level of
AEC with the original set R .It is believed that the relation among anomaly representationswithin or between anomaly events is quite complex. When P r e c i s i on Incomplete rate % 4 8 12 16 200.60.70.80.91.0 R e c a ll Incomplete rate %
Fig. 9. Update performance vs. the incomplete rate of R TABLE VT
IME COSTS ( Time points =20K, inr %=15%) |C| AE AE Time AE F1 In Round UP Time UP F1
60 20 <
1s 0.715 3K 4.69s 0.6560.876 3K 8.41s 0.812150 20 <
1s 5K 10.21s 0.8210.876 8K 16.39s 0.8240.876 10K 20.16s 0.830210 20 <
2s 0.865 3K 9.49s 0.83230 <
2s 0.82 3K 9.62s 0.790 some representations are missing in the given knowledge, itis challenged to be automatically well-identified and well-recovered. The improvement between the performance gap of
AEC and
Update and the gap between R and R + will beaddressed in our future work. E. Efficiency results
We report the time costs of the methods in Table V, whichpresents results with several typical parameter values due tothe limited space. Here, In Round denotes the iteration timesof the explanation and update. It is worth noting that thefirst two steps in our method ( i.e., violation detection andanomaly explanation) spend little time (denoted as AE Time),and finish computing the anomalies in 2 seconds with 210constraints on 20K data. The update time cost increases withthe growth of
Constraints and In Round, respectively. Itshows that AE has slight effect on the update time, for thereason that the limited AE in real scenarios will not leadto large computation in either the covering and the updatealgorithms. Though the iteration times lead time growth, theproposed method can present almost the same performancewith less In Round. Such efficiency performance shows thatour method has the potential to be used for processing large-scale IoT data. VII. R ELATED W ORK
We summarize a few works related to our proposed issuesin time series anomaly explanation.
Anomaly detection in temporal data . Anomaly detection(see [19] as a survey) is a important step in time seriesmanagement process [20], which aims to discover unexpectedchanges in patterns or data values in time series. Guptaet al. [4] summarizes anomaly detection tasks in kinds oftemporal data and provide an overview of detection tech-niques ( e.g. , statistical techniques, distance-based approaches,classification-based approaches). Autoregression and windowmoving-average models ( e.g.,
EWMA, ARIMA [21]) arewidely used in outlier points detections [3]. On the other hand,anomalous subsequences are more challenged to be detectedbecause abnormal behaviors within subsequences are difficultto be distinguished from normal behaviors [1]. Sequencepatterns discovery in time series is continuously studied, i.e.,
Rule-based temporal data cleaning . Data cleaning andrepairing is of great importance in data preprocessing. With therise of temporal data mining, effective cleaning on temporaldata is gaining attention according to its valuable temporalinformation. Ihab F. Ilyas and Xu Chu give an overview ofthe end-to-end data cleaning process including error detectionand repair methods in [10]. Both statistical-based [27], [28]and constraints-based [11], [29] cleaning are widely appliedin temporal date quality improvement. [29] extends the idea ofconstraints from dependencies defined on relational database( e.g ., FD , CFD in [30]), and proposes sequential dependencies( SD ) to describe the semantics of temporal data. Accordingly,speed constraints are developed in sequential data and appliedto time series cleaning solutions [11], [28]. Causality analysistries to reason about the responsibility of a source in causingerrors result. Systems like Scorpion [31] and DBRx [9] havebeen developed to compute the causality and responsibility ofviolations. The DBRx makes explanation discovery erroneoustuples with desirable properties, namely coverage, preciseness,and conciseness. As the existing techniques mostly focus onrelational data. We move a step further in anomaly explanationstudy in temporal data in this paper. Our work can alsocomplement the state-of-art data cleaning techniques.VIII. C ONCLUSION
We formalized the anomaly explanation problem in mul-tivariate temporal data and construct a self-contained 3-stepmethod to solve the problem. We identified anomalies asthe violations of types of constraints, and devised set-cover-based algorithms to reason the anomaly events with thegiven knowledge set. Further, we proposed knowledge updatemethods to improve the knowledge quality and in turn addto the effectiveness of our method. Experiments on real IoTdata showed that the proposed method computed high-qualityexplanation solutions of anomalies.R
EFERENCES[1] M. Toledano, I. Cohen, Y. Ben-Simhon, and I. Tadeski, “Real-timeanomaly detection system for time series at scale,” in
Proceedings ofthe KDD Workshop on Anomaly Detection , 2017, pp. 56–65.[2] T. Dasu, J. M. Loh, and D. Srivastava, “Empirical glitch explanations,”in
The 20th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, KDD ’14, New York, NY, USA - August24 - 27, 2014 , 2014, pp. 572–581.[3] J. Takeuchi and K. Yamanishi, “A unifying framework for detectingoutliers and change points from time series,”
IEEE Trans. Knowl. DataEng. , vol. 18, no. 4, pp. 482–492, 2006.[4] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han,
Outlier Detection forTemporal Data , ser. Synthesis Lectures on Data Mining and KnowledgeDiscovery. Morgan & Claypool Publishers, 2014.[5] X. Wang and C. Wang, “Time series data cleaning: A survey,”
IEEEAccess , vol. 8, pp. 1866–1881, 2020. [6] T. Dasu, R. Duan, and D. Srivastava, “Data quality for temporal streams,”
IEEE Data Eng. Bull. , vol. 39, no. 2, pp. 78–92, 2016.[7] Braatz, L. H. Chiang, and E. L. R. D, “Fault detection and diagnosis inindustrial systems,” 2001.[8] R. Fujimaki and T. N. et al, “Mining abnormal patterns from heteroge-neous time-series with irrelevant features for fault event detection,”
Stat.Anal. Data Min. , vol. 2, no. 1, pp. 1–17, 2009.[9] A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti, “Descriptive andprescriptive data cleaning,” in
International Conference on Managementof Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014 . ACM,2014, pp. 445–456.[10] I. F. Ilyas and X. Chu,
Data Cleaning . ACM, 2019. [Online].Available: https://doi.org/10.1145/3310205[11] S. Song, A. Zhang, J. Wang, and P. S. Yu, “SCREEN: stream datacleaning under speed constraints,” in
Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data, Melbourne,Victoria, Australia, May 31 - June 4, 2015 , pp. 827–841.[12] W. Yin, T. Yue, H. Wang, Y. Huang, and Y. Li, “Time series cleaningunder variance constraints,” in
Database Systems for Advanced Appli-cations - DASFAA 2018 International Workshops , ser. Lecture Notes inComputer Science, vol. 10829. Springer, 2018, pp. 108–113.[13] R. M. Karp, “Reducibility among combinatorial problems,” in
50 Yearsof Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art . Springer, 2010, pp. 219–241.[14] J. E. Beasley and K. Jornsten, “Enhancing an algorithm for set coveringproblems,”
European Journal of Operational Research , vol. 58, no. 2,pp. 293–300, 1992.[15] X. Ding, H. Wang, J. Su, Z. Li, J. Li, and H. Gao, “Cleanits: A datacleaning system for industrial time series,”
PVLDB , vol. 12, no. 12, pp.1786–1789, 2019.[16] C. Wang, X. Huang, and J. Q. et al, “Apache iotdb: Time-series databasefor internet of things,”
Proc. VLDB Endow. , vol. 13, no. 12, pp. 2901–2904, 2020.[17] R. S. Tsay, “Outliers, level shifts, and variance changes in time series,”
Journal of Forecasting , vol. 7, no. 1, pp. 1–20, 1988.[18] R. S. Tsay, D. Pena, and A. E. Pankratz, “Outliers in multivariate timeseries,”
DES - Working Papers. Statistics and Econometrics. WS , 1998.[19] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
ACM Comput. Surv. , vol. 41, no. 3, pp. 15:1–15:58, 2009.[20] S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Time series managementsystems: A survey,”
IEEE Trans. Knowl. Data Eng. , vol. 29, no. 11, pp.2581–2600, 2017.[21] W. W. S. Wei,
Time series analysis - univariate and multivariatemethods . Addison-Wesley, 1989.[22] S. Papadimitriou, J. Sun, and C. Faloutsos, “Streaming pattern discoveryin multiple time-series,” in
PVLDB .[23] F. M¨orchen, “Algorithms for time series knowledge mining,” in
Pro-ceedings of the Twelfth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, Philadelphia, PA, USA, August20-23, 2006 , 2006, pp. 668–673.[24] U. Rebbapragada, P. Protopapas, C. E. Brodley, and C. R. Alcock,“Finding anomalous periodic time series,”
Machine Learning , vol. 74,no. 3, pp. 281–313, 2009.[25] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, “High-dimensional and large-scale anomaly detection using a linear one-classSVM with deep learning,”
Pattern Recognition , vol. 58, pp. 121–134,2016.[26] H. Liu, X. Li, J. Li, and S. Zhang, “Efficient outlier detection forhigh-dimensional data,”
IEEE Trans. Systems, Man, and Cybernetics:Systems , vol. 48, no. 12, pp. 2451–2461, 2018.[27] M. Yakout, L. Berti- ´Equille, and A. K. Elmagarmid, “Don’t be scared:use scalable automatic repairing with maximal likelihood and boundedchanges,” in
Proceedings of the ACM SIGMOD International Conferenceon Management of Data, SIGMOD 2013 , pp. 553–564.[28] A. Zhang, S. Song, and J. Wang, “Sequential data cleaning: A statisticalapproach,” in
Proceedings of the International Conference on Manage-ment of Data, SIGMOD Conference , 2016, pp. 909–924.[29] L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava, “Sequentialdependencies,”
PVLDB , vol. 2, no. 1, pp. 574–585, 2009.[30] W. Fan and F. Geerts,
Foundations of Data Quality Management ,ser. Synthesis Lectures on Data Management. Morgan & ClaypoolPublishers, 2012.[31] E. Wu and S. Madden, “Scorpion: Explaining away outliers in aggregatequeries,”