Explaining Anomalies in Groups with Characterizing Subspace Rules
NNoname manuscript No. (will be inserted by the editor)
Explaining Anomalies in Groups with CharacterizingSubspace Rules
Meghanath Macha · Leman Akoglu
Received: date / Accepted: date
Abstract
Anomaly detection has numerous applications and has been studiedvastly. We consider a complementary problem that has a much sparser literature:anomaly description . Interpretation of anomalies is crucial for practitioners forsense-making, troubleshooting, and planning actions. To this end, we present anew approach called x-PACS (for eXplaining Patterns of Anomalies with Charac-terizing Subspaces), which “reverse-engineers” the known anomalies by identifying(1) the groups (or patterns) that they form, and (2) the characterizing subspaceand feature rules that separate each anomalous pattern from normal instances.Explaining anomalies in groups not only saves analyst time and gives insight intovarious types of anomalies, but also draws attention to potentially critical, re-peating anomalies. In developing x-PACS , we first construct a desiderata for theanomaly description problem. From a descriptive data mining perspective, ourmethod exhibits five desired properties in our desiderata. Namely, it can unearthanomalous patterns ( i ) of multiple different types, ( ii ) hidden in arbitrary sub-spaces of a high dimensional space, ( iii ) interpretable by human analysts, ( iv )different from normal patterns of the data, and finally ( v ) succinct, providing ashort data description. No existing work on anomaly description satisfies all ofthese properties simultaneously. Furthermore, x-PACS is highly parallelizable; itis linear on the number of data points and exponential on the (typically small)largest characterizing subspace size. The anomalous patterns that x-PACS findsconstitute interpretable “signatures”, and while it is not our primary goal, theycan be used for anomaly detection. Through extensive experiments on real-worlddatasets, we show the effectiveness and superiority of x-PACS in anomaly explana-tion over various baselines, and demonstrate its competitive detection performanceas compared to the state-of-the-art. Meghanath Macha and Leman AkogluHeinz College, Carnegie Mellon University5000 Forbes Avenue, Pittsburgh, PA 15213E-mail: { meghanam, lakoglu } @andrew.cmu.edu a r X i v : . [ c s . L G ] M a y Meghanath Macha, Leman Akoglu
Given a large dataset containing normal and labeled anomalous points, how canwe characterize the anomalies? What combinations of features and feature valuesmake the anomalies stand out? Are there anomalous patterns , that is, do anomaliesform groups ? How many different types of anomalies (or groups) are there, andhow can we describe them succinctly for downstream investigation and decision-making by analysts?Anomaly mining is important for numerous applications in security, medicine,finance, etc., for which many detection methods exist [1]. In this work, we con-sider a complementary problem to this vast body of work: the problem of anomaly description . Simply put, we aim to find human-interpretable explanations to al-ready identified anomalies. Our goal is “reverse-engineering” known anomalies byunearthing their hidden characteristics—those that make them stand out.The problem arises in a variety of scenarios, in which we obtain labeled anoma-lies, albeit no description of the anomalies that could facilitate their interpretation.Example scenarios are those where(1) the detection algorithm is a “black-box” and only provides labels, due to in-tellectual property or security reasons (e.g., Yelp’s review filter [43]),(2) the detection algorithm does not produce an interpretable output and/or can-not explicitly identify anomalous patterns (e.g., ensemble detectors like baggedLOF [36] or isolation forest [38]), and(3) the anomalies are identified via external mechanisms (e.g., when software orcompute-jobs on a cluster crash, loan customers default, credit card trans-actions get reported by card owners as fraudulent, products get reported byconsumers as faulty, etc.). This setting also arises when security experts set up“honeypots” to attract malicious users, and later study their operating mech-anisms (often manually). Examples include fake followers of honeypot Twitteraccounts [37] and fraudulent bot-accounts that click honeypot ads [12].Explaining anomalies is extremely useful in practice as anomalies are to be in-vestigated by human analysts in almost all scenarios. Interpretation of the anoma-lies help the analysts in sense-making and knowledge discovery, troubleshootingand decision making (e.g., planning and prioritizing actions), and building betterprevention mechanisms (e.g., policy changes).Our work taps into the gap between anomaly detection and its end usageby analysts, and introduces x-PACS for characterizing the anomalies in high-dimensional datasets . Our emphasis is explaining the anomalies in groups . Wemodel the anomalies to consist of ( i ) various patterns (i.e., sets of clustered anoma-lies) and ( ii ) outliers (i.e., scattered anomalies different from the rest). For examplein fraud, malicious agents that follow similar strategies, or those who work to-gether in “coalition”, exhibit similar properties and form anomalous groups. Botsdeployed for e.g., click or email spam also tend to produce similar footprints asthey follow the same source of command-and-control. At the same time, there may be multiple groups of fraudsters or bots with different strategies.Explaining anomalies in groups has three key advantages: (1) it saves investi-gation time by providing a compact explanation, rather than the analyst having to In this text, phrases ‘anomalous pattern’, ‘clustered anomalies’, and ‘group of anomalies’are interchangeable.xplaining Anomalies in Groups with Characterizing Subspace Rules 3 go through anomalies one by one, (2) it provides insights into the characteristicsof different anomaly types, and (3) importantly, it draws attention to anomaliesthat form patterns, which are potentially more critical as they are repetitive.To lay out the challenges from a data mining perspective, we first introducea list of desired properties (Desiderata 1–5) that approaches to the problem ofanomaly explanation should satisfy. We then summarize our contributions.1.1 Desiderata for Anomaly DescriptionIn a nutshell, anomaly explanation methods should effectively characterize differ-ent kinds of anomalies present in the data, handle high dimensional datasets, andproduce human-interpretable explanations that are distinct from normal patternsas well as succinct in length.D1
Identifying different types of anomalies:
Anomalies are generatedby mechanisms other than the normal. Since such mechanisms can vary (e.g.,different fraud schemes), it is likely for the anomalies to form multiple patterns inpotentially different feature subspaces. A description algorithm should be able tocharacterize all types of anomalies.D2
Handling high-dimensionality : Data instances typically have tens oreven hundreds of features. It is meaningful to assume that the anomalies in apattern exhibit only a (small) fraction of features in common. In other words,anomalies are likely to “hide” in sparse subspaces of the full space.D3
Interpretable descriptions:
It is critical that the explanation of theanomalies can be easily understood by analysts. In other words, descriptions shouldconvey what makes a group of instances anomalous in a human-interpretable way.D4
Discriminative (or detection) power:
Explanations of anomalies shouldnot also be valid for normal points. In other words, descriptions should be discrim-inative and separate the anomalies from the normal points sufficiently well. As aresult, they could also help detect future anomalies of the same type.D5
Succinct descriptions:
It is particularly important to have simple andconcise representations, for ease of visualization and avoiding information overload.This follows the Occam’s razor principle.1.2 Limitations of Existing TechniquesProviding interpretable explanations for anomalies is a relatively new area of studycompared to anomaly detection. However, the problem has similarities to descrip-tion based techniques for imbalanced datasets. Almost all existing work in theanomaly detection literature assume anomalies to be scattered, and try to explainthem one at a time [10,11,24,27,34,50]. Related work on collective data descrip-tion [18,55], including rare class characterization [20,21], assume a single patternand/or do not look for subspaces. Other closely related areas are subgroup dis- covery and inductive rule learners. Subgroup discovery techniques [16,22,25,26,39,56] aim to describe individual classes while inductive rule learners [9,8,13,15,19] focus on describing multiple classes with the aim of generalization rather thanexplanation. Another related line of work involves techniques for explaining black-box classifiers [14,28,42,51] where the emphasis is on explaining the prediction of
Meghanath Macha, Leman Akoglu a single instance rather than a group of instances. Moreover, none of these workhas an explicit emphasis on succinct, minimal descriptions.To the best of our knowledge and as we expand in related work in §
5, there isno existing work that provides a principled and general approach to the anomalydescription problem that meet all of the goals in our desiderata adequately. (SeeTable 9 for an overview and comparison of related work.)1.3 Summary of ContributionsOur work sets out to fill the gap, with the following main contributions: – Desiderata for Anomaly Description:
We introduce a new desiderata andtarget five rules-of-thumb (D1–D5) for designing our approach. – Description-in-Groups (
DiG ) Problem:
We formulate the explanationproblem as one of identifying the various groups that the anomalies form (D1)within low-dimensional subspaces (D2). – Description Algorithm x-PACS : We introduce a new algorithm that pro-duces interpretable rules (D3)—intervals on the features in each subspace, thatare also discriminative (D4)—characterizing the anomalies in the group withina subspace but as few normal points as possible. – A New Encoding Scheme:
We design a new encoding-based objective fordescribing the anomalies in groups, based on the minimum description length(MDL) principle [52]. Through non-monotone submodular optimization wecarefully select the minimal subspace rules (D5) that require the fewest ‘bits’to collectively describe all the anomalies.
Reproducibility:
All of our code and datasets are open-sourced at https://github.com/meghanathmacha/xPACS . In a nutshell, our goal is to identify a few, small micro-clusters of anomalies hid-den in arbitrary feature subspaces that collectively and yet succinctly represent theanomalies and separate them from the normal points . Specifically, our proposed x-PACS finds a small set of low-dimensional hyper-ellipsoids (i.e., micro-clusterscorresponding to anomalous patterns each enclosing a subset of the anomalies),and reveals scattered anomalies (i.e., outliers not contained in any ellipsoid).Features that are part of the subspace in which a hyper-ellipsoid lies consti-tute its characterizing subspace . Ranges of values these features take are furthercharacterized by the location (center and radii) of a hyper-ellipsoid within thesubspace. Each hyper-ellipsoid is simply a “ pack of anomalies” (hence the name x-PACS ). The rest of the paper uses ‘hyper-ellipsoid’ or ‘ pack ’ in reference toanomalous patterns. X- refers to the number of packs, which we automatically identify via our data encodingscheme ( § X-means [49], which finds the number ofk-means clusters automatically in an information-theoretic way.xplaining Anomalies in Groups with Characterizing Subspace Rules 5(a) [1–9]: anomalies,and [10]: typical normal image (b) 2 anomalous pat-terns found by x-PACS D e s c r i p t i o n o f c o s t ( i n b i t s ) Cost475142520000 reduction
Number of anomaly patterns (c) description cost vs. packs
Fig. 1: (best in color) Example x-PACS input–output.2.1 Example x-PACS input-outputIn Fig. 1 we show an example of the input and output of x-PACS . We consider theface images dataset, in which x-PACS identifies a minimum-description packingwith two anomalous patterns and an outlier. In Fig. 1a, we visualize the dataset,where pixels are dimensions/features, that contains 9 labeled ‘anomalies’: [images1–8] of 2 types (people w/ sunglasses or people w/ white t-shirt or both) + [image9] an outlier (one person w/ beard). We also show [image 10], which is represen-tative of 82 normal samples (people w/ black t-shirt w/out beard or sunglasses).In Fig. 1b, we display the anomalous patterns found by x-PACS (characterizingsubspaces are 1-d, feature rules/intervals shown at the bottom with arrows—thesmaller, the darker the pixel) together explain anomalies 1–8 succinctly; 1-d pack(left) encloses images { } , 1-d pack (right) encloses images { } . Corre-sponding features/pixels highlighted on enclosed images. In Fig. 1c, we plot thedescription length (in bits, see § x-PACS automatically finds the best number of patterns (=2) thatdescribe the anomalies and reveals the (unpacked) outlier [image 9].2.2 Main StepsOur x-PACS consists of three main steps, each aiming to meet various criteria inour desiderata (D1–D5).1. First we employ subspace clustering to automatically identify multiple clustersof anomalies (D1) embedded in various feature subspaces. Advantages of sub-spaces are two-fold: handling “curse of dimensionality” (D2) and explainingeach pattern with only a few features (D5).2. In the second step, we represent anomalies in each subspace cluster by an axis-aligned hyper-ellipsoid. Ellipsoids, in contrast to hyperballs, allow for varying spread of anomalies in each dimension. Axis-alignment ensures interpretableexplanation with original features, which typically have real meaning to a user(D3). Moreover, we introduce a convex formulation to ensure that the ellipsoidsare “pure” and enclose very few non-anomalous points, if at all, such that thecharacterization is discriminative (D4). Meghanath Macha, Leman Akoglu
3. Final step is summarization, where we strive to generate minimal descrip-tions for ease of comprehension (D5). To decide which patterns describe theanomalies most succinctly, we introduce an encoding scheme based on the Min-imum Description Length (MDL) principle [52]. Our encoding-based objectivelends itself to non-monotone submodular function maximization. Using an al-gorithm with approximation guarantees we identify a short list of patterns(hyper-ellipsoids) that are ( i ) compact with small radii, i.e., range of valuesthat anomalies take per feature in the characterizing subspace is narrow; ( ii )non-redundant, which “pack” (i.e., enclose) mostly different anomalies in var-ious subspaces, and ( iii ) pure, which enclose either none or only a few normalpoints. Importantly, the necessary number of packs is automatically identifiedbased on the MDL criterion. Remark:
Note that while x-PACS identifies descriptive patterns of the anoma-lies, those can also be used for detection. Each pattern, along with its characteriz-ing features and its enclosing boundary within that subspace can be seen as a dis-criminative signature (or set of rules), and can be used to label future instances—anew instance that falls within any of the packs is labeled as anomalous. Instead of asingle signature or an abstract classifier function or model, however, x-PACS iden-tifies multiple, interpretable signatures.2.3 Notation and DefinitionsInput dataset is denoted with D = { ( x , y ) , . . . , ( x m , y m ) } , containing m pointsin d -dimensions, where F depicts the feature set. A subset A ⊂ D of points arelabeled as y A = ‘anomalous’, |A| = a . The rest are y D\A = ‘normal’ points,denoted by N , |N | = n , a + n = m .Our goal is to find “enclosing shapes”, called packs , that collectively containas many of the anomalies as possible. While arbitrary shapes would allow forhigher flexibility, we restrict these shapes to the hyper-ellipsoids family for ease ofinterpretation. This is not a strong limitation, however, since anomalous patternsare expected to form compact micro-clusters in some feature subspaces, ratherthan lie on arbitrarily shaped manifolds. A pack is formally defined as follows. Definition 1 (pack) A pack p k is a hyper-ellipsoid in a feature subspace F k ⊆ F , |F k | = d k , characterized by its center c k ∈ R d k and matrix M k ∈ R d k × d k where p k ( c k , M k ) = { x | ( x − c k ) T M − k ( x − c k ) ≤ } . We denote the anomalies that p k encloses by A k ⊆ A , and the normal points thatit encloses by N k ⊂ N . Definition 2 (packing) A packing P is a collection of packs as defined above; P = { p ( c , M ) , . . . , p K ( c K , M K ) } with size K . xplaining Anomalies in Groups with Characterizing Subspace Rules 7 Problem 1 Given a dataset
D ∈ R m × d containing a anomalous points in A and n non-anomalous or normal points in N , a (cid:28) n ; Find a set of anomalous patterns ( packs ) P = { p , p , . . . , p K } , each contain-ing/enclosing a subset of the anomalies A k , where (cid:83) ≤ k ≤ K A k ⊆ A , such that P provides the minimum description length L ( A|D , P ) (in bits)for the anomalies in D . (We introduce our MDL-based encoding scheme and costfunction L ( · ) later in § A k ∩ A l (cid:54) = ∅ ∃ k, l . Packs can also sharecommon features in their subspaces (as different types of anomalies may sharesome common characteristics), i.e., F k ∩ F l (cid:54) = ∅ ∃ k, l . Moreover, the enclosingboundary of a pack may also contain some non-anomalous points. These issuesrelated to the redundancy and purity of the packs would play a key role in the“description cost” of the anomalies. When it comes to identifying a small set of packs out of a list of candidates, we formulate an encoding scheme as a guidingprinciple to selecting the smallest, least redundant, and the purest collection of packs that would yield the shortest description of all the anomalies. x-PACS : Explaining Anomalies in Groups Next we present the details of x-PACS , which consists of three building blocks: § Subspace Clustering : Identify clusters of anomalies in various subspaces § Refinement : Transform box-like subspace clusters to pure and compacthyper-ellipsoids (or packs ) § Summarization : Select subset of packs that yields the minimum descrip-tion length of anomaliesWe present our algorithms for each of these next.3.1 Subspace Clustering: Finding Hyper-rectanglesIn our formulation, we allow for anomalies to form multiple patterns, intuitivelyeach containing anomalies of a different kind. We model anomalous patterns ascompact “micro-clusters” in various feature subspaces.In the first step, we use a subspace clustering algorithm, similar to CLIQUE[3] and ENCLUS [7], that discovers subspaces with high-density anomaly clustersin a bottom-up, Apriori fashion. There are two main differences that we introduce.First, while prior techniques focus on a density (minimum count or mass) criterion,we use two criteria: ( i ) mass and ( ii ) purity, in order to find clusters that respec-tively contain many anomalous points, but also a low number of normal points. Second, we do not enforce a strict grid over the features but find varying-lengthhigh-density intervals through density estimation in a data-driven way.Simply put, the search algorithm starts with identifying 1-dimensional intervalsin each feature that meet a certain mass threshold. These intervals are then com-bined to generate 2-dimensional candidate rectangles. In general, k -dimensional Meghanath Macha, Leman Akoglu
Algorithm 1
SubClus ( D , ms, µ ) Input: dataset D = A∪N ∈ R m × d with labeled anomalous and normal points, mass threshold ms ∈ Z , purity threshold µ ∈ Z Output: set of hyper-rectangles R = { R , R , . . . } each containing min. ms anomalous &max. µ normal points1: Let R ( k ) denote k -dimensional hyper-rectangles. Initialize R (1) by kernel density estima-tion with varying quantile thresholds in q = { , , , } , set k = 12: for each hyper-rectangle R ∈ R ( k ) do if mass( R ) ≥ ms then if impurity( R ) ≤ µ then R ( k ) pure = R ( k ) pure ∪ R else R ( k ) ¬ pure = R ( k ) ¬ pure ∪ R end if end for R = R ∪ R ( k ) pure R ( k +1) := generateCandidates( R ( k ) pure ∪ R ( k ) ¬ pure )9: if R ( k +1) = ∅ then return R k = k + 1, go to step 2 hyper-rectangles are generated by merging ( k − both the mass and purity criteria arereported as clusters. A hyper-rectangle is formally defined as follows. Definition 3 (hyper-rectangle)
Let F = f × f × . . . × f d be our original d -dimensional numerical feature space. A hyper-rectangle R = ( s , s , . . . , s d (cid:48) ), d (cid:48) ≤ d , resides in a space f t × f t × . . . × f t d (cid:48) where t i < t j if i < j , andhas d (cid:48) sides, s z = [ lb z , ub z ], that correspond to individual intervals with lowerand upper bounds in each dimension. A point x = (cid:104) x , x , . . . , x d (cid:105) is said to becontained or enclosed in hyper-rectangle R = ( s , s , . . . , s d (cid:48) ), if lb z ≤ x t z ≤ ub z ∀ z = { , . . . , d (cid:48) } .The outline of our subspace clustering is in Algorithm 1. It takes dataset D as input with anomalous and normal points, a mass threshold ms equal to theminimum number of required anomalous points and a purity threshold µ equal tothe maximum number of allowed normal points to be contained inside, and returnshyper-rectangles that meet the desired criteria.To begin (line 1), we find 1-dimensional candidate hyper-rectangles, equivalentto intervals in individual features. To create promising candidate intervals initially,we find dense intervals with many anomalous points. To this end, we perform kerneldensity estimation (KDE ) on the anomalous points and extract the intervals ofsignificant peaks. This is achieved by extracting the contiguous intervals in eachdimension with density larger than the q -th quantile of all estimated densities. q isvaried in [80 ,
95] to obtain candidate intervals of varying length. An illustration isgiven in Fig. 2. Since multiple peaks may exist, multiple intervals can be generated per dimension as q is varied. KDE involves two parameters - the number of points sampled to construct the smoothcurve and the kernel bandwidth. We set the sample size to 512 points and use the Silverman’srule of thumb [54] to set the bandwidth. For categorical features, we would instead use histogram density estimation.xplaining Anomalies in Groups with Characterizing Subspace Rules 9 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 . . . . Feature Values80859095 q . D en s i t y Fig. 2: Identifying candidate hyper-rectangles in 1-d (equivalent to intervals) byKDE for varying quantile thresholds q .At any given level (or iteration) of the Apriori-like SubClus algorithm, wescan all the candidates at that level (line 2–6) and filter out the ones that meetthe mass criterion (line 3). Those that pass the filter are later merged to formcandidates for the next level. Others with mass less than required are discarded,with no implications on accuracy. The correctness of the pruning procedure followsfrom the downward closure property of the mass criterion: for any k -dimensionalhyper-rectangle with mass ≥ ms , its projections in any one of ( k − ≥ ms .At each level, we also keep track of the hyper-rectangles that meet both themass and the purity criteria (line 4). Purity exhibits the upward closure property :for any ( k − ≤ µ normalpoints), any k -dimensional hyper-rectangle that subsumes it is also pure. Thisproperty could help us stop growing pure candidates by excluding them fromthe candidate generation step and speeding up the termination. While correct,however, such early-termination would prevent us from finding even purer hyper-rectangles later up in the hierarchy. To obtain as many candidate packs as possible,we continue our search for all hyper-rectangles that meet the mass criterion, anduse the purity criterion for selecting the ones to be output (line 7).The algorithm proceeds level by level. Having identified k -dimensional hyper-rectangles that satisfy the mass criterion, denoted R ( k ) ≥ ms = R ( k ) pure ∪ R ( k ) ¬ pure (re-spectively for pure and not-pure sets), ( k +1)-dimensional candidates are generated(line 8) in two steps: join and prune. The join step combines hyper-rectangles hav-ing first ( k −
1) dimensions as well as sides in common. That is, if ( s u , s u , . . . , s u k )and ( s v , s v , . . . , s v k ) are two k -dimensional hyper-rectangles in R ( k ) ≥ ms , we require u i = v i and s u i = s v i ∀ i ∈ { , . . . , ( k − } and u k < v k to form candidate ( k + 1)-dimensional hyper-rectangles of the form ( s u , s u , . . . , s u k , s v k ). The prune stepdiscards all ( k + 1)-dimensional hyper-rectangles that have a k -dimensional pro- jection outside R ( k ) ≥ ms . Again, the correctness of this procedure follows from thedownward closure property of mass. Choice of ( ms, µ ) : To obtain hyper-rectangles of varying size and quality,packing potentially different anomalies (and non-anomalies), we run Algorithm1 with “conservative” parameters, i.e., low ms and high µ . As such, to generate a good volume of candidates, we set ( ms, µ ) as the median of the number ofanomalous points, normal points from the 1-dimensional hyper rectangles. Settinga higher ms and lower µ would prune more (and potentially undesirably many)candidates in exchange of reduced time. We use the median to strike a balancebetween the quality and running time. As we describe later in § Sub-Clus into a hyper-ellipsoid (which we call a pack , recall Definition 1). An ellipsoidwith center c is written as p ( c , M ) = { x | ( x − c ) T M − ( x − c ) ≤ } for positive semi-definite matrix M (cid:31) R , let us denote the anomalous points it contains by x i ∈ A for i = 1 , . . . , a R (See Def.n 3) and anomalous points outside R by x j ∈ A for j = a R +1 , . . . , a . The normal points are denoted by x l ∈ N for l = 1 , . . . , n .When we convert a given R to an ellipsoid, we would like all x i ’s (anomalouspoints) it already contains to reside inside the ellipsoid. In contrast, we would likeall x l ’s (normal points) to remain outside the ellipsoid. The refinement is achievedby enclosing as many as the other anomalous points ( x j ’s) that are in the vicinityof R inside the ellipsoid as well. Those would be the points that were left outdue to axis-aligned interval-based box shapes that hyper-rectangles are limited tocapture. An illustration is given in Fig. 3.First we describe our approach for x i ’s and x l ’s, the positive and negativepoints that we respectively aim to include and exclude. The goal is to find adiscriminating function h ( · ) where h ( x i ) > h ( x l ) <
0. To this end, weuse the quadratic function h ( x ) = x T Ux + w T x + w , with parameters Θ = { U , w , w } . We solve for Θ by setting up an optimization problem based on asemi-definite program (SDP), that satisfies x Ti Ux i + w T x i + w > i and x Tl Ux l + w T x l + w < l . Most SDP solvers do not work well with strictinequalities, thus we modify to a non-strict feasibility problem by adding a margin,and solve (for each R ): xplaining Anomalies in Groups with Characterizing Subspace Rules 11 − − . . . . . Diagonalization varying Rewards total.data$x t o t a l . da t a$ y ID − − − − − − − − − − − −
04 ID − − − − − − − − Fig. 3: Example illustration of refining hyper-rectangles to ellipsoids in 2-d.Anomalous points (black) captured by
SubClus (Alg. 1) in a (green) rectangle,other anomalous points (blue) in the vicinity, and normal points (red).min U , w ,w (cid:88) ε i + λ (cid:88) ε l s.t. x Ti Ux i + w T x i + w ≥ − ε i , i = 1 , . . . , a R x Tl Ux l + w T x l + w ≤ − ε l , l = 1 , . . . , n U (cid:22) − I, ε i ≥ , ε l ≥ U is a negative semi-definite matrix. We can show that ( U , w , w ) definean ellipsoidal enclosing boundary, wrapping x i ’s inside and leaving x l ’s outside,for which we allow some slack ε . λ is to account for the imbalance between thenumber of positive and negative samples. The optimization problem is convex,which we solve using an efficient off-the-shelf solver, where each hyper-rectangleoutput by SubClus can be processed independently.Having set up our refinement step as a convex quadratic discrimination prob-lem, we next describe how we incorporate x j ’s (anomalous points outside R ) intothe optimization. Intuitively, we would like to include as many other anomalies aspossible inside the ellipsoid, but only those that are nearby x i ’s and not neces-sarily those that are far away. In other words, we only want to “recover” the x j ’ssurrounding a given R and not grow the ellipsoid to include far away x j ’s to theextent that it would end up including many normal points as well.To this end, we treat x j ’s similar to x i ’s but incur a lower penalty of excludingan x j than excluding an x i or including an x l . The optimization is re-written asmin U , w ,w (cid:88) ε i + α (cid:88) ε j + λ (cid:88) ε l s.t. x Ti Ux i + w T x i + w ≥ − ε i , i = 1 , . . . , a R x Tj Ux j + w T x j + w ≥ − ε j , j = a R +1 , . . . , a x Tl Ux l + w T x l + w ≤ − ε l , l = 1 , . . . , n U (cid:22) − I, ε i ≥ , ε j ≥ , ε l ≥ α (penalty constant for x j ’s) smaller than both 1 and λ is likelya good choice. However, we do not know which ( α, λ ) pair would provide a good trade-off in general. Therefore, we sweep over a grid of possible values and gen-erate various ellipsoids, as illustrated for the example case in Fig. 3. A last butimportant step is to sweep over the collection to discard dominated packs. Specifi-cally, we output only the set of p ’s in the Pareto frontier w.r.t. mass versus purity.In this set there are no two packs where one strictly dominates the other —by en-closing both higher number of anomalous points (higher mass) and lower numberof normal points (higher purity).We refine a hyper-rectangle R = ( s , s , . . . , s d (cid:48) ) into an ellipsoid within thesame subspace, in other words, U ∈ R d (cid:48) × d (cid:48) and w ∈ R d (cid:48) . For interpretability, weconstrain U to be diagonal to obtain axis-aligned ellipsoids as shown in Fig. 3,since the original features have meaning to the user. Our explanation consists of one rule on each feature in the subspace. A featurerule is a ± radius interval around the ellipsoid’s center. Formally: Definition 4 (Feature rules)
Given an axis-aligned ellipsoid p ( c , M ) in a sub-space f t × . . . × f t d (cid:48) , a rule on feature t z is an interval ( c [ z ] − radius z , c [ z ]+ radius z ),where radius z = √ M zz , ∀ z = { , . . . , d (cid:48) } . Conjunction of all d (cid:48) feature rules con-stitute the signature of p .To wrap up, we show how to compute c and M − from ( U , w , w ) to obtainthe center and radii for an ellipsoid, using which we generate the feature rules. Obtaining c : At the boundary of the ellipsoid, h ( x ) = 0 and inside h ( x ) > h ( x ) is the maximum. Hence; c := max x x T Ux + w T x + w = − U − w (1) Obtaining M − : x T ( − U ) x − w T x − w < x T ( − U ) x + 2 c T Ux − w < x − c ) T ( − U )( x − c ) + c T Uc − w < x − c ) T − U ( w − c T Uc ) ( x − c ) < ⇒ M − = − U ( w − c T Uc ) Obtaining radii: ( x − c ) T M − ( x − c ) = d (cid:48) (cid:88) z =1 ( x [ z ] − c [ z ]) ( M − ) zz ≤ To compute radius in dimension z , we find point x where x [ z (cid:48) ] = c [ z (cid:48) ] , ∀ z (cid:48) (cid:54) = z ,and ( x [ z ] − c [ z ]) M zz = 1. It is easy to see that radius z = (cid:12)(cid:12) x [ z ] − c [ z ] (cid:12)(cid:12) = √ M zz . We use α = { − , − , . . . , } × λ = { − , − , . . . , } . If the anomalous patterns are to be used for detection, we estimate a full U matrix (i.e.,possibly rotated ellipsoid).xplaining Anomalies in Groups with Characterizing Subspace Rules 13 packs pro-duced in § MDL formulation for encoding a given packing
Our encoding scheme involves a Sender (us) and a Receiver (remote). We assumeboth of them have access to dataset
D ∈ R m × d but only the Sender knows the setof anomalous points A . The goal of the Sender is to transmit (over a channel) to theReceiver the information about which points are the anomalies using as few bits aspossible . Na¨ıvely encoding all feature values of every anomalous point individually would cost |A| d log f bits. The idea is that by encoding the enclosing boundaryof packs (ellipsoids) found in § in groups , which could save bits.Obviously we would want to avoid “noisy” packs that include many normalpoints—that would necessitate spending extra bits for encoding those exceptions(i.e. “telling” the Receiver which points in a pack are not anomalies). Moreover, wewould want to avoid using packs that encode largely overlapping group of anoma-lies, as bits would be wasted to redundancy. While identifying the packing thatyields the fewest bits is the main problem, we first lay out our description lengthobjective, for a given packing P = { p ( c , M ) , . . . , p K ( c K , M K ) } : – Transmit number of packs = log (cid:63) K – For each pack p k ∈ P : – Transmit number of dimensions = log (cid:63) d k , d k ≤ d – Transmit identity of dimensions = log (cid:0) dd k (cid:1) – Transmit the center c k = d k log f – Transmit M k = d k log f ( d k log f if diagonal) – Transmit exceptions (i.e., non-anomalies in p k ): • number of normal points in p k = log (cid:63) n k • identity of normal points; by forming all possible subsets of size n k of m k (total number of points in p k ) = log (cid:0) m k n k (cid:1) (based on a canonicalordering of subsets, where points are ordered by distance to center) Value of f is chosen according to the required floating point precision in the normalizedfeature space R d . Cost of encoding an arbitrary integer K is L N ( K ) = log (cid:63) ( K )+log ( c ), where c ≈ . (cid:63) ( K ) = log ( K ) + log (log ( K )) + . . . summing only the positive terms [52]. We droplog ( c ) as it is constant for all packing s. Another way to identify the normal points in a pack : sort points by their distance to centerand send the index of normal points in this list of length m k . This costs more for n k ≥ n k log m k > log m nkk n k ! > log (cid:0) m k n k (cid:1) .4 Meghanath Macha, Leman Akoglu Total cost of encoding with packing P is then (cid:96) ( P ) = log (cid:63) K + K (cid:88) k =1 L ( p k ) , where (3) L ( p k ) = log (cid:63) d k + log (cid:32) dd k (cid:33) + d k ( d k + 1) log f + log (cid:63) n k + log (cid:32) m k n k (cid:33) (4) MDL objective function
Our objective is to find a packing , that is to identify a subset of packs , which pro-vides the minimum encoding length. However, we do not assume that all anoma-lies would be covered by a packing, i.e., (cid:83) k A k ⊆ A , as there could be anomalouspoints (outliers) that do not belong in any pattern but lie away from the others.The outliers A\{ (cid:83) k A k } are yet to be encoded individually.Description length of all anomalies A with packing P is L ( A|D , P ) = (cid:0) |A| − | (cid:91) p ∈P A p | (cid:1) d log f + (cid:2) log (cid:63) |P| + (cid:88) p ∈P L ( p ) (cid:3) where the second term [in brackets] is (cid:96) ( P ): cost of transmitting P (and theanomalies covered by it) by Eq. (3), and the first term is the cost of individuallyencoding the remaining anomalies not covered by P .Notice that the objective of finding a subset S that minimizes the descriptionlength is equivalent to selecting a packing that reduces the na¨ıve encoding cost of |A| d log f the most, i.e.:max S R (cid:96) ( S ) = | (cid:91) p ∈S A p | c u − log (cid:63) |S| − (cid:88) p ∈S L ( p ) + (cid:2) log (cid:63) |E| + (cid:88) p (cid:48) ∈E L ( p (cid:48) ) (cid:3) (5)where c u = d log f is a constant unit-cost to encode a point, and set E denotesall the ellipsoids returned from the second part (refinement), as such, S ⊆ E . Firstthree terms of the objective capture the overall reduction in encoding cost due tothe packing with ellipsoids in S . We can read it as aiming to find a packing thatcovers as many anomalies as possible (expressive), while having small model cost(low complexity)—containing only a few packs in low dimensions. The constantterm [in brackets] ensures that R (cid:96) ( S ) is a non-negative function. Subset selection algorithm for MDL packing
To devise a subset selection algorithm, we start by studying the properties of our objective function R (cid:96) , such as submodularity and monotonicity that could enableus to use fast heuristics with approximation guarantees. Unfortunately, R (cid:96) is notsubmodular as it is given in Eq. (5). However, with a slight modification where wefix the solution size (number of output packs ) to |S| = K , such that the secondterm is constant log (cid:63) K , the function becomes submodular, as we show below. xplaining Anomalies in Groups with Characterizing Subspace Rules 15 Theorem 1
Our cardinality-constrained objective set function R (cid:48) (cid:96) ( S ) is submod-ular. That is, for all subsets S ⊆ T ⊆ E and packs p ∈ E\T , it holds that R (cid:48) (cid:96) ( S ∪ { p } ) − R (cid:48) (cid:96) ( S ) ≥ R (cid:48) (cid:96) ( T ∪ { p } ) − R (cid:48) (cid:96) ( T ) . Proof
Let
Cover ( S ) = | (cid:83) p ∈S A p | return the number of anomalies contained bythe union of packs in S . Canceling the equivalent terms and constants on each sideof the inequality, we are left with Cover ( S ∪ { p } ) − Cover ( S ) ≥ Cover ( T ∪ { p } ) − Cover ( T ). The inequality follows from the submodularity property of the Cover function. (cid:117)(cid:116)
It is also easy to see that R (cid:48) (cid:96) is not monotonic. Theorem 2
Our modified objective set function R (cid:48) (cid:96) ( S ) is non-monotonic. Thatis, there exists ∃S ⊆ T where R (cid:48) (cid:96) ( T ) < R (cid:48) (cid:96) ( S ) .Proof For
S ⊆ T , Cover ( T ) ≥ Cover ( S ) due to monotonicity of Cover function.On the other hand, description cost of packs in T is (cid:80) p ∈T L ( p ) = (cid:80) p (cid:48) ∈S L ( p (cid:48) ) + (cid:80) p (cid:48)(cid:48) ∈T \S L ( p (cid:48)(cid:48) ) and hence is strictly greater than those of S . As such, for two packing s S ⊂ T with the same coverage, we would have R (cid:48) (cid:96) ( T ) < R (cid:48) (cid:96) ( S ). (cid:117)(cid:116) Maximizing a submodular function is NP-hard as it captures problems such asMax-Cut and Max k-cover [17]. Nevertheless the structure of submodular functionsmakes it possible to achieve non-trivial results. In particular, there exist approxi-mation algorithms for non-monotone submodular functions that are non-negative ,like our objective function R (cid:48) (cid:96) . In particular, one can achieve an approximation fac-tor of 0 .
41 for the maximization of any non-negative non-monotone submodularfunction without constraints [17].In our case, we need to solve our objective under the cardinality (i.e., subsetsize) constraint, where |S| is fixed to some K (since only then R (cid:96) is submodular).To this end, we use the Random-Greedy algorithm by Buchbinder et al. [6],which provides the best known guarantee for the cardinality-constrained setting,with approximation factors in [0 . , − o (1)]. The algorithm is quite simple;at each step of K iterations, it computes the marginal gain of adding a single pack p ∈ E \ S to S and selects one among the top K highest-gain packs uniformlyat random. Choice of K : We identify K , the number of packs to describe the anomalies,automatically, best of which is unknown apriori. Concretely, we solve to obtainsubset S (cid:63)K each time for a fixed K = |S (cid:63)K | = 1 , , . . . , a , and return the solutionwith the largest objective value of R (cid:96) ( S (cid:63)K ) = R (cid:48) (cid:96) ( S (cid:63)K ) − log (cid:63) K in Eq. (5). This isanalogous to model selection with regularization for increasing model size.3.4 Overall Algorithm x-PACS Algorithm 2 puts together all three components of x-PACS as described through § § Intuitively, this is where R (cid:96) drops when we add a new pack to S (with positive cost) thatdoes not cover any new anomalies.6 Meghanath Macha, Leman Akoglu Algorithm 2 x-PACS ( A ∪ N ): Explaining Anomalous Patterns
Input: dataset D = A ∪ N with labeled anomalies
Output: set of anomalous patterns (represented as hyper-ellipsoids) P = { p ( c , M ) , . . . , p K ( c K , M K ) }
1: Set of hyper-rectangles R = ∅
2: Obtain R (1) (1-d intervals) by kernel density estimation, varying cut-off threshold in q = { , , , } (cid:98) f a := distribution of number of anomalies across R (1) (cid:98) f n := distribution of number of normal points across R (1) R := SubClus ( D , ms = q ( (cid:98) f a , µ = q ( (cid:98) f n ) , § E = ∅ for R ∈ R do E R = ∅ for α = { − , − , . . . , } do for λ = { − , − , . . . , } do E R := E R ∪ solve optimization problem in § R, α, λ )12: end for end for E := E ∪
ParetoFrontier( E R )15: end for for K = 1 , . . . , |A| : select a subset S ∗ K ⊂ E of K packs using the cardinality-constrained Random-Greedy algorithm by Buchbinder et al. [6] to optimize the description lengthreduction objective R (cid:96) ( · ) in § return P := arg max S ∗ K R (cid:48) (cid:96) ( S ∗ K ) − log ∗ K Computational complexity:
We analyze the complexity of each part sepa-rately. Main computation of § SubClus algorithm. Preliminary KDE tocreate 1-d intervals is independently done per dimension in parallel, only on theanomalous points. We use a constant number of sampling locations, as such, KDEcomplexity is O ( a ) where a is the number of anomalies. SubClus then proceedslevel-by-level and makes as many passes over the data as the number of levels.For a d (cid:48) dimensional hyper-rectangle that meets the mass and purity criteria, allits 2 d (cid:48) projections in any subset of the dimensions also meet the mass criterion(although may not be pure). As such, running time of SubClus is exponential inthe highest dimensionality of the hyper-rectangle that meets both criteria. Totaltime complexity of this step is O ( c d max + md max ) for a constant c that accountsfor possibly multiple d max -dimensional hyper-rectangles and the smaller ones. Thesecond term captures the passes over the data over d max levels.The main computation of § O ([ d max + m ] ) for anaxis-aligned ellipsoid (or diagonal U ) per iteration. To speed up, we filter bulkof the points beyond a certain distance of a given hyper-rectangle, since its refinedhyper-ellipsoid would mostly include/exclude points inside and nearby it. Filteringtakes O ( m ), after which we solve the SDP for a near-constant number of points. Itis easy to show that finding the Pareto frontier set of non-dominating packs (line14)—such that no pack that has strictly larger mass and smaller impurity exists— can be done through two passes over all alternative hyper-ellipsoids generatedfor different ( α, λ ). This procedure does not change the overall complexity but For instance, if we have t d max -dimensional hyper-rectangles, then the complexity wouldbe O ( t d max + md max ), we could rewrite this as O ( c d max + md max ) In practice, the solver converges in 20-100 iterations.xplaining Anomalies in Groups with Characterizing Subspace Rules 17 is likely to yield a much smaller set of ellipsoids per rectangle. We refine eachhyper-rectangle independently in parallel.The main computation in the last part is the
Random-Greedy algorithm,which makes K iterations for a given number of packs K . In each iteration, itmakes a pass over the not-yet-selected hyper-ellipsoids, computes the marginalreduction in bits by selecting each, and picks randomly among the top K with thehighest reduction. We use a size- K min-heap to maintain the top K as we makea pass over the packs. Worst case cost is O ( |E| log K ), multiplied by K iterations.We run Random-Greedy for K = 1 , . . . , a , each of which is parallelized. Totalcomplexity of § O ( |E| a log a ).The number of ellipsoids, |E| , is in the same order of the number of hyper-rectangles from § O ( c d max ). Thus, the overall complexity can be written as O ( md max + c d max a log a ); linear on the number of data points m , near-linear on a ,and exponential in the largest pack dimensionality d max . Through experiments on real-world datasets we answer the following questions. Aquick reference to the UCI datasets used in our experiments is in Table 1. Lastcolumn gives % savings (in bits) in describing/encoding the anomalies by x-PACS .Q1.
Effectiveness:
How accurate, interpretable, and succinct are our explana-tions? How do they compare to descriptions by Decision Trees?Q2.
Detection performance:
Do our explanations generalize? Can they beused as signatures to detect future anomalies? To this end, we compare x-PACS to 7 different baselines.Q3.
Scalability:
How does x-PACS ’s running time scale in terms of data sizeand dimensionality?4.1 Effectiveness of ExplanationsOur primary focus is anomaly description where we unearth interpretable charac-teristics for known anomalies. To this end, we present 6 case studies with groundtruth, followed by quantitative comparison to decision trees.
Our
Image dataset contains gray-scale headshot images of various people. Wedesignate the majority wearing dark-color t-shirts as the normal samples. We cre-ate 3 versions containing different number of anomalous patterns, as we describebelow. We compare x-PACS ’s findings to the ground truth.
Case I:
ImagesI
We label 8 images of people wearing sunglasses as anomaliesas shown in (a) below, and combine them with the normal samples none of whichhas sunglasses. In this simple scenario x-PACS successfully identifies a single, 1-dpattern shown in (b), which packs all the 8 anomalies but no normal samples. Alsoshown at the bottom of (b) is the interval of values, that is the ± radius range Table 1: Dataset statistics. x-PACS achieves significant savings (in bits) by ex-plaining anomalies in groups.
Name size m dim. d anom. a %-savings ImagesI
88 120 8 99.75
ImagesII
91 180 9 88.53
ImagesIII
110 180 12 99.51
DigitI
DigitII
BrCancer
683 9 239 93.74
Arrythmia
332 172 87 92.92
Wine
95 13 24 97.04
Yeast
592 8 129 98.04 around the pack ’s center, for the corresponding dimension (the lower, the darkerthe pixel). (a) anomalies (b) x-PACS packing
Case II:
ImagesII
Next, we construct the 9 anomalies as shown earlier in § x-PACS finds 2 pure packs , each 1-d, that collectively describe the 8 anomalies andnone of the normal samples. The bearded image does not belong to any pack andis left out as an outlier. Case III:
ImagesIII
We construct the third dataset with 12 anomalies: thesame 9 from
ImagesII plus 3 faces (10–12) with beard as shown below. In thiscase, x-PACS finds that characterizing the bearded images as a separate patternis best to reduce the description cost, and outputs 3 pure, 1-d packs shown in (b).(a) +3 anomalies (b) x-PACS packing
In all scenarios, x-PACS is able to unearth simple (low-dimensional) and pure(discriminative) characteristics of the anomalies. Also, it automatically identifiesthe correct number of anomalous patterns that yield the shortest data descriptionas shown in Fig. 4. xplaining Anomalies in Groups with Characterizing Subspace Rules 19 reduction reduction
Images I
Images II
Images III reduction K D e s c r i p t i o n C o s t ( i n b i t s ) Fig. 4: x-PACS ’s description cost of anomalies in image datasets for K = 1 , . . . , K = 0) is shown w/ a horizontal line per dataset. x-PACS findsthe appropriate number of patterns automatically and significantly reduces thedescription cost.Next we study a different domain. The Digit dataset contains instances ofdigit hand-drawings in time. Features are the x and y coordinates of the hand in 8consecutive time ticks during which a human draws each digit on paper. As such,each drawing has 16 features. Case IV:
DigitI
We designate all drawings of digit ‘0’ as normal and a sampleof digit ‘7’ as ‘anomalous’ to study the characteristics of drawing a ‘7’ as comparedto a ‘0’. 8 different positions of the hand in time averaged over all correspondingsamples of these two digits is shown below (a–b). . . . . . . unlist(x) un li s t ( y ) (a) avg. ‘0’-drawing (b) avg. ‘7’-drawing (c) x-PACS packing x-PACS identifies a single, 2-d pack containing all 228 instances of ‘7’s and no‘0’s, as given in Table 2, where we list the ellipsoid center and the ± radius inter-val where the hand is positioned for the characterizing features. The anomalouspattern suggests right & bottom positioning of the hand respectively at times t3& t6, which follows human intuition—in contrast, typical hand positions for ‘0’ atthose ticks are opposite; at the left & top. Corresponding avg. hand positions in DigitI ‘0’ vs. ‘7’: x-PACS finds one 2-d pack . packID feature center interval |A k | |N k | k = 1 x @ t y @ t Case V:
DigitII
We perform a second case study where we designate digit ‘8’drawings as normal and ‘2’ and ‘3’ as the anomalies. Avg. drawings are illustratedin (a–b) below. x-PACS is able to describe 210 of the 211 anomalies in a single,4-d pack listed in Table 3 and illustrated in (c). The single unpacked drawing isshown in (d) and looks like an odd ‘3’. . . . . . . Pack visualization
X-coordinate Y - c oo r d i na t e (a) avg. ‘8’ (b) avg. ‘2,3’ (c) x-PACS packing (d) outlierTable 3: DigitII ‘8’ vs. ‘2’,‘3’: x-PACS ’s one 4-d pack . packID feature center interval |A k | |N k | k = 1 y @ t y @ t y @ t y @ t Looking at the avg. ‘8’ vs. ‘2’ or ‘3’ drawings above, itappears that a single feature like y @ t
8, i.e., vertical handposition at the end, should be discriminative alone; as ‘8’tends to end at the top vs. others at the bottom.Interestingly, none of the 1-d packs on y @ t bottom just like most ‘2’ and ‘3’s. Case VI:
BrCancer
Finally, breast cancer dataset contains 239 malign(anomalous) and 444 benign cancer instances. x-PACS finds 5 packs listed inTable 4, covering a total of 226 anomalies while also including 17 unique normalpoints in the packing . Pack pack suggests large ‘clumpthickness’ and ‘mitoses’ (related to cell division andtissue growth) for 145 cases. Smaller pure 1-d packs , 4 and 5, indicate very large‘cellsize’ and ‘nucleoili’. These findings are intuitive even to non-experts like us(although we lack the domain expertise to interpret pack
BrCancer : x-PACS finds five 1-d or 2-d packs . packID feature center interval |A k | |N k | k = 1 chromatin 0.76 (0.63, 0.88) 162 11 k = 2 clumpthickness 0.94 (0.84, 1.00) 145 5mitoses 0.28 (0.00, 0.63) k = 3 epicellsize 0.33 (0.24,0.42) 97 2barenuclei 0.11 (0.09,0.14) k = 4 nucleoili 0.98 (0.93, 1.00) 75 0 k = 5 cellsize 0.99 (0.98, 1.00) 67 0xplaining Anomalies in Groups with Characterizing Subspace Rules 21 Table 5: Interpretability measures (a)–(d): x-PACS vs. Rule learners. Also givenfor reference is detection performance in AUPRC (See § measure (a) DT-5 4.0000 2.9889 0.0233 0.4769 0.6252DT-4 3.7778 2.7856 0.0422 0.4801 0.6070DT-3 3.0000 2.4078 0.0700 0.4812 0.6210DT-2 2.4444 1.8889 0.1378 0.4872 0.6236DT-1 x-PACS x-PACS vs. Rule Learners Since in our work, we view anomalies as an already defined class, explaining anoma-lies is equivalent to describing an under represented target class [57]. Hence, wecompare x-PACS to techniques that explain labeled data. To this end, we con-sider interpretable supervised models, specifically, inductive rule based learnersthat aim to extract rules from a labeled data set that are discriminative in na-ture. We argue that linear classifiers like logistic regression are not comparable to x-PACS for two key reasons. First, they do not group the anomalies, but ratheroutput a single separating hyperplane. Second, they do not provide rules on thefeatures , but only feature coefficients, which could be negative (hard to interpret).Further, techniques aiming to explain black box predictions are not directly com-parable to our method since most of the works aim to explain one instance at atime compared to the group wise explanations x-PACS provides.We compare x-PACS to the following popular rule based learners.1. Decision Tree (DT): DT aims to partition (or group) the labeled data into pureleaves. We treat the leaves containing at least two anomalies analogous to our packs . Each such leaf is characterized by the feature rules (or predicates) onthe path from the root.2. Ripper [9]: Ripper is a popular inductive rule learner that sequentially minesfor feature rules with high accuracy and coverage with the aim to achieve gen-eralization. We use a publicly available implementation of Ripper in the Wekarepository for our experiments and consider rules that are labeled anomalous.3. RuleFit [15]: RuleFit is an ensemble learner where the base learner is a rulegenerated by a decision tree. A regression/classification is setup using the baselearners to identify the rules that are important in discriminating the differentclasses. We use the publicly available RuleFit implementation and use therules with non-zero coefficients with atleast two anomalies.To compare x-PACS with rule learners, it is not fair to use description length since the listed techniques do not explicitly optimize it. Instead, we use the follow-ing external interpretability measures proposed in [35] (all being lower the better): Note that RuleFit is averaged over seven datasets due to underspecified regression in
Arrythmia and
Yeast R package pre : https://CRAN.R-project.org/package=pre
Image.IImage.IIImage.IIIDigit.I Digit.II BrCancerArrythmiaWineYeast
DT-5DT-3DT-1xPACS
Image.IImage.IIImage.IIIDigit.I Digit.II BrCancerArrythmiaWineYeast
DT-5DT-3DT-1xPACS (a)
Image.IImage.IIImage.IIIDigit.I Digit.II BrCancerArrythmiaWineYeast
DT-5DT-3DT-1xPACS
Image.IImage.IIImage.IIIDigit.I Digit.II BrCancerArrythmiaWineYeast
DT-5DT-3DT-1xPACS (c) avg. impurity (d) avg. width
Image.IImage.IIImage.IIIDigit.I Digit.II BrCancerArrythmiaWineYeast
DT-5DT-3DT-1xPACS
Fig. 5: x-PACS achieves the bestbalance between interpretabilitymeasures (a)–(d) [lower is bet-ter for all of them], and is sig-nificantly better at detection (e)[higher is better], as compared toseveral rule learners. (e) AUPRC (a) number of groups (anomalous packs), (b) avg. length of rules (pack dimension-ality), (c) avg. fraction of normal points within packs (impurity divided by n ),and (d) avg. interval width across feature rules. In other words, an explanationwith fewer groups, fewer rules, fewer exceptions, and smaller spread in features isconsidered more interpretable.DT has no means to choose the number of packs automatically. Therefore, wereport DT results for depths 1–5 as compared to x-PACS in Table 5, averaged across all datasets. In addition to the interpretability measures, we report thedetection performance in AUPRC (area under precision-recall curve) on held-outdata (80-20 split) that quantifies the generalization of the subspace rules. Resultson individual datasets per measure are shown with radar charts in Fig. 5. (SeeTable 6 for detailed results.) Notice the trade-offs between the measures for DT: xplaining Anomalies in Groups with Characterizing Subspace Rules 23 while (c) and (d) tend to decrease with increasing depth, (a) and (b) increase. Thelack of rule summarization in RuleFit is evident in the number of groups (a) where x-PACS consistently produces smaller number of explanations across various datasets. We also note that x-PACS produces tighter intervals (d) compared to Rip-per emphasizing the concreteness of the explanations. Overall, x-PACS achievesthe best trade-off with lower overall values across the interpretability measures.Moreover, our signatures are significantly better at detecting future anomalies. Wepresent more detailed experiments on detection next.Table 6: Rule learners and DT (with respective depths 1–5) compared to x-PACS across datasets on interpretability measures (a)–(d) [all lower the better]as well as detection performance AUPRC [higher the better]. RuleFit leads tounderspecified regression in Arrythmia and
Yeast which we denote by NA.
Measure Dataset/Model ImageI ImageII ImageIII DigitI DigitII BrCancer Arrythmia Wine Yeast(a)numberofgroups DT-5 1.00 2.00 2.00 3.00 2.00 10.00 9.00 1.00 6.00DT-4 1.00 2.00 2.00 3.00 2.00 8.00 9.00 1.00 6.00DT-3 1.00 2.00 2.00 3.00 2.00 6.00 6.00 1.00 4.00DT-2 1.00 2.00 2.00 3.00 2.00 4.00 4.00 1.00 3.00DT-1 1.00 2.00 2.00 2.00 2.00 2.00 2.00 1.00 2.00RuleFit 3.00 15.00 10.00 13.00 8.00 24.00 NA 11.00 NARipper 1.00 2.00 2.00 2.00 2.00 3.00 5.00 1.00 4.00 x-PACS x-PACS x-PACS x-PACS x-PACS
We study the importance of the refinement step discussed in § x-PACS (denoted as ablated x-PACS ). Recall that the primary reasonwe perform the refinement step is to cover more anomalous points and reduce thenumber of normal points in the packs (See Fig. 3). Hence, to showcase the benefit,in Table 7, we compare the proportion of anomalous (higher is better) and normalpoints (lower is better) covered in the final packs obtained using x-PACS andthe ablated x-PACS . In addition, we also report the %-savings (higher is better)achieved in both cases. In x-PACS , the summarization step (See § § x-PACS .Table 7: Ablation Study: x-PACS vs. ablated x-PACS (no refinement to ellip-soids). Coverage of anomalous points (higher is better), coverage of normal points(lower is better), and % savings (higher is better). Method ImagesI ImagesII ImagesIII DigitI DigitII BrCancer Arrythmia Wine Yeast
Coverage ofanom. points x-PACS ablatedx-PACS ablatedx-PACS ablatedx-PACS
From Table 7, we observe that x-PACS is indeed able to cover more anomalouspoints, while avoiding normal points in the final packs for all the datasets. Theseresults demonstrate the utility of the refinement step.4.2 Detection PerformanceWhile not our primary focus, x-PACS can also be used to detect anomalies. Specifi-cally, given the packs identified from historical/training data, a future test instancethat falls in any one of the packs (i.e., enclosed within any hyper-ellipsoid in the packing ) can be flagged as an anomaly. To measure detection quality, we compare x-PACS to 7 competitive baselineson all datasets.1. Mixture of K -Gaussians on the anomalous points. K ∈ { , , . . . , } chosenat the “knee” of likelihood. Anomaly score of test instance: maximum of the probabilities of being generated from each cluster.2. KDE on the normal points. Gaussian kernel bandwidth chosen by cross-validation.Anomaly score: negative of the density at test point. Note that, like any supervised method, x-PACS could only detect future instances ofanomalies of known types.xplaining Anomalies in Groups with Characterizing Subspace Rules 25
Table 8: Area under precision-recall curve (AUPRC) on anomaly detection.
Method
ImagesI ImagesII ImagesIII DigitI DigitII BrCancer Arrythmia Wine Yeast K -Gaussians KDE NN PCA+SVDD DT SVM-Lin
SVM-RBF x-PACS NN . Anomaly score: distance of test point to its nearest neighbor (nn) normalpoint in training set, divided by the distance of that nn point to its own nearestnormal point in training set.4. PCA+SVDD on all points [55]. A single hyperball that aims to enclose anoma-lous points in the
PCA-reduced space , for which the embedding dimension-ality is chosen at the “knee” of the scree plot. Anomaly score: distance of testpoint from the hyperball’s center. DT on all points, where we balance the data for training and regularize by tree- depth, chosen from { , , . . . , } via cross-validation. Anomaly score: numberof anomalous samples in the leaf the test point falls into divided by leaf size.6-7) SVM-lin & SVM-RBF on all points. Hyperparameters set by cross-validation.Anomaly score: “confidence”, i.e., distance from decision boundary.We create 3 folds of each dataset, and in turn use 2/3 for training and 1/3for testing, except the
Images datasets with the fewest anomalies for which wedo leave-one-out testing. All points receive an anomaly score by each method asdescribed above. x-PACS ’s anomaly score for a test instance x is the maximum h k ( x ) = x T U k x + w Tk x + w k among all p k ’s in the packing resulting from trainingdata. We rank points in decreasing order of their score, and report the area underthe precision-recall curve in Table 8. SVM s achieve the highest detection rate, as might be expected. However, kernelSVM cannot be interpreted. Linear SVM, like LR, does not identify anomalous pat-terns nor does it produce any explicit feature rules. Notably, x-PACS outperformsall other baselines considerably across datasets, including DT, which produces themost interpretable output among the baselines as discussed in § Finally, we quantify the scalability of x-PACS empirically. To this end, we imple-ment a synthetic data generator, parameterized by data size, total dimensionality,maximum pack size and dimensionality and number of anomalous packs. Anoma-lies are sampled from a small range per feature within a subspace, and normal SVDD optimization diverged for some high dimensional datasets, therefore, we performedPCA as a preprocessing step.6 Meghanath Macha, Leman Akoglu points are sampled from the reverse of the histogram densities derived from theanomalous points.Fig. 6 shows the running time w.r.t. data size m , dimensionality d , average packdimensionality d avg , and total number of anomalies a . All plots demonstrate near-linear scalability. Recall that we showed exponential complexity w.r.t. maximumpack dimensionality d max . Notably, we observe linear time growth on average. R unn i ng t i m e ( s e c ond s ) a = 500, avg. pack dim = 20, d = 100 R unn i ng t i m e ( s e c ond s ) m = 20000, a = 500, avg. pack dim = 20 R unn i ng t i m e ( s e c ond s ) m = 20000, a = 500, d= 100 R unn i ng t i m e ( s e c ond s ) m = 20000, avg. pack dim = 20, d= 100 Fig. 6: x-PACS scales linearly with input size.
Related areas of study span across outlier explanation, subspace clustering andsubspace outlier detection, data description, subgroup discovery, rule learning,rare class discovery and approaches aiming to explain black box classifiers. Weshow the highlights of related work in the context of our desiderata in Table 9.
Outlier explanation:
The seminal work [27] provides what they call “inten-sional knowledge”, per outlier, by identifying the minimal subspaces in which itdeviates. To find the optimal subset of features that differentiate the outliers fromnormal points, [34] formulates a constraint programming problem and [24] takesa subspace search route. Similarly, [10,11,40] aim to explain one outlier at a timeby features that participate in projection directions that maximally separate them from normal points.
All existing work in this area assume the outliers are scat-tered and strive to explain them individually rather than in groups. Therefore,they cannot identify anomalous patterns. Moreover, they do not focus explicitlyon shortest description, let alone in a principled, information-theoretic way as weaddress in this work. xplaining Anomalies in Groups with Characterizing Subspace Rules 27
Table 9: Comparison of related work in terms of properties D1–D5 in reference toour Desiderata (see § P r o p e r t y E x p l a i n a s - a - g r o u p ? M u l t i p l e g r o u p s ? F i n d s u b s p a c e ? R u l e s o n f e a t u r e s ? D i s c r i m i n a t i v e ? M i n i m a l ? Subspace clustering [3,7,53,30,44] (cid:88) (cid:88) (cid:88) (cid:88)
Projected clustering [2,41] (cid:88) (cid:88) (cid:88) (data descr.) SVDD [55], SSSVDD [18] (cid:88) (cid:88) (cid:88) (rare category) RACH [21] (cid:88) (cid:88) (cid:88) (rare category) PALM [20] (cid:88) (cid:88) (cid:88) (cid:88)
Knorr and Ng [27] (cid:88) (cid:88) (cid:88)
RefOUT [24], CP [34] LODI [11], LOGP [10] (cid:88) (cid:88)
EXPREX [5] (cid:88) (cid:88) (cid:88) (cid:88) (Explaining black box classifiers) LIME [51] (cid:88) (cid:88)
EXstream [58] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Explainer [50] (cid:88) (cid:88) (cid:88)
SRF [29], Krimp [56], RuleFit [15], Ripper [9] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) x-PACS [this paper] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Extending earlier work [4] on explaining single outliers, [5] aims to explaingroups of outlier points or what they call sub-populations. They search for (cid:104) context,feature (cid:105) pairs, where the (single) feature can differentiate as many outliers as pos-sible from normal points that share the same context. It is important to note thattheir goal is to explain a group (or set) of outliers collectively and not particularlyexplaining them with multiple groups. Similarly, [58] describes anomalies groupedin time. They construct explanatory Conjunctive Normal Form rules using fea-tures with low segmentation entropy, which quantifies how intermixed normal andanomalous points are. They heuristically discard highly correlated features fromthe rules to get minimal explanations. Again, they strive to explain all the anoma-lies as a group, and not in multiple groups.We found that SRF (sapling random forest) [29] aims to explain and clusteroutliers similar to our problem setting. They build on their earlier work [50], whichexplains outliers one at a time by learning an ensemble of small decision trees(called saplings) and combining the rules (from root to leaf in which the outlierlies) across the trees. SRF then groups the outliers using k-means clustering basedon the similarity of their explanations. However, there is no guarantee on theminimality of their overall description, since grouping is done as a post-processingstep and by using a local-optima-prone clustering algorithm. Moreover, there isnot much discussion in their paper on the choice of the number of clusters, northe format of the final description after the anomalies are clustered. We are notaware of a publicly available implementation of SRF to compare with the proposedmethod and hence omit it from the experimental evaluation.
Subspace clustering and Outlier detection:
There is a long list of work onsubspace clustering [3,7,30,44,53] that aim to find high-density clusters in feature subspaces. (See [48,32] for reviews.) Some others are projection based that work intransformed feature spaces [2,41]. However, these are unsupervised methods andtheir goal is not explaining labeled data, nor they focus on minimal explanations.There is also a long list of subspace-based outlier detection methods [23,31,33,45,46,47], however, they do not address the description problem.
Data description and Rare class discovery:
Another line of related work isdata description [18,55] and rare class (or category) characterization [21,20]. Themain goal behind all of these work is to explain the data (normal and rare-classpoints, respectively) within a separating hyperball. However, all of them assumethat those points cluster in a single hyperball, and with the exception of [20], searchfor a full-dimensional enclosing hyperball. As such, they do not address the curse-of-dimensionality or identify multiple clusters embedded in different subspaces.
Subgroup discovery and Rule learning:
Classification rule learning algo-rithms have the objective of generating models consisting of a set of rules induc-ing properties of all the classes of the target variable, while in subgroup discoverythe objective is to discover individual rules of interest (See [22] for an overview).The seminal works in rule based learners Ripper [9], CN2 [8] sequentially minefor rules with high accuracy and coverage. More recently, [15] propose RuleFit, anensemble learner where the base learner is a rule generated by a decision tree. Aregression/classification is setup using the base learners to identify the rules thatare important in discriminating the different classes. Few other works in ensemblelearners [13,19] build ensemble trees that are interpretable. While the rules are in-terpretable, they are learnt with an aim to achieve generalization. This is differentfrom our work where we primarily focus on describing the under represented class(anomalies) without emphasizing the generalizability.
SubgroupMiner [26] extends seminal works in subgroup discovery (MIDOS [57],
Explora [25]) to handle numerical and categorical attributes. SD [16] propose aninteractive subgroup discovery technique based on the variation of beam searchalgorithms guided by expert knowledge. Krimp [56] propose a greedy MDL basedapproach to mine frequent item sets to describe points of a given class. Further, aclassifier is proposed by mining frequent item sets on various classes independently.Discriminative constrast pattern mining techniques [39] assume nominal featuresand aim to extract contrast patterns (itemsets) with large support difference acrosscategories.A key difference in the techniques discussed above to our work is the sum-marization scheme discussed in § x-PACS to rule learners on various interpretability measures. Explaining black-box classifiers:
Approaches such as [14,28,42,51] aim toexplain the decision made by a black box predictor. LIME [51] finds nearest neigh-bors to single input labeled example to construct a linear interpretable model thatis locally faithful to the predictor. Further, authors propose a sub modular op-timization framework to pick instances that are representative of the predictionsof a classifier. Other works [14,28] explain the model by perturbing the featuresto quantify the influence on prediction. All of these works do not aim to explainmultiple instances collectively, as such they do not address the summarization of the explanations and are hence not comparable to the proposed method.All in all, none of the existing related methods provides all of 1) collective ,rather than individual, explanations, 2) explanations for multiple anomalous groups,3) in characterizing subspaces , 4) using interpretable feature rules that can 5) dis-criminate anomalies from normal points, 6) aiming to minimize description length. xplaining Anomalies in Groups with Characterizing Subspace Rules 29
We considered the problem of explaining given anomalies in high-dimensionaldatasets in groups. Our key idea is to describe the data by the patterns it con-tains. We proposed x-PACS for identifying a small number of low-dimensionalanomalous patterns that “pack” similar, clustered anomalies and “compress” thedata most succinctly. In designing x-PACS , we combined ideas from data mining(bottom-up algorithms with pruning), optimization (nonlinear quadratic discrim-ination), information theory (data encoding with bits), and theory of algorithms(nonmonotone submodular function maximization). Our notable contributions arelisted as follows. – A new desiderata for the anomaly description problem, enlisting five desiredproperties (D1–D5), – A new problem formulation , for explaining a given set of anomalies ingroups (D1), – Description algorithm x-PACS , which provides low-dimensional (D2), in-terpretable (D3), and discriminative (D4) feature rules per anomalous group, – A new anomaly encoding scheme , based on the minimum descriptionlength (MDL) principle, that lends itself to efficient optimization to produce minimal explanations (D5) with guarantees.Through experiments on real-world datasets, we showed the effectiveness of x-PACS both in explanation and detection and superiority to competitive baselines.For reproducibility, all of our source code and datasets are publicly released at https://github.com/meghanathmacha/xPACS . Acknowledgments
This research is sponsored by NSF CAREER 1452425 and IIS 1408287, ARO Young Inves-tigator Program under Contract No. W911NF-14-1-0029, and the PwC Risk and RegulatoryServices Innovation Center at Carnegie Mellon University. Any conclusions expressed in thismaterial are of the authors and do not necessarily reflect the views, either expressed or implied,of the funding parties.
References
1. C. C. Aggarwal.
Outlier Analysis . Springer, 2013.2. C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithmsfor projected clustering. In
SIGMOD , pages 61–72, 1999.3. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clusteringof high dimensional data for data mining applications. In
SIGMOD , pages 94–105, 1998.4. F. Angiulli, F. Fassetti, and L. Palopoli. Detecting outlying properties of exceptionalobjects.
ACM Trans. Database Syst. , 34(1), 2009.5. F. Angiulli, F. Fassetti, and L. Palopoli. Discovering characterizations of the behavior ofanomalous subpopulations.
IEEE TKDE , 25(6):1280–1292, 2013.6. N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. Submodular maximization withcardinality constraints. In
SODA , pages 1433–1452, 2014.7. C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for miningnumerical data. In
KDD , pages 84–93, 1999.8. P. Clark and T. Niblett. The cn2 induction algorithm.
Machine learning , 3(4):261–283,1989.0 Meghanath Macha, Leman Akoglu9. W. W. Cohen. Fast effective rule induction. In
Machine Learning Proceedings 1995 , pages115–123. Elsevier, 1995.10. X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. Discriminative features foridentifying and interpreting outliers. In
ICDE , pages 88–99, 2014.11. X. H. Dang, B. Micenkov´a, I. Assent, and R. T. Ng. Local outlier detection with inter-pretation. In
ECML/PKDD , pages 304–320, 2013.12. V. Dave, S. Guha, and Y. Zhang. Measuring and fingerprinting click-spam in ad networks.In
SIGCOMM , pages 175–186. ACM, 2012.13. H. Deng. Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456 , 2014.14. R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningfulperturbation. arXiv preprint arXiv:1704.03296 , 2017.15. J. H. Friedman, B. E. Popescu, et al. Predictive learning via rule ensembles.
The Annalsof Applied Statistics , 2(3):916–954, 2008.16. D. Gamberger and N. Lavrac. Expert-guided subgroup discovery: Methodology and ap-plication.
Journal of Artificial Intelligence Research , 17:501–527, 2002.17. S. O. Gharan and J. Vondrak. Submodular maximization by simulated annealing. In
SODA , pages 1098–1116. SIAM, 2011.18. N. G¨ornitz, M. Kloft, and U. Brefeld. Active and semi-supervised data domain description.In
ECML/PKDD , pages 407–422, 2009.19. S. Hara and K. Hayashi. Making tree ensembles interpretable. arXiv preprintarXiv:1606.05390 , 2016.20. J. He and J. G. Carbonell. Co-selection of features and instances for unsupervised rarecategory analysis. In
SDM , pages 525–536, 2010.21. J. He, H. Tong, and J. G. Carbonell. Rare category characterization. In
ICDM , pages226–235, 2010.22. F. Herrera, C. J. Carmona, P. Gonz´alez, and M. J. Del Jesus. An overview on subgroupdiscovery: foundations and applications.
Knowledge and information systems , 29(3):495–525, 2011.23. F. Keller, E. M¨uller, and K. B¨ohm. HiCS: High contrast subspaces for density-basedoutlier ranking. In
ICDE , pages 1037–1048, 2012.24. F. Keller, E. M¨uller, A. Wixler, and K. B¨ohm. Flexible and adaptive subspace search foroutlier analysis. In
CIKM , pages 1381–1390. ACM, 2013.25. W. Kl¨osgen. Explora: A multipattern and multistrategy discovery assistant. In
Advances inknowledge discovery and data mining , pages 249–271. American Association for ArtificialIntelligence, 1996.26. W. Kl¨osgen and M. May. Census data miningan application. In
Proceedings of the 6thEuropean Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD), Helsinki, Finland , 2002.27. E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In
VLDB , pages 211–222, 1999.28. P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730 , 2017.29. M. Kopp, T. Pevn´y, and M. Holena. Interpreting and clustering outliers with saplingrandom forests. In
ITAT , 2014.30. H.-P. Kriegel, P. Kr¨oger, M. Renz, and S. H. R. Wurst. A generic framework for efficientsubspace clustering of high-dimensional data. In
ICDM , 2005.31. H.-P. Kriegel, P. Kr¨oger, E. Schubert, and A. Zimek. Outlier detection in axis-parallelsubspaces of high dimensional data. In
PAKDD , pages 831–838, 2009.32. H.-P. Kriegel, P. Kr¨oger, and A. Zimek. Clustering high-dimensional data: A survey onsubspace clustering, pattern-based clustering, and correlation clustering.
ACM Trans.Knowl. Discov. Data , 3(1):1–58, 2009.33. H.-P. Kriegel, P. Krger, E. Schubert, and A. Zimek. Outlier detection in arbitrarily orientedsubspaces. In
ICDM , pages 379–388, 2012.34. C.-T. Kuo and I. Davidson. A framework for outlier description using constraint program-ming. In
AAAI , pages 1237–1243, 2016.35. H. Lakkaraju, E. Kamar, R. Caruana, and J. Leskovec. Interpretable and explorableapproximations of black box models.
CoRR , abs/1707.01154, 2017.36. A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In
KDD , pages 157–166,2005.37. K. Lee, B. D. Eoff, and J. Caverlee. Seven months with the devils: A long-term study ofcontent polluters on twitter. In
ICWSM , 2011.xplaining Anomalies in Groups with Characterizing Subspace Rules 3138. F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In
ICDM , 2008.39. E. Loekito and J. Bailey. Mining influential attributes that capture class and groupcontrast behaviour. In
CIKM , pages 971–980. ACM, 2008.40. B. Micenkov´a, R. T. Ng, X. H. Dang, and I. Assent. Explaining outliers by subspaceseparability. In
ICDM , pages 518–527, 2013.41. G. Moise, J. Sander, and M. Ester. P3c: A robust projected clustering algorithm. In
ICDM , pages 414–425, 2006.42. G. Montavon, W. Samek, and K.-R. M¨uller. Methods for interpreting and understandingdeep neural networks.
Digital Signal Processing , 2017.43. A. Mukherjee, V. Venkataraman, B. Liu, and N. S. Glance. What yelp fake review filtermight be doing? In
ICWSM , 2013.44. E. M¨uller, I. Assent, S. G¨unnemann, R. Krieger, and T. Seidl. Relevant subspace clustering:Mining the most interesting non-redundant concepts in high dimensional data. In
ICDM ,pages 377–386. IEEE, 2009.45. E. M¨uller, I. Assent, P. I. Sanchez, Y. Mulle, and K. B¨ohm. Outlier ranking via subspaceanalysis in multiple views of the data. In
ICDM , pages 529–538, 2012.46. E. M¨uller, I. Assent, U. Steinhausen, and T. Seidl. Outrank: ranking outliers in highdimensional data. In
ICDE Workshops , pages 600–603, 2008.47. E. M¨uller, M. Schiffer, and T. Seidl. Statistical selection of relevant subspace projectionsfor outlier ranking. In
ICDE , pages 434–445, 2011.48. L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review.6(1):90–105, 2004.49. D. Pelleg and A. Moore. X -means: Extending K -means with efficient estimation of thenumber of clusters. In ICML , pages 727–734, 2000.50. T. Pevn´y and M. Kopp. Explaining anomalies with sapling random forests. In
ITAT ,2014.51. M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining thepredictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 1135–1144. ACM, 2016.52. J. Rissanen. Modeling by shortest data description.
Automatica , 14:465–471, 1978.53. K. Sequeira and M. J. Zaki. Schism: A new approach for interesting subspace mining. In
ICDM , pages 186–193, 2004.54. B. W. Silverman.
Density estimation for statistics and data analysis . Routledge, 2018.55. D. M. J. Tax and R. P. W. Duin. Support vector data description.
Machine Learning ,54(1):45–66, 2004.56. J. Vreeken, M. Van Leeuwen, and A. Siebes. Krimp: mining itemsets that compress.
DataMining and Knowledge Discovery , 23(1):169–214, 2011.57. S. Wrobel. An algorithm for multi-relational discovery of subgroups. In
European Sym-posium on Principles of Data Mining and Knowledge Discovery , pages 78–87. Springer,1997.58. H. Zhang, Y. Diao, and A. Meliou. EXstream: Explaining anomalies in event streammonitoring. In