[PDF] Model-based Exception Mining for Object-Relational Data

Abstract

This paper is based on a previous publication [29]. Our work extends exception mining and outlier detection to the case of object-relational data. Object-relational data represent a complex heterogeneous network [12], which comprises objects of different types, links among these objects, also of different types, and attributes of these links. This special structure prohibits a direct vectorial data representation. We follow the well-established Exceptional Model Mining framework, which leverages machine learning models for exception mining: A object is exceptional to the extent that a model learned for the object data differs from a model learned for the general population. Exceptional objects can be viewed as outliers. We apply state of-the-art probabilistic modelling techniques for object-relational data that construct a graphical model (Bayesian network), which compactly represents probabilistic associations in the data. A new metric, derived from the learned object-relational model, quantifies the extent to which the individual association pattern of a potential outlier deviates from that of the whole population. The metric is based on the likelihood ratio of two parameter vectors: One that represents the population associations, and another that represents the individual associations. Our method is validated on synthetic datasets and on real-world data sets about soccer matches and movies. Compared to baseline methods, our novel transformed likelihood ratio achieved the best detection accuracy on all datasets.

Full PDF

MModel-based Exception Mining for Object-RelationalData

Fatemeh Riahi

School of Computing ScienceSimon Fraser UniversityBurnaby, [email protected]

Oliver Schulte

School of Computing ScienceSimon Fraser UniversityBurnaby, [email protected]

Abstract —This paper is based on a previous publication [29].Our work extends exception mining and outlier detection to thecase of object-relational data. Object-relational data represent acomplex heterogeneous network [12], which comprises objectsof different types, links among these objects, also of differ-ent types, and attributes of these links. This special structureprohibits a direct vectorial data representation. We follow thewell-established Exceptional Model Mining framework, whichleverages machine learning models for exception mining: A objectis exceptional to the extent that a model learned for the objectdata differs from a model learned for the general population.Exceptional objects can be viewed as outliers. We apply state-of-the-art probabilistic modelling techniques for object-relationaldata that construct a graphical model (Bayesian network), whichcompactly represents probabilistic associations in the data. Anew metric, derived from the learned object-relational model,quantiﬁes the extent to which the individual association pattern ofa potential outlier deviates from that of the whole population. Themetric is based on the likelihood ratio of two parameter vectors:One that represents the population associations, and another thatrepresents the individual associations. Our method is validatedon synthetic datasets and on real-world data sets about soccermatches and movies. Compared to baseline methods, our noveltransformed likelihood ratio achieved the best detection accuracyon all datasets.

I. I

NTRODUCTION : E

XCEPTION M INING FOR R ELATIONAL D ATA

Exception mining is an important data analysis task inmany domains. For relational data, exception mining supportsoutlier detection, where statistical deviations are viewed as dueto a node or entity being genuinely exceptional, rather thandue to statistical noise in the data. Statistical approaches tounsupervised exception/outlier detection are based on a gener-ative model of the data [2]. The generative model representsnormal behavior. An individual object is deemed an outlierif the model assigns sufﬁciently low likelihood to generatingit. Following the well-established Exceptional Model Miningframework [10], we propose a new method for extendingstatistical outlier detection to the case of object-relationaldata using a novel likelihood-ratio comparison for generativeprobabilistic models.The object-relational data model is one of the main datamodels for structured data [18]. The main characteristics ofobjects that we utilize in this paper are the following. (1)

Object Identity.

Each object has a unique identiﬁer that is the same across contexts. For example, a player has a name thatidentiﬁes him in different matches. (2)

Class Membership.

Anobject is an instance of a class, which is a collection of similarobjects. Objects in the same class share a set of attributes. Forexample, van Persie is a player object that belongs to the classstriker, which is a subclass of the class player. Note that thisuse of the term “class” is different from the machine learningsense of “class” as a prediction target. (3)

Object Relationships.

Objects are linked to other objects. Both objects and theirlinks have attributes. A common type of object relationshipis a component relationship between a complex object and itsparts. For example, a match links two teams, and each teamcomprises a set of players for that match. A difference betweenrelational and vectorial data is therefore that an individualobject is characterized not only by a list of attributes, but alsoby its links and by attributes of the object linked to it. We referto the substructure comprising this information as the objectdata . Equivalent terms are “egonet” from network analysis [3]and “interpretation” [19]. Relational outlier detection aims toidentify objects whose data differ from the general populationor class. Our approach to this problem leverages statistical-relational model discovery, as follows. a) Approach:

A class-model Bayesian network (BN)structure is learned with data for the entire population. Thenodes in the BN represent attributes for links, of multipletypes, and attributes of objects, also of multiple types. To learnthe BN model, we apply techniques from statistical-relationallearning, a recent ﬁeld that combines AI and machine learning[13], [32], [9]. Given a set of parameter values and an inputdatabase, it is possible to compute a class model likelihood that quantiﬁes how well the BN ﬁts the object data. The classmodel likelihood uses BN parameter values estimated from theentire class data.

This is a relational extension of the standardlog-likelihood method for i.i.d. vectorial data, which uses thelikelihood of a data point as its outlier score. While the classmodel likelihood is a good baseline score, it can be improvedby comparing it to the object model likelihood , which uses BNparameter values estimated from the object data.

The modellog-likelihood ratio (LR) is the log-ratio of the object modellikelihood to the class model likelihood. This ratio quantiﬁeshow the probabilistic associations that hold in the generalpopulation deviate from the associations in the object datasubstructure. While the likelihood ratio discriminates relationaloutliers better than the class model likelihood alone, it canbe improved further by applying two transformations: (1)a mutual information decomposition, and (2) replacing log- (cid:13) a r X i v : . [ c s . A I] J u l xtended from Riahi and Schulte 2015 IEEE Symposium Series on Computational Intelligence Exceptional Model Mining: I.I.D Single-Table Data attribute 1 attribute 2 attribute 3 population model learning Entire Population Data attribute 1 attribute 2 attribute 3 subgroup model learning Subgroup Data Outlierness Metric (quality measure) = Measures dissimilarity between population and subgroup models

Fig. 1. A general schema for Exceptional Model Mining for propositionaldata likelihood differences by log-likelihood distances. We refer tothe resulting novel score as the log-likelihood distance . b) Evaluation: Our code and datasets are available on-line at [28]. Our performance evaluation follows the design ofprevious outlier detection studies [12], [2], where the methodsare scored against a test set of known outliers. We use threesynthetic and two real-world datasets, from the UK PremierSoccer League and the Internet Movie Database (IMDb). Onthe synthetic data we have known ground truth. For the real-world data, we use a one-class design, where one object classis designated as normal and objects from outside the classare the outliers. For example, we compare goalies as outliersagainst the class of strikers as normal objects. On all datasets,the log-likelihood distance metric achieves the best detectionaccuracy compared to baseline methods.We also offer case studies where we assess whetherindividuals that our score ranks as highly unusual in theirclass are indeed unusual. The case studies illustrate that ouroutlier score is easy to interpret , because the Bayesian networkprovides a sum decomposition of the data distributions byfeatures. Interpretability is very important for users of anoutlier detection method as there is often no ground truth toevaluate outliers suggested by the method. c) Related Work:

Section V discusses the relationshipto related work in detail. Our approach applies the exceptionalmodel mining (EMM) framework [10] to multi-relational data.Figure 1 illustrates the EMM schema. The EMM frameworkleverages the extensive work on model learning in machinelearning for exception mining: A subgroup is exceptional tothe extent that a model learned from data for the subgroupdeviates from a model learned for the general population. Acomputational method for measuring this extent is called aquality measure; we also refer to it as an outlierness metric.For a given model type, ﬁnding an appropriate quality measurefor quantifying exceptionality is the main research question inEMM. The EMM framework allows us to leverage the exten-sive work on statistical-relational model learning for exceptionmining in multi-relational data. Compared to previous EMMmodels, the novelty of our work is as follows. 1) EMM hasso far been developed only for propositional i.i.d. data, notrelational data. Accordingly EMM has not been applied withSRL models. 2) In the propositional i.i.d. setting, each objectis represented by a single data row, and it is meaningless tolearn a model for a single object. Instead, EMM is appliedto identify exceptional subgroups of objects. With relational data, each object is represented by its own dataset (egonet,interpretation), and it is meaningful to apply EMM to identifysingle exceptional objects. Compared to previous relationaloutlier detection work, our model-based approach is novel inthat it neither summarizes the object data by a feature set (as inthe Oddball system, see [3]) nor looks for rules that exceptionalobjects violate (e.g. [19]). d) Contributions:

Our main contributions may be sum-marized as follows.1) The ﬁrst EMM approach to outlier detection forstructured data that is based on a probabilistic model.2) A new outlier score based on a novel model likeli-hood comparison, the log-likelihood distance. e) Paper Organization:

We review background aboutBayesian networks for relational data. Then we describe howwe apply the EMM framework to multi-relational data. Weintroduce a novel log-likelihood distance outlier score as thequality or outlierness metric. After presenting the details ofour approach, we review related work. Empirical evaluationcompares model-based and aggregation-based approaches torelational outlier detection, with respect to three synthetic andthree real-world problems.II. B

ACKGROUND : B

AYESIAN N ETWORKS FOR R ELATIONAL D ATA

We adopt the Parametrized Bayes net (PBN) formalism[26] that combines Bayes nets with logical syntax for ex-pressing relational concepts. EMM is an inclusive frameworkand can in principle be applied with other SRL models,such as Markov Logic networks [9]. We worked with PBNsbecause i) they offer the most scalable structure learningmethods [33] to support our larger datasets, and ii) the PBNconditional probability parameters can be easily interpreted,which means that the resulting exceptionality metrics can beeasily interpreted (see Section I-0b below).

A. Bayesian Networks A Bayesian Network (BN) is a directed acyclic graph(DAG) whose nodes comprise a set of random variables[24]. Depending on context, we interchangeably refer to thenodes and variables of a BN. Fix a set of variables V = { V , . . . , V n } . The possible values of V i are enumerated as { v i , . . . , v ir i } . The notation P ( V i = v ) ≡ P ( v ) denotes theprobability of variable V i taking on value v . We also usethe vector notation P ( V = v ) ≡ P ( v ) to denote the jointprobability that each variable V i takes on value v i .The conditional probability parameters of a Bayesian net-work specify the distribution of a child node given an assign-ment of values to its parent node. For an assignment of valuesto its nodes, a BN deﬁnes the joint probability as the productof the conditional probability of the child node value given itsparent values, for each child node in the network. This meansthat the log-joint probability can be decomposed as the node-wise sum ln P ( V = v ; B, θ ) = n (cid:88) i =1 ln θ ( v i | v pa i ) (1)2xtended from Riahi and Schulte 2015 IEEE Symposium Series on Computational Intelligence ShotEff(T,M) PassEff(T,M) Result(T,M ) P(Result=Win|shotEff=high, passEff=high)=0.44 P(Result=Win|shotEff=high, passEff=low)=0.22 P(Result=Win|shotEff=low, passEff=low)=0.18 P(Result=Win|shotEff=low, passEff=high)=0.07   hipassEffhishotEffwinresP

P(shotEff=high)=0.38 P(shotEff=low)=0.62 P(passEff=high)=0.43 P(passEff=low)=0.57  winresP ShotEff(WA,M ) PassEff(WA,M) Result(WA,M ) P(Result=Win|shotEff=high, passEff=high)=0.53 P(Result=Win|shotEff=high, passEff=low)=0.50 P(Result=Win|shotEff=low, passEff=low)=0.00 P(Result=Win|shotEff=low, passEff=high)=0.11    winresP hipassEffhishotEffwinresP

P(shotEff=high)=0.50 P(shotEff=low)=0.50

P(passEff=high)=0.61 P(passEff=low)=0.39

Fig. 2. Example of joint and marginal probabilities computed from a toyBayesian network structure. The parameters were estimated from the PremierLeague dataset. (Top): A class model Bayesian network B c for all teamswith class parameters θ c . (Bottom): The same Bayesian network structurewith object parameters θ o learned for Wigan Athletics ( T = W A ). Ourmodel-based methods outlier scores compare the data likelihood of the classparameters and the object parameters. where v i resp. v pa i is the assignment of values to node V i resp. the parents of V i determined by the assignment v . Toavoid difﬁculties with ln(0) , here and below we assume thatjoint distributions are positive everywhere. Since the parametervalues for a Bayes net deﬁne a joint distribution over its nodes,they therefore entail a marginal, or unconditional, probabilityfor a single node. We denote the marginal probability thatnode V has value v as P ( V = v ; B, θ ) ≡ θ ( v ) . a) Example.: Figure 2 shows an example of a Bayesiannetwork and associated joint and marginal probabilities.

B. Relational Data

We adopt a functor-based notation for combining logicaland statistical concepts [26], [16]. A functor is a function or

Exceptional Model Mining: Multi-Relational Data statistical-relational model for population learning Entire Population Database statistical-relational model for individual object learning Individual Object Database Outlierness Metric (quality measure) = Measures dissimilarity between population and object models

Kullback-Leibler divergence Expected log-distance (new) Parametrized Bayesian Network Parametrized Bayesian Network

Fig. 3. The EMM approach for statistical-relational models. The model classwe utilize in this paper are Parametrized Bayesian networks, with a log-linearlikelihood function. As outlierness metrics we consider the standard Kullback-Leibler divergence, and a novel divergence introduced in this paper. predicate symbol. Each functor has a set of values (constants)called the domain of the functor. The domain of a predicate is { T , F } . Predicates are usually written with uppercase Romanletters, other terms with lowercase letters. A predicate of arityat least two is a relationship functor. Relationship functorsspecify which objects are linked. Other functors represent features or attributes of an object or a tuple of objects (i.e.,of a relationship). A population is a set of objects. A term isof the form f ( σ , . . . , σ k ) where f is a functor and each σ i isa ﬁrst-order variable or a constant denoting an object. A termis ground if it contains no ﬁrst-order variables; otherwise itis a ﬁrst-order term. In the context of a statistical model, werefer to ﬁrst-order terms as Parametrized Random Variables (PRVs) [16]. A grounding replaces each ﬁrst-order variable ina term by a constant; the result is a ground term. A groundingmay be applied simultaneously to a set of terms. A relationaldatabase D speciﬁes the values of all ground terms, which canbe listed in data tables.Consider a joint assignment P ( V = v ) of values to a setof PRVs V . The grounding space of the PRVs is the set ofall possible grounding substitutions, each applied to all PRVsin V . The count of groundings that satisfy the assignmentwith respect to a database D is denoted by D ( V = v ) .The database frequency P D ( V = v ) is the grounding countdivided by the number of all possible groundings. Example.

The Opta dataset represents information aboutpremier league data (Sec. VI-B). The basic populations areteams, players, matches, with corresponding ﬁrst-order vari-ables T , P , M . Table I speciﬁes values for some ground terms.The ﬁrst three column headers show ﬁrst-order variables rang-ing over different populations. The remaining columns repre-sent features. Table III illustrates grounding counts. Countsare based on the 2011-2012 Premier League Season. Wecount only groundings ( team , match ) such that team playsin match . Each team, including Wigan Athletics, appearsin 38 matches. The total number of team-match pairs is ×

20 = 760 .A novel aspect of our paper is that we learn model parame-ters for speciﬁc objects as well as for the entire population. Theappropriate object data table is formed from the populationdata table by restricting the relevant ﬁrst-order variable to thetarget object. For example, the object database for target Team3xtended from Riahi and Schulte 2015 IEEE Symposium Series on Computational Intelligence

TABLE I. S

AMPLE P OPULATION D ATA T ABLE (S OCCER ). MatchId M TeamId T PlayerId P First goal(

P,M)

TimePlayed(

P,M)

ShotEff(

T,M) result(

T,M)

117 WA McCarthy 0 90 0.53 win

148 WA McCarthy 0 85 0.57 loss

15 MC Silva 1 90 0.59 win ... ... ... ... ... ...

TABLE II. S

AMPLE O BJECT D ATA T ABLE , FOR TEAM T = WA . MatchId M TeamId T = WA PlayerId P First goal(

P,M)

TimePlayed(

P,M)

ShotEff(

WA,M) result(

WA,M)

117 WA McCarthy 0 90 0.53 win

148 WA McCarthy 0 85 0.57 loss ... WA ... ... ... ...

TABLE III. E

XAMPLE OF G ROUNDING C OUNT AND F REQUENCY IN P REMIER L EAGUE D ATA , FOR THE CONJUNCTION passEﬀ ( T , M ) = hi , shotEﬀ ( T , M ) = hi , Result ( T , M ) = win . Database Count or D ( V = v ) Frequency or P D ( V = v ) Population 76 /

760 = 0 . Wigan Athletics 7 /

38 = 0 . WiganAthletic , forms a subtable of the data table of Table Ithat contains only rows where TeamID = WA ; see Table II.In database terminology, an object database is like a viewcentered on the object. The object database is an individual-centered representation [11]. C. Bayesian Networks for Relational Data A Parametrized Bayesian Network Structure (PBN) isa Bayesian network structure whose nodes are PRVs. Therelationships and features in an object database deﬁne a setof nodes for Bayes net learning; see Figure 2.

1) Model Likelihood for Parametrized Bayesian Networks:

A standard method for applying a generative model assumesthat the generative model represents normal behavior since itwas learned from the entire population. An object is deemedan outlier if the model assigns sufﬁciently low likelihoodto generating its features [6]. This likelihood method is animportant baseline for our investigation. Deﬁning a likelihoodfor relational data is more complicated than for i.i.d. data,because an object is characterized not only by a feature vector,but by an object database. We employ the relational pseudolog-likelihood [31], which can be computed as follows for agiven Bayesian network and database.

LOG ( D , B , θ ) = n (cid:88) i = i (cid:88) j = (cid:88) pa i P D ( v ij , pa i ) ln θ ( v ij | pa i ) (2)Equation (2) represents the standard BN log-likelihoodfunction for the object data [8], except that parent-childinstantiation counts are standardized to be proportions [31].The equation can be read as follows.1) For each parent-child conﬁguration, use the condi-tional probability of the child given the parent.2) Multiply the logarithm of the conditional probabilityby the database frequency of the parent-child conﬁg-uration.3) Sum this product over all parent-child conﬁgurationsand all nodes.Schulte proves that the maximum of the pseudo-likelihood(2) is given by the empirical database frequencies [31,Prop.3.1.]. In all our experiments we use these maximumlikelihood parameter estimates. Example.

The family conﬁguration passEﬀ ( T , M ) = hi , shotEﬀ ( T , M ) = hi , Result ( T , M ) = win contributes one term to the pseudo log-likelihood for the BNof Figure 2. For the population database, this term is . × ln(0 .

44) = − . . For the Wigan Athletics database, the termis . × ln(0 .

44) = − . .III. EMM FOR R ELATIONAL D ATA

This section describes our approach to applying the EMMframework to relational data, using the following notation. • D C is the database for the entire class of objects; cf.Table I. This database deﬁnes the class distribution P C ≡ P D c . • D o is the restriction of the input database to the targetobject; cf. Table II. This database deﬁnes the objectdistribution P o ≡ P D o . • B C is a model (e.g., Bayesian network) learned with D P as the input database; cf. Figure 2(a). • θ C resp. θ o are parameters learned for B C using D c resp. D o as the input database.Figure 3 illustrates these concepts and the system ﬂow forcomputing an outlierness score. First, we learn a Bayesiannetwork B C for the entire population using a previous learningalgorithm (see Section VI-C below). We then evaluate how wellthe class model ﬁts the target object data. For vectorial data,the standard model ﬁt metric is the log-likelihood of the target datapoint . For relational data, the counterpart is the relationallog-likelihood (2) of the target database : LOG ( D o , B C , θ C ) . (3)While this is a good baseline outlier score, it can beimproved by considering scores based on the likelihood ratio,or log-likelihood difference : LR ( D o , B C , θ o ) ≡ LOG ( D o , B C , θ o ) − LOG ( D o , B C , θ C ) . (4)The log-likelihood difference compares how well the class-level parameters ﬁt the object data, vs. how well the objectparameters ﬁt the object data. In terms of the conditional prob-ability parameters, it measures how much the log-conditionalprobabilities in the class distribution differ from those inthe object distribution. Note that this deﬁnition applies onlyfor relational data where an individual is characterized bya substructure rather than a “ﬂat” feature vector. Assumingmaximum likelihood parameter estimation, LR is equivalentto the Kullback-Leibler divergence between the class-leveland object-level parameters [8]. While the LR score providesmore outlier information than the model log-likelihood, itcan be improved further by two transformations as follows.(1) Decompose the joint probability into a single-featurecomponent and a mutual information component. (2) Replacelog-likelihood differences by log-likelihood distances. Theresulting score is the log-likelihood distance ( ELD ), which4xtended from Riahi and Schulte 2015 IEEE Symposium Series on Computational Intelligenceis the main novel score we propose in this paper. Formallyit is deﬁned as follows for each feature i . The total score isthe sum of feature-wise scores. Section IV below providesexample computations. ELD i = (cid:80) r i j = P o ( v ij ) (cid:12)(cid:12)(cid:12) ln θ o ( v ij ) θ C ( v ij ) (cid:12)(cid:12)(cid:12) + (cid:80) r i j =1 (cid:80) pa i P o ( v ij , pa i ) (cid:12)(cid:12)(cid:12) ln θ o ( v ij | pa i ) θ o ( v ij ) − ln θ C ( v ij | pa i ) θ C ( v ij ) (cid:12)(cid:12)(cid:12) . (5)The ﬁrst sum is the single-feature component, where eachfeature is considered independently of all others. It computesthe expected log-distance with respect to the singe featurevalue probabilities between the object and the class models.The second ELD sum is the mutual information compo-nent , based on the mutual information among all features. Itcomputes the expected log-distance between the object and theclass models with respect to the mutual information of featurevalue assignments. Intuitively, the ﬁrst sum measures how themodels differ if we treat each feature in isolation. The secondsum measures how the models differ in terms of how stronglyparent and child features are associated with each other.

A. Motivation

The motivation for the mutual information decompositionis two-fold.(1)

Interpretability , which is very important for outlier de-tection. The single-feature components are easy to interpretsince they involve no feature interactions. Each parent-childlocal factor is based on the average relevance of parent valuesfor predicting the value of the child node, where relevanceis measured by ln( θ ( v ij | pa i ) /θ ( v ij )) . This relevance termis basically the same as the widely used lift measure [35],therefore an intuitively meaningful quantity. The ELD scorecompares how relevant a given parent condition is in the objectdata with how relevant it is in the general class.(2)

Avoiding cancellations.

Each term in the log-likelihooddifference (4) decomposes into a relevance difference and amarginal difference: ln θ o ( v ij | pa i ) θ C ( v ij | pa i ) = ln θ o ( v ij | pa i ) θ o ( v ij ) − ln θ C ( v ij | pa i ) θ C ( v ij ) +ln θ o ( v ij ) θ C ( v ij ) . (6)These differences can have different signs for differentchild-parent conﬁgurations and cancel each other out; seeTable IV. Taking distances as in Equation 5 avoids this undesir-able cancellation. Since our goal is to assess the distinctnessof an object, we do not want differences to cancel out. Thegeneral point is that averaging differences is appropriate whenconsidering costs, or utilities, but not appropriate for assessingthe distinctness of an object. For instance, the average of bothvectors (0,0) and (1,-1) is 0, but their distance is not.

B. Comparison Outlier Scores

Our lesion study compares our log-likelihood distance

ELD score to baselines that are deﬁned by omitting a compo-nent of

ELD . In this section we deﬁne these scores. The scoresincrease in sophistication in the sense that they apply more

TABLE IV. B

ASELINE O UTLIER S CORES FOR B AYESIAN N ETWORKS

Method Formula FD i (cid:80) ni =1 (cid:80) rij =1 P o ( v ij ) (cid:12)(cid:12)(cid:12) ln θo ( vij ) θC ( vij ) (cid:12)(cid:12)(cid:12) − LOG i − (cid:80) ni =1 (cid:80) rij =1 (cid:80) pa i P o ( v ij , pa i ) ln θ C ( v ij | pa i ) LR i (cid:80) rij =1 (cid:80) pa i P o ( v ij , pa i ) ln θo ( vij | pa i ) θC ( vij | pa i ) . | LR i | (cid:80) rij =1 (cid:80) pa i P o ( v ij , pa i ) | ln θo ( vij | pa i ) θC ( vij | pa i ) | . LR + i (cid:80) rij =1 P o ( v ij ) ln θo ( vij ) θC ( vij ) + (cid:80) rij =1 (cid:80) pa i P o ( v ij , pa i ) ln θo ( vij | pa i ) θo ( vij ) − ln θC ( vij | pa i ) θC ( vij ) . transformations of the log-likelihood ratio. More sophisticatedscores provide more information about outliers. Table IVdeﬁnes local feature scores; the total score is the sum offeature-wise scores. All metrics are deﬁned such that a higherscore indicates a greater anomaly. The metrics are as follows.

Feature Divergence FD is the ﬁrst component of the ELD score. It considers each feature independently (no featurecorrelations).

Log-Likelihood Score

LOG is the standard model-based out-lier detection score using data likelihood.

Log-Likelihood Difference LR is the log-likelihood differ-ence (4) between the class-level and object-level param-eters. Log-Likelihood Difference with absolute value | LR | replaces differences in LR by distances. Log-Likelihood Difference with decomposition LR + applies a mutual information decomposition to LR .IV. E XAMPLES

We provide three simple examples with only two featuresthat illustrate the computation of the outlier scores. They aredesigned so that outliers and normal objects are easy to distin-guish, and so that it is easy to trace the behavior of an outlierscore. The examples therefore serve as thought experimentsthat bring out the strengths and weaknesses of model-basedoutlier scores. Figure 4 describes the BN representation of theexamples. Table V provides the computation of the scores. Forintuition, we can think of a soccer setting, where each matchassigns a value to each attribute F i , i = 1 , for each player.Scores for the F feature are computed conditional on F = 1 .Expectation terms are computed ﬁrst for F = 1 , then F = 0 .The single feature distributions are uniform, so the featurecomponent ELD is 0 for each node in both examples.The table illustrates the undesirable cancelling effects in LR . In the high correlation scenario 4(a), the outlier objecthas a lower probability than the normal class distributionof Match Result = given that Shot Eﬃciency = .Speciﬁcally, 0.5 vs. 0.9. The outlier object exhibits a higherprobability Match Result = than the normal class distribu-tion, conditional on Shot Eﬃciency = ; speciﬁcally, 0.5 vs.0.1. In line 1, column 2 of Table V the log-ratios ln(0 . / . and ln(0 . / . therefore have different signs. In the lowcorrelation scenario 4(b), the cancelling occurs in the sameway, but with the normal and outlier probabilities reversed.The cancelling effect is even stronger for attributes with morethan two possible values.5xtended from Riahi and Schulte 2015 IEEE Symposium Series on Computational Intelligence F1=Shot_Efficiency F2=Match_Result

P(F1=1)= % 50

P(F2=0|F1=0)= % 90 P(F2=1|F1=1)= % 90

Normal=Striker

P(F1=1)= % 50

P(F2=1)= % 50

Outlier=Mid Fielder

P(F1=1)= % 50 P(F1=1)= % 50 P(F1=1)= % 90

P(F2=0|F1=0)= % 90 P(F2=1|F1=1)= % 90

P(F1=1)= % 10 (a) (b) (c)

P(F2=1)= % 50

P(F2=0|F1=0)= % 90 P(F2=1|F1=1)= % 90

P(F2=0|F1=0)= % 90

P(F2=1|F1=1)= % 90 F1=Shot_Efficiency F2=Match_Result

Normal=Striker

F1=Tackle_ Efficiency

F2=Match_Result

F1=Tackle_ Efficiency F2=Match_Result

Normal=Striker

F1=Shots On Target F2=Match_Result F1=Shots On Target F2=Match_Result

Outlier=Mid Fielder

Fig. 4. Illustrative Bayesian networks. The networks are not learned from data, but hand-constructed to be plausible for the soccer domain. (a) High Correlation:Normal individuals exhibit a strong association between their features, outliers no association. Both normals and outliers have a close to uniform distributionover single features. (b) Low Correlation: Normal individuals exhibit no association between their features, outliers have a strong association. Both normals andoutliers have a close to uniform distribution over single features. (c) Single Attributes: Both normal and outlier individuals exhibit a strong association betweentheir features. In normals, 90% of the time, feature 1 has value 0. For outliers, feature 1 has value 0 only 10% of the time.TABLE V. E

XAMPLE C OMPUTATION OF DIFFERENT OUTLIER SCORES . Score F Computation F | F Computation Result LR / . / .

5) = 0 1 / . / .

9) + 1 / . / . | LR | (no parents) / | ln(0 . / . − ln(0 . / . | +1 / | ln(0 . / . − ln(0 . / . | FD | ln(0 . / . | = 0 1 / | ln(0 . / . | + 1 / | ln(0 . / . | ELD .

79 + FD Table V(a): High Correlation Case, Figure 4(a).

Score F Computation F | F Computation Result LR / . / .

5) = 0 0 . · . . / .

5) + 0 . · . . / . | LR | (no parents) . · . | ln(0 . / . − ln(0 . / . | + 0 . · . | ln(0 . / . − ln(0 . / . | FD | ln(0 . / . | = 0 1 / | ln(0 . / . | + 1 / | ln(0 . / . | ELD . FD Table V(b): Low Correlation Case. Figure 4(b).

V. R

ELATED W ORK

Outlier detection is a densely researched ﬁeld, for a surveyplease see [2]. Figure 5 provides a tree picture of whereour method is situated with respect to other outlier detectionmethods and other data models. Our method falls in thecategory of unsupervised statistical model-based approaches.To our knowledge, ours is the ﬁrst model-based methodtailored for object-relational data. Like other model-basedapproaches, it detects global outliers.

Aggarwal [2] deﬁnes aglobal outlier to be a data point that notably deviates fromthe rest of the population. We review relevant approachesfrom different data models, the most common atomic objectmodel—where data is represented by vectors—and structureddata models. Akoglu et al. provide an excellent recent surveyof outlier detection in relational models [3]. a) Attribute Vector Data Model:

By far most work onoutlier detection considers atomic objects with ﬂat featurevectors. This leads to an impedance mismatch: The requiredinput format for these outlier detection methods is a singledata matrix, not a structured dataset. For example, one cannotprovide a relational database as input. This mismatch is notsimply a question of choosing a ﬁle format, but insteadreﬂects a different underlying data model: complex objectswith both attributes and component objects vs. atomic objectswith attributes only. It is possible to “ﬂatten” structured databy converting it to unstructured feature vectors, for instanceby using aggregate functions. We evaluated the aggregationapproach in this paper by applying three standard methods for

Outlier(Detec+on(out(of(paper(scope(object4rela+onal( unsupervised(supervised(Associa+on(Rules( density(based+distance(based+subspace+clustering+ contextual(outliers(a:ribute(vectors(data(matrix(Community(Discovery(

Novelty(Diagram( outlier(score(=(log-likelihood(distance( paper%topic% data(model( i.i.d.(“ﬂa:ening”(aggrega+on(Outlier(score(=(log4likelihood(model4based(model4based(

Fig. 5. A tree structure for related work on outlier detection for structureddata. A path speciﬁes an outlier detection problem, the leaves list majorapproaches to the problem. Approaches in italics appear in experiments. outlier detection.Work on atomic contextual outliers [34] is like ours inthat it considers the distinctness of a target individual froma reference class. A reference class is not speciﬁed for eachobject, but is constructed as part of outlier detection. Ourwork could be combined with a class discovery approach byproviding a score of how informative the inferred classes are. b) Structured Data Models:

We discuss related techniquesin three types of structured data models: SQL (relational),XML (hierarchical), and OLAP (multi-dimensional).For relational data, many outlier detection approaches aimto discover rules that represent the presence of anomalousassociations for an individual or the absence of normal associ-ations [19], [12]. The survey by [23] uniﬁes within a generalrule search framework related tasks such as exception mining,which looks for associations that characterize unusual cases,subgroup mining, which looks for associations characterizingimportant subgroups, and contrast space mining, which looksfor differences between classes. Another rule-based approachuses Inductive Logic Programming techniques [4]. While localrules are informative, they are not based on a global statisticalmodel and do not provide a single outlier score for eachindividual.A latent variable approach in information networks ranks6xtended from Riahi and Schulte 2015 IEEE Symposium Series on Computational Intelligencepotential outliers in reference to the latent communities in-ferred by network analysis [12]. Our model aggregates infor-mation from entities and links of different types, but does notassume that different communities have been identiﬁed.Koh et al. [17] propose a method for hierarchical structuresrepresented in XML document trees. Their aim is to identifyfeature outliers, not class outliers as in our work. Also, they useaggregate functions to convert the object hierarchy into featurevectors. Their outlier score is based on local correlations, andthey do not construct a model.The multi-dimensional data model deﬁnes numeric mea-sures for a set of dimensions. The differences in the two datamodels mean that multi-dimensional outlier detection mod-els [30] do not carry over to object-relational outlier detection.(1) The object data model allows but does not require anynumeric measures. In our datasets, all features are discrete.Nor do we assume that it is possible to aggregate numericmeasures to summarize lower-level data at higher levels. (2)In scoring a potential outlier object, our method considersother objects both below and above the target object in thecomponent hierarchy. OLAP exploration methods consideronly cells below or at the same level as the target cell. Forexample, in scoring a player, our method would considerfeatures of the player’s team. Also, the

ELD outlier scoreof an object is not determined by the outlier scores of itscomponents, in contrast to the approach of Sarawagi et al. .(3) Our approach models a joint distribution over features,exploiting correlations among features. Most of the OLAP-based methods consider only a single numeric measure at atime, not a joint model.Statistical data cleaning methods are related to outlier de-tection, in that erroneous data may be detected as outliers (e.g.,[7]). Nonetheless, these data cleaning methods differ fromour work in several ways. 1) Although they often originatein the database community, they are usually developed onlyfor single-table propositional data, not relational data. (Anexception is the ERACER system [20].) 2) Our work assumesthat the data is (mainly) correct, and identiﬁes exceptionalidentities for the given data. 3) Data cleaning methods focuson unusual values or tuples (e.g. a mistaken rating for a movieby a user), not exceptional subdatabases or egonets.VI. E

XPERIMENTAL D ESIGN

All the experiments were performed on a 64-bit Centos ma-chine with 4GB RAM and an Intel Core i5-480 M processor.The likelihood-based outlier scores were computed with SQLqueries using JDBC, JRE 1.7.0. and MySQL Server version5.5.34. We describe the datasets used in our experiments.

A. Synthetic Datasets

We generated three synthetic datasets with normal andoutlier players using the distributions represented in the threeBayesian networks of Figure 4. Each player participates in38 matches, similar to the real-world data. The main goal ofdesigning synthetic experiments is to test the methods on easyto detect outliers. Each match assigns a value to each feature F i , i = 1 , for each player. High Correlation

See Figure 4(a).

TABLE VI. O

UTLIER / NORMAL O BJECTS IN R EAL -W ORLD D ATASETS . Normal

Low Correlation

See Figure 4(b).

Single features

See Figure 4(c).We used the mlbench package in R to generate syntheticfeatures in matches, following these distributions for 240normal players and 40 outliers. We followed the real-worldOpta data in terms of number of normal and outlier individuals.The scores are used to rank all 280 players. B. Real-World Datasets

Data tables are prepared from Opta data [21] andIMDb [14]. Our datasets and code are available on-line [15]. a) Soccer Data:

The Opta data were released byManchester City. It lists box scores, that is, counts of all theball actions within each game by each player, for the 2011-2012 season. For each player in a match, our data set containseleven player features. For each team in a match, there are ﬁvefeatures computed as player feature aggregates, as well as theteam formation and the result (win, tie, loss). There are two re-lationships,

Appears Player ( P , M ) , Appears Team ( T , M ) . b) IMDB Data: The Internet Movie Database (IMDB)is an on-line database of information related to ﬁlms, televisionprograms, and video games. The IMDB website offers adataset containing information on cast, crew, titles, technicaldetails and biographies into a set of compressed text ﬁles.We preprocessed the data like [25] to obtain a database withseven tables: one for each population and one for the three re-lationships

Rated ( User , Movie ) , Directs ( Director , Movie ) ,and ActsIn ( Actor , Movie ) .In real-world data, there is no ground truth about whichobjects are outliers. To address this issue, we employ a one-class design: we learn a model for the class distribution, withdata from that class only. Then we rank all individuals fromthe normal class together with all objects from a contrast classtreated as outliers, to test whether an outlier score recognizesobjects from the contrast class as outliers. Table VI shows thenormal and contrast classes for three different datasets. In-classoutliers are possible, e.g. unusual strikers are still membersof the striker class. Our case studies describe a few in-classoutliers. In the soccer data, we considered only individualswho played more than 5 matches out of a maximum 38. C. Methods Compared

We compare two types of approaches, and within eachapproach several outlier detection methods. The ﬁrst approachevaluates the likelihood-based outlier scores described in Sec-tion III. For relational Bayesian network structure learning weutilize the previous learn-and-join algorithm (LAJ), which isa state-of-the-art BN structure learning method for relationaldata [32]. The LAJ algorithm employs an iterative deepeningstrategy, which can be described as as search through a latticeof table joins. For each table join, different BNs are learned7xtended from Riahi and Schulte 2015 IEEE Symposium Series on Computational Intelligenceand the learned edges are propagated from smaller to largertable joins. For a full description, complexity analysis, andlearning time measurements, please see [32]. We used theimplementation of the LAJ algorithm due to its creators [15].The second approach ﬁrst “ﬂattens” the structured datainto a matrix of feature vectors, then applies standard matrix-based outlier detection methods. We refer to such methods as aggregation-based (cf. Figures 5). For example, this was theapproach taken by Breunig et al. for identifying anomalousplayers in sports data [5]. Following their paper, for eachcontinuous feature in the object data, we use the average overits values, and for each discrete feature, we use the occurrencecount of each feature value in the object data. Aggregationtends to lose information about correlations. Our experimentsaddress the empirical question of whether this loss of in-formation affects outlier detection. We evaluated three stan-dard matrix-based outlier detection methods: Density-based

LOF [5], distance-based

KNNOutlier [27] and subspaceanalysis

OutRank [22]. These represent common, fundamentalapproaches for vectorial data. Like

ELD , subspace analysis issensitive to correlations among features. We used the availableimplementation of all three data matrix methods from the stateof the art data mining software

ELKI [1]. We used

PRO-CLUS as the clustering function for

OutRank , recommended by [22].VII. E

MPIRICAL R ESULTS

We present results regarding computational feasibility, pre-dictive performance, and case studies.

1) Computational Cost of the

ELD

Score.:

Table VIIshows that the computation of the

ELD value for a giventarget object is feasible. On average, it takes a quarter of aminute for each soccer player, and one minute for each movie.This includes the time for parameter learning from the objectdatabase. Learning the class model BN takes longer, but needsto be done only once for the entire object class.

The BNmodel provides a crucial low-dimensional representation ofthe distribution information in the data.

Table VIII comparesthe number of terms required to compute the

ELD score inthe BN representation to the number of terms in an unfactoredrepresentation with one parameter for each joint probability.

TABLE VII. T

IME ( MIN ) FOR COMPUTING THE

ELD

SCORE . Dataset Class Model Average per ObjectStrikers vs. Goalies 4.14 0.25Midﬁelder vs. Goalies 4.02 0.25Drama vs. Comedy 8.30 1.00

TABLE VIII. T HE B AYESIAN NETWORK REPRESENTATION DECREASESTHE NUMBER OF TERMS REQUIRED FOR COMPUTING THE

ELD

SCORE . Dataset

2) Detection Accuracy:

We follow the evaluation designof [12] and make each baseline methods detect the same per-centage of objects as outliers: Sort the outlier scores obtainedby the three baseline methods in descending order, and take thetop r percent as outliers. Then we use precision , a.k.a. true TABLE IX. P

RECISION OF OUTLIER SCORES IN DIFFERENT DATASETS . Dataset percentage Model-based models Aggregation-based models

ELD | LR | LR FD LOG LOF

OutRank

KNNOutlier

High-Correlation 1%

TABLE X.

AUC OF ELD VS . | LR | . Score High-Cor. Low-Cor. Single-F. Striker Midﬁelder Drama

ELD | LR | positive rate as the evaluation metric which is the percentageof correct ones in the set of outliers identiﬁed by the algorithm.As in [12], we set the percentages of outlier to be 1% and5%. In the one-class design, precision measures how manymembers of the outlier class were correctly recognized. Wealso report some AUC measurements [2], which aggregateprecision values at different percentage cutoffs. a) Likelihood-Based Methods: Table IX shows the

Precision values for each probabilistic ranking. Our

ELD score achieves the top score on each dataset. On the syntheticdata,

ELD and | LR | are the only scores with 100% precisionat 1% and 5%. This conﬁrms the value of using distancesrather than differences. While it ought to be easy to distinguishthe outliers, Table X shows that ELD is the only score thatachieves perfect detection , that is AUC = 1.0. b) Aggregation-Based Methods vs. ELD:

Table IXshows the precision values for aggregation-based meth-ods compared to

ELD . Our

ELD score outperforms allaggregation-based methods on all datasets , except for a tiewith

OutRank (ProClus) on the relatively easy problem ofdistinguishing strikers from goalies. The performances ofaggregation-based methods are most like that of the probabilis-tic score FD , which does not consider the correlation amongthe features. This ﬁnding reﬂects the fact that aggregation tendsto lose information about correlations. The aggregation-basedmethods achieve their highest performance on the Strikers vs.Goalies dataset. In this dataset action count features such asShotsOnTarget, ShotEfﬁciency point to strikers and the featureSavesMade points to goalies. Therefore, outliers in this datasetare easy to ﬁnd by considering features in isolation.

3) Case Studies:

For a case study, we examine three topoutliers as ranked by

ELD , shown in Table XI. The aimof the case study is to provide a qualitative sense of theoutliers indicated by the scores. Also, we illustrate how the BNrepresentation leads to an interpretable ranking. Speciﬁcally,we employ a feature-wise decomposition of the score combinedwith a drill down analysis:1) Find the node V i that has the highest ELD i diver-gence score for the outlier object.2) Find the parent-child combination that contributes themost to the ELD i score for that node. Our

ELD score performs the best also with other metrics such as recall,to a similar degree.

ELD score for the parent-child com-bination into feature and mutual information compo-nent.We present strong associations—indicated by the

ELD ’smutual information component—in the intuitive format ofassociation rules. a) Strikers vs. Goalies:

In real-world data, a rare objectmay be a within-class outlier , i.e., highly anomalous evenwithin its class. In an unsupervised setting without class labels,we do not expect an outlier score to distinguish such an in-classoutlier from outliers outside the class. An example is the strikerEdin Dzeko. He is a highly anomalous striker who obtains thetop

ELD divergence score among both strikers and goalies.His

ELD score is highest for the Dribble Efﬁciency feature.The highest

ELD score for that feature occurs when DribbleEfﬁciency is low, and its parents have the following values:Shot Efﬁciency high, Tackle Efﬁciency medium. Looking atthe single feature divergence, we see that Edin Dzeko is indeedan outlier in the Dribble Efﬁciency subspace: His dribbleefﬁciency is low in 16% of his matches, whereas a randomlyselected striker has low dribble efﬁciency in 50% of theirmatches. Thus, Edin Dzeko is an unusually good dribbler.Looking at the mutual information component of

ELD , i.e.,the parent-child correlations, for Edin Dzeko the conﬁdence ofthe rule

ShotEﬀ = high , TackleEﬀ = medium → DribbleEﬀ = low is 50%, whereas in the general striker class it is . b) Midﬁelders vs. Strikers: For the single feature score,Robin van Persie is recognized as a clear striker becauseof the

ShotsOnTarget feature. It makes sense that strikersshoot on target more often than midﬁelders. Robin van Persieachieves a high number of shots on targets in of hismatches, compared to for a random midﬁelder. The mutualinformation component shows that he also exhibits unusualcorrelations. For example, the conﬁdence of the rule ShotEﬀ = high , TimePlayed = high → ShotsOnTarget = high is 70% for van Persie, whereas for strikers overall it is 52%.The most anomalous midﬁelder is Scott Sinclair. His mostunusual feature is DribbleEﬃciency : For feature divergence,he achieves a high dribble efﬁciency of the time, com-pared to a random midﬁelder with . c) Drama vs. Comedy: The top outlier rank is assignedto the within-class outlier

BraveHeart . Its most unusual fea-ture is

ActorQuality : In a random drama movie, of actorshave the highest quality level 4, whereas for

BraveHeart of actors achieve the highest quality level.The

ELD score identiﬁes the comedies

BluesBrothers and

AustinPowers as the top out-of-class outliers. In a randomdrama movie, of actors have casting position 3, whereasfor

AustinPowers of actors have this casting position,and for

BluesBrothers of actors do.VIII. C

ONCLUSION

We presented a new approach for applying Bayes nets toobject-relational outlier detection, a challenging and practically important topic for machine learning. This approach followsthe general framework of Exceptional Model Mining [10], andapplies it to multi-relational data. The key idea is to learn oneset of parameter values that represent class-level associations,another set to represent object-level associations, and comparehow well each parametrization ﬁts the relational data that char-acterize the target object. The classic metric for comparing twoparametrized models is their log-likelihood ratio; we reﬁnedthis concept to deﬁne a new relational log-likelihood distancemetric via two transformations: (1) a mutual informationdecomposition, and (2) replacing log-likelihood differences bylog-likelihood distances. This metric combines a single featurecomponent, where features are treated as independent, witha correlation component that measures the deviation in thefeatures’ mutual information.In experiments on three synthetic and three real-worldoutlier sets, the log-likelihood distance achieved the bestdetection accuracy. The alternative of converting the structureddata to a ﬂat data matrix via aggregation had a negative impact.Case studies showed that the log-distance score leads to easilyinterpreted rankings.There are several avenues for future work. (i) A limitationof our current approach is that it ranks potential outliers, butdoes not set a threshold for a binary identiﬁcation of outliervs. non-outlier. (ii) Our divergence uses expected L1-distancefor interpretability, but other distance scores like L2 could beinvestigated as well. (iii) Extending the expected L1-distancefor continuous features is a useful addition.In sum, outlier metrics based on model likelihoods are anew type of structured outlier score for object-relational data.Our evaluation indicates that this model-based score providesinformative, interpretable, and accurate rankings of objects aspotential outliers. A

CKNOWLEDGEMENT

This work was supported by a Discovery Grant from theNational Sciences and Engineering Council of Canada. Weare indebted to Peter Flach for referring us to the EMMframework. R

EFERENCES[1] E. Achtert, H. Kriegel, E. Schubert, and A. Zimek. Interactive datamining with 3d-parallel coordinate trees. In

Proceedings of the 2013ACM SIGMOD , New York, NY, USA, 2013.[2] C. Aggarwal.

Outlier Analysis . Springer New York, 2013.[3] L. Akoglu, H. Tong, and D. Koutra. Graph based anomaly detectionand description: a survey.

Data Mining and Knowledge Discovery ,29(3):626–688, 2015.[4] F. Angiulli, G. Greco, and L. Palopoli. Outlier detection by logicprogramming.

ACM Transactions on Computer Logic , 2004.[5] M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifyingdensity-based local outliers. In

Proceedings of ACM SIGMOD , 2000.[6] A. Cansado and A. Soto. Unsupervised anomaly detection in largedatabases using Bayes Nets.

Appllied Artiﬁcial Intelligence , 2008.[7] S. De, Y. Hu, V. V. Meduri, Y. Chen, and S. Kambhampati. Bayeswipe:A scalable probabilistic framework for improving data quality.

Journalof Data and Information Quality (JDIQ) , 8(1):5, 2016.[8] L. de Campos. A scoring function for learning Bayes nets based onmutual information and conditional independence tests.

Journal ofMachine learning Research , 2006.

TABLE XI. C

ASE STUDY FOR THE TOP OUTLIERS RETURNED BY THE LOG - LIKELIHOOD DISTANCE SCORE

ELD

Strikers (Normal) vs. Goalies (Outlier)PlayerName Position

ELD

Rank

ELD

Max Node

ELD

Node Score FD Max feature Value Object Probability Class ProbabilityEdin Dzeko Striker 1 DribbleEfﬁciency 83.84 DE=low 0.16 0.5Paul Robinson Goalie 2 SavesMade 49.4 SM=Medium 0.3 0.04Michel Vorm Goalie 3 SavesMade 85.9 SM=Medium 0.37 0.04Midﬁelders (Normal) vs. Strikers (Outlier)PlayerName Position

ELD

Rank

ELD

Max Node

ELD

Node Score FD Max feature Value Object Probability Class ProbabilityRobin Van Persie Striker 1 ShotsOnTarget 153.18 ST=high 0.34 0.03Wayne Rooney Striker 2 ShotsOnTarget 113.14 ST=high 0.26 0.03Scott Sinclair Midﬁelder 6 DribbleEfﬁciency 71.9 DE=high 0.5 0.3Drama (Normal) vs. Comedy (Outlier)MovieTitle Genre

ELD

Rank

ELD

Max Node

ELD

Node Score FD Max feature Value Object Probability Class ProbabilityBrave Heart Drama 1 ActorQuality 89995.4 a quality=4 0.93 0.42Austin Powers Comedy 2 Cast Position 61021.28 Cast Num=3 0.78 0.49Blue Brothers Comedy 3 Cast Position 24432.21 Cast num=3 0.88 0.49 [9] P. Domingos and D. Lowd.

Markov Logic: An Interface Layer forArtiﬁcial Intelligence . Morgan and Claypool Publishers, 2009.[10] W. Duivesteijn, A. J. Feelders, and A. Knobbe. Exceptional modelmining.

Data Mining and Knowledge Discovery , 30(1):47–98, 2016.[11] P. A. Flach. Knowledge representation for inductive learning. In

Symbolic and Quantitative Approaches to Reasoning and Uncertainty ,pages 160–167. Springer, 1999.[12] J. Gao, F. Liang, W. Fan, Y. Wang, and J. Han. On community outliersand their detection in information network. In

Proceedings of ACMSIGKDD , 2010.[13] L. Getoor and B. Taskar.

Introduction to statistical relational learning ∼ oschulte/jbn/.[16] A. Kimmig, L. Mihalkova, and L. Getoor. Lifted graphical models: asurvey. Computing Research Repository , 2014.[17] J. L. Koh, M. L. Lee, W. Hsu, and W. T. Ang. Correlation-based attributeoutlier detection in XML. In

Proceedings of ICDE. IEEE 24th , 2008.[18] D. Koller and A. Pfeffer. Object-oriented Bayes nets. In

Proceedingsof UAI , 1997.[19] J. Maervoet, C. Vens, G. Vanden Berghe, H. Blockeel, and P. De Caus-maecker. Outlier detection in relational data: A case study.

ExpertSystem Applications , 2012.[20] C. Mayﬁeld, J. Neville, and S. Prabhakar. Eracer: a database approachfor statistical inference and data cleaning. In

Proceedings of the 2010ACM SIGMOD International Conference on Management of data

Proceedings ofICDM , 2012.[23] P. K. Novak, G. I. Webb, and S. Wrobel. Supervised descriptive rulediscovery: A unifying survey of contrast set, emerging pattern andsubgroup mining.

Journal of Machine Learning Research , 2009.[24] J. Pearl.

Probabilistic Reasoning in Intelligent Systems . MorganKaufmann, 1988.[25] V. Peralta. Extraction and Integration of MovieLens and IMDb.Technical report, APDM project, 2007.[26] D. Poole. First-order probabilistic inference. In

Proceedings of IJCAI ,2003.[27] S. Ramaswamy, R. Rastogi, and K. Shim. Efﬁcient algorithms formining outliers from large data sets.

SIGMOD , 2000.[28] F. Riahi and O. Schulte. Codes and Datasets. [Online]. Available:.ftp://ftp.fas.sfu.ca/pub/cs/oschulte/CodesAndDatasets/, 2015.[29] F. Riahi and O. Schulte. Model-based outlier detection for object-relational data. In , pages 1590–1598. IEEE, 2015.[30] S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven explorationof OLAP data cubes. In

Proceedings of International Conference onExtending Database Technology . Springer-Verlag, 1998. [31] O. Schulte. A tractable pseudo-likelihood function for Bayes netsapplied to relational data. In

Proceedings of SIAM SDM , 2011.[32] O. Schulte and H. Khosravi. Learning graphical models for relationaldata via lattice search.

Journal of Machine Learning , 2012.[33] O. Schulte, H. Khosravi, and T. Man. Learning directed relationalmodels with recursive dependencies.

Machine Learning , 89:299–316,2012.[34] G. Tang, J. Bailey, J. Pei, and G. Dong. Mining multidimensionalcontextual outliers from categorical relational data. In

Proceedings ofSSDBM , 2013.[35] S. Tuffery.

Data Mining and Statistics for Decision Making . WileySeries in Computational Statistics, 2011.. WileySeries in Computational Statistics, 2011.