A Two-Stage Bayesian Semiparametric Model for Novelty Detection with Robust Prior Information
AA Two-Stage Bayesian Nonparametric Modelfor Novelty Detectionwith Robust Prior Information
Francesco Denti ∗ Andrea Cappozzo † Francesca Greselin † Abstract
Standard novelty detection methods aim at bi-partitioning the testunits into already observed and previously unseen patterns. However,two significant issues arise: there may be considerable interest in iden-tifying specific structures within the latter subset, and contamination inthe known classes could completely blur the actual separation betweenmanifest and new groups. Motivated by these problems, we propose atwo-stage Bayesian nonparametric novelty detector, building upon priorinformation robustly extracted from a set of complete learning units. Ageneral-purpose multivariate methodology, as well as an extension suit-able for functional data objects, are devised. Some theoretical propertiesof the associated semi-parametric prior are investigated. Moreover, wepropose a suitable ξ -sequence to construct an independent slice-efficientsampler that takes into account the difference between manifest and nov-elty components. We showcase our model performance through an exten-sive simulation study and applications on both multivariate and functionaldatasets, in which diverse and distinctive unknown patterns are discov-ered. Supervised classification with unobserved classes aims at predicting a qualita-tive output for a test set by training a classifier on a fully-labeled training set, inwhich the former may contain classes not previously observed in the latter. Thisis usually not contemplated in a standard framework, where the learning unitsare assumed to be outlier-free realizations from all the sub-groups comprisingthe target population. However, this hypothesis may not hold in general, like inthe case of evolving systems, where the presence of novel species is an importantissue, or in the analysis of social networks, whose configurations continuouslyexpand and evolve, or in the case of rare classes. Relevant examples include, but ∗ Department of Statistics and Computer Science University of California, Irvine [email protected] † Department of Statistics and Quantitative Methods, University of Milano-Bicocca, [email protected], [email protected] a r X i v : . [ s t a t . A P ] J un re not limited to, radar target detection (Carpenter et al., 1997), detection ofmasses in mammograms (Tarassenko et al., 1995), handwritten digit recognition(Tax and Duin, 1998) and e-commerce (Manikopoulos and Papavassiliou, 2002),for which labeled observations may not be available for each and every group.Within the model-based family of classifiers, adaptive methods recently ap-peared in the literature to overcome this limitation. Miller and Browning (2003)pioneer a mixture model methodology for class discovery, robust classification,and outlier rejection. Bouveyron (2014), introduced an adaptive classifier inwhich two algorithms, based respectively on transductive and inductive learn-ing, are devised for inference. More recently, Fop et al. (2018) extended theoriginal work of Bouveyron (2014), by accounting for unobserved classes andextra variables in high-dimensional discriminant analysis.Suppose we now want to distinguish between novelties , i.e., test observa-tions displaying a specific common pattern, and anomalies , i.e., outlying unitsthat can be regarded as noise. This further feature is addressed in Cappozzoet al. (2019), where, using impartial trimming (Gordaliza, 1991), a fixed per-centage of data points is left unmodeled and no distributional assumptions area-priori imposed to the trimmed units. While the distinction between a noveland an anomalous entity is most often apparent in practice, there exist somecircumstances under which such separation is vague and somewhat philosoph-ical. Consider, for example, an evolving ecosystem in which novel species arelikely to appear over time. It may happen that at an initial instant t a realnovelty is mistaken to be mere noise due to its embryonic stage. Nonetheless,if the model were to be fitted at a subsequent instant t , the increased numberof such samples could be sufficient to acknowledge an actual novel species.Motivated by all these arguments, we propose a Bayesian Robust Adap-tive Novelty Detector (BRAND), a two-stage procedure in a semi-parametricBayesian framework for simultaneously dealing with outliers and hidden classesthat may be present in the test set. To sketch the idea, we first learn the knownpatterns from the labeled dataset using procedures that are robust against bothoutliers and label noise. In the second phase, we translate the training insightsinto informative priors, and we model the test units with an Bayesian mixtureof known groups plus a novelty term. To reflect the lack of knowledge on thenovelty term, we resort to a Dirichlet Process mixture model. The adoption ofthis nonparametric methodology allows to overcome the problematic and some-how unnatural specification of the number of mixture components in the novelgroup.The rest of the manuscript is organized as follows: in Section 2, we presentour two-stage methodology for novelty detection. We dedicate Section 3 tothe investigation of the random measures’ clustering properties induced by ourmodel. In Section 4, we propose a functional extension of the multivariatemodel, delineating a novelty detection method suitable for functional data. Sec-tion 5 discusses posterior inference, while in Section 6 we present an extensivesimulation study and applications to multivariate and functional data. Con-cluding remarks and further research directions are outlined in Section 7.2 A Two-Stage Bayesian procedure for NoveltyDetection
Given a classification framework, consider the complete collection of learningunits X = { ( x n , l n ) } Nn =1 , where x n denotes a p -variate observation and l n = j ∈ { , . . . , J } its associated group label. Both terms are directly availableand the distinct values in l n , n = 1 , . . . , N represent the J observed classes with subset sizes n , . . . , n J . Correspondingly, let { ( y m , z m ) } Mm =1 be the testset where, differently from the standard assumptions, the unknown labels z m could belong to a set that encompasses more elements than { } . Thatis, a countable number of J classes may be “hidden” in the test with no priorinformation available on their magnitude or on their structure. Therefore, it isreasonable to account for the novelty term via a single flexible component fromwhich a dedicated post-processing procedure may reveal circumstantial patterns(see Section 2.3). Both x n and y m are independent realizations of a continuousrandom vector (or function, see Section 4) X , whose conditional distributionvaries according to the associated class labels. In the upcoming Sections, weassume that each observation in class j is independent multivariate Gaussian,having density φ ( ·| Θ j ) with location-scale parameter Θ j = ( µ j , Σ j ), where µ j ∈ R p denotes the mean vector and Σ j the corresponding covariance matrix.This allows for the automatic implementation of standard powerful methodsin the training information extraction (see Section 2.1). Notwithstanding, theproposed methodology is general enough that it can be easily extended to dealwith different component distributions.The main modeling purpose is to classify the observations in the test seteither into one of the J observed classes or into the novel component. At thesame time, we investigate the presence of homogeneous groups in the noveltyterm, discriminating between unseen components and outlier. In doing so, atwo-stage strategy is devised. The first phase, described in Section 2.1, relies ona class-wise robust procedure for extracting prior information from the train-ing set. Afterwards, the semi-parametric Bayesian model, which is the mainmethodological contribution of the present paper, is fitted to the test units. Afull account on its definition is reported in Section 2.2. The first step of our procedure is designed to obtain reliable estimates ˆ Θ j for theparameters of the observed classes from the learning set. To this aim one couldemploy standard methods, as the MLE adopting a classical framework or theMAP/posterior estimates assuming a Bayesian setting. Nonetheless, these stan-dard approaches are not robust against contamination and the presence of onlyfew outlying points could entirely bias the subsequent Bayesian model, shouldthe informative priors be improperly set. A direct consequence of this undesir-able behavior is reported in the simulation study of Section 6.1. To this extent,we opt for more sophisticated alternatives that are able to deal with outliers3nd label noise, when it comes to learning the structure of the observed classes.Particularly, the selected methodologies involve the Minimum Covariance De-terminant (MCD) estimator (Rousseeuw, 1984; Hubert et al., 2018) and, whenfacing high-dimensional data (as in the functional case of Section 6.3), the Min-imum Regularized Covariance Determinant (MRCD) estimator (Boudt et al.,2020). Clearly, at this stage any robust estimator of multivariate scatter andlocation may be employed for solving the problem: see, for instance, the com-parison study reported in Maronna and Yohai (2017) for a (non-exhaustive) listof suitable candidates.We decide to rely on the MCD and its high-dimensional extension MRCDprocedures for their well-established efficacy in the classification framework (Hu-bert and Van Driessen, 2004) and direct availability of fast algorithms for in-ference, readily implemented in the rrcov R package (Todorov and Filzmoser,2009). We briefly recall the main MCD and MRCD features in the remainderpart of this Section. For a thorough treatment the interested reader is referredto Hubert and Debruyne (2010) and Boudt et al. (2020), respectively.The MCD is an affine equivariant and highly robust estimator of multivariatelocation and scatter, for which a fast algorithm is available (Rousseeuw andDriessen, 1999). The raw MCD estimator with parameter η MCD ∈ [0 . , (cid:98) ( n + p + 1) / (cid:99) ≤ (cid:98) η MCD N (cid:99) ≤ N defines the following location anddispersion estimates: • µ MCD is the mean of the (cid:98) η MCD N (cid:99) observations for which the determi-nant of the sample covariance matrix is minimal • Σ MCD is the corresponding covariance matrix, multiplied by a consistencyfactor c (Croux and Haesbroeck, 1999)with (cid:98)·(cid:99) denoting the floor function. The MCD is a consistent, asymptoti-cally normal and highly robust estimator with bounded influence function andbreakdown value equal to (1 − (cid:98) η MCD N (cid:99) /N )% (Butler et al., 1993; Cator andLopuha¨a, 2012). However, a major drawback is its inapplicability when thedata dimension p exceeds the subset size (cid:98) η MCD N (cid:99) as the covariance matrix ofany (cid:98) η MCD N (cid:99) -subset becomes singular. This situation appears ever so often inour context, as the MCD is group-wise applied to the observed classes in thetraining set, such that it is sufficient to have p > min n j ,j =1 ,...,J (cid:98) η MCD n j (cid:99) for the MCD solution to be ill-defined. In order to overcome this limitation,Boudt et al. (2020) introduced the MRCD estimator. The main idea is toreplace the subset-based covariance estimation by a regularized one, defined asa weighted average of the sample covariance on the (cid:98) η MCD N (cid:99) -subset and apredetermined positive definite target matrix. The MRCD estimator is definedas the multivariate location and regularized scatter based on the (cid:98) η MCD N (cid:99) -subset that makes its overall determinant the smallest. The MRCD preserves thegood breakdown properties of its non-regularized counterpart, and in addition,4s applicable in high dimensional problems where (cid:98) η MCD N (cid:99) is possibly smallerthan p .The first phase of our two-stage modeling procedure thus works as follows:considering the available labels l n , n = 1 , . . . , N we apply the MCD (or MRCD)estimator within each class to extract ˆ µ MCDj and ˆ Σ MCDj , j = 1 . . . , J . Forease of notation, we let the superscript in the robust estimates to be ‘MCD’even when its regularized version is considered. Clearly, the MCD solution ispreferred should the sample size be large enough. Lastly, there is no reason for η MCD to be the same in all observed classes. If a group is known to be par-ticularly outliers-sensitive its associated
M CD subset size may be set smallerthan the remaining ones. Since this type of information is seldom available,we subsequently let η MCDj = η MCD for all classes in the learning set. Theso-obtained estimates are incorporated in the Bayesian model specification pre-sented in Section 2.2 where the robust knowledge extracted from X is accountedas prior information and, according to an Empirical Bayes rationale, it will beused as informative hyperparameters. In this way, outliers and label noise thatmay be present in the labelled units will not bias the initial beliefs for the knowngroups in the second phase. We assume that each observation in the test set is generated accordingly to amixture of J + 1 elements: J multivariate Gaussians φ ( ·| Θ j ) that have beenobserved in the learning set, and an extra term f nov called novelty component.In formulas: y m | π , Θ j , f nov ∼ J (cid:88) j =1 π j φ ( ·| Θ j ) + π f nov . (1)We define π = { π j } Jj =1 , where π j denotes the prior probability of the observedclass j (already present in the learning set), while π is the probability of observ-ing some novelty. Of course, (cid:80) Jj =0 π j = 1. To reflect our lack of knowledge onthe novelty component f nov , we employ a Bayesian nonparametric specification.In particular, we resort to the Dirichlet Process Mixture model of Gaussians (Lo,1984; Escobar and West, 1995) imposing the following structure: f nov = (cid:90) φ ( ·| Θ nov ) G ( d Θ nov ) , G ∼ DP ( γ, H ) , (2)where DP ( γ, H ) is the usual Dirichlet process with concentration parameter γ and base measure H (Ferguson, 1973). Adopting Sethuraman’s Stick Breaking5onstruction (Sethuraman, 1994), we can express the likelihood as follows: L ( y | π , µ , Σ , ω ) = M (cid:89) m =1 J (cid:88) j =1 π j φ ( y m | µ j , Σ j ) ++ π ∞ (cid:88) h =1 ω h φ ( y m | µ novh , Σ novh ) (cid:35) . (3)The term (cid:80) ∞ h =1 ω h φ ( ·| Θ novh ) represents a Dirichlet Process convoluted with aNormal kernel, for flexibly modeling a potentially infinite number of hiddenclasses and/or outlying observations. The following prior probabilities for theparameters complete the Bayesian model specification: Θ j = ( µ j , Σ j ) ∼ P T rj j = 1 , . . . , J, Θ novh = ( µ novh , Σ novh ) ∼ H h = 1 , . . . , ∞ , π ∼ Dir ( a , a , . . . , a J ) , ω ∼ SB ( γ ) . (4)Values a , . . . , a J are the hyper-parameters of a Dirichlet distribution on theknown classes. The learning set can be exploited to determine reasonable val-ues of such hyper-parameters by setting a j = n j /N . The quantity a is theinitial prior belief on how much novelty we are expected to discover in the testset. Generally, the parameter controlling the novelty proportion a is a prioriconsidered to be small.To exploit conjugacy, we adopt Normal-inverse-Wishart (NIW) priors forboth the manifest and the novel classes. For each of the known group we assumethat P T rj ≡ N IW (cid:16) ˆ µ MCDj , λ
T r , ν
T r , ˆ Σ MCDj (cid:17) , j = 1 , . . . , J where ˆ µ MCDj and ˆ Σ MCDj are the MCD robust estimates obtained in phaseI. At the same time the precision parameter λ T r and the degrees of freedom ν T r are treated as tuning parameters to enforce high mass around the robustestimates. By letting these two parameters go off to infinity we can also recoverthe degenerate case P T rj = δ ˆ Θ j where the Dirac’s delta denotes a point masscentered in ˆ Θ j . That is, the prior beliefs extracted from the training set canbe flexibly updated by gradually transitioning from transductive to inductiveinference by increasing λ T r and ν T r (Bouveyron, 2014). Similarly, we set H ≡ N IW ( m , λ , ν , S ) , where the hyperparameters are chosen to induce a flatprior for the novel components. Lastly, with ω ∼ SB ( γ ) we denote the vectorof Stick-Breaking weights, composed of elements defined as w k = v k (cid:89) l
0) we first compute the pairwisecoclustering matrix P = { p m,m (cid:48) } , whose entry p m,m (cid:48) denotes the probabilitythat y m and y m (cid:48) belong to the same cluster. We then retrieve the best par-tition minimizing the Variation of Information (VI) criterion, as suggested inWade and Ghahramani (2018). More details on how to post-process the MCMCoutput are given in Section 5. 7 Properties of the proposed semiparametric prior
We now investigate the properties of the underlying random mixing measure in-duced by the model specification presented in the previous section. Specifically,the model in (3)-(4) can be generalized in the following hierarchical form, whichhighlights the dependence on a discrete random measure ˜ p : y m | Θ m ∼ N ( ·| Θ m ) Θ m | ˜ p i.i.d. ∼ ˜ p ˜ p = J (cid:88) j =1 π j δ Θ j + π (cid:34) + ∞ (cid:88) h =1 ω h δ Θ novh (cid:35) ( π , π , . . . , π J ) ∼ Dir ( a , a , . . . , a J ) ω ∼ SB ( γ ) Θ j ∼ P T rj Θ novh ∼ H. (7)From (7) it can be seen how our model is an extension of the contaminatedinformative priors proposed in Scarpa and Dunson (2009), where the authorspropose to juxtapose a single atom to a DP. We further assume that each P T rj is a probability distribution with mean µ j , second moment µ j, and variance σ j , j = 1 , . . . , J . Similarly, let E [ Θ novh ] = µ , V [ Θ novh ] = σ ∀ h ≥ a = (cid:80) Jj =0 a j . For all m ∈ { , . . . , M } , we can prove that E [ Θ m ] = J (cid:88) j =0 a j a µ j , V [ Θ m ] = J (cid:88) j =0 a j a (cid:16) µ j, − a j a µ j (cid:17) − J (cid:88) l>j ≥ a j a l a µ l µ j . The overall variance can also be written in terms of variances of the single,observed mixture components: V [ Θ m ] = J (cid:88) j =0 a j a (cid:16) σ j + (cid:16) − a j a (cid:17) µ j (cid:17) − J (cid:88) l>j ≥ a j a l a µ l µ j . Given the discrete nature of ˜ p , we can expect ties between realizations sampledfrom this measure, say Θ m and Θ m (cid:48) . Therefore, we can compute the probabilityof obtaining a tie as: P ( Θ m = Θ m (cid:48) ) = J (cid:88) k =1 a k ( a k + 1) a ( a + 1) + a ( a + 1) a ( a + 1) ·
11 + γ , (8)where the contribution to this probability of the novelty terms is multiplicativelyreduced by a factor that depends on the inverse of the concentration parameter.The proof of (8) is reported in Appendix A. Clearly, if a priori we expect a largenumber of clusters in the novelty term (large γ ), the probability of a tie reduces.8ndeed, some noticeable limiting cases arise:lim γ → + ∞ P ( Θ m = Θ m (cid:48) ) = J (cid:88) k =1 a k ( a k + 1) a ( a + 1) , lim γ → P ( Θ m = Θ m (cid:48) ) = J (cid:88) k =0 a k ( a k + 1) a ( a + 1) . If γ → J + 1 components. Instead, γ → + ∞ leads to the case of a DP with numerous atoms characterized by similar proba-bility, hence annihilating the contribution to the probability of the novelty term.Moreover, suppose we rewrite the distribution of π as Dir (cid:16) a J +1 , ˜ aJ +1 , . . . , ˜ aJ +1 (cid:17) .In this case, the hyperparameters relative to the observed groups are assumedequal to ˜ a . Then, we obtain a = a + J ˜ aJ +1 , and P ( Θ m = Θ m (cid:48) ) = J ˜ aJ +1 (cid:16) ˜ aJ +1 + 1 (cid:17) a + J ˜ aJ +1 (cid:16) a + J ˜ aJ +1 + 1 (cid:17) ++ a J +1 (cid:16) a J +1 + 1 (cid:17) a + J ˜ aJ +1 (cid:16) a + J ˜ aJ +1 + 1 (cid:17) ·
11 + γ . (9)As J increases, the second part of (9) vanishes. Accordingly, if we suppose an un-bounded number of observed groups letting J → ∞ , then we have P ( Θ m = Θ m (cid:48) ) =1 / (1 + ˜ a ) as in the classical DP case, and the model loses its ability to detectnovel instances.Lastly, we investigate the covariance between the two random elements Θ m and Θ m (cid:48) . Consider a vector (cid:37) = { (cid:37) j } J +1 j =1 , with the first entry equal to γ and theremaining entries equal to 1. Then, Cov γ ( Θ m , Θ m (cid:48) ) = J (cid:88) j =0 (cid:32) a j ( a j + 1) a ( a + 1) (cid:37) j µ j, − a j a µ j (cid:33) + − a ( a + 1) J (cid:88) j>l ≥ a j a l µ j µ l ++ a ( a + 1) a ( a + 1) γ γ µ . (10)It can be seen that the covariance is characterized by three terms. In the first twothe seen and unseen components have the same influence. The last component,however, is nonnegative and entirely determined by quantities linked to the9ovel part of the model. Notice that if γ → Cov ( Θ m , Θ m (cid:48) ) = J (cid:88) j =0 (cid:32) a j ( a j + 1) a ( a + 1) µ j, − a j a µ j (cid:33) + − a ( a + 1) J (cid:88) j>l ≥ a j a l µ j µ l which is the same covariance we would obtain if ˜ p = ˜ p ≡ (cid:80) Jj =0 π j δ Θ j , i.e. ifwe were dealing with a classical mixture model with J + 1 components. Thisimplies that (10) can be rewritten as Cov γ ( Θ m , Θ m (cid:48) ) = Cov ( Θ m , Θ m (cid:48) ) − a ( a + 1) a ( a + 1) γ γ σ , which leads to a nice interpretation. The introduction of novelty atoms decreasesthe “standard” covariance, and this effect is stronger as the prior weight givento the novelty component a , the dispersion of the base measure σ and/or theconcentration parameter γ increases. The modeling framework introduced in Section 2 is very general and can beeasily modified to handle more involved data structures. In this Section, wedevelop a methodology for functional classification that allows novelty functionaldetection, building upon model (3)-(4). We hereafter assume that our trainingand test instances are error-prone realizations of a univariate stochastic process X ( t ), t ∈ T with T ⊂ R .Recently, numerous authors have contributed to the area of Bayesian non-parametric functional clustering (see, for example Bigelow and Dunson, 2009;Petrone et al., 2009; Rodriguez and Dunson, 2014; Rigon, 2019). The key fea-ture of our approach is the inclusion of prior knowledge that helps discriminateamong novel and observed classes of functions. Canale et al. (2017) propose aPitman-Yor mixture with a spike-and-slab base measure to effectively model thedaily basal body temperature in women by including the a priori known distinc-tive biphasic trajectory that characterizes healthy beings. Instead of modifyingthe base meaure of the nonparametric process, Scarpa and Dunson (2009) ad-dress the same problem by contaminating a point mass with a realization from aDP. As such, our approach can be interpreted as a direct extension of the latter,where J ≥ Θ m ( t ) = (cid:0) f m ( t ) , σ m ( t ) (cid:1) denote the vector comprising of smooth func-tional mean f m : T → R and the measurement noise σ m : T → R + for a genericcurve m in the test set, evaluated at instant t . Then the BRAND model asintroduced in Section 2.2 for multivariate data can be modified as follows:10 m ( t ) | Θ m ( t ) = f m ( t ) + ε m ( t ); ε m ( t ) ∼ N (0 , σ m ( t )) Θ m ( t ) | ˜ p ∼ ˜ p, ˜ p = J (cid:88) j =1 π j δ Θ j + π (cid:34) + ∞ (cid:88) h =1 ω h δ Θ novh (cid:35) , ( π , π , . . . , π J ) ∼ Dir ( a , a , . . . , a J ) , ω ∼ SB ( γ ) , Θ j ∼ P T rj , Θ novh ∼ H, (11)where all the distributions P T rj and the base measure H model the functionalmean and the noise independently. We propose the following informative priorfor Θ j = (cid:0) f j ( t ) , σ j ( t ) (cid:1) : f j ( t ) ind. ∼ N (cid:0) ¯ f j ( t ) , ϕ j (cid:1) ,σ j ( t ) ind. ∼ IG (cid:32) (cid:0) ¯ σ j ( t ) (cid:1) v j , ¯ σ j ( t ) (cid:32) (cid:0) ¯ σ j ( t ) (cid:1) v j (cid:33)(cid:33) . (12)We denote the estimates obtained from the training set of the mean and variancefunctions for each observed class j , j = 1 , . . . , J as ¯ f j and ¯ σ j , respectively. Thehyper-parameters ϕ j define the degree of confidence we a priori assume for theinformation extracted from the learning set, while the Inverse Gamma ( IG )specification ensures that E (cid:2) σ j ( t ) (cid:3) = ¯ σ j ( t ) and V ar (cid:2) σ j ( t ) (cid:3) = v j . It remainsto define how we compute ¯ f j and ¯ σ j , that is, how the robust extraction ofprior information is performed in this functional extension. Applying standardprocedures in Functional Data Analysis (Ramsay, James, Silverman, 2005), wefirst smooth each training curve via a weighted sum of L basis functions x n ( t ) = L (cid:88) l =1 ξ nl φ l ( t ) n = 1 , . . . , N where φ l ( t ) is the l -th basis evaluated in t and ξ nl its associated coefficient.Given the acyclic nature of the functional objects treated in Section 6.3, we willsubsequently employ B-spline bases (de Boor, 2001). Clearly, depending on theproblem at hand, any basis function system may be considered in this phase.After such representation has been performed, we are left with J matrices ofcoefficients each of dimension n j × L . By treating them as multivariate entities,as done for example in Abraham et al. (2003), we resort to the very sameprocedures described in Section 2.1 and we set¯ f j ( t ) = L (cid:88) l =1 ˆ ξ MCDjl φ l ( t ) , ¯ σ j ( t ) = 1 n j − (cid:88) n : l n = j (cid:0) x n ( t ) − ¯ f j ( t ) (cid:1) where ˆ ξ MCDjl is the robust location estimate computed via
M CD (or
M RCD )on the n j × L matrix of coefficients, j = 1 , . . . , J . On the other hand, more flex-ibility is needed to specify the base measure H for Θ novh = (cid:0) f novh ( t ) , σ novh ( t ) (cid:1) .11herefore, by the same smoothing procedure considered for the training curves,we build a hierarchical specification for the quantities involved in the noveltyterm: f novh ( t ) = L (cid:88) l =1 ξ novhl φ l ( t ) , ξ novhl ∼ N ( ψ h , τ h ) ,ψ h ∼ N (0 , s ) ,τ h ∼ IG ( a τ , b τ ) , σ novh ( t ) ∼ IG ( a H , b H ) . (13)The first line of (13) can be rewritten as f novh ( t ) ∼ N (cid:16) ψ h (cid:80) Ll =1 φ l ( t ) , τ h (cid:80) Ll =1 φ l ( t ) (cid:17) .We call this model functional BRAND. It provides a powerful extension for func-tional novelty detection. A successful application is reported in Section 6.3. The distribution p ( π , ω , a , β , Θ , Θ nov | y ) is analytically intractable, therefore werely upon MCMC techiques to carry out posterior inference. An easy samplingscheme can be constructed mimicking the blocked Gibbs sampler of Ishwaranand James (2001), where the infinite series in (3) is truncated at a pre-specifiedlevel L < ∞ . However, this approach leads to a non-negligible truncation errorif L is too small and computational inefficiencies if L is set too high. Instead,we propose a modification of the ξ -sequence of the Independent Slice-efficientsampler (Kalli et al., 2011), another well known conditional algorithm thatallows one to sample from the exact posterior. To adapt the algorithm to ourframework, we start from the following alternative reparameterization of themodel in in (3)-(4): y m | ˜ Θ , ζ m ∼ N (cid:16) ˜ Θ ζ m (cid:17) ζ m | ˜ π ∼ ∞ (cid:88) k =1 ˜ π k δ k ˜ π k = π {
25. The blue rectangle highlights the weights relative to the knowncomponents.account the difference between the manifest and the novel components. Usually,a very common choice is ξ l = (1 − κ ) κ l − . This option allows to compute each L m analytically, being the smallest integer such that L m < u m ) − log(1 − κ )log( κ ) . However, the default choice of a geometrically decreasing ξ -sequence is inap-propriate in this context, since we are dealing with a mixture where not all thecomponents are conceptually equivalent. In fact, the default ξ -sequence tendsto favor components that come first in the mixture specification (in our case,the manifest ones). To overcome this issue, we propose the following intuitivemodification. Given a value for κ ∈ (0 , − κ )% of themass into the first J + 1 elements of the sequence. We then induce a geometricdecay in the remaining ones to split the remaining fraction κ . We force the ele-ment in position J + 1 to have the same mass given to the manifest componentsto avoid an under representation of the novelty part.13o do so, we define ξ l = − κJ +1 if l ≤ J − κJ +1 (cid:16) ( J +1) κJK +1 (cid:17) l − J − if l > J + 1 (17)It is easy to prove that (cid:80) + ∞ l =1 ξ l = 1. According to (17), the first J + 1 elementsof the sequence have masses equal to (1 − κ ) / ( J + 1). The truncation thresholdchanges accordingly, becoming the smallest integer such that L ∗ < J + 1 + log(min( u )) − log (cid:16) − κJ +1 (cid:17) log (cid:16) ( J +1) κJκ +1 (cid:17) . (18)Inequality (18) states that the truncation threshold L ∗ can be only greateror equal to J + 1, ensuring that the MCMC always takes into consideration thecreation of at least one cluster in the novel distribution.A representation of the modified ξ -sequence is depicted in Figure 1. Thepseudo-code for the devised Gibbs sampler is reported in Appendix B. The al-gorithm for the functional extension is not included for conciseness, yet its struc-ture closely follows the one outlined for the multivariate case. Software routines,including the implementation for both methods, the simulation study and realdata analyses of Section 6 are openly available at https://github.com/AndreaCappozzo/brand-public repo .Once the MCMC sample is collected, we first compute the a posteriori prob-ability of being a novelty for every test unit m , P P N m = P [ y m ∼ f nov | Y ], thatis estimated according to the ergodic mean:ˆ P P N m = (cid:80) Ii =1 { a ( i ) m =0) } I , (19)where α ( i ) m is the value assumed by the parameter α m at the i -th iteration ofthe MCMC chain and I is the total number of iterations. We remark thatthe inference on α can be conducted directly, since the mixture between the J observed components and f nov is not subjected to label switching. In contrast,we need to take this problem into account when dealing with β . To performvalid inference, one possibility is to rely on the posterior probability coclusteringmatrix (PPCM) as defined in Section 2.3. Each entry of this matrix p m,m (cid:48) = P [ y m and y m (cid:48) belong to the same novelty class] is estimated asˆ p m,m (cid:48) = (cid:80) Ii =1 { β ( i ) m = β ( i ) m (cid:48) ) } I . (20)Once the PPCM is obtained, we can employ it to estimate the best partition(BP) in the novelty subset. The BP is obtained by minimizing a loss functiondefined over the space of clusterings, which can be computed starting from thePPCM. A famous loss function was proposed by Binder (1978), and investigated14n a BNP setting by Lau and Green (2007). However, the so-called Binder losspresents peculiar asymmetries, preferring to split over merge clusters. Thiscould result in an unnecessarily high number of estimated clusters. Therefore,we adopt the Variation of Information (VI - Meilˇa, 2007) as loss criterion. Theassociated loss function, recently proposed by Wade and Ghahramani (2018), isknown to provide less fragmented results.Finally, once the BP for the novelty component has been estimated, we canrely on a heuristic based on the clusters’ sizes, to discriminate anomalies fromactual new classes. Let us suppose that the BP comprises of S novel clusters.Denote the number of instances assigned to cluster s ∈ { , . . . , S } with m novs .A cluster s is labeled as an agglomerate of outlying points if its frequency m novs is sufficiently small, say the first percentile of the entire novelty sample size.Oppositely, all clusters whose frequencies exceed this threshold are regarded asproper novel groups. In this Section, we present a simulation study aimed at highlighting the capabil-ities of the new semi-parametric Bayesian model in performing novelty detectionby comparing it with existing methodologies. By considering different scenar-ios, in terms of hidden classes sample size and adulteration proportion in thetraining set, we evaluate the importance of the robust information extractionphase and how it affects the learning procedure.
We consider a training set formed by J = 3 observed classes, each distributed ac-cording to a bivariate normal density N ( µ j , Σ j ), j = 1 , ,
3, with the followingparameters: µ = ( − , (cid:48) , µ = (4 , (cid:48) , µ = ( − , − (cid:48) Σ = (cid:20) . . (cid:21) Σ = (cid:20) (cid:21) Σ = (cid:20) (cid:21) . The classes sample sizes are, respectively, equal to n = 300, n = 300 and n = 400. The same groups are also present in the test set, together with fourpreviously unobserved classes. We generate the new classes via bivariate normaldensities with parameters: µ = (0 , (cid:48) , µ = (5 , − (cid:48) , µ = (5 , − (cid:48) , µ = ( − , − (cid:48) , = (cid:20) − . − .
75 1 (cid:21) , Σ = (cid:20) . . (cid:21) , Σ = (cid:20) − . − . (cid:21) , Σ = (cid:20) .
01 00 0 . (cid:21) . The test set encompasses a total of 7 components: 3 observed and 4 novelties.Starting from the above-described data generating process, we consider fourdifferent scenarios varying: • Data contamination level – No contamination in the training set (
Label noise = False ) –
12% label noise between classes 2 and 3 (
Label noise = True ) • Test set sample size – Novelty subset size equal to 30% of the test set (
Novelty size =Not small ) m = 200 , m = 200 , m = 250 , m = 90 ,m = 100 , m = 100 , m = 10 – Novelty subset size equal to 15% of the test set (
Novelty size =Small ) m = 350 , m = 250 , m = 250 , m = 49 ,m = 50 , m = 50 , m = 1 . Figure 2 exemplifies the experiment structure displaying a realization fromthe
Label noise = True , Novelty size = Not small scenario. As is evidentfrom the plots, the label noise is strategically included to cause a more difficultidentification of the fourth class, should the parameters of the second and thirdclasses be non-robustly learned. Further, notice that the last group presentslimited sample size and variability: it could easily be regarded as pointwise con-tamination (i.e., an anomaly) rather than an actual new component. Nonethe-less, following the reasoning outlined in the introduction, we are interested inevaluating the ability of the nonparametric density to capture and discriminatethese types of peculiar patterns as well. For each combination of contamina-tion level and test set sample size, we simulate B = 100 datasets. Results arereported in the following subsection. We compare the performance of the BRAND model with different hyper-parametersspecifications: • the information from the training set is either non-robustly ( η MCD = 1)or robustly ( η MCD = 0 .
75) extracted,16 raining set Test set−10 −5 0 5 −10 −5 0 5−10−505
Class
Figure 2: Simulated data for the
Label noise = True , Novelty size = Notsmall scenario. Classes 4 , . . . , • the precision parameter associated with the training prior belief is eithervery high ( λ T r = 1000) or moderately low ( λ T r = 10).In addition, two model-based adaptive classifiers are considered in the compari-son, namely the inductive RAEDDA model (Cappozzo et al., 2019) with labeledand unlabeled trimming levels respectively equal to 0 .
12 and 0 .
05, and the in-ductive AMDA model (Bouveyron, 2014). For each replication of the simulatedexperiment, a set of four metrics is recorded from the test set: • Novelty predictive value (Precision) : the proportion of units marked asnovelties by a given method truly belonging to classes 4 , . . . , • Accuracy on the known classes : the classification accuracy of a givenmethod within the subset of groups already observed in the training set, • Adjusted Rand Index (
ARI , Rand, 1971): measuring the similarity be-tween the partition returned by a given method and the underlying truestructure, • PPN : a posteriori probability of being a novelty, computed according toEquation (19) (BRAND only).We run 20000 MCMC iterations and discard the first 20000 as a burn-inphase. Apart from the hyper-parameters for the training components, fairlyuninformative priors are employed in the base measure H , with m = 0 , λ =0 . , ν = 10 and S = 10 I . Lastly, a Gamma DP concentration parameter isconsidered with prior rate and scale hyper-parameters both equal to 1.Figure 3 and Table 1 report the results for B = 100 repetitions of the ex-periment under the different simulated scenarios. The Novelty predictive value metric highlights the models’ capability to correctly recover and identify thepreviously unseen patterns. As expected, in the adulteration-free scenarios, allmethodologies succeed well enough in separating known and hidden compo-nents. The worst performance is exhibited by the RAEDDA model for which,17 ovelty size = Not small Novelty size = Small
Labe l N o i s e = F a l s eLabe l N o i s e = T r ue ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDABRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDA
Novelty predictive value
Novelty size = Not small Novelty size = Small
Labe l N o i s e = F a l s eLabe l N o i s e = T r ue ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDABRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDA
Accuracy on the observed classes subset
Novelty size = Not small Novelty size = Small
Labe l N o i s e = F a l s eLabe l N o i s e = T r ue ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDABRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDA
Adjusted Rand Index
Figure 3: Box plots for (from top to bottom) novelty predictive value, accuracyon the known classes and ARI metrics for B = 100 repetitions of the simulatedexperiment, varying data contamination level and test set sample size.18 RAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) Labe l N o i s e = F a l s eLabe l N o i s e = T r ue −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10−15−10−50510−15−10−50510 0.250.500.751.00 PPN^
Figure 4: Hex plots of the average estimated posterior probability of being anovelty, according to formula (19), for B = 100 repetitions of the simulatedexperiment, varying data contamination level and BRAND hyper-parameters, Not small novelty subset size. The brighter the color the higher the probabilityof belonging to f nov .due to the fixed trimming level, a small part of the group-wise most extreme(but still genuine) observations is discarded, thus slightly overestimating thenovelties percentage (the same happens for the ARI metric). Different resultsare displayed in those scenarios wherein the label noise complicates the learn-ing process. Robust procedures efficiently cope with the adulteration presentin the training set, while the AMDA model and the BRAND methods when η MCD = 1 tend to largely overestimate the novelty component. Particularly,the harmful effect caused by the mislabeled units is exacerbated in the BRANDmodel that sets high confidence in the priors ( λ T r = 1000), while a partial mit-igation, albeit feeble, emerges when λ T r is set equal to 10. This consequenceis even more apparent in the hex plots of Figure 4, where we see that the lat-ter model tries to modify its prior belief to accommodate the (outlier-free) testunits, while the former, forced to stick close to its prior distribution by the highvalue of λ T r , incorporates the second and third class in the novelty term. Thefinal output, as displayed in the
Accuracy on the known classes boxplots, hasan overall high misclassification error when it comes to identifying the test unitsbelonging to the previously observed classes. Differently, setting robust infor-mative priors prevents this undesirable behavior, as is shown by both the highlevel of accuracy and the associated low posterior probability of being a noveltyin the feature space wherein the observed groups lie. It is surprising that thetrue partition recovery assessed via the Adjusted Rand Index does not seem tobe influenced by the label noise, and that the BRAND model always outper-forms the competing methodologies, regardless of which hyper-parameters were19able 1: Accuracy on the known classes, Adjusted Rand Index and Nov-elty predictive value metrics for B = 100 repetitions of the simulated experi-ment,varying data contamination level and test set sample size. Standard errorsare reported in parentheses. Label noise = False Label noise=TrueAccuracy ARI Precision Accuracy ARI PrecisionNovelty Size = Not small
AMDA
RAEDDA
BRAND η MCD = 1 , λ Tr = 1000) (0.001) (0.005) (0.002) (0.001) (0.002) (0.001) BRAND η MCD = 1 , λ Tr = 10) (0.001) (0.005) (0.002) (0.206) (0.01) (0.13) BRAND η MCD = 0 . , λ Tr = 1000) (0.007) (0.007) (0.013) (0.004) (0.008) (0.009) BRAND η MCD = 0 . , λ Tr = 10) (0.001) (0.005) (0.002) (0.001) (0.007) (0.001)Novelty Size = Small AMDA
RAEDDA
BRAND η MCD = 1 , λ Tr = 1000) ( < BRAND η MCD = 1 , λ Tr = 10) (0.01) (0.001) (0.002) (0.222) (0.003) (0.285) BRAND η MCD = 0 . , λ Tr = 1000) (0.022) (0.006) (0.056) (0.001) (0.002) (0.007) BRAND η MCD = 0 . , λ Tr = 10) ( < < selected. As previously mentioned, for the BRAN D ( η MCD = 1 , λ
T r = 10)and
BRAN D ( η MCD = 1 , λ
T r = 1000) cases the second and third class areassimilated by the nonparametric component. Despite this undesired outcome,retrieving the novelty best partition by minimizing the VI criterion (see Section5) allows the model to correctly identify the patterns that were originally con-taminated in the training set. That is, the true groups structure are nowhereto be found in the test set, so that the DP prior is forced to create them anew.Clearly, this is sub-optimal behavior as the separation of what is known from20hat is new is completely lost, yet it may raise suspicion on dealing with acontaminated learning set, suggesting the need of a robust prior informationextraction.
Sophisticated and advanced techniques like X-rays, scanning microscopy andlaser technology are increasingly employed for the automatic collection and pro-cessing of images. Within the domain of computer vision studies, novelty detec-tion is generally portrayed as a one-class classification problem. There, the aimis to separate the known patterns from the absent, poorly sampled or not welldefined remainder (Khan and Madden, 2014). Thus, there is strong interest indeveloping methodologies that not only distinguish the already observed quan-tities from the new entities, but that also identify specific structures within thenovelty component. The present case study involves the detection of a novel
Training set Test set0.825 0.850 0.875 0.900 0.825 0.850 0.875 0.9001314151617 compactness pe r i m e t e r Seed Type
Figure 5: Learning scenario (only compactness and perimeter variables dis-played) for novelty detection of 1 unobserved wheat variety, seed dataset.grain type by means of seven geometric parameters, recorded postprocessing X-ray photograms of kernels (Charytanowicz et al., 2010). In more detail, for the210 samples belonging to the three different wheat varieties, high quality visu-alization of the internal kernel structure is detected using a soft X-ray techniqueand, subsequently, the image files are post-processed via a dedicated computersoftware package (Strumi(cid:32)l(cid:32)lo et al., 1999). The obtained dataset is publicly avail-able in the University of California, Irvine Machine Learning data repository.This experiment involves the random selection of 70 training units from thefirst two cultivars, and a test set of 105 samples, including 35 grains from thethird variety. The resulting learning scenario is displayed in Figure 5. The aimof the analysis is to employ BRAND to detect the third unobserved variety,whilst performing classification of the known grain types with high accuracy.Firstly, the MCD estimator with hyper-parameter h MCD = 0 .
95 is adopted forrobustly learning the training structure of the two observed wheat varieties. In21he second stage, our model is fitted to the test set, discarding 20000 iterationsfor the burn-in phase, and subsequently retaining 10000 MCMC samples. Asusual, fairly uninformative priors are employed in the base measure H , with m = 0 , λ = 0 . , ν = 10 and S = I . For the training components, meanand covariance matrices of the Normal-inverse-Wishart priors are directly de-termined by the MCD output of the first stage, while ν T r and λ T r are specifiedto be respectively equal to 250 and 1000. The latter value indicates that afterhaving robustly extracted information for the two observed classes, high trustis placed in the prior distributions of the known components. Model resultsare reported in Figure 6, where the posterior probability of being a novelty
P P N m = P [ y m ∼ f nov | Y ], m = 1 , . . . , M , displayed in the plots below themain diagonal, are estimated according to the ergodic mean in (19). The a pos-teriori classification, computed via majority vote, is depicted in the plots abovethe main diagonal, where the water-green solid diamonds denote observationsbelonging to the novel class. The confusion matrix associated with the esti-mated group assignments is reported in Table 2, where the third group varietyis effectively captured by the flexible process modeling the novel component.Table 2: Confusion matrix for the semi-parametric Bayesian classifier on the testset, seeds dataset. The label “New” indicates observations that are estimatedto have arisen from the novelty component. TruthClassification 1 2 31 31 0 82 1 35 0New 3 0 27All in all, the promising results obtained with this multivariate dataset mayfoster the employment of our methodology in automatic image classification pro-cedures that supersede the one-class classification paradigm, allowing for a muchmore flexible anomaly and novelty detector in computer vision applications. In recent years, machine learning methodologies have experienced an ever-growing interest in countless fields, including food authentication research (Singhand Domijan, 2019). An authenticity study aims to characterize unknown foodsamples, correctly identifying their type and/or provenance. Clearly, no observa-tion is to be trusted in a context wherein the final purpose is to detect potentialadulterated units exactly, in which, for example, an entire subsample may be-long to a previously unseen pattern. Motivated by a dataset of Near InfraredSpectra (NIR) of meat varieties, we employ the functional model introduced inSection 4 to perform classification and novelty detection when a hidden classand four manually adulterated units are present in the test set. The considereddata report the electromagnetic spectrum for a total of 231 homogenized meat22 rea perimeter compactness length width asymmetry length groove a r eape r i m e t e r c o m pa c t ne ss l eng t h w i d t ha sy mm e t r y l eng t h g r oo v e Figure 6: Test set for the considered experimental scenario, seeds dataset. Plotsbelow the main diagonal represent the estimated posterior probability of beinga novelty. The brighter the color the higher the probability of belonging to f nov .Plots above the main diagonal display the associated group assignments, wherethe water-green solid diamonds denote observations classified as novelties.samples, recorded from 400 − nm at intervals of 2 nm (McElhinney et al.,1999). The units belong to five different meat types, with 32 beef, 55 chicken,34 lamb, 55 pork, and 55 turkey records. For each meat sample, the amount oflight absorbed at a given wavelength is recorded: A = log (1 /R ) where R is thereflectance value. The visible part of the electromagnetic spectrum (400 − nm ) accounts for color differences in the meat types, while their chemical com-position is recorded further along the spectrum. NIR data can be interpretedas a discrete evaluation of a continuous function in a bounded domain, there-fore, the procedure described in Section 4 is a sensible methodological tool formodeling this type of data object (Barati et al., 2013). We randomly partitionthe recorded units into labeled and unlabeled sets: the former is composed of28 chicken, 17 lamb, 28 pork, and 28 turkey, while the latter contains the same23roportion of these four meat types with an additional 32 beef units. The lastclass is not observed in the test set, and needs, therefore, to be discovered.Also, four validation units are manually adulterated and added to the test setas follows: • a shifted version of a pork sample, achieved by removing the first 15 datapoints and appending the last 15 group-mean absorbance values at theend of the spectrum; • a noisy version of a pork sample, generated by adding Gaussian whitenoise to the original spectrum; • a modified version of a turkey sample, obtained by abnormally increasingthe absorbance value in a single specific wavelength to simulate a spike; • a pork sample with an added slope, produced by multiplying the originalspectrum by a positive constant.These modifications mimic the ones considered in the “Chimiom´etrie 2005”chemometric contest, in which the participants were tasked to perform discrim-ination and outlier detection of mid-infrared spectra of four different starchestypes (Fern´andez Pierna and Dardenne, 2007). In our context, both the beefsubpopulation and the adulterated units are previously unseen patterns thatshall be captured by the novelty component. Posterior probability of being a novelty500 1000 1500 2000 2500−202 A b s o r ban c e PPN^
Figure 7: Estimated posterior probability of being a novelty, according to for-mula (19), the brighter the color the higher the probability of belonging to f nov .Firstly, robust prior information is recovered from the learning set. Giventhe non-cyclical nature of the spectra, each training unit is approximated via alinear combination of L = 100 B-spline bases and their associated coefficientsare retrieved. Given the high-dimensional nature of the smoothing process,the MRCD is employed to obtain a robust group-wise estimates for the splines24oefficients. These quantities, which are linearly combined with the B-splinebases, account for the training atoms Θ j , j = 1 , . . . , η MCD = 0 . Θ j , j = 1 , . . . , f nov . TruthClassification Beef Chicken Lamb Pork Turkey OutliersNovelty 28 0 0 0 2 4Chicken 0 18 0 0 6 0Lamb 4 0 17 0 0 0Pork 0 5 0 20 3 0Turkey 0 4 0 4 15 0accuracy is in agreement with the ones produced by state-of-the-art classifiersin a fully-supervised scenario (see, for example, Murphy et al., 2010; Guti´errezet al., 2014). That is, our proposal is capable of detecting previously unseenclasses and outlying units, whilst maintaining competitive predictive power.Focusing on the novelty component, the model almost entirely captures thebeef hidden class and the adulterated units, yet two turkey samples are alsoincorrectly assigned. The obtained classification for the curves identified to benovelties, resulting by VI minimization, is displayed in the upper panel of Figure8 where two distinct clusters are detected. Interestingly, the 28 beef samples(blue dashed lines) are separately grouped from the manually adulterated unitsand the two turkeys (solid red lines). The estimated partition highlights thepresence of a novel class (i.e., the beef meat variety) and a group of outlyingspectra, to which our four modifications shall belong. The only concern are thetwo units incorrectly assigned to the novel component. A closer look at theturkey sub-population, displayed in the lower panel of Figure 8, show how thesetwo samples exhibit a somehow extreme pattern within their group and can,therefore, be legitimately flagged as outlying or anomalous turkeys.In this Section, we have shown the effectiveness of our methodology in cor-rectly identifying a hidden group in a functional setting, while jointly achieving25 P clustering of novelties500 1000 1500 2000 2500−202 A b s o r ban c e Meat type
Novelty 1Novelty 2Focus on Turkey sub−population500 1000 1500 2000 2500−2−1012
Wavelength (nm) A b s o r ban c e Meat type
Outlying TurkeyTurkey
Figure 8: Upper panel: best partition of the novelty component recovered byminimizing the Variation of Information loss function. The dashed blue curvesare beef samples, while the solid red ones are the manually adulterated unitsand the two turkeys incorrectly assigned to the novel component. Lower panel:true turkey sub-population in the test set, the units incorrectly assigned to thenovel component are displayed with solid darkred lines.good classification accuracy and recognition of outlying curves. The model’ssuccessful application seems particularly desirable in fields like food authentic-ity, where generally no information regarding how many modifications and/oradulteration mechanisms may be present in the samples, are a priori available.
In the present manuscript, we have introduced a two-stage methodology forrobust Bayesian nonparametric novelty detection. In the first stage, we ro-bustly extract the observed group structure from the training set. In the secondstage, we incorporate such prior knowledge in a contaminated mixture, whereinwe have employed a nonparametric component to describe the novelty term.The latter could either correspond to anomalies or actual new groups, yet thedistinction is made possible by retrieving the best partition within the novelsubset. We have then investigated the basic properties of the induced randommeasure, underlying interpretations, and connections with existing methods.Subsequently, the general multivariate methodology has been extended to han-dle functional data objects, resulting in a novelty detector for functional dataanalysis. A dedicated slice-efficient sampler, taking into account the difference26etween unseen and seen components, has been devised for posterior inference.An extensive simulation study and applications on multivariate and functionaldata have validated the effectiveness of our proposal, fostering its employmentin diverse areas from image analysis to chemometrics.BRAND can be seen as the starting point for many different research av-enues. Future research directions aim at providing a Bayesian interpretation ofthe robust MCD estimator to propose a unified, fully Bayesian model. More ver-satile specifications can be adopted for the known components, weakening theGaussianity assumption. This can be done by adopting more flexible distribu-tions while keeping the mean and variance of the resulting densities constrainedto the findings in the training set, for example, via centered stick-breaking mix-tures (Yang et al., 2010).Similarly, functional BRAND can be improved adopting a more general priorspecification via Gaussian Processes (Rasmussen and Williams, 2005). Lastly,it is of paramount interest to develop scalable algorithms, as Variational Bayes(Blei et al., 2017) and Expectation-Maximization (Dempster et al., 1977), forinference on massive datasets. Such solutions will offer both increased speedand lower computational cost, which are crucial for assuring the applicability ofour proposal in the big data era.
Appendix A. Proof of (8)
Let us represent all the possible values of Θ in one vector ˜ Θ = ( Θ , . . . , Θ J , Θ nov , . . . ) P ( Θ m = Θ m (cid:48) ) = E [ P ( Θ m = Θ m (cid:48) | ˜ p )]= (cid:88) j ≥ E (cid:104) P (cid:16) Θ m = ˜ Θ j | ˜ p (cid:17) · P (cid:16) Θ m (cid:48) = ˜ Θ j | ˜ p (cid:17)(cid:105) == J (cid:88) j =1 E (cid:104) P (cid:16) Θ m = ˜ Θ j | ˜ p (cid:17) · P (cid:16) Θ m (cid:48) = ˜ Θ j | ˜ p (cid:17)(cid:105) ++ (cid:88) j ≥ J +1 E (cid:104) P (cid:16) Θ m = ˜ Θ j | ˜ p (cid:17) · P (cid:16) Θ m (cid:48) = ˜ Θ j | ˜ p (cid:17)(cid:105) == J (cid:88) j =1 E (cid:2) π j (cid:3) + (cid:88) j ≥ J +1 E (cid:2) π ω j (cid:3) = J (cid:88) j =1 E (cid:2) π j (cid:3) + (cid:88) j ≥ J +1 E (cid:2) π (cid:3) E (cid:2) ω j (cid:3) = J (cid:88) j =1 a j ( a j + 1) a ( a + 1) + a ( a + 1) a ( a + 1) ·
11 + γ . ppendix B. Gibbs sampler algorithm for model (3) - (4) Algorithm 1:
Efficient Slice Sampler for the BNP-Novelty detectionmodel
Input:
Initial values for the MCMC, robust estimates from X . Output:
Posterior MCMC sample for the parameters of interest. for i = 1 , . . . , N SIM do
1. Sample every u m from a uniform distribution U (0 , ξ ζ m ).2. Compute the stochastic truncation term L ∗ according to (18).3. Let m j = (cid:80) Mm =1 { α m = g } , with j = 0 , . . . , J . Sample π from a conjugateDirichlet distribution: π ∼ Dir ( a + m , a + m , . . . , a J + m J ) .
4. Sample the SB variables after integrating out u : v k | · · · ∼ Beta (1 + M (cid:88) m =1 { α m =0 ∩ β m = k } ,γ + M (cid:88) m =1 { α m =0 ∩ β m >k } ) .
5. Compute the SB weights according to (5)6. Compute the one-line probability weights ˜ π according to (14).7. Sample the atoms for the observed classes Θ j, exploiting conjugacybetween the likelihood and the prior for j = 1 , . . . , J .8. Sample the atoms for the novel classes Θ nov ,l exploiting conjugacybetween the likelihood and the prior for l = 1 , . . . , L ∗ .9. Obtain ˜ Θ concatenating the updated values of Θ and Θ nov .10. Sample each ξ m from the following joint discrete distribution: P ( ζ m = l ) ∝ ˜ π l ξ l { u m <ξ l } φ (cid:16) y m | ˜ Θ l (cid:17) ,l = 1 , . . . , L ∗ , P (otherwise) ∝ .
11. Recover the values for the membership vectors α and β using (15).Divide the elements in ˜ Θ into Θ and Θ nov . end eferenceseferences