[PDF] A Two-Stage Bayesian Semiparametric Model for Novelty Detection with Robust Prior Information

Abstract

Novelty detection methods aim at partitioning the test units into already observed and previously unseen patterns. However, two significant issues arise: there may be considerable interest in identifying specific structures within the novelty, and contamination in the known classes could completely blur the actual separation between manifest and new groups. Motivated by these problems, we propose a two-stage Bayesian semiparametric novelty detector, building upon prior information robustly extracted from a set of complete learning units. We devise a general-purpose multivariate methodology that we also extend to handle functional data objects. We provide insights on the model behavior by investigating the theoretical properties of the associated semiparametric prior. From the computational point of view, we propose a suitable ξ -sequence to construct an independent slice-efficient sampler that takes into account the difference between manifest and novelty components. We showcase our model performance through an extensive simulation study and applications on both multivariate and functional datasets, in which diverse and distinctive unknown patterns are discovered.

Full PDF

AA Two-Stage Bayesian Nonparametric Modelfor Novelty Detectionwith Robust Prior Information

Francesco Denti ∗ Andrea Cappozzo † Francesca Greselin † Abstract

Standard novelty detection methods aim at bi-partitioning the testunits into already observed and previously unseen patterns. However,two signiﬁcant issues arise: there may be considerable interest in iden-tifying speciﬁc structures within the latter subset, and contamination inthe known classes could completely blur the actual separation betweenmanifest and new groups. Motivated by these problems, we propose atwo-stage Bayesian nonparametric novelty detector, building upon priorinformation robustly extracted from a set of complete learning units. Ageneral-purpose multivariate methodology, as well as an extension suit-able for functional data objects, are devised. Some theoretical propertiesof the associated semi-parametric prior are investigated. Moreover, wepropose a suitable ξ -sequence to construct an independent slice-eﬃcientsampler that takes into account the diﬀerence between manifest and nov-elty components. We showcase our model performance through an exten-sive simulation study and applications on both multivariate and functionaldatasets, in which diverse and distinctive unknown patterns are discov-ered. Supervised classiﬁcation with unobserved classes aims at predicting a qualita-tive output for a test set by training a classiﬁer on a fully-labeled training set, inwhich the former may contain classes not previously observed in the latter. Thisis usually not contemplated in a standard framework, where the learning unitsare assumed to be outlier-free realizations from all the sub-groups comprisingthe target population. However, this hypothesis may not hold in general, like inthe case of evolving systems, where the presence of novel species is an importantissue, or in the analysis of social networks, whose conﬁgurations continuouslyexpand and evolve, or in the case of rare classes. Relevant examples include, but ∗ Department of Statistics and Computer Science University of California, Irvine [email protected] † Department of Statistics and Quantitative Methods, University of Milano-Bicocca, [email protected], [email protected] a r X i v : . [ s t a t . A P ] J un re not limited to, radar target detection (Carpenter et al., 1997), detection ofmasses in mammograms (Tarassenko et al., 1995), handwritten digit recognition(Tax and Duin, 1998) and e-commerce (Manikopoulos and Papavassiliou, 2002),for which labeled observations may not be available for each and every group.Within the model-based family of classiﬁers, adaptive methods recently ap-peared in the literature to overcome this limitation. Miller and Browning (2003)pioneer a mixture model methodology for class discovery, robust classiﬁcation,and outlier rejection. Bouveyron (2014), introduced an adaptive classiﬁer inwhich two algorithms, based respectively on transductive and inductive learn-ing, are devised for inference. More recently, Fop et al. (2018) extended theoriginal work of Bouveyron (2014), by accounting for unobserved classes andextra variables in high-dimensional discriminant analysis.Suppose we now want to distinguish between novelties , i.e., test observa-tions displaying a speciﬁc common pattern, and anomalies , i.e., outlying unitsthat can be regarded as noise. This further feature is addressed in Cappozzoet al. (2019), where, using impartial trimming (Gordaliza, 1991), a ﬁxed per-centage of data points is left unmodeled and no distributional assumptions area-priori imposed to the trimmed units. While the distinction between a noveland an anomalous entity is most often apparent in practice, there exist somecircumstances under which such separation is vague and somewhat philosoph-ical. Consider, for example, an evolving ecosystem in which novel species arelikely to appear over time. It may happen that at an initial instant t a realnovelty is mistaken to be mere noise due to its embryonic stage. Nonetheless,if the model were to be ﬁtted at a subsequent instant t , the increased numberof such samples could be suﬃcient to acknowledge an actual novel species.Motivated by all these arguments, we propose a Bayesian Robust Adap-tive Novelty Detector (BRAND), a two-stage procedure in a semi-parametricBayesian framework for simultaneously dealing with outliers and hidden classesthat may be present in the test set. To sketch the idea, we ﬁrst learn the knownpatterns from the labeled dataset using procedures that are robust against bothoutliers and label noise. In the second phase, we translate the training insightsinto informative priors, and we model the test units with an Bayesian mixtureof known groups plus a novelty term. To reﬂect the lack of knowledge on thenovelty term, we resort to a Dirichlet Process mixture model. The adoption ofthis nonparametric methodology allows to overcome the problematic and some-how unnatural speciﬁcation of the number of mixture components in the novelgroup.The rest of the manuscript is organized as follows: in Section 2, we presentour two-stage methodology for novelty detection. We dedicate Section 3 tothe investigation of the random measures’ clustering properties induced by ourmodel. In Section 4, we propose a functional extension of the multivariatemodel, delineating a novelty detection method suitable for functional data. Sec-tion 5 discusses posterior inference, while in Section 6 we present an extensivesimulation study and applications to multivariate and functional data. Con-cluding remarks and further research directions are outlined in Section 7.2 A Two-Stage Bayesian procedure for NoveltyDetection

Given a classiﬁcation framework, consider the complete collection of learningunits X = { ( x n , l n ) } Nn =1 , where x n denotes a p -variate observation and l n = j ∈ { , . . . , J } its associated group label. Both terms are directly availableand the distinct values in l n , n = 1 , . . . , N represent the J observed classes with subset sizes n , . . . , n J . Correspondingly, let { ( y m , z m ) } Mm =1 be the testset where, diﬀerently from the standard assumptions, the unknown labels z m could belong to a set that encompasses more elements than { } . Thatis, a countable number of J classes may be “hidden” in the test with no priorinformation available on their magnitude or on their structure. Therefore, it isreasonable to account for the novelty term via a single ﬂexible component fromwhich a dedicated post-processing procedure may reveal circumstantial patterns(see Section 2.3). Both x n and y m are independent realizations of a continuousrandom vector (or function, see Section 4) X , whose conditional distributionvaries according to the associated class labels. In the upcoming Sections, weassume that each observation in class j is independent multivariate Gaussian,having density φ ( ·| Θ j ) with location-scale parameter Θ j = ( µ j , Σ j ), where µ j ∈ R p denotes the mean vector and Σ j the corresponding covariance matrix.This allows for the automatic implementation of standard powerful methodsin the training information extraction (see Section 2.1). Notwithstanding, theproposed methodology is general enough that it can be easily extended to dealwith diﬀerent component distributions.The main modeling purpose is to classify the observations in the test seteither into one of the J observed classes or into the novel component. At thesame time, we investigate the presence of homogeneous groups in the noveltyterm, discriminating between unseen components and outlier. In doing so, atwo-stage strategy is devised. The ﬁrst phase, described in Section 2.1, relies ona class-wise robust procedure for extracting prior information from the train-ing set. Afterwards, the semi-parametric Bayesian model, which is the mainmethodological contribution of the present paper, is ﬁtted to the test units. Afull account on its deﬁnition is reported in Section 2.2. The ﬁrst step of our procedure is designed to obtain reliable estimates ˆ Θ j for theparameters of the observed classes from the learning set. To this aim one couldemploy standard methods, as the MLE adopting a classical framework or theMAP/posterior estimates assuming a Bayesian setting. Nonetheless, these stan-dard approaches are not robust against contamination and the presence of onlyfew outlying points could entirely bias the subsequent Bayesian model, shouldthe informative priors be improperly set. A direct consequence of this undesir-able behavior is reported in the simulation study of Section 6.1. To this extent,we opt for more sophisticated alternatives that are able to deal with outliers3nd label noise, when it comes to learning the structure of the observed classes.Particularly, the selected methodologies involve the Minimum Covariance De-terminant (MCD) estimator (Rousseeuw, 1984; Hubert et al., 2018) and, whenfacing high-dimensional data (as in the functional case of Section 6.3), the Min-imum Regularized Covariance Determinant (MRCD) estimator (Boudt et al.,2020). Clearly, at this stage any robust estimator of multivariate scatter andlocation may be employed for solving the problem: see, for instance, the com-parison study reported in Maronna and Yohai (2017) for a (non-exhaustive) listof suitable candidates.We decide to rely on the MCD and its high-dimensional extension MRCDprocedures for their well-established eﬃcacy in the classiﬁcation framework (Hu-bert and Van Driessen, 2004) and direct availability of fast algorithms for in-ference, readily implemented in the rrcov R package (Todorov and Filzmoser,2009). We brieﬂy recall the main MCD and MRCD features in the remainderpart of this Section. For a thorough treatment the interested reader is referredto Hubert and Debruyne (2010) and Boudt et al. (2020), respectively.The MCD is an aﬃne equivariant and highly robust estimator of multivariatelocation and scatter, for which a fast algorithm is available (Rousseeuw andDriessen, 1999). The raw MCD estimator with parameter η MCD ∈ [0 . , (cid:98) ( n + p + 1) / (cid:99) ≤ (cid:98) η MCD N (cid:99) ≤ N deﬁnes the following location anddispersion estimates: • µ MCD is the mean of the (cid:98) η MCD N (cid:99) observations for which the determi-nant of the sample covariance matrix is minimal • Σ MCD is the corresponding covariance matrix, multiplied by a consistencyfactor c (Croux and Haesbroeck, 1999)with (cid:98)·(cid:99) denoting the ﬂoor function. The MCD is a consistent, asymptoti-cally normal and highly robust estimator with bounded inﬂuence function andbreakdown value equal to (1 − (cid:98) η MCD N (cid:99) /N )% (Butler et al., 1993; Cator andLopuha¨a, 2012). However, a major drawback is its inapplicability when thedata dimension p exceeds the subset size (cid:98) η MCD N (cid:99) as the covariance matrix ofany (cid:98) η MCD N (cid:99) -subset becomes singular. This situation appears ever so often inour context, as the MCD is group-wise applied to the observed classes in thetraining set, such that it is suﬃcient to have p > min n j ,j =1 ,...,J (cid:98) η MCD n j (cid:99) for the MCD solution to be ill-deﬁned. In order to overcome this limitation,Boudt et al. (2020) introduced the MRCD estimator. The main idea is toreplace the subset-based covariance estimation by a regularized one, deﬁned asa weighted average of the sample covariance on the (cid:98) η MCD N (cid:99) -subset and apredetermined positive deﬁnite target matrix. The MRCD estimator is deﬁnedas the multivariate location and regularized scatter based on the (cid:98) η MCD N (cid:99) -subset that makes its overall determinant the smallest. The MRCD preserves thegood breakdown properties of its non-regularized counterpart, and in addition,4s applicable in high dimensional problems where (cid:98) η MCD N (cid:99) is possibly smallerthan p .The ﬁrst phase of our two-stage modeling procedure thus works as follows:considering the available labels l n , n = 1 , . . . , N we apply the MCD (or MRCD)estimator within each class to extract ˆ µ MCDj and ˆ Σ MCDj , j = 1 . . . , J . Forease of notation, we let the superscript in the robust estimates to be ‘MCD’even when its regularized version is considered. Clearly, the MCD solution ispreferred should the sample size be large enough. Lastly, there is no reason for η MCD to be the same in all observed classes. If a group is known to be par-ticularly outliers-sensitive its associated

M CD subset size may be set smallerthan the remaining ones. Since this type of information is seldom available,we subsequently let η MCDj = η MCD for all classes in the learning set. Theso-obtained estimates are incorporated in the Bayesian model speciﬁcation pre-sented in Section 2.2 where the robust knowledge extracted from X is accountedas prior information and, according to an Empirical Bayes rationale, it will beused as informative hyperparameters. In this way, outliers and label noise thatmay be present in the labelled units will not bias the initial beliefs for the knowngroups in the second phase. We assume that each observation in the test set is generated accordingly to amixture of J + 1 elements: J multivariate Gaussians φ ( ·| Θ j ) that have beenobserved in the learning set, and an extra term f nov called novelty component.In formulas: y m | π , Θ j , f nov ∼ J (cid:88) j =1 π j φ ( ·| Θ j ) + π f nov . (1)We deﬁne π = { π j } Jj =1 , where π j denotes the prior probability of the observedclass j (already present in the learning set), while π is the probability of observ-ing some novelty. Of course, (cid:80) Jj =0 π j = 1. To reﬂect our lack of knowledge onthe novelty component f nov , we employ a Bayesian nonparametric speciﬁcation.In particular, we resort to the Dirichlet Process Mixture model of Gaussians (Lo,1984; Escobar and West, 1995) imposing the following structure: f nov = (cid:90) φ ( ·| Θ nov ) G ( d Θ nov ) , G ∼ DP ( γ, H ) , (2)where DP ( γ, H ) is the usual Dirichlet process with concentration parameter γ and base measure H (Ferguson, 1973). Adopting Sethuraman’s Stick Breaking5onstruction (Sethuraman, 1994), we can express the likelihood as follows: L ( y | π , µ , Σ , ω ) = M (cid:89) m =1  J (cid:88) j =1 π j φ ( y m | µ j , Σ j ) ++ π ∞ (cid:88) h =1 ω h φ ( y m | µ novh , Σ novh ) (cid:35) . (3)The term (cid:80) ∞ h =1 ω h φ ( ·| Θ novh ) represents a Dirichlet Process convoluted with aNormal kernel, for ﬂexibly modeling a potentially inﬁnite number of hiddenclasses and/or outlying observations. The following prior probabilities for theparameters complete the Bayesian model speciﬁcation: Θ j = ( µ j , Σ j ) ∼ P T rj j = 1 , . . . , J, Θ novh = ( µ novh , Σ novh ) ∼ H h = 1 , . . . , ∞ , π ∼ Dir ( a , a , . . . , a J ) , ω ∼ SB ( γ ) . (4)Values a , . . . , a J are the hyper-parameters of a Dirichlet distribution on theknown classes. The learning set can be exploited to determine reasonable val-ues of such hyper-parameters by setting a j = n j /N . The quantity a is theinitial prior belief on how much novelty we are expected to discover in the testset. Generally, the parameter controlling the novelty proportion a is a prioriconsidered to be small.To exploit conjugacy, we adopt Normal-inverse-Wishart (NIW) priors forboth the manifest and the novel classes. For each of the known group we assumethat P T rj ≡ N IW (cid:16) ˆ µ MCDj , λ

T r , ν

T r , ˆ Σ MCDj (cid:17) , j = 1 , . . . , J where ˆ µ MCDj and ˆ Σ MCDj are the MCD robust estimates obtained in phaseI. At the same time the precision parameter λ T r and the degrees of freedom ν T r are treated as tuning parameters to enforce high mass around the robustestimates. By letting these two parameters go oﬀ to inﬁnity we can also recoverthe degenerate case P T rj = δ ˆ Θ j where the Dirac’s delta denotes a point masscentered in ˆ Θ j . That is, the prior beliefs extracted from the training set canbe ﬂexibly updated by gradually transitioning from transductive to inductiveinference by increasing λ T r and ν T r (Bouveyron, 2014). Similarly, we set H ≡ N IW ( m , λ , ν , S ) , where the hyperparameters are chosen to induce a ﬂatprior for the novel components. Lastly, with ω ∼ SB ( γ ) we denote the vectorof Stick-Breaking weights, composed of elements deﬁned as w k = v k (cid:89) l ∩ β m =0 } ++ π { α m =0 ∩ β m > } ω β m (cid:3) ×× φ (cid:0) y m | Θ ( α m ,β m ) (cid:1) . (6)where α m ∈ { , . . . , J } and β m ∈ { , . . . , ∞} are latent variables identifyingthe unobserved group membership for y m , m = 1 , . . . , M and Θ ( α m ,β m ) = (cid:16) Θ ( α m , , Θ nov (0 ,β m ) (cid:17) . To complete the speciﬁcation, we set ω = 1Lastly, we want to underline that there might be some cases where thenumber of novelty groups is known to be bounded and does not grow with thesample size as in the DP case. In those situations, an appealing alternative to theDPMM is the Overﬁtting Mixture Model, studied by Rousseau and Mengersen(2011) and recently investigated in Malsiner-Walli et al. (2016). The advantage of employing a DPMM for the novelty part is twofold: on onehand, it allows us to model all the data that come from unseen componentswith a unique, ﬂexible density. On the other hand, the clustering that is natu-rally induced in the data by the DPMM allows us to distinguish among all theobservations assigned to the novelty component, between actual unseen classesand outlying units. More speciﬁcally, given the fact that the concept of anoutlier does not possess a rigorous mathematical deﬁnition (Ritter, 2014), theestimated sample sizes of the discovered classes act as an appropriate featurefor discriminating between scattered outlying units and actual hidden groups.That is, if a component φ ( ·| Θ novh ) ﬁts only a small number of data points, wecan regard those units as outliers; whereas we assume to have discovered anextra class whenever any component possesses a substantial structure. Clearly,in real applications domain-expert supervision will always be crucial for classinterpretation when extra groups are believed to have been detected. Whilstthe mixture between known and novel distributions is identiﬁable and not sub-jected to the label switching problem, the same cannot be said about the DPcomponent modeling the novelty density. To recover a meaningful estimate forthe partition of points regarded as novel ( β m >

0) we ﬁrst compute the pairwisecoclustering matrix P = { p m,m (cid:48) } , whose entry p m,m (cid:48) denotes the probabilitythat y m and y m (cid:48) belong to the same cluster. We then retrieve the best par-tition minimizing the Variation of Information (VI) criterion, as suggested inWade and Ghahramani (2018). More details on how to post-process the MCMCoutput are given in Section 5. 7 Properties of the proposed semiparametric prior

We now investigate the properties of the underlying random mixing measure in-duced by the model speciﬁcation presented in the previous section. Speciﬁcally,the model in (3)-(4) can be generalized in the following hierarchical form, whichhighlights the dependence on a discrete random measure ˜ p : y m | Θ m ∼ N ( ·| Θ m ) Θ m | ˜ p i.i.d. ∼ ˜ p ˜ p = J (cid:88) j =1 π j δ Θ j + π (cid:34) + ∞ (cid:88) h =1 ω h δ Θ novh (cid:35) ( π , π , . . . , π J ) ∼ Dir ( a , a , . . . , a J ) ω ∼ SB ( γ ) Θ j ∼ P T rj Θ novh ∼ H. (7)From (7) it can be seen how our model is an extension of the contaminatedinformative priors proposed in Scarpa and Dunson (2009), where the authorspropose to juxtapose a single atom to a DP. We further assume that each P T rj is a probability distribution with mean µ j , second moment µ j, and variance σ j , j = 1 , . . . , J . Similarly, let E [ Θ novh ] = µ , V [ Θ novh ] = σ ∀ h ≥ a = (cid:80) Jj =0 a j . For all m ∈ { , . . . , M } , we can prove that E [ Θ m ] = J (cid:88) j =0 a j a µ j , V [ Θ m ] = J (cid:88) j =0 a j a (cid:16) µ j, − a j a µ j (cid:17) − J (cid:88) l>j ≥ a j a l a µ l µ j . The overall variance can also be written in terms of variances of the single,observed mixture components: V [ Θ m ] = J (cid:88) j =0 a j a (cid:16) σ j + (cid:16) − a j a (cid:17) µ j (cid:17) − J (cid:88) l>j ≥ a j a l a µ l µ j . Given the discrete nature of ˜ p , we can expect ties between realizations sampledfrom this measure, say Θ m and Θ m (cid:48) . Therefore, we can compute the probabilityof obtaining a tie as: P ( Θ m = Θ m (cid:48) ) = J (cid:88) k =1 a k ( a k + 1) a ( a + 1) + a ( a + 1) a ( a + 1) ·

11 + γ , (8)where the contribution to this probability of the novelty terms is multiplicativelyreduced by a factor that depends on the inverse of the concentration parameter.The proof of (8) is reported in Appendix A. Clearly, if a priori we expect a largenumber of clusters in the novelty term (large γ ), the probability of a tie reduces.8ndeed, some noticeable limiting cases arise:lim γ → + ∞ P ( Θ m = Θ m (cid:48) ) = J (cid:88) k =1 a k ( a k + 1) a ( a + 1) , lim γ → P ( Θ m = Θ m (cid:48) ) = J (cid:88) k =0 a k ( a k + 1) a ( a + 1) . If γ → J + 1 components. Instead, γ → + ∞ leads to the case of a DP with numerous atoms characterized by similar proba-bility, hence annihilating the contribution to the probability of the novelty term.Moreover, suppose we rewrite the distribution of π as Dir (cid:16) a J +1 , ˜ aJ +1 , . . . , ˜ aJ +1 (cid:17) .In this case, the hyperparameters relative to the observed groups are assumedequal to ˜ a . Then, we obtain a = a + J ˜ aJ +1 , and P ( Θ m = Θ m (cid:48) ) = J ˜ aJ +1 (cid:16) ˜ aJ +1 + 1 (cid:17) a + J ˜ aJ +1 (cid:16) a + J ˜ aJ +1 + 1 (cid:17) ++ a J +1 (cid:16) a J +1 + 1 (cid:17) a + J ˜ aJ +1 (cid:16) a + J ˜ aJ +1 + 1 (cid:17) ·

11 + γ . (9)As J increases, the second part of (9) vanishes. Accordingly, if we suppose an un-bounded number of observed groups letting J → ∞ , then we have P ( Θ m = Θ m (cid:48) ) =1 / (1 + ˜ a ) as in the classical DP case, and the model loses its ability to detectnovel instances.Lastly, we investigate the covariance between the two random elements Θ m and Θ m (cid:48) . Consider a vector (cid:37) = { (cid:37) j } J +1 j =1 , with the ﬁrst entry equal to γ and theremaining entries equal to 1. Then, Cov γ ( Θ m , Θ m (cid:48) ) = J (cid:88) j =0 (cid:32) a j ( a j + 1) a ( a + 1) (cid:37) j µ j, − a j a µ j (cid:33) + − a ( a + 1) J (cid:88) j>l ≥ a j a l µ j µ l ++ a ( a + 1) a ( a + 1) γ γ µ . (10)It can be seen that the covariance is characterized by three terms. In the ﬁrst twothe seen and unseen components have the same inﬂuence. The last component,however, is nonnegative and entirely determined by quantities linked to the9ovel part of the model. Notice that if γ → Cov ( Θ m , Θ m (cid:48) ) = J (cid:88) j =0 (cid:32) a j ( a j + 1) a ( a + 1) µ j, − a j a µ j (cid:33) + − a ( a + 1) J (cid:88) j>l ≥ a j a l µ j µ l which is the same covariance we would obtain if ˜ p = ˜ p ≡ (cid:80) Jj =0 π j δ Θ j , i.e. ifwe were dealing with a classical mixture model with J + 1 components. Thisimplies that (10) can be rewritten as Cov γ ( Θ m , Θ m (cid:48) ) = Cov ( Θ m , Θ m (cid:48) ) − a ( a + 1) a ( a + 1) γ γ σ , which leads to a nice interpretation. The introduction of novelty atoms decreasesthe “standard” covariance, and this eﬀect is stronger as the prior weight givento the novelty component a , the dispersion of the base measure σ and/or theconcentration parameter γ increases. The modeling framework introduced in Section 2 is very general and can beeasily modiﬁed to handle more involved data structures. In this Section, wedevelop a methodology for functional classiﬁcation that allows novelty functionaldetection, building upon model (3)-(4). We hereafter assume that our trainingand test instances are error-prone realizations of a univariate stochastic process X ( t ), t ∈ T with T ⊂ R .Recently, numerous authors have contributed to the area of Bayesian non-parametric functional clustering (see, for example Bigelow and Dunson, 2009;Petrone et al., 2009; Rodriguez and Dunson, 2014; Rigon, 2019). The key fea-ture of our approach is the inclusion of prior knowledge that helps discriminateamong novel and observed classes of functions. Canale et al. (2017) propose aPitman-Yor mixture with a spike-and-slab base measure to eﬀectively model thedaily basal body temperature in women by including the a priori known distinc-tive biphasic trajectory that characterizes healthy beings. Instead of modifyingthe base meaure of the nonparametric process, Scarpa and Dunson (2009) ad-dress the same problem by contaminating a point mass with a realization from aDP. As such, our approach can be interpreted as a direct extension of the latter,where J ≥ Θ m ( t ) = (cid:0) f m ( t ) , σ m ( t ) (cid:1) denote the vector comprising of smooth func-tional mean f m : T → R and the measurement noise σ m : T → R + for a genericcurve m in the test set, evaluated at instant t . Then the BRAND model asintroduced in Section 2.2 for multivariate data can be modiﬁed as follows:10 m ( t ) | Θ m ( t ) = f m ( t ) + ε m ( t ); ε m ( t ) ∼ N (0 , σ m ( t )) Θ m ( t ) | ˜ p ∼ ˜ p, ˜ p = J (cid:88) j =1 π j δ Θ j + π (cid:34) + ∞ (cid:88) h =1 ω h δ Θ novh (cid:35) , ( π , π , . . . , π J ) ∼ Dir ( a , a , . . . , a J ) , ω ∼ SB ( γ ) , Θ j ∼ P T rj , Θ novh ∼ H, (11)where all the distributions P T rj and the base measure H model the functionalmean and the noise independently. We propose the following informative priorfor Θ j = (cid:0) f j ( t ) , σ j ( t ) (cid:1) : f j ( t ) ind. ∼ N (cid:0) ¯ f j ( t ) , ϕ j (cid:1) ,σ j ( t ) ind. ∼ IG (cid:32) (cid:0) ¯ σ j ( t ) (cid:1) v j , ¯ σ j ( t ) (cid:32) (cid:0) ¯ σ j ( t ) (cid:1) v j (cid:33)(cid:33) . (12)We denote the estimates obtained from the training set of the mean and variancefunctions for each observed class j , j = 1 , . . . , J as ¯ f j and ¯ σ j , respectively. Thehyper-parameters ϕ j deﬁne the degree of conﬁdence we a priori assume for theinformation extracted from the learning set, while the Inverse Gamma ( IG )speciﬁcation ensures that E (cid:2) σ j ( t ) (cid:3) = ¯ σ j ( t ) and V ar (cid:2) σ j ( t ) (cid:3) = v j . It remainsto deﬁne how we compute ¯ f j and ¯ σ j , that is, how the robust extraction ofprior information is performed in this functional extension. Applying standardprocedures in Functional Data Analysis (Ramsay, James, Silverman, 2005), weﬁrst smooth each training curve via a weighted sum of L basis functions x n ( t ) = L (cid:88) l =1 ξ nl φ l ( t ) n = 1 , . . . , N where φ l ( t ) is the l -th basis evaluated in t and ξ nl its associated coeﬃcient.Given the acyclic nature of the functional objects treated in Section 6.3, we willsubsequently employ B-spline bases (de Boor, 2001). Clearly, depending on theproblem at hand, any basis function system may be considered in this phase.After such representation has been performed, we are left with J matrices ofcoeﬃcients each of dimension n j × L . By treating them as multivariate entities,as done for example in Abraham et al. (2003), we resort to the very sameprocedures described in Section 2.1 and we set¯ f j ( t ) = L (cid:88) l =1 ˆ ξ MCDjl φ l ( t ) , ¯ σ j ( t ) = 1 n j − (cid:88) n : l n = j (cid:0) x n ( t ) − ¯ f j ( t ) (cid:1) where ˆ ξ MCDjl is the robust location estimate computed via

M CD (or

M RCD )on the n j × L matrix of coeﬃcients, j = 1 , . . . , J . On the other hand, more ﬂex-ibility is needed to specify the base measure H for Θ novh = (cid:0) f novh ( t ) , σ novh ( t ) (cid:1) .11herefore, by the same smoothing procedure considered for the training curves,we build a hierarchical speciﬁcation for the quantities involved in the noveltyterm: f novh ( t ) = L (cid:88) l =1 ξ novhl φ l ( t ) , ξ novhl ∼ N ( ψ h , τ h ) ,ψ h ∼ N (0 , s ) ,τ h ∼ IG ( a τ , b τ ) , σ novh ( t ) ∼ IG ( a H , b H ) . (13)The ﬁrst line of (13) can be rewritten as f novh ( t ) ∼ N (cid:16) ψ h (cid:80) Ll =1 φ l ( t ) , τ h (cid:80) Ll =1 φ l ( t ) (cid:17) .We call this model functional BRAND. It provides a powerful extension for func-tional novelty detection. A successful application is reported in Section 6.3. The distribution p ( π , ω , a , β , Θ , Θ nov | y ) is analytically intractable, therefore werely upon MCMC techiques to carry out posterior inference. An easy samplingscheme can be constructed mimicking the blocked Gibbs sampler of Ishwaranand James (2001), where the inﬁnite series in (3) is truncated at a pre-speciﬁedlevel L < ∞ . However, this approach leads to a non-negligible truncation errorif L is too small and computational ineﬃciencies if L is set too high. Instead,we propose a modiﬁcation of the ξ -sequence of the Independent Slice-eﬃcientsampler (Kalli et al., 2011), another well known conditional algorithm thatallows one to sample from the exact posterior. To adapt the algorithm to ourframework, we start from the following alternative reparameterization of themodel in in (3)-(4): y m | ˜ Θ , ζ m ∼ N (cid:16) ˜ Θ ζ m (cid:17) ζ m | ˜ π ∼ ∞ (cid:88) k =1 ˜ π k δ k ˜ π k = π { J } . (15)However, we prefer the form of model (6) thanks to the direct interpretationof the membership latent variables α and β , which associate each observationto the observed or novel classes, respectively. We introduce two sequences ofadditional auxiliary parameters: a stochastic sequence u = { u m } Mm =1 of uniformrandom variables and a deterministic sequence ξ = { ξ l } l ≥ . The introductionof these two latent variables allows for a stochastic truncation at each iterationof the sampler, where the stochastic threshold L is given as L = max L m and L m is the largest integer such that ξ L m > u m . This establishes a ﬁnite number12f mixture components needed at each MCMC sweep, making computationsfeasible. Then, we can rewrite model (6) as L ( y , ζ , u | ˜ π , µ , Σ ) = M (cid:89) m =1 (cid:20) ˜ π ζ m ξ ζ m { u m <ξ ζm } φ (cid:16) y m | ˜ Θ ζ m (cid:17)(cid:21) . (16)In the deﬁnition of a dedicated deterministic sequence ξ , it is crucial to take into ... … Figure 1: Example of deterministic sequence deﬁned according to (17), with κ = 0 .

25. The blue rectangle highlights the weights relative to the knowncomponents.account the diﬀerence between the manifest and the novel components. Usually,a very common choice is ξ l = (1 − κ ) κ l − . This option allows to compute each L m analytically, being the smallest integer such that L m < u m ) − log(1 − κ )log( κ ) . However, the default choice of a geometrically decreasing ξ -sequence is inap-propriate in this context, since we are dealing with a mixture where not all thecomponents are conceptually equivalent. In fact, the default ξ -sequence tendsto favor components that come ﬁrst in the mixture speciﬁcation (in our case,the manifest ones). To overcome this issue, we propose the following intuitivemodiﬁcation. Given a value for κ ∈ (0 , − κ )% of themass into the ﬁrst J + 1 elements of the sequence. We then induce a geometricdecay in the remaining ones to split the remaining fraction κ . We force the ele-ment in position J + 1 to have the same mass given to the manifest componentsto avoid an under representation of the novelty part.13o do so, we deﬁne ξ l =  − κJ +1 if l ≤ J − κJ +1 (cid:16) ( J +1) κJK +1 (cid:17) l − J − if l > J + 1 (17)It is easy to prove that (cid:80) + ∞ l =1 ξ l = 1. According to (17), the ﬁrst J + 1 elementsof the sequence have masses equal to (1 − κ ) / ( J + 1). The truncation thresholdchanges accordingly, becoming the smallest integer such that L ∗ < J + 1 + log(min( u )) − log (cid:16) − κJ +1 (cid:17) log (cid:16) ( J +1) κJκ +1 (cid:17) . (18)Inequality (18) states that the truncation threshold L ∗ can be only greateror equal to J + 1, ensuring that the MCMC always takes into consideration thecreation of at least one cluster in the novel distribution.A representation of the modiﬁed ξ -sequence is depicted in Figure 1. Thepseudo-code for the devised Gibbs sampler is reported in Appendix B. The al-gorithm for the functional extension is not included for conciseness, yet its struc-ture closely follows the one outlined for the multivariate case. Software routines,including the implementation for both methods, the simulation study and realdata analyses of Section 6 are openly available at https://github.com/AndreaCappozzo/brand-public repo .Once the MCMC sample is collected, we ﬁrst compute the a posteriori prob-ability of being a novelty for every test unit m , P P N m = P [ y m ∼ f nov | Y ], thatis estimated according to the ergodic mean:ˆ P P N m = (cid:80) Ii =1 { a ( i ) m =0) } I , (19)where α ( i ) m is the value assumed by the parameter α m at the i -th iteration ofthe MCMC chain and I is the total number of iterations. We remark thatthe inference on α can be conducted directly, since the mixture between the J observed components and f nov is not subjected to label switching. In contrast,we need to take this problem into account when dealing with β . To performvalid inference, one possibility is to rely on the posterior probability coclusteringmatrix (PPCM) as deﬁned in Section 2.3. Each entry of this matrix p m,m (cid:48) = P [ y m and y m (cid:48) belong to the same novelty class] is estimated asˆ p m,m (cid:48) = (cid:80) Ii =1 { β ( i ) m = β ( i ) m (cid:48) ) } I . (20)Once the PPCM is obtained, we can employ it to estimate the best partition(BP) in the novelty subset. The BP is obtained by minimizing a loss functiondeﬁned over the space of clusterings, which can be computed starting from thePPCM. A famous loss function was proposed by Binder (1978), and investigated14n a BNP setting by Lau and Green (2007). However, the so-called Binder losspresents peculiar asymmetries, preferring to split over merge clusters. Thiscould result in an unnecessarily high number of estimated clusters. Therefore,we adopt the Variation of Information (VI - Meilˇa, 2007) as loss criterion. Theassociated loss function, recently proposed by Wade and Ghahramani (2018), isknown to provide less fragmented results.Finally, once the BP for the novelty component has been estimated, we canrely on a heuristic based on the clusters’ sizes, to discriminate anomalies fromactual new classes. Let us suppose that the BP comprises of S novel clusters.Denote the number of instances assigned to cluster s ∈ { , . . . , S } with m novs .A cluster s is labeled as an agglomerate of outlying points if its frequency m novs is suﬃciently small, say the ﬁrst percentile of the entire novelty sample size.Oppositely, all clusters whose frequencies exceed this threshold are regarded asproper novel groups. In this Section, we present a simulation study aimed at highlighting the capabil-ities of the new semi-parametric Bayesian model in performing novelty detectionby comparing it with existing methodologies. By considering diﬀerent scenar-ios, in terms of hidden classes sample size and adulteration proportion in thetraining set, we evaluate the importance of the robust information extractionphase and how it aﬀects the learning procedure.

We consider a training set formed by J = 3 observed classes, each distributed ac-cording to a bivariate normal density N ( µ j , Σ j ), j = 1 , ,

3, with the followingparameters: µ = ( − , (cid:48) , µ = (4 , (cid:48) , µ = ( − , − (cid:48) Σ = (cid:20) . . (cid:21) Σ = (cid:20) (cid:21) Σ = (cid:20) (cid:21) . The classes sample sizes are, respectively, equal to n = 300, n = 300 and n = 400. The same groups are also present in the test set, together with fourpreviously unobserved classes. We generate the new classes via bivariate normaldensities with parameters: µ = (0 , (cid:48) , µ = (5 , − (cid:48) , µ = (5 , − (cid:48) , µ = ( − , − (cid:48) , = (cid:20) − . − .

75 1 (cid:21) , Σ = (cid:20) . . (cid:21) , Σ = (cid:20) − . − . (cid:21) , Σ = (cid:20) .

01 00 0 . (cid:21) . The test set encompasses a total of 7 components: 3 observed and 4 novelties.Starting from the above-described data generating process, we consider fourdiﬀerent scenarios varying: • Data contamination level – No contamination in the training set (

Label noise = False ) –

12% label noise between classes 2 and 3 (

Label noise = True ) • Test set sample size – Novelty subset size equal to 30% of the test set (

Novelty size =Not small ) m = 200 , m = 200 , m = 250 , m = 90 ,m = 100 , m = 100 , m = 10 – Novelty subset size equal to 15% of the test set (

Novelty size =Small ) m = 350 , m = 250 , m = 250 , m = 49 ,m = 50 , m = 50 , m = 1 . Figure 2 exempliﬁes the experiment structure displaying a realization fromthe

Label noise = True , Novelty size = Not small scenario. As is evidentfrom the plots, the label noise is strategically included to cause a more diﬃcultidentiﬁcation of the fourth class, should the parameters of the second and thirdclasses be non-robustly learned. Further, notice that the last group presentslimited sample size and variability: it could easily be regarded as pointwise con-tamination (i.e., an anomaly) rather than an actual new component. Nonethe-less, following the reasoning outlined in the introduction, we are interested inevaluating the ability of the nonparametric density to capture and discriminatethese types of peculiar patterns as well. For each combination of contamina-tion level and test set sample size, we simulate B = 100 datasets. Results arereported in the following subsection. We compare the performance of the BRAND model with diﬀerent hyper-parametersspeciﬁcations: • the information from the training set is either non-robustly ( η MCD = 1)or robustly ( η MCD = 0 .

75) extracted,16 raining set Test set−10 −5 0 5 −10 −5 0 5−10−505

Class

Figure 2: Simulated data for the

Label noise = True , Novelty size = Notsmall scenario. Classes 4 , . . . , • the precision parameter associated with the training prior belief is eithervery high ( λ T r = 1000) or moderately low ( λ T r = 10).In addition, two model-based adaptive classiﬁers are considered in the compari-son, namely the inductive RAEDDA model (Cappozzo et al., 2019) with labeledand unlabeled trimming levels respectively equal to 0 .

12 and 0 .

05, and the in-ductive AMDA model (Bouveyron, 2014). For each replication of the simulatedexperiment, a set of four metrics is recorded from the test set: • Novelty predictive value (Precision) : the proportion of units marked asnovelties by a given method truly belonging to classes 4 , . . . , • Accuracy on the known classes : the classiﬁcation accuracy of a givenmethod within the subset of groups already observed in the training set, • Adjusted Rand Index (

ARI , Rand, 1971): measuring the similarity be-tween the partition returned by a given method and the underlying truestructure, • PPN : a posteriori probability of being a novelty, computed according toEquation (19) (BRAND only).We run 20000 MCMC iterations and discard the ﬁrst 20000 as a burn-inphase. Apart from the hyper-parameters for the training components, fairlyuninformative priors are employed in the base measure H , with m = 0 , λ =0 . , ν = 10 and S = 10 I . Lastly, a Gamma DP concentration parameter isconsidered with prior rate and scale hyper-parameters both equal to 1.Figure 3 and Table 1 report the results for B = 100 repetitions of the ex-periment under the diﬀerent simulated scenarios. The Novelty predictive value metric highlights the models’ capability to correctly recover and identify thepreviously unseen patterns. As expected, in the adulteration-free scenarios, allmethodologies succeed well enough in separating known and hidden compo-nents. The worst performance is exhibited by the RAEDDA model for which,17 ovelty size = Not small Novelty size = Small

Labe l N o i s e = F a l s eLabe l N o i s e = T r ue ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDABRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) RAEDDAAMDA

Novelty predictive value

Novelty size = Not small Novelty size = Small

Accuracy on the observed classes subset

Novelty size = Not small Novelty size = Small

Adjusted Rand Index

Figure 3: Box plots for (from top to bottom) novelty predictive value, accuracyon the known classes and ARI metrics for B = 100 repetitions of the simulatedexperiment, varying data contamination level and test set sample size.18 RAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) BRAND ( h MCD = , l Tr = ) Labe l N o i s e = F a l s eLabe l N o i s e = T r ue −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10−15−10−50510−15−10−50510 0.250.500.751.00 PPN^

Figure 4: Hex plots of the average estimated posterior probability of being anovelty, according to formula (19), for B = 100 repetitions of the simulatedexperiment, varying data contamination level and BRAND hyper-parameters, Not small novelty subset size. The brighter the color the higher the probabilityof belonging to f nov .due to the ﬁxed trimming level, a small part of the group-wise most extreme(but still genuine) observations is discarded, thus slightly overestimating thenovelties percentage (the same happens for the ARI metric). Diﬀerent resultsare displayed in those scenarios wherein the label noise complicates the learn-ing process. Robust procedures eﬃciently cope with the adulteration presentin the training set, while the AMDA model and the BRAND methods when η MCD = 1 tend to largely overestimate the novelty component. Particularly,the harmful eﬀect caused by the mislabeled units is exacerbated in the BRANDmodel that sets high conﬁdence in the priors ( λ T r = 1000), while a partial mit-igation, albeit feeble, emerges when λ T r is set equal to 10. This consequenceis even more apparent in the hex plots of Figure 4, where we see that the lat-ter model tries to modify its prior belief to accommodate the (outlier-free) testunits, while the former, forced to stick close to its prior distribution by the highvalue of λ T r , incorporates the second and third class in the novelty term. Theﬁnal output, as displayed in the

Accuracy on the known classes boxplots, hasan overall high misclassiﬁcation error when it comes to identifying the test unitsbelonging to the previously observed classes. Diﬀerently, setting robust infor-mative priors prevents this undesirable behavior, as is shown by both the highlevel of accuracy and the associated low posterior probability of being a noveltyin the feature space wherein the observed groups lie. It is surprising that thetrue partition recovery assessed via the Adjusted Rand Index does not seem tobe inﬂuenced by the label noise, and that the BRAND model always outper-forms the competing methodologies, regardless of which hyper-parameters were19able 1: Accuracy on the known classes, Adjusted Rand Index and Nov-elty predictive value metrics for B = 100 repetitions of the simulated experi-ment,varying data contamination level and test set sample size. Standard errorsare reported in parentheses. Label noise = False Label noise=TrueAccuracy ARI Precision Accuracy ARI PrecisionNovelty Size = Not small

AMDA

RAEDDA

BRAND η MCD = 1 , λ Tr = 1000) (0.001) (0.005) (0.002) (0.001) (0.002) (0.001) BRAND η MCD = 1 , λ Tr = 10) (0.001) (0.005) (0.002) (0.206) (0.01) (0.13) BRAND η MCD = 0 . , λ Tr = 1000) (0.007) (0.007) (0.013) (0.004) (0.008) (0.009) BRAND η MCD = 0 . , λ Tr = 10) (0.001) (0.005) (0.002) (0.001) (0.007) (0.001)Novelty Size = Small AMDA

RAEDDA

BRAND η MCD = 1 , λ Tr = 1000) ( < BRAND η MCD = 1 , λ Tr = 10) (0.01) (0.001) (0.002) (0.222) (0.003) (0.285) BRAND η MCD = 0 . , λ Tr = 1000) (0.022) (0.006) (0.056) (0.001) (0.002) (0.007) BRAND η MCD = 0 . , λ Tr = 10) ( < < selected. As previously mentioned, for the BRAN D ( η MCD = 1 , λ

T r = 10)and

BRAN D ( η MCD = 1 , λ

T r = 1000) cases the second and third class areassimilated by the nonparametric component. Despite this undesired outcome,retrieving the novelty best partition by minimizing the VI criterion (see Section5) allows the model to correctly identify the patterns that were originally con-taminated in the training set. That is, the true groups structure are nowhereto be found in the test set, so that the DP prior is forced to create them anew.Clearly, this is sub-optimal behavior as the separation of what is known from20hat is new is completely lost, yet it may raise suspicion on dealing with acontaminated learning set, suggesting the need of a robust prior informationextraction.

Sophisticated and advanced techniques like X-rays, scanning microscopy andlaser technology are increasingly employed for the automatic collection and pro-cessing of images. Within the domain of computer vision studies, novelty detec-tion is generally portrayed as a one-class classiﬁcation problem. There, the aimis to separate the known patterns from the absent, poorly sampled or not welldeﬁned remainder (Khan and Madden, 2014). Thus, there is strong interest indeveloping methodologies that not only distinguish the already observed quan-tities from the new entities, but that also identify speciﬁc structures within thenovelty component. The present case study involves the detection of a novel

Training set Test set0.825 0.850 0.875 0.900 0.825 0.850 0.875 0.9001314151617 compactness pe r i m e t e r Seed Type

Figure 5: Learning scenario (only compactness and perimeter variables dis-played) for novelty detection of 1 unobserved wheat variety, seed dataset.grain type by means of seven geometric parameters, recorded postprocessing X-ray photograms of kernels (Charytanowicz et al., 2010). In more detail, for the210 samples belonging to the three diﬀerent wheat varieties, high quality visu-alization of the internal kernel structure is detected using a soft X-ray techniqueand, subsequently, the image ﬁles are post-processed via a dedicated computersoftware package (Strumi(cid:32)l(cid:32)lo et al., 1999). The obtained dataset is publicly avail-able in the University of California, Irvine Machine Learning data repository.This experiment involves the random selection of 70 training units from theﬁrst two cultivars, and a test set of 105 samples, including 35 grains from thethird variety. The resulting learning scenario is displayed in Figure 5. The aimof the analysis is to employ BRAND to detect the third unobserved variety,whilst performing classiﬁcation of the known grain types with high accuracy.Firstly, the MCD estimator with hyper-parameter h MCD = 0 .

95 is adopted forrobustly learning the training structure of the two observed wheat varieties. In21he second stage, our model is ﬁtted to the test set, discarding 20000 iterationsfor the burn-in phase, and subsequently retaining 10000 MCMC samples. Asusual, fairly uninformative priors are employed in the base measure H , with m = 0 , λ = 0 . , ν = 10 and S = I . For the training components, meanand covariance matrices of the Normal-inverse-Wishart priors are directly de-termined by the MCD output of the ﬁrst stage, while ν T r and λ T r are speciﬁedto be respectively equal to 250 and 1000. The latter value indicates that afterhaving robustly extracted information for the two observed classes, high trustis placed in the prior distributions of the known components. Model resultsare reported in Figure 6, where the posterior probability of being a novelty

P P N m = P [ y m ∼ f nov | Y ], m = 1 , . . . , M , displayed in the plots below themain diagonal, are estimated according to the ergodic mean in (19). The a pos-teriori classiﬁcation, computed via majority vote, is depicted in the plots abovethe main diagonal, where the water-green solid diamonds denote observationsbelonging to the novel class. The confusion matrix associated with the esti-mated group assignments is reported in Table 2, where the third group varietyis eﬀectively captured by the ﬂexible process modeling the novel component.Table 2: Confusion matrix for the semi-parametric Bayesian classiﬁer on the testset, seeds dataset. The label “New” indicates observations that are estimatedto have arisen from the novelty component. TruthClassiﬁcation 1 2 31 31 0 82 1 35 0New 3 0 27All in all, the promising results obtained with this multivariate dataset mayfoster the employment of our methodology in automatic image classiﬁcation pro-cedures that supersede the one-class classiﬁcation paradigm, allowing for a muchmore ﬂexible anomaly and novelty detector in computer vision applications. In recent years, machine learning methodologies have experienced an ever-growing interest in countless ﬁelds, including food authentication research (Singhand Domijan, 2019). An authenticity study aims to characterize unknown foodsamples, correctly identifying their type and/or provenance. Clearly, no observa-tion is to be trusted in a context wherein the ﬁnal purpose is to detect potentialadulterated units exactly, in which, for example, an entire subsample may be-long to a previously unseen pattern. Motivated by a dataset of Near InfraredSpectra (NIR) of meat varieties, we employ the functional model introduced inSection 4 to perform classiﬁcation and novelty detection when a hidden classand four manually adulterated units are present in the test set. The considereddata report the electromagnetic spectrum for a total of 231 homogenized meat22 rea perimeter compactness length width asymmetry length groove a r eape r i m e t e r c o m pa c t ne ss l eng t h w i d t ha sy mm e t r y l eng t h g r oo v e Figure 6: Test set for the considered experimental scenario, seeds dataset. Plotsbelow the main diagonal represent the estimated posterior probability of beinga novelty. The brighter the color the higher the probability of belonging to f nov .Plots above the main diagonal display the associated group assignments, wherethe water-green solid diamonds denote observations classiﬁed as novelties.samples, recorded from 400 − nm at intervals of 2 nm (McElhinney et al.,1999). The units belong to ﬁve diﬀerent meat types, with 32 beef, 55 chicken,34 lamb, 55 pork, and 55 turkey records. For each meat sample, the amount oflight absorbed at a given wavelength is recorded: A = log (1 /R ) where R is thereﬂectance value. The visible part of the electromagnetic spectrum (400 − nm ) accounts for color diﬀerences in the meat types, while their chemical com-position is recorded further along the spectrum. NIR data can be interpretedas a discrete evaluation of a continuous function in a bounded domain, there-fore, the procedure described in Section 4 is a sensible methodological tool formodeling this type of data object (Barati et al., 2013). We randomly partitionthe recorded units into labeled and unlabeled sets: the former is composed of28 chicken, 17 lamb, 28 pork, and 28 turkey, while the latter contains the same23roportion of these four meat types with an additional 32 beef units. The lastclass is not observed in the test set, and needs, therefore, to be discovered.Also, four validation units are manually adulterated and added to the test setas follows: • a shifted version of a pork sample, achieved by removing the ﬁrst 15 datapoints and appending the last 15 group-mean absorbance values at theend of the spectrum; • a noisy version of a pork sample, generated by adding Gaussian whitenoise to the original spectrum; • a modiﬁed version of a turkey sample, obtained by abnormally increasingthe absorbance value in a single speciﬁc wavelength to simulate a spike; • a pork sample with an added slope, produced by multiplying the originalspectrum by a positive constant.These modiﬁcations mimic the ones considered in the “Chimiom´etrie 2005”chemometric contest, in which the participants were tasked to perform discrim-ination and outlier detection of mid-infrared spectra of four diﬀerent starchestypes (Fern´andez Pierna and Dardenne, 2007). In our context, both the beefsubpopulation and the adulterated units are previously unseen patterns thatshall be captured by the novelty component. Posterior probability of being a novelty500 1000 1500 2000 2500−202 A b s o r ban c e PPN^

Figure 7: Estimated posterior probability of being a novelty, according to for-mula (19), the brighter the color the higher the probability of belonging to f nov .Firstly, robust prior information is recovered from the learning set. Giventhe non-cyclical nature of the spectra, each training unit is approximated via alinear combination of L = 100 B-spline bases and their associated coeﬃcientsare retrieved. Given the high-dimensional nature of the smoothing process,the MRCD is employed to obtain a robust group-wise estimates for the splines24oeﬃcients. These quantities, which are linearly combined with the B-splinebases, account for the training atoms Θ j , j = 1 , . . . , η MCD = 0 . Θ j , j = 1 , . . . , f nov . TruthClassiﬁcation Beef Chicken Lamb Pork Turkey OutliersNovelty 28 0 0 0 2 4Chicken 0 18 0 0 6 0Lamb 4 0 17 0 0 0Pork 0 5 0 20 3 0Turkey 0 4 0 4 15 0accuracy is in agreement with the ones produced by state-of-the-art classiﬁersin a fully-supervised scenario (see, for example, Murphy et al., 2010; Guti´errezet al., 2014). That is, our proposal is capable of detecting previously unseenclasses and outlying units, whilst maintaining competitive predictive power.Focusing on the novelty component, the model almost entirely captures thebeef hidden class and the adulterated units, yet two turkey samples are alsoincorrectly assigned. The obtained classiﬁcation for the curves identiﬁed to benovelties, resulting by VI minimization, is displayed in the upper panel of Figure8 where two distinct clusters are detected. Interestingly, the 28 beef samples(blue dashed lines) are separately grouped from the manually adulterated unitsand the two turkeys (solid red lines). The estimated partition highlights thepresence of a novel class (i.e., the beef meat variety) and a group of outlyingspectra, to which our four modiﬁcations shall belong. The only concern are thetwo units incorrectly assigned to the novel component. A closer look at theturkey sub-population, displayed in the lower panel of Figure 8, show how thesetwo samples exhibit a somehow extreme pattern within their group and can,therefore, be legitimately ﬂagged as outlying or anomalous turkeys.In this Section, we have shown the eﬀectiveness of our methodology in cor-rectly identifying a hidden group in a functional setting, while jointly achieving25 P clustering of novelties500 1000 1500 2000 2500−202 A b s o r ban c e Meat type

Novelty 1Novelty 2Focus on Turkey sub−population500 1000 1500 2000 2500−2−1012

Wavelength (nm) A b s o r ban c e Meat type

Outlying TurkeyTurkey

Figure 8: Upper panel: best partition of the novelty component recovered byminimizing the Variation of Information loss function. The dashed blue curvesare beef samples, while the solid red ones are the manually adulterated unitsand the two turkeys incorrectly assigned to the novel component. Lower panel:true turkey sub-population in the test set, the units incorrectly assigned to thenovel component are displayed with solid darkred lines.good classiﬁcation accuracy and recognition of outlying curves. The model’ssuccessful application seems particularly desirable in ﬁelds like food authentic-ity, where generally no information regarding how many modiﬁcations and/oradulteration mechanisms may be present in the samples, are a priori available.

In the present manuscript, we have introduced a two-stage methodology forrobust Bayesian nonparametric novelty detection. In the ﬁrst stage, we ro-bustly extract the observed group structure from the training set. In the secondstage, we incorporate such prior knowledge in a contaminated mixture, whereinwe have employed a nonparametric component to describe the novelty term.The latter could either correspond to anomalies or actual new groups, yet thedistinction is made possible by retrieving the best partition within the novelsubset. We have then investigated the basic properties of the induced randommeasure, underlying interpretations, and connections with existing methods.Subsequently, the general multivariate methodology has been extended to han-dle functional data objects, resulting in a novelty detector for functional dataanalysis. A dedicated slice-eﬃcient sampler, taking into account the diﬀerence26etween unseen and seen components, has been devised for posterior inference.An extensive simulation study and applications on multivariate and functionaldata have validated the eﬀectiveness of our proposal, fostering its employmentin diverse areas from image analysis to chemometrics.BRAND can be seen as the starting point for many diﬀerent research av-enues. Future research directions aim at providing a Bayesian interpretation ofthe robust MCD estimator to propose a uniﬁed, fully Bayesian model. More ver-satile speciﬁcations can be adopted for the known components, weakening theGaussianity assumption. This can be done by adopting more ﬂexible distribu-tions while keeping the mean and variance of the resulting densities constrainedto the ﬁndings in the training set, for example, via centered stick-breaking mix-tures (Yang et al., 2010).Similarly, functional BRAND can be improved adopting a more general priorspeciﬁcation via Gaussian Processes (Rasmussen and Williams, 2005). Lastly,it is of paramount interest to develop scalable algorithms, as Variational Bayes(Blei et al., 2017) and Expectation-Maximization (Dempster et al., 1977), forinference on massive datasets. Such solutions will oﬀer both increased speedand lower computational cost, which are crucial for assuring the applicability ofour proposal in the big data era.

Appendix A. Proof of (8)

Let us represent all the possible values of Θ in one vector ˜ Θ = ( Θ , . . . , Θ J , Θ nov , . . . ) P ( Θ m = Θ m (cid:48) ) = E [ P ( Θ m = Θ m (cid:48) | ˜ p )]= (cid:88) j ≥ E (cid:104) P (cid:16) Θ m = ˜ Θ j | ˜ p (cid:17) · P (cid:16) Θ m (cid:48) = ˜ Θ j | ˜ p (cid:17)(cid:105) == J (cid:88) j =1 E (cid:104) P (cid:16) Θ m = ˜ Θ j | ˜ p (cid:17) · P (cid:16) Θ m (cid:48) = ˜ Θ j | ˜ p (cid:17)(cid:105) ++ (cid:88) j ≥ J +1 E (cid:104) P (cid:16) Θ m = ˜ Θ j | ˜ p (cid:17) · P (cid:16) Θ m (cid:48) = ˜ Θ j | ˜ p (cid:17)(cid:105) == J (cid:88) j =1 E (cid:2) π j (cid:3) + (cid:88) j ≥ J +1 E (cid:2) π ω j (cid:3) = J (cid:88) j =1 E (cid:2) π j (cid:3) + (cid:88) j ≥ J +1 E (cid:2) π (cid:3) E (cid:2) ω j (cid:3) = J (cid:88) j =1 a j ( a j + 1) a ( a + 1) + a ( a + 1) a ( a + 1) ·

11 + γ . ppendix B. Gibbs sampler algorithm for model (3) - (4) Algorithm 1:

Eﬃcient Slice Sampler for the BNP-Novelty detectionmodel

Input:

Initial values for the MCMC, robust estimates from X . Output:

Posterior MCMC sample for the parameters of interest. for i = 1 , . . . , N SIM do

1. Sample every u m from a uniform distribution U (0 , ξ ζ m ).2. Compute the stochastic truncation term L ∗ according to (18).3. Let m j = (cid:80) Mm =1 { α m = g } , with j = 0 , . . . , J . Sample π from a conjugateDirichlet distribution: π ∼ Dir ( a + m , a + m , . . . , a J + m J ) .

4. Sample the SB variables after integrating out u : v k | · · · ∼ Beta (1 + M (cid:88) m =1 { α m =0 ∩ β m = k } ,γ + M (cid:88) m =1 { α m =0 ∩ β m >k } ) .

5. Compute the SB weights according to (5)6. Compute the one-line probability weights ˜ π according to (14).7. Sample the atoms for the observed classes Θ j, exploiting conjugacybetween the likelihood and the prior for j = 1 , . . . , J .8. Sample the atoms for the novel classes Θ nov ,l exploiting conjugacybetween the likelihood and the prior for l = 1 , . . . , L ∗ .9. Obtain ˜ Θ concatenating the updated values of Θ and Θ nov .10. Sample each ξ m from the following joint discrete distribution: P ( ζ m = l ) ∝ ˜ π l ξ l { u m <ξ l } φ (cid:16) y m | ˜ Θ l (cid:17) ,l = 1 , . . . , L ∗ , P (otherwise) ∝ .

11. Recover the values for the membership vectors α and β using (15).Divide the elements in ˜ Θ into Θ and Θ nov . end eferenceseferences