# Unobserved classes and extra variables in high-dimensional discriminant analysis

Michael Fop, Pierre-Alexandre Mattei, Charles Bouveyron, Thomas Brendan Murphy

UUnobserved Classes and Extra Variables inHigh-dimensional Discriminant Analysis

Michael Fop , Pierre-Alexandre Mattei , Charles Bouveyron , and Thomas Brendan Murphy School of Mathematics & Statistics and Insight Research Centre, University College Dublin, Ireland Université Côte d’Azur, Inria, CNRS, Laboratoire J.A. Dieudonné, Maasai team, France * [email protected] Abstract

In supervised classiﬁcation problems, the test set may contain data points belonging toclasses not observed in the learning phase. Moreover, the same units in the test data may bemeasured on a set of additional variables recorded at a subsequent stage with respect to whenthe learning sample was collected. In this situation, the classiﬁer built in the learning phaseneeds to adapt to handle potential unknown classes and the extra dimensions. We introducea model-based discriminant approach, Dimension-Adaptive Mixture Discriminant Analysis(D-AMDA), which can detect unobserved classes and adapt to the increasing dimensionality.Model estimation is carried out via a full inductive approach based on an EM algorithm.The method is then embedded in a more general framework for adaptive variable selectionand classiﬁcation suitable for data of large dimensions. A simulation study and an artiﬁcialexperiment related to classiﬁcation of adulterated honey samples are used to validate theability of the proposed framework to deal with complex situations.

Keywords:

Adaptive supervised classiﬁcation, conditional estimation, model-based discriminant analy-sis, unobserved classes, variable selection.

Standard supervised classiﬁcation approaches assume that all existing classes in the data havebeen observed during the learning phase. However, in some cases there could be the possibilityof having units in the test set belonging to classes not previously observed. In such situation,a standard classiﬁer would fail to detect the novel classes and would assign the observationsonly to the classes it is aware of from the learning stage. Moreover, the observations to beclassiﬁed may be recorded on a collection of additional variables other than the variables alreadyobserved in the learning data. An example is classiﬁcation of spectrometry data, where thetest data may be measured at a ﬁner resolution than the learning set, hence with a increasednumber of wavelengths. In this setting, the classiﬁer would also need to adapt to the increasingdimensionality. The combination of these cases together leads to a complex situation where themodel built in the learning stage is faced with two sources of criticality when classifying the newdata: unobserved classes and extra variables.In a recent work, Bouveyron (2014) introduced an adaptive method for model-based classiﬁ-cation when the test data contains unknown classes. Nonetheless, the method is not capable ofhandling the situation of additional variables. To deal with this problem, this work introducesa model-based adaptive classiﬁcation method for detection of novel classes in a set of new datathat is characterized by an expanded number of variables with respect to the learning set. The

The work of Fop M. and Murphy T. B. was supported by the Science Foundation Ireland funded InsightResearch Centre (SFI/12/RC/2289_P2). The work of Mattei P. A. and Bouveyron C. has been supported bythe French government, through the 3IA Côte d’Azur Investment in the Future project managed by the NationalResearch Agency (ANR) with the reference numbers ANR-19-P3IA-0002. a r X i v : . [ s t a t . M E ] F e b pproach is developed in conjunction with an adaptive variable selection procedure used to selectthe variables of the test set most relevant for the classiﬁcation of the observations into observedand novel classes. An EM algorithm based on an inductive approach is proposed for estimationof the model. Variable selection is performed with a greedy forward algorithm that exploits theinductive characteristics of the approach and make it suitable for high-dimensional data.The methodology presented here aims at tackling the problems arising from a mismatch inthe distributions of labels and input variables in training and test data. This problem is moregenerally denoted as "dataset shift", and we point the interested reader to Quionero-Candelaet al. (2009) and Moreno-Torres et al. (2012). In this work, the mismatch is due to unrepresentedclasses in the training data and increased dimensions of the test data. Consider a set of learning observations { x s ; ¯ ‘ s } , where x s is the observation of a vector ofrandom variables and ¯ ‘ s is the associate class label, such that ¯ ‘ sc = 1 if observation s belongsto class c , 0 otherwise; c = 1 , . . . , C . The aim of supervised classiﬁcation is to build a classiﬁerfrom the complete learning data { x s , ¯ ‘ s } and use it to assign a new observation to one of theknown classes. Model-based discriminant analysis (MDA, Bouveyron et al., 2019; McLachlan,2012, 2004; Fraley and Raftery, 2002) is a probabilistic approach for supervised classiﬁcation ofcontinuous data in which the data generating process is represented as follows:¯ ‘ s ∼ C Y c =1 τ ¯ ‘ sc c , ( x s | ¯ ‘ sc = 1) ∼ N ( µ c , Σ c ) , (1)where τ c denotes the probability of observing class c , with P c τ c = 1. Consequently, the marginaldensity of each data point corresponds to the density of a Gaussian mixture distribution: f ( x s ; Θ ) = C X c =1 τ c φ ( x s ; µ c , Σ c ) , where φ ( . ; µ c , Σ c ) is the multivariate Gaussian density, with mean µ c and covariance matrix Σ c ,and Θ is the collection of all mixture parameters. Then, using the maximum a posteriori (MAP)rule, a new observation y i is assigned to the class ‘ ic with the highest posterior probability:Pr( ‘ ic = 1 | y i ) = τ c φ ( y i ; µ c , Σ c ) P Cc =1 τ c φ ( y i ; µ c , Σ c ) . (2)The framework is closely related to other discriminant analysis methods. If the covariancematrices are constrained to be the same across the classes, then the standard linear discriminantanalysis (LDA) is recovered. On the other hand, if the covariance matrices have no constraints,the method corresponds to the standard quadratic discriminant analysis (QDA McLachlan,2004; Fraley and Raftery, 2002). Several extension of this framework have been proposed inthe literature in order to increase its ﬂexibility and scope. For example, Hastie and Tibshirani(1996) consider the case where each class density is itself a mixture of Gaussian distributionswith common covariance matrix and known number of components. Fraley and Raftery (2002)further generalize this approach, allowing the covariance matrices to be diﬀerent across thesub-groups and applying model-based clustering to the observations of each class. Anotherapproach, eigenvalue decomposition discriminant analysis (EDDA, Bensmail and Celeux, 1996),is based on the family of parsimonious Gaussian models of Celeux and Govaert (1995), whichimposes cross-constraints on the eigen-decomposition of the class covariance matrices. Thislatter approach allows more ﬂexibility than LDA, and is more structured than QDA and themethods of Fraley and Raftery (2002), which could be over-parameterized. In high-dimensionalsettings, diﬀerent approaches have been proposed based on regularization and variable selection:Friedman (1989) and Xu et al. (2009) propose regularized versions of discriminant analysis where2 shrinkage parameter is introduced to control the degree of regularization between LDA andQDA; Le et al. (2020) and Sun and Zhao (2015) deﬁne frameworks where a penalty term isintroduced and the classes are characterized by sparse inverse covariance matrices. It is alsoworth to mention that for high-dimensional data, the framework of discriminant analysis hasoften been phrased in terms of sparse discriminant vectors, see for example: Clemmensen et al.(2011), Mai et al. (2012), Safo and Ahn (2016), Jiang et al. (2018), Qin (2018). The discriminant analysis approaches pointed out earlier assume that all existing classes havebeen observed in the training set during the learning phase, not taking into account that thetest data might include observations arising from classes present in the learning phase. Initialworks in the the context of unobserved classes detection and model-based discriminant analysisare those of Miller and Browning (2003) and Frame and Jammalamadaka (2007), while exam-ples of applications include galaxy classiﬁcation (Bazell and Miller, 2005) and acoustic speciesclassiﬁcation (Woillez et al., 2012). More recently, building on Miller and Browning (2003) work,Bouveyron (2014) introduced

Adaptive Mixture Discriminant Analysis (AMDA), a frameworkfor model-based discriminant analysis which allows the modeling of data where the test set con-tains novel classes not observed in the learning phase. The AMDA model considers the dataarising from a mixture model with observed and unobserved classes and Bouveyron (2014) pro-poses two alternative approaches for model estimation. In particular, the inductive approach,where the classifying function is ﬁrst estimated on the learning set and then applied to the testdata. Crucially, the core assumption of the inductive approach is that the parameters estimatedon the training data are ﬁxed when dealing with the test set (see Chapelle et al., 2006; Pangand Kasabov, 2004, for example). The assumption makes the approach most suitable for faston-line data classiﬁcation when the data come in multiple streams. In fact, with this approach,the learning set does not need to be kept in memory for prediction on a set of new data points,only the estimated parameters need to be stored.In what follows we provide a formal description of the problem of unobserved classes in thetest data and give a brief overview of the inductive AMDA methodology, as it constitutes thestarting block of the main contribution of this paper. The learning data is composed of M ob-servations x s and the associated class labels ¯ ‘ s , while the test data contains N new observations y i . For ease of presentation, we treat the classes as sets, with C the set of all classes. The AMDAframework considers the situation where the data generating process is the same as depicted in(1) but only a subset of classes is observed in the training set, that is a subset K ⊆ C of classeshas been represented in the learning data. Therefore the test data may contain a set of extra“hidden” classes H such that K ∪ H = C . The cardinality of these sets (i.e. the number ofclasses) is denoted with K , H and C respectively, such as K + H = C .The inductive AMDA approach consists of two phases: a learning phase and a discoveryphase. The initial learning phase corresponds to the estimation of a model-based discriminantanalysis classiﬁer using the training data. The data in the learning phase are complete, and theparameters estimated in this stage are then employed in the subsequent discovery phase. The discovery phase searches for H novel classes in the set of new observations y i . In this phase,because of the inductive approach, the learning data is no longer needed and is discarded. Theonly relevant quantities to be retained are the parameter estimates obtained during the learningphase. In this stage, one needs to estimate the parameters of the unobserved classes in a partiallyunsupervised way in order to derive the classiﬁcation rule as in (2). Because the observations y i are unlabelled, the following log-likelihood is considered: L ( Y ; Θ ) = N X i =1 log K X k =1 τ k φ ( y i ; µ k , Σ k ) + C X h = K +1 τ h φ ( y i ; µ h , Σ h ) , where Θ denotes the collection of all parameters. The parameters µ k , Σ k for k = 1 , . . . , K are those of the classes observed in the training set and have been already estimated in the3earning phase; the bar in the notation indicates that at this stage these parameters have alreadybeen estimated and are ﬁxed. On the other hand, the Gaussian density parameters µ h , Σ h for h = K + 1 , . . . , C remain to be estimated. Note that quantities related to the knownclasses are denoted with subscript k , while the subscript h denotes quantities related to the newclasses; subscript c denotes both known and unknown classes. Bouveyron (2014) presents anEM algorithm (Dempster et al., 1977; McLachlan and Krishann, 2008) for optimization of theabove log-likelihood with respect to the parameters of the unobserved classes, keeping ﬁxed theparameters estimated in the learning phase. The present paper extends the inductive AMDA framework to the case where the test dataincludes not only unobserved classes, but also extra variables. The contribution of this work istwofold. First, we propose a novel inductive model-based adaptive classiﬁcation framework whichcan model the situation where the observations of the test data may contain classes unobservedduring the training stage and are recorded on an expanded set of input features. Secondly, weincorporate this framework in a computationally eﬃcient inductive variable selection procedureemployed to detect the most relevant variables for classiﬁcation into observed and unknownclasses.The paper is organized as follows. The current section 1 introduced the problem of classifca-tion with unknown classes and extra variables, also providing a short overview of model-basedclassiﬁcation via discriminant analysis. In particular, section 1.2 brieﬂy described the adaptivediscriminant analysis method which is the basis of our proposed method. The following sectionspresents the novel methodology. Section 2 introduces the novel adaptive mixture discriminantanalysis method capable to handle a the complex situation where the new observations includeinformation about unknown classes and are also measured on a set of additional variables. InSection 4, the proposed method is naturally incorporated in a variable selection approach tailoredfor classiﬁcation of high-dimensional data. Extensive simulation experiments are conducted inSection 5, in order to evaluate the performance of the proposed method for adaptive classiﬁca-tion and variable selection. Section 6 presents an application to the classiﬁcation of spectroscopydata of contaminated honey samples. The paper ends with a discussion in Section 7.

The AMDA framework combines supervised and unsupervised learning for detecting unobservedclasses in the test data. However, in a dynamic classiﬁcation setting, the new observations couldbe characterized not only by information about novel classes. In fact, the units in the testdata could also have been recorded on a set of additional variables other than the ones alreadyobserved in the learning data. An example would be spectrometry data where the samples inthe learning phase have been recorded over a set of speciﬁed wavelengths, and then additionalsamples have been collected in a subsequent phase at a ﬁner resolution. Another example isthe case where the variables correspond to points in time, and observations are recorded in acontinuous manner; a given set of observations could have been collected up to a certain datapoint, while another set of units could have been recorded up to a successive period of time. Afurther example is the case where some of the variables in the training data are corrupted andcannot be used to build the classiﬁer, while they are available in the testing stage.Formally, we describe the setting of unknown classes and extra variables as follows. Thelearning data X is composed of M observations x s with the associated class labels ¯ ‘ s , and thetest data Y is composed of N new unlabelled observations y i . As in Section 1.2, the test datamay contain a set of unobserved classes H such that K ∪ H = C . However, in this setting weconsider the case where only a subset of variables available in the test data are observed orrecorded in the training data. Hence, the test data also includes extra variables compared tothe data set used for training. We consider the collection of variables observed in learning andtest data as sets. In the case where only a subset of variables is available in the learning set, the4 a) Learning phase. (b)

Discovery phase.

Figure 1: General framework of the inductive estimation approach for Dimension-Adaptive Mixture Dis-criminant Analysis. test observations y i are realizations of the set of variables R , while the training observations arerecorded on the subset of variables P ⊂ R . Consequently, the set Q = R \ P denotes the setof additional variables observed in the test set but not in the training set. The cardinalities ofthese sets, i.e. the number of variables in each set, are indicated with P , Q , and R , respectively,with R = P + Q .The extra dimensions in the test data induce an augmented parameter space in the predictionand novel class detection stage of the classiﬁer. Discarding the additional dimensions availablein the test data can potentially damage the classiﬁcation performance of the model, especiallyif the extra variables contain useful discriminant information. In this context, the classiﬁerbuilt in the learning phase needs to adapt in order to handle the situation where the new datato be classiﬁed contains information about novel classes and extra variables. To the purpose,we introduce Dimension-Adaptive Mixture Discriminant Analysis (D-AMDA). The model is ageneralization of AMDA and is designed to classify new observations measured on additionalvariables and possibly containing information about unobserved classes. Under the model, thejoint densities of each observed and new data point together with observed and unobserved classlabels are given by: f ( x s , ¯ ‘ s ; Θ x ) = K Y k =1 { τ k φ ( x s ; µ k , Σ k ) } ¯ ‘ sk , (3) f ( y i , ‘ i ; Θ y ) = (cid:20) K Y k =1 { τ k φ ( y i ; µ ∗ k , Σ ∗ k ) } ‘ ik (cid:21) × (cid:20) C Y h = K +1 { τ h φ ( y i ; µ ∗ h , Σ ∗ h ) } ‘ ih (cid:21) , (4)with s = 1 , . . . , M , i = 1 , . . . , N , and Θ x and Θ y are the set of parameters for trainingand test observations. As earlier, the subscript k indicates quantities related to the knownclasses, while the subscript h denotes quantities related to the new classes. The parameters µ k and Σ k are the class-speciﬁc mean and covariance parameters of the observed classes in thelearning data and related to the subset of variables P . The parameters denoted with µ ∗ and Σ ∗ denotes respectively the class-speciﬁc mean and covariance parameters for both observedand unobserved classes and related to the full collection of variables of the test data. Theseparameters are deﬁned on an augmented space compared to the parameters in (3). Indeed, µ k and Σ k are P -dimensional vectors and P × P matrices, while µ ∗ and Σ ∗ are R -dimensionalvectors and R × R matrices. As such, the model takes into account that y i may be measuredon additional variables and generalizes the AMDA framework.Similarly to AMDA, model estimation for D-AMDA is carried out within an inductive es-timation framework. Figure 1 provides a sketch of the general framework. In (a) the trainingdata X and the corresponding collection of labels ¯ ‘ are observed. The aim of the learning stageis to estimate the set of parameters µ k , Σ k and τ of the density in (3). In (b) only Y is observed5nd no information about the classiﬁcation is given. The test data is partitioned into two parts: Y P , the subset of data corresponding to the variables observed in the training set, and Y Q , thesubset of data related to the additional variables (gray background). In the discovery phase theaim is to estimate the parameters µ ∗ k , µ ∗ h , Σ ∗ k , Σ ∗ h and τ of the distribution in (4), as well asto infer the classiﬁcation of the new unlabelled data points. The collection of class labels tobe inferred here is composed of the labels indicating the classes observed in the learning stageand the labels indicating the new classes (gray background). Model estimation for D-AMDA isdetailed in the next sections. The use of an inductive estimation approach is appropriate for the proposed D-AMDA frame-work, as it allows to only retain the test data once a set of mean and covariance parameterestimates have been obtained from the training data. Since the training set is lower dimensionalcompared to the test data, this can be particularly eﬃcient in high-dimensional and on-lineclassiﬁcation settings. Like in Section 1.2, the approach is composed of a learning and a discov-ery phase. In this case, the discovery phase includes a novel estimation procedure employed toaccount for the extra dimensions.

The learning phase of the inductive approach consists of estimating parameters for the observedclasses employing only the training set. From equation (3), this stage corresponds to the standardestimation of a model-based discriminant analysis classiﬁer, performed by optimization of theassociated log-likelihood: L ( X , ¯ ‘ ; Θ x ) = M X s =1 K X k =1 ¯ ‘ sk log { τ k φ ( x s ; µ k , Σ k ) } , which reduces to the separate estimation of the class density parameters. Here we considerthe eigenvalue decomposition discriminant analysis (EDDA) of Bensmail and Celeux (1996), inwhich the class covariance matrices Σ k are parameterized according to the eigen-decomposition Σ k = λ k D k A k D k , providing a collection of parsimonious models. Estimation of this model iscarried out using the mclust R package Scrucca et al. (2016), which also automatically selects thebest covariance decomposition model using the

Bayesian Information Criterion (BIC, Schwarz,1978; Fraley and Raftery, 2002). This learning phase is more general than the one in Bouveyron(2014). In fact, the author considers a QDA model in the learning phase, a particular caseof EDDA corresponding to an unconstrained covariance model (see Scrucca et al., 2016). TheEDDA classiﬁer learned in this phase is more ﬂexible and is proven to perform better than QDA(Bensmail and Celeux, 1996), although it will introduce some complications, which described inthe following section.The learning phase outputs the parameters of the EDDA model ﬁtted on the training data µ k and Σ k for k = 1 , . . . , K . We note again that we use the bar symbol a to stress the factthat the parameters estimated in the learning phase are ﬁxed during the discovery phase. Sincethe discovery phase relies only on the test data and these parameters, the training set can bediscarded. The discovery phase looks for novel classes in the test data, given the parameter estimates fromthe learning phase. Under the D-AMDA modelling framework, we also need to take into accountthe extra dimensions of the test data. Subsequently, in this phase we need to estimate two maincollections of parameters: the parameters of the additional variables corresponding to novel andknown classes, and the parameters of the already observed variables related to new and known6lasses. These characterize the distribution in (4) and are estimated keeping the parameterestimates from the learning phase ﬁxed.Because the labels of the test data are unobserved, in this stage we aim to optimize thefollowing log-likelihood: L ( Y ; Θ y ) = N X i =1 log K X k =1 τ k φ ( y i ; µ ∗ k , Σ ∗ k ) + C X h = K +1 τ h φ ( y i ; µ ∗ h , Σ ∗ h ) , (5)with P Kk =1 τ k + P Hh =1 τ h = 1. Crucially, for k = 1 , . . . , K , mean and covariance parameters of the K observed classes are partitioned into parameters ﬁxed from the learning phase correspondingto the variables observed in X and parameters corresponding to the additional variables presentin the test data: µ ∗ k = ( µ k µ Q k ) Σ ∗ k = " Σ k C k C k Σ Q k , (6)where C k are the covariance terms between additional and observed variables. Such partition ofthe parameters of the observed classes will need to be taken into account during the estimationprocedure, as it indirectly induces a constraint on the estimation of the parameters for theadditional variables; see the following Section 3.2.2.Optimization of this log-likelihood in the discovery phase is carried out by resorting to anEM algorithm Dempster et al. (1977); McLachlan and Krishann (2008). From (5) we have thecomplete log-likelihood: L ( Y , ‘ ; Θ y ) = N X i =1 K X k =1 ‘ ik log n τ k φ ( y i ; µ ∗ k , Σ ∗ k ) o + C X h = K +1 ‘ ih log n τ h φ ( y i ; µ ∗ h , Σ ∗ h ) o , (7)where, ‘ ik and ‘ ih denote the latent class membership indicators to be estimated on the testdata for known and unobserved classes. The EM algorithm alternates the following two steps.• E Step : After estimation of the parameters at the previous M step iteration, the estimatedconditional probabilities, t ic = c Pr ( ‘ ic = 1 | y i ) are computed as: t ic = b τ c φ ( y i ; b µ ∗ k , b Σ ∗ k ) P Kk =1 b τ k φ ( y i ; b µ ∗ k , b Σ ∗ k ) + P Ch = K +1 b τ h φ ( y i ; b µ ∗ h , b Σ ∗ h ) , for i = 1 , . . . , N , and c = 1 , . . . , C .• M Step : in this step of the algorithm we maximize the expectation of the completelog-likelihood computed using the estimated probabilities t ic of the E step. Due to theaugmented dimensions of the test data, this step is more involved. The optimizationprocedure of the M-step is divided into two parts: estimation of mixing proportions andmean and covariance parameters corresponding to the unobserved classes, described inSection 3.2.1, and estimation of mean and covariance parameters related to the classesalready observed in the learning phase, described in Section 3.2.2. The introduction of new variables does not aﬀect the estimation of the mixing proportions,nor the estimation of the parameters corresponding to the new classes. Hence, in this case theupdates are in line with those outlined in Bouveyron (2014). From equation (7), the estimatesof mean and covariance parameters of the H hidden classes are obtained simply by optimizingthe term involving µ ∗ h and Σ ∗ h . Therefore, the estimates of the Gaussian density parametersrelated to the unknown classes are simply given by: b µ ∗ h = 1 N h N X i =1 t ih y i , b Σ ∗ h = 1 N h N X i =1 t ih ( y i − b µ ∗ h )( y i − b µ ∗ h ) , N h = P i t ih . For the mixing proportions, two alternative updates are available. One isbased on the re-normalization of the mixing proportions τ k as outlined in Bouveyron (2014), theother on the re-estimation of the mixing proportions for both observed and unobserved classes onthe test data. The two updates correspond to very diﬀerent assumptions about the data: the re-normalization update is based on the assumption that the new classes do not aﬀect the balanceof the classes observed in the training set, while the other update is based on the assumptionthat the class proportions may have changed in the test data. We opt for this latter approach,since it is more ﬂexible and avoids the introduction of possible bias due to the re-normalization,as discussed in Lawoko and McLachlan (1989). The mixing proportions are updated as follows: b τ k = N k N b τ h = N h N for k = 1 , . . . , K and h = K + 1 , . . . , C. The estimation of mean and covariance parameters µ ∗ k and Σ ∗ k of the classes already observed inthe training data is an involved problem, due to the augmented parameter dimensions and thefact that the parameters from the learning phase needs to be kept ﬁxed. Here we need to estimatethe components µ Q k , Σ Q k and C k of the partitions in (6). As in a standard Gaussian mixturemodel, a straightforward updated for these would be computing the related in sample weightedquantities. However, this would not take into account the constraint that the parameters µ k and Σ k have already been estimated in the learning phase and need to be held ﬁxed. In particular,the covariance block Σ k has been estimated in the learning phase via the EDDA model, imposingconstraints on its eigen-decomposition. As it is often the case, if the covariance model for Σ k has a particular structure (i.e. is not the VVV using the mclust nomenclature) the approachwould not ensure a valid positive deﬁnite Σ ∗ k . A clear example is the case where the EDDAmodel estimated in the learning phase is a spherical one with diagonal matrices Σ k . In suchcase, completing the oﬀ-diagonal entries of Σ ∗ k with non-zero terms and without taking intoaccount the structure of the block Σ k would not guarantee a positive deﬁnite covariance matrix(Zhang, 2006). We propose the following procedure to obtain valid estimates.Denote an observation of the test data y i = { y P i , y Q i } , where y P i are the measurements of theset P of variables of the training data and y Q i are the measurements of the set Q of additionalvariables observed in the test data. To take into account the structure of the block Σ k , theproblem of maximizing the expectation of (7) with respect to Σ ∗ k and µ ∗ k can be interpreted asthe problem of ﬁnding estimates of µ Q k , Σ Q k , and C k such that the joint distribution of observedand extra variables ( { y P i , y Q i } | l ik = 1) ∼ N ( µ ∗ k , Σ ∗ k ) is a multivariate Gaussian density whosemarginal distributions are ( y P i | l ik = 1) ∼ N ( µ k , Σ k ) and ( y Q i | l ik = 1) ∼ N ( µ Q k , Σ Q k ), and with Σ ∗ k being positive deﬁnite. To accomplish this task, we devise the following inductive conditionalestimation procedure: Step 1.

Fix the marginal distribution of the variables observed in the learning phase, ( y P i | l ik =1) ∼ N ( µ k , Σ k ); Step 2.

Estimate the parameters of the conditional distribution ( y Q i | y P i , l ik = 1) ∼ N ( m ik , E k ),where m ik and E k are related mean and covariance parameters. Step 3.

Find estimates of the parameters of the joint distribution ( { y P i , y Q i } | l ik = 1) ∼ N ( µ ∗ k , Σ ∗ k )using the ﬁxed marginal and the conditional distribution.Since we are using an inductive approach, Step 1 corresponds in keeping µ k , Σ k ﬁxed. Next,in Step 2 the parameter estimates of the distribution of the new variables given the variablesobserved in the training set are obtained. This allows to take into account the information andthe structure of the learning phase parameters. Then, in

Step 3 these estimates are used to ﬁndthe parameters of the marginal distribution of the set of new variables Q and the joint distri-bution of R = {P , Q} , while preserving the joint association structure among all the variablesin R . The proposed method is related to the well known iterative proportional ﬁtting algorithmfor ﬁtting distributions with ﬁxed marginals (see for example Whittaker, 1990; Fienberg and8eyer, 2006), and the iterative conditional ﬁtting algorithm of Chaudhuri et al. (2007) used toestimate a multivariate Gaussian distribution with association constraints.Taking the expectation of the complete log-likelihood in (7), the term involving µ ∗ k and Σ ∗ k can be rewritten as: N X i =1 " K X k =1 t ik log n φ ( y Q i | y P i ; m ik , E k ) φ ( y P i ; µ k , Σ k ) o . (8)In Step 1 , parameters µ k , Σ k are ﬁxed from the learning phase. Therefore, the term log { φ ( y P i ; µ k , Σ k ) } is already maximized. In Step 2 and

Step 3 , we make use of the well known closure proper-ties of the multivariate Gaussian distribution (see Tong, 1990; Zhang, 2006, for example) inorder to maximize the term log { φ ( y Q i | y P i ; m k , E k ) } . In Step 2 the focus is on the conditionaldistribution; for each observation i we can rewrite: m ik = µ Q k + C k Σ − k ( y P i − µ k ) , E k = Σ Q k − C k Σ − k C k . Let us deﬁne the scattering matrix O k = P Ni =1 t ik ( y i − y k )( y i − y k ) , with y k = N k P Ni =1 t ik y i .We can partition it as: O k = " W k V k V k U k , with W k the block related to the variables observed in the learning set, U k the block associatedto the new variables and V k the crossproducts. Now we maximize (8) with respect to E k and C k . After some algebraic manipulations, we obtain the estimates: b C k = ( Σ − k W k Σ − k ) − ( Σ − k V k ) b E k = 1 N k h b C k Σ − k W k Σ − k b C k − V k Σ − k b C k + U k i , Then, in

Step 3 we obtain the estimates of the marginal distribution for the set of extra variablesas: b µ Q k = 1 N k " N X i =1 t ik y Q i − b C k Σ − k N X i =1 t ik ( y P i − µ k ) , b Σ Q k = b E k + b C k Σ − k b C k . Hence µ ∗ k and Σ ∗ K are estimated: b µ ∗ k = ( µ k b µ Q k ) , b Σ ∗ k = " Σ k b C k b C k b Σ Q k . Further details about the derivations are in Appendix A. Provided that O k is positive deﬁnite,the estimate of Σ k obtained in such way is ensured to be positive deﬁnite as well due to theproperties of the Schur complement (Zhang, 2006; Tong, 1990). In certain cases, for examplewhen the number of variables R is large compared to N and to the expected class sizes, orwhen the variables are highly correlated, this scattering matrix could be singular. To overcomethis issue, one could resort to regularization. To this purpose, we delineate a simple Bayesianregularization approach in Appendix B. In order to compute the ﬁrst E step iteration of the EM algorithm in the discovery phase,we need to initialize the parameter values. A random initialization has a fair chance of notproviding good starting points. On the other hand, the initialization based on the model-basedhierarchical clustering method discussed in (Scrucca and Raftery, 2015) and (Fraley, 1998) oftenyields good starting points, is computationally eﬃcient and works well in practice. However,we need to take care of the fact that a subset of the parameters is ﬁxed. We make use of thefollowing strategy for initialization. 9irst we obtain a hierarchical unsupervised partition of the observations in the test datausing the method of (Scrucca and Raftery, 2015) and (Fraley, 1998). Afterwards, for a ﬁxednumber C of clusters and the corresponding partition, we compute the within-cluster meansand covariance matrices, both for new and observed variables. Let us denote with ˜ µ P g and ˜ Σ P g ( g = 1 , . . . , C ) the computed cluster parameters related to the observed variables, with ˜ µ Q g and˜ Σ Q g those related to the extra variables, and with ˜ C g the covariance terms. Now, we ﬁnd which ofthe detected clusters match the classes observed in the training set over the observed variables.For each known class and each cluster we compute the Kullback-Leibler divergence:tr (cid:26)(cid:16) ˜ Σ P g (cid:17) − Σ k (cid:27) + ( ˜ µ P g − µ k ) (cid:16) ˜ Σ P g (cid:17) − ( ˜ µ P g − µ k ) + log det ˜ Σ P g det Σ k , ∀ g, k. Then, we ﬁnd the ﬁrst K clusters with the minimum divergence and thus likely correspondingto the classes observed in the training data. For these clusters, the set of parameters related tothe observed variables are initialized with the associated values µ k and Σ k , the set of parametersrelated to the new variables are initialized with the same values ˜ µ Q k and ˜ Σ Q k , and the covarianceterms with ˜ C k . The remaining clusters can be considered as hidden classes and the relatedparameters are initialized with the corresponding cluster means and covariances.Similarly to AMDA, also in the D-AMDA framework class detection corresponds correspondsto selection of the number of hidden classes in the test data. As in the learning phase, the BICis employed for this purpose. Explicitly, for a range of values of number of hidden classes H , wechoose the model that maximizes the quantity:BIC H = 2 L ( Y ; b Θ y ) − η H log N, where η H is the number of parameters estimated in the discovery phase, equal to ( H + K −

1) +2 HR + H (cid:0) R (cid:1) + 2 KQ + KP Q + K (cid:0) Q (cid:1) . Given the large amount and the variety of sources at disposition, classiﬁcation of high-dimensionaldata is becoming more and more a routine task. In this setting, variable selection has been provenbeneﬁcial for increasing accuracy, reducing the number of parameters and a better model in-terpretation (Guyon and Elisseeﬀ, 2003; Pacheco et al., 2006; Brusco and Steinley, 2011). Weadapt the variable selection method of Maugis et al. (2011) and Murphy et al. (2010) in orderto perform inductive variable selection within the context of D-AMDA. The aim is to selectthe relevant variables that contain the most useful information about both observed and novelclasses. The method is inductive in the sense that the classiﬁer model ﬁrst is built on the dataobserved in the learning phase. Then, while performing variable selection on the new test data,the classiﬁer is adapted by removing and adding variables without re-estimating the model onthe learning data.Following Maugis et al. (2011) and Murphy et al. (2010), at each step of the variable selectionprocedure we consider the partition Y = ( Y class , Y prop , Y other ), where Y class is the current setof relevant variables, Y prop is the variable proposed to be added/removed to/from Y class , and Y other are the non relevant variables. Let also ‘ be the class indicator variable. For each stageof the algorithm, we compare two models: M : p ( Y | ‘ ) = p ( Y class , Y prop | ‘ ) p ( Y other ) , M : p ( Y | ‘ ) = p ( Y class | ‘ ) p ( Y prop | Y reg ⊆ Y class ) p ( Y other ) . In model M , Y p is relevant for classiﬁcation and p ( Y class , Y prop | ‘ ) is the D-AMDA modelwhere the classiﬁer is adapted by including the proposed variable Y prop . In model M , Y prop does not depend on the labels and thus is not useful for classiﬁcation. p ( Y class | ‘ ) is the D-AMDAmodel on the current selected variables and the conditional distribution p ( Y prop | Y reg ⊆ Y class )is a regression where Y prop depends on Y class through a subset of predictors Y reg . This regression10erm encompasses the fact that some variables may be redundant given the set of already selectedones, and thus can be discarded (Murphy et al., 2010; Raftery and Dean, 2006). Relevantpredictors are chosen via a standard stepwise procedure and the selection avoids to includeunnecessary parameters that would over-penalize the model without a signiﬁcant increase in itslikelihood (Maugis et al., 2009a,b). The two models are compared by computing the diﬀerencebetween their BIC:BIC = BIC class ( Y class , Y prop ) , BIC = BIC no class ( Y class ) + BIC reg ( Y prop | Y reg ⊆ Y class ) , where BIC class ( Y class , Y p ) is the BIC of the D-AMDA model where Y prop is useful for classiﬁca-tion, BIC no class ( Y class ) is the BIC on the current set of selected variables and BIC reg ( Y prop | Y reg ⊆ Y class ) is the BIC of the regression. The diﬀerence (BIC − BIC ) is computed and if it isgreater than zero, there is evidence that Y p conveys useful information about the classes, hencevariable Y prop is added to the D-AMDA model and the classiﬁer is updated.The selection is performed using a stepwise greedy forward search where variables areadded and removed in turn. Since we adopt an inductive approach, when the variables tobe added/removed belong to the set of variables already observed in the learning phase, theclassiﬁer is updated in a fast and eﬃcient way. Indeed, if a variable observed in X needs to beadded, the classiﬁer is updated by simply augmenting the set of parameters with the parame-ters already estimated in the learning phase. Analogously, if the variable needs to be removed,the classiﬁer is updated by deleting the corresponding parameters. Only parameters related toadditional variables and novel classes need to be estimated when updating the D-AMDA model.Parameters related to known classes and observed variables are updated only via deletion oraddition. As such, the method is suitable for fast on-line variable selection.The classiﬁcation procedure is partly unsupervised because of the presence of unobservedclasses. Therefore, while searching for the relevant variables, also the number H of unknownclasses needs to be chosen. As in Maugis et al. (2009a,b); Raftery and Dean (2006), we consider arange of possible values for H . Then, at every step BIC class ( Y class , Y prop ) and BIC no class ( Y class )are computed by maximizing over this range. Therefore, the method returns both the set ofrelevant variables and the optimal number of unobserved classes.The set of relevant variables needs to be initialized at the ﬁrst stage of the variable selectionalgorithm. We suggest to start the search from a conveniently chosen subset of size S of thevariables observed in the learning phase. To determine such subset, for every variable in Y corresponding to those already observed in X , we estimate a univariate Gaussian mixture modelfor a number of components ranging from 0 to G > K . Then we compute the diﬀerence betweenthe BIC of such model and the BIC of a single univariate Gaussian distribution. The variablesare ranked according to this diﬀerence from the largest to the lowest value. The starting subsetis formed by selecting the top S variables in the list. Similar initial selection strategies have beendiscussed in McLachlan (2004) and Murphy et al. (2010). Note that one could also initialize theset of relevant variables from all the observed variables of the training data. Nonetheless, if thenumber of variables observed in X is large, it is likely that many of them would be uninformativeor redundant, therefore, initialization using such set might not provide a good starting point forthe search. In this section we evaluate the proposed modeling framework for variable selection and adap-tive classiﬁcation through diﬀerent simulated data experiments under various conditions. Theobjective is to assess the classiﬁcation performance of the method, its ability of detecting thenovel classes and its ability of discarding irrelevant variables and selecting those useful for clas-siﬁcation. 11 .1 Simulation study 1

This simulation study shows the usefulness of using all the variables available in the test datafor class prediction and detection when only a small subset of these are observed in the trainingstage.We consider the well known Italian wines dataset (Forina et al., 1986). The data consistof 27 chemical measurements from a collection of wine samples from Piedmont region, in Italy.The observations are classiﬁed into three classes indicating the type of wine. Diﬀerent scenariosare considered for diﬀerent combinations of number of variables observed in the training stageand diﬀerent test data sample sizes. Using the class-speciﬁc sample means and covariances,we generate training data sets with random subsets of the 27 variables, with the number ofvariables observed in the training set equal to 18, 9, and 3. Then, with the same class-speciﬁcparameters, a test set on all the 27 variables and diﬀerent sample sizes is generated. One class israndomly deleted from the training data, while all 3 classes are present in the test data. In eachscenario we consider the following models: the EDDA classiﬁer ﬁtted on the training data withfull information, i.e. all 3 classes and all 27 variables, tested on the full test data; the EDDAclassiﬁer ﬁtted on the training data considering only a subset of the variables, then tested on thetest data containing the same subset of training variables; the AMDA approach of Bouveyron(2014) ﬁtted on the simulated training data with a subset of the variables and tested on the testdata with the subset of variables observed in the training; the presented D-AMDA framework.Details are described Appendix C, with detailed results reported in Figures 6, 7, 8, 9, 10, and11. The variables in the wine data present a good degree of discrimination, and the EDDA modelﬁtted and tested on the complete data represents the optimal baseline performance. On the otherhand, the EDDA classiﬁer trained on the partial data cannot account for the unobserved classin the test data and provides the worst classiﬁcation performance. AMDA can detect additionalclasses in the test data, but it cannot use the discriminant information potentially availablein the additional variables, thus obtaining an inferior classiﬁcation performance compared toD-AMDA. Since the D-AMDA framework adapts to the additional dimensions and classes, allthe information available in the variables observed in the test set is exploited for classiﬁcation,of both variables observed during the training stage and the extra ones present in the test set.This extra information is beneﬁcial, especially when the number of variables present in thetraining set is small ( P = 3 in particular), attaining a classiﬁcation performance comparable tothe optimal baseline. This simulation study assesses the D-AMDA classiﬁcation performance and the eﬀectiveness ofthe inductive variable selection method at detecting variables relevant for classiﬁcation. Diﬀerentscenarios are constructed by deﬁning diﬀerent proportions of relevant and irrelevant variablesavailable in the training and the full test data.In all the experiments of this section we consider three types of variables: class-generativevariables (

Gen ), which contain the principal information about the classes, redundant variables(

Cor ), which are correlated to the generative ones, and noise variables (

Noi ), which do not conveyany information about the classes. The

Gen variables are distributed according to mixture of C = 4 multivariate Gaussian distributions. Each Cor variable is correlated to 2

Gen variablesselected at random, while the

Noi variables are independent from both

Gen and

Cor variables.In the learning set, 2 of the 4 classes are observed and they are randomly chosen. All the 4classes are observed in the test set. Further details about the parameters of the simulations arein Appendix C. Three experiments are considered, each one characterized by three scenarios.We point out the fact that, as they are generated, the

Cor variables actually contain someinformation about the classiﬁcation. Indeed, they are independent of the label variable onlyconditionally on the set

Gen , not marginally. Thus, in some cases, they could convey the bestinformation available to classify the data units if some generative variables have been discardedduring the search. Hence, the inclusion of a

Cor variable would not necessarily degenerate the12lassiﬁcation performance.

The test data consist of 100 variables, 10

Gen , 30

Cor and 60

Noi . In the learning set, 20 ofthe 100 variables are observed. Three scenarios are deﬁned according to the set of variablesobserved in the simulated X :(1.a) All the 10 Gen variables plus 10 variables picked at random among

Cor and

Noi .(1.b) 5

Gen variables selected at random, 5

Cor selected at random, plus 10 variables chosen atrandom among

Cor and

Noi .(1.c) 2

Gen selected at random, the remaining 18 variables are chosen at random among

Cor and

Noi .The sample size of the learning set is equal to the sample size of Y and takes values 100, 200 and400. In all scenarios, the forward search is initialized starting from all the variables observed in X . In (1.a), the EDDA learning model is estimated on a set containing all the classiﬁcationvariables. Furthermore, the forward search is initialized on the same set. This gives a goodstarting point to the variable selection procedure, resulting that only Gen variables are declaredas relevant and with an excellent classiﬁcation performance. The results hold regardless of thesize of the samples in practice (Appendix C, Figures 12 and 15). As less

Gen variables areavailable in the learning phase, the variable selection method declares as relevant

Cor variablesmore frequently. However, good classiﬁcation results and good selection performance are stillobtained, especially for larger sample sizes (Appendix C, Figures 13, 14, 16, 17).

Also here the test data consist of 100 variables, 10

Gen , 30

Cor and 60

Noi . In the learningset, 50 of the 100 variables are observed. Three scenarios are deﬁned according to the set ofvariables observed in the simulated X :(2.a) All the 10 Gen variables plus 40 variables randomly chosen among

Cor and

Noi .(2.b) 5

Gen variables selected at random, 15

Cor selected at random, plus 30 variables chosenat random among

Cor and

Noi .(2.c) 2

Gen selected at random, the remaining 48 variables are randomly selected among

Cor and

Noi .In this experiment, the sample size of the learning set is ﬁxed and equal to 50 for all the scenarios.The forward search is initialized from 10 of the 50 variables observed in X , selected using theprocedure described in Section 4.This setting is particularly challenging, since the learning set is high-dimensional in compar-ison with the number of data points. In practice, this results in a learning phase where onlyEDDA models with diagonal covariance matrices can be estimated. Even if all Gen variables areobserved in X , such subset of models are misspeciﬁed in relation to how the data is generated.This represents a diﬃcult starting point for the D-AMDA model and the variable selection pro-cedure. Indeed, with this experiment we want to test the robustness of the method against themisspeciﬁcation of the model in the learning stage. Nevertheless, in scenario (2.a) a selectionof reasonable quality is attained. In all three scenarios, Noi variables are never selected and themethod achieves a good classiﬁcation performance even if

Cor variables are selected as relevantalmost as many times as the variables of the

Gen set (Appendix C, Figures 18, 19, 20, 21, 22,21). This fact is likely due to the variable selection initialization: this initialization strategytends to start the selection from a set of good classiﬁcation variables, and such set may containboth

Gen and

Cor variables. 13 .2.3 Experiment 3

In this case the test data consist of 200 variables, 20

Gen , 60

Cor and 120

Noi . In the learningset, 40 of the 200 variables are observed. Three scenarios are deﬁned according to the set ofvariables observed in the simulated X :(3.a) All the 20 Gen variables plus 20 variables selected randomly among

Cor and

Noi .(3.b) 10

Gen variables selected at random, 10

Cor selected at random, plus 20 variables pickedat random among

Cor and

Noi .(3.c) 4

Gen selected at random, the remaining 36 variables are randomly chosen among

Cor and

Noi .Here, the sample size of the learning set is equal to the one of the test data and takes values100, 200 and 400. The forward search is initialized from 10 of the 40 variables observed in X ,selected using the procedure described in Section 4.The experiment is characterized by Y being high-dimensional. For larger sample sizes, thevariable selection method tends to correctly identify the relevant variables, especially as thenumber of Gen variables involved in the estimation of the EDDA model in the learning phaseincreases.

Cor are declared as relevant more often when such number is reduced. However,

Noi variables are never selected (Appendix C, Figures 24, 25, 26). The D-AMDA with variableselection obtains good classiﬁcation results in all the scenarios (Appendix C, Figures 27, 28, 29).

Food authenticity studies are concerned with establishing whether foods are authentic or not.Mid-infrared spectroscopy provides an eﬃcient method of collecting data for use in food authen-ticity studies, without destructing the sample being tested nor requiring complex preparation(Downey, 1996). In this section we consider a food authenticity data set consisting of mid-infrared spectroscopic measurements of honey samples. (Kelly et al., 2006) collected 1090 ab-sorbance spectra of artisanal Irish honey over the wavelength range 3700 nm − nm at 35 nm interval. Therefore, the data consists of 285 absorbance values (variables). Of these samples,290 are pure honey, while the remaining are contaminated with ﬁve sugar syrups: beet sucrose(120), dextrose syrup (120), partial invert cane syrup (160), fully inverted beet syrup (280) andhigh-fructose corn syrup (120). The aim is to discriminate the pure honey from the adulteratedsamples and the diﬀerent contaminants. At the same time, the purpose is in the identiﬁcation ofa small subset of absorbance values containing as much information for authentication purposesas the whole spectrum does. Figures 2 and 3 provide a graphical description of the data. Exceptfrom beet sucrose and dextrose, there is an high overlapping between the other contaminantsand the pure honey; this stems from the similar composition of honey and these syrups (Kellyet al., 2006). The principal features seem to be around the ranges 8700 nm − nm and10500 nm − nm , while the spectra overlap signiﬁcantly at lower wavelengths.In this section we test the D-AMDA method with variable selection. We construct anartiﬁcial experiment that represents the situation were the samples in the learning set werecollected at a lower resolution than the ones in test data and the information about one of thecontaminants was missing. We randomly split the whole data into learning set and test set, inproportions 2 / / nm intervals, retaining wavelengths 3700 nm , 3770 nm ,3840 nm and so on. Thus, the data observed in the learning phase are approximately recordedon half of the variables of the test data. Afterwards, we randomly chose one of the two classesrelated to the contaminants beet sucrose and dextrose syrup, and we remove from the learningset the corresponding observations. In this way we obtain a test set measured on additionalvariables and containing extra classes.We replicate the experiment for 100 times, applying the D-AMDA approach with variableselection ( varSel ) and without ( noSel ). For comparison and for evaluating the classiﬁcationperformance, we also apply the EDDA model ( orac ) to the whole learning set containing all14 . . . . . . A b s o r ban c e Wavelength (nm)

Figure 2: The Mid-infrared spectra recorded for the pure and the contaminated honey samples. . . . . . A b s o r ban c e Wavelength (nm) Pure honeyBeet sucroseDextrosePart. inv. caneHigh−fruct. corn

Figure 3: Class-conditional mean spectra for pure and contaminated honey samples. The ﬁgure is a zoomon the range nm − nm approximatively. the contaminants. Then we use the estimated classiﬁer on the test data to classify the samples.The EDDA classiﬁer uses all the information available about classes and wavelengths, thus itsclassiﬁcation performance can be considered as the “oracle” baseline. For the variable selection,we initialize the search from a set of 30 wavelengths selected as described in Section 4.Results of the variable selection procedure are reported in Figure 4. The ﬁgure displaysthe proportion of times a wavelength has been declared as relevant for separating the classes ofcontaminants and the pure honey. The frequently chosen wavelengths are mostly in the ranges10000 nm − nm and 8500 nm − nm . In particular, values 10000 nm , 10070 nm , 10140 nm ,10210 nm , 10910 nm , 10980 nm , 11050 nm , and 11120 nm are selected in all the replicates ofthe experiment. Also wavelengths in the range 5400 nm − nm are selected a signiﬁcantnumber of times. The peak in 8250 nm corresponds to a wavelength range particularly usefulto discriminate dextrose syrup from the rest (Kelly et al., 2006). The most frequently selectedwavelengths correspond to the interesting peaks and features of the spectra. Classiﬁcation results15 . . . . . . P r opo r t i on Wavelength (nm)

Figure 4: Proportions of time a wavelength has been selected as a relevant variable over a hundredreplicates of the artiﬁcial experiment. . . . . . . noSel orac varSel A R I (a) . . . . . . orac varSel E rr o r (b) Figure 5: ARI (a) and classiﬁcation error (b). The classiﬁcation error is reported for the EDDA modeland the D-AMDA with variable selection. For the error, the boxplot displays values only for the 79/100times the D-AMDA correctly selected the number of unknown classes. are presented in Figure 5. As for the simulation settings, because of the extra hidden class in thetest set we made use of the ARI (Hubert and Arabie, 1985) to compare the actual classiﬁcationand the ones estimated by varSel and noSel (see also Appendix C). noSel selects the correctnumber of classes only 34/100 of the times. varsSel chose the right number of unknown classes79 out of 100 times and panel (b) of Figure 5 reports the boxplot of the classiﬁcation error of varSel and orac in this case. The classiﬁcation performance of varsel is comparable to orac , butit makes use of information about less wavelengths and is obtained in a more complex setting.

We presented a general adaptive mixture discriminant analysis method for classiﬁcation andvariable selection when the test data contain unobserved classes and extra variables. We haveshown that our methodology eﬀectively addresses the issues generated by the presence of hidden16lasses in a test data with augmented dimensions compared to the data observed during thetraining stage. As such, the method is suitable for applications in real-time classiﬁcation prob-lems where the new data points to be labelled convey extra information thanks to the presenceof additional input features.The inductive approach had the advantage of avoiding the storing of the learning set and ofavoiding the re-estimation of the parameters already obtained in the learning stage. However,when extra variables are observed in the test data, the estimation process is a complex problem,due to the parameter constraints induced by the initial learning phase. An inductive conditionalestimation procedure has been introduced to overcome the issue and obtain valid parameterestimates related to the added dimensions. The inductive framework results in a fast andcomputationally eﬃcient procedure, which has been embedded into a variable selection methodfor dealing with high-dimensional data.The method proposed in this paper opens interesting future research directions. A limitationof the D-AMDA framework is that the discovery phase does not consider particular constraintson the estimated covariance matrices, reducing the ﬂexibility of the model. The introductionof parsimonious models as in Bensmail and Celeux (1996) with adaptive dimensions is objectof current investigation. Another limitation is that the labels observed in the training data areassumed to be noise free, as well as that no outlier observations are present in the input features.Recent work by Cappozzo et al. (2020) proposes a robust version of the AMDA framework toaddress these added sources of complexity. Future work may explore the development of arobust version of D-AMDA, with a particular focus on discarding those additional dimensionscharacterized by high levels of noisy and contaminated observations, suitable for robust on-lineclassiﬁcation. 17 ppendixA Details of the inductive conditional estimation

Neglecting the term involving the mixing proportions, the objective function to be optimized inthe M step to estimate µ ∗ k and Σ ∗ k is given by: F ( µ ∗ , Σ ∗ ) = N X i =1 " K X k =1 t ik log n φ ( y i ; µ ∗ k , Σ ∗ k ) o . Let O k = P Ni =1 t ik ( y i − y k )( y i − y k ) , with y k = N k P Ni =1 t ik y i . The above function can beexpressed in term of the covariance matrix as: F ( Σ ∗ ) = X k tr (cid:8) O k ( Σ ∗ k ) − (cid:9) + X k N k log det Σ ∗ k . Let us now consider the partitioned matrices: Σ ∗ k = " Σ k C k C k Σ Q k , O k = " W k V k V k U k . Furthermore, deﬁne E k = Σ Q k − C k Σ − k C k . Then( Σ ∗ k ) − = " Σ − k + Σ − k C k E − k C k Σ − k − Σ − k C k E − k E − k C k Σ − k E − k , and log det Σ ∗ k = log det Σ k + log det E k . It follows that F ( Σ ∗ ) can be re-expressed as functionof E k and C k as follows: F ( E , C ) = X k tr { W k Σ − k C k E − k C k Σ − k } − X k tr { V k Σ − k C k E − k } + X k tr { U k E − k } + X k N k log det E k + const.Maximization of F ( E , C ) with respect to E k and C k leads to: b C k = ( Σ − k W k Σ − k ) − ( Σ − k V k ) , b E k = 1 N k h b C k Σ − k W k Σ − k b C k − V k Σ − k b C k + U k i . Consequently we have that: b Σ Q k = b E k + b C k Σ − k b C k . Given estimates b C k and b E k , for the mean parameter µ Q k corresponding to the additionalvariables, deﬁne now m ik = µ Q k + C k Σ − k ( y P i − µ k ). Consequently, the function F ( µ ∗ , Σ ∗ ) canbe rewritten as: F ( m ) = N X i =1 " K X k =1 t ik log n φ ( y Q i | y P i ; m ik , b E k ) o + const.By plugging the m ik expression above in F ( m ), we can express the latter in terms of µ Q k as: F ( µ Q ) = − N X i =1 K X k =1 t ik n(cid:2) y Q i − µ Q k − b C k Σ − k ( y P i − µ k ) (cid:3) b E − k (cid:2) y i − µ Q k − b C k Σ − k ( y P i − µ k ) (cid:3)o + const . Taking derivatives of F ( µ Q ) and solving for µ Q k we obtain: b µ Q k = 1 N k " N X i =1 t ik y Q i − b C k Σ − k N X i =1 t ik ( y P i − µ k ) . The above passages prove the derivation of the updating equations of the M step in Section3.2.2. 18

A note on regularization

The procedure described in 3.2.2 requires the empirical class scatter matrix O k to be deﬁnitepositive. This may not be the case in situations where the expected number of observations ina class is small or the variables are highly correlated. Approaches for Bayesian regularizationin the context of ﬁnite Gaussian mixture models for clustering have already been suggestedin the literature, see in particular Baudry and Celeux (2015). We suggest a similar approach,proposing the following regularized version of O k : O reg k = O k + S det( S ) /R (cid:18) γK + H (cid:19) /R , where S = N P Ni =1 ( y i − ¯ y )( y i − ¯ y ) is the empirical covariance matrix computed on the fulltest data, and ¯ y the sample mean, ¯ y = N P Ni =1 y i . The coeﬃcient γ controls the amount ofregularization and we set it to (log R ) /N ; see Baudry and Celeux (2015) for further details. C Details and results of simulation experiments

In this section we describe in more details the settings of the simulated data experiments andwe give a graphical summary of the results.

C.1 Simulation study 1

The training data has M = 300 observations in all scenarios. A random subset of the 27variables is taken from the data with the number of variables observed in the training setequal P = { , , } . Test set on all the 27 variables and diﬀerent sample sizes equal to N = { , , , , } . Diﬀerent scenarios are deﬁned by diﬀerent combinations of P and N ,One class is randomly deleted from the training data, while all 3 classes are present in the testdata. In each scenario, the following models are considered: full , the EDDA classiﬁer ﬁttedon the training data with full information, i.e. all 3 classes and all 27 variables, tested on thefull test data; EDDA the EDDA classiﬁer ﬁtted on the training data considering only a subsetof the variables, then tested on the full test data; the

AMDA approach of Bouveyron (2014)ﬁtted on the simulated training data with a subset of the variables and tested on the test datawith the subset of variables observed in the training; the presented

D-AMDA framework. Eachexperiment is replicated 100 times for all combinations of sample sizes and number of observedtraining variables. Model selection for AMDA and D-AMDA is performed using BIC and arange of values of H from 0 to 4. Since AMDA and D-AMDA are partially unsupervised, wecompute the classiﬁcation error on the matching classes detected after tabulating the actualclassiﬁcation with the estimated one using function matchClasses of package e1071 (Meyeret al., 2019). To compare AMDA and D-AMDA, we also report the adjusted Rand index (ARI,Hubert and Arabie, 1985). Indeed, the learning in the test set is partly unsupervised, and anumber of hidden classes diﬀerent from 1 could be estimated.The next ﬁgures contain graphical summaries of the results of the various experiments forthe diﬀerent scenarios. 19 . . . . . . full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA N = 50 N = 100 N = 200 N = 300 N = 500 E rr o r Figure 6: Simulation study 1, P = 18 . Error computed on the matched classes between the actualclassiﬁcation of the test data and the estimated one. . . . . . . AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA

N = 50 N = 100 N = 200 N = 300 N = 500 A R I Figure 7: Simulation study 1, P = 18 . Adjusted Rand index between the actual classiﬁcation of the testdata and the estimated one for AMDA and D-AMDA. . . . . . . full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA N = 50 N = 100 N = 200 N = 300 N = 500 E rr o r Figure 8: Simulation study 1, P = 9 . Error computed on the matched classes between the actualclassiﬁcation of the test data and the estimated one. . . . . . . AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA

N = 50 N = 100 N = 200 N = 300 N = 500 A R I Figure 9: Simulation study 1, P = 9 . Adjusted Rand index between the actual classiﬁcation of the testdata and the estimated one for AMDA and D-AMDA. . . . . . . full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA full EDDA AMDA D−AMDA N = 50 N = 100 N = 200 N = 300 N = 500 E rr o r Figure 10: Simulation study 1, P = 3 . Error computed on the matched classes between the actualclassiﬁcation of the test data and the estimated one. . . . . . . AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA AMDA D−AMDA

N = 50 N = 100 N = 200 N = 300 N = 500 A R I Figure 11: Simulation study 1, P = 3 . Adjusted Rand index between the actual classiﬁcation of the testdata and the estimated one for AMDA and D-AMDA. C.2 Simulation study 2

Gen variables are distributed according to a mixture of C = 4 multivariate Gaussian distri-butions with mixing proportions (0.3, 0.4, 0.4, 0.3). Mean parameters are randomly chosen in(-7, 7), (-4.5, 4.5), (-0.5, 0.5), (-10, 10). For each class, the covariance matrices are randomlygenerated from the Wishart distributions W ( G, Ψ ), W ( G + 2 , Ψ ), W ( G + 1 , Ψ ), W ( G, Ψ ),where G denotes the number of generative variables. The scale matrices are respectively deﬁned:Ψ , is such that ψ jj = 1 and ψ ji = ψ ij = 0 .

7; Ψ , is such that ψ jj = 1 and ψ ji = ψ ij = 0 . = Ψ = I . Cor variables are generated as X g + X g + (cid:15) , where X g and X g are two randomlychosen Gen variables and (cid:15) ∼ N (0 , Noi variables are generated as N ( , Ψ), where Ψ is such that ψ jj = 1 and ψ ji = ψ ij = 0 .

5; thus they are correlated to eachother, but not to

Cor and

Gen variables. In Simulation 3, the

Noi variables are generated allindependent of each other. The 2 classes observed in the learning set are randomly chosen fromthe set of 4 classes with equal probabilities. We considered three diﬀerent sample sizes for thetest data, respectively 100, 200 and 400. Each scenario within each experiment and for eachsample size was replicated 50 times.The next subsections contain graphical summaries of the various experiments for the diﬀerentscenarios. Throughout the diﬀerent scenarios, we used the ARI to assess the quality of theclassiﬁcation and we compared the results of the following methods:• noSel , the D-AMDA model applied on X and the full Y without performing any variableselection.• orac , representing the “oracle” solution. This corresponds to the D-AMDA model appliedon X ∗ and Y ∗ , the learning and test set where only Gen variables are observed.• varSel , the D-AMDA model with the forward variable selection applied to the observed X and Y .The variable selection performance of the proposed method was assessed via the proportion oftimes each variable was selected as relevant out of the 50 replicated experiments.21 .2.1 Experiment 1 P r opo r t i on Figure 12: Simulation experiment 1, scenario (a). Proportions of time a variable has been selected asrelevant. P r opo r t i on Figure 13: Simulation experiment 1, scenario (b). Proportions of time a variable has been selected asrelevant. P r opo r t i on Figure 14: Simulation experiment 1, scenario (c). Proportions of time a variable has been selected asrelevant. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 15: Simulation experiment 1, scenario (a). ARI between the actual classiﬁcation of the test dataand the estimated one. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 16: Simulation experiment 1, scenario (b). ARI between the actual classiﬁcation of the test dataand the estimated one. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 17: Simulation experiment 1, scenario (c). ARI between the actual classiﬁcation of the test dataand the estimated one.

C.2.2 Experiment 2 P r opo r t i on Figure 18: Simulation experiment 2, scenario (a). Proportions of time a variable has been selected asrelevant. P r opo r t i on Figure 19: Simulation experiment 2, scenario (b). Proportions of time a variable has been selected asrelevant. .00.51.0 N = 100N = 200N = 400Gen Cor Noi P r opo r t i on Figure 20: Simulation experiment 2, scenario (c). Proportions of time a variable has been selected asrelevant. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 21: Simulation experiment 2, scenario (a). ARI between the actual classiﬁcation of the test dataand the estimated one. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 22: Simulation experiment 2, scenario (b). ARI between the actual classiﬁcation of the test dataand the estimated one. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 23: Simulation experiment 2, scenario (c). ARI between the actual classiﬁcation of the test dataand the estimated one. .2.3 Experiment 3 P r opo r t i on Figure 24: Simulation experiment 3, scenario (a). Proportions of time a variable has been selected asrelevant. P r opo r t i on Figure 25: Simulation experiment 3, scenario (b). Proportions of time a variable has been selected asrelevant. P r opo r t i on Figure 26: Simulation experiment 3, scenario (c). Proportions of time a variable has been selected asrelevant. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 27: Simulation experiment 3, scenario (a). ARI between the actual classiﬁcation of the test dataand the estimated one. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 28: Simulation experiment 3, scenario (b). ARI between the actual classiﬁcation of the test dataand the estimated one. . . . . . . noSel orac varSel noSel orac varSel noSel orac varSelN = 100 N = 200 N = 400 A R I Figure 29: Simulation experiment 3, scenario (c). ARI between the actual classiﬁcation of the test dataand the estimated one. eferences Baudry, J.-P. and Celeux, G. (2015). EM for mixtures Initialization requires special care.

Statis-tics and Computing , 25(4):713–726.Bazell, D. and Miller, D. J. (2005). Class discovery in galaxy classiﬁcation.

The AstrophysicalJournal , 618(2):723.Bensmail, H. and Celeux, G. (1996). Regularized Gaussian discriminant analysis through eigen-value decomposition.

Journal of the American Statistical Association , 91:1743–1748.Bouveyron, C. (2014). Adaptive mixture discriminant analysis for supervised learning withunobserved classes.

Journal of Classiﬁcation , 31(1):49–84.Bouveyron, C., Celeux, G., Murphy, T. B., and Raftery, A. E. (2019).

Model-based clusteringand classiﬁcation for data science: with applications in R , volume 50. Cambridge UniversityPress.Brusco, M. J. and Steinley, D. (2011). Exact and approximate algorithms for variable selectionin linear discriminant analysis.

Computational Statistics & Data Analysis , 55(1):123–131.Cappozzo, A., Greselin, F., and Murphy, T. B. (2020). Anomaly and novelty detection for robustsemi-supervised learning.

Statistics and Computing , 30(5):1545–1571.Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering models.

Pattern Recogni-tion , 28(5):781–793.Chapelle, O., Schölkopf, B., and Zien, A., editors (2006).

Semi-Supervised Learning . MIT Press.Chaudhuri, S., Drton, M., and Richardson, T. S. (2007). Estimation of a covariance matrix withzeros.

Biometrika , 94(1):199–216.Clemmensen, L., Hastie, T., Witten, D., and Ersbøll, B. (2011). Sparse discriminant analysis.

Technometrics , 53(4):406–413.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incompletedata via the EM algorithm.

Journal of the Royal Statistical Society, Series B , 39(1):1–38.Downey, G. (1996). Authentication of food and food ingredients by near infrared spectroscopy.

Journal of Near Infrared Spectroscopy , 4(1):47–61.Fienberg, S. E. and Meyer, M. M. (2006). Iterative proportional ﬁtting.

Encyclopedia of Statis-tical Sciences , 6:3723–3726.Forina, M., Armanino, C., Castino, M., and Ubigli, M. (1986). Multivariate data analysis as adiscriminating method of the origin of wines.

Vitis , 25(3):189–201.Fraley, C. (1998). Algorithms for model-based Gaussian hierarchical clustering.

SIAM Journalon Scientiﬁc Computing , 20(1):270–281.Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis and densityestimation.

Journal of the American Statistical Association , 97:611–631.Frame, S. J. and Jammalamadaka, S. R. (2007). Generalized mixture models, semi-supervisedlearning, and unknown class inference.

Advances in Data Analysis and Classiﬁcation , 1(1):23–38.Friedman, J. H. (1989). Regularized discriminant analysis.

Journal of the American StatisticalAssociation , 84(405):165–175. 27uyon, I. and Elisseeﬀ, A. (2003). An introduction to variable and feature selection.

Journal ofMachine Learning Research , 3:1157–1182.Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures.

Journal ofthe Royal Statistical Society. Series B (Methodological) , 58(1):155–176.Hubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of Classiﬁcation , 2:193–218.Jiang, B., Wang, X., and Leng, C. (2018). A direct approach for sparse quadratic discriminantanalysis.

Journal of Machine Learning Research , 19(1):1098–1134.Kelly, J. D., Petisco, C., and Downey, G. (2006). Application of fourier transform midinfraredspectroscopy to the discrimination between irish artisanal honey and such honey adulteratedwith various sugar syrups.

Journal of Agricultural and Food Chemistry , 54(17):6166–6171.Lawoko, C. and McLachlan, G. (1989). Bias associated with the discriminant analysis approachto the estimation of mixing proportions.

Pattern Recognition , 22(6):763 – 766.Le, K. T., Chaux, C., Richard, F. J., and Guedj, E. (2020). An adapted linear discriminantanalysis with variable selection for the classiﬁcation in high-dimension, and an application tomedical data.

Computational Statistics & Data Analysis , 152:107031.Mai, Q., Zou, H., and Yuan, M. (2012). A direct approach to sparse discriminant analysis inultra-high dimensions.

Biometrika , 99(1):29–42.Maugis, C., Celeux, G., and Martin-Magniette, M. L. (2009a). Variable selection for clusteringwith Gaussian mixture models.

Biometrics , 65:701–709.Maugis, C., Celeux, G., and Martin-Magniette, M. L. (2009b). Variable selection in model-based clustering: A general variable role modeling.

Computational Statistics & Data Analysis ,53:3872–3882.Maugis, C., Celeux, G., and Martin-Magniette, M. L. (2011). Variable selection in model-baseddiscriminant analysis.

Journal of Multivariate Analysis , 102(10):1374 – 1387.McLachlan, G. (2004).

Discriminant Analysis and Statistical Pattern Recognition . Wiley.McLachlan, G. and Krishann, T. (2008).

The EM Algorithm and Extensions . Wiley.McLachlan, G. J. (2012). Discriminant analysis.

Wiley Interdisciplinary Reviews: ComputationalStatistics , 4(5):421–431.Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2019). e1071: MiscFunctions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TUWien . R package version 1.7-3.Miller, D. J. and Browning, J. (2003). A mixture model and EM-based algorithm for classdiscovery, robust classiﬁcation, and outlier rejection in mixed labeled/unlabeled data sets.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 25(11):1468–1483.Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., and Herrera, F. (2012).A unifying view on dataset shift in classiﬁcation.

Pattern Recognition , 45(1):521–530.Murphy, T. B., Dean, N., and Raftery, A. E. (2010). Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications.

The Annals of Applied Statistics , 4(1):396.Pacheco, J., S., C., Núñez, L., and Gómez., O. (2006). Analysis of new variable selection methodsfor discriminant analysis.

Computational Statistics & Data Analysis , 51(3):1463 – 1478.28ang, S. and Kasabov, N. (2004). Inductive vs transductive inference, global vs local mod-els: SVM, TSVM, and SVMT for gene expression classiﬁcation problems. In , volume 2, pages 1197–1202.Qin, Y. (2018). A review of quadratic discriminant analysis for high-dimensional data.

WileyInterdisciplinary Reviews: Computational Statistics .Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009).

DatasetShift in Machine Learning . The MIT Press.Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering.

Journal ofthe American Statistical Association , 101:168–178.Safo, S. E. and Ahn, J. (2016). General sparse multi-class linear discriminant analysis.

Compu-tational Statistics & Data Analysis , 99:81–90.Schwarz, G. (1978). Estimating the dimension of a model.

Annals of Statistics , 6(2):461–464.Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2016). mclust 5: Clustering, classiﬁca-tion and density estimation using Gaussian ﬁnite mixture models.

The R Journal , 8(1):289–317.Scrucca, L. and Raftery, A. E. (2015). Improved initialisation of model-based clustering usingGaussian hierarchical partitions.

Advances in Data Analysis and Classiﬁcation , 9(4):447–460.Sun, J. and Zhao, H. (2015). The application of sparse estimation of covariance matrix toquadratic discriminant analysis.

BMC Bioinformatics , 16(1).Tong, Y. (1990).

The Multivariate Normal Distribution . Springer Verlag.Whittaker, J. (1990).

Graphical Models in Applied Multivariate Statistics . Wiley.Woillez, M., Ressler, P. H., Wilson, C. D., and Horne, J. K. (2012). Multifrequency speciesclassiﬁcation of acoustic-trawl survey data using semi-supervised learning with class discovery.

The Journal of the Acoustical Society of America , 131(2):184–190.Xu, P., Brock, G. N., and Parrish, R. S. (2009). Modiﬁed linear discriminant analysis ap-proaches for classiﬁcation of high-dimensional microarray data.

Computational Statistics &Data Analysis , 53(5):1674 – 1687.Zhang, F. (2006).