[PDF] Mixture of Conditional Gaussian Graphical Models for unlabelled heterogeneous populations in the presence of co-factors

Abstract

Conditional correlation networks, within Gaussian Graphical Models (GGM), are widely used to describe the direct interactions between the components of a random vector. In the case of an unlabelled Heterogeneous population, Expectation Maximisation (EM) algorithms for Mixtures of GGM have been proposed to estimate both each sub-population's graph and the class labels. However, we argue that, with most real data, class affiliation cannot be described with a Mixture of Gaussian, which mostly groups data points according to their geometrical proximity. In particular, there often exists external co-features whose values affect the features' average value, scattering across the feature space data points belonging to the same sub-population. Additionally, if the co-features' effect on the features is Heterogeneous, then the estimation of this effect cannot be separated from the sub-population identification. In this article, we propose a Mixture of Conditional GGM (CGGM) that subtracts the heterogeneous effects of the co-features to regroup the data points into sub-population corresponding clusters. We develop a penalised EM algorithm to estimate graph-sparse model parameters. We demonstrate on synthetic and real data how this method fulfils its goal and succeeds in identifying the sub-populations where the Mixtures of GGM are disrupted by the effect of the co-features.

Full PDF

MMixture of Conditional Gaussian Graphical Models

Mixture of Conditional Gaussian Graphical Models forunlabelled heterogeneous populations in the presence ofco-factors

Thomas Lartigue [email protected]

Aramis project-team and CMAPInria and ´Ecole polytechniqueParis, France

Stanley Durrleman [email protected]

Aramis project-teamInriaParis, France

St´ephanie Allassoni`ere [email protected]

Universit´e Paris Descartes,Paris, France

Editor:

Abstract

Conditional correlation networks, within Gaussian Graphical Models (GGM), are widelyused to describe the direct interactions between the components of a random vector. Inthe case of an unlabelled Heterogeneous population, Expectation Maximisation (EM) al-gorithms for Mixtures of GGM have been proposed to estimate both each sub-population’sgraph and the class labels. However, we argue that, with most real data, class aﬃliationcannot be described with a Mixture of Gaussian, which mostly groups data points accord-ing to their geometrical proximity. In particular, there often exists external co-featureswhose values aﬀect the features’ average value, scattering across the feature space datapoints belonging to the same sub-population. Additionally, if the co-features’ eﬀect on thefeatures is Heterogeneous, then the estimation of this eﬀect cannot be separated from thesub-population identiﬁcation. In this article, we propose a Mixture of Conditional GGM(CGGM) that subtracts the heterogeneous eﬀects of the co-features to regroup the datapoints into sub-population corresponding clusters. We develop a penalised EM algorithmto estimate graph-sparse model parameters. We demonstrate on synthetic and real datahow this method fulﬁls its goal and succeeds in identifying the sub-populations where theMixtures of GGM are disrupted by the eﬀect of the co-features.

Keywords:

Gaussian Graphical Models, Unlabelled Heterogeneous populations, Condi-tional GGM, Mixture Models, EM algorithm

1. Introduction

The conditional correlation networks are a popular tool to describe the co-variations be-tween the component of a random vector. Within the Gaussian Graphical Model (GGM)framework, introduced in Dempster (1972), the random vector of interest is modelled as agaussian vector N ( µ, Σ), and the conditional correlation networks can be recovered from a r X i v : . [ m a t h . S T ] J un artigue, Durrleman, Allassoni`ere the sparsity of the inverse covariance matrix Λ := Σ − . In this article, we consider the caseof an unlabelled heterogeneous population, in which diﬀerent sub-populations (or “classes”)are described by diﬀerent networks. Additionally, we take into account the presence of ob-served co-features (discrete and/or continuous) that have a heterogeneous (class-dependent)impact on the values of the features. The absence of known class labels turns the analysis ofthe population into an unsupervised problem. As a result, any inference method will haveto tackle the problem of cluster discovery in addition to the parameter estimation. A crucialtask, considering that the relevance of the estimated parameters is entirely dependant onthe clusters identiﬁed. The co-features, if their eﬀects on the features are consequent, cangreatly disrupt the clustering. Indeed, any unsupervised method will then be more likelyto identify clusters correlated with the values of the co-features than with the hidden sub-population labels. This occurs frequently when analysing biological or medical features. Toprovide a simple illustration, if one runs an unsupervised method on an unlabelled pop-ulation containing both healthy and obese patients, using the body fat percentage as afeature, then the unearthed clusters are very likely to be more corrected with the genderof the patients (a co-feature) rather than with the actual diagnostic (the hidden variable).Additionally, the fact that the eﬀect of the gender on the average body fat is also dependenton the diagnostic (class-dependent eﬀect) makes the situation even more complex.Unsupervised GGM have received recent attention, with works such as Gao et al. (2016)and Hao et al. (2017) adapting the popular supervised joint Hierarchical GGM methods ofMohan et al. (2014) and Danaher et al. (2014) to the unsupervised case. When the labelsare known in advance, these joint Hierarchical GGM are useful models to estimate severalsparse conditional correlation matrices and are modular enough to allow for the recovery ofmany diﬀerent forms of common structure between classes. However, we argue that theyare not designed for eﬃcient cluster identiﬁcation in the unsupervised scenario, and willvery likely miss the hidden variable and ﬁnd clusters correlated to the most inﬂuential co-features instead. Which in turn will result in the estimation of irrelevant parameters. Evenwhen there are no pre-existing hidden variables to recover, and the unsupervised method isrun “blindly”, it is uninteresting to recover clusters describing the values to already knownco-features. Instead, one would rather provide beforehand the unsupervised method withthe information of the co-features’ values and encourage it to recover new information fromthe data.In order to take into account the eﬀect of co-features on features, Yin and Li (2011)and Wytock and Kolter (2013) introduced the Conditional Gaussian Graphical Models(CGGM). Within this model, the average eﬀect of the co-features is subtracted from thefeatures, in order to leave only orthogonal eﬀects. Both Yin and Li (2011) and Sohn andKim (2012) worked with homogeneous populations, but the Hierarchical form of the CGGMwas introduced by Chun et al. (2013) to study labelled heterogeneous populations, with het-erogeneous eﬀects of the co-features on the features. Recent works such as Huang et al.(2018) and Ou-Yang et al. (2019) have adapted the state of the art supervised joint Hier-archical GGM methods for the CGGM. However, to the best of our knowledge, there hasbeen no eﬀort to make use of the CGGM in the unsupervised case.In this article, we introduce a Mixture of Conditional GGM that models the class-dependenteﬀect of the co-features on the features. We propose an Expectation-Maximisation (EM)procedure to estimate this model without prior knowledge of the class labels. This EM ixture of Conditional Gaussian Graphical Models algorithm can be regularised with all the structure-inducing penalties introduced for thesupervised joint Hierarchical CGGM. Hence, the recovered sparse conditional correlationgraphs can present any of the desired form of common structure. Moreover, with an addi-tional penalty, we can also enforce structure within the parameter describing the relationbetween co-features and features.Thanks to the inclusion of the co-features within the model, our EM algorithm is able toavoid trivial clusters correlated with the co-features’ values, and instead unearths clustersproviding new information on the population. Additionally, since our model takes intoaccount heterogeneous eﬀects of the co-features, our EM can handle the more complex sce-narios, where the co-features act diﬀerently on the features in each sub-population.Another, very diﬀerent, domain of research, the “Finite Mixture Regression models” (FMR),makes use of models that exhibit some formal similarities with the Mixture of CGGM. TheFMR, see DeSarbo and Cron (1988) or Khalili and Chen (2007) for early examples of unpe-nalised FMR and penalised FMR respectively, consist of several parallel linear regressionsbetween co-features and features, with unlabelled data. The clustering in FMR is focusedon identifying diﬀerent linear models between co-features and features. This is very diﬀer-ent from our GGM approach, which seeks to describe the multidimensional feature vectorwith graphs, and uses the co-feature as a tool to improve the clustering within the featurespace.We demonstrate the performance of our method on synthetic and real data. First witha 2-dimensional toy example, where we show the importance of taking into considerationthe (heterogeneous) eﬀects of co-features for the clustering. Then, in higher dimension, wedemonstrate that our EM with Mixture of CGGM consistently outperforms, both in termsof classiﬁcation and parameter reconstruction, the EM with a Mixture of GGM (used inGao et al. (2016) and Hao et al. (2017)), as well as an improved Mixture of GGM EM,that takes into consideration a homogeneous co-feature eﬀect. Finally, on real Alzheimer’sDisease data, we show that our method is the better suited to recover clusters correlatedwith the diagnostic, from both MRI and Cognitive Score features.

2. Supervised Hierarchical GGM and CGGM

In this section, we summarise the whys and wherefores of Gaussian Graphical Modelling: thesimple models for homogeneous populations, as well as the hierarchical models for heteroge-neous populations. First, we explore the classical Gaussian Graphical Models techniques todescribe a vector of features Y ∈ R p , then we discuss the Conditional Gaussian GraphicalModels implemented in the presence of additional co-features X ∈ R q . For every parametricmodel, we call θ the full parameter, and p θ the probability density function. Hence, in theexample of a gaussian model θ = ( µ, Σ). For hierarchical models with K classes, we willhave K parameters ( θ , ..., θ K ). In the classical GGM analysis introduced by Dempster (1972), the studied features Y ∈ R p are assumed to follow a Multivariate Normal distribution: Y ∼ N ( µ, Σ). The average µ is artigue, Durrleman, Allassoni`ere often ignored and put to 0. With Λ := Σ − , the resulting distribution is: p θ ( Y ) = (2 π ) − p/ | Λ | / exp (cid:18) − Y T Λ Y (cid:19) . (1)In this case θ = Λ. Using the property that corr ( Y u , Y v | ( Y w ) w (cid:54) = u,v ) = − (Λ) uv √ (Λ) uu (Λ) vv , theconditional correlation network is obtained using a sparse estimation of the precision (or“inverse-covariance”) matrix Λ. Heterogeneous population, where diﬀerent correlation net-works may exist for each sub-population (or “class”), can be described with the Hierarchicalversion of the GGM (1). With K classes, Let θ := ( θ , ..., θ k ) be the parameter for each classand z ∈ (cid:74) , K (cid:75) the categorical variable corresponding to the class label of the observation Y . With θ k := Λ k and z known, the Hierarchical density can be written: p θ ( Y | z ) = K (cid:88) k =1 z = k p θ k ( Y )= K (cid:88) k =1 z = k (2 π ) − p/ | Λ k | / exp (cid:18) − Y T Λ k Y (cid:19) . (2)Mirroring the famous Graphical LASSO (GLASSO) approach introduced by Yuan and Lin(2007) and Banerjee et al. (2006) for homogeneous populations, many authors have chosento estimate sparse (cid:98) Λ k as penalised Maximum Likelihood Estimator (MLE) of Λ k . For i = 1 , ..., n , let Y ( i ) be independent identically distributed (iid) feature vectors and z ( i ) their labels. These MLE are computed from the simple convex optimisation problemˆ θ = argmin θ − n K (cid:88) k =1 n (cid:88) i =1 z ( i ) = k ln p θ k ( Y ( i ) ) + pen ( θ ) . (3)Where the convex penalty pen ( θ ) is usually designed to induce sparsity within each indi-vidual ˆΛ k as well as to enforce a certain common structure between the ˆΛ k . This commonstructure is a desirable outcome when the diﬀerent sub-populations are assumed to still re-tain core similarities. Following in the footsteps of Guo et al. (2011), most authors proposesuch a joint estimation of the matrices Λ k . In the case of the penalised MLE estimation (3),the form of the resulting common structure is dependent on the penalty. For instance, Dana-her et al. (2014) propose the “Fused Graphical LASSO”” and “Group Graphical LASSO”penalties that encourage shared values and shared sparsity pattern across the diﬀerent Λ k respectively. Likewise, Yang et al. (2015) propose another fused penalty to incentivise com-mon values across matrices. With their node based penalties, Mohan et al. (2014) canencourage the recovery of common hubs in the graphs. Remark 1

Within a hierarchical model, one can also take θ k := ( µ k , Λ k ) , and adapt p θ k ( Y ) accordingly, since it is natural to allow each sub-population to have diﬀerent average levels µ k . In some frameworks, additional variables, noted X ∈ R q and called “co-features” or “cofac-tors” can be observed alongside the regular features within the gaussian vector Y ∈ R p . In ixture of Conditional Gaussian Graphical Models all generality, X can be a mix of ﬁnite, discrete and continuous random variables. In theGGM analysis, these co-features are not included as nodes of the estimated conditional cor-relation graph. Instead, they serve to enrich the conditioning deﬁning each edge: in the newgraph, there is an edge between the nodes Y u and Y v iif cov ( Y u , Y v | ( Y w ) w (cid:54) = u,v , X ) (cid:54) = 0. TheConditional Gaussian Graphical Models (CGGM) were introduced by Yin and Li (2011)and Sohn and Kim (2012) in order to properly take into account the eﬀect of X on Y andeasily identify the new conditional correlation network in-between the Y . They propose alinear eﬀect, expressed by the conditional probability density function (pdf): p θ ( Y | X ) = (2 π ) − p/ | Λ | / exp (cid:18) −

12 ( Y + Λ − Θ T X ) T Λ( Y + Λ − Θ T X ) (cid:19) , (4)with Θ ∈ R q × p and θ = { Λ , Θ } . In other words: Y | X ∼ N (cid:0) − Λ − Θ T X, Λ − (cid:1) . Two mainbranches of CGGM exist, depending on whether the pdf of X is also modelled. In thiswork, we chose to impose no model on X . The lack of assumption on the density of X provides far more freedom than the joint gaussian assumption. In particular, X can havecategorical and even deterministic components. This allows us to integrate any observedvariables without restriction to the model.To tackle heterogeneous populations, works such as Chun et al. (2013) have introduced theHierarchical version of the CGGM pdf: p θ k ( Y | X, z ) = K (cid:88) k =1 z = k (cid:18) | Λ k | (2 π ) p (cid:19) − exp (cid:18)

12 ( Y + Λ − k Θ Tk X ) T Λ k ( Y + Λ − k Θ Tk X ) (cid:19) . (5)In particular, Huang et al. (2018) have adapted the penalised MLE (3) to the Hierar-chical CGGM density for some of the most popular GGM penalties. With a idd sample( Y ( i ) , X ( i ) , z ( i ) ) ni =1 , the corresponding penalised CGGM MLE can be written;ˆ θ = argmin θ − n K (cid:88) k =1 n (cid:88) i =1 z ( i ) = k ln p θ k ( Y ( i ) | X ( i ) ) + pen ( θ ) . (6) Remark 2

To include a regular average value for Y , independent of the values of X , onecan simply add a constant component equal to ”1” in X .

3. Mixtures of CGGM for unlabelled heterogeneous population

In this section, we tackle the problem of an unlabelled heterogeneous population. Weintroduce a Mixture of Conditional Gaussian Graphical Model to improve upon the state ofthe art unsupervised methods by taking into consideration the potent co-features that candrive the clustering. We develop a penalised EM algorithm to both identify data clustersand estimate sparse, structured, model parameters. We justify that our algorithm is usablewith a wide array of penalties and provide detailed algorithmic for the Group GraphicalLASSO (GGL) penalty.

When the labels of a heterogeneous population are missing, supervised parameter estima-tion methods like (3) have to be replaced by unsupervised approaches that also tackle the artigue, Durrleman, Allassoni`ere problem of cluster discovery. When z is unknown, the Hierarchical model (2) can easily bereplaced by a Mixture model with observed likelihood: p θ,π ( Y ) = K (cid:88) k =1 π k p θ k ( Y ) , (7)and complete likelihood: p θ,π ( Y, z ) = K (cid:88) k =1 z = k π k p θ k ( Y ) . (8)Where π k := P ( z = k ) and π := ( π , ..., π k ) . Then, the supervised penalised likelihoodmaximisation (3) can be adapted into the penalised observed likelihood optimisation:ˆ θ, ˆ π = argmin θ,π − n n (cid:88) i =1 ln (cid:32) K (cid:88) k =1 π k p θ k (cid:16) Y ( i ) (cid:17)(cid:33) + pen ( θ, π ) . (9)This is a non-convex problem, and authors such as Zhou et al. (2009) and Krishnamurthy(2011) have proposed EM algorithms to ﬁnd local solutions to (9). They omit however thecommon structure inducing penalties that are the signature of the supervised joint GGMmethods. The works of Gao et al. (2016) and Hao et al. (2017) correct this by proposingEM algorithms that solve (9) for some of the joint-GGM penalties, such as the Fused andGroup Graphical LASSO penalties.By design, the EM algorithm must handle the cluster identiﬁcation jointly with the mixtureparameters estimation. The underlying assumption is that the diﬀerent sub-populations canbe identiﬁed as diﬀerent clusters in the feature space. With real data, and especially medicaldata, this is generally untrue, as many factors other then the class label can have a largerimpact on the position of the data points in the feature space. Even when there are nospeciﬁc sub-populations to recover, and the EM is ran “blindly” in order to observe whichdata points are more naturally grouped together by the method, the unearthed clustershave every chance to be very correlated with very inﬂuential but trivial external variables,such as the age group or the gender. In order to guide the cluster discovery of the EMalgorithm, we propose a Mixture of Conditional Gaussian Graphical Models with whichthe overbearing eﬀect of trivial external variable can be removed. By placing all externalobserved variable into X , we deﬁne the Mixture of CGGM with its observed likelihood: p θ,π ( Y | X ) := K (cid:88) k =1 π k p θ k ( Y | X )= K (cid:88) k =1 π k (cid:18) | Λ k | (2 π ) p (cid:19) exp (cid:18) −

12 ( Y + Λ − k Θ Tk X ) T Λ k ( Y + Λ − k Θ Tk X ) (cid:19) . (10)Within this model, the position of each feature vector Y is corrected by its, class-dependent,linear prediction by the co-features X : E [ Y | X, z = k ] = − Λ − k Θ Tk X . In other words the“Mixture of Gaussians” type clustering is done on the residual vector Y − E [ Y | X, z = k ] = ixture of Conditional Gaussian Graphical Models Y + Λ − k Θ Tk X . Hence, even if the co-features X have a class-dependant impact on theaverage level of the features Y , the Mixture of CGGM model is still able to regroup in thefeature space the observations Y ( i ) that belong to the same class, z ( i ) = k . We illustratethis dynamic in section 4.1.Like the previous works on joint-GGM estimation, our goal is to estimate the parameters ofmodel (10) with sparse inverse-covariance matrices Λ k and common structure across classes.Sparsity in the matrices Θ k is also desirable for the sake of interpretation. Hence, we deﬁnethe following penalised Maximum Likelihood problem:ˆ θ, ˆ π = argmin θ,π − n n (cid:88) i =1 ln (cid:32) K (cid:88) k =1 π k p θ k (cid:16) Y ( i ) | X ( i ) (cid:17)(cid:33) + pen ( θ, π ) . (11)As with (9), this is a non-convex problem, and we deﬁne an EM algorithm to ﬁnd localminima of the optimised function. In this section, we provide the detailed steps of a penalised EM algorithm to ﬁnd localsolution of the non-convex penalised MLE (11) in order to estimate the parameters of themixture model (10) with inverse-covariance sparsity as well as common structure. Firstwe provide the diﬀerent steps of the algorithm and justify that it can be run with a widearray of penalty functions. Then, we provide a detailed optimisation scheme for the GroupGraphical Lasso (GGL) penalty speciﬁcally.

EM algorithm for Mixtures of CGGM.

With n ﬁxed (cid:8) X ( i ) (cid:9) ni =1 and n iid observations (cid:8) Y ( i ) (cid:9) ni =1 following the mixture density p θ,π ( Y | X ) given in (10), the penalised observednegative log-likelihood to optimise is: − n n (cid:88) i =1 ln (cid:32) K (cid:88) k =1 π k p θ k (cid:16) Y ( i ) | X ( i ) (cid:17)(cid:33) + pen ( θ, π ) . (12)We will not redo here all the calculations for the EM applied to a mixture. In the end, weget an iterative procedure updating the current parameter ( θ ( t ) , π ( t ) ) with two steps. TheExpectation (E) step is: p ( t ) i,k := P θ ( t ) ,π ( t ) ( z ( i ) = k | Y ( i ) , X ( i ) ) = p θ ( t ) k ( Y ( i ) | X ( i ) ) π ( t ) k (cid:80) Kl =1 p θ ( t ) l ( Y ( i ) | X ( i ) ) π ( t ) l . More explicitly, by replacing p θ k ( Y | X ) by its formula (4):( E ) p ( t ) i,k = | Λ k | − exp (cid:0) ( Y ( i ) + Λ − k Θ Tk X ( i ) ) T Λ k ( Y ( i ) + Λ − k Θ Tk X ( i ) ) (cid:1)(cid:80) Kl =1 | Λ l | − exp (cid:0) ( Y ( i ) + Λ − l Θ Tl X ( i ) ) T Λ l ( Y ( i ) + Λ − l Θ Tl X ( i ) ) (cid:1) . (13)The M step is: θ ( t +1) , π ( t +1) = argmin θ,π − n K (cid:88) k =1 n (cid:88) i =1 p ( t ) i,k (cid:16) ln p θ k ( Y ( i ) | X ( i ) ) + ln π k (cid:17) + pen ( θ, π ) . artigue, Durrleman, Allassoni`ere Assuming that there is no coupling between π and θ in the penalty, i.e. pen ( π, θ ) = pen π ( π ) + pen θ ( θ ), then the two optimisations can be separated: θ ( t +1) = argmin θ − n K (cid:88) k =1 n (cid:88) i =1 p ( t ) i,k ln p θ k ( Y ( i ) | X ( i ) ) + pen θ ( θ ) ,π ( t +1) = argmin π − n K (cid:88) k =1 n (cid:88) i =1 p ( t ) i,k ln π k + pen π ( π ) . Let us denote the suﬃcient statistics n ( t ) k := (cid:80) ni =1 p ( t ) i,k , S k, ( t ) Y Y := n (cid:80) ni =1 p ( t ) i,k Y ( i ) Y ( i ) T , S k, ( t ) Y X := n (cid:80) ni =1 p ( t ) i,k Y ( i ) X ( i ) T and S k, ( t ) XX := n (cid:80) ni =1 p ( t ) i,k X ( i ) X ( i T ) . Then, the M step can beformulated as: θ ( t +1) = argmin θ K (cid:88) k =1 (cid:16)(cid:68) Λ k , S k, ( t ) Y Y (cid:69) + (cid:68) k , S k, ( t ) Y X (cid:69) + (cid:68) Θ k Λ − k Θ Tk , S k, ( t ) XX (cid:69)(cid:17) ( M ) − K (cid:88) k =1 n ( t ) k n ln ( | Λ k | ) + pen θ ( θ ) ,π ( t +1) = argmin π − K (cid:88) k =1 n ( t ) k n ln π k + pen π ( π ) . (14)The E step in Eq (13) is in closed form. With any reasonable penalty pen π , the optimisationon the class weights π in Eq (14) will be trivial, and most likely in closed form as well. Theupdate of θ in the M step (14) takes exactly the same form as the supervised penalisedMLE of Eq (6), see Huang et al. (2018) for the explicit supervised CGGM formulation. Asa result, as long as the supervised case (6) is solved, then the M step is tractable as well.In their work on joint Hierarchical CGGM, Huang et al. (2018) show that the supervisednegative log-likelihood is a convex function of θ . As a consequence the problem (14) issolvable for a very wide array of penalties pen θ , in particular all the convex diﬀerentiablepenalties.In order to provide an algorithm with more speciﬁc and detailed steps, we consider in therest of the section the special case of the GGL penalty. The GGL penalty was noticeablyused in the supervised case by Huang et al. (2018), who proposed a proximal gradientalgorithm. Likewise, we can use a proximal gradient algorithm to compute the M step (14)of our EM algorithm. Proximal gradient algorithm to solve the M step with the GGL penalty.

TheGroup Graphical Lasso (GGL) penalty, introduced in Danaher et al. (2014) and adaptedto the hierarchical CGGM by Huang et al. (2018), can be written: pen θ ( θ ) := (cid:88) ≤ i (cid:54) = j ≤ p  λ Λ1 K (cid:88) k =1 (cid:12)(cid:12)(cid:12) Λ ( ij ) k (cid:12)(cid:12)(cid:12) + λ Λ2 (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 (cid:16) Λ ( ij ) k (cid:17)  + (cid:88) ( i,j ) ∈ (cid:74) ,q (cid:75) × (cid:74) ,p (cid:75)  λ Θ1 K (cid:88) k =1 (cid:12)(cid:12)(cid:12) Θ ( ij ) k (cid:12)(cid:12)(cid:12) + λ Θ2 (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 (cid:16) Θ ( ij ) k (cid:17)  . (15) ixture of Conditional Gaussian Graphical Models Unlike in Huang et al. (2018), where λ Λ1 = λ Θ1 and λ Λ2 = λ Θ2 , we use diﬀerent levels ofpenalisation for the parameters Λ and Θ, since both their scales and their desired sparsitylevel can be very diﬀerent. This penalty borrows its design from the Group Lasso, seeYuan and Lin (2006), where the l norm induces individual sparsity of each coeﬃcient, andthe l induces simultaneous sparsity of groups of coeﬃcients. In Eq. (15), for each pair( i, j ) belonging to the relevant space, (cid:110) Λ ( ij ) k (cid:111) Kk =1 constitutes a group that can be entirelyput to 0. This incites the algorithm to set a certain matrix coeﬃcient to 0 over all K classes. These common zeros constitute the common structure sought after by the GGLapproach. In our CGGM case, the same can be said for the group (cid:110) Θ ( ij ) k (cid:111) Kk =1 . Regardingthe theoretical analysis, we underline that the l part of the penalty is not separable in asum of K diﬀerent penalties, which forces a joint optimisation problem to be solved, evenin the supervised framework.We detail here how to solve the M step (14) with pen θ ( θ ) deﬁned as in Eq (15). Weassume, as usual, that the optimisation in π is both independent from the optimisation in θ = { Λ k , Θ k } Kk =1 and trivial. The function to minimise in θ at the M step is: f ( θ ) := K (cid:88) k =1 (cid:32) − n ( t ) k n ln ( | Λ k | ) + (cid:68) Λ k , S k, ( t ) Y Y (cid:69) + (cid:68) k , S k, ( t ) Y X (cid:69) + (cid:68) Θ k Λ − k Θ Tk , S k, ( t ) XX (cid:69)(cid:33) + pen θ ( θ ) . As shown in Huang et al. (2018), this function is convex and inﬁnite on the border of itsset of deﬁnition and as a unique global minimum. We note f ( θ ) =: g ( θ ) + pen θ ( θ ) for thesake of simplicity. The proximal gradient algorithm, see Combettes and Pesquet (2011), isan iterative method based on a quadratic approximation on g ( θ ). If θ ( s − is the currentstate of the parameter within the proximal gradient iterations, then the next stage, θ ( s ) , isfound by optimising the approximation: f (cid:16) θ ( s ) (cid:17) = f (cid:16) θ ( s − + θ ( s ) − θ ( s − (cid:17) ≈ g (cid:16) θ ( s − (cid:17) + ∇ g (cid:16) θ ( s − (cid:17) T . (cid:16) θ ( s ) − θ ( s − (cid:17) + 12 α (cid:13)(cid:13)(cid:13) θ ( s ) − θ ( s − (cid:13)(cid:13)(cid:13) + pen θ (cid:16) θ ( s ) (cid:17) ≡ α (cid:13)(cid:13)(cid:13) θ ( s ) − (cid:16) θ ( s − − α ∇ g (cid:16) θ ( s − (cid:17)(cid:17)(cid:13)(cid:13)(cid:13) + pen θ (cid:16) θ ( s ) (cid:17) . (16)Where we removed in the last line the constants irrelevant to the optimisation in θ ( s ) and α denotes the step size of the gradient descend. Note that we use the exponent ( s ) to indicatethe current stage of the proximal gradient iteration, to avoid confusion with the exponent( t ) used for the EM iterations (which are one level above). We underline that, in additionto g ( θ ) itself, the second order term in the Taylor development of g ( θ ) is also approximated.Using α (cid:13)(cid:13) θ ( s ) − θ ( s − (cid:13)(cid:13) instead of (cid:0) θ ( s ) − θ ( s − (cid:1) T .H g (cid:0) θ ( s − (cid:1) . (cid:0) θ ( s ) − θ ( s − (cid:1) spares usfrom computing the Hessian H g (cid:0) θ ( s − (cid:1) and simpliﬁes the calculations to come. The ap-proximated formulation in Eq (16) leads to the deﬁnition of the proximal optimisationproblem: prox α ( x ) := argmin θ α (cid:107) θ − x (cid:107) + pen θ ( θ ) . (17) artigue, Durrleman, Allassoni`ere So that the proximal gradient step can be written: θ ( s ) = prox α s (cid:16) θ ( s − − α s ∇ g (cid:16) θ ( s − (cid:17)(cid:17) . (18)Where the step size α s is determined by line search. The usual proximal gradient heuristicis to take a initial step size α , a coeﬃcient β ∈ ]0 , α ←− βα ,as long as: g (cid:16) θ ( s − − αG α (cid:16) θ ( s − (cid:17)(cid:17) > g (cid:16) θ ( s − (cid:17) − α ∇ g (cid:16) θ ( s − (cid:17) T .G α (cid:16) θ ( s − (cid:17) + α (cid:13)(cid:13)(cid:13) G α (cid:16) θ ( s − (cid:17)(cid:13)(cid:13)(cid:13) , with G α (cid:0) θ ( s − (cid:1) := θ ( s − − prox α ( θ ( s − − α ∇ g ( θ ( s − )) α the generalised gradient.To apply the proximal gradient algorithm, we need to be able to solve the proximal (17)with the CGGM likelihood and the GGL penalty. Thankfully, Danaher et al. (2014) foundan explicit solution to this problem in the GGM case, which Huang et al. (2018) adapted tothe CGGM. The proximal optimisation is separable in Λ and Θ, and the solutions Λ ( prox ) and Θ ( prox ) share the same formula. As a result, we use D as a placeholder name foreither Λ or Θ, i.e. depending on the context either D ijk = Λ ijk or D ijk = Θ ijk . Let S bethe soft thresholding operator: S ( x, λ ) := sign ( x ) max ( | x | − λ, (cid:101) D ijk,α := D ij, ( s − k − α ∂g∂D ijk (cid:0) θ ( s − (cid:1) . The solution of (17), with x = θ ( s − − α ∇ g (cid:0) θ ( s − (cid:1) , is given coeﬃcient bycoeﬃcient by: D ij, ( prox ) k = S (cid:16) (cid:101) D ijk,α , λ D α (cid:17) max  − λ D α (cid:113)(cid:80) k S ( (cid:101) D ijk,α , λ D α ) ,  . (19)Note that the partial derivatives ∂g∂D ijk (cid:0) θ ( s − (cid:1) , necessary to get (cid:101) D ijk,α , are easily calculatedin closed form from the likelihood formula. With the proximal problem (17) and the linesearch easily solvable, the proximal gradient steps can be iterated until convergence to ﬁndthe global minimum of f ( θ ). With f ( θ ) optimised, the M step (14) is solved.

4. Experiments

In this section, we demonstrate the performances of our EM with Mixture of CGGM. Firston a visual toy example in 2 dimension, then on a higher dimensional synthetic exampleand ﬁnally on real Alzheimer’s Disease data. We compare the Mixture of CGGM to theregular Mixture of GGM which ignores co-features and to a Mixture of GGM that assumesa uniform linear eﬀect of the co-features on the features.

In this section, we present a simple visual example to illustrate the importance of taking intoaccount heterogeneous co-feature eﬀects. We show that even with a single binary co-feature,and with low dimensional features, the state of the art unsupervised GGM techniques aregreatly disrupted by the co-features. Whereas our EM with Mixture of CGGM (which wecall “Conditional EM” or “C-EM”) achieves near perfect classiﬁcation. ixture of Conditional Gaussian Graphical Models Under the Mixture of Gaussians (MoG) model, the observed data, Y ∼ (cid:80) Kk =1 π k N ( µ k , Σ k ),belongs to K classes which can directly be represented as K clusters in the feature space R p .Each cluster centred around a centroid at position µ k and with an ellipsoid shape describedby Σ k . However, when there exists conditioning variables X ∈ R q that have an eﬀect on Y ,this geometric description becomes more complex. Typically, the value of Y could dependlinearly on the value of X , with E [ Y | X, z = k ] = β Tk X for some β k ∈ R p × q . In this case, theaverage position in class k is not a ﬁxed µ k but a function of X . If X contains categoricalvariables, this creates as many diﬀerent centroid positions as there are possible categorycombinations in X . The number of these de facto clusters geometrically increases withthe dimension q , which deters from simply running a clustering methods with an increasednumber of clusters K (cid:48) to identify all of them. Moreover, if X contains continuous variables,there is a continuum of positions for the centroid, not a ﬁnite number of de facto clusters. If X mixes the two types of variables, the two eﬀects coexists. This shatters any hope to runa traditional MoG-based EM clustering algorithm, since its success is heavily dependent onits ability to identify correctly the K distinct cluster centroids µ k .Since the X are observed, a possible solution is to run the linear regression ˆ Y = ˆ βX before-hand, and run the EM algorithm on the residual Y − ˆ Y to remove the eﬀect of X . This iswhat we call the “residual EM” or “residual Mixture of GGM”. However this does not takeinto account the fact that this eﬀect can be diﬀerent for each class k , β (cid:54) = β (cid:54) = ... (cid:54) = β K .Since the label is not known beforehand in the unsupervised context, the linear regressionˆ Y = ˆ βX can only be run on all the data indiscriminately, hence is insuﬃcient in general. Onthe other hand, the hierarchical CGMM (5), which veriﬁes: E [ Y | X, z = k ] = − Λ − k Θ Tk X ,is designed to capture heterogeneous co-feature eﬀects. We design a simple experiment tosubstantiate this intuition.In this example, Y ∈ R , X ∈ {− , } and z ∈ { , } . Y | X, z follows the hierarchicalconditional model of (5). In this simple case, this can be written as Y = ( β X + (cid:15) ) z =1 +( β X + (cid:15) ) z =2 . With (cid:15) ∼ N (0 , Λ − ) and (cid:15) ∼ N (0 , Λ − ). A typical iid data sample( Y ( i ) i =1 ) n is represented on the left sub-ﬁgure of Figure 1. The hidden variable z is repre-sented by the colour (blue or orange). The observable co-feature X is represented by theshape of the data point (dot or cross). It is clear from the ﬁgure that a Mixture of Gaussiansmodel with K = 2 cannot properly separate the blue and orange points in two clusters.Indeed, on the right sub-ﬁgure of Figure 1, we observe the ﬁnal state of an EM that ﬁtsa Mixture of Gaussians on Y . The two recovered clusters are more correlated with theco-feature X than the hidden variable z . However, this method did not take advantageof the knowledge of the co-feature X . As previously mentioned, the most one could ﬁrstsubtract the eﬀect of X from Y before running the EM. On the left sub-ﬁgure of Figure 2,we represent the residual data ˜ Y := Y − ˆ βX . Where ˆ β is the Ordinary Least Square esti-mator of the linear regression between X and Y over all the dataset ( ˆ β ≈ β + β if n is largeenough). Since the linear eﬀect between X and Y is not uniform over the dataset, but classdependent, the correction is imperfect, and the two class clusters remain hardly separable.This is why the residual EM, that ﬁts a Mixture of GGM on ˜ Y is also expected to fail toidentify clusters related to the hidden variable. Which is shown by the right sub-ﬁgure ofFigure 2, where we see a typical ﬁnal state of the residual EM.On the leftmost sub-ﬁgure of Figure 3, we display the proper correction for the co-features’ artigue, Durrleman, Allassoni`ere eﬀect ˜ Y (cid:48) = Y − β X z =1 − β X z =2 = (cid:15) z =1 + (cid:15) z =2 . Under this form, a Mixture ofGaussian can separate the data by colour. This is precisely the kind of translation thateach data point undergoes within a Hierarchical CGGM. Hence a Mixture of CGGM cansucceed in identifying the hidden variable z , provided that it estimates correctly the modelparameters. To illustrate this point, the two next sub-ﬁgures in Figure 3 represent the sameﬁnal state of the EM ﬁtting a Mixture of CGGM on Y . The middle sub-ﬁgure represents˜ Y (cid:48) as well as the two estimated centered distributions N (0 , (cid:98) Λ − k ) for k = 1 ,

2. We can seethe two formally identiﬁed clusters after removing the eﬀect of X . The rightmost sub-ﬁgurerepresents the original data Y as well as the four estimated distributions N ( ± (cid:98) Σ k (cid:98) Θ Tk , (cid:98) Λ − k )for k = 1 ,

2. The four de facto clusters present in the data Y before removing the eﬀect of X are well estimated by the method.We conﬁrm these illustrative results by running several simulations. We generate 50 datasetswith n = 500 data points. For each simulation, we make 10 random initialisation from whichwe run the three EMs: with GGM, residualised GGM or CGGM. Table 4.1 summarises theresults. We follow the errors made by the estimated class probabilities or “soft labels”, (cid:98) P ( z i = k ), which we call the “soft misclassiﬁcation error”, as well as the error made bythe “hard labels”, (cid:98) z i = k , which we call the “hard misclassiﬁcation error”. They can beexpressed as n (cid:80) i,k (cid:12)(cid:12)(cid:12) z i = k − (cid:98) P ( z i = k ) (cid:12)(cid:12)(cid:12) and n (cid:80) i,k | z i = k − (cid:98) z i = k | respectively. We seethat the Mixture of CGGM performs much better, with less than 10% of misclassiﬁcationin average, while the two GGM methods are both above 40% of error, fairly close to thelevel of a random uniform classiﬁer, 50%.

20 10 0 10 20201001020 20 10 0 10 20201001020

Figure 1: (Left) Observed data Y in the 2D space. The observed conditioning variable X is binary. Data points with X = − X = 1 are represented as dots. In addition, there is an unknown “class” variable z . Class 1 is in blue, class 2 in orange. Y | X, z follows the hierarchical conditionalmodel. As a result, the two classes (orange and blue) are hard to separate in twoclusters. (Right) Typical clusters estimated by an EM that ﬁts a GGM mixtureon Y ixture of Conditional Gaussian Graphical Models

20 15 10 5 0 5 10 15 20201510505101520

Figure 2: (Left) Residual ˜ Y = Y − ˆ βY data after taking into account the estimated eﬀectof X . Since the eﬀect had diﬀerent intensities on class 1 and 2, only the averageeﬀect was subtracted, and two classes are still not well separated. (Right) Typicalclusters estimated by the “residual EM”, that ﬁts a GGM mixture on ˜ Y

20 10 0 10 20201001020

Figure 3: (Left) Observations ˜ Y (cid:48) = Y − β X z =1 − β X z =2 exactly corrected for the class-dependent eﬀect of X . In this state the two classes appear as two distinct clusters.The Conditional-EM is designed to transform the data in this manner. (Middle)One possible representation of the CEM results. The corrected observations ˜ Y (cid:48) are displayed alongside centered normal distributions with the two estimatedcovariance matrices: N (0 , (cid:98) Λ − k ). (Right) Another possible representation of thesame CEM results. The original observations Y are displayed, alongside with thefour de facto estimated distributions N ( ± (cid:98) Σ k (cid:98) Θ Tk , (cid:98) Λ − k ). artigue, Durrleman, Allassoni`ere Table 1: Average and standard deviation of the misclassiﬁcation error achieved on the 2-dimensional example with the EMs on the Mixture of GGM, the Mixture of GGMwith residualised data, and the Mixture of CGG. The two GGM methods are closeto the threshold of random classiﬁcation (0.50), while the Mixture of CGGM is inaverage below 10% of error. EM GGM EM resid. GGM EM CGGM n (cid:80) i,k (cid:12)(cid:12)(cid:12) z i = k − (cid:98) P ( z i = k ) (cid:12)(cid:12)(cid:12) (0.05) 0.08 (0.17) n (cid:80) i,k | z i = k − (cid:98) z i = k | (0.06) 0.07 (0.17) In this section, we perform a quantitative analysis of the algorithms in a higher dimensionframework, where the matrix parameters Λ and Θ are more naturally interpreted as sparsenetworks. We conﬁrm that the Mixture of Conditional Gaussian Graphical Models is bettersuited to take into account the heterogeneous eﬀects of co-features on the graph.For this experiment, the observed data follows a mixture model with K = 3 classes. Eachclass k has the probability weight π k = . An observation ( Y, X ) ∈ R p × R q belonging to theclass k is described by the distribution: Y | X ∼ N (cid:0) − Λ − k Θ Tk X, Λ − k (cid:1) . No model assumptionare made on X . In this example, X contains two binary variables, two continuous variable,and a constant variable always equal to 1. The inverse-covariance matrix Λ k ∈ R p × p and thetransition matrix Θ ∈ R q × p are both sparse, with p = 10 and q = 5. We run 20 simulations.A simulation consists of n = 300 generated data points. On these data points, we run thecompared methods, all initialised with the same random parameters. For all simulations,we make 10 of these runs, each with a diﬀerent random initialisation. We compared thesame three algorithms as in section 4.1: the EM for the Mixture of GGM, the EM for theMixture of GGM with average eﬀect of X subtracted, and the EM applied to the Mixtureof CGGM. Additionally, we also run the tempered version of these three EM algorithms.We follow four metrics to assess the method’s success in terms of cluster recovery and ﬁtwith the data. The classiﬁcation error (both soft and hard labels versions), the recoveryof the network matrix Λ and an “ABC-like” metric. The “ABC-like” metric is meant toassess how well each of the estimated solutions is able to replicate the observed data. Sinceeach solution is the parameter of a probability distribution, at the end of each EM, we gen-erate new data following this proposed distribution. Then, for each synthetic data point,we compute the distance to the closest neighbour among the real data points. These min-imal distances constitute our “ABC-like” metric. Finally, we also compute the executiontime of each EM, knowing that they all have the same stopping criteria. We represent onFigure 4 the empirical distribution of these four metrics and we quantify with Table 4.2the key statistics (mean, standard deviation, median) that characterise them. With K = 3and balanced classes, a uniform random classiﬁer would guess the wrong label 66 .

7% ofthe time. We observe that the two Mixture of GGM method are dangerously close to thisthreshold, with more than 50% hard misclassiﬁcation. The EM on the Mixture of CGGM ixture of Conditional Gaussian Graphical Models (C-EM) on the other hand, achieves a much better classiﬁcation with less than 15% hardmisclassifcation. This demonstrate that, even when faced with a more complex situation,in higher dimension, the Mixture of CGGM is better suited to correct for the eﬀect of theco-features and discover the right clusters of data points. This also underlines once morethe importance of allowing diﬀerent values of the eﬀect of X on Y for each class. Indeed,the residual Mixture of GGM - which took into account the average eﬀect of X on Y overthe entire population - was unable to achieve better performances than the EM that didnot even use the co-features X . In terms of reconstruction of the observed data by theestimated model (ABC-like metric), we see that the synthetic data points generated fromthe estimated Mixture of CGGM model have closer nearest neighbours than the data pointsgenerated by the other estimated models. In addition to all these observations, the C-EM isalso faster than the other two methods, reaching the convergence threshold faster. In addi-Table 2: Average, standard deviation and median (below) of the four followed performance metrics over the30 × bold . We can see that the classiﬁcation performanceswith the Mixture of CGGM are much better than the two methods with Mixtures of GGM, andwith faster computation times.soft misclassif. hard misclassif. ABC-like metric runtimesGGM 0.56 (0.03) (0.09) (61) (0.03) (0.03) (0.05) (0.06) (0.14) (91) tion to the cluster recovery, we can also asses the parameter reconstruction of each method.Since the three clustering methods estimate diﬀerent parametric models over the data, theydo not actually try to estimate the same parameters. Regardless, all the methods still es-timate a certain inverse covariance matrix Λ k (conditional or not on the X depending onthe model) of each sub-population that they identify. In Table 3, we can check that the (cid:98) Λ k estimated by with the Mixture of CGGM are indeed a much better ﬁt for the real Λ k thenthe estimated matrices from the other models. This is expected, since the real Λ k actuallycorrespond to the CGGM model. The two metrics followed are the Kullback–Leibler (KL)divergence between the gaussian distribution f Λ k ∼ N (0 , Λ − k )) and f (cid:98) Λ k ∼ N (0 , (cid:98) Λ − k )), andthe l diﬀerence given by the Froebenius norm: (cid:13)(cid:13)(cid:13) Λ k − (cid:98) Λ k (cid:13)(cid:13)(cid:13) F .To illustrate the diﬀerent level of success concerning the conditional correlation graph re-covery, we display on Figure 5 the conditional correlation matrix (i.e. the conditionalcorrelation graph with weighted edges) estimated by each method. The three columns ofﬁgures correspond to the three sub-populations. The ﬁrst two rows of ﬁgures are the ma-trices estimated by the two Mixtures of GGM methods, with and without residualisation artigue, Durrleman, Allassoni`ere E M E M r e s i d u a l C - E M EM type s o f t m i s c l a ss i f i c a t i o n e rr o r E M E M r e s i d u a l C - E M EM type h a r d m i s c l a ss i f i c a t i o n e rr o r E M E M r e s i d u a l C - E M EM type A B C - li k e m e t r i c E M E M r e s i d u a l C - E M EM type r un t i m e s Figure 4: Empirical distribution of several performance metrics measured over many simu-lations. The sample is made of 30 simulations with 5 diﬀerent initialisations each.Three methods are compared. The EM and EM residual algorithms estimate aMixture of GGM. The C-EM algorithm estimates a Mixture of CGGM. The C-EMis much better performing and faster. (Upper left) Soft mis-classiﬁcation error (cid:12)(cid:12)(cid:12) z i = k − (cid:98) P ( z i = k ) (cid:12)(cid:12)(cid:12) . (Upper right) Hard mis-classiﬁcation error | z i = k − (cid:98) z i = k | .(Bottom left) ABC-like metric. (Bottom right) Run time. ixture of Conditional Gaussian Graphical Models with the co-features. The third row of ﬁgures correspond to the matrices estimated by theMixture of CGGM. The ﬁnal row displays the real conditional correlation matrices. Weobserve that the two Mixture of GGM recover way too many edges, with no particularﬁt with the real matrix. By contrast, the matrices from the CGGM Mixture exhibit theproper edge patterns, with few False Positive and False Negative. This is not an easy feat toachieves, since the method was run from a random initialisation on a totally unsuperviseddataset, with heavily translated data points all over the 10 dimensional space. Moreover,the matrices in Figure 5 all result from the inversion of the empirical covariance matrix,which is neither a very geometrical nor a very stable operation.In Figure 6, we represent the regression parameter (cid:98) Θ k estimated by with the Mixture ofCGGM alongside the real Θ k . Once again, we see that the sparsity pattern is very wellidentiﬁed, with no False Positive. Moreover, in this case, there are also almost no FalseNegative, and all the edge intensities are correct. This is not a surprise. Indeed, the param-eter Θ plays a huge role in the correct classiﬁcation of the data, since it serves to deﬁne theexpected position of each data point in the feature space (playing the part of the “average”parameter in Mixtures of GGM). Hence, a good estimation of Θ is mandatory to reach agood classiﬁcation. Since the EM with Mixture of CGGM achieved very good classiﬁcationresults, it was expected that Θ would be well estimated.Table 3: Average and standard deviation of the metrics describing the reconstruction of each inverse-covariance matrix Λ k . The matrices are consistently better reconstructed with the mixture ofCGGM. metric class EM GGM EM res. GGM EM CGGM KL ( f Λ , f ˆΛ ) 1 11.0 (3.0) 7.5 (6.8) (cid:13)(cid:13)(cid:13) Λ − ˆΛ (cid:13)(cid:13)(cid:13) F In this section, we conﬁrm our experimental observations with a real, high dimensional,Alzheimer’s Disease dataset. We illustrate that the EM with Mixture of CGGM is bettersuited to identify clusters correlated with the diagnostic than the Mixture of GGM meth-ods. We bring to light the eﬀect of co-features such as the gender and age on the medicalfeatures.Our dataset is composed of the parameters ξ, τ, ( w i ) i =1 ,..., of longitudinal models estimatedon real Alzheimer’s Disease patients, see Schiratti et al. (2015). In summary, the evolutionof several features are followed over time for each patients. The features i = 1 , ...,

10 corre-spond to MRI measures of atrophy in diﬀerent region of the brain. The features i = 11 , ..., artigue, Durrleman, Allassoni`ere class 1 class 2 class 3EM EM res.

C-EM

True Figure 5: Comparison between the several estimated and the true conditional correlationmatrices for each sub-population. The three columns of ﬁgures correspond to thethree sub-populations. The ﬁrst two rows of ﬁgures are the matrices estimatedby the two Mixture of GGM methods, with and without residualisation with theco-features. The third row of ﬁgures correspond to the matrices estimated bythe Mixture of CGGM. The ﬁnal row displays the real conditional correlationmatrices. Unlike the two GGM-based methods, the Mixture of CGGM recoverscorrect edges with very few False Positives. ixture of Conditional Gaussian Graphical Models class 1 class 2 class 3C-EM True Figure 6: Reconstruction of the Θ k by the EM on the Mixture of CGGM. The three columnsof ﬁgures correspond to the three sub-populations. Almost all the edges are right,with no False Positive and almost no False Negative. Moreover, the intensitiesare also mostly correct.geodesic trajectory within a Riemannian manifold of the parameter space that ﬁts with thepatient’s own evolution. More speciﬁcally, the longitudinal model parameters describe howeach patient’s trajectory deviates from a speciﬁc reference geodesic trajectory. The param-eter ξ is the time acceleration of the patient with regards to the reference. The parameter τ is the time shift, so that a smaller τ corresponds a disease which starts early. Each w i describes the space shift of the trajectory with regards to its corresponding feature. Theyhave the same sign convention as τ . With Y := ξ, τ, ( w i ) i =1 ,..., the vector of features, wehave p = 32. We add three co-features to describe each patient: the gender, the age atthe ﬁrst visit (“age baseline”), and the number of years of education. With the additionof the constant co-feature = 1, the vector of co-features is 4-dimensional, X ∈ R . Thedataset contains 1400 patients, with half being healthy (“Control” patients), and the otherhalf being diagnosed with the Alzheimer’s Disease, either from the start or after a few visits(“AD” patients). The data is centered and normalised over the entire population.We run the three algorithms: EM, EM residual and C-EM on this dataset. In order tocheck the stability of the results over several diﬀerent runs, we implement a bootstrapprocedure that only uses 70% of the data each time. We generate 10 such bootstrappeddataset. We initialise the algorithms with a KMeans on the Y ( i ) data points. Since KMeansis not deterministic, we make 5 diﬀerent runs for each bootstrapped dataset, starting from5 diﬀerent possible KMeans initialisation each. Like previously, for the sake of fairness,the EM and C-EM are always provided with the same initialisation, and the residualEM is initialised with a KMeans on the residual of Y after subtracting the predictionby the X , a more relevant initialisation for this method. We make all these runs withfour diﬀerent feature sets. First with no space shift variable Y = { ξ, τ } , p = 2, thenwe add only the MRI space shifts Y = { ξ, τ, ( w i ) i =1 ,..., } , p = 12, then only the Cogni-tive Scores space shifts Y = { ξ, τ, ( w i ) i =1 ,..., } , p = 22, and ﬁnally, with all the features Y = { ξ, τ, ( w i ) i =1 ,..., } , p = 22. The classiﬁcation results are summarised in Table 4.3.With two balanced classes, the classiﬁcation error of a uniform random classiﬁer is 50%. artigue, Durrleman, Allassoni`ere On the smallest dataset, p = 2, we can see that the discovered cluster are somewhat corre-lated with the diagnostic, with classiﬁcation errors below 30%. The Mixture of GGM on theuniformly residualised data and the Mixture of CGGM achieve similar levels of error, theyare both better than the regular Mixture of GGM. When the MRI features are added, allthe discovered cluster become more correlated with the diagnostic. The regular Mixture ofGGM achieves in average 16% of hard classiﬁcation error, the residualised Mixture of GGMis at 11% of error, and the Mixture of CGGM even below, at 7%. The results with onlythe Cognitive Scores are very similar, simply a bit worse for every method. However, whenboth the MRI and Cognitive Scores feature are included, the performance of both GGMmixtures decrease, with both higher average error and higher variance. On ther other hand,the Mixture of CGGM achieves here its best level of performance. This stability of theMixture of CGGM’s performance as the size of the feature set increases indicates that ourmodel is the better suited to properly identify clusters correlated with the diagnostic inhigh dimension.We analyse the estimated Mixture of CGGM parameters on the full feature set p = 32.First, since E [ Y | X, z = k ] = − Λ k Θ Tk X in the CGGM, we display on Figure 7 the twoestimated ˆ β k := − (cid:98) Λ k (cid:98) Θ k (averaged over the bootstrap), as well as their diﬀerence. Theyplay the role of linear regression coeﬃcients in the model. The last column is the constantcoeﬃcient, while the ﬁrst three are the gender, age baseline and years of education coeﬃ-cients respectively. Since the data is centered, negative and positive values correspond tobelow average and above average values respectively. The cluster k = 1 is the one verycorrelated with the Control patients sub-population. Similarly, the cluster k = 2 is the onevery correlated with the AD patients.The most noticeable diﬀerence between the two ˆ β k are the constant vectors, who have op-posite eﬀects on all features. In particular, the “AD cluster” is very correlated with high ξ , low τ , an earlier atrophy of the ventricles, as well as high space shift for the two logicalmemory tests (immediate and delayed). The exact opposite being true for the “Controlcluster”. These are the expected eﬀects: a high ξ corresponds to a quickly progressingdisease, and a low τ to an early starting disease.The non-constant linear regression coeﬃcients are also diﬀerent between the clusters, al-though these are often diﬀerence in intensity and not in sign. In order to visualise moreclearly the diﬀerences in intensity, we represent on the leftmost sub-ﬁgure of Figure 7, withthe same conventions, the diﬀerence ˆ β − ˆ β . In particular, within the AD cluster, we ob-serve stronger positive eﬀect of the Age at the ﬁrst visit on the space shifts corresponding tothe Amygdala, entorhinal cortex, hippocampus and parahippocampus cortex atrophies. Onthe contrary, there is a stronger positive eﬀect of the education level on all the space shiftsof MRI atrophies for the Control patients. The age at the ﬁrst visit has a stronger negativeimpact on w i corresponding to the scores self reported memory, language and visual spatialcapacity for the AD patients, and a stronger negative impact on the two logical memoryscores for the control patients.Finally, we display on Figure 8 and 9 the average conditional correlation graphs estimatedfor these two clusters by the Mixture of CGGM. Their only noticeable diﬀerence is thenegative conditional correlation between ξ and τ in the “Control cluster”, which is reversedin the “AD cluster”. For the AD patients, this means that a disease that appears latertends to also progress faster, which is in line with medical observations. Apart from this ixture of Conditional Gaussian Graphical Models Table 4:

Recovery of the diagnostic labels (AD or control) with unsupervised methods on real longitudi-nal data. The three compared methods are the EM, EM residual (both GGM) and the C-EM(CGGM). Four diﬀerent feature sets are tried: only { τ, ξ } , adding the MRI space shift coeﬃcients w i , adding the Cognitive Score (CS) space shift coeﬃcients w i , and adding both the MRI and CSspace shift coeﬃcients. The table presents the average and standard deviation of the misclassi-ﬁcation error over 10 bootstrap iteration, with 5 diﬀerent KMeans initialisation each. The bestresults are in bold . metric EM EM resid. C-EMno CS, no MRI soft misclassif. 0.31 (0.02) 0.22 (0.03) p = 2 hard misclassif. 0.31 (0.03) (0.05) 0.19 (0.01) only MRI soft misclassif. 0.15 (0.01) 0.13 (0.01) p = 12 hard misclassif. 0.12 (0.01) 0.10 (0.01) only CS soft misclassif. 0.17 (0.02) 0.15 (0.02) p = 22 hard misclassif. 0.14 (0.03) 0.13 (0.03) CS and MRI soft misclassif. 0.24 (0.09) 0.17 (0.04) p = 32 hard misclassif. 0.21 (0.10) 0.15 (0.05) edge, the rest of the connections are almost identical in-between clusters. This suggeststhat the, cluster dependent, prediction E ˆ θ k [ Y ( i ) | X ( i ) , z = k ] = − (cid:98) Σ k (cid:98) Θ Tk X ( i ) takes into ac-count enough of the cluster-speciﬁc eﬀects so that the remaining unexplained variance hasalmost the same form in both clusters. Hence, the conditional correlations pictured in thesegraphs correspond to very general eﬀects. Such as the positive correlations between relatedcognitive tests or areas of the cortex.More strikingly, there are no conditional correlation between ξ or τ and any of the spaceshifts w i . This as consequent medical implications, since it suggests that the earliness ( τ )and speed ( ξ ) of the disease are conditionally independent from the succession of degrada-tion that the patient’s imagery and cognitive scores undergo. In other words, these graphssupport the idea that the disease is the same regardless of whether it appears early/lateand progresses fast/slowly.

5. Conclusion

We introduced the Mixture of Conditional Gaussian Graphical Models in order to guide thecluster discovery when estimating diﬀerent Gaussian Graphical Models for an unlabelledheterogeneous population in the presence of co-features. We motivated its usage to deal withthe potential in-homogeneous and class-dependent eﬀect of the co-features on the observeddata that would otherwise disrupt the clustering eﬀort. To estimate our Mixture model, weproposed a penalised EM algorithm (“Conditional EM” or “C-EM”) compatible with a widearray of penalties. Moreover, we provided detailed algorithmic steps in the speciﬁc case ofthe popular Group Graphical LASSO penalty. Then, we demonstrated the interest of the artigue, Durrleman, Allassoni`ere !" !" $ !" $ − !" CognitiveScoresMRI

Figure 7: Average (cid:98) β k := − (cid:98) Σ k (cid:98) Θ Tk over 10 bootstrap sampling of the data. For each boot-strapped dataset, 3 diﬀerent runs of the C-EM are made, each with a diﬀerentKMmeans initialisation of the labels. (Left) (cid:98) β , the cluster k = 1 is always verycorrelated with the Control patients sub-population (less than 10% deviation).(Middle) (cid:98) β , the cluster k = 2 is likewise very correlated with the AD patients. Ineach ﬁgure, the last column is the constant coeﬃcient. The largest inter-clusterdiﬀerences are between the two constant terms. However there are some notice-able diﬀerence on the other regression coeﬃcients as well. (Right) Average (cid:98) β − (cid:98) β over the 30 bootstrap runs of the C-EM. Here, the diﬀerences in intensity betweenAD ( k = 2) and Control ( k = 1) patients are more explicit. ixture of Conditional Gaussian Graphical Models tauxiprecuneushippocampusentorhinal cortexfusiform gyrusmidtemporal lobeamygdalatemporal poleinsulaparahippocampus cortexventriclesADAS-Cog (multi-domain)RAVLT (long-term memory)CF (mental flexibility)MoCA (multi-domain)memory and attentionmemory - delayed recallmemory and recognitionGDS (mood disorders)FAQ (autonomy)NPI (mental disorders)self-reported memoryself-reported languageself-reported visual spatial capacitypartner reported memorypartner reported languagepartner reported visual spatial capacityBNT (language and recognition)LM (memory - immediate recall)LM (memory - delayed recall)DSST (attention) Figure 8: Conditional correlation graph of the estimated cluster most correlated with the “Control” diagnosis . artigue, Durrleman, Allassoni`ere tauxiprecuneushippocampusentorhinal cortexfusiform gyrusmidtemporal lobeamygdalatemporal poleinsulaparahippocampus cortexventriclesADAS-Cog (multi-domain)RAVLT (long-term memory)CF (mental flexibility)MoCA (multi-domain)memory and attentionmemory - delayed recallmemory and recognitionGDS (mood disorders)FAQ (autonomy)NPI (mental disorders)self-reported memoryself-reported languageself-reported visual spatial capacitypartner reported memorypartner reported languagepartner reported visual spatial capacityBNT (language and recognition)LM (memory - immediate recall)LM (memory - delayed recall)DSST (attention) Figure 9: Conditional correlation graph of the estimated cluster most correlated with the “AD” diagnosis . ixture of Conditional Gaussian Graphical Models method with experiments on synthetic and real data. First, we showed on a toy example -with a 2-dimensional feature space and a 1-dimensional co-feature - that the regular Mixtureof GGM methods were inadequate to deal with even the most simple in-homogeneous co-feature. We conﬁrmed on a more complex simulation, in higher dimension, that Mixturesof CGGM could identify much better the clusters in the feature space, and recover theactual GGM structure of the data. Finally, we tested all the methods on a real data set,with longitudinal model parameters describing the evolution of several Alzheimer’s Diseasepatients. We demonstrated that our method was the best at identifying the diagnosticwith an unlabelled dataset. We unearthed some in-homogeneous eﬀect of co-features on thelongitudinal parameter and recovered the conditional correlation graphs by cluster. Thesegraphs hint at a conditional independence between the earliness and speed of the diseaseand the order in which the many degradation appear. This hypothesis will be tested infuture studies. Acknowledgments

The research leading to these results has received funding from the European ResearchCouncil (ERC) under grant agreement No 678304, European Union’s Horizon 2020 researchand innovation program under grant agreement No 666992 (EuroPOND) and No 826421(TVB-Cloud), and the French government under management of Agence Nationale de laRecherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001(PRAIRIE 3IA Institute) and reference ANR-10-IAIHU-06 (IHU-A-ICM).

References

Onureena Banerjee, Laurent El Ghaoui, Alexandre d’Aspremont, and Georges Natsoulis.Convex optimization techniques for ﬁtting sparse gaussian graphical models. In

Pro-ceedings of the 23rd international conference on Machine learning , pages 89–96. ACM,2006.Hyonho Chun, Min Chen, Bing Li, and Hongyu Zhao. Joint conditional gaussian graphicalmodels with multiple sources of genomic data.

Frontiers in genetics , 4:294, 2013.Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signalprocessing. In

Fixed-point algorithms for inverse problems in science and engineering ,pages 185–212. Springer, 2011.Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inversecovariance estimation across multiple classes.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 76(2):373–397, 2014.Arthur P Dempster. Covariance selection.

Biometrics , pages 157–175, 1972.Wayne S DeSarbo and William L Cron. A maximum likelihood methodology for clusterwiselinear regression.

Journal of classiﬁcation , 5(2):249–282, 1988.Chen Gao, Yunzhang Zhu, Xiaotong Shen, and Wei Pan. Estimation of multiple networksin gaussian mixture models.

Electronic journal of statistics , 10:1133, 2016. artigue, Durrleman, Allassoni`ere Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Joint estimation of multiplegraphical models.

Biometrika , 98(1):1–15, 2011.Botao Hao, Will Wei Sun, Yufeng Liu, and Guang Cheng. Simultaneous clustering and es-timation of heterogeneous graphical models.

The Journal of Machine Learning Research ,18(1):7981–8038, 2017.Feihu Huang, Songcan Chen, and Sheng-Jun Huang. Joint estimation of multiple conditionalgaussian graphical models.

IEEE transactions on neural networks and learning systems ,29(7):3034–3046, 2018.Abbas Khalili and Jiahua Chen. Variable selection in ﬁnite mixture of regression models.

Journal of the american Statistical association , 102(479):1025–1038, 2007.Akshay Krishnamurthy. High-dimensional clustering with sparse gaussian mixture models.

Unpublished paper , pages 191–192, 2011.Karthik Mohan, Palma London, Maryam Fazel, Daniela Witten, and Su-In Lee. Node-based learning of multiple gaussian graphical models.

The Journal of Machine LearningResearch , 15(1):445–488, 2014.Le Ou-Yang, Xiao-Fei Zhang, Xiaohua Hu, and Hong Yan. Diﬀerential network analysisvia weighted fused conditional gaussian graphical model.

IEEE/ACM transactions oncomputational biology and bioinformatics , 2019.J-B Schiratti, St´ephanie Allassonniere, Alexandre Routier, Olivier Colliot, Stanley Dur-rleman, Alzheimers Disease Neuroimaging Initiative, et al. A mixed-eﬀects model withtime reparametrization for longitudinal univariate manifold-valued data. In

InternationalConference on Information Processing in Medical Imaging , pages 564–575. Springer, 2015.Kyung-Ah Sohn and Seyoung Kim. Joint estimation of structured sparsity and outputstructure in multiple-output regression via inverse-covariance regularization. In

ArtiﬁcialIntelligence and Statistics , pages 1081–1089, 2012.Matt Wytock and Zico Kolter. Sparse gaussian conditional random ﬁelds: Algorithms,theory, and application to energy forecasting. In

International conference on machinelearning , pages 1265–1273, 2013.Sen Yang, Zhaosong Lu, Xiaotong Shen, Peter Wonka, and Jieping Ye. Fused multiplegraphical lasso.

SIAM Journal on Optimization , 25(2):916–943, 2015.Jianxin Yin and Hongzhe Li. A sparse conditional gaussian graphical model for analysis ofgenetical genomics data.

The annals of applied statistics , 5(4):2630, 2011.Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49–67,2006.Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model.

Biometrika , 94(1):19–35, 2007. ixture of Conditional Gaussian Graphical Models Hui Zhou, Wei Pan, and Xiaotong Shen. Penalized model-based clustering with uncon-strained covariance matrices.

Electronic journal of statistics , 3:1473, 2009., 3:1473, 2009.