Mixture of Conditional Gaussian Graphical Models for unlabelled heterogeneous populations in the presence of co-factors
MMixture of Conditional Gaussian Graphical Models
Mixture of Conditional Gaussian Graphical Models forunlabelled heterogeneous populations in the presence ofco-factors
Thomas Lartigue [email protected]
Aramis project-team and CMAPInria and ´Ecole polytechniqueParis, France
Stanley Durrleman [email protected]
Aramis project-teamInriaParis, France
St´ephanie Allassoni`ere [email protected]
Universit´e Paris Descartes,Paris, France
Editor:
Abstract
Conditional correlation networks, within Gaussian Graphical Models (GGM), are widelyused to describe the direct interactions between the components of a random vector. Inthe case of an unlabelled Heterogeneous population, Expectation Maximisation (EM) al-gorithms for Mixtures of GGM have been proposed to estimate both each sub-population’sgraph and the class labels. However, we argue that, with most real data, class affiliationcannot be described with a Mixture of Gaussian, which mostly groups data points accord-ing to their geometrical proximity. In particular, there often exists external co-featureswhose values affect the features’ average value, scattering across the feature space datapoints belonging to the same sub-population. Additionally, if the co-features’ effect on thefeatures is Heterogeneous, then the estimation of this effect cannot be separated from thesub-population identification. In this article, we propose a Mixture of Conditional GGM(CGGM) that subtracts the heterogeneous effects of the co-features to regroup the datapoints into sub-population corresponding clusters. We develop a penalised EM algorithmto estimate graph-sparse model parameters. We demonstrate on synthetic and real datahow this method fulfils its goal and succeeds in identifying the sub-populations where theMixtures of GGM are disrupted by the effect of the co-features.
Keywords:
Gaussian Graphical Models, Unlabelled Heterogeneous populations, Condi-tional GGM, Mixture Models, EM algorithm
1. Introduction
The conditional correlation networks are a popular tool to describe the co-variations be-tween the component of a random vector. Within the Gaussian Graphical Model (GGM)framework, introduced in Dempster (1972), the random vector of interest is modelled as agaussian vector N ( µ, Σ), and the conditional correlation networks can be recovered from a r X i v : . [ m a t h . S T ] J un artigue, Durrleman, Allassoni`ere the sparsity of the inverse covariance matrix Λ := Σ − . In this article, we consider the caseof an unlabelled heterogeneous population, in which different sub-populations (or “classes”)are described by different networks. Additionally, we take into account the presence of ob-served co-features (discrete and/or continuous) that have a heterogeneous (class-dependent)impact on the values of the features. The absence of known class labels turns the analysis ofthe population into an unsupervised problem. As a result, any inference method will haveto tackle the problem of cluster discovery in addition to the parameter estimation. A crucialtask, considering that the relevance of the estimated parameters is entirely dependant onthe clusters identified. The co-features, if their effects on the features are consequent, cangreatly disrupt the clustering. Indeed, any unsupervised method will then be more likelyto identify clusters correlated with the values of the co-features than with the hidden sub-population labels. This occurs frequently when analysing biological or medical features. Toprovide a simple illustration, if one runs an unsupervised method on an unlabelled pop-ulation containing both healthy and obese patients, using the body fat percentage as afeature, then the unearthed clusters are very likely to be more corrected with the genderof the patients (a co-feature) rather than with the actual diagnostic (the hidden variable).Additionally, the fact that the effect of the gender on the average body fat is also dependenton the diagnostic (class-dependent effect) makes the situation even more complex.Unsupervised GGM have received recent attention, with works such as Gao et al. (2016)and Hao et al. (2017) adapting the popular supervised joint Hierarchical GGM methods ofMohan et al. (2014) and Danaher et al. (2014) to the unsupervised case. When the labelsare known in advance, these joint Hierarchical GGM are useful models to estimate severalsparse conditional correlation matrices and are modular enough to allow for the recovery ofmany different forms of common structure between classes. However, we argue that theyare not designed for efficient cluster identification in the unsupervised scenario, and willvery likely miss the hidden variable and find clusters correlated to the most influential co-features instead. Which in turn will result in the estimation of irrelevant parameters. Evenwhen there are no pre-existing hidden variables to recover, and the unsupervised method isrun “blindly”, it is uninteresting to recover clusters describing the values to already knownco-features. Instead, one would rather provide beforehand the unsupervised method withthe information of the co-features’ values and encourage it to recover new information fromthe data.In order to take into account the effect of co-features on features, Yin and Li (2011)and Wytock and Kolter (2013) introduced the Conditional Gaussian Graphical Models(CGGM). Within this model, the average effect of the co-features is subtracted from thefeatures, in order to leave only orthogonal effects. Both Yin and Li (2011) and Sohn andKim (2012) worked with homogeneous populations, but the Hierarchical form of the CGGMwas introduced by Chun et al. (2013) to study labelled heterogeneous populations, with het-erogeneous effects of the co-features on the features. Recent works such as Huang et al.(2018) and Ou-Yang et al. (2019) have adapted the state of the art supervised joint Hier-archical GGM methods for the CGGM. However, to the best of our knowledge, there hasbeen no effort to make use of the CGGM in the unsupervised case.In this article, we introduce a Mixture of Conditional GGM that models the class-dependenteffect of the co-features on the features. We propose an Expectation-Maximisation (EM)procedure to estimate this model without prior knowledge of the class labels. This EM ixture of Conditional Gaussian Graphical Models algorithm can be regularised with all the structure-inducing penalties introduced for thesupervised joint Hierarchical CGGM. Hence, the recovered sparse conditional correlationgraphs can present any of the desired form of common structure. Moreover, with an addi-tional penalty, we can also enforce structure within the parameter describing the relationbetween co-features and features.Thanks to the inclusion of the co-features within the model, our EM algorithm is able toavoid trivial clusters correlated with the co-features’ values, and instead unearths clustersproviding new information on the population. Additionally, since our model takes intoaccount heterogeneous effects of the co-features, our EM can handle the more complex sce-narios, where the co-features act differently on the features in each sub-population.Another, very different, domain of research, the “Finite Mixture Regression models” (FMR),makes use of models that exhibit some formal similarities with the Mixture of CGGM. TheFMR, see DeSarbo and Cron (1988) or Khalili and Chen (2007) for early examples of unpe-nalised FMR and penalised FMR respectively, consist of several parallel linear regressionsbetween co-features and features, with unlabelled data. The clustering in FMR is focusedon identifying different linear models between co-features and features. This is very differ-ent from our GGM approach, which seeks to describe the multidimensional feature vectorwith graphs, and uses the co-feature as a tool to improve the clustering within the featurespace.We demonstrate the performance of our method on synthetic and real data. First witha 2-dimensional toy example, where we show the importance of taking into considerationthe (heterogeneous) effects of co-features for the clustering. Then, in higher dimension, wedemonstrate that our EM with Mixture of CGGM consistently outperforms, both in termsof classification and parameter reconstruction, the EM with a Mixture of GGM (used inGao et al. (2016) and Hao et al. (2017)), as well as an improved Mixture of GGM EM,that takes into consideration a homogeneous co-feature effect. Finally, on real Alzheimer’sDisease data, we show that our method is the better suited to recover clusters correlatedwith the diagnostic, from both MRI and Cognitive Score features.
2. Supervised Hierarchical GGM and CGGM
In this section, we summarise the whys and wherefores of Gaussian Graphical Modelling: thesimple models for homogeneous populations, as well as the hierarchical models for heteroge-neous populations. First, we explore the classical Gaussian Graphical Models techniques todescribe a vector of features Y ∈ R p , then we discuss the Conditional Gaussian GraphicalModels implemented in the presence of additional co-features X ∈ R q . For every parametricmodel, we call θ the full parameter, and p θ the probability density function. Hence, in theexample of a gaussian model θ = ( µ, Σ). For hierarchical models with K classes, we willhave K parameters ( θ , ..., θ K ). In the classical GGM analysis introduced by Dempster (1972), the studied features Y ∈ R p are assumed to follow a Multivariate Normal distribution: Y ∼ N ( µ, Σ). The average µ is artigue, Durrleman, Allassoni`ere often ignored and put to 0. With Λ := Σ − , the resulting distribution is: p θ ( Y ) = (2 π ) − p/ | Λ | / exp (cid:18) − Y T Λ Y (cid:19) . (1)In this case θ = Λ. Using the property that corr ( Y u , Y v | ( Y w ) w (cid:54) = u,v ) = − (Λ) uv √ (Λ) uu (Λ) vv , theconditional correlation network is obtained using a sparse estimation of the precision (or“inverse-covariance”) matrix Λ. Heterogeneous population, where different correlation net-works may exist for each sub-population (or “class”), can be described with the Hierarchicalversion of the GGM (1). With K classes, Let θ := ( θ , ..., θ k ) be the parameter for each classand z ∈ (cid:74) , K (cid:75) the categorical variable corresponding to the class label of the observation Y . With θ k := Λ k and z known, the Hierarchical density can be written: p θ ( Y | z ) = K (cid:88) k =1 z = k p θ k ( Y )= K (cid:88) k =1 z = k (2 π ) − p/ | Λ k | / exp (cid:18) − Y T Λ k Y (cid:19) . (2)Mirroring the famous Graphical LASSO (GLASSO) approach introduced by Yuan and Lin(2007) and Banerjee et al. (2006) for homogeneous populations, many authors have chosento estimate sparse (cid:98) Λ k as penalised Maximum Likelihood Estimator (MLE) of Λ k . For i = 1 , ..., n , let Y ( i ) be independent identically distributed (iid) feature vectors and z ( i ) their labels. These MLE are computed from the simple convex optimisation problemˆ θ = argmin θ − n K (cid:88) k =1 n (cid:88) i =1 z ( i ) = k ln p θ k ( Y ( i ) ) + pen ( θ ) . (3)Where the convex penalty pen ( θ ) is usually designed to induce sparsity within each indi-vidual ˆΛ k as well as to enforce a certain common structure between the ˆΛ k . This commonstructure is a desirable outcome when the different sub-populations are assumed to still re-tain core similarities. Following in the footsteps of Guo et al. (2011), most authors proposesuch a joint estimation of the matrices Λ k . In the case of the penalised MLE estimation (3),the form of the resulting common structure is dependent on the penalty. For instance, Dana-her et al. (2014) propose the “Fused Graphical LASSO”” and “Group Graphical LASSO”penalties that encourage shared values and shared sparsity pattern across the different Λ k respectively. Likewise, Yang et al. (2015) propose another fused penalty to incentivise com-mon values across matrices. With their node based penalties, Mohan et al. (2014) canencourage the recovery of common hubs in the graphs. Remark 1
Within a hierarchical model, one can also take θ k := ( µ k , Λ k ) , and adapt p θ k ( Y ) accordingly, since it is natural to allow each sub-population to have different average levels µ k . In some frameworks, additional variables, noted X ∈ R q and called “co-features” or “cofac-tors” can be observed alongside the regular features within the gaussian vector Y ∈ R p . In ixture of Conditional Gaussian Graphical Models all generality, X can be a mix of finite, discrete and continuous random variables. In theGGM analysis, these co-features are not included as nodes of the estimated conditional cor-relation graph. Instead, they serve to enrich the conditioning defining each edge: in the newgraph, there is an edge between the nodes Y u and Y v iif cov ( Y u , Y v | ( Y w ) w (cid:54) = u,v , X ) (cid:54) = 0. TheConditional Gaussian Graphical Models (CGGM) were introduced by Yin and Li (2011)and Sohn and Kim (2012) in order to properly take into account the effect of X on Y andeasily identify the new conditional correlation network in-between the Y . They propose alinear effect, expressed by the conditional probability density function (pdf): p θ ( Y | X ) = (2 π ) − p/ | Λ | / exp (cid:18) −
12 ( Y + Λ − Θ T X ) T Λ( Y + Λ − Θ T X ) (cid:19) , (4)with Θ ∈ R q × p and θ = { Λ , Θ } . In other words: Y | X ∼ N (cid:0) − Λ − Θ T X, Λ − (cid:1) . Two mainbranches of CGGM exist, depending on whether the pdf of X is also modelled. In thiswork, we chose to impose no model on X . The lack of assumption on the density of X provides far more freedom than the joint gaussian assumption. In particular, X can havecategorical and even deterministic components. This allows us to integrate any observedvariables without restriction to the model.To tackle heterogeneous populations, works such as Chun et al. (2013) have introduced theHierarchical version of the CGGM pdf: p θ k ( Y | X, z ) = K (cid:88) k =1 z = k (cid:18) | Λ k | (2 π ) p (cid:19) − exp (cid:18)
12 ( Y + Λ − k Θ Tk X ) T Λ k ( Y + Λ − k Θ Tk X ) (cid:19) . (5)In particular, Huang et al. (2018) have adapted the penalised MLE (3) to the Hierar-chical CGGM density for some of the most popular GGM penalties. With a idd sample( Y ( i ) , X ( i ) , z ( i ) ) ni =1 , the corresponding penalised CGGM MLE can be written;ˆ θ = argmin θ − n K (cid:88) k =1 n (cid:88) i =1 z ( i ) = k ln p θ k ( Y ( i ) | X ( i ) ) + pen ( θ ) . (6) Remark 2
To include a regular average value for Y , independent of the values of X , onecan simply add a constant component equal to ”1” in X .
3. Mixtures of CGGM for unlabelled heterogeneous population
In this section, we tackle the problem of an unlabelled heterogeneous population. Weintroduce a Mixture of Conditional Gaussian Graphical Model to improve upon the state ofthe art unsupervised methods by taking into consideration the potent co-features that candrive the clustering. We develop a penalised EM algorithm to both identify data clustersand estimate sparse, structured, model parameters. We justify that our algorithm is usablewith a wide array of penalties and provide detailed algorithmic for the Group GraphicalLASSO (GGL) penalty.
When the labels of a heterogeneous population are missing, supervised parameter estima-tion methods like (3) have to be replaced by unsupervised approaches that also tackle the artigue, Durrleman, Allassoni`ere problem of cluster discovery. When z is unknown, the Hierarchical model (2) can easily bereplaced by a Mixture model with observed likelihood: p θ,π ( Y ) = K (cid:88) k =1 π k p θ k ( Y ) , (7)and complete likelihood: p θ,π ( Y, z ) = K (cid:88) k =1 z = k π k p θ k ( Y ) . (8)Where π k := P ( z = k ) and π := ( π , ..., π k ) . Then, the supervised penalised likelihoodmaximisation (3) can be adapted into the penalised observed likelihood optimisation:ˆ θ, ˆ π = argmin θ,π − n n (cid:88) i =1 ln (cid:32) K (cid:88) k =1 π k p θ k (cid:16) Y ( i ) (cid:17)(cid:33) + pen ( θ, π ) . (9)This is a non-convex problem, and authors such as Zhou et al. (2009) and Krishnamurthy(2011) have proposed EM algorithms to find local solutions to (9). They omit however thecommon structure inducing penalties that are the signature of the supervised joint GGMmethods. The works of Gao et al. (2016) and Hao et al. (2017) correct this by proposingEM algorithms that solve (9) for some of the joint-GGM penalties, such as the Fused andGroup Graphical LASSO penalties.By design, the EM algorithm must handle the cluster identification jointly with the mixtureparameters estimation. The underlying assumption is that the different sub-populations canbe identified as different clusters in the feature space. With real data, and especially medicaldata, this is generally untrue, as many factors other then the class label can have a largerimpact on the position of the data points in the feature space. Even when there are nospecific sub-populations to recover, and the EM is ran “blindly” in order to observe whichdata points are more naturally grouped together by the method, the unearthed clustershave every chance to be very correlated with very influential but trivial external variables,such as the age group or the gender. In order to guide the cluster discovery of the EMalgorithm, we propose a Mixture of Conditional Gaussian Graphical Models with whichthe overbearing effect of trivial external variable can be removed. By placing all externalobserved variable into X , we define the Mixture of CGGM with its observed likelihood: p θ,π ( Y | X ) := K (cid:88) k =1 π k p θ k ( Y | X )= K (cid:88) k =1 π k (cid:18) | Λ k | (2 π ) p (cid:19) exp (cid:18) −
12 ( Y + Λ − k Θ Tk X ) T Λ k ( Y + Λ − k Θ Tk X ) (cid:19) . (10)Within this model, the position of each feature vector Y is corrected by its, class-dependent,linear prediction by the co-features X : E [ Y | X, z = k ] = − Λ − k Θ Tk X . In other words the“Mixture of Gaussians” type clustering is done on the residual vector Y − E [ Y | X, z = k ] = ixture of Conditional Gaussian Graphical Models Y + Λ − k Θ Tk X . Hence, even if the co-features X have a class-dependant impact on theaverage level of the features Y , the Mixture of CGGM model is still able to regroup in thefeature space the observations Y ( i ) that belong to the same class, z ( i ) = k . We illustratethis dynamic in section 4.1.Like the previous works on joint-GGM estimation, our goal is to estimate the parameters ofmodel (10) with sparse inverse-covariance matrices Λ k and common structure across classes.Sparsity in the matrices Θ k is also desirable for the sake of interpretation. Hence, we definethe following penalised Maximum Likelihood problem:ˆ θ, ˆ π = argmin θ,π − n n (cid:88) i =1 ln (cid:32) K (cid:88) k =1 π k p θ k (cid:16) Y ( i ) | X ( i ) (cid:17)(cid:33) + pen ( θ, π ) . (11)As with (9), this is a non-convex problem, and we define an EM algorithm to find localminima of the optimised function. In this section, we provide the detailed steps of a penalised EM algorithm to find localsolution of the non-convex penalised MLE (11) in order to estimate the parameters of themixture model (10) with inverse-covariance sparsity as well as common structure. Firstwe provide the different steps of the algorithm and justify that it can be run with a widearray of penalty functions. Then, we provide a detailed optimisation scheme for the GroupGraphical Lasso (GGL) penalty specifically.
EM algorithm for Mixtures of CGGM.
With n fixed (cid:8) X ( i ) (cid:9) ni =1 and n iid observations (cid:8) Y ( i ) (cid:9) ni =1 following the mixture density p θ,π ( Y | X ) given in (10), the penalised observednegative log-likelihood to optimise is: − n n (cid:88) i =1 ln (cid:32) K (cid:88) k =1 π k p θ k (cid:16) Y ( i ) | X ( i ) (cid:17)(cid:33) + pen ( θ, π ) . (12)We will not redo here all the calculations for the EM applied to a mixture. In the end, weget an iterative procedure updating the current parameter ( θ ( t ) , π ( t ) ) with two steps. TheExpectation (E) step is: p ( t ) i,k := P θ ( t ) ,π ( t ) ( z ( i ) = k | Y ( i ) , X ( i ) ) = p θ ( t ) k ( Y ( i ) | X ( i ) ) π ( t ) k (cid:80) Kl =1 p θ ( t ) l ( Y ( i ) | X ( i ) ) π ( t ) l . More explicitly, by replacing p θ k ( Y | X ) by its formula (4):( E ) p ( t ) i,k = | Λ k | − exp (cid:0) ( Y ( i ) + Λ − k Θ Tk X ( i ) ) T Λ k ( Y ( i ) + Λ − k Θ Tk X ( i ) ) (cid:1)(cid:80) Kl =1 | Λ l | − exp (cid:0) ( Y ( i ) + Λ − l Θ Tl X ( i ) ) T Λ l ( Y ( i ) + Λ − l Θ Tl X ( i ) ) (cid:1) . (13)The M step is: θ ( t +1) , π ( t +1) = argmin θ,π − n K (cid:88) k =1 n (cid:88) i =1 p ( t ) i,k (cid:16) ln p θ k ( Y ( i ) | X ( i ) ) + ln π k (cid:17) + pen ( θ, π ) . artigue, Durrleman, Allassoni`ere Assuming that there is no coupling between π and θ in the penalty, i.e. pen ( π, θ ) = pen π ( π ) + pen θ ( θ ), then the two optimisations can be separated: θ ( t +1) = argmin θ − n K (cid:88) k =1 n (cid:88) i =1 p ( t ) i,k ln p θ k ( Y ( i ) | X ( i ) ) + pen θ ( θ ) ,π ( t +1) = argmin π − n K (cid:88) k =1 n (cid:88) i =1 p ( t ) i,k ln π k + pen π ( π ) . Let us denote the sufficient statistics n ( t ) k := (cid:80) ni =1 p ( t ) i,k , S k, ( t ) Y Y := n (cid:80) ni =1 p ( t ) i,k Y ( i ) Y ( i ) T , S k, ( t ) Y X := n (cid:80) ni =1 p ( t ) i,k Y ( i ) X ( i ) T and S k, ( t ) XX := n (cid:80) ni =1 p ( t ) i,k X ( i ) X ( i T ) . Then, the M step can beformulated as: θ ( t +1) = argmin θ K (cid:88) k =1 (cid:16)(cid:68) Λ k , S k, ( t ) Y Y (cid:69) + (cid:68) k , S k, ( t ) Y X (cid:69) + (cid:68) Θ k Λ − k Θ Tk , S k, ( t ) XX (cid:69)(cid:17) ( M ) − K (cid:88) k =1 n ( t ) k n ln ( | Λ k | ) + pen θ ( θ ) ,π ( t +1) = argmin π − K (cid:88) k =1 n ( t ) k n ln π k + pen π ( π ) . (14)The E step in Eq (13) is in closed form. With any reasonable penalty pen π , the optimisationon the class weights π in Eq (14) will be trivial, and most likely in closed form as well. Theupdate of θ in the M step (14) takes exactly the same form as the supervised penalisedMLE of Eq (6), see Huang et al. (2018) for the explicit supervised CGGM formulation. Asa result, as long as the supervised case (6) is solved, then the M step is tractable as well.In their work on joint Hierarchical CGGM, Huang et al. (2018) show that the supervisednegative log-likelihood is a convex function of θ . As a consequence the problem (14) issolvable for a very wide array of penalties pen θ , in particular all the convex differentiablepenalties.In order to provide an algorithm with more specific and detailed steps, we consider in therest of the section the special case of the GGL penalty. The GGL penalty was noticeablyused in the supervised case by Huang et al. (2018), who proposed a proximal gradientalgorithm. Likewise, we can use a proximal gradient algorithm to compute the M step (14)of our EM algorithm. Proximal gradient algorithm to solve the M step with the GGL penalty.
TheGroup Graphical Lasso (GGL) penalty, introduced in Danaher et al. (2014) and adaptedto the hierarchical CGGM by Huang et al. (2018), can be written: pen θ ( θ ) := (cid:88) ≤ i (cid:54) = j ≤ p λ Λ1 K (cid:88) k =1 (cid:12)(cid:12)(cid:12) Λ ( ij ) k (cid:12)(cid:12)(cid:12) + λ Λ2 (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 (cid:16) Λ ( ij ) k (cid:17) + (cid:88) ( i,j ) ∈ (cid:74) ,q (cid:75) × (cid:74) ,p (cid:75) λ Θ1 K (cid:88) k =1 (cid:12)(cid:12)(cid:12) Θ ( ij ) k (cid:12)(cid:12)(cid:12) + λ Θ2 (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 (cid:16) Θ ( ij ) k (cid:17) . (15) ixture of Conditional Gaussian Graphical Models Unlike in Huang et al. (2018), where λ Λ1 = λ Θ1 and λ Λ2 = λ Θ2 , we use different levels ofpenalisation for the parameters Λ and Θ, since both their scales and their desired sparsitylevel can be very different. This penalty borrows its design from the Group Lasso, seeYuan and Lin (2006), where the l norm induces individual sparsity of each coefficient, andthe l induces simultaneous sparsity of groups of coefficients. In Eq. (15), for each pair( i, j ) belonging to the relevant space, (cid:110) Λ ( ij ) k (cid:111) Kk =1 constitutes a group that can be entirelyput to 0. This incites the algorithm to set a certain matrix coefficient to 0 over all K classes. These common zeros constitute the common structure sought after by the GGLapproach. In our CGGM case, the same can be said for the group (cid:110) Θ ( ij ) k (cid:111) Kk =1 . Regardingthe theoretical analysis, we underline that the l part of the penalty is not separable in asum of K different penalties, which forces a joint optimisation problem to be solved, evenin the supervised framework.We detail here how to solve the M step (14) with pen θ ( θ ) defined as in Eq (15). Weassume, as usual, that the optimisation in π is both independent from the optimisation in θ = { Λ k , Θ k } Kk =1 and trivial. The function to minimise in θ at the M step is: f ( θ ) := K (cid:88) k =1 (cid:32) − n ( t ) k n ln ( | Λ k | ) + (cid:68) Λ k , S k, ( t ) Y Y (cid:69) + (cid:68) k , S k, ( t ) Y X (cid:69) + (cid:68) Θ k Λ − k Θ Tk , S k, ( t ) XX (cid:69)(cid:33) + pen θ ( θ ) . As shown in Huang et al. (2018), this function is convex and infinite on the border of itsset of definition and as a unique global minimum. We note f ( θ ) =: g ( θ ) + pen θ ( θ ) for thesake of simplicity. The proximal gradient algorithm, see Combettes and Pesquet (2011), isan iterative method based on a quadratic approximation on g ( θ ). If θ ( s − is the currentstate of the parameter within the proximal gradient iterations, then the next stage, θ ( s ) , isfound by optimising the approximation: f (cid:16) θ ( s ) (cid:17) = f (cid:16) θ ( s − + θ ( s ) − θ ( s − (cid:17) ≈ g (cid:16) θ ( s − (cid:17) + ∇ g (cid:16) θ ( s − (cid:17) T . (cid:16) θ ( s ) − θ ( s − (cid:17) + 12 α (cid:13)(cid:13)(cid:13) θ ( s ) − θ ( s − (cid:13)(cid:13)(cid:13) + pen θ (cid:16) θ ( s ) (cid:17) ≡ α (cid:13)(cid:13)(cid:13) θ ( s ) − (cid:16) θ ( s − − α ∇ g (cid:16) θ ( s − (cid:17)(cid:17)(cid:13)(cid:13)(cid:13) + pen θ (cid:16) θ ( s ) (cid:17) . (16)Where we removed in the last line the constants irrelevant to the optimisation in θ ( s ) and α denotes the step size of the gradient descend. Note that we use the exponent ( s ) to indicatethe current stage of the proximal gradient iteration, to avoid confusion with the exponent( t ) used for the EM iterations (which are one level above). We underline that, in additionto g ( θ ) itself, the second order term in the Taylor development of g ( θ ) is also approximated.Using α (cid:13)(cid:13) θ ( s ) − θ ( s − (cid:13)(cid:13) instead of (cid:0) θ ( s ) − θ ( s − (cid:1) T .H g (cid:0) θ ( s − (cid:1) . (cid:0) θ ( s ) − θ ( s − (cid:1) spares usfrom computing the Hessian H g (cid:0) θ ( s − (cid:1) and simplifies the calculations to come. The ap-proximated formulation in Eq (16) leads to the definition of the proximal optimisationproblem: prox α ( x ) := argmin θ α (cid:107) θ − x (cid:107) + pen θ ( θ ) . (17) artigue, Durrleman, Allassoni`ere So that the proximal gradient step can be written: θ ( s ) = prox α s (cid:16) θ ( s − − α s ∇ g (cid:16) θ ( s − (cid:17)(cid:17) . (18)Where the step size α s is determined by line search. The usual proximal gradient heuristicis to take a initial step size α , a coefficient β ∈ ]0 , α ←− βα ,as long as: g (cid:16) θ ( s − − αG α (cid:16) θ ( s − (cid:17)(cid:17) > g (cid:16) θ ( s − (cid:17) − α ∇ g (cid:16) θ ( s − (cid:17) T .G α (cid:16) θ ( s − (cid:17) + α (cid:13)(cid:13)(cid:13) G α (cid:16) θ ( s − (cid:17)(cid:13)(cid:13)(cid:13) , with G α (cid:0) θ ( s − (cid:1) := θ ( s − − prox α ( θ ( s − − α ∇ g ( θ ( s − )) α the generalised gradient.To apply the proximal gradient algorithm, we need to be able to solve the proximal (17)with the CGGM likelihood and the GGL penalty. Thankfully, Danaher et al. (2014) foundan explicit solution to this problem in the GGM case, which Huang et al. (2018) adapted tothe CGGM. The proximal optimisation is separable in Λ and Θ, and the solutions Λ ( prox ) and Θ ( prox ) share the same formula. As a result, we use D as a placeholder name foreither Λ or Θ, i.e. depending on the context either D ijk = Λ ijk or D ijk = Θ ijk . Let S bethe soft thresholding operator: S ( x, λ ) := sign ( x ) max ( | x | − λ, (cid:101) D ijk,α := D ij, ( s − k − α ∂g∂D ijk (cid:0) θ ( s − (cid:1) . The solution of (17), with x = θ ( s − − α ∇ g (cid:0) θ ( s − (cid:1) , is given coefficient bycoefficient by: D ij, ( prox ) k = S (cid:16) (cid:101) D ijk,α , λ D α (cid:17) max − λ D α (cid:113)(cid:80) k S ( (cid:101) D ijk,α , λ D α ) , . (19)Note that the partial derivatives ∂g∂D ijk (cid:0) θ ( s − (cid:1) , necessary to get (cid:101) D ijk,α , are easily calculatedin closed form from the likelihood formula. With the proximal problem (17) and the linesearch easily solvable, the proximal gradient steps can be iterated until convergence to findthe global minimum of f ( θ ). With f ( θ ) optimised, the M step (14) is solved.
4. Experiments
In this section, we demonstrate the performances of our EM with Mixture of CGGM. Firston a visual toy example in 2 dimension, then on a higher dimensional synthetic exampleand finally on real Alzheimer’s Disease data. We compare the Mixture of CGGM to theregular Mixture of GGM which ignores co-features and to a Mixture of GGM that assumesa uniform linear effect of the co-features on the features.
In this section, we present a simple visual example to illustrate the importance of taking intoaccount heterogeneous co-feature effects. We show that even with a single binary co-feature,and with low dimensional features, the state of the art unsupervised GGM techniques aregreatly disrupted by the co-features. Whereas our EM with Mixture of CGGM (which wecall “Conditional EM” or “C-EM”) achieves near perfect classification. ixture of Conditional Gaussian Graphical Models Under the Mixture of Gaussians (MoG) model, the observed data, Y ∼ (cid:80) Kk =1 π k N ( µ k , Σ k ),belongs to K classes which can directly be represented as K clusters in the feature space R p .Each cluster centred around a centroid at position µ k and with an ellipsoid shape describedby Σ k . However, when there exists conditioning variables X ∈ R q that have an effect on Y ,this geometric description becomes more complex. Typically, the value of Y could dependlinearly on the value of X , with E [ Y | X, z = k ] = β Tk X for some β k ∈ R p × q . In this case, theaverage position in class k is not a fixed µ k but a function of X . If X contains categoricalvariables, this creates as many different centroid positions as there are possible categorycombinations in X . The number of these de facto clusters geometrically increases withthe dimension q , which deters from simply running a clustering methods with an increasednumber of clusters K (cid:48) to identify all of them. Moreover, if X contains continuous variables,there is a continuum of positions for the centroid, not a finite number of de facto clusters. If X mixes the two types of variables, the two effects coexists. This shatters any hope to runa traditional MoG-based EM clustering algorithm, since its success is heavily dependent onits ability to identify correctly the K distinct cluster centroids µ k .Since the X are observed, a possible solution is to run the linear regression ˆ Y = ˆ βX before-hand, and run the EM algorithm on the residual Y − ˆ Y to remove the effect of X . This iswhat we call the “residual EM” or “residual Mixture of GGM”. However this does not takeinto account the fact that this effect can be different for each class k , β (cid:54) = β (cid:54) = ... (cid:54) = β K .Since the label is not known beforehand in the unsupervised context, the linear regressionˆ Y = ˆ βX can only be run on all the data indiscriminately, hence is insufficient in general. Onthe other hand, the hierarchical CGMM (5), which verifies: E [ Y | X, z = k ] = − Λ − k Θ Tk X ,is designed to capture heterogeneous co-feature effects. We design a simple experiment tosubstantiate this intuition.In this example, Y ∈ R , X ∈ {− , } and z ∈ { , } . Y | X, z follows the hierarchicalconditional model of (5). In this simple case, this can be written as Y = ( β X + (cid:15) ) z =1 +( β X + (cid:15) ) z =2 . With (cid:15) ∼ N (0 , Λ − ) and (cid:15) ∼ N (0 , Λ − ). A typical iid data sample( Y ( i ) i =1 ) n is represented on the left sub-figure of Figure 1. The hidden variable z is repre-sented by the colour (blue or orange). The observable co-feature X is represented by theshape of the data point (dot or cross). It is clear from the figure that a Mixture of Gaussiansmodel with K = 2 cannot properly separate the blue and orange points in two clusters.Indeed, on the right sub-figure of Figure 1, we observe the final state of an EM that fitsa Mixture of Gaussians on Y . The two recovered clusters are more correlated with theco-feature X than the hidden variable z . However, this method did not take advantageof the knowledge of the co-feature X . As previously mentioned, the most one could firstsubtract the effect of X from Y before running the EM. On the left sub-figure of Figure 2,we represent the residual data ˜ Y := Y − ˆ βX . Where ˆ β is the Ordinary Least Square esti-mator of the linear regression between X and Y over all the dataset ( ˆ β ≈ β + β if n is largeenough). Since the linear effect between X and Y is not uniform over the dataset, but classdependent, the correction is imperfect, and the two class clusters remain hardly separable.This is why the residual EM, that fits a Mixture of GGM on ˜ Y is also expected to fail toidentify clusters related to the hidden variable. Which is shown by the right sub-figure ofFigure 2, where we see a typical final state of the residual EM.On the leftmost sub-figure of Figure 3, we display the proper correction for the co-features’ artigue, Durrleman, Allassoni`ere effect ˜ Y (cid:48) = Y − β X z =1 − β X z =2 = (cid:15) z =1 + (cid:15) z =2 . Under this form, a Mixture ofGaussian can separate the data by colour. This is precisely the kind of translation thateach data point undergoes within a Hierarchical CGGM. Hence a Mixture of CGGM cansucceed in identifying the hidden variable z , provided that it estimates correctly the modelparameters. To illustrate this point, the two next sub-figures in Figure 3 represent the samefinal state of the EM fitting a Mixture of CGGM on Y . The middle sub-figure represents˜ Y (cid:48) as well as the two estimated centered distributions N (0 , (cid:98) Λ − k ) for k = 1 ,
2. We can seethe two formally identified clusters after removing the effect of X . The rightmost sub-figurerepresents the original data Y as well as the four estimated distributions N ( ± (cid:98) Σ k (cid:98) Θ Tk , (cid:98) Λ − k )for k = 1 ,
2. The four de facto clusters present in the data Y before removing the effect of X are well estimated by the method.We confirm these illustrative results by running several simulations. We generate 50 datasetswith n = 500 data points. For each simulation, we make 10 random initialisation from whichwe run the three EMs: with GGM, residualised GGM or CGGM. Table 4.1 summarises theresults. We follow the errors made by the estimated class probabilities or “soft labels”, (cid:98) P ( z i = k ), which we call the “soft misclassification error”, as well as the error made bythe “hard labels”, (cid:98) z i = k , which we call the “hard misclassification error”. They can beexpressed as n (cid:80) i,k (cid:12)(cid:12)(cid:12) z i = k − (cid:98) P ( z i = k ) (cid:12)(cid:12)(cid:12) and n (cid:80) i,k | z i = k − (cid:98) z i = k | respectively. We seethat the Mixture of CGGM performs much better, with less than 10% of misclassificationin average, while the two GGM methods are both above 40% of error, fairly close to thelevel of a random uniform classifier, 50%.
20 10 0 10 20201001020 20 10 0 10 20201001020
Figure 1: (Left) Observed data Y in the 2D space. The observed conditioning variable X is binary. Data points with X = − X = 1 are represented as dots. In addition, there is an unknown “class” variable z . Class 1 is in blue, class 2 in orange. Y | X, z follows the hierarchical conditionalmodel. As a result, the two classes (orange and blue) are hard to separate in twoclusters. (Right) Typical clusters estimated by an EM that fits a GGM mixtureon Y ixture of Conditional Gaussian Graphical Models
20 15 10 5 0 5 10 15 20201510505101520
Figure 2: (Left) Residual ˜ Y = Y − ˆ βY data after taking into account the estimated effectof X . Since the effect had different intensities on class 1 and 2, only the averageeffect was subtracted, and two classes are still not well separated. (Right) Typicalclusters estimated by the “residual EM”, that fits a GGM mixture on ˜ Y
20 10 0 10 20201001020
Figure 3: (Left) Observations ˜ Y (cid:48) = Y − β X z =1 − β X z =2 exactly corrected for the class-dependent effect of X . In this state the two classes appear as two distinct clusters.The Conditional-EM is designed to transform the data in this manner. (Middle)One possible representation of the CEM results. The corrected observations ˜ Y (cid:48) are displayed alongside centered normal distributions with the two estimatedcovariance matrices: N (0 , (cid:98) Λ − k ). (Right) Another possible representation of thesame CEM results. The original observations Y are displayed, alongside with thefour de facto estimated distributions N ( ± (cid:98) Σ k (cid:98) Θ Tk , (cid:98) Λ − k ). artigue, Durrleman, Allassoni`ere Table 1: Average and standard deviation of the misclassification error achieved on the 2-dimensional example with the EMs on the Mixture of GGM, the Mixture of GGMwith residualised data, and the Mixture of CGG. The two GGM methods are closeto the threshold of random classification (0.50), while the Mixture of CGGM is inaverage below 10% of error. EM GGM EM resid. GGM EM CGGM n (cid:80) i,k (cid:12)(cid:12)(cid:12) z i = k − (cid:98) P ( z i = k ) (cid:12)(cid:12)(cid:12) (0.05) 0.08 (0.17) n (cid:80) i,k | z i = k − (cid:98) z i = k | (0.06) 0.07 (0.17) In this section, we perform a quantitative analysis of the algorithms in a higher dimensionframework, where the matrix parameters Λ and Θ are more naturally interpreted as sparsenetworks. We confirm that the Mixture of Conditional Gaussian Graphical Models is bettersuited to take into account the heterogeneous effects of co-features on the graph.For this experiment, the observed data follows a mixture model with K = 3 classes. Eachclass k has the probability weight π k = . An observation ( Y, X ) ∈ R p × R q belonging to theclass k is described by the distribution: Y | X ∼ N (cid:0) − Λ − k Θ Tk X, Λ − k (cid:1) . No model assumptionare made on X . In this example, X contains two binary variables, two continuous variable,and a constant variable always equal to 1. The inverse-covariance matrix Λ k ∈ R p × p and thetransition matrix Θ ∈ R q × p are both sparse, with p = 10 and q = 5. We run 20 simulations.A simulation consists of n = 300 generated data points. On these data points, we run thecompared methods, all initialised with the same random parameters. For all simulations,we make 10 of these runs, each with a different random initialisation. We compared thesame three algorithms as in section 4.1: the EM for the Mixture of GGM, the EM for theMixture of GGM with average effect of X subtracted, and the EM applied to the Mixtureof CGGM. Additionally, we also run the tempered version of these three EM algorithms.We follow four metrics to assess the method’s success in terms of cluster recovery and fitwith the data. The classification error (both soft and hard labels versions), the recoveryof the network matrix Λ and an “ABC-like” metric. The “ABC-like” metric is meant toassess how well each of the estimated solutions is able to replicate the observed data. Sinceeach solution is the parameter of a probability distribution, at the end of each EM, we gen-erate new data following this proposed distribution. Then, for each synthetic data point,we compute the distance to the closest neighbour among the real data points. These min-imal distances constitute our “ABC-like” metric. Finally, we also compute the executiontime of each EM, knowing that they all have the same stopping criteria. We represent onFigure 4 the empirical distribution of these four metrics and we quantify with Table 4.2the key statistics (mean, standard deviation, median) that characterise them. With K = 3and balanced classes, a uniform random classifier would guess the wrong label 66 .
7% ofthe time. We observe that the two Mixture of GGM method are dangerously close to thisthreshold, with more than 50% hard misclassification. The EM on the Mixture of CGGM ixture of Conditional Gaussian Graphical Models (C-EM) on the other hand, achieves a much better classification with less than 15% hardmisclassifcation. This demonstrate that, even when faced with a more complex situation,in higher dimension, the Mixture of CGGM is better suited to correct for the effect of theco-features and discover the right clusters of data points. This also underlines once morethe importance of allowing different values of the effect of X on Y for each class. Indeed,the residual Mixture of GGM - which took into account the average effect of X on Y overthe entire population - was unable to achieve better performances than the EM that didnot even use the co-features X . In terms of reconstruction of the observed data by theestimated model (ABC-like metric), we see that the synthetic data points generated fromthe estimated Mixture of CGGM model have closer nearest neighbours than the data pointsgenerated by the other estimated models. In addition to all these observations, the C-EM isalso faster than the other two methods, reaching the convergence threshold faster. In addi-Table 2: Average, standard deviation and median (below) of the four followed performance metrics over the30 × bold . We can see that the classification performanceswith the Mixture of CGGM are much better than the two methods with Mixtures of GGM, andwith faster computation times.soft misclassif. hard misclassif. ABC-like metric runtimesGGM 0.56 (0.03) (0.09) (61) (0.03) (0.03) (0.05) (0.06) (0.14) (91) tion to the cluster recovery, we can also asses the parameter reconstruction of each method.Since the three clustering methods estimate different parametric models over the data, theydo not actually try to estimate the same parameters. Regardless, all the methods still es-timate a certain inverse covariance matrix Λ k (conditional or not on the X depending onthe model) of each sub-population that they identify. In Table 3, we can check that the (cid:98) Λ k estimated by with the Mixture of CGGM are indeed a much better fit for the real Λ k thenthe estimated matrices from the other models. This is expected, since the real Λ k actuallycorrespond to the CGGM model. The two metrics followed are the Kullback–Leibler (KL)divergence between the gaussian distribution f Λ k ∼ N (0 , Λ − k )) and f (cid:98) Λ k ∼ N (0 , (cid:98) Λ − k )), andthe l difference given by the Froebenius norm: (cid:13)(cid:13)(cid:13) Λ k − (cid:98) Λ k (cid:13)(cid:13)(cid:13) F .To illustrate the different level of success concerning the conditional correlation graph re-covery, we display on Figure 5 the conditional correlation matrix (i.e. the conditionalcorrelation graph with weighted edges) estimated by each method. The three columns offigures correspond to the three sub-populations. The first two rows of figures are the ma-trices estimated by the two Mixtures of GGM methods, with and without residualisation artigue, Durrleman, Allassoni`ere E M E M r e s i d u a l C - E M EM type s o f t m i s c l a ss i f i c a t i o n e rr o r E M E M r e s i d u a l C - E M EM type h a r d m i s c l a ss i f i c a t i o n e rr o r E M E M r e s i d u a l C - E M EM type A B C - li k e m e t r i c E M E M r e s i d u a l C - E M EM type r un t i m e s Figure 4: Empirical distribution of several performance metrics measured over many simu-lations. The sample is made of 30 simulations with 5 different initialisations each.Three methods are compared. The EM and EM residual algorithms estimate aMixture of GGM. The C-EM algorithm estimates a Mixture of CGGM. The C-EMis much better performing and faster. (Upper left) Soft mis-classification error (cid:12)(cid:12)(cid:12) z i = k − (cid:98) P ( z i = k ) (cid:12)(cid:12)(cid:12) . (Upper right) Hard mis-classification error | z i = k − (cid:98) z i = k | .(Bottom left) ABC-like metric. (Bottom right) Run time. ixture of Conditional Gaussian Graphical Models with the co-features. The third row of figures correspond to the matrices estimated by theMixture of CGGM. The final row displays the real conditional correlation matrices. Weobserve that the two Mixture of GGM recover way too many edges, with no particularfit with the real matrix. By contrast, the matrices from the CGGM Mixture exhibit theproper edge patterns, with few False Positive and False Negative. This is not an easy feat toachieves, since the method was run from a random initialisation on a totally unsuperviseddataset, with heavily translated data points all over the 10 dimensional space. Moreover,the matrices in Figure 5 all result from the inversion of the empirical covariance matrix,which is neither a very geometrical nor a very stable operation.In Figure 6, we represent the regression parameter (cid:98) Θ k estimated by with the Mixture ofCGGM alongside the real Θ k . Once again, we see that the sparsity pattern is very wellidentified, with no False Positive. Moreover, in this case, there are also almost no FalseNegative, and all the edge intensities are correct. This is not a surprise. Indeed, the param-eter Θ plays a huge role in the correct classification of the data, since it serves to define theexpected position of each data point in the feature space (playing the part of the “average”parameter in Mixtures of GGM). Hence, a good estimation of Θ is mandatory to reach agood classification. Since the EM with Mixture of CGGM achieved very good classificationresults, it was expected that Θ would be well estimated.Table 3: Average and standard deviation of the metrics describing the reconstruction of each inverse-covariance matrix Λ k . The matrices are consistently better reconstructed with the mixture ofCGGM. metric class EM GGM EM res. GGM EM CGGM KL ( f Λ , f ˆΛ ) 1 11.0 (3.0) 7.5 (6.8) (cid:13)(cid:13)(cid:13) Λ − ˆΛ (cid:13)(cid:13)(cid:13) F In this section, we confirm our experimental observations with a real, high dimensional,Alzheimer’s Disease dataset. We illustrate that the EM with Mixture of CGGM is bettersuited to identify clusters correlated with the diagnostic than the Mixture of GGM meth-ods. We bring to light the effect of co-features such as the gender and age on the medicalfeatures.Our dataset is composed of the parameters ξ, τ, ( w i ) i =1 ,..., of longitudinal models estimatedon real Alzheimer’s Disease patients, see Schiratti et al. (2015). In summary, the evolutionof several features are followed over time for each patients. The features i = 1 , ...,
10 corre-spond to MRI measures of atrophy in different region of the brain. The features i = 11 , ..., artigue, Durrleman, Allassoni`ere class 1 class 2 class 3EM EM res.
C-EM
True Figure 5: Comparison between the several estimated and the true conditional correlationmatrices for each sub-population. The three columns of figures correspond to thethree sub-populations. The first two rows of figures are the matrices estimatedby the two Mixture of GGM methods, with and without residualisation with theco-features. The third row of figures correspond to the matrices estimated bythe Mixture of CGGM. The final row displays the real conditional correlationmatrices. Unlike the two GGM-based methods, the Mixture of CGGM recoverscorrect edges with very few False Positives. ixture of Conditional Gaussian Graphical Models class 1 class 2 class 3C-EM True Figure 6: Reconstruction of the Θ k by the EM on the Mixture of CGGM. The three columnsof figures correspond to the three sub-populations. Almost all the edges are right,with no False Positive and almost no False Negative. Moreover, the intensitiesare also mostly correct.geodesic trajectory within a Riemannian manifold of the parameter space that fits with thepatient’s own evolution. More specifically, the longitudinal model parameters describe howeach patient’s trajectory deviates from a specific reference geodesic trajectory. The param-eter ξ is the time acceleration of the patient with regards to the reference. The parameter τ is the time shift, so that a smaller τ corresponds a disease which starts early. Each w i describes the space shift of the trajectory with regards to its corresponding feature. Theyhave the same sign convention as τ . With Y := ξ, τ, ( w i ) i =1 ,..., the vector of features, wehave p = 32. We add three co-features to describe each patient: the gender, the age atthe first visit (“age baseline”), and the number of years of education. With the additionof the constant co-feature = 1, the vector of co-features is 4-dimensional, X ∈ R . Thedataset contains 1400 patients, with half being healthy (“Control” patients), and the otherhalf being diagnosed with the Alzheimer’s Disease, either from the start or after a few visits(“AD” patients). The data is centered and normalised over the entire population.We run the three algorithms: EM, EM residual and C-EM on this dataset. In order tocheck the stability of the results over several different runs, we implement a bootstrapprocedure that only uses 70% of the data each time. We generate 10 such bootstrappeddataset. We initialise the algorithms with a KMeans on the Y ( i ) data points. Since KMeansis not deterministic, we make 5 different runs for each bootstrapped dataset, starting from5 different possible KMeans initialisation each. Like previously, for the sake of fairness,the EM and C-EM are always provided with the same initialisation, and the residualEM is initialised with a KMeans on the residual of Y after subtracting the predictionby the X , a more relevant initialisation for this method. We make all these runs withfour different feature sets. First with no space shift variable Y = { ξ, τ } , p = 2, thenwe add only the MRI space shifts Y = { ξ, τ, ( w i ) i =1 ,..., } , p = 12, then only the Cogni-tive Scores space shifts Y = { ξ, τ, ( w i ) i =1 ,..., } , p = 22, and finally, with all the features Y = { ξ, τ, ( w i ) i =1 ,..., } , p = 22. The classification results are summarised in Table 4.3.With two balanced classes, the classification error of a uniform random classifier is 50%. artigue, Durrleman, Allassoni`ere On the smallest dataset, p = 2, we can see that the discovered cluster are somewhat corre-lated with the diagnostic, with classification errors below 30%. The Mixture of GGM on theuniformly residualised data and the Mixture of CGGM achieve similar levels of error, theyare both better than the regular Mixture of GGM. When the MRI features are added, allthe discovered cluster become more correlated with the diagnostic. The regular Mixture ofGGM achieves in average 16% of hard classification error, the residualised Mixture of GGMis at 11% of error, and the Mixture of CGGM even below, at 7%. The results with onlythe Cognitive Scores are very similar, simply a bit worse for every method. However, whenboth the MRI and Cognitive Scores feature are included, the performance of both GGMmixtures decrease, with both higher average error and higher variance. On ther other hand,the Mixture of CGGM achieves here its best level of performance. This stability of theMixture of CGGM’s performance as the size of the feature set increases indicates that ourmodel is the better suited to properly identify clusters correlated with the diagnostic inhigh dimension.We analyse the estimated Mixture of CGGM parameters on the full feature set p = 32.First, since E [ Y | X, z = k ] = − Λ k Θ Tk X in the CGGM, we display on Figure 7 the twoestimated ˆ β k := − (cid:98) Λ k (cid:98) Θ k (averaged over the bootstrap), as well as their difference. Theyplay the role of linear regression coefficients in the model. The last column is the constantcoefficient, while the first three are the gender, age baseline and years of education coeffi-cients respectively. Since the data is centered, negative and positive values correspond tobelow average and above average values respectively. The cluster k = 1 is the one verycorrelated with the Control patients sub-population. Similarly, the cluster k = 2 is the onevery correlated with the AD patients.The most noticeable difference between the two ˆ β k are the constant vectors, who have op-posite effects on all features. In particular, the “AD cluster” is very correlated with high ξ , low τ , an earlier atrophy of the ventricles, as well as high space shift for the two logicalmemory tests (immediate and delayed). The exact opposite being true for the “Controlcluster”. These are the expected effects: a high ξ corresponds to a quickly progressingdisease, and a low τ to an early starting disease.The non-constant linear regression coefficients are also different between the clusters, al-though these are often difference in intensity and not in sign. In order to visualise moreclearly the differences in intensity, we represent on the leftmost sub-figure of Figure 7, withthe same conventions, the difference ˆ β − ˆ β . In particular, within the AD cluster, we ob-serve stronger positive effect of the Age at the first visit on the space shifts corresponding tothe Amygdala, entorhinal cortex, hippocampus and parahippocampus cortex atrophies. Onthe contrary, there is a stronger positive effect of the education level on all the space shiftsof MRI atrophies for the Control patients. The age at the first visit has a stronger negativeimpact on w i corresponding to the scores self reported memory, language and visual spatialcapacity for the AD patients, and a stronger negative impact on the two logical memoryscores for the control patients.Finally, we display on Figure 8 and 9 the average conditional correlation graphs estimatedfor these two clusters by the Mixture of CGGM. Their only noticeable difference is thenegative conditional correlation between ξ and τ in the “Control cluster”, which is reversedin the “AD cluster”. For the AD patients, this means that a disease that appears latertends to also progress faster, which is in line with medical observations. Apart from this ixture of Conditional Gaussian Graphical Models Table 4:
Recovery of the diagnostic labels (AD or control) with unsupervised methods on real longitudi-nal data. The three compared methods are the EM, EM residual (both GGM) and the C-EM(CGGM). Four different feature sets are tried: only { τ, ξ } , adding the MRI space shift coefficients w i , adding the Cognitive Score (CS) space shift coefficients w i , and adding both the MRI and CSspace shift coefficients. The table presents the average and standard deviation of the misclassi-fication error over 10 bootstrap iteration, with 5 different KMeans initialisation each. The bestresults are in bold . metric EM EM resid. C-EMno CS, no MRI soft misclassif. 0.31 (0.02) 0.22 (0.03) p = 2 hard misclassif. 0.31 (0.03) (0.05) 0.19 (0.01) only MRI soft misclassif. 0.15 (0.01) 0.13 (0.01) p = 12 hard misclassif. 0.12 (0.01) 0.10 (0.01) only CS soft misclassif. 0.17 (0.02) 0.15 (0.02) p = 22 hard misclassif. 0.14 (0.03) 0.13 (0.03) CS and MRI soft misclassif. 0.24 (0.09) 0.17 (0.04) p = 32 hard misclassif. 0.21 (0.10) 0.15 (0.05) edge, the rest of the connections are almost identical in-between clusters. This suggeststhat the, cluster dependent, prediction E ˆ θ k [ Y ( i ) | X ( i ) , z = k ] = − (cid:98) Σ k (cid:98) Θ Tk X ( i ) takes into ac-count enough of the cluster-specific effects so that the remaining unexplained variance hasalmost the same form in both clusters. Hence, the conditional correlations pictured in thesegraphs correspond to very general effects. Such as the positive correlations between relatedcognitive tests or areas of the cortex.More strikingly, there are no conditional correlation between ξ or τ and any of the spaceshifts w i . This as consequent medical implications, since it suggests that the earliness ( τ )and speed ( ξ ) of the disease are conditionally independent from the succession of degrada-tion that the patient’s imagery and cognitive scores undergo. In other words, these graphssupport the idea that the disease is the same regardless of whether it appears early/lateand progresses fast/slowly.
5. Conclusion
We introduced the Mixture of Conditional Gaussian Graphical Models in order to guide thecluster discovery when estimating different Gaussian Graphical Models for an unlabelledheterogeneous population in the presence of co-features. We motivated its usage to deal withthe potential in-homogeneous and class-dependent effect of the co-features on the observeddata that would otherwise disrupt the clustering effort. To estimate our Mixture model, weproposed a penalised EM algorithm (“Conditional EM” or “C-EM”) compatible with a widearray of penalties. Moreover, we provided detailed algorithmic steps in the specific case ofthe popular Group Graphical LASSO penalty. Then, we demonstrated the interest of the artigue, Durrleman, Allassoni`ere !" !" $ !" $ − !" CognitiveScoresMRI
Figure 7: Average (cid:98) β k := − (cid:98) Σ k (cid:98) Θ Tk over 10 bootstrap sampling of the data. For each boot-strapped dataset, 3 different runs of the C-EM are made, each with a differentKMmeans initialisation of the labels. (Left) (cid:98) β , the cluster k = 1 is always verycorrelated with the Control patients sub-population (less than 10% deviation).(Middle) (cid:98) β , the cluster k = 2 is likewise very correlated with the AD patients. Ineach figure, the last column is the constant coefficient. The largest inter-clusterdifferences are between the two constant terms. However there are some notice-able difference on the other regression coefficients as well. (Right) Average (cid:98) β − (cid:98) β over the 30 bootstrap runs of the C-EM. Here, the differences in intensity betweenAD ( k = 2) and Control ( k = 1) patients are more explicit. ixture of Conditional Gaussian Graphical Models tauxiprecuneushippocampusentorhinal cortexfusiform gyrusmidtemporal lobeamygdalatemporal poleinsulaparahippocampus cortexventriclesADAS-Cog (multi-domain)RAVLT (long-term memory)CF (mental flexibility)MoCA (multi-domain)memory and attentionmemory - delayed recallmemory and recognitionGDS (mood disorders)FAQ (autonomy)NPI (mental disorders)self-reported memoryself-reported languageself-reported visual spatial capacitypartner reported memorypartner reported languagepartner reported visual spatial capacityBNT (language and recognition)LM (memory - immediate recall)LM (memory - delayed recall)DSST (attention) Figure 8: Conditional correlation graph of the estimated cluster most correlated with the “Control” diagnosis . artigue, Durrleman, Allassoni`ere tauxiprecuneushippocampusentorhinal cortexfusiform gyrusmidtemporal lobeamygdalatemporal poleinsulaparahippocampus cortexventriclesADAS-Cog (multi-domain)RAVLT (long-term memory)CF (mental flexibility)MoCA (multi-domain)memory and attentionmemory - delayed recallmemory and recognitionGDS (mood disorders)FAQ (autonomy)NPI (mental disorders)self-reported memoryself-reported languageself-reported visual spatial capacitypartner reported memorypartner reported languagepartner reported visual spatial capacityBNT (language and recognition)LM (memory - immediate recall)LM (memory - delayed recall)DSST (attention) Figure 9: Conditional correlation graph of the estimated cluster most correlated with the “AD” diagnosis . ixture of Conditional Gaussian Graphical Models method with experiments on synthetic and real data. First, we showed on a toy example -with a 2-dimensional feature space and a 1-dimensional co-feature - that the regular Mixtureof GGM methods were inadequate to deal with even the most simple in-homogeneous co-feature. We confirmed on a more complex simulation, in higher dimension, that Mixturesof CGGM could identify much better the clusters in the feature space, and recover theactual GGM structure of the data. Finally, we tested all the methods on a real data set,with longitudinal model parameters describing the evolution of several Alzheimer’s Diseasepatients. We demonstrated that our method was the best at identifying the diagnosticwith an unlabelled dataset. We unearthed some in-homogeneous effect of co-features on thelongitudinal parameter and recovered the conditional correlation graphs by cluster. Thesegraphs hint at a conditional independence between the earliness and speed of the diseaseand the order in which the many degradation appear. This hypothesis will be tested infuture studies. Acknowledgments
The research leading to these results has received funding from the European ResearchCouncil (ERC) under grant agreement No 678304, European Union’s Horizon 2020 researchand innovation program under grant agreement No 666992 (EuroPOND) and No 826421(TVB-Cloud), and the French government under management of Agence Nationale de laRecherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001(PRAIRIE 3IA Institute) and reference ANR-10-IAIHU-06 (IHU-A-ICM).
References
Onureena Banerjee, Laurent El Ghaoui, Alexandre d’Aspremont, and Georges Natsoulis.Convex optimization techniques for fitting sparse gaussian graphical models. In
Pro-ceedings of the 23rd international conference on Machine learning , pages 89–96. ACM,2006.Hyonho Chun, Min Chen, Bing Li, and Hongyu Zhao. Joint conditional gaussian graphicalmodels with multiple sources of genomic data.
Frontiers in genetics , 4:294, 2013.Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signalprocessing. In
Fixed-point algorithms for inverse problems in science and engineering ,pages 185–212. Springer, 2011.Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inversecovariance estimation across multiple classes.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 76(2):373–397, 2014.Arthur P Dempster. Covariance selection.
Biometrics , pages 157–175, 1972.Wayne S DeSarbo and William L Cron. A maximum likelihood methodology for clusterwiselinear regression.
Journal of classification , 5(2):249–282, 1988.Chen Gao, Yunzhang Zhu, Xiaotong Shen, and Wei Pan. Estimation of multiple networksin gaussian mixture models.
Electronic journal of statistics , 10:1133, 2016. artigue, Durrleman, Allassoni`ere Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Joint estimation of multiplegraphical models.
Biometrika , 98(1):1–15, 2011.Botao Hao, Will Wei Sun, Yufeng Liu, and Guang Cheng. Simultaneous clustering and es-timation of heterogeneous graphical models.
The Journal of Machine Learning Research ,18(1):7981–8038, 2017.Feihu Huang, Songcan Chen, and Sheng-Jun Huang. Joint estimation of multiple conditionalgaussian graphical models.
IEEE transactions on neural networks and learning systems ,29(7):3034–3046, 2018.Abbas Khalili and Jiahua Chen. Variable selection in finite mixture of regression models.
Journal of the american Statistical association , 102(479):1025–1038, 2007.Akshay Krishnamurthy. High-dimensional clustering with sparse gaussian mixture models.
Unpublished paper , pages 191–192, 2011.Karthik Mohan, Palma London, Maryam Fazel, Daniela Witten, and Su-In Lee. Node-based learning of multiple gaussian graphical models.
The Journal of Machine LearningResearch , 15(1):445–488, 2014.Le Ou-Yang, Xiao-Fei Zhang, Xiaohua Hu, and Hong Yan. Differential network analysisvia weighted fused conditional gaussian graphical model.
IEEE/ACM transactions oncomputational biology and bioinformatics , 2019.J-B Schiratti, St´ephanie Allassonniere, Alexandre Routier, Olivier Colliot, Stanley Dur-rleman, Alzheimers Disease Neuroimaging Initiative, et al. A mixed-effects model withtime reparametrization for longitudinal univariate manifold-valued data. In
InternationalConference on Information Processing in Medical Imaging , pages 564–575. Springer, 2015.Kyung-Ah Sohn and Seyoung Kim. Joint estimation of structured sparsity and outputstructure in multiple-output regression via inverse-covariance regularization. In
ArtificialIntelligence and Statistics , pages 1081–1089, 2012.Matt Wytock and Zico Kolter. Sparse gaussian conditional random fields: Algorithms,theory, and application to energy forecasting. In
International conference on machinelearning , pages 1265–1273, 2013.Sen Yang, Zhaosong Lu, Xiaotong Shen, Peter Wonka, and Jieping Ye. Fused multiplegraphical lasso.
SIAM Journal on Optimization , 25(2):916–943, 2015.Jianxin Yin and Hongzhe Li. A sparse conditional gaussian graphical model for analysis ofgenetical genomics data.
The annals of applied statistics , 5(4):2630, 2011.Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49–67,2006.Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model.
Biometrika , 94(1):19–35, 2007. ixture of Conditional Gaussian Graphical Models Hui Zhou, Wei Pan, and Xiaotong Shen. Penalized model-based clustering with uncon-strained covariance matrices.
Electronic journal of statistics , 3:1473, 2009., 3:1473, 2009.