The Sparse Latent Position Model for nonnegative weighted networks
TThe Sparse Latent Position Model for nonnegative weighted networks
Riccardo Rastelli [email protected] for Statistics and Mathematics, Vienna University of Economics and Business, Austria.
Abstract
This paper introduces a new methodology to analyse bipartite and unipartite net-works with nonnegative edge values. The proposed approach combines and adapts anumber of ideas from the literature on latent variable network models. The resultingframework is a new type of latent position model which exhibits great flexibility, andis able to capture important features that are generally exhibited by observed net-works, such as sparsity and heavy tailed degree distributions. A crucial advantageof the proposed method is that the number of latent dimensions is automaticallydeduced from the data in one single algorithmic framework. In addition, the modelattaches a weight to each of the latent dimensions, hence providing a measure of theirrelative importance. A fast variational Bayesian algorithm is proposed to estimatethe parameters of the model. Finally, applications of the proposed methodology areillustrated on both artificial and real datasets, and comparisons with other existingprocedures are provided.
Keywords:
Weighted networks; Latent position models; Sparse finite mixture mod-els; Variational inference; Stochastic blockmodels.
Statistical models for networks are most often designed for binary interactions betweenentities. However, in many cases, the edges carry additional information that may providecrucial insights in the analyses. One common situation is that of nonnegative interactionvalues, where each edge carries a positive real number representing, for example, theintensity or the length of the interaction. This type of scenario is extremely common andit tends to arise in a variety of fields: in biology, microarray data measure the expressionlevel of genes among a set of samples (Brunet et al. 2004); gene co-expression networkscharacterise the similarities between genes based on their expression level (Zhang and1 a r X i v : . [ s t a t . M E ] A ug orvath 2005); and mutualistic networks describe how often a pollinator visits certainplants (Dormann et al. 2008). In neuroscience, nonnegative networks may be used torepresent functional connectivity levels between different areas of the brain (Bullmoreand Sporns 2009). In finance, the nonnegative values may refer to the liabilities of onefinancial institution towards others (Gandy and Veraart 2016).The available literature on the analysis of weighted networks, and, in particular, ofnonnegative weighted networks, is limited. This is especially true for the family of LatentPosition Models (LPM), which have been widely used in statistical network analyses sincethe seminal work of Hoff et al. (2002). In these network models, one postulates that thenodes are points embedded in a latent space, and that the propensity to create edges isdetermined by the Euclidean distance between the two nodes involved. To the best of myknowledge, in the stream of literature that originated from Hoff et al. (2002), the onlyextension of the original LPM to the analysis of weighted interactions is due to Sewelland Chen (2016).Another fundamental model used in the statistical analyses of networks is the Stochas-tic Blockmodel (SBM) of Wang and Wong (1987), or its equivalent bipartite variant,called Latent Blockmodel (LBM, Govaert and Nadif 2003). These two models assumea partition on the set of nodes of the graph, and postulate that edge probabilities aredetermined by the cluster memberships of the nodes at its extremities. Extensions of theSBM and LBM to the analysis of weighted networks are also few. In this context, tworelevant papers are those of Mariadassou et al. (2010) and Aicher et al. (2014). Recentreviews of the LPM, SBM and their extensions may be found in Salter-Townshend et al.(2012), Matias and Robin (2014), and Raftery (2017).This paper introduces a new modelling approach for the analysis of nonnegative in-teraction data. The model proposed, called the Sparse Latent Position Model (SLPM),modifies the original LPM to account for nonnegative weighted edges. The model adaptsa number of ideas from the literature on blockmodels and finite mixture models: thisresults in a hybrid framework that exhibits great flexibility, and that it is able to ele-gantly address some of the crucial issues and research questions encountered in LPMsand SBMs.Similarly to the LPM, the SLPM asserts that nodes are characterised by a K -dimensionalvector of coordinates. However, while in the LPM the edge values are determined by thelatent Euclidean distances, the SLPM postulates that each dimension gives a separate2ontribution to the edge values, and that each of these contributions is, in fact, determinedby the distance between the two nodes in the corresponding dimension. Essentially, theSLPM may be interpreted as a finite mixture of K unidimensional LPMs in a weightednetwork context. This paper illustrates that the structure of the SLPM allows great flexi-bility, and that it can naturally capture relevant features of the the topologies of observednetworks, such as sparsity of edges and heavy tailed degree distributions.In addition, one important facet of the SLPM is the sparsity with respect to themixture components, inherited from recent works on Gaussian finite mixtures (Malsiner-Walli et al. 2016; Malsiner-Walli et al. 2017) and other types of latent variable models.In a sparse mixture model, one deliberately creates overfitting by considering a relativelylarge number of latent groups, however, once an appropriate shrinkage prior is defined onthe mixing proportions, the superfluous groups are emptied during the estimation. Thisapproach is strongly motivated by the theoretical results of Rousseau and Mengersen(2011), where the authors establish the necessary conditions for the posterior distributionto concentrate around the correct number of groups.In the SLPM, the sparse representation ensures that the number of mixture compo-nents is deduced from the data. Since these groups correspond to the latent dimensions,the dimensionality of the latent space is automatically estimated from the data. Further-more, the mixing proportions of the mixture acquire a rather interesting interpretation,since they measure how useful each latent dimension is in explaining the observed data.Regarding the estimation of the SLPM, a variational Bayesian method (Attias 1999)is proposed. Variational approximations have been largely used in the last two decadesin a variety of latent variable models (a recent review may be found in Blei et al. 2017).Although, from a theoretical perspective, the implications of the approximation are notyet completely understood, a variety of applications (Blei et al. 2017) have proved thatthese methods can offer a number of appealing advantages during the estimation process.Variational approximations have been proposed both for LPM frameworks (Salter-Townshend and Murphy 2013; Gollini and Murphy 2016) and for the SBM and its ex-tensions (Daudin et al. 2008; Airoldi et al. 2008; Latouche 2010; Matias and Miele 2017;Bouveyron et al. 2018). In the SLPM context, the variational Bayesian method is ratherconvenient, since, differently from sampling-based approaches, it relies on a fast optimi-sation framework. Hence, it allows one to bypass the non-identifiability issues associatedwith the cluster membership variables and with the latent positions, while, at the same3ime, providing a rough approximation of the posterior uncertainty. The variational algo-rithm described in this paper has been implemented in the R package SparseLPM , whichis publicly available from
CRAN (R Core Team 2017).The paper is organised as follows: in Section 2 the SLPM is formally introduced andstudied. The section contains a detailed description of its main properties, and a numberof comparisons with other well known models and with the existing literature. Section 3describes the variational algorithm, and the details regarding the optimisation procedure,including a useful initialisation strategy. Finally Sections 4 and 5 illustrate applicationsof the methodology to artificial and real datasets, respectively.
The observed data matrix X has M rows and N columns, and its entries are nonnegative.In a network context, X defines the nonnegative adjacency matrix of a bipartite graph,whereby the nodes are separated into two sets (of cardinalities M and N , respectively),and only the interactions between nodes of different types are allowed. It is important tonote that, as a special case, the unipartite weighted graph structure is recovered if X isa square matrix and its rows and columns are indexed with labels from the same set.The observations X = { x ij : i = 1 , . . . , M ; j = 1 , . . . , N } are assumed to be indepen-dent realisations from a finite mixture of K exponential densities: p ( x ij | λ , θ ij ) = K (cid:88) k =1 λ k f E ( x ij ; θ ijk ) (1)where f E ( · ; θ ) denotes an exponential density with rate θ , and the vector θ ij = { θ ij , . . . , θ ijK } indicates the component-specific rates, for every edge ( i, j ) .In the SLPM, the K groups of the mixture distribution are re-interpreted as the K latent dimensions of a LPM, where each dimension gives a separate contribution in deter-mining the probabilities of the edges. In particular, the latent variable θ ijk correspondsto the squared Euclidean distance between i and j in the k -th dimension, i.e.: θ ijk = ( U ik − V jk ) (2)where U ik (resp. V jk ) indicates the k -th coordinate of i (resp. j ) in the latent space R K .Since the expectation of an exponential random variable is equal to its inverse rate, (2)4uarantees that the generated edge weight is expected to be large (resp. small) if the twonodes are close (resp. far apart) in every latent dimension.As is common in mixture modelling, an equivalent representation of the same modelcan be obtained using data augmentation. An allocation variable Z ij is attached to theedge ( i, j ) , for all ≤ i ≤ M and ≤ j ≤ N , such that: Z ij = with probability λ ... ... K with probability λ K (3)Now, the conditional log-likelihood reads as follows: (cid:96) X ( Z , U , V ) = M (cid:88) i =1 N (cid:88) j =1 K (cid:88) k =1 Z ijk (cid:8) log (cid:2) ( U ik − V jk ) (cid:3) − x ij ( U ik − V jk ) (cid:9) (4)where Z ijk = 1 if Z ij = k and Z ijk = 0 otherwise.Finally, additional hierarchical levels are considered for the mixing proportions andfor the latent positions: ( λ , . . . , λ K ) ∼ Dirichlet ( δ , . . . , δ K ) U ik ∼ Gaussian (0 , /γ k ) V jk ∼ Gaussian (0 , /γ k ) γ k ∼ Gamma ( a k , b k ) (5)for all ≤ i ≤ M , ≤ j ≤ N and ≤ k ≤ K . Here, Gaussian refers to the univariatenormal density.The dependencies between the model parameters are summarised in the graphicalrepresentation of Figure 1. δ λ Z a,b γ U , V X
Figure 1: Graphical summary of the dependencies in the SLPM. .2 Interpretation and properties A flexible mixture model.
Mixtures of exponential distributions satisfy a numberof interesting properties, and have been well studied and widely applied in a variety offields. The reader may find an overview of these properties in Frühwirth-Schnatter (2006)and references therein. In particular, one key property of these models is their flexibility:Bernstein’s theorem states that any completely monotone probability density functionmay be written as a continuous mixture of exponentials (Feldmann and Whitt 1998,Theorem 3.1). In the present paper, this argument motivates the structure proposedin (1), asserting that the SLPM may be particularly appropriate to represent extremelycomplex network structures, when K is large enough. Weight and degree distributions.
The SLPM is able to capture relevant featuresthat are usually exhibited by observed networks. Figure 2 uses simulated data to demon-strate that the SLPM can naturally represent highly skewed and extremely heavy tailedweight and degree distributions. In fact, both panels show that the tails of the weight llllllllllllllllllllllllllllllllllllllllllllllll l l l − − − − − log−edge−value l og − p r opo r t i on y - l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
10 15 20 25 − − − − log−degree l og − p r opo r t i on y - Figure 2: Log-log scale empirical distributions for the edge weights (left panel) and out-degrees(right panel) of a simulated SLPM with nodes and latent dimensions. and degree distributions decay to zero slower than a power-law with exponent . This isa desirable feature in many types of applications: for example, the topologies of socialnetworks and financial networks are known to be characterised by the presence of hubsand by core-periphery structures (Barabási and Albert 1999; Newman 2001; Boss et al.6004). Sparsity.
In recent years, sparse finite mixtures have received a lot of attention (seeMalsiner-Walli et al. 2016 and references therein) since they are able to mimic the proper-ties displayed by other common models (for example by the Dirichlet process mixture ofQuintana and Iglesias 2003, as shown in Frühwirth-Schnatter and Malsiner-Walli 2018),while maintaining a very simple computational framework to perform inference. Moreimportantly, sparse finite mixtures are able to efficiently assess many competing modelsin one single run of the estimation algorithm. In fact, they do not require grid searchesover all the possible values of K , nor trans-dimensional samplers such as the reversiblejump of Richardson and Green (1997). This generally leads to substantial gains in termsof computational efficiency.A sparse finite mixture is a deliberately overfitted clustering model where one specifiesa shrinkage prior on the mixture weights and on the component-specific parameters toensure that superfluous groups are emptied (Malsiner-Walli et al. 2016; Malsiner-Walliet al. 2017). The SLPM adapts this idea in a weighted latent position network modelcontext: overfitting is achieved by considering a large number of mixture components,while imposing a sparse Dirichlet prior distribution on the mixing proportions to ensurethat superfluous groups become empty. In practice, such sparse Dirichlet prior is attainedthrough δ = · · · = δ K = δ < /K , as illustrated by Rousseau and Mengersen (2011).As a consequence, the SLPM is a type of LPM where a large number of latent di-mensions can be considered, hence guaranteeing the theoretical flexibility argued earlierin this section. At the same time, parsimony is attained using the shrinkage prior on themixing proportions, making sure that only the relevant latent dimensions are actuallyused. The final product is a LPM where not only the number of latent dimensions is esti-mated from the data, but also where the dimensions are automatically ranked accordingto their importance, as indicated by the mixing proportions. This feature becomes veryconvenient when plotting the latent space: for example, one can select and show only thetwo most relevant latent dimensions, or the two latent dimensions that characterise thelarge edge values.Another aspect of the sparsity regards the edge-values: most observed weighted net-works exhibit zero-inflated empirical distributions, where the majority of edges are absentor carry a null weight. The SLPM can account for this type of sparsity in a very naturalway, as edge weights are shrunk towards zero when generated by latent dimensions which7re characterised by a small precision γ k , for ≤ k ≤ K . In practice, in a fitted SLPM,one or more latent dimensions may be exploited to separate the null weights from therest, hence obtaining a spike-and-slab type of sparsity. Outliers.
In the data augmented representation of the SLPM, each edge is assignedto and characterised by one dimension of the latent space. It is possible that only oneedge is assigned to a certain latent dimension. Since the shrinkage Dirichlet prior heavilypenalises this type of behaviour, the only circumstance where a fitted SLPM may exhibitthis feature is when the value carried by such edge is an outlier that cannot be explainedthrough the other relevant dimensions. That is, the SLPM may exploit the empty com-ponents to separate those edges that cannot be represented by the dimensions that arebeing used. This mechanism naturally highlights outliers in the data, or, equivalently,it may signal that, to some extent, the latent space representation is not particularlyappropriate for the dataset considered.
Identifiability.
The likelihood of a latent position network model is known to be in-variant under rotations, translations and reflections of the latent positions (Hoff et al.2002). Markov chain Monte Carlo techniques may be used to obtain a sample from theposterior distribution of these models, however, post-processing becomes then necessaryto deal with the identifiability issues and, thus, to attain meaningful interpretations.The SLPM is no exception in this regard, as its likelihood is invariant under transla-tions and reflections. In addition, the likelihood is also unaffected by any relabeling ofthe latent dimensions, which, in turn, leads to the same label switching issues studiedby (Stephens 2000). The combination of these two sources of non-identifiability makesthe available post-processing methods inadequate, and discourage the use of sampling-based methods, as their output cannot be efficiently interpreted. This paper considers anoptimisation framework which bypasses all identifiability problems and makes the labelswitching and the translation invariance irrelevant.
Univariate networks.
Section 2.1 introduces the SLPM as a statistical model forbipartite networks. The canonical directed unipartite framework is obtained as a specialcase, when M = N and the nodes’ labels vary in the same set. The SLPM can be easilymodified to neglect self-edges, if these are not of interest.If a directed unipartite scenario is considered, the SLPM separates the latent positions8f nodes based on whether they are sending or receiving an edge. For example, thelatent position of node i is given by U i = { U i , . . . , U iK } when sending edges and by V i = { V i , . . . , V iK } when receiving them. Although unusual, such a distinction may giveinteresting advantages when disassortative patterns are observed.Finally, with some additional work, it may also be possible to extend the SLPMframework to handle undirected unipartite networks, where the latent positions U and V coincide. However, this scenario is not explored in the present paper and it is left tofuture work. Latent position models.
Rastelli et al. (2016) illustrate the properties exhibited bythe basic LPM, and show that this model can become very flexible when additionalnodal random effects are added. In particular, they show that the basic LPM and itsextensions are naturally designed to capture transitivity, homophily and heavy taileddegree distributions. In the SLPM there are no nodal random effects, and, similarlyto the LPM, each of the nodes in the graph is characterised by a number of latentcoordinates. However, the generative mechanism behind the SLPM makes use of thesequantities in a different way, as each edge is determined by the coordinates in only oneof the latent dimensions.On the one hand, this modification makes the model much more flexible: for example,differently from the canonical LPM, two nodes located in the same position may not bestochastically equivalent conditionally on the allocation variables, in that the distributionsgenerating their edges may be completely different. Lack of transitivity, which is commonin unipartite networks with strong core-periphery structures, may also be represented bythe SLPM: conditionally on Z , the variables X ij and X i (cid:48) j are dependent only if they areallocated to the same latent group, for any i, i (cid:48) , j . In addition, the SLPM can capturehomophily and, to some extent, the lack of it. In fact, two nodes that are close inthe latent space are expected to be strongly connected, regardless of which dimensiondetermines the edge weight. Nevertheless, if two nodes are close in at least one dimensionand far apart in at least another, a wide range of scenarios become possible. Disassortative mixing refers to the tendency of nodes to connect to those that are not similar tothem. For example, disassortative mixing arises when many nodes are connected only to a large hub(core-periphery structure). In the SLPM context, the distinction between sender and receiver positionmay be useful to capture the fact that certain nodes act as bridges, receiving edges from a set of nodesand sending edges into a different set.
9n the other hand, the SLPM is less interpretable than the LPM and its extensions. Infact, in the SLPM, the Euclidean distance between two K dimensional coordinates vectorssimply has no relevant meaning, as one should rather study the distances between nodesfor each latent dimension separately. In addition, a fitted SLPM may be characterisedby a large number of latent dimensions, making graphical representations impractical.So, generally speaking, extracting meaningful interpretations from a two-dimensionalrepresentation of the latent space of a SLPM may not be as straightforward as in otherLPM frameworks.One significant novelty introduced by the SLPM is that the number of latent dimen-sions is estimated from the data, and, in addition, a weight is assigned to each dimensionto represent its importance. The estimation of K is a long standing problem that veryfew times has been directly addressed in the LPM literature; the main reason being thatthe arbitrary choice K = 2 provides a convenient framework to clearly show the results.One relevant work in this context is Handcock et al. (2007), where the authors intro-duce a variant of the well-known Bayesian Information Criterion (BIC) to perform modelchoice. However, in the discussion they also point out that this method may not be for-mally correct and appropriate when choosing the number of latent dimensions. Anotherwork is that of Friel et al. (2016), where the authors use instead the Deviance Informa-tion Criterion (DIC). One drawback of this approach is that it requires the Markov chainMonte Carlo sampler to be run for every value of K considered. By contrast, in theSLPM, the model-choice problem is addressed in one single algorithmic framework, sincethe superfluous groups are automatically emptied during the estimation.Another crucial difference with the existing literature relates to the presence of weightson the edges: the aforementioned LPMs can only deal with binary edges. The only paperthat proposes an adaptation of the LPM framework to a weighted network context isSewell and Chen (2016). The authors consider a dynamic weighted network scenario,and propose several types of models based on the type of weights carried by the edges.For the nonnegative case, they propose a Tobit model (Tobin 1958), and estimate modelparameters using Markov chain Monte Carlo sampling. The SLPM is fundamentallydifferent, as it uses a mixture model to characterise the distribution of the edges, withthe goal of naturally representing the sparsity of network connections, and estimatingthe number of latent dimensions. While the model of Sewell and Chen (2016) may bebetter suited to clearly summarise the network data in a two-dimensional space, dynamic10xtensions of the proposed SLPM should provide a more flexible framework to predictmissing or future values. Weighted blockmodels.
The stochastic blockmodel (SBM) of Wang and Wong (1987)has been extended to account for weighted interactions by Mariadassou et al. (2010) andby Aicher et al. (2014). In the SBM, the nodes are characterised by a latent cluster mem-bership, and any two nodes assigned to the same group are said stochastically equivalent,as their connections are generated from the same parametric distribution. Equivalently,one may say that each edge of the network is characterised by a latent cluster membership,which is itself determined by the allocations of the nodes at its extremities: the edgesassigned to the same group are generated from the same parametric distribution. Thismechanism also characterises the well-known mixed-membership stochastic blockmodel(MMSBM) of Airoldi et al. (2008), where, again, the cluster membership characteris-ing each edge is determined by node-specific attributes. One crucial advantage of theMMSBM with respect to the SBM is that it allows heterogeneity to be present withinany subset of nodes.The SLPM may be seen as a variant of the SBM and the MMSBM, where the clustermembership assigned to an edge is not determined by nodal information, but ratherby some global parameters (the mixing proportions λ ). In addition, once these clustermemberships are set, the distributions generating the edge weights are chosen not onlyaccording to such clustering variables, but also using cluster-specific nodal information,i.e. the latent coordinates. Hence, the SLPM relates to the literature originated fromAiroldi et al. (2008) and it extends the classic blockmodels by introducing an additionalingredient, the latent distance representation, to explain the observed heterogeneity andto increase the flexibility of the mixture. Tensor factorisation models.
Tensors are array objects that generalise adjacencymatrices to more than two dimensions, and are generally used to represent multiviewnetworks, dynamic networks, and other types of multivariate interaction data. One com-mon approach used to model these objects is inspired by factor models (Bartholomewet al. 2011), where the tensor (or, in our context, the adjacency matrix X ) is decomposedusing a low-rank factorisation: X = UV (cid:48) + E (6)11ere, U has size M × K , V has size N × K , E has size M × N , and the dash denotesmatrix transposition. A very rich machine learning literature is available on this topic(see, for example, Lim and Teh 2007; Mnih and Salakhutdinov 2008; Salakhutdinovand Mnih 2008), however, since this paper deals with nonnegative matrices, our interestmainly focuses on the nonnegative matrix factorisation (NMF) approach of Lee and Seung(1999) and Lee and Seung (2001).In statistics, similar models have been considered for the analysis of binary networksby Hoff et al. (2002), under the name of projection models . Since each edge is assumedto be generated through a transformation of the scalar product of K -dimensional la-tent vectors, the latent space may be interpreted as a K -dimensional sphere, where asmall angle between two nodes implies a higher probability of connection. While thisgenerative mechanism shares similarities with LPMs and other distance based models, itprovides a vastly different geometrical representation of the network, hence, it may leadto qualitatively different statistical analyses and interpretations.Projection models have been intensely studied and extended in a number of ways(Hoff 2005; Durante and Dunson 2014), and have also been considered as models forweighted interactions (Hoff 2018). The work of Durante et al. (2017) has several similari-ties with the approach proposed in this paper. The authors consider a multiview networkand propose a mixture of projection models to explain its binary edges. They exploitthe sparse Dirichlet prior on the mixture weights to ensure that non-relevant mixturecomponents are emptied. They propose a Markov chain Monte Carlo sampler and focustheir analysis only on the characterisation of the edge probabilities, hence bypassing thenoted identifiability issues.The SLPM relies instead on a sparse finite mixture representation of the latent spacewithin a distance-based framework, which also leads to an estimation of K . The newmodel proposed in this paper is specifically designed for nonnegative weighted bipartitenetworks, and, hence, it could be seen as a competitor of the NMF algorithms. Here, themain novelty introduced by the SLPM with respect to NMF and the other tensor factori-sation models is the latent distance framework, where interaction values are determinedby how close (in the Euclidean sense) the two entities are in the latent space.12 Inference
According to Section 2.1 and Figure 1, the posterior density of a SLPM factorises asfollows: π ( Z , U , V , λ , γ | X , δ , a , b ) ∝ L X ( Z , U , V ) π ( Z| λ ) π ( λ | δ ) π ( U | γ ) π ( V | γ ) π ( γ | a , b ) (7)However, as noted in the previous section, posterior samples for the model parametersare not interpretable due to the multiple sources of non-identifiabilities. This meansthat a traditional Markov chain Monte Carlo sampler cannot be used to characterisethe posterior and, hence, the uncertainty around the parameter estimates. To overcomethis limitation, an approach based on Variational Bayes (VB, Attias 1999; Attias 2000)is proposed. This means that sampling techniques are replaced by optimisation, whichallows much faster computations while still providing a reasonable characterisation of theposterior uncertainty. In this paper, the theory behind VB is not discussed in detail: the interested reader mayfind more complete illustrations of these methods for example in Latouche (2010) andBlei et al. (2017) and references therein.Denote the collection of model parameters {Z , U , V , λ , γ } = φ ∈ Φ : the goal of VBis to perform a search in a family Q of densities over Φ , to find the density q ∗ ( · ) ∈ Q that is closest, in the sense of Kullback-Leibler (KL) divergence , to the posterior density π ( φ | X ) defined by (7).Generally, the densities in the family Q are, by construction, smoother, more tractable,and easier to interpret than π ( ·| X ) . As is common practice, the variational densities in Q are assumed to satisfy a mean-field assumption: q ( Z , U , V , λ , γ ) = (cid:40)(cid:89) i,j q Z ( Z ij ) (cid:41) (cid:40)(cid:89) i,k q U ( U ik ) (cid:41) (cid:40)(cid:89) j,k q V ( V jk ) (cid:41) × (cid:40)(cid:89) i,j q λ ( λ ij ) (cid:41) (cid:40)(cid:89) k q γ ( γ k ) (cid:41) (8) The KL divergence measures the discrepancy between two densities and it is defined as KL ( q || p ) = E q [log q ( φ )] − E q [log p ( φ )] Note that the KL is not a distance since it is not symmetric, i.e. KL ( q || p ) (cid:54) = KL ( p || q ) , however, it isnonnegative and it becomes zero when q and p coincide.
13n (8) and from here onwards: ≤ i ≤ M , ≤ j ≤ N and ≤ k ≤ K , unlessotherwise specified. Similarly to most other LPMs, the SLPM does not have all of itsfull-conditional distributions in standard form, mainly due to the likelihood specification.In a MCMC context, this means that non-standard parameter updates must be performedusing Metropolis-within-Gibbs (Hoff et al. 2002). In a VB context, this implies that theparametric forms of the variational densities in Q cannot be automatically derived, andthat closed forms updates will not be available for the variational parameters.Hence, the following variational densities are assumed: q Z (cid:16) Z ij , . . . , Z ijK (cid:12)(cid:12)(cid:12) ˜ λ ij , . . . , ˜ λ ijK (cid:17) ∼ M ultinomial (cid:16)
1; ˜ λ ij , . . . , ˜ λ ijK (cid:17) q U (cid:16) U ik (cid:12)(cid:12)(cid:12) ˜ α Uik , ˜ β Uik (cid:17) ∼ Gaussian (cid:16) ˜ α Uik , ˜ β Uik (cid:17) q V (cid:16) V jk (cid:12)(cid:12)(cid:12) ˜ α V jk , ˜ β V jk (cid:17) ∼ Gaussian (cid:16) ˜ α V jk , ˜ β V jk (cid:17) q λ ( λ ij , . . . , λ ijK ) ∼ Dirichlet (cid:16) ˜ δ , . . . , ˜ δ K (cid:17) q γ ( γ k ) ∼ Gamma (cid:16) ˜ a k , ˜ b k (cid:17) (9)Here, Gaussian refers to the univariate normal density. It should be noted that all thevariational parameters are indicated with a tilde: ˜ φ = (cid:110) ˜ λ , ˜ α U , ˜ β U , ˜ α V , ˜ β V , ˜ δ , ˜ a , ˜ b (cid:111) Proposition 1.
Minimising the KL divergence between q ( · ) ∈ Q and the SLPM posterior π ( ·| X ) is (approximately) equivalent to maximising the following functional with respectto the variational parameters ˜ φ : F (cid:16) ˜ φ (cid:17) = const + (cid:88) i,j,k ˜ λ ijk (cid:34) ψ (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − log (˜ η ijk ) + log (cid:16) ˜ ζ ijk (cid:17) − x ij ˜ η ijk − log (cid:16) ˜ λ ijk (cid:17)(cid:35) + (cid:88) k δ k − ˜ δ k + (cid:88) i,j ˜ λ ijk (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) + (cid:88) k (cid:18) a k − ˜ a k + M + N (cid:19) (cid:104) ψ (˜ a k ) − log ˜ b k (cid:105) − (cid:88) k ˜ a k ˜ b k (cid:32) b k + ˜ S k (cid:33) + 12 (cid:88) ik log (cid:16) ˜ β Uik (cid:17) + 12 (cid:88) j,k log (cid:16) ˜ β V jk (cid:17) − log Γ (cid:32)(cid:88) k ˜ δ k (cid:33) + (cid:88) k (cid:104) log Γ (cid:16) ˜ δ k (cid:17) + ˜ a k − ˜ a k log ˜ b k + log Γ (˜ a k ) (cid:105) (10) where: ˜ η ijk = ˜ β Uik + ˜ β V jk + ( ˜ α Uik − ˜ α V jk ) ˜ ζ ijk = 2˜ η ijk − α Uik − ˜ α V jk ) ˜ S k = (cid:88) i (cid:16) ˜ β Uik + ˜ α Uik (cid:17) + (cid:88) j (cid:16) ˜ β V jk + ˜ α V jk (cid:17) (11) nd ψ ( · ) indicates the first derivative of the log-gamma function. The functional F is often called the variational free energy. The proof of Proposition1, and a description of the approximations introduced to obtain this result are providedin Appendix A.1. F The maximisation of (10) is performed using a coordinate ascent algorithm (CAVI, Bishop2006; Blei et al. 2017), where each of the variational densities composing q ( · ) are updatedin turn in a greedy fashion. This section shows that some of these updates are availablein closed form, whereas iterative procedures must be used for the remaining densities.The proofs are all shown in the appendix. Proposition 2.
For any given configuration of variational parameters ˜ φ and indexes ≤ i ≤ M and ≤ j ≤ N , the values of (cid:16) ˜ λ ij , . . . , ˜ λ ijK (cid:17) that maximise F are given by: ˜ λ ∗ ijk = exp (cid:110) ψ (cid:16) ˜ η ijk ˜ ζ ijk (cid:17) − log (cid:16) ˜ η ijk ˜ ζ ijk (cid:17) − x ij ˜ η ijk + ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:16)(cid:80) k ˜ δ k (cid:17)(cid:111)(cid:80) k (cid:48) exp (cid:26) ψ (cid:18) ˜ η ijk (cid:48) ˜ ζ ijk (cid:48) (cid:19) − log (cid:16) ˜ η ijk (cid:48) ˜ ζ ijk (cid:48) (cid:17) − x ij ˜ η ijk (cid:48) + ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:16)(cid:80) k (cid:48)(cid:48) ˜ δ k (cid:48)(cid:48) (cid:17)(cid:27) (12) for ≤ k, k (cid:48) , k (cid:48)(cid:48) ≤ K . Proposition 3.
For any given configuration of variational parameters ˜ φ , the values of ˜ δ that maximise F are given by: ˜ δ ∗ k = δ k + (cid:88) i,j ˜ λ ijk (13) for ≤ k ≤ K . Proposition 4.
For any given configuration of variational parameters ˜ φ , the values of ˜ a and ˜ b that maximise F are given by: ˜ a ∗ k = a k + M + N b ∗ k = b k + 12 (cid:88) i (cid:16) ˜ β Uik + ˜ α Uik (cid:17) + 12 (cid:88) j (cid:16) ˜ β V jk + ˜ α V jk (cid:17) (14) for ≤ k ≤ K . Similar closed form updates are not available for the remaining variational densities q U and q V . Hence, the updates for the corresponding parameters aim at improving thevalue of F using a greedy rule in the direction of the natural gradient (Amari 1998).15 roposition 5. The update defined by: ˜ α ∗ Uik = ˜ α Uik + ε ˜ β Uik ∂ F (cid:16) ˜ φ (cid:17) ∂ ˜ α Uik ˜ β ∗ Uik = exp ε ˜ β Uik ∂ F (cid:16) ˜ φ (cid:17) ∂ ˜ β Uik ˜ β Uik (15) is in the direction of the natural gradient of F with respect to ˜ α Uik and ˜ β Uik , and, for asmall enough ε > , it does not decrease the value of F . A result equivalent to Proposition 5 is also provable for the variational density asso-ciated to the latent position V jk , for every ≤ j ≤ N and ≤ k ≤ K .In practice, the algorithm keeps track of the learning rate ε associated to the varia-tional densities of each latent position. To update one of these densities, an initial learningrate equal to ε is considered, and new proposed parameter values are calculated accord-ing to (15). If the new value of F is not smaller, the new values are retained and thealgorithm proceeds to the following update. Otherwise, the learning rate is repeatedlyhalved until it is sufficiently small to guarantee a non-decrease of F . The pseudocode forthe optimisation routine is shown in Algorithm 1. Algorithm 1
CAVI algorithm for the SLPM
INPUT: tol and starting values ˜ φ set F ∗ = F (cid:16) ˜ φ (cid:17) set stop = false while ! stop do for every i and j : update ˜ λ ij using (12)update ˜ δ using (13)update ˜ a and ˜ b using (14)for every i and k : update ˜ α Uik and ˜ β Uik using (15)for every j and k : update ˜ α V jk and ˜ β V jk using (15) if F (cid:16) ˜ φ (cid:17) ≤ F ∗ + tol then stop = true end if set F ∗ = F (cid:16) ˜ φ (cid:17) end while OUTPUT: F ∗ and ˜ φ It should be noted that, since each of the updates cannot decrease the value of theobjective function, the algorithm is greedy; hence, due to the nonlinearity of the objective16unction, the procedure is guaranteed to converge only to a local optimum. The iterativeroutine is stopped when a full iteration yields an increase smaller than a prefixed positivequantity denoted tol .Regarding the hyperparameters, these may be used to introduce prior knowledge inthe framework, when this is available. More in general, as a default, the following valuesare used: the entries of δ are all set to . , while the entries of a and b are all set to . With these choices, the Dirichlet distribution on the mixing proportions becomes ashrinkage prior, as already discussed. The
Gamma (1 , prior on the precisions yields,instead, a rather flexible structure, which supports a wide range of scenarios. In thispaper, the hyperparameters are set to the default values in every application considered. Due to its iterative nature, the greedy algorithm proposed requires the variational param-eters to be initialised. As is the case for other similar algorithms, such as the expectation-maximisation of Dempster et al. (1977) or the iterated conditional modes of Besag (1986),parameter initialisation plays a fundamental role in avoiding convergence to irrelevant lo-cal optima. Here, a Euclidean-distance-based heuristic method is proposed to initialisethe latent positions of the nodes. Once the value of K is set, the goal is to locate thenodes in R K in a way such that the nodes that are close to each other have larger observedinteraction values.First, recall that, given a binary adjacency matrix A , the generic element indexed by ( i, j ) of the matrix AA (cid:48) identifies the number of paths that allow one to move from i to j in steps. Similarly, in a weighted bipartite network context, the matrix XX (cid:48) hassize M × M , and the value of its generic entry ( i, i (cid:48) ) increases if both senders i and i (cid:48) are strongly interacting with the same nodes. Hence, a measure of similarity betweensender nodes and, respectively, between receiver nodes is introduced using the followingmatrices: S U = (cid:114) XX (cid:48) N S V = (cid:114) X (cid:48) X M (16)As a consequence, the matrix D U (resp. D V ), obtained inverting the elements of S U (resp. S V ), gauge the dissimilarity between any two sender nodes (resp. receiver nodes).Gathering together all the available information, a dissimilarity matrix for all the nodesof the network is defined as follows: D = (cid:20) D U XX (cid:48) D V (cid:21) (17)17nd nonmetric multidimensional scaling is used on D to initialise the latent positions forall of the senders and receivers. Since some of the elements of X , S U and S V may bezero, the initialisation procedure is performed on X + E , where the matrix E has all itselements equal to an arbitrary small positive constant (cid:15) > . The variational variances ˜ β U (resp. ˜ β V ) are all set to times the empirical variance of ˜ α U (resp. ˜ α V ).The remaining variational parameters are all set to with the exception of the ˜ λ s,which are initialised during the first step of the optimisation conditionally on all the otherparameters. In this section, the proposed methodology is tested on artificial data, demonstrating thatit can serve as a computationally efficient tool to summarise the information provided bynonnegative matrices.
The goal of these experiments is to demonstrate that the SLPM leads to meaningfulresults and that it can achieve a good fit to the data. The NMF method is also used onthe same generated datasets for comparison purposes.
Experiment 1: SLPM data.
In the first experiment, the algorithms are tested onsimulated SLPMs in two scenarios: one corresponding to a small network with M = N =25 , and the other on a larger network with M = N = 100 . Here, the SLPM modellingassumption is correct; hence, the corresponding method is expected to perform generallywell, and to achieve better results than NMF. The true number of dimensions is fixedto K true = 3 and the mixing proportions are all set to / . The latent positions areall sampled from a standard Gaussian distribution, and, finally, the average weightednetwork X is generated according to: x ij = K (cid:88) k =1 λ k ( U ik − V jk ) − (18)This generating process is independently repeated times, hence obtaining simu-lated SLPMs for each scenario.The NMF model is estimated using the method of Brunet et al. (2004), as implementedin the R package NMF (Gaujoux and Seoighe 2010), while the variational method of Section18 is used for the SLPM. Both methods consider K = 8 latent dimensions, and are runonce on each dataset. Additionally, the SLPM hyperparameters are fixed to δ = 0 . and a = b = 1 , and the variational parameters are initialised using the distance-basedapproach described in Section 3.3. The tolerance threshold for the SLPM is set to tol =0 . .Once the parameters of the SLPM and NMF are estimated for all of the generatednetworks, they are used to reconstruct the weighted adjacency matrices X ( SLP M ) and X ( NMF ) . Then, to assess the performance of the two models, the following loss functionis considered: L (cid:16) X , X ( · ) (cid:17) = 1 M N (cid:88) i,j log (cid:12)(cid:12)(cid:12) x ij − x ( · ) ij (cid:12)(cid:12)(cid:12) (19)for any two M × N matrices X and X ( · ) . The presence of the log is necessary to coun-terbalance the extremely heavy tailed shape of the weights documented in Figure 2. Infact, without the log transformation, the value of the loss would be almost exclusivelydetermined by the prediction error on the heaviest weight. By contrast, it should benoted that a drawback of this transformation is that the magnitude of the errors on thesmall edge values are inflated.The two plots in Figure 3 compare the losses for SLPM and NMF on the small andlarge networks, respectively. Clearly, the estimated SLPM yields smaller prediction log -errors in the vast majority of cases. This is especially true in the scenario with largernetworks (right panel of Figure 3), as the additional available information guarantees areasonably good recovery of the true latent structure.To estimate the value of K , one may consider: ˆ K = (cid:40) k s. t. k = 1 , . . . , K ; (cid:88) i,j ˜ λ ijk > (cid:41) (20)that is, the number of groups with a size which is strictly greater than zero. However, dueto the presence of extreme edge values, nearly empty groups most often appear. Thesenaturally inflate ˆ K without adding any substantial changes in the results obtained. Forthis reason, studying more directly the mixing proportions may, in general, be advisable.Regarding the estimation of K in this experiment, Figure 4 shows the estimatedmixing proportions in decreasing order. In both scenarios, and for the vast majority ofgenerated datasets, two latent dimensions stand out, since most edges are assigned toeither of these two. The third most relevant dimension generally contains only few edgeswhich correspond to outliers. 19 .0 0.5 1.0 1.5 2.0 2.5 . . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M l l lll ll l l lll ll l lll ll ll l lll lll l lll lll ll l l lll llll ll l lll lll ll ll ll l lll lllll ll llll l ll ll l l llll lllll ll lll ll M = N = 25 . . . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M ll ll lll ll lll l llll ll ll ll ll lll lll ll ll l ll l ll ll ll l lll lll l ll ll lll lll llll llll ll l lll l ll ll ll lll lll ll lll l ll l l M = N = 100
Figure 3:
Model fit, experiment 1 . Prediction losses of the estimated matrices X ( SLP M ) and X ( NMF ) with respect to the true X , for each of the generated datasets. The left panel correspondsto the small datasets ( M = N = 25 ), whereas the right panel corresponds to the large datasets( M = N = 100 ). . . . . . . K p r opo r t i on o f edge s M = N = 25 . . . . . . K p r opo r t i on o f edge s M = N = 100
Figure 4:
Model fit, experiment 1 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. In both small networks (left panel) and largenetworks (right panel), two latent dimensions stand out. xperiment 2: homogeneous data. In this second experiment, the data is gener-ated from a homogeneous process: each edge weight is assumed to be an exponentialrandom variable, whose rate is distributed according to a
Gamma (1 , . This leads toa homogeneous network model whose empirical degree distribution is characterised by aheavy tail (Figure 5). l l ll llllllllllllllllllllllllllllllllllllllll llll l l l −10 −5 0 5 − − − − log−edge−value l og − p r opo r t i on y - l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l l l l −10 −5 0 5 10 − − − − − − log−edge−value l og − p r opo r t i on y - Figure 5:
Model fit, experiment 2 . Empirical log - log distribution of the edge weights for twonetworks generated using the homogeneous model, with sizes M = N = 25 and M = N = 100 (respectively on the left and right panel). Similarly to the previous experiment, the SLPM and NMF are estimated on dataset with sizes M = N = 25 or M = N = 100 . The hyperparameters, number oflatent dimensions, and optimisation parameters are also set exactly to the same valuesused in the previous experiment.The predictive log -errors are shown in Figure 6. While the NMF model seems toachieve slightly better results on the small datasets, the SLPM again outperforms NMFon the larger datasets. This suggests that the distance-based generative process behindthe SLPM may be more flexible than that of the NMF, and that it may be better suitedto represent the topologies exhibited by many types of observed networks.As concerns the estimation of K , Figure 7 shows the estimated mixing proportions indecreasing order. The number of relevant components corresponds to the correct numberof dimensions in the small datasets, since we see from the left panel that the third largestgroup contains approximately five to of the edges in more than half of the datasets.However, an overestimation of K is observed on the larger dataset. Here, it seems that,21 .8 1.0 1.2 1.4 1.6 . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M ll ll l llllll ll lll lll l ll llll ll llll ll ll lll ll ll lll ll l ll lll l llll lll l llll llll ll ll lll ll ll ll ll l ll llll l ll ll ll M = N = 25 . . . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M llll lll lllllllllll ll ll l llllll l llllll l lll ll llll ll lll llll lll ll lllll llll lllllll llll lll l llll lllll ll llll l M = N = 100
Figure 6:
Model fit, experiment 2 . Prediction losses of the estimated matrices X ( SLP M ) and X ( NMF ) with respect to the true X , for each of the generated datasets. The left panel correspondsto the small datasets ( M = N = 25 ), whereas the right panel corresponds to the large datasets( M = N = 100 ). . . . . . . . . K p r opo r t i on o f edge s M = N = 25 . . . . K p r opo r t i on o f edge s M = N = 100
Figure 7:
Model fit, experiment 2 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. In the small networks (left panel) the estimatedmodels are characterised by three latent dimensions, whereas in the large networks (right panel)more dimensions are required to fit the data. The goal of this section is to further assess the performance of the SLPM in recoveringthe correct number of latent dimensions.Four scenarios are considered: K true = 2 and M = N = 40 in the first scenario, K true = 4 and M = N = 40 in the second, K true = 2 and M = N = 80 in the third,and K true = 4 and M = N = 80 in the fourth. In each scenario the hyperparametersare all fixed to , and artificial networks are generated independently. The mixingproportions, the precisions, and the latent positions are all generated according to thehierarchical structure described in Section 2.1. Then, the average weighted network X isgenerated according to (18).Similarly to the previous section, the SLPM is estimated on all of the datasets us-ing VB with the same set-up for hyperparameters, number of latent dimensions, andoptimisation parameters. In addition, in order to assess the sensitivity of the SLPM toits hyperparameters, the variational algorithm is run once with δ = 0 . and once with δ = 0 . .As in the previous simulation studies, the number of revelant groups is deduced bythe mixing proportions once they are sorted in decreasing order. Figures 8, 9, 10, and 11refer to scenarios one to four, respectively. It appears that, overall, the value of thehyperparameter δ does not have any relevant effect on the estimate of K . While in thefirst scenario ( M = N = 40 , K = 2 , Figure 8) the number of relevant latent dimensionsis generally overestimated to , in the second scenario ( M = N = 40 , K = 4 , Figure 9) K is well-recovered. In fact, in more than half of the datasets, the fourth group containsmore than of the edges, whereas the fifth group has negligible size.Similarly to the previous experiment, K is overestimated in the larger datasets ( M = N = 80 , scenarios and in Figures 10 and 11, respectively), in that in more than halfof the datasets the sixth group has a non-negligible size.These simulations illustrate that, while shrinkage and parsimony are generally achieved,the estimated number of relevant groups does not always coincide exactly with the true23 . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 40K_true = 2 . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 40K_true = 2 Figure 8:
Model choice, scenario 1 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. The hyperparameter δ seems to have no effecton the results. Three latent dimensions stand out with a weight approximately equal to . in atleast half of the datasets. The fourth dimension has a small weight in all of the datasets. . . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 40K_true = 4 . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 40K_true = 4 Figure 9:
Model choice, scenario 2 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. The hyperparameter δ seems to have no effecton the results. In this scenario, the number of relevant latent dimensions is generally equal tothe true value of K . . . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 80K_true = 2 . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 80K_true = 2 Figure 10:
Model choice, scenario 3 . In both panels, the k -th boxplot describes the k -thmixing weight (in decreasing order) across datasets. The hyperparameter δ seems to have noeffect on the results. Although the true value of K is two, five latent dimensions stand out. Inthis scenario, the VB algorithm struggles to reduce the dimensionality and to recover the correctnumber of latent dimensions. . . . . . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 80K_true = 4 . . . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 80K_true = 4 Figure 11:
Model choice, scenario 4 . In both panels, the k -th boxplot describes the k -thmixing weight (in decreasing order) across datasets. The hyperparameter δ seems to haveno effect on the results. Similarly to the third scenario, the VB algorithm struggles to reduce thedimensionality and it tends to overestimate the number of relevant latent dimensions. , but rather lives in a neighourhood of this value. In particular, the number of relevantgroups tends to be overestimated when the dataset is large and, hence, more heteroge-neous. Similarly to the other available estimation methods for LPMs, the computational com-plexity of the variational algorithm is determined by the updates of the latent positions.One such update requires a summation over all possible neighbours, hence it implies ei-ther N or M operations, depending on whether the updated node is a sender or receiver,respectively. In addition, since the step size must be tuned at every update attempt, the N (or M ) evaluations are actually performed an unspecified number of times, dependingon the rule determining the learning rates. However, since the step sizes remain approxi-mately constant as the algorithm progresses, these re-evaluations are generally performedvery few times.The number of iterations required by the algorithm to converge impacts the overallcomputing time by a much greater extent. In general, although the algorithm is greedy,there are no guarantees regarding the number of iterations required for convergence, and,most likely, such number depends on the number of nodes and on the topology of thenetwork. In practice, a few hundreds iterations are often sufficient to achieve the threshold tol = 0 . .Figure 12 compares the computing time required by the variational algorithm of SLPMwith that of NMF. NMF exploits parallel computing and is generally much faster, as itconverges with very few iterations. The figure also highlights that SLPM scales quitepoorly with the number of dimensions.These results suggest that, while SLPM generally achieves a very good fit to the data,it may also struggle in dealing with very large datasets, hence forcing one to increasethe tolerance threshold or to reduce the maximum number of iterations, in order to getusable results in a reasonable time. The goal of the two studies shown in this section is to propose an application of themethodology proposed on real datasets, and to illustrate how the SLPM represents andsummarises the observed data. 26 network size s e c ond s l l l l l l l l l l l SLPM with K = 2SLPM with K = 10NMF with K = 2NMF with K = 10
Figure 12:
Computational efficiency . Number of seconds required to perform the first iterations of the VB algorithm (black and red lines), and one run of the NMF algorithm (greenand blue lines). The two algorithms appear to scale similarly with respect to the size of thenetworks, especially when the number of latent dimensions is small. For a larger number of latentdimensions, the NMF algorithm scales much better than the VB algorithm used to estimate theSLPM. The dataset used in this example has been first studied by Vanhems et al. (2013), and it isavailable from the R package igraphdata . The data record proximity interactions between patients, nurses, doctors and administrative staff at a university Hospital inLyon, France, from a Monday to the following Friday in December . Wearablesensors based on active Radio-Frequency IDentification (RFID) are used to probe thearea surrounding each individual at seconds intervals, and proximity interactions arerecorded whenever two individuals are closer than . meters. The analysis of such datamay provide insights on the dynamics of the spread of Hospital-acquired infections, and,hence, lead to valid and more effective preventive measures.In this analysis, the interaction matrix X denotes the cumulated proximity time, i.e.the generic entry x ij indicates the number of seconds the staff member i has spent inproximity of patient j . The length of stay of the patients can be deduced from theproximity information, and, clearly, those who stayed longer have a higher degree (notshown).The latent number of dimensions was set to K = 10 , and the VB algorithm was run27nce using the initialisation of Section 3.3. The algorithm reached convergence after iterations and . seconds.The rounded mixing proportions are ˆ λ = 0 . , ˆ λ = 0 . , and ˆ λ k < . for i =2 , , , , , , ; two latent groups stand out with a large mixing weight. As illustratedby the two plots in Figure 13, the two relevant latent dimensions fulfil two very differentroles. In fact, the larger one has a very high empirical variance, meaning that it is l − − dimension l og edge v a l ue s l l l l l l l l l . . . . . . empirical variance m i x i ng p r opo r t i on l l l l l l l l l Figure 13:
Hospital data . The left panel shows the distribution of the weight log-values assignedto each of the two most relevant dimensions. As null-weights cannot be shown, a small positiveconstant is added to all the entries of X to improve the readability of this plot. The rightpanel compares the mixing proportions with the empirical variances of the corresponding latentdimensions. used to represent all of the zero or nearly-zero weights. By contrast, dimension numberthree is instead used for almost all of the remaining edges. The same conclusion is alsoapparent from Figure 14, where dimension number three, plotted on the vertical axis,clearly exhibits more heterogeneity than dimension number one, which is plotted on thehorizontal axis.The results show that there are no substantial differences between the nurses, doctors,and administrative staff in how they interact with the patients. It should be noted that,since dimension three is able to capture almost all the heterogeneity in the data, theunidimensional projection of this SLPM has features very similar to those of the originalLPM of Hoff et al. (2002). For example, for both patients and staff, nodes with higherdegree tend to be positioned closer to the centre of the space.28 − . − . − . . . . . dimension 1 d i m en s i on l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l Figure 14:
Hospital data . Latent space representation based on the two most relevant latentdimensions. The black squares correspond to the patients, whereas the circles correspond to thestaff members. Different circle colours indicate the known staff member occupations (blue fornurses, green for doctors, red for administrative staff ). The sizes of the circles and squares areproportional to the degrees of the corresponding nodes. The second example is an analysis of the dataset first studied by Khan et al. (2001). Here,the matrix X , of size × , represents the measured expression level of M = 2308 genes in N = 64 samples, for patients who tested positive on the presence of small roundblue cell tumours of childhood. The goal is to characterise the connectivity profiles ofgenes and samples, and to provide a latent space representation of these data. There areno zero-weighted edges, the lowest value recorded is . , the highest is . , and themedian is . .The VB algorithm was initialised and run with K = 20 , and converged after iter-ations and approximately one hour. The rounded mixing proportions for the nonemptygroups are equal to ˆ λ = 0 . , ˆ λ = 0 . , ˆ λ = 0 . and ˆ λ = 0 . . As illus-trated in Figure 15, the three most relevant groups all have a relatively small empiricalvariance, and they are used to characterise the edges with both large and small weights.Group is instead used only for large weights, yet it has a much smaller size and maybe considered a group of outliers. The two largest groups, corresponding to dimensions29 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllll − − − dimension l og edge v a l ue s lllll l lllll l l l l lllll . . . . . empirical variance m i x i ng p r opo r t i on lllll l lllll l l l l lllll Figure 15:
Microarray data . The left panel shows the distribution of the weight log-valuesassigned to each of the two most relevant dimensions. The right panel compares the mixingproportions with the empirical variances of the corresponding latent dimensions. and , are used in the latent space representation of Figure 16. In both dimensions, −1.5 −1.0 −0.5 0.0 0.5 1.0 − . − . . . . . dimension 6 d i m en s i on l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Figure 16:
Microarray data . Latent space representation based on the two most relevantlatent dimensions. The black circles correspond to the genes, whereas the squares correspondto the samples. Different square colours indicate the known tumour classes. The sizes of thecircles and squares are proportional to the degrees of the corresponding nodes. groups of samples, which do notcorrespond to the tumour types included in the metadata. This is not unexpected, sincethe separation of the tumour types is known to be a challenging problem, and the SLPMdoes not involve the optimisation of a clustering criterion on the nodes of the graph. Theresults of Khan et al. (2001) are of a different nature, since the authors use the knowntumour classes to train a neural network classifier, and then use this tool to predict theclasses on additional test data.Differently from the hospital data, the genes located close to the origin of the spaceare those farthest from the samples, hence they have the smallest degrees. The highdegree genes are instead located at the periphery of the cloud, where they are at closedistance with at least three of the samples’ point clouds. The SLPM introduced in this paper represents a new research direction for the statisticalmodelling of networks. This new framework modifies and extends the LPM approach,introducing elements from the literature on blockmodels to enhance the overall flexibilityof the proposed model.Differently from blockmodels and tensor factorisation models, the SLPM projects thenodes into a latent space where a weighted transformed Euclidean distance determinesthe edge values. This leads to a tool that provides a distance-based representation ofthe rows and columns of nonnegative interaction matrices, where the number of latentdimensions is automatically deduced from the data.The approach proposed prompts a number of new interesting extensions and researchpossibilities. Matrix completion and matrix reconstruction frameworks stand out as po-tential useful applications, in that the SLPM may represent an effective alternative toNMF methods. These extensions and the related software are currently being devel-oped. Furthermore, modifications of the SLPM may also be considered to account forother types of network data, such as binary networks, dynamic networks, or multiviewnetworks.In this paper, a VB framework is proposed to perform inference. This method isfaster than sampling-based approaches, and it allows one to bypass identifiability issues.However, the true impact of the approximations introduced and used is not assessed and31emains unknown. Future work may consider more effective strategies that lead to betterresults in shorter time, either using sampling approaches or approximations of a differentnature.
Software
The R package SparseLPM accompanies this paper and it provides an implementation ofthe variational algorithm described. Parts of the code have been written in
C++ to reduceoverall computing times. The package is publicly available from
CRAN (R Core Team2017). All of the computations described in this paper have been performed on a 8-cores(2.2 GHz) Debian machine.
Acknowledgements
This research was partially supported through the Vienna Science and Technology Fund(WWTF) Project MA14-031.
References
Aicher, C., A. Z. Jacobs, and A. Clauset (2014). “Learning latent block structure inweighted networks”. In:
Journal of Complex Networks
Journal of Machine Learning Research
Neural computation
Proceedings of the Fifteenth conference on Uncertainty in artificialintelligence . Morgan Kaufmann Publishers Inc., pp. 21–30.Attias, H. (2000). “A variational bayesian framework for graphical models”. In:
Advancesin neural information processing systems , pp. 209–215.Barabási, A. L. and R. Albert (1999). “Emergence of scaling in random networks”. In: science
Latent variable models and factoranalysis: A unified approach . Vol. 904. John Wiley & Sons.Besag, J. (1986). “On the statistical analysis of dirty pictures”. In:
Journal of the RoyalStatistical Society. Series B (Methodological) , pp. 259–302.Bishop, C. (2006).
Pattern recognition and machine learning . Springer.Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). “Variational inference: A reviewfor statisticians”. In:
Journal of the American Statistical Association
Quantitative finance
Statistics and Computing
Proceedings of the nationalacademy of sciences
Nature Reviews Neuroscience
Statistics and computing
Journal of the royal statistical society. Series B(methodological) , pp. 1–38.Dormann, C. F., B. Gruber, and J. Fründ (2008). “Introducing the bipartite package:analysing ecological networks”. In: interaction
1, pp. 0–2413793.Durante, D. and D. B. Dunson (2014). “Nonparametric Bayes dynamic modelling ofrelational data”. In:
Biometrika
Journal of the American Statistical Association
Performance evaluation
Proceedings ofthe National Academy of Sciences
Finite mixture and Markov switching models . SpringerScience & Business Media.Frühwirth-Schnatter, S. and G. Malsiner-Walli (2018). “From here to infinity: sparse finiteversus Dirichlet process mixtures in model-based clustering”. In:
Advances in DataAnalysis and Classification . issn : 1862-5355. doi :
10 . 1007 / s11634 - 018 - 0329 - y . url : https://doi.org/10.1007/s11634-018-0329-y .Gandy, A. and L. A. M. Veraart (2016). “A Bayesian methodology for systemic riskassessment in financial networks”. In: Management Science
BMC bioinformatics
Journal of Computational and Graphical Statistics
PatternRecognition
Journal of the Royal Statistical Society: Series A (Statistics inSociety)
Journal of theamerican Statistical association arXiv preprintarXiv:1807.08038 .Hoff, P. D., A. E. Raftery, and M. S. Handcock (2002). “Latent space approaches tosocial network analysis”. In:
Journal of the American Statistical Association
Nature medicine
Nature
Advances in neural information processing systems , pp. 556–562.Lim, Y. J. and Y. W. Teh (2007). “Variational Bayesian approach to movie rating pre-diction”. In:
Proceedings of KDD cup and workshop . Vol. 7, pp. 15–21.Malsiner-Walli, G., S. Frühwirth-Schnatter, and B. Grün (2016). “Model-based cluster-ing based on sparse finite Gaussian mixtures”. In:
Statistics and computing
Journal of Computational and GraphicalStatistics
The Annals of Applied Statistics
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology)
ESAIM: Proceedings and Surveys
47, pp. 55–74.Mnih, A. and R. R. Salakhutdinov (2008). “Probabilistic matrix factorization”. In:
Ad-vances in neural information processing systems , pp. 1257–1264.Newman, M. E. J. (2001). “The structure of scientific collaboration networks”. In:
Pro-ceedings of the national academy of sciences
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
R: A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing. Vienna, Austria. url : .Raftery, A. E. (2017). “Comment: Extending the Latent Position Model for Networks”.In: Journal of the American Statistical Association
Network Science
Journal of the Royal StatisticalSociety, Series B
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology)
Proceedings of the 25th international conference onMachine learning . ACM, pp. 880–887.Salter-Townshend, M. and T. B. Murphy (2013). “Variational Bayesian inference for thelatent position cluster model for network data”. In:
Computational Statistics & DataAnalysis
Statistical Analysis andData Mining: The ASA Data Science Journal
Social Networks
44, pp. 105–116.Stephens, M. (2000). “Dealing with label switching in mixture models”. In:
Journal of theRoyal Statistical Society: Series B (Statistical Methodology)
Econo-metrica: journal of the Econometric Society , pp. 24–36.Vanhems, P., A. Barrat, C. Cattuto, J. F. Pinton, N. Khanafer, C. Régis, B. Kim, B.Comte, and N. Voirin (2013). “Estimating potential infection transmission routes inhospital wards using wearable proximity sensors”. In:
PloS one
Journal of the American Statistical Association
82, pp. 8–19.Zhang, B. and S. Horvath (2005). “A general framework for weighted gene co-expressionnetwork analysis”. In:
Statistical applications in genetics and molecular biology
Appendix
Throughout the appendix: ≤ i ≤ M , ≤ j ≤ N and ≤ k, h ≤ K . A.1 Proof of Proposition 1
Proof.
We can rewrite the model evidence p ( X ) as follows: p ( X ) = p ( X ) p ( X , φ ) p ( X , φ ) = p ( X , φ ) p ( φ | X ) (21)Now, we apply a log transformation, and add and subtract log q ( φ ) to the right hand side: log p ( X ) = log p ( X , φ ) − log p ( φ | X ) + log q ( φ ) − log q ( φ ) (22)On the left hand side we have the model log evidence, which does not depend on φ . Hence, taking theexpectations of both sides with respect to q leads to the following: log p ( X ) = E q [log p ( X , φ )] − E q [log q ( φ )] + KL [ q ( · ) || π ( ·| X )] (23)It is clear that, if we define the variational free energy as follows: F (cid:16) ˜ φ (cid:17) = E q [log p ( X , φ )] − E q [log q ( φ )] (24)minimising the KL divergence is equivalent to maximising F , as log p ( X ) does not depend on φ . Also,since the KL divergence is always nonnegative, F functions as a lower bound for the model evidence.What remains to be proved is that the definition of the variational free energy given in (10) corre-sponds to the one derived in (24), up to addition by a constant. The functional F can be decomposedinto a number of terms as follows: F (cid:16) ˜ φ (cid:17) = E q [log p ( X , φ )] − E q [log q ( φ )]= E q [ (cid:96) X ( Z , U , V )] + E q [log π ( Z| λ )] + E q [log π ( λ | δ )]+ E q [log π ( U | γ )] + E q [log π ( V | γ )] + E q [log π ( γ | a , b )] − E q (cid:104) log q (cid:16) Z (cid:12)(cid:12)(cid:12) ˜ λ (cid:17)(cid:105) − E q (cid:104) log q (cid:16) ˜ λ (cid:12)(cid:12)(cid:12) ˜ δ (cid:17)(cid:105) − E q (cid:104) log q (cid:16) U (cid:12)(cid:12)(cid:12) ˜ α U , ˜ β U (cid:17)(cid:105) − E q (cid:104) log q (cid:16) V (cid:12)(cid:12)(cid:12) ˜ α V , ˜ β V (cid:17)(cid:105) − E q (cid:104) log q (cid:16) γ (cid:12)(cid:12)(cid:12) ˜ a, ˜ b (cid:17)(cid:105) (25)Before characterising the expectations on the right hand side of (25), we focus our attention of therandom variables ∆ ijk = U ik − V jk θ ijk = ∆ ijk = ( U ik − V jk ) (26)According to q , we know that ∆ ijk ∼ Gaussian (cid:16) ˜ α Uik − ˜ α V jk , ˜ β Uik + ˜ β V jk (cid:17) and that the variationaldensity on θ ijk does not have, in general, an explicit form. However, we do know its expectation andvariance: ˜ η ijk := E [ θ ijk ] = E (cid:2) ∆ ijk (cid:3) = V ar q (∆ ijk ) + ( E [∆ ijk ]) = ˜ β Uik + ˜ β V jk + (˜ α Uik − ˜ α V jk ) ˜ ζ ijk := V ar q ( θ ijk ) = E (cid:2) ∆ ijk (cid:3) − (cid:0) E (cid:2) ∆ ijk (cid:3)(cid:1) = (˜ α Uik − ˜ α V jk ) + 6 (˜ α Uik − ˜ α V jk ) (cid:16) ˜ β Uik + ˜ β V jk (cid:17) + 3 (cid:16) ˜ β Uik + ˜ β V jk (cid:17) − (cid:16) ˜ β Uik + ˜ β V jk (cid:17) − α Uik − ˜ α V jk ) (cid:16) ˜ β Uik + ˜ β V jk (cid:17) − (˜ α Uik − ˜ α V jk ) = 4 (˜ α Uik − ˜ α V jk ) (cid:16) ˜ β Uik + ˜ β V jk (cid:17) + 2 (cid:16) ˜ β Uik + ˜ β V jk (cid:17) = 2˜ η ijk − α Uik − ˜ α V jk ) (27)These two quantities play a fundamental role and are used to approximate the expectation of the log -likelihood with respect to q .Now, we evaluate the expectations in (25). xpectation of the log -likelihood. E q [ (cid:96) X ( Z , U , V )] = E q (cid:88) i,j,k Z ijk { log θ ijk − x ij θ ijk } = (cid:88) i,j,k E q [ Z ijk ] { E q [log θ ijk ] − x ij E q [ θ ijk ] } = (cid:88) i,j,k ˜ λ ijk { E q [log θ ijk ] − x ij ˜ η ijk }≈ (cid:88) i,j,k ˜ λ ijk (cid:40) ψ (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − log (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − x ij ˜ η ijk (cid:41) (28)The approximation of the expectation in the last line of (28) comes from the fact that we have replacedthe density of θ ijk with a gamma density calibrated to have the same mean and variance. In fact, neitherthe true density has an explicit form, nor we can evaluate the expectation of log ( θ ijk ) exactly. Once thereplacement is made, the expectation of the log of a gamma variable is readily available. Other typesof approximations may be considered to evaluate this expectation, such as Monte Carlo estimators orquadrature methods, however these require a much larger computational cost. Taylor approximationsare more often used (see for example Salter-Townshend and Murphy 2013 and Gollini and Murphy 2016).The choice of this gamma calibration is purely pragmatic: we find that, in this particular framework,the approximation proposed leads to much smaller errors than a second order Taylor approximation (notshown). Expectation of the log prior on Z . E q [log π ( Z| λ )] = E q (cid:88) i,j,k Z ijk log λ k = (cid:88) i,j,k E q [ Z ijk ] E q [log λ k ]= (cid:88) i,j,k ˜ λ ijk (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (29) Expectation of the log prior on λ . E q [log π ( λ | δ )] = log Γ (cid:32)(cid:88) k δ k (cid:33) − (cid:88) k log Γ ( δ k ) + (cid:88) k ( δ k − E q [log λ k ]= const + (cid:88) k ( δ k − (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (30) Expectation of the log prior on U. E q [log π ( U | γ )] = (cid:88) ik E q (cid:20) − log (2 π )2 + log γ k − γ k U ik (cid:21) = const + M (cid:80) k E q [log γ k ]2 − (cid:88) ik E q [ γ k ] E q (cid:2) U ik (cid:3) const + M (cid:88) k (cid:104) ψ (˜ a k ) − log ˜ b k (cid:105) − (cid:88) ik ˜ a k ˜ b k (cid:16) ˜ β Uik + ˜ α Uik (cid:17) (31)An equivalent result holds for the expectation of the prior on V . xpectation of the log prior on γ . E q [log π ( γ | a , b )] = (cid:88) k E q [ a k log b k − log Γ ( a k ) + ( a k −
1) log γ k − b k γ k ]= const + (cid:88) k ( a k − E q [log γ k ] − (cid:88) k b k E q [ γ k ]= const + (cid:88) k ( a k − (cid:104) ψ (˜ a k ) − log ˜ b k (cid:105) − (cid:88) k b k ˜ a k ˜ b k (32) Entropy terms.
The remaining expectations correspond to the negative entropies of common dis-tributions, whose formulas are well-known. E q (cid:104) log q (cid:16) Z (cid:12)(cid:12)(cid:12) ˜ λ (cid:17)(cid:105) = (cid:88) i,j,k ˜ λ ijk log ˜ λ ijk E q (cid:104) log q (cid:16) ˜ λ (cid:12)(cid:12)(cid:12) ˜ δ (cid:17)(cid:105) = log Γ (cid:32)(cid:88) k ˜ δ k (cid:33) + (cid:88) k (cid:40)(cid:16) ˜ δ k − (cid:17) (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) − log Γ (cid:16) ˜ δ k (cid:17)(cid:41) E q (cid:104) log q (cid:16) U (cid:12)(cid:12)(cid:12) ˜ α U , ˜ β U (cid:17)(cid:105) = const − (cid:88) ik log (cid:16) ˜ β Uik (cid:17) E q (cid:104) log q (cid:16) γ (cid:12)(cid:12)(cid:12) ˜ a, ˜ b (cid:17)(cid:105) = (cid:88) k (cid:110) (˜ a k − ψ (˜ a k ) − ˜ a k − log Γ (˜ a k ) + log (cid:16) ˜ b k (cid:17)(cid:111) (33)The variational free energy in (10) is obtained by combining all the formulas and rearranging theterms. A.2 Proof of Proposition 2
Proof.
The maximisation of F with respect to (cid:16) ˜ λ ij , . . . , ˜ λ ijK (cid:17) , and under the constraint (cid:80) k ˜ λ ijk = 1 ,is performed using the method of Lagrange multipliers. Define the following Lagrangian: H ij (cid:16) ˜ λ ij , . . . , ˜ λ ijK , ξ ij (cid:17) = F (cid:16) ˜ φ (cid:17) + ξ ij (cid:32)(cid:88) k ˜ λ ijk − (cid:33) = const + (cid:88) k ˜ λ ijk (cid:110) C k − log ˜ λ ijk + ξ ij (cid:111) − ξ ij (34)where C k = ψ (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − log (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − x ij ˜ η ijk (35)The derivatives of H ij are equal to the following: ∂ H ij ∂ ˜ λ ijk (cid:16) ˜ λ ij , . . . , ˜ λ ijK , ξ ij (cid:17) = C k − log ˜ λ ijk + ξ ij − (36)Posing the derivative equal to zero and exponentiating gives the following solution: ˜ λ ∗ ijk = exp { C k + ξ ij − } (37)Now, we do the same with the derivative with respect to ξ ij : ∂ H ij ∂ξ ij (cid:16) ˜ λ ij , . . . , ˜ λ ijK , ξ ij (cid:17) = (cid:88) k ˜ λ ijk − { ξ ij } (cid:88) k exp { C k − } − (38)and obtain: ξ ij = − log (cid:32)(cid:88) k exp { C k − } (cid:33) (39)which leads to (12). A study of the sign of the partial derivatives confirms that this point is a maximum. .3 Proof of Proposition 3 Proof.
The first derivative of F with respect to ˜ δ k , for some k , is equal to: ∂ F ∂ ˜ δ k (cid:16) ˜ φ (cid:17) = (cid:34) ∂ψ∂ ˜ δ k (cid:16) ˜ δ k (cid:17) − ∂ψ∂ ˜ δ k (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (cid:88) i,j ˜ λ ijk + δ k − ˜ δ k − ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33) + ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33) = (cid:34) ∂ψ∂ ˜ δ k (cid:16) ˜ δ k (cid:17) − ∂ψ∂ ˜ δ k (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (cid:88) i,j ˜ λ ijk + δ k − ˜ δ k (40)which is equal to zero iff ˜ δ k = δ k + (cid:80) i,j ˜ λ ijk . A study of the sign of the partial derivatives confirms thatthis point is a maximum. A.4 Proof of Proposition 4
Proof.
The first derivatives of F with respect to ˜ a k and ˜ b k , for some k , are equal to: ∂ F ∂ ˜ a k (cid:16) ˜ φ (cid:17) = ∂ψ∂ ˜ a k (˜ a k ) (cid:0) M + N + a k − ˜ a k (cid:1) + 1 − b k + S k / b k ∂ F ∂ ˜ b k (cid:16) ˜ φ (cid:17) = − ( a k + M + N ) ˜ b k + ˜ a k Sk + b k ˜ a k ˜ b k = ˜ a k Sk + b k ˜ a k − ˜ b k ( a k + M + N ) ˜ b k (41)Posing these equal to zero and solving for ˜ a k and ˜ b k gives (14). A study of the sign of the partialderivatives confirms that this point is a maximum. A.5 Proof of Proposition 5
Proof.
Let us simplify the notation by denoting α = ˜ α Uik , β = ˜ β Uik , F ( α, β ) = F (cid:16) ˜ φ (cid:17) and q ( α, β ) = q U (cid:16) ˜ φ (cid:17) . To avoid the fact that β is constrained to be positive, we derive the parameter updates for its log and then transform back. Hence, let us define the function H ( α, ξ ) = F (cid:0) α, e ξ (cid:1) (42)where we have used the transformation β = β ( ξ ) = e ξ . Using the chain rule: ∂ H ∂ξ ( α, ξ ) = ∂ F ∂β (cid:0) α, e ξ (cid:1) ∂β∂ξ ( ξ ) = e ξ ∂ F ∂β (cid:0) α, e ξ (cid:1) (43)Thus, a standard gradient ascent update for ( α t , ξ t ) simply reads as follows: (cid:40) α t +1 = α t + ε ∂ F ∂α (cid:0) α t , e ξ t (cid:1) ξ t +1 = ξ t + εe ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1) (44)Now, following Amari (1998), we propose instead a natural gradient ascent, which essentially correctsthe direction of the gradient to account for the geometry of the space of functions we optimise over. Inpractice, the direction of the gradient is corrected by premultiplying it by the inverse Fisher informationassociated to the likelihood q ( α, ξ ) . The log -likelihood and its derivatives read as follow: log [ q ( α, ξ )] = const − ξ − ( x − α ) e ξ ∂ log [ q ( α, ξ )] ∂α = − e ξ ∂ log [ q ( α, ξ )] ∂α∂ξ = − ( x − α ) e ξ ∂ log [ q ( α, ξ )] ∂ξ = − ( x − α ) e ξ (45) ence, the Fisher information and its inverse are equal to the following: I ( α, ξ ) = (cid:18) e ξ (cid:19) I − ( α, ξ ) = (cid:18) e ξ
00 2 (cid:19) (46)Using these results, we can state the natural gradient ascent update for ( α, ξ ) , which reads as follows: (cid:40) α t +1 = α t + εe ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) ξ t +1 = ξ t + 2 εe ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1) (47)Now, we simply take the exp of the second line, obtaining: (cid:40) α t +1 = α t + εe ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) e ξ t +1 = e ξ t exp (cid:110) εe ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:111) (48)which is equivalent to (15) once the log transformation is undone: (cid:40) α t +1 = α t + εe ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) β t +1 = β t exp (cid:110) εβ t ∂ F ∂β ( α t , β t ) (cid:111) (49)Note that the auxiliary variable ξ does not appear in (15). Finally, we shall prove that a small enough ε leads to a non-decrease of F . Let us consider again (47), and let us define: ∆ α = e ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) ∆ ξ = 2 e ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1) (50)Using a first order Taylor expansion around ( α t , ξ t ) , there exists a small ε > such that: H ( α t + ε ∆ α , ξ t + ε ∆ ξ ) ≈ H ( α t , ξ t ) + ε ∇H ( α t , ξ t ) (cid:48) (cid:20) ∆ α ∆ ξ (cid:21) = H ( α t , ξ t ) + ε (cid:20) ∂ F ∂α ( α t , ξ t ) , e ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:21) (cid:34) e ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) e ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:35) = H ( α t , ξ t ) + εe ξ t (cid:20) ∂ F ∂α (cid:0) α t , e ξ t (cid:1)(cid:21) + 2 εe ξ t (cid:20) ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:21) ≥ H ( α t , ξ t ) (51)or, equivalently: F ( α t +1 , β t +1 ) ≥ F ( α t , β t ) ..