[PDF] The Sparse Latent Position Model for nonnegative weighted networks

Abstract

This paper introduces a new methodology to analyse bipartite and unipartite networks with nonnegative edge values. The proposed approach combines and adapts a number of ideas from the literature on latent variable network models. The resulting framework is a new type of latent position model which exhibits great flexibility, and is able to capture important features that are generally exhibited by observed networks, such as sparsity and heavy tailed degree distributions. A crucial advantage of the proposed method is that the number of latent dimensions is automatically deduced from the data in one single algorithmic framework. In addition, the model attaches a weight to each of the latent dimensions, hence providing a measure of their relative importance. A fast variational Bayesian algorithm is proposed to estimate the parameters of the model. Finally, applications of the proposed methodology are illustrated on both artificial and real datasets, and comparisons with other existing procedures are provided.

Full PDF

TThe Sparse Latent Position Model for nonnegative weighted networks

Riccardo Rastelli [email protected] for Statistics and Mathematics, Vienna University of Economics and Business, Austria.

Abstract

This paper introduces a new methodology to analyse bipartite and unipartite net-works with nonnegative edge values. The proposed approach combines and adapts anumber of ideas from the literature on latent variable network models. The resultingframework is a new type of latent position model which exhibits great ﬂexibility, andis able to capture important features that are generally exhibited by observed net-works, such as sparsity and heavy tailed degree distributions. A crucial advantageof the proposed method is that the number of latent dimensions is automaticallydeduced from the data in one single algorithmic framework. In addition, the modelattaches a weight to each of the latent dimensions, hence providing a measure of theirrelative importance. A fast variational Bayesian algorithm is proposed to estimatethe parameters of the model. Finally, applications of the proposed methodology areillustrated on both artiﬁcial and real datasets, and comparisons with other existingprocedures are provided.

Keywords:

Weighted networks; Latent position models; Sparse ﬁnite mixture mod-els; Variational inference; Stochastic blockmodels.

Statistical models for networks are most often designed for binary interactions betweenentities. However, in many cases, the edges carry additional information that may providecrucial insights in the analyses. One common situation is that of nonnegative interactionvalues, where each edge carries a positive real number representing, for example, theintensity or the length of the interaction. This type of scenario is extremely common andit tends to arise in a variety of ﬁelds: in biology, microarray data measure the expressionlevel of genes among a set of samples (Brunet et al. 2004); gene co-expression networkscharacterise the similarities between genes based on their expression level (Zhang and1 a r X i v : . [ s t a t . M E ] A ug orvath 2005); and mutualistic networks describe how often a pollinator visits certainplants (Dormann et al. 2008). In neuroscience, nonnegative networks may be used torepresent functional connectivity levels between diﬀerent areas of the brain (Bullmoreand Sporns 2009). In ﬁnance, the nonnegative values may refer to the liabilities of oneﬁnancial institution towards others (Gandy and Veraart 2016).The available literature on the analysis of weighted networks, and, in particular, ofnonnegative weighted networks, is limited. This is especially true for the family of LatentPosition Models (LPM), which have been widely used in statistical network analyses sincethe seminal work of Hoﬀ et al. (2002). In these network models, one postulates that thenodes are points embedded in a latent space, and that the propensity to create edges isdetermined by the Euclidean distance between the two nodes involved. To the best of myknowledge, in the stream of literature that originated from Hoﬀ et al. (2002), the onlyextension of the original LPM to the analysis of weighted interactions is due to Sewelland Chen (2016).Another fundamental model used in the statistical analyses of networks is the Stochas-tic Blockmodel (SBM) of Wang and Wong (1987), or its equivalent bipartite variant,called Latent Blockmodel (LBM, Govaert and Nadif 2003). These two models assumea partition on the set of nodes of the graph, and postulate that edge probabilities aredetermined by the cluster memberships of the nodes at its extremities. Extensions of theSBM and LBM to the analysis of weighted networks are also few. In this context, tworelevant papers are those of Mariadassou et al. (2010) and Aicher et al. (2014). Recentreviews of the LPM, SBM and their extensions may be found in Salter-Townshend et al.(2012), Matias and Robin (2014), and Raftery (2017).This paper introduces a new modelling approach for the analysis of nonnegative in-teraction data. The model proposed, called the Sparse Latent Position Model (SLPM),modiﬁes the original LPM to account for nonnegative weighted edges. The model adaptsa number of ideas from the literature on blockmodels and ﬁnite mixture models: thisresults in a hybrid framework that exhibits great ﬂexibility, and that it is able to ele-gantly address some of the crucial issues and research questions encountered in LPMsand SBMs.Similarly to the LPM, the SLPM asserts that nodes are characterised by a K -dimensionalvector of coordinates. However, while in the LPM the edge values are determined by thelatent Euclidean distances, the SLPM postulates that each dimension gives a separate2ontribution to the edge values, and that each of these contributions is, in fact, determinedby the distance between the two nodes in the corresponding dimension. Essentially, theSLPM may be interpreted as a ﬁnite mixture of K unidimensional LPMs in a weightednetwork context. This paper illustrates that the structure of the SLPM allows great ﬂexi-bility, and that it can naturally capture relevant features of the the topologies of observednetworks, such as sparsity of edges and heavy tailed degree distributions.In addition, one important facet of the SLPM is the sparsity with respect to themixture components, inherited from recent works on Gaussian ﬁnite mixtures (Malsiner-Walli et al. 2016; Malsiner-Walli et al. 2017) and other types of latent variable models.In a sparse mixture model, one deliberately creates overﬁtting by considering a relativelylarge number of latent groups, however, once an appropriate shrinkage prior is deﬁned onthe mixing proportions, the superﬂuous groups are emptied during the estimation. Thisapproach is strongly motivated by the theoretical results of Rousseau and Mengersen(2011), where the authors establish the necessary conditions for the posterior distributionto concentrate around the correct number of groups.In the SLPM, the sparse representation ensures that the number of mixture compo-nents is deduced from the data. Since these groups correspond to the latent dimensions,the dimensionality of the latent space is automatically estimated from the data. Further-more, the mixing proportions of the mixture acquire a rather interesting interpretation,since they measure how useful each latent dimension is in explaining the observed data.Regarding the estimation of the SLPM, a variational Bayesian method (Attias 1999)is proposed. Variational approximations have been largely used in the last two decadesin a variety of latent variable models (a recent review may be found in Blei et al. 2017).Although, from a theoretical perspective, the implications of the approximation are notyet completely understood, a variety of applications (Blei et al. 2017) have proved thatthese methods can oﬀer a number of appealing advantages during the estimation process.Variational approximations have been proposed both for LPM frameworks (Salter-Townshend and Murphy 2013; Gollini and Murphy 2016) and for the SBM and its ex-tensions (Daudin et al. 2008; Airoldi et al. 2008; Latouche 2010; Matias and Miele 2017;Bouveyron et al. 2018). In the SLPM context, the variational Bayesian method is ratherconvenient, since, diﬀerently from sampling-based approaches, it relies on a fast optimi-sation framework. Hence, it allows one to bypass the non-identiﬁability issues associatedwith the cluster membership variables and with the latent positions, while, at the same3ime, providing a rough approximation of the posterior uncertainty. The variational algo-rithm described in this paper has been implemented in the R package SparseLPM , whichis publicly available from

CRAN (R Core Team 2017).The paper is organised as follows: in Section 2 the SLPM is formally introduced andstudied. The section contains a detailed description of its main properties, and a numberof comparisons with other well known models and with the existing literature. Section 3describes the variational algorithm, and the details regarding the optimisation procedure,including a useful initialisation strategy. Finally Sections 4 and 5 illustrate applicationsof the methodology to artiﬁcial and real datasets, respectively.

The observed data matrix X has M rows and N columns, and its entries are nonnegative.In a network context, X deﬁnes the nonnegative adjacency matrix of a bipartite graph,whereby the nodes are separated into two sets (of cardinalities M and N , respectively),and only the interactions between nodes of diﬀerent types are allowed. It is important tonote that, as a special case, the unipartite weighted graph structure is recovered if X isa square matrix and its rows and columns are indexed with labels from the same set.The observations X = { x ij : i = 1 , . . . , M ; j = 1 , . . . , N } are assumed to be indepen-dent realisations from a ﬁnite mixture of K exponential densities: p ( x ij | λ , θ ij ) = K (cid:88) k =1 λ k f E ( x ij ; θ ijk ) (1)where f E ( · ; θ ) denotes an exponential density with rate θ , and the vector θ ij = { θ ij , . . . , θ ijK } indicates the component-speciﬁc rates, for every edge ( i, j ) .In the SLPM, the K groups of the mixture distribution are re-interpreted as the K latent dimensions of a LPM, where each dimension gives a separate contribution in deter-mining the probabilities of the edges. In particular, the latent variable θ ijk correspondsto the squared Euclidean distance between i and j in the k -th dimension, i.e.: θ ijk = ( U ik − V jk ) (2)where U ik (resp. V jk ) indicates the k -th coordinate of i (resp. j ) in the latent space R K .Since the expectation of an exponential random variable is equal to its inverse rate, (2)4uarantees that the generated edge weight is expected to be large (resp. small) if the twonodes are close (resp. far apart) in every latent dimension.As is common in mixture modelling, an equivalent representation of the same modelcan be obtained using data augmentation. An allocation variable Z ij is attached to theedge ( i, j ) , for all ≤ i ≤ M and ≤ j ≤ N , such that: Z ij =  with probability λ ... ... K with probability λ K (3)Now, the conditional log-likelihood reads as follows: (cid:96) X ( Z , U , V ) = M (cid:88) i =1 N (cid:88) j =1 K (cid:88) k =1 Z ijk (cid:8) log (cid:2) ( U ik − V jk ) (cid:3) − x ij ( U ik − V jk ) (cid:9) (4)where Z ijk = 1 if Z ij = k and Z ijk = 0 otherwise.Finally, additional hierarchical levels are considered for the mixing proportions andfor the latent positions: ( λ , . . . , λ K ) ∼ Dirichlet ( δ , . . . , δ K ) U ik ∼ Gaussian (0 , /γ k ) V jk ∼ Gaussian (0 , /γ k ) γ k ∼ Gamma ( a k , b k ) (5)for all ≤ i ≤ M , ≤ j ≤ N and ≤ k ≤ K . Here, Gaussian refers to the univariatenormal density.The dependencies between the model parameters are summarised in the graphicalrepresentation of Figure 1. δ λ Z a,b γ U , V X

Figure 1: Graphical summary of the dependencies in the SLPM. .2 Interpretation and properties A ﬂexible mixture model.

Mixtures of exponential distributions satisfy a numberof interesting properties, and have been well studied and widely applied in a variety ofﬁelds. The reader may ﬁnd an overview of these properties in Frühwirth-Schnatter (2006)and references therein. In particular, one key property of these models is their ﬂexibility:Bernstein’s theorem states that any completely monotone probability density functionmay be written as a continuous mixture of exponentials (Feldmann and Whitt 1998,Theorem 3.1). In the present paper, this argument motivates the structure proposedin (1), asserting that the SLPM may be particularly appropriate to represent extremelycomplex network structures, when K is large enough. Weight and degree distributions.

The SLPM is able to capture relevant featuresthat are usually exhibited by observed networks. Figure 2 uses simulated data to demon-strate that the SLPM can naturally represent highly skewed and extremely heavy tailedweight and degree distributions. In fact, both panels show that the tails of the weight llllllllllllllllllllllllllllllllllllllllllllllll l l l − − − − − log−edge−value l og − p r opo r t i on y - l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

10 15 20 25 − − − − log−degree l og − p r opo r t i on y - Figure 2: Log-log scale empirical distributions for the edge weights (left panel) and out-degrees(right panel) of a simulated SLPM with nodes and latent dimensions. and degree distributions decay to zero slower than a power-law with exponent . This isa desirable feature in many types of applications: for example, the topologies of socialnetworks and ﬁnancial networks are known to be characterised by the presence of hubsand by core-periphery structures (Barabási and Albert 1999; Newman 2001; Boss et al.6004). Sparsity.

In recent years, sparse ﬁnite mixtures have received a lot of attention (seeMalsiner-Walli et al. 2016 and references therein) since they are able to mimic the proper-ties displayed by other common models (for example by the Dirichlet process mixture ofQuintana and Iglesias 2003, as shown in Frühwirth-Schnatter and Malsiner-Walli 2018),while maintaining a very simple computational framework to perform inference. Moreimportantly, sparse ﬁnite mixtures are able to eﬃciently assess many competing modelsin one single run of the estimation algorithm. In fact, they do not require grid searchesover all the possible values of K , nor trans-dimensional samplers such as the reversiblejump of Richardson and Green (1997). This generally leads to substantial gains in termsof computational eﬃciency.A sparse ﬁnite mixture is a deliberately overﬁtted clustering model where one speciﬁesa shrinkage prior on the mixture weights and on the component-speciﬁc parameters toensure that superﬂuous groups are emptied (Malsiner-Walli et al. 2016; Malsiner-Walliet al. 2017). The SLPM adapts this idea in a weighted latent position network modelcontext: overﬁtting is achieved by considering a large number of mixture components,while imposing a sparse Dirichlet prior distribution on the mixing proportions to ensurethat superﬂuous groups become empty. In practice, such sparse Dirichlet prior is attainedthrough δ = · · · = δ K = δ < /K , as illustrated by Rousseau and Mengersen (2011).As a consequence, the SLPM is a type of LPM where a large number of latent di-mensions can be considered, hence guaranteeing the theoretical ﬂexibility argued earlierin this section. At the same time, parsimony is attained using the shrinkage prior on themixing proportions, making sure that only the relevant latent dimensions are actuallyused. The ﬁnal product is a LPM where not only the number of latent dimensions is esti-mated from the data, but also where the dimensions are automatically ranked accordingto their importance, as indicated by the mixing proportions. This feature becomes veryconvenient when plotting the latent space: for example, one can select and show only thetwo most relevant latent dimensions, or the two latent dimensions that characterise thelarge edge values.Another aspect of the sparsity regards the edge-values: most observed weighted net-works exhibit zero-inﬂated empirical distributions, where the majority of edges are absentor carry a null weight. The SLPM can account for this type of sparsity in a very naturalway, as edge weights are shrunk towards zero when generated by latent dimensions which7re characterised by a small precision γ k , for ≤ k ≤ K . In practice, in a ﬁtted SLPM,one or more latent dimensions may be exploited to separate the null weights from therest, hence obtaining a spike-and-slab type of sparsity. Outliers.

In the data augmented representation of the SLPM, each edge is assignedto and characterised by one dimension of the latent space. It is possible that only oneedge is assigned to a certain latent dimension. Since the shrinkage Dirichlet prior heavilypenalises this type of behaviour, the only circumstance where a ﬁtted SLPM may exhibitthis feature is when the value carried by such edge is an outlier that cannot be explainedthrough the other relevant dimensions. That is, the SLPM may exploit the empty com-ponents to separate those edges that cannot be represented by the dimensions that arebeing used. This mechanism naturally highlights outliers in the data, or, equivalently,it may signal that, to some extent, the latent space representation is not particularlyappropriate for the dataset considered.

Identiﬁability.

The likelihood of a latent position network model is known to be in-variant under rotations, translations and reﬂections of the latent positions (Hoﬀ et al.2002). Markov chain Monte Carlo techniques may be used to obtain a sample from theposterior distribution of these models, however, post-processing becomes then necessaryto deal with the identiﬁability issues and, thus, to attain meaningful interpretations.The SLPM is no exception in this regard, as its likelihood is invariant under transla-tions and reﬂections. In addition, the likelihood is also unaﬀected by any relabeling ofthe latent dimensions, which, in turn, leads to the same label switching issues studiedby (Stephens 2000). The combination of these two sources of non-identiﬁability makesthe available post-processing methods inadequate, and discourage the use of sampling-based methods, as their output cannot be eﬃciently interpreted. This paper considers anoptimisation framework which bypasses all identiﬁability problems and makes the labelswitching and the translation invariance irrelevant.

Univariate networks.

Section 2.1 introduces the SLPM as a statistical model forbipartite networks. The canonical directed unipartite framework is obtained as a specialcase, when M = N and the nodes’ labels vary in the same set. The SLPM can be easilymodiﬁed to neglect self-edges, if these are not of interest.If a directed unipartite scenario is considered, the SLPM separates the latent positions8f nodes based on whether they are sending or receiving an edge. For example, thelatent position of node i is given by U i = { U i , . . . , U iK } when sending edges and by V i = { V i , . . . , V iK } when receiving them. Although unusual, such a distinction may giveinteresting advantages when disassortative patterns are observed.Finally, with some additional work, it may also be possible to extend the SLPMframework to handle undirected unipartite networks, where the latent positions U and V coincide. However, this scenario is not explored in the present paper and it is left tofuture work. Latent position models.

Rastelli et al. (2016) illustrate the properties exhibited bythe basic LPM, and show that this model can become very ﬂexible when additionalnodal random eﬀects are added. In particular, they show that the basic LPM and itsextensions are naturally designed to capture transitivity, homophily and heavy taileddegree distributions. In the SLPM there are no nodal random eﬀects, and, similarlyto the LPM, each of the nodes in the graph is characterised by a number of latentcoordinates. However, the generative mechanism behind the SLPM makes use of thesequantities in a diﬀerent way, as each edge is determined by the coordinates in only oneof the latent dimensions.On the one hand, this modiﬁcation makes the model much more ﬂexible: for example,diﬀerently from the canonical LPM, two nodes located in the same position may not bestochastically equivalent conditionally on the allocation variables, in that the distributionsgenerating their edges may be completely diﬀerent. Lack of transitivity, which is commonin unipartite networks with strong core-periphery structures, may also be represented bythe SLPM: conditionally on Z , the variables X ij and X i (cid:48) j are dependent only if they areallocated to the same latent group, for any i, i (cid:48) , j . In addition, the SLPM can capturehomophily and, to some extent, the lack of it. In fact, two nodes that are close inthe latent space are expected to be strongly connected, regardless of which dimensiondetermines the edge weight. Nevertheless, if two nodes are close in at least one dimensionand far apart in at least another, a wide range of scenarios become possible. Disassortative mixing refers to the tendency of nodes to connect to those that are not similar tothem. For example, disassortative mixing arises when many nodes are connected only to a large hub(core-periphery structure). In the SLPM context, the distinction between sender and receiver positionmay be useful to capture the fact that certain nodes act as bridges, receiving edges from a set of nodesand sending edges into a diﬀerent set.

9n the other hand, the SLPM is less interpretable than the LPM and its extensions. Infact, in the SLPM, the Euclidean distance between two K dimensional coordinates vectorssimply has no relevant meaning, as one should rather study the distances between nodesfor each latent dimension separately. In addition, a ﬁtted SLPM may be characterisedby a large number of latent dimensions, making graphical representations impractical.So, generally speaking, extracting meaningful interpretations from a two-dimensionalrepresentation of the latent space of a SLPM may not be as straightforward as in otherLPM frameworks.One signiﬁcant novelty introduced by the SLPM is that the number of latent dimen-sions is estimated from the data, and, in addition, a weight is assigned to each dimensionto represent its importance. The estimation of K is a long standing problem that veryfew times has been directly addressed in the LPM literature; the main reason being thatthe arbitrary choice K = 2 provides a convenient framework to clearly show the results.One relevant work in this context is Handcock et al. (2007), where the authors intro-duce a variant of the well-known Bayesian Information Criterion (BIC) to perform modelchoice. However, in the discussion they also point out that this method may not be for-mally correct and appropriate when choosing the number of latent dimensions. Anotherwork is that of Friel et al. (2016), where the authors use instead the Deviance Informa-tion Criterion (DIC). One drawback of this approach is that it requires the Markov chainMonte Carlo sampler to be run for every value of K considered. By contrast, in theSLPM, the model-choice problem is addressed in one single algorithmic framework, sincethe superﬂuous groups are automatically emptied during the estimation.Another crucial diﬀerence with the existing literature relates to the presence of weightson the edges: the aforementioned LPMs can only deal with binary edges. The only paperthat proposes an adaptation of the LPM framework to a weighted network context isSewell and Chen (2016). The authors consider a dynamic weighted network scenario,and propose several types of models based on the type of weights carried by the edges.For the nonnegative case, they propose a Tobit model (Tobin 1958), and estimate modelparameters using Markov chain Monte Carlo sampling. The SLPM is fundamentallydiﬀerent, as it uses a mixture model to characterise the distribution of the edges, withthe goal of naturally representing the sparsity of network connections, and estimatingthe number of latent dimensions. While the model of Sewell and Chen (2016) may bebetter suited to clearly summarise the network data in a two-dimensional space, dynamic10xtensions of the proposed SLPM should provide a more ﬂexible framework to predictmissing or future values. Weighted blockmodels.

The stochastic blockmodel (SBM) of Wang and Wong (1987)has been extended to account for weighted interactions by Mariadassou et al. (2010) andby Aicher et al. (2014). In the SBM, the nodes are characterised by a latent cluster mem-bership, and any two nodes assigned to the same group are said stochastically equivalent,as their connections are generated from the same parametric distribution. Equivalently,one may say that each edge of the network is characterised by a latent cluster membership,which is itself determined by the allocations of the nodes at its extremities: the edgesassigned to the same group are generated from the same parametric distribution. Thismechanism also characterises the well-known mixed-membership stochastic blockmodel(MMSBM) of Airoldi et al. (2008), where, again, the cluster membership characteris-ing each edge is determined by node-speciﬁc attributes. One crucial advantage of theMMSBM with respect to the SBM is that it allows heterogeneity to be present withinany subset of nodes.The SLPM may be seen as a variant of the SBM and the MMSBM, where the clustermembership assigned to an edge is not determined by nodal information, but ratherby some global parameters (the mixing proportions λ ). In addition, once these clustermemberships are set, the distributions generating the edge weights are chosen not onlyaccording to such clustering variables, but also using cluster-speciﬁc nodal information,i.e. the latent coordinates. Hence, the SLPM relates to the literature originated fromAiroldi et al. (2008) and it extends the classic blockmodels by introducing an additionalingredient, the latent distance representation, to explain the observed heterogeneity andto increase the ﬂexibility of the mixture. Tensor factorisation models.

Tensors are array objects that generalise adjacencymatrices to more than two dimensions, and are generally used to represent multiviewnetworks, dynamic networks, and other types of multivariate interaction data. One com-mon approach used to model these objects is inspired by factor models (Bartholomewet al. 2011), where the tensor (or, in our context, the adjacency matrix X ) is decomposedusing a low-rank factorisation: X = UV (cid:48) + E (6)11ere, U has size M × K , V has size N × K , E has size M × N , and the dash denotesmatrix transposition. A very rich machine learning literature is available on this topic(see, for example, Lim and Teh 2007; Mnih and Salakhutdinov 2008; Salakhutdinovand Mnih 2008), however, since this paper deals with nonnegative matrices, our interestmainly focuses on the nonnegative matrix factorisation (NMF) approach of Lee and Seung(1999) and Lee and Seung (2001).In statistics, similar models have been considered for the analysis of binary networksby Hoﬀ et al. (2002), under the name of projection models . Since each edge is assumedto be generated through a transformation of the scalar product of K -dimensional la-tent vectors, the latent space may be interpreted as a K -dimensional sphere, where asmall angle between two nodes implies a higher probability of connection. While thisgenerative mechanism shares similarities with LPMs and other distance based models, itprovides a vastly diﬀerent geometrical representation of the network, hence, it may leadto qualitatively diﬀerent statistical analyses and interpretations.Projection models have been intensely studied and extended in a number of ways(Hoﬀ 2005; Durante and Dunson 2014), and have also been considered as models forweighted interactions (Hoﬀ 2018). The work of Durante et al. (2017) has several similari-ties with the approach proposed in this paper. The authors consider a multiview networkand propose a mixture of projection models to explain its binary edges. They exploitthe sparse Dirichlet prior on the mixture weights to ensure that non-relevant mixturecomponents are emptied. They propose a Markov chain Monte Carlo sampler and focustheir analysis only on the characterisation of the edge probabilities, hence bypassing thenoted identiﬁability issues.The SLPM relies instead on a sparse ﬁnite mixture representation of the latent spacewithin a distance-based framework, which also leads to an estimation of K . The newmodel proposed in this paper is speciﬁcally designed for nonnegative weighted bipartitenetworks, and, hence, it could be seen as a competitor of the NMF algorithms. Here, themain novelty introduced by the SLPM with respect to NMF and the other tensor factori-sation models is the latent distance framework, where interaction values are determinedby how close (in the Euclidean sense) the two entities are in the latent space.12 Inference

According to Section 2.1 and Figure 1, the posterior density of a SLPM factorises asfollows: π ( Z , U , V , λ , γ | X , δ , a , b ) ∝ L X ( Z , U , V ) π ( Z| λ ) π ( λ | δ ) π ( U | γ ) π ( V | γ ) π ( γ | a , b ) (7)However, as noted in the previous section, posterior samples for the model parametersare not interpretable due to the multiple sources of non-identiﬁabilities. This meansthat a traditional Markov chain Monte Carlo sampler cannot be used to characterisethe posterior and, hence, the uncertainty around the parameter estimates. To overcomethis limitation, an approach based on Variational Bayes (VB, Attias 1999; Attias 2000)is proposed. This means that sampling techniques are replaced by optimisation, whichallows much faster computations while still providing a reasonable characterisation of theposterior uncertainty. In this paper, the theory behind VB is not discussed in detail: the interested reader mayﬁnd more complete illustrations of these methods for example in Latouche (2010) andBlei et al. (2017) and references therein.Denote the collection of model parameters {Z , U , V , λ , γ } = φ ∈ Φ : the goal of VBis to perform a search in a family Q of densities over Φ , to ﬁnd the density q ∗ ( · ) ∈ Q that is closest, in the sense of Kullback-Leibler (KL) divergence , to the posterior density π ( φ | X ) deﬁned by (7).Generally, the densities in the family Q are, by construction, smoother, more tractable,and easier to interpret than π ( ·| X ) . As is common practice, the variational densities in Q are assumed to satisfy a mean-ﬁeld assumption: q ( Z , U , V , λ , γ ) = (cid:40)(cid:89) i,j q Z ( Z ij ) (cid:41) (cid:40)(cid:89) i,k q U ( U ik ) (cid:41) (cid:40)(cid:89) j,k q V ( V jk ) (cid:41) × (cid:40)(cid:89) i,j q λ ( λ ij ) (cid:41) (cid:40)(cid:89) k q γ ( γ k ) (cid:41) (8) The KL divergence measures the discrepancy between two densities and it is deﬁned as KL ( q || p ) = E q [log q ( φ )] − E q [log p ( φ )] Note that the KL is not a distance since it is not symmetric, i.e. KL ( q || p ) (cid:54) = KL ( p || q ) , however, it isnonnegative and it becomes zero when q and p coincide.

13n (8) and from here onwards: ≤ i ≤ M , ≤ j ≤ N and ≤ k ≤ K , unlessotherwise speciﬁed. Similarly to most other LPMs, the SLPM does not have all of itsfull-conditional distributions in standard form, mainly due to the likelihood speciﬁcation.In a MCMC context, this means that non-standard parameter updates must be performedusing Metropolis-within-Gibbs (Hoﬀ et al. 2002). In a VB context, this implies that theparametric forms of the variational densities in Q cannot be automatically derived, andthat closed forms updates will not be available for the variational parameters.Hence, the following variational densities are assumed: q Z (cid:16) Z ij , . . . , Z ijK (cid:12)(cid:12)(cid:12) ˜ λ ij , . . . , ˜ λ ijK (cid:17) ∼ M ultinomial (cid:16)

1; ˜ λ ij , . . . , ˜ λ ijK (cid:17) q U (cid:16) U ik (cid:12)(cid:12)(cid:12) ˜ α Uik , ˜ β Uik (cid:17) ∼ Gaussian (cid:16) ˜ α Uik , ˜ β Uik (cid:17) q V (cid:16) V jk (cid:12)(cid:12)(cid:12) ˜ α V jk , ˜ β V jk (cid:17) ∼ Gaussian (cid:16) ˜ α V jk , ˜ β V jk (cid:17) q λ ( λ ij , . . . , λ ijK ) ∼ Dirichlet (cid:16) ˜ δ , . . . , ˜ δ K (cid:17) q γ ( γ k ) ∼ Gamma (cid:16) ˜ a k , ˜ b k (cid:17) (9)Here, Gaussian refers to the univariate normal density. It should be noted that all thevariational parameters are indicated with a tilde: ˜ φ = (cid:110) ˜ λ , ˜ α U , ˜ β U , ˜ α V , ˜ β V , ˜ δ , ˜ a , ˜ b (cid:111) Proposition 1.

Minimising the KL divergence between q ( · ) ∈ Q and the SLPM posterior π ( ·| X ) is (approximately) equivalent to maximising the following functional with respectto the variational parameters ˜ φ : F (cid:16) ˜ φ (cid:17) = const + (cid:88) i,j,k ˜ λ ijk (cid:34) ψ (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − log (˜ η ijk ) + log (cid:16) ˜ ζ ijk (cid:17) − x ij ˜ η ijk − log (cid:16) ˜ λ ijk (cid:17)(cid:35) + (cid:88) k  δ k − ˜ δ k + (cid:88) i,j ˜ λ ijk  (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) + (cid:88) k (cid:18) a k − ˜ a k + M + N (cid:19) (cid:104) ψ (˜ a k ) − log ˜ b k (cid:105) − (cid:88) k ˜ a k ˜ b k (cid:32) b k + ˜ S k (cid:33) + 12 (cid:88) ik log (cid:16) ˜ β Uik (cid:17) + 12 (cid:88) j,k log (cid:16) ˜ β V jk (cid:17) − log Γ (cid:32)(cid:88) k ˜ δ k (cid:33) + (cid:88) k (cid:104) log Γ (cid:16) ˜ δ k (cid:17) + ˜ a k − ˜ a k log ˜ b k + log Γ (˜ a k ) (cid:105) (10) where: ˜ η ijk = ˜ β Uik + ˜ β V jk + ( ˜ α Uik − ˜ α V jk ) ˜ ζ ijk = 2˜ η ijk − α Uik − ˜ α V jk ) ˜ S k = (cid:88) i (cid:16) ˜ β Uik + ˜ α Uik (cid:17) + (cid:88) j (cid:16) ˜ β V jk + ˜ α V jk (cid:17) (11) nd ψ ( · ) indicates the ﬁrst derivative of the log-gamma function. The functional F is often called the variational free energy. The proof of Proposition1, and a description of the approximations introduced to obtain this result are providedin Appendix A.1. F The maximisation of (10) is performed using a coordinate ascent algorithm (CAVI, Bishop2006; Blei et al. 2017), where each of the variational densities composing q ( · ) are updatedin turn in a greedy fashion. This section shows that some of these updates are availablein closed form, whereas iterative procedures must be used for the remaining densities.The proofs are all shown in the appendix. Proposition 2.

For any given conﬁguration of variational parameters ˜ φ and indexes ≤ i ≤ M and ≤ j ≤ N , the values of (cid:16) ˜ λ ij , . . . , ˜ λ ijK (cid:17) that maximise F are given by: ˜ λ ∗ ijk = exp (cid:110) ψ (cid:16) ˜ η ijk ˜ ζ ijk (cid:17) − log (cid:16) ˜ η ijk ˜ ζ ijk (cid:17) − x ij ˜ η ijk + ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:16)(cid:80) k ˜ δ k (cid:17)(cid:111)(cid:80) k (cid:48) exp (cid:26) ψ (cid:18) ˜ η ijk (cid:48) ˜ ζ ijk (cid:48) (cid:19) − log (cid:16) ˜ η ijk (cid:48) ˜ ζ ijk (cid:48) (cid:17) − x ij ˜ η ijk (cid:48) + ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:16)(cid:80) k (cid:48)(cid:48) ˜ δ k (cid:48)(cid:48) (cid:17)(cid:27) (12) for ≤ k, k (cid:48) , k (cid:48)(cid:48) ≤ K . Proposition 3.

For any given conﬁguration of variational parameters ˜ φ , the values of ˜ δ that maximise F are given by: ˜ δ ∗ k = δ k + (cid:88) i,j ˜ λ ijk (13) for ≤ k ≤ K . Proposition 4.

For any given conﬁguration of variational parameters ˜ φ , the values of ˜ a and ˜ b that maximise F are given by: ˜ a ∗ k = a k + M + N b ∗ k = b k + 12 (cid:88) i (cid:16) ˜ β Uik + ˜ α Uik (cid:17) + 12 (cid:88) j (cid:16) ˜ β V jk + ˜ α V jk (cid:17) (14) for ≤ k ≤ K . Similar closed form updates are not available for the remaining variational densities q U and q V . Hence, the updates for the corresponding parameters aim at improving thevalue of F using a greedy rule in the direction of the natural gradient (Amari 1998).15 roposition 5. The update deﬁned by: ˜ α ∗ Uik = ˜ α Uik + ε ˜ β Uik ∂ F (cid:16) ˜ φ (cid:17) ∂ ˜ α Uik ˜ β ∗ Uik = exp  ε ˜ β Uik ∂ F (cid:16) ˜ φ (cid:17) ∂ ˜ β Uik  ˜ β Uik (15) is in the direction of the natural gradient of F with respect to ˜ α Uik and ˜ β Uik , and, for asmall enough ε > , it does not decrease the value of F . A result equivalent to Proposition 5 is also provable for the variational density asso-ciated to the latent position V jk , for every ≤ j ≤ N and ≤ k ≤ K .In practice, the algorithm keeps track of the learning rate ε associated to the varia-tional densities of each latent position. To update one of these densities, an initial learningrate equal to ε is considered, and new proposed parameter values are calculated accord-ing to (15). If the new value of F is not smaller, the new values are retained and thealgorithm proceeds to the following update. Otherwise, the learning rate is repeatedlyhalved until it is suﬃciently small to guarantee a non-decrease of F . The pseudocode forthe optimisation routine is shown in Algorithm 1. Algorithm 1

CAVI algorithm for the SLPM

INPUT: tol and starting values ˜ φ set F ∗ = F (cid:16) ˜ φ (cid:17) set stop = false while ! stop do for every i and j : update ˜ λ ij using (12)update ˜ δ using (13)update ˜ a and ˜ b using (14)for every i and k : update ˜ α Uik and ˜ β Uik using (15)for every j and k : update ˜ α V jk and ˜ β V jk using (15) if F (cid:16) ˜ φ (cid:17) ≤ F ∗ + tol then stop = true end if set F ∗ = F (cid:16) ˜ φ (cid:17) end while OUTPUT: F ∗ and ˜ φ It should be noted that, since each of the updates cannot decrease the value of theobjective function, the algorithm is greedy; hence, due to the nonlinearity of the objective16unction, the procedure is guaranteed to converge only to a local optimum. The iterativeroutine is stopped when a full iteration yields an increase smaller than a preﬁxed positivequantity denoted tol .Regarding the hyperparameters, these may be used to introduce prior knowledge inthe framework, when this is available. More in general, as a default, the following valuesare used: the entries of δ are all set to . , while the entries of a and b are all set to . With these choices, the Dirichlet distribution on the mixing proportions becomes ashrinkage prior, as already discussed. The

Gamma (1 , prior on the precisions yields,instead, a rather ﬂexible structure, which supports a wide range of scenarios. In thispaper, the hyperparameters are set to the default values in every application considered. Due to its iterative nature, the greedy algorithm proposed requires the variational param-eters to be initialised. As is the case for other similar algorithms, such as the expectation-maximisation of Dempster et al. (1977) or the iterated conditional modes of Besag (1986),parameter initialisation plays a fundamental role in avoiding convergence to irrelevant lo-cal optima. Here, a Euclidean-distance-based heuristic method is proposed to initialisethe latent positions of the nodes. Once the value of K is set, the goal is to locate thenodes in R K in a way such that the nodes that are close to each other have larger observedinteraction values.First, recall that, given a binary adjacency matrix A , the generic element indexed by ( i, j ) of the matrix AA (cid:48) identiﬁes the number of paths that allow one to move from i to j in steps. Similarly, in a weighted bipartite network context, the matrix XX (cid:48) hassize M × M , and the value of its generic entry ( i, i (cid:48) ) increases if both senders i and i (cid:48) are strongly interacting with the same nodes. Hence, a measure of similarity betweensender nodes and, respectively, between receiver nodes is introduced using the followingmatrices: S U = (cid:114) XX (cid:48) N S V = (cid:114) X (cid:48) X M (16)As a consequence, the matrix D U (resp. D V ), obtained inverting the elements of S U (resp. S V ), gauge the dissimilarity between any two sender nodes (resp. receiver nodes).Gathering together all the available information, a dissimilarity matrix for all the nodesof the network is deﬁned as follows: D = (cid:20) D U XX (cid:48) D V (cid:21) (17)17nd nonmetric multidimensional scaling is used on D to initialise the latent positions forall of the senders and receivers. Since some of the elements of X , S U and S V may bezero, the initialisation procedure is performed on X + E , where the matrix E has all itselements equal to an arbitrary small positive constant (cid:15) > . The variational variances ˜ β U (resp. ˜ β V ) are all set to times the empirical variance of ˜ α U (resp. ˜ α V ).The remaining variational parameters are all set to with the exception of the ˜ λ s,which are initialised during the ﬁrst step of the optimisation conditionally on all the otherparameters. In this section, the proposed methodology is tested on artiﬁcial data, demonstrating thatit can serve as a computationally eﬃcient tool to summarise the information provided bynonnegative matrices.

The goal of these experiments is to demonstrate that the SLPM leads to meaningfulresults and that it can achieve a good ﬁt to the data. The NMF method is also used onthe same generated datasets for comparison purposes.

Experiment 1: SLPM data.

In the ﬁrst experiment, the algorithms are tested onsimulated SLPMs in two scenarios: one corresponding to a small network with M = N =25 , and the other on a larger network with M = N = 100 . Here, the SLPM modellingassumption is correct; hence, the corresponding method is expected to perform generallywell, and to achieve better results than NMF. The true number of dimensions is ﬁxedto K true = 3 and the mixing proportions are all set to / . The latent positions areall sampled from a standard Gaussian distribution, and, ﬁnally, the average weightednetwork X is generated according to: x ij = K (cid:88) k =1 λ k ( U ik − V jk ) − (18)This generating process is independently repeated times, hence obtaining simu-lated SLPMs for each scenario.The NMF model is estimated using the method of Brunet et al. (2004), as implementedin the R package NMF (Gaujoux and Seoighe 2010), while the variational method of Section18 is used for the SLPM. Both methods consider K = 8 latent dimensions, and are runonce on each dataset. Additionally, the SLPM hyperparameters are ﬁxed to δ = 0 . and a = b = 1 , and the variational parameters are initialised using the distance-basedapproach described in Section 3.3. The tolerance threshold for the SLPM is set to tol =0 . .Once the parameters of the SLPM and NMF are estimated for all of the generatednetworks, they are used to reconstruct the weighted adjacency matrices X ( SLP M ) and X ( NMF ) . Then, to assess the performance of the two models, the following loss functionis considered: L (cid:16) X , X ( · ) (cid:17) = 1 M N (cid:88) i,j log (cid:12)(cid:12)(cid:12) x ij − x ( · ) ij (cid:12)(cid:12)(cid:12) (19)for any two M × N matrices X and X ( · ) . The presence of the log is necessary to coun-terbalance the extremely heavy tailed shape of the weights documented in Figure 2. Infact, without the log transformation, the value of the loss would be almost exclusivelydetermined by the prediction error on the heaviest weight. By contrast, it should benoted that a drawback of this transformation is that the magnitude of the errors on thesmall edge values are inﬂated.The two plots in Figure 3 compare the losses for SLPM and NMF on the small andlarge networks, respectively. Clearly, the estimated SLPM yields smaller prediction log -errors in the vast majority of cases. This is especially true in the scenario with largernetworks (right panel of Figure 3), as the additional available information guarantees areasonably good recovery of the true latent structure.To estimate the value of K , one may consider: ˆ K = (cid:40) k s. t. k = 1 , . . . , K ; (cid:88) i,j ˜ λ ijk > (cid:41) (20)that is, the number of groups with a size which is strictly greater than zero. However, dueto the presence of extreme edge values, nearly empty groups most often appear. Thesenaturally inﬂate ˆ K without adding any substantial changes in the results obtained. Forthis reason, studying more directly the mixing proportions may, in general, be advisable.Regarding the estimation of K in this experiment, Figure 4 shows the estimatedmixing proportions in decreasing order. In both scenarios, and for the vast majority ofgenerated datasets, two latent dimensions stand out, since most edges are assigned toeither of these two. The third most relevant dimension generally contains only few edgeswhich correspond to outliers. 19 .0 0.5 1.0 1.5 2.0 2.5 . . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M l l lll ll l l lll ll l lll ll ll l lll lll l lll lll ll l l lll llll ll l lll lll ll ll ll l lll lllll ll llll l ll ll l l llll lllll ll lll ll M = N = 25 . . . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M ll ll lll ll lll l llll ll ll ll ll lll lll ll ll l ll l ll ll ll l lll lll l ll ll lll lll llll llll ll l lll l ll ll ll lll lll ll lll l ll l l M = N = 100

Figure 3:

Model ﬁt, experiment 1 . Prediction losses of the estimated matrices X ( SLP M ) and X ( NMF ) with respect to the true X , for each of the generated datasets. The left panel correspondsto the small datasets ( M = N = 25 ), whereas the right panel corresponds to the large datasets( M = N = 100 ). . . . . . . K p r opo r t i on o f edge s M = N = 25 . . . . . . K p r opo r t i on o f edge s M = N = 100

Figure 4:

Model ﬁt, experiment 1 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. In both small networks (left panel) and largenetworks (right panel), two latent dimensions stand out. xperiment 2: homogeneous data. In this second experiment, the data is gener-ated from a homogeneous process: each edge weight is assumed to be an exponentialrandom variable, whose rate is distributed according to a

Gamma (1 , . This leads toa homogeneous network model whose empirical degree distribution is characterised by aheavy tail (Figure 5). l l ll llllllllllllllllllllllllllllllllllllllll llll l l l −10 −5 0 5 − − − − log−edge−value l og − p r opo r t i on y - l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l l l l −10 −5 0 5 10 − − − − − − log−edge−value l og − p r opo r t i on y - Figure 5:

Model ﬁt, experiment 2 . Empirical log - log distribution of the edge weights for twonetworks generated using the homogeneous model, with sizes M = N = 25 and M = N = 100 (respectively on the left and right panel). Similarly to the previous experiment, the SLPM and NMF are estimated on dataset with sizes M = N = 25 or M = N = 100 . The hyperparameters, number oflatent dimensions, and optimisation parameters are also set exactly to the same valuesused in the previous experiment.The predictive log -errors are shown in Figure 6. While the NMF model seems toachieve slightly better results on the small datasets, the SLPM again outperforms NMFon the larger datasets. This suggests that the distance-based generative process behindthe SLPM may be more ﬂexible than that of the NMF, and that it may be better suitedto represent the topologies exhibited by many types of observed networks.As concerns the estimation of K , Figure 7 shows the estimated mixing proportions indecreasing order. The number of relevant components corresponds to the correct numberof dimensions in the small datasets, since we see from the left panel that the third largestgroup contains approximately ﬁve to of the edges in more than half of the datasets.However, an overestimation of K is observed on the larger dataset. Here, it seems that,21 .8 1.0 1.2 1.4 1.6 . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M ll ll l llllll ll lll lll l ll llll ll llll ll ll lll ll ll lll ll l ll lll l llll lll l llll llll ll ll lll ll ll ll ll l ll llll l ll ll ll M = N = 25 . . . . . . . Average log−error with NMF A v e r age l og − e rr o r w i t h S L P M llll lll lllllllllll ll ll l llllll l llllll l lll ll llll ll lll llll lll ll lllll llll lllllll llll lll l llll lllll ll llll l M = N = 100

Figure 6:

Model ﬁt, experiment 2 . Prediction losses of the estimated matrices X ( SLP M ) and X ( NMF ) with respect to the true X , for each of the generated datasets. The left panel correspondsto the small datasets ( M = N = 25 ), whereas the right panel corresponds to the large datasets( M = N = 100 ). . . . . . . . . K p r opo r t i on o f edge s M = N = 25 . . . . K p r opo r t i on o f edge s M = N = 100

Figure 7:

Model ﬁt, experiment 2 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. In the small networks (left panel) the estimatedmodels are characterised by three latent dimensions, whereas in the large networks (right panel)more dimensions are required to ﬁt the data. The goal of this section is to further assess the performance of the SLPM in recoveringthe correct number of latent dimensions.Four scenarios are considered: K true = 2 and M = N = 40 in the ﬁrst scenario, K true = 4 and M = N = 40 in the second, K true = 2 and M = N = 80 in the third,and K true = 4 and M = N = 80 in the fourth. In each scenario the hyperparametersare all ﬁxed to , and artiﬁcial networks are generated independently. The mixingproportions, the precisions, and the latent positions are all generated according to thehierarchical structure described in Section 2.1. Then, the average weighted network X isgenerated according to (18).Similarly to the previous section, the SLPM is estimated on all of the datasets us-ing VB with the same set-up for hyperparameters, number of latent dimensions, andoptimisation parameters. In addition, in order to assess the sensitivity of the SLPM toits hyperparameters, the variational algorithm is run once with δ = 0 . and once with δ = 0 . .As in the previous simulation studies, the number of revelant groups is deduced bythe mixing proportions once they are sorted in decreasing order. Figures 8, 9, 10, and 11refer to scenarios one to four, respectively. It appears that, overall, the value of thehyperparameter δ does not have any relevant eﬀect on the estimate of K . While in theﬁrst scenario ( M = N = 40 , K = 2 , Figure 8) the number of relevant latent dimensionsis generally overestimated to , in the second scenario ( M = N = 40 , K = 4 , Figure 9) K is well-recovered. In fact, in more than half of the datasets, the fourth group containsmore than of the edges, whereas the ﬁfth group has negligible size.Similarly to the previous experiment, K is overestimated in the larger datasets ( M = N = 80 , scenarios and in Figures 10 and 11, respectively), in that in more than halfof the datasets the sixth group has a non-negligible size.These simulations illustrate that, while shrinkage and parsimony are generally achieved,the estimated number of relevant groups does not always coincide exactly with the true23 . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 40K_true = 2 . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 40K_true = 2 Figure 8:

Model choice, scenario 1 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. The hyperparameter δ seems to have no eﬀecton the results. Three latent dimensions stand out with a weight approximately equal to . in atleast half of the datasets. The fourth dimension has a small weight in all of the datasets. . . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 40K_true = 4 . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 40K_true = 4 Figure 9:

Model choice, scenario 2 . In both panels, the k -th boxplot describes the k -th mixingweight (in decreasing order) across datasets. The hyperparameter δ seems to have no eﬀecton the results. In this scenario, the number of relevant latent dimensions is generally equal tothe true value of K . . . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 80K_true = 2 . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 80K_true = 2 Figure 10:

Model choice, scenario 3 . In both panels, the k -th boxplot describes the k -thmixing weight (in decreasing order) across datasets. The hyperparameter δ seems to have noeﬀect on the results. Although the true value of K is two, ﬁve latent dimensions stand out. Inthis scenario, the VB algorithm struggles to reduce the dimensionality and to recover the correctnumber of latent dimensions. . . . . . . . . K p r opo r t i on o f edge s delta = 0.001M = N = 80K_true = 4 . . . . . . . K p r opo r t i on o f edge s delta = 0.1M = N = 80K_true = 4 Figure 11:

Model choice, scenario 4 . In both panels, the k -th boxplot describes the k -thmixing weight (in decreasing order) across datasets. The hyperparameter δ seems to haveno eﬀect on the results. Similarly to the third scenario, the VB algorithm struggles to reduce thedimensionality and it tends to overestimate the number of relevant latent dimensions. , but rather lives in a neighourhood of this value. In particular, the number of relevantgroups tends to be overestimated when the dataset is large and, hence, more heteroge-neous. Similarly to the other available estimation methods for LPMs, the computational com-plexity of the variational algorithm is determined by the updates of the latent positions.One such update requires a summation over all possible neighbours, hence it implies ei-ther N or M operations, depending on whether the updated node is a sender or receiver,respectively. In addition, since the step size must be tuned at every update attempt, the N (or M ) evaluations are actually performed an unspeciﬁed number of times, dependingon the rule determining the learning rates. However, since the step sizes remain approxi-mately constant as the algorithm progresses, these re-evaluations are generally performedvery few times.The number of iterations required by the algorithm to converge impacts the overallcomputing time by a much greater extent. In general, although the algorithm is greedy,there are no guarantees regarding the number of iterations required for convergence, and,most likely, such number depends on the number of nodes and on the topology of thenetwork. In practice, a few hundreds iterations are often suﬃcient to achieve the threshold tol = 0 . .Figure 12 compares the computing time required by the variational algorithm of SLPMwith that of NMF. NMF exploits parallel computing and is generally much faster, as itconverges with very few iterations. The ﬁgure also highlights that SLPM scales quitepoorly with the number of dimensions.These results suggest that, while SLPM generally achieves a very good ﬁt to the data,it may also struggle in dealing with very large datasets, hence forcing one to increasethe tolerance threshold or to reduce the maximum number of iterations, in order to getusable results in a reasonable time. The goal of the two studies shown in this section is to propose an application of themethodology proposed on real datasets, and to illustrate how the SLPM represents andsummarises the observed data. 26 network size s e c ond s l l l l l l l l l l l SLPM with K = 2SLPM with K = 10NMF with K = 2NMF with K = 10

Figure 12:

Computational eﬃciency . Number of seconds required to perform the ﬁrst iterations of the VB algorithm (black and red lines), and one run of the NMF algorithm (greenand blue lines). The two algorithms appear to scale similarly with respect to the size of thenetworks, especially when the number of latent dimensions is small. For a larger number of latentdimensions, the NMF algorithm scales much better than the VB algorithm used to estimate theSLPM. The dataset used in this example has been ﬁrst studied by Vanhems et al. (2013), and it isavailable from the R package igraphdata . The data record proximity interactions between patients, nurses, doctors and administrative staﬀ at a university Hospital inLyon, France, from a Monday to the following Friday in December . Wearablesensors based on active Radio-Frequency IDentiﬁcation (RFID) are used to probe thearea surrounding each individual at seconds intervals, and proximity interactions arerecorded whenever two individuals are closer than . meters. The analysis of such datamay provide insights on the dynamics of the spread of Hospital-acquired infections, and,hence, lead to valid and more eﬀective preventive measures.In this analysis, the interaction matrix X denotes the cumulated proximity time, i.e.the generic entry x ij indicates the number of seconds the staﬀ member i has spent inproximity of patient j . The length of stay of the patients can be deduced from theproximity information, and, clearly, those who stayed longer have a higher degree (notshown).The latent number of dimensions was set to K = 10 , and the VB algorithm was run27nce using the initialisation of Section 3.3. The algorithm reached convergence after iterations and . seconds.The rounded mixing proportions are ˆ λ = 0 . , ˆ λ = 0 . , and ˆ λ k < . for i =2 , , , , , , ; two latent groups stand out with a large mixing weight. As illustratedby the two plots in Figure 13, the two relevant latent dimensions fulﬁl two very diﬀerentroles. In fact, the larger one has a very high empirical variance, meaning that it is l − − dimension l og edge v a l ue s l l l l l l l l l . . . . . . empirical variance m i x i ng p r opo r t i on l l l l l l l l l Figure 13:

Hospital data . The left panel shows the distribution of the weight log-values assignedto each of the two most relevant dimensions. As null-weights cannot be shown, a small positiveconstant is added to all the entries of X to improve the readability of this plot. The rightpanel compares the mixing proportions with the empirical variances of the corresponding latentdimensions. used to represent all of the zero or nearly-zero weights. By contrast, dimension numberthree is instead used for almost all of the remaining edges. The same conclusion is alsoapparent from Figure 14, where dimension number three, plotted on the vertical axis,clearly exhibits more heterogeneity than dimension number one, which is plotted on thehorizontal axis.The results show that there are no substantial diﬀerences between the nurses, doctors,and administrative staﬀ in how they interact with the patients. It should be noted that,since dimension three is able to capture almost all the heterogeneity in the data, theunidimensional projection of this SLPM has features very similar to those of the originalLPM of Hoﬀ et al. (2002). For example, for both patients and staﬀ, nodes with higherdegree tend to be positioned closer to the centre of the space.28 − . − . − . . . . . dimension 1 d i m en s i on l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l Figure 14:

Hospital data . Latent space representation based on the two most relevant latentdimensions. The black squares correspond to the patients, whereas the circles correspond to thestaﬀ members. Diﬀerent circle colours indicate the known staﬀ member occupations (blue fornurses, green for doctors, red for administrative staﬀ ). The sizes of the circles and squares areproportional to the degrees of the corresponding nodes. The second example is an analysis of the dataset ﬁrst studied by Khan et al. (2001). Here,the matrix X , of size × , represents the measured expression level of M = 2308 genes in N = 64 samples, for patients who tested positive on the presence of small roundblue cell tumours of childhood. The goal is to characterise the connectivity proﬁles ofgenes and samples, and to provide a latent space representation of these data. There areno zero-weighted edges, the lowest value recorded is . , the highest is . , and themedian is . .The VB algorithm was initialised and run with K = 20 , and converged after iter-ations and approximately one hour. The rounded mixing proportions for the nonemptygroups are equal to ˆ λ = 0 . , ˆ λ = 0 . , ˆ λ = 0 . and ˆ λ = 0 . . As illus-trated in Figure 15, the three most relevant groups all have a relatively small empiricalvariance, and they are used to characterise the edges with both large and small weights.Group is instead used only for large weights, yet it has a much smaller size and maybe considered a group of outliers. The two largest groups, corresponding to dimensions29 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllll − − − dimension l og edge v a l ue s lllll l lllll l l l l lllll . . . . . empirical variance m i x i ng p r opo r t i on lllll l lllll l l l l lllll Figure 15:

Microarray data . The left panel shows the distribution of the weight log-valuesassigned to each of the two most relevant dimensions. The right panel compares the mixingproportions with the empirical variances of the corresponding latent dimensions. and , are used in the latent space representation of Figure 16. In both dimensions, −1.5 −1.0 −0.5 0.0 0.5 1.0 − . − . . . . . dimension 6 d i m en s i on l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Figure 16:

Microarray data . Latent space representation based on the two most relevantlatent dimensions. The black circles correspond to the genes, whereas the squares correspondto the samples. Diﬀerent square colours indicate the known tumour classes. The sizes of thecircles and squares are proportional to the degrees of the corresponding nodes. groups of samples, which do notcorrespond to the tumour types included in the metadata. This is not unexpected, sincethe separation of the tumour types is known to be a challenging problem, and the SLPMdoes not involve the optimisation of a clustering criterion on the nodes of the graph. Theresults of Khan et al. (2001) are of a diﬀerent nature, since the authors use the knowntumour classes to train a neural network classiﬁer, and then use this tool to predict theclasses on additional test data.Diﬀerently from the hospital data, the genes located close to the origin of the spaceare those farthest from the samples, hence they have the smallest degrees. The highdegree genes are instead located at the periphery of the cloud, where they are at closedistance with at least three of the samples’ point clouds. The SLPM introduced in this paper represents a new research direction for the statisticalmodelling of networks. This new framework modiﬁes and extends the LPM approach,introducing elements from the literature on blockmodels to enhance the overall ﬂexibilityof the proposed model.Diﬀerently from blockmodels and tensor factorisation models, the SLPM projects thenodes into a latent space where a weighted transformed Euclidean distance determinesthe edge values. This leads to a tool that provides a distance-based representation ofthe rows and columns of nonnegative interaction matrices, where the number of latentdimensions is automatically deduced from the data.The approach proposed prompts a number of new interesting extensions and researchpossibilities. Matrix completion and matrix reconstruction frameworks stand out as po-tential useful applications, in that the SLPM may represent an eﬀective alternative toNMF methods. These extensions and the related software are currently being devel-oped. Furthermore, modiﬁcations of the SLPM may also be considered to account forother types of network data, such as binary networks, dynamic networks, or multiviewnetworks.In this paper, a VB framework is proposed to perform inference. This method isfaster than sampling-based approaches, and it allows one to bypass identiﬁability issues.However, the true impact of the approximations introduced and used is not assessed and31emains unknown. Future work may consider more eﬀective strategies that lead to betterresults in shorter time, either using sampling approaches or approximations of a diﬀerentnature.

Software

The R package SparseLPM accompanies this paper and it provides an implementation ofthe variational algorithm described. Parts of the code have been written in

C++ to reduceoverall computing times. The package is publicly available from

CRAN (R Core Team2017). All of the computations described in this paper have been performed on a 8-cores(2.2 GHz) Debian machine.

Acknowledgements

This research was partially supported through the Vienna Science and Technology Fund(WWTF) Project MA14-031.

References

Aicher, C., A. Z. Jacobs, and A. Clauset (2014). “Learning latent block structure inweighted networks”. In:

Journal of Complex Networks

Journal of Machine Learning Research

Neural computation

Proceedings of the Fifteenth conference on Uncertainty in artiﬁcialintelligence . Morgan Kaufmann Publishers Inc., pp. 21–30.Attias, H. (2000). “A variational bayesian framework for graphical models”. In:

Advancesin neural information processing systems , pp. 209–215.Barabási, A. L. and R. Albert (1999). “Emergence of scaling in random networks”. In: science

Latent variable models and factoranalysis: A uniﬁed approach . Vol. 904. John Wiley & Sons.Besag, J. (1986). “On the statistical analysis of dirty pictures”. In:

Journal of the RoyalStatistical Society. Series B (Methodological) , pp. 259–302.Bishop, C. (2006).

Pattern recognition and machine learning . Springer.Blei, D. M., A. Kucukelbir, and J. D. McAuliﬀe (2017). “Variational inference: A reviewfor statisticians”. In:

Journal of the American Statistical Association

Quantitative ﬁnance

Statistics and Computing

Proceedings of the nationalacademy of sciences

Nature Reviews Neuroscience

Statistics and computing

Journal of the royal statistical society. Series B(methodological) , pp. 1–38.Dormann, C. F., B. Gruber, and J. Fründ (2008). “Introducing the bipartite package:analysing ecological networks”. In: interaction

1, pp. 0–2413793.Durante, D. and D. B. Dunson (2014). “Nonparametric Bayes dynamic modelling ofrelational data”. In:

Biometrika

Journal of the American Statistical Association

Performance evaluation

Proceedings ofthe National Academy of Sciences

Finite mixture and Markov switching models . SpringerScience & Business Media.Frühwirth-Schnatter, S. and G. Malsiner-Walli (2018). “From here to inﬁnity: sparse ﬁniteversus Dirichlet process mixtures in model-based clustering”. In:

Advances in DataAnalysis and Classiﬁcation . issn : 1862-5355. doi :

10 . 1007 / s11634 - 018 - 0329 - y . url : https://doi.org/10.1007/s11634-018-0329-y .Gandy, A. and L. A. M. Veraart (2016). “A Bayesian methodology for systemic riskassessment in ﬁnancial networks”. In: Management Science

BMC bioinformatics

Journal of Computational and Graphical Statistics

PatternRecognition

Journal of the Royal Statistical Society: Series A (Statistics inSociety)

Journal of theamerican Statistical association arXiv preprintarXiv:1807.08038 .Hoﬀ, P. D., A. E. Raftery, and M. S. Handcock (2002). “Latent space approaches tosocial network analysis”. In:

Journal of the American Statistical Association

Nature medicine

Nature

Advances in neural information processing systems , pp. 556–562.Lim, Y. J. and Y. W. Teh (2007). “Variational Bayesian approach to movie rating pre-diction”. In:

Proceedings of KDD cup and workshop . Vol. 7, pp. 15–21.Malsiner-Walli, G., S. Frühwirth-Schnatter, and B. Grün (2016). “Model-based cluster-ing based on sparse ﬁnite Gaussian mixtures”. In:

Statistics and computing

Journal of Computational and GraphicalStatistics

The Annals of Applied Statistics

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology)

ESAIM: Proceedings and Surveys

47, pp. 55–74.Mnih, A. and R. R. Salakhutdinov (2008). “Probabilistic matrix factorization”. In:

Ad-vances in neural information processing systems , pp. 1257–1264.Newman, M. E. J. (2001). “The structure of scientiﬁc collaboration networks”. In:

Pro-ceedings of the national academy of sciences

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

R: A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing. Vienna, Austria. url : .Raftery, A. E. (2017). “Comment: Extending the Latent Position Model for Networks”.In: Journal of the American Statistical Association

Network Science

Journal of the Royal StatisticalSociety, Series B

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology)

Proceedings of the 25th international conference onMachine learning . ACM, pp. 880–887.Salter-Townshend, M. and T. B. Murphy (2013). “Variational Bayesian inference for thelatent position cluster model for network data”. In:

Computational Statistics & DataAnalysis

Statistical Analysis andData Mining: The ASA Data Science Journal

Social Networks

44, pp. 105–116.Stephens, M. (2000). “Dealing with label switching in mixture models”. In:

Journal of theRoyal Statistical Society: Series B (Statistical Methodology)

Econo-metrica: journal of the Econometric Society , pp. 24–36.Vanhems, P., A. Barrat, C. Cattuto, J. F. Pinton, N. Khanafer, C. Régis, B. Kim, B.Comte, and N. Voirin (2013). “Estimating potential infection transmission routes inhospital wards using wearable proximity sensors”. In:

PloS one

Journal of the American Statistical Association

82, pp. 8–19.Zhang, B. and S. Horvath (2005). “A general framework for weighted gene co-expressionnetwork analysis”. In:

Statistical applications in genetics and molecular biology

Appendix

Throughout the appendix: ≤ i ≤ M , ≤ j ≤ N and ≤ k, h ≤ K . A.1 Proof of Proposition 1

Proof.

We can rewrite the model evidence p ( X ) as follows: p ( X ) = p ( X ) p ( X , φ ) p ( X , φ ) = p ( X , φ ) p ( φ | X ) (21)Now, we apply a log transformation, and add and subtract log q ( φ ) to the right hand side: log p ( X ) = log p ( X , φ ) − log p ( φ | X ) + log q ( φ ) − log q ( φ ) (22)On the left hand side we have the model log evidence, which does not depend on φ . Hence, taking theexpectations of both sides with respect to q leads to the following: log p ( X ) = E q [log p ( X , φ )] − E q [log q ( φ )] + KL [ q ( · ) || π ( ·| X )] (23)It is clear that, if we deﬁne the variational free energy as follows: F (cid:16) ˜ φ (cid:17) = E q [log p ( X , φ )] − E q [log q ( φ )] (24)minimising the KL divergence is equivalent to maximising F , as log p ( X ) does not depend on φ . Also,since the KL divergence is always nonnegative, F functions as a lower bound for the model evidence.What remains to be proved is that the deﬁnition of the variational free energy given in (10) corre-sponds to the one derived in (24), up to addition by a constant. The functional F can be decomposedinto a number of terms as follows: F (cid:16) ˜ φ (cid:17) = E q [log p ( X , φ )] − E q [log q ( φ )]= E q [ (cid:96) X ( Z , U , V )] + E q [log π ( Z| λ )] + E q [log π ( λ | δ )]+ E q [log π ( U | γ )] + E q [log π ( V | γ )] + E q [log π ( γ | a , b )] − E q (cid:104) log q (cid:16) Z (cid:12)(cid:12)(cid:12) ˜ λ (cid:17)(cid:105) − E q (cid:104) log q (cid:16) ˜ λ (cid:12)(cid:12)(cid:12) ˜ δ (cid:17)(cid:105) − E q (cid:104) log q (cid:16) U (cid:12)(cid:12)(cid:12) ˜ α U , ˜ β U (cid:17)(cid:105) − E q (cid:104) log q (cid:16) V (cid:12)(cid:12)(cid:12) ˜ α V , ˜ β V (cid:17)(cid:105) − E q (cid:104) log q (cid:16) γ (cid:12)(cid:12)(cid:12) ˜ a, ˜ b (cid:17)(cid:105) (25)Before characterising the expectations on the right hand side of (25), we focus our attention of therandom variables ∆ ijk = U ik − V jk θ ijk = ∆ ijk = ( U ik − V jk ) (26)According to q , we know that ∆ ijk ∼ Gaussian (cid:16) ˜ α Uik − ˜ α V jk , ˜ β Uik + ˜ β V jk (cid:17) and that the variationaldensity on θ ijk does not have, in general, an explicit form. However, we do know its expectation andvariance: ˜ η ijk := E [ θ ijk ] = E (cid:2) ∆ ijk (cid:3) = V ar q (∆ ijk ) + ( E [∆ ijk ]) = ˜ β Uik + ˜ β V jk + (˜ α Uik − ˜ α V jk ) ˜ ζ ijk := V ar q ( θ ijk ) = E (cid:2) ∆ ijk (cid:3) − (cid:0) E (cid:2) ∆ ijk (cid:3)(cid:1) = (˜ α Uik − ˜ α V jk ) + 6 (˜ α Uik − ˜ α V jk ) (cid:16) ˜ β Uik + ˜ β V jk (cid:17) + 3 (cid:16) ˜ β Uik + ˜ β V jk (cid:17) − (cid:16) ˜ β Uik + ˜ β V jk (cid:17) − α Uik − ˜ α V jk ) (cid:16) ˜ β Uik + ˜ β V jk (cid:17) − (˜ α Uik − ˜ α V jk ) = 4 (˜ α Uik − ˜ α V jk ) (cid:16) ˜ β Uik + ˜ β V jk (cid:17) + 2 (cid:16) ˜ β Uik + ˜ β V jk (cid:17) = 2˜ η ijk − α Uik − ˜ α V jk ) (27)These two quantities play a fundamental role and are used to approximate the expectation of the log -likelihood with respect to q .Now, we evaluate the expectations in (25). xpectation of the log -likelihood. E q [ (cid:96) X ( Z , U , V )] = E q (cid:88) i,j,k Z ijk { log θ ijk − x ij θ ijk }  = (cid:88) i,j,k E q [ Z ijk ] { E q [log θ ijk ] − x ij E q [ θ ijk ] } = (cid:88) i,j,k ˜ λ ijk { E q [log θ ijk ] − x ij ˜ η ijk }≈ (cid:88) i,j,k ˜ λ ijk (cid:40) ψ (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − log (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − x ij ˜ η ijk (cid:41) (28)The approximation of the expectation in the last line of (28) comes from the fact that we have replacedthe density of θ ijk with a gamma density calibrated to have the same mean and variance. In fact, neitherthe true density has an explicit form, nor we can evaluate the expectation of log ( θ ijk ) exactly. Once thereplacement is made, the expectation of the log of a gamma variable is readily available. Other typesof approximations may be considered to evaluate this expectation, such as Monte Carlo estimators orquadrature methods, however these require a much larger computational cost. Taylor approximationsare more often used (see for example Salter-Townshend and Murphy 2013 and Gollini and Murphy 2016).The choice of this gamma calibration is purely pragmatic: we ﬁnd that, in this particular framework,the approximation proposed leads to much smaller errors than a second order Taylor approximation (notshown). Expectation of the log prior on Z . E q [log π ( Z| λ )] = E q (cid:88) i,j,k Z ijk log λ k  = (cid:88) i,j,k E q [ Z ijk ] E q [log λ k ]= (cid:88) i,j,k ˜ λ ijk (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (29) Expectation of the log prior on λ . E q [log π ( λ | δ )] = log Γ (cid:32)(cid:88) k δ k (cid:33) − (cid:88) k log Γ ( δ k ) + (cid:88) k ( δ k − E q [log λ k ]= const + (cid:88) k ( δ k − (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (30) Expectation of the log prior on U. E q [log π ( U | γ )] = (cid:88) ik E q (cid:20) − log (2 π )2 + log γ k − γ k U ik (cid:21) = const + M (cid:80) k E q [log γ k ]2 − (cid:88) ik E q [ γ k ] E q (cid:2) U ik (cid:3) const + M (cid:88) k (cid:104) ψ (˜ a k ) − log ˜ b k (cid:105) − (cid:88) ik ˜ a k ˜ b k (cid:16) ˜ β Uik + ˜ α Uik (cid:17) (31)An equivalent result holds for the expectation of the prior on V . xpectation of the log prior on γ . E q [log π ( γ | a , b )] = (cid:88) k E q [ a k log b k − log Γ ( a k ) + ( a k −

1) log γ k − b k γ k ]= const + (cid:88) k ( a k − E q [log γ k ] − (cid:88) k b k E q [ γ k ]= const + (cid:88) k ( a k − (cid:104) ψ (˜ a k ) − log ˜ b k (cid:105) − (cid:88) k b k ˜ a k ˜ b k (32) Entropy terms.

The remaining expectations correspond to the negative entropies of common dis-tributions, whose formulas are well-known. E q (cid:104) log q (cid:16) Z (cid:12)(cid:12)(cid:12) ˜ λ (cid:17)(cid:105) = (cid:88) i,j,k ˜ λ ijk log ˜ λ ijk E q (cid:104) log q (cid:16) ˜ λ (cid:12)(cid:12)(cid:12) ˜ δ (cid:17)(cid:105) = log Γ (cid:32)(cid:88) k ˜ δ k (cid:33) + (cid:88) k (cid:40)(cid:16) ˜ δ k − (cid:17) (cid:34) ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) − log Γ (cid:16) ˜ δ k (cid:17)(cid:41) E q (cid:104) log q (cid:16) U (cid:12)(cid:12)(cid:12) ˜ α U , ˜ β U (cid:17)(cid:105) = const − (cid:88) ik log (cid:16) ˜ β Uik (cid:17) E q (cid:104) log q (cid:16) γ (cid:12)(cid:12)(cid:12) ˜ a, ˜ b (cid:17)(cid:105) = (cid:88) k (cid:110) (˜ a k − ψ (˜ a k ) − ˜ a k − log Γ (˜ a k ) + log (cid:16) ˜ b k (cid:17)(cid:111) (33)The variational free energy in (10) is obtained by combining all the formulas and rearranging theterms. A.2 Proof of Proposition 2

Proof.

The maximisation of F with respect to (cid:16) ˜ λ ij , . . . , ˜ λ ijK (cid:17) , and under the constraint (cid:80) k ˜ λ ijk = 1 ,is performed using the method of Lagrange multipliers. Deﬁne the following Lagrangian: H ij (cid:16) ˜ λ ij , . . . , ˜ λ ijK , ξ ij (cid:17) = F (cid:16) ˜ φ (cid:17) + ξ ij (cid:32)(cid:88) k ˜ λ ijk − (cid:33) = const + (cid:88) k ˜ λ ijk (cid:110) C k − log ˜ λ ijk + ξ ij (cid:111) − ξ ij (34)where C k = ψ (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − log (cid:32) ˜ η ijk ˜ ζ ijk (cid:33) − x ij ˜ η ijk (35)The derivatives of H ij are equal to the following: ∂ H ij ∂ ˜ λ ijk (cid:16) ˜ λ ij , . . . , ˜ λ ijK , ξ ij (cid:17) = C k − log ˜ λ ijk + ξ ij − (36)Posing the derivative equal to zero and exponentiating gives the following solution: ˜ λ ∗ ijk = exp { C k + ξ ij − } (37)Now, we do the same with the derivative with respect to ξ ij : ∂ H ij ∂ξ ij (cid:16) ˜ λ ij , . . . , ˜ λ ijK , ξ ij (cid:17) = (cid:88) k ˜ λ ijk − { ξ ij } (cid:88) k exp { C k − } − (38)and obtain: ξ ij = − log (cid:32)(cid:88) k exp { C k − } (cid:33) (39)which leads to (12). A study of the sign of the partial derivatives conﬁrms that this point is a maximum. .3 Proof of Proposition 3 Proof.

The ﬁrst derivative of F with respect to ˜ δ k , for some k , is equal to: ∂ F ∂ ˜ δ k (cid:16) ˜ φ (cid:17) = (cid:34) ∂ψ∂ ˜ δ k (cid:16) ˜ δ k (cid:17) − ∂ψ∂ ˜ δ k (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (cid:88) i,j ˜ λ ijk + δ k − ˜ δ k  − ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33) + ψ (cid:16) ˜ δ k (cid:17) − ψ (cid:32)(cid:88) h ˜ δ h (cid:33) = (cid:34) ∂ψ∂ ˜ δ k (cid:16) ˜ δ k (cid:17) − ∂ψ∂ ˜ δ k (cid:32)(cid:88) h ˜ δ h (cid:33)(cid:35) (cid:88) i,j ˜ λ ijk + δ k − ˜ δ k  (40)which is equal to zero iﬀ ˜ δ k = δ k + (cid:80) i,j ˜ λ ijk . A study of the sign of the partial derivatives conﬁrms thatthis point is a maximum. A.4 Proof of Proposition 4

Proof.

The ﬁrst derivatives of F with respect to ˜ a k and ˜ b k , for some k , are equal to:  ∂ F ∂ ˜ a k (cid:16) ˜ φ (cid:17) = ∂ψ∂ ˜ a k (˜ a k ) (cid:0) M + N + a k − ˜ a k (cid:1) + 1 − b k + S k / b k ∂ F ∂ ˜ b k (cid:16) ˜ φ (cid:17) = − ( a k + M + N ) ˜ b k + ˜ a k Sk + b k ˜ a k ˜ b k = ˜ a k Sk + b k ˜ a k − ˜ b k ( a k + M + N ) ˜ b k (41)Posing these equal to zero and solving for ˜ a k and ˜ b k gives (14). A study of the sign of the partialderivatives conﬁrms that this point is a maximum. A.5 Proof of Proposition 5

Proof.

Let us simplify the notation by denoting α = ˜ α Uik , β = ˜ β Uik , F ( α, β ) = F (cid:16) ˜ φ (cid:17) and q ( α, β ) = q U (cid:16) ˜ φ (cid:17) . To avoid the fact that β is constrained to be positive, we derive the parameter updates for its log and then transform back. Hence, let us deﬁne the function H ( α, ξ ) = F (cid:0) α, e ξ (cid:1) (42)where we have used the transformation β = β ( ξ ) = e ξ . Using the chain rule: ∂ H ∂ξ ( α, ξ ) = ∂ F ∂β (cid:0) α, e ξ (cid:1) ∂β∂ξ ( ξ ) = e ξ ∂ F ∂β (cid:0) α, e ξ (cid:1) (43)Thus, a standard gradient ascent update for ( α t , ξ t ) simply reads as follows: (cid:40) α t +1 = α t + ε ∂ F ∂α (cid:0) α t , e ξ t (cid:1) ξ t +1 = ξ t + εe ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1) (44)Now, following Amari (1998), we propose instead a natural gradient ascent, which essentially correctsthe direction of the gradient to account for the geometry of the space of functions we optimise over. Inpractice, the direction of the gradient is corrected by premultiplying it by the inverse Fisher informationassociated to the likelihood q ( α, ξ ) . The log -likelihood and its derivatives read as follow: log [ q ( α, ξ )] = const − ξ − ( x − α ) e ξ ∂ log [ q ( α, ξ )] ∂α = − e ξ ∂ log [ q ( α, ξ )] ∂α∂ξ = − ( x − α ) e ξ ∂ log [ q ( α, ξ )] ∂ξ = − ( x − α ) e ξ (45) ence, the Fisher information and its inverse are equal to the following: I ( α, ξ ) = (cid:18) e ξ (cid:19) I − ( α, ξ ) = (cid:18) e ξ

00 2 (cid:19) (46)Using these results, we can state the natural gradient ascent update for ( α, ξ ) , which reads as follows: (cid:40) α t +1 = α t + εe ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) ξ t +1 = ξ t + 2 εe ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1) (47)Now, we simply take the exp of the second line, obtaining: (cid:40) α t +1 = α t + εe ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) e ξ t +1 = e ξ t exp (cid:110) εe ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:111) (48)which is equivalent to (15) once the log transformation is undone: (cid:40) α t +1 = α t + εe ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) β t +1 = β t exp (cid:110) εβ t ∂ F ∂β ( α t , β t ) (cid:111) (49)Note that the auxiliary variable ξ does not appear in (15). Finally, we shall prove that a small enough ε leads to a non-decrease of F . Let us consider again (47), and let us deﬁne: ∆ α = e ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) ∆ ξ = 2 e ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1) (50)Using a ﬁrst order Taylor expansion around ( α t , ξ t ) , there exists a small ε > such that: H ( α t + ε ∆ α , ξ t + ε ∆ ξ ) ≈ H ( α t , ξ t ) + ε ∇H ( α t , ξ t ) (cid:48) (cid:20) ∆ α ∆ ξ (cid:21) = H ( α t , ξ t ) + ε (cid:20) ∂ F ∂α ( α t , ξ t ) , e ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:21) (cid:34) e ξ t ∂ F ∂α (cid:0) α t , e ξ t (cid:1) e ξ t ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:35) = H ( α t , ξ t ) + εe ξ t (cid:20) ∂ F ∂α (cid:0) α t , e ξ t (cid:1)(cid:21) + 2 εe ξ t (cid:20) ∂ F ∂β (cid:0) α t , e ξ t (cid:1)(cid:21) ≥ H ( α t , ξ t ) (51)or, equivalently: F ( α t +1 , β t +1 ) ≥ F ( α t , β t ) ..