[PDF] Efficient Bayesian Community Detection using Non-negative Matrix Factorisation

Abstract

Identifying overlapping communities in networks is a challenging task. In this work we present a novel approach to community detection that utilises the Bayesian non-negative matrix factorisation (NMF) model to produce a probabilistic output for node memberships. The scheme has the advantage of computational efficiency, soft community membership and an intuitive foundation. We present the performance of the method against a variety of benchmark problems and compare and contrast it to several other algorithms for community detection. Our approach performs favourably compared to other methods at a fraction of the computational costs.

Full PDF

EEfﬁcient Bayesian Community Detection using Non-negative MatrixFactorisation

Ioannis Psorakis ∗† , Stephen Roberts ‡ and Ben Sheldon § October 24, 2018

Abstract

Identifying overlapping communities in networks is a challenging task. In this work we present a novelapproach to community detection that utilises the Bayesian non-negative matrix factorisation (NMF) modelto produce a probabilistic output for node memberships. The scheme has the advantage of computationalefﬁciency, soft community membership and an intuitive foundation. We present the performance of themethod against a variety of benchmark problems and compare and contrast it to several other algorithmsfor community detection. Our approach performs favourably compared to other methods at a fraction of thecomputational costs.

Keywords: Community detection, non-negative matrix factorisation, Bayesian inference.

The network paradigm is widely used to model real-world complex systems, by focusing on the pattern of as-sociations between their structural components. A system is captured as a mathematical graph, where nodes (orvertices) denote the presence of an entity and edges (or links) signify some sort of association (or interaction).In contrast to other data manipulation approaches where each element is described by a set of attributes (forexample x ∈ R D ), here our data is captured in a relational form and inferences are made primarily based ontheir connectivity patterns.Real-world networks differ from the classic Erdos-Renyi random graph because the presence of a link be-tween two nodes is not generated by a Bernoulli trial with same success probability across all possible pairs.Instead, real-world networks exhibit an inhomogeneous distribution of edges among vertices [12], creating‘hotspots’ of hightened connectivity. These modules or communities are densely connected, relatively inde-pendent compartments [12] [20] of the network that account for its form and function as a system [26]. Theintuition behind that mesoscopic organisation of networks is intuitively straightforward, with many examplesfrom everyday life; human social networks consist of cliques of friends, Web pages can be grouped into collec-tions with similar topic, etc.As an example, consider the simple undirected graph of Fig. 1, described by an adjacency matrix A ∈ R N × N where we can immediately distinguish the two densely connected compartments C1 and C2. As we ∗ Authors in alphabetic order. † Pattern Analysis & Machine Learning Research Group, Department of Engineering Science, University of Oxford. Parks Road,Oxford, OX1 3PJ U.K. Email [email protected] ‡ Pattern Analysis & Machine Learning Research Group. Email [email protected] § Dept Zoology, University of Oxford a r X i v : . [ s t a t . M L ] S e p re not sure of the membership of node 5, it is fairly reasonable to consider it as an overlap of C1 and C2.Therefore, we can express our community partition as an expansion of our network to a bipartite graph withincidence matrix B ∈ R N × K so that b ik = 1 denotes that node i belongs to a group k and is zero otherwise.Although the human eye is an excellent analytic tool for simple visualised data [24], the algorithmic process ofidentifying the number of groups, classifying their members and spotting overlaps in any given network is farfrom straightforward.Figure 1: The problem of community detection can be viewed as mapping our network to a bipartite graph,where the ﬁrst class corresponds to individual nodes and the second to communities. Links connecting nodes tomultiple communities help us capture overlapping phenomena, in which an individual can participate in manygroups.Extracting community structure from a network is a considerably challenging task [26], both as an inferenceand as a computational efﬁciency problem. One of the main reasons is that there is no formal, application-independent deﬁnition of what a community actually is [12]; we simply accept that communities are nodesubsets with a ‘statistically surprising’ link density, usually measured ‘modularity’ Q from Mark Newman andMichelle Girvan [27]. Nevertheless, this deﬁnition lies in the heart of modern community detection algorithms,manifested either as an exploration of conﬁgurations (such as B described above) that seek to approximatelymaximise Q (as direct optimisation is NP-hard [12]), or other techniques that exploit characteristics of thenetwork topology to group together nodes with high mutual connectivity (and use Q as a performance measure).A fairly comprehensive review of existing methods is provided in [30] and [12], while [9] provides a tablesummarising the computational complexity of popular algorithms.The most signiﬁcant drawbacks of modern community detection algorithms are hard partitioning or/and computational complexity . Many widely popular methods such as Girvan-Newman [27], Extremal Optimisa-tion [10] or Spectral Partitioning [25] cannot account for the overlapping nature of communities, which is animportant characteristic of complex systems [12]. On the other hand, the Clique Percolation Method (CPM)[28] allows node assignment to multiple modules but does not provide some ‘degree of belief’ on how stronglyan individual belongs to a certain group. Therefore, our aim is not only to model group overlaps but also cap-ture the likelihood of node memberships in a disciplined way, expressing each row of the incidence matrix B

2s a membership distribution over communities. Additionally, we want to avoid the computational complexityissues of traditional combinatorial methods.Towards these goals, community detection can be seen as a generative model in a probabilistic framework.This has the advantage that, in principle, fully Bayesian models may be formulated in which priors exist overall the model parameters. This enables, for example, model selection (to determine e.g. how many communitiesthere are) and the principled handling of uncertainty, noise and missing data.In this paper we consider the case in which constraints exist in our beliefs regarding the generative processfor the observed data. In particular we investigate the issue of enforcing positivity onto the model. As themodel we consider is evaluated in a Bayesian manner, model selection may be applied to infer the complexityof representation in the solution space of inferred communities.This paper is organised as follows. We ﬁrst introduce the basic concepts of the theory and our data decom-position goals. Details of the Bayesian paradigm under which model inference is performed are then presented.Representative results are given in the next sections followed by conclusions and discussion.

We consider a matrix of observed interactions between a set of N atoms , which in the context of communitydetection we consider to be individuals. We consider an interaction matrix denoted V ∈ R N × N + such that v ij represents a count process detailing the number of interactions between atoms (individuals) i and j . We consider the decomposition of the observed interaction matrix V as a linear combination of K canonicalcommunities, each of which can be seen as a latent (hidden) generator of interactions between atoms. Hence, ˆ v ij = K (cid:88) k =1 w ik h kj , (1)in which we regard the w ik as mixing coefﬁcients and the h kj as elements forming a basis set of communitystructures. The above equation may be re-written in matrix form by deﬁning W ∈ R N × K and H ∈ R K × N ˆ V = WH . (2)Without constraint, Equation 2 is ill-posed, i.e. an inﬁnite number of equivalent solutions exist. Many well-known decomposition methods can be seen as members of a family of approaches which impose constraints inthe solution space to make the matrix-product decomposition well-posed. PCA:

Principal Component Analysis [17] avoids the problem of an ill-posed solution space by making severalconstraints. The ﬁrst is that the observed data distrbution and the basis are Gaussian distributed (in that onlysecond-order statistics are employed in the PCA formalism) and that the basis is orthogonal. This still leaves apermutation degree of freedom, which is removed by sorting the basis in order of variance (in effect an orderingof the eigenvalues). A brief note on nomenclature - as the method described here has its roots in the work in non-negative matrix factorization (NMF),we keep to the historic nomenclature of the latter to make easier access for the reader who wishes to follow up primary references. CA:

Independent Component Analysis [8, 31] makes the solutions to Equation 2 well-posed by forcingstatistical independence between the components of the basis without making strong assumptions regardingthe Gaussianity of the component distributions, unlike PCA. This gives rise to a methodology which allowsprojective decompositions similar to PCA for non-Gaussian data.

NMF:

Non-negative Matrix Factorization makes the assumption that all the elements of matrices V , W , H lie in R + . This is enough, up to an arbitrary scaling degree of freedom, to ensure that solutions to Equation2 are well-posed. The latter assumptions of non-negativity match our prior beliefs regarding the generation ofthe observed interaction count data, V , and we extend our discussion of NMF in the following sections. We consider the observed data to be modelled as a Poisson process with expectations given by the elements of ˆ V = WH . This Poisson model lies at the core of the NMF methodology ﬁrst developed by Lee & Seung [21].The standard maximum-likelihood solution is to ﬁnd W , H such that p ( V | W , H ) is maximized, or alternately,the energy function − log p ( V | W , H ) is minimized.Consider an element v of V and the associated expected counts from the model, ˆ v . The negative loglikelihood of v under a Poisson model is: − log p ( v | ˆ v ) = − v log ˆ v + ˆ v + log v ! . (3)Using the Stirling approximation to second order, namely log v ! = v log v − v + 12 log(2 πv ) , (4)so Equation 3 can be written as, − log p ( v | ˆ v ) = v log (cid:16) v ˆ v (cid:17) + ˆ v − v + 12 log(2 πv ) (5)and the full negative log-likelihood for all the observed data as − log p ( V | ˆ V ) = − (cid:88) i (cid:88) j log p ( v ij | ˆ v ij ) . (6) As each of the k ∈ { ...K } columns of W and rows of H represents the contribution from a single latentcommunity (as per Equation 1) we allow for different shrinkage hyperparameters, deﬁned as the set of { β k } .Following the development of this model in [33] and similar models for probabilistic PCA [34] and ICA [7, 32]we place independent half-normal priors over the columns of W and rows of H in which the β k may be seenas precision (inverse variance) parameters: p ( w ik | β k ) = HN ( w ik | , β − k ) p ( h kj | β k ) = HN ( h kj | , β − k ) , (7)where HN ( x | , β − ) = (cid:114) π β / exp (cid:18) − βx (cid:19) . (8)4eﬁning the vector β as [ β , ..., β K ] , this leads to negative log priors over W and H as: − log p ( W | β ) = (cid:88) i (cid:88) k β k w ik − F β k + const , − log p ( H | β ) = (cid:88) k (cid:88) j β k h kj − N β k + const . (9)The net effect of β on the elements of W and H may be considered as follows. As the negative log probabilitymay be regarded as an error or energy function our goal is to descend its surface to a point of minimum.Consider the negative derivative w.r.t. a single element of W or H of the above Equations, e.g. − ∂ ( − log p ( W | β )) ∂w ik = − β k w ik . (10)Incremental changes in w ik , as we iterate towards a solution, are proportional to the negative gradient of the en-ergy function, so the effect of the prior is to promote a shrinkage to zero of w ik with a rate constant proportionalto β k . A large β k represents a belief that the half-normal distribution over w ik has small variance, and hence w ik is expected to lie close to zero. As we shall see, the priors and the likelihood function (quantifying how well weexplain the data) are combined with the net effect that columns of W (and rows of H ) which have little effectin changing how well we explain the observed data will shrink close to zero. This generic approach is wellknown in the statistics literature, as shrinkage or ridge regression [2] and in the machine learning communityas automatic relevance determination [4].Finally we must place prior distributions over the β k . We assume the set of β k are independent and asthese are scale hyperparameters we place a standard Gamma distribution over them [2]: p ( β k | a k , b k ) = b a k k Γ( a k ) β a k − k exp ( − β k b k ) , (11)in which the hyper-hyperparameters a k , b k deﬁning the gamma distrubution over β k are ﬁxed. The negative logof the probability distribution over β is hence, − log p ( β ) = (cid:88) k [ β k b k − ( a k −

1) log β k ] + const . (12) Figure 2 shows the generative graphical model for the NMF method. The joint distribution over all variables is p ( V , W , H , β ) = p ( V | W , H ) p ( W | β ) p ( H | β ) p ( β ) , (13)and the model posterior over all parameters, given the observations is: p ( W , H , β | V ) = p ( V , W , H , β ) p ( V ) . (14) This corresponds to the belief that the existence of one community is not dependent upon others. Clearly, there will be situations inwhich this can be extended to allow for a full inter-dependency between communities. We do not consider this here, however. Allowingdependency is similar to the notion of structure priors discussed in [29]. V from the latent structure W and H the components of which have scale hyperparameters β . The hyper-hyperparameters a, b are ﬁxed in the model.Noting that p ( V ) is a constant w.r.t. the inference over the model’s free parameters, we aim hence to maximizethe model posterior given the observations. This is equivalent to minimizing the negative log posterior, whichwe may regard as an energy (or error) function, U say. We hence deﬁne U = log p ( V | W , H ) − log p ( W | β ) − log p ( H | β ) − log p ( β ) . (15)Expanding this expression using the results from Equations 5, 6, 9 and 12 and collating all terms independentof the model parameters into a constant, gives: U = (cid:88) i (cid:88) j (cid:20) v ij log (cid:18) v ij ˆ v ij (cid:19) + ˆ v ij (cid:21) + 12 (cid:88) k (cid:88) f β k w ik  + (cid:32)(cid:88) n β k h kj (cid:33) − ( F + N ) log β k  + (cid:88) k [ β k b k − ( a k −

1) log β k ] + const . (16) There are a variety of approaches one could take to infer W , H , β given Equation 16. In this paper we follow[21, 22, 3, 33] and utilize a rapid ﬁxed point maximum a posteriori (MAP) algorithm which guarantees topreserve the non-negativity of all parameters in the model. At each iteration the following re-estimations aremade, H ← (cid:18) HW T + BH (cid:19) W T (cid:18) VWH (cid:19) (17) W ← (cid:18) W1H T + WB (cid:19) (cid:18) VWH (cid:19) H T (18)in which we deﬁne B as having the elements β k along its diagonal and zeros elsewhere and is a vectorof ones. We keep to the notation convention that (cid:0) XY (cid:1) represents element-by-element division and does not XY − . The values of β k are re-estimated by setting to zero the derivative w.r.t. the β k of the energyfunction in Equation 16. This gives an estimate for β k as β k ← N + a k − (cid:16)(cid:80) f w ik + (cid:80) n h kj (cid:17) + b k (19)which is the same as the update equation detailed in [33]. The algorithm proceeds by cycling through Equa-tions 17,18, 19 until a convergence criterion or maximum number of iterations is reached. We note that a fullyBayesian approach to NMF is developed using variational inference in [6]. This offers certain potential im-provements over the maximum a posteriori (MAP) solution at the expense of computational speed. The latterwe regard as a very important feature of any algorithm for community detection and the current emphasis ofour work is directed by computational efﬁciency. As we may write the observed data as a linear combination of community basis structures and the mixingfractions are strictly non-negative, the model is identical to a mixture model in which the elements w ik denotethe relative importance of community k in explaining the observed interactions associated with member i .Under the assumption that the i -th member’s interactions are explained by some community memberships, it isreasonable to deﬁne degrees of community membership, π ik , which sum to unity for each member, as: π ik = w ik (cid:80) k (cid:48) w ik (cid:48) . (20)A greedy community allocation scheme for member i is easily achieved, if desired, by choosing the community k ∗ which is the argmax of either the w ik or π ik . In this section we demonstrate the performance of NMF-based community detection against a variety of bench-mark problems. We start with a toy network to illustrate the intuition behind our clustering methodology usinggraphical examples. Afterwards, we continue by testing our method against artiﬁcial problems with observedcommunity structure. Finally, we test our method against popular real-world networks of various sizes andlevels of community cohesiveness.

In all the results presented in this paper we allow the maximum number of possible communities K to equal N , the number of members. The hyper-hyperparameters, a and b , which govern the scale of the shrinkagehyperparameters, β k , are ﬁxed at a = 1 , b = 2 so that β k all have vague distributions over them. The initialmatrices W , H each have elements drawn independently at random from a uniform distribution in the interval [0 , .The interaction matrix V we use for NMF is derived from the (weighted) adjacency matrix of each networkwith diagonal elements the strengths of each node (the sum of each row or column of the adjacency matrix).7 .2 An Illustrative Example Consider the simple toy graph of Fig. 3 with N = 16 nodes and M = 25 edges of varying weights. Weextract the mesoscopic (community) structure of this network using NMF, along with the popular ExtremalOptimisation (EO) [10], Spectral Partitioning (SP) [25] and Weighted Clique Percolation Method (wCPM)[11].Figure 3: An undirected weighted toy graph with 16 nodes. Each pair of nodes has a different interactionstrength as denoted by the different lines.Although a trivial problem at ﬁrst glance, each community detection method we applied yielded differentmodules and node allocations, as seen in Fig. 4. Hard-partitioning methods such as EO and SP produce suchinconsistencies mainly due to the ‘broker’ nature of nodes such as , or , which lie on high-ﬂow paths in thenetwork, making them difﬁcult to assign on one module or the other [12]. Although this issue is addressed bywCPM, which allows node membership to multiple modules, it does not provide some measure of ‘participationstrength’ or ‘degree of belief in membership’.In NMF, communities are viewed as basis structures , captured in the model as the K columns of our basismatrix W (see Section 2). In this framework, we consider each basis structure or community k to have a totalbinding energy that is allocated to the atoms (nodes) based on w k ∈ R N × . For example in Figure 5, we takea column of W and draw a colormap (left frame) based on the intensity of its elements. Components withnon-zero energy correspond to nodes that participate in such basis structure and form a subset of the wholenetwork (right frame). We can see that this basis community in Fig. 5 is dominated by nodes 6, 7 and 8, whichcontribute most of the binding energy, while the peripheral nodes 4, 5, 9, 10 have some minor participation.We applied NMF to our synthetic graph, where we extracted K ∗ = 4 communities (bases to which at leastone member is allocated) as seen from the four numbered plates in Figure 6. For illustrative purposes, weassigned nodes to communities using greedy allocation, i.e. we put the individual to the community into whichit contributes most of its energy. The contribution of each atom i to the total binding energy of each structure k , as denoted by w i ∈ R × K + , can be viewed as the individual’s degree of participation or, when normalised, probability of membership to that community. From our example, in Figure 7 we show the different membershipdistributions for four different nodes in the graph. In accordance to our intuition, we see that node 6, which actsas a mediator between different communities has a more entropic membership distribution while nodes such as4 or 14 have more conﬁdent assignments.We also can view real-world systems, such as social networks, under the framework we described above;groups of individuals are structures bound together with a given energy (time spent together, genetic relatedness,8igure 4: Node allocations to communities for three different community detection methodologies.Figure 5: One of the NMF basis structures , as extracted from our toy graph. Each atom has a different degreeof participation, as it can be seen from the colourmap on the left. Node 6 is a focal individual, contributing themost energy to the structure along with nodes 7 and 8, while nodes 4,5,9 and 10 are peripheral.9igure 6: The community structure of our toy graph, where each coloured plate represents a different basisstructure. For purposes of illustration, we assigned each node to the community with the highest membershipprobability.Figure 7: For each node of the toy graph our method gives a probability membership distribution; in thehorizonal axis we enumerate each community appearing in Fig. 6 and each bar represents the likelihood thatnode i belongs to the respective module. For nodes that weakly communicate with other groups (such as 4 and14) we see conﬁdent allocations, while individuals that lie on the boundary between communities (such as 6)have more entropic membership distribution. 10endency for cooperation, etc). Every individual contributes to a range of communities a certain amount of suchenergy, which can be also seen as his/her degree of membership. High-energy members can be regarded as focalindividuals in a group, while social structures with members of uniform contribution can be regarded as teams that are held together because of equal participation of their members. Finally, under this framework we canidentify highly social individuals, that belong to many groups with high amount of participation.Having used a simple graph to illustrate the intuition behind NMF-based community detection, we proceedin the following section to demonstrate its performance on artiﬁcial problems of larger scale and complexity. A very popular evaluation methodology for a community detection algorithm is to test it against an artiﬁcialnetwork with “observed” community structure and measure how well the algorithm extracts the underlyingmesoscopic organisation. The partition quality is usually compared with the original using the popular Nor-malised Mutual Information (NMI) criterion [9].We start with arguably the most popular type of benchmark problem: realisations of the Newman-Girvanrandom graph [14] (NG graph). We generate networks with N = 128 nodes and C = 4 communities with n = 32 nodes each where each one has an average degree of (cid:104) k (cid:105) = 16 . By manipulating the expected inter-community degree (cid:104) k out (cid:105) of nodes we test our algorithm against various levels of community cohesiveness.As seen in Figures 8 and 9 NMF produces state-of-the-art performance in extracting the original modules forany degree of fuzziness in the artiﬁcial network, outperforming the popular Spectral Partitioning and Hierar-chical Clustering (complete linkage - angular distance) method and having similar performance to ExtremalOptimisation.Figure 8: Normalised Mutual Information and modularity across different levels of community cohesion ina Newman-Girvan random network. We compare our method (NMF) against other popular community de-tection methodologies such as Extremal Optimization (EO), Spectral Partitioning (Spectral) and HierarchicalClustering (Hierarchical). 11igure 9: Mean entropy (in bits) of NMF node membership probabilities for decreasing levels of communitycohesion in a Newman-Girvan random graph. We notice that NMF can describe the ever-increasing fuzzinessof the NG graph in terms of decreasing node allocation conﬁdence.Although the NG graph is a very popular benchmark problem, it has been heavily criticised [20] for notreﬂecting the properties of real-world networks; NG graph realisations are small in size, with a ﬁxed number ofcommunities and ﬁxed community populations while the degree distributions are uniform. For those reasons,Lancichinetti and Fortunato proposed a new class of benchmark problems [20] (which we shall refer to themas LF graphs) that produce networks of any size, with power-law degree and community size distributions.The community cohesiveness is controlled by a mixing parameter µ t , that signiﬁes the expected fraction ofintercommunity links per node. For the case of weighted LF graphs, we have a similar parameter µ w thatcontrols the strength allocation of a node between same-community members and outsiders.For the purposes of our experiments, we generated a variety of such networks with N = 1000 nodes anddifferent parameters regarding the average degree (cid:104) k (cid:105) and the exponents γ , γ of the degree and communitysize distributions. By starting with a small mixing parameter µ t = 0 . and for each 0.1-step up to µ t = 0 . (from ‘clear’ to ‘fuzzy’ community structure), we generate 100 realisations of the LF graph and monitor themodule recognition performance of NMF using the popular normalised mutual information criterion. For thecase of weighted LF graphs, we manipulate both mixing parameters at the same time, therefore µ t = µ w . Theresults, both for binary and weighted networks and for different conﬁguration parameters are shown in Fig. 10and 11. In this section we present the performance of NMF on a variety of popular community detection problems.For these networks we have no “observed solution”, therefore we measure the performance of our algorithmusing the very popular Newman-Girvan modularity Q [27]. In Table 1 we present a list of our datasets, alongwith their number of nodes N and edges M . Our algorithm is compared on the same data with ExtremalOptimisation (EO) [10] and the Louvain [5] methods. We note other methods such as Spectral Partitioning and12igure 10: Normalised Mutual Information across different levels of community cohesion in binary LF randomnetworks. We start with a very cohesive network (low mixing parameter µ ) and proceed by making the networkfuzzier. Each point in the graph represents the average over 100 realisations of an LF graph with the givenparameters. The error bars represent one standard deviation.Hierarchical Clustering algorithms give signiﬁcantly worse performance than either NMF, EO or Louvain andthese results are not presented here.For each dataset we run NMF and EO 100 times with different random initialisations and monitored thevalues of modularity Q along with the number K ∗ of extracted communities. As previously detailed, forNMF initialisation, we assume a possible maximum number of communities, K , equal to the number of nodes K = N (which is the maximum possible partition size for any network - though we ﬁnd that running with13igure 11: Normalised Mutual Information across different levels of community cohesion in weighted LFrandom networks. Again, we start with a very cohesive network and proceed by making the network fuzzier.In this case of weighted network, we have the same mixing parameter for both intercommunity degrees andstrengths ( µ t = µ w ). Each point in the graph represents the average over 100 realisations of an LF graph withthe given parameters and the error bars represent our standard deviation.a lower value is preferred, as this reduces computation) and the ‘effective’ number of communities K ∗ isthen inferred from the data. The Louvain method has a very stable behaviour across different runs so we haveomitted the standard deviation of modularity and community sizes for each dataset. The algorithmic complexityof our approach is O ( N K ) , as compared to O ( N log N ) for EO [9]. We also note that, in practice, as EOrequires stochastic steps, the run times of the two algorihms differ even more signiﬁcantly. In the majority of14able 1: Real-world datasetsDataset N M weighted?Dolphins [23] 62 159 noBooks US Politics [19] 105 441 noLes Miserables [18] 77 254 yesCollege Football [14] 115 613 noJazz Musicians [15] 198 2742 noC. elegans metabolic [1] 453 2025 noNetwork Science [27] 1589 2742 yesFacebook Caltech [35] 769 16656 noapplications, the maximum likely number of communities K (cid:28) N and so our approach can be very efﬁcientand competitive against the Louvain method.Table 2: Modularity results against Extremal Optimisation and Louvain methodDataset NMF EO LouvainDolphins 0.47 ± ± ± (cid:15) ± ± ± ± (cid:15) ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Q as a performance measure of partition quality and we also present the number of identiﬁedcommunities. As Q can not account for the overlapping nature of communities, we use ‘greedy allocation’ i.ewe assign a node to the module with the highest probability of membership. The results of NMF are presentedalongside the very popular Extremal Optimisation and Louvain method for comparative analysis. From Table2 we can see that our approach performs competitively yet is not an algorithm designed with the aim of max-imising modularity, unlike either EO or the Louvain methods. Additionally, it has the advantage of providing15 robabilistic outputs for community membership (therefore achieving soft partitioning) and having low com-putational overhead . Finally, NMF does not suffer from the resolution limit [13] of modularity optimisationmethods such as EO, where smaller groups are merged together [30] [13], ending up with smaller number ofcommunities, as seen in Table 3. In this work we described a novel approach to community detection that adopts the Bayesian non-negativematrix factorisation model of [33] to achieve soft-partitioning of the network, assigning each node a probabilityof membership over all the extracted communities. That allows us not only to capture the fuzziness of thenetwork (via the entropy of the membership distribution) but also to improve network cartography techniques[16] by identifying central and peripheral nodes in modules. Network visualization tools can also be improvedin this manner. The approach is computationally efﬁcient and offers performance comparable to state-of-the-artmethods. Indeed the performance advantages for large data sets allow the NMF approach to be run many timescompared to a single run of competing approaches. Clearly this allows for the selection of the best performingrun, or a small ensemble of high-modularity solutions.

Future application work in this area addresses the analysis of a large zoological data set of interactions betweenmembers of a population of wild birds. As signiﬁcant data exists and secondary veriﬁcation data has beencollated, such as breeding pair identiﬁcations etc. this offers a unique chance to verify any relationships thatour approach detects.Work is currently underway to allow this model to be extended to incorporate dynamics such that non-stationary, time-varying, community relationships may be tracked. We have not discussed the handling ofmissing data in this paper, but taking missing observations into account may be readily handled in our approachand this is detailed in [6]. As mentioned in the paper, more complex priors over the set of β k would allow fordomain knowledge to be incorporated and for correlated structures to be correctly dealt with. The authors would like to thank Nick Jones, Mason Porter and Mark Ebden for valuable comments. IoannisPsorakis is funded from a grant via Microsoft Research, for which we are most grateful.

References [1] http://deim.urv.cat/˜aarenas/data/xarxes/celegans_metabolic .[2] J. M. Bernardo and A. F. M. Smith.

Bayesian Theory . John Wiley, 1994.[3] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons. Algorithms and applicationsfor approximate nonnegative matrix factorization. In

Computational Statistics and Data Analysis , pages155–173, 2006.[4] C. M. Bishop.

Pattern Recognition and Machine Learning . Springer, ﬁrst edition, 2007.165] V. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in largenetworks.

J. Stat. Mech. , 2008:P10008, October 2008.[6] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models.

Computational Intelligenceand Neuroscience , 2009(785152), 2009.[7] R. A. Choudrey and S. J. Roberts. Variational mixture of bayesian independent component analyzers.

Neural Computation , 15(1):213–252, 2003.[8] P. Comon. Independent component analysis, a new concept?

Signal Processing. , 36(3):287–314, 1994.[9] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas. Comparing community structure identiﬁcation.

J.Stat. Mech. , 2005(09):P09008, 2005.[10] J. Duch and A. Arenas. Community detection in complex networks using extremal optimization.

Phys.Rev. E , 72(2):027104, 2005.[11] I. Farkas, D. Abel, G. Palla, and T. Vicsek. Weighted network modules.

New Journal of Physics , 9(6):180,2007.[12] S. Fortunato. Community detection in graphs.

Physics Reports , 486(3-5):75–174, February 2010.[13] S. Fortunato and M. Barthelemy. Resolution limit in community detection.

Proceedings of the NationalAcademy of Sciences , 104(1):36–41, 2007.[14] M. Girvan and M. E. J. Newman. Community structure in social and biological networks.

PNAS ,99(12):7821–7826, 2002.[15] P. Gleiser and L. Danon. Community structure in jazz.

Advances in Complex Systems , 6(4):565–573,2003.[16] R. Guimera and L. A. N. Amaral. Cartography of complex networks: modules and universal roles.

J. Stat.Mech. , 2005(2):P02001, 2005.[17] I. T. Jolliffe.

Principal Component Analysis

Phys. Rev. E , 80(1):016118, 2009.[21] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorisation.

Nature ,401:788–791, October 1999.[22] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In

In NIPS , pages 556–562.MIT Press, 2000.[23] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson. The bottlenose dolphincommunity of doubtful sound features a large proportion of long-lasting associations.

Behavioral Ecologyand Sociobiology , 54(4):396–405, 2003. 1724] M. E. J. Newman. The structure and function of complex networks.

SIAM Rev. , 45(2):167–256, 2003.[25] M. E. J. Newman. Modularity and community structure in networks.

PNAS , 103(23):8577–8582, 2006.[26] M. E. J. Newman.

Networks: an Introduction . Oxford University Press, 2010.[27] M. E. J Newman and M. Girvan. Finding and evaluating community structure in networks.

Phys. Rev. E ,69(2):026113, 2004.[28] G. Palla, I. Derenyi, I Farkas, and T. Vicsek. Uncovering the overlapping community structure of complexnetworks in nature and society.

Nature Letters , 435(7043):814–818, 2005.[29] W. Penny and S. J. Roberts. Bayesian multivariate autoregressive models with structured priors.

IEEProceedings on Vision, image and signal processing , 149(1):33–41, 2002.[30] M. A. Porter, J. P. Onnela, and P. J. Mucha. Communities in networks.

Notices of the American Mathe-matical Society , 56(9):1082–1097 and 1164–1166, 2009.[31] S. Roberts and R. Everson.

Independent Component Analysis: principles and practice . CambridgeUniversity Press, 2001.[32] S. J. Roberts and R. A. Choudrey. Bayesian independent component analysis with prior constraints:An application in biosignal analysis. In Joab Winkler, Mahesan Niranjan, and Neil Lawrence, editors,

Deterministic and Statistical Methods in Machine Learning , volume 3635 of

Lecture Notes in ComputerScience , pages 159–179. Springer Berlin / Heidelberg, 2005.[33] V. Tan and C. F´evotte. Automatic relevance determination in nonnegative matrix factorization. In

SPARS09 - Signal Processing with Adaptive Sparse Structured Representations (2009) , pages 1–19, 2009.[34] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis.