Bayesian inference of network structure from unreliable data
RRobust Bayesian inference of network structure from unreliable data
Jean-Gabriel Young, George T. Cantwell, and M. E. J. Newman
1, 2 Center for the Study of Complex Systems, University of Michigan, Ann Arbor, Michigan 48109, USA Department of Physics, University of Michigan, Ann Arbor, Michigan 48109, USA
Most empirical studies of complex networks do not return direct, error-free measurements ofnetwork structure. Instead, they typically rely on indirect measurements that are often error-proneand unreliable. A fundamental problem in empirical network science is how to make the bestpossible estimates of network structure given such unreliable data. In this paper we describe a fullyBayesian method for reconstructing networks from observational data in any format, even whenthe data contain substantial measurement error and when the nature and magnitude of that erroris unknown. The method is introduced through pedagogical case studies using real-world examplenetworks, and specifically tailored to allow straightforward, computationally efficient implementationwith a minimum of technical input. Computer code implementing the method is publicly available.
I. INTRODUCTION
Networks are widely used as a convenient mathemati-cal representation of the relationships between elementsof a complex system [1]. Network methods have beenfruitfully applied to aid our understanding of systems inphysics, biology, computer and information sciences, thesocial sciences, and many other areas. A typical empiricalnetwork study will first determine the structure of a net-work of interest, using experiments, field observations,or archival data, then analyze that structure to revealfeatures of interest, using any of the many quantitativeanalysis techniques developed for this purpose [2].In this paper we focus on the first part of this process,on how we determine the structure of a network fromappropriate empirical observations. At first sight, thisappears to be a straightforward task. Most studies aimto measure the presence or absence of edges in a network,either singly or in some collective fashion, and then as-semble a picture of the complete network by aggregatingmany such measurements. Upon further consideration,however, it is apparent that the situation is not so sim-ple, because most network measurements do not tell usunambiguously about the presence or absence of edges.At best, they give us a noisy indication, and in manycases they merely hint obliquely at the network struc-ture [3].As an example, consider a protein-protein interactionnetwork representing the pattern of physical interactionsbetween proteins in the cell. Such networks can be mea-sured in a relatively direct manner. Techniques such asaffinity purification and two-hybrid screens can be usedto determine whether a given protein interacts with oth-ers [4, 5]. These techniques are notoriously unreliable,however, returning high rates of both false positives andfalse negatives, so that individual measurements do notthemselves tell us the structure of the network [6]. Mea-surements may have to be repeated many times in orderto separate the signal from the noise [7].A more complex example is the measurement of a so-cial network such as a network of who is friends withwhom. Such networks can be measured by conduct- ing surveys in which people are asked to identify theirfriends [8]. Here, however, things can get complicated.For instance, it happens often that person A identifiesperson B as a friend, but person B does not identify per-son A. In some communities fully a half of all friendshipsare “one-way” friendships of this kind [9, 10]. Shouldtwo such people be considered friends? Why do they dis-agree? Is one of the individuals mistaken, or forgetful, ornot telling the truth? Perhaps the two are using differ-ent definitions of friendship? Things become harder stillwhen we consider other data that are commonly gatheredin such surveys, such as age, gender, race, occupation,income, or educational level of participants. Friendshipsare well known to depend strongly on such personal char-acteristics and in cases where we are uncertain about ahypothesized friendship between two people we may beable to use their traits to make a better informed deci-sion about whether they are really friends [11]. The bestestimate of whether two people are friends may thus bethe result of a complex calculation that combines manyinputs.Most empirical studies of network structure, however,do not take such an approach. Even though network dataare known to be noisy and imperfect [6, 12–17], manyexperimenters nonetheless rely on simple direct measure-ments of the presence of edges and effectively make theassumption that the resulting data are the network. In aprotein-protein interaction network they assume an inter-action to be present if, for example, it is observed to existenough times when measured. In a friendship networktwo people are assumed to be friends if one identifies theother as a friend. And yet it is clear from the discussionabove that the true situation is more complicated thanthis. In realistic circumstances, we have a heterogeneousbody of data, potentially of many different types, poten-tially unreliable, and varying in its relationship to theactual network structure. In such circumstances, tradi-tional methods for inferring network structure from datamay be inadequate and in some cases outright mislead-ing.In this paper we present a general framework for infer-ring network structure from measurements using meth- a r X i v : . [ c s . S I] A ug ods of Bayesian inference, along with a complete softwarepipeline implementing that framework. The methods wedescribe take empirical measurements of a system and re-turn a posterior distribution over possible network struc-tures, thereby telling us not only what the likely structureis but also giving us an estimate of our certainty aboutthat structure. Our approach requires a minimum of in-put from the experimenter, the only information theyneed supply (other than the data) being a description ofhow the measurements depend on the underlying networkstructure, which is specified at a high level in the formof probability distributions—we give a number of exam-ples in this paper. The rest of the calculation, from datato posterior distribution and final network structure, isperformed automatically. Application of Bayesian infer-ence methods often requires experimentation to find thebest approach for a particular application. The level ofautomation in our framework makes this a straightfor-ward task, allowing one to easily experiment with differ-ent strategies and quickly see the results [18].The problem we address falls in the general area of network reconstruction , the problem of determining thestructure of a network from the available data. Therehas been a considerable amount of previous work on net-work reconstruction, much of it in the subject-specificliterature and directed at the reconstruction of par-ticular types of networks or using particular types ofdata. Methods have been proposed, for example, for ge-ographic co-location data [19–22], social networks [23],brain scans [24–26], and biochemical networks [5, 27–29]. Perhaps the closest precursor to our work is that ofButts [23], who developed Bayesian methods and Gibbssampling techniques for estimating social network struc-ture from unreliable social surveys. Priebe and collab-orators [30] have developed similar ideas, incorporatingstructured priors over networks to improve inference, anapproach that has been echoed in statistics [24–26] andnetwork science [31]. A key difference between these ap-proaches and our own, however, is that they were de-veloped with particular measurement processes in mind,whereas our methods can accommodate almost arbitraryones. The approach we propose also differs from our ownrelated previous work [3, 32, 33] in putting forward an es-timation algorithm that works essentially automaticallywith arbitrary models, using ideas borrowed from the lit-erature on finite mixture models [34, 35].Looking more broadly, there is also a considerable vol-ume of work that tackles other problems concerning thereliability of network data, many of which could alsobe addressed using variants or extensions of the meth-ods proposed here. The problem of link prediction , forinstance, involves predicting missing edges in partiallyobserved networks and a range of algorithms have beenproposed that achieve good performance [36–40]. Alsomuch studied is the problem of inferring network struc-ture from non-network data [41], such as time-series [42],gene expression data [43], the spread of information ordisease [44, 45], and various local network features and node properties [46, 47]. A separate literature deals withproblems of network sampling , addressing how choicessuch as which nodes or edges we measure can affect ourestimates of network properties and how best to makethose choices [2, 48–50]. Disambiguation or entity reso-lution is the process of accurately identifying the nodesin a network in the presence of potential sources of errorsuch as duplicate nodes or nodes that have been inad-vertently combined [16, 51, 52]. Some methods have alsobeen proposed that aim to perform more than one ofthese tasks at once, such as simultaneous link predictionand disambiguation [53]. Finally, there is also a growingliterature on making best estimates of derived networkmetrics starting from probabilistic representations of net-work structure [54–58]. These methods use the kinds ofstructural estimates we develop in this paper as input tocalculations of further quantities of interest, such as cen-trality measures, path lengths, correlations, communitystructure, or core decompositions [55, 56].This manuscript is organized as follows. In Section IIwe first give an overview of the framework we proposefor inference of network structure from unreliable data.Then in Section III we describe in depth two exampleapplications, illustrating the construction of appropriatemeasurement models and the practical implementation ofthe method. In Section IV we give our conclusions andsuggest some directions for future work. An appendixgives technical details of the workings of our method. II. DESCRIPTION OF THE METHOD
Suppose we have made some measurements of a net-worked system. They may include direct measurementsof network structure, such as measurements of the pres-ence or absence or edges, but they may also include otherquantities that could have indirect bearing on networkstructure, such as measurements of node properties ortraits. Our objective is to make the best possible estimateof the structure of the network from these measurementsand to do so efficiently and almost automatically, requir-ing a minimum of technical effort, so that practitionerscan focus on making the measurements and interpretingthe results. In this section we give a broad overview of theconcepts and essential equations underlying our method.A complete derivation and accompanying technical dis-cussion can be found in Appendix A.In a nutshell, we posit that there exists some underly-ing network whose structure affects measurements madeon the system of interest. This network is represented byits adjacency matrix A , which is initially unknown. Ourgoal is to estimate this matrix. We assume that the ob-servational data depend on the adjacency matrix but in apotentially noisy way: even exact repetition of the sameexperiment could produce different observations. Thismeans that, even though the network has a well-definedand unambiguous structure, it will not in general be pos-sible to tell exactly what that structure is from the mea-surements. To accommodate this uncertainty, we adopta Bayesian point of view. Instead of inferring the exactnetwork structure itself from measurements, we insteadinfer a probability distribution over possible structurescompatible with the observations.Apart from the data themselves, the only input ourmethod requires of the user is the specification of a modelthat represents, in general terms, how the data dependon the network structure. This model can take a va-riety of different forms: networks and the experimentsused to measure them vary widely, so no single modelapplies in all cases. Our method allows the user to spec-ify the model that is most appropriate to their situation.The model may contain parameters (sometimes many ofthem) but it is not necessary to know the values of theseparameters: our method automatically infers the bestvalues from the data.For the sake of concreteness, we concentrate in thispaper primarily on the most common situation encoun-tered in network studies, in which the network is simple,undirected, and unweighted, and the data consist of indi-vidual measurements of the presence or absence of edges.The methods we describe are applicable to other situa-tions as well, but this one covers many cases of interestand will allow us to use a more transparent and explicitnotation. The measurements themselves can be as simpleas observed interactions between pairs of nodes, but canalso take more complex forms, such as paths across thenetwork, time-series, tags associated with relationships,or any of a variety of other possibilities. We encode themeasurements in an array X , whose entry X ij containsall the information we have about the interaction of nodes i and j .We also make a further crucial assumption about themodel, namely that observations of different node pairsare conditionally independent , which in this case meansthat the observations X ij of the interaction between i and j depend only on the adjacency matrix element A ij and not on any other elements. This assumption is notstrictly necessary for the method work but, as shownin Appendix A, it improves the computational efficiencysubstantially. And, since it is true of most commonlyused models anyway, it is in practice not a significantrestriction. There are exceptions: in certain (“fixedchoice”) social network studies, for instance, participantsare limited to naming a maximum number of friends orcontacts, such as ten. This means that every time aparticipant names a contact, contacts with other indi-viduals becomes less likely because the participant hasfewer “slots” left to fill, and hence contacts are (weakly)negatively correlated. In this paper we neglect such de-pendencies and assume that contacts are independent.(However, see Refs. [3, 32] for a discussion of methodsthat can handle dependencies.)With these assumptions, selecting a model boils downto making three basic choices. The first and most crucialof these is specifying how the pairwise measurements X ij depend on the underlying network, as represented by the adjacency matrix A . Specifically, for a given pair ofnodes i, j we need to specify the expected distributionof values X ij for the case where i and j are connectedby an edge and for the case where they are not. We willwrite these two distributions respectively as µ ij (1 , θ ) = P ( X ij | A ij = 1 , θ ) (1)and µ ij (0 , θ ) = P ( X ij | A ij = 0 , θ ) , (2)where θ denotes any parameters of the distribution. (Wewill in some cases drop θ from the notation where it isclearly understood.) We will refer to Eqs. (1) and (2) asthe data model .The second modeling choice is the specification of theprior probability ascribed to the edge between i and j .What is the probability that the edge exists before wemake any measurements of it? Do all edges have an equalchance of existing a priori, or are some more likely thanothers? Again we assume that different edges are inde-pendent, although again this assumption, while computa-tionally convenient, is not strictly necessary. Mimickingthe notation introduced for the data model, we write theprior probability of an edge between i and j as ν ij (1 , θ ) = P ( A ij = 1 | θ ) (3)and the probability of no edge is ν ij (0 , θ ) = 1 − ν ij (1 , θ ).Again θ collectively denotes the set of parameters (ifany). We call this second set of probabilities the net-work model .The third and last modeling choice is the specificationof the prior distribution on the parameters θ . Our frame-work does not place any restriction on the possible formof the prior on θ . In many cases a simple uniform (max-imum entropy) prior works well, but the prior can alsobe chosen for example to encode specific prior knowledgeabout the system or to rule out unphysical values of theparameters.Once these three choices are made, the rest of the pro-cedure is essentially automatic. Given the model choicesand a set of measurements, our method will generate astring of pairs ( A r , θ r ) of networks and parameters com-patible with the measurements. By inspecting these net-works and parameters we can determine any other net-work properties we might care about. For example, ifwe want to determine whether i and j are connected byan edge, we can inspect the matrix element A ( r ) ij for all r and compute the fraction of the time that A ( r ) ij = 1, whichgives us (a Monte Carlo estimate of) the posterior prob-ability of the edge’s existence.More precisely, the sample networks and parametersets returned by our algorithm are drawn from the jointposterior distribution P ( A , θ | X ), which allows us tocompute an estimate of the expected value of any func-tion f ( A , θ ) of the network and parameters thus: (cid:104) f ( A , θ ) (cid:105) = (cid:88) A (cid:90) f ( A , θ ) P ( θ, A | X ) dθ (cid:39) N N (cid:88) r =1 f ( A r , θ r ) , (4)where N is the number of samples generated.Our computer code implementing these calculations isfreely available online with accompanying tutorials ex-plaining its use. III. EXAMPLES
To demonstrate how our method operates in practice,we present two detailed case studies. The first involves anetwork of animal interactions.
A. Network of dolphin companionship
Many animals form lasting social networks, includ-ing monkeys, deer, horses, cattle, dolphins, and kanga-roos [14]. Here we analyze data from Connor et al. [59] ofinteractions among a small ( n = 13) group of male bot-tlenose dolphins as they swim in a shallow lagoon. Thisis a typical example of an animal observational studythat aims to determine social ties indirectly by observingbehavior. Standard techniques of social network anal-ysis would typically be used to transform the observa-tions into association indices that quantify the level ofinteraction between pairs of individuals [60]. These in-dices, however, can can be hard to interpret and theirdefinition relies on somewhat ad hoc assumptions. Ourmethods give us a more principled way to infer connec-tions by interpreting the data as noisy measurements ofan underlying social network.
1. Basic model
In this particular study the recorded data X ij , shownin Fig. 1, represent the number of times each pair of dol-phins is observed swimming in close proximity. We canuse these data to reconstruct the underlying network asfollows. First, it is reasonable to assume that the num-ber of interactions between two dolphins will depend onwhether they have a network connection or not. But alsowe expect there to be some randomness in the numbers,both because of circumstances and because of observa-tional error. We thus model the number of interactions The code can be downloaded from https://github.com/jg-you/noisy-networks-measurements . Dolphin D o l ph i n (a) o f i n t e r a c t i on s Number of interactions F r a c t i on o f pa i r s (b) FIG. 1. Data on interactions among a group of dolphins, fromConnor et al. [59]. Thirteen male dolphins were observed asthey swam in a shallow lagoon and tabulations were made ofpairs that swam together. (a) Observed frequency of interac-tion for each pair. (b) Histogram of frequencies. as a Poisson random variable with mean λ or λ depend-ing on whether there is or is not a network connection,respectively. That is, µ ij (0 , λ ) = λ X ij X ij ! e − λ , (5a) µ ij (1 , λ ) = λ X ij X ij ! e − λ . (5b)We assume that λ > λ , i.e., that individuals interactmore often if they have a network connection than if theydo not.We also need to choose our network model and prioron the model parameters. Having no prior informationabout the probabilities of individual network edges, weassume that all edges are a priori equally likely, whichimplies ν ij (0 , ρ ) = 1 − ρ, (6a) ν ij (1 , ρ ) = ρ. (6b)where ρ is the probability of an edge.For the priors on the parameters we make minimalassumptions. For ρ , which is a probability, we assume auniform prior on the interval [0 , λ and λ , whichhave semi-infinite support, we cannot use a uniform prior,so instead we assume a slowly varying probability over awide range of plausible values. In keeping with modernBayesian practice we use a semi-normal distribution withlarge variance (i.e., the right half of a normal distributioncentered on zero): P ( λ k ) ∝ e − λ k / σ , (7)where σ (cid:29) P ( λ k ) = 0 if λ k <
0. In our work we use σ = 100.With the model specified, we now run the algorithmdescribed in Appendix A and obtain a series of samples( A r , θ r ) from the joint distribution P ( A , θ | X ), where θ in this case collectively refers to the parameters λ , λ , and ρ , and θ r is one realization of these parameters.Two examples of sampled network structures are shownin Fig. 2a, out of thousands generated. As the figure (a) Dolphin D o l ph i n (b) C onne c t i on p r obab ili t y FIG. 2. Reconstruction of the network of dolphins from thedata shown in Fig. 1. (a) Two examples of sampled net-work structures. (b) Matrix of posterior edge probabilities,obtained by averaging over 4000 network samples. Entries inthe matrix corresponding to the edges highlighted in panel (a)are shown in matching colors. shows, the two structures are similar but not identical.This is expected: we should see variability from sampleto sample. When looking over the entire sample set, forexample, the edge between nodes 9 and 10, highlightedin orange, appears almost always, whereas the edge be-tween 1 and 8, in blue, appears only rarely. To capturethis variability, we average over samples and computethe posterior probability of every edge as the fraction ofsamples in which the edge occurs. The result is shown inmatrix form in Fig. 2b. Comparing with Fig. 1, we seethat these probabilities are quite different from the distri-bution of number of interactions between dolphin pairs.Moreover, while the numbers of interactions span a widerange of values, the probabilities of edges are clusteredaround zero and one. This is good news. Edge probabili-ties near zero and one indicate edges about which we arerelatively certain, either that they exist or that they donot. Thus we have turned a data set with considerablevariability into a network structure about which we havehigh confidence. We discuss this point further below.In conjunction with these estimates of the network’sstructure, we also compute estimates of the parame-ters θ , by averaging the parameter samples, just as wedid with the network samples. For instance, we find that λ = 0 . ± .
22 and λ = 14 . ± .
5, meaning that dol-phin pairs interacted an average of 14.4 times during theexperiment if they shared a network connection, but only0.6 times if they did not. In other words, the effect ofnetwork connections is very pronounced and highly sta-tistically significant. We also find that ρ = 0 . ± . FIG. 3. Statistics of samples drawn from the posterior dis-tribution P ( λ , λ , ρ | X ) for the data shown in Fig. 1 and themeasurement model given in Eqs. (5)–(6) over four runs withrandomly chosen initial conditions. Only 500 of the 4000 totalsamples taken are shown for the sake of visual clarity. Col-ors indicate the four runs. (a) The logarithm of the posteriorprobability. Different runs are separated by vertical dottedlines. (b) Pair plot showing the relation between sampledvalues of parameters, as well as the distribution of individualparameters on the diagonal.
2. Sampling quality and goodness of fit
While these results are promising, there is further workto be done if we are to be confident in them. In particular,as is standard with Bayesian calculations, there are twospecific things we need to check [61]. First, we need to besure that the Monte Carlo algorithm is sampling correctlyfrom the posterior distribution and, second, we need totest whether our proposed model is in fact a good fit tothe data.In Monte Carlo calculations of this kind the posteriorprobability distribution is typically rugged , meaning ithas multiple local optima, and the sampling algorithmcan as a result get stuck for periods of time in subopti-mal portions of the sampling space. To test for this issuewe plot in Fig. 3a the log probability of our samples as afunction of time over four different runs of the algorithm.The plot shows that the distribution of values appearsroughly consistent within each run and across differentruns, which we take as a sign that the samples have notfailed in obvious ways. In Fig. 3b we compare how thesampled values for the parameters λ , λ , and ρ relateto one another. These plots, colloquially known as pairplots, are conventional in Bayesian analysis. In additionto showing the distribution of values for the individualparameters on the diagonal (which are more informativethan the simple averages reported in the text above),these plots can help verify the quality of the samples andprovide some sanity checks. In our case, the plots re-veal that all the samples come from approximately thesame region of parameter space regardless of the differ-ent initialization of the four runs, which tallies with theuniform sample quality found in Fig. 3a and gives furtherevidence that the algorithm is sampling consistently.Another function of pair plots is to diagnose problems Observed interactions P r ed i c t ed i n t e r a c t i on s (a) D ( X ; ) D ( X ; ) (b) FIG. 4. Posterior-predictive tests of the model used for thedolphin network. (a) Average predicted number of interac-tions between dolphin pairs as a function of the actual ob-served number of interactions, for five random data sets gen-erated from the model. Colors of the data points indicatethe numbers of samples and R = 0 .
50. (b) Scatter plotof discrepancy values. Each point in this plot correspondsto one of 500 sets of parameter values, selected at randomfrom a total of 4000 such sets drawn from the posterior dis-tribution P ( λ , λ , ρ | X ) during the calculation. Sets havinghigher model–model discrepancy D ( ˜ X, λ , λ , ρ ) than data–model discrepancy D ( X , λ , λ , ρ ) are highlighted in blue,above the diagonal. The fraction of such sets gives us the p -value, which in this case is p = 0 . with the model specification. Two parameters that can-not be independently determined from the data, for in-stance, will have strong correlation, and other issues likepermutation symmetries will also show up in these plots.In the case of Fig. 3b we find the parameters to be onlymodestly correlated and there are no major red flags atthis stage.Having confirmed that the samples are plausibly drawnfrom a correctly specified posterior distribution, the otherthing we need to check is whether our model is actuallya good fit to the data. If the model is a poor one, theneven the best fit it provides may not actually be good fit.To test the goodness of fit we use two Bayesian tests ofthe type known as posterior-predictive assessments . (SeeAppendix A and Refs. [18, 62] for discussions.) In thesetests we use the data model P ( X | ˆ A , ˆ θ ) with parametervalues ˆ A and ˆ θ drawn from our Monte Carlo sample togenerate new data ˜ X , as if we were making measure-ments on a system that obeyed the fitted model exactly.Then we compare these new data to the original inputs.If the model is a good fit, the two should look similar.An example of such a comparison is shown in Fig. 4.Panel (a) in the figure shows the number of times a pairof dolphins interacts in the simulated data as a functionof the number of times they are observed to interact inthe original data, in five artificial data sets ˜ X generatedas above. If data and model agreed well, most of thepoints in this scatter plot would concentrate along thediagonal line, but in this case they do not. This is ourfirst indication that the fit we have found may not be asgood as we would like. We will see shortly how to makethe fit much better, but let us proceed for the moment with what we have as an illustration of our goodness-of-fit analysis.To more accurately quantify the performance of themodel we can calculate a discrepancy between datasets [62]: D ( X , θ ) = (cid:88) ij X ij log X ij ˜ X ij ( θ ) , (8)where ˜ X ij ( θ ) are the data generated from the model withparameters θ . The discrepancy is essentially a Kullback-Leibler divergence between the distribution of real andsynthetic data for one sampled value of the parameters.It functions in this situation as a goodness-of-fit measure:the smaller the discrepancy, the closer the synthetic dataare to the input.The discrepancy is not very informative by itself, how-ever, since it is not clear what kind of values we shouldexpect to see. For example, even if the model is a per-fect fit, we should not expect the discrepancy to be zero,since the randomness of both the data and the modelmean that they are unlikely to agree exactly. To obtaina point of comparison, therefore, we compute discrepan-cies between a large number of pairs of simulated datasets drawn from realizations of the model with the sameparameter values used for the observed data. If the modelwere correct, so that simulated and observed data havethe same distribution, then these values would tell us thetypical magnitude we should expect the discrepancy tohave. If the values are mostly smaller than the observeddiscrepancy then it indicates the model is unlikely to becorrect; if they are larger then the model is not ruledout. The fraction of generated discrepancy values thatare larger than the observed discrepancy gives us the p -value for the model, i.e., the probability that the observeddiscrepancy would have been generated if the model werecorrect.Note that the use of the p -value in this situation isdifferent from the way it is used in traditional frequentiststatistics, where it represents the probability of getting aparticular observed value if a null hypothesis were true.In the traditional scenario a small p -value leads us toreject the null hypothesis, and, since this is often thegoal of an experiment, small p -values are “good.” In thepresent case, the p -value is applied to the model we arefitting (there is no null/alternative hypothesis) and it justcounts the fraction of artificial data sets for which we finddiscrepancies more extreme than the value found whenfitting the model to the true data. So in this situationsmall p -values are “bad.”Figure 4b shows values of the discrepancy for both ar-tificial and real data, for 500 sampled values of the modelparameters. Instances where the artificial discrepancy isgreater than the observed one appear above the diagonalin the plot and the fraction of such points tells us the p -value. In this case we find p = 0 . p -value [62], the value is not as high as we would like it tobe, and though we cannot completely rule out the modelthe evidence suggests that the fit is not ideal.
3. Improved model
What can we do to improve the fit? The standardapproach is to adopt a more elaborate model that is ca-pable of representing a wider range of data distributions.In the model we have used so far connections are binary:dolphin pairs either have a connection or they don’t. Wecan create a more nuanced model by allowing three levelsof connection, corresponding to no tie, a weak tie, or astrong tie. Denoting the three levels by adjacency matrixelements A ij = 0, 1, and 2, we introduce a new distribu-tion for X ij when A ij = 2 which is Poisson as before butwith mean λ : µ ij (2 , λ ) = λ X ij X ij ! e − λ (9)where λ ≥ λ ≥ λ , and the prior of Eq. (6) becomes ν ij (0 , ρ ) = ρ = 1 − ρ − ρ , (10a) ν ij (1 , ρ ) = ρ , (10b) ν ij (2 , ρ ) = ρ . (10c)The fitting and model verification procedures follow thesame lines as previously.This modified model now fits the data significantly bet-ter, as shown in Fig. 5a. It divides the observed num-bers of pair interactions into three clear groups centeredaround values of about 0, 5, and 25, and the p -value isnow a very respectable 0 . p -value significantly larger than this could bea sign of problems, indicating overfitting. Unless thedistribution of the discrepancy is strongly skewed, theexpected p -value will normally be around 0.5 when themodel is a perfect fit.Having found a model that fits the data well, we ex-amine the inferred network structure, which is shown inFig. 5b. The network has three disconnected subgroupsof dolphins, two comprised of strong connections only andone, the largest of the three, having a mix of strong andweak connections. The posterior probabilities on all ofthe interaction types are close to one, indicating high con-fidence in the structure of the network. For instance, themodel predicts that nodes 8 and 9 are not connected withprobability 0 . . . . . Observed interactions P r ed i c t ed i n t e r a c t i on s (a) (b) FIG. 5. Network inference with multiple edge types for thedata shown in Fig. 1 and the measurement model given inEqs. (5)–(10). The estimated mean numbers of interactionsare λ (cid:39) .
02 when there is no edge between a pair ofdolphins, λ = 5 .
13 when the pair shares a weak tie, and λ = 21 .
97 when they share a strong one. The correspondingprior probabilities of edge types are ρ = 0 . ρ = 0 .
28, and ρ = 0 .
14. (a) Posterior predicted number of interactions be-tween dolphins as a function of observed number for five ran-dom data sets generated from the fitted model ( R = 0 . number that falls between the weak and strong domainsin the fitted model (see Fig. 5a). B. Friendship network of school students
For our second example, we revisit a network analyzedpreviously using different methods in Refs. [3, 32], a net-work of friendships between high-school students takenfrom the US National Longitudinal Study of Adolescentto Adult Health (the “AddHealth” study). Although themethod proposed in this paper is mathematically morecomplex than that of [3, 32], it is arguably easier to ap-ply since our analysis pipeline performs most of the cal-culation automatically. The most demanding step is theformulation of the model, but once we have a plausiblemodel the rest of the process is straightforward and me-chanical.The AddHealth study is a large study of networks of so-cial contact among students in schools across the UnitedStates. Students in participating schools were asked toidentify their friends and the basic unit of resulting datais a friendship nomination: one student says they arefriends with another. Our data matrix X in this caseis thus a non-symmetric one: X ij = 1 if i names j asa friend and 0 otherwise. There is no guarantee that j will also name i and in fact there are many instancesin which reported friendships only run in one direction.If we assume that friendship is fundamentally a bidirec-tional interaction then this lack of symmetry indicatesthat the data are necessarily unreliable.As is done in Refs. [3, 23, 32], we fit the data us-ing a model in which each student i has an individualtrue-positive rate α i and false-positive rate β i . The true-positive rate is the probability that i names as a friendanother student who is in fact a friend, as determined bythe adjacency matrix. The false-positive rate is the prob-ability of naming someone who is not actually a friend.We explicitly allow for different true- and false-positiverates for different individuals, since it is widely acceptedthat survey respondents vary in the accuracy of their re-sponses.In the notation of this paper the equations for themodel are: µ ij (1 , α i , α j ) = α X ij i (1 − α i ) − X ij α X ji j (1 − α j ) − X ji , (11a) µ ij (0 , β i , β j ) = β X ij i (1 − β i ) − X ij β X ji j (1 − β j ) − X ji . (11b)For instance, supposing that i and j truly are friends,the probability of i saying that they are ( X ij = 1) while j says they are not ( X ji = 0) is µ ij (1) = α i (1 − α j ).Conversely, if they are not in fact friends then we insteadget µ ij (0) = β i (1 − β j ).For the priors we again make the assumption of Eq. (6)that all edges are a priori equally likely, and assumea uniform prior on the edge probability ρ and a uni-form distribution over all values of α i and β i that satisfy β i < < α i . (One could simply assume a uniform prioron both α i and β i in the range [0 ,
1] but this leaves someambiguity in the model because of the inherent symmet-ric between edges and non-edges: if we exchange the val-ues of all α i and all β i and set ρ to 1 − ρ the model re-mains the same. By making the reasonable assumptionthat α i > β i we break this symmetry. The assumptionthat β i < < α i is not strictly necessary, but turns outto be helpful for narrowing down the parameter spaceand hence improving the speed and convergence of thecalculation [23].)Figures 6 and 7 show the results of fitting this modelto the data for a single school from the AddHealth dataset. We use one of the smaller schools as our example,with 521 students who completed a survey and 2182 de-clared ties, primarily in order to make visualization ofthe results easier. We find that the Monte Carlo algo-rithm converges well and gives samples that appear toaccurately characterize the posterior distribution. Fig-ure 6 shows discrepancy values in a manner analogous toFig. 3b for the dolphin network and all values are wellabove the diagonal, indicating a good fit to the data.The inferred network structure is shown in Fig. 7a.By contrast with the dolphin network example, the pos-terior probabilities of edges now vary more widely, asrepresented by the thickness of the edges in the figure.Figure 7b shows the distribution of edge probabilities asa histogram and many probabilities are again close toeither 1 or 0, indicating a high degree of certainty that D ( X ; ) D ( X ; ) FIG. 6. Goodness of fit testing for the AddHealth model.Each point in this plot corresponds to a random parame-ter set drawn from the posterior distribution P ( α, β, ρ | X ).Samples associated with a higher model–model discrepancy D ( ˜ X, θ ) than data–model discrepancy D ( X , θ ) appear abovethe dotted line, indicating a good fit between data and model. these edges either exist or do not, but there are also a sig-nificant number of edges with intermediate probabilities,edges about which we are less certain.The fit also returns values of the true- and false-positive rates α i and β i for each node, which allow usto make quantitative statements about how accuratelyeach individual reports his or her friendships. The av-erage value of α i over all individuals and all samplesis 0.76, meaning that an estimated 24% of friendshipsare going unreported. The average false-positive rate is0.0065, which sounds small but this result is somewhatmisleading. The network is very sparse, meaning thatalmost all edges that could exist do not. It takes only asmall fraction of false-positives among these many non-edges to generate a significant number of errors. Ar-guably a more informative measure of false positives isthe precision , which is the fraction of reported friend-ships that are actually present and is given in this caseby ρα i / [ ρα i + (1 − ρ ) β i ] [32]. The distribution of valuesof the precision is shown in Fig. 7c and ranges from a lit-tle under 0.2 to a little over 0.75, indicating that in facta significant fraction of reported friendships—anywherefrom 25% to 80%—are false positives.These results are largely in agreement with previouswork [32], although there are modest differences in es-timated parameter values and network structure, due tothe different methodology. We would argue that the fullyBayesian methodology employed here is more correct inthat it accounts for intrinsic uncertainty in the parame-ter values. The methodology of [32], which makes use ofan expectation-maximization (EM) algorithm, might bedescribed as “semi-Bayesian,” computing a full posteriordistribution over the network structure but relying onpoint estimates of the parameters. Because the modelused here is a large one, having O( n ) parameters, weexpect there to be significant uncertainty in the param-eter values, which is captured by our Bayesian samplingmethod. That said, in practice the two methods do leadto qualitatively consistent conclusions. The key benefitsof the current approach in this case are that it is simpler (a) Edge probability F r a c t i on o f edge s (b) Precision F r a c t i on o f node s (c) FIG. 7. Inference of a school friendship network from noisydata. (a) The 521 nodes in this figure represent the studentsat a single school in the AddHealth study and inferred friend-ships are shown as edges whose thickness indicates the esti-mated probability that they exist. The size and color of thenodes indicates the estimated precision of friendship reportsby the corresponding individual, i.e., the fraction of their re-ported friendships that are inferred to actually exist. Darkershades indicated less precise individuals and correspond to theshades in the histogram in panel (c). The average values ofthe parameters of the model are (cid:104) α (cid:105) (cid:39) . (cid:104) β (cid:105) (cid:39) . (cid:104) ρ (cid:105) (cid:39) . to implement using standard software, is formally morecorrect, and incorporates a natural means for checkingthe goodness of fit.One potential issue with the results is the fact that thediscrepancy values in Fig. 6 are all well above the dottedline, indicating close fits of the model to the data anda p -value near 1. A p -value this large can be a warning sign for overfitting, which is a possibility given the largenumber of parameters in the model. Such an issue couldnot be diagnosed with the methods previously used inRefs. [3, 32], but our approach makes this possible. Onecould address the problem by changing the model, say byusing a more complex model in which instead of fittingthe true- and false-positive parameters we instead drawthem from a hyperprior distribution, such as a beta dis-tribution, with an associated (small) set of hyperparam-eters that are fit using Monte Carlo. This approach canreduce the chances of overfitting and would be a gooddirection for future work. IV. CONCLUSIONS
In this paper we have introduced a general Bayesianframework for reconstructing networks from observa-tional data in the case where the data are error prone,even when the magnitude of the errors is unknown. Ourmethods work by fitting a suitable model of the mea-surement process to the data and there is a large class ofmodels that is both expressive enough to represent realdata sets accurately and yet simple enough to allow foreasy and automatic statistical inference. The output ofthe fitting process is a complete Bayesian posterior dis-tribution over possible network structures and possiblevalues of model parameters. We have demonstrated ourmethods with two case studies showing how to formulatesuitable models, fit them, assess goodness of fit, and inferreliable estimates of network structure.With this work, we hope to promote the adoption ofmore rigorous methods for handling measurement errorin network data in a principled manner. The methodswe propose not only achieve this but do so in a man-ner that is straightforward and requires a minimum oftechnical expertise on the part of the user. Practitionerscan use the framework we propose to apply appropriate,application-specific models to their data and obtain esti-mates of network structure in a matter of minutes.
ACKNOWLEDGMENTS
We thank Alec Kirkley for helpful discussions. Thiswork was funded in part by the James S. McDonnellFoundation (JGY) and the US National Science Foun-dation under grant DMS–1710848 (MEJN).
Appendix A: Methods
In this appendix we describe the mathematical andstatistical foundations of our method in detail.0
1. Generative models of measurement
Consider an experimental setting in which we havemeasurements X of a network’s structure. The mea-surements could be as simple as a number of observedinteractions between pairs of nodes, but could also incor-porate time-series, vector measurements, etc. In generalthese measurements do not tell us the exact structureof the network, but instead give us indirect and poten-tially noisy information. Our goal is to make the bestestimate we can of the true network structure given themeasurements.In the general framework we consider here, two nodes i and j can share connections of various types. In thesimplest case there are just two types: nodes can be eitherconnected by an edge (type 1) or not (type 0). In a morecomplex three-type case the connection could be absent(type 0), weak (type 1), or strong (type 2), and so on. Fora network of n nodes we encode these connections by an n × n adjacency matrix A where the matrix element A ij records the type of connection between nodes i and j . Wecan also represent directed networks using an asymmetricadjacency matrix with A ij being the type of the directedconnection from j to i and A ji being the type from i to j .Our approach rests on the hypothesis that the ma-trix X of pairwise measurements is dependent, in a prob-abilistic fashion, on the adjacency matrix A . Both A and X can be either symmetric (for undirected networks)or asymmetric (for directed ones) and they need not beof the same type. In friendship networks, for example,the symmetric relationship of being friends is commonlyprobed using asymmetric measurements (person i saysthey are friends with person j ).It is this dependence between network and measure-ment that we exploit to estimate A from X . We formal-ize the relation using a generative model that specifiesthe probability P ( X | A , θ ) of making the measurementsgiven the network, plus optionally some additional pa-rameters represented collectively by θ . Then, applyingBayes’ rule, we can write the probability of the unknownquantities A and θ given the measurements as P ( A , θ | X ) = P ( X | A , θ ) P ( A | θ ) P ( θ ) P ( X ) . (A1)Our goal is to use this equation to infer the network struc-ture A from the measurements X and to quantify theerrors we might make in doing so.
2. A flexible class of models
To further simplify the discussion and improve the ef-ficiency of the numerical calculations we make some ad-ditional assumptions about the model, while keeping theapproach as broad as possible to allow users to easilyadapt it to various types of data and experimental set-tings. Of the four probabilities that appear on the right-handside of Eq. (A1) one of them P ( X ) is a constant (sinceit depends only on X which is fixed by the experiment)and hence will play no part in our calculations. The oth-ers must be specified to define our model. We refer tothese three probabilities as the data model P ( X | A , θ ),the network model P ( A | θ ), and the prior on the param-eters P ( θ ). Let us consider each of these in turn. a. Data model The data model P ( X | A , θ ) specifies the probability ofmaking a particular set of measurements X given thenetwork and the model parameters. In specifying thisprobability we will make two key assumptions. First,we assume that the measurement X ij is only influencedby the corresponding element A ij of the adjacency ma-trix and not by any other elements. Second, we assumethat, conditioned on the network structure A and param-eters θ , the measurements X ij for different node pairs areindependent. Thus, for instance, P ( X ij , X kl | A , θ ) = P ( X ij | A ij , θ ) P ( X kl | A kl , θ ) . The notation here is a bit unwieldy, so for clarity weintroduce the notation µ ij ( A ij , θ ) to denote the probabil-ity P ( X ij | A ij , θ ) of making the measurement X ij giventhe type A ij of the connection between nodes i and j and given the parameter values θ . (Where the mean-ing is clear we may drop the explicit dependence on θ to simplify our expressions.) With this notation and ourassumption of conditional independence, the probabil-ity P ( X | A , θ ) for the data model is simply P ( X | A , θ ) = (cid:89) ( i,j ) µ ij ( A ij , θ ) . (A2)The product (cid:81) ( i,j ) is taken over all unordered pairs ofnodes when the network is undirected and over all or-dered pairs when it is directed.Table I gives a selection of possible forms for the datamodel for networks with only a single edge type. Gener-alization to multiple edge types is straightforward. (Seealso Ref. [32] for a discussion of a range of models.) b. Network model The network model P ( A | θ ) can be thought of as ourprior expectation of what the network should look like,before we make the measurements. By analogy with thefactorized form of the data model in Eq. (A2), we con-sider network models with the factorized form P ( A | θ ) = (cid:89) ( i,j ) ν ij ( A ij , θ ) , (A3)where we define ν ij ( A ij , θ ) in a similar mannerto µ ij ( A ij , θ ), as the prior probability P ( A ij | θ ) that1nodes i and j share a connection of type A ij , given theparameters θ . Many standard network models can bewritten in this form, including the Erd˝os–R´enyi randomgraph, the configuration model, and the stochastic blockmodel. Some examples of network models are given inTable II and Ref. [32]. c. Prior on the parameters The third component of our generative model, the prior P ( θ ) on the parameters, is the simplest. Our methoddoes not place any significant constraints on the formof this probability, so one is free to choose almost anyform appropriate to problem at hand, ranging from sim-ple flat priors or factorized forms to ones that incorpo-rate complex correlations between parameters. The onlystipulation we make is that the parameters should becontinuous-valued variables (not discrete-valued), whichallows for more efficient sampling procedures (see Sec-tion A 4).
3. Inference in theory
Gathering the elements defined above and substitutingthem into Eq. (A1) we obtain the complete joint posteriordistribution for the model: P ( A , θ | X ) = P ( X | θ, A ) P ( A | θ ) P ( θ ) P ( X ) , ∝ P ( θ ) (cid:89) ( i,j ) µ ij ( A ij ) ν ij ( A ij ) . (A4)This distribution tells us the probability of a networkstructure and a set of parameter values given the ob-served measurements. From it we can derive a variety offurther useful quantities, such as the probability of thenetwork structure independent of the parameters, whichis given by P ( A | X ) = (cid:90) P ( A , θ | X ) dθ. (A5)Even more useful, perhaps, is the probability of havingan edge of a given type between two specific nodes i, j : P ( A ij = k | X ) = (cid:90) P ( A ij = k, θ | X ) dθ ∝ (cid:90) µ ij ( k, θ ) ν ij ( k, θ ) P ( θ ) dθ, (A6)where we have used Eq. (A4).If we instead want to learn something about a param-eter φ ∈ θ then we can compute its distribution as P ( φ | X ) = (cid:88) A (cid:90) P ( θ (cid:48) , φ, A | X ) dθ (cid:48) , (A7) where θ (cid:48) is the parameter set with φ excluded.Each of these quantities can be considered a specialcase of the posterior average of a general function f ( A , θ )of network structure and parameters, thus: (cid:104) f ( A , θ ) (cid:105) = (cid:88) A (cid:90) f ( A , θ ) P ( θ, A | X ) dθ. (A8)There are a number of approaches we could take tocomputing expectations of this form [35]. One possi-bility is to use an expectation–maximization (EM) al-gorithm to compute the distribution over potential net-works P ( A | θ, X ) as well as a point-estimate of θ [3, 32].Alternatively, following [31, 63], we can integrate out theparameters θ analytically to derive the marginal distri-bution P ( A | X ) over the networks alone. Both these ap-proaches, however, only allow us to compute averagesover network structures and not over parameters. Theyare moreover not in line with our goal of providing al-most automatic inference for arbitrary choices of models,the EM approach because it calls for the solution of (of-ten non-linear) equations specific to the model, and themarginal-based approach because it works only for mod-els amenable to closed-form integration.Instead, therefore, we employ a generalization of amethod introduced in [33], which harnesses standardmixture-modeling techniques, adapting them to the net-work context.
4. Inference in practice
The general idea behind our method is to computeexpectations of the form (A8) in two manageable stepsby factorizing the joint posterior as P ( A , θ | X ) = P ( A | θ, X ) P ( θ | X ) . (A9)This factorization tells us that we can draw samples fromthe joint posterior by first sampling sets of parameter val-ues θ from the marginal distribution P ( θ | X ) and thensampling networks A from P ( A | θ, X ) with these param-eter values. If we sample m different parameter sets andthen n networks for each set, we end up with mn net-work/parameter pairs, which we number r = 1 . . . mn .Then we can estimate the average in Eq. (A8) as (cid:104) f ( A , θ ) (cid:105) = (cid:88) A (cid:90) f ( A , θ ) P ( A | θ, X ) P ( θ | X ) dθ (cid:39) mn mn (cid:88) r =1 f ( A r , θ r ) . (A10)This expression is completely general and holds for anyposterior, but for the class of models we consider herethere are, as we now show, particularly efficient methodsthat can help us quickly generate the samples we need.2 Model Parameters Data probabilityBinomial withuniform errors True positive rate α ∈ [0 , µ ij (1) = α X ij (1 − α ) N ij − X ij False positive rate β ∈ [0 , µ ij (0) = β X ij (1 − β ) N ij − X ij Binomial withnode-dependenterrors True positive rate α i ∈ [0 ,
1] for node i µ ij (1) = α X ij i (1 − α i ) N ij − X ij False positive rate β i ∈ [0 ,
1] for node i µ ij (0) = β X ij i (1 − β i ) N ij − X ij Poisson withuniform errors Means λ , λ for edges and non-edges µ ij (1) = λ X ij e − λ /X ij ! µ ij (0) = λ X ij e − λ /X ij !Poisson withnode propensity Normalized node propensity 0 < η i < (cid:80) η i = 1) and base rates λ , λ µ ij (1) = ( λ η i η j ) X ij e − λ η i η j /X ij ! µ ij (0) = ( λ η i η j ) X ij e − λ η i η j /X ij !TABLE I. Example data models for undirected networks with one edge type. Here N ij represents the number of times thenode pair i, j was measured and X ij represents how many of those times an edge was observed to exist.Model Parameters Edge probabilityRandom graph Edge probability ρ ν ij (1) = ρ “Soft” configuration model Node pseudo-degree λ i ν ij (1) = 1 / (1 + e − λ i λ j )Stochastic block model Node i belongs to group g i and edgeprobability between groups r and s is ω rs ν ij (1) = ω g i g j Random graph withmultiple edge types Probability of type- k edge ρ k ν ij ( k ) = ρ k Poisson multigraph Mean edge number ω ν ij ( k ) = ω k e − ω /k !TABLE II. Network models for the prior probability ν ij of an edge between nodes i and j . a. Generating parameter samples The first step of the sampling algorithm draws valuesof the parameters θ from the marginal distribution P ( θ | X ) = (cid:88) A P ( θ, A | X ) , (A11)where the sum runs over all the possible matrices A . Formodels with the factorized form (A4) we have P ( θ | X ) ∝ P ( θ ) (cid:88) A (cid:89) ( i,j ) µ ij ( A ij , θ ) ν ij ( A ij , θ ) ∝ P ( θ ) (cid:89) ( i,j ) (cid:88) k µ ij ( k, θ ) ν ij ( k, θ ) . (A12)Modern probabilistic programming languages make iteasy to generate random samples from factorizedmarginals of this kind. Our code is written in the prob-abilistic language Stan, which implements the techniqueknown as Hamiltonian Monte Carlo to generate samplesautomatically and efficiently—see Refs. [64, 65] for an in-troduction. Evaluating P ( θ | X ) involves a product overpairs ( i, j ) of nodes, of which there are O ( n ), meaningthat in general generating a sample takes O ( n ) time. Inmany cases, however, the time complexity can be reducedto O ( n ) by pooling terms in the product, as discussed inSec. A 6 b. b. Generating network samples Given sampled values θ , . . . , θ m of the parameters, thenext step is to generate samples of the network A fromthe distribution P ( A | θ, X ) for these parameter values.This is straightforward for the factorized model assumedhere, since node pairs are independent and we can sampleeach one separately. Specifically, using Eqs. (A4) and(A12), we have P ( A | θ, X ) = P ( θ, A | X ) P ( θ | X ) = (cid:81) ( i,j ) µ ij ( A ij ) ν ij ( A ij ) (cid:81) ( i,j ) (cid:80) k µ ij ( k ) ν ij ( k )= (cid:89) ( i,j ) Q ij ( A ij , θ ) , (A13)where Q ij ( k, θ ) = µ ij ( k ) ν ij ( k ) (cid:80) k (cid:48) µ ij ( k (cid:48) ) ν ij ( k (cid:48) ) (A14)is the posterior probability that nodes i and j are joinedby an edge of type k . Generating networks is simplya matter of drawing a value A ij = k for each nodepair independently from the distribution over k impliedby Q ij ( k ). Again, naively this takes time O ( n ) for allnode pairs, but on a sparse network the speed can beimproved by sampling only those edges with k > k = 0 for all others.3To estimate the average (cid:104) f ( A , θ ) (cid:105) , we generate a seriesof parameter sets θ using Eq. (A12) and for each of thesea series of networks using Eq. (A13), then evaluate theaverage with Eq. (A10).
5. Assessing goodness of fit
The method described above is simple, efficient, andoften gives good results. As described in the main text,however, the method can fail if the model itself is faulty—if the model is a poor representation of the system, failingto fit the data for any parameter values. It’s importanttherefore to verify that the fit between model and data isgood, which can be done with the standard technique of posterior-predictive assessment . As described in the maintext, this involves generating synthetic data ˜ X from thedistribution implied by the fitted model: P ( ˜ X | X ) = (cid:90) (cid:88) A P ( ˜ X | θ, A ) P ( θ, A | X ) dθ. (A15)This distribution weights all the possible parameters θ and networks A with their appropriate posterior proba-bilities and tells us the probability that a new data set ˜ X would have if it were truly generated by the model withthese inputs. The idea of the posterior-predictive assess-ment is to compare these synthetic data with the originalinput X . If the two look alike then the model has cap-tured the data well; otherwise, it has not.There are a number of ways to quantify the similarityof ˜ X and X . For instance, one can compute the average (cid:104) ˜ X ij (cid:105) = (cid:88) ˜ X P ( ˜ X | X ) ˜ X ij , (A16)and compare the result with X ij . Visualizing the matrixof residues (cid:104) ˜ X (cid:105) − X , the distribution of these residues,or how they depend on X ij allows one to easily spot sys-tematic issues with the model [18]. Such calculations arenot costly in practice: the distribution (A15) is just anaverage of a known function of A , θ over the posteriordistribution and has the same general form as Eq. (A10),so it can be evaluated numerically by the same methods.In this particular case, however, we can do even better,skipping the network sampling step altogether and mak-ing an estimate directly from the parameter samples. Todo this, we write the distribution of Eq. (A15) in the form P ( ˜ X | X ) = (cid:90) (cid:88) A P ( ˜ X | θ, A ) P ( A | θ, X ) P ( θ | X ) dθ = (cid:90) P ( θ | X ) (cid:88) A (cid:89) ( i,j ) ˜ µ ij ( A ij , θ ) Q ij ( A ij , θ ) dθ = (cid:90) P ( θ | X ) (cid:89) ( i,j ) (cid:88) k ˜ µ ij ( k, θ ) Q ij ( k, θ ) dθ, (A17) where have used Eqs. (A2) and (A13) in the second line,and ˜ µ ij ( k, θ ) is the probability of generating a syntheticmeasurement ˜ X ij given that ( i, j ) is an edge of type k .This expression is now independent of A and only re-quires an average over θ to evaluate.Using this expression for P ( ˜ X | X ), we can write theaverage (cid:104) ˜ X ij (cid:105) in Eq. (A16) as (cid:104) ˜ X ij (cid:105) = (cid:90) P ( θ | X ) (cid:88) k (cid:10) ˜ µ ij ( k, θ ) (cid:11) Q ij ( k, θ ) dθ, (A18)which we evaluate numerically as (cid:104) ˜ X ij (cid:105) (cid:39) m m (cid:88) r =1 (cid:88) k (cid:10) ˜ µ ij ( k, θ r ) (cid:11) Q ij ( k, θ r ) . (A19)Note that (cid:10) ˜ µ ij ( k, θ r ) (cid:11) usually has a simple closed form,since it is just the mean of ˜ X ij within the data modelwith parameters θ r .A visual inspection of the residues between ˜ X and X is often enough to reveal issues with goodness of fit, butone can carry out a more formal model assessment usingany of a variety of discrepancy measures that quantifythe distance between the synthetic data ˜ X and the orig-inal X [62]. The average value of such a discrepancy willalways be greater than zero, since one does not expect thesynthetic and original data to agree perfectly even witha perfect model. To obtain a baseline against which dis-crepancy values can be compared, we therefore computethe discrepancy between synthetic measurements ˜ X and their associated predictions, calculating a model-versus-model discrepancy distribution.In the calculations presented here we make use of thelog-likelihood ratio discrepancy: D ( X , θ r ) = (cid:88) ( i,j ) X ij log X ij ˜ X ij ( θ r ) , (A20)where ˜ X ij ( θ r ) is evaluated using Eq. (A19) with the sam-pled parameter values θ r . This discrepancy is reminiscentof a Kullback-Leibler divergence, with the primary dif-ference being that it compares unnormalized quantitiesrather than normalized probability distributions. Thatsaid, the norm of the two sets of measurements shouldbe similar, since the whole purpose of the calculation isto reproduce the original observations. Hence, one canusually interpret the discrepancy in more or less the sameway as a divergence: the smaller the divergence the bet-ter the fit (although values slightly less than zero canoccur, which is not true of a true divergence).We compute the distribution of the discrepancy andthe reference distribution ˜ X simultaneously using themethod introduced in Ref. [62]. We go through each net-work/parameter sample A r , θ r and generate a single real-ization ˜ X of the synthetic data from the data model, thencompute the two discrepancies D ( ˜ X , θ r ) and D ( X , θ r )using the analog of Eq. (A19). From the resulting sets4of discrepancy values one can then compute the p -value p = P [ D ( X , θ r ) > D ( ˜ X , θ r )], which is the fraction ofartificial data sets with discrepancy at least as large asthe observed value. The model is rejected if the p -valueis too small. This calculation does not cost much com-putation time since we are merely reusing the samplesalready generated for estimation purposes.
6. Implementation
In this section we discuss details of implementationof the algorithm, including a number of techniques forimproving speed and numerical accuracy which can beuseful with large data sets. a. Sampling networks
One of the more computationally costly steps in thealgorithm is the generation of sample networks from theconditional posterior distribution P ( A | θ, X ). Naivelygenerating the network by flipping a biased coin for everynode pair i, j takes time O ( n ) on a network of n nodes.For some models on sparse networks this time can bereduced by explicitly sampling only the edges that exist.That is, all edges are assumed not to exist, except fora sparse sample that are generated in accordance withthe fitted model. For instance, with the simple “uniformerror” model of Table I, the posterior probabilities Q ij of edges are a unique function Q ( X ) of the number ofobservations X ij of the edge in question. With this inmind we define Σ = (cid:80) ( i,j ) Q ij = (cid:80) X n ( X ) Q ( X ) where n ( X ) = (cid:80) ( i,j ) δ ( X, X ij ) is the number of node pairs with X observations and δ ( x, y ) is the Kronecker delta.The value of Σ can be calculated rapidly once n ( X )is known, then we can generate a sampled network byfirst drawing an integer M ∼ Poisson(Σ) to representthe number of edges in the network, and then generat-ing M random edges with probabilities Q ij with stan-dard “roulette wheel” proportional sampling using binarysearch. The complete process takes time O ( M log n ),which on a sparse network will be much faster than the O ( n ) of the naive algorithm.In other cases we may be able to skip the process ofnetwork sampling altogether, although at the price of stillhaving to perform O ( n ) operations. Specifically, whenwe want to calculate the average of a function f thatfactorizes over node pairs thus f ( A , θ ) = (cid:89) ( i,j ) g ij ( A ij , θ ) , (A21)we can write the average as (cid:104) f ( A , θ ) (cid:105) = (cid:88) A (cid:90) f ( A , θ ) P ( A | θ, X ) P ( θ | X ) dθ = (cid:90) P ( θ | X ) (cid:89) ( i,j ) (cid:88) k [ g ij ( k, θ ) Q ij ( k, θ )] dθ. (A22)Now we sample m sets of parameter values θ r as usual,but generate no networks A , and the average we want isgiven by (cid:104) f ( A , θ ) (cid:105) (cid:39) m m (cid:88) r =1 (cid:89) ( i,j ) (cid:88) k g ij ( k, θ r ) Q ij ( k, θ r ) . (A23) b. Sampling parameters Generating sample values of the parameters also takestime O ( n ) in general, because the right-hand side ofEq. (A12) involves a product over pairs of nodes. Forsome models, however, we may be able to evaluate thisproduct more rapidly by methods similar to those de-scribed for sampling networks above. Taking again theexample of the “uniform error” model from Table I, theprobability µ ij ( k ) is a function µ ( X, k, θ ) only of thenumber of observations X ij of the corresponding edge(and k and θ ) and ν ij ( k ) is a function of k and θ only.This means we can group terms in the product and write (cid:89) ( i,j ) (cid:88) k µ ij ( k, θ ) ν ij ( k, θ ) = (cid:89) X (cid:20)(cid:88) k µ ( X, k, θ ) ν ( k, θ ) (cid:21) n ( X ) , (A24)which saves considerable time. [1] M. Newman, Networks . Oxford University Press, Oxford,2nd edition (2018).[2] E. D. Kolaczyk,
Statistical Analysis of Network Data .Springer, New York, NY (2009).[3] M. E. J. Newman, Network structure from rich but noisydata.
Nat. Phys. , 542–545 (2018).[4] T. Ito et al. , Toward a protein–protein interaction mapof the budding yeast: A comprehensive system to exam-ine two-hybrid interactions in all possible combinationsbetween the yeast proteins. Proc. Natl. Acad. Sci. USA , 1143–1147 (2000). [5] N. J. Krogan et al. , Global landscape of protein com-plexes in the yeast Saccharomyces cerevisiae. Nature ,637–643 (2006).[6] E. Sprinzak, S. Sattath, and H. Margalit, How reliableare experimental protein-protein interaction data?
J.Mol. Biol. , 919–923 (2003).[7] T. Rolland et al. , A proteome-scale map of the humaninteractome network.
Cell , 1212–1226 (2014).[8] S. Wasserman and K. Faust,
Social Network Analysis .Cambridge University Press, Cambridge (1994). [9] E. Vaquera and G. Kao, Do you like me as much as Ilike you? Friendship reciprocity and its effects on schooloutcomes among adolescents. Soc. Sci. Res. , 55–72(2008).[10] B. Ball and M. E. J. Newman, Friendship networks andsocial status. Network Science , 16–30 (2013).[11] M. McPherson, L. Smith-Lovin, and J. M. Cook, Birds ofa feather: Homophily in social networks. Annual Reviewof Sociology , 415–444 (2001).[12] P. W. Holland and S. Leinhardt, The structural implica-tions of measurement error in sociometry. J. Math. So-ciol. , 85–111 (1973).[13] L. Amini, A. Shaikh, and H. Schulzrinne, Issues withinferring internet topological attributes. Comput. Com-mun. , 557–567 (2004).[14] H. Whitehead, Analyzing Animal Societies: Quantita-tive Methods for Vertebrate Social Analysis . Universityof Chicago Press, Chicago, IL (2008).[15] O. Sporns,
Networks of the Brain . MIT Press, Cam-bridge, MA (2010).[16] D. J. Wang, X. Shi, D. A. McFarland, and J. Leskovec,Measurement error in network data: A re-classification.
Soc. Netw. , 396–409 (2012).[17] J. Wiese, J.-K. Min, J. I. Hong, and J. Zimmerman,You never call, you never write: Call and SMS logs donot always indicate tie strength. In Proceedings of the18th ACM Conference on Computer Supported Coopera-tive Work & Social Computing , pp. 765–774 (2015).[18] A. Gelman and C. R. Shalizi, Philosophy and the practiceof Bayesian statistics.
Br. J. Math. Stat. Psychol. , 8–38 (2013).[19] N. Eagle and A. Pentland, Reality mining: Sensing com-plex social systems. J. Pers. Ubiquitous Comput. ,255–268 (2006).[20] N. Eagle, A. S. Pentland, and D. Lazer, Inferring friend-ship network structure by using mobile phone data. Proc.Natl. Acad. Sci. USA , 15274–15278 (2009).[21] D. J. Crandall, L. Backstrom, D. Cosley, S. Suri, D. Hut-tenlocher, and J. Kleinberg, Inferring social ties from ge-ographic coincidences.
Proc. Natl. Acad. Sci. USA ,22436–22441 (2010).[22] J. Cranshaw, E. Toch, J. Hong, A. Kittur, and N. Sadeh,Bridging the gap between physical location and onlinesocial networks. In
Proceedings of the 12th ACM Interna-tional Conference on Ubiquitous Computing , pp. 119–128(2010).[23] C. T. Butts, Network inference, error, and informant(in)accuracy: A Bayesian approach.
Soc. Netw. , 103–140 (2003).[24] C. M. Le, K. Levin, and E. Levina, Estimating a networkfrom multiple noisy realizations. Electron. J. Stat. ,4697–4740 (2018).[25] R. Tang et al. , Connectome smoothing via low-rank ap-proximations. IEEE Trans. Med. Imaging. , 1446–1456(2018).[26] L. Wang, Z. Zhang, and D. Dunson, Common and indi-vidual structure of brain networks. Ann. Appl. Stat. ,85–112 (2019).[27] R. Jansen et al. , A Bayesian networks approach for pre-dicting protein-protein interactions from genomic data. Science , 449–453 (2003).[28] X. Jiang and E. D. Kolaczyk, A latent eigenprobit modelwith link uncertainty for prediction of protein-protein in-teractions.
Stat. Biosci. , 84–104 (2012). [29] A. Birlutiu, F. d’Alch´e Buc, and T. Heskes, A Bayesianframework for combining protein and network topologyinformation for predicting protein-protein interactions. IEEE/ACM Transactions on Computational Biology andBioinformatics , 538–550 (2014).[30] C. E. Priebe, D. L. Sussman, M. Tang, and J. T. Vogel-stein, Statistical inference on errorfully observed graphs. J. Comput. Graph. Stat , 930–953 (2015).[31] T. P. Peixoto, Reconstructing networks with unknownand heterogeneous errors. Phys. Rev. X , 041011 (2018).[32] M. E. J. Newman, Estimating network structure from un-reliable measurements. Phys. Rev. E , 062321 (2018).[33] J.-G. Young, F. S. Valdovinos, and M. E. J. Newman,Reconstruction of plant–pollinator networks from obser-vational data. Preprint bioRxiv:754077 (2019).[34] D. M. Titterington, A. Smith, and U. E. Makov, Statisti-cal analysis of finite mixture distributions . Wiley, (1985).[35] G. J. McLachlan and D. Peel,
Finite mixture models . Wi-ley (2004).[36] D. Liben-Nowell and J. Kleinberg, The link-predictionproblem for social networks.
J. Assoc. Inf. Sci. Technol. , 1019–1031 (2007).[37] A. Clauset, C. Moore, and M. E. J. Newman, Hierarchicalstructure and the prediction of missing links in networks. Nature , 98–101 (2008).[38] R. Guimer`a and M. Sales-Pardo, Missing and spuriousinteractions and the reconstruction of complex networks.
Proc. Natl. Acad. Sci. USA , 22073–22078 (2009).[39] M. Huisman, Imputation of missing network data: Somesimple procedures.
J. Soc. Struct. , 1–29 (2009).[40] M. Kim and J. Leskovec, The network completion prob-lem: Inferring missing nodes and edges in networks. InB. Liu, H. Liu, C. Clifton, T. Washio, and C. Kamath(eds.), Proceedings of the 2011 SIAM International Con-ference on Data Mining , pp. 47–58, Society for Industrialand Applied Mathematics, Philadephia, PA (2011).[41] I. Brugere, B. Gallagher, and T. Y. Berger-Wolf, Networkstructure inference, a survey: Motivations, methods, andapplications.
ACM Comput. Surv. , 24:1–24:39 (2018).[42] Q. Li, Y. Zheng, X. Xie, Y. Chen, W. Liu, and W.-Y. Ma,Mining user similarity based on location history. In Pro-ceedings of the 16th ACM Sigspatial International Con-ference on Advances in Geographic Information Systems (2008).[43] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, andD. Di Bernardo, How to infer gene networks from ex-pression profiles.
Mol. Syst. Biol. , 78 (2007).[44] M. Gomez-Rodriguez, J. Leskovec, and A. Krause, In-ferring networks of diffusion and influence. ACM Trans.Knowl. Discov. Data , 21 (2012).[45] P. Netrapalli and S. Sanghavi, Learning the graph of epi-demic cascades. In Proceedings of the 12th ACM SIG-METRICS Joint International Conference on Measure-ment and Modeling of Computer Systems , pp. 211–222,Association of Computing Machinery, New York (2012).[46] T. Squartini and D. Garlaschelli,
Maximum-Entropy Net-works: Pattern Detection, Network Reconstruction andGraph Combinatorics . Springer, Berlin (2017).[47] M. Yuan and Y. Lin, Model selection and estimationin the Gaussian graphical model.
Biometrika , 19–35(2007).[48] P. Orbanz, Subsampling large graphs and invariance innetworks. Preprint arXiv:1710.04217 (2017). [49] M. P. H. Stumpf, C. Wiuf, and R. M. May, Subnets ofscale-free networks are not scale-free: Sampling proper-ties of networks. Proc. Natl. Acad. Sci. USA , 4221–4224 (2005).[50] S. H. Lee, P.-J. Kim, and H. Jeong, Statistical propertiesof sampled networks.
Phys. Rev. E , 016102 (2006).[51] C. T. Butts, Revisiting the foundations of network anal-ysis. Science , 414–416 (2009).[52] A. A. Ferreira, M. A. Goncalves, and A. H. F. Laender,A brief survey of automatic methods for author namedisambiguation.
SIGMOD Record , 15–26 (2012).[53] G. M. Namata, B. London, and L. Getoor, Collectivegraph identification. ACM Trans. Knowl. Discov. Data , 25 (2016).[54] J. J. Pfeiffer and J. Neville, Methods to determine nodecentrality and clustering in graphs with uncertain struc-ture. In Fifth International AAAI Conference on Weblogsand Social Media (2011).[55] F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich,Core decomposition of uncertain graphs. In
Proceedingsof the 20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pp. 1316–1325(2014).[56] T. Martin, B. Ball, and M. E. J. Newman, Structural in-ference for uncertain networks.
Phys. Rev. E , 012306(2016).[57] T. Poisot, A. R. Cirtwill, K. Cazelles, D. Gravel, M.-J.Fortin, and D. B. Stouffer, The structure of probabilistic networks. Methods Ecol. Evol. , 303–312 (2016).[58] A. Khan, Y. Ye, and L. Chen, On Uncertain Graphs .Morgan & Claypool, San Rafael, CA (2018).[59] R. C. Connor, R. A. Smolker, and A. F. Richards, Dol-phin alliances and coalitions. In
Coalitions and Alliancesin Humans and Other Animals , p. 443. Oxford UniversityPress, Oxford (1992).[60] J. B. Brask, S. Ellis, and D. P. Croft, Animal socialnetworks-an introduction for complex systems scientists.Preprint arXiv:2005.09598 (2020).[61] A. Gelman, H. S. Stern, J. B. Carlin, D. B. Dunson,A. Vehtari, and D. B. Rubin,
Bayesian Data Analysis ,3rd edition. Chapman and Hall/CRC Press, New York,NY (2013).[62] A. Gelman, X.-L. Meng, and H. Stern, Posterior predic-tive assessment of model fitness via realized discrepan-cies.
Stat. Sin. , 733–760 (1996).[63] T. P. Peixoto, Network reconstruction and communitydetection from dynamics. Phys. Rev. Lett. , 128301(2019).[64] M. Betancourt, A conceptual introduction to Hamilto-nian Monte Carlo. Preprint arXiv:1701.02434 (2017).[65] B. Carpenter et al. , Stan: A probabilistic programminglanguage.
J. Stat. Softw.76