[PDF] Inferring the minimum spanning tree from a sample network

Abstract

Minimum spanning trees (MSTs) are used in a variety of fields, from computer science to geography. Infectious disease researchers have used them to infer the transmission pathway of certain pathogens. However, these are often the MSTs of sample networks, not population networks, and surprisingly little is known about what can be inferred about a population MST from a sample MST. We prove that if n nodes (the sample) are selected uniformly at random from a complete graph with N nodes and unique edge weights (the population), the probability that an edge is in the population graph's MST given that it is in the sample graph's MST is \frac{n}{N}. We use simulation to investigate this conditional probability for G(N,p) graphs, Barab\'{a}si-Albert (BA) graphs, graphs whose nodes are distributed in \mathbb{R}^2 according to a bivariate standard normal distribution, and an empirical HIV genetic distance network. Broadly, results for the complete, G(N,p), and normal graphs are similar, and results for the BA and empirical HIV graphs are similar. We recommend that researchers use an edge-weighted random walk to sample nodes from the population so that they maximize the probability that an edge is in the population MST given that it is in the sample MST.

Full PDF

IInferring the minimum spanning tree from a sample network

Jonathan Larson and Jukka-Pekka OnnelaFebruary 22, 2021

Abstract

Minimum spanning trees (MSTs) are used in a variety of ﬁelds, from computer science to geography.Infectious disease researchers have used them to infer the transmission pathway of certain pathogens.However, these are often the MSTs of sample networks, not population networks, and surprisingly littleis known about what can be inferred about a population MST from a sample MST. We prove that if n nodes (the sample) are selected uniformly at random from a complete graph with N nodes and uniqueedge weights (the population), the probability that an edge is in the population graph’s MST given thatit is in the sample graph’s MST is nN . We use simulation to investigate this conditional probability for G ( N, p ) graphs, Barab´asi–Albert (BA) graphs, graphs whose nodes are distributed in R according to abivariate standard normal distribution, and an empirical HIV genetic distance network. Broadly, resultsfor the complete, G ( N, p ), and normal graphs are similar, and results for the BA and empirical HIVgraphs are similar. We recommend that researchers use an edge-weighted random walk to sample nodesfrom the population so that they maximize the probability that an edge is in the population MST giventhat it is in the sample MST.

Keywords: minimum spanning tree, MST, inference, sampling

A graph consists of nodes (also called vertices) and edges, with each edge connecting a pair of nodes. Atree is a subset of the edges of a connected graph that has no cycles, i.e., there is only one path from onenode to any other node. A spanning tree connects all the vertices of the graph, and the minimum spanningtree (MST) is the spanning tree with the lowest total edge weight. If the original graph is not connected,its minimum spanning forest (MSF) consists of the MSTs of its connected components. If the edge weightsare unique, there is only one MST; if there are duplicate edge weights, there may be more than one MST.Given a weighted graph, there are a variety of algorithms for obtaining its MST. The classic algorithms areBor˚uvka’s [1], Prim’s [2], and Kruskal’s [3].Researchers in a variety of ﬁelds have used MSTs to analyze network data. For example, neurologists have1 a r X i v : . [ m a t h . S T ] F e b sed MSTs to compare brain networks, in which regions of the brain are nodes and edges denote structuralor functional connections [4, 5]. Computer scientists have used MSTs to segment video into meaningfulpartitions [6] and decompose images into a base layer and a detail layer [7]. Geographers have used MSTs todescribe local building patterns [8]. The use of MSTs to study the hierarchical structure of ﬁnancial marketsusing correlation-based networks was ﬁrst proposed in [9]; some of these concepts were later expanded in aseries of papers [10, 11] that included the application of MST to a subset of 116 stocks of the 500 stocksin the S&P 500 index. In 2020, PNAS published at least six articles in which researchers used an MST[12, 13, 14, 15, 16, 17].The MSTs we construct are usually only the MSTs of sample networks, whereas our interest lies incharacterizing the MST of the population. Despite the importance of the problem, we are not aware of anypublished work on what may be inferred about the MST of the population network from the MST of a samplenetwork. (We should clarify that when we say sampling, we mean the sampling of nodes.) Instead, researchhas focused on other issues, such as using knowledge of the population network to predict the behavior oftrees that span sample subgraphs [18]; using knowledge of the population network and sampling process toﬁnd a set of edges that contains the sample MST with high probability [19]; or ﬁnding the MST when edgeweights are random [20].This paper aims to answer the following related questions:1. Given that an edge is in the sample graph but not the sample MST, what is the probability that it isnot in the population MST? We can think of this probability as the negative predictive value (NPV).2. Given that an edge appears in the sample MST, what is the probability that it appears in the populationMST? We can think of this probability as the positive predictive value (PPV).3. How well can we estimate these probabilities by bootstrapping from the sample graph?4. How strong is the relationship between the number of bootstrap MSTs that an edge is in and whetheror not that edge is in the population MST? Theory

A graph G = ( V, E ) consists of a set of nodes (or vertices) V and a set of edges E ; each edge e ∈ E connects a pair of nodes u, v ∈ V , u (cid:54) = v , so that we may write e = ( u, v ). Here, we assume all graphsare undirected, so ( u, v ) = ( v, u ). Each edge e ∈ E has a weight w ( e ) >

0. A cycle C is a set of edges { e , . . . , e k } ⊂ E such that ∀ i ∈ { , . . . , k − } , e i = ( v i , v i +1 ), where i (cid:54) = j = ⇒ v i (cid:54) = v j , and e k = ( v k , v ).2n some texts a cycle also contains the associated vertices, but here, for ease of exposition, the terms “cycle”,“tree”, “spanning tree”, “MST”, and “cut” all refer to sets of edges. A tree T is a subset of the edges ofa connected graph that contains no cycles; T is spanning if ∀ v ∈ V , ∃ e ∈ T such that v is an endpointof e ; and a minimum spanning tree (MST) has the lowest total edge weight of all spanning trees. If theoriginal graph is not connected, its minimum spanning forest (MSF) consists of the MSTs of its connectedcomponents. If V and V are non-empty sets of nodes satisfying V ∪ V = V and V ∩ V = ∅ , thenthe associated cut D consists of edges connecting one node from V and one node from V . In symbols, D = { ( v , v ) ∈ E : v ∈ V , v ∈ V } .Throughout, we will assume that distances are measured without error. The symbol | E | denotes thenumber of elements in the set E and A (cid:52) B = ( A \ B ) ∪ ( B \ A ). Theorems 2 and 3 are known facts but areproved here for completeness. Theorem 4 is a version of the cycle property and Theorem 5 is a version ofthe cut property. Theorem 9 is based on [21]. Lemma 1. If G is a connected graph with N ∈ { , , , . . . } nodes and T is a spanning tree of G , then atleast one node in G is the endpoint of only one edge in T .Proof. Suppose the contrary, that each node in G is the endpoint of at least two edges in T . (No node in G can be the endpoint of zero edges in T because T is spanning, and thus connects all nodes.) Start at anynode in G and walk along an edge in T . From the next node, walk along a diﬀerent edge in T . Continuethis walk, leaving each node by a diﬀerent edge than the one by which you arrived. Since N < ∞ , at somepoint you will arrive at a node you have already visited. This means that T contains a cycle, which is acontradiction. So at least one node in G is the endpoint of only one edge in T . Theorem 2. If G is a connected graph with N ∈ { , , , . . . } nodes and T is a spanning tree of G , then | T | = N − .Proof. Suppose N = 2. Then G has only one edge, T = E , and | T | = 1 = N −

1. Now suppose the theoremholds for graphs with N − ≥ G has N nodes. Find a node v in G that is the endpointof only one edge e in T . Remove v from G to create the new graph G (cid:48) and remove e from T to create thenew tree T (cid:48) . Since T (cid:48) is a spanning tree of G (cid:48) , and G (cid:48) has N − | T (cid:48) | = N −

2. Since | T (cid:48) | = | T | − | T | = N −

1. Thus, through induction, we have shown that the theorem is true for N ∈ { , , , . . . } .We can extend Theorem 2 to conclude that, if G is a graph with N ∈ { , , , . . . } nodes and K compo-nents, and T is a spanning tree of G , then | T | = N − K . Theorem 3. If G = ( V, E ) is a connected graph with N < ∞ nodes and unique positive edge weights then G has exactly one MST. roof. If N ≤ E = ∅ and the MST is empty. If N = 2 then | E | = 1 and the MST is E . Suppose N ≥

3. Suppose the opposite of the statement of the theorem, that G has more than one MST. Let A and B denote two distinct MSTs of G . Let a = arg min e ∈ A (cid:52) B w ( e ). Since the edge weights are unique, so is a .Without loss of generality, assume a ∈ A . Since B is a spanning tree, B ∪ { a } contains a cycle C containing a . Since A is a tree, A cannot contain C , meaning C must contain an edge b / ∈ A . Since a, b ∈ A (cid:52) B and a = arg min e ∈ A (cid:52) B w ( e ), w ( b ) > w ( a ). This means B ∪ { a } \ { b } is a spanning tree with lower total edgeweight than B , which is a contradiction. Thus G has exactly one MST. Theorem 4.

Suppose G = ( V, E ) is a connected graph with N < ∞ nodes and edges with unique positiveweights. Let T be the (unique) MST of G . Then e ∈ E \ T if and only if e belongs to a cycle C in G and e has greater weight than every other edge in C .Proof. Suppose e ∈ E \ T . Then T ∪ { e } contains a cycle C . The weight of e cannot be equal to the weightof any other edge in C because all the edge weights are unique. If ∃ e (cid:48) ∈ C such that w ( e (cid:48) ) > w ( e ), then T ∪ { e } \ { e (cid:48) } would be a spanning tree with smaller total edge weight than T , and T would not be an MST.Thus e has edge weight greater than every other edge in C .Suppose e belongs to a cycle C in G and e has greater weight than every other edge in C . Suppose e ∈ T .If e (cid:48) is any other edge in C then T ∪ { e (cid:48) } \ { e } would be a spanning tree with smaller total edge weight than T , which is a contradiction. Thus e / ∈ T . Theorem 5.

Suppose G = ( V, E ) is a connected graph with N < ∞ nodes and edges with unique positiveweights. Let T be the (unique) MST of G . Then e ∈ T if and only if e belongs to a cut D in G and e haslower weight than every other edge in D .Proof. Suppose e ∈ T but each cut containing e contains another edge with lower weight than e . Removing e from T would split T into two components, T and T , where T = T ∪ T ∪ { e } . Let V denote the set ofendpoints of edges in T , let V denote the set of endpoints of edges in T , and let D denote the set of edgesin E with one endpoint in V and the other endpoint in V . Let e (cid:48) denote another edge in D with lowerweight than e . Then T ∪ { e (cid:48) } \ { e } is a spanning tree with lower total edge weight than T , a contradiction.Thus if e ∈ T then e belongs to a cut D and has the lowest weight of any edge in D .Suppose e = ( v , v ) belongs to a cut D in G and e has lower weight than every other edge in D . Let V and V denote the two sets of vertices separated by this cut, with v ∈ V and v ∈ V . If e / ∈ T then T ∪ { e } contains a cycle C . C \ { e } is a path from v ∈ V to v ∈ V , and thus contains an edge e (cid:48) ∈ D .By assumption, w ( e ) < w ( e (cid:48) ). Thus T ∪ { e } \ { e (cid:48) } is a spanning tree with lower total edge weight than T , acontradiction. Thus e ∈ T . 4t this point it is necessary to deﬁne more rigorously our ﬁrst quantity of interest, the negative predictivevalue (NPV). We consider two related but not necessarily equivalent approaches. First, let G n = ( V n , E n )be the subgraph of G = ( V, E ) induced by sampling n nodes from V . (At this point we do not specify thesampling mechanism.) Assuming G has unique edge weights, let T be the unique MST of G and let T n bethe unique MST of G n . Finally, let e be an edge selected uniformly at random from E . We want to know P ( e ∈ E \ T | e ∈ E n \ T n ). Of course, this quantity is only deﬁned if P ( e ∈ E n \ T n ) >

0. The secondapproach is to ﬁnd E (cid:18) | E n \ ( T ∪ T n ) || E n \ T n | I ( | E n \ T n | > (cid:19) ,where I ( A ) = 1 if event A transpires and 0 otherwise. Theorem 6.

Let G = ( V, E ) be a graph with N < ∞ nodes, unique positive edge weights, and MSF T . Let G n = ( V n , E n ) be a subgraph of G with n ∈ { , , . . . , N } nodes and MSF T n . Then T ∩ E n \ T n = ∅ .Proof. Suppose ∃ e ∈ T ∩ E n \ T n . Then { e } ∪ T n contains a cycle C . If there exists an edge e (cid:48) ∈ C withgreater weight than e then T n ∪ { e } \ { e (cid:48) } is a spanning tree with smaller weight than T n , a contradiction.Thus e has the largest weight of any edge in C . Since C ⊂ E n ⊂ E N , Theorem 4 implies e / ∈ T . This is acontradiction, so T ∩ E n \ T n = ∅ .Theorem 6 implies that E n \ T n = E n \ ( T ∪ T n ), meaning P ( e ∈ E \ T | e ∈ E n \ T n ) = P ( e ∈ E n \ ( T ∪ T n )) P ( e ∈ E n \ T n ) = 1and E (cid:18) | E n \ ( T ∪ T n ) || E n \ T n | I ( | E n \ T n | > (cid:19) = 1.In other words, for both approaches, the NPV is 1. This is irrespective of how G is generated or howthe nodes in V n are sampled; we just require that the edge weights be unique and that the appropriatedenominators are non-zero.Our next task is to ﬁnd the positive predictive value, or PPV. Just as with the NPV, we take twoapproaches. We want to ﬁnd P ( e ∈ T | e ∈ T n ) and E (cid:18) | T ∩ T n || T n | I ( | T n | > (cid:19) .Unlike with the NPV, these values will depend on how G is generated and how the nodes in V n are sampled.We begin with the case where G is a complete graph and the nodes in V n are sampled uniformly at random5rom V . Theorem 7.

Let G = ( V, E ) be a complete graph with N < ∞ nodes and positive, unique edge weights. Let G n = ( V n , E n ) , where V n contains n ∈ { , , . . . , N } nodes selected uniformly at random from V , and where E n contains those edges from E that have both endpoints in V n . (In other words, G n is the subgraph of G induced by the nodes in V n .) Deﬁne T to be the unique MST of G and deﬁne T n to be the unique MST of G n . Then P ( e ∈ T | e ∈ T n ) = E (cid:16) | T ∩ T n || T n | I ( | T n | > (cid:17) = nN .Proof. Since G is complete, G n must be connected, so | T n | = n − P ( e ∈ T n ) = | T n || E | = n − (cid:0) N (cid:1) .Since the nodes in V n are selected uniformly at random from V , without respect to whether they are theendpoints of edges in T , T ⊥⊥ E n . Thus P ( e ∈ T ∩ T n ) = P ( e ∈ T ∩ T n ∩ E n ) (1)= P ( e ∈ T ∩ E n ) (2)= P ( e ∈ T ) P ( e ∈ E n )= | T || E | | E n || E | = N − (cid:0) N (cid:1) (cid:0) n (cid:1)(cid:0) N (cid:1) .Note that we used Theorem 6 to move from (1) to (2). So P ( e ∈ T | e ∈ T n ) = P ( e ∈ T ∩ T n ) P ( e ∈ T n )= nN .Since G is connected and n ≥ | T n | = n − >

0, so E (cid:18) | T ∩ T n || T n | I ( | T n | > (cid:19) = E (cid:18) | T ∩ T n || T n | (cid:19) = E ( | T ∩ T n | ) n − E ( | T ∩ T n | ) = E ( | T ∩ E n | ). If e , . . . , e N − is an enumeration of the edges in T ,6hen E ( | T ∩ E n | ) = N − (cid:88) i =1 I ( e i ∈ E n ) = N − (cid:88) i =1 (cid:0) n (cid:1)(cid:0) N (cid:1) = ( N − (cid:0) n (cid:1)(cid:0) N (cid:1) and E (cid:18) | T ∩ T n || T n | I ( | T n | > (cid:19) = N − n − (cid:0) n (cid:1)(cid:0) N (cid:1) = nN .In other words, under the conditions of Theorem 7, the probability that an edge is in the populationMST given that it is in the sample MST is equal to the proportion of the population that has been sampled.Applied researchers can increase this probability by recruiting more participants, and the increase is linearin sample size.Theorem 7 relies on two key facts: The ﬁrst is that | T | , | T n | , and | E n | are known constants, which resultsfrom G being complete. The second is that T is independent of E n , which results from sampling the nodesuniformly at random. In Theorem 8, we consider a scenario where | T | , | T n | , and | E n | are random, but T isstill independent of E n . Theorem 8.

Let G and G n be deﬁned as in Theorem 7. Let G (cid:48) = ( V, E (cid:48) ) , where each edge from E isincluded in E (cid:48) independently and with probability p . Let G (cid:48) n = ( V n , E (cid:48) n ) , where E (cid:48) n contains those edges from E (cid:48) that have both endpoints in V n . (In other words, E (cid:48) n = E n ∩ E (cid:48) .) Deﬁne T (cid:48) to be the unique MSF of G (cid:48) and T (cid:48) n to be the unique MSF of G (cid:48) n . Let K (cid:48) denote the number of components in G (cid:48) and K (cid:48) n denote thenumber of components in G (cid:48) n . Then P ( e ∈ T (cid:48) | e ∈ T (cid:48) n ) = nN (cid:18) n − N − (cid:19) (cid:18) N − E ( K (cid:48) ) n − E ( K (cid:48) n ) (cid:19) .Proof. Note that T (cid:48) n ⊂ E (cid:48) n and ( T (cid:48) ⊥⊥ E (cid:48) n ) | E (cid:48) . That is, if an edge is in E (cid:48) , whether or not it is in T (cid:48) has nobearing on whether or not it will be included in E (cid:48) n . Using this fact and Theorem 6, P ( e ∈ T (cid:48) ∩ T (cid:48) n ) = P ( e ∈ T (cid:48) ∩ E (cid:48) n )= P ( e ∈ T (cid:48) ∩ E (cid:48) n ∩ E (cid:48) )= P ( e ∈ T (cid:48) ∩ E (cid:48) n | e ∈ E (cid:48) ) P ( e ∈ E (cid:48) )= P ( e ∈ T (cid:48) | e ∈ E (cid:48) ) P ( e ∈ E (cid:48) n | E (cid:48) ) P ( e ∈ E (cid:48) )= P ( e ∈ T (cid:48) ∩ E (cid:48) ) P ( e ∈ E (cid:48) n | e ∈ E (cid:48) )= P ( e ∈ T (cid:48) ) P ( e ∈ E (cid:48) n | e ∈ E (cid:48) )= P ( e ∈ T (cid:48) ) P ( e ∈ E n ∩ E (cid:48) | e ∈ E (cid:48) )7 P ( e ∈ T (cid:48) ) P ( e ∈ E n ∩ E (cid:48) ) P ( e ∈ E (cid:48) )= P ( e ∈ T (cid:48) ) P ( e ∈ E n ) P ( e ∈ E (cid:48) ) P ( e ∈ E (cid:48) )= P ( e ∈ T (cid:48) ) P ( e ∈ E n ) .Next, P ( e ∈ T (cid:48) ) = N (cid:88) k =1 P ( e ∈ T (cid:48) | K (cid:48) = k ) P ( K (cid:48) = k )= N (cid:88) k =1 N − k (cid:0) N (cid:1) P ( K (cid:48) = k )= 1 (cid:0) N (cid:1) [ N − E ( K (cid:48) )] .Similarly, P ( e ∈ T (cid:48) n ) = 1 (cid:0) n (cid:1) [ n − E ( K (cid:48) n )] .Thus, P ( e ∈ T (cid:48) | e ∈ T (cid:48) n ) = P ( e ∈ T (cid:48) ∩ T (cid:48) n ) P ( e ∈ T (cid:48) n )= P ( e ∈ T (cid:48) ) P ( e ∈ E n ) P ( e ∈ T (cid:48) n )= N − E ( K (cid:48) ) n − E ( K (cid:48) n ) n ( n − N ( N −

1) .If E ( K (cid:48) ) and E ( K (cid:48) n ) approach 1 as n and N increase toward inﬁnity then P ( e ∈ T (cid:48) | e ∈ T (cid:48) n ) → nN , whichis the result for the complete graph. We were unable to determine E (cid:16) | T ∩ T n || T n | I ( | T n | > (cid:17) analytically, so weexplored it through simulation (described in the next section).At this point we turn to graphs with more than one MST. Theorem 9 is based on [21]. Theorem 9.

Suppose G = ( V, E ) is a connected, weighted graph with ﬁnitely many nodes and more thanone MST. Let A and B denote two of these MSTs. Then there exists a bijective function g : A \ B → B \ A uch that the weight of a is equal to the weight of g ( a ) , a and g ( a ) belong to a cycle C ⊂ B ∪ { a } in whichthey are the maximum weight edges, and a and g ( a ) belong to a cut D ⊂ E in which they are the minimumweight edges.Proof. Take a = ( v , v ) ∈ A \ B . (If A \ B = ∅ then A ⊂ B , and since | A | = | B | , that would imply that A = B .) Removing a from A would split A into two components, A and A , where A = A ∪ A ∪ { a } .Let V denote the set of endpoints of edges in A , let V denote the set of endpoints of edges in A , with v ∈ V and v ∈ V , and let D denote the set of edges in E with one endpoint in V and the other endpointin V . Since a / ∈ B , B ∪ { a } contains a cycle C . Since C \ { a } consists of a path from v ∈ V to v ∈ V ,there is at least one edge b ∈ C \ { a } that is also in D . Since D ∩ A = { a } and b (cid:54) = a , b / ∈ A . So b ∈ B \ A .1. Suppose the weight of b is strictly greater than the weight of a . Then B ∪ { a } \ { b } would be aspanning tree with total edge weight less than B , and B would not be an MST.2. Suppose the weight of b is strictly less than the weight of a . Then A ∪{ b }\{ a } would be a spanningtree with total edge weight less than A , and A would not be an MST.So w ( a ) = w ( b ).1. Suppose C contains an edge c with weight greater than that of a and b . Then B ∪ { a } \ { c } is aspanning tree with lower total edge weight than B , which is a contradiction. Thus a and b have themaximum weight of any edge in C .2. Suppose D contains an edge d with weight less than that of a and b . Then A ∪ { d } \ { a } is aspanning tree with lower total edge weight than A , which is a contradiction. Thus a and b have theminimum weight of any edge in D .Deﬁne T = A ∪ { b } \ { a } . It is a spanning tree with total edge weight equal to T , so it is an MST. If T (cid:54) = B , repeat the process: select a ∈ T \ B and b ∈ B \ T such that a and b belong to1. a cycle C ⊂ B ∪ { a } in which they have the maximum weight of any edge; and2. a cut D ⊂ E in which they have the minimum weight of any edge.Note that the edges a , a , b , and b are all distinct. Deﬁne T = T ∪ { b } \ { a } , which is another MST.If T (cid:54) = B , repeat the process. If k = | A \ B | = | B \ A | , then T k = B .Theorem 9 implies that any graph with multiple MSTs must contain a cycle with two edges sharing themaximum weight. The converse is not true, as demonstrated in Figure 1(a). Even if a graph contains a cyclewith two edges sharing the maximum weight, it may only have one MST.9 A G E D C G’ A’ B’

H T (a) (b) (c)

Figure 1: Counterexamples. (a) H contains a cycle with two edges sharing the maximum weight, but it onlyhas one MST T . (b) For the graph G , there are 4! = 24 diﬀerent orderings of the edges, but 5 MSTs (labeled A through E ). Since 24 / A has four orderings leading to it but each of the other MSTs has ﬁve orderings leading to it. (c) If we chain G to create G (cid:48) , we see that the MST A (cid:48) has 4 × B (cid:48) has 5 × A (cid:48) to the number of orderingsleading to B (cid:48) is 16 /

25 = (4 / . Chaining G in this way indeﬁnitely demonstrates that even asymptoticallythe ratio of orderings leading to each MST does not approach 1.10et G = ( V, E ) be a connected, weighted graph with N vertices, m edges, and at least two MSTs, A and B . Then m ≥ N and there must be at least two edges in E with the same weight. Label the edges in E from lowest to highest weight so that, if w ( e i ) is the weight of the i th edge, w ( e ) ≤ w ( e ) ≤ · · · ≤ w ( e m ).For edges with the same weight, order them arbitrarily. Suppose there are K ≥ k i ≥ i th weight for i = 1 , . . . , K . Then there are (cid:81) Ki =1 k i !diﬀerent orderings σ of the edges such that w ( e σ (1) ) ≤ w ( e σ (2) ) ≤ · · · ≤ w ( e σ ( m ) ). Note that the identityfunction σ ( i ) = i is counted as one of these orderings.In order to ﬁnd the MST of a graph with unique edge weights, the actual values of the weights are notimportant. It is only their order that matters. If we think of each ordering σ as treating edge e σ ( i ) as havingweight w σ (cid:0) e σ ( i ) (cid:1) = i , then the edge weights are unique and σ results in a unique MST. If the number oforderings leading to each MST were equal, we could sample uniformly from the set of MSTs of a graphby sampling uniformly from the set of edge orderings. Unfortunately, Figure 1(b) demonstrates that thenumber of orderings leading to each MST may not be equal. For the graph G , there are 4! = 24 diﬀerentorderings of the edges, but 5 MSTs (labeled A through E ). Since 24 / A has four orderings leading to it but each of the otherMSTs has ﬁve orderings leading to it. It is tempting to assume that as the number of edges and/or cyclesapproaches inﬁnity, the proportion of orderings leading to each MST will approach the same value, butFigure 1(c) provides a counterexample. If we chain G to create G (cid:48) , we see that the MST A (cid:48) has 4 × B (cid:48) has 5 × A (cid:48) to the number of orderings leading to B (cid:48) is 16 /

25 = (4 / . As the number ofinstances of G chained together approaches inﬁnity, the ratio of the number of orderings leading to the A chain to the number of orderings leading to the B chain approaches zero, not one. So the proportion oforderings leading to each MST does not approach the same value even as the number of edges and/or cyclesapproaches inﬁnity. Methods

Other types of graphs and sampling methods are not as tractable as complete graphs and samplinguniformly at random. In order to estimate the probability that an edge is in the population MST given thatit is in the sample MST for more complex situations, a simulation study was conducted. For the simulationstudy, we targeted the parameter E (cid:16) | T ∩ T n || T n | I ( | T n | > (cid:17) .11 imulation Study The following algorithm was repeated for i = 1 , . . . , g i with N = 100 nodes. (More information on the type of graph is includedbelow.) This graph g i is the population graph.2. Find t g i , the MST of g i . Thus t g i is the population MST.3. Sample n nodes from g i , yielding the induced subgraph h i . (More information on the sampling processand the value of n is included below.) Thus h i is the sample graph.4. Find t h i , the MST of h i . Thus t h i is the sample MST.5. Calculate the positive predictive valuePPV i = t g i ∩ t h i t h i .6. Repeat the following for j = 1 , . . . , n /N nodes from h i , yielding the induced subgraph h i,j . In other words, sample the sameproportion of nodes from h i as were sampled from g i . Thus h i,j is the bootstrap sample graph.(b) Find t h i,j , the MST of h i,j . Thus t h i,j is the bootstrap MST.(c) Calculate the bootstrap positive predictive valueBPPV i,j = t h i ∩ t h i,j t h i,j .7. Calculate BPPV i = 1100 (cid:88) j =1 BPPV i,j .8. Calculate the area under the ROC curve (AUC i ) using the number of times that an edge appears ina bootstrap MST (i.e., one of the t h i,j ) as the predictor and whether that edge appears in t g i as theoutcome.The following statistics were calculated to summarize the 1,000 replications:PPV = 11000 (cid:88) i =1 PPV i (cid:88) i =1 BPPV i AUC = 11000 (cid:88) i =1 AUC i .Conﬁdence intervals were calculated as follows:PPV ± z . (cid:118)(cid:117)(cid:117)(cid:116) (cid:32) (cid:88) i =1 (cid:0) PPV i − PPV (cid:1) (cid:33) BPPV ± z . (cid:118)(cid:117)(cid:117)(cid:116) (cid:32) (cid:88) i =1 (cid:16) BPPV i − BPPV (cid:17) (cid:33) AUC ± z . (cid:118)(cid:117)(cid:117)(cid:116) (cid:32) (cid:88) i =1 (cid:0) AUC i − AUC (cid:1) (cid:33) .An entire simulation, with 1,000 replications, was run for each of the following types of graphs (with N = 100 nodes):1. Complete: Complete graph with weights uniformly distributed on (0 , G (cid:0) N, (cid:1) : First, a complete graph was generated with weights uniformly distributed on (0 , .3. Normal: Vertices were distributed in R according to a bivariate standard normal distribution, andthe weight of an edge connecting two vertices was equal to the Euclidean distance between them.4. Barab´asi–Albert: Barab´asi–Albert (BA) graph with each new node attaching to three existing nodesand with weights uniformly distributed on (0 , n was set to 25, 50, and 75, and for each replication and each value of n , the followingtypes of sampling were used:1. Uniform: Nodes were sampled uniformly at random.2. Near: For complete graphs, node i ’s probability of being sampled was proportional to max { s , . . . , s N }− s i + min { s , . . . , s N } , where s i is the total weight of all edges adjacent to node i . This simulates pref-erentially selecting nodes that are close to other nodes, while ensuring that every node has positive13 g i MST t g i

1 3 2 sample 1 h i MST t h i

1 4 PPV i = ½ sample h i,j MST t h i,j BPPV i,j = 1 P o pu l a ti o n S a m p l e B oo t s t r a p (1) (2) (3) (4) (5) (6a) (6b) (6c) Figure 2: Illustration of one replication.14robability of being selected. For non-complete graphs, node i ’s probability of being sampled wasproportional to d i or d i + 1, where d i is the degree of node i , if the minimum degree was positive orzero, respectively. This simulates preferentially selecting nodes with many neighbors, while ensuringthat every node has positive probability of being selected.3. Far: For complete graphs, node i ’s probability of being sampled was proportional to s i . This simulatespreferentially selecting nodes that are far from other nodes. For non-complete graphs, node i ’s prob-ability of being sampled was proportional to max { d , . . . , d N } − d i + max { , min { d , . . . , d N }} . Thissimulates preferentially selecting nodes with few neighbors, while ensuring that every node has positiveprobability of being selected.4. Random Walk: The following algorithm was repeated until n nodes were recorded in the vector v : Anode was selected uniformly at random from nodes not already in v and recorded. Suppose it wasnode i . If node i had no neighbors, then i was incremented by 1 and the process was restarted. Ifnode i had neighbors, with labels j , . . . , j d i , one of these neighbors was selected at random to be thenext recorded node. The probabilities were not uniform; node j ’s probability of being selected wasproportional to max { s j , . . . , s j di } − s j + min { s j , . . . , s j di } . This simulates preferentially selecting aneighbor that is close to the current node, while ensuring that every neighbor has positive probabilityof begin selected.One additional sampling method was used only for the “normal” graph. For each replication, all nodesin the ﬁrst quadrant were sampled (i.e., nodes with x and y coordinates greater than or equal to 0); then, allnodes in the ﬁrst and second quadrant were sampled (i.e., nodes with x coordinate greater than or equal to0); ﬁnally, all nodes in the ﬁrst, second, and fourth quadrant were sampled (i.e., nodes with x or y coordinategreater than or equal to 0). Note that for each replication, approximately (but not necessarily exactly) 25,50, and 75 nodes are sampled. No bootstrapping was performed for this sampling method. HIV Genetic Distance Network

Infectious disease researchers have used MSTs to infer the transmission pathway of infectious disease[22, 23]. They typically sequence a portion of the pathogen’s genome from human tissue samples; calculatedistances between samples based on those sequences; construct networks in which each sample is a nodeand two nodes are connected by an edge if the distance between them is below a speciﬁed cut-oﬀ; weighteach edge by the distance between the two nodes; and then construct an MST from this weighted graph.The MST is a natural starting point when trying to determine the transmission pathway of the pathogen.15 l l l l l l l l l l l l l

Number of Nodes N u m be r o f C o m ponen t s Figure 3: The number of components of each size in the HIV graph.The researchers are usually only interested in the ﬁrst time a person was infected, so they want to eliminatecycles, and the person most likely to have infected a given individual is assumed to be whoever has thepathogen with the most similar genetic makeup.The Primary Infection Research Consortium at UC San Diego (PIRC) [24, 25] provided an edgelist foran HIV genetic distance network. Each year, the PIRC recruits up to 100 people who are newly diagnosedwith HIV. Both specimens and clinical data are collected upon recruitment and then at regular intervalsthereafter. Participants with chronic HIV infection are followed for twelve weeks and participants with acuteHIV infection are followed for several years.Each of the 1,234 nodes in the edgelist corresponded to an HIV sample; edge weights were genetic distancescalculated using the HIV-TRACE method [26]. This method aligns a sample sequence to a reference sequenceand then calculates distances between each pair of sample sequences. As in [22] and [27], edges with distancesgreater than 1.5% were deleted. This yielded a graph with 588 nodes, 984 edges, and 171 components. Figure3 displays the number of components of each size.Fifty-three edges had a weight of 0. Unfortunately, due to uncertainty in the distance estimation process[28], even if the edges between samples A and B and between B and C both had weights of 0, the edgebetween samples A and C was not always 0. Thus, weights of 0 were set to one-half the minimum positiveedge weight. Regarding the edge weights, 746 were unique, 69 were shared by two edges, 11 were shared bythree edges, two were shared by four edges, one was shared by six edges, and one was shared by ﬁfty-threeedges. Figure 4 displays the number of edges that had each edge weight. The minimum positive diﬀerence16 llllllllllllllllllllllllllllllll llll llllllllllllllllllll llllll lllllllllllllllllllllllllll llll llllllllllllllllllllllllllllllllllll llll lllllllllllllllllllllllllllllllllllllllllllllll lllllll llllllllllllllllllllllllllllllllllllllllll l llll llllllllllllllllllllllllllllllllllllllllllllllllll l llllll llllllllllllllllllllllllllllllllll llll lllllllllllllllllllllllllllllllllllllllllllllllllllllll lll llllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllll lllllllllllllllllllllllllllllllllllllllllllllllllllll lllll llllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll llllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllll llllll

Weight N u m be r o f E dge s w i t h T h i s W e i gh t Figure 4: Number of edges with each edge weight in the HIV graph.between any two edge weights was 10 − .In order to ensure unique edge weights, one of the orderings described in the Theory section was selecteduniformly at random and used throughout analysis. In other words, the edges were ordered from smallestto greatest weight, with ties broken arbitrarily.The same algorithm described in the Simulation Study subsection was used to analyze the empirical data,with the following modiﬁcations: • For each of the 1,000 replications, g i = g was the empirical HIV genetic distance network. That is, thesame graph was used each time. • In order to sample the same proportion of nodes as in the simulation study (25%, 50%, and 75%), foreach replication, n was set to 147, 294, and 441.Figure 5 displays the entire HIV genetic distance network, its MST, a subgraph induced by sampling 50%of the nodes uniformly at random, and the MST of the induced subgraph.The PIRC also provided three-digit zip codes for 564 of the 588 nodes in the graph. For each of the threezip codes with the most nodes, an MST was created using only nodes from that zip code, and the proportionof edges in each MST that were also in the population MST was calculated. These three zip codes accountedfor 546 nodes, or 92.9% of the nodes in the graph. The next-most-represented zip code had only seven nodes,or 1.2% of the graph.All graphs were undirected. All simulations were run in R version 3.6.1, using the package igraph [29],17 ll lll l ll ll l lll ll l lllll ll ll lllllll ll l ll lll llll ll l lll ll llll l ll ll l llll lllll l ll lll lllll llll lll lll l l l llllll ll llll llll l l l lll lll l l l ll ll l l ll llll lll ll llll l ll l lll l ll l ll ll l lll ll l lll lll ll ll l ll l l ll ll ll llll lll lll ll lll ll lllll ll ll lll l lllll lll llll l lll l l llllll lll ll l lll lll ll lll llllllll ll llllll lll l llll ll l lll l llll ll l l l lll ll lll lll l l l ll ll lll ll ll ll ll ll ll lll ll ll llllllll l ll lll ll l l lll l llll lll l llll lll ll ll ll lllll llll lll llllll l ll ll lll ll ll ll lll ll ll l ll ll l l lll l lll l l lll ll ll l ll ll lll llll lll llll l ll l l ll lll lllll l llll l ll l llll llll lll ll ll l lll ll lll ll l l l lll ll l l lll ll llll ll ll llll l ll lll ll l ll llll l ll ll llll ll llll ll ll l ll l lll ll llll (a) l ll lll l ll ll l lll ll l lllll ll ll lllllll ll l ll lll llll ll l lll ll llll l ll ll l llll lllll l ll lll lllll llll lll lll l l l llllll ll llll llll l l l lll lll l l l ll ll l l ll llll lll ll llll l ll l lll l ll l ll ll l lll ll l lll lll ll ll l ll l l ll ll ll llll lll lll ll lll ll lllll ll ll lll l lllll lll llll l lll l l llllll lll ll l lll lll ll lll llllllll ll llllll lll l llll ll l lll l llll ll l l l lll ll lll lll l l l ll ll lll ll ll ll ll ll ll lll ll ll llllllll l ll lll ll l l lll l llll lll l llll lll ll ll ll lllll llll lll llllll l ll ll lll ll ll ll lll ll ll l ll ll l l lll l lll l l lll ll ll l ll ll lll llll lll llll l ll l l ll lll lllll l llll l ll l llll llll lll ll ll l lll ll lll ll l l l lll ll l l lll ll llll ll ll llll l ll lll ll l ll llll l ll ll llll ll llll ll ll l ll l lll ll llll (b) l l ll ll ll l lll lll lllll ll l ll ll llll lll ll ll l lll lll ll llll l llll lll ll l l lll l l lll l lll l lllll l ll ll llll lll ll ll l ll l l ll ll ll ll lll ll lll l lll llll l lll l llll ll lll l ll lll lll l ll l l ll ll lll l ll l ll lll lll llllll lll ll lllll l ll llll ll ll ll lll ll l l lll l ll ll ll l lll l llll llll ll l l l ll lllll lll ll llllll l l lll lll ll ll ll lll l l lll ll l l ll l lll lll l l ll (c) l l ll ll ll l lll lll lllll ll l ll ll llll lll ll ll l lll lll ll llll l llll lll ll l l lll l l lll l lll l lllll l ll ll llll lll ll ll l ll l l ll ll ll ll lll ll lll l lll llll l lll l llll ll lll l ll lll lll l ll l l ll ll lll l ll l ll lll lll llllll lll ll lllll l ll llll ll ll ll lll ll l l lll l ll ll ll l lll l llll llll ll l l l ll lllll lll ll llllll l l lll lll ll ll ll lll l l lll ll l l ll l lll lll l l ll (d) Figure 5: (a) The HIV genetic distance network. It has 588 nodes, 984 edges, and 171 components. (b) Thepopulation MST of the HIV genetic distance network. (c) A subgraph induced by sampling 50% of the nodesin the HIV genetic distance network uniformly at random. (d) The sample MST of the induced subgraph.For this sample, the PPV is 0.737. 18n the O2 High Performance Compute Cluster, supported by the Research Computing Group, at HarvardMedical School. See http://rc.hms.harvard.edu for more information. The package igraph uses Prim’salgorithm [2] to ﬁnd the MST. Code is available at https://github.com/onnela-lab/mst.The PIRC was approved by the University of California–San Diego’s Human Research Protection Program(Project

Results

Results for the simulation study and empirical data are in Tables 1 and 1. The results for quadrantsampling of normal graphs are as follows: for n = 25, PPV = 0 .

246 (95% CI 0.240-0.252); for n = 50,PPV = 0 .

497 (95% CI 0.492-0.503); and for n = 75, PPV = 0 .

747 (95% CI 0.743-0.752).For complete, G ( n, p ), and normal graphs, PPV ≈ nN when nodes are sampled uniformly at random;PPV > nN when nodes that have many neighbors or low total edge weight are preferentially sampled, orwhen an edge-weighted random walk is used to sample nodes; and PPV < nN when nodes that have fewneighbors or high total edge weight are preferentially sampled. For normal graphs, PPV ≈ nN when nodesare sampled by quadrant.For BA graphs and uniform sampling, PPV > nN . “Near” sampling increases PPV whereas “far” samplingdecreases it. The random walk produces the highest values of PPV of any of the sampling methods. Thesimulations using the PIRC data have similar results to the BA graphs but with even higher values of PPV.Across graph types and sampling methods, BPPV does not have a consistent relationship with PPV.Sometimes they have overlapping conﬁdence intervals, sometimes BPPV > PPV, and sometimes BPPV < PPV; it is hard to generalize about when each scenario arises. That said, BPPV is closer to PPV at highersample sizes, for all graphs and all types of sampling.For complete, G ( n, p ), and normal graphs, AUC is almost always above 0 .

75, with many values above0 .

90. For BA graphs and the PIRC graph, AUC is still always above 0 .

50, with many values above 0 . ype of SamplingGraph n Statistic Uniform Near Far Random Walk C o m p l e t e

25 PPV 0.245 (0.240-0.251) 0.255 (0.249-0.261) 0.247 (0.241-0.252) 0.268 (0.262-0.274)BPPV 0.240 (0.239-0.241) 0.251 (0.250-0.252) 0.229 (0.228-0.230) 0.300 (0.298-0.301)AUC 0.862 (0.858-0.865) 0.871 (0.868-0.874) 0.842 (0.838-0.845) 0.908 (0.906-0.910)50 PPV 0.503 (0.499-0.508) 0.508 (0.504-0.513) 0.494 (0.490-0.499) 0.513 (0.509-0.518)BPPV 0.500 (0.500-0.501) 0.510 (0.509-0.510) 0.491 (0.491-0.492) 0.523 (0.522-0.523)AUC 0.983 (0.983-0.983) 0.984 (0.983-0.984) 0.981 (0.981-0.982) 0.984 (0.984-0.984)75 PPV 0.748 (0.745-0.751) 0.754 (0.751-0.757) 0.744 (0.740-0.747) 0.761 (0.758-0.764)BPPV 0.747 (0.746-0.747) 0.753 (0.753-0.754) 0.740 (0.740-0.741) 0.756 (0.756-0.757)AUC 0.996 (0.996-0.996) 0.996 (0.996-0.997) 0.996 (0.996-0.996) 0.997 (0.996-0.997) G ( n , p )

25 PPV 0.255 (0.250-0.261) 0.254 (0.248-0.260) 0.245 (0.239-0.250) 0.298 (0.293-0.304)BPPV 0.249 (0.248-0.251) 0.258 (0.257-0.259) 0.242 (0.241-0.244) 0.397 (0.395-0.399)AUC 0.728 (0.722-0.734) 0.744 (0.738-0.750) 0.683 (0.676-0.690) 0.892 (0.889-0.894)50 PPV 0.497 (0.493-0.502) 0.504 (0.500-0.509) 0.495 (0.491-0.500) 0.538 (0.533-0.542)BPPV 0.500 (0.499-0.500) 0.511 (0.510-0.511) 0.490 (0.489-0.490) 0.564 (0.563-0.564)AUC 0.965 (0.964-0.965) 0.965 (0.964-0.965) 0.959 (0.959-0.960) 0.973 (0.972-0.973)75 PPV 0.752 (0.748-0.755) 0.756 (0.753-0.759) 0.742 (0.738-0.745) 0.772 (0.769-0.776)BPPV 0.747 (0.746-0.747) 0.754 (0.754-0.755) 0.740 (0.739-0.740) 0.774 (0.774-0.775)AUC 0.993 (0.992-0.993) 0.993 (0.992-0.993) 0.992 (0.992-0.992) 0.993 (0.993-0.994) N o r m a l

25 PPV 0.246 (0.241-0.252) 0.254 (0.249-0.260) 0.251 (0.245-0.256) 0.267 (0.261-0.272)BPPV 0.240 (0.239-0.241) 0.251 (0.249-0.252) 0.228 (0.227-0.230) 0.300 (0.299-0.302)AUC 0.864 (0.861-0.868) 0.873 (0.870-0.876) 0.842 (0.839-0.846) 0.906 (0.904-0.909)50 PPV 0.497 (0.492-0.502) 0.503 (0.498-0.507) 0.494 (0.489-0.498) 0.512 (0.508-0.517)BPPV 0.500 (0.499-0.500) 0.509 (0.509-0.510) 0.491 (0.490-0.491) 0.522 (0.522-0.523)AUC 0.983 (0.983-0.983) 0.983 (0.983-0.984) 0.981 (0.981-0.981) 0.984 (0.984-0.984)75 PPV 0.748 (0.744-0.751) 0.755 (0.751-0.758) 0.745 (0.741-0.748) 0.758 (0.755-0.762)BPPV 0.747 (0.746-0.747) 0.753 (0.753-0.754) 0.740 (0.740-0.741) 0.756 (0.756-0.757)AUC 0.996 (0.996-0.996) 0.996 (0.996-0.997) 0.996 (0.996-0.996) 0.997 (0.996-0.997) B a r a b ´a s i – A l b e r t

25 PPV 0.389 (0.381-0.396) 0.459 (0.453-0.466) 0.377 (0.368-0.385) 0.704 (0.698-0.709)BPPV 0.905 (0.899-0.910) 0.639 (0.633-0.646) 0.974 (0.971-0.977) 0.746 (0.743-0.749)AUC 0.510 (0.500-0.520) 0.511 (0.504-0.518) 0.523 (0.511-0.534) 0.872 (0.869-0.874)50 PPV 0.538 (0.532-0.543) 0.701 (0.697-0.706) 0.465 (0.460-0.470) 0.852 (0.849-0.855)BPPV 0.722 (0.718-0.726) 0.734 (0.732-0.736) 0.830 (0.825-0.835) 0.838 (0.837-0.839)AUC 0.684 (0.679-0.689) 0.849 (0.847-0.852) 0.554 (0.549-0.559) 0.946 (0.945-0.947)75 PPV 0.755 (0.751-0.759) 0.880 (0.878-0.883) 0.649 (0.645-0.654) 0.943 (0.941-0.945)BPPV 0.775 (0.775-0.776) 0.888 (0.887-0.889) 0.707 (0.705-0.709) 0.932 (0.932-0.933)AUC 0.912 (0.910-0.915) 0.969 (0.968-0.969) 0.719 (0.715-0.723) 0.984 (0.984-0.985) H I VN e t w o r k

147 PPV 0.578 (0.572-0.583) 0.688 (0.684-0.693) 0.635 (0.630-0.641) 0.990 (0.989-0.991)BPPV 0.800 (0.796-0.804) 0.653 (0.651-0.655) 0.937 (0.935-0.940) 0.975 (0.974-0.975)AUC 0.598 (0.593-0.603) 0.786 (0.784-0.789) 0.626 (0.620-0.632) 0.997 (0.997-0.998)294 PPV 0.727 (0.724-0.730) 0.888 (0.886-0.890) 0.729 (0.726-0.731) 0.995 (0.995-0.995)BPPV 0.795 (0.794-0.796) 0.847 (0.847-0.848) 0.871 (0.870-0.873) 0.992 (0.992-0.992)AUC 0.855 (0.853-0.857) 0.963 (0.963-0.964) 0.788 (0.786-0.790) 0.999 (0.998-0.999)441 PPV 0.868 (0.866-0.869) 0.972 (0.972-0.973) 0.836 (0.834-0.838) 0.998 (0.998-0.998)BPPV 0.880 (0.879-0.880) 0.960 (0.959-0.960) 0.872 (0.871-0.873) 0.997 (0.997-0.997)AUC 0.944 (0.943-0.944) 0.997 (0.997-0.997) 0.919 (0.919-0.920) 0.999 (0.999-0.999)Table 1: Results for complete, G ( n, p ), normal (excluding quadrant sampling), Barab´asi–Albert, and HIV genetic distancenetwork graphs. Data are presented as mean (95% CI). Each simulated graph had 100 nodes. The HIV genetic distancenetwork is from the Primary Infection Resource Consortium [24, 25]. PPV = positive predictive value; BPPV = bootstrappositive predictive value; AUC = area under the receiver operating characteristic curve.20 ip Code

921 433 73.6% 87.5%920 63 10.7% 92.9%919 50 8.5% 44.4%Table 2: Results from creating MSTs from nodes belonging to each of the most-represented zip codes in thedata.

Discussion

In spite of the wide use of the MST on sample networks, little was known about what could be inferredfrom it about the MST of population networks. This study examined exactly that. Returning to thequestions posed in the Introduction, we can say the following:1. Given that an edge is in the sample graph but not the sample MST, what is the probability that it isnot in the population MST?This probability is 1, regardless of the type of graph or sampling method, provided the edge weightsare unique.2. Given that an edge appears in the sample MST, what is the probability that it appears in the populationMST?This depends on the number of nodes sampled, the type of graph, and the type of sampling. Thisconditional probability is maximized by increasing the sample size; starting with an underlying BAgraph; and either preferentially sampling nodes that are “near” other nodes or using an edge-weightedrandom walk. Of course, applied researchers will not be able to choose their underlying graph type ortell which nodes are high degree or have low total edge weight before sampling. Thus, an edge-weightedrandom walk may be their best bet. The probability that an edge appears in the population MSTgiven that it is in the sample MST is minimized by decreasing n ; starting with an underlying complete, G ( n, p ), or normal graph; and preferentially sampling nodes that are “far” from other nodes.Fortunately, applied researchers may already be using the edge-weighted random walk. In that samplingmethod, the neighbor of a selected node is more likely to be selected as well if their viral genome iscloser to the ﬁrst node. In the real world, this is achieved by contact tracing or partner notiﬁcation. Forexample, if someone tests positive for HIV, eﬀorts are made to test anyone they recently had sexualcontact with or shared a needle with. The idea is to ﬁnd those individuals who either transmittedthe pathogen to the original patient or who contracted the disease from the original patient. Both ofthese groups of people are likely to be carrying pathogens that are genetically similar to the pathogenscarried by the original patient. 21 − − − k P r opo r t i on o f N ode s w i t h D eg r ee > k l l l l l l l l l l l l llllllllllllllllllllllllllllll G(n,p)BAHIV

Figure 6: Degree distributions for the G ( n, p ), Barab´asi–Albert, and PIRC graphs. Values for the G ( n, p )and Barab´asi–Albert graphs are averages across 1,000 replications. Values for the complete and normalgraphs are not shown because in those graphs, each node has degree n − G ( n, p ) and normal graphs were very similar to each other and diﬀered fromthe results for BA graphs. This makes sense because sampling nodes uniformly from each of the ﬁrst threegraph types yields a graph of a similar type; in contrast, sampling nodes uniformly from a BA graph doesnot yield a BA graph [30]. The results for the G ( n, p ) graph diﬀered somewhat from the results for thecomplete and normal graph when a random walk was used to sample nodes. For this sampling method,the average PPV was higher for G ( n, p ) than for complete and normal graphs. This may indicate that anedge-weighted random walk leads to an increased PPV when the underlying graph is not complete, i.e.,when not all possible edges are present. The results for the PIRC data were similar to the results for theBA graphs. This makes sense given the similarity in degree distribution (see Figure 6).When sampling by zip code, the proportion of edges in the sample MST that are also in the population22ST is much, much higher than nN . This is good news for applied researchers, because it implies thatsampling can be limited to a single geographic area and still identify most of the edges that are in thepopulation MST. It is interesting to note that this contrasts with the location-based sampling that wasperformed with the simulated normal graphs. With the normal graphs, sampling by quadrant yieldedconditional probabilities approximately equal to nN .One limitation of this study is that the PIRC data have non-unique edge weights, meaning the MST maynot be unique. Future studies can examine whether the number of times an edge appears in sample MSTs isindicative of the number of times it appears in the population MSTs. Further research could also examinethe impact of measuring edge weights with error. Declarations

Acknowledgments

The authors would like to thank Susan Little, Christy Anderson, Martin Furey, Felix Torres, SergeiKosakovsky Pond, and the Primary Infection Resource Consortium for sharing and explaining the empiricaldata. The authors would also like to thank Rui Wang, Alessandro Vespignani, and Edoardo Airoldi for theirhelpful suggestions.

Funding

Jonathan Larson is supported by NIH T32 AI007358. Jukka-Pekka Onnela is supported by NIAID R01AI138901.

Aﬃliations

Jonathan Larson is a student and Jukka-Pekka Onnela is an Associate Professor in the Department ofBiostatistics at Harvard T.H. Chan School of Public Health.

Author Contributions

J.L. designed the research, performed the research, and analyzed the data. J.P.O. supervised the research.J.L. and J.P.O. wrote the paper.

Competing Interests

The authors declare no competing interests. 23 eferences [1] Neˇsetˇril J, Milkov´a E, Neˇsetˇrilov´a H. Otakar Bor˚uvka on minimum spanning tree problem Translationof both the 1926 papers, comments, history. Discrete Mathematics. 2001;233(1):3–36.[2] Prim RC. Shortest Connection Networks And Some Generalizations. Bell System Technical Jour-nal. 1957;36(6):1389–1401. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/j.1538-7305.1957.tb01515.x .[3] Kruskal JB. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem.Proceedings of the American Mathematical Society. 1956;7(1):48–50. Available from: .[4] Tewarie P, van Dellen E, Hillebrand A, Stam CJ. The minimum spanning tree: An unbiased method forbrain network analysis. NeuroImage. 2015;104:177–188. Available from: .[5] van Dellen E, Sommer IE, Bohlken MM, Tewarie P, Draaisma L, Zalesky A, et al. Minimum spanningtree analysis of the human connectome. Human Brain Mapping. 2018;39(6):2455–2471. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/hbm.24014 .[6] Wang B, Chen Y, Liu W, Qin J, Du Y, Han G, et al. Real-time hierarchical supervoxel segmentationvia a minimum spanning tree. IEEE Transactions on Image Processing. 2020;29:9665–9677.[7] Jin Y, Zhao H, Gu F, Bu P, Na M. A spatial minimum spanning tree ﬁlter. Measurement Science andTechnology. 2020 jan;32(1):015204. Available from: https://doi.org/10.1088/1361-6501/abaa65 .[8] Wu B, Yu B, Wu Q, Chen Z, Yao S, Huang Y, et al. An extended minimum spanning tree methodfor characterizing local urban patterns. International Journal of Geographical Information Science.2018;32(3):450–475.[9] Mantegna RN. Hierarchical structure in ﬁnancial markets. The European Physical Journal B - Con-densed Matter and Complex Systems. 1999;11(1):193–197.[10] Onnela JP, Chakraborti A, Kaski K, Kerti´esz J. Dynamic asset trees and portfolio analysis. TheEuropean Physical Journal B - Condensed Matter and Complex Systems. 2002;30(3):285–288.[11] Onnela JP, Chakraborti A, Kaski K, Kert´esz J, Kanto A. Dynamics of market correlations: Taxonomyand portfolio analysis. Physical Review E. 2003 Nov;68:056110. Available from: https://link.aps.org/doi/10.1103/PhysRevE.68.056110 . 2412] Li K, Zhang S, Song X, Weyrich A, Wang Y, Liu X, et al. Genome evolution of blind subterraneanmole rats: Adaptive peripatric versus sympatric speciation. Proceedings of the National Academy ofSciences. 2020;117(51):32499–32508. Available from: .[13] Steinbrenner AD, Mu˜noz-Amatria´ın M, Chaparro AF, Aguilar-Venegas JM, Lo S, Okuda S, et al.A receptor-like protein mediates plant immune responses to herbivore-associated molecular patterns.Proceedings of the National Academy of Sciences. 2020;117(49):31510–31518. Available from: .[14] Manning CD, Clark K, Hewitt J, Khandelwal U, Levy O. Emergent linguistic structure in artiﬁ-cial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences.2020;117(48):30046–30054. Available from: .[15] Saul LK. A tractable latent variable model for nonlinear dimensionality reduction. Proceedings of theNational Academy of Sciences. 2020;117(27):15403–15408. Available from: .[16] Matsumura H, Hsiao MC, Lin YP, Toyoda A, Taniai N, Tarora K, et al. Long-read bitter gourd(Momordica charantia) genome and the genomic architecture of nonclassic domestication. Proceedingsof the National Academy of Sciences. 2020;117(25):14543–14551. Available from: .[17] Hahn M, Jurafsky D, Futrell R. Universals of word order reﬂect optimization of grammars for eﬃcientcommunication. Proceedings of the National Academy of Sciences. 2020;117(5):2347–2353. Availablefrom: .[18] Bertsimas DJ. The probabilistic minimum spanning tree problem. Networks. 1990;20:245–275.[19] Goemans MX, Vondr´ak J. Covering minimum spanning trees of random subgraphs. Random Structures& Algorithms. 2006;29(3):257–276. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/rsa.20115 .[20] Torkestani JA, Meybodi MR. A learning automata-based heuristic algorithm for solving the minimumspanning tree problem in stochastic graphs. The Journal of Supercomputing. 2012;59:1035–1054.[21] (https://cs stackexchange com/users/98/raphael) R. Do the minimum spanning trees of a weightedgraph have the same number of edges with a given weight?;. URL:https://cs.stackexchange.com/q/2211(version: 2019-05-21). Computer Science Stack Exchange. Available from: https://cs.stackexchange.com/q/2211 . 2522] Campbell EM, Jia H, Shankar A, Hanson D, Luo W, Masciotra S, et al. Detailed Transmission NetworkAnalysis of a Large Opiate-Driven Outbreak of HIV Infection in the United States. The Journal ofInfectious Diseases. 2017 10;216(9):1053–1062. Available from: https://doi.org/10.1093/infdis/jix307 .[23] Spada E, Sagliocca L, Sourdis J, Garbuglia AR, Poggi V, De Fusco C, et al. Use of the MinimumSpanning Tree Model for Molecular Epidemiological Investigation of a Nosocomial Outbreak of HepatitisC Virus Infection. Journal of Clinical Microbiology. 2004;42(9):4230–4236. Available from: https://jcm.asm.org/content/42/9/4230 .[24] Le T, Wright EJ, Smith DM, He W, Catano G, Okulicz JF, et al. Enhanced CD4+ T-Cell Recoverywith Earlier HIV-1 Antiretroviral Therapy. New England Journal of Medicine. 2013;368(3):218–230.PMID: 23323898. Available from: https://doi.org/10.1056/NEJMoa1110187 .[25] Morris SR, Little SJ, Cunningham T, Garfein RS, Richman DD, Smith DM. Evaluation of an HIVNucleic Acid Testing Program With Automated Internet and Voicemail Systems to Deliver Results.Annals of Internal Medicine. 2010;152(12):778–785. PMID: 20547906. Available from: .[26] Kosakovsky Pond SL, Weaver S, Leigh Brown AJ, Wertheim JO. HIV-TRACE (TRAnsmission Clus-ter Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolv-ing Pathogens. Molecular Biology and Evolution. 2018 01;35(7):1812–1819. Available from: https://doi.org/10.1093/molbev/msy016 .[27] Little SJ, Kosakovsky Pond SL, Anderson CM, Young JA, Wertheim JO, Mehta SR, et al. Using HIVNetworks to Inform Real Time Prevention Interventions. PLOS ONE. 2014 06;9(6):1–8. Available from: https://doi.org/10.1371/journal.pone.0098443 .[28] Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region ofmitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution. 1993 05;10(3):512–526. Available from: https://doi.org/10.1093/oxfordjournals.molbev.a040023 .[29] Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal.2006;Complex Systems:1695. Available from: https://igraph.org .[30] Stumpf MPH, Wiuf C, May RM. Subnets of scale-free networks are not scale-free: Sampling propertiesof networks. Proceedings of the National Academy of Sciences. 2005;102(12):4221–4224. Available from: