[PDF] Network estimation via graphon with node features

Abstract

Estimating the probabilities of linkages in a network has gained increasing interest in recent years. One popular model for network analysis is the exchangeable graph model (ExGM) characterized by a two-dimensional function known as a graphon. Estimating an underlying graphon becomes the key of such analysis. Several nonparametric estimation methods have been proposed, and some are provably consistent. However, if certain useful features of the nodes (e.g., age and schools in social network context) are available, none of these methods was designed to incorporate this source of information to help with the estimation. This paper develops a consistent graphon estimation method that integrates the information from both the adjacency matrix itself and node features. We show that properly leveraging the features can improve the estimation. A cross-validation method is proposed to automatically select the tuning parameter of the method.

Full PDF

NNetwork estimation via graphon with node features

Yi Su ∗ Raymond K. W. Wong † Thomas C. M. Lee ‡ September 5, 2018

Abstract

Estimating the probabilities of linkages in a network has gained increasing interest in recentyears. One popular model for network analysis is the exchangeable graph model (ExGM) char-acterized by a two-dimensional function known as a graphon . Estimating an underlying graphonbecomes the key of such analysis. Several nonparametric estimation methods have been pro-posed, and some are provably consistent. However, if certain useful features of the nodes (e.g.,age and schools in social network context) are available, none of these methods was designedto incorporate this source of information to help with the estimation. This paper develops aconsistent graphon estimation method that integrates the information from both the adjacencymatrix itself and node features. We show that properly leveraging the features can improve theestimation. A cross-validation method is proposed to automatically select the tuning parameterof the method.

Keywords: consistency, exchangeable graph model, feature assisted neighborhood smoothing(FANS), generative model, nonparametric.

A network (undirected simple graph) can be modeled as a partial observation of an inﬁnite randomgraph. Exchangeable random graph model (ExGM) is a popular nonparametric model for inﬁnitegraphs where node indices are exchangeable (e.g., Hoﬀ, 2008; Kallenberg, 2006; Lov´asz, 2012;Orbanz and Roy, 2015), i.e., the joint distribution of edges is invariant under permutation of node ∗ PhD candidate, Department of Statistics, University of California at Davis, Davis, CA 95616 (email: [email protected] ). † Assistant Professor, Department of Statistics, Texas A&M University, College Station, TX 77843 (email: [email protected] ). ‡ Professor, Department of Statistics, University of California at Davis, Davis, CA 95616 (email: [email protected] ). a r X i v : . [ s t a t . M E ] S e p ndices. For instance, in a social network when a node represents a person, the assignment of nodeto person does not carry any information, and swapping the node indices between any two people(i.e., relabeling) deﬁnes the same network. An ExGM is characterized by a symmetric measurablefunction w known as graphon (Aldous, 1981; Hoover, 1979), which therefore plays a central role inmodel-based inference and prediction of network data under ExGM. Based on the Aldous-HooverTheorem, we assume the following generative model of the network via graphon: a set of latentlabels { u i } , each associated with a node, are ﬁrst drawn independently from Uniform(0 , u i and u j , the probability that there is a connection between the i -th node and the j -th node is given by w ( u i , u j ). According to these probabilities, edges will bethen generated independently of each other conditional on { u i } .In general, graphon provides a uniﬁed and solid framework for modeling networks. For instance,community structures widely used in the modeling of social networks correspond to a parametricpiecewise-constant model of graphon. More importantly, graphon opens up an opportunity formore ﬂexible but challenging nonparametric modeling, which has sparked a recent surge of interestamong researchers (e.g., Airoldi et al. , 2013; Wolfe and Olhede, 2013; Chan and Airoldi, 2014;Gao et al. , 2015; Zhang et al. , 2017; Klopp et al. , 2017), which is also the focus of the presentwork. Since the knowledge of graphon facilitates our understanding of the underlying network,nonparametric graphon estimation helps discover unknown patterns in the corresponding networkgeneration mechanism. Besides, statistical inference of network can also be conducted via graphon(e.g., Lloyd et al. , 2012; Yang et al. , 2014).A typical assumption adopted by nonparametric graphon estimation is the smoothness of theunderlying graphon w . If we were given the latent labels { u i } , the graphon estimation is simply anonparametric regression problem. However, { u i } are not observed, which poses a unique challenge.Due to smoothness assumption, various researchers have made use of the idea that “similar labels”produce “similar graphon slices”, where a graphon slice at a label u ∈ [0 ,

1] is a one-dimensionalfunction w ( u, · ). In other words, if the labels u i and u j of two nodes are close, w ( u i , · ) and w ( u j , · )should be similar. As graphon is unknown, several methods (e.g., Airoldi et al. , 2013; Chan and2iroldi, 2014; Gao et al. , 2015) instead use the rows and/or columns of the adjacency matrix asa proxy of graphon slices, to describe the distance between nodes, based upon which smoothingprocedures can be constructed. Despite the success of many existing methods relying solely onthe adjacency matrix, very often features (or attributes) of the nodes are available aside fromthe network (adjacency matrix) itself, and could potentially provide important information fornetwork estimation. Take Facebook friendship network as an example. It is conceivable that userswho share similar values of certain features (e.g., age and schools) will have similar connectionbehaviors. These additional features can be valuable resource for better estimating the underlyingprobabilities of linkages and the graphon.The main contribution of the paper is the proposal of a nonparametric graphon estimationmethod that is capable of utilizing the information hidden in the node features for better networkestimation. This can be realized by relating node features to graphon slices through the latentlabels. That is, close labels should correspond to both similar graphon slices and similar nodefeatures.The rest of this paper is organized as follows. In Section 2, we will review some basic deﬁnitionsand related work on graphon estimation, and summarize our contribution. We introduce theproposed framework and estimation method in Section 3. Theoretical results are provided inSection 4, while numerical experiments on synthetic graphons are shown in Section 5. Finally, weapply our method to a real-world friendship network in Section 6, and supplementary material isdeferred to the appendix. This section presents necessary background material. In sequel, for any matrix M , we use M ij , M i · and M · j to denote its ( i, j )-th element, i -th row and j -th column, respectively. Let A ∈ { , } n × n be the adjacency matrix of a non-directed simple graph with n nodes ( n can beinﬁnity); i.e., A ij = 1 if the i -th node and j -th node is connected, and 0 otherwise. For an inﬁnitely3ized graph, we say it is exchangeable if the distribution over A is invariant under any permutationof nodes. The Aldous-Hoover theorem (Aldous, 1981; Hoover, 1979) guarantees that every ExGMmust be represented by a graphon. Deﬁnition 2.1. (Graphon) A graphon is a symmetric measurable function w : [0 , → [0 ,

1] suchthat Pr( A ij = 1 | u i , u j ) = w ( u i , u j ) , where u i iid ∼ Uniform(0 ,

1) for i ∈ N .A network of size n can be modeled as a partial observation of an ExGM, and thereby generatedby the following two-step sampling scheme: u i iid ∼ Uniform(0 , , i = 1 , , . . . , n ; A ij | u i , u j ind ∼ Bernoulli( w ( u i , u j )) , i < j. (1)Identiﬁability is a well-known issue of graphon, and diﬀerent graphons can give rise to the sameExGM. Speciﬁcally, up to a measure preserving transformation ϕ , w (cid:48) ( u, v ) := w ( ϕ ( u ) , ϕ ( v )) and w deﬁne the same ExGM. That is, the distributions of these two random arrays are the same.To guarantee an unique representation, one can impose the strict monotonicity of degree condition (Bickel and Chen, 2009; Yang et al. , 2014), which assumes that there exists a measure preservingtransformation ϕ such that w can ( u, v ) := w ( ϕ ( u ) , ϕ ( v )), and g can ( u ) = (cid:90) w can ( u, v ) dv (2)is strictly increasing in u . Here w can is called the canonical form of graphon w , and is a uniquerepresentation of the underlying ExGM. However, this assumption is restrictive because it excludescommonly used models such as the stochastic block model. In our framework, we will not enforcestrictly monotonic node degrees.In principle, one cannot determine which graphon, from its equivalence class, that generates theunderlying network based on the adjacency matrix. There are two layers of estimation: the ﬁrstlayer is the estimation of the graphon w while the second layer is the estimation of the latent labels { u i } . Since it is unrealistic to estimate the labels without strong assumptions, the main purpose4f estimating a graphon is sometimes to obtain the probabilities of linkages at the observed nodes,( w ( u i , u j )) ≤ i,j ≤ n . This is also the goal of this paper. In the literature of graphon estimation, a commonly adopted strategy is to employ the graphonslices (which can be estimated by the rows and columns of the adjacency matrix) to deﬁne thedistance between nodes. With this, one can group similar nodes together into diﬀerent blocks andestimate the graphon values within any block by averaging the number of edges in it.Airoldi et al. (2013) proposed the Stochastic Blockmodel Approximation (SBA) algorithm,which approximates the graphon by a piecewise constant function. Their estimator is consistentin mean squared error, but a key assumption is that there are at least 2 T ( T ∈ N + ) independentrealizations generated from w , which is unlikely to hold in reality. They group the nodes into K blocks, and the estimated graphon is a piecewise constant function over K × K blocks.Some other methods are based on the strong assumption of strict monotonicity of degree(Chan and Airoldi, 2014; Yang et al. , 2014), under which a canonical graphon is well-deﬁnedand hence treated as the estimand of interest. One representative of this category is the Sorting-and-Smoothing (SAS) algorithm proposed by Chan and Airoldi (2014). It ﬁrst sorts the nodesaccording to their empirical degrees, then computes a local histogram estimator ˆ H ∈ [0 , k × k forsome bandwidth h = n/k , and ﬁnally applies a smoothing technique to obtain the ﬁnal estimate.This SAS estimator is consistent and reaches the rate of convergence n − log n . This rate matcheswith the optimal rate in general graphon estimation without the assumption of strict monotonicityof degree (Gao et al. , 2015).Another popular method is Universal Singular Value Thresholding (USVT) proposed by Chat-terjee et al. (2015), which targets at general matrix denoising problems with missing values. Sincethis method is not speciﬁcally for graphon, the rate of convergence is not competitive.More recently, Zhang et al. (2017) proposed a novel Neighborhood Smoothing (NBS) methodfor estimating the underlying probability matrix P ij , which is equivalent to estimating the graphon w ( u i , u j ). Diﬀerent from the SBA and the SAS algorithms, these authors proposed an adaptive5eighborhood selection method which allows each node to have its own neighbors. The NBS methodperforms very well for a wide range of graphons in both low-rank and high-rank situations, and theonly assumption on graphon is piecewise Lipschitz. These authors also showed that the error rateof NBS is the smallest among all existing non-combinatorial methods. To the best of our knowledge, none of the existing methods are designed to utilize information otherthan the adjacency matrix itself for nonparametric newtork/graphon estimation. With additionalnode features, the estimation could be much improved. In this paper, we propose a generative modelof node features which allows borrowing information from the features in an adaptive manner toimprove the network/graphon estimation. If similar node features correspond to similar graphonslices, these features are valuable, especially when the graphon itself has weak/local signals. Incontrast, it could happen that two nodes with identical attributes behave very diﬀerently. In suchscenarios, it is unwise to contaminate the estimation by using these unhelpful features. We willavoid this contamination by selecting the tuning parameter adaptively, which controls the weightof using the features’ information.

We begin with some notations. Recall that A ∈ { , } n × n is an observed adjacency matrix generatedby graphon w ( u, u ). That is, A ij | u i , u j ind. ∼ Bernoulli( w ( u i , u j ))where u i iid ∼ Uniform(0 ,

1) for i = 1 , . . . , n . For each node i , we also observe a feature vector X i ∈ R p , i = 1 , . . . , n . We assume that these X i ’s are also generated from (unknown) latent labels: X i = f ( u i ) + e i , (3)where f = ( f , ..., f p ) T : R → R p is an unknown function, and e i is a random vector with indepen-dent entries of mean zero and variance σ . Also, { e , . . . , e n , u , . . . , u n } are mutually independent.6e note that X i , . . . , X ip , the elements of X i are dependent in general due to the sharing of u i through f , . . . , f p ; and the assumption of constant variance σ can be relaxed easily. In addition,we assume that w and f are piecewise Lipschitz functions which will be deﬁned in Section 4.Although we aim to utilize the features for better graphon estimation, our feature model (3) isfairly general and does not assume the usefulness of features for graphon estimation in priori. Tosee this, the essential information of X i for graphon estimation is captured by the hidden label u i .When f is monotonic, close feature vectors correspond to close latent labels, which will generatesimilar slices in the adjacency matrix (under smoothness assumption of w ). In this case, similarityof features is a helpful source we can borrow information from. On the other hand, when f isnon-monotonic, close features does not necessarily imply similar graphon slices. For example, if f ( u ) = f ( u ) for very diﬀerent u and u , then whether including feature similarity is useful ornot will depend on if graphon slice w ( u , · ) is close to w ( u , · ). Given that we do not know if thelatter is true, the use of features could worsen the graphon estimation. In the subsequent sections,we will develop a method that allows adaptive incorporation of feature information via a tuningparameter, as well as a data-adaptive method for choosing such a parameter.In what follows, we use P to denote the underlying (conditional) probability matrix with P ij = w ( u i , u j ), i.e., P = E ( A |{ u i } ni =1 ). Our goal is to estimate P . This subsection provides a general description of the proposed method for graphon estimation. Themethod is called FANS, short for Feature Assisted Neighboring Smoothing.Since the latent labels { u i } are unavailable, the key of estimating a graphon is to deﬁne ameasure of node dissimilarity. Here we deﬁne a (squared) dissimilarity function d ( i, j ) between the i -th node and the j -th node ( i (cid:54) = j ) as the weighted sum of two terms: d ( i, j ) = d ( i, j ) + λs ( i, j ) , (4)where their relative weights are determined by a tuning parameter λ ≥

0. In (4), d ( i, j ) is adistance measure for graphon slices while s ( i, j ) is a distance measure for features; exact forms ofthese two measures are given in Section 3.2. The parameter λ controls how much information we7ant to borrow from the node features. When these features are not helpful, we could avoid usingthem by setting λ = 0. We will discuss a data-driven choice of λ later.The ﬁrst step of the proposed estimation method is to estimate d ( i, j ). Once such an estimateˆ d ( i, j ) is obtained, the next step is to obtain the neighborhood for each node. Similar to the NBSmethod (Zhang et al. , 2017), we deﬁne the neighborhood of the i -th node as N i = { i (cid:48) (cid:54) = i : ˆ d ( i, i (cid:48) ) ≤ q i ( h ) } , (5)where q i ( h ) is the h -th sample quantile of the set { ˆ d ( i, i (cid:48) ) : i (cid:48) (cid:54) = i } , and h = C (cid:112) log n/n witha global constant C >

0. From our experience, the performance of the proposed method is notsensitive to the choice of C in a mild range between 0.5 and 1.5. In practice, we set C = 1.Unlike the SBA and SAS algorithms, this neighborhood is diﬀerent from node to node. Finally,the estimated graphon evaluated at ( u i , u j ) is given byˆ w ( u i , u j ) = ˆ P ij = 12 (cid:32) (cid:80) i (cid:48) ∈ N i A i (cid:48) j | N i | + (cid:80) j (cid:48) ∈ N j A ij (cid:48) | N j | (cid:33) . (6)To sum up, the proposed FANS method consists of the following three major steps:1. Obtain an estimate for d ( i, j ) in (4), where d ( i, j ), s ( i, j ) are estimate by (9) and (8), and λ is chosen by Algorithm 2.2. Calculate the N i for all i using (5).3. Compute the estimated w ( u i , u j ) with (6).Details for these three steps are given below. See also Algorithm 1. d and s Following the ideas of Airoldi et al. (2013) and Zhang et al. (2017), we deﬁne d using the L distance of graphon slices. To be more speciﬁc, for any i (cid:54) = j , deﬁne d ( i, j ) = (cid:90) | w ( u i , v ) − w ( u j , v ) | dv = (cid:104) w ( u i , · ) , w ( u i , · ) (cid:105) + (cid:104) w ( u j , · ) , w ( u j , · ) (cid:105) − (cid:104) w ( u i , · ) , w ( u j , · ) (cid:105) , (cid:104) g ( x ) , g ( x ) (cid:105) := (cid:90) g ( x ) g ( x ) dx. With slight notational abuse, we use the notation (cid:104)· , ·(cid:105) to denote both the L inner product of twofunctions and the Euclidean inner product of two vectors.We immediately notice that the last term (cid:104) w ( u i , · ) , w ( u j , · ) (cid:105) can be estimated by (cid:104) A i · , A j · (cid:105) /n ,because the entries of A i · and A j · are “almost” independent (except for A ij and A ji ). However, (cid:104) w ( u i , · ) , w ( u i , · ) (cid:105) cannot be well estimated using (cid:104) A i · , A i · (cid:105) /n (one can consider estimating p inBernoulli distribution as an analogy.) Similarly for (cid:104) w ( u j , · ) , w ( u j , · ) (cid:105) . The SBA algorithm solvesthis issue by requiring that at least two independent copies of the network are observed.Following Zhang et al. (2017), we instead use an approximate upper bound of d ( i, j ), which ismotivated by the following heuristic argument. First, d ( i, j ) = (cid:104) w ( u i , · ) − w ( u j , · ) , w ( u i , · ) (cid:105) − (cid:104) w ( u i , · ) − w ( u j , · ) , w ( u j , · ) (cid:105) . (7)With large sample, it is likely that there exist ˜ i , ˜ j such that | u ˜ i − u i | ≤ ε and | u ˜ j − u j | ≤ ε forsmall ε . Suppose w ( u, u (cid:48) ), as a function of u (cid:48) , has a Lipschitz constant L for every u ∈ [0 , |(cid:104) w ( u i , · ) − w ( u j , · ) , w ( u i , · ) (cid:105)| = |(cid:104) w ( u i , · ) − w ( u j , · ) , w ( u ˜ i , · ) (cid:105) + (cid:104) w ( u i , · ) − w ( u j , · ) , w ( u i , · ) − w ( u ˜ i , · ) (cid:105)|≤ max k (cid:54) = i,j |(cid:104) w ( u i , · ) − w ( u j , · ) , w ( u k , · ) (cid:105)| + Lε, since (cid:82) ( w ( u i , u (cid:48) ) − w ( u j , u (cid:48) )) du (cid:48) ≤ (cid:82) ( w ( u i , u (cid:48) ) − w ( u ˜ i , u (cid:48) )) du (cid:48) ≤ L ε . Similarly for thesecond term of (7). Therefore, we have d ( i, j ) ≤ k (cid:54) = i,j |(cid:104) w ( u i , · ) − w ( u j , · ) , w ( u k , · ) (cid:105)| + 2 Lε.

Disregarding the multiplicative constant and the small term (cid:15) , it can be estimated by˜ d ( i, j ) := max k (cid:54) = i,j |(cid:104) A i · − A j · , A k · (cid:105)| /n. In the same vein, we deﬁne s ( i, j ) = (cid:107) f ( u i ) − f ( u j ) (cid:107) , and the estimator of its upper bound (upto a multiplicative constant) is ˆ s ( i, j ) := max k (cid:54) = i,j |(cid:104) X i − X j , X k (cid:105)| /p, (8)9here X i ∈ R p is the feature vector for the i -th node. The usage of these upper bounds and theirestimates will be justiﬁed both theoretically and empirically in subsequent sections. ˜ d Due to the nature of upper bound and the fact that A ij ’s are binary, ˜ d ( i, j ) has an issue of ties.For now suppose λ = 0; i.e., not using any node feature. When w is a piecewise constant function(for which the stochastic block model is an example), D i := { ˜ d ( i, j ) : j (cid:54) = i } may contain a largenumber of ties. In our simulation we found that when n = 500, the number of unique values in D i can be as small as 65. These ties cause a problem when deﬁning N i because the set of boundarypoints B i := { i (cid:48) (cid:54) = i : ˜ d ( i, i (cid:48) ) = q i ( h ) } can be large. Hence N i does not change continuously as h changes, and therefore it may include too many nodes on the boundary. Clearly, not all the nodesin B i are as useful as those in D i \ B i , but there is no mechanism to distinguish which nodes in B i are useful to be included. On the other hand, we do not want to either exclude or include all ofthem. To solve this problem, we propose using an adjusted ˜ d by applying a random perturbationto the original deﬁnition: ˆ d ( i, j ) := max k (cid:54) = i,j ( |(cid:104) A i · − A j · , A k · (cid:105)| + ( t/n )) /n, (9)with t ∼ Uniform(0 , |(cid:104) A i · − A j · , A k · (cid:105)| is always an integer, this randomization will notchange the order of any other points that are not ties. As for λ >

0, the randomization is notnecessary if X i ’s are continuous random variables since ties are unlikely. Algorithm 1

The FANS method

Input : A ∈ { , } n × n , X = ( X , . . . , X n ) T ∈ R n × p , λ Output : ˆ P Step 1: Calculate ˆ d ( i, j ) = max k (cid:54) = i,j ( |(cid:104) A i · − A j · , A k · (cid:105)| + t/n ) /n with t ∼ Uniform(0 , s ( i, j ) = max k (cid:54) = i,j |(cid:104) X i − X j , X k (cid:105)| /p .Step 3: Compute ˆ d ( i, j ) = ˆ d ( i, j ) + λ ˆ s ( i, j ).Step 4: Deﬁne N i = { i (cid:48) (cid:54) = i : ˆ d ( i, i (cid:48) ) < q i ( h ) } where h = C (cid:113) log nn .Step 5: Output ˆ P ij = (cid:18) (cid:80) i (cid:48)∈ Ni A i (cid:48) j | N i | + (cid:80) j (cid:48)∈ Nj A ij (cid:48) | N j | (cid:19) .10 .4 Cross-validation for selecting λ Selection of any tuning parameter in network estimation is generally a challenging problem. Pop-ular data-spliting strategies like cross-validation has no trivial extension to the setting of networkdata. Recently, Chen and Lei (2018) proposed a piecewise node-pair splitting technique for cross-validation to determine the number of communities K in a stochastic block model. An unpublishedwork of Li et al. (2016) proposed a two-stage network cross-validation by edge splitting for stochas-tic block model. A key assumption of both methods is that P is low-rank, which does not ﬁt ingeneral graphon framework.One advantage of having node features is that, by solely comparing the features, it is possibleto generate prediction of edge connection. If a new node i comes into an existing network, we canﬁnd its nearest neighbor i ∗ based on its features. Then we use the estimated graphon slice of i ∗ as a prediction of i ’s connections. Assuming that the features are useful (which means the optimal λ (cid:54) = 0), a good model should predict i ∗ with a small error.Our cross-validation method is outlined in Algorithm 2. It would be natural to use (cid:96) norm ornegative binomial log-likelihood as the loss function. However, we found that (cid:96) norm would faileasily since it is much less robust than (cid:96) error. The reason we do not use log-likelihood is thatwe may have ˆ P ij = 0 or 1 occasionally. Simulation results suggest that our method works well andwill set λ ≈ When the number of features are large, there are two potentially undesirable consequences. First,the computational time for the proposed estimation method is large, and second, there is a chancethat some of the features are useless which may worsen the estimation quality. Therefore, wepropose a feature screening procedure for removing some useless features before we apply theproposed graphon estimation method.Let D = { ˜ d ( i, j ) } be the distance matrix based on the adjacency matrix A , and S = { ˆ s ( i, j ) } be the dissimilarity matrix based on a single feature. We use the correlation between { D ij } i (cid:54) = j and { S ij } i (cid:54) = j to describe the coherence of features and graphon slices. If a feature is irrelevant, the11 lgorithm 2 Cross-validation for choosing λ Input : A ∈ { , } n × n , X = ( X , . . . , X n ) T ∈ R n × p , Λ = { λ , . . . , λ Q } . Output : Optimal λ opt . for m = 1 , . . . , M do Randomly sample [10% n ] nodes V ⊂ { , . . . , n } as the validation set.Let T = { , . . . , n }\ V be the training set.Split A and X : A =  A ( T ) A ( T × V ) A ( V × T ) A ( V )  , X =  X ( T ) X ( V )  . for q = 1 , . . . , Q do Fit ˆ P ij for ( i, j ) ∈ T × T using A ( T ) , X ( T ) and λ q . for i ∈ V do Find i ∗ = arg min i (cid:48) ∈ T (cid:107) X i (cid:48) − X i (cid:107) , and let ˆ P ij = ˆ P i ∗ j , j ∈ T . end for Compute loss for model q , L ( m ) q = | V || T | (cid:80) ( i,j ) ∈ V × T | A ij − ˆ P ij | . end forend for Let L q = M M (cid:80) m =1 L ( m ) q , and return λ opt = arg min q L q .correlation between D and S would be small. Since { D ij } i (cid:54) = j and { S ij } i (cid:54) = j do not form a linearrelation, we use Kendall’s τ correlation (or Spearman correlation). We will discuss the practicalchoice of threshold in Section 5. In this section, we establish the asymptotic convergence of the proposed estimator. For simplicity,we study the estimator without node features screening (Section 3.5). To accommodate importantmodels such as stochastic block model, we do not assume the graphon w to be completely smooth.Instead, we focus on the following family of functions. Deﬁnition 3.1. (Bivariate piecewise Lipschitz graphon family) For any δ, L >

0, let W δ ; L denote12 family of piecewise Lipschitz graphon functions w : [0 , → [0 ,

1] such that (i) there exists K ≥ x < . . . < x K = 1 such that min ≤ i ≤ K ( x i − x i − ) ≥ δ ; (ii) for any ( u , v ) and( u , v ) ∈ [ x i − x i − ] × [ x j − x j − ], | w ( u , v ) − w ( u , v ) | ≤ L ( | u − u | + | v − v | ).Similarly, our theory also allows piecewise Lipschitz form for the feature function f in the followingsense. Deﬁnition 3.2. (Piecewise Lipschitz feature family) For any δ, L >

0, let F δ ; L denote a family ofpiecewise Lipschitz functions f : [0 , → R p such that (i) there exists D ≥ x < . . .

Deﬁnition 3.2. (Sub-Gaussian distribution) we say X is sub-Gaussian( σ ) if E ( X ) = 0 and E [ e sX ] ≤ e σ s for ∀ s ∈ R .We have the following theorem regarding the rate of convergence of the estimated probability matrixˆ P . Theorem 3.1. (Consistency of ˆ P ) Assume that (i) w ∈ W δ ; L w and f ∈ F δ ; L f with globalconstants L w and L f ; (ii) the length of the smallest common Lipschitz piece δ f ∩ w = δ f ∩ w ( n ) :=min i,j {| I fi ∩ I wj | : I fi ∩ I wj (cid:54) = φ } satisﬁes lim n →∞ (cid:18) δ f ∩ w (cid:14)(cid:113) log nn (cid:19) → ∞ ; (iii) E ( e ik ) = 0, and e ik iid ∼ sub-Gaussian( σ ) for all i and k ; (iv) (cid:107) f ( u ) (cid:107) /p ≤ M for any u ∈ [0 , h = C (cid:113) log nn for anyglobal constant C . Then for all w ∈ W δ ; L w and F δ ; L f , we have1 n (cid:107) ˆ P − P (cid:107) F = O P (cid:32)(cid:114) log nn (cid:33) + λ O P ( M ( σ, p, n )) , where M ( σ, p, n ) is given as follows.(a) If p > n , M ( σ, p, n ) = max (cid:40)(cid:114) log nn , σ (cid:115) log np , σ (cid:115) log np (cid:41) . (b) If p ≤ n , M ( σ, p, n ) = max (cid:40)(cid:114) log nn , σ (cid:115) log np , σ (cid:18) log np (cid:19)(cid:41) .

13n our theorem, the best convergence rate of (cid:107) ˆ P − P (cid:107) F /n is (cid:112) log n/n which is sub-optimalwhen compared to the minimax rate log n/n . However, as claimed by a recent work (Zhang et al. ,2017), (cid:112) log n/n is the best rate obtained among existing non-combinatorial methods for generalgraphon estimation without making strong assumptions such as strict monotonicity of degree condi-tion. In our theorem, this rate can be achieved by both zero and nonzero λ (with appropriate rate).When λ = 0, the node features play no role in the proposed estimation procedure. This indicatesthat, in our theorem, we do not obtain any gain in terms of rate of convergence by the additionalusage of features. We hypothesize that this is largely due to the ﬂexible modeling between featuresand hidden labels. Besides, a similar conclusion is obtained by another recent work on communitydetection with node features (Zhang et al. , 2016), which suggests that the real beneﬁt of featuresis revealed in the ﬁnite sample performance. As indicated evidently by our empirical study inSections 5 and 6, the usage of features (i.e., λ >

0) improves the quality of estimation. It is thenof theoretical interest to understand how large λ could be accommodated by the proposed methodwithout compromising the overall rate of convergence. Our theorem has shed light on this theoret-ical question. The interplay of λ, p, σ, n and their eﬀects to the rate of convergence can be easilyseen by enforcing λ M ( σ, p, n ) = O P ( (cid:112) log n/n ). Here we give two examples under the setting ofbounded σ . When p = Ω(log n ) (high-dimensional setting), we require λ to be O P (max( (cid:112) p/n, p = O (log n ) (low-dimensional setting), we need λ = O P ( p/ √ n log n ). Setup:

We generate a network A ∈ { , } n × n from a graphon w ( u, v ) by the two-step proceduredescribed in (1). For each node, we generate its features X i ∈ R p from f by (3). If close X i ’s implyclose graphon slices, our method can leverage the information of features and eﬀectively improvethe estimation. One such assumption that guarantees this is that f is smooth and monotonic. Asfor when f and the graphon has totally diﬀerent structures, then inclusion of s in the dissimilaritymeasure (i.e., choosing a large λ ) will worsen the estimation.14he level of noise σ also inﬂuences the eﬀect of features. In order to compare the eﬀects ofdiﬀerent noise levels, we ﬁrst standardize each feature f j by its standard deviation before addingnoise. To be more speciﬁc, X ij = f j ( u i ) / sd( f j ( U )) + e ij , i = 1 , . . . , n, j = 1 , . . . , p, where U ∼ Uniform(0 ,

1) and e ij iid ∼ N (0 , σ ).Finally, we deﬁne MSE and MAE to measure the performance of the estimation:MSE = 1 n (cid:88) i,j ( ˆ P ij − P ij ) and MAE = 1 n (cid:88) i,j | ˆ P ij − P ij | . An illustrative example:

We will ﬁrst use an illustrative example to show the eﬀects of diﬀerentnode features.Consider the case of a single feature f ∈ R and a simple monotonic graphon w ( u, v ) = ( u + v ) / f will be useful, for instance, f = f ( u ) =cos( πu ), which approximates the true labels well. However, if f = f ( u ) = cos(2 πu ) or even f = f ( u ) = cos(4 πu ) (which are non-monotonic and periodic), then close X i and X j may not welldescribe the similarity of the corresponding graphon slices. Thus, f and f may not be as helpful.Nevertheless, it turned out that using f or f can still help locally and improve estimation when λ is small since they are smooth. More details are shown in Figure 2. General graphons:

Now, we study the eﬀect of node features in the more general cases. Weconsider networks generated from the following four graphons that are used in the literature; seeFigure 3 and Table 1. Graphon 1 is a stochastic block model (SBM) with (cid:98) log n (cid:99) blocks. Graphon2 is periodic and low-rank. Graphons 3 and 4 are more general and both full-rank. Here the rankof a graphon is evaluated numerically on P .As for the node features, we select two non-monotonic functions and two monotonic ones: f ( u ) = cos(2 π (1 − u ) ) , f ( u ) = 10 u − u + 5 ,f ( u ) = cos( πu ) , f ( u ) = Φ − ( u ) , and f = ( f , f , f , f ) T , where Φ is the CDF function of standard normal distribution.Figure 4 shows the comparison of curves of MSE against λ with diﬀerent feature noise levels σ = 0 , . , . . . , . g to g , and each curve is an average from 20 repeated experiments.15igure 1: w ( u, v ) = ( u + v ) / w ( u, v ) Rank local structure g k/ ( K + 1) if u, v ∈ ( k − K , kK ); (cid:98) log n (cid:99) No0 . / ( K + 1) otherwise . K = (cid:98) log n (cid:99) g sin(5 π ( u + v −

1) + 1) + 0 . g − (cid:2) (cid:8) . | u − v | ) / − . (cid:9)(cid:3) − full No g ( u + v ) cos(1 / ( u + v )) + 0 .

15 full YesSince g is block-structured, using an appropriate amount of information from f should im-prove the results, as smooth features make the neighborhoods more concentrated except near theboundaries. However when λ is large, it will pull nodes on the boundaries into wrong neighbor-hoods, thus MSE can deteriorate rapidly. For g , although the graphon is smooth, close X i ’s donot well correspond to similar graphon slices due to the periodic structure of g . In this case, theadjacency matrix A itself carries such strong information that adding features does not help thatmuch. However, the MSE is not inﬂuenced a lot when λ is close to 0. For g and g , the eﬀects16igure 2: The eﬀect of diﬀerent features f i ( u ) = cos(2 i − πu ), i = 1 , ,

3. Legends correspond to f , f and f from top to bottom.of node features is signiﬁcant since close features correspond to close graphon slices. There is a20-30% improvement in MSE with p -value= 0. In particular, smooth node features can help bettercapture the local structure in g ; see Appendix A for further details.As expected, the performance becomes worse in general when the noise level of features increases.In practice, neither the pattern of features nor the noise level is known, therefore the cross-validationmethod is proposed for selecting an appropriate λ . Recall that the feature screening procedure developed in Section 3.5 requires the speciﬁcation of athreshold. Our numerical experiments show that 0.03 would be a reasonable choice of threshold.We tested it under Graphons 1 to 4 with feature being Gaussian noise. We performed 1000 inde-pendent trials to calculate the proportion of successful screen-out. The results are summarized inTable 2. The probability of false positive (i.e., keeping a useless feature) is well controlled (approx-imately under 0.05) when we set the threshold as 0.03. In practice, the proposed feature screeningmechanism should be combined with ﬁeld knowledge to perform feature selection.17igure 3: Graphon visualizations when n = 2000. From top-left to bottom-right (by row): g to g . In this subsection we compare the performance of the proposed method FANS with other methodsfound in the literature. These methods include • SBA: Stochastic Blockmodel Approximation of Airoldi et al. (2013), • SAS: Sorting-and-Smoothing of Chan and Airoldi (2014), • USVT: Universal Singular Value Thresholding of Chatterjee et al. (2015), and • NBS: Neighborhood Smoothing of Zhang et al. (2017).18igure 4: Eﬀect of λ with 20 independent experiments for each curve. From top-left to bottom-right(by row): g to g . The starting point of each curve is λ = 0, which is equivalent to the case withoutusing features. Legends (from top to bottom) correspond to σ = 0 , . , . . . , . n , we randomly generated 100 independent realizations. For eachrealization we applied the above ﬁve methods to obtain the corresponding estimated graphons.Since the SBA algorithm requires at least two independent graphs, we followed the way the authorsconducted simulations in their paper and generated two of n × n graphs to make the comparisonfair. For the SAS algorithm, we set the bandwidth h = log n as suggested in their paper. For NBSand FANS, we set C = 1. For FANS speciﬁcally, feature screening (Section 3.5) was performedand λ was automatically chosen by cross-validation (Section 3.4) before every ﬁtting.We calculated the MSE and MAE for each estimated graphon. The results for n = 200 and 500are summarized in Table 3, where the averages and the standard errors of the calculated MSEs and19able 2: Feature screening with Gaussian noise feature.Graphon g g g g { τ < . } / p -value when the FANS methodis compared with NBS using a paired one-sided t -test. From this table, one can see that, whennode features were available, FANS gave the best results (but we note that sometimes NBS gavesimilarly best results). This is not surprising, as FANS is the only method that was designed toincorporate node feature information for graphon estimation. When there is no such node featurepresent, FANS is similar to NBS, which gave extremely favorable results when comparing with theremaining methods. In this section, we apply the proposed FANS method to a real-life dataset and illustrates itsusefulness via both visualization and a leave-one-out link prediction problem. This dataset is relatedto friendship network and was collected by the National Longitudinal Study of Adolescent Health(the AddHealth study), which can be downloaded from http://moreno.ss.uci.edu/data.html .In this study students were asked to list their friends that they recently chatted with. Note thatthe original data are directed graphs, but we consider two students as friends if one named theother. The whole dataset consists of 81 sub-datasets, each containing either one or two schools.We analyzed one of them (“comm10”) which contains 587 students from a single school. The threecovariates recorded are gender , race and grade . We treated grade as an ordinal variable. As for gender and race , we convert them to a 0/1 vector. The j -th coordinate is 1 if and only if thestudent belongs to the j -th category. 20able 3: Comparison with existing methods. Reported are the means and the standard errors (inparentheses) of MSEs and MAEs, based on 100 independent trials. Numbers in brackets are the p -values of paired t -tests comparing the mean values between NBS and FANS. The lowest valuesare listed in bold. n = 200 SBA SAS USVT NBS FANS ( σ = 0 . g MSE (SE) 0.0112 (0.0015) 0.0269 (0.0049) 0.1679 (0.0013) [0.058]MAE (SE) 0.0682 (0.0058) 0.1039 (0.0138) 0.3865 (0.0034) [0.058] g MSE (SE) 0.0295 (0.0023) 0.0983 (0.0160) 0.0641 (0.0017) [0.1236]MAE (SE) 0.1220 (0.0053) 0.2709 (0.0255) 0.1768 (0.0025) [0.1004] g MSE (SE) 0.0158 (0.0015) 0.0144 (3.3e-4) 0.0122 (0.0021) 0.0067 (3.7e-4) [0]MAE (SE) 0.0684 (0.0057) 0.0855 (0.0016) 0.0738 (0.0106) 0.0484 (0.0015) [0] g MSE (SE) 0.0172 (0.0023) 0.0044 (3.0e-4) 0.1015 (0.0055) 0.0044 (2.5e-4) [0]MAE (SE) 0.0978 (0.0073) 0.0545 (0.0018) 0.2920 (0.0111) 0.0526 (0.0015) [0] n = 500 SBA SAS USVT NBS FANS ( σ = 0 . g MSE (SE) 0.0297 (5.5e-4) 0.0210 (0.0046) 0.1791 (7.0e-4) [5e-31]MAE (SE) 0.0245 (0.0034) 0.0782 (0.0120) 0.4024 (0.0018) [4e-28] g MSE (SE) 0.0154 (0.0014) 0.0907 (0.0142) 0.0617 (7.4e-4) [0.1728]MAE (SE) 0.0899 (0.0044) 0.2562 (0.0257) 0.1679 (0.0016) [0.1238] g MSE (SE) 0.0081 (7.8e-4) 0.0136 (2.8e-4) 0.0049 (4.2e-4) 0.0031 (1.3e-4) [0]MAE (SE) 0.0453 (0.0038) 0.0839 (0.0015) 0.0408 (0.0020) 0.0293 (7.5e-4) [0] g MSE (SE) 0.0099 (0.0015) 0.0029 (2.1e-4) 0.1008 (0.0032) 0.0024 (8.8e-5) [0]MAE (SE) 0.0768 (0.0061) 0.0450 (0.0018) 0.2895 (0.0067) 0.0384 (7.9e-4) [0]

The observed adjacency matrix is displayed in the bottom-right plot in Figure 5. The block struc-ture of the six communities that correspond to the six grades from 7 (upper-right) to 12 (bottom-left) is apparent. Here we also include a recently proposed community detection algorithm thatalso makes use of node features, JCDC, proposed by Zhang et al. (2016). We chose the numberof communities K = 6 for JCDC. We applied the ﬁve methods mentioned in Section 5.3 to ﬁtthe underlying graphon, and JCDC to estimate the communities. The comparison is visualized in21igure 5, where the nodes were sorted by grade. For the SAS algorithm, in addition to nodes sortedby grade, we also present the estimated graphon with nodes sorted by the empirical node degrees(which is the direct output from their method). From neither one of the two SAS plots can weobserve any patterns of interest. The USVT method fails since it omitted all the singular values.We can observe the block structures recovered from NBS, JCDC and FANS methods, but thelatter two provided a much clearer view. However, JCDC seems to miss the two small communitiesnear the lower corner. On the other hand, the proposed FANS with λ = 0 . λ = 0, FANS still outperformed NBS since FANS resolves the tie-issuediscussed in Section 3.We further sorted the nodes by gender. The estimated graphon by FANS is shown in Figure 6.The bottom-left block is the sub-network among females, and the upper-right corner forms thecommunity among males. The other two parts are the cross-community connections. Within eachblock, there are 6 blocks that correspond to the six grades from 7 (upper-right) to 12 (bottom-left).One can see that the behaviors of male and female students are similar, and they are more likelyto know each other in higher grades. This ﬁgure also illustrates that diﬀerent graphons deﬁned upto a measure preserving transformation correspond to the same network. Since for this problem the true graphon is unknown, it is hard to quantitatively evaluate thequality of an estimated graphon. Therefore, we used a leave-one-out link prediction to compareFANS with NBS. In general, graphon estimation methods are not applicable to link predictionbecause they require a fully-observed network. However, FANS can be slightly modiﬁed such thatthe estimation of ˆ P ij is independent of its observed value A ij . This can be done by, in Step 1of Algorithm 1, modifying (cid:104)· , ·(cid:105) to (cid:104) A i · , A j · (cid:105) mod := (cid:80) k (cid:54) = i,j A ik A jk . Then ˆ P ij can be predictedcompletely independent of A ij .This is an element-wise procedure, and it can only predict one value at a time. Thus, we22asked one pair of entries ( A ij = A ji ) each time and conducted a leave-one-out link prediction.We calculated the area under the receiver operating characteristic (ROC) curve to evaluate the linkprediction. From Figure 7, we can see that FANS with λ = 0 . λ = 0, and FANS with λ = 0 . This paper developed a graphon estimation method that is capable of utilizing the information fromboth the observed adjacency matrix and node features. Under some mild regularity conditions, theconsistency properties of the proposed method is established. The rate of convergence is the sameas without using node features, but in practice the proposed method can improve the estimationresults in most cases. Lastly, for a real world dataset the proposed method has beneﬁts as it reﬂectsmore meaningful structures of a network and yields a higher link prediction accuracy.

A Local Structure in Graphon 4

This appendix provides an example to demonstrate the potential usefulness of incorporating nodefeatures into graphon estimation. The bottom-left corner of g presents a rich local structure thatis diﬃcult to be captured by the adjacency matrix. As shown in Figure 8, the left estimationfails to detect any local structure since the signal carried by the adjacency matrix is relativelyweak compared to the Bernoulli noise. On the other hand, the ﬁtted graphon on the right of24igure 7: ROC curves for NBS, FANS with λ = 0, and FANS with λ = 0 . B Proof of Theorem 3.1

This appendix presents technical arguments leading to the theoretical results in the paper. In orderto prove Theorem 3.1, if we writeˆ P = ( ˜ P + ˜ P (cid:48) ) / P = (cid:80) i (cid:48) ∈ N i A i (cid:48) j | N i | , then it suﬃces to prove the consistency of ˜ P . In other words, we need to prove that1 n (cid:107) ˜ P − P (cid:107) F = O P (cid:32)(cid:114) log nn (cid:33) + λ O P ( M ( σ, p, n )) . (10)By triangle inequality for Frobenius norm, (10) implies Theorem 3.1.It is clear to observe that 1 n (cid:107) ˜ P − P (cid:107) F ≤ n max i (cid:107) ˜ P i · − P i · (cid:107) , g . Left: estimated graphon without using node features with the NBSmethod; right: estimated graphon with smooth node features obtained by the FANS method.so we only need to obtain a bound for the right-hand-side. Let’s consider the following decompo-sition. 1 n (cid:107) ˜ P i · − P i · (cid:107) ≤ n n (cid:88) j =1 (cid:40) (cid:20) (cid:80) i (cid:48) ∈ N i ( A i (cid:48) j − P i (cid:48) j ) | N i | (cid:21) + 2 (cid:20) (cid:80) i (cid:48) ∈ N i ( P i (cid:48) j − P ij ) | N i | (cid:21) (cid:41) := 2 n n (cid:88) j =1 J ( i, j ) + 2 n n (cid:88) j =1 J ( i, j ) . Let’s ﬁrst consider J . When n >

2, we have1 n n (cid:88) j =1 J ( i, j ) = 1 n | N i | n (cid:88) j =1  (cid:88) i (cid:48) ∈ N i ( A i (cid:48) j − P i (cid:48) j ) + (cid:88) i (cid:48) ∈ N i (cid:88) i (cid:48)(cid:48) ∈ N i i (cid:48)(cid:48) (cid:54) = i (cid:48) ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j )  ≤ | N i | + 1 n | N i | n (cid:88) j =1  (cid:88) i (cid:48) ,i (cid:48)(cid:48) ∈ N i i (cid:48) (cid:54) = i (cid:48)(cid:48) ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j )  ≤ | N i | + 1 | N i | (cid:88) i (cid:48) ,i (cid:48)(cid:48) ∈ N i i (cid:48) (cid:54) = i (cid:48)(cid:48) (cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) j =1 ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ | N i | + 1 | N i | (cid:88) i (cid:48) ,i (cid:48)(cid:48) ∈ N i i (cid:48) (cid:54) = i (cid:48)(cid:48)  n − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = i (cid:48) ,i (cid:48)(cid:48) ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j ) (cid:12)(cid:12)(cid:12)(cid:12) + 2 n  i (cid:48) (cid:54) = i (cid:48)(cid:48) , by Bernstein’s inequality, when ε ∈ (0 ,

1] and n ≥  n − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = i (cid:48) ,i (cid:48)(cid:48) ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε  ≤ (cid:40) − ( n − ε ε (cid:41) ≤ e − nε / . Taking a union bound over all i (cid:48) (cid:54) = i (cid:48)(cid:48) , we havePr  max i ; i (cid:48) (cid:54) = i (cid:48)(cid:48) ∈ N i n − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = i (cid:48) ,i (cid:48)(cid:48) ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε  ≤ Pr  max i (cid:48) (cid:54) = i (cid:48)(cid:48) n − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = i (cid:48) ,i (cid:48)(cid:48) ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε  ≤ n e − nε / . Let ε = (cid:113) ( C + 8) log nn for some C > ε ≤

1, then we havePr  max i ; i (cid:48) ,i (cid:48)(cid:48) ∈ N i n − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = i (cid:48) ,i (cid:48)(cid:48) ( A i (cid:48) j − P i (cid:48) j )( A i (cid:48)(cid:48) j − P i (cid:48)(cid:48) j ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:114) ( C + 8) log nn  ≤ n − C / . Note that by deﬁnition of N i , we have | N i | ≥ C ( n − (cid:113) log nn ≥ C (cid:48) √ n log n for some constants C (cid:48) >

0. Thus, when n is large, with probability at least 1 − n − C / , we havemax i n n (cid:88) j =1 J ( i, j ) ≤ max i  | N i | + 1 | N i | (cid:88) i (cid:48) ,i (cid:48)(cid:48) ∈ N i i (cid:48) (cid:54) = i (cid:48)(cid:48) (cid:32)(cid:114) ( C + 8) log nn + 2 n (cid:33) ≤ C (cid:48) √ n log n + (cid:114) ( C + 8) log nn + 2 n = ˜ C (cid:114) log nn We can see that the bound for J is guaranteed by the size of N i as well as the nature ofBernoulli random variable. As for J , we have1 n n (cid:88) j =1 J ( i, j ) = 1 n n (cid:88) j =1 (cid:26) (cid:80) i (cid:48) ∈ N i ( P i (cid:48) j − P ij ) | N i | (cid:27) ≤ | N i | (cid:88) i (cid:48) ∈ N i (cid:107) P i (cid:48) · − P i · (cid:107) /n, due to Cauchy-Schwarz inequality. So, it suﬃces to bound max i ; i (cid:48) ∈ N i (cid:107) P i (cid:48) · − P i · (cid:107) /n . In order to do so,we will need the following lemmas. 27 emma 1. Let I f , . . . , I fD and I w , . . . , I wK be the Lipschitz pieces for f and w respectively. Deﬁnea neighborhood (diﬀerent from N i ) of label u i as S i (∆ n ) = { u i ± ∆ n } ∩ I f ( u i ) ∩ I w ( u i )where I f ( u i ) is the Lipschitz piece of f that contains u i , I w ( u i ) is the Lipschitz piece of w thatcontains u i , and ∆ n = ( ˜ C + √ C + 4) (cid:113) log nn for any C , ˜ C > n < min i,j {| I fi ∩ I wj | : I fi ∩ I wj (cid:54) = φ } (so that | S i (∆ n ) | > ∆ n ) and (cid:113) ( C + 4) log nn ≤

1. Then, we havePr (cid:40) min i |{ ˜ i (cid:54) = i : u ˜ i ∈ S (∆ n ) }| n − ≥ ˜ C (cid:114) log nn (cid:41) ≥ − n − C / Proof: By Bernstein’s inequality, for any ε ∈ (0 ,

1] and n ≥

6, we havePr (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n − |{ ˜ i (cid:54) = i : u ˜ i ∈ S (∆ n ) }| − | S i (∆ n ) | (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε (cid:27) ≤ (cid:40) − ( n − ε ε (cid:41) ≤ e − nε / . Taking a union bound over all i = 1 , . . . , n ,Pr (cid:26) max i (cid:12)(cid:12)(cid:12)(cid:12) n − |{ ˜ i (cid:54) = i : u ˜ i ∈ S (∆ n ) }| − | S i (∆ n ) | (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε (cid:27) ≤ ne − nε / . Let ε = (cid:113) ( C + 4) log nn with any C > ε ≤

1, then we havePr (cid:40) max i (cid:12)(cid:12)(cid:12)(cid:12) n − |{ ˜ i (cid:54) = i : u ˜ i ∈ S (∆ n ) }| − | S i (∆ n ) | (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:114) ( C + 4) log nn (cid:41) ≤ n · n − ( C +4) / = 2 n − C / . Therefore, with probability at least 1 − n − C / , we havemax i (cid:12)(cid:12)(cid:12)(cid:12) n − |{ ˜ i (cid:54) = i : u ˜ i ∈ S (∆ n ) }| − | S i (∆ n ) | (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:114) ( C + 4) log nn , which implies thatmin i n − |{ ˜ i (cid:54) = i : u ˜ i ∈ S (∆ n ) }| ≥ min i | S i (∆ n ) | − (cid:114) ( C + 4) log nn ≥ | ∆ n | − (cid:114) ( C + 4) log nn = ˜ C (cid:114) log nn Lemma 2.

Let parameter h = C (cid:113) log nn where 0 < C ≤ ˜ C with ˜ C , ∆ n and S i (∆ n ) deﬁned inLemma 1. Recall that ˆ d ( i, i (cid:48) ) = ˜ d ( i, i (cid:48) ) + λ ˆ s ( i, i (cid:48) ) and N i := { i (cid:48) : ˆ d ( i, i (cid:48) ) ≤ q i ( h ) } . Then when n is large enough, with high probability (tending to 1), we havemax i ; i (cid:48) ∈ N i n (cid:107) P i (cid:48) · − P i · (cid:107) ≤ ˜ C (cid:114) log nn + λ M ( σ, p, n )28here M ( σ, p, n ) is the bound of ˆ s ( i, i (cid:48) ) whose explicit form will be shown in Lemma 3.Proof: First, we will show that for i (cid:54) = j , (cid:104) P i · , P j · (cid:105) /n and (cid:104) A i · , A j · (cid:105) /n are close. We will use( P /n ) ij and ( A /n ) ij to represent the ( i, j )-th entry of P /n and A /n respectively. Suppose n > (cid:12)(cid:12) ( A /n ) ij − ( P /n ) ij (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) k =1 ( A ik A jk − P ik P jk ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) k (cid:54) = i,j ( A ik A jk − P ik P jk ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) k = i,j ( A ik A jk − P ik P jk ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) n − (cid:88) k (cid:54) = i,j ( A ik A jk − P ik P jk ) (cid:12)(cid:12)(cid:12)(cid:12) + 4 n By Bernstein’s inequality, if ε ∈ (0 ,

1] and n ≥ (cid:12)(cid:12)(cid:12)(cid:12) n − (cid:88) k (cid:54) = i,j ( A ik A jk − P ik P jk ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε  ≤ (cid:40) − ( n − ε ε (cid:41) ≤ e − nε / . Taking a union bound over all i (cid:54) = j ,Pr  max i (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12) n − (cid:88) k (cid:54) = i,j ( A ik A jk − P ik P jk ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε  ≤ n e − nε / . Let ε = (cid:113) ( C + 8) log nn for some C > ε ≤

1, then we havePr  max i (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12) n − (cid:88) k (cid:54) = i,j ( A ik A jk − P ik P jk ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:114) ( C + 8) log nn  ≤ n · n − ( C +8) / = 2 n − C / . Therefore, with probability at least 1 − n − C / , we havemax i (cid:54) = j (cid:12)(cid:12) ( A /n ) ij − ( P /n ) ij (cid:12)(cid:12) ≤ (cid:114) ( C + 8) log nn + 4 n ≤ (cid:114) ( C + 8) log nn (when n is large) . Next, we claim that for those ˜ i such that u ˜ i ∈ S i (∆ n ), we have (cid:104) P i · , P k · (cid:105) /n ≈ (cid:104) P ˜ i · , P k · (cid:105) /n . Thisis ensured by Lipschitz condition on w . In fact,max i ;˜ i ∈ S i (∆ n ) | ( P /n ) ik − ( P /n ) ˜ ik | = max i ;˜ i ∈ S i (∆ n ) |(cid:104) P i · − P ˜ i · , P k · (cid:105) /n |≤ max i ;˜ i ∈ S i (∆ n ) (cid:107) P i · − P ˜ i · (cid:107) · (cid:107) P k · (cid:107) /n ≤ max i ;˜ i ∈ S i (∆ n ) (cid:113) n ( L w | u i − u ˜ i | ) · √ n/n ≤ L w ∆ n d ( i, ˜ i ) for all i and any˜ i ∈ S i (∆ n ). With probability at least 1 − n − C / , we havemax i ;˜ i ∈ S i (∆ n ) ˜ d ( i, ˜ i ) = max i ;˜ i ∈ S i (∆ n ) (cid:26) max k (cid:54) = i, ˜ i | ( A /n ) ik − ( A /n ) ˜ ik | (cid:27) ≤ max i ;˜ i ∈ S i (∆ n ) (cid:26) max k (cid:54) = i, ˜ i | ( P /n ) ik − ( P /n ) ˜ ik | (cid:27) + 2 max i (cid:54) = j (cid:12)(cid:12) ( A /n ) ij − ( P /n ) ij (cid:12)(cid:12) ≤ L w ∆ n + 4 (cid:114) ( C + 8) log nn = ˜ C (cid:114) log nn We can also show that with probability at least 1 − n − C / − n − C ∗ / ( C , C ∗ > i ;˜ i ∈ S i (∆ n ) ˆ s ( i, ˜ i ) ≤ M ( σ, p, n ) , (11)where M ( σ, p, n ) is the bound that has the form derived in Lemma 3. Then with probability atleast 1 − n − C / − n − C / − n − C ∗ / ,max i ;˜ i ∈ S i (∆ n ) ˆ d ( i, ˜ i ) ≤ ˜ C (cid:114) log nn + λ M ( σ, p, n ) . (12)Based on the deﬁnition of N i and that h = C (cid:113) log nn < ˜ C (cid:113) log nn , combining the results fromLemma 1 and (12), with probability at least 1 − n − C / − n − C / − n − C / − n − C ∗ / , we havemax i ; i (cid:48) ∈ N i ˆ d ( i, i (cid:48) ) ≤ ˜ C (cid:114) log nn + λ M ( σ, p, n ) . Finally, with probability 1 − n − C / − n − C / − n − C / − n − C ∗ / , for all i, i (cid:48) ∈ N i uniformly,1 n (cid:107) P i (cid:48) · − P i · (cid:107) ≤ | ( P /n ) ii − ( P /n ) ii (cid:48) | + | ( P /n ) i (cid:48) i (cid:48) − ( P /n ) i (cid:48) i |≤ | ( P /n ) ˜ ii − ( P /n ) ˜ ii (cid:48) | + | ( P /n ) ˜ i (cid:48) i (cid:48) − ( P /n ) ˜ i (cid:48) i | + 4 L w ∆ n ≤ | ( A /n ) ˜ ii − ( A /n ) ˜ ii (cid:48) | + | ( A /n ) ˜ i (cid:48) i (cid:48) − ( A /n ) ˜ i (cid:48) i | + 4 ˜ C (cid:114) log nn + 4 L w ∆ n ≤ k (cid:54) = i,i (cid:48) | ( A /n ) ik − ( A /n ) i (cid:48) k | + 4 ˜ C (cid:114) log nn + 4 L w ∆ n = 2 ˜ d ( i, i (cid:48) ) + 4 ˜ C (cid:114) log nn + 4 L w ∆ n ≤ d ( i, i (cid:48) ) + 4 ˜ C (cid:114) log nn + 4 L w ∆ n ≤ ˜ C (cid:114) log nn + λ M ( σ, p, n ) . This completes the proof of Lemma 2. 30 emma 3. X i = f ( u i ) + e i ∈ R p with E ( e ik ) = 0 and e ik iid ∼ sub-Gaussian( σ ). f is piecewiseLipschitz with global constant L f , and (cid:107) f ( u ) (cid:107) /p ≤ M for any u ∈ [0 ,

1] ( M is a global constant).Then with probability at least 1 − n − C / − n − C ∗ / ( C , C ∗ > p > n , max i ;˜ i ∈ S i (∆ n ) ˆ s ( i, ˜ i ) ≤ ˜ C max (cid:40)(cid:114) log nn , σ (cid:115) log np , σ (cid:115) log np (cid:41) (ii) When p ≤ n , max i ;˜ i ∈ S i (∆ n ) ˆ s ( i, ˜ i ) ≤ ˜ C max (cid:40)(cid:114) log nn , σ (cid:115) log np , σ (cid:18) log np (cid:19)(cid:41) Proof: Denote µ i = f ( u i ). Similar to the proof of Lemma 2, we ﬁrst need to show that (cid:104) X i , X j (cid:105) /p and (cid:104) µ i , µ j (cid:105) /p are close. |(cid:104) X i , X j (cid:105) − (cid:104) µ i , µ j (cid:105)| ≤ |(cid:104) µ i , e j (cid:105)| + |(cid:104) µ j , e i (cid:105)| + |(cid:104) e i , e j (cid:105)| Since e jk ∼ sub-Gaussian( σ ), µ ik e jk | u i ∼ sub-Gaussian( f k ( u i ) σ ). By Hoeﬀding’s inequality,for any ε >

0, we havePr (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:104) µ i , e j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) > ε (cid:12)(cid:12) u i (cid:27) ≤ (cid:26) − p ε σ (cid:80) pk =1 f k ( u i ) (cid:27) ≤ (cid:26) − pε σ M (cid:27) where the RHS does not involve u i . Therefore we havePr (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:104) µ i , e j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) > ε (cid:27) ≤ (cid:26) − pε σ M (cid:27) . Taking a union bound over all i and j (it can be that i = j ), and setting ε = (cid:113) ( C + 4) σ M log np for any C >

0, we havePr (cid:40) max i,j (cid:12)(cid:12)(cid:12)(cid:12) p (cid:104) µ i , e j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) > (cid:115) ( C + 4) σ M log np (cid:41) ≤ n exp (cid:26) − ( C + 4) σ M log n σ M (cid:27) = 2 n − C / . As for the |(cid:104) e i , e j (cid:105)| /p term, by Lemma 4, we can show that for i (cid:54) = j , e ik e jk ∼ sub-Exponentialdistribution with parameters ν = 4 σ and α = 4 σ .[Case I:] When 0 < ε ≤ ν = 4 σ , by Bernstein’s inequality, we havePr (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:104) e i , e j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) > ε (cid:27) ≤ e − pε ν = 2 e − pε σ . i (cid:54) = j , and setting ε = (cid:113) ( C (cid:48) + 64) σ np , we havePr (cid:40) max i,j (cid:12)(cid:12)(cid:12)(cid:12) p (cid:104) e i , e j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) > (cid:115) ( C (cid:48) + 64) σ log np (cid:41) ≤ n − C (cid:48) / . The above inequality holds if p > n and C (cid:48) ∈ (0 , p log n − − n − C / − n − C (cid:48) / , we havemax i (cid:54) = j |(cid:104) X i , X j (cid:105) − (cid:104) µ i , µ j (cid:105)| /p ≤ (cid:115) ( C + 4) σ M log np + (cid:115) ( C (cid:48) + 64) σ log np . [Case II:] If ε > ν = 4 σ , Bernstein’s inequality givesPr (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:104) e i , e j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) > ε (cid:27) ≤ e − pε ν = 2 e − pε σ . Taking a union bound over all i (cid:54) = j , and setting ε = ( C (cid:48)(cid:48) + 16) σ np , we havePr (cid:26) max i,j (cid:12)(cid:12)(cid:12)(cid:12) p (cid:104) e i , e j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) > ( C (cid:48)(cid:48) + 16) σ log np (cid:27) ≤ n − C (cid:48)(cid:48) / . The above inequality holds if p ≤ n and for any C (cid:48)(cid:48) >

0. Therefore, with probability at least1 − n − C / − n − C (cid:48)(cid:48) / , we havemax i (cid:54) = j |(cid:104) X i , X j (cid:105) − (cid:104) µ i , µ j (cid:105)| /p ≤ (cid:115) ( C + 4) σ M log np + ( C (cid:48)(cid:48) + 16) σ log np . Second, we will show that for ˜ i such that u ˜ i ∈ S i (∆ n ), we have (cid:104) µ i , µ k (cid:105) /p ≈ (cid:104) µ ˜ i , µ k (cid:105) /p . This isensured by Lipschitz condition on f . In fact, conditional on { u i } ,max i ;˜ i ∈ S i (∆ n ) |(cid:104) µ i , µ k (cid:105) /p − (cid:104) µ ˜ i , µ k (cid:105) /p | = max i ;˜ i ∈ S i (∆ n ) |(cid:104) µ i − µ ˜ i , µ k (cid:105) /p |≤ max i ;˜ i ∈ S i (∆ n ) (cid:107) µ i − µ ˜ i (cid:107) · (cid:107) µ k (cid:107) /p ≤ max i ;˜ i ∈ S i (∆ n ) (cid:113) p ( L f | u i − u ˜ i | ) · (cid:112) pM /p ≤ √ M L f ∆ n where √ M L f ∆ n does not depend on u i . Now, we are ready to bound ˆ s ( i, ˜ i ).Under case I, when p > n and C (cid:48) ∈ (0 , p log n − − n − C / − n − C (cid:48) / , we havemax i ;˜ i ∈ S i (∆ n ) ˆ s ( i, ˜ i ) = max i ;˜ i ∈ S i (∆ n ) (cid:26) max k (cid:54) = i,i (cid:48) |(cid:104) X i , X k (cid:105) − (cid:104) X j , X k (cid:105)| /p (cid:27) ≤ max i ;˜ i ∈ S i (∆ n ) (cid:26) max k (cid:54) = i, ˜ i |(cid:104) µ i , µ k (cid:105) − (cid:104) µ ˜ i , µ k (cid:105)| /p (cid:27) + 2 max i (cid:54) = j |(cid:104) X i , X j (cid:105) − (cid:104) µ i , µ j (cid:105)| /p ≤ √ M L f ∆ n + 2 (cid:32) (cid:115) ( C + 4) σ M log np + (cid:115) ( C (cid:48) + 64) σ log np (cid:33) = √ M L f ( ˜ C + (cid:112) C + 4) (cid:114) log nn + (4 σ (cid:112) ( C + 4) M + σ (cid:113) ( C (cid:48) + 64)) (cid:115) log np = ˜ C (cid:114) log nn + ˜ C σ (cid:115) log np + ˜ C σ (cid:115) log np = ˜ C max (cid:40)(cid:114) log nn , σ (cid:115) log np , σ (cid:115) log np (cid:41) . Under case II, when p ≤ n and C (cid:48)(cid:48) >

0, with probability at least 1 − n − C / − n − C (cid:48)(cid:48) / ,we havemax i ;˜ i ∈ S i (∆ n ) ˆ s ( i, ˜ i ) = max i ;˜ i ∈ S i (∆ n ) (cid:26) max k (cid:54) = i,i (cid:48) |(cid:104) X i , X k (cid:105) − (cid:104) X j , X k (cid:105)| /p (cid:27) ≤ max i ;˜ i ∈ S i (∆ n ) (cid:26) max k (cid:54) = i, ˜ i |(cid:104) µ i , µ k (cid:105) − (cid:104) µ ˜ i , µ k (cid:105)| /p (cid:27) + 2 max i (cid:54) = j |(cid:104) X i , X j (cid:105) − (cid:104) µ i , µ j (cid:105)| /p ≤ √ M L f ∆ n + 2 (cid:32) (cid:115) ( C + 4) σ M log np + ( C (cid:48)(cid:48) + 16) σ log np (cid:33) = √ M L f ( ˜ C + (cid:112) C + 4) (cid:114) log nn + 4 σ (cid:112) ( C + 4) M (cid:115) log np + 2( C (cid:48)(cid:48) + 16) σ log np = ˜ C max (cid:40)(cid:114) log nn , σ (cid:115) log np , σ (cid:18) log np (cid:19)(cid:41) . < This completes the proof of Lemma 3.

Lemma 4.

Let Y , Y iid ∼ sub-Gaussian( σ ) with E ( Y i ) = 0. That is, E [ e sY i ] ≤ e σ s , ∀ s ∈ R . Then, Z = Y Y follows sub-Exponential distribution with ( ν, α ) = (4 σ , σ ).33roof: E [ e sZ ] = E [ e sY Y ]= 1 + E ( Y Y ) + ∞ (cid:88) k =2 s k E ( Y Y ) k k != 1 + ∞ (cid:88) k =2 s k E ( Y k ) E ( Y k ) k ! ≤ ∞ (cid:88) k =2 s k E ( | Y | k ) k ! ≤ ∞ (cid:88) k =2 s k (2 σ ) k k ! k != 1 + (2 σ s ) ∞ (cid:88) k =0 (2 σ s ) k ≤ σ s (when s ≤ σ ) ≤ e σ s / . Therefore, Z = Y Y is sub-Exponential with α = 4 σ and ν = 16 σ .Finally, we are ready to complete the main theorem. With probability at least 1 − n − C / − n − C / − n − C / − n − C ∗ / − n − C / , we havemax i n (cid:107) ˜ P i · − P i · (cid:107) ≤ C (cid:114) log nn + λ M ( σ, p, n ) , which completes the proof of the main theorem. References

Airoldi, E. M., Costa, T. B., and Chan, S. H. (2013). Stochastic blockmodel approximation ofa graphon: Theory and consistent estimation. In

Advances in Neural Information ProcessingSystems , pages 692–700.Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables.

Journalof Multivariate Analysis , (4), 581–598.Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and newman–girvanand other modularities. Proceedings of the National Academy of Sciences , (50), 21068–21073.34han, S. and Airoldi, E. (2014). A consistent histogram estimator for exchangeable graph models.In International Conference on Machine Learning , pages 208–216.Chatterjee, S. et al. (2015). Matrix estimation by universal singular value thresholding.

The Annalsof Statistics , (1), 177–214.Chen, K. and Lei, J. (2018). Network cross-validation for determining the number of communitiesin network data. Journal of the American Statistical Association , (521), 241–251.Gao, C., Lu, Y., Zhou, H. H., et al. (2015). Rate-optimal graphon estimation. The Annals ofStatistics , (6), 2624–2652.Hoﬀ, P. (2008). Modeling homophily and stochastic equivalence in symmetric relational data. In Advances in Neural Information Processing Systems , pages 657–664.Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables.

Preprint,Institute for Advanced Study, Princeton .Kallenberg, O. (2006).

Probabilistic symmetries and invariance principles . Springer Science &Business Media, New York.Klopp, O., Tsybakov, A. B., Verzelen, N., et al. (2017). Oracle inequalities for network models andsparse graphon estimation.

The Annals of Statistics , (1), 316–354.Li, T., Levina, E., and Zhu, J. (2016). Network cross-validation by edge sampling. arXiv preprintarXiv:1612.04717 .Lloyd, J., Orbanz, P., Ghahramani, Z., and Roy, D. M. (2012). Random function priors forexchangeable arrays with applications to graphs and relational data. In Advances in NeuralInformation Processing Systems , pages 998–1006.Lov´asz, L. (2012).

Large networks and graph limits . American Mathematical Society Providence,Rhode Island. 35rbanz, P. and Roy, D. M. (2015). Bayesian models of graphs, arrays and other exchangeablerandom structures.

IEEE Transactions on Pattern Analysis and Machine Intelligence , (2),437–461.Wolfe, P. J. and Olhede, S. C. (2013). Nonparametric graphon estimation. arXiv preprintarXiv:1309.5936 .Yang, J., Han, C., and Airoldi, E. (2014). Nonparametric estimation and testing of exchangeablegraph models. In Artiﬁcial Intelligence and Statistics , pages 1060–1067.Zhang, Y., Levina, E., and Zhu, J. (2016). Community detection in networks with node features.

Electronic Journal of Statistics , (2), 3153–3178.Zhang, Y., Levina, E., and Zhu, J. (2017). Estimating network edge probabilities by neighbourhoodsmoothing. Biometrika ,104