Exploration of Large Networks with Covariates via Fast and Universal Latent Space Model Fitting
EExploration of Large Networks with Covariates via Fast andUniversal Latent Space Model Fitting
Zhuang Ma and Zongming Ma ∗ University of Pennsylvania
Abstract
Latent space models are effective tools for statistical modeling and exploration of networkdata. These models can effectively model real world network characteristics such as degreeheterogeneity, transitivity, homophily, etc. Due to their close connection to generalized linearmodels, it is also natural to incorporate covariate information in them. The current paperpresents two universal fitting algorithms for networks with edge covariates: one based on nuclearnorm penalization and the other based on projected gradient descent. Both algorithms aremotivated by maximizing likelihood for a special class of inner-product models while workingsimultaneously for a wide range of different latent space models, such as distance models, whichallow latent vectors to affect edge formation in flexible ways. These fitting methods, especiallythe one based on projected gradient descent, are fast and scalable to large networks. We obtaintheir rates of convergence for both inner-product models and beyond. The effectiveness of themodeling approach and fitting algorithms is demonstrated on five real world network datasetsfor different statistical tasks, including community detection with and without edge covariates,and network assisted learning.
Keywords : community detection, network with covariates, non-convex optimization, projectedgradient descent.
Network is a prevalent form of data for quantitative and qualitative analysis in a number of fields,including but not limited to sociology, computer science, neuroscience, etc. Moreover, due toadvances in science and technology, the sizes of the networks we encounter are ever increasing.Therefore, to explore, to visualize and to utilize the information in large networks poses significantchallenges to Statistics. Unlike traditional datasets in which a number of features are recorded foreach subject, network datasets provide information on the relation among all subjects under study,sometimes together with additional features. In this paper, we focus on the modeling, visualizationand exploration of networks in which additional features might be observed for each node pair.On real world networks, people oftentimes observe the following characteristics. First, thedegree distributions of nodes are often right-skewed and so networks exhibit degree heterogeneity. ∗ Email: [email protected] . a r X i v : . [ s t a t . M E ] A ug n addition, connections in networks often demonstrate transitivity, that is nodes with commonneighbors are more likely to be connected. Moreover, nodes that are similar in certain ways(students in the same grade, brain regions that are close physically, etc.) are more likely to formbonds. Such a phenomenon is usually called homophily in network studies. Furthermore, nodes insome networks exhibit clustering effect and in such cases it is desirable to partition the nodes intodifferent communities.An efficient way to explore network data and to extract key information is to fit appropriatestatistical models on them. To date, there have been a collection of network models proposedby researchers in various fields. These models aim to catch different subsets of the foregoingcharacteristics, and Goldenberg et al. [26] provides a comprehensive overview. An important classof network models are latent space models [31]. Suppose there are n nodes in the observed network.The key idea underlying latent space modeling is that each node i can be represented by a vector z i in some low dimensional Euclidean space (or some other metric space of choice, see e.g. [37, 5]for latent spaces with negative curvature) that is sometimes called the social space, and nodes thatare “close” in the social space are more likely to be connected. Hoff et al. [31] considered two typesof latent space models: distance models and projection models. In both cases, the latent vectors t z i u ni “ were treated as fixed effects. Later, a series of papers [28, 27, 30, 39] generalized the originalproposal in [31] for better modeling of other characteristics of social networks, such as clustering,degree heterogeneity, etc. In these generalizations, the z i ’s were treated as random effects generatedfrom certain multivariate Gaussian mixtures. Moreover, model fitting and inference in these modelshas been carried out via Markov Chain Monte Carlo, and it is difficult to scale these methodologiesto handle large networks [26]. In addition, one needs to use different likelihood function based onchoice of model and there is little understanding of the quality of fitting when the model is mis-specified. Albeit these disadvantages, latent space models are attractive due to their friendlinessto interpretation and visualization.For concreteness, assume that we observe an undirected network represented by a symmetricadjacency matrix A on n nodes with A ij “ A ji “ i and j are connected and zero otherwise.In addition, we may also observe a symmetric pairwise covariate matrix X which measures certaincharacteristics of node pairs. We do not allow self-loop and so we set the diagonal elements of thematrices A and X to be zeros. The covariate X ij can be binary, such as an indicator of whethernodes i and j share some common attribute (e.g. gender, location, etc) or it can be continuous,such as a distance/similarity measure (e.g. difference in age, income, etc). It is straightforward togeneralize the methods and theory in this paper to a finite number of covariates.In this paper, we aim to tackle the following two key issues in latent space modeling of networkdata: • First, we seek a class of latent space models that is special enough so that we can design fastfitting algorithms for them and hence be able to handle networks of very large sizes; • In addition, we would like to be able to fit a class of models that are flexible enough to wellapproximate a wide range of latent space models of interest so that fitting methods for thisflexible class continue to work even when the model is mis-specified.From a practical viewpoint, if one is able to find such a class of models and design fast algorithmsfor fitting them, then one would be able to use this class as working models and to use the associatedfast algorithms to effectively explore large networks.2 .1 Main contributions
We make progress on tackling the foregoing two issues simultaneously in the present paper, whichwe summarize as the following main contributions:1. We first consider a special class of latent space models, called inner-product models, anddesign two fast fitting algorithms for this class. Let the observed n -by- n adjacency matrixand covariate matrix be A and X , respectively. The inner-product model assumes that forany i ă j , A ij “ A ji ind. „ Bernoulli p P ij q , withlogit p P ij q “ Θ ij “ α i ` α j ` βX ij ` z J i z j , (1)where for any x P p , q , logit p x q “ log r x {p ´ x qs . Here, α i , 1 ď i ď n , are parametersmodeling degree heterogeneity. The parameter β is the coefficient for the observed covariate,and z J i z j is the inner-product between the latent vectors. As we will show later in Section2, this class of models can incorporate degree heterogeneity, transitivity and homophilyexplicitly. From a matrix estimation viewpoint, the matrix G “ p G ij q “ p z J i z j q is of rank atmost k that can be much smaller than n . Motivated by recent advances in low rank matrixestimation, we design two fast algorithms for fitting (1). One algorithm is based on liftingand nuclear norm penalization of the negative log-likelihood function. The other is based ondirectly optimizing the negative log-likelihood function via projected gradient descent. Forboth algorithms, we establish high probability error bounds for inner-product models. Theconnection between model (1) and the associated algorithms and other related work in theliterature will be discussed immdediately in next subsection.2. We further show that these two fitting algorithms are “universal” in the sense that theycan work simultaneously for a wide range of latent space models beyond the inner-productmodel class. For example, they work for the distance model and the Gaussian kernelmodel in which the inner-product term z J i z j in (1) is replaced with ´} z i ´ z j } and c exp p´} z i ´ z j } { σ q , respectively. Thus, the class of inner-product models is flexible and canbe used to approximate many other latent space models of interest. In addition, the associatedalgorithms can be applied to networks generated from a wide range of mis-specified modelsand still yield reasonable results. The key mathematical insight that enables such flexibilityis introduced in Section 2 as the Schoenberg Condition (7).3. We demonstrate the effectiveness of the model and algorithms on real data examples. Inparticular, we fit inner-product models by the proposed algorithms on five different realnetwork datasets for several different tasks, including visualization, clustering and network-assisted classification. On three popular benchmark datasets for testing community detectionon networks, a simple k -means clustering on the estimated latent vectors obtained by ouralgorithm yields as good result on one dataset and better results on the other two whencompared with four state-of-the-art methods. The same “model fitting followed by k -meansclustering” approach also yields nice clustering of nodes on a social network with edgecovariates. Due to the nature of latent space models, for all datasets on which we fit the model,we obtain natural visualizations of the networks by plotting latent vectors. Furthermore, weillustrate how network information can be incorporated in traditional learning problems usinga document classification example. 3 Matlab implementation of the methodology in the present paper is available upon request. The current form of the inner-product model (1) has previously appeared in Hoff [28] and Hoff[29], though the parameters were modeled as random effects rather than fixed values, and Bayesianapproaches were proposed for estimating variance parameters of the random effects. Hoff [30]proposed a latent eigenmodel which has a probit link function as opposed to the logistic linkfunction in the present paper. As in [28] and [29], parameters were modeled as random effects andmodel fitting was through Bayesian methods. It was shown that the eigenmodel weakly generalizesthe distance model in the sense that the order of the entries in the latent component can bepreserved. This is complementary to our results which aim to approximate the latent componentdirectly in some matrix norm. An advantage of the eigenmodel is its ability to generalize the latentclass model, whereas the inner-product model (1) and the more general model we shall considergeneralize a subset of latent class models due to the constraint that the latent component (aftercentering) is positive semi-definite. We shall return to this point later in Section 7. Young andScheinerman [57] proposed a random dot product model, which can be viewed as an inner-productmodel with identity link function. The authors studied a number of properties of the model, suchas diameter, clustering and degree distribution. Tang et al. [51] studied properties of the leadingeigenvectors of the adjacency matrices of latent positions graphs (together with its implication onclassification in such models) where the connection probability of two nodes is the value that someuniversal kernel [46] takes on the associated latent positions and hence generalizes the randomdot product model. This work is close in spirit to the present paper. However, there are severalimportant differences. First, the focus here is model fitting/parameter estimation as opposed toclassification in [51]. In addition, any universal kernel considered in [51] satisfies the Schoenbergcondition (7) and thus is covered by the methods and theory of the present paper, and so we covera broader range of models that inner-product models can approximate. This is also partially dueto the different inference goals. Furthermore, we allow the presence of observed covariates while[51] did not.When fitting a network model, we are essentially modeling and estimating the edge probabilitymatrix. From this viewpoint, the present paper is related to the literautre on graphon estimationand edge probability matrix estimation for block models. See, for instance, [6, 4, 55, 22, 35, 24]and the references therein. However, the block models have stronger structural assumptions thanthe latent space models we are going to investigate.The algorithmic and theoretical aspects of the paper is also closely connected to the line ofresearch on low rank matrix estimation, which plays an important role in many applications suchas phase retrieval [12, 13] and matrix completion [11, 33, 34, 10, 36]. Indeed, the idea of nuclearnorm penalization has originated from matrix completion for both general entries [11] and binaryentries [19]. In particular, our convex approach can be viewed as a Lagrangian form of the proposalin [19] when there is no covariate and the matrix is fully observed. We have nonetheless decided tospell out details on both method and theory for the convex approach because the matrix completionliterature typically does not take into account the potential presence of observed covariates. On theother hand, the idea of directly optimizing a non-convex objective function involving a low rankmatrix has been studied recently in a series of papers. See, for instance, [42, 9, 50, 54, 15, 60, 25]and the references therein. Among these papers, the one that is the most related to the projected4radient descent algorithm we are to propose and analyze is [15] which focused on estimating apositive semi-definite matrix of exact low rank in a collection of interesting problems. Anotherrecent and related work [56] has appeared after the initial posting of the present manuscript.However, we will obtain tighter error bounds for latent space models and we will go beyond theexact low rank scenario.
After a brief introduction of standard notation used throughout the paper, the rest of the paper isorganized as follows. Section 2 introduces both inner-product models and a broader class of latentspace models on which our fitting methods work. The two fitting methods are described in detailin Section 3, followed by their theoretical guarantees in Section 4 under both inner-product modelsand the general class. The theoretical results are further corroborated by simulated examples inSection 5. Section 6 demonstrates the competitive performance of the modeling approach andfitting methods on five different real network datasets. We discuss interesting related problemsin Section 7 and present proofs of the main results in Section 8. Technical details justifying theinitialization methods for the project gradient descent approach are deferred to the appendix.
Notation
For A “ p A ij q P R n ˆ n , Tr p A q “ ř ni “ A ii stands for its trace. For X, Y P R m ˆ n , @ X, Y D “ Tr p X J Y q defines an inner product between them. If m ě n , for any matrix X withsingular value decomposition X “ ř ni “ σ i u i v J i , } X } ˚ “ ř ni “ σ i , } X } F “ bř ni “ σ i and } X } op “ max ni “ σ i stand for the nuclear norm, the Frobenius norm and the operator norm of the matrix,respectively. Moreover, X i ˚ and X ˚ j denote the i -th row and j -th column of X , and for anyfunction f , f p X q is the shorthand for applying f p¨q element-wisely to X , that is f p X q P R m ˆ n and r f p X qs ij “ f p X ij q . Let S n ` be the set of all n ˆ n positive semidefinite matrices and O p m, n q be theset of all m ˆ n orthonormal matrices. For any convex set C , P C p¨q is the projection onto the C . In this section, we first give a detailed introduction of the inner-product model (1) and conditionsfor its identifiability. In addition, we introduce a more general class of latent space models thatincludes the inner-product model as a special case. The methods we propose later will be motivatedby the inner-product model and can also be applied to the more general class.
Recall the inner-product model defined in (1), i.e., for any observed A and X and any i ă j , A ij “ A ji ind. „ Bernoulli p P ij q , with logit p P ij q “ Θ ij “ α i ` α j ` βX ij ` z J i z j . Fixing all other parameters, if we increase α i , then node i has higher chances of connecting withother nodes. Therefore, the α i ’s model degree heterogeneity of nodes and we call them degreeheterogeneity parameters. Next, the regression coefficient β moderates the contribution of covariateto edge formation. For instance, if X ij indicates whether nodes i and j share some common attributesuch as gender, then a positive β value implies that nodes that share common attribute are more5ikely to connect. Such a phenomenon is called homophily in the social network literaute. Last butnot least, the latent variables t z i u ni “ enter the model through their inner-product z J i z j , and henceis the name of the model. We impose no additional structural/distributional assumptions on thelatent variables for the sake of modeling flexibility.We note that model (1) also allows the latent variables to enter the second equation in the formof g p z i , z j q “ ´ } z i ´ z j } . To see this, note that g p z i , z j q “ ´ } z i } ´ } z j } ` z J i z j , and we mayre-parameterize by setting ˜ α i “ α i ´ } z i } for all i . Then we haveΘ ij “ α i ` α j ` βX ij ´ } z i ´ z j } “ ˜ α i ` ˜ α j ` βX ij ` z J i z j . An important implication of this observation is that the function g p z i , z j q “ ´ } z i ´ z j } directlymodels transitivity , i.e., nodes with common neighbors are more likely to connect since their latentvariables are more likely to be close to each other in the latent space. In view of the foregoingdiscussion, the inner-product model (1) also enjoys this nice modeling capacity.In matrix form, we have Θ “ α n J ` n α J ` βX ` G (2)where 1 n is the all one vector in R n and G “ ZZ J with Z “ p z , ¨ ¨ ¨ , z n q J P R n ˆ k . Since there isno self-edge and Θ is symmetric, only the upper diagonal elements of Θ are well defined, which wedenote by Θ u . Nonetheless we define the diagonal element of Θ as in (2) since it is inconsequential.To ensure identifiability of model parameters in (1), we assume the latent variables are centered,that is J Z “ Z where J “ I n ´ n n n J . (3)Note that this condition uniquely identifies Z up to a common orthogonal transformation of itsrows while G “ ZZ J is now directly identifiable. Model (1) is a special case of a more general class of latent space models, which can be defined by A ij “ A ji ind. „ Bernoulli p P ij q , withlogit p P ij q “ Θ ij “ ˜ α i ` ˜ α j ` βX ij ` (cid:96) p z i , z j q (4)where (cid:96) p¨ , ¨q is a smooth symmetric function on R k ˆ R k . We shall impose an additional constrainton (cid:96) following the discussion below. In matrix form, for ˜ α “ p ˜ α , . . . , ˜ α n q and L “ p (cid:96) p z i , z j qq , wecan write Θ “ ˜ α n J ` n ˜ α J ` βX ` L. To better connect with (2), let G “ J LJ, and α n J ` n α J “ ˜ α n J ` n ˜ α J ` L ´ J LJ. (5)Note that the second equality in the last display holds since the expression on its righthand side issymmetric and of rank at most two. Then we can rewrite the second last display asΘ “ α n J ` n α J ` βX ` G (6)6hich reduces to (2) and G satisfies J G “ G . Our additional constraint on (cid:96) is the followingSchoenberg condition:For any positive integer n ě z , . . . , z n P R k , G “ J LJ is positive semi-definite for L “ p (cid:96) p z i , z j qq and J “ I n ´ n n n J . (7)Condition (7) may seem abstract, while the following lemma elucidates two important classes ofsymmetric functions for which it is satisfied. Lemma 2.1.
Condition (7) is satisfied in the following cases:1. (cid:96) is a positive semi-definite kernel function on R k ˆ R k ;2. (cid:96) p x, y q “ ´} x ´ y } qp for some p P p , s and q P p , p s where } ¨ } p is the p -norm (or p -seminormwhen p ă ) on R k . The first claim of Lemma 2.1 is a direct consequence of the definition of positive semi-definitekernel function which ensures that the matrix L itself is positive semi-definite and so is G “ J LJ since J is also positive semi-definite. The second claim is a direct consequence of the Hilbert spaceembedding result by Schoenberg [48, 49]. See, for instance, Theorems 1 and 2 of [48]. This section presents two methods for fitting models (1) and (4)–(7). Both methods are motivatedby minimizing the negative log-likelihood function of the inner-product model, and can be regardedas pseudo-likelihood approaches for more general models. From a methodological viewpoint, a keyadvantage of these methods, in particular the projected gradient descent method, is the scalabilityto networks of large sizes.
In either the inner-product model or the general model we suppose the parameter matrix Θ in (2)or (6) satisfies ´ M ď Θ ij ď ´ M for 1 ď i ‰ j ď n, and | Θ ii | ď M for 1 ď i ď n. (8)where M ě M are non-negative. Then for any Θ satisfying (8), the corresponding edgeprobabilities satisfy12 e ´ M ď ` e M ď P ij ď ` e M ď e ´ M , ď i ‰ j ď n. (9)Thus M controls the conditioning of the problem and M controls the sparsity of the network.Let σ p x q “ {p ` e ´ x q be the sigmoid function, i.e. the inverse of logit function, then for any i ‰ j , P ij “ σ p Θ ij q and the log-likelihood function of the inner-product model (1) can be writtenas ÿ i ă j ! A ij log ´ σ p Θ ij q ¯ ` p ´ A ij q log ´ ´ σ p Θ ij q ¯) “ ÿ i ă j ! A ij Θ ij ` log ´ ´ σ p Θ ij q ¯) . G “ ZZ J in inner-product models. The MLE of Θ u is the solution to the followingrank constrained optimization problem:min Θ u ,α,β,G ´ ÿ i ă j ! A ij Θ ij ` log ´ ´ σ p Θ ij q ¯) , subject to Θ “ α n J ` n α J ` βX ` G, ´ M ď Θ ij ď ´ M ,GJ “ G, G P S n ` , rank p G q ď k. (10)This optimization problem is non-convex and generally intractable. To overcome this difficulty, weconsider a convex relaxation that replaces the rank constraint on G in (10) with a penalty termon its nuclear norm. Since G is positive semi-definite, its nuclear norm equals its trace. Thus, ourfirst model fitting scheme solves the following convex program:min α,β,G ´ ÿ i,j ! A ij Θ ij ` log ´ ´ σ p Θ ij q ¯) ` λ n Tr p G q subject to Θ “ α n J ` n α J ` βX ` G, GJ “ G, G P S n ` , ´ M ď Θ ij ď ´ M . (11)The convex model fitting method (11) is motivated by the nuclear norm penalization ideaoriginated from the matrix completion literature. See, for instance, [11], [10], [36], [19] and thereferences therein. In particular, it can be viewed as a Lagrangian form of the proposal in [19]when there is no covariate and the matrix is fully observed. However, we have decided to makethis proposal and study the theoretical properties as the existing literature, such as [19], does nottake in consideration the potential presence of observed covariates. Furthermore, one can still solve(11) when the true underlying model is one of the general models introduced in Section 2.2. Theappropriate choice of λ n will be discussed in Section 4. Remark . In addition to the introduction of the trace penalty, the first term in the objectivefunction in (11) now sums over all p i, j q pairs. Due to symmetry, after scaling, the difference fromthe sum in (10) lies in the inclusion of all diagonal terms in Θ. This slight modification leadsto neither theoretical consequence nor noticeable difference in practice. However, it allows easierimplementation and simplifies the theoretical investigation. We would note that the constraint ´ M ď Θ ij ď ´ M is included partially for obtaining theoretical guarantees. In simulated examplesreported in Section 5, we found that the convex program worked equally well without this constraint. Although the foregoing convex relaxation method is conceptually neat, state-of-the-art algorithmsto solve the nuclear (trace) norm minimization problem (11), such as iterative singular valuethresholding, usually require computing a full singular value decomposition at every iteration,which can still be time consuming when fitting very large networks.To further improve scalability of model fitting, we propose an efficient first order algorithm thatdirectly tackles the following non-convex objective function:min
Z,α,β g p Z, α, β q “ ´ ÿ i,j ! A ij Θ ij ` log ´ ´ σ p Θ ij q ¯) where Θ “ α n J ` n α J ` βX ` ZZ J . (12)The detailed description of the method is presented in Algorithm 1.8 lgorithm 1 A projected gradient descent model fitting method. Input:
Adjacency matrix: A ; covariate matrix: X ; latent space dimension: k ě
1; initialestimates: Z , α , β ; step sizes: η Z , η α , η β ; constraint sets: C Z , C α , C β . Output: p Z “ Z T , p α “ α T , p β “ β T . for t “ , , ¨ ¨ ¨ , T ´ do r Z t ` “ Z t ´ η Z ∇ Z g p Z, α, β q “ Z t ` η Z ` A ´ σ p Θ t q ˘ Z t ; r α t ` “ α t ´ η α ∇ α g p Z, α, β q “ α t ` η α p A ´ σ p Θ t qq n ; r β t ` “ β t ´ η β ∇ β g p Z, α, β q “ β t ` η β @ A ´ σ p Θ t q , X D ; Z t ` “ P C Z p r Z t ` q , α t ` “ P C α p r α t ` q , β t ` “ P C β p r β t ` q ; end for After initialization, Algorithm 1 iteratively updates the estimates for the three parameters,namely Z , α and β . In each iteration, for each parameter, the algorithm first descends along thegradient direction by a pre-specified step size. The descent step is then followed by an additionalprojection step which projects the updated estimates to pre-specified constraint sets. We proposeto set the step sizes as η Z “ η { ›› Z ›› , η α “ η {p n q , and η β “ η {p } X } q (13)for some small numeric constant η ą
0. To establish the desired theoretical guarantees, we makea specific choice of the constraint sets later in the statement of Theorem 4.2 and Theorem 4.4. Inpractice, one may simply set Z t ` “ J r Z t ` , α t ` “ r α t ` , and β t ` “ r β t ` . (14)Here and after, when there is no covariate, i.e. X “
0, we skip the update of β in each iteration.For each iteration, the update on the latent part is performed in the space of Z (that is R n ˆ k )rather than the space of all n ˆ n Gram matrices as was required in the convex approach. In thisway, it reduces the computational cost per iteration from O p n q to O p n k q . Since we are mostinterested in cases where k ! n , such a reduction leads to improved scalability of the non-convexapproach to large networks. To implement this non-convex algorithm, we need to specify the latentspace dimension k , which was not needed for the convex program (11). We defer the discussion onthe data-driven choice of k to Section 7.We note that Algorithm 1 is not guaranteed to find any global minimizer, or even any localminimizer, of the objective function (12). However, as we shall show later in Section 4, underappropriate conditions, the iterates generated by the algorithm will quickly enter a neighborhood ofthe true parameters ( Z ‹ , α ‹ , β ‹ ) and any element in this neighborhood is statistically at least as goodas the estimator obtained from the convex method (11). This approach has close connection to theinvestigation of various non-convex methods for other statistical and signal processing applications.See for instance [13], [15] and the references therein. Our theoretical analysis of the algorithm isgoing to provide some additional insight as we shall establish its high probability error bounds forboth the exact and the approximate low rank scenarios. In the rest of this section, we discussinitialization of Algorithm 1. 9 .2.1 Initialization Appropriate initialization is the key to success for Algorithm 1. We now present two ways toinitialize it which are theoretically justifiable under different circumstances.
Initialization by projected gradient descent in the lifted space
The first initializationmethod is summarized in Algorithm 2, which is essentially running the projected gradient descentalgorithm on the following regularized objective function for a small number of steps: f p G, α, β q “ ´ ÿ i,j t A ij Θ ij ` log p ´ σ p Θ ij qqu ` λ n Tr p G q ` γ n ` } G } ` ›› α n J ›› ` } Xβ } ˘ . (15)Except for the third term, this is the same as the objective function in (11). However, theinclusion of the additional proximal term ensures that one obtains the desired initializers after aminimal number of steps. Algorithm 2
Initialization of Algorithm 1 by Projected Gradient Descent Input:
Adjacency matrix: A ; covariate matrix X ; initial values: G “ , α “ , β “ η ; constraint set: C G , C α , C β ; regularization parameter: λ n , γ n ; latent dimension: k ;number of steps: T . for t “ , , ¨ ¨ ¨ , T do r G t ` “ G t ´ η ∇ G f p G, α, β q “ G t ` η p A ´ σ p Θ t q ´ λ n I n ´ γ n G t q ; r α t ` “ α t ´ η ∇ α f p G, α, β q{ n “ α t ` η pp A ´ σ p Θ t qq n { n ´ γ n α t q ; r β t ` “ β t ´ η ∇ β f p G, α, β q{} X } “ β t ` η p @ A ´ σ p Θ t q , X D {} X } ´ γ n β t q ; G t ` “ P C G p r G t ` q , α t ` “ P C α p r α t ` q , β t ` “ P C β p r β t ` q ; end for Set Z T “ U k D { k where U k D k U J k is the top- k eigen components of G T ; Output: Z T , α T , β T .The appropriate choices of λ n and γ n will be spelled out in Theorem 4.5 and Corollary 4.1.The step size η in Algorithm 2 can be set at a small positive numeric constant, e.g. η “ .
2. Theprojection sets that lead to theoretical justification will be specified later in Theorem 4.5 while inpractice, one may simply set G t ` “ J r G t ` , α t ` “ r α t ` , and β t ` “ r β t ` . Initialization by universal singular value thresholding
Another way to construct theinitialization is to first estimate the probability matrix P by universal singular value thresholding(USVT) proposed by [14] and then compute the initial estimates of α, Z, β heuristically by invertingthe logit transform. The procedure is summarized as Algorithm 3.Intuitively speaking, the estimate of P by USVT is consistent when } P } ˚ is “small”. Followingthe arguments in Theorems 2.6 and 2.7 of [14], such a condition is satisfied when the covariatematrix X “ X has “simple” structure. Such “simple” structure could be X ij “ κ p x i , x j q where x , ¨ ¨ ¨ , x n P R d are feature vectors associated with the n nodes and κ p¨ , ¨q characterizesthe distance/similarity between node i and node j . For instance, one could have X ij “ t x i “ x j u where x , ¨ ¨ ¨ , x n P t , ¨ ¨ ¨ , K u is a categorical variable such as gender, race, nationality, etc; or X ij “ s p| x i ´ x j |q where s p¨q is a continuous monotone link function and x , ¨ ¨ ¨ , x n P R is acontinuous node covariate such as age, income, years of education, etc.10 lgorithm 3 Initialization of Algorithm 1 by Singular Value Thresholding Input:
Adjacency matrix: A ; covariate matrix X ; latent dimension k ; threshold τ . Let r P “ ř σ i ě τ σ i u i v J i where ř ni “ σ i u i v J i is the SVD of A . Elementwisely project r P to theinterval r e ´ M , s to obtain p P . Let p Θ “ logit pp p P ` p P J q{ q ; Let α , β “ arg min α,β } p Θ ´ ` α n J ` n α J ` βX ˘ } ; Let p G “ P S n ` p R q where R “ J p p Θ ´ p α n J ` n p α q J ` β X qq J ; Set Z “ U k D { k where U k D k U J k is the top- k singular value components of p G ; Output: α , Z , β . Remark . The least squares problem in step 2 of Algorithm 3 has closed form solution and canbe computed in O p n q operations. The computational cost of Algorithm 3 is dominated by matrixdecompositions in step 1 and step 3. In this section, we first present error bounds for both fitting methods under inner-product models,followed by their generalizations to the more general models satisfying the Schoenberg condition(7). In addition, we give theoretical justifications of the two initialization methods for Algorithm1.
We shall establish uniform high probability error bounds for inner-product models belonging to thefollowing parameter space: F p n, k, M , M , X q “ ! Θ | Θ “ α n J ` n α J ` βX ` ZZ J , J Z “ Z, max ď i ď n } Z i ˚ } , } α } , | β | max ď i ă j ď n | X ij | ď M , max ď i ‰ j ď n Θ ij ď ´ M ) . (16)When X “
0, we replace the first inequality in (16) with max ď i ď n } Z i ˚ } , } α } ď M {
2. For theresults below, k , M , M and X are all allowed to change with n . Results for the convex approach
We first present theoretical guarantees for the optimizer of(11). When X in nonzero, we make the following assumption for the identifiability of β . Assumption 4.1.
The stable rank of the covariate matrix X satisfies r stable p X q “ } X } { } X } ě M k for some large enough constant M . The linear dependence on k of r stable p X q is in some sense necessary for β to be identifiable asotherwise the effect of the covariates could be absorbed into the latent component ZZ J .Let p p α, p β, p G q be the solution to (11) and p α ‹ , β ‹ , G ‹ q be the true parameter that governs thedata generating process. Let p Θ and Θ ‹ be defined as in (2) but with the estimates and thetrue parameter values for the components respectively. Define the error terms ∆ p Θ “ p Θ ´ Θ ‹ ,∆ p α “ p α ´ α ‹ , ∆ p β “ ˆ β ´ β ‹ and ∆ p G “ p G ´ G ‹ . The following theorem gives both deterministic11nd high probability error bounds for estimating both the latent vectors and logit-transformedprobability matrix. Theorem 4.1.
Under Assumption 4.1, for any λ n satisfying λ n ě max t } A ´ P } op , |x A ´ P, X { } X } F y|{? k, u , there exists a constant C such that ›› ∆ p G ›› , ›› ∆ p Θ ›› ď Ce M λ n k. Specifically, setting λ n “ C a max t ne ´ M , log n u for a large enough positive constant C , thereexist positive constants c, C such that uniformly over F p n, k, M , M , X q , with probability at least ´ n ´ c , ›› ∆ p G ›› , ›› ∆ p Θ ›› ď Cψ n where ψ n “ e M nk ˆ max (cid:32) e ´ M , log nn ( . (17)If we turn the error metrics in Theorem 4.1 to mean squared errors, namely } ∆ p G } { n and } ∆ p Θ } { n , then we obtain the familiar k { n rate in low rank matrix estimation problems [36, 2, 19,43]. In particular, the theorem can viewed as a complementary result to the result in [19] in the casewhere there are observed covariate and the 1 ´ bit matrix is fully observed. When e ´ M ě log nn , thesparsity of the network affects the rate through the multiplier e ´ M . As the network gets sparser,the multiplier will be no smaller than log nn . Remark . Note that the choice of the penalty parameter λ n depends on e ´ M which by (9)controls the sparsity of the observed network. In practice, we do not know this quantity and wepropose to estimate M with x M “ ´ logit p ř ij A ij { n q . Results for the non-convex approach
A key step toward establishing the statistical propertiesof the outputs of Algorithm 1 is to characterize the evolution of its iterates. To start with, weintroduce an error metric that is equivalent to } ∆ Θ t } “ } Θ t ´ Θ ‹ } while at the same time is moreconvenient for establishing an inequality satisfied by all iterates. Note that the latent vectors areonly identifiable up to an orthogonal transformation of R k , for any Z , Z P R n ˆ k , we define thedistance measure dist p Z , Z q “ min R P O p k q } Z ´ Z R } F where O p k q collects all k ˆ k orthogonal matrices. Let R t “ arg min R P O p k q ›› Z t ´ Z ‹ R ›› F and ∆ Z t “ Z t ´ Z ‹ R t , and further let ∆ α t “ α t ´ α ‹ , ∆ G t “ Z t p Z t q J ´ Z ‹ Z J‹ and ∆ β t “ β t ´ β ‹ . Then theerror metric we use for characterizing the evolution of iterates is e t “ } Z ‹ } } ∆ Z t } ` ›› ∆ α t n J ›› ` ›› ∆ β t X ›› . (18)Let κ Z ‹ be the condition number of Z ‹ (i.e., the ratio of the largest to the smallest singularvalues). The following lemma shows that the two error metrics e t and } ∆ Θ t } are equivalent up toa multiplicative factor of order κ Z ‹ . 12 emma 4.1. Under Assumption 4.1, there exists constant ď c ă such that e t ď κ Z ‹ p? ´ q } ∆ G t } ` ›› ∆ α t n J ›› ` ›› ∆ β t X ›› ď κ Z ‹ p? ´ qp ´ c q } ∆ Θ t } . Moreover, if dist p Z t , Z ‹ q ď c } Z ‹ } op , e t ě p c ` q } ∆ G t } ` ›› ∆ α t n J ›› ` ›› ∆ β t X ›› ě p c ` q p ` c q } ∆ Θ t } . In addition, our error bounds depend on the following assumption on the initializers.
Assumption 4.2.
The initializers Z , α , β in Algorithm 1 satisfy e ď ce ´ M } Z ‹ } { κ Z ‹ for asufficiently small positive constant c . Note that the foregoing assumption is not very restrictive. If k ! n , M is a constant and theentries of Z ‹ are i.i.d. random variables with mean zero and bounded variance, then } Z ‹ } op — ? n and κ Z ‹ —
1. In view of Lemma 4.1, this only requires n } Θ ´ Θ ‹ } to be upper bounded by someconstant. We defer verification of this assumption for initial estimates constructed by Algorithm 2and Algorithm 3 to Section 4.3.The following theorem states that under such an initialization, errors of the iterates convergelinearly till they reach the same statistical precision ψ n as in Theorem 4.1 modulo a multiplicativefactor that depends only on the condition number of Z ‹ . Theorem 4.2.
Let Assumptions 4.1 and 4.2 be satisfied. Set the constraint sets as C Z “ t Z P R n ˆ k , J Z “ Z, max ď i ď n } Z i ˚ } ď M { u , C α “ t α P R n , } α } ď M { u , C β “ t β P R , β } X } ď M { u . Set the step sizes as in (13) for any η ď c where c ą is a universal constant. Let ζ n “ max t } A ´ P } op , |x A ´ P, X { } X } F y|{? k, u . Then we have1. Deterministic errors of iterates: If } Z ‹ } ě C κ Z ‹ e M ζ n ˆ max !a ηke M , ) for a sufficientlylarge constant C , there exist positive constants ρ and C such that e t ď ˆ ´ ηe M κ Z ‹ ρ ˙ t e ` Cκ Z ‹ ρ e M ζ n k. Probabilistic errors of iterates: If } Z ‹ } ě C κ Z ‹ ? ne M ´ M { max !a ηke M , ) for asufficiently large constant C , there exist positive constants ρ, c and C such that uniformlyover F p n, k, M , M , X q with probability at least ´ n ´ c , e t ď ˆ ´ ηe M κ Z ‹ ρ ˙ t e ` C κ Z ‹ ρ ψ n . When X “ C β “ H and we replace M { C Z and C α with M { or any T ą T “ log p M κ Z ‹ e M ´ M nk q{ log p ´ ηe M κ Z ‹ ρ q , } ∆ G T } , } ∆ Θ T } ď C κ Z ‹ ψ n . for some constant C ą .Remark . In view of Lemma 4.1, the rate obtained by the non-convex approach in terms of ›› ∆ p Θ ›› matches the upper bound achieved by the convex method, up to a multiplier of κ Z ‹ . Assuggested by Lemma 4.1, the extra factor comes partly from the fact that e t is a slightly strongerloss function than } ∆ Θ t } and in the worst case can be cκ Z ‹ times larger than } ∆ Θ t } . Remark . Under the setting of Theorem 4.2, the projection steps for α, β in Algorithm 1 arestraightforward and have the following closed form expressions: α t ` i “ r α t ` i min t , M {p | r α t ` i |qu , β t ` “ r β t ` min t , M {p | r β t ` | max i,j | X ij |qu . The projection step for Z is slightly more involved.Notice that C Z “ C Z Ş C Z where C Z “ t Z P R n ˆ k , J Z “ Z u , C Z “ t Z P R n ˆ k , max ď i ď n } Z i ˚ } ď M { u . Projecting to either of them has closed form solution, that is P C Z p Z q “ J Z , r P C Z p Z qs i ˚ “ Z i ˚ min t , a M {p } Z i ˚ } qu . Then Dykstra’s projection algorithm [20] (or alternating projectionalgorithm) can be applied to obtain P C Z p r Z t ` q . We note that projections induced by theboundedness constraints for Z, α, β are needed for establishing the error bounds theoretically.However, when implementing the algorithm, users are at liberty to drop these projections andto only center the columns of the Z iterates as in (14). We did not see any noticeable differencethus caused on simulated examples reported in Section 5. Remark . When both M and M are constants and the covariate matrix X is absent, the resultin Section 4.5 of [15], in particular Corollary 5, implies the error rate of O p nk q in Theorem 4.2.However, when M Ñ 8 and M remains bounded as n Ñ 8 , the error rate in [15] becomes O p e M M nk q , which can be much larger than the rate O p e M nk q in Theorem 4.2 even when X is absent. We feel that this is a byproduct of the pursuit of generality in [15] and so the analysishas not been fine-tuned for latent space models. In addition, Algorithm 1 enjoys nice theoreticalguarantees on its performance even when the model is mis-specified and the Θ matrix is onlyapproximately low rank. See Theorem 4.4 below. This case which is important from a modelingviewpoint was not considered in [15] as its focus was on generic non-convex estimation of low rankpositive semi-definite matrices rather than fittings latent space models. We now investigate the performances of the fitting approaches on more general models satisfyingthe Schoenberg condition (7). To this end, we consider the following parameter space for the moregeneral class of latent space models F g p n, M , M , X q “ ! Θ | Θ “ α n J ` n α J ` βX ` G, G P S n ` , J G “ G, max ď i ď n G ii , } α } , | β | max ď i ă j ď n | X ij | ď M { , max ď i ‰ j ď n Θ ij ď ´ M ) . (19) One can verify that in this case we can identify the quantities in Corollary 5 of [15] as σ “ p “ d “ n , r “ k , ν — M , L ν — (cid:96) ν — e M .
14s before, when X “
0, we replace the first inequality in (19) with max ď i ď n } Z i ˚ } , } α } ď M { M , M and X are all allowed to change with n . Note that the latent spacedimension k is no longer a parameter in (19). Then for any positive integer k , let U k D k U J k be thebest rank- k approximation to G ‹ . In this case, with slight abuse of notation, we let Z ‹ “ U k D { k and s G k “ G ‹ ´ U k D k U J k . (20)Note that (19) does not specify the spectral behavior of G which will affect the performance ofthe fitting methods as the theorems in this section will later reveal. We choose not to make suchspecification due to two reasons. First, the spectral behavior of distance matrices resulting fromdifferent kernel functions and manifolds is by itself a very rich research topic. See, for instance,[45, 7, 21, 17] and the references therein. In addition, the high probability error bounds we are todevelop in this section is going to work uniformly for all models in (19) and can be specialized toany specific spectral decay pattern of G of interest. Results for the convex approach
The following theorem is a generalization of Theorem 4.1to the general class (19).
Theorem 4.3.
For any k P N ` such that Assumption 4.1 holds and any λ n satisfying λ n ě max t } A ´ P } op , |x A ´ P, X { } X } F y|{? k, u , there exists a constant C such that the solution tothe convex program (11) satisfies ›› ∆ p Θ ›› ď C ` e M λ n k ` e M λ n } s G k } ˚ ˘ . Specifically, setting λ n “ C a max t ne ´ M , log n u for a large enough constant C , there existspositive constants c, C such that uniformly over F g p n, M , M , X q with probability at least ´ n ´ c , ›› ∆ p Θ ›› ď C p ψ n ` e M ´ M { ? n } s G k } ˚ q . (21)The upper bound in (21) has two terms. The first is the same as that for the inner-productmodel. The second can be understood as the effect of model mis-specification, since the estimatoris essentially based on the log-likelihood of the inner-product model. We note that the bound holdsfor any k such that Assumption 4.1 holds while the choice of the tuning parameter λ n does notdepend on k . Therefore, we can take the infimum over all admissible values of k , depending on thestable rank of X . When X “
0, we can further improve the bound on the right side of (21) to bethe infimum of the current expression over all 0 ď k ď n . Results for the non-convex approach
Under the definition in (20), we continue to use theerror metric e t defined in equation (18). The following theorem is a generalization of Theorem 4.2to the general class (19). Theorem 4.4.
Under Assumptions 4.1 and 4.2, set the constraint sets C Z , C α , C β and the stepsizes η Z , η α and η β as in Theorem 4.2. Let ζ n “ max t } A ´ P } op , |x A ´ P, X { } X } F y|{? k, u .Then we have . Deterministic errors of iterates: If } G ‹ } op ě C κ Z ‹ e M ζ n ˆ max !a ηke M , b η } s G k } { ζ n , ) ,there exist positive constants ρ and C such that e t ď ˆ ´ ηe M κ Z ‹ ρ ˙ t e ` Cκ Z ‹ ρ ´ e M ζ n k ` e M ›› s G k ›› ¯ . Probabilistic errors of iterates: If } G ‹ } op ě C κ Z ‹ ? ne M ´ M { max !a ηke M , b η } s G k } { ζ n , ) for a sufficiently large constant C , there exist positive constants ρ, c and C such thatuniformly over F g p n, M , M , X q with probability at least ´ n ´ c , the iterates generated byAlgorithm 1 satisfying e t ď ˆ ´ ηe M κ Z ‹ ρ ˙ t e ` Cκ Z ‹ ρ ´ ψ n ` e M ›› s G k ›› ¯ . For any T ą T “ log p M κ Z ‹ e M ´ M nk q{ log p ´ ηe M κ Z ‹ ρ q , } ∆ G T } , } ∆ Θ T } ď C κ Z ‹ ´ ψ n ` e M ›› s G k ›› ¯ for some constant C . Compared with Theorem 4.2, the upper bound here has an additional term e M } ¯ G k } that canbe understood as the effect of model mis-specification. Such a term can result from mis-specifyingthe latent space dimension in Algorithm 1 when the inner-product model holds, or it can arise whenthe inner-product model is not the true underlying data generating process. This term is differentfrom its counterpart in Theorem 4.3 which depends on ¯ G k through its nuclear norm. In eithercase, the foregoing theorem guarantees that Algorithm 1 continues to yield reasonable estimate ofΘ and G as long as } ¯ G k } “ O p e M ´ M nk q , i.e., when the true underlying model can be reasonablyapproximated by an inner-product model with latent space dimension k . We conclude this section with error bounds for the two initialization methods in Section 3. Thesebounds justify that the methods yield initial estimates satisfying Assumption 4.2 under differentcircumstances.
Error bounds for Algorithm 2
The following theorem indicates that Algorithm 2 yields goodinitial estimates after a small number of iterates as long as the latent effect G ‹ is substantial andthe remainder ¯ G k is well controlled. Theorem 4.5.
Suppose that Assumption 4.1 holds and that } α ‹ n J } F , } β ‹ X } F ď C } G ‹ } F for anumeric constant C ą . Let λ n satisfy C a max t ne ´ M , log n u ď λ n ď c } G ‹ } op {p e M ? kκ Z ‹ q for sufficiently large constant C and sufficiently small constant c , let γ n satisfy γ n ď δλ n { } G ‹ } op or sufficiently small constant δ . Choose step size η ď { and set the constraint sets as C G “ t G P S n ˆ n ` , J G “ G, max ď i,j ď n | G ij | ď M { u , C α “ t α P R n , } α } ď M { u , C β “ t β P R , β } X } ď M { u . If the latent vectors contain strong enough signal in the sense that } G ‹ } ě Cκ Z ‹ e M max ! φ n , } s G k } ˚ { k, ›› s G k ›› ) , (22) for some sufficiently large constant C , there exist positive constants c, C such that with probabilityat least ´ n ´ c , for any given constant c ą , e T ď c e ´ M } Z ‹ } { κ Z ‹ as long as T ě T , where T “ log ˜ C e M kκ Z ‹ c ¸ ˆ log ˆ ´ γ n η ˙˙ ´ . (23)We note that the theorem holds for both inner-product models and more general modelssatisfying condition (7). In addition, it gives the ranges of λ n and γ n for implementing Algorithm 2.Note that the choices λ n and γ n affect the number of iterations needed. To go one step further, thefollowing corollary characterizes the ideal choices of γ n and λ n in Algorithm 2. It is worth notingthat the choice of λ n here does not coincide with that in Theorem 4.1 and Theorem 4.3. So this isslightly different from the conventional wisdom that to initialize the non-convex approach, it wouldsuffice to simply run the convex optimization algorithm for a small number of steps. Interestingly,the corollary shows that when M , k and κ Z ‹ are all upper bounded by universal constants, forappropriate choices of γ n and λ n in Algorithm 2, the number of iterations needed does not dependon the graph size n . Corollary 4.1.
Specifically in Theorem 4.5, if we choose γ n “ c {p e M ? kκ Z ‹ q for some sufficientlysmall constant c , and λ n “ C γ n } G ‹ } op for some sufficiently large constant C , there exist positiveconstants c, C such that with probability at least ´ n ´ c , for any given constant c ą , e T ď c e ´ M } Z ‹ } { κ Z ‹ as long as T ě T , where T “ log ˜ C e M kκ Z ‹ c ¸ ˆ log ˆ ´ γ n η ˙˙ ´ . (24) Remark . Similar to computing P C Z p¨q in Algorithm 1, P C G p¨q could also be implemented byDykstra’s projection algorithm since C G is the intersection of two convex sets. The boundednessconstraint max i,j | G ij | ď M { G t ` will have closed form solution G t ` “ P S n ` p J r G t ` J q where P S n ` p¨q can be computedby singular value thresholding. When X “ C β “ H and we replace M { C G and C α with M { rror bounds for Algorithm 3 The following result justifies the singular value thresholdingapproach to initialization for inner-product models with no edge covariate.
Proposition 4.1.
If no covariates are included in the latent space model and } G ‹ } F ě c n for somenumeric constant c ą , then there exists constant c such that with probability at least ´ n c , forany n ě C p k, M , κ Z ‹ q where C p k, M , κ Z ‹ q is a constant depending on k, M and κ Z ‹ , the outputsof Algorithm 3 with τ ě . ? n satisfies the initialization condition in Assumption 4.2. Although we do not have further theoretical results, Algorithm 3 worked well on all the simulateddata examples reported in Section 5.
In this section, we present results of simulation studies on three different aspects of the proposedmethods: (1) scaling of estimation errors with network sizes, (2) impact of initialization onAlgorithm 1, and (3) performance of the methods on general models.
Estimation errors
We first investigate how estimation errors scale with network size. To thisend, we fix β ‹ “ ´? p n, k q P t , , , , u ˆ t , , u , we set the othermodel parameters randomly following these steps:1. Generate the degree heterogeneity parameters: p α ‹ q i “ ´ α i { ř nj “ α j for 1 ď i ď n , where α , ¨ ¨ ¨ , α n iid „ U r , s .2. Generate µ , µ P R k with coordinates iid following U r´ , s as two latent vector centers;3. Generate latent vectors: for i “ , . . . , k , let p z q i , ¨ ¨ ¨ , p z t n { u q i iid „ p µ q i ` N r´ , s p , q and p z t n { u ` q i , ¨ ¨ ¨ , p z n q i iid „ p µ q i ` N r´ , s p , q where N r´ , s p , q is the standard normaldistribution restricted onto the interval r´ , s , then set Z ‹ “ J Z where Z “ r z , ¨ ¨ ¨ , z n s J and J is as defined in (3). Finally, we normalize Z ‹ such that } G ‹ } F “ n ;4. Generate the covariate matrix: X “ n r X {} r X } F where r X ij iid „ min t| N p , q| , u .For each generated model, we further generated 30 independent copies of the adjacency matrix foreach model configuration. Unless otherwise specified, for all experiments in this section, with given p n, k q , the model parameters are set randomly following the above four steps and algorithms arerun on 30 independent copies of the adjacency matrix.The results of the estimation error for varying p n, k q are summarized in the log-log boxplots inFigure 1, where “Relative Error - Z ” is defined as } p Z p Z J ´ Z ‹ Z J‹ } {} Z ‹ Z J‹ } and “Relative Error -Θ” is defined as } p Θ ´ Θ ‹ } {} Θ ‹ } . From the boxplots, for each fixed latent space dimension k , therelative estimation errors for both Z ‹ and Θ ‹ scale at the order of 1 {? n . This agrees well with thetheoretical results in Section 3. For different latent space dimension k , the error curve (in log-logscale) with respect to network size n only differs in the intercept. Impact of initialization on Algorithm 1
We now turn to the comparison of three differentinitialization methods for Algorithm 1: the convex method (Algorithm 2), singular valuethresholding (Algorithm 3), and random initialization. To this end, we fixed n “ , k “ ll l ll l
500 1000 2000 4000 8000
Number of Nodes R e l a t i v e E rr o r − Z k = 2k = 4k = 8 l lll ll l ll ll
500 1000 2000 4000 8000
Number of Nodes R e l a t i v e E rr o r − Q k = 2k = 4k = 8 Figure 1: log-log boxplot for relative estimation errors with varying network size and latent spacedimension.In Algorithm 2, we choose T “
10 and λ n “ a n p p where p p “ ř ij A ij { n . In Algorithm 3, weset M “ τ “ a n p p . The relative estimation errors are summarized as boxplotsin Figure 2. Clearly, the non-convex algorithm is very robust to the initial estimates. Similarphenomenon is observed in real data analysis where different initializations yield nearly the sameclustering accuracy. Algorithm 2 Algorithm 3 Random R e l a t i v e E rr o r − Z Algorithm 2 Algorithm 3 Random R e l a t i v e E rr o r − Q Figure 2: Boxplot for relative estimation error with different initialization methods.
Performance on general models
Finally, to investigate the performance of the proposedmethod under the general model (4), we try two frequently used kernel functions, distance kernel (cid:96) d p z i , z j q “ ´} z i ´ z j } and Gaussian kernel (cid:96) g p z i , z j q “ p´} z i ´ z j } { q . In this part, we use19 to represent the dimension of the latent vectors (that is, z , ¨ ¨ ¨ , z n P R d ) and k to represent thefitting dimension in Algorithm 1. We fix d “ n “ i “ , . . . , d , let p z q i , ¨ ¨ ¨ , p z t n { u q i iid „ p µ q i ` N r´ , s p , q and p z t n { u ` q i , ¨ ¨ ¨ , p z n q i iid „ p µ q i ` N r´ , s p , q where N r´ , s p , q is the standardnormal distribution restricted onto the interval r´ , s . Finally for given kernel function (cid:96) p¨ , ¨q , set G ‹ “ J LJ where L ij “ (cid:96) p z i , z j q .We run both the convex approach and Algorithm 1 with different fitting dimensions. The boxplotfor the relative estimation errors and the singular value decay of the kernel matrix under distancekernel and Gaussian kernel are summarized in Figure 3 and Figure 4 respectively.As we can see, under the generalized model, the non-convex algorithm exhibits bias-variancetradeoff with respect to the fitting dimension, which depends on the singular value decay of thekernel matrix. The advantage of the convex method is the adaptivity to the unknown kernelfunction. ll lll k = 3 k = 5 k = 7 convex R e l a t i v e E rr o r − Q l l l l l l l l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll S i ngu l a r V a l ue s Figure 3: The log-log plot of relative estimation errors of both convex and non-convex approachunder the distance kernel (cid:96) d p z i , z j q “ ´} z i ´ z j } (left panel). The log-log plot of ordered eigenvaluesof G ‹ (right panel).When the true underlying model is not the inner-product model, Theorem 4.4 indicates that theoptimal choice of fitting dimension k should depend on the size of the network. To illustrate suchdependency, we vary both network size and fitting dimension, of which the results are summarizedin Figure 5. As the size of the network increases, the optimal choice of fitting dimension increasesas well. In this section, we demonstrate how the model and fitting methods can be used to explore real worlddatasets that involve large networks. In view of the discussion in Section 4.2 and Section 5, we can20 l l k = 3 k = 6 k = 9 convex R e l a t i v e E rr o r − Q l l l l l l l l l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll S i ngu l a r V a l ue s Figure 4: log-log plot for the relative estimation errors of both convex and non-convex approachunder the Gaussian kernel (cid:96) g p z i , z j q “ p´} z i ´ z j } { q (left panel). The log-log plot of orderedeigenvalues of G ‹ (right panel). l l l l l
500 1000 2000 4000 8000
Number of Nodes R e l a t i v e E rr o r − Q l k=3k=6k=9 l l l l l
500 1000 2000 4000 8000
Number of Nodes R e l a t i v e E rr o r − Q l k=3k=6k=9 Figure 5: log-log plot for the relative estimation error with varying network size under distancekernel (cid:96) d p z i , z j q “ ´} z i ´ z j } (left panel) and under the Gaussian kernel (cid:96) g p z i , z j q “ p´} z i ´ z j } { q (right panel).always employ the inner-product model (1) as the working model. In particular, we illustrate threedifferent aspects. First, we consider community detection on networks without covariate. To thisend, we compare the performance of simple k -means clustering on fitted latent variables with severalstate-of-the-art methods. Next, we investigate community detection on networks with covariates.In this case, we could still apply k -means clustering on fitted latent variables. Whether there iscovariate or not, we can always visualize the network by plotting fitted latent variables in someappropriate way. Furthermore, we study how fitting the model can generate new feature variablesto aid content-based classification of documents. The ability of feature generation also makes21he model and the fitting methods potentially useful in other learning scenarios when additionalnetwork information among both training and test data is present. Community detection on networks without covariate has been intensively studied from boththeoretical and methodological viewpoints. Thus, it naturally serves as a test example for theeffectiveness of the model and fitting methods we have proposed in previous sections. To adaptour method to community detection, we propose to partition the nodes by the following two stepprocedure:1. Fit the inner-product model to data with Algorithm 1;2. Apply a simple k -means clustering on the fitted latent variables.In what follows, we call this two step procedure LSCD (Latent Space based Community Detection).We shall compare it with three state-of-the-art methods: (i) SCORE [32]: a normalized spectralclustering method developed under degree-corrected block models (DCBM); (2) OCCAM [58]:a normalized and regularized spectral clustering method for potentially overlapping communitydetection; (3) CMM [16]: a convexified modularity maximization method developed under DCBM.(4) Latentnet [38]: a hierachical bayesian method based on the latent space clustering model [27].To avoid biasing toward our own method, we compare these methods on three datasets thathave been previously used in the original papers to justify the first three methods at comparison:a political blog dataset [1] that was studied in [32] and two Facebook datasets (friendship networksof Simmons College and Caltech) [53] that were studied in [16]. To make fair comparison, for allthe methods, we supplied the true number of communities in each data. When fitting our model,we set the latent space dimension to be the same as the number of communities.In the latentnet package [38], there are three different ways to predict the communitymembership. Using the notation of the R package [38], they are mkl$Z.K , mkl$mbc$Z.K and mle$Z.K . We found that mkl$mbc$Z.K consistently outperformed the other two on these dataexamples and we thus used it as the outcome of Latentnet. Due to the stochastic nature of theBayesian approach, we repeated it 20 times on each dataset and reported both the average errorsas well as the standard deviations (numbers in parentheses).Table 1 summarizes the performance of all five methods on the three datasets. Among all themethods at comparison, all methods performed well on the political blog dataset with Latentnetbeing the best, and LSCD outperformed all other methods on the two Facebook datasets. On theCaltech friendship dataset, it improved the best result out of the other four methods by almost15% in terms of proportion of mis-clustered nodes.Dataset % (0.117%)Simmons College 4 olitical Blog This well-known dataset was recorded by [1] during the 2004 U.S. PresidentialElection. The original form is a directed network of hyperlinks between 1490 political blogs. Theblogs were manually labeled as either liberal or conservative according to their political leanings.The labels were treated as true community memberships. Following the literature, we removed thedirection information and focused on the largest connected component which contains 1222 nodesand 16714 edges. All five methods performed comparably on this dataset with Latentnet achievingthe smallest mis-clustered proportion.
Simmons College
The Simmons College Facebook network is an undirected graph that contains1518 nodes and 32988 undirected edges. For comparison purpose, we followed the same pre-processing steps as in [16] by considering the largest connected component of the students withgraduation year between 2006 and 2009, which led to a subgraph of 1137 nodes and 24257 edges.It was observed in [53] that the class year has the highest assortativity values among all availabledemographic characteristics, and so we treated the class year as the true community label. On thisdataset, LSCD achieved the lowest mis-clustered proportion among these methods, with CMM aclose second lowest.An important advantage of model (1) is that it can provide a natural visualization of thenetwork. To illustrate, the left panel of Figure 6 is a 3D visualization of the network with the firstthree coordinates of the estimated latent variables. From the plot, one can immediately see threebig clusters: class year 2006 and 2007 combined (red), class year 2008 (green) and class year 2009(blue). The right panel zooms into the cluster that includes class year 2006 and 2007 by projectingthe the estimated four dimensional latent vectors onto a two dimensional discriminant subspacethat was estimated from the fitted latent variables and the clustering results of LSCD. It turnedout that class year 2006 and 2007 could also be reasonably distinguished by the latent vectors. lllll ll l ll ll l l l ll l ll lll l ll lll ll l lll lll lll llll ll lll l lll llll ll l l lll l l l ll ll ll l ll l ll ll l lll ll llll l lllll ll l lll lll ll ll ll l ll ll lll ll ll lll l lll l ll lllll lll ll lll lll l ll l lll llll l lll l ll l ll lll ll l llll ll ll ll l llll llll ll l ll lll ll llll ll l lll lll ll ll l
Class Year l Figure 6: The left panel is a visualization of the network with the first three coordinates of theestimated latent vectors. The right panel is a visualization of students in class year 2006 and 2007by projecting the four dimensional latent vectors to an estimated two dimensional discriminantsubspace. 23 altech Data
In contrast to the Simmons College network in which communities are formedaccording to class years, communities in the Caltech friendship network are formed according todorms [52, 53]. In particular, students spread across eight different dorms which we treated as truecommunity labels. Following the same pre-processing steps as in [16], we excluded the studentswhose residence information was missing and considered the largest connected component of theremaining graph, which contained 590 nodes and 12822 undirected edges. This dataset is morechallenging than the Simmons College network. Not only the size of the network halves but thenumber of communities doubles. In some sense, it serves the purpose of testing these methods whenthe signal is weak. LSCD also achieved the highest overall accuracy on this dataset, reducing thesecond best error rate (achieved by CMM) by nearly 15%. See the last row of Table 1. Moreover,LSCD achieved the lowest maximum community-wise misclustering error among the four methods.See Figure 7 for a detailed comparison of community-wise misclustering rates of the five methods.It is worth noting that the two spectral methods, SCORE and OCCAM, fell behind on the twoFacebook datasets. One possible explanation is that the structures of these Facebook networks aremore complex than the political blog network and so DCBM suffers more under-fitting on them. Incontrast, the latent space model (1) is more expressive and goes well beyond simple block structure.The Latentnet approach did not perform well on the Facebook datasets, either. One possible reasonis the increased numbers of communities compared to the political blog dataset, which substantiallyincreased the difficulty of sampling from posterior distributions. P r ed i c t ed D o r m s b y L S CD P r ed i c t ed D o r m s b y S C O R E P r ed i c t ed D o r m s b y O CC A M P r ed i c t ed D o r m s b y C MM P r ed i c t ed D o r m s b y La t en t ne t Clustering Error
Figure 7: Comparison of community-wise misclustering errors in Caltech friendship network. Toprow, left to right: LSCD, SCORE and OCCAM; bottom row, left to right: CMM and Latentnet.24 .2 Community detection with covariate
We now further demonstrate the power of the model and our proposed fitting methods byconsidering community detection on networks with covariates. Again, we used the LSCD procedurelaid out in the previous subsection for community detection.To this end, we consider a lawyer network dataset which was introduced in [40] that studiedthe relations among 71 lawyers in a New England law firm. The lawyers were asked to checkthe names of those who they socialized with outside work, who they knew their family and viceversa. There are also several node attributes contained in the dataset: status (partner or associate),gender, office, years in the firm, age, practice (litigation or corporate), and law school attended,among which status is most assortative. Following [59], we took status as the true community label.Furthermore, we symmetrized the adjacency matrix, excluded two isolated nodes and finally endedup with 69 lawyers connected by 399 undirected edges.Visualization and clustering results with and without covariate are shown in Figure 8. On theleft panel, as we can see, the latent vectors without adjustment by any covariate worked reasonablywell in separating the lawyers of different status and most of the 12 errors (red diamonds) werecommitted on the boundary. On the right panel, we included a covariate ‘practice’ into the latentspace model: we set X ij “ X ji “ i ‰ j and the i th and the j th lawyers shared the same practice,and X ij “ X ji “ βX . The predicted community memberships of lawyers indexed by orange numbers(39, 43, 45, 46, 51, 58) were successfully corrected after introducing this covariate. So the numberof mis-clustered nodes was reduced by 50%. We also observed that lawyer 37, though still mis-clustered, was significantly pushed towards the right cluster. True Status ll associatepartner Predicted associatepartner
Without Covariate
True Status ll associatepartner Predicted associatepartner
Practice type as Covariate
Figure 8: Visualization of the lawyer network using the estimated two dimensional latent vectors.The left panel shows results without including any covariate while the right panel shows resultsthat used practice type information. 25 .3 Network assisted learning
In this section, we demonstrate how fitting model (1) can generate new features to be used inmachine learning applications when additional network information is available. Consider a networkwith n nodes and observed adjacency matrix A . Suppose the profile of the nodes is represented by d dimensional features, denoted by x , ¨ ¨ ¨ , x n P R d . Assume each node is associated with a label(or say, variable of interest), denoted by y , either continuous or categorical. Suppose the labelsare only observed for a subset of the nodes in the network. Without loss of generality, we assume y , ¨ ¨ ¨ , y m are observed for some m ă n . The goal here is to predict the labels y m ` , ¨ ¨ ¨ , y n basedon the available information. Without considering the network information, this is the typicalsetup of supervised learning with labeled training set p x , y q , ¨ ¨ ¨ , p x m , y m q and unlabeled test set x m ` , ¨ ¨ ¨ , x n . As one way to utilize the network information, we propose to supplement the existingfeatures in the prediction task with the latent vectors estimated by Algorithm 1 (without any edgecovariates).To give a specific example, we considered the Cora dataset [44]. It contains 2708 machinelearning papers which were manually classified into 7 categories: Neural Networks, Rule Learning,Reinforcement Learning, Probabilistic Methods, Theory, Genetic Algorithms and Case Based. Thedataset also includes the contents of the papers and a citation network, which are represented bya document-word matrix (the vocabulary contains 1433 frequent words) and an adjacency matrixrespectively. The task is to predict the category of the papers based on the available information.For demonstration purpose, we only distinguish neural network papers from the other categories,and so the label y is binary.Let W be the document-word matrix. In the present example, W is of size 2708 ˆ W ij equals 1 if the i th document contains the j th word.Otherwise W ij equals zero. As a common practice in latent semantic analysis, to represent thetext information as vectors, we extract leading- d principal component loadings from W W J as thefeatures. We chose d “
100 by maximizing the prediction accuracy using cross-validation.However, how to utilize the information contained in the citation network for the desired learningproblem is less straightforward. We propose to augment the latent semantic features with the latentvectors estimated from the citation network. Based on the simple intuition that papers in thesame category are more likely to cite each other, we expect the latent vectors, as low dimensionalsummary of the network, to contain information about the paper category. The key message wewant to convey here is that with vector representation of the nodes obtained from fitting thelatent space model, network information can be incorporated in many supervised and unsupervisedlearning problems and other exploratory data analysis tasks.Back to the Cora dataset, for illustration purpose, we fitted standard logistic regressions withthe following three sets of features:1. the leading 100 principal component loadings;2. estimated degree parameters ˆ α i and latent vectors ˆ z i obtained from Algorithm 1;3. the combination of features in 1 and 2.We considered three different latent space dimensions: k “ , ,
10. As we can see from Figure 9,the latent vectors contained a considerable amount of predictive power for the label. Adding thelatent vectors to the principal components of the word-document matrix could substantially reducemisclassification rate. 26 ll lll lllllll llll lllll lllllllllll lllllllllllllllllllllllllllllllllll C l a ss i f i c a t i on E rr o r Figure 9: Boxplots of misclassification rates using logistic regression with different feature sets. Werandomly split the dataset into training and test sets with size ratio 3:1 for 500 times and computedmisclassification errors for each configuration. PC represents the leading 100 principal componentloadings of the document-word matrix. Z p k q represents the feature matrix where the i th row is theconcatenation of the estimated degree parameter p α i and the estimated latent vector p z i with latentdimension k . PC+ Z p k q means the combination of the two sets of features. In this section, we discuss a number of related issues and potential problems for future research.
Data-driven choice of latent space dimension
For the projected gradient descent method,i.e., Algorithm 1, one needs to specify the latent space dimension k as an input. AlthoughTheorem 4.4 suggests that the algorithm could still work reasonably well if the specified latentspace dimension is slightly off the target, it is desirable to have a systematic approach to selecting k based on data. One possibility is to inspect the eigenvalues of G T in Algorithm 2 and set k tobe the number of eigenvalues larger than the parameter λ n used in the algorithm. Undirected networks with multiple covariates and weighted edges
The model (1) andthe fitting methods can easily be extended to handle multiple covariates. When the number ofcovariates is fixed, error bounds analogous to those in Section 4 can be expected. We omit thedetails. Moreover, as pointed out in [26], latent space models for binary networks such as (1)can readily be generalized to weighted networks, i.e., networks with non-binary edges. We referinterested readers to the general recipe spelled out in Section 3.9 of [26]. If the latent variablesenter a model for weighted networks in the same way as in model (1), we expect the key ideasbehind our proposed fitting methods to continue to work.
Directed networks
In many real world networks, edges are directed. Thus, it is a natural nextstep to generalize model (1) to handle such data. Suppose for any i ‰ j , A ij “ i to node j , and A ij “ i ‰ j , A ij ind. „ Bernoulli p P ij q , with logit p P ij q “ Θ ij “ α i ` γ j ` βX ij ` z J i w j . (25)Here, the α i ’s P R model degree heterogeneity of outgoing edges while the γ j ’s P R modelheterogeneity of incoming edges. The meaning of β is the same as in model (1). To furtheraccommodate asymmetry, we associate with each node two latent vectors z i , w i P R k , where the z i ’s are latent variables influencing outgoing edges and the w i ’s incoming edges. Such a model hasbeen proposed and used in the study of recommender system [3] and it is also closely connectedto the latent eigenmodel proposed in [30] if one further restricts z i P t w i , ´ w i u for each i . Underthis model, the idea behind the convex fitting method in Section 3.1 can be extended. However,it is more challenging to devise a non-convex fitting method with similar theoretical guarantees towhat we have in the undirected case. On the other hand, it should be relatively straightforwardto further extend the ideas to directed networks with multiple covariates and weighted edges. Arecent paper [56] has appeared after the initial posting of the present manuscript, which obtainedsome interesting results along these directions. We present here the proofs of Theorem 4.3 and Theorem 4.4 since Theorem 4.1 is a corollary of theformer and Theorem 4.2 a corollary of the latter. Throughout the proof, let P “ p σ p Θ ‹ ,ij qq and P “ p P ij i ‰ j q . Thus, E p A q “ P . Moreover, for any Θ P R n ˆ n , define h p Θ q “ ´ n ÿ i,j “ t A ij Θ ij ` log p ´ σ p Θ ij qqu . (26)For conciseness, we denote F g p n, M , M , X q by F g throughout the proof. We focus on the casewhere X is nonzero, and the case of X “ Let Z ‹ P R n ˆ k such that Z ‹ Z J‹ is the best rank k approximation to G ‹ . For any matrix M , letcol p M q be the subspace spanned by the column vectors of M and row p M q “ col p M J q . For anysubspace S of R n (or R n ˆ n ), let S K be its orthogonal complement, and P S the projection operatoronto the subspace. The proof relies on the following two lemmas. Lemma 8.1.
Let M K k “ t M P R n ˆ n : row p M q Ă col p Z ‹ q K and col p M q Ă col p Z ‹ q K u and M k be its orthogonal complement in R n ˆ n under trace inner product. If λ n ě } A ´ P } op , then for s G k “ P M K k G ‹ , we have } ∆ p G } ˚ ď ? k } P M k ∆ p G } F ` } ∆ p α n J } F ` λ n |x A ´ P, ∆ p β X y| ` } s G k } ˚ . Proof.
See Section 8.1.1. 28 emma 8.2.
For any k ě such that Assumption 4.1 holds. Choose λ n ě max t } A ´ P } op , u and |x A ´ P, X y| ď λ n ? k } X } F . There exist constants C ą and ď c ă such that } ∆ p Θ } ě p ´ c q ` } ∆ p G } ` } ∆ p α n J } ` } ∆ p β X } ˘ ´ C } s G k } ˚ { k, and } ∆ p Θ } ď p ` c q ` } ∆ p G } ` } ∆ p α n J } ` } ∆ p β X } ˘ ` C } s G k } ˚ { k. Proof.
See Section 8.1.2.
Lemma 8.3.
There exist absolute constants c, C such that for any Θ P F g with probability at least ´ n ´ c , the following inequality holds } A ´ P } op , @ A ´ P, X D ? k } X } F ď C b max t ne ´ M , log n u . Proof.
For any Θ in the parameter space, the off diagonal elements of Θ are uniformly boundedfrom above by ´ M , and so max i,j P ij ď e ´ M . Moreover, max i P ii ď } A ´ P } op ď } A ´ P } op ` } P ´ P } op ď } A ´ P } op `
1. Together with Lemma 8.12, thisimplies that there exist absolute constants c , C ą P ˆ } A ´ P } op ď C b max t ne ´ M , log n u ˙ ě ´ n ´ c . (27)Since the diagonal entries of X are all zeros, we have x A ´ P, X y “ x A ´ P , X y . Hence, Lemma 8.13implies that uniformly over the parameter space, P ˜ @ A ´ P, X D ? k } X } F ď C b max t ne ´ M , log n u ¸ ě ´ ` ´ C max (cid:32) ne ´ M , log n ( k { ˘ ě ´ n ´ C k { . (28)Combining (27) and (27) finishes the proof. Proof of Theorem 4.3 ˝ We first establish the deterministic bound. Observe that p Θ “ p α n J ` n p α J ` p βX ` p G is the optimal solution to (11), and that the true parameter Θ ‹ “ α ‹ n J ` n α J‹ ` β ‹ X ` G ‹ is feasible. Thus, we have the basic inequality h p p Θ q ´ h p Θ ‹ q ` λ n p} p G } ˚ ´ } G ‹ } ˚ q ď , (29)where h is defined in (26). For any Θ P F g , | Θ ij | ď M for all i, j and so for τ “ e M {p ` e M q ,the Hessian ∇ h p Θ q “ diag ` vec ` σ p Θ q ˝ p ´ σ p Θ qq ˘˘ ľ τ I n ˆ n . For any vector b , diag p b q is the diagonal matrix with elements of a on its diagonals. For any matrix B “ r b , . . . , b n s P R n ˆ n , vec p B q P R n is obtained by stacking b , . . . , b n in order. For any squarematrices A and B , A ľ B if and only if A ´ B is positive semi-definite. With the last display,Taylor expansion gives h p p Θ q ´ h p Θ ‹ q ě x ∇ Θ h p Θ ‹ q , ∆ p Θ y ` τ } ∆ p Θ } .
29n the other hand, triangle inequality implies λ n p} p G } ˚ ´ } G ‹ } ˚ q ě ´ λ n } ∆ G } ˚ . Together with (29), the last two displays imply x ∇ Θ h p Θ ‹ q , ∆ p Θ y ` τ } ∆ p Θ } F ´ λ n } ∆ p G } ˚ ď . Triangle inequality further implies τ } ∆ Θ } F ď λ n } ∆ G } ˚ ` |x ∇ Θ h p Θ ‹ q , ∆ p G ` ∆ p α n J ` n ∆ J p α y| ` | ∆ p β x ∇ Θ h p Θ ‹ q , X y|“ λ n } ∆ G } ˚ ` |x A ´ P, ∆ p G ` p α n J y| ` | ∆ p β x A ´ P, X y|ď λ n } ∆ G } ˚ ` |x A ´ P, ∆ p G ` p α n J y| ` λ n ? k } ∆ p β X } F . (30)Here the equality is due to the symmetry of A ´ P and the last inequality is due to the conditionimposed on λ n . We now further upper bound the first two terms on the rightmost side. First, byLemma 8.1 and the assumption that |x A ´ P, X y| ď λ n ? k } X } F , we have } ∆ G } ˚ ď ? k } P M k ∆ p G } F ` } ∆ p α n J } F ` ? k } ∆ p β X } F ` } s G k } ˚ . (31)Moreover, H¨older’s inequality implies |x A ´ P, ∆ p G ` p α n J y| ď } A ´ P } op p} ∆ p G } ˚ ` } ∆ p α n J } ˚ q“ } A ´ P } op p} ∆ p G } ˚ ` } ∆ p α n J } F qď λ n p} ∆ p G } ˚ ` } ∆ p α n J } F q . (32)Here the equality holds since ∆ p α n J is a rank one matrix. Substituting (31) and (32) into (30), weobtain that τ } ∆ p Θ } ď λ n } ∆ p G } ˚ ` λ n } ∆ p α n J } F ` λ n ? k } ∆ p β X } F ď λ n p ? k } P M k ∆ p G } F ` } ∆ p α n J } F ` ? k } ∆ p β X } F ` } s G k } ˚ q` λ n } ∆ p α n J } F ` λ n ? k } ∆ p β X } F ď C λ n ` ? k p} P M k ∆ p G } F ` } ∆ p α n J } F ` } ∆ p β X } F q ` } s G k } ˚ ˘ . By Lemma 8.2, we can further bound the righthand side as τ } ∆ p Θ } ď C λ n ? k p} ∆ p Θ } F ` } s G k } ˚ {? k q ` C λ n } s G k } ˚ ď C λ n ? k } ∆ p Θ } F ` p C ` C q λ n } s G k } ˚ . Solving the quadratic inequality, we obtain } ∆ p Θ } ď C ˆ λ n kτ ` λ n } s G k } ˚ τ ˙ . τ ě ce ´ M for some positive constant c . Therefore, } ∆ p Θ } ď C ` e M λ n k ` e M λ n } s G k } ˚ ˘ . ˝ We now turn to the probabilistic bound. By Lemma 8.3, there exist constants c , C suchthat for any λ n ě C a max t ne ´ M , log n u , we have uniformly over the parameter space that P ˜ λ n ě } A ´ P } op , @ A ´ P, X D ? k } X } F +¸ ě ´ n ´ c . Denote this event as E . Since the conditions on λ n in the first part of Theorem 4.3 are satisfied on E , it follows that there exists an absolute constant C ą ´ n ´ c , } ∆ p Θ } ď Cφ n . This completes the proof. By the convexity of h p Θ q , h p p Θ q ´ h p Θ ‹ q ě x ∇ Θ h p Θ ‹ q , ∆ p Θ y“ ´x A ´ P, ∆ p G ` p α n J ` ∆ p β X yě ´ } A ´ P } op ` } ∆ p G } ˚ ` } ∆ p α n J } ˚ ˘ ´ |x A ´ P, ∆ p β X y|ě ´ λ n ` } P M k ∆ p G } ˚ ` } P M K k ∆ p G } ˚ ` } ∆ p α n J } F ˘ ´ |x A ´ P, ∆ p β X y| . The last inequality holds since λ n ě } A ´ P } op and P M k ` P M K k equals identity. On the otherhand, by the definition of s G k , } p G } ˚ ´ } G ‹ } ˚ “ } P M k G ‹ ` s G k ` P M k ∆ p G ` P M K k ∆ p G } ˚ ´ } P M k G ‹ ` s G k } ˚ ě } P M k G ‹ ` P M K k ∆ p G } ˚ ´ } s G k } ˚ ´ } P M k ∆ p G } ˚ ´ } P M k G ‹ } ˚ ´ } s G k } ˚ “ } P M k G ‹ } ˚ ` } P M K k ∆ p G } ˚ ´ } s G k } ˚ ´ } P M k ∆ p G } ˚ ´ } P M k G ‹ } ˚ “ } P M K k ∆ p G } ˚ ´ } P M k ∆ p G } ˚ ´ } s G k } ˚ . Here, the second last equality holds since P M k G ‹ and P M K k ∆ p G have orthogonal column and rowspaces. Furthermore, since p Θ is the optimal solution to (11), and Θ ‹ is feasible, the basic inequalityand the last two displays imply0 ě h p p Θ q ´ h p Θ ‹ q ` λ n ` } p G } ˚ ´ } G ‹ } ˚ ˘ ě ´ λ n ` } P M k ∆ p G } ˚ ` } P M K k ∆ p G } ˚ ` } ∆ p α n J } F ˘ ´ |x A ´ P, ∆ p β X y| ` λ n ` } P M K k ∆ p G } ˚ ´ } P M k ∆ p G } ˚ ´ } s G k } ˚ ˘ “ λ n ` } P M K k ∆ p G } ˚ ´ } P M k ∆ p G } ˚ ´ } s G k } ˚ ´ } ∆ p α n J } F ˘ ´ |x A ´ P, ∆ p β X y| . Rearranging the terms leads to } P M K k ∆ p G } ˚ ď } P M k ∆ p G } ˚ ` } ∆ p α n J } F ` λ n |x A ´ P, ∆ p β X y| ` } s G k } ˚ , } ∆ p G } ˚ ď } P M k ∆ p G } ˚ ` } ∆ p α n J } F ` λ n | @ A ´ P, ∆ p β X D | ` } s G k } ˚ . Last but not least, note that the rank of P M k ∆ p G is at most 2 k , and so we complete the proof byfurther bounding the first term on the righthand side of the last display by 4 ? k } P M k ∆ p G } F . By definition, we have the decomposition } ∆ p Θ } “ } ∆ p G ` ∆ p α n J ` n ∆ J p α ` ∆ p β X } “ } ∆ p G ` ∆ p α n J ` n ∆ J p α } ` } ∆ p β X } ` x ∆ p G ` ∆ p α n J ` n ∆ J p α , ∆ p β X y“ } ∆ p G } ` } ∆ p α n J } ` p ∆ p α n J ∆ p α n J q ` } ∆ p β X } ` x ∆ p G ` p α n J , ∆ p β X y . Here the last equality is due to the symmetry of X and the fact that ∆ p G n “
0. SinceTr p ∆ p α n J ∆ p α n J q “ Tr p n J ∆ p α n J ∆ p α q “ | n J ∆ p α | ě
0, the last display implies } ∆ p Θ } ě } ∆ p G } ` } ∆ p α n J } ` } ∆ p β X } ` x ∆ p G ` p α n J , ∆ p β X y . (33)Furthermore, we have |x ∆ p G ` p α n J , ∆ p β X y|ď } ∆ p G } ˚ } ∆ p β X } op ` } ∆ p α n J } ˚ } ∆ p β X } op ď ` ? k } P M k ∆ p G } F ` } ∆ p α n J } F ` λ n |x A ´ P, ∆ p β X y| ` } s G k } ˚ ˘ } ∆ p β X } op ď ` ? k } P M k ∆ p G } F ` } ∆ p α n J } F ` ? k } ∆ p β X } F ` } s G k } ˚ ˘ } ∆ p β X } F a r stable p X qď C ? k a r stable p X q ` } ∆ p G } ` } ∆ p α n J } ` } ∆ p β X } ˘ ` } s G k } ˚ a r stable p X q } ∆ p β X } F ď C ? k a r stable p X q ` } ∆ p G } ` } ∆ p α n J } ` } ∆ p β X } ˘ ` } s G k } ˚ c r stable p X q ` c } ∆ p β X } for any constant c ě
0. Here, the first inequality holds since the operator norm and the nuclearnorm are dual norms under trace inner product. The second inequality is due to Lemma 8.1 andthe fact that } ∆ p α n J } ˚ “ } ∆ p α n J } F since ∆ p α n J is of rank one. The third inequality is due tothe definition of r stable p X q and that |x A ´ P, X y| ď λ n ? k } X } F by assumption and ∆ p β is a scalar.The fourth inequality is due to Assumption 4.1 and the last due to 2 ab ď a ` b for any a, b P R .Substituting these inequalities into (33) leads to } ∆ p Θ } ě ˜ ´ C ? k a r stable p X q ¸ } ∆ p G } ` ˜ ´ C ? k a r stable p X q ¸ } ∆ p α n J } ` ˜ ´ C ? k a r stable p X q ´ c ¸ } ∆ p β X } ´ } s G k } ˚ c r stable p X q .
32n the other hand, notice that Tr p ∆ p α n J ∆ p α n J q ď } ∆ p α n J } F , we have } ∆ p Θ } ď ˜ ` C ? k a r stable p X q ¸ } ∆ p G } ` ˜ ` C ? k a r stable p X q ¸ } ∆ p α n J } ` ˜ ` C ? k a r stable p X q ` c ¸ } ∆ p β X } ` } s G k } ˚ c r stable p X q . Together with Assumption 4.1, the last two displays complete the proof.
Again, we directly prove the results under the general model. Recall that G ‹ « U k D k U J k is thetop- k eigen-decomposition of G ‹ , Z ‹ “ U k D { k , s G k “ G ‹ ´ U k D k U J k and ∆ G t “ Z t p Z t q J ´ Z ‹ Z J‹ .For the convenience of analysis, we will instead analyze the following quantity, r e t “ ›› Z ›› } ∆ Z t } ` } ∆ α t n J } ` } ∆ β t X } . Under Assumption 4.2, } ∆ Z } op ď δ } Z ‹ } op , p ´ δ q e t ď r e t ď p ` δ q e t . (34)for some sufficiently small constant δ P p , q . The rest of the proof relies on the following lemmas. Lemma 8.4.
For any Θ ‹ P F g p n, M , M , X q , max ď i ď n }p Z ‹ q i } ď M { .Proof. By definition, G ‹ ´ Z ‹ Z J‹ P S n ` , which implies, e J i ` G ‹ ´ Z ‹ Z J‹ ˘ e i “ G ii ´ }p Z ‹ q i } ě
0, thatis }p Z ‹ q i } ď G ii ď M { ď i ď n . Lemma 8.5.
If Assumption 4.1 holds, there exist constants ď c ă and C such that } ∆ Θ t } ě p ´ c q ` } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ˘ ´ C } s G k } , } ∆ Θ t } ď p ` c q ` } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ˘ ` C } s G k } . Proof.
See Section 8.2.1.
Lemma 8.6.
Under Assumption 4.1, let ζ n “ max t } A ´ P } op , |x A ´ P, X { } X } F y|{? k, u , if } ∆ Z t } F ď c e ´ M } Z ‹ } op { κ Z ‹ and } Z ‹ } ě C e M κ Z ‹ ζ n for sufficiently small constant c andsufficiently large constant C , there exist a constant c such that, for any η ď c , there exist positiveconstants ρ and C , r e t ` ď ´ ´ ηe M κ ρ ¯ r e t ` ηC ` } s G k } ` e M ζ n k ˘ . Proof.
See Section 8.2.2. 33 emma 8.7.
Under Assumption 4.1, let ζ n “ max t } A ´ P } op , |x A ´ P, X { } X } F y|{? k, u ,if } Z ‹ } ě C κ Z ‹ ζ n e M max !b η } s G k } { ζ n , a ηke M , ) for a sufficiently large constant C and r e ď c e ´ M } Z ‹ } { κ Z ‹ , then for all t ě , } ∆ Z t } F ď c e M κ Z ‹ } Z ‹ } op . Proof.
See Section 8.2.3.
Proof of Lemma 4.1
By Lemma 8.5, notice that s G k “ } ∆ Θ t } ě p ´ c q ` } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ˘ , } ∆ Θ t } ď p ` c q ` } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ˘ . (35)By Lemma 8.9, } Z t p Z t q J ´ Z ‹ Z J‹ } ě p? ´ q κ ´ Z ‹ } Z ‹ } } ∆ Z t } which implies, e t ď κ Z ‹ p? ´ q } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ď κ Z ‹ p? ´ qp ´ c q } ∆ Θ t } . Similarly, by Lemma 8.10, when dist p Z t , Z ‹ q ď c } Z ‹ } op , } Z t p Z t q J ´ Z ‹ Z J‹ } ď p ` c q } Z ‹ } } ∆ Z t } , and this implies, e t ě p ` c q } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ě p ` c q p ` c q } Z ‹ } } ∆ Z t } . Proof of Theorem 4.4
Consider the deterministic bound first. By Lemma 8.7, for all t ě } ∆ Z t } F ď c e M κ Z ‹ } Z ‹ } op . Then apply Lemma 8.6, there exists positive constants ρ and M such that for all t ě r e t ` ď ˜ ´ ηe M κ Z ‹ ρ ¸ r e t ` ηC ` } s G k } ` e M ζ n k ˘ . r e t ď ˜ ´ ηe M κ Z ‹ ρ ¸ t r e ` t ÿ i “ ηC ` } s G k } ` e M ζ n k ˘ ˜ ´ ηe M κ Z ‹ ρ ¸ i ď ˜ ´ ηe M κ Z ‹ ρ ¸ t r e ` Cκ ρ ` e M ζ n k ` e M } s G k } ˘ . Notice that 0 . e t ď r e t ď . e t , e t ď ˜ ´ ηe M κ Z ‹ ρ ¸ t e ` Cκ ρ ` e M ζ n k ` e M } s G k } ˘ . Given the last display, the proof of the probabilistic bound is nearly the same as that of thecounterpart in Theorem 4.3 and we leave out the details.
By definition, } ∆ G t } “ } Z t p Z t q J ´ Z ‹ Z J‹ ´ s G k } ě } Z t p Z t q J ´ Z ‹ Z J‹ } ` } s G k } ´ | @ Z t p Z t q J ´ Z ‹ Z J‹ , s G k D |ě } Z t p Z t q J ´ Z ‹ Z J‹ } ` } s G k } ´ } Z t p Z t q J ´ Z ‹ Z J‹ } F } s G k } F ě } Z t p Z t q J ´ Z ‹ Z J‹ } ` } s G k } ´ c } Z t p Z t q J ´ Z ‹ Z J‹ } ´ c ´ } s G k } ě p ´ c q} Z t p Z t q J ´ Z ‹ Z J‹ } ´ p c ´ ´ q} s G k } where the second last inequality comes from a ` b ě ab and holds for any c ě
0. Similarly, itcould be shown that } ∆ G t } ď p ` c q} Z t p Z t q J ´ Z ‹ Z J‹ } ` p ` c ´ q} s G k } . Expanding the term } ∆ Θ t } , we obtain } ∆ Θ t } “ } ∆ G t ` ∆ α t n J ` n ∆ J α t ` ∆ β t X } “ } ∆ G t ` ∆ α t n J ` n ∆ J α t } ` } ∆ β t X } ` @ ∆ G t ` ∆ α t n J ` n ∆ J α t , ∆ β t X D “ } ∆ G t } ` } ∆ α t n J } ` p ∆ α t n J ∆ α t n J q ` } ∆ β t X } ` @ ∆ G t ` α t n J , ∆ β t X D , where the last equality is due to the symmetry of X . Notice that Tr p ∆ p α n J ∆ p α n J q “ Tr p n J ∆ p α n J ∆ p α q “ | n J ∆ p α | ě } ∆ Θ t } ě p ´ c q} Z t p Z t q J ´ Z ‹ Z J‹ } ´ p c ´ ´ q} s G k } ` } ∆ α t n J } ` } ∆ β t X } ` @ Z t p Z t q J ´ Z ‹ Z J‹ ` α t n J , ∆ β t X D ´ @ s G k , ∆ β t X D . (36)35y H¨older’s inequality, | @ Z t p Z t q J ´ Z ‹ Z J‹ ` α t n J , ∆ β t X D | ď ` } Z t p Z t q J ´ Z ‹ Z J‹ } ˚ ` } ∆ α t n J } ˚ ˘ ›› ∆ β t X ›› op ď ´ ? k } Z t p Z t q J ´ Z ‹ Z J‹ } F ` } ∆ α t n J } F ¯ ›› ∆ β t X ›› op ď ´ ? k } Z t p Z t q J ´ Z ‹ Z J‹ } F ` } ∆ α t n J } F ¯ } ∆ β t X } F { a r stable p X qď C d k r stable p X q ` } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ˘ , and for any c ą | @ s G k , ∆ β t X D | ď } s G k } F } ∆ β t X } F ď c } ∆ β t X } ` c } s G k } . Substitute these inequalities into (36), } ∆ p Θ } ě ˜ ´ C d k r stable p X q ´ c ¸ } Z t p Z t q J ´ Z ‹ Z J‹ } ` ˜ ´ C d k r stable p X q ¸ } ∆ α t n J } ` ˜ ´ C d k r stable p X q ´ c ¸ } ∆ β t X } ´ p c ´ ` { c q} s G k } . On the other hand, notice that Tr p ∆ α t n J ∆ α t n J q ď } ∆ α t n } F , we have } ∆ p Θ } ď ˜ ` C d k r stable p X q ` c ¸ } Z t p Z t q J ´ Z ‹ Z J‹ } ` ˜ ` C d k r stable p X q ¸ } ∆ p α n J } ` ˜ ` C d k r stable p X q ` c ¸ } ∆ β t X } ` p c ´ ` { c q} s G k } . This completes the proof.
Let Θ t “ α t n J ` n p α t q J ` β t X ` Z t p Z t q J , R t “ arg min R P R r ˆ r ,RR J “ I r } Z t ´ Z ‹ R } F , r R t “ arg min R P R r ˆ r ,RR J “ I r } r Z t ´ Z ‹ R } F and ∆ Z t “ Z t ´ Z ‹ R t , then } Z t ` ´ Z ‹ R t ` } ď } Z t ` ´ Z ‹ r R t ` } ď } r Z t ` ´ Z ‹ r R t ` } ď } r Z t ` ´ Z ‹ R t } . The first and the last inequalities are due to the definition of R t ` and r R t ` , and the secondinequality is due to the projection step. Plugging in the definition of r Z t ` , we obtain } Z t ` ´ Z ‹ R t ` } ď } Z t ´ Z ‹ R t } ` η Z } ∇ h p Θ t q Z t } ´ η Z @ ∇ h p Θ t q Z t , Z t ´ Z ‹ R t D “ } Z t ´ Z ‹ R t } ` η Z } ∇ h p Θ t q Z t } ´ η Z @ ∇ h p Θ t q , p Z t ´ Z ‹ R t qp Z t q J D . Z t p Z t q J ´ Z ‹ R t p Z t q J “ p Z t p Z t q J ´ Z ‹ Z J‹ q ` p Z t p Z t q J ` Z ‹ Z J‹ q ´ Z ‹ R p Z t q J . Also due to the symmetry of ∇ h p Θ t q , @ ∇ h p Θ t q , p Z t p Z t q J ` Z ‹ Z J‹ q ´ Z ‹ R p Z t q J D “ @ ∇ h p Θ t q , ∆ Z t ∆ J Z t D . Therefore, combine the above three equations, } Z t ` ´ Z ‹ R t ` } ď } Z t ´ Z ‹ R t } ` η Z } ∇ h p Θ t q Z t } ´ η Z @ ∇ h p Θ t q , ∆ Z t ∆ J Z t D ´ η Z @ ∇ h p Θ t q , p Z t p Z t q J ´ Z ‹ Z J‹ q D . (37)By similar and slightly simpler arguments, we also obtain } α t ` ´ α ‹ } ď } r α t ` ´ α ‹ } “ } α t ´ α ‹ } ` η α } ∇ h p Θ t q n } ´ η α @ ∇ h p Θ t q n , α t ´ α ‹ D . (38) } β t ` ´ β ‹ } ď } r β t ` ´ β ‹ } “ } β t ´ β ‹ } ` η β @ ∇ h p Θ t q , X D ´ η β @ ∇ h p Θ t q , p β t ´ β ‹ q X D . (39)For h p Θ q in (26), define H p Θ q “ E Θ ‹ r h p Θ qs ´ n ÿ i “ Θ ii σ p Θ ‹ ,ii q . Then it is straightforward to verify that ∇ H p Θ q “ σ p Θ q ´ σ p Θ ‹ q and so ∇ H p Θ ‹ q “
0. With η Z “ η {} Z } , η α “ η { n, η β “ η { } X } , the weighted sum ›› Z ›› ˆ (37)+2 n ˆ (38)+ } X } ˆ (39)is equivalent to r e t ` ď r e t ´ η @ ∇ h p Θ t q , Z t p Z t q J ´ Z ‹ Z J‹ ` p α t ´ α ‹ q n J ` p β t ´ β ‹ q X ` ∆ Z t ∆ J Z t D ` ´›› Z ›› η Z } ∇ h p Θ t q Z t } ` nη α } ∇ h p Θ t q n } ` } X } η β @ ∇ h p Θ t q , X D ¯ ď r e t ´ η @ ∇ h p Θ t q , ∆ s Θ t D ´ η @ ∇ h p Θ t q , ∆ Z t ∆ J Z t D ` ˜ η } Z } } ∇ h p Θ t q Z t } ` η n } ∇ h p Θ t q n } ` η } X } @ ∇ h p Θ t q , X D ¸ , where ∆ s Θ t “ Z t p Z t q J ´ Z ‹ Z J‹ ` ∆ α t n J ` n p ∆ α t q J ` ∆ β t X “ ∆ Θ t ´ s G k . Then, simple algebrafurther leads to r e t ` ď r e t ´ η @ ∇ h p Θ t q ´ ∇ H p Θ t q , ∆ s Θ t D ´ η @ ∇ H p Θ t q , ∆ Θ t D ´ η @ ∇ H p Θ t q , s G k D ´ η @ ∇ h p Θ t q , ∆ Z t ∆ J Z t D ` ˜ η } Z } } ∇ h p Θ t q Z t } ` η n } ∇ h p Θ t q n } ` η } X } @ ∇ h p Θ t q , X D ¸ ď r e t ´ η @ ∇ H p Θ t q , ∆ Θ t D ` η | @ ∇ h p Θ t q ´ ∇ H p Θ t q , ∆ s Θ t D | ` η | @ ∇ h p Θ q , ∆ Z t ∆ J Z t D |` η | @ ∇ H p Θ t q , s G k D | ` η ˜ } Z } } ∇ h p Θ t q Z t } ` n } ∇ h p Θ t q n } ` } X } @ ∇ h p Θ t q , X D ¸ “ r e t ´ ηD ` ηD ` ηD ` ηD ` η D . (40)37n what follows, we control Note that for any Θ P F g ,14 I n ˆ n ľ ∇ H p Θ q “ diag ´ vec ` σ p Θ q ˝ p ´ σ p Θ qq ˘¯ ľ τ I n ˆ n where τ “ e M {p ` e M q — e ´ M . Hence H p¨q is τ -strongly convex and -smooth. Further noticethat ∇ H p Θ ‹ q “
0, then by Lemma 8.11, D “ @ ∇ H p Θ t q , ∆ Θ t D ě τ { τ ` { } ∆ Θ t } ` τ ` { } σ p Θ t q ´ σ p Θ ‹ q} . By triangle inequality, D ď | @ σ p Θ ‹ q ´ A, Z t p Z t q J ´ Z ‹ Z J‹ D | ` | @ σ p Θ ‹ q ´ A, ∆ α t n J D | ` | @ σ p Θ ‹ q ´ A, ∆ β t X D | . Recall that ζ n “ max t } A ´ P } op , |x A ´ P, X { } X } F y|{? k, u , and so D ď ζ n } Z t p Z t q J ´ Z ‹ Z J‹ } ˚ ` ζ n } ∆ α t n J } ˚ ` ζ n ? k } ∆ β t X } F . Notice that Z t p Z t q J ´ Z ‹ Z J‹ has rank at most 2 k , D ď ζ n ? k } Z t p Z t q J ´ Z ‹ Z J‹ } F ` ζ n } ∆ α t n J } F ` ζ n ? k } ∆ β t X } F . Further by Cauchy-Schwarz inequality, there exists constant C such that for any positive constant c which we will specify later, D ď c ` } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ˘ ` C c ζ n k. By Lemma 8.5, there exist constants c , C such that D ´ D ě ˆ p ´ c q τ τ ` ´ c ˙ ` } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } F ` } ∆ β t X } ˘ ` τ ` { } σ p Θ t q ´ σ p Θ ‹ q} ´ C } s G k } ´ C c ζ n k. (41)By Lemma 8.9, D ´ D ě p? ´ q κ ˆ p ´ c q τ τ ` ´ c ˙ e t ` τ ` { } σ p Θ t q ´ σ p Θ ‹ q} ´ C } s G k } ´ C c ζ n k. To bound D , notice that ∆ Z t ∆ J Z t is a positive semi-definite matrix, D ď | @ ∇ h p Θ t q , ∆ Z t ∆ J Z t D | ď ›› ∇ h p Θ t q ›› op ›› ∆ Z t ∆ J Z t ›› ˚ “ ›› ∇ h p Θ t q ›› op Tr p ∆ Z t ∆ J Z t q ď ›› ∇ h p Θ t q ›› op } ∆ Z t } “ ›› ∇ h p Θ t q ´ ∇ H p Θ t q ` ∇ H p Θ t q ›› op } ∆ Z t } ď ›› ∇ h p Θ t q ´ ∇ H p Θ t q ›› op } ∆ Z t } ` ›› ∇ H p Θ t q ›› op } ∆ Z t } “ } σ p Θ ‹ q ´ A } op } ∆ Z t } ` ›› σ p Θ t q ´ σ p Θ ‹ q ›› op } ∆ Z t } ď ζ n } ∆ Z t } ` } σ p Θ t q ´ σ p Θ ‹ q} F } ∆ Z t } .
38y the assumption that } ∆ Z t } F ď c e M κ } Z ‹ } op , } σ p Θ t q ´ σ p Θ ‹ q} F } ∆ Z t } ď c e M κ } σ p Θ t q ´ σ p Θ ‹ q} F } ∆ Z t } F } Z ‹ } op ď c } σ p Θ t q ´ σ p Θ ‹ q} ` c c e M κ } ∆ Z t } } Z ‹ } . for any constant c to be specified later. Then D ď ˜ ζ n } Z ‹ } ` c c e M κ ¸ e t ` c } σ p Θ t q ´ σ p Θ ‹ q} . By the assumption that } Z ‹ } ě C κ ζ n e M for sufficiently large constant C , D ď ˆ C e M κ ` c c e M κ ˙ e t ` c } σ p Θ t q ´ σ p Θ ‹ q} . (42)For D simple algebra leads to D “ | @ ∇ H p Θ t q , s G k D | “ | @ σ p Θ t q ´ σ p Θ ‹ q , s G k D | ď } σ p Θ t q ´ σ p Θ ‹ q} F } s G k } F ď c } σ p Θ t q ´ σ p Θ ‹ q} ` c } s G k } (43)for any constant c to be specified later.We now turn to bounding D . To this end, we upper bound its three terms separately as follows.First, } ∇ h p Θ t q Z t } “ } ` ∇ h p Θ t q ´ ∇ H p Θ t q ˘ Z t ` ∇ H p Θ t q Z t } ď p}p ∇ h p Θ t q ´ ∇ H p Θ t qq Z t } ` } ∇ H p Θ t q Z t } qď p}p σ p Θ ‹ q ´ A q Z t } ` }p σ p Θ t q ´ σ p Θ ‹ qq Z t } qď p} σ p Θ ‹ q ´ A } } Z t } ` } σ p Θ t q ´ σ p Θ ‹ q} } Z t } qď ´ ζ n } Z t } ` ›› Z t ›› } σ p Θ t q ´ σ p Θ ‹ q} ¯ . Next, } ∇ h p Θ t q n } “ ››` ∇ h p Θ t q ´ ∇ H p Θ t q ˘ n ` ∇ H p Θ t q n ›› ď ´ ››` ∇ h p Θ t q ´ ∇ H p Θ t q ˘ n ›› ` } ∇ H p Θ t q n } ¯ ď ´ }p σ p Θ ‹ q ´ A q n } ` ›› p σ p Θ t q ´ σ p Θ ‹ qq n ›› ¯ ď n ´ ζ n ` } σ p Θ t q ´ σ p Θ ‹ q} ¯ . Furthermore, @ ∇ H p Θ t q , X D “ ´@ ∇ h p Θ t q ´ ∇ H p Θ t q , X D ` @ ∇ H p Θ t q , X D¯ ď ´@ σ p Θ ‹ q ´ A, X D ` @ σ p Θ t q ´ σ p Θ ‹ q , X D ¯ ď ´ ζ n k } X } ` } σ p Θ t q ´ σ p Θ ‹ q} } X } ¯ . p Z t , Z ‹ q ď c } Z ‹ } op , combining these inequalities yields D ď ¨˝ ›› Z t ›› } Z ‹ } ζ n k ` ζ n ` ζ n k ˛‚ ` ¨˝ ›› Z t ›› } Z ‹ } ` ˛‚ } σ p Θ t q ´ σ p Θ ‹ q} . By the assumption that } ∆ Z t } F ď c e M κ } Z ‹ } op for some sufficiently small c , D ď C ` ζ n k ` } σ p Θ t q ´ σ p Θ ‹ q} ˘ . (44)Combining (40), (41), (42), (43) and (44), we obtain r e t ` ď r e t ´ η ˆ p? ´ q κ ˆ p ´ c q τ τ ` ´ c ˙ ´ C e M κ ` c c e M κ ˙ e t ` η ˆ C ` c ˙ } s G k } ´ ˆ τ ` { ´ c ´ c ´ C η ˙ } σ p Θ t q ´ σ p Θ ‹ q} ` η C c ζ n k ` η C ζ n k where c , c , c are arbitrary constants, c is a sufficiently small constant, and C is a sufficientlylarge constant. Notice that τ — e ´ M . Choose c “ cτ and c, c , c , η small enough such that2 p? ´ q ˆ p ´ c q τ τ ` ´ c ˙ ´ e M C ´ c c e M ą r ρe ´ M , and1 τ ` { ´ c ´ c ´ C η ě , for some positive constant r ρ . Recall that r e t ě p ´ δ q e t . Then there exists a universal constant C ą r e t ` ď ´ ´ ηe M κ r ρ p ´ δ q ¯ r e t ` ηC ` } s G k } ` e M ζ n k ˘ . The proof is completed by setting ρ “ p ´ δ q r ρ . Note the claim is deterministic in nature and we prove by induction. At initialization we have } ∆ Z } F ď ˜ r e } Z } ¸ ď ˜ c e M κ } Z ‹ } } Z } ¸ “ c e M κ } Z ‹ } op } Z ‹ } op } Z } op ď c e M κ } Z ‹ } op , where the last inequality is obtained from ›› Z ›› op ě } Z ‹ } op ´ } ∆ Z } op ě ´ ´ c e M κ ¯ } Z ‹ } op ě } Z ‹ } op , where the second the the last inequalities are due to Assumption 4.2.40uppose the claim is true for all t ď t , by Lemma 8.6, r e t ` ď ´ ´ ηe M κ ρ ¯ t r e ` ηC ` } s G k } ` e M ζ n k ˘ ď r e ` ηC ` } s G k } ` e M ζ n k ˘ ď c e M κ } Z ‹ } ` ηC ` } s G k } ` e M ζ n k ˘ “ c e M κ } Z ‹ } ˜ ` η Ce M ζ n κ c } Z ‹ } ˆ } s G k } ζ n ` e M k ˙¸ ď c e M κ } Z ‹ } ˆ ` Cc C ˙ . Choosing C large enough such that C ě Cc , then r e t ` ď c e M κ } Z ‹ } and therefore, } ∆ Z t ` } F ď ˜ r e t ` } Z ‹ } ¸ ď c ? e M κ } Z ‹ } op } Z ‹ } op } Z } op ď c e M κ } Z ‹ } op . This completes the proof.
We state below additional technical lemmas used in the proofs.
Lemma 8.8 ([18]) . Let X , ¨ ¨ ¨ , X n be independent Bernoulli random variables with P p X i “ q “ p i . For S n “ ř ni “ a i X i and ν “ ř ni “ a i p i . Then we have P p S n ´ E S n ă ´ λ q ď exp p´ λ { ν q ,P p S n ´ E S n ą λ q ď exp ˆ ´ λ p ν ` aλ { q ˙ , where a “ max t| a | , ¨ ¨ ¨ , | a n |u . Lemma 8.9 ([54]) . For any Z , Z P R n ˆ k , we have dist p Z , Z q ď p? ´ q σ k p Z q } Z Z J ´ Z Z J } . Lemma 8.10 ([54]) . For any Z , Z P R n ˆ k such that dist p Z , Z q ď c } Z } op , we have } Z Z J ´ Z Z J } F ď p ` c q } Z } op dist p Z , Z q . emma 8.11 ([47]) . For a continuously differentiable function f , if it is µ -strongly convex and L -smooth on a convex domain D , say for any x, y P D , µ } x ´ y } ď f p y q ´ f p x q ´ @ f p x q , y ´ x D ď L } x ´ y } , then @ f p x q ´ f p y q , x ´ y D ě µLµ ` L } x ´ y } ` µ ` L } f p x q ´ f p y q} , and also @ f p x q ´ f p y q , x ´ y D ě µ } x ´ y } . Lemma 8.12 ([41], [23]) . Let A be the symmetric adjacency matrix of a random graph on n nodesin which edges occur independently. Let E r A ij s “ P ij for all i ‰ j and P ii P r , s . Assume that n max i,j P ij ď d . Then for any C , there is a constant C “ C p C q such that } A ´ P } op ď C a d ` log n with probability at least ´ n ´ C . Lemma 8.13.
Let A be the symmetric adjacency matrix of a random graph of n nodes in whichedges occur independently. Let E r A ij s “ P ij for all i ‰ j and P ii P r , s for all i and X bedeterministic with X ii “ for all i . Then, |x A ´ P, X y| ď C } X } F with probability at least ´ p´ C { p max q ´ exp p´ C } X } F { } X } q , where p max “ max i ‰ j P ij .Proof. Observe that x A ´ P, X y “ ř i ă j p A ij ´ P ij q X ij and A ij are independent Bernoulli randomvariables with E r A ij s “ P ij . Apply Lemma 8.8 to ř i ă j p A ij ´ P ij q X ij with λ “ C } X } F {
2, we have ν “ ř i ă j X ij P ij ď p max } X } and P ` |x A ´ P, X y| ď C } X } F ˘ ď exp p´ C } X } { ν q ` exp ˆ ´ C } X } t ν, C } X } } X } F u ˙ ď p´ C } X } { ν q ` exp p´ C } X } F { } X } qď p´ C { p max q ` exp p´ C } X } F { } X } q . This completes the proof.
A Proofs of results for initialization
This section presents the proofs of Theorem 4.5, Corollary 4.1 and Proposition 4.1.42 .1 Preliminaries
We introduce two technical results used repeatedly in the proofs.
Lemma A.1. If } G ‹ } ě C } s G k } for some constant C ą , then } G ‹ } ě c } G ‹ } { k for someconstant c ą .Proof. By definition, } G ‹ } ď ` } s G k } ` } Z ‹ Z J‹ } ˘ ď } s G k } ` k } Z ‹ } ď p k ` { C q } Z ‹ } . Therefore, } Z ‹ } ě c } G ‹ } { k for some constant c ą Theorem A.1.
Under Assumption 4.1, choose λ n , γ n such that λ n ě } A ´ P } op ` γ n } G ‹ } op , } A ´ P } op ` γ n ? k } α ‹ n J } F , @ A ´ P, X D ? k } X } F ` γ n ? k } Xβ ‹ } F + . (45) Let the constant step size η ď { and the constraint sets C G , C α and C β as specified in Theorem4.5. If the latent vectors contain strong enough signal in the sense that } G ‹ } ě Cκ Z ‹ e M max ! e M λ n k, } s G k } ˚ { k, } s G k } ) (46) for some sufficiently large constant C , then for any given constant c ą , there exists a universalconstant C such that for any T ě T , the error will satisfy e T ď c e ´ M } Z ‹ } { κ Z ‹ , where T “ log ˜ C e M kκ Z ‹ c } G ‹ } ` } α ‹ n J } ` } Xβ ‹ } } G ‹ } ¸ ˆ log ˆ ´ γ n η ˙˙ ´ . Proof.
See Section A.5.
A.2 Proof of Theorem 4.5
We focus on the case where X is nonzero, and the case of X “ C , c such that with probability at least 1 ´ n c , } A ´ P } op , @ A ´ P, X D ? k } X } F ď C b max t ne ´ M , log n u . All the following analysis is conditional on this event. Since } α ‹ n J } F , } β ‹ X } F ď C } G ‹ } F , byLemma A.1, } α ‹ n J } F ` } Xβ ‹ } F ď C ? k } G ‹ } op . for some constant C . Combining these two inequalities leads tomax } A ´ P } op ` γ n } G ‹ } op , } A ´ P } op ` γ n ? k } α ‹ n J } F , @ A ´ P, X D ? k } X } F ` γ n ? k } Xβ ‹ } F + ď C b max t ne ´ M , log n u ` p ` C q γ n } G ‹ } op ď C { C λ n ` p ` C q δλ n ď λ n { . C is sufficiently large and δ is sufficiently small.Furthermore, r Cκ Z ‹ e M λ n k ď r Cc } G ‹ } ď } G ‹ } , since c is a sufficient small constant. Therefore, the inequality (46) holds. Apply Theorem A.1,there exists a universal constant C such that for any given constant c ą e T ď c e ´ M } Z ‹ } { κ Z ‹ , as long as T ě T , where T “ log ˜ C e M kκ Z ‹ c } x ‹ } D } G ‹ } ¸ ˆ log ˆ ´ γ n η ˙˙ ´ . Notice that when } α ‹ n J } F , } β ‹ X } F ď C } G ‹ } F , } x ‹ } D ď C } G ‹ } for some constant C . Therefore, T ď log ˜ C C e M kκ Z ‹ c ¸ ˆ log ˆ ´ γ n η ˙˙ ´ . This completes the proof.
A.3 Proof of Corollary 4.1
By Lemma 8.3, there exist constants C , c such that with probability at least 1 ´ n c , } A ´ P } op , @ A ´ P, X D ? k } X } F ď C b max t ne ´ M , log n u . All the following analysis is conditional on this event. Since } α ‹ n J } F , } β ‹ X } F ď C } G ‹ } F , byLemma A.1, } α ‹ n J } F ` } Xβ ‹ } F ď C ? k } G ‹ } op . for some constant C . Combining these two inequalities leads tomax } A ´ P } op ` γ n } G ‹ } op , } A ´ P } op ` γ n ? k } α ‹ n J } F , @ A ´ P, X D ? k } X } F ` γ n ? k } Xβ ‹ } F + ď C b max t ne ´ M , log n u ` p ` C q γ n } G ‹ } op ď C γ n } G ‹ } op ` p ` C q γ n } G ‹ } op “ p ` C ` C q γ n } G ‹ } op , where the last inequality is due to equation (22). Since we choose λ n “ C γ n } Z ‹ } for somesufficiently large constant C , inequality (45) holds. Further, notice that γ n “ γ “ c {p e M ? kκ Z ‹ q for some sufficiently small constant c , r Ce M κ Z ‹ λ n k “ r Cc } G ‹ } ď } G ‹ } . Therefore, the inequality (46) holds. Apply Theorem A.1, there exists a universal constant C suchthat for any given constant c ą e T ď c e ´ M } Z ‹ } { κ Z ‹ , as long as T ě T , where T “ log ˜ C e M kκ Z ‹ c } x ‹ } D } G ‹ } ¸ ˆ log ˆ ´ γη ˙˙ ´ . } α ‹ n J } F , } β ‹ X } F ď C } G ‹ } F , } x ‹ } D ď C } G ‹ } for some constant C . Therefore, T ď log ˜ C C e M kκ Z ‹ c ¸ ˆ log ˆ ´ γη ˙˙ ´ . This completes the proof.
A.4 Proof of Proposition 4.1
Applying Theorem 2.7 in [14] we obtain1 n } p P ´ P } ď C p k, M , κ Z ‹ q n ´ k ` . where the constant C p k, M , κ Z ‹ q depends on k, M , κ Z ‹ . Notice that Θ ij “ logit p P ij q and logit p¨q is 4 e M -Lipchitz continuous in the interval “ e ´ M , ‰ , and so1 n } p Θ ´ Θ } ď C p k, M , κ Z ‹ q n ´ k ` . Let ∆ p Θ “ p Θ ´ Θ ‹ It is easy to verify, α “ ` nI n ` n n J ˘ ´ p Θ1 n “ α ‹ ` n ˆ I n ´ n n n J ˙ ∆ p Θ n , and hence } α n J ´ α ‹ n J } F “ n } ˆ I n ´ n n n J ˙ ∆ p Θ n n J } F ď n ›››› I n ´ n n n J ›››› op } ∆ p Θ } F } n n J } F ď } ∆ p Θ } F . Notice that G ‹ P S n ` , } p G ´ G ‹ } F ď } p G ´ J p Θ J ` J p Θ J ´ G ‹ } F ď } J p Θ J ´ G ‹ } F ď } ∆ p Θ } F . Further notice that r p G ‹ q “ k , } Z p Z q J ´ G ‹ } F ď } Z p Z q J ´ p G ` p G ´ G ‹ } F ď } p G ´ G ‹ } F ď } ∆ p Θ } F . Then, by Lemma 8.9, e ď } Z ‹ } dist p Z , Z ‹ q ` n ›› α ´ α ‹ ›› ď κ Z ‹ p? ´ q } Z p Z q J ´ G ‹ } ` n ›› α ´ α ‹ ›› ď κ Z ‹ } ∆ p Θ } ` } ∆ p Θ } ď κ Z ‹ C p k, M , κ Z ‹ q n ˆ n ´ k ` ď kκ Z ‹ C p k, M , κ Z ‹ q c } G ‹ } k ˆ n ´ k ` ď C p k, M , κ Z ‹ q } Z ‹ } ˆ n ´ k ` . Therefore, the initialization condition in Assumption 4.2 will hold for large enough n .45 .5 Proof of Theorem A.1 A.5.1 Preparations
Recall the definition of f p G, α, β q in (15) where p G, α, β q P D “ " p G, α, β q| GJ “ G, G P S ` , max i,j | G ij | , max i | α i | ď M , | β | ď M i,j | X ij | * . Define the norm } ¨ } D in the domain D by }p G, α, β q} D “ ´ } G } ` } α n J } ` } Xβ } ¯ { . Lemma A.2.
The function f is γ n -strongly convex and p γ n ` { q -smooth in the convex domain D with respect to the norm } ¨ } D , that is, for p G i , α i , β i q P D , i “ , , let p ∆ G , ∆ α , ∆ β q “ p G ´ G , α ´ α , β ´ β q , then γ n }p ∆ G , ∆ α , ∆ β q} D ď f p G , α , β q ´ f p G , α , β q ´ @ ∇ G f p G , α , β q , ∆ G D ´ @ ∇ α f p G , α , β q , ∆ α D ´ @ ∇ β f p G , α , β q , ∆ β D ď γ n ` { }p ∆ G , ∆ α , ∆ β q} D . Proof.
With slight abuse of notation, let h p G, α, β q “ ´ ÿ i,j ! A ij Θ ij ` log ´ ´ σ p Θ ij q ¯) (47)which is a convex function of G, α and β . In addition, let r p G, α, β q “ γ n ` } G } ` } α n J } ` } Xβ } ˘ ` λ n Tr p G q (48)which is γ n -strongly convex w.r.t. the norm } ¨ } D . Thus f p G, α, β q is γ n -strongly convex. On theother hand, r p¨ , ¨ , ¨q is γ n smooth and h p G , α , β q ´ h p G , α , β q ´ @ ∇ G h p G , α , β q , ∆ G D ´ @ ∇ α h p G , α , β q , ∆ α D ´ @ ∇ β h p G , α , β q , ∆ β D “ h p Θ q ´ h p Θ q ´ @ ∇ Θ h p Θ q , ∆ G D ´ @ ∇ Θ h p Θ q n , ∆ α D ´ @ ∇ Θ h p Θ q , X D ∆ β “ h p Θ q ´ h p Θ q ´ @ ∇ Θ h p Θ q , Θ ´ Θ D ď } Θ ´ Θ } “ } ∆ G ` α n J ` X ∆ β } ď ` } ∆ G } ` } ∆ α n J } ` } X ∆ β } ˘ ď ` } ∆ G } ` } ∆ α n J } ` } X ∆ β } ˘ . This finishes the proof. 46efine p r G, r α, r β q “ arg min p G,α,β qP D f p G, α, β q , ∆ r G “ r G ´ G ‹ , ∆ r α “ r α ´ α ‹ , ∆ r β “ r β ´ β ‹ , ∆ r Θ “ r Θ ´ Θ ‹ . Similar to the analysis of the convex programming in (11), one can obtain the followingresults. Lemma A.3.
Let M K k “ t M P R n ˆ n : row p M q Ă col p Z ‹ q K and col p M q Ă col p Z ‹ q K u and M k beits orthogonal complement in R n ˆ n under trace inner product. If λ n ě } A ´ P } op ` γ n } G ‹ } op , } A ´ P } op ` γ n ? k } α ‹ n J } F , @ A ´ P, X D ? k } X } F ` γ n ? k } Xβ ‹ } F + , then for s G k “ P M K k G ‹ , we have } ∆ r G } ˚ ď ? k } P M k ∆ r G } F ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ` } s G k } ˚ . Proof.
Let r h p G, α, β q “ ´ ÿ ď i,j ď n t A ij Θ ij ` log p ´ σ p Θ ij qqu ` γ n ` } G } ` } α n J } ` } Xβ } ˘ . By the convexity of r h , r h p r G, r α, r β q ´ r h p G ‹ , α ‹ , β ‹ qě @ ∇ G r h p G ‹ , α ‹ , β ‹ q , ∆ r G D ` @ ∇ α r h p G ‹ , α ‹ , β ‹ q , ∆ r α D ` @ ∇ β r h p G ‹ , α ‹ , β ‹ q , ∆ r β D “ ´x A ´ P, ∆ r G ` r α n J ` ∆ r β X y ` γ n ´@ G ‹ , ∆ r G D ` n @ α ‹ , ∆ r α D ` } X } @ β ‹ , ∆ r β D¯ ě ´ } A ´ P } op ` } ∆ r G } ˚ ` } ∆ r α n J } ˚ ˘ ´ |x A ´ P, ∆ r β X y|´ γ n ´ } G ‹ } op } G ‹ } ˚ ` } α ‹ n J } F } ∆ r α n J } F ` } Xβ ‹ } F } X ∆ r β } F ¯ ě ´ ´ } A ´ P } op ` γ n } G ‹ } op ¯ ›› ∆ r G ›› ˚ ´ ´ } A ´ P } op ` γ n {? k } α ‹ n J } F ¯ ? k } ∆ r α n J } F ´ ´@ A ´ P, X D { ´ ? k } X } F ¯ ` γ n {? k } Xβ ‹ } F ¯ ? k } X ∆ r β } F ě ´ λ n ` ›› ∆ r G ›› ˚ ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ˘ ě ´ λ n ` } P M k ∆ r G } ˚ ` } P M K k ∆ r G } ˚ ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ˘ . The last inequality holds since P M k ` P M K k equals identity and λ n ě } A ´ P } op ` γ n } G ‹ } op , } A ´ P } op ` γ n ? k } α ‹ n J } F , @ A ´ P, X D ? k } X } F ` γ n ? k } Xβ ‹ } F + . On the other hand, by the definition of s G k , } r G } ˚ ´ } G ‹ } ˚ “ } P M k G ‹ ` s G k ` P M k ∆ r G ` P M K k ∆ r G } ˚ ´ } P M k G ‹ ` s G k } ˚ ě } P M k G ‹ ` P M K k ∆ r G } ˚ ´ } s G k } ˚ ´ } P M k ∆ r G } ˚ ´ } P M k G ‹ } ˚ ´ } s G k } ˚ “ } P M k G ‹ } ˚ ` } P M K k ∆ r G } ˚ ´ } s G k } ˚ ´ } P M k ∆ r G } ˚ ´ } P M k G ‹ } ˚ “ } P M K k ∆ r G } ˚ ´ } P M k ∆ r G } ˚ ´ } s G k } ˚ . P M k G ‹ and P M K k ∆ r G have orthogonal column and rowspaces. Furthermore, since p Θ is the optimal solution to (11), and Θ ‹ is feasible, the basic inequalityand the last two displays imply0 ě r h p r G, r α, r β q ´ r h p G ‹ , α ‹ , β ‹ q ` λ n ` } r G } ˚ ´ } G ‹ } ˚ ˘ ě ´ λ n ` } P M k ∆ r G } ˚ ` } P M K k ∆ r G } ˚ ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ˘ ` λ n ` } P M K k ∆ r G } ˚ ´ } P M k ∆ r G } ˚ ´ } s G k } ˚ ˘ “ λ n ` } P M K k ∆ r G } ˚ ´ } P M k ∆ r G } ˚ ´ } s G k } ˚ ´ ? k } ∆ r α n J } F ´ ? k } X ∆ r β } F ˘ . Rearranging the terms leads to } P M K k ∆ r G } ˚ ď } P M k ∆ r G } ˚ ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ` } s G k } ˚ , and triangle inequality further implies } ∆ r G } ˚ ď } P M k ∆ r G } ˚ ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ` } s G k } ˚ . Finally, note that the rank of P M k ∆ r G is at most 2 k , } ∆ r G } ˚ ď ? k } P M k ∆ r G } F ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ` } s G k } ˚ . This completes the proof.
Lemma A.4.
For any k ě such that Assumption 4.1 holds. Choose λ n ě max t } A ´ P } op , u and |x A ´ P, X y| ď λ n ? k } X } F . There exist constants C ą and ď c ă such that } ∆ r Θ } ě p ´ c q ` } ∆ r G } ` } ∆ r α n J } ` } ∆ r β X } ˘ ´ C } s G k } ˚ { k, and } ∆ r Θ } ď p ` c q ` } ∆ r G } ` } ∆ r α n J } ` } ∆ r β X } ˘ ` C } s G k } ˚ { k. Proof.
The proof is the same as the proof of Lemma 8.2 and we leave out the details.
Theorem A.2.
Under Assumption 4.1, for any λ n satisfying λ n ě } A ´ P } op ` γ n } G ‹ } op , } A ´ P } op ` γ n ? k } α ‹ n J } F , @ A ´ P, X D ? k } X } F ` γ n ? k } Xβ ‹ } F + , there exists a constant C such that ´ } ∆ r G } F ` } ∆ r α n J } F ` } ∆ r β X } F ¯ ď C ˆ e M λ n k ` } s G k } ˚ k ˙ . Proof.
Recall the definition of h p G, α, β q in (47). Observe that p Θ “ p α n J ` n p α J ` p βX ` p G is theoptimal solution to (11), and that the true parameter Θ ‹ “ α ‹ n J ` n α J‹ ` β ‹ X ` G ‹ is feasible.Thus, we have the basic inequality r h p r G, r α, r β q ´ r h p G ‹ , α ‹ , β ‹ q ` λ n p} r G } ˚ ´ } G ‹ } ˚ q ď . (49)48y definition, r h p G, α, β q “ h p G, α, β q ` γ n ` } G } ` } α n J } ` } Xβ } ˘ . On the one hand, h p r G, r α, r β q ´ h p G ‹ , α ‹ , β ‹ q´ @ ∇ G h p G ‹ , α ‹ , β ‹ q , ∆ r G D ´ @ ∇ α h p G ‹ , α ‹ , β ‹ q , ∆ r α D ´ @ ∇ β h p G ‹ , α ‹ , β ‹ q , ∆ r β D “ h p r Θ q ´ h p Θ ‹ q ´ @ ∇ Θ h p Θ ‹ q , ∆ r Θ D ě τ } ∆ r Θ } , where the last inequality is by the strong convexity of h p¨q with respect to Θ in the domain F g and τ “ e M {p ` e M q as in the proof of Theorem 4.1. Further by Lemma A.4, τ } ∆ r Θ } ě τ p ´ c q ` } ∆ r G } ` } ∆ r α n J } ` } ∆ r β X } ˘ ´ Cτ } s G k } ˚ { k. On the other hand, the l regularization term is strongly convex with respect to p G, α, β q . Thenwe have r h p r G, r α, r β q ´ r h p G ‹ , α ‹ , β ‹ qě @ ∇ G r h p G ‹ , α ‹ , β ‹ q , ∆ r G D ` @ ∇ α r h p G ‹ , α ‹ , β ‹ q , ∆ r α D ` @ ∇ β r h p G ‹ , α ‹ , β ‹ q , ∆ r β D ` τ p ´ c q ` } ∆ r G } ` } ∆ r α n J } ` } ∆ r β X } ˘ ´ Cτ } s G k } ˚ { k ě ´ λ n ` ›› ∆ r G ›› ˚ ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ˘ ` τ p ´ c q ` } ∆ r G } ` } ∆ r α n J } ` } ∆ r β X } ˘ ´ Cτ } s G k } ˚ { k. By triangle inequality, λ n p} p G } ˚ ´ } G ‹ } ˚ q ě ´ λ n } ∆ G } ˚ . Together with (49), the last two inequalities imply τ p ´ c q ´ } ∆ r G } ` } ∆ r α n J } ` } ∆ r β X } ¯ ď λ n ´›› ∆ r G ›› ˚ ` ? k } ∆ r α n J } F ` ? k } X ∆ r β } F ¯ ` λ n } ∆ r G } ˚ ` Cτ } s G k } ˚ { k. By Lemma A.3, τ p ´ c q ´ } ∆ r G } ` } ∆ r α n J } ` } ∆ r β X } ¯ ď C λ n ? k ´ } ∆ r G } F ` } ∆ r α n J } F ` } X ∆ r β } F ¯ ` C λ n } s G k } ˚ ` Cτ } s G k } ˚ { k. This implies that there exists some constant c such that c τ ´ } ∆ r G } F ` } ∆ r α n J } F ` } ∆ r β X } F ¯ ď C λ n ? k ´ } ∆ r G } F ` } ∆ r α n J } F ` } X ∆ r β } F ¯ ` C λ n } s G k } ˚ ` Cτ } s G k } ˚ { k. C such that ´ } ∆ r G } F ` } ∆ r α n J } F ` } ∆ r β X } F ¯ ď C ˆ λ n kτ ` λ n } s G k } ˚ τ ` } s G k } ˚ k ˙ . Note that τ ě c e ´ M and e M λ n } s G k } ˚ ď c ´ e M λ n k ` } s G k } ˚ k ¯ for positive constants c , c .Therefore, ´ } ∆ r G } F ` } ∆ r α n J } F ` } ∆ r β X } F ¯ ď C ˆ e M λ n k ` } s G k } ˚ k ˙ , which completes the proof. Lemma A.5 ([8]) . Let x P D and y P R n , then @ π D p y q ´ x, π D p y q ´ y D ď where D is a convex set and π D p x q “ arg min y P D } x ´ y } . Lemma A.6.
With η G “ η, η α “ η { n, η β “ η {} X } , @ G t ´ G t ` , G t ´ r G D ` n @ α t ´ α t ` , α t ´ r α D ` @ β t ´ β t ` , β t ´ r β D } X } ě ηµ } x t ´ r x } D ` ˆ ´ ηL ˙ ! } G t ` ´ G t } ` } ` α t ` ´ α t ˘ n J } ` }p β t ` ´ β t q X } ) where µ “ γ n and L “ γ n ` { . Proof.
Let x t “ p G t , α t , β t q and r x “ p r G, r α, r β q . Then f p x t ` q ´ f p r x q “ f p x t ` q ´ f p x t q ` f p x t q ´ f p r x qď @ ∇ f p x t q , x t ` ´ x t D ` L } x t ` ´ x t } D ` @ ∇ f p x t q , x t ´ r x D ´ µ } x t ´ r x } D ď @ ∇ f p x t q , x t ` ´ r x D ` L } x t ` ´ x t } D ´ µ } x t ´ r x } D “ @ ∇ G f p G t , α t , β t q , G t ` ´ r G D ` @ ∇ α f p G t , α t , β t q , α t ` ´ r α D ` @ ∇ β f p G t , α t , β t q , β t ` ´ r β D ` L } x t ` ´ x t } D ´ µ } x t ´ r x } D . Notice that r G t ` “ G t ´ η G B f B G | G “ G t and G t ` is the projection of r G t ` to the convex set t G | GJ “ G, G P S ` , max i,j } G ij } ď M u . Therefore by Lemma A.5, @ G t ` ´ r G t ` , G t ` ´ r G D ď @ B f B G | G “ G t , G t ` ´ r G D ď η G @ G t ´ G t ` , G t ` ´ r G D “ η G @ G t ´ G t ` , G t ´ r G D ´ η G } G t ´ G t ` } . @ B f B α | α “ α t ` , α t ` ´ r α D ď η α @ α t ´ α t ` , α t ´ r α D ´ η α } α t ´ α t ` } , @ B f B β | β “ β t ` , β t ` ´ r β D ď η β @ β t ´ β t ` , β t ´ r β D ´ η β } β t ´ β t ` } . Also notice that f p x t ` q ´ f p r x q ě
0, therefore0 ď η p f p x t ` q ´ f p r x qq ď @ G t ´ G t ` , G t ´ r G D ` n @ α t ´ α t ` , α t ´ r α D ` } X } @ β t ´ β t ` D ´ } x t ´ x t ` } D ` ηL } x t ´ x t ` } D ´ ηµ } x t ´ r x } D . This completes the proof.
A.5.2 Proof of the theorem
Let x t “ p G t , α t , β t q , r x “ p r G, r α, r β q . By definition, } x t ` ´ r x } D “ } G t ` ´ r G } ` } ` α t ` ´ α t ˘ n J } ` }p β t ` ´ r β q X } . Notice that for each component, the error can be decomposed as (with G as an example), } G t ` ´ r G } “ } G t ´ r G } ´ @ G t ´ G t ` , G t ´ r G D ` } G t ` ´ G t } . Summing up these equations leads to } x t ` ´ r x } D “ } x t ´ r x } D ´ !@ G t ´ G t ` , G t ´ r G D ` n @ α t ´ α t ` , α t ´ r α D ` } X } @ β t ´ β t ` , β t ´ r β D) ` ! } G t ` ´ G t } ` } ` α t ` ´ α t ˘ n J } ` }p β t ` ´ β t q X } ) . By Lemma A.6, } x t ` ´ r x } D ď p ´ ηµ q} x t ´ r x } D ´ p ´ ηL q} x t ´ x t ` } D . Then for any η ď { L , } x t ` ´ r x } D ď p ´ ηµ q} x t ´ r x } D . By Lemma 8.5, and repeatedly using the inequality p a ` b q ď p a ` b q , e t ď κ Z ‹ p? ´ q } Z t p Z t q J ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ď κ Z ‹ p? ´ q ` } Z t p Z t q J ´ G t } ` } G t ´ Z ‹ Z J‹ } ˘ ` } ∆ α t n J } ` } ∆ β t X } ď κ Z ‹ p? ´ q } G t ´ Z ‹ Z J‹ } ` } ∆ α t n J } ` } ∆ β t X } ď κ Z ‹ p? ´ q ` } G t ´ G ‹ } ` } G ‹ ´ Z ‹ Z J‹ } ˘ ` } ∆ α t n J } ` } ∆ β t X } ď κ Z ‹ p? ´ q } G t ´ G ‹ } ` κ Z ‹ p? ´ q } s G k } ` } ∆ α t n J } ` } ∆ β t X } .
51y the definition of } ¨ } D , we further have e t ď κ Z ‹ p? ´ q ` } x t ´ x ‹ } D ` } s G k } ˘ ď κ Z ‹ p? ´ q ` } x t ´ r x } D ` } r x ´ x ‹ } D ` } s G k } ˘ ď κ Z ‹ p? ´ q ` p ´ ηγ n q t } x ´ r x } D ` } r x ´ x ‹ } D ` } s G k } ˘ ď κ Z ‹ p? ´ q ` p ´ ηγ n q t } x ´ x ‹ } D ` p ´ ηγ n q t } r x ´ x ‹ } D ` } r x ´ x ‹ } D ` } s G k } ˘ . According to Theorem A.2, there exists constant C ą } r x ´ x ‹ } D ď C ˆ e M λ n k ` } s G k } ˚ k ˙ . Therefore, e t ď C κ Z ‹ ` p ´ ηγ n q t } x ´ x ‹ } D ` e M λ n k ` } s G k } ˚ { k ` } s G k } ˘ . Since x “ e t ď C κ Z ‹ ` p ´ ηγ n q t } x ‹ } D ` e M λ n k ` } s G k } ˚ { k ` } s G k } ˘ ď c κ Z ‹ e M } Z ‹ } ˆ C κ Z ‹ e M c } Z ‹ } ` p ´ ηγ n q t } x ‹ } D ` e M λ n k ` } s G k } ˚ { k ` } s G k } ˘ . Under our assumptions, there exists some sufficiently large constant C such that } Z ‹ } ě C κ Z ‹ e M max ! e M λ n k, } s G k } ˚ { k, } s G k } ) . Therefore, e t ď c κ Z ‹ e M } Z ‹ } ˆ ˜ C e M κ Z ‹ } Θ ‹ } c τ } Z ‹ } p ´ ηγ n q t ` C c C ¸ . Choose large enough C ą C { c , then e t ď c κ Z ‹ e M } Z ‹ } ˆ ˜ Ce M κ Z ‹ } x ‹ } D c } Z ‹ } p ´ ηγ n q t ` ¸ . Therefore, e t ď c τ κ } Z ‹ } when C e M κ Z ‹ } x ‹ } D c } G ‹ } p ´ ηγ n q t ď . By Lemma A.1, } G ‹ } ě c } G ‹ } { k . Therefore, it suffices to have t ě log ˜ k } x ‹ } D } G ‹ } C e M κ Z ‹ c c ¸ ˆ log ˆ ´ ηγ n ˙˙ ´ . This completes the proof. 52 eferences [1] L. A. Adamic and N. Glance. The political blogosphere and the 2004 us election: divided theyblog. In
Proceedings of the 3rd International Workshop on Link Discovery , pages 36–43. ACM,2005.[2] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convexrelaxation: Optimal rates in high dimensions.
The Annals of Statistics , 40:1171–1197, 2012.[3] D. Agarwal and B.-C. Chen. Regression-based latent factor models. In
Proceedings of the15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages19–28. ACM, 2009.[4] E. M. Airoldi, T. B. Costa, and S. H. Chan. Stochastic blockmodel approximation of agraphon: Theory and consistent estimation. In
Advances in Neural Information ProcessingSystems , pages 692–700, 2013.[5] D. Asta and C. R. Shalizi. Geometric network comparison. arXiv preprint arXiv:1411.1350 ,2014.[6] P. J. Bickel and A. Chen. A nonparametric view of network models and newman–girvan andother modularities.
Proceedings of the National Academy of Sciences , 106(50):21068–21073,2009.[7] E. Bogomolny, O. Bohigas, and C. Schmit. Spectral properties of distance matrices.
Journalof Physics A: Mathematical and General , 36(12):3595, 2003.[8] S. Bubeck. Theory of convex optimization for machine learning. arXiv preprintarXiv:1405.4980 , 2014.[9] S. Burer and R. D. Monteiro. Local minima and convergence in low-rank semidefiniteprogramming.
Mathematical Programming , 103(3):427–444, 2005.[10] E. Cand`es and B. Recht. Exact matrix completion via convex optimization.
Communicationsof the ACM , 55(6):111–119, 2012.[11] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion.
IEEE Transactions on Information Theory , 56(5):2053–2080, 2010.[12] E. J. Candes, Y. C. Eldar, T. Strohmer, and V. Voroninski. Phase retrieval via matrixcompletion.
SIAM review , 57(2):225–251, 2015.[13] E. J. Cand`es, X. Li, and M. Soltanolkotabi. Phase retrieval via wirtinger flow: Theory andalgorithms.
IEEE Transactions on Information Theory , 61(4):1985–2007, 2015.[14] S. Chatterjee. Matrix estimation by universal singular value thresholding.
The Annals ofStatistics , 43(1):177–214, 2015.[15] Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent:General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025 , 2015.5316] Y. Chen, X. Li, and J. Xu. Convexified modularity maximization for degree-correctedstochastic block models. arXiv preprint arXiv:1512.08425 , 2015.[17] X. Cheng and A. Singer. The spectrum of random inner-product kernel matrices.
RandomMatrices: Theory and Applications , 2(04):1350010, 2013.[18] F. Chung and L. Lu. Connected components in random graphs with given expected degreesequences.
Annals of combinatorics , 6(2):125–145, 2002.[19] M. A. Davenport, Y. Plan, E. Van Den Berg, and M. Wootters. 1-bit matrix completion.
Information and Inference: A Journal of the IMA , 3(3):189–223, 2014.[20] R. L. Dykstra. An algorithm for restricted least squares regression.
Journal of the AmericanStatistical Association , 78(384):837–842, 1983.[21] N. El Karoui. The spectrum of kernel random matrices.
The Annals of Statistics , 38(1):1–50,2010.[22] C. Gao, Y. Lu, H. H. Zhou, et al. Rate-optimal graphon estimation.
The Annals of Statistics ,43(6):2624–2652, 2015.[23] C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Achieving optimal misclassification proportionin stochastic block model. arXiv preprint arXiv:1505.03772 , 2015.[24] C. Gao, Y. Lu, Z. Ma, and H. H. Zhou. Optimal estimation and completion of matrices withbiclustering structures.
Journal of Machine Learning Research , 17(161):1–29, 2016.[25] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In
Advancesin Neural Information Processing Systems , pages 2973–2981, 2016.[26] A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi. A survey of statistical networkmodels.
Foundations and Trends in Machine Learning , 2(2):129–233, 2010.[27] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based clustering for social networks.
Journal of the Royal Statistical Society: Series A (Statistics in Society) , 170(2):301–354, 2007.[28] P. D. Hoff. Random effects models for network data. In
Dynamic Social Network Modelingand Analysis: Workshop Summary and Papers . Citeseer, 2003.[29] P. D. Hoff. Bilinear mixed-effects models for dyadic data.
Journal of the american Statisticalassociation , 100(469):286–295, 2005.[30] P. D. Hoff. Modeling homophily and stochastic equivalence in symmetric relational data. In
Advances in neural information processing systems , pages 657–664, 2008.[31] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social networkanalysis.
Journal of the American Statistical Association , 97(460):1090–1098, 2002.[32] J. Jin. Fast community detection by score.
The Annals of Statistics , 43(1):57–89, 2015.[33] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries.
IEEETransactions on Information Theory , 56(6):2980–2998, 2010.5434] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries.
Journal ofMachine Learning Research , 11(Jul):2057–2078, 2010.[35] O. Klopp, A. B. Tsybakov, and N. Verzelen. Oracle inequalities for network models and sparsegraphon estimation. arXiv preprint arXiv:1507.04118 , 2015.[36] V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal ratesfor noisy low-rank matrix completion.
The Annals of Statistics , pages 2302–2329, 2011.[37] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Bogun´a. Hyperbolic geometryof complex networks.
Physical Review E , 82(3):036106, 2010.[38] P. N. Krivitsky and M. S. Handcock. Fitting latent cluster models for networks with latentnet.
Journal of Statistical Software , 24(i05), 2008.[39] P. N. Krivitsky, M. S. Handcock, A. E. Raftery, and P. D. Hoff. Representing degreedistributions, clustering, and homophily in social networks with latent cluster random effectsmodels.
Social Networks , 31(3):204–213, 2009.[40] E. Lazega.
The collegial phenomenon: The social mechanisms of cooperation among peers ina corporate law partnership . Oxford University Press on Demand, 2001.[41] J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models.
TheAnnals of Statistics , 43(1):215–237, 2015.[42] Z. Ma. Sparse principal component analysis and iterative thresholding.
The Annals ofStatistics , 41(2):772–801, 2013.[43] Z. Ma and Y. Wu. Volume ratio, sparsity, and minimaxity under unitarily invariant norms.
IEEE Transactions on Information Theory , 61(12):6939–6956, 2015.[44] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction ofinternet portals with machine learning.
Information Retrieval , 3(2):127–163, 2000.[45] M. M´ezard, G. Parisi, and A. Zee. Spectra of Euclidean random matrices.
Nuclear Physics B ,559(3):689–701, 1999.[46] C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels.
Journal of Machine LearningResearch , 7(Dec):2651–2667, 2006.[47] Y. Nesterov.
Introductory lectures on convex optimization , volume 87. Springer Science &Business Media, 2004.[48] I. J. Schoenberg. On certain metric spaces arising from Euclidean spaces by a change of metricand their imbedding in Hilbert space.
Annals of Mathematics , pages 787–793, 1937.[49] I. J. Schoenberg. Metric spaces and positive definite functions.
Transactions of the AmericanMathematical Society , 44(3):522–536, 1938.[50] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factorization.
IEEETransactions on Information Theory , 62(11):6535–6579, 2016.5551] M. Tang, D. L. Sussman, and C. E. Priebe. Universally consistent vertex classification forlatent positions graphs.
The Annals of Statistics , 41(3):1406–1430, 2013.[52] A. L. Traud, E. D. Kelsic, P. J. Mucha, and M. A. Porter. Comparing community structureto characteristics in online collegiate social networks.
SIAM review , 53(3):526–543, 2011.[53] A. L. Traud, P. J. Mucha, and M. A. Porter. Social structure of facebook networks.
PhysicaA: Statistical Mechanics and its Applications , 391(16):4165–4180, 2012.[54] S. Tu, R. Boczar, M. Soltanolkotabi, and B. Recht. Low-rank solutions of linear matrixequations via procrustes flow. arXiv preprint arXiv:1507.03566 , 2015.[55] P. J. Wolfe and S. C. Olhede. Nonparametric graphon estimation. arXiv preprintarXiv:1309.5936 , 2013.[56] Y.-J. Wu, E. Levina, and J. Zhu. Generalized linear models with low rank effects for networkdata. arXiv preprint arXiv:1705.06772 , 2017.[57] S. J. Young and E. R. Scheinerman. Random dot product graph models for social networks.In
International Workshop on Algorithms and Models for the Web-Graph , pages 138–149.Springer, 2007.[58] Y. Zhang, E. Levina, and J. Zhu. Detecting overlapping communities in networks using spectralmethods. arXiv preprint arXiv:1412.3432 , 2014.[59] Y. Zhang, E. Levina, and J. Zhu. Community detection in networks with node features. arXivpreprint arXiv:1509.01173 , 2015.[60] Q. Zheng and J. Lafferty. Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent. statstat