Detecting Latent Communities in Network Formation Models
aa r X i v : . [ ec on . E M ] M a y Detecting Latent Communities in Network Formation Models ∗ Shujie Ma † Liangjun Su ‡ Yichong Zhang § May 18, 2020
Abstract
This paper proposes a logistic undirected network formation model which allows for assorta-tive matching on observed individual characteristics and the presence of edge-wise fixed effects.We model the coefficients of observed characteristics to have a latent community structure andthe edge-wise fixed effects to be of low rank. We propose a multi-step estimation procedureinvolving nuclear norm regularization, sample splitting, iterative logistic regression and spec-tral clustering to detect the latent communities. We show that the latent communities can beexactly recovered when the expected degree of the network is of order log n or higher, where n is the number of nodes in the network. The finite sample performance of the new estimationand inference methods is illustrated through both simulated and real datasets. Keywords:
Community detection, homophily, spectral clustering, strong consistency, unob-served heterogeneity
In real world social and economic networks, individuals tend to form links with someones whoare alike to themselves, resulting in assortative matching on observed individual characteristics(homophily). In addition, network data often exhibit natural communities such that individualsin the same community may share similar preferences for a certain type of homophily while thosein different communities tend to have quite distinctive preferences. In many cases, such a commu-nity structure is latent and has to be identified from the data. The detection of such communitystructures is challenging yet crucial for network analyses. It prompts a couple of important ques-tions that need to be addressed: How do we formulate a network formation model with individual ∗ Ma’s research was partially supported by NSF grant DMS 1712558 and a UCR Academic Senate CoR grant.Zhang acknowledges the financial support from Singapore Ministry of Education Tier 2 grant under grant MOE2018-T2-2-169 and the Lee Kong Chian fellowship. Any and all errors are our own. † Department of Statistics, University of California, Riverside. E-mail address: [email protected]. ‡ Tsinghua University and Singapore Management University. E-mail address: [email protected]. § Singapore Management University. E-mail address: [email protected]. l in the networkformation model is B l,k k for any nodes i and j in communities k and k , respectively. Theedge-wise fixed effects are assumed to have a low-rank structure. This includes the commonly useddiscretized fixed effects and additive fixed effects as special cases. To address the second issue, wenote that the estimation of this latent model is challenging, and it has to involve a multi-step pro-cedure. In the first step, we estimate the coefficient matrices by a nuclear norm regularized logisticregression given their low-rank structures; we then obtain the estimators of their singular vectorswhich contain information about the community memberships via the singular value decomposition(SVD). Such singular vector estimates are only consistent in Frobenius norms but not in uniformrow-wise Euclidean norm. A refined estimation is needed for accurate community detection. Inthe second step, we use the singular vector estimates from the first step as the initial values anditeratively run row-wise and column-wise logistic regressions to reestimate the singular vectors.The efficiency of the resulting estimator can be improved through this iterative procedure. In thethird step, we apply the standard K-means algorithm to the singular vector estimates obtained inthe second step. For technical reasons, we have to resort to sample-splitting techniques to estimatethe singular vectors, and for numerical stability, both iterative procedures and multiple-splits arecalled upon. We establish the exact recovery of the latent community (strong consistency) underthe condition that the expected degree of the network diverges to infinity at the rate log n orhigher order, where n is the number of nodes. Under the exact recovery property, we can treatthe estimated community memberships as the truth and further estimate the community-specificregression coefficients.Our paper is closely related to three strands of literature in statistics and econometrics.First, our paper is closely tied to the large literature on the application of spectral cluster-ing to detect communities in stochastic block models (SBMs). Since the pioneering work ofHolland, Laskey, and Leinhardt (1983), SBM has become the most popular model for communitydetection. The statistical properties of spectral clustering in such models have been studied byJin (2015), Joseph and Yu (2016), Lei and Rinaldo (2015), Paul and Chen (2020), Qin and Rohe(2013), Rohe, Chatterjee, and Yu (2011), Sarkar and Bickel (2015), Sengupta and Chen (2015),Vu (2018), Wang and Wong (1987), Yun and Proutiere (2014), and Yun and Proutiere (2016),among others. From an information theory perspective, Abbe and Sandon (2015), Abbe, Bandeira, and Hall(2016), Mossel, Neeman, and Sly (2014), and Vu (2018) establish the phase transition thresholdfor the exact recovery of communities in SBMs, which requires the expected degree to diverge toinfinity at a rate no slower than log n . Su, Wang, and Zhang (2020) show that spectral clustering2an achieve this information-theoretical minimum rate for the exact recovery. Nevertheless, exist-ing SBMs either do not include covariates or include covariates in a non-regression fashion (see,e.g., Binkiewicz, Vogelstein, and Rohe, 2017), which makes them too simple for practical uses. Formore complicated models that can incorporate both covariates and community structures, peopleoften resort to variational EM algorithm, the performance of which highly hinges on the properchoice of initial values. In contrast, the network formation model proposed in this paper extendsthe SBM to a complex logistic regression model with both latent community structures and co-variates, and our multi-step procedure provides an effective and reliable tool for the estimation ofsuch a complex network model. Despite the fact that the regression coefficient matrices have to beestimated from the data in order to obtain the associated singular vectors for spectral clustering,we are able to obtain the exact recovery of the community structures at the minimal rate on theexpected node degree.Second, our paper is closely tied to the literature on network formation models; see, for example,Chatterjee, Diaconis, and Sly (2011), Graham (2017), Holland and Leinhardt (1981), Jochmans(2019), Leung (2015), Mele (2017a), Rinaldo, Petrovi´c, and Fienberg (2013), and Yan and Xu(2013). We complement these works by allowing for community structures on the regressioncoefficients, which can capture a rich set of unobserved heterogeneity in the network data. In aworking paper, Mele (2017b) also considers a network formation model with heterogeneous play-ers and latent community structure. He assumes that the community structure follows an i.i.d.multinomial distribution and imposes a prior distribution over communities and parameters beforeconducting Bayesian estimation and inferences. In contrast, we treat the community membershipsas fixed parameters and aim to recover them from a single observation of a large network. Ouridea of introducing the community structure in network formation model is also inspired by therecent works of Bonhomme and Manresa (2015) and Su, Shi, and Phillips (2016), who introducelatent group structures into panel data analyses.Last, our paper is related to the literature on the use of nuclear norm regularization in variouscontexts; see Alidaee, Auerbach, and Leung (2020), Belloni, Chen, and Padilla (2019), Chernozhukov, Hansen, Liao, and Zhu(2018), Fan, Gong, and Zhu (2019), Feng (2019), Koltchinskii, Lounici, and Tsybakov (2011), Moon and Weidner(2018), Negahban and Wainwright (2011), Negahban, Ravikumar, Wainwright, and Yu (2012), andRohde and Tsybakov (2011), among others. All these previous works focus on the error bounds (inFrobenius norm) for the nuclear norm regularized estimates, except Moon and Weidner (2018) andChernozhukov et al. (2018) who study the inference problem in the linear panel data models with alow-rank structure. Like Moon and Weidner (2018) and Chernozhukov et al. (2018), we simply usethe nuclear norm regularization to obtain consistent initial estimates. Unlike Moon and Weidner(2018) and Chernozhukov et al. (2018), we study a logistic network formation model with a la-tent community structure and propose the iterative row- and column-wise logistic regressions toimprove the error bounds (in row-wise Euclidean norm) for the singular vectors of the nuclearnorm regularized estimates. Relying on such an improvement, we can fully recover the community3emberships. Then, we can estimate the community specific parameters and make statisticalinference.The rest of the paper is organized as follows. In Section 2, we introduce the model and severalbasic assumptions. In Section 3, we provide our multi-step estimation procedure. In Section 4,we establish the statistical properties of our proposed estimators of the community membershipsand regression coefficients. Section 5 reports simulation results. In Section 6, we apply the newmethods to study the community structure of a Facebook friendship networks at one hundredAmerican colleges and universities at a single time point. Section 7 concludes. We provide theproofs of all theoretical results in the appendix. Some additional technical results are containedin the online supplement.Notation. Throughout the paper, we write M = { M ij } as a matrix with its ( i, j )-th entrydenoted as M ij . We use || · || op , || · || F , and || · || ∗ to denote matrix spectral, Frobenius, and nuclearnorms, respectively. We use [ n ] to denote { , · · · , n } for some positive integer n . For a vector u , || u || and u ⊤ denote its L norm and transpose, respectively. For a vector a = ( a , · · · , a n ), letdiag( a ) be the diagonal matrix whose diagonal is a . For a symmetric matrix B ∈ R K × K , we definevech( B ) = ( B , B , B , · · · , B K , · · · , B K − K , B KK ) ⊤ . We define max( u, v ) = u ∨ v and min( u, v ) = u ∧ v for two real numbers u and v . We write { A } to denote the usual indicator function that takes value 1 if event A happens and 0 otherwise. In this section, we introduce the model and basic assumptions.
For i = j ∈ [ n ], let Y ij denote the dummy variable for a link between nodes i and j . It takes value1 if nodes i and j are linked and 0 otherwise. Let W ij = ( W ,ij , ..., W p,ij ) ⊤ denote a p -vector ofmeasurements of homophily between nodes i and j . Researchers observe the network adjacencymatrix { Y ij } and covariates { W ij } . We model the link formation between i and j is as Y ij = { ε ij ≤ log ζ n + p X l =0 W l,ij Θ ∗ l,ij } , i < j, (2.1)where { ζ n } n ≥ is a deterministic sequence that may decay to zero and is used to control theexpected degree in the network, W ,ij = 1 , and W l,ij = W l,ji for j = i and l ∈ [ p ]. For clarity, weconsider undirected network so that Y ij = Y ji and Θ ∗ l,ij = Θ ∗ l,ji ∀ l if i = j, ε ij follows the standardlogistic distribution for i < j , and ε ij = ε ji . Let Y ii = 0 for all i ∈ [ n ].4pparently, without making any assumptions on Θ ∗ l = { Θ ∗ l,ij } for l ∈ [ p ] ∪ { } , one cannotestimate all the parameters in (2.1) as the number of parameters can easily exceed the numberof observations in the model. Specifically, we will follow the literature on reduced rank regressionand assume that each Θ ∗ l exhibits a certain low rank structure. Even so, it is easy to see that ourmodel in (2.1) is fairly general, and it includes a variety of network formation models as specialcases.1. If log( ζ n ) = 2¯ a = n P ni =1 a i , α i = a i − ¯ a , Θ ∗ ,ij = α i + α j , and p = 0 , then Y ij = { ε ij ≤ a i + a j } . (2.2)Under the standard logistic distribution assumption on ε ij , P ( Y ij = 1) = exp( a i + a j )1+exp( a i + a j ) for all i = j, and we have the simplest exponential graph model (Beta model) considered in theliterature; see, e.g., Lusher, Koskinen, and Robins (2013).2. If log( ζ n ) and Θ ∗ ,ij are defined as above and Θ ∗ l,ij = β l for l ∈ [ p ] , then Y ij = { ε ij ≤ a i + a j + W ⊤ ij β } , (2.3)where β = ( β , ..., β p ) ⊤ . Apparently, (2.3) is the undirected dyadic link formation model withdegree heterogeneity studied in Graham (2017). See also Yan, Jiang, Fienberg, and Leng(2019) for the case of a directed network.3. Let Θ ,ij = Θ ∗ ,ij +log ζ n . If p = 0 , and Θ = { Θ ,ij } is assumed to exhibit the stochastic blockstructure such that Θ ,ij = b kl if nodes i and j belong to communities k and l, respectively,then we have Y ij = { ε ij ≤ Θ ,ij } . (2.4)Corresponding to the simple SBM with K communities, the the probability matrix P = { P ij } with P ij = P ( Y ij = 1) can be written as P = ZBZ ⊤ where Z = { Z ik } denotes an n × K binary matrix providing the cluster membership of each node, i.e., Z ik = 1 if node i is incommunity k and Z ik = 0 otherwise, and B = { B kl } denotes the block probability matrixthat depends on b kl . See Holland et al. (1983) and the references cited in the introductionsection.4. Let Θ ,ij = Θ ∗ ,ij + log ζ n . If Θ = { Θ ,ij } is assumed to exhibit the stochastic block structuresuch that Θ ,ij = b kl if nodes i and j belong to communities k and l, respectively, andΘ ∗ l,ij = β l for l ∈ [ p ] , then Y ij = { ε ij ≤ Θ ,ij + W ⊤ ij β } . (2.5)Then (2.5) defines a stochastic block model with covariates considered in Sweet (2015), Leger(2016), and Roy, Atchade, and Michailidis (2019).5nder the assumptions specified in the next subsection, it is easy to see that the expecteddegree of the network is of order nζ n . In the theory to be developed below, we allow ζ n to shrinkto zero at a rate as slow as n − log n , so that the expected degree can be as small as C log n forsome sufficiently large constant C and the network is semi-dense. Of course, if ζ n is fixed orconvergent to a positive constant as n → ∞ , the network becomes dense.To proceed, let τ n = log( ζ n ), Γ ∗ ,ij = τ n + Θ ∗ ,ij , Γ ∗ ij = (Γ ∗ ,ij , Θ ∗ ,ij , ..., Θ ∗ p,ij ) ⊤ , and W ij =( W ,ij , W ,ij , ..., W p,ij ) ⊤ , where W ,ij = 1. Let Γ ∗ = (Γ ∗ , Θ ∗ , ..., Θ ∗ p ) , where Γ ∗ = { Γ ∗ ,ij } andΘ ∗ l = { Θ ∗ l,ij } for l ∈ [ p ] . Then, we can rewrite the model in (2.1) as Y ij = { ε ij ≤ W ⊤ ij Γ ∗ ij } . (2.6)Below, we impose some basic assumptions on the model in order to propose a multiple-step pro-cedure to estimate the parameters of interest in the model. Now, we state a set of basic assumptions to characterize the model in (2.1). The first assumptionis about the data generating process (DGP).
Assumption 1.
1. For l ∈ [ p ] , { W l,ij } ≤ i
Let Θ ∗ ,ij = α i + α j where P ni =1 α i = 0 . In this case, K = 2 and n − Θ ∗ = U Σ V T , where U = √ n (1 + α s n ) − √ n (1 − α s n ) ... ... √ n (1 + α n s n ) − √ n (1 − α n s n ) , V = √ n (1 + α s n ) √ n (1 − α s n ) ... ... √ n (1 + α n s n ) √ n (1 − α n s n ) , Σ = s n s n ! and s n = n P ni =1 α i . Similarly, it is easy to verify that U = √ ( s n + α ) − √ ( s n − α ) ... ... √ ( s n + α n ) − √ ( s n − α n ) and V = √ n V . Note that we allow { α i } ni =1 to depend on { W ij } ≤ i Let Θ ∗ = Z B ∗ Z ⊤ , where Z ∈ R n × K is the membership matrix, K denotes thenumber of distinctive communities for Θ ∗ , and B ∗ ∈ R K × K is symmetric with rank K . Let ι n denote an n × vector of ones. For normalization, we assume ι ⊤ n Z B ∗ Z ⊤ ι n = p ⊤ B ∗ p = 0 , where p ⊤ = ( n , n , · · · , n K , n ) and n k, denotes the size of Θ ∗ ’s k -th community for k ∈ [ K ] . Then, as emma 2.1 below shows, U = Z ⊤ (Π ,n ) − / S ′ Σ and V = Z ⊤ (Π ,n ) − / S , where S and S ′ are two K × K matrices such that S ⊤ S = I K = ( S ′ ) ⊤ S ′ , Π ,n = diag ( p ) ,and Σ is the singular value matrix of Π / ,n B ∗ Π / ,n . Note that in this example, we allow the groupstructures Z and Z l , l ∈ [ p ] to be different. Let n l,k denote the number of nodes in Θ ∗ l ’s k -th group for k ∈ [ K l ] and l ∈ [ p ] . Let π l,kn = n l,k /n and Π l,n = diag( π l, n , · · · , π l,K l n ) for l ∈ [ p ] . The next assumption imposes some conditionson the community size. Assumption 3. 1. There exist some constants C σ and c σ such that ∞ > C σ ≥ lim sup n max l ∈ [ p ] ∪{ } σ ,l ≥ lim inf n min l ∈ [ p ] ∪{ } σ K l ,l ≥ c σ > . 2. There exist some constants C and c such that ∞ > C ≥ lim sup n max k ∈ [ K l ] , l ∈ [ p ] π l,kn ≥ lim inf n min k ∈ [ K l ] , l ∈ [ p ] π l,kn ≥ c > . Two remarks are in order. First, Assumption 3 implies that the size of each community ofΘ ∗ l is proportional to the number of nodes n . Such an assumption is common in the literature onnetwork community detection and panel data latent structure detection. Second, it is possible toallow for π l,kn and/or σ k,l to vary with n . In this case, one just needs to keep track of all theseterms in the proofs.To proceed, we state a lemma that lays down the foundation for our estimation procedure inthe next section. Lemma 2.1. Suppose that Assumptions 2 and 3 hold. Then,1. V l = Z l (Π l,n ) − / S l and U l = Z l (Π l,n ) − / S ′ l Σ l for l ∈ [ p ] , where S l and S ′ l are two K l × K l matrices such that S ⊤ l S l = I K l = ( S ′ l ) ⊤ S ′ l .2. max j ∈ [ n ] || v j,l || ≤ c − / < ∞ and max i ∈ [ n ] || u i,l || ≤ c − / C σ < ∞ for l ∈ [ p ] .3. If z i,l = z j,l , then (cid:13)(cid:13)(cid:13) v i,l || v i,l || − v j,l || v j,l || (cid:13)(cid:13)(cid:13) = || ( z i,l − z j,l ) S l || = √ for l ∈ [ p ] . Lemma 2.1 implies the singular vectors { v i,l } i ∈ [ n ] of Θ ∗ l contain information about the commu-nity structure. A similar result has been established in the community detection literature; see,e.g., Rohe et al. (2011, Lemma 3.1) and Su et al. (2020, Theorem II.1). If we assume Θ ∗ exhibitsa community structure, similar results also hold for it.8 The Estimation For notational simplicity, we will focus on the case of p = 1 and study the recovery of latentcommunity structure in Θ ∗ below. The general case with multiple covariates involves fundamentallyno new ideas but more complicated notations.First, we recognize that Γ ∗ and Γ ∗ are both low rank matrices with ranks bounded from aboveby K + 1 and K , respectively. We can obtain their preliminary estimates via the nuclear normpenalized logistic regression. Second, based on the normalization imposed in Assumption 2.1, wecan estimate τ n and Θ ∗ separately. We then apply the SVD to the preliminary estimates of Θ ∗ and Θ ∗ and obtain the estimates of U l , Σ l , and V l , l = 0 , 1. Third, we plug back the second stepestimates of { V l } l =0 , and re-estimate each row of U l by a row-wise logistic regression. We canfurther iterate this procedure and estimate U l and V l alternatively. Last, we apply the K-meansalgorithm to the final estimate of V to recover the community memberships. We rely on a samplesplitting technique along with the estimation. Throughout, we assume the ranks K and K areknown. We will propose an singular-value-ratio-based criterion to select them in Section 4.6.Below is an overview of the multi-step estimation procedure that we propose.1. Using the full sample, run the nuclear norm regularized estimation twice as detailed in Section3.1 and obtain b τ n and { b Σ l } l =0 , , the preliminary estimates of τ n and { Σ l } l =0 , . 2. Randomly split the nodes into two subsets, denoted as I and I . Using edges ( i, j ) ∈ I × [ n ],run the nuclear norm estimation twice as detailed in Section 3.2 and obtain { b V (1) l } l =0 , , apreliminary estimate of { V l } l =0 , . For j ∈ [ n ] , denote the j -th row of b V (1) l as ( b v (1) j,l ) ⊤ , whichis a preliminary estimate of v ⊤ j,l . 3. For each i ∈ I , take { b v (1) j,l } j ∈ I ,l =0 , as regressors and run the row-wise logistic regression toobtain { b u (1) i,l } l =0 , , the estimates of { u i,l } l =0 , . For each j ∈ [ n ] , take { b u (1) i,l } i ∈ I ,l =0 , as regres-sors and run the column-wise logistic regression to obtain updated estimates, { ˙ v (0 , j,l } l =0 , of { v j,l } l =0 , . See Section 3.3 for details.4. Based on { ˙ v (0 , j,l } j ∈ [ n ] ,l =0 , , obtain the iterative estimates ( ˙ u ( h, i, , ˙ u ( h, i, ) i ∈ [ n ] and ( ˙ v ( h, j, , ˙ v ( h, j, ) j ∈ [ n ] of the singular vectors as in Step 3 for h = 1 , , · · · , H . See Section 3.4 for details.5. Switch the roles of I and I and repeat Steps 2–4 to obtain ( ˙ u ( h, i, , ˙ u ( h, i, ) i ∈ [ n ] and ( ˙ v ( h, j, , ˙ v ( h, j, ) j ∈ [ n ] for h ∈ [ H ]. Let v j, = (cid:18) ( ˙ v ( H, j, ) ⊤ || ˙ v ( H, j, || , ( ˙ v ( H, j, ) ⊤ || ˙ v ( H, j, || (cid:19) ⊤ . Then, apply the K-means algorithm on { v j, } j ∈ [ n ] to recover the community memberships in Θ ∗ as detailed in Section 3.5.Several remarks are in order. First, b τ n and { b Σ l } l =0 , obtained in Step 1 are used in Steps 3-5and to determine { K l } l =0 , in Section 4.6, respectively. Second, we employ the sample-splittingtechnique to create independence between the edges used for Steps 2 and 3. As b V (1) l in Step 2 is9stimated by the nuclear-norm regularized logistic regression, we can only control the estimationerror in Frobenius norm, as shown in Theorem 4.1. On the other hand, to analyze the row-wise estimator, we need to control for the estimation error of b V (1) l in row-wise L norm (denotedas || · || →∞ ). We overcome the discrepancy between || · || F and || · || →∞ by the independencestructure. Third, one may propose to use each row of the full-sample lower-rank estimator b V l as { ˙ v j,l } j ∈ [ n ] , the initial estimates in Step 4. However, as b V l is estimated using the full sam-ple, it is not independent of, say, the i -th row of the edges if we want to estimate ( u ⊤ i, , u ⊤ i, ).Fourth, in the literature, researchers overcome this difficulty by using the “leave-one-out” tech-nique. See, for example, Abbe, Fan, Wang, and Zhong (2017), Bean, Bickel, El Karoui, and Yu(2013), Javanmard and Montanari (2018), Su et al. (2020), and Zhong and Boumal (2018), amongothers. Denote b Θ ( i ) l as the low-rank estimator of Θ ∗ l using all the edges except those on the i -throw and column and b V ( i ) l is obtained by applying the SVD on b Θ ( i ) l . The key step for the “leave-one-out” technique is to establish a perturbation theory to bound b Θ ( i ) l − b Θ l , and thus, b V ( i ) l − b V l .However, unlike the community detection literature, b Θ l and b Θ ( i ) l are not directly observed but es-timated by the nuclear-norm regularized logistic regression. It is interesting but very challenging,if possible, to establish such a perturbation theory. Fifth, although the sample-splitting can resultin information loss, we compensate it in three aspects: (1) we just treat the sample-split estimator˙ v (0 , j,l as an initial value and in Step 4, we update it via an iterative algorithm which uses all theedges; (2) we can switch the roles of I and I and obtain ˙ v ( H, j,l and ˙ v ( H, j,l after H iterations;(3) to mitigate the concern of the randomness caused by a single sample split, in Section 3.5, wepropose to repeat the sample-splitting R times to obtain R classifications, and select one of thembased on the maximum-likelihood principle. Recall that Γ ∗ = τ n + Θ ∗ and Γ ∗ = Θ ∗ . Let Γ ∗ = (Γ ∗ , Γ ∗ ). Let Λ ( u ) = − u ) denote thestandard logistic probability density function. Let ℓ ij (Γ ij ) = Y ij log(Λ( W ⊤ ij Γ ij )) + (1 − Y ij ) log(1 − Λ( W ⊤ ij Γ ij ))denote the conditional logistic log-likelihood function associated with nodes i and j. Let T ( τ , c n ) = { (Γ , Γ ) ∈ R n × n × R n × n : | Γ ,ij − τ | ≤ c n , | Γ ,ij | ≤ c n } . We propose to estimate Γ ∗ by e Γ = ( e Γ , e Γ ) via minimizing the negative logistic log-likelihoodfunction with the nuclear norm regularization: e Γ = arg min Γ ∈ T (0 , log n ) Q n (Γ) + λ n X l =0 || Γ l || ∗ , (3.1)10here Q n (Γ) = − n ( n − P i,j ∈ [ n ] ,i = j ℓ ij (Γ ij ) and λ n > ζ n to shrink to zero at a rate as slow as n − log n so that τ n = log ( ζ n ) is slightlysmaller than log n in magnitude. So it is sufficient to consider a parameter space T (0 , log n )that expands at rate-log n. Later on, we specify λ n = C λ ( √ ζ n n + √ log n ) n ( n − for some constant tuningparameter C λ . Throughout the paper, we assume W ,ij has been rescaled so that its standarderror is one. Therefore, we do not need to consider different penalty loads for || Γ || ∗ and || Γ || ∗ .Many statistical softwares automatically normalize the regressors when estimating a generalizedlinear model. We recommend this normalization in practice before using our algorithm.Let e τ n = n ( n − P i = j e Γ ,ij . We will show that e τ n lies within c τ √ log n -neighborhood of the truevalue τ n , where c τ can be made arbitrarily small provided that the expected degree is larger than C log n for some sufficiently large C. This rate is insufficient and remains to be refined. Given e τ n ,we propose to reestimate Γ ∗ by b Γ = ( b Γ , b Γ ) , where b Γ = arg min Γ ∈ T ( e τ n ,C M √ log n ) Q n (Γ) + λ n X l =0 || Γ l || ∗ , and C M is some constant to be specified later. Note that we now restrict the parameter space toexpand at rate- √ log n only.Let b τ n = n ( n − P i = j b Γ ,ij . Since Θ ∗ l = { Θ ∗ l,ij } are symmetric, we define their preliminarylow-rank estimators as b Θ l = { b Θ l,ij } , where b Θ l,ij = f M (( b Γ l,ij + b Γ l,ji ) / − b τ n δ l ) if i = j i = j for l = 0 , ,δ l = { l = 0 } , f M ( u ) = u · {| u | ≤ M } + M · { u > M } − M · { u < − M } , and M is somepositive constant. For l = 0 , 1, we denote the SVD of n − b Θ l as n − b Θ l = be U l be Σ l ( b ˜ V l ) ⊤ , where be Σ l = diag( b σ ,l , ..., b σ n,l ), b σ ,l ≥ · · · ≥ b σ n,l ≥ 0, and both be U l and be V l are n × n unitary matrices.Let b V l consist of the first K l columns of be V l , such that ( b V l ) ⊤ b V l = I K l and b Σ l = diag( b σ ,l , · · · , b σ K l ,l ).Then b V l = √ n b V l . Let η n = q log nnζ n and η n = η n + η n . The proof of Theorem 4.1.1 suggests that e τ n − τ n = O p ( η n √ log n ) , whichis o p ( √ log n ) (resp. o p (1)) if one assumes that the magnitude nζ n of the expected degree is of order higher thanlog n (resp. (log n ) ). But we will only assume that η n ≤ C F ≤ for some sufficiently small constant C F below. .2 Split-Sample Low-Rank Estimation We divide the n nodes into two roughly equal-sized subsets ( I , I ). Let n ℓ = I ℓ denote thecardinality of the set I ℓ . If n is even, one can simply set n ℓ = n/ ℓ = 1 , . Now, we only use the pair of observations ( i, j ) ∈ I × [ n ] to conduct the low-rank estimation.Let Γ ∗ l ( I ) consist of the i -th row of Γ ∗ l for i ∈ I , l = 0 , . Let Γ ∗ ( I ) = (Γ ∗ ( I ) , Γ ∗ ( I )). Define T (1) ( τ , c n ) = { (Γ , Γ ) ∈ R n × n × R n × n : | Γ ,ij − τ | ≤ c n , | Γ ,ij | ≤ c n } . We estimate Γ ∗ ( I ) via the following nuclear-norm regularized estimation e Γ (1) = arg min Γ ∈ T (1) (0 , log n ) Q (1) n (Γ) + λ (1) n X l =0 || Γ l || ∗ , (3.2)where Q (1) n (Γ) = − n ( n − P i ∈ I ,j ∈ [ n ] ,i = j ℓ ij (Γ ij ) and λ (1) n = C λ ( √ ζ n n + √ log n ) n ( n − .Let e τ (1) n = n ( n − P i ∈ I ,j ∈ [ n ] ,i = j e Γ (1)0 ,ij . As above, this estimate lies within c τ √ log n -neighborhoodof the true value τ n . To refine it, we can reestimate Γ ∗ ( I ) by b Γ (1) = ( b Γ (1)0 , b Γ (1)1 j ) : b Γ (1) = arg min Γ ∈ T (1) ( e τ (1) n ,C M √ log n ) Q (1) n (Γ) + λ (1) n X l =0 || Γ l || ∗ . Let b τ (1) n = n ( n − P i ∈ I ,j ∈ [ n ] ,i = j b Γ (1)0 ,ij . Noting that { Γ ∗ l } l =0 , are symmetric, we define the prelim-inary low-rank estimates for the n × n matrices Θ ∗ l ( I ) by b Θ (1) l for l = 0 , 1, where b Θ (1) l,ij = f M (( b Γ (1) l,ij + b Γ (1) l,ji ) / − b τ (1) n δ l ) if ( i, j ) ∈ I × I , i = j i, j ) ∈ I × I , i = jf M ( b Γ (1) l,ij − b τ (1) n δ l ) if i ∈ I , j / ∈ I , and δ l , f M ( u ) and M are defined in Step 1. For l = 0 , 1, we denote the SVD of n − b Θ (1) l as n − b Θ (1) l = be U (1) l be Σ (1) l ( be V (1) l ) ⊤ , where be Σ (1) l is a rectangular ( n × n ) diagonal matrix with b σ (1) i,l appearing in the ( i, i )th position andzeros elsewhere, b σ (1)1 ,l ≥ · · · ≥ b σ (1) n ,l ≥ 0, and b ˜ U (1) l and b ˜ V (1) l are n × n and n × n unitary matrices,respectively. Let b V (1) l consist of the first K l columns of be V (1) l such that ( b V (1) l ) ⊤ b V (1) l = I K l . Let b Σ (1) l = diag( b σ (1)1 ,l , · · · , b σ (1) K l ,l ). Then b V (1) l = √ n b V (1) l , and ( b v (1) j,l ) ⊤ is the j -th row of b V (1) l for j ∈ [ n ].12 .3 Split-Sample Row- and Column-Wise Logistic Regressions Let µ = ( µ ⊤ , µ ⊤ ) ⊤ and Λ left ij ( µ ) = Λ( b τ n + P l =0 µ ⊤ l b v (1) j,l W l,ij ) and ℓ left ij ( µ ) = Y ij log(Λ left ij ( µ )) +(1 − Y ij ) log(1 − Λ left ij ( µ )) . Given the preliminary estimate { b v (1) j,l } obtained in Step 2, we can estimate theleft singular vectors { u i, , u i, } for each i ∈ I by { b u (1) i, , b u (1) i, } via the row-wise logistic regression:(( b u (1) i, ) ⊤ , ( b u (1) i, ) ⊤ ) ⊤ = arg min µ =( µ ⊤ ,µ ⊤ ) ⊤ ∈ R K K Q (0) in,U ( µ ) , where Q (0) in,U ( µ ) = − n P j ∈ I ,j = i ℓ left ij ( µ ) . Let ν = ( ν ⊤ , ν ⊤ ) ⊤ and Λ right ij ( ν ) = Λ( b τ n + P l =0 ν ⊤ l b u (1) i,l W l,ij ) and ℓ right ij ( ν ) = Y ij log(Λ right ij ( ν ))+(1 − Y ij ) log(1 − Λ right ij ( ν )) . Given ( b u (1) i, , b u (1) i, ) , we update the estimate of the right singular vectors { v i, , v i, } for each j ∈ [ n ] by { ˙ v (0 , j, , ˙ v (0 , j, } via the column-wise logistic regression:(( ˙ v (0 , j, ) ⊤ , ( ˙ v (0 , j, ) ⊤ ) ⊤ = arg min ν =( ν ⊤ ,ν ⊤ ) ⊤ ∈ R K K Q (0) jn,V ( ν ) , where Q (0) jn,V ( ν ) = − n P i ∈ I ,i = j ℓ right ij ( ν ) . Our final objective is to obtain accurate estimates of { v j,l } j ∈ [ n ] ,l =0 , . To this end, we treat { ˙ v (0 , j, , ˙ v (0 , j, } j ∈ [ n ] as the initial estimate in the following full-sample iteration procedure. For h = 1 , , ..., H, let Λ left, hij ( µ ) = Λ( b τ n + P l =0 µ ⊤ l ˙ v ( h − , j,l W l,ij )) and ℓ left, hij ( µ ) = Y ij log(Λ left, hij ( µ ))+(1 − Y ij ) log(1 − Λ left, hij ( µ )) . Given { ˙ v ( h − , i, , ˙ v ( h − , i, } , we can compute { ˙ u ( h, i, , ˙ u ( h, i, } via(( ˙ u ( h, i, ) ⊤ , ( ˙ u ( h, i, ) ⊤ ) ⊤ = arg min µ =( µ ⊤ ,µ ⊤ ) ⊤ ∈ R K K Q ( h ) in,U ( µ ) , where Q ( h ) in,U ( µ ) = − n P j ∈ [ n ] ,j = i ℓ left, hij ( µ ) . Given { ˙ u ( h, i, , ˙ u ( h, i, } , by letting Λ right, hij ( ν ) = Λ( b τ n + P l =0 ν ⊤ l ˙ u ( h, i,l W l,ij )) and ℓ right, hij ( ν ) = Y ij log(Λ right, hij ( ν )) +(1 − Y ij ) log(1 − Λ right, hij ( ν )) , we compute { ˙ v ( h, j, , ˙ v ( h, j, } via(( ˙ v ( h, j, ) ⊤ , ( ˙ v ( h, j, ) ⊤ ) ⊤ = arg min ν =( ν ⊤ ,ν ⊤ ) ⊤ ∈ R K K Q ( h ) jn,V ( ν ) , where Q ( h ) jn,V ( ν ) = − n P i ∈ [ n ] ,i = j ℓ right, hij ( ν ) . We can stop iteration when certain convergence criterion is met for sufficiently large H. Switching the roles of I and I and repeating the procedure in the last three steps, we canobtain the iterative estimates { ˙ u ( h, i, , ˙ u ( h, i, } i ∈ [ n ] and { ˙ v ( h, j, , ˙ v ( h, j, } j ∈ [ n ] for h = 1 , , · · · , H .13 .5 K-means Classification Recall that v j, = (cid:18) ( ˙ v ( H, j, ) ⊤ || ˙ v ( H, j, || , ( ˙ v ( H, j, ) ⊤ || ˙ v ( H, j, || (cid:19) ⊤ , a 2 K × { v j, } j ∈ [ n ] . Let B = { β , . . . , β K } be a set of K arbitrary 2 K × β , . . . , β K . Define b Q n ( B ) = 1 n n X j =1 min ≤ k ≤ K k v j, − β k k and b B n = { b β , . . . , b β K } , where b B n = arg min B b Q n ( B ) . For each j ∈ [ n ] , we estimate the groupidentity by ˆ g j = arg min ≤ k ≤ K (cid:13)(cid:13)(cid:13) v j, − b β k (cid:13)(cid:13)(cid:13) , (3.3)where if there are multiple k ’s that achieve the minimum, ˆ g j takes value of the smallest one.As mentioned previously, we can repeat Steps 2–5 R times to obtain R membership estimates,denoted as { ˆ g j,r } j ∈ [ n ] ,r ∈ [ R ] . Recall thatvech( B ∗ ) = ( B ∗ , , B ∗ , , B ∗ , , · · · , B ∗ , K , · · · , B ∗ ,K − K , B ∗ ,K K ) ⊤ , which is a K ( K + 1) / χ ,ij be a K ( K + 1) / g i ∨ g j − g i ∨ g j ) / g i ∧ g j )-th element is one and the rest are zeros, where g i ∈ [ K ] denotesthe true group membership of the i -th node in Θ ∗ . By construction, χ ⊤ ,ij vech( B ∗ ) = B ∗ ,g i g j . Analogously, for the r -th split, denote ˆ χ r,ij as a K ( K + 1) / g i,r ∨ ˆ g j,r − g i,r ∨ ˆ g j,r ) / g i,r ∧ ˆ g j,r )-th element is one and the rest are zeros. We then estimate B ∗ by b B ,r ,a symmetric matrix constructed from b b r by reversing the vech operator: b b r = arg max b L n,r ( b ) , where L n,r ( b ) = P i 2, and ( ˙ u ( H, i, , ˙ v ( H, j, , ˙ u ( H, i, , ˙ v ( H, j, ) are obtained in Step 4. Then, the likelihood of the r -th split is defined as b L ( r ) = L n,r ( b b r ) . Our final estimator { ˆ g i,r ∗ } i ∈ [ n ] of the membership corresponds to the r ∗ -th split, where r ∗ = arg max r ∈ [ R ] b L ( r ) . (3.4)14 Statistical Properties In this section, we study the asymptotic properties of the estimators proposed in the last section. To study the asymptotic properties of the first two-step estimators, we add two assumptions. Assumption 4. For some positive constant ˜ c , let C (˜ c ) = { (∆ , ∆ ) : ∆ l = ∆ ′ l + ∆ ′′ l for l = 0 , , X l =0 || ∆ ′′ l || ∗ ≤ ˜ c X l =0 || ∆ ′ l || ∗ , rank (∆ ′ ) ≤ K + 2 , rank (∆ ′ ) ≤ K } If (∆ , ∆ ) ∈ C (˜ c ) , then there is a constant κ > that potentially depends on ˜ c such that X ≤ i,j ≤ n (∆ ,ij + W ,ij ∆ ,ij ) ≥ κ X ≤ i,j ≤ n (cid:0) ∆ ,ij + ∆ ,ij (cid:1) a.s. Assumption 5. C λ > C Υ M W , where C Υ is a constant defined in Lemma S1.1 in the onlinesupplement.2. There exist constants < c ≤ c < ∞ such that ζ n c ≤ Λ n,ij ≤ ζ n c , where Λ n,ij ≡ Λ( W ⊤ ij Γ ∗ ij ) . q log nnζ n ≤ c F ≤ for some sufficiently small constant c F .4. There exists a constant C ,u such that max i ∈ [ n ] | u i, | ≤ C ,u .5. P i ∈ I ,j ∈ [ n ] Θ ∗ ,ij = o ( q log( n ) nζ n ) . Assumption 4 is the restricted strong convexity condition commonly assumed in the literature.See, e.g., Negahban and Wainwright (2011), Negahban et al. (2012), Chernozhukov et al. (2018),and Moon and Weidner (2018), among others. Assumption 5 is a regularity condition. In partic-ular, Assumptions 5.2 implies the order of the average degree in the network is nζ n . Assumption5.3 means that the average degree diverges to infinity at a rate that is not slower than log n .Such a rate is the slowest for exact recovery in the SBM, as established by Abbe et al. (2016),Abbe and Sandon (2015), Mossel et al. (2014), and Vu (2018). As our model incorporates theSBM as a special case, the rate is also the minimal requirement for the exact recovery of Z ,which is established in Theorem 4.4 below. Assumption 5.5 usually holds as the sample is splitrandomly and Θ ∗ satisfies the normalization condition in Assumption 2.1. For the specificationin Example 1, this assumption is satisfied if n P i ∈ I α i = o ( q log( n ) nζ n ). Such a requirement holds15lmost surely ( a.s. ) if α i = a i − n P i ∈ [ n ] a i and { a i } ni =1 is a sequence of i.i.d. random variableswith finite second moments. For the specification in Example 2, Assumption 5.5 is satisfied if p ⊤ ( I ) B ∗ p = o ( q log( n ) nζ n ), where p ⊤ ( I ) = ( n , ( I ) n , · · · , n k, ( I ) n ) and n k, ( I ) denotes the size ofΘ ∗ ’s k -th community for the subsample of nodes with index i ∈ I . As p ⊤ B ∗ p = 0, the require-ment holds almost surely if community memberships are generated from a multinomial distributionso that || p − p ( I ) || = o a.s. ( q log( n ) nζ n ). Theorem 4.1. Let Assumptions 1–5 hold and η n = q log nnζ n + log nnζ n . Then for l = 0 , , we have that a.s. ,1. | b τ n − τ n | ≤ C F, η n , | b τ (1) n − τ n | ≤ C F, η n , n || b Θ l − Θ ∗ l || F ≤ C F, η n , n || b Θ (1) l − Θ ∗ l ( I ) || F ≤ C F, η n , max k ∈ [ K l ] | b σ k,l − σ k,l | ≤ C F, η n , max k ∈ [ K l ] | b σ (1) k,l − σ k,l | ≤ C F, η n , || V l − b V l b O l || F ≤ C F, √ nη n , and || V l − b V (1) l b O (1) l || F ≤ C F, √ nη n , where b O l and b O (1) l are two K l × K l orthogonal matrices that depend on ( V l , b V l ) and ( V l , b V (1) l ) ,respectively, and C F, and C F, are two constants defined respectively after (A.12) and (A.13)in the Appendix. Part 1 of Theorem 4.1 indicates that despite the possible divergence of the grand intercept τ n , we can estimate it consistently up to rate η n . In the dense network, ζ n ≍ a ≍ b denotes both a/b and b/a are stochastically bounded. In this case, τ n ≍ p (log n ) /n. Note that the convergence rate of b Θ l and b Θ (1) l in terms of theFrobenius norm is also driven by η n . Similarly for b σ k,l b σ (1) k,l , b V l / √ n and b V (1) l / √ n. In part 4 ofTheorem 4.1, the orthogonal matrices b O l and b O (1) l are present because the singular values of Θ ∗ l can be the same and its singular vectors can only be identified up to some rotation. Define two ( K + K ) × ( K + K ) matrices:Ψ j ( I ) = 1 n X i ∈ I ,i = j " u i, u i, W ,ij u i, u i, W ,ij ⊤ and Φ i ( I ) = 1 n X j ∈ I ,j = i " v j, v j, W ,ij v j, v j, W ,ij ⊤ .To study the asymptotic properties of the third step estimator, we assume that both matrices arewell behaved uniformly in i and j in the following assumption. Assumption 6. There exist constants C φ and c φ such that a.s. ∞ > C φ ≥ lim sup n max j ∈ [ n ] λ max (Ψ j ( I )) ≥ lim inf n min j ∈ [ n ] λ min (Ψ j ( I )) ≥ c φ > and > C φ ≥ lim sup n max i ∈ I λ max (Φ i ( I )) ≥ lim inf n min i ∈ I λ min (Φ i ( I )) ≥ c φ > , where λ max ( · ) and λ min ( · ) denote the maximum and minimum eigenvalues, respectively. Assumption 6 assumes that Φ i ( I ) and Ψ j ( I ) are positive definite (p.d.) uniformly in i and j asymptotically. Suppose there are K equal-sized communities and B ∗ = I K in Assumption2, then Π ,n = diag(1 /K , · · · , /K ). By Lemma S1.4 in the online supplement, if node j is incommunity k , then v j, = √ n q K n z j, = √ K e K ,k , where e K ,k denotes a K × k -th unit being 1 and all other units being 0. For the specification in Example 1,Φ i ( I ) = 1 n X j ∈ I √ (1 + α j s n ) √ (1 − α j s n ) z j, W ,ij √ (1 + α j s n ) √ (1 − α j s n ) z j, W ,ij ⊤ . Suppose that α i = a i − ¯ a for some i.i.d. sequence { a i } ni =1 with ¯ a = n P ni =1 a i , E ( W ,ij a j | X i ) = 0, E ( W ,ij | X i ) ≥ c > c , then n P j ∈ I a j /s n → a.s. Therefore, we can expectthat, uniformly over i ∈ I ,Φ i ( I ) → diag(1 , , E ( W ,ij | X i ) , · · · , E ( W ,ij | X i )) a.s., which implies Assumption 6 holds. For the specification in Example 2, if we further assume Θ ∗ and Θ ∗ share the same group structure Z , thenΦ i ( I ) = 1 n X j ∈ I z j, z j, W ,ij ! z j, z j, W ,ij ! ⊤ . Suppose E ( W ,ij | X i ) = 0 and E ( W ,ij | X i ) ≥ c > c , then one can expect thatΦ i ( I ) has the same a.s. limit as above uniformly over i ∈ I .The following theorem studies the asymptotic properties of b u (1) i,l and ˙ v (0 , j,l defined in Step 3. Theorem 4.2. Suppose that Assumptions 1–6 hold. Then, max i ∈ I || ( b O (1) l ) ⊤ b u (1) i,l − u i,l || ≤ C ∗ η n and max j ∈ [ n ] || ( b O (1) l ) ⊤ ˙ v (0 , j,l − v j,l || ≤ C ,v η n a.s., where C ∗ and C ,v are some constants defined respectively in (A.25) and (A.29) in the Appendix. Theorem 4.2 establishes the uniform bound for the estimation error of ˙ v (0 , j,l up to some rotation.Since Lemma 2.1 shows { v j, } j ∈ [ n ] contains information about the community memberships, it isintuitive to expect that we can use ˙ v (0 , j,l to recover the memberships as long as the estimation erroris sufficiently small. However, we only use half of the edges to estimate ˙ v (0 , j,l , which may result ininformation loss. In the next section, we treat ˙ v (0 , j,l as an initial value and iteratively re-estimate17 u i,l } i ∈ [ n ] and { v j,l } i ∈ [ n ] using all the edges in the network. We will show that the iteration canpreserve the error bound established in Theorem 4.2. Define two ( K + K ) × ( K + K ) matrices:Ψ j = 1 n X i ∈ [ n ] ,i = j " u i, u i, W ,ij u i, u i, W ,ij ⊤ and Φ i = 1 n X j ∈ [ n ] ,j = i " v j, v j, W ,ij v j, v j, W ,ij ⊤ . To study the asymptotic properties of the fourth step estimators, we add an assumption. Assumption 7. There exist constants C φ and c φ such that a.s. ∞ > C φ ≥ lim sup n max j ∈ [ n ] λ max (Ψ j ) ≥ lim inf n min j ∈ [ n ] λ min (Ψ j ) ≥ c φ > and ∞ > C φ ≥ lim sup n max i ∈ [ n ] λ max (Φ i ) ≥ lim inf n min i ∈ [ n ] λ min (Φ i ) ≥ c φ > . The above assumption parallels Assumption 6 and is now imposed for the full sample. Theorem 4.3. Suppose that Assumptions 1–7 hold. Then, for h = 1 , · · · , H and l = 0 , , max i ∈ [ n ] || ( b O (1) l ) ⊤ ˙ u ( h, i,l − u i,l || ≤ C h,u η n and max i ∈ [ n ] || ( b O (1) l ) ⊤ ˙ v ( h, i,l − v i,l || ≤ C h,v η n a.s., where { C h,u } Hh =1 and { C h,v } Hh =1 are two sequences of constants defined in the proof of this theorem. Theorem 4.3 establishes the uniform bound for the estimation error in the iterated estimators { ˙ u ( h, i,l } and { ˙ v ( h, i,l } . By switching the roles of I and I , we have, similar to Theorem 4.1, that k V l − b V (2) l b O (2) l k F ≤ C F, √ nη n , where b O (2) l is a K l × K l rotation matrix that depends on V l and b V (2) l . Then, following the samederivations of Theorems 4.2 and 4.3, we have, for h = 1 , · · · , H ,max i ∈ [ n ] || ( b O (2) l ) ⊤ ˙ u ( h, i,l − u i,l || ≤ C h,u η n and max i ∈ [ n ] || ( b O (2) l ) ⊤ ˙ v ( h, i,l − v i,l || ≤ C h,v η n a.s. Let g i ∈ [ K ] denote the true group identity for the i -th node. To establish the strong consistencyof the membership estimator ˆ g i defined in (3.3), we add the following side condition.18 ssumption 8. Suppose K / C H,v C η n ≤ , where C H,v is the constant defined in the proofof Theorem 4.3. Apparently, Assumption 8 is automatically satisfied in large samples if η n = o (1) . Theorem 4.4. If Assumptions 1–8 hold, then up to some label permutation, max ≤ i ≤ n { ˆ g i = g i } = 0 a.s. Several remarks are in order. First, Theorem 4.4 implies the K-means algorithm can exactlyrecover the latent group structure of Θ ∗ a.s. Second, if we repeat the sample split R times, weneed to maintain Assumptions 6 for each split. Then, we can show the exact recovery of ˆ g i,r for r ∈ [ R ] in the exact same manner, as long as R is fixed. This implies ˆ g i,r ∗ for r ∗ selected in (3.4)also enjoys the property that max ≤ i ≤ n { ˆ g i,r ∗ = g i } = 0 a.s. Third, if Θ ∗ also has the latent group structure as in Example 2, we can apply the same K-means algorithm to { v j, } j ∈ [ n ] with v j, ≡ ( ˙ v ( H, ⊤ j, / || ˙ v ( H, j, || , ˙ v ( H, ⊤ j, / || ˙ v ( H, j, || ) ⊤ to recover thegroup identities of Θ ∗ . Last, if we further assume Z = Z = Z (which implies K = K ), thenwe can catenate v j, and v j, as a 4 K × B ∗ Given the exact recovery of the community memberships asymptotically, we can just treat ˆ g i as g i . In this case, the inference for B ∗ for the model in Example 1 has been studied by Graham(2017). The model in Example 2 boils down to the standard logistic regression with finite-numberof parameters, whose inference theory is established in Appendix S2. In the following, we discussthe two specifications in Examples 1 and 2. Example 1 (cont.). Suppose the model is specified as in (2.6) with Γ ∗ ,ij = τ n + α i + α j andΓ ∗ = Θ ∗ = Z B ∗ Z ⊤ . Recall the definitions of χ ,ij , ˆ χ r,ij , and vech( B ∗ ) in Section 3.5 such that χ ⊤ ,ij vech( B ∗ ) = B ∗ ,g i g j . We further denote ˆ χ ,ij as either ˆ χ ,ij if one single split is used or ˆ χ r ∗ ,ij if R splits are used and the r ∗ -th split is selected. Corollary 4.1. Suppose that Assumptions 1–8 hold. Then ˆ χ ,ij = χ ,ij , ∀ i < j a.s. Corollary 4.1 directly follows from Theorem 4.4 and implies that we can treat χ ,ij as observed.Then, (2.6) can be written as Y ij = { ε ij ≤ τ n + α i + α j + ω ⊤ ,ij vech( B ∗ ) } , ω ,ij = W ,ij χ ,ij . This model has already been studied by Graham (2017). We can directlyapply his Tetrad logit regression to estimate vec( B ∗ ). We provide more details on the estimationand inference in Section S2 in the online supplement. Example 2 (cont.). Let g i, be the true memberships of node i for Θ ∗ and ˆ g i, be itsestimator which can be computed by applying the K-means algorithm to { v j, } j ∈ [ n ] . Further note Z ι K = ι n where recall that ι b denote a b × ∗ = τ n ι n ι ⊤ n + Z B ∗ Z ⊤ = Z ( B ∗ + τ n ι K ι ⊤ K ) Z ⊤ ≡ Z B ∗∗ Z ⊤ . As above, we define χ ,ij be a K ( K + 1) / × g i, ∨ g j, − g i, ∨ g j, ) / g i, ∧ g j, )-th element is one and the rest are zeros and ˆ χ ,ij be a K ( K + 1) / × g i, ∨ ˆ g j, − g i, ∨ ˆ g j, ) / g i, ∧ ˆ g j, )-th element is one and therest are zeros. Similarly, we have the following corollary. Corollary 4.2. If Assumptions 1–8 hold and the model is specified in Example 2, then ˆ χ l,ij = χ l,ij , ∀ i < j, l = 0 , a.s. We propose to estimate vech( B ∗ ) ≡ (vech( B ∗∗ ) ⊤ , vech( B ∗ ) ⊤ ) ⊤ by b b ≡ ( b b ⊤ , b b ⊤ ) ⊤ = arg min b =( b ⊤ ,b ⊤ ) ⊤ ∈ R K K / × R K K / Q n ( b ) , where Q n ( b ) = − n ( n − P ≤ i We generate data from the following two models. Model 1. We simulate the responses Y ij from the Bernoulli distribution with mean Λ(log( ζ n )+Θ ∗ ,ij + W ,ij Θ ∗ ,ij ) for i < j , where Θ ∗ ,ij = α i + α j and Θ ∗ = ZB ∗ Z ⊤ . We generate α i i.i.d ∼U ( − / , / 2) for i = 1 , ..., n , and W ,ij = | X i − X j | for i = j , where X i i.i.d ∼ N (0 , i th row of the membership matrix Z ∈ R n × K , the C th i component is 1 and other entries are 0, where C = ( C , ..., C n ) ⊤ ∈ R n is the membership vector with C i ∈ [ K ]. Case 1. Let K = 2 and B ∗ = ((0 . , . ⊤ , (0 . , . ⊤ ) ⊤ . The membership vector C =( C , ..., C n ) ⊤ is generated by sampling each entry independently from { , } with probabilities { . , . } . Let ζ n = 0 . n − / log n . Case 2. Let K = 3 and B ∗ = ((0 . , . , . ⊤ , (0 . , . , . ⊤ , (0 . , . , . ⊤ ) ⊤ . The mem-bership vector C = ( C , ..., C n ) ⊤ is generated by sampling each entry independently from { , , } with probabilities { . , . , . } . Let ζ n = 1 . n − / log n . Model 2. We simulate the responses Y ij from the Bernoulli distribution with mean Λ(log( ζ n )+Θ ∗ ,ij + W ,ij Θ ∗ ,ij ) for i < j , where Θ ∗ = ZB ∗ Z ⊤ , Θ ∗ = ZB ∗ Z ⊤ , and W ,ij is simulated in thesame way as in Model 1. Note here we impose that the latent community structures for Θ ∗ and21 ∗ are the same. We then apply the K-means algorithm to the 4 K × { v ⊤ j, , v ⊤ j, } j ∈ [ n ] torecover the community membership, as described in Section 4.4. Case 1. Let K = K = 2 and B ∗ = ((0 . , . ⊤ , (0 . , . ⊤ ) ⊤ , B ∗ = ((0 . , . ⊤ , (0 . , . ⊤ ) ⊤ .The membership vector C = ( C , ..., C n ) ⊤ is generated by sampling each entry independently from { , } with probabilities { . , . } . Let ζ n = 0 . n − / log n . Case 2. Let K = K = 3 and B ∗ = ((0 . , . , . ⊤ , (0 . , . , . ⊤ , (0 . , . , . ⊤ ) ⊤ , B ∗ =((0 . , . , . ⊤ , (0 . , . , . ⊤ , (0 . , . , . ⊤ ) ⊤ . The membership vector is generated in the sameway as given in Case 2 of Model 1. Let ζ n = 1 . n − / log n .We consider n = 500 , , and 1500. All simulation results are based on 200 realizations. We select the number of communities K by an eigenvalue ratio method given as follows. Let b σ , ≥ · · · ≥ b σ K max , be the first K max singular values of the SVD decomposition of b Θ from thenuclear norm penalization method given in Section 3.1. We estimate K by b K defined in (4.1) bysetting c = 0 . K max = 10. We set the tuning parameter λ n = C λ {√ nY + √ log n } / { n ( n − } with C λ = 2 and similarly for λ (1) n . To require that the estimator of b Θ l,ij is bounded by finiteconstants, we let M = 2 and C M = 2. The performance of the method is not sensitive to thechoice of these finite constants. Define the mean squared error (MSE) of the nuclear norm estimator b Θ l for Θ l as P i = j ( b Θ l,ij − Θ ∗ l,ij ) / { n ( n − } for l = 0 , b Θ l , the mean of b K and the percentage of correctly estimating K based on the 200 realizations. We observe that the mean value of b K gets closer to the truenumber of communities K and, the percentage of correctly estimating K approaches to 1, as thesamples size n increases. When n is large enough ( n = 1500), the mean value of b K is the same as K and the percentage of correctly estimating K is exactly equal to 1.Next, we use three commonly used criteria for evaluating the accuracy of membership estima-tion for our proposed method. These criteria include the Normalized Mutual Information (NMI),the Rand Index (RI) and the proportion (PROP) of nodes whose memberships are correctly iden-tified. They all give a value between 0 and 1, where 1 means a perfect membership estimation.Table 2 presents the mean of the NMI, RI and PROP values based on the 200 realizations forModels 1 and 2. The values of NMI, RI and PROP increase to 1 as the sample size increases forall cases. These results demonstrate that our method is quite effective for membership estimationin both models, and corroborate our large-sample theory.Last, we estimate the parameters B ∗ and B ∗ by our proposed method given in Section 4.5for Model 2. Tables 3 and 4 show the empirical coverage rate (coverage) of the 95% confidenceintervals, the absolute value of bias (bias), the empirical standard deviation (emp sd), and theaverage value of the estimated asymptotic standard deviation (asym sd) of the estimates for B ∗ and B ∗ in cases 1 and 2 of model 2, respectively, based on 200 realizations. We observe that22able 1: The MSEs for b Θ l , the mean of b K and the percentage of correctly estimating K basedon the 200 realizations for Models 1 and 2. K = 2 K = 3 n 500 1000 1500 500 1000 1500Model 1MSE for b Θ b Θ b K b Θ b Θ b K K = 2 K = 3 n 500 1000 1500 500 1000 1500Model 1NMI 0.9247 0.9976 0.9978 0.5494 0.7867 0.8973RI 0.9807 0.9995 0.9996 0.7998 0.9062 0.9593PROP 0.9903 0.9999 0.9999 0.8063 0.9089 0.9670Model 2NMI 0.9488 0.9977 0.9984 0.9664 0.9843 0.9977RI 0.9881 0.9966 0.9998 0.9790 0.9909 0.9987PROP 0.9940 0.9978 0.9999 0.9838 0.9928 0.998823he emp sd and asym sd decrease and the empirical coverage rate gets close to the nominal level0 . 95, as the sample size increases. Moreover, the value of emp sd is similar to that of asym sd foreach parameter. This result confirms our established formula (in the online supplement) for theasymptotic variances of the estimators for the parameters. When the sample size is large enough( n = 1500), the value of bias is very small compared to asym sd, so that it can be negligible forconstructing confidence intervals of the parameters.Table 3: The empirical coverage rate (coverage), the absolute bias (bias), empirical standarddeviation (emp sd) and asymptotic standard deviation (asym sd) of the estimators for B ∗ and B ∗ in case 1 of Model 2 based on 200 realizations. n B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , coverage 0.880 0.860 0.975 0.960 0.915 0.955500 bias 0.023 0.020 0.003 0.002 0.007 0.001emp sd 0.042 0.036 0.014 0.021 0.018 0.009asym sd 0.035 0.029 0.015 0.020 0.017 0.009coverage 0.960 0.940 0.945 0.944 0.944 0.9401000 bias 0.004 0.001 < . 001 0.002 0.002 < . < . 001 0.001 0.001 0.001 0.001 < . In this section, we apply the proposed method to study the community structure of a social networkdataset. The dataset contains Facebook friendship networks at one hundred American colleges and univer-sities at a single point in time. It was provided and analyzed by Traud, Mucha, and Porter (2012),and can be downloaded from https://archive.org/details/oxford-2005-facebook-matrix. Traud et al.(2012) used the dataset to illustrate the relative importance of different characteristics of individ-uals across different institutions, and showed that gender, dormitory residence and class year mayplay a role in network partitions by using assortativity coefficients. We, therefore, use these threeuser attributes as the covariates X i = ( X i , X i , X i ) ⊤ , where X i =binary indicator for gender,24able 4: The empirical coverage rate (coverage), the absolute bias (bias), empirical standarddeviation (emp sd) and asymptotic standard deviation (asym sd) of the estimators for B ∗ and B ∗ in case of Model 2 based on 200 realizations. n B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , coverage 0.910 0.920 0.900 0.875 0.925 0.960500 bias 0.018 0.025 < . 001 0.008 0.002 0.009emp sd 0.033 0.029 0.035 0.030 0.028 0.032asym sd 0.033 0.031 0.032 0.028 0.027 0.032coverage 0.915 0.935 0.955 0.930 0.950 0.9251000 bias 0.005 0.005 0.001 0.004 0.006 0.006emp sd 0.018 0.016 0.015 0.014 0.014 0.017asym sd 0.017 0.015 0.017 0.013 0.014 0.016coverage 0.940 0.945 0.940 0.960 0.940 0.9551500 bias 0.001 0.001 < . 001 0.001 0.002 < . n B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , coverage 0.885 0.900 0.915 0.900 0.960 0.925500 bias 0.020 0.005 0.001 0.016 < . 001 0.005emp sd 0.023 0.019 0.020 0.021 0.017 0.022asym sd 0.025 0.019 0.019 0.020 0.016 0.022coverage 0.930 0.905 0.945 0.925 0.940 0.9301000 bias 0.003 0.001 0.006 0.007 0.002 0.002emp sd 0.011 0.011 0.011 0.009 0.008 0.011asym sd 0.012 0.009 0.010 0.009 0.008 0.011coverage 0.940 0.955 0.940 0.960 0.960 0.9501500 bias < . < . < . 001 0.001 < . 001 0.001emp sd 0.009 0.006 0.007 0.005 0.005 0.007asym sd 0.008 0.006 0.007 0.006 0.006 0.00725 i =multi-category variable for dorm number (e.g. “202”, “203”, etc.), and X i =integer valuedvariable for class year (e.g. “2004”, “2005”, etc.). We use the dataset of Rice University to identifythe latent community structure interacted with the covariates by our proposed method.We use the dataset to fit the model: Y ij = { ε ij ≤ τ n + Θ ∗ ,ij + W ,ij Θ ∗ ,ij } , i > j, (6.1)where Y ij is the observed value (0 or 1) of the adjacency matrix in the dataset, and W ,ij = { P k =1 (2 D ij,k / ∆ k ) } / , where ∆ k = max( D ij,k ) − min( D ij,k ) and D ij,k = X ik − X jk for k = 1 , , τ n , Θ ∗ ,ij , Θ ∗ ,ij ) are unknown parameters, and Θ ∗ ,ij and Θ ∗ ,ij have the latent groupstructures Θ ∗ = ZB ∗ Z ⊤ and Θ ∗ = ZB ∗ Z ⊤ , respectively. Following model 2 in the simulation, weimpose that Θ ∗ and Θ ∗ share the same community structure. It is worth noting that Roy et al.(2019) fit a similar regression model as (6.1) but let the coefficient of the pairwise covariate be anunknown constant with respect to ( i, j ) such that Θ ∗ ,ij = Θ ∗ . Although Roy et al.’s 2019 modelcan take into account the covariate effect for community detection, it does not consider possibleinteraction effects of the observed covariates and the latent community structure. As a result, itmay cause the number of estimated groups to be inflated. In the dataset of Rice University, wedelete the nodes with missing values and with degree less than 10, and consider the class year from2004 to 2009. After the cleanup, there are n = 3073 nodes and 279916 edges in the dataset for ouranalysis. We first use the eigenvalue ratio method to obtain the estimated number of groups for Θ ∗ and Θ ∗ : b K = 4 and b K = 4 . Next, we use our proposed method to obtain the estimated membership for each node. Table5 presents the number of students in each estimated group for female and male, for different classyears, and for different dorm numbers. It is interesting to observe that most female students belongto either group 2 or group 4, and most male students belong to either group 1 or group 3. Thereis a clear community division between female and male; within each gender category, the studentsare further separated into two large groups. Moreover, most students in the class years of 2004and 2005 are in either group 1 or group 2, while most students in the class years of 2008 and2009 are in either group 3 or group 4. Students in the class years of 2006 and 2007 are almostevenly distributed across the four groups, with a tendency that more students will join groups 3and group 4 when they are in later class years. This result indicates that students tend to bein different groups as the gap between their class years becomes larger. Last, Table 6 shows theestimates of B ∗ and B ∗ and their standard errors (s.e.). We obtain the p-value < . 01 for testingeach coefficient in B ∗ equal to zero, indicating that the three covariates are useful for identifyingthe community structure. 26able 5: The number of persons in each estimated group for female and male, for different classyears, and for different dorm numbers.gender class yearfemale male 2004 2005 2006 2007 2008 2009group 1 1 515 112 139 147 110 37 1group 2 540 4 103 135 116 165 50 2group 3 4 1050 38 79 152 178 277 300group 4 958 1 30 62 125 156 288 271dorm number202 203 204 205 206 207 208 209 210group 1 71 67 36 42 41 50 57 59 93group 2 65 98 53 46 20 63 56 56 84group 3 94 116 142 138 129 130 121 101 83group 4 92 72 124 125 139 95 122 110 83Table 6: The estimates of B ∗ and B ∗ and their standard errors (s.e.). B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , estimate -0.730 4.912 -1.543 6.197 -0.751 4.123 -1.624 -1.702 5.933 -1.419s.e. 0.018 0.112 0.024 0.171 0.017 0.195 0.024 0.017 0.207 0.016 B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , B ∗ , estimate -3.397 -6.381 -4.398 -5.656 -3.600 -5.628 -4.387 -6.384 -6.704 -7.567s.e. 0.042 0.102 0.057 0.155 0.042 0.180 0.059 0.059 0.196 0.06027 Conclusion In this paper, we proposed a network formation model which can capture heterogeneous effectsof homophily via a latent community structure. When the expected degree diverges at a rate noslower than rate-log n , we established that the proposed method can exactly recover the latentcommunity memberships almost surely. By treating the estimated community memberships as thetruth, we can then estimate the regression coefficients in the model by existing methods in theliterature. 28 ppendix A Proofs of the Main Results In this appendix, we prove the main results in the paper. Given the fact that our proofs involvea lot of constants defined in the assumptions and proofs, we first provide a list of these constantsin Appendix A.1. Then we prove Lemma 2.1 and Theorems 4.1–4.4 in Appendices A.2–A.6,respectively. A.1 List of constants Before we prove the main results, we first list all the constants in Table 7. We specify each constantto illustrate that all our results hold as long as p log n/ ( nζ n ) ≤ c F ≤ for some sufficiently smallconstant c F . Apparently, if log n/ ( nζ n ) → c F can be arbitrarily small as long as n is sufficientlylarge. Then all the rate requirements in the proof hold automatically. However, log n/ ( nζ n ) → M W | W ,ij | ≤ M W . M max i ∈ [ n ] ,l =0 , | Θ ∗ l,ij | ≤ M, used in the definition of f M ( · ). C λ Used in the definition of λ (1) n . C M Used in the definition of T (1) . C σ , c σ , C , c Defined in Assumption 3. κ Defined in Assumption 4. c, c, C ,u , c F Defined in Assumption 5. C φ , c φ Defined in Assumption 6. C F , C F, , C F, Defined in Theorem 4.1. C ∗ Defined in Theorem 4.2. C h,u , C h,v Defined in Theorem 4.3. C Υ Defined in Lemma S1.1. A.2 Proof of Lemma 2.1 We prove the results for U first. Let Π ,n = Z ⊤ Z /n = diag( π , n , · · · , π ,K n ) . Then,( n − Θ ∗ )( n − Θ ∗ ) ⊤ = n − Z B ∗ Π n B ∗ Z ⊤ . χ ≡ Π / ,n B ∗ Π ,n B ∗ Π / ,n : χ = S ′ ˜Ω l ( S ′ ) ⊤ . Let U = Z ( Z ⊤ Z ) − / S ′ , where S is a K × K matrix such that ( S ′ ) ⊤ S ′ = I K . Then, we have U ˜Ω U ⊤ = n − Z Π − / n S ˜Ω S ⊤ Π − / n Z ⊤ = n − Z B ∗ Π n B ∗ Z ⊤ = ( n − Θ ∗ ) . In addition, note that U ⊤ U = I K and ˜Ω is a diagonal matrix. This implies ˜Ω = Σ (afterreordering the eigenvalues) and U l is the corresponding singular vector matrix. Then, by definition, U = √ n U Σ = Z (Π ,n ) − / S ′ Σ . Similarly, by considering the spectral decomposition of ( n − Θ ∗ ) ⊤ ( n − Θ ∗ ), we can show that V = Z (Π ,n ) − / S for some rotation matrix S . Parts (2) and (3) can be verified directly by notingthat S and S ′ are orthonormal, Π ,n is diagonal, and Assumption 3 holds. A.3 Proof of Theorem 4.1 We focus on the split-sample low-rank estimators. The full-sample results can be derived in thesame manner. Denote Q n,ij (Γ ij ) = − [ Y ij log(Λ( W ⊤ ij Γ ij )) + (1 − Y ij ) log(1 − Λ( W ⊤ ij Γ ij ))], whichis a convex function for each element in Γ ij = (Γ ,ij , Γ ,ij ) ⊤ . In addition, we note that thetrue parameter Γ ∗ ( I ) ∈ T (1) (0 , log n ). Denote e Γ (1) = { e Γ (1) ij } i ∈ I ,j ∈ [ n ] , e Γ (1) ij = ( e Γ (1)0 ,ij , e Γ (1)1 ,ij ) ⊤ and∆ ij = e Γ (1) ij − Γ ∗ ij ≡ (∆ ,ij , ∆ ,ij ) ⊤ , for i ∈ I , j ∈ [ n ]. Then, we have λ (1) n X l =0 (cid:16) || Γ ∗ l ( I ) || ∗ − || e Γ (1) l || ∗ (cid:17) ≥ n ( n − X i ∈ I ,j ∈ [ n ] ,i = j (cid:16) Q n,ij ( e Γ (1) ij ) − Q n,ij (Γ ∗ ij ) (cid:17) ≥ n ( n − X i ∈ I ,j ∈ [ n ] ,i = j (cid:16) ∂ Γ ij Q ⊤ n,ij (Γ ∗ ij ) (cid:17) ⊤ ∆ ij = − n ( n − X i ∈ I ,j ∈ [ n ] ,i = j (cid:16) Y ij − Λ( W ⊤ ij Γ ∗ ij ) (cid:17) W ⊤ ij ∆ ij ≡ − n ( n − X l =0 trace(Υ ⊤ l ∆ l ) , (A.1)where ∂ Γ ij Q ⊤ n,ij (Γ ∗ ij ) = ∂Q n,ij (Γ ∗ ij ) /∂ Γ ij , Υ l is an n × n matrix with ( i, j )-th entryΥ l,ij = (cid:16) Y ij − Λ( W ⊤ ij Γ ∗ ij ) (cid:17) W l,ij if i ∈ I , j ∈ [ n ] , j = i i = j ∈ I , and trace( · ) is the trace operator. By (A.1), we have0 ≤ λ (1) n X l =0 (cid:16) || Γ ∗ l ( I ) || ∗ − || e Γ (1) l || ∗ (cid:17) + 1 n ( n − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X l =0 trace(Υ ⊤ l ∆ l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) λ (1) n X l =0 (cid:16) || Γ ∗ l ( I ) || ∗ − || e Γ (1) l || ∗ (cid:17) + 1 n ( n − X l =0 || Υ l || op || ∆ l || ∗ . (A.2)By Chernozhukov et al. (2018, Lemma C.2) and the fact that Γ ∗ and Γ ∗ are exact low-rankmatrices with ranks upper bounded by K + 1 and K , respectively, there exist { ∆ ′ l , ∆ ′′ l } l =0 suchthat ∆ l = ∆ ′ l + ∆ ′′ l , rank(∆ ′ ) ≤ K + 2, rank(∆ ′ ) ≤ K , and for l = 0 , || ∆ l || F = || ∆ ′ l || F + || ∆ ′′ l || F and || Γ ∗ l ( I ) + ∆ ′′ l || ∗ = || Γ ∗ l ( I ) || ∗ + || ∆ ′′ l || ∗ . (A.3)This implies that || Γ ∗ l ( I ) || ∗ − || e Γ (1) l || ∗ = || Γ ∗ l ( I ) || ∗ − || Γ ∗ l ( I ) + ∆ ′ l + ∆ ′′ l || ∗ ≤ || ∆ ′ l || ∗ − || ∆ ′′ l || ∗ , l = 0 , . (A.4)Therefore, combining (A.2), Lemma S1.1, and (A.4), we have0 ≤ λ (1) n X l =0 (cid:16) || ∆ ′ l || ∗ − || ∆ ′′ l || ∗ (cid:17) + C Υ M W ( p ζ n n + √ log n ) n ( n − X l =0 (cid:16) || ∆ ′ l || ∗ + || ∆ ′′ l || ∗ (cid:17) . Noting that λ (1) n = C λ ( √ ζ n n + √ log n ) n ( n − and C λ > C Υ M W , the last inequality implies that( C λ − C Υ M W ) X l =0 || ∆ ′′ l || ∗ ≤ ( C λ + C Υ M W ) X l =0 || ∆ ′ l || ∗ , (A.5)and that (∆ , ∆ ) ∈ C (˜ c ) for ˜ c = C λ + C Υ M W C λ − C Υ M W > , with a slight abuse of notation. Note although∆ ′ l and ∆ ′′ l are n × n matrices, we can make them square matrices by adding rows of zeros. Thiswill not affect the matrices’ nuclear norm, and thus, (A.5) still holds for the associated squarematrices.Next, we consider the second-order Taylor expansion of Q n,ij (Γ ij ), following the argument inBelloni, Chernozhukov, Fern´andez-Val, and Hansen (2017). Let f ij ( t ) = log { W ⊤ ij (Γ ∗ ij + t ∆ ij )) } , where ∆ ij = (∆ ,ij , · · · , ∆ p,ij ) ⊤ . Then, Q n,ij ( e Γ (1) ij ) − Q n,ij (Γ ∗ ij ) − ∂ Γ ij Q ⊤ n,ij (Γ ∗ ij )∆ ij = f ij (1) − f ij (0) − f ′ ij (0) . Note that f ij ( · ) is a three times differentiable convex function such that for all t ∈ R , | f ′′′ ij ( t ) | = | W ⊤ ij ∆ ij | Λ( W ⊤ ij (∆ ij + t ∆ ij ))(1 − Λ( W ⊤ ij (∆ ij + t ∆ ij ))) | − W ⊤ ij (∆ ij + t ∆ ij )) |≤| W ⊤ ij ∆ ij | f ′′ ij ( t ) . f ij (1) − f ij (0) − f ′ ij (0) ≥ f ′′ ij (0)( W ⊤ ij ∆ ij ) h exp( −| W ⊤ ij ∆ ij | ) + | W ⊤ ij ∆ ij | − i =Λ( W ⊤ ij Γ ∗ ij )(1 − Λ( W ⊤ ij Γ ∗ ij )) h exp( −| W ⊤ ij ∆ ij | ) + | W ⊤ ij ∆ ij | − i ≥ cζ n h exp( −| W ⊤ ij ∆ ij | ) + | W ⊤ ij ∆ ij | − i ≥ cζ n ( W ⊤ ij ∆ ij ) i,j | W ⊤ ij ∆ ij | ∨ log(2)) ! ≥ ζ n c ( W ⊤ ij ∆ ij ) M W + 1) log n , (A.6)where the third inequality holds by Lemma S1.2 and the last inequality holds because of As-sumption 5 and the fact that | W ⊤ ij ∆ ij | ≤ | e Γ ,ij − Γ ,ij | + M W | e Γ ,ij − Γ ,ij | ≤ M W + 1) log n. Therefore, F n (∆ , ∆ ) ≡ n ( n − X i ∈ I ,j ∈ [ n ] ,j = i h Q n,ij ( e Γ (1) ij ) − Q n,ij (Γ ∗ ij ) − ∂ Γ ij Q ⊤ n,ij (Γ ∗ ij )∆ ij i ≥ ζ n c n ( n − M W + 1) log n X i ∈ I ,j ∈ [ n ] ,j = i ( W ⊤ ij ∆ ij ) ≥ ζ n c n ( n − M W + 1) log n " κ X l =0 || ∆ l || F − M W + 1) (log n ) n , (A.7)where, by an abuse of notation, we still view ∆ l as a square matrix with some rows filled in byzeros, and the last inequality holds by Assumption 4 and the fact that | ∆ l,ii | ≤ n , i ∈ I .On the other hand, by (A.1), F n (∆ , ∆ ) ≤ λ (1) n X l =0 (cid:16) || Γ ∗ l ( I ) || ∗ − || e Γ (1) l || ∗ (cid:17) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ( n − X l =0 trace(Υ ⊤ l ∆ l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (1) n X l =0 (cid:16) || ∆ ′ l || ∗ − || ∆ ′′ l || ∗ (cid:17) + 1 n ( n − X l =0 || Υ l || op || ∆ l || ∗ ≤ p ζ n n + √ log nn ( n − " X l =0 ( C λ + C Υ M W ) || ∆ ′ l || ∗ − X l =0 ( C λ − C Υ M W ) || ∆ ′′ l || ∗ ≤ p ζ n n + √ log nn ( n − 1) ( C λ + C Υ M W )( X l =0 || ∆ ′ l || ∗ ) ≤ p ζ n n + √ log nn ( n − 1) ( C λ + C Υ M W ) p K ( X l =0 || ∆ ′ l || F )32 p ζ n n + √ log nn ( n − 1) ( C λ + C Υ M W ) p K ( X l =0 || ∆ l || F ) ≤ p ζ n n + √ log nn ( n − 1) ( C λ + C Υ M W )2 p ¯ K ( X l =0 || ∆ l || F ) / , (A.8)where ¯ K = max( K + 1 , K ), the first inequality is due to (A.1), the second inequality is due to(A.4) and the trace inequality, the third inequality holds by the definition of λ (1) n and Lemma S1.1,the fourth inequality is due to the fact that C λ − C Υ M W > 0, the fifth inequality is due to thefact that rank(∆ ′ l ) ≤ K , the second last inequality is due to (A.3), and the last inequality is dueto the Cauchy’s inequality.Combining (A.7) and (A.8), we have " ( X l =0 || ∆ l || F ) / − √ ¯ K ( M W + 1)( C λ + C Υ M W ) cκ log n [ p nζ n + √ log n ] ζ n ≤ ¯ K (cid:20) M W + 1)( C λ + C Υ M W ) cκ (cid:21) log n [ p nζ n + √ log n ] ζ n ! + 4 n ( M W + 1) (log n ) κ , and thus 1 n ( X l =0 || ∆ l || F ) / ≤ C F log n p nζ n + (log n ) / nζ n ! a.s., (A.9)where C F = √ ¯ K ( M W +1)( C λ + C Υ M W ) cκ . Then, | e τ (1) n − τ n | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i ∈ I ,j ∈ [ n ] ( e Γ ,ij − τ n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i ∈ I ,j ∈ [ n ] ( e Γ ,ij − Γ ∗ ,ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i ∈ I ,j ∈ [ n ] Θ ∗ ,ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ n n || ∆ || F + M ≤ C F log n p nζ n + (log n ) / nζ n ! ≤ C F ( c F + c F ) p log n, (A.10)where the last inequality follows Assumption 5.3.Next, we rerun the nuclear norm regularized logistic regression with the parameter space re-striction T (1) (0 , log n ) replaced by T (1) ( e τ (1) n , C M √ log n ). First, we note that the true parameterΓ ∗ ( I ) ∈ T (1) ( e τ (1) n , C M √ log n ) because | Γ ∗ ,ij | ≤ C M √ log n and | Γ ∗ ,ij − e τ (1) n | ≤ | Θ ∗ ,ij | + | e τ (1) n − τ n | ≤ | Θ ∗ ,ij | + 30 C F ( c F + c F ) p log n ≤ C M p log n, (A.11)where we use the fact that c F , and thus, 30( c F + c F ) C F is sufficiently small.Therefore, following the same arguments used to obtain (A.5), we can show that b ∆ ≡ ( b ∆ , b ∆ ) ∈C (˜ c ) , where b ∆ l = b Γ (1) l − Γ ∗ l ( I ). Let b ∆ ij = ( b ∆ ,ij , b ∆ ,ij ) ⊤ . Now let f ij ( t ) = log(1 + exp( W ⊤ ij (Γ ∗ ij +33 b ∆ ij ))) . Then, following (A.6), f ij (1) − f ij (0) − f ′ ij (0) ≥ cζ n ( W ⊤ ij b ∆ ij ) i,j | W ⊤ ij b ∆ ij | ∨ log(2)) ! ≥ ζ n c ( W ⊤ ij b ∆ ij ) C M + M W ) √ log n , where the last inequality holds because of (A.11) and uniformly in ( i, j ) | W ⊤ ij b ∆ ij | ≤| b Γ (1)0 ,ij − Γ ∗ ,ij | + M W | ˆΓ (1)1 ,ij − Θ ∗ ,ij |≤| b Γ (1)0 ,ij − e τ (1) n | + | e τ (1) n − Γ ∗ ,ij | + M W ( p log n + M ) ≤ C M + M W ) p log n. Then, similar to (A.7) and (A.8), F n ( b ∆ , b ∆ ) ≡ n ( n − X i ∈ I ,j ∈ [ n ] ,j = i (cid:16) Q n,ij ( b Γ (1) ij ) − Q n,ij (Γ ∗ ij ) − ∂ Γ ij Q ⊤ n,ij (Γ ∗ ij ) b ∆ ij (cid:17) ≥ ζ n c n ( n − M W + C M ) √ log n " κ X l =0 || b ∆ l || F ! − M W + C M ) log nn and F n ( b ∆ , b ∆ ) ≤ p ζ n n + √ log nn ( n − 1) ( C λ + C Υ M W )2 p ¯ K ( X l =0 || b ∆ l || F ) / . Therefore, we have X l =0 || b ∆ l || F ! / − √ ¯ K ( M W + C M )( C λ + C Υ M W ) cκ √ log n ( p nζ n + √ log n ) ζ n ! ≤ ¯ K (cid:20) M W + C M )( C λ + C Υ M W ) cκ (cid:21) √ log n ( p nζ n + √ log n ) ζ n ! + 4( M W + C M ) log nn , and thus, 1 n X l =0 || b ∆ l || F ! / ≤ C F, η n , (A.12)where C F, = √ ¯ K ( M W + C M )( C λ + C Υ M W ) cκ and η n = q log nnζ n + log nnζ n . Then, similar to (A.10) and byAssumption 5.5, we have | b τ (1) n − τ n | ≤ √ n n || b ∆ || F + o ( η n ) ≤ C F, η n . This establishes the firstresult in Theorem 4.1.In addition,1 n || b Θ (1)1 − Θ ∗ ( I ) || F n X ( i,j ) ∈ I × I i = j (cid:18) 12 ( b Γ (1)1 ,ij + b Γ (1)1 ,ji ) − Θ ∗ ,ij (cid:19) + X ( i,j ): i ∈ I ,j / ∈ I ( b Γ (1)1 ,ij − Θ ∗ ,ij ) / + 1 n X i ∈ I Θ ∗ ,ii / ≤ n X i ∈ I ,j ∈ [ n ] ,i = j ( b Γ (1)1 ,ij − Θ ∗ ,ij ) / + 1 n X i ∈ I Θ ∗ ,ii / ≤ n X l =0 || b ∆ l || F ! / + r M n ≤ C F, η n a.s., where the first inequality holds due to the fact that f M ( · ) is 1-Lipschitz continuous, Θ ∗ = (Θ ∗ ) ⊤ ,and | Θ ∗ ,ij | ≤ M , and the last inequality holds due to the fact that n (cid:16)P i ∈ I (Θ ∗ ,ii ) (cid:17) / = o ( η n ).Similarly,1 n || b Θ (1)0 − Θ ∗ ( I ) || F ≤ n X ( i,j ) ∈ I × I ,i = j (cid:18) 12 ( b Γ (1)0 ,ij + b Γ (1)0 ,ji ) − Θ ∗ ,ij − b τ (1) n (cid:19) + X ( i,j ): i ∈ I ,j / ∈ I (Γ ∗ ,ij − b τ (1) n − Θ ∗ ,ij ) / + 1 n X i ∈ I Θ ∗ ,ii / ≤ n X i ∈ I ,j ∈ [ n ] ,i = j ( e Γ (1)0 ,ij − Γ ∗ ,ij ) / + | b τ (1) n − τ n | + r M n ≤ C F, η n a.s. Then, by the Weyl’s inequality, max k =1 , ··· ,K l | b σ (1) k,l − σ k,l | ≤ C F, η n a.s. for l = 0 , . Last, noting that b V (1) l consists of the first K l eigenvectors of ( n b Θ (1) l ) ⊤ ( n b Θ (1) l ), we have (cid:13)(cid:13)(cid:13)(cid:13) n b Θ (1) ⊤ l (cid:18) n b Θ (1) l (cid:19) − n Θ ∗⊤ l ( I ) (cid:18) n Θ ∗ l ( I ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ C σ n || b Θ (1) l − Θ ∗ l ( I ) || F ≤ C F, C σ η n . Then by the Davis-Kahan sin Θ Theorem (Su et al. (2020, Lemma C.1)), we have ||V l − b V (1) l b O (1) l || F ≤ p K l ||V l − b V (1) l b O (1) l || op ≤ √ K l C F, C σ sη n σ K l ,l − C F, C σ η n ≤ √ K l C F, C σ sη n c σ − C F, C σ η n ≤ √ K l C F, C σ η n c σ ≤ C F, η n , (A.13)where C F, = max l =0 , √ K l C F, C σ c − σ , and the third inequality holds due to Assumption 5 andthe second last inequality is due to the fact that we can set c F to be sufficiently small to ensurethat 1 − √ C F, C σ ( c F + c F ) c − σ ≥ √ . b V (1) l = √ n b V (1) l and V l = √ n V l , we have the desired result that || V l − b V (1) l b O (1) l || F ≤ C F, √ nη n . (cid:4) A.4 Proof of Theorem 4.2 First, we prove the first result in the theorem. Let ∆ i,l = ( b O (1) l ) ⊤ b u (1) i,l − u i,l for l = 0 , 1, and∆ iu = (∆ ⊤ i, , ∆ ⊤ i, ) ⊤ . Denote b Λ n,ij = Λ( b τ n + X l =0 u ⊤ i,l ( b O (1) l ) ⊤ b v (1) j,l W l,ij ) . (A.14)Recall that Λ n,ij = Λ( τ n + P l =0 u ⊤ i,l v j,l W l,ij ) = Λ( τ n + Θ ∗ ,ij + Θ ∗ ,ij W ,ij ) . Let˜Λ n,ij = Λ( ˙ a n,ij ) , (A.15)where ˙ a n,ij is an intermediate value that is between τ n +Θ ∗ ,ij +Θ ∗ ,ij W ,ij and b τ n + P l =0 u ⊤ i,l ( b O (1) l ) ⊤ b v (1) j,l W l,ij .Define b φ (1) ij = " ( b O (1)0 ) ⊤ b v (1) j, ( b O (1)1 ) ⊤ b v (1) j, W ,ij and b Φ (1) i = 1 n X j ∈ I ,j = i b φ (1) ij ( b φ (1) ij ) ⊤ . Let e Λ (1) ij ( µ ) = Λ( b τ n + P l =0 µ ⊤ l ( b O (1) l ) ⊤ b v (1) j,l W l,ij ) and ℓ (1) ij ( µ ) = Y ij log( e Λ (1) ij ( µ )) +(1 − Y ij ) log(1 − e Λ (1) ij ( µ )) . Define e Q (1) in ( µ ) = − n P j ∈ I ,j = i ℓ (1) ij ( µ ) . Then,0 ≥ Q (0) in,U ( b u (1) i, , b u (1) i, ) − Q (1) in (( b O (1)0 ) u i, , ( b O (1)1 ) u i, )= e Q (1) in ( u i, + ∆ i, , u i, + ∆ i, ) − e Q (1) in ( u i, , u i, ) ≥ − n X j ∈ I ,j = i ( Y ij − b Λ n,ij )( b φ (1) ij ) ⊤ ∆ iu + 1 n X j ∈ I ,j = i b Λ n,ij (1 − b Λ n,ij ) h exp( −| ( b φ (1) ij ) ⊤ ∆ iu | ) + | ( b φ (1) ij ) ⊤ ∆ iu | − i ≥ − n X j ∈ I ,j = i ( Y ij − b Λ n,ij )( b φ (1) ij ) ⊤ ∆ iu + c ′ ζ n n X j ∈ I ,j = i h exp( −| ( b φ (1) ij ) ⊤ ∆ iu | ) + | ( b φ (1) ij ) ⊤ ∆ iu | − i ≥ − n X j ∈ I ,j = i ( Y ij − b Λ n,ij )( b φ (1) ij ) ⊤ ∆ iu + c ′ ζ n n X j ∈ I ,j = i (( b φ (1) ij ) ⊤ ∆ iu ) − | ( b φ (1) ij ) ⊤ ∆ iu | (A.16)where the second inequality is due to Bach (2010, Lemma 1), the third inequality is due to thefact that exp( − t ) + t − ≥ c ′ is defined in Lemma S1.3, andthe last inequality is due to the fact that exp( − t ) + t − ≥ t − t . The following argument follows36elloni et al. (2017). Let F (∆ iu ) = e Q (1) in ( u i, + ∆ i, , u i, + ∆ i, ) − e Q (1) in ( u i, , u i, ) + 1 n X j ∈ I ,j = i ( Y ij − b Λ n,ij )( b φ (1) ij ) ⊤ ∆ iu , which is convex in ∆ iu . Let q in = inf ∆ h n P j ∈ I ,j = i (( b φ (1) ij ) ⊤ ∆) i / n P j ∈ I ,j = i (( b φ (1) ij ) ⊤ ∆) and δ in = n X j ∈ I ,j = i (( b φ (1) ij ) ⊤ ∆ iu ) / . (A.17)If δ in ≤ q in , then n P j ∈ I ,j = i (( b φ (1) ij ) ⊤ ∆ iu ) ≤ δ in , and thus F (∆ iu ) ≥ c ′ ζ n δ in . On the other hand,if δ in > q in , let ˜∆ iu = ∆ iu q in δ in , then h n P j ∈ I ,j = i (( b φ (1) ij ) ⊤ ˜∆ iu ) i / ≤ q in . Then, we have F (∆ iu ) = F ( δ in ˜∆ iu q in ) ≥ δ in q in F ( ˜∆ iu ) ≥ c ′ ζ n δ in n q in X j ∈ [ n ] ,j = i (( b φ (1) ij ) ⊤ ˜∆ iu ) = c ′ ζ n q in δ in . Therefore, by Lemma S1.4, F (∆ iu ) ≥ min (cid:18) c ′ ζ n δ in , c ′ ζ n q in δ in (cid:19) ≥ min (cid:18) c ′ c φ ζ n c || ∆ iu || , c ′ ζ n q in √ c φ || ∆ iu || √ (cid:19) . (A.18)On the other hand, we have | F (∆ iu ) | ≤ (cid:12)(cid:12)(cid:12) n P j ∈ I ,j = i ( Y ij − b Λ n,ij )( b φ (1) ij ) ⊤ ∆ iu (cid:12)(cid:12)(cid:12) ≤ I i + II i , where I i = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X j ∈ I ,j = i ( Y ij − Λ n,ij ) ( b φ (1) ij ) ⊤ ∆ iu (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and II i = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X j ∈ I ,j = i ( b Λ n,ij − Λ n,ij )( b φ (1) ij ) ⊤ ∆ iu (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We aim to upper bound I i and II i uniformly in i below.We first bound II i . Note that II i ≤ n X j ∈ I ,j = i ˜Λ n,ij (1 − ˜Λ n,ij ) (cid:18) | b τ n − τ n | + X l =0 (cid:12)(cid:12)(cid:12) u ⊤ i,l (( b O (1) l ) ⊤ b v (1) j,l − v j,l ) W l,ij (cid:12)(cid:12)(cid:12)(cid:19) | ( b φ (1) ij ) ⊤ ∆ iu |≤ c ′ M (1 + M W ) ζ n || ∆ iu || n c σ X j ∈ I ,j = i | b τ n − τ n | + X l =0 (cid:12)(cid:12)(cid:12) u ⊤ i,l (( b O (1) l ) ⊤ b v (1) j,l − v j,l ) W l,ij (cid:12)(cid:12)(cid:12)! ≤ c ′ M (1 + M W ) ζ n || ∆ iu || c σ C F, η n + c II X l =0 n X j ∈ I ,j = i (cid:13)(cid:13)(cid:13) ( b O (1) l ) ⊤ b v (1) j,l − v j,l (cid:13)(cid:13)(cid:13) ≤ c ′ M (1 + M W ) ζ n || ∆ iu || c σ " C F, η n + c II X l =0 √ n (cid:13)(cid:13)(cid:13) b V (1) l b O (1) l − V l (cid:13)(cid:13)(cid:13) F ≤ C II || ∆ iu || ζ n η n , (A.19)37here c II = max( C ,u , c − / M W ) C σ , C II = 2 c ′ M (1 + M W ) ζ n (48 C F, + 136 c II C F, ) c − σ , the firstinequality holds by the Taylor expansion, the second inequality holds by Lemma S1.3max i,j ∈ I ,i = j || b φ (1) ij || ≤ max i,j ∈ I ,i = j (cid:16) || b O (1)0 b v (1) j, || + M W || b O (1)1 b v (1) j, || (cid:17) ≤ M σ − K , + 2 M W M σ − K , ≤ M (1 + M W ) c − σ , (A.20)the third inequality is due to Theorem 4.1 and the fact that || u ⊤ i,l W l,ij || ≤ c II , the fourth inequalityis due to Cauchy’s inequality, and the last inequality is due to Theorem 4.1. Note that the constant C II does not depend on i , the above upper bound for II i holds uniformly over i .Next, we turn to the upper bound for I i . Let F n be the σ -field generated by { X i } ni =1 ∪{ ε ij } i ∈ I ,j ∈ [ n ] ,j = i ∪ { e ij } ≤ i,j ≤ n and H ij = ( Y ij − Λ n,ij ) b φ (1) ij . Then, conditional on F n , { H ij } j ∈ I ,j = i only depends on { ε ij } j ∈ I ,j = i , and thus, is a sequence of independent random vectors. Note that I i ≤ || n P j ∈ I ,j = i H ij |||| ∆ iu || . Let H k,ij be the k -th coordinate of H ij where k ∈ [ K + K ]. ByLemma S1.3, (A.20) and Assumption 5,max ≤ i,j ≤ n | H k,ij | ≤ (cid:2) M (1 + M W ) c − σ + 1 (cid:3) (1 + c ) ≡ C H (A.21)and P j ∈ I ,j = i E ( H k,ij |F n ) ≤ C H ζ n n . Therefore, by the Bernstein inequality, for any t > P max i ∈ I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈ I ,j = i H k,ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ n t (cid:12)(cid:12)(cid:12)(cid:12) F n ≤ X i ∈ I − n t C H ζ n n + C H tn ! . Taking t = 4 C H q ζ n log nn , we have P max i ∈ I n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈ I ,j = i H k,ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:12)(cid:12)(cid:12)(cid:12) F n ≤ n exp − C H ζ n log nn n C H ζ n n + C H q ζn log nn n ≤ n exp (cid:18) − n (cid:19) ≤ n − . , where the second inequality holds because log n/ ( nζ n ) ≤ c F < C H > 1. Applying Expecta-tion on both sides, we have P max i ∈ I n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈ I ,j = i H k,ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t ≤ n − . . i ∈ I I i ≤ max i ∈ I n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈ I ,j = i H k,ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C H r log nζ n n a.s. (A.22)Combining (A.19) and (A.22), we have | F (∆ iu ) | ≤ (4 C H + C II ) ζ n η n || ∆ iu || . (A.23)Then, (A.18) and (A.23) imply(4 C H + C II ) ζ n η n || ∆ iu || ≥ min (cid:18) cc φ ζ n k| ∆ iu || , c √ c φ ζ n q in || ∆ iu || √ (cid:19) . (A.24)On the other hand, by Lemma S1.5, we havelim inf n min i ∈ I c ′ √ c φ ζ n q in || ∆ iu || √ ≥ c φ c σ M (1 + M W ) c ′ √ c φ ζ n || ∆ iu || √ > (4 C H + C II ) ζ n η n || ∆ iu || , where the first inequality holds by Lemma S1.5 and the second inequality holds due to the factthat c F is sufficiently small so that(4 C H + C II )( c F + c F ) < c ′ ( c φ ) / c σ √ M (1 + M W ) . Therefore, (A.24) implies || ( b O (1) l ) ⊤ b u (1) i,l − u i,l || ≤ || ∆ u || ≤ C H + C II ) cc φ η n ≤ C ∗ η n a.s., (A.25)where C ∗ = max (cid:16) C H + C II ) cc φ , C H + C II ) cc φ c σ + C σ C u C Σ c σ + 48 C F, C u (cid:17) . Because the constant C ∗ doesnot depend on index i , the above inequality holds uniformly over i ∈ I . Now, we prove the second result in the theorem . The proof follows that of the firstresult with a notable difference: { b u (1) i,l } i ∈ I ,l =0 , are not independent of the observations { Y ij } giventhe covariates, thus the conditional Bernstein inequality argument cannot be directly used. Recallthat ( ˙ v (0 , j, , ˙ v (0 , j, ) = arg min Q (0) jn,V ( ν , ν ) , where Q (0) jn,V ( ν ) with ν = ( ν ⊤ , ν ⊤ ) ⊤ is defined in Section 3.3. Let ˜Λ (0) ij ( ν ) = Λ( b τ n + P l =0 ν ⊤ l ( b O (1) l ) ⊤ b u (1) i,l W l,ij ) ,ℓ (0) ij ( ν ) = Y ij log( ˜Λ (0) ij ( ν )) +(1 − Y ij ) log(1 − ˜Λ (0) ij ( ν )) . Define e Q (0) jn,V ( ν ) = − n P j ∈ I ,j = i ℓ (1) ij ( ν ) . Then Q (0) jn,V ( ν , ν ) = e Q (0) jn,V (( b O (1)0 ) ⊤ ν , ( b O (1)1 ) ⊤ ν ) . Recall that Λ n,ij = Λ( τ n + P l =0 u ⊤ i,l v j,l W l,ij ) = Λ( τ n + Θ ∗ ,ij + Θ ∗ ,ij W ,ij ) . Let ˙Λ n,ij = Λ( b τ n +39 l =0 v ⊤ j,l ( b O (1) l ) ⊤ b u (1) i,l W l,ij ) and ˜Λ n,ij = Λ( ˙ a n,ij ) , where ˙ a n,ij is an intermediate value that is between τ n + Θ ∗ ,ij + Θ ∗ ,ij W ,ij and b τ n + P l =0 v ⊤ j,l ( b O (1) l ) ⊤ b u (1) i,l W l,ij . Define˙ ψ ij = " ( b O (1)0 ) ⊤ b u (1) i, ( b O (1)1 ) ⊤ b u (1) i, W ,ij and ˙Ψ j = 1 n X i ∈ I ,i = j ˙ ψ ij ( ˙ ψ ij ) ⊤ . Let ∆ jv ≡ (∆ ⊤ j, , ∆ ⊤ j, ) ⊤ , where ∆ j,l = ( b O (1) l ) ⊤ ˙ v (0 , j,l − v j,l for l = 0 , . Then we have0 ≥ Q (0) jn,V ( ˙ v (0 , j, , ˙ v (0 , j, ) − Q (0) jn,V (( b O (1)0 ) ⊤ v j, , ( b O (1)1 ) ⊤ v j, )= e Q (0) jn,V (( b O (1)0 ) ⊤ ˙ v (0 , j, , ( b O (1)1 ) ⊤ ˙ v (0 , j, ) − e Q (0) jn,V ( v j, , v j, ) ≥ − n X i ∈ I ,i = j ( Y ij − ˙Λ n,ij )( ˙ ψ ij ) ⊤ ∆ v + c ′ ζ n n X i ∈ I ,i = j " (( ˙ ψ ij ) ⊤ ∆ v ) − | ( ˙ ψ ij ) ⊤ ∆ v | . By the first result that max i ∈ I || ( b O (1) l ) ⊤ b u (1) i,l − u i,l || ≤ C ∗ η n , we havemax i ∈ I || ( b O (1) l ) ⊤ b u (1) i,l W l,ij || ≤ M W max i ∈ I h || ( b O (1) l ) ⊤ b u (1) i,l − u i,l || + || u i,l || i ≤ M W ( C ∗ η n + C σ C u ) < ∞ . Therefore, similar to (S1.1), we have || ˙Ψ j − Ψ j ( I ) || ≤ M W ( C ∗ η n + C σ C u ) n X l =0 X i ∈ I || ( b O (1) l ) ⊤ b u (1) i,l − u i,l ||≤ M W ( C ∗ η n + C σ C u ) C ∗ η n a.s. As c F is sufficiently small so that 4 M W ( C ∗ η n + C σ C u ) C ∗ ( c F + c F ) ≤ c φ / j ∈ [ n ] λ min ( ˙Ψ j ) ≥ c φ / a.s. Let F (∆ jv ) = e Q (0) jn ( v j, + ∆ j, , v j, + ∆ j, ) − e Q (0) jn ( v j, , v j, ) + 1 n X i ∈ I ,i = j ( Y ij − ˙Λ n,ij )( ˙ ψ ij ) ⊤ ∆ jv . Following the same argument in the proof of Theorem 4.2, we have F (∆ jv ) ≥ min (cid:18) c ′ c φ ζ n c || ∆ jv || , c ′ ζ n q jn √ c φ || ∆ jv || √ (cid:19) , where q jn = inf ∆ h n P i ∈ I ,i = j (( ˙ ψ ij ) ⊤ ∆) i / n P i ∈ I ,i = j (( ˙ ψ ij ) ⊤ ∆) . For the upper bound of F (∆ jv ), we can show that F (∆ jv ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i ∈ I ,i = j ( Y ij − Λ n,ij ) ( ˙ ψ ij ) ⊤ ∆ jv (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i ∈ I ,i = j ( ˙Λ n,ij − Λ n,ij )( ˙ ψ ij ) ⊤ ∆ jv (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≡ ˜ I j + f II j . 40e first bound f II j . Following Lemma S1.3(1), we have || v ⊤ j,l ( b O (1) l ) ⊤ b u (1) i,l W l,ij || . || ( b O (1) l ) ⊤ b u (1) i,l − u i,l || + || u i,l || ≤ C < ∞ . Then, by the same argument in the proof of Lemma S1.3(2), we have c ′ ζ n ≥ ˙Λ n,ij ≥ c ′ ζ n and c ′ ζ n ≥ ˜Λ n,ij ≥ c ′ ζ n , for some constants ∞ > c ′ > c ′ > 0. Following (A.19) and by noticing that n P i ∈ I ,i = j || ( b O (1) l ) ⊤ b u (1) i,l − u i,l || ≤ C ∗ η n , we have f II j ≤ C ′ II ζ n η n || ∆ jv || , (A.26)for some constant C ′ II > e I j is different from that of I i as we no longer have the independence between˙ ψ ij and Y ij − Λ n,ij given { W ,ij } ≤ i Abbe, E., A. S. Bandeira, and G. Hall (2016). Exact recovery in the stochastic block model. IEEETransactions on Information Theory 62 (1), 471–487.43bbe, E., J. Fan, K. Wang, and Y. Zhong (2017). Entrywise eigenvector analysis of randommatrices with low expected rank. arXiv preprint arXiv:1709.09565 .Abbe, E. and C. Sandon (2015). Community detection in general stochastic block models: Funda-mental limits and efficient algorithms for recovery. In Foundations of Computer Science (FOCS),2015 IEEE 56th Annual Symposium on , pp. 670–688. IEEE.Ahn, S. C. and A. R. Horenstein (2013). Eigenvalue ratio test for the number of factors. Econo-metrica 81 (3), 1203–1227.Alidaee, H., E. Auerbach, and M. P. Leung (2020). Recovering network structure from aggregatedrelational data using penalized regression. arXiv preprint arXiv:2001.06052 .Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics 4 ,384–414.Bandeira, A. S. and R. van Handel (2016). Sharp nonasymptotic bounds on the norm of randommatrices with independent entries. The Annals of Probability 44 (4), 2479–2506.Bean, D., P. J. Bickel, N. El Karoui, and B. Yu (2013). Optimal m-estimation in high-dimensionalregression. Proceedings of the National Academy of Sciences 110 (36), 14563–14568.Belloni, A., M. Chen, and O. H. M. Padilla (2019). High dimensional latent panel quantile regres-sion with an application to asset pricing. arXiv preprint arXiv:1912.02151 .Belloni, A., V. Chernozhukov, I. Fern´andez-Val, and C. Hansen (2017). Program evaluation withhigh-dimensional data. Econometrica 85 (1), 233–298.Binkiewicz, N., J. T. Vogelstein, and K. Rohe (2017). Covariate-assisted spectral clustering. Biometrika 104 (2), 361–377.Bonhomme, S. and E. Manresa (2015). Grouped patterns of heterogeneity in panel data. Econo-metrica 83 (3), 1147–1184.Cabral, R., F. De la Torre, P. J. Costeira, and B. Alexandre (2013). Unifying nuclear normand bilinear factorization approaches for low-rank matrix decomposition. IEEE InternationalConference on Computer Vision , 2488–2495.Chatterjee, S., P. Diaconis, and A. Sly (2011). Random graphs with a given degree sequence. TheAnnals of Applied Probability 21 (4), 1400–1435.Chernozhukov, V., C. Hansen, Y. Liao, and Y. Zhu (2018). Inference for heterogeneous effectsusing low-rank estimations. arXiv preprint arXiv:1812.08089 .44an, J., W. Gong, and Z. Zhu (2019). Generalized high-dimensional trace regression via nuclearnorm regularization. Journal of econometrics 212 (1), 177–202.Feng, J. (2019). Regularized quantile regression with interactive fixed effects. arXiv preprintarXiv:1911.00166 .Graham, B. S. (2017). An econometric model of network formation with degree heterogeneity. Econometrica 85 (4), 1033–1063.Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). Stochastic blockmodels: First steps. Socialnetworks 5 (2), 109–137.Holland, P. W. and S. Leinhardt (1981). An exponential family of probability distributions fordirected graphs. Journal of the American Statistical Association 76 (373), 33–50.Javanmard, A. and A. Montanari (2018). De-biasing the lasso: Optimal sample size for gaussiandesigns. The Annals of Statistics 46 (6A), 2593–2622.Jin, J. (2015). Fast community detection by score. The Annals of Statistics 43 (1), 57–89.Jochmans, K. (2019). Modified-likelihood estimation of fixed-effect models for dyadic data.Joseph, A. and B. Yu (2016). Impact of regularization on spectral clustering. The Annals ofStatistics 44 (4), 1765–1791.Koltchinskii, V., K. Lounici, and A. B. Tsybakov (2011). Nuclear-norm penalization and optimalrates for noisy low-rank matrix completion. The Annals of Statistics 39 (5), 2302–2329.Lam, C. and Q. Yao (2012). Factor modeling for high-dimensional time series: inference for thenumber of factors. The Annals of Statistics 40 (2), 694–726.Leger, J.-B. (2016). Blockmodels: A r-package for estimating in latent block model and stochasticblock model, with various probability functions, with or without covariates. arXiv preprintarXiv:1602.07587 .Lei, J. and A. Rinaldo (2015). Consistency of spectral clustering in stochastic block models. TheAnnals of Statistics 43 (1), 215–237.Leung, M. P. (2015). Two-step estimation of network-formation models with incomplete informa-tion. Journal of Econometrics 188 (1), 182–195.Lusher, D., J. Koskinen, and G. Robins (2013). Exponential random graph models for socialnetworks: Theory, methods, and applications . Cambridge University Press.Mele, A. (2017a). A structural model of dense network formation. Econometrica 85 (3), 825–850.45ele, A. (2017b). A structural model of homophily and clustering in social networks. Available atSSRN 3031489 .Moon, H. R. and M. Weidner (2018). Nuclear norm regularized estimation of panel regressionmodels. arXiv preprint arXiv:1810.10987 .Mossel, E., J. Neeman, and A. Sly (2014). Consistency thresholds for binary symmetric blockmodels. arXiv preprint arXiv:1407.1591 In proc. of STOC15 .Negahban, S. and M. J. Wainwright (2011). Estimation of (near) low-rank matrices with noiseand high-dimensional scaling. The Annals of Statistics 39 (2), 1069–1097.Negahban, S. N., P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A unified framework for high-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science 27 (4),538–557.Paul, S. and Y. Chen (2020). Spectral and matrix factorization methods for consistent communitydetection in multi-layer networks. The Annals of Statistics 48 (1), 230–250.Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. EconometricTheory 7 (2), 186–199.Qin, T. and K. Rohe (2013). Regularized spectral clustering under the degree-corrected stochasticblockmodel. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger(Eds.), Advances in Neural Information Processing Systems , Volume 26, pp. 3120–3128. CurranAssociates, Inc.Rinaldo, A., S. Petrovi´c, and S. E. Fienberg (2013). Maximum lilkelihood estimation in the β -model. The Annals of Statistics 41 (3), 1085–1110.Rohde, A. and A. B. Tsybakov (2011). Estimation of high-dimensional low-rank matrices. TheAnnals of Statistics 39 (2), 887–930.Rohe, K., S. Chatterjee, and B. Yu (2011). Spectral clustering and the high-dimensional stochasticblockmodel. The Annals of Statistics 39 (4), 1878–1915.Roy, S., Y. Atchade, and G. Michailidis (2019). Likelihood inference for large scale stochasticblockmodels with covariates based on a divide-and-conquer parallelizable algorithm with com-munication. Journal of Computational and Graphical Statistics 28 (3), 609–619.Sarkar, P. and P. J. Bickel (2015). Role of normalization in spectral clustering for stochasticblockmodels. The Annals of Statistics 43 (3), 962–990.Sengupta, S. and Y. Chen (2015). Spectral clustering in heterogeneous networks. StatisticaSinica 25 (3), 1081–1106. 46u, L., Z. Shi, and P. C. Phillips (2016). Identifying latent structures in panel data. Economet-rica 84 (6), 2215–2264.Su, L., W. Wang, and Y. Zhang (2020). Strong consistency of spectral clustering for stochasticblock models. IEEE Transactions on Information Theory 66 (1), 324–338.Sweet, T. M. (2015). Incorporating covariates into stochastic blockmodels. Journal of Educationaland Behavioral Statistics 40 (6), 635–664.Traud, A., P. Mucha, and M. Porter (2012). Social structure of facebook networks. Physica A:Statistical Mechanics and its Applications 391 (16), 4165–4180.Vu, V. (2018). A simple svd algorithm for finding hidden partitions. Combinatorics, Probabilityand Computing 27 (1), 124–140.Wang, Y. J. and G. Y. Wong (1987). Stochastic blockmodels for directed graphs. Journal of theAmerican Statistical Association 82 (397), 8–19.Yan, T., B. Jiang, S. E. Fienberg, and C. Leng (2019). Statistical inference in a directed networkmodel with covariates. Journal of the American Statistical Association 114 (526), 857–868.Yan, T. and J. Xu (2013). A central limit theorem in the β -model for undirected random graphswith a diverging number of vertices. Biometrika 100 (2), 519–524.Yun, S.-Y. and A. Proutiere (2014). Accurate community detection in the stochastic block modelvia spectral algorithms. arXiv preprint arXiv:1412.7335 .Yun, S.-Y. and A. Proutiere (2016). Optimal cluster recovery in the labeled stochastic block model.In Advances in Neural Information Processing Systems , pp. 965–973.Zhong, Y. and N. Boumal (2018). Near-optimal bounds for phase synchronization. SIAM Journalon Optimization 28 (2), 989–1016. 47 nline Supplement to“Detecting Latent Communities in Network Formation Models” Shujie Ma a , Liangjun Su b,c , and Yichong Zhang ca Department of Statistics, University of California, Riverside b School of Economics and Management, Tsinghua University, c School of Economics, Singapore Management University This supplement is composed of three parts. Section S1 contains some technical lemmas used inthe proofs of the main results in the paper. Section S2 provides more details on the inference of B ∗ . Section S3 describes the algorithm to implement the nuclear norm regularized estimation. S1 Some Technical Lemmas Lemma S1.1. Let C Υ be an sufficiently large and fixed constant. Suppose that the assumptionsin Theorem 4.1 hold. Then max l =0 , || Υ l || op ≤ C Υ M W ( p ζ n n + p log n ) a.s. Proof. Let C = { X i } ni =1 ∪ { e ij } ≤ i Suppose M ≥ t ≥ , for some M ≥ log(2) . Then exp( − t ) + t − ≥ t M . Proof. Let f ( t ) = exp( − t ) + t − − t M . Then, f ′ ( t ) = 1 − exp( − t ) − t M . We want to show f ′ ( t ) ≥ t ∈ [0 , M ]. This implies that min t ∈ [0 ,M ] f ( t ) = f (0) = 0. Note that f ′ ( M ) = 0 . − exp( − M ) ≥ . In addition, we note that f ′ ( t ) is concave so that for any t ∈ [0 , M ], f ′ ( t ) ≥ f ′ ( M ) tM ≥ . This concludes the proof. (cid:4) Lemma S1.3. Suppose that the Assumptions in Theorem 4.1 hold. Then,1. max j ∈ I || ( b O (1) l ) ⊤ b v (1) j,l || ≤ M σ − K l ,l a.s. ; 2. There exist some constants ∞ > c ′ > c ′ > such that c ′ ζ n ≥ b Λ n,ij ≥ c ′ ζ n and c ′ ζ n ≥ ˜Λ n,ij ≥ c ′ ζ n a.s., where b Λ n,ij and ˜Λ n,ij are defined in (A.14) and (A.15) , respectively. Proof. 1. Note that || ( b O (1) l ) ⊤ b v (1) j,l || = || b v (1) j,l || ≤ ˆ σ − K l ,l || b Σ (1) l b v (1) j,l || = n − / ˆ σ − K l ,l (cid:13)(cid:13)(cid:13) [( b U (1) l ) ⊤ b Θ (1) l ] · j (cid:13)(cid:13)(cid:13) ≤ n − / ˆ σ − K l ,l (cid:13)(cid:13)(cid:13) [ b Θ (1) l ] · j (cid:13)(cid:13)(cid:13) ≤ M σ − K l ,l , where the first equality holds because b O (1) l is unitary, the second equality holds because n − / ( b U (1) l ) ⊤ b Θ (1) l = b Σ (1) l √ n ( b V (1) l ) ⊤ ≡ b Σ (1) l ( b V (1) l ) ⊤ , the first inequality holds because ( b U (1) l ) ⊤ b U (1) l = I K l , and the last inequality holds because | b Θ l,ij | ≤ M by construction and that by Theorem 4.1 and the fact that c F is sufficiently small so that28 C F, η n ≤ σ K l ,l / 2, and thus, | ˆ σ − K l ,l − σ − K l ,l | ≤ | ˆ σ K l ,l − σ K l ,l | σ K l ,l ( σ K l ,l − | ˆ σ K l ,l − σ K l ,l | ) ≤ σ − K l ,l a.s. As the constant M does not depend on j , the result holds uniformly over j = 1 , · · · , n . By Theorem 4.1 and the previous result, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b τ n + X l =0 u ⊤ i,l ( b O (1) l ) ⊤ b v (1) j,l W l,ij − τ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | b τ n − τ n | + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X l =0 u ⊤ i,l ( b O (1) l ) ⊤ b v (1) j,l W l,ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C F, η n + C, and thus, there exist some constants ∞ > c ′ > c ′ > c ′ ζ n ≥ b Λ n,ij ≥ c ′ ζ n . For the same reason, we have c ′ ζ n ≥ ˜Λ n,ij ≥ c ′ ζ n . (cid:4) Lemma S1.4. Suppose Assumptions 1–6 hold. Recall that b Φ (1) i = 1 n X j ∈ I ,j = i " ( b O (1)0 ) ⊤ b v (1) j, ( b O (1)1 ) ⊤ b v (1) j, W ,ij ( b O (1)0 ) ⊤ b v (1) j, ( b O (1)1 ) ⊤ b v (1) j, W ,ij ⊤ . Then, for the constant c φ defined in Assumption 6, min i ∈ I λ min ( b Φ (1) i ) ≥ c φ / a.s. Proof. By Lemma S1.3(1), || ( b O (1) l,U ) ⊤ b v (1) j,l || ≤ M σ − K l ,l for l = 0 , 1. Then, we have || b Φ (1) i − Φ i ( I ) || ≤ Mn X l =0 X j ∈ I σ − K l ,l || ( b O (1) l ) ⊤ b v (1) j,l − v j,l ||≤ M X l =0 σ − K l ,l n − / || b V l b O (1) l − V l || F ≤ (cid:16) p K + p K ) C σ M C F, c − σ (cid:17) η n a.s. (S1.1)where the second inequality holds due to Cauchy’s inequality, and the last inequality holds due toTheorem 4.1. As c F is sufficiently small so that 1088 √ K l C σ M C F, c − σ ( c F + c F ) ≤ c φ / 2, we havemin i ∈ I λ min ( b Φ (1) i ) ≥ min i ∈ I λ min (Φ i ( I )) − (cid:16) p K + p K ) C σ M C F, c − σ (cid:17) η n ≥ c φ / a.s. (cid:4) Lemma S1.5. Let q in be defined in (A.17). Suppose that Assumptions 1–6 hold. Then lim inf n min i ∈ I q in ≥ c φ c σ M (1 + M W ) > , a.s., here c and M are two constants in Assumption 6 and Lemma S1.3, respectively. Proof. Note q in ≥ inf ∆ vuut c σ n P j ∈ I ,j = i (( b φ (1) ij ) ⊤ ∆) M (1 + M W ) || ∆ || ≥ c σ lim inf n min i ∈ I λ min ( b Φ (1) i ) M (1 + M W ) ≥ c φ c σ M (1 + M W ) > , where the first inequality is due to Lemma S1.3(1) and the second inequality is due to LemmaS1.4. (cid:4) S2 More Details on the Inference for B ∗ In this appendix we provide more details on the inference for B ∗ discussed in Section 4.5 via twoexamples. S2.1 Example 1 In this example, we consider the tetrad logit regression of Graham (2017). Let S ij,i ′ j ′ = Y ij Y i ′ j ′ (1 − Y ii ′ )(1 − Y jj ′ ) − (1 − Y ij )(1 − Y i ′ j ′ ) Y ii ′ Y jj ′ . Then, for an arbitrary K ( K + 1) / B , theconditional likelihood of S ij,i ′ j ′ given S ij,i ′ j ′ ∈ {− , } is ℓ ij,i ′ j ′ ( B ) = | S ij,i ′ j ′ | h S ij,i ′ j ′ e ω ⊤ ,ij,i ′ j ′ B − log (cid:16) S ij,i ′ j ′ e ω ⊤ ,ij,i ′ j ′ B ) (cid:17)i , where e ω ,ij,i ′ j ′ = ω ,ij + ω ,i ′ j ′ − ( ω ,ii ′ + ω ,jj ′ ). Further denote¯ ℓ ij,i ′ j ′ ( B ) = 13 (cid:0) ℓ ij,i ′ j ′ ( B ) + ℓ ij,j ′ i ′ ( B ) + ℓ ii ′ ,j ′ j ( B ) (cid:1) . Following Graham (2017), we define the tetrad regression estimator b B for vech( B ∗ ) as b B = arg max B X i
3, or 4 nodes in common. Then, we make the following assumption on the Hessianmatrix. Assumption 9. Suppose that Γ ≡ lim n →∞ t − ,n P i
In this example, we consider the logistic maximum likelihood estimation. Let Λ n,ij ( u ) = Λ( ω ⊤ ij [vech( B ∗ )+ u ( n ζ n ) − / ]) and Λ n,ij ≡ Λ n,ij (0) , where ω ij = ( χ ⊤ ,ij , χ ⊤ ,ij W ,ij ) ⊤ is an K -vector with K = P l =0 K l ( K l + 1) / . Note that Λ n,ij = Λ( W ⊤ ij Γ ∗ ij ) . Now, we consider an alternative assumption. Assumption 10. sup k u k≤ C n ζ n P ≤ i Suppose that Assumptions 1–8 and 10 hold. Let b H n = P ≤ i 0, there exists n sufficiently large so that for all n ≥ n and k ∈ [ K ],1 n ζ n X ≤ i We apply the optimization algorithm proposed in Cabral, De la Torre, Costeira, and Alexandre(2013) to obtain the nuclear norm penalized estimator given in (3.2). For any given r l ≥ K l and r l ≤ n , Γ l can be written as Γ l = U l V ⊤ l , where U l ∈ R n × r l and V l ∈ R r l × n , for l = 0 , ..., p . Weconsider the optimization problem: Q (1) n (Γ) + λ (1) n p X l =0 γ l ( || U l || F + || V l || F ) , (S3.1)6here Γ = (Γ l , l = 0 , ..., p ), and Q (1) n (Γ) = X i ∈ I ,j ∈ [ n ] ,i = j h − Y ij ( W ⊤ ij Γ ij ) + log { W ⊤ ij Γ ij ) } i , subject to Γ l = U l V ⊤ l for l = 0 , ..., p . Let λ (1) n = C λ ( p ζ n n + √ log n ).Let Γ ∗ l for l = 0 , ..., p be an optimal solution of (3.2) with rank(Γ ∗ l ) = K ∗ l . Cabral et al. (2013)shows that any solution Γ l = U l V ⊤ l for l = 0 , ..., p of (S3.1) with r l ≥ K ∗ l is a solution of (3.1).Next we apply the Augmented Lagrange Multiplier (ALM) method given in Cabral et al. (2013)to solve (S3.1). The augmented Lagrangian function of (S3.1) is Q (1) n (Γ) + λ (1) n p X l =0 γ l ( || U l || F + || V l || F ) + p X l =0 D ∆ l , Γ l − U l V ⊤ l E + ρ p X l =0 || Γ l − U l V ⊤ l || F , where ∆ l are Lagrange multipliers and ρ is a penalty parameter to improve convergence.1. At step m + 1, for given ( U ml , V ml , ∆ ml , Θ m , l = 0 , ..., p ), (Γ m +1 ) minimizes L n (Γ) = Q (1) n (Γ) + p X l =0 D ∆ ml , Γ l − U ml V m ⊤ l E + ρ p X l =0 || Γ l − U ml V m ⊤ l || F + C. Moreover, for i ∈ I , j ∈ [ n ] , i = j , ∂L n (Γ) ∂ Γ l,ij = ( µ ij − Y ij ) W l,ij + ∆ ml,ij + ρ (Θ l,ij − V m ⊤ l,ij U ml,ij ) , where µ ij = exp( P l =0 W l,ij Γ l,ij ) { P l =0 W l,ij Γ l,ij ) } − , and ∂ L n (Γ) ∂ Γ l,ij = µ ij (1 − µ ij ) W l,ij + ρ,∂ L n (Γ) ∂ Γ l,ij Γ l ′ ,ij = µ ij (1 − µ ij ) W l,ij W l ′ ,ij , for l = l ′ For i = j ∈ I , ∂L n (Γ) ∂ Γ l,ij = ∆ ml,ij + ρ (Γ l,ij − V m ⊤ l,ij U ml,ij ) , ∂ L n (Γ , Γ ) ∂ Γ l,ij = ρ and ∂ L n (Γ , Γ ) ∂ Γ l,ij Γ l ′ ,ij = 0. Then,Γ m +1 = − ( ∂ L n (Γ m ) ∂ Γ ij Γ ⊤ ij ) − ( ∂L n (Γ m ) ∂ Γ ij ) + Γ m , where Γ ij = (Γ ,ij , ..., Γ p,ij ) ⊤ . Update Γ m +1 l,ij = Γ m +1 l,ij I {| Γ m +1 l,ij | ≤ log n } + log nI {| Γ m +1 l,ij | > log n } . 7. For given ( U ml , V ml , ∆ ml , Γ m +1 , l = 1 , U m +1 l minimizes λ (1) n X l =0 γ l ( || U l || F + || V ml || F ) + X l =0 D ∆ ml , Γ m +1 l − U l V m ⊤ l E + ρ || Γ m +1 l − U l V m ⊤ l || F + C. Then U m +1 l = (∆ ml + ρ Γ m +1 l ) V ml ( λ (1) n γ l I r l + ρV m ⊤ l V ml ) − . Similarly, V m +1 l = (∆ ml + ρ Γ m +1 l ) ⊤ U m +1 l ( λ (1) n γ l I r l + ρU m +1 ⊤ l U m +1 l ) − . 3. Let ∆ m +1 l = ∆ ml + ρ (Θ m +1 l − U m +1 l V m +1 ⊤ l ) . 4. Let ρ = min( ρµ, ) ..