[PDF] Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models

Abstract

We characterize the effectiveness of a classical algorithm for recovering the Markov graph of a general discrete pairwise graphical model from i.i.d. samples. The algorithm is (appropriately regularized) maximum conditional log-likelihood, which involves solving a convex program for each node; for Ising models this is ℓ 1 -constrained logistic regression, while for more general alphabets an ℓ 2,1 group-norm constraint needs to be used. We show that this algorithm can recover any arbitrary discrete pairwise graphical model, and also characterize its sample complexity as a function of model width, alphabet size, edge parameter accuracy, and the number of variables. We show that along every one of these axes, it matches or improves on all existing results and algorithms for this problem. Our analysis applies a sharp generalization error bound for logistic regression when the weight vector has an ℓ 1 constraint (or ℓ 2,1 constraint) and the sample vector has an ℓ ∞ constraint (or ℓ 2,∞ constraint). We also show that the proposed convex programs can be efficiently solved in O ~ ( n 2 ) running time (where n is the number of variables) under the same statistical guarantees. We provide experimental results to support our analysis.

Full PDF

SSparse Logistic Regression Learns All Discrete PairwiseGraphical Models

Shanshan Wu, Sujay Sanghavi, Alexandros G. Dimakis [email protected] , [email protected] , [email protected] Department of Electrical and Computer EngineeringUniversity of Texas at Austin

Abstract

We characterize the eﬀectiveness of a classical algorithm for recovering the Markovgraph of a general discrete pairwise graphical model from i.i.d. samples. The algorithm is(appropriately regularized) maximum conditional log-likelihood, which involves solving aconvex program for each node; for Ising models this is (cid:96) -constrained logistic regression,while for more general alphabets an (cid:96) , group-norm constraint needs to be used. Weshow that this algorithm can recover any arbitrary discrete pairwise graphical model,and also characterize its sample complexity as a function of model width, alphabet size,edge parameter accuracy, and the number of variables. We show that along every one ofthese axes, it matches or improves on all existing results and algorithms for this problem.Our analysis applies a sharp generalization error bound for logistic regression when theweight vector has an (cid:96) constraint (or (cid:96) , constraint) and the sample vector has an (cid:96) ∞ constraint (or (cid:96) , ∞ constraint). We also show that the proposed convex programs can beeﬃciently solved in ˜ O ( n ) running time (where n is the number of variables) under thesame statistical guarantees. We provide experimental results to support our analysis. Undirected graphical models provide a framework for modeling high dimensional distributionswith dependent variables and have many applications including in computer vision (Choi et al.,2010), bio-informatics (Marbach et al., 2012), and sociology (Eagle et al., 2009). In this paperwe characterize the eﬀectiveness of a natural, and already popular, algorithm for the structurelearning problem. Structure learning is the task of ﬁnding the dependency graph of a Markovrandom ﬁeld (MRF) given i.i.d. samples; typically one is also interested in ﬁnding estimates forthe edge weights as well. We consider the structure learning problem in general (non-binary)discrete pairwise graphical models. These are MRFs where the variables take values in adiscrete alphabet, but all interactions are pairwise. This includes the Ising model as a specialcase (which corresponds to a binary alphabet).The natural and popular algorithm we consider is (appropriately regularized) maximumconditional log-likelihood for ﬁnding the neighborhood set of any given node. For the Isingmodel, this becomes (cid:96) -constrained logistic regression; more generally for non-binary graphicalmodels the regularizer becomes an (cid:96) , norm. We show that this algorithm can recover all1 a r X i v : . [ c s . L G ] J un iscrete pairwise graphical models, and characterize its sample complexity as a function of theparameters of interest: model width, alphabet size, edge parameter accuracy, and the numberof variables. We match or improve dependence on each of these parameters, over all existingresults for the general alphabet case when no additional assumptions are made on the model(see Table 1). For the speciﬁc case of Ising models, some recent work has better dependence onsome parameters (see Table 2 in Appendix A).We now describe the related work, and then outline our contributions. Related Work

In a classic paper, Ravikumar et al. (2010) considered the structure learning problem forIsing models. They showed that (cid:96) -regularized logistic regression provably recovers the correctdependency graph with a very small number of samples by solving a convex program for eachvariable. This algorithm was later generalized to multi-class logistic regression with group-sparse regularization, which can learn MRFs with higher-order interactions and non-binaryvariables (Jalali et al., 2011). A well-known limitation of (Ravikumar et al., 2010; Jalali et al.,2011) is that their theoretical guarantees only work for a restricted class of models. Speciﬁcally,they require that the underlying learned model satisﬁes technical incoherence assumptions,that are diﬃcult to validate or check.Paper Assumptions Sample complexity ( N )Greedy algo-rithm (Hamiltonet al., 2017) 1. Alphabet size k ≥ O (exp( k O ( d ) exp( O ( d λ )) η O (1) ) ln( nkρ ))

2. Model width ≤ λ

3. Degree ≤ d

4. Minimum edge weight ≥ η >

5. Probability of success ≥ − ρ Sparsitron (Kli-vans and Meka,2017) 1. Alphabet size k ≥ O ( λ k exp(14 λ ) η ln( nkρη ))

2. Model width ≤ λ

3. Minimum edge weight ≥ η >

4. Probability of success ≥ − ρ(cid:96) , -constrainedlogistic regression[ this paper ] 1. Alphabet size k ≥ O ( λ k exp(14 λ ) η ln( nkρ ))

2. Model width ≤ λ

3. Minimum edge weight ≥ η >

4. Probability of success ≥ − ρ Table 1: Sample complexity comparison for diﬀerent graph recovery algorithms. The pairwisegraphical model has alphabet size k . For k = 2 (i.e., Ising models), our algorithm reduces tothe (cid:96) -constrained logistic regression (see Table 2 in Appendix A for related work on learningIsing models). Our sample complexity has a better dependency on the alphabet size ( ˜ O ( k ) versus ˜ O ( k ) ) than that in (Klivans and Meka, 2017) . Theorem 8.4 in (Klivans and Meka, 2017) has a typo. The correct dependence should be k instead of k .In Section 8 of (Klivans and Meka, 2017), after re-writing the conditional distribution as a sigmoid function,the weight vector w is a vector of length ( n − k + 1 . Their derivation uses an incorrect bound (cid:107) w (cid:107) ≤ λ ,while it should be (cid:107) w (cid:107) ≤ kλ . This gives rise to an additional k factor on the ﬁnal sample complexity. n be the number ofvariables and k be the alphabet size; deﬁne the model width λ as the maximum neighborhoodweight (see Deﬁnition 1 and 2 for the precise deﬁnition). For structure learning algorithms,a popular approach is to focus on the sub-problem of ﬁnding the neighborhood of a singlenode. Once this is correctly learned, the overall graph structure is a simple union bound.Indeed all the papers we now discuss are of this type. As shown in Table 1, Hamilton et al.(2017) proposed a greedy algorithm to learn pairwise (and higher-order) MRFs with generalalphabet. Their algorithm generalizes the approach of Bresler (2015) for learning Ising models.The sample complexity in (Hamilton et al., 2017) grows logarithmically in n , but doubly exponentially in the width λ . Note that an information-theoretic lower bound for learningIsing models (Santhanam and Wainwright, 2012) only has a single-exponential dependenceon λ . Klivans and Meka (2017) provided a diﬀerent algorithmic and theoretical approach bysetting this up as an online learning problem and leveraging results from the Hedge algorithmtherein. Their algorithm Sparsitron achieves single-exponential dependence on the width λ . Our Contributions • Our main result: We show that the (cid:96) , -constrained logistic regression can be used toestimate the edge weights of a discrete pairwise graphical model from i.i.d. samples (seeTheorem 2). For the special case of Ising models (see Theorem 1), this reduces to an (cid:96) -constrained logistic regression. We make no incoherence assumption on the graphicalmodels. As shown in Table 1, our sample complexity scales as ˜ O ( k ) , which improves theprevious best result with ˜ O ( k ) dependency . The analysis applies a sharp generalizationerror bound for logistic regression when the weight vector has an (cid:96) , constraint (or (cid:96) constraint) and the sample vector has an (cid:96) , ∞ constraint (or (cid:96) ∞ constraint) (see Lemma 7and Lemma 10 in Appendix B). Our key insight is that a generalization bound can beused to control the squared distance between the predicted and true logistic functions (seeLemma 1 and Lemma 2 in Section 3.2), which then implies an (cid:96) ∞ norm bound betweenthe weight vectors (see Lemma 5 and Lemma 6). • We show that the proposed algorithms can run in ˜ O ( n ) time without aﬀecting thestatistical guarantees (see Section 2.3). Note that ˜ O ( n ) is an eﬃcient runtime for graphrecovery over n nodes. Previous algorithms in (Hamilton et al., 2017; Klivans and Meka,2017) also require ˜ O ( n ) runtime for structure learning of pairwise graphical models. • We construct examples that violate the incoherence condition proposed in (Ravikumaret al., 2010) (see Figure 1). We then run (cid:96) -constrained logistic regression and show It may be possible to prove a similar result for the regularized version of the optimization problem usingtechniques from (Negahban et al., 2012). One needs to prove that the objective function satisﬁes restrictedstrong convexity (RSC) when the samples are from a graphical model distribution (Vuﬀray et al., 2016; Lokhovet al., 2018). It is interesting to see if the proof presented in this paper is related to the RSC condition. This improvement essentially comes from the fact that we are using an (cid:96) , norm constraint instead ofan (cid:96) norm constraint for learning general (i.e., non-binary) pairwise graphical models (see our remark afterTheorem 2). The Sparsitron algorithm proposed by Klivans and Meka (2017) learns a (cid:96) -constrained generalizedlinear model. This (cid:96) -constraint gives rise to a k dependency for learning non-binary pairwise graphical models. • We empirically compare the proposed algorithm with the Sparsitron algorithm in (Klivansand Meka, 2017) over diﬀerent alphabet sizes, and show that our algorithm needs fewersamples for graph recovery (see Figure 2).

Notation.

We use [ n ] to denote the set { , , · · · , n } . For a vector x ∈ R n , we use x i or x ( i ) to denote its i -th coordinate. The (cid:96) p norm of a vector is deﬁned as (cid:107) x (cid:107) p = ( (cid:80) i | x i | p ) /p . We use x − i ∈ R n − to denote the vector after deleting the i -th coordinate. For a matrix A ∈ R n × k , weuse A ij or A ( i, j ) to denote its ( i, j ) -th entry. We use A ( i, :) ∈ R k and A (: , j ) ∈ R n to the denotethe i -th row vector and the j -th column vector. The (cid:96) p,q norm of a matrix A ∈ R n × k is deﬁned as (cid:107) A (cid:107) p,q = (cid:107) [ (cid:107) A (1 , :) (cid:107) p , ..., (cid:107) A ( n, :) (cid:107) p ] (cid:107) q . We deﬁne (cid:107) A (cid:107) ∞ = max ij | A ( i, j ) | throughout the paper(note that this deﬁnition is diﬀerent from the induced matrix norm). We use σ ( z ) = 1 / (1 + e − z ) to represent the sigmoid function. We use (cid:104)· , ·(cid:105) to represent the dot product between twovectors (cid:104) x, y (cid:105) = (cid:80) i x i y i or two matrices (cid:104) A, B (cid:105) = (cid:80) ij A ( i, j ) B ( i, j ) . We start with the special case of binary variables (i.e., Ising models), and then move to thegeneral case with non-binary variables.

We ﬁrst give a deﬁnition of an Ising model distribution.

Deﬁnition 1.

Let A ∈ R n × n be a symmetric weight matrix with A ii = 0 for i ∈ [ n ] . Let θ ∈ R n be a mean-ﬁeld vector. The n -variable Ising model is a distribution D ( A, θ ) on {− , } n that satisﬁes P Z ∼D ( A,θ ) [ Z = z ] ∝ exp( (cid:88) ≤ i

Let Z ∼ D ( A, θ ) and Z ∈ {− , } n . For any i ∈ [ n ] , the conditional probability of the i -th variable Z i ∈ {− , } given the states of all other variables Z − i ∈ {− , } n − is P [ Z i = 1 | Z − i = x ] = exp( (cid:80) j (cid:54) = i A ij x j + θ i )exp( (cid:80) j (cid:54) = i A ij x j + θ i ) + exp( − (cid:80) j (cid:54) = i A ij x j − θ i ) = σ ( (cid:10) w, x (cid:48) (cid:11) ) , (3)4 here x (cid:48) = [ x, ∈ {− , } n , and w = 2[ A i , · · · , A i ( i − , A i ( i +1) , · · · , A in , θ i ] ∈ R n . Moreover, w satisﬁes (cid:107) w (cid:107) ≤ λ ( A, θ ) , where λ ( A, θ ) is the model width deﬁned in Deﬁnition 1. Following Fact 1, the natural approach to estimating the edge weights A ij is to solvea logistic regression problem for each variable. For ease of notation, let us focus on the n -th variable (the algorithm directly applies to the rest variables). Given N i.i.d. samples { z , · · · , z N } , where z i ∈ {− , } n from an Ising model D ( A, θ ) , we ﬁrst transform the samplesinto { ( x i , y i ) } Ni =1 , where x i = [ z i , · · · , z in − , ∈ {− , } n and y i = z in ∈ {− , } . By Fact 1,we know that P [ y i = 1 | x i = x ] = σ ( (cid:104) w ∗ , x (cid:105) ) where w ∗ = 2[ A n , · · · , A n ( n − , θ n ] ∈ R n satisﬁes (cid:107) w ∗ (cid:107) ≤ λ ( A, θ ) . Suppose that λ ( A, θ ) ≤ λ , we are then interested in recovering w ∗ by solvingthe following (cid:96) -constrained logistic regression problem ˆ w ∈ arg min w ∈ R n N N (cid:88) i =1 (cid:96) ( y i (cid:10) w, x i (cid:11) ) s.t. (cid:107) w (cid:107) ≤ λ, (4)where (cid:96) : R → R is the loss function (cid:96) ( y i (cid:10) w, x i (cid:11) ) = ln(1 + e − y i (cid:104) w,x i (cid:105) ) = (cid:40) − ln σ ( (cid:10) w, x i (cid:11) ) , if y i = 1 − ln(1 − σ ( (cid:10) w, x i (cid:11) )) , if y i = − (5)Eq. (5) is essentially the negative log-likelihood of observing y i given x i at the current w .Let ˆ w be a minimizer of (4). It is worth noting that in the high-dimensional regime ( N < n ), ˆ w may not be unique. In this case, we will show that any one of them would work. Aftersolving the convex problem in (4), the edge weight is estimated as ˆ A nj = ˆ w j / .The pseudocode of the above algorithm is given in Algorithm 1. Solving the (cid:96) -constrainedlogistic regression problem will give us an estimator of the true edge weight. We then form thegraph by keeping the edge that has estimated weight larger than η/ (in absolute value). Algorithm 1:

Learning an Ising model via (cid:96) -constrained logistic regression Input: N i.i.d. samples { z , · · · , z N } , where z m ∈ {− , } n for m ∈ [ N ] ; an upperbound on λ ( A, θ ) ≤ λ ; a lower bound on η ( A, θ ) ≥ η > . Output: ˆ A ∈ R n × n , and an undirected graph ˆ G on n nodes. for i ← to n do ∀ m ∈ [ N ] , x m ← [ z m − i , , y m ← z mi ˆ w ← arg min w ∈ R n N (cid:80) Nm =1 ln(1 + e − y m (cid:104) w,x m (cid:105) ) s.t. (cid:107) w (cid:107) ≤ λ ∀ j ∈ [ n ] , ˆ A ij ← ˆ w ˜ j / , where ˜ j = j if j < i and ˜ j = j − if j > i end Form an undirected graph ˆ G on n nodes with edges { ( i, j ) : | ˆ A ij | ≥ η/ , i < j } . Theorem 1.

Let D ( A, θ ) be an unknown n -variable Ising model distribution with dependencygraph G . Suppose that the D ( A, θ ) has width λ ( A, θ ) ≤ λ . Given ρ ∈ (0 , and (cid:15) > , if thenumber of i.i.d. samples satisﬁes N = O ( λ exp(12 λ ) ln( n/ρ ) /(cid:15) ) , then with probability at least − ρ , Algorithm 1 produces ˆ A that satisﬁes max i,j ∈ [ n ] | A ij − ˆ A ij | ≤ (cid:15). (6)5 orollary 1. In the setup of Theorem 1, suppose that the Ising model distribution D ( A, θ ) hasminimum edge weight η ( A, θ ) ≥ η > . If we set (cid:15) < η/ in (6), which corresponds to samplecomplexity N = O ( λ exp(12 λ ) ln( n/ρ ) /η ) , then with probability at least − ρ , Algorithm 1recovers the dependency graph, i.e., ˆ G = G . Deﬁnition 2.

Let k be the alphabet size. Let W = { W ij ∈ R k × k : i (cid:54) = j ∈ [ n ] } be a set ofweight matrices satisfying W ij = W Tji . Without loss of generality, we assume that every row(and column) vector of W ij has zero mean. Let Θ = { θ i ∈ R k : i ∈ [ n ] } be a set of externalﬁeld vectors. Then the n -variable pairwise graphical model D ( W , Θ) is a distribution over [ k ] n where P Z ∼D ( W , Θ) [ Z = z ] ∝ exp( (cid:88) ≤ i

The assumption that W ij has centered rows and columns (i.e., (cid:80) b W ij ( a, b ) = 0 and (cid:80) a W ij ( a, b ) = 0 for any a, b ∈ [ k ] ) is without loss of generality (see Fact 8.2 in (Klivansand Meka, 2017)). If the a -th row of W ij is not centered, i.e., (cid:80) b W ij ( a, b ) (cid:54) = 0 , we can deﬁne W (cid:48) ij ( a, b ) = W ij ( a, b ) − (cid:80) b W ij ( a, b ) /k and θ (cid:48) i ( a ) = θ i ( a ) + (cid:80) b W ij ( a, b ) /k , and notice that D ( W , Θ) = D ( W (cid:48) , Θ (cid:48) ) . Because the sets of matrices with centered rows and columns (i.e., { M ∈ R k × k : (cid:80) b M ( a, b ) = 0 , ∀ a ∈ [ k ] } and { M ∈ R k × k : (cid:80) a M ( a, b ) = 0 , ∀ b ∈ [ k ] } ) are twolinear subspaces, alternatively projecting W ij onto the two sets will converge to the intersectionof the two subspaces (Von Neumann, 1949). As a result, the condition of centered rows andcolumns is necessary for recovering the underlying weight matrices, since otherwise diﬀerentparameters can give the same distribution. Note that in the case of k = 2 , Deﬁnition 2 is thesame as Deﬁnition 1 for Ising models. To see their connection, simply deﬁne W ij ∈ R × asfollows: W ij (1 ,

1) = W ij (2 ,

2) = A ij , W ij (1 ,

2) = W ij (2 ,

1) = − A ij .For a pairwise graphical model distribution D ( W , Θ) , the conditional distribution of anyvariable (when restricted to a pair of values) given all the other variables follows a logisticfunction, as shown in Fact 2. This is analogous to Fact 1 for the Ising model distribution. Fact 2.

Let Z ∼ D ( W , Θ) and Z ∈ [ k ] n . For any i ∈ [ n ] , any α (cid:54) = β ∈ [ k ] , and any x ∈ [ k ] n − , P [ Z i = α | Z i ∈ { α, β } , Z − i = x ] = σ ( (cid:88) j (cid:54) = i ( W ij ( α, x j ) − W ij ( β, x j )) + θ i ( α ) − θ i ( β )) . (9)Given N i.i.d. samples { z , · · · , z N } , where z m ∈ [ k ] n ∼ D ( W , Θ) for m ∈ [ N ] , the goal isto estimate matrices W ij for all i (cid:54) = j ∈ [ n ] . For ease of notation and without loss of generality,let us consider the n -th variable. Now the goal is to estimate matrices W nj for all j ∈ [ n − .To use Fact 2, ﬁx a pair of values α (cid:54) = β ∈ [ k ] , let S be the set of samples satisfy-ing z n ∈ { α, β } . We next transform the samples in S to { ( x t , y t ) } | S | t =1 as follows: x t = ([ z t − n , ∈ { , } n × k , y t = 1 if z tn = α , and y t = − if z tn = β . HereOneHotEncode ( · ) : [ k ] n → { , } n × k is a function that maps a value t ∈ [ k ] to the standardbasis vector e t ∈ { , } k , where e t has a single 1 at the t -th entry. For each sample ( x, y ) inthe set S , Fact 2 implies that P [ y = 1 | x ] = σ ( (cid:104) w ∗ , x (cid:105) ) , where w ∗ ∈ R n × k satisﬁes w ∗ ( j, :) = W nj ( α, :) − W nj ( β, :) , ∀ j ∈ [ n − w ∗ ( n, :) = [ θ n ( α ) − θ n ( β ) , , ..., . (10)Suppose that the width of D ( W , Θ) satisﬁes λ ( W , Θ) ≤ λ , then w ∗ deﬁned in (10) satisﬁes (cid:107) w ∗ (cid:107) , ≤ λ √ k , where (cid:107) w ∗ (cid:107) , := (cid:80) j (cid:107) w ∗ ( j, :) (cid:107) . We can now form an (cid:96) , -constrained logisticregression over the samples in S : w α,β ∈ arg min w ∈ R n × k | S | | S | (cid:88) t =1 ln(1 + e − y t (cid:104) w,x t (cid:105) ) s.t. (cid:107) w (cid:107) , ≤ λ √ k, (11)Let w α,β be a minimizer of (11). Without loss of generality, we can assume that the ﬁrst n − rows of w α,β are centered, i.e., (cid:80) a w α,β ( j, a ) = 0 for j ∈ [ n − . Otherwise, we canalways deﬁne a new matrix U α,β ∈ R n × k by centering the ﬁrst n − rows of w α,β : U α,β ( j, b ) = w α,β ( j, b ) − k (cid:88) a ∈ [ k ] w α,β ( j, a ) , ∀ j ∈ [ n − , ∀ b ∈ [ k ]; (12) U α,β ( n, b ) = w α,β ( n, b ) + 1 k (cid:88) j ∈ [ n − ,a ∈ [ k ] w α,β ( j, a ) , ∀ b ∈ [ k ] . Since each row of the x matrix in (11) is a standard basis vector (i.e., all zeros except a singleone), (cid:10) U α,β , x (cid:11) = (cid:10) w α,β , x (cid:11) , which implies that U α,β is also a minimizer of (11).The key step in our proof is to show that given enough samples, the obtained U α,β ∈ R n × k matrix is close to w ∗ deﬁned in (10). Speciﬁcally, we will prove that | W nj ( α, b ) − W nj ( β, b ) − U α,β ( j, b ) | ≤ (cid:15), ∀ j ∈ [ n − , ∀ α, β, b ∈ [ k ] . (13)Recall that our goal is to estimate the original matrices W nj for all j ∈ [ n − . Summing (13)over β ∈ [ k ] (suppose U α,α = 0 ) and using the fact that (cid:80) β W nj ( β, b ) = 0 gives | W nj ( α, b ) − k (cid:88) β ∈ [ k ] U α,β ( j, b ) | ≤ (cid:15), ∀ j ∈ [ n − , ∀ α, b ∈ [ k ] . (14)In other words, ˆ W nj ( α, :) = (cid:80) β ∈ [ k ] U α,β ( j, :) /k is a good estimate of W nj ( α, :) .Suppose that η ( W , Θ) ≥ η , once we obtain the estimates ˆ W ij , the last step is to form agraph by keeping the edge ( i, j ) that satisﬁes max a,b | ˆ W ij ( a, b ) | ≥ η/ . The pseudocode of theabove algorithm is given in Algorithm 2. Theorem 2.

Let D ( W , Θ) be an n -variable pairwise graphical model distribution with width λ ( W , Θ) ≤ λ . Given ρ ∈ (0 , and (cid:15) > , if the number of i.i.d. samples satisﬁes N = O ( λ k exp(14 λ ) ln( nk/ρ ) /(cid:15) ) , then with probability at least − ρ , Algorithm 2 produces ˆ W ij ∈ R k × k that satisﬁes | W ij ( a, b ) − ˆ W ij ( a, b ) | ≤ (cid:15), ∀ i (cid:54) = j ∈ [ n ] , ∀ a, b ∈ [ k ] . (15)7 lgorithm 2: Learning a pairwise graphical model via (cid:96) , -constrained logistic regression Input: alphabet size k ; N i.i.d. samples { z , · · · , z N } , where z m ∈ [ k ] n for m ∈ [ N ] ; anupper bound on λ ( W , Θ) ≤ λ ; a lower bound on η ( W , Θ) ≥ η > . Output: ˆ W ij ∈ R k × k for all i (cid:54) = j ∈ [ n ] ; an undirected graph ˆ G on n nodes. for i ← to n do for each pair α (cid:54) = β ∈ [ k ] do S ← { z m , m ∈ [ N ] : z mi ∈ { α, β }} ∀ z t ∈ S , x t ← OneHotEncode ([ z t − i , , y t ← if z ti = α ; y t ← − if z ti = β w α,β ← arg min w ∈ R n × k | S | (cid:80) | S | t =1 ln(1 + e − y t (cid:104) w,x t (cid:105) ) s.t. (cid:107) w (cid:107) , ≤ λ √ k Deﬁne U α,β ∈ R n × k by centering the ﬁrst n − rows of w α,β (see (12)). end for j ∈ [ n ] \ i and α ∈ [ k ] do ˆ W ij ( α, :) ← k (cid:80) β ∈ [ k ] U α,β (˜ j, :) , where ˜ j = j if j < i and ˜ j = j − if j > i . end end Form graph ˆ G on n nodes with edges { ( i, j ) : max a,b | ˆ W ij ( a, b ) | ≥ η/ , i < j } . Corollary 2.

In the setup of Theorem 2, suppose that the pairwise graphical model distribution D ( W , Θ) satisﬁes η ( W , Θ) ≥ η > . If we set (cid:15) < η/ in (15), which corresponds to samplecomplexity N = O ( λ k exp(14 λ ) ln( nk/ρ ) /η ) , then with probability at least − ρ , Algorithm 2recovers the dependency graph, i.e., ˆ G = G . Remark ( (cid:96) , versus (cid:96) constraint). The w ∗ ∈ R n × k matrix deﬁned in (10) satisﬁes (cid:107) w ∗ (cid:107) ∞ , ≤ λ ( W , Θ) . This implies that (cid:107) w ∗ (cid:107) , ≤ λ ( W , Θ) √ k and (cid:107) w ∗ (cid:107) ≤ λ ( W , Θ) k .Instead of solving the (cid:96) , -constrained logistic regression deﬁned in (11), we could solve an (cid:96) -constrained logistic regression with (cid:107) w (cid:107) ≤ λ ( W , Θ) k . However, this will lead to a samplecomplexity that scales as ˜ O ( k ) , which is worse than the ˜ O ( k ) sample complexity achieved bythe (cid:96) , -constrained logistic regression. The reason why we use the (cid:96) , constraint instead ofthe tighter (cid:96) ∞ , constraint in the algorithm is because our proof relies on a sharp generalizationbound for (cid:96) , -constrained logistic regression (see Lemma 10 in the appendix). It is unclearwhether a similar generalization bound exists for the (cid:96) ∞ , constraint. Remark (lower bound on the alphabet size).

A simple lower bound is Ω( k ) . Tosee why, consider a graph with two nodes (i.e., n = 2 ). Let W be a k -by- k weight matrixbetween the two nodes, deﬁned as follows: W (1 ,

1) = W (2 ,

2) = 1 , W (1 ,

2) = W (2 ,

1) = − ,and W ( i, j ) = 0 otherwise. This deﬁnition satisﬁes the condition that every row and column iscentered (Deﬁnition 2). Besides, we have λ = 1 and η = 1 , which means that the two quantitiesdo not scale in k . To distinguish W from the zero matrix, we need to observe samples in theset { (1 , , (2 , , (1 , , (2 , } . This requires Ω( k ) samples because any speciﬁc sample ( a, b ) (where a ∈ [ k ] and b ∈ [ k ] ) has a probability of approximately /k to show up. ˜ O ( n ) time Our results so far assume that the (cid:96) -constrained logistic regression (in Algorithm 1) and the (cid:96) , -constrained logistic regression (in Algorithm 2) can be solved exactly. This would require8 O ( n ) complexity if an interior-point based method is used (Koh et al., 2007). The goal of thissection is to reduce the runtime to ˜ O ( n ) via ﬁrst-order optimization method. Note that ˜ O ( n ) is an eﬃcient time complexity for graph recovery over n nodes. Previous structural learningalgorithms of Ising models require either ˜ O ( n ) complexity (e.g., (Bresler, 2015; Klivans andMeka, 2017)) or a worse complexity (e.g., (Ravikumar et al., 2010; Vuﬀray et al., 2016) require ˜ O ( n ) runtime). We would like to remark that our goal here is not to give the fastest ﬁrst-orderoptimization algorithm (see our remark after Theorem 4). Instead, our contribution is toprovably show that it is possible to run Algorithm 1 and Algorithm 2 in ˜ O ( n ) time withoutaﬀecting the original statistical guarantees.To better exploit the problem structure , we use the mirror descent algorithm with aproperly chosen distance generating function (aka the mirror map). Following the standardmirror descent setup, we use negative entropy as the mirror map for (cid:96) -constrained logisticregression and a scaled group norm for (cid:96) , -constrained logistic regression (see Section 5.3.3.2and Section 5.3.3.3 in (Ben-Tal and Nemirovski, 2013) for more details). The pseudocode isgiven in Appendix H. The main advantage of mirror descent algorithm is that its convergencerate scales logarithmically in the dimension (see Lemma 12 in Appendix I). Speciﬁcally, let ¯ w be the output after O (ln( n ) /γ ) mirror descent iterations, then ¯ w satisﬁes ˆ L ( ¯ w ) − ˆ L ( ˆ w ) ≤ γ, (16)where ˆ L ( w ) = (cid:80) Ni =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) /N is the empirical logistic loss, and ˆ w is the actualminimizer of ˆ L ( w ) . Since each mirror descent update requires O ( nN ) time, where N is thenumber of samples and scales as O (ln( n )) , and we have to solve n regression problems (one foreach variable in [ n ] ), the total runtime scales as ˜ O ( n ) , which is our desired runtime.There is still one problem left, that is, we have to show that (cid:107) ¯ w − w ∗ (cid:107) ∞ ≤ (cid:15) (where w ∗ is the minimizer of the true loss L ( w ) = E ( x,y ) ∼D ln(1 + e − y (cid:104) w,x (cid:105) ) ) in order to conclude thatTheorem 1 and 2 still hold when using mirror descent algorithms. Since ˆ L ( w ) is not stronglyconvex, (16) alone does not necessarily imply that (cid:107) ¯ w − ˆ w (cid:107) ∞ is small. Our key insight is thatin the proof of Theorem 1 and 2, the deﬁnition of ˆ w (as a minimizer of ˆ L ( w ) ) is only usedto show that ˆ L ( ˆ w ) ≤ ˆ L ( w ∗ ) (see inequality (b) of (28) in Appendix B). It is then possible toreplace this step with (16) in the original proof, and prove that Theorem 1 and 2 still hold aslong as γ is small enough (see (60) in Appendix I).Our key results in this section are Theorem 3 and Theorem 4, which show that Algorithm 1and Algorithm 2 can run in ˜ O ( n ) time without aﬀecting the original statistical guarantees. Theorem 3.

In the setup of Theorem 1, suppose that the (cid:96) -constrained logistic regres-sion in Algorithm 1 is optimized by the mirror descent method (Algorithm 3) given in Ap-pendix H. Given ρ ∈ (0 , and (cid:15) > , if the number of mirror descent iterations satisﬁes T = Speciﬁcally, for the (cid:96) -constrained logisitic regression deﬁned in (4), since the input sample satisiﬁes (cid:107) x (cid:107) ∞ = 1 , the loss function is O (1) -Lipschitz w.r.t. (cid:107)·(cid:107) . Similarly, for the (cid:96) , -constrained logisitic regressiondeﬁned in (11), the loss function is O (1) -Lipschitz w.r.t. (cid:107)·(cid:107) , because the input sample satisﬁes (cid:107) x (cid:107) , ∞ = 1 . Other approaches include the standard projected gradient descent and the coordinate descent. Theirconvergence rates depend on either the smoothness or the Lipschitz constant (w.r.t. (cid:107)·(cid:107) ) of the objectivefunction (Bubeck, 2015). This would lead to a total runtime of ˜ O ( n ) for our problem setting. Another optionwould be the composite gradient descent method, the analysis of which relies on the restricted strong convexityof the objective function (Agarwal et al., 2010). For other variants of mirror descent algorithms, see the remarkafter Theorem 4. ( λ exp(12 λ ) ln( n ) /(cid:15) ) , and the number of samples satisﬁes N = O ( λ exp(12 λ ) ln( n/ρ ) /(cid:15) ) ,then (6) still holds with probability at least − ρ . The total time complexity of Algorithm 1 is O ( T N n ) . Theorem 4.

In the setup of Theorem 2, suppose that the (cid:96) , -constrained logistic regres-sion in Algorithm 2 is optimized by the mirror descent method (Algorithm 4) given in Ap-pendix H. Given ρ ∈ (0 , and (cid:15) > , if the number of mirror descent iterations satisﬁes T = O ( λ k exp(12 λ ) ln( n ) /(cid:15) ) , and the number of samples satisﬁes N = O ( λ k exp(14 λ ) ln( nk/ρ ) /(cid:15) ) ,then (15) still holds with probability at least − ρ . The total time complexity of Algorithm 2 is O ( T N n k ) . Remark.

It is possible to improve the time complexity given in Theorem 3 and 4 (especiallythe dependence on (cid:15) and λ ), by using stochastic or accelerated versions of mirror descentalgorithms (instead of the batch version given in Appendix H). In fact, the Sparsitron algorithmproposed by Klivans and Meka (2017) can be seen as an online mirror descent algorithm foroptimizing the (cid:96) -constrained logistic regression (see Algorithm 3 in Appendix H). Furthermore,Algorithm 1 and 2 can be parallelized as every node has an independent regression problem. We give a proof outline for Theorem 1. The proof of Theorem 2 follows a similar outline. Let D be a distribution over {− , } n × {− , } , where ( x, y ) ∼ D satisﬁes P [ y = 1 | x ] = σ ( (cid:104) w ∗ , x (cid:105) ) .Let L ( w ) = E ( x,y ) ∼D ln(1 + e − y (cid:104) w,x (cid:105) ) and ˆ L ( w ) = (cid:80) Ni =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) /N be the expectedand empirical logistic loss. Suppose (cid:107) w ∗ (cid:107) ≤ λ . Let ˆ w ∈ arg min w ˆ L ( w ) s.t. (cid:107) w (cid:107) ≤ λ . Ourgoal is to prove that (cid:107) ˆ w − w ∗ (cid:107) ∞ is small when the samples are constructed from an Ising modeldistribution. Our proof can be summarized in three steps:1. If the number of samples satisﬁes N = O ( λ ln( n/ρ ) /γ ) , then L ( ˆ w ) − L ( w ∗ ) ≤ O ( γ ) .This is obtained using a sharp generalization bound when (cid:107) w (cid:107) ≤ λ and (cid:107) x (cid:107) ∞ ≤ (seeLemma 7 in Appendix B).2. For any w , we show that L ( w ) − L ( w ∗ ) ≥ E x [ σ ( (cid:104) w, x (cid:105) ) − σ ( (cid:104) w ∗ , x (cid:105) )] (see Lemma 9 andLemma 8 in Appendix B). Hence, Step 1 implies that E x [ σ ( (cid:104) ˆ w, x (cid:105) ) − σ ( (cid:104) w ∗ , x (cid:105) )] ≤ O ( γ ) (see Lemma 1 in the next subsection).3. We now use a result from (Klivans and Meka, 2017) (see Lemma 5 in the next subsection),which says that if the samples are from an Ising model and if γ = O ( (cid:15) exp( − λ )) , then E x [ σ ( (cid:104) ˆ w, x (cid:105) ) − σ ( (cid:104) w ∗ , x (cid:105) )] ≤ O ( γ ) implies that (cid:107) ˆ w − w ∗ (cid:107) ∞ ≤ (cid:15) . The required numberof samples is N = O ( λ ln( n/ρ ) /γ ) = O ( λ exp(12 λ ) ln( n/ρ ) /(cid:15) ) .For the general setting with non-binary alphabet (i.e., Theorem 2), the proof is similar tothat of Theorem 1. The main diﬀerence is that we need to use a sharp generalization boundwhen (cid:107) w (cid:107) , ≤ λ √ k and (cid:107) x (cid:107) , ∞ ≤ (see Lemma 10 in Appendix B). This would give usLemma 2 (instead of Lemma 1 for the Ising models). The last step is to use Lemma 6 to boundthe inﬁnity norm between the two weight matrices.10 .2 Supporting lemmas Lemma 1 and Lemma 2 are the key results in our proof. They essentially say that givenenough samples, solving the corresponding constrained logistic regression problem will providea prediction σ ( (cid:104) ˆ w, x (cid:105) ) close to the true σ ( (cid:104) w ∗ , x (cid:105) ) in terms of their expected squared distance. Lemma 1.

Let D be a distribution on X × {− , } , where X = { x ∈ { , } n × k : (cid:107) x (cid:107) , ∞ ≤ } .Furthermore, ( X, Y ) ∼ D satisﬁes P [ Y = 1 | X = x ] = σ ( (cid:104) w ∗ , x (cid:105) ) , where w ∗ ∈ R n × k . Weassume that (cid:107) w ∗ (cid:107) , ≤ λ √ k for a known λ ≥ . Given N i.i.d. samples { ( x i , y i ) } Ni =1 from D ,let ˆ w be any minimizer of the following (cid:96) , -constrained logistic regression problem: ˆ w ∈ arg min w ∈ R n × k N N (cid:88) i =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) s.t. (cid:107) w (cid:107) , ≤ λ √ k. (18) Given ρ ∈ (0 , and (cid:15) > , if the number of samples satisﬁes N = O ( λ k (ln( n/ρ )) /(cid:15) ) , thenwith probability at least − ρ over the samples, E ( x,y ) ∼D [( σ ( (cid:104) w ∗ , x (cid:105) ) − σ ( (cid:104) ˆ w, x (cid:105) )) ] ≤ (cid:15) . The proofs of Lemma 1 and Lemma 2 are given in Appendix B. Note that in the setup ofboth lemmas, we form a pair of dual norms for x and w , e.g., (cid:107) x (cid:107) , ∞ and (cid:107) w (cid:107) , in Lemma 2,and (cid:107) x (cid:107) ∞ and (cid:107) w (cid:107) in Lemma 1. This duality allows us to use a sharp generalization boundwith a sample complexity that scales logarithmic in the dimension (see Lemma 7 and Lemma 10in Appendix B).Deﬁnition 3 deﬁnes a δ -unbiased distribution. This notion of δ -unbiasedness is proposed byKlivans and Meka (2017). Deﬁnition 3.

Let S be the alphabet set, e.g., S = {− , } for Ising model and S = [ k ] for analphabet of size k . A distribution D on S n is δ -unbiased if for X ∼ D , any i ∈ [ n ] , and anyassignment x ∈ S n − to X − i , min α ∈ S ( P [ X i = α | X − i = x ]) ≥ δ . For a δ -unbiased distribution, any of its marginal distribution is also δ -unbiased (seeLemma 3). Lemma 3.

Let D be a δ -unbiased distribution on S n , where S is the alphabet set. For X ∼ D ,any i ∈ [ n ] , the distribution of X − i is also δ -unbiased. Lemma 4 describes the δ -unbiased property of graphical models. This property has beenused in the previous papers (e.g., (Klivans and Meka, 2017; Bresler, 2015)). Lemma 4.

11n Lemma 1 and Lemma 2, we give a sample complexity bound for achieving a small (cid:96) error between σ ( (cid:104) ˆ w, x (cid:105) ) and σ ( (cid:104) w ∗ , x (cid:105) ) . The following two lemmas show that if the sampledistribution is δ -unbiased, then a small (cid:96) error implies a small distance between ˆ w and w ∗ . Lemma 5.

Let D be a δ -unbiased distribution on [ k ] n . For X ∼ D , let ˜ X ∈ { , } n × k be the one-hot encoded X . Let u, w ∈ R n × k be two matrices satisfying (cid:80) a u ( i, a ) = 0 and (cid:80) a w ( i, a ) = 0 ,for i ∈ [ n ] . Suppose that for some u, w and θ (cid:48) , θ (cid:48)(cid:48) ∈ R , we have E X ∼D [( σ ( (cid:68) w, ˜ X (cid:69) + θ (cid:48) ) − σ ( (cid:68) u, ˜ X (cid:69) + θ (cid:48)(cid:48) )) ] ≤ (cid:15) , where (cid:15) < δe − (cid:107) w (cid:107) ∞ , − | θ (cid:48) |− . Then (cid:107) w − u (cid:107) ∞ ≤ O (1) · e (cid:107) w (cid:107) ∞ , + | θ (cid:48) | · (cid:112) (cid:15)/δ . The proofs of Lemma 5 and Lemma 6 can be found in (Klivans and Meka, 2017) (see Claim8.6 and Lemma 4.3 in their paper). We give a slightly diﬀerent proof of these two lemmas inAppendix E.

We provide proof sketches for Theorem 1 and Theorem 2 using the supporting lemmas. Thedetailed proof can be found in Appendix F and G.

Proof sketch of Theorem 1.

Without loss of generality, let us consider the n -th variable.Let Z ∼ D ( A, θ ) , and X = [ Z − n ,

1] = [ Z , Z , · · · , Z n − , ∈ {− , } n . By Fact 1 andLemma 1, if N = O ( λ ln( n/ρ ) /γ ) , then E X [( σ ( (cid:104) w ∗ , X (cid:105) ) − σ ( (cid:104) ˆ w, X (cid:105) )) ] ≤ γ with probabilityat least − ρ/n . By Lemma 4 and Lemma 3, Z − n is δ -unbiased with δ = e − λ / . We can thenapply Lemma 5 to show that if N = O ( λ exp(12 λ ) ln( n/ρ ) /(cid:15) ) , then max j ∈ [ n ] | A nj − ˆ A nj | ≤ (cid:15) with probability at least − ρ/n . Theorem 1 then follows by a union bound over all n variables. Proof sketch of Theorem 2.

Let us again consider the n -th variable since the proof is thesame for all other variables. As described before, the key step is to show that (13) holds. Nowﬁx a pair of α (cid:54) = β ∈ [ k ] , let N α,β be the number of samples such that the n -th variable is either α or β . By Fact 2 and Lemma 2, if N α,β = O ( λ k ln( n/ρ (cid:48) ) /γ ) , then with probability at least − ρ (cid:48) , the matrix U α,β ∈ R n × k satisﬁes E x [( σ ( (cid:104) w ∗ , x (cid:105) ) − σ ( (cid:10) U α,β , x (cid:11) )) ] ≤ γ , where w ∗ ∈ R n × k is deﬁned in (10). By Lemma 6 and Lemma 4, if N α,β = O ( λ k exp(12 λ ) ln( n/ρ (cid:48) )) /(cid:15) ) , thenwith probability at least − ρ (cid:48) , | W nj ( α, b ) − W nj ( β, b ) − U α,β ( j, b ) | ≤ (cid:15), ∀ j ∈ [ n − , ∀ b ∈ [ k ] .Since D ( W , Θ) is δ -unbiased with δ = e − λ /k , in order to have N α,β samples for a given ( α, β ) pair, we need the total number of samples to satisfy N = O ( N α,β /δ ) . Theorem 2 then followsby setting ρ (cid:48) = ρ/ ( nk ) and taking a union bound over all ( α, β ) pairs and all n variables. In both simulations below, we assume that the external ﬁeld is zero. Sampling is done viaexactly computing the distribution.

Learning Ising models.

In Figure 1 we construct a diamond-shape graph and show thatthe incoherence value at Node 1 becomes bigger than 1 (and hence violates the incoherence For a matrix w , we deﬁne (cid:107) w (cid:107) ∞ = max ij | w ( i, j ) | . Note that this is diﬀerent from the induced matrix norm. n and edge weight a .We then run 100 times of Algorithm 1 and plot the fraction of runs that exactly recovers theunderlying graph structure. In each run we generate a diﬀerent set of samples. The resultshown in Figure 1 is consistent with our analysis and also indicates that our conditions forgraph recovery are weaker than those in (Ravikumar et al., 2010). … a a Number of nodes (n) I n c ohe r en c e a t N ode a=0.15a=0.20a=0.25 Number of samples (N) P r ob s u cc i n r un s n=6n=8n=10n=12n=14 Figure 1:

Left : The graph structure used in this simulation. It has n nodes and n − edges.Every edge has the same weight a > . Middle : Incoherence value at Node 1.

Right : Wesimulate 100 runs of Algorithm 1 for the diamond graph with edge weight a = 0 . . Learning general pairwise graphical models.

We compare our algorithm (Algorithm 2)with the Sparsitron algorithm in (Klivans and Meka, 2017) on a two-dimensional 3-by-3 grid(shown in Figure 2). We experiment two alphabet sizes: k = 4 , . For each value of k , wesimulate both algorithms 100 runs, and in each run we generate random W ij matrices withentries ± . . As shown in the Figure 2, our algorithm requires fewer samples for successfullyrecovering the graphs. More details about this experiment can be found in Appendix J. Number of samples ( N) P r ob s u cc i n r un s Alphabet size ( k) = 4

Our method[KM17] k = 6 Our method[KM17]

Figure 2:

Left : A two-dimensional 3-by-3 grid graph used in the simulation.

Middle andright : Our algorithm needs fewer samples than the Sparsitron algorithm for graph recovery.

We have shown that the (cid:96) , -constrained logistic regression can recover the Markov graph ofany discrete pairwise graphical model from i.i.d. samples. For the Ising model, it reduces tothe (cid:96) -constrained logistic regression. This algorithm has better sample complexity than theprevious state-of-the-art result ( k versus k ), and can run in ˜ O ( n ) time. One interestingdirection for future work is to see if the /η dependency in the sample complexity can beimproved. Another interesting direction is to consider MRFs with higher-order interactions.13 eferences Agarwal, A., Negahban, S., and Wainwright, M. J. (2010). Fast global convergence rates ofgradient methods for high-dimensional statistical recovery. In

Advances in Neural InformationProcessing Systems , pages 37–45.Aurell, E. and Ekeberg, M. (2012). Inverse ising inference using all the data.

Physical reviewletters , 108(9):090201.Banerjee, O., Ghaoui, L. E., and d’Aspremont, A. (2008). Model selection through sparsemaximum likelihood estimation for multivariate gaussian or binary data.

Journal of Machinelearning research , 9(Mar):485–516.Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk boundsand structural results.

Journal of Machine Learning Research , 3(Nov):463–482.Ben-Tal, A. and Nemirovski, A. (Fall 2013). Lectures on modern convex optimization. .Bento, J. and Montanari, A. (2009). Which graphical models are diﬃcult to learn? In

Advancesin Neural Information Processing Systems , pages 1303–1311.Bresler, G. (2015). Eﬃciently learning ising models on arbitrary graphs. In

Proceedings ofthe forty-seventh annual ACM symposium on Theory of computing (STOC) , pages 771–782.ACM.Bubeck, S. (2015). Convex optimization: Algorithms and complexity.

Foundations and Trends R (cid:13) in Machine Learning , 8(3-4):231–357.Choi, M. J., Lim, J. J., Torralba, A., and Willsky, A. S. (2010). Exploiting hierarchical contexton a large database of object categories. In Computer vision and pattern recognition (CVPR),2010 IEEE conference on , pages 129–136. IEEE.Eagle, N., Pentland, A. S., and Lazer, D. (2009). Inferring friendship network structure by usingmobile phone data.

Proceedings of the national academy of sciences , 106(36):15274–15278.Hamilton, L., Koehler, F., and Moitra, A. (2017). Information theoretic properties of markovrandom ﬁelds, and their algorithmic applications. In

Advances in Neural InformationProcessing Systems , pages 2463–2472.Jalali, A., Ravikumar, P., Vasuki, V., and Sanghavi, S. (2011). On learning discrete graphicalmodels using group-sparse regularization. In

Proceedings of the Fourteenth InternationalConference on Artiﬁcial Intelligence and Statistics , pages 378–387.Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. (2012). Regularization techniques forlearning with matrices.

Journal of Machine Learning Research , 13(Jun):1865–1890.Kakade, S. M., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction:Risk bounds, margin bounds, and regularization. In

Advances in neural information processingsystems , pages 793–800. 14livans, A. R. and Meka, R. (2017). Learning graphical models using multiplicative weights.In

Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science(FOCS) , pages 343–354. IEEE.Koh, K., Kim, S.-J., and Boyd, S. (2007). An interior-point method for large-scale (cid:96) -regularizedlogistic regression. Journal of Machine learning research , 8(Jul):1519–1555.Lee, S.-I., Ganapathi, V., and Koller, D. (2007). Eﬃcient structure learning of markov networksusing l _ -regularization. In Advances in neural Information processing systems , pages817–824.Lokhov, A. Y., Vuﬀray, M., Misra, S., and Chertkov, M. (2018). Optimal structure andparameter learning of ising models.

Science advances , 4(3):e1700791.Marbach, D., Costello, J. C., Küﬀner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison,K. R., Consortium, T. D., Kellis, M., Collins, J. J., and Stolovitzky, G. (2012). Wisdom ofcrowds for robust gene network inference.

Nature methods , 9(8):796.Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A uniﬁed frameworkfor high-dimensional analysis of m -estimators with decomposable regularizers. StatisticalScience , 27(4):538–557.Ravikumar, P., Wainwright, M. J., and Laﬀerty, J. D. (2010). High-dimensional ising modelselection using (cid:96) -regularized logistic regression. The Annals of Statistics , 38(3):1287–1319.Rigollet, P. and Hütter, J.-C. (Spring 2017). Lectures notes on high dimensional statistics. .Santhanam, N. P. and Wainwright, M. J. (2012). Information-theoretic limits of selectingbinary graphical models in high dimensions.

IEEE Transactions on Information Theory ,58(7):4117–4134.Shalev-Shwartz, S. and Ben-David, S. (2014).

Understanding machine learning: From theoryto algorithms . Cambridge university press.Von Neumann, J. (1949). On rings of operators. reduction theory.

Annals of Mathematics ,pages 401–485.Vuﬀray, M., Misra, S., Lokhov, A., and Chertkov, M. (2016). Interaction screening: Eﬃcientand sample-optimal learning of ising models. In

Advances in Neural Information ProcessingSystems , pages 2595–2603.Yang, E., Allen, G., Liu, Z., and Ravikumar, P. K. (2012). Graphical models via generalizedlinear models. In

Advances in Neural Information Processing Systems , pages 1358–1366.Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model.

Biometrika , 94(1):19–35. 15

Related work on learning Ising models

For the special case of learning Ising models (i.e., binary variables), we compare the samplecomplexity among diﬀerent graph recovery algorithms in Table 2.Note that the algorithms in (Ravikumar et al., 2010; Bresler, 2015; Vuﬀray et al., 2016;Lokhov et al., 2018) are designed for learning Ising models instead of general pairwise graphicalmodels. Hence, they are not presented in Table 1.As mentioned in the Introduction, Ravikumar et al. (2010) consider (cid:96) -regularized logisticregression for learning Ising models in the high-dimensional setting. They require incoherenceassumptions that ensure, via conditions on sub-matrices of the Fisher information matrix,that sparse predictors of each node are hard to confuse with a false set. Their analysisobtains signiﬁcantly better sample complexity compared to what is possible when these extraassumptions are not imposed (see (Bento and Montanari, 2009)). Others have also considered (cid:96) -regularization (e.g., (Lee et al., 2007; Yuan and Lin, 2007; Banerjee et al., 2008; Jalali et al.,2011; Yang et al., 2012; Aurell and Ekeberg, 2012)) for structure learning of Markov randomﬁelds but they all require certain assumptions about the graphical model and hence theirmethods do not work for general graphical models. The analysis of (Ravikumar et al., 2010)is of essentially the same convex program as this work (except that we have an additionalthresholding procedure). The main diﬀerence is that they obtain a better sample guaranteebut require signiﬁcantly more restrictive assumptions.In the general setting with no restrictions on the model, Santhanam and Wainwright (2012)provide an information-theoretic lower bound on the number of samples needed for graphrecovery. This lower bound depends logarithmically on n , and exponentially on the width λ ,and (somewhat inversely) on the minimum edge weight η . We will ﬁnd these general broadtrends, but with important diﬀerences, in the other algorithms as well.Bresler (2015) provides a greedy algorithm and shows that it can learn with samplecomplexity that grows logarithmically in n , but doubly exponentially in the width λ and alsoexponentially in /η . It is thus suboptimal with respect to its dependence on λ and η .Vuﬀray et al. (2016) propose a new convex program (i.e. diﬀerent from logistic regression),and for this they are able to show a single-exponential dependence on λ . There is also low-orderpolynomial dependence on λ and /η . Note that given λ and η , the degree is bounded by d ≤ λ/η (the equality is achieved when every edge has the same weight and there is noexternal ﬁeld). Therefore, their sample complexity can scale as worse as /η . Later, the sameauthors (Lokhov et al., 2018) prove a similar result for the (cid:96) -regularized logistic regressionusing essentially the same proof technique as (Vuﬀray et al., 2016).Rigollet and Hütter (2017) analyze the (cid:96) -constrained logistic regression for learning Isingmodels. Their sample complexity has a better dependence on /η ( /η vs /η ) than (Lokhovet al., 2018). However, naïvely extending their analysis to the (cid:96) , -constrained logistic regressionwill give a sample complexity exponential in the alphabet size .In this paper, we analyze the (cid:96) , -constrained logistic regression for learning discrete pairwisegraphical models with general alphabet. Our proof uses a sharp generalization bound forconstrained logistic regression, which is diﬀerent from (Lokhov et al., 2018; Rigollet and Hütter, Lemma 5.21 in (Rigollet and Hütter, 2017) has a typo: The upper bound should depend on exp(2 λ ) .Accordingly, Theorem 5.23 should depend on exp(4 λ ) rather than exp(3 λ ) . This is because the Hessian of the population loss has a lower bound that depends on exp( − λ √ k ) for (cid:107) w (cid:107) , ≤ λ √ k and (cid:107) x (cid:107) , ∞ ≤ . N )Information-theoretic lowerbound (Santhanamand Wainwright,2012) 1. Model width ≤ λ , and λ ≥ { ln( n )2 η tanh( η ) ,2. Degree ≤ d d ln( n d ) ,3. Minimum edge weight ≥ η > exp( λ ) ln( nd/ − ηd exp( η ) }

4. External ﬁeld = 0 (cid:96) -regularizedlogistic regres-sion (Ravikumaret al., 2010) Q ∗ is the Fisher information matrix, O ( d ln( n )) S is set of neighbors of a given variable.1. Dependency: ∃ C min > such thateigenvalues of Q ∗ SS ≥ C min

2. Incoherence: ∃ α ∈ (0 , such that (cid:107) Q ∗ S c S ( Q ∗ SS ) − (cid:107) ∞ ≤ − α

3. Regularization parameter: λ N ≥ − α ) α (cid:113) ln( n ) N

4. Minimum edge weight ≥ √ dλ N /C min

5. External ﬁeld = 0

6. Probability of success ≥ − e − O ( λ N N ) Greedyalgorithm (Bresler,2015) 1. Model width ≤ λ O (exp( exp( O ( dλ )) η O (1) ) ln( nρ ))

2. Degree ≤ d

3. Minimum edge weight ≥ η >

4. Probability of success ≥ − ρ InteractionScreening (Vuﬀrayet al., 2016) 1. Model width ≤ λ

2. Degree ≤ d O (max { d, η }

3. Minimum edge weight ≥ η > d exp(6 λ ) ln( nρ ))

4. Regularization parameter = 4 (cid:113) ln(3 n /ρ ) N

5. Probability of success ≥ − ρ(cid:96) -regularizedlogisticregression (Lokhovet al., 2018) 1. Model width ≤ λ

2. Degree ≤ d O (max { d, η }

3. Minimum edge weight ≥ η > d exp(8 λ ) ln( nρ ))

4. Regularization parameter O ( (cid:113) ln( n /ρ ) N )

5. Probability of success ≥ − ρ(cid:96) -constrainedlogisticregression (Rigolletand Hütter, 2017) 1. Model width ≤ λ O ( λ exp(8 λ ) η ln( nρ ))

2. Minimum edge weight ≥ η >

3. Probability of success ≥ − ρ Sparsitron (Klivansand Meka, 2017) 1. Model width ≤ λ O ( λ exp(12 λ ) η ln( nρη ))

2. Minimum edge weight ≥ η >

3. Probability of success ≥ − ρ(cid:96) -constrainedlogistic regression[ this paper ] 1. Model width ≤ λ O ( λ exp(12 λ ) η ln( nρ ))

2. Minimum edge weight ≥ η >

3. Probability of success ≥ − ρ Table 2: Sample complexity comparison for learning Ising models. The second column lists theassumptions in their analysis. Given λ and η , the degree is bounded by d ≤ λ/η , with equalityachieved when every edge has the same weight and there is no external ﬁeld.17017). For Ising models (shown in Table 2), our sample complexity matches that of (Klivansand Meka, 2017). For non-binary pairwise graphical models (shown in Table 1), our samplecomplexity improves the state-of-the-art result. B Proof of Lemma 1 and Lemma 2

The proof of Lemma 1 relies on the following lemmas. The ﬁrst lemma is a generalization errorbound for any Lipschitz loss of linear functions with bounded (cid:107) w (cid:107) and (cid:107) x (cid:107) ∞ . Lemma 7. (see, e.g., Corollary 4 of (Kakade et al., 2009) and Theorem 26.15 of (Shalev-Shwartz and Ben-David, 2014)) Let D be a distribution on X × Y , where X = { x ∈ R n : (cid:107) x (cid:107) ∞ ≤ X ∞ } , and Y = {− , } . Let (cid:96) : R → R be a loss function with Lipschitz constant L (cid:96) .Deﬁne the expected loss L ( w ) and the empirical loss ˆ L ( w ) as L ( w ) = E ( x,y ) ∼D (cid:96) ( y (cid:104) w, x (cid:105) ) , ˆ L ( w ) = 1 N N (cid:88) i =1 (cid:96) ( y i (cid:10) w, x i (cid:11) ) , (19) where { x i , y i } Ni =1 are i.i.d. samples from distribution D . Deﬁne W = { w ∈ R n : (cid:107) w (cid:107) ≤ W } .Then with probability at least − ρ over the samples, we have that for all w ∈ W , L ( w ) ≤ ˆ L ( w ) + 2 L (cid:96) X ∞ W (cid:114) n ) N + L (cid:96) X ∞ W (cid:114) /ρ ) N . (20)

Lemma 8. (Pinsker’s inequality) Let D KL ( a || b ) := a ln( a/b ) + (1 − a ) ln((1 − a ) / (1 − b )) denotethe KL-divergence between two Bernoulli distributions ( a, − a ) , ( b, − b ) with a, b ∈ [0 , .Then ( a − b ) ≤ D KL ( a || b ) . (21) Lemma 9.

Let D be a distribution on X × {− , } . For ( X, Y ) ∼ D , P [ Y = 1 | X = x ] = σ ( (cid:104) w ∗ , x (cid:105) ) , where σ ( x ) = 1 / (1 + e − x ) is the sigmoid function. Let L ( w ) be the expected logisticloss: L ( w ) = E ( x,y ) ∼D ln(1 + e − y (cid:104) w,x (cid:105) ) = E ( x,y ) ∼D [ − y + 12 ln( σ ( (cid:104) w, x (cid:105) )) − − y − σ ( (cid:104) w, x (cid:105) ))] . (22) Then for any w , we have L ( w ) − L ( w ∗ ) = E ( x,y ) ∼D [ D KL ( σ ( (cid:104) w ∗ , x (cid:105) ) || σ ( (cid:104) w, x (cid:105) ))] , (23) where D KL ( a || b ) := a ln( a/b ) + (1 − a ) ln((1 − a ) / (1 − b )) denotes the KL-divergence betweentwo Bernoulli distributions ( a, − a ) , ( b, − b ) with a, b ∈ [0 , . roof. Simply plugging in the deﬁnition of the expected logistic loss L ( · ) gives L ( w ) − L ( w ∗ ) = E ( x,y ) ∼D [ − y + 12 ln( σ ( (cid:104) w, x (cid:105) )) − − y − σ ( (cid:104) w, x (cid:105) ))]+ E ( x,y ) ∼D [ y + 12 ln( σ ( (cid:104) w ∗ , x (cid:105) )) + 1 − y − σ ( (cid:104) w ∗ , x (cid:105) ))]= E x E y | x [ − y + 12 ln( σ ( (cid:104) w, x (cid:105) )) − − y − σ ( (cid:104) w, x (cid:105) ))]+ E x E y | x [ y + 12 ln( σ ( (cid:104) w ∗ , x (cid:105) )) + 1 − y − σ ( (cid:104) w ∗ , x (cid:105) ))] ( a ) = E x [ − σ ( (cid:104) w ∗ , x (cid:105) ) ln( σ ( (cid:104) w, x (cid:105) )) − (1 − σ ( (cid:104) w ∗ , x (cid:105) )) ln(1 − σ ( (cid:104) w, x (cid:105) ))]+ E x [ σ ( (cid:104) w ∗ , x (cid:105) ) ln( σ ( (cid:104) w ∗ , x (cid:105) )) + (1 − σ ( (cid:104) w ∗ , x (cid:105) )) ln(1 − σ ( (cid:104) w ∗ , x (cid:105) ))]= E x (cid:20) σ ( (cid:104) w ∗ , x (cid:105) ) ln (cid:18) σ ( (cid:104) w ∗ , x (cid:105) ) σ ( (cid:104) w, x (cid:105) ) (cid:19) + (1 − σ ( (cid:104) w ∗ , x (cid:105) )) ln (cid:18) − σ ( (cid:104) w ∗ , x (cid:105) )1 − σ ( (cid:104) w, x (cid:105) ) (cid:19)(cid:21) = E ( x,y ) ∼D [ D KL ( σ ( (cid:104) w ∗ , x (cid:105) ) || σ ( (cid:104) w, x (cid:105) ))] , where (a) follows from the fact that E y | x [ y ] = 1 · P [ y = 1 | x ] + ( − · P [ y = − | x ] = 2 σ ( (cid:104) w ∗ , x (cid:105) ) − . We are now ready to prove Lemma 1 (which is restated below):

Lemma.

Let D be a distribution on {− , } n × {− , } where for ( X, Y ) ∼ D , P [ Y = 1 | X = x ] = σ ( (cid:104) w ∗ , x (cid:105) ) . We assume that (cid:107) w ∗ (cid:107) ≤ λ for a known λ ≥ . Given N i.i.d. samples { ( x i , y i ) } Ni =1 , let ˆ w be any minimizer of the following (cid:96) -constrained logistic regression problem: ˆ w ∈ arg min w ∈ R n N N (cid:88) i =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) s.t. (cid:107) w (cid:107) ≤ λ. (24) Given ρ ∈ (0 , and (cid:15) > , suppose that N = O ( λ (ln( n/ρ )) /(cid:15) ) , then with probability at least − ρ over the samples, we have that E ( x,y ) ∼D [( σ ( (cid:104) w ∗ , x (cid:105) ) − σ ( (cid:104) ˆ w, x (cid:105) )) ] ≤ (cid:15) .Proof. We ﬁrst apply Lemma 7 to the setup of Lemma 1. The loss function (cid:96) ( z ) = ln(1 + e − z ) deﬁned above has Lipschitz constant L (cid:96) = 1 . The input sample x ∈ {− , } n satisﬁes (cid:107) x (cid:107) ∞ ≤ .Let W = { w ∈ R n × k : (cid:107) w (cid:107) ≤ λ } . According to Lemma 7, with probability at least − ρ/ over the draw of the training set, we have that for all w ∈ W , L ( w ) ≤ ˆ L ( w ) + 4 λ (cid:114) n ) N + 2 λ (cid:114) /ρ ) N . (25)where L ( w ) = E ( x,y ) ∼D ln(1 + e − y (cid:104) w,x (cid:105) ) and ˆ L ( w ) = (cid:80) Ni =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) /N are the expectedloss and empirical loss.Let N = C · λ ln(8 n/ρ ) /(cid:15) for a global constant C , then (25) implies that with probabilityat least − ρ/ , L ( w ) ≤ ˆ L ( w ) + (cid:15), for all w ∈ W . (26)19e next prove a concentration result for ˆ L ( w ∗ ) . Here w ∗ is the true regression vector andis assumed to be ﬁxed. First notice that ln(1 + e − y (cid:104) w ∗ ,x (cid:105) ) is bounded because | y (cid:104) w ∗ , x (cid:105) | ≤ λ .Besides, the ln(1 + e − z ) has Lipschitz 1, so | ln(1 + e − λ ) − ln(1 + e λ ) | ≤ λ . Hoeﬀding’sinequality gives that P [ ˆ L ( w ∗ ) − L ( w ∗ ) ≥ t ] ≤ e − Nt / (4 λ ) . Let N = C (cid:48) · λ ln(2 /ρ ) /(cid:15) for aglobal constant C (cid:48) , then with probability at least − ρ/ over the samples, ˆ L ( w ∗ ) ≤ L ( w ∗ ) + (cid:15). (27)Then the following holds with probability at least − ρ : L ( ˆ w ) ( a ) ≤ ˆ L ( ˆ w ) + (cid:15) ( b ) ≤ ˆ L ( w ∗ ) + (cid:15) ( c ) ≤ L ( w ∗ ) + 2 (cid:15), (28)where (a) follows from (26), (b) follows from the fact ˆ w is the minimizer of ˆ L ( w ) , and (c)follows from (27).So far we have shown that L ( ˆ w ) − L ( w ∗ ) ≤ (cid:15) with probability at least − ρ . The laststep is to lower bound L ( ˆ w ) − L ( w ∗ ) by E ( x,y ) ∼D ( σ ( (cid:104) w ∗ , x (cid:105) ) − σ ( (cid:104) w, x (cid:105) )) using Lemma 8 andLemma 9. E ( x,y ) ∼D ( σ ( (cid:104) w ∗ , x (cid:105) ) − σ ( (cid:104) w, x (cid:105) )) d ) ≤ E ( x,y ) ∼D D KL ( σ ( (cid:104) w ∗ , x (cid:105) ) || σ ( (cid:104) w, x (cid:105) )) / ( e ) = ( L ( ˆ w ) − L ( w ∗ )) / ( f ) ≤ (cid:15), where (d) follows from Lemma 8, (e) follows from Lemma 9, and (f) follows from (28). Therefore,we have that E ( x,y ) ∼D ( σ ( (cid:104) w ∗ , x (cid:105) ) − σ ( (cid:104) w, x (cid:105) )) ≤ (cid:15) with probability at least − ρ , if the numberof samples satisﬁes N = O ( λ ln( n/ρ ) /(cid:15) ) .The proof of Lemma 2 is identical to the proof of Lemma 1, except that it relies on thefollowing generalization error bound for Lipschitz loss functions with bounded (cid:96) , -norm. Lemma 10.

Let D be a distribution on X × Y , where X = { x ∈ R n × k : (cid:107) x (cid:107) , ∞ ≤ X , ∞ } , and Y = {− , } . Let (cid:96) : R → R be a loss function with Lipschitz constant L (cid:96) . Deﬁne the expectedloss L ( w ) and the empirical loss ˆ L ( w ) as L ( w ) = E ( x,y ) ∼D (cid:96) ( y (cid:104) w, x (cid:105) ) , ˆ L ( w ) = 1 N N (cid:88) i =1 (cid:96) ( y i (cid:10) w, x i (cid:11) ) , (29) where { x i , y i } Ni =1 are i.i.d. samples from distribution D . Deﬁne W = { w ∈ R n × k : (cid:107) w (cid:107) , ≤ W , } . Then with probability at least − ρ over the draw of N samples, we have that for all w ∈ W , L ( w ) ≤ ˆ L ( w ) + 2 L (cid:96) X , ∞ W , (cid:114) n ) N + L (cid:96) X , ∞ W , (cid:114) /ρ ) N . (30)Lemma 10 can be readily derived from the existing results. First, notice that the dual normof (cid:107)·(cid:107) , is (cid:107)·(cid:107) , ∞ . Using Corollary 14 in (Kakade et al., 2012), Theorem 1 in (Kakade et al.,20009), and the fact that (cid:107) w (cid:107) ,q ≤ (cid:107) w (cid:107) , for q ≥ , we conclude that the Rademacher complexityof the function class F := { x → (cid:104) w, x (cid:105) : (cid:107) w (cid:107) , ≤ W , } is at most X , ∞ W , (cid:112) n ) /N . Wecan then obtain the standard Rademacher-based generalization bound (see, e.g., (Bartlett andMendelson, 2002) and Theorem 26.5 in (Shalev-Shwartz and Ben-David, 2014)) for boundedLipschitz loss functions.We omit the proof of Lemma 2 since it is the same as that of Lemma 1. C Proof of Lemma 3

Lemma 3 is restated below.

Lemma.

Let D be a δ -unbiased distribution on S n , where S is the alphabet set. For X ∼ D ,any i ∈ [ n ] , the distribution of X − i is also δ -unbiased.Proof. For any j (cid:54) = i ∈ [ n ] , any a ∈ S , and any x ∈ S n − , we have P [ X j = a | X [ n ] \{ i,j } = x ] = (cid:88) b ∈ S P [ X j = a, X i = b | X [ n ] \{ i,j } = x ]= (cid:88) b ∈ S P [ X i = b | X [ n ] \{ i,j } = x ] · P [ X j = a | X i = b, X [ n ] \{ i,j } = x ] ( a ) ≥ δ (cid:88) b ∈ S P [ X i = b | X [ n ] \{ i,j } = x ]= δ, (31)where (a) follows from the fact that X ∼ D and D is a δ -unbiased distribution. Since (31)holds for any j (cid:54) = i ∈ [ n ] , any a ∈ S , and any x ∈ S n − , by deﬁnition, the distribution of X − i is δ -unbiased. D Proof of Lemma 4

The lemma is restated below, followed by its proof.

Lemma.

Let D ( W , Θ) be a pairwise graphical model distribution with alphabet size k and width λ ( W , Θ) . Then D ( W , Θ) is δ -unbiased with δ = e − λ ( W , Θ) /k . Speciﬁcally, an Ising modeldistribution D ( A, θ ) is e − λ ( A,θ ) / -unbiased.Proof. Let X ∼ D ( W , Θ) , and assume that X ∈ [ k ] n . For any i ∈ [ n ] , any a ∈ [ k ] , and any x ∈ [ k ] n − , we have P [ X i = a | X − i = x ] = exp( (cid:80) j (cid:54) = i W ij ( a, x j ) + θ i ( a )) (cid:80) b ∈ [ k ] exp( (cid:80) j (cid:54) = i W ij ( b, x j ) + θ i ( b ))= 1 (cid:80) b ∈ [ k ] exp( (cid:80) j (cid:54) = i ( W ij ( b, x j ) − W ij ( a, x j )) + θ i ( b ) − θ i ( a )) ( a ) ≥ k · exp(2 λ ( W , Θ)) = e − λ ( W , Θ) /k, (32)where (a) follows from the deﬁnition of model width. The lemma then follows (Ising modelcorresponds to the special case of k = 2 ). 21 Proof of Lemma 5 and Lemma 6

The proof relies on the following basic property of the sigmoid function (see Claim 4.2 of (Klivansand Meka, 2017)): | σ ( a ) − σ ( b ) | ≥ e −| a |− · min(1 , | a − b | ) , ∀ a, b ∈ R . (33)We ﬁrst prove Lemma 5 (which is restated below). Lemma.

Let D be a δ -unbiased distribution on {− , } n . Suppose that for two vectors u, w ∈ R n and θ (cid:48) , θ (cid:48)(cid:48) ∈ R , E X ∼D [( σ ( (cid:104) w, X (cid:105) + θ (cid:48) ) − σ ( (cid:104) u, X (cid:105) + θ (cid:48)(cid:48) )) ] ≤ (cid:15) , where (cid:15) < δe − (cid:107) w (cid:107) − | θ (cid:48) |− . Then (cid:107) w − u (cid:107) ∞ ≤ O (1) · e (cid:107) w (cid:107) + | θ (cid:48) | · (cid:112) (cid:15)/δ .Proof. For any i ∈ [ n ] , any X ∈ {− , } n , let X i ∈ {− , } be the i -th variable and X − i ∈{− , } n − be the [ n ] \{ i } variables. Let X i, + ∈ {− , } n (respectively X i, − ) be the vectorobtained from X by setting X i = 1 (respectively X i = − ). Then we have (cid:15) ≥ E X ∼D [( σ ( (cid:104) w, X (cid:105) + θ (cid:48) ) − σ ( (cid:104) u, X (cid:105) + θ (cid:48)(cid:48) )) ]= E X − i (cid:20) E X i | X − i ( σ ( (cid:104) w, X (cid:105) + θ (cid:48) ) − σ ( (cid:104) u, X (cid:105) + θ (cid:48)(cid:48) )) (cid:21) = E X − i [( σ ( (cid:10) w, X i, + (cid:11) + θ (cid:48) ) − σ ( (cid:10) u, X i, + (cid:11) + θ (cid:48)(cid:48) )) · P [ X i = 1 | X − i ]+ ( σ ( (cid:10) w, X i, − (cid:11) + θ (cid:48) ) − σ ( (cid:10) u, X i, − (cid:11) + θ (cid:48)(cid:48) )) · P [ X i = − | X − i ]] ( a ) ≥ δ · E X − i [( σ ( (cid:10) w, X i, + (cid:11) + θ (cid:48) ) − σ ( (cid:10) u, X i, + (cid:11) + θ (cid:48)(cid:48) )) + ( σ ( (cid:10) w, X i, − (cid:11) + θ (cid:48) ) − σ ( (cid:10) u, X i, − (cid:11) + θ (cid:48)(cid:48) )) ] ( b ) ≥ δe − (cid:107) w (cid:107) − | θ (cid:48) |− · E X − i [min(1 , (( (cid:10) w, X i, + (cid:11) + θ (cid:48) ) − ( (cid:10) u, X i, + (cid:11) + θ (cid:48)(cid:48) )) )+ min(1 , (( (cid:10) w, X i, − (cid:11) + θ (cid:48) ) − ( (cid:10) u, X i, − (cid:11) + θ (cid:48)(cid:48) )) )] ( c ) ≥ δe − (cid:107) w (cid:107) − | θ (cid:48) |− · E X − i min(1 , (2 w i − u i ) / ( d ) = δe − (cid:107) w (cid:107) − | θ (cid:48) |− · min(1 , w i − u i ) ) . (34)Here (a) follows from the fact that D is a δ -unbiased distribution, which implies that P [ X i =1 | X − i ] ≥ δ and P [ X i = − | X − i ] ≥ δ . Inequality (b) is obtained by substituting (33). Inequality(c) uses the following fact min(1 , a ) + min(1 , b ) ≥ min(1 , ( a − b ) / , ∀ a, b ∈ R . (35)To see why (35) holds, note that if both | a | , | b | ≤ , then (35) is true since a + b ≥ ( a − b ) / .Otherwise, (35) is true because the left-hand side is at least 1 while the right-hand side is atmost 1. The last equality (d) follows from that X − i is independent of min(1 , w i − u i ) ) .Since (cid:15) < δe − (cid:107) w (cid:107) − | θ (cid:48) |− , (34) implies that | w i − u i | ≤ O (1) · e (cid:107) w (cid:107) + | θ (cid:48) | · (cid:112) (cid:15)/δ . Because(34) holds for any i ∈ [ n ] , we have that (cid:107) w − u (cid:107) ∞ ≤ O (1) · e (cid:107) w (cid:107) + | θ (cid:48) | · (cid:112) (cid:15)/δ .We now prove Lemma 6 (which is restated below).22 emma. Let D be a δ -unbiased distribution on [ k ] n . For X ∼ D , let ˜ X ∈ { , } n × k be the one-hot encoded X . Let u, w ∈ R n × k be two matrices satisfying (cid:80) j u ( i, j ) = 0 and (cid:80) j w ( i, j ) = 0 for i ∈ [ n ] . Suppose that for some u, w and θ (cid:48) , θ (cid:48)(cid:48) ∈ R , we have E X ∼D [( σ ( (cid:68) w, ˜ X (cid:69) + θ (cid:48) ) − σ ( (cid:68) u, ˜ X (cid:69) + θ (cid:48)(cid:48) )) ] ≤ (cid:15) , where (cid:15) < δe − (cid:107) w (cid:107) ∞ , − | θ (cid:48) |− . Then (cid:107) w − u (cid:107) ∞ ≤ O (1) · e (cid:107) w (cid:107) ∞ , + | θ (cid:48) | · (cid:112) (cid:15)/δ .Proof. Fix an i ∈ [ n ] and a (cid:54) = b ∈ [ k ] . Let X i,a ∈ [ k ] n (respectively X i,b ) be the vector obtainedfrom X by setting X i = a (respectively X i = b ). Let ˜ X i,a ∈ { , } n × k be the one-hot encodingof X i,a ∈ [ k ] n . Then we have (cid:15) ≥ E X ∼D [( σ ( (cid:68) w, ˜ X (cid:69) + θ (cid:48) ) − σ ( (cid:68) u, ˜ X (cid:69) + θ (cid:48)(cid:48) )) ]= E X − i (cid:20) E X i | X − i ( σ ( (cid:68) w, ˜ X (cid:69) + θ (cid:48) ) − σ ( (cid:68) u, ˜ X (cid:69) + θ (cid:48)(cid:48) )) (cid:21) ≥ E X − i [( σ ( (cid:68) w, ˜ X i,a (cid:69) + θ (cid:48) ) − σ ( (cid:68) u, ˜ X i,a (cid:69) + θ (cid:48)(cid:48) )) · P [ X i = a | X − i ]+ ( σ ( (cid:68) w, ˜ X i,b (cid:69) + θ (cid:48) ) − σ ( (cid:68) u, ˜ X i,b (cid:69) + θ (cid:48)(cid:48) )) · P [ X i = b | X − i ]] ( a ) ≥ δe − (cid:107) w (cid:107) ∞ , − | θ (cid:48) |− · E X − i [min(1 , (( (cid:68) w, ˜ X i,a (cid:69) + θ (cid:48) ) − ( (cid:68) u, ˜ X i,a (cid:69) + θ (cid:48)(cid:48) )) )+ min(1 , (( (cid:68) w, ˜ X i,b (cid:69) + θ (cid:48) ) − ( (cid:68) u, ˜ X i,b (cid:69) + θ (cid:48)(cid:48) )) )] ( b ) ≥ δe − (cid:107) w (cid:107) ∞ , − | θ (cid:48) |− · E X − i min(1 , (( w ( i, a ) − w ( i, b )) − ( u ( i, a ) − u ( i, b ))) / δe − (cid:107) w (cid:107) ∞ , − | θ (cid:48) |− min(1 , (( w ( i, a ) − w ( i, b )) − ( u ( i, a ) − u ( i, b ))) / (36)Here (a) follows from that D is a δ -unbiased distribution and (33). Inequality (b) follows from(35). Because (cid:15) < δe − (cid:107) w (cid:107) ∞ , − | θ (cid:48) |− , (36) implies that ( w ( i, a ) − w ( i, b )) − ( u ( i, a ) − u ( i, b )) ≤ O (1) · e (cid:107) w (cid:107) ∞ , + | θ (cid:48) | · (cid:112) (cid:15)/δ. (37) ( u ( i, a ) − u ( i, b )) − ( w ( i, a ) − w ( i, b )) ≤ O (1) · e (cid:107) w (cid:107) ∞ , + | θ (cid:48) | · (cid:112) (cid:15)/δ. (38)Since (37) and (38) hold for any a (cid:54) = b ∈ [ k ] , we can sum over b ∈ [ k ] and use the fact that (cid:80) j u ( i, j ) = 0 and (cid:80) j w ( i, j ) = 0 to get w ( i, a ) − u ( i, a ) = 1 k (cid:88) b ( w ( i, a ) − w ( i, b )) − ( u ( i, a ) − u ( i, b )) ≤ O (1) · e (cid:107) w (cid:107) ∞ , + | θ (cid:48) | · (cid:112) (cid:15)/δ.u ( i, a ) − w ( i, a ) = 1 k (cid:88) b ( u ( i, a ) − u ( i, b )) − ( w ( i, a ) − w ( i, b )) ≤ O (1) · e (cid:107) w (cid:107) ∞ , + | θ (cid:48) | · (cid:112) (cid:15)/δ. Therefore, we have | w ( i, a ) − u ( i, a ) | ≤ O (1) · e (cid:107) w (cid:107) ∞ , + | θ (cid:48) | · (cid:112) (cid:15)/δ , for any i ∈ [ n ] and a ∈ [ k ] . F Proof of Theorem 1

We ﬁrst restate Theorem 1 and then give the proof.23 heorem.

For ease of notation, we consider the n -th variable. The goal is to prove that Algorithm 1is able to recover the n -th row of the true weight matrix A . Speciﬁcally, we will show that ifthe number samples satisﬁes N = O ( λ exp( O ( λ )) ln( n/ρ ) /(cid:15) ) , then with probability as least − ρ/n , max j ∈ [ n ] | A nj − ˆ A nj | ≤ (cid:15). (40)We then use a union bound to conclude that with probability as least − ρ , max i,j ∈ [ n ] | A ij − ˆ A ij | ≤ (cid:15) .Let Z ∼ D ( A, θ ) , X = [ Z − n ,

1] = [ Z , Z , · · · , Z n − , ∈ {− , } n , and Y = Z n ∈ {− , } .By Fact 1, P [ Y = 1 | X = x ] = σ ( (cid:104) w ∗ , x (cid:105) ) , where w ∗ = 2[ A n , · · · , A n ( n − , θ n ] . Further, (cid:107) w ∗ (cid:107) ≤ λ . Let ˆ w be the solution of the (cid:96) -constrained logistic regression problem deﬁned in(4). By Lemma 1, if the number of samples satisﬁes N = O ( λ ln( n/ρ ) /γ ) , then with probabilityat least − ρ/n , we have E X [( σ ( (cid:104) w ∗ , X (cid:105) ) − σ ( (cid:104) ˆ w, X (cid:105) )) ] ≤ γ. (41)By Lemma 4, Z − n ∈ {− , } n − is δ -unbiased (Deﬁnition 3) with δ = e − λ / . By Lemma 5,if γ < C δe − λ for some constant C > , then (41) implies that (cid:107) w ∗ n − − ˆ w n − (cid:107) ∞ ≤ O (1) · e λ · (cid:112) γ/δ. (42)Note that w ∗ n − = 2[ A n , · · · , A n ( n − ] and ˆ w n − = 2[ ˆ A n , · · · , ˆ A n ( n − ] . Let γ = C δe − λ (cid:15) for some constant C > and (cid:15) ∈ (0 , , (42) then implies that max j ∈ [ n ] | A nj − ˆ A nj | ≤ (cid:15). (43)The number of samples needed is N = O ( λ ln( n/ρ ) /γ ) = O ( λ e λ ln( n/ρ ) /(cid:15) ) .We have proved that (40) holds with probability at least − ρ/n . Using a union boundover all n variables gives that with probability as least − ρ , max i,j ∈ [ n ] | A ij − ˆ A ij | ≤ (cid:15) . G Proof of Theorem 2

The following lemma will be used in the proof.

Lemma 11.

Let Z ∼ D , where D is a δ -unbiased distribution on [ k ] n . Given α (cid:54) = β ∈ [ k ] ,conditioned on Z n ∈ { α, β } , Z − n ∈ [ k ] n − is also δ -unbiased. roof. For any i ∈ [ n − , a ∈ [ k ] , and x ∈ [ k ] n − , we have P [ Z i = a | Z [ n ] \{ i,n } = x, Z n ∈ { α, β } ]= P [ Z i = a, Z [ n ] \{ i,n } = x, Z n = α ] + P [ Z i = a, Z [ n ] \{ i,n } = x, Z n = β ] P [ Z [ n ] \{ i,n } = x, Z n = α ] + P [ Z [ n ] \{ i,n } = x, Z n = β ] ( a ) ≥ min( P [ Z i = a, Z [ n ] \{ i,n } = x, Z n = α ] P [ Z [ n ] \{ i,n } = x, Z n = α ] , P [ Z i = a, Z [ n ] \{ i,n } = x, Z n = β ] P [ Z [ n ] \{ i,n } = x, Z n = β ] )= min( P [ Z i = a | Z [ n ] \{ i,n } = x, Z n = α ] , P [ Z i = a | Z [ n ] \{ i,n } = x, Z n = β ]) ( b ) ≥ δ. (44)where (a) follows from the fact that ( a + b ) / ( c + d ) ≥ min( a/c, b/d ) for a, b, c, d > , (b) followsfrom the fact that Z is δ -unbiased.Now we are ready to prove Theorem 2, which is restated below. Theorem.

Let D ( W , Θ) be an n -variable pairwise graphical model distribution with width λ ( W , Θ) ≤ λ and alphabet size k . Given ρ ∈ (0 , and (cid:15) > , if the number of i.i.d. samplessatisﬁes N = O ( λ k exp(14 λ ) ln( nk/ρ ) /(cid:15) ) , then with probability at least − ρ , Algorithm 2produces ˆ W ij ∈ R k × k that satisﬁes | W ij ( a, b ) − ˆ W ij ( a, b ) | ≤ (cid:15), ∀ i (cid:54) = j ∈ [ n ] , ∀ a, b ∈ [ k ] . (45) Proof.

To ease notation, let us consider the n -th variable (i.e., set i = n inside the ﬁrst “for”loop of Algorithm 2). The proof directly applies to other variables. We will prove the followingresult: if the number of samples N = O ( λ k exp(14 λ ) ln( nk/ρ ) /(cid:15) ) , then with probability atleast − ρ/n , the U α,β ∈ R n × k matrices produced by Algorithm 2 satisﬁes | W nj ( α, b ) − W nj ( β, b ) − U α,β ( j, b ) | ≤ (cid:15), ∀ j ∈ [ n − , ∀ α, β, b ∈ [ k ] . (46)Suppose that (46) holds, summing over β ∈ [ k ] and using the fact that (cid:80) β W nj ( β, b ) = 0 gives | W nj ( α, b ) − k (cid:88) β ∈ [ k ] U α,β ( j, b ) | ≤ (cid:15), ∀ j ∈ [ n − , ∀ α, b ∈ [ k ] . (47)Theorem 2 then follows by taking a union bound over the n variables.The only thing left is to prove (46). Now ﬁx a pair of α, β ∈ [ k ] , let N α,β be thenumber of samples such that the n -th variable is either α or β . By Lemma 2 and Fact 2, if N α,β = O ( λ k ln( n/ρ (cid:48) ) /γ ) , then with probability at least − ρ (cid:48) , the minimizer of the (cid:96) , constrained logistic regression w α,β ∈ R n × k satisﬁes E X [( σ ( (cid:104) w ∗ , X (cid:105) ) − σ ( (cid:68) w α,β , X (cid:69) )) ] ≤ γ. (48)Recall that X ∈ { , } n × k is the one-hot encoding of the vector [ Z − n , ∈ [ k ] n , where Z ∼ D ( W , Θ) and Z n ∈ { α, β } . Besides, w ∗ ∈ R n × k satisﬁes w ∗ ( j, :) = W nj ( α, :) − W nj ( β, :) , ∀ j ∈ [ n − w ∗ ( n, :) = [ θ n ( α ) − θ n ( β ) , , · · · , . (49)25et U α,β ∈ R n × k be formed by centering the ﬁrst n − rows of w α,β . Since each row of X isa standard basis vector (i.e., all 0’s except a single 1), (cid:10) U α,β , X (cid:11) = (cid:10) w α,β , X (cid:11) . Hence, (48)implies E X [( σ ( (cid:104) w ∗ , X (cid:105) ) − σ ( (cid:68) U α,β , X (cid:69) )) ] ≤ γ. (50)By Lemma 4, we know that Z ∼ D ( W , Θ) is δ -unbiased with δ = e − λ /k . By Lemma 11,conditioned on Z n ∈ { α, β } , Z − n is also δ -unbiased. Hence, the condition of Lemma 6 holds.Applying Lemma 6 to (50), we get that if N α,β = O ( λ k exp(12 λ ) ln( n/ρ (cid:48) )) /(cid:15) ) , the followingholds with probability at least − ρ (cid:48) : | W nj ( α, b ) − W nj ( β, b ) − U α,β ( j, b ) | ≤ (cid:15), ∀ j ∈ [ n − , ∀ b ∈ [ k ] . (51)So far we have proved that (46) holds for a ﬁxed ( α, β ) pair. This requires that N α,β = O ( λ k exp(12 λ ) ln( n/ρ (cid:48) )) /(cid:15) ) . Recall that N α,β is the number of samples that the n -th variabletakes α or β . We next derive the number of total samples needed in order to have N α,β samplesfor a given ( α, β ) pair. Since D ( W , Θ) is δ -unbiased with δ = e − λ ( W , Θ) /k , for Z ∼ D ( W , Θ) ,we have P [ Z n ∈ { α, β }| Z − n ] ≥ δ , and hence P [ Z n ∈ { α, β } ] ≥ δ . By the Chernoﬀ bound, ifthe total number of samples satisﬁes N = O ( N α,β /δ + log(1 /ρ (cid:48)(cid:48) ) /δ ) , then with probability atleast − ρ (cid:48)(cid:48) , we have N α,β samples for a given ( α, β ) pair.To ensure that (51) holds for all ( α, β ) pairs with probability at least − ρ/n , we can set ρ (cid:48) = ρ/ ( nk ) and ρ (cid:48)(cid:48) = ρ/ ( nk ) and take a union bound over all ( α, β ) pairs. The total numberof samples required is N = O ( λ k exp(14 λ ) ln( nk/ρ ) /(cid:15) ) .We have shown that (46) holds for the n -th variable with probability at least − ρ/n . Bythe discussion at the beginning of the proof, Theorem 2 then follows by a union bound overthe n variables. H Mirror descent algorithms for constrained logistic regression

Algorithm 3 gives a mirror descent algorithm for the following (cid:96) -constrained logistic regression: min w ∈ R n N N (cid:88) i =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) s.t. (cid:107) w (cid:107) ≤ W . (52)We use the doubling trick to expand the dimension and re-scale the samples (Step 2 inAlgorithm 3). Now the original problem becomes a logistic regression problem over a probabilitysimplex: ∆ n +1 = { w ∈ R n +1 : (cid:80) n +1 i =1 w i = 1 , w i ≥ , ∀ i ∈ [2 n + 1] } . min w ∈ ∆ n +1 N N (cid:88) i =1 − ˆ y i ln( σ ( (cid:10) w, ˆ x i (cid:11) )) − (1 − ˆ y i ) ln(1 − σ ( (cid:10) w, ˆ x i (cid:11) )) , (53)where (ˆ x i , ˆ y i ) ∈ R n +1 × { , } . In Step 4-11 of Algorithm 3, we follow the standard simplexsetup for mirror descent algorithm (see Section 5.3.3.2 of (Ben-Tal and Nemirovski, 2013)).Speciﬁcally, the negative entropy is used as the distance generating function (aka the mirrormap). The projection step (Step 9) can be done by a simple (cid:96) normalization operation. Afterthat, we transform the solution back to the original space (Step 12).26 lgorithm 3: Mirror descent algorithm for (cid:96) -constrained logistic regression Input: { ( x i , y i ) } Ni =1 where x i ∈ {− , } n , y i ∈ {− , } ; constraint on the (cid:96) norm W ∈ R + ; number of iterations T . Output: ¯ w ∈ R n . for sample i ← to N do // Form samples (ˆ x i , ˆ y i ) ∈ R n +1 × { , } . ˆ x i ← [ x i , − x i , · W , ˆ y i ← ( y i + 1) / end // Initialize w as the uniform distribution. w ← [ n +1 , n +1 , · · · , n +1 ] ∈ R n +1 γ ← W (cid:113) n +1) T // Set the step size. for iteration t ← to T do g t ← N (cid:80) Ni =1 ( σ ( (cid:10) w t , ˆ x i (cid:11) ) − ˆ y i )ˆ x i // Compute the gradient. w t +1 i ← w ti exp( − γg ti ) , for i ∈ [2 n + 1] // Coordinate-wise update. w t +1 ← w t +1 / (cid:107) w t +1 (cid:107) // Projection step. end ¯ w ← (cid:80) Tt =1 w t /T // Aggregate the updates.// Transform ¯ w back to R n and the actual scale. ¯ w ← ( ¯ w n − ¯ w ( n +1):2 n ) · W Algorithm 4 gives a mirror descent algorithm for the (cid:96) , -constrained logistic regression: min w ∈ R n × k N N (cid:88) i =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) s.t. (cid:107) w (cid:107) , ≤ W , . (54)For simplicity, we assume that n ≥ . We then follow Section 5.3.3.3 of (Ben-Tal andNemirovski, 2013) to use the following function as the mirror map Φ : R n × k → R : Φ( w ) = e ln( n ) p (cid:107) w (cid:107) p ,p , p = 1 + 1 / ln( n ) . (55)The update step (Step 8) can be computed eﬃciently in O ( nk ) time, see the discussion inSection 5.3.3.3 of (Ben-Tal and Nemirovski, 2013) for more details. I Proof of Theorem 3 and Theorem 4

Lemma 12.

Let ˆ L ( w ) = N (cid:80) Ni =1 ln(1 + e − y i (cid:104) w,x i (cid:105) ) be the empirical loss. Let ˆ w be a minimizerof the ERM deﬁned in (52). The output ¯ w of Algorithm 3 satisﬁes ˆ L ( ¯ w ) − ˆ L ( ˆ w ) ≤ W (cid:114) n + 1) T . (56) For n ≤ , we need to switch to a diﬀerent mirror map, see Section 5.3.3.3 of (Ben-Tal and Nemirovski,2013) for more details. lgorithm 4: Mirror descent algorithm for (cid:96) , -constrained logistic regression Input: { ( x i , y i ) } Ni =1 where x i ∈ { , } n × k , y i ∈ {− , } ; constraint on the (cid:96) , norm W , ∈ R + ; number of iterations T . Output: ¯ w ∈ R n × k . for sample i ← to N do // Form samples (ˆ x i , ˆ y i ) ∈ R n × k × { , } . ˆ x i ← x i · W , , ˆ y i ← ( y i + 1) / end // Initialize w as a constant matrix. w ← [ n √ k , n √ k , · · · , n √ k ] ∈ R n × k γ ← W , (cid:113) e ln( n ) T // Set the step size. for iteration t ← to T do g t ← N (cid:80) Ni =1 ( σ ( (cid:10) w t , ˆ x i (cid:11) ) − ˆ y i )ˆ x i // Compute the gradient. w t +1 ← arg min (cid:107) w (cid:107) , ≤ Φ( w ) − (cid:10) ∇ Φ( w t ) − γg t , w (cid:11) // Φ( w ) is in (55). end ¯ w ← ( (cid:80) Tt =1 w t /T ) · W // Aggregate the updates. Similarly, let ˆ w be a minimizer of the ERM deﬁned in (54). Then the output ¯ w of Algorithm 4satisﬁes ˆ L ( ¯ w ) − ˆ L ( ˆ w ) ≤ O (1) · W , (cid:114) ln( n ) T . (57)Lemma 12 follows from the standard convergence result for mirror descent algorithm (see,e.g., Theorem 4.2 of (Bubeck, 2015)), and the fact that the gradient g t in Step 6 of Algorithm 3satisﬁes (cid:107) g t (cid:107) ∞ ≤ W (reps. the gradient g t in Step 6 of Algorithm 4 satisﬁes (cid:107) g t (cid:107) ∞ ≤ W , ).This implies that the objective function after rescaling the samples is W -Lipschitz w.r.t. (cid:107)·(cid:107) (reps. W , -Lipschitz w.r.t. (cid:107)·(cid:107) , ).We are now ready to prove Theorem 3, which is restated below. Theorem.

In the setup of Theorem 1, suppose that the (cid:96) -constrained logistic regression in Algo-rithm 1 is optimized by the mirror descent method (Algorithm 3) given in Appendix H. Given ρ ∈ (0 , and (cid:15) > , if the number of mirror descent iterations satisﬁes T = O ( λ exp(12 λ ) ln( n ) /(cid:15) ,and the number of i.i.d. samples satisﬁes N = O ( λ exp(12 λ ) ln( n/ρ ) /(cid:15) ) , then (6) still holdswith probability at least − ρ . The total run-time of Algorithm 1 is O ( T N n ) .Proof. We ﬁrst note that in the proof of Theorem 1, we only use ˆ w in order to apply the resultfrom Lemma 1. In the proof of Lemma 1 (given in Appendix B), there is only one place wherewe use the deﬁnition of ˆ w : the inequality (b) in (28). As a result, if we can show that (28) stillholds after replacing ˆ w by ¯ w , i.e., L ( ¯ w ) ≤ L ( w ∗ ) + O ( γ ) , (58)then Lemma 1 would still hold, and so is Theorem 1.28y Lemma 12, if the number of iterations satisﬁes T = O ( W ln( n ) /γ ) , then ˆ L ( ¯ w ) − ˆ L ( ˆ w ) ≤ γ. (59)As a result, we have L ( ¯ w ) ( a ) ≤ ˆ L ( ¯ w ) + γ ( b ) ≤ ˆ L ( ˆ w ) + 2 γ ( c ) ≤ ˆ L ( w ∗ ) + 2 γ ( d ) ≤ L ( w ∗ ) + 3 γ, (60)where (a) follows from (26), (b) follows from (59), (c) follows from the fact that ˆ w is theminimizer of ˆ L ( w ) , and (d) follows from (27). The number of mirror descent iterationsneeded for (58) to hold is T = O ( W ln( n ) /γ ) . In the proof of Theorem 1, we need to set γ = O (1) (cid:15) exp( − λ ) (see the proof following (42)), so the number of mirror descent iterationsneeded is T = O ( λ exp(12 λ ) ln( n ) /(cid:15) ) .To analyze the runtime of Algorithm 1, note that for each variable in [ n ] , transforming thesamples takes O ( N ) time, solving the (cid:96) -constrained logisitic regression via Algorithm 3 takes O ( T N n ) time, and updating the edge weight estimate takes O ( n ) time. Forming the graph ˆ G over n nodes takes O ( n ) time. The total runtime is O ( T N n ) .The proof of Theorem 4 is identical to that of Theorem 3 and is omitted here. The key stepis to show that (58) holds after replacing ˆ w by ¯ w . This can be done by using the convergenceresult in Lemma 12 and applying the same logic in (60). The runtime of Algorithm 2 can beanalyzed in the same way as above. The (cid:96) , -constrained logistic regression dominates thetotal runtime. It requires O ( T N α,β nk ) time for each pair ( α, β ) and each variable in [ n ] , where N α,β is the subset of samples that a given variable takes either α or β . Since N ≥ kN α,β , thetotal runtime is O ( T N n k ) . J More experimental results

We compare our algorithm (Algorithm 2) with the Sparsitron algorithm in (Klivans and Meka,2017) on a two-dimensional 3-by-3 grid (shown in Figure 2). We experiment three alphabetsizes: k = 2 , , . For each value of k , we simulate both algorithms 100 runs, and in each run wegenerate the W ij matrices with entries ± . . To ensure that each row (as well as each column)of W ij is centered (i.e., zero mean), we will randomly choose W ij between two options: as anexample of k = 2 , W ij = [0 . , − . − . , . or W ij = [ − . , .

2; 0 . , − . . The external ﬁeldis zero. Sampling is done via exactly computing the distribution. The Sparsitron algorithmrequires two sets of samples: 1) to learn a set of candidate weights; 2) to select the bestcandidate. We use max { , . · N } samples for the second part. We plot the estimationerror max ij (cid:107) W ij − ˆ W ij (cid:107) ∞ and the fraction of successful runs (i.e., runs that exactly recoverthe graph) in Figure 3. Compared to the Sparsitron algorithm, our algorithm requires fewersamples for successfully recovery. 29

000 4000 6000 80000.050.10.15

Alphabet size ( k) = 2

Our method[KM17]2000 4000 6000 800000.51 P r ob s u cc i n r un s Our method[KM17] 4000 6000 8000 10000

Number of samples ( N) k = 4 Our method[KM17]4000 6000 8000 10000

Number of samples ( N) k = 6 Our method[KM17]2 4 610 Figure 3: Comparison of our algorithm and the Sparsitron algorithm in (Klivans and Meka,2017) on a two-dimensional 3-by-3 grid. Top row shows the average of the estimation error max ij (cid:107) W ij − ˆ W ij (cid:107) ∞ . Bottom row plots the faction of successful runs (i.e., runs that exactlyrecover the graph). Each column corresponds to an alphabet size: k = 2 , ,6