[PDF] Ising Model Selection Using \ell_{1}-Regularized Linear Regression

Abstract

We theoretically investigate the performance of \ell_{1}-regularized linear regression (\ell_1-LinR) for the problem of Ising model selection using the replica method from statistical mechanics. The regular random graph is considered under paramagnetic assumption. Our results show that despite model misspecification, the \ell_1-LinR estimator can successfully recover the graph structure of the Ising model with N variables using M=\mathcal{O}\left(\log N\right) samples, which is of the same order as that of \ell_{1}-regularized logistic regression. Moreover, we provide a computationally efficient method to accurately predict the non-asymptotic performance of the \ell_1-LinR estimator with moderate M and N. Simulations show an excellent agreement between theoretical predictions and experimental results, which supports our findings.

Full PDF

IIsing Model Selection Using (cid:96) -Regularized Linear Regression Xiangming Meng Tomoyuki Obuchi Yoshiyuki Kabashima Abstract

We theoretically investigate the performance of (cid:96) -regularized linear regression ( (cid:96) -LinR) for theproblem of Ising model selection using the replicamethod from statistical mechanics. The regularrandom graph is considered under paramagneticassumption. Our results show that despite modelmisspeciﬁcation, the (cid:96) -LinR estimator can suc-cessfully recover the graph structure of the Isingmodel with N variables using M = O (log N ) samples, which is of the same order as that of (cid:96) -regularized logistic regression. Moreover, weprovide a computationally efﬁcient method to ac-curately predict the non-asymptotic performanceof the (cid:96) -LinR estimator with moderate M and N . Simulations show an excellent agreement be-tween theoretical predictions and experimentalresults, which supports our ﬁndings.

1. Introduction

The advent of massive data across various scientiﬁc disci-plines has led to the widespread use of undirected graphicalmodels, also known as Markov random ﬁelds (MRFs), asa tool for discovering and visualizing dependencies amongcovariates in multivariate data (Wainwright & Jordan, 2008).The Ising model, originally proposed in statistical physics,is one special class of binary MRFs with pairwise poten-tials and has been widely used in different domains such asimage analysis, social networking, gene network analysis(Nguyen et al., 2017; Aurell & Ekeberg, 2012; Bachschmid-Romano & Opper, 2015; Berg, 2017; Bachschmid-Romano& Opper, 2017; Abbara et al., 2020). Among various ap-plications, one fundamental problem of interest is calledIsing model selection, which refers to recovering the un-derlying graph structure of the original Ising model fromindependent, identically distributed (i.i.d.) samples.A variety of methods have been proposed (Wainwright et al., Institute for Physics of Intelligence and Department of Physics,The University of Tokyo, Japan Department of Systems Science,Kyoto University, Japan. Correspondence to: Xiangming Meng < [email protected] > . (cid:96) -regularized logisticregression ( (cid:96) -LogR), showing that M = O (log N ) sam-ples sufﬁce for an Ising model with N spins under certainassumption (Ravikumar et al., 2010). The use of logisticloss in (cid:96) -LogR stems from its consistency with the under-lying conditional distribution of the Ising model. However,in practice the model generating the data is usually un-known a priori , i.e., model mismatch or misspeciﬁcation isinevitable. In this paper, we focus on one popular linear esti-mator called (cid:96) -regularized linear regression ( (cid:96) -LinR), alsowidely known as least absolute shrinkage and selection op-erator (LASSO) (Tibshirani, 1996) in statistics and machinelearning, and ask the question whether or not the misspeci-ﬁed (cid:96) -LinR estimator can recover the graph structure usingthe same order of samples as (cid:96) -LogR. Interestingly, though (cid:96) -LinR naively ignores the nonlinear relations within thespins of the Ising model, our theoretical analysis reveals anafﬁrmative answer in the case of regular random (RR) graph G N,d,K with constant node degree d and coupling strength K under the paramagnetic assumption. Apart from the well-known theoretical results from thestatistics community (Ravikumar et al., 2010; Santhanam& Wainwright, 2012), there is another line of research onIsing model selection (also known as the inverse Ising prob-lem in the physics community) using the replica methodfrom statistical mechanics (Opper & Saad, 2001; Mezard &Montanari, 2009), including the theoretical analyses of thePL method (Bachschmid-Romano & Opper, 2017; 2015;Berg, 2017; Abbara et al., 2020; Meng et al., 2020). Forexample, in Bachschmid-Romano & Opper (2017), giveni.i.d. samples from an equilibrium Ising model, the perfor-mance of the PL method was studied. However, insteadof graph structure learning, Bachschmid-Romano & Opper(2017) focused on the problem of parameter learning sinceonly the fully-connected Ising model was considered. Then, a r X i v : . [ c s . L G ] F e b sing Model Selection Using (cid:96) -Regularized Linear Regression Abbara et al. (2020) extended the analysis to Ising modelwith sparse couplings using logistic regression without regu-larization. The recent work Meng et al. (2020) analyzed theperformance of (cid:96) -regularized linear regression but the tech-niques invented there are not applicable to (cid:96) -LinR sincethe (cid:96) -norm breaks the rotational invariance property thatthe (cid:96) -norm satisﬁes.Regarding the study of (cid:96) -LinR (LASSO) under model mis-speciﬁcation, the past few years have seen a line of researchin the ﬁeld of signal processing with a speciﬁc focus onthe single-index model (Brillinger, 1982; Plan & Vershynin,2016; Thrampoulidis et al., 2015; Zhang et al., 2016; Gen-zel, 2016). These studies are closely related to ours butthere are several important differences. First, in our study,the covariates are generated from an Ising model ratherthan a Gaussian distribution. Second, we focus on modelselection consistency of (cid:96) -LinR while most previous stud-ies considered estimation consistency except Zhang et al.(2016). However, Zhang et al. (2016) only considered theclassical asymptotic regime while we are interested in thehigh-dimensional setting where M (cid:28) N . Finally, we wouldlike to mention two additional related works Meinshausenet al. (2006); Zhao & Yu (2006) which also studied modelselection using (cid:96) -LinR but both of them only focused onthe Gaussian graphical models. The main contribution is that, using the replica methodfrom statistical mechanics, we demonstrate that despitemodel misspeciﬁcation, the (cid:96) -LinR estimator is consis-tent for high-dimensional Ising model selection with M = O (log N ) samples, which is the same (up to some con-stant factor) as (cid:96) -LogR. Speciﬁcally, for a RR graph G ∈G N,d,K under paramagnetic assumption (Mezard & Monta-nari, 2009), we obtain a lower bound of the number of sam-ples M > c log N tanh ( K ) for some constant c which coincideswith the information-theoretic lower bound M > c (cid:48) log NK (Santhanam & Wainwright, 2012) for some constant c (cid:48) athigh temperatures since tanh ( K ) = O ( K ) as K → .Our second contribution is to provide sharp predictions ofthe non-asymptotic behavior of (cid:96) -LinR for Ising modelselection with moderate M and N , including precision rate,recall rate, and residual sum of square (RSS). It is worthpointing out that such kind of precise non-asymptotic resultshave not been previously obtained for Ising model selectioneven with (cid:96) -LogR, and are different from former preciseasymptotic results of (cid:96) -LinR which assumed ﬁxed ratio α ≡ M/N (Bayati & Montanari, 2011; Rangan et al., 2012;Thrampoulidis et al., 2015; Gerbelot et al., 2020), thoughgood match is also achieved there for moderate M and N .While this paper focuses on (cid:96) -LinR, our method can be eas- ily generalized to any (cid:96) -regularized estimator with generalloss functions, e.g., the regularized interaction screening es-timator (Lokhov et al., 2018). Thus, an additional technicalcontribution is to provide a generic approach for investigat-ing various (cid:96) -regularized estimators for Ising model selec-tion. Although the replica method is a non-rigorous methodfrom statistical mechanics, our result is conjectured to be ex-act, which is supported by not only the excellent agreementbetween experimental results and theoretical predictions, butalso its consistency with the rigorous information-theoreticresult at high temperatures. It remains an open problem toderive a rigorous mathematical proof for our results.

2. Background and Problem Setup

Ising model is one special class of MRFs with pairwisepotentials and each variable takes binary values (Opper& Saad, 2001; Abbara et al., 2020), which is one classi-cal model from statistical physics. The joint probabilitydistribution of an Ising model with N variables (spins) s = ( s i ) N − i =0 ∈ {− , +1 } N has the form P Ising ( s | J ) = 1 Z ( J ) exp (cid:88) i

3. Statistical Mechanics Analysis

In this section, the statistical mechanics analysis of the (cid:96) -LinR estimator is presented. For simplicity and withoutloss of generality, we focus on spin s and will drop cer-tain subscript for notational convenience. Following theterminology in Abbara et al. (2020); Meng et al. (2020),we will refer to the Ising model generating the dataset D M with couplings J ∗ as the teacher model. To characterizethe performance of the estimator, the Precision , Recall , and

RSS are considered:

P recision = T PT P + F P , (6)

Recall = T PT P + F N , (7)

RSS = (cid:13)(cid:13)(cid:13) ˆ J − J ∗ (cid:13)(cid:13)(cid:13) , (8) where T P , F P , F N denote the number of true positive,false positive, and false negative samples in the estimator ˆ J , respectively. The Precision and

Recall characterize theperformance of structure recovery while

RSS describes theperformance of parameter learning.

The basic idea of the statistical mechanical approach is tointroduce the following Hamiltonian and Boltzmann distri-bution induced by the loss function (cid:96) ( · ) H (cid:0) J |D M (cid:1) = M (cid:88) µ =1 (cid:96) (cid:16) s ( µ )0 h ( µ ) (cid:17) + λM (cid:107) J (cid:107) , (9) P (cid:0) J |D M (cid:1) = 1 Z e − β H ( J |D M ) , (10)where Z = (cid:82) d J e − β H ( J |D M ) is the partition function, and β ( > is the inverse temperature. In the zero-temperaturelimit β → + ∞ , the Boltzmann distribution converges to apoint-wise measure on the estimator ˆ J = arg min J (cid:34) M M (cid:88) µ =1 (cid:96) (cid:16) s ( µ )0 h ( µ ) (cid:17) + λ (cid:107) J (cid:107) (cid:35) . (11)In particular, the estimator ˆ J in (11) corresponds to (cid:96) -LinR (5) and (cid:96) -LogR (3) when (cid:96) ( x ) = ( x − and (cid:96) ( x ) = log (cid:0) e − x (cid:1) , respectively.In statistical mechanics, macroscopic properties of (10) canbe analyzed by assessing the free energy density f ( D M ) = − Nβ log Z , which, in the current case, depends on the pre-determined randomness D M . However, as N, M → ∞ , f ( D M ) is expected to show self averaging property (Nishi-mori, 2001): for typical datasets D M , f ( D M ) converges toits average f = − N β [log Z ] D M , (12)where [ · ] D M denotes the expectation over the dataset D M ,i.e. [ · ] D M = (cid:80) s (1) ,..., s ( M ) ( · ) (cid:81) Mµ =1 P Ising (cid:0) s ( µ ) | J ∗ (cid:1) . Con-sequently, one can analyze the typical performance of (10)and hence the estimator (11) via the assessment of (12). Unfortunately, computing (12) rigorously is difﬁcult. Forpractically overcoming this difﬁculty, we resort to the replicamethod (Opper & Saad, 2001; Nishimori, 2001; Mezard& Montanari, 2009) from statistical mechanics, which issymbolized by using the following identity f = − N β [log Z ] D M = − lim n → N β ∂ log [ Z n ] D M ∂n . (13)The basic idea is as follows. One replaces the averageof log Z by the that of Z n which is analytically tractable sing Model Selection Using (cid:96) -Regularized Linear Regression for n ∈ N in the large N limit, and constructs an analyt-ically continuable expression from N to R , then takes thelimit n → by using the expression. Although the replicamethod is not rigorous, it has been empirically veriﬁed fromextensive studies in disorder systems in statistical physics(Opper & Saad, 2001; Mezard & Montanari, 2009) and alsofound useful in the study of high-dimensional statisticalmodels in machine learning (Gerace et al., 2020). In severalcases, the results derived by the replica method have beenrigorously proved to be exact, e.g., Reeves & Pﬁster (2019).Speciﬁcally, with the Hamiltonian H (cid:0) J |D M (cid:1) , assuming n ∈ N is a positive integer, the replicated partition function [ Z n ] D M in (13) can be written as [ Z n ] D M = (cid:90) n (cid:89) a =1 d J a e − βλM (cid:80) na =1 (cid:107) J a (cid:107) × (cid:40)(cid:88) s P Ising ( s | J ∗ ) exp (cid:34) − β n (cid:88) a =1 (cid:96) ( s h a ) (cid:35)(cid:41) M , (14)where h a = (cid:80) j J aj s j will be termed as local ﬁeld hereafter.The analysis below essentially depends on the distributionof the local ﬁeld but it is nontrivial. To resolve this problem,we here take the similar approach in Abbara et al. (2020);Meng et al. (2020) and introduce the following assumption. Assumption 1:

Denote as

Ψ = { j | j ∈ N (0) } and ¯Ψ = { j | j = 1 , ..., N − , j / ∈ N (0) } the active and in-active sets of spin s , respectively, then for a RR graph G ∈ G N,d,K under paramagnetic assumption, i.e., ( d −

1) tanh ( K ) < , the (cid:96) -LinR estimator in (5) obeysthe following form ˆ J j = (cid:40) ¯ J j + √ N w j , j ∈ Ψ √ N w j , j ∈ ¯Ψ (15) where ¯ J i is the mean value of the estimator and w i is arandom variable which is asymptotically zero mean withvariance scaled as O (1) . This assumption is veriﬁed in the Appendix B. Under As-sumption 1, the local ﬁelds h a can be decomposed as h a = (cid:88) j ∈ Ψ ¯ J j s j + h aw , (16)where h aw ≡ (cid:80) j √ N w ai s j is the “noise” part. According tothe central limit theorem, the noise part h aw can be approxi-mated as multivariate Gaussian variables, which, under thereplica symmetric (RS) ansatz (Nishimori, 2001), can befully described by the following two order parameters Q ≡ N (cid:88) i,j w ai C \ ij w aj , q ≡ N (cid:88) i,j w ai C \ ij w bj , ( a (cid:54) = b ) , (17) where C \ ≡ { C \ ij } is the covariance matrix of the teacherIsing model without the spin s . Since the difference be-tween C \ and that with s is not essential in the limit N → ∞ , hereafter the superscript \ will be discarded. Asshown in Appendix A, the average free energy density (13)in the limit β → ∞ can be computed as f ( β → ∞ ) = − Extr {− ξ + S } , (18)where ξ, S are the corresponding energy and entropy terms: S = lim n → N β ∂∂n log I, (19) I = (cid:90) n (cid:89) a =1 dw a n (cid:89) a =1 e − λβ (cid:107) w a (cid:107) δ (cid:88) i,j w ai C ij w aj − N Q  × (cid:89) a

Denote C ≡ E s [ ss T ] , where E s [ · ] = (cid:80) s P Ising ( s | J ∗ )( · ) , as the covariance matrix of spin con-ﬁgurations s . Suppose that the eigendecomposition of C is C = O Λ O T , where O is the orthogonal matrix, then O can be seen as a random sample generated from the Haarorthogonal measure and thus for typical graph realizationsfrom G N,d,K , I in (20) is equal to the average [ I ] O . This assumption is partly veriﬁed in Appendix C. Under As-sumption 2, the entropy term S in (19) can be alternativelycomputed as lim n → Nβ ∂∂n log [ I ] O , as shown in Appendix A.Finally, under the RS ansatz, the average free energy den-sity (13) in the limit β → ∞ associated with the (cid:96) -LinR sing Model Selection Using (cid:96) -Regularized Linear Regression estimator is calculated to be f ( β → ∞ ) = − Extr Θ  − α χ ) E s,z (cid:18)(cid:16) s − (cid:80) j ∈ Ψ ¯ J j s j − √ Qz (cid:17) (cid:19) − λα (cid:80) j ∈ Ψ (cid:12)(cid:12) ¯ J j (cid:12)(cid:12) + ( − ER + F η ) G (cid:48) ( − Eη )+ EQ − F χ + KR − Hη − E z min w (cid:110) K w − √ Hzw + λM √ N | w | (cid:111)  , (22)where z ∼ N (0 , , and G ( x ) is a function deﬁned as G ( x ) = −

12 log x −

12 +

Extr Λ (cid:26) − (cid:90) log (Λ − γ ) ρ ( γ ) dγ + Λ2 x (cid:27) , (23)and ρ ( γ ) is the eigenvalue distribution (EVD) of the co-variance matrix C , and Θ is a collection of macroscopicparameters Θ = (cid:110) χ, Q, E, R, F, η, K, H, (cid:8) ¯ J j (cid:9) j ∈ Ψ (cid:111) . Fordetails of these macroscopic parameters and the EVD ρ ( γ ) ,please refer to the Appendix A and F, respectively. Althoughthere are no analytic solutions, these macroscopic parame-ters can be obtained by numerically solving the followingequations which are termed hereafter as equations of state(EOS) employing the physics terminology:  E = α (1+ χ ) F = α (1+ χ ) (cid:20) E s (cid:16) s − (cid:80) j ∈ Ψ s j ¯ J j (cid:17) + Q (cid:21) R = K [ (cid:16) H + λ M N (cid:17) erfc (cid:16) λM √ HN (cid:17) − λM (cid:113) HN √ π e − λ M HN ] Eη = − (cid:82) ρ ( γ )˜ Λ − γ dγQ = FE + R ˜ Λ − ( − ER + F η ) η (cid:82) ρ ( γ ) ( ˜ Λ − γ ) dγ K = E ˜ Λ + η χ = E + η ˜ ΛH = Rη + F ˜ Λ + ( − ER + F η ) E (cid:82) ρ ( γ ) ( ˜ Λ − γ ) dγ η = K erfc (cid:16) λM √ HN (cid:17) ¯ J j = soft (tanh( K ) ,λ (1+ χ ))1+( d −

1) tanh ( K ) , j ∈ Ψ (24)where ˜ Λ satisfying Eη = − (cid:82) ρ ( γ )˜ Λ − γ dγ is determined bythe extremization condition in (23) and soft ( z, τ ) = sign ( z ) ( | z | − τ ) + is the soft-thresholding function. Notethat in (22), apart from the ratio α ≡ M/N , N and M also appear as λM/ √ N in the free energy result, which isdifferent from previous results (Abbara et al., 2020; Menget al., 2020; Gerace et al., 2020). The reason is that, thanksto the (cid:96) -regularization term λM (cid:107) J (cid:107) in the Hamiltonian H (cid:0) J |D M (cid:1) (9), the mean estimates (cid:8) ¯ J j (cid:9) j ∈ Ψ in the active set Ψ and the noise w in the inactive set ¯Ψ essentially givedifferent scaling contributions to the free energy density,i.e., λα before (cid:80) j ∈ Ψ (cid:12)(cid:12) ¯ J j (cid:12)(cid:12) and λM/ √ N before | w | in (22).Consequently, the two different scaling factors cannot besimultaneously absorbed by any scaling change of λ . Forexample, if λ = O (1 / √ N ) , the coefﬁcient of (cid:80) j ∈ Ψ | ¯ J j | scales as O (1 / √ N ) while that of | w | scales as O (1) when α = O (1) , implying non-negligible false positives appearin the noise estimates while the bias from the (cid:96) penalty dis-appear for the mean estimates. Meanwhile, if λ = O (1) , thecoefﬁcient of (cid:80) j ∈ Ψ | ¯ J j | is O (1) while that of | w | scales as O ( √ N ) , implying a strong penalty completely suppressingfalse positives in the large N limit. From the free energy result (22) and assumption (15), weobtain explicit expressions of the (cid:96) -LinR estimator. Specif-ically, the mean estimates { ¯ J j } j ∈ Ψ in the active set Ψ are ¯ J j = soft (tanh ( K ) , λ (1 + χ ))1 + ( d −

1) tanh ( K ) , j ∈ Ψ , (25)while the noise estimates { ˆ J j } j ∈ ¯Ψ in the inactive set ¯Ψ are ˆ J j = √ HK √ N soft (cid:18) z j , λM √ HN (cid:19) , j ∈ ¯Ψ , (26)where z j ∼ N (0 , , j ∈ ¯Ψ are i.i.d. standard Gaussianrandom variables. The results (25) and (26) assert that the (cid:96) -LinR estimator is decoupled and its asymptotic behaviorcan be described by two scalar soft-thresholding estima-tors for the active set and inactive set, respectively. Conse-quently, the statistical properties of (cid:96) -LinR can be readilyobtained once the EOS (24) is solved. The derivation andinterpretation of (25) and (26) are in Appendix A.3.In the high-dimensional setting where the number of verticesin the graph N is allowed to grow as a function of the num-ber of samples M , one important question for Ising modelselection is that what is the number of samples M requiredto successfully recover the graph structure as N → ∞ . Asdeﬁned in (6), successful recovery is achieved if and onlyif P recision = 1 and

Recall = 1 , which can be evaluatedwith the two scalar estimators (25) and (26). However, thereare no analytical solutions to EOS (24), which makes itdifﬁcult to derive an explicit condition. To overcome thisdifﬁculty, we perform perturbation analysis of EOS (22) andobtain the asymptotic relation H (cid:39) F (cid:104) γ (cid:105) , where (cid:104) γ (cid:105) is themean eigenvalue of covariance matrix C . Then, we obtainthat for a RR graph G ∈ G N,d,K , given M i.i.d. samples D M , the (cid:96) -LinR estimator (5) can successfully recover thegraph structure G as N → ∞ if M > c ( λ, K ) log Nλ , < λ < tanh ( K ) , (27) sing Model Selection Using (cid:96) -Regularized Linear Regression where c ( λ, K ) is a constant value dependent on the regular-ization parameter λ and coupling strength K and a sharpprediction (as veriﬁed in Sec. 4) is obtained as c ( λ, K ) = 2 (cid:0) − tanh ( K ) + dλ (cid:1) (cid:104) γ (cid:105) d −

1) tanh ( K ) . (28)For details of the analysis, including that of (cid:96) -LogR whichhas similar result as (27) but different value of c ( λ, K ) , seeAppendix D. Consequently, despite model misspeciﬁcation,the (cid:96) -LinR estimator is model selection consistent with M = O (log N ) samples, which is of the same order as (cid:96) -LogR. Note that analytical result of c ( λ, K ) is not availablefor (cid:96) -LogR, but numerical results show that there is onlya slight difference in c ( λ, K ) ≡ c ( λ,K ) λ between the (cid:96) -LinR and (cid:96) -LogR estimators; see Fig. 2.The result in (27) is derived for (cid:96) -LinR with a ﬁxed regular-ization parameter λ . Since the value of λ is upper boundedby tanh ( K ) (otherwise false negatives occur as discussedin Appendix D), a universal lower bound of the number ofsamples M for (cid:96) -LinR is obtained as M > (cid:104) γ (cid:105) log N tanh ( K ) . (29)Interestingly, the lower bound in (29) coincides (up to someconstant factor) with the information-theoretic result M > c log NK obtained in Santhanam & Wainwright (2012) since tanh ( K ) = O ( K ) as K → at high temperatures.This encouraging result demonstrates that, even under modelmisspeciﬁcation, the simple (cid:96) -LinR estimator can approachthe optimal performance for Ising model selection up tosome constant factor. M, N

It is desirable in practice to predict the non-asymptotic per-formance of the (cid:96) -LinR estimator for ﬁnite M, N . How-ever, it is found that solving (25) and (26) simply by insert-ing ﬁnite values of

M, N does not provide good consistencywith some of the experimental results, in particular the re-call rate. This is reasonable since in obtaining the energyterm ξ in (21), the ﬂuctuations around the mean estimates (cid:8) ¯ J j (cid:9) j ∈ Ψ due to ﬁnite size effect of M is not taken into ac-count correctly by the expectation E s,z ( · ) . To address thisproblem, we replace E s,z ( · ) with ﬁnite sample average andthe estimates { ˆ J j } j ∈ Ψ in the active set can be obtained bysolving the following d -dimensional optimization problem min J j,j ∈ Ψ  (cid:80) Mµ =1 (cid:16) s µ − (cid:80) j ∈ Ψ s µj J j − √ Qz µ (cid:17) χ ) M + λ (cid:88) j ∈ Ψ | J j |  , (30)where s µ , s µj,j ∈ Ψ ∼ P ( s , s Ψ | J ∗ ) , z µ ∼ N (0 , , µ =1 ...M . The solution to (30) is equivalent to (25) as M → ∞ Algorithm 1

Numerical method to solve EOS (24) togetherwith (30) for moderate

M, N for the (cid:96) -LinR estimator.Initialization: χ, Q, E, R, F, η, K, H . repeatfor t = 1 to T MC do Draw random samples s µ , s µj,j ∈ Ψ ∼ P ( s , s Ψ | J ∗ ) and z µ ∼ N (0 , , µ = 1 ...M .Obtain solutions (cid:110) ˆ J j (cid:111) j ∈ Ψ to (30).Compute (cid:52) ( t ) = M (cid:80) Mµ =1 (cid:16) s µ − (cid:80) j ∈ Ψ s µj ˆ J j (cid:17) . end for Update the values of χ, Q, E, R, F, η, K, H bysolving the EOS (24) with the substitution of E s (cid:16) s i − (cid:80) j ∈ Ψ s j ¯ J j (cid:17) = T MC (cid:80) T MC t =1 (cid:52) ( t ) . until convergence.but (30) enables us to capture the ﬂuctuations of { ˆ J j } j ∈ Ψ when M is ﬁnite. Note that due to the modiﬁcation in(30), the solutions to the EOS (24) also need to be modiﬁedaccordingly to take into account the ﬁnite size effect. Onecan solve them iteratively, as sketched in Algorithm 1 andthe implementation details are shown in Appendix E.1.Consequently, given the solutions of R, H, Q, χ in Algo-rithm 1 for moderate

M, N , the non-asymptotic statisticalproperties of the (cid:96) -LinR estimator can be evaluated usingcomputationally tractable MC simulations to the reduced d -dimensional (cid:96) -LinR estimator (30) and scalar estimator(26). Denote { ˆ J tj } , t = 1 , ..., T MC as the estimates in t -thMC simulation, where ˆ J tj,j ∈ Ψ and ˆ J tj,j ∈ ¯Ψ are solutions of(30) and (26), and T MC is the total number of MC simu-lations. Then, from deﬁnitions (6) - (8), the P recision , Recall , and

RSS are computed as

P recision = 1 T MC T MC (cid:88) t =1 (cid:13)(cid:13)(cid:13) ˆ J tj,j ∈ Ψ (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˆ J tj,j ∈ Ψ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆ J tj,j ∈ ¯Ψ (cid:13)(cid:13)(cid:13) , (31) Recall = 1 T MC T MC (cid:88) t =1 (cid:13)(cid:13)(cid:13) ˆ J tj,j ∈ Ψ (cid:13)(cid:13)(cid:13) d , (32) RSS = 1 T MC T MC (cid:88) t =1 (cid:88) j ∈ Ψ (cid:12)(cid:12)(cid:12) ˆ J tj − K (cid:12)(cid:12)(cid:12) + R, (33)where (cid:107)·(cid:107) is the (cid:96) -norm indicating the number of nonzeroelements.

4. Experimental results

In this section, we conduct numerical experiments to verifythe accuracy of the theoretical analysis. The setup is as fol-lows. The RR graph G ∈ G N,d,K with node degree d = 3 sing Model Selection Using (cid:96) -Regularized Linear Regression Figure 1.

Theoretical and experimental results of

RSS , P recision and

Recall for both (cid:96) -LinR and (cid:96) -LogR when λ = 0 . , N =200 , , with different values of α ≡ M/N . The standard error bars are obtained from 5 random runs, each with MCsimulations. An excellent agreement between theory and experiment is achieved, even for small N = 200 and small α ( small M ). Figure 2.

Critical scaling value c ( λ, K ) ≡ c ( λ,K ) λ of (cid:96) -LinRand (cid:96) -LogR for the RR graph G ∈ G N,d,K with d = 3 , K =0 . . The value of c ( λ,K ) λ of the (cid:96) -LogR estimator is about thesame as that of the (cid:96) -LinR estimator, though it is slightly smaller.Note that c ( λ, K ) of (cid:96) -LinR is obtained from (28) while that of (cid:96) -LogR is numerically obtained by Algorithm 3 in Appendix Ewith N = 800 , M = 4000 since there is no analytical solution. Figure 3.

P recision and

Recall versus N when M = c log N and K = 0 . for (cid:96) -LinR and (cid:96) -LogR when λ = 0 . , where c ( λ, K ) ≡ c ( λ,K ) λ ≈ . When c > c ( λ, K ) , the Preci-sion increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when c < c ( λ, K ) . sing Model Selection Using (cid:96) -Regularized Linear Regression Figure 4.

P recision and

Recall versus N when M = c log N and K = 0 . for (cid:96) -LinR and (cid:96) -LogR when λ = 0 . , where c ( λ, K ) ≡ c ( λ,K ) λ ≈ . . When c > c ( λ, K ) , the Preci-sion increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when c < c ( λ, K ) . The Recall increases consistently and approach to 1 as N → ∞ . and coupling strength K = 0 . is considered, which sat-isﬁes the paramagnetic condition ( d −

1) tanh ( K ) < .The active couplings { J ij } ( i,j ) ∈ E have the same probabilityof taking both signs of +1 or − . The experimental pro-cedures are as follows. First, a random graph G ∈ G N,d,K is generated and the Ising model is deﬁned on it. Then, thespin snapshots are obtained using MC sampling, yieldingthe dataset D M . We randomly choose a center spin s andinfer its neighborhood using the (cid:96) -LinR (5) and (cid:96) -LogR(3) estimators. To obtain standard error bars, we repeat thesequence of operations many times.We ﬁrst verify the precise non-asymptotic predictions de-scribed in Sec.3.4. Fig. 1 shows the replica and experimen-tal results of RSS , P recision , Recall for both (cid:96) -LinR(5) and (cid:96) -LogR (3) when λ = 0 . , N = 200 , , with different values of α ≡ M/N . For both (cid:96) -LinR and (cid:96) -LogR, there is an excellent agreement between the theo-retical predictions and experimental results, even for small N = 200 and small α ( equivalently small M ), verifying thecorrectness of the replica analysis. Interestingly, the quanti-tatively similar behavior between (cid:96) -LinR and (cid:96) -LogR isobserved in Precision and

Recall . The results with λ = 0 . in the same setting as Fig.1 is shown in Appendix G.Subsequently, the asymptotic result and sharpness of c ( λ, K ) in (27) are evaluated. As it is intractable tosimulate the limit N → ∞ , we investigate the trend of P recision and

Recall as N increases. Based on the Though this setting is different from the analysis where thenonzero teacher couplings take a uniform sign, the result can bedirectly compared thanks to gauge symmetry (Nishimori, 2001). replica analysis in Sec. 3.3, both the

P recision and

Recall will increase as N increases with M = c log N samples when c > c ( λ,K ) λ ; otherwise, the P recision will decrease as N increases when c < c ( λ,K ) λ . Notethat the Recall will increase with N as long as λ takesthe valid value. Fig. 2 shows comparison of the criti-cal scaling value c ( λ,K ) λ between (cid:96) -LinR and (cid:96) -LogRfor the RR graph G ∈ G N,d,K when d = 3 , K = 0 . .It can be seen that the value c ( λ,K ) λ of (cid:96) -LogR is onlyslightly smaller than that of (cid:96) -LinR. Then, we conductedexperiments for M = c log N with different values of c around c ( λ, K ) ≡ c ( λ,K ) λ . Two typical values λ = 0 . and λ = 0 . are evaluated, where c ( λ = 0 . , K ) ≈ . , c ( λ = 0 . , K ) ≈ . for the (cid:96) -LinR estima-tor while c ( λ = 0 . , K ) ≈ . , c ( λ = 0 . , K ) ≈ . for the (cid:96) -LogR estimator. Experimental results aresimulated for N = 200 , , , , . As shownin Fig. 3 and Fig. 4, apart from the good agreement be-tween theoretical predictions and experimental results, when c > c ( λ, K ) , the Precision increases consistently with N and approaches 1 as N → ∞ and decreases consistentlywith N when c < c ( λ, K ) , while the Recall increasesconsistently and approaches to 1 as N → ∞ .

5. Conclusion

In this paper, we theoretically analyzed the performance of (cid:96) -regularized linear regression ( (cid:96) -LinR) for Ising modelselection using the replica method from statistical mechan-ics. It is demonstrated that, in the case of RR graph underparamagnetic assumption, although there is model misspec-iﬁcation, one can still successfully recover the graph struc-ture of the Ising model using simple (cid:96) -LinR estimator with M = O (log N ) samples, which is of the same order asthe matched (cid:96) -LogR estimator. This implies the robust-ness of the (cid:96) -LinR estimator to model misspeciﬁcation.Moreover, we provide a computationally tractable methodto obtain sharp predictions of the non-asymptotic behaviourof (cid:96) -LinR for moderate size of M, N . There is an excellentagreement between the theoretical predictions and experi-mental results, which supports our ﬁndings.Several key assumptions are made in our theoretical analysisfor the Ising model, such as the paramagnetic assumptionwhich implies that the coupling strength should be sufﬁ-ciently small. These assumptions restrict the applicability ofthe presented result, and thus overcoming such limitationswill be an important direction for future work. Anotherimportant direction is to investigate the performance of the (cid:96) -LinR estimator for general Ising model beyond the RRgraph (Bresler, 2015), e.g., graphs with unbounded degreeof neighborhood connections. sing Model Selection Using (cid:96) -Regularized Linear Regression References

Abbara, A., Kabashima, Y., Obuchi, T., and Xu, Y. Learningperformance in inverse ising problems with sparse teachercouplings.

Journal of Statistical Mechanics: Theory andExperiment , 2020(7):073402, 2020.Aurell, E. and Ekeberg, M. Inverse ising inference using allthe data.

Physical review letters , 108(9):090201, 2012.Bachschmid-Romano, L. and Opper, M. Learning of cou-plings for random asymmetric kinetic ising models re-visited: random correlation matrices and learning curves.

Journal of Statistical Mechanics: Theory and Experiment ,2015(9):P09016, 2015.Bachschmid-Romano, L. and Opper, M. A statisticalphysics approach to learning curves for the inverse isingproblem.

Journal of Statistical Mechanics: Theory andExperiment , 2017(6):063406, 2017.Bayati, M. and Montanari, A. The lasso risk for gaussianmatrices.

IEEE Transactions on Information Theory , 58(4):1997–2017, 2011.Berg, J. Statistical mechanics of the inverse ising problemand the optimal objective function.

Journal of StatisticalMechanics: Theory and Experiment , 2017(8):083402,2017.Besag, J. Statistical analysis of non-lattice data.

Journal ofthe Royal Statistical Society: Series D (The Statistician) ,24(3):179–195, 1975.Bresler, G. Efﬁciently learning ising models on arbitrarygraphs. In

Proceedings of the forty-seventh annual ACMsymposium on Theory of computing , pp. 771–782, 2015.Brillinger, D. R. A generalized linear model with gaussianregressor variables. In

A Festschrift for Erich L. Lehmann ,pp. 97–114. 1982.Decelle, A. and Ricci-Tersenghi, F. Pseudolikelihood deci-mation algorithm improving the inference of the interac-tion network in a general class of ising models.

Physicalreview letters , 112(7):070603, 2014.Diaconis, P. and Shahshahani, M. On the eigenvalues ofrandom matrices.

Journal of Applied Probability , 31(A):49–62, 1994.Genzel, M. High-dimensional estimation of structured sig-nals from non-linear observations with general convexloss functions.

IEEE Transactions on Information Theory ,63(3):1601–1619, 2016.Gerace, F., Loureiro, B., Krzakala, F., M´ezard, M., and Zde-borov´a, L. Generalisation error in learning with randomfeatures and the hidden manifold model. arXiv preprintarXiv:2002.09339 , 2020. Gerbelot, C., Abbara, A., and Krzakala, F. Asymptotic er-rors for convex penalized linear regression beyond gaus-sian matrices. arXiv preprint arXiv:2002.04372 , 2020.H¨oﬂing, H. and Tibshirani, R. Estimation of sparse bi-nary pairwise markov networks using pseudo-likelihoods.

Journal of Machine Learning Research , 10(4):883–906,2009.Johansson, K. On random matrices from the compact clas-sical groups.

Annals of mathematics , pp. 519–545, 1997.Lokhov, A. Y., Vuffray, M., Misra, S., and Chertkov, M.Optimal structure and parameter learning of ising models.

Science advances , 4(3):e1700791, 2018.McKay, B. D. The expected eigenvalue distribution of alarge regular graph.

Linear Algebra and its Applications ,40:203–216, 1981.Meinshausen, N., B¨uhlmann, P., et al. High-dimensionalgraphs and variable selection with the lasso.

The annalsof statistics , 34(3):1436–1462, 2006.Meng, X., Obuchi, T., and Kabashima, Y. Structure learn-ing in inverse ising problems using (cid:96) -regularized linearestimator. arXiv preprint arXiv:2008.08342 , 2020.Mezard, M. and Montanari, A. Information, physics, andcomputation . Oxford University Press, 2009.Nguyen, H. C. and Berg, J. Bethe–peierls approximationand the inverse ising problem.

Journal of StatisticalMechanics: Theory and Experiment , 2012(03):P03004,2012.Nguyen, H. C., Zecchina, R., and Berg, J. Inverse statisticalproblems: from the inverse ising problem to data science.

Advances in Physics , 66(3):197–261, 2017.Nishimori, H.

Statistical physics of spin glasses and in-formation processing: an introduction . Number 111.Clarendon Press, 2001.Opper, M. and Saad, D.

Advanced mean ﬁeld methods:Theory and practice . MIT press, 2001.Plan, Y. and Vershynin, R. The generalized lasso with non-linear observations.

IEEE Transactions on informationtheory , 62(3):1528–1537, 2016.Prasad, A., Srinivasan, V., Balakrishnan, S., and Ravikumar,P. On learning ising models under huber’s contamina-tion model.

Advances in Neural Information ProcessingSystems , 33, 2020.Rangan, S., Fletcher, A. K., and Goyal, V. K. Asymptoticanalysis of map estimation via the replica method andapplications to compressed sensing.

IEEE Transactionson Information Theory , 58(3):1902–1923, 2012. sing Model Selection Using (cid:96) -Regularized Linear Regression Ravikumar, P., Wainwright, M. J., Lafferty, J. D., et al. High-dimensional ising model selection using (cid:96) -regularizedlogistic regression. The Annals of Statistics , 38(3):1287–1319, 2010.Reeves, G. and Pﬁster, H. D. The replica-symmetric predic-tion for random linear estimation with gaussian matricesis exact.

IEEE Transactions on Information Theory , 65(4):2252–2283, 2019.Ricci-Tersenghi, F. The bethe approximation for solving theinverse ising problem: a comparison with other inferencemethods.

Journal of Statistical Mechanics: Theory andExperiment , 2012(08):P08015, 2012.Santhanam, N. P. and Wainwright, M. J. Information-theoretic limits of selecting binary graphical models inhigh dimensions.

IEEE Transactions on Information The-ory , 58(7):4117–4134, 2012.Thrampoulidis, C., Abbasi, E., and Hassibi, B. Lasso withnon-linear measurements is equivalent to one with linearmeasurements.

Advances in Neural Information Process-ing Systems , 28:3420–3428, 2015.Tibshirani, R. Regression shrinkage and selection via thelasso.

Journal of the Royal Statistical Society: Series B(Methodological) , 58(1):267–288, 1996.Vuffray, M., Misra, S., Lokhov, A., and Chertkov, M. Inter-action screening: Efﬁcient and sample-optimal learningof ising models. In

Advances in Neural Information Pro-cessing Systems , pp. 2595–2603, 2016.Wainwright, M. J. and Jordan, M. I.

Graphical models,exponential families, and variational inference . NowPublishers Inc, 2008.Wainwright, M. J., Lafferty, J. D., and Ravikumar, P. K.High-dimensional graphical model selection using (cid:96) -regularized logistic regression. In Advances in neuralinformation processing systems , pp. 1465–1472, 2007.Zhang, Y., Guo, W., and Ray, S. On the consistency offeature selection with lasso for non-linear targets. In

International Conference on Machine Learning , pp. 183–191. PMLR, 2016.Zhao, P. and Yu, B. On model selection consistency oflasso.

The Journal of Machine Learning Research , 7:2541–2563, 2006. sing Model Selection Using (cid:96) -Regularized Linear Regression:Supplementary Material A. Free energy density f computation The detailed derivation of the average free energy density f = − Nβ [log Z ] D M in (13) using the replica method isillustrated. For generality, an arbitral loss function (cid:96) ( · ) is adopted in the following derivation. Afterwards, speciﬁc resultsfor both the (cid:96) -LinR estimator (5) with square loss (cid:96) ( x ) = ( x − and the (cid:96) -LogR estimator (3) with logistic loss (cid:96) ( x ) = log (cid:0) e − x (cid:1) are provided. A.1. Energy term ξ of f The key of replica method is to compute the replicated partition function [ Z n ] D M . According to the deﬁnition in (14) andAssumption 1 in Sec. 3.2, the average replicated partition function [ Z n ] D M can be re-written as [ Z n ] D M = (cid:90) n (cid:89) a =1 d J a e − βλM (cid:80) na =1 (cid:80) j | J aj | (cid:40)(cid:88) s P Ising ( s | J ∗ ) exp (cid:34) − β n (cid:88) a =1 (cid:96) ( s h a ) (cid:35)(cid:41) M , ≈ (cid:90) n (cid:89) a =1 dw a e − βλM (cid:16)(cid:80) na =1 (cid:80) j ∈ Ψ | ¯ J j | + (cid:80) na =1 1 √ N (cid:107) w a (cid:107) (cid:17) × (cid:88) s P Ising ( s | J ∗ ) (cid:89) a (cid:90) dh aw δ  h aw − √ N (cid:88) j ∈ ¯Ψ w aj s j  e − β (cid:80) na =1 (cid:96) ( s ( (cid:80) j ∈ Ψ ¯ J j s j + h aw ))  αN = (cid:90) n (cid:89) a =1 dw a e − βλM (cid:18) n (cid:80) j ∈ Ψ | ¯ J j | + (cid:80) na =1 (cid:107) wa (cid:107) √ N (cid:19) × (cid:40) (cid:88) s ,s Ψ (cid:90) n (cid:89) a =1 dh aw P ( s , s Ψ , { h aw } a | J ∗ , { w a } a ) e − β (cid:80) na =1 (cid:96) ( s ( (cid:80) j ∈ Ψ ¯ J j s j + h aw )) (cid:41) αN ≈ (cid:90) n (cid:89) a =1 dw a e − βλM (cid:18) n (cid:80) j ∈ Ψ | ¯ J j | + (cid:80) na =1 (cid:107) wa (cid:107) √ N (cid:19) × (cid:40) (cid:88) s ,s Ψ P ( s , s Ψ | J ∗ ) (cid:90) n (cid:89) a =1 dh aw P noise ( { h aw } a | { w a } a ) e − β (cid:80) na =1 (cid:96) ( s ( (cid:80) j ∈ Ψ ¯ J j s j + h aw )) (cid:41) αN , (34)where (cid:110) √ N w aj , j ∈ Ψ (cid:111) in the ﬁnite active set Ψ are neglected in the second line when N is large, P ( s , s Ψ | J ∗ ) = (cid:80) s ¯Ψ P Ising ( s | J ∗ ) is the marginal distribution of s , s Ψ that can be computed as Abbara et al. (2020), P noise ( { h aw } a | { w a } a ) is the distribution of the “noise” part h aw ≡ √ N (cid:80) j ∈ ¯Ψ w aj s j of the local ﬁeld. In the last line, the asymptotic independencebetween h aw and ( s , s Ψ ) are applied as discussed in Abbara et al. (2020).To proceed with the calculation, according to the central limit theorem (CLT), the noise part { h aw } na =1 can be regarded asGaussian variables so that P noise ( { h aw } a | { w a } a ) can be approximated as a multivariate Gaussian distribution. Under the sing Model Selection Using (cid:96) -Regularized Linear Regression RS ansatz, two auxiliary order parameters are introduced, i.e., Q ≡ N (cid:88) i,j ∈ ¯Ψ w ai C \ ij w aj , (35) q ≡ N (cid:88) i,j ∈ ¯Ψ w ai C \ ij w bj , ( a (cid:54) = b ) , (36)where C \ = (cid:110) C \ ij (cid:111) is the covariance matrix of the teacher Ising model without s . To write the integration in terms of theorder parameters Q, q , we introduce the following trivial identities N (cid:90) dQδ  (cid:88) i,j (cid:54) =0 w ai C \ ij w aj − N Q  , a = 1 , ..., n (37) N (cid:90) dqδ  (cid:88) i,j (cid:54) =0 w ai C \ ij w bj − N q  , a < b, (38)so that [ Z n ] D M in (34) can be rewritten as [ Z n ] D M = e − βλMn (cid:80) j ∈ Ψ | ¯ J j | (cid:90) dQdq (cid:90) n (cid:89) a =1 dw a e − λβ M √ N (cid:80) na =1 (cid:107) w a (cid:107) n (cid:89) a =1 δ (cid:88) i,j w ai C \ ij w aj − N Q  × (cid:89) a
Jjsj + √ qz )] Q − q ) e − β(cid:96) ( y ) , (46)where in the last line, a change of variable y = s (cid:16)(cid:80) j ∈ Ψ ¯ J j s j + √ Q − qv + √ qz (cid:17) is used.As a result, from (13), the average free energy density in the limit β → ∞ reads f ( β → ∞ ) = lim β →∞ − N β (cid:26) lim n → ∂∂n log I + M lim n → ∂∂n log L (cid:27) = − Extr {− ξ + S } , (47)where Extr {·} denotes extremization w.r.t. some relevant variables, and ξ, S are the corresponding energy and entropyterms of f , respectively: S = lim n → N β ∂∂n log I, (48) I = (cid:90) n (cid:89) a =1 dw a e − λβ (cid:80) na =1 (cid:107) w a (cid:107) n (cid:89) a =1 δ (cid:88) i,j w ai C ij w aj − N Q  (cid:89) a
1) ˆ qq (cid:27) × (cid:104) e Tr ( CL n ) (cid:105) O , (54) L n = − (cid:16) ˆ Q + ˆ q (cid:17) n (cid:88) a =1 w a ( w a ) T + ˆ q (cid:32) n (cid:88) a =1 w a (cid:33) (cid:32) n (cid:88) b =1 w b (cid:33) T . (55)To proceed with the computation, the eigendecompostion of the matrix L n is performed. After some algebra, for theconﬁguration of w a that satisﬁes both ( w a ) T w a = N R and ( w a ) T w b = N r , the eigenvalues and associated eigenvectorsof matrix L n can be calculated as follows  λ = − N (cid:16) ˆ Q + ˆ q − n ˆ q (cid:17) ( R − r + nr ) ,u = (cid:80) na =1 w a ,λ = − N (cid:16) ˆ Q + ˆ q (cid:17) ( R − r ) ,u a = w a − n (cid:80) nb =1 w b , a = 2 , ..., n, (56)where λ is the eigenvalue corresponding to the eigenvector u while λ is the degenerate eigenvalue corresponding toeigenvectors u a , a = 2 , ..., n . To compute (cid:104) e Tr ( CL n ) (cid:105) O , we deﬁne a function G ( x ) as G ( x ) ≡ N log (cid:104) exp (cid:16) x Tr C (cid:0) T (cid:1)(cid:17)(cid:105) O = Extr Λ (cid:26) − (cid:90) log (Λ − γ ) ρ ( γ ) dγ + Λ2 x (cid:27) −

12 log x − , (57) sing Model Selection Using (cid:96) -Regularized Linear Regression and ρ ( γ ) is the eigenvalue distribution (EVD) of C . Then, combined with (56), after some algebra, we obtain that N log (cid:104) e Tr ( CL n ) (cid:105) O = G (cid:16) − (cid:16) ˆ Q + ˆ q − n ˆ q (cid:17) ( R − r + nr ) (cid:17) + ( n − G (cid:16) − (cid:16) ˆ Q + ˆ q (cid:17) ( R − r ) (cid:17) . (58)Furthermore, replacing the original delta functions in (53) as  δ (cid:16) ( w a ) T w a − N R (cid:17) = (cid:82) d ˆ Re − ˆ R ( ( w a ) T w a − NR ) ,δ (cid:16) ( w a ) T w b − N r (cid:17) = (cid:82) d ˆ re ˆ r ( ( w a ) T w b − Nr ) , we obtain [ I ] = (cid:90) dRdrd ˆ Qd ˆ qd ˆ Rd ˆ r n (cid:89) a =1 dw a exp  − n (cid:88) a =1 λβM √ N (cid:107) w a (cid:107) − ˆ R + ˆ r n (cid:88) a =1 ( w a ) T w a + ˆ r (cid:88) a,b ( w a ) T w b  × exp (cid:26) N n RR − N n n −

1) ˆ rr + N n QQ − N n n −

1) ˆ qq (cid:27) × (cid:104) e Tr ( CL n ) (cid:105) O . (59)In addition, using a Gaussian integral, the following result can be linearized as (cid:90) n (cid:89) a =1 dw a exp  − n (cid:88) a =1 λβM √ N (cid:107) w a (cid:107) − ˆ R + ˆ r n (cid:88) a =1 ( w a ) T w a + ˆ r (cid:88) a,b ( w a ) T w b  = (cid:90) n (cid:89) a =1 dw a exp  − n (cid:88) a =1 N (cid:88) i =1 λβM √ N | w ai | − ˆ R + ˆ r n (cid:88) a =1 N (cid:88) i =1 ( w ai ) + ˆ r N (cid:88) i =1 (cid:32) n (cid:88) a =1 w ai (cid:33)  = (cid:89) i (cid:90) D z i (cid:90) n (cid:89) a =1 dw a exp (cid:40) − n (cid:88) a =1 λβM √ N | w ai | − ˆ R + ˆ r n (cid:88) a =1 ( w ai ) + √ ˆ rz i (cid:88) a w ai (cid:41) = (cid:89) i (cid:90) D z i (cid:40)(cid:90) dw exp (cid:34) − ˆ R + ˆ r w i + (cid:18) √ ˆ rz − λβM √ N sign ( w i ) (cid:19) w i (cid:35)(cid:41) n , where D z i = dz i √ π e − zi . Consequently, the entropy term S of the free energy density f is computed as lim n → N ∂∂n log [ I ] O = (cid:16) ˆ q ( R − r ) − (cid:16) ˆ Q + ˆ q (cid:17) r (cid:17) G (cid:48) (cid:16) − (cid:16) ˆ Q + ˆ q (cid:17) ( R − r ) (cid:17) + G (cid:16) − (cid:16) ˆ Q + ˆ q (cid:17) ( R − r ) (cid:17) + ˆ RR rr QQ qq (cid:90) Dz log (cid:90) dw exp (cid:34) − ˆ R + ˆ r w + (cid:18) √ ˆ rz − λβM √ N sign ( w ) (cid:19) w (cid:35) . For β → ∞ , according to the characteristic of the Boltzmann distribution, the following scaling relations are assumed tohold, i.e.,  ˆ Q + ˆ q ≡ βE ˆ q ≡ β F ˆ R + ˆ r ≡ βK ˆ r ≡ β Hβ ( Q − q ) ≡ χβ ( R − r ) ≡ η (60)Finally, the entropy term is computed as S = ( − ER + F η ) G (cid:48) ( − Eη ) + 12 EQ − F χ + 12 KR − Hη − (cid:90) min w (cid:26) K w − (cid:18) √ Hz − λM √ N sign ( w ) (cid:19) w (cid:27) Dz. (61) sing Model Selection Using (cid:96) -Regularized Linear Regression A.3. Free energy density result

Combining the results (50) and (61) together, the free energy density for general loss function (cid:96) ( · ) in the limit β → ∞ isobtained as f ( β → ∞ ) = − Extr Θ  − α E s,z (cid:18) min y (cid:20) ( y − s ( √ Qz + (cid:80) j ∈ Ψ ¯ J j s j )) χ + (cid:96) ( y ) (cid:21)(cid:19) − αλ (cid:80) j ∈ Ψ (cid:12)(cid:12) ¯ J j (cid:12)(cid:12) + ( − ER + F η ) G (cid:48) ( − Eη ) + EQ − F χ + KR − Hη − E z (cid:16) min w (cid:110) K w − (cid:16) √ Hz − λM √ N sign ( w ) (cid:17) w (cid:111)(cid:17)  , (62)where the values of the parameters Θ = (cid:110) χ, Q, E, R, F, η, K, H, (cid:8) ¯ J j (cid:9) j ∈ Ψ (cid:111) can be calculated by the extremizationcondition, i.e., solving the equations of state (EOS). For general loss function (cid:96) ( y ) , the EOS for (62) is as follows  ˆ y ( s, z, χ, Q, J ) = arg max y (cid:26) − ( y − s ( √ Qz + (cid:80) j ∈ Ψ ¯ J j s j )) χ − (cid:96) ( y ) (cid:27) E = α √ Q E s,z (cid:16) s z d(cid:96) ( y ) dy | y =ˆ y ( s,z,χ,Q,J ) (cid:17) F = α E s,z (cid:18)(cid:16) d(cid:96) ( y ) dy | y =ˆ y ( s,z,χ,Q,J ) (cid:17) (cid:19) R = K (cid:104)(cid:16) H + λ M N (cid:17) erfc (cid:16) λM √ HN (cid:17) − λM (cid:113) HN √ π e − λ M HN (cid:105) Eη = − (cid:82) ρ ( γ )˜ Λ − γ dγQ = FE + R ˜ Λ − ( − ER + F η ) η (cid:82) ρ ( λ ) ( ˜ Λ − λ ) dλ K = E ˜ Λ + η χ = E + η ˜ ΛH = Rη + F ˜ Λ + ( − ER + F η ) E (cid:82) ρ ( λ ) ( ˜ Λ − λ ) dλ η = K erfc (cid:16) λM √ HN (cid:17) ¯ J j,j ∈ Ψ = arg min J j,j ∈ Ψ (cid:26) E s,z (cid:18)(cid:20) ( ˆ y ( s,z,χ,Q,J ) − s ( √ Qz + (cid:80) j ∈ Ψ J j s j )) χ + (cid:96) (ˆ y ( s, z, χ, Q, J )) (cid:21)(cid:19) + λ (cid:80) j ∈ Ψ | J j | (cid:27) (63)where ˜ Λ satisfying Eη = − (cid:82) ρ ( γ )˜ Λ − γ dγ is determined by the extremization condition in (57) combined with the free energyresult (62). In general, there are no analytic solutions for the EOS (63) but it can be solved numerically.A.3.1. S QUARE LOSS (cid:96) ( y ) = ( y − / In the case of square lass (cid:96) ( y ) = ( y − / for the (cid:96) -LinR estimator, there is an analytic solution to y in min y (cid:20) ( y − s ( √ Qz + (cid:80) j ∈ Ψ ¯ J j s j )) χ + (cid:96) ( y ) (cid:21) and thus the results can be further simpliﬁed. Speciﬁcally, the free energy can bewritten as follows f ( β → ∞ ) = − Extr Θ  − α χ ) E s,z (cid:20)(cid:16) s − (cid:80) j ∈ Ψ s j ¯ J j − √ Qz (cid:17) (cid:21) − αλ (cid:80) j ∈ Ψ (cid:12)(cid:12) ¯ J j (cid:12)(cid:12) + ( − ER + F η ) G (cid:48) ( − Eη ) + EQ − F χ + KR − Hη − E z (cid:104) min w (cid:110) K w − (cid:16) √ Hz − λM √ N sign ( w ) (cid:17) w (cid:111)(cid:105)  , (64) sing Model Selection Using (cid:96) -Regularized Linear Regression and the corresponding EOS can be written as  E = α (1+ χ ) ( a ) F = α (1+ χ ) (cid:20) E s (cid:16) s i − (cid:80) j ∈ Ψ s j ¯ J j (cid:17) + Q (cid:21) ( b ) R = K (cid:104)(cid:16) H + λ M N (cid:17) erfc (cid:16) λM √ HN (cid:17) − λM (cid:113) HN √ π e − λ M HN (cid:105) ( c ) Eη = − (cid:82) ρ ( γ )˜ Λ − γ dγ ( d ) Q = FE + R ˜ Λ − ( − ER + F η ) η (cid:82) ρ ( γ ) ( ˜ Λ − γ ) dγ ( e ) K = E ˜ Λ + η ( f ) χ = E + η ˜ Λ ( g ) H = Rη + F ˜ Λ + ( − ER + F η ) E (cid:82) ρ ( γ ) ( ˜ Λ − γ ) dγ ( h ) η = K erfc (cid:16) λM √ HN (cid:17) ( i )¯ J j = soft (tanh( K ) ,λ (1+ χ ))1+( d −

1) tanh ( K ) , j ∈ Ψ ( j ) (65)Note that the mean estimates (cid:8) ¯ J j , j ∈ Ψ (cid:9) in (65) is obtained by solving the following reduced optimization problem arg min { ¯ J j } 

12 (1 + χ ) E s,z  s − (cid:88) j ∈ Ψ s j ¯ J j − (cid:112) Qz   − λ (cid:88) j ∈ Ψ (cid:12)(cid:12) ¯ J j (cid:12)(cid:12) , (66)where the corresponding ﬁxed-point equation associated with any ¯ J k , k ∈ Ψ can be written as follows

11 + χ E s  s k  s − (cid:88) j ∈ Ψ s j ¯ J j  − λ sign (cid:0) ¯ J k (cid:1) = 0 , ∀ k ∈ Ψ , (67)where the sign ( · ) denotes an element-wise application of the standard sign function. For a RR graph G ∈ G N,d,K withdegree d and coupling strength K , without loss of generality, assuming that all the active couplings are positive, we have E s ( s s k ) = tanh ( K ) , ∀ k ∈ Ψ , and E s ( s k s j ) = tanh ( K ) , ∀ k, j ∈ Ψ , k (cid:54) = j . Given these results and thanks to thethe symmetry, we obtain ¯ J j = soft (tanh ( K ) , λ (1 + χ ))1 + ( d −

1) tanh ( K ) , j ∈ Ψ , (68)where soft ( z, τ ) = sign ( z ) ( | z | − τ ) + is the soft-thresholding function, i.e., soft ( z, τ ) ≡ sign ( z ) ( | z | − τ ) + ≡  z − τ, z > τ , | z | ≤ τz + τ, z < − τ (69)On the other hand, in the inactive set ¯Ψ , each component of the scaled noise estimates can be statistically described as thesolution to the scalar estimator min w (cid:110) K w − (cid:16) √ Hz − λM √ N sign ( w ) (cid:17) w (cid:111) in (62). Consequently, recalling the deﬁnition of w in (15), the estimates (cid:110) ˆ J j , j ∈ ¯Ψ (cid:111) in the inactive set ¯Ψ are ˆ J j = √ HK √ N soft (cid:18) z j , λM √ HN (cid:19) = arg min J j  (cid:32) J j − K (cid:114) HN z j (cid:33) + λMKN | J j |  , j ∈ ¯Ψ , (70)which z j ∼ N (0 , , j ∈ ¯Ψ are i.i.d. random Gaussian noise.Consequently, it can be seen that from (68) and (70), statistically, the (cid:96) -LinR estimator is decoupled into two scalarthresholding estimators for the active set Ψ and inactive set ¯Ψ , respectively. sing Model Selection Using (cid:96) -Regularized Linear Regression A.3.2. L

OGISTIC LOSS (cid:96) ( y ) = log (cid:0) e − y (cid:1) In the case of logistic lass (cid:96) ( y ) = log (cid:0) e − y (cid:1) for the (cid:96) -LogR estimator, however, there is no analytic solution to y in min y (cid:20) ( y − s ( √ Qz + (cid:80) j ∈ Ψ ¯ J j s j )) χ + (cid:96) ( y ) (cid:21) and we have to solve it together iteratively with other parameters Θ . After somealgebra, we obtain the EOS for the (cid:96) -LogR estimator:  ˆ y ( s,z,χ,Q,J ) − s ( √ Qz + (cid:80) j ∈ Ψ ¯ J j s j ) χ = 1 − tanh (ˆ y ( s, z, χ, Q, J )) E = α E s,z (cid:16) s z √ Q tanh (ˆ y ( S, z, χ, Q, J )) (cid:17) F = α E s,z (cid:16) (1 − tanh (ˆ y ( S, z, χ, Q, J ))) (cid:17) R = K (cid:104)(cid:16) H + λ M N (cid:17) erfc (cid:16) λM √ HN (cid:17) − λM (cid:113) HN √ π e − λ M HN (cid:105) Eη = − (cid:82) ρ ( γ )˜ Λ − γ dγQ = FE + R ˜ Λ − ( − ER + F η ) η (cid:82) ρ ( λ ) ( ˜ Λ − λ ) dλ K = E ˜ Λ + η χ = E + η ˜ ΛH = Rη + F ˜ Λ + ( − ER + F η ) E (cid:82) ρ ( λ ) ( ˜ Λ − λ ) dλ η = K erfc (cid:16) λM √ HN (cid:17) ¯ J j = soft ( E s,z ( ˆ y ( s,z,χ,Q,J ) s (cid:80) j ∈ Ψ s j ) ,λdχ ) d (1+( d −

1) tanh ( K )) , j ∈ Ψ (71)In the active set Ψ , the mean estimates (cid:8) ¯ J j , j ∈ Ψ (cid:9) can be obtained by solving a reduced (cid:96) -regularized optimizationproblem min { ¯ J j } j ∈ Ψ  E s,z  min y  (cid:16) y − s (cid:16) √ Qz + (cid:80) j ∈ Ψ ¯ J j s j (cid:17)(cid:17) χ + log (cid:0) e − y (cid:1) + λ (cid:88) j ∈ Ψ (cid:12)(cid:12) ¯ J j (cid:12)(cid:12) . (72)In contrast to the (cid:96) -LinR estimator, the mean estimates (cid:8) ¯ J j , j ∈ Ψ (cid:9) in (72) for the (cid:96) -LogR estimator do not have analyticsolutions and also have to be solved numerically. For a RR graph G ∈ G N,d,K with degree d and coupling strength K ,after some algebra, the corresponding ﬁxed-point equations for (cid:8) ¯ J j = J, j ∈ Ψ (cid:9) are obtained as follows J = soft (cid:16) E s,z (cid:16) ˆ y ( s, z, χ, Q, J ) s (cid:80) j ∈ Ψ s j (cid:17) , λdχ (cid:17) d (cid:0) d −

1) tanh ( K ) (cid:1) , (73)which can be solved iteratively.The estimates in the inactive set ¯Ψ are the same as (70) that of (cid:96) -LinR, which can be described by a scalar theresholdingestimator once the EOS is solved. B. Veriﬁcation of the Assumption 1

To verify the Assumption 1, ﬁrst we categorize the estimators based on the distance or generation from the focused spin s . Considering the original Ising model whose coupling network is a tree-like graph, we can naturally deﬁne generationsof the spins according to the distance from the focused spin s . We categorize the spins directly connected to s as theﬁrst generation and denote the corresponding index set as Ω = { i | J ∗ i (cid:54) = 0 , i ∈ { , . . . , N − }} . Each spin in Ω isconnected to some other spins except for s , and those spins constitute the second generation and we denote its index set as Ω . This recursive construction of generations can be unambiguously continued on the tree-like graph, and we denote theindex set of the g -th generation from spin s as Ω g . The overall construction of generations is graphically represented inFig. 5. Generally, assume that the set of nonzero values of the (cid:96) -LinR estimator is denoted as Ψ = { Ω , . . . , Ω g } . Then,Assumption 1 means that the correct active set of the mean estimates is Ψ = { Ω } . sing Model Selection Using (cid:96) -Regularized Linear Regression Figure 5.

Schematic of generations of spins. In general, the g -th generation of spin s is denoted as Ω g , whose distance from spin s is g . To verify this, we examine the values of mean estimates based on (64). Due to the symmetry, it is expected that for each a = 1 , ..., g , the values of the mean estimates ¯ J j ∈ Ω a = J a are identical to each other within the same set Ω a , a = 1 ...g . Inaddition, if the solutions satisfy Assumption 1 in (15), i.e., J = J, J a = 0 , a ≥ , from (64) we obtain (cid:40) χ (cid:2) tanh ( K ) − (cid:0) d −

1) tanh ( K ) (cid:1) J (cid:3) − λ = 0 , j ∈ Ω ; (cid:12)(cid:12)(cid:12) χ (cid:2) tanh a ( K ) − tanh a − ( K ) (cid:0) d −

1) tanh ( K ) (cid:1) J (cid:3)(cid:12)(cid:12)(cid:12) ≤ λ, j ∈ Ω a , a ≥ , (74)where the result E s ( s i s j ) = tanh d ( K ) is used for any two spins s i , s j whose distance is d in the RR graph G ∈ G N,d,K .Note that the solution of the ﬁrst equation in (74) automatically satisﬁes the second equation (sub-gradient condition) since | tanh ( K ) | ≤ , which indicates that J = J, J a = 0 , a ≥ is one valid solution. Moreover, the convexity of the squareloss function indicates that this is the unique and correct solution, which veriﬁes the Assumption 1. C. Veriﬁcation of Assumption 2

We here verify a part of the Assumption 2 in Sec.3.2, the orthogonal matrix O diagonalizing the covariance matrix C isdistributed from the Haar orthogonal measure. To achieve this, we compare certain properties of the orthogonal matrixgenerated from the diagonalization of the covariance matrix C with the orthogonal matrix which is actually generatedfrom the Haar orthogonal measure. Speciﬁcally, we compute the cumulants of the trace of the power k of the orthogonalmatrix. All cumulants with degree r ≥ are shown to disappear in the large N limit (Diaconis & Shahshahani, 1994;Johansson, 1997). The nontrivial cumulants are only second order cumulant with the same power k . We have computedthese cumulants about the orthogonal matrix from the covariance matrix C and found that they exhibit the same behavior asthe ones generated from the true Haar measure, as shown in Fig. 6. D. Details of the High-dimensional asymptotic result

Here the asymptotic performance of

P recision and

Recall are considered for both the (cid:96) -LinR estimator and the (cid:96) -LogRestimator. Recall that perfect Ising model selection is achieved if and only if P recision = 1 and

Recall = 1

D.1. Recall rate

According to the deﬁnition in (6), the recall rate is only related to the statistical properties of estimates in the active set Ψ and thus the mean estimates (cid:8) ¯ J j (cid:9) j ∈ Ψ in the limit M → ∞ are considered. sing Model Selection Using (cid:96) -Regularized Linear Regression Figure 6.

The RR graph G ∈ G N,d,K with N = 1000 , d = 3 , K = 0 . is generated and we compute the associated covariance matrix C and then diagonalize it as C = O Λ O T , obtaining the orthogonal matrix O . Then the Tr (cid:0) O k (cid:1) , Tr (cid:0) O − k (cid:1) for several k ( k = 1 ∼ )are computed, where Tr ( · ) is the trace operation. This procedure is repeated 200 times with different random numbers, from which weobtain the ensemble of Tr (cid:0) O k (cid:1) and Tr (cid:0) O − k (cid:1) . Consequently, the cumulants of 1st, 2nd, and 3rd orders are computed. All of themexhibit the expected theoretical behavior. D.1.1. S

QUARE LOSS

In this case, in the limit M → ∞ , the mean estimates (cid:8) ¯ J j = J (cid:9) j ∈ Ψ in the active set Ψ are shown in (68) and rewritten asfollows for ease of reference J = soft (tanh ( K ) , λ (1 + χ ))1 + ( d −

1) tanh ( K ) . (75)As a result, as long as λ (1 + χ ) < tanh ( K ) , J > and thus we can successfully recover the active set so that Recall = 1 .In addition, when M = O (log N ) , χ → as N → ∞ , as demonstrated later by the relation in (85). As a result, theregularization parameter needs to satisfy < λ < tanh ( K ) .D.1.2. L OGISTIC LOSS

In this case, in the limit M → ∞ , the mean estimates (cid:8) ¯ J j = J (cid:9) j ∈ Ψ in the active set Ψ are shown in (73) and rewritten asfollows for ease of reference J = soft (cid:16) E s,z (cid:16) ˆ y ( s, z, χ, Q, J ) s (cid:80) j ∈ Ψ s j (cid:17) , λdχ (cid:17) d (cid:0) d −

1) tanh ( K ) (cid:1) . (76)There is no analytic solution for ˆ y ( s, z, χ, Q, J ) and the following ﬁxed-point equation has to be solved numerically ˆ y ( s, z, χ, Q, J ) − s (cid:16) √ Qz + J (cid:80) j ∈ Ψ s j (cid:17) χ = 1 − tanh (ˆ y ( s, z, χ, Q, J )) . (77)Then one can determine the valid choice of λ to enable J > . Numerical results show that the choice of λ is similar to thatof the square loss. D.2. Precision rate

According to the deﬁnition in (6), to compute the

P recision , the number of true positives

T P and false positives

F P areneeded, respectively. On the one hand, as discussed in Appendix D.1, in the limit M → ∞ , the recall rate approach toone and thus we have T P = d for a RR graph G ∈ G N,d,K . On the other hand, the number of false positives F P can becomputed as

F P = F P R · N , where F P R is the false positive rate (FPR). sing Model Selection Using (cid:96) -Regularized Linear Regression As shown in Appendix A.3, the estimator in the inactive set ¯Ψ can be statistically described by a scalar estimator (70) andthus the F P R can be computed as

F P R = erfc (cid:18) λM √ HN (cid:19) , (78)which depends on λ, M, N, H . However, for both the square loss and logistic loss, there is no analytic result for H in (63).Nevertheless, we can obtain some asymptotic result using perturbative analysis.Speciﬁcally, we focus on the asymptotic behavior of the macro parameters, e.g., χ, Q, K, E, H, F , in the regime F P R → ,which is necessary for successful Ising model selection. From η = K erfc (cid:16) λM √ HN (cid:17) in EOS (63) and the F P R in (78), thereis

F P R = Kη . Moreover, by combining Eη = − (cid:82) ρ ( γ )˜ Λ − γ dγ and K = E ˜ Λ + η , the following relation can be obtainederfc (cid:18) λM √ HN (cid:19) = 1 − (cid:90) ρ ( γ )1 − γ ˜ Λ dγ. (79)Thus as F P R = erfc (cid:16) λM √ HN (cid:17) → , there is (cid:82) ρ ( γ )1 − γ ˜ Λ dγ → , implying that the magnitude of ˜ Λ → ∞ . Consequently, usingthe truncated series expansion, we obtain Eη = − (cid:90) ρ ( γ )˜ Λ − γ dγ = − Λ ∞ (cid:88) k =0 (cid:10) γ k (cid:11) ˜ Λ k (cid:39) − Λ − (cid:104) γ (cid:105) ˜ Λ , (80)where (cid:10) γ k (cid:11) = (cid:82) ρ ( γ ) γ k dγ . Then, solving the quadratic equation (80), we obtain the solution (the other solution is notconsidered since it is a smaller value) of ˜ Λ as ˜ Λ = − − (cid:112) − Eη (cid:104) γ (cid:105) Eη (cid:39) (cid:104) γ (cid:105) − Eη . (81)To compute (cid:82) ρ ( γ ) ( ˜ Λ − γ ) dγ , we use the following relation f (cid:16) ˜ Λ (cid:17) = − (cid:90) ρ ( γ )˜ Λ − γ dγ (cid:39) − Λ − (cid:104) γ (cid:105) ˜ Λ , (82) df (cid:16) ˜ Λ (cid:17) d ˜ Λ = (cid:90) ρ ( γ ) (cid:16) ˜ Λ − γ (cid:17) dγ (cid:39) Λ + 2 (cid:104) γ (cid:105) ˜ Λ . (83)Substituting the results (81) - (83) into (63), after some algebra, we obtain K (cid:39) E (cid:104) γ (cid:105) , (84) χ (cid:39) η (cid:104) γ (cid:105) , (85) Q (cid:39) (cid:104) γ (cid:105) E η R − (cid:104) γ (cid:105) EF η + 3 (cid:104) γ (cid:105) F η − R (cid:104) γ (cid:105) Eη (cid:104) γ (cid:105) − , (86) H (cid:39) (cid:104) γ (cid:105) E η F − (cid:104) γ (cid:105) RηE + 3 (cid:104) γ (cid:105) RE − F (cid:104) γ (cid:105) Eη (cid:104) γ (cid:105) − . (87) sing Model Selection Using (cid:96) -Regularized Linear Regression Figure 7.

E, F, H versus α when α = 50(log N ) /N for N = 10 ∼ for RR graph G ∈ G N,d,K with d = 3 , K = 0 . . Note thatin this case, there is (cid:104) γ (cid:105) = 1 . In addition, as

F P R = erfc (cid:16) λM √ HN (cid:17) → , from (63) we obtain R = 1 K (cid:34)(cid:18) H + λ M N (cid:19) erfc (cid:18) λM √ HN (cid:19) − λM (cid:114) HN √ π e − λ M HN (cid:35) (cid:39) HK erfc (cid:18) λM √ HN (cid:19) (cid:39) HK η (cid:39) HE (cid:104) γ (cid:105) η, (88)where the ﬁrst result in (cid:39) uses the asymptotic relation erfc ( x ) (cid:39) x √ π e − x as x → ∞ and the last result in (cid:39) results fromthe asymptotic relation in (84). Then, substituting (88) into (87) leads to the following relation (3 Eη (cid:104) γ (cid:105) − H (cid:39) (cid:104) γ (cid:105) E η F − (cid:104) γ (cid:105) η E H + 3 Eη (cid:104) γ (cid:105) H − F (cid:104) γ (cid:105) . (89)Interestingly, the common terms Eη (cid:104) γ (cid:105) H in both sides of (89) cancel with each other. Therefore, the key result for H isobtained as follows H (cid:39) F (cid:104) γ (cid:105) . (90)In addition, from (90) and (86), Q can be simpliﬁed as Q (cid:39) R (cid:104) γ (cid:105) . (91)As shown in (63), F = α E s,z (cid:16) d(cid:96) ( y ) dy | y =ˆ y ( s,z,χ,Q,J ) (cid:17) , thus the result H (cid:39) F (cid:104) γ (cid:105) in (90) implies that there is a linearrelation between H and α ≡ M/N . The relation between

E, F, H and α are also veriﬁed numerically in Fig. 7 when M = 50(log N ) for N = 10 ∼ using the (cid:96) -LinR estimator.Denote by H (cid:39) F (cid:104) γ (cid:105) ≡ α (cid:52) , where (cid:52) = E s,z (cid:16) d(cid:96) ( y ) dy | y =ˆ y ( s,z,χ,Q,J ) (cid:17) (cid:104) γ (cid:105) = O (1) , then the F P R in (78) can berewritten as follows

F P R = erfc (cid:18) λM √ α (cid:52) N (cid:19) = erfc (cid:32) λ (cid:115) M (cid:52) (cid:33) ≤ √ π e − λ M (cid:52) − log (cid:16) λ M (cid:52) (cid:17) , (92)where the last inequality uses the upper bound of erfc function, i.e., erfc ( x ) ≤ x √ π e − x . Consequently, the number of false sing Model Selection Using (cid:96) -Regularized Linear Regression positives F P satisﬁes

F P ≤ N √ π e − λ M (cid:52) − log (cid:16) λ M (cid:52) (cid:17) = 1 √ π e − λ M (cid:52) − log (cid:16) λ M (cid:52) (cid:17) +log N < √ π e − λ M (cid:52) +log N , (93)where the last inequality holds when λ M (cid:52) > , which is necessary when F P → as N → ∞ . Consequently, to ensure F P → as N → ∞ , from (93), the term λ M (cid:52) should grow at least faster than log N , i.e., M > (cid:52) log Nλ . (94)Meanwhile, the number of false positives F P will decay as O (cid:0) e − c log N (cid:1) for some constant c ( > .D.2.1. S QUARE LOSS

In this case, when < λ < tanh ( K ) , from (65), we can obtain an analytic result for (cid:52) as follows (cid:52) (cid:39) E s  s − (cid:88) j ∈ Ψ s j ¯ J j  (cid:104) γ (cid:105) = 1 − tanh K + dλ d −

1) tanh K (cid:104) γ (cid:105) . (95)On the other hand, from the discussion in Appendix D.1, the recall rate Recall → as M → ∞ when < λ < tanh K . Overall, for a RR graph G ∈ G N,d,K with degree d and coupling strength K , given M i.i.d. samples D M = (cid:8) s (1) , ..., s ( M ) (cid:9) , using (cid:96) -LinR estimator (5) with regularization parameter λ , perfect recovery of the graph structure G can be achieved as N → ∞ if the number of samples M satisﬁes M > c ( λ, K ) log Nλ , λ ∈ (0 , tanh ( K )) (96)where c ( λ, K ) is a value dependent on the regularization parameter λ and coupling strength K , which can be approximatedin the limit N → ∞ as: c ( λ, K ) = 2 (cid:0) − tanh ( K ) + dλ (cid:1) (cid:104) γ (cid:105) d −

1) tanh ( K ) . (97)D.2.2. L OGISTIC LOSS

In this case, from (71), the value of (cid:52) can be computed as (cid:52) (cid:39) E s,z (cid:16) (1 − tanh (ˆ y ( S, z, χ, Q, J ))) (cid:17) (cid:104) γ (cid:105) . (98)However, different from the case of (cid:96) -LinR estimator, there is no analytic solution but it can be calculated numerically. Itcan be seen that the (cid:96) -LinR estimator only differs in the value of scaling factor (cid:52) with the (cid:96) -LogR estimator for Isingmodel selection. E. Details of the non-asymptotic result for moderate

M, N

As demonstrated in Appendix A.3, from the replica analysis, both (cid:96) -LinR and (cid:96) -LogR estimators are decoupled and theirasymptotic behavior can be described by two scalar estimators for the active set and inactive set, respectively. It is desirableto obtain the non-asymptotic result for moderate M, N . However, it is found that the behavior of the two scalar estimatorsby simply inserting the ﬁnite values of

M, N into the EOS does not always lead to good consistency with the experimental sing Model Selection Using (cid:96) -Regularized Linear Regression results, especially for the Recall when M is small. This can be explained by the derivation of the free energy density. Incalculating the energy term ξ , the limit M → ∞ is taken implicitly when assuming the limit N → ∞ with α ≡ M/N . As aresult, the scalar estimator associated with the active set can only describe the asymptotic performance in the limit M → ∞ .Thus, one cannot describe the ﬂuctuating behavior of the estimator in the active set such as the recall rate for ﬁnite M . Tocharacterize the non-asymptotic behavior of the estimates in the active set Ψ , we replace the expectation E s ( · ) in (62) by thesample average over M samples, and the corresponding estimates are obtained as (cid:110) ˆ J j (cid:111) j ∈ Ψ = arg min J j,j ∈ Ψ  M M (cid:88) µ =1 min y µ  (cid:16) y µ − s µ (cid:16) √ Qz µ + (cid:80) j ∈ Ψ J j s µj (cid:17)(cid:17) χ + (cid:96) ( y µ )  + λ (cid:88) j ∈ Ψ | J j |  , (99)where z µ ∼ N (0 , and s µ , s µj,j ∈ Ψ ∼ P ( s , s Ψ | J ∗ ) are random samples µ = 1 , ..., M . Note that the mean estimates (cid:8) ¯ J j (cid:9) j ∈ Ψ are replaced by (cid:110) ˆ J j (cid:111) j ∈ Ψ in (99) as we now focus on its ﬂuctuating behaviour due to the ﬁnite size effect. In thelimit M → ∞ , the sample average will converge to the expectation and thus (99) is equivalent to (72) when M → ∞ . E.1. Square loss (cid:96) ( y ) = ( y − / In the case of square loss (cid:96) ( y ) = ( y − / , there is an analytic solution to y in min y (cid:20) ( y − s ( √ Qz + (cid:80) j ∈ Ψ ¯ J j s j )) χ + (cid:96) ( y ) (cid:21) .Consequently, similar to (66), the result of (99) for the (cid:96) -LinR estimator becomes (cid:110) ˆ J j (cid:111) j ∈ Ψ = arg min J j,j ∈ Ψ 

12 (1 + χ ) M M (cid:88) µ =1  s µi − (cid:88) j ∈ Ψ s µj J j − (cid:112) Qz µ  + λ (cid:88) j ∈ Ψ | J j |  . (100)As the mean estimates (cid:8) ¯ J j (cid:9) j ∈ Ψ are modiﬁed as in (100), the corresponding solution to the EOS in (65) also needs to bemodiﬁed, and this can be solved iteratively as sketched in Algorithm 1. For a practical implementation of Algorithm 1, thedetails are described in the following.First, in the EOS (24), we need to obtain ˜ Λ satisfying the following relation Eη = − (cid:90) ρ ( γ )˜ Λ − γ dγ, (101)which is difﬁcult to solve directly. To obtain ˜ Λ , we introduce an auxiliary variable Γ ≡ − Λ , by which (101) can be rewrittenas Γ = Eη (cid:82) ρ ( γ )1+ Γ γ dγ , (102)which can be solved iteratively. Accordingly, the χ, Q, K, H in EOS (24) can be equivalently written in terms of Γ .Second, when solving the EOS (24) iteratively using numerical methods, it is helpful to improve the convergence of thesolution by introducing a small amount of damping factor damp ∈ [0 , for χ, Q, E, R, F, η, K, H, Γ in each iteration.The detailed implementation of Algorithm 1 is shown in Algorithm 2. E.2. Logistic loss (cid:96) ( y ) = log (cid:0) e − y (cid:1) In the case of square lass (cid:96) ( y ) = log (cid:0) e − y (cid:1) , since there is no analytic solution to y in min y (cid:20) ( y − s ( √ Qz + (cid:80) j ∈ Ψ ¯ J j s j )) χ + (cid:96) ( y ) (cid:21) , the result of (99) for the (cid:96) -LogR estimator becomes ˆ J j,j ∈ Ψ = arg min J j,j ∈ Ψ  M M (cid:88) µ =1 min y µ  (cid:16) y µ − s µ (cid:16) √ Qz µ + (cid:80) j ∈ Ψ J j s µj (cid:17)(cid:17) χ + log (cid:0) e − y (cid:1) + λ (cid:88) j ∈ Ψ (cid:12)(cid:12) J µj (cid:12)(cid:12) , (103)Similarly as the case for square loss, as the mean estimates (cid:8) ¯ J j (cid:9) j ∈ Ψ are modiﬁed as in (103), the corresponding solutionsto the EOS in (71) also need to be modiﬁed, which can be solved iteratively as shown in Algorithm 3. sing Model Selection Using (cid:96) -Regularized Linear Regression Algorithm 2

Detailed implementation of Algorithm 1 for the (cid:96) -LinR estimator with moderate M, N . Input:

M, N, λ, K , ρ ( γ ) and T MC , T EOS .Initialization: χ, Q, E, R, F, η, K, H, Γ . repeatfor t = 1 to T MC do - Draw M random samples s µ , s µj,j ∈ Ψ ∼ P ( s , s Ψ | J ∗ ) and z µ ∼ N (0 , , µ = 1 ...M .- Solve ˆ J j,j ∈ Ψ = arg min J j,j ∈ Ψ (cid:20) (cid:80) Mµ =1 ( s µ − (cid:80) j ∈ Ψ s µj J j −√ Qz µ ) χ ) M + λ (cid:80) j ∈ Ψ | J j | (cid:21) .- Compute (cid:52) ( t ) = M (cid:80) Mµ =1 (cid:16) s µ − (cid:80) j ∈ Ψ s µj ˆ J j (cid:17) . end for Set ¯ (cid:52) = T MC (cid:80) T MC t =1 (cid:52) ( t ) . for t = 1 to T EOS do E = (1 − damp ) α (1+ χ ) + damp · EF = (1 − damp ) α (1+ χ ) (cid:0) ¯ (cid:52) + Q (cid:1) + damp · FR = (1 − damp ) K (cid:104)(cid:16) H + λ M N (cid:17) erfc (cid:16) λM √ HN (cid:17) − λM (cid:113) HN √ π e − λ M HN (cid:105) + damp · R for t = 1 to T gamma do Γ = (1 − damp ) Eη (cid:82) ρ ( γ )1+ Γγ dγ + damp · Γ end for K = (1 − damp ) (cid:16) − EΓ + η (cid:17) + damp · Kχ = (1 − damp ) (cid:0) − ηΓ + E (cid:1) + damp · χQ = (1 − damp ) (cid:18) FE − RΓ − ( − ER + F η ) ηΓ (cid:82) ρ ( γ )(1+ Γγ )2 dγ (cid:19) + damp · QH = (1 − damp ) (cid:18) RE − FΓ − ( − ER + F η ) EΓ (cid:82) ρ ( γ )(1+ Γγ )2 dγ (cid:19) + damp · Hη = (1 − damp ) K erfc (cid:16) λM √ HN (cid:17) + damp · η end foruntil convergence. Return: χ, Q, E, R, F, η, K, H, Γ . F. Eigenvalue Distribution ρ ( γ ) From the replica analysis presented, the learning performance will depend on the eigenvalue distribution (EVD) ρ ( γ ) of thecovariance matrix C of the teacher Ising model. In general, it is difﬁcult to obtain this EVD; however, for sparse tree-likegraphs such as RR graph G ∈ G N,d,K with constant node degree d and sufﬁciently small coupling strength K that yieldsthe paramagnetic state ( E s ( s ) = ), it can be computed analytically. For this, we express the covariances as C ij = E s ( s i s j ) − E s ( s i ) E s ( s j ) = ∂ log Z ( θ ) ∂θ i ∂θ j , (104)where Z ( θ ) = (cid:82) d s P Ising ( s | J ∗ ) exp( (cid:80) N − i =0 θ i s i ) and the assessment is carried out at θ = .In addition, for technical convenience we introduce the Gibbs free energy as A ( m ) = max θ (cid:110) θ T m − log Z ( θ ) (cid:111) . (105)The deﬁnition of (105) indicates that following two relations hold: ∂m i ∂θ j = ∂ log Z ( θ ) ∂θ i ∂θ j = C ij ,∂θ i ∂m j = [ C − ] ij = ∂ A ( m ) ∂m i ∂m j , (106) sing Model Selection Using (cid:96) -Regularized Linear Regression Algorithm 3

Detailed implementation of solving the EOS (71) together with (103) for (cid:96) -LogR with moderate M, N . Input:

M, N, λ, K , ρ ( γ ) and T MC , T EOS , T active .Initialization: χ, Q, E, R, F, η, K, H, Γ repeatfor t = 1 to T MC do Draw M random samples s µ , s µj,j ∈ Ψ ∼ P ( s , s Ψ | J ∗ ) and z µ ∼ N (0 , , µ = 1 ...M .Initialization ˆ J j,j ∈ Ψ for t = 1 to T active do Solve ˆ y µ = arg min y µ (cid:20) ( y µ − s µ ( √ Qz µ + (cid:80) j ∈ Ψ ˆ J j s µj )) χ + log (cid:0) e − y µ (cid:1)(cid:21) , µ = 1 ...M. Solve ˆ J j,j ∈ Ψ = arg min J j,j ∈ Ψ (cid:20) M (cid:80) Mµ =1 (cid:20) ( ˆ y µ − s µ ( √ Qz µ + (cid:80) j ∈ Ψ J j s µj )) χ + log (cid:0) e − y µ (cid:1)(cid:21) + λ (cid:80) j ∈ Ψ | J j | (cid:21) . end for Compute (cid:52) ( t ) = M (cid:80) Mµ =1 (cid:16) s z µ √ Q tanh (ˆ y µ ) (cid:17) .Compute (cid:52) ( t ) = M (cid:80) Mµ =1 (1 − tanh (ˆ y µ )) . end for Set ¯ (cid:52) = T MC (cid:80) T MC t =1 (cid:52) ( t ) and ¯ (cid:52) = T MC (cid:80) T MC t =1 (cid:52) ( t ) . for t = 1 to T EOS do E = (1 − damp ) · α ¯ (cid:52) + damp · EF = (1 − damp ) · α ¯ (cid:52) + damp · FR = (1 − damp ) K (cid:104)(cid:16) H + λ M N (cid:17) erfc (cid:16) λM √ HN (cid:17) − λM (cid:113) HN √ π e − λ M HN (cid:105) + damp · R for t = 1 to T gamma do Γ = (1 − damp ) Eη (cid:82) ρ ( γ )1+ Γγ dγ + damp · Γ end for K = (1 − damp ) (cid:16) − EΓ + η (cid:17) + damp · Kχ = (1 − damp ) (cid:0) − ηΓ + E (cid:1) + damp · χQ = (1 − damp ) (cid:18) FE − RΓ − ( − ER + F η ) ηΓ (cid:82) ρ ( γ )(1+ Γγ )2 dγ (cid:19) + damp · QH = (1 − damp ) (cid:18) RE − FΓ − ( − ER + F η ) EΓ (cid:82) ρ ( γ )(1+ Γγ )2 dγ (cid:19) + damp · Hη = (1 − damp ) K erfc (cid:16) λM √ HN (cid:17) + damp · η end foruntil convergence. Return: χ, Q, E, R, F, η, K, H, Γ . sing Model Selection Using (cid:96) -Regularized Linear Regression where the evaluations are performed at θ = and m = arg min m A ( m ) ( = under the paramagnetic assumption).Consequently, we can focus on the computation of A ( m ) to obtain the EVD of C − . The inverse covariance matrix of aRR graph G ∈ G N,d,K can be computed from the Hessian of the Gibbs free energy (Abbara et al., 2020; Ricci-Tersenghi,2012; Nguyen & Berg, 2012) as (cid:2) C − (cid:3) ij = ∂A ( m ) ∂m i ∂m j = (cid:18) d − tanh K − d + 1 (cid:19) δ ij − tanh ( J ij )1 − tanh ( J ij ) (1 − δ ij ) , (107)and in matrix form, we have C − = (cid:18) d − tanh K − d + 1 (cid:19) I − tanh ( J )1 − tanh ( J ) , (108)where I is an identity matrix of proper size, and the operations tanh ( · ) , tanh ( · ) on matrix J are deﬁned in the component-wise manner. For RR graph G ∈ G N,d,K , J is a sparse matrix, therefore the matrix tanh( J )1 − tanh ( J ) also corresponds to a sparsecoupling matrix (whose nonzero coupling positions are the same as J ) with constant coupling strength K = tanh( K )1 − tanh ( K ) and ﬁxed connectivity d , the corresponding eigenvalue (denoted as ζ ) distribution can be calculated as (McKay, 1981) ρ ζ ( ζ ) = d (cid:112) K ( d − − ζ π ( K d − ζ ) , | ζ | ≤ K √ d − . (109)From (108), the eigenvalue η of C − is η i = d − tanh K − d + 1 − ζ i , (110)which, when combined with (109), readily yields the EVD of η as N → ∞ as follows: ρ η ( η ) = ρ ζ (cid:18) d − tanh K − d + 1 − η (cid:19) = d (cid:114) (cid:16) tanh( K )1 − tanh ( K ) (cid:17) ( d − − (cid:16) d − tanh K − d + 1 − η (cid:17) π (cid:18)(cid:16) tanh( K )1 − tanh ( K ) (cid:17) d − (cid:16) d − tanh K − d + 1 − η (cid:17) (cid:19) , (111)where η ∈ (cid:104) d − tanh K − d + 1 − K ) √ d − − tanh ( K ) , d − tanh K − d + 1 + K ) √ d − − tanh ( K ) (cid:105) .Consequently, since γ = 1 /η , we obtain the EVD of ρ ( γ ) as follows ρ ( γ ) = 1 γ ρ η (cid:18) η = 1 γ (cid:19) = d (cid:114) (cid:16) tanh( K )1 − tanh ( K ) (cid:17) ( d − − (cid:16) d − tanh K − d + 1 − γ (cid:17) πγ (cid:18)(cid:16) tanh( K )1 − tanh ( K ) (cid:17) d − (cid:16) d − tanh K − d + 1 − γ (cid:17) (cid:19) (112)where γ ∈ (cid:104) / (cid:16) d − tanh K − d + 1 + K ) √ d − − tanh ( K ) (cid:17) , / (cid:16) d − tanh K − d + 1 − K ) √ d − − tanh ( K ) (cid:17)(cid:105) . G. Additional Experimental Results

Fig. 8 shows the results under the same setting as Fig. 1 except that λ = 0 . . Good agreement between replica results andexperimental results is also achieved in Fig. 8. Similar to the case of λ = 0 . , there is negligible difference in P recision and

Recall between (cid:96) -LinR and (cid:96) -LogR. Meanwhile, compared to Fig. 1 when λ = 0 . , the difference in RSS between (cid:96) -LinR and (cid:96) -LogR is reduced when λ = 0 . . In addition, by comparing Fig. 1 and Fig. 8, it can be seen that under thesame setting, when λ increases, the P recision becomes larger while the

Recall becomes smaller, implying a tradeoff inchoosing λ in practice for Ising model selection with ﬁnite M, N . sing Model Selection Using (cid:96) -Regularized Linear Regression Figure 8.

Theoretical and experimental results of RSS,

P recision and

Related Researches

CaPC Learning: Confidential and Private Collaborative Learning

by Christopher A. Choquette-Choo

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

by Dara Bahri

SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

by Bahare Fatemi

On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

by M. Torabi Rad

Spherical Message Passing for 3D Graph Networks

by Yi Liu

Target Training Does Adversarial Training Without Adversarial Samples

by Blerta Lindqvist

Domain Invariant Representation Learning with Domain Density Transformations

by A. Tuan Nguyen

Sparsification via Compressed Sensing for Automatic Speech Recognition

by Kai Zhen

Using Deep LSD to build operators in GANs latent space with meaning in real space

by J. Quetzalcoatl Toledo-Marin

Consensus Based Multi-Layer Perceptrons for Edge Computing

by Haimonti Dutta

RL for Latent MDPs: Regret Guarantees and a Lower Bound

by Jeongyeol Kwon

Scheduling the NASA Deep Space Network with Deep Reinforcement Learning

by Edwin Goh

Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

by Guangyu Shen

On Explainability of Graph Neural Networks via Subgraph Explorations

by Hao Yuan

More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

by Roee Cates

Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19

by Logan Ward

Estimation and Applications of Quantiles in Deep Binary Classification

by Anuj Tambwekar

Emotion Transfer Using Vector-Valued Infinite Task Learning

by Alex Lambert

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

by Florian E. Dorner

Bounded Memory Active Learning through Enriched Queries

by Max Hopkins

Adversarially Trained Models with Test-Time Covariate Shift Adaptation

by Jay Nandy

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

by Tomáš Chobola

Regularization Strategies for Quantile Regression

by Taman Narayan

Large-Scale Training System for 100-Million Classification at Alibaba

by Liuyihan Song

Classifier Calibration: with implications to threat scores in cybersecurity

by Waleed A. Yousef

«

1

2

3

4

»

Submitted on 8 Feb 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar