Belief-Propagation and replicas for inference and learning in a kinetic Ising model with hidden spins
BBelief-Propagation and replicas for inference andlearning in a kinetic Ising model with hidden spins
C Battistin , J Hertz , , J Tyrcha , Y Roudi , Kavli Institute for Systems Neuroscience and Centre for Neural Computation ,Olav Kyrres gate 9, 7030 Trondheim, NO NORDITA, KTH Royal Institute of Technology and Stockholm University, 10691Stockholm, Sweden Niels Bohr Institute, Blegdamsvej 17, 2100 Copenhagen , Denmark Mathematical Statistics, Stockholm University, S106 91 Stockholm, SwedenE-mail: [email protected]
December 2014
Abstract.
We propose a new algorithm for inferring the state of hidden spins andreconstructing the connections in a synchronous kinetic Ising model, given the observedhistory. Focusing on the case in which the hidden spins are conditionally independentof each other given the state of observable spins, we show that calculating the likelihoodof the data can be simplified by introducing a set of replicated auxiliary spins. BeliefPropagation (BP) and Susceptibility Propagation (SusP) can then be used to inferthe states of hidden variables and learn the couplings. We study the convergence andperformance of this algorithm for networks with both Gaussian-distributed and binarybonds. We also study how the algorithm behaves as the fraction of hidden nodes andthe amount of data are changed, showing that it outperforms the TAP equations forreconstructing the connections.
1. Introduction
Reconstructing interactions in a complex network and building statistical models fortheir stochastic dynamics are important and active areas of research in statistical physicsand machine learning. In recent years, questions related to this topic have receivedextra attention given the applications in analyzing high-throughput biological (e.g.gene expression or neural) as well as financial data. The statistical physics communityhas contributed significantly to this area in recent years offering tools to implementapproximate and efficient inference algorithms as well as theoretical insight into variousaspects of the problem [1, 2, 3, 4, 5].Although the amount of data that one can record from, e.g. biological networks,is rapidly increasing, even in the best cases we still do not have access to recordingsfrom the whole network. For example, with the most advanced recording technologiestoday, we can still only measure the activity of a fraction of neurons in a local cortical a r X i v : . [ c ond - m a t . d i s - nn ] D ec elief-Propagation and replicas for learning Expectation Maximization (EM) [21, 22], a two-step recursive algorithmthat alternates between computing hidden moments at fixed connectivity and updatingthe connectivity given the moments. In the case of kinetic Ising model with hiddennodes, as shown in [18, 19], this EM algorithm can be done approximately using mean-field and TAP approximations: approximations that can be thought of as small couplingexpansions. Another class of approximations, more appropriate for sparse and strongconnectivity, similar to some biological networks, is cavity/Belief-Propagation (BP)approximations. For the kinetic Ising model without hidden nodes, these approximationshave been studied in [23, 24]. In this paper, we focus our attention to the inferenceof hidden states and learning network connectivity with hidden nodes, using BP andits extension to calculating susceptibilities, Susceptibility Propagation (SusP) in areplica-based approach. We derive an algorithm based on these approximations forthe case where there are no hidden-to-hidden connections and numerically evaluate itsperformance under different conditions.The paper is organized as follows. In the first section we define the dynamical modeland introduce a replicated system of auxiliary hidden nodes that simplifies calculatingthe likelihood. We then derive the update rules for the couplings in a gradient-ascentfashion on likelihood, emphasizing the required expectation values to be taken over thedistribution of hidden node values. These are expressed in terms of cavity messages, elief-Propagation and replicas for learning
2. Formulation of the model
We consider a network of N = N v + N h binary ± N v visible units, whose states at time t we denote as s ( t ) = { s i ( t ) } , and N h hidden units, with states σ ( t ) = { σ a ( t ) } . The dynamics isMarkovian and the spins are updated synchronously at each time step, according to thetransition probability:P [ s ( t + 1) , σ ( t + 1) | s ( t ) , σ ( t )] = exp [ (cid:80) i s i ( t + 1) h i ( t )] (cid:81) i h i ( t )] × exp [ (cid:80) a σ a ( t + 1) b a ( t )] (cid:81) a b a ( t )] , (1)where we choose the fields acting on visible and hidden units respectively as follows: h i ( t ) = (cid:88) j J ij s j ( t ) + (cid:88) b K ib σ b ( t ) (2) b a ( t ) = (cid:88) j L aj s j ( t ) . (3)Notice that the units are conditionally independent given the state of the network at theprevious time step and that we restrict ourselves to a system without hidden-to-hiddeninteractions. This choice is important for the following derivation. We do not make anyfurther assumption on the connectivity; in particular, the couplings do not need to beeither symmetric or fully asymmetric. We observe, however, that because there are nohidden-to-hidden connections the likelihood factorizes in time P [ s ] = (cid:81) t P t [ s ], whereeach factor involves a trace over single-time hidden states σ ( t ):P t [ s ] = (cid:88) σ exp (cid:104)(cid:80) i s + i (cid:16)(cid:80) j J ij s j + (cid:80) b K ib σ b (cid:17) + (cid:80) a,j σ a L aj s − j (cid:105)(cid:81) i (cid:104)(cid:80) j J ij s j + (cid:80) b K ib σ b (cid:105) (cid:81) a (cid:104)(cid:80) j L aj s − j (cid:105) . (4)Here we have dropped the argument t on the σ s and written s ± j for s j ( t ± , T ].The likelihood of this history under the stochastic process in (1) is obtained by tracingout the hidden states from the joint distribution of the histories of hidden and visiblestates. Given the J s, K s and L s this operation is exponentially complex in N h and cantherefore only be performed exactly for very small systems (in practice N h of order 10).For larger networks, approximate methods, such as those in [18] and [19], have to beintroduced. In this paper, we introduce an approximation scheme that uses the BP andSusP algorithms. To formulate it, we will need to introduce replicas. elief-Propagation and replicas for learning On the right-hand side of (4) we are dealing with a system of hidden units interactingwith future, past and present observations. From now on we are going to treat ourobservations as sources of an external field, rather than states of reacting units. Onecan then notice that the only term that makes the distribution we are integrating overdifferent from the usual equilibrium Ising distribution for the hidden units is the firsthyperbolic cosine in the denominator.Consider now the following equality:(2 cosh [ f i ]) n = (cid:88) τ i exp (cid:34) n (cid:88) α =1 τ αi f i (cid:35) (5)with f i = (cid:80) j J ij s j + (cid:80) b K ib σ b , that introduces a trace over n replicas of N v auxiliaryspins τ αi = ±
1. DefiningP ( n ) t [ s ] ≡ (cid:88) σ , τ exp (cid:104)(cid:80) i (cid:0) s + i + (cid:80) nα =1 τ αi (cid:1) (cid:16)(cid:80) j J ij s j + (cid:80) b K ib σ b (cid:17) + (cid:80) a,j σ a L aj s − j (cid:105)(cid:81) a (cid:104)(cid:80) j L aj s − j (cid:105) , (6)using Eq. (5) and then taking the limit of the number of replicas to −
1, for the likelihoodof the data we haveP t [ s ] = lim n →− P ( n ) t [ s ] (7)In the next section, using the derivatives of P ( n ) we will derive learning rules forthe couplings. Assuming that the order of these derivatives and the limit of n → − n → − τ αi feels the field (cid:80) j J ij s j generated by thedata and couples to the σ b via the K ib . These couplings are now the only interactionleft in the problem: everything else can just be thought of as external fields, as far asthe σ s and τ s are concerned. We can therefore now use standard methods for Isingspin systems on this rather conventional (except for having − τ spins)problem. Figure 1 illustrates the new system. The EM algorithm provides a simple and robust tool for parameter estimation in modelswith incomplete data [25]. In the maximization step, the model parameters are updatedin order to maximize the likelihood given the statistics of the hidden variables. Herewe implement this step using the gradient based learning rule ∆ J ij ∝ ∂ log P( s ) /∂J ij .Using (6) and (7) one gets∆ J ij ∝ lim n →− (cid:32) s + i + n (cid:88) α =1 (cid:104) τ αi (cid:105) (cid:33) s j , (8) elief-Propagation and replicas for learning ⌧ tt t + 1 J ... Hidden unitsVisible units
J KL K n-replicas
Figure 1: Time slicing of the dynamics. Equilibrium bipartite system of hidden nodes σ s and n replicas of auxiliary hidden nodes at time t, interacting via K s. Observednodes at t − t and t + 1 provide a different external input at each time step.where the average denoted by (cid:104)· · ·(cid:105) is taken over the conditional distributionP ( n ) ( σ , τ | s ) = P ( n ) ( σ , τ , s ) / P ( n ) ( s ). (This is, of course, the contribution to ∆ J ij froma single time step; the full ∆ J ij is obtained by summing over steps. In what follows thissummation will be implicit.)In the replica-symmetric Ansatz the replicated systems are indistinguishable andconsequently the magnetizations (cid:104) τ αi (cid:105) are all equal for α = 1 . . . n . Denoting theircommon value by m i , that is, m i ≡ (cid:104) τ i (cid:105) = (cid:104) τ i (cid:105) = · · · = (cid:104) τ ni (cid:105) (9)and taking the limit n → −
1, we have∆ J ij ∝ s + i s j − m i s j (10 a )∆ K ib ∝ s + i µ b − (cid:104) τ i σ b (cid:105) (10 b )∆ L aj ∝ µ a s − j − tanh (cid:34)(cid:88) b L bj s − j (cid:35) s − j , (10 c ) elief-Propagation and replicas for learning µ a ≡ (cid:104) σ a (cid:105) and for deriving ∆ K ib and ∆ L aj we followed the same procedure asexemplified for the ∆ J ij . From (10 a )-(10 c ), we see that for performing the expectationstep, we need to evaluate the mean values of τ and σ , i.e. m i and µ a , and the pairaverages (cid:104) τ i σ b (cid:105) .These learning rules are exact and could have been derived without replicas: Withthe identification τ i = tanh f i , they are just those of ref. [19], Eqs. (23-26). Whatrequires approximation (except for very small systems) is evaluating the averages. Inthe next subsection, we will show how they can be calculated approximately using theBP and SusP algorithms, within the replica framework. m i , µ a and (cid:104) τ i σ b (cid:105) Consider first the marginals m i and µ a . These can be obtained from the cavitymagnetizations m aiα and µ iαa , which are the magnetizations calculated as if theconnectivity is tree-like, with the connection from the unit labeled by the superscriptremoved. Standard cavity arguments [10] lead to the following closed set of equationsfor the cavity magnetizations µ ia = tanh (cid:34) g a − (cid:88) j tanh − (cid:2) t ja m aj (cid:3) − tanh − [ t ia m ai ] (cid:35) (11 a ) m ai = tanh (cid:34)(cid:88) j J ij s j + (cid:88) b (cid:54) = a tanh − (cid:2) t ib µ ib (cid:3)(cid:35) (11 b )where we have defined g a = (cid:80) i K ia s + i + (cid:80) j L aj s − j and t ja = tanh [ K ja ]. In (11 a )-(11 b )we got rid of the replica index α , since by replica symmetry µ iαa is independent of α ,hence the limit n → − m i one has to reintroduce the removed bond, through the relation tanh − [ m i ] =tanh − [ m ai ] + tanh − [ t ia µ ia ], and similarly for the µ a .How to estimate the (cid:104) τ i σ b (cid:105) ? Inspired by the Susceptibility Propagation algorithm[26, 27] we exploit the fluctuation-response theorem that matches the correlationbetween two nodes with the response of the magnetization at one site to a perturbationat the other site. In our problem it translates into the relation (cid:104) τ i σ a (cid:105) = χ ib + m i µ a (12) χ ib = ∂m i ∂b − b (13)where χ ib is the susceptibility. Defining b − a ≡ (cid:88) j L aj s − j , (14)we derive a closed set of equations for the derivatives v iab and g ajb of the cavity fields: elief-Propagation and replicas for learning v iab ≡ ∂µ ia ∂b − b = (cid:0) − ( µ ia ) (cid:1) (cid:40) δ ab − (cid:88) j t ja − t ja g ajb (cid:41) (15 a ) g ajb ≡ ∂m aj ∂b − b = (cid:0) − ( m aj ) (cid:1) (cid:88) c (cid:54) = a t jc − t jc ( µ jc ) v jcb (15 b )From the stationary points of (15 a )-(15 b ) one can recover the susceptibility using: χ ib = (cid:0) − m i (cid:1) (cid:88) a t ia − t ia ( µ ia ) v iab (16)
3. Numerical Results
We tested the reconstruction power of our algorithm on two bond distributions:Gaussian and binary. In both cases we investigated the impact of the sparsnesson the performance of the algorithm, by varying the sparsity parameter c =Pr [node i is connected to node j ]. Once we have decided which nodes are connected to which (which bonds to be non-zero, an event that happens with probability c ), we choose the non-zero couplings to beindependently drawn from a normal distribution with zero mean and standard deviation: σ J = g/ (cid:112) N v c (17 a ) σ L = g/ (cid:112) N v c (17 b ) σ K = g/ (cid:112) N h c (17 c )where g is the couplings strength parameter. This choice ensures the field exerted onhidden and visible units to be of the same size, despite the absence of hidden to hiddeninteractions. In section 2 we set up our learning protocol in two steps:expectation and maximization. In the expectation (E) step we estimate means andcorrelations of the hidden units given observations and the current estimate of thecouplings. Here we check the performance of the E step, i.e., equations (11 a )-(11 b )and (15 a )-(15 b ), given the true connectivity. First we study the convergence of theBP equations and the correlation between the hidden states σ a ( t ) and the estimatorˆ σ a ( t ) ≡ sign ( µ a ( t )).Figure 2A shows that the correlation (cid:104) σ a ( t )ˆ σ a ( t ) (cid:105) increases monotonically with thefraction of visible units on both on fully-connected and sparse graphs. The change ofconcavity at N v = N h has to be ascribed to the fact that more visible units correspond elief-Propagation and replicas for learning Fraction of visible units < ˆ σ a σ a > J , T A Fraction of visible units F r a c t i on o f c on v e r ged t i m e s t ep s B R M SE m μ < σ a τ i > C Figure 2: BP performances on inferring the statistics of hidden units vs fraction of visibleunits. (A) Correlation between hidden states and sign of the inferred magnetizationsˆ σ a ( t ). (B) fraction of converged instances. N = 100, T = 100, averages over 10 realizations of the couplings, sparse graphs with c=0.1 (blue) and fully connectedones (green). (C) RMSE between BP and exact gradient ascent (equations (18 a )-(18 c )) magnetizations and correlation functions. N = 20, T = 10, averages over 10 realizations of the couplings, c = 0 . N v < N h . We would like, however, to point out that for such asmall and highly dilute random network, hidden units have few connections with visibleunits and this can contribute to the low performance at N v < N h for sparse networks.For small systems, we were able to compare our BP-based results with “exact”ones (i.e., the best possible ones, given the finite data set used to base the inferenceon). Recall that the our gradient ascent learning rules (10 a )-(10 c ) are exact when theaverages m i , µ b and (cid:104) τ i σ b (cid:105) are evaluated using the true distribution P t [ σ | s ]: m i = Tr σ (cid:40) tanh (cid:34)(cid:88) j J ij s j + (cid:88) b K ib σ b (cid:35) P t [ σ | s ] (cid:41) (18 a ) µ b = Tr σ { σ b P t [ σ | s ] } (18 b ) elief-Propagation and replicas for learning (cid:104) τ i σ b (cid:105) = Tr σ (cid:40) tanh (cid:34)(cid:88) j J ij s j + (cid:88) b K ib σ b (cid:35) σ b P t [ σ | s ] (cid:41) (18 c )These averages can be computed by direct summation over states for small systems.Comparing the BP-based estimates of the parameters with those obtained in this wayenabled us to see how much of the error we found (relative to ground truth) was due toBP (Figure 2C) and how much was due to the finiteness of the data set.If one removes the cases for which SusP diverged, the RMSE between thecorrelations, (cid:104) σ a τ i (cid:105) , inferred using BP and those inferred using exact enumerationdrops as the fraction of visible units increase. In the divergent case, typically,correlations inferred using BP grow exponentially while iterating (15 a )-(15 b ). Thelack of convergence of SusP has been already pointed out in the literature [27], and[28] suggests an alternative formula to compute the two point correlation function thatprovides finite estimates, without solving the issue, however. In the next section westudy the convergence of the learning algorithm and propose a simple but effectivemethod to estimate the correlations between hidden units. We studied the performances of our algorithm in learning theconnectivity for sparse and fully connected networks with Gaussian distributed non-zerobounds with standard deviation defined in (17 a )-(17 c ). Since the system is invariantwith respect to permutations of the hidden nodes indices, we initialised our algorithmwith initial couplings having the same sign as the true ones. Figure 3A shows examplesof the qualitative agreement between reconstructed J rec ij and generative model couplings J ij , while Figure 3B shows that the RMSE on the reconstructed couplings,RMSE ≡ (cid:113)(cid:80) i,j (cid:0) J rec ij − J ij (cid:1) N v σ J (19)scales as 1 / √ T with the data length ( T ).Figure 4 shows that, as expected, the weaker the couplings the longer the datalength required for convergence. Figure 4 also points out that for fully connectedtopologies 100% convergence rate is never reached when the couplings are strong.Inspecting the behavior of the algorithm during learning, one realizes thatdivergence develops in a systematic way. First, after some iterations of the EM-algorithm, (11 a )-(11 b ) for the magnetizations start not converging. Then, after a fewiterations the equations for the correlations (15 a )-(15 b ) stop converging and eventuallycorrelations explode and couplings follow. Fig. 5A displays a typical example of theevolution of the divergence in terms of hidden units statistics and RMSE on thecouplings. Notice that we considered both sets of equations, (11 a )-(11 b ) and (15 a )-(15 b ),as having converged when the average square distance between cavity magnetizationsand correlations at consecutive iterations is smaller than 10 − . We have tried to solvethe problem starting from its source and improving the convergence of the BP equationsfor the magnetizations by setting initial values to the inferred value at the previous EM elief-Propagation and replicas for learning −0.5 0 0.5−0.8−0.400.40.8 exact Js A −2.5 −1.5 −0.5 0.5 1.5 2.5−2−1012 exact Ks −1 0 1−1.5−1−0.500.511.5 exact Ls −3 −2 −1 T R M SE B −3 −2 −1 T R M SE Figure 3: Learning on a graphs with N v = 80 visible units and N h = 8 hidden units,sparsity c=0.1, g=1. Js (green), Ks (red), Ls(blue). (A) An example showing thereconstructed couplings versus the true ones used in generating the data at a datalength T = 10 . Crosses correspond to the initial values for the reconstruction. (B)RMSE on couplings vs data length T , N v = 80, N h = 8 (left) and N h = 16 (right).Error bars correspond to one standard deviation on 20 realizations of the generativemodel.iteration, using na¨ıve mean field theory and initializing m i ( t ) at s i ( t + 1), as well asdamping adaptively the learning. However, we did not observe any significant effectby doing this. We therefore intervened on correlations directly, exploiting the Cauchy–Schwarz inequality to detect diverging two point correlation functions ( (cid:104) σ b τ i (cid:105) >
1) andreplacing them by their corresponding independent spin values ( µ b m i ). The correctionprevents the correlations from diverging and consequently allows further improvementof the learning as shown in Fig. 5B. Figure 6 compares RMSE vs data length for theoriginal uncorrected algorithm and the corrected one: improvement of convergence rateaffects the quality of the reconstruction as well.Figure 7 shows that the RMSE on the reconstructed couplings decreases with thesystem size, if the N v to N h ratio and the data length T are fixed. In order to beconclusive on the scaling of the RMSE with the system size, the problem needs furthernumerical exploration. elief-Propagation and replicas for learning F r a c t i on o f c on v e r ged i n s t a c e s A F r a c t i on o f c on v e r ged i n s t a c e s B F r a c t i on o f c on v e r ged i n s t a c e s C F r a c t i on o f c on v e r ged i n s t a c e s D Figure 4: Fraction of converged instances vs data length for a network of N v = 20 visibleunits and N h = 4 hidden. (A) c = 0 . g = 1, (B) c = 1, g = 1, (C) c = 1, g = 0 .
5, (D) c = 1, g = 0 .
1. Js (green), Ks (red), Ls(blue) convergence is considered separately.
We tested our algorithm on a network with binary-valued connections in addition tothe Gaussian-distributed ones that we discussed above. The sparsity is controlled bythe parameter c , as in the previous section. That is, each bond is zero with probability c and non-zero with probability 1 − c in which case it can be positive or negative withequal probabilities taking the value J ij = ± g/ (cid:112) N v c (20 a ) L aj = ± g/ (cid:112) N v c (20 b ) K ia = ± g/ (cid:112) N h c (20 c )Figure 8 shows how the RMSE scales with the data length for dilute and fullyconnected systems.In order to investigate how much the previous knowledge of the degree of sparsityaffects the reconstruction, we attempted an a posteriori pruning of the reconstructedcouplings. The latter consisted of a classification of the reconstructed couplingsaccording to their strength as zero and non-zero couplings, consistent with the sparsity.The first group are set to zero, while remaining couplings are sorted by sign, and then set elief-Propagation and replicas for learning RMSE(J)* σ J RMSE(K)* σ K RMSE(L)* σ L < | < σ a (t) τ i (t)> | > t,a,i < | m i | > t,i < | µ a | > t,a Figure 5: Inference - learning vs iteration of our EM algorithm. T = 10 , N v = 20, N h = 4, g = 1, c = 1. RMSE on the couplings and average absolute value of thestatistics, during learning of a single connectivity. Left and right plots share the samelegend (displayed on the right figure). Left: Uncorrected SusP without correction.Right: corrected SusP. In this example BP equations (11 a )-(11 b ) stop converging at the11th iteration of the algorithm. After few iterations the equations for the correlations(15 a )-(15 b ) follow and our correction starts affecting the learning. Confinement of theinferred correlations ensures the convergence of the learning algorithm.to their respective inferred means. Unfortunately the process did not improve the qualityof the reconstruction due to misclassification: the distribution of the reconstructedcouplings is not trimodal and the three distributions (zero, positive and negativecouplings) overlap. We also compared the performance to that obtained for the algorithms recentlydeveloped in [18, 19] . In the absence of hidden-to-hidden interactions and for weakcouplings, the authors of those papers derived the same set of TAP-like equations: inthe first paper from the saddle point approximation to the path integral of the likelihoodin equation (4) and in the second through a variational approach. We find here thatBP outperforms TAP both on fully connected and sparse topologies, as can be seen inFigure 9. Here we plot the RMSE of the reconstructed couplings as a function of thecoupling strength g . elief-Propagation and replicas for learning −3 −2 −1 T R M SE A −2 −1 T R M SE B −3 −2 −1 T R M SE C −2 −1 T R M SE D Figure 6: RMSE on the couplings vs data length for networks of N v = 90 visible unitsand N h = 10 hidden, g = 1. Top: the original algorithm; bottom: the correctedalgorithm. (A)-(C) c = 0 . c = 1. Convergence of Js (green), Ks (red), andLs(blue) is considered separately. Errors correspond to one standard deviation aroundthe mean for 10 realizations of the couplings.
4. Discussion
In this paper we presented a novel approach for studying the dynamics of a kineticIsing model with hidden nodes and learning the connections in such a network. For asystem without hidden-to-hidden connections, the likelihood of the data can be writtenas a set of independent equilibrium problems for each time step. In each such problem,calculating the partition function requires performing a trace over hidden variables. Weshowed that the form of the action we get allows performing this trace at the costof introducing auxiliary replicated spins that mediate the interaction between hiddenand observed nodes at the same time. Combined further with Belief Propagation andSusceptibility Propagation, we derived learning rules for inferring the state of the hiddennodes and the couplings in the network, given the observed data.The algorithm we developed here was derived in the replica symmetric framework, elief-Propagation and replicas for learning −3 −2 −1 System size N R M SE Figure 7: RMSE of the reconstructed couplings vs system size N . Data length T = 10 ,coupling strength g = 1, sparsity c = 0 . N v /N h = 4. −3 −2 −1 T R M SE −2 −1 T R M SE Figure 8: Learning on a graphs with binary couplings: RMSE of couplings vs datalength. N v = 20 visible units and N h = 4 hidden units, sparsity c = 0 . c = 1(right). Js (green), Ks (red), Ls(blue).which may make one wonder whether a form of replica symmetry breaking might beintroduced in our approach. However, it is important to note that due to the lackof terms of the form τ α τ β , there is no interaction between the replicas and thus nopossibility of breaking the replica symmetry. Consequently within the limit where thenumber of replicas goes to −
1, the replica symmetric solution is the true solution.We assessed the accuracy of our approximate method in both estimating the hiddenunits’ statistics and solving the inverse problem. The choice of the BP algorithm forthe inference step was motivated by its good performance on equilibrium spin glasses[28]. We noticed that when trying to learn the connections, for which we used theSusP algorithm, the instability of that algorithm affects the reconstruction of strongcouplings. We thus suggested a simple but effective correction to improve convergence,finding that it ensured that the RMSE on the couplings scales as 1 / √ T . Finally, we elief-Propagation and replicas for learning −3 −2 −1 g R M SE −3 −2 −1 g R M SE Figure 9: BP-TAP comparison: RMSE on the reconstructed couplings vs couplingsstrength g. Solid (dashed) lines correspond to N v = 90 ( N v = 80) visible units and N h = 10 ( N h = 20) hidden units, sparsity c = 0 . c = 1 (right), TAP (green), BP(blue), T = 10 . Error bars correspond to one standard deviation on 10 realizations ofthe generative model.compared our algorithm to the TAP approach [18], finding that the algorithm describedhere systematically outperforms the latter, though by a small margin. Although thealgorithm proposed here improves on TAP even for weak couplings, it is importantto note that TAP is polynomial in the number of hidden nodes, while our algorithmis polynomial in the total system size. Furthermore, TAP learning allows for thereconstruction of the couplings in the presence of hidden-to-hidden connections. Itwould therefore be interesting to combine the two algorithms in the following way:during learning one can initially set the hidden-to-hidden couplings to zero, learning allthe other connections using the BP-replica method proposed here, then moving to learnthe hidden-to-hidden couplings using the TAP equations.For the case of sparse networks, there are some modifications of the algorithm thatwe have not yet tried, but they may improve the method. For instance it would beinteresting to try an online decimation, that is forcing the known sparsity at everystep of learning, instead of the off-line one that we tried. It is also well known that,particularly for small data sets, one can mistakenly infer that absent connections arepresent but with small values. Thus it would be interesting to test non-trivial standardmethods for optimally pruning the reconstructed networks, like L regularization. Acknowledgments
This work has been partially supported by the Marie Curie Initial Training NetworkNETADIS (FP7, grant 290038) elief-Propagation and replicas for learning References [1] Lezon T R, Banavar J R, Cieplak M, Maritan A and Fedoroff N V 2006
Proceedings of the NationalAcademy of Sciences
Journal of Statistical Mechanics: Theoryand Experiment
P12001[3] Cocco S, Leibler S and Monasson R 2009
Proceedings of the National Academy of Sciences
Physica A: Statistical Mechanics and its Applications
Physical Review E Physical Review E Journal of Statistical Physics Philosophical Magazine Proceedings of the Royal Society of London. Series A, Mathematical and PhysicalSciences
The European Physical Journal B-Condensed Matter and ComplexSystems Journal of Statistical Physics
Physical Review Letters
Journal of Statistical Mechanics: Theory and Experiment
L07001[14] Aurell E and Mahmoudi H 2011
Journal of Statistical Mechanics: Theory and Experiment
P04014[15] Aurell E and Mahmoudi H 2012
Phys. Rev. E (3) 031119 URL http://link.aps.org/doi/10.1103/PhysRevE.85.031119 [16] Roudi Y and Hertz J 2011 Journal of Statistical Mechanics: Theory and Experiment
P03031[17] Mahmoudi H and Saad D 2014
Journal of Statistical Mechanics: Theory and Experiment
P07001[18] Dunn B and Roudi Y 2013
Physical Review E Mathematical Biosciences and Engineering Journal of Statistical Mechanics: Theory andExperiment
P06013[21] Sundberg R 1974
Scandinavian Journal of Statistics
Journal of the Royal Statistical Society. Series B(Methodological)
Journal of Statistical Mechanics: Theory and Experiment
P08009[24] Aurell E and Mahmoudi H 2011
Communications in Theoretical Physics The Annals of statistics
Journal of Physiology-Paris
The European Physical Journal B-Condensed Matter andComplex Systems Journal of Statistical Mechanics: Theory and Experiment2012