Expectation propagation on the diluted Bayesian classifier
Alfredo Braunstein, Thomas Gueudré, Andrea Pagnani, Mirko Pieropan
EExpectation propagation for the diluted Bayesian classifier
Alfredo Braunstein,
1, 2, 3
Thomas Gueudr´e, ∗ Andrea Pagnani,
1, 2, 3 and Mirko Pieropan † Department of Applied Science and Technologies (DISAT),Politecnico di Torino, Corso Duca Degli Abruzzi 24, Torino, Italy Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO) - Italy INFN Sezione di Torino, Via P. Giuria 1, I-10125 Torino, Italy
Efficient feature selection from high-dimensional datasets is a very important challenge in manydata-driven fields of science and engineering. We introduce a statistical mechanics inspired strategythat addresses the problem of sparse feature selection in the context of binary classification byleveraging a computational scheme known as expectation propagation (EP). The algorithm is usedin order to train a continuous-weights perceptron learning a classification rule from a set of (possiblypartly mislabeled) examples provided by a teacher perceptron with diluted continuous weights. Wetest the method in the Bayes optimal setting under a variety of conditions and compare it to otherstate-of-the-art algorithms based on message passing and on expectation maximization approximateinference schemes. Overall, our simulations show that EP is a robust and competitive algorithm interms of variable selection properties, estimation accuracy and computational complexity, especiallywhen the student perceptron is trained from correlated patterns that prevent other iterative methodsfrom converging. Furthermore, our numerical tests demonstrate that the algorithm is capable oflearning online the unknown values of prior parameters, such as the dilution level of the weights ofthe teacher perceptron and the fraction of mislabeled examples, quite accurately. This is achievedby means of a simple maximum likelihood strategy that consists in minimizing the free energyassociated with the EP algorithm.
I. INTRODUCTION
The problem of extracting sparse information from high dimensional data is among the most interesting challenges intheoretical computer science with many applications ranging from computational biology to combinatorial chemistry,neuroscience and natural language processing [1, 2]. As a specific example, next generation sequencing and, ingeneral, the ongoing technological revolution related to high-throughput technologies in biology pose very stringentrequirements to the algorithmic techniques that are supposed to analyze the data that are produced and made publiclyavailable through easily accessible databases. Just to give some orders of magnitude, a typical genetic screening forcancer-related pathologies – freely available from The Cancer Genome Atlas website [3] – involves measurementof activity or genetic sequence variation over ∼ ,
000 genes measured on patient cohorts that typically countaround 1,000 individuals divided into cases and controls (lung and colorectal cancer are an exception, with ∼ , O (10 ) genes from 23,000 measured probes. Such problem can be simply formulated interms of the following classification problem: given the activity and/or the genetic alterations of an individual, finda simple rule involving a small – possibly the smallest – subset of genes to assess the probability for the individual todevelop the disease. There are two main difficulties in this task: (i) typically genes act in a combinatorial and nonlinear manner, (ii) individual samples turn out to be statistically very correlated.Historically, the problem of sparse feature selection in classification tasks has been divided into two complementarycomputational methods [2]: (i) wrappers that exploit the learning mechanism to produce a prediction value relatedscore for the sought signature, (ii) filters where the signature extraction is a data pre-processing, typically unrelatedto the classification task.From the point of view of information theory, the problem of sparse feature selection in classification is strictly relatedto Compressive Sensing (CS), one of the most studied methods for data acquisition, with interesting applications inseveral other research fields [4, 5]. CS was originally proposed as a new low-rate signal acquisition technique forcompressible signals [4, 6, 7] and is formulated as follows: given M < N , a vector y ∈ R M and a linear operatorof maximal rank F ∈ R M × N often referred to as the measurement or sensing matrix, the CS problem consists indetermining the unknown sparse vector w ∈ R N that is linked to its compressed projection y by means of the lineartransformation y = F w , where F and y are assumed to be known. Although research in CS has still many open ∗ Currently at Amazon Alexa, Torino † Author to whom any correspondence should be addressed. Email: [email protected] a r X i v : . [ s t a t . M L ] S e p challenges to face, very stringent results are known about the general conditions for the existence and uniqueness ofthe solution. Among the different algorithms that have been proposed in order to reconstruct efficiently the signal,many use techniques borrowed from the statistical mechanics of disordered systems [8–11].More recently, the so called 1-bit CS (1BCS) has been proposed as a strategy to deal with the problem of inferringa sparse signal knowing only the sign of the data of the linear measurements: y = sign( F w ), where sign( x ) is a vectorwith elements x i / | x i | for x i (cid:54) = 0. Besides being of interest for signal transmission related problems where discarding theamplitude of the signal can significantly reduce the amount of information to be stored/relayed [12, 13], this problemcan also be interpreted in terms of sparse boolean classification tasks. The most widely adopted inference scheme inCS is the so-called LASSO regression or L -norm minimization [14], as originally proposed in the context of 1BCS in[12]. However, it is clear that the most efficient solution from the point of view of optimal dilution of the problemshould be achieved by a L -pseudonorm, where non-zero parameters are indeed penalized independently of their non-zero value. Unfortunately, dealing with the non-convex L -regularization is not so simple as it typically leads to phasetransitions that make the problem computationally intractable. A practical solution to the problem is to restrict thespace of parameters to a discrete set, where effectively the L -pseudonorm is indeed equivalent to the more amenable L case [15–19]. As far as continuous parameters are concerned, different strategies have been proposed. First, fromthe statistical physics community side, an approach pursuing this direction consists in a perceptron whose continuousparameters are masked by boolean variables mimicking dilution [20–23]. Attempts to characterize theoretically thephase space diagram and the structure of the transition through the replica method have been reported in [24–26].Variations of the generalized approximate message passing technique (GAMP) were employed in [27], as it providesa tractable and efficient way to perform minimum mean squared error (MMSE) estimation on the variables to beretrieved when the matrix of patterns is large and Gaussian i.i.d.. However, for more general pattern matrices, GAMPconvergence is not guaranteed, which has led to the extension of algorithms of the vector AMP (VAMP) type [28] togeneralized linear models [29, 30], including perceptron learning.On the computer science side, many other algorithms for 1BCS combining the enforcement of sparsity and of signconsistency constraints were also proposed, building upon analogous algorithms developed for standard CS. Examplesof methods for error-free sparse signal retrieval from one-bit quantized measurements include greedy approacheswhich iteratively determine the most appropriate sparse support given the sign measurements, such as matching signpursuit [31], as well as binary iterative hard thresholding (BIHT) [32], where an L -based convex consistency-enforcingobjective function minimization is alternated with a thresholding operation that selects the K largest elements. Theproblem of noisy 1BCS was addressed, for instance, in [33–35]. However, among these examples, only [35] proposesan algorithm which does not require the prior knowledge of the number of corrupted sign measurements. Here, theone-bit measurement errors are modeled by introducing a sparse vector s whose nonzero components produce thesign mismatches as y = sign( F w + s ). The algorithm attempts to identify the sign errors and to retrieve the sparsesignal w using a variational expectation-maximization based inference scheme.In this work we propose a new wrapper strategy where both the variable selection and the classification tasks aresimultaneously performed through Expectation Propagation (EP), an iterative scheme to approximate intractabledistributions that was introduced first in the field of statistical physics [36, 37] and shortly after in the field oftheoretical computer science [38]. In analogy to what presented in [39] in the context of sampling the space ofhigh dimensional polytopes, we show that, by approximating the computationally intractable posterior distribution P ( w | y , F ) through a tractable multivariate probability density Q ( w | y , F ), we are able to solve both efficiently andaccurately the 1BCS problem. One of the main strengths of this EP-based approach is that it does not require anystatistical properties for the measurement matrix F , whereas other improved belief propagation methods typicallyrequire some kind of i.i.d. property for the measurement matrix.The paper has the following structure: after this introduction, in Sec. II we define the problem, and introduce theEP algorithm. In Sec. III we present extensive numerical simulations both in the noiseless and noisy case. Here bothi.i.d. and correlated measurement matrices are analyzed. Finally, in Sec. IV we summarize the results of the paperand draw the conclusions. II. METHODSA. The diluted perceptron as a linear estimation problem and its statistical mechanics setup
We consider a student perceptron with N input units and continuous weights w ∈ R N . We assume that theconnections are diluted and that only a fraction ρ of them are nonzero. We also assume that M real-valued patterns x τ ∈ R N are presented to the perceptron and that a binary label σ τ , τ = 1 , . . . , M has already been assigned to eachof them as a result of the classification performed by a teacher perceptron with sparse continuous weights B . Thetask of the student perceptron is to learn the input/output association based on the examples ( x τ , σ τ ) , τ = 1 , . . . , M provided by the teacher: σ τ = sign( w T x τ ) , τ = 1 , . . . , M, (1)where we use the convention that sign(0) = 1. For each example τ , the rule (1) is equivalent to the condition: (cid:0) σ τ x Tτ (cid:1) w ≥ . (2)We now introduce the auxiliary variables, y τ := (cid:0) σ τ x Tτ (cid:1) w , and the data matrix: X σ = σ x T σ x T ... σ M x TM . (3)Through the previous definitions, we can define the following linear estimation problem: y = X σ w , (4)where the variables to be inferred are both y and w . As we will show below, the positivity constraints in Eq. (2) willbe enforced in terms of a prior distribution on the y variables.The linear estimation problem expressed in Eq. (4) can be addressed in a Bayesian setting: by introducing thevariable vector h = ( w , . . . , w N , y , . . . , y M ) T and the energy function: E ( w , y ) = (cid:107) y − X σ w (cid:107) = h T E − h , E − = (cid:18) X Tσ X σ − X Tσ − X σ I (cid:19) , (5)the likelihood of the set of N weights of the perceptron can be expressed as the Boltzmann distribution associatedwith E ( w , y ), which reads: L ( w ) = P ( σ , . . . , σ M | w ) = 1 Z e − βE ( w , y ) , (6)where, from a statistical physics standpoint, β plays the role of an inverse temperature. In the absence of noise, it isconvenient to consider the zero temperature limit of this likelihood L ( w ) β →∞ −−−−→ δ ( y − X σ w ), where δ ( x ) denotes theDirac delta distribution.We also introduce prior distributions in order to encode the constraints to which the variables w i , i = 1 , . . . , N , and y τ , τ = 1 , . . . , M , are subject. The sparsity assumption on the weights w are expressed in terms of a spike-and-slabprior [40]: Γ( w i ) = (1 − ρ ) δ ( w i ) + ρ (cid:114) λ π e − λw i , i = 1 , . . . , N. (7)If the labels of the teacher are not corrupted by noise, then the auxiliary variables y need to fulfill the positivityconstraint (2), which can be expressed in terms of the pseudoprior:Λ( y τ ) = Θ( y τ ) , τ = 1 , . . . , M. (8)On the other hand, if noise at the output of the teacher perceptron is present, one may assume that the labels providedby the teacher perceptron are assigned according to the following process [41]:˜ σ = (cid:40) sign( B T x ) with probability η − sign( B T x ) with probability 1 − η (9)and that the student receives the altered examples ( x µ , ˜ σ µ ) , µ = 1 , . . . , M . In this case, if the process that flips thelabels is known, then it may be encoded in the pseudoprior Λ as follows:Λ( y µ ) = η Θ( y µ ) + (1 − η )Θ( − y µ ) . (10)In general, the parameters ρ , λ and η are not known and need to be learned by the student perceptron in the trainingphase. Finally, by Bayes’ rule, the posterior distribution of both weights and auxiliary variables read: P ( w , y ) = 1 Z P δ ( y − X σ w ) N (cid:89) i =1 Γ i ( w i ) M (cid:89) τ =1 Λ τ ( y τ ) . (11) B. Learning the weights via expectation propagation
1. Zero temperature formulation
We wish to infer the values of the weights by estimating the expectation values of the marginals of the distribution(11), as this strategy minimizes the associated mean squared error. However, the latter marginalizations are intractableand we need to resort to approximation methods. Here we propose an expectation propagation scheme based on thezero temperature formulation presented in [42] in order to solve the problem.Starting from the linear system X σ w = y , we notice that it can be written in a row-echelon form. We express itas: G h = , (12)where G = ( − X σ | I ) and I is the M × M identity matrix.The intractable posterior distribution reads: P ( h ) = 1 Z P δ M ( G h ) (cid:89) i ∈ W Γ i ( h i ) (cid:89) τ ∈ Y Λ τ ( h τ ) , (13)where W = { , . . . , N } , Y = { N + 1 , . . . , N + M } and δ M ( z ) denotes the M -dimensional Dirac delta distribution.We introduce Gaussian approximating factors: φ i ( w i ) = exp (cid:18) − ( w i − a i ) d i (cid:19) , (14)and a fully Gaussian approximation of the posterior distribution (13), in which all priors Γ and Λ are replaced byfactors of the form (14): Q ( h ) = 1 Z Q δ M ( G h ) (cid:89) i ∈ W φ ( h i ; a i , d i ) (cid:89) τ ∈ Y φ ( h τ ; a τ , d τ ) . (15) Q ( h ) is a multivariate Gaussian distribution: Q ( h ) = 1 Z Q δ M ( G h ) exp (cid:18) −
12 ( h − ¯ h ) T Σ − ( h − ¯ h ) (cid:19) , (16)whose covariance matrix and mean are given, respectively, by: Σ − = (cid:88) i ∈ W d i e i e Ti + X Tσ (cid:32)(cid:88) i ∈ Y d i e i e Ti (cid:33) X σ , (17)and by: ¯ h = Σ (cid:32) (cid:88) i ∈ W a i d i e i + (cid:88) i ∈ Y a i d i X Tσ e i (cid:33) , (18)where e i denotes the i -th basis vector of the standard basis of R N + M .We also introduce tilted distributions associated with the variables h i , for all i = 1 , . . . , N + M . In particular, if i ∈ W , we have: Q ( i ) ( h ) = 1 Z Q ( i ) δ M ( G h )Γ i ( h i ) (cid:89) i ∈ W \{ i } φ ( h i ; a i , d i ) (cid:89) τ ∈ Y φ ( h τ ; a τ , d τ ) , (19)whereas, if i ∈ Y : Q ( i ) ( h ) = 1 Z Q ( i ) δ M ( G h )Λ i ( h i ) (cid:89) i ∈ W φ ( h i ; a i , d i ) (cid:89) τ ∈ Y \{ i } φ ( h τ ; a τ , d τ ) . (20)The tilted distributions can be expressed as the product of one of the priors and a Gaussian cavity distribution: Q ( i ) ( h ) = ψ i ( h i ) ˜ Q ( i ) ( h ) , (21)where ψ ∈ { Γ , Λ } and we have denoted the cavity distribution associated with the i -th variable by ˜ Q ( i ) :˜ Q ( i ) ( h ) = 1 Z Q ( i ) δ M ( G h ) exp (cid:18) −
12 ( h − ¯ h ( i ) ) T (cid:16) Σ ( i ) (cid:17) − ( h − ¯ h ( i ) ) (cid:19) . (22)The cavity covariance matrices are given by the following expressions: (cid:16) Σ ( i ) (cid:17) − = (cid:80) j ∈ W \{ i } d j e j e Tj + X Tσ (cid:16)(cid:80) j ∈ D d j e j e Tj (cid:17) X σ , if i ∈ W, (cid:80) j ∈ W d j e j e Tj + X Tσ (cid:16)(cid:80) j ∈ D \{ i } d j e j e Tj (cid:17) X σ , if i ∈ Y. (23)whereas the cavity means read:¯ h ( i ) = Σ ( i ) (cid:16)(cid:80) j ∈ W \{ i } a j d j e j + (cid:80) j ∈ Y a j d j X Tσ e j (cid:17) , if i ∈ W, Σ ( i ) (cid:16)(cid:80) j ∈ W a j d j e j + (cid:80) j ∈ Y \{ i } a j d j X Tσ e j (cid:17) , if i ∈ Y. (24)The means a and the variances d of the Gaussian factors are determined via matching of the first and secondmoments of the tilted and of the fully Gaussian approximated distributions for all i = 1 , . . . , N + M : (cid:104) h i (cid:105) Q ( i ) = (cid:104) h i (cid:105) Q , (cid:104) h i (cid:105) Q ( i ) = (cid:104) h i (cid:105) Q . (25)The EP update equations follow from the moment matching conditions (25). In particular, notice that the marginalsof Q ( h ) are Gaussian distributions. This allows to express a i and d i in terms of the means and variances of Q ( i ) andin terms of the means and variances of the associated tilted distributions. Indeed, using the fact that the product ofGaussians is a Gaussian and the moment matching conditions, we obtain the EP update rules for the variances d andthe means a : d i = (cid:32) (cid:104) h i (cid:105) Q ( i ) − (cid:104) h i (cid:105) Q ( i ) − ( i ) ii (cid:33) − , (26) a i = ¯ h i + d i Σ ( i ) ii (cid:16) ¯ h i − ¯ h ( i ) i (cid:17) , (27)for all i = 1 , . . . , N + M .EP repeatedly estimates the vectors a and d until a fixed point is eventually reached. From a practical point ofview, the algorithm returns the means and the variances of the marginal tilted distributions as soon as the convergencecriterion: ε t := max i (cid:110)(cid:12)(cid:12)(cid:12) (cid:104) h i (cid:105) Q ( i ) t − (cid:104) h i (cid:105) Q ( i ) t − (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) (cid:104) h i (cid:105) Q ( i ) t − (cid:104) h i (cid:105) Q ( i ) t − (cid:12)(cid:12)(cid:12)(cid:111) < ε stop (28)is fulfilled, where t denotes the current iteration and ε stop is a threshold, which in this work was taken equal to10 − . In particular, the posterior mean value of weights learned by the student perceptron are estimated as given by (cid:104) w i (cid:105) Q ( i ) , with a standard deviation equal to (cid:113) (cid:104) w i (cid:105) Q ( i ) − (cid:104) w i (cid:105) Q ( i ) .The zero temperature formulation of EP presented in this section is computationally advantageous compared tothe finite temperature one presented in Appendix A, as its complexity is dominated by the inversion of the N × N matrix (23) rather than of a ( N + M ) × ( N + M ) matrix. We refer to Appendix A for more details about the finitetemperature formulation of EP. C. Moments of the tilted distributions
1. Moments of the spike and slab prior
In this section, we shall compute the first and second moments of the leave-one-out distributions when the prior isof the spike-and-slab type. We recall the expression of the spike-and-slab prior already introduced in Eq. (7) for thesake of convenience: Γ( h k ) = (1 − ρ ) δ ( h k ) + ρ (cid:114) λ π e − λh k , k = 1 , . . . , N. (29)The marginal tilted distribution of each weight of the student perceptron reads: Q ( k ) ( h k ) = 1 Z Q ( k ) ˜ Q k ( h k )Γ( h k ) , (30)where we have introduced the marginalized cavity Gaussian distribution ˜ Q k :˜ Q k ( h k ; µ k , Σ k ) = 1 √ π Σ k e − ( hk − µk )22Σ k . (31)From Eq. (30), computing the partition function of the tilted distribution Q ( k ) yields: Z Q ( k ) = (1 − ρ ) 1 √ π Σ k e − µ k k + ρ √ π (cid:114) λ λ Σ k e − λµ k λ Σ k . (32)Finally, the first moment and the second moment of the same distribution are given by: (cid:104) h k (cid:105) Q ( k ) = 1 Z Q ( k ) ρ √ π e − λµ k λ Σ k (cid:114) λ λ Σ k µ k λ Σ k , (33)and by: (cid:104) h k (cid:105) Q ( k ) = 1 Z Q ( k ) ρ √ π e − λµ k λ Σ k (cid:114) λ λ Σ k (cid:18) Σ k + λ Σ k + µ k (1 + λ Σ k ) (cid:19) , (34)respectively.
2. Moments of the theta pseudoprior
We now repeat the same reasoning for the case of the theta pseudoprior, which was defined as:Λ( h k ) = Θ( h k ) , k = N + 1 , . . . , N + M. (35)The associated tilted distribution of the k -th variable is given by: Q ( k ) ( h k ) = 1 Z Q ( k ) ˜ Q k ( h k )Λ( h k ) , k = N + 1 , . . . , N + M, (36)where the expression for ˜ Q k is the same as in Eq. (31). The normalization of (30) is the partition function of thetilted distribution and reads: Z Q ( k ) = 12 (cid:20) (cid:18) µ k √ k (cid:19)(cid:21) , (37)where erf denotes the error function, defined as:erf( x ) = 2 √ π (cid:90) x e − z dz. (38)Computing the first moment of the marginal tilted distribution leads to the expression: (cid:104) h k (cid:105) = µ k + (cid:114) Σ k π e − µ k k Φ (cid:16) µ k √ Σ k (cid:17) = µ k (cid:18) R ( α k ) α k (cid:19) , (39)where Φ( x ) = (cid:104) (cid:16) x √ (cid:17)(cid:105) is the cumulative density function (CDF) of the standard normal distribution, R ( x ) = √ π e − x / Φ( x ) and α k = µ k √ Σ k . Finally, concerning the second moment of the marginal tilted distribution, one obtains: (cid:104) h k (cid:105) = µ k + Σ k + µ k (cid:112) Σ k R ( α k ) , (40)implying that the variance of h k w.r.t. the k -th marginal tilted distribution can be expressed in a compact way by:Var( h k ) = Σ k (1 − α k R ( α k ) − R ( α k )) . (41)
3. Moments of the theta mixture pseudoprior
When the pseudoprior Λ is of the theta mixture type:˜Λ( h k ) = η Θ( h k ) + (1 − η )Θ( − h k ) , ≤ η ≤ , k = N + 1 , . . . , N + M, (42)we have for the partition function Z Q ( k ) of the tilted distributions (36): Z Q ( k ) = η (cid:20)
12 erfc (cid:18) − µ k √ k (cid:19)(cid:21) + (1 − η ) (cid:20)
12 erfc (cid:18) µ k √ k (cid:19)(cid:21) = (cid:114) π Σ k (cid:20)
12 + (cid:18) η − (cid:19) erf (cid:18) µ k √ k (cid:19)(cid:21) . (43)For the first moment, one obtains: (cid:104) h k (cid:105) Q ( k ) = 1 Z Q ( k ) (cid:40) η √ π Σ k (cid:34) Σ k e − µ k k + µ k (cid:114) π Σ k (cid:18) − µ k √ k (cid:19)(cid:35) ++ 1 − η √ π Σ k (cid:34) − e − µ k k + µ k (cid:114) π Σ k (cid:18) µ k √ k (cid:19)(cid:35)(cid:41) == µ k + (cid:114) π (2 η − e − µ k k η · erfc (cid:16) − µ k √ k (cid:17) + (1 − η )erfc (cid:16) µ k √ k (cid:17) , (44)and the second moment w.r.t. the marginal tilted distribution (36) reads: (cid:104) h k (cid:105) Q ( k ) = 1 Z Q ( k ) (cid:40) η √ π Σ k (cid:34) µ k Σ k e − µ k k + (cid:114) π Σ k (cid:0) µ k + Σ k (cid:1) erfc (cid:18) − µ k √ k (cid:19)(cid:35) ++ 1 − η √ π Σ k (cid:34) − µ k Σ k e − µ k k + (cid:114) π Σ k (cid:0) µ k + Σ k (cid:1) erfc (cid:18) µ k √ k (cid:19)(cid:35)(cid:41) == µ k + Σ k + µ k (cid:114) k π (2 η − e − µ k k η · erfc (cid:16) − µ k k (cid:17) + (1 − η )erfc (cid:16) µ k k (cid:17) . (45) III. RESULTSA. Sparse perceptron learning from noiseless examples
In this section, we will present some results obtained from numerical simulations in the presence of noiselessexamples, both in the case where patterns are independent and identically distributed (i.i.d.) and in a simple caseof correlated patterns. For the sake of simplicity, in all the situations described in the following, we have chosen aBayes-optimal setting, where the prior information provided by the spike-and-slab prior mirrors the actual distributionof the weights to be retrieved.First, we performed numerical experiments with i.i.d. patterns drawn from a Gaussian distribution having zeromean and unit variance. As a performance measure, we consider the mean squared error between the normalizedweights of the student perceptron at the end of the learning process and those of the teacher perceptron:MSE( ˜ w , ˜ B ) = 1 N N (cid:88) k =1 ( ˜ w k − ˜ B k ) , (46)where ˜ w = w / (cid:107) w (cid:107) are the rescaled weights of the student and ˜ B = B / (cid:107) B (cid:107) denote those of the teacher. In ourresults, this metric is expressed in dB and used to compare expectation propagation to the 1-bit Approximate MessagePassing (1-bit AMP) algorithm introduced in [27] and to the grVAMP algorithm proposed within the unified Bayesianframework of general linear models published in reference [30], which the authors show to yield equivalent results tothe VAMP algorithm for the generalized linear model described in [29]. The per iteration computational cost of 1-bitAMP and of grVAMP is O ( N ), as it is dominated by a matrix-vector product in both cases. However, since grVAMP M S E ( d B ) EP1-bit AMPgrVAMP
FIG. 1:
M SE resulting from sparse weight learning from i.i.d. patterns using EP, 1-bit AMP and grVAMP basedestimation as a function of α . The parameters considered for the perceptron are N = 128 and ρ = 0 .
25 and thenumber of instances is N samples = 100. All simulations converged and the MSE shown is averaged over all theconsidered instances. The error bars are estimated as σ/ (cid:112) N samples , where σ is the sample standard deviation of the M SE computed over all the instances.requires that the singular vectors and the singular values of the pattern matrix X are provided as input, the cubiccomplexity of performing a one-time initial singular value decomposition of X should be taken into account as well.We considered the average of the MSE (46) over N samples = 100 simulations. The simulations correspond to sparseperceptron learning of different instances of the weights of the teacher perceptron, each from a different set of i.i.d.Gaussian patterns fed to the student perceptron. We considered the case in which the total number of weights is N = 128 and their density level is fixed to ρ = 0 .
25. The convergence threshold was set to (cid:15) stop = 10 − and thevalue of the damping parameter of the EP algorithm was set equal to 0.9995 (although good results can be obtainedusing a lower damping too, e.g. 0.99). The results of the simulations for different values of α are reported in Fig.1 and show that EP, 1-bit AMP and grVAMP based learning from i.i.d. Gaussian patterns have roughly the sameperformance regardless of the specific value of α . The convergence criterion was 10 − in the 1-bit AMP simulationsand 10 − in the grVAMP simulations. All the simulations performed using EP, 1-bit AMP and grVAMP convergedwithin the thresholds we considered. The error bars in Fig. 1 were estimated as σ/ (cid:112) N samples , where σ denotes thesample standard deviation of the MSE.We also considered the problem of sparse perceptron learning from correlated patterns drawn from a multivariatenormal distribution, in the simple case where the mean is m = and the covariance matrix is constructed accordingto: S = Y T Y + ∆ , (47)where Y ∈ R u × N is an i.i.d. matrix with entries drawn from a standard univariate Gaussian distribution and ∆ isa diagonal matrix whose eigenvalues are given by the absolute value of i.i.d. random entries drawn from the samedistribution. By construction, this matrix is symmetric and positive definite and, therefore, is a proper covariancematrix. The diagonal matrix ∆ is added in order to ensure that S has full rank. As an extreme case, we choose u = 1for the matrix Y .We find that even in this case the student perceptron is able to estimate the weights of the teacher, although, underthe same values of the parameters N , ρ and α of the model and under the same values of the EP parameters (i.e.damping, (cid:15) stop and maximum number of iterations), the accuracy of the estimation is lower than the one achieved bylearning from i.i.d. Gaussian patterns, as one might expect. Still, in the presence of the correlated patterns consideredhere, expectation propagation based learning proves to be advantageous as compared with other algorithms for 1-bitcompressed sensing such as 1-bit AMP, whose estimates of the means and variances of the weights to be retrieveddiverge. In addition, in the same situation, EP outperforms grVAMP based learning, as shown in fig. 5 for the set of α = 6 α = 4 α = 2.5 α = 1 (a) α = 1 α = 6 α = 2.5 α = 4 (b) M S E ( d B ) grVAMP: N = 128, = 0.25EP: N = 128, = 0.25grVAMP: N = 256, = 0.25EP: N = 256, = 0.25 (c) FIG. 2: Sparse perceptron learning from correlated patterns sampled from a multivariate gaussian distribution. Thevalues of the parameters are specified in each panel and we set u = 1. Comparison between grVAMP and EP basedlearning. (a) ROC curves. (b) Sensitivity plots. For reference, in (a,b) the case of ideal variable selection by theteacher perceptron that provided the examples is also shown. (c) Mean squared error in dB. In each plot, the meanvalues and the standard deviations are computed over the set of the N conv instances for which convergence wasachieved. The error bars are estimated as σ/ √ N conv , where σ is the sample standard deviation over the same set ofinstances.parameters N = 128 and ρ = 0 .
25. The convergence thresholds of the grVAMP and of the EP algorithms were set tothe same values as in the case of learning from i.i.d. Gaussian patterns and further lowering the value of the thresholdparameter of grVAMP did not result in a noticeable improvement of the grVAMP results. In the case of EP, thedamping factor was set to 0.999 and the maximum number of iterations for convergence was 50000. The fraction ofconverged trials is shown in table I for both algorithms in the case where N = 128 and ρ = 0 .
25. The EP led studentperceptron is more accurate at determining the nonzero weights than the grVAMP led counterpart, as shown by theROC curves in Fig. 2a and by the sensitivity plots of Fig. 2b. In order to construct these curves, each weight of theteacher was assigned a score given by its probability of being nonzero as estimated by EP and grVAMP. The weights0 α f EP σ f EP f grV AMP σ grV AMP (a) N = 128 α f EP σ f EP f grV AMP σ grV AMP (b) N = 256 TABLE I: Fraction of converged trials over a set of 100 different instances of the weights of the teacher perceptronand of the training set of examples. The patterns were sampled from the multivariate Gaussian distribution withcovariance matrix (47). The number of variables is N = 128 and the density of the weights of the teacher is ρ = 0 . P (cid:54) =0 k = (cid:32) (cid:18) ρ − (cid:19) (cid:114) λ Σ k λ Σ k e − µ k k (1+ λ Σ k ) (cid:33) − . (48)In the case of EP, µ k and Σ k , for k = 1 , . . . , N , are the EP cavity means and variances, whereas, in the case ofgrVAMP, µ corresponds to the VAMP quantity r k and Σ k = γ − k , where γ k is the quantity that parameterizes thedenoiser in VAMP. In both cases, ρ denotes the density parameter of the spike-and-slab prior. Interestingly, thediscrepancy between the accuracy of the two algorithms becomes larger as the number of patterns increases. Thisbehavior is reflected in the difference between the mean squared errors of the two algorithms, as shown in Fig. 2c.In each plot in Fig. 2, we have shown the average of the quantities considered over the set of the N conv instancesfor which each algorithm achieved convergence. Accordingly, the error bars were estimated as σ/ √ N conv , where σ denotes the standard deviation over the same set of instances.A useful additional feature of the EP-based learning approach is the possibility to learn iteratively the value of ρ during the estimation of the weights of the teacher, as, unlike EP, most algorithms for 1-bit compressed sensing –including those with which we compared our results in this section – assume the density of the signal to be given apriori. The estimation of the density parameter is achieved by minimizing the EP free energy with respect to ρ andyields good results as long as the number of the patterns presented to the student is large enough. We refer to theAppendix for details concerning the EP free energy, its expression for the sparse perceptron learning problem and freeenergy optimization based learning of the parameters of the prior.In order to show that our approach allows to estimate the dilution level of the teacher perceptron, we performed aset of N samples = 100 EP simulations on a system with N = 128 and ρ = 0 .
25, where the density parameter ρ of thespike-and-slab prior assigned to each weight variable was randomly initialized by sampling its value from a uniformdistribution over the interval 0 . ≤ ρ ≤ .
95 and where the learning rate was chosen to be δρ = 10 − . We showour results in table II. For each value of α , we show the average value ρ L of the density estimate over all samplesand its associated statistical uncertainty, which was computed as δρ L = σ ρ L / (cid:112) N samples , as for these values of theparameters all simulations converged. We also show the relative difference ∆ ρ/ρ between the true value of the densityand the estimated one. Since ∆ ρ (cid:29) δρ L , we omit the statistical uncertainty associated with ∆ ρ . Finally, we noticethat, even when learning from correlated patterns constructed as described above, the student is able to estimate thedensity level of the weights of the teacher perceptron quite accurately, provided that a sufficient number of patternsis provided to the student perceptron. In table II, we give an example of this fact when the teacher perceptron has N = 128 weights and density ρ = 0 . B. Sparse perceptron learning from a noisy teacher
We analyzed the performance of EP based sparse perceptron learning when a fraction of the examples is mislabeled.The student perceptron is given the a priori information that a certain fraction of the labels is wrongly assigned. As1 α i.i.d. patterns: ρ L ± δρ L i.i.d. patterns: ∆ ρ/ρ patterns from MVN: ρ L ± δρ L patterns from MVN: ∆ ρ/ρ . ± .
003 0 .
236 0 . ± .
004 0 . . ± .
002 0 .
121 0 . ± .
004 0 . . ± .
002 0 .
066 0 . ± .
003 0 . . ± .
002 0 .
042 0 . ± .
003 0 . . ± .
001 0 .
031 0 . ± .
003 0 . TABLE II: Learning of the density ρ of the weights of the teacher for a perceptron with parameters N = 128 and ρ = 0 .
25. The average and the standard deviation of the learned value of ρ at convergence over all the trials forwhich convergence was achieved during the training process are denoted by ρ L and δρ L , respectively. In each trial,the initial condition ρ was drawn uniformly from the interval 0 . ≤ ρ ≤ . ε R BCS =10 − . Thus, the R1BCS iterations stop when the estimate w R BCS of the weights of the teacher is such that (cid:107) w R BCS − w oldR BCS (cid:107) < ε R BCS . In each experiment, a given number K label = (1 − η ) M of labels were flipped, where η is the fraction of unchanged labels, which was set equal to 0.9. In the EP simulations a damping factor equal to0.999 and a convergence threshold (cid:15) stop = 10 − were used.We first considered the case of i.i.d. Gaussian patterns drawn from a Gaussian distribution with zero mean and unitvariance. The results provided by the two algorithms are shown in Fig. 3. In order to assess the variable selectioncapabilities of the two algorithms when attempting to learn the weights of the teacher from this kind of patternsin the presence of mislabeled examples, we computed the ROC curves (Fig. 3a) and the sensitivity plots (Fig. 3b)associated with the weights of the student after the training phase was completed in the case where the number ofweights was N = 128 and the density of the weights B of the teacher was set to ρ = 0 .
25. The ordering criterion forthe weights B adopted in the ROC curves in Fig. 3a and in the sensitivity plots in Fig. 3b was based on the absolutevalue of the weights w of the student. In the case of EP, we also plotted the ROC and sensitivity curves according tothe sorting criterion based on the score expressed in Eq. (48), but the EP results did not exhibit noticeable differenceswith respect to those obtained using the other sorting criterion in the case of i.i.d. Gaussian patterns. While, on theone hand, the ROC curves and sensitivity plots associated with EP exhibit systematically lower values for the truepositive ratio than the ones related to R1BCS and this leads to larger values of the MSE, as shown in fig. 3c, it isworth noticing that the areas under the curves ( AU C ) are not very different for the two algorithms and the relativediscrepancy (
AU C max − AU C min ) /AU C max between the largest and the smallest of the two at any given value of α does not exceed 6%, as can be deduced from table IIIa. Furthermore, EP tends to be more accurate than R1BCS inthe low α regime, as can be seen by considering large enough dilutions for the weights. We give an example of thisfact in fig. 3d, where we show the MSE as a function of α for a perceptron with parameters N = 512 and ρ = 0 . η for the parameter η of the theta mixture pseudoprior uniformly from the interval0 . < η < η of the labels in the case where thetrue density of the weights is assigned to the spike-and-slab prior factors and kept fixed throughout the simulations.We performed analogous experiments with patterns drawn from the zero mean multivariate Gaussian distributionwith covariance matrix (47) that we had already considered in the noiseless case. For both algorithms, the numericalexperiments were conducted on a set of 100 different instances. Convergence was achieved for all instances in the caseof R1BCS. The same was true for EP, except for α = 1 and α = 2, where the fraction of converged trials was 95%and 99%, respectively. We show the results of our tests in Fig. 4, where, analogously to the case of i.i.d. Gaussianpatterns presented above, we computed the ROC curves (Fig. 4a), the sensitivity plots (Fig. 4b) and the MSE curves(Fig. 4c) related to the weights learned by the student using EP and using R1BCS. Although the performance ofEP drops significantly compared with R1BCS as a result of combining this kind of correlated patterns and the noiseon the labels, it is worth noticing that the variable selection performance displayed by the EP-related ROC curvesin Fig. 4a is still quite good and that the teacher-student MSE – together with the corresponding teacher-studentoverlaps (not shown) – can also be considered acceptable, as long as the considered values of α are large enough.Moreover, in the case of the specific Gaussian correlated patterns considered here, one can notice from the sensitivity2 α = 6 α = 4 α = 2.5 α = 1 (a) α = 1 α = 6 α = 2.5 α = 4 (b) M S E ( d B ) R1BCS: N = 128, = 0.25, = 0.9EP: N = 128, = 0.25, = 0.9 (c) M S E ( d B ) R1BCS: N = 512, = 0.0625, = 0.9EP: N = 512, = 0.0625, = 0.9 (d) FIG. 3: Sparse perceptron learning from i.i.d. patterns sampled from a standard Gaussian distribution and(1 − η ) M mislabeled examples. Comparison between R1BCS and EP in terms of (a) their ROC curves, (b) theirsensitivity plots, (c,d) their mean squared error (expressed in dB). For reference, in (a,b) the case of ideal variableselection by the teacher perceptron that provided the examples is shown. The values of the parameters are N = 128, ρ = 0 . η = 0 . N = 512, ρ = 0 . η = 0 . N samples = 100 instances, as convergence was achieved in each simulation, andthe error bars are estimated as σ/ (cid:112) N samples , where σ denotes the sample standard deviation over all the instancesconsidered.plots in Fig. 4b that, in contrast with the case of Gaussian i.i.d. patterns, the accuracy of EP in selecting the mostrelevant weights is significantly larger if the criterion adopted for sorting the student weights is given by the score(48) rather than by their absolute value, suggesting that the probabilities P (cid:54) =0 ,EPk , k = 1 , . . . , N, which EP assignsto each weight of the student perceptron, remain a reliable indicator in order to determine which patterns are mostlikely to determine the outcome of the classification performed by the teacher perceptron prior to noise corruption.Analogously to the case of i.i.d. patterns, we allowed the student to learn the parameter η while the true value of theparameter ρ of the spike-and-slab priors Γ was provided and kept fixed during the training phase and were able toobtain good estimates of the consistency level of the labels for all values of α , as shown in table IIId.One important limitation of the R1BCS algorithm as compared with EP is that it involves both the inversionof a N × N matrix and of a M × M matrix at each iteration. As a consequence, the computational complexity ofR1BCS is dominated by O (max( N , M )) operations. Therefore, from a computational point of view, EP is especiallyadvantageous when the number of patterns in the training set is large, as, in the EP formulation proposed in this3 α AUC aEP AUC aR BCS . . ± .
004 0 . ± . . . ± .
005 0 . ± . . . ± .
006 0 . ± . . . ± .
005 0 . ± . . . ± .
005 0 . ± . . . ± .
005 0 . ± . . . ± .
005 0 . ± . . . ± .
005 0 . ± . . . ± .
005 0 . ± . α AUC aEP AUC pEP
AUC aR BCS . . ± .
006 0 . ± .
006 0 . ± . . . ± .
007 0 . ± .
007 0 . ± . . . ± .
006 0 . ± .
006 0 . ± . . . ± .
007 0 . ± .
008 0 . ± . . . ± .
007 0 . ± .
007 0 . ± . . . ± .
006 0 . ± .
007 0 . ± . . . ± .
007 0 . ± .
007 0 . ± . . . ± .
007 0 . ± .
007 0 . ± . . . ± .
007 0 . ± .
006 0 . ± . α η L ∆ η/η . ± .
02 0 . . ± .
02 0 . . ± .
01 0 . . ± .
009 0 . . ± .
005 0 . . ± .
002 0 . . ± .
001 0 . . ± .
001 0 . . ± . . α η L ∆ η/η . ± .
007 0 . . ± .
004 0 . . ± .
009 0 . . ± .
002 0 . . ± .
002 0 . . ± .
002 0 . . ± .
001 0 . . ± .
001 0 . . ± . . TABLE III: (a) AUC scores associated with the ROC curves shown in Fig. 3a, which correspond to EP and R1BCSbased classification from i.i.d. Gaussian patterns in the presence of label noise, with η = 0 .
9. (b) AUC scoresassociated with the ROC curves shown in Fig. 4a, which correspond to EP and R1BCS based classification frommultivariate Gaussian patterns in the presence of label noise, with η = 0 .
9. (c, d) Values of the η parameter of thetheta mixture pseudoprior estimated by the student perceptron during the training phase when using EP to learnthe weights of the teacher from i.i.d. Gaussian patterns (c) and from multivariate Gaussian patterns (d) in the case N = 128. The estimated value of η is denoted as η L , the true value is η = 0 .
9. The density parameter of thespike-and-slab prior was given and fixed to the true value ρ = 0 . N regardless of the number of patterns of the training set for all values of α >
1. This is confirmed by the running times of the two algorithms, which are presented in table IV. α t EP (s), i.i.d patterns t R BCS (s), i.i.d patterns t EP (s), patterns from MVN t R BCS (s), patterns from MVN0 . . ± . . ± . ± . ± . . ± . ± . ± . ± . . ± . ± . ± . ± . . ± ± ± ± . ± ± ± ± . ± ± ± ± . . ± . ± . ± . ± . . ± . ± ± ± . . ± . ±
15 106 ± ± TABLE IV: Running time related to the EP and R1BCS based sparse perceptron learning from i.i.d. Gaussianpatterns and from Gaussian patterns from multivariate normal distribution with covariance matrix given by Eq.(47) ( u = 1) in the presence of label noise, with η = 0 . σ/ √ N conv , where σ is the sample standard deviation over the set ofconverged trials and N conv is the number of converged simulations.4 α = 6 α = 4 α = 2.5 α = 1 (a) α = 1 α = 6 α = 2.5 α = 4 (b) M S E ( d B ) R1BCS: N = 128, = 0.25, = 0.9EP: N = 128, = 0.25, = 0.9 (c) FIG. 4: Sparse perceptron learning from correlated patterns sampled from a multivariate Gaussian distribution and(1 − η ) M mislabeled examples in the case N = 128, ρ = 0 . u = 1 and η = 0 .
9. Comparison between R1BCS andEP in terms of (a) their ROC curves, (b) their sensitivity plots, (c) their mean squared error (expressed in dB). Forreference, in (a,b) the case of ideal variable selection by the teacher perceptron that provided the examples is shown.The plotted quantities are the mean values computed over the set of N conv instances for which convergence wasachieved, i.e. all instances in the case of R1BCS and all simulations for each value of α except α = 1 and α = 2,where the fraction of converged trials was 95% and 99%, respectively, in the case of EP. The error bars areestimated as σ/ √ N conv , where σ denotes the sample standard deviation over all the instances considered. C. Correlated patterns generated by a recurrent neural network
As an example of diluted network with correlated inputs, we consider a network of N randomly diluted perceptronswithout self-loops. We will denote the i -th row of the weight matrix W ∈ R N × ( N − as w i . Each entry of w i is theweight of an incoming link of the i -th perceptron. Each perceptron receives binary inputs x generated according to aGlauber dynamics at zero temperature. We considered both the case of synchronous update of the patterns at eachtime step and the case where the binary inputs are updated asynchronously.In the case of synchronous update, starting from an initial random vector x = sign( ξ ), where ξ ∼ N ( ξ ; , I ) atdiscrete time t = 0 and given a pattern x t at time t , each perceptron computes its output at time t + 1 according to: z ti = w Ti x t \ i , (49)5 x t +1 i = sign (cid:0) z ti (cid:1) , (50)where x t \ i denotes the vector of outputs produced by all perceptrons except the i -th one at time t . The patternsat time t + 1 are given by the set of outputs resulting from Eq. (49) and Eq. (50). While in principle such arecurrent network dynamics could become trapped in a limit cycle when coupled to a synchronous update rule forthe patterns, in practice this never happened when generating such patterns from a recurrent network of N = 128diluted perceptrons, each of which having dilution level ρ = 0 . i is selected at random and, given a pattern x t at time t , the i -th component of the pattern vector at time t + 1 is computed according to Eq. (49) and Eq. (50),while all other components are left unchanged. In order to tune the degree of correlation of the patterns, we generatedthem in such a way that two sets of patterns at consecutive times have a given Hamming distance. In order to do so,we ran the Glauber dynamics of the recurrent network as described and updated the patterns asynchronously, butonly stored their configuration when the desired Hamming distance between the candidate set of patterns at time t + 1 and the set of patterns at time t was achieved.First, we consider the noiseless case and analyze the performance of training all the perceptrons of a student networkof the same type of the network that generated the patterns using both EP and grVAMP. We tested the two methodson a network with N = 128 perceptrons and density parameter of each perceptron given by ρ = 0 .
25. We consideredboth the case where the patterns are generated with the synchronous update rule presented above and the case ofasynchronous update. In both cases, the damping parameter of EP was set to 0.999 and the convergence thresholdwas set to 10 − for both EP and grVAMP. When considering patterns generated with a synchronous update, bothalgorithms achieved convergence during the training task for all perceptrons of the recurrent network. We evaluatedthe ROC curves and the reconstruction error associated with the correct/incorrect selection of the nonzero weightsat the level of single perceptron by considering the whole network and computing the average and the standarddeviation of the relevant quantities (i.e. the MSE, the fraction of false positives and the fraction of true positives)over all the perceptrons of the student network. Both algorithms were able to select the same fraction of relevantweights by assigning them a large probability of being nonzero, as shown by the vertical portion of the correspondingROC curves in Fig. 5a, which were constructed by sorting the weights according to the probabilities expressed inEq. (48), in decreasing order. As a consequence, both grVAMP and EP showed similar values for the MSE betweenthe weights estimated by the student perceptrons and the ones of the corresponding teachers as a function of α , asone can notice in Fig. 5b. We repeated the same numerical experiments with patterns generated performing anasynchronous update of the perceptron outputs and fixing d H ( x t +1 , x t ) = 10, corresponding to a Pearson correlationcoefficient r P earson = 0 .
84 between pattern vectors at consecutive times. In this case, the convergence rate within thestudent network was significantly lower for grVAMP than for EP, as plotted in Fig. 5c, which shows the fraction ofperceptrons of the recurrent network whose training task successfully converged. The mean values of the MSE andtheir associated uncertainties σ/ √ N , as evaluated over the N perceptrons of the student network regardless of theirconvergence status at the end of the training task, are displayed as a function of α in Fig. 5d and show that grVAMPis not able to accurately estimate the weights of the teacher when it fails to converge. However, the performance ofEP and grVAMP were comparable when considering the subset of perceptrons whose grVAMP guided learning tasksconverged (not shown).We verified that the student perceptrons were able to estimate the density of the weights of the teacher perceptronsquite accurately while learning the classification rule of the teacher from this kind of patterns using EP. We providesome examples of the estimated values of the density at large values of α in table 5e.Finally, we analyzed the noisy case and considered EP and R1BCS in our tests, both in the presence of patternsgenerated using a synchronous update rule and using an asynchronous update rule. As in the noiseless case, weconsidered a recurrent network with N = 128 perceptrons. In particular, the set of weights of the network was thesame as the one used in the noiseless case and the fraction of non-flipped labels presented to each perceptron was η = 0 .
9. The convergence threshold of R1BCS and that of EP were set to 10 − and both EP-based and R1BCS-basedlearning converged for all the student perceptrons of the network within these given thresholds. The goodness of theestimation of the weights of the teacher was not very different for the two learning algorithms both in the case ofsynchronously updated patterns and in the case of asynchronously updated ones (not shown). However, for α > σ/ √ N conv , where σ is thesample standard deviation over the set of the N conv perceptrons for which convergence was attained. In the noisyregime, convergence was achieved for all perceptrons, except in the case of asynchronously updated pattern, where,for α = 0 . α = 1 and α = 2, one of the N = 128 perceptrons failed to converge using EP. The numerical tests weperformed show that it is possible to estimate the parameter η of the theta mixture pseudoprior during the trainingphase, provided that the noise level 1 − η is not too large, similarly to what we observed in the noisy case with i.i.d.6 α = 6 α = 4 α = 2.5 α = 1 (a) M S E ( d B ) EPgrVAMP (b) f r a c t i o n o f c o n v e r g e d e x p e r i m e n t s EPgrVAMP (c) M S E ( d B ) grVAMPEP (d) Value of α Estimated ρ synchr Estimated ρ asynchr , d H = 104 . . ± .
002 0 . ± . . . ± .
002 0 . ± . . . ± .
001 0 . ± . (e) FIG. 5: (a, b) Comparison between grVAMP and EP based sparse perceptron learning from correlated patternsgenerated by a recurrent network according to a Glauber dynamics with synchronous update. The number ofperceptrons in the network is N = 128 and the density of the weights of each perceptron is ρ = 0 .
25. All N perceptrons achieved convergence during the training task. (a) Mean squared error in dB. (b) ROC curves related tothe estimated topology of the learned weights of each perceptron for several values of α . (c, d) Comparison betweengrVAMP and EP based sparse perceptron learning from correlated patterns generated by a of a whole recurrentnetwork according to a Glauber dynamics with asynchronous update and Hamming distance between sets ofpatterns at consecutive times equal to 10. (c) Fraction of perceptrons of the recurrent network whose training tasksachieved convergence. (d) Mean squared error in dB averaged over the set of all perceptrons of the recurrentnetwork. The set includes both the perceptrons for which convergence was achieved during the training task andthose for which convergence was not achieved. In each plot, the mean values and the uncertainties were evaluatedover the whole set of N perceptrons. The error bars were estimated as σ/ √ N , where σ is the sample standarddeviation computed over the set of N trained student perceptrons. (e) Estimated value of the density of the weightsof single perceptrons in the network at large alphas, both in the case of synchronously updated patterns and in thecase of asynchronous update with fixed Hamming distance d H = 10 between patterns at consecutive time steps.7 α t EP (s), synchr. update t R BCS (s), synchr. update t EP (s), asynchr. update ( d H = 10) t R BCS (s), asynchr. update ( d H = 10)0 . . ± . . ± . . ± . . ± . . ± . ± . ± . ± . . ± . ± . ± . ± . . ± ± ± ± . . ± . ± ± ± . . ± . ± ± ± . . ± . ± ± ± . . ± . ± ± ± . ± ±
20 125 ± ± α η L , synchr. update ∆ η/η , synchr. update η L , asynchr. update ( d H = 10) ∆ η/η , asynchr. update0.5 0 . ± .
02 0 .
05 0 . ± .
004 0 . . ± .
01 0 .
06 0 . ± . . . ± .
007 0 .
03 0 . ± .
001 0 . . ± .
008 0 .
001 0 . ± .
001 0 . . ± .
007 0 .
008 0 . ± .
001 0 . . ± .
001 0 .
001 0 . ± .
001 0 . . ± . .
004 0 . ± .
001 0 . . ± . .
01 0 . ± .
001 0 . . ± . .
01 0 . ± . . TABLE V: (a) Running time related to the EP and R1BCS based sparse perceptron learning from correlatedpatterns generated from a Glauber dynamics at zero temperature with a synchronous and with an asynchronousperceptron update rule in the presence of label noise, with η = 0 . η of the theta mixture pseudoprior resulting from perceptron learning from thesame patterns with label noise. The true unknown value of η is 0.9. Both in (a) and (b), N = 128 and ρ = 0 . IV. CONCLUSIONS
In this paper we have proposed an expectation propagation based strategy for efficient 1-bit compressed sensingreconstruction, whose computational complexity is dominated by a matrix inversion which requires O ( N ) elementaryoperations. We analyzed the behavior in the zero temperature case and assuming that the patterns are generated bya teacher having the same structure as the student.The performance of the algorithm has been extensively tested under several conditions. For i.i.d. patterns generatedby a Gaussian distribution of zero mean and unit variance, the algorithm performance are on par with two other state-of-the art algorithms: 1-bit AMP [27], and grVAMP [30]. However, in the correlated pattern case, where 1bitAMPfails to converge, EP outperforms grVAMP both in terms of accuracy and specificity. Moreover, both in the i.i.d. andcorrelated case, EP is able to learn with remarkable accuracy the density ρ of the weights of the teacher during theretrieval task. This feature of EP puts it at an advantage over other algorithms for 1-bit compressed sensing whichrequire that the dilution level is known and provided among their inputs.We then tested the robustness of EP reconstruction against noise. To do so, we mislabeled a fraction (1 − η ) of theexamples. As both 1-bit AMP and grVAMP, at least in their standard implementation, are not able to cope with noise,we compared EP with the R1BCS algorithm [35] for 1-bit compressed sensing with sign-flip. Again, as in the noiselesscase discussed above, we first considered the i.i.d pattern case and then the correlated pattern case. Although theROC curves and the sensitivity plots associated with EP were systematically lower than the ones related to R1BCS interms of true positive rate, the fact that the relative difference between the areas under the curves (AUC) for the twoalgorithms never exceeded 6% both when considering i.i.d. patterns and when considering correlated patterns allows8to conclude that the variable selection properties displayed by the two algorithms are not very different. Analogouslyto the estimation of the density parameter when learning the weights from noiseless examples, by including in thealgorithm a simple optimization strategy consisting in one step of gradient descent on the EP free energy at eachiteration EP was able to successfully retrieve the noise level affecting the labels.It is perhaps worth noticing that if we concentrate on the variable selection performance of the algorithm in termsof the posterior probability for weight w i of being non zero, rather than using the absolute value of its mean as apredictor, then the accuracy of EP in selecting the most relevant weights becomes comparable and in some cases evenbetter than R1BCS, especially in selecting the most relevant weights as shown in Fig. 4b. In other terms, EP seemsto be marginally better than R1BCS in terms of predicting the most relevant variables. One important limitation ofR1BCS in comparison with EP, is the computational complexity: as discussed in the introduction, the computationalcomplexity of EP is O ( N ) (at least in the zero temperature case), whereas R1BCS scales as O (max( N , M ))Therefore, from a computational point of view, EP turns out to be especially advantageous in the large α regime asshown in table IV.Finally, we explored a more realistic scenario for “temporally” correlated patterns generated by a recurrent networkof N randomly diluted perceptrons both in the case of synchronous and asynchronous update schemes. First, wecompared the performance of EP and grVAMP in the noiseless case. The first striking observation is the poorperformance of grVAMP in terms of percentage of patterns for which the iterative strategy converges. While EPconverges basically on all presented set of patterns, the convergence rate drops to less than 20% for a large intervalof α values for grVAMP. EP in this regime shows a remarkable performance both in terms accuracy and sensitivity.In the noisy case, we compared the performance of EP and R1BCS. The goodness of the estimation of the weights ofthe teacher was not very different for the two learning algorithms both in the case of synchronously updated patternsand in the case of asynchronously updated one. However, in the α > online the optimal dilution of the problem usinga maximum likelihood data-driven iterative strategy, whereas, for most competitive strategies, the dilution is a fixedparameter to set on the basis of some prior knowledge about the problem. In addition, we have shown that the samemaximum likelihood online strategy can be used to learn the consistency level of the labels when they are affectedby noise. As for the limitations of EP, the main disadvantage is given by its cubic computational complexity, whichmakes it slower as compared with algorithms such as 1-bit AMP and implementations of generalized VAMP, wherethe complexity is quadratic. Furthermore, we found that, while EP was able to deal with the task of learning fromnoisy examples, a not too large level of noise was needed in order for the algorithm to learn the target classificationrule with an acceptable accuracy. ACKNOWLEDGMENTS
AB, AP, and MP acknowledge funding from INFERNET, a European Union’s Horizon 2020 research and inno-vation program under the Marie Sk(cid:32)lodowska-Curie grant agreement No 734439, and from the SmartData@Politointerdepartmental center for data science of Politecnico di Torino.
Appendix A: Finite temperature formulation of Expectation Propagation
We here recall the finite temperature EP scheme used in [39] and state the update equations for the sparse perceptronlearning problem.In this case, the posterior distribution is given by: P ( h ) = 1 Z P e − β h T E − h N (cid:89) i =1 Γ i ( h i ) N + M (cid:89) τ = N +1 Λ τ ( h τ ) , (A1)where we recall that: E − = (cid:32) X Tσ X σ − X Tσ − X σ I (cid:33) , (A2)9As in the previous section, we introduce Gaussian approximating factors (14) and write a fully Gaussian approxi-mation of the posterior distribution (A1): Q ( h ) = 1 Z Q e − β h T E − h (cid:89) i ∈ W φ ( h i ; a i , d i ) (cid:89) τ ∈ Y φ ( h τ ; a τ , d τ ) == 1 Z Q exp (cid:18) −
12 ( h − ¯ h ) T Σ − ( h − ¯ h ) (cid:19) . (A3)The covariance matrix Σ and the mean ¯ h in Eq. (A3) are given, respectively, by: Σ − = β E − + D , (A4)and by: ¯ h = ΣD a . (A5)Moreover, we define the tilted distributions Q ( i ) for all i = 1 , . . . , N + M as: Q ( i ) ( h ) = 1 Z Q e − β h T E − h ψ i ( h i ) (cid:89) k (cid:54) = i φ ( h k ; a k , d k ) == 1 Z Q ( i ) ψ i ( h i ) exp (cid:18) −
12 ( h − ¯ h ( i ) ) T (cid:16) Σ ( i ) (cid:17) − ( h − ¯ h ( i ) ) (cid:19) , (A6)where ψ i = Γ if i ∈ W and ψ i = Λ if i ∈ Y .As in the previous section, the EP update equations for the means a i and for the variances d i are obtained byimposing the matching of the first and second moments of each variable h i w.r.t. the tilted marginal distributions Q ( i ) ( h i ) and the fully approximated marginal distributions Q ( h i ): (cid:104) h i (cid:105) Q = (cid:104) h i (cid:105) Q ( i ) , (cid:104) h i (cid:105) Q = (cid:104) h i (cid:105) Q ( i ) , (A7)which leads, again, to Eq. (26) and to Eq. (27). Appendix B: EP free energy for the diluted perceptron problem
Let us recall the definition of the EP free energy for a system with N + M variables: F EP = ( N + M −
1) log Z Q − N + M (cid:88) k =1 log Z Q ( k ) , (B1)where: log Z Q = N + M π ) + 12 log(det Σ ) , (B2)and the expression of log Z Q ( k ) depends on the type of prior considered. If k = 1 , ..., N , the prior is a spike-and-slaband one has: log Z Q ( k ) = log (cid:32) (1 − ρ ) 1 √ π Σ k e − µ k k + ρ √ π (cid:114) λ λ Σ k e − λµ k λ Σ k (cid:33) , k = 1 , ..., N (B3)For the remaining variables, the expression of Z Q ( k ) either reads: Z Q ( k ) = (cid:114) π Σ k (cid:18) (cid:18) µ k √ k (cid:19)(cid:19) , k = N + 1 , ..., N + M (B4)0if Λ( h k ) = Θ( h k ), or it reads: Z Q ( k ) = (cid:114) π Σ k (cid:20) η (cid:18) − µ k √ k (cid:19) + 1 − η (cid:18) µ k √ k (cid:19)(cid:21) == (cid:114) π Σ k (cid:20)
12 + (cid:18) η − (cid:19) erf (cid:18) µ k √ k (cid:19)(cid:21) , k = N + 1 , ..., N + M, (B5)if Λ( h k ) = η Θ( h k ) + (1 − η )Θ( − h k ).As a consequence, when Λ( h k ) = Θ( h k ), the EP free energy of the problem is given by: F EP = ( N + M − (cid:18) N + M π ) + 12 log(det Σ ) (cid:19) −− N (cid:88) k =1 log (cid:32) (1 − ρ ) 1 √ π Σ k e − µ k k + ρ √ π (cid:114) λ λ Σ k e − λµ k λ Σ k (cid:33) − M π/ − N + M (cid:88) k = N +1 log Σ k − N + M (cid:88) k = N +1 log (cid:18) (cid:18) µ k √ k (cid:19)(cid:19) , (B6)whereas, when Λ( h k ) = η Θ( h k ) + (1 − η )Θ( − h k ), it is given by: F EP = ( N + M − (cid:18) N + M π ) + 12 log(det Σ ) (cid:19) −− N (cid:88) k =1 log (cid:32) (1 − ρ ) 1 √ π Σ k e − µ k k + ρ √ π (cid:114) λ λ Σ k e − λµ k λ Σ k (cid:33) − M π/ − N + M (cid:88) k = N +1 log Σ k − N + M (cid:88) k = N +1 log (cid:18)
12 + (cid:18) η − (cid:19) erf (cid:18) µ k √ k (cid:19)(cid:19) . (B7)As a practical remark, an efficient way to compute F EP numerically involves using the Cholesky decomposition ofthe covariance matrix in log(det Σ ), that is Σ = LL T , where L is a lower triangular matrix with real positive diagonalentries L kk for all k = 1 , . . . , N + M . Then, the log determinant contribution to the EP free energy is efficientlycomputed as log(det Σ ) = 2 (cid:80) N + Mk =1 log( L kk ). Appendix C: Learning the density level of the weights of the teacher and the noise on the labels
The parameters of the prior distributions, such as the density ρ of the weights of the teacher signal and the fraction η of labels fulfilling the consistency constraints, can be iteratively learned by the student perceptron by minimizingthe free energy associated with the EP algorithm. We follow the reasoning laid out in reference [42], which we hererecall and adapt to the sparse perceptron learning problem.Let θ denote the set of parameters of the prior distribution P θ ( h ). For example, the density ρ appears in theprior distribution in the factors Γ( w i ), for i = 1 , . . . , N , whereas the consistency level η of the labels appears in thefactors Λ in the noisy case. Such parameters can be estimated by the student perceptron by maximizing the followinglikelihood function: P ( σ , . . . , σ M | θ , x , . . . , x M ) = (cid:90) d h P ( σ , . . . , σ M , h | θ , x , . . . , x M ) = (cid:90) d h P ( X σ | h , x , . . . , x M ) P ( h | θ ) == (cid:90) d h P ( X σ | h , x , . . . , x M ) P ( h , . . . , h N | ρ, λ ) P ( h N +1 , . . . , h N + M | η ) = Z ( θ ) , (C1)which is nothing but the normalization of the posterior distribution in Eq. (11).It is possible to associate a free energy to the partition function (C1) by using the definition F = − log Z ( θ ). WhenEP reaches its fixed point, F is approximated by the EP free energy (B1) and the student perceptron can attempt tominimize the latter via gradient descent: θ ( t +1) j = θ ( t ) j − δθ j ∂F EP ∂θ j , (C2)1where t denotes the current iteration, θ j denotes the j -th component of the parameter vector θ and δθ j is itscorresponding learning rate, which was taken equal to 10 − in our numerical experiments both when learning ρ and when learning η .Notice that, while the only contributions to F EP depending explicitly on the parameters θ of the prior are givenby the terms F Q ( k ) , k = 1 , . . . , N , the components of the gradient include other terms as well. However, these termsdepend on the derivatives of the free energy with respect to the cavity parameters, which vanish at the EP fixedpoint. The optimization strategy we employ consists in iteratively alternating an EP update step at fixed priorparameters and an update of θ performed via gradient descent at fixed EP parameters, similarly to an expectationmaximization (EM) scheme, where the optimization over the EP parameters corresponds to the expectation stepand the minimization of the free energy with respect to θ corresponds to the maximization step. A ‘proper’ EMprocedure would also be possible and would involve alternating a complete EP estimation of the approximatingposterior distributions at fixed prior parameters until convergence is reached (E-step) and a maximum likelihoodupdate of the prior parameters (M-step). The fact that we employ an alternating minimization procedure of this kindallows to only consider the explicit dependence of the free energy on the prior parameters. In particular, we have that ∂F EP /∂ρ in Eq. (C2) reads: ∂F EP ∂ρ = N (cid:88) k =1 1 √ π Σ k e − µ k k − √ π ( λ +Σ k ) e − µ kλ +Σ k (1 − ρ ) √ π Σ k e − µ k k + ρ √ π ( λ +Σ k ) e − µ kλ +Σ k . (C3)Moreover, as highlighted in [42], the second derivative with respect to ρ is strictly positive if λ >
0, which guaranteesthe uniqueness of the value of the density value that minimizes F EP , provided that the EP parameters µ k and Σ k arefixed for all k = 1 , . . . , N .The same line of reasoning applies to the estimation of the parameter η . In this case, the only contributions to F EP where η appears are the terms F Q ( k ) , k = N + 1 , ..., N + M , and by taking the derivative w.r.t. η one obtains: ∂F EP ∂η = N + M (cid:88) k = N +1 − (cid:16) µ k √ k (cid:17) η − (cid:16) µ k √ k (cid:17) . (C4) [1] Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti A Zadeh, Feature extraction: foundations and applications ,Vol. 207 (Springer, 2008).[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman,
The elements of statistical learning: data mining, inference, andprediction (Springer Science & Business Media, 2009).[3] “The cancer genome atlas web site,” (2020 – accessed February 16, 2020).[4] E. J. Cand`es and Michael B Wakin, “An introduction to compressive sampling [a sensing/sampling paradigm that goesagainst the common knowledge in data acquisition],” IEEE signal processing magazine , 21–30 (2008).[5] Richard G Baraniuk, “More is less: signal processing and the data deluge,” Science , 717–719 (2011).[6] D. L. Donoho, “Compressed sensing,” IEEE Transactions on information theory , 1289–1306 (2006).[7] E. J. Cand`es and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?” IEEETransactions on Information Theory , 5406–5425 (2006).[8] David L. Donoho, Arian Maleki, and Andrea Montanari, “Message-passing algorithms for compressed sensing,” Proceedingsof the National Academy of Sciences , L09003 (2009).[10] Surya Ganguli and Haim Sompolinsky, “Statistical mechanics of compressed sensing,” Phys. Rev. Lett. , 188701 (2010).[11] F. Krzakala, M. M´ezard, F. Sausset, Y. Sun, and L. Zdeborov´a, “Probabilistic reconstruction in compressed sensing:algorithms, phase diagrams, and threshold achieving matrices,” Journal of Statistical Mechanics: Theory and Experiment , P08009 (2012).[12] Petros T Boufounos and Richard G Baraniuk, “1-bit compressive sensing,” in (IEEE, 2008) pp. 16–21.[13] Doohwan Lee, Tatsuya Sasaki, Takayuki Yamada, Kazunori Akabane, Yo Yamaguchi, and Kazuhiro Uehara, “Spectrumsensing for networked system using 1-bit compressed sensing with partial random circulant measurement matrices,” in (IEEE, 2012) pp. 1–5.[14] Robert Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B(Methodological) , 267–288 (1996). [15] A Braunstein, A Pagnani, M Weigt, and R Zecchina, “Gene-network inference by message passing,” in Journal of Physics:Conference Series , Vol. 95 (IOP Publishing, 2008) p. 012016.[16] A Braunstein, A Pagnani, M Weigt, and R Zecchina, “Inference algorithms for gene networks: a statistical mechanicsanalysis,” Journal of Statistical Mechanics: Theory and Experiment , P12001 (2008).[17] Andrea Pagnani, Francesca Tria, and Martin Weigt, “Classification and sparse-signature extraction from gene-expressiondata,” Journal of Statistical Mechanics: Theory and Experiment , P05001 (2009).[18] Marc Bailly-Bechet, Alfredo Braunstein, Andrea Pagnani, Martin Weigt, and Riccardo Zecchina, “Inference of sparsecombinatorial-control networks from gene-expression data: a message passing approach,” BMC bioinformatics , 355(2010).[19] Evan J. Molinelli, Anil Korkut, Weiqing Wang, Martin L. Miller, Nicholas P. Gauthier, Xiaohong Jing, Poorvi Kaushik,Qin He, Gordon Mills, David B. Solit, Christine A. Pratilas, Martin Weigt, Alfredo Braunstein, Andrea Pagnani, RiccardoZecchina, and Chris Sander, “Perturbation biology: Inferring signaling networks in cellular systems,” PLOS ComputationalBiology , 1–23 (2013).[20] Yoshiyuki Kabashima, “A CDMA multiuser detection algorithm on the basis of belief propagation,” Journal of Physics A:Mathematical and General , 11111–11121 (2003).[21] Shinsuke Uda and Yoshiyuki Kabashima, “Statistical mechanical development of a sparse bayesian classifier,” Journal ofthe Physical Society of Japan , 2233–2242 (2005).[22] Y Kabashima, “Inference from correlated patterns: a unified theory for perceptron learning and linear vector channels,”Journal of Physics: Conference Series , 012001 (2008).[23] Yingying Xu and Yoshiyuki Kabashima, “Statistical mechanics approach to 1-bit compressed sensing,” Journal of StatisticalMechanics: Theory and Experiment , P02041 (2013).[24] Alejandro Lage-Castellanos, Andrea Pagnani, and Martin Weigt, “Statistical mechanics of sparse generalization andgraphical model selection,” Journal of Statistical Mechanics: Theory and Experiment , P10009 (2009).[25] A. Lage-Castellanos, A. Pagnani, and M. Weigt, “The importance of dilution in the inference of biological networks,” in (2009) pp. 531–538.[26] Alejandro Lage-Castellanos, Andrea Pagnani, and Gretel Quintero Angulo, “Stability of the replica symmetric solutionin diluted perceptron learning,” Journal of Statistical Mechanics: Theory and Experiment , L02002 (2013).[27] Yingying Xu, Yoshiyuki Kabashima, and Lenka Zdeborov´a, “Bayesian signal reconstruction for 1-bit compressed sensing,”Journal of Statistical Mechanics: Theory and Experiment , P11015 (2014).[28] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” IEEE Transactions on InformationTheory , 6664–6684 (2019).[29] Philip Schniter, Sundeep Rangan, and Alyson K. Fletcher, “Vector approximate message passing for the generalized linearmodel,” (2016), arXiv:1612.01186 [cs.IT].[30] X. Meng, S. Wu, and J. Zhu, “A unified bayesian inference framework for generalized linear models,” IEEE SignalProcessing Letters , 398–402 (2018).[31] P. T. Boufounos, “Greedy sparse signal reconstruction from sign measurements,” in (2009) pp. 1305–1309.[32] L. Jacques, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk, “Robust 1-bit compressive sensing via binary stableembeddings of sparse vectors,” IEEE Transactions on Information Theory , 2082–2102 (2013).[33] M. Yan, Y. Yang, and S. Osher, “Robust 1-bit compressive sensing using adaptive outlier pursuit,” IEEE Transactionson Signal Processing , 3868–3875 (2012).[34] A. Movahed, A. Panahi, and G. Durisi, “A robust rfpi-based 1-bit compressive sensing reconstruction algorithm,” in (2012) pp. 567–571.[35] F. Li, J. Fang, H. Li, and L. Huang, “Robust one-bit bayesian compressed sensing with sign-flip errors,” IEEE SignalProcessing Letters , 857–861 (2015).[36] M. Opper and O. Winther, “Gaussian processes for classification: mean-field algorithms,” Neural Computation , 2655–2684 (2000).[37] M. Opper and O. Winther, “Adaptive and self-averaging Thouless-Anderson-Palmer mean-field theory for probabilisticmodeling,” Physical Review E , 056131 (2001).[38] T. P. Minka, “Expectation propagation for approximate Bayesian inference,” in Proceedings of the Seventeenth conferenceon Uncertainty in artificial intelligence (Morgan Kaufmann Publishers Inc., 2001) pp. 362–369.[39] A. Braunstein, A. P. Muntoni, and A. Pagnani, “An analytic approximation of the feasible space of metabolic networks,”Nature communications , 14915 (2017).[40] T. J. Mitchell and J. J. Beauchamp, “Bayesian variable selection in linear regression,” Journal of the American StatisticalAssociation Statistical Physics of Spin Glasses and Information Processing: An Introduction , International series ofmonographs on physics (Oxford University Press, 2001).[42] Alfredo Braunstein, Anna Paola Muntoni, Andrea Pagnani, and Mirko Pieropan, “Compressed sensing reconstructionusing expectation propagation,” Journal of Physics A: Mathematical and Theoretical53