[PDF] Sparse online variational Bayesian regression

Abstract

Full PDF

aa r X i v : . [ s t a t . C O ] F e b Sparse online variational Bayesian regression

Kody J. H. Law and Vitaly ZankinFebruary 25, 2021

Abstract

This work considers variational Bayesian inference as an inexpensiveand scalable alternative to a fully Bayesian approach in the context ofsparsity-promoting priors. In particular, the priors considered arise fromscale mixtures of Normal distributions with a generalized inverse Gaussianmixing distribution. This includes the variational Bayesian LASSO as aninexpensive and scalable alternative to the Bayesian LASSO introducedin [56]. It also includes priors which more strongly promote sparsity. Forlinear models the method requires only the iterative solution of determin-istic least squares problems. Furthermore, for n → ∞ data points and p unknown covariates the method can be implemented exactly online witha cost of O ( p ) in computation and O ( p ) in memory. For large p an ap-proximation is able to achieve promising results for a cost of O ( p ) in bothcomputation and memory. Strategies for hyper-parameter tuning are alsoconsidered. The method is implemented for real and simulated data. It isshown that the performance in terms of variable selection and uncertaintyquantiﬁcation of the variational Bayesian LASSO can be comparable tothe Bayesian LASSO for problems which are tractable with that method,and for a fraction of the cost. The present method comfortably handles n = p = 131 , on a laptop in minutes, and n = 10 , p = 10 overnight. Regression is a quintessential and ubiquitous task of machine learning. Thesimplest method one can use to solve regression tasks is a linear model withGaussian noise and prior [11]. The most attractive feature of linear Gaussianmodels is analytical tractability, from both frequentist and Bayesian viewpoints.However, once one employs basis function expansions, they also become quiteﬂexible. There are numerous methods in which linear models can be embedded,such as total least squares [75] and mixtures of regressions [45], for example.Sparsity promoting priors have proven to be very useful for identifying usefulfeatures and avoiding overﬁtting, perhaps most notably the Lasso [68] and itsincarnation as total variation (TV) regularization in imaging [71, 66]. How-ever, as soon as a non-Gaussian prior is introduced then analytical tractabilityis lost and, in particular, the Bayesian solution becomes very expensive [56],requiring Markov chain Monte Carlo (MCMC) methods [60, 36]. Furthermore,1parsity promoting priors are not diﬀerentiable, which prevents the use of simpleGaussian approximations such as Laplace approximation [7].Here we introduce variational approximations of the posterior associated tosparsity priors given by scale mixtures of Normals, with a generalized inverseGaussian mixing distribution. This includes the variational Bayesian Lasso (VBL) – a principled Gaussian approximation to the Bayesian Lasso presentedin [56] . It is derived through a variational Bayesian expectation maximization(VBEM) method [6, 4], which requires only solution of (unconstrained) linearsystems, and provides approximations of the mean and covariance of the target.Additionally, in parallel we perform classical expectation maximization (EM)[24] to obtain the maximum a posteriori (MAP) estimator.The approach presented here provides fast uncertainty quantiﬁcation (UQ)in the context of sparse regression and linear inverse problems. The key pointsare:• Fast and inexpensive variational Bayesian solution with sparsity promot-ing priors, such as π ( β ) ∝ exp( − λ | β | qq ) , where | β | qq = P pi =1 | β i | q with q ≤ . The Bayesian formulation provides UQ in the estimate, which isbecoming increasingly important in science and engineering applications[64, 34].• Online

Bayesian solution means inﬁnite data n → ∞ can be learned fromwith a O ( p ) memory cost and O ( p ) compute cost per update, where p is the width of the design matrix (see below). This trivially reducesto the standard Kalman ﬁlter [44] for the case of Gaussian prior, and ishence exact (the compute cost also reduces to O ( p ) in this case). Oth-erwise it provides exact MAP estimate (for convex prior) and variationalapproximation to the posterior , with little more than a pair of parallelKalman ﬁlters. We note the results presented are constrained to the caseof regression, but will be extended to non-trivial dynamics in future work.•

Adaptive online learning of hyper-parameters via EM strategies providesimproved accuracy at a marginal additional cost.• A further “low rank + diagonal” approximation can provide a reductionin cost to O ( p ) in case p is prohibitively large, in a similar spirit to theensemble Kalman ﬁlter (EnKF) [29].The rest of the paper is organized as follows. The basic model is intro-duced in Section 2: after introducing linear models in 2.1, sparsity priors in2.2, Bayesian formulation in 2.3, EM in 2.4, and VBEM in 2.5, this culminatesin our Gaussian approximation of sparsity-promoting posteriors and VBL in2.6. Further enhancements to the basic model are introduced in Section 3. Inparticular, the online version is introduced in 3.1, and hyper-parameter tuningis considered in 3.2. Numerical results are presented in Section 4, including See remark 2.4 for discussion of some recent related work which we discovered after writingthis paper.

First some background and the basic model will be introduced in this section.

Let D n = { ( x i , y i ) } ni =1 ,with x i ∈ R p and y i ∈ R (for simplicity), and let X n =[ x , . . . , x n ] T and Y n = ( y , . . . , y n ) T . Consider the following statistical modelfor Bayesian linear regression y i = x Ti β + ǫ i , ǫ i ⊥ β . (1)Online inference of β corresponds to the case n → ∞ . If β ∼ N ( m , C ) and ǫ i ∼ N (0 , γ ) , possibly depending on some parameters θ ∈ Θ ⊂ R l which aresuppressed in the notation, then conditioned on θ , the posterior distribution β |D n ∼ N ( m n , C n ) is given in closed form as follows [11] m n = (cid:18) γ X Tn X n + C − (cid:19) − (cid:18) γ X Tn Y n + C − m (cid:19) = m + C X Tn (cid:0) γ I n + X n C X Tn (cid:1) − ( Y n − X n m ) , (2) C n = (cid:18) γ X Tn X n + C − (cid:19) − = C − C X Tn (cid:0) γ I n + X n C X Tn (cid:1) − X n C . (3)The mean m n minimizes the quadratic objective function γ | X n β − y | + 12 | C − / ( β − m ) | . (4)By choosing the appropriate version above, the computation cost is O ( pn min { p, n } ) ,and the memory cost is O ( pn ) . In case n > p , and in particular if n → ∞ , thenthe Kalman ﬁlter [44] provides exact solution online with a memory and (periteration) computation cost of O ( p ) . See appendix D.A useful observation in what follows is that equation (2) can be rewrittenrecursively as m n = (cid:18) γ ( x n x Tn + X Tn − X n − ) + C − (cid:19) − (cid:18) γ ( x n y n + X Tn − X n − m n − ) + C − m n − (cid:19) = m n − + C X Tn (cid:0) γ I n + X n C X Tn (cid:1) − (cid:16) b Y n − X n m n − (cid:17) , (5)where b Y n := (( X n − m n − ) T , y n ) T . Above is presented for the assimilation ofone observation at a time, but it is easily modiﬁed to update a batch of M observations at a time, which will be considered later. This observation is3bviously not useful by itself, as it incurs a cost of O (min { p, n } ) per iteration,whereas the Kalman ﬁlter delivers O ( p ) updates in primal/covariance form(see 55). However, in what follows, the precision will be required. The latterrepresentation will facilitate an approximate update which relies on recursive“low-rank + diagonal” approximations of X Tn X n + C − , in similar spirit to theensemble Kalman ﬁlter [15].Linear models of the form x T β are quite ﬂexible, once one considers basisfunction expansions. In other words, x = (1 , ψ ( s ) , ψ ( s ) , . . . , ψ p ( s )) for data s ∈ R d and some functions { ψ i } , which can be a subset of polynomials [34,20, 38], wavelets and other “ x -lets” [19], radial basis functions [14], randomfeature models [59], or any number of other choices. The book [7] provides aconcise and easy to read summary for regression applications. In fact, there arecomplete bases for many function-spaces. For example, if Ω = [0 , d then theFourier series forms a complete basis for L (Ω) = { f : Ω → R ; R Ω f ( s ) ds < ∞} [33] and a subset of such features can therefore be used to construct aconvergent approximation. In fact, since radial basis function kernel operatorsare diagonalized by the Fourier basis, then the expectation of the product of twosuch features with a random frequency is equal to the kernel evaluation. MonteCarlo approximation of such expectations is the basis of the random featuremodels [59].An issue is how many terms to include, and perhaps more importantly, howto select a subset of the important terms from a suﬃciently rich set, withoutincurring a loss in accuracy due to overﬁtting, as can occur with too muchﬂexibility. The latter issue is often referred to as “variance-bias tradeoﬀ”: amodel which is too ﬂexible (negligible bias) may be too strongly inﬂuenced bya particular data set, hence incurring a large variance over a range of diﬀerentdata sets [32]. This well-known issue can be dealt with by beginning with asuﬃciently rich class of approximating functions (e.g. a large enough basis) andthen introducing prior assumptions, or regularization, in order to let the modelselect the most parsimonious representation [69, 40, 68, 54, 53, 67]. In the context of prior selection, often the Gaussian assumption is considered toorestrictive. In particular, it has become very popular at the end of the last andthe beginning of this millenium to utilize a sparsity promoting prior. In this casethe Gaussian with quadratic density which gives rise to the second (Tikhonovregularization) term in equation (4) is replaced with another density of the form exp( − R ( β )) , where for example R ( β ) = | Lβ | qq , for q ∈ (0 , , L : R ˜ p → R p , and | β | qq = P pi =1 | | β i | q [68, 19, 25, 56]. This is often extended to the case q = 0 , whichcorresponds to counting measure on the non-zero elements | β | = P pi =1 { β i =0 } .As we will see, more general sparsity promoting priors are also possible. Notethat if there is a W : R p → R ˜ p such that W L = I ˜ p , then one can alwaysredeﬁne X n → X n W and β → Lβ . Therefore, we assume without loss of toomuch generality that L = I p . This will be discussed further in the examples.There is a computational burden to performing inference with these more4xotic priors, even in the case when we abandon UQ and settle for a MAPestimator. In the best case of q = 1 we have a convex optimization problem,which can be solved eﬃciently by a number of modern methods, such as iterativesoft thresholding [22, 13] and alternating direction method of multipliers [31, 12].These iterative methods incur the same cost as (2) at each iteration.Considering the full Bayesian posterior, the situation is even more daunt-ing. Indeed once one adopts such a sparsity prior then the posterior is nolonger characterized in closed form with ﬁnitely many parameters, as in theGaussian case (in the ﬁnite/discrete/parametric case). Laplace approxima-tion [7] requires derivative and Hessian of the log-posterior, which may notexist. Computationally-intensive methods such as Markov chain Monte Carlo(MCMC) [60] are required for consistent estimation, as proposed for the BayesianLasso in [56]. There has been a lot of activity in this direction in the past 10 years– see, e.g. [49, 57, 58, 50, 16, 72, 9]. Here we propose to employ a variationalapproach to recover the best Gaussian approximation to the sparsity-promotingposterior, in a sense to be deﬁned precisely below. It is worth mentioning thatthe Bayesian formulation allows for more general priors than the continuous, or“shrinkage”, priors considered here. In recent years “spike and slab” priors [51]have become very popular, consisting of a mixture distribution with a Diracmass at 0 (spike) and a continuous distribution such as the ones considered here(slab). See [5] for a recent review, focused on the spike and slab LASSO. Assume that the prior consists of p independent random variables β j , each withLaplace distribution L ( β j ; λ ) . The MAP estimator associated to this modelcorresponds to L1 regularized regression, or LASSO [68]. It is well-known thatthe Laplace distribution can be expressed as a scale mixture of zero mean Gaus-sians N ( β j ; 0 , θ j ) [2, 30, 61], where the mixture distribution follows a generalizedinverse Gaussian law GIG ( θ j ; ν, δ, λ ) ∝ θ ν − j exp (cid:20) −

12 ( δ /θ j + λ θ j ) (cid:21) , (6)with δ = 0 and ν = 1 , i.e. L ( β j ; λ ) = Z R + N ( β j ; 0 , θ j ) GIG ( θ j ; 1 , , λ ) dθ j . (7)We consider priors deﬁned by general scale mixtures of Normals with mixingdistribution given by (6). This includes other sparsity promoting distributionssuch as Normal inverse Gaussian and Normal Gamma [70, 18, 37, 17]. Further-more, the conditionals of the joint are known exactly β | θ, X n , Y n ∼ N ( β ; m θn , C θn ) , (8) θ j | β j , X n , Y n ∼ GIG ( θ j ; ν − / , q δ + β j , λ ) , (9)5here m θn , C θn are given by (2), with C θ = D ( θ ) := diag( θ , . . . , θ p ) the diagonalmatrix with θ on the diagonal, and m = 0 . Therefore, Gibbs sampling canbe used to sample from the monolithic posterior (2) [63, 56]. Sequential MonteCarlo methods [26, 52] have also been designed to sample from the full posteriorsequentially in [63, 17], and a sequential expectation maximization (EM) [24]method has been used to approximate MAP estimates of β | ( X n , Y n ) in [18, 17]. Remark 2.1

One may also keep an original C so that C θ = ( C − + D (1 /θ )) − and we have P ( β | θ ) = N ( β ; 0 , ( C − + D (1 /θ )) − ) . This would marginally yieldan elastic net prior for δ = 0 and ν = 1 : P ( β ) ∝ N ( β ; 0 , C ) L ( β ; λ ) [61]. For the next sections we suppress X and n in the notation where convenient.Suppose we want to maximize log P ( Y, β ) = log Z P ( Y, β, θ ) dθ ≥ Z log (cid:18) P ( Y, β, θ ) q ( θ ) (cid:19) q ( θ ) dθ , (10)where the inequality arises (for any probability density q ( θ ) > ) from anapplication of Jensen’s inequality, and P here denotes a probability density.The expectation maximization (EM) algorithm [24] proceeds as follows. Deﬁne q t ( θ ) = P ( θ | β t , Y ) , Q ( β | β t ) = Z log (cid:18) P ( Y, β, θ ) P ( θ | β t , Y ) (cid:19) P ( θ | β t , Y ) dθ , (11)and let β t +1 = argmax β Q ( β | β t ) .In our context this entails iteratively computing Q ( β | β t ) = 12 β T D (1 /θ t +1 ) β + 12 γ | Y n − X n β | + K ( β t , X n , Y n ) (12)where we recall that D (1 /θ t +1 ) is the diagonal matrix with /θ t +1 j on the diag-onal. For example, in the case of δ ≥ and ν = 1 , θ t +1 is deﬁned element-wiseas /θ t +1 j := E (cid:2) /θ j | β t , X n , Y n (cid:3) = λ (( β tj ) + δ ) − / . (13)The calculation of (12) is given in Appendix A along with a slightly lengthierexplanation of EM. Note we have assumed ν = 1 but allowed δ = 0 in thehyperprior (6), which relaxes the marginal Laplace identity (7). The generalform of (13) is given in equation (49), and the case ν = 0 is given in (39). Onethen has the iteration µ t +1 n = (cid:16) X Tn X n + γ λD (cid:16)(cid:0) ( µ tn ) + δ (cid:1) − / (cid:17)(cid:17) − X Tn Y n . (14)These analytical calculations have been shown and used before in several works,including [30, 18, 37]. From this, one obtains the MAP estimator at convergence µ tn → µ n . 6 emark 2.2 An iteratively reweighted least squares (IRLS) algorithm [41] canbe derived in order to approximate regularization with q ∈ (0 , by a sequenceof problems with q = 2 , based on the following observation | β | qq = p X i =1 | β i | q = p X i =1 | β i | q − β i . The resulting iteration for q = 1 is exactly as in (14) , where δ > is interpretedas a regularization parameter. It can be shown under appropriate assumptionsthat µ tn → µ n as t → ∞ , where µ n is sparse if such solution exists, and conver-gence is linear (exponentially fast) for µ tn suﬃciently close to µ n [23]. Here we propose to use the variational Bayesian expectation maximization(VBEM) algorithm, introduced in the context of graphical models in [4, 6],. Weshow how it works elegantly in our context to provide a Gaussian approximationto problems with sparsity priors, which is optimal in a certain sense. Supposewe return to (10), and this time multiply/divide by some density q ( β ) > andintegrate over β as well. Then we have the evidence lower bound log P ( Y ) = log Z P ( Y, β, θ ) dθdβ ≥ Z log (cid:18) P ( Y, β, θ ) q ( θ ) q ( β ) (cid:19) q ( θ ) q ( β ) dθdβ . (15)Coincidentally, maximizing this with respect to the densities q ( θ ) q ( β ) coincideswith minimizing the KL divergence between this variational approximation andthe joint posterior, i.e. log P ( Y ) − Z log (cid:18) P ( Y, β, θ ) q ( θ ) q ( β ) (cid:19) q ( θ ) q ( β ) dθdβ = − Z log (cid:18) P ( β, θ | Y ) q ( θ ) q ( β ) (cid:19) q ( θ ) q ( β ) dθdβ =: KL [ q ( θ ) q ( β ) || P ( β, θ | Y )] . (16)The objective functions for each of q ( θ ) and q ( β ) given the other are convex andcan be minimized exactly, as observed in [4, 6], leading to the iterative algorithm q t +1 ( θ ) ∝ exp (cid:18)Z log P ( Y, β, θ ) q t ( β ) dβ (cid:19) , (17) q t +1 ( β ) ∝ exp (cid:18)Z log P ( Y, β, θ ) q t +1 ( θ ) dθ (cid:19) . (18)Furthermore, following from convexity of the intermediate targets this gives a de-scent direction for (16) KL (cid:2) q t +1 ( θ ) q t +1 ( β ) || P ( β, θ | Y ) (cid:3) ≤ KL [ q t ( θ ) q t ( β ) || P ( β, θ | Y )] .Observe that constraining to q t +1 ( β ) = δ β t +1 ( β ) , where δ ( · ) is the Dirac deltafunction and β t +1 is the point of maximum probability above, yields the originalEM algorithm. Also, observe that (17) may itself be intractable in general, al-though it is shown in [6] that it is simpliﬁed somewhat for conjugate exponential7odels and may be analytically soluble. Fortunately, the present situation isthe best case, where it is analytically soluble. Notice that the objective func-tion (16) corresponds to an independence assumption between θ and β , howeverfrom (17) it is clear that the solution solves a coupled system, and in fact prob-abilistic dependence is replaced with a deterministic dependence on each others’summary statistics , as noted in [6]. -2 -1 0 1 2 -2-1012 log likelihoodlog priorlog posterior -4 -2 0 2 4 -4-2024 log p( |D)log q( |D)E p ( )m=E q ( ) MAP q( |D) -1 0 1 2 x -1-0.500.511.52 y p( |D) -1 0 1 2 x -1-0.500.511.52 y Figure 1: Illustration of the variational Bayesian LASSO (VBL). The top leftﬁgure shows the contours of the prior, the likelihood, and the posterior. Thetop right ﬁgure shows the contours of the posterior P ( β |D ) and the variationalapproximation q ( β |D ) , along with means and MAPs. For this example, theposterior median is very close to the mean, but slightly towards the MAP. Thebottom row shows the density of VBL (left, with MAP estimate µ ) and theposterior (right). 8 .6 Gaussian approximation to a sparsity promoting pos-terior Following instead the variational Bayesian approach of Sec. 2.5 gives (17), i.e. q t +1 ( θ ) ∝ exp  − p X j =1 E t [ β j ] /θ j  P ( θ | λ ) , (19) q t +1 ( β ) ∝ exp  − p X j =1 β j E t +1 [1 /θ j ] − γ | Y n − X n β |  , where E t is used to (degenerately) denote expectation with respect to the it-eration t intermediate variational distribution, with respect to its argument, β or θ . The ﬁrst equation looks similar to the EM algorithm, however with theimportant diﬀerence q t +1 ( θ j ) = GIG ( θ j ; ν − / , q δ + C tn,jj + ( m tn,j ) , λ ) , (20)where ( m tn , C tn ) are the mean and covariance of q t ( β ) = N ( β ; m tn , C tn ) (note theappearance of the variance C tn,jj instead of just ( m tn,j ) ).In turn this means that for the case ν = 1 we have /θ t +1 n,j := E t +1 [1 /θ j ] = λ ( C tn,jj + ( m tn,j ) + δ ) − / . (21)The update equations are (e.g. in primal/monolithic form (2)) m t +1 n = C t +1 n (cid:18) γ X Tn Y n (cid:19) , (22) C t +1 n = (cid:18) γ X Tn X n + D (1 /θ t +1 n ) (cid:19) − . (23)Here we can explicitly observe the deterministic dependence between the marginallyoptimal θ and β distributions via each others’ summary statistics. The generalform of the update (21) is given in (49), and the case ν = 0 is given in (39).Note that this algorithm runs for approximately the same cost as the for-mer, and provides a Gaussian approximation ( m ∗ n , C ∗ n ) of the posterior. Theformer provides an approximation of the MAP estimator µ ∗ n . In the case ν = 1 we refer to this triple ( µ ∗ n , m ∗ n , C ∗ n ) as the variational Bayesian LASSO (VBL).All above is comprised in Algorithm 1. It is well-known that the mean ofsparsity priors, such as TV-regularization, may not in fact promote sparsity[47]. Indeed even the Bayesian LASSO point estimator [56], which returnsmedian instead of mean, is not as sparse as the standard LASSO, i.e. itsMAP estimator. In the context of UQ, one is more concerned with uncer-tainty measures, such as conﬁdence intervals. For example, one may considerthe sparse solution to be reasonable if, for an index j such that µ n,j = 0 , one9 lgorithm 1 Monolithic Variational Bayesian LASSO.

Input : Design matrix X , labels Y , parameters γ, λ, δ, ν > , initial guess ( µ , m , C ) ∈ R p + p , convergence criteria ǫ, T > and distance function d : R (2 p + p ) × (2 p + p ) → R + .1. Specify functional forms θ t +1VBEM ( β ) and θ t +1EM ( β ) based on δ, ν , as given in(49).2. Set t = 0 , and ( µ − , m − , C − ) = p + p (all zeros).3. While t ≤ T and d (( µ t , m t , C t ) , ( µ t − , m t − , C t − )) > ǫ ;(a) Compute θ t +1VBEM and θ t +1EM (arguments suppressed);(b) Compute K t +1VBEM = D ( θ t +1VBEM ) X T ( XD ( θ t +1VBEM ) X T + γ I n ) − , (24) K t +1EM = D ( θ t +1EM ) X T ( XD ( θ t +1EM ) X T + γ I n ) − . (25)(c) Compute m t +1 = K t +1VBEM Y , (26) C t +1 = ( I − K t +1VBEM X ) D ( θ t +1VBEM ) , (27) µ t +1 = K t +1EM Y . (28)(d) t=t+1.

Output : ( µ ∗ , m ∗ , C ∗ ) ∈ R p + p .has ∈ ( m n,j − p C n,jj , m n,j + 2 p C n,jj ) , i.e. the origin is within the con-ﬁdence interval of the Bayesian marginal for those coordinates which are pre-dicted to vanish. In general, one may ﬂag as unusual any circumstances where µ n,j / ∈ ( m n,j − p C n,jj , m n,j + 2 p C n,jj ) . See Figure 1 for an illustration of theVBL applied to a simple example with p = n = 2 (“exact” values are calculatedwith numerical quadrature on the domain [ − , with grid points in eachdirection). Remark 2.3

Step (3) of Algorithm 1 requires a stopping criterion. It is sensiblefor example to let this be determined by the misﬁt d (( µ t , m t , C t ) , ( µ t − , m t − , C t − )) = k Xm t − Y k or max {k m t − m t − k , k µ t − µ t − k , k C t − C t − k} . In the ﬁrst case,a good choice may be ǫ = ργ , for ρ < . We will see that the algorithm returnsa good estimator very quickly, but may take a long time to converge, and mayeven return a worse estimator at convergence. Therefore, a maximum numberof iterations is often also employed as an alternative in practice. Remark 2.4

The VBEM has come to be known more generally as mean-ﬁeldvariational Bayes . See [10] for a recent review, where the quite natural name co-10rdinate ascent variational inference is given to the iterations 19. After complet-ing this work, we made several discoveries. The special case of λ = 0 , referredto as automatic relevance detection [54], has been introduced already, indepen-dently, in [8] and [27]. Also, the work [3] considers variational inference in thecontext of the stable mixing distribution. This distribution cannot be explicitlyrepresented in general, but it yields marginal priors of the form exp( − λ | β | qq ) for q ∈ (0 , , and leads to EM updates of the IRLS form with δ = 0 , as in remark2.2. This is referred to as Bayesian bridge regression, and clearly includes theBayesian LASSO ( δ = 0 , ν = 1 , i.e. q = 1 ) as a special case. The focus of thatwork is on the resulting point estimator for very small q , however, and the VBLcase is not considered. This section focuses on enhancing the model by enabling sequential/online in-ference and hyper-parameter optimization.

In the following we focus the description on ( m n , C n ) of the VBEM formulation(22), (23), but note that analogous equations hold for µ n (14). There are twodistinct scenarios to consider here. First we will consider the case of moderate p and n ≫ p , or in particular n → ∞ , where the online method reproduces EMand VBEM exactly at a cost of O ( p ) per iteration. The second case we willconsider is that of very large p , and possibly n ≤ p , where it is necessary toimpose an approximation in order to control the cost by O ( p ) . The approximateapproach is also amenable to n → ∞ . p exact method Suppose we are assimilating batches of size M , and denote ¯ X n = X nM , ¯ Y n = Y nM , ˜ X n = [ x ( n − M +1 , . . . , x nM ] T , ˜ Y n = [ y ( n − M +1 , . . . , y nM ] T , so that e.g. ˜ X Tn ˜ X n = P nMi =( n − M +1 x i x Ti . Sequential batches are presented butthis may not always be a sensible choice and some permutation of the indicesmay make sense. See remark 3.2. We can compute batch updates of the requiredmatrices in (2) exactly with a total of at most O ( p M ) operations: ¯ X Tn ¯ X n = ¯ X Tn − ¯ X n − + ˜ X Tn ˜ X n , ¯ X Tn ¯ Y n = ¯ X Tn − ¯ Y n − + ˜ X Tn ˜ Y n , ¯ Y Tn ¯ Y n = ¯ Y Tn − ¯ Y n − + ˜ Y Tn ˜ Y n . The cost may be smaller if the intermediate quantities are sparse (many zeros) orlow-rank. We have the following equation for the precision ( C tn ) − = γ ¯ X Tn ¯ X n + D (1 /θ tn ) and can proceed directly with iterating (22) and (23). In the worstcase scenario, the inversion required to compute (22) will incur a cost of O ( p ) ,so one would aim to take M = p . We iterate that the focus here is the case11 ≫ p . For large p , which will be discussed now, ¯ X Tn ¯ X n must be sparse orlow-rank. In this case, inversion can be done approximately with a cost of aslittle as O ( p ) . In case O ( p ) is prohibitive for computation or memory, thenonline computation and storage of the component matrices must be controlledas well. This is discussed further in the following section. p approximate method In the case of large p and/or n ≤ p , the problem is diﬀerent. It is preferable todirectly confront the monolithic problem (2) if it is possible, for example in casethat n < ∞ is ﬁxed and X n is suﬃciently sparse and/or low-rank to allow thedirect use of an iterative Krylov-type solver [39, 62]. On the other hand, if thiscannot be handled directly, then a sequential/online strategy can be adopted asfollows. Suppose we have b X n − ∈ R M × p s.t. ¯ X Tn − ¯ X n − ≈ b X Tn − b X n − , so that ( C ∗ n − ) − ≈ γ b X Tn − b X n − + D (1 /θ ∗ n − ) . The idea is to preserve a “rank M + diagonal” type approximation, while ef-fectively passing information forward in an online fashion. Recall equation (5)and deﬁne b Y n := (cid:18) b X n − m n − ˜ Y n (cid:19) ∈ R M , X n := (cid:18) b X n − ˜ X n (cid:19) ∈ R M × p . The O (2 M p ) update which replaces (22) and (23) is given by m t +1 n = m n − + D ( θ t +1 n ) X Tn (cid:0) γ I M + X n D ( θ t +1 n ) X Tn (cid:1) − (cid:16) b Y n − X n m n − (cid:17) , (29) C t +1 n = (cid:16) I p − D ( θ t +1 n ) X Tn (cid:0) γ I M + X n D ( θ t +1 n ) X Tn (cid:1) − X n (cid:17) D ( θ t +1 n ) . (30)Finally we need b X n ∈ R M × p s.t. b X Tn b X n ≈ X Tn X n in order to proceed to thenext iteration. This is achieved by (i) computing a reduced rank-M eigende-composition X n X Tn ≈ U Σ U T , with Σ ∈ R M × M diagonal and U ∈ R M × M orthogonal, and (ii) deﬁning b X n := U T X n . This approximation in principlecosts O (2 M ( p + 4 M )) but with a memory cost of only O ( M p ) . All the stepsabove are summarized in Algorithm 2. Remark 3.1 (EnKF)

We note the similarity between the above procedure anda (square-root) EnKF [48] for solution of (55) and (56) , which proceeds witha low-rank (or low-rank plus diagonal) approximation of the covariance C n . Inour scenario, the above framework is more natural and is expected to provide abetter approximation. This is described further in appendix D. Remark 3.2 (Batching strategy)

In the oﬄine scenario where the data size n > is ﬁxed and the sequential method is employed then choice of batches isimportant. If the inputs are i.i.d. random samples then sequential batching is ensible, i.e. (1 , , . . . , M ) , ( M + 1 , . . . , M ) , etc. Otherwise if there is structurein the inputs (e.g. in the context of inverse problems) then it makes more senseto use random sampling without replacement or evenly spread out batches (1 , b, b, . . . , p ) , (2 , b, b, . . . , p ) , etc., where b = p/M . Note that in practice one would hope that after some iterations ( µ n , m n , C n ) will not be changing very much with the iterative re-weighting, and few innerupdates will be required, if any. If the model is stationary, then one may alsonot need to allow n → ∞ . There are some other modiﬁcations which couldbe made along the way to further improve eﬃciency, such as thresholding, i.e. m n → { ǫ< | m n |} m n , for small ǫ > , where A is the indicator function on theset A , and it acts elementwise on the entries of m n (and similar for µ n ) . Supposethat m n has essentially converged, and p ′ ≪ p parameters are non-zero. Wecan then discard the -valued parameters, thereby either vastly speeding up thealgorithm or making way for inclusion of p − p ′ new parameters. Similar thingshave been done before. See e.g. [74].All of the present technology is well-suited to an online scenario, where oneassumes a ﬁxed static problem but data arrives sequentially in time, and maycontinue indeﬁnitely. If this model is meant to emulate a computer simulation,for example which is called by another computer simulation for a particular valueof inputs, as in the common case of coupled multi-physics systems and wholedevice modelling, then one can decide whether the emulator is suitable for agiven query input, for example by evaluating the uncertainty under the currentapproximation x ( s ) T C n x ( s ) . If this is below a certain level then the modelreturns m Tn x ( s ) (and possibly also x ( s ) T C n x ( s ) if the requesting program iscapable of handling uncertainty), otherwise the emulator requests a label fromthe computer simulation and is updated accordingly, as above. It may also be ofinterest in an oﬄine scenario to build up a database of labeled data and revisethe emulator as this is done. Such a task is called experimental design, andgreedy or myopic sequential experimental design can be posed elegantly withinthe sequential framework above. Here it will be assumed that the regularization parameter δ > is ﬁxed atsome small value, e.g. − . However, the hyper-parameters γ and λ still needto be learned. We will consider several approaches.The ﬁrst approach is only applicable to VBEM for ( m n , C n ) and is a versionof iterated nested VBEM algorithm, whereby for each τ = 1 , . . . the innerVBEM algorithm is as in (19) and Algorithm 1, and yields q t,τ ( β | ( λ, γ ) τ ) → q ∗ ,τ ( β | ( λ, γ ) τ ) = N ( m ∗ ,τn , C ∗ ,τn ) , lgorithm 2 Online Approximate Variational Bayesian LASSO (large p ). Input : Design matrix X , labels Y (possibly inﬁnite and arriving online), pa-rameters γ, λ, δ, ν > , initial guess ( µ , m , C ) ∈ R p + p , inner convergencecriteria ǫ, T > , distance function d : R (2 p + p ) × (2 p + p ) → R + , batch size M and rule for batching (see remark 3.2).1. Set n = 1 , b X = X . Do Algorithm 1, and output ( µ ∗ , m ∗ , C ∗ ) andfunctional forms θ t +1VBEM ( β ) and θ t +1EM ( β ) , as given in (49).2. For n = 2 , . . . (a) Set t = 0 , ( µ n , m n , C n ) = ( µ ∗ n − , m ∗ n − , C ∗ n − ) , ( µ − n , m − n , C − n ) = p + p , b Y VBEM n = (cid:18) b X n − m ∗ n − ˜ Y n (cid:19) , b Y EM n = (cid:18) b X n − µ ∗ n − ˜ Y n (cid:19) , X n = (cid:18) b X n − ˜ X n (cid:19) . (b) While t ≤ T and d (( µ tn , m tn , C tn ) , ( µ t − n , m t − n , C t − n )) > ǫ ;i. Compute θ t +1VBEM and θ t +1EM (arguments suppressed);ii. Compute K t +1VBEM = D ( θ t +1VBEM ) X Tn ( X n D ( θ t +1VBEM ) X Tn + γ I M ) − , (31) K t +1EM = D ( θ t +1EM ) X Tn ( X n D ( θ t +1EM ) X Tn + γ I M ) − . (32)iii. Compute m t +1 n = m ∗ n − + K t +1VBEM ( b Y VBEM n − X n m ∗ n − ) , (33) C t +1 n = ( I p − K t +1VBEM X n ) D ( θ t +1VBEM ) , (34) µ t +1 = µ ∗ n − + K t +1EM ( b Y EM n − X n µ ∗ n − ) . (35)iv. t=t+1.(c) Compute rank M approximation U Σ U T ≈ X n X Tn , and set b X n := U T X n . Output : ( µ ∗ n , m ∗ n , C ∗ n ) ∈ R p + p , at any time n (or rank M version of thelatter). 14hich is then itself used in an outer standard EM on ( λ, γ ) to ﬁnd ( λ, γ ) τ +1 = argmax λ,γ Z log q ∗ ,τ ( β, Y | ( λ, γ )) q ∗ ,τ ( β | Y, ( λ, γ ) τ ) dβ . The details of how this is done will be described in detail in Section 3.2.2 below.The procedure is iterated until convergence. The second approach is a varianton the ﬁrst, in which single steps of the outer and inner algorithm are executediteratively: ( λ, γ ) t +1 = argmax λ,γ Z log q t ( β, Y | ( λ, γ )) q t ( β | Y, ( λ, γ ) t ) dβ . (36)The third approach is applicable to both EM (i.e. the problem for µ n ) andVBEM (i.e. the problem for m n , C n ) independently, however there is a philo-sophical disconnect if they return diﬀerent optimal hyper-parameters, since thenour MAP estimate and our moment estimates correspond to diﬀerent posteri-ors. In this version we augment the variational distribution with an additionalfactor q ( λ, γ ) , which must be learned in an additional step after (19). Further-more, we will constrain the distribution q t ( γ , λ ) to the form δ ( γ ) t ,λ t , wherethe distribution on the parameters is Dirac at the maximizer, i.e. (( γ ) t +1 , λ t +1 ) = argmax ( γ ) ,λ Z log P ( Y, β, θ, ( γ ) , λ ) q t ( β ) q t ( θ ) dβdθ . (37)This is achieved by including an additional M step within the VBEM algorithmat each iteration, where we compute the optimal (( γ ) t +1 , ( λ ) t +1 ) , similar towhat was done in a monolithic context in [30] for γ . The third approach seemssomewhat more elegant but it turns out to be messy, and for that reason is notconsidered further. However, the practical diﬃculties may be largely due to ourchoice of prior on θ , which can of course be adjusted e.g. as in [56]. Therefore, adetailed description of the third approach is included in the appendix C.1. Theﬁrst approach is clean but expensive, and the second is clean and cheap. Indeedthe second can also be viewed as an approximation to the third, where we usethe variational conditional q t ( β, Y | ( λ, γ )) in the integrand instead of the jointdistribution.Note that all three approaches discussed above can be easily incorporatedinto Algorithm 1 and Algorithm 2 as optional steps that change values of γ and λ in between the consecutive iterations of the main algorithm. Before describing the method, it will be useful to recall some basic results re-lating to P ( θ | β, λ ) = GIG ( θ j ; ν − / , q δ + β j , λ ) , β = µ tn corresponds to EM and β j = q C tn,jj + ( m tn,j ) corresponds to VBEM), which can be found for instance in [1]. E ( θ j | β j , ν = 1) = λ √ δ + β j +1 λ , E ( θ j | β j , ν = 0) = q δ + β j λ , (38) E ( θ − j | β j ν = 1) = λ √ δ + β j , E ( θ − j | β j ν = 0) = λ q δ + β j + 1 δ + β j . (39)See appendix C for further details.In the case ν = 0 , if one lets λ, δ → then the resulting hyperprior on θ would be an improper Jeﬀreys prior P ( θ ) ∝ /θ , and the marginal prior on β would also be improper and of the same form P ( β ) ∝ /β . The equations (50)are no longer valid, as K ν +1 / (0) /K ν − / (0) is not deﬁned, but the results canbe computed directly using integration by parts. This case has been exploredfor example in [37, 18, 30]. The case λ = 0 corresponds to an inverse gammaprior and leads to D ( β ) = 1 / ( δ + β j ) , and has been considered in [8, 27].See also remark 2.4. This is a relaxation of the “ q = 0” pseudo-norm fromthe perspective of IRLS, as in remark 2.2, and hence approximates best subsetselection.An interesting extension is obtained by taking δ = λ = 0 and allowing P ( θ ) ∝ θ ν − , for ν < / . This is an improper prior but it has conditionalmoments given by E ( θ j | β j ) = − β j (2 ν + 1) − , E ( θ − j | β j ) = β − j (1 − ν ) . (40)The ﬁrst calculation is valid only for ν < − / . Details are given in (51) inappendix C. The case δ > can be considered as well, with ( β j + δ ) replacing β j above. For this prior we will abuse notation and redeﬁne λ = 1 − ν , for thesake of a uniform presentation of hyper-parameter optimization below. As mentioned above, the ﬁrst approach is less elegant than the third, butcleaner/simpler for our model. Assume we run the algorithm of Sec. 2.6 fora ﬁxed value of the hyper-parameters ( λ, γ ) τ , resulting in a joint variationaldistribution q ∗ ,τ ( β, Y | ( λ, γ )) = P ( Y | β, γ ) N ( β ; 0 , D (1 /θ ∗ ,τ ( λ )) , where q ∗ ,τ ( β | Y, ( λ, γ )) = N ( β ; 0 , D (1 /θ ∗ ,τ ( λ ))) is the variational posterior, andnow θ ∗ ,τ ( λ ) encodes all relevant information about ( λ, γ ) τ , via m ∗ ,τn , C ∗ ,τn , while λ is considered explicitly. In particular, ( θ τ ( λ )) − is given in general by (49), orfor the particular cases of ν = 0 and ν = 1 by (39). An EM step for the MLEof ( λ, γ ) is Q ( λ, γ | ( λ, γ ) τ ) = Z log( q ∗ ,τ ( β, Y | ( λ, γ ))) q ∗ ,τ ( β | Y, ( λ, γ ) τ ) dβ , ( λ, γ ) τ +1 = argmax λ,γ Q ( λ, γ | ( λ, γ ) τ ) . (41)16he objective function for γ is given by f ( γ ) := − n log γ + 12 γ (cid:0) s n − v Tn m ∗ ,τn + tr[ A n ( C ∗ ,τn + m ∗ ,τn ( m ∗ ,τn ) T )] (cid:1) , (42)where A n := X Tn X n , v n := X Tn Y n , s n := Y Tn Y n . (43)This can be optimized independently of λ , giving ( γ t +1 n ) = 1 n E τ | Y n − X n β | = 1 n (cid:0) s n − v Tn m ∗ ,τn + tr[ A n ( C ∗ ,τn + m ∗ ,τn ( m ∗ ,τn ) T )] (cid:1) , where the expectation is with respect to q ∗ ,τ ( β | Y, ( λ, γ ) τ ) . Note that in caseswhere n < p , one can use the identity tr[ A n ( C ∗ ,τn + m ∗ ,τn ( m ∗ ,τn ) T )] = tr[ X n C ∗ ,τn X Tn ] + | X n m ∗ ,τn | . These computations are easily adapted to the online case described in section3.1.The objective function for λ is equally as simple via this formulation (incontrast to the third method, described in appendix C.1), and is given by p X j =1 (cid:0) E τ ( β j ) /θ τj ( λ ) − log(1 /θ τj ( λ )) (cid:1) , where we recall again that ( θ τ ( λ )) − is given in general by (49), or for theparticular cases of ν = 0 and ν = 1 by (39). In the case of ν = 1 , one has ( λ τ +1 ) − = 1 p p X j =1 ( C ∗ ,τn,jj + ( m ∗ ,τn,j ) ) / q δ + C ∗ ,τn,jj + ( m ∗ ,τn,j ) . (44)While in the case ν = 0 , one has ( λ τ +1 ) − = 1 p − P pi =1 C ∗ ,τn,jj +( m ∗ ,τn,j ) ( δ + C ∗ ,τn,jj +( m ∗ ,τn,j ) ) p X j =1 C ∗ ,τn,jj + ( m ∗ ,τn,j ) q δ + C ∗ ,τn,jj + ( m ∗ ,τn,j ) . (45)Finally, the case of ν ≤ / , λ = 0 , δ ≥ , and renaming λ := 1 − ν , leads to ( λ τ +1 ) − = 1 p p X j =1 ( C ∗ ,τn,jj + ( m ∗ ,τn,j ) ) / ( δ + C ∗ ,τn,jj + ( m ∗ ,τn,j ) ) . (46)It is interesting to note that if δ = 0 then λ τ +1 ≡ , just as in the case of theJeﬀrey’s prior.Despite less attractive theoretical properties in comparison to the full VBEM,this is a clean and simple approach for optimizing the hyperparameters. Theobjective function (41) is convex and analytically soluble (for the cases above).An obvious issue is the nested EM algorithms, which is undesirable. Hence, thesecond option, which is to simply iterate between a single iteration of (41) anda single iteration of VBEM, as described in (36).17 Numerical Results

In this section we will explore the approach presented on some simulated andreal data.

Here we present the VBL model and compare to the Bayesian LASSO (BL) of[56]. We use the simple diabetes data set from [28], with n = 484 and p = 10 ,which was used in [56]. The fully Bayesian methodology is quite expensive, andyet tractable for this very small problem, which allows us to compare our verycheap variational approach. In turn, the VBL is applicable for problems withseveral orders of magnitude larger values for n and p , where even the mightiestsupercomputers will struggle to achieve the full Bayesian solution. The resultsare shown in Figure 2.The estimates of hyperparameter λ for VBEM and EM models are obtainedby respectively using the second approach from Sec. 3.2 with ν = 1 (44) andthe third approach with ν = 1 described in appendix C.1. The resulted hy-perparameters ( γ, λ ) are the following: b γ VBEM , b λ VBEM = (53 . , . and b γ EM , b λ EM = (53 . , . . For the Bayesian LASSO we take the optimal value λ BL = 0 . selected by maximum marginal likelihood as in [56]. It should bealso noted that the estimate b λ EM is practically coincides with the one chosenby the leave-one-out cross-validation.Another aspect of comparison between the BL and VBL (more precisely,VBEM part) involves computational costs. Table 1 displays the inference time (averaged over 30 runs) and the root-mean-squared error (obtained by 5-foldcross-validation) for VBEM and BL that were run with ﬁxed hyperparameters b λ VBEM and b λ BL . The number of consecutive iterations of the BL’ Gibbs sampleris 10,000 (after 1,000 burn-in) as in [56], and the maximum number of iterationsof the VBEM is limited by 10. It is worth highlighting that while the RMSEsare very similar (and correspond well with the noise estimation b γ VBEM or b γ EM )the diﬀerence in inference time reaches an impressive 1,000 times speed up. Time, ms RMSE

VBEM, λ = 0 . ± λ = 0 . ±

602 54.612Table 1: VBEM and BL inference time and cross-validation errors.

Now we will consider an image deconvolution problem. Consider that we wouldlike to recover an image ˜ β ∈ R ˜ p comprised of ˜ p = p pixels, from observations This comparison was made on the laptop with CPU I7-7700HQ and 16 GB of RAM .0 0.2 0.4 0.6 0.8 1.0||β|| /max||β|| −800−600−400−2000200400600800 S t a n d a r d i z e d C o e ff i c i e n t s Lasso /max||β|| Bayesian Lasso (median) /max||β|| Ridge /max||β|| VBEM /max||β|| EM −800 −600 −400 −200 0 200 400 600 800Standardized Coefficients10987654321 V a r i a b l e N u b e r Bayesian Lasso ((ith 95% CI)LassoLeast squaresEMVBEM (with 95% CI) −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0Standardized Coefficients10987654321 V a r i a b e N u m b e r Bayesian Lasso (with 95% CI)LassoLeast squaresEMVBEM (with 95% CI)

Figure 2: Illustration of the variational Bayesian LASSO (VBL) on the diabetesdataset. The top ﬁgure is analogous to Fig 1 of [56], except it adds VBL in thelast 2 columns. The bottom row is analogous to Fig 2 of [56], except with VBL(and 95% credible interval) added. The right panel is a zoomed in version ofthe left panel. 19 = [ y , . . . , y n ] T ∈ R n with n ≤ ˜ p . The design matrix is deﬁned as follows, for l = 1 , . . . , n , ˜ x Tl ˜ β := z i l ,j l , z = F − D (exp( − ω | k | )) F ˜ β . (47)where { ( i l , j l ) } nl =1 is a subset of pairs of indices associated to spatial observationsof the degraded image/signal ˜ β , denoted z (both represented as 2 dimensional p × p arrays here), F denotes the Fourier transform (which will be computedwith fast Fourier transform (FFT) [21] at a cost of O ( p log p ) ), and k =( k , k ) ∈ {− p / , . . . , p / − } is the multi-index of wave-numbers associatedto the transformed signal. This simply corresponds to convolution in physicalspace with a Gaussian kernel with variance proportional to ω . The observationsare then deﬁned as usual y = ˜ X ˜ β + ǫ , ǫ ∼ N (0 , γ I n ) . Recall the discussion in Section 2.2. We now are interested not in sparsesignals per se, but rather in edge-preservation, or in other words sparse gradient.For this purpose a popular choice is the (non-isotropic) total variation priorgiven by Q p − j =1 L (( D β ) j ; λ ) L (( D β ) j ; λ ) , where D = F − ( − i ) k F , D = F − ( − i ) k F , and the missing degree of freedom corresponds to the constantwavenumber k = (0 , . This can be constructed as a marginal just like (7), usinga pair of Normals N (( D β ) j ; 0 , θ j ) and N (( D β ) j ; 0 , θ j ) for each j = 1 , . . . , ˜ p − ,This change of variables proves to be messy within the VBEM (although it worksjust ﬁne for EM/TV, even in the spatial domain).We adopt an alternative approach as follows, which we have found cleanerand more computationally expedient. Note that the TV prior can be alterna-tively written as a standard LASSO prior on ¯ β := D ˜ β ∈ R p − , where p = 2˜ p + 1 and D = ( D T , D T ) T ∈ R p − × ˜ p . This matrix also has the vector ˜ p ∈ R ˜ p of onesin its kernel. Denote the coeﬃcient of ˜ p as β . We can redeﬁne the forwardmodel on β := ( ¯ β T , β ) T as X := ˜ X ( D † , ˜ p ) , where D † = ( D T D ) − D T is theleft pseudo-inverse of D . Our data for the transformed model is Y = Xβ + ǫ .Typically p will be large, for example p = 256 , or even larger, whichprecludes O ( p ) calculations. In terms of computation, when n is small, then X T can be computed explicitly with n FFTs, which allows explicit computationof K := C X T ( XC X T + γ I n ) − for small enough n . This allows computationof (2). In order to compute diagonal of C n in (3) we observe that the secondterm can be written as ( K ◦ ( C X T )) n , where ◦ denotes the element-wiseproduct of two matrices. We conduct several experiments. First we considera toy model with n = ˜ p = p and p = 28 , with strong blurring and smallnoise. With a large λ we obtain very impressive reconstructions (see Figure3). It is notable that the uncertainty is signiﬁcantly underestimated, althoughrelatively correct (in the sense that it is large and small where the error is).This is due to the fact that λ has been chosen too large, which, on the otherhand, provides the impressive reconstruction of the edges. Notice here the topright-hand plot which shows the (relative L ) error as a function of iteration20 its minimum may yield a much smaller error in comparison to the value atconvergence, in particular for the MAP estimator (TV-EM). The plot also showsthe data misﬁt, where we can observe the classical “L-curve” and see that anappropriate stopping criterion can be derived from convergence of the misﬁt (inthe more realistic scenario where we do not know the error). Truth

Observations

Least squares

Error (E), VBEM10*misfit (M), VBEME, TV-EMM, TV-EM

TV (EM) -- end -0.500.511.5

TV (EM) -- optimal

VBEM -0.200.20.40.60.811.2

VBEM standard deviation

Figure 3: VBL TV-denoising illustrated on a simple toy image. The toprow features (from left to right) the truth, blurry observations, the Tikhonov-regularized (least squares) solution, and a plot of the data misﬁt and recon-struction error over the EM/VBEM iterations for the various. The bottom rowfeatures (from left to right) the TV-regularized MAP estimator at convergence µ n , the TV-regularized MAP estimator at the minimum of the error (arounditeration 100 – see top right), the VBL mean m n , and the VBL standard devi-ation.Now we will move to a higher dimensional example, with p = 256 . Ob-serve that the case n = ˜ p cannot be handled directly, which provides a testingground for our sequential method. However, it will be useful to have a groundtruth, which is possible for appropriate choices of parameters. The calculationis provided in appendix E. The ﬁrst set of experiments in Figure 3 shows thereconstruction corresponding to full osbervations, and illustrates that VBL iscapable of achieving edge sparsity as well as UQ.The next set of experiments is intended to illustrate two things. First, inthe case where the monolithic problem can be solved, as above, the sequentialversion does a good job of getting close to the full monolithic solution. For thechoice of ω = γ = 0 . and λ = 1 , we recover a ground truth with ˜ n = 1514 for ρ = 0 . (see appendix E). The relative L error is . . Letting M = 1490 forthe recursive Algorithm 2, a single iteration of M observations gives . , whileafter completing the iterations, we get . . The results are shown in Figure 5.Second, in the case of smaller values ω = γ = 0 . where the monolithicproblem cannot be solved and one must settle for either sparse observationsor a truncation of X above the desired threshold described in appendix E, thesequential version does signiﬁcantly better than the monolithic approximation.In this case, the desired threshold with ρ = 1 would be ˜ n ≈ which is notfeasible. We use the coarse approximation with ˜ n = 1640 dominant modes, and21 ruth

50 100 150 200 25050100150200250

Observations

50 100 150 200 25050100150200250

VBEM error

50 100 150 200 25050100150200250

VBEM Gradient norm st. dev.

50 100 150 200 25050100150200250

VBEM mean

50 100 150 200 25050100150200250

VBEM x-gradient mean

50 100 150 200 25050100150200250 -15-10-5051015

VBEM y-gradient mean

50 100 150 200 25050100150200250 -15-10-505101520

VBEM Gradient norm mean

50 100 150 200 25050100150200250

TV/EM MAP

50 100 150 200 25050100150200250

TV/EM x-gradient

50 100 150 200 25050100150200250 -15-10-5051015

TV/EM y-gradient

50 100 150 200 25050100150200250 -15-10-505101520

TV/EM Gradient norm

50 100 150 200 25050100150200250

Tikhonov mean

50 100 150 200 25050100150200250

Tikhonov x-gradient mean

50 100 150 200 25050100150200250 -505

Tikhonov y-gradient mean

50 100 150 200 25050100150200250 -6-4-20246

Tikhonov Gradient norm mean

50 100 150 200 25050100150200250

Figure 4: VBL TV-denoising illustrated on a high-resolution image of SheppLogan phantom. The top row features (from left to right) the truth, blurryobservations (every pixel), the error w.r.t. VBL mean, and UQ in the form ofthe standard deviation of the gradient norm of VBL (i.e. ( E ( β p − E β p ) +( β ˜ p +1:2˜ p − E β ˜ p +1:2˜ p ) ) / – see the text). The second row features (from left toright) the VBL mean ( E ˜ β ), the x-gradient mean ( E β p ), the y-gradient mean( E β ˜ p +1:2˜ p ), and the gradient mean norm ( (( E β p ) +( E β ˜ p +1:2˜ p ) ) / ). The nexttwo rows correspond to the same quantities for the TV MAP and the Tikhonovregularized problem. 22 bservations

20 40 60 80 100 12020406080100120

VBEM mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm st. dev.

20 40 60 80 100 12020406080100120

VBEM mean

20 40 60 80 100 12020406080100120

VBEM mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm st. dev.

20 40 60 80 100 12020406080100120

Figure 5: VBL TV-denoising (top row, from left to right – observations, mean,gradient mean, and gradient standard deviation) in comparison to the recursiveversion as in algorithm 2 (bottom row, from left to right – mean after 1 iteration,mean, gradient mean, and gradient standard deviation). ω = γ = 0 . . Observations truncated

20 40 60 80 100 12020406080100120

VBEM mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm st. dev.

20 40 60 80 100 12020406080100120

Observations

20 40 60 80 100 12020406080100120

VBEM mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm mean

20 40 60 80 100 12020406080100120

VBEM Gradient norm st. dev.

20 40 60 80 100 12020406080100120

Figure 6: Truncated VBL TV-denoising (top row, from left to right – trun-cated observations, mean, gradient mean, and gradient standard deviation) incomparison to the recursive version as in algorithm 2 (bottom row, from left toright – observations, mean, gradient mean, and gradient standard deviation). ω = γ = 0 . . 23chieve relative L errors in the truncated observations of . ≫ γ and in thesolution . . For the recursive implementation we let M = 1640 , and achievean error of . . The results are shown in Figure 6. We notice in this case that,despite the fact that the reconstruction error is better and the edges are morecrisp, there is some strange radiation/noise in the recursive reconstruction. Itis a topic of further investigation to understand this better (and remove it).The method is able to handle a larger problem with p = 10 and we were ableto assimilate n = 500 batches of size M = 200 overnight, reducing the relativeerror from . with a single pair of batches to . and yielding a reasonablelooking reconstruction. These results are not shown. Remark 4.1

Note that we impose a sparsity constraint on β as well, whichis slightly diﬀerent from TV. This constraint could be easily removed but ouraim is not to belabour the ﬁner points of TV-denoising and rather to illustrateour method on this example. Also note that this can all be easily adjusted fornon-LASSO cases, which would be expected to more strongly enforce sparsity ofedges. Acknowledgements.

KJHL and VZ were supported by The Alan TuringInstitute under the EPSRC grant EP/N510129/1. KJHL and VZ were alsosupported in part by the U. S. Department of Energy, Oﬃce of Science, Of-ﬁce of Fusion Energy Sciences and Oﬃce of Advanced Scientiﬁc ComputingResearch through the Scientiﬁc Discovery through Advanced Computing (Sci-DAC) project on Advanced Tokamak Modeling under a contract with Oak RidgeNational Laboratory.

References [1]

M. Abramowitz and I. A. Stegun , Handbook of mathematical functionswith formulas, graphs, and mathematical tables , vol. 55, US Governmentprinting oﬃce, 1948.[2]

D. F. Andrews and C. L. Mallows , Scale mixtures of normal distribu-tions , Journal of the Royal Statistical Society: Series B (Methodological),36 (1974), pp. 99–102.[3]

A. Armagan , Variational bridge regression , in Artiﬁcial Intelligence andStatistics, PMLR, 2009, pp. 17–24.[4]

H. Attias , A variational Bayesian framework for graphical models , in Ad-vances in neural information processing systems, 2000, pp. 209–215.[5]

R. Bai, V. Rockova, and E. I. George , Spike-and-slab meets lasso: Areview of the spike-and-slab lasso , arXiv preprint arXiv:2010.06451, (2020).246]

M. J. Beal and Z. Ghahramani , The variational Bayesian EM al-gorithm for incomplete data: with application to scoring graphical modelstructures , Bayesian statistics, 7 (2003), p. 210.[7]

C. M. Bishop , Pattern recognition and machine learning , springer, 2006.[8]

C. M. Bishop and M. E. Tipping , Variational relevance vector machines ,in Proceedings of the Sixteenth conference on Uncertainty in artiﬁcial in-telligence, 2000, pp. 46–53.[9]

N. Biswas, A. Bhattacharya, P. E. Jacob, and J. E. Johndrow , Coupled markov chain monte carlo for high-dimensional regression withhalf-t priors , arXiv preprint arXiv:2012.04798, (2020).[10]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe , Variational infer-ence: A review for statisticians , Journal of the American statistical Asso-ciation, 112 (2017), pp. 859–877.[11]

G. E. Box and G. C. Tiao , Bayesian inference in statistical analysis ,vol. 40, John Wiley & Sons, 2011.[12]

S. Boyd, N. Parikh, and E. Chu , Distributed optimization and sta-tistical learning via the alternating direction method of multipliers , NowPublishers Inc, 2011.[13]

K. Bredies and D. A. Lorenz , Linear convergence of iterative soft-thresholding , Journal of Fourier Analysis and Applications, 14 (2008),pp. 813–837.[14]

D. S. Broomhead and D. Lowe , Radial basis functions, multi-variablefunctional interpolation and adaptive networks , tech. rep., Royal Signalsand Radar Establishment Malvern (United Kingdom), 1988.[15]

G. Burgers, P. Jan van Leeuwen, and G. Evensen , Analysisscheme in the ensemble Kalman ﬁlter , Monthly weather review, 126 (1998),pp. 1719–1724.[16]

D. Calvetti, E. Somersalo, and A. Strang , Hierachical Bayesianmodels and sparsity: L -magic , Inverse Problems, 35 (2019), p. 035003.[17] F. Caron, L. Bornn, and A. Doucet , Sparsity-promoting Bayesiandynamic linear models , (2012).[18]

F. Caron and A. Doucet , Sparse Bayesian nonparametric regression ,in Proceedings of the 25th international conference on Machine learning,2008, pp. 88–95.[19]

S. S. Chen, D. L. Donoho, and M. A. Saunders , Atomic decomposi-tion by basis pursuit , SIAM review, 43 (2001), pp. 129–159.2520]

A. Chkifa, A. Cohen, G. Migliorati, F. Nobile, and R. Tempone , Discrete least squares polynomial approximation with random evaluations-application to parametric and stochastic elliptic pdes , ESAIM: Mathemati-cal Modelling and Numerical Analysis, 49 (2015), pp. 815–837.[21]

J. W. Cooley and J. W. Tukey , An algorithm for the machine calcu-lation of complex fourier series , Mathematics of computation, 19 (1965),pp. 297–301.[22]

I. Daubechies, M. Defrise, and C. De Mol , An iterative threshold-ing algorithm for linear inverse problems with a sparsity constraint , Com-munications on Pure and Applied Mathematics: A Journal Issued by theCourant Institute of Mathematical Sciences, 57 (2004), pp. 1413–1457.[23]

I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk , Iteratively reweighted least squares minimization for sparse recovery , Com-munications on Pure and Applied Mathematics: A Journal Issued by theCourant Institute of Mathematical Sciences, 63 (2010), pp. 1–38.[24]

A. P. Dempster, N. M. Laird, and D. B. Rubin , Maximum likelihoodfrom incomplete data via the EM algorithm , Journal of the Royal StatisticalSociety: Series B (Methodological), 39 (1977), pp. 1–22.[25]

D. L. Donoho , Compressed sensing , IEEE Transactions on informationtheory, 52 (2006), pp. 1289–1306.[26]

A. Doucet, N. de Freitas, and N. Gordon , Sequential Monte CarloMethods in Practice , Springer, 2001.[27]

J. Drugowitsch , Variational bayesian inference for linear and logisticregression , arXiv preprint arXiv:1310.5438, (2013).[28]

B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al. , Leastangle regression , The Annals of statistics, 32 (2004), pp. 407–499.[29]

G. Evensen , Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics ,Journal of Geophysical Research: Oceans, 99 (1994), pp. 10143–10162.[30]

M. A. Figueiredo , Adaptive sparseness for supervised learning , IEEEtransactions on pattern analysis and machine intelligence, 25 (2003),pp. 1150–1159.[31]

D. Gabay and B. Mercier , A dual algorithm for the solution of non-linear variational problems via ﬁnite element approximation , Computers &mathematics with applications, 2 (1976), pp. 17–40.[32]

S. Geman, E. Bienenstock, and R. Doursat , Neural networks and thebias/variance dilemma , Neural computation, 4 (1992), pp. 1–58.2633]

T. P. Georgi , Fourier series , 1976.[34]

R. Ghanem, D. Higdon, and H. Owhadi , Handbook of uncertaintyquantiﬁcation , vol. 6, Springer, 2017.[35]

R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin,and P. Richtárik , Sgd: General analysis and improved rates , in Interna-tional Conference on Machine Learning, PMLR, 2019, pp. 5200–5209.[36]

P. J. Green, K. Łatuszyński, M. Pereyra, and C. P. Robert , Bayesian computation: a summary of the current state, and samples back-wards and forwards , Statistics and Computing, 25 (2015), pp. 835–862.[37]

J. E. Griffin and P. J. Brown , Bayesian hyper-lassos with non-convexpenalization , Australian & New Zealand Journal of Statistics, 53 (2011),pp. 423–442.[38]

L. Guo, A. Narayan, and T. Zhou , Constructing least-squares polyno-mial approximations , SIAM Review, 62 (2020), pp. 483–508.[39]

M. R. Hestenes, E. Stiefel, et al. , Methods of conjugate gradients forsolving linear systems , vol. 49, NBS Washington, DC, 1952.[40]

A. E. Hoerl and R. W. Kennard , Ridge regression: Biased estimationfor nonorthogonal problems , Technometrics, 12 (1970), pp. 55–67.[41]

P. W. Holland and R. E. Welsch , Robust regression using iterativelyreweighted least-squares , Communications in Statistics-theory and Meth-ods, 6 (1977), pp. 813–827.[42]

D. R. Hunter and K. Lange , A tutorial on MM algorithms , The Amer-ican Statistician, 58 (2004), pp. 30–37.[43]

S. Kaczmarz , Angenaherte auﬂosung von systemen linearer glei-chungen ,Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., (1937), pp. 355–357.[44]

R. E. Kalman , A new approach to linear ﬁltering and prediction problems ,Journal of Basic Engineering, 82 (1960), pp. 35–45.[45]

A. Khalili and J. Chen , Variable selection in ﬁnite mixture of regres-sion models , Journal of the american Statistical association, 102 (2007),pp. 1025–1038.[46]

H. Kushner and G. G. Yin , Stochastic approximation and recursivealgorithms and applications , vol. 35, Springer Science & Business Media,2003.[47]

M. Lassas and S. Siltanen , Can one use total variation prior for edge-preserving Bayesian inversion? , Inverse Problems, 20 (2004), p. 1537.2748]

K. Law, A. Stuart, and K. Zygalakis , Data assimilation , Cham,Switzerland: Springer, (2015).[49]

F. Lucka , Fast Markov chain Monte Carlo sampling for sparse Bayesianinference in high-dimensional inverse problems using L1-type priors , In-verse Problems, 28 (2012), p. 125012.[50]

M. Markkanen, L. Roininen, J. M. Huttunen, and S. Lasanen , Cauchy diﬀerence priors for edge-preserving bayesian inversion , Journal ofInverse and Ill-posed Problems, 27 (2019), pp. 225–240.[51]

T. J. Mitchell and J. J. Beauchamp , Bayesian variable selection inlinear regression , Journal of the american statistical association, 83 (1988),pp. 1023–1032.[52]

P. Moral , Feynman-Kac formulae: Genealogical and interacting particlesystems with applications, Probability and its applications , Springer, NewYork, 2004.[53]

P. Müller and F. A. Quintana , Nonparametric Bayesian data analysis ,Statistical science, (2004), pp. 95–110.[54]

R. M. Neal , Bayesian learning for neural networks , vol. 118, SpringerScience & Business Media, 1996.[55]

J. Nocedal and S. Wright , Numerical optimization , Springer Science& Business Media, 2006.[56]

T. Park and G. Casella , The Bayesian LASSO , Journal of the Ameri-can Statistical Association, 103 (2008), pp. 681–686.[57]

M. Pereyra , Proximal Markov chain Monte Carlo algorithms , Statisticsand Computing, 26 (2016), pp. 745–760.[58] ,

Maximum-a-posteriori estimation with Bayesian conﬁdence regions ,SIAM Journal on Imaging Sciences, 10 (2017), pp. 285–302.[59]

A. Rahimi, B. Recht, et al. , Random features for large-scale kernelmachines. , in NIPS, vol. 3, Citeseer, 2007, p. 5.[60]

C. Robert and G. Casella , Monte Carlo statistical methods , SpringerScience & Business Media, 2013.[61]

V. Roy, S. Chakraborty, et al. , Selection of tuning parameters, solu-tion paths and standard errors for Bayesian lassos , Bayesian Analysis, 12(2017), pp. 753–778.[62]

Y. Saad , Krylov subspace methods for solving large unsymmetric linearsystems , Mathematics of computation, 37 (1981), pp. 105–126.2863]

D. Sejdinović, C. Andrieu, and R. Piechocki , Bayesian sequentialcompressed sensing in sparse dynamical systems , in 2010 48th Annual Aller-ton Conference on Communication, Control, and Computing (Allerton),IEEE, 2010, pp. 1730–1736.[64]

R. C. Smith , Uncertainty quantiﬁcation: theory, implementation, and ap-plications , vol. 12, Siam, 2013.[65]

T. Strohmer and R. Vershynin , A randomized kaczmarz algorithm withexponential convergence , Journal of Fourier Analysis and Applications, 15(2009), pp. 262–278.[66]

D. Strong and T. Chan , Edge-preserving and scale-dependent propertiesof total variation regularization , Inverse problems, 19 (2003), p. S165.[67]

A. M. Stuart , Inverse problems: a Bayesian perspective , Acta numerica,19 (2010), pp. 451–559.[68]

R. Tibshirani , Regression shrinkage and selection via the lasso , Jour-nal of the Royal Statistical Society: Series B (Methodological), 58 (1996),pp. 267–288.[69]

A. N. Tikhonov , On the solution of ill-posed problems and the method ofregularization , in Doklady Akademii Nauk, vol. 151, Russian Academy ofSciences, 1963, pp. 501–504.[70]

M. E. Tipping et al. , The relevance vector machine. , in NIPS, vol. 12,1999.[71]

C. R. Vogel and M. E. Oman , Iterative methods for total variationdenoising , SIAM Journal on Scientiﬁc Computing, 17 (1996), pp. 227–238.[72]

Z. Wang, J. M. Bardsley, A. Solonen, T. Cui, and Y. M. Mar-zouk , Bayesian inverse problems with l priors: a randomize-then-optimizeapproach , SIAM Journal on Scientiﬁc Computing, 39 (2017), pp. S140–S166.[73] A. Winkelbauer , Moments and absolute moments of the normal distri-bution , arXiv preprint arXiv:1209.4340, (2012).[74]

I. E.-H. Yen, T.-W. Lin, S.-D. Lin, P. K. Ravikumar, and I. S.Dhillon , Sparse random feature algorithm as coordinate descent in hilbertspace , in Advances in Neural Information Processing Systems, Citeseer,2014, pp. 2456–2464.[75]

H. Zhu, G. Leus, and G. B. Giannakis , Sparsity-cognizant total least-squares for perturbed compressive sampling , IEEE Transactions on SignalProcessing, 59 (2011), pp. 2002–2016.29

Derivations related to EM and VBEM

Following from (11), note that (i) Q ( β | β t ) ≤ log P ( Y, β ) and (ii) Q ( β t | β t ) =log P ( Y, β t ) . Therefore log P ( Y, β t +1 ) = Q ( β t +1 | β t ) + log P ( Y, β t +1 ) − Q ( β t +1 | β t ) ≥ Q ( β t +1 | β t )= Q ( β t +1 | β t ) − Q ( β t | β t ) + log P ( Y, β t ) ≥ log P ( Y, β t ) . The ﬁrst inequality arises from property (i), the equality comes from property(ii) and the ﬁnal inequality is due to the optimality of β t +1 . This shows thatthe EM algorithm provides a non-decreasing algorithm for the optimizationof log P ( Y, β t +1 ) . It is a particular case of what have come to be known asmajorization minimization algorithms [42].The full calculation of (12) is given by Q ( β | β t ) := − Z log( P ( β, Y n | θ, X n )) P ( θ | β t , X n , Y n ) dθ = − Z (log( P ( β | θ )) + log( P ( Y n | β, X n ))) P ( θ | β t , X n , Y n ) dθ = 12 X j β j E (cid:20) θ j (cid:12)(cid:12)(cid:12) β t , X n , Y n (cid:21) + 12 E h log( θ j ) (cid:12)(cid:12)(cid:12) β t , X n , Y n i + 12 γ | Y n − X n β | + K ( β t , X n , Y n )= 12 β T D (1 /θ t ) β + 12 γ | Y n − X n β | + K ( β t , X n , Y n ) B Discussion of other iterative methods for MAPestimation

It is worth brieﬂy discussing other standard iterative optimization algorithmsfor solving (4). In particular, gradient descent and quasi-Newton methods arevery promising alternatives in the case where the design matrix X n is sparse.This is not the primary context of interest in the present work, so the generalcase is discussed. Gradient descent methods achieve linear convergence , whichmeans that in terms of iterations the complexity is logarithmic in the desiredaccuracy [55]. Computation of the gradient of (4) incurs a cost of O ( np ) , and ifone uses a conjugate gradient approach [55, 39] (ensuring that each successivesearch direction is orthogonal to all previous ones) then the number of itera-tions required for convergence to the exact solution is bounded above by p , i.e.the memory and computational complexity is no worse than the monolithic ap-proach. Stochastic gradient descent alleviates n − dependence per iteration byusing an unbiased estimate of the gradient, i.e. a batch of b = O (1) data isused at each iteration with a cost of O ( p ) . Under appropriate assumptions, thisapproach can converge [46, 35], but there are no tight theoretical complexity30ounds. In the machine learning literature, one often refers to epochs , or sweeps(plural) through the full data set, so one can expect a complexity of at least O ( np ) . An alternative in similar spirit is the (randomized) Kaczmarz algorithm[43, 65], which enjoys an impressive per iteration cost of O ( p ) , but would alsotypically require O ( n ) iterations until convergence. The latter may be improvedwith a O ( np ) pre-processing step which replaces a uniform distribution on thedata with one scaled by the row norms of X n . Of course none of these methodsprovides an uncertainty estimate. Quasi-Newton methods, such as BFGS, pro-vide an approximation of the covariance C n from equations (3), (56) as well assuper-linear convergence. In this case, one has a per iteration complexity cost of O ( p ( n + p )) , and a memory requirement of O (( n + 2 k ) p ) , for k iterations (a rank2 update to the approximation of the Hessian and its inverse is performed ateach iteration). The limited memory alternative limits k ≤ k max . One expectsthe method to converge very rapidly, for k = O (1) , so it can still be competitive.A one-oﬀ cost of O ( np ) to compute X Tn X n can reduce the n − dependence ofeither method to a p − dependence. Furthermore, the computation of X Tn X n canbe split into n/m batches of size m to be computed in parallel, yielding O ( mp ) memory and O ( p m ) computation cost for each, for the price of an additional O ( p n/m ) cost to combine at the end. C Hyper-parameter optimization

The general calculations for

GIG distributions, which lead to (38) and (39), aregiven by E ( θ j ) = q δ + β j λ K ν + (cid:16) λ q δ + β j (cid:17) K ν − (cid:16) λ q δ + β j (cid:17) , (48) E ( θ − j ) = λ q δ + β j K ν + (cid:16) λ q δ + β j (cid:17) K ν − (cid:16) λ q δ + β j (cid:17) − ν − / δ + β j , (49)where K α ( z ) denotes the modiﬁed Bessel function of the second kind. Note θ t +1 n := 1 / E t ( θ − n,j ) as in (21) and (13). Important special cases with analyticallytractable expressions include ν = 1 and ν = 0 , in which case K ( z ) K ( z ) = z + 1 z , K ( z ) K − ( z ) = 1 . (50)Evaluating the expressions above at z = λ q δ + β j gives the special casesdescribed in equations (38) and (39).We now compute the relevant integrals for P ( θ ) ∝ θ ν − , with ν < , as31hown in (40). We have E ( θ − j | β j ) = R ∞ θ ν − . j exp( − θ j β j ) dθ j R ∞ θ ν − . j exp( − θ j β j ) dθ j = R ∞−∞ u − ν ) exp( − β j u ) du R ∞−∞ u − ν exp( − β j u ) du = | β j | − (1 − ν ) . (51)The last equation follows from equation (18) of [73]. A similar calculation isdone for E ( θ j ) , which is valid only for ν < − / . So, now ν takes the role of λ . Itis easy to see how δ > ﬁts into this calculation by replacing q β j + δ → | β j | . C.1 Third approach

Here we consider expanding the VBEM approach, and constraining the hyper-parameter distribution to be Dirac at the maximizer, which results in a EMextended version of VBEM. For the remainder of the section we will assumeVBEM is used, with the obvious adjustment for EM.The objective function (37) has the form (suppressing X n ) double check! Q ( γ , λ | ( γ ) t , λ t ) := − Z log( P ( β, γ , λ, Y n , θ )) q t ( θ ) q t ( β ) dθdβ = − Z (cid:16) log( P ( β | θ )) + log( P ( θ | λ )) + log( P ( λ )) + log( P ( Y n | β, X n , γ ))+ log( P ( γ )) (cid:17) q t ( θ ) q t ( β ) dθdβ = − pν log( λ ) + p log( K ν ( λδ )) + p X j =1 λ E t [ θ j ]+ 12 γ ( s n − v Tn E t ( β ) + E t ( β T A n β )) + K ( β t , ( γ ) t , λ t , X n , Y n )= − pν log( λ ) + p log( K ν ( λδ )) + p X j =1 λ E t [ θ j ] + f ( γ ) , (52)where f ( γ ) is deﬁned in (42) and can be optimized independently.For ν = 1 we have Q ( γ , λ | β t , ( γ ) t , λ t ) = − p log( λ ) + p log( K ( λδ )) + 12 p X j =1 λ λ t q δ + C tn,jj + ( m tn,j ) + 1( λ t ) + K ( β t , X n , Y n ) . Deﬁne D tn = D ( λ tn / p δ + diag( C tn ) + ( m tn ) ) and E tn = p X j =1 ( λ tn q δ + C tn,jj + ( m tn,j ) + 1) / ( λ tn ) = tr(( D tn ) − ) + p ( λ tn ) − . λ t +1 n = argmin λ p log( K ( λδ )) − p log( λ ) + E tn λ = argmin λ E tn λ − p log( λ ) + O ( δ ) ≈ r pE tn = λ tn q ( λ tn ) tr(( D tn ) − ) p . (53)Digressing for a moment on this, it tends to λ tn or 0 as tr(( D tn ) − ) /p tends to 0or ∞ . When tr(( D tn ) − ) /p ≫ then ( λ tn ) − ≈ p tr(( D tn ) − ) /p .Now we consider the case ν = 0 . We have Q ( γ , λ | β t , ( γ ) t , λ t ) = p log( K ( λδ )) + 12 p X j =1 λ q δ + C tn,jj + ( m tn,j ) λ t + f ( γ ) . Interestingly, in this case we redeﬁne the precision as an “elastic net” between q = 0 and q = 1 , i.e. D tn = D (cid:18) λ t √ δ + C tn +( m tn ) + δ + C tn +( m tn ) (cid:19) . Let F tn = P pj =1 √ δ + C tn,jj +( m tn,j ) λ t . We then have λ t +1 n = argmin λ p log( K ( λδ )) + F tn λ ≈ δ exp (cid:18) − Γ + 12 W [ − ( δ e ) / (2 F tn )] (cid:19) , (54)where Γ ≈ . is the Euler-Mascharoni constant and W is the product logfunction (solution to z = we w ), i.e. this is solution to the equation λ F tn (log( δλ ) + Γ − log(2)) + 1 = 0 . D Ensemble Kalman ﬁlter formulation

In an online context, the Kalman ﬁlter provides recursive equations below, anal-ogous to (2) and (3), for either the covariance or the precision m n = (cid:18) γ x n x Tn + C − n − (cid:19) − (cid:18) γ x n y n + C − n − m n − (cid:19) = m n − + C n − x n (cid:0) γ + x Tn C n − x n (cid:1) − (cid:0) y n − x Tn m n − (cid:1) , (55) C n = (cid:18) γ x n x Tn + C − n − (cid:19) − = C n − − C n − x n (cid:0) γ + x Tn C n − x n (cid:1) − x Tn C n − . (56)33e have the following incremental update formula for (23), which incurs a costof O ( p ) ( C θn ) − = ( C θ ′ n ) − − D (1 /θ ′ ) + D (1 /θ ) . (57)Unfortunately, the solution of (22) requires inversion of a p × p to compute (55),at a premium cost of O ( p ) (for exact solution and in the absence of sparsity).In this context it is natural to consider the ensemble Kalman ﬁlter as a low-rank and cost-eﬃcient alternative. The EnKF was introduced in [29, 15] andhas since exploded in popularity, largely due to its remarkable success in pro-viding an eﬃcient approximation to the Kalman ﬁlter in very high dimensionalgeophysical applications. Many versions of EnKF exist, but in this case theversion which makes the most sense is the deterministic, or square root, EnKF[48]. The method is initialized with an ensemble β (1)0 , . . . , β ( K )0 ∼ N ( m , C ) ,and then the Kalman ﬁlter equations (55) are replaced with the following, for n ≥ m n − = 1 K K X i =1 β ( i ) n − , C n − = 1 K K X i =1 ( β ( i ) n − − m n − )( β ( i ) n − − m n − ) T , (58) b m n = m n − + C n − x n ( γ + x Tn C n − x n ) − ( y n − x Tn m n − ) , b C n = ( I p − C n − x n ( γ + x Tn C n − x n ) − ) C n − ,β ( i ) n ∼ N ( β ; b m n , b C n ) , i = 1 , . . . , K , (59)The most common regime of application is K ≪ p , which admittedly looksdubious from a statistical perspective. However, the cost of this method isnow O ( Kp ) in both computation and memory, so the impetus is clear froma purely computational perspective. The remarkable thing is that it actuallyoften works quite well, although we note that the more common regime ofapplication is dynamical systems in which some particle-wise (often nonlinear)forward propagation occurs in between (59) and (58). The stochastic versioncan be used directly in the absence of the sparsity considerations of Section 2.2.However, in order to use the identify (57) we need the precision. One potential,and common, solution is to modify/inﬂate (58) for some small ǫ > with C n = 1 K K X i =1 ( β ( i ) n − m n )( β ( i ) n − m n ) T + ǫI p . (60)In our case, however, there is by design a more sensible choice of approxima-tion by a diagonal matrix plus low-rank correction. The whole program canbe carried out, but due to this fact, we will not consider EnKF further here.Note that such adjustments, known generally as covariance inﬂation in the dataassimilation literature [48], prevent convergence of the model to the Kalmanﬁlter in the limit of an inﬁnite sample size, so exactness is lost.34 Full observations Fourier truncation for TV de-noising

Thanks to the diagonalization of ˜ X we can identify an approximation as follows.Let I := { k ; exp( − ω | k | ) > ργ } , for ρ < , such that for z = ˜ X ˜ β and b z = ˜ X I ˜ β ,we have | z − b z | < γ , i.e. the observed signal is less than the observationalnoise. Here ˜ X I is shorthand notation for the rank ˜ n = |I| approximation of ˜ X obtained by truncating wavenumbers k / ∈ I . For appropriate choices of ω, γ > , this provides a tractable scenario for full observations (in the sensethat the solution is close to the actual full observation case). The situationis slightly complicated however, since X ∈ R ˜ p × p despite being rank ˜ n . Wetherefore approximate X ≈ X ℓ X r , where X ℓ ∈ C ˜ p × ˜ n and X r ∈ C ˜ n × p aredeﬁned as follows X ℓ := F − (exp( − ω | k | )) k ∈I , X † r :=  F − ( ik | k | − ) k ∈I F − ( ik | k | − ) k ∈I ˜ pδ k =0  . Now X † X = X † r ( X † ℓ X ℓ ) X r and we simply redeﬁne (3) with an alternative ap-plication of the Sherman Morrison Woodbury matrix identity C n = ( I p − ˜ KX r ) C , ˜ K = C X † r ( X r C X † r + γ ( X † ℓ X ℓ ) − ) − , and (2) becomes m n = ˜ K ( X † ℓ X ℓ ) − X † ℓ Y .

We note that X † ℓ X ℓ ∈ R ˜ n × ˜ n can be easily computed and inverted for a cost O ( p ˜ n ))