Overcomplete representation in a hierarchical Bayesian framework
OOvercomplete representation in a hierarchical Bayesianframework
Monica Pragliola a , Daniela Calvetti b , Erkki Somersalo b b Department of Mathematics, University of Bologna, Piazza di Porta San Donato 5,Bologna, IT a Department of Mathematics, Applied Mathematics and Statistics, Case WesternReserve University
Abstract
A common task in inverse problems and imaging is finding a solution that is sparse, in the sensethat most of its components vanish. In the framework of compressed sensing, general results guaran-teeing exact recovery have been proven. In practice, sparse solutions are often computed combining ‘ -penalized least squares optimization with an appropriate numerical scheme to accomplish thetask. A computationally efficient alternative for finding sparse solutions to linear inverse problems isprovided by Bayesian hierarchical models, in which the sparsity is encoded by defining a condition-ally Gaussian prior model with the prior parameter obeying a generalized gamma distribution. Aniterative alternating sequential (IAS) algorithm has been demonstrated to lead to a computationallyefficient scheme, and combined with Krylov subspace iterations with an early termination condition,the approach is particularly well suited for large scale problems. Here the Bayesian approach tosparsity is extended to problems whose solution allows a sparse coding in an overcomplete systemsuch as composite frames. It is shown that among the multiple possible representations of the un-known, the IAS algorithm, and in particular, a hybrid version of it, is effectively identifying themost sparse solution. Computed examples show that the method is particularly well suited not onlyfor traditional imaging applications but also for dictionary learning problems in the framework ofmachine learning. Sparsity promoting methods and algorithms for inverse problems and imaging applications have beenextensively studied in the past decades, and they continue to be a very active field of research. Theinterest in compressed sensing has motivated a significant part of the works on the topic. The startingpoint of many sparse reconstruction problems is a dictionary , intended as a collection of elements inthe ambient space referred to as atoms [15], used to represent the unknown quantity of interest. Thedictionary may be selected according to some a priori information available on the problem of interestor, alternatively, its formation can be data-driven - see, e.g., [16, 2, 12]. Recently, approaches aimed atlearning the dictionary while jointly recovering the signal have also been developed [11]. Typically, thecardinality of the dictionary is significantly larger than the dimension of the ambient space. When theatoms in W do not form a basis for the ambient space, the dictionary is called redundant or overcomplete .The use of redundant dictionaries has proved to be a useful strategy in terms of artifact reduction,especially in the framework of signal denoising problems [18, 17].Let x ∈ R n be an unknown signal, and let W = { w i } Ni =1 be a dictionary with atoms w i ∈ R n . Wearrange the atoms as columns of the dictionary matrix, W ∈ R n × N , with n (cid:28) N , and refer to this matrixas the dictionary. In a synthesis perspective, a sparse reconstruction problem is the task of recovering asparse vector α ∈ R N , with most of its components vanishing, that represents the original signal in termsof W , x = W α , starting from a corrupted and possibly poorly sampled indirect observation b ∈ R m of x ,with m ≤ n . Assuming that the observation is linear in x and the noise is additive, the sparse dictionaryrepresentation can be formulated as an optimization problem of the formminimize k α k such that b = A x + ε , with x = W α , (1)1 a r X i v : . [ m a t h . NA ] J un here k α k = card(supp( α )), A ∈ R m × n is the forward model operator, and ε ∈ R m is an additive noisevector. Addressing the minimization problem (1) directly is a challenge due to its NP-hardness, thusexplaining the need for alternative approaches. A strategy which has been widely explored and goesunder the name of basis pursuit replaces the ‘ -(semi)norm with its ‘ convex relaxation [13].When the signal itself is known to be compressible, i.e., card ( { x i | | x i | < (cid:15) } ) (cid:28) n with (cid:15) > A satisfies the restricted isometry property (RIP) condition, anoptimal bound for the error k x − ˆ x k , with ˆ x denoting the recovered signal, has been derived. Moreover,if x is sufficiently sparse, the signal can be recovered exactly [9]. When the signal itself is not sparse,but it allows a sparse or compressible representation in a given dictionary, the exact recovery results stillhold, provided that A satisfies a restricted isometry property adapted to a dictionary (D-RIP) condition[10]. Given the theoretical motivation, a significant amount of research is devoted to identifying classesof operators to which these results could be applied.Besides the convex approaches, ‘ p -norms with p < ‘ -(semi)norm in problem (1), as they are known to promote sparsity more strongly than the case p = 1.Nonetheless, the presence of local minima is a clear limitation to the reliability of non-convex strategies.The sparse reconstruction problem for linear inverse problems allows a natural formulation in theBayesian computational framework, with the notion of sparsity promoting priors. In a number of previousworks [8, 7, 6], the recovery of a sparse signal x - or of a signal admitting a sparse representation in a givenbasis - has been addressed by modeling its entries in a hierarchical Bayesian framework as conditionallyGaussian random variables with unknown variances, with a generalized gamma hyperprior distribution.The sparsity promotion and the convexity properties of the corresponding class of hypermodels have beenstudied in [8, 7]. The derived results, and in particular the considerations on the convexity propertiesof the resulting maximum a posteriori estimation problems, motivated the introduction of a hybridhypermodel combining the strong sparsity promotion that typically characterizes non-convex settingswith the convexity guarantees [6].In this article, the hierarchical Bayesian framework outlined in the previous articles, and in particularthe use of the hybrid algorithm of [6], is extended to address sparse recovery problems in presence ofredundant dictionaries. We consider a version of the iterated alternating sequential (IAS) algorithm thatcombines ideas from the Bayesian inference and iterative Krylov subspace methods, suitable for largescale problems, and is therefore particularly attractive for problems with large dictionaries. Numericalexamples demonstrate the computational efficiency of the approach, and most importantly, show thatfor composite frame dictionaries, where each subdictionary would provide a sufficient representation ofthe signal, the method is capable of identifying an optimally sparse representation. Consider the linear inverse problem b = A x + ε , ε ∈ N (0 , Σ ) such that x = W α , where A ∈ R m × n , with m ≤ n , is the known forward model operator, x ∈ R n is the unknown of interestand Σ ∈ R m × m is the symmetric positive definite covariance matrix of the additive Gaussian noise. Inaddition, we assume that x admits a representation in the redundant dictionary W ∈ R n × N , with n (cid:28) N ,where the unknown vector α ∈ R N is sparse, i.e. k α k (cid:28) N , either because x can be naturally describedby few atoms in the dictionary or because x needs to be compressed.In a number of previous contributions [8, 7, 6], the a priori sparsity belief on the unknown has beenexploited by modeling its entries as independent random variables following a conditionally Gaussian distribution, i.e., α j | θ j ∼ N (0 , θ j ) , ≤ j ≤ N , or, in an equivalent compact form, α | θ ∼ N (0 , D θ ) , D θ = diag ( θ , . . . , θ N ) ∈ R N × N . The conditional Gaussian prior on α given the vector θ takes the form π α | θ ( α | θ ) ∝ Q Nj =1 p θ j exp (cid:18) − k D − / θ α k (cid:19) = exp (cid:18) − k D − / θ α k − N X j =1 log θ j (cid:19) . θ is also modeled as a randomvariable. The a priori beliefs about θ are encoded in the hyperprior π θ ( θ ), and the joint prior on thecoupled vector of unknowns ( α, θ ) reads π ( α,θ ) ( α, θ ) = π α | θ ( α | θ ) π θ ( θ ) . (2)In [7], the authors propose to model the unknown variances θ j as mutually independent random variablesfollowing a generalized gamma distribution, π θ ( θ ) = π θ ( θ | r, β, ϑ ) = | r | n Γ( β ) n N Y j =1 ϑ j (cid:18) θ j ϑ j (cid:19) rβ − exp (cid:18) − (cid:18) θ j ϑ j (cid:19) r (cid:19) , (3)where r ∈ R \ { } , β > ϑ j >
0. This choice is motivated by the observation that generalized gammadistributions tend to favor values which are close to the expected value while also allowing for few outliersvery far from the mean. Presumably, the outlier variances give rise to the few non-zero values of α , orvalues above a tiny threshold.The information about the observation process is encoded in the likelihood distribution, which inview of the additive Gaussian noise model, takes the form π b | α ( b | α ) ∝ exp (cid:18) − k S ( AW α − b ) k (cid:19) , where S is the Cholesky factor of the precision matrix Σ − , i.e. Σ − = S T S . If the matrix Σ and thereby S , are known, without loss of generality we can assume the noise to be white, i.e. Σ = I , because it canbe whitened by a linear transform on A and b , namely A −→ SA , b −→ S b . Under the white normal noise assumption, the likelihood distribution is of the form π b | α ( b | α ) ∝ exp (cid:18) − k AW α − b k (cid:19) . The conditional prior and the hyperprior are coupled to the posterior distribution via Bayes’ formula,yielding the following expression for the posterior distribution π ( α,θ ) | b ( α, θ | b ) ∝ π b | α ( b | α ) π ( α,θ ) ( α | θ ) π θ ( θ ) . In the Bayesian framework, the posterior distribution is the complete solution to the inverse problem, thatcan be used to produce representative estimates of the unknown of interest, and quantify the uncertainty.Here, we chose to summarize the posterior with the Maximum A Posteriori (MAP) estimate,( α ∗ , θ ∗ ) ∈ arg max α,θ (cid:8) π ( α,θ ) | b ( α, θ | b ) (cid:9) or equivalently, by taking the negative logarithm of the density and ignoring the additive constants,( α ∗ , θ ∗ ) ∈ arg min α,θ { F ( α, θ ) } , (4)where F ( α, θ ) = F ( α, θ | r, ϑ, β )= 12 k b − AW α k + 12 N X j =1 α j θ j − η N X j =1 log θ j ϑ j + N X j =1 (cid:18) θ j ϑ j (cid:19) r | {z } P ( x,θ | r,β,ϑ ) , η = (cid:18) rβ − (cid:19) (5)Echoing the terminology of classical regularization schemes, we refer to P ( x, θ | r, β, ϑ ) as the penaltyterm . 3 The IAS algorithm
The search for the minimizer of the MAP objective function in (5) is carried out with the global hybrid scheme introduced in [6], based on the iterative alternating sequential (IAS) algorithm described below.Details of the hybrid scheme that ensues are reviewed in Section 5.Given a suitable initialization of the variances θ , at each iteration step the IAS algorithm updatesthe iterates α t , θ t by solving the minimization problem in alternating directions, that is α t +1 ∈ arg min α (cid:8) F ( α, θ t ) (cid:9) , θ t +1 ∈ arg min θ (cid:8) F ( α t +1 , θ ) (cid:9) . Because of the particular form of the objective function, both variables can be updating efficiently asfollows.
Update of α The α -update reduces to the solution of a quadratic minimization problem, i.e., α t +1 ∈ arg min α n k b − AW α k + k D − / θ α k o , θ = θ t , that can be recast in an equivalent form as finding the solution in the least squares sense to the linearsystem (cid:20) AWD − / θ (cid:21) α = (cid:20) b (cid:21) . (6)After performing the change of variable D − / θ α = γ , we can write (6) as (cid:20) AWD / θ I (cid:21) γ = (cid:20) b (cid:21) , where I is an n × n unit matrix. The solution of this least squares problem is also Tikhonov regularizedsolution of AWD / θ γ = b, , α = D / θ γ . with regularization parameter equal to one. An alternative to Tikhonov regularization yielding a similarsolution is to solve the underlying linear system with an iterative solver equipped with an early stoppingcriterion. The stopping condition is usually based on a variant of Morozov’s discrepancy principle,stipulating that the iterations are terminated as soon as the discrepancy reaches an estimated level ofthe observation noise. In the statistical framework under the Gaussian noise assumption, the noise levelcan be expressed in terms of the standard deviation of the noise. In our case, where we assume m -dimensional white noise, this quantity is equal to √ m . Following [5, 7], we solve the linear system usingthe Conjugate Gradient for Least Squares (CGLS) algorithm with the early stopping at noise level √ m ;see [5] for more details. Update of θ Due to the mutual independence of the entries of θ , each variance θ j can be updatedseparately by imposing the component-wise first order optimality condition on (5). More specifically, θ t +1 j is the solution of the non-linear equation ∂ F ∂θ j = − α j θ j − (cid:18) rβ − (cid:19) θ j + r θ r − j ϑ rj = 0 , α = α t +1 . (7)For some values of r , e.g., r = ±
1, (7) admits an analytic solution. However, in general we need to solveit numerically. It was shown in [7] that after the changes of variables θ j = ϑ j ξ j , x j = p ϑ j z j , we maywrite ξ j = ϕ ( | z j | ), and via implicit differentiation, the function ϕ satisfies the initial value problem ϕ ( z ) = 2 zϕ ( z )2 r ϕ ( z ) r +1 + z , ϕ (0) = (cid:16) ηr (cid:17) /r . (8)Therefore the updated value of θ j can be computed by a numerical time integrator. Since the same typeof differential equation is satisfied by all components, an efficient way to update θ is to sort the currentvalues z j in an ascending order, and integrate sequentially over the gaps between the values by a suitabletime integrator. 4e point out that unlike in the formally similar alternating direction method for multipliers (ADMM)algorithm [1] that is often used to solve regularized inverse problems with sparsity promoting priors, theIAS algorithm does not require the introduction of an artificial decoupling term of the fidelity and penaltyterms, as the partial decoupling in IAS is automatic and exact. Before presenting the details of the hybrid scheme used in the numerical tests, we briefly review some ofthe main results related to the selection of the hyperparameters ( r, β, ϑ ) appearing in the expression ofthe hyperprior in (3).We start recalling a theorem, whose proof can be found in [7], summarizing how r and β affect theconvexity properties of the functional F . Theorem 1.
Let β > and r = 0 , and let F ( x, θ ) be the objective function for the minimization problemin (4).(a) If r ≥ and η = rβ − / > , the function F ( x, θ ) is globally convex.(b) If < r < and η = rβ − / > , or, if r < and β > , the function F ( x, θ ) is convexprovided that θ j < θ = ϑ j (cid:18) ηr | r − | (cid:19) /r . (9)The convexity of the MAP objective function, guaranteed for r ≥
1, is very convenient, however someof the configurations attained for r < θ j as a function of x j as θ j = g j ( x j ) = ϑ j ϕ | x j | p ϑ j ! . We review some recent results [3, 8, 7] about the connections that can be drawn between the generalizedgamma hyperpriors and classical sparsity promoting penalty terms, assuming that ( θ j , x j ) satisfies theabove identity.(i) For the gamma hypermodel, i.e. r = 1, as η = rβ − → + the penalty term approaches aweighted ‘ -penalty term [3, 8],lim η → + P (cid:18) x, g ( x ) | ,
32 + η, ϑ (cid:19) = √ n X j =1 | x j | p ϑ j . (ii) If rβ = , the penalty term coincides with the weighted ‘ p -norm, with p = 2 r/ ( r + 1) [7], P (cid:18) x, g ( x ) | r, r , ϑ (cid:19) = C r n X j =1 | x j | p p ϑ jp , C r = r + 1(2 r ) r/ ( r +1) . (iii) For the inverse gamma hypermodel, corresponding to r = −
1, the penalty term approaches theStudent distribution, a prominently fat tailed distribution favoring large outliers, and leading toa greedy algorithm that strongly promotes sparsity [7].To summarize, the above results indicate that the hyperpriors for which the global convexity of thecorresponding hypermodel is not guaranteed ( r <
1) are expected to promote sparsity more effectivelythan the limit case r = 1 that can be seen as a counterpart of the ‘ -penalized case.While the hyperparameters r and β determine the strength of the sparsity promotion and the convex-ity properties of the MAP objective function, the vector of the scale parameters ϑ can be set automaticallyonce the operator AW is given. More specifically, for each j , ϑ j can be related to the sensitivity of thedata to x j , given by the quantity k AW e j k , where e j ∈ R N denotes the canonical j -th Cartesian unitvector. It was proven, for r = 1 in [5, 8] and in more general settings in [7], that under the assumption5hat the signal-to-noise ratio is given, and that the prior satisfies an exchangeability condition guaran-teeing that no particular sparse combinations of components of x are favored over others, the entries of ϑ must be chosen as ϑ j = C k AW e j k , where C > A .Sensitivity weights play an important role in, e.g., inverse source problems, in which sources near theobservation points may be favored over far away sources unless the exchangeability condition is imposed.In the current setting, when the dictionary consists of sub-frames with possibly different column norms,we expect the different weights ϑ j to prevent the representation of the signal the frames with largercolumn norms to dominate. In the following discussion, we write the penalty function P ( x, θ | r, β, ϑ ) in terms of components, P ( x, θ | r, β, ϑ ) = = N X j =1 α j θ j − η log θ j ϑ j + (cid:18) θ j ϑ j (cid:19) r ! = N X j =1 P j ( x j , θ j | r, β, ϑ j ) . In [6], two different hybrid strategies to speed up and enhance sparsity promotion in the IAS algorithmwere proposed. In both versions, the IAS iterations are initiated by selecting a conservative set ofhyperparameters for which the objective function is convex, thus guaranteeing global convergence to aunique minimizer. We denote this set of parameters by ( r (1) , β (1) , ϑ (1) ). For the second phase of thehybrid algorithm, we select another set of parameters, ( r (2) , β (2) , ϑ (2) ), for which the global convexity ofthe objective function is not valid. To match the models so that they express coherent prior beliefs, weadjust the scale parameters ϑ ( j ) so as to satisfy the compatibility condition (cid:18) η (1) r (1) (cid:19) /r (1) ϑ (1) j = (cid:18) η (2) r (2) (cid:19) /r (2) ϑ (2) j , (10)that guarantee that the parameter θ j computed at x j = 0 returns the same value regardless of the model.For further discussion, we refer to [6].In the local hybrid version, the IAS algorithm is initially run with hyperparameters ( r (1) , β (1) , ϑ (1) ),and after each iteration step, we check which θ j , if any, satisfies the condition (9), where θ is computedusing the hyperparameter set ( r (2) , β (2) , ϑ (2) ). In correspondence of those which do, we modify the localobjective function so that P j ( x j , θ j | r (1) , β (1) , ϑ (1) j ) → P j ( x j , θ j | r (2) , β (2) , ϑ (2) j ) . The global hybrid scheme is based on the idea that after a number of IAS iteration rounds, the iterateof the globally convex objective function with hyperparameters ( r (1) , β (1) , ϑ (1) ) is near the unique globalminimum of that objective function. Restarting the IAS from the current point with the parameters( r (2) , β (2) , ϑ (2) ) may quickly find a local minimizer near the global minimizer of the original objectivefunction. While the two minimizers are likely not far apart, the local minimizer is typically sparser, andthe convergence to it is faster.In the computed examples of the next chapter, we restrict ourselves to the global hybrid strategywith the gamma hyperprior for the first model, and the generalized gamma hyperprior with r = 0 . In this section, we demonstrate the viability of the hybrid IAS algorithm in the context of overcompleterepresentations. The main goal of these examples is to demonstrate that the algorithm is capable of6igure 1: The generative model (left) and the blurred and noisy data vector b ∈ R (right).selecting from a dictionary of sub-frames, where several possible representations are possible, a set ofatoms that make the representation as sparse as possible. Signal restoration from convolution data
The first test case is a one-dimensional deconvolutionproblem. The generative model is a piecewise constant signal f : [0 , → R , f (0) = 0, and the dataconsist of a few discrete observations, b j = Z A ( s j − t ) f ( t ) dt + ε j , ≤ j ≤ m, A ( s j , t ) = 1 √ πw e − ( sj − t )22 w , corrupted by Gaussian blur with w = 0 .
02 and additive scaled white Gaussian noise, with standarddeviation σ set to 2% of the maximum of the noiseless signal. The data has been generated using adiscretization of the unit interval with n = n dense = 1253 nodes, while in the forward model used forsolving the inverse problem, we set n = 500. The number of equidistant observation points in the signaldomain is m = 46. The generative signal and the data are shown in Figure 1.The generative signal admits a natural sparse representation in terms of its increments z j = z j − z j − over the interval of definition. Assuming x = 0, then z = B x , B = . . . − . . .
0. . .0 . . . − ∈ R n × n , (11)hence x = L z with L = B − = . . .
01 1 . . . . . . ∈ R n × n . Our goal is to test the effectiveness of the outlined framework in recovering the most natural sparserepresentation of the given signal. Let C denote the discrete cosine transform matrix, providing analternative and accurate way of representing the signal, x = C T y, y = C x, which is, however, not sparse. To test wether the algorithm is able to identify the frame that allows asparse representation, we consider the overcomplete dictionary, W = [ W , W ] ∈ R n × n with W = L ∈ R n × n and W = C T ∈ R n × n , and formulate the underlying linear inverse problem as b = AW α + ε = A [ W , W ] (cid:20) α α (cid:21) + ε, ε ∼ N (0 , σ I m ) , where A is the discrete blur operator. 7igure 2: Reconstruction of the signal x (left) and the count of CGLS steps per outer iteration of theglobal hybrid IAS (right).Figure 3: From left to right: representation vector α i , contribution of the signal W i α i and scaledvariances corresponding to α i for i = 1 (top row) and i = 2 (bottom row). In the global hybrid IAS weset η (1) = 10 − and η (2) = 10 − .The signal reconstructed by the global hybrid IAS scheme is shown in Figure 2. The restored α and α and their contribution in the estimated signal are shown in Figure 3, together with the scaledvariances corresponding to α and α , i.e. θ j ϑ (2) j , ≤ j ≤ n and θ j ϑ (2) j , n + 1 ≤ j ≤ n . Notice that the output variances are scaled by the sensitivities corresponding to the second hyperpriorused to design the hybrid scheme.Despite the relatively high level of degradation (blur and noise) and down-sampling in the observed data b ,the algorithm has no problem detecting the basis that provides a more natural and sparse representationfor the original signal. In fact, the coefficients α are five to six orders of magnitude smaller than thenon-vanishing components of α . The degree of sparsity in the final representation is also reflected inthe number of CGLS steps per outer iteration of the global hybrid IAS - see Figure 2 - which quicklysettles around the cardinality of the support of α . Image denoising
In the second example, we consider the problem of denoising a blocky gray scaletest image x ∈ R n × n , n = 200. The pixel values, which are between 0 and 1, are corrupted by scaledwhite Gaussian noise with standard deviation σ set to 10% of the maximum of the noiseless image, seeFigure 4.The test image presents sharp edges lying along the horizontal and vertical axes. Therefore, x admitsa sparse representation both in the vertical and horizontal increment bases, the latter being slightly lesssparse than the former. After representing the image in vector form x ∈ R n by stacking the pixel values8 teration 31 Figure 4: Original image (left), observed data (middle) and reconstructed image (right).5 -3 -4 Figure 5: Representation vectors α i (left panels), logarithmic plot of the corresponding scaled variances(middle panels) and vectors W i α i contributing to the final restoration (right panels) for i = 1 (top row), i = 2 (bottom row). The global hybrid CGLS has been performed with parameters η (1) = 10 − and η (2) = 10 − .columnwise, we introduce the redundant dictionary W = [ W , W ] ∈ R n × n with W = ( I n ⊗ B ) − ∈ R n × n and W = ( B ⊗ I n ) − ∈ R n × n , where B is defined as in (11), and ⊗ stands for the Kronecker product. Homogenous Dirichlet boundaryconditions are assumed on the left and top edges of the image. We want to estimate the sparse vector α = [ α , α ] T , with α i ∈ R n , i = 1 ,
2, from the data vector b ∈ R n , given the forward model b = W α + ε = [ W , W ] (cid:20) α α (cid:21) + ε, ε ∼ N (0 , σ I n ) . It is worth remarking here that we require α to be not only sparse, but as sparse as possible.The restoration of the image via global hybrid IAS is shown in Figure 4, while the contribution ofthe vertical and horizontal increment bases together with the output scaled variances corresponding tovectors α and α are shown in Figure 5. We observe that the image is almost completely restored interms of the basis vectors corresponding to increments in the vertical direction ( α ), whereas the entriesof α , corresponding to increments in the horizontal direction is negligible. The representation in termsof W is indeed sparser than that in terms of W , due to the shorter horizontal boundary of the whiteinclusion compared to the vertical boundary. 9 teration 37 Figure 6: Original image (left), observed data (middle) and reconstructed image (right).
Image restoration
In the third example, we consider the restoration problem of the n × n generativeimage in Figure 6, with n = 100, with values in [0 , w = 0 .
006 and additive scaled white Gaussian noise with standard deviation σ set to 1% ofthe maximum of the noiseless signal, see Figure 6. The test image presents three distinctive features,namely point-wise stars, the blocky moon and the smooth cloud. After re-arranging the original x in avectorized form by stacking its entries in columnwise order, we hypothesize that a suitable dictionary forthe problem of interest is W = [ W , W , W , W ] ∈ R n × n , with W = I n , W = ( I n ⊗ B ) − ∈ R n × n , W = ( B ⊗ I n ) − ∈ R n × n and W = C T , where B is as in (11) and C is the 2D cosine transform matrix. The problem is to estimate the sparsevector α = [ α , α , α , α ] T , with α i ∈ R n , for i = 1 , , ,
4, from the data vector b ∈ R n , given theforward model b = AW α + ε = [ W , W , W , W ] α α α α + ε, ε ∼ N (0 , σ I n ) , with A representing the discrete blur operator.The image restored via the global hybrid IAS algorithm is shown in Figure 6, while Figure 7 shows thereconstructions of the representation vectors α i , the corresponding variances scaled by the sensitivities,and the contribution of the vectors W i α i in the final restoration, for i = 1 , , , . We point out that, as in the previous example, the representation vectors in both the vertical andhorizontal increment bases are sparse. Nonetheless, the hybrid hypermodel selects the one with fewernon-zero entries.
Dictionary learning
The final example, coming from machine learning, is concerned with the sparseidentification of hand-written digits based on a dictionary of annotated data. Consider the MNISTdata set of hand-written digits 0 , , . . . , ×
16 black-and-white images. Denoting by w ( j ) ∈ R (16) , 1 ≤ j ≤ N the vectorized image vectors of N = 1 707 handwritten digits constituting theatoms of the dictionary, and by c j ∈ { , , . . . , } the corresponding annotations, we form the dictionarymatrix W = (cid:2) w (1) · · · w ( N ) (cid:3) ∈ R n × N , n = 256 , N = 1 705 . To identify an handwritten digit b drawn from an independent set of handwritten digits, we seek torepresent it in a sparse manner in terms of the given dictionary, b = W α + ε, where α ∈ R N is a sparse vector, and ε represents the discrepancy between the data and its representation.The idea is represented schematically in Figure 8. We point out that in the dictionary consisting ofall handwritten digits, the digits with same annotation can each be thought of representing a sub-dictionary, and as the proposed algorithm seeks the most economic representation, it is natural thatthe representation corresponds to picking the representing atoms from the sub-dictionary with greatestaffinity with the digit that represents the data.In this example, we run the global hybrid IAS algorithm using the parameters ( r (1) , β (1) , ϑ (1) ) =(1 , / − , − ), where all components of the vector ϑ (1) are assumed equal, as sensitivity is not an10 -7 -17.02 -17 0.20.40.60.8-0.4 -0.2 0 0.2 0.4 -15 -10 -5 0 0.20.40.60.80 10 20 -15 -10 -5 0 0.20.40.60.8 Figure 7: Representation vectors α i (left panels), logarithmic plot of the corresponding scaled variances(middle panels) and vectors W i α i contributing to the final restoration (right panels) for i = 1 (firstrow), i = 2 (second row), i = 3 (third row) and i = 4 (fourth row). In the global hybrid CGLS we set η (1) = 10 − and η (2) = 10 − .issue in this example, and ( r (2) , β (2) ) = ( − , ϑ (2) determined from the compatibilitycondition (10). Furthermore, since the digit images are non-negative, after each update step of the pair( α, θ ), we project the image to the positive cone. A theoretical justification of the projection step wasgiven in [7]. We switch from the first to the second model in the hybrid IAS scheme when either therelative change in θ with respect to the ‘ -norm falls below 10 − or 80 iterations have been completed.Figures 9, 10 and 11 show the results with different choices of the standard deviation of the likelihood.Observe that here, the noise term ε represents the discrepancy between the data b and its representationin terms of the dictionary, and can be chosen according to how much fidelity is required. Choosing σ large allows a very sparse representation, as the required quality of the approximation is low, however,poor approximation easily leads to a mis-labeling of the digit. On the other hand, decreasing σ forces theapproximation to be better, and more atoms are required. The labeling can be done using the majorityvote principle. In the computed example, Figures 9 and 10, the labeling with majority vote is correct in11igure 8: A schematic representation of the dictionary learning example. The digit on the right is thenon-annotated image b , which is approximated in terms of the annotated atoms w i on the right. Thecoefficients α i can then be used to identify the digit.Figure 9: Dictionary learning results. The first row shows the test images of the digits to be classified bythe dictionary learning algorithm (vector b ), the true annotation indicated in the figure, the second rowthe vectors θ after the IAS iteration with the first hyperprior, and the third row after the iteration withthe second hyperprior. The fourth row represents the synthesis W α approximating the original digit, andfinally, the fifth row gives the histogram of the annotations of the atoms corresponding to coefficientsabove a threshold τ = 0 .
01. The annotation is done by majority vote, choosing the largest of the bins.In this example, the standard deviation of the noise representing the mismatch was σ = 0 . The hierarchical Bayesian framework combined with Krylov subspace iterative solvers for large linearsystems is well suited for the design computationally efficient methods to solve large scale ill posed inverseproblems with sparsity constraints. Here we have shown that the framework can be naturally adaptedfor dealing with overcomplete systems, consisting of, e.g., combined frames or bases. The approach hassignificant potential when it may not be known a priori which frame is best suited for representing theunknown, leaving it up to the algorithm to find the most parsimonious representation. In order to avoidthat one frame is favored over another, however, it is important that the data are equally sensitive tocomponents in every frame. Fortunately, the sensitivity analysis developed by the authors in [4, 8, 7],provides naturally such scaling. The proposed sensitivity weights are rooted in the very natural Bayesianprinciple of exchangeability, stating that no set of non-zero components with a given cardinality shouldbe favored over any other. In light of this principle, the scaling guarantees the same explanation power forevery sub-frame, so the one leading to most sparse solution is automatically selected. This feature mayturn out particularly useful in machine learning, with applications such as MRI fingerprinting (see,e.g.,[14]). In [8], a connection between the proposed IAS algorithm and the compressed sensing literature[9] was considered, suggesting that when the forward model guarantees perfect sparse recovery, the IASalgorithm effectively finds a good approximation of it. It is reasonable to believe that the results can be12igure 10: The rows are as in Figure 9. In this example, the standard deviation of the noise representingthe mismatch was σ = 0 .
05. Observe that the approximation becomes sparser.Figure 11: The rows are as in Figure 9. In this example, the standard deviation of the noise repre-senting the mismatch was σ = 0 .
1. The increased sparsity here is traded with an increased number ofmisclassifications, such as in the first and the third columns.13xtended to overcomplete dictionaries, for which similar recovery results are known [10].
Acknowledgements
The work of DC was partly supported by NSF grants DMS-1522334 and DMS-1951446, and the work ofES was partly supported by NSF grant DMS-1714617.
References [1] Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J.
Distributed optimization and statisticallearning via the alternating direction method of multipliers . Foundations and Trends in Machinelearning, 3(1):1-122, 2011.[2] Bruckstein, A. M., Donoho, D. L. and Elad, M.
From Sparse Solutions of Systems of Equations toSparse Modeling of Signals and Images . SIAM Review, 51(1):34-81, 2009.[3] Calvetti, D., Hakula, H., Pursiainen, S. and Somersalo, E.
Conditionally Gaussian Hypermodels forCerebral Source Localization . SIAM Journal on Imaging Sciences, 2(3):879-909, 2009.[4] Calvetti, D., Pascarella, A., Pitolli, F., Somersalo, E. and Vantaggi, B.
Brain activity mapping fromMEG data via a hierarchical Bayesian algorithm with automatic depth weighting . Brain topography,32(3):363-393, 2019.[5] Calvetti, D., Pitolli, F., Somersalo, E. and Vantaggi, B.
Bayes Meets Krylov: statistically InspiredPreconditioners for CGLS . SIAM Review, 60(2):429-461, 2018.[6] Calvetti, D., Pragliola, M. and Somersalo, E.
Sparsity promoting hybrid solvers for hierarchicalBayesian inverse problems . arXiv2003.06532, 2020.[7] Calvetti, D., Pragliola, M., Somersalo, E. and Strang, A.
Sparse reconstructions from few noisy data:analysis of hierarchical Bayesian models with generalized gamma hyperpriors . Inverse Problems,36(2), 2020.[8] Calvetti, D., Somersalo, E. and Strang, A.
Hierachical Bayesian models and sparsity: ‘ -magic .Inverse problems, 35(3), 2019.[9] Candes, E. J., Romberg, J. and Tao, T. Robust uncertainty principles: exact signal reconstructionfrom highly incomplete frequency information . IEEE Transactions on Information Theory, 52(2):489-509, 2006.[10] Candes, E. J., Eldar, Y. C., Needell, D. and Randall, P.
Compressed sensing with coherent andredundant dictionaries . Applied and Computational Harmonic Analysis, 31(1):59-73, 2011.[11] Chambolle, A., Holler, M. and Pock, T.
A Convex Variational Model for Learning ConvolutionalImage Atoms from Incomplete Data . Journal of Mathematical Imaging and Vision, 62:417-444, 2020.[12] Chen, G. and Needell, D.
Compressed sensing and dictionary learning . Finite Frame Theory, Pro-ceedings of Symposia in Applied Mathematics, 73:201-241, 2016.[13] Chen, S. S., Donoho, D. L. and Saunders, M. A.
Atomic Decomposition by Basis Pursuit . SIAMJournal on Scientific Computing, 20(1):33-61, 1998.[14] Ma, D., Gulani, V., Seiberlich, N., Liu, K., Sunshine, J. L. Duerk, J. L. and Griswold, M. A.
Magnetic resonance fingerprinting . Nature, 495(7440):187-192, 2013.[15] Mallat, S. G. and Zhang, Z.
Matching pursuits with time-frequency dictionaries . IEEE Transactionson Signal Processing, 41(12):3397-3415, 1993.[16] Rubinstein, R., Bruckstein, A. M. and Elad, M.
Dictionaries for Sparse Representation Modeling .Proceedings of the IEEE, 98(6):1045-1057, 2010.[17] Starck, J., Fadili, J. and Murtagh, F.
The Undecimated Wavelet Decomposition and its Reconstruc-tion . IEEE Transactions on Image Processing, 16(2):297-309, 2007.1418] Starck, J. L., Elad, M. and Donoho, D.