DAEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
DDAEs for Linear Inverse Problems: Improved Recovery with ProvableGuarantees
Jasjeet Dhaliwal Kyle Hambrook Abstract
Generative priors have been shown to provideimproved results over sparsity priors in linear in-verse problems. However, current state of the artmethods suffer from one or more of the follow-ing drawbacks: (a) speed of recovery is slow; (b)reconstruction quality is deficient; (c) reconstruc-tion quality is contingent on a computationallyexpensive process of tuning hyperparameters. Inthis work, we address these issues by utilizingDenoising Auto Encoders (DAEs) as priors anda projected gradient descent algorithm for recov-ering the original signal. We provide rigoroustheoretical guarantees for our method and experi-mentally demonstrate its superiority over existingstate of the art methods in compressive sensing,inpainting, and super-resolution. We find that ouralgorithm speeds up recovery by two orders ofmagnitude (over 100x), improves quality of re-construction by an order of magnitude (over 10x),and does not require tuning hyperparameters.
1. Introduction
Linear inverse problems can be formulated mathematicallyas y = Ax + e where y ∈ R m is the observed vector, A ∈ R m × N isthe measurement process, e ∈ R m is a noise vector, and x ∈ R N is the original signal. The problem is to recoverthe signal x , given the observation y and the measurementmatrix A . Such problems arise naturally in a wide varietyof fields including image processing, seismic and medicaltomography, geophysics, and magnetic resonance imaging.In this paper, we focus on three linear inverse problemsencountered in image processing: compressive sensing, Department of Mathematics, San Jose State Univer-sity, San Jose, California. Correspondence to: Jas-jeet Dhaliwal < [email protected] > , Kyle Hambrook < [email protected] > . inpainting, and super-resolution. We motivate our methodusing the compressive sensing problem. Figure 1.
Compressive sensing (w/out noise) CelebA for m =1000 . DAE-PGD reconstructions show a 10x improvement inquality. Sparsity Prior
The problem of compressive sensing as-sumes the matrix A ∈ R m × N is fat, i.e. m < N . Evenwhen no noise is present ( y = Ax ), the system is underdetermined and the recovery problem is intractable. How-ever, it has been shown that if the matrix A satisfies certainconditions such as the Restricted Isometry Property (RIP)and if x is known to be approximately sparse in some fixedbasis, then x can typically be recovered even when m (cid:28) N (Tibshirani, 1996; Donoho et al., 2006; Candes et al., 2006).However, sparsity (or approximate sparsity) is a veryrestrictive condition to impose on the signal as it limits theapplicability of recovery methods to a small subset of inputdomains. In order to ease this constraint, there has beenconsiderable effort in using other forms of structured priorssuch as structured sparsity (Baraniuk et al., 2010), sparsityin tree-structured dictionaries (Peyre, 2010), and low-rankmixture of Gaussians (Chen et al., 2010). Although these a r X i v : . [ ee ss . I V ] J a n AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees efforts improve on the sparsity prior, they do not cater tosignals that are not naturally sparse or structured-sparse.
Generative Prior
Bora et al. (Bora et al., 2017) address thisissue by replacing the sparsity prior on x with a generativeprior. In particular, the authors first train a generative model f : R k (cid:55)→ R N with k < N that maps a lower dimensionallatent space to the higher dimensional ambient space. Thismodel is referred to as the generator. Next, they impose theprior that the original signal x lies in (or near) the range of f . Hence, the recovery problem reduces to finding the bestapproximation to x in f ( R k ) .It is crucial to note that the quality of the generative priordepends on how well the training set captures the data distri-bution. Bora et al.(Bora et al., 2017) used a Generative Ad-versarial Network (GAN) as the generator, G : R k (cid:55)→ R N ,where k < N , to model the distribution of the training dataand posed the following non-convex optimization problem ˆ z = arg min z ∈ R k (cid:0) (cid:107) AG ( z ) − y (cid:107) + λ (cid:107) z (cid:107) (cid:1) such that G (ˆ z ) is treated as the approximation to x . Theauthors provided recovery guarantees for their methods andvalidated the efficacy of using generative priors by showingthat their method required 5-10x fewer measurements thanLasso (with a sparsity constraint) (Tibshirani, 1996) whileyielding the same accuracy in recovery. However, since theproblem is non-convex and requires a search over R k , it iscomputationally expensive and the reconstruction qualitydepends on the initialization vector z ∈ R k .Since then, there have been significant efforts to improverecovery results using neural networks as generative priors(Adler & ¨Oktem, 2017; Fan et al., 2017; Gupta et al., 2018;Liu et al., 2017; Mardani et al., 2018; Metzler et al., 2017;Mousavi et al., 2017; Rick Chang et al., 2017a; Shah &Hegde, 2018; Yeh et al., 2017; Raj et al., 2019; Heckel &Hand, 2018). Shah et al. (Shah & Hegde, 2018) extendedthe work of (Bora et al., 2017) by training a generator G andusing a projected gradient descent algorithm that consists ofa gradient descent step w t = x t − ηA T ( Ax t − y ) followedby a projection step x t +1 = G ( arg min z ∈ R k (cid:107) G ( z ) − w t (cid:107) ) . Thecore idea being that the estimate w t is improved by pro-jecting it onto the range of G . However, since their methodrequires solving a non-convex optimization problem at everyupdate step, it also leads to slow recovery.Raj et al. (Raj et al., 2019) enhanced the results of (Shah& Hegde, 2018) by eliminating the expensive non-convexoptimization based projection step with one that is an orderof magnitude cheaper. In particular, they trained a GAN G to model the data distribution and also trained a pseudo-inverse GAN G ‡ that learned a mapping from the ambient space to the latent space. Next, they used the projectionstep: x t +1 = G ( G ‡ ( w t )) . By eliminating the need to solvea non-convex optimization problem to update x t +1 , theywere able to attain a significant speed up in the running timeof the recovery algorithm.However, the recovery algorithm of (Raj et al., 2019)has two main drawbacks. First, training two networks: G and G ‡ makes the training process and the projectionstep unnecessarily convoluted. Second, their recoveryguarantees only hold when the learning rate η = β , where β is a RIP-style constant of the matrix A . Since it isNP-hard to estimate the constant β (Bandeira et al., 2013),it follows that setting η = β is NP-hard as well. . DAE Prior
In an effort to address the aforementioned issues,we propose to use a DAE (Vincent et al., 2008) prior in lieuof the generative prior introduced by Bora et al. (Bora et al.,2017). It has previously been shown that DAEs not onlycapture useful structure of the data distribution (Vincentet al., 2010) but also implicitly capture properties of the data-generating density (Alain & Bengio, 2014; Bengio et al.,2013). Moreover, as DAEs are trained to remove noise fromvectors sampled from the input distribution, they integratenaturally with gradient descent algorithms that lead to noisyapproximations at each time step. In consideration of theabove, we hypothesize that DAEs are viable candidates forprojection operators in a gradient descent based recoveryalgorithm.We therefore replace the generator G used in Bora et al.(Bora et al., 2017) with a DAE F : R N (cid:55)→ R N such thatthe range of F contains the vectors from the original datagenerating distribution. We then impose the prior that theoriginal signal x lies in the range of F and utilize Algorithm1 to recover an approximation to x . We provide theoreticalrecovery guarantees and find that our framework is able toaddress the shortcomings of previous works noted above.Our contributions can be summarized as: • We provide rigorous theoretical guarantees for conver-gence in Algorithm 1. • We experimentally demonstrate orders of magnitude(over 100x) speed up in recovery compared to state ofthe art methods. • We experimentally demonstrate order of magnitude (over10x) improvement in recovery quality compared to stateof the art methods. We observed this problem when trying to reproduce the ex-perimental results of (Raj et al., 2019). Specifically, we tried anexhaustive grid-search for η but each value led to poor reconstruc-tion quality. AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
2. Algorithm and Results
Given a vector x ∈ R N , we use (cid:107) x (cid:107) to denote the (cid:96) -normfor x . Similarly, for a matrix A ∈ R m × N , (cid:107) A (cid:107) denotes theinduced matrix norm from the (cid:96) -norm. A DAE is a non-linear mapping F : R N (cid:55)→ R N that canbe written as a composition of two non-linear mappings -an encoder E : R N (cid:55)→ R k where k < N and a decoder D : R k (cid:55)→ R N . Therefore, F ( x ) = ( D ◦ E )( x ) . Givena set of n samples from a domain of interest { x i } ni =1 , thetraining set X is created by adding Gaussian noise to theoriginal samples. That is, X = { x (cid:48) i } ni =1 , where x (cid:48) i = x i + e i and e i ∼ N ( µ i , σ i ) .The loss function for training F is the Mean Squared Error(MSE) loss defined as : L F ( X ) = n (cid:80) ni =1 (cid:107) F ( x (cid:48) i ) − x i (cid:107) .The training procedure uses gradient descent to minimize L F ( X ) with back-propagation. Recall that in the linear inverse problem y = Ax + e our goal is to recover an approximation ˆ x to x such that ˆ x lies in the range of F . Thus we aim to find ˆ x such that ˆ x = arg min z ∈ F ( R N ) (cid:107) Az − y (cid:107) As in (Shah & Hegde, 2018; Raj et al., 2019), we use aprojected gradient descent algorithm. Given an estimate x t at iteration t , we compute a gradient descent step for solvingthe unrestricted problem: minimize z ∈ R N (cid:107) Az − y (cid:107) as: w t ← x t − ηA T ( Ax t − y ) Next we project w t onto the range of F to satisfy our prior: x t +1 = F ( w t ) Note that, compared to (Shah & Hegde, 2018; Raj et al.,2019), the projection step does not require solving a non-convex optimization problem.Now suppose that the domain of interest is represented bythe set D ⊆ R N . Then, given a vector x (cid:48) = x + e , where x ∈ D , and e ∈ R N is an unknown noise vector, the successof our method depends on how small the error (cid:107) A ( x (cid:48) ) − x (cid:107) is. If the training set X captures the domain of interest welland if the training procedure utilizes a diverse enough setof noise vectors { e i } Ni =1 , then we expect (cid:107) A ( x (cid:48) ) − x (cid:107) to be small. Consequently, we expect the projection step ofAlgorithm 1 to yield vectors in or close to D . We providethe complete algorithm below. Algorithm 1
DAE-PGD
Input: y ∈ R m , A ∈ R m × N , f : R N → R N , T ∈ Z + , η ∈ R > Output : x T t ← , x ← while t < T do w t ← x t − ηA T ( Ax t − y ) x t +1 ← f ( w t ) return x T Figure 2.
Compressive Sensing (w/out noise) CelebA for various m . Reconstructions capture finer grained details as m increases. We begin by introducing two standard definitions requiredto provide recovery guarantees.
Definition 1 (RIP ( S, δ ) ) . Given S ⊆ R N and δ > , amatrix A ∈ R m × N satisfies the RIP ( S, δ ) property if (1 − δ ) (cid:107) x − x (cid:107) ≤ (cid:107) A ( x − x ) (cid:107) ≤ (1+ δ ) (cid:107) x − x (cid:107) for all x , x ∈ S . A variation of the RIP ( S, δ ) property for sparse vectorswas first introduced by Candes et al. in (Candes & Tao,2005) and has been shown to be a sufficient condition inproving recovery guarantees using (cid:96) -minimization methods(Foucart & Rauhut, 2017). Next, we define an Approximate AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
Projection (AP) property and provide an interpretation thatelucidates its role in the results of Theorem 3. . Definition 2 (AP(S, α )) . Let α ≥ . A mapping f : R N → S ⊆ R N satisfies AP ( S, α ) if (cid:107) w − f ( w ) (cid:107) ≤ (cid:107) w − x (cid:107) + α (cid:107) f ( w ) − x (cid:107) for every w ∈ R N and x ∈ S . We now explain the significance of Def. 2. Let x ∗ = arg min z ∈ S (cid:107) w − z (cid:107) and observe (cid:107) w − f ( w ) (cid:107) ≤ ( (cid:107) w − x ∗ (cid:107) + (cid:107) f ( w ) − x ∗ (cid:107) ) (1)Hence, α ≤ (cid:107) f ( w ) − x ∗ (cid:107) +2 (cid:107) w − x ∗ (cid:107) is needed to ensurethe RHS of Def. 2 is bounded by the RHS of (1). In otherwords, for α to be small, the projection error (cid:107) f ( w ) − x ∗ (cid:107) as well as distance of w to S need to be small. Since theDAE F learns to minimize (cid:107) F ( w ) − x ∗ (cid:107) (Section 2.2),we expect a small projection error. Moreover, if the imageof F approximates the data distribution well, we expect asmall value for (cid:107) w − x ∗ (cid:107) at every gradient descent step ofAlgorithm 1. . Theorem 3.
Let f : R N → S ⊆ R N satisfy AP(S, α ) andlet A ∈ R m × N be a matrix with (cid:107) A (cid:107) ≤ M that satisfiesRIP ( S, δ ) . If y = Ax with x ∈ S , the recovery error ofAlgorithm 1 is bounded as: (cid:107) x T − x (cid:107) ≤ (2 γ ) T (cid:107) x − x (cid:107) + α (cid:18) − (2 γ ) T − (2 γ ) (cid:19) (2) where γ = (cid:112) η M (1 + δ ) + 2 η ( δ −
1) + 1 . Proof of Theorem 3 . Using the notion of Algorithm 1 andthe fact that f satisfies AP ( S, α ) we have (cid:107) ( w t − x ) − ( x t +1 − x ) (cid:107) ≤ (cid:107) w t − x (cid:107) + α (cid:107) x t +1 − x (cid:107) . Noting (cid:107) a − b (cid:107) = (cid:107) a (cid:107) + (cid:107) b (cid:107) − (cid:104) a, b (cid:105) and re-arrangingterms we get (cid:107) x t +1 − x (cid:107) ≤ (cid:104) ( w t − x ) , ( x t +1 − x ) (cid:105) + α (cid:107) x t +1 − x (cid:107) . Now we expand the inner product using w t = x t − ηA T ( Ax t − y ) and y = Ax to get (cid:107) x t +1 − x (cid:107) ≤ (cid:104) ( I − ηA T A )( x t − x ) , ( x t +1 − x ) (cid:105) + α (cid:107) x t +1 − x (cid:107) . (3) Various flavors of the AP ( S, α ) property have been used inprevious works, such as Shah et al. (Shah & Hegde, 2018) and Rajet al. (Raj et al., 2019). A small value for α is verified experimentally by using theresults of Theorem 3 and observing small recovery error in ourexperiments. Using the Cauchy–Schwarz inequality we have |(cid:104) ( I − ηA T A )( x t − x ) , ( x t +1 − x ) (cid:105)|≤ (cid:13)(cid:13) ( I − ηA T A )( x t − x ) (cid:13)(cid:13) (cid:107) ( x t +1 − x ) (cid:107) (4)By setting u = x t − x , expanding, and using the RIP ( S, α ) property of A , we see that (cid:13)(cid:13) ( I − ηA T A ) u (cid:13)(cid:13) = (cid:107) u (cid:107) − η (cid:107) Au (cid:107) + η (cid:13)(cid:13) A T ( Au ) (cid:13)(cid:13) ≤(cid:107) u (cid:107) − η (1 − δ ) (cid:107) u (cid:107) + η (1 + δ ) M (cid:107) u (cid:107) = γ (cid:107) u (cid:107) (5)We substitute the results of (4) and (5) into (3) and divideboth sides by (cid:107) x t +1 − x (cid:107) to get (cid:107) x t +1 − x (cid:107) ≤ γ (cid:107) x t − x (cid:107) + α (6)Using induction on (6) gives (2).Theorem 3 tells us that, if γ < , then for large T , therecovery error is essentially α/ (1 − γ ) . Note that therequirement γ < is satisfied for a large range of values of η as long as δ is sufficiently small . Hence, as long as thevalue of α is small, we expect to see a small recovery error.We now compare the above results to Theorem 1 of (Rajet al., 2019) and Theorem 2.2 of (Shah & Hegde, 2018).As mentioned in Section 1, convergence in Theorem 1 of(Raj et al., 2019) is only guaranteed when η = β , whichis a much more restrictive condition on η than Theorem3 provides. In fact, β is a RIP-style constant that is NP-hard to find (Bandeira et al., 2013) which makes setting thevalue of η = β NP-hard as well. Even though the resultsof Theorem 2.2 from (Shah & Hegde, 2018) require a lessrestrictive constraint on η , their guarantees only hold forrandom Gaussian matrices. Moreover, they require (cid:107) A (cid:107) ≤ ω , where ω is a RIP-style constant for A . Once again,it is NP-hard to estimate ω , hence making the constraintvery strict. In contrast, the results of Theorem 3 apply toarbitrary matrices that satisfy the RIP- ( S, δ ) property andwithout imposing a strict condition on (cid:107) A (cid:107) .
3. Experiments
We provide experimental results for the problems of com-pressive sensing, inpainting, and super-resolution. We referto the results of Algorithm 1 as DAE-PGD and compareits results to the methods of Bora et al. (Bora et al., 2017)which we refer to as CSGM, and Shah et al. (Shah & Hegde, For instance, random Gaussian matrices yield small values for δ with high probability (Foucart & Rauhut, 2017) AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
Figure 3.
Compressive Sensing (with noise) CelebA for m =1000 . DAE-PGD reconstructions show a 10x improvement inquality. , which we refer to as PGD-GAN. Although the workof Raj et al. (Raj et al., 2019) is the closest to our method,we do not include comparisons to their work as we wereunable to reproduce their results . Our experiments are conducted on the MNIST(LeCun) and CelebA (Liu et al., 2015) datasets. The MNISTdataset consists of × greyscale images of digitswith 50,000 training and 10,000 test samples. We reportresults for a random subset of the test set. The CelebAdataset consists of more than 200,000 celebrity images. Wepre-processes each image to a size of × × and usethe first 160, 000 images as the training set and a randomsubset of the remaining 40,000+ images as the test set. Network Architecture
The network architectures for ourDAEs are inspired by the Variational Auto Encoder architec-ture from Fig 2. of (Hou et al., 2017) with a few key changes.We replace the Leaky Relu activation with Relu, we add thetwo outputs of the encoder to get the latent representation z , and we alter the kernel sizes as well as the convolutionstrides of the network as described in Table 2. PGD-GAN results are only provided for compressive sensingon CelebA as per (Shah & Hegde, 2018). We used their code, their trained models, their recovery algo-rithm, and a grid search for η but the reconstructed images were ofvery poor quality. We also reached out to the authors but they didnot have the exact values of η that were used in their experiments. Training
We use the Adam optimizer (Kingma & Ba, 2014)to minimize the MSE loss function with learning rate 0.01and a batch size of 128. We train the CelebA network for400 epochs and the MNIST network for 100 epochs.In an effort to ensure that (cid:107) A ( x (cid:48) ) − x (cid:107) defined in Section 2.3is small, we split the training set into 5 equal sized subsets.For each distinct subset, we sample the noise vectors from aGaussian distribution N ( µ, σ ) with a distinct value for σ for each subset. The five different values for σ that we useare { . , . , . , . , . } .All of our experiments were conducted on a Tesla M40GPU with 12 GB of memory using Keras (Chollet, 2015)and Tensorflow (Abadi et al., 2015) libraries. The code toreproduce our results is available here. We consider the problem of compressive sensing withoutnoise: y = Ax and with noise: y = Ax + e , with e ∼N (0 , . . We use m to denote the number of observedmeasurements in our results (i.e. y ∈ R m ). As done inprevious works (Bora et al., 2017; Shah & Hegde, 2018;Raj et al., 2019), the matrix A ∈ R m × N is chosen to be arandom Gaussian matrix with A ij ∼ N (0 , m ) . Finally, weset the learning rate of Algorithm 1 as η = 1 . Note that inboth (with and w/out noise) cases, we also include recoveryresults for the Lasso algorithm (Tibshirani, 1996) with aDCT basis (L-DCT) and with a wavelet basis (L-Wavelet).We begin with CelebA without noise. Figure 1 provides aqualitative comparison of reconstruction results for m =1000 . We observe that DAE-PGD provides the best qualityreconstructions and is able to reproduce even fine graineddetails of the original images such as eyes, nose, lips, hair,texture, etc. Indeed the high quality reconstructions sup-port the case that the DAE has a small α as per Def. 2.For a quantitative comparison, we turn to Figure 4 whichplots the average squared reconstruction error (cid:107) x − ˆ x (cid:107) foreach algorithm at different values of m . Note that DAE-PGD provides more than 10x improvement in the squaredreconstruction error.In order to capture how the quality of reconstruction de-grades as the number of measurements decrease, we addFigure 2, which shows reconstructions for different valuesof m . We observe that even though reconstructions with asmall number of measurements capture the essence of theoriginal images, the fine grained details are captured onlyas the number of measurements increase.We now turn to the speed of reconstruction. Table 1 showsthat our method provides speedups of over 100x as com-pared to PGD-GAN and CSGM . CSGM is executed for 500 max iterations with 2 restarts and
AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
Figure 4.
Compressive Sensing recovery error: (cid:107) x − ˆ x (cid:107) . Left: CelebA without noise - DAE-PGD shows over 10x improvement. Middle:CelebA with noise - DAE-PGD shows over 10x improvement. Right: MNIST without noise - DAE-PGD beats CSGM for m > . Figure 5.
Compressive Sensing MNIST. Left: Reconstructions for m = 100 without noise. Middle: DAE-PGD reconstructions fordifferent m . Right: Reconstructions for m = 100 with noise. DAE-PGD and CSGM yield high reconstruction quality for the no-noisecase but DAE-PGD outperforms both Lasso and CSGM in the presence of noise. m CGSM PGD-GAN DAE-PGD Speedup
250 53.78 48.40 0.07 692x500 59.81 48.46 0.09 538x1000 81.08 48.46 0.11 440x1500 92.68 48.50 0.14 346x2000 107.41 48.56 0.21 230x
Table 1.
Average running times (in seconds) for the CompressiveSensing problem (w/out noise) on the CelebA dataset.
Next, we provide qualitative reconstruction results forCelebA with additive noise in Figure 3 note that DAE-PGD clearly outperforms other methods. Moreover, wefind that the reconstructions of DAE-PGD once again cap-ture fine-grained details despite the presence of noise in themeasurements. We perform a similar comparison for theMNIST dataset and report results in Figures 4 and 5.
Inpainting is the problem of recovering the original image,given an occluded version of it. Specifically, the observedimage y consists of occluded (or masked ) regions createdby applying a pixel-wise mask A to the original image x . PGD-GAN is executed for 100 max iterations and 1 restart.
We use m to refer to the size of mask that occludes a m × m region of the original image x .We present recovery results for CelebA with m = 10 in Fig-ure 6 and observe that DAE-PGD is able to recovery a highquality approximation to the original image and outperformsCSGM in all cases. Figure 6 also captures how recoveryis affected by different mask sizes. As in the compressivesensing problem, we find that DAE-PGD reconstructionscapture the fine-grained details of each image. Figure 6 alsoreports the result for the MNIST dataset. Even though DAE-PGD outperforms CSGM, we see that the recovery qualityof DAE-PGD degrades considerably when m = 15 . Wehypothesize this is due to the structure of MNIST images. Inparticular, since MNIST images are grayscale with most ofthe pixels being black, putting a × black patch on thesmall area displaying the number makes the reconstructionproblem considerably more difficult. This causes consider-able degradation in reconstruction quality for larger masksizes. Super-resolution is the problem of recovering the originalimage from a smaller and lower-resolution version. Wecreate this smaller and lower-resolution image by taking
AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
Figure 6.
Inpainting. Left: CelebA reconstructions for m = 10 . Middle-Left: DAE-PGD CelebA reconstructions for different m .Middle-Right: MNIST reconstructions for m = 5 . Right: DAE-PGD MNIST reconstructions for different m . the spatial averages of f × f pixel values where f is theratio of downsampling. This results in blurring a f × f region followed by downsampling the image. We test ouralgorithm with f = 2 , , corresponding to × , × , and × smaller image sizes, respectively.The reconstruction results are provided in 7. We see thatDAE-PGD provides higher quality reconstruction for f =2 for both CelebA and MNIST. Moreover, reconstructionquality degrades gracefully for CelebA for increasing valuesof f . However, in the case of MNIST, reconstruction qualitydegrades considerably when f = 4 . Noting that f = 4 only gives 16 measurements (i.e. y ∈ R ) , we hypothesizethat measurements may not contain enough signal toaccurately reconstruct the original images.
4. Related Work
Compressive Sensing
The field of compressive sensing wasessentially initiated with the work of (Cand`es et al., 2006)and (Donoho et al., 2006) where provided recovery resultsfor sparse signals with a random measurement matrix. Someof the earlier work in extending compressive sensing to per-form stable recovery with deterministic matrices was doneby (Candes & Tao, 2005) and (Candes et al., 2006), wherea sufficient condition for recovery was satisfaction of a re-stricted isometry hypothesis. (Blumensath & Davies, 2009)introduced IHT as an algorithm to recover sparse signalswhich was later modified in (Baraniuk et al., 2010) to reducethe search space as long as the sparsity was structured.
Generative Priors
Following the lead of (Bora et al., 2017),there have been significant efforts to improve on previousrecovery results using neural networks as generative models(Adler & ¨Oktem, 2017; Fan et al., 2017; Gupta et al., 2018;Liu et al., 2017; Mardani et al., 2018; Metzler et al., 2017;Mousavi et al., 2017; Rick Chang et al., 2017a; Shah & Consider compressive sensing with sparsity constraints whererecovery guarantees hold when m ≥ Cs ln( Ns ) (Foucart &Rauhut, 2017). Hegde, 2018; Yeh et al., 2017; Raj et al., 2019; Heckel &Hand, 2018). One line of work (Jagatap & Hegde, 2019;Heckel & Hand, 2018) extends the efforts of Bora et al.(Bora et al., 2017) by using an untrained neural network G and solving the optimization problem in (1). However, theoptimization problem is highly non-convex and requires alarge number of iterations with multiple restarts. Anotherline of work, (Mousavi et al., 2017; Mousavi & Baraniuk,2017) trains a neural network to model the transformation f ( y ) = ˆ x where ˆ x is the approximation to the originalinput x . This approach is limited as a) the inverse mappingis non-trivial to learn and b) will only work for a fixedmeasurement mechanism. Peng et al (Peng et al., 2020)follow the projected gradient descent method of (Shah &Hegde, 2018) and replace the inner optimization step bytwo projection steps: 1) mapping the approximation of thegradient descent step into a latent space; 2) mapping thelatent space vector back to the original space. DAEs in Linear Inverse Problems
DAEs have been pre-viously used in image processing tasks such as image de-noising (Wang et al., 2018; Guo et al., 2019; Rick Changet al., 2017b) and image super-resolution (Sønderby et al.,2016) to yield good results. However, these approaches uti-lize different recovery algorithms and none impose the DAEprior. Wang et al. (Wang et al., 2018) use gradient descent tominimize the mean shift vector and the mean squared errorat each update step. Guo et al. (Guo et al., 2019) utilize anExpectation Maximization style update step at each iterationto recover the original signal. Sonderby et al. (Sønderbyet al., 2016) deploy Bayes-optimal denoising to take a gra-dient step along the log-probability of the data distribution.Chang et. al (Rick Chang et al., 2017b) used the alternat-ing direction method of multipliers to solve a Langrangianformulation that involves the prior as a constraint.
5. Conclusion
We introduced DAEs as priors for general linear in-verse problems and provided experimental results for the
AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
Figure 7.
Super-resolution. Left: CelebA reconstructions for f = 2 . Middle-Left: DAE-PGD CelebA reconstructions for different f .Middle-Right: MNIST reconstructions for f = 2 . Right: DAE-PGD MNIST reconstructions for different f . Layer C-K C-S M-K M-S
Conv2D 1 9 × × × × × × × × × × × × × × × × Table 2.
Network Architectures for CelebA and MNIST. C-K, C-S, M-K, and M-S report CelebA Kernel Sizes, CelebA Strides,MNIST Kernel Sizes, and MNIST strides respectively. problems of compressive sensing, inpainting, and super-resolution on the CelebA and MNIST datasets. Utilizing aprojected gradient descent algorithm for recovery, we pro-vided rigorous theoretical guarantees for our framework andshowed that our recovery algorithm does not impose strictconstraints on the learning rate and hence eliminates theneed to tune hyperparameters. We compared our frameworkto state of the art methods experimentally and found that ourrecovery algorithm provided a speed up of over two ordersof magnitude and an order of magnitude improvement inreconstruction quality.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Is-ard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,Levenberg, J., Man´e, D., Monga, R., Moore, S., Mur-ray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B.,Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va-sudevan, V., Vi´egas, F., Vinyals, O., Warden, P., Watten-berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow:Large-scale machine learning on heterogeneous systems,2015. URL http://tensorflow.org/ . Software available from tensorflow.org.Adler, J. and ¨Oktem, O. Solving ill-posed inverse problemsusing iterative deep neural networks.
Inverse Problems ,33(12):124007, 2017.Alain, G. and Bengio, Y. What regularized auto-encoderslearn from the data-generating distribution.
The Journalof Machine Learning Research , 15(1):3563–3593, 2014.Bandeira, A. S., Dobriban, E., Mixon, D. G., and Sawin,W. F. Certifying the restricted isometry property is hard.
IEEE transactions on information theory , 59(6):3448–3450, 2013.Baraniuk, R. G., Cevher, V., Duarte, M. F., and Hedge, C.Model-based compressive sensing.
IEEE Transactionson Information Theory , 56(4):1982–2001, 2010.Bengio, Y., Yao, L., Alain, G., and Vincent, P. Generalizeddenoising auto-encoders as generative models.
Advancesin neural information processing systems , 26:899–907,2013.Blumensath, T. and Davies, M. E. Iterative hard threshold-ing for compressed sensing.
Applied and computationalharmonic analysis , 27(3):265–274, 2009.Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Com-pressed sensing using generative models. arXiv preprintarXiv:1703.03208 , 2017.Candes, E. and Tao, T. Decoding by linear programming. arXiv preprint math/0502327 , 2005.Cand`es, E. J., Romberg, J., and Tao, T. Robust uncertaintyprinciples: Exact signal reconstruction from highly in-complete frequency information.
IEEE Transactions oninformation theory , 52(2):489–509, 2006.Candes, E. J., Romberg, J. K., and Tao, T. Stable signalrecovery from incomplete and inaccurate measurements.
Communications on Pure and Applied Mathematics: A
AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
Journal Issued by the Courant Institute of MathematicalSciences , 59(8):1207–1223, 2006.Chen, M., Silva, J., Paisley, J., Wang, C., Dunson, D., andCarin, L. Compressive sensing on manifolds using anonparametric mixture of factor analyzers: Algorithmand performance bounds.
IEEE Transactions on SignalProcessing , 58(12):6140–6155, 2010.Chollet, F. keras. https://github.com/fchollet/keras , 2015.Donoho, D. L. et al. Compressed sensing.
IEEE Transac-tions on information theory , 52(4):1289–1306, 2006.Fan, K., Wei, Q., Carin, L., and Heller, K. A. An inner-loop free solution to inverse problems using deep neuralnetworks.
Advances in Neural Information ProcessingSystems , 30:2370–2380, 2017.Foucart, S. and Rauhut, H.
A Mathematical Introduction toCompressive Sensing . 2017.Guo, B., Han, Y., and Wen, J. Agem: Solving linear in-verse problems via deep priors and sampling.
Advancesin Neural Information Processing Systems , 32:547–558,2019.Gupta, H., Jin, K. H., Nguyen, H. Q., McCann, M. T.,and Unser, M. Cnn-based projected gradient descent forconsistent ct image reconstruction.
IEEE transactions onmedical imaging , 37(6):1440–1453, 2018.Heckel, R. and Hand, P. Deep decoder: Concise image rep-resentations from untrained non-convolutional networks. arXiv preprint arXiv:1810.03982 , 2018.Hou, X., Shen, L., Sun, K., and Qiu, G. Deep featureconsistent variational autoencoder. In ,pp. 1133–1141. IEEE, 2017.Jagatap, G. and Hegde, C. Algorithmic guarantees for in-verse imaging with untrained network priors. In
Advancesin Neural Information Processing Systems , pp. 14832–14842, 2019.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.LeCun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ .Liu, D., Wen, B., Liu, X., Wang, Z., and Huang, T. S. Whenimage denoising meets high-level vision tasks: A deeplearning approach. arXiv preprint arXiv:1706.04284 ,2017. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learningface attributes in the wild. In
Proceedings of the IEEEinternational conference on computer vision , pp. 3730–3738, 2015.Mardani, M., Sun, Q., Donoho, D., Papyan, V., Monajemi,H., Vasanawala, S., and Pauly, J. Neural proximal gra-dient descent for compressive imaging. In
Advances inNeural Information Processing Systems , pp. 9573–9583,2018.Metzler, C., Mousavi, A., and Baraniuk, R. Learned d-amp: Principled neural network based compressive imagerecovery. In
Advances in Neural Information ProcessingSystems , pp. 1772–1783, 2017.Mousavi, A. and Baraniuk, R. G. Learning to invert: Signalrecovery via deep convolutional networks. In , pp. 2272–2276. IEEE, 2017.Mousavi, A., Dasarathy, G., and Baraniuk, R. G. Deepcodec:Adaptive sensing and recovery via deep convolutionalneural networks. arXiv preprint arXiv:1707.03386 , 2017.Peng, P., Jalali, S., and Yuan, X. Solving inverse problemsvia auto-encoders.
IEEE Journal on Selected Areas inInformation Theory , 1(1):312–323, 2020.Peyre, G. Best basis compressed sensing.
IEEE Transac-tions on Signal Processing , 58(5):2613–2622, 2010.Raj, A., Li, Y., and Bresler, Y. Gan-based projector forfaster recovery with convergence guarantees in linear in-verse problems. In
Proceedings of the IEEE InternationalConference on Computer Vision , pp. 5602–5611, 2019.Rick Chang, J., Li, C.-L., Poczos, B., Vijaya Kumar, B.,and Sankaranarayanan, A. C. One network to solve themall–solving linear inverse problems using deep projec-tion models. In
Proceedings of the IEEE InternationalConference on Computer Vision , pp. 5888–5897, 2017a.Rick Chang, J. H., Li, C.-L., Poczos, B., Vijaya Kumar,B. V. K., and Sankaranarayanan, A. C. One network tosolve them all – solving linear inverse problems usingdeep projection models. In
Proceedings of the IEEEInternational Conference on Computer Vision (ICCV) ,Oct 2017b.Shah, V. and Hegde, C. Solving linear inverse problems us-ing gan priors: An algorithm with provable guarantees. In , pp. 4609–4613. IEEE,2018.Sønderby, C. K., Caballero, J., Theis, L., Shi, W., andHusz´ar, F. Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490 , 2016.
AEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees
Tibshirani, R. Regression shrinkage and selection via thelasso.
Journal of the Royal Statistical Society: Series B(Methodological) , 58(1):267–288, 1996.Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.Extracting and composing robust features with denoisingautoencoders. In
Proceedings of the 25th internationalconference on Machine learning , pp. 1096–1103, 2008.Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol,P.-A., and Bottou, L. Stacked denoising autoencoders:Learning useful representations in a deep network witha local denoising criterion.
Journal of machine learningresearch , 11(12), 2010.Wang, Y., Liu, Q., Zhou, H., and Wang, Y. Learning multi-denoising autoencoding priors for image super-resolution.
Journal of Visual Communication and Image Representa-tion , 57:152–162, 2018.Yeh, R. A., Chen, C., Yian Lim, T., Schwing, A. G.,Hasegawa-Johnson, M., and Do, M. N. Semantic imageinpainting with deep generative models. In