An Interpretation of Regularization by Denoising and its Application with the Back-Projected Fidelity Term
AAN INTERPRETATION OF REGULARIZATION BY DENOISING AND ITS APPLICATIONWITH THE BACK-PROJECTED FIDELITY TERM
Einav Yogev-Ofer, Tom Tirer, and Raja Giryes
School of Electrical EngineeringTel Aviv University
ABSTRACT
The vast majority of image recovery tasks are ill-posedproblems. As such, methods that are based on optimizationuse cost functions that consist of both fidelity and prior (regu-larization) terms. A recent line of works imposes the prior bythe Regularization by Denoising (RED) approach, which ex-ploits the good performance of existing image denoising en-gines. Yet, the relation of RED to explicit prior terms is stillnot well understood, as previous work requires too strong as-sumptions on the denoisers. In this paper, we make two con-tributions. First, we show that the RED gradient can be seenas a (sub)gradient of a prior function—but taken at a denoisedversion of the point. As RED is typically applied with a rel-atively small noise level, this interpretation indicates a simi-larity between RED and traditional gradients. This leads toour second contribution: We propose to combine RED withthe Back-Projection (BP) fidelity term rather than the com-mon Least Squares (LS) term that is used in previous works.We show that the advantages of BP over LS for image deblur-ring and super-resolution, which have been demonstrated fortraditional gradients, carry on to the RED approach.
Index Terms — Inverse problems, image deblurring,super-resolution, Regularization by Denoising, Back-Projection
1. INTRODUCTION
Image recovery tasks aim to restore an original image fromits observed degraded version. The observed image can bedegraded in many ways, such as blurring, subsampling (lowerresolution), noise addition, or all together. In many tasks, theobserved image can be described by a linear expression: y = Ax ∗ + e , (1)Where x ∗ ∈ R n is the original image, y ∈ R m is the de-graded observed image, e ∈ R m is a noise vector and A ∈ R m × n is the degradation matrix. For example, in the de-blurring task A represents the blurring operation, and in thesuper-resolution task A is an operator that represents blurring(typically, anti-aliasing filtering) followed by subsampling.Most of the methods for recovering x ∗ from the observed y are based on optimization. They involve minimization of a cost function that is composed of a data fidelity term and aprior term f ( x ) = (cid:96) ( x ) + λs ( x ) , (2)where (cid:96) ( · ) is the fidelity term, s ( · ) is the prior term and λ isa positive parameter that controls the level of regularization.The fidelity term encourages complying with the observationmodel (1), while the prior represents assumptions on the orig-inal image, and is inevitable due to the ill-posedness nature ofimage reconstruction tasks.Over the years, different prior functions have been pro-posed, such as (cid:96) -norm of the wavelet coefficients vector [1,2]and the Total Variation (TV) prior [3]. Interestingly, there arealso very successful priors, such as BM3D [4], that are basedon a series of operations rather than on an explicit prior func-tion. Traditionally, a different algorithm has been designedfor each task (structure of A in (1)) and each prior s ( · ) . How-ever, it has been suggested in [5] to exploit the good perfor-mance of existing image denoising for solving tasks otherthen denoising. This “plug-and-play denoisers” concept en-courages one to use a denoiser for imposing the prior even ifit not explicitly clear what is the function s ( · ) that is associ-ated with this denoiser (as in the case of BM3D). Follow-uppapers of [5] include [6–12], and many more.A popular line of works (see, e.g., [8, 9, 13–17]) that fol-lows this plug-and-play concept is based on the Regulariza-tion by Denoising (RED) approach. The original RED pa-per [8] proposed a gradient-based regularization of inverseproblems using an off-the-shelf Gaussian denoiser D ( · ; σ ) ( σ is the noise level of the denoiser, which is not necessarily thenoise level in y ). Specifically, within optimization schemes,it is proposed to replace ∇ s ( x ) with the “RED gradient” g RED ( x ; D ) := x − D ( x ; σ ) . (3)The paper [8] showed that under strong assumptions on thedenoiser, which as pointed out in [13] include also a symmet-ric Jacobian condition, the expression in (3) is the gradient ofthe prior term s [8] ( x ) = 12 x T ( x − D ( x ; σ )) . (4) a r X i v : . [ ee ss . I V ] J a n owever, as shown in [13], these assumptions do not hold forwidely used denoisers. Therefore, using RED may be betterunderstood as a prior that is imposed directly on the gradient of a given optimization scheme rather than on the optimiza-tion objective [9, 13].In this paper, we show that the RED gradient can be seenas a (sub)gradient of a prior function (beyond the class offunctions that can be expressed as (4)), but taken at a denoisedversion of the point. As RED is typically applied with a rela-tively small noise level, this interpretation indicates a similar-ity between the RED gradient and traditional gradients. Fol-lowing this interpretation, we propose to combine RED withthe Back-Projection (BP) fidelity term [18] rather than withthe common Least Squares (LS) fidelity term that is used inprevious works [8, 9, 13–17]. We show that the advantages ofBP over LS for image deblurring and super-resolution, whichhave been demonstrated for traditional gradients [11, 18–21],carry on to the RED approach.
2. AN INTERPRETATION FOR RED
We start with providing an interpretation for RED that mayprovide an additional insight on its applicability and success.Note that a Gaussian denoiser D ( x ; σ ) of an image x , un-der the prior s ( · ) and noise level σ > , can be expressed asthe minimizer of the following optimization problem D ( x ; σ ) = argmin z σ (cid:107) z − x (cid:107) + s ( z ) . (5)For a proper lower-semicontinuous convex function s ( · ) thisminimizer is unique, and in the optimization literature it isoften referred to as the proximal mapping of σ s ( · ) at x .For a convex s ( · ) , the first-order optimality condition of ˆ x = D ( x ; σ ) implies that ∈ ˆ x − x + σ ∂s ( ˆ x ) , (6)where ∂s ( ˆ x ) := { g : s ( z ) ≥ s ( ˆ x ) + g T ( z − ˆ x ) , ∀ z } is thesubdifferential (the set of the subgradients) of s ( · ) at ˆ x . Thus, x − ˆ x = x − D ( x ; σ ) ∈ σ ∂s ( ˆ x ) and the gradient of RED(see (3)) obeys g RED ( x ; D ) ∈ σ ∂s ( D ( x ; σ )) . (7)If s ( · ) is a smooth (i.e. continuously differentiable), then ∂s ( ˆ x ) = ∇ s ( ˆ x ) (a singleton), so g RED ( x ; D ) = σ ∇ s ( D ( x ; σ )) . (8)As the factor σ can be absorbed in the hyper-parameter λ (cf. (2)), we see that for a general prior s ( · ) , the RED gradient g RED ( x ) can be seen as a gradient of s ( · ) at a cleaner point,i.e., after denoising using the prior s ( · ) . Therefore, g RED ( x ) For brevity, from now on we omit the dependency on D . enhances the effect of the prior compared to using the gradient ∇ s ( x ) at the current point. Let us illustrate this relationshipin two concrete examples. Tikhonov regularization.
For the simple Tikhonov reg-ularization prior s ( x ) = (cid:107) Rx (cid:107) , we have that ∇ s ( x ) = R T Rx and D ( x ; σ ) = ( σ R T R + I n ) − x . Thus, from (8)we get g RED ( x ) = σ ∇ s ( D ( x ; σ )) (9) = σ R T R ( σ R T R + I n ) − x . As a sanity check, notice that this result indeed coincides withthe definition in (3), i.e., g RED ( x ) = σ R T R ( I n + σ R T R ) − x = x − ( I n − σ R T R ( I n + σ R T R ) − ) x = x − ( σ R T R + I n ) − x = x − D ( x ; σ ) , (10)where the second equality uses the Woodbury identity.Since the outer σ in (9) can be absorbed in the hyper-parameter λ , we are left with comparing R T R (that appearsin ∇ s ( x ) ) and R T R ( σ R T R + I n ) − (that appears in g RED ( x ) in (9)). Clearly R T R ( σ R T R + I n ) − ≤ R T R (in the sense that R T R − R T R ( σ R T R + I n ) − is posi-tive semi-definite). This demonstrates the enhanced effect ofthe prior that is inherent in g RED ( x ) . Yet, observe that for asmall values of σ (as typically used in RED) the directions of ∇ s ( x ) and g RED ( x ) will be close. (cid:96) -norm prior. Let us observe the relation between ∂s ( x ) and g RED ( x ) for the (nonsmooth) prior s ( x ) = (cid:107) W T x (cid:107) ,where W is an orthonormal basis (i.e., W T W = W W T = I n ). For example, W can be an orthonormal wavelet basisunder which the coefficients vector α = W T x is sparse. Inthis case, the subdifferential ∂s ( x ) is given by ∂s ( x ) = W ∂ (cid:107) α (cid:107) (cid:12)(cid:12) α = W T x = W z : z i = 1 , [ W T x ] i > z i ∈ [ − , , [ W T x ] i = 0 z i = − , [ W T x ] i < . (11)For the (cid:96) -norm prior, note that (5) turns into the soft-thresholding denoiser [1, 2]. Namely, D ( x ; σ ) = argmin z (cid:107) z − x (cid:107) + σ (cid:107) W T z (cid:107) = W (cid:18) argmin ˜ α (cid:107) ˜ α − W T x (cid:107) + σ (cid:107) ˜ α (cid:107) (cid:19) = W T σ ( W T x ) , (12) Note that this prior cannot be expressed in the form of (4) (see [8, 13]). a) Gaussian blur kernel, σ e = √ . , using BM3D prior (b) Gaussian blur kernel, σ e = √ . , using TV prior (c) Gaussian blur kernel, σ e = √ , using BM3D prior (d) Gaussian blur kernel, σ e = √ , using TV prior(e) Uniform blur kernel, σ e = √ . , using BM3D (f) Uniform blur kernel, σ e = √ . , using TV prior (g) Uniform blur kernel, σ e = √ , using BM3D prior (h) Uniform blur kernel, σ e = √ , using TV prior Fig. 1 : Deblurring results: PSNR (averaged over 8 test images) vs. iteration number, for LS-RED (red) and BP-RED (blue).where [ T σ ( α )] i = sign( α i )max( | α i | − σ , . The subdif-ferential σ ∂s ( D ( x ; σ )) is then given by σ ∂s ( D ( x ; σ )) = σ W ∂ (cid:107) α (cid:107) (cid:12)(cid:12) α = W T D ( x ; σ ) = σ W ∂ (cid:107) α (cid:107) (cid:12)(cid:12) α = T σ ( W T x ) = σ W z : z i = 1 , [ T σ ( W T x )] i > z i ∈ [ − , , [ T σ ( W T x )] i = 0 z i = − , [ T σ ( W T x )] i < . (13)Comparing (13) and (11), the enhanced effect of the nons-moothness of the prior is clear: ∂s ( D ( x ; σ )) ⊇ ∂s ( x ) .Let us verify (7) for the RED gradient defined in (3) g RED ( x ) = x − D ( x ; σ ) = x − W T σ ( W T x )= σ W σ (cid:0) W T x − T σ ( W T x ) (cid:1) =: σ W z
RED . (14)If [ T σ ( W T x )] i > , we have [ z RED ] i = 1 σ ([ W T x ] i − ([ W T x ] i − σ )) = 1 . (15)Similarly, if [ T σ ( W T x )] i < , we have [ z RED ] i = − .Lastly, if [ T σ ( W T x )] i = 0 , or equivalently − σ ≤ [ W T x ] i ≤ σ , we have [ z RED ] i = σ [ W T x ] i ∈ [ − , . Thus, g RED ( x ) ∈ σ ∂s ( D ( x ; σ )) .We believe that choosing g RED ( x ) and not other subgra-dients that are in (13) is good, in the sense that it is closer(in angle) to subgradients from (11), and performs gradient-based restoration similar to subgradients from (11) and betterthan other subgradients that are in (13). We defer proving thisto future research.
3. THE PROPOSED METHOD: BP-RED
In the previous section we saw that the RED gradient differsfrom a traditional (sub)gradient of s ( x ) only in the fact thatit is computed in a denoised version of x (associated withthe prior s ( x ) ). As RED is typically applied with a rela-tively small noise level, this indicates a similarity betweenthe RED gradient and traditional gradients, which implies thatapproaches that are useful for traditional gradient descent arelikely to be beneficial also for RED. Following this perspec-tive, we propose to incorporate the RED approach with theBack-Projection (BP) fidelity term [18].Previous works on RED use the LS term (cid:96) LS ( x ) = (cid:107) y − Ax (cid:107) as their fidelity term. Formally, many of them are basedon the following “gradient descent” reconstruction scheme x k +1 = x k − µ ( ∇ (cid:96) LS ( x k ) + λ g RED ( x k )) (16) = x k − µ (cid:0) A T ( Ax k − y ) + λ ( x k − D ( x k ; σ )) (cid:1) , where µ is the step-size. Recent works on ill-posed linearinverse problems [11,18–21] have shown the benefits of usingthe BP fidelity term (cid:96) BP ( x ) = (cid:107) A † ( y − Ax ) (cid:107) , where A † := A T ( AA T ) − is the pseudoinverse of A , instead ofLS. Following this approach, we propose to combine the REDprior with the BP term. Thus, we propose using x k +1 = x k − µ ( ∇ (cid:96) BP ( x k ) + λ g RED ( x k )) (17) = x k − µ (cid:0) A † ( Ax k − y ) + λ ( x k − D ( x k ; σ )) (cid:1) . Note that in important tasks, such as deblurring and super-resolution, the operator A † can be implemented efficiently us-ing FFT or conjugate gradients (CG) with no need for matrixinversion, and the computational complexity of (17) is similarto (16). For sophisticated denoisers (e.g., TV [3], BM3D [4], a) SRx3 for Gauss. kernelwith std 1.6, using BM3D prior (b) SRx3 for Gauss. kernelwith std 1.6, using TV prior (c) SRx4 for Gauss. kernelwith std 2.2, using BM3D prior (d) SRx4 for Gauss. kernelwith std 2.2, using TV prior Fig. 2 : SR results: PSNR (averaged over 8 test images) vs. iteration number, for LS-RED (red) and BP-RED (blue).or deep CNNs) the complexity of both (16) and (17) are dom-inated by the complexity of the denoising operation D ( · ; σ ) .
4. EXPERIMENTS
In this section, we concentrate on two image recovery tasks:deblurring and super-resolution, which are performed oneight classical test images: cameraman, house, peppers,Lena, Barbara, boat, hill and couple. We examine the per-formance of the two gradient-based methods in (16) and(17), denoted by LS-RED and BP-RED, respectively. Asthe off-the-shelf denoisers, we use the (convex) isotropictotal-variation (TV) [3] and the (non-convex) BM3D [4].
In the deblurring task, the linear operator A represents filter-ing with some blur kernel. We examine two common blurkernels: × uniform kernel and × Gaussian kernel withstd of 1.6 (both are normalized to unit sum), and two levels ofGaussian noise: σ e = √ . and σ e = √ .Both LS-RED and BP-RED have similar per-iterationcomplexity, which is dominated by the complexity of thedenoisers, since the operators A , A T and A † are imple-mented very efficiently using the Fast Fourier Transform(FFT) (see [11]). In the FFT implementation of A † , weregularize the denominator by . σ e . We initialize bothmethods with x = y . We use the typical step-size of 1over the Lipschitz constant of ∇ (cid:96) ( · ) for both methods, i.e., µ = 1 / || A † A || = 1 for BP-RED, and µ = 1 / || A T A || , com-puted by the power method, for LS-RED. For LS-RED, wealso present results for step-size of µ = 2 / ( σ e + λ ) as usedin [8]. The two hyper-parameters of LS-RED and BP-RED(namely, the regularization level λ and the noise level σ thatis used in the denoiser) are tuned uniformly for each scenarioby choosing the best λ from a dense grid in (0 . , . andthe best σ from a dense grid in (0 . , .The restoration performance of the two methods for allthe experiments is presented in Fig. 1. The graphs show thePeak Signal to Noise Ratio (PSNR) metric (averaged over allthe test images) vs. the iteration number. One can see thatusing the BP-RED method achieves better performance than using the LS-RED method in terms of (a) higher PSNR and(b) faster convergence. This behavior is consistent with theresults in [18], which are not based on the RED approach. In the super-resolution (SR) task, the linear operator A rep-resents filtering with an anti-aliasing kernel followed by sub-sampling. We examine scale factors of 3 and 4. For scalefactor of 3 we use a × Gaussian kernel with std of 1.6, andfor scale factor of 4 we used a × Gaussian kernel with stdof 2.2 (both are normalized to unit sum).Here, A † in BP-RED is implemented using CG [22],which has shown extremely fast convergence. Thus, again,the per-iteration complexity of LS-RED and BP-RED isdominated by the denoisers. We initialize both methods withbicubic upsampling of y . As before, we use the typicalstep-size of 1 over the Lipschitz constant of ∇ (cid:96) ( · ) for bothmethods (here we do not examine the step-size from [8], as itis not suitable for the noiseless case). For each method, thehyper-parameters λ and σ are tuned uniformly per scenario,using dense grids in the ranges of (0 . , . and (0 . , respectively.The restoration performance (average PSNR vs. iterationnumber) of the two methods for all the experiments is pre-sented in Fig. 2. The PSNR advantage of BP-RED over LS-RED is more moderate than in the deblurring case, but stillclear and consistent.
5. CONCLUSION
We considered the RED approach, which exploits existingimage denoising engines for solving tasks other than denois-ing. We showed that the RED gradient can be seen as a(sub)gradient of a prior function that is computed at a (typi-cally slightly) denoised version of the point. This similaritybetween RED and traditional gradients motivated us to com-bine RED with the BP fidelity term, which has demonstratedimproved results for traditional gradients schemes. Variousexperiments demonstrated the advantages of our BP-REDmethod over the combination of RED with least squares,used in previous works. cknowledgment.
This research was supported by ERC-StGgrant no. 757497 (SPADE).
6. REFERENCES [1] D. L. Donoho and J. M. Johnstone, “Ideal spatial adap-tation by wavelet shrinkage,” biometrika , vol. 81, no. 3,pp. 425–455, 1994.[2] D. L. Donoho, “De-noising by soft-thresholding,”
IEEEtransactions on information theory , vol. 41, no. 3, pp.613–627, 1995.[3] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear totalvariation based noise removal algorithms,”
Phys. D , vol.60, no. 1-4, pp. 259–268, Nov. 1992.[4] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Im-age denoising by sparse 3-d transform-domain collabo-rative filtering,”
IEEE Trans. on Image Processing , vol.16, no. 8, pp. 2080–2095, 2007.[5] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg,“Plug-and-play priors for model based reconstruction,”in . IEEE, 2013, pp. 945–948.[6] S. Sreehari, S. V. Venkatakrishnan, B. Wohlberg, G. T.Buzzard, L. F. Drummy, J. P. Simmons, and C. A.Bouman, “Plug-and-play priors for bright field electrontomography and sparse interpolation,”
IEEE Transac-tions on Computational Imaging , vol. 2, no. 4, pp. 408–423, 2016.[7] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “Fromdenoising to compressed sensing,”
IEEE Transactionson Information Theory , vol. 62, no. 9, pp. 5117–5144,2016.[8] Y. Romano, M. Elad, and P. Milanfar, “The little enginethat could: Regularization by denoising (red),”
SIAMJournal on Imaging Sciences , vol. 10, no. 4, pp. 1804–1844, 2017.[9] S. A. Bigdeli, M. Zwicker, P. Favaro, and M. Jin, “Deepmean-shift priors for image restoration,” in
Advances inNeural Information Processing Systems , 2017, pp. 763–772.[10] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learningdeep cnn denoiser prior for image restoration,” in
Pro-ceedings of the IEEE conference on computer vision andpattern recognition , 2017, pp. 3929–3938.[11] T. Tirer and R. Giryes, “Image restoration by iterativedenoising and backward projections,”
IEEE Transac-tions on Image Processing , vol. 28, no. 3, pp. 1220–1234, 2018. [12] G. T. Buzzard, S. H. Chan, S. Sreehari, and C. A.Bouman, “Plug-and-play unplugged: Optimization-free reconstruction using consensus equilibrium,”
SIAMJournal on Imaging Sciences , vol. 11, no. 3, pp. 2001–2020, 2018.[13] E. T. Reehorst and P. Schniter, “Regularization by de-noising: Clarifications and new interpretations,”
IEEEtransactions on computational imaging , vol. 5, no. 1,pp. 52–67, 2018.[14] G. Mataev, P. Milanfar, and M. Elad, “Deepred: Deepimage prior powered by red,” in
Proceedings of theIEEE International Conference on Computer VisionWorkshops , 2019, pp. 0–0.[15] Y. Sun, J. Liu, and U. S. Kamilov, “Block coordi-nate regularization by denoising,”
IEEE Transactionson Computational Imaging , vol. 6, pp. 908–921, 2020.[16] Y. Sun, J. Liu, Y. Sun, B. Wohlberg, and U. S.Kamilov, “Async-RED: A provably convergent asyn-chronous block parallel stochastic method using deepdenoising priors,” arXiv preprint arXiv:2010.01446 ,2020.[17] R. Cohen, M. Elad, and P. Milanfar, “Regularization bydenoising via fixed-point projection (red-pro),” arXivpreprint arXiv:2008.00226 , 2020.[18] T. Tirer and R. Giryes, “Back-projection based fidelityterm for ill-posed linear inverse problems,”
IEEE Trans-actions on Image Processing , vol. 29, no. 1, pp. 6164–6179, 2020.[19] T. Tirer and R. Giryes, “Super-resolution via image-adapted denoising cnns: Incorporating external and in-ternal learning,”
IEEE Signal Processing Letters , vol.26, no. 7, pp. 1080–1084, 2019.[20] T. Tirer and R. Giryes, “On the convergence rate ofprojected gradient descent for a back-projection basedobjective,” arXiv preprint arXiv:2005.00959 , 2020.[21] J. Zukerman, T. Tirer, and R. Giryes, “BP-DIP: Abackprojection based deep image prior,” , pp.675–679, 2020.[22] M. R. Hestenes and E. Stiefel, “Methods of conjugategradients for solving linear systems,”