[PDF] Deep Equilibrium Architectures for Inverse Problems in Imaging

Abstract

Recent efforts on solving inverse problems in imaging via deep neural networks use architectures inspired by a fixed number of iterations of an optimization method. The number of iterations is typically quite small due to difficulties in training networks corresponding to more iterations; the resulting solvers cannot be run for more iterations at test time without incurring significant errors. This paper describes an alternative approach corresponding to an infinite number of iterations, yielding a consistent improvement in reconstruction accuracy above state-of-the-art alternatives and where the computational budget can be selected at test time to optimize context-dependent trade-offs between accuracy and computation. The proposed approach leverages ideas from Deep Equilibrium Models, where the fixed-point iteration is constructed to incorporate a known forward model and insights from classical optimization-based reconstruction methods.

Full PDF

DDeep Equilibrium Architectures for Inverse Problems in Imaging

Davis Gilton ∗ , Gregory Ongie † , and Rebecca Willett ‡ .February 17, 2021 Abstract

Recent eﬀorts on solving inverse problems in imaging via deep neural networks use architectures inspiredby a ﬁxed number of iterations of an optimization method. The number of iterations is typically quite smalldue to diﬃculties in training networks corresponding to more iterations; the resulting solvers cannot be runfor more iterations at test time without incurring signiﬁcant errors. This paper describes an alternativeapproach corresponding to an inﬁnite number of iterations, yielding up to a 4dB PSNR improvementin reconstruction accuracy above state-of-the-art alternatives and where the computational budget canbe selected at test time to optimize context-dependent trade-oﬀs between accuracy and computation.The proposed approach leverages ideas from Deep Equilibrium Models, where the ﬁxed-point iterationis constructed to incorporate a known forward model and insights from classical optimization-basedreconstruction methods.

A collection of recent eﬀorts surveyed by [1] consider the problem of using training data to solve inverseproblems in imaging. Speciﬁcally, imagine we observe a corrupted set of measurements y of an image x (cid:63) according to a measurement operator A with some noise ε according to y = Ax (cid:63) + ε. (1)Our task is to compute an estimate of x given measurements y and knowledge of A . This task is particularlychallenging when the inverse problem is ill-posed , i.e., when the system is underdetermined or ill-conditioned,in which case simple methods such as least squares estimation ( i.e., (cid:98) x = ( A (cid:62) A ) − A (cid:62) y ) may not exist or mayproduce estimates that are highly sensitive to noise.Decades of research has explored geometric models of image structure that can be used to regularizesolutions to this inverse problem, including [2, 3, 4] and many others. More recent eﬀorts have focused insteadon using large collections of training images, { x ∗ i } ni =1 , to learn eﬀective regularizers.One particularly popular and eﬀective approach involves augmenting standard iterative inverse problemsolvers with learned deep networks. This approach, which we refer to as deep unrolling (DU) , is reviewedin §2.1. The basic idea is to build an architecture that mimics a small number of steps in of an iterativeprocedure. In practice, the number of steps is quite small (typically 5-10) because of issues stability, memory,and numerical issues arising in backpropagation. This paper sidesteps this key limitation of deep unrollingmethods with a novel approach based on deep equilibrium models (DEMs) [5], which are designed fortraining arbitrarily deep networks. The result is a novel approach to training networks to solve inverseproblems in imaging that yields up to a 4dB improvement in performance above state-of-the-art ∗ D. Gilton is with the department of Electrical and Computer Engineering at the University of Wisconsin-Madison, 1415Engineering Dr, Madison, WI 53706 USA. † G. Ongie is with the department of Mathematical and Statistical Sciences at Marquette University, 1250 W Wisconsin Ave,Milwaukee, WI 53233 USA. ‡ R. Willett is with the Departments of Computer Science and Statistics at the University of Chicago, 5747 S Ellis Ave,Chicago, IL 60637 USA. This material is based upon work supported by the National Science Foundation under Grant Nos. 1740707, 1447449, and0353079, as well as AFOSR FA9550-18-1-0166. a r X i v : . [ ee ss . I V ] F e b ec o n s t r u c t i o n PS N R (a) MRI reconstruction. R ec o n s t r u c t i o n PS N R (b) Deep unrolling challenges Figure 1: (a) PSNR of reconstructed images for an MRI reconstruction problem as a function of iterationsused to compute the reconstruction. Unrolled methods are optimized for a ﬁxed computational budgetduring training, and running additional steps at test time yields a signiﬁcant drop in performance. Ourdeep equilibrium methods can achieve the same PSNR for the optimal computational budget of an unrolledmethod, but can trade slightly more computation time for a signiﬁcant increase in the PSNR, allowing auser to choose the desired computational budget and reconstruction quality. (b) Standard unrolled deepoptimization networks typically require choosing some ﬁxed number of iterates during training. Deviatingfrom this ﬁxed number at test time incurs a signiﬁcant penalty in PSNR. The forward model here is 8xaccelerated single-coil MRI reconstruction, and the unrolled algorithm is unrolled proximal gradient descentwith K iterates, labeled PROX-K (Fig. 3). For further experimental details see §6.4. alternatives and where the computational budget can be selected at test time to optimize context-dependent tradeoﬀs between accuracy and computation. The key empirical ﬁndings, which are detailed in§6.4, are illustrated in Fig. 1(a).

This paper presents a fundamentally new approach to machine-learning based methods for solving linearinverse problems in imaging. Unlike most state-of-the-art methods, which are based on unrolling a smallnumber of iterations of an iterative reconstruction scheme (“deep unrolling”), our method is based on deepequilibrium models that correspond to a potentially inﬁnite number of iterations. This framework yieldsmore accurate reconstructions that the current state-of-the-art across a range of inverse problems and givesusers the ability to navigate a tradeoﬀ between reconstruction computation time and accuracy at test time;speciﬁcally, we observe up to a 4dB improvement in PSNR. Furthermore, because our formulation is basedon ﬁnding ﬁxed points, we can use standard accelerated ﬁxed point methods to speed test time computations– something that is not possible with deep unrolling methods. In addition, our approach inherits provableconvergence guarantees depending on the “base” algorithm used to select a ﬁxed point equation for the deepequilibrium framework. Experimental results also show that our proposed initialization based on pre-trainingis superior than random initialization, and the proposed approach is more robust to noise than past methods.Overall, the proposed approach is a unique bridge between conventional ﬁxed-point methods in numericalanalysis and deep learning and optimization-based solvers for inverse problems.2

Relationship to Prior Work

Deep Unrolling methods describe approaches to solving inverse problems which consist of a ﬁxed number ofarchitecturally identical “blocks,” which are often inspired by a particular optimization algorithm. Thesemethods represent the current state of the art in MRI reconstruction, with most top submissions to thefastMRI challenge [6] being some sort of unrolled net. Unrolled networks have seen success in other imagingtasks, e.g. low-dose CT [7], light-ﬁeld photography [8], and emission tomography [9].We describe here a speciﬁc variant of deep unrolling methods based on gradient descent, although manyvariants exist based on alternative optimization or ﬁxed point iteration schemes [1]. Suppose we have a knownregularization function r that could be applied to an image x ; e.g. in Tikhonov regularization, r ( x ) = λ (cid:107) x (cid:107) for some scalar λ > . Then we could compute an image estimate by solving the optimization problem (cid:98) x = arg min x (cid:107) y − Ax (cid:107) + r ( x ) . (2)If r is diﬀerentiable, this can be accomplished via gradient descent. That is, we start with an initial estimate x (0) such as x (0) = A (cid:62) y and choose a step size η > , such that for iteration k = 1 , , , . . . , we set x ( k +1) = x ( k ) + ηA (cid:62) ( y − Ax ( k ) ) − η ∇ r ( x ( k ) ) , where ∇ r is the gradient of the regularizer.The basic idea behind deep unrolled methods is to ﬁx some number of iterations K (typically K rangesfrom 5 to 10), declare that x ( K ) will be our estimate (cid:98) x , and model ∇ r with a neural network, denoted R θ ( x ) ,that can be learned with training data. We assume that all R θ have identical weights, although other worksalso explore non-weight-tied variants [10]. For example, we may deﬁne the unrolled gradient descent estimateto be (cid:98) x ( K ) ( y ; θ ) := x ( K ) where x (0) = A (cid:62) y and for k = 0 , . . . , K − x ( k +1) = x ( k ) + ηA (cid:62) ( y − Ax ( k ) ) − ηR θ ( x ( k ) ) . (3)Training attempts to minimize the cost function (cid:80) ni =1 (cid:107) (cid:98) x ( K ) ( y i ; θ ) − x ∗ i (cid:107) with respect to the networkparameters θ . This form of training is often called “end-to-end”; that is, we do not train the networkrepresenting ∇ r in isolation, but rather on the quality of the resulting estimate (cid:98) x ( K ) , which depends on theforward model A .The number of iterations is kept small for two reasons. First, at deployment, these systems are optimizedto compute image estimates quickly – a desirable property we wish to retain in developing new methods.Second, it is challenging to train deep unrolled networks for many iterations due to memory limitations ofGPUs because the memory required to calculate the backpropagation updates scales linearly with the numberof unrolled iterations.As a workaround, consider training such systems for a small number of iterations K (e.g., K = 10 ),then extracting the learned regularizer gradient ∇ r , and using it within a gradient descent algorithm untilconvergence (i.e. for more iterations K than used in training). Our numerical results highlight how poorlythis method performs in practice (§6.4). Choosing the number of iterations K (and hence the test timecomputational budget) at training time is essential. As we illustrate in Fig. 1(b), one cannot deviate fromthis choice after training and expect good performance. In [5], the authors propose a method for training arbitrarily-deep networks deﬁned by repeated application ofa single layer. Imagine an L -layer network with input y and weights θ . Letting x ( k ) denote the output of the k th hidden layer, we may write x ( k +1) = f ( k ) θ ( x ( k ) ; y ) for k = 0 , . . . , L − where k is the layer index and f ( k ) θ is a nonlinear transformation such as inner products followed by theapplication of a nonlinear activation function. Recent prior work explored forcing this transformation at each3ayer to be the same (i.e. weight tying ), so that f ( k ) θ = f θ for all k and showed that such networks still yieldcompetitive performance [11, 12]. Under weight tying, we have the recursion x ( k +1) = f θ ( x ( k ) ; y ) (4)and the output x ( K ) as K → ∞ is a ﬁxed point of the operator f θ ( · , y ) . [5] show this ﬁxed point can becomputed without explicitly building an inﬁnitely-deep network, and that the network weights can be learnedusing implicit diﬀerentiation and constant memory, bypassing computation and numerical stability issuesassociated with related techniques on large-scale problems [13, 14]. This past work focused on sequencemodels and time-series tasks, assuming that each f θ was a single layer of a neural network, and did notexplore the image reconstruction task that is the focus of this paper. Initiated by [15], a collection of methods based on the plug-and-play (

PnP ) framework have been proposed,allowing denoising algorithms to be used as priors for model-based image reconstruction. Given an inverseproblem setting, one writes the reconstructed image as the solution to an optimization problem in which theobjective function is the sum of a data-ﬁt term and a regularization term. Applying alternating directionsmethod of multipliers (ADMM, [16, 17]) to this optimization problem, we arrive at a collection of updateequations, one of which has the form arg min x (cid:107) z − x (cid:107) + r ( x ) , where r ( x ) is the regularization function; this optimization problem in this update equation can be consideredas “denoising” the image z . PnP methods replace this explicit optimization step with a “plugged-in” denoisingmethod. Notably, some state-of-the-art denoisers (e.g., BM3D [4] and U-nets [18]) do not have an explicit r associated with them, but nevertheless empirically work well within the PnP framework. [19] propose

Regularization by Denoising (RED) based on a similar philosophy, but consider a regularizer of the form r ( x ) = x (cid:62) ( x − ρ ( x )) , where ρ ( x ) corresponds to an image denoising function.Recent PnP and RED eﬀorts focuses on using training data to learn denoisers [20, 21, 22, 23, 24]. Incontrast to the unrolling methods described in §2.1, these methods are not trained end-to-end; rather, thedenoising module is trained independent of the inverse problem (i.e. A ) at hand. As described by [1],decoupling the training of the learned component from A results in a reconstruction system that is ﬂexibleand does not need to be re-trained for each new A , but can require substantially more training samples toachieve the reconstruction accuracy of a method trained end-to-end for a speciﬁc A . Our approach is to choose a function f θ so that a ﬁxed point of the recursion in (4) is a good estimate of x for a given y . To the best of our knowledge, this is the ﬁrst example of the application of DEMs to imagereconstruction tasks. We describe choices of f θ (and hence of the implicit inﬁnite-depth neural networkarchitecture) that explicitly account for the forward model A and generally for the inverse problem at hand.Speciﬁcally, we propose choosing f θ based on the ideas of deep unrolling (§2.1) extended to an inﬁnite numberof iterations – a paradigm that has been beyond the reach of all previous deep unrolling methods.Let x ( k ) denote the image estimate after k rounds of an iterative algorithm (as in §2.1) and y denote theobservation from (1) under forward operator A . We consider three speciﬁc choices of f θ below, but note thatmany other options are possible. Deep equilibrium gradient descent (

DE-Grad ): Connecting the gradient descent iterations in (3)with the deep equilibrium model in (4), we let f θ ( x ; y ) = x + ηA (cid:62) ( y − Ax ) − ηR θ ( x ) (5)and note that a ﬁxed point of f θ ( · ; y ) , denoted x ( ∞ ) , corresponds to the limit of (cid:98) x ( K ) θ as K → ∞ in (3).4 + ηA ⊤ ( y − Ax ) ηR θ ( ⋅ ) = + x f θ ( x ; y ) Figure 2: Deep Equilibrium Gradient Descent (

DE-Grad)

Deep equilibrium proximal gradient (

DE-Prox ): Proximal gradient methods [25] use a proximaloperator associated with a function h : prox h ( x ) = arg min u (cid:107) u − x (cid:107) + h ( u ) (6)and use this to solve the optimization problem in (2) via the iterates x ( k +1) = prox ηr ( x ( k ) + ηA (cid:62) ( y − Ax ( k ) )) , where η > is a step size. Inspired by this optimization framework, we choose f θ in the DEM framework as f θ ( x ; y ) = prox ηr ( x + ηA (cid:62) ( y − Ax )) . (7)Following [26], we replace prox ηr with a trainable network R θ : R n → R n , leading to the ﬁxed-point iterations: f θ ( x ; y ) = R θ ( x + ηA (cid:62) ( y − Ax )) . (8) x + ηA ⊤ ( y − Ax ) prox ηr θ ( ⋅ ) = x f θ ( x ; y ) Figure 3: Deep Equilibrium Proximal Gradient (

DE-Prox ) Deep equilibrium alternating direction method of multipliers (

DE-ADMM ): The AlternatingDirection Method of Multipliers (ADMM, [16]) reformulates the optimization problem (2) as min x,z (cid:107) y − Ax (cid:107) + r ( z ) subject to z = x. The augmented Lagrangian (in its “scaled form” – see [16]) associated with this problem is given by L α ( x, z, u ) := 12 (cid:107) y − Ax (cid:107) + r ( z ) + 12 α (cid:107) z − x + u (cid:107) where u is an additional auxiliary variable and α > is a user-deﬁned parameter. The ADMM iterates arethen z ( k +1) = arg min z L α ( x ( k ) , z, u ( k ) ) x ( k +1) = arg min x L α ( x, z ( k +1) , u ( k ) ) u ( k +1) = u ( k ) + z ( k +1) − x ( k +1) , (9)5ere the z - and x -updates simplify as z ( k +1) = prox αr ( x ( k ) − u ( k ) ) x ( k +1) =( I + αA (cid:62) A ) − ( αA (cid:62) y + z ( k +1) + u ( k ) ) . As in the

DE-Prox approach, prox αr ( · ) can be replaced with a learned network, denoted R θ . Making thisreplacement, and substituting z ( k +1) directly into the expressions for x ( k +1) and u ( k +1) gives: x ( k +1) =( I + αA (cid:62) A ) − ( αA (cid:62) y + R θ ( z ( k ) − u ( k ) ) + u ( k ) ) u ( k +1) = u ( k ) + R θ ( z ( k ) − u ( k ) ) − x ( k +1) . (10)Note that the updates for x ( k +1) and u ( k +1) depend only on the previous iterates x ( k ) and u ( k ) . Therefore,the above updates can be interpreted as ﬁxed-point iterations on the joint variable q = ( x, u ) , where theiteration map f θ ( q ; y ) is implicitly deﬁned as the map that satisﬁes q ( k +1) = f θ ( q ( k ) , y ) with q ( k ) := ( z ( k ) , u ( k ) ) . (11)Here we take the estimated image to be x ( ∞ ) , where q ( ∞ ) = ( x ( ∞ ) , u ( ∞ ) ) is the ﬁxed-point of f θ ( · ; y ) , i.e.,the limit of q ( K ) as K → ∞ . ( A ⊤ A + αI ) −1 ( αA ⊤ y + u + z ′ ) R θ ( x − u ) = + q f θ ( q ; y ) x u x ′ u ′ − z ′ + + − Figure 4: Deep Equilibrium Alternating Direction Method of Multipliers (

DE-ADMM ) Given a choice of f θ ( · ; y ) , we confront the following obstacles. (1) Forward calculation: given an observation y and weights θ , we need to compute the ﬁxed point of f θ ( · ; y ) eﬃciently . (2) Training: given a collection oftraining samples { x i } ni =1 , we need to ﬁnd the optimal θ . Both training and inference in a DEM require calculating a ﬁxed point of the iteration map f θ ( · ; y ) given someinitial point y . The most straightforward approach is to use ﬁxed-point iterations given in (4). Convergenceof this scheme for speciﬁc f θ designs is discussed in Section 5.However, ﬁxed-point iterations may not converge quickly. By viewing unrolled deep networks as ﬁxed-point iterations, we inherit the ability to accelerate inference with standard ﬁxed-point accelerators. To ourknowledge, this work is the ﬁrst time iterative inversion methods incorporating deep networks have beenaccelerated using ﬁxed-point accelerators. Anderson Acceleration:

Anderson acceleration [27] utilizes past iterates to identify promising directionsto move during the iterations. This takes the form of identifying a vector α ( k ) ∈ R m and setting x ( k +1) = (1 − β ) m − (cid:88) i =0 α ( k ) i x ( k − i ) + β m − (cid:88) i =0 α ( k ) i f θ ( x ( k − i ) ; y ) . We ﬁnd α ( k ) by solving the optimization problem: arg min α || Gα || , s.t. (cid:62) α = 1 (12) Anderson acceleration for Deep Equilibrium models was introduced in a NeurIPS tutorial by [28]. G a matrix with m columns, where the i th column is the (vectorized) residual f θ ( x ( k − i ) ; y ) − x ( k − i ) .The optimization problem in (12) admits a least-squares solution, adding negligible computational overheadwhen m is small ( e.g., m = 5 ).An important practical consideration is that accelerating ﬁxed-point iterations arising from optimizationalgorithms with auxiliary variables (like ADMM) is non-trivial. In these cases, standard ﬁxed-point iterationsmay be preferred for their simplicity of implementation. This is the approach we take in ﬁnding ﬁxed-pointsof our proposed DE-ADMM model.

In this section, we provide a brief overview of the training procedure used to train all networks in Section6.4. We use stochastic gradient descent to ﬁnd network parameters θ that (locally) minimize a cost functionof the form n (cid:80) ni =1 (cid:96) ( x ( ∞ ) ( y i ; θ ) , x ∗ i ) where (cid:96) ( · , · ) is a given loss function, x ∗ i is the i th training image withpaired measurements y i , and x ( ∞ ) ( y i ; θ ) denotes the reconstructed image given as the ﬁxed-point of f θ ( · ; y i ) .For our image reconstruction experiments, we use the mean-squared error (MSE) loss: (cid:96) ( x, x (cid:63) ) = 12 || x − x (cid:63) || . (13)To simplify the calculations below, we consider gradients of the cost function with respect to a single trainingmeasurement/image pair, which we denote ( y, x ∗ ) . Following [5], we leverage the fact that x ( ∞ ) := x ( ∞ ) ( y ; θ ) is a ﬁxed-point of f θ ( · ; y ) to ﬁnd the gradient of the loss with respect to the network parameters θ withoutbackpropagating through an arbitrarily-large number of ﬁxed-point iterations. We summarize this approachbelow.First, abbreviating (cid:96) ( x ( ∞ ) , x (cid:63) ) by (cid:96) , then by the chain rule ∂(cid:96)∂θ (cid:62) = ∂x ( ∞ ) ∂θ (cid:62) ∂(cid:96)∂x ( ∞ ) . (14)Since we assume (cid:96) is the MSE loss, the gradient ∂(cid:96)∂x ( ∞ ) is simply the residual between x (cid:63) and the equilibriumpoint: ∂(cid:96)∂x ( ∞ ) = x ( ∞ ) − x (cid:63) .In order to compute ∂x ( ∞ ) ∂θ we start with the ﬁxed point equation: x ( ∞ ) = f θ ( x ( ∞ ) ; y ) . Implicitlydiﬀerentiating and rearranging this equation with respect to θ gives ∂x ( ∞ ) ∂θ = (cid:18) I − ∂f θ ( x ; y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) x = x ( ∞ ) (cid:19) − ∂f θ ( x ( ∞ ) ; y ) ∂θ . (15)Plugging this expression into (14) gives ∂(cid:96)∂θ (cid:62) = ∂f θ ( x ( ∞ ) ; y ) ∂θ (cid:62) (cid:18) I − ∂f θ ( x ; y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) x = x ( ∞ ) (cid:19) −(cid:62) ∂(cid:96)∂x ( ∞ ) This converts the memory-intensive task of backpropagating through many iterations of f θ ( · ; y ) to theproblem of calculating an inverse Jacobian-vector product. To approximate the inverse Jacobian-vectorproduct, ﬁrst we deﬁne the vector β ( ∞ ) by β ( ∞ ) = (cid:18) I − ∂f θ ( x ; y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) x = x ( ∞ ) (cid:19) −(cid:62) ( x ( ∞ ) − x (cid:63) ) . Following [28], we note that β ( ∞ ) is a ﬁxed point of the equation β ( k +1) = (cid:18) ∂f θ ( x ; y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) x = x ( ∞ ) (cid:19) (cid:62) β ( k ) + ( x ( ∞ ) − x (cid:63) ) , (16)7nd the same machinery used to calculate the ﬁxed point x ( ∞ ) may be used to calculate β ( ∞ ) . For analysispurposes, we note if β (0) = , simple ﬁxed-point iterations (16) may be represented by the Neumann series: β ( ∞ ) = ∞ (cid:88) n =0 (cid:18) ∂f θ ( x ; y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) x = x ( ∞ ) (cid:19) n (cid:62) ( x ( ∞ ) − x (cid:63) ) . (17)Convergence of the above Neumann series is discussed in Section 5.Conventional autodiﬀerentiation tools permit quickly computing the vector-Jacobian products in (16) and(17). Once an accurate approximation to β ( ∞ ) is calculated, the gradient in (14) is given by ∂(cid:96)∂θ (cid:62) = ∂f ( x ( ∞ ) ; y ) ∂θ (cid:62) β ( ∞ ) . (18)The gradient calculation process is summarized in the following steps, assuming a ﬁxed point x ( ∞ ) of f θ isknown:1. Compute the residual r = x ∞ − x ∗ .2. Compute an approximate ﬁxed-point β ( ∞ ) of the equation β = (cid:16) ∂f θ ( x ; y ) ∂x (cid:12)(cid:12) x = x ( ∞ ) (cid:17) (cid:62) β + r .3. Compute ∂(cid:96)∂θ (cid:62) = ∂f ( x ( ∞ ) ; y ) ∂θ (cid:62) β ( ∞ ) . Here we study convergence of the proposed deep equilibrium models to a ﬁxed-point at inference time,i.e., given the iteration map f θ ( · ; y ) : R d → R d we give conditions that guarantee the convergence of x ( k +1) = f θ ( x ( k ) ; y ) to a ﬁxed-point x ( ∞ ) as k → ∞ .Classical ﬁxed-point theory ensures that the iterates converge to a unique ﬁxed-point if the iteration map f θ ( · ; y ) is contractive , i.e., if there exists a constant < c < such that (cid:107) f θ ( x ; y ) − f θ ( x (cid:48) ; y ) (cid:107) ≤ c (cid:107) x − x (cid:48) (cid:107) .Below we give conditions on the learned component R θ : R d → R d (replacing the gradient or proximalmapping of a regularizer) used in the DE-Grad , DE-Prox and

DE-ADMM models that that ensure theresulting iteration map is contractive and thus the ﬁxed-point iterations for these models converge.In particular, following [21], we assume that the learned component R θ satisﬁes the following condition:there exists an (cid:15) > such that for all x, x (cid:48) ∈ R d we have (cid:107) ( R θ − I )( x ) − ( R θ − I )( x (cid:48) ) (cid:107) ≤ (cid:15) (cid:107) x − x (cid:48) (cid:107) (19)where ( R θ − I )( x ) := R θ ( x ) − x . In other words, we assume the map R θ − I is (cid:15) -Lipschitz.If we interpret R θ as a denoising or de-artifacting network, then R θ − I is the map that outputs the noise orartifacts present in a degraded image. In practice, often R θ is implemented with a residual “skip-connection”,such that R θ = I + N θ , where N θ is, e.g., a deep U-net. Therefore, in this case, (19) is equivalent to assumingthe trained network N θ is (cid:15) -Lipschitz.We prove the following theorem in the supplement. Theorem 1 (Convergence of

DE-Grad ) . Assume that R θ − I is (cid:15) -Lipschitz (19) , and let L = λ max ( A (cid:62) A ) and µ = λ min ( A (cid:62) A ) , where λ max ( · ) and λ min ( · ) denote the maximum and minimum eigenvalue, respectively.If the step-size parameter η > is such that η < / ( L + 1) , then the DE-Grad iteration map f θ ( · , y ) deﬁnedin (5) satisﬁes (cid:107) f θ ( x, y ) − f θ ( x (cid:48) , y ) (cid:107) ≤ (1 − η (1 + µ ) + η(cid:15) ) (cid:124) (cid:123)(cid:122) (cid:125) =: γ (cid:107) x − x (cid:48) (cid:107) for all x, x (cid:48) ∈ R d . The coeﬃcient γ is less than if (cid:15) < µ , in which case the the iterates of DE-Grad converge.

Proof:

Let f θ ( x ; y ) be the iteration map for DE-Grad . The Jacobian of f θ with respect to x ∈ R d is givenby J ( x ) = ( I − ηA (cid:62) A ) − η∂ x R θ ( x ) ∈ R d × d ∂ x R θ ( x ) ∈ R d × d is the Jacobian of R θ : R d → R d with respect to x ∈ R d . To prove f θ ( · ; y ) iscontractive it suﬃces to show (cid:107) J ( x ) (cid:107) < for all x ∈ R d where (cid:107) · (cid:107) denotes the spectral norm. Towards thisend, we have (cid:107) J ( x ) (cid:107) = (cid:107) ( I − ηA (cid:62) A ) − η∂ x R θ ( x ) (cid:107) = (cid:107) ηI + (1 − η ) I − ηA (cid:62) A − η∂ x R θ ( x ) (cid:107) = (cid:107) (1 − η ) I − ηA (cid:62) A − η ( ∂ x R θ ( x ) − I ) (cid:107)≤ (cid:107) (1 − η ) I − ηA (cid:62) A (cid:107) + η (cid:107) ∂ x R θ ( x ) − I (cid:107)≤ max i | (1 − η ) − ηλ i | + η(cid:15) (20)where λ i denotes the i th eigenvalue of A (cid:62) A , and in the ﬁnal inequality (20) we used our assumption that themap ( R θ − I )( x ) := R θ ( x ) − x is (cid:15) -Lipschitz, and therefore the spectral norm of its Jacobian ∂ x R θ ( x ) − I isbounded by (cid:15) .Finally, by our assumption η < L where L := max i λ , we have η < λ i for all i , which implies (1 − η ) − ηλ i > for all i . Therefore, the maximum in (20) is obtained at µ := min i λ i , which gives (cid:107) J ( x ) (cid:107) ≤ − η (1 + µ ) + η(cid:15). This shows f θ is γ -Lipschitz with γ = 1 − η (1 + µ ) + η(cid:15) , proving the claim.Convergence of PnP approaches

PnP-Prox and

PnP-ADMM was studied in [21]. At inference time,the proposed

DE-Prox and

DE-ADMM methods are equivalent to the corresponding

PnP method butwith a retrained denoising network R θ . Therefore, the convergence results in [21] apply directly to DE-Prox and

DE-ADMM . To keep the paper self-contained, we restate these results below, specialized to the case ofthe quadratic data-ﬁdelity term assumed in (2).

Theorem 2 (Convergence of

DE-Prox ) . Assume that R θ − I is (cid:15) -Lipschitz (19) , and let L = λ max ( A (cid:62) A ) and µ = λ min ( A (cid:62) A ) > , where λ max ( · ) and λ min ( · ) denote the maximum and minimum eigenvalue, respectively.Then the DE-Prox iteraion map f θ ( · , y ) deﬁned in (8) is contractive if the step-size parameter η satisﬁes µ (1 + 1 /ε ) < η < L − L (1 + 1 /ε ) . Such an η exists if ε < µ/ ( L − µ ) . See Theorem 1 of [21].

Theorem 3 (Convergence of

DE-ADMM ) . Assume that R θ − I is (cid:15) -Lipschitz (19) , and let L = λ max ( A (cid:62) A ) and µ = λ min ( A (cid:62) A ) > , where λ max ( · ) and λ min ( · ) denote the maximum and minimum eigenvalue, respec-tively. Then the iteration map f θ ( · ; y ) for DE-ADMM deﬁned in (11) is contractive if the ADMM step-sizeparameter α parameter satisﬁes ε (1 + ε − ε ) µ < α. See Corollary 1 of [21].Unlike the convergence result for

DE-Grad in Theorem 1, the convergence results for

DE-Prox and

DE-ADMM in Theorem 2 and Theorem 3 make the assumption that λ min ( A T A ) > , i.e., A has a trivialnullspace. This is condition is satisﬁed for certain inverse problems, such as denoising or deblurring, butviolated in many others, including compressed sensing and undersampled MRI. However, in practice weobserve that the iterates of DE-Prox and

DE-ADMM still appear to converge even in situations where A has a nontrivial nullspace, indicating this assumption may be stronger than necessary.Finally, an important practical concern when training deep equilibrium models is whether the ﬁxed-pointiterates used to compute gradients (as detailed in Section 4.2) will converge. Speciﬁcally, the gradient ofthe loss at the training pair ( y, x ∗ ) involves computing the truncated Neumann series in (17). This seriesconverges if the Jacobian ∂f θ ( x ; y ) ∂x has spectral norm less than when evalated at any x ∈ R d , which is true ifand only if f θ ( · ; y ) is contractive. Therefore, the same conditions in Theorems 1-3 that ensure the iterationmap f θ ( · ; y ) is contractive also ensure that the Neumann series in (17) used to compute gradients converges.9able 1: Median Test PSNR and SSIM comparison across reconstruction approaches and problems; for eachsetting, the best PSNR and SSIMs are in bold. Plug-n-Play(U-netdenoiser) RED(U-netdenoiser) Deep Unrolled MethodsTrained End-to-End Deep Equilibrium (Ours)TV Prox ADMM ADMM Grad Prox ADMM

PNeumann

Grad Prox ADMMDeblur (1)PSNR 26.79 24.86 25.95 25.35 28.17 27.57 28.14 28.73 29.26

Deblur (2)PSNR 31.31 33.14 33.43 33.25 34.17 34.22 35.38 35.47 38.74

CS (8x)PSNR 15.30 4.61 4.04 5.32 18.45 20.38 19.81 20.25 21.89

Our numerical experiments include comparisons with a variety of models and methods. Total-variationRegularized Least Squares (TV) is an important baseline that does not use any training data but ratherleverages geometric models of image structure [3, 29, 30]. The

PnP and

RED methods are described in§2.3; we consider both the original ADMM variant of [15]

PnP-ADMM and a proximal gradient

PnP-Prox method as described in [21]. We utilize the ADMM formulation of

RED . Deep Unrolled methods (DU) aredescribed in §2.1; we consider DU using gradient descent, proximal gradient, and ADMM. The preconditionedNeumann network [31] represents the state of the art in unrolled approaches but does not have simple DeepEquilibrium or Plug-and-Play analogues.We compare the above approaches across three inverse problems: Gaussian deblurring (Deblur), 8 × Gaussian compressed sensing (CS), and 8 × accelerated Cartesian single-coil MRI reconstruction (MRI).The compressed sensing and single-coil MRI measurements are corrupted with additive Gaussian noise with σ = 0 . . Deblur (1) is blurred with additive Gaussian noise with σ = 0 . , and Deblur (2) is corrupted withadditive Gaussian noise with σ = 0 . .For deblurring and compressed sensing, we utilize a subset of the Celebrity Faces with Attributes (CelebA)dataset [32], which consists of centered human faces. We train on a subset of 10000 of the training images.All images are resized to 128 × × For our learned network, we utilize a U-Net architecture [18] with some modiﬁcations. First, we have removedall instance normalization layers. For both the CelebA and fastMRI datasets, we train six U-Net denoiserswith noise variances σ = 0 . , . , . , . , . , . on the training split.For the CelebA set, to ensure contractivity of the learned component, we add spectral normalization toall layers [5], ensuring that each layer has a Lipschitz constant bounded above by 1. This normalization isenforced during pretraining as well as during the Deep Equilibrium training phrase.We found that adding spectral normalization resulted in signiﬁcant PSNR drops on the fastMRI dataset,and so did not use spectral normalization for the fastMRI U-Nets. Instead, we initialized the U-Nets beforepretraining by sampling kernel weights from a random Gaussian distribution with σ = 0 . . Empirically,this initialization provides suﬃcient expressive power, but without causing a lack of contractivity, which cancause signiﬁcant problems during training. 10urther details on settings, parameter choices, and data may be found in the appendix and in ourpublicly-available code. The proposed approaches to solving inverse problems via Deep Equilibrium are all based on some iterativeoptimization algorithm. Each of these iterative optimization algorithms has their own set of hyperparametersto choose, e.g., the step size η in DE-Grad , plus any parameters used to calculate the initial estimate x (0) .We choose to ﬁx all algorithm-speciﬁc hyperparameters prior to training a Deep Equilibrium network, forclarity and ease of training. We perform a grid search over algorithm-speciﬁc hyperparameters, testing theperformance of the untrained Deep Equilibrium network on a held-out test set.Tuning hyperparameters requires choosing a particular R θ during tuning. We use an R θ that has beenpretrained for Gaussian denoising. Pretraining can be done on the target dataset ( e.g., training on MRIimages directly) or using an independent dataset ( e.g., the BSD500 image dataset [34]). We use the formerapproach in our experiments. In the supplemental materials we show that pretraining provides a smallimprovement in reconstruction accuracy over random initialization.Because we initialize our learned components with denoisers, the initial setup of our method exactlycorresponds to tuning a PnP approach with a deep denoiser. Training adapts the iteration map f θ to aparticular inverse problem and data distribution. We note that our approach may be used to adapt anyiterative optimization framework satisfying the conditions in Section 4, e.g., solvers for RED [19]. We present the main reconstruction accuracy comparison in Table 1. Each entry for Deep Equilibrium (DE),Regularization by Denoising (RED), and Plug-and-Play (

PnP ) approaches is the result of running ﬁxed-pointiterations until the relative change between iterations is less than − . During training, all DE models werelimited to 50 forward updates. The DU models are tested at the number of iterations for which they weretrained and all parameters for TV reconstructions (including number of TV iterations) are cross-validated tomaximize PSNR. Performance as a function of iteration is shown in Figs. 1(a), 5(a), and 5(b), with examplereconstructions in Fig. 6. Further example reconstructions are available for qualitative evaluation in Appendix8. We observe our DE reconstructions improve reconstruction quality beyond the number of iterations theywere trained for.We observe our DE-based approaches consistently outperform DU approaches across f θ choices. Amongchoices of iterative reconstruction architectures for f θ , DE-Prox is a front-runner. However, some of thediﬀerences may be due to

DE-ADMM not being accelerated, while

DE-Prox and

DE-Grad leverageAnderson acceleration.Across problems, the DE approach is generally competitive with the end-to-end trained DU solvers. Fortwo of three problems, our approach requires no more computation to be competitive with a DU network,with increasing advantage to our approach with further computation.

Here we compare the eﬀect of initializing the learned component R θ in our deep equilibrium models witha pretrained denoiser versus initializing with random weights. We use identical hyperparameters for bothinitialization methods.We present our results on Deep Equilibrium Proximal Gradient Descent ( DE-Prox ) and Deep EquilibriumADMM (

DE-ADMM ) in Figure 7(a). We observe an improvement in reconstruction quality when utilizingour pretraining method compared to a random initialization. We also note that pretraining enables a simplerchoice of algorithm-speciﬁc hyperparameters. For example, with random initialization, choosing the properinternal step size for

DE-Prox would require training several diﬀerent

DE-Prox instances, and choosingthe correct step size based on validation set, in addition to any other parameters, such as the learning rateused during training. Available at: https://github.com/dgilton/deep_equilibrium_inverse ec o n s t r u c t i o n PS N R (a) Deblurring R ec o n s t r u c t i o n PS N R (b) CS Figure 5: Iterations vs. reconstruction PSNR for

DE-Prox and competing methods for (a) deblurring and (b)compressed sensing. MRI results are in Fig. 1(a). The deep unrolled prox grad was trained for 10 iterations.In (b), the PSNR of Plug-and-Play ProxGrad was below 5dB for all iterations. In all examples, deep unrollingis only eﬀective at the number of iterations for which it is trained, whereas deep equilibrium achieves higherPSNR across a broad range of iterations, allowing a user to trade oﬀ computation time and accuracy. (a) Ground truth (b) IFFT ( A (cid:62) y ), PSNR =22.7 dB (c) DU-Prox , PSNR = 30.1dB (d)

DE-Prox , PSNR = 32.2dB

Figure 6: MRI reconstruction example. Best viewed digitally.

We observe empirically that the Deep Equilibrium approach to training achieves competitive reconstructionquality and increased ﬂexibility with respect to allocating computation budget. Recent work in deep inversionhas questioned these methods’ robustness to noise and unexpected inputs [35, 36, 37].To examine whether the Deep Equilibrium approach is brittle to simple changes in the noise distribution,we varied the level of Gaussian noise added to the observations at test time and observed the eﬀect onreconstruction quality. Fig. 7(b) demonstrates that the Deep Equilibrium model

DE-Prox is more robust tovariation in the noise level than the analogous Deep Unrolled approach

DU-Prox . The forward model usedin Fig. 7(b) is MRI reconstruction.

This paper illustrates non-trivial quantitative beneﬁts to using implicitly-deﬁned inﬁnite-depth networks forsolving linear inverse problems in imaging. Other recent work has focused on such implicit networks akinto the deep equilibrium models considered here (e.g. [38]). Whether these models could lead to additional12 ec o n s t r u c t i o n PS N R Figure 7: (a) Comparison of learned

DE-Prox reconstruction quality across three diﬀerent inverse prob-lems: Deblurring (Blur), compressed sensing (CS), and undersampled MRI reconstruction (MRI). In ourexperiements, initializing with a pretrained denoiser routined oﬀered as good or better reconstruction quality(in terms of PSNR) than a random initialization. (b) Noise sensitivity comparison between

DU-Prox and

DE-Prox . The forward model used is MRI reconstruction, and σ here corresponds to the level of Gaussiannoise added to observations.advances in image reconstruction remains an open question for future work. Furthermore, while the expositionin this work focused on linear inverse problems, nonlinear inverse problems may be solved with iterativeapproaches just as well. The conditions under which deep equilibrium methods proposed here may be usedon such iterative approaches are an active area of investigation. References [1] G. Ongie, A. Jalal, C. A. Metzler, R. G. Baraniuk, A. G. Dimakis, and R. Willett, “Deep learningtechniques for inverse problems in imaging,” arXiv preprint arXiv:2005.06001 , 2020.[2] A. N. Tikhonov, “On the stability of inverse problems,” in

Dokl. Akad. Nauk SSSR , vol. 39, 1943, pp.195–198.[3] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,”

PhysicaD: nonlinear phenomena , vol. 60, no. 1-4, pp. 259–268, 1992.[4] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domaincollaborative ﬁltering,”

IEEE Transactions on image processing , vol. 16, no. 8, pp. 2080–2095, 2007.[5] S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” in

Advances in Neural InformationProcessing Systems , 2019, pp. 690–701.[6] M. J. Muckley, B. Riemenschneider, A. Radmanesh, S. Kim, G. Jeong, J. Ko, Y. Jun, H. Shin, D. Hwang,M. Mostapha et al. , “State-of-the-art machine learning mri reconstruction in 2020: Results of the secondfastmri challenge,” arXiv preprint arXiv:2012.06318 , 2020.[7] D. Wu, K. Kim, and Q. Li, “Computationally eﬃcient deep neural network for computed tomographyimage reconstruction,”

Medical physics , vol. 46, no. 11, pp. 4763–4776, 2019.[8] I. Y. Chun, Z. Huang, H. Lim, and J. Fessler, “Momentum-net: Fast and convergent iterative neuralnetwork for inverse problems,”

IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020.139] A. Mehranian and A. J. Reader, “Model-based deep learning pet image reconstruction using forward-backward splitting expectation maximisation,”

IEEE Transactions on Radiation and Plasma MedicalSciences , 2020.[10] J. Adler and O. Öktem, “Learned primal-dual reconstruction,”

IEEE transactions on medical imaging ,vol. 37, no. 6, pp. 1322–1332, 2018.[11] R. Dabre and A. Fujita, “Recurrent stacking of layers for compact neural machine translation models,”in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 33, 2019, pp. 6292–6299.[12] S. Bai, J. Z. Kolter, and V. Koltun, “Trellis networks for sequence modeling,” arXiv preprintarXiv:1810.06682 , 2018.[13] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary diﬀerential equations,”in

Advances in neural information processing systems , 2018, pp. 6571–6583.[14] E. Haber and L. Ruthotto, “Stable architectures for deep neural networks,”

Inverse Problems , vol. 34,no. 1, p. 014004, 2017.[15] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-play priors for model basedreconstruction,” in . IEEE, 2013,pp. 945–948.[16] S. Boyd, N. Parikh, and E. Chu,

Distributed optimization and statistical learning via the alternatingdirection method of multipliers . Now Publishers Inc, 2011.[17] S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play admm for image restoration: Fixed-pointconvergence and applications,”

IEEE Transactions on Computational Imaging , vol. 3, no. 1, pp. 84–98,2016.[18] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmen-tation,” in

International Conference on Medical image computing and computer-assisted intervention .Springer, 2015, pp. 234–241.[19] Y. Romano, M. Elad, and P. Milanfar, “The little engine that could: Regularization by denoising (red),”

SIAM Journal on Imaging Sciences , vol. 10, no. 4, pp. 1804–1844, 2017.[20] T. Meinhardt, M. Moller, C. Hazirbas, and D. Cremers, “Learning proximal operators: Using denoisingnetworks for regularizing inverse imaging problems,” in

Proceedings of the IEEE International Conferenceon Computer Vision , 2017, pp. 1781–1790.[21] E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin, “Plug-and-play methods provably convergewith properly trained denoisers,” in

International Conference on Machine Learning , 2019, pp. 5546–5557.[22] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in

Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 3929–3938.[23] T. Tirer and R. Giryes, “Super-resolution via image-adapted denoising cnns: Incorporating external andinternal learning,”

IEEE Signal Processing Letters , vol. 26, no. 7, pp. 1080–1084, 2019.[24] J. Liu, Y. Sun, C. Eldeniz, W. Gan, H. An, and U. S. Kamilov, “Rare: Image reconstruction using deeppriors learned without groundtruth,”

IEEE Journal of Selected Topics in Signal Processing , vol. 14, no. 6,pp. 1088–1099, 2020.[25] N. Parikh and S. Boyd, “Proximal algorithms,”

Foundations and Trends in optimization , vol. 1, no. 3,pp. 127–239, 2014.[26] M. Mardani, Q. Sun, D. Donoho, V. Papyan, H. Monajemi, S. Vasanawala, and J. Pauly, “Neuralproximal gradient descent for compressive imaging,” in

Advances in Neural Information ProcessingSystems , 2018, pp. 9573–9583. 1427] H. F. Walker and P. Ni, “Anderson acceleration for ﬁxed-point iterations,”

SIAM Journal on NumericalAnalysis , vol. 49, no. 4, pp. 1715–1735, 2011.[28] Z. Kolter, D. Duvenaud, and M. Johnson, “Deep implicit layers - neural odes, deep equilibirum models,and beyond,” 2020. [Online]. Available: http://implicit-layers-tutorial.org/[29] D. Strong and T. Chan, “Edge-preserving and scale-dependent properties of total variation regularization,”

Inverse problems , vol. 19, no. 6, p. S165, 2003.[30] A. Beck and M. Teboulle, “Fast gradient-based algorithms for constrained total variation image denoisingand deblurring problems,”

IEEE transactions on image processing , vol. 18, no. 11, pp. 2419–2434, 2009.[31] D. Gilton, G. Ongie, and R. Willett, “Neumann networks for linear inverse problems in imaging,”

IEEETransactions on Computational Imaging , 2019.[32] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in

Proceedings ofInternational Conference on Computer Vision (ICCV) , December 2015.[33] J. Zbontar, F. Knoll, A. Sriram, M. J. Muckley, M. Bruno, A. Defazio, M. Parente, K. J. Geras,J. Katsnelson, H. Chandarana, Z. Zhang, M. Drozdzal, A. Romero, M. Rabbat, P. Vincent, J. Pinkerton,D. Wang, N. Yakubova, E. Owens, C. L. Zitnick, M. P. Recht, D. K. Sodickson, and Y. W. Lui, “fastMRI:An open dataset and benchmarks for accelerated MRI,” 2018.[34] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and itsapplication to evaluating segmentation algorithms and measuring ecological statistics,” in

ProceedingsEighth IEEE International Conference on Computer Vision. ICCV 2001 , vol. 2. IEEE, 2001, pp.416–423.[35] V. Antun, F. Renna, C. Poon, B. Adcock, and A. C. Hansen, “On instabilities of deep learning in imagereconstruction and the potential costs of ai,”

Proceedings of the National Academy of Sciences , 2020.[36] A. Raj, Y. Bresler, and B. Li, “Improving robustness of deep-learning-based image reconstruction,” arXivpreprint arXiv:2002.11821 , 2020.[37] M. Genzel, J. Macdonald, and M. März, “Solving inverse problems with deep neural networks–robustnessincluded?” arXiv preprint arXiv:2011.04268 , 2020.[38] L. El Ghaoui, F. Gu, B. Travacca, and A. Askari, “Implicit deep learning,” arXiv preprintarXiv:1908.06315 , 2019. 15

Appendix

In this section, we provide further visualizations of the reconstructions produced by Deep Equilibrium modelsand the corresponding Deep Unrolled approaches, beyond those shown in the main body. Figures 10, 9, and8 are best viewed electronically, and contain the ground-truth images, the measurements (projected back toimage space in the case of MRI and compressed sensing), and reconstructions by

DU-Prox and

DE-Prox .We also visualize the intermediate iterates in the ﬁxed-point iterations, to further demonstrate theconvergence properties of DEMs for image reconstruction. We ﬁnd that DEMs converge quickly to reasonablereconstructions, and maintain high-quality reconstructions after more than one hundred iterations.GroundTruthIFFT A (cid:62) y DU-ProxDE-Prox (Ours)Figure 8: Sample images and reconstructions with for × accelerated MRI reconstruction with additive noiseof σ = 0 . . Best viewed electronically. In Figures 13, 12, and 11 we visualize the outputs of the K ’th iteration of the mapping f θ in DE-Prox .Recall that during training,

DE-Prox was limited to 50 forward iterations. We observe that across forwardproblems, the reconstructions continue converging after this number of iterations.We illustrate 90 iterations for deblurring and MRI reconstructions, and illustrate 190 iterations forcompressed sensing, since that problem requires additional iterations to converge.For further illustration we also demonstrate the qualitative eﬀects of running

DU-Prox for more iterationsthan it was trained for. 16roundTruth A (cid:62) y DU-ProxDE-Prox (Ours)Figure 9: Sample images and reconstructions for × Gaussian compressed sensing with additive noise of σ = 0 . . Best viewed electronically. In this section we provide further details related to the experimental setup.The input to every learned algorithm is the preconditioned measurement ( A (cid:62) A + λI ) − A (cid:62) y , where λ is generally set to be equal to the noise level σ . For MRI reconstruction experiments, λ = 0 . was used.The masks used in the MRI reconstruction experiments are based on a Cartesian sampling pattern, as inthe standard fastMRI setting. The center 4 % of frequencies are fully sampled, and further frequencies aresampled according to a Gaussian distribution centered at 0 frequency with σ = 1 .The compressed sensing design matrices have entries sampled and scaled so that each entry is drawn froma Gaussian distribution with variance /m , where A ∈ R m × n . The same design matrix is used for all learnedmethods.Optimization algorithm parameters for RED, Plug-and-Play, and all Deep Equilibrium approaches are allchosen via a logarithmic grid search from − to with 20 elements in each dimension of the grid. AllDU methods were trained for 10 iterations. All testing was done on an NVidia Titan X. All networks weretrained on a cluster with a variety of computing resources . Every experiment was run utilizing a singleGPU-single CPU setup with less than 12 GB of GPU memory. See: https://slurm.ttic.edu/ y DU-ProxDE-Prox (Ours)Figure 10: Sample images and reconstructions for Gaussian deblurring with additive noise of σ = 0 . . Bestviewed electronically. 18=0 K=10 K=20 K=30 K=40K=50 K=60 K=70 K=80 K=90Figure 11: Sample images and reconstructions for MRI reconstruction with accleration × and additiveGaussian noise with σ = 0 . . Each image represents the output of iterate number K . Below each imageis the residual between iterate K and the previously-visualized iterate, or in the case of K = 0 , betweenthe input to the network and the output of the initial iterate. Each residual is multiplied by × for easiervisualization. The ground truth may be viewed in the initial column of Figure 8. Best viewed electronically.K=0 K=10 K=20 K=30 K=40 K=50 K=60 K=70 K=80 K=90K=100 K=110 K=120 K=130 K=140 K=150 K=160 K=170 K=180 K=190Figure 12: Sample images and reconstructions from DE-Prox reconstructions, with the forward model × Gaussian compressed sensing with Gaussian noise with σ = 0 . . Each image represents the output of iteratenumber K . Below each image is the residual between iterate K and the previously-visualized iterate, or inthe case of K = 0 , between the input to the network and the output of the initial iterate. The ground truthmay be viewed in the ﬁnal column of Figure 9. Each residual is multiplied by × for easier visualization.Best viewed electronically. 19=0 K=10 K=20 K=30 K=40K=50 K=60 K=70 K=80 K=90Figure 13: Sample images and reconstructions from DE-Prox performing Gaussian deblurring with Gaussiannoise with σ = 0 . . Each image represents the output of iterate number K . Below each image is the residualbetween iterate K and the previously-visualized iterate, or in the case of K = 0 , between the input to thenetwork and the output of the initial iterate. Each residual is multiplied by × for easier visualization. Theground truth may be viewed in the ﬁnal column of Figure 10. Best viewed electronically.20=0 K=10 K=20 K=30 K=40K=50 K=60 K=70 K=80 K=90Figure 14: Sample images and reconstructions from the standard DU-Prox method performing MRIreconstruction with accleration × and additive Gaussian noise with σ = 0 . . Each image represents theoutput of iterate number K . Unlike our approach, iterates beyond the number DU-Prox was trained fordecrease reconstruction accuracy, adding nonphysical artifacts. The ground truth may be viewed in the initialcolumn of Figure 8. Best viewed electronically.K=0 K=10 K=20 K=30 K=40K=50 K=60 K=70 K=80 K=90Figure 15: Sample images and reconstructions from the standard