[PDF] Deep Learning Methods for Solving Linear Inverse Problems: Research Directions and Paradigms

Abstract

The linear inverse problem is fundamental to the development of various scientific areas. Innumerable attempts have been carried out to solve different variants of the linear inverse problem in different applications. Nowadays, the rapid development of deep learning provides a fresh perspective for solving the linear inverse problem, which has various well-designed network architectures results in state-of-the-art performance in many applications. In this paper, we present a comprehensive survey of the recent progress in the development of deep learning for solving various linear inverse problems. We review how deep learning methods are used in solving different linear inverse problems, and explore the structured neural network architectures that incorporate knowledge used in traditional methods. Furthermore, we identify open challenges and potential future directions along this research line.

Full PDF

aa r X i v : . [ ee ss . SP ] A ug Deep Learning Methods for Solving Linear InverseProblems: Research Directions and Paradigms

Yanna Bai , Wei Chen , *, Jie Chen and Weisi Guo , . State Key Laboratory of Rail Traﬃc Control and Safety, Beijing Jiaotong University,Beijing, China . Northwestern Polytechnical University, Xian, China . Cranﬁeld University, Milton Keynes, UK . Alan Turing Institute, London, UK

Abstract

The linear inverse problem is fundamental to the development of various sci-entiﬁc areas. Innumerable attempts have been carried out to solve diﬀerentvariants of the linear inverse problem in diﬀerent applications. Nowadays, therapid development of deep learning provides a fresh perspective for solving thelinear inverse problem, which has various well-designed network architecturesresults in state-of-the-art performance in many applications. In this paper, wepresent a comprehensive survey of the recent progress in the development of deeplearning for solving various linear inverse problems. We review how deep learn-ing methods are used in solving diﬀerent linear inverse problems, and explorethe structured neural network architectures that incorporate knowledge used intraditional methods. Furthermore, we identify open challenges and potentialfuture directions along this research line.

Keywords:

Deep learning, Linear inverse problems, Neural networks

1. Introduction

The study of the inverse problem begins early from the 20th century and isstill attractive today. The inverse problem refers to using the results of actual ✩ ∗ Corresponding author

Preprint submitted to Journal of L A TEX Templates August 12, 2020 bservations to infer the values of the parameters that characterize the systemand to estimate data that are not easily directly observed.The inverse problem exists in many applications. In geophysics, the inverseproblem is solved to detect mineral deposits such as underground oil based onthe observations of an acoustic wave which is sent from the surface of the earth.In medical imaging, the inverse problem is solved to reconstruct an image ofthe internal structure of the human body based on the X-ray signal passingthrough the human body. In mechanical engineering, the inverse problem issolved to perform nondestructive testing by processing the scattered ﬁeld onthe surface, which avoids expensive and destructive evaluation. In imaging, theinverse problem is solved to recover images of high quality from the lossy image,for example, image denoising and image super-resolution (SR).Mathematically, the inverse problem can be described as the estimation ofhidden parameters of the model m ∈ R N from the observed data d ∈ R M ,where N (possibly inﬁnite) is the number of model parameters and M is thedimension of observed data. A general description of the inverse problem is d = A ( m ) , (1)where A is the forward operator mapping the model space to the data space.An inverse problem is well-posed if it satisﬁes the following three properties[1]. • Existence. For any data d , there exists an m that satisﬁes (1), whichmeans there exists a model that ﬁts the observed data. • Uniqueness. For every d , if there are m and m that satisfy (1), then m = m , which means the model that ﬁts the observed data is unique. • Stability. A − is a continuous map, which means small changes in the ob-served data d will make small changes in the estimated model parameters m .If any of the three properties does not hold, the inverse problem is ill-posed.2 .1. The Linear Inverse Problem In linear inverse problems (LIPs), the forward operator A in (1) is linearand can be written as a matrix A ∈ R M × N . When M = N and the matrix A has a full rank, the solution of the LIP is unique, and the model parameters aregiven by multiplying the matrix inverse A − with the data d . In the situation M > N , it becomes an over-determined problem that may have no solution.In situations where

M < N , the LIP is undetermined, and the solution of theundetermined LIP is not unique, which means this LIP is ill-posed. To solvethe ill-posed LIP, extra knowledge of the system model is usually needed, whichis also known as prior information.In the presence of noisy observed data d , the LIP can be expressed as anoptimization problem as followingmin m k d − Am k + J ( m ) , (2)where J ( · ) incorporates the prior information. For example, the Tikhonovregularization is popularly used, where J ( m ) = k Γm k and Γ represents theTikhonov matrix (e.g. Γ = α I ).Based on the diﬀerent prior information and the structure of the operator A , the LIP can be classiﬁed into diﬀerent categories [2]. In the following twosubsections, we review LIPs that attract extensive interests in recent years. In this subsection, we introduce LIPs with various parameterized models,which correspond to diﬀerent prior information.

In LIPs, one popular prior information is the sparsity of the parameters,which has been applied in communication systems [3, 4], sensor networks [5, 6]and many other applications [7, 8].In sparse LIPs, m is a sparse vector where only several elements of m are non-zeros, and the prior information J ( · ) = α k m k , where α is some regularizationparameter and k m k denotes the ℓ norm of the vector m that counts the3umber of non-zeros in m . While the optimization problem in sparse LIPs hasnon-continuous objective function and is NP-hard, we usually resort to solvean alternative problem with a smoothed objective function [9]. The regularizer J ( · ) is replaced by a sparsity-enforcing function, e.g., the ℓ norm function J ( · ) = k m k and the log penalty function J ( · ) = P Ni =1 log(1 + m i ) in [10].Under certain conditions on the matrix A and the sparsity level of m , thesolution of the new optimization problem is equivalent to the original problem[11].In addition to the sparse structure, real world signals exhibit many otherstructures, e.g., block-sparsity [12], group-sparsity [13], tree-sparsity [14] andothers [15, 16], which can be exploited in solving m from the observations d .Considering the block-sparsity or the group-sparsity, m can be written as m =[ m ; m ; . . . ; m r ] with m i ∈ R L ( i = 1 , . . . , r ) for M = Lr , where only severalof the m i are non-zeros vectors. For the tree-sparsity, the non-zeros clusteralong the branches of the tree. That means, if a node is non-zero, then theother nodes that are on the branch from the root to the node are non-zeros.The tree-sparsity wildly exists in the wavelet coeﬃcients of nature signals andimages.Another popular structure exists in the multiple measurement vector (MMV)problem which is the extension of the basic sparse LIP. The hidden parameter is M = [ m , m , . . . , m L ] ∈ R N × L , and the measurements D = [ d , d , . . . , d L ] ∈ R M × L . In many MMV problems, columns of M are considered to be jointlysparse [17]. The simplest MMV structure is row-sparsity where the non-zerosof each column share the same supports (Fig. 1(b)). There are various jointlysparse structures in MMV problems, some of which are illustrated in Fig. 1[18]. More structures can be formed by combining the jointly sparse structurein the MMV and the structure in each vector, e.g., the forest sparsity [19] whichcombines the joint sparsity and the tree-sparsity.4 igure 1: Various structured sparse models [18]. (a)sparsity, (b)row-sparsity (c)row-sparsitywith embedded element-sparsity, (d)row-sparsity plus element-sparsity. Low-rank matrix recovery is another rapid-developed research topic withbroad applications, such as saliency detection [20], face recognition [21] andothers [22, 23].The low-rank matrix recovery aims to estimate a low-rank matrix M ∈ R N × N from the observed data d , which is obtained by using a linear operator A : R N × N → R M ( M < N N ). In low-rank matrix recovery problem, theprior information J ( · ) = α · rank( M ), where rank( · ) denotes the matrix rankand α denotes the regularization parameter. This optimization problem is alsoNP-hard. Alternatively, under certain conditions on the linear mapping andthe matrix rank, one can replace J ( · ) with J ( · ) = α k M k ∗ , where k · k ∗ denotesthe matrix nuclear norm that sum the singular values of the matrix. As thetightest convex relaxation of rank minimization, the nuclear norm minimizationproblem can be solved via various convex optimization algorithms [24].In real-world signals, the low-rank structure can be combined with otherstructures. In simultaneously sparse and low-rank matrix reconstruction prob-lem, which exists in sub-wavelength optical imaging, hyperspectral image un-mixing, graph denoising, the matrix M is simultaneously sparse and low-rank[25]. The corresponding regularizer J ( · ) = α k M k + β · rank( M ), where α and β are positive parameters that balance the sparsity, the matrix rank, and thedata ﬁtting term. A popular convex relaxation of this problem is to replacethe ℓ norm and rank function with the ℓ norm and the nuclear norm, re-spectively. The sparse plus low-rank matrix reconstruction aims to recover amatrix M which is the superposition of a low-rank matrix L and a sparse matrix S . This problem arises in applications such as network anomalous detection,magnetic resonance imaging (MRI) and single voice extraction. An alternative5ptimization problem with convex relaxed terms can be used to facilitate algo-rithm development, e.g., the robust principal component analysis (RPCA) withan identity matrix as the mapping A [26].The low-rank structure also exists in tensor. Tensor is a higher dimensionalgeneralization of the matrix that attracts great attention recent years. Low-rank tensor recovery aims to recover the low-rank tensor M ∈ R N × ... × N n from a limited number of observations, where A : R N × ... × N n → R M (typically M ≪ Q ni =1 N i ). The corresponding prior information J ( · ) = rank( M ), whererank( M ) denotes some form of tensor rank. One popular approach is to usetensor nuclear norm k M k ∗ , which is a convex combination of the nuclear normsof all matrices unfolded along diﬀerent modes [27]. There also exists nonconvexmethod, for example, in [28], Chen et al. propose an empirical Bayes methodthat has state-of-the-art performance in sparse and low-rank matrix recovery. A In this subsection, we introduce the LIPs with various linear operators A ,which arises in diﬀerent applications.The linear operator A is an identity matrix in denoising. Solving the inverseproblem in denoising is to remove the noise n from the observed data d . InLIPs, the observed data may contain noise that comes from the measurementprocess, the transmission process, the quantization or the compression processfor storage. Imperfect instruments and interfering natural phenomena can alsointroduce noise. There are various types of noise in diﬀerent applications. Forexample, images may be corrupted by gaussian noise, salt and pepper noise,speckle noise, Brownian noise and other [29]. Denoising is the process of re-moving the noise from the observed data, which is an essential and importantproblem that can be found in astronomy, medical imaging and many otherapplications. Existing algorithms for denoising include non-local means [30],curvelet transform [31], statistical modeling [32], and nonlocal self-similarity(NSS) models [33]. The NSS models are popular in advanced methods such asBM3D [33], NCSR [34] and WNNM [35]. For blind denoising, the techniques6ased on dictionary learning and transform learning are popular [36, 37, 38, 39].Image SR is another typical LIP where the linear operator A = DBM refersto the image acquisition process which contains a set of degradations that involvewarping, blurring, down-sampling and noise [40]. Image SR aims to reconstructa high-resolution (HR) image from a single low-resolution (LR) image or mul-tiple LR images. Since the number of known parameters in LR images exceedsthe number of unknown variables in HR images, image SR is an ill-posed LIP.Classic methods for image SR include edge-based methods [41], image statisticalmethods [42], sparse coding [43] and example-based methods [44].Compressed sensing (CS) is a LIP whose linear operator A has more columnsthan rows. CS is a sampling paradigm that breaks the Nyquist theory and canrestore the entire desired signal from fewer measured values by using sparse sig-nal characteristics. In CS, the linear operator A has fewer rows than columns,i.e., M < N , which leads to an underdetermined system. To reconstruct thesignal m from a reduced number of observations, the reconstructed signal m isrequired to be sparse, or represented as a sparse vector under certain transforma-tions, e.g., wavelet transform, Fourier transform and discrete cosine transform.Feature Selection (FS) is a LIP whose linear operator A has fewer columnsthan rows. FS is the process that ﬁnds features having the most contributionto our prediction or the output we are interested in. It is a useful tool tosimplify models for interpretation, reduce overﬁtting and avoid the curse ofdimensionality in machine learning and signal processing. FS has been applied inmany applications such as text categorization, bioinformatics and data mining.One approach to conduct FS is to formulate the problem as a LIP. For example,to classify handwritten digits, each row of A includes the feature coeﬃcientsof one data sample [45]. Since the number of data samples could be large, thelinear operator A has M > N . A key premise of FS is that the data containsredundant or irrelevant features, and thus removing those features does notresult in loss of information in the prediction [46].Dictionary learning denotes a LIP whose linear operator A and its represen-tation m are learned from the observed data d , which exists in many applications7uch as image classiﬁcation[47], outliers detection [48], and distributed CS [49].With the learned dictionary A , the high-dimensional signal performs dimen-sionality reduction to remove redundant information generated in the samplingprocess. Generally, only some of the atoms in the dictionary are used to con-struct the sparse representation of the high-dimensional signal. Compared withthe predeﬁned dictionary, e.g., wavelets, the learned one would be more appro-priate for the signals of the same ensemble and could lead to improved perfor-mance in various tasks, e.g., denoising and classiﬁcation. We refer interestedreaders to [50] for more details on various dictionary learning methods includingthe probabilistic learning methods, the learning methods based on clustering orvector quantization, and the methods for learning dictionaries with a particularconstruction. While the traditional dictionary learning relies on the one levelof the dictionary, the new deep dictionary learning (DDL), which combines theconcept of dictionary learning and deep learning (DL), uses multiple layers ofdictionaries to represent the signal [51]. The dictionary learning can also com-bine with other techniques, for example, Gong et al. propose a simultaneouslysparse and low-rank tensor representation model to enhance the capability ofdictionary learning for hyperspectral image denoising [52], and Xin et al. jointlyoptimize the sensing matrix and sparsifying dictionary for tensor CS [53].

2. DL and LIPs

In this section, we ﬁrst illustrate the motivation and advantages of using DLin solving LIPs. Then, we summarize the earlier eﬀorts of using DL in inverseproblems and clarify the novelty of this review. Then, we brieﬂy introduce thecategorization of diﬀerent methods in section 3.

As a long-standing problem, plenty of algorithms have been proposed inkinds of literature to solve LIPs, for example, in CS, under certain conditionson the sensing matrix A , e.g., the restricted isometry property (RIP) [54], the8 igure 2: The decomposition of the error in the solution of inverse problems. LIP has a unique solution and can be solved with algorithms with relativelylow computational complexity, e.g., iterative hard thresholding [55], orthogonalmatching pursuit [56], message-passing algorithms [57] and the sparse Bayesianlearning based algorithms [58]. However, in applications, these conditions areoften unattainable.In recent years, DL attracts wide attention as a promising approach to solvethe LIP. For example, by unfolding an iterative algorithm into a neural network(NN), we can learn the parameters of iterative algorithms from training data,which diﬀers from traditional algorithms that employ predetermined parame-ters.Using DL to solve inverse problems has several advantages. Firstly, in com-parison to traditional iterative algorithms, DL can signiﬁcantly increase thespeed of convergence. For example, Gregor and LeCun validate that the DLbased method is 10 times faster than the iterative coordinate descent methodwith the same approximation error [59]. In addition, DL based methods arecapable to decrease the average recovery error. As shown in Fig. 2, the recoveryerror of all algorithms results comes from several aspects. Imperfect modelingof the problem leads to the model error, the approximation (e.g., using convexrelaxation) of the original objective function leads to the structure error, andthe sub-optimal solution of algorithms leads to the convergence error. Insteadof dealing with the imperfect mathematical models and approximated optimiza-tion problems, the DL based method learns the mapping from the input to theoutput directly and has the potential to overcome or relieve challenges broughtby the model error, the structure error and the convergence error in traditionalalgorithms. The success of DL methods for inverse problems has been observedin a number of works [60, 61, 62, 59, 63, 64, 65].To unveil the advantages of the DL based method in solving LIPs, we show9 able 1: The denoising results of real-world images.Criterion Method DND PolyUPSNR BM3D [66] 34 .

51 37 . .

49 36 . .

38 35 . .

94 36 . CBDNet [70] 38.06 37.00

SSIM BM3D 0 . . . . . . . . CBDNet 0.9421 0.9457

Table 2: Test time for diﬀerent methods on a single image denoising.Methods

CBDNet

KSVD BM3D MCWNNM TWSCTime(s) . . .

21 391 . the performance of the DL model and the state-of-the-art traditional algorithmsin real-word image denoising. The results in DND [71] dataset come from thework of Guo et al. [70] and the results in PolyU [72] dataset is from our exper-iments. As shown in Table 1 and Table 2, the CBDNet outperforms most oftraditional algorithms in both PSNR/SSIM and computing time. These sim-ulations are conducted on a computer with a quad-core 4.2GHz CPU, 16 GBRAM, a GTX1080Ti GPU, and the Microsoft Windows 10 operating system. Several remarkable works have compiled comprehensive reviews on using DLin inverse problems. However, existing reviews mainly focus on the applicationof imaging [73, 74, 75, 76, 77]. In [73], McCann et al. summarize the use of theconvolutional NN (CNN) to solve imaging problems such as denoising, SR, andreconstruction. They focus on the design of the CNNs including the trainingdata, the architecture, and the problem formulation. Lucas et al. also focuson imaging problems, but they summarize a wild range of NNs, including themultilayer perceptron (MLP), CNNs, autoencoders (AEs), and generative ad-versarial networks (GANs) [74]. In the recent work [77], Ongie et al. propose ataxonomy for DL in imaging according to the forward model and the learningprocess. Other reviews include the review of using DL for MRI image recon-struction [76] and image SR [75], which are also focus on a special application10f inverse problems. A survey for data-driven methods in inverse problems isgiven in [78], which aims to promote more theoretical research.In this paper, we categorize the LIPs according to various parameterizedmodels according to diﬀerent prior information in the linear operator A andthe data d, then we focus on the innovation of constructing a speciﬁed NN forvarious parameterized models, instead of considering the NN as a black box.We aim to provide a comprehensive review of state-of-the-art DL techniques insolving LIPs, not limited to imaging problems. Our hope is that this article canprovide guidance for designing NNs for various LIPs. At last, we discuss theexisting challenges and promising directions for further research, which are notall covered in literature.In Fig. 3, we show the structure of section 3. Our taxonomy in section3 is according to the type of NNs, as the architecture of the NN is the mostpivotal element of DL and determines whether the NN can eﬀectively capturethe deep features of the training data. We summarize the use of fully connectedNNs (FNNs), CNNs, recurrent NNs (RNNs), AEs, and GANs in dealing withvarious LIPs, including CS, denoising, image SR, and others. In addition to thegeneric NN, we summarize various structured NN, which deﬁnes the NN thatcombines the prior information in various forms. Among the structured NN, themost famous one is the deep unfolding methods which unfold the iterations ofan iterative inference method into layer-wise structure analogous to a NN [79].In addition to the deep unfolding networks, we also consider the structurednetworks that get inspiration from the traditional analyzed-based methods. Forexample, the DDL combines the concept of DL and dictionary learning.

3. DL in Solving LIPs

In this section, we introduce how DL is exploited to handle LIPs in diﬀerentapplications and provide detailed instructions on the construction of the NN andthe training process. Diﬀerent settings in DL based methods are summarizedin Table 3-Table 9, which include the input/output, loss function, learning rate,11 (cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:17)(cid:3)(cid:22) (cid:39)(cid:47)(cid:3)(cid:76)(cid:81) (cid:54)(cid:82)(cid:79)(cid:89)(cid:76)(cid:81)(cid:74)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:47)(cid:44)(cid:51)(cid:86) (cid:22)(cid:17)(cid:20)(cid:3)(cid:41)(cid:49)(cid:49) (cid:3)(cid:22)(cid:17)(cid:21)(cid:3)(cid:38)(cid:49)(cid:49) (cid:3)(cid:22)(cid:17)(cid:22)(cid:3)(cid:53)(cid:49)(cid:49) (cid:22)(cid:17)(cid:23)(cid:3)(cid:36)(cid:40)(cid:22)(cid:17)(cid:24)(cid:3)(cid:42)(cid:36)(cid:49)(cid:42)(cid:36)(cid:49) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:39)(cid:72)(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:81)(cid:74) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53)(cid:39)(cid:36)(cid:40) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:39)(cid:72)(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:81)(cid:74)(cid:36)(cid:40)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53)(cid:54)(cid:83)(cid:68)(cid:85)(cid:86)(cid:72)(cid:3)(cid:38)(cid:82)(cid:71)(cid:76)(cid:81)(cid:74) (cid:3)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:53)(cid:72)(cid:70)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:48)(cid:47)(cid:51)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:39)(cid:72)(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:81)(cid:74)(cid:54)(cid:83)(cid:68)(cid:85)(cid:86)(cid:72)(cid:3)(cid:38)(cid:82)(cid:71)(cid:76)(cid:81)(cid:74)(cid:39)(cid:48)(cid:41)(cid:47)(cid:82)(cid:90)(cid:16)(cid:85)(cid:68)(cid:81)(cid:78)(cid:3)(cid:48)(cid:68)(cid:87)(cid:85)(cid:76)(cid:91)(cid:3) (cid:53)(cid:72)(cid:70)(cid:82)(cid:89)(cid:72)(cid:85)(cid:92) (cid:39)(cid:72)(cid:72)(cid:83)(cid:3)(cid:88)(cid:81)(cid:73)(cid:82)(cid:79)(cid:71)(cid:76)(cid:81)(cid:74)(cid:3)(cid:41)(cid:49)(cid:49)(cid:54)(cid:83)(cid:68)(cid:85)(cid:86)(cid:72)(cid:3)(cid:38)(cid:82)(cid:71)(cid:76)(cid:81)(cid:74)(cid:39)(cid:39)(cid:47) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53) (cid:3)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:53)(cid:72)(cid:70)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:38)(cid:49)(cid:49) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:39)(cid:72)(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:81)(cid:74)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53)(cid:39)(cid:44)(cid:51) (cid:47)(cid:54)(cid:55)(cid:48)(cid:54)(cid:83)(cid:68)(cid:85)(cid:86)(cid:72)(cid:3)(cid:47)(cid:44)(cid:51)(cid:53)(cid:49)(cid:49)(cid:57)(cid:76)(cid:71)(cid:72)(cid:82)(cid:3)(cid:54)(cid:53) (cid:53)(cid:49)(cid:49)(cid:14)(cid:38)(cid:49)(cid:49) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:39)(cid:72)(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:81)(cid:74)(cid:3)(cid:3)(cid:48)(cid:53)(cid:44)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:53)(cid:72)(cid:70)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:38)(cid:82)(cid:80)(cid:80)(cid:82)(cid:81)(cid:3)(cid:38)(cid:49)(cid:49) (cid:54)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:88)(cid:85)(cid:72)(cid:71)(cid:3)(cid:38)(cid:49)(cid:49)(cid:38)(cid:54)(cid:3)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3) (cid:53)(cid:72)(cid:70)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53) (cid:39)(cid:72)(cid:72)(cid:83)(cid:3)(cid:88)(cid:81)(cid:73)(cid:82)(cid:79)(cid:71)(cid:76)(cid:81)(cid:74)(cid:3)(cid:38)(cid:49)(cid:49)(cid:38)(cid:54)(cid:16)(cid:48)(cid:53)(cid:44)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:39)(cid:72)(cid:81)(cid:82)(cid:76)(cid:86)(cid:76)(cid:81)(cid:74) (cid:3)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3) (cid:53)(cid:72)(cid:70)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:48)(cid:88)(cid:79)(cid:87)(cid:76)(cid:80)(cid:82)(cid:71)(cid:68)(cid:79)(cid:3)(cid:39)(cid:47)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:54)(cid:53)(cid:3)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72)(cid:3)(cid:53)(cid:72)(cid:70)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:38)(cid:82)(cid:80)(cid:80)(cid:82)(cid:81)(cid:3)(cid:53)(cid:49)(cid:49) (cid:54)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:88)(cid:85)(cid:72)(cid:71)(cid:3)(cid:53)(cid:49)(cid:49)(cid:48)(cid:48)(cid:57) (cid:54)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:88)(cid:85)(cid:72)(cid:71)(cid:3)(cid:53)(cid:49)(cid:49)(cid:3)(cid:54)(cid:72)(cid:84)(cid:88)(cid:72)(cid:81)(cid:87)(cid:76)(cid:68)(cid:79)(cid:3)(cid:54)(cid:83)(cid:68)(cid:85)(cid:86)(cid:72)(cid:3)(cid:47)(cid:44)(cid:51) (cid:54)(cid:83)(cid:68)(cid:85)(cid:86)(cid:72)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3) (cid:47)(cid:82)(cid:90)(cid:16)(cid:85)(cid:68)(cid:81)(cid:78)(cid:3)(cid:48)(cid:82)(cid:71)(cid:72)(cid:79)(cid:86) (cid:54)(cid:83)(cid:68)(cid:85)(cid:86)(cid:72)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3) (cid:47)(cid:82)(cid:90)(cid:16)(cid:85)(cid:68)(cid:81)(cid:78)(cid:3)(cid:48)(cid:82)(cid:71)(cid:72)(cid:79)(cid:86) (cid:53)(cid:51)(cid:38)(cid:36)

Figure 3: Schematic diagram of the structure of section 3. initialization and training algorithms. With these settings, we can easily trainNNs using popular DL platforms such as Tensorﬂow [80] and PyTorch [81].

The FNN, also known as MLP is one of the most basic structures in DL and apowerful tool in solving LIPs. In addition to the basic FNN, some modiﬁcationscan be employed to enhance the performance, such as skip connections betweenlayers [59], well-designed activation functions [63] and weight constrains [82].Here we introduce common FNNs and structured FNNs related to LIPs. VariousFNNs for LIPs are summarized in Table 3.Perhaps the most straightforward DL based method for LIPs is the use ofcommon FNNs. Especially for image denoising [60, 61] and sparse LIPs [62].Considering that the ordinary MLP can approximate more functions than theCNNs, Burger et al. ﬁrstly apply an ordinary MLP for image denoising andobtained competitive results compared to the classical BM3D [60]. To achievethe start-of-the-art performance, they adopt a large network that consists ofsuﬃcient parameters, a large patch size and large training set. The networkis eﬀective in noisy images which contain the additive white Gaussian noise.However, the accuracy of this method is sensitive to the mismatch of the noise12 able 3: FNNs for LIPs.

Ref. Application Input Output Loss Function Initialization Learning Rate Optimizer[60] Image De-noising Normalizedoverlappingpatches Clean patches The quadraticerror Normal distri-bution 0 . /N ( N is thenumber of layerunits) SGD[61] Image De-noising Pre-processedgrey imageand depthimage Denoised im-age Proposededge basedweighted lossfunction Not given Not given Not given[62] SparseCoding The observedsignal Non-zeroprobability Softmax lossfunction Follows [83] 0 . − − − − − × − × − distributions in the training data set and the testing data set. To against varyingnoise levels, Wang and Morel employ a linear mean shift before the denoisingnetwork to improves the robustness of the network [61]. To solve the sparse LIP,Xin et al. incorporate some powerful techniques such as batch normalizationand residual connection into the FNNs, and uses the support of the vector aslabels to train the network, which reduces the burden of the NN in solving thesparse inverse problem [62].In addition to the image denoising and sparse LIPs, the FNN is also usedin low-rank matrix recovery. One of the typical low-rank matrix recovery prob-lems is the matrix completion problem where the matrix to be completed isassumed to be low-rank. In [92], Fan and Cheng propose a deep-structure NNnamed deep matrix factorization (DMF) for matrix completion, which is morecomputationally eﬃcient than the nuclear norm and truncated nuclear normrelated methods. In DMF, the input is low-dimensional unknown latent vari-ables and is jointly optimized with the parameters. The output of the network13 a) Left: Block diagram of the ISTA. Right: The structure of the LISTA,uses a time-unfolded version of the ISTA block diagram of three iterations(The network can have arbitrary layers).(b) Left: Block diagram of the IHT algorithm. Right: The time-unfoldedversion of the IHT algorithm. In the deep ℓ encoder, the hard threshold-ing function is decomposed into two linear scaling operators plus a HELU.(c) A diﬀerent form of LISTA, with learnable parameters A t , B t and θ t .(d) The structure of LAMP, with learnable parameters A t , B t and θ t .Figure 4: Various deep unfolding FNNs for sparse LIPs. is the incomplete low-rank matrix. The DMF aims to recover an incompletelow-rank matrix by learning a nonlinear latent variable model. Exploring theimplicit regularization, Arora et al. prove that the deeper DMF can lead tomore accurate low-rank solutions [93].FNNs can also beneﬁt from the unfolding of traditional iterative algorithms,which leads to deep unfolding FNNs [79]. Generally, the t -th iteration of aniterative algorithm can be written as ˆm t +1 = g ( Wd + S ˆm t ) , (3)where W and S are algorithm-dependent parameters, and g is a nonlinear func-tion. In view of the fact that the update rule in (3) shares great similarities toone layer of a FNN, various iterative algorithms are unfolded and transformedinto diﬀerent deep unfolding FNNs for solving LIPs.As the high computation time of traditional sparse coding methods fail tomeet the requirement of real-time applications, Gregor and LeCun unfold the14terative shrinkage and thresholding algorithm (ISTA) [82], and propose a newnetwork for fast sparse coding, namely learned ISTA (LISTA), which is shownin Fig. 4(a) [59]. The iterative steps of the ISTA is given by v t = d − A ˆm t , ˆm t +1 = g θ ( ˆm t + A T v t ) , (4)where v t is the residual error, g θ ( x ) = sign( x ) max {| x |− θ, } is the element-wisesoft-thresholding function and θ is the shrinkage parameter. Equation (4) canbe rewritten as (3) with the W and S given by W = A T , S = I − A T A . (5)The LISTA adopts the element-wise soft-thresholding function in the ISTAas the activation function and limits the parameters of all layers to share thesame weight as the unfolded ISTA. Diﬀerent from the hand-designed parametersin the ISTA, the parameters W , S , and θ in the LISTA are learned from thetraining data. The parameters in the ISTA (5) can be used as a good initial-ization for training the LISTA. To generate the label ˜m i in the training data,Gregor and LeCun use the Coordinate Descent (CoD) algorithm to solve the ℓ norm minimization problem for each d i , which may not be the most sparsesolution owing to the structure error as illustrated in Fig.2.To improve the performance of LISTA, various variants of LISTA are pro-posed. In [84], Zhang et al. propose cascade LISTA and cascade learned CoD(LCoD), which are used to reconstruct the sparse signal and predict imagesparse code. In cascade LISTA and cascade LCoD, several individual LISTAand LCoD are trained in parallel to decrease the accumulated error, and whentest, those networks are in series. To obtain a linear convergence, Chen et al.introduce a partial weight coupling structure into the LISTA [85]. While LISTAis trained for a certain A , it lacks scalability for various models. Even a smalldeviation in A can deteriorate its performance. To this end, Aberdam et al.propose Ada-LISTA, which uses both signals and their dictionaries as inputs[86]. In Ada-LISTA, the input dictionaries are embedded into the network, andtwo auxiliary learned matrices are used to wrap the dictionary. In addition tothe learned weight matrix, the deep unfolding FNNs can also be designed to15nly learn the step-size and threshold parameters, for example, the AnalyticLISTA (ALISTA) in [87], where the weight matrix is obtained from the analy-sis of corresponding optimization problem. In [88], Ablin et al. choose to onlylearn the step-size of LISTA, which is conﬁrmed to be competitive in suﬃcientlysparse cases.To avoid the structure error produced in generating the training data, Wanget al. propose the deep ℓ encoder to solve the ℓ norm minimization problemdirectly, where the label ˜m i is the original sparse signal [63]. The deep ℓ encoderis obtained based on the unfolding of the iterative hard thresholding (IHT) [55]algorithm (Fig. 4(b)), which is similar to the ISTA except the nonlinear function.The nonlinear function in the IHT algorithm is the hard thresholding function g θ ( x ) = x · sign(max {| x | − θ, } ) and θ is the activation threshold. To update θ ,the authors decompose the original hard thresholding function g θ ( x ) into twolinear scaling operators plus a hard thresholding linear unit (HELU)HELU θ ( x ) =  | x/θ | < x | x/θ | ≥ . (6)However, the HELU is a discontinuous function that destroys the universalapproximation capability of the network and is hard to train. To this end, anovel continuous function HELU σ is proposed, which is given inHELU σ ( x ) =  | x | ≤ − σ x − σσ − σ < x < x +1 − σσ − < x < σ − x | x | ≥ . (7)Obviously, HELU σ is equivalent to the HELU in (6) when σ →

0. At thebeginning of the training, σ can be set as a small constant and then graduallydecreased during the training phase. Besides, for the case with a known sparselevel k , the HELU layer can be replaced by a max- k pooling layer and a max- k unpooling layer. Similar to the LISTA, the weights of the deep ℓ encoder arelearned and shared among layers.Based on the approximate message passing (AMP) algorithm [57], a networkthat adopts the independent weights among layers is proposed by Borgerding16nd Schniter [64]. Compared with the ISTA (4), the residual error of the AMPalgorithm depends on the t -th iterative and the ( t − t -thiterative of the AMP algorithm is given by v t = d − A ˆm t + b t v t − , ˆm t +1 = g θ ( ˆm t + A T v t ) , (8)where b t = M k ˆm t k , θ t = αM k ˆv t k and α is a tuning parameter. The diﬀerenceof the LISTA and the learned AMP (LAMP) can be found in Fig. 4(c) andFig. 4(d). In [65], Borgerding et al. further extend the vector AMP (VAMP)algorithm [94] into the learned VAMP (LVAMP) network. Compared with theLAMP network, the LVAMP network oﬀers increased robustness to deviationsof the matrix A from i.i.d. Gaussian.The deep unfolding method can also be used in low-rank models. In [95],Pu et al. design a speciﬁc deep unfolding network based on the alternatingdirection method of multipliers (ADMM) for sparse and low-rank matrices. Inparticular, to make the network diﬀerentiable and learnable, they use a specialnon-linear activation function f ( x ) = ReLU( x − θ ) − ReLU( − x − θ ) to replacethe shrinkage operator in ADMM, and use the online RPCA for the low-rankterm.In addition to get inspiration from the unfolding the iterative algorithmwhich follows (3), the NN can be combined with traditional algorithms in otherforms. By using a NN to perform each step of the traditional K-SVD algorithm,Scetbon et al. unfold the K-SVD into an end-to-end deep architecture and trainit in a supervised manner [89]. The proposed scheme boosts the performanceof the famous K-SVD denoising algorithm. By embedding the minimum meansquared error (MMSE) estimator into the NN, Ito et al. propose the trainableiterative soft thresholding algorithm (TISTA) [90], where the MMSE estimatoris used as a shrinkage function to improves the speed of convergence. Similarto TISTA, Yao et al. combine the Steins unbiased risk estimate into the ISTA(SURE-TISTA) based network [91]. Both TISTA and SURE-TISTA use fewerlearnable variables while achieving a performance close to LAMP.DDL is another type of structured FNN that combines the knowledge of17raditional algorithms. It can be used in inverse problems such as image SRand image reconstruction [96, 97, 98, 99] and image SR [99]. While solvingthe inverse problems in imaging with DDL, Lewis D. et al. reform the entireinversion process with the variable splitting augmented Lagrangian approach,then segregate it into several subproblems, and solve all the variables jointly [96].To reconstruct the multi-echo MRI with DDL, Singhal and Majumdar proposetwo variants of DDL, including the joint-sparse dictionary learning based DDLand low-rank based DDL [97]. In [98], they introduce the coupled dictionarylearning technique into DDL, and propose a domain adaptation approach fordiﬀerent imaging tasks. For image SR, Huang and Dragotti design an L -layerFNN which includes L − The CNN has eﬀectively reduced the number of parameters by replacingthe fully connected layers with the convolutional layers. CNN inspired by thebiological visual cortex can capture the local similarity of images and thus isemployed as a key technique in most image-related applications. Various CNNsfor LIPs are summarized in Table 4.

For image denoising, FNNs introduced in the previous subsection require apredetermined input image size, while CNNs are more ﬂexible for dealing withimages with arbitrary sizes. In [100], Wang et al. propose a two-layer CNN,where they use the Relu activation function for the ﬁrst layer and the sigmoidactivation function for the second layer. Besides, under the inspiration of lateralinhibition in real neurons and computational neuroscience models, a novel localresponse normalization is employed after the output of ReLU, which leads to18 able 4: CNNs for LIPs.

Ref. Application Input Output Loss Function Initialization Learning Rate Optimizer[100] ImageDenois-ing Cropped noisyimage Denoised im-age Mean squared er-ror Gaussian distribu-tion 10 − − − − − − − − − − − − − − − × − × − − − − − − − − − − − − × − − − × − − − − − − − − − − σ is ﬁrst extended to a noiselevel map. The noise level map is then concatenated with the down sampledsub-images to form a tensor that is used as the input of the network (Fig. 5(b)).Various CNNs are designed for diﬀerent denoising applications. Zhang etal. extend the CNN to depth image denoising and propose a denoising andenhancement CNN (DE-CNN) [102]. In the DE-CNN, the input of the net-work contains both the depth image and pre-processed gray image, as shown inFig. 5(a). The authors also propose a novel edge based weighted loss functionand a data augmentation strategy that expands useful depth images. For hyper-spectral image denoising, Chang et al. use the CNN to extract the spectral andthe spatial information, where spectral correlation is depicted by the multiplechannels [103]. In [105], Yuan et al. use the spatial and spectral informationas input. They capture and fuse multiscale spatial-spectral feature for the ﬁnalrestoration. For medical image denoising, Panda et al. propose a wide residualCNNs for medical image denoising [106]. In order to solve the problem that theuse of squared Euclidean distance will lead to over-smoothed image, they com-bine the perceptual loss and squared Euclidean distance for training, which isconﬁrmed to be helpful in keeping structural or anatomical details. Wang et al.design a local receptive ﬁeld smoothing network which remains the smoothingproperties of the receptive ﬁeld by weighting their local neighborhoods [107].Instead of expecting the clean image as network output, Zhang et al. pro-pose a denoising CNN (DnCNN) that outputs the residual between a cleanimage and a noisy image [108]. By using residual learning, the network is ableto handle unknown noise levels and can be also transferred to other tasks suchas single image SR and image deblocking. Wang et al. further combine thedilated convolution [127] with residual learning to improve computational eﬃ-ciency and enlarge the receptive ﬁeld [109]. In [110], Su et al. propose a deepmulti-scale cross-path concatenation residual network (MC RNet) for Poisson20 (cid:72) (cid:79) (cid:88) (cid:39) (cid:72) (cid:83) (cid:87) (cid:75) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:48) (cid:68) (cid:91) (cid:3) (cid:83)(cid:82)(cid:82) (cid:79)(cid:76) (cid:81)(cid:74) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:39) (cid:72) (cid:81)(cid:82) (cid:76) (cid:86) (cid:72) (cid:71) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:42) (cid:85) (cid:68) (cid:92) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:51) (cid:85) (cid:72) (cid:16) (cid:83) (cid:85) (cid:82) (cid:70)(cid:72) (cid:86)(cid:86) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:48) (cid:68) (cid:91) (cid:3) (cid:83)(cid:82)(cid:82) (cid:79)(cid:76) (cid:81)(cid:74) (cid:53) (cid:72) (cid:79) (cid:88) (a) The network structure of the DE-CNN. (cid:53) (cid:72) (cid:79) (cid:88) (cid:39) (cid:82) (cid:90) (cid:81) (cid:86) (cid:68) (cid:80) (cid:83) (cid:79) (cid:72) (cid:71) (cid:86) (cid:88)(cid:69) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:86) (cid:3) (cid:9) (cid:81)(cid:82) (cid:76) (cid:86) (cid:72) (cid:3) (cid:79) (cid:72) (cid:89) (cid:72) (cid:79) (cid:3) (cid:80) (cid:68) (cid:83) (cid:37) (cid:68) (cid:87) (cid:70) (cid:75) (cid:49) (cid:82) (cid:85) (cid:80) (cid:68) (cid:79)(cid:76) (cid:93)(cid:68) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:39) (cid:72) (cid:81)(cid:82) (cid:76) (cid:86) (cid:72) (cid:71) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:37) (cid:68) (cid:87) (cid:70) (cid:75) (cid:49) (cid:82) (cid:85) (cid:80) (cid:68) (cid:79)(cid:76) (cid:93)(cid:68) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:856)(cid:856)(cid:856) (b) The network structure of the FFDNet. (cid:53) (cid:72) (cid:79) (cid:88) (cid:49) (cid:82) (cid:76) (cid:86) (cid:92) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:37) (cid:68) (cid:87) (cid:70) (cid:75) (cid:49) (cid:82) (cid:85) (cid:80) (cid:68) (cid:79)(cid:76) (cid:93)(cid:68) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:86) (cid:76) (cid:71)(cid:88) (cid:68) (cid:79) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:37) (cid:68) (cid:87) (cid:70) (cid:75) (cid:49) (cid:82) (cid:85) (cid:80) (cid:68) (cid:79)(cid:76) (cid:93)(cid:68) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:856)(cid:856)(cid:856) (cid:37) (cid:68) (cid:87) (cid:70) (cid:75) (cid:49) (cid:82) (cid:85) (cid:80) (cid:68) (cid:79)(cid:76) (cid:93)(cid:68) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:856)(cid:856)(cid:856) (c) The network structure of the DnCNN.Figure 5: Common CNNs for image denoising.Table 5: The SR results: average PSNR/SSIM for scale factors 2 and 4.Algorithm Scale PSNR(Set5) SSIM(Set5) PSNR(Set14) SSIM(Set14)Bicubic 2 33 .

66 0 .

93 30 .

32 0 . .

65 0 .

95 32 .

42 0 . .

00 0 .

96 32 .

65 0 . .

26 0 .

95 32 .

88 0 . .

52 0 .

96 33 .

08 0 . Bicubic 4 28 .

42 0 .

81 26 .

00 0 . .

49 0 .

86 27 .

50 0 . .

72 0 .

86 27 .

62 0 . .

90 0 .

86 27 .

73 0 . .

54 0 .

88 28 .

19 0 . denoising, where they use cross-path concatenation and the skip connection toobtain multi-scale context representations of images.Diﬀerent with image denoising, in image SR the dimension of the output ishigher than the input. To explore the information in diﬀerent dimension space,various network architectures are designed. In super-resolution convolutionalneural network (SRCNN) [112], the input of the network is an interpolated LRimage (Fig. 6(a)). The SRCNN uses a relatively large ﬁlter size to utilize theinformation from more pixels and simultaneously processes multiple channels,which leads to superior performance in comparison to traditional example-basedapproaches. Considering that the SRCNN is sub-optimal and computationally21 a ) T h e s t r u c t u r e o f t h e S R C NN . ( b ) T h e s t r u c t u r e o f t h e E S P C N . ( c ) T h e s t r u c t u r e o f t h e F S R C NN . (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:47)(cid:82)(cid:90)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72) (cid:43)(cid:76)(cid:74)(cid:75)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:76)(cid:80)(cid:68)(cid:74)(cid:72) (cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74) (cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74) (cid:856)(cid:856)(cid:856) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) ( d ) T h e s t r u c t u r e o f t h e L a pS R N . (cid:3)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:47)(cid:82)(cid:90)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:43)(cid:76)(cid:74)(cid:75)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:39)(cid:82)(cid:90)(cid:81)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72)(cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74) (cid:856)(cid:856)(cid:856) (cid:39)(cid:82)(cid:90)(cid:81)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72)(cid:3)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:38)(cid:82)(cid:80)(cid:70)(cid:68)(cid:87) ( e ) T h e s t r u c t u r e o f t h e D B P N . F i g u r e : C o mm o n C NN s f o r i m ag e S R . neﬃcient owing to the use of the interpolated image as input, more eﬃcientnetworks such as eﬃcient sub-pixel CNN (ESPCN) [113] and fast SRCNN (FS-RCNN) are proposed [114]. Both ESPCN and FSRCNN use LR image as inputand perform the upsampling in the last layer. The last layer of ESPCN is a sub-pixel convolution layer, which ﬁrstly generates multiple feature maps and thenconducts a periodic shuﬄing to the pixels to produce the ﬁnal HR image. Thelast layer of FSRCNN is a deconvolution layer, and FSRCNN uses smaller ﬁltersizes and specially designed shrinking layer to accelerate the network. Whilethe ESPCN and FSRCNN get the HR image in the last layer, Lai et al. pro-pose Laplacian pyramid SR network (LapSRN) which progressively increasesthe dimension of the output of each layer (Fig. 6(d)) [115]. The deep back-projection network (DBPN) which uses the iterative up- and down-samplinglayers to explore the mutual dependencies of LR images and HR images, asshown in Fig. 6(e) [116]. Each pair of sampling layers represents a type of thedegradation and corresponding components. Furthermore, Haris et al. proposethe dense DBPN (D-DBPN), which adds skip connections to allow the concate-nation of features between layers. It is observed the dense DBPN can furtherimprove the performance of the SR, especially in large scaling factors. In Ta-ble 5 and Table 6, we compare the performance of diﬀerent CNNs for imageSR in datasets Set5 [128] and Set14 [129], and compare the diﬀerent CNNs forimage SR. Compared with SRCNN, FSRCNN is deeper, but uses less ﬁltersand smaller ﬁlter sizes. Thus, the FSRCNN has fewer parameters and is faster(41 . × ) without performance degradation. ESPCN uses the same ﬁlter sizes asSRCNN, but decreases the number of ﬁlters and extracts the features in the LRspace to reduces the computational complexity and obtain the real-time speed.Compared with the previous networks, LapSRN is much deeper (27 layers) anduses the residual learning to assist the training. Charbonnier loss function usedin LapSRN has a higher gradient magnitude than the ℓ loss and decreases theringing artifacts. For D-DBPN, the network has a depth up to 48 layers anduses smaller ﬁlter sizes than the SRCNN, FSRCNN and LapSRN. Even with ashallow depth (18 layers), the DBPN outperforms the LapSRN (31.54 dB) with23 able 6: Comparisons among various CNNs for SR.network Parameters Training data Loss function NetworkSRCNN 57 k ImageNet subset (over 5million sub-images) Mean squarederror Conv(9,64,1)-Conv(5,32,64)-Conv(5,1,32)FSRCNN 12 k General-100 dataset and91-image dataset (19 timesmore images after dataaugmentation) Mean squarederror Conv(5,56,1)-Conv(1,12,56)-4Conv(3,12,12)-Conv(1,56,12)-Conv(9,1,56)ESPCN 20 k k Berkeley segmentationdataset and 91-imagedataset Charbonnierpenalty func-tion Conv(3,64,3)-2(10Conv(3,64,64)-Conv(3,256,64)-Conv(3,3,64)-Conv(3,12,3))DBPN 10 M DIV2K and Flickr and Im-ageNet subset Mean squarederror Conv(256,3,3)-Conv(32,1,1)-7(Conv(32,2,2)-Conv(32,6,6)-Conv(32,2,2)-Conv(32,6,6)-Conv(32,2,2)-Conv(32,6,6))-Conv(32,2,2)-Conv(32,6,6)-Conv(32,2,2)-Conv(3,3,3) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87)(cid:76) (cid:82)(cid:81) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:49) (cid:82)(cid:81) (cid:16) (cid:79)(cid:76) (cid:81) (cid:72)(cid:68) (cid:85) (cid:87) (cid:85) (cid:68) (cid:81) (cid:86) (cid:73) (cid:82) (cid:85) (cid:80) (cid:3) (cid:54) (cid:68) (cid:80) (cid:83) (cid:79)(cid:76) (cid:81)(cid:74) (cid:3) (cid:71) (cid:68) (cid:87) (cid:68) (cid:48)(cid:88)(cid:79)(cid:87)(cid:76)(cid:83)(cid:79)(cid:76)(cid:72)(cid:85)(cid:88)(cid:83)(cid:71)(cid:68)(cid:87)(cid:72) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87)(cid:76) (cid:82)(cid:81) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:49) (cid:82)(cid:81) (cid:16) (cid:79)(cid:76) (cid:81) (cid:72)(cid:68) (cid:85) (cid:87) (cid:85) (cid:68) (cid:81) (cid:86) (cid:73) (cid:82) (cid:85) (cid:80) (cid:3) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87)(cid:76) (cid:82)(cid:81) (cid:48)(cid:88)(cid:79)(cid:87)(cid:76)(cid:83)(cid:79)(cid:76)(cid:72)(cid:85)(cid:88)(cid:83)(cid:71)(cid:68)(cid:87)(cid:72) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:49) (cid:82)(cid:81) (cid:16) (cid:79)(cid:76) (cid:81) (cid:72)(cid:68) (cid:85) (cid:87) (cid:85) (cid:68) (cid:81) (cid:86) (cid:73) (cid:82) (cid:85) (cid:80) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87)(cid:76) (cid:82)(cid:81) (cid:48)(cid:88)(cid:79)(cid:87)(cid:76)(cid:83)(cid:79)(cid:76)(cid:72)(cid:85)(cid:88)(cid:83)(cid:71)(cid:68)(cid:87)(cid:72) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87) (cid:72) (cid:71) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:82)(cid:81)(cid:72)(cid:3)(cid:86)(cid:87)(cid:68)(cid:74)(cid:72) (a) The network structure of the ADMM-Net. (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:54) (cid:68) (cid:80) (cid:83) (cid:79)(cid:76) (cid:81)(cid:74) (cid:3) (cid:71) (cid:68) (cid:87) (cid:68) (cid:51) (cid:85) (cid:82)(cid:91) (cid:76) (cid:80) (cid:68) (cid:79) (cid:3) (cid:80) (cid:68) (cid:83)(cid:83) (cid:76) (cid:81)(cid:74) (cid:3) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:54) (cid:82) (cid:73) (cid:87) (cid:3) (cid:87) (cid:75) (cid:85) (cid:72) (cid:86) (cid:75)(cid:82) (cid:79) (cid:71) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:86) (cid:87) (cid:85) (cid:88) (cid:70) (cid:87) (cid:72) (cid:71) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:17)(cid:17)(cid:17) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:51) (cid:85) (cid:82)(cid:91) (cid:76) (cid:80) (cid:68) (cid:79) (cid:3) (cid:80) (cid:68) (cid:83)(cid:83) (cid:76) (cid:81)(cid:74) (cid:51) (cid:85) (cid:82)(cid:91) (cid:76) (cid:80) (cid:68) (cid:79) (cid:3) (cid:80) (cid:68) (cid:83)(cid:83) (cid:76) (cid:81)(cid:74) (cid:50)(cid:81)(cid:72)(cid:3)(cid:83)(cid:75)(cid:68)(cid:86)(cid:72)(cid:55)(cid:75)(cid:72)(cid:3)(cid:73)(cid:82)(cid:85)(cid:90)(cid:68)(cid:85)(cid:71)(cid:87)(cid:85)(cid:68)(cid:81)(cid:86)(cid:73)(cid:82)(cid:85)(cid:80) (cid:55)(cid:75)(cid:72)(cid:3)(cid:69)(cid:68)(cid:70)(cid:78)(cid:90)(cid:68)(cid:85)(cid:71)(cid:87)(cid:85)(cid:68)(cid:81)(cid:86)(cid:73)(cid:82)(cid:85)(cid:80) (b) The network structure of the ISTA-Net.Figure 7: Deep unfolding CNNs for image CS.

Akin to the structured FNNs, structures in traditional algorithms can alsobe employed in the design of structured CNNs.One example of the deep unfolding networks is the application of imagereconstruction. Following in the iterative procedure of the ADMM algorithm,Yang et al. construct a CNN based ADMM-Net for CS-MRI, where each layerrepresents a subproblem in the ADMM optimization problem (Fig. 7(a)) [122].Especially, in the ADMM-Net, all the parameters are learned, including thetransforms, penalty parameters and shrinkage functions. Furthermore, in [142],they redesign the ADMM algorithm and unfold it to the more powerful ADMM-CSNet. Another deep unfolding CNN is the ISTA-Net, which is also designed forCS imaging. Similar to the ADMM-Net, the parameters in the ISTA-Net are alllearned [123]. The ISTA-Net contains several phases, each of which representsan iteration of the ISTA (Fig. 7(b)). Each phase of the ISTA-Net includes aforward transform and a symmetric backward transform, where the forwardtransform is used to replace the hand-crafted sparse transform of the originalimage in the ISTA, and the backward transform is designed to exhibit a structuresymmetric to that of the forward transform. The AMP algorithm can also be26sed for image denoising, which leads to the denoising AMP (D-AMP) algorithm[143]. By unfolding the D-AMP algorithms, Metzler et al. design their learnedD-AMP (LDAP) [144], which can be used to recovery image from diﬀerentmeasurement matrices. In LDAP, DnCNN is embedded into the network asa denoiser. Following the deep unfolding principle, Solomon et al. unfold thelow-rank plus sparse ISTA to solve the RPCA problem [145] more eﬃciently.Instead of using a fully-connected layer for matrix multiplications, they usethe convolutional layers to reduce the number of parameters. The proposedconvolutional robust principal component analysis (CORONA) is further usedin SR ultrasound to remove the clutter signal.Another example is the application of image SR. Most related work derivesthe network with the consideration of sparse coding methods [146, 147]. Donget al. use linear transforms to project image patches onto a dictionary andreplace the sparse coding solver with a nonlinear transform (Fig. 8(a)) [124].Liu et al. propose the sparse coding based network (SCN) (Fig. 8(b)), whichconsists of a patch extraction layer, a LISTA sub-network for sparse coding, anHR patch recovery layer, and a patch combination layer [148]. In the SCN, theLISTA sub-network is employed to enforce the sparsity of the representation. Inaddition, the authors propose a cascade of SCNs (CSCNs) (Fig. 8(c)) so that thenetwork can be extended to deal with diﬀerent scaling factors. In the practicalscene where the LR images suﬀer from various types of corruption, Liu et al.ﬁne-tune the learned SCN with a small amount of training data to adapt themodel to the new scenario [148].Structured CNNs are also proposed for image denoising [149] and imagerestoration [150]. For example, to exploit the native non-local self-similarityproperty of natural images, Lefkimmiatis proposes a CNN based network thatuses an extra regularization term in the loss function [149]. The key idea isunfolding the proximal gradient method to construct a network graph, whereeach layer represents one proximal gradient iteration. In [150], Chen and Pockconstruct the trainable nonlinear reaction diﬀusion (TNRD) network based onthe nonlinear reaction diﬀusion models for image restoration, which can be27hought as a forward convolutional network. Besides, they add a reaction termto adapt to various image processing problems.Multimodal DL [151] is another promising technique in solving image SRproblems and drives plenty of structured CNNs. In multimodal DL for im-age SR, the input of the network is generally including a LR image and a HRimage in a diﬀerent modality. For example, Marivani et al. use LR near-infrared images and HR RGB images to super-resolve the HR near-infraredimages [152, 153]. In [152], they design their learned multimodal convolutionalsparse coding (LMCSC) model by unfolding the proximal method that usedfor solving the convolutional sparse coding with side information. In [153],they turn to solve the appropriate ℓ − ℓ minimization problem for multimodalimage SR and design their deep multimodal sparse coding network (DMSC)based on a deep unfolding FNN named learned side-information-driven itera-tive soft thresholding algorithm (LeSITA). To capture the cross-modality depen-dency, Deng and Dragotti design a special joint multi-modal dictionary learning(JMDL) algorithm, and unfolding it into a deep coupled ISTA network [154].In particular, they use a layer-wise optimization algorithm (LOA) to solve themulti-layer dictionary learning problem for initialization. In addition to imageSR, multimodal DL can also be used in image reconstruction [155, 156]. Compared with FNNs and CNNs, RNNs are more appropriate in dealingwith sequential inputs, such as the time-varying signal [157]. Thus, the RNNcan be used to solve a sequence of correlated LIPs. Various RNNs for LIPs aresummarized in Table 7.One of the examples is the sparse LIP, especially the structured sparse LIP.In [62], Xin et al. use an long short-term memory (LSTM) network as an adap-tive variant of IHT to allow a longer ﬂow of information to explore the structureof A in a general sparse LIP. In MMV where the supports of each column arenot totally consistent due to the noise or partly innovative sparse pattern in thesource, Palangi et al. design an LSTM to capture the unknown dependency be-28 a) A sparse coding based CNN.(b) The structure of the SCN. (cid:79)(cid:68)(cid:69)(cid:72)(cid:79)(cid:79)(cid:68)(cid:69)(cid:72)(cid:79) (cid:54) (cid:38) (cid:49) (cid:20) (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:43) (cid:76) (cid:74)(cid:75) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:54) (cid:38) (cid:49) (cid:21) (cid:37) (cid:76) (cid:70) (cid:88)(cid:69) (cid:76) (cid:69) (cid:37) (cid:76) (cid:70) (cid:88)(cid:69) (cid:76) (cid:69) (cid:856)(cid:856)(cid:856) (cid:48) (cid:54) (cid:40) (cid:48) (cid:54) (cid:40) (cid:86)(cid:70)(cid:68)(cid:79)(cid:72)(cid:3) (cid:91)(cid:3) (cid:20) (cid:86)(cid:70)(cid:68)(cid:79)(cid:72)(cid:3) (cid:91)(cid:3) (cid:86) (cid:86)(cid:70)(cid:68)(cid:79)(cid:72)(cid:3) (cid:91)(cid:3) (cid:86)(cid:240)(cid:3) (c) The structure of the CSCN.Figure 8: Structured CNNs for image SR. able 7: Details of training some RNNs for LIPs. Ref. Application Input Output Loss Function Initialization Learning Rate Optimizer[158] MMVproblems The observedsignal The recoveredsignal The quadratic er-ror Smallrandomnumbers Not given Not given[159] MMVproblems The observedsignal The recoveredsignal Cross entropy Smallrandomnumbers Not given Backpropagationthrough timeand ADAM[160] Block-sparsityrecovery The sequenceof residualvectors One-hot vec-tors Cross entropy Not given 3 × − − − − − − − − − − − − − − − × − (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:53) (cid:72) (cid:79) (cid:88) (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:43) (cid:76) (cid:74)(cid:75) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:53)(cid:72)(cid:70)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:40)(cid:80)(cid:69)(cid:72)(cid:71)(cid:76)(cid:81)(cid:74) (cid:44)(cid:81)(cid:73)(cid:72)(cid:85)(cid:72)(cid:81)(cid:70)(cid:72) (cid:53) (cid:72) (cid:79) (cid:88) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:53) (cid:72) (cid:79) (cid:88) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:53) (cid:72) (cid:79) (cid:88) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:53) (cid:72) (cid:79) (cid:88) (a) The structure of DRCN, which consists a embedding network, an inference network and areconstruction network. (cid:54)(cid:78)(cid:76)(cid:83)(cid:3)(cid:38)(cid:82)(cid:81)(cid:81)(cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:43) (cid:76) (cid:74)(cid:75) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:856)(cid:856)(cid:856) (cid:53) (cid:72)(cid:70) (cid:82)(cid:81) (cid:49) (cid:72) (cid:87) (cid:50) (cid:88) (cid:87) (cid:83)(cid:88) (cid:87) (cid:3) (cid:20) (cid:856)(cid:856)(cid:856) (cid:50) (cid:88) (cid:87) (cid:83)(cid:88) (cid:87) (cid:3) (cid:39) (cid:856)(cid:856)(cid:856) (cid:1085) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:53) (cid:72) (cid:79) (cid:88) (cid:856)(cid:856)(cid:856) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:53) (cid:72) (cid:79) (cid:88) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:53) (cid:72) (cid:79) (cid:88) (cid:3) (cid:38) (cid:82)(cid:81)(cid:89)(cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:53) (cid:72) (cid:79) (cid:88) (cid:40) (cid:80) (cid:69) (cid:72) (cid:71) (cid:3) (cid:49) (cid:72) (cid:87) (cid:856)(cid:856)(cid:856) (b) The ﬁnal model of DRCN with recursive-supervision and skip connection. The reconstruc-tion network is shared for recursive predictions.Figure 9: The network structure of deeply-recursive convolutional network. SR [168]. In DRFN, they use the transposed convolution as the upsamplinglayer and combine diﬀerent-level features to reconstruct high-quality images. Asimilar method can also be found in [169], where Wang et al. use convolutionalLSTM (ConvLSTM) in the residual block to form their multi-memory CNN(MMCNN) for video SR. In [170], Wang et al. propose a bidirectional recurrentconvolutional NN named LFNet for light-ﬁeld image SR, which uses an implicitlymulti-scale fusion to utilize the spatial relations in light-ﬁeld images. For imagedenoising, considering that the feature fusion of common CNNs is coarse, Wanget al. use the gated recurrent unit (GRU) to select and combine the features ofdiﬀerent layers [171].For MRI image reconstruction, Qin et al. use a convolutional RNN to ex-plore the dependencies of the temporal sequences [172]. In addition, they alsocombine the network to the traditional optimization algorithms, which form thestructured RNN. In [173], Putzky and Welling propose the recurrent inferencemachines (RIM) for image restoration, which is the unrolling of the inferencealgorithm. Yang et al. further use the RIM in accelerated photoacoustic to-mography (PAT) reconstruction [178], where the forward operator A is used inthe training process. 32tructured RNNs are also common when solving sparse LIPs. Similar tostructured FNNs, the structured RNNs get inspiration from the traditional it-erative algorithms, such as ISTA. Intuitively, the RNN can be used to deal witha sequence of correlated observations in sparse LIPs. For example, in [174],Wisdom et al. solve the sequential sparse LIP with a structured RNN whichinspired by the sequential ISTA. Diﬀerent from the generic stacked RNN, theinput of the proposed SISTA-RNN is connected to every iteration layer. In[175], Le et al. design their RNNs for sequential sparse LIP by unfolding theproximal gradient method that aims to solve the ℓ − ℓ minimization problem.Compared with the stacked RNN, the designed ℓ − ℓ − RNN has additionalconnections between the layers.In addition, in sparse LIPs, the support of the nonzero elements can bethought as a sequence, and it has been proved that the known part of supportscan be used to speed up the convergence. While LISTA uses a ﬁxed learningrate to learn the parameters, Zhou et al. adds an adaptive momentum vectorto the network and design their adaptive ISTA [176]. They further improve theeﬃciency of adaptive ISTA by reforming it as an RNN, which can be thoughtas a variant of the famous LSTM. In addition to the simple, one-step iterativealgorithms such as the ISTA, in [179], He et al. resemble the complex, multi-loop, majorization-minimization algorithm sparse Bayesian learning (SPL) toan RNN. The proposed network exhibits signiﬁcantly improved performancein comparison to existing structured FNNs. This method can be applied tomany applications including Direction-of-Arrival estimation and 3D photometricstereo recovery.

AEs are self-supervised feedforward NNs that are usually used for dimensionreduction and feature learning [180, 181]. An AE consists of an encoder and adecoder, which learns eﬃcient data coding. The AE aims to learn the usefulproperties of the data, rather than reproduce the input at the output. Diﬀerentvariants of the basic AE are proposed to force the learning of the useful prop-33 able 8: Details of training some AEs for LIPs.

Ref. Application Input Output Loss Function Initialization Learning Rate Optimizer[189] Image De-noising Overlappingpatches Clean patches The quadraticerror withsparsity regu-larization Pre-trainedstacked de-noising auto-encoder Not given Quasi-Newton[190][191] Image De-noising Overlappingpatches Clean patches The quadraticerror withsparsity regu-larization Pre-trainedSSDAs Not given Quasi-Newton[192] Image De-noising Overlappingpatches Clean patches The quadraticerror Pre-trainedsingle-layerSSDAs 10 − − erties of features, such as the regularized AEs and the sparse AEs. AEs havebeen used in denoising [182, 183], modulation classiﬁcation in communicationsystems [184, 185] and image classiﬁcation [186, 187, 188]. Various AEs for LIPsare summarized in Table 8.The denoising AE (DAE) is the most commonly used AE in solving inverseproblems, which is ﬁrstly proposed in [201] to obtain robust features. TheDAE tries to reconstruct the signal from its noisy input. In [189], Xie et al.propose the stacked sparse denoising AE (SSDA) for image denoising and blindinpainting, which stacks multiple DAEs and forces parameters to be sparse byemploying sparsity regularization. In the training phase, Xie et al. initializethe SSDA with stacked DAs, where each DA is trained one by one, and theinput of the successor DA is the output of the predecessor DA rather thanthe original noisy image. To improve the robustness of the SSDA, Agostinelliet al. propose the adaptive multi-column SSDA (AMC-SSDA), where severalSSDAs are learned under diﬀerent noise levels, and a weight prediction moduleis learned to combine the results of all SSDAs with diﬀerent weights [191]. Whilethe sparsity regularizer in [189] is not computationally eﬃcient for DAEs with34ultiple hidden layers, Cho improves the performance of the network by forcingthe output of the encoder to be sparse [192]. The proposed DAE performs welleven without sparsity regularization and does not use any prior informationabout the noise. To enhance the robustness of AE to hybrid noises, Ye et al.add the KL penalty to the loss function, which brings the average activationof the hidden layer close to zero [193]. In addition to fully connected AEs,convolutional layers can also be used for AEs. In [194], Gondara uses a DAEconstructed using convolutional layers for medical image denoising. However,the previous work in [189, 191, 192, 193, 194] is inductive. In [182], the AE isfurther extended for blind image denoising.The AE can also be used in image SR and reconstruction. In [195], Zeng etal. develop a coupled deep AE (CDA) for single image SR. The CDA containsthree parts, two AEs which extract the hidden representations of LR/HR imagepatches respectively, and a hidden layer which learns the mapping between thetwo representations. The training process of CDA contains the training of threeparts and ﬁne-tuning of the entire network. Considering the problem that theinconsistency between the sparse coeﬃcients of the LR image and HR imageinﬂuences the SR results, Shao et al. propose coupled sparse AE (CSAE) tolearn the mapping between the sparse coeﬃcients of the LR image and HRimage [196]. The proposed CSAE is used for the spatial resolution of remotesensing images. For image reconstruction, Mehta et al. propose to use AE forCS-based medical image reconstruction to cut oﬀ the time for reconstruction[202]. Instead of using the Euclidean norm as a cost function, Mehta et al.use a robust ℓ norm. Similar to the work in [202], Gupta and Bhowmick alsoconsider the time-consuming problem in real-time image reconstruction [197].They propose Coupled AE (CAE) to learn the mapping from the measurementsto the representation of the target images.Besides, AEs are also popular in sparse coding. In [203], Barello et al.design the sparse coding variational AE (SVAE), which is neurally plausibleto calculate the neural response of an image patch. To solve the computationproblem when using LISTA for convolutional sparse coding, Sreter and Giryes35 able 9: Details of training some GANs for LIPs. Ref. Application Input Output Loss Function Initialization LearningRate Optimizer[205] Image De-noising Noisy im-age patch Clean im-age patch Mean squared error Not given 10 − × − × − − − − × − propose the convolutional LISTA, which serves as the sparse encoder in an AE[198]. Based on the sparse coding, Jalali and Yuan analyze the performanceof AEs for such recovery problems, and proposed a projected gradient descentbased algorithm [200].In addition to the common AEs, AEs can also beneﬁt from the deep unfoldingmethod. In [204], Sprechmann et al. unfold proximal descent algorithms, andthen learn the pursuit processes to solve the low-rank models, including theRPCA and non-negative matrix factorization. The GAN is originally proposed as a form of the generative model for un-supervised learning, which can also be used for applications involving LIPs.Various GANs for LIPs are summarized in Table 9.The main motivation for using GANs for denoising is that GANs can betterpreserve high-frequency components and image details, while CNNs can easilyover-smooth the edges of the image. For image denoising, the generator networkis expected to generate the denoised signal, and the discriminator network is36sed to distinguish the denoised output from the ground truth, which providesprovide feedback for the training of the generator network. The application ofGANs in denoising could be diverse. For example, Chen et al. proposed a GAN-CNN based blind denoiser, where the generator network is used to estimate thedistribution of noisy images and generate paired training data for the training ofdenoising CNN [205]. The network structure of the generator and discriminatorcan be inspired by various FNNs or CNNs, such as LISTA-GAN [206], VGG-GAN [207] and ResNet-GAN [208] or special designed [209].Another main innovation lies in the design of various loss functions. Wolterinket al. ﬁnd that the network trained with voxel-wise loss has a higher peak signal-to-noise ratio, while the network trained with adversarial loss better capturesimage statistics [210]. In [207], Yang et al. add the Wasserstein distance andperceptual loss to GANs. The Wasserstein distance, which comes from theoptimal transport theory, is used as the discrepancy measure to improve theperformance of GANs. The perceptual loss, which calculates the discrepancybetween images in an established feature space, is used to suppress noise. Al-saiari et al. use the weighted sum of pixel-to-pixel Euclidean loss, feature loss,smooth loss and adversarial loss [208], while Li and Xiao use the combinationof the denoising loss and reconstruction loss. In Fig. 10, we compare the per-formance of diﬀerent loss functions under the same training set and the samenetwork structure. The adversarial loss adapts the binary cross-entropy thatcomes from the discriminator, and helps to generate images that can deceivethe discriminator. It is found that the network that trained with the adversar-ial loss is hard to convergence and the generated image has higher noise levels.The pixel loss calculates the pixel-to-pixel Euclidean distance between the out-put and the clean image, and is helpful for correctly ﬁlling the noise of the color.However, the network trained with the pixel loss leads to a smooth image. Thefeature loss, which depends on the features extracted from the convolutionallayer, helps to extract features accurately. Thus, the network trained with theadversarial loss, pixel loss and style loss has the best visual quality.GANs have also been employed for image SR, which leads to diﬀerent inno-37 a) (b) (c) (d)Figure 10: The denoising results with diﬀerent loss functions. (a)noisy image, (b)denoisedimage with the adversarial loss (c)denoised image with the adversarial loss and pixel loss,(d)denoised image with the adversarial loss, pixel loss and feature loss. vative designs. A common problem is that the LR images may contain noise,such as the speckle and smudge in synthetic aperture radar images [211]. Thegeneral method is performing image denoising to LR images ﬁrstly, and thenreconstructing the HR images. The denoising and DR can be performed with ajoint generator network [211, 212] or two generator networks [213]. Comparedwith image denoising, the network structure of generator networks for imageSR is more diverse. In Fig. 11, we show several novel network structures inGANs for image SR, including an hourglass CNN model [217], a Cycle-in-Cyclenetwork [218, 219] and a dense block network [220].The innovations in loss function also exists in image SR for ﬁner texturedetails, and most loss function is a weighted sum of several losses. The lossescan be classed into adversarial loss, the pixel-based loss and feature map basedloss. For example, Ledig et al. use an adversarial loss and a content loss [214],while Chen et al. use an MSE loss, the generative loss and the VGG loss [212].Other loss functions contain the sum of the perceptual loss, MSE-based contentloss, and an adversarial loss, which is used [215] by Gopan and Kumar, the sumof the pixel-wise loss and adversarial loss used in [216] by Jiang et al. and thesum of joint sparsifying transform loss and supervision loss in [213] by You etal..

4. Challenges and Future Research Directions

In the previous section, we explore several research directions and paradigmson using DL to solve LIPs. It has been observed that DL has brought break-throughs in many applications. However, there are still many open challengesthat require further investigation. In this section, we discuss several potential38 (cid:82)(cid:90)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72) (cid:43)(cid:76)(cid:74)(cid:75)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:54)(cid:76)(cid:74)(cid:80)(cid:82)(cid:76)(cid:71) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:47)(cid:72)(cid:68)(cid:78)(cid:92)(cid:3)(cid:53)(cid:72)(cid:79)(cid:88)(cid:37)(cid:68)(cid:87)(cid:70)(cid:75)(cid:49)(cid:82)(cid:85)(cid:80)(cid:68)(cid:79)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:37) (cid:79) (cid:82) (cid:70) (cid:78) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:37)(cid:68)(cid:87)(cid:70)(cid:75)(cid:49)(cid:82)(cid:85)(cid:80)(cid:68)(cid:79)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:39) (cid:82) (cid:90) (cid:81) (cid:86) (cid:68) (cid:80) (cid:83) (cid:79) (cid:72) (cid:3) (cid:37) (cid:79) (cid:82) (cid:70) (cid:78) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78)(cid:39)(cid:82)(cid:90)(cid:81)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72) (cid:47)(cid:72)(cid:68)(cid:78)(cid:92)(cid:3)(cid:53)(cid:72)(cid:79)(cid:88)(cid:37)(cid:68)(cid:87)(cid:70)(cid:75)(cid:49)(cid:82)(cid:85)(cid:80)(cid:68)(cid:79)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74) (cid:56) (cid:83) (cid:86) (cid:68) (cid:80) (cid:83) (cid:79)(cid:76) (cid:81)(cid:74) (cid:3) (cid:37) (cid:79) (cid:82) (cid:70) (cid:78) (cid:39)(cid:82)(cid:90)(cid:81)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72)(cid:3) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:39)(cid:82)(cid:90)(cid:81)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72)(cid:3) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:39)(cid:82)(cid:90)(cid:81)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72)(cid:3) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78)(cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78)(cid:37)(cid:68)(cid:87)(cid:70)(cid:75) (cid:49)(cid:82)(cid:85)(cid:80)(cid:68)(cid:79)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:37)(cid:79)(cid:82)(cid:70)(cid:78)(cid:37)(cid:79)(cid:82)(cid:70)(cid:78) ( a ) A nh o u r g l a ss C NN m o d e l[ ]. (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81)(cid:81)(cid:82) (cid:76) (cid:86) (cid:92) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:42) (cid:72) (cid:81) (cid:72) (cid:85) (cid:68) (cid:87) (cid:82) (cid:85) (cid:3) (cid:20) (cid:42) (cid:72) (cid:81) (cid:72) (cid:85) (cid:68) (cid:87) (cid:82) (cid:85) (cid:3) (cid:21) (cid:54) (cid:53) (cid:39) (cid:76) (cid:86) (cid:70) (cid:85) (cid:76) (cid:80) (cid:76) (cid:81) (cid:68) (cid:87) (cid:82) (cid:85) (cid:3) (cid:20) (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81)(cid:71) (cid:72) (cid:81)(cid:82) (cid:76) (cid:86) (cid:72) (cid:71) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:70) (cid:79) (cid:72)(cid:68) (cid:81) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:53) (cid:72)(cid:70) (cid:82)(cid:89) (cid:72) (cid:85) (cid:72) (cid:71) (cid:3) (cid:43) (cid:76) (cid:74)(cid:75) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:53) (cid:72)(cid:68) (cid:79) (cid:3) (cid:43) (cid:76) (cid:74)(cid:75) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:3) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:39) (cid:76) (cid:86) (cid:70) (cid:85) (cid:76) (cid:80) (cid:76) (cid:81) (cid:68) (cid:87) (cid:82) (cid:85) (cid:3) (cid:21) (cid:42) (cid:72) (cid:81) (cid:72) (cid:85) (cid:68) (cid:87) (cid:82) (cid:85) (cid:3) (cid:22) (cid:47) (cid:82) (cid:90) (cid:16)(cid:85) (cid:72) (cid:86) (cid:82) (cid:79) (cid:88) (cid:87)(cid:76) (cid:82)(cid:81) (cid:76) (cid:80) (cid:68) (cid:74) (cid:72) (cid:47) (cid:53) (cid:70) (cid:79) (cid:72)(cid:68) (cid:81) (cid:3) (cid:47) (cid:53) (cid:47) (cid:53) (cid:3)(cid:3) (cid:43) (cid:53) ( b ) A C y c l e - i n - C y c l e n e t w o r k [ , ]. (cid:47)(cid:82)(cid:90)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:43)(cid:76)(cid:74)(cid:75)(cid:16)(cid:85)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:53)(cid:72)(cid:79)(cid:88) (cid:37) (cid:68) (cid:86) (cid:76) (cid:70) (cid:3) (cid:37) (cid:79) (cid:82) (cid:70) (cid:78) (cid:37)(cid:68)(cid:86)(cid:76)(cid:70)(cid:3)(cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:56)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:53)(cid:72)(cid:79)(cid:88)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:53)(cid:72)(cid:79)(cid:88) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:53)(cid:72)(cid:79)(cid:88) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:53)(cid:72)(cid:79)(cid:88)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:53)(cid:72)(cid:79)(cid:88)(cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:53)(cid:72)(cid:79)(cid:88) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:91) (cid:3) (cid:86) (cid:70)(cid:68) (cid:79) (cid:72) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:856)(cid:856)(cid:856) (cid:37)(cid:68)(cid:86)(cid:76)(cid:70)(cid:3)(cid:37)(cid:79)(cid:82)(cid:70)(cid:78) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:14) (cid:38)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) ( c ) A d e n s e b l o c k n e t w o r k . F i g u r e : T h e s t r u c t u r e o f g e n e r a t o r n e t w o r k s o f G AN s f o r i m ag e S R . uture research directions in using DL to solve LIPs. In solving LIPs, the performance of DL based methods greatly relies on thedata (the input and the label) seen during training, which reﬂects the functionalrelationship between model parameters m and the observed data d . However,just as imperfect mathematical modeling of complex scenarios in traditionalmethods leads to the model error, imperfect training data in DL methods alsoleads to the recovery error.The recovery error caused by the training data may come from the gener-ating process of the training data. In practical scenarios that we do not haveaccess to the real m , a popular method is to artiﬁcially generate the trainingdata. However, the artiﬁcially generated data may have a distribution thatdiﬀers from the distribution of the real m . For example, in the sparse LIP,the sparse data m may contain diﬀerent sparsity and sparse patterns. In casesthat we cannot get the general data m , we may resort to the traditional algo-rithm, e.g., in the LISTA, the sparse data m is generated from the traditionalCoD algorithm [59]. However, the traditional algorithms may converge to thenon-optimal solution, thus results in errors in the training data. Therefore, apotential research direction is to study the errors contained in the training dataand the methods to reduce or even eliminate the recovery error caused by thetraining data.The recovery error may also result from the mismatch between the trainingdata and the test data. For example, in image denoising, the mismatch of thenoise distributions between the training data and the testing data leads to theperformance degeneration [60]. In [191], Agostinelli et al. solve this problem byconnecting the networks that are trained under diﬀerent noise distributions inparallel according to the learned weight. However, such methods increase themodel complex and lead to heavy computation. A more straightforward solutionis to increase the diversity of training data. In [108], Zhang et al. constructtheir training data set with diﬀerent noise distributions and train a single NN to40eal with multiple noise distributions. However, this method is not suitable fortraining time-limited scenarios, since it is impossible to include all the possible m in a limited training data. The process of using the DL based method to solve the LIP can be seenas choosing an optimal function from a class of functions deﬁned by a NNfor the mapping relationship between model parameters m and the observeddata d . By carefully designing the network architecture, we are designing aclass of functions that are closer to the mapping relationship, which helps forfaster convergence and optimal solution. However, the design of the networkarchitecture still lacks theoretical support and thus is intractable. Thus, moretheoretical explorations are needed.In LIPs, there usually exists prior knowledge about model parameters, e.g.,their spatial distribution or mutual dependence. We expect to gain further im-provement in convergence speed and performance by incorporating the priorknowledge into network designs. The structured networks also have proﬁts inother aspects. For example, by limiting the weight sharing between network lay-ers, the LISTA has fewer parameters than the common NN, thus the LISTA isless likely to be over-ﬁtting [59]. A popular method is to design networks basedon the unfolding of iterative algorithms [59, 64]. Since the traditional iterativealgorithms have calculated an estimation of the LIP, the time-unfolded networkcan directly obtain a sub-optimal solution without training. As such a struc-tured network can obtain a better solution than the iterative algorithm aftertraining, and needs less training data and time to obtain the same performancecompared with the common network. However, unfolding based networks mayalso converge to a local optimal under the misleading of the iterative algorithm.Therefore, a potential research direction is to investigate the theoretical boundthat a structured network can achieve for a speciﬁc inverse problem, such asthe maximum convergence speed and the highest accuracy. Further researchalso needs to be done on the design of structured networks that could achieve41erformance closed to the theoretical bound, in addition to the unfolding basedmethods. Another potential research direction is the tradeoﬀ between the con-vergence speed and accuracy. For example, in [62], Xin et al. demonstratethat an FNN with independent weights has better estimation accuracy alongwith the decrease in convergence speed in cases where the linear operator A hascoherent columns. Modern inverse problems increasingly involve high dimensional data such astensors [221, 222, 223, 224], which usually refer to inter-dimension correlations[225]. However, at present, most of the DL based methods for solving LIPs areperformed on low-dimensional data, e.g., vectors and matrixes. A method thatusing existing models to process high dimensional data is to decrease the di-mension of the input data ﬁrstly. For example, ﬂattening the three-dimensionaltensor into a two-dimensional matrix. However, the dimensionality reductionprocess is usually accompanied by the loss of the inter-dimension correlationsinformation. A potential solution is to design specialized networks for high di-mensional data processing. For example, The 3-D convolution can be used toexplore the spatial and spectral characters of hyperspectral image [226, 105].Another popular method for high dimensional tensor processing is deep tensorfactorization (DTF), which considers the temporal or spatial information. TheDTF can extract hierarchical and meaningful features of multi-channel imagessuch as hyperspectral images, thus is popular in image classiﬁcation and patternclassiﬁcation [227, 228, 229]. DTF can also be used in recommender systems[230], scene decomposition [231], and fault diagnosis [232].Another problem is that processing the high dimensional data needs a largerand deeper network, which means the rapid increase in the number of networkparameters and the surge in the demand of hardware with high computationalcapability. However, DL heavily relies on high-parallel computation of GPUsfor training, while GPUs have limited memory, which makes DL based methodsencounter computational diﬃculties when processing high dimensional data. A42ossible solution is the distributed DL, such as the model parallelism and thedata parallelism. In the model parallelism, the whole network is partitioned intosmall components and then trained in diﬀerent machines. In the data paral-lelism, diﬀerent machines have a complete copy of the entire model and limitedtraining data, then the complete model is calculated by some methods. Themodel parallelism and the data parallelism can be combined to achieve trainingacceleration [233]. Besides, there are several methods to train the distributedNNs, and each method exists many variants [234, 235, 236, 237]. One of thepotential research directions is the maximum accuracy that the distributed DLcan get with the speciﬁc training algorithm under given conditions such as lim-ited training time or limited training data. Besides, we could also consider thetradeoﬀ between model accuracy and runtime [238].

In general, the DL based methods with more complex networks have betteraccuracy. However, complex models usually involve a great number of parame-ters, which increases the diﬃculty in training and limits its usage in computingresource-constrained applications. Therefore, an important research direction isthe design of light and eﬃcient network architectures, which helps to eﬀectivelyapply the DL models to various hardware platforms [239, 240, 241, 242].A carefully designed network architecture can eﬀectively reduce the redun-dancy and computation of the DL models, thus speed up the solving pro-cess without sacriﬁcing reconstruction accuracy. Representative work includethe SqueezeNet [243] and the MobileNet [240]. Another method is compress-ing an existing network to decrease the number of parameters and the re-quired computation resource, under the guarantee of reconstruction accuracy[244, 245, 246, 247, 248]. For example, the model cutting method compressesthe model by cutting unimportant connections of a trained model according tosome eﬀective evaluations [249]. The network quantization method cuts the re-dundancy of the data by reducing the length of the code and the number of bits,according to the data distribution in the trained model [250]. Another eﬃcient43ethod is network binarization, where the original ﬂoating-point weights areforced to be +1 and −

1. For a speciﬁc LIP, it remains a challenge to choose asuitable method to balance the accuracy and computation speed.Future AI-driven automation will bring about a step-change in their abilityto create eﬃcient, resilient, and also user-centric services. However, the verysame algorithms may also cause irreversible environmental damage due to theirhigh energy consumption and lead to serious global sustainability issues. Toachieve UN sustainable development goals in the context of lightweight andgreen AI, we need to reduce the computation and energy consumption.Model compression approaches are for reducing the sizes of DNN target op-erations and data access overhead in both training and inference of the DNN.This is highly related to the numbers of neurons and the associated weights in it.Due to the lack of theoretical results on the optimal DNN architecture . Previ-ous studies have revealed that NNs are typically over-parameterized, and thereis signiﬁcant redundancy that can be exploited [251]. Therefore, it is possibleto achieve similar function approximation performance by removing redundantnetwork architecture (e.g. pruning the network) and only retaining useful partswith greatly reduced model size. The second method is architectural innova-tions, such as replacing fully-connected layers with convolutional layers thatare relatively more compact. Another method is weight quantization. Already,some of the aforementioned DNN compression practices have emerged in recentmobile DL applications. In practical applications, there is contradiction between the limited trainingdata and training time, and inﬁnite real data and various application scenarios.Although the DL method succeeds in speciﬁc scenarios, it takes a very highcost in training diﬀerent DL models for diﬀerent application scenarios. Thus,the research on the generalization of the DL models is important and essential, Neuroevolution DL does oﬀer a numerical pathway to ﬁnding optimal architectures

While this article focuses on the applications of DL in solving LIPs, therealso exist several works in using DL to solve various nonlinear inverse problems,especially in the CS problems with quantized measurements [259]. For example,in [260], Takabe et al. propose a complex-ﬁeld trainable ISTA (C-TISTA) basedon the concept of deep unfolding, which aims to solve the complex-ﬁeld nonlinearinverse problems. In C-TISTA, they use a trainable shrinkage function to utilizevarious prior information such as sparsity. While Mahabadi et al. try to learnthe sampling process of the quantized CS [261], Leinonen and Codreanu directlyjointly optimize the whole sampling and recovery process with an encoder anddecoder via NNs [262]. A similar method for joint optimization of measurementand recovery in can quantized CS also be found in [263], where the NN consistsa binary measurement matrix, a non-uniform quantizer, and a non-iterativerecovery solver. Considering the high computing and expressive power, the useof NNs in nonlinear inverse problems is a promising direction.45 . Conclusion

In this paper, we presented a comprehensive survey of the recent achieve-ments in using DL to solve LIPs. We summarize the use of various DL architec-tures, optimization algorithms, loss functions and techniques in solving LIPs.For LIPs with structured information, we present how it is used in the designof various DL models. Our hope is that this article can provide guidance fordesigning NNs for solving various LIPs. In addition to the recent progresses,there are still many open challenges and promising future directions includingthe construction of training datasets, the design of structured networks, thetechniques for high dimensional data processing in NNs, the design of light andeﬃcient network architectures, and the problems in practical applications.

References [1] G. Backus, F. Gilbert, The resolving power of gross earth data, Geophysical Journal Interna-tional 16 (2) (1968) 169–205.[2] S. I. Kabanikhin, Deﬁnitions and examples of inverse and ill-posed problems, Journal ofInverse and Ill-Posed Problems 16 (4) (2008) 317–357.[3] X. Li, J. Fang, H. Duan, Z. Chen, H. Li, Fast beam alignment for millimeter wave commu-nications: A sparse encoding and phaseless decoding approach, IEEE Transactions on SignalProcessing 67 (17) (2019) 4402–4417.[4] X. Li, J. Fang, H. Li, P. Wang, Millimeter wave channel estimation via exploiting joint sparseand low-rank structures, IEEE Transactions on Wireless Communications 17 (2) (2018) 1123–1133.[5] W. Chen, I. J. Wassell, Cost-aware activity scheduling for compressive sleeping wireless sensornetworks, IEEE Transactions on Signal Processing 64 (9) (2016) 2314–2323.[6] W. Chen, I. J. Wassell, Optimized node selection for compressive sleeping wireless sensornetworks, IEEE Transactions on Vehicular Technology 65 (2) (2016) 827–836.[7] W. Chen, I. J. Wassell, A decentralized bayesian algorithm for distributed compressive sensingin networked sensing systems, IEEE Transactions on Wireless Communications 15 (2) (2016)1282–1292.[8] J. F. Murray, K. Kreutz-Delgado, An improved focuss-based learning algorithm for solvingsparse linear inverse problems, in: Conference Record of Thirty-Fifth Asilomar Conferenceon Signals, Systems and Computers (Cat.No.01CH37256), Vol. 1, 2001, pp. 347–351 vol.1.[9] X. Shen, Y. Gu, Nonconvex sparse logistic regression with weakly convex regularization, IEEETransactions on Signal Processing 66 (12) (2018) 3199–3211.[10] H. Lee, A. Battle, R. Raina, A. Y. Ng, Eﬃcient sparse coding algorithms, in: Advances inneural information processing systems, 2007, pp. 801–808.[11] G. Li, Y. Gu, Restricted isometry property of gaussian random projection for ﬁnite set ofsubspaces, IEEE Transactions on Signal Processing 66 (7) (2018) 1705–1720.

12] W. U. Bajwa, M. F. Duarte, R. Calderbank, Conditioning of random block subdictionarieswith applications to block-sparse recovery and regression, IEEE Transactions on InformationTheory 61 (7) (2015) 4060–4079.[13] P. Chen, I. W. Selesnick, Group-sparse signal denoising: Non-convex regularization, convexoptimization, IEEE Transactions on Signal Processing 62 (13) (2014) 3464–3478.[14] R. G. Baraniuk, V. Cevher, M. F. Duarte, C. Hegde, Model-based compressive sensing, IEEETransactions on Information Theory 56 (4) (2010) 1982–2001.[15] J. Fang, Y. Shen, H. Li, P. Wang, Pattern-coupled sparse bayesian learning for recovery ofblock-sparse signals, IEEE Transactions on Signal Processing 63 (2) (2015) 360–372.[16] J. Fang, F. Wang, Y. Shen, H. Li, R. S. Blum, Super-resolution compressed sensing forline spectral estimation: An iterative reweighted approach, IEEE Transactions on SignalProcessing 64 (18) (2016) 4649–4662.[17] W. Chen, Simultaneous sparse bayesian learning with partially shared supports, IEEE SignalProcessing Letters 24 (11) (2017) 1641–1645.[18] W. Chen, D. Wipf, Y. Wang, Y. Liu, I. J. Wassell, Simultaneous bayesian sparse approxima-tion with structured sparse models, IEEE Transactions on Signal Processing 64 (23) (2016)6145–6159.[19] C. Chen, Y. Li, J. Huang, Forest sparsity for multi-channel compressive sensing, IEEE Trans-actions on Signal Processing 62 (11) (2014) 2803–2813.[20] W. Zou, K. Kpalma, Z. Liu, J. Ronsin, Segmentation driven low-rank matrix recovery forsaliency detection, in: 24th British machine vision conference (BMVC), 2013, pp. 1–13.[21] R. Basri, D. W. Jacobs, Lambertian reﬂectance and linear subspaces, IEEE Transactions onPattern Analysis and Machine Intelligence 25 (2) (2003) 218–233.[22] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems,Computer (8) (2009) 30–37.[23] Z. Zhou, J. Fang, L. Yang, H. Li, Z. Chen, R. S. Blum, Low-rank tensor decomposition-aidedchannel estimation for millimeter wave mimo-ofdm systems, IEEE Journal on Selected Areasin Communications 35 (7) (2017) 1524–1538.[24] L. Yang, J. Fang, H. Duan, H. Li, B. Zeng, Fast low-rank bayesian matrix completion withhierarchical gaussian prior models, IEEE Transactions on Signal Processing 66 (11) (2018)2804–2817.[25] W. Chen, Simultaneously sparse and low-rank matrix reconstruction via nonconvex and non-separable regularization, IEEE Transactions on Signal Processing 66 (20) (2018) 5313–5323.[26] E. J. Cand`es, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, Journal of theACM (JACM) 58 (3) (2011) 11.[27] J. Liu, P. Musialski, P. Wonka, J. Ye, Tensor completion for estimating missing values invisual data, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1) (2013)208–220.[28] W. Chen, X. Gong, N. Song, Nonconvex robust low-rank tensor reconstruction via an empir-ical bayes method, IEEE Transactions on Signal Processing 67 (22) (2019) 5785–5797.[29] C. Saxena, D. Kourav, Noises and image denoising techniques: a brief survey, Internationaljournal of Emerging Technology and advanced Engineering 4 (3) (2014) 878–885.[30] M. Chen, H. Zhang, G. Lin, An adaptive directional non-local means algorithm with size-adaptive search window for image denoising, in: 2018 3rd International Conference on SmartCity and Systems Engineering (ICSCSE), 2018, pp. 834–839.[31] T. Qiao, J. Ren, Z. Wang, J. Zabalza, M. Sun, H. Zhao, S. Li, J. A. Benediktsson, Q. Dai, . Marshall, Eﬀective denoising and classiﬁcation of hyperspectral images using curvelet trans-form and singular spectrum analysis, IEEE Transactions on Geoscience and Remote Sensing55 (1) (2017) 119–133.[32] Hyung Il Koo, Nam Ik Cho, Image denoising based on a statistical model for wavelet coeﬃ-cients, in: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing,2008, pp. 1269–1272.[33] K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, Image denoising by sparse 3-d transform-domain collaborative ﬁltering, IEEE Transactions on Image Processing 16 (8) (2007) 2080–2095.[34] W. Dong, L. Zhang, G. Shi, X. Li, Nonlocally centralized sparse representation for imagerestoration, IEEE Transactions on Image Processing 22 (4) (2013) 1620–1630.[35] S. Gu, L. Zhang, W. Zuo, X. Feng, Weighted nuclear norm minimization with application toimage denoising, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition,2014, pp. 2862–2869.[36] X. Zeng, W. Bian, W. Liu, J. Shen, D. Tao, Dictionary pair learning on grassmann manifoldsfor image denoising, IEEE Transactions on Image Processing 24 (11) (2015) 4556–4569.[37] S. K. Sahoo, A. Makur, Enhancing image denoising by controlling noise incursion in learneddictionaries, IEEE Signal Processing Letters 22 (8) (2015) 1123–1126.[38] S. Ravishankar, Y. Bresler, Learning doubly sparse transforms for images, IEEE Transactionson Image Processing 22 (12) (2013) 4598–4612.[39] B. Wen, S. Ravishankar, Y. Bresler, Vidosat: High-dimensional sparsifying transform learningfor online video denoising, IEEE Transactions on Image Processing 28 (4) (2019) 1691–1704.[40] Sung Cheol Park, Min Kyu Park, Moon Gi Kang, Super-resolution image reconstruction: atechnical overview, IEEE Signal Processing Magazine 20 (3) (2003) 21–36.[41] W. Wang, J. Dong, S. Niu, Y. Chen, Edge-guided semi-coupled dictionary learning superresolution for retina image, in: 2019 IEEE 16th International Symposium on BiomedicalImaging (ISBI 2019), 2019, pp. 1631–1634.[42] X. Tian, J. Chen, A fast algorithm for single image super-resolution reconstruction via revisedstatistical prediction model, in: 2016 International Conference on Information System andArtiﬁcial Intelligence (ISAI), 2016, pp. 305–309.[43] Z. Hu, T. Li, Y. Yang, X. Liu, H. Zheng, D. Liang, Super-resolution pet image reconstructionwith sparse representation, in: 2017 IEEE Nuclear Science Symposium and Medical ImagingConference (NSS/MIC), 2017, pp. 1–3.[44] J. Choi, S. Bae, M. Kim, Single image super-resolution based on self-examples using context-dependent subpatches, in: 2015 IEEE International Conference on Image Processing (ICIP),2015, pp. 2835–2839.[45] A. Jalali, P. Ravikumar, S. Sanghavi, A dirty model for multiple sparse regression, IEEETransactions on Information Theory 59 (12) (2013) 7947–7968.[46] A. H. Shahana, V. Preeja, Survey on feature subset selection for high dimensional data, in:2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT),2016, pp. 1–4.[47] X. Wang, Y. Gu, Cross-label suppression: A discriminative and fast dictionary learning withgroup regularization, IEEE Transactions on Image Processing 26 (8) (2017) 3859–3873.[48] J. Qi, W. Chen, Learning a discriminative dictionary for classiﬁcation with outliers, SignalProcessing 152 (2018) 255–264.[49] W. Chen, I. J. Wassell, M. R. D. Rodrigues, Dictionary design for distributed compressive ensing, IEEE Signal Processing Letters 22 (1) (2015) 95–99.[50] I. Toˇsi´c, P. Frossard, Dictionary learning, IEEE Signal Processing Magazine 28 (2) (2011)27–38.[51] S. Tariyal, A. Majumdar, R. Singh, M. Vatsa, Deep dictionary learning, IEEE Access 4 (2016)10096–10109.[52] X. Gong, W. Chen, J. Chen, A low-rank tensor dictionary learning method for hyperspectralimage denoising, IEEE Transactions on Signal Processing 68 (2020) 1168–1180.[53] X. Ding, W. Chen, I. J. Wassell, Joint sensing matrix and sparsifying dictionary optimizationfor tensor compressive sensing, IEEE Transactions on Signal Processing 65 (14) (2017) 3632–3646.[54] E. J. Candes, The restricted isometry property and its implications for compressed sensing,Comptes rendus mathematique 346 (9-10) (2008) 589–592.[55] T. Blumensath, M. E. Davies, Iterative hard thresholding for compressed sensing, Appliedand computational harmonic analysis 27 (3) (2009) 265–274.[56] J. A. Tropp, A. C. Gilbert, Signal recovery from random measurements via orthogonal match-ing pursuit, IEEE Transactions on Information Theory 53 (12) (2007) 4655–4666.[57] D. L. Donoho, A. Maleki, A. Montanari, Message-passing algorithms for compressed sensing,Proceedings of the National Academy of Sciences 106 (45) (2009) 18914–18919.[58] M. Al-Shoukairi, P. Schniter, B. D. Rao, A gamp-based low complexity sparse bayesian learn-ing algorithm, IEEE Transactions on Signal Processing 66 (2) (2018) 294–308.[59] K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, in: Proceedings of the27th International Conference on International Conference on Machine Learning, 2010, pp.399–406.[60] H. C. Burger, C. J. Schuler, S. Harmeling, Image denoising: Can plain neural networks com-pete with BM3D?, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition,2012, pp. 2392–2399.[61] Y. Wang, J. Morel, Can a single image denoising neural network handle all levels of gaussiannoise?, IEEE Signal Processing Letters 21 (9) (2014) 1150–1153.[62] B. Xin, Y. Wang, W. Gao, D. Wipf, B. Wang, Maximal sparsity with deep networks?, in:Advances in Neural Information Processing Systems, 2016, pp. 4340–4348.[63] Z. Wang, Q. Ling, T. Huang, Learning deep ℓ encoders, in: AAAI Conference on ArtiﬁcialIntelligence, 2016, pp. 2194–2200.[64] M. Borgerding, P. Schniter, Onsager-corrected deep learning for sparse linear inverse prob-lems, in: 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP),2016, pp. 227–231.[65] M. Borgerding, P. Schniter, S. Rangan, Amp-inspired deep networks for sparse linear inverseproblems, IEEE Transactions on Signal Processing 65 (16) (2017) 4293–4308.[66] K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, Color image denoising via sparse 3d col-laborative ﬁltering with grouping constraint in luminance-chrominance space, in: 2007 IEEEInternational Conference on Image Processing, Vol. 1, 2007, pp. I – 313–I – 316.[67] M. Aharon, M. Elad, A. Bruckstein, K-svd: An algorithm for designing overcomplete dic-tionaries for sparse representation, IEEE Transactions on Signal Processing 54 (11) (2006)4311–4322.[68] J. Xu, L. Zhang, D. Zhang, X. Feng, Multi-channel weighted nuclear norm minimizationfor real color image denoising, in: 2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 1105–1113.

69] J. Xu, L. Zhang, D. Zhang, A trilateral weighted sparse coding scheme for real-world imagedenoising, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018,pp. 20–36.[70] S. Guo, Z. Yan, K. Zhang, W. Zuo, L. Zhang, Toward convolutional blind denoising of realphotographs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition, 2019.[71] T. Pl¨otz, S. Roth, Benchmarking denoising algorithms with real photographs, in: 2017 IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2750–2759.[72] J. Xu, H. Li, Z. Liang, D. Zhang, L. Zhang, Real-world noisy image denoising: A new bench-mark, Real-world noisy image denoising: A new benchmark.[73] M. T. McCann, K. H. Jin, M. Unser, Convolutional neural networks for inverse problems inimaging: A review, IEEE Signal Processing Magazine 34 (6) (2017) 85–95.[74] A. Lucas, M. Iliadis, R. Molina, A. K. Katsaggelos, Using deep neural networks for inverseproblems in imaging: Beyond analytical methods, IEEE Signal Processing Magazine 35 (1)(2018) 20–36.[75] W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, Q. Liao, Deep learning for single image super-resolution: A brief review, IEEE Transactions on Multimedia 21 (12) (2019) 3106–3121.[76] D. Liang, J. Cheng, Z. Ke, L. Ying, Deep magnetic resonance image reconstruction: Inverseproblems meet neural networks, IEEE Signal Processing Magazine 37 (1) (2020) 141–151.[77] G. Ongie, A. Jalal, C. A. Metzler, R. G. Baraniuk, A. G. Dimakis, R. Willett, Deep learningtechniques for inverse problems in imaging, IEEE Journal on Selected Areas in InformationTheory 1 (1) (2020) 39–56.[78] S. Arridge, P. Maass, O. ¨Oktem, C. Sch¨onlieb, Solving inverse problems using data-drivenmodels, Acta Numerica 28 (2019) 1–174.[79] J. R. Hershey, J. L. Roux, F. Weninger, Deep unfolding: Model-based inspiration of noveldeep architectures, arXiv preprint arXiv:1409.2574.[80] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,M. Isard, et al., Tensorﬂow: a system for large-scale machine learning, in: OSDI, Vol. 16,2016, pp. 265–283.[81] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learn-ing library, in: Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.[82] I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algorithm for linear inverseproblems with a sparsity constraint, Communications on Pure and Applied Mathematics: AJournal Issued by the Courant Institute of Mathematical Sciences 57 (11) (2004) 1413–1457.[83] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectiﬁers: Surpassing human-level per-formance on imagenet classiﬁcation, in: 2015 IEEE International Conference on ComputerVision (ICCV), 2015, pp. 1026–1034.[84] H. Zhang, H. Shi, W. Wang, Cascade deep networks for sparse linear inverse problems, in:2018 24th International Conference on Pattern Recognition (ICPR), 2018, pp. 812–817.[85] X. Chen, J. Liu, Z. Wang, W. Yin, Theoretical linear convergence of unfolded ista and itspractical weights and thresholds, in: Advances in Neural Information Processing Systems,2018, pp. 9061–9071.[86] A. Aberdam, A. Golts, M. Elad, Ada-lista: Learned solvers adaptive to varying models, arXivpreprint arXiv:2001.08456.[87] J. Liu, X. Chen, Z. Wang, W. Yin, Alista: Analytic weights are as good as learned weights n lista.[88] P. Ablin, T. Moreau, M. Massias, A. Gramfort, Learning step sizes for unfolded sparse coding,in: Advances in Neural Information Processing Systems, 2019, pp. 13100–13110.[89] M. Scetbon, M. Elad, P. Milanfar, Deep k-svd denoising, arXiv preprint arXiv:1909.13164.[90] D. Ito, S. Takabe, T. Wadayama, Trainable ista for sparse signal recovery, IEEE Transactionson Signal Processing 67 (12) (2019) 3113–3125.[91] M. Yao, J. Dang, Z. Zhang, L. Wu, Sure-tista: A signal recovery network for compressedsensing, in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), 2019, pp. 3832–3836.[92] J. Fan, J. Cheng, Matrix completion by deep matrix factorization, Neural Networks 98 (2018)34–41.[93] S. Arora, N. Cohen, W. Hu, Y. Luo, Implicit regularization in deep matrix factorization, in:Advances in Neural Information Processing Systems, 2019, pp. 7413–7424.[94] S. Rangan, P. Schniter, A. K. Fletcher, Vector approximate message passing, in: 2017 IEEEInternational Symposium on Information Theory (ISIT), 2017, pp. 1588–1592.[95] J. Pu, Y. Panagakis, M. Pantic, Learning diﬀerentiable sparse and low rank networks foraudio-visual object localization, in: ICASSP 2020 - 2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8668–8672.[96] J. Lewis D., V. Singhal, A. Majumdar, Solving inverse problems in imaging via deep dictionarylearning, IEEE Access 7 (2019) 37039–37049.[97] V. Singhal, A. Majumdar, Reconstructing multi-echo magnetic resonance images via struc-tured deep dictionary learning, Neurocomputing.[98] V. Singhal, A. Majumdar, A domain adaptation approach to solve inverse problems in imagingvia coupled deep dictionary learning, Pattern Recognition (2019) 107163.[99] J. Huang, P. L. Dragotti, A deep dictionary model for image super-resolution, in: 2018 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp.6777–6781.[100] X. Wang, Q. Tao, L. Wang, D. Li, M. Zhang, Deep convolutional architecture for naturalimage denoising, in: 2015 International Conference on Wireless Communications Signal Pro-cessing (WCSP), 2015, pp. 1–4.[101] K. Zhang, W. Zuo, L. Zhang, FFDNet: Toward a fast and ﬂexible solution for cnn-basedimage denoising, IEEE Transactions on Image Processing 27 (9) (2018) 4608–4622.[102] X. Zhang, R. Wu, Fast depth image denoising and enhancement using a deep convolutionalnetwork, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2016, pp. 2499–2503.[103] Y. Chang, L. Yan, H. Fang, S. Zhong, W. Liao, Hsi-denet: Hyperspectral image restoration viaconvolutional neural network, IEEE Transactions on Geoscience and Remote Sensing 57 (2)(2019) 667–682.[104] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.[105] Q. Yuan, Q. Zhang, J. Li, H. Shen, L. Zhang, Hyperspectral image denoising employing aspatialcspectral deep residual convolutional neural network, IEEE Transactions on Geoscienceand Remote Sensing 57 (2) (2019) 1205–1218.[106] A. Panda, R. Naskar, S. Rajbans, S. Pal, A 3d wide residual network with perceptual loss forbrain mri image denoising, in: 2019 10th International Conference on Computing, Communi-cation and Networking Technologies (ICCCNT), 2019, pp. 1–7. ecognition, 2018, pp. 1828–1837.[124] C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: European conference on computer vision, 2014, pp. 184–199.[125] Z. Wang, D. Liu, J. Yang, W. Han, T. Huang, Deep networks for image super-resolution withsparse prior, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp.370–378.[126] S. Lefkimmiatis, Non-local color image denoising with convolutional neural networks, in: 2017IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5882–5891.[127] F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprintarXiv:1511.07122.[128] R. M. M. Bevilacqua, A. Roumy, C. Guillemot, M. L. Alberi-Morel, Low-complexity single-image super-resolution based on nonnegative neighbor embedding, IEEE Signal ProcessingMagazine.[129] R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-representations, in:International conference on curves and surfaces, 2010, pp. 711–730.[130] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, Y. Fu, Residual dense network for image super-resolution, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018, pp. 2472–2481.[131] V. Lempitsky, A. Vedaldi, D. Ulyanov, Deep image prior, in: 2018 IEEE/CVF Conference onComputer Vision and Pattern Recognition, 2018, pp. 9446–9454.[132] O. Sidorov, J. Y. Hardeberg, Deep hyperspectral prior: Single-image denoising, inpainting,super-resolution, in: 2019 IEEE/CVF International Conference on Computer Vision Work-shop (ICCVW), 2019, pp. 3844–3851.[133] K. Gong, C. Catana, J. Qi, Q. Li, Pet image reconstruction using deep image prior, IEEETransactions on Medical Imaging 38 (7) (2019) 1655–1665.[134] K. Gong, K. Kim, D. Wu, M. K. Kalra, Q. Li, Low-dose dual energy ct image reconstructionusing non-local deep image prior, in: 2019 IEEE Nuclear Science Symposium and MedicalImaging Conference (NSS/MIC), 2019, pp. 1–2.[135] D. Van Veen, A. Jalal, M. Soltanolkotabi, E. Price, S. Vishwanath, A. G. Dimakis, Compressedsensing with deep image prior and learned regularization, arXiv preprint arXiv:1806.06438.[136] J. Ren, J. Liang, Y. Zhao, Soil ph measurement based on compressive sensing and deep imageprior, IEEE Transactions on Emerging Topics in Computational Intelligence 4 (1) (2020)74–82.[137] J. Liu, Y. Sun, X. Xu, U. S. Kamilov, Image restoration using total variation regularized deepimage prior, in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), 2019, pp. 7715–7719.[138] G. Jagatap, C. Hegde, High dynamic range imaging using deep image priors, in: ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2020, pp. 9289–9293.[139] G. Jagatap, C. Hegde, Algorithmic guarantees for inverse imaging with untrained networkpriors, in: Advances in Neural Information Processing Systems, 2019, pp. 14832–14842.[140] S. Dittmer, T. Kluth, P. Maass, D. O. Baguer, Regularization by architecture: A deep priorapproach for inverse problems, Journal of Mathematical Imaging and Vision (2019) 1–15.[141] R. Heckel, M. Soltanolkotabi, Compressive sensing with un-trained neural networks: Gradientdescent ﬁnds the smoothest approximation, arXiv preprint arXiv:2005.03991.[142] Y. Yang, J. Sun, H. Li, Z. Xu, Admm-csnet: A deep learning approach for image compressive ensing, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (3) (2020) 521–538.[143] C. A. Metzler, A. Maleki, R. G. Baraniuk, From denoising to compressed sensing, IEEETransactions on Information Theory 62 (9) (2016) 5117–5144.[144] C. Metzler, A. Mousavi, R. Baraniuk, Learned d-amp: Principled neural network based com-pressive image recovery, in: Advances in Neural Information Processing Systems, 2017, pp.1772–1783.[145] O. Solomon, R. Cohen, Y. Zhang, Y. Yang, Q. He, J. Luo, R. J. G. van Sloun, Y. C. El-dar, Deep unfolded robust pca with application to clutter suppression in ultrasound, IEEETransactions on Medical Imaging 39 (4) (2020) 1051–1063.[146] Jianchao Yang, J. Wright, T. Huang, Yi Ma, Image super-resolution as sparse representationof raw image patches, in: 2008 IEEE Conference on Computer Vision and Pattern Recognition,2008, pp. 1–8.[147] J. Yang, J. Wright, T. S. Huang, Y. Ma, Image super-resolution via sparse representation,IEEE Transactions on Image Processing 19 (11) (2010) 2861–2873.[148] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, T. S. Huang, Robust single image super-resolutionvia deep networks with sparse prior, IEEE Transactions on Image Processing 25 (7) (2016)3194–3207.[149] S. Lefkimmiatis, Universal denoising networks: A novel CNN architecture for image denoising,in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp.3204–3213.[150] Y. Chen, T. Pock, Trainable nonlinear reaction diﬀusion: A ﬂexible framework for fast andeﬀective image restoration, IEEE Transactions on Pattern Analysis and Machine Intelligence39 (6) (2017) 1256–1272.[151] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal deep learning, in:ICML, 2011.[152] I. Marivani, E. Tsiligianni, B. Cornelis, N. Deligiannis, Learned multimodal convolutionalsparse coding for guided image super-resolution, in: 2019 IEEE International Conference onImage Processing (ICIP), 2019, pp. 2891–2895.[153] I. Marivani, E. Tsiligianni, B. Cornelis, N. Deligiannis, Multimodal image super-resolution viadeep unfolding with side information, in: 2019 27th European Signal Processing Conference(EUSIPCO), 2019, pp. 1–5.[154] X. Deng, P. L. Dragotti, Deep coupled ista network for multi-modal image super-resolution,IEEE Transactions on Image Processing 29 (2020) 1683–1698.[155] A. Falvo, D. Comminiello, S. Scardapane, G. Finesi, M. Scarpiniti, A. Uncini, A multimodaldeep network for the reconstruction of t2w mr images, arXiv preprint arXiv:1908.03009.[156] E. Tsiligianni, N. Deligiannis, Deep coupled-representation learning for sparse linear inverseproblems with side information, IEEE Signal Processing Letters 26 (12) (2019) 1768–1772.[157] K. Qiu, X. Mao, X. Shen, X. Wang, T. Li, Y. Gu, Time-varying graph signal reconstruction,IEEE Journal of Selected Topics in Signal Processing 11 (6) (2017) 870–883.[158] H. Palangi, R. Ward, L. Deng, Distributed compressive sensing: A deep learning approach,IEEE Transactions on Signal Processing 64 (17) (2016) 4504–4518.[159] H. Palangi, R. Ward, L. Deng, Reconstruction of sparse vectors in compressive sensing withmultiple measurement vectors using bidirectional long short-term memory, in: 2016 IEEEGlobal Conference on Signal and Information Processing (GlobalSIP), 2016, pp. 192–196.[160] C. Lyu, Z. Liu, L. Yu, Block-sparsity recovery via recurrent neural network, Signal Processing

54 (2019) 129–135.[161] D. Li, Y. Liu, Z. Wang, Video super-resolution using motion compensation and residualbidirectional recurrent convolutional network, in: 2017 IEEE International Conference onImage Processing (ICIP), 2017, pp. 1642–1646.[162] G. Hinton, N. Srivastava, K. Swersky, Neural networks for machine learning lecture 6aoverview of mini-batch gradient descent, Cited on 14 (8).[163] B. Lim, K. M. Lee, Deep recurrent resnet for video super-resolution, in: 2017 Asia-PaciﬁcSignal and Information Processing Association Annual Summit and Conference (APSIPAASC), 2017, pp. 1452–1455.[164] Y. Huang, W. Wang, L. Wang, Video super-resolution via bidirectional recurrent convolu-tional networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4)(2018) 1015–1028.[165] D. Li, Y. Liu, Z. Wang, Video super-resolution using non-simultaneous fully recurrent convo-lutional network, IEEE Transactions on Image Processing 28 (3) (2019) 1342–1355.[166] M. Haris, G. Shakhnarovich, N. Ukita, Recurrent back-projection network for video super-resolution, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2019, pp. 3892–3901.[167] J. Kim, J. K. Lee, K. M. Lee, Deeply-recursive convolutional network for image super-resolution, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016, pp. 1637–1645.[168] X. Yang, H. Mei, J. Zhang, K. Xu, B. Yin, Q. Zhang, X. Wei, Drfn: Deep recurrent fusion net-work for single-image super-resolution with large factors, IEEE Transactions on Multimedia21 (2) (2019) 328–337.[169] Z. Wang, P. Yi, K. Jiang, J. Jiang, Z. Han, T. Lu, J. Ma, Multi-memory convolutional neuralnetwork for video super-resolution, IEEE Transactions on Image Processing 28 (5) (2019)2530–2544.[170] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, T. Tan, Lfnet: A novel bidirectional recurrentconvolutional neural network for light-ﬁeld image super-resolution, IEEE Transactions onImage Processing 27 (9) (2018) 4274–4286.[171] W. Wang, C. Pang, Z. Liu, R. Lan, X. Luo, Srgnet: A gru based feature fusion networkfor image denoising, in: 2019 International Symposium on Intelligent Signal Processing andCommunication Systems (ISPACS), 2019, pp. 1–2.[172] C. Qin, J. Schlemper, J. Caballero, A. N. Price, J. V. Hajnal, D. Rueckert, Convolutional re-current neural networks for dynamic mr image reconstruction, IEEE Transactions on MedicalImaging 38 (1) (2019) 280–290.[173] P. Putzky, M. Welling, Recurrent inference machines for solving inverse problems, arXivpreprint arXiv:1706.04008.[174] S. Wisdom, T. Powers, J. Pitton, L. Atlas, Building recurrent networks by unfolding itera-tive thresholding for sequential sparse recovery, in: 2017 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4346–4350.[175] H. D. Le, H. Van Luong, N. Deligiannis, Designing recurrent neural networks by unfolding anl1-l1 minimization algorithm, in: 2019 IEEE International Conference on Image Processing(ICIP), 2019, pp. 2329–2333.[176] J. Zhou, K. Di, J. Du, X. Peng, H. Yang, S. Pan, I. W. Tsang, Y. Liu, Z. Qin, R. S. M. Goh,Sc2net: Sparse lstms for sparse coding, in: Thirty-Second AAAI Conference on ArtiﬁcialIntelligence, 2018. onference (EUSIPCO), 2019, pp. 1–5.[230] Z. Chen, S. Gai, D. Wang, Deep tensor factorization for multi-criteria recommender systems,in: 2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 1046–1051.[231] J. Casebeer, M. Colomb, P. Smaragdis, Deep tensor factorization for spatially-aware scenedecomposition, in: 2019 IEEE Workshop on Applications of Signal Processing to Audio andAcoustics (WASPAA), 2019, pp. 180–184.[232] L. Luo, L. Xie, H. Su, Deep learning with tensor factorization layers for sequential faultdiagnosis and industrial process monitoring, IEEE Access 8 (2020) 105494–105506.[233] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang,Q. V. Le, et al., Large scale distributed deep networks, in: Advances in neural informationprocessing systems, 2012, pp. 1223–1231.[234] M. Langer, A. Hall, Z. He, W. Rahayu, Mpca sgda method for distributed training of deeplearning models on spark, IEEE Transactions on Parallel and Distributed Systems 29 (11)(2018) 2540–2556.[235] W. Zhang, S. Gupta, X. Lian, J. Liu, Staleness-aware async-sgd for distributed deep learning,arXiv preprint arXiv:1511.05950.[236] N. Strom, Scalable distributed dnn training using commodity gpu cloud computing, in: Six-teenth Annual Conference of the International Speech Communication Association, 2015.[237] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, E. P.Xing, More eﬀective distributed ml via a stale synchronous parallel parameter server, in:Advances in neural information processing systems, 2013, pp. 1223–1231.[238] S. Gupta, W. Zhang, F. Wang, Model accuracy and runtime tradeoﬀ in distributed deeplearning: A systematic study, in: 2016 IEEE 16th International Conference on Data Mining(ICDM), 2016, pp. 171–180.[239] J. M. Alvarez, M. Salzmann, Learning the number of neurons in deep networks, in: Advancesin Neural Information Processing Systems, 2016, pp. 2270–2278.[240] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,H. Adam, Mobilenets: Eﬃcient convolutional neural networks for mobile vision applications,arXiv preprint arXiv:1704.04861.[241] G. Huang, S. Liu, L. v. d. Maaten, K. Q. Weinberger, Condensenet: An eﬃcient densenetusing learned group convolutions, in: 2018 IEEE/CVF Conference on Computer Vision andPattern Recognition, 2018, pp. 2752–2761.[242] X. Lin, C. Zhao, W. Pan, Towards accurate binary convolutional neural network, in: Advancesin Neural Information Processing Systems, 2017, pp. 345–353.[243] X. Zhang, X. Zhou, M. Lin, J. Sun, Shuﬄenet: An extremely eﬃcient convolutional neuralnetwork for mobile devices, in: 2018 IEEE/CVF Conference on Computer Vision and PatternRecognition, 2018, pp. 6848–6856.[244] Y. Cheng, D. Wang, P. Zhou, T. Zhang, Model compression and acceleration for deep neuralnetworks: The principles, progress, and challenges, IEEE Signal Processing Magazine 35 (1)(2018) 126–136.[245] I. Oguntola, S. Olubeko, C. Sweeney, Slimnets: An exploration of deep model compressionand acceleration, in: 2018 IEEE High Performance extreme Computing Conference (HPEC),2018, pp. 1–6.[246] J. Cheng, J. Wu, C. Leng, Y. Wang, Q. Hu, Quantized cnn: A uniﬁed approach to accelerateand compress convolutional networks, IEEE Transactions on Neural Networks and LearningSystems 29 (10) (2018) 4730–4743.247] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, W. J. Dally, Eie: Eﬃcientinference engine on compressed deep neural network, in: 2016 ACM/IEEE 43rd Annual In-ternational Symposium on Computer Architecture (ISCA), 2016, pp. 243–254.[248] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan,X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, B. Yuan, Circnn: Accelerating and compressingdeep neural networks using block-circulant weight matrices, in: 2017 50th Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO), 2017, pp. 395–408.[249] S. Han, H. Maoand W. J. DallyarXiv preprint arXiv:1510.00149, Deep compression: Com-pressing deep neural networks with pruning, trained quantization and huﬀman coding, arXivpreprint arXiv:1510.00149.[250] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural networks:Training neural networks with low precision weights and activations, The Journal of MachineLearning Research 18 (1) (2017) 6869–6898.[251] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neuralnetworks, in: nces in neural information processing systems, 2016, pp. 2082–2090.[252] A. Azulay, Y. Weiss, Why do deep convolutional networks generalize so poorly to small imagetransformations?, arXiv preprint arXiv:1805.12177.[253] J. Su, D. V. Vargas, K. Sakurai, One pixel attack for fooling deep neural networks, IEEETransactions on Evolutionary Computation (2019) 1–1.[254] G. Raskutti, M. J. Wainwrightand B. Yu, Early stopping and non-parametric regression: anoptimal data-dependent stopping rule, The Journal of Machine Learning Research 15 (1)(2014) 335–366.[255] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simpleway to prevent neural networks from overﬁtting, The Journal of Machine Learning Research15 (1) (2014) 1929–1958.[256] H. Inoue, Data augmentation by pairing samples for images classiﬁcation, arXiv preprintarXiv:1801.02929.[257] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requiresrethinking generalization, arXiv preprint arXiv:1611.03530.[258] B. Neyshabur, S. Bhojanapalli, D. McAllester, N. Srebro, Exploring generalization in deeplearning, in: Advances in Neural Information Processing Systems, 2017, pp. 5947–5956.[259] A. Zymnis, S. Boyd, E. Candes, Compressed sensing with quantized measurements, IEEESignal Processing Letters 17 (2) (2010) 149–152.[260] S. Takabe, T. Wadayama, Y. C. Eldar, Complex trainable ista for linear and nonlinear inverseproblems, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), 2020, pp. 5020–5024.[261] R. K. Mahabadi, J. Lin, V. Cevher, A learning-based framework for quantized compressedsensing, IEEE Signal Processing Letters 26 (6) (2019) 883–887.[262] M. Leinonen, M. Codreanu, Quantized compressed sensing via deep neural networks, in: 20202nd 6G Wireless Summit (6G SUMMIT), 2020, pp. 1–5.[263] B. Sun, H. Feng, K. Chen, X. Zhu, A deep learning framework of quantized compressedsensing for wireless neural recording, IEEE Access 4 (2016) 5169–5178.