Blind Image Restoration with Flow Based Priors
Leonhard Helminger, Michael Bernasconi, Abdelaziz Djelouah, Markus Gross, Christopher Schroers
BBlind Image Restoration with Flow Based Priors
Leonhard Helminger , ∗ Michael Bernasconi , ∗ Abdelaziz Djelouah Markus Gross Christopher Schroers Department of Computer Science DisneyResearch|StudiosETH Zurich, Switzerland Zurich, Switzerland
Abstract
Image restoration has seen great progress in the last years thanks to the advancesin deep neural networks. Most of these existing techniques are trained using fullsupervision with suitable image pairs to tackle a specific degradation. However,in a blind setting with unknown degradations this is not possible and a good priorremains crucial. Recently, neural network based approaches have been proposedto model such priors by leveraging either denoising autoencoders or the implicitregularization captured by the neural network structure itself. In contrast to this, wepropose using normalizing flows to model the distribution of the target content andto use this as a prior in a maximum a posteriori (MAP) formulation. By expressingthe MAP optimization process in the latent space through the learned bijectivemapping, we are able to obtain solutions through gradient descent. To the best ofour knowledge, this is the first work that explores normalizing flows as prior inimage enhancement problems. Furthermore, we present experimental results fora number of different degradations on data sets varying in complexity and showcompetitive results when comparing with the deep image prior approach.
In today’s digitized world, there is an increased demand to process existing older content. Examplesare the archival of photo prints (Liu et al.) for more reliable long-term data storage, preparingheritage footage (Ame) for more engaging documentaries, and making classic films and existingcatalog contents available to large new audiences through streaming services. This old contentis however often in low quality and may be deteriorated in complex ways, which creates a needfor blind image restoration methods that are generic and able to address a wide range of possiblycombined degradations. Blind image restoration can be formulated as solving the following energyminimization problem: x (cid:63) = arg min x [ L data (ˆ x , x ) + L reg ( x )] , (1)where ˆ x is the observed image and x (cid:63) the restored image to be estimated. The first term, L data , isa data fidelity term which can be problem dependent and ensures that the solution agrees with theobservation; the second term, L reg ( x ) , is a regularizer that typically encodes certain smoothnessassumptions on the expected solution and thus pushes it to lie within a given space. From a Bayesianviewpoint, the posterior distribution of the restored image is p ( x | ˆ x ) ∝ p (ˆ x | x ) p ( x ) . This allowsrewriting the above restoration problem into the following equivalent maximum a posteriori (MAP)estimate: x (cid:63) = arg max x log( p ( x | ˆ x )) = arg max x log ( p (ˆ x | x )) (cid:124) (cid:123)(cid:122) (cid:125) data + log p ( x ) (cid:124) (cid:123)(cid:122) (cid:125) reg , (2) Preprint. Under review. ∗ equal contribution a r X i v : . [ ee ss . I V ] S e p egraded Ours Ours Ulyanov et al. (2018) GT Figure 1: Comparative results with Deep Image Prior (Ulyanov et al., 2018) on different imagerestoration tasks. The first example corresponds to denoising whereas the second is image inpainting.Our approach is able to remove the degradation and produces visually more pleasing results in someregions like the text and the teeth.which makes it more explicit that the regularizer should model prior knowledge about the unknownsolution. Many handcrafted priors have been proposed reflecting desired properties based on totalvariation (Rudin et al., 1992), gradient sparsity (Fergus et al., 2006) or the dark pixel prior (He et al.,2010). More recently, learning based priors have been explored, in particular the usage of denoisingautoencoders (DAEs) as regularizers for inverse imaging problems (Meinhardt et al., 2017). Buildingon DAEs, Bigdeli et al. (2017) propose to use a Gaussian smoothed natural image distribution as prior.In a different direction, Ulyanov et al. (2018) showed that an important part of the image statistics iscaptured by the structure of a convolutional image generator even independent of any learning.All existing methods proposed alternatives and approximations to the true image prior p ( x ) inEquation 2. However, with deep normalizing flows, we have an approach for a tractable and exactlog-likelihood computation (Dinh et al., 2017). Therefore, we propose to use normalizing flows forcapturing the distribution of target high quality content to serve as a prior in the MAP formulation.In addition to this, the inference of the latent value that corresponds to a data point can be doneexactly without any approximation since our generative model is invertible. We use this learnedbijective mapping to express the MAP optimization process in the latent space and are able to obtainsolutions through gradient descent. In a number of experiments, we explore our approach for differentdegradations on data sets of varying complexity and we show that we can achieve competitive resultsas illustrated in Figure 1.The contribution of this paper is three fold: 1) to the best of our knowledge, our work is the first usingnormalizing flows to learn a prior for blind image restoration; 2) we take advantage of the bijectivemapping learned by our model to express the MAP problem of image reconstruction in latent space,where gradient descent can be used to estimate the solution; 3) we propose using new loss termsduring model training for regularizing the latent space which yields a better behavior during the MAPinference.Our paper is organized as follows. In Section 2, we recap important background regarding normalizingflow before describing our method in Section 3. Section 4 covers important related work and Section 5discusses our experimental results. We give our conclusions in Section 6. Borrowing the notation from Papamakarios et al. (2019), let’s consider two random variables X and U that are related through the reversible transformation T : R d → R d , x = T ( u ) . In this case, thedistribution of the two variables are related as follows: p X ( x ) = p U ( u ) | det J T ( u ) | − , (3)2here u = T − ( x ) and J T ( u ) is the Jacobian of T. Here, the determinant preserves total probabilityand can be understood as the amount of squeezing and stretching of the space induced by thetransformer T . The objective of normalizing flows (Rezende & Mohamed, 2015) is to map a basedistribution to an arbitrary distribution through a change of variable. In practice, a series T , . . . , T K of such mappings are applied to transform the base distribution into a more complex multi-modal one x T − K ←−→ h K − T − K − ←−−→ h K − · · · h T − ←−→ u , (4) p X ( x ) = p U (cid:0) T − ( x ) (cid:1) K (cid:89) k =1 (cid:12)(cid:12)(cid:12)(cid:12) det d h k − d h k (cid:12)(cid:12)(cid:12)(cid:12) , (5)where we define h K (cid:44) x and h (cid:44) u . It is clear that computing the determinant of these Jacobianmatrices, as well as the function inverses, must remain easy to allow their integration as part of aneural network. This is not the case for arbitrary Jacobians and recent successes in normalizing floware due to the proposition of invertible transformations with easy to compute determinants. Normalizing flows as generative model.
Recent works (Kingma & Dhariwal, 2018; Dinh et al.,2017) have shown the great potential of using normalizing flow as generative model where an imageobservation x is generated from a latent representation ux = T θ ( u ) with u ∼ p ( u ) . (6)Here x ∈ X is a high-dimensional vector, T θ denotes a composition of invertible transformations,and p ( u ) is the base distribution e.g. a normal distribution. Considering a discrete set X of N naturalimages, the flow based model is trained by by minimizing the following log-likelihood objective: L = 1 N N (cid:88) i =1 − log p θ (cid:16) x ( i ) (cid:17) . (7)In the next section, we will describe our approach for leveraging flow based models for various imagerestoration applications. By training a generative flow model as described in the previous section, we learn a mapping T θ froma latent space U , with a known base distribution p ( u ) , to the complex image space X . In this work,we propose to use the capacity of normalizing flows to compute the exact likelihood of images p θ ( x ) ,as prior in the image restoration problem x (cid:63) = arg min x − log p (ˆ x | x ) − log p θ ( x ) . (8)In addition to the prior, we also take advantage of the bijective mapping in normalizing flows torewrite the optimization with respect to the latent uu (cid:63) = arg min u [ − log p (ˆ x | T θ ( u )) − log p θ ( T θ ( u )) ] . (9)With this new formulation, we are leveraging the learned mapping between the complex input space(the image space) and the base space (the latent space) that follows a simpler distribution. This newspace is more adapted for such an optimization problem. In this work, we solve it through an iterativegradient descent, where each step is applied on the latents according to u t +1 = u t − η ∇ u L ( θ, u , ˆ x ) . (10)Here L ( θ, u , ˆ x ) abbreviates the objective defined in equation 9 and η is the weighting applied to thegradient. We used the Adam optimizer (Kingma & Ba, 2015) to compute the gradient steps. Themodel is generic and once trained on target quality images, different applications can be consideredby adapting the data loss term. In this work we use a generic data fidelity term between the inputimage ˆ x and the restored result x = T θ ( u ) : L data (ˆ x , T θ ( u )) = − log p (ˆ x | T θ ( u )) = m (cid:12) λ || ˆ x − T θ ( u ) || , (11)where (cid:12) is the Hadamard product. The mask m is a binary mask that indicates pixel locationswith valid color values and allows to handle the inpainting scenario. The parameter λ controls thedeviation tolerance from the original degraded input ˆ x . Next we provide details on the normalizingflow architecture used, the training losses, and our coarse to fine optimization procedure.3 Figure 2: Overview of the normalizing flow architecture. The input image x is processed by an L = 3 level network, where each level consists of a squeeze operation followed by a series of K steps. Each step is a succession of ActNorm , × convolution and an affine layer . The image latentrepresentation is ( u , u , u ) . The number of levels and steps can be adapted to the complexity ofthe data. The proposed generative model is based on the architecture described by Kingma & Dhariwal (2018).We first present the individual building layers • Activation normalization.
Proposed by Kingma & Dhariwal (2018), this is an alternativeto batch normalization. It performs an affine transformation on the activations using alearned scale and bias parameter per channel. • Invertible × convolution. Kingma & Dhariwal (2018) also proposed to replace therandom permutation of channels, in coupling layers between the transformations, with alearned invertible × convolution. • Affine transformation.
This layer is a coupling introduced by Dinh et al. (2015). The inputis split into two partitions, where one is the input for the conditioner, a neural network tomodify the channels of the second partition. Here, the transformation is affine. • Factor-out layers.
The objective of factoring-out parts of the base distribution (Dinh et al.,2017) is to allow a coarse to fine modeling by introducing conditional distributions anddependencies on deeper levels.Using these layers, we propose the model illustrated in Figure 2. It consists of L levels, each oneis a succession of K steps, where a step defined as the composition of the layers: ActNorm , × convolution and Affine . At the end of each intermediate level, the transformed values ( latents ) are splitin two parts h i and u i , with the factor-out layer. The parameters ( µ i , σ i ) of the conditional distribution p ( u i | h i ) are predicted by a neural network. In our case, this is a zero initialized 2D convolution asproposed in (Kingma & Dhariwal, 2018). In the experimental part and in supplementary material, weprovide more details about the architecture used for each dataset. When using normalizing flows to learn a continuous distribution, the input images have to be dequantized . Following common practices in generative flows, we redefine the negative log-likelihoodobjective ( nll ) of equation 7 L nll = 1 N N (cid:88) i =1 − log p θ (cid:16) x ( i ) + (cid:15) (cid:17) . (12)Here (cid:15) is uniformly sampled from [0 , . This model is sufficient for simple datasets as we show inthe experimental section with the MNIST examples (see Figure 3). However for more complex data,a regularization of the learned latent space is needed. The main objective is to structure this space ina beneficial way for the optimization. 4 atent-Noise loss. In order to enforce some regularization of the latent space, we add uniformnoise to the latents u ξ = u + ξ where ξ ∼ U ( − . , . . The proposed loss term L ln = || T θ ( u ξ ) − x || (13)penalizes parameters θ that would map back u ξ far from the initial input image x . It is interesting tonote that this loss does not make any assumption regarding the degraded images, but it still results ina latent space better suited for our optimization problem. Auto-Encoder loss.
If we consider the model illustrated in Figure 2, the image x is mapped to itsrepresentation ( u , u , u ) . From only the latent value u , we compute ˜ x by sampling the most likelyintermediate values ˜ u l ∼ p ( u l | h l ) . Since we use a Gaussian distribution, this corresponds to themean value of the predicted distribution. The proposed loss L ae = || ˜ x − x || (14)forces the model to store sufficient information in the deepest level to reconstruct the image. Thisallows a more robust coarse-to-fine strategy during the optimization.The final training loss for the normalizing flows is L = L nll + β ln L ln + β ae L ae , (15)where β ln and β ae are the weightings for each loss term. We used β ln = 100 and β ae = 1 . Theablation study in the experimental section shows the necessity of training the generative flow modelwith all these loss terms. The optimization procedure described in Equation 10 is iterative and we need to set its initial value u . In order to choose a good starting point, we leverage the introduced multi-scale architecture. Ourstarting point is u = (ˆ u , ˜ u , ˜ u ) with ˆ u defined by T − θ (ˆ x ) . (16)The values of the other components, ˜ u and ˜ u , are sampled as the mean values of the respectivepredicted distributions. p ( u | h ) and p ( u | h ) . As our auto-encoder loss enforces the possibilityto reconstruct the image from ˆ u only, this lowest level contains coarse image information whiledetails are stored in the upper levels. This is advantageous for image restoration tasks where thedegradation often affects the detail of an image.Given this starting point, the optimization is done in a coarse-to-fine fashion. First, only the lowestlevel variables are optimized while the upper levels are respectively sampled from the predictedmeans. These are then progressively included in the optimization u t +10 = u t − η ∇ u L ( θ, u , ˆ x ) , (17) ( u , u ) t +1 = ( u , u ) t − η ∇ ( u , u ) L ( θ, u , ˆ x ) , (18) ( u , u , u ) t +1 = ( u , u , u ) t − η ∇ ( u , u , u ) L ( θ, u , ˆ x ) . (19)With this coarse-to-fine scheme, we are able to incrementally refine the reconstructed images bymaking sure that the lower level information is correct first. Despite the success of supervised deep learning approaches for dedicated image restoration problemssuch as super-resolution (Wang et al., 2018; Zhang et al., 2018), denoising (Zhang et al., 2017a),inpainting (Pathak et al., 2016) or a combination of them (Park & Mu Lee, 2017), one importantdrawback is the need for retraining whenever the specific degradation or its parameters change. Somerecent works (Cornillère et al., 2019; Bell-Kligler et al., 2019) have investigated the blind setting forsuper-resolution. However that concerns the parameters of the degradation only and such solutionsare not applicable to an unknown degradation.When addressing the blind restoration problem, the common approach is to consider the Bayesian per-spective where recovering the original image is expressed as solving a maximum a posteriori (MAP)5 (0 , N (0 , N (0 , JPEG 30 JPEG 10 JPEG 5 U ( ± U ( ± Mask(10) U ( ± ◦ JPEG 10 ◦ Mask(10) G T I n O u t Figure 3: Results produced by a single-level normalizing flow trained on the MNIST dataset. Eachcolumn corresponds to a different type of degradation. From top to bottom the ground truth, thedegraded image and the reconstructed image are shown.problem. The objective function consists of a fidelity term and a regularization term. The fidelityterm can be problem specific and easier to express than the prior that is supposed to reflect desiredproperties of the reconstructed image. Existing handcrafted priors are based on total variation (Rudinet al., 1992), gradient sparsity (Fergus et al., 2006) or the dark pixel prior (He et al., 2010). Recently,several works have investigated the usage of CNNs as priors. For example, (Rick Chang et al., 2017;Zhang et al., 2017b) show how a deep CNN trained for image denoising can effectively be used asprior in various image restoration tasks. Additionally, Meinhardt et al. (2017) provide new insightson how the denoising strength of the neural network relates to the weight on the data fidelity term.Bigdeli et al. (2017) define a utility function that includes the smoothed natural image distributionand relate this to denoising autoencoders. In a different direction, Ulyanov et al. (2018) showedthat an important part of the image statistics is already captured by the structure of a convolutionalimage generator itself, independent of any learning. This work was further analyzed from a Bayesianperspective (Cheng et al., 2019) and combined with a denoising autoencoder prior (Mataev et al.,2019).The idea presented in our work stems from recent developments in normalizing flows (Dinh et al.,2015, 2017; Kingma & Dhariwal, 2018) and their promising capacity of learning a bijective mappingfrom a space with a prescribed distribution to the complex space of images, additionally providingexact log-likelihood tractability. Using a learned prior that only depends on properties of high qualityimages is an exciting direction, as this removes the need to rely on other assumptions that are eitherexplicit, in the case of handcrafted solutions, or implicit in the case of denoising autoencoders. Thiswork is a first step demonstrating the potential of normalizing flows in image restoration tasks. Webelieve this is an exciting new direction that furthermore is expected to benefit from improvementsand research that generally explores normalizing flow as a generative model.
In this section we explore the usage of our proposed solution for different blind image restorationtasks. We show results on two synthetic datasets, the MNIST and the self generated Sprites, and onreal images. We also include comparisons with the Deep Image Prior (DIP) (Ulyanov et al., 2018).Since we do not focus on a specific degradation during training, our proposed approach can be appliedon various types of restoration problems. In this work we present results on different types of imagedegradation: noise (uniform and normal), JPEG compression artifacts, and missing regions. Thenoisy images are generated by adding i.i.d. samples of noise to the pixel values, with noise distributedaccording to U ( min , max ) or N (0 , σ ) . The varying degrees of JPEG artifacts are generated by usingdifferent levels (10 to 70) for the JPEG compression. For the inpainting task, we masked multipleregions of size × pixels. An overview of the used degradations is visualized in Figure 3. MNIST results.
As a first step we tested our flow based image prior on the well studied MNISTdataset (LeCun et al., 1998). Given the simplicity of this dataset, the model used for this experimentconsists of a single-level L = 1 with K = 16 steps. We choose the base distribution p ( u ) to be a6 round Truth Input L nll L nll + L ln L nll + L ae All
Figure 4: Restoration of degraded Sprites: first row corresponds to a Gaussian noise with σ = 5 ,second row is inpainting and last row combines denoising, inpainting and JPEG artifact removal.Columns correspond to different normalizing flow models, each one trained with the indicated lossterm. Results show the importance of using all the proposed loss terms (see text for details).Gaussian with unit variance and a trainable mean. Further, a ResNet (He et al., 2016) with blocksand C = 128 intermediate channels, was used to learn the parameters for the affine transformations.Given a degraded image ˆ x the goal is to find the most likely image x (cid:63) by solving the optimizationproblem of Equation 9. Given the simplicity of the data set, we use the mean of the base distribution p ( u ) as starting point u . It can be seen in Figure 3 that this is sufficient to enhance the binary digitsfor any degradation. A related experiment was conducted by Dinh et al. (2015), where the degradeddigits were enhanced by maximizing the probability of the image trough back propagation to thepixel values. This is equivalent to only considering the prior term in Equation 9. Sprites results.
To handle this larger and more complex data set, we increased the capacity of ourflow based prior. We use L = 3 levels, with K = 8 steps each. In the optimization, the learning rate η and the data weighting term λ are set to and , respectively. The gradient descent is done in acoarse to fine way (see section 3.3), each time with update steps per level before including thenext one. When all latent levels are included, an additional optimization steps are performed.Figure 4 shows image restoration results on this data set: The first row corresponds to a denoisingtask, the second is image inpainting and the last combines both in addition to compression artifactremoval. Note that these images were not observed during training. As the data becomes morecomplex, we can see the importance of the regularization losses proposed in Section 3.2. Using thenegative-log-likelihood loss ( L nll ) is clearly not sufficient, and a prior trained only with this termis not suited for the latent space optimization. The most important improvement comes from usingthe latent-noise loss ( L ln ). This regularization enforces neighboring elements in latent space to bemapped back to similar images. This is highly beneficial to the gradient descent procedure in latentspace and a prior trained with this loss already leads to some good restoration results. Finally, acoarse-to-fine approach is able to handle most cases, in particular high intensity noise levels. Thisrequires training the normalizing flow model with the additional auto-encoder loss ( L ae ). Blind image restoration.
We show that the proposed model is applicable to the restoration ofgeneric images. In order to do so, the model must generalize to patches of high resolution goodquality images. For this we use the DIV2K dataset (Agustsson & Timofte, 2017) that serves astraining and test set for most image super-resolution works. We use the same train/test split with images in the training set and in the test set. Training is done on random image patches ofsize × . The normalizing flow architecture used here is very similar to the one described forthe Sprites (see supplementary material for details). The restoration of full images of arbitrary sizecan be done by reconstructing each patch individually. A margin is used to avoid boundary artifactsbetween patches. Restoration results are presented in Figure 5 for different image degradations. For7 a) Compression artifacts (b) Noise (c) Mask - noise - compression Figure 5: Results on DIV2K dataset. The proposed prior is used to restore arbitrary size images.Degradations include: (a) JPEG compression artifacts; (b) denoising; and (c) a combination ofmasked regions, noise and compression artifacts. O u r s D I P Figure 6: Compared to DIP, restoring large missingregions is not possible (green), but on this exampleit produced better denoising results (red).
Type of degradation DIP Ours
JPEG artifacts . . Noise . . Multiple degradations . . Figure 7: Quantitative evaluation onDIV2K using PSNR (see text for details).each example, we show the full resolution result, then focus on a part of the image, illustrating thechange.
Comparison with Deep Image Prior (DIP).
We first compare the two methods on the imagespresented in the original DIP paper (Ulyanov et al., 2018). We use our same model trained on theDIV2K dataset. We show competitive restoration results (Figure 1), producing even visually morepleasing reconstruction than DIP on some regions (such as the text and the mouth). The main limitin our case is the patch size used during training. Because of this, it is not possible to inpaint largemasked regions such as in the library image (Figure 6). Interestingly however, in this case backgroundregions are better denoised. We also conduct a quantitative evaluation with results presented inFigure 7. Using the test set from DIV2K, we try to restore different degradations: Noise ( N (0 , ),JPEG artifacts and a combination of artifact removal, denoising and inpainting. For this comparisonit is unclear how to best set the number of iterations for the DIP. To handle this, we started from theobservation that our method converges to the result in approximately hour of computation. Usingthe DIP online implementation, this corresponds to around k optimization steps on the denoisingtask. We used this maximum number of steps as the threshold for all images and degradations of thetest set. The evaluation using PSNR as error metric (Figure 7), demonstrates that our approach is ableto achieve competitive results and even outperform DIP on some of the restoration tasks. In this paper, we explored using normalizing flows for capturing the distribution of target high qualitycontent to serve as a prior in blind image restoration. To the best of our knowledge, this is the firsttime such a direction is explored. One advantage of this formulation is the learned bijective mappingfrom image to latent space that we use to express the MAP problem of image reconstruction in latentspace. We also show the importance of using regularizing losses during training. Finally, we presentexperimental results illustrating the capacity of the proposed solution to handle different degradationson data sets of varying complexity. We believe this is an exciting new direction as there is still a lotof potential for improvement. 8 eferences
America In Color. . Accessed: 2018-03-12.Agustsson, E. and Timofte, R. NTIRE 2017 challenge on single image super-resolution: Dataset andstudy. In , pp. 1122–1131. IEEE Computer Society,2017. doi: 10.1109/CVPRW.2017.150. URL https://doi.org/10.1109/CVPRW.2017.150 .Bell-Kligler, S., Shocher, A., and Irani, M. Blind super-resolution kernel estimation using aninternal-gan. In
Advances in Neural Information Processing Systems , pp. 284–293, 2019.Bigdeli, S. A., Zwicker, M., Favaro, P., and Jin, M. Deep mean-shift priors for image restoration. In
Advances in Neural Information Processing Systems , pp. 763–772, 2017.Cheng, Z., Gadelha, M., Maji, S., and Sheldon, D. A bayesian perspective on the deep image prior. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 5443–5451,2019.Cornillère, V., Djelouah, A., Yifan, W., Sorkine-Hornung, O., and Schroers, C. Blind image superresolution with spatially variant degradations.
ACM Transactions on Graphics (SIGGRAPH AsiaConference Proceedings) , 2019.Dinh, L., Krueger, D., and Bengio, Y. NICE: non-linear independent components estimation. InBengio, Y. and LeCun, Y. (eds.), , 2015. URL http://arxiv.org/abs/1410.8516 .Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In , 2017. URL https://openreview.net/forum?id=HkpbnH9lx .Fergus, R., Singh, B., Hertzmann, A., Roweis, S. T., and Freeman, W. T. Removing camera shakefrom a single photograph. In
ACM SIGGRAPH 2006 Papers , pp. 787–794. 2006.He, K., Sun, J., and Tang, X. Single image haze removal using dark channel prior.
IEEE transactionson pattern analysis and machine intelligence , 33(12):2341–2353, 2010.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In
Proceedingsof the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016.Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun,Y. (eds.), , 2015. URL http://arxiv.org/abs/1412.6980 .Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolu-tions. In
Advances in Neural Information Processing Systems 31: Annual Conferenceon Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018,Montréal, Canada. , pp. 10236–10245, 2018. URL http://papers.nips.cc/paper/8224-glow-generative-flow-with-invertible-1x1-convolutions .LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to documentrecognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.Liu, C., Rubinstein, M., Krainin, M., and Freeman, B. PhotoScan: Tak-ing Glare-Free Pictures of Pictures. https://ai.googleblog.com/2017/04/photoscan-taking-glare-free-pictures-of.html . Accessed: 2020-05-25.Mataev, G., Milanfar, P., and Elad, M. Deepred: Deep image prior powered by red. In
Proceedings ofthe IEEE International Conference on Computer Vision Workshops , pp. 0–0, 2019.9einhardt, T., Moller, M., Hazirbas, C., and Cremers, D. Learning proximal operators: Usingdenoising networks for regularizing inverse imaging problems. In
Proceedings of the IEEEInternational Conference on Computer Vision , pp. 1781–1790, 2017.Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normaliz-ing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762 , 2019.Park, H. and Mu Lee, K. Joint estimation of camera pose, depth, deblurring, and super-resolutionfrom a blurred image sequence. In
Proceedings of the IEEE International Conference on ComputerVision , pp. 4613–4621, 2017.Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. Context encoders: Featurelearning by inpainting. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 2536–2544, 2016.Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In
Proceedings of the32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 ,pp. 1530–1538, 2015. URL http://proceedings.mlr.press/v37/rezende15.html .Rick Chang, J., Li, C.-L., Poczos, B., Vijaya Kumar, B., and Sankaranarayanan, A. C. One networkto solve them all–solving linear inverse problems using deep projection models. In
Proceedings ofthe IEEE International Conference on Computer Vision , pp. 5888–5897, 2017.Rudin, L. I., Osher, S., and Fatemi, E. Nonlinear total variation based noise removal algorithms.
Physica D: nonlinear phenomena , 60(1-4):259–268, 1992.Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 9446–9454, 2018.Wang, Y., Perazzi, F., McWilliams, B., Sorkine-Hornung, A., Sorkine-Hornung, O., and Schroers,C. A fully progressive approach to single-image super-resolution. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops , pp. 864–873, 2018.Zhang, K., Zuo, W., Chen, Y., Meng, D., and Zhang, L. Beyond a gaussian denoiser: Residual learningof deep cnn for image denoising.
IEEE Transactions on Image Processing , 26(7):3142–3155,2017a.Zhang, K., Zuo, W., Gu, S., and Zhang, L. Learning deep cnn denoiser prior for image restoration. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 3929–3938,2017b.Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., and Fu, Y. Image super-resolution using very deepresidual channel attention networks. In
Proceedings of the European Conference on ComputerVision (ECCV) , pp. 286–301, 2018. 10
Supplementary Material
A.1 Additional Comparison with Deep Image Prior
We provide an additional comparison with Deep Image Prior for the task of compression artifactremoval.
Degraded Deep Image Prior Ours
Figure 8: JPG artifact removal. We can observe that our results are sharper around the eyes.
A.2 MNIST
For MNIST the network architecture is kept simple, only consisting of a single level. We use K = 16 steps in our model. Due to the fact that squeezing layers require the input’s height and width to bedivisible by two the input images are zero-padded to size × .As coupling transform we use the one depicted in Figure 9 with two blocks ( N = 2 ) and intermediate channels ( C inter = 128 ). Finally, we choose a Gaussian with unit variance as our basedistribution. The Gaussian’s mean is set to a trainable parameter. All other parameters are listed inTable 1. BlockA ffi ne Coupling Figure 9: Details of the affine coupling transform. x Conv d and x Conv d refer to standard2D convolutions using a kernel size of 3x3 and 1x1 respectively. The ” + ” at the end of the block isan element wise addition. A.3 Sprites
Each image in the Sprites dataset consists of a figure performing some pose in front of a randombackground. Figures are centered in the image and have varying color for hair and clothing. Eachimage is of size 64x64. Dataset will be made available upon acceptance.
Architecture.
For this experiment, the number of levels is set to L = 3 and each level has K = 8 steps. The distributions p ( u | h ) and p ( u | h ) depend on a function which computes mean µ ( h i ) and variance σ ( h i ) . We call this function the context encoder. A single 2D convolution with kernel11 arameter Value N f C inter p ( u ) N ( µ, optimizer Adamlearning rate − batch size 50 max gradient value max gradient L -norm Table 1: Details of architecture and training for the MNIST experimentssize × and twice the number of output dimension as input dimensions is used as the contextencoder. The context encoder’s output is then split in half along the channel dimension. One halfis used as µ ( h i ) , the other as σ ( h i ) . The convolutions weight and bias are initialized to zero forstability reasons. The other parameters for the Sprites dataset are listed in Table 2. Parameter Value L ) 3 K ) 8Affine coupling C inter p ( u | h ) , p ( u | h ) N ( µ ( h i ) , Diag ( σ ( h i ))) Base distribution p ( u ) N ( µ, Diag ( σ )) Context Encoder p ( u | h ) , p ( u | h ) zero initialized 2D Convolution, kernel size 3x3optimizer Adamlearning rate − batch size 20 max gradient value max gradient L -norm latent noise magnitude ± . latent noise loss ( β ln ) 100autoencoder loss ( β ae ) 1Table 2: Sprites training specification. A.4 DIV2K
The number of levels in the architecture is set to L = 8 with K = 4 steps per level. The number ofintermediate channels in the coupling transforms is . The context encoder architecture is deepenedfrom to convolutional layer as is illustrated in Figure 10 and a dropout layer is added to thebeginning. All the architecture parameters are listed in Table 3. Context Encoder
Figure 10: Architecture of the context encoder used for the DIV2K example. A dropout layer with p = 0 . is used as the first layer to prevent overfitting. The last convolution’s weight and bias areinitialized to zero for stability reasons.In addition to this we found that at test time the optimization was faster when the model was trainedwith additional noise on the images. 12he Image-Noise -loss L in works similarly to the Latent-Noise -loss L ln (see Equation 13 in themain paper) except the noise is added to the image x and distortion is measured on the encoding u = T − θ ( x ) . L in = || T − θ ( x ) − T − θ ( x + η ) || (20) Parameter Value N f C inter p ( u | h ) , p ( u | h ) N ( µ ( h i ) , Diag ( σ ( h i ))) Base distribution p ( u ) N ( µ, Diag ( σ )) Context Encoder p ( u | h ) , p ( u | h ) N = 5 optimizer Adamlearning rate − batch size 15 max gradient value max gradient L -norm latent noise magnitude ± . latent noise loss ( β ln ) 100autoencoder loss ( β ae ) 1image noise loss ( β in ) 100image noise magnitude ± Table 3: DIV2K training specification.
Patch-wise Reconstruction.
A full image of arbitrary size can be reconstructed by reconstructingeach patch individually. To avoid boundary artifacts between patches a margin is used as illustratedin Figure 11. The margin causes overlap between adjacent patches yielding more consistent results inboundary regions.Figure 11: Illustrations of the tiles used for patch-wise reconstruction. H and W refer to the patchesheight and with respectively. M refers to the margin. Neighboring patches overlap in a region ofwidth · M . Analogously the same pattern extends in the vertical direction. In our work we use H = W = 64 and M = 4= 4