[PDF] Reblur2Deblur: Deblurring Videos via Self-Supervised Learning

Abstract

Motion blur is a fundamental problem in computer vision as it impacts image quality and hinders inference. Traditional deblurring algorithms leverage the physics of the image formation model and use hand-crafted priors: they usually produce results that better reflect the underlying scene, but present artifacts. Recent learning-based methods implicitly extract the distribution of natural images directly from the data and use it to synthesize plausible images. Their results are impressive, but they are not always faithful to the content of the latent image. We present an approach that bridges the two. Our method fine-tunes existing deblurring neural networks in a self-supervised fashion by enforcing that the output, when blurred based on the optical flow between subsequent frames, matches the input blurry image. We show that our method significantly improves the performance of existing methods on several datasets both visually and in terms of image quality metrics. The supplementary material is this https URL

Full PDF

RReblur2Deblur: Deblurring Videos via Self-Supervised Learning

Huaijin Chen , Jinwei Gu Orazio Gallo Ming-Yu Liu Ashok Veeraraghavan Jan Kautz {jinweig, ogallo, mingyul, jkautz}@nvidia.com {huaijin.chen, vashok}@rice.edu NVIDIA Research Rice University (c) DeblurGAN [13](a) Input (d) Proposed (e) Ground Truth(b) DVD [22]

PSNR: 25.07PSNR: 21.76 PSNR: 25.56PSNR: 22.24 PSNR: 26.91PSNR: 23.82

Figure 1. We propose a novel method for video motion deblur with self-supervised learning. Compared to prior method such as DVD [22]and DeblurGAN [13], we enforce a physics-based blur formation model during Deep Neural Network (DNN) learning, which effectivelyreduces image artifacts and improves generalization ability of DNN-based video deblurring.

Abstract

Motion blur is a fundamental problem in computer visionas it impacts image quality and hinders inference. Tradi-tional deblurring algorithms leverage the physics of theimage formation model and use hand-crafted priors: theyusually produce results that better reﬂect the underlyingscene, but present artifacts. Recent learning-based methodsimplicitly extract the distribution of natural images directlyfrom the data and use it to synthesize plausible images. Theirresults are impressive, but they are not always faithful to thecontent of the latent image. We present an approach thatbridges the two.Our method ﬁne-tunes existing deblurring neural net-works in a self-supervised fashion by enforcing that theoutput, when blurred based on the optical ﬂow between sub-sequent frames, matches the input blurry image. We showthat our method signiﬁcantly improves the performance ofexisting methods on several datasets both visually and interms of image quality metrics.

1. Introduction

Cameras integrate the scene radiance over a ﬁnite expo-sure time, which causes motion blur when the scene, thecamera itself, or both move. Motion blur, in turn, affectsthe frequency content of the resulting image, thus hinderingvirtually any computer vision task. As a consequence, anumber of deblurring algorithms have been proposed, whichseek to estimate the latent, sharp image from one or more blurry observations of a scene.At high level, we can identify two classes of deblurringalgorithms.

Physics-based methods observe that blur canbe modeled by a convolution of the latent image with aspatially varying ﬁlter, usually referred to as the blur kernel.Deblurring, then, is reduced to estimating the blur kernel anddeconvolving it from the observed, blurry image. Both thekernel estimation and the deconvolution process, however,are severely ill-posed and introduce artifacts, such as ringingartifacts [21].A more recent trend is to use deep-learning methods tosynthesize, from blurry observations, an image that best re-ﬂects the priors learned from training data [22, 13]. Whileneural networks have shown superior results in several ﬁeldsof computer vision, they may require more training data thanis available, and do become brittle when the training exam-ples fail at capturing the full distribution of the real-worlddata. Finally, and perhaps most importantly, the synthesizedimages may be aesthetically pleasing, e.g ., sharper than theobservations, but may not match the appearance of the latentimage. Figure 1(c) shows one such example: the method byKupyn et al . [13], reconstruct a sharp number “1,” in placeof the true number “3.” This signiﬁcantly reduces the beneﬁtof using deep-learning based deblurring as a pre-processingstep to computer vision algorithms.Our method bridges the two approaches. Like previousmethods, we use deep learning to synthesize sharp framesfrom a blurry video. However, we also explicitly enforcethe solution to lie on the manifold of the sharp images that1 a r X i v : . [ c s . C V ] J a n xplain the observed blurry frames. To achieve this, we pro-pose the ﬁrst self-supervised, end-to-end deblurring method.Our differentiable pipeline can be used to ﬁne-tune any ex-isting pre-trained network.From multiple consecutive blurry video frames, we pro-duce the corresponding sharp frames and the optical ﬂowbetween them. We use this information to compute a per-pixel blur kernel with which we reblur the sharp frames backonto the input images. By minimizing the distance betweenthe synthesized blurry images and the input images, our ap-proach allows to ﬁne-tune the parameters of our system forspeciﬁc inputs and without the need for the correspondingground truth data . We show that our method outperformsthe state-of-the-art methods based on both image qualitymetrics and visual inspection. Figure 1 shows two exampleson which existing methods either fail or synthesize sharpestimates that do not reproduce the actual content of thelatent sharp image. On the contrary, our method successfullyrecovers the ground truth image, while deblurring the inputimage.

2. Related Work

Traditionally, deblurring algorithms model the image for-mation process as the convolution of a blur kernel with asharp image, which is then estimated by means of deconvo-lution [15].The blur kernel, however, is anything but straightforwardto estimate. A common assumption is that the kernel is space-invariant [6, 21, 15], which is only valid when the scene isstatic—camera shake is the only source of motion blur—andplanar. Even under this simplifying assumption, the problemis severely ill-posed, and thus requires regularization. Indeedseveral priors have been successfully employed for the latentimage, e.g ., TV [1], heavy-tailed gradient distribution [19],Gaussian distribution [14], smoothness [21], and for theshape of the kernel, e.g ., sparsity of the kernel [6] or para-metric kernel modeling [27]. Alternatively, the kernel canbe estimated with a deep-learning approach [20, 2]. A fewmethods relax this assumption by estimating non-uniformblur kernels [25, 9, 24]. Several approaches move even fur-ther in this direction by leveraging optical ﬂow to estimate a per-pixel blur [10, 7].The strength of these methods lies in their ability to ex-plicitly model the physics of the image formation model. Onthe other hand, hand-crafted priors are not always realistic,and may result in visual artifacts in the estimated image.An alternative way to tackle this problem is to learn di-rectly from the data, which is possible thanks to the successof learning-based image synthesis approaches. Rather thanestimating the blur kernel and explicitly deconvolving it fromthe observed image, these methods propose an end-to-endlearning strategy to estimate the latent image.To compensate for the lack of priors, some methods rely on video data [22, 26, 3]: because camera shake is an idiosyn-cratic motion, each frame can be thought as an independentobservation of the scene.In the absence of temporal information, a way to learn thepriors is needed: Generative Adversarial Networks (GANs)are a powerful method for this task. Indeed a number ofapproaches successfully employ GANs to perform single-image deblurring [13, 18, 17].Standard GAN-based methods have shown an extraor-dinary ability to learn natural distributions. However, theimages they synthesize, while realistic, may not reﬂect theinput data accurately, as shown in Figure 1(c).We propose to leverage the beneﬁt of both approaches:after a kernel-free estimation of the latent image from multi-ple frames, we estimate the optical ﬂow, and the per-pixelblur it induces. This allows us to reblur the estimated sharpimages back on the observed blurry images, and to constrainthe solution to the manifold of images that capture the actual input content.

3. Method

Our work stems from the observation that GAN-basedmethods produce excellent results thanks to their ability tocapture the natural distribution of sharp images. However,we also note that the synthesized images, while generallysharp, may deviate from the content of the latent image thatunderlying the blurry observation, see Figure 1(c).We propose to improve the performance of existing end-to-end deblurring methods by encouraging solutions thatmore closely capture the content of the input image, in addi-tion to being sharp. Our key idea is to introduce a physics-based blur formation model into the DNN training, withwhich we reblur the estimated sharp images and minimizethe difference with the input blurry images.

Given a deblur network d ( · ;⇥ d ) , either pre-trained ortrained from scratch, we estimate the latent sharp frame attime t , from the blurry observation I ( t ) B : ˆ I ( t ) S = d ( I ( t ) B ;⇥ d ) . (1)The weights of the deblurring network ⇥ d can be learned byminimizing the loss L S over a dataset S = { I B ; I S } withsupervision L S (⇥ d ) = X S h ( ˆ I S , I S ) , (2)where h ( · , · ) measures the distance between the estimatedsharp images and the ground truth sharp images. Thesupervised loss L S (⇥ d ) can be computed with differentchoices of the distance function h ( · , · ) [28], or with mul-tiple input frames [22], or at multiple scales [16]. Re-cent work [18, 13, 16] introduced an additional GAN loss2 nput Deblur Network I ( t +1) B I ( t ) B I ( t B Estimated Sharp Video ˆ I ( t ) S ˆ I ( t +1) S ˆ I ( t S Optical Flow Network Estimated Optical Flow F t ! t F t +1 ! t k ( F t ! t ( p ) , F t +1 ! t ( p )) Blur Kernel Estimation K ( p ) Pixel-wise Blur Kernels b ( ˆ I ( t ) S ; K ) Reblurring ˆ I ( t ) B Re-blurred ImageSelf-Supervised Loss

Data-driven Deblurring Physics-based Reblurring d ( I ( t B ; ⇥ d ) d ( I ( t ) B ; ⇥ d ) d ( I ( t +1) B ; ⇥ d ) f ( ˆ I ( t S , ˆ I ( t ) S , ⇥ f ) f ( ˆ I ( t +1) S , ˆ I ( t ) S , ⇥ f ) Blurry Video

Output L U (⇥ d ) = X U h (ˆ I ( t ) B , I ( t ) B ) Supervised Loss L S (⇥ d ) = X S h (ˆ I S , I S ) Figure 2.

System Overview . Our proposed deblurring framework takes three consecutive blurry images as inputs. We ﬁrst deblur each inputimage through the deblur network. After that, we compute the optical ﬂow between the three recovered sharp images. We then estimate theper-pixel blur kernel and reconstruct the blurry input — this "reblurring" step offers an additional training signal for self-supervised learningto improve the deblur network and remove image artifacts. L G (⇥ d ) , which achieved better performance by implicitlylearning the distribution of sharp images. Relying solely onsupervised learning, these methods still often produce imageartifacts, especially when they are applied to images withdifferent distributions of the training images.Assume now that we are given a motion blurred video,rather than a single frame. By exploiting the motion informa-tion from videos, we incorporate a physics-based blur forma-tion model into DNN learning to minimize image artifacts.Speciﬁcally, suppose we deblur three consecutive framesindependently, obtaining { ˆ I ( t S , ˆ I ( t ) S , ˆ I ( t +1) S } . Since theframes are adjacent in time, we can compute the opticalﬂow F between them. For this task we can use a differentpre-trained neural network f : F t ! t = f ( ˆ I ( t S , ˆ I ( t ) S , ⇥ f ) (3) F t +1 ! t = f ( ˆ I ( t +1) S , ˆ I ( t ) S , ⇥ f ) . (4)Recall that our method aims at removing motion blur: theoptical ﬂow F = ( u, v ) , which expresses the motion ateach pixel p , carries the necessary information to estimate aper-pixel blur kernel K ( p ) = k ( F t ! t ( p ) , F t +1 ! t ( p )) , (5)which serves an important function: it allows to close theloop with the original observation. If image ˆ I ( t ) S is estimatedcorrectly, in fact, the corresponding reblurred image, ˆ I ( t ) B = b ( ˆ I ( t ) S ; K ) , (6) should be close to the input blurry image I ( t ) B , where b ( · ; K ) is the physics-based blur formation model.This pipeline allows us to ﬁne-tune the deblur network ⇥ d over new blurry videos U = { I ( t ) B } by minimizing L U (⇥ d ) = X U h ( ˆ I ( t ) B , I ( t ) B ) , (7)where L U (⇥ d ) is the self-supervised loss over the unlabeledvideos U . Equation (7) is precisely the mechanism by whichwe enforce that the estimated image ˆ I ( t ) S is consistent withthe observed image I ( t ) B , in addition to following the distri-bution of natural, sharp images.We implement all the three modules, i.e ., the debur net-work d ( · ;⇥ d ) , the optical ﬂow network f ( · ;⇥ f ) , and thephysics-based blur formation model b ( · ; K ) , to be differen-tiable, which allows an end-to-end training. Figure 2 showsthe complete system pipeline. In the following sections wedescribe the implementation details of each module. Our system can ﬂexibly choose the deblur network d ( · ;⇥ d ) or the optical ﬂow network f ( · ;⇥ f ) . In this pa-per, for the deblur network, we evaluated two network ar-chitecture. The ﬁrst one is DVD [22], which is an encoder-decoder architecture with skip connections. The second oneis the generator part of DeblurGAN [13] which is also an We can also simultaneously ﬁne-tune the optical ﬂow network ⇥ f together with ⇥ d . h ( · , · ) that measures the distance between two images, we used theMSE distance in the paper. The physics-based blur formation model b ( ˆ I S ; K ) per-forms a per-pixel convolution with a spatially-varyingblur kernel K ( p ) that is derived from the computed theoptical ﬂow F t ! t ( p ) = [ u t ! t ( p ) , v t ! t ( p )] and F t +1 ! t ( p ) = [ u t +1 ! t ( p ) , v t ! t ( p )] , We assumed thesame piecewise-linear motion blur kernel as proposed in [10]that consists of two line segments: K ( p )[ x, y ] = ( xv t +1 ! t ( p )+ yu t +1 ! t ( p ))2 ⌧ ||F t +1 ! t ( p ) || if ( x, y ) R ( xv t ! t ( p )+ yu t ! t ( p ))2 ⌧ ||F t ! t ( p ) || if ( x, y ) R otherwise , (8)where R : x [0 ,⌧u t +1 ! t ( p )] , y [0 ,⌧v t +1 ! t ( p )] and R : x [0 ,⌧u t ! t ( p )] , y [0 ,⌧v t ! t ( p )] . ⌧ is theexposure cycle as deﬁned in [10], which is set to ⌧ = 1 inthis paper. Reshape × × × × Optical Flow Pixel-wise Blur KernelBlur Kernel Look-up Table ! " [$,&](("),+(") ,-.→, (("),+(") ,0.→, (+ Bilinear Interpolation

Figure 3.

Optical Flow to Blur Kernel:

We use a pre-computedlookup table and bilinear interpolation to convert the optical ﬂow ( u ( p ) , v ( p )) to per-pixel blur kernel K ( p )[ x, y ] . This operation isdifferentiable, therefore can be instantiated as DNN layers. The per-pixel blur kernel model deﬁned above in Equa-tion (8) cannot be directly used in DNN training, becausethe delta function ( · ) is non-differentiable. To solve thisissue, we use a precomputed lookup table that maps a setof optical ﬂow values ( u i , v i ) to the blur kernels k i [ x, y ] ,where i = 1 , · · · , N . For a given optical ﬂow ( u, v ) at pixel p , we use bilinear interpolation to compute the blur kernel K ( p )[ x, y ] from the lookup table: K ( p )[ x, y ] = N X i =1 ! i ( u, v ) k i [ x, y ] . (9)Since the bilinear interpolation is differentiable with respectto the weights ! i ( u, v ) , the gradient can be back-propagatedto the optical ﬂow network f ( · ;⇥ f ) and, subsequently, to thedeblur network d ( · ;⇥ d ) . In this paper, we set N = 33 ⇥ to compute the lookup table, thus limiting the range of theoptical ﬂow to compute the motion blur kernels from pixels to pixels in both directions. Figure 3 shows adiagram of this procedure. For the ﬁne-tuning step, we found that minimizing theself-supervised loss L U alone leads to degenerate solutions.Therefore, we use the hybrid loss L (⇥ d ) = L S (⇥ d ) + ↵ L U (⇥ d ) , (10)which also includes the supervision loss L S from the originalsupervised datasets S . Each mini-batch is sampled partiallyfrom the original supervised datasets S and partially fromthe new unlabeled videos U . The weight coefﬁcient ↵ bal-ances the contribution of the self-supervision loss and thesupervised loss. We set ↵ = 0 . in all our experiments.We implemented our algorithm in PyTorch. For boththe deblur network d ( · ;⇥ d ) and the optical ﬂow network f ( · ;⇥ f ) , we started from pre-trained models, which we referto as baseline networks. We performed 200 self-supervisedlearning iterations on the baseline networks. We used theADAM solver [12] with a learning rate of 0.0001 for allour experiments, and a learning rate decay of 0.5 appliedat 30, 50, and 100 epochs. We set the training mini-batchsize between 8 and 20 depending on the DNN memory foot-print. Finally, we use NVIDIA TitanX GPUs for trainingand testing.Figure 4 shows an example of the proposed self-supervised learning. By ﬁne-tuning with the physics-basedblur formation model, we improve both the deblur network ⇥ d and the optical ﬂow network ⇥ f . The blur in this ex-ample is caused only by camera motion, and thus the trueoptical ﬂow should be smooth — the ﬁne-tuning removes theartifacts due to the scene’s texture from the original opticalﬂow. The proposed DeblurGAN-Reblur also outperformsthe baseline DeblurGAN.

4. Experiments and Results

In this section we describe the experiments we performed,including the datasets we used, our quantitative and qual-itative results, as well as a thorough ablation study. Moreresults, including full size images are available in the supple-mentary material. Supplementary material: https://goo.gl/nYPjEQ nput DeblurGAN Baseline Proposed DeblurGAN-ReblurOptical Flow (before fine-tuning) Optical Flow (after fine-tuning) Figure 4.

A closer look of the proposed self-supervised learningresults . By ﬁne-tuning via the physics-based blur formation model,we improve both the deblur network ⇥ d and the optical ﬂow net-work ⇥ f . The blur in this example image is caused only by cameramotion — the ﬁne-tuning removes the artifacts in the optical ﬂow(due to scene texture). The deblurred image of DeblurGAN-Rebluris also better than the baseline DeblurGAN.Table 1. Datasets used in our self-supervised ﬁne-tuning experi-ments Name

We evaluated the proposed method extensively on fourdatasets, as summarized in Table 1. Both the MCD [16] andDVD [22] datasets were captured with high-speed camerasuch as GoPro Hero 5 and Sony RX10 at 240fps, and thushave ground truth sharp images for quantitative evaluation.In addition, in order to test the generalization ability ofdeblurring algorithms (other than natural scenes), we alsoused a GoPro camera and captured a small dataset of anISO-resolution chart moving in front of the camera. We referto this as the ISOCHART set, which we will release uponpublication. For these three datasets, we average every frames to create the blurry image and use the center frameas the sharp ground truth. Finally, the WFA dataset [4, 3]is widely used for evaluating deblurring algorithms. It doesnot offer ground truth images and thus can only be evaluatedqualitatively. Note that all these four datasets are used as theunsupervised dataset U in our experiment, which means weuse only the blurry images as the input for self-supervisedlearning.As mentioned early, we compared with two recent meth-ods as our baseline, the DVD [22], which is an auto-encoder with skip connection for deblurring, and the De-blurGAN [13], which is the generator part of a GAN net-work. These two methods are representative since one ispurely supervised learning from blur-sharp pairs and theother incorporates the GAN loss. We applied the proposedself-supervised learning method on top of these two base-lines and ﬁne-tuned the networks. We refer to the resulting networks as DVD-Reblur and DeblurGAN-Reblur respec-tively. In addition, we also compared with the MCD [16]method, which is similar to DeblurGAN. Figure 5 shows several examples from the four datasets.Table 2 shows the averaged PSNR and SSIM for the threedatasets. As shown, our proposed self-supervised learningmethod brings signiﬁcant improvement over the two baselinenetworks (especially over the DeblurGAN-baseline, about1 dB improvement). Both proposed methods DeblurGAN-Reblur and DVD-Reblur effectively remove the artifactsintroduced by the networks by enforcing the physics-basedblur formation model.We also compared with the MCD network [16], which isa multi-scale network with a GAN loss. The average PSNRof MCD on the three datasets are 28.53, 31.21, and 32.30respectively, which are slightly better than ours (DeblurGAN-Reblur). However as shown in Figure 5 and the supplemen-tary material, we found DeblurGAN-Reblur often achievesbetter visual quality than MCD (see the ISOCHART in Fig-ure 5 for example). In addition, our proposed methods arecomputationally more efﬁcient than MCD: MCD takes 4.33seconds to deblur an image at resolution ⇥ , whileDeblurGAN-Reblur takes 0.85 second and DVD-Reblur taks0.84 second to deblur at the same image resolution — about ⇥ faster run time than MCD. We also performed an ablation study to analyze several as-pects of the proposed method. For computational efﬁciency,we ran all the studies over a randomly picked subset of theMCD dataset comprising 50 images. The results are reportedbelow.

Choices of the Blur Formation Model b ( ˆ I ( t ) S ; K ) In ad-dition to the blur formation model described in Section 3.3,which is based on per-pixel convolution, one can warp theestimated sharp image ˆ I ( t ) S towards t + 1 and t directlyusing the the optical ﬂow F t +1 ! t and F t ! t , and aver-age the resulting images. The warping can be implementedwith bilinear interpolation, making this formation model alsodifferentiable.Table 5 summarizes the results. As shown, the per-pixelconvolution blur formation model performs slightly betterin terms of PSNR and SSIM, while the warping-based blurformation runs much faster during the ﬁne-tuning. Nev-ertheless, both blur formation models produce signiﬁcantimprovement over the two baseline methods. Self-supervised Learning on a Single Image

Since theproposed method is self-supervised, in theory one can ﬁne-tune the deblur network for each individual image separately,5 a) (b) (c) (d) (e) (f) (g)(a) (b) (c) (d) (e) (f) (g)(a) (b) (c) (d) (e) (f) (g)(a) (b) (c) (d) (e) (f) (g)(a) (b) (c) (d) (e) (f)

Figure 5.

Comparison of several deblur methods on images from different datasets . The images are from DVD [22], MCD [16], WFA[4, 3] and our own ISOCHART datasets. The insets on the right shows the detailed input and deblur results of the bounding box areas in theleft input images. The deblur results follow the order of (a) DVD baseline [22] (b) DeblurGAN [13], (c) MCD [16], (d) DVD-Reblur (ours),(e) DeblurGAN-Reblur (ours), (f) blurry (input), (g) ground truth. (The ground truth is not available for the WFA dataset, therefore the lastcolumn of the last row missing.) The complete, full resolution images are available in the supplementary material. able 2. Comparison of several state-of-art deblurring techniques (in terms of average PSNR, SSIM and average run-time) on three datasets.DVD-Reblur and DeblurGAN-Reblur are proposed methods which integrate the reblurring framework within existing DNN-based deblurringalgorithms. Network ! DVD [22] DVD-Reblur DeblurGAN [13] DeblurGAN-ReblurDatasets

PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIMMCD 25.36 0.8380

DVD 29.15 0.9218

ISOCHART 29.85 0.9632

Table 3. Results of two different blur formation models. The col-umn "Time" is the average running time for ﬁne-tuning a mini-batchof 10 RGB images of size ⇥ . Network PSNR SSIM Time (s)DVD [22] 25.067 0.872 -DVD-Reblur, Conv-Based

DeblurGAN [13] 26.884 0.913 -DeblurGAN-Reblur, Conv-Based

Table 4. Fine-tuning deblur network on single image vs. on groupof images. Each approach’s resulting PSNR and SSIM are shown.

Individual GroupImage No. PSNR SSIM PSNR SSIM1 26.538 0.878 despite the high computational cost. Interestingly, we foundthis customized, single-image based ﬁne-tuning to be notnecessarily better — slightly worse, in fact — than ﬁne-tuning over a group of testing images. Table 4 summarizesthe results for ﬁve randomly picked images, each of which isﬁne-tuned separately with DeblurGAN-Reblur. One possibleexplanation is that ﬁne-tuning over a set of testing imagesmay produce a more stable gradient based on the underlyingimage distribution, and may thus be less likely to get stuck inlocal minima. The full-resolution results of these ﬁve imagesare provided in the supplementary material. Effect of the number of frames in the blur formationmodel

When we construct the physics-based blur forma-tion model, we need to compute two optical ﬂow maps F t +1 ! t and F t ! t , which requires at least three framesas input. In addition, we also experimented with only twoframes as input, and made the assumption that F t +1 ! t ⇡F t ! t . We evaluated with DVD-Reblur network. Asexpected, we found that the three-frame based method per-forms better than the two-frame based method. Table 5summarizes the results. Table 5. Deblurring performance of three-frame based vs. two-frame based blur formation model.

Network PSNR SSIMDVD [22] 25.067 0.872DVD-Reblur (2 frames) 25.163 0.869DVD-Reblur (3 frames)

5. Conclusion and Discussion

In this paper, we proposed a novel deep learning basedmethod for video motion deblurring. In order to improvethe generalization ability and overcome the image artifactsof prior supervised-learning based methods, we propose toincorporate a physics-based blur formation model to reblurthe estimated sharp images, which allows us to ﬁne-tune thedeblur network via self-supervised learning. We evaluatedour method over multiple datasets and found the proposedapproach effectively removes image artifacts and improvesperformance.There are several limitations in the current approach thatwe plan to address in the future. While the piece-wise linearblur kernel based on optical ﬂow is applicable for mostmotion blur in videos, this assumption does not hold forlarge amount of motion blur which often results in nonlinear,complex blur kernels. Moreover, the GAN-based traininghas been shown to learn the statistical distribution of imageswell — it can be also incorporated with the physics-basedblur formation model for the self-supervised learning.

References [1] José M Bioucas-Dias, Mario AT Figueiredo, and João Pe-dro Oliveira. Total variation-based image deconvolution: amajorization-minimization approach. In

Acoustics, Speechand Signal Processing, 2006. ICASSP 2006 Proceedings.2006 IEEE International Conference on , volume 2, pagesII–II. IEEE, 2006. 2[2] Ayan Chakrabarti. A neural approach to blind motion deblur-ring. In

European Conference on Computer Vision , pages221–235. Springer, 2016. 2[3] Sunghyun Cho, Jue Wang, and Seungyong Lee. Video de-blurring for hand-held cameras using patch-based synthesis.

ACM Transactions on Graphics (TOG) , 31(4):64, 2012. 2, 5,6[4] Mauricio Delbracio and Guillermo Sapiro. Hand-Held VideoDeblurring Via Efﬁcient Fourier Aggregation.

Computational maging, IEEE Transactions on , 1(4):270–283, Dec 2015. 5,6[5] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser,Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt,Daniel Cremers, and Thomas Brox. Flownet: Learning opticalﬂow with convolutional networks. In Proceedings of the IEEEInternational Conference on Computer Vision , pages 2758–2766, 2015. 4[6] Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T Roweis,and William T Freeman. Removing camera shake from asingle photograph. In

ACM transactions on graphics (TOG) ,volume 25, pages 787–794. ACM, 2006. 2[7] Dong Gong, Jie Yang, Lingqiao Liu, Yanning Zhang, IanReid, Chunhua Shen, Anton van den Hengel, and QinfengShi. From motion blur to motion ﬂow: a deep learning solu-tion for removing heterogeneous motion blur. arXiv preprintarXiv:1612.02583 , 2016. 2[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 4[9] Michael Hirsch, Christian J Schuler, Stefan Harmeling, andBernhard Schölkopf. Fast removal of non-uniform camerashake. In

Computer Vision (ICCV), 2011 IEEE InternationalConference on , pages 463–470. IEEE, 2011. 2[10] Tae Hyun Kim and Kyoung Mu Lee. Generalized videodeblurring for dynamic scenes. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 5426–5434, 2015. 2, 4[11] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evo-lution of optical ﬂow estimation with deep networks. arXivpreprint arXiv:1612.01925 , 2016. 4[12] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. 4[13] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych,Dmytro Mishkin, and Jiri Matas. Deblurgan: Blind mo-tion deblurring using conditional adversarial networks. arXivpreprint arXiv:1711.07064 , 2017. 1, 2, 3, 5, 6, 7[14] Anat Levin, Rob Fergus, Frédo Durand, and William T Free-man. Image and depth from a conventional camera witha coded aperture.

ACM transactions on graphics (TOG) ,26(3):70, 2007. 2[15] Anat Levin, Yair Weiss, Fredo Durand, and William T Free-man. Understanding and evaluating blind deconvolution algo-rithms. In

Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on , pages 1964–1971. IEEE,2009. 2[16] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deepmulti-scale convolutional neural network for dynamic scenedeblurring. arXiv preprint arXiv:1612.02177 , 2016. 2, 5, 6[17] T. M. Nimisha, Akash Kumar Singh, and A. N. Rajagopalan.Blur-invariant deep learning for blind-deblurring. In

The IEEE International Conference on Computer Vision (ICCV) ,Oct 2017. 2[18] Sainandan Ramakrishnan, Shubham Pachori, Aalok Gan-gopadhyay, and Shanmuganathan Raman. Deep generativeﬁlter for motion deblurring. arXiv preprint arXiv:1709.03481 ,2017. 2[19] Stefan Roth and Michael J Black. Fields of experts: A frame-work for learning image priors. In

Computer Vision andPattern Recognition, 2005. CVPR 2005. IEEE Computer So-ciety Conference on , volume 2, pages 860–867. IEEE, 2005.2[20] Christian J Schuler, Michael Hirsch, Stefan Harmeling, andBernhard Schölkopf. Learning to deblur.

IEEE transactionson pattern analysis and machine intelligence , 38(7):1439–1451, 2016. 2[21] Qi Shan, Jiaya Jia, and Aseem Agarwala. High-quality motiondeblurring from a single image. In

Acm transactions ongraphics (tog) , volume 27, page 73. ACM, 2008. 1, 2[22] Shuochen Su, Mauricio Delbracio, Jue Wang, GuillermoSapiro, Wolfgang Heidrich, and Oliver Wang. Deep videodeblurring. arXiv preprint arXiv:1611.08387 , 2016. 1, 2, 3,5, 6, 7[23] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.Pwc-net: Cnns for optical ﬂow using pyramid, warping, andcost volume. arXiv preprint arXiv:1709.02371 , 2017. 4[24] Jian Sun, Wenfei Cao, Zongben Xu, and Jean Ponce. Learninga convolutional neural network for non-uniform motion blurremoval. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 769–777, 2015. 2[25] Oliver Whyte, Josef Sivic, Andrew Zisserman, and JeanPonce. Non-uniform deblurring for shaken images.

Inter-national journal of computer vision , 98(2):168–186, 2012.2[26] Patrick Wieschollek, Michael Hirsch, Bernhard Scholkopf,and Hendrik P. A. Lensch. Learning blind motion deblurring.In

The IEEE International Conference on Computer Vision(ICCV) , Oct 2017. 2[27] Ruomei Yan and Ling Shao. Blind image blur estimationvia deep learning.

IEEE Transactions on Image Processing ,25(4):1910–1921, 2016. 2[28] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Lossfunctions for image restoration with neural networks.

IEEETransactions on Computational Imaging , 3(1):47–57, 2017. 2, 3(1):47–57, 2017. 2