Learning Model-Blind Temporal Denoisers without Ground Truths
Yanghao Li, Bichuan Guo, Jiangtao Wen, Zhen Xia, Shan Liu, Yuxing Han
LLearning Model-Blind Temporal Denoiserswithout Ground Truths
Bichuan Guo , Jiangtao Wen , Zhen Xia , Shan Liu , and Yuxing Han Tsinghua University, Beijing, China Tencent Media Lab Research Institute of Tsinghua University in Shenzhen, Shenzhen, China [email protected]
Abstract.
Denoisers trained with synthetic data often fail to cope withthe diversity of unknown noises, giving way to methods that can adapt toexisting noise without knowing its ground truth. Previous image-basedmethod leads to noise overfitting if directly applied to video denois-ers, and has inadequate temporal information management especiallyin terms of occlusion and lighting variation, which considerably hindersits denoising performance. In this paper, we propose a general frame-work for video denoising networks that successfully addresses these chal-lenges. A novel twin sampler assembles training data by decoupling in-puts from targets without altering semantics, which not only effectivelysolves the noise overfitting problem, but also generates better occlusionmasks efficiently by checking optical flow consistency. An online denoisingscheme and a warping loss regularizer are employed for better temporalalignment. Lighting variation is quantified based on the local similarityof aligned frames. Our method consistently outperforms the prior artby 0.6-3.2dB PSNR on multiple noises, datasets and network architec-tures. State-of-the-art results on reducing model-blind video noises areachieved. Extensive ablation studies are conducted to demonstrate thesignificance of each technical components.
Keywords: temporal denoising, model-blind, optical flow
Noise reduction is a crucial first step in video processing pipelines. Despite thesteady advancements in sensor technology, visible noises still occur when record-ing in low lighting conditions [9] or using mobile devices [1]. Therefore, effectivedenoisers are essential for achieving satisfactory results in downstream applica-tions [26,27].While there is a vast literature on reducing synthetic noises, reducing noiseswithout explicit models (i.e. model-blind) remains an essential and challengingproblem. On one hand, until recently, most traditional [14,16,19,53] and data-driven methods [11,51,52] assume additive white Gaussian noise (AWGN). How-ever, Pl¨otz and Roth [36] showed that denoisers trained with synthetic AWGN a r X i v : . [ c s . C V ] J u l B. Guo et al. occlusiononline denoise lightingvariationocclusionocclusiononline denoise input inputtarget targettarget 1target 2input 1:input 2: (a) (b)(c)optical flow warping lightingvariation
Fig. 1.
An overview of existing and our methods. Notations: y i (noisy frames), w f /w b (forward/backward flow), ˆ x i (denoised y i ), f i → j ( f i warped towards f j ). (a) Ehret et al.[18]. Training inputs and targets are constructed by aligning adjacent frames. Occludedregions inferred from flow divergence are excluded. (b) The naive extension of [18] tovideo denoisers with multi-frame inputs. Noise overfitting occurs due to pixels in y appear both in inputs and targets. (c) Our method. By construction, any input andits target have no overlapping sources hence overfitting is avoided often perform poorly on real data. On the other hand, creating training data bysynthesizing realistic noise [12,20] is non-trivial and prone to bias. Alternativelyone may estimate the ground truths (GT) of real photographs by adjusting ex-posure times [3,36], but this method is time-consuming and not viable for videos.Also, such synthesis and training for all possible noises can be computationallyprohibitive. As a result, self-adaptive methods that do not require explicit noisemodeling or expensive GT datasets have drawn considerable attention lately[6,17,23,25].One of such methods is frame2frame [18] recently proposed by Ehret etal., which adapts an image denoiser to existing video noise. It is built uponthe noise2noise framework [25] which trains an image denoising network withnoisy-noisy pairs (as opposed to normally using noisy-clean pairs). Since the opti-mal weights under L1 loss is invariant to zero-median output noises, noise2noise only requires the noisy pairs to have the same GT and independent median-preserving noise realizations. By aligning video frames using optical flow, themethod frame2frame constructs such pairs from the to-be-denoised video(Fig. 1(a)) and fine-tunes an image denoiser. It can cope with a wide rangeof noises and is shown to outperform many image denoisers in frame-by-frameblind denoising.Despite the success of frame2frame as a model-blind image denoiser, sev-eral challenges remain that hinder its video denoising performance. First, itsperformance is reliant on the optical flow quality. However, many optical flow earning Model-Blind Temporal Denoisers without Ground Truths 3 estimators only care about the flow error on clean image pairs. This criteriondoes not prioritize warped results and can perform sub-optimally in aligningnoisy frames. Second, the key assumption of noise2noise , that noisy pairs havethe same GT, is easily violated due to occlusion and lighting variation. Finercorrespondence management is needed. Third, it cannot be directly applied totemporal denoising where adjacent noisy frames are taken as inputs: since theseframes are also used in frame2frame to construct training targets, this dual-presence in both inputs and targets causes noise overfitting in static regions (Fig.1(b)).In this paper we propose a general framework for video denoising networksthat successfully addresses all these challenges. An overview of our method isshown in Fig. 1(c). The main contributions of this paper are: – Temporal alignment is improved by employing an online denoising scheme,as well as a warping loss regularizer aiming to improve the content awarenessof optical flow estimation networks. – Correspondence management is enhanced by aggregating two components:better occlusion masks are produced based on optical flow consistency; light-ing variation is measured based on local similarity. – We reveal the noise overfitting problem due to dual-presence suffered by thenaive extension of image-based method, and propose a novel twin sampler that not only decouples inputs from targets to prevent noise overfitting, butalso enables better occlusion inference as a free by-product.
Image and video denoising.
Over the years, a myriad of image denoisingalgorithms have been proposed: bilateral filters [46], domain-transform [34,37],variational [39,40] and patch-based methods [14,16,24,29,53]. The first attemptbased on neural networks utilizes fully connected layers [7]. More recently, CNN-based methods have been proposed, including DnCNN [51], FFDnet [52] andmany others [9,30]. These data-driven methods are commonly trained with noisy-clean pairs, where noisy images are synthesized from clean images with a knownnoise model. Our method is also data-driven, but tackles a more difficult casewhere the noise model is unknown.Comparing to image denoising, the literature addressing video denoising ismore limited. VBM3D [13] extends the image-based BM3D [14] by searchingfor similar patches in adjacent frames. VBM4D [28] further generalizes this ideato spatio-temporal volumes. VNLB [5] extends the image-based NLB algorithm[24] in a similar manner. The first data-driven method [10] exploits recurrentneural networks for temporal information, but its performance is not satisfactory.Recently, Davy et al. proposed VNLnet [15] which augments DnCNN with aspatio-temporal nearest neighbor module, allowing non-local patch search andCNN to be combined. Tassano et al. proposed FastDVDnet [45], which employsa cascade of U-shaped encoder-decoder architectures [38] and performs implicitmotion estimation. However, these methods are restricted to their training noises
B. Guo et al. and generalize poorly on other noises. In [17,33], the related problem of burstdenoising is addressed, but these methods are not tailored for videos containinglarge motions.
Blind denoising.
Many efforts have been devoted to blind denoising lately.The first line of research constructs noisy-clean pairs to train deep architec-tures. Methods were developed for acquiring the GT of real photographs: manydatasets [1,3,36,48] were proposed, and deep architectures tailored for realisticnoises were trained [4,42,50]. However, high quality GT videos are harder toobtain comparing to images, preventing these methods from being applied totemporal denoising. Meanwhile, analytic models were proposed for simulatingrealistic noises: CBDNet [20] incorporates in-camera processing pipelines intonoise modeling; ViDeNN [12] considers photon shot and read noises. Neverthe-less, these methods are still dependent on their respective analytic models whichexhibit inevitable bias, and may not generalize well to other noises. Our methoddoes not require expensive video GT datasets, and only imposes weak statisticalassumptions on noise attributes.The second line of research trains self-adaptive denoisers with noisy dataalone without clean counterparts.
Noise2noise [25] observes that under mildconditions on the noise distribution, an image denoiser can be trained with noisypairs that have the same GT and independent noise realizations.
Frame2frame [18] constructs such noisy pairs from videos using optical flow. Taking a stepfurther, noise2self [6] and noise2void [23] propose to train image denoisersusing single noisy images as both input and target, where part of the input isremoved from the output’s receptive field to avoid learning a degenerate identityfunction. However, these methods do not outperform noise2noise as less infor-mation is available during training. Our proposed twin sampler is reminiscent ofthese methods in the decoupling of inputs from targets; but instead of remov-ing elements, we replace elements without changing semantic content. It tacklesnoise overfitting due to the temporal redundancy in static scenes, a problem thatprevious image-based methods do not encounter.
In discriminative learning paradigm, noisy inputs y i are synthesized from cleanimages x i as y i = x i + n i , where n i follows some analytic noise distribution,e.g. AWGN. The noise2noise framework uses noisy-noisy pairs ( y i , y (cid:48) i ) instead,dispensing with the need for explicit noise modeling or noisy-clean datasets.Specifically, it assumes the noisy pair satisfies y i = x i + n i , y (cid:48) i = x i + n (cid:48) i , (1)i.e. they share the same GT. A neural network g θ with weights θ is then trainedby minimizing the empirical risk:argmin θ E y i , y (cid:48) i [ (cid:96) ( g θ ( y i ) , y (cid:48) i )] , (2) earning Model-Blind Temporal Denoisers without Ground Truths 5 where (cid:96) is, say, L loss. With a sufficiently large training set, the network g θ learns to approximate the optimal estimator g ∗ , which is E [ y (cid:48) i | y i ] according toBayesian decision theory. If the noise distribution satisfies E [ y (cid:48) i | y i ] = E [ x i | y i ] , (3)i.e. the noise n (cid:48) i preserves mean, the optimal estimator g ∗ would be the same asif y (cid:48) i was replaced by x i in the training criterion (2). In other words, the networkhas the same optimal weights θ as if it was trained using noisy-clean pairs ( y i , x i ).The same property holds for L / L loss under median/mode-preserving noises.Such noisy pairs are still difficult to obtain from real data. noise2void pro-poses to use ( y i , y i ) to train a “blind-spot” architecture that removes inputpixels from the output’s receptive field at same spatial coordinates. However,it is incompatible with many state-of-the-art methods [15,45] that directly addinputs to outputs for residual prediction, and is shown in [23] to perform worsethan noise2noise and traditional methods e.g. BM3D.Alternatively, frame2frame proposes to use videos to construct such noisypairs. They assume that consecutive frames { y i − , y i } are both observations ofthe same clean signal x i , except that y i − is transformed by motion. Optical flowis computed from { y i − , y i } and used to warp y i − to become y (cid:48) i , such that y (cid:48) i and y i are aligned. Occluded pixels are inferred from the divergence of opticalflow and then excluded from the loss (2). Image denoisers trained with noisypairs from a video can be used to denoise that specific video in a frame-by-framemanner. Frame2Frame learns image denoisers that take single frames y i as inputs. Toincorporate temporal information, video denoising networks include the adjacentframes of y i in their inputs as well, denoted by Y i : Y i := { ..., y i − , y i , y i +1 , ... } , (4)where := denotes definition/assignment. The optical flow estimator plays a crit-ical role in improving denoising performance. There are two major drawbacksof using existing optical flow estimators. (i) They are usually trained with cleanimage pairs, which can perform worse on noisy inputs. (ii) Most optical flowestimators are designed to match the GT flow, regardless of the image content.As the goal of optical flow warping is to correctly align adjacent frames, wewould like to penalize misalignments that result in large pixel differences. Forexample, flow error in homogeneous regions (e.g. a whiteboard) will not causeviolation of the noise2noise assumption, and should not be penalized as muchas heterogeneous regions.To solve (i), one might use synthetic noise to train a noise-robust optical flowestimator. However, in our problem setup such method is not viable as noisemodels are unknown. Instead, we perform online denoising before estimatingoptical flow, using the very denoiser g θ we are training. Intuitively, the denoiser B. Guo et al. (a) (b)(c) (d) (e)
Fig. 2.
Comparison between the original loss and (6) on PWC-net [43]. (a) and (b):warped images with inferred occlusion (black) using (6) and the original loss. (a) hassmaller inferred occlusion. (c): reference frame. (d) and (e): warped frames using (6)and original loss. (d) matches (c) more faithfully g θ and the optical flow quality evolve with each other: throughout training, g θ produces progressively cleaner frames that improve optical flow estimation,which in turn helps to train g θ via better alignment. Formally, suppose an opticalflow estimator Γ computes the optical flow from a to b as Γ ( a , b ). The forwardand backward flow w f , w b between y i − and y i are computed as w f := Γ ( g θ ( Y i − ) , g θ ( Y i )) , w b := Γ ( g θ ( Y i ) , g θ ( Y i − )) . (5)To solve (ii), we use a warping loss to regularize the training of Γ . This lossdirectly penalizes pixel difference after alignment. Suppose the GT flow from a to b is w , and the original loss function is L orig = L ( Γ ( a , b ) , w ), which onlyconsiders flow error. We use the following hybrid loss instead (a comparison isshown in Fig. 2): L orig + λ (cid:13)(cid:13) (1 − o a ) (cid:12) (cid:0) a − warp( b , Γ ( a , b )) (cid:1)(cid:13)(cid:13) , (6)where “warp” represents the inverse warping function, and λ is a hyper-parameterthat controls the balance between these two terms. The GT occlusion map o a isused to exclude occluded regions where no alignment can be achieved. As such,training with this loss requires datasets that contain GT occlusion maps, e.g.FlyingChairs2 [22] and Sintel [8]. Meister et al. [32] used the warping loss totrain Γ without GT flow. In our scenario, we found the GT flow to be a usefulguidance, hence the warping loss is only used as a regularization term in (6). Let us further analyze the scenarios where the noise2noise assumption (3) failsin the case of multiple frame input Y i . The assumption requires that E [ y (cid:48) i | Y i ] = E [ x i | Y i ], where y (cid:48) i is obtained by warping y i − towards y i using optical flow. Itfails if (i) a pixel in y i − has no correspondence in y i , or (ii) the correspondingpixels in y i − and y i have different GT values. Occlusion causes (i), while lightingvariation leads to (ii), see Fig. 3 (b) and (f).In frame2frame , occlusion is detected by checking if the divergence of op-tical flow exceeds a threshold. As it turns out, we can use the forward and earning Model-Blind Temporal Denoisers without Ground Truths 7 (a) (b) (c) (d)(e) (f) (g) (h) lightingvariation occlusion Fig. 3.
An illustration of correspondence management: (a) frame y i − ; (b) frame y i ;(f) y (cid:48) i (frame y i − warped towards y i ); (d) inferred occlusion mask (solid black) andlighting variation (gray) from Sec. 3.3; (c) and (g): multiply (b) and (f) with mask (d),respectively; (e) inferred occlusion mask based on flow divergence; (h) GT occlusion.Compare (b) and (f) to observe occlusion and lighting variation. Further compare with(c) and (g) to see that the mask (d) effectively covers these outlier pixels backward optical flow computed in the previous subsection to derived a bet-ter occlusion mask, see Fig. 3 (d) and (e). The forward-backward consistencyassumption [44] states that the forward flow of a non-occluded pixel and thebackward flow of its corresponding pixel in the next frame should be oppositenumbers. Meister et al. [32] used this property to regularize unsupervised trainingof optical flow. Here we use this property to directly infer occlusion. Specifically,let p denote a pixel coordinate in y i ; we can compute a binary map o i to markif p is occluded in the previous frame y i − : let o i ( p ) := 0 (not occluded) if (cid:13)(cid:13) w b ( p ) + w f (cid:0) p + w b ( p ) (cid:1)(cid:13)(cid:13) < α (cid:18)(cid:13)(cid:13) w b ( p ) (cid:13)(cid:13) + (cid:13)(cid:13) w f (cid:0) p + w b ( p ) (cid:1)(cid:13)(cid:13) (cid:19) + α , (7)otherwise o i ( p ) := 1 (occluded), where α , α are hyper-parameters specifyingrelative and absolute thresholds.While occlusion masks are binary, lighting variation is quantitative by nature.We can use the difference between corresponding pixels, e.g. pixels at samecoordinates of x i and x (cid:48) i ( x i − warped towards x i ), to quantify lighting variation.However, individual pixels can have large variance due to noise and randomness.To improve robustness, we instead compare the average intensity of patchescentered at corresponding pixels. Using a 5 × κ , the patch differencecan be computed efficiently (e.g. on GPUs) as | κ ∗ ( x i − x (cid:48) i ) | , where ∗ denotesconvolution. Again, occluded pixels should be excluded from the patch. This canbe done by a point-wise product between x i − x (cid:48) i and the non-occlusion map1 − o i , followed by proper normalization. Formally, the lighting variation l i ofpixels in x i with respect to corresponding pixels in x i − is computed as l i := (cid:12)(cid:12) κ ∗ [( ˆ x i − ˆ x (cid:48) i ) (cid:12) (1 − o i )] (cid:12)(cid:12) κ ∗ (1 − o i ) + (cid:15) , (8)where ˆ x i and ˆ x (cid:48) i represent our estimates of the GT quantities x i and x (cid:48) i , (cid:12) andfraction denote point-wise product and division, and the denominator is a nor-malization factor that contains a small positive (cid:15) = 10 − to prevent division by B. Guo et al. zero. The evaluation of lighting variation (8) can also benefit from online denois-ing, so as to prevent the pixel difference due to noise from being misinterpretedas lighting variation. To do so, we define the clean signal estimates ˆ x i and ˆ x (cid:48) i in(8) as: ˆ x i := g θ ( Y i ) , ˆ x (cid:48) i := warp( g θ ( Y i − ) , w b ) . (9)This reduces the amount of noise in clean signal estimates to further improverobustness. A naive extension of frame2frame to video denoisers is to train with ( Y i , y (cid:48) i ).Unfortunately this leads to noise overfitting in static regions. Since the opticalflow is almost zero in these regions ( y (cid:48) i ≈ y i − ), the target y (cid:48) i becomes a part ofthe input Y i . The network can simply learn to reproduce that part (the previousframe y i − ), leading to noisy prediction. A visual example is provided later inFig. 5 (right). To avoid this, one might consider to use a frame y j / ∈ Y i forcomputing y (cid:48) i . However this is impractical as state-of-the-art video denoisingnetworks can have Y i that spans a large temporal window (up to 7 in bothdirections [15]), and frames that are too far from y i simply cannot be alignedwith.We propose a twin sampler that not only effectively solves this problem, butalso brings additional benefits. Our first step is to replace y i − in Y i with awarped frame as well. A toy example that illustrates this idea is given below.Suppose the network g θ originally takes three frames Y = { y , y , y } as inputto denoise the middle frame y . Using estimated optical flow, we warp y toalign with y , yielding y → ; similarly, y is warped to align with y , yielding y → . The new input is Y (cid:48) = { y , y , y → } which resembles the original input { y , y , y } semantically. The target is still y → which is another noisy obser-vation of x . The key is that the input Y (cid:48) and the target y → do not sharesources: pixels in Y (cid:48) originate from y and y , and pixels in y → originate from y . As such, a degenerate mapping that produces part of the input will not belearned. Also, since Y (cid:48) keeps the semantic form of the original input Y , thenetwork’s interpretation remains the same thus no change is required duringinference time.As simple as it may seem, this method comes with two convenient byprod-ucts. Firstly, since the forward and backward flow between y and y havebeen computed during the above process, the occlusion mask in Section 3.3can be derived with little additional cost. Secondly, another noisy pair, ( Y (cid:48) = { y → , y , y } , y → ), can be immediately constructed without additional opti-cal flow estimation/warping. The input Y (cid:48) resembles Y = { y , y , y } seman-tically, the target y → is another noisy observation of x , and no sources areshared between them. This method is dubbed a “twin sampler” due to the factthat constructed samples are always grouped in pairs corresponding to consecu-tive frames. earning Model-Blind Temporal Denoisers without Ground Truths 9 Algorithm 1
Our training procedure for each mini-batch. The optical flownetwork Γ has been trained with loss (6). while the batch is not full; select a random i and do
2: Construct original inputs Y i − and Y i by (4).3: Compute optical flow w f , w b and clean signal estimates ˆ x i and ˆ x (cid:48) i by (5)(9).4: Compute backward occlusion map o i and lighting variation l i by (7)(8). For-ward o i − and l i − are computed similarly with i − i exchanged.5: Construct the final input and target from (10)(11)(12), crop the above quanti-ties at same spatial locations and add them to the mini-batch. end while
6: Compute loss (13) and update weights θ with backprop. Formally, let y i → j denote the frame obtained by warping y i towards y j .Suppose the network input takes the general form (4). The twin sampler firstcomputes y ( i − → i := warp( y i − , w b ) , y i → ( i − := warp( y i , w f ) , (10)Then, two noisy pairs are constructed as (cid:16) Y (cid:48) i − := Y i − \ { y i } ∪ { y ( i − → i } , y i → ( i − (cid:17) , (11) (cid:16) Y (cid:48) i := Y i \ { y i − } ∪ { y i → ( i − } , y ( i − → i (cid:17) . (12)The occlusion map o i and lighting variation l i are used to adjust the loss (cid:96) in the training criterion (2). For the noisy pair (12), its associated loss is (cid:96) (cid:0) g θ ( Y (cid:48) i ) (cid:12) γ , y ( i − → i (cid:12) γ (cid:1) where γ = (1 − o i ) (cid:12) ξ ( l i ) , (13)and ξ is a non-linear function with a hyper-parameter α that maps its input torange (0 , ξ ( l ) := exp( − α l ) . (14)Intuitively, occluded pixels do not contribute to the loss, and pixels with drasticlighting variation contribute less to the loss. Therefore, our loss function guidesthe network to learn from pixels that are properly aligned and satisfy the assump-tion (3). For the other noisy pair (11), its occlusion map o i − , lighting variation l i − and associated loss are computed similarly, with w f / w b exchanged in (7)and i/i − The pseudocode summarizing the above procedures is shown in Algorithm 1. Wetrain the network g θ using mini-batches, each consists of multiple noisy pairs.Since the noisy video can have very high resolutions, in line 5 we crop these pairsas well as their associated occlusion maps and lighting variations to a fixed size. This allows us to use large batch sizes regardless of the video’s resolution. Allrelated computations can efficiently run on GPUs.Following [18], θ is initialized by pretraining with synthetic AWGN on cleandatasets. Since GT is utilized in this pretraining but not in our method, if theactual noise is very close to AWGN, the initial model operates in an ideal testsituation and is likely to perform very well. Our final trick is to use a denoisingautoencoder (DAE) to detect if this happens. According to Alain et al. [2], thereconstruction error of a DAE r ( y ) trained with infinitesimal AWGN satisfies r ( y ) − y ≈ σ ∇ y log Pr( y ), where Pr( y ) is the data distribution of GT. If y contains little noise, it is close to the GT manifold, which implies Pr( y ) is closeto its local maximum and the reconstruction error will be small. Therefore, wecan use (cid:107) r ( y ) − y (cid:107) as a rough indicator of the cleanliness of y . Using a DAEtrained on ImageNet [41], we compute the reconstruction error of denoised videoframes before and after noise2noise training. If the average error magnitudedoes not decrease significantly (50%), the initial model will be kept. Due to the lack of reliable methods for obtaining GT ofnoisy videos, we use synthetic noises for quantitative experiments as in [18], anddemonstrate real noise reduction visually. Five distinct synthetic noises are usedfor testing: AWGN20 (AWGN with standard deviation σ =20), MG (multiplica-tive Gaussian, where each pixel’s value is multiplied by a N (1 , . ) Gaussian),CG (correlated Gaussian, where AWGN with σ =25 is blurred with a 3 × , σ =25). To mimic realistic scenarios, all pixel values are clipped to range[0 , sintel-tr ), hyper-parametertuning ( sintel-val ) and denoising performance evaluation ( sintel-8 ), respec-tively. All 30 sequences from the “test-dev” split of DAVIS ( davis-30 ) and 7selected sequences [15] from Derf’s collection ( derf-7 ) are also used for perfor-mance evaluation. Implementation.
To demonstrate the generality of our framework, we applyit to latest video denoising networks with distinct architectures (VNLnet [15]and FastDVDnet [45]), see Fig. 4. The weight used to initialize VNLnet is thepublicly released version trained on color sequences with AWGN. The authors ofFastDVDnet included noise strength in network input for non-blind denoising.We train a blind version by removing the noise strength input and repeating thesame training procedure. The noise strength used for training is σ =20 for both, earning Model-Blind Temporal Denoisers without Ground Truths 11 block 1block 1block 1 block 2 output ...... non-localfeaturesnon-localsearch output (a) (b) Fig. 4. (a) FastDVDnet takes 5 frames as input and performs two-stage denoising. (b)VNLnet takes 15 frames as input, which are converted to features using a non-localsearch module noisyFastDVDnet+f2fFastDVDnet+ours num. of iterations PS N R AWGN20MGCGIRJPEG -3 -2 -1 PS N R // // small motionmedium motionlarge motion -1 PS N R // // PS N R // // PS N R // // Fig. 5.
Left: occlusion masking ( α , α < ∞ ) is essential for sequences with large mo-tions. A moderate penalty on lighting variation (0 < α < ∞ ) is optimal. Regularizedoptical flow ( λ >
0) boosts denoising performance. Overall the hyper-parameters arenot sensitive to small variations. Middle: denoising a sequence in sintel-8 . The per-formance of FastDVDnet+ours converges steadily as training continues. Right: derf-7 with CG noise. Our method avoids the noise overfitting problem suffered by the naiveextension of [18] which allows the test noise AWGN20 to cover the case where the initial modelmatches the test noise as discussed in Sec. 3.5.We perform random search to determine the best hyper-parameters. A “val-idation noise” (AWGN with σ =30) is used to prevent previous “test noises”from being seen. The combination that achieves the best average PSNR on sintel-val is: α =0.0064, α =1.4 in (7); α =5.0 in (14); and λ =0.06 in (6).Within sintel-val , 3 sequences with different motion scales are selected; indi-vidual hyper-parameters are varied to study their sensitivities, see Fig. 5. Theloss function (cid:96) in (2) is the L1 loss, which can cope with a wide range of noisesaccording to [18]. We use the Adam optimizer to update weights (Algorithm 1line 6); the learning rate is 5 × − for FastDVDnet and 2 × − for VNLnet; afixed batch size 32 is set for both, and the iteration stops after 100 mini-batches.The fixed crop size in Algorithm 1 line 5 is 96 by 96. To compute optical flow, Γ is selected as the recently proposed PWC-net [43], which outperforms manytraditional methods and is also faster. We fine-tune the publicly released model(pretrained with FlyingThings3D [31]) using FlyingChairs2 [22], ChairsSDHom[21] and sintel-tr with loss (6). Regarding overall performance, we primarily compare with frame2frame , whichuses the image denoiser DnCNN as their backbone. The naive extension of frame2frame to video denoising networks, as described at the beginning ofSection 3.4, also serves as a baseline. Traditional methods such as VBM4D andVNLB, as well as some recent blind denoising methods including CBDnet andViDeNN are also compared. Note that these recent methods are still trainedwith noisy-clean pairs, whose performances are bounded by their training dataand noise model assumptions. For frame2frame , the backbone DnCNN is alsoinitialized by pretraining with AWGN σ =20. Since our task is model-blind de-noising, using specialized pretrained model for each test noise is not allowed.Therefore, for methods that require pretrained weights, the same publicly re-leased model will be used for all noises. Table 1.
Average PSNR/SSIM on derf-7 and davis-30 . DnCNN+f2f is the originalimplementation of [18]. FastDVDnet+f2f and VNLnet+f2f are naive extensions of [18]to video denoisers as described in Sec. 3.4. “X initial” is the initial model of X pretrainedwith AWGN, “X+ours” is our proposed framework applied to X. For image denoisers,frame-by-frame denoising is performed dataset derf-7 davis-30 noise AWGN20 MG CG IR JPEG AWGN20 MG CG IR JPEGVBM4D [28] 33.23/.896 30.11/ .850 .801 / .928 / .927 / .928 30.24 /.849 / .844 31.34 / .838 31.13 / .867 34.67 / .927 27.81 /.780 / .822 30.84 / .860 30.48 / .854 FastDVDnet initial 33.28/.904 22.44/.561 23.03/.504 21.83/.485 27.95/.705 34.39/ .927 .927
Table 1 show the overall results on derf-7 and davis-30 . More details aregiven in the table caption. The following observations are clear from the results.(1) The naive extension of frame2frame to video denoising networks performseven worse than its DnCNN-based version (compare rows with suffix “+f2f”).This is due to the noise overfitting problem as discussed before, see Fig. 5 (right).(2) Our method consistently outperforms DnCNN-based frame2frame on botharchitectures, achieving 0.6-3.2dB PSNR gain (“DnCNN+f2f” v.s. rows withsuffix “+ours”). This proves that our method successfully leverages the capabilityof video denoising networks to utilize temporal information. (3) Comparing toother existing methods, our method achieves state-of-the-art results on removingmodel-blind noises. This is expected as those other methods are designed for theirrespective noise models, and do not have the capability to adapt to existing earning Model-Blind Temporal Denoisers without Ground Truths 13 noisy VBM4DViDeNN CBDNetDnCNN+f2f VNLnet+ours noisy ViDeNN DnCNN+f2fVBM4D CBDNet FastDVDnet+ours
Fig. 6.
Denoising real videos captured with a front-facing camera (left) or low lightingcondition (right). Model-based methods, even though targeting realistic noises (midcolumn), fail in this case. Our method outperforms image-based frame2frame byincorporating temporal information video noise if their models are violated. (4) According to columns “AWGN20”,our method can effectively detect if the actual noise is close to the training noiseused to initialize weights and select the appropriate model.We also use mobile devices to capture several real sequences, whose subjectivedenoising results are shown in Fig. 6. Models trained with synthetic data, eventhough equipped with realistic noise models, fail to remove these real noises,revealing the limitation of model-based approaches. Our method produces cleanresults and outperforms [18] on both architectures.
Table 2 shows the detailed breakdown of our method’s performance on dataset sintel-8 . Due to space constraints, the complete breakdown for other noisesare given in the supplementary. Since sintel-8 contains GT optical flow andocclusion, we can compare with two oracles that exploit these GT: row 8 employs frame2frame , while row 9 employs our twin sampler.The twin sampler offers the most significant contribution, as PSNR is im-proved by 1.1-3.7dB (row 1 v.s. 2). It also benefits the oracle by 1.8-3.7dB (row8 v.s. 9). The gap between row 2 and 9 (0.8-1.5dB) results from inaccurate cor-respondence. For occlusion masking, flow consistency clearly outperforms flowdivergence, achieving 0.4-0.7dB PSNR gain (row 2 v.s. 3). By considering light-ing variation , PSNR is improved by 0.06-0.2dB (row 3 v.s. 4). By improvingoptical flow quality, online denoising and warping loss contribute roughlyequally, providing 0.1-0.4dB PSNR gain in total (row 4/5/6). The
DAE module(row 7) selects the better model between row 6 and 10. When the test noisematches pretraining noise (AWGN20), the initial model is indeed selected. Eventhe oracle (row 9) does not match the initial model, as these state-of-the-artvideo denoisers are very powerful in ideal test situations once trained. More no-tably, for JPEG noise the PSNR of row 7 is higher than both row 6 and 10.This is due to the JPEG noise being moderately close to AWGN20 (see Fig. 5
Table 2.
Average PSNR/SSIM on sintel-8 . “ts”: twin sampler. If twin sampler is dis-abled, the naive extension of frame2frame is used. “occ”: occlusion inference method.“div”/“ofc”: occlusion is inferred based on optical flow divergence/consistency. “lv”:lighting variation. If it is disabled, lighting variations l i − , l i are set to 0. “od”: onlinedenoising. “wl”: warping loss regularizer. If it is disabled, Γ is fine-tuned on the samedatasets, but with the original loss. “dae”: DAE module as described in Sec. 3.5 components VNLnet FastDVDnetts occ lv od wl dae AWGN20 AWGN40 MG JPEG AWGN20 AWGN40 MG JPEG1 (cid:55) div (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) div (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) ofc (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) ofc (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) ofc (cid:51) (cid:51) (cid:55) (cid:55) (cid:51) ofc (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) ofc (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) GT (cid:55) GT flow (cid:55) (cid:51) GT (cid:55) GT flow (cid:55) middle), and the better model can be either row 6 or 10, depending on the videocontent. Overall, by combining all components, the gap between row 2 and 9is significantly reduced on FastDVDnet and even surpassed on VNLnet (due tolighting variation).To demonstrate our method’s robustness to noise levels , Table 2 also liststhe detailed breakdown on two different Gaussian noises: AWGN20 ( σ =20) andAWGN40 ( σ =40). It can be seen that our proposed method achieves consistentimprovement across different noise strengths. We have presented a general framework for adapting video denoising networksto model-blind noises without utilizing clean signals. The twin sampler not onlyresolves the overfitting problem suffered by the naive extension of image-basedmethods, but also operates efficiently by reusing estimated optical flow. The restcomponents further boost denoising performance via occlusion masking, lightingvariation penalty and optical flow refinement. Our results indicate that in orderto train a video denoiser with only noisy data, one shall look at frame differencesand similarities simultaneously: noise attributes can be learned from the former,while temporal information can be extracted from the latter. Our method con-sistently outperforms the prior art by 0.6-3.2dB PSNR on multiple noises anddatasets. The significance of our method is also reflected in its generality, as itis successfully applied to multiple latest architectures. earning Model-Blind Temporal Denoisers without Ground Truths 15
References
1. Abdelhamed, A., Lin, S., Brown, M.S.: A high-quality denoising dataset for smart-phone cameras. In: CVPR. pp. 1692–1700 (2018)2. Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data-generating distribution. JMLR (1), 3563–3593 (2014)3. Anaya, J., Barbu, A.: RENOIR – a dataset for real low-light image noise reduction.Journal of Visual Communication and Image Representation , 144–154 (2018)4. Anwar, S., Barnes, N.: Real image denoising with feature attention. In: ICCV(2019)5. Arias, P., Morel, J.M.: Towards a Bayesian video denoising method. In: Inter-national Conference on Advanced Concepts for Intelligent Vision Systems. pp.107–117 (2015)6. Batson, J., Royer, L.: Noise2self: Blind denoising by self-supervision. In: ICML.pp. 524–533 (2019)7. Burger, H.C., Schuler, C.J., Harmeling, S.: Image denoising: Can plain neural net-works compete with BM3D? In: CVPR. pp. 2392–2399. IEEE (2012)8. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source moviefor optical flow evaluation. In: ECCV. pp. 611–625. Springer (Oct 2012)9. Chen, C., Chen, Q., Xu, J., Koltun, V.: Learning to see in the dark. In: CVPR.pp. 3291–3300 (2018)10. Chen, X., Song, L., Yang, X.: Deep RNNs for video denoising. In: Applications ofDigital Image Processing (2016)11. Chen, Y., Pock, T.: Trainable nonlinear reaction diffusion: A flexible frameworkfor fast and effective image restoration. IEEE TPAMI (2016)12. Claus, M., van Gemert, J.: ViDeNN: Deep blind video denoising. In: CVPR Work-shops (2019)13. Dabov, K., Foi, A., Egiazarian, K.: Video denoising by sparse 3D transform-domaincollaborative filtering. In: European Signal Processing Conference. pp. 145–149(2007)14. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-Dtransform-domain collaborative filtering. IEEE TIP (2007)15. Davy, A., Ehret, T., Morel, J.M., Arias, P., Facciolo, G.: A non-local CNN forvideo denoising. In: ICIP. pp. 2409–2413 (2019)16. Dong, W., Zhang, L., Shi, G., Li, X.: Nonlocally centralized sparse representationfor image restoration. IEEE TIP (2012)17. Ehret, T., Davy, A., Arias, P., Facciolo, G.: Joint demosaicing and denoising byoverfitting of bursts of raw images. ICCV (2019)18. Ehret, T., Davy, A., Morel, J.M., Facciolo, G., Arias, P.: Model-blind video denois-ing via frame-to-frame training. In: CVPR. pp. 11369–11378 (2019)19. Gu, S., Zhang, L., Zuo, W., Feng, X.: Weighted nuclear norm minimization withapplication to image denoising. In: CVPR. pp. 2862–2869 (2014)20. Guo, S., Yan, Z., Zhang, K., Zuo, W., Zhang, L.: Toward convolutional blinddenoising of real photographs. In: CVPR. pp. 1712–1722 (2019)21. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0:Evolution of optical flow estimation with deep networks. In: CVPR (2017)22. Ilg, E., Saikia, T., Keuper, M., Brox, T.: Occlusions, motion and depth boundarieswith a generic network for disparity, optical flow or scene flow estimation. In:ECCV (2018)6 B. Guo et al.23. Krull, A., Buchholz, T.O., Jug, F.: Noise2void – learning denoising from singlenoisy images. In: CVPR. pp. 2129–2137 (2019)24. Lebrun, M., Buades, A., Morel, J.M.: A nonlocal Bayesian image denoising algo-rithm. SIAM Journal on Imaging Sciences pp. 1665–1688 (2013)25. Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., Aila,T.: Noise2noise: Learning image restoration without clean data. In: ICML. pp.2971–2980 (2018)26. Liu, D., Cheng, B., Wang, Z., Zhang, H., Huang, T.S.: Enhance visual recognitionunder adverse conditions via deep networks. IEEE TIP (2019)27. Liu, D., Wen, B., Liu, X., Wang, Z., Huang, T.S.: When image denoising meetshigh-level vision tasks: a deep learning approach. In: IJCAI. pp. 842–848 (2018)28. Maggioni, M., Boracchi, G., Foi, A., Egiazarian, K.: Video denoising, deblocking,and enhancement through separable 4-D nonlocal spatiotemporal transforms. IEEETIP (9), 3952–3966 (2012)29. Mairal, J., Bach, F.R., Ponce, J., Sapiro, G., Zisserman, A.: Non-local sparse mod-els for image restoration. In: ICCV. pp. 54–62 (2009)30. Mao, X., Shen, C., Yang, Y.B.: Image restoration using very deep convolutionalencoder-decoder networks with symmetric skip connections. In: NeurIPS. pp. 2802–2810 (2016)31. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox,T.: A large dataset to train convolutional networks for disparity, optical flow, andscene flow estimation. In: CVPR (2016)32. Meister, S., Hur, J., Roth, S.: UnFlow: Unsupervised learning of optical flow witha bidirectional census loss. In: AAAI (2018)33. Mildenhall, B., Barron, J.T., Chen, J., Sharlet, D., Ng, R., Carroll, R.: Burstdenoising with kernel prediction networks. In: CVPR. pp. 2502–2510 (2018)34. Moulin, P., Liu, J.: Analysis of multiresolution image denoising schemes usinggeneralized Gaussian and complexity priors. IEEE Transactions on InformationTheory pp. 909–919 (1999)35. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video objectsegmentation. In: CVPR (2016)36. Pl¨otz, T., Roth, S.: Benchmarking denoising algorithms with real photographs. In:CVPR. pp. 1586–1595 (2017)37. Portilla, J., Strela, V., Wainwright, M.J., Simoncelli, E.P.: Image denoising usingscale mixtures of Gaussians in the wavelet domain. IEEE TIP (2003)38. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medical Image Computingand Computer-assisted Intervention. pp. 234–241 (2015)39. Roth, S., Black, M.J.: Fields of experts: A framework for learning image priors. In:CVPR (2005)40. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removalalgorithms. Physica D: nonlinear phenomena pp. 259–268 (1992)41. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang,Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Ima-geNet Large Scale Visual Recognition Challenge. IJCV (3), 211–252 (2015).https://doi.org/10.1007/s11263-015-0816-y42. Song, Y., Zhu, Y., Du, X.: Dynamic residual dense network for image denoising.Sensors (17), 3809 (2019)43. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow usingpyramid, warping, and cost volume. In: CVPR. pp. 8934–8943 (2018)earning Model-Blind Temporal Denoisers without Ground Truths 1744. Sundaram, N., Brox, T., Keutzer, K.: Dense point trajectories by GPU-acceleratedlarge displacement optical flow. In: ECCV. pp. 438–451 (2010)45. Tassano, M., Delon, J., Veit, T.: FastDVDnet: Towards real-time video denoisingwithout explicit motion estimation. arXiv preprint arXiv:1907.01361 (2019)46. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: ICCV(1998)47. Xiph.org: Derf’s Test Media Collection ((accessed Nov 7, 2019)), https://media.xiph.org/video/derf/
48. Xu, J., Li, H., Liang, Z., Zhang, D., Zhang, L.: Real-world noisy image denoising:A new benchmark. arXiv preprint arXiv:1804.02603 (2018)49. Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement withtask-oriented flow. IJCV (8), 1106–1125 (2019)50. Yue, Z., Yong, H., Zhao, Q., Zhang, L., Meng, D.: Variational denoising network:Toward blind noise modeling and removal. In: NeurIPS (2019)51. Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a Gaussian denoiser:Residual learning of deep CNN for image denoising. IEEE TIP (2017)52. Zhang, K., Zuo, W., Zhang, L.: FFDNet: Toward a fast and flexible solution forCNN-based image denoising. IEEE TIP27