[PDF] NFCNN: Toward a Noise Fusion Convolutional Neural Network for Image Denoising

Abstract

Deep learning based methods have achieved the state-of-the-art performance in image denoising. In this paper, a deep learning based denoising method is proposed and a module called fusion block is introduced in the convolutional neural network. For this so-called Noise Fusion Convolutional Neural Network (NFCNN), there are two branches in its multi-stage architecture. One branch aims to predict the latent clean image, while the other one predicts the residual image. A fusion block is contained between every two stages by taking the predicted clean image and the predicted residual image as a part of inputs, and it outputs a fused result to the next stage. NFCNN has an attractive texture preserving ability because of the fusion block. To train NFCNN, a stage-wise supervised training strategy is adopted to avoid the vanishing gradient and exploding gradient problems. Experimental results show that NFCNN is able to perform competitive denoising results when compared with some state-of-the-art algorithms.

Full PDF

NNFCNN: Toward a Noise Fusion Convolutional Neural Networkfor Image Denoising ∗ Maoyuan Xu † and Xiaoping Xie ‡ School of Mathematics, Sichuan University, Chengdu 610064, China

Abstract

Deep learning based methods have achieved the state-of-the-art performance in image denoising.In this paper, a deep learning based denoising method is proposed and a module called fusion block isintroduced in the convolutional neural network. For this so-called Noise Fusion Convolutional NeuralNetwork (NFCNN), there are two branches in its multi-stage architecture. One branch aims to predictthe latent clean image, while the other one predicts the residual image. A fusion block is containedbetween every two stages by taking the predicted clean image and the predicted residual image asa part of inputs, and it outputs a fused result to the next stage. NFCNN has an attractive texturepreserving ability because of the fusion block. To train NFCNN, a stage-wise supervised trainingstrategy is adopted to avoid the vanishing gradient and exploding gradient problems. Experimentalresults show that NFCNN is able to perform competitive denoising results when compared with somestate-of-the-art algorithms.

Keywords: Deep Learning, Fusion Block, Two Branches Network Architecture, Image Denoising.

Image denoising is classiﬁed as the low level computer vision task which is the basis input of middleand high level tasks, e.g. image segmentation [2, 18, 36, 42, 46], pose estimation [7, 13, 16, 32, 41, 47] andimage classiﬁcation [10, 17, 24, 27, 34, 38], etc. Since the very ﬁrst digital image appeared, algorithmsfor image denoising have been extensively studied aiming to obtain a higher visual quality image withless noise. Nowadays, such methods are divided into two classes in general. One class, also regarded asmodel-driven approaches, are conventional methods based on traditional operations.The other class, called data-driven approaches, apply Artiﬁcial Neural Networks (ANN) to achievethe state-of-the-art performance in the image denoising ﬁeld. The conventional methods model noiseby employing some mathematical theory to construct the denoising function. Nevertheless, such typemethods may suﬀer from some unexpected eﬀects such as staircase eﬀect, speckle eﬀect and over smoothedeﬀect. The data-driven algorithms utilize the data-learning ability of ANN to remove noise from noisyimages.There is an a priori hypothesis for image denoising, i.e. the noise n , clean image x and noisy observation y have the following relationship: y = x + n, where noise n has lots of modes, such as Additive White Gaussian Noise (AWGN) with standard deviation,Poisson Noise and Salt & Pepper Noise, etc. The goal of image denoising is to recover the clean image x from a noisy observation y . A conventional method build a mathematical model to approximate theprocess ( y − n ) so as to obtain an approximative latent clean image ˆ x . In recent decades, there have beena considerable number of conventional methods doing well on image denosing, e.g. sparse models [12,30],nonlocal self-similarity (NSS) models [5, 11, 44] and PDE based models [3, 9, 22, 33, 45], etc.For a data-driven method, thanks to the universal approximation ability of ANN, a function f canbe learned from data to obtain, by ˆ x = f ( y ), an approximative latent clean image ˆ x from the noisyobservation y . There has been signiﬁcant progress in this ﬁeld with the development of deep ANNs[6, 8, 15, 31, 43, 48, 49], especially for deep CNNs [8, 15, 31, 48, 49]. ∗ This work was supported in part by the National Natural Science Foundation of China (11771312). † Email: [email protected] ‡ Corresponding author. Email: [email protected] a r X i v : . [ ee ss . I V ] F e b nspired by the PM model [33], a Trainable Nonlinear Reaction Diﬀusion (TNRD) network of goodperformance was presented in [8] which has a multi-stage structure and applies a stage-wise super-vised training strategy during the model training phase. After TNRD, another awesome network, calledDnCNN, was developed in [48]. Diﬀerent from TNRD, DnCNN directly creates its model. The archi-tecture of DnCNN is single threaded by stacking convolutional blocks, where one convolutional block,except for the ﬁrst and last blocks, contains one convolutional layer, one batch normalization layer [21]and one nonlinear activation layer, and residual learning [20] is employed to predict the residual imageso as to boost its performance. As shown in [48], the denoising performance of DnCNN is enhancedby combining the batch normalization with the residual learning. Later, a fast and ﬂexible denoisingconvolutional neural network (FFDNet) was proposed in [49] to denoise images with diﬀerent noise levelsand spatially variant noise, where a noise level map M is used to guide the model to obtain a betterdenoising performance. It should be mentioned that all of the above algorithms are designed for additivewhite Gaussian noise and their performance is limited for a real-world photograph denoising task. Toaddress this limitation, a convolutional blind denoising network (CBDNet) of outstanding performancewas invented in [15] by incorporating network architecture, noise modelling, and asymmetric learning.CBDNet contains a noise estimator sub-network to shrink the gap between real-world data and syntheticdata, and combines real-world noisy data with synthetic data in the network training so as to make thelearned model applicable to real images.To the best of our knowledge, so far no method fuses noise with the predicted clean image to generatea better result, although some of the above algorithms utilize noise as their input or output. In this work,we propose a Noise Fusion Convolutional Neural Network (NFCNN) to fuse intermediate predicted noisewith the intermediate predicted clean image and original input through a fusion block. The informationcontained in noise of an image is abundant, and noise is not always harmful to the model. It can boostthe generalization performance of ANN if the information inside noise is incorporated appropriately. Bythe fusion block, NFCNN is able to excavate the information of noise to generate a better denoisedresult. The fusion block mixes the intermediate predicted clean image, the intermediate predicted noiseand the original input to output a fused image. Since the fusion block needs three inputs, our proposedNFCNN needs to simultaneously output the predicted clean image and predicted noise. With these two-branches architecture, a stage-wise supervised training is then taken as our training strategy to boostits generalization performance and to avoid the vanishing gradient and exploding gradient problems. Abatch normalization layer is contained in the convolutional block in NFCNN. The branch of NFCNN thatoutputs predicted noise can be regarded as an application of residual learning. We note that only AWGNis considered to train the model in our work, as it is one of the most universal noise modes.Heavy experiments demonstrate that our NFCNN trained by the stage-wise supervised training strat-egy yields competitive denoising performance, when compared with the state-of-the-art methods in termsof PSNR metric, such as BM3D [11], WNNM [14], TNRD [8], DnCNN [48] and FFDNet [49]. NFCNNsurpasses FFDNet by 0.02 ∼ ∼ δ are 15 and 25, DnCNN outperforms FFDNet by 0.02dB, and NFCNN exceedsDnCNN by 0.01 ∼ ∼ δ = 50 and 75. DnCNN has better performance than FFD-Net when δ = 15, and the best performance still belongs to NFCNN. Experiments are also conductedon Kodak24 dataset and McMaster dataset and give similar results. More detailed training setting andexperimental results are described in Section 4.The rest of this work is organized as follows. Section 2 gives a brief summarization of recent relatedworks. Section 3 presents a high level concept of our proposed NFCNN network architecture and a detaileddescription of the fusion block. Heavy experimental results are displayed in Section 4 to demonstrate thedenoising performance of our method. Finally, Section 5 is devoted to concluding remarks. As mentioned before, there are mainly two classes of methods aiming to image denoising: conventionalmethods based on traditional operations, and data-driven methods based on ANN. Since the proposedmethod is related to ANN, some of the important works in this ﬁeld will be summarized in this section.

The basic idea of TNRD by Chen & Pock [8] is to design a learning based nonlinear reaction diﬀusionmodel for image processing. Before TNRD, Perona and Malik [33] proposed the following classic nonlinear2 igure 1:

The architecture of TNRD. diﬀusion PDE called PM model for image denoising:  ∂u∂t = div( g ( |∇ u | ) ∇ u ) ,u | t =0 = f, (1)where ∇ denotes the gradient operator, div is the divergence operator, t represents the time, and f is aninitial image to be processed. g ( · ) is regarded as an edge-stopping function [4] here. A typical g -functionis of the form g ( z ) = 1 / (1 + z ).Evolved from the discretization scheme of (1), Chen & Pock proposed a multi-stage trainable nonlinearreaction diﬀusion model of the form u t − u t − ∆ t = − N k (cid:88) i =1 K ti (cid:62) φ ti ( K ti u t − ) (cid:124) (cid:123)(cid:122) (cid:125) diﬀusion term − ψ t ( u t − , f ) (cid:124) (cid:123)(cid:122) (cid:125) reaction term , where K i ∈ R N × N is a highly sparse matrix and ∆ t is set to 1 in practice. Functions φ ti and ψ t describethe diﬀusion and reaction processes, respectively. Matrix K i can be regarded as a convolutional operatorand can be replaced by a learnable module, i.e. artiﬁcial neural networks, and N k is set to be the numberof ﬁlters. The concept graph of architecture for TNRD is summarized as in Fig. 1.Due to its deep network architecture, TNRD is easy to encounter the vanishing gradient issue duringthe training process. Thus, a stage-wise scheme of greedy training is considered as a trick for training tooptimize the cost function L (Θ t ) = S (cid:88) s =1 l ( u st , u sgt ) , where u st is the output of stage t , u sgt is the ground-truth label of stage t , and l ( · , · ) denotes the L loss.Heavy experiments show that TNRD has better performance than the conventional methods notonly in image denoising, but also in some other aspects like single image super resolution and JPEGdeblocking. There are three possible points to explain why TNRD surpasses other methods in terms ofPSNR metric [8]: • Anisotropy . Convolutional ﬁlters are obtained by training, which may lead to diﬀerent kernels withanisotropy along diﬀerent directions. • Higher order . The learned ﬁlters can have assorted (even fractional) orders of derivatives. • Adaptive forward/backward diﬀusion through the learned nonlinear functions . Nonlinear mappingsor functions are also acquired by training to enrich the smoothing ability of diﬀusion process.In addition, the trained model is lightweight to run on GPU.

DnCNN [48] is a method that incorporates residual learning [20] and batch normalization [21] so asto achieve good performance in image restoration. According to the experiments executed with DnCNN,residual learning and batch normalization not only speed up the training process, but also largely boostthe generalization performance of DnCNN. Unlike the residual learning that employs a large amountof residual units in network, DnCNN uses a single residual unit to predict the residual image. Batchnormalization has the ability to speed up the training phase and enhance performance in the inferenceprocess, and can also avoid overﬁtting to some extent.3 igure 2:

High level architecture of DnCNN.

Figure 3:

The architecture of FFDNet for image denoising.

It has been shown in the experiments of DnCNN that the training becomes more stable and leadsto better testing performance after combining residual learning and batch normalization in the trainingprocess. DnCNN employs a 3 × l (Θ) = 12 N N (cid:88) i =1 (cid:107) R ( y i ; Θ) − ( y i − x i ) (cid:107) F , where Θ is the trainable parameters vector, N represents the number of training samples, F ( y ) = x , and R ( · ) is the residual mapping.Depth of DnCNN network is set to be 17 for gray scale image denoising and 20 for color imagedenoising, respectively. Expensive experiments show that DnCNN can achieve state-of-the-art perfor-mance. DnCNN performs well not only in image denoising, but also in some other ﬁelds like single imagesuper-resolution and JPEG image deblocking. Aiming to handle the cases with diﬀerent levels of noise and spatially variant noise, FFDNet [49]adopts a non-uniform noise level map M as one input. For an original image of size W × H × C , fourdownsampled sub-images of size W × H × C are utilized to improve the eﬃciency of the network. Here W is the width of the image, H the height, and C the number of channels with C = 1 for a gray scaleimage and C = 3 for a color image.When the level of noise in the testing process has a large gap to the one used in the training process,the result will be over-smooth or under-smooth. With the help of the noise level map M , FFDNetis able to deal with the spatially variant noise problem, that is regarded as a challenge in the imagedenoising ﬁeld, with high performance. Similar to DnCNN, FFDNet contains residual learning and batchnormalization so as to boost its generalization performance, and the convolutional kernel size is set to be3 ×

3. The number of convolutional layers is empirically set to be 15 for a gray scale image and 12 fora color image, respectively. It should be noted that FFDNet is not to predict the residual image. Thewhole architecture of FFDNet can be illustrated by Fig. 3.Since FFDNet uses the noise level map M as one of inputs and does not predict the residual image,its loss function, diﬀerent from that of DnCNN, is of the form l (Θ) = 12 N N (cid:88) i =1 (cid:107) F ( y i , M i ; Θ) − x i (cid:107) , where Θ is the trainable parameters vector, N the number of training samples, M i the correspondingnoise level map for sample y i , and F ( · , · ) the nonlinear mapping modelled by FFDNet. Notice that Adamalgorithm [23] is employed as an optimizer in the training process of FFDNet to minimize l (Θ). Though having achieved impressive performance in removing Additive Gaussian White Noise, deepCNNs are limited on real-world noise. CBDNet [15] is produced to improve the robustness and practica-4 igure 4:

Illustration of CBDNet for blind denoising of real-world noisy photograph. bility of deep denoising models. On one hand, the denoising performance of CNN largely depends on thenoise level gap between the real-world data and synthetic data. On the other hand, the distributions ofnoise from the real world and synthetic data are actually diﬀerent, which indicates that training is undertwo diﬀerent domains. CBDNet utilizes a subnetwork as the noise estimation extractor to decrease thediﬀerence between the two domains, and uses an asymmetric loss to make it robust. Illustration of theCBDNet architecture is shown by Fig. 4.The loss function, L , of CBDNet consists of three parts, i.e. L = L rec + λ asymm L asymm + λ T V L T V , where L rec , L asymm and L T V denote respectively the reconstruction loss, asymmetric loss and TV loss,and λ asymm and λ T V are the corresponding weight factors. The reconstruction loss L rec is deﬁned as L rec = (cid:107) ˆ x − x (cid:107) , where x represents the output of CBDNet. According to the estimated noise level ˆ σ ( y i ) and the ground-truth noise level σ ( y i ) at i -th pixel, the asymmetric loss L asymm and TV loss L T V can be formulatedas L asymm = (cid:88) i | α − I (ˆ σ ( y i ) − σ ( y i )) < | · (ˆ σ ( y i ) − σ ( y i )) ,L T V = (cid:107)∇ h ˆ σ ( y ) (cid:107) + (cid:107)∇ v ˆ σ ( y ) (cid:107) , respectively, where I e = 1 for e < I e = 0 for e ≥ α is a hyperparameter, and ∇ h and ∇ v aregradient operators along the horizontal and vertical directions. This section is to construct a noise fusion convolutional neural network (NFCNN) for image denoising.NFCNN has a hierarchical structure to build a deep neural network (DNN). As we know, one of thedrawbacks of DNN is the diﬃculty of training. We then take a stage-wise supervised training strategyduring the training phase to avert vanishing gradient and exploding gradient issues. The fusion blockfuses the intermediate predicted clean image, the intermediate predicted noise and the original input togenerate a mixed output. The hint of fusion block comes from the fact that the exchange of informationin network is always helpful for deep learning. The output after the fusion block will be taken as an inputof next stage.

Fig. 5 illustrates the high level concept of network architecture for NFCNN. There are multiple stagescontained in NFCNN to simultaneously output intermediate predicted noise and intermediate predictedclean images, and there is a fusion block between every two stages. Each fusion block blends the dataof predicted noise, predicted clean image and original input together and outputs a fused result to nextstage.Let S i denote the i-th stage for 0 < i ≤ T and T >

1. There are two branches in S i to predictthe latent clean image and noise by convolutional blocks, respectively. The detailed design for this5 igure 5: A high level concept of network architecture for the proposed NFCNN.

Figure 6:

Detailed design for the convolutional block. convolutional block is shown in Fig. 6. Nine convolutional layers with 3 × As displayed in Fig. 5, there is a fusion block between every two stages, except for the last stage,to achieve information fusion in our network. Though a few methods use residual learning, i.e. letANN learn the noise of image, they do not make enough use of noise. Information contained in noise isabundant and integrating it to network may help the model obtain a better result. To absorb the textureof image contained in noise, NFCNN takes the original noisy image, intermediate predicted clean imageand intermediate predicted noise as inputs of fusion block to generate a fused result of them. Thus, thefusion block can be regarded as an encoder and the next stage is treated as a decoder. The next stagedirectly takes the outputs from the fusion block as inputs to generate its outputs.The architecture of fusion block is illustrated in Fig. 7, where ⊕ denotes concatenating operation inchannel axis. In the fusion block of the ﬁrst layer, every two of inputs are combined without repetition asthe input of convolutional block B, and the rest one is taken as the input of convolutional block A. Thegoal of ﬁrst layer is to separately encode the predicted noise, predicted clean image and original noisyimage. Convolutional block A aims at the encoding process of a single input and convolutional block Bis for the encoding process of a composed input. Fig. 8 and Fig. 9 show a detailed design that is similarto Fig. 6 for convolutional blocks A and B, respectively.6 igure 7: Architecture of fusion block. ⊕ denotes concatenating operation in channel axis. Figure 8:

Detailed design for convolutional block A.

In the second layer, the encoded outputs from the ﬁrst layer are concatenated in channel as theinput of convolutional block C, whose architecture is shown in Fig. 10. Outputs from C are composedagain before convolutional block D. The architecture of D is given by Fig. 11. Note that the number ofconvolutional layers for convolutional blocks A, B, C and D is increasing from three to six gradually, soas to make the encoding process of noise become more abstract as the number of layers increases. By thefusion block, information contained in noise is able to be extracted and encoded in the outputs.

As mentioned before, our NFCNN has a deep structure due to the phased network architecture andfusion block. It is possible to encounter during training some issues like vanishing gradient, explodinggradient, easily overﬁtting and being caught in a local minimum point, etc.. A possible way to avoid suchsituations is the stage-wise supervised training strategy.The input of NFCNN is a noisy observation y = x + n , where x is the clean image and n denotes thenoise. The goal of our NFCNN is to learn a mapping F ( y ) = ˆ x to generate a latent clean image ˆ x fromnoisy observation y . Suppose F i to be the i-th fusion block, then the whole pipeline can be formulated7 igure 9: Detailed design for convolutional block B.

Figure 10:

Detailed design for convolutional block C.

Figure 11:

Detailed design for convolutional block D. by  y = y , i = 0;( C i , N i ) = S i ( y i ) , y i +1 = F i ( y , C i , N i ) , < i < T ;( C T , N T ) = S T ( y T ) , i = T, (2)where C i , N i denote the intermediate predicted clean image and intermediate predicted noise, respec-8ively.The loss function of NFCNN is of the form L = L C + α L N , (3)where α is a weight factor to balance L C and L N , the loss functions for intermediate predicted cleanimages and intermediate predicted noises, which are respectively given by L C ( Θ ) = 12 K K (cid:88) i =1 (cid:107)F C ( y i ; Θ ) − ˆC i (cid:107) + β K K (cid:88) i =1 (cid:107)F C ( y i ; Θ ) − ˆC i (cid:107) , (4) L N ( Θ ) = 12 K K (cid:88) i =1 (cid:107)F N ( y i ; Θ ) − ˆN i (cid:107) + β K K (cid:88) i =1 (cid:107)F N ( y i ; Θ ) − ˆN i (cid:107) . (5)Here Θ is the trainable parameters vector, K the number of training samples, F C the predicted cleanimage, F N the predicted noise, ˆC the ground-truth clean image, ˆN the ground-truth noise, and β is anonnegative factor. Note that the case of β = 0 means that we purely use L loss during training.In the loss function (3), we combine L loss with L loss to boost the generalization performance ofNFCNN. Using L loss during the training phase aims to match the labels of ground-truth clean imageand noise in the sense of least squares, and L loss is related to PSNR metric to ensure the visual qualityof our predicted results. As a contrast, employing L loss in the loss function is to force the predictedresult to approximate the label in the pixel level. However, since L loss is related to SSIM metric, it ispossible to reduce the convergence speed and performance of our model if L loss occupies an excessiveproportion. Hence, the factor β is utilized to control the inﬂuence brought from L loss. Since the convolutional operation reduces the size of input, there are lots of modes for padding to keepthe size same. The most common and popular mode is zero padding, which, however, might result in theartiﬁcial boundary eﬀect or synthetic eﬀect. Besides, zero padding is found to reduce the generalizationperformance of NFCNN, though it may be usable enough for higher computer vision tasks like imageclassiﬁcation [24, 34, 38], pose estimation [7, 32, 41] and object detection [19, 26, 35]. As a result, we applythe replication padding mode in our method, whose core idea is to copy and expand the boundary valuesrather than just adding zero values.

To train our NFCNN with adequate data, we adopt some existing datasets to enrich our trainingdata. • BSDS500 [1]. BSDS500 dataset from Berkeley is contained in our training dataset. There are 500color images that are originally designed for the contour detection task contained in this dataset.For an image denoising task, only the clean images are needed. Therefore, the ground-truth samplesof this dataset are suitable for our task. • Waterloo Exploration Database [28]. The dataset has 4,744 images with high quality from the Wa-terloo Exploration Database. This large-scale dataset is for testing the generalization performanceof image quality assessment (IQA) models. • Flickr2K [25]. The last part of our training data comes from Flickr2K dataset. It has 2,650 2Kimages for training of super-resolution task. However, 2K resolution for the image denoising taskis too large to train our network. Thus, we crop them with a ﬁxed patch size.Note that the test samples are not contained in the training dataset. The whole training dataset contains7,794 samples, and we adopt the validation set of BSDS500 as our validation dataset.

To avoid overﬁtting as far as possible, some data augmentation methods are employed in the trainingof NFCNN for both gray images and color scale images.9 able 1:

Results of diﬀerent noise levels on Kodak24. δ = 15 δ = 25 δ = 50 δ = 75Number of Stages NFCNN( ∗ ) NFCNN NFCNN( ∗ ) NFCNN NFCNN( ∗ ) NFCNN NFCNN( ∗ ) NFCNN2 34.54 • Randomly Cropping . Input for NFCNN is randomly cropped by a ﬁxed patch size, e.g. 180 × • Flipping . Input will be ﬂipped in terms of three modes, i.e. horizontal ﬂipping, vertical ﬂipping andthe combination mode of the former two. By this data augmentation method, NFCNN can learndiﬀerent patterns of image and noise. • Image Blurring . The image blurring augmentation method is applied before adding the noise tothe original clean image, which aims to simulate the issue of shooting jitter.

To show the eﬀectiveness of fusion block, experiments are conducted by deleting the fusion blockbetween two neighbouring stages. NFCNNs with 2, 3 and 4 stages are regarded as experimental targetmodels. According to Section 3.2, information contained in predicted clean image and predicted noise canbe exchanged through the fusion block. Noise level δ is set to 15, 25, 50 and 75 during the experiments.Kodak24 dataset is employed as the test dataset to verify the eﬀectiveness of fusion block.Experimental results are displayed by Table 1, where NFCNN( ∗ ) denotes the NFCNN model withoutthe fusion block, with metric PSNR(dB). We can see that NFCNN outperforms NFCNN( ∗ ), i.e. NFCNNwith the fusion block behaves better than the one without the fusion block. A few visualization results inFig. 12 also show that denoised results are reﬁned by the fusion block. This means that the fusion blockis helpful for the exchange of information between the predicted clean image and predicted noise andhas the ability to decrease artiﬁcial eﬀect. We also see that NFCNN has better performance when thenumber of stages of NFCNN is set as 2. Based on this observation, we set it as 2 in all the experimentsfor NFCNN. During training, we set α = 1 . β = 0 .

01 in (4) and (5) to control the inﬂuence from L loss.LeakyReLU [29] with slope 0.25 is utilized in NFCNN as the nonlinear activation function. According tothe receptive ﬁeld, the number of stages in NFCNN should be set as an appropriate value to keep goodbalance between the size of receptive ﬁeld and the patch size 180 × δ to generate synthetic data for training. When using the normal distribution with croppingoperation, i.e. cropping value of image to keep it in range [0 , δ = 15 , , ,

75 fordiﬀerent denoising tasks. Batch size is set as 6 to keep the balance between the training speed and GPUmemory.To make our NFCNN have a faster convergence speed, we employ Adam optimizer [23] with learningrate 1 e − and default setting of other parameters. Learning rate will be divided by 10 after 300,000 stepsof training. The kernel size of all convolutional layer is set as 3 × × In this subsection, we compare our model with some state-of-the-art methods such as BM3D [11],WNNM [14], MLP [6], TNRD [8], DnCNN [48], and FFDNet [49]. Experiment results are listed inTables 2-6 to demonstrate the competitive performance of our NFCNN. Since some methods are not10 igure 12:

Visualization results about the eﬀectiveness of fusion block under noise level δ = 25 , where the ﬁrst,second, third and last columns are original images, noisy images, results of NFCNN( ∗ ) and results of NFCNN,respectively. (a) Original im-age (b) Noisy imagewith cropping op-eration (c) Noisy imagewithout croppingoperation Figure 13:

Comparison of synthetic data with and without cropping operation when noise level δ = 25 . open source, we can only borrow results from their corresponding original papers. Note that we do notcompare NFCNN with CBDNet [15] which is applicable to real-world denosing tasks.For the BSD68 test dataset, all the algorithms are tested by adding noise levels δ = 15 , ,

50 and 75for gray scale images. Results based on PSNR(dB) are displayed in Table 2. We can see that our proposedNFCNN with 2 stages behaves well in all cases. NFCNN surpasses FFDNet by 0.16dB when noise level δ = 15. NFCNN outperforms DnCNN when δ = 15, 25 and 75, while they have similar generalizationperformance when δ = 50. The reason for the similar performance when δ = 50 may be that DnCNNhas reached during its training a better minimum point in this case than that in other cases.CBSD68 is a color version of BSD68, and the corresponding results are shown in Table 3. Whennoise level δ = 15 and 25, NFCNN exceeds DnCNN by 0.01 ∼ δ = 15 and 25, and DnCNNoutperforms FFDNet by 0.02dB. However, NFCNN is defeated by FFDNet when δ = 50.Set12 is another well known dataset for image denoising. According to Table 4, NFCNN surpassesFFDNet by 0.03 ∼ δ = 50 and 75, and behaves better than DnCNN and FFDNet when δ = 15. When δ = 25, DnCNN and FFDNet have similar denoising performance and surpass NFCNN by0.01dB.Table 5 gives the average PSNR(dB) of diﬀerent methods on Kodak24 dataset. We can see that in11 igure 14: Results of NFCNN with noise level δ = 25 . The ﬁrst, second and last columns are the original images,noisy images and denoised results, respectively. (a) Original image (b) Noisy image (c) CBM3D(d) FFDNet (e) NFCNN Figure 15:

Baboon with noise level δ = 25 . Table 2:

The average PSNR(dB) of diﬀerent methods on BSD68 dataset with noise level δ for gray scale images. Methods BM3D WNNM MLP TNRD DnCNN FFDNet NFCNN δ = 15 31.07 31.37 - 31.42 31.72 31.63 δ = 25 28.57 28.83 28.96 28.92 29.23 29.19 δ = 50 25.62 25.87 26.03 25.97 δ = 75 24.21 24.40 24.59 - 24.64 24.79 the cases of δ = 15, 25 and 75, NFCNN still has the best performance in terms of PSNR, but FFDNethas a better result when δ = 50.Table 6 shows the results on McMaster dataset. Thanks to the use of fusion block, NFCNN has thebest performance. Note NFCNN reaches the same performance as that of FFDNet when δ = 50.To demonstrate the eﬀectiveness of NFCNN, some denoised results are displayed in Fig. 14. We note12 able 3: The average PSNR(dB) of diﬀerent methods on CBSD68 dataset with noise level δ for color scale images. Methods BM3D DnCNN FFDNet NFCNN δ = 15 33.52 33.89 33.87 δ = 25 30.71 31.23 31.21 δ = 50 27.38 27.92 δ = 75 25.74 24.47 26.24 Table 4:

The average PSNR(dB) of diﬀerent methods on Set12 dataset with noise level δ for gray scale images. Methods BM3D WNNM MLP TNRD DnCNN FFDNet NFCNN δ = 15 32.37 32.70 - 32.50 32.86 32.75 δ = 25 29.97 30.26 30.03 30.06 δ = 50 26.72 27.05 26.78 26.81 27.18 27.32 δ = 75 24.91 25.23 25.07 - 25.20 25.49 Table 5:

The average PSNR(dB) of diﬀerent methods on Kodak24 dataset with noise level δ for color scale images. Methods BM3D DnCNN FFDNet NFCNN δ = 15 34.28 34.48 34.63 δ = 25 31.68 32.03 32.13 δ = 50 28.46 28.85 δ = 75 26.82 25.04 27.27 Table 6:

The average PSNR(dB) of diﬀerent methods on McMaster dataset with noise level δ for color scaleimages. Methods BM3D DnCNN FFDNet NFCNN δ = 15 34.06 33.44 34.66 δ = 25 31.66 31.51 32.35 δ = 50 28.51 28.61 δ = 75 26.79 25.10 27.33 that the textures of objects in pictures are still sharp after denoising. For example, Fig. 15 helps verify thetexture preserving performance of NFCNN. The eye of baboon is cropped and resized to make a detailedcomparison with diﬀerent methods. The pictures are corrupted by AGWN with δ = 25. We can see thatCBM3D leads to a over-smoothed result; FFDNet has better denoising performance than CBM3D butfails to preserve the circle around iris, while NFCNN keeps a good balance between denoising and texturepreserving. Notice that the artiﬁcial eﬀect happens in the result of FFDNet, but NFCNN generates amore natural result. The pleasant texture preserving performance of NFCNN mostly beneﬁts from theinformation exchange by the fusion block, which helps the model extract the details from the residualimage. We have proposed a deep learning based denoising method, called NFCNN, with a module of fusionblock and a multi-stage structure. Thanks to the information exchange by the fusion block, NFCNNhas a pleasant texture preserving ability. A stage-wise training strategy has been adopted in NFCNNto avoid the vanishing gradient and exploding gradient problems. Experimental results have veriﬁed theeﬀectiveness of NFCNN, and demonstrated the competitive denoising performance when compared withthe state-of-the-art algorithms.

References [1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hier-archical image segmentation.

IEEE Trans. Pattern Anal. Mach. Intell. , 33(5):898–916, May 2011.[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.

IEEE transactions on pattern analysis and machineintelligence , 39(12):2481–2495, 2017. 133] Jian Bai and Xiang-Chu Feng. Fractional-order anisotropic diﬀusion for image denoising.

IEEEtransactions on image processing , 16(10):2492–2502, 2007.[4] Michael Black, Guillermo Sapiro, David Marimont, and David Heeger. Robust anisotropic diﬀusionand sharpening of scalar and vector images. In

Proceedings of International Conference on ImageProcessing , volume 1, pages 263–266. IEEE, 1997.[5] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In ,volume 2, pages 60–65. IEEE, 2005.[6] Harold C Burger, Christian J Schuler, and Stefan Harmeling. Image denoising: Can plain neuralnetworks compete with bm3d? In ,pages 2392–2399. IEEE, 2012.[7] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part aﬃnity ﬁelds. arXiv preprint arXiv:1812.08008 , 2018.[8] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diﬀusion: A ﬂexible framework for fastand eﬀective image restoration.

IEEE transactions on pattern analysis and machine intelligence ,39(6):1256–1272, 2016.[9] Yunmei Chen, Baba C Vemuri, and Li Wang. Image denoising and segmentation via nonlineardiﬀusion.

Computers & Mathematics with Applications , 39(5-6):131–149, 2000.[10] Dan Ciregan, Ueli Meier, and J¨urgen Schmidhuber. Multi-column deep neural networks for imageclassiﬁcation. In , pages 3642–3649.IEEE, 2012.[11] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoisingby sparse 3-d transform-domain collaborative ﬁltering.

IEEE Transactions on image processing ,16(8):2080–2095, 2007.[12] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations overlearned dictionaries.

IEEE Transactions on Image processing , 15(12):3736–3745, 2006.[13] Ali Erol, George Bebis, Mircea Nicolescu, Richard D Boyle, and Xander Twombly. Vision-basedhand pose estimation: A review.

Computer Vision and Image Understanding , 108(1-2):52–73, 2007.[14] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weighted nuclear norm minimizationwith application to image denoising. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2862–2869, 2014.[15] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. Toward convolutional blind denoisingof real photographs. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 1712–1722, 2019.[16] Robert M Haralick, Hyonam Joo, Chung-Nan Lee, Xinhua Zhuang, Vinay G Vaidya, and Man BaeKim. Pose estimation from corresponding point data.

IEEE Transactions on Systems, Man, andCybernetics , 19(6):1426–1446, 1989.[17] Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for imageclassiﬁcation.

IEEE Transactions on systems, man, and cybernetics , (6):610–621, 1973.[18] Robert M Haralick and Linda G Shapiro. Image segmentation techniques.

Computer vision, graphics,and image processing , 29(1):100–132, 1985.[19] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask r-cnn. In

Proceedings of theIEEE international conference on computer vision , pages 2961–2969, 2017.[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages770–778, 2016.[21] Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.1422] Marko Janev, Stevan Pilipovi´c, Teodor Atanackovi´c, Radovan Obradovi´c, and NebojˇsA Ralevi´c.Fully fractional anisotropic diﬀusion for image denoising.

Mathematical and Computer Modelling ,54(1-2):729–741, 2011.[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[24] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with deep convo-lutional neural networks. In

Advances in neural information processing systems , pages 1097–1105,2012.[25] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residualnetworks for single image super-resolution. In

Proceedings of the IEEE conference on computer visionand pattern recognition workshops , pages 136–144, 2017.[26] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense objectdetection. In

Proceedings of the IEEE international conference on computer vision , pages 2980–2988,2017.[27] Dengsheng Lu and Qihao Weng. A survey of image classiﬁcation methods and techniques for im-proving classiﬁcation performance.

International journal of Remote sensing , 28(5):823–870, 2007.[28] Kede Ma, Zhengfang Duanmu, Qingbo Wu, Zhou Wang, Hongwei Yong, Hongliang Li, and LeiZhang. Waterloo Exploration Database: New challenges for image quality assessment models.

IEEETransactions on Image Processing , 26(2):1004–1016, Feb. 2017.[29] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectiﬁer nonlinearities improve neural networkacoustic models. In

Proc. icml , volume 30, page 3, 2013.[30] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Non-local sparsemodels for image restoration. In , pages2272–2279. IEEE, 2009.[31] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutionalencoder-decoder networks with symmetric skip connections. In

Advances in neural informationprocessing systems , pages 2802–2810, 2016.[32] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation.In

European conference on computer vision , pages 483–499. Springer, 2016.[33] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic diﬀusion.

IEEETransactions on pattern analysis and machine intelligence , 12(7):629–639, 1990.[34] Florent Perronnin, Jorge S´anchez, and Thomas Mensink. Improving the ﬁsher kernel for large-scaleimage classiﬁcation. In

European conference on computer vision , pages 143–156. Springer, 2010.[35] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed,real-time object detection. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 779–788, 2016.[36] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedicalimage segmentation. In

International Conference on Medical image computing and computer-assistedintervention , pages 234–241. Springer, 2015.[37] David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature , 323(6088):533–536, 1986.[38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 , 2014.[39] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overﬁtting.

The journal of machine learningresearch , 15(1):1929–1958, 2014.[40] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Eﬃcient objectlocalization using convolutional networks. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 648–656, 2015. 1541] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural net-works. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages1653–1660, 2014.[42] Guotai Wang, Wenqi Li, Maria A Zuluaga, Rosalind Pratt, Premal A Patel, Michael Aertsen, TomDoel, Anna L David, Jan Deprest, S´ebastien Ourselin, et al. Interactive medical image segmen-tation using deep learning with image-speciﬁc ﬁne tuning.

IEEE transactions on medical imaging ,37(7):1562–1573, 2018.[43] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks.In

Advances in neural information processing systems , pages 341–349, 2012.[44] Jun Xu, Lei Zhang, Wangmeng Zuo, David Zhang, and Xiangchu Feng. Patch group based nonlocalself-similarity prior learning for image denoising. In

Proceedings of the IEEE international conferenceon computer vision , pages 244–252, 2015.[45] Maoyuan Xu and Xiaoping Xie. An eﬃcient feature-preserving pde algorithm for image denoisingbased on a spatial-fractional anisotropic diﬀusion equation. arXiv preprint arXiv:2101.01496 , 2021.[46] Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and Danny Z Chen. Suggestive annotation: Adeep active learning framework for biomedical image segmentation. In

International conference onmedical image computing and computer-assisted intervention , pages 399–407. Springer, 2017.[47] Yi Yang and Deva Ramanan. Articulated pose estimation with ﬂexible mixtures-of-parts. In

CVPR2011 , pages 1385–1392. IEEE, 2011.[48] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a Gaussian de-noiser: Residual learning of deep cnn for image denoising.

IEEE Transactions on Image Processing ,26(7):3142–3155, 2017.[49] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and ﬂexible solution for cnn-basedimage denoising.