Contextual colorization and denoising for low-light ultra high resolution sequences
CCONTEXTUAL COLORIZATION AND DENOISING FOR LOW-LIGHT ULTRA HIGHRESOLUTION SEQUENCES
N. Anantrasirichai and David Bull
Visual Information Laboratory, University of Bristol, UK
ABSTRACT
Low-light image sequences generally suffer from spatio-temporal incoherent noise, flicker and blurring of movingobjects. These artefacts significantly reduce visual qualityand, in most cases, post-processing is needed in order to gen-erate acceptable quality. Most state-of-the-art enhancementmethods based on machine learning require ground truth databut this is not usually available for naturally captured lowlight sequences. We tackle these problems with an unpaired-learning method that offers simultaneous colorization anddenoising. Our approach is an adaptation of the CycleGANstructure. To overcome the excessive memory limitationsassociated with ultra high resolution content, we proposea multiscale patch-based framework, capturing both localand contextual features. Additionally, an adaptive temporalsmoothing technique is employed to remove flickering arte-facts. Experimental results show that our method outperformsexisting approaches in terms of subjective quality and that itis robust to variations in brightness levels and noise.
Index Terms — colorization, denoising, GAN
1. INTRODUCTION
Low-light conditions can be problematic for video acquisitioncausing poor scene visibility (Fig. 1b), focusing difficulties,blurring of moving objects due to limited of shutter speeds,and noise due to high ISO values (Fig. 1c). These impair-ments are not only visually unpleasant, but they also impactupon the performance of automated tasks, such as classifica-tion, detection, and tracking.Traditional enhancement techniques typically wash outdetails, flatten appearance and amplify noise. In professionallow-light applications, such as natural history filmmaking, aspecialist will manually apply colour grading and noise reduc-tion techniques as part of the post-production workflow (e.g.Fig. 1d). The final results may however be unsatisfactory asthe information contained in the source sequence is limited.Recently, deep learning algorithms have demonstratedtheir effectiveness for image enhancement, segmentation,
This work was supported by Bristol+Bath Creative R+D under AHRCgrant AH/S002936/1.
Fig. 1 . (a-c) The 5K ‘ Macro ’ scenes. (d-f) enhancement results ofthe low-light scene (b) using (d) manually editing by the expert, (e)CycleGAN [1], and (f) our model. Inset shows magnified object. detection and denoising [2]. Typical algorithms use Convolu-tional Neural Networks (CNNs) to extract semantic meaningfrom low-level features, effectively working as an encoder. Aconvolutional decoder is then appended to produce a new im-age output [3, 4]. This module can be further employed as thegenerator in a Generative Adversarial Networks (GANs) [5],where a second module, the discriminator, is employed toimprove the generator’s performance by checking whetherthe received image is ‘real’ or ‘fake’. Despite the successof these methods, their application to processing low lightdata is challenging due to the absence of ground truth data(replicating the same scene, registered at the pixel level, withappropriate lighting is practically impossible.This paper presents a new end-to-end enhancement frame-work based on a generative model. It does not require pairedtraining samples, but instead performs a mapping based onlearnt common statistics. This approach, commonly referredto as a CycleGAN [1], does not manipulate the low-light in-put directly, but instead generates entirely new images. Inour case, our aim is to transform noisy, low-light images (Fig.1b) into clean, sharp day-light versions (Fig. 1a). It should benoted that an expert edited version could in principle be used a r X i v : . [ ee ss . I V ] J a n s a target for the training process. However, this process ishugely time consuming and expensive.Our framework is specifically tailored to ultra high reso-lution (UHR) sequences, which suffer from excessive mem-ory requirements. To address this, we propose a patch-basedstrategy where a local patch is concatenated with the resizedregion where it belongs. The local patch contains localisedfeatures and noise characteristics, whilst the region patch con-tains contextual information. We hence calculate the traininglosses of the local and region patches separately, by using an (cid:96) loss for the local patches to minimise noise and preservetextures, and by using a perceptual loss function [6] for the re-gion patches to learn context. Finally, we propose an adaptivetemporal smoothing technique to handle brightness changesand mitigate temporal inconsistencies.The remainder of this paper is organised as follows. Asummary of related work is presented in Section 2. Details ofthe proposed framework and our contributions are describedin Section 3. The performance of the method is evaluated inSection 4, followed by the conclusions in Section 5.
2. RELATED WORK
This section reviews state-of-the-art methods for image-to-image translation and denoising. A survey of recent tech-niques can also be found in [2].
Image-to-image translation aims to produce a new im-age that has a different appearance to the input but withsimilar semantic content. Early algorithms employed CNNsto perform tasks such as converting grayscale tones to naturalcolors [7] or photographs to stylistic paints [8]. Subsequently,conditional GANs, such as Pix2Pix [9], were proposed andthese further extended the range of possible applications,including converting road maps to aerial photographs, or asketch into a coloured object. These methods invariably ex-ploit supervised learning, requiring a paired training dataset.CycleGAN [1], DualGAN [10] and DiscoGAN [11] archi-tectures were then proposed to overcome this limitation bytraining two GANs with two groups of unpaired images,mapping the characteristics of one group onto the other. Morerecently, unpaired GAN-based methods have been developedto produce diverse outputs from a single input [12].
Denoising techniques are now, almost entirely, basedon deep learning approaches. For example, a residual noisemap of an image can be estimated using a Denoising CNN(DnCNN) [13] while for video, spatial and temporal networksare concatenated in [14]. FFDNet [15] works on reversiblydownsampled subimages. VNLnet combines a non-localpatch search module with DnCNN [16]. TOFlow [17] offersan end-to-end framework that performs motion analysis andvideo processing simultaneously. GANs have also been em-ployed to estimate a noise distribution which is subsequentlyused to augment clean data for training CNN-based denois-ing networks (such as DnCNN) [18]. GANs also have been
Fig. 2 . Workflow. Left patches are low-light inputs and top rightpatches are targets. employed for denoising medical images [19], but they are notpopular in the natural image domain due to the limited dataresolution of current GANs.
3. METHODOLOGIES
A diagram of the proposed framework is illustrated in Fig. 2.The process comprises patch generation, image enhancement,patch merging and temporal smoothing.
Patches are cropped from the UHR images with the possi-ble maximum size allowed by GPU memory. However, thesemay not contain sufficient contextual detail to be learnt by theCNNs. We therefore use both local and region- based patches.The sizes of the local and region patches are N l × N l and N r × N r pixels, N l < N r , respectively. Region patches arecropped with the same centre point as the corresponding localpatches, and then are resized to N l × N l pixels to concatenatewith the local patches. The input to the GAN hence has sixchannels for an RGB colour format. The region size is not re-stricted, but it has to be large enough to capture the semanticmeaning of the object in the local patch, e.g. close-up contentrequires larger region patches than landscape content. We model our image-to-image translation problem follow-ing the concept of CycleGAN [1]. The first training group(group A ) comprises the low-light patches and the secondgroup (group B ) comes from the target image. The patches ofboth groups are translated twice, i.e. from group A to groupB with the generator G A , then translated back to the origi-nal group A with the generator G B . Then, the loss functioncompares the input image and its reconstruction. Generators : Both G A and G B have three sequentialmodules: i) an encoder with three convolutional blocks (3 × × Discriminators : Following the original CycleGAN,five convolutional blocks (4 × Loss functions : The training process aims to minimisea loss function L final comprising i) adversarial loss L GAN , ii)cycle consistency loss L cyc , and iii) identity loss L idt , shownin Eq.1, where λ GAN , λ cyc and λ idt are weights. L final = λ GAN L GAN + λ cyc L cyc + λ idt L idt . (1) L GAN joins a generator loss, to fool the discriminator, and adiscriminator loss, to distinguish between the real and trans-lated samples. L cyc enforces forward-backward consistency,and L idt indicates that G A should be the identity if the tar-get patch is fed and similarly with G B if the low-light patchis fed. We calculate the losses of the local patches ( A l , B l )and region patches ( A r , B r ) separately, and weight toward theloss of the local patches. This is because the local patches arewhat we actually want to translate, whilst the region patchesprovide contextual information as guidance. That is, L t = w L lt + (1 − w ) L rt , w > . , t ∈ { GAN , cyc , idt } . (2)For L GAN , in addition to the original CycleGAN that usesleast square GAN (LSGAN), we employ a relativistic averageLSGAN (RaLSGAN) [23], which measures the global proba-bility of input data to be more realistic than the opposing type.This improves training stability and visual quality.For L cyc and L idt , we employ an (cid:96) loss for the localpatches. This is a pixel-wise loss that is robust to noise andcapable of preserving textures. For the region patches, weuse a perceptual loss computed from feature maps φ extractedwith a pretrained VGG19 [6]. This has proven performancefor measuring contextual similarity. We however employ (cid:96) -norm instead of (cid:96) -norm used in [6] as it is more robust tooutliers. L cyc and L idt are computed as follows. L l cyc = || G B ( G A ( A l )) − A l || + || G A ( G B ( B l )) − B l || , (3a) L r cyc = || φ ( G B ( G A ( A r ))) − φ ( A r )) || + || φ ( G A ( G B ( B r ))) − φ ( B r ) || , (3b) L l idt = || G A ( B l ) − B l || + || G B ( A l ) − A l || , (4a) L r idt = || φ ( G A ( B r )) − φ ( B r )) || + || φ ( G B ( A r )) − φ ( A r ) || , (4b) We also tried Hinge adversarial loss [24], style loss [8],wavelet-based loss, gradient loss [25] and total variation loss[26]. None of these improved enhancement performance. For inference, we divide each frame of the UHR sequenceinto overlapping patches. All results in this paper were recon-structed from the patches shifted by N l / pixels. The inputand the output of the network have six channels comprisingthe RGB local and RGB region patches, but only the RGBlocal patches are used to reconstruct a frame. The patches aremerged with Gaussian weights ( µ = N l / , σ = N l / , where µ and σ are the mean and the standard deviation). Due to memory limitations associated with UHR videos,learning process can only be performed on a frame-by-framebasis. Since this can lead to temporal inconsistency, a pixel-wise average across a temporal sliding window is used tosmooth brightness and colour. The window size of each pixelis changed adaptively, based on the magnitude of its motion.Firstly, each frame in the sliding window is warped and regis-tered to the current frame. We adapt a warping process usingmulti-scale gradient matching in [27] to reduce large dis-placements amongst frames, and then apply a wavelet-basedregistration [28] to mitigate micro misalignment. Motion es-timation is performed using coarser level wavelet coefficientsto determine large motion components and then finer levelcoefficients to refine the motion field. The sliding window isdefined at the pixel level with the maximum values of N max backward and N max forward frames. The pixels with largermotion will be constructed with a fewer frames, whilst thestable pixels will be the average of all 2 N max +1 in the slidingwindow. We set N max to 6 frames and if the motion is morethan 256 pixels, no neighbouring frames are used.
4. EXPERIMENTS AND DISCUSSION
The method was tested with six UHR sequences: i) threeof 8K resolution (7680 × Static ’, ‘
Fly ’,and ‘
Horse ’, and ii) three of 5K resolution (5120 × Macro ’, ‘
Woods ’, and ‘
River
Static ’ sequence is static indoor scene, whilst theothers contain moving objects and dynamic background.
Training parameters : Local patches are memory lim-ited to 360 ×
360 pixels and region patches are 1,000 × × Horse ’ that is 2,500 × ig. 3 . Enhancement results of low-light ‘
Macro ’ sequence using (a) traditional histogram matching to the target, (b) Neural Style [8], (c)DeepPrior [29] then histogram matching, and (d) Learning to see in the dark [30].
Fig. 4 . Results of frame 200 of (left-right) ‘
Static ’, ‘
Fly ’, ‘
Horse ’ , ‘
Woods ’, and ‘
River . (Top-bottom) Low-light scene with the target scenein the inset, results of CycleGAN and our proposed method, respectively.
Fig. 5 . Robustness test with various intensity changes of ‘
Static ’sequence. The columns with the red blocks are training sets. The 1 st and 3 rd rows are inputs. The 2 nd and 4 th are outputs of our method. We randomly cropped 1,000 patches of the first frame of eachsequence for training. Only the first frame was used as wewanted to investigate the robustness of the model. Trainingparameters were: λ GAN =1, λ cyc =10, λ idt =0.5, w =0.9. Results and comparison : The results in Fig. 1 and Fig. 4reveal that our method creates better contrast with lower noisethan CycleGAN, and that contextual information from regionpatches assists the formation of local information (e.g. thelizard head appears much more clearly in our result). We alsoprovide comparisons with other automated methods: Deep-Prior [29] is an unsupervised denoising technique (needingsubsequent histogram matching); Learn-to-See-in-the-Dark isa supervised low-light image enhancement method [30]. We retrained their model with our ‘
Static ’ sequence combinedwith their datasets. The results in Fig. 3 clearly show thatresidual noise remains a problem for these methods.
Robustness : Fig. 5 shows results for four brightnessvalues: two between the training low-light and the targetdatasets, and two darker values than the training version. Themodel can be seen to be robust to intensity changes. Theeffect of noise is noticeable in the darker sequences but issubtle, because the convolutional layers behave like low-passfilters. We also see that, as the input noise level reduces, theoutputs become sharper and more vivid.
5. CONCLUSIONS
We present a novel end-to-end framework for joint coloriza-tion and denoising of low-light UHR sequences. Since reg-istered ground truth is unavailable, we use a CycleGAN thatlearns statistics of the source and the target groups. To ad-dress the issue of memory load, we propose a patch-basedtechnique, where local and region patches are concatenatedas the input of the network. The architecture of both gen-erators and discriminators, as well as the loss functions, aremodified to suit UHR images. Finally, we used an adaptivetemporal smoothing technique to mitigate flickering artefacts.Our proposed framework clearly outperforms existing meth-ods, providing evident benefits in terms of subjective quality.
6. ACKNOWLEDGEMENT
We would like to thank Esprit film and television, and BBC Bristolfor providing datasets. . REFERENCES [1] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros,“Unpaired image-to-image translation using cycle-consistentadversarial networks,” in
IEEE ICCV , 2017.[2] N. Anantrasirichai and D. Bull, “Artificial intelligence in thecreative industries: A review,” arXiv:2007.12391 , 2020.[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo-lutional networks for biomedical image segmentation,” in
International Conference on Medical Image Computing andComputer-Assisted Intervention . 2015, pp. 234–241, Springer.[4] N. Anantrasirichai and D. Bull, “DefectNet: Multi-classfault detection on highly-imbalanced datasets,” in
IEEE In-ternational Conference on Image Processing (ICIP) , 2019, pp.2481–2485.[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, “Generative adversarial nets,” in
Advancesin Neural Information Processing Systems 27 , pp. 2672–2680.2014.[6] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham,A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi,“Photo-realistic single image super-resolution using a genera-tive adversarial network,” in
IEEE Conference on CVPR , 2017,pp. 105–114.[7] Richard Zhang, Phillip Isola, and Alexei A. Efros, “Colorfulimage colorization,” in
The European Conference on Com-puter Vision (ECCV) , 2016, pp. 649–666.[8] Leon Gatys, Alexander Ecker, and Matthias Bethge, “A neuralalgorithm of artistic style,”
Journal of Vision , vol. 16, no. 12,2016.[9] P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in
IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , July 2017, pp. 5967–5976.[10] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Un-supervised dual learning for image-to-image translation,” in
IEEE International Conference on Computer Vision (ICCV) ,Oct 2017, pp. 2868–2876.[11] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee,and Jiwon Kim, “Learning to discover cross-domain relationswith generative adversarial networks,” in , 2017, pp. 1857–1865.[12] HY. Lee, HY. Tseng, and Q. et al. Mao, “DRIT++: Diverseimage-to-image translation via disentangled representations,”
Int J Comput Vis , vol. 128, pp. 2402–2417, 2020.[13] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyonda gaussian denoiser: Residual learning of deep CNN for imagedenoising,”
IEEE Transactions on Image Processing , vol. 26,no. 7, pp. 3142–3155, 2017.[14] Michele Claus and Jan van Gemert, “ViDeNN: Deep blindvideo denoising,” in
CVPR workshop , 2019.[15] K. Zhang, W. Zuo, and L. Zhang, “FFDNet: Toward a fastand flexible solution for cnn-based image denoising,”
IEEETransactions on Image Processing , vol. 27, no. 9, pp. 4608–4622, 2018. [16] A. Davy, T. Ehret, J. Morel, P. Arias, and G. Facciolo, “A non-local cnn for video denoising,” in
IEEE International Confer-ence on Image Processing (ICIP) , Sep. 2019, pp. 2409–2413.[17] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, andWilliam T. Freeman, “Video enhancement with task-orientedflow,”
International Journal of Computer Vision , vol. 127, pp.1106–1125, 2019.[18] J. Chen, J. Chen, H. Chao, and M. Yang, “Image blind de-noising with generative adversarial network based noise mod-eling,” in
IEEE/CVF Conference on Computer Vision and Pat-tern Recognition , 2018, pp. 3155–3164.[19] Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kalra,Y. Zhang, L. Sun, and G. Wang, “Low-dose ct image denois-ing using a generative adversarial network with wassersteindistance and perceptual loss,”
IEEE Transactions on MedicalImaging , vol. 37, no. 6, pp. 1348–1357, 2018.[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn-ing for image recognition,” in
IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016, pp. 770–778.[21] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in
IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) ,July 2017, pp. 2261–2269.[22] K. Isogawa, T. Ida, T. Shiodera, and T. Takeguchi, “Deepshrinkage convolutional neural network for adaptive noise re-duction,”
IEEE Signal Processing Letters , vol. 25, no. 2, pp.224–228, 2018.[23] Alexia Jolicoeur-Martineau, “The relativistic discriminator: akey element missing from standard GAN,” in
InternationalConference on Learning Representations , 2019.[24] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and AugustusOdena, “Self-attention generative adversarial networks,” in , 09–15Jun 2019, vol. 97, pp. 7354–7363.[25] R. Muhammad Umer, G. Luca Foresti, and C. Micheloni,“Deep generative adversarial residual convolutional networksfor real-world super-resolution,” in
IEEE/CVF Conference onCVPRW , 2020, pp. 1769–1777.[26] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang,“Generative image inpainting with contextual attention,” in
IEEE/CVF Conference on Computer Vision and Pattern Recog-nition , 2018, pp. 5505–5514.[27] N. Anantrasirichai, A. Achim, and D. Bull, “Atmospheric tur-bulence mitigation for sequences with moving objects usingrecursive image fusion,” in
IEEE International Conference onImage Processing (ICIP) , 2018, pp. 2895–2899.[28] N. Anantrasirichai, A. Achim, N.G. Kingsbury, and D.R. Bull,“Atmospheric turbulence mitigation using complex wavelet-based fusion,”
IEEE TIP , vol. 22, no. 6, pp. 2398–2408, 2013.[29] V. Lempitsky, A. Vedaldi, and D. Ulyanov, “Deep imageprior,” in
IEEE/CVF Conference on Computer Vision and Pat-tern Recognition , 2018, pp. 9446–9454.[30] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see inthe dark,” in