No-reference denoising of low-dose CT projections
NNO-REFERENCE DENOISING OF LOW-DOSE CT PROJECTIONS
Elvira Zainulina ‡† , Alexey Chernyavskiy † , Dmitry V. Dylov ‡† Philips AI Research, ‡ Skolkovo Institute of Science and Technology, Moscow, Russia
ABSTRACT
Low-dose computed tomography (LDCT) became a cleartrend in radiology with an aspiration to refrain from deliver-ing excessive X-ray radiation to the patients. The reduction ofthe radiation dose decreases the risks to the patients but raisesthe noise level, affecting the quality of the images and their ul-timate diagnostic value. One mitigation option is to considerpairs of low-dose and high-dose CT projections to train a de-noising model using deep learning algorithms; however, suchpairs are rarely available in practice. In this paper, we presenta new self-supervised method for CT denoising. Unlike exist-ing self-supervised approaches, the proposed method requiresonly noisy CT projections and exploits the connections be-tween adjacent images. The experiments carried out on anLDCT dataset demonstrate that our method is almost as ac-curate as the supervised approach, while also outperformingseveral modern self-supervised denoising methods.
Index Terms — Self-supervised learning, blind denois-ing, convolutional neural networks (CNN), convolutionallong short-term memory (ConvLSTM), computed tomogra-phy (CT), CT projections.
1. INTRODUCTION
The potential health risks for patients caused by X-ray radia-tion of computed tomography (CT) have led to the emergenceof the low-dose CT [1]. Although the reduction of radiationdose decreases the risks, it also increases the level of noise.The deterioration of the quality in CT projections leads tostreaks and blurriness in the reconstructed CT slices. Whilethere have been progress in applying convolutional neural net-works (CNNs) for denoising the reconstructed CT slices [2],the raw projections contain much more information whichcould and should be used for increasing the signal-to-noiseratio and to contribute to more accurate reconstruction.Traditional denoising algorithms rely on using regular-ization with special penalty functions such as total variation.These algorithms often require careful case-by-case tuning ofparameters. On the other hand, the most recent algorithmsbased on deep learning (DL) are supervised, i.e. they requirepairs of low-noise and high-noise images of the same scenefor training. In a clinical setting, for example in a radiology
Correspondence to: Elvira Zainulina – [email protected] suite, the acquisition of such paired data can take at least twicelonger than the regular CT exam and would increase exposureto the X-ray dose. Generating the noisy data artificially is anoption to alleviate the lack of the image pairs, but the addi-tion of the noise to the images, especially the medical ones,can bias the predictions of a CNN because no synthetic noisecan completely emulate the real one to portray all physicalphenomena encountered in a CT machine.Recently, self-supervised methods for reference-free im-age denoising have been proposed. In these methods, thenoiseless signal is deduced from the noisy image itself. Re-search has shown that self-supervised methods produce im-ages with quality comparable to that of images denoised bythe supervised methods without requiring reference images.In [3], Lehtinen et al. have proposed a CNN that considerspairs of noisy images of the same scene with two independentrealizations of noise. Given the first image, the CNN is trainedto produce the second image, and learns an approximation tothe noise-free image as a result. This Noise2Noise methodwas applied for denoising X-ray projections and CT imagesand compared to a supervised model in [4]. It gave acceptableresults but the images were over-smoothed. Unless there is away of getting two co-registered images of the same scene,e.g. from a still camera, this approach is not practical. Mostoften, only one noisy image of a particular scene is available.Custom addition of noise to images before applying this ap-proach is also questionable since the mismatch between thestatistical properties of the synthetic data and those of the realdata may lead to a decrease in image quality at test time com-pared to the performance during training.Other current self-supervised denoising methods requireonly noisy images, but they add complexity to the model.In [5], Krull et al. have described Noise2Void, an algorithmthat uses pixel masking during the training process makingit less efficient. A solution to overcome this drawback wasproposed by Laine et al. in [6]. They developed an architec-ture based on the U-Net [7] that includes special convolutionand downsampling layers. This architecture allows to obtaina blind spot at the center of the patch, instead of masking.Also, the authors of [6] introduced loss functions that takeinto account Gaussian or Poisson distribution of the noise.However, their method requires passing one image through aCNN four times for removing noise or making a model fourtimes bigger. In [8], a more complicated statistical approach a r X i v : . [ ee ss . I V ] F e b as applied for denoising CT images but it was applied onlyto reconstructed scans and sinograms, not CT projections.To our knowledge, none of the existent self-supervisedreference-free denoising methods use information containedin sequences of images. Each image is denoised separately,which leads to sub-optimal image quality. We proposeNoise2NoiseTD, a time-distributed method that exploits theinformation redundancy in sequences of images such as CTprojections. It requires neither pairs of low-noise and high-noise images nor pairs of high-noise images of the samescene. It makes the denoising process simpler compared toexistent DL-based self-supervised approaches.
2. METHODOLOGY2.1. Noise2NoiseTD approach overview
The proposed approach is based on the assumption that, givena sequence of k + 1 image frames (CT-projections) p ( θ ± i ∆ θ ) , i = 1 , . . . , k , where θ is the angle of rotation of theX-ray source around the patient, the content of the framescan be distinguished from the noise using similarities foundamong them by a neural network. Since the noise is indepen-dent on each CT projection, the features extracted from thesequence of frames will reflect only the information about theprojected structures (the anatomy) and allow to recover thenoise-free projection p ( θ ) . The choice of the number of theadjacent frames k depends on how much the content of theframes overlaps, and the computational and memory capacityof the device.We propose to use the bidirectional convolutional mem-ory units (Bi-ConvLSTM) for carrying information about theadjacent images. These units allow to extract the featurescorresponding to the slight consequent change of the struc-tures that are observed from the first to the last viewing an-gle, and vice versa, and then combine these features. Thus,Bi-ConvLSTM units provide a stable restoration of the mid-dle frame in the sequence that is being denoised. These unitswork with features that are extracted from each frame inde-pendently by some CNN. Finally, the features extracted fromthe frames of the sequence, with the exception of the middleprojection, are summed up and processed by another CNN.The task of this network is to process and fuse the results toobtain the denoised middle projection; therefore it can have asimpler architecture.Since no noiseless ground truth is available, the trainingof Noise2NoiseTD proceeds in a self-supervised way. We usea no-reference loss function that is based on the loss functionproposed in [6]. The network architecture and the loss func-tion will be described in detail in the next subsections. Figure 1 shows the proposed Noise2NoiseTD model. Thebackbone of our implementation of the Noise2Noise ap-
Fig. 1 . Architecture of the Noise2NoiseTD model.proach is based on the U-Net architecture, except that it doesnot have any pooling layers and all levels preserve the widthand height of the input tensors using replicate padding. Sincesome features can contain more information related to theanatomical structures, and others may contain more infor-mation about the corruptions, we used a feature attentionmechanism. Channel attention allows to focus on more infor-mative features. Channel attention layers [9] are added aftereach block that is on the second or third level of the model.To account for connections between subsequent frames,Bi-ConvLSTM blocks are included to the model. Theseblocks are based on the ConvLSTM layers introduced in [10].We do not use past cell status for gates calculation; it re-duces the number of model parameters without really affect-ing image quality. A Bi-ConvLSTM layer consists of twoConvLSTM-cells that process image frames in two oppositedirections. The forward and backward outputs are combinedby a summation operator. As suggested in [11], the firstBi-ConvLSTM layer is inserted into the bottleneck and thesecond one is plugged into the end of the network.The output of the last Bi-ConvLSTM layer is summed upalong the time axis, as we want to predict only the middle pro-jection in the sequence. With the purpose of preventing over-fitting to the noisy middle projection, we make the network“blind” to it by excluding the corresponding Bi-ConvLSTMoutput from the summation. Next, the aggregated output isprocessed by a × convolution in order to obtain the propernumber of output channels. Following [12], all the convolu-tions in the model are made bias-free. It was stated in [13] and [14] that the noise in CT projectionsis more likely to be Gaussian. However, for low-dose CThe noise at each pixel can be more accurately modeled as anindependent random variable sampled from a mixed Poisson-Gaussian distribution. We found from our experiments thatthis assumption gave better results than when the noise wasassumed to be purely Gaussian or Poisson. Based on this as-sumption, one can derive a reference-free loss function suit-able for self-supervised model training that accounts for thenoise distribution.According to [6], the distribution of the noisy data y =( y , . . . , y n ) given its neighbourhood Ω y = (Ω y , . . . , Ω y n ) relates to the distribution of the clean data x = ( x , . . . , x n ) in the following way: p ( y i | Ω y i ) (cid:124) (cid:123)(cid:122) (cid:125) Noisy observation = (cid:90) p ( y i | x i ) (cid:124) (cid:123)(cid:122) (cid:125) Noise model p ( x i | Ω y i ) (cid:124) (cid:123)(cid:122) (cid:125) Clean prior dx i . As only noisy data is available, the network that models priordistribution can be trained by minimizing the following log-likelihood function: L = − n (cid:88) i =1 log p ( y i | Ω y i ) . For predicting x , in addition to Ω y , y can be included usingBayes rule: p ( x i | y i , Ω y i ) ∝ p ( y i | x i ) p ( x i | Ω y i ) . The algorithm for training a CNN with noisy data is thefollowing [6] (for convenience we omit the index i):1. Train the model to map the pixels in the patch into themean µ x and standard deviation σ x of the Gaussian ap-proximation of the distribution of the clean data p ( x | Ω y ) .2. During the test phase obtain µ x and σ x using the trainednetwork. Then compute E x [ p ( x | y, Ω y )] .Assuming a mixed Poisson-Gaussian distribution, y = P oisson ( λx ) /λ + N (0 , a ) , where λ is the maximum event count and a is the variance ofthe additive Gaussian noise. After approximation, the noisydata is modeled as y = N ( µ x , σ x + µ x /λ + a ) . a and λ are theunknown parameters that are learned together with the mainmodel. The variance of the noise is σ n = µ x /λ + a .The loss function for CNN training is therefore given as: L = (cid:88) i (cid:18) ( y i − µ i ) σ i + 12 log σ i − . σ i (cid:19) , (1)where − . σ i is a regularization parameter proposed in [6]for the case of unknown noise parameters that encourages ex-plaining the observed noise as corruption instead of uncer-tainty about the clean signal. The posterior mean estimate,i.e. the prediction of the denoised image, is computed as fol-lows: E x [ p ( x | y, Ω y )] = µ x σ n + yσ x σ n + σ x . (2) H - t e s t p a t i e n t - p a t i e n t - A - t e s t (a) projection domain p a t i e n t - p a t i e n t - H - t e s t A - t e s t (b) image domain Fig. 2 . SSIM between the denoised and the full-dose images.
The experiments were carried out on the projection data fromthe LDCT dataset [15] published by the Mayo clinic. CT pro-jection data is provided for both full and simulated lower dose( of the routine dose) levels. Since the dose levels aredifferent for the head and the abdomen, we trained two inde-pendent models for denoising the corresponding projections.They were also tested independently. For our experiments weselected sets of CT projections coming from seven randomlychosen patients: five containing head data, and two consistingof abdomen projections.For abdominal projections, we used data from one patientand randomly selected and projections for thetrain and validation sets respectively, in such a way that atleast projections were adjacent. We used the first substantive adjacent projections of the other patient with ab-domen projections as test set; we denote it A-test.For models that denoise head projections, the train, vali-dation and test datasets consisted of , . , . cor-respondingly, of the data from three patients ( projec-tions in total). We call the test part of this data H-test. Sincethe acquisition geometry for these projections is axial, wesplit the data so that the projections from the same acquisi-tion circle were included into one dataset. We also tested themodels on full data from the two other patients (patient- andpatient- ) in order to check the generalizability of the modelsand calculate metrics on the full set of projections.
3. EXPERIMENTS AND RESULTS
Besides using the loss in (1), we also trained our model withthe MSE loss. In this case we computed the loss between themiddle projection and its denoised version. We chose to com-pare results against the recent self-supervised approach of [6].We observed that the algorithm of [6] produced results ofhigher quality when its original backbone CNN was replacedby DnCNN [16], so we decided to compare our approachwith this modification of [6] that we call Noise2Void- R ( Rstands for the four rotations used to denoise an image patch).1) Fragment of abdominal CT projection (A-test)(2) Abdominal CT scan (A-test)
Fig. 3 . Fragments of (1) CT projections and (2) CT scans re-constructed from original projections: (a) low-dose, (d) full-dose; from projections denoised by self-supervised models:(b) our model, MSE loss, (c) Noise2Void- R, using pro-jection, (e) our model, loss given by (1); from projectionsdenoised by DnCNN (supervised): (f) using projections.Finally, we performed denoising using DnCNN trained insupervised mode. Because our approach uses adjacent pro-jections, in order to make the comparison more fair, weincluded adjacent projections as inputs to the Noise2Void- Rand DnCNN as additional channels. A reasonable trade-offbetween complexity and level of detail for these models wasachieved by taking three adjacent projections from each side,so the input consisted of seven projections.All self-supervised models were trained in Pytorch us-ing Adam optimizer with default parameters, learning rate − , and minibatch size of . The noise model (parameters a and λ ) was trained together with the main denoising model.The minibatches consisted of random × crops. We as-sumed that full-dose projections were available for the valida-tion sets, and used them only to decide when to stop training.Namely, we trained our models until the moment when the PSNR began to decrease and . · SSIM + 0 . · L1-loss [17]began to grow on the validation set.Supervised models were trained using MSE loss, Adamoptimizer with default parameters, learning rate · − andminibatch size of . The minibatches were formed as in thecase of the self-supervised training. The training was per-formed until the moment the training curve reached a plateau.We compared the denoising results in the projection do-main, and also in the image domain that is usually of greaterdiagnostic and practical interest. We used the TIGRE tool-box [18] to reconstruct the CT slices from the projections.We computed the SSIM metric between high-dose projec-tions and, first, the low-dose projections and, next, the de-noised projections (we did the same with the correspondingCT slices). The SSIM metrics are shown in Figure 2. Theeffect of our denoising is more pronounced for abdomendata, probably because the dose (and the image quality) forhead CT is higher than for abdomen area. Interestingly, forNoise2Void-4R the use of several frames resulted in a smallerimprovement of SSIM than when using only one frame.We also evaluated the models visually, as image similar-ity metrics are not always demonstrative. The reconstructedCT scans are shown in Figure 3. The results demonstrate thatour model performs better denoising than the Noise2Void- Rmodel, and the image quality is comparable with the super-vised DnCNN model. Whereas the Noise2Void- R modelproduced overly smooth reconstructions even if it used sev-eral adjacent projections as input, our model was better atpreserving edges and fine details.
4. CONCLUSION AND DISCUSSIONS
We demonstrated a new approach for denoising CT projec-tions that supports training the model in the self-supervisedmode and allows to denoise sequences of images depend-ing only on the features extracted from these sequences. Wecompared our Noise2NoiseTD model to the state-of-the-artself-supervised denoising model [6] and to the popular su-pervised DnCNN network [16]. Our model outperformed theself-supervised denoising model, and although we did not usehigh quality ground truth images during training, it producedresults that are comparable to those of the supervised model.Since the experiments were carried out on simulated data,the method should be tested on real data as well. Also, weplan to refine the learning process to consider projection nor-malization for more accurate noise approximation and esti-mation of its parameters. The image quality metrics that weused in this work may not correlate well with the opinion ofradiologists. A qualitative comparison of denoising methodsshould be carried out with the help of domain experts.Because our method is not specific to the imaging modal-ity except for the chosen noise model, it can be adopted for de-noising sequences of other medical images, such as dynamicPET, dynamic MR, spectral CT, ultrasonograms, etc. . COMPLIANCE WITH ETHICAL STANDARDS
This research study was conducted retrospectively using hu-man subject data made available in open access by The Can-cer Imaging Archive (TCIA). Ethical approval was not re-quired as confirmed by the license attached with the open ac-cess data.
6. ACKNOWLEDGMENTS
The authors declare no conflicts of interest.
7. REFERENCES [1] D.R. Aberle, A.M. Adams, C.D. Berg, W.C. Black, J.D.Clapp, R.M. Fagerstrom, I.F. Gareen, C. Gatsonis, P.M.Marcus, and J.D. Sicks, “Reduced lung-cancer mortalitywith low-dose computed tomographic screening,”
NewEngland Journal of Medicine , vol. 365, no. 5, pp. 395–409, Aug. 2011.[2] H. Shan, Y. Zhang, Q. Yang, U. Kruger, M. K. Kalra,L. Sun, W. Cong, and G. Wang, “3-d convolutionalencoder-decoder network for low-dose ct via transferlearning from a 2-d trained network,”
IEEE Transac-tions on Medical Imaging , vol. 37, no. 6, pp. 1522–1534,2018.[3] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine,T. Karras, M. Aittala, and T. Aila, “Noise2Noise: Learn-ing image restoration without clean data,” in
Proceed-ings of Machine Learning Research . 2018, vol. 80, pp.2965–2974, PMLR.[4] P. Gnudi, B. Schweizer, M. Kachelrieß, and Y. Berker,“Denoising of X-ray projections and computed tomog-raphy images using convolutional neural networks with-out clean data,” in
The 6th International Conferenceon Image Formation in X-Ray Computed Tomography ,2020, pp. 590–593.[5] A. Krull, T. Buchholz, and F. Jug, “Noise2Void -learning denoising from single noisy images,” in , 2019, pp. 2124–2132.[6] S. Laine, T. Karras, J. Lehtinen, and T. Aila, “High-quality self-supervised deep image denoising,” in
Ad-vances in Neural Information Processing Systems 32 ,pp. 6970–6980. Curran Associates, Inc., 2019.[7] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo-lutional networks for biomedical image segmentation,”in
International Conference on Medical image comput-ing and computer-assisted intervention . Springer, 2015,pp. 234–241. [8] K. Kim, S. Soltanayev, and S.Y. Chun, “Unsupervisedtraining of denoisers for low-dose CT reconstructionwithout full-dose ground truth,”
IEEE Journal of Se-lected Topics in Signal Processing , vol. 14, no. 6, pp.1112–1125, 2020.[9] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitationnetworks,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , June2018.[10] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, andW. Woo, “Convolutional LSTM network: A machinelearning approach for precipitation nowcasting,” in
Pro-ceedings of the 28th International Conference on NeuralInformation Processing Systems - Volume 1 , Cambridge,MA, USA, 2015, NIPS’15, p. 802–810, MIT Press.[11] A. Novikov, D. Major, M. Wimmer, D. Lenis, andK. B¨uhler, “Deep sequential segmentation of organs involumetric medical scans,”
IEEE Transactions on Med-ical Imaging , vol. 38, no. 5, pp. 1207–1215, may 2019.[12] S. Mohan, Z. Kadkhodaie, E.P. Simoncelli, andC. Fernandez-Granda, “Robust and interpretable blindimage denoising via bias-free Convolutional neural net-works,” in
International Conference on Learning Rep-resentations , 2020.[13] M. Diwakar and M. Kumar, “A review on CT imagenoise and its denoising,”
Biomedical Signal Processingand Control , vol. 42, pp. 73 – 88, 2018.[14] H. Lu, I.-T. Hsiao, X. Li, and Z. Liang, “Noise proper-ties of low-dose CT projections and noise treatment byscale transformations,” in ,2001, vol. 3, pp. 1662–1666.[15] C. McCollough, B. Chen, D. Holmes, X. Duan, Z. Yu,L. Yu, S. Leng, and J. Fletcher, “Low Dose CT im-age and projection data (LDCT-and-Projection-data),”2020.[16] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang,“Beyond a gaussian denoiser: Residual learning of DeepCNN for image denoising,”
IEEE Transactions on Im-age Processing , vol. 26, no. 7, pp. 3142–3155, 2017.[17] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss func-tions for image restoration with neural networks,”
IEEETransactions on Computational Imaging , vol. 3, no. 1,pp. 47–57, 2017.[18] A. Biguri, M. Dosanjh, S. Hancock, and M. Soleimani,“TIGRE: a MATLAB-GPU toolbox for CBCT image re-construction,”