[PDF] A Lightweight Structure Aimed to Utilize Spatial Correlation for Sparse-View CT Reconstruction

Abstract

Sparse-view computed tomography (CT) is known as a widely used approach to reduce radiation dose while accelerating imaging through lowered projection views and correlated calculations. However, its severe imaging noise and streaking artifacts turn out to be a major issue in the low dose protocol. In this paper, we propose a dual-domain deep learning-based method that breaks through the limitations of currently prevailing algorithms that merely process single image slices. Since the scanned object usually contains a high degree of spatial continuity, the obtained consecutive imaging slices embody rich information that is largely unexplored. Therefore, we establish a cascade model named LS-AAE which aims to tackle the above problem. In addition, in order to adapt to the social trend of lightweight medical care, our model adopts the inverted residual with linear bottleneck in the module design to make it mobile and lightweight (reduce model parameters to one-eighth of its original) without sacrificing its performance. In our experiments, sparse sampling is conducted at intervals of 4{\deg}, 8{\deg} and 16{\deg}, which appears to be a challenging sparsity that few scholars have attempted before. Nevertheless, our method still exhibits its robustness and achieves the state-of-the-art performance by reaching the PSNR of 40.305 and the SSIM of 0.948, while ensuring high model mobility. Particularly, it still exceeds other current methods when the sampling rate is one-fourth of them, thereby demonstrating its remarkable superiority.

Full PDF

aa r X i v : . [ ee ss . I V ] J a n A Lightweight Structure Aimed to Utilize SpatialCorrelation for Sparse-View CT Reconstruction

Yitong Liu [email protected]

Ken Deng [email protected]

Chang Sun [email protected]

Hongwen Yang [email protected]

Abstract

Sparse-view computed tomography (CT) is known as a widely used approach to re-duce radiation dose while accelerating imaging through lowered projection viewsand correlated calculations. However, its severe imaging noise and streaking arti-facts turn out to be a major issue in the low dose protocol. In this paper, we proposea dual-domain deep learning-based method that breaks through the limitations ofcurrently prevailing algorithms that merely process single image slices. Since thescanned object usually contains a high degree of spatial continuity, the obtainedconsecutive imaging slices embody rich information that is largely unexplored.Therefore, we establish a cascade model named LS-AAE which aims to tackle theabove problem. In addition, in order to adapt to the social trend of lightweightmedical care, our model adopts the inverted residual with linear bottleneck in themodule design to make it mobile and lightweight (reduce model parameters toone-eighth of its original) without sacriﬁcing its performance. In our experiments,sparse sampling is conducted at intervals of 4°, 8° and 16°, which appears to be achallenging sparsity that few scholars have attempted. Nevertheless, our methodstill exhibits its robustness and achieves the state-of-the-art performance by reach-ing the PSNR of 40.305 and the SSIM of 0.948, while ensuring high model mo-bility. Particularly, it still exceeds other current methods when the sampling rateis one-fourth of them, thereby demonstrating its remarkable superiority.

Over the last few decades [1, 2], X-ray Computed Tomography (CT) has demonstrated its promi-nent practical value and wide range of applications including clinical diagnosis, safety inspectionand industrial detection [3]. Especially in the past year, due to the global spread of the CoronaVirus Disease 2019 (COVID-19), the term CT has become well-known to the public as an essentialauxiliary technology. However, the radiation dose brought by CT has a nonnegligible side effect onthe human body. Since it has a latent risk of inducing cancers, radiation dose reduction is becomingmore and more crucial under the principle of ALARA (as low as reasonably achievable) [4, 5, 6, 7].Generally speaking, there are two approaches to reduce radiation dose. The approach of tube current(or voltage) reduction [8, 9] lowers the x-ray exposure in each view but suffers from the increasednoise in projections. Although the approach of projection number reduction [10, 11] (also known assparse-view CT) can avoid the former problem and realize the additional beneﬁt of accelerated scanand calculation, it leads to severe image quality degradation of increased streaking artifacts broughtby its missing projections. In this paper, we focus on effectively repairing and reconstructing sparse-view CT so as to acquire high-quality CT images.Sparse-view CT reconstruction has always been a classic inverse problem which has attracted wideattention [12]. In the past few decades, iterative reconstruction methods have become the dominant

Preprint. Work in progress. pproach to solve inverse problems [13, 14, 15, 16]. With the advent of compressed sensing [17]and its related regularizers, the quality of reconstructed images has been improved to a certain ex-tent. One of the most typical regularizers is the total variation (TV) method, algorithms based onwhich include TV-POCS [18], TGV method [19], SART [13] and SART-TV [20] etc. In addition,dictionary learning is also commonly used as a regularizer. For example, [21] constructs a globaldictionary and an iterative adaptive dictionary to solve the problem of low-dose CT reconstruction.In recent years, with the improvement of computing power, there comes a rapid growth in deeplearning [22]. Subsequently, neural networks have been widely applied in image analysis tasks,such as image classiﬁcation [23], image segmentation [24, 25, 26], especially inverse problems inimage reconstruction, such as artifacts reduction [27, 28], denoising [29] and inpainting [30]. SinceGAN (Generative Adversarial Networks) was designed elaborately by Goodfellow in 2014 [31], ithas been adopted in many image processing tasks due to its prominent performance in realisticallypredicting image details. Therefore, GANs are also naturally applied to improving the quality oflow-dose CT images [32, 33, 34]. In addition, Ye et al. explored the relationship between deeplearning and classical signal processing methods in [35], explained the reason why deep learningcan be employed in imaging inverse problems, and provided a theoretical basis for the applicationof deep learning in low-dose CT reconstruction.Some researchers adopt deep learning-based architectures to complement and restore the limit-viewRadon data [36, 37, 32, 38, 39, 40]. Dong et al. [36] used U-Net [25] to predict the missing Radondata, then reconstruct it to the image through FBP [41]. Jian Fu et al. [37] built a network thatinvolves the tight coupling of the deep learning neural network and DPC-CT (Differential phase-contrast CT) reconstruction algorithm in the domain of DPC projection sinograms. The estimatedresult is a complete phase-contrast projection sinogram. Rushil Anirudh et al. established CTNet[38], a system of 1D and 2D convolutional neural networks, which operates on the limited-viewsinogram to predict the full-view sinogram, and then fed it to the standard analytical and iterativereconstruction algorithms to obtain the ﬁnal result.Other researchers carried out post-processing on reconstructed images with deep learning models,so as to remove the artifacts and noises for upgrading the quality of these images[42, 43, 44, 33,45, 46, 47]. In 2016, a deep convolutional neural network [44] was proposed to learn an end-to-endmapping between the FBP and artifact-free images. In 2018, Yoseob Han and Jong Chul Ye designeda dual frame and tight frame U-Net [42] which satisﬁes the frame condition and performs better forrecovery of high frequency edges in sparse-view CT. In 2019, Xie et al. [33] built an end-to-endcGAN model with joint function used for removing artifacts from limited-angle CT reconstructionimages. In 2020, Wang et al. [45] developed a limited-angle TCT image reconstruction algorithmbased on U-Net, which could suppress the artifacts and preserve the structures. Experiments haveshown that U-Net-like structures are efﬁcacious for image artifacts removal and texture restoration[35, 36, 42, 45, 47] .Since neural networks are capable of predicting unknown data in the Radon and image domains, anatural idea is to combine these two domains [48, 49, 34, 50, 51, 52] to acquire better restorationresults. Speciﬁcally, it ﬁrst complements the Radon data, and then remove the residual artifacts andnoises on images converted from the full-view Radon data. In 2018, Zhao et al. proposed SVGAN[34], an artifacts reduction method for low-dose and sparse-view CT via a single model trainedby GAN. In 2019, Liang et al. [49] proposed a comprehensive network combining projection andimage domains. The projection estimation network is based on Res-CNN structure, and the imagedomain network takes the advantage of U-Net. In 2020, Zhu et al. designed ADAPTIVE-NET[50] to conduct joint denoising on the acquired sinogram and the reconstructed CT image, whilereconstructing CT image in an end-to-end manner. In the past three years, experiments have provedthat this sort of two-stage algorithm is quite conducive to image quality improvement.All the current mainstream methods mentioned above make us notice that they solely process oneach single CT image, while neglecting the solid fact the scanned object is always highly continuous.Consequently, there is abundant spatial information lies in these obtained consecutive CT images,which is largely left to be exploited. This enlightens us to propose a novel cascade model calledLS-AAE (Lightweight Spatial Adversarial Autoencoder) that mainly focus on availably utilizing thespatial information between greatly correlated images. It has been proved in our experiments thatthis sort of structure design manages to efﬁcaciously remove streaking artifacts in sparse-view CTimages, and outruns other prevailing methods with its remarkable performance.2t is the social trend now to make healthcare mobile and portable. In lots of deep learning-basedmethods, however, scholars improve accuracy at the expense of sacriﬁcing computing resources.Such computational complexity usually exceeds the capabilities of many mobile and embeddedapplications. This paper adopts the inverted residual with linear bottleneck [53] in the moduledesign to propose a mobile structure that reduce model parameters to one-eighth of its originalwithout sacriﬁcing its performance.Although enhancing the sparsity of sparse-view CT can bring beneﬁts of accelerated scanning andrelated calculations, it will cause additional imaging damage. Balancing image quality and X-raydose level has become a well-known trade-off problem. Thus, in order to explore the limit of sparsityin sparse-view CT reconstruction, we conduct sparse sampling at intervals of 4°, 8° and most impor-tantly, 16°. Even under such sampling sparsity, our model can still exhibit its remarkable robustnessand the state-of-the-art performance.We introduce our proposed method exhaustively in Section II, the experimental results and corre-sponding discussion are described in section III, and conclusion is stated in section IV.

As is known to all, consecutive CT images usually contain high spatial coherency since the scannedobject is usually spatially continuous. On account of that, we can imagine these CT images asadjacent frames in a video which contains much more information than a still image. This highcorrelation within the sequence of images can improve the performance of artifact removal fromtwo aspects. Firstly, the extension of search regions from two-dimensional image neighborhoods tothree-dimensional spatial neighborhoods provide extra information which can be used to denoise thereference image. Secondly, using spatial neighbors helps to reduce streaking artifacts as the residualerror in each image is correlated.Also, we cannot help but notice that the task of artifact removal between consecutive images issimilar to video denoising. Therefore, after investigating lots of research work on video denoising[54, 55, 56, 57, 58, 59, 60, 61, 62, 63], we ﬁnd out that current state-of-the-art methods lay lots ofemphasis on motion estimation due to the strong redundancy along motion trajectories. To conclude,in order to more effectively remove streaking artifacts from sparse-view CT images, we need todesign a structure that can not only look into the three-dimensional spatial neighborhood, but alsocapture the motion between consecutive images.

In recent years, lots of research has been invested into tuning deep neural networks to achieve an op-timal balance between efﬁciency and performance. Among them, depthwise separable convolutions[64] exhibits its extraordinary capability and has gradually become an essential building block fornumerous lightweight neural networks [64, 65, 66]. It aims to decompose the standard convolutionallayer into two separate layers, namely the depthwise convolutional layer and the pointwise convo-lutional layer. The former layer is designed to perform lightweight ﬁltering through employing asingle convolutional ﬁlter per input channel, the latter one conducts × convolution to constructnew features by computing linear combinations of input channels.For the standard convolutional layer with input tensor size ( c in , h, w ) , kernel size ( c out , c in , k, k ) and output tensor size ( c out , h, w ) , its computational cost equals to c in · h · w · ( k · c out ) . However,in depthwise separable convolutions, the depthwise convolutional layer has a computational cost of c in · h · w · k since it merely operates on a single input channel, and the pointwise convolutionallayer has a computational cost of c in · h · w · c out . Therefore, we only need a computational cost of h · w · c in · ( k + c out ) for depthwise separable convolutions, which is almost the one-ninth ( k equalsto 3 in our case) of the standard convolution. Most importantly, depthwise separable convolutionsmanage to lower the computational complexity to a large extent without sacriﬁcing its accuracy,which would make it perfect to be inserted into our module design.3 .2 Overall Structure2.2.1 Structure Overview Figure 1: Structure overview. The sparse-view Radon data X is ﬁrst sent to the neural network F for completion, then the restored full-view Radon data X ′ is converted to the image Y , which isfeed into the neural network G for artifacts removal and we can ﬁnally obtain the ideal high-qualityimage Y ′ .We can learn from the universal approximation theorem [67] that multilayer feedforward networksare capable of approximating various continuous functions. This inspires us to think that neuralnetworks can be used to learn complex mappings that are difﬁcult to solve through mathematicalanalysis. Thus, in this paper, we utilize a deep learning-based structure that combines the Radondomain and the image domain (Figure 1) to solve the task of sparse-view CT reconstruction andinpainting.Firstly, we want to make full use of the prior information in the Radon domain by converting thesparse-view Radon data X to the full-view Radon data X ′ so as to complement the missing datain some scanning angles. This process can be represented by the mapping: X f −→ X ′ according tothe universal approximation theorem, where function f can be approximated through our proposedneural network F . After we obtain the full-view Radon data X ′ , we transform it to the image Y through FBP. Although the ﬁrst stage manages to alleviate the severe imaging damage from theoriginal sparse-view CT image, there are still lots of streaking artifacts existing in Y that need to beremoved to acquire the high-quality restored result Y ′ . We represent the restoration process into themapping: Y g −→ Y ′ , where function g can be approximated through our proposed neural network G .Through the above two-stage structure that combines the Radon domain with the image domain, wecan ﬁnally get the ideal restored results. We ﬁrst adopt linear interpolation to convert the original sparse-view Radon data to full-view Radondata so as to satisfy the structural characteristics of our proposed neural network, which requiresthe input and output images to have the same resolution. Then we build a lightweight adversarialautoencoder (L-AAE) in Figure 2 to restore the Radon data, the structure of its autoencoder (L-AE)Figure 2: The diagram of our proposed L-AAE, which is composed of a L-AE and a discriminatorthat help restore the image texture. 4igure 3: The detailed structure of the L-AE. Input images are ﬁrst feed into the encoder for featureextraction and then sent into the decoder for texture restoration, where skip connections are addedto merge low-level features.can be seen from Figure 3 and Table 1, which is composed of the encoder and the decoder that arehighly symmetrical. Table 1: Parametric structure of the L-AELayer

IC OC

Stride InputSize OutputSizeConv1 1 32 2 192 ×

512 96 × ×

256 96 × ×

256 48 × ×

128 48 × ×

128 24 × ×

64 24 × ×

64 12 × ×

32 12 × ×

32 24 × ×

64 24 × ×

64 48 × ×

128 48 × ×

128 96 × ×

256 96 × ×

256 192 × ×

512 192 × exp ) a low-dimensional compressed rep-resentation to high dimension with a kernel size of × . The intermediate expansion layer adoptslightweight depthwise convolutions mentioned above so as to signiﬁcantly decreases the number ofoperations (ops) and memory needed while sustaining the same performance. The last layer projectsthe feature back to a low-dimensional representation with a linear convolution like the ﬁrst layer.All these layers are followed by batch normalization [69] and ReLU [70] as the non-linear activationexcept for the last layer that only followed by a batch normalization layer. Input Output Conv 1×1(IC, IC×exp) BNReLU Conv 3×3(IC×exp, IC×exp) BN

ReLU

Conv 3×3(IC×exp, OC) BNSkip Connection

Figure 4: The detailed diagram of the building blocks in L-AAE, which adopt the inverted residualwith linear bottleneck.In Figure 4, IC and OC stand for the input and output channel of building blocks respectively. Allconvolutional layers in all building blocks have a stride of 1 except for Block2_1, Block3_1 andBlock4_1 that have a stride of 2 to conduct downsampling.Expansion factor exp is 1 for Block1,Block7 and Block8 to avoid large ops and memory cost, we set up exp to be 3 for Block5_1 andBlock6_1, and every block expect these mentioned above have an exp of 6. Besides, shortcutconnections are implemented in blocks that have the same resolution between its input and outputfeature maps to enhance information ﬂow and also improve the ability of a gradient to propagateacross multiplier layers. We adopt × convolution in shortcuts when the number of channel in theinput and output feature maps is different.The discriminator in our L-AAE aims to strengthen model’s ability to restore the detailed texture ofimages, its structure is almost the same as the encoder above, except that its Block4_3 and Block4_4have an OC of 64 and 1 respectively. The output of Block4_4 is ﬂattened and sent to sigmoidfunction for probability prediction, which we average to get the ﬁnal output that represents the inputimage’s probability to be a real image. This novel lightweight AAE enables us to acquire the wellrestored Radon data that are complete in every scanning angle, and the computational cost is about8 times smaller than that of standard convolutions without sacriﬁcing its accuracy. After stage one, we transform the acquired full-view Radon data to images and ﬁnd out that, wesuccessfully enrich the information in the Radon domain and alleviate streaking artifacts from theoriginal sparse-view CT imaging. Now in stage two, we will mainly focus on removing artifacts,restoring image to an ideal level. As mentioned above, we need a neural network that not only lookinto the three-dimensional spatial neighborhood, but also capture the motion between consecutiveimages, so as to efﬁcaciously utilize the abundant spatial information between consecutive imagesto remove artifacts from the input image. 6enerally speaking, motion estimation always brings an additional degree of complexity that isadverse to model’s implementation in reality. It means that we need a structure that can manageto deploy motion estimation without much resource cost, we refer to [71] and its general structureappears to be a cascaded two-step architecture that inherently embed the motion of objects. Inspiredby this, we propose a model named Lightweight Spatial Adversarial Autoencoder (LS-AAE) whichcan be seen from Figure 5. It slightly modiﬁes the L-AE from Figure 3 as its inpainting block, detailsare shown in Table 1. The replacement from 2D convolution to 3D convolution enables our modelto look into the three-dimensional spatial neighborhood for extra information.Table 2: From 2D convolution to 3D convolutionLayer

IC OC

KernelSize Stride Padding2D Convolution Conv1 1 16 (3,3) (2,2) (1,1)3D Convolution Conv1_1 1 16 (3,3,3) (1,2,2) (1,0,0)Conv1_2 16 32 (3,3,3) (2,1,1) (0,1,1)As shown in Figure 5, ﬁve consecutive images { I i − , I i − , I i , I i +1 , I i +2 } are sent into the LS-AAE to restore the middle one. We ﬁrstly treat these inputs as triplets of consecutive images { I i − , I i − , I i } , { I i − , I i , I i +1 } and { I i , I i +1 , I i +2 } , then enter them into the Inpainting Blocks 1.Subsequently, we obtain the outputs of these blocks and combine them into triplet (cid:8) I ′ i − , I ′ i , I ′ i +1 (cid:9) which will be sent into Inpainting Block 2 to acquire the ultimate estimation I ′′ i corresponding to thecentral image I i . The LS-AAE digs deep into the three-dimensional space and implicitly handlesmotion without any explicit motion compensation stage on account of the traits of its architecture.Besides, the three Inpainting Blocks in step one share the same weights so as to avoid memorycost. We also add a discriminator in stage two to better restore the image texture, the predictedimage I ′′ i and its corresponding ground truth (the full-view CT imaging) I GTi are both send into thisdiscriminator, its structure is exactly the same as it is in stage one..Figure 5: The diagram of our proposed LS-AAE. It combines an autoencoder that fully utilizes thespatial correlation between consecutive CT images and a discriminator that help reﬁne image details.

Stage one and stage two are trained separately. For the autoencoders in these two models, weemploy the multi-loss function below, which is consists of three parts l MSE , l Adv and l Reg withtheir respective hyperparameters α , α and α . l AE = α l MSE + α l Adv + α l Reg (1) l MSE calculates the mean square error between the restored image and its corresponding groundtruth, it is widely used in various image inpainting tasks since it provides an intuitive evaluation forthe model’s prediction. The expression of l MSE can be seen from Equation (2). l MSE = 1 W × H W X x =1 H X y =1 ( I GTx,y − G AE ( I Input ) x,y ) (2)7here function G AE stands for the autoencoder, I Input and I GT are the input image and its corre-sponding ground truth, W and H are the width and height of the input image respectively. l Adv refers to the adversarial loss. The autoencoder manages to fool the discriminator by makingits prediction as close to the ground truth as possible, so as to achieve the ideal image restorationoutcome. Its expression can be seen from Equation (3). l adv = 1 − D ( G AE ( I Input )) (3)Where function D and G AE stands for the discriminator and the autoencoder respectively, I Input isthe model’s input image. l Reg is the regularization term of our multi-loss function. Since noises will have a side effect on ourrestoration result, we add a regularization term to maintain the smoothness of the image and alsoprevent overﬁtting. TV Loss is widely used in image analysis tasks, it reduces the variation betweenadjacent pixels to a certain extent. Its expression can be seen from Equation (4). l Reg = 1 W × H W X x =1 H X y =1 (cid:13)(cid:13)(cid:13) ∇ G AE ( I Input ) x,y (cid:13)(cid:13)(cid:13) (4)Where function G AE represents the autoencoder, I Input is the model’s input image, W and H arethe width and height of the input image respectively. ∇ calculates the gradient, k·k obtains the norm.To optimize the discriminator of these two stages, their loss function should enable them to betterdistinguish between real and fake inputs. The loss function l Dis is shown in Equation (5). l D is = 1 − D (cid:0) I GT (cid:1) + D (cid:0) G AE (cid:0) I Input (cid:1)(cid:1) (5)Where function D and G stands for the discriminator and the autoencoder respectively, I Input and I GT are the input image and its corresponding ground truth. The discriminator outputs a scalar be-tween 0 to 1 which represents the probability that the input image is real. Therefore, minimizing − D ( I GT ) /maximizing D ( I GT ) enables the discriminator to recognize real images, while mini-mizing D ( G AE ( I Input )) enables the discriminator to distinguish fake images that generated fromthe autoencoder from all input images.During the training process, we adopt the Adam algorithm [72] for optimization. the learning rateis set to 1e-4 initially. For the multi-loss function, α , α and α are set to 1, 1e-3, and 2e-8respectively. We implement our whole structure using PyTorch [73] on two GeForce RTX 2080 Ti. We adopt the LIDC-IDRI [74] as our dataset, which includes 1018 cases and approximately 240,000DCM ﬁles of corresponding CT images. Cases 1 to 100 are divided into test set, cases 101 to 200 aredivided into validation set, and the rest are divided into train set. Such a large amount of data allowsus to train our models from scratch without overﬁtting. We utilize NumPy to read from these DCMﬁles and conduct sparse sampling at intervals of 4, 8 and 16 (the corresponding full-view Radondata has 180 projections). Subsequently, we ﬁrst analyze our overall structure through a series ofablation studies, and then compare our experimental results with other current methods to prove itssuperiority and robustness.

With all these innovations we make in our overall structure design, it would be appropriate for usto conduct corresponding ablation studies to prove their necessity. In this part, all the experimentalresults are acquired from sparse-view CT data with an interval of 4 if there is no speciﬁc mention.8 .1.1 The L-AE’s Trade-off between Mobility and Performance

As is known to all, U-Net has extraordinary performance in numerous medical image processingtasks, [42] implemented it for sparse-view CT imaging restoration and obtained outstanding restora-tion results. To testify that our proposed autoencoder can achieve a good balance between perfor-mance and mobility, we replace it with U-Net in the ﬁrst stage and compare the restoration resultsand model parameters of this stage with ours, as shown in Table 3. The images mentioned in Table3 are reconstructed from the Radon data restored through stage one.Table 3: U-Net VS. L-AERadonPSNR RadonSSIM ImagePSNR ImageSSIM ParametersU-Net 57.582 0.998 29.598 0.874 10.401ML-AE 57.66 0.998 29.609 0.874 1.675MAs we can see from Table 3, whether in the Radon domain or in the image domain, L-AE has com-petitive performance compared with U-Net. Moreover, it signiﬁcantly reduces model parameters,making it suitable for situations where computational resources are extremely limited. This exhibitsour model’s ability in efﬁciently restoring CT images, thus adapting to the social trend of deployingportable medical devices.

We establish discriminators in both two stages, hoping to further improve our model’s performancein restoring sparse-view CT data through the adversarial learning between the autoencoders and thediscriminators. In order to verify this point of view, we send the test set into stage one where there ismerely an autoencoder and compare its restoration results with ours, which can be seen from Table4. The images mentioned in Table 4 are reconstructed from the Radon data restored through stageone. Table 4: The Role of the DiscriminatorRadonPSNR RadonSSIM ImagePSNR ImageSSIML-AE Only 48.904 0.985 28.448 0.871

L-AAE 57.660 0.998 29.609 0.874

From the above table, we can realize the signiﬁcance of our proposed discriminator, it indeed assistsour model to achieve a better level of restoration under the evaluation of PSNR and SSIM. Its precisestructure (refers to Sec II) also ensures a high degree of mobility, which enables our overall structureto be portable and accurate at the same time.

As we state in Sec II, this sort of cascaded two-step structure inherently embeds the motion ofobjects which can largely help to remove image artifacts due to the strong redundancy betweenthese consecutive images. Consequently, we design an experiment with reference to [71] to provethis view. In the second stage, instead of sending ﬁve consecutive images into this two-step LS-AAE, we directly input them into a single Inpainting Block (SIB) that is slightly modiﬁed in thethree-dimensional convolution part to handle ﬁve images, that means we adopt a stride of 2 in theConv1_1 layer (refers to Table 1). The experimental results can be seen from Table 5 below.Now the SIB no longer owns this built-in cascade structure to implicitly conduct motion estima-tion, it suffers from a obvious drop in PSNR and SSIM. Therefore, we can arrive at the conclusionthat, LS-AAE manages to effectively improve model’s capability of restoring CT images with itscascaded two-step architecture that inherently capture the motion between consecutive images.9able 5: Restoration Results of SIB and LS-AAEImage PSNR Image SSIMSIB 38.972 0.941

LS-AAE 40.305 0.9483.1.4 The 3D convolution in LS-AAE

We mention in Sec II that, the extension of search regions from two-dimensional image neighbor-hoods to three-dimensional spatial neighborhoods provide extra information for image restoration.Also, extracting spatial features is conducive to remove streaking artifacts as the residual error ineach image is correlated. In order to realize this extension of search regions, three-dimensionalconvolution is employed in every Inpainting Block of LS-AAE. To verify the cruciality of thesethree-dimensional convolutions, we conduct an experiment in which 3D convolution are replacedback to 2D convolution, where the number of input images is regarded as the number of input chan-nel (refers to Table 2). The inpainting results of these two models are shown in Table 6.Table 6: Restoration Results of 2D and 3D LS-AAEImage PSNR Image SSIM2D LS-AAE 39.472 0.944

3D LS-AAE 40.305 0.948

We can see that the inpainting outcome suffers from a drop about 0.9dB in PSNR, proving thatthree-dimensional convolutions assist model in restoring CT images to a certain extent without sig-niﬁcantly consuming computational resources.

In all the experiments above, we set the image interval between input consecutive CT images ofLS-AAE to the default value of 1. However, we cannot help but wonder that whether increasing theinterval value can help the model obtain more spatial information, thereby enhancing its ability inremoving image artifacts. In the following experiment, we set this image interval T to 1, 2, 3, 4 and5 respectively, their corresponding results are shown in Table 7.Table 7: The Image Interval’s Effect on Restoration ResultsImage PSNR Image SSIM T = 1 T = 2 T = 3 T = 4 T = 5 In this paper, we adopt a two-stage structure that combines the Radon domain and the image domainto obtain high-quality sparse-view CT images. Since each stage of the overall structure conductsrestoration in their separate domains and both remarkably upgrade the restoration results, this leadsus to think, what role do these two domains play? Subsequently, we feed our test set into these threestructures: L-AAE in stage one that concentrates on the Radon domain, LS-AAE in stage two that10ocus on the image domain and of course, our overall structure that contains these two stages. Thequantitative inpainting results of the above three structures can be referred from Table 8, the intuitiveoutcome can also be seen in Figure 6.

Dual Domains Ground TruthThe Radon Domain The Image Domain

Figure 6: The intuitive restoration results obtained by different domains.Table 8: Restoration Results obtained by different domainsImage PSNR Image SSIMThe Radon Domain 30.310 0.905The Image Domain 34.135 0.888

Dual Domains 40.305 0.948

It can be seen from above that, restoration in each domain has its pros and cons. For the Radondomain, it demonstrates its superiority in enhancing the structural similarity of images so as to per-form well under the evaluation of SSIM. While as for the image domain, it exhibits great ability inalleviating distortion, thus has a relatively good performance under the evaluation of PSNR. Natu-rally, we acquire extraordinary restoration results when combining these two domains to merge theirrespective advantages. Besides, we solely utilize the spatial correlation in the Image domain due toour discovery that, the spatial information between continuous Radon slices has little impact on theﬁnal inpainting outcome. We suppose this is because the texture in Radon slices does not have muchsimilarity with CT images, thus cannot be restored in this way.

After verifying the rationality of our overall structural design, we want to testify its robustnessthrough applying it to sparse-view CT data with a higher level of sparsity, which means, conductingsparse sampling at intervals of 4, 8 and even 16 (the corresponding full-view Radon data has 180projections). In addition, we compare our method with other current ones to prove its prominentcapability of restoring sparse-view CT images and removing streaking artifacts. The experimentalresults are shown in Table 9, and the intuitive outcome can be seen from Figure 7.As we can see, our method exhibits extraordinary capability of restoring sparse-view CT imaging,effectively removes streaking artifacts and outruns other methods by a large margin. Also, it canbe applied to extreme sparsity while still obtaining prominent inpainting outcome. Particularly, ourmethod still exceeds others when the sampling rate is one-fourth of them, thereby demonstrating itsremarkable robustness and superiority. 11 ur Method Ground TruthU-NetFBP SART-TV I n t e r v a l = I n t e r v a l = I n t e r v a l = Figure 7: The intuitive restoration results of various methods at different sampling intervals.Table 9: Methods ComparisonInterval=4 Interval=8 Interval=16PSNR SSIM PSNR SSIM PSNR SSIMFBP 12.080 0.498 12.065 0.485 12.032 0.471SART-TV 19.179 0.665 19.061 0.632 18.777 0.602U-Net 34.018 0.885 31.944 0.843 28.767 0.798

Ours 40.305 0.948 37.633 0.937 34.052 0.910

In this paper, we propose a lightweight structure that efﬁcaciously restores sparse-view CT with itstwo-stage architecture combining the Radon domain and the image domain. Most importantly, wegroundbreakingly exploit the abundant spatial information existing between consecutive CT images,so as to achieve a remarkable restoration outcome even if our method encounters extreme sparsity.In the ﬁrst stage, a mobile model named L-AAE is proposed to complement the original sparse-viewCT in the Radon domain, it adopts the inverted residual with linear bottleneck in order to signiﬁ-cantly reduce computational resource requirements while maintaining outstanding performance. Inthe second stage, after reconstructing the restored full-view Radon data into images through FBP,we establish a lightweight model called LS-AAE. It is designed to implicitly conduct motion esti-mation and dig into the three-dimensional spatial neighborhood with a relatively low memory cost.Therefore, it manages to concentrates on fully utilizing the strong spatial correlation between con-tinuous CT images, so as to productively remove streaking artifacts and ﬁnally acquire high-qualityrestoration results.Eventually, for the sparse-view CT with a sampling interval of 4, we achieve a PSNR of 40.305 and aSSIM of 0.948, realizing a remarkable restoration result that can effectively eliminate image artifacts.In addition, our method also performs well when it comes to extreme sparsity (the sampling intervalis 8 or even 16), exhibiting its prominent robustness.12 eferences [1] A. M. Cormack. Representation of a function by its line integrals, with some radiologicalapplications. II.

J. Appl. Phys. , 35(10):2908–2913, Nov. 1964.[2] G. Hounsﬁeld. Computerized transverse axial scanning (tomography): I. description of system.

Br. J. Radiol. , 46(552):1016–1022, Jan. 1974.[3] Wang, Ge, Yu, Hengyong, De, Man, and Bruno. An outlook on X-ray CT research and devel-opment.

Med. Phys. , 35(3):1051–1064, Apr. 2008.[4] R. Krishnamoorthi, M. N. Ramarajan, N. E. Wang, M. B. Newman, M. E. Rubesova, C. M.Mueller, and R. A. Barth. Effectiveness of a staged US and CT protocol for the diagno-sis of pediatric appendicitis: reducing radiation exposure in the age of ALARA.

Radiology ,259(1):231–239, Apr. 2011.[5] T. L. Slovis. CT and computed radiography: The pictures are great, but is the radiation dosegreater than required?

Am. J. Roentgenol. , 179(1):39–41, Aug. 2002.[6] C. H. Mccollough, A. N. Primak, N. Braun, J. Koﬂer, L. Yu, and J. Christner. Strategies forreducing radiation dose in CT.

Radiol. Clin. N. Am. , 47(1):27–40, Feb. 2009.[7] C. H. Mccollough, M. R. Bruesewitz, and J. M. Koﬂer. CT dose reduction and dose manage-ment tools: Overview of available options.

Radiographics , 26(2):503–512, Mar. 2006.[8] P. A. Poletti, A. Platon, O. Rutschmann, F. Schmidlin, C. Iselin, and C. Becker. Low-doseversus standard-dose CT protocol in patients with clinically suspected renal colic.

Am. J.Roentgenol. , 188(4):927–933, May. 2007.[9] D. Tack, V. De Maertelaer, and P. Gevenois. Dose reduction in multidetector CT usingattenuation-based online tube current modulation.

Am. J. Roentgenol. , 181(2):331–4, Sep.2003.[10] J. Bian, J. Siewerdsen, X. Han, E. Sidky, J. Prince, C. Pelizzari, and X. Pan. Evalua-tion of sparse-view reconstruction from ﬂat-panel-detector cone-beam CT.

Phys. Med. Biol. ,55(22):6575–6599, Oct. 2010.[11] J. Bian, J. Wang, X. Han, E. Sidky, L. Shao, and X. Pan. Optimization-based image reconstruc-tion from sparse-view data in offset-detector CBCT.

Phys. Med. Biol. , 58(2):205–230, Dec.2012.[12] K. Jin, M. Mccann, E. Froustey, and M. Unser. Deep convolutional neural network for inverseproblems in imaging.

IEEE Trans. Image Process. , PP:99, Nov. 2016.[13] A. H. Andersen and A. C. Kak. Simultaneous algebraic reconstruction technique (SART): Asuperior implementation of the ART algorithm.

Ultrason. Imaging , 6(1):81–94, 1984.[14] D. Wu, K. Kim, G. E. Fakhri, and Q. Li. Iterative low-dose CT reconstruction with priorstrained by artiﬁcial neural network.

IEEE Trans. Med. Imag. , PP(12):1–1, 2017.[15] Z. Hu, J. Gao, N. Zhang, Y. Yang, X. Liu, H. Zheng, and D. Liang. An improved statisticaliterative algorithm for sparse-view and limited-angle CT image reconstruction.

Sci. Rep. , 7(1),Dec. 2017.[16] H. Zhang, B. Dong, and B. Liu. Jsr-net: A deep network for joint spatial-radon domain CTreconstruction from incomplete data. In

IEEE Int. Conf. Acoust. Speech Signal Process. Proc. ,2019.[17] E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles : Exact signal frequencyinformation.

IEEE Trans. Inf. Theory , 52(2):489–509, Mar. 2006.[18] E. Y. Sidky and X. Pan. Image reconstruction in circular cone-beam computed tomography byconstrained, total-variation minimization.

Phys. Med. Biol. , 53(17):4777–4807, Oct. 2008.[19] S. Niu, Y. Gao, Z. Bian, J. Huang, W. Chen, G. Yu, Z. Liang, and J. Ma. Sparse-view X-rayCT reconstruction via total generalized variation regularization.

Phys. Med. Biol. , 59(12):2997,May. 2014.[20] E. Y. Sidky, C. M. Kao, and X. Pan. Accurate image reconstruction from few-views andlimited-angle data in divergent-beam CT.

J. X-Ray Sci. Technol. , 14(2):119–139, 2009.[21] Q. Xu, H. Y. Yu, X. Q. Mou, L. Zhang, J. Hsieh, and G. Wang. Low-dose X-ray CT recon-struction via dictionary learning.

IEEE Trans. Med. Imag. , 31(9):1682–1697, Apr. 2012.1322] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.

Nature , 521(7553):436–444, May. 2015.[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, and M. Bernstein. Imagenet large scale visual recognition challenge.

InternationalJournal of Computer Vision , 115:1–42, Jan. 2015.[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.

IEEE Trans. Pattern Anal. Mach. Intell. , 39(4):640–651, 2015.[25] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical imagesegmentation. In

Proc. Int. Conf. Med. Image Comput. Comput. Assist. Intervent. , pages 234–241, 2015.[26] M. Soltaninejad, C. J. Sturrock, M. Grifﬁths, T. P. Pridmore, and M. P. Pound. Three dimen-sional root CT segmentation using multi-resolution encoder-decoder networks.

IEEE Trans.Image Process. , 29:6667–6679, 2020.[27] C. Dong, Y. Deng, C. C. Loy, and X. Tang. Compression artifacts reduction by a deep convo-lutional network. In

Proc. IEEE Int. Conf. Comput. Vision. , pages 576–584, 2015.[28] J. Guo and H. Chao. Building dual-domain representations for compression artifacts reduction.In

Proc. Europ. Conf. Comp. Visi. , pages 628–644, Sep. 2016.[29] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks.

Adv.neural inf. proces. syst. , 1, 2012.[30] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok. ReconNet: Non-iterative recon-struction of images from compressively sensed measurements. In

Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit. , 2016.[31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,and Y. Bengio. Generative adversarial nets. In

Adv. neural inf. proces. syst. , volume 27, pages2672–2680, 2014.[32] J. Bai, X. Dai, Q. Wu, and L. Xie. Limited-view CT reconstruction based on autoencoder-likegenerative adversarial networks with joint loss. In , pages 5570–5574, Jul. 2018.[33] S. Xie, H. Xu, and H. Li. Artifact removal using GAN network for limited-angle CT recon-struction. In , 2019.[34] Z. Zhao, Y. Sun, and P. Cong. Sparse-view CT reconstruction via generative adversarial net-works. In , 2018.[35] J. C. Ye and Y. S. Han. Deep convolutional framelets: A general deep learning for inverseproblems.

SIAM J. Imaging Sci. , 11(2), 2017.[36] J. Dong, J. Fu, and Z. He. A deep learning reconstruction framework for X-ray computedtomography with incomplete data.

PLoS One , 14:e0224426, Nov. 2019.[37] J. Fu, J. Dong, and F. Zhao. A deep learning reconstruction framework for differentialphase-contrast computed tomography with incomplete data.

IEEE Trans. Image Process. ,29(1):2190–2202, 2020.[38] R. Anirudh, H. Kim, J. J Thiagarajan, K. A. Mohan, K. Champley, and T. Bremer. Lose theviews: Limited angle CT reconstruction via implicit sinogram completion. In

Proc. IEEEComput. Soc. Conf. Comput. Vision Pattern Recognit. , 2018.[39] X. Dai, J. Bai, T. Liu, and L. Xie. Limited-view cone-beam CT reconstruction based on anadversarial autoencoder network with joint loss.

IEEE Access , 7:7104–7116, 2019.[40] M. U. Ghani and W. C. Karl. Deep learning-based sinogram completion for low-dose CT. In

IEEE Image, Video, Multidimens. Signal Process. Workshop, IVMSP - Proc. , pages 1–5, 2018.[41] A. Katsevich. Theoretically exact ﬁltered backprojection-type inversion algorithm for spiralCT.

SIAM J. Appl. Math. , 62(6):2012–2026, 2002.[42] Y. Han and J. C. Ye. Framing U-Net via deep convolutional framelets: Application to sparse-view CT.

IEEE Trans. Med. Imaging , pages 14–18, 2018.[43] Z. Zhang, X. Liang, X. Dong, Y. Xie, and G. Cao. A sparse-view CT reconstruction methodbased on combination of DenseNet and deconvolution.

IEEE Trans. Med. Imaging , 37(6):1–1,2018. 1444] H. Zhang, L. Li, K. Qiao, L. Wang, B. Yan, L. Li, and G. Hu. Image prediction forlimited-angle tomography via deep learning with convolutional neural network. arXiv preprintarXiv:1607.08707 , 2016.[45] J. Wang, J. Liang, J. Cheng, Y. Guo, and L. Zeng. Deep learning based image reconstructionalgorithm for limited-angle translational computed tomography.

PLoS One , 15(1):e0226963,2020.[46] S. Kuanar, V. Athitsos, D. Mahapatra, K. R. Rao, Z. Akhtar, and D. Dasgupta. Low doseabdominal CT image reconstruction: An unsupervised learning based approach. In , pages 1351–1355, 2019.[47] S. Guan, A. A. Khan, S. Sikdar, and P. V. Chitnis. Fully dense UNet for 2-D sparse photoa-coustic tomography artifact removal.

IEEE J. Biomed. Health Inform. , 24(2):568–576, 2020.[48] D. Lee, S. Choi, and H. J. Kim. High quality imaging from sparsely sampled computed to-mography data with deep learning and wavelet transform in various domains.

Med. Phys. ,2018.[49] K. Liang, H. Yang, and Y. Xing. Comparison of projection domain, image domain, andcomprehensive deep learning for sparse-view x-ray ct image reconstruction. arXiv preprintarXiv:1804.04289 , 2018.[50] J. Zhu, T. Su, X. Deng, X. Sun, and Y. Ge. Low-dose CT reconstruction with simultaneoussinogram and image domain denoising by deep neural network. In

SPIE Med. Imag. , volume11312, pages 1007–1012, 2020.[51] K. Hammernik, T. Würﬂ, T. Pock, and A. K. Maier. A deep learning architecture for limited-angle computed tomography reconstruction.

Inf. aktuell , pages 92–97, 2017.[52] Q. Zhang, Z. Hu, C. Jiang, H. Zheng, Y. Ge, and D. Liang. Artifact removal using a hybrid-domain convolutional neural network for limited-angle computed tomography imaging.

Phys.Med. Biol. , 65(15):155010, 2020.[53] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen. MobileNetV2: Invertedresiduals and linear bottlenecks. In

Proc IEEE Comput. Soc. Conf. Comput. Vision PatternRecognit. , 2018.[54] R. Pascanu, T. Mikolov, and Y. Bengio. On the difﬁculty of training recurrent neural networks.In , pages 1310–1318, 2013.[55] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi. Real-time videosuper-resolution with spatio-temporal networks and motion compensation. In

Proc IEEE Com-put. Soc. Conf. Comput. Vision Pattern Recognit. , pages 2848–2857, 2017.[56] M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian. Video denoising, deblocking, andenhancement through separable 4-D nonlocal spatiotemporal transforms.

IEEE Trans. ImageProcess. , 21(9):3952–3966, 2012.[57] P. Arias and J. M. Morel. Video denoising via empirical bayesian estimation of space-timepatches.

J. Math. Imaging Vis. , 60(1):70–93, 2018.[58] T. Vogels, F. Rousselle, B. Mcwilliams, G. Röthlin, A. Harvill, D. Adler, M. Meyer, andJ. Novák. Denoising with kernel prediction and asymmetric loss functions.

ACM Trans. Graph. ,37(4):124, 2018.[59] T. Ehret, A. Davy, J. M. Morel, G. Facciolo, and P. Arias. Model-blind video denoising viaframe-to-frame training. In , pages 11369–11378, 2019.[60] M. Claus and J. V. Gemert. ViDeNN: Deep blind video denoising. In , pages 0–0, 2019.[61] A. Davy, T. Ehret, G. Facciolo, J. M. Morel, and P. Arias. Non-local video denoising by CNN. arXiv preprint arXiv:1811.12758 , 2018.[62] M. Tassano, J. Delon, and T. Veit. DVDNET: A fast network for deep video denoising. In , pages 1805–1809, 2019.[63] X. Chen, L. Song, and X. Yang. Deep RNNs for video denoising. In

Proc. 39th Appl. Digit.Image Process. , volume 9971, 2016. 1564] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, andH. Adam. MobileNets: Efﬁcient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 , 2017.[65] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In , pages 1800–1807, 2017.[66] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShufﬂeNet: An extremely efﬁcient convolutionalneural network for mobile devices. In , pages 6848–6856, 2018.[67] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universalapproximators.

Neural Netw. , 2(5):359–366, 1989.[68] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In

Proc.Europ. Conf. Comp. Visi. , pages 630–645, 2016.[69] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In

Proc. of The 32nd Int. Conf. Mach. Learn. , pages 448–456, 2015.[70] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In

Proc. of the14th Int. Conf. Artif. Intell. Stat. , volume 15, pages 315–323, 2011.[71] M. Tassano, J. Delon, and T. Veit. FastDVDnet: Towards real-time deep video denoising with-out ﬂow estimation. In ,pages 1354–1363, 2020.[72] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In

Int. Conf. Learn.Represent. , 2015.[73] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In

NeurIPS , 2017.[74] S. G. Armato et al.

The lung image database consortium (LIDC) and image database resourceinitiative (IDRI): a completed reference database of lung nodules on CT scans.