iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction
Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka
iiFDK: A Scalable Framework for Instant High-resolution ImageReconstruction
Peng Chen
Tokyo Institute of TechnologyAIST-Tokyo Tech Real WorldBig-Data Computation OpenInnovation Laboratory, NationalInstitute of Advanced IndustrialScience and [email protected]
Mohamed Wahib
AIST-Tokyo Tech Real WorldBig-Data Computation OpenInnovation Laboratory, NationalInstitute of Advanced IndustrialScience and [email protected]
Shinichiro Takizawa
AIST-Tokyo Tech Real WorldBig-Data Computation OpenInnovation Laboratory, NationalInstitute of Advanced IndustrialScience and [email protected]
Ryousei Takano
National Institute of AdvancedIndustrial Science and [email protected]
Satoshi Matsuoka
Tokyo Institute of TechnologyRIKEN Center for ComputationalScience, Hyogo, [email protected]
ABSTRACT
Computed Tomography (CT) is a widely used technology that re-quires compute-intense algorithms for image reconstruction. Wepropose a novel back-projection algorithm that reduces the pro-jection computation cost to 1/6 of the standard algorithm. We alsopropose an efficient implementation that takes advantage of theheterogeneity of GPU-accelerated systems by overlapping the fil-tering and back-projection stages on CPUs and GPUs, respectively.Finally, we propose a distributed framework for high-resolutionimage reconstruction on state-of-the-art GPU-accelerated super-computers. The framework relies on an elaborate interleave of MPIcollective communication steps to achieve scalable communication.Evaluation on a single Tesla V100 GPU demonstrates that our back-projection kernel performs up to 1.6 × faster than the standard FDKimplementation. We also demonstrate the scalability and instanta-neous CT capability of the distributed framework by using up to2,048 V100 GPUs to solve a 4K and 8K problems within 30 secondsand 2 minutes, respectively (including I/O). CCS CONCEPTS • Computing methodologies → Massively parallel algorithms ; Image processing . KEYWORDS
Computed Tomography, GPU, Distributed, Heterogeneous Com-puting
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
SC ’19, November 17–22, 2019, Denver, CO, USA © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6229-0/19/11...$15.00https://doi.org/10.1145/3295500.3356163
ACM Reference Format:
Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, and SatoshiMatsuoka. 2019. iFDK: A Scalable Framework for Instant High-resolutionImage Reconstruction. In
The International Conference for High Perfor-mance Computing, Networking, Storage, and Analysis (SC ’19), November17–22, 2019, Denver, CO, USA.
ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3295500.3356163
High-resolution Compute Tomography (CT) is a technology usedin a wide variety of fields, e.g. medical diagnosis, non-invasive in-spection [62], and reverse engineering [17, 50]. In the past decades,the size of a single three-dimensional (3D) volume generated byCT systems has increased from hundreds of megabytes (the typicalsizes of a volume are 256 , 512 ) to several gigabytes (i.e. 2048 ,4096 ) [7, 42, 66]. The increased demand for rapid tomography re-construction and the associated high computational cost attractedheavy attention and efforts from the HPC community [8, 11, 19,25, 28, 47, 54, 55, 66, 68, 76]. As illustrated in [48], the FDK al-gorithm is widely regarded as the primary method to reconstruct3D images (or volumes) from projections, i.e. X-ray images. TheFDK algorithm includes a filtering stage (also known as convolu-tion) and a back-projection stage. The computational complexi-ties of those two stages are O( N loд ( N )) and O( N ) , respectively.Researchers are increasingly relying on the latest accelerators toimprove the computational performance of FDK, e.g. ApplicationSpecific Integrated Circuits (ASIC) [72], Field-Programming GateArray (FPGA) [16, 27, 64, 75], Digital Signal Processor (DSP) [37],Intel Xeon-Phi [53], Multi-core CPUs [68], and Graphics ProcessingUnit (GPU) [51, 73, 77, 78]. This paper focuses on GPU-acceleratedsupercomputers for two reasons. First, GPUs are dominantly usedfor tomographic image reconstruction [20, 28, 33, 55, 59, 74]. Second,GPU-accelerated supercomputers are increasingly gaining groundin top-tier HPC systems. Feldkamp, Davis, and Kress [23] presented a convolution-backprojection formulation(known as FDK algorithm) for CT image reconstruction in 1984. FDK is also known asthe Filtered Back Projection (FBP) algorithm. a r X i v : . [ c s . D C ] S e p C ’19, November 17–22, 2019, Denver, CO, USA Chen, P. and Wahib, M., et al.
Instantaneous high-resolution image reconstruction, i.e. gen-erating a volume moments after processing the scanned imageprojections, has long been the holy grail of CT technologies. Thefollowing are some of the challenges one has to consider whentargeting instant high-resolution image reconstruction. First, FDKis a well-researched algorithm, yet it remains a fact that FDK, withits high-compute intensity, can benefit from innovation at the al-gorithm level in order to reduce the cost of computing the projec-tions (while preserving computational precision). Second, CPUs andGPUs have different architectures. A fully heterogeneous solutionnecessitates careful assignment and orchestration of computingtasks to CPUs and GPUs. Third, high-resolution image reconstruc-tion is limited by GPU memory capacity. Taking a volume of size4096 for instance, the required storage is 256GB, which largelyexceeds the memory capacity of a single GPU. Hence a distributedimplementation is essential to avoid the GPU memory capacity bot-tleneck. Fourth, the effective use of MPI over the system hierarchy iscritical to optimize data movement within the system. Finally, non-trivial optimizations are required to design an end-to-end pipelinefor the computation of the filtering and back-projection stages us-ing thousands of CPUs/GPUs (and this includes the I/O bottleneckof the parallel file system).We target to enable instantaneous high-resolution image recon-struction while also providing the technical capacity required foradvancing to unprecedented resolutions, e.g. 8192 . We proposea scalable framework, called iFDK, for computing FDK on GPU-accelerated supercomputers. Optimizing the back-projection stageis crucial since back-projection is the computational bottleneckin most of the practical CT image reconstruction algorithms. Wepropose a novel back-projection algorithm that reduces the numberof operations for computing the projection computations to a factorof 1/6 from the standard FDK algorithm. The algorithm is also gen-eral and thus can be adopted by iterative reconstruction methods,in which the back-projection is required to be repeated dozens oftimes, e.g. ART [24], SART [3], MLEM [61], and MBIR [2].Prevalent approaches in the literature execute both the filteringand back-projection stages on GPUs. Contrarily, we improve theefficiency by taking full advantage of the heterogeneity in GPU-accelerated supercomputers. The filtering stage is executed on CPUswith optimizations for multi-threading and SIMD vectorization [15].The back-projection stage is executed on GPUs with optimizationsat the algorithmic level to reduce the cost of the projection compu-tations while also improving the data locality.The proposed framework is further optimized for reducing com-munication. More specifically, we propose a scalable problem de-composition scheme at which several independent sub-tasks aredecomposed to a two-dimensional mesh of MPI ranks. The imagereconstruction problem is decomposed such that the horizontalMPI ranks handle the input while the vertical ranks generate theoutput volume. Most importantly, the sub-tasks are overlapped ina pipelined fashion to be computed in parallel.Using more than 2,000 Nvidia Tesla V100 GPUs, we solve the 4K(2048 × × → ) and 8K (2048 × × → ) high-resolution image reconstruction problems within 30 seconds and2 minutes, respectively. This includes the end-to-end processing The image reconstruction problem is defined in Section 2.3. time: loading projections from the Parallel File System (PFS), thefiltering stage, the back-projection stage, MPI communication, andfinally storing the output 3D volume to the PFS.The contributions in this paper are as follows: • We propose a novel back-projection algorithm that reduces thecost to compute the projections and improves cache locality. • We propose a scalable and distributed framework for high-resolutionimage reconstruction on heterogeneous supercomputers. • We demonstrate that high-resolution image reconstruction prob-lems can be solved within tens of seconds by iFDK. To the au-thor’s knowledge, this is the first attempt to achieve instantdistributed CT image reconstruction for 4K and 8K resolutions.The rest of this paper is organized as follows. In Section 2, wereview the background of CUDA and FDK algorithm. In Section 3,we propose the novel FDK algorithm and CUDA implementation.In Section 4, we propose the distributed framework (namely iFDK)and present the performance model. Section 5 describes the eval-uation results. Section 6.1 discusses potential impact of iFDK inreal-world. In Section 7, we introduce the related work. Finally,Section 8 concludes.
In this section, we introduce the basics of CUDA and describe thedetails of the FDK algorithm.
Nvidia Compute Unified Device Architecture (CUDA) is a paral-lel computing platform and application programming model. Webriefly introduce the concepts of CUDA’s architecture and memoryhierarchy, more details can be found in [18].
CUDA architecture.
CUDA is built on an array of multi-threadedStreaming Processors (SMs). Massive thread-level parallelism is ab-stracted into a hierarchy of threads running in a single instructionmulti-thread (SIMT) fashion. The threads in CUDA are grouped intowarps (32 threads execute as a warp), blocks, and grids. Thousandsof threads are created, scheduled, and executed concurrently.
CUDA memory hierarchy.
To approach the peak performanceof GPU, it is essential to implement applications that efficiently uti-lize CUDA’s memory hierarchy. (I)
Global memory . The largestoff-chip memory. A coalesced access pattern is required for theCUDA kernels to achieve the highest bandwidth. (II)
Shared mem-ory . A fast on-chip scratchpad memory, which shares space withthe L1 cache. The access scope is limited to a single CUDA block.(III)
Constant memory . A read-only constant cache shared by allof the SMs. (IV)
Texture memory . A read-only on-chip memorywhich is optimized for the spatial locality. The operation of readinga texture is also called texture fetch. Texture fetch [56] can sup-port efficient sub-pixel interpolation (as will be shown in Alg 3).(V)
Register files . The fastest on-chip and thread-private memory.
CUDA shuffle intrinsic.
CUDA provides efficient intra-warpcommunication instructions, namely, shuffle. Without using sharedor global memory, threads in a single warp can exchange regis-ters directly. In terms of computing efficiency, using shuffle forin-register computation is superior in performance to the othermemory types, due to its low latency and high throughput [14].
Scalable High-resolution Image Reconstruction Framework SC ’19, November 17–22, 2019, Denver, CO, USA 𝑖 𝑗 𝑘 𝑍𝑌 𝑋𝑂 𝑤 ℎ𝑋 𝑌 𝑍 𝑈 𝑉 𝑆 𝑂 𝑎𝑏 𝐴 𝐵 (a) (b) 𝐹𝑃𝐷
Figure 1: CBCT geometry and trajectory.Table 1: CBCT parameter list.
Param Description Unit N p the number of 2D projections N/A N u , N v the width and height of a 2D projection, respectively pixel D u , D v FPD pixel pitch in U and V direction, respectively mm/pixel E i , Q i the i th projections and the filtered result, respectively N/A F ramp F cos N v , N u ) [23] N/A P i the i th projection matrix of size 3 × d distance of X-ray source to rotation Z-axis pixel D distance of X-ray source to FPD center pixel N x , N y , N z the number of voxels in X, Y, Z dimension, respectively voxel D x , D y , D z the pitch of volume in X, Y, Z dimension, respectively mm/voxel θ rotation step angle, θ = · π / N p Rad I
3D volume N/A
In this section, we revisit the 3D image reconstruction method(namely FDK) for Cone-Beam Computed Tomography (CBCT) asintroduced by Feldkamp et al [23]. We briefly introduce geometry,arithmetic computation, and convolution. More details can be foundin [34].
CBCT Geometry.
Figure 1 illustrates the CBCT geometryin detail and Table 1 lists all of the related parameters. S is a micro-focus X-ray source, FPD (Flat Panel Detector) is a class of x-raydigital radiography detectors, which is principally similar to theimage sensors used in digital photography. In addition, both theX-ray source and FPD are fixed relatively in position as Figure 1ashows. Both of them rotate around the Z-axis while scanning objectsplaced at the center O to express a 3D volume as shown in Figure 1b.
FDK Computation.
FDK is widely employed to build tomo-graphic images in clinical and medical practice [48]. FDK comprisesa filtering stage (or convolution stage) as Algorithm 1 shows, and aback-projection stage shown in Algorithm 2. In Algorithm 1, cosineweighting ( F cos ) and ramp filter ( F ramp ) are convolved with pro-jections ( E i ) to generate the filtered result ( Q i ). The details of F cos and F ramp (including improved versions) can be found in [23, 34].The shape of the F ramp filter deeply affects the final image quality,yet it has no effect on the compute intensity of the filtering stage.Algorithm 2 shows the back-projection algorithm. The projec-tion matrix P i , a well-aligned 3 × d , D , θ . More de-tails on computing the P matrix are presented in literature [52, 69].It is important to mention the embarrassingly parallel nature of theback-projection computation when utilizing the projection matrix,in comparison to other methods in literature [38, 57]. As Algo-rithm 3 shows, the bilinear interpolation method is adopted by most Algorithm 1
Filtering stage [34].
Input: E , F cos , F ramp , N p , N v , N u Output: Q for i ∈[ , N p ) do ˜ E i ← E i · F cos ▷ · means point-wise multiplication for j ∈[ , h ) do Q i ( j , . . . ) ← ˜ E i ( j , . . . )⊗ F ramp ▷ ⊗ means convolution end for end for Algorithm 2
Back-projection stage. This scheme is implementedin RTK [52], RabbitCT [53], etc.
Input: P , Q , N p , N x , N y , N z Output: I ▷ generated 3D volume I ← ▷ I initialization for s ∈[ , N p ) do for k ∈[ , N z ) do for j ∈[ , N y ) do for i ∈[ , N x ) do [ x , y , z ] T ← P s · [ i , j , k , ] T ▷ f ← / z W dis ← f ▷ distance weight [ u , v ] T ← [ x , y ] T · f ▷ coordinates in FPD I ( i , j , k ) ← I ( i , j , k ) + W dis · interp ( Q s , u , v ) end for end for end for end for Algorithm 3
Bilinear interpolation with sub-pixel precision [32]. function interp2( X , u , v ) ▷ X is 2D matrix, ( u , v ) is sub-pixel coordinate [ n u , n v ] T ←[ int ( u ) , int ( v )] T ▷ n u , n v are integers [ d u , d v ] T ←[ u − n u , v − n v ] T ▷ distance to left points t ← X ( n u , n v )·( − d u ) + X ( n u + , n v )· d u ▷ a sub-pixel value t ← X ( n u , n v + )·( − d u ) + X ( n u + , n v + )· d u ▷ a sub-pixel value return t ·( − d v ) + t · d v ▷ final sub-pixel value end function of the FDK implementations to fetch the intensity value of a 2Dmatrix with sub-pixel precision for updating each value of I inAlgorithm 2 [52, 53]. Convolution via FFT.
FFT [13] is essential to the filteringstage since one-dimensional FFT is used to perform the convolutionoperation in Algorithm 1. For large problem sizes, FFT is typicallythe choice for the convolution computation [12]. Optimized FFTprimitives are typically provided by CPU/GPU vendors, e.g. In-tel IPP (Intel Integrated Performance Primitives) [65], and NvidiacuFFT library [18]. In Algorithm 1 line 4, the FFT primitive is em-ployed to perform the convolution as follows. Regarding the twoone-dimensional arrays, according to the Convolution Theorem [4],convolution computation in the time domain equals point-wise dotproduct in the frequency domain. The reader can refer to a detailedillustration in [41].
Throughout this paper, the image reconstruction problem and per-formance metric GUPS are defined as follows:(I) N u × N v × N p → N x × N y × N z is defined as the image reconstruc-tion problem, where N u × N v × N p denotes the size of projec-tions ( Input ) and N x × N y × N z is the size of volume ( Output ).(II)
GUPS is defined as giga-updates per second and is used as aperformance metric. It is computed as
GUPS = N x ∗ N y ∗ N z ∗ N p T ∗ ,where T is the execution time (in seconds). C ’19, November 17–22, 2019, Denver, CO, USA Chen, P. and Wahib, M., et al.
Filtering Filtering
Filtering Filtering
Back-projectionBack-projection
Back-projectionBack-projection
Filtering Filtering
Filtering Filtering
Back-projectionBack-projection
Back-projectionBack-projection
LoadLoad
LoadLoadLoadLoad
LoadLoad
Store
StoreStoreStore
On CPUs On GPUs
AllGather Reduce
On CPUs D P r o j e c t i o n s D V o l u m e Input
Output
Proposed Framework on Multi-nodes with Multi-GPUs
Figure 2: Overview of the proposed framework. Multi-nodes withGPUs accelerators are used such that a 3D volume is generated from2D projections.
The next section discusses the execution of the filtering stage onthe CPUs and the following section discusses the execution of thenovel FDK algorithm (back-projection stage) on the GPUs.
Figure 2 shows our heterogeneous computational flow, which isdifferent from the typical method of using only the GPU for all thecomputation [10, 38, 46]. Utilizing the CPU to perform the filteringstage can be more efficient in comparison to using the GPUs to com-pute the entire FDK pipeline.
We list the three considerations that makes the fully heterogeneous model more efficient: (I)
Ded-ication of GPU to the compute intense stage: the latency ofthe filtering stage on the CPU can be hidden by overlapping itwith the back-projection offloaded to the GPU. Hence, the GPU canbe fully dedicated to the back-projection kernel (see Section 3.3).(II)
Reduce memory pressure on GPU device memory: in thecase that the GPU would be used for filtering, a batch of imageshave to be filtered at a time in order to fully utilize the GPU. Theadditional memory required to store the filtered projects wouldtake away from the already limited GPU memory and would forcethe use of more GPUs in high-resolution problems to have enoughaggregate memory capacity. (III)
Efficient communication: inour communication scheme, an AllGather collective is requiredafter the filtering stage to fetch the filtered projections. When thefiltering stage is applied at the GPU, the AllGather collective will beapplied on data residing in GPU memory and not on data residingin the CPU memory (as is the case when the filtering is applied onthe CPU). Applying AllGather on data residing on the GPU incurthe extra cost of moving data across the PCIe interconnect, evenwhen the GPUDirect [36, 49] Technology is enabled.
We propose a general back-projection algorithm that reduces thecomputation and improves the data locality (regardless of the targetarchitecture).
Theorems for Back-projection Algorithm.
This sectionintroduces three theorems which we use to propose a novel versionof the original FDK algorithm at Algorithm 2. • Theorem-1:
As Figure 1 shows, when two 3D points a (˜ i , ˜ j , ˜ k )and b (˜ i , ˜ j , ˜ N z − k −
1) are symmetrical to XY plane, then thecorresponding projection points A ( ˜ u A , ˜ v A ) and B ( ˜ u B , ˜ v B ) at FPD are symmetrical to the horizontal center line, namely ˜ u A = ˜ u B and ˜ v A + ˜ v B = N v −
1. (proven by [77]) • Theorem-2:
For points in the vertical line ab (parallel to Z-axis),their projection line AB is parallel to V-axis in FPD plane, namelythe ˜ u of line AB has a constant value. (the mathematical proof istrivial, we do not include it due to space constraints) • Theorem-3:
When points are in the vertical line ab (parallel toZ-axis), such that computing the projection points use Equation 1,then their z is a constant value equal to d + y ab , where y ab is theY coordinate of line ab . (proof follows). Proof of Theorem-3.
We prove that in the specified rotationangle β (or i · θ ), if the ˆ i and ˆ j are fixed, the z in Equ 1 is a constantvalue. As Fig 1 shows, given a point a of coordinate ( ˜ i , ˜ j , ˜ k ) in thevolume coordinate system, its projection point A of coordinate (u,v) on the FPD can be computed using the projection equation as (cid:40) [ x , y , z ] T = P i · [ ˜ i , ˜ j , ˜ k , ] T [ u , v ] T = [ x , y ] T · / z (1) where x,y,z are temporary variables. P i is a 3 × β . Hence, P i maybe written as (cid:40) ˆ P i = M · M rot · M P i = ˆ P i [ ] (2) where the shapes of ˆ P i and P i are 4 × ×
3, respectively. M , M rot , and M are listed as follows M = (cid:169)(cid:173)(cid:173)(cid:173)(cid:171) D x D y D z
00 0 0 1 (cid:170)(cid:174)(cid:174)(cid:174)(cid:172) · (cid:169)(cid:173)(cid:173)(cid:173)(cid:171) −( N x − )/ − ( N y − )/
20 0 − ( N z − )/
20 0 0 1 (cid:170)(cid:174)(cid:174)(cid:174)(cid:172) M rot = (cid:169)(cid:173)(cid:173)(cid:173)(cid:171) − d (cid:170)(cid:174)(cid:174)(cid:174)(cid:172) · (cid:169)(cid:173)(cid:173)(cid:173)(cid:171) cos ( β ) − sin ( β ) sin ( β ) cos ( β ) (cid:170)(cid:174)(cid:174)(cid:174)(cid:172) M = (cid:169)(cid:173)(cid:173)(cid:173)(cid:171) / D u / D v (cid:170)(cid:174)(cid:174)(cid:174)(cid:172) · (cid:169)(cid:173)(cid:173)(cid:173)(cid:171) D ( N u − )· D u / D ( N v − )· D v / (cid:170)(cid:174)(cid:174)(cid:174)(cid:172) M represents the transformation of the coordinate system fromvolume to gantry [71], M rot denotes the gantry rotation along theZ-axis at the angle β plus the transpose distance of d . M indicatesthe projection point on the FPD plane. Note that all of the variablesare listed in Table 1. By expanding Equation 1, the z can be writtenas z = d + sin ( β )( ˜ i − ( N x − )/ )· D x − cos ( β )·( ˜ j − ( N y − )/ )· D y (3) Clearly, the z is independent of ˆ k and equals d+ y ab , where y ab is the Y coordinate of line ab , which is parallel to the Z-axis inFigure 1a. Reducing the Cost of Computing the Projections.
Inthis section, we present a method to improve the performance ofback-projection by reducing the projections computational cost.Algorithm 2 is adopted in a number of CBCT applications, e.g.open-source libraries (OSCaR [51], RabbitCT [53], RTK [52]), andliterature [31, 46, 74, 77]. Based on
Theorem-2 and
Theorem-3 , the
Scalable High-resolution Image Reconstruction Framework SC ’19, November 17–22, 2019, Denver, CO, USA
Algorithm 4
Proposed Back-projection algorithm. The optimizedvariables are highlighted in gray color.
Input: P i , Q i , N p , N x , N y , N z , i ∈ [ , N p ) Output: I ▷ reconstructed 3-dimensional volume1: ˜ I ← ▷ ˜ I initialization2: for s ∈ [ , N p ) do ˜ Q s ← Q Ts ▷ transpose 2D matrix4: for j ∈ [ , N y ) do for i ∈ [ , N x ) do t ← [ i , j , , ] [ x , z ]←[⟨ P s [ ] , t ⟩ , ⟨ P s [ ] , t ⟩] ▷ f ← . / z u ← x · f W dis ← f ▷ weighting11: for k ∈ [ , N z / ) do ▷ symmetric geometry12: y ←⟨ P s [ ] , [ i , j , k , ]⟩ ▷ v ← y · f ▷ compute v only14: ˜ I ( k , j , i ) ← ˜ I ( k , j , i ) + W dis · interp ( ˜ Q s , v , u ) ˜ k ← N z − − k ▷ symmetric geometry16: ˜ v ← N v − − v ▷ symmetric geometry17: ˜ I ( ˜ k , j , i ) ← ˜ I ( ˜ k , j , i ) + W dis · interp ( ˜ Q s , ˜ v , u ) end for end for end for end for I ← reshape ( ˜ I ) ▷ reshape means changing data layout improved back-projection is illustrated in Algorithm 4. The compu-tational cost for computing the projections becomes a factor of 1/6of the original FDK algorithm (Alg 2 line 6), since we only computehalf of the N z dimension (Algorithm 4 line 11), and one of innerproducts for the two 1 × u and W dis are reused for N z times, and only half of y is computed directly. Zhao et al. [77]discussed a rotational symmetry in the projection layout. We do notadopt this methodology since it is impractical for pipeline process-ing in terms of the high latency of collecting the four projections. Improving Data Locality: Data Layout & Loops.
Thissection discusses how we leverage the proposed algorithm to derivean implementation that improves the data locality. To increasethe cache hits for accessing the volume ( I ) and projection ( Q ) inAlgorithm 2, the proposed Algorithm 4 adjusts the memory layoutand re-organizes the loops for N p , N x , N y , and N z . The original I uses an i-major layout as Figure 1b shows. The I becomes k-major inthe proposed algorithm. Based on Theorem-2 , the proposed memorylayout is more cache-friendly for data access, since the data buffersof both the projections and the volume can be accessed contiguously,as shown by the marked variables of ˜ Q s and ˜ I in Algorithm 4.Note that this memory layout is general to all kinds of processors,i.e. CPUs, GPUs, and Xeon Phi. To improve the data locality, theauthors in [38, 78] implemented the back-projection kernel on GPUsby organizing the loops as Algorithm 4, such that they computealong the z-axis first. However, that method does not optimize forthe layout of arrays Q and I . Hence, their implementation becomesfurther complex since one has to rely on CUDA’s 2D-texture cacheto improve the data locality. It is noteworthy that the time requiredto transpose a projection (Algorithm 4 line 3) is a small fraction of Listing 1: The proposed back-projection CUDA kernel. Constantmemory-optimized ProjMat is defined to store the 3 × dot function computes inner product, mad is the fused-multiply-and-addition intrinsic, interp2 function is implemented asAlgorithm 3. The batch of projections (defined as N batch ) is 32. __constant float4 ProjMat[32][3]; //32 3x4 proj-matrix __global void shflBP(float* vol, int3 vol_size, const float* img, int3 img_dim){ int laneId = threadIdx.x & 31; //(k,j,i) is the position of a voxel int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; int k = blockIdx.z*blockDim.z + threadIdx.z; float4 vec = make_float4(k, j, i, 1); //Z, U are 2 registers float Z, U; if (laneId < img_dim.z) { Z = 1/dot(ProjMat[laneId][2], vec); //inner product U = dot(ProjMat[laneId][0], vec)*Z; //inner product } float sum = 0, _sum = 0, v, _v; int _i = vol_dim.z - 1 - i; //geometry symmetric for (int s = 0; s CUDA Implementation. This section describes the pro-posed CUDA implementation. We implement the proposed back-projection kernel (called shflBP in Listing 1), which can be used inall generations of Nvidia GPUs from Kepler to Volta architectures.In Listing 1, the detailed CUDA kernel is presented. We use globalmemory (as introduced in Section 2.1) to store the 3D volume (seethe variable of vol in Listing 1) due to its huge size. Though theCUDA unified memory [18] is also an attractive choice for stor-ing 3D volume, we avoid using it due to its unstable performance,which varies from CUDA version to version as reported in [6, 39].To further reduce the computation cost by sharing data betweenthreads, we take advantage of the shuffle instruction to performintra-warp communication as introduced in Section 2.1. The shflBPkernel processes a batch of projections in one pass. This benefitsthe overall performance in many aspects: (I) decreasing the accesscount of the volume data which is stored in the global memory.(II) increasing in-register accumulation for back-projection compu-tation. (III) eliminating the overhead of launching multiple CUDAkernels. Note that this strategy is also applied in the widely usedimage library RTK [52]. However, as explained earlier, the proposed C ’19, November 17–22, 2019, Denver, CO, USA Chen, P. and Wahib, M., et al. Table 2: iFDK parameter list. Parameter Descriptions R , C the rows and columns of 2D grid of mpi ranks, respectively R i the i th row of 2D mpi ranks, where i ∈[ , R ) C i the i th column of 2D mpi ranks, where i ∈[ , C ) N дpu _ per _ node the number of GPUs per compute node N cpu _ per _ node the number of CPUs per compute node N nodes the total number of compute nodes N ranks the total number of launched ranks N дpus the total number of GPUs N cpus the total number of CPUs N PCIe the number of PCIe connector per compute node N proj _ per _ rank the number of loaded and filtered projections per rank N cpu _ core the total number of cores per CPU N дpu _ mem _ size the memory capacity of a single GPU algorithm outperforms RTK by reducing the computational costand improving data locality. Shuffle-based CUDA kernel: shflBP. In this section, weexplain the details of applying shuffle to our CUDA kernel. Weemploy 2 registers (named Z , U ) in the shflBP kernel (Listing 1 line10) to store the values of f and u (Algorithm 4 line 7 ∼ W dis (Algorithm 4 line 10) can be easily obtainedas Wdis in Listing 1 line 21. We avoid the use of shared or globalmemory all together by taking advantage of the shuffle instructionto realize the data communication between threads ( Listing 1 line19 ∼ This section presents a distributed framework, namely iFDK, forinstant high-resolution image reconstruction. The parameters usedin iFDK are summarized in Table 2. We combine the CPU filtering stage, the GPU-optimized back-projection stage, and MPI as a communication library to scale iFDKto the O( ) GPUs. This section elaborates on the design choicesand implementation. 2D Grid of MPI Ranks. The compute capability and devicememory capacity are limited in a single GPU. It is impractical to usea single GPU to generate large volumes (e.g. volumes of size 4096 or8192 ). Therefore, we scale the proposed method to take advantageof GPU-accelerated supercomputers to solve those problems. Thissection presents the problem decomposition and orchestration ofMPI ranks. To fully utilize the computing resources, i.e. CPUs, GPUs,inter-connectors, we launch multiple MPI ranks within each node(one rank per GPU) to perform the computation and communicationconcurrently.The MPI ranks are managed as a 2D-grid of R rows and C columns. The total number of ranks may be expressed as N ranks = C ∗ R (4) In Figure 3a, an example of using iFDK with 32 MPI ranks is pre-sented, where R=8 and C=4. In iFDK, ranks in each column of the2D-grid load a subset of projections from the PFS independently.Next, we perform the filtering stage on those projections using theCPUs. The number of projections processed by the ranks in eachcolumn of the 2D-grid is N p / C . We use MPI-AllGather to send thefiltered projections to neighbor ranks in the same group (namelythe same column). Hence, the number of projections which areloaded and filtered by a single rank is N proj _ per _ rank = N p / N ranks = N p /( C ∗ R ) (5) Each rank in the same row of the 2D-grid (or the R thi row) computesthe same sub-volumes as Fig 3b shows, the final sub-volume thatis reconstructed by the R thi row can be obtained by reducing all ofthe sub-volumes that are generated by the ranks in the same group(or the R thi row). Multi-GPU Management. We discuss how the GPUs aremanaged by MPI ranks in this section. Commonly, in a single com-pute node, there are multiple GPUs (e.g. ORNL’Summit has sixGPUs, LLNL’s Sierra and TokyoTech’s Tsubame have four GPUs),which are connected to the CPUs by PCIe or NVLink [36]. Welaunch a number of MPI ranks per compute node equivalent to thenumber of GPUs, i.e. one MPI rank per GPU. The N дpus may bewritten as N дpus ≡ N ranks (6) Here, N nodes = N ranks / N дpu _ per _ node is the number of required com-pute nodes. Each MPI rank manages a single GPU as follows: first,we gather the filtered projections that are processed on the CPU,as explained in Section 3.1. Second, the processed projections arecopied from the host to device memory. Third, the back-projectionkernel is launched to generate the specified subset of 3D volume.Finally, the computed volume is copied from the device memory tothe host. The detailed operations performed inside each rank, bymultiple threads, are presented in the next section. Multiple Threads in a Rank. This section presents howmultiple threads are orchestrated in each MPI rank. Each rankin iFDK processes several tasks in parallel, e.g. loading projec-tions from PFS, filtering the projections, collective communica-tion, back-projection, and storing the volume to PFS. As Figure 4shows, we use three threads to execute those tasks, namely Main-thread , Filtering-thread , and Bp-thread (Back-projection thread) .Those threads are created using the pthread library, they executeindependently and exchange data with each other using circularbuffers [70]. The Filtering-thread launches OpenMP threads ofnumber ( N cpu _ core ∗ N cpu _ per _ node / N ranks − ) to load projections andexecute the filtering in parallel. For each projection, the load andfiltering operations are executed within the same OpenMP threadin the sequence that enables immediate processing.As Figure 3b shows, we use the MPI-AllGather collective togather the specified subset of filtered projections in the Main-thread.At the same time, those filtered projections are dispatched to thedesignated GPU by Bp-thread for the back-projection computation.Note that for each MPI rank, N proj _ per _ rank times of AllGatheroperations are required since we process one projection at a time byAllGather. When all of the filtered projections are finally processed Scalable High-resolution Image Reconstruction Framework SC ’19, November 17–22, 2019, Denver, CO, USA 𝑅 𝑅 𝑅 𝐶 𝐶 𝐶 𝐶 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Input : 2D Projections Output : 3D volume vol 0 vol 1 vol 14 vol 15 (a) iFDK framework. The input is the 2D projections, the output is the generated 3Dvolume. vol denotes a sub-volume. C i and R i are defined in Table 2. 𝐶 Input : a subset of projections 𝑅 vol 15vol 0 Output : a subset of 3D volume MPI_Allgather Projections in Column MPI_Reduce Volume in Row (b) Allgather is performed across the ranks in one column.Reduction is applied only once across the ranks in onerow. Figure 3: Illustrative example of the iFDK framework. 32 MPI ranks are arranged in a 2D grid of size 4 × by the GPUs, the generated sub-volume on the GPU device memoryis copied back to the host. Next, using the MPI-Reduce collectiveas in Figure 4b, we do a single reduction step to generate the finalresulting volume in host memory using the Main-thread (a realexample can be found in Figure 7). Finally, the 3D volume is storedin the PFS using multi-ranks with multi-threads. Note that thevolume of size N x × N y × N z is stored as slices of number N z , the sizeof each slice is N x × N y . There is room for improvement by tuningthe size of each slice to optimize for the throughput of storing tothe PFS (i.e. tune slice size to optimize for file striping). Orchestration and Overlapping. In order to achieve theoptimal overlap, we use three threads to pipeline the computationas Figure 4a shows. Figure 4c presents a real example of solving a 4Kproblem using 128 V100 GPUs. To further give insight into the effectof pipelining the computation, the breakdown of the overlappedcomputation (namely T compute , as defined Section 4.2) is listed inTable 5. The value of δ >1 indicates that we achieve the goal of im-proving the overall performance by overlapping different stages, i.e.filtering, collective communication, and back-projection. This over-lapping scheme improves overall performance since back projectionis the main bottleneck, and hence we get a streaming benefit fromthe overlapping. On the other hand, overlapping the tasks after theback-projection (i.e. the device to host copy, reduction, and storingto PFS) does not guarantee any performance improvement (for theprice of complexity introduced). Nonetheless, overlapping after theback-projection remains to be one of the points of investigation infuture work. Configuration of Parameter R. This section discusseshow to select the optimal value for the parameter R: the number ofrows in the 2D mesh of MPI ranks. Based on the design of iFDK (inFigure 3a), R may be expressed as R = sizeof ( f loat ) ∗ N x ∗ N y ∗ N z / N sub _ vol (7) where N sub _ vol is the size of sub-volume. For a specified numberof GPUs (or ranks), we minimize the value of R and maximize thevalue of C for three reasons: (I) We can efficiently use the limiteddevice memory since each GPU would be able to compute a volume of size sizeof ( f loat ) ∗ N x ∗ N y ∗ N z / R ; (II) We can achieve higher com-putational performance by generating larger volumes. As Table 4shows, regarding the specified input, bigger output (smaller α ) re-sults in better performance for the back-projection kernel; (III) Tothe specified workload of N u × N v × N p → N x × N y × N z , there is a linearrelationship between the runtime and N p . Since we decompose theworkload to sub-tasks ( N u × N v × NpC → N x × N y × N z ) of number C, it isessential to maximize the value of C to decrease the runtime ofeach sub-task.In addition, the value of R is often power of two and is con-strained by the memory capacity of a GPU as follows sizeof ( f loat ) ∗ ( N x ∗ N y ∗ N z R + N u ∗ N v ∗ N batch )≤ N дpu _ mem _ size where N batch =32 as Listing 4 shows. Given N дpu _ mem _ size =16GBin the GPU generation we use, for the high-resolution reconstruc-tion problems we target, N sub _ vol =8GB is adopted. We discuss a performance model intended to analyze the impactof different parameters on performance and predict the potentialpeak performance. Micro-benchmarks. We use micro-benchmarks to mea-sure peak throughput parameters of constant values in our model(for a given system). BW load and BW store are the aggregate through-put of reading and writing to the PFS, respectively. Both of themare measured by LLNL IOR [58]. T H f lt is the throughput of filter-ing computation that we measure by running the filtering kernelon the target CPU. T H bp is the throughput of our back-projectionkernel as measured on the target GPU (we also report it in Table 4to give perspective to readers interested in only the back projec-tion kernel). T H trans is the throughput of transposing a volumeon GPU. The T H AllGather and T H Reduce are the throughput ofMPI-AllGather and Reduce APIs, respectively. Both of them aremeasured by Intel mpi-benchmarks. BW PCIe is the throughput ofdata transfer between the host and device memory via a singlePCI-e and is measured by Nvidia’s tool called bandwidthTest. iFDK Performance Model. Given execution time requiredfor: reading projections from storage T load , filtering projections C ’19, November 17–22, 2019, Denver, CO, USA Chen, P. and Wahib, M., et al. Projections Filtering Thread Main Thread MPI-AllGather Back-projection Thread vol Circular buffer Circular buffer PFS (a) Three threads execute in parallel in a pipeline fashion and exchange data viatwo queue-buffers. MPI-AllGather is performed the Main thread. Main ThreadMPI-Reduce vol Vol k vol 1vol 0 ⋯ ⋯ ⋯ 3D volume Main ThreadStore PFS (b) Reduce all the sub-volumes generated in the previous step to asingle final volume to be stored in the PFS by the main thread. Filtering thread Main thread BP thread Time AllGather AllGather ⋯ ⋯ ⋯ ⋯ ⋯ ⋯⋯ D2H Reduce ⋯ ⋯ ⋯ BP BP BP Store Exchange dataLoad & FilteringMPI-AllGatherH2D copy Back-projection D2H copy MPI-Reduce Store to PFS (c) Example of pipeline to solve × → problem using 128 V100 GPUs. R=32 and C=4. Measured execution time for each task is displayed. The Filteringthread processes 32 projections. The Main thread sends 32 projections and gathers 1024 projections using MPI-AllGather. The Bp-thread processes 1024 projections. Figure 4: Orchestration and Overlapping in iFDK. T f lt , communicating all projections by MPI-AllGather T AllGather ,copying filtered projections from host to device T H D , back-projection T bp , transposing the sub-volume T trans , moving the sub-volumefrom device memory to host T D H , reducing the sub-volume T Reduce ,and storing volume to PFS T store . Those variables can be writtenas T load = sizeof ( f loat ) ∗ N u ∗ N v ∗ N p / BW load (8) T f lt = N p N nodes ∗ T H f lt = N p ∗ N дpu _ per _ node C ∗ R ∗ T H f lt (9) T AllGather = N p /( C ∗ R ∗ T H AllGather ) (10) T H D = sizeof ( f loat ) ∗ N дpu _ per _ node ∗ N u ∗ N v ∗ N p C ∗ BW PCIe ∗ N PCIe (11) T bp = T H D + N p /( C ∗ T H bp ) (12) T trans = sizeof ( f loat ) ∗ N x ∗ N y ∗ N z /( R ∗ T H trans ) (13) T D H = sizeof ( f loat ) ∗ N дpu _ per _ node ∗ N x ∗ N y ∗ N z R ∗ BW PCIe ∗ N PCIe (14) T reduce = sizeof ( f loat ) ∗ N x ∗ N y ∗ N z /( R ∗ T H Reduce ) (15) T store = sizeof ( f loat ) ∗ N x ∗ N y ∗ N z /( BW store ) (16) As Figure 4 shows, the three threads compute in parallel such thatmost of the computation and data movement is overlapped. Ad-ditionally, the filtering thread launches multiple OpenMP threadsto perform the loading and filtering operations concurrently asSection 4.1 explained. The filtering operation can be perfectly over-lapped since T load + T f lt ≪ T bp , as the example shows in Figure 4c.Here, the execution time required by the three threads may beexpressed as T compute = max ( T load , T f lt , T AllGather , T bp ) (17) The time required to transpose, copy and reduce the volume is T post = T trans + T D H + T reduce + T store ≈ T D H + T reduce + T store (18) Note that the T trans is a small value ( T trans ≪ T D H / is observed)that could be ignored . The total execution time is T runtime = T compute + T post ≈ T compute + T D H + T reduce + T store (19) Conclusions from the Performance Model. We concludethe discussion of our performance model as follows. (I) Scalability: The performance of iFDK scales with the number of GPUs ( N дpus )since T f lt , T AllGather and T bp are inversely proportional to C (asin Equation 9 ∼ N дpus (as in Equation 4and Equation 6), where R is a value that is minimized to meet theconstraints described in Section 4.1.5, where T post is a constantvalue in T runtime . (II) Potential peak performance: Accordingto Equation 19, the potential peak performance of the computation( T runtime ) can be used to quantify the efficiency of our implemen-tation. This section lists the experimental environment, reports the perfor-mance of the proposed algorithms and discusses the scalability ofthe framework. This section lists the experimental environment and discusses howthe performance was measured. HPC system and environment. AIST’s ABCI supercom-puter is used for our evaluation. Each compute node is equippedwith two Intel Xeon Gold 6148 CPUs (20 Cores), 384GB memory,four Tesla V100 GPUs (16GB RAM) through PCIe gen3 × 16 and twoInfiniBand EDR HCAs. ABCI uses CentOS 7.4 for the operatingsystem and mounts a 6.6PB GPFS shared storage. The iFDK frame-work is implemented by Nvidia CUDA 9.0 (CUDA driver version:410.104), Intel Performance Libraries 2018.2.199 (includes MPI andIPP). The compilers nvcc-9.0 and mpicc (included in MPI library)are used to compile the CUDA kernel and host code, respectively.The compiler option (-gencode arch=compute_70,code=sm_70) isapplied in nvcc-9.0 for Nvidia’s Volta architecture. Measurement methodology. Since the performance of imagereconstruction is independent of the content of projections or vol-ume, we apply the standard Shepp-Logan phantom [60] to generatea variety of projections by the forward-projection tool in RTK li-brary. An example that is reconstructed by our framework can befound in Fig. 7. We use single precision for all projections, volumes, System is ranked th on the TOP500 list as of June 2019. Scalable High-resolution Image Reconstruction Framework SC ’19, November 17–22, 2019, Denver, CO, USA Table 3: Back-projection kernel characteristics. "Texture cache" and"L1 cache" mean accessing projections via 2D-Layered texture andL1 cache, respectively. "Transpose projection" and "Transpose vol-ume" mean transposing the projections and volume, respectively.RTK-32 is the imrpoved RTK kernel. The other kernels are shflBPkernel (as in Listing 1) with different characteristics. Texture cache L1 cache Transpose projection Transpose Volume RTK-32 ✓ ✗ ✗ ✗ Bp-Tex ✓ ✗ ✗ ✓ Tex-Tran ✓ ✗ ✓ ✓ Bp-L1 ✗ ✗ ✓ ✓ L1-Tran ✗ ✓ ✓ ✓ and runs. For output verification, we use the image processing toolImageJ [1] to render the generated 3D volumes, then inspecte themmanually. We also use profiled runs to investigate the density valueof each voxel. Finally, we compare the output with the volumesgenerated by the RTK library on the CPU, the Root Mean SquareError (RMSE) is less than 10e-5. We use the OS-independent func-tion cudaEvent to measure the execution time of CUDA kernels andemploy MPI_Wtime to measure the application’s runtime. Each ofthe reported results is averaged by 100 runs. Finally, we report theperformance in the unit of GUPS, as defined in Section 2.3. This section reports the performance of the proposed back-projectionkernel on Tesla V100 GPU by comparisons with a collection of ker-nels (listed in Table 3). The latest RTK 1.4.0 implementation (calledRTK-32 in Table 4) is used with 32-bit precision (versus the default8-bit precision). Note that our target is to instantly generate high-resolution volumes. We demonstrate that we could achieve thisgoal while using high image quality: we do not sacrifice the qualityby using lower precision. The kernel function kernel_fdk_3Dgrid of RTK is strictly im-plemented as defined in Algorithm 4. The original RTK limits themaximum count of projections to 16, we extend it to 32. Also, weadjust its interpolation function (Algorithm 3) to the precision of32-bits that uses 2D-Layered cache without linear interpolation:namely using the cudaFilterModePoint parameter for the texturefunction. Regarding L1 cache-optimized access for projections inTable 3, the __ldg intrinsic is applied.The characteristics of the proposed kernels are listed in Table 3.A variety of image reconstruction problems are evaluated on thosekernels, the performance in GUPS is listed in Table 4. The timerequired to move data between host and device is not included inthe execution time for the calculation of GUPS. Note that in mostapplications, the value of α is typically very small, often less than 1.As seen in Table 4, the proposed CUDA kernel, namely "L1-Trans",outperforms the other kernels. The performance advantage is dueto the improved data locality and efficient intra-warp communi-cation. As Table 4 shows, the size of the output of RTK cannot bebigger than 8GB since RTK employs a dual buffer technique to storethe volume while the maximum memory capacity of Tesla V100 is16GB in our testbed. By inspecting the performance in Table 4 wecan observe the following: (I) When comparing the performancedifference between Bp-Tex and Tex-Trans, it appears that the trans-pose operation of the projection has a minor effect on the hit-rate Table 4: Back-projection kernel performance on Tesla V100 GPU. 1kand 2k mean 1024 and 2048, respectively. α is defined as the ratio ofinput to output problem size. The characteristics of evaluated CUDAkernels are listed in Table 3. FDK poblems α RTK-32 Bp-Tex Tex-Tran Bp-L1 L1-Tran ( pixel → voxel ) (GUPS) (GUPS) (GUPS) (GUPS) (GUPS) × k → 128 65.3 38.8 46.5 23.7 × k → 16 107.4 96.2 98.9 28.0 × k → × k →( k ) × k →( k ) × k ( k ) → 512 41.9 13.8 13.5 5.7 ( k ) → 64 77.4 35.9 43.2 12.8 ( k ) → ( k ) →( k ) ( k ) →( k ) × k ( k ) × k → ( k ) × k → 256 38.6 12.7 12.6 4.4 ( k ) × k → 32 80.2 35.5 42.5 13.9 ( k ) × k →( k ) ( k ) × k →( k ) × k of the 2D-Layered texture cache. (II) However, the transpose opera-tion noticeably contributes to the Bp-L1 cache hit-rate as observedby comparing the performance of Bp-L1 and L1-Trans. (III) Theproposed kernel outperforms the most commonly used productionlibrary, for high-resolution image reconstruction. This section presents the performance and scalability of iFDK. Inthe scaling experiments, each GPU computes a sub-volume of 8GB(namely N sub _ vol ). According to Equation 7, R=32 and R=256 areused for generating volumes of 4096 and 8192 , respectively. Inthe special case of iFDK that C=1, T reduce becomes zero since thevolume reduction is not required (shown as "N/A" in each sub-figureof Figure 5).Note that we focus on discussing the impact of the volume param-eters N x , N y , N z and number of projections N p on the performanceand scalability in the following sections. The reason is that we can-not target weak scaling by increasing the input image sizes N u or N v . As Algorithm 1 shows, only the filtering computation isdependent on N u and N v . However, the back-projection, whichis the main kernel that we scale, is independent of N u or N v (seeAlgorithm 2). This becomes clear from the performance metricsdefinition of GUPS as in Section 2.3, i.e. no dependency on N u or N v . Strong Scaling. In Figure 5a and Figure 5b, the stacked exe-cution time of T compute , T post (namely T reduce + T D H ) and T store are displayed. Note that all T load , T f lt and T AllGather are includedin T compute as Equation 17 shows. Since the value of R is fixed, thevalue of C in iFDK increases in proportion to N дpus . This resultsin T compute decreasing inversely in proportion to N дpus . To givefurther insight of the computational behaviour, a breakdown of T compute is listed in Table 5. As the figure demonstrates, iFDK fol-lows the same scaling behavior of the potential peak performance. Weak Scaling. Figure 5c and Figure 5d show the weak scal-ing. The evaluated number of GPUs (namely N дpus ) is up to 2,048.In both figures, each rank loads and processes 16 and 4 projections,respectively. We use an MPI-AllGather operation to get filtered C ’19, November 17–22, 2019, Denver, CO, USA Chen, P. and Wahib, M., et al. 32 32 64 64 128 128 256 256 512 512 10241024 20482048 R un t i m e ( s ) Number of GPUsCompute D2Hstore ReduceTheorital Compute Theorital D2HTheorital store Theorital reduce (𝑁 𝑔𝑝𝑢𝑠 ) 𝑇 𝑐𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 𝑇 𝑠𝑡𝑜𝑟𝑒 𝑇 𝑐𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 𝑇 𝑠𝑡𝑜𝑟𝑒 MeasuredPeak 𝑇 𝑟𝑒𝑑𝑢𝑐𝑒 𝑇 𝑟𝑒𝑑𝑢𝑐𝑒 N/AN/A (a) Strong scaling for × → . R=32, C= N дpus / . 256 256 512 512 1024 1024 2048 2048 R un t i m e ( s ) Number of GPUsCompute D2Hstore ReduceTheorital Compute Theorital D2HTheorital store Theorital reduce (𝑁 𝑔𝑝𝑢𝑠 ) 𝑇 𝑐𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 𝑇 𝑠𝑡𝑜𝑟𝑒 𝑇 𝑐𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 𝑇 𝑠𝑡𝑜𝑟𝑒 Measured Peak 𝑇 𝑟𝑒𝑑𝑢𝑐𝑒 𝑇 𝑟𝑒𝑑𝑢𝑐𝑒 N/A N/A (b) Strong scaling for × → . R=256, C= N дpus / . 32 32 64 64 128 128 256 256 512 512 1024 1024 2048 2048 R un t i m e ( s ) Number of GPUs Compute D2H store Reduce TBpTime TD2H Tstore Treduce 𝑇 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 (𝑁 𝑔𝑝𝑢𝑠 ) 𝑇 𝑠𝑡𝑜𝑟𝑒 𝑇 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 𝑇 𝑅𝑒𝑑𝑢𝑐𝑒 𝑇 𝑠𝑡𝑜𝑟𝑒 Measured Peak 𝑇 𝑅𝑒𝑑𝑢𝑐𝑒 N/AN/A (c) Weak scaling for × N p → . N p =16* N дpus , R=32, C= N дpus /32. 256 256 512 512 1024 1024 2048 2048 R un t i m e ( s ) Number of GPUs Compute D2H store Reduce TBpTime TD2H Tstore Treduce 𝑇 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 (𝑁 𝑔𝑝𝑢𝑠 )𝑇 𝑠𝑡𝑜𝑟𝑒 𝑇 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝑇 𝐷2𝐻 𝑇 𝑅𝑒𝑑𝑢𝑐𝑒 𝑇 𝑠𝑡𝑜𝑟𝑒 MeasuredPeak 𝑇 𝑅𝑒𝑑𝑢𝑐𝑒 N/AN/A (d) Weak scaling for × N p → . N p =4* N дpus , R=256, C= N дpus /256. Figure 5: Scaling iFDK. "Reduce" denotes volume reduction. "store" indicates storing the volume to PFS. "D2H" means GPU → CPU copy. Loadingprojections from PFS, AllGather communication are overlapped with T compute . CPU → GPU copy is included with T compute . T reduce is N/Awhen C=1 (no inter-rank reduction occurs). The high performance of the proposed back-projection kernel exposes other bottlenecks (e.g. I/O). P e r f o r m a n c e ( G U P S ) Number of GPUs (𝑁 𝑔𝑝𝑢𝑠 ) Figure 6: Performance in a unit of GUPS for problem with input size × and three different output sizes , and . projections from the ranks in the same column group (as in Fig-ure 3b). In the back-projection stage, each rank processes 128 and1024 projections, respectively. Performance. The potential peak and achieved performancesare displayed in Figure 5. Note that T trans is as small as 0.29s andthus, it is included in T D H for simplifying the stacked bar figures.The peak performance is projected by our performance model. Forexample, the overall potential peak performance (namely T runtime )is obtained by the benchmark values as follows. The peak bandwidthof a single PCIe × 16 (namely BW PCIe ) is 11.9GB/s, the projectedtime required to copy data of size 32GB (8G*4) from device memoryto the host by dual PCI-e connectors is ≈ T D H ). Theprojected time required to reduce 8GB of data by dual InfiniBand MPI_ReduceMPI_ReduceMPI_Reduce MPI_Reduce 𝐶 𝐶 𝐶 𝐶 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 Figure 7: Results of volume reduction in an experiment done iniFDK. The problem is × → . iFDK parameters are R=4and C=4. N дpus is 16, performance is 1,134 GUPS. per node is ≈ T reduce ). The peak sequential write band-width of GPFS (namely BW store ) is 28.5GB/s and thus, the projectedtime required to store data of size 256GB and 2TB is ≈ T store ), respectively.On average, we can achieve 76% of the potential peak perfor-mance. Upon analyzing the performance gap, we found the follow-ing. For T compute , the data exchange between the three threadsorchestrating the workflow, as Figure 4 shows, can have some over-head that contributes to the gap. The overhead of intra-thread datamovement within the back-projection thread also has an overhead:before copying data from host to GPU memory, it is necessaryto gather at least 32 projections as a batch. In addition, memorymanagement, building the circular buffers (as in Figure 4a) alsocontribute to the gap in a minor way. For T reduce , we confirmed Scalable High-resolution Image Reconstruction Framework SC ’19, November 17–22, 2019, Denver, CO, USA Table 5: Details of T compute in Fig. 5a and Fig. 5b. T load is includedin T f lt . δ is defined as δ = Tf lt + TAllGather + TbpTcompute . volume Nдpus Ncpus Tf lt TAllGather Tbp Tcompute δ (voxel) (s) (s) (s) (s) 32 16 1.4 31.4 54.8 70.2 64 32 0.8 20.7 27.5 35.6 128 64 <0.7 15.2 14.0 18.9 256 128 <0.7 7.4 7.0 10.2 256 128 <0.7 46.9 83.0 101.3 512 256 <0.7 26.9 41.5 53.1 that the gap is due to MPI-Reduce being called only once: the firstcall to the collective is typically slower, which is why benchmarks,like the one we use, pre-run few iterations before measurements.For T D H , the architecture of the compute node can cause con-tention on the PCIe switch feeding two GPUs (two PCIe switchesfor four GPUs). For T store , it appears from our investigation thatthe minor gap is the effect of volume slices written to PFS not tunedto the ideal stripe size. Figure 6 shows the overall performanceof iFDK in GUPS. We use the entire end-to-end execution time asdefined in Equation 19. Figure 7 shows an example that uses MPI-Reduce primitive to generate a volume of 2048 . iFDK scales betterin generating the volume of 8192 than 4096 . It is due to betterdevice utilization in the former. This becomes clear from observ-ing the relationship between the coefficient α and performance, inTable 4. Note that the 2048 × → and 2048 × → problems can be instantly solved within 30 seconds and 2 min-utes, respectively. Finally, we report that the performance of theback-projection kernel on a single GPU is ≈ 200 GUPS (withoutusing mixed precision), as shown in Table 3. This performance ofthe back-projection kernel is the main reason a time-to-solutionin O( ) seconds for high-resolution problems on O( ) GPUsbecomes possible. It is noteworthy that all computation, running onboth CPUs and GPUs, use single precision to maintain high fidelitysolutions. Scaling to 8K FDK problems. The iFDK is general to anygeneral sizes of FDK problems, due to the two-dimensional prob-lem decomposition methodology. At present, the common per-spective in CT imaging is that processing 8192 volumes is highlydemanded but not feasible (as discussed extensively by Martz etal. [21]). Nonetheless, we conducted experiments with iFDK on 8Kvolumes. Using iFDK, with 2,048 GPUs, the 2048 × → problem is solved within 2 minutes. This time-to-solution is inclu-sive of I/O: it includes storing the volume of size to PFS in 79s.Note that it is rare to find high precision scanners for capturingimages with the quality necessary for 8K resolution, and hence itis rare to find a CT system generating the 3D volume of 8K. Yetwe conducted 8K experiments for the sake of evaluating the scala-bility of iFDK and demonstrating that 8K image reconstruction istechnically attainable. Performance Model Accuracy. Table 5 lists the detailedexecution time of each component in the framework. Our analysisof the results and the resulting observations are as follows:(i) δ > . This indicates that the orchestration and pipeliningmethodology as introduced in Section 4.1.4 is effective in im-proving the overall performance. (ii) T AllGather < T bp . MPI-AllGather is employed to share the fil-tered projections within the C i group as in Figure 3a, T AllGather can be considered as the overhead of filtering computation. For-tunately, it can be overlapped with the back-projection stageas Figure 4 shows. Furthermore, this justifies the strategy ofminimizing R (as discussed in Section 4.1.5) to generate as largeas possible sub-volumes on the GPUs. In this section, we discuss the impact of iFDK on several real-worldapplications and show the platforms available for iFDK. In one of the main reference textbooks of image reconstruction [21](Section 10.7.1), Martz et al. address in detail the necessity of high-resolution image reconstruction, and give an open question aboutchallenge of high-resolution image reconstruction (we quote): "Whathappens if we start manipulating ( k ) and ( k ) volumes?" .The work in this paper tackles this challenge by paving the wayfor generating high-resolution volumes instantly using algorithmicinnovation and HPC best practices.High-resolution image reconstruction is essential since it canexpose more detailed information (i.e. geometry, intensity) andprovide benefits to the throughput of the industrial inspection andscientific research. The remainder of this section gives concreteexamples for the state-of-art innovation in commercial productsfor high-resolution image reconstruction.In [17], the industrial CT (named GOM CT) is equipped withFPD of resolution 3008 × × ∼ × × , 4000 ) and is built for turbine bladeand casting inspection. In [62], the CT system (named inspeXioSMX-225CT) is equipped with FPD of resolution (4096 ) and is usedfor defect inspection.Scientific research is also witnessing a boom of interest in high-resolution image reconstruction. For instance, Bice et al. [9] pro-posed rapid tomographic image reconstruction by large scale par-allelization (up to 32k cores) to meet the critical demands fromscientists. since they require quasi-instant feedback in their ex-periments for rapidly checking results and adjusting experimentalconfigurations while using CT images. This instant capability iscritically demanded in scanning objects with huge volumes (e.g.motor engine, human brain [44]) and complex details (e.g. the bodyof insects).In the industrial field, the CT is widely used for defect inspectionand reverse engineering [22, 29]. Asadizanjani et al. [5] employedmicro CT for non-destructive PCB reverse engineering. The impor-tance of high-resolution in reverse engineering is critical for hugeobjects with complex structures. This section discusses the potential for the practical use case ofiFDK in many fields, i.e. medical, industrial, and scientific. The C ’19, November 17–22, 2019, Denver, CO, USA Chen, P. and Wahib, M., et al. proposed back-projection algorithm and CUDA implementationcan be applied in a number of iterative solvers (i.e. ART, MLEM,MBIR), which are popular methodologies in medical imaging forlow dose image reconstruction. In addition, it can provide benefitsfor real-time CT systems, e.g. 4D-CT [35]. The methods proposed by iFDK are not limitedin use to top-tier HPC systems accessible only to a limited numberof researchers. A simple calculation shows that generating a 4Kvolume as in Figure 5a can be done, for example, on Amazon’s AWSHPC offerings for the cost of less than $100. That is when using 256 p3.8xlarge EC2 instances (with four V100 GPUs per node, similar tothe system used in this paper). This uses on-demand pricing withbilling timed by seconds at the price of $12.24 per hour (March2019 US east Ohio region). This accounts for the low-performancenetwork in AWS (10Gbps) by assuming factors of a slow down overthe reported performance. Note that we do not consider movingthe volumes out of the cloud since the volume visualization anddata analysis can be performed in the same cloud platform. Weacknowledge that policy and privacy issues would complicate theuse of public clouds in medical and industrial CT. Yet we emphasizethat, from the technical perspective, the methods used in iFDKmake the practical and affordable instant high-resolution imagereconstruction feasible. In order to avoid the privacy issues mentionedearlier, one could use an on-premise dense GPU box to achievethe performance reported by iFDK, for reasonably high resolution.For instance, the Nvidia DGX-2 [45] is an appropriate alternativeconsidering it is equipped with 16 Tesla V100 GPUs, the total GPUmemory of 512GB, system memory of 1.5TB, and internal storage of30TB. Those specifications would allow iFDK, for instance, to tackle4K problems within a minute (projected by the results shown inFigure 5a) without privacy concern. In addition, the DGX-2 has thefast NVSwitch [36] interconnect between GPUs, and a high capacitySSD: iFDK would even perform better due to improvements in thecommunication and I/O. Finally, it is important to mention that theprice of DGX-2 is relatively low when considering the prices ofhigh-end CT instruments. Researchers have been working for a long time to optimize CTimage reconstruction algorithms. Wu et al. presented an Appli-cation Specific Integrated Circuits (ASIC) solution, which is ef-ficient in computation but expensive in development [72]. Theworks in [16, 27, 64, 75] applied Field-Programmable Gate Ar-ray (FPGA) to accelerate the image reconstruction computation.Programming FPGA by HDL language is complex and thus, thetrend in recent years is to use high-level synthesis approaches, e.g.OpenCL [26, 40, 63]. Both Treibig et al. and Hofmann et al. opti-mized FDK by SIMD instruction set extensions and achieved recordperformance for CPUs [30, 67]. The performance engineering for RabbitCT [53] on the Intel Xeon Phi accelerator by Johannes et al.demonstrated a promising approach for optimizing back-projectionalgorithms [31]. Mostly, the prior work focused on parallelizingimage reconstruction algorithms on the specified accelerators. The prior work in [38, 77] focused on decomposing the outputproblems into several sub-volumes in order to solve the problemout-of-core. Wang et al. [68] used a 69,632-core distributed sys-tem to accelerate the iterative reconstruction and achieved 1,665 × speed-up over the baseline. The authors in [10] use up to 6 GPUto reconstruct the volume of 2K and 4K via the FDK method. Lackof details, e.g. computational precision and I/O, complicates directcomparisons with the work mentioned above. An iterative imagereconstruction algorithm called Distributed MLEM [19] scaled upto 16 GPUs on multi-GPU clusters. Hidayetoglu et. al [28] present amassively-parallel solver (iterative method) for tomographic imagereconstruction, their solution scales up to 4,096 GPUs. The filtered back-projection method is indispensable in most ofthe practical CT systems. In this work, we propose a novel andgeneral FDK algorithm that reduces the computational cost andimproves data locality. Using CUDA, we implement an efficientback-projection kernel. We further propose a distributed frame-work that leverages the heterogeneity in modern systems to solvehigh-resolution image reconstruction problems (e.g. 4K, 8K) in tensof seconds. More specifically, we optimize the filtering computationand back-projection on CPUs and GPUs, respectively. We performthe inter-node hierarchical computation by MPI collective com-munication primitive. The experimental results demonstrate theperformance and scalability of iFDK, and also validate our perfor-mance model. For future work, we intend to investigate compres-sion and visualization of the high-resolution volumes. We also planto provide a real-time image reconstruction cloud service. ACKNOWLEDGMENT This work was partially supported by JST-CREST under Grant Num-ber JPMJCR1687. Computational resource of AI Bridging CloudInfrastructure (ABCI) provided by National Institute of AdvancedIndustrial Science and Technology (AIST) was used. A part of thenumerical calculations were carried out on the TSUBAME3.0 su-percomputer at Tokyo Institute of Technology. We would like tothank Endo Lab at Tokyo Institute of Technology for providing aTesla V100 GPU Server. REFERENCES [1] Michael D Abràmoff, Paulo J Magalhães, and Sunanda J Ram. 2004. Imageprocessing with ImageJ. Biophotonics international 11, 7 (2004), 36–42.[2] K. Aditya Mohan, S. V. Venkatakrishnan, J. W. Gibbs, E. B. Gulsoy, X. Xiao, M.De Graef, P. W. Voorhees, and C. A. Bouman. 2015. TIMBIR: A Method for Time-Space Reconstruction From Interlaced Views. IEEE Transactions on ComputationalImaging 1, 2 (June 2015), 96–111.[3] Anders H Andersen and Avinash C Kak. 1984. Simultaneous algebraic recon-struction technique (SART): a superior implementation of the ART algorithm. Ultrasonic imaging 6, 1 (1984), 81–94.[4] George B Arfken and Hans J Weber. 1999. Mathematical methods for physicists.[5] Navid Asadizanjani, Sina Shahbazmohamadi, Mark Tehranipoor, and DomenicForte. 2015. Non-destructive pcb reverse engineering using x-ray micro computedtomography. In .1–5.[6] Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, andDhabaleswar K Panda. 2018. OC-DNN: Exploiting Advanced Unified MemoryCapabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In . IEEE,143–152. Scalable High-resolution Image Reconstruction Framework SC ’19, November 17–22, 2019, Denver, CO, USA [7] Benjamin Betz, Steffen Kie, Michael Krumm, Gunnar Knupe, Tsegaye Eshete, andSven Simon. 2018. Efficient Data Structures for the Fast 3D Reconstruction ofVoxel Volumes with Inhomogeneous Spatial Resolution. (2018).[8] Tekin Bicer, Doga Gursoy, Raj Kettimuthu, Francesco De Carlo, Gagan Agrawal,and Ian Foster. 2015. Rapid Tomographic Image Reconstruction via Large-ScaleParallelization, Vol. 9233. 289–302. https://doi.org/10.1007/978-3-662-48096-0_23[9] Tekin Bicer, Doga Gursoy, Rajkumar Kettimuthu, Francesco De Carlo, GaganAgrawal, and Ian T Foster. 2015. Rapid tomographic image reconstruction vialarge-scale parallelization. In European Conference on Parallel Processing . Springer,289–302.[10] Javier Garcia Blas, Monica Abella, Florin Isaila, Jesus Carretero, and Manuel Desco.2014. Surfing the optimization space of a multiple-GPU parallel implementationof a X-ray tomography reconstruction algorithm. Journal of Systems and Software 95 (2014), 166–175.[11] Javier Garcia Blas, Florin Isaila, Monica Abella, Jesus Carretero, Ernesto Liria,and Manuel Desco. 2013. Parallel Implementation of a X-ray Tomography Recon-struction Algorithm Based on MPI and CUDA. In Proceedings of the 20th EuropeanMPI Users’ Group Meeting (EuroMPI ’13) . ACM, New York, NY, USA, 217–222.[12] Ronald Newbold Bracewell. 1995. Two-dimensional imaging . Vol. 247. PrenticeHall Englewood Cliffs.[13] E Oran Brigham and E Oran Brigham. 1988. The fast Fourier transform and itsapplications . Vol. 448. prentice Hall Englewood Cliffs, NJ.[14] Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, and SatoshiMatsuoka. 2018. Efficient Algorithms for the Summed Area Tables Primitive onGPUs. In .IEEE, 482–493.[15] Paul Cockshott and Kenneth Renfrew. 2013. SIMD programming manual for Linuxand Windows . Springer Science & Business Media.[16] Srdjan Coric, Miriam Leeser, Eric Miller, and Marc Trepanier. 2002. Parallel-beam backprojection: an FPGA implementation optimized for medical imaging.In Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays NVIDIA Developer Zone.http://docs.nvidia.com/cuda/index.html (2019).[19] Jingyu Cui, Guillem Pratx, Bowen Meng, and Craig S Levin. 2013. DistributedMLEM: An iterative tomographic image reconstruction algorithm for distributedmemory architectures. IEEE transactions on medical imaging 32, 5 (2013), 957–967.[20] Philippe Despres and Xun Jia. 2017. A review of GPU-based medical imagereconstruction. Physica Medica 42 (2017), 76–92.[21] Harry E. Martz, Clint M. Logan, Daniel J. Schneberk, and Peter J. Shull. 2017. X-ray imaging: fundamentals, industrial techniques, and applications . Boca Raton:CRC Press, Taylor & Francis Group.[22] Liyong Fang, Hui Li, Jinping Bai, and Bailin Li. 2013. Application of industrialCT in reverse engineering technology. High Power Laser and Particle Beams 25, 7(2013), 1620–1624.[23] LA Feldkamp, LC Davis, and JW Kress. 1984. Practical cone-beam algorithm. JOSA A 1, 6 (1984), 612–619.[24] Richard Gordon, Robert Bender, and Gabor T Herman. 1970. Algebraic recon-struction techniques (ART) for three-dimensional electron microscopy and X-rayphotography. Journal of theoretical Biology 29, 3 (1970), 471–481.[25] Jens Gregor. 2011. Distributed CPU multi-core implementation of SIRT withvectorized matrix kernel for micro-CT. Proceedings of the 11th Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine (2011).[26] Devin Held. 2016. Analysis of 3D Cone-Beam CT Image Reconstruction Perfor-mance on a FPGA. (2016).[27] I Henry and Ming Chen. 2012. An FPGA Architecture for Real-Time 3-D Tomo-graphic Reconstruction . Ph.D. Dissertation. University of California, Los Angeles.[28] M. Hidayetoglu, C. Pearson, I. El Hajj, L. Gurel, W. C. Chew, and W. Hwu. 2018. AFast and Massively-Parallel Inverse Solver for Multiple-Scattering TomographicImage Reconstruction. In . 64–74. https://doi.org/10.1109/IPDPS.2018.00017[29] Akira Hirakimoto. 2002. Microfocus X-ray computed tomography and it’s in-dustrial applications. In Analytical Sciences/Supplements Proceedings of IUPACInternational Congress on Analytical Sciences 2001 (ICAS 2001) . The Japan Societyfor Analytical Chemistry, i123–i125.[30] Johannes Hofmann, Jan Treibig, Georg Hager, and Gerhard Wellein. 2014. Com-paring the performance of different x86 SIMD instruction sets for a medicalimaging application on modern multi-and manycore chips. In Proceedings of the2014 Workshop on Programming models for SIMD/Vector processing . ACM, 57–64.[31] Johannes Hofmann, Jan Treibig, Georg Hager, and Gerhard Wellein. 2014. Per-formance engineering for a medical imaging application on the Intel Xeon Phiaccelerator. In Architecture of Computing Systems (ARCS), 2014 Workshop Proceed-ings . VDE, 1–8.[32] Ramesh Jain, Rangachar Kasturi, and Brian G Schunck. 1995. Machine vision .Vol. 5. McGraw-Hill New York. [33] Xun Jia, Bin Dong, Yifei Lou, and Steve B Jiang. 2011. GPU-based iterative cone-beam CT reconstruction using tight frame regularization. Physics in Medicine &Biology 56, 13 (2011), 3787.[34] Avinash C.. Kak and Malcolm Slaney. 1988. Principles of computerized tomographicimaging . IEEE press New York.[35] PJ Keall, G Starkschall, HEE Shukla, KM Forster, V Ortiz, CW Stevens, SS Vedam,R George, T Guerrero, and R Mohan. 2004. Acquiring 4D thoracic CT scans usinga multislice helical method. Physics in Medicine & Biology 49, 10 (2004), 2053.[36] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent,and Kevin Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink,NV-SLI, NVSwitch and GPUDirect. arXiv preprint arXiv:1903.04611 (2019).[37] Wenxuan Liang, Hui Zhang, and Guangshu Hu. 2010. Optimized implementationof the FDK algorithm on one digital signal processor. Tsinghua Science andTechnology 15, 1 (2010), 108–113.[38] Yuechao Lu, Fumihiko Ino, and Kenichi Hagihara. 2016. Cache-aware GPUoptimization for out-of-core cone beam CT reconstruction of high-resolutionvolumes. IEICE TRANSACTIONS on Information and Systems 99, 12 (2016), 3060–3071.[39] KV Manian, AA Ammar, A Ruhela, C-H Chu, H Subramoni, and DK Panda. 2019.Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPUArchitectures. In Proceedings of the 12th Workshop on General Purpose ProcessingUsing GPUs . ACM, 43–52.[40] Maxime Martelli, Nicolas Gag, Alain Mérigot, and Cyrille Enderli. 2017. 3DTomography back-projection parallelization on FPGAs using OpenCL. In Designand Architectures for Signal and Image Processing (DASIP), 2017 Conference on .IEEE, 1–6.[41] Wolfram Mathworld. 2019. Convolution Theorem. http://mathworld.wolfram.com/ConvolutionTheorem.html [Online; accessed 13-May-2019].[42] Glenn R Myers, Shane J Latham, Andrew M Kingston, Jan Kolomazník, VáclavKrajíček, Tomáš Krupka, Trond K Varslot, and Adrian P Sheppard. 2016. Highcone-angle x-ray computed micro-tomography with 186 gigavoxel datasets. In Developments in X-Ray Tomography X Parallel Comput. Proc. Fully Three-DimensionalImage Reconstruct. Radiol. Nucl. Med. Inverse problems 25, 12 (2009), 123009.[49] Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy,and Dhabaleswar K Panda. 2013. Efficient inter-node MPI communication usingGPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In . IEEE, 80–89.[50] CT ProScan. 2019. Reverse Engineering Services. https://3dproscan.com/industrial-ct-scanning-services-reverse-engineering/.[51] N Rezvani, D Aruliah, K Jackson, D Moseley, and J Siewerdsen. 2007. SU-FF-I-16:OSCaR: An open-source cone-beam CT reconstruction tool for imaging research. Medical Physics 34, 6Part2 (2007), 2341–2341.[52] Simon Rit, M Vila Oliva, Sébastien Brousmiche, Rudi Labarbe, David Sarrut, andGregory C Sharp. 2014. The Reconstruction Toolkit (RTK), an open-source cone-beam CT reconstruction toolkit based on the Insight Toolkit (ITK). In Journal ofPhysics: Conference Series , Vol. 489. IOP Publishing, 012079.[53] Christopher Rohkohl, Benjamin Keck, HG Hofmann, and Joachim Hornegger.2009. RabbitCT-an open platform for benchmarking 3D cone-beam reconstructionalgorithms. Medical Physics 36, 9Part1 (2009), 3940–3944.[54] Jeffrey M Rosen, Junjie Wu, TF Wenisch, and JA Fessler. 2013. Iterative helicalCT reconstruction in the cloud for ten dollars in five minutes. In Proc. Intl. Mtg.on Fully 3D Image Recon. in Rad. and Nuc. Med . 241–4.[55] Amit Sabne, Xiao Wang, Sherman J Kisner, Charles A Bouman, Anand Raghu-nathan, and Samuel P Midkiff. 2017. Model-based iterative CT image reconstruc-tion on GPUs. ACM SIGPLAN Notices 52, 8 (2017), 207–220.[56] Ing Karl Schwarz. 2011. The Texture Unit as Performance Booster in CT-Reconstruction. In NVIDIA Tesla GPU Computing at the International Super-computing Conference 2011 .[57] Estefania Serrano, Guzman Bermejo, Javier Garcia Blas, and Jesus Carretero.2014. High-performance X-ray tomography reconstruction algorithm based onheterogeneous accelerated computing systems. In Cluster Computing (CLUSTER),2014 IEEE International Conference on . IEEE, 331–338. C ’19, November 17–22, 2019, Denver, CO, USA Chen, P. and Wahib, M., et al. [58] Hongzhang Shan and John Shalf. 2007. Using IOR to analyze the I/O performancefor HPC platforms. (2007).[59] GC Sharp, N Kandasamy, H Singh, and Michael Folkert. 2007. GPU-based stream-ing architectures for fast cone-beam CT image reconstruction and demons de-formable registration. Physics in Medicine & Biology 52, 19 (2007), 5771.[60] Lawrence A Shepp and Benjamin F Logan. 1974. The Fourier reconstruction of ahead section. IEEE Transactions on nuclear science 21, 3 (1974), 21–43.[61] Lawrence A Shepp and Yehuda Vardi. 1982. Maximum likelihood reconstructionfor emission tomography. IEEE transactions on medical imaging Medical Imaging 2011: Physics of Medical Imaging , Vol. 7961. InternationalSociety for Optics and Photonics, 79612Q.[64] Nikhil Subramanian. 2009. A C-to-FPGA solution for accelerating tomographicreconstruction . Ph.D. Dissertation. University of Washington.[65] Stewart Taylor. 2007. Optimizing applications for multi-core processors, using theIntel integrated performance primitives . Intel Press.[66] Darren Thompson, Ya I Nesterets, TE Gureyev, Arthur Sakellariou, Alex Khas-sapov, and John Taylor. 2011. Rapid CT reconstruction on GPU-enabled HPCclusters. MODSIM 2011, December 12-16 2011, Perth, Australia (2011).[67] Jan Treibig, Georg Hager, Hannes G Hofmann, Joachim Hornegger, and GerhardWellein. 2013. Pushing the limits for medical image reconstruction on recentstandard multicore processors. The International Journal of High PerformanceComputing Applications 27, 2 (2013), 162–177.[68] Xiao Wang, Amit Sabne, Putt Sakdhnagool, Sherman J Kisner, Charles A Bouman,and Samuel P Midkiff. 2017. Massively parallel 3D image reconstruction. In Proceedings of the International Conference for High Performance Computing,Networking, Storage and Analysis . ACM, 3.[69] Karl Wiesent, Karl Barth, Nassir Navab, Peter Durlak, Thomas Brunner, OliverSchuetz, and Wolfgang Seissler. 2000. Enhanced 3-D-reconstruction algorithm for C-arm systems suitable for interventional procedures. IEEE transactions onmedical imaging 19, 5 (2000), 391–403.[70] Wikipedia. 2019. Circular buffer — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Circular%20buffer&oldid=888744485. [Online;accessed 02-April-2019].[71] Wikipedia. 2019. Gantry (medical) — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Gantry%20(medical)&oldid=881118523. [On-line; accessed 23-March-2019].[72] Michael A Wu. 1991. ASIC applications in computed tomography systems. In ASIC Conference and Exhibit, 1991. Proceedings., Fourth Annual IEEE International .IEEE, P1–3.[73] Fang Xu and Klaus Mueller. 2005. Accelerating popular tomographic recon-struction algorithms on commodity PC graphics hardware. IEEE Transactions onnuclear science 52, 3 (2005), 654–663.[74] Fang Xu and Klaus Mueller. 2007. Real-time 3D computed tomographic recon-struction using commodity graphics hardware. Physics in Medicine & Biology Medical Imaging 2006: Physics of Medical Imaging , Vol. 6142. InternationalSociety for Optics and Photonics, 61424L.[76] Jiansheng Yang, Xiaohu Guo, Qiang Kong, Tie Zhou, and Ming Jiang. 2006.Parallel implementation of Katsevich’s FBP algorithm. International journal ofbiomedical imaging Journal of Biomedical Imaging