High performance volume ray casting: A branchless generalized Joseph projector
MManuscript for peer review · [email protected] · September 4, 2016
Efficient ray tracing on 3D regular grids for fastgeneration of digitally reconstructed radiographs initerative tomographic reconstruction techniques
Jonas Dittmann a, ∗ , a Lehrstuhl f¨ur R¨ontgenmikroskopie, Universit¨at W¨urzburg,Josef Martin Weg 63, 97074 W¨urzburg, Germany
Abstract
Cone beam projection is an essential and particularly time consuming part of anyiterative tomographic reconstruction algorithm. On current graphics hardwareespecially the amount and pattern of memory accesses is a limiting factor whenread-only textures cannot be used. With the final objective of acceleratingiterative reconstruction techniques, a non-oversampling Joseph-like raytracingprojection algorithm for three dimensions featuring both a branchless samplingloop and a cache friendly memory access pattern is presented.An interpretation of the employed interpolation scheme is given with respectto the effective beam and voxel models implied. The method is further comparedto existing techniques, and the modifications required to implement further voxeland beam shape models are outlined.Both memory access rates and total run time are benchmarked on a currentconsumer grade graphics processing unit and explicitly compared to the perfor-mance of a classic Digital Differential Analyzer (DDA) algorithm. The presentedraytracer achieves memory access rates of 292 GB/s in read-and-write memoryand 502 GB/s in read-only texture memory. It outperforms the DDA in termsof total run time by a factor of up to five and achives 170 to 300 projections of a512 voxel volume per second. Keywords: ray tracing; ray casting; forward projection; X-ray transform; houghtransform; digitally reconstructed radiograph; volume visualization; voxel basisfunction; computed tomography; tomographic reconstruction; iterativereconstruction; ∗ Corresponding Author J. Dittmann, Josef Martin Weg 63, 97074 W¨urzburg, Germany,Phone: +49 931 31 88830
Email address: [email protected] (Jonas Dittmann) a r X i v : . [ phy s i c s . m e d - ph ] S e p anuscript for peer review · [email protected] · September 4, 2016
1. Introduction
X-ray Computed Tomography is an important instrument for non invasiveimaging of the interior of opaque objects by means of a multitude of X-ray imagesat different orientations of the object in question. Classically, the reconstructionof the seeked image out of the cumulative absorption images is done by FilteredBackprojection, a method which analytically derives from the relation betweenthe Fourier transform of a tomographic slice and the Fourier transform of itsX-ray projections (the Fourier slice theorem).There are many cases though in which iterative methods as e.g. SART(Simultaneous Algebraic Reconstruction Technique, [1]) are required, i.e. meth-ods that simulate X-ray projections of a preliminary solution and manipulatethis solution until the calculated projections match the observed ones. Thesetechniques allow for more general imaging models and can be combined withiterative regularization techniques that are required in cases of incomplete data.The major drawback of iterative methods is their high demand for computationalpower, as any iterative method consists of many repetitions of forward projections(i.e. simulations of the X-ray projections) and corresponding backprojections ofcorrections to the solution.The task of forward projection is equivalent to evaluating the intersection ofX-ray beams hitting the detector bins with the pixels or voxels of the tomogram.The problem can be stated in terms of either of two underlying approaches. First,one may regard the connecting lines between X-ray source and detector pixelsand then determine intersections with image pixels or voxels ( ray driven or raytracing methods). Secondly, one can as well regard the connecting lines betweensource and image voxels and rather determine the affected detector pixels ( voxeldriven methods).Although voxel driven methods inherently allow for optimal memory accesspatterns within the volume image to be projected, ray tracing methods are bettersuited for parallel architectures as no explicit synchronization between differentthreads handling independent rays is necessary. Their mostly non-contiguousmemory access pattern typically is a speed limiting factor though. Ray tracing has been studied since the advent of raster graphics. Besidesiterative tomographic reconstruction, its applications range from renderinglines on a raster grid over volume graphics visualization to efficient calculationof “digitally reconstructed radiographs” (DRR) for 3D image registration andradiation therapy treatment planning.The key concept in the majority of fast ray tracing algorithms on regulargrids is the notion of a “driving axis” [2–6] as already introduced in 1962 byBresenham in the context of line drawing. The line or ray will be traced inunit steps of the designated driving axis and the intersected pixels or voxels aredetermined by evaluating the line equation at each step. Effective limiting of2 anuscript for peer review · [email protected] · September 4, 2016 any line’s slope to a range within ± ° by choice of the driving axis ensures thatno relevant pixels or voxels are skipped.This concept is, in the context of DRR, explicitly or implicitly used e.g. byJosephs’ algorithm [2], by shear-warp techniques [7, 8] or ray-driven splattingalgorithms [9, 10] as well as by the recently popular “Distance Driven Method” [11].It also emerges naturally from general sampling considerations, as interpolationcan thus be avoided along the driving axis.Prominent alternative techniques are the algorithm by Siddon [12] andvariants thereof [13–16], which trace lines in irregular steps from intersection tointersection with any of the planes perpendicular to the coordinate axes. Theirfinal objective of calculating the exact line-voxel intersection though can as wellbe achieved with driving-axis based algorithms [3, 17].An important factor for the design of a ray tracer for DRR purposes is theassumed underlying system model. Most prevalent is the assumption that imagedobjects can be exactly modeled by cubic voxels of homogeneous density andincident radiation by rectangular beam profiles of finite extent. Much effort hasbeen put into the development of exact ray tracers in this respect [3, 17–25].When arguing that there is no outstanding reason to assume homogeneouscubic voxels, the complexity for an “exact” ray tracer can be considerably reducedby assuming isotropic basis functions instead of box-shaped voxels [10, 26–28] asmodeling of both voxel and beam profiles can then be merged into an interpolationfunction (footprint) acting on ray-voxel distances. Other footprint based methodsreplace the latter distance by algorithmically more convenient approximations[2, 11, 29]. Joseph’s projector in particular uses an implicit footprint emergingfrom the interpolated sampling only within the immediate neighborhood of asampling point. It has been adopted in the past e.g. to trace X-rays throughparallel stacks of textured polygons [30, 31] or for list mode PET reconstructions[32].Beam width modeling can, with Joseph’s method, easliy approximated bysufficiently matching the voxel grid size to the beam width. Both oversamplingof rays or sampling an extended neighborhood of each sampling point canthus be avoided and the total amount of required memory accesses kept to aneccessary minimum. Although fourier based projection methods (for parallelbeam geometries) [33] or divide and conquer approaches exploiting partialoverlaps of rays at close by viewing angles allow to systematically reduce theamount of memory operations even further [34], those methods only apply whenmany angular densely sampled projections are to be calculated simultaneously.Independent of employed model and sampling technique, researchers havetried to utilize the high processing power of dedicated graphics hardware inorder to speed up CT reconstruction. Starting with SGI graphics workstationsin the 1990s, the tomographic problem began to be restated in terms of graphicsoperations that were efficiently implemented in hardware. Detailed reviews of thedevelopments up to now are given in [35, 36]. Current general purpose graphicsprocessors are much more versatile and although efficient algorithms still needto be tailored to the specific strengths of the hardware, they are not limited toclassic graphics operations such as texture mapping anymore.3 anuscript for peer review · [email protected] · September 4, 2016 v o x e l v o l u m e p i x e l d e t e c t o r (0,0,0) d → s → r a y r → ~ r s a m p li n g i n c r e m e n t (source) (detector pixel) driving axis Figure 1:
Left:
Illustration of the ray tracing problem for a ray (cid:126)r between source (cid:126)s and detectorpixel (cid:126)d through a voxel volume in steps of the sampling increment vector ˜ r . Coordinates (cid:126)s and (cid:126)d are defined with respect to the origin of the voxel grid, and distance units are normalized to thevoxel grid spacing. ˜ r is defined as ˜ r = (cid:126)r/r m (with m = argmax i ( | r i | )) and thus 1 ≤ (cid:107) ˜ r (cid:107) ≤ √ r ensures unit increments along at least one axis (the “driving axis”). Right:
The driving axis concept is based on the observation that a line with slope ≤ (cid:107) ˜ r (cid:107) along the line. The present article will contribute a particularly short and instructive for-mulation of a Joseph-like ray tracer for three dimensions, which will thus bereferred to as “Generalized Joseph Projector” (GJP) in the following. Smoothalternatives to the linear sampling kernel as well as a suiting choice of voxelgrid spacing in case of cone beam geometries will be discussed. Although thedriving axis concept will be applied to derive the sampling pattern, the presentedalgorithm does not actually use a driving axis in the sense of dedicated codebranches, i.e. handles all cases in a single code branch. It is designed withmodern graphics processor architectures in mind and addresses three typicalbottlenecks: kernel complexity, branching and memory access. The proposedalgorithm does in particular not base its efficiency on fast read-only GPU texturememory and is thus well suited for iterative methods that require to constantlymodify the voxel volume.
2. Method
The basic problem is illustrated in Fig. 1. A ray (cid:126)r emanating from a sourceposition (cid:126)s traverses a voxel volume and hits a detector pixel at location (cid:126)d . Alongits intersection with the volume, the latter will be sampled in steps of ˜ r , alsotaking into account the finite width of an X-ray beam. For convenience, thecoordinate system is aligned with the voxel grid, i.e. the origin is at the cornerof the voxel volume and all distance units will be expressed in terms of gridspacing. Classically and particularly in 2D line drawing or ray tracing problems, thepath to trace is expressed as a linear equation for which the driving axis definesthe input parameter. The driving axis is chosen such that the derivative orslope of the line equation lies within − anuscript for peer review · [email protected] · September 4, 2016 steps of the driving axis (cf. Fig. 1). While this scheme is optimal in the sensethat it neither regards any affected voxels twice nor skips any, the particularformulation in terms of line equations is unfavorable on SIMD architectures(single instruction, multiple data; also “vector processor”) due to the requiredcase differentiation depending on the current driving axis.We will thus switch to a more convenient parametric vector representation,based on the positions of source (cid:126)s and detector pixel (cid:126)d relative to the volumeorigin. All points (cid:126)p on the line are then characterized by (cid:126)p = (cid:126)s + l(cid:126)r with (cid:126)r = (cid:126)d − (cid:126)s (1)and l beeing the free parameter. The driving axis m is then defined by thelargest coefficient of (cid:126)r : m = argmax i ( | r i | ) . (2)For convenience, we will define the indices¯ m and ¯ m (3)to refer to the remaining non-driving axes.An increment vector ˜ r between successive sampling points will be chosensuch that the resulting sampling points are always aligned with the driving axis,which holds for ˜ r = (cid:126)rr m . (4)The first possible sampling point is defined by the intersection of the raywith the plane perpendicular to the driving axis m , i.e.[ (cid:126)s + o ˜ r ] m ! = 0 ⇒ o = − s m (5)where o is the distance between source and first sampling plane in units of thesampling increment (cid:107) ˜ r (cid:107) . o will thus be called “sampling offset” in the following.The volume can now be sampled at points (cid:126)p ( i ) along the defined path in unitsteps of axis m by evaluating (cid:126)p ( i ) = ( (cid:126)s + o ˜ r ) + i ˜ r (6)for integer i ∈ [0 , i max ], where i max is defined by the extent of the voxel volumealong axis m .The driving axis component p ( i ) m is always integer by construction of samplingincrement ˜ r , sampling offset o and sampling index i . When sampling the volumeat (cid:126)p ( i ) , interpolation is thus only required along the non-driving axes, i.e. withina group of 2 × anuscript for peer review · [email protected] · September 4, 2016 d r i v i n g a x i s → n + n + n + n r a y → r a y → i n t e r p o l a t e d a x i s → i n t e r p o l a t e d a x i s → ~ r ~ r Figure 2:
Left:
Illustration of the proposed ray tracing and sampling scheme. The volumeis sampled at integer positions of the driving axis defined by the largest coefficient of (cid:126)r orequivalently in intervals of ˜ r (Eq. 4). Interpolation is thus only necessary within 4-voxel blocksalong the remaining axes which are marked in changing color with increasing sampling index.Above, the intersection points (cid:126)p ( i ) are shown in their respective sampling planes describedby their nearest neighbor voxel locations (cid:126)v ( i, − . Right:
The concept of a ray piercingconsecutive 4-voxel planes is illustrated for all possible driving axes. The sampling planes’orientation automatically emerges from the nearest neighbor selection according to Eq. 15without requiring case differentiations. p (i)m_ → v (i,2) → v (i,1) → v (i,3) → v (i,4) → p (i) └ p (i)m_ ┘ p (i)m_ – └ p (i)m_ ┘ – − . . . . . − . . . . . − . . . . . − . . . . . Figure 3:
Left: relation between sampling point (cid:126)p ( i ) and nearest neighbor voxels (cid:126)v ( i, − .The distances p ( i )¯ m − (cid:4) p ( i )¯ m (cid:5) and p ( i )¯ m − (cid:4) p ( i )¯ m (cid:5) are used to calculate interpolation weights(with (cid:4) (cid:5) designating the floor operation). Center and right:
Examples of separable splineinterpolation and bi-linear interpolation (cf. Eqs. 11, 13, 14) anuscript for peer review · [email protected] · September 4, 2016
Given the minimal set of 2 × interpolation indeedresults in a 3D generalization of Joseph’s method.To avoid aliasing artifacts caused by high frequency components of the kinkedlinear interpolation kernel, more bandlimited smooth kernels such as the raisedcosine or Hann(ing) function have been proposed for image interpolation [26, 37].Sunneg˚ardh and Danielsson motivated a similarly bell shaped kernel as stripintegral model over a linearly interpolated image [29].All of the above mentioned first order methods are described by a onedimensional interpolation kernel w ( d ); d ∈ [0 ,
1) (7)acting on the fractional parts (i.e. normalized distances to grid planes) d / = p ( i )¯ m / − (cid:4) p ( i )¯ m / (cid:5) of p ( i )¯ m and p ( i )¯ m (cf. Fig. 3). The following properties are requiredfor interpolation: w (1 − d ) = 1 − w ( d ) (8) w (0) = 1 , (9)i.e. interpolation is symmetric and short distances result in high weight. Asensible interpolation kernel should also be monotonic: w (cid:48) ( d ) ≤ ∀ d ∈ [0 , . (10)Interpolation in higher dimensions is achieved by multiplication: w ( d , . . . , d n ) = (cid:89) i w ( d i ) . (11)The more bandlimiting kernels have in common that they further ensure smooth-ness at the boundaries, which for first order methods can be guaranteed byrequiring the derivative w (cid:48) ( d ) to vanish on grid points: w (cid:48) ( d ) | d ∈{ , } = 0 . (12)In the following we will regard two examples, namely linear interpolation aswell as a smooth spline kernel modeled in the style of the Hanning function (yetless computationally expensive): w lin ( d ) = 1 − d (13) w spl ( d ) = 1 − d + 2 d , (14) As the sampling points are aligned with one axis, tri-linear interpolation is identical tobi-linear interpolation within the remaining two dimensions here anuscript for peer review · [email protected] · September 4, 2016
By construction, every sampling point (cid:126)p ( i ) as defined by Eqs. 4–6 has (in 3D)at most two non-integer components. These non-integer coordinates necessarilylie between two integer ones along their respective axes. For each sampling point,the group of four nearest neighbor voxels (shown as 1 × × p ( i )¯ m and p ( i )¯ m : (cid:126)v ( i, = ( p ( i ) m , (cid:4) p ( i )¯ m (cid:5) , (cid:4) p ( i )¯ m (cid:5) ) (cid:126)v ( i, = ( p ( i ) m , (cid:6) p ( i )¯ m (cid:7) , (cid:6) p ( i )¯ m (cid:7) ) (cid:126)v ( i, = ( p ( i ) m , (cid:4) p ( i )¯ m (cid:5) , (cid:6) p ( i )¯ m (cid:7) ) (cid:126)v ( i, = ( p ( i ) m , (cid:6) p ( i )¯ m (cid:7) , (cid:4) p ( i )¯ m (cid:5) )where (cid:4) (cid:5) and (cid:6) (cid:7) designate the floor and ceiling operators respectively. In thisform, the determination of (cid:126)v ( i, − requires explicit treatment of three differentcases depending on m ∈ { , , } . This can however for algorithmic conveniencebe formulated in a completely driving axis (i.e. m ) agnostic way: (cid:126)v ( i, = ( (cid:4) p ( i )1 (cid:5) , (cid:4) p ( i )2 (cid:5) , (cid:4) p ( i )3 (cid:5) ) (cid:126)v ( i, = ( (cid:4) p ( i )1 (cid:5) , (cid:6) p ( i )2 (cid:7) , (cid:6) p ( i )3 (cid:7) ) (cid:126)v ( i, = ( (cid:6) p ( i )1 (cid:7) , (cid:4) p ( i )2 (cid:5) , (cid:6) p ( i )3 (cid:7) ) (cid:126)v ( i, = ( (cid:6) p ( i )1 (cid:7) , (cid:6) p ( i )2 (cid:7) , (cid:4) p ( i )3 (cid:5) ) (15)by exploiting that (cid:4) p ( i ) m (cid:5) = (cid:6) p ( i ) m (cid:7) = p ( i ) m . Independent of m ∈ { , , } , the vectors (cid:126)v ( i, − define a group of at mostfour voxels in a plane perpendicular to the driving axis. Special cases arise wheneither of p ( i )¯ m or p ( i )¯ m is also integer, which leads to only one or two different (cid:126)v ( i, − . Lines 13 and 14 in Algorithm 1 ensure that duplicate voxels within (cid:126)v ( i, − will receive zero weight. Sampling in the context of X-ray projection (equivalently DRR) impliesintegration within the sampled volume over a box shaped beam profile of finitewidth perpendicular to the ray direction. The proposed sampling kernels (Eqs.13 and 14) themselves thus represents a convolution of a kernel describing theprojection of a voxel basis function (its “footprint”) and a box shaped kerneldescribing the X-ray beam (cf. Fig. 4). The effective voxel footprint f ( d )implicitly modeled by Eqs. 13 and 14 can thus be obtained by differentiation8 anuscript for peer review · [email protected] · September 4, 2016 − . − . . . . . . . . − . − . . . . . . . . Figure 4: 1D interpolation kernel w spl ( | d | ) (right) and interpretation as convolution of effectivevoxel footprint and beam profile (left). For w lin ( | d | ) (not shown), the voxel footprint is equallybox shaped as the beam profile. (a) (b) (c) Figure 5: Effective range of influence resulting from interpolation among the nearest neighborpixels at a single sampling point. The sampling point (black square) is chosen to be infinitesi-mally close to a grid point. The voxels considered by the sampling procedure are marked inblue. Assuming a voxel grid spacing matched to the modeled beam width, the modeling iscorrect when the ray is axis-aligned (a). In the worst case, when the ray runs diagonal throughthe grid, the effective modeled beam width and voxel diameter is 1 / √ and coordinate shifting: f lin ( d ) = (cid:40) − . < d ≤ .
50 else f spl ( d ) = (cid:40) (1 − d ) − . < d ≤ .
50 else . These results can conversely be verified by convolution with a box function ofunit width and heights centered about the origin.Note that d is always measured along grid axes, i.e. typically not perpendicularto the ray direction. This implies that both the footprint and the implicitlymodeled beam width become narrower when the ray is oriented diagonal inthe voxel grid. Figure 5 gives an illustration of this orientation dependenteffective voxel and beam width modeling, which arises from the nearest neighborevaluation based on the sampling points (cid:126)p ( i ) rather than on the true geometricdistance d geom ( (cid:126)p ( i ) , (cid:126)v ) = (cid:13)(cid:13) ( (cid:126)p ( i ) − (cid:126)v ) × ˜ r (cid:13)(cid:13) / (cid:107) ˜ r (cid:107) (16)between voxel centers (cid:126)v in the vicinity of (cid:126)p ( i ) and an arbitrarily oriented line˜ r . The effective implicit voxel and beam widths can be defined as the smallest9 anuscript for peer review · [email protected] · September 4, 2016 geometric distance between center ray and voxels not considered by the samplingscheme. This distance is given bymin { (cid:126)v } (cid:18) (cid:107) (cid:126)v × ˜ r (cid:107)(cid:107) ˜ r (cid:107) (cid:19) { (cid:126)v | v ¯ m , v ¯ m ∈ {− , , } ∧ ( v ¯ m , v ¯ m ) (cid:54) = (0 , ∧ v m = 0 } , where { (cid:126)v } represents the set of 8 possibly unconsidered nearest neighbors withinthe sampling plane to a sampling point infinitely close to one grid point. In 2D,this reduces to 1 / (cid:13)(cid:13) ˜ r (cid:13)(cid:13) , which has a minimum of 1 / √ / √ r = (1 , , (cid:112) / In order to realize truly isotropic footprints and beam widths without sacri-ficing the branchless sampling scheme defined by Eqs. 6 and 15, oversamplingcan be employed in combination with evaluation of the true geometric distance(Eq. 16) between (cid:126)v ( i, − and the ray. By modeling two (or four, in 3D) beams of1 / Algorithm 1 provides a commented pseudo code implementation of theproposed GJP algorithm, including explicit sampling and interpolation. Thealgorithm consists of four parts: the initilization phase sets up the major axis m ,the sampling increment ˜ r , the sampling offset o and the maximum sampling index i max . Based on these, the sampling points (cid:126)p ( i ) are evaluated at the beginningof the sampling loop. For each (cid:126)p ( i ) the affected voxels and and their respectiveinterpolation weights are explicitly determined. Finally, interpolated samples areweighted by the sampling interval (cid:107) ˜ r (cid:107) and accumulated in the reduction variable a . For presentation purposes, only lower and upper bounds for sampling offset o and sample point count i max are used in combination with an in bounds test inline 8. The latter is not essential to the algorithm though and becomes obsoletewhen entry and exit points of the ray with respect to the voxel volume arecalculated exactly. The sampling loop does in particular not require branchingdepending on the driving axis m .Based on the employed interpolation kernel w , the method will be referredto as GJP spl (GJP with spline interpolation kernel) or GJP lin (GJP with linearinterpolation kernel). Linear interpolation is on graphics processing hardwareusually directly provided by dedicated texture units and needs not always beexplicitly implemented. In these cases, the evaluation of (cid:126)v ( i, − and correspond-ing interpolation weights can be ommited and the volume image may be directlyevalutated at the sampling points (cid:126)p ( i ) . The linear interpolating GJP lin utilizinghardware provided interpolation will be termed GJP hwlin .10 anuscript for peer review · [email protected] · September 4, 2016
Algorithm 1
The proposed ray tracing and sampling algorithm as illustratedin Figs. 1–5. All coordinates are expressed in units of the Cartesian grid that isought to be sampled. The in bounds test in line 8 is not essential and becomesobsolete when o and i max are calculated exactly. The sampling loop iteratingover indices i is thus generally branchless and further, as it needs not be executedin any particular order, is itself parallelizable. m ← argmax i ( | r i | ) (cid:46) m : major (driving) axis ˜ r ← (cid:126)r/r m (cid:46) ˜ r : sampling increment vector o ← − s m (cid:46) o : sampling offset, (cid:126)s : ray source point i max ← volumeDimensions[ m ] (cid:46) i max : maximal sampling points a ← (cid:46) a : accumulator variable for i = 0 .. i max do (cid:46) iterate over sampling points (cid:126)p ← (cid:126)s + ( o + i ) · ˜ r (cid:46) (cid:126)p : current sampling point if (cid:126)p is in volume then (cid:46) alternatively: precise choices for o and i max (cid:126)p fl ← floor( (cid:126)p ) (cid:46) find lower voxel indices (cid:126)p cl ← ceil( (cid:126)p ) (cid:46) find upper voxel indices assert: p fl ,m = p cl ,m = p m = i (cid:46) by design of o and ˜ r , p m = i (cid:126)w fl ← w ( (cid:126)p − (cid:126)p fl ) (cid:46) elementwise mapping of distances tointerpolation weights (cf. Fig. 3, Eqs. 13, 14) (cid:126)w cl ← − (cid:126)w fl (cid:46) calculate complementary weights w fl ,m ← w cl ,m ← (cid:46) special case: driving axis a ← a +volume[ p fl , , p fl , , p fl , ] · (cid:107) ˜ r (cid:107) · w fl , · w fl , · w fl , (cid:46) (cf. Eqs. 11, 15) a ← a +volume[ p fl , , p cl , , p cl , ] · (cid:107) ˜ r (cid:107) · w fl , · w cl , · w cl , a ← a +volume[ p cl , , p fl , , p cl , ] · (cid:107) ˜ r (cid:107) · w cl , · w fl , · w cl , a ← a +volume[ p cl , , p cl , , p fl , ] · (cid:107) ˜ r (cid:107) · w cl , · w cl , · w fl , end if end for . anuscript for peer review · [email protected] · September 4, 2016 N / Δ v N/2 Δ d α /2 S D Δ d Δ v α . . . . . . ∆ d / ∆ v min. rayspacingmax. rayspacing Figure 6:
Left:
Relation between voxel grid size and detector size. The voxel grid size is chosensuch that it exactly contains the circular field of view defined by the irradiation cone (withcone angle α ) between source and detector. The grid sampling is chosen to be N × N for adetector of N equispaced pixels. Voxel spacing ∆ v and detector pixel size ∆ d are then relatedby arcsin( N v S ) = arctan( N d S + D ). Right: minimal and maximal projected width ∆ (cid:48) d (atthe source- and detector facing edges respectively) of the divergent beams within the voxelgrid in units of ∆ v for different cone angles α . The critical value of ∆ (cid:48) d / ∆ v > √ is exceededonly for very large α > . ° (equivalently: S < N ∆ v ). A fundamental assumption of the proposed ray tracing (and particular sam-pling) method is that the voxel grid is approximately matched to the modeledbeam width. In case of parallel beams, the voxel grid spacing ∆ v would con-sequently be chosen equal to the detector pixel spacing ∆ d , while it becomessuccessively smaller with increasing cone angle. From the sketch in Fig. 6, thetrigonometric relation arcsin( N v S ) = arctan( N d S + D )between detector size N ∆ d and voxel volume diameter N ∆ v can be derived,resulting in an optimal choice for ∆ v dependent on source and detector distances S and D : ∆ v = ∆ d S (cid:113) ( D + S ) + ( N ∆ d ) , (17)which reduces to ∆ v = ∆ d S ( D + S ) (linear scaling) for D + S (cid:29) N ∆ d (small coneangles).Although for divergent beams the projected detector pixel (or beam) width∆ (cid:48) d obviously varies within the field of view, the maximal mismatch of ∆ (cid:48) d and∆ v at the source- and detector-facing sides of the volume stays within 20% evenfor rather large cone angles of 20 ° (cf. Fig. 6). Only above 38 . ° (equivalently: S < N ∆ v ) the maximal mismatch becomes so large that the ray tracingprocedure would begin to skip voxels at the detector-facing side of the volumewhen the grid further is in 45 ° orientation to the ray direction, as the projectedray spacing ∆ (cid:48) d exceeds the minimal effective grid spacing of cos(45 ° )∆ v = √ ∆ v by a factor of 2 in that case. 12 anuscript for peer review · [email protected] · September 4, 2016
Sequentially reading subsequent memory addresses classically is the optimalmemory access strategy for serial algorithms. This is due to the caching behaviourof CPUs or memory controllers, which, when physically accessing memory, willfetch a whole “cache line” covering a range of memory addresses. Any subsequentmemory accesses that lie within this cache line can be served efficiently. Forstraight forward serial raytracing this implies that efficient cache usage is onlypossible for a narrow range of ray orientations aligned with the memory layout,and tracing along other directions will result in very non-contiguous memoryaccess orders. De Man and Basu addressed this issue with their Distance DrivenMethod [11] which serializes memory accesses by means of interleaving thetracing of adjacent rays such that subsequent iterations of the algorithm willfollow the memory layout rather than a particular ray. Instead of adapting thealgorithm to the memory layout, the latter may as well be made more isotropicby mapping close by spatial coordinates to close by linear memory addresses bymeans of e.g. Morton coding [38]. Although this will notably reduce the averagedistance between subsequently accessed memory addresses for arbitrary rays, itwill also prohibit perfectly serial memory accesses.For parallel GPU algorithms, the temporal memory access order of an in-dividual thread becomes less important. Instead, requested memory addressesby simultaneous threads should be consecutive or close by. In contrast to serialprogramming, optimal memory access is thus achieved when rays run perpen-dicular to the memory layout so that each thread will step through memory ina non-contiguous fashion when regarded isolated, yet parallel (both temporaland spatial) threads handling adjacent rays will, in each step, access consecutivememory addresses that can be efficiently served. The conebeam geometry (incontrast to 2D fanbeam geometry) is highly beneficial in this respect as it featuresa third dimension parallel to the rotational axis which is always perpendicular tothe possible driving axes occuring for circular source trajectories. The optimalmemory layout is therefore, somewhat counterintuitive, contiguous across tomo-graphic slices, i.e. along the rotational axis. Parallel GJP threads will coherentlystep through the voxel grid along a driving axis parallel to the axial plane andaccess contiguous “columns” in memory.
3. Results
Both performance with respect to system modeling as well as efficiency interms of run time and memory bandwidth of the presented algorithm are testedon an Nvidia GeForce GTX 970 graphics processor (GPU) for a 10 ° conebeamconfiguration and compared to an implementation of the popular pencil beammodel based on the 3D Digital Differential Analyzer variant by Amanatides andWoo [13]. The performance with respect to system modeling is demonstrated onconebeam projections of the three dimensional modified Shepp Logan phan-13 anuscript for peer review · [email protected] · September 4, 2016 tom based on the definition reproduced in [39]. 10 ° conebeam projections ontoa 512 pixel detector are generated by averaging 64 analytic line integrals perdetector pixel. In total, 803 ( ≈ π ° are computed. For comparison withnumeric projections, the phantom is further rendered on a 512 voxel grid using5 fold oversampling for edge anti-aliasing.Figure 7 shows error measures based on the differences of analytic and numericprojections. The DDA (Digital Differential Analyzer) or equivalently the Siddonor pencil beam projector model exhibit most artifacts, particularly in cases whererays run roughly parallel to grid axes. In these situations the pencil beam modelis equivalent to nearest neighbor sampling. When oversampling the DDA by afactor of two in each dimension, i.e. tracing four rays per detector pixel, theresulting projection quality becomes comparable to that of the non oversampledGJP spl using spline interpolation. The apparently best results are achieved bythe liner interpolating GJP lin despite the presumed high frequency artifacts ofthe linear interpolation kernel.As iterative reconstruction techniques as SART subsequently enforce consis-tency of the solution with each provided projection based on a given forwardmodel, model errors of the discrete forward projectors will directly translateto artifacts in reconstruction results. SART is known to sufficiently convergewithin less than 10 iterations and Figure 8 shows respective reconstructions fromthe analytic projections using different forward models on a 512 voxel grid. Inaccordance with the visual quality of the projections as shown in Fig. 7, thereconstruction quality is worst for the non-oversampling DDA, comparable for2-fold oversampled DDA and GJP spl and best for GJP lin . Run time performance is evaluated for projections of a cylindric volume withina cubic bounding box of 512 voxels onto a 512 pixel detector (cf. Fig. 9) . Thevolume is stored in 32bit floating point format in either main- or texture memoryof the graphics processing unit. As typical for computed tomography setups,projections are performed for a multitude of source and detector orientationsover the full angular range of 360 ° on a circular trajectory around the volumecenter. The rotational axis is aligned parallel to the fastest index of the memorylayout. For each individual configuration of source and detector, the run timeis optimized over a wide range of possible thread block or work group sizeparameters (CUDA and OpenCL terminology respectively). This eliminates theinfluence of technicalities introduced by the parallelization schemes of graphicsprocessors. Measured execution times further exhibit a variance of up to 10%when running the same code multiple times. We report the fastest measuredtimes for each algorithm.Table 1 lists the so evaluated run times as averages over 360 equidistantprojection angles. As a measure for GPU occupancy it further lists averagememory access rates based on the total runtime and the amount of accessedvoxels by each raytracing algorithm respectively. Although the latter is not14 anuscript for peer review · [email protected] · September 4, 2016
DDA DDA 2 × GJP spl
GJP lin . . . . . . R e l a ti v e ‘ e rr o r[ % ] GJP lin
GJP spl
DDA DDA 2 × Figure 7: Approximation errors of different projection algorithms for a 10 ° conebeam geometry.Numeric projections of the rasterized modified Shepp Logan phantom (on a 512 grid) onto a512 detector are compared to respective analytic projections. The top row shows exemplarydifference images for a frontal view. Below, the difference (cid:96) norm normalized to the (cid:96) normof the reference projection is plotted for all projection angles. DDA DDA 2 × GJP spl
GJP lin
Figure 8: Axial and sagittal central slices of iterative SART reconstructions (10 iterations) fromanalytic projections of the modified Shepp Logan phantom using different numeric projectionmethods within the iterative process. The grayscale window is [0.16,0.32]. Limitations of thediscrete forward models (cf. Fig. 7) manifest themselves in the final reconstruction result. anuscript for peer review · [email protected] · September 4, 2016 p x p x p x p x Figure 9: Sketch of the benchmark configuration. A cylindrical volume of both 512 voxeldiameter and height is projected onto a 512 pixel detector. The projection cone has an openingangle of 10 ° . Source and detector are rotated about the center axis to acquire projections fromdifferent orientations. Memory DDA DDA 2 × GJP lin
GJP hwlin
Texture 4.59 ms 15.2 ms 4.97 ms 3.28 ms118 GB/s 143 GB/s 334 GB/s 502 GB/sGPU RAM 4.39 ms 15.7 ms 5.69 ms —123 GB/s 139 GB/s 292 GB/s —
Table 1: Average execution times in milliseconds and memory access rates in gigabytes persecond for 2D projections of a cylindrical volume within a 512 bounding box onto a 512 detector in a 10 ° conebeam setup (cf. Fig. 9) measured on an Nvidia GTX 970 GPU. The voxeldata is stored in 32bit floating point format either in texture or main GPU memory. Timingsare measured for the Digital Differential Analyzer (DDA, equivalent to Siddon’s method),2-fold oversampled DDA (i.e. 2 × lin ) or implicit (GJP hwlin ) linear interpolation. strictly kown in the case of GJP hwlin due to unkown implementation detailswithin the GPU, it is reasonably assumed to be the same as for GJP lin .
4. Discussion
Finite beam width modeling, as required for realistic modeling of X-rayprojections, is already implied by interpolated sampling from a discrete voxelgrid. Computationally costly explicit beam shape modeling by oversampling, i.e.tracing of multiple rays per detector pixel, can thus be avoided by matching thevoxel grid spacing to the projected beam width within the voxel volume. Althoughthe exactness of the implicit beam width modeling in the non-oversampling case islimited as it inherits the anisotropic nature of the Cartesian voxel grid, the qualityof forward projections is visibly improved both with respect to the widespreadpencil beam model and its oversampled variant. Although the intrinsic beamwidth modeling further does not account for the divergent beam width withinthe field of view, this error remains within ±
10% for typical cone angles of 10 ° .Absolutely exact ray modeling is in fact mostly not required as Hofmann et al.report [40] in agreement with our own observations.16 anuscript for peer review · [email protected] · September 4, 2016
If desired though, the presented voxel volume traversal scheme may as well beused to implement a Siddon like pencil beam model by replacing the interpolationkernel with a function evaluating the line-box intersection length (cf. [3, 17]).It can further be interpreted as an approximate radial voxel basis functionand finite beam width model [26–29] with simplified nearest neighbors andfootprint evaluation. The required modifications in order to implement a trulyisotropic voxel basis function and beam shape model at the expense of highercomputational cost have been outlined. Such an extended model may alsoconsider the finite beam divergence, i.e. the varying beam width along the paththrough the voxel volume.
The projection quality as empirically assessed by means of comparing numericprojections of a modified Shepp Logan phantom with analytic ones indicates thatGJP lin performs best among the tested techniques despite its non-bandlimitedlinear interpolation kernel. This is also confirmed by iterative reconstructionresults. The theoretical advantage of reduced high frequency aliasing by theHanning-like kernel used in GJP spl is apparently outweighted by its inferior abilityto model gradients. Both GJP methods are however considerably better than theclassic Digital Differential Analyzer or Siddon method. These findings seem tostand in contrast to [31] who conclude from both projection and reconstructionroot mean square errors that slice-interpolating (Joseph-like) schemes performsimilar to line-box integration (Siddon-like) schemes. To further improve overlinear interpolation, true higher order interpolation methods or equivalentlyoverlapping voxel footprints are required, in accord with [26, 31]. Either optioncontradicts the restriction of the sampling scheme to a 4-neighborhood though.
The choice of sampling points and sampling distances respectively based onthe well known driving axis concept ensures that no relevant voxels will be skippednor sampled twice. The problem is concisely formulated as parametric vectorequation and a commented pseudo code implementation is given. In contrastto previous literature [2–4, 6, 15, 17, 22, 32], it does not require branchingdependent on the driving axis. While the similar 3D Wu approach by Schretterfor list mode backprojection [32] profited from fixed point arithmetic and cachingof a precomputed weights table within the CPU, the present GJP algorithmparticularly profits from the matching of voxel grid and ray spacing in combinationwith the alignment of memory layout and rotational axis. This results in coherentmemory accesses by parallel GPU threads tracing adjacent rays and furtherallows to notably exceed the nominal memory bandwidth (224 GB/s) of theGPU owing to a high cache hit rate. The GJP is thus able to consider morethan three times as many voxels as the DDA within the same time frame andoutperforms the oversampled DDA of comparable ray modeling capability by afactor of three to five.In contrast to the 3D generalization of Joseph’s projector discussed by[30, 31] for stacks of textured 2D planes, the present approach does not require17 anuscript for peer review · [email protected] · September 4, 2016 resampling when the driving axis changes and is also highly efficient usingread-and-write memory (RAM) of the GPU. This is particularly advantageousfor iterative reconstruction methods that inherently require constant read-and-write operations and do otherwise need to maintain copies of the reconstructionvolume.Other driving axis oriented (also “slice-based”) approaches [9, 21, 22, 29, 34] aswell as general ray-driven splatting methods [10] applicable to arbitrary footprintmodels [19, 26–28] focus on more elaborate ray modeling at the expense of anadditional inner loop dynamically identifying all intersected (by some definition)voxels within each sampling plane. This is also true for the Distance Drivenmethod [11] which further interleaves looping over voxels and detector pixelsand thus complicates parallelization.Most similar to Joseph-like approaches in terms of sampling is the shear-warptechnqiue presented in the early 1990s [7, 8], which however requires the storageof a temporary copy of the sheared volume as well as an additional resamplingstep (“warp”).
5. Conclusion and outlook
A both simple and efficient non-oversampling ray tracing forward projectorfor three dimensions based on a parametric vector formulation of the driving axisconcept has been presented. Matching of detector and volume sampling ensuresimplicit modeling of finite beam widths while optimizing the amount of requiredmemory accesses in the ray tracing process. The method is a generalization ofJoseph’s 2D projector [2], and different interpolation kernels were discussed. Aninterpretation of the interpolation schemes with respect to the effective modeledvoxel footprint and beam profile has been given and required modifications forthe implementation of other voxel and beam shape models have been outlined.The ray modeling capabilities were demonstrated on a Shepp Logan phantomexample.The presented ray tracing and sampling algorithm is, due to its branchlessdesign and predictable memory access pattern, particularly well suited for theSIMD architectures of modern general purpose GPUs. The scheme easily par-allelizes both in a single as well as in a multiple threads per ray configuration.Compared to the classic Digital Differential Analyzer (DDA) algorithm, a threeto five times higher memory access rate is achieved on an Nvidia GTX 970 GPU,allowing to outperform the DDA also in terms of total run time despite thehigher total amount of memory accesses.In the context of computed tomography, improved ray tracing performancedirectly translates to shorter image reconstruction times for most iterative tomo-graphic reconstruction techniques as the forward modeling of X-ray projectionsoften constitutes the largest time factor of such algorithms. Its high efficiencyalso when operating on read-and-write memory eliminates the need to constantlyhold and update copies of the reconstruction volume.18 anuscript for peer review · [email protected] · September 4, 2016
6. Acknowledgements
The author is supported by Fraunhofer IIS and further thanks the anonymousreviewers for their valuable feedback.
ReferencesReferences [1] A. H. Andersen, A. C. Kak:
Simultaneous Algebraic Reconstruction Tech-nique (SART): A Superior Implementation of the ART Algorithm . UltrasonicImaging 6, 81–94 (1984)[2] P. M. Joseph:
An Improved Algorithms for Reprojecting Rays Through PixelImages . IEEE Transactions on Medical Imaging MI-1 (3) 192–196 (1982)[3] S. B. Lo:
Strip and Line Path Integrals with a Square Pixel Matrix: AUnified Theory for Computational CT Projections . IEEE Transactions onMedical Imaging 7 (4) 355–363 (1988)[4] A. Fujimoto, T. Tanaka and K. Iwata:
ARTS: Accelerated Ray-TracingSystem . Computer Graphics and Applications, IEEE 6 (4) 16–26 (1986)[5] R. Endl, M. Sommer:
Classification of Ray-Generators in Uniform Subdivi-sions and Octrees for Ray Tracing . Computer Graphics Forum 13 (1) 3–19(1994)[6] Y. K. Liu, B. ˇZalik:
A General Multi-step Algorithm for Voxel TraversingAlong a Line.
Computer Graphics Forum 27 (1) 73–80 (2008)[7] G. G. Cameron, P. E. Undrill:
Rendering volumetric medical image dataon a SIMD-architecture computer . Proceedings of the Third EurograhpicsWorkshop on Rendering, 135–145, Bristol, UK, May 1992[8] P. Lacroute, M. Levoy:
Fast Volume Rendering Using a Shear-Warp Factor-ization of the Viewing Transformation . SIGGRAPH 1994, Orlando, Florida,USA[9] S. Matej, R. M. Lewitt:
Practical Considerations for 3-D Image Reconstruc-tion Using Spherically Symmetric Volume Elements . IEEE Transactions onMedical Imaging 15 (1) 68–78 (1996)[10] K. Mueller, R. Yagel:
Fast Perspective Volume Rendering with Splattingby Utilizing a Ray-Driven Approach . Proceedings of the 7th Conference onVisualization ’96, 65–72 (1996)[11] B. De Man, S. Basu:
Distance-driven projection and backprojection in threedimensions . Physics in Medicine and Biology 49 (11) 2463–2475 (2004)19 anuscript for peer review · [email protected] · September 4, 2016 [12] R. Siddon:
Fast calculation of the exact radiological path for a three-dimensional CT array . Medical Physics 12 (2) 252–255 (1985)[13] J. Amanatides, A. Woo:
A Fast Voxel Traversal Algorithm for Ray Tracing .Eurographics 87 (3) 10 (1987)[14] F. Jacobs, E. Sundermann, B. De Sutter, M. Christiaens, I. Lemahieu:
Afast algorithm to calculate the exact radiological path through a pixel or voxelspace . Journal of computing and information technology 6 (1) 89–94 (1998)[15] H. Zhao, A. J. Reader:
Fast Ray-Tracing Technique to Calculate LineIntegral Paths in Voxel Arrays . Nuclear Science Symposium ConferenceRecord, 2003 IEEE 4 2808–2812 (2003).[16] M. de Greef, J. Crezee, J. C. van Eijk, R. Pool, A. Bel:
Accelerated raytracing for radiotherapy dose calculations on a GPU . Medical Physics 36(9) 4095–4102 (2009)[17] H. Gao:
Fast parallel algorithms for the x-ray transform and its adjoint .Medical Physics 39 (11) 7110–7120 (2012)[18] W. Yao, K. : A nalytically derived weighting factors for transmission to-mography cone beam projections . Physics in Medicine and Biology 54 (3)513–533 (2009)[19] Y. Long, J. A. Fessler, J. M. Balter:
3D forward and back-projection forx-ray CT using seperable footprints . IEEE Transactions on Medical Imaging29 (11) 1839–1850 (2010)[20] M. Wu, J. Fessler:
GPU acceleration of 3D forward and backward projectionusing separable footprints for X-ray CT image reconstruction . Proceedingsof the International Meeting on Fully 3D Image Reconstruction in Radiologyand Nuclear Medicine 6, 021911 (2011)[21] V. Ngyen, S. Lee:
Graphics processing unit-accelerated iterative tomographicreconstruction with strip-integral system model.
Optical Engineering 51 (9)093203-1–11 (2012)[22] S. Zhang, D. Zhang, H. Gong, O. Ghasemalizadeh, G. Wang, G. Cao:
Fastand accurate computation of system matrix for area integral model-basedalgebraic reconstruction technique . Optical Engineering 53 (11) 113101-1–9(2014)[23] S. Ha, A. Kumar, K. Mueller:
A Study of Volume Integration Modelsfor Iterative Cone-Beam Computed Tomography . Proceedings of the 13thMeeting on Fully 3D Image Reconstruction (2015)[24] S. Ha, H. Li, K. Mueller:
Efficient area-based ray integration using summedarea tables and regression models . Proceedings of the 4th InternationalMeeting on image formation in X-ray CT, 507–510 (2016)20 anuscript for peer review · [email protected] · September 4, 2016 [25] R. Sampson, M. G. McGaffin, T. F. Wenisch, J. A. Fessler:
Investigatingmulti-threaded SIMD for helical CT reconstruction on a CPU . Proceedingsof the 4th International Meeting on image formation in X-ray CT, 275–278(2016)[26] K. M. Hanson, G. W. Wecksung:
Local basis-function approach to computedtomography . Applied Optics 24 (23) 4028–4039 (1985)[27] R. M. Lewitt:
Alternatives to voxels for image representation in iterativeroconstruction algorithms . Physics in Medicine and Biology 37 (3) 705–716(1992)[28] A. Ziegler, T. K¨ohler, T. Nielsen, R. Proksa:
Efficient projection andbackprojection scheme for spherically symmetric basis functions in divergentbeam geometry . Medical Physics 33 (12) 4653–4663 (2006)[29] J. Sunneg˚ardh, P. Danielsson:
A New Anti-Aliased Projection Operatorfor Iterative CT Reconstruction . Proceedings of the Ninth InternationalMeeting on Fully Three-dimensional Image Reconstruction in Radiologyand Nuclear Medicine, Lindau, Germany, July 9–13, 2007[30] F. Xu, K. Mueller:
Accelerating Popular Tomographic Reconstruction Al-gorithms on Commodity PC Graphics Hardware . IEEE Transactions onNuclear Science 52 (3) 654–663 (2005)[31] F. Xu, K. Mueller:
A Comparative Study of Popular Interpolation and Inte-gration Methods for use in Computed Tomography . 3rd IEEE InternationalSymposium on Biomedical Imaging: Nano to Macro, 1252–1255 (2006)[32] C. Schretter:
A Fast Tube-of-Response Raytracer . Medical Physics 33 (12)4744–4748 (2006)[33] S. Matej, J. A. Fessler, I. G. Kazantsev:
Iterative Tomographic ImageReconstruction Using Fourier-Based Forward and Back-Projectors . IEEETransactions on Medical Imaging 23 (4) 401–412 (2004)[34] J. Brokish, D. B. Keesing, Y. Bresler:
Iterative circular conebeam CTreconstruction using fast hierarchical backprojection/reprojection operators .SPIE Medical Imaging. International Society for Optics and Photonics,2010.[35] K. Mueller, F. Xu, N. Neophytou:
Why do Commodity Graphics HardwareBoards (GPUs) work so well for acceleration of Computed Tomography? .SPIE Electronic Imaging 2007[36] G. Pratx, L. Xing:
GPU computing in medical physics: A review . MedicalPhysics 38 (5) 2685–2697 (2011)[37] H. B. Kekre, S. C. Sahasrabudhe, N. C. Goyal:
Raised Cosine Functionfor Image Data Interpolation . Computers & Electrical Engineering 9 (3-4)131–152 (1982) 21 anuscript for peer review · [email protected] · September 4, 2016 [38] G. M. Morton:
A computer Oriented Geodetic Data Base; anda New Technique in File Sequencing . IBM Technical Report,https://domino.research.ibm.com/library/cyberdig.nsf/0/0dabf9473b9c86d48525779800566a39(1966)[39] M. Schabel:
3D Shepp-Logan phantom . MATLAB script,http://de.mathworks.com/matlabcentral/fileexchange/9416-3d-shepp-logan-phantom/content/phantom3d.m (2005–2006)[40] C. Hofmann, M. Knaup, M. Kachelrieß: