Deep Learning compatible Differentiable X-ray Projections for Inverse Rendering
Karthik Shetty, Annette Birkhold, Norbert Strobel, Bernhard Egger, Srikrishna Jaganathan, Markus Kowarschik, Andreas Maier
AA3046
Deep Learning compatible Differentiable X-rayProjections for Inverse Rendering
Karthik Shetty , Annette Birkhold , Norbert Strobel , Bernhard Egger ,Srikrishna Jaganathan , Markus Kowarschik , Andreas Maier Pattern Recognition Lab, FAU Erlangen-N¨urnberg Siemens Healthcare GmbH, Forchheim Fakult¨at Elektrotechnik, HS f¨ur angewandte Wissenschaften W¨urzburg-Schweinfurt MIT - BCS, CSAIL & CBMM, USA [email protected]
Abstract.
Many minimally invasive interventional procedures still relyon 2D fluoroscopic imaging. Generating a patient-specific 3D model fromthese X-ray projection data would allow to improve the procedural work-flow, e.g. by providing assistance functions such as automatic positioning.To accomplish this, two things are required. First, a statistical humanshape model of the human anatomy and second, a differentiable X-rayrenderer. In this work, we propose a differentiable renderer by deriv-ing the distance travelled by a ray inside mesh structures to generate adistance map. To demonstrate its functioning, we use it for simulatingX-ray images from human shape models. Then we show its applicationby solving the inverse problem, namely reconstructing 3D models fromreal 2D fluoroscopy images of the pelvis, which is an ideal anatomicalstructure for patient registration. This is accomplished by an iterativeoptimization strategy using gradient descent. With the majority of thepelvis being in the fluoroscopic field of view, we achieve a mean Haus-dorff distance of 30 mm between the reconstructed model and the groundtruth segmentation.
Over the last decades, the amount of X-ray guided interventional procedures hasincreased steadily, raising the awareness for optimized procedural workflows andradiation-dose induced adverse effects. Assistance systems based on a 3D digi-tal twin of the patient have the potential to speed up procedures and minimizethe required radiation. However, precisely representing the individual humananatomy based on the commonly acquired 2D projection images is an ill-posedproblem. Ehlke et al. showed that an accurate and fast 3D reconstruction of thehuman pelvis from 2D projection images is possible by learning its shape space inthe form of tetrahedral mesh structures from a large set of pelvis scans includingbone density information [1]. In theory, this approach has the potential to gen-erate digital twins of patients from live fluoroscopic imaging data and may allowto develop advanced assistance applications in the interventional environment. a r X i v : . [ ee ss . I V ] F e b Shetty et al.
The process of generating a 3D model from 2D projection images is knownas inverse-rendering and has received considerable attention in computer vi-sion [2,3]. However, achieving this usually means solving an ill-posed problem,and rendering pipelines are not necessarily differentiable. This is usually over-come by softening the non-differentiabilities by soft-rasterization [2]. Usually, a3D scene is represented by a polygon mesh, which can be expressed by a set ofvertices and faces. Such a representation is usually chosen for statistical shapeand pose modeling. Skinned Multi-Person Linear Model (SMPL) is an exampleof such a parameterized model, which is deformable and accurately represents alarge variation of human shapes [4]. Employing such a model enables more preciseand faster 3D reconstruction from a 2D projection image space due to the incor-poration of prior knowledge [3]. To carry out the 3D reconstruction from X-rayprojections, we need to calculate artificial X-ray projections from a 3D model,e.g., a Statistical Shape Model (SSM). Generation of transmission images frommesh structures has been previously proposed [1,5]. For a viable reconstruction,we seek to optimize the shape β and pose θ parameters by minimizing a similar-ity function between the simulated image and the acquired image. To this end,we propose a framework to combine existing X-ray rasterization approaches witha differentiable rasterizer, to create a setup that can be iteratively optimized bygradient descent techniques. The proposed renderer enables future end-to-endframeworks. A watertight manifold surface can be represented as a triangular mesh by a set offaces f ∈ N N f × and vertices v ∈ R N v × , where the the object has N f faces and N v vertices. Given a projection matrix, it is possible to transform the mesh fromworld coordinates to detector coordinates while maintaining the actual depthof the vertex. In traditional rasterization of 3D scenes, each pixel P xy on thescreen is affected by only the face f j nearest to the ray from the camera. Thisis usually achieved by storing the depth information for all the faces influencedby a given pixel in the form of z-buffer . However, in the case of transmissionimaging we need to know the distance traveled by a ray through an object in theform of l-buffer [5]. For each face f j , the normals −→ N f j in the projective space arecomputed. The sign of the dot product D f j R xy = sign (cid:16) −→ N f j · −→ N view (cid:17) betweenthe face normal −→ N f j and detector plane normal −→ N view = (0 , , T determinesif the ray R xy is either entering or exiting the object for a given face. Hence, thetotal path length L xy traveled by a ray for a given object p from the source tothe detector pixel position P xy is visualized in Fig. 1 and can be formulated as L xy = K (cid:88) j =1 D f j R xy Z xy f j (1) ifferentiable X-ray Projections for Inverse Rendering 3 Here Z xy f j represents the distance between the face f j and the detectorpixel position P xy , while j and K represents the current valid intersection andthe fixed number of mesh overlaps stored in the z-buffer, respectively. As a result,there is no substantial requirement for the elements of the z-buffer to be storedin a sorted manner. The only constraint is the number of faces K influencing agiven pixel. To determine the gradients for the backward pass we follow the sameprocedure as implemented by Ravi et al [2], with an exception for the gradientsfor the z-buffer: ∂C∂z xyj = ∂C∂L xy D f j R xy (2)Here, z xyj represents the z-buffer distance from the pixel P xy corresponding toface f j with C as some cost function, e.g., the normalized gradient correlation(NGC) explained below.It is generally assumed that the number of incoming rays for a given object isthe same as the outgoing rays. However, artifacts occur when the face normal isperpendicular to the detector plane or in the presence of a non-manifold surface.The occurrence of such a situation can be determined when the summation overthe pixel sign function (cid:80) Kj R xy f j is non-zero. To overcome this, we set the imagepixel value to a neighbouring pixel with zero sum and set the gradient for thatparticular pixel to zero. Fig. 1.
Visualization for the determination of the ray distance for a simplifiedstructure. Here, total distance covered by the ray is L xy = d - d + d - d . The pipeline for generating the X-ray projection image for a human patientmodel is depicted in Fig. 2. Distance projection maps L p are generated indi-vidually for air ( L air ), body ( L body ), bones ( L bones ) and organs ( L organs ). L air represents the Euclidean distance from a detector pixel P xy in world coordinatesto the X-ray source. L body , L bones and L organs are generated as described inSec. 2.1 from individual meshes respectively. For ease of explanation, we de-scribe L organs as a single organ distance map, however it is a subset of lungs,heart, liver, kidney and spleen. As L air is inclusive of L body , we define the actualair distance map as L air − L body . For the body we proceed similarly as shown inFig. 2. Shetty et al. S o u r c e D e t e c t o r Fig. 2.
Left: Projection performed around the thorax section from a mesh repre-sentation of a patient model. Right: Pipeline to generate a transmission imagefrom the given mesh structures and projection matrix.The X-ray projection is generated using a primary signal I xy for a pixel P xy described using the Beer-Lambert law as shown in Eq. 3 with a photon energy E , X-ray energy I o ( E ), and linear attenuation coefficients µ ( p, E ) for objects pI xy = (cid:88) E I o ( E ) e (cid:80) p ( µ ( p,E ) L pxy ) (3)Making use of the aforementioned distance maps L p , the simulation can beextended for polychromatic X-ray beam spectra. We build a statistical parametric model M ( β , θ ; Φ ), similar to SMPL [4] ex-tended for internal anatomy using principal component analysis to characterizethe human anatomy shape space from a large set of segmented whole-body CTscans. This makes it possible to generate realistic X-ray projections with theentire pipeline being differentiable in nature. Here, Φ represents the learned pa-rameters of the model. To reconstruct a 3D model from a projection image we fixthe camera projection matrix of the simulation system as defined by the targetalong with an initial mean shape β m and rest pose θ r . A gradient descent basedoptimization with backpropagation is applied to the simulation model. NGCis used as a similarity measure [6], which is defined by the normalized cross-correlation of the image gradient between two images. The model is optimizeduntil convergence to obtain the optimal pose θ and shape β parameters.For evaluation, we used the dataset provided by Grupp et al. made up ofthe hip anatomy from 6 patients [7]. For each patient, a 3D segmentation of thepelvis along with 14 landmarks and a total of 366 fluoroscopy images in varyingorientations with their respective projection matrices are available. The SSM is registered to each patient from the dataset. This resulted in a basescore with an average Hausdorff distance of 17.79 mm and a mean landmark errorof 11.81 mm. The rotation and translation parameters obtained are assumed as ifferentiable X-ray Projections for Inverse Rendering 5
Table 1.
Overview of the reconstruction and landmark error. All distances aremeasured in mm. The lower bound error is dependent on the shape model, witha mean Hausdorff distance of 17.79 mm and a mean landmark error of 11.81 mm.
InitialHausdorff Distance InitialLandmark Error ReconstructedHausdorff Distance ReconstructedLandmark Error46 . ± .
64 30 . ± .
30 29 . ± .
56 22 . ± . the ground truth irrespective of the shape parameters. The 2D-3D reconstructionis tested by applying a random translation in the range of [-10,10] mm and arotation in the range of [-5,5] ◦ on all three axes for the SSM. The randomizationis small due to the nature of the loss function being sensitive to changes onlyaround a small region, such precision of initialization could be either providedmanually or achieved with a network learning those initialization parameters.Fig. 3 shows sample outputs of our proposed method along with the quantitativeresults in Table 1. For evaluation, we consider only the images where the regionof pelvis visible is greater than 50%. (a) (b) (c) (d) (e) (f) Fig. 3.
Samples for 3D reconstruction from 2D projection images. (a) Target im-age, (b) Projection image with a random translation and rotation, (c) Projectionimage after registration, (d) Initial 3D overlay of template mesh, (e) 3D overlayof the meshes after registration, (f) NGC map after registration.
We presented an approach for differentiable rendering and reconstructing apatient-specific 3D mesh model from a single 2D X-ray projection image. Wedemonstrated that using a deformable human model parameterized by shapeand pose, we were able to approximately register patients to an X-ray projec-tion. We found that given a visibility above 50% and an initialization of a genericmodel shape positioned close to the actual image, the 3D reconstruction error issignificantly reduced with our optimization approach. The remaining error maybe due to the simultaneous unconstrained optimization of shape and pose. This
Shetty et al. is because changes in shape and translation across depth, i.e., in z-direction, re-sults in a large range of solutions causing the simulated image to appear similarto the target image. For now, we ignore the case when less than 50% of the pelvisis visible due to the intrinsic nature of the loss function being sensitive to theinitial position. In general, the error could be reduced with an additional priorto determine a good initial pose [8].As the registration is independent of the patient pose we can overcome possi-ble pose mismatches resulting from employing preoperative CT scan for registra-tion. Still, the pipeline has limitations. First, the X-ray simulation is an approx-imation of a real X-ray image, as it does not take into account any scatteringmechanisms and other effects causing noise. This can be possibly improved byemploying a deep learning model for predicting patient-specific scatter [9]. Sec-ond, the renderer assumes a constant density for any given material, resultingin a pseudo-realistic rendering. One way to overcome this problem would be touse a tetrahedral mesh with learned bone density distributions. Third, due toambiguous depth information in fluoroscopy images, accurate registration be-comes a challenging task. This error could be minimized with multi-view imagesor by using the table position as a constraint. To summarize, we proposed adifferentiable renderer by deriving the distance travelled by a ray inside meshstructures. This was further extended to generate X-ray images from humanmodels. Finally, we showed that it could be successfully used to reconstruct 3Dmodels from 2D fluoroscopy images.
Disclaimer:
The concepts and information presented in this paper are based onresearch and are not commercially available.
References
1. Ehlke M, Ramm H, Lamecker H, et al. Fast generation of virtual X-ray images forreconstruction of 3D anatomy. IEEE Trans Vis Comput Graph. 2013; p. 2673–2682.2. Ravi N, Reizenstein J, Novotny D, et al. Accelerating 3d deep learning with py-torch3d. arXiv preprint arXiv:200708501. 2020;.3. Liu S, Li T, Chen W, et al. Soft rasterizer: A differentiable renderer for image-based3d reasoning. In: Proc IEEE Int Conf Comput Vis; 2019. p. 7708–7717.4. Loper M, Mahmood N, Romero J, et al. SMPL: A Skinned Multi-Person LinearModel. ACM Trans Graph (Proc SIGGRAPH Asia). 2015 Oct;34(6):248:1–248:16.5. Vidal FP, Garnier M, Freud N, et al. Simulation of X-ray Attenuation on the GPU.In: Tang W, Collomosse J, editors. Theory and Practice of Computer Graphics. TheEurographics Association; 2009. .6. Penney GP, Weese J, Little JA, et al. A comparison of similarity measures for use in2-D-3-D medical image registration. IEEE Trans Med Imaging. 1998;17(4):586–595.7. Grupp RB, Unberath M, Gao C, et al. Automatic annotation of hip anatomy influoroscopy for robust and efficient 2D/3D registration. Int J Comput Assist RadiolSurg. 2020 May;15(5):759–769.8. Bier B, Unberath M, Zaech JN, et al. X-ray-transform invariant anatomical land-mark detection for pelvic trauma surgery. In: Med Image Comput Comput AssistInterv. Springer; 2018. p. 55–63.ifferentiable X-ray Projections for Inverse Rendering 79. Roser P, Zhong X, Birkhold A, et al. Physics-driven learning of x-ray skin dosedistribution in interventional procedures. Med Phys. 2019;46:4654–4665.1. Ehlke M, Ramm H, Lamecker H, et al. Fast generation of virtual X-ray images forreconstruction of 3D anatomy. IEEE Trans Vis Comput Graph. 2013; p. 2673–2682.2. Ravi N, Reizenstein J, Novotny D, et al. Accelerating 3d deep learning with py-torch3d. arXiv preprint arXiv:200708501. 2020;.3. Liu S, Li T, Chen W, et al. Soft rasterizer: A differentiable renderer for image-based3d reasoning. In: Proc IEEE Int Conf Comput Vis; 2019. p. 7708–7717.4. Loper M, Mahmood N, Romero J, et al. SMPL: A Skinned Multi-Person LinearModel. ACM Trans Graph (Proc SIGGRAPH Asia). 2015 Oct;34(6):248:1–248:16.5. Vidal FP, Garnier M, Freud N, et al. Simulation of X-ray Attenuation on the GPU.In: Tang W, Collomosse J, editors. Theory and Practice of Computer Graphics. TheEurographics Association; 2009. .6. Penney GP, Weese J, Little JA, et al. A comparison of similarity measures for use in2-D-3-D medical image registration. IEEE Trans Med Imaging. 1998;17(4):586–595.7. Grupp RB, Unberath M, Gao C, et al. Automatic annotation of hip anatomy influoroscopy for robust and efficient 2D/3D registration. Int J Comput Assist RadiolSurg. 2020 May;15(5):759–769.8. Bier B, Unberath M, Zaech JN, et al. X-ray-transform invariant anatomical land-mark detection for pelvic trauma surgery. In: Med Image Comput Comput AssistInterv. Springer; 2018. p. 55–63.ifferentiable X-ray Projections for Inverse Rendering 79. Roser P, Zhong X, Birkhold A, et al. Physics-driven learning of x-ray skin dosedistribution in interventional procedures. Med Phys. 2019;46:4654–4665.