[PDF] Deep Learning compatible Differentiable X-ray Projections for Inverse Rendering

Abstract

Many minimally invasive interventional procedures still rely on 2D fluoroscopic imaging. Generating a patient-specific 3D model from these X-ray projection data would allow to improve the procedural workflow, e.g. by providing assistance functions such as automatic positioning. To accomplish this, two things are required. First, a statistical human shape model of the human anatomy and second, a differentiable X-ray renderer. In this work, we propose a differentiable renderer by deriving the distance travelled by a ray inside mesh structures to generate a distance map. To demonstrate its functioning, we use it for simulating X-ray images from human shape models. Then we show its application by solving the inverse problem, namely reconstructing 3D models from real 2D fluoroscopy images of the pelvis, which is an ideal anatomical structure for patient registration. This is accomplished by an iterative optimization strategy using gradient descent. With the majority of the pelvis being in the fluoroscopic field of view, we achieve a mean Hausdorff distance of 30 mm between the reconstructed model and the ground truth segmentation.

Full PDF

AA3046

Deep Learning compatible Diﬀerentiable X-rayProjections for Inverse Rendering

Karthik Shetty , Annette Birkhold , Norbert Strobel , Bernhard Egger ,Srikrishna Jaganathan , Markus Kowarschik , Andreas Maier Pattern Recognition Lab, FAU Erlangen-N¨urnberg Siemens Healthcare GmbH, Forchheim Fakult¨at Elektrotechnik, HS f¨ur angewandte Wissenschaften W¨urzburg-Schweinfurt MIT - BCS, CSAIL & CBMM, USA [email protected]

Abstract.

Many minimally invasive interventional procedures still relyon 2D ﬂuoroscopic imaging. Generating a patient-speciﬁc 3D model fromthese X-ray projection data would allow to improve the procedural work-ﬂow, e.g. by providing assistance functions such as automatic positioning.To accomplish this, two things are required. First, a statistical humanshape model of the human anatomy and second, a diﬀerentiable X-rayrenderer. In this work, we propose a diﬀerentiable renderer by deriv-ing the distance travelled by a ray inside mesh structures to generate adistance map. To demonstrate its functioning, we use it for simulatingX-ray images from human shape models. Then we show its applicationby solving the inverse problem, namely reconstructing 3D models fromreal 2D ﬂuoroscopy images of the pelvis, which is an ideal anatomicalstructure for patient registration. This is accomplished by an iterativeoptimization strategy using gradient descent. With the majority of thepelvis being in the ﬂuoroscopic ﬁeld of view, we achieve a mean Haus-dorﬀ distance of 30 mm between the reconstructed model and the groundtruth segmentation.

Over the last decades, the amount of X-ray guided interventional procedures hasincreased steadily, raising the awareness for optimized procedural workﬂows andradiation-dose induced adverse eﬀects. Assistance systems based on a 3D digi-tal twin of the patient have the potential to speed up procedures and minimizethe required radiation. However, precisely representing the individual humananatomy based on the commonly acquired 2D projection images is an ill-posedproblem. Ehlke et al. showed that an accurate and fast 3D reconstruction of thehuman pelvis from 2D projection images is possible by learning its shape space inthe form of tetrahedral mesh structures from a large set of pelvis scans includingbone density information [1]. In theory, this approach has the potential to gen-erate digital twins of patients from live ﬂuoroscopic imaging data and may allowto develop advanced assistance applications in the interventional environment. a r X i v : . [ ee ss . I V ] F e b Shetty et al.

The process of generating a 3D model from 2D projection images is knownas inverse-rendering and has received considerable attention in computer vi-sion [2,3]. However, achieving this usually means solving an ill-posed problem,and rendering pipelines are not necessarily diﬀerentiable. This is usually over-come by softening the non-diﬀerentiabilities by soft-rasterization [2]. Usually, a3D scene is represented by a polygon mesh, which can be expressed by a set ofvertices and faces. Such a representation is usually chosen for statistical shapeand pose modeling. Skinned Multi-Person Linear Model (SMPL) is an exampleof such a parameterized model, which is deformable and accurately represents alarge variation of human shapes [4]. Employing such a model enables more preciseand faster 3D reconstruction from a 2D projection image space due to the incor-poration of prior knowledge [3]. To carry out the 3D reconstruction from X-rayprojections, we need to calculate artiﬁcial X-ray projections from a 3D model,e.g., a Statistical Shape Model (SSM). Generation of transmission images frommesh structures has been previously proposed [1,5]. For a viable reconstruction,we seek to optimize the shape β and pose θ parameters by minimizing a similar-ity function between the simulated image and the acquired image. To this end,we propose a framework to combine existing X-ray rasterization approaches witha diﬀerentiable rasterizer, to create a setup that can be iteratively optimized bygradient descent techniques. The proposed renderer enables future end-to-endframeworks. A watertight manifold surface can be represented as a triangular mesh by a set offaces f ∈ N N f × and vertices v ∈ R N v × , where the the object has N f faces and N v vertices. Given a projection matrix, it is possible to transform the mesh fromworld coordinates to detector coordinates while maintaining the actual depthof the vertex. In traditional rasterization of 3D scenes, each pixel P xy on thescreen is aﬀected by only the face f j nearest to the ray from the camera. Thisis usually achieved by storing the depth information for all the faces inﬂuencedby a given pixel in the form of z-buﬀer . However, in the case of transmissionimaging we need to know the distance traveled by a ray through an object in theform of l-buﬀer [5]. For each face f j , the normals −→ N f j in the projective space arecomputed. The sign of the dot product D f j R xy = sign (cid:16) −→ N f j · −→ N view (cid:17) betweenthe face normal −→ N f j and detector plane normal −→ N view = (0 , , T determinesif the ray R xy is either entering or exiting the object for a given face. Hence, thetotal path length L xy traveled by a ray for a given object p from the source tothe detector pixel position P xy is visualized in Fig. 1 and can be formulated as L xy = K (cid:88) j =1 D f j R xy Z xy f j (1) iﬀerentiable X-ray Projections for Inverse Rendering 3 Here Z xy f j represents the distance between the face f j and the detectorpixel position P xy , while j and K represents the current valid intersection andthe ﬁxed number of mesh overlaps stored in the z-buﬀer, respectively. As a result,there is no substantial requirement for the elements of the z-buﬀer to be storedin a sorted manner. The only constraint is the number of faces K inﬂuencing agiven pixel. To determine the gradients for the backward pass we follow the sameprocedure as implemented by Ravi et al [2], with an exception for the gradientsfor the z-buﬀer: ∂C∂z xyj = ∂C∂L xy D f j R xy (2)Here, z xyj represents the z-buﬀer distance from the pixel P xy corresponding toface f j with C as some cost function, e.g., the normalized gradient correlation(NGC) explained below.It is generally assumed that the number of incoming rays for a given object isthe same as the outgoing rays. However, artifacts occur when the face normal isperpendicular to the detector plane or in the presence of a non-manifold surface.The occurrence of such a situation can be determined when the summation overthe pixel sign function (cid:80) Kj R xy f j is non-zero. To overcome this, we set the imagepixel value to a neighbouring pixel with zero sum and set the gradient for thatparticular pixel to zero. Fig. 1.

Visualization for the determination of the ray distance for a simpliﬁedstructure. Here, total distance covered by the ray is L xy = d - d + d - d . The pipeline for generating the X-ray projection image for a human patientmodel is depicted in Fig. 2. Distance projection maps L p are generated indi-vidually for air ( L air ), body ( L body ), bones ( L bones ) and organs ( L organs ). L air represents the Euclidean distance from a detector pixel P xy in world coordinatesto the X-ray source. L body , L bones and L organs are generated as described inSec. 2.1 from individual meshes respectively. For ease of explanation, we de-scribe L organs as a single organ distance map, however it is a subset of lungs,heart, liver, kidney and spleen. As L air is inclusive of L body , we deﬁne the actualair distance map as L air − L body . For the body we proceed similarly as shown inFig. 2. Shetty et al. S o u r c e D e t e c t o r Fig. 2.

Left: Projection performed around the thorax section from a mesh repre-sentation of a patient model. Right: Pipeline to generate a transmission imagefrom the given mesh structures and projection matrix.The X-ray projection is generated using a primary signal I xy for a pixel P xy described using the Beer-Lambert law as shown in Eq. 3 with a photon energy E , X-ray energy I o ( E ), and linear attenuation coeﬃcients µ ( p, E ) for objects pI xy = (cid:88) E I o ( E ) e (cid:80) p ( µ ( p,E ) L pxy ) (3)Making use of the aforementioned distance maps L p , the simulation can beextended for polychromatic X-ray beam spectra. We build a statistical parametric model M ( β , θ ; Φ ), similar to SMPL [4] ex-tended for internal anatomy using principal component analysis to characterizethe human anatomy shape space from a large set of segmented whole-body CTscans. This makes it possible to generate realistic X-ray projections with theentire pipeline being diﬀerentiable in nature. Here, Φ represents the learned pa-rameters of the model. To reconstruct a 3D model from a projection image we ﬁxthe camera projection matrix of the simulation system as deﬁned by the targetalong with an initial mean shape β m and rest pose θ r . A gradient descent basedoptimization with backpropagation is applied to the simulation model. NGCis used as a similarity measure [6], which is deﬁned by the normalized cross-correlation of the image gradient between two images. The model is optimizeduntil convergence to obtain the optimal pose θ and shape β parameters.For evaluation, we used the dataset provided by Grupp et al. made up ofthe hip anatomy from 6 patients [7]. For each patient, a 3D segmentation of thepelvis along with 14 landmarks and a total of 366 ﬂuoroscopy images in varyingorientations with their respective projection matrices are available. The SSM is registered to each patient from the dataset. This resulted in a basescore with an average Hausdorﬀ distance of 17.79 mm and a mean landmark errorof 11.81 mm. The rotation and translation parameters obtained are assumed as iﬀerentiable X-ray Projections for Inverse Rendering 5

Table 1.

Overview of the reconstruction and landmark error. All distances aremeasured in mm. The lower bound error is dependent on the shape model, witha mean Hausdorﬀ distance of 17.79 mm and a mean landmark error of 11.81 mm.

InitialHausdorﬀ Distance InitialLandmark Error ReconstructedHausdorﬀ Distance ReconstructedLandmark Error46 . ± .

64 30 . ± .

30 29 . ± .

56 22 . ± . the ground truth irrespective of the shape parameters. The 2D-3D reconstructionis tested by applying a random translation in the range of [-10,10] mm and arotation in the range of [-5,5] ◦ on all three axes for the SSM. The randomizationis small due to the nature of the loss function being sensitive to changes onlyaround a small region, such precision of initialization could be either providedmanually or achieved with a network learning those initialization parameters.Fig. 3 shows sample outputs of our proposed method along with the quantitativeresults in Table 1. For evaluation, we consider only the images where the regionof pelvis visible is greater than 50%. (a) (b) (c) (d) (e) (f) Fig. 3.

Samples for 3D reconstruction from 2D projection images. (a) Target im-age, (b) Projection image with a random translation and rotation, (c) Projectionimage after registration, (d) Initial 3D overlay of template mesh, (e) 3D overlayof the meshes after registration, (f) NGC map after registration.

We presented an approach for diﬀerentiable rendering and reconstructing apatient-speciﬁc 3D mesh model from a single 2D X-ray projection image. Wedemonstrated that using a deformable human model parameterized by shapeand pose, we were able to approximately register patients to an X-ray projec-tion. We found that given a visibility above 50% and an initialization of a genericmodel shape positioned close to the actual image, the 3D reconstruction error issigniﬁcantly reduced with our optimization approach. The remaining error maybe due to the simultaneous unconstrained optimization of shape and pose. This

Shetty et al. is because changes in shape and translation across depth, i.e., in z-direction, re-sults in a large range of solutions causing the simulated image to appear similarto the target image. For now, we ignore the case when less than 50% of the pelvisis visible due to the intrinsic nature of the loss function being sensitive to theinitial position. In general, the error could be reduced with an additional priorto determine a good initial pose [8].As the registration is independent of the patient pose we can overcome possi-ble pose mismatches resulting from employing preoperative CT scan for registra-tion. Still, the pipeline has limitations. First, the X-ray simulation is an approx-imation of a real X-ray image, as it does not take into account any scatteringmechanisms and other eﬀects causing noise. This can be possibly improved byemploying a deep learning model for predicting patient-speciﬁc scatter [9]. Sec-ond, the renderer assumes a constant density for any given material, resultingin a pseudo-realistic rendering. One way to overcome this problem would be touse a tetrahedral mesh with learned bone density distributions. Third, due toambiguous depth information in ﬂuoroscopy images, accurate registration be-comes a challenging task. This error could be minimized with multi-view imagesor by using the table position as a constraint. To summarize, we proposed adiﬀerentiable renderer by deriving the distance travelled by a ray inside meshstructures. This was further extended to generate X-ray images from humanmodels. Finally, we showed that it could be successfully used to reconstruct 3Dmodels from 2D ﬂuoroscopy images.

Disclaimer:

The concepts and information presented in this paper are based onresearch and are not commercially available.

References

1. Ehlke M, Ramm H, Lamecker H, et al. Fast generation of virtual X-ray images forreconstruction of 3D anatomy. IEEE Trans Vis Comput Graph. 2013; p. 2673–2682.2. Ravi N, Reizenstein J, Novotny D, et al. Accelerating 3d deep learning with py-torch3d. arXiv preprint arXiv:200708501. 2020;.3. Liu S, Li T, Chen W, et al. Soft rasterizer: A diﬀerentiable renderer for image-based3d reasoning. In: Proc IEEE Int Conf Comput Vis; 2019. p. 7708–7717.4. Loper M, Mahmood N, Romero J, et al. SMPL: A Skinned Multi-Person LinearModel. ACM Trans Graph (Proc SIGGRAPH Asia). 2015 Oct;34(6):248:1–248:16.5. Vidal FP, Garnier M, Freud N, et al. Simulation of X-ray Attenuation on the GPU.In: Tang W, Collomosse J, editors. Theory and Practice of Computer Graphics. TheEurographics Association; 2009. .6. Penney GP, Weese J, Little JA, et al. A comparison of similarity measures for use in2-D-3-D medical image registration. IEEE Trans Med Imaging. 1998;17(4):586–595.7. Grupp RB, Unberath M, Gao C, et al. Automatic annotation of hip anatomy inﬂuoroscopy for robust and eﬃcient 2D/3D registration. Int J Comput Assist RadiolSurg. 2020 May;15(5):759–769.8. Bier B, Unberath M, Zaech JN, et al. X-ray-transform invariant anatomical land-mark detection for pelvic trauma surgery. In: Med Image Comput Comput AssistInterv. Springer; 2018. p. 55–63.iﬀerentiable X-ray Projections for Inverse Rendering 79. Roser P, Zhong X, Birkhold A, et al. Physics-driven learning of x-ray skin dosedistribution in interventional procedures. Med Phys. 2019;46:4654–4665.1. Ehlke M, Ramm H, Lamecker H, et al. Fast generation of virtual X-ray images forreconstruction of 3D anatomy. IEEE Trans Vis Comput Graph. 2013; p. 2673–2682.2. Ravi N, Reizenstein J, Novotny D, et al. Accelerating 3d deep learning with py-torch3d. arXiv preprint arXiv:200708501. 2020;.3. Liu S, Li T, Chen W, et al. Soft rasterizer: A diﬀerentiable renderer for image-based3d reasoning. In: Proc IEEE Int Conf Comput Vis; 2019. p. 7708–7717.4. Loper M, Mahmood N, Romero J, et al. SMPL: A Skinned Multi-Person LinearModel. ACM Trans Graph (Proc SIGGRAPH Asia). 2015 Oct;34(6):248:1–248:16.5. Vidal FP, Garnier M, Freud N, et al. Simulation of X-ray Attenuation on the GPU.In: Tang W, Collomosse J, editors. Theory and Practice of Computer Graphics. TheEurographics Association; 2009. .6. Penney GP, Weese J, Little JA, et al. A comparison of similarity measures for use in2-D-3-D medical image registration. IEEE Trans Med Imaging. 1998;17(4):586–595.7. Grupp RB, Unberath M, Gao C, et al. Automatic annotation of hip anatomy inﬂuoroscopy for robust and eﬃcient 2D/3D registration. Int J Comput Assist RadiolSurg. 2020 May;15(5):759–769.8. Bier B, Unberath M, Zaech JN, et al. X-ray-transform invariant anatomical land-mark detection for pelvic trauma surgery. In: Med Image Comput Comput AssistInterv. Springer; 2018. p. 55–63.iﬀerentiable X-ray Projections for Inverse Rendering 79. Roser P, Zhong X, Birkhold A, et al. Physics-driven learning of x-ray skin dosedistribution in interventional procedures. Med Phys. 2019;46:4654–4665.