[PDF] Distributed-memory large deformation diffeomorphic 3D image registration

Abstract

We present a parallel distributed-memory algorithm for large deformation diffeomorphic registration of volumetric images that produces large isochoric deformations (locally volume preserving). Image registration is a key technology in medical image analysis. Our algorithm uses a partial differential equation constrained optimal control formulation. Finding the optimal deformation map requires the solution of a highly nonlinear problem that involves pseudo-differential operators, biharmonic operators, and pure advection operators both forward and back- ward in time. A key issue is the time to solution, which poses the demand for efficient optimization methods as well as an effective utilization of high performance computing resources. To address this problem we use a preconditioned, inexact, Gauss-Newton- Krylov solver. Our algorithm integrates several components: a spectral discretization in space, a semi-Lagrangian formulation in time, analytic adjoints, different regularization functionals (including volume-preserving ones), a spectral preconditioner, a highly optimized distributed Fast Fourier Transform, and a cubic interpolation scheme for the semi-Lagrangian time-stepping. We demonstrate the scalability of our algorithm on images with resolution of up to 1024 3 on the "Maverick" and "Stampede" systems at the Texas Advanced Computing Center (TACC). The critical problem in the medical imaging application domain is strong scaling, that is, solving registration problems of a moderate size of 256 3 ---a typical resolution for medical images. We are able to solve the registration problem for images of this size in less than five seconds on 64 x86 nodes of TACC's "Maverick" system.

Full PDF

DDistributed-Memory Large DeformationDiffeomorphic 3D Image Registration

Andreas Mang, Amir Gholami, and George Biros

The Institute of Computational Engineering and SciencesThe University of Texas at Austin, Austin, Texas 78712–[email protected]; [email protected]; [email protected]

Abstract —We present a parallel distributed-memory algorithmfor large deformation diffeomorphic registration of volumetricimages that produces large isochoric deformations (locally volumepreserving). Image registration is a key technology in medicalimage analysis. Our algorithm uses a partial differential equationconstrained optimal control formulation. Finding the optimaldeformation map requires the solution of a highly nonlinearproblem that involves pseudo-differential operators, biharmonicoperators, and pure advection operators both forward and back-ward in time. A key issue is the time to solution, which poses thedemand for efﬁcient optimization methods as well as an effectiveutilization of high performance computing resources. To addressthis problem we use a preconditioned, inexact, Gauss-Newton-Krylov solver. Our algorithm integrates several components: aspectral discretization in space, a semi-Lagrangian formulationin time, analytic adjoints, different regularization functionals(including volume-preserving ones), a spectral preconditioner, ahighly optimized distributed Fast Fourier Transform, and a cubicinterpolation scheme for the semi-Lagrangian time-stepping. Wedemonstrate the scalability of our algorithm on images withresolution of up to on the “Maverick” and “Stampede”systems at the Texas Advanced Computing Center (TACC). Thecritical problem in the medical imaging application domain isstrong scaling, that is, solving registration problems of a moderatesize of —a typical resolution for medical images. We are ableto solve the registration problem for images of this size in lessthan ﬁve seconds on 64 x86 nodes of TACC’s ‘’Maverick” system. Index Terms —Diffeomorphic Image Registration, OptimalControl, Newton-Krylov Methods, Scientiﬁc Computing, HighPerformance Computing.

I. I

NTRODUCTION

Deformable registration (also known as image alignment,warping, or matching) refers to methods that ﬁnd point corre-spondences between images by comparing image intensities.

This material is based upon work supported by AFOSR grants FA9550-12-10484 and FA9550-11-10339; by NSF grant CCF-1337393; by the U.S.Department of Energy, Ofﬁce of Science, Ofﬁce of Advanced ScientiﬁcComputing Research, Applied Mathematics program under Award NumbersDE-SC0010518 and DE-SC0009286; by NIH grant 10042242; by DARPAgrant W911NF-115-2-0121; and by the Technische Universit¨at M¨unchen,Institute for Advanced Study, funded by the German Excellence Initiative (andthe European Union Seventh Framework Programme under grant agreement291763). Any opinions, ﬁndings, and conclusions or recommendations ex-pressed herein are those of the authors and do not necessarily reﬂect the viewsof the AFOSR, the DOE, the NIH, the DARPA, or the NSF. Computing timeon the Texas Advanced Computing Centers Stampede system was providedby an allocation from TACC and the NSF.

We refer to these point correspondences as the deformationmap (see Figure 1 in § II). An example for a low dimensionalimage registration problem is afﬁne registration; it createssimple maps consisting of rotations, translations, and scal-ings [49]. Typically, afﬁne registration is used as an initial-ization step for large deformation diffeomorphic registration ( LDDR ), which is the problem we are concerned with in thepresent work. In LDDR, we typically search for a deformationmap for which the degrees of freedom are the ambient spacetimes the number of grid points deﬁned in the image space.LDDR is much more ﬂexible than afﬁne registration and thus,in general, more informative in clinical studies [55], [38].Such high-dimensional transformations can be deﬁned in manydifferent ways [49], [50], [55]. Image registration is an ill-posed inverse problem; it does not have a unique solution.Not all large deformation maps are admissible since they canshufﬂe the points arbitrarily to match intensities. It is crucialto impose constraints on the deformation while allowing forﬂexibility. The most important constraint is that the map is diffeomorphic (see Figure 2 in § II).Solving an LDDR problem in a rigorous way requires thesolution of a non-convex partial differential equation (

PDE )constrained optimal control problem [10], [26], [31], [41].This problem is ill-posed and involves non-linear and ill-conditioned operators. Most state-of-the-art packages circum-vent these issues by sacriﬁcing scalability and settling forcrude solutions using simple but suboptimal algorithms. Inmany cases this works sufﬁciently well, but in several othercases, it does not. There is signiﬁcant activity in trying to im-prove the existing algorithms. With regards to performance op-timizations, most codes use open multi-processing (

OpenMP )or graphics processing unit (

GPU ) acceleration; there are veryfew codes that utilize distributed memory parallelism. As aresult they are not scalable to the full resolution; to solveproblems for large images most codes use subsampling. Thisis limiting considering the current imaging resolutions. SevenTesla magnetic resonance imaging (

MRI ) scanners can reacha resolution of up to . mm ( ≈ voxels) [44]. Ultra-high resolution computed tomography ( CT ) captures . mmresolution ( ≈ voxels) [36].Beyond the need for strong scaling of image registration accepted for publication at SC16; Salt Lake City, Utah, USA; November 2016 a r X i v : . [ c s . D C ] A ug R ρ T ρ T ( y ) | ρ R − ρ T ( y ) | | ρ R − ρ T ( y ) | ρ T ( y ) original data rigid registration deformable registration ρ R ρ T ρ T ( y ) | ρ R − ρ T ( y ) | | ρ R − ρ T ( y ) | ρ T ( y ) original data rigid registration deformable registration Fig. 1:

The image registration problem (data taken from [50], [3]). The input (original data) are image intensities ρ R and ρ T .The output is y , the deformation map. Our goal is to ﬁnd y so that ρ T ( y ) (the deformed ρ T ) is as close as possible to ρ R (with respect to some appropriate measure). One way to achieve this is to use rigid registration (i.e., searching for a map thatentirely is described by rotations and translations). The result of a rigid registration is shown in the third image from the left;the fourth image shows the difference between the two images ρ R and ρ T after rigid registration. As one can see, there arestill signiﬁcant differences in the intensity values. If we use a deformable registration instead, we can compute a much moreﬂexible y , which results in a much smaller misﬁt | ρ R − ρ T ( y ) | . The deformation map y for this method is visualized in thelast ﬁgure to the right. The grid lines, superimposed on top of ρ T ( y ) were a Cartesian grid before the deformation. We usethem to visualize the overall deformation. algorithms for clinical applications, there is also need for weakscaling for imaging in biology, biophysics, and neuroscience.Animal Micro-CT reaches O ( µm ) resolution ( ≈ vox-els) [56], [64]. In small animal neuroimaging, CLARITY [58],a novel optical imaging technique, can deliver sub-micronresolution for the whole brain of the animal, resulting in10-100 billion-voxel images. To our knowledge, none of theexisting schemes for LDDR allow for the registration of suchlarge volumetric images [5], [9]. Contributions : The design goals for our 3D LDDR schemeare the following: (1) ability to represent large diffeomorphicdeformations; (2) algorithms based on rigorous mathematicalfoundations; (3) algorithmic optimality with respect to both the deformation map resolution and the image resolution; and(4) parallel scalability. Here, we propose an algorithm thatachieves these goals and has the following characteristics: • It is based on optimal control theory. Our formulationallows the control of the registration quality in terms ofimage correspondence and different quality metrics forthe diffeomorphism/deformation map (see § II-A). • It uses a semi-Lagrangian approach for solving thetransport equations that govern the deformation of theimage. This approach leads to algorithmic optimality(see § III-B2). • It uses a spectral discretization in space (see § III-B1).This discretization enables ﬂexibility in the choice ofregularization operators for the deformation map. Suchﬂexibility is necessary since different image registrationapplications have different requirements. It also allowsfor efﬁcient solvers of saddle-point linear systems. • It uses optimal algorithms based on an adjoint-based for-mulation solved via a line-search globalized, inexact, pre-conditioned Gauss-Newton-Krylov scheme (see § III-A). • It uses distributed-memory parallelism for scalability,employs several performance optimizations speciﬁc toour problem, and uses a parallel FFT for elliptic solvers and differentiations that has been shown to scale tohundreds of thousands of cores (see § III-C). It employsseveral optimizations for the most expensive part of thecomputation (cubic interpolation) (see § III-C2). It alsosupports GPU acceleration (not discussed here).In addition, we analyze the overall complexity of our methodin terms of communication, computation, and storage. Theclass of deformations we consider here are one of the mostchallenging since we enable locally volume preserving maps, which ﬁnd many applications [54], [63], [29], [12]. We presentresults for synthetic and neurological images and demonstratethe performance of our algorithm (see § IV) for both volumepreserving and more generic deformation maps.

Related work on high performance computing methodsfor 3D image registration : A rich literature survey on highperformance computing (

HPC ) in image registration can befound in [53], [18], [35]. General surveys on image registrationcan be found in [49], [55]. Formulations related to the onediscussed in this work are reviewed in [45], [46], [47].State of the art registration packages that are used inthe medical imaging and medical image computing commu-nity include ELASTIX [39], ANTS [6], DARTEL [4], andDEMONS [60], [61], [43]. All of these offer some kindof diffeomorphic registration scheme. These packages mostlysupport OpenMP, but do not use GPUs or Message PassingInterface (

MPI ) acceleration (exceptions to be discussed be-low). An important distinction should be made between the image resolution (number of voxels) and the map resolution (number of degrees of freedom for the map parameterization).In general, the higher the map resolution, the better theregistration quality [38], [39] but the harder the optimizationproblem since it has more degrees of freedom. Most existingcodes downsample the map resolution signiﬁcantly. In the medical imaging jargon this is referred to as “mass preserving” maps. he majority of researchers have used GPUs to acceleratethe calculations. For example, the solver for the LDDR schemein [28] uses a preconditioned gradient descent (not Newton)algorithm with a hardware-provided trilinear interpolation on aGPU architecture. It supports the distributed solution of mul-tiple independent LDDR image registration problems (in anembarrassingly parallel way), but does not support distributedmemory parallelism for a single LDDR problem.Two popular packages that exploit GPU acceleration areNIFTYREG [48] and PLASTIMATCH [52]. They use B-spline parameterized low-resolution maps ( coefﬁcients),a tri-linear interpolation scheme, and gradient descent typeoptimization; NIFTYREG supports soft constraints to penalizevolume change.An MPI version of NIFTYREG for bigger images [33]exists, but the map resolution remains the same ( regulargrid for the deformation ﬁeld). The pioneering works [62],[42] support MPI. Their formulation is based on elasticdeformation maps (not LDDR). A GPU-LDDR scheme thatsupports somewhat high-resolution maps ( ) is [59]. It usessteepest descent (not Newton) and does not support MPI; notimings are reported.To summarize, existing schemes do not support scalableLDDR algorithms and no scaling studies have been reported. Limitations : In multiframe volume registration (e.g., 4DCine-MRI) one seeks to register multiple images using asmooth, continuous mapping [13], [9]. Our solver can be usedas is, but our diffeomorphic map parameterization is bettersuited for registering two images. Our parameterization can beextended without any major algorithmic changes but the soft-ware engineering would require some work. Another missingpiece is a preconditioner that is insensitive to the regularizationparameter. There are several techniques for doing so, e.g., gridcontinuation and multilevel preconditioning [10], [1], [47].Here we focus on the single-level solver. The single nodeperformance of the interpolation can be improved by moresophisticated blocking, manual vectorization, and possiblyprefetching. Similar considerations hold true for the GPUversion of the interpolation. This is ongoing work. Anotherlimitation is that we only consider a discretization on Cartesiangrids. This is not always the best grid [62]. The structure ofour algorithm changes signiﬁcantly for unstructured grids.II. B

ACKGROUND

Let

Ω = [0 , π ) ⊂ R be the spatial domain, in which wedeﬁne functions (images). ∂ Ω denotes the boundary of Ω and x a point in Ω . Let ρ ( x ) ∈ R be a function deﬁned on Ω . Inimaging ρ ( x ) is the image intensity at a point x ; in optimalcontrol ρ ( x ) is the state ﬁeld . In the registration problem, wehave a reference image, denoted by ρ R ( x ) , and a templateimage, denoted by ρ T ( x ) ; the goal is to ﬁnd a vector valueddeformation map, denoted by y ( x ) , that maps a point of thetemplate image ρ T to a corresponding point in the referenceimage ρ R [49], [50].Let v ( x ) be the velocity ﬁeld that generates the map y .In our formulation, we introduce a pseudotime to denote the ˜ Ω • ••• •• •• det ( ∇ y ) ∈ ( ) •• •• det ( ∇ y ) = •• •• det ( ∇ y ) > ˜ Ω •• •• det ( ∇ y ) < • det ( ∇ y ) = Fig. 2:

Here we illustrate diffeomorphic and non-diffeomorphic transformations in 2D by considering an in-ﬁnitesimal area ˜Ω . We can think of ˜Ω as a single grid cell(pixel). The ﬁrst ﬁgure on the left depicts the undeformedarea. The second ﬁgure shows an admissible deformation thatshrinks the original area. The third ﬁgure depicts an area-preserving deformation (in 3D it would be volume preserving).The fourth ﬁgure shows the result of a deformation thatexpands the area. The ﬁfth illustration is not diffeomorphicsince material lines that did not cross each other in theoriginal (leftmost illustration), cross now. The sixth, rightmost,is illustration corresponds to a singular deformation in whichall the spatial information is lost by shrinking ˜Ω to a singlepoint. The last two deformations are not useful in imageanalysis. However, without any constraints on the deformationmap, pixels with non-diffeomorphic behavior appear almostalways. For this reason appropriate regularization of theproblem is necessary. deformation of the template image at time t , denoted by ρ ( x , t ) . We deﬁne ρ ( x , t = 0) to be the undeformed templateimage ρ T , and ρ ( x , t = 1) to be the result of applying thedeformation map (which needs to be compared to ρ R ( x ) ).For the optimal control problem, λ ( x , t ) is the adjoint ﬁeld , H is the reduced Hessian operator , g is the gradient ﬁeld ,and β > is the scalar regularization parameter. We useperiodic boundary conditions for all differential operators. Forthe discretization, N i is the number of grid points per i -th dimension; N N N is the total number of unknowns inspace; n t is the number of time steps and δt the time step size .We use p for the number of MPI tasks , t s for the latency inseconds, and t w for the reciprocal of the bandwidth when wedo complexity analysis. Boldface lowercase symbols indicatevectors in R . A. Image registration

The image registration problem can be abstractly deﬁned asfollows. Given two functions (i.e., images) ρ T ( x ) and ρ R ( x ) ,we seek a vector function y ( x ) such that the L -distance(i.e., the residual) between ρ T ( y ( x )) and ρ R ( x ) is minimal.We can think of y ( x ) as deforming an (inﬁnite) grid ofpoints in the template image ρ T so that their intensity after thedeformation matches the reference image ρ R in the L -norm. It can be shown that this problem has an inﬁnite number ofsolutions, most of which are not useful. To resolve this, oneneeds to impose additional constraints, such as smoothness(i.e., ∇ y ( x ) exists), although this alone may not be sufﬁcient Other types of distance measures can be used (see, e.g., [49], [50], [55]).There are no signiﬁcant changes in our formulation or algorithm if we wouldconsider other, popular distance measures. o ensure that the map is plausible. Note, that we require det( ∇ y ( x )) > for all x ∈ Ω to guarantee that y ( x ) is a diffeomorphism (for an illustration, see 2). In velocity-based LDDR it can be shown that such a diffeomorphic map y exists, if the generating velocity ﬁeld v ( x , t ) is adequatelysmooth [9], [15], [13]; a typical requirement is that v is an H -function [9]. The deformation map y can be computedfrom v by solving ∂ t y ( x , t ) + v ( x , t ) · ∇ y ( x , t ) = 0 , y ( x ,

0) = x , (1)where y ( x ) = y ( x , . It can be shown that if the velocityis incompressible (i.e., div v = 0 for every x in Ω ), then det ∇ y ( x , t ) = 1 and the diffeomorphism is referred to as volume preserving [27]. Here, we consider both the generaland the incompressible velocity cases. The latter case is morechallenging. We only consider stationary velocity ﬁelds, thatis, v ( x , t ) = v ( x ) . In the following we drop the dependence of the functionson the spatial position x for notational convenience. B. Formulation

The solution to the image registration problem can befound by solving the following PDE-constrained optimizationproblem [30], [13], [45]: min v J [ v ] = 12 (cid:107) ρ − ρ R (cid:107) L (Ω) + β (cid:107) ∇ v (cid:107) L (Ω) (2a)subject to ∂ t ρ ( t ) + v · ∇ ρ ( t ) = 0 in Ω × (0 , , (2b) ρ (0) = ρ T in Ω , (2c)div v = 0 in Ω . (2d)The second term of J enforces smoothness for v and β > is the regularization parameter . In this formulation, ρ ( x ) = ρ ( x , ≡ ρ T ( y ) , where y is the solution of (1) at t = 1 .The constraint (2b) deﬁnes an implicit function between ρ and v ; given v we solve (2b) for ρ . a) Computing the gradient of J : Given v , we needseveral steps to compute the gradient g = ∂ v J . First wecompute ρ (1) by solving (2b) (with initial condition deﬁned by(2c)). Then we compute the adjoint function λ ( t ) by solvingthe backward-in-time adjoint equation [34], [10] − ∂ t λ ( t ) − div ( v λ ( t )) = 0 in Ω × [0 , , (3) λ (1) = ρ R − ρ (1) in Ω . Once we have the state and adjoint ﬁelds, we can evaluate the gradient given by g ( v ) := β ∇ v + ( I − ∇∇ − div ) (cid:90) λ ( t ) ∇ ρ ( t ) d t. (4) It can be shown that the space of possible diffeomorphisms generated bytime-varying velocities is strictly larger than the space generated by stationaryvelocities. This does not have practical implications when we register twoimages (see, e.g., [45]). However, it is restrictive if we want register a sequenceof images, like in optical ﬂow problems.

The operator P = I − ∇∇ − div , also known as the Lerayoperator [57], eliminates the incompressibility constraint for v . Furthermore, we deﬁne the vector ﬁeld b ( x ) := (cid:90) λ ( x , t ) ∇ ρ ( x , t ) d t, so that g ( v ) = β ∇ v + P b . The gradient g is a nonlinearelliptic integro-differential operator where the state (forwardin time) and adjoint (backward in time) transport PDEs are“hidden” in b .Evaluating the gradient requires solving two transport equa-tions, inverting the Laplacian and applying gradient, diver-gence, and biharmonic operators. (If we do not wish tocompute a volume preserving map we can drop (i.e., notenforce) the incompressibility constraint. Then, in the gradientcalculation, we only need to replace the P operator withan identity operator.) The ﬁrst-order optimality condition for (2) requires that g ( v ) = . Most registration packages usesteepest descent (ﬁrst order) methods to ﬁnd an optimal point(minimizer) [6], [9], [13]. However, steepest descent methodsonly have a linear convergence rate. Using Newton methods,which provide a much better convergence rate, is consideredto be prohibitive, especially for LDDR for two main reasons.First, it is cumbersome to derive the equations for the secondorder optimality conditions. Second, a naive implementationof Newton methods can be very costly if not done carefully. b) The Newton and Gauss-Newton Hessian operators H : To solve g ( v ) = for v we use an Armijo line-searchglobalized Newton method [51]. The key operation is theaction of the Hessian H ( v ) on a vector ﬁeld ˜ v , which iscommonly referred to as the Hessian matvec . This matvec iscomputed by performing the following steps: First of all, weneed to solve (2b) and (3) to compute the state and adjointvariables ρ and λ , respectively. After computing these ﬁelds,we need to solve (5a) for the incremental state variable ˜ ρ and (5c) for the incremental adjoint variable ˜ λ , accumulatingthem in time to compute ˜ b and ﬁnally evaluating (5e). ∂ t ˜ ρ ( t ) + v · ∇ ˜ ρ ( t ) + ˜ v · ∇ ρ ( t ) = 0 in Ω × (0 , , (5a) ˜ ρ (0) = 0 in Ω , (5b) − ∂ t ˜ λ ( t ) − div (˜ λ ( t ) v + λ ( t )˜ v ) = 0 in Ω × [0 , , (5c) ˜ λ (1) + ˜ ρ (1) = 0 in Ω , (5d) H ( v )˜ v := β ∇ ˜ v + P ˜ b in Ω , (5e)where ˜ b = (cid:90) ˜ λ ( t ) ∇ ρ ( t ) + λ ( t ) ∇ ˜ ρ ( t ) d t. Notice that (5a) and (5c) require storing ρ ( t ) , ˜ ρ ( t ) , and λ ( t ) for all t . Also notice that certain terms in (5) drop if we enforcediv v = 0 . The Newton step , ˜ v , is obtained by solving thelinear system H ( v )˜ v = − g ( v ) . For a Gauss-Newton

Hessian,we drop the two terms that involve λ ( t ) , i.e., the last term in(5c) and the second term in ˜ b . ••••• ••••• ••••• ••••• ••••• δ t x ••••• •••• ••••• • ••• •• •• •••••••• ••••• ••••• ••••• ••••• t = X Fig. 3:

Illustration of the semi-Lagrangian scheme (ﬁguremodiﬁed from [47]). The square domain corresponds to a gridblock assigned to a single MPI task. Point x is a regular gridpoint and X is the material point at t = 0 that landed at x attime δt . The semi-Lagrangian algorithm requires interpolationof v , ρ, ˜ ρ, λ , and ˜ λ at these off-grid points; as illustrated here,these points can lie in other MPI tasks. We communicate thesepoints, interpolate their values, and then communicate themback. III. M

ETHODS

Given two images ρ R and ρ T our goal is to ﬁnd y . Weuse an optimize-then-discretize approach for (2). Our schemecan be summarized as follows: • We solve (2) for v . • Once we have v , we use (1) to compute y . • To ﬁnd v we solve g ( v ) = (where g ( v ) is given by(4)) using a preconditioned Newton-Krylov method. • In space we use Fourier expansions (regular grids withperiodic boundary conditions) and in time we use a semi-Lagrangian scheme. • We use data parallelism in space (the regular grid). • All spatial differential operators (and their inverses ifneeded) are computed spectrally using our parallel FFT. • All spatial algebraic operators are done in parallel.We provide more details for our algorithm in the followingsubsections.

A. Newton-Krylov solver and preconditioning

For the optimization we use a Newton method global-ized with an Armijo line-search. We use a preconditionedConjugate-Gradient (PCG) method to compute the Newtonstep. The linear solves using PCG are done inexactly usinga tolerance that depends on the relative norm of the gradi-ent [17]. The preconditioner is the inverse of the biharmonicoperator ( ∇ − ) and can be applied in nearly linear time usingFFTs (with a logarithmic factor). This preconditioner deliversmesh-independence—but not β -independence (see § IV). Sincethe problem is highly nonlinear we use parameter continua-tion on β . The target value for β is application dependentand, in our algorithm, determined by various metrics deﬁnedon ∇ y [45]. We use the TAO module from the

PETSc library [8], [7] for numerical optimization, which supportsuser-deﬁned, matrix free PCG.

TAO provides interfaces thatallow one to control two main parameters in the Newton-Krylov solver: ( i ) the accuracy of the solution of the linear system to compute the Newton step (the relative toleranceof the PCG method used to solve the Hessian equation);and ( ii ) the nonlinear termination criteria. We provide thealgorithms to determine these parameters and, given v and ˜ v ,efﬁcient routines for the function evaluation J ( v ) , the gradientevaluation g ( v ) , the Hessian matvec H ( v )˜ v and the action ofthe preconditioner ∇ − ˜ v . B. Discretization

We use Cartesian grids to approximate spatial functions andspatial differential operators. We use an explicit Runge-Kuttasecond-order semi-Lagrangian method to discretize in time.

1) Space:

We use a spectral projection scheme for allspatial operations on a regular grid deﬁned on Ω with periodicboundary conditions. For simplicity, we consider the isotropiccase, in which the grid spacing is the same in all directions;that is, the number of points per direction is given by N = N = N = N . Our actual implementation does notrequire this (see § IV for an example). We approximate ρ ( x ) = (cid:88) k ˆ ρ k exp( − k · x ) , where k = ( k , k , k ) ∈ N is a multi-index with − N/ ≤ k j ≤ N/ , j = 1 , , . The corresponding regular gridpoints are given by x i = 2 π i /N , where i = ( i , i , i ) ∈ N and ≤ i j ≤ N − , j = 1 , , .We refer to { ˆ ρ k } as the spectral coefﬁcients of ρ . Mappingsbetween { ρ i } and { ˆ ρ k } are done using the forward and inverse Fast Fourier Transform ( FFT ). We use a similar spectral dis-cretization for λ and for each component of the velocity ﬁeld v . All derivatives are performed by ﬁrst taking the FFT andthen ﬁltering the spectral coefﬁcients appropriately. In general,the input images ρ R and ρ T may not be periodic functions. Inthat case a spectral approximation will create excessively highaliasing errors. To address this, we use zero-padding for ρ R and ρ T . Also, in general, images will have discontinuities andthus are not differentiable, creating similar aliasing problems.So, before applying our algorithm, we smooth them spectrallywith a Gaussian ﬁlter whose bandwidth is π/N (the gridsize). Notice that our spectral representation with periodicboundary conditions allows us to apply all the different spatialoperators—including ∇ − and ∇ − —in a stable, accurate,and extremely efﬁcient manner. As a result, the main cost ofthe computation will be solving the transport equations, notapplying and inverting elliptic differential operators.

2) Time:

We choose a semi-Lagrangian method since itis unconditionally stable [19] and allows us to take a smallnumber of time steps n t ∈ N . This is critical since we storeseveral space-time ﬁelds. For example, when solving (5a)for ˜ ρ ( t ) , we need ρ ( t ) for all t . For large n t the storagerequirements become excessive and more sophisticated check-pointing schemes [2] are required—which are more expensive.If we were using a Courant-Friedrichs-Lewy ( CFL ) restricted scheme for solving the transport equations, storing the time The CFL condition deﬁnes an upper bound on the time step size to ensurea stable solution of stiff, time-dependent PDEs [40]. istory would have been impossible, since we had to storehundreds of time steps (see, e.g., [45], [46], [47]).To explain the semi-Lagrangian method we reintroducethe notational dependence in space. We consider the generaltransport equation for a scalar ﬁeld ν ( x , t ) (with a stationaryvelocity) ∂ t ν ( x , t ) + v ( x ) · ∇ ν ( x , t ) = f ( ν ( x , t ) , x ) ) . For example, for (5c) we have ν = ˜ λ and f = ˜ λ div v + div ( λ ˜ v ) . Then, for each x , to compute ν ( x , δt ) given ν ( x , ,we ﬁrst compute a new point X , the semi-Lagrangian point ,using the scheme below: X ∗ = x − δt v ( x ); X = x − δt v ( x ) + v ( X ∗ ) ) , (6)and then we set ν ( X ) = ν ( X , f ( X ) = f ( ν ( X ) , X ); ν ∗ ( x ) = ν ( X ) + δt f ( X ); f ∗ ( x ) = f ( ν ∗ ( x ) , x ); ν ( x , δt ) = ν ( X ) + δt f ( X ) + f ∗ ( x )) . (7)This scheme is fully explicit and unconditionally stable . Recallthat x is a regular grid point and thus ν ( x , and v ( x ) areknown. X ∗ and X are not regular grid points. Computing v and ν at these off-grid locations requires multiple interpola-tions: three interpolations for v ( X ∗ ) , interpolations for the f terms that depend on the semi-Lagrangian point X , and ﬁnallyone interpolation for ν ( X , . If f depends on derivatives of ν ,we ﬁrst differentiate on the regular grid and then we interpolatethe derivatives. The same scheme is used for the adjointequations by changing the time variable from t to τ = 1 − t ,so that − ∂ t λ ( t ) = ∂ τ λ ( τ ) and λ ( t = 1) = λ ( τ = 0) . Notethat the interpolation cannot be done using a FFT, since theinterpolation points can be spaced irregularly between gridpoints.Cubic interpolation is typically preferred, compared tolinear interpolation, because the interpolation errors will beaccumulated throughout the time stepping without a time-stepfactor [11]. We use a tricubic interpolation scheme , whichwe discuss in § III-C2.

C. Parallel algorithms

The main computational kernels are the 3D FFTs to com-pute derivatives, elliptic operators and their inverses, theinterpolation to off-grid points needed for the semi-Lagrangiantime stepping, the Krylov solver (PCG) for the Hessian, andthe Newton solver. The 3D FFT has well-known algorithmiccomplexity. The interpolation on semi-Lagrangian points is themost expensive parts of the computation, despite being local.As it turns out, about 60% of the overall time for the imageregistration problem is spent on interpolation. The Krylov andNewton solvers are sequential across iterations whereas allthe function, gradient, and Hessian evaluations are done using ( a ) ( b ) ( c ) Fig. 4:

Here we explain the data partitioning, which is basedon the pencil decomposition for 3D FFTs [25]. The colorsindicate the data partitioning; each color corresponds to thedata assigned to an MPI task. Subﬁgure (a) represents theinput distribution of the (volumetric) image. After an FFTin the ﬁrst coordinate, in (b) we do the FFT in the secondcoordinate. This requires √ p concurrent alltoall betweengroups of √ p MPI tasks. This process is repeated for the thirddirection in (c) and has the same communication costs as (b)(image modiﬁed from [23]). data parallelism. Below we give more details on the FFT, theinterpolation, and how we put everything together.

1) Partitioning and FFT:

Let N i , i = 1 , , , be thenumber of grid points in the i th dimension. Also assumewe have p = p p MPI tasks. We partition the data usingthe pencil decomposition of 3D FFT (see Figure 4). EachMPI task gets ( N /p ) ( N /p ) N grid points. There is nopartitioning in time and all the time steps are stored in memory.The scalability of the 3D FFT has been well studied [16],[25]. The 3D FFT requires O ( . N p log N ) computations and O ( t s √ p + t w N p ) communications. We use the open-sourcepackage AccFFT [24], [22] that supports both GPU and CPUFFTs and is based on the 1D FFTs implemented in the FFTWpackage [21]. Our code features optimizations for the ∇ anddiv operators that allow us to avoid multiple 3D FFTs. Forexample, ∇ ρ = ( ∂ ρ, ∂ ρ, ∂ ρ ) requires N N

1D FFTsacross the ﬁrst coordinate, diagonal scaling, and then the samenumber of inverse FFTs again across the ﬁrst dimension. Asimilar process is required for the other components but theyrequire collective all-to-all communications for rearranging thedata. The remaining operators P , ∇ , and ∇ − are diagonaland require standard 3D FFTs.

2) Interpolation:

For every grid point x i we have to ﬁndpoints X i required in (7) using (6). In the distributed case,every processor interpolates all the points that fall into theregion deﬁned by its pencil (that would be subﬁgure (a)in Figure 4). This is essentially an alltoallv operation.We refer to this step as “scatter” phase. Note that the pointsneed to be constructed only when the velocity ﬁeld changes.In a Newton iteration for a given v we have to computethese points only once for v (forward transport) and for − v (adjoint equations); that is the scatter phase needs to be doneonce per ﬁeld per Newton iteration. This results in speedupsdue to savings in communication and computation. After thescatter phase is completed, each process has to perform a cubicinterpolation on the points that it owns locally as well as thepoints it received from the other processors. After this step isone, an alltoallv operation is necessary to send/recv allthe interpolation results. This needs to be done once per timestep.The computation is organized as follows. For every forwardor adjoint solve, we invoke an interpolation planner , whichperforms the scatter phase and stores the semi-Lagrangianpoints and creates the communication plans for the trans-port equation. Then the actual transport (7), which involvesmultiple interpolations at every time step, is performed. Fordivergence-free v , the computation of (2b) and (3) involvesonly interpolations. For (5a) and (5c) it also involves differ-ential operators for the gradient and divergence operators thatappear on the right-hand side.Note that it is possible for an interpolation point to fall in-between the locally owned domains of the processes. This isbecause the local domain of each process is disjoint from oth-ers. For this reason, every processor maintains a layer of ghostpoints , regular grid points that belong to other processors.The values of ν at these points must be synchronized beforeinterpolation takes place. Notice that for every point we have tobring in 64 scalar values and perform roughly × ﬂoatingpoint operations. The constant is related to the 64 coefﬁcients( ) required to build and evaluate the tricubic interpolanttimes ﬁve ﬂops per coefﬁcient. Therefore, the computation tomemory trafﬁc ratio will be O (1) —the computation is memorybound. Blocking, prefetching, and vectorization can be usedto improve the performance. Algorithm 1

Parallel tricubic interpolation.

Input: { X i (cid:48) }∈ Ω r , owner ( X i (cid:48) ) , { x i }∈ Ω r , worker ( x i ) , ν ( x i ) , MPI task r Output: ν ( X i ) Communicate ν values for ghost points. Send/recv X i (cid:48) from r to/from owner ( X i (cid:48) ). Locally interpolate to compute ν ( X i ) . Send/recv ν ( X i ) to/from worker ( x i ) to r .The execution ﬂow of the interpolation algorithm is ex-plained in Algorithm 1, for an MPI task r . Let ν ( x i ) bethe value of the scalar ﬁeld ν at a regular grid point x i .Also, let X i (cid:48) be the corresponding semi-Lagrangian pointsfor each i (cid:48) computed by (6). Let Ω j be the spatial domainassigned to the MPI task j . As shown in ﬁgure 3, X i (cid:48) ∈ Ω r does not imply that x i (cid:48) ∈ Ω r and vice versa. For an off-gridpoint X i (cid:48) , owner ( X i (cid:48) ) computes the MPI task that owns thatpoint. That is, X i (cid:48) ∈ Ω r but x i (cid:48) ∈ Ω o wner ( X i (cid:48) ) . Similarly, x i ∈ Ω but X i ∈ Ω w orker ( X i (cid:48) ) . Of course for most pointsboth the owner and worker domains will be identical to Ω r .In line 1 every task sends values of ν that it owns to its fourneighbors (recall that we use a pencil decomposition) witha communication cost of t w N /p + t s )) (the four cornerneighbors can be combined with the messages of the edgeneighbors, but appropriate ordering of the messages). In line 2,for X i (cid:48) whose corresponding regular grid point x i belongs toa different MPI task, r sends ν ( X i (cid:48) ) to that task. In line 3 the interpolation takes place. This phase requiresroughly O (600 N /p ) ﬂoating point operations. This is fol-lowed by an alltoall communication in line 4 to send/recvinterpolation results. Our GPU implementation follows a sim-ilar strategy.

3) Algorithm for incremental state equation:

We sum-marize the steps needed to solve the incremental forwardproblem (5a) for one time step in algorithm 2 to illustratethe communication and computation pattern.

Algorithm 2

One time step of the solution of the incrementalforward problem.

Input: v ( x i ) , ˜ v ( x i ) , ρ ( x i , , ρ ( x i , δt ) , ˜ ρ ( x i , , X i Output: ˜ ρ ( x i , δt ) ρ ( X i ) = ρ ( X i , using Algorithm 1. ∇ ρ ( x i , using FFT. f ( x i ) = − ˜ v ( x i ) · ∇ ρ ( x i , . f ( X i ) using f ( x i ) and Algorithm 1. ρ ∗ ( x i ) = ρ ( X i ) + δt f ( X i ) . ∇ ρ ∗ ( x i ) using FFT. f ∗ ( x i ) = − ˜ v ( x i ) · ∇ ρ ∗ ( x i ) . ˜ ρ ( x i , δt ) = ρ ( X i ) + δt ( f ( X i ) + f ∗ ( x i )) .This calculation requires four interpolation steps using Al-gorithm 1, one for the scalar interpolation in line 1 and threefor the vector interpolation in line 4. It also requires four FFTs:two for line 2 (it is two because we need to go the spectraldomain, differentiate, and then back to the spatial domain) andtwo for line 6. The other parts are triple “for-loops” over all thegrid points i in Ω r . The FFTs require global synchronizations.The total cost of the incremental adjoint solve is four 3DFFTs and two interpolations. The incremental adjoint requiresthe same computations (for divergence-free velocity ﬁelds).

4) Complexity of Hessian matvec and overall algorithm:

Every

Hessian matvec requires n t forward and adjoint solvesor n t FFTs and n t interpolations. The remaining operationsof applying the regularization and the preconditioner arenegligible since they include just 2 FFTs each. The gradientis also cheaper since (2b) and (3) are simpler than theones in (5). Regarding memory , every task needs to store n t N /p +5 N /p values for the incremental adjoint and statevariables. Therefore, accounting the complexities for the FFTand interpolation we obtain, T ﬂop ≈ n t (cid:18) . N p log N + 4 600 N p (cid:19) T mpi ≈ n t (cid:18) t s √ p + t w N p (cid:19) + 4 n t (cid:18) t s + t w N p (cid:19) This estimate assumes that the semi-Lagrangian points areuniformly distributed across processors, however, this is notguaranteed and depends on the velocity ﬁeld and the CFLnumber. In practice the interpolation is the predominant costof the calculation, at least for the problem sizes we have tested.For ﬁxed β the number of Newton iterations are independent ofthe mesh size, the inversion of highly ill-conditioned operatorsis done in linear time. R ρ T | ρ R − ρ T | Fig. 5:

3D visualization of a synthetic registration problem(volume rendering). From left to right: ( i ) reference image ρ R ,( ii ) template image ρ T , and ( iii ) initial (before registration)residual differences between ρ R and ρ T . The reference image ρ R is generated from ρ T by solving the forward problem witha known velocity v (cid:63) (details can be found in the text). Darkareas indicate large residual differences and white areas zeroresidual differences. IV. R

ESULTS

A. Experimental setup

In this section, we give details on the experimental setupwe used to test our solver.

1) Images:

We use one real-world and one synthetic im-age to test our algorithm. For the synthetic case we con-struct the template image as follows: ρ T ( x ) = (sin ( x ) +sin ( x ) + sin ( x )) / ; the velocity is given by v (cid:63) ( x ) =(cos( x ) sin( x ) , cos( x ) sin( x ) , cos( x ) sin( x )) T ; the ref-erence image ρ R is the solution of (2b) with the exact velocity v (cid:63) (see Figure 5 for an illustration of this problem). We usea synthetic case to perform the scaling studies, since medicalimages come with a ﬁxed resolution/grid size. To test ourscheme on real medical images, we use two 3D MRI brainimages of different individuals (“multi-subject registrationproblem”; grid size: × × ). This data is fromthe Non-rigid Registration Evaluation Project ( NIREP ) [14]. (see Figure 6 for an illustration).

2) Implementation and Hardware:

Our code is imple-mented in

C++ and uses MPI and the OpenMP library formultithreading. The code is compiled with the Intel

C++ compiler using the -O3 ﬂag. Although we have GPU imple-mentations both for the FFT and the interpolation, we havenot used accelerators in the results we report here. We carryout runtime experiments on the TACC’s “Maverick” system.Each compute node contains dual, ten-core Intel Xeon E5-2680 v2 (Ivy Bridge) processors running at 2.8GHz with12.8GB/core of memory. Each node also has an NVIDIATesla K40 GPU accelerator. We also report large-scale runson TACC’s “Stampede” system (two eight-core Xeon E5-2680v1 (Sandy Bridge) processors with 32GB host memory pernode). As mentioned before, we use PETSc’s TAO for thenonlinear optimization, vector operations of PETSc for vector For the incompressible case we use a similar but divergence free velocityﬁeld v (cid:63) . The data is available at http://nirep.org ; the interested reader isreferred to [14] for more details. We consider the ﬁrst two datasets na01 and na02 from this repository. linear operations, and AccFFT for the Fourier transforms. Thebasic interface to TAO is the functional, gradient, Hessianmatvec, and preconditioner, as well as routines to select thetolerances for the nonlinear solver and for the Newton steps.

3) Parameters:

The regularization parameter β is set to E − for the scalability runs (for both, the synthetic andthe real-world brain example). The number of time steps n t controls the accuracy and should be related to the CFLnumber. For simplicity and to be able to compare differentcases, we have kept it ﬁxed to n t = 4 . The gradient tol-erance is g tol = 1 E − unless otherwise stated. We use aninexact Newton method with quadratic forcing. We do notreport continuation results; all the runs are done for a single(experimentally determined) value of β . We report ( i ) wall-clock times, ( ii ) communication times, as well as ( iii ) the timeto solution for our method with respect to different registrationproblems and parameter settings. Since the problem is non-convex and we are not interested in high-accuracy solutions,we opt for a Gauss-Newton approximation. B. Scalability using synthetic images

We use a ﬁxed set of parameters, which we experimentallydetermined to yield a good balance between computationalcomplexity and computational performance. We illustrate theregistration problem in Figure 5. We report results for dif-ferent grid resolutions ( N i ∈ { , , , , } ), anddifferent numbers of cores and MPI task conﬁgurations ( p ∈{ , , , , , , } ). The results are reported in Table I(“Maverick” runs) and Table II (large scale “Stampede” runs).First, we interpret the runs ( Similar conclusions can be drawnfor the set; again, going from 16 tasks to 256 tasks, weobserve 50% efﬁciency. For the ( Many image registration codes don’t even compute gradients and thetermination criterion is the number of iterations. Given the limitations in theresolution and the image quality a relative reduction of the gradient by 1% istypically considered quite excessive. We use FFTs for the discretization of differential operators since thisallows us to invert them at the cost of a spectral diagonal scaling. This offersthe opportunity to exactly fulﬁll and efﬁciently eliminate the incompressibilityconstraint from the optimality system. Also, it allows for an efﬁcient precon-ditioning of the Hessian with essentially no construction cost (see § III fordetails). R ρ T | ρ R − ρ T | | ρ R − ρ T ( y ) | Fig. 6:

3D visualization of the registration problem for the brain images. From left to right: ( i ) reference image ρ R , ( ii )template image ρ T , ( iii ) the residual differences between ρ R and ρ T (before registration), and ( iv ) the residual differencesbetween ρ R and ρ T ( y ) (deformed template image; after registration). Dark areas indicate a large residual and white areasno residual differences. Table I:

Computational performance of our solver for the synthetic registration problem illustrated in Figure 5 on TACC’s“Maverick” computing system. We neglect the incompressibility constraint for these runs. We report the time to solution, andthe communication and execution times for the FFT and the interpolation, respectively (in seconds). We report timings as afunction of the number of unknowns (in space), and the number of nodes and tasks. We use 16 tasks per node.

FFT interpolation N nodes tasks time to solution communication execution communication execution .

54 1 . E − . E − . E − . E − . E − . E − . E − . E − . E − . E +1 1 .

73 1 .

35 1 .

84 6 . .

88 1 .

30 5 . E − .

17 3 . .

70 1 .

19 2 . E − . E − . .

01 6 . E − . E − . E − . E − . E +1 1 . E +1 1 . E +1 1 . E +1 2 . E +1 . E +1 7 .

27 1 .

56 2 .

60 8 . .

23 2 .

67 3 . E − . E − . .

72 1 .

70 1 . E − . E − . . E +2 4 . E +1 2 . E +1 2 . E +1 6 . E +1 . E +1 1 . E +1 4 .

18 4 .

22 1 . E +1 . E +1 1 . E +1 1 .

77 2 .

33 8 . eight. The overall timings are 15.2 seconds, 23 seconds, and32 seconds, respectively, which again is not perfect. If we lookmore closely at how the time is allocated, we observe that theexecution time for the FFT scales perfectly in these three runs(1.35 seconds, 1.56 seconds, and 1.77 seconds, respectively).The interpolation execution also scales well, both in termsof communication and computation. The deterioration of theoverall time is due to the FFT communication costs. Thelargest problem we solved for the synthetic case was run C. Real-world registration problem

We report exemplary results for the brain data sets illustratedin Figure 6 (grid size: × × ). We set the g tol to E − and the maximal number of outer iterations (Newton steps) to50, and β = 1 E − . We study strong scaling and the sensitivityof the convergence of our solver with respect to changes in theregularization weight β . We report scalability results for thebrain images in Table IV. We display exemplary result for the Table V:

Sensitivity of the computational work loadwith respect to varying regularization weights β ∈{ E − , E − , E − } . We report results for four Newtoniterations for the brain images. We report the number ofHessian matvecs and the time to solution (in seconds) andin parenthesis its relative increase from the base case. β matvecs time to solution ( relative ) E − . E +1 ( 1.0 ) E − . E +2 ( 4.6 ) E − . E +2 ( 35.0 ) considered datasets in Figure 7. We report results for varyingchoices of the regularization parameter β in Table V.We observe that we can signiﬁcantly reduce the compu-tational timings if we switch to parallel architectures. Thescaling results are consistent with what we observed for thesynthetic data sets. We can reduce the wall clock time bytwo orders of magnitude if we change from one task on onenode to 64 MPI tasks on 32 nodes. We can ﬁt the entireproblem on one node. This demonstrates the practicability ofour solver. The communication and execution times of theFFT and the interpolator drop signiﬁcantly as we increase able II: Computational performance of our solver for the synthetic registration problem illustrated in Figure 5 on TACC’s“Stampede” computing system. We neglect the incompressibility constraint for these runs. We report the time to solution, andthe communication and execution times for the FFT and the interpolation, respectively (in seconds). We report timings as afunction of the number of unknowns (in space), and the number of nodes and tasks. We use 2 tasks per node.

FFT interpolation N nodes tasks time to solution communication execution communication execution

256 512 . E +1 4 .

61 2 .

62 4 .

12 1 . E +1 . E +1 2 .

23 1 .

30 2 .

38 9 . . E +1 1 .

69 6 . E − .

25 4 .

256 512 . E +2 3 . E +1 3 . E +1 3 . E +1 1 . E +2 . E +2 2 . E +1 1 . E +1 1 . E +1 8 . E +1 . E +1 1 . E +1 6 .

75 8 .

78 4 . E +1 Table III:

Computational performance of our solver for a synthetic registration problem similar to the one illustratedin Figure 5 on TACC’s “Maverick” computing system. We use the incompressibility constraint for these runs (mass preservingdiffeomorphism). We report the time to solution, and the communication and execution times for the FFT and the interpolation,respectively (in seconds). We report results for a ﬁxed grid size ( ) as a function of the number of nodes and tasks. We use2 tasks per node. FFT interpolationnodes tasks time to solution communication execution communication execution . E +2 0 1 . E +1 2 .

82 9 . E +1 . E +1 3 .

18 5 .

73 8 . E − . E +1 . E +1 2 .

17 2 .

72 5 . E − . E +1 . E +1 1 .

10 1 .

25 4 . E − . .

Exemplary registration results for the brain data sets. We display, from left to right axial slices of ( i ) the reference image ρ R , ( ii ) the template image ρ T , the residual differences ( iii ) between ρ R and ρ T before registration and ( iv ) after registration,( v ) a point-wise map of the determinant of the deformation gradient (the color map represents volume change ranging from0 to 2, where det( ∇ y ) = 0 is black, det( ∇ y ) ∈ (0 , corresponds to different shades of orange (from dark to bright),and det( ∇ y ) ≥ is white), and ( vi ) the deformed template image ρ T ( y ) with a grid in overlay (closeup to the right). Thevalues for the determinant of the deformation gradient are strictly positive (i.e., the deformation map is diffeomorphic). able IV: Strong scaling results for the brain images computed on “Maverick”. We set the regularization parameter to β = 1 E − . We perform two Newton iterations for these scalability runs. We report the number of nodes, the number of MPItasks, and the communication and execution times for the FFT and the interpolation (in seconds). FFT interpolationnodes tasks time to solution communication execution communication execution . E +3 0 . E +1 2 . E +2 2 . E +1 7 . E +2 . E +2 2 . E +1 6 . E +1 5 .

73 1 . E +2 . E +1 8 .

59 1 . E +1 1 .

20 4 . E +1 . E +1 4 .

94 6 .

50 5 . E − . E +1 . E +1 4 .

03 1 .

10 8 . E − . the number of nodes. The interpolation time contributes againcritically (about or more than 50% of the time to solution).As for the sensitivity with respect to the regularizationparameter we can see that the number of Hessian matvecs(a proxy for the overall Newton-Krylov iterations) increases,as we reduce the regularization parameter from β = 1 E − to β = 1 E − . The time to solution increases by a factor of 35 forthe smallest β reported here. This clearly demonstrates that theperformance of our preconditioner is not ideal; it deteriorateswith a reduction in β . As we have seen in the former section,the solver behaves independent of the mesh size. Implementingan improved scheme for preconditioning the Hessian requiresmore work. V. C ONCLUSION

We presented a complete algorithm for large deformationdiffeomorphic medical image registration. We were able tosolve problems of unprecedented scale. One may ask how suchruns translate to a clinical setting. As the cost of computingdrops, we are hopeful that 32- and 256-task calculations willbe possible at a modest cost.The proposed algorithm is ﬂexible and scalable. It supportsdifferent types of regularization functionals and can be ex-tended to different image distance measures. Our approach canbe easily extended to vector images and—with some additionalwork—can also be extended to non-stationary (time-varying)velocities [30], [9]. This will be necessary to register time-series of images or optical ﬂow problems. All the parallelismrelated issues remain the same. A major remaining challengeis the design of preconditioners that are insensitive to theregularization parameter.Finally, our algorithm relates to other applications besidesmedical imaging. For example applications in weather predic-tion and ocean physics (for tracking Lagrangian tracers in theoceans) [37], for reconstruction of porous media ﬂows [20],and registration of Micro-CTs for material science and biol-ogy [32]. Although our method is highly optimized for regulargrids with periodic boundary conditions, many aspects of ouralgorithm carry over. R

EFERENCES[1] S. S. Adavani and G. Biros, “Fast algorithms for source identiﬁcationproblems with elliptic PDE constraints,”

SIAM Journal on ImagingSciences , vol. 3, no. 4, pp. 791–808, 2008. 3 [2] V. Akcelik, G. Biros, and O. Ghattas, “Parallel multiscale Gauss-Newton-Krylov methods for inverse wave propagation,” in

ProcACM/IEEE Conference on Supercomputing , 2002, pp. 1–15. 5[3] Y. Amit, “A nonlinear variational problem for image matching,”

SIAMJournal on Scientiﬁc Computing , vol. 15, no. 1, pp. 207–224, 1994. 2[4] J. Ashburner, “A fast diffeomorphic image registration algorithm,”

NeuroImage , vol. 38, no. 1, pp. 95–113, 2007. 2[5] J. Ashburner and K. J. Friston, “Diffeomorphic registration us-ing geodesic shooting and Gauss-Newton optimisation,”

NeuroImage ,vol. 55, no. 3, pp. 954–967, 2011. 2[6] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook et al. , “A reproducibleevaluation of ANTs similarity metric performance in brain imageregistration,”

NeuroImage , vol. 54, pp. 2033–2044, 2011. 2, 4[7] S. Balay, S. Abhyankar, M. F. Adams, J. Brown et al.

International Journal of Computer Vision , vol. 61, no. 2, pp. 139–157,2005. 2, 3, 4, 11[10] A. Borz`ı and V. Schulz,

Computational optimization of systems governedby partial differential equations . Philadelphia, Pennsylvania, US:SIAM, 2012. 1, 3, 4[11] J. P. Boyd,

Chebyshev and Fourier spectral methods . Mineola, NewYork, US: Dover, 2000. 6[12] M. Burger, J. Modersitzki, and L. Ruthotto, “A hyperelastic regu-larization energy for image registration,”

SIAM Journal on ScientiﬁcComputing , vol. 35, no. 1, pp. B132–B148, 2013. 2[13] K. Chen and D. A. Lorenz, “Image sequence interpolation using optimalcontrol,”

Journal of Mathematical Imaging and Vision , vol. 41, pp. 222–238, 2011. 3, 4[14] G. E. Christensen, X. Geng, J. G. Kuhl, J. Bruss et al. , “Introduction tothe non-rigid image registration evaluation project,” in

Proc BiomedicalImage Registration , vol. LNCS 4057, 2006, pp. 128–135. 8[15] G. Crippa, “The ﬂow associated to weakly differentiable vector ﬁelds,”Ph.D. dissertation, University of Zurich, 2007. 4[16] K. Czechowski, C. Battaglino, C. McClanahan, K. Iyer et al. , “Onthe communication complexity of 3D FFTs and its implications forexascale,” in

Proc ACM/IEEE Conference on Supercomputing , 2012,pp. 205–214. 6[17] S. C. Eisentat and H. F. Walker, “Choosing the forcing terms in aninexact Newton method,”

SIAM Journal on Scientiﬁc Computing , vol. 17,no. 1, pp. 16–32, 1996. 5[18] A. Eklund, P. Dufort, D. Forsberg, and S. M. LaConte, “Medicalimage processing on the GPU–past, present and future,”

Medical ImageAnalysis , vol. 17, no. 8, pp. 1073–1094, 2013. 2[19] M. Falcone and R. Ferretti, “Convergence analysis for a class of high-order semi-Lagrangian advection schemes,”

SIAM Journal on NumericalAnalysis , vol. 35, no. 3, pp. 909–940, 1998. 5[20] J. Fohring, E. Haber, and L. Ruthotto, “Geophysical imaging of ﬂuidﬂow in porous media,”

SIAM Journal on Scientiﬁc Computing arXiv e-rints , 2016, in review (arXiv preprint: http://arxiv.org/abs/1506.07933).6[23] A. Gholami, A. Mang, and G. Biros, “An inverse problem formulationfor parameter estimation of a reaction-diffusion model of low gradegliomas,”

Journal of Mathematical Biology

An Introduction toparallel computing: Design and analysis of algorithms , 2nd ed. AddisonWesley, 2003. 6[26] M. D. Gunzburger,

Perspectives in ﬂow control and optimization .Philadelphia, Pennsylvania, US: SIAM, 2003. 1[27] M. E. Gurtin,

An introduction to continuum mechanics , ser. Mathematicsin Science and Engineering. Academic Press, 1981, vol. 158. 4[28] L. Ha, J. Kr¨uger, S. Joshi, and T. C. Silva, “Multi-scale unbiaseddiffeomorphic atlas construction on multi-GPUs,”

GPU ComputingGems Emerald Edition , vol. 1, pp. 771–791, 2010. 3[29] E. Haber and J. Modersitzki, “Numerical methods for volume preservingimage registration,”

Inverse Problems , vol. 20, pp. 1621–1638, 2004. 2[30] G. L. Hart, C. Zach, and M. Niethammer, “An optimal control approachfor deformable registration,” in

Proc IEEE Conference on ComputerVision and Pattern Recognition , 2009, pp. 9–16. 4, 11[31] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich,

Optimization withPDE constraints . Berlin, DE: Springer, 2009. 1[32] S. T. Ho and W. D. Hutmacher, “A comparison of micro CT withother techniques used in the characterization of scaffolds,”

Biomaterials ,vol. 27, no. 8, pp. 1362–1376, 2006. 11[33] F. Ino, K. Ooyama, and K. Hagihara, “A data distributed parallelalgorithm for nonrigid image registration,”

Parallel Computing , vol. 31,no. 1, pp. 19–43, 2005. 3[34] K. Ito and K. Kunisch,

Lagrange multiplier approach to variationalproblems and applications . Philadelphia, Pennsylvania, US: SIAM,2008, vol. 15. 4[35] G. S. James Shackleford, Nagarajan Kandasamy,

High performancedeformable image registration algorithms for manycore processors .Morgan Kaufmann, 2013. 2[36] R. Kakinuma, N. Moriyama, Y. Muramatsu, S. Gomi et al. , “Ultra-high-resolution computed tomography of the lung: Image quality of aprototype scanner,”

PloS one , vol. 10, no. 9, p. e0137165, 2015. 1[37] E. Kalany,

Atmospheric modeling, data assimilation and predictability .Oxford University Press, 2002. 11[38] A. Klein, J. Andersson, B. A. Ardekani, J. Ashburner et al. , “Evaluationof 14 nonlinear deformation algorithms applied to human brain MRIregistration,”

NeuroImage , vol. 46, no. 3, pp. 786–802, 2009. 1, 2[39] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. W. Pluim,“ELASTIX: A tollbox for intensity-based medical image registration,”

Medical Imaging, IEEE Transactions on , vol. 29, no. 1, pp. 196–205,2010. 2[40] R. J. LeVeque,

Numerical methods for conservation laws . Springer,1992, vol. 132. 5[41] J.-L. Lions,

Some aspects of the optimal control of distributed parametersystems . Philadelphia, Pennsylvania, US: SIAM, 1972. 1[42] Y. Liu, A. Fedorov, R. Kikinis, and N. Chrisochoides, “Real-time non-rigid registration of medical images on a cooperative parallel architec-ture,” in

Proc IEEE International Conference on Bioinformatics andBiomedicine , 2009, pp. 401–404. 3[43] M. Lorenzi, N. Ayache, G. B. Frisoni, and X. Pennec, “LCC-demons:a robust and accurate symmetric diffeomorphic registration algorithm,”

NeuroImage , vol. 81, pp. 470–483, 2013. 2[44] F. L¨usebrink, A. Wollrab, and O. Speck, “Cortical thickness determina-tion of the human brain using high resolution 3T and 7T MRI data,”

Neuroimage , vol. 70, pp. 122–131, 2013. 1[45] A. Mang and G. Biros, “An inexact Newton–Krylov algorithm for con-strained diffeomorphic image registration,”

SIAM Journal on ImagingSciences , vol. 8, no. 2, pp. 1030–1069, 2015. 2, 4, 5, 6[46] ——, “Constrained H -regularization schemes for diffeomorphic imageregistration,” SIAM Journal on Imaging Sciences , 2016, to appear (arXivpreprint: http://arxiv.org/abs/1503.00757). 2, 6[47] ——, “A Semi-Lagrangian two-level preconditioned Newton–Krylovsolver for constrained diffeomorphic image registration,” arXiv e-prints ,2016, in review (arXiv preprint: http://arxiv.org/abs/1604.02153). 2, 3,5, 6 [48] M. Modat, G. R. Ridgway, Z. A. Taylor, M. Lehmann et al. , “Fast free-form deformation using graphics processing units,”

Computer Methodsand Programs in Biomedicine , vol. 98, no. 3, pp. 278–284, 2010. 3[49] J. Modersitzki,

Numerical methods for image registration . New York:Oxford University Press, 2004. 1, 2, 3[50] ——,

FAIR: Flexible algorithms for image registration . Philadelphia,Pennsylvania, US: SIAM, 2009. 1, 2, 3[51] J. Nocedal and S. J. Wright,

Numerical Optimization . New York, NewYork, US: Springer, 2006. 4[52] J. A. Schackleford, N. Kandasamy, and G. C. Sharp, “On developingB-spline registration algorithms for multi-core processors,”

Physics inMedicine and Biology , vol. 55, no. 21, pp. 6329–6351, 2010. 3[53] R. Shams, P. Sadeghi, R. A. Kennedy, and R. I. Hartley, “A surveyof medical image registration on multicore and the GPU,”

SignalProcessing Magazine, IEEE , vol. 27, no. 2, pp. 50–60, 2010. 2[54] D. Shen and C. Davatzikos, “Very high-resolution morphometry usingmass-preserving deformations and HAMMER elastic registration,”

Neu-roImage , vol. 18, no. 1, pp. 28–41, 2003. 2[55] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical imageregistration: A survey,”

Medical Imaging, IEEE Transactions on , vol. 32,no. 7, pp. 1153–1190, 2013. 1, 2, 3[56] Z. Starosolski, C. A. Villamizar, D. Rendon, M. J. Paldino et al. ,“Ultra high-resolution in vivo computed tomography imaging of mousecerebrovasculature using a long circulating blood pool contrast agent,”

Scientiﬁc Reports , vol. 5, no. 10178, 2015. 2[57] R. Temam,

Navier–Stokes equations: Theory and numerical analysis .North-Holland Pub. Co., 1977. 4[58] R. Tomer, L. Ye, B. Hsueh, and K. Deisseroth, “Advanced CLARITY forrapid and high-resolution imaging of intact tissues,”

Nature protocols ,vol. 9, no. 7, pp. 1682–1697, 2014. 2[59] T. ur Rehman, E. Haber, G. Pryor, J. Melonakos, and A. Tannenbaum,“3d nonrigid registration via optimal mass transport on the GPU,”

Medical Image Analysis , vol. 13, no. 6, pp. 931–940, 2009. 3[60] T. Vercauteren, X. Pennec, A. Perchant, and N. Ayache, “Symmetriclog-domain diffeomorphic registration: A demons-based approach,” in

Proc Medical Image Computing and Computer-Assisted Intervention ,vol. LNCS 5241, no. 5241, 2008, pp. 754–761. 2[61] ——, “Diffeomorphic demons: Efﬁcient non-parametric image registra-tion,”

NeuroImage , vol. 45, no. 1, pp. S61–S72, 2009. 2[62] S. K. Warﬁeld, M. Ferrant, X. Gallez, A. Nabavi et al. , “Real-timebiomechanical simulation of volumetric brain deformation for imageguided neurosurgery,” in

Proc ACM/IEEE Conference on Supercomput-ing , 2000, pp. 23–23. 3[63] Y. Yin, E. A. Hoffman, and C.-L. Lin, “Mass preserving nonrigidregistration of CT lung images using cubic B-spline,”

Medical Physics ,vol. 36, no. 9, pp. 4213–4222, 2009. 2[64] M.-Q. Zhang, L. Zhou, Q.-F. Deng, Y.-Y. Xie et al. , “Ultra-high-resolution 3D digitalized imaging of the cerebral angioarchitecture inrats using synchrotron radiation,”