Distributed-memory large deformation diffeomorphic 3D image registration
DDistributed-Memory Large DeformationDiffeomorphic 3D Image Registration
Andreas Mang, Amir Gholami, and George Biros
The Institute of Computational Engineering and SciencesThe University of Texas at Austin, Austin, Texas 78712–[email protected]; [email protected]; [email protected]
Abstract —We present a parallel distributed-memory algorithmfor large deformation diffeomorphic registration of volumetricimages that produces large isochoric deformations (locally volumepreserving). Image registration is a key technology in medicalimage analysis. Our algorithm uses a partial differential equationconstrained optimal control formulation. Finding the optimaldeformation map requires the solution of a highly nonlinearproblem that involves pseudo-differential operators, biharmonicoperators, and pure advection operators both forward and back-ward in time. A key issue is the time to solution, which poses thedemand for efficient optimization methods as well as an effectiveutilization of high performance computing resources. To addressthis problem we use a preconditioned, inexact, Gauss-Newton-Krylov solver. Our algorithm integrates several components: aspectral discretization in space, a semi-Lagrangian formulationin time, analytic adjoints, different regularization functionals(including volume-preserving ones), a spectral preconditioner, ahighly optimized distributed Fast Fourier Transform, and a cubicinterpolation scheme for the semi-Lagrangian time-stepping. Wedemonstrate the scalability of our algorithm on images withresolution of up to on the “Maverick” and “Stampede”systems at the Texas Advanced Computing Center (TACC). Thecritical problem in the medical imaging application domain isstrong scaling, that is, solving registration problems of a moderatesize of —a typical resolution for medical images. We are ableto solve the registration problem for images of this size in lessthan five seconds on 64 x86 nodes of TACC’s ‘’Maverick” system. Index Terms —Diffeomorphic Image Registration, OptimalControl, Newton-Krylov Methods, Scientific Computing, HighPerformance Computing.
I. I
NTRODUCTION
Deformable registration (also known as image alignment,warping, or matching) refers to methods that find point corre-spondences between images by comparing image intensities.
This material is based upon work supported by AFOSR grants FA9550-12-10484 and FA9550-11-10339; by NSF grant CCF-1337393; by the U.S.Department of Energy, Office of Science, Office of Advanced ScientificComputing Research, Applied Mathematics program under Award NumbersDE-SC0010518 and DE-SC0009286; by NIH grant 10042242; by DARPAgrant W911NF-115-2-0121; and by the Technische Universit¨at M¨unchen,Institute for Advanced Study, funded by the German Excellence Initiative (andthe European Union Seventh Framework Programme under grant agreement291763). Any opinions, findings, and conclusions or recommendations ex-pressed herein are those of the authors and do not necessarily reflect the viewsof the AFOSR, the DOE, the NIH, the DARPA, or the NSF. Computing timeon the Texas Advanced Computing Centers Stampede system was providedby an allocation from TACC and the NSF.
We refer to these point correspondences as the deformationmap (see Figure 1 in § II). An example for a low dimensionalimage registration problem is affine registration; it createssimple maps consisting of rotations, translations, and scal-ings [49]. Typically, affine registration is used as an initial-ization step for large deformation diffeomorphic registration ( LDDR ), which is the problem we are concerned with in thepresent work. In LDDR, we typically search for a deformationmap for which the degrees of freedom are the ambient spacetimes the number of grid points defined in the image space.LDDR is much more flexible than affine registration and thus,in general, more informative in clinical studies [55], [38].Such high-dimensional transformations can be defined in manydifferent ways [49], [50], [55]. Image registration is an ill-posed inverse problem; it does not have a unique solution.Not all large deformation maps are admissible since they canshuffle the points arbitrarily to match intensities. It is crucialto impose constraints on the deformation while allowing forflexibility. The most important constraint is that the map is diffeomorphic (see Figure 2 in § II).Solving an LDDR problem in a rigorous way requires thesolution of a non-convex partial differential equation (
PDE )constrained optimal control problem [10], [26], [31], [41].This problem is ill-posed and involves non-linear and ill-conditioned operators. Most state-of-the-art packages circum-vent these issues by sacrificing scalability and settling forcrude solutions using simple but suboptimal algorithms. Inmany cases this works sufficiently well, but in several othercases, it does not. There is significant activity in trying to im-prove the existing algorithms. With regards to performance op-timizations, most codes use open multi-processing (
OpenMP )or graphics processing unit (
GPU ) acceleration; there are veryfew codes that utilize distributed memory parallelism. As aresult they are not scalable to the full resolution; to solveproblems for large images most codes use subsampling. Thisis limiting considering the current imaging resolutions. SevenTesla magnetic resonance imaging (
MRI ) scanners can reacha resolution of up to . mm ( ≈ voxels) [44]. Ultra-high resolution computed tomography ( CT ) captures . mmresolution ( ≈ voxels) [36].Beyond the need for strong scaling of image registration accepted for publication at SC16; Salt Lake City, Utah, USA; November 2016 a r X i v : . [ c s . D C ] A ug R ρ T ρ T ( y ) | ρ R − ρ T ( y ) | | ρ R − ρ T ( y ) | ρ T ( y ) original data rigid registration deformable registration ρ R ρ T ρ T ( y ) | ρ R − ρ T ( y ) | | ρ R − ρ T ( y ) | ρ T ( y ) original data rigid registration deformable registration Fig. 1:
The image registration problem (data taken from [50], [3]). The input (original data) are image intensities ρ R and ρ T .The output is y , the deformation map. Our goal is to find y so that ρ T ( y ) (the deformed ρ T ) is as close as possible to ρ R (with respect to some appropriate measure). One way to achieve this is to use rigid registration (i.e., searching for a map thatentirely is described by rotations and translations). The result of a rigid registration is shown in the third image from the left;the fourth image shows the difference between the two images ρ R and ρ T after rigid registration. As one can see, there arestill significant differences in the intensity values. If we use a deformable registration instead, we can compute a much moreflexible y , which results in a much smaller misfit | ρ R − ρ T ( y ) | . The deformation map y for this method is visualized in thelast figure to the right. The grid lines, superimposed on top of ρ T ( y ) were a Cartesian grid before the deformation. We usethem to visualize the overall deformation. algorithms for clinical applications, there is also need for weakscaling for imaging in biology, biophysics, and neuroscience.Animal Micro-CT reaches O ( µm ) resolution ( ≈ vox-els) [56], [64]. In small animal neuroimaging, CLARITY [58],a novel optical imaging technique, can deliver sub-micronresolution for the whole brain of the animal, resulting in10-100 billion-voxel images. To our knowledge, none of theexisting schemes for LDDR allow for the registration of suchlarge volumetric images [5], [9]. Contributions : The design goals for our 3D LDDR schemeare the following: (1) ability to represent large diffeomorphicdeformations; (2) algorithms based on rigorous mathematicalfoundations; (3) algorithmic optimality with respect to both the deformation map resolution and the image resolution; and(4) parallel scalability. Here, we propose an algorithm thatachieves these goals and has the following characteristics: • It is based on optimal control theory. Our formulationallows the control of the registration quality in terms ofimage correspondence and different quality metrics forthe diffeomorphism/deformation map (see § II-A). • It uses a semi-Lagrangian approach for solving thetransport equations that govern the deformation of theimage. This approach leads to algorithmic optimality(see § III-B2). • It uses a spectral discretization in space (see § III-B1).This discretization enables flexibility in the choice ofregularization operators for the deformation map. Suchflexibility is necessary since different image registrationapplications have different requirements. It also allowsfor efficient solvers of saddle-point linear systems. • It uses optimal algorithms based on an adjoint-based for-mulation solved via a line-search globalized, inexact, pre-conditioned Gauss-Newton-Krylov scheme (see § III-A). • It uses distributed-memory parallelism for scalability,employs several performance optimizations specific toour problem, and uses a parallel FFT for elliptic solvers and differentiations that has been shown to scale tohundreds of thousands of cores (see § III-C). It employsseveral optimizations for the most expensive part of thecomputation (cubic interpolation) (see § III-C2). It alsosupports GPU acceleration (not discussed here).In addition, we analyze the overall complexity of our methodin terms of communication, computation, and storage. Theclass of deformations we consider here are one of the mostchallenging since we enable locally volume preserving maps, which find many applications [54], [63], [29], [12]. We presentresults for synthetic and neurological images and demonstratethe performance of our algorithm (see § IV) for both volumepreserving and more generic deformation maps.
Related work on high performance computing methodsfor 3D image registration : A rich literature survey on highperformance computing (
HPC ) in image registration can befound in [53], [18], [35]. General surveys on image registrationcan be found in [49], [55]. Formulations related to the onediscussed in this work are reviewed in [45], [46], [47].State of the art registration packages that are used inthe medical imaging and medical image computing commu-nity include ELASTIX [39], ANTS [6], DARTEL [4], andDEMONS [60], [61], [43]. All of these offer some kindof diffeomorphic registration scheme. These packages mostlysupport OpenMP, but do not use GPUs or Message PassingInterface (
MPI ) acceleration (exceptions to be discussed be-low). An important distinction should be made between the image resolution (number of voxels) and the map resolution (number of degrees of freedom for the map parameterization).In general, the higher the map resolution, the better theregistration quality [38], [39] but the harder the optimizationproblem since it has more degrees of freedom. Most existingcodes downsample the map resolution significantly. In the medical imaging jargon this is referred to as “mass preserving” maps. he majority of researchers have used GPUs to acceleratethe calculations. For example, the solver for the LDDR schemein [28] uses a preconditioned gradient descent (not Newton)algorithm with a hardware-provided trilinear interpolation on aGPU architecture. It supports the distributed solution of mul-tiple independent LDDR image registration problems (in anembarrassingly parallel way), but does not support distributedmemory parallelism for a single LDDR problem.Two popular packages that exploit GPU acceleration areNIFTYREG [48] and PLASTIMATCH [52]. They use B-spline parameterized low-resolution maps ( coefficients),a tri-linear interpolation scheme, and gradient descent typeoptimization; NIFTYREG supports soft constraints to penalizevolume change.An MPI version of NIFTYREG for bigger images [33]exists, but the map resolution remains the same ( regulargrid for the deformation field). The pioneering works [62],[42] support MPI. Their formulation is based on elasticdeformation maps (not LDDR). A GPU-LDDR scheme thatsupports somewhat high-resolution maps ( ) is [59]. It usessteepest descent (not Newton) and does not support MPI; notimings are reported.To summarize, existing schemes do not support scalableLDDR algorithms and no scaling studies have been reported. Limitations : In multiframe volume registration (e.g., 4DCine-MRI) one seeks to register multiple images using asmooth, continuous mapping [13], [9]. Our solver can be usedas is, but our diffeomorphic map parameterization is bettersuited for registering two images. Our parameterization can beextended without any major algorithmic changes but the soft-ware engineering would require some work. Another missingpiece is a preconditioner that is insensitive to the regularizationparameter. There are several techniques for doing so, e.g., gridcontinuation and multilevel preconditioning [10], [1], [47].Here we focus on the single-level solver. The single nodeperformance of the interpolation can be improved by moresophisticated blocking, manual vectorization, and possiblyprefetching. Similar considerations hold true for the GPUversion of the interpolation. This is ongoing work. Anotherlimitation is that we only consider a discretization on Cartesiangrids. This is not always the best grid [62]. The structure ofour algorithm changes significantly for unstructured grids.II. B
ACKGROUND
Let
Ω = [0 , π ) ⊂ R be the spatial domain, in which wedefine functions (images). ∂ Ω denotes the boundary of Ω and x a point in Ω . Let ρ ( x ) ∈ R be a function defined on Ω . Inimaging ρ ( x ) is the image intensity at a point x ; in optimalcontrol ρ ( x ) is the state field . In the registration problem, wehave a reference image, denoted by ρ R ( x ) , and a templateimage, denoted by ρ T ( x ) ; the goal is to find a vector valueddeformation map, denoted by y ( x ) , that maps a point of thetemplate image ρ T to a corresponding point in the referenceimage ρ R [49], [50].Let v ( x ) be the velocity field that generates the map y .In our formulation, we introduce a pseudotime to denote the ˜ Ω • ••• •• •• det ( ∇ y ) ∈ ( ) •• •• det ( ∇ y ) = •• •• det ( ∇ y ) > ˜ Ω •• •• det ( ∇ y ) < • det ( ∇ y ) = Fig. 2:
Here we illustrate diffeomorphic and non-diffeomorphic transformations in 2D by considering an in-finitesimal area ˜Ω . We can think of ˜Ω as a single grid cell(pixel). The first figure on the left depicts the undeformedarea. The second figure shows an admissible deformation thatshrinks the original area. The third figure depicts an area-preserving deformation (in 3D it would be volume preserving).The fourth figure shows the result of a deformation thatexpands the area. The fifth illustration is not diffeomorphicsince material lines that did not cross each other in theoriginal (leftmost illustration), cross now. The sixth, rightmost,is illustration corresponds to a singular deformation in whichall the spatial information is lost by shrinking ˜Ω to a singlepoint. The last two deformations are not useful in imageanalysis. However, without any constraints on the deformationmap, pixels with non-diffeomorphic behavior appear almostalways. For this reason appropriate regularization of theproblem is necessary. deformation of the template image at time t , denoted by ρ ( x , t ) . We define ρ ( x , t = 0) to be the undeformed templateimage ρ T , and ρ ( x , t = 1) to be the result of applying thedeformation map (which needs to be compared to ρ R ( x ) ).For the optimal control problem, λ ( x , t ) is the adjoint field , H is the reduced Hessian operator , g is the gradient field ,and β > is the scalar regularization parameter. We useperiodic boundary conditions for all differential operators. Forthe discretization, N i is the number of grid points per i -th dimension; N N N is the total number of unknowns inspace; n t is the number of time steps and δt the time step size .We use p for the number of MPI tasks , t s for the latency inseconds, and t w for the reciprocal of the bandwidth when wedo complexity analysis. Boldface lowercase symbols indicatevectors in R . A. Image registration
The image registration problem can be abstractly defined asfollows. Given two functions (i.e., images) ρ T ( x ) and ρ R ( x ) ,we seek a vector function y ( x ) such that the L -distance(i.e., the residual) between ρ T ( y ( x )) and ρ R ( x ) is minimal.We can think of y ( x ) as deforming an (infinite) grid ofpoints in the template image ρ T so that their intensity after thedeformation matches the reference image ρ R in the L -norm. It can be shown that this problem has an infinite number ofsolutions, most of which are not useful. To resolve this, oneneeds to impose additional constraints, such as smoothness(i.e., ∇ y ( x ) exists), although this alone may not be sufficient Other types of distance measures can be used (see, e.g., [49], [50], [55]).There are no significant changes in our formulation or algorithm if we wouldconsider other, popular distance measures. o ensure that the map is plausible. Note, that we require det( ∇ y ( x )) > for all x ∈ Ω to guarantee that y ( x ) is a diffeomorphism (for an illustration, see 2). In velocity-based LDDR it can be shown that such a diffeomorphic map y exists, if the generating velocity field v ( x , t ) is adequatelysmooth [9], [15], [13]; a typical requirement is that v is an H -function [9]. The deformation map y can be computedfrom v by solving ∂ t y ( x , t ) + v ( x , t ) · ∇ y ( x , t ) = 0 , y ( x ,
0) = x , (1)where y ( x ) = y ( x , . It can be shown that if the velocityis incompressible (i.e., div v = 0 for every x in Ω ), then det ∇ y ( x , t ) = 1 and the diffeomorphism is referred to as volume preserving [27]. Here, we consider both the generaland the incompressible velocity cases. The latter case is morechallenging. We only consider stationary velocity fields, thatis, v ( x , t ) = v ( x ) . In the following we drop the dependence of the functionson the spatial position x for notational convenience. B. Formulation
The solution to the image registration problem can befound by solving the following PDE-constrained optimizationproblem [30], [13], [45]: min v J [ v ] = 12 (cid:107) ρ − ρ R (cid:107) L (Ω) + β (cid:107) ∇ v (cid:107) L (Ω) (2a)subject to ∂ t ρ ( t ) + v · ∇ ρ ( t ) = 0 in Ω × (0 , , (2b) ρ (0) = ρ T in Ω , (2c)div v = 0 in Ω . (2d)The second term of J enforces smoothness for v and β > is the regularization parameter . In this formulation, ρ ( x ) = ρ ( x , ≡ ρ T ( y ) , where y is the solution of (1) at t = 1 .The constraint (2b) defines an implicit function between ρ and v ; given v we solve (2b) for ρ . a) Computing the gradient of J : Given v , we needseveral steps to compute the gradient g = ∂ v J . First wecompute ρ (1) by solving (2b) (with initial condition defined by(2c)). Then we compute the adjoint function λ ( t ) by solvingthe backward-in-time adjoint equation [34], [10] − ∂ t λ ( t ) − div ( v λ ( t )) = 0 in Ω × [0 , , (3) λ (1) = ρ R − ρ (1) in Ω . Once we have the state and adjoint fields, we can evaluate the gradient given by g ( v ) := β ∇ v + ( I − ∇∇ − div ) (cid:90) λ ( t ) ∇ ρ ( t ) d t. (4) It can be shown that the space of possible diffeomorphisms generated bytime-varying velocities is strictly larger than the space generated by stationaryvelocities. This does not have practical implications when we register twoimages (see, e.g., [45]). However, it is restrictive if we want register a sequenceof images, like in optical flow problems.
The operator P = I − ∇∇ − div , also known as the Lerayoperator [57], eliminates the incompressibility constraint for v . Furthermore, we define the vector field b ( x ) := (cid:90) λ ( x , t ) ∇ ρ ( x , t ) d t, so that g ( v ) = β ∇ v + P b . The gradient g is a nonlinearelliptic integro-differential operator where the state (forwardin time) and adjoint (backward in time) transport PDEs are“hidden” in b .Evaluating the gradient requires solving two transport equa-tions, inverting the Laplacian and applying gradient, diver-gence, and biharmonic operators. (If we do not wish tocompute a volume preserving map we can drop (i.e., notenforce) the incompressibility constraint. Then, in the gradientcalculation, we only need to replace the P operator withan identity operator.) The first-order optimality condition for (2) requires that g ( v ) = . Most registration packages usesteepest descent (first order) methods to find an optimal point(minimizer) [6], [9], [13]. However, steepest descent methodsonly have a linear convergence rate. Using Newton methods,which provide a much better convergence rate, is consideredto be prohibitive, especially for LDDR for two main reasons.First, it is cumbersome to derive the equations for the secondorder optimality conditions. Second, a naive implementationof Newton methods can be very costly if not done carefully. b) The Newton and Gauss-Newton Hessian operators H : To solve g ( v ) = for v we use an Armijo line-searchglobalized Newton method [51]. The key operation is theaction of the Hessian H ( v ) on a vector field ˜ v , which iscommonly referred to as the Hessian matvec . This matvec iscomputed by performing the following steps: First of all, weneed to solve (2b) and (3) to compute the state and adjointvariables ρ and λ , respectively. After computing these fields,we need to solve (5a) for the incremental state variable ˜ ρ and (5c) for the incremental adjoint variable ˜ λ , accumulatingthem in time to compute ˜ b and finally evaluating (5e). ∂ t ˜ ρ ( t ) + v · ∇ ˜ ρ ( t ) + ˜ v · ∇ ρ ( t ) = 0 in Ω × (0 , , (5a) ˜ ρ (0) = 0 in Ω , (5b) − ∂ t ˜ λ ( t ) − div (˜ λ ( t ) v + λ ( t )˜ v ) = 0 in Ω × [0 , , (5c) ˜ λ (1) + ˜ ρ (1) = 0 in Ω , (5d) H ( v )˜ v := β ∇ ˜ v + P ˜ b in Ω , (5e)where ˜ b = (cid:90) ˜ λ ( t ) ∇ ρ ( t ) + λ ( t ) ∇ ˜ ρ ( t ) d t. Notice that (5a) and (5c) require storing ρ ( t ) , ˜ ρ ( t ) , and λ ( t ) for all t . Also notice that certain terms in (5) drop if we enforcediv v = 0 . The Newton step , ˜ v , is obtained by solving thelinear system H ( v )˜ v = − g ( v ) . For a Gauss-Newton
Hessian,we drop the two terms that involve λ ( t ) , i.e., the last term in(5c) and the second term in ˜ b . ••••• ••••• ••••• ••••• ••••• δ t x ••••• •••• ••••• • ••• •• •• •••••••• ••••• ••••• ••••• ••••• t = X Fig. 3:
Illustration of the semi-Lagrangian scheme (figuremodified from [47]). The square domain corresponds to a gridblock assigned to a single MPI task. Point x is a regular gridpoint and X is the material point at t = 0 that landed at x attime δt . The semi-Lagrangian algorithm requires interpolationof v , ρ, ˜ ρ, λ , and ˜ λ at these off-grid points; as illustrated here,these points can lie in other MPI tasks. We communicate thesepoints, interpolate their values, and then communicate themback. III. M
ETHODS
Given two images ρ R and ρ T our goal is to find y . Weuse an optimize-then-discretize approach for (2). Our schemecan be summarized as follows: • We solve (2) for v . • Once we have v , we use (1) to compute y . • To find v we solve g ( v ) = (where g ( v ) is given by(4)) using a preconditioned Newton-Krylov method. • In space we use Fourier expansions (regular grids withperiodic boundary conditions) and in time we use a semi-Lagrangian scheme. • We use data parallelism in space (the regular grid). • All spatial differential operators (and their inverses ifneeded) are computed spectrally using our parallel FFT. • All spatial algebraic operators are done in parallel.We provide more details for our algorithm in the followingsubsections.
A. Newton-Krylov solver and preconditioning
For the optimization we use a Newton method global-ized with an Armijo line-search. We use a preconditionedConjugate-Gradient (PCG) method to compute the Newtonstep. The linear solves using PCG are done inexactly usinga tolerance that depends on the relative norm of the gradi-ent [17]. The preconditioner is the inverse of the biharmonicoperator ( ∇ − ) and can be applied in nearly linear time usingFFTs (with a logarithmic factor). This preconditioner deliversmesh-independence—but not β -independence (see § IV). Sincethe problem is highly nonlinear we use parameter continua-tion on β . The target value for β is application dependentand, in our algorithm, determined by various metrics definedon ∇ y [45]. We use the TAO module from the
PETSc library [8], [7] for numerical optimization, which supportsuser-defined, matrix free PCG.
TAO provides interfaces thatallow one to control two main parameters in the Newton-Krylov solver: ( i ) the accuracy of the solution of the linear system to compute the Newton step (the relative toleranceof the PCG method used to solve the Hessian equation);and ( ii ) the nonlinear termination criteria. We provide thealgorithms to determine these parameters and, given v and ˜ v ,efficient routines for the function evaluation J ( v ) , the gradientevaluation g ( v ) , the Hessian matvec H ( v )˜ v and the action ofthe preconditioner ∇ − ˜ v . B. Discretization
We use Cartesian grids to approximate spatial functions andspatial differential operators. We use an explicit Runge-Kuttasecond-order semi-Lagrangian method to discretize in time.
1) Space:
We use a spectral projection scheme for allspatial operations on a regular grid defined on Ω with periodicboundary conditions. For simplicity, we consider the isotropiccase, in which the grid spacing is the same in all directions;that is, the number of points per direction is given by N = N = N = N . Our actual implementation does notrequire this (see § IV for an example). We approximate ρ ( x ) = (cid:88) k ˆ ρ k exp( − k · x ) , where k = ( k , k , k ) ∈ N is a multi-index with − N/ ≤ k j ≤ N/ , j = 1 , , . The corresponding regular gridpoints are given by x i = 2 π i /N , where i = ( i , i , i ) ∈ N and ≤ i j ≤ N − , j = 1 , , .We refer to { ˆ ρ k } as the spectral coefficients of ρ . Mappingsbetween { ρ i } and { ˆ ρ k } are done using the forward and inverse Fast Fourier Transform ( FFT ). We use a similar spectral dis-cretization for λ and for each component of the velocity field v . All derivatives are performed by first taking the FFT andthen filtering the spectral coefficients appropriately. In general,the input images ρ R and ρ T may not be periodic functions. Inthat case a spectral approximation will create excessively highaliasing errors. To address this, we use zero-padding for ρ R and ρ T . Also, in general, images will have discontinuities andthus are not differentiable, creating similar aliasing problems.So, before applying our algorithm, we smooth them spectrallywith a Gaussian filter whose bandwidth is π/N (the gridsize). Notice that our spectral representation with periodicboundary conditions allows us to apply all the different spatialoperators—including ∇ − and ∇ − —in a stable, accurate,and extremely efficient manner. As a result, the main cost ofthe computation will be solving the transport equations, notapplying and inverting elliptic differential operators.
2) Time:
We choose a semi-Lagrangian method since itis unconditionally stable [19] and allows us to take a smallnumber of time steps n t ∈ N . This is critical since we storeseveral space-time fields. For example, when solving (5a)for ˜ ρ ( t ) , we need ρ ( t ) for all t . For large n t the storagerequirements become excessive and more sophisticated check-pointing schemes [2] are required—which are more expensive.If we were using a Courant-Friedrichs-Lewy ( CFL ) restricted scheme for solving the transport equations, storing the time The CFL condition defines an upper bound on the time step size to ensurea stable solution of stiff, time-dependent PDEs [40]. istory would have been impossible, since we had to storehundreds of time steps (see, e.g., [45], [46], [47]).To explain the semi-Lagrangian method we reintroducethe notational dependence in space. We consider the generaltransport equation for a scalar field ν ( x , t ) (with a stationaryvelocity) ∂ t ν ( x , t ) + v ( x ) · ∇ ν ( x , t ) = f ( ν ( x , t ) , x ) ) . For example, for (5c) we have ν = ˜ λ and f = ˜ λ div v + div ( λ ˜ v ) . Then, for each x , to compute ν ( x , δt ) given ν ( x , ,we first compute a new point X , the semi-Lagrangian point ,using the scheme below: X ∗ = x − δt v ( x ); X = x − δt v ( x ) + v ( X ∗ ) ) , (6)and then we set ν ( X ) = ν ( X , f ( X ) = f ( ν ( X ) , X ); ν ∗ ( x ) = ν ( X ) + δt f ( X ); f ∗ ( x ) = f ( ν ∗ ( x ) , x ); ν ( x , δt ) = ν ( X ) + δt f ( X ) + f ∗ ( x )) . (7)This scheme is fully explicit and unconditionally stable . Recallthat x is a regular grid point and thus ν ( x , and v ( x ) areknown. X ∗ and X are not regular grid points. Computing v and ν at these off-grid locations requires multiple interpola-tions: three interpolations for v ( X ∗ ) , interpolations for the f terms that depend on the semi-Lagrangian point X , and finallyone interpolation for ν ( X , . If f depends on derivatives of ν ,we first differentiate on the regular grid and then we interpolatethe derivatives. The same scheme is used for the adjointequations by changing the time variable from t to τ = 1 − t ,so that − ∂ t λ ( t ) = ∂ τ λ ( τ ) and λ ( t = 1) = λ ( τ = 0) . Notethat the interpolation cannot be done using a FFT, since theinterpolation points can be spaced irregularly between gridpoints.Cubic interpolation is typically preferred, compared tolinear interpolation, because the interpolation errors will beaccumulated throughout the time stepping without a time-stepfactor [11]. We use a tricubic interpolation scheme , whichwe discuss in § III-C2.
C. Parallel algorithms
The main computational kernels are the 3D FFTs to com-pute derivatives, elliptic operators and their inverses, theinterpolation to off-grid points needed for the semi-Lagrangiantime stepping, the Krylov solver (PCG) for the Hessian, andthe Newton solver. The 3D FFT has well-known algorithmiccomplexity. The interpolation on semi-Lagrangian points is themost expensive parts of the computation, despite being local.As it turns out, about 60% of the overall time for the imageregistration problem is spent on interpolation. The Krylov andNewton solvers are sequential across iterations whereas allthe function, gradient, and Hessian evaluations are done using ( a ) ( b ) ( c ) Fig. 4:
Here we explain the data partitioning, which is basedon the pencil decomposition for 3D FFTs [25]. The colorsindicate the data partitioning; each color corresponds to thedata assigned to an MPI task. Subfigure (a) represents theinput distribution of the (volumetric) image. After an FFTin the first coordinate, in (b) we do the FFT in the secondcoordinate. This requires √ p concurrent alltoall betweengroups of √ p MPI tasks. This process is repeated for the thirddirection in (c) and has the same communication costs as (b)(image modified from [23]). data parallelism. Below we give more details on the FFT, theinterpolation, and how we put everything together.
1) Partitioning and FFT:
Let N i , i = 1 , , , be thenumber of grid points in the i th dimension. Also assumewe have p = p p MPI tasks. We partition the data usingthe pencil decomposition of 3D FFT (see Figure 4). EachMPI task gets ( N /p ) ( N /p ) N grid points. There is nopartitioning in time and all the time steps are stored in memory.The scalability of the 3D FFT has been well studied [16],[25]. The 3D FFT requires O ( . N p log N ) computations and O ( t s √ p + t w N p ) communications. We use the open-sourcepackage AccFFT [24], [22] that supports both GPU and CPUFFTs and is based on the 1D FFTs implemented in the FFTWpackage [21]. Our code features optimizations for the ∇ anddiv operators that allow us to avoid multiple 3D FFTs. Forexample, ∇ ρ = ( ∂ ρ, ∂ ρ, ∂ ρ ) requires N N
1D FFTsacross the first coordinate, diagonal scaling, and then the samenumber of inverse FFTs again across the first dimension. Asimilar process is required for the other components but theyrequire collective all-to-all communications for rearranging thedata. The remaining operators P , ∇ , and ∇ − are diagonaland require standard 3D FFTs.
2) Interpolation:
For every grid point x i we have to findpoints X i required in (7) using (6). In the distributed case,every processor interpolates all the points that fall into theregion defined by its pencil (that would be subfigure (a)in Figure 4). This is essentially an alltoallv operation.We refer to this step as “scatter” phase. Note that the pointsneed to be constructed only when the velocity field changes.In a Newton iteration for a given v we have to computethese points only once for v (forward transport) and for − v (adjoint equations); that is the scatter phase needs to be doneonce per field per Newton iteration. This results in speedupsdue to savings in communication and computation. After thescatter phase is completed, each process has to perform a cubicinterpolation on the points that it owns locally as well as thepoints it received from the other processors. After this step isone, an alltoallv operation is necessary to send/recv allthe interpolation results. This needs to be done once per timestep.The computation is organized as follows. For every forwardor adjoint solve, we invoke an interpolation planner , whichperforms the scatter phase and stores the semi-Lagrangianpoints and creates the communication plans for the trans-port equation. Then the actual transport (7), which involvesmultiple interpolations at every time step, is performed. Fordivergence-free v , the computation of (2b) and (3) involvesonly interpolations. For (5a) and (5c) it also involves differ-ential operators for the gradient and divergence operators thatappear on the right-hand side.Note that it is possible for an interpolation point to fall in-between the locally owned domains of the processes. This isbecause the local domain of each process is disjoint from oth-ers. For this reason, every processor maintains a layer of ghostpoints , regular grid points that belong to other processors.The values of ν at these points must be synchronized beforeinterpolation takes place. Notice that for every point we have tobring in 64 scalar values and perform roughly × floatingpoint operations. The constant is related to the 64 coefficients( ) required to build and evaluate the tricubic interpolanttimes five flops per coefficient. Therefore, the computation tomemory traffic ratio will be O (1) —the computation is memorybound. Blocking, prefetching, and vectorization can be usedto improve the performance. Algorithm 1
Parallel tricubic interpolation.
Input: { X i (cid:48) }∈ Ω r , owner ( X i (cid:48) ) , { x i }∈ Ω r , worker ( x i ) , ν ( x i ) , MPI task r Output: ν ( X i ) Communicate ν values for ghost points. Send/recv X i (cid:48) from r to/from owner ( X i (cid:48) ). Locally interpolate to compute ν ( X i ) . Send/recv ν ( X i ) to/from worker ( x i ) to r .The execution flow of the interpolation algorithm is ex-plained in Algorithm 1, for an MPI task r . Let ν ( x i ) bethe value of the scalar field ν at a regular grid point x i .Also, let X i (cid:48) be the corresponding semi-Lagrangian pointsfor each i (cid:48) computed by (6). Let Ω j be the spatial domainassigned to the MPI task j . As shown in figure 3, X i (cid:48) ∈ Ω r does not imply that x i (cid:48) ∈ Ω r and vice versa. For an off-gridpoint X i (cid:48) , owner ( X i (cid:48) ) computes the MPI task that owns thatpoint. That is, X i (cid:48) ∈ Ω r but x i (cid:48) ∈ Ω o wner ( X i (cid:48) ) . Similarly, x i ∈ Ω but X i ∈ Ω w orker ( X i (cid:48) ) . Of course for most pointsboth the owner and worker domains will be identical to Ω r .In line 1 every task sends values of ν that it owns to its fourneighbors (recall that we use a pencil decomposition) witha communication cost of t w N /p + t s )) (the four cornerneighbors can be combined with the messages of the edgeneighbors, but appropriate ordering of the messages). In line 2,for X i (cid:48) whose corresponding regular grid point x i belongs toa different MPI task, r sends ν ( X i (cid:48) ) to that task. In line 3 the interpolation takes place. This phase requiresroughly O (600 N /p ) floating point operations. This is fol-lowed by an alltoall communication in line 4 to send/recvinterpolation results. Our GPU implementation follows a sim-ilar strategy.
3) Algorithm for incremental state equation:
We sum-marize the steps needed to solve the incremental forwardproblem (5a) for one time step in algorithm 2 to illustratethe communication and computation pattern.
Algorithm 2
One time step of the solution of the incrementalforward problem.
Input: v ( x i ) , ˜ v ( x i ) , ρ ( x i , , ρ ( x i , δt ) , ˜ ρ ( x i , , X i Output: ˜ ρ ( x i , δt ) ρ ( X i ) = ρ ( X i , using Algorithm 1. ∇ ρ ( x i , using FFT. f ( x i ) = − ˜ v ( x i ) · ∇ ρ ( x i , . f ( X i ) using f ( x i ) and Algorithm 1. ρ ∗ ( x i ) = ρ ( X i ) + δt f ( X i ) . ∇ ρ ∗ ( x i ) using FFT. f ∗ ( x i ) = − ˜ v ( x i ) · ∇ ρ ∗ ( x i ) . ˜ ρ ( x i , δt ) = ρ ( X i ) + δt ( f ( X i ) + f ∗ ( x i )) .This calculation requires four interpolation steps using Al-gorithm 1, one for the scalar interpolation in line 1 and threefor the vector interpolation in line 4. It also requires four FFTs:two for line 2 (it is two because we need to go the spectraldomain, differentiate, and then back to the spatial domain) andtwo for line 6. The other parts are triple “for-loops” over all thegrid points i in Ω r . The FFTs require global synchronizations.The total cost of the incremental adjoint solve is four 3DFFTs and two interpolations. The incremental adjoint requiresthe same computations (for divergence-free velocity fields).
4) Complexity of Hessian matvec and overall algorithm:
Every
Hessian matvec requires n t forward and adjoint solvesor n t FFTs and n t interpolations. The remaining operationsof applying the regularization and the preconditioner arenegligible since they include just 2 FFTs each. The gradientis also cheaper since (2b) and (3) are simpler than theones in (5). Regarding memory , every task needs to store n t N /p +5 N /p values for the incremental adjoint and statevariables. Therefore, accounting the complexities for the FFTand interpolation we obtain, T flop ≈ n t (cid:18) . N p log N + 4 600 N p (cid:19) T mpi ≈ n t (cid:18) t s √ p + t w N p (cid:19) + 4 n t (cid:18) t s + t w N p (cid:19) This estimate assumes that the semi-Lagrangian points areuniformly distributed across processors, however, this is notguaranteed and depends on the velocity field and the CFLnumber. In practice the interpolation is the predominant costof the calculation, at least for the problem sizes we have tested.For fixed β the number of Newton iterations are independent ofthe mesh size, the inversion of highly ill-conditioned operatorsis done in linear time. R ρ T | ρ R − ρ T | Fig. 5:
3D visualization of a synthetic registration problem(volume rendering). From left to right: ( i ) reference image ρ R ,( ii ) template image ρ T , and ( iii ) initial (before registration)residual differences between ρ R and ρ T . The reference image ρ R is generated from ρ T by solving the forward problem witha known velocity v (cid:63) (details can be found in the text). Darkareas indicate large residual differences and white areas zeroresidual differences. IV. R
ESULTS
A. Experimental setup
In this section, we give details on the experimental setupwe used to test our solver.
1) Images:
We use one real-world and one synthetic im-age to test our algorithm. For the synthetic case we con-struct the template image as follows: ρ T ( x ) = (sin ( x ) +sin ( x ) + sin ( x )) / ; the velocity is given by v (cid:63) ( x ) =(cos( x ) sin( x ) , cos( x ) sin( x ) , cos( x ) sin( x )) T ; the ref-erence image ρ R is the solution of (2b) with the exact velocity v (cid:63) (see Figure 5 for an illustration of this problem). We usea synthetic case to perform the scaling studies, since medicalimages come with a fixed resolution/grid size. To test ourscheme on real medical images, we use two 3D MRI brainimages of different individuals (“multi-subject registrationproblem”; grid size: × × ). This data is fromthe Non-rigid Registration Evaluation Project ( NIREP ) [14]. (see Figure 6 for an illustration).
2) Implementation and Hardware:
Our code is imple-mented in
C++ and uses MPI and the OpenMP library formultithreading. The code is compiled with the Intel
C++ compiler using the -O3 flag. Although we have GPU imple-mentations both for the FFT and the interpolation, we havenot used accelerators in the results we report here. We carryout runtime experiments on the TACC’s “Maverick” system.Each compute node contains dual, ten-core Intel Xeon E5-2680 v2 (Ivy Bridge) processors running at 2.8GHz with12.8GB/core of memory. Each node also has an NVIDIATesla K40 GPU accelerator. We also report large-scale runson TACC’s “Stampede” system (two eight-core Xeon E5-2680v1 (Sandy Bridge) processors with 32GB host memory pernode). As mentioned before, we use PETSc’s TAO for thenonlinear optimization, vector operations of PETSc for vector For the incompressible case we use a similar but divergence free velocityfield v (cid:63) . The data is available at http://nirep.org ; the interested reader isreferred to [14] for more details. We consider the first two datasets na01 and na02 from this repository. linear operations, and AccFFT for the Fourier transforms. Thebasic interface to TAO is the functional, gradient, Hessianmatvec, and preconditioner, as well as routines to select thetolerances for the nonlinear solver and for the Newton steps.
3) Parameters:
The regularization parameter β is set to E − for the scalability runs (for both, the synthetic andthe real-world brain example). The number of time steps n t controls the accuracy and should be related to the CFLnumber. For simplicity and to be able to compare differentcases, we have kept it fixed to n t = 4 . The gradient tol-erance is g tol = 1 E − unless otherwise stated. We use aninexact Newton method with quadratic forcing. We do notreport continuation results; all the runs are done for a single(experimentally determined) value of β . We report ( i ) wall-clock times, ( ii ) communication times, as well as ( iii ) the timeto solution for our method with respect to different registrationproblems and parameter settings. Since the problem is non-convex and we are not interested in high-accuracy solutions,we opt for a Gauss-Newton approximation. B. Scalability using synthetic images
We use a fixed set of parameters, which we experimentallydetermined to yield a good balance between computationalcomplexity and computational performance. We illustrate theregistration problem in Figure 5. We report results for dif-ferent grid resolutions ( N i ∈ { , , , , } ), anddifferent numbers of cores and MPI task configurations ( p ∈{ , , , , , , } ). The results are reported in Table I(“Maverick” runs) and Table II (large scale “Stampede” runs).First, we interpret the runs ( Similar conclusions can be drawnfor the set; again, going from 16 tasks to 256 tasks, weobserve 50% efficiency. For the ( Many image registration codes don’t even compute gradients and thetermination criterion is the number of iterations. Given the limitations in theresolution and the image quality a relative reduction of the gradient by 1% istypically considered quite excessive. We use FFTs for the discretization of differential operators since thisallows us to invert them at the cost of a spectral diagonal scaling. This offersthe opportunity to exactly fulfill and efficiently eliminate the incompressibilityconstraint from the optimality system. Also, it allows for an efficient precon-ditioning of the Hessian with essentially no construction cost (see § III fordetails). R ρ T | ρ R − ρ T | | ρ R − ρ T ( y ) | Fig. 6:
3D visualization of the registration problem for the brain images. From left to right: ( i ) reference image ρ R , ( ii )template image ρ T , ( iii ) the residual differences between ρ R and ρ T (before registration), and ( iv ) the residual differencesbetween ρ R and ρ T ( y ) (deformed template image; after registration). Dark areas indicate a large residual and white areasno residual differences. Table I:
Computational performance of our solver for the synthetic registration problem illustrated in Figure 5 on TACC’s“Maverick” computing system. We neglect the incompressibility constraint for these runs. We report the time to solution, andthe communication and execution times for the FFT and the interpolation, respectively (in seconds). We report timings as afunction of the number of unknowns (in space), and the number of nodes and tasks. We use 16 tasks per node.
FFT interpolation N nodes tasks time to solution communication execution communication execution .
54 1 . E − . E − . E − . E − . E − . E − . E − . E − . E − . E +1 1 .
73 1 .
35 1 .
84 6 . .
88 1 .
30 5 . E − .
17 3 . .
70 1 .
19 2 . E − . E − . .
01 6 . E − . E − . E − . E − . E +1 1 . E +1 1 . E +1 1 . E +1 2 . E +1 . E +1 7 .
27 1 .
56 2 .
60 8 . .
23 2 .
67 3 . E − . E − . .
72 1 .
70 1 . E − . E − . . E +2 4 . E +1 2 . E +1 2 . E +1 6 . E +1 . E +1 1 . E +1 4 .
18 4 .
22 1 . E +1 . E +1 1 . E +1 1 .
77 2 .
33 8 . eight. The overall timings are 15.2 seconds, 23 seconds, and32 seconds, respectively, which again is not perfect. If we lookmore closely at how the time is allocated, we observe that theexecution time for the FFT scales perfectly in these three runs(1.35 seconds, 1.56 seconds, and 1.77 seconds, respectively).The interpolation execution also scales well, both in termsof communication and computation. The deterioration of theoverall time is due to the FFT communication costs. Thelargest problem we solved for the synthetic case was run C. Real-world registration problem
We report exemplary results for the brain data sets illustratedin Figure 6 (grid size: × × ). We set the g tol to E − and the maximal number of outer iterations (Newton steps) to50, and β = 1 E − . We study strong scaling and the sensitivityof the convergence of our solver with respect to changes in theregularization weight β . We report scalability results for thebrain images in Table IV. We display exemplary result for the Table V:
Sensitivity of the computational work loadwith respect to varying regularization weights β ∈{ E − , E − , E − } . We report results for four Newtoniterations for the brain images. We report the number ofHessian matvecs and the time to solution (in seconds) andin parenthesis its relative increase from the base case. β matvecs time to solution ( relative ) E − . E +1 ( 1.0 ) E − . E +2 ( 4.6 ) E − . E +2 ( 35.0 ) considered datasets in Figure 7. We report results for varyingchoices of the regularization parameter β in Table V.We observe that we can significantly reduce the compu-tational timings if we switch to parallel architectures. Thescaling results are consistent with what we observed for thesynthetic data sets. We can reduce the wall clock time bytwo orders of magnitude if we change from one task on onenode to 64 MPI tasks on 32 nodes. We can fit the entireproblem on one node. This demonstrates the practicability ofour solver. The communication and execution times of theFFT and the interpolator drop significantly as we increase able II: Computational performance of our solver for the synthetic registration problem illustrated in Figure 5 on TACC’s“Stampede” computing system. We neglect the incompressibility constraint for these runs. We report the time to solution, andthe communication and execution times for the FFT and the interpolation, respectively (in seconds). We report timings as afunction of the number of unknowns (in space), and the number of nodes and tasks. We use 2 tasks per node.
FFT interpolation N nodes tasks time to solution communication execution communication execution
256 512 . E +1 4 .
61 2 .
62 4 .
12 1 . E +1 . E +1 2 .
23 1 .
30 2 .
38 9 . . E +1 1 .
69 6 . E − .
25 4 .
256 512 . E +2 3 . E +1 3 . E +1 3 . E +1 1 . E +2 . E +2 2 . E +1 1 . E +1 1 . E +1 8 . E +1 . E +1 1 . E +1 6 .
75 8 .
78 4 . E +1 Table III:
Computational performance of our solver for a synthetic registration problem similar to the one illustratedin Figure 5 on TACC’s “Maverick” computing system. We use the incompressibility constraint for these runs (mass preservingdiffeomorphism). We report the time to solution, and the communication and execution times for the FFT and the interpolation,respectively (in seconds). We report results for a fixed grid size ( ) as a function of the number of nodes and tasks. We use2 tasks per node. FFT interpolationnodes tasks time to solution communication execution communication execution . E +2 0 1 . E +1 2 .
82 9 . E +1 . E +1 3 .
18 5 .
73 8 . E − . E +1 . E +1 2 .
17 2 .
72 5 . E − . E +1 . E +1 1 .
10 1 .
25 4 . E − . .
69 6 . E − . E − . E − . s li c e s li c e s li c e ρ R ρ T | ρ R − ρ T | | ρ R − ρ T ( y ) | det ( ∇ y ) ρ T ( y ) s li c e s li c e s li c e ρ R ρ T | ρ R − ρ T | | ρ R − ρ T ( y ) | det ( ∇ y ) ρ T ( y ) s li c e s li c e s li c e ρ R ρ T | ρ R − ρ T | | ρ R − ρ T ( y ) | det ( ∇ y ) ρ T ( y ) s li c e s li c e s li c e ρ R ρ T | ρ R − ρ T | | ρ R − ρ T ( y ) | det ( ∇ y ) ρ T ( y ) Fig. 7:
Exemplary registration results for the brain data sets. We display, from left to right axial slices of ( i ) the reference image ρ R , ( ii ) the template image ρ T , the residual differences ( iii ) between ρ R and ρ T before registration and ( iv ) after registration,( v ) a point-wise map of the determinant of the deformation gradient (the color map represents volume change ranging from0 to 2, where det( ∇ y ) = 0 is black, det( ∇ y ) ∈ (0 , corresponds to different shades of orange (from dark to bright),and det( ∇ y ) ≥ is white), and ( vi ) the deformed template image ρ T ( y ) with a grid in overlay (closeup to the right). Thevalues for the determinant of the deformation gradient are strictly positive (i.e., the deformation map is diffeomorphic). able IV: Strong scaling results for the brain images computed on “Maverick”. We set the regularization parameter to β = 1 E − . We perform two Newton iterations for these scalability runs. We report the number of nodes, the number of MPItasks, and the communication and execution times for the FFT and the interpolation (in seconds). FFT interpolationnodes tasks time to solution communication execution communication execution . E +3 0 . E +1 2 . E +2 2 . E +1 7 . E +2 . E +2 2 . E +1 6 . E +1 5 .
73 1 . E +2 . E +1 8 .
59 1 . E +1 1 .
20 4 . E +1 . E +1 4 .
94 6 .
50 5 . E − . E +1 . E +1 4 .
03 1 .
10 8 . E − . the number of nodes. The interpolation time contributes againcritically (about or more than 50% of the time to solution).As for the sensitivity with respect to the regularizationparameter we can see that the number of Hessian matvecs(a proxy for the overall Newton-Krylov iterations) increases,as we reduce the regularization parameter from β = 1 E − to β = 1 E − . The time to solution increases by a factor of 35 forthe smallest β reported here. This clearly demonstrates that theperformance of our preconditioner is not ideal; it deteriorateswith a reduction in β . As we have seen in the former section,the solver behaves independent of the mesh size. Implementingan improved scheme for preconditioning the Hessian requiresmore work. V. C ONCLUSION
We presented a complete algorithm for large deformationdiffeomorphic medical image registration. We were able tosolve problems of unprecedented scale. One may ask how suchruns translate to a clinical setting. As the cost of computingdrops, we are hopeful that 32- and 256-task calculations willbe possible at a modest cost.The proposed algorithm is flexible and scalable. It supportsdifferent types of regularization functionals and can be ex-tended to different image distance measures. Our approach canbe easily extended to vector images and—with some additionalwork—can also be extended to non-stationary (time-varying)velocities [30], [9]. This will be necessary to register time-series of images or optical flow problems. All the parallelismrelated issues remain the same. A major remaining challengeis the design of preconditioners that are insensitive to theregularization parameter.Finally, our algorithm relates to other applications besidesmedical imaging. For example applications in weather predic-tion and ocean physics (for tracking Lagrangian tracers in theoceans) [37], for reconstruction of porous media flows [20],and registration of Micro-CTs for material science and biol-ogy [32]. Although our method is highly optimized for regulargrids with periodic boundary conditions, many aspects of ouralgorithm carry over. R
EFERENCES[1] S. S. Adavani and G. Biros, “Fast algorithms for source identificationproblems with elliptic PDE constraints,”
SIAM Journal on ImagingSciences , vol. 3, no. 4, pp. 791–808, 2008. 3 [2] V. Akcelik, G. Biros, and O. Ghattas, “Parallel multiscale Gauss-Newton-Krylov methods for inverse wave propagation,” in
ProcACM/IEEE Conference on Supercomputing , 2002, pp. 1–15. 5[3] Y. Amit, “A nonlinear variational problem for image matching,”
SIAMJournal on Scientific Computing , vol. 15, no. 1, pp. 207–224, 1994. 2[4] J. Ashburner, “A fast diffeomorphic image registration algorithm,”
NeuroImage , vol. 38, no. 1, pp. 95–113, 2007. 2[5] J. Ashburner and K. J. Friston, “Diffeomorphic registration us-ing geodesic shooting and Gauss-Newton optimisation,”
NeuroImage ,vol. 55, no. 3, pp. 954–967, 2011. 2[6] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook et al. , “A reproducibleevaluation of ANTs similarity metric performance in brain imageregistration,”
NeuroImage , vol. 54, pp. 2033–2044, 2011. 2, 4[7] S. Balay, S. Abhyankar, M. F. Adams, J. Brown et al.
International Journal of Computer Vision , vol. 61, no. 2, pp. 139–157,2005. 2, 3, 4, 11[10] A. Borz`ı and V. Schulz,
Computational optimization of systems governedby partial differential equations . Philadelphia, Pennsylvania, US:SIAM, 2012. 1, 3, 4[11] J. P. Boyd,
Chebyshev and Fourier spectral methods . Mineola, NewYork, US: Dover, 2000. 6[12] M. Burger, J. Modersitzki, and L. Ruthotto, “A hyperelastic regu-larization energy for image registration,”
SIAM Journal on ScientificComputing , vol. 35, no. 1, pp. B132–B148, 2013. 2[13] K. Chen and D. A. Lorenz, “Image sequence interpolation using optimalcontrol,”
Journal of Mathematical Imaging and Vision , vol. 41, pp. 222–238, 2011. 3, 4[14] G. E. Christensen, X. Geng, J. G. Kuhl, J. Bruss et al. , “Introduction tothe non-rigid image registration evaluation project,” in
Proc BiomedicalImage Registration , vol. LNCS 4057, 2006, pp. 128–135. 8[15] G. Crippa, “The flow associated to weakly differentiable vector fields,”Ph.D. dissertation, University of Zurich, 2007. 4[16] K. Czechowski, C. Battaglino, C. McClanahan, K. Iyer et al. , “Onthe communication complexity of 3D FFTs and its implications forexascale,” in
Proc ACM/IEEE Conference on Supercomputing , 2012,pp. 205–214. 6[17] S. C. Eisentat and H. F. Walker, “Choosing the forcing terms in aninexact Newton method,”
SIAM Journal on Scientific Computing , vol. 17,no. 1, pp. 16–32, 1996. 5[18] A. Eklund, P. Dufort, D. Forsberg, and S. M. LaConte, “Medicalimage processing on the GPU–past, present and future,”
Medical ImageAnalysis , vol. 17, no. 8, pp. 1073–1094, 2013. 2[19] M. Falcone and R. Ferretti, “Convergence analysis for a class of high-order semi-Lagrangian advection schemes,”
SIAM Journal on NumericalAnalysis , vol. 35, no. 3, pp. 909–940, 1998. 5[20] J. Fohring, E. Haber, and L. Ruthotto, “Geophysical imaging of fluidflow in porous media,”
SIAM Journal on Scientific Computing arXiv e-rints , 2016, in review (arXiv preprint: http://arxiv.org/abs/1506.07933).6[23] A. Gholami, A. Mang, and G. Biros, “An inverse problem formulationfor parameter estimation of a reaction-diffusion model of low gradegliomas,”
Journal of Mathematical Biology
An Introduction toparallel computing: Design and analysis of algorithms , 2nd ed. AddisonWesley, 2003. 6[26] M. D. Gunzburger,
Perspectives in flow control and optimization .Philadelphia, Pennsylvania, US: SIAM, 2003. 1[27] M. E. Gurtin,
An introduction to continuum mechanics , ser. Mathematicsin Science and Engineering. Academic Press, 1981, vol. 158. 4[28] L. Ha, J. Kr¨uger, S. Joshi, and T. C. Silva, “Multi-scale unbiaseddiffeomorphic atlas construction on multi-GPUs,”
GPU ComputingGems Emerald Edition , vol. 1, pp. 771–791, 2010. 3[29] E. Haber and J. Modersitzki, “Numerical methods for volume preservingimage registration,”
Inverse Problems , vol. 20, pp. 1621–1638, 2004. 2[30] G. L. Hart, C. Zach, and M. Niethammer, “An optimal control approachfor deformable registration,” in
Proc IEEE Conference on ComputerVision and Pattern Recognition , 2009, pp. 9–16. 4, 11[31] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich,
Optimization withPDE constraints . Berlin, DE: Springer, 2009. 1[32] S. T. Ho and W. D. Hutmacher, “A comparison of micro CT withother techniques used in the characterization of scaffolds,”
Biomaterials ,vol. 27, no. 8, pp. 1362–1376, 2006. 11[33] F. Ino, K. Ooyama, and K. Hagihara, “A data distributed parallelalgorithm for nonrigid image registration,”
Parallel Computing , vol. 31,no. 1, pp. 19–43, 2005. 3[34] K. Ito and K. Kunisch,
Lagrange multiplier approach to variationalproblems and applications . Philadelphia, Pennsylvania, US: SIAM,2008, vol. 15. 4[35] G. S. James Shackleford, Nagarajan Kandasamy,
High performancedeformable image registration algorithms for manycore processors .Morgan Kaufmann, 2013. 2[36] R. Kakinuma, N. Moriyama, Y. Muramatsu, S. Gomi et al. , “Ultra-high-resolution computed tomography of the lung: Image quality of aprototype scanner,”
PloS one , vol. 10, no. 9, p. e0137165, 2015. 1[37] E. Kalany,
Atmospheric modeling, data assimilation and predictability .Oxford University Press, 2002. 11[38] A. Klein, J. Andersson, B. A. Ardekani, J. Ashburner et al. , “Evaluationof 14 nonlinear deformation algorithms applied to human brain MRIregistration,”
NeuroImage , vol. 46, no. 3, pp. 786–802, 2009. 1, 2[39] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. W. Pluim,“ELASTIX: A tollbox for intensity-based medical image registration,”
Medical Imaging, IEEE Transactions on , vol. 29, no. 1, pp. 196–205,2010. 2[40] R. J. LeVeque,
Numerical methods for conservation laws . Springer,1992, vol. 132. 5[41] J.-L. Lions,
Some aspects of the optimal control of distributed parametersystems . Philadelphia, Pennsylvania, US: SIAM, 1972. 1[42] Y. Liu, A. Fedorov, R. Kikinis, and N. Chrisochoides, “Real-time non-rigid registration of medical images on a cooperative parallel architec-ture,” in
Proc IEEE International Conference on Bioinformatics andBiomedicine , 2009, pp. 401–404. 3[43] M. Lorenzi, N. Ayache, G. B. Frisoni, and X. Pennec, “LCC-demons:a robust and accurate symmetric diffeomorphic registration algorithm,”
NeuroImage , vol. 81, pp. 470–483, 2013. 2[44] F. L¨usebrink, A. Wollrab, and O. Speck, “Cortical thickness determina-tion of the human brain using high resolution 3T and 7T MRI data,”
Neuroimage , vol. 70, pp. 122–131, 2013. 1[45] A. Mang and G. Biros, “An inexact Newton–Krylov algorithm for con-strained diffeomorphic image registration,”
SIAM Journal on ImagingSciences , vol. 8, no. 2, pp. 1030–1069, 2015. 2, 4, 5, 6[46] ——, “Constrained H -regularization schemes for diffeomorphic imageregistration,” SIAM Journal on Imaging Sciences , 2016, to appear (arXivpreprint: http://arxiv.org/abs/1503.00757). 2, 6[47] ——, “A Semi-Lagrangian two-level preconditioned Newton–Krylovsolver for constrained diffeomorphic image registration,” arXiv e-prints ,2016, in review (arXiv preprint: http://arxiv.org/abs/1604.02153). 2, 3,5, 6 [48] M. Modat, G. R. Ridgway, Z. A. Taylor, M. Lehmann et al. , “Fast free-form deformation using graphics processing units,”
Computer Methodsand Programs in Biomedicine , vol. 98, no. 3, pp. 278–284, 2010. 3[49] J. Modersitzki,
Numerical methods for image registration . New York:Oxford University Press, 2004. 1, 2, 3[50] ——,
FAIR: Flexible algorithms for image registration . Philadelphia,Pennsylvania, US: SIAM, 2009. 1, 2, 3[51] J. Nocedal and S. J. Wright,
Numerical Optimization . New York, NewYork, US: Springer, 2006. 4[52] J. A. Schackleford, N. Kandasamy, and G. C. Sharp, “On developingB-spline registration algorithms for multi-core processors,”
Physics inMedicine and Biology , vol. 55, no. 21, pp. 6329–6351, 2010. 3[53] R. Shams, P. Sadeghi, R. A. Kennedy, and R. I. Hartley, “A surveyof medical image registration on multicore and the GPU,”
SignalProcessing Magazine, IEEE , vol. 27, no. 2, pp. 50–60, 2010. 2[54] D. Shen and C. Davatzikos, “Very high-resolution morphometry usingmass-preserving deformations and HAMMER elastic registration,”
Neu-roImage , vol. 18, no. 1, pp. 28–41, 2003. 2[55] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical imageregistration: A survey,”
Medical Imaging, IEEE Transactions on , vol. 32,no. 7, pp. 1153–1190, 2013. 1, 2, 3[56] Z. Starosolski, C. A. Villamizar, D. Rendon, M. J. Paldino et al. ,“Ultra high-resolution in vivo computed tomography imaging of mousecerebrovasculature using a long circulating blood pool contrast agent,”
Scientific Reports , vol. 5, no. 10178, 2015. 2[57] R. Temam,
Navier–Stokes equations: Theory and numerical analysis .North-Holland Pub. Co., 1977. 4[58] R. Tomer, L. Ye, B. Hsueh, and K. Deisseroth, “Advanced CLARITY forrapid and high-resolution imaging of intact tissues,”
Nature protocols ,vol. 9, no. 7, pp. 1682–1697, 2014. 2[59] T. ur Rehman, E. Haber, G. Pryor, J. Melonakos, and A. Tannenbaum,“3d nonrigid registration via optimal mass transport on the GPU,”
Medical Image Analysis , vol. 13, no. 6, pp. 931–940, 2009. 3[60] T. Vercauteren, X. Pennec, A. Perchant, and N. Ayache, “Symmetriclog-domain diffeomorphic registration: A demons-based approach,” in
Proc Medical Image Computing and Computer-Assisted Intervention ,vol. LNCS 5241, no. 5241, 2008, pp. 754–761. 2[61] ——, “Diffeomorphic demons: Efficient non-parametric image registra-tion,”
NeuroImage , vol. 45, no. 1, pp. S61–S72, 2009. 2[62] S. K. Warfield, M. Ferrant, X. Gallez, A. Nabavi et al. , “Real-timebiomechanical simulation of volumetric brain deformation for imageguided neurosurgery,” in
Proc ACM/IEEE Conference on Supercomput-ing , 2000, pp. 23–23. 3[63] Y. Yin, E. A. Hoffman, and C.-L. Lin, “Mass preserving nonrigidregistration of CT lung images using cubic B-spline,”
Medical Physics ,vol. 36, no. 9, pp. 4213–4222, 2009. 2[64] M.-Q. Zhang, L. Zhou, Q.-F. Deng, Y.-Y. Xie et al. , “Ultra-high-resolution 3D digitalized imaging of the cerebral angioarchitecture inrats using synchrotron radiation,”