[PDF] Bayesian Functional Registration of fMRI Data

Abstract

Functional magnetic resonance imaging (fMRI) has provided invaluable insight into our understanding of human behavior. However, large inter-individual differences in both brain anatomy and functional localization after anatomical alignment remain a major limitation in conducting group analyses and performing population-level inference. This paper addresses this problem by developing and validating a new computational technique for reducing misalignment across individuals in functional brain systems by spatially transforming each subject's functional data to a common reference map. Our proposed Bayesian functional registration approach allows us to assess differences in brain function across subjects and individual differences in activation topology. It combines intensity-based and feature-based information into an integrated framework and allows inference to be performed on the transformation via the posterior samples. We evaluate the method in a simulation study and apply it to data from a study of thermal pain. We find that the proposed approach provides increased sensitivity for group-level inference.

Full PDF

SSubmitted to the Annals of Applied Statistics

BAYESIAN FUNCTIONAL REGISTRATION OF FMRI DATA B Y G UOQING W ANG , A

BHIRUP D ATTA AND M ARTIN

A. L

INDQUIST

Department of Biostatistics, Johns Hopkins University, [email protected]

Functional magnetic resonance imaging (fMRI) has provided invalu-able insight into our understanding of human behavior. However, large inter-individual differences in both brain anatomy and functional localization afteranatomical alignment remain a major limitation in conducting group anal-yses and performing population level inference. This paper addresses thisproblem by developing and validating a new computational technique for re-ducing misalignment across individuals in functional brain systems by spa-tially transforming each subjects functional data to a common reference map.Our proposed Bayesian functional registration approach allows us to assessdifferences in brain function across subjects and individual differences in ac-tivation topology. It combines intensity-based and feature-based informationinto an integrated framework, and allows inference to be performed on thetransformation via the posterior samples. We evaluate the method in a simu-lation study and apply it to data from a study of thermal pain. We ﬁnd that theproposed approach provides increased sensitivity for group-level inference.

1. Introduction.

In recent years functional magnetic resonance imaging (fMRI) has pro-vided invaluable insight into understanding the neurophysiological underpinnings of humanbehavior. Much of these gains have been driven by the combination of increasingly sophis-ticated statistical and computational techniques and the ability to capture brain data at ﬁnerspatial and temporal resolution. Together, these advances are allowing researchers to studybrain-behavior correlations and develop population-level models of the functional brain rep-resentations underlying behavior, performance, clinical status and prognosis. However, afundamental obstacle to using emerging analytic techniques effectively in these settings isindividual variation in functional brain anatomy. That is, the fact that the precise locationsof brain activation are not perfectly aligned across subjects. Addressing this obstacle willgreatly extend the capability to perform group-level hypothesis testing, use machine learningto develop population-level brain models that predict behavior, and take advantage of the fullspatial resolution of fMRI.In a task-based fMRI study, each subject is typically administered one or more stimuliand observed at hundreds of time points. At each time point, the subject’s blood oxygenationlevel dependent (BOLD) response is recorded at roughly 100,000 spatial locations (voxels),giving rise to multivariate time series data. In order to properly perform statistical analysisover multiple subjects, it is necessary for each voxel to lie within the same brain structurefor each subject. This does not occur naturally, as all subjects have brains of different sizesand shapes. Therefore, to achieve this goal all subjects’ brains are normalized to a stereotaxicspace prior to analysis.Typically, normalization is performed using a high-resolution anatomical scan that is ac-quired in the same session and spatially aligned with the fMRI data. Current standard practiceis to use nonlinear transformations to warp individual subjects’ anatomical scan to an aver-age, anatomically based 3D reference space (e.g., the “Montreal Neurologic Institute” (MNI)space). These transformations are then applied to the fMRI data to ensure they lie in MNIspace, and data can be compared across subjects (Lindquist et al., 2008; Ombao et al., 2016).

Keywords and phrases: functional magnetic resonance imaging, group-level analysis, registration, Bayesianmethods, inter-individual differences, pain. a r X i v : . [ s t a t . A P ] F e b While this procedure efﬁciently makes the brains of all subjects overlap with regards togross anatomical landmarks, it does not implicitly take into consideration residual differ-ences in brain anatomy, and the location and distribution of functional regions relative tothese landmarks. For example, primary visual cortex can vary in size by as much as 2-fold across different subjects’ brains (Rademacher et al., 1993) and in location relative toother anatomical landmarks (Amunts et al., 2000), as can sulcal locations (Thompson et al.,1996). In some cases, inter-individual differences can be accounted for by normalization,but in other cases they cannot. For example, the cingulate sulcus is structurally dimorphic(Vogt et al., 1995) —a substantial minority of individuals possesses a double cingulate sul-cus—precluding anatomical alignment using current methods. In addition, considerable vari-ability in functional localization persists even after anatomical alignment (Duncan et al.,2009). For example, the location of visual motion-sensitive area MT can vary by more than cm . The location and extent of lateral occipital cortex (Iordan et al., 2016) and fusiform facearea (Allison et al., 1999; McCarthy et al., 1999) are also highly variable across individualsand located in different places relative to anatomical sulcal landmarks.Large inter-individual differences in both brain anatomy and functional localization afteranatomical alignment are a major limitation in conducting group analyses and populationlevel inference. This is true because a large number of statistical models are performed voxel-wise (i.e., subjects are compared using a separate model at each voxel). This can lead to asituation where voxels from different functional brain regions are compared to one anotherin a model, signiﬁcantly complicating interpretation. In addition, voxels are often used asfeatures in prediction models, and any variation in the location of functional regions acrosssubjects would lead to feature misalignment. Finally, it degrades the quality of the data.Though individual-subject, whole-brain fMRI data can now be collected at a resolution of mm voxels or smaller, once functional maps are combined across individuals to constructa population-level model, the effective spatial resolution is dramatically lower.Recent approaches have sought to circumvent this problem by mapping individual brainsinto a functional population-level reference space rather than an anatomically based brainspace. The ﬁrst such model is the ‘hyperalignment’ procedure (Haxby et al., 2011). Herebrain activity patterns corresponding to stimuli and other cognitive events are represented asvectors in a neural representational space spanned by the voxels in a local neighborhood. Hy-peralignment rotates each participant’s local voxel-wise activity patterns through multivariatevoxel space using a Procrustes transformation to align the representational geometry acrosssubjects. Mathematically, this is similar to Canonical Correlation Analysis (CCA). A numberof reﬁnements of hyperalignment have recently appeared, including: kernel hyperalignment(Lorbert and Ramadge, 2012), which performs nonlinear hyperalignment in an embeddingspace; regularized hyperalignment (Xu et al., 2012) which uses a ridge CCA formulation; thetwo-phase joint SVD-Hyperalignment algorithm (Chen et al., 2014), where singular valuedecomposition (SVD) provides a lower dimensional feature space where hyperalignment isapplied; the shared response model (Chen et al., 2014) which casts the model in a proba-bilistic framework; and searchlight hyperalignment (Guntupalli et al., 2016) which allowsfor whole-brain coverage. Other methods have been developed that align subjects based oninter-subject correlations in time-series data during movie viewing. One method is functionaltime series alignment (Sabuncu et al., 2010), which matches voxels across subjects usinga 2D ‘rubber sheet’ warping and maximizing inter-subject correlations across voxels (Has-son et al., 2004). Another is functional connectivity alignment (Conroy et al., 2013), whichmatches voxels to a functional reference by minimizing the Frobenius norm of the differ-ence between a subject’s connectivity matrix and a reference matrix with a shape-preservingpenalty function. Both methods use warping and penalization strategies speciﬁc to the cortex,so it is unclear how easily they can be adapted to include subcortical regions. This body of work shows great promise for allowing participants to vary in their functionalactivation patterns (Haxby et al., 2011), but suffers from several shortcomings. First, hyper-alignment requires subjects to watch a long ﬁlm (up to 2 hours) to align subjects into a com-mon representational space, and functional connectivity-based methods require substantialresting state data. Second, the choice of reference data inﬂuences which types of functionalpatterns can be appropriately aligned. For example, movie reference data works well foraudio-visual representations but not prefrontal and limbic networks. Functional connectivity-based measures work better for limbic cortex but are expected to perform worse on objectand semantic representations. Neither type of method has been tested on subcortical repre-sentations, or on clinically relevant functions such as pain and emotion. Third, these methodsdo not include an explicit spatial model able to make inferences on the location or extentof activation, though a few recent studies have taken steps in this direction (Nenning et al.,2017).In this work, we focus on performing local registration of brain activation using subject-speciﬁc functional activation maps. These maps are the output of a series of voxel-wise mod-els that quantify the activation in the BOLD signal that arises due to the stimulus of interest.We seek to warp (i.e., spatially transform) a chosen ﬂoating map (i.e., a subject-speciﬁcactivation map) to some predeﬁned reference map. In service of this goal we propose a gen-eralized Bayes approach that is capable of cropping the subject-speciﬁc activation maps tomatch certain pre-deﬁned regions of interest (ROIs). The prior distribution is estimated bymatching the features of the activation maps, which are commonly deﬁned by local peaks.The pseudo-likelihood is a representation of the similarity measure between the referencemap and the warped ﬂoating map, which is given by the sum of squared difference (SSD).Furthermore, our work simultaneously models an intensity correction term which compen-sates for differences in the level of activation between the reference and ﬂoating maps. Asthe images are deﬁned on a regular lattice, the discreteness leads to the need to utilize Gaus-sian process based kriging interpolation techniques (Banerjee et al., 2014; Cressie and Wikle,2015) while realigning images.This paper is organized as follows. Section 2 describes the data, which is from an fMRIstudy of thermal pain. In Section 3 we introduce our generalized Bayes framework for per-forming functional image registration. In Section 4 we describe a simulation study used toevaluate the performance of the method, and in Section 5 we apply the method to data fromthe thermal pain study. The paper concludes with a discussion.

2. An fMRI Study of Thermal Pain .

The data are from an fMRI study of thermalpain; see Woo et al. (2015) for an in-depth discussion of the data set. In brief, healthy,right-handed subjects completed the study (age . ± . years, females). All subjectsgave informed consent, and the Columbia University Institutional Review Board approvedthe study.Seven runs were administered to each subject during a single session, with each run con-sisting of between 58-75 trials. In our analysis the runs were concatenated and analyzedtogether. In each trial, thermal stimulations were delivered to the left inner forearm. Eachstimulus lasted . s, with s ramp-up and s ramp-down periods and . s at the targettemperature. In total stimuli were administered at six different temperatures, ranging from . − . ◦ C in increments of ◦ C. Each stimulus was followed by a . − . s pre-ratingperiod, after which subjects rated their intensity of pain on a scale between − . Eachtrial ended with a − s resting period.For each subject, 1845 images were acquired using a 3T Philips Achieva TX scannerat Columbia University. Structural images were acquired using high-resolution T1 spoiledgradient recall (SPGR) images. Functional echo planar images (EPIs) were acquired with repetition time (TR) = ms, echo time (TE) = ms, ﬁeld of view = mm, × matrix, × × mm voxels, interleaved slices, parallel imaging, and sensitivity encod-ing (SENSE) factor . . For each subject, structural images were co-registered to the meanfunctional image using the iterative mutual information-based algorithm in SPM8 . There-after, images were normalized to Montreal Neurological Institute (MNI) space using SPM8’sgenerative segment-and-normalize algorithm. Prior to preprocessing of functional images,the ﬁrst four volumes were removed to allow for image intensity stabilization. Outliers wereidentiﬁed using the Mahalanobis distance for the matrix of slice-wise mean and standard de-viation values. The functional images were corrected for differences in slice-timing, and mo-tion corrected using SPM8. These images were warped to SPM’s normative atlas using warp-ing parameters estimated from co-registered high-resolution structural images, and smoothedwith an mm full width at half maximum (FWHM) Gaussian kernel. A high-pass ﬁlter of s was applied to the time series data.For each subject, a voxel-wise general linear model (GLM) analysis was performed usingSPM8. For each temperature, boxcar regressors convolved with the canonical hemodynamicresponse function (Lindquist et al., 2009), were constructed to model periods correspondingto the . s thermal stimulation and the s rating periods. Other regressors of non-interest(i.e., nuisance variables) included (a) a run speciﬁc intercept; (b) linear drift across timewithin each run; (c) the six estimated head movement parameters (x, y, z, roll, pitch, andyaw), their mean-centered squares, their derivatives, and squared derivative for each run; (d)indicator vectors for outlier time points identiﬁed based on their multivariate distance fromthe other images in the sample; (e) indicator vectors for the ﬁrst two images in each run;(f) signals from white matter and ventricle. For each temperature the regression coefﬁcientscorresponding to the temperature speciﬁc regressor at each voxel were combined into a sin-gle activation map. Hence, for each subject we had six different brain maps depicting thefunctional response across the brain to each of the temperatures.In this work, we are primarily focused on the subject-speciﬁc mean across the six differ-ence brain maps to analyze the inter-subject variability. We use this data to perform functionalalignment. In particular, we focus our analysis on a somatosensory region of the brain; Fig7 shows the region across subjects. Within this region one can identify several local peakswithin individual participants that conform to known somatosensory areas. However, thereclearly exists substantial inter-subject variability in their exact location.

3. Methods.

The goal of image registration is to ﬁnd a mapping from each voxel inone image (the ﬂoating map) to a corresponding position in another image (the referencemap). This mapping involves operations on the ﬂoating map (e.g., translation, scaling, ro-tations) described by a transformation operator T . To formalize the problem, let Y =( Y ( s ) , Y ( s ) , . . . , Y ( s V )) T and R = ( R ( s ) , R ( s ) , . . . , R ( s V )) T be the ﬂoating map andreference map, respectively, located at the same set of voxel locations S = { s , s , · · · , s V } .A conventional approach for such a registration problem is to use a generative model of theform R ∼ d F ( Y ( T ) , θ ) , where Y ( T ) = ( Y ( T ( s )) , . . . , Y ( T ( s V )) T and w denotes all theparameters involved in the transformation operator T and distribution F . This requires thespeciﬁcation of a full probability distribution F . Also such a probability model will typicallyposit the reference map R to be noisier than the ﬂoating map. For example, one can specifya generative model F as(1) R ( s ) = b Y ( T ( s, w )) + e ( s ) , for s ∈ S , where Y ( T ( s, w )) is the activation of the ﬂoating map at the transformed location T ( s, w ) , w are the parameters characterizing the transformation, the factor b adjusts for the discrep-ancy of intensities between the ﬂoating map and reference map, and e ( s ) is the additive noise.It is evident from (1) that R is modeled as noisier than Y due to the errors e . This is unde-sirable as in practice the observed ﬂoating map is expected to be equally or more noisy thanthe reference, and hence generative models of this form are likely to be misspeciﬁed.Furthermore, the generative model approach becomes problematic when working withmultiple ﬂoating maps Y i , as it requires multiple generative models for the common refer-ence map R . A possible model-based solution to the multiple registration problem wouldbe to switch the roles of Y and R in (1), expressing Y ( s ) = b R ( T ( s, w ))+ error, whichthen easily extends to multiple images Y i . This formulation naturally models all the ﬂoatingmaps using the common reference map R , and mitigates the issue of the reference map be-ing modeled as more noisy. However, the model ﬁt ˆ b R ( T ( s, ˆ w )) removes all subject-speciﬁcfeatures that are not present in the the reference. Also this switched model does not offer anydirect way to register the ﬂoating image to the reference. As these properties are undesirable,alternative approaches must be explored.3.1. Generalized Bayes framework for registration.

For the above mentioned reasons,we abandon a fully model-based formulation and instead adopt a more ﬂexible general lossfunction based approach to measure and correct for relative misalignment of the individualmap Y with respect to the reference map R . We consider (cid:96) ( R , Y ( T ) , w ) where Y ( T ) =( Y ( T ( s ) , . . . , Y ( T ( s V )) T and w denotes all the parameters involved in the transformationoperator T and the loss function (cid:96) .If π ( w ) denotes a prior on the parameters, then a posterior distribution given the lossfunction (cid:96) will be given by(2) π ( η | data ) = exp ( − (cid:96) ( R , Y ( T ) , w )) π ( w ) (cid:82) exp ( − (cid:96) ( R , Y ( T ) , w )) π ( w ) d w Posteriors of the form (2) that do not rely on a probability model for the data but sim-ply a loss function are called Gibbs posteriors or generalized posteriors and are becomingincreasingly common in Bayesian inference. They feature in early works of Vovk (1990);Shawe-Taylor and Williamson (1997); McAllester (1999); Walker and Hjort (2001) and theirposterior summaries are referred to as Laplace Type Estimators (LTE) in Chernozhukov andHong (2003) who developed extensive theoretical results on posterior concentration, asymp-totic normality and valid conﬁdence intervals for such posteriors and estimators. Bissiri et al.(2016) argued that updates of the form (2) are the only coherent update of a prior belief π given the data and the choice of the loss function (cid:96) .The loss function approach is compatible with multiple registration to a common template.When working with multiple ﬂoating maps, the total loss is simply the sum of the losses forthe individual maps. For example, registering K images Y i , i = 1 , . . . , K to a common R entails simple extension of the loss function to (cid:80) Ki =1 (cid:96) ( R , Y i ( T ) , w i ) where the parameters w i can be image-speciﬁc or shared across images, i.e., w i = w depending on the context.In this manuscript, we simply deﬁne the loss function as the distance between the warpedimage Y ( T ) and R ,(3) (cid:96) ( R , Y , b, φ, w ) = 1 φ (cid:107) R − b Y ( T ( · , w )) (cid:107) L , and update prior beliefs about the parameters using this loss. While there are many potentialimage similarity measures (Crum et al., 2004), we use the sum-of-square differences (SSD) as it has been demonstrated as being appropriate for single-modal brain registration (Simpsonet al., 2012). However, the general framework outlined here is agnostic to the choice of lossfunctions, and other losses like mutual information loss (Viola and Wells, 1997; Thevenaz andUnser, 1998), correlation-based losses (Gruen and Baltsavias, 1987), can be used dependingon the application.To implement this approach, decisions need to be made regarding how spatial interpolationis performed on the discrete images to obtain Y ( T ) used by the loss function, choices oftransformation operator, reference map, prior and hyperparmeters. We discuss these decisionsin detail below.3.2. Kriging Interpolation.

It is important to note that updating the parameters via opti-mization, or sampling using the loss function (3), would require evaluating the warped images Y ( T ) for a given value of the transformation parameters w . As the activation maps Y areonly observed on a predeﬁned set of V voxels V , the warped images, Y ( T ) at the set of lo-cation T ( V ) , are acquired through spatial interpolation. Kriging is a geostatistical smoothingand interpolation method that predict the unobserved values of the outcome at new locationsgiven observations at current locations (Matheron, 1963; Banerjee et al., 2014; Cressie andWikle, 2015).To apply Kriging as an exact interpolator, Y is assumed to be realization of a Gaussian pro-cess with mean µ and spatial covariance C , where C ∈ R V × V with C l,m = σ exp {− ρ || s l − s m ||} . Note that we do not introduce a nugget effect term in the covariance function as the thepurpose here is exact interpolation. The spatial parameters σ and ρ can be estimated withinthe Bayesian framework jointly with the other parameters, although increasingly, apriori es-timation of these parameters is being adopted as a pragmatic strategy (Finley et al., 2019).Here we estimate them based on the marginal likelihood of Y using maximum likelihoodestimation.The ordinary kriging interpolator is the best linear unbiased predictor (BLUP) under thesquared error loss and can be expressed as the conditional mean of warped images given theoriginal images, taking the form(4) (cid:92) Y ( T ) = E ( Y ( T ) | Y ) = T C − Y1 T C − + K T C − (cid:18) Y − T C − Y1 T C − (cid:19) , where K = Cov ( Y , Y ( T )) is the covariance matrix between Y ( T ) and Y with K i,j = σ exp {− ρ (cid:107) s i − T ( s j ) (cid:107)} , the covariance function evaluated at the original voxels and thewarped voxels, and is a vector of ones of length V .3.3. Transformation Operator.

We here choose the similarity transformation (Cederberg,2001), which consists of translation, scaling and rotation. For the case of 2D images this canbe expressed as follows:(5) T ( s ) = As + θ, where A = (cid:18) cos( ω ) − sin( ω )sin( ω ) cos( ω ) (cid:19) × (cid:18) σ x σ y (cid:19) , where s = ( s x , s y ) T , θ = ( θ x , θ y ) T ∈ R , ω ∈ ( − π , π ) , σ x > , σ y > . The similarity trans-formation in (5) is a special case of the general 2D afﬁne transformation. To see this, wedecompose the transformation matrix A as(6) A = (cid:18) A A A A (cid:19) = (cid:18) cos( ω ) − sin( ω )sin( ω ) cos( ω ) (cid:19) (cid:32) A cos( ω )+ A sin( ω ) A cos( ω ) − A sin( ω ) (cid:33) (cid:18) sign ( A ) (cid:112) A + A A cos( ω ) − A sin( ω ) (cid:19) , where ω = arctan( A /A ) . We note that the expression above is the result of the QR de-composition. The ﬁrst term is the rotation matrix. It is followed by the shearing matrix. Thelast term is the scaling matrix. The equivalence between Eqns. (5) and (6) can now be derivedby adding constraints to eliminate the shearing effect, i.e., setting A cos( ω )+ A sin( ω ) A cos( ω ) − A sin( ω ) = 0 , orequivalently A A + A A = 0 . Under this constraint, the decomposition can be simpli-ﬁed as(7) A = (cid:18) A A A A (cid:19) = (cid:18) cos( ω ) − sin( ω )sin( ω ) cos( ω ) (cid:19) (cid:18) sign ( A ) (cid:112) A + A A / cos( ω ) (cid:19) . Therefore, there exists a bijection mapping between ( σ x , σ y , ω ) T and vec ( A ) under the con-straint that A A + A A = 0 . Furthermore, if we are interested in the positive scalings, σ x > , σ y > , this can be achieved by setting A > and A > . We will exploit thisconnection to formulate the prior for the transformation parameters in the next section.3.4. Regularization Prior.

There are several obstacles in optimizing for transformationparameters in the general registration problem. First, it is an inherently ill-posed problem. Aclassic example where two distinct registration routines might lead to the same registrationresult was discussed in Fischer and Modersitzki (2008), where they showed a subtle examplethat either translation or rotation can yield the same optima. Second, it is difﬁcult to avoidunwanted local optima. Both of these difﬁculties can be addressed by including a regulariza-tion, or penalty, term. In this Section we adapt this idea by introducing regularization priorsfor our generalized Bayes framework.The generalized (Gibbs) posterior distribution of the warping parameters has the followingform(8) p ( T, φ, b | Y , R ) ∝ φ − V exp (cid:26) − φ (cid:107) R − b Y ( T ) (cid:107) L (cid:27) π ( T, φ, b ) , where π ( T, φ, b ) is the regularization prior. We specify π ( T, b, φ ) = π ( T | φ ) π ( b | φ ) π ( φ ) forchoices of π ( T | φ ) and π ( b | φ ) which leads to the following form of the negative log-posterior, φ (cid:107) R − b Y ( T ) (cid:107) L + λ T φ Reg ( T ) + λ b φ (log( b ) − log( b )) , (9)where the regularizer Reg can be chosen in different ways. The terms λ b and λ T are the tuningparameter, which control the trade-off between the ﬁtting of the registration likelihood andthe regularization priors. The factor (log( b ) − log( b )) in the regularization part regularizesthe intensity-correction term, and the term b is one of the hyper-parameters whose choicewill be described in the next Section. To gain intuition on the necessity of this term, withoutpenalizing values of b away from some b , the MCMC may become stuck in loss functionvalleys centered around a local minima. For example, around b = 0 , the loss function will beﬂat for all values of the transformation parameters w . The prior for b leading to (9) is given by log( b ) | φ ∼ N (log( b ) , φ λ b ) . The regularization of T , Reg ( T ) , is taken to measure the distancebetween the coordinates transformed by the proposed T and the some initial warping map T deﬁned by hyper-parameters A and θ . Formally, the regularizer is represented in thefollowing way,(10) Reg ( T ) = (cid:88) s ∈R (cid:107) T ( s ) − T ( s ) (cid:107) L = (cid:88) s ∈R (cid:107) ( A − A ) s + θ − θ (cid:107) L . The expression of the regularizer can be further simpliﬁed as:Reg ( T ) = trace { [ s x s y ]( M − M )( M − M ) T [ s x s y ] T } = trace { ( M − M ) T Σ s ( M − M ) } , where M = [ A θ ] T = (cid:18) A A θ x A A θ y (cid:19) T , M = [ A θ ] T , and Σ s =  s Tx s x s Tx s y s Tx Tx s y s Ty s y s Ty Tx Ty T  .Hence, using the prior vec ( M ) | φ ∼ N ( vec ( M ) , φ λ T I ⊗ Σ − s ) , where ⊗ denotes the Kro-necker product and I ∈ R × the identity matrix, ensures that the Gibbs posterior is of theform (9). Note that, M has parameters and our transformation is only deﬁned by param-eters ω, σ x , σ y , θ x , θ y . As discussed in section 3.3, there is a one-to-one relationship betweenour parametrizations and the general 2d afﬁne transformation parametrized by M under theconstraint A A + A A = 0 . Hence, we only use the multivariate normal prior for the 3-dimensional subset of A along with θ to correspond to the priors on the rotation, scaling andtranslation parameters. Finally, we use the standard reference prior for the scale parameter φ ,i.e., π ( φ ) ∝ /φ .Combining all the pieces in the generalized Bayes framework, the Gibbs posterior distri-bution of the warping parameters has the following form, p ( θ, σ, ω ) ∝ φ − V − exp (cid:40) − φ (cid:107) R − b Y ( T ) (cid:107) L − λ b φ ((log( b ) − log( b )) ) − λ T φ trace { ( M − M ) T Σ s ( M − M ) } (cid:41) . (11)3.5. Corresponding Feature-based Prior Estimation.

Informed selection of the regular-ization hyper-parameters b and M is critical to ensure identiﬁability of the registrationproblem. In this Section we outline an algorithm for prior (hyperparameter) estimation basedon corresponding features. The loss function (3) is solely based on the intensity informationof the images, and does not use the landmark information of activation maps which is analternate branch in the image registration. The prior (hyper-parameters) of the transforma-tion can be estimated by incorporating information about the landmarks. This approach doesnot encounter computational burden with a moderate number of landmarks. In this work, wedeﬁne the landmarks as the regional peaks (maxima) in clusters consisting of eight-adjacentvoxels.Let p R = { s Ri ∈ R | i = 1 , · · · , N R } and p Y = { s Yi ∈ R | i = 1 , · · · , N Y } be the land-marks of reference map and the ﬂoating map, respectively. N R and N Y are the number oflandmarks in the reference map and ﬂoating map, respectively. As a ﬁrst step, the coarse sub-region of interest of the reference map is determined manually based on scientiﬁc beliefs.This sub-region contains a subset of landmarks, ˆ p R ∈ p R , of size denoted by ˆ N R . We aimto ﬁnd the corresponding subset of the landmarks ˆ p Y from p Y . Many conventional meth-ods, including Geometric Hashing (Mian et al., 2006) and Iterative Closest Point (Yang andMedioni, 1992; Besl and McKay, 1992), works for point set matching. However, these meth-ods fail to capture the intensity features of the maps. In our application, as the size of the setof landmarks are relatively small (i.e., N Y < and ˆ N R < ), it is computationally efﬁcientto apply the Brute Force Search method that tries all possible ˆ N R dimensional subsets of p Y to determine the matching landmarks.The similarity transformation can be estimated using Procrustes analysis. Here we apply amodiﬁed version of Procrustes analysis, namely ABC Procrustean Algorithm (Awange et al.,2008). This version is preferable to the ordinary Procrustes algorithm as it is capable ofﬁnding the distinct scale parameters in each axial direction as opposed to the estimation ofuniform scales.The objective function used to search for correspondence is selected as the sum of squareddistance between the query landmarks and the transformed candidates. Again, we need to add a constraint term to address the ill-posed registration problem, as discussed in the beginningof this section. Formally, the mechanism can be formulated as follows, || ˆ p R − T α (ˆ p Y ) || ≤ d subject to S (ˆ p Y − T α (ˆ p Y )) < α, (12)using the transformation operator T α ( s ) = A α s + θ α .The fact that the activation maps are discrete realizations of continuous maps motivatesthe use of the distance less than the threshold d , instead of minimizing the distance betweenlandmarks. The discreteness of a map results in the bias of the coordinates of landmarks, asthe underlying location of landmarks can be anywhere within a pixel. We suggest to take d = ˆ N R which relaxes the landmark within a voxel. The objective function creates a setof candidates of correspondence, and we choose the optimal correspondence based on theintensity loss.The constraint S is motivated by diffusion registration (Fischer and Modersitzki, 2003)and has been applied to afﬁne models by Chumchob and Chen (2009). By deﬁning a function D : R → R with D ( p ) = ( D ( p ) , D ( p )) T = p − T ( p ) , it can be expressed as follows: S ( p − T ( p )) = 12 (cid:90) ( |∇ D ( p ) | + |∇ D ( p ) | ) d p = (cid:107) I − A || F , (13)where (cid:107) · (cid:107) F denotes the Frobenius norm, I is the identity matrix, A is deﬁned in equation(5). The other application of the constraint deﬁned in Eq. (13) is to determine the orienta-tion of ˆ p Y . For example, in the case of ˆ N R = 3 , ˆ p Y = { s Y , s Y , s Y } , there are six possiblecorrespondences to ˆ p R , and the transformation can be derived for each of them. The mis-matched correspondence may result in excessive transformation, which can cause redundantrotation and scaling. Therefore, the one minimizing S (ˆ p Y − T α (ˆ p Y )) is used to determinethe orientation of ˆ p Y .The threshold α in Eq. 12 is chosen automatically using the Photometric Error Criterion(PEC) (Brunet et al., 2010). For each α , the optimization problem in Eq. (12) is used todetermine the corresponding features as well as the associated transformation. To evaluatethe hyperparameter α , the criterion takes advantage of intensity information. which is deﬁnedas:(14) C ( α ) = 1 |G| (cid:88) p ∈G || R ( p ) − b Y ( T α ( p )) || . Here G is the region of interest and |G| is the size of the region. The discrepancy of scale be-tween the target map and the reference map is corrected by a scaling factor b . Note that b canbe estimated as the regression coefﬁcient of R ( p ) against Y ( T α ( p )) without an intercept asthe criterion in Eq. ( ) is denoted as the mean squared error of the regression. This estimate b is used as a hyper-parameter in the prior of b in Section 3.5. We may further deﬁne therobust photometric error criterion (RPEC) by making a heavy-tailed distribution assumptionon the residuals, for instance, the Student T distribution with a small degree freedom. Theselected α minimizes the criterion in Eq. (14), and produces the feature correspondence andthe estimation of transformation M which will be considered as the hyper-parameter in theprior for M in Section . We summarize the full steps of prior estimation in Algorithm (1).3.6. Choice of Reference Map.

A critical decision when performing registration is de-ciding on an appropriate reference map. One option is to use the group mean map across allﬂoating maps. However, this will generally make the reference map blurry and the estimatedtransformation largely variant. In our application, the reference subject is chosen based onscientiﬁc knowledge. For example, researchers may choose the functional map that best rep-resents the areas of interest to be the reference. In general, we suggest to smooth the reference map prior to model ﬁtting, as the raw map may contain outliers that unduly inﬂuence the ﬁtthrough the introduction of local modes.It is worth noting that instead of aligning the whole brain functional map, we insteadfocus on estimating the transformation with respect to local regions of interest. In this way,estimation will not be unduly inﬂuenced by other ’nuisance’ regions that do not containsignal of interest. The query map cropped by a rectangle bounding box from the referencemap describes the local area of interest; see Figure (1) for an example. The transformedbounding boxes crop the ﬂoating maps to match and register to the query map. As the afﬁnetransformation we focus on is linear, the variability of bounding boxes on ﬂoating maps isequivalent to the variability of registration.3.7. Model Fitting.

The inference on the parameters is based on the draws from the pos-terior distribution. We use the Markov Chain Monte Carlo (MCMC) algorithm offered in theStan software to simulate from the posterior distributions. Stan (Carpenter et al., 2017) isimplemented with the No-U-Turn sampler (NUTS) (Hoffman and Gelman, 2014). We referreaders interested in details of the algorithm to the original work. Three distinct chains areﬁtted and initialized at values close to the prior estimates in order to accelerate the modelﬁtting.The only hyper-parameters that are not determined in Section 3.5 is λ b and λ T , which con-trol the amount of regularization. It is common to choose such hyperparameters using cross-validation. However the use of cross-validation embedded into MCMC runs is computation-ally intensive, and it requires a large amount of parallel computation to cover all possiblesplits of the data. Vehtari et al. (2017) proposed an approach for approximating leave-one-out cross-validation using Pareto smoothed importance sampling (PSIS-LOO) which can beeasily computed. Compared to the Watanabe-Akaike information (WAIC; Watanabe (2010)),we prefer PSIS-LOO as it is more robust in the ﬁnite case with weak priors or inﬂuentialobservations.Along with the PSIS, we also calculate the potential scale reduction statistic (Gelman andRubin, 1992), ˆ R , as a metric of sampling convergence. Starting with a relatively high valueof both λ b and λ T , we gradually ﬁt the model with decreasing λ until ˆ R < . . This stepkeeps the model away from unwanted local modes. For each choice of ( λ b , λ T ) we computePSIS and select the one minimizing PSIS.

4. Simulation.

To illustrate the validity of our method we perform a simulation studyusing maps synthesized from the real data. Figure 1 shows an example functional activationmap. The query part of the map is manually cropped and shown within the dashed rectangle.The left panel in Figure 2 shows the query map. The ﬂoating map shown in the right panel isgenerated by warping the reference map with the similarity transformation (including trans-lation, scaling and rotation). The underlying parameters are translations θ = (2 , − , scalings σ = (0 . , . , and rotation ω = π/ . White noise, − (cid:15) with (cid:15) ∼ N (0 , , is subsequentlyadded to the map. We compare the learned prior information with the posterior in Figure3, where the region surrounded by the dashed box represents the prior information and thesolid box crops the matched region transformed by the posterior parameters. The boxes aregenerated by transforming the border box of the reference map using the prior estimates andmedian of posterior samples, respectively, of the transformation parameters.The models were ﬁtted using a sequence of regularization parameters ( λ b , λ T ) in Eq. (11).The expected log predictive density (ELPD) was estimated on each model. Figure 4 plotsthe -2 × ELPD which makes it on the deviance scale. The samples produced by the modelwith the lowest -2 × ELPD were used to estimate the posterior distribution. The uncertaintyof transformation can be demonstrated as posterior distribution of the translation, scaling and rotation parameters which are shown on Figure 5 with the truth superimposed as thedashed vertical line, where we can see the posterior samples successfully recover the truth.We further ran 100 simulations with different underlying transformation. Figure 6 shows theplot of posterior means against the true transformation. We could see the method recoveredthe truth successfully in each case.

5. Real Data Results.

We use the data described in Section 2 to perform functionalalignment. Figure 7 shows functional activation maps of the region of interest across the subjects. Note that within this region there are multiple peaks. However, there also existssubstantial inter-subject variability in their exact location. For example, comparing subjects14 and 20 suggests a potential transformation consisting of rotation and shifting. Each ofthese functional activation maps are processed through the adaptive thresholding introducedby Bradley and Roth (2007). The landmarks are then identiﬁed as the regional maxima of theneighborhood of 8 adjacent locations.The same reference map is utilized as described in the simulation; see Figure 1. The sub-ject speciﬁc maps are shown in Figure 8, and the estimated corresponding areas are indicatedusing the dashed boxes. Figure 9 shows the matching area warped by the posterior meanof transformation, along with the 95% credible region of the transformations are superim-posed in the gray-shaded region. The credible region of transformation is found using thedensity-based clustering algorithm (DBSCAN), introduced by Ester et al. (1996). The DB-SCAN algorithm is the unsupervised clustering algorithm that separates the clusters base onthe density. We refer readers to the original works for further details. To apply DBSCAN, wespecify the epsilon region with at least 5 samples and a grid search of the size of the epsilonneighborhood to assure the only cluster captures about 95% of the posterior samples. Theshaded region in the ﬁgure is then obtained by warping the searching box with each posteriorsamples of the transformation. The inter-subject variability in registration can be visualizedusing the estimated credible regions. For instance, subject 25 exhibits a high degree of uncer-tainty, while subject 14 exhibits relatively low uncertainty.To evaluate the performance of the functional alignment procedure, we performed asecond-level (across-subjects) analysis to test for signiﬁcant temperature related effects. Fig-ure 10 shows results for both the warped and unwarped maps. Studying the voxel-wise groupmeans and standard deviations, as well as t-statistics show increased similarity in the activa-tion proﬁles across subjects. After functional alignment, three peaks corresponding to pos-terior insula, SII, and opercular are readily apparent. In addition, the variation is decreasedand more focused on the three regions-of-interest. This indicates that while the locationsof the peaks are now consistent across subjects, the inter-subject differences in intensity re-main. This is important for testing for brain-behavior correlations across subjects. Finally,the t-statistic is higher, indicating increased sensitivity for group-level inference.

6. Discussion.

In this work, we propose a framework for reducing misalignment acrossindividuals in functional brain systems. Our approach has the ability to register local brainfeatures, deﬁned as local peaks in the activation maps, providing a means to align func-tional features prior to inter-subject statistical analysis. Our generalized Bayes frameworkcombines intensity-based information through the loss function and the feature-based infor-mation through the prior estimation, allowing us to conduct inference of the transformationvia the posterior samples. In an application to fMRI data from a study of thermal pain, wefound out that the registration contributes to increased sensitivity for group-level inference.A major limitation of our work is the choice of the reference image. We currently chooseit subjectively or based on the results of a meta-analysis. Therefore, in order to apply themethod researchers need to check to ensure that the chosen reference pattern is shared across subjects. In future work, we will expand our approach to model the target image as a latentmap to be estimated in a data-driven manner.Furthermore, as our data experiment measures multiple brain maps at six different temper-atures, in future work we will model the within-subject variability using the registration andexplore its association with other effects such as temperature. This will allow us to explorehow the spatial extent of brain activation changes as a function of stimulus intensity. This pro-vides a means to move past simply looking at spatial intensity in a speciﬁc voxel as a measureof brain activation. Another possible extension is evaluating the multilevel variability throughthe inference on the within-subject and between-subject transformation.In this paper we have presented a 2-dimensional implementation of our proposed ap-proach. However, we note that a 3-dimensional version could be constructed in an analogousmanner. We began with the 2D version for illustration purposes, as it allows us to present asimpliﬁed description of the approach. In practice when applying the proposed approach to3D data we would recommend a 3D implementation.Finally, in future work we will seek to explore how functional registration can help im-prove population-level models of the functional brain representations that can be used topredict behavior, performance, clinical status and prognosis. We hypothesize that due tofunctional misalignment, features used in predictive models may not be properly aligned,negatively impacting predictive accuracy. Using our approach as a preliminary feature align-ment step could potentially rectify this shortcoming. Acknowledgments.

The work presented in this paper was supported in part by NIHgrants R01 EB016061 and R01 EB026549 from the National Institute of Biomedical Imagingand Bioengineering and R01 MH116026 from the National Institute of Mental Health. REFERENCES

Allison, T., Puce, A., Spencer, D. D., and McCarthy, G. (1999). Electrophysiological studies of human faceperception. i: Potentials generated in occipitotemporal cortex by face and non-face stimuli.

Cerebral cortex ,9(5):415–430.Amunts, K., Malikovic, A., Mohlberg, H., Schormann, T., and Zilles, K. (2000). Brodmann’s areas 17 and 18brought into stereotaxic space—where and how variable?

Neuroimage , 11(1):66–84.Awange, J. L., Bae, K. H., and Claessens, S. J. (2008). Procrustean solution of the 9-parameter transformationproblem.

Earth, Planets and Space , 60(6):529–537.Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014).

Hierarchical Modeling and Analysis for Spatial Data,Second Edition . CRC Press, Philadelphia, PA. ID: 1598075.Besl, P. J. and McKay, N. D. (1992). A Method for Registration of 3-D Shapes.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 14(2):239–256.Bissiri, P. G., Holmes, C. C., and Walker, S. G. (2016). A general framework for updating belief distributions.

Journal of the Royal Statistical Society. Series B, Statistical methodology , 78(5):1103.Bradley, D. and Roth, G. (2007). Adaptive Thresholding using the Integral Image.

Journal of Graphics Tools ,12(2):13–21.Brunet, F., Bartoli, A., Navab, N., and Malgouyres, R. (2010). Pixel-based hyperparameter selection for feature-based image registration.

VMV 2010 - Vision, Modeling and Visualization , pages 33–40.Carpenter, B., Lee, D., Brubaker, M. A., Riddell, A., Gelman, A., Goodrich, B., Guo, J., Hoffman, M., Betancourt,M., and Li, P. (2017). Stan: A probabilistic programming language.

Journal of Statistical Software , 76(1):1–32.Cederberg, J. N. (2001).

A Course in Modern Geometries . Undergraduate Texts in Mathematics. Springer NewYork, New York, NY.Chen, P. H., Guntupalli, J. S., Haxby, J. V., and Ramadge, P. J. (2014). Joint SVD-Hyperalignment for multi-subject FMRI data alignment.

IEEE International Workshop on Machine Learning for Signal Processing,MLSP , pages 1–6.Chernozhukov, V. and Hong, H. (2003). An mcmc approach to classical estimation.

Journal of Econometrics ,115(2):293–346.Chumchob, N. and Chen, K. (2009). A robust afﬁne image registration method.

International Journal of Numer-ical Analysis and Modeling , 6(2):311–334.Conroy, B. R., Singer, B. D., Guntupalli, J. S., Ramadge, P. J., and Haxby, J. V. (2013). Inter-subject alignmentof human cortical anatomy using functional connectivity.

NeuroImage , 81:400–411.Cressie, N. and Wikle, C. K. (2015).

Statistics for spatio-temporal data . John Wiley & Sons.Crum, W. R., Hartkens, T., and Hill, D. L. (2004). Non-rigid image registration: Theory and practice.

BritishJournal of Radiology , 77(SPEC. ISS. 2).Duncan, K. J., Pattamadilok, C., Knierim, I., and Devlin, J. T. (2009). Consistency and variability in functionallocalisers.

Neuroimage , 46(4):1018–1026.Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise.

Proceedings of the 2nd International Conference on Knowledge Discoveryand Data Mining , pages 226–231.Finley, A. O., Datta, A., Cook, B. D., Morton, D. C., Andersen, H. E., and Banerjee, S. (2019). Efﬁcient algo-rithms for bayesian nearest neighbor gaussian processes.

Journal of Computational and Graphical Statistics ,28(2):401–414.Fischer, B. and Modersitzki, J. (2003). Curvature based image registration.

Journal of Mathematical Imagingand Vision , 18(1):81–85.Fischer, B. and Modersitzki, J. (2008). Ill-posed medicine - An introduction to image registration.

InverseProblems , 24(3):1–16.Gelman, A. and Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Sequences Linked refer-ences are available on JSTOR for this article : Inference from Iterative Simulation Using Multiple Sequences.

Statistical Science , 7(4):457–472.Gruen, A. W. and Baltsavias, E. P. (1987). High-precision image matching for digital terrain model generation.

Photogrammetria , 42(3):97–112.Guntupalli, J. S., Hanke, M., Halchenko, Y. O., Connolly, A. C., Ramadge, P. J., and Haxby, J. V. (2016). A Modelof Representational Spaces in Human Cortex.

Cerebral Cortex , 26(6):2919–2934.Hasson, U., Nir, Y., Levy, I., Fuhrmann, G., and Malach, R. (2004). Intersubject Synchronization of CorticalActivity during Natural Vision.

Science , 303(5664):1634–1640.Haxby, J. V., Guntupalli, J. S., Connolly, A. C., Halchenko, Y. O., Conroy, B. R., Gobbini, M. I., Hanke, M., andRamadge, P. J. (2011). A common, high-dimensional model of the representational space in human ventraltemporal cortex.

Neuron , 72(2):404–416. Hoffman, M. D. and Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengths in HamiltonianMonte Carlo.

Journal of Machine Learning Research , 15:1593–1623.Iordan, M. C., Joulin, A., Beck, D. M., and Fei-Fei, L. (2016). Locally-optimized inter-subject alignment offunctional cortical regions. arXiv preprint arXiv:1606.02349 .Lindquist, M. A. et al. (2008). The statistical analysis of fmri data.

Statistical science , 23(4):439–464.Lindquist, M. A., Loh, J. M., Atlas, L. Y., and Wager, T. D. (2009). Modeling the hemodynamic response functionin fmri: efﬁciency, bias and mis-modeling.

Neuroimage , 45(1):S187–S198.Lorbert, A. and Ramadge, P. J. (2012). Kernel hyperalignment.

Advances in Neural Information ProcessingSystems , 3(1):1790–1798.Matheron, G. (1963). Principles of geostatistics.

Economic geology , 58(8):1246–1266.McAllester, D. A. (1999). Some pac-bayesian theorems.

Machine Learning , 37(3):355–363.McCarthy, G., Puce, A., Belger, A., and Allison, T. (1999). Electrophysiological studies of human face percep-tion. ii: Response properties of face-speciﬁc potentials generated in occipitotemporal cortex.

Cerebral cortex ,9(5):431–444.Mian, A. S., Bennamoun, M., and Owens, R. (2006). Three-dimensional model-based object recognition and seg-mentation in cluttered scenes.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 28(10):1584–1601.Nenning, K. H., Liu, H., Ghosh, S. S., Sabuncu, M. R., Schwartz, E., and Langs, G. (2017). Diffeomorphicfunctional brain surface alignment: Functional demons.

NeuroImage , 156(December 2016):456–465.Ombao, H., Lindquist, M., Thompson, W., and Aston, J. (2016).

Handbook of Neuroimaging Data Analysis . CRCPress.Rademacher, J., Caviness Jr, V., Steinmetz, H., and Galaburda, A. (1993). Topographical variation of the humanprimary cortices: implications for neuroimaging, brain mapping, and neurobiology.

Cerebral Cortex , 3(4):313–329.Sabuncu, M. R., Singer, B. D., Conroy, B., Bryan, R. E., Ramadge, P. J., and Haxby, J. V. (2010). Function-basedintersubject alignment of human cortical anatomy.

Cerebral Cortex , 20(1):130–140.Shawe-Taylor, J. and Williamson, R. C. (1997). A pac analysis of a bayesian estimator. In

Proceedings of thetenth annual conference on Computational learning theory , pages 2–9.Simpson, I. J., Schnabel, J. A., Groves, A. R., Andersson, J. L., and Woolrich, M. W. (2012). Probabilisticinference of regularisation in non-rigid registration.

NeuroImage , 59(3):2438–2451.Thevenaz, P. and Unser, M. (1998). Efﬁcient mutual information optimizer for multiresolution image registration.

IEEE International Conference on Image Processing , 1:833–837.Thompson, P. M., Schwartz, C., Lin, R. T., Khan, A. A., and Toga, A. W. (1996). Three-dimensional statisticalanalysis of sulcal variability in the human brain.

Journal of Neuroscience , 16(13):4261–4274.Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Statistics and Computing , 27(5):1413–1432.Viola, P. and Wells, W. M. (1997). Alignment by Maximization of Mutual Information.

International Journal ofComputer Vision , 24(2):137–154.Vogt, B. A., Nimchinsky, E. A., Vogt, L. J., and Hof, P. R. (1995). Human cingulate cortex: surface features, ﬂatmaps, and cytoarchitecture.

Journal of Comparative Neurology , 359(3):490–506.Vovk, V. G. (1990). Aggregating strategies.

Proc. of Computational Learning Theory, 1990 .Walker, S. and Hjort, N. L. (2001). On bayesian consistency.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 63(4):811–821.Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information crite-rion in singular learning theory.

Journal of Machine Learning Research , 11:3571–3594.Woo, C.-W., Roy, M., Buhle, J. T., and Wager, T. D. (2015). Distinct brain systems mediate the effects of noci-ceptive input and self-regulation on pain.

PLoS Biol , 13(1):e1002036.Xu, H., Lorbert, A., Ramadge, P. J., Guntupalli, J. S., and Haxby, J. V. (2012). Regularized hyperalignment ofmulti-set fMRI data. , pages 229–232.Yang, C. and Medioni, G. (1992). Object modelling by registration of multiple range images.

Image and VisionComputing , 10(3):145–155. F IG . An example of a functional activation map. When used as a reference map, a query map indicatinglandmarks of interest, is manually determined and shown within the dashed rectangle. F IG . The query map extracted from the reference map (left column). The target map generated by warping thereference map (right column). F IG . Cropping box comparing the prior information (dashed box) with the posterior (solid box). F IG . A plot of -2 × expected log predictive density (ELPD) versus the regularization parameter λ T given aﬁxed λ b . The vertical bar (red) shows the selected λ T , which reaches the minimum of -2 × ELPD. F IG . The estimated posterior density plot for each of the transformation parameters: translations (ﬁrst row),scalings (second row), and rotations (third row). Vertical bars indicate the true parameter values. F IG . A plot of the posterior means against the true parameters for each transformation parameter: translations(ﬁrst row), scalings (second row), and rotations (third row). F IG . Subject-speciﬁc activation maps of the region of interest across the subjects. Note the differences inlocation of peaks across subjects, indicating signiﬁcant inter-subject differences. F IG . The corresponding cropped area (bounding box) based on prior information in each subject-speciﬁc map.The features (local peaks) are circled. F IG . The warped bounding box using the transformation computed with the posterior mean is shown in solidred. The 95% credible interval of the transformation map is illustrated in the shaded area. F IG . A comparison of the unwarped maps (left column) and warped maps (right column). Group-level statistics(T-statistics, group means, group standard deviations) are calculated for both raw maps and the warped maps. Algorithm 1:

Corresponding Feature-based Prior (hyper-parameter) Estimation Input: reference image R , reference landmarks p R , ﬂoating image Y , ﬂoatinglandmarks p Y , bound for distance between landmarks d , bound for tuningparameter α max . Construct the set S p = { ˆ p Y ∈ p Y : | ˆ p Y | = | p R |} , Construct the corresponding transformation set S T = { ˆ T : ˆ T = argmin ( (cid:107) p R − T (ˆ p Y ) (cid:107) ) , for ˆ p Y ∈ S p } . for each α ∈ (0 , α max ) do Find the subset ˆ S pα ⊆ S p , such that ˆ S pα = { ˆ p Y ∈ S p : S (ˆ p Y − T α (ˆ p Y )) < α and (cid:107) ˆ p R − T (ˆ p Y ) (cid:107) < d } Collect the corresponding transformation subset ˆ S Tα ⊆ S T , Compute the metric of α , C ( α ) = min T ∈ ˆ S Tα |G| (cid:80) p ∈G || R ( p ) − b Y ( T ( p )) || Output: choose α that minimizes C ( α ) . The corresponding T αα