[PDF] A Restaurant Process Mixture Model for Connectivity Based Parcellation of the Cortex

Abstract

One of the primary objectives of human brain mapping is the division of the cortical surface into functionally distinct regions, i.e. parcellation. While it is generally agreed that at macro-scale different regions of the cortex have different functions, the exact number and configuration of these regions is not known. Methods for the discovery of these regions are thus important, particularly as the volume of available information grows. Towards this end, we present a parcellation method based on a Bayesian non-parametric mixture model of cortical connectivity.

Full PDF

AA Restaurant Process Mixture Model forConnectivity Based Parcellation of the Cortex

Daniel Moyer, Boris A. Gutman, Neda Jahanshad, andPaul M. Thompson

Imaging Genetics Center, University of Southern California [email protected]

Abstract.

One of the primary objectives of human brain mapping isthe division of the cortical surface into functionally distinct regions, i.e.parcellation. While it is generally agreed that at macro-scale diﬀerentregions of the cortex have diﬀerent functions, the exact number and con-ﬁguration of these regions is not known. Methods for the discovery ofthese regions are thus important, particularly as the volume of availableinformation grows. Towards this end, we present a parcellation methodbased on a Bayesian non-parametric mixture model of cortical connec-tivity.

Keywords:

Human Connectome, Cortical Parcellation, Bayesian Non-Parametrics

Historically, researchers proposed and investigated regional brain parcellationsthrough manual dissection and qualitative description [25]. The rise of non-invasive neuro-imaging coupled with advances in computing and computer vi-sion allowed for the exploration of automated parcellation methods, both forﬁtting existing atlases to data and for data-driven discovery of functionally andstructurally cohesive parcels [23]. The success of the former propelled the risein interest and analyses of brain connectomics in the last decade [20]. Connec-tomics is a topic of interest within all scales of neuroscience; at the macro-scale,it is often deﬁned by discrete networks of cortical gray matter regions as nodeswith weighted or binary edges connecting them. In structural connectomics–thefocus of this paper–the edge weights are usually based on counts of estimatedstructural connections recovered using diﬀusion MRI tractography, sometimesweighted by a microstructural measure.A number of papers have focused on the converse, using the connection proﬁleof either voxels or vertices, or in the case of functional MRI, the pairwise signalcorrelation for each vertex pair to deﬁne the parcellation of the cortex (see [4]and the references therein). For connectivity based parcellation (CBP) methodsusing structural connectivity, two modeling choices are required: 1) a spatialresolution of the grid on which connections are deﬁned, and 2) the criteria onwhich clusters should be formed. In almost every existing method, connectivity a r X i v : . [ q - b i o . N C ] M a r easures are ﬁrst estimated for a high resolution grid of atomic units (choiceone), e.g. voxels or vertices, and the atomic units then combined under spatialconstraints optimizing a desiderata (choice two).The ﬁrst decision is essentially a division of connectivity into two scales: themacro-region level and higher resolution voxel/vertex level, where our task isto learn from the former the regions of the latter. The second modeling choiceis the criteria on which these atoms are clustered. Many popular choices comefrom more general clustering literature, e.g. within-group sum of squares (ex-plained variance), within-group statistical distances, and mixture model likeli-hoods [4,15,24]. These criteria use the connectivity proﬁles of each meso-scaleatom without regard to the network structure they induce at the macro-scale.In other words, they treat each vertex or voxel as a data point with an associ-ated vector (its row in the meso-scale adjacency matrix), and then cluster basedon this vector space. If one vertex is changed from one group to another, thesemethods generally do not re-evaluate the quality of all the groups, though onthe macro-scale each of their connective proﬁles would have changed.In the present work, we will address both these choices. We present a methodframed in the context of generative models, speciﬁcally Bayesian non-parametricmixture models which place priors over all possible partitions of the higher res-olution grid, and do not require the number of clusters to be predeﬁned. Onelarge classes of such priors are the so-called “restaurant processes”, used here.We implement a continuum form of connectivity for our mixture components,and further leverage a conjugate likelihood-prior relationship to produce closedform marginal likelihoods for network interactions, allowing eﬃcient sampling.Our paper is organized as follows: we ﬁrst deﬁne terminology, rigorouslydeﬁne the parcellation task, and describe the model as a whole. We then describeeach of its components in closer detail. We then present results on two datasets,and discuss the model in relation to existing models and methods. Let Ω be the white matter/gray matter interface (the inner cortical surface),with the acknowledgment that Ω is in general composed of two disjoint sheets,each with a boundary at the medial wall. Fix a coordinate system over Ω , anddeﬁne a parcellation P as any set of regions { E i } where E i ⊂ Ω , (cid:83) E i = Ω , and | (cid:84) E i | = 0 (i.e. the regions E i are almost disjoint). We assume there exists alatent parcellation P ∗ that accurately describes the cortical surface with respectto its underlying neuroanatomical structure. Our objective is the recovery of P ∗ , speciﬁcally using structural connectivity information, and without specifyingthe exact number of regions. In order to accomplish this, we construct a jointgenerative model of parcellations and connectivity.We start by choosing a model of partitions. We use the distance depen-dent Chinese Restaurant Process (ddCRP) [2], a variant of the popular ChineseRestaurant Process (CRP) non-parametric Bayesian models. CRP models aremost commonly used in mixture models, providing a prior over all possible la-el assignments for any number of label-parameter pairs. A main assumption ofthe CRP is the exchangeability of the data; the ddCRP removes this exchange-ability assumption, allowing for non-trivial topologies of dependence betweendata points. This is discussed in Section 2.1 in detail. Brieﬂy, ddCRP allows usto use non-parametric style mixture models on mesh grids, where we assume apriori that neighboring patches are dependent. For example, we assume thereis spatial auto-correlation over the discrete manifold of the mesh. Practicallyspeaking, the ddCRP is the component responsible for merging or splitting theparcels (clusters), and in general for their conﬁguration.We next choose a mixture component model; the distribution chosen here willgenerate the observed network between estimated regions from the ddCRP. Wechoose to follow the style of the Inﬁnite Relational Model [12,1], where we modelinteractions between clusters instead of the proﬁles of the clusters themselves.Thus, we need a separate parameter for each pair of regions. Before diving intothis however, it is important to consider the form of our connectivity data.Structural connectivity is estimated using streamlines (tractography), usuallyvia identifying tracts which intersect the cortical surface at two locations. Thus,the evidence of connectivity is a set of endpoint pairs on Ω .In traditional connectivity analysis, these endpoints are counted by regionpair, and a graph is formed from the resulting count statistics. These represen-tations abstract away both knowledge of region geometry such as surface area,curvature etc, as well as topological information, i.e. region adjacency; this is theinformation we will be using in the ddCRP model. While it is possible to ignorethese conﬂicting motives and directly kluge a graph to a spatial patch model, weinstead attempt to retain spatial intuition in our connectivity representation.Consider Ω × Ω , the set of all possible tract endpoints intersecting the corticalsurface. We model the observation of these pairs of endpoints as a spatial pointprocess on Ω × Ω . Assuming that each tract is independently recovered , this pro-cess is the Poisson point process. That is, for any region pair E i × E j ⊂ Ω × Ω , thenumber of tract endpoint pairs observed in that region pair is Poisson distributedwith parameter (cid:82) E i (cid:82) E j λ ( x, y ) dydx . Here λ : Ω × Ω → R + is a non-negative ratefunction assumed to be integrable over all Ω × Ω . For Poisson processes, λ com-pletely characterizes the process. While we discuss further the Poisson pointprocess in Section 2.2, in the view of the overall model it is important to noteone convenient property: disjoint regions have independent counts.Moving back to the mixture components, we make the following simplifyingassumption on the form of the tract endpoint process: each region pair interactsin a homogenous manner. That is, we assume λ is constant over any pair ofparcels. Thus, for any ﬁnite conﬁguration of K parcels we have on order K non-negative scalar parameters to estimate, and these parameters are the rateparameters for Poisson spatial processes generating the evidence of connectivity.We choose to use the Gamma distribution to model these parameters (i.e., eachpair of parcels ( g i , g j ) draws a rate λ ij from a Gamma prior). As shown in It is important to make the distinction between physical fascicles and recoveredtracts. Here, we deﬁne the latter to be the reconstructed tractography. ection 2.2, the conjugacy of the Gamma distribution with the homogenousPoisson process allows for closed form marginal distributions, and thus eﬃcientcollapsed sampling methods. We also choose to use the mesh faces { f m } Mm =1 as the elements of our ddCRP-mixture. This is because connectivity is usuallydeﬁned over intersections of tracts with areal units, and both the ddCRP as wellas the Poisson process naturally operate over such regions.Putting this all together, and leaving the meaning of c m for the next section,the model is as follows: c m , g i ∼ ddCRP( α, Adj ) λ ij ∼ Gamma( a, b ) D ij ∼ Poisson Point Process( λ ij ) As suggested by its name, distance directed Chinese Restaurant Process is avariant of the Chinese Restaurant Process (CRP), often used in non-parametricBayesian mixture models as a prior over possible mixture components (i.e. adistribution over distributions). Let α be some positive constant concentrationparameter, and let G be a prior distribution over mixture component param-eters (for us, a gamma distribution). The original CRP mixture model [17] de-scribes an endless stream of customers (data) entering a restaurant with an inﬁ-nite number of tables (clusters). Each customer either chooses (with prescribedprobability dependent on α ) to sit at an existing table (which has a particularcomponent distribution) or sit an unoccupied table (draw a new component dis-tribution from the prior). Up to the indexing of the tables, for any ﬁnite numberof observations any number of clusters and conﬁguration of cluster associationsis possible.In the original CRP, the data are assumed to be exchangeable; that is, thejoint likelihood of any observations is invariant under permutations of observa-tion indices. However, in our spatial context we have a topology of face adja-cencies. Permutations of the face indexes are non-trivial, and thus the faces arenot exchangeable. To model this, the ddCRP allows each customer to chooseanother customer (possibly itself) to sit with based on its dependencies. Thisforms a directed graph of seating choices; table assignments are then made toeach group of customers who have chosen to sit with each other, i.e. each con-nected component of the seating choice graph. Mixture components are drawnfor each table from G , and only then are the actual data drawn from eachmixture component. Clearly this is a two stage procedure. In our context, thismeans each face will choose to be in a cluster with one of its neighbors or itself.As above, let { f m } Mm =1 be the set of mesh faces, and let { c m } Mm =1 be the cor-responding assignments, where each c m ∈ { , . . . , M } . We draw c m conditionedadjacency information Adj as follows: ( c m = j | Adj ) ∝  j is adjacent to iα if m = j g k for k ∈ { , . . . , K } , where K is thenumber of clusters. Due to our restriction of c m to the indices of faces adjacent to f m , each g k is a contiguous region, and the set of groups forms a valid parcellationof Ω . We note that the original ddCRP is deﬁned for more general distancefunctions. The evidence of pairwise interaction between regions in structural connectivityis the set of tract endpoints D = { ( x t , y t ) } Tt =1 . Since the regional clusters aredeﬁned over discrete grids of areal atoms (mesh faces), these are naturally ag-gregated to count measures over each pair of sub-regions. For any pair of regions( g i , g j ) ⊂ Ω × Ω , deﬁne D ij = { ( x t , y t ) ∈ g i × g j } . We model the counts | D ij | using the Poisson process with ﬁxed intensity λ ij , where the area g i × g j containsa random count | D ij | distributed | D ij | ∼ P oisson ( (cid:90) g i × g j λdxdy )Using the independence assumption of the tract endpoints, the likelihood of anyconﬁguration of tract endpoints can then be written L ( D ) = (cid:89) g i ,g j exp (cid:40) − (cid:90) g i × g j λ ij dxdy (cid:41) λ | D ij | ij We use a Gamma prior for the λ ij parameters, the conjugate prior of the Poissondistribution. Using the Gamma distribution allows us derive a simple closed formmarginal distribution for D ij that “integrates out” the λ ij ’s, leaving a likelihoodin terms of prior parameters a, b . It is as follows: P ( D ij | a, b ) = (cid:90) P ( D ij , µ | a, b ) dµ = (cid:90) P ( D ij | µ ) P ( µ | a, b ) dµ = (cid:90) exp (cid:40) − (cid:90) g i (cid:90) g j λdA (cid:41) | D ij | (cid:89) t =1 λ (cid:124) (cid:123)(cid:122) (cid:125) Homogeneous Point Process × b a Γ ( a ) exp( − bλ ) λ a − (cid:124) (cid:123)(cid:122) (cid:125) Gamma Prior dλ = Z ( a, b ) (cid:90) exp {− ( | g i × g j | + b ) λ } λ | D ij | + a − (cid:124) (cid:123)(cid:122) (cid:125) Un-normalized Gamma Posterior dλ = Z ( a, b ) Z ( a (cid:48) , b (cid:48) ) = (cid:18) β | g i × g j | + b (cid:19) a (cid:18) | g i × g j | + b (cid:19) | D ij | Γ ( a + | D ij | ) Γ ( a )Here, Z ( a, b ) = b a Γ ( a ) .3 Combined Model and Collapsed Sampling Scheme We will estimate the model via Collapsed Gibbs Sampling, speciﬁcally usingthe closed form integral over λ ij to avoid sampling the interaction parameters.Starting at iteration (cid:96) = 0, we update each c (cid:96)m by the following conditionallikelihood: P ( c (cid:96) +1 m = k | D ) ∝ P ( c m = k ) (cid:89) i,j P ( D ij | a, b, c (cid:96) +1 m = k, { c (cid:96) +1 r } rm )Since we assume c m is restricted by our mesh topology, we have only a smallnumber of options to evaluate. We denote the seating graph edge c (cid:96)m as a “criticaledge” if for any other node f m (cid:48) such that c m (cid:48) = m there exists a path to the facewith index c (cid:96)m . Let g old be f m ’s previous component, and g crit be the componentof f m without its own edge (its critical component). Without loss of generalitywe may further order each neighbor of f m as f n for n = 1 , . . . , N , and theirgroups as g ( n ). Using these deﬁnitions, for triangle meshes we can write out allpossible scenarios:1. If c m is not critical, and all neighbors are of the same component beforethe update, then we can simply choose c m via P ( c m = k ), as there is nodiﬀerence with respect to the induced components.2. If c m is not critical, but not surrounded by the same component, then weare asking essentially “Should c m ’s previously induced component join oneof its as of yet independent neighbors”. Thus, P ( c (cid:96) +1 m = n | D ) ∝ P ( c m = n ) K (cid:89) k P ( D ∗ ,k | g new = g old ∪ g n , g k ) × N (cid:89) g ˆ n :ˆ n (cid:54) = n ˆ n (cid:54) = old K (cid:89) k P ( D g ( n ) ,k | g ˆ n , g k )for each neighbor n . Here D ∗ ,k = { ( x t , y t ) ∈ g old ∪ g n × g k }

3. If c m is critical, then for each neighboring component (including the compo-nent g old \ g crit in the neighbors) we have P ( c (cid:96) +1 m = n | D ) ∝ P ( c m = n ) K (cid:89) k P ( D ∗ ,k | g new = g crit ∪ g n , g k ) × (cid:89) ˆ n (cid:54) = n K (cid:89) k P ( D ˆ n,k | g ˆ n , g k ) . We iteratively update the face associates using P ( c (cid:96) +1 m = k | D ), collectingsamples after every pass. While this generates a posterior distribution over c m ,we simply take the maximum a posteriori (MAP) estimate as our selected par-cellation.n general, updates made in Gibbs sampling algorithms are done sequentially;this is because strong dependencies between concurrent updates will destabilizesome samplers. However, in cases of low dependence approximate asynchronousparallel updates have been used with empirically strong results (so called “Hog-wild” updates [11]). In our case, most updates are either within components, orbetween small components (with correspondingly small interdependencies), so asmall degree of parallelism is possible. In practice we use a compromise betweenthe serial algorithm and the parallel version: we use a shared memory parallelsampler for calculating the likelihoods of a small batch c i , then make a serialupdates based on these likelihoods. This allows a roughly linear speed-up in thenumber of threads used, though there is a slowly scaling cost of the serial update. In ﬁxing a coordinate system, it is common to split the white matter/gray matterinterface into two spheres, each with a null region where the corpus callosumbridges the longitudinal ﬁssure. Thus, an easy system can be constructed usingspherical coordinates and a marker for hemisphere.The symmetry of each tract’s endpoints requires careful consideration toavoid double counting; while the intuition of the model can be understood with-out thinking about the symmetry of the data, when evaluating joint probabilitiesit is important to only include each data point once. This can be achieved byonly evaluating P ( D ij | a, b ) for i ≤ j . When computing the parallel updates, inour experience it is much more eﬃcient to keep the threads active but idle, andsimply have a single thread do the serial update. This avoids the overhead ofrepeated thread spawns, which for some implementations/architectures can becostly. In order to test our proposed model, we use two open datasets, one composedof 20 subjects each scanned twice from the Institute of Psychology, ChineseAcademy of Sciences (IPCAS) subset of the Consortium for Reliability and Re-producibility (CoRR) dataset [26], and the other composed of 30 subjects fromthe Human Connectome Project (HCP) S900 release [22]. The pre-processingdiﬀers slightly between the datasets, to account for the diﬀerent imaging pa-rameters. In general the HCP dataset has higher resolution (both in voxel sizeand angular resolution) leading to diﬀerent tractographies. On each dataset wecompare the performance of the proposed method against two recommendedalternatives: Ward’s method, a greedy hierarchical clustering method [4], andSpectral Clustering [15]. : T1-weighted (T1w) and diﬀusion weighted (DWI) images were obtainedon 3T Siemens TrioTim by the original investigators [26] using an 8-channel headoil and 60 directions. Each subject was scanned twice, roughly two weeks apart.T1w images were processed with Freesufer’s [5] recon-all pipeline to obtain a tri-angle mesh of the grey-white matter boundary registered to a shared sphericalspace [6]. We resample this space to a geodesic grid (where each face has ap-proximately equal area) with 10,000 total faces, doing so only after computingtract intersections with the surface. Probabilistic streamline tractography wasconducted using the DWI in 2mm isotropic MNI 152 space, using Dipy’s [7] im-plementation of constrained spherical deconvolution (CSD) [21] with a harmonicorder of 6. Tractography streamlines were seeded at 2 random locations in eachwhite matter voxel labeled by FSL’s FAST. Streamline tracking followed direc-tions randomly in proportion to the orientation function at each sample pointat 0.5mm steps, starting bidirectionaly from each seed point with 8 restartsper seed. As per Dipy’s Anatomically Constrained Tractography (ACT) [19], weretained only tracts longer than 5mm with endpoints in likely gray matter.

HCP:

We used the minimally preprocessed T1-weighted (T1w) and diﬀusionweighted (DWI) images rigidly aligned to MNI space. Brieﬂy, the preprocessingof these images included motion correction and eddy current correction (DWI),and linear and nonlinear alignment (betweek T1w and DWI). We used the HCPPipeline (version 3.13.1) FreeSurfer protocol to run an optimized version of therecon-all pipeline that computes surface meshes in a higher resolution (0.7mmisotropic) space. We again resample this space to an geodesic grid after com-puting tract intersections with the surface. Tractography was conducted usingthe DWI in the native 1.25mm isotropic voxel size in MNI space. Probabilisticstreamline tractography was performed as in IPCAS above.

We ﬁt the proposed method using our parallel sampling scheme, using 60 passesof the sampler with 8 parallel threads (approximately 600,000 updates per sub-ject), using a = 1, b = 1, and α = 0 .

01. We use the MAP estimate as ourresults. We ﬁtted Ward clustering by maximizing Explained Variance over ana¨ıve search of every possible merge. For the Spectral Clustering method we usean exponential kernel, using the normalized cosine distance as a metric. We usea number of eigenvectors equal to the number of clusters. For both baselines wetake the vector of connections as our feature vector. In both clustering schemes,we specify the number of clusters to be equal to that of the proposed method.We assess cluster quality using a KL-divergence based measure. We take thenumber of tracts from each face f m to each region g i as the objective distri-bution, and measure how well this is approximated by the average number oftracts from g ( f m ) to g i . These form two matrices of dimension M × K . We thennormalize these matrices to sum to one and measure their KL divergence as in[15]. If a cluster is well represented by its average connectivity proﬁle, then thisdivergence will be low. For the IPCAS dataset we have an additional measureof Test-Retest reproducibility. This is measured by Normalized Mutual Infor-mation (NMI)[4,1], which measures cluster similarity without requiring similarnumbers of clusters. Let Z be a binary matrix of cluster assignments, where, for ig. 1. Plots of the KL Divergence based goodness of ﬁt measure, for the three methods,on both datasets. Here, lower is better. each row i , each entry Z ij is 1 if f i is in cluster j and zero otherwise. NMI isdeﬁned as I ( Z , Z ) / (cid:112) ( H ( Z ) H ( Z ), where I ( · , · ) is mutual information, H ( · )is entropy, and Z and Z are the cluster assignments for the ﬁrst and secondscan respectively. (This uses the convention customary in information theorythat 0 log 0 = 0). NMI is also invariant under permutations of labels. As can beseen in Figure 1 and Figure 2, the proposed method is performing well comparedto the baseline methods. HCP dataset uses more clusters (around 250 per hemi-sphere) than the IPCAS dataset (around 175-200 per hemisphere). The diﬀerencehere may be due in part to the higher resolution of the HCP dataset, leading togreater resolving power with respect to the regional connections. These averagesare at the upper range of the number suggested by Van Essen et al. [23]. This model draws on the wide range of previously proposed methods in connec-tivity based parcellation. Several non-parametric Bayesian methods have beenproposed, in particular two excellent works Jbabdi et al. [10] and Baldassano etal. [1], both of whom use Normal-inverse-Wilshart conjugations as their mixturecomponents (Baldassano et al. use a special case, the Normal-inverse- χ ). Thesemodels also enjoy closed form marginal distributions, but do not have inﬁnitedivisibility (the distributions they model are not spatial processes). Jbabdi etal., whose work predates the ddCRP, use a Dirichlet Process with spatial priorsas their partition prior. They then further deﬁne a hierarchical process on topof this that links multiple subjects. Baldassano et al. use the ddCRP directly,but model voxel connections, again without the aid of a spatial process. Instead,they model the aggregate connectivity as coming from a normal distribution. ig. 2. Left: Normalized Mutual Information between Test-Retest scans. Here, higher is better. Right: histograms of the number of clusters selected for each subject.

The ddCRP is similar to a Markov Random Field model with a very strongspatial prior. These models have been successful in obtaining parcellations fromfunctional connectivity [9,18], though few if any have used Bayesian non-parametrics.This frame of reference leads us toward more traditional computer vision taskssuch as pixel labeling, where as in many cases surface parcellation has beenframed as vertex parcellation [3,15,24]. This is a small but relatively importantconceptual diﬀerence; the pixel and mesh-face models have areal units, but ver-tex parcellations are graphs of inﬁnitesimal points. The intuition of the formerleads us toward the use of spatial processes.A similar spatial process viewpoint of connectivity is proposed in Moyer etal. [14], but the discovery of new parcellations is not discussed. Poisson countprocesses for network interactions have also been explored in the literature [13],as have inﬁnite relational variants [8] though usually in the context of networkclustering via the stochastic blockmodel (i.e. clustering the regions themselves).These usually ignore spatial constraints.Alternative methods to Bayesian models usually specify the number of clus-ters. Of note is Parisot et al. [15] and a subsequent work by the same authors[16], which propose spectral methods for the parcellation task, augmented witha pre-processing local agglomeration. These papers note the propensity for Spec-tral Clustering to form equi-areal clusters; as can be seen in Figure 3, our methoddoes not form equi-areal groups. Thus, it may be the case that a lower numberof clusters for spectral clustering may perform better.As there is a rich body of functional and anatomical knowledge regardingthe cortex, parcellations based on connectivity information alone would needproper neuroanatomical, histological and functional validation, and more infor-mation from these sources would ideally be used to optimize parcellations. Themodel presented here uses only spatial constraints and connectivity to estimate ig. 3.

An exemplar parcellation from an HCP subject. Region colors are random. feasible parcellations based on recoverable structural connections from imaging.However, we believe that the modeling techniques explored here can easily beimputed into larger, multi-modal models, and in general the improvements mademay increase the accuracy and reproducibility of studies of connectivity patterns.These are critical to furthering our understanding of the living human brain.

Acknowledgements

This work was supported by NIH Grant U54 EB020403, as well as the NSFGraduate Research Fellowship Program. The authors would like to thank thereviewers as well as Greg Ver Steeg for multiple helpful conversations.