[PDF] Finding Differentially Covarying Needles in a Temporally Evolving Haystack: A Scan Statistics Perspective

Abstract

Recent results in coupled or temporal graphical models offer schemes for estimating the relationship structure between features when the data come from related (but distinct) longitudinal sources. A novel application of these ideas is for analyzing group-level differences, i.e., in identifying if trends of estimated objects (e.g., covariance or precision matrices) are different across disparate conditions (e.g., gender or disease). Often, poor effect sizes make detecting the differential signal over the full set of features difficult: for example, dependencies between only a subset of features may manifest differently across groups. In this work, we first give a parametric model for estimating trends in the space of SPD matrices as a function of one or more covariates. We then generalize scan statistics to graph structures, to search over distinct subsets of features (graph partitions) whose temporal dependency structure may show statistically significant group-wise differences. We theoretically analyze the Family Wise Error Rate (FWER) and bounds on Type 1 and Type 2 error. On a cohort of individuals with risk factors for Alzheimer's disease (but otherwise cognitively healthy), we find scientifically interesting group differences where the default analysis, i.e., models estimated on the full graph, do not survive reasonable significance thresholds.

Full PDF

FFinding Diﬀerentially Covarying Needles in a TemporallyEvolving Haystack: A Scan Statistics Perspective

Ronak Mehta [email protected] Hyunwoo J. Kim [email protected] Shulei Wang [email protected] Sterling C. Johnson [email protected] Ming Yuan [email protected] Vikas Singh [email protected] Computer Sciences Department Department of Statistics Department of Medicine Department of BiostatisticsUniversity of Wisconsin-MadisonMadison, WI 53706, USA

Project webpage: pages.cs.wisc.edu/~ronakrm/research/covtraj/

Video summary: vimeo.com/205606140

Abstract

Recent results in coupled or temporal graphical models oﬀer schemes for estimating therelationship structure between features when the data come from related (but distinct)longitudinal sources. A novel application of these ideas is for analyzing group-level diﬀerences,i.e., in identifying if trends of estimated objects (e.g., covariance or precision matrices)are diﬀerent across disparate conditions (e.g., gender or disease). Often, poor eﬀect sizesmake detecting the diﬀerential signal over the full set of features diﬃcult: for example,dependencies between only a subset of features may manifest diﬀerently across groups. In thiswork, we ﬁrst give a parametric model for estimating trends in the space of SPD matrices asa function of one or more covariates. We then generalize scan statistics to graph structures,to search over distinct subsets of features (graph partitions) whose temporal dependencystructure may show statistically signiﬁcant group-wise diﬀerences. We theoretically analyzethe Family Wise Error Rate (FWER) and bounds on Type 1 and Type 2 error. On a cohortof individuals with risk factors for Alzheimer’s disease (but otherwise cognitively healthy),we ﬁnd scientiﬁcally interesting group diﬀerences where the default analysis, i.e., modelsestimated on the full graph , do not survive reasonable signiﬁcance thresholds.

Keywords:

Manifold Statistics, Scan Statistics, Longitudinal Analysis a r X i v : . [ s t a t . M L ] N ov . Introduction Multivariate data analysis exploiting the conditional independence structure between featuresor covariates using undirected graphical models is now standard within any data analysistoolbox. When the data are multivariate Gaussian, the zeros in the inverse covariance(precision) matrix give conditional independences among the variables (Lauritzen, 1996).Further, if the precision matrix is sparse, we can derive dependencies between features whenthe data are high-dimensional and/or the number of measurements are small. The estimationof a graphical model has been extensively studied and a rich literature is available describingits statistical and algorithmic properties (Koller and Friedman, 2009; Jordan, 1998). Forinstance, the so-called graphical lasso formulation uses an (cid:96) -norm penalty on the precisionmatrix and is widely used, and consistency properties in the large p regime (Cai et al., 2011;Friedman et al., 2008; Yuan, 2010) are now well understood. These formulations have alsobeen extended to various transformations of Gaussian distributions (e.g., non-paranormal)using rank statistics (Liu et al., 2009; Xue and Zou, 2012; Liu et al., 2012). Coupled and Temporal Graphical Models.

Often, data come from two (or more) disparatesources or multiple timepoints. Within the last few years, a few proposals have describedstrategies for linking the sparsity patterns of multiple graphical models, e.g., using a fusedlasso penalty (Danaher et al., 2014) (Yang et al., 2015). Observe that if the data sourcescorrespond to longitudinal acquisitions, we should expect the ‘structure’ to gradually evolve.Several authors have oﬀered generalizations to address this problem: (Zhou et al., 2010)removes the assumption that each graph is independent and structurally ‘close’. Instead,(Zhou et al., 2010) can be thought of as a growth model (McArdle and Bell, 2000) deﬁnedon these structures: they show how non-identically distributed graphs can be learned overtime. Recently, the nonparametric procedure in (Qiu et al., 2015) extends these ideas tohandle multiple sources, each with multiple samples.The ideas in the literature so far to “couple” multiple graphical model estimation modulesare mostly nonparametric. While such a formulation oﬀers beneﬁts, in many estimationproblems, parametric models may be more convenient for downstream statistical analysis,particularly for hypothesis testing (Hardle and Mammen, 1993; Geer, 2000; Roehrig, 1988).Given that the topic of coupled graphical models, by itself, is fairly recent, algorithmsfor parametric estimation of temporal or coupled Gaussian graphical models have not yetbeen heavily studied. This will involve parameterizing trends in the highly structurednature of the ‘response’ variable (SPD matrices). We ﬁnd that parametric formulations formanifold-valued data have been proposed recently (Kim et al., 2014; Cornea et al., 2016).Because SPD matrices form a Riemannian manifold, algorithms that estimate a parametricmodel respecting the underlying Riemannian metric are more suitable in many applicationsas opposed to assuming a Euclidean metric on positively or negatively curved spaces (Xieet al., 2010; Fletcher and Joshi, 2007; Jayasumana et al., 2013). We will make a few simplemodiﬁcations (for eﬃciency purposes) to such algorithms and make use of the estimatedparameters for follow-up analysis.

Finding Group-wise Diﬀerences.

Assuming that we have a black-box procedure toestimate a parametric model on the SPD manifold available, in many tasks, such anestimation is merely a segue to other analyses designed to answer scientiﬁcally meaningfulquestions. For example, we are often interested in asking whether the temporally coupled2odel estimated using the procedure above diﬀers in meaningful ways across groups inducedby a stratiﬁcation or dichotomous variable (e.g., gender or disease). For instance, is the‘slope’ in structured response space statistically diﬀerent across education level or body massindex? While the body of work for graphical model estimation is mature, the literaturedescribing hypothesis tests in this regime (Stdler and Mukherjee, 2012; Belilovsky et al.,2015) is sparse at best. Given that such questions are simpler to answer with alternativeschemes (with assumptions on the distributional properties of the data), e.g., structuralequation modeling, latent growth models and so on (Ullman and Bentler, 2003; McArdleand Bell, 2000), it seems that the unavailability of such tools is limiting the adoption of suchideas in a broader cross-section of science. We will seek to address this gap.

Needles in Temporal Haystacks.

If we temporarily set aside the potential value of ahypothesis test framework for temporal trajectories in graphical models, we see that from anoperational viewpoint, such procedures are most eﬀective when a practitioner already has aprecise scientiﬁc question in mind. In reality, however, many data analysis tools are deployedfor exploratory analyses to inform an investigator as to which questions to ask. Being ableto “localize” which parts of the model are diﬀerent across groups can be very valuable. Thisability actually beneﬁts statistical power as well. Notice that when the stratiﬁed groupsare not very diﬀerent to begin with, e.g., healthy individuals with presence or absence of agenetic mutation, the eﬀect sizes (statistical diﬀerence between two groups) are likely to bepoor. Here, while the trends identiﬁed on the full precision matrix may still be diﬀerent(i.e., there may be a real signal associated with a grouping variable), they may not be strongenough to survive signiﬁcance thresholds. Ideally, what we need here are analogs of thewidely used “scan statistics” for our hypothesis testing formulations for temporal graphicalmodels — to identify which parts of the signal are promising. Then, even if only a smallsubset of features were diﬀerent across groups, we may be able to identify these diﬀerentialeﬀects eﬃciently. This beneﬁts Type 2 error, provides a practical turnkey product for anexperimental scientist, and makes up the key technical results of our work.Brieﬂy, we provide (i) a simple and eﬃcient parametric procedure for modeling temporallyevolving graphical models, (ii) a hypothesis test for identifying diﬀerences between group-wise estimated models, and (iii) a scan algorithm to identify those subsets of the featureswhich contribute to the group-wise diﬀerences . Together, these ideas oﬀer a frameworkfor identifying group-wise diﬀerences in temporally coupled graphical models. From theexperimental perspective, we ﬁnd scientiﬁcally plausible results on a unique longitudinallytracked cohort of middle-aged (and young elderly) persons at risk for Alzheimer’s diseasedue to family history, but who are otherwise completely cognitively healthy.The rest of the paper is organized as follows. In Section 2 we present an eﬃcient manifoldregression procedure for modeling covariance trajectories, which serves as a blackbox modulein our hypothesis testing framework. In Section 3, we deﬁne our main hypothesis testfor group diﬀerence analysis over covariance trajectories. In Section 4, we present a setof technical results describing our localization procedure based on scan statistics, as wellas derive suitable size corrections to compare across feature subsets. Sections 5, 6, and7 conclude with empirical evaluations of our model on synthetic data, various types ofdemographics/behavior data collected longitudinally in the United States from publiclyavailable resources, and ﬁnally, our analysis on a unique longitudinal dataset (followed since2001) from a preclinical Alzheimer’s disease study involving approximately 1500 individuals.3 . Characterizing Covariance Trajectories

Our main statistical testing framework, to be described shortly, needs an eﬃcient meansfor calculating a “trajectory” of the feature-by-feature interaction graphs over time forthe given longitudinal data. We now describe a scheme which oﬀers this capability. Let X t ∈ R n t ,p be the design matrix of all n t samples at time t , where t ∈ { , . . . , T } , and T isthe total number of distinct timepoints. We wish to capture the trends in the relationshipsbetween the features as a function of t . To evaluate the groupwise diﬀerences in changes ofsuch interactions, we make use of the fact that these interactions are commonly capturedby correlation or conditional independence, represented by the covariance matrix (withnormalized features) and the precision matrix (the inverse of covariance matrix).Here we simply use the covariance matrix for each timepoint t to denote the interactionbetween features, C t = cov ( X t ). Our goal now is to estimate the parameters of the function, t → C t . We may vectorize the covariance matrix and apply a linear model; its parameterswill give the trajectory in “vectorized covariance space” as we scan through t . But thesepredictions are not guaranteed to be valid SPD matrices and even if a projection is performedto obtain a covariance estimate, distortions introduced by the process may be signiﬁcant(Fletcher, 2013). It is well known that classical vector space models tend to be suboptimal inthe manifold setting (covariance matrices live on the SPD manifold) since they use Euclideanmetrics which are deﬁned in the ambient space. For manifold-valued data, Riemannianmetrics are shown to be superior in many applications (Fletcher and Joshi, 2007; Banerjeeet al., 2015; Jayasumana et al., 2013; Tuzel et al., 2007), and are increasingly being deployedin machine learning/statistics. We will utilize an appropriate statistical model informed bythe manifold-structure of the data and then derive a hypothesis testing procedure to detectgroupwise diﬀerence in the changes of interactions between features in longitudinal analysis.To do so, we ﬁrst summarize basic diﬀerential geometry notations (Do Carmo, 1992; Lee,2012) and then describe our models. If desired, any other (eﬃcient) manifold-valued linearmodel (Fletcher, 2013) can be substituted in; no change in the workﬂow is needed. A readerfamiliar with manifold regression algorithms may consider this module as a black-box andskip ahead to Section 3 which uses the parameter estimates from this procedure. Let M be a diﬀerentiable (smooth) manifold in arbitrary dimensions. A diﬀerentiablemanifold M is a topological space that is locally similar to Euclidean space and has aglobally deﬁned diﬀerential structure. A Riemannian manifold is a diﬀerentiable manifold M equipped with a smoothly varying inner product. The geodesic curve is the locallyshortest path, analogous to straight lines in R p — this geodesic curve will be the object thatdeﬁnes the trajectory of our covariance matrices in SPD space. Unlike the Euclidean space,note that there may exist multiple geodesic curves between two points on a curved manifold.So, the geodesic distance between two points on M is deﬁned as the length of the shortest geodesic curve connecting two points. The geodesic distance helps in measuring the error ofour trajectory estimation (analogous to a Frobenius or (cid:96) norm based loss in the Euclideansetting). The geodesic curve from y i to y j is parameterized by a tangent vector in thetangent space anchored at y i with an exponential map Exp( y i , · ) : T y i M → M . The inverseof the exponential map is the logarithm map, Log( y i , · ) : M → T y i M . These two operations4peration Euclidean RiemannianSubtraction −−→ x i x j = x j − x i −−→ x i x j = Log( x i , x j )Addition x i + −−→ x j x k Exp( x i , −−→ x j x k )Distance (cid:107)−−→ x i x j (cid:107) (cid:107) Log( x i , x j ) (cid:107) x i Mean (cid:80) ni =1 −→ ¯ xx i = 0 (cid:80) ni =1 Log(¯ x, x i ) = 0Covariance E (cid:2) ( x i − ¯ x )( x i − ¯ x ) T (cid:3) E (cid:2) Log(¯ x, x )Log(¯ x, x ) T (cid:3) Table 1:

Basic operations in Euclidean space and Riemannian manifolds.

Figure 1: Group-wise MMGLM: The left and right ﬁgures represent two linear models onthe SPD( p ) manifold. Points x i in the tangent space are our covariate or predictor,and points y i in the manifold space represent SPD( p ) matrices. In our regressionsetting, we wish to minimize the error (brown curves) between the estimationand the sample points. Because each linear model has a diﬀerent base point, thetrajectories cannot be directly compared as in the Euclidean setting.move us back and forth between the manifold and the tangent space. For completeness,Table 1 shows corresponding operations in the Euclidean space and Riemannian manifolds.Separate from the above notation, matrix exponential (and logarithm) are simply exp( · )(and log( · )). Finally, parallel transport is a generalized parallel translation on manifolds.Given a diﬀerentiable curve γ : I → M , where I is an open interval, the parallel transportof v ∈ T γ ( t ) M along curve γ can be interpreted as the parallel translation of v on themanifold preserving its length and the angle between v ( t ) and γ . The parallel transport of v from y to y (cid:48) is Γ y → y (cid:48) v . Several regression models for manifold-valued data have been proposed recently, a majorityof which are nonparametric (Jayasumana et al., 2013; Banerjee et al., 2015). Because of thelongitudinal nature of our dataset (and recruitment considerations in neuroimaging studies),sample sizes do not exceed a few hundred participants (typically much smaller). We havefound that generally, in this regime, parametric methods are better suited and also oﬀerother beneﬁts for downstream applications. Next, we will give a simple parametric modelfor this problem. Let x and y be vectors in R p and R p (cid:48) respectively.5 eﬁnition 1 (Standard GLM.) The Euclidean multivariate multilinear model is y = β + β x + β x + . . . + β p x p + (cid:15) (1) where β , β i and the error (cid:15) are in R p (cid:48) and x = [ x . . . x p ] T are the predictor variables. Henceforth, we will use the terms covariate and predictor interchangeably to describe thosespeciﬁc features we wish to control for in our model (e.g., time-points in our experiments).For manifold-valued data, we adapt the formulation proposed by (Kim et al., 2014).

Deﬁnition 2

The Manifold Multivariate General Linear Model (MMGLM) is deﬁned as min b ∈M , ∀ j,V j ∈ T b M N (cid:88) i =1 d ( Exp ( b, V x i ) , y i ) , (2) where V x i := (cid:80) nj =1 V j x ji and d ( · , · ) is the geodesic distance between ˆ y i := Exp ( b, V x i ) and y i . This formulation generalizes (1), by replacing the intercept β and each vector β j for acovariate with a base point b ∈ M and a geodesic basis V j ∈ T b M respectively. The geodesicbasis V j at b parameterizes a geodesic curve Exp( b, V j x j ). Intuitively, this model is a‘generalized’ linear model with the inverse exponential map Exp − (or logarithm map Log)as a ‘link’ function (Kim et al., 2014; Cornea et al., 2016). When the covariate/predictorsare univariate, we will obtain a single geodesic curve, modeled via the so-called GeodesicRegression (Fletcher, 2013). The objective in (2), can be solved by both gradient descent (Fletcher, 2013; Kim et al., 2014)and MCMC methods (Cornea et al., 2016). Unfortunately, these schemes can be expensive,especially when the dimension of the manifold is large. Further, if the algorithm needs tobe run a large number of times, the computational footprint quickly becomes prohibitive.Motivated by these considerations, we use a so-called log-Euclidean approximate algorithmintroduced in (Kim et al., 2014) with some adaptations, which requires mild assumptions onthe manifold-valued data.Recall that in classical ordinary least squares (OLS), the regression curve goes throughthe mean of covariates and response variables, i.e., y − ¯ y = β ( x − ¯ x ). Similarly, we assumethat geodesic curves go through the mean of response variables on the manifold. Then, thebase point, or intercept, “ b ” in (2) can be approximated by the manifold-valued mean ofthe sample points , the Karcher mean (Karcher, 1977). The propositions derived from (Kimet al., 2014) lead directly to the following. Proposition 3

Let ¯ C be the unique Karcher mean of a suﬃciently close set of covariancematrices that lie on a curve Ω . Then ¯ C ∈ Ω , and for some tangent vector V ∈ T ¯ C M andeach C , there exists x ∈ R such that C = Exp ( ¯

C, V x ) . This allows us to bypass the fairly involved variational procedure to estimate the base point b . 6ith this approximation of ˆ b via ¯ y , the remaining variables to optimize are the tangentvectors V . We do so by taking advantage of log-Euclidean schemes. Once the base pointis established as the Karcher mean, each data point on the manifold is projected into thetangent space at that point: Log(¯ y, y ). These “centered” points ˜ y are now Euclidean, and ifthe covariates are centered as well (˜ x ), a closed form solution exists in the standard form of V = ˜ y ˜ x (cid:62) (˜ x ˜ x (cid:62) ) − .In this setting, it is often assumed that two points y , y have a distance deﬁned as d ( y , y ) := (cid:107) Log( y , y ) (cid:107) y ≈ (cid:107) Log( b, y ) − Log( b, y ) (cid:107) b . However, on SPD manifolds withan aﬃne invariant metric, each tangent space has a diﬀerent inner product varying as afunction of the base point b , i.e., (cid:104) u, v (cid:105) b := tr( b − / ub − vb − / ). This makes comparison oftrajectories diﬃcult without moving to tangent bundle formulations. This issue is discussedin some detail in (Muralidharan and Fletcher, 2012; Hong et al., 2015). However, note that Remark 4

When the base point b is the identity I , then the inner product is exactly theEuclidean metric (cid:104) u, v (cid:105) b := tr ( b − / ub − vb − / ) = tr ( uv ) = tr ( u T v ) . This follows from the fact that u and v are symmetric matrices on SPD( p ). We takeadvantage of this property through parallel transport . Speciﬁcally, we can bring all of thedata to T I M which will allow for a meaningful comparison of two tangent vectors fromdiﬀerent base points. Similar schemes have been used for projection on submanifolds in (Xieet al., 2010) and other problems (Sommer et al., 2014). With a fast algorithm to compute(2) available, we can now accurately model longitudinal trajectories of covariances matrices.Our statistical procedure described next simply assumes the availability of some suitablescheme to solve the manifold-regression as deﬁned in (2) eﬃciently and does not depend onparticular properties of the foregoing algorithm.

3. Test Statistics for SPD ( p ) Trajectories

With an algorithm to construct a regression model for covariance matrix responses in hand,we can now describe a key component of our contribution: a test statistic which allowsaddressing the main question of interest:

Is the progression/trajectory of covariance matrices(over time) diﬀerent across two groups?

In the standard two-sample testing problem, ahypothesis test is set up to check if the parameters of each group are signiﬁcantly diﬀerent: H : θ = θ vs. H A : θ (cid:54) = θ (3)Recall that in a general linear model (GLM), when testing for mean group diﬀerences, thetest parameters are the regression slopes from a standard GLM ﬁt. In our setting, theparameters of interest are the population covariance trajectories estimated from the manifoldregression in (2), see Fig. 1. While the trajectories and the slopes are related, note thatour parameters are estimated on the manifold . Two unique manifold trajectories, whenprojected as simple multivariate responses in Euclidean space, may not be signiﬁcantlydiﬀerent under the GLM hypothesis testing framework, as has been observed by (Du et al.,2014). Returning to our longitudinal trajectory formulation, we have the following na¨ıveCovariance GLM: 7 eﬁnition 5 Let vec ( C g,t ) be the vectorized covariance matrix at timepoint t for group g ∈ { , } . Then the na¨ıve Covariance GLM is deﬁned as vec ( C g,t ) = β g + β g t + (cid:15) (4) with the slope θ = β in the hypothesis test in (3) , and vec ( · ) is the vectorized form of theinput matrix. With this model, our hypothesis testing reduces to a simple diﬀerence of slopes, which iswell-studied in classical statistics literature.

Deﬁnition 6 (Seber and Lee, 2003) Let β , β be the multivariate slopes calculated fromestimating (4) . Then an α -level hypothesis test rejects the null hypothesis β = β when L > χ p | − α , where L = ( ˆ β − ˆ β )Σ − ( ˆ β − ˆ β ) (5)Knowing that the response space is structured, i.e., our covariance matrices lie on the SPDmanifold, we seek a more appropriate test and corresponding test statistic which adequatelycaptures this knowledge.Observe that we can directly apply the manifold regression in § Deﬁnition 7

Let C g,t be the covariance matrix at timepoint t for group g ∈ { , } . Thenthe Longitudinal-Covariance GLM (LCGLM) is deﬁned as C g,t = Exp ( b g , V g t ) (6) with b g and V g being the base point and tangent vector respectively, as described in § But instead of solving p ( p − / diﬀerent tangent spaces. To accurately compare twotangent vectors, we must parallel transport both vectors to the same tangent space. Oncethey are both in the same space, we can construct a simple test statistic for the trajectorydiﬀerence. L = (cid:107) Γ b → I V − Γ b → I V (cid:107) I (7)Recall that the inner product at the Identity I coincides with the Euclidean metric. Thiscan now be naturally interpreted as a diﬀerence of slopes, and together with a standardEuclidean Normal noise assumption yields the following hypothesis test. Proposition 8

Assume that Γ b → I V is normally distributed N (0 , I ) . Then the statisticdeﬁned in (7) follows a χ p distribution with p degrees of freedom, and the threshold test in (6) is an α -level hypothesis for the covariate trajectory group diﬀerence. .1 Incorporating First-Order Diﬀerences In many real world situations, ﬁrst-order information in the data is often valuable inidentifying group diﬀerences. Restricting our analysis to only the second-order interactions,i.e., covariances, may be ineﬃcient (or sub-optimal) when the mean signal diﬀerence betweengroups is large. Our construction easily extends to these cases. Particularly, the productspace over both means and covariances is in R p × SPD( p ). Remark 9

The typical GLM on the ﬁrst order information is deﬁned in the standardEuclidean space. So, computing the regression in the product space R p × SPD ( p ) amounts tosimply computing the regression on the ﬁrst and second order statistics (mean and covariance)separately. The above statement suggests that by applying the manifold regression to the covariancesand the standard regression model for the means, we are directly solving the product spaceregression problem, incorporating both ﬁrst and second order statistics. However, in thesecases, the statistic deﬁned above in (7) does not directly take into account the potentialdiﬀerence in means. However, given our Normal noise assumption we can easily invoke thestandard Gaussian multivariate likelihood statistic for group diﬀerences.

Deﬁnition 10

Let ˆ µ t , ˆΣ t be the estimated mean and covariance from the standard linearmodel and our manifold-covariance GLM respectively. Then the Gaussian likelihood of ourdata X is P ( X | ˆ µ, ˆΣ) = T (cid:89) t =1 n t (cid:89) i =1 P ( X t | N (ˆ µ t , ˆΣ t )) , (8) where X t is the subset of our data collected at timepoint t . Additionally, we can deﬁne astandard likelihood ratio test statistic as: L prod = P ( X | ˆ µ , ˆΣ ) P ( X | ˆ µ , ˆΣ ) P ( X , | ˆ µ , , ˆΣ , ) (9)This statistic is again χ p -distributed (Seber and Lee, 2003), and an α -level hypothesis testfor group diﬀerence analysis can be deﬁned in the same way as above. While our manifoldregression modeling is focused on the case of centered data (where the mean signal maynot be signiﬁcantly diﬀerent between the groups), we use the product space construction,wherever appropriate, in experimental evaluations.

4. Localizing Group Diﬀerences for SPD ( p ) Trajectories

The above procedure provides a precise mechanism to derive a statistic from the group-wisecovariance matrix trajectories. However, when the eﬀect sizes are poor, any scheme operatingon the trajectories of the full covariance matrix may still fail to identify group diﬀerences(as is the case in our experiments). To improve statistical power, localizing the process ofcomputing the trajectories only to the relevant features is critical. To this end, we considerthe following global hypothesis testing problem H : ∀ R, β R = β R vs. H : ∃ R, β R (cid:54) = β R , β denotes the slope and R is the region of the covariance matrix which only includesthe relevant features , see Fig. 2. It turns out that by adapting Scan statistics (Fan et al.,2012; Arias-Castro et al., 2011), we will be able to exclude the eﬀect of irrelevant regions ofthe covariance matrix in the calculated trajectories. By extending this concept to graphs,we obtain an algorithm to identify subsets of features of the covariance matrix which showgroup diﬀerences that are otherwise unidentiﬁable, in a statistically rigorous way.

Scan statistics are a valuable tool for structured multiple testing. In its simplest form, wecan consider a setting where we place a window (or box) over a region R in an image andcalculate a local statistic L R , e.g., an average or a response to a convolution ﬁlter. Then,the window can be raster scanned at various locations in the image ( R ) and the maximumover the set of local statistics can be called the scan statistic. Intuitively, if the image isassumed to be a Gaussian random ﬁeld, we can set up a null hypothesis using a criticalvalue and ﬁnding a statistically signiﬁcant signal (i.e., regions) corresponds to comparingthe local region-wise statistic with the critical value. Of course, there is ﬂexibility in termsof specifying properties of the regions as described next. Deﬁnition 11

Let R be the collection of all possible structured regions, and L R be somestatistic over region R , a structured subset of R . The scan statistic is deﬁned as L ∗ =max R ∈R L R . Recent results in scan statistics show how size corrections can be used to increase detectionpower in multi-scale analysis with nice guarantees (Walther et al., 2010; Wang et al., 2016).To utilize these ideas for our hypothesis test, we must extend scan statistics and these sizecorrections to a graph setting where the graph is induced by a sparse estimation of theprecision matrix, e.g., graphical lasso (or any other algorithm of choice) over the features.To do so, structured regions R and a statistic L R on each region must be deﬁned on thegraph. Intuitively, in our case, L R must capture the “diﬀerence” in group-wise covariancetrajectories. As we will describe shortly, it is in the context of this statistic where we utilizethe LCGLM (6), which will be invoked at the level of individual regions R , one by one.Let G := ( V , E ) be a graph over the features (represented in the covariance matrix)with vertex set V and edge set E . We deﬁne the structured region R ⊆ G as a connectedsubgraph of G corresponding to the selection of those vertices as our feature subset (blockof the covariance matrix, see Fig. 2). A natural question is whether such an enumeration istractable if the number of connected subgraphs R is exponential. It turns out that if wemake a mild assumption on the graph, the number of induced regions can be shown to bepolynomially bounded. Further, it then naturally provides a size correction , the analog for amultiple testing adjustment. Remarks.

In our motivating application, the group diﬀerences we seek to identify willinvolve a cohesive set of features that will be connected to each other, by deﬁnition (largechanges in covariances indicate dependent features). Based on this observation, we assumethat the true localized subgraph is a “ball” subgraph.10F CDE F CE r = 0 r = 1 r = 2 r = 1 B ( E,

1) = { F, C, E } Figure 2: (left) A region of the sparse precision matrix, (center) The corresponding subgraphof that region, along with balls of varying radius from the root node E , (right)The ball subgraph constructed with r = 1. These subgraphs with bounded radiusact as the structured regions on which scan statistics can be applied. Deﬁnition 12

A ball subgraph consists of nodes with a given radius r from a particularnode (see Fig. 2). The collection of ball subgraphs is deﬁned as R = { B ( v, r ) : v ∈ V and r ∈ N } (10) where the ball subgraph B ( v, r ) := { v (cid:48) ∈ V : d ( v, v (cid:48) ) ≤ r } , and d ( v, v (cid:48) ) is the minimum lengthpath connecting v and v (cid:48) . With this assumption, it can be veriﬁed that we now only need to search a polynomiallybounded set of regions.

Remark 13

The number of unique ball subgraphs in any graph G is bounded above by D |V| ,where D is the diameter (longest chain) of the graph G . On these regions (i.e., blocks of covariance matrix), we will invoke LCGLM to provide usa statistic L R . This is just the diﬀerence in slopes of the calculated manifold regressionacross groups in (7). We will iteratively obtain this statistic for distinct regions R and ﬁndsubgraphs that diﬀer in their trajectories across groups using a size correction for hypothesistests.Let us revisit the standard linear model setting and assume that our slopes β Rg correspondto the subset of slopes from features in R , and ˆ β Rg is an estimate of that slope. In this case,we have the following statistic (see e.g. (Seber and Lee, 2003)),( ˆ β R − ˆ β R )Σ − R ( ˆ β R − ˆ β R ) ∼ χ | E ( R ) | , (11)where Σ − R is the covariance matrix of ˆ β R − ˆ β R . With a normal noise assumption, thiscovariance will be identity and the statistic would simply be the (cid:96) -norm diﬀerence as inthe classical analysis. To make the statistics comparable across diﬀerent sizes , we use thestandardized version of a χ | E ( R ) | distribution, L R = ( ˆ β R − ˆ β R )Σ − R ( ˆ β R − ˆ β R ) − E ( R ) (cid:112) E ( R ) . (12) We can extend this analysis to our manifold setting.11 eﬁnition 14

For a given structured region R , the region-based LCGLM is written as ( b Rg , V Rg ) = argmin ( b R , V R ) ∈ T M R E (cid:2) d ( Exp ( b R , V R t g ) , C Rg ) (cid:3) (13) where C Rg is the covariance matrix subblock deﬁned by features included in R for group g ( t g is our univariate predictor, i.e., time). To compare the group trajectories, we ﬁrst parallel transport each tangent vector to theidentity as described in § (cid:107) Γ b R → I V R − Γ b R → I V R (cid:107) I . In the case of the product space construction, we apply the test in (8) to thedata subset corresponding to the features in region R , with the same correction as in (12). Summary.

We now have a region-based statistic for the manifold regression settingthat is approximately normally distributed N (0 , A ﬁnal unresolved yet important issue is that we must correct L R based on the numberof edges E ( R ) in R . This has a direct consequence on detection power. Observe that thenormalization for size correction should be determined by the null distribution of L R , i.e.,when there is no slope diﬀerence in the trajectories between groups. In order to derive acorrection, we need to characterize the behavior of scan statistics within roughly similarregions, max R ∈R ( A ) L R , where R ( A ) is the collection of region R s with similar size as E ( R ), R ( A ) = { R ∈ R : A/ < | E ( R ) | ≤ A } . (14)Clearly, the behavior of max R ∈R ( A ) L R depends on the “complexity” of R ( A ). A clearunderstanding of how similar subgraphs relate to each other leads directly to a correctiontied to their relative sizes.To investigate the complexity of R ( A ), we deﬁne the following quantities. Deﬁnition 15

The distance between subgraphs R and R can be given as d ( R , R ) = 1 − | E ( R ) ∩ E ( R ) | (cid:112) | E ( R ) || E ( R ) | (15) Deﬁnition 16

Let the (cid:15) -covering number of R ( A ) , denoted by N ( A, (cid:15) ) , be the smallestinteger such that there is a subset R approx ( A, (cid:15) ) of R such that sup R ∈R ( A ) inf R ∈R approx ( A,(cid:15) ) d ( R , R ) ≤ (cid:15) (16) where |R approx ( A, (cid:15) ) | = N ( A, (cid:15) ) . We can verify that all regions in R ( A ) can be approximated by regions in R approx ( A ) withreasonably small error. From the deﬁnitions, notice that the complexity of R ( A ) is reﬂectedby N ( A, (cid:15) ). If N ( A, (cid:15) ) is nicely bounded (as is the case here), scan statistics can be calculatedvery eﬃciently (Lemma 18).Before stating this result, we make a mild assumption on our graph. For any ballsubgraph, the edges around its center are not too sparse, compared to the edges in the outerregion of the ball subgraph, i.e., hard on the inside, soft on the outside. This yields,12igure 3: (Left) Chain, ring and 2D lattice graphs that satisfy the Avocado Assumption.(Right) Star graph that does not satisfy the property: from the center node thegraph is “too dense on the outside.”

Assumption 17 (Avocado) There exist constants S and H such that, for any r/ ≤ r (cid:48) ≤ r and v ∈ V , | E ( B ( v, r (cid:48) )) || E ( B ( v, r )) | ≥ H (cid:18) − | E ( B ( v, r − r (cid:48) )) || E ( B ( v, r )) | (cid:19) S . (17)We see that this assumption holds for many classes of graphs: a ring graph satisﬁes thiscondition when H = 1 and S = 1 and the 2-d lattice satisﬁes this condition when H = 1 / S = 2 (see Fig. 3). With this assumption, we have the following result for the (cid:15) -coveringnumber N ( A, (cid:15) ). Lemma 18

Let | E | be the total number of edges in G . If (17) holds and A is given, then,for a constant C H,S which only depends on H and S in (17), N ( A, (cid:15) ) ≤ C H,S | E | A (cid:18) (cid:15) (cid:19) S +1 . (18)The proof of this result follows from our ball-subgraph construction and our Avocadoassumption and provided in the Appendix.Intuitively, this result upper bounds the number of graphs that are necessary to searchover to completely exhaust the search space of subgraphs. With this result, we can nowconstruct a suitable size correction. Following the work of (Davies and Kovac, 2001) and(Wang et al., 2016), we can increase the power of our test by using the following statistic: T ∗ = max R ∈R (cid:32) L R − (cid:115) log | E || E ( R ) | (cid:33) . (19)The signiﬁcance of this size correction is that we now have a single critical value for eachcandidate subgraph, regardless of the subgraph size . Our ﬁnal test is deﬁned as I [ T ∗ > q α ],where q α is the α -level quantile of T ∗ under the null hypothesis (that no region is trulysigniﬁcant across groups). By construction, we can control the type 1 error at a speciﬁed α -level.Under the alternative hypothesis of this framework, it is important to note that in manycases, large subgraphs that subsume smaller signiﬁcant graphs may also have large test13tatistics, and our hypothesis test only indicates the existence of some signiﬁcant region.To identify or localize the smaller subsets, we follow the procedure from (Jeng et al., 2010),by beginning with the subgraph with the largest test statistic and iteratively removingoverlapping subsets from the total set of subgraphs. This requires testing each regional/localstatistic, ( L R − (cid:112) log( | E | / | E ( R ) | )) against q α . Under this procedure, we can control theweak family-wise error rate (wFWER) if we view our problem via the lens of multiple testing.The weak FWER is the probability of false discovery under the null hypothesis. To see thatthis is inherently controlled, note P ( F N ≥ | H ) = P ( T ∗ > q α | H ) ≤ α, (20)where F N is the number of false discoveries under the null hypothesis. With this correctionat the group diﬀerence level, we completely avoid any multiple comparisons issues that wouldarise in the case of a test for each subgraph. In addition to controlling the false positive rate,we have the following guarantee on identifying truly signiﬁcant regions under the normalnoise assumption.

Theorem 19 If (17) holds and the number of edges in the candidate subgraph is largerthan log | E | , i.e., | E ( R ) | (cid:29) log | E | ∀ R ∈ R , (21) then the critical value q α satisﬁes q α = O (1) . (22) Moreover, as | E | → ∞ , if a subgraph R obeys ( β R − β R ) T Σ − R ( β R − β R ) (cid:112) | E ( R ) | (cid:29) (cid:115) log | E || E ( R ) | , (23) then as | E | → ∞ , P (cid:32) L R − (cid:115) log | E || E ( R ) | > q α (cid:33) → . (24)The full proof of this result follows a generic chaining argument (see, e.g. (Talagrand, 2006))along with application of concentration inequalities and union bounds, and can be found inthe Appendix. Summary.

At a high level, this result directly characterizes the behavior of T ∗ underthe null hypothesis H and the alternative hypothesis H , respectively. We see that (22)implies that T ∗ can roughly be seen as a constant under the null hypothesis, and under thealternative hypothesis when (23) is satisﬁed, the test based on T ∗ is consistent, see (24). With these guarantees, our full workﬂow is as follows. First, we use an oracle procedureto generate a graph over our features that roughly captures the conditional independences.Any procedure that provides a conditional independence graph is suﬃcient. Next, for eachball subgraph over this graph, we compute the Longitudinal-Covariance GLM over these14igure 4: (left,top) States identiﬁed as having signiﬁcantly diﬀerent time-varying tobaccousage across gender from 2001 to 2015. (left,bottom) States identiﬁed as havingsigniﬁcantly diﬀerent time-varying heavy drinking use across gender from 2010to 2015. (right) Linear regressions over tobacco usage ﬁtted to the four statesdeﬁned by the ball subgraph centered at Louisiana. Best viewed in color.features for both groups, and compute the statistics outlined in §

3. We then compute thesize-corrected statistic, and compare against the single critical value. For all regions that passthis threshold, we apply the procedure from (Jeng et al., 2010). This workﬂow shows howto conduct hypothesis tests on temporal trends of large covariance matrices, with improvedpower and bounded Type 1 error. Additional implementation details can be found in theAppendix.

5. Localization Evaluation: Trends of Tobacco Usage Across Gender

We begin our empirical analysis of the model by ﬁrst applying the subgraph localizationprocedure by itself (standalone), separate from our manifold regression scheme. In thiscase, our statistic is derived from only

Generalized Linear Models (GLM) constructions,where the ˆ β Rg in equation (12) is the slope estimated from ﬁtting standard ﬁrst order linearmodels. Identifying the diﬀerentially varying subgraphs across groups in this way is similarto a simpler version of the planted clique identiﬁcation problem (Arora and Barak, 2009),where the clique we are trying to identify corresponds to those nodes whose slopes varysigniﬁcantly across groups. Data.

The Center for Disease Control (CDC) provides extensive statistics regardingtobacco and alcohol usage across the US. This data has been collected systematically for thelast few decades and is publically available (includes demographic information and gender).As a simple application of our proposed framework, we may pose the following question:which “sub-groups” of states tend to evolve diﬀerently in their correlation (pertaining totobacco/alcohol usage) over time? Our framework extends easily to answer this question. In15his setup, the oracle graph is simply the adjacency graph of the continental US naturallywhich will be used directly in our scanning procedure. For this dataset, we have directobservations of node measures: the percentage of males and females who reported smokingor drinking heavily in each state. Using gender as the group, we ﬁt standard linear modelsfor each candidate subgraph, and compute the diﬀerence of gender-wise slopes statistic asdescribed above. In Figure 4, we see the regions identiﬁed using our method, and interpretsome of the tobacco usage ﬁndings here.In the northeast, we see that women have reduced their tobacco usage at a signiﬁcantlyfaster rate than men compared to the rest of the country. We suspect that this may be atleast partly tied to the development of women’s cigarette brands in the late 1960s and 1970sfollowed by subsequent aggressive public policy campaigns in the 1990s and 2000s to highlighthealth risks beyond pulmonary or cardiovascular diseases for women (e.g., infertility, reducedbone-density in post-menopausal women). We also see that state-wide indoor smoking banswere put in place in the Northeast ahead of many other states in the union. In the South,the trends among men and women also seems to diﬀer signiﬁcantly. (see Fig. 4). Apart fromhealth factors, the group-wise diﬀerences in the group-wise trends may also be explainedby a few reasons identiﬁed in a study in 2007 (Stehr, 2007) which found that as the statesales tax on cigarettes changed (increased), women were signiﬁcantly more price elasticthan men. Between 2006 and 2008, the cigarette tax increased dramatically for all of the 4states identiﬁed except for Louisiana , whose tax rate has remained constant. Additionally,while Arkansas did increase their cigarette tax in 2009, they did not increase taxes inlocations near borders shared with higher taxing states . These intricate relationships amongstates lend credibility to the fact that our scan statistics framework is indeed identifyinginteresting sub-regions, and suggests that the full covariance-trajectory pipeline may bemore appropriate if eﬀects beyond the means are relevant within an analysis.

6. Pipeline Evaluation on Simulations and Baby Name Trends Over Time

We next evaluate the ability of our entire analysis pipeline to identify group diﬀerencesacross temporally evolving covariance trajectories. In many existing analyses, the eﬀect ofthe mean diﬀerences may be stronger than the eﬀect of the interaction matrix. However, incases where the mean signal is weak , we expect that the covariance eﬀect will be important.To evaluate our model in this regime, we perform a set of simulation studies and also analyzea publicly available longitudinal dataset.

Simulations.

We randomly generate SPD matrices from a ‘path’ of 4 discrete pointsalong the manifold, and use these data as population covariance matrices to generate 0-meansample data. Table 6 shows the results of the hypothesis testing procedure with 50 featuresaveraged over 100 runs, where both the true number of features with covariance trajectorydiﬀerences, p t , and the number of samples per group, n , were varied. As expected, ourrecovery rate increases nicely as a function of the number of samples n and decreases as thesize of region of change p t is increased when n is held constant.We compare our model to baseline methods that may be used in practice for the foregoinggroup diﬀerence hypothesis test. In standard applications, general linear models (GLMs) areoften the ﬁrst line of attack. When the covariates are assumed to be independent, a simplelinear model as in (6) may be suitable. However, when the group diﬀerence is inﬂuenced16able 2: Detection Accuracy of hypothesis test scheme (100 runs). p t = 5 p t = 8 p t = 10 p t = 15 n = 10 0.06 0.02 0.04 0.03 n = 20 0.75 0.75 0.53 0.29 n = 50 0.99 1.00 1.00 0.80 n = 100 1.00 1.00 1.00 0.95 n = 200 1.00 1.00 1.00 0.98 n = 1000 1.00 1.00 1.00 1.00 l l l l l l l l l l l l l l l % C o rr e c t N u ll R e j e c t i on s l GLM w/ Interactions Manifold Trajectory GLM Naive GLM

Figure 5:

Correct null hypothesis rejections over 100 runs for three models. For p = 50 features, each plotshows the rejection rate for p t ∈ { , , } (from left to right) respectively as a function of thenumber of sample points. by speciﬁc interactions between covariates, such linear models require additional care. Atypical solution is to introduce pairwise interaction terms into the model – a choice betweenall possible interactions or speciﬁc interactions speciﬁed by an expert . The ﬁrst model hasproblems since the number of samples n (cid:28) p . In the second model, we depend completelyon the user’s choice of interactions, and must correct for multiple testing when testingdiﬀerent models, at least partly reducing the power of the ﬁnal test. Figure 5 shows thevalue of our method over these models. For the interaction GLM case, we randomly selectinteraction terms to include in the GLM, with size p t (the ground truth number of variablesin the interaction). In this way, we approximate the eﬀect of an oracle specifying to the GLMwhich terms may describe the underlying interaction. We report the fraction of signiﬁcancetests where a signiﬁcance threshold of p ≤ .

05 was found for each model, averaged over 100runs. We see that our proposed scheme consistently achieves near-perfect results in terms ofthe percentage of null hypotheses that were correctly rejected (i.e., there was a signiﬁcantgroup-diﬀerence signal). The power of scan statistics on graphs is particularly evident in theneedle in haystack setting where the true diﬀerential signal is small ( p t ≤

8) and the samplesize is small to medium. When the sample size is large and p t is also large, the standard17igure 6: Contiguous states identiﬁed as having signiﬁcantly diﬀerent time-varying co-occurrences between boys and girls baby names from 1910 to 2015. Best viewedin color.linear model with additional interaction terms starts to approach the statistical performanceof our algorithm. Longitudinal trends in Baby Names.

In addition to the simulations above, wereport results from a simple analysis of how male/female baby names evolve over time overthe last century. The United States Social Security Administration provides a publiclyavailable dataset listing the frequency of the top 1000 baby names in each state for the last106 years. We evaluate our model in this context to examine which “sub-group” of statestend to evolve (or change) in their “name agreement” (or correlation) over time between boynames and girl names. Here, rather than calculating a sample covariance at each timepoint,we calculate a rank correlation matrix instead. For example, if two neighboring Gulf Coaststates, say Georgia and Alabama, substantially agreed on both boys and girls names in theperiod following the second World War, but gradually this agreement declined over timefor girls (but not boys), we expect that our scan statistics on graphs hypothesis test willsegment out this diﬀerential signal (in slope trends) from the planar graph induced by thestates sharing a border. Shown in Figure 6 are the regions identiﬁed using our method,applied on only the rank correlations for the top 10 names for both genders per state peryear. Each highlighted region indicates a sub-group where their “trends of correlation (oragreement/disagreement)” in preferred baby names over the last century varies betweenboys and girls. For states not identiﬁed by our model (in gray), we can conclude that thestate-to-state name preference-interactions may have still evolved over time but we haveinsuﬃcient statistical evidence to conclude that such trends (slopes) are diﬀerent betweenboys and girls. 18 . Identifying Diﬀerentially Covarying Features in PreclinicalAlzheimer’s Disease

We now describe experiments and results focused on the key motivation of this work — tofacilitate analysis of a longitudinal study of individuals at risk for Alzheimer’s disease (AD)where the statistical signal is weak (with small to medium sample sizes). We describe thedataset details followed by the analysis and then interpret our conclusions in the context ofscientiﬁc results that have been published in the literature in aging and dementia.

Study background.

We analyzed data from a cohort of individuals who have beenlongitudinally tracked for at least three visits over multiple years, as part of an ongoing study(since 2001) to understand the disease processes in the brain before an individual exhibitssigns of cognitive decline due to Alzheimer’s Disease (AD) (Sager et al., 2005). The study,Wisconsin Registry for Alzheimer’s Prevention (WRAP) is among the largest of its kind inexistence, focused on “preclinical” AD, i.e., when the individuals are still cognitively healthy,oﬀering a window into the early disease processes where treatments, drugs and interventionsare likely to be most eﬀective. WRAP and its ancillary studies acquire neuroimaging data(MRI, PET with diﬀerent tracers, diﬀusion MRI) and various clinical test scores, geneticand demographic data as well as clinical measures such as Cerebrospinal Fluid (CSF).Our analysis seeks to understand subtle group-wise diﬀerences in longitudinal patterns ofdependencies between these measures at this early stage of the disease.

Dataset.

The dataset consisted of 114 subjects with imaging data from at least twotypes of imaging modalities: Positron emission tomography and diﬀusion weighted MagneticResonance (MR) images. Positron emission tomography (PET) images were used to calculate,using well-validated pre-processing pipelines, the mean amyloid-plaque load (an importantbiomarker for AD) in 16 diﬀerent anatomical regions of interest in the brain. Amyloid plaqueis known to be an AD-related pathology and generally precedes onset of cognitive symptoms.Separately, diﬀusion tensor MR imaging (DTI) data were processed and used to calculateboth Fractional Anisotropy (FA) and Mean Diﬀusivity (MD) in 48 distinct regions (Moriet al., 2008). DTI images provide information about structural connectivity between graymatter regions in the brain. In addition to these 108 (48 × Is analysis of second order statistics necessary?

In Figure 7, we present histogramsdetailing the distribution of two critical cognitive tests, stratiﬁed across various groups ofscientiﬁc interest. Evaluating these distributions were the key motivation for our explorationinto the methods described in the paper. Small diﬀerences in means across groups regardlessof grouping selection (i.e., stratiﬁcation variable) , and the saturation that occurs at theceiling of cognitive test scores and other preliminary experiments conducted by us suggestthat standard analyses are not sensitive enough to identify subtle higher-order diﬀerences.19 iB ApoEGender Family History40 45 50 55 60 40 45 50 55 6005101520250510152025 C oun t Boston Naming Test Score

PiB ApoEGender Family History30 40 50 60 70 30 40 50 60 7005100510 C oun t RAVLT Totaled Learning Score

Female Male Negative Positive

Figure 7: Histograms of the Boston Naming Test Scores and RAVLT Total Scores for all timepoints for the 114 individual measurements across diﬀerent group separations. Themeans for each test score is not signiﬁcantly diﬀerent across diﬀerent stratiﬁcationvariable.

We now describe, one by one, the components of the largest feature subset discovered foreach stratiﬁcation scheme and highlight the main scientiﬁc ﬁndings. In most cases, weprovide a brief scientiﬁc interpretation of the results for the interested reader. Additionaldetails and results are available in the appendix.

A) Graph Scan Statistics on slope diﬀerences across gender.

The most signiﬁ-cant (based on region-score) subset identiﬁed by the gender grouping was between the FADTI measurement in the left cingulum gyrus as well as the scores on the Rey Auditory VerbalLearning Test (RAVLT). In recent AD research, gender has been identiﬁed as a factor in theprogression of various pathology measures (e.g., incidence and prevalence of AD is higherin women (Fratiglioni et al., 1991; Rimol et al., 2010)), and has contributed to a formalNIH notice (NOT-OD-15-102). However, we note that previous work in the ﬁeld has not identiﬁed gender-related diﬀerences when looking only at diﬀusion measures in the cingulum(Lin et al., 2014). Our algorithm successfully identiﬁed longitudinal changes in interaction between these variables which supports the earlier results, and provides some evidence thatas men and women age, their cognitive decline as measured by RAVLT manifests diﬀerentlyin relation to the cingulum gyrus.

B) Graph Scan Statistics on slope diﬀerences across genotype.

Next, we strat-iﬁed the cohort based on the genotype known to be most closely linked with AD, i.e., theAPOE (Apolipoprotein E) gene (Corder et al., 1993) — we inherit one APOE allele fromeach parent; having one or two copies of the e4 allele increases a person’s risk of getting AD20 ender

Set 1 RAVLT Total (1-5)FA Cingulum LSet 2 FA Medial lemniscus LFA Cingulum (hippocampus) LFA Post thalamic radiation LSet 3 FA Corticospinal tract RFA Superior O.F. fasciculus R

Genotype: APOE4

Digit Span Backward Raw Score Stroop Color-word ScorePiB Cingulum Post L PiB Cingulum Post RPiB Frontal Med Orb L PiB Frontal Med Orb RPiB Precuneus L PiB Precuneus RPiB SupraMarginal PiB Temporal Mid R

Table 3: Group diﬀerence across Gender (left) and Genotype APOE4 expression (right).Three disjoint sets of features were identiﬁed as coavarying signiﬁcantly diﬀerentlyamong gender, while one larger set was identiﬁed in the genotype stratiﬁcation.

Amyloid Load (PiB Positivity)

Set 1 PiB Angular L/R PiB Cingulum Ant L/RPiB Cingulum Post L/R PiB Frontal Med Orb L/RPiB Precuneus L/R PiB Temporal Sup L/RPiB Temporal Mid L/R

PiB SupraMarginal L

Set 2 FA Cerebral peduncle R FA Cerebral peduncle LMD Corticospinal tract R MD Corticospinal tract LTrail-Making Test Part A Score MD Cerebral peduncle RPET Cingulum Post R

Table 4: Group diﬀerence across Amyloid Load (PiB Positivity)whereas the rarer e2 allele is associated with a lower risk of AD. Using this stratiﬁcation,we obtain a low-risk and an at-risk group of individuals. Here, we identiﬁed amyloid-loadregions within the medial and lateral parietal lobes and ﬁnd that in the “low-risk” group, thecovariances between Digit Span and Stroop Color-Word scores (attention and concentrationscores) and amyloid load moves from strongly negative towards 0 as a function of age(Table 3). In the “at-risk” group (APOE4), however, we ﬁnd that as a function of age, thefeatures become more and more positively correlated. Existing studies have shown that theaccumulation of amyloid is signiﬁcantly diﬀerent across APOE4 gene expression (Morminoet al., 2014), and our results provide some evidence that the expression of the genotypemay interact with cognitive scores as well, even at this early stage of the disease , when theindividuals in our cohort are cognitively healthy. The sets of features showing a diﬀerentialsignal are presented in Table 3.

C) Graph Scan Statistics on slope diﬀerences across amyloid load positivity.

As brieﬂy described above, amyloid load is an important biomarker for AD. For our analysis,amyloid (or PiB) positivity is calculated using the mean amyloid PiB measures across allbrain regions using a PiB PET image scan of the participant. When we used this measure forstratiﬁcation (threshold was set at 1 .

18, following (Darst et al., 2017)), our model identiﬁedﬁfteen of the sixteen PiB regions that were input to the model when the density of theoracle graph was set to be high. This result is as expected, but interestingly we ﬁnd that21 xpert Consensus Diagnosis

WAIS-3 LNS Raw Score Boston Naming Test Total ScoreRAVLT A2 Raw Score RAVLT A3 Raw ScoreRAVLT A4 Raw Score RAVLT A5 Raw ScoreRAVLT A6 Raw Score RAVLT Delayed Recall Raw ScoreTrail-Making Test Part A Trail-Making Test Part BClock Drawing Test Score CES Depression Scale Score

Table 5: Group diﬀerence localization across expert clinical diagnosis. With signiﬁcantlymore samples and a larger set of cognitive tests, those above were identiﬁed assigniﬁcantly diﬀerent across the expert consensus measure.controlling for the linear combination of the features (through centering), the residual error still has signiﬁcant signal with the PiB positivity measure, indicating that amyloid burden interactions across brain regions plays a very important role in AD progression (Hardy andSelkoe, 2002; Hardy and Higgins, 1992; Tanzi and Bertram, 2005; Jack Jr et al., 2010). Whenthe sparsity of the oracle graph was increased, however, four neighboring regions, the leftand right corticospinal tract and the left and right cerebral peduncle were identiﬁed on bothPiB and DTI measures (supported by the literature (Douaud et al., 2011)), together withPart A of the Trail Making Test (see Table 4) which happens to be used in AD diagnosis(Albert et al., 2011). This suggests that changes in atrophy within these regions, as measuredby DTI, co-occur with changes in amyloid burden. Additionally, because these regions arehighly correlated with rough and ﬁne motor ability (Naidich et al., 2009), it seems plausiblethat amyloid positivity will lead to higher ‘covariation’ in the regions associated with ameasure of ﬁne motor speed, i.e., the Trail Making Test.

In addition to the dataset presented above, we apply our method to a much larger datasetconsisting of approximately 1500 individuals with only cognitive testing data collectedin a longitudinal manner. Each individual was administered these tests for between twoand three time-points, yielding approximately n = 4000 samples for our model. For eachassessment, a conference of experts applied a diagnostic label indicating normal cognition ormild cognitive impairment. Using this binary classiﬁcation, we can stratify our populationfor group diﬀerence analysis. We ﬁnd that among many diﬀerent signiﬁcant subsets, thecovariance trajectory among the scores on both parts of the Trail-Making Test and on alltrials of the RAVLT test explain a signiﬁcant group diﬀerence. These have previously beenshown to be the most sensitive tests for early cognitive decline (Albert et al., 2001). Table5 displays the other tests identiﬁed by our algorithm, and additional experiments on thislarger cohort can be found in the appendix. 22 .3 Baseline. In various experiments on this dataset, when the MMGLM procedure is performed forthe entire feature set in totality ( not utilizing any of the proposed ideas based on scanstatistics), and the null distribution derived using permutation testing, the procedure yieldsno signiﬁcance across any scientiﬁcally interesting group stratiﬁcations . This implies thatthe ability to search over diﬀerent blocks of the covariance matrix is critical in identifyingmeaningful group diﬀerences in the trajectories, unavailable using alternate schemes. Forinstance, simpler strategies work well enough for datasets such as ADNI – which includesdiseased subjects as well as controls – where the signal is stronger and even temporalmodeling may be unnecessary. While the scientiﬁc results need to be interpreted withcaution and reproducibility experiments on other similar datasets (both within the US andinternationally) are in the planning phase, we believe that the ability to localize diﬀerences inthese interaction patterns in a statistically rigorous manner is valuable and these ﬁndings canbe investigated standalone, via more classical schemes (e.g., structural equation modeling).

8. Conclusions

The analysis of datasets to identify where clinically disparate groups diﬀer is pervasive inbiology, neuroscience, genomics and epidemiological studies. We ﬁnd that graphical modelsare an ideal tool to analyze high-dimensional data in these areas but have been sparinglyused for the analysis of group-wise diﬀerences, especially in a longitudinal setting. Motivatedby an application related to longitudinal analysis of imaging and clinical/cognitive data fromotherwise healthy individuals who are at risk for Alzheimer’s disease (AD), we show howa combination of manifold regression with a generalization of scan statistics to the graphsetting yields tools that can be directly deployed. We present an eﬃcient algorithm anddevelop the theoretical results showing the regimes where its application is appropriate. Invarious experiments, while the standard schemes are not suﬃciently powered to detect thesignal, our proposed formulation is able to detect meaningful group diﬀerence patterns, manyof which have a clear scientiﬁc interpretation. We believe that these results are promisingfor the neuroimaging application described and other regimes where group-wise analysis isdesired but the number of features is large.

Acknowledgments

This research was supported in part by NIH grants R01 AG040396, AG021155, EB022883and NSF grants DMS 1265202 and CAREER award 1252725. The authors were alsosupported by the UW Center for Predictive Computational Phenotyping (via BD2K awardAI117924) and the Wisconsin Alzheimer’s Disease Research Center (AG033514). Mehta wassupported by a fellowship via training grant award T32LM012413.23 ppendix A. Technical Proofs.

A.1 Proof of Lemma 18

To remind the reader, this result was necessary in order to allow us to reduce the number ofsubgraphs (regions) that need to be evaluated over the graph. By bounding the coveringnumber we have a guarantee that we do not need to consider an exponential number ofsubgraphs in order to ﬁnd a localization.

Proof

To upper bound N ( A, (cid:15) ), we ﬁrst construct the (cid:15) -covering set of R ( A ) under metric d . To this end, we decompose R ( A ) into several disjoint sets R j ( A ) = (cid:26) B ( v, r ) ∈ R ( A ) : (cid:18) − ( j + 1) (cid:15) (cid:19) A < | E ( B ( v, r )) | ≤ (cid:18) − j(cid:15) (cid:19) A (cid:27) , for j = 0 , , . . . , (cid:100) (cid:15) (cid:101) . Our strategy is to construct (cid:15) -covering set for each set R j ( A ).We only construct (cid:15) -covering set for R ( A ); R j ( A ) ( j ≥

1) can be treated similarly. Toconstruct the (cid:15) -covering set for R ( A ), we denote by d v,r the largest positive number suchthat | E ( B ( v, r − d v,r )) || E ( B ( v, r )) | ≥ − (cid:15) , (25)for every v ∈ V and r ∈ N . Let D the collection of d v,r such that B ( v, r ) ∈ R ( A ), i.e. D = { d v,r : B ( v, r ) ∈ R ( A ) } , and V the collection of nodes such that B ( v, r ) ∈ R ( A ), i.e. V = { v : B ( v, r ) ∈ R ( A ) } . We pick up the largest number in D , denoted by d v ,r , i.e. d v ,r ≥ d v,r ∀ d v,r ∈ D anddeﬁne ˜ V as ˜ V = { v ∈ V : v ∈ B ( v , d v ,r / } . After deﬁning ˜ V , D and V can be deﬁned as D = D \ { d v,r : v ∈ ˜ V } and V = V \ ˜ V . Then we can pick up the largest number in D , denote by d v ,r and ˜ V can be deﬁnedsimilarly. We can repeat the above process until D M and V M are empty for some M .Weactually obtain a partition of V , M (cid:91) i =1 ˜ V i = V and ˜ V i ∩ ˜ V i = ∅ ≤ i < i ≤ M. Based on d v ,r , . . . , d v M ,r M , we are ready to prove the set R ( A, (cid:15) ) = { B ( v i , r i ) : 1 ≤ i ≤ M } is actually an (cid:15) -covering set for R ( A ). To this end, it is equivalent to show that for arbitrary B ( v (cid:48) , r (cid:48) ) ∈ R ( A ), we have d ( B ( v (cid:48) , r (cid:48) ) , B ( v i , r i )) ≤ (cid:15) (26)24hen v (cid:48) ∈ ˜ V i . To show (26), we consider two cases where r (cid:48) > r i − d v i ,r i / r (cid:48) ≤ r i − d v i ,r i / r (cid:48) > r i − d v i ,r i /

2, then B ( v i , r i − d v i ,r i ) ⊂ B ( v (cid:48) , r (cid:48) ) . Combining above result, (25), and the deﬁnition of R ( A ) yields | E ( B ( v (cid:48) , r (cid:48) )) ∩ E ( B ( v i , r i )) | (cid:112) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) |≥ | E ( B ( v i , r i − d v i ,r i )) | (cid:112) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) |≥ (cid:114) − (cid:15) | E ( B ( v i , r i − d r i )) || E ( B ( v (cid:48) , r (cid:48) )) |≥ − (cid:15). On the other hand, if r (cid:48) ≤ r i − d v i ,r i /

2, then B ( v (cid:48) , r (cid:48) ) ⊂ B ( v i , r i ) . (27)By deﬁnition of R ( A ), we can get | E ( B ( v (cid:48) , r (cid:48) )) ∩ E ( B ( v i , r i )) | (cid:112) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) | ≥ (cid:115) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) | ≥ − (cid:15). Therefore, (26) is proved and R ( A, (cid:15) ) is an (cid:15) -covering set for R ( A ).The rest of the proof is to bound the cardinality of R ( A, (cid:15) ), i.e. M . Note that (17)implies there exists some constant D H,S only depending on H and S such that, for any v ∈ V and r ∈ N , | E ( B ( v, r/ | ≥ D H,S | E ( B ( v, r )) | . By the deﬁnition of d v i ,r i , we can ensure B ( v i , d v i ,r i /

4) are disjoint. Hence, this implies | E ( ˜ V i ) | ≥ | E ( B ( v i , d v i ,r i / | ≥ D H,S | E ( B ( v i , d v i ,r i )) | ≥ D H,S

HA(cid:15) S / S +1 . The last inequality is suggested by (17) and (25). The volume argument yields M ≤ | E | D H,S

HA(cid:15) S / S +1 ≤ S +1 D H,S H | E | A (cid:18) (cid:15) (cid:19) S (18) is obtained upon application of the above to each R j ( A ). A.2 Proof of Theorem 19

Before we are ready to prove Theorem 19, we need the following result:25 emma 20

Let Y , . . . , Y d be i.i.d. standard Gaussian variable, i.e. N (0 , and a , . . . , a d be a sequence of numbers. If Z = d (cid:88) i =1 a i ( Y i − , (28) then P ( | Z | ≥ | a | √ x + 2 | a | ∞ x ) ≤ − x ) (29) where | a | = (cid:113)(cid:80) di =1 a i and | a | ∞ = max i =1 ,...,d | a i | . Proof

This is a direct extension of lemma 1 in (Laurent and Massart, 2000) to the negativecase. We follow arguments similar to theirs. Let φ ( x ) be the the logarithm of the Laplacetransform of Y i −

1. For any − / < x < / φ ( x ) = log (cid:0) E (cid:0) exp( x ( Y i − (cid:1)(cid:1) = − x −

12 log(1 − x ) ≤ x − | x | . This leads to log( E ( e xZ )) = d (cid:88) i =1 log (cid:0) E (cid:0) exp( a i x ( Y i − (cid:1)(cid:1) ≤ d (cid:88) i =1 a i x − | a i | x ≤ | a | x − | a | ∞ x With the same arguments in (Laurent and Massart, 2000), we could prove that P (cid:0) Z ≥ | a | ∞ x + 2 | a | √ x (cid:1) ≤ exp( − x ) . The other direction can be proved if we apply the same argument for − Z .With this in hand we proceed to prove Theorem 19. Proof

In the following proof, C always refers to some constant, although its value maychange from place to place. First, we prove (22). To this end, we prove concentrationinequalities for L R for some R and L R − L R for some R (cid:54) = R . Since we assume the noisefollows normal distribution, we have( ˆ β R − ˆ β R ) T Σ − R ( ˆ β R − ˆ β R ) = (cid:80) X i − ( (cid:80) X i ) (cid:107) ˆ β R − ˆ β R (cid:107) ∼ χ | E ( R ) | . By tail bound for χ random variables (see e.g. (Laurent and Massart, 2000)), we can yield P (cid:32) L R > t + 2 t (cid:112) | E ( R ) | (cid:33) ≤ exp( − t ) . (30)26y deﬁnition, L R − L R can be written as L R − L R = (cid:80) i ∈ R \ R Z i (cid:112) | E ( R ) | + (cid:32) (cid:112) | E ( R ) | − (cid:112) | E ( R ) | (cid:33) (cid:88) i ∈ R ∩ R Z i − (cid:80) i ∈ R \ R Z i (cid:112) | E ( R ) | where Z i are independent random variable following distribution χ −

1. Lemma 20 implies P (cid:18) | L R − L R | > (cid:112) d ( R , R ) t + 2 t min( | E ( R ) | , | E ( R ) | ) (cid:19) ≤ − t ) . (31)We now proceed to prove (22) by applying a chaining argument (See (Talagrand, 2006))and concentration inequalities (30) and (31). Recall R app ( A, (cid:15) ) is the smallest (cid:15) -covering setof R ( A ) and N ( A, (cid:15) ) is the covering number of R ( A ). For any subgraph candidate R , wedenote by π l ( R ) = arg min R (cid:48) ∈R app ( A,e − l ) d ( R, R (cid:48) ) . For any l ∗ > l ∗ , which will be speciﬁed later, we write max R ∈R ( A ) L R into three partsmax R ∈R ( A ) L R ≤ max R ∈R ( A ) | L R − L π l ∗ ( R ) | + l ∗ − (cid:88) l = l ∗ max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | + max R ∈R ( A ) L π l ∗ ( R ) . Now, we bound these three terms above separately.

Term 1 . Let l ∗ = 2 log | E | . By concentration inequality (31) and union bound, wehave P (cid:32) max R ∈R ( A ) | L R − L π l ∗ ( R ) | > (cid:112) x + log | E | ) | E | + 4 x + 8 log | E | A (cid:33) ≤|R ( A ) | P (cid:32) | L R − L π l ∗ ( R ) | > (cid:112) x + log | E | ) | E | + 4 x + 8 log | E | A (cid:33) ≤ |R ( A ) || E | exp( − x ) ≤ − x )for x < log | E | . Therefore, we have P (cid:18) max R ∈R ( A ) | L R − L π l ∗ ( R ) | > C ( x + log | E | ) A (cid:19) ≤ exp( − x ) , for x < log | E | . Term 2 . Let l ∗ = log log( | E | /A ). Recall that the Avocado assumption (17) suggeststhat N ( A, (cid:15) ) ≤ C H,S | E | A (cid:18) (cid:15) (cid:19) S +1 . (32)Applying concentration inequality (30) along with t = (cid:115) log (cid:18) | E | A (cid:19) + ( S + 1) log log (cid:18) | E | A (cid:19) + x + C (33)27nd the union bound, we have P (cid:18) max R ∈R ( A ) L π l ∗ ( R ) > t + 2 t √ A (cid:19) ≤ N (cid:18) A, | E | /A ) (cid:19) P (cid:18) L π l ∗ ( R ) > t + 2 t √ A (cid:19) ≤ C H,S | E | A (cid:18) log | E | A (cid:19) S +1 P (cid:18) L π l ∗ ( R ) > t + 2 t √ A (cid:19) ≤ exp( − x )for x < log | E | . Here we also apply condition (21). Therefore, we obtain P (cid:32) max R ∈R ( A ) L π l ∗ ( R ) > (cid:115) log (cid:18) | E | A (cid:19) + ( S + 1) log log (cid:18) | E | A (cid:19) + x + C (cid:33) ≤ exp( − x )for x < log | E | . Term 3 . For any given l , application of concentration inequality (31), covering numbercondition (32), and the union bound yields, P (cid:32) max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | > (cid:114) C (log( | E | /A ) + l + x ) e l + C (log( | E | /A ) + l + x ) A (cid:33) ≤ C H,S | E | A e ( l +1)( S +1) P (cid:32) | L π l +1 ( R ) − L π l ( R ) | > (cid:114) C (log( | E | /A ) + l + x ) e l + C (log( | E | /A ) + l + x ) A (cid:33) ≤ exp( − x ) l . for any x < log | E | . With another standard application of the union bound, we have P  l ∗ − (cid:88) l = l ∗ max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | > (cid:115) C (log( | E | /A ) + x )log( | E | /A ) + log | E | + x log | E | A  ≤ l ∗ − (cid:88) l = l ∗ P (cid:32) max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | > (cid:114) C (log( | E | /A ) + l + x ) e l + C (log( | E | /A ) + l + x ) A (cid:33) ≤ l ∗ − (cid:88) l = l ∗ exp( − x ) l ≤ − x ) . Putting the three terms above together yields P (cid:32) max R ∈R ( A ) L R > (cid:115) log (cid:18) | E | A (cid:19) + C ( x + 1) (cid:33) ≤ e | E | /A ) exp( − x ) , A (cid:29) log | E | and the inequalities √ a + b ≤ √ a + √ b and √ a + b ≤√ a + b/ √ a .Now, we apply this bound to A = | E | − k , k ≥ P (cid:32) max R ∈R (cid:32) L R − (cid:115) log | E || E ( R ) | (cid:33) > C ( x + 1) (cid:33) ≤ − x ) . This immediately suggests that q α = O (1).Now, let’s turn to the case when a subgraph is signiﬁcant, that is to prove (24). Assumethe signiﬁcant region is R . Using standard statistics we calculate the mean and variance of L R E ( L R ) = ( β R − β R ) T Σ − R ( β R − β R ) (cid:112) | E ( R ) | and V ar ( L R ) = 2+4 ( β R − β R ) T Σ − R ( β R − β R ) | E ( R ) | . By Chebyshev’s inequality, we have P (cid:32) | L R − E ( L R ) | (cid:112) V ar ( L R ) > x (cid:33) ≤ x . (34)If ( β R − β R ) T Σ − R ( β R − β R ) ≥ | E ( R ) | , then (34) suggests P ( L R > (cid:112) | E ( R ) | ) → , | E | → ∞ by taking x as a sequence (e.g., log log( | E ( R ) | )) which increases slow enough in (34). Thisleads to (24). If ( β R − β R ) T Σ − R ( β R − β R ) < | E ( R ) | , then V ar ( L R ) <

6. Then (23)and (34) imply P (cid:32) L R − (cid:115) | E || E ( R ) | > q α (cid:33) → , as | E | → ∞ . Appendix B. Implementation Details.

The workﬂow below describes one run of our model given a sparsity is speciﬁed for the oraclegraph procedure.1.

Oracle Graph.

As noted in the main paper, we use graphical lasso (glasso) togenerate an oracle graph , which allows to deﬁne structured regions (subgraphs) forscan statistics on graphs. Each element of the input matrix C in (35) for glasso isgenerated by calculating the slope for each position of the covariance matrix acrossthe predictors for each group, and then taking the diﬀerence between the groups. Thefollowing inverse covariance estimation problem, glasso , is then solved using existingMATLAB interfaces to fast C implementations.Θ = arg min Θ (cid:23) − log | Θ | + tr ( C Θ) + λ || Θ || (35)With sparsity parameter λ , this procedure generates a reasonably sparse oracle graph .29. Candidate Subgraphs.

With the oracle graph in hand, we then construct the set ofall ball subgraphs, as deﬁned in Section 4 of our main paper. By limiting ourselves toonly a few ( D | V | ) subgraphs, we can perform scan statistics more eﬃciently.3. Characterizing the Null Distribution.

In the case where we have few samples,we cannot directly apply the χ result. In these cases, the null distribution is thencharacterized using permutation testing over all candidate subgraphs. For each sub-graph the input data is permuted a number of times to generate a good representationof the distribution at that subgraph. All normalized (but not size-corrected) scanstatistics are then calculated for all permutations across all subsets and then combinedin order to create the null distribution.4. Calculating the Test Statistic

For a speciﬁc subset of the data, the scan statistic iscalculated and corrected as described in Section 4 of the main paper, over the originalgrouping of the data. For each group, the logitudinal-covariance GLM (7) is computedusing the procedures in § Region Identiﬁcation.

We ﬁrst identify all subsets whose statistic falls above the α -level threshold speciﬁed. Then the subset-collection procedure outlined in the mainpaper, developed by (Jeng et al., 2010), is applied, and the non-overlapping criticalregions are output. Numerical Considerations

In practice, our empirical covariance matrices calculated on the sample data may not bepositive deﬁnite. The matrix can be rank deﬁcient when we do not have enough linearlyindependent samples. In addition, we may use a rank correlation matrix in its place,which also may not be PD. To resolve this issue, we project the empirical covariancematrix onto the symmetric-positive deﬁnite SPD( n ) manifold. We ﬁrst apply a standardprocedure for transforming a symmetric matrix into a symmetric positive semideﬁnite (SPSD)one. As described in (Wu et al., 2005), the standard eigenvalue thresholding, or clipping, λ SP SD = max(0 , λ ) is sensible since it provides the optimal projection of any matrix ontothe SPSD manifold. Let Σ = U Λ U (cid:62) be the eigenvalue decomposition of the matrix Σ. TheSPSD projection of Σ is then proj SP SD (Σ) = U diag(max( λ , , . . . , max( λ n , U (cid:62) . Andso to project to the SPD( n ) manifold we can simply add some epsilon to each element ofthe diagonal: proj SP D (Σ) = U diag(max( λ , , . . . , max( λ n , U (cid:62) + (cid:15)I (36)A remark on the term (cid:15)I will be useful here. We ﬁnd that in experiments, numerical problemscan arise if the smallest eigenvalue of the projected matrix is too small. By iteratively addinga small (cid:15) until the smallest eigenvalue is above our threshold, we ensure that the matrix ispositive deﬁnite for the exponential and logarithmic maps. They are necessary for movingback and forth between the manifold and the tangent space. A note on localization accuracy

In addition to simply checking whether or not we were able to correctly answer the hypothesistest group diﬀerence, it is important that if a signiﬁcance is found, that it is found in the30eatures that were originally used to generate the data. Using the same simulation setup asprevious, we take the union of all subsets returned to be signiﬁcant and check if each of thetruly changing features p t are contained within the superset.In this particular case we ﬁnd that our localization is only dependent on the graphicallasso procedure we use to generate the oracle graph. As long as the sparsity speciﬁed islarge enough to include at least p t edges, we ﬁnd that in every simulation where we ﬁnda signiﬁcant diﬀerence, the features that express the diﬀerence are a superset of the truefeatures. Appendix C. Preclinical AD Extended Details and Results.

Data and Variable Descriptions

In our neuroimaging experiments, a large number of our features describe speciﬁc andlocalized regions of the brain across multiple imaging modalities. Below we list and describeeach of regions for each modality, and give a brief background on each of methods used toacquire the data. We also include the list of cognitive scores used in our analysis.

PET Imaging

Positron emission tomography has become an increasingly popular method of imaging thebrain, speciﬁcally in the areas where cognitive decline can be strongly correlated with thespeciﬁc matter being imaged. Pittsburgh compound B (PiB) was used as the tracer forthese images, and the 16 mirrored (Left and Right) regions labeled below were selected asstrongly correlated with the development and progression of Alzheimer’s Disease.

1. PiB Angular L/R2. PiB Cingulum Ant L/R3. PiB Cingulum Post L/R4. PiB Frontal Med Orb L/R5. PiB Precuneus L/R6. PiB SupraMarginal L/R7. PiB Temporal Mid L/R8. PiB Temporal Sup L/R

The average of the voxel values in each ROI (region of interest) of the brain are used forimaging features. The 16 regions are highlighted in Figure 8.31igure 8: 16 Positron Emission Tomography (PET) regions.

DTI Imaging

Diﬀusion tensor imaging is used to measure the restricted diﬀusion of water through andabout regions of the brain. The 48 regions here are the aggregated measurements of totalrates of diﬀusion for each voxel in that region. The two measurements, Fractional Anisotropy(FA) and Mean Diﬀusivity (MD) collectively well describe the diﬀusion in a speciﬁc region.The following is the full list of regions used in our analysis. Regions that spanned acrossboth the left and right sides of the brain are indicated as such, and were treated as separateand independent in our analyses.

1. Middle cerebellar peduncle2. Pontine crossing tract (a part of MCP)3. Genu of corpus callosum4. Body of corpus callosum5. Splenium of corpus callosum6. Fornix (column and body of fornix)7. Corticospinal tract R/L8. Medial lemniscus R/L9. Inferior cerebellar peduncle R/L10. Superior cerebellar peduncle R/L11. Cerebral peduncle R/L12. Anterior limb of internal capsule R/L13. Posterior limb of internal capsule R/L14. Retrolenticular part of internal capsule R/L15. Anterior corona radiata R/L 16. Superior corona radiata R/L17. Posterior corona radiata R/L18. Posterior thalamic radiation (include opticradiation) R/L19. Sagittal stratum (include inferior longitidi-nal fasciculus and inferior fronto-occipitalfasciculus) R/L20. External capsule R/L21. Cingulum (cingulate gyrus) R/L22. Cingulum (hippocampus) R/L23. Fornix (cres) / Stria terminalis (can not beresolved with current resolution) R/L24. Superior longitudinal fasciculus R/L25. Superior fronto-occipital fasciculus (couldbe a part of anterior internal capsule) R/L26. Uncinate fasciculus R/L27. Tapetum R/L

Cognitive Evaluations

The battery of cognitive test scores in our analysis included a breadth of evaluations chosenspeciﬁcally for their coverage of various measures of cognition. Among all tests given to thecohort, the following 17 were selected by expert clinicians and researchers in the ﬁeld fortheir coverage and their potential value in understanding trends across groups.

1. WAIS-III Digit Span Forward Raw Score2. WAIS-III Digit Span Backward Raw Score3. WAIS-III Letter-Number Sequencing RawScore4. COWAT CFL Score5. Boston Naming Test Total Score6. RAVLT Learning Trial A1 Raw Score7. RAVLT Learning Trial A2 Raw Score8. RAVLT Learning Trial A3 Raw Score9. RAVLT Learning Trial A4 Raw Score 10. RAVLT Learning Trial A5 Raw Score11. RAVLT Learning Trial A6 Raw Score12. RAVLT Delayed Recall Raw Score13. Stroop Word/Color-Word Scaled Score14. Trail-Making Test Part A15. Trail-Making Test Part B16. Clock Drawing Test Score17. Center for Epidemiologic Studies Depres-sion Scale Score

WAIS-III.

This is the most widely used IQ test. The Digit Span examination isspeciﬁcally meant to evaluate the working memory of an individual. Participants arerequired to attempt to recall a series of numbers in order, both forwards and backwards.Letter-Number sequencing reﬂects a similar idea, but with a mix of both numbers andletters in increasing and alphabetical order, and is meant to be an indicator of more complexmental control (Wechsler, 2014).

Rey Auditory Visual Learning Test.

This test is speciﬁcally meant to evaluate allaspects of memory. Each trial evaluates a diﬀerent type of memory, ranging from short-termand working memory to procedural and episodic memory. (Schmidt et al., 1996).

Trail-Making Test.

This is a very popular test in providing information about executivefunction in the brain. The test consists of drawing lines among a randomly generated set ofpoints in a square, where each point is labeled with a number. In Part A, participants must‘connect the dots’ in increasing numerical order, and in Part B in increasing numerical and33 myloid Load (PiB Positivity)

Set 1 PiB Angular L/R PiB Cingulum Ant L/RPiB Cingulum Post L/R PiB Frontal Med Orb L/RPiB Precuneus L/R PiB Temporal Sup L/RPiB Temporal Mid L/R

PiB SupraMarginal L

Set 2 FA Cerebral peduncle R FA Cerebral peduncle LMD Corticospinal tract R MD Corticospinal tract LTrail-Making Test Part A Score MD Cerebral peduncle RPET Cingulum Post RTable 6: Group diﬀerence across Amyloid Load (PiB Positivity)

Gender

Set 1 Rey Audio and Verbal LearningTest FA Cingulum LFA Medial lemniscus L FA Cingulum (hippocampus) LSet 2 FA Posterior thalamic radiation(include optic radiation) LSet 3 FA Corticospinal tract R FA Superior fronto-occipital fascicu-lus RTable 7: Group diﬀerence in genderalphabetical order. The score on the test is primarily dictated by the time in seconds it takesto complete the task for 25 of these ‘dots.’ More background information and normativeanalyses can be found in (Tombaugh, 2004).Other tests similarly measure various cognitive function. While the Depression ScaleScore did not crop up in any of our analyses here, it has been shown that depression isstrongly associated with AD-related decline (Wragg and Jeste, 1989).

Detailed Imaging with Cognitive Tests Results

In the following tables we provide additional details of the statistical test we performed onthe preclinical AD cohort. Each set contains a set of features found to display signiﬁcantgroup diﬀerence (at the p ≤ .

05 level) along the covariance trajectory divided by the groupvariable indicated.While some of these associations are well-known, few have been indicated as novel byAD researchers and clinicians, and to be of interesting value for further analysis.

Detailed results on larger cohort with only cognitive scores

We also applied our method to a larger cohort consisting of approximately 1500 subjectswith varying temporal measurements on the battery of cognitive tests. Each individual34 enotype: APOE4

Set 1 Digit Span Backward Raw Score Stroop Color-wordPiB Cingulum Post L PiB Cingulum Post RPiB Frontal Med Orb L PiB Frontal Med Orb RPiB Precuneus L PiB Precuneus RPiB SupraMarginal PiB Temporal Mid RTable 8: Group diﬀerence across Genotype APOE4 expression

Consensus Conference

Set 2 Digit Span Backward Raw Score Stroop Color-wordPiB Cingulum Post L PiB Cingulum Post RPiB Frontal Med Orb L PiB Frontal Med Orb RPiB Precuneus L PiB Precuneus RPiB SupraMarginal PiB Temporal Mid RTable 9: Group diﬀerence across Expert MCI Diagnosishad approximately 3 visits worth of data, and so our total number of measurements wasapproximately n = 4000. In addition to the groupings used above, we were able to use analgorithmic cognitive impairment (ACI) measure to further evaluate the model against afactor which is known to be group-separating. Below are the tabulated feature sets identiﬁedby our model for each of the group separations described in the main paper. In this case toincrease interpretability of the results we limited our search to groups of 3-6 features.When grouped by genotype, the most indicative subset as shown in Table 11. Thesetests are most closely associated with memory, and we see that no tests of executive functionor spatial ability (Trail-Making or Clock Drawing) were included.In addition to an algorithmic measure of impairment, a conference of expert cliniciansand researchers have given each individual a clinical impairment diagnosis for each time theyunderwent the cognitive battery. Using this as a group separator, we found a large numberof overlapping subsets that displayed signiﬁcant group diﬀerence at the p = 0 .

05 level. Theseare shown in Table 12. Trail-Making Test Parts A and B appeared in all identiﬁed subsets.

Algorithmic Cognitive Impairment

Set 1 Boston Naming Test Total Score RAVLT Learning Trial A1 RawScoreRAVLT Learning Trial A6 RawScoreTable 10: Group Diﬀerence Localization Across Algorithmic Impairment35 enotype: ApoE4

Set 1 WAIS-III Digit Span BackwardRaw Score RAVLT Learning Trial A3 RawScoreRAVLT Learning Trial A4 RawScore RAVLT Learning Trial A5 RawScoreTable 11: Group Diﬀerence Localization Across ApoE4 Genotype

Expert Consensus Measure

WAIS-3 Letter-Number Sequenc-ing Raw Score Boston Naming Test Total ScoreRAVLT Learning Trial A2 RawScore RAVLT Learning Trial A3 RawScoreRAVLT Learning Trial A4 RawScore RAVLT Learning Trial A5 RawScoreRAVLT Learning Trial A6 RawScore RAVLT Delayed Recall Raw ScoreTrail-Making Test Part A Trail-Making Test Part BClock Drawing Test Score Center for Epidemiologic Studies De-pression Scale ScoreTable 12: Group Diﬀerence Localization Across Expert Clinical Diagnosis36 .02.55.07.510.012.5 4 8 12

RAVLT Delayed Recall Raw Score c oun t ApoE Negative ApoE Positive0100200300400 0 5 10 15

RAVLT Delayed Recall Raw Score c oun t Female Male 0100200300400 0 5 10 15

RAVLT Delayed Recall Raw Score c oun t FH Negative FH Positive0100200300400500 0 5 10 15

RAVLT Delayed Recall Raw Score c oun t Clinically Impaired Clinically Normal

Figure 10: Histograms of the Delayed Recall Scores for all time points for the ∼ ppendix D. Diﬀerential Geometry Basics and Notes. We brieﬂy introduce notions that we used in the main paper. For more details, we refer thereader to (Do Carmo, 1992; Lee, 2003; Spivak, 1981).

Diﬀerentiable manifold. A diﬀerentiable (smooth) manifold of dimension n is a set M and a maximal family of injective mappings ϕ i : U i ⊂ R n → M of open sets U i of R n into M such that:1. ∪ i ϕ i ( U i ) = M

2. for any pair i, j with ϕ i ( U i ) ∩ ϕ j ( U j ) = W (cid:54) = φ , the sets ϕ − i ( W ) and ϕ − j ( W ) areopen sets in R n and the mappings ϕ − j ◦ ϕ i are diﬀerentiable, where ◦ denotes functioncomposition.3. The family { ( U i , ϕ i ) } is maximal relative to the conditions (1) and (2).Roughly speaking, a diﬀerentiable (smooth) manifold M is a topological space that islocally similar to Euclidean space and has a globally deﬁned diﬀerential structure. Tangent space ( T p M ). The tangent space at p ∈ M is the vector space, which consistsof the tangent vectors of all possible curves passing through p . Tangent bundle ( T M ). The tangent bundle of M is the disjoint union of tangent spacesat all points of M , T M = (cid:96) p ∈M T p M . The tangent bundle is equipped with a natural projection map π : T M → M . Riemannian manifold. A Riemannian manifold is equipped with a smoothly varyingmetric (inner product), which is called

Riemannian metric .Various geometric notions, e.g., the angle between two curves or the length of a curve,can be extended on the manifold.

Geodesic curves.

A geodesic curve on a Riemannian manifold is the locally shortest(distance-minimizing) curve. These are analogous to straight lines in Euclidean space and amain object to generalize linear models to Riemannian manifolds.

Geodesic distance.

The geodesic distance between two points on M is the length ofthe shortest geodesic curve connecting the two points. More generally, distance between twopoints on Riemannian manifolds is deﬁned by the inﬁmum of the length of all diﬀerentiablecurves connecting the two points. Let γ be a continuously diﬀerentiable curve γ : [ a, b ] → M between p and q in M and g be a metric tensor in M c . Then, formally, the distance between p and q is deﬁned as d( p, q ) := inf γ (cid:90) ba (cid:113) g γ ( t )( ˙ γ ( t ) , ˙ γ ( t )) dt (37)where γ ( a ) = p and γ ( b ) = q . Exponential map . An exponential map is a map from a tangent space T p M to M ,which is usually locally deﬁned due to the existence and uniqueness of ordinary diﬀerentialequation for the map. The geodesic curve from y i to y j can be parameterized by a tangentvector in the tangent space at y i with an exponential map Exp( y i , · ) : T y i M → M . Logarithm map.

The inverse of the exponential map is the logarithm map , Log( y i , · ) : M → T y i M . For completeness, Table 13 shows corresponding operations in the Euclideanspace and Riemannian manifolds. In the main paper, for the readability when operations are38ultiply nested, exponential map and its inverse logarithm map are denoted by Exp( p, x )and Log( p, v ) respectively, where p, x ∈ M and v ∈ T p M . They are usually denoted exp p ( x )and log p ( v ) in most of diﬀerential geometry books.Separate from the above notations, matrix exponential, i.e, exp( X ) := (cid:80) k ! X k , where0! = 1 and X = I and matrix logarithm are denoted by as exp( · ) and log( · ). Intrinsic mean.

Let d ( · , · ) deﬁne the distance between two points. The intrinsic (orKarcher) mean is the minimizer to ¯ y = arg min y ∈M N (cid:88) i =1 d ( y, y i ) , (38) which may be an arithmetic, geometric or harmonic mean depending on d ( · , · ). A Karchermean is a local minimum to (38) and a global minimum is referred as a Fr´echet mean. Onmanifolds, the Karcher mean satisﬁes (cid:80) Ni =1 Log ¯ y y i = 0. Algorithm 1 : Karcher mean

Input: y , . . . , y N ∈ M , α Output: ¯ y ∈ M ¯ y = y while (cid:107) (cid:80) Ni =1 Log(¯ y k , y i ) (cid:107) > (cid:15) do ∆¯ y = αN (cid:80) Ni =1 Log(¯ y k , y i )¯ y k +1 = Exp(¯ y k , ∆¯ y ) end while Figure 11: Karcher mean on manifoldsThis identity implies the ﬁrst order necessary condition of (38), i.e., ¯ y is a local minimumwith a zero norm gradient (Karcher, 1977). In general, on manifolds, the existence anduniqueness of th.e Karcher mean is not guaranteed unless we assume, for uniqueness, thatthe data is in a small neighborhood. Parallel transport.

Let M be a diﬀerentiable manifold with an aﬃne connection ∇ and I be an open interval. Let c : I → M be a diﬀerentiable curve in M and let V be a tangent Operation Euclidean RiemannianSubtraction −−→ x i x j = x j − x i −−→ x i x j = Log( x i , x j )Addition x i + −−→ x j x k Exp( x i , −−→ x j x k )Distance (cid:107)−−→ x i x j (cid:107) (cid:107) Log( x i , x j ) (cid:107) x i Mean (cid:80) ni =1 −→ ¯ xx i = 0 (cid:80) ni =1 Log(¯ x, x i ) = 0Covariance E (cid:2) ( x i − ¯ x )( x i − ¯ x ) T (cid:3) E (cid:2) Log(¯ x, x )Log(¯ x, x ) T (cid:3) Table 13:

Basic operations in Euclidean space and Riemannian manifolds. T c ( t ) M , where t ∈ I . Then, there exists a unique parallel vector ﬁeld V along c ,such that V ( t ) = V . Here, V ( t ) is called the parallel transport of V ( t ) along c . Geometry of SPD manifolds

Covariance matrices are symmetric positive deﬁnite matrices. Let SPD( n ) be a manifold forsymmetric positive deﬁnite matrices of size n × n . This forms a quotient space GL ( n ) /O ( n ),where GL ( n ) denotes the general linear group (the group of ( n × n ) nonsingular matrices)and O ( n ) is the orthogonal group (the group of ( n × n ) orthogonal matrices). The innerproduct of two tangent vectors u, v ∈ T p M is given by (cid:104) u, v (cid:105) p = tr( p − / up − vp − / ) (39)This plays the role of the Fisher-Rao metric in the statistical model of multivariate distribu-tions. Here, T p M is a tangent space at p (which is a vector space) is the space of symmetricmatrices of dimension ( n + 1) n/

2. The geodesic distance is d ( p, q ) = tr(log ( p − / qp − / )).The exponential map and logarithm map are given asExp( p, v ) = p / exp( p − / vp − / ) p / , Log( p, q ) = p / log( p − / qp − / ) p / . (40)Let p, q be in SPD( n ) and a tangent vector w ∈ T p M , the tangent vector in T q M whichis the parallel transport of w along the shortest geodesic from p to q is given byΓ p → q ( w ) = p / rp − / wp − / rp / where r = exp (cid:16) p − / v p − / (cid:17) and v = Log( p, q ) (41)40 eferences Marilyn S Albert, Mark B Moss, Rudolph Tanzi, and Kenneth Jones. Preclinical predictionof ad using neuropsychological tests.

Journal of the International NeuropsychologicalSociety , 7(05):631–639, 2001.Marilyn S Albert, Steven T DeKosky, Dennis Dickson, Bruno Dubois, Howard H Feldman,Nick C Fox, Anthony Gamst, David M Holtzman, William J Jagust, Ronald C Petersen,et al. The diagnosis of mild cognitive impairment due to alzheimers disease: Recommenda-tions from the national institute on aging-alzheimers association workgroups on diagnosticguidelines for alzheimer’s disease.

Alzheimer’s & dementia , 7(3):270–279, 2011.E. Arias-Castro, E. J. Cand`es, et al. Detection of an anomalous cluster in a network.

TheAnnals of Statistics , pages 278–304, 2011.Sanjeev Arora and Boaz Barak.

Computational complexity: a modern approach . CambridgeUniversity Press, 2009.M. Banerjee, R. Chakraborty, et al. Nonlinear regression on Riemannian manifolds and itsapplications to neuro-image analysis. In

MICCAI , pages 719–727. 2015.Eugene Belilovsky, Ga¨el Varoquaux, and Matthew B Blaschko. Hypothesis testing fordiﬀerences in gaussian graphical models: Applications to brain connectivity. arXivpreprint arXiv:1512.08643 , 2015.T. Cai, W. Liu, et al. A constrained (cid:96) minimization approach to sparse precision matrixestimation. JASA , 106(494):594–607, 2011.EH Corder, AM Saunders, WJ Strittmatter, DE Schmechel, PC Gaskell, GWet al Small,AD Roses, JL Haines, and M Al Pericak-Vance. Gene dose of apolipoprotein e type 4allele and the risk of alzheimers disease in late onset families.

Science , 261(5123):921–923,1993.E. Cornea, H. Zhu, et al. Regression models on Riemannian symmetric spaces.

JRSS-B ,2016.P. Danaher, P. Wang, et al. The joint graphical LASSO for inverse covariance estimationacross multiple classes.

JRSS-B , 76(2):373–397, 2014.Burcu F Darst, Rebecca L Koscik, Annie M Racine, Jennifer M Oh, Rachel A Krause,Cynthia M Carlsson, Henrik Zetterberg, Kaj Blennow, Bradley T Christian, Barbara BBendlin, et al. Pathway-speciﬁc polygenic risk scores as predictors of amyloid- β depositionand cognitive function in a sample at increased risk for alzheimers disease. Journal ofAlzheimer’s Disease , 55(2):473–484, 2017.P Laurie Davies and Arne Kovac. Local extremes, runs, strings and multiresolution.

Annalsof Statistics , pages 1–48, 2001.M. P Do Carmo.

Riemannian geometry . Springer, 1992.41wena¨elle Douaud, Saˆad Jbabdi, Timothy EJ Behrens, Ricarda A Menke, Achim Gass,Andreas U Monsch, Anil Rao, Brandon Whitcher, Gordon Kindlmann, Paul M Matthews,et al. Dti measures in crossing-ﬁbre areas: increased diﬀusion anisotropy reveals earlywhite matter alteration in mci and mild alzheimer’s disease.

Neuroimage , 55(3):880–890,2011.J. Du, A. Goh, S. Kushnarev, and A. Qiu. Geodesic regression on orientation distributionfunctions with its application to an aging study.

NeuroImage , 87:416–426, 2014.J. Fan, X. Han, et al. Control of the fdr under arbitrary covariance dependence.

JASA ,2012.P T. Fletcher. Geodesic regression and the theory of least squares on Riemannian manifolds.

IJCV , 105(2):171–185, 2013.T.P. Fletcher and S. Joshi. Riemannian geometry for the statistical analysis of diﬀusiontensor data.

Signal Processing , 87(2):250–262, 2007.Laura Fratiglioni, M Grut, Y Forsell, M Viitanen, M Grafstr¨om, K Holmen, K Ericsson,L B¨ackman, A Ahlbom, and B Winblad. Prevalence of alzheimer’s disease and otherdementias in an elderly urban population relationship with age, sex, and education.

Neurology , 41(12):1886–1886, 1991.J. Friedman, T. Hastie, et al. Sparse inverse covariance estimation with the graphical LASSO.

Biostatistics , 9(3):432–441, 2008.S.A. Geer.

Empirical Processes in M-estimation , volume 6. Cambridge university press,2000.Wolfgang Hardle and Enno Mammen. Comparing nonparametric versus parametric regressionﬁts.

The Annals of Statistics , pages 1926–1947, 1993.John Hardy and Dennis J Selkoe. The amyloid hypothesis of alzheimer’s disease: progressand problems on the road to therapeutics.

Science , 297(5580):353–356, 2002.John A Hardy and Gerald A Higgins. Alzheimer’s disease: the amyloid cascade hypothesis.

Science , 256(5054):184, 1992.Yi Hong, Nikhil Singh, Roland Kwitt, and Marc Niethammer. Group testing for longitudinaldata. In

International Conference on Information Processing in Medical Imaging , pages139–151. Springer, 2015.Cliﬀord R Jack Jr, Heather J Wiste, Prashanthi Vemuri, Stephen D Weigand, Matthew LSenjem, Guang Zeng, Matt A Bernstein, Jeﬀrey L Gunter, Vernon S Pankratz, Paul SAisen, et al. Brain beta-amyloid measures and magnetic resonance imaging atrophy bothpredict time-to-progression from mild cognitive impairment to alzheimers disease.

Brain ,133(11):3336–3348, 2010.S. Jayasumana et al. Kernel methods on the Riemannian manifold of SPD matrices. In

CVPR , 2013. 42 Jessie Jeng, T Tony Cai, and Hongzhe Li. Optimal sparse segment identiﬁcation withapplication in copy number variation analysis.

Journal of the American StatisticalAssociation , 105(491):1156–1166, 2010.M.I. Jordan.

Learning in graphical models , volume 89. Springer Science & Business Media,1998.Hermann Karcher. Riemannian center of mass and molliﬁer smoothing.

Communications onpure and applied mathematics , 30(5):509–541, 1977.Hyunwoo J Kim, Nagesh Adluru, et al. Multivariate general linear models on Riemannianmanifolds with applications to statistical analysis of diﬀusion weighted images. In

CVPR ,pages 2705–2712, 2014.D. Koller and N. Friedman.

Probabilistic graphical models: principles and techniques . MITpress, 2009.Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional bymodel selection.

Annals of Statistics , pages 1302–1338, 2000.S.L. Lauritzen.

Graphical models . Clarendon Press, 1996.John M Lee. Smooth manifolds. In

Introduction to Smooth Manifolds , pages 1–29. Springer,2003.John M Lee.

Introduction to smooth manifolds . Springer, 2012.Muriel Deutsch Lezak.

Neuropsychological assessment . Oxford University Press, USA, 2004.Yi-Cheng Lin, Yao-Chia Shih, Wen-Yih I Tseng, Yu-Hsiu Chu, Meng-Tien Wu, Ta-FuChen, Pei-Fang Tang, and Ming-Jang Chiu. Cingulum correlates of cognitive functions inpatients with mild cognitive impairment and early alzheimers disease: a diﬀusion spectrumimaging study.

Brain topography , 27(3):393–402, 2014.H. Liu, J. Laﬀerty, et al. The nonparanormal: Semiparametric estimation of high dimensionalundirected graphs.

JMLR , 10:2295–2328, 2009.Han Liu, Fang Han, et al. High-dimensional semiparametric gaussian copula graphicalmodels.

The Annals of Statistics , 40(4):2293–2326, 2012.J.J. McArdle and R.Q. Bell. An introduction to latent growth models for developmentaldata analysis. 2000.S. Mori, K. Oishi, et al. Stereotaxic white matter atlas based on diﬀusion tensor imaging inan icbm template.

Neuroimage , 40(2):570–582, 2008.Elizabeth C Mormino, Rebecca A Betensky, Trey Hedden, Aaron P Schultz, Andrew Ward,Willem Huijbers, Dorene M Rentz, Keith A Johnson, Reisa A Sperling, Alzheimer’sDisease Neuroimaging Initiative, et al. Amyloid and apoe ε Neurology , 82(20):1760–1767, 2014.43rasanna Muralidharan and P Thomas Fletcher. Sasaki metrics for analysis of longitudinaldata on manifolds. In

Computer Vision and Pattern Recognition (CVPR), 2012 IEEEConference on , pages 1027–1034. IEEE, 2012.Thomas P Naidich, Henri M Duvernoy, Bradley N Delman, A Gregory Sorensen, Spyros SKollias, and E Mark Haacke.

Duvernoy’s atlas of the human brain stem and cerebellum:high-ﬁeld MRI, surface anatomy, internal structure, vascularization and 3 D sectionalanatomy . Springer Science & Business Media, 2009.H. Qiu, F. Han, et al. Joint estimation of multiple graphical models from high dimensionaltime series.

JRSS-B , 2015.A. M Racine, N. Adluru, et al. Associations between white matter microstructure andamyloid burden in preclinical AD: a multimodal imaging investigation.

NeuroImage:Clinical , 4:604–614, 2014.Lars M Rimol, Ingrid Agartz, Srdjan Djurovic, Andrew A Brown, J Cooper Roddey, Anna KK¨ahler, Morten Mattingsdal, Lavinia Athanasiu, Alexander H Joyner, Nicholas J Schork,et al. Sex-dependent association of common variants of microcephaly genes with brainstructure.

Proceedings of the National Academy of Sciences , 107(1):384–388, 2010.Charles S Roehrig. Conditions for identiﬁcation in nonparametric and parametric models.

Econometrica: Journal of the Econometric Society , pages 433–447, 1988.Mark A Sager, Bruce Hermann, and Asenath La Rue. Middle-aged children of persons withalzheimers disease: Apoe genotypes and cognitive function in the wisconsin registry foralzheimers prevention.

Journal of geriatric psychiatry and neurology , 18(4):245–249, 2005.Michael Schmidt et al.

Rey auditory verbal learning test: a handbook . Western PsychologicalServices Los Angeles, 1996.George AF Seber and Alan J Lee. Linear regression analysis. hoboken.

NJ: Wiley. doi , 10:9780471722199, 2003.S. Sommer, F. Lauze, et al. Optimization over geodesics for exact principal geodesic analysis.

Adv. in Comp. Math. , 40(2):283–313, 2014.M. Spivak.

Comprehensive introduction to diﬀerential geometry . Publish or Perish, Inc.,1981.Mark Stehr. The eﬀect of cigarette taxes on smoking among men and women.

HealthEconomics , 16(12):1333–1343, 2007. ISSN 1099-1050. doi: 10.1002/hec.1223. URL http://dx.doi.org/10.1002/hec.1223 .Nicolas Stdler and Sach Mukherjee. Two-sample testing in high-dimensional models, 2012.Michel Talagrand.

The generic chaining: upper and lower bounds of stochastic processes .Springer Science & Business Media, 2006.Rudolph E Tanzi and Lars Bertram. Twenty years of the alzheimers disease amyloidhypothesis: a genetic perspective.

Cell , 120(4):545–555, 2005.44om N Tombaugh. Trail making test a and b: normative data stratiﬁed by age and education.

Archives of clinical neuropsychology , 19(2):203–214, 2004.Oncel Tuzel, Fatih Porikli, and Peter Meer. Human detection via classiﬁcation on Rieman-nian manifolds. In

Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEEConference on , pages 1–8. IEEE, 2007.J.B. Ullman and P.M. Bentler.

Structural equation modeling . Wiley Online Library, 2003.G. Walther et al. Optimal and fast detection of spatial clusters with scan statistics.

TheAnnals of Statistics , 38(2):1010–1033, 2010.S. Wang, J. Fan, et al. Structured correlation detection with application to colocalizationanalysis in dual-channel ﬂuorescence microscopic imaging. arXiv preprint arXiv:1604.02158 ,2016.David Wechsler. Wechsler adult intelligence scale–fourth edition (wais–iv), 2014.Robin E Wragg and Dilip V Jeste. Overview of depression and psychosis in alzheimersdisease.

Am J Psychiatry , 146(5):577–587, 1989.G. Wu, E.Y. Chang, et al. An analysis of transformation on non-positive semideﬁnitesimilarity matrix for kernel machines. In

ICML , volume 8, 2005.Y. Xie, B.C. Vemuri, et al. Statistical analysis of tensor ﬁelds. In

MICCAI , pages 682–689.Springer, 2010.L. Xue and H.and others Zou. Regularized rank-based estimation of high-dimensionalnonparanormal graphical models.

The Annals of Statistics , 40(5):2541–2571, 2012.S. Yang, Z. Lu, et al. Fused multiple graphical lasso.

SIAM J. Opt. , 25(2):916–943, 2015.M. Yuan. High dimensional inverse covariance matrix estimation via LP.