Finding Differentially Covarying Needles in a Temporally Evolving Haystack: A Scan Statistics Perspective
Ronak Mehta, Hyunwoo J. Kim, Shulei Wang, Sterling C. Johnson, Ming Yuan, Vikas Singh
FFinding Differentially Covarying Needles in a TemporallyEvolving Haystack: A Scan Statistics Perspective
Ronak Mehta [email protected] Hyunwoo J. Kim [email protected] Shulei Wang [email protected] Sterling C. Johnson [email protected] Ming Yuan [email protected] Vikas Singh [email protected] Computer Sciences Department Department of Statistics Department of Medicine Department of BiostatisticsUniversity of Wisconsin-MadisonMadison, WI 53706, USA
Project webpage: pages.cs.wisc.edu/~ronakrm/research/covtraj/
Video summary: vimeo.com/205606140
Abstract
Recent results in coupled or temporal graphical models offer schemes for estimating therelationship structure between features when the data come from related (but distinct)longitudinal sources. A novel application of these ideas is for analyzing group-level differences,i.e., in identifying if trends of estimated objects (e.g., covariance or precision matrices)are different across disparate conditions (e.g., gender or disease). Often, poor effect sizesmake detecting the differential signal over the full set of features difficult: for example,dependencies between only a subset of features may manifest differently across groups. In thiswork, we first give a parametric model for estimating trends in the space of SPD matrices asa function of one or more covariates. We then generalize scan statistics to graph structures,to search over distinct subsets of features (graph partitions) whose temporal dependencystructure may show statistically significant group-wise differences. We theoretically analyzethe Family Wise Error Rate (FWER) and bounds on Type 1 and Type 2 error. On a cohortof individuals with risk factors for Alzheimer’s disease (but otherwise cognitively healthy),we find scientifically interesting group differences where the default analysis, i.e., modelsestimated on the full graph , do not survive reasonable significance thresholds.
Keywords:
Manifold Statistics, Scan Statistics, Longitudinal Analysis a r X i v : . [ s t a t . M L ] N ov . Introduction Multivariate data analysis exploiting the conditional independence structure between featuresor covariates using undirected graphical models is now standard within any data analysistoolbox. When the data are multivariate Gaussian, the zeros in the inverse covariance(precision) matrix give conditional independences among the variables (Lauritzen, 1996).Further, if the precision matrix is sparse, we can derive dependencies between features whenthe data are high-dimensional and/or the number of measurements are small. The estimationof a graphical model has been extensively studied and a rich literature is available describingits statistical and algorithmic properties (Koller and Friedman, 2009; Jordan, 1998). Forinstance, the so-called graphical lasso formulation uses an (cid:96) -norm penalty on the precisionmatrix and is widely used, and consistency properties in the large p regime (Cai et al., 2011;Friedman et al., 2008; Yuan, 2010) are now well understood. These formulations have alsobeen extended to various transformations of Gaussian distributions (e.g., non-paranormal)using rank statistics (Liu et al., 2009; Xue and Zou, 2012; Liu et al., 2012). Coupled and Temporal Graphical Models.
Often, data come from two (or more) disparatesources or multiple timepoints. Within the last few years, a few proposals have describedstrategies for linking the sparsity patterns of multiple graphical models, e.g., using a fusedlasso penalty (Danaher et al., 2014) (Yang et al., 2015). Observe that if the data sourcescorrespond to longitudinal acquisitions, we should expect the ‘structure’ to gradually evolve.Several authors have offered generalizations to address this problem: (Zhou et al., 2010)removes the assumption that each graph is independent and structurally ‘close’. Instead,(Zhou et al., 2010) can be thought of as a growth model (McArdle and Bell, 2000) definedon these structures: they show how non-identically distributed graphs can be learned overtime. Recently, the nonparametric procedure in (Qiu et al., 2015) extends these ideas tohandle multiple sources, each with multiple samples.The ideas in the literature so far to “couple” multiple graphical model estimation modulesare mostly nonparametric. While such a formulation offers benefits, in many estimationproblems, parametric models may be more convenient for downstream statistical analysis,particularly for hypothesis testing (Hardle and Mammen, 1993; Geer, 2000; Roehrig, 1988).Given that the topic of coupled graphical models, by itself, is fairly recent, algorithmsfor parametric estimation of temporal or coupled Gaussian graphical models have not yetbeen heavily studied. This will involve parameterizing trends in the highly structurednature of the ‘response’ variable (SPD matrices). We find that parametric formulations formanifold-valued data have been proposed recently (Kim et al., 2014; Cornea et al., 2016).Because SPD matrices form a Riemannian manifold, algorithms that estimate a parametricmodel respecting the underlying Riemannian metric are more suitable in many applicationsas opposed to assuming a Euclidean metric on positively or negatively curved spaces (Xieet al., 2010; Fletcher and Joshi, 2007; Jayasumana et al., 2013). We will make a few simplemodifications (for efficiency purposes) to such algorithms and make use of the estimatedparameters for follow-up analysis.
Finding Group-wise Differences.
Assuming that we have a black-box procedure toestimate a parametric model on the SPD manifold available, in many tasks, such anestimation is merely a segue to other analyses designed to answer scientifically meaningfulquestions. For example, we are often interested in asking whether the temporally coupled2odel estimated using the procedure above differs in meaningful ways across groups inducedby a stratification or dichotomous variable (e.g., gender or disease). For instance, is the‘slope’ in structured response space statistically different across education level or body massindex? While the body of work for graphical model estimation is mature, the literaturedescribing hypothesis tests in this regime (Stdler and Mukherjee, 2012; Belilovsky et al.,2015) is sparse at best. Given that such questions are simpler to answer with alternativeschemes (with assumptions on the distributional properties of the data), e.g., structuralequation modeling, latent growth models and so on (Ullman and Bentler, 2003; McArdleand Bell, 2000), it seems that the unavailability of such tools is limiting the adoption of suchideas in a broader cross-section of science. We will seek to address this gap.
Needles in Temporal Haystacks.
If we temporarily set aside the potential value of ahypothesis test framework for temporal trajectories in graphical models, we see that from anoperational viewpoint, such procedures are most effective when a practitioner already has aprecise scientific question in mind. In reality, however, many data analysis tools are deployedfor exploratory analyses to inform an investigator as to which questions to ask. Being ableto “localize” which parts of the model are different across groups can be very valuable. Thisability actually benefits statistical power as well. Notice that when the stratified groupsare not very different to begin with, e.g., healthy individuals with presence or absence of agenetic mutation, the effect sizes (statistical difference between two groups) are likely to bepoor. Here, while the trends identified on the full precision matrix may still be different(i.e., there may be a real signal associated with a grouping variable), they may not be strongenough to survive significance thresholds. Ideally, what we need here are analogs of thewidely used “scan statistics” for our hypothesis testing formulations for temporal graphicalmodels — to identify which parts of the signal are promising. Then, even if only a smallsubset of features were different across groups, we may be able to identify these differentialeffects efficiently. This benefits Type 2 error, provides a practical turnkey product for anexperimental scientist, and makes up the key technical results of our work.Briefly, we provide (i) a simple and efficient parametric procedure for modeling temporallyevolving graphical models, (ii) a hypothesis test for identifying differences between group-wise estimated models, and (iii) a scan algorithm to identify those subsets of the featureswhich contribute to the group-wise differences . Together, these ideas offer a frameworkfor identifying group-wise differences in temporally coupled graphical models. From theexperimental perspective, we find scientifically plausible results on a unique longitudinallytracked cohort of middle-aged (and young elderly) persons at risk for Alzheimer’s diseasedue to family history, but who are otherwise completely cognitively healthy.The rest of the paper is organized as follows. In Section 2 we present an efficient manifoldregression procedure for modeling covariance trajectories, which serves as a blackbox modulein our hypothesis testing framework. In Section 3, we define our main hypothesis testfor group difference analysis over covariance trajectories. In Section 4, we present a setof technical results describing our localization procedure based on scan statistics, as wellas derive suitable size corrections to compare across feature subsets. Sections 5, 6, and7 conclude with empirical evaluations of our model on synthetic data, various types ofdemographics/behavior data collected longitudinally in the United States from publiclyavailable resources, and finally, our analysis on a unique longitudinal dataset (followed since2001) from a preclinical Alzheimer’s disease study involving approximately 1500 individuals.3 . Characterizing Covariance Trajectories
Our main statistical testing framework, to be described shortly, needs an efficient meansfor calculating a “trajectory” of the feature-by-feature interaction graphs over time forthe given longitudinal data. We now describe a scheme which offers this capability. Let X t ∈ R n t ,p be the design matrix of all n t samples at time t , where t ∈ { , . . . , T } , and T isthe total number of distinct timepoints. We wish to capture the trends in the relationshipsbetween the features as a function of t . To evaluate the groupwise differences in changes ofsuch interactions, we make use of the fact that these interactions are commonly capturedby correlation or conditional independence, represented by the covariance matrix (withnormalized features) and the precision matrix (the inverse of covariance matrix).Here we simply use the covariance matrix for each timepoint t to denote the interactionbetween features, C t = cov ( X t ). Our goal now is to estimate the parameters of the function, t → C t . We may vectorize the covariance matrix and apply a linear model; its parameterswill give the trajectory in “vectorized covariance space” as we scan through t . But thesepredictions are not guaranteed to be valid SPD matrices and even if a projection is performedto obtain a covariance estimate, distortions introduced by the process may be significant(Fletcher, 2013). It is well known that classical vector space models tend to be suboptimal inthe manifold setting (covariance matrices live on the SPD manifold) since they use Euclideanmetrics which are defined in the ambient space. For manifold-valued data, Riemannianmetrics are shown to be superior in many applications (Fletcher and Joshi, 2007; Banerjeeet al., 2015; Jayasumana et al., 2013; Tuzel et al., 2007), and are increasingly being deployedin machine learning/statistics. We will utilize an appropriate statistical model informed bythe manifold-structure of the data and then derive a hypothesis testing procedure to detectgroupwise difference in the changes of interactions between features in longitudinal analysis.To do so, we first summarize basic differential geometry notations (Do Carmo, 1992; Lee,2012) and then describe our models. If desired, any other (efficient) manifold-valued linearmodel (Fletcher, 2013) can be substituted in; no change in the workflow is needed. A readerfamiliar with manifold regression algorithms may consider this module as a black-box andskip ahead to Section 3 which uses the parameter estimates from this procedure. Let M be a differentiable (smooth) manifold in arbitrary dimensions. A differentiablemanifold M is a topological space that is locally similar to Euclidean space and has aglobally defined differential structure. A Riemannian manifold is a differentiable manifold M equipped with a smoothly varying inner product. The geodesic curve is the locallyshortest path, analogous to straight lines in R p — this geodesic curve will be the object thatdefines the trajectory of our covariance matrices in SPD space. Unlike the Euclidean space,note that there may exist multiple geodesic curves between two points on a curved manifold.So, the geodesic distance between two points on M is defined as the length of the shortest geodesic curve connecting two points. The geodesic distance helps in measuring the error ofour trajectory estimation (analogous to a Frobenius or (cid:96) norm based loss in the Euclideansetting). The geodesic curve from y i to y j is parameterized by a tangent vector in thetangent space anchored at y i with an exponential map Exp( y i , · ) : T y i M → M . The inverseof the exponential map is the logarithm map, Log( y i , · ) : M → T y i M . These two operations4peration Euclidean RiemannianSubtraction −−→ x i x j = x j − x i −−→ x i x j = Log( x i , x j )Addition x i + −−→ x j x k Exp( x i , −−→ x j x k )Distance (cid:107)−−→ x i x j (cid:107) (cid:107) Log( x i , x j ) (cid:107) x i Mean (cid:80) ni =1 −→ ¯ xx i = 0 (cid:80) ni =1 Log(¯ x, x i ) = 0Covariance E (cid:2) ( x i − ¯ x )( x i − ¯ x ) T (cid:3) E (cid:2) Log(¯ x, x )Log(¯ x, x ) T (cid:3) Table 1:
Basic operations in Euclidean space and Riemannian manifolds.
Figure 1: Group-wise MMGLM: The left and right figures represent two linear models onthe SPD( p ) manifold. Points x i in the tangent space are our covariate or predictor,and points y i in the manifold space represent SPD( p ) matrices. In our regressionsetting, we wish to minimize the error (brown curves) between the estimationand the sample points. Because each linear model has a different base point, thetrajectories cannot be directly compared as in the Euclidean setting.move us back and forth between the manifold and the tangent space. For completeness,Table 1 shows corresponding operations in the Euclidean space and Riemannian manifolds.Separate from the above notation, matrix exponential (and logarithm) are simply exp( · )(and log( · )). Finally, parallel transport is a generalized parallel translation on manifolds.Given a differentiable curve γ : I → M , where I is an open interval, the parallel transportof v ∈ T γ ( t ) M along curve γ can be interpreted as the parallel translation of v on themanifold preserving its length and the angle between v ( t ) and γ . The parallel transport of v from y to y (cid:48) is Γ y → y (cid:48) v . Several regression models for manifold-valued data have been proposed recently, a majorityof which are nonparametric (Jayasumana et al., 2013; Banerjee et al., 2015). Because of thelongitudinal nature of our dataset (and recruitment considerations in neuroimaging studies),sample sizes do not exceed a few hundred participants (typically much smaller). We havefound that generally, in this regime, parametric methods are better suited and also offerother benefits for downstream applications. Next, we will give a simple parametric modelfor this problem. Let x and y be vectors in R p and R p (cid:48) respectively.5 efinition 1 (Standard GLM.) The Euclidean multivariate multilinear model is y = β + β x + β x + . . . + β p x p + (cid:15) (1) where β , β i and the error (cid:15) are in R p (cid:48) and x = [ x . . . x p ] T are the predictor variables. Henceforth, we will use the terms covariate and predictor interchangeably to describe thosespecific features we wish to control for in our model (e.g., time-points in our experiments).For manifold-valued data, we adapt the formulation proposed by (Kim et al., 2014).
Definition 2
The Manifold Multivariate General Linear Model (MMGLM) is defined as min b ∈M , ∀ j,V j ∈ T b M N (cid:88) i =1 d ( Exp ( b, V x i ) , y i ) , (2) where V x i := (cid:80) nj =1 V j x ji and d ( · , · ) is the geodesic distance between ˆ y i := Exp ( b, V x i ) and y i . This formulation generalizes (1), by replacing the intercept β and each vector β j for acovariate with a base point b ∈ M and a geodesic basis V j ∈ T b M respectively. The geodesicbasis V j at b parameterizes a geodesic curve Exp( b, V j x j ). Intuitively, this model is a‘generalized’ linear model with the inverse exponential map Exp − (or logarithm map Log)as a ‘link’ function (Kim et al., 2014; Cornea et al., 2016). When the covariate/predictorsare univariate, we will obtain a single geodesic curve, modeled via the so-called GeodesicRegression (Fletcher, 2013). The objective in (2), can be solved by both gradient descent (Fletcher, 2013; Kim et al., 2014)and MCMC methods (Cornea et al., 2016). Unfortunately, these schemes can be expensive,especially when the dimension of the manifold is large. Further, if the algorithm needs tobe run a large number of times, the computational footprint quickly becomes prohibitive.Motivated by these considerations, we use a so-called log-Euclidean approximate algorithmintroduced in (Kim et al., 2014) with some adaptations, which requires mild assumptions onthe manifold-valued data.Recall that in classical ordinary least squares (OLS), the regression curve goes throughthe mean of covariates and response variables, i.e., y − ¯ y = β ( x − ¯ x ). Similarly, we assumethat geodesic curves go through the mean of response variables on the manifold. Then, thebase point, or intercept, “ b ” in (2) can be approximated by the manifold-valued mean ofthe sample points , the Karcher mean (Karcher, 1977). The propositions derived from (Kimet al., 2014) lead directly to the following. Proposition 3
Let ¯ C be the unique Karcher mean of a sufficiently close set of covariancematrices that lie on a curve Ω . Then ¯ C ∈ Ω , and for some tangent vector V ∈ T ¯ C M andeach C , there exists x ∈ R such that C = Exp ( ¯
C, V x ) . This allows us to bypass the fairly involved variational procedure to estimate the base point b . 6ith this approximation of ˆ b via ¯ y , the remaining variables to optimize are the tangentvectors V . We do so by taking advantage of log-Euclidean schemes. Once the base pointis established as the Karcher mean, each data point on the manifold is projected into thetangent space at that point: Log(¯ y, y ). These “centered” points ˜ y are now Euclidean, and ifthe covariates are centered as well (˜ x ), a closed form solution exists in the standard form of V = ˜ y ˜ x (cid:62) (˜ x ˜ x (cid:62) ) − .In this setting, it is often assumed that two points y , y have a distance defined as d ( y , y ) := (cid:107) Log( y , y ) (cid:107) y ≈ (cid:107) Log( b, y ) − Log( b, y ) (cid:107) b . However, on SPD manifolds withan affine invariant metric, each tangent space has a different inner product varying as afunction of the base point b , i.e., (cid:104) u, v (cid:105) b := tr( b − / ub − vb − / ). This makes comparison oftrajectories difficult without moving to tangent bundle formulations. This issue is discussedin some detail in (Muralidharan and Fletcher, 2012; Hong et al., 2015). However, note that Remark 4
When the base point b is the identity I , then the inner product is exactly theEuclidean metric (cid:104) u, v (cid:105) b := tr ( b − / ub − vb − / ) = tr ( uv ) = tr ( u T v ) . This follows from the fact that u and v are symmetric matrices on SPD( p ). We takeadvantage of this property through parallel transport . Specifically, we can bring all of thedata to T I M which will allow for a meaningful comparison of two tangent vectors fromdifferent base points. Similar schemes have been used for projection on submanifolds in (Xieet al., 2010) and other problems (Sommer et al., 2014). With a fast algorithm to compute(2) available, we can now accurately model longitudinal trajectories of covariances matrices.Our statistical procedure described next simply assumes the availability of some suitablescheme to solve the manifold-regression as defined in (2) efficiently and does not depend onparticular properties of the foregoing algorithm.
3. Test Statistics for SPD ( p ) Trajectories
With an algorithm to construct a regression model for covariance matrix responses in hand,we can now describe a key component of our contribution: a test statistic which allowsaddressing the main question of interest:
Is the progression/trajectory of covariance matrices(over time) different across two groups?
In the standard two-sample testing problem, ahypothesis test is set up to check if the parameters of each group are significantly different: H : θ = θ vs. H A : θ (cid:54) = θ (3)Recall that in a general linear model (GLM), when testing for mean group differences, thetest parameters are the regression slopes from a standard GLM fit. In our setting, theparameters of interest are the population covariance trajectories estimated from the manifoldregression in (2), see Fig. 1. While the trajectories and the slopes are related, note thatour parameters are estimated on the manifold . Two unique manifold trajectories, whenprojected as simple multivariate responses in Euclidean space, may not be significantlydifferent under the GLM hypothesis testing framework, as has been observed by (Du et al.,2014). Returning to our longitudinal trajectory formulation, we have the following na¨ıveCovariance GLM: 7 efinition 5 Let vec ( C g,t ) be the vectorized covariance matrix at timepoint t for group g ∈ { , } . Then the na¨ıve Covariance GLM is defined as vec ( C g,t ) = β g + β g t + (cid:15) (4) with the slope θ = β in the hypothesis test in (3) , and vec ( · ) is the vectorized form of theinput matrix. With this model, our hypothesis testing reduces to a simple difference of slopes, which iswell-studied in classical statistics literature.
Definition 6 (Seber and Lee, 2003) Let β , β be the multivariate slopes calculated fromestimating (4) . Then an α -level hypothesis test rejects the null hypothesis β = β when L > χ p | − α , where L = ( ˆ β − ˆ β )Σ − ( ˆ β − ˆ β ) (5)Knowing that the response space is structured, i.e., our covariance matrices lie on the SPDmanifold, we seek a more appropriate test and corresponding test statistic which adequatelycaptures this knowledge.Observe that we can directly apply the manifold regression in § Definition 7
Let C g,t be the covariance matrix at timepoint t for group g ∈ { , } . Thenthe Longitudinal-Covariance GLM (LCGLM) is defined as C g,t = Exp ( b g , V g t ) (6) with b g and V g being the base point and tangent vector respectively, as described in § But instead of solving p ( p − / different tangent spaces. To accurately compare twotangent vectors, we must parallel transport both vectors to the same tangent space. Oncethey are both in the same space, we can construct a simple test statistic for the trajectorydifference. L = (cid:107) Γ b → I V − Γ b → I V (cid:107) I (7)Recall that the inner product at the Identity I coincides with the Euclidean metric. Thiscan now be naturally interpreted as a difference of slopes, and together with a standardEuclidean Normal noise assumption yields the following hypothesis test. Proposition 8
Assume that Γ b → I V is normally distributed N (0 , I ) . Then the statisticdefined in (7) follows a χ p distribution with p degrees of freedom, and the threshold test in (6) is an α -level hypothesis for the covariate trajectory group difference. .1 Incorporating First-Order Differences In many real world situations, first-order information in the data is often valuable inidentifying group differences. Restricting our analysis to only the second-order interactions,i.e., covariances, may be inefficient (or sub-optimal) when the mean signal difference betweengroups is large. Our construction easily extends to these cases. Particularly, the productspace over both means and covariances is in R p × SPD( p ). Remark 9
The typical GLM on the first order information is defined in the standardEuclidean space. So, computing the regression in the product space R p × SPD ( p ) amounts tosimply computing the regression on the first and second order statistics (mean and covariance)separately. The above statement suggests that by applying the manifold regression to the covariancesand the standard regression model for the means, we are directly solving the product spaceregression problem, incorporating both first and second order statistics. However, in thesecases, the statistic defined above in (7) does not directly take into account the potentialdifference in means. However, given our Normal noise assumption we can easily invoke thestandard Gaussian multivariate likelihood statistic for group differences.
Definition 10
Let ˆ µ t , ˆΣ t be the estimated mean and covariance from the standard linearmodel and our manifold-covariance GLM respectively. Then the Gaussian likelihood of ourdata X is P ( X | ˆ µ, ˆΣ) = T (cid:89) t =1 n t (cid:89) i =1 P ( X t | N (ˆ µ t , ˆΣ t )) , (8) where X t is the subset of our data collected at timepoint t . Additionally, we can define astandard likelihood ratio test statistic as: L prod = P ( X | ˆ µ , ˆΣ ) P ( X | ˆ µ , ˆΣ ) P ( X , | ˆ µ , , ˆΣ , ) (9)This statistic is again χ p -distributed (Seber and Lee, 2003), and an α -level hypothesis testfor group difference analysis can be defined in the same way as above. While our manifoldregression modeling is focused on the case of centered data (where the mean signal maynot be significantly different between the groups), we use the product space construction,wherever appropriate, in experimental evaluations.
4. Localizing Group Differences for SPD ( p ) Trajectories
The above procedure provides a precise mechanism to derive a statistic from the group-wisecovariance matrix trajectories. However, when the effect sizes are poor, any scheme operatingon the trajectories of the full covariance matrix may still fail to identify group differences(as is the case in our experiments). To improve statistical power, localizing the process ofcomputing the trajectories only to the relevant features is critical. To this end, we considerthe following global hypothesis testing problem H : ∀ R, β R = β R vs. H : ∃ R, β R (cid:54) = β R , β denotes the slope and R is the region of the covariance matrix which only includesthe relevant features , see Fig. 2. It turns out that by adapting Scan statistics (Fan et al.,2012; Arias-Castro et al., 2011), we will be able to exclude the effect of irrelevant regions ofthe covariance matrix in the calculated trajectories. By extending this concept to graphs,we obtain an algorithm to identify subsets of features of the covariance matrix which showgroup differences that are otherwise unidentifiable, in a statistically rigorous way.
Scan statistics are a valuable tool for structured multiple testing. In its simplest form, wecan consider a setting where we place a window (or box) over a region R in an image andcalculate a local statistic L R , e.g., an average or a response to a convolution filter. Then,the window can be raster scanned at various locations in the image ( R ) and the maximumover the set of local statistics can be called the scan statistic. Intuitively, if the image isassumed to be a Gaussian random field, we can set up a null hypothesis using a criticalvalue and finding a statistically significant signal (i.e., regions) corresponds to comparingthe local region-wise statistic with the critical value. Of course, there is flexibility in termsof specifying properties of the regions as described next. Definition 11
Let R be the collection of all possible structured regions, and L R be somestatistic over region R , a structured subset of R . The scan statistic is defined as L ∗ =max R ∈R L R . Recent results in scan statistics show how size corrections can be used to increase detectionpower in multi-scale analysis with nice guarantees (Walther et al., 2010; Wang et al., 2016).To utilize these ideas for our hypothesis test, we must extend scan statistics and these sizecorrections to a graph setting where the graph is induced by a sparse estimation of theprecision matrix, e.g., graphical lasso (or any other algorithm of choice) over the features.To do so, structured regions R and a statistic L R on each region must be defined on thegraph. Intuitively, in our case, L R must capture the “difference” in group-wise covariancetrajectories. As we will describe shortly, it is in the context of this statistic where we utilizethe LCGLM (6), which will be invoked at the level of individual regions R , one by one.Let G := ( V , E ) be a graph over the features (represented in the covariance matrix)with vertex set V and edge set E . We define the structured region R ⊆ G as a connectedsubgraph of G corresponding to the selection of those vertices as our feature subset (blockof the covariance matrix, see Fig. 2). A natural question is whether such an enumeration istractable if the number of connected subgraphs R is exponential. It turns out that if wemake a mild assumption on the graph, the number of induced regions can be shown to bepolynomially bounded. Further, it then naturally provides a size correction , the analog for amultiple testing adjustment. Remarks.
In our motivating application, the group differences we seek to identify willinvolve a cohesive set of features that will be connected to each other, by definition (largechanges in covariances indicate dependent features). Based on this observation, we assumethat the true localized subgraph is a “ball” subgraph.10F CDE F CE r = 0 r = 1 r = 2 r = 1 B ( E,
1) = { F, C, E } Figure 2: (left) A region of the sparse precision matrix, (center) The corresponding subgraphof that region, along with balls of varying radius from the root node E , (right)The ball subgraph constructed with r = 1. These subgraphs with bounded radiusact as the structured regions on which scan statistics can be applied. Definition 12
A ball subgraph consists of nodes with a given radius r from a particularnode (see Fig. 2). The collection of ball subgraphs is defined as R = { B ( v, r ) : v ∈ V and r ∈ N } (10) where the ball subgraph B ( v, r ) := { v (cid:48) ∈ V : d ( v, v (cid:48) ) ≤ r } , and d ( v, v (cid:48) ) is the minimum lengthpath connecting v and v (cid:48) . With this assumption, it can be verified that we now only need to search a polynomiallybounded set of regions.
Remark 13
The number of unique ball subgraphs in any graph G is bounded above by D |V| ,where D is the diameter (longest chain) of the graph G . On these regions (i.e., blocks of covariance matrix), we will invoke LCGLM to provide usa statistic L R . This is just the difference in slopes of the calculated manifold regressionacross groups in (7). We will iteratively obtain this statistic for distinct regions R and findsubgraphs that differ in their trajectories across groups using a size correction for hypothesistests.Let us revisit the standard linear model setting and assume that our slopes β Rg correspondto the subset of slopes from features in R , and ˆ β Rg is an estimate of that slope. In this case,we have the following statistic (see e.g. (Seber and Lee, 2003)),( ˆ β R − ˆ β R )Σ − R ( ˆ β R − ˆ β R ) ∼ χ | E ( R ) | , (11)where Σ − R is the covariance matrix of ˆ β R − ˆ β R . With a normal noise assumption, thiscovariance will be identity and the statistic would simply be the (cid:96) -norm difference as inthe classical analysis. To make the statistics comparable across different sizes , we use thestandardized version of a χ | E ( R ) | distribution, L R = ( ˆ β R − ˆ β R )Σ − R ( ˆ β R − ˆ β R ) − E ( R ) (cid:112) E ( R ) . (12) We can extend this analysis to our manifold setting.11 efinition 14
For a given structured region R , the region-based LCGLM is written as ( b Rg , V Rg ) = argmin ( b R , V R ) ∈ T M R E (cid:2) d ( Exp ( b R , V R t g ) , C Rg ) (cid:3) (13) where C Rg is the covariance matrix subblock defined by features included in R for group g ( t g is our univariate predictor, i.e., time). To compare the group trajectories, we first parallel transport each tangent vector to theidentity as described in § (cid:107) Γ b R → I V R − Γ b R → I V R (cid:107) I . In the case of the product space construction, we apply the test in (8) to thedata subset corresponding to the features in region R , with the same correction as in (12). Summary.
We now have a region-based statistic for the manifold regression settingthat is approximately normally distributed N (0 , A final unresolved yet important issue is that we must correct L R based on the numberof edges E ( R ) in R . This has a direct consequence on detection power. Observe that thenormalization for size correction should be determined by the null distribution of L R , i.e.,when there is no slope difference in the trajectories between groups. In order to derive acorrection, we need to characterize the behavior of scan statistics within roughly similarregions, max R ∈R ( A ) L R , where R ( A ) is the collection of region R s with similar size as E ( R ), R ( A ) = { R ∈ R : A/ < | E ( R ) | ≤ A } . (14)Clearly, the behavior of max R ∈R ( A ) L R depends on the “complexity” of R ( A ). A clearunderstanding of how similar subgraphs relate to each other leads directly to a correctiontied to their relative sizes.To investigate the complexity of R ( A ), we define the following quantities. Definition 15
The distance between subgraphs R and R can be given as d ( R , R ) = 1 − | E ( R ) ∩ E ( R ) | (cid:112) | E ( R ) || E ( R ) | (15) Definition 16
Let the (cid:15) -covering number of R ( A ) , denoted by N ( A, (cid:15) ) , be the smallestinteger such that there is a subset R approx ( A, (cid:15) ) of R such that sup R ∈R ( A ) inf R ∈R approx ( A,(cid:15) ) d ( R , R ) ≤ (cid:15) (16) where |R approx ( A, (cid:15) ) | = N ( A, (cid:15) ) . We can verify that all regions in R ( A ) can be approximated by regions in R approx ( A ) withreasonably small error. From the definitions, notice that the complexity of R ( A ) is reflectedby N ( A, (cid:15) ). If N ( A, (cid:15) ) is nicely bounded (as is the case here), scan statistics can be calculatedvery efficiently (Lemma 18).Before stating this result, we make a mild assumption on our graph. For any ballsubgraph, the edges around its center are not too sparse, compared to the edges in the outerregion of the ball subgraph, i.e., hard on the inside, soft on the outside. This yields,12igure 3: (Left) Chain, ring and 2D lattice graphs that satisfy the Avocado Assumption.(Right) Star graph that does not satisfy the property: from the center node thegraph is “too dense on the outside.”
Assumption 17 (Avocado) There exist constants S and H such that, for any r/ ≤ r (cid:48) ≤ r and v ∈ V , | E ( B ( v, r (cid:48) )) || E ( B ( v, r )) | ≥ H (cid:18) − | E ( B ( v, r − r (cid:48) )) || E ( B ( v, r )) | (cid:19) S . (17)We see that this assumption holds for many classes of graphs: a ring graph satisfies thiscondition when H = 1 and S = 1 and the 2-d lattice satisfies this condition when H = 1 / S = 2 (see Fig. 3). With this assumption, we have the following result for the (cid:15) -coveringnumber N ( A, (cid:15) ). Lemma 18
Let | E | be the total number of edges in G . If (17) holds and A is given, then,for a constant C H,S which only depends on H and S in (17), N ( A, (cid:15) ) ≤ C H,S | E | A (cid:18) (cid:15) (cid:19) S +1 . (18)The proof of this result follows from our ball-subgraph construction and our Avocadoassumption and provided in the Appendix.Intuitively, this result upper bounds the number of graphs that are necessary to searchover to completely exhaust the search space of subgraphs. With this result, we can nowconstruct a suitable size correction. Following the work of (Davies and Kovac, 2001) and(Wang et al., 2016), we can increase the power of our test by using the following statistic: T ∗ = max R ∈R (cid:32) L R − (cid:115) log | E || E ( R ) | (cid:33) . (19)The significance of this size correction is that we now have a single critical value for eachcandidate subgraph, regardless of the subgraph size . Our final test is defined as I [ T ∗ > q α ],where q α is the α -level quantile of T ∗ under the null hypothesis (that no region is trulysignificant across groups). By construction, we can control the type 1 error at a specified α -level.Under the alternative hypothesis of this framework, it is important to note that in manycases, large subgraphs that subsume smaller significant graphs may also have large test13tatistics, and our hypothesis test only indicates the existence of some significant region.To identify or localize the smaller subsets, we follow the procedure from (Jeng et al., 2010),by beginning with the subgraph with the largest test statistic and iteratively removingoverlapping subsets from the total set of subgraphs. This requires testing each regional/localstatistic, ( L R − (cid:112) log( | E | / | E ( R ) | )) against q α . Under this procedure, we can control theweak family-wise error rate (wFWER) if we view our problem via the lens of multiple testing.The weak FWER is the probability of false discovery under the null hypothesis. To see thatthis is inherently controlled, note P ( F N ≥ | H ) = P ( T ∗ > q α | H ) ≤ α, (20)where F N is the number of false discoveries under the null hypothesis. With this correctionat the group difference level, we completely avoid any multiple comparisons issues that wouldarise in the case of a test for each subgraph. In addition to controlling the false positive rate,we have the following guarantee on identifying truly significant regions under the normalnoise assumption.
Theorem 19 If (17) holds and the number of edges in the candidate subgraph is largerthan log | E | , i.e., | E ( R ) | (cid:29) log | E | ∀ R ∈ R , (21) then the critical value q α satisfies q α = O (1) . (22) Moreover, as | E | → ∞ , if a subgraph R obeys ( β R − β R ) T Σ − R ( β R − β R ) (cid:112) | E ( R ) | (cid:29) (cid:115) log | E || E ( R ) | , (23) then as | E | → ∞ , P (cid:32) L R − (cid:115) log | E || E ( R ) | > q α (cid:33) → . (24)The full proof of this result follows a generic chaining argument (see, e.g. (Talagrand, 2006))along with application of concentration inequalities and union bounds, and can be found inthe Appendix. Summary.
At a high level, this result directly characterizes the behavior of T ∗ underthe null hypothesis H and the alternative hypothesis H , respectively. We see that (22)implies that T ∗ can roughly be seen as a constant under the null hypothesis, and under thealternative hypothesis when (23) is satisfied, the test based on T ∗ is consistent, see (24). With these guarantees, our full workflow is as follows. First, we use an oracle procedureto generate a graph over our features that roughly captures the conditional independences.Any procedure that provides a conditional independence graph is sufficient. Next, for eachball subgraph over this graph, we compute the Longitudinal-Covariance GLM over these14igure 4: (left,top) States identified as having significantly different time-varying tobaccousage across gender from 2001 to 2015. (left,bottom) States identified as havingsignificantly different time-varying heavy drinking use across gender from 2010to 2015. (right) Linear regressions over tobacco usage fitted to the four statesdefined by the ball subgraph centered at Louisiana. Best viewed in color.features for both groups, and compute the statistics outlined in §
3. We then compute thesize-corrected statistic, and compare against the single critical value. For all regions that passthis threshold, we apply the procedure from (Jeng et al., 2010). This workflow shows howto conduct hypothesis tests on temporal trends of large covariance matrices, with improvedpower and bounded Type 1 error. Additional implementation details can be found in theAppendix.
5. Localization Evaluation: Trends of Tobacco Usage Across Gender
We begin our empirical analysis of the model by first applying the subgraph localizationprocedure by itself (standalone), separate from our manifold regression scheme. In thiscase, our statistic is derived from only
Generalized Linear Models (GLM) constructions,where the ˆ β Rg in equation (12) is the slope estimated from fitting standard first order linearmodels. Identifying the differentially varying subgraphs across groups in this way is similarto a simpler version of the planted clique identification problem (Arora and Barak, 2009),where the clique we are trying to identify corresponds to those nodes whose slopes varysignificantly across groups. Data.
The Center for Disease Control (CDC) provides extensive statistics regardingtobacco and alcohol usage across the US. This data has been collected systematically for thelast few decades and is publically available (includes demographic information and gender).As a simple application of our proposed framework, we may pose the following question:which “sub-groups” of states tend to evolve differently in their correlation (pertaining totobacco/alcohol usage) over time? Our framework extends easily to answer this question. In15his setup, the oracle graph is simply the adjacency graph of the continental US naturallywhich will be used directly in our scanning procedure. For this dataset, we have directobservations of node measures: the percentage of males and females who reported smokingor drinking heavily in each state. Using gender as the group, we fit standard linear modelsfor each candidate subgraph, and compute the difference of gender-wise slopes statistic asdescribed above. In Figure 4, we see the regions identified using our method, and interpretsome of the tobacco usage findings here.In the northeast, we see that women have reduced their tobacco usage at a significantlyfaster rate than men compared to the rest of the country. We suspect that this may be atleast partly tied to the development of women’s cigarette brands in the late 1960s and 1970sfollowed by subsequent aggressive public policy campaigns in the 1990s and 2000s to highlighthealth risks beyond pulmonary or cardiovascular diseases for women (e.g., infertility, reducedbone-density in post-menopausal women). We also see that state-wide indoor smoking banswere put in place in the Northeast ahead of many other states in the union. In the South,the trends among men and women also seems to differ significantly. (see Fig. 4). Apart fromhealth factors, the group-wise differences in the group-wise trends may also be explainedby a few reasons identified in a study in 2007 (Stehr, 2007) which found that as the statesales tax on cigarettes changed (increased), women were significantly more price elasticthan men. Between 2006 and 2008, the cigarette tax increased dramatically for all of the 4states identified except for Louisiana , whose tax rate has remained constant. Additionally,while Arkansas did increase their cigarette tax in 2009, they did not increase taxes inlocations near borders shared with higher taxing states . These intricate relationships amongstates lend credibility to the fact that our scan statistics framework is indeed identifyinginteresting sub-regions, and suggests that the full covariance-trajectory pipeline may bemore appropriate if effects beyond the means are relevant within an analysis.
6. Pipeline Evaluation on Simulations and Baby Name Trends Over Time
We next evaluate the ability of our entire analysis pipeline to identify group differencesacross temporally evolving covariance trajectories. In many existing analyses, the effect ofthe mean differences may be stronger than the effect of the interaction matrix. However, incases where the mean signal is weak , we expect that the covariance effect will be important.To evaluate our model in this regime, we perform a set of simulation studies and also analyzea publicly available longitudinal dataset.
Simulations.
We randomly generate SPD matrices from a ‘path’ of 4 discrete pointsalong the manifold, and use these data as population covariance matrices to generate 0-meansample data. Table 6 shows the results of the hypothesis testing procedure with 50 featuresaveraged over 100 runs, where both the true number of features with covariance trajectorydifferences, p t , and the number of samples per group, n , were varied. As expected, ourrecovery rate increases nicely as a function of the number of samples n and decreases as thesize of region of change p t is increased when n is held constant.We compare our model to baseline methods that may be used in practice for the foregoinggroup difference hypothesis test. In standard applications, general linear models (GLMs) areoften the first line of attack. When the covariates are assumed to be independent, a simplelinear model as in (6) may be suitable. However, when the group difference is influenced16able 2: Detection Accuracy of hypothesis test scheme (100 runs). p t = 5 p t = 8 p t = 10 p t = 15 n = 10 0.06 0.02 0.04 0.03 n = 20 0.75 0.75 0.53 0.29 n = 50 0.99 1.00 1.00 0.80 n = 100 1.00 1.00 1.00 0.95 n = 200 1.00 1.00 1.00 0.98 n = 1000 1.00 1.00 1.00 1.00 l l l l l l l l l l l l l l l % C o rr e c t N u ll R e j e c t i on s l GLM w/ Interactions Manifold Trajectory GLM Naive GLM
Figure 5:
Correct null hypothesis rejections over 100 runs for three models. For p = 50 features, each plotshows the rejection rate for p t ∈ { , , } (from left to right) respectively as a function of thenumber of sample points. by specific interactions between covariates, such linear models require additional care. Atypical solution is to introduce pairwise interaction terms into the model – a choice betweenall possible interactions or specific interactions specified by an expert . The first model hasproblems since the number of samples n (cid:28) p . In the second model, we depend completelyon the user’s choice of interactions, and must correct for multiple testing when testingdifferent models, at least partly reducing the power of the final test. Figure 5 shows thevalue of our method over these models. For the interaction GLM case, we randomly selectinteraction terms to include in the GLM, with size p t (the ground truth number of variablesin the interaction). In this way, we approximate the effect of an oracle specifying to the GLMwhich terms may describe the underlying interaction. We report the fraction of significancetests where a significance threshold of p ≤ .
05 was found for each model, averaged over 100runs. We see that our proposed scheme consistently achieves near-perfect results in terms ofthe percentage of null hypotheses that were correctly rejected (i.e., there was a significantgroup-difference signal). The power of scan statistics on graphs is particularly evident in theneedle in haystack setting where the true differential signal is small ( p t ≤
8) and the samplesize is small to medium. When the sample size is large and p t is also large, the standard17igure 6: Contiguous states identified as having significantly different time-varying co-occurrences between boys and girls baby names from 1910 to 2015. Best viewedin color.linear model with additional interaction terms starts to approach the statistical performanceof our algorithm. Longitudinal trends in Baby Names.
In addition to the simulations above, wereport results from a simple analysis of how male/female baby names evolve over time overthe last century. The United States Social Security Administration provides a publiclyavailable dataset listing the frequency of the top 1000 baby names in each state for the last106 years. We evaluate our model in this context to examine which “sub-group” of statestend to evolve (or change) in their “name agreement” (or correlation) over time between boynames and girl names. Here, rather than calculating a sample covariance at each timepoint,we calculate a rank correlation matrix instead. For example, if two neighboring Gulf Coaststates, say Georgia and Alabama, substantially agreed on both boys and girls names in theperiod following the second World War, but gradually this agreement declined over timefor girls (but not boys), we expect that our scan statistics on graphs hypothesis test willsegment out this differential signal (in slope trends) from the planar graph induced by thestates sharing a border. Shown in Figure 6 are the regions identified using our method,applied on only the rank correlations for the top 10 names for both genders per state peryear. Each highlighted region indicates a sub-group where their “trends of correlation (oragreement/disagreement)” in preferred baby names over the last century varies betweenboys and girls. For states not identified by our model (in gray), we can conclude that thestate-to-state name preference-interactions may have still evolved over time but we haveinsufficient statistical evidence to conclude that such trends (slopes) are different betweenboys and girls. 18 . Identifying Differentially Covarying Features in PreclinicalAlzheimer’s Disease
We now describe experiments and results focused on the key motivation of this work — tofacilitate analysis of a longitudinal study of individuals at risk for Alzheimer’s disease (AD)where the statistical signal is weak (with small to medium sample sizes). We describe thedataset details followed by the analysis and then interpret our conclusions in the context ofscientific results that have been published in the literature in aging and dementia.
Study background.
We analyzed data from a cohort of individuals who have beenlongitudinally tracked for at least three visits over multiple years, as part of an ongoing study(since 2001) to understand the disease processes in the brain before an individual exhibitssigns of cognitive decline due to Alzheimer’s Disease (AD) (Sager et al., 2005). The study,Wisconsin Registry for Alzheimer’s Prevention (WRAP) is among the largest of its kind inexistence, focused on “preclinical” AD, i.e., when the individuals are still cognitively healthy,offering a window into the early disease processes where treatments, drugs and interventionsare likely to be most effective. WRAP and its ancillary studies acquire neuroimaging data(MRI, PET with different tracers, diffusion MRI) and various clinical test scores, geneticand demographic data as well as clinical measures such as Cerebrospinal Fluid (CSF).Our analysis seeks to understand subtle group-wise differences in longitudinal patterns ofdependencies between these measures at this early stage of the disease.
Dataset.
The dataset consisted of 114 subjects with imaging data from at least twotypes of imaging modalities: Positron emission tomography and diffusion weighted MagneticResonance (MR) images. Positron emission tomography (PET) images were used to calculate,using well-validated pre-processing pipelines, the mean amyloid-plaque load (an importantbiomarker for AD) in 16 different anatomical regions of interest in the brain. Amyloid plaqueis known to be an AD-related pathology and generally precedes onset of cognitive symptoms.Separately, diffusion tensor MR imaging (DTI) data were processed and used to calculateboth Fractional Anisotropy (FA) and Mean Diffusivity (MD) in 48 distinct regions (Moriet al., 2008). DTI images provide information about structural connectivity between graymatter regions in the brain. In addition to these 108 (48 × Is analysis of second order statistics necessary?
In Figure 7, we present histogramsdetailing the distribution of two critical cognitive tests, stratified across various groups ofscientific interest. Evaluating these distributions were the key motivation for our explorationinto the methods described in the paper. Small differences in means across groups regardlessof grouping selection (i.e., stratification variable) , and the saturation that occurs at theceiling of cognitive test scores and other preliminary experiments conducted by us suggestthat standard analyses are not sensitive enough to identify subtle higher-order differences.19 iB ApoEGender Family History40 45 50 55 60 40 45 50 55 6005101520250510152025 C oun t Boston Naming Test Score
PiB ApoEGender Family History30 40 50 60 70 30 40 50 60 7005100510 C oun t RAVLT Totaled Learning Score
Female Male Negative Positive
Figure 7: Histograms of the Boston Naming Test Scores and RAVLT Total Scores for all timepoints for the 114 individual measurements across different group separations. Themeans for each test score is not significantly different across different stratificationvariable.
We now describe, one by one, the components of the largest feature subset discovered foreach stratification scheme and highlight the main scientific findings. In most cases, weprovide a brief scientific interpretation of the results for the interested reader. Additionaldetails and results are available in the appendix.
A) Graph Scan Statistics on slope differences across gender.
The most signifi-cant (based on region-score) subset identified by the gender grouping was between the FADTI measurement in the left cingulum gyrus as well as the scores on the Rey Auditory VerbalLearning Test (RAVLT). In recent AD research, gender has been identified as a factor in theprogression of various pathology measures (e.g., incidence and prevalence of AD is higherin women (Fratiglioni et al., 1991; Rimol et al., 2010)), and has contributed to a formalNIH notice (NOT-OD-15-102). However, we note that previous work in the field has not identified gender-related differences when looking only at diffusion measures in the cingulum(Lin et al., 2014). Our algorithm successfully identified longitudinal changes in interaction between these variables which supports the earlier results, and provides some evidence thatas men and women age, their cognitive decline as measured by RAVLT manifests differentlyin relation to the cingulum gyrus.
B) Graph Scan Statistics on slope differences across genotype.
Next, we strat-ified the cohort based on the genotype known to be most closely linked with AD, i.e., theAPOE (Apolipoprotein E) gene (Corder et al., 1993) — we inherit one APOE allele fromeach parent; having one or two copies of the e4 allele increases a person’s risk of getting AD20 ender
Set 1 RAVLT Total (1-5)FA Cingulum LSet 2 FA Medial lemniscus LFA Cingulum (hippocampus) LFA Post thalamic radiation LSet 3 FA Corticospinal tract RFA Superior O.F. fasciculus R
Genotype: APOE4
Digit Span Backward Raw Score Stroop Color-word ScorePiB Cingulum Post L PiB Cingulum Post RPiB Frontal Med Orb L PiB Frontal Med Orb RPiB Precuneus L PiB Precuneus RPiB SupraMarginal PiB Temporal Mid R
Table 3: Group difference across Gender (left) and Genotype APOE4 expression (right).Three disjoint sets of features were identified as coavarying significantly differentlyamong gender, while one larger set was identified in the genotype stratification.
Amyloid Load (PiB Positivity)
Set 1 PiB Angular L/R PiB Cingulum Ant L/RPiB Cingulum Post L/R PiB Frontal Med Orb L/RPiB Precuneus L/R PiB Temporal Sup L/RPiB Temporal Mid L/R
PiB SupraMarginal L
Set 2 FA Cerebral peduncle R FA Cerebral peduncle LMD Corticospinal tract R MD Corticospinal tract LTrail-Making Test Part A Score MD Cerebral peduncle RPET Cingulum Post R
Table 4: Group difference across Amyloid Load (PiB Positivity)whereas the rarer e2 allele is associated with a lower risk of AD. Using this stratification,we obtain a low-risk and an at-risk group of individuals. Here, we identified amyloid-loadregions within the medial and lateral parietal lobes and find that in the “low-risk” group, thecovariances between Digit Span and Stroop Color-Word scores (attention and concentrationscores) and amyloid load moves from strongly negative towards 0 as a function of age(Table 3). In the “at-risk” group (APOE4), however, we find that as a function of age, thefeatures become more and more positively correlated. Existing studies have shown that theaccumulation of amyloid is significantly different across APOE4 gene expression (Morminoet al., 2014), and our results provide some evidence that the expression of the genotypemay interact with cognitive scores as well, even at this early stage of the disease , when theindividuals in our cohort are cognitively healthy. The sets of features showing a differentialsignal are presented in Table 3.
C) Graph Scan Statistics on slope differences across amyloid load positivity.
As briefly described above, amyloid load is an important biomarker for AD. For our analysis,amyloid (or PiB) positivity is calculated using the mean amyloid PiB measures across allbrain regions using a PiB PET image scan of the participant. When we used this measure forstratification (threshold was set at 1 .
18, following (Darst et al., 2017)), our model identifiedfifteen of the sixteen PiB regions that were input to the model when the density of theoracle graph was set to be high. This result is as expected, but interestingly we find that21 xpert Consensus Diagnosis
WAIS-3 LNS Raw Score Boston Naming Test Total ScoreRAVLT A2 Raw Score RAVLT A3 Raw ScoreRAVLT A4 Raw Score RAVLT A5 Raw ScoreRAVLT A6 Raw Score RAVLT Delayed Recall Raw ScoreTrail-Making Test Part A Trail-Making Test Part BClock Drawing Test Score CES Depression Scale Score
Table 5: Group difference localization across expert clinical diagnosis. With significantlymore samples and a larger set of cognitive tests, those above were identified assignificantly different across the expert consensus measure.controlling for the linear combination of the features (through centering), the residual error still has significant signal with the PiB positivity measure, indicating that amyloid burden interactions across brain regions plays a very important role in AD progression (Hardy andSelkoe, 2002; Hardy and Higgins, 1992; Tanzi and Bertram, 2005; Jack Jr et al., 2010). Whenthe sparsity of the oracle graph was increased, however, four neighboring regions, the leftand right corticospinal tract and the left and right cerebral peduncle were identified on bothPiB and DTI measures (supported by the literature (Douaud et al., 2011)), together withPart A of the Trail Making Test (see Table 4) which happens to be used in AD diagnosis(Albert et al., 2011). This suggests that changes in atrophy within these regions, as measuredby DTI, co-occur with changes in amyloid burden. Additionally, because these regions arehighly correlated with rough and fine motor ability (Naidich et al., 2009), it seems plausiblethat amyloid positivity will lead to higher ‘covariation’ in the regions associated with ameasure of fine motor speed, i.e., the Trail Making Test.
In addition to the dataset presented above, we apply our method to a much larger datasetconsisting of approximately 1500 individuals with only cognitive testing data collectedin a longitudinal manner. Each individual was administered these tests for between twoand three time-points, yielding approximately n = 4000 samples for our model. For eachassessment, a conference of experts applied a diagnostic label indicating normal cognition ormild cognitive impairment. Using this binary classification, we can stratify our populationfor group difference analysis. We find that among many different significant subsets, thecovariance trajectory among the scores on both parts of the Trail-Making Test and on alltrials of the RAVLT test explain a significant group difference. These have previously beenshown to be the most sensitive tests for early cognitive decline (Albert et al., 2001). Table5 displays the other tests identified by our algorithm, and additional experiments on thislarger cohort can be found in the appendix. 22 .3 Baseline. In various experiments on this dataset, when the MMGLM procedure is performed forthe entire feature set in totality ( not utilizing any of the proposed ideas based on scanstatistics), and the null distribution derived using permutation testing, the procedure yieldsno significance across any scientifically interesting group stratifications . This implies thatthe ability to search over different blocks of the covariance matrix is critical in identifyingmeaningful group differences in the trajectories, unavailable using alternate schemes. Forinstance, simpler strategies work well enough for datasets such as ADNI – which includesdiseased subjects as well as controls – where the signal is stronger and even temporalmodeling may be unnecessary. While the scientific results need to be interpreted withcaution and reproducibility experiments on other similar datasets (both within the US andinternationally) are in the planning phase, we believe that the ability to localize differences inthese interaction patterns in a statistically rigorous manner is valuable and these findings canbe investigated standalone, via more classical schemes (e.g., structural equation modeling).
8. Conclusions
The analysis of datasets to identify where clinically disparate groups differ is pervasive inbiology, neuroscience, genomics and epidemiological studies. We find that graphical modelsare an ideal tool to analyze high-dimensional data in these areas but have been sparinglyused for the analysis of group-wise differences, especially in a longitudinal setting. Motivatedby an application related to longitudinal analysis of imaging and clinical/cognitive data fromotherwise healthy individuals who are at risk for Alzheimer’s disease (AD), we show howa combination of manifold regression with a generalization of scan statistics to the graphsetting yields tools that can be directly deployed. We present an efficient algorithm anddevelop the theoretical results showing the regimes where its application is appropriate. Invarious experiments, while the standard schemes are not sufficiently powered to detect thesignal, our proposed formulation is able to detect meaningful group difference patterns, manyof which have a clear scientific interpretation. We believe that these results are promisingfor the neuroimaging application described and other regimes where group-wise analysis isdesired but the number of features is large.
Acknowledgments
This research was supported in part by NIH grants R01 AG040396, AG021155, EB022883and NSF grants DMS 1265202 and CAREER award 1252725. The authors were alsosupported by the UW Center for Predictive Computational Phenotyping (via BD2K awardAI117924) and the Wisconsin Alzheimer’s Disease Research Center (AG033514). Mehta wassupported by a fellowship via training grant award T32LM012413.23 ppendix A. Technical Proofs.
A.1 Proof of Lemma 18
To remind the reader, this result was necessary in order to allow us to reduce the number ofsubgraphs (regions) that need to be evaluated over the graph. By bounding the coveringnumber we have a guarantee that we do not need to consider an exponential number ofsubgraphs in order to find a localization.
Proof
To upper bound N ( A, (cid:15) ), we first construct the (cid:15) -covering set of R ( A ) under metric d . To this end, we decompose R ( A ) into several disjoint sets R j ( A ) = (cid:26) B ( v, r ) ∈ R ( A ) : (cid:18) − ( j + 1) (cid:15) (cid:19) A < | E ( B ( v, r )) | ≤ (cid:18) − j(cid:15) (cid:19) A (cid:27) , for j = 0 , , . . . , (cid:100) (cid:15) (cid:101) . Our strategy is to construct (cid:15) -covering set for each set R j ( A ).We only construct (cid:15) -covering set for R ( A ); R j ( A ) ( j ≥
1) can be treated similarly. Toconstruct the (cid:15) -covering set for R ( A ), we denote by d v,r the largest positive number suchthat | E ( B ( v, r − d v,r )) || E ( B ( v, r )) | ≥ − (cid:15) , (25)for every v ∈ V and r ∈ N . Let D the collection of d v,r such that B ( v, r ) ∈ R ( A ), i.e. D = { d v,r : B ( v, r ) ∈ R ( A ) } , and V the collection of nodes such that B ( v, r ) ∈ R ( A ), i.e. V = { v : B ( v, r ) ∈ R ( A ) } . We pick up the largest number in D , denoted by d v ,r , i.e. d v ,r ≥ d v,r ∀ d v,r ∈ D anddefine ˜ V as ˜ V = { v ∈ V : v ∈ B ( v , d v ,r / } . After defining ˜ V , D and V can be defined as D = D \ { d v,r : v ∈ ˜ V } and V = V \ ˜ V . Then we can pick up the largest number in D , denote by d v ,r and ˜ V can be definedsimilarly. We can repeat the above process until D M and V M are empty for some M .Weactually obtain a partition of V , M (cid:91) i =1 ˜ V i = V and ˜ V i ∩ ˜ V i = ∅ ≤ i < i ≤ M. Based on d v ,r , . . . , d v M ,r M , we are ready to prove the set R ( A, (cid:15) ) = { B ( v i , r i ) : 1 ≤ i ≤ M } is actually an (cid:15) -covering set for R ( A ). To this end, it is equivalent to show that for arbitrary B ( v (cid:48) , r (cid:48) ) ∈ R ( A ), we have d ( B ( v (cid:48) , r (cid:48) ) , B ( v i , r i )) ≤ (cid:15) (26)24hen v (cid:48) ∈ ˜ V i . To show (26), we consider two cases where r (cid:48) > r i − d v i ,r i / r (cid:48) ≤ r i − d v i ,r i / r (cid:48) > r i − d v i ,r i /
2, then B ( v i , r i − d v i ,r i ) ⊂ B ( v (cid:48) , r (cid:48) ) . Combining above result, (25), and the definition of R ( A ) yields | E ( B ( v (cid:48) , r (cid:48) )) ∩ E ( B ( v i , r i )) | (cid:112) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) |≥ | E ( B ( v i , r i − d v i ,r i )) | (cid:112) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) |≥ (cid:114) − (cid:15) | E ( B ( v i , r i − d r i )) || E ( B ( v (cid:48) , r (cid:48) )) |≥ − (cid:15). On the other hand, if r (cid:48) ≤ r i − d v i ,r i /
2, then B ( v (cid:48) , r (cid:48) ) ⊂ B ( v i , r i ) . (27)By definition of R ( A ), we can get | E ( B ( v (cid:48) , r (cid:48) )) ∩ E ( B ( v i , r i )) | (cid:112) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) | ≥ (cid:115) | E ( B ( v (cid:48) , r (cid:48) )) || E ( B ( v i , r i )) | ≥ − (cid:15). Therefore, (26) is proved and R ( A, (cid:15) ) is an (cid:15) -covering set for R ( A ).The rest of the proof is to bound the cardinality of R ( A, (cid:15) ), i.e. M . Note that (17)implies there exists some constant D H,S only depending on H and S such that, for any v ∈ V and r ∈ N , | E ( B ( v, r/ | ≥ D H,S | E ( B ( v, r )) | . By the definition of d v i ,r i , we can ensure B ( v i , d v i ,r i /
4) are disjoint. Hence, this implies | E ( ˜ V i ) | ≥ | E ( B ( v i , d v i ,r i / | ≥ D H,S | E ( B ( v i , d v i ,r i )) | ≥ D H,S
HA(cid:15) S / S +1 . The last inequality is suggested by (17) and (25). The volume argument yields M ≤ | E | D H,S
HA(cid:15) S / S +1 ≤ S +1 D H,S H | E | A (cid:18) (cid:15) (cid:19) S (18) is obtained upon application of the above to each R j ( A ). A.2 Proof of Theorem 19
Before we are ready to prove Theorem 19, we need the following result:25 emma 20
Let Y , . . . , Y d be i.i.d. standard Gaussian variable, i.e. N (0 , and a , . . . , a d be a sequence of numbers. If Z = d (cid:88) i =1 a i ( Y i − , (28) then P ( | Z | ≥ | a | √ x + 2 | a | ∞ x ) ≤ − x ) (29) where | a | = (cid:113)(cid:80) di =1 a i and | a | ∞ = max i =1 ,...,d | a i | . Proof
This is a direct extension of lemma 1 in (Laurent and Massart, 2000) to the negativecase. We follow arguments similar to theirs. Let φ ( x ) be the the logarithm of the Laplacetransform of Y i −
1. For any − / < x < / φ ( x ) = log (cid:0) E (cid:0) exp( x ( Y i − (cid:1)(cid:1) = − x −
12 log(1 − x ) ≤ x − | x | . This leads to log( E ( e xZ )) = d (cid:88) i =1 log (cid:0) E (cid:0) exp( a i x ( Y i − (cid:1)(cid:1) ≤ d (cid:88) i =1 a i x − | a i | x ≤ | a | x − | a | ∞ x With the same arguments in (Laurent and Massart, 2000), we could prove that P (cid:0) Z ≥ | a | ∞ x + 2 | a | √ x (cid:1) ≤ exp( − x ) . The other direction can be proved if we apply the same argument for − Z .With this in hand we proceed to prove Theorem 19. Proof
In the following proof, C always refers to some constant, although its value maychange from place to place. First, we prove (22). To this end, we prove concentrationinequalities for L R for some R and L R − L R for some R (cid:54) = R . Since we assume the noisefollows normal distribution, we have( ˆ β R − ˆ β R ) T Σ − R ( ˆ β R − ˆ β R ) = (cid:80) X i − ( (cid:80) X i ) (cid:107) ˆ β R − ˆ β R (cid:107) ∼ χ | E ( R ) | . By tail bound for χ random variables (see e.g. (Laurent and Massart, 2000)), we can yield P (cid:32) L R > t + 2 t (cid:112) | E ( R ) | (cid:33) ≤ exp( − t ) . (30)26y definition, L R − L R can be written as L R − L R = (cid:80) i ∈ R \ R Z i (cid:112) | E ( R ) | + (cid:32) (cid:112) | E ( R ) | − (cid:112) | E ( R ) | (cid:33) (cid:88) i ∈ R ∩ R Z i − (cid:80) i ∈ R \ R Z i (cid:112) | E ( R ) | where Z i are independent random variable following distribution χ −
1. Lemma 20 implies P (cid:18) | L R − L R | > (cid:112) d ( R , R ) t + 2 t min( | E ( R ) | , | E ( R ) | ) (cid:19) ≤ − t ) . (31)We now proceed to prove (22) by applying a chaining argument (See (Talagrand, 2006))and concentration inequalities (30) and (31). Recall R app ( A, (cid:15) ) is the smallest (cid:15) -covering setof R ( A ) and N ( A, (cid:15) ) is the covering number of R ( A ). For any subgraph candidate R , wedenote by π l ( R ) = arg min R (cid:48) ∈R app ( A,e − l ) d ( R, R (cid:48) ) . For any l ∗ > l ∗ , which will be specified later, we write max R ∈R ( A ) L R into three partsmax R ∈R ( A ) L R ≤ max R ∈R ( A ) | L R − L π l ∗ ( R ) | + l ∗ − (cid:88) l = l ∗ max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | + max R ∈R ( A ) L π l ∗ ( R ) . Now, we bound these three terms above separately.
Term 1 . Let l ∗ = 2 log | E | . By concentration inequality (31) and union bound, wehave P (cid:32) max R ∈R ( A ) | L R − L π l ∗ ( R ) | > (cid:112) x + log | E | ) | E | + 4 x + 8 log | E | A (cid:33) ≤|R ( A ) | P (cid:32) | L R − L π l ∗ ( R ) | > (cid:112) x + log | E | ) | E | + 4 x + 8 log | E | A (cid:33) ≤ |R ( A ) || E | exp( − x ) ≤ − x )for x < log | E | . Therefore, we have P (cid:18) max R ∈R ( A ) | L R − L π l ∗ ( R ) | > C ( x + log | E | ) A (cid:19) ≤ exp( − x ) , for x < log | E | . Term 2 . Let l ∗ = log log( | E | /A ). Recall that the Avocado assumption (17) suggeststhat N ( A, (cid:15) ) ≤ C H,S | E | A (cid:18) (cid:15) (cid:19) S +1 . (32)Applying concentration inequality (30) along with t = (cid:115) log (cid:18) | E | A (cid:19) + ( S + 1) log log (cid:18) | E | A (cid:19) + x + C (33)27nd the union bound, we have P (cid:18) max R ∈R ( A ) L π l ∗ ( R ) > t + 2 t √ A (cid:19) ≤ N (cid:18) A, | E | /A ) (cid:19) P (cid:18) L π l ∗ ( R ) > t + 2 t √ A (cid:19) ≤ C H,S | E | A (cid:18) log | E | A (cid:19) S +1 P (cid:18) L π l ∗ ( R ) > t + 2 t √ A (cid:19) ≤ exp( − x )for x < log | E | . Here we also apply condition (21). Therefore, we obtain P (cid:32) max R ∈R ( A ) L π l ∗ ( R ) > (cid:115) log (cid:18) | E | A (cid:19) + ( S + 1) log log (cid:18) | E | A (cid:19) + x + C (cid:33) ≤ exp( − x )for x < log | E | . Term 3 . For any given l , application of concentration inequality (31), covering numbercondition (32), and the union bound yields, P (cid:32) max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | > (cid:114) C (log( | E | /A ) + l + x ) e l + C (log( | E | /A ) + l + x ) A (cid:33) ≤ C H,S | E | A e ( l +1)( S +1) P (cid:32) | L π l +1 ( R ) − L π l ( R ) | > (cid:114) C (log( | E | /A ) + l + x ) e l + C (log( | E | /A ) + l + x ) A (cid:33) ≤ exp( − x ) l . for any x < log | E | . With another standard application of the union bound, we have P l ∗ − (cid:88) l = l ∗ max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | > (cid:115) C (log( | E | /A ) + x )log( | E | /A ) + log | E | + x log | E | A ≤ l ∗ − (cid:88) l = l ∗ P (cid:32) max R ∈R ( A ) | L π l +1 ( R ) − L π l ( R ) | > (cid:114) C (log( | E | /A ) + l + x ) e l + C (log( | E | /A ) + l + x ) A (cid:33) ≤ l ∗ − (cid:88) l = l ∗ exp( − x ) l ≤ − x ) . Putting the three terms above together yields P (cid:32) max R ∈R ( A ) L R > (cid:115) log (cid:18) | E | A (cid:19) + C ( x + 1) (cid:33) ≤ e | E | /A ) exp( − x ) , A (cid:29) log | E | and the inequalities √ a + b ≤ √ a + √ b and √ a + b ≤√ a + b/ √ a .Now, we apply this bound to A = | E | − k , k ≥ P (cid:32) max R ∈R (cid:32) L R − (cid:115) log | E || E ( R ) | (cid:33) > C ( x + 1) (cid:33) ≤ − x ) . This immediately suggests that q α = O (1).Now, let’s turn to the case when a subgraph is significant, that is to prove (24). Assumethe significant region is R . Using standard statistics we calculate the mean and variance of L R E ( L R ) = ( β R − β R ) T Σ − R ( β R − β R ) (cid:112) | E ( R ) | and V ar ( L R ) = 2+4 ( β R − β R ) T Σ − R ( β R − β R ) | E ( R ) | . By Chebyshev’s inequality, we have P (cid:32) | L R − E ( L R ) | (cid:112) V ar ( L R ) > x (cid:33) ≤ x . (34)If ( β R − β R ) T Σ − R ( β R − β R ) ≥ | E ( R ) | , then (34) suggests P ( L R > (cid:112) | E ( R ) | ) → , | E | → ∞ by taking x as a sequence (e.g., log log( | E ( R ) | )) which increases slow enough in (34). Thisleads to (24). If ( β R − β R ) T Σ − R ( β R − β R ) < | E ( R ) | , then V ar ( L R ) <
6. Then (23)and (34) imply P (cid:32) L R − (cid:115) | E || E ( R ) | > q α (cid:33) → , as | E | → ∞ . Appendix B. Implementation Details.
The workflow below describes one run of our model given a sparsity is specified for the oraclegraph procedure.1.
Oracle Graph.
As noted in the main paper, we use graphical lasso (glasso) togenerate an oracle graph , which allows to define structured regions (subgraphs) forscan statistics on graphs. Each element of the input matrix C in (35) for glasso isgenerated by calculating the slope for each position of the covariance matrix acrossthe predictors for each group, and then taking the difference between the groups. Thefollowing inverse covariance estimation problem, glasso , is then solved using existingMATLAB interfaces to fast C implementations.Θ = arg min Θ (cid:23) − log | Θ | + tr ( C Θ) + λ || Θ || (35)With sparsity parameter λ , this procedure generates a reasonably sparse oracle graph .29. Candidate Subgraphs.
With the oracle graph in hand, we then construct the set ofall ball subgraphs, as defined in Section 4 of our main paper. By limiting ourselves toonly a few ( D | V | ) subgraphs, we can perform scan statistics more efficiently.3. Characterizing the Null Distribution.
In the case where we have few samples,we cannot directly apply the χ result. In these cases, the null distribution is thencharacterized using permutation testing over all candidate subgraphs. For each sub-graph the input data is permuted a number of times to generate a good representationof the distribution at that subgraph. All normalized (but not size-corrected) scanstatistics are then calculated for all permutations across all subsets and then combinedin order to create the null distribution.4. Calculating the Test Statistic
For a specific subset of the data, the scan statistic iscalculated and corrected as described in Section 4 of the main paper, over the originalgrouping of the data. For each group, the logitudinal-covariance GLM (7) is computedusing the procedures in § Region Identification.
We first identify all subsets whose statistic falls above the α -level threshold specified. Then the subset-collection procedure outlined in the mainpaper, developed by (Jeng et al., 2010), is applied, and the non-overlapping criticalregions are output. Numerical Considerations
In practice, our empirical covariance matrices calculated on the sample data may not bepositive definite. The matrix can be rank deficient when we do not have enough linearlyindependent samples. In addition, we may use a rank correlation matrix in its place,which also may not be PD. To resolve this issue, we project the empirical covariancematrix onto the symmetric-positive definite SPD( n ) manifold. We first apply a standardprocedure for transforming a symmetric matrix into a symmetric positive semidefinite (SPSD)one. As described in (Wu et al., 2005), the standard eigenvalue thresholding, or clipping, λ SP SD = max(0 , λ ) is sensible since it provides the optimal projection of any matrix ontothe SPSD manifold. Let Σ = U Λ U (cid:62) be the eigenvalue decomposition of the matrix Σ. TheSPSD projection of Σ is then proj SP SD (Σ) = U diag(max( λ , , . . . , max( λ n , U (cid:62) . Andso to project to the SPD( n ) manifold we can simply add some epsilon to each element ofthe diagonal: proj SP D (Σ) = U diag(max( λ , , . . . , max( λ n , U (cid:62) + (cid:15)I (36)A remark on the term (cid:15)I will be useful here. We find that in experiments, numerical problemscan arise if the smallest eigenvalue of the projected matrix is too small. By iteratively addinga small (cid:15) until the smallest eigenvalue is above our threshold, we ensure that the matrix ispositive definite for the exponential and logarithmic maps. They are necessary for movingback and forth between the manifold and the tangent space. A note on localization accuracy
In addition to simply checking whether or not we were able to correctly answer the hypothesistest group difference, it is important that if a significance is found, that it is found in the30eatures that were originally used to generate the data. Using the same simulation setup asprevious, we take the union of all subsets returned to be significant and check if each of thetruly changing features p t are contained within the superset.In this particular case we find that our localization is only dependent on the graphicallasso procedure we use to generate the oracle graph. As long as the sparsity specified islarge enough to include at least p t edges, we find that in every simulation where we finda significant difference, the features that express the difference are a superset of the truefeatures. Appendix C. Preclinical AD Extended Details and Results.
Data and Variable Descriptions
In our neuroimaging experiments, a large number of our features describe specific andlocalized regions of the brain across multiple imaging modalities. Below we list and describeeach of regions for each modality, and give a brief background on each of methods used toacquire the data. We also include the list of cognitive scores used in our analysis.
PET Imaging
Positron emission tomography has become an increasingly popular method of imaging thebrain, specifically in the areas where cognitive decline can be strongly correlated with thespecific matter being imaged. Pittsburgh compound B (PiB) was used as the tracer forthese images, and the 16 mirrored (Left and Right) regions labeled below were selected asstrongly correlated with the development and progression of Alzheimer’s Disease.
1. PiB Angular L/R2. PiB Cingulum Ant L/R3. PiB Cingulum Post L/R4. PiB Frontal Med Orb L/R5. PiB Precuneus L/R6. PiB SupraMarginal L/R7. PiB Temporal Mid L/R8. PiB Temporal Sup L/R
The average of the voxel values in each ROI (region of interest) of the brain are used forimaging features. The 16 regions are highlighted in Figure 8.31igure 8: 16 Positron Emission Tomography (PET) regions.
DTI Imaging
Diffusion tensor imaging is used to measure the restricted diffusion of water through andabout regions of the brain. The 48 regions here are the aggregated measurements of totalrates of diffusion for each voxel in that region. The two measurements, Fractional Anisotropy(FA) and Mean Diffusivity (MD) collectively well describe the diffusion in a specific region.The following is the full list of regions used in our analysis. Regions that spanned acrossboth the left and right sides of the brain are indicated as such, and were treated as separateand independent in our analyses.
1. Middle cerebellar peduncle2. Pontine crossing tract (a part of MCP)3. Genu of corpus callosum4. Body of corpus callosum5. Splenium of corpus callosum6. Fornix (column and body of fornix)7. Corticospinal tract R/L8. Medial lemniscus R/L9. Inferior cerebellar peduncle R/L10. Superior cerebellar peduncle R/L11. Cerebral peduncle R/L12. Anterior limb of internal capsule R/L13. Posterior limb of internal capsule R/L14. Retrolenticular part of internal capsule R/L15. Anterior corona radiata R/L 16. Superior corona radiata R/L17. Posterior corona radiata R/L18. Posterior thalamic radiation (include opticradiation) R/L19. Sagittal stratum (include inferior longitidi-nal fasciculus and inferior fronto-occipitalfasciculus) R/L20. External capsule R/L21. Cingulum (cingulate gyrus) R/L22. Cingulum (hippocampus) R/L23. Fornix (cres) / Stria terminalis (can not beresolved with current resolution) R/L24. Superior longitudinal fasciculus R/L25. Superior fronto-occipital fasciculus (couldbe a part of anterior internal capsule) R/L26. Uncinate fasciculus R/L27. Tapetum R/L
Cognitive Evaluations
The battery of cognitive test scores in our analysis included a breadth of evaluations chosenspecifically for their coverage of various measures of cognition. Among all tests given to thecohort, the following 17 were selected by expert clinicians and researchers in the field fortheir coverage and their potential value in understanding trends across groups.
1. WAIS-III Digit Span Forward Raw Score2. WAIS-III Digit Span Backward Raw Score3. WAIS-III Letter-Number Sequencing RawScore4. COWAT CFL Score5. Boston Naming Test Total Score6. RAVLT Learning Trial A1 Raw Score7. RAVLT Learning Trial A2 Raw Score8. RAVLT Learning Trial A3 Raw Score9. RAVLT Learning Trial A4 Raw Score 10. RAVLT Learning Trial A5 Raw Score11. RAVLT Learning Trial A6 Raw Score12. RAVLT Delayed Recall Raw Score13. Stroop Word/Color-Word Scaled Score14. Trail-Making Test Part A15. Trail-Making Test Part B16. Clock Drawing Test Score17. Center for Epidemiologic Studies Depres-sion Scale Score
WAIS-III.
This is the most widely used IQ test. The Digit Span examination isspecifically meant to evaluate the working memory of an individual. Participants arerequired to attempt to recall a series of numbers in order, both forwards and backwards.Letter-Number sequencing reflects a similar idea, but with a mix of both numbers andletters in increasing and alphabetical order, and is meant to be an indicator of more complexmental control (Wechsler, 2014).
Rey Auditory Visual Learning Test.
This test is specifically meant to evaluate allaspects of memory. Each trial evaluates a different type of memory, ranging from short-termand working memory to procedural and episodic memory. (Schmidt et al., 1996).
Trail-Making Test.
This is a very popular test in providing information about executivefunction in the brain. The test consists of drawing lines among a randomly generated set ofpoints in a square, where each point is labeled with a number. In Part A, participants must‘connect the dots’ in increasing numerical order, and in Part B in increasing numerical and33 myloid Load (PiB Positivity)
Set 1 PiB Angular L/R PiB Cingulum Ant L/RPiB Cingulum Post L/R PiB Frontal Med Orb L/RPiB Precuneus L/R PiB Temporal Sup L/RPiB Temporal Mid L/R
PiB SupraMarginal L
Set 2 FA Cerebral peduncle R FA Cerebral peduncle LMD Corticospinal tract R MD Corticospinal tract LTrail-Making Test Part A Score MD Cerebral peduncle RPET Cingulum Post RTable 6: Group difference across Amyloid Load (PiB Positivity)
Gender
Set 1 Rey Audio and Verbal LearningTest FA Cingulum LFA Medial lemniscus L FA Cingulum (hippocampus) LSet 2 FA Posterior thalamic radiation(include optic radiation) LSet 3 FA Corticospinal tract R FA Superior fronto-occipital fascicu-lus RTable 7: Group difference in genderalphabetical order. The score on the test is primarily dictated by the time in seconds it takesto complete the task for 25 of these ‘dots.’ More background information and normativeanalyses can be found in (Tombaugh, 2004).Other tests similarly measure various cognitive function. While the Depression ScaleScore did not crop up in any of our analyses here, it has been shown that depression isstrongly associated with AD-related decline (Wragg and Jeste, 1989).
Detailed Imaging with Cognitive Tests Results
In the following tables we provide additional details of the statistical test we performed onthe preclinical AD cohort. Each set contains a set of features found to display significantgroup difference (at the p ≤ .
05 level) along the covariance trajectory divided by the groupvariable indicated.While some of these associations are well-known, few have been indicated as novel byAD researchers and clinicians, and to be of interesting value for further analysis.
Detailed results on larger cohort with only cognitive scores
We also applied our method to a larger cohort consisting of approximately 1500 subjectswith varying temporal measurements on the battery of cognitive tests. Each individual34 enotype: APOE4
Set 1 Digit Span Backward Raw Score Stroop Color-wordPiB Cingulum Post L PiB Cingulum Post RPiB Frontal Med Orb L PiB Frontal Med Orb RPiB Precuneus L PiB Precuneus RPiB SupraMarginal PiB Temporal Mid RTable 8: Group difference across Genotype APOE4 expression
Consensus Conference
Set 2 Digit Span Backward Raw Score Stroop Color-wordPiB Cingulum Post L PiB Cingulum Post RPiB Frontal Med Orb L PiB Frontal Med Orb RPiB Precuneus L PiB Precuneus RPiB SupraMarginal PiB Temporal Mid RTable 9: Group difference across Expert MCI Diagnosishad approximately 3 visits worth of data, and so our total number of measurements wasapproximately n = 4000. In addition to the groupings used above, we were able to use analgorithmic cognitive impairment (ACI) measure to further evaluate the model against afactor which is known to be group-separating. Below are the tabulated feature sets identifiedby our model for each of the group separations described in the main paper. In this case toincrease interpretability of the results we limited our search to groups of 3-6 features.When grouped by genotype, the most indicative subset as shown in Table 11. Thesetests are most closely associated with memory, and we see that no tests of executive functionor spatial ability (Trail-Making or Clock Drawing) were included.In addition to an algorithmic measure of impairment, a conference of expert cliniciansand researchers have given each individual a clinical impairment diagnosis for each time theyunderwent the cognitive battery. Using this as a group separator, we found a large numberof overlapping subsets that displayed significant group difference at the p = 0 .
05 level. Theseare shown in Table 12. Trail-Making Test Parts A and B appeared in all identified subsets.
Algorithmic Cognitive Impairment
Set 1 Boston Naming Test Total Score RAVLT Learning Trial A1 RawScoreRAVLT Learning Trial A6 RawScoreTable 10: Group Difference Localization Across Algorithmic Impairment35 enotype: ApoE4
Set 1 WAIS-III Digit Span BackwardRaw Score RAVLT Learning Trial A3 RawScoreRAVLT Learning Trial A4 RawScore RAVLT Learning Trial A5 RawScoreTable 11: Group Difference Localization Across ApoE4 Genotype
Expert Consensus Measure
WAIS-3 Letter-Number Sequenc-ing Raw Score Boston Naming Test Total ScoreRAVLT Learning Trial A2 RawScore RAVLT Learning Trial A3 RawScoreRAVLT Learning Trial A4 RawScore RAVLT Learning Trial A5 RawScoreRAVLT Learning Trial A6 RawScore RAVLT Delayed Recall Raw ScoreTrail-Making Test Part A Trail-Making Test Part BClock Drawing Test Score Center for Epidemiologic Studies De-pression Scale ScoreTable 12: Group Difference Localization Across Expert Clinical Diagnosis36 .02.55.07.510.012.5 4 8 12
RAVLT Delayed Recall Raw Score c oun t ApoE Negative ApoE Positive0100200300400 0 5 10 15
RAVLT Delayed Recall Raw Score c oun t Female Male 0100200300400 0 5 10 15
RAVLT Delayed Recall Raw Score c oun t FH Negative FH Positive0100200300400500 0 5 10 15
RAVLT Delayed Recall Raw Score c oun t Clinically Impaired Clinically Normal
Figure 10: Histograms of the Delayed Recall Scores for all time points for the ∼ ppendix D. Differential Geometry Basics and Notes. We briefly introduce notions that we used in the main paper. For more details, we refer thereader to (Do Carmo, 1992; Lee, 2003; Spivak, 1981).
Differentiable manifold. A differentiable (smooth) manifold of dimension n is a set M and a maximal family of injective mappings ϕ i : U i ⊂ R n → M of open sets U i of R n into M such that:1. ∪ i ϕ i ( U i ) = M
2. for any pair i, j with ϕ i ( U i ) ∩ ϕ j ( U j ) = W (cid:54) = φ , the sets ϕ − i ( W ) and ϕ − j ( W ) areopen sets in R n and the mappings ϕ − j ◦ ϕ i are differentiable, where ◦ denotes functioncomposition.3. The family { ( U i , ϕ i ) } is maximal relative to the conditions (1) and (2).Roughly speaking, a differentiable (smooth) manifold M is a topological space that islocally similar to Euclidean space and has a globally defined differential structure. Tangent space ( T p M ). The tangent space at p ∈ M is the vector space, which consistsof the tangent vectors of all possible curves passing through p . Tangent bundle ( T M ). The tangent bundle of M is the disjoint union of tangent spacesat all points of M , T M = (cid:96) p ∈M T p M . The tangent bundle is equipped with a natural projection map π : T M → M . Riemannian manifold. A Riemannian manifold is equipped with a smoothly varyingmetric (inner product), which is called
Riemannian metric .Various geometric notions, e.g., the angle between two curves or the length of a curve,can be extended on the manifold.
Geodesic curves.
A geodesic curve on a Riemannian manifold is the locally shortest(distance-minimizing) curve. These are analogous to straight lines in Euclidean space and amain object to generalize linear models to Riemannian manifolds.
Geodesic distance.
The geodesic distance between two points on M is the length ofthe shortest geodesic curve connecting the two points. More generally, distance between twopoints on Riemannian manifolds is defined by the infimum of the length of all differentiablecurves connecting the two points. Let γ be a continuously differentiable curve γ : [ a, b ] → M between p and q in M and g be a metric tensor in M c . Then, formally, the distance between p and q is defined as d( p, q ) := inf γ (cid:90) ba (cid:113) g γ ( t )( ˙ γ ( t ) , ˙ γ ( t )) dt (37)where γ ( a ) = p and γ ( b ) = q . Exponential map . An exponential map is a map from a tangent space T p M to M ,which is usually locally defined due to the existence and uniqueness of ordinary differentialequation for the map. The geodesic curve from y i to y j can be parameterized by a tangentvector in the tangent space at y i with an exponential map Exp( y i , · ) : T y i M → M . Logarithm map.
The inverse of the exponential map is the logarithm map , Log( y i , · ) : M → T y i M . For completeness, Table 13 shows corresponding operations in the Euclideanspace and Riemannian manifolds. In the main paper, for the readability when operations are38ultiply nested, exponential map and its inverse logarithm map are denoted by Exp( p, x )and Log( p, v ) respectively, where p, x ∈ M and v ∈ T p M . They are usually denoted exp p ( x )and log p ( v ) in most of differential geometry books.Separate from the above notations, matrix exponential, i.e, exp( X ) := (cid:80) k ! X k , where0! = 1 and X = I and matrix logarithm are denoted by as exp( · ) and log( · ). Intrinsic mean.
Let d ( · , · ) define the distance between two points. The intrinsic (orKarcher) mean is the minimizer to ¯ y = arg min y ∈M N (cid:88) i =1 d ( y, y i ) , (38) which may be an arithmetic, geometric or harmonic mean depending on d ( · , · ). A Karchermean is a local minimum to (38) and a global minimum is referred as a Fr´echet mean. Onmanifolds, the Karcher mean satisfies (cid:80) Ni =1 Log ¯ y y i = 0. Algorithm 1 : Karcher mean
Input: y , . . . , y N ∈ M , α Output: ¯ y ∈ M ¯ y = y while (cid:107) (cid:80) Ni =1 Log(¯ y k , y i ) (cid:107) > (cid:15) do ∆¯ y = αN (cid:80) Ni =1 Log(¯ y k , y i )¯ y k +1 = Exp(¯ y k , ∆¯ y ) end while Figure 11: Karcher mean on manifoldsThis identity implies the first order necessary condition of (38), i.e., ¯ y is a local minimumwith a zero norm gradient (Karcher, 1977). In general, on manifolds, the existence anduniqueness of th.e Karcher mean is not guaranteed unless we assume, for uniqueness, thatthe data is in a small neighborhood. Parallel transport.
Let M be a differentiable manifold with an affine connection ∇ and I be an open interval. Let c : I → M be a differentiable curve in M and let V be a tangent Operation Euclidean RiemannianSubtraction −−→ x i x j = x j − x i −−→ x i x j = Log( x i , x j )Addition x i + −−→ x j x k Exp( x i , −−→ x j x k )Distance (cid:107)−−→ x i x j (cid:107) (cid:107) Log( x i , x j ) (cid:107) x i Mean (cid:80) ni =1 −→ ¯ xx i = 0 (cid:80) ni =1 Log(¯ x, x i ) = 0Covariance E (cid:2) ( x i − ¯ x )( x i − ¯ x ) T (cid:3) E (cid:2) Log(¯ x, x )Log(¯ x, x ) T (cid:3) Table 13:
Basic operations in Euclidean space and Riemannian manifolds. T c ( t ) M , where t ∈ I . Then, there exists a unique parallel vector field V along c ,such that V ( t ) = V . Here, V ( t ) is called the parallel transport of V ( t ) along c . Geometry of SPD manifolds
Covariance matrices are symmetric positive definite matrices. Let SPD( n ) be a manifold forsymmetric positive definite matrices of size n × n . This forms a quotient space GL ( n ) /O ( n ),where GL ( n ) denotes the general linear group (the group of ( n × n ) nonsingular matrices)and O ( n ) is the orthogonal group (the group of ( n × n ) orthogonal matrices). The innerproduct of two tangent vectors u, v ∈ T p M is given by (cid:104) u, v (cid:105) p = tr( p − / up − vp − / ) (39)This plays the role of the Fisher-Rao metric in the statistical model of multivariate distribu-tions. Here, T p M is a tangent space at p (which is a vector space) is the space of symmetricmatrices of dimension ( n + 1) n/
2. The geodesic distance is d ( p, q ) = tr(log ( p − / qp − / )).The exponential map and logarithm map are given asExp( p, v ) = p / exp( p − / vp − / ) p / , Log( p, q ) = p / log( p − / qp − / ) p / . (40)Let p, q be in SPD( n ) and a tangent vector w ∈ T p M , the tangent vector in T q M whichis the parallel transport of w along the shortest geodesic from p to q is given byΓ p → q ( w ) = p / rp − / wp − / rp / where r = exp (cid:16) p − / v p − / (cid:17) and v = Log( p, q ) (41)40 eferences Marilyn S Albert, Mark B Moss, Rudolph Tanzi, and Kenneth Jones. Preclinical predictionof ad using neuropsychological tests.
Journal of the International NeuropsychologicalSociety , 7(05):631–639, 2001.Marilyn S Albert, Steven T DeKosky, Dennis Dickson, Bruno Dubois, Howard H Feldman,Nick C Fox, Anthony Gamst, David M Holtzman, William J Jagust, Ronald C Petersen,et al. The diagnosis of mild cognitive impairment due to alzheimers disease: Recommenda-tions from the national institute on aging-alzheimers association workgroups on diagnosticguidelines for alzheimer’s disease.
Alzheimer’s & dementia , 7(3):270–279, 2011.E. Arias-Castro, E. J. Cand`es, et al. Detection of an anomalous cluster in a network.
TheAnnals of Statistics , pages 278–304, 2011.Sanjeev Arora and Boaz Barak.
Computational complexity: a modern approach . CambridgeUniversity Press, 2009.M. Banerjee, R. Chakraborty, et al. Nonlinear regression on Riemannian manifolds and itsapplications to neuro-image analysis. In
MICCAI , pages 719–727. 2015.Eugene Belilovsky, Ga¨el Varoquaux, and Matthew B Blaschko. Hypothesis testing fordifferences in gaussian graphical models: Applications to brain connectivity. arXivpreprint arXiv:1512.08643 , 2015.T. Cai, W. Liu, et al. A constrained (cid:96) minimization approach to sparse precision matrixestimation. JASA , 106(494):594–607, 2011.EH Corder, AM Saunders, WJ Strittmatter, DE Schmechel, PC Gaskell, GWet al Small,AD Roses, JL Haines, and M Al Pericak-Vance. Gene dose of apolipoprotein e type 4allele and the risk of alzheimers disease in late onset families.
Science , 261(5123):921–923,1993.E. Cornea, H. Zhu, et al. Regression models on Riemannian symmetric spaces.
JRSS-B ,2016.P. Danaher, P. Wang, et al. The joint graphical LASSO for inverse covariance estimationacross multiple classes.
JRSS-B , 76(2):373–397, 2014.Burcu F Darst, Rebecca L Koscik, Annie M Racine, Jennifer M Oh, Rachel A Krause,Cynthia M Carlsson, Henrik Zetterberg, Kaj Blennow, Bradley T Christian, Barbara BBendlin, et al. Pathway-specific polygenic risk scores as predictors of amyloid- β depositionand cognitive function in a sample at increased risk for alzheimers disease. Journal ofAlzheimer’s Disease , 55(2):473–484, 2017.P Laurie Davies and Arne Kovac. Local extremes, runs, strings and multiresolution.
Annalsof Statistics , pages 1–48, 2001.M. P Do Carmo.
Riemannian geometry . Springer, 1992.41wena¨elle Douaud, Saˆad Jbabdi, Timothy EJ Behrens, Ricarda A Menke, Achim Gass,Andreas U Monsch, Anil Rao, Brandon Whitcher, Gordon Kindlmann, Paul M Matthews,et al. Dti measures in crossing-fibre areas: increased diffusion anisotropy reveals earlywhite matter alteration in mci and mild alzheimer’s disease.
Neuroimage , 55(3):880–890,2011.J. Du, A. Goh, S. Kushnarev, and A. Qiu. Geodesic regression on orientation distributionfunctions with its application to an aging study.
NeuroImage , 87:416–426, 2014.J. Fan, X. Han, et al. Control of the fdr under arbitrary covariance dependence.
JASA ,2012.P T. Fletcher. Geodesic regression and the theory of least squares on Riemannian manifolds.
IJCV , 105(2):171–185, 2013.T.P. Fletcher and S. Joshi. Riemannian geometry for the statistical analysis of diffusiontensor data.
Signal Processing , 87(2):250–262, 2007.Laura Fratiglioni, M Grut, Y Forsell, M Viitanen, M Grafstr¨om, K Holmen, K Ericsson,L B¨ackman, A Ahlbom, and B Winblad. Prevalence of alzheimer’s disease and otherdementias in an elderly urban population relationship with age, sex, and education.
Neurology , 41(12):1886–1886, 1991.J. Friedman, T. Hastie, et al. Sparse inverse covariance estimation with the graphical LASSO.
Biostatistics , 9(3):432–441, 2008.S.A. Geer.
Empirical Processes in M-estimation , volume 6. Cambridge university press,2000.Wolfgang Hardle and Enno Mammen. Comparing nonparametric versus parametric regressionfits.
The Annals of Statistics , pages 1926–1947, 1993.John Hardy and Dennis J Selkoe. The amyloid hypothesis of alzheimer’s disease: progressand problems on the road to therapeutics.
Science , 297(5580):353–356, 2002.John A Hardy and Gerald A Higgins. Alzheimer’s disease: the amyloid cascade hypothesis.
Science , 256(5054):184, 1992.Yi Hong, Nikhil Singh, Roland Kwitt, and Marc Niethammer. Group testing for longitudinaldata. In
International Conference on Information Processing in Medical Imaging , pages139–151. Springer, 2015.Clifford R Jack Jr, Heather J Wiste, Prashanthi Vemuri, Stephen D Weigand, Matthew LSenjem, Guang Zeng, Matt A Bernstein, Jeffrey L Gunter, Vernon S Pankratz, Paul SAisen, et al. Brain beta-amyloid measures and magnetic resonance imaging atrophy bothpredict time-to-progression from mild cognitive impairment to alzheimers disease.
Brain ,133(11):3336–3348, 2010.S. Jayasumana et al. Kernel methods on the Riemannian manifold of SPD matrices. In
CVPR , 2013. 42 Jessie Jeng, T Tony Cai, and Hongzhe Li. Optimal sparse segment identification withapplication in copy number variation analysis.
Journal of the American StatisticalAssociation , 105(491):1156–1166, 2010.M.I. Jordan.
Learning in graphical models , volume 89. Springer Science & Business Media,1998.Hermann Karcher. Riemannian center of mass and mollifier smoothing.
Communications onpure and applied mathematics , 30(5):509–541, 1977.Hyunwoo J Kim, Nagesh Adluru, et al. Multivariate general linear models on Riemannianmanifolds with applications to statistical analysis of diffusion weighted images. In
CVPR ,pages 2705–2712, 2014.D. Koller and N. Friedman.
Probabilistic graphical models: principles and techniques . MITpress, 2009.Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional bymodel selection.
Annals of Statistics , pages 1302–1338, 2000.S.L. Lauritzen.
Graphical models . Clarendon Press, 1996.John M Lee. Smooth manifolds. In
Introduction to Smooth Manifolds , pages 1–29. Springer,2003.John M Lee.
Introduction to smooth manifolds . Springer, 2012.Muriel Deutsch Lezak.
Neuropsychological assessment . Oxford University Press, USA, 2004.Yi-Cheng Lin, Yao-Chia Shih, Wen-Yih I Tseng, Yu-Hsiu Chu, Meng-Tien Wu, Ta-FuChen, Pei-Fang Tang, and Ming-Jang Chiu. Cingulum correlates of cognitive functions inpatients with mild cognitive impairment and early alzheimers disease: a diffusion spectrumimaging study.
Brain topography , 27(3):393–402, 2014.H. Liu, J. Lafferty, et al. The nonparanormal: Semiparametric estimation of high dimensionalundirected graphs.
JMLR , 10:2295–2328, 2009.Han Liu, Fang Han, et al. High-dimensional semiparametric gaussian copula graphicalmodels.
The Annals of Statistics , 40(4):2293–2326, 2012.J.J. McArdle and R.Q. Bell. An introduction to latent growth models for developmentaldata analysis. 2000.S. Mori, K. Oishi, et al. Stereotaxic white matter atlas based on diffusion tensor imaging inan icbm template.
Neuroimage , 40(2):570–582, 2008.Elizabeth C Mormino, Rebecca A Betensky, Trey Hedden, Aaron P Schultz, Andrew Ward,Willem Huijbers, Dorene M Rentz, Keith A Johnson, Reisa A Sperling, Alzheimer’sDisease Neuroimaging Initiative, et al. Amyloid and apoe ε Neurology , 82(20):1760–1767, 2014.43rasanna Muralidharan and P Thomas Fletcher. Sasaki metrics for analysis of longitudinaldata on manifolds. In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEEConference on , pages 1027–1034. IEEE, 2012.Thomas P Naidich, Henri M Duvernoy, Bradley N Delman, A Gregory Sorensen, Spyros SKollias, and E Mark Haacke.
Duvernoy’s atlas of the human brain stem and cerebellum:high-field MRI, surface anatomy, internal structure, vascularization and 3 D sectionalanatomy . Springer Science & Business Media, 2009.H. Qiu, F. Han, et al. Joint estimation of multiple graphical models from high dimensionaltime series.
JRSS-B , 2015.A. M Racine, N. Adluru, et al. Associations between white matter microstructure andamyloid burden in preclinical AD: a multimodal imaging investigation.
NeuroImage:Clinical , 4:604–614, 2014.Lars M Rimol, Ingrid Agartz, Srdjan Djurovic, Andrew A Brown, J Cooper Roddey, Anna KK¨ahler, Morten Mattingsdal, Lavinia Athanasiu, Alexander H Joyner, Nicholas J Schork,et al. Sex-dependent association of common variants of microcephaly genes with brainstructure.
Proceedings of the National Academy of Sciences , 107(1):384–388, 2010.Charles S Roehrig. Conditions for identification in nonparametric and parametric models.
Econometrica: Journal of the Econometric Society , pages 433–447, 1988.Mark A Sager, Bruce Hermann, and Asenath La Rue. Middle-aged children of persons withalzheimers disease: Apoe genotypes and cognitive function in the wisconsin registry foralzheimers prevention.
Journal of geriatric psychiatry and neurology , 18(4):245–249, 2005.Michael Schmidt et al.
Rey auditory verbal learning test: a handbook . Western PsychologicalServices Los Angeles, 1996.George AF Seber and Alan J Lee. Linear regression analysis. hoboken.
NJ: Wiley. doi , 10:9780471722199, 2003.S. Sommer, F. Lauze, et al. Optimization over geodesics for exact principal geodesic analysis.
Adv. in Comp. Math. , 40(2):283–313, 2014.M. Spivak.
Comprehensive introduction to differential geometry . Publish or Perish, Inc.,1981.Mark Stehr. The effect of cigarette taxes on smoking among men and women.
HealthEconomics , 16(12):1333–1343, 2007. ISSN 1099-1050. doi: 10.1002/hec.1223. URL http://dx.doi.org/10.1002/hec.1223 .Nicolas Stdler and Sach Mukherjee. Two-sample testing in high-dimensional models, 2012.Michel Talagrand.
The generic chaining: upper and lower bounds of stochastic processes .Springer Science & Business Media, 2006.Rudolph E Tanzi and Lars Bertram. Twenty years of the alzheimers disease amyloidhypothesis: a genetic perspective.
Cell , 120(4):545–555, 2005.44om N Tombaugh. Trail making test a and b: normative data stratified by age and education.
Archives of clinical neuropsychology , 19(2):203–214, 2004.Oncel Tuzel, Fatih Porikli, and Peter Meer. Human detection via classification on Rieman-nian manifolds. In
Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEEConference on , pages 1–8. IEEE, 2007.J.B. Ullman and P.M. Bentler.
Structural equation modeling . Wiley Online Library, 2003.G. Walther et al. Optimal and fast detection of spatial clusters with scan statistics.
TheAnnals of Statistics , 38(2):1010–1033, 2010.S. Wang, J. Fan, et al. Structured correlation detection with application to colocalizationanalysis in dual-channel fluorescence microscopic imaging. arXiv preprint arXiv:1604.02158 ,2016.David Wechsler. Wechsler adult intelligence scale–fourth edition (wais–iv), 2014.Robin E Wragg and Dilip V Jeste. Overview of depression and psychosis in alzheimersdisease.
Am J Psychiatry , 146(5):577–587, 1989.G. Wu, E.Y. Chang, et al. An analysis of transformation on non-positive semidefinitesimilarity matrix for kernel machines. In
ICML , volume 8, 2005.Y. Xie, B.C. Vemuri, et al. Statistical analysis of tensor fields. In
MICCAI , pages 682–689.Springer, 2010.L. Xue and H.and others Zou. Regularized rank-based estimation of high-dimensionalnonparanormal graphical models.
The Annals of Statistics , 40(5):2541–2571, 2012.S. Yang, Z. Lu, et al. Fused multiple graphical lasso.
SIAM J. Opt. , 25(2):916–943, 2015.M. Yuan. High dimensional inverse covariance matrix estimation via LP.