Greedy low-rank algorithm for spatial connectome regression
Patrick Kürschner, Sergey Dolgov, Kameron Decker Harris, Peter Benner
GGreedy low-rank algorithm for spatial connectomeregression
Patrick K¨urschner ∗ Sergey Dolgov † Kameron Decker Harris ‡ Peter Benner § November 4, 2019
Abstract
Recovering brain connectivity from tract tracing data is an important computa-tional problem in the neurosciences. Mesoscopic connectome reconstruction was pre-viously formulated as a structured matrix regression problem [20], but existing tech-niques do not scale to the whole-brain setting. The corresponding matrix equation ischallenging to solve due to large scale, ill-conditioning, and a general form that lacks aconvergent splitting. We propose a greedy low-rank algorithm for connectome recon-struction problem in very high dimensions. The algorithm approximates the solutionby a sequence of rank-one updates which exploit the sparse and positive definite prob-lem structure. This algorithm was described previously [27] but never implemented forthis connectome problem, leading to a number of challenges. We have had to designjudicious stopping criteria and employ efficient solvers for the three main sub-problemsof the algorithm, including an efficient GPU implementation that alleviates the mainbottleneck for large datasets. The performance of the method is evaluated on threeexamples: an artificial “toy” dataset and two whole-cortex instances using data fromthe Allen Mouse Brain Connectivity Atlas. We find that the method is significantlyfaster than previous methods and that moderate ranks offer good approximation. Thisspeedup allows for the estimation of increasingly large-scale connectomes across taxaas these data become available from tracing experiments. The data and code areavailable online.
Keywords:
Matrix equations, computational neuroscience, low-rank approximation,networks
Neuroscience and machine learning are now enjoying a shared moment of intense interestand exciting progress. Many computational neuroscientists find themselves inspired by un-precedented datasets to develop innovative methods of analysis. Exciting examples of such ∗ KU Leuven, Electrical Engineering (ESAT/STADIUS), Campus Kulak Kortrijk( [email protected] ). † University of Bath, Mathematical Sciences ( [email protected] ) ‡ University of Washington, Computer Science & Engineering ( [email protected] ). § Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg( [email protected] ). a r X i v : . [ m a t h . NA ] N ov ext-generation experimental methodology and datasets are large-scale recordings and pre-cise manipulations of brain activity, genetic atlases, and neuronal network tracing efforts.Thus, techniques which summarize many experiments into an estimate of the overall brainnetwork are increasingly important. Many believe that uncovering such network structureswill help us unlock the principles underlying neural computation and brain disorders [17].Initial versions of such connectomes [26] are already being integrated into large-scale mod-eling projects [41]. We present a method which allows us to perform these reconstructionsfaster, for larger datasets.Structural connectivity refers to the synaptic connections formed between axons (out-puts) and dendrites (inputs) of neurons, which allow them to communicate chemically andelectrically. We represent such networks as a weighted, directed graph encoded by a non-negative adjacency matrix W . The network of whole-brain connections or connectome iscurrently studied at a number of scales [46, 25]: Microscopic connectivity catalogues individ-ual neuron connections but currently is restricted to small volumes due to difficult tracing ofconvoluted geometries [24]. Macroscopic connectivity refers to connections between largerbrain regions and is currently known for a number of model organisms [7]. Mesoscopic con-nectivity [33] lies between these two extremes and captures projection patterns of groupsof hundreds to thousands of neurons among the 10 –10 neurons in a typical mammalianbrain.Building on previous work [20, 26], we present a scalable method to infer spatially-resolved mesoscopic connectome from tracing data. We apply our method to data fromthe Allen Mouse Brain Connectivity Atlas [35] to reconstruct mouse cortical connectivity.This resource is one of the most comprehensive publicly available datasets, but similar dataare being collected for fly [23], rat [6], and marmoset [31], among others. Our focus ispresenting and profiling an improved algorithm for connectome inference. By developingscalable methods as in this work, we hope to enable the reconstruction of high-resolutionconnectomes in these diverse organisms. We focus on the mesoscale because it is naturally captured by viral tracing experiments(Figure 1). In these experiments, a virus is injected into a specific location in the brain,where it loads the cells with proteins that can then be imaged, tracing out the projectionsof those neurons with cell bodies located in the injection site. The source and target signals,within and outside of the injection sites, are measured as the fraction of fluorescing pixelswithin cubic voxels. These form the data matrices X ∈ R n X × n inj and Y ∈ R n Y × n inj , whereparameters n X and n Y are the number of locations in the discretized source and target regionsof the d -D brain, and n inj is the number of injections. In general, n X and n Y may be unequal,e.g. if injections were only delivered to the right hemisphere of the brain. Each experimentonly traces out the projections from that particular injection site. By performing manysuch experiments, with multiple mice, and varying the injection sites to cover the brain, onecan then “stitch” together a mesoscopic connectome for the average mouse. We refer theinterested reader to [35] for more details of the experimental procedures.We present a new low-rank approach to solving the smoothness-regularized optimizationproblem posed by [20]. Specifically, they considered solving the regularized least squares2igure 1: In this paper, we present an improved method for the mesoscale connectome in-ference problem. (A) The goal is to find a voxel-by-voxel matrix W so that the pattern ofneural projections y arising from an injection x is reproduced by matrix-vector multiplica-tion, y ≈ W x . The vectors x and y contain the fraction of fluorescing pixels in each voxelfrom viral tracing experiments. (B) An example of the data, in this case a coronal slicefrom a tracing experiment delivered to primary motor cortex (MOp). Bright green areasare neural cells expressing the green fluorescent protein. (C) The raw data are preprocessedto separate the injection site (red/orange) from its projections (green). Fluorescence valuesin the injection site enter into the source vector x , whereas fluorescence everywhere else isstored in the entries of the target vector y . The x and y are discretized volume images of thebrain reshaped into vector form. Entry W ij models the expected fluorescence at location i from one unit of source fluorescence at location j , a linear operator mapping from sourceimages to target images. Image credit (B and C): Allen Institute for Brain Science.problem W ∗ = arg min W ≥ (cid:107) P Ω ( W X − Y ) (cid:107) F (cid:124) (cid:123)(cid:122) (cid:125) loss + λ (cid:107) L y W + W L (cid:124) x (cid:107) F (cid:124) (cid:123)(cid:122) (cid:125) regularization , (1)where the minimum is taken over nonnegative matrices. The operator P Ω defines an entry-wise product (Hadamard product) P Ω ( M ) = M ◦ Ω, for any matrix M ∈ R n Y × n inj , and Ωis a binary matrix, of the same size, which masks out the injection sites where the entriesof Y are unknown. We take the smoothing matrices L y ∈ R n Y × n Y and L x ∈ R n X × n X tobe discrete Laplace operator, i.e. the graph Laplacians of the voxel face adjacency graphsfor discretized source and target regions. We choose a regularization parameter ¯ λ and set λ = ¯ λ n inj n X to avoid dependence on n X , n Y and n inj , since the loss term is a sum over n Y × n inj entries and the regularization sums over n Y × n X many entries. In this paper, we take a different convention for Ω (the complement) as in [20].
3e now comment on the typical parameters for problem (1). The mouse brain griddedat 100 µ m resolution contains approximately n X , n Y ∈ O (10 ) voxels in 3-D. On the otherhand, the number of experiments n inj is less than O (10 ). By projecting the 3-D cortical datainto 2-D, as we do in this paper, we can reduce the size by an order of magnitude to n X , n Y ∈O (10 ), but focusing on the cortex reduces n inj to O (10 ). Since n inj (cid:28) n X , n Y , least squaresestimation of W (i.e. λ = 0) is highly underdetermined and will remain underdeterminedunless orders of magnitude more tracing experiments are performed. Regularization is thusessential for filling the gaps in injection coverage. Furthermore, the vast size of the n Y × n X matrix W for whole-brain connectivities has motivated our search for scalable and fastlow-rank methods. Much of the work on mesoscale mouse connectomes leverages the data and processingpipelines of the Allen Mouse Brain Connectivity Atlas available at http://connectivity.brain-map.org [29, 35]. In the early examples of such work, [35] used viral tracing datato construct regional connectivity matrices. Nonnegative matrix regression was used toestimate the regional connectivity. First, the injection data were processed into a pairof matrices X Reg and Y Reg containing the regionalized injection volumes and projectionvolumes, respectively. The rows of these matrices are the regions and the columns indexinjection experiments. [35] then used nonnegative least squares to fit a region-by-regionmatrix W Reg such that Y Reg ≈ W Reg X Reg . Due to numerical ill-conditioning and a lackof data, some regions were excluded from the analysis. Similar techniques have been usedto estimate regional connectomes in other animals. [49] took a different approach, usinga likelihood-based Markov chain Monte Carlo method to infer regional connectivity andweight uncertainty from the Allen data.[20] made a conceptual and methodological leap when they presented a method to usesuch data for spatially-explicit mesoscopic connectivity. The Allen Mouse Brain Atlas isessentially a coordinate mapping which discretizes the average mouse brain into cubic voxels,where each voxel is assigned to a given region in a hierarchy of brain regions. They usedan assumption of spatial smoothness to formulate (1), where the specific smoothing termresults in a high-dimensional thin plate spline fit [48]. They then solved (1) using thegeneric quasi-Newton algorithm L-BFGS-B [8]. This technique was applied to the mousevisual areas but limited to small datasets since W was dense. Using a simple low-rankversion based on projected gradient descent, [20] argued that such a method could scaleto larger brain areas. However, the initial low-rank implementation turned out to be tooslow to converge for large-scale applications. Times to convergence were not reported inthe original paper, but the full rank version typically took around a day, while the low-rankversion needed multiple days to reach a minimum. [26] simplified the mathematical problem by assuming that the injections were deliveredto just a single voxel at the injection center of mass. Using a kernel smoother led to a methodwhich is explicitly low-rank, with smoothing performed only in the injection space (columnsof W ). This kernel method was applied to the whole mouse brain, yielding the first estimateof voxel-voxel whole-brain connectivity for this animal. However, these assumptions do nothold in reality: The injections affect a volume of the brain that encompasses much more thanthe center of mass. We also expect that the connectivity is also smooth across projection KD Harris, personal communication, 2017. Note that these times are for the much smaller visual areasdataset. Wildtype injections can cover 30–500 voxels, approximately 240 on average, at 100 µ m resolution [35]. W ), since the incoming projections to a voxel are strongly correlated withthose of nearby voxels. These inaccuracies mean that the kernel method is prone to artifacts,in particular ones arising from the injection site locations, since there is no ability for thatmethod to translate the source of projections smoothly away from injection sites. It isthus imperative to develop an efficient method for the spline problem that works for largedatasets. We will now describe, for the first time, the continuous mathematical properties of thisproblem, in order to illuminate why it is challenging to solve. Equation (1) can be seen as adiscrete version of an underlying continuous problem (similar to [42], among others) wherewe define the cost as12 n inj (cid:88) i =1 (cid:90) T ∩ Ω i (cid:18)(cid:90) S W ( x, y ) X i ( x )d x − Y i ( y ) (cid:19) d y + λ (cid:90) T (cid:90) S (∆ W ( x, y )) d x d y. (2)The cost is minimized over W : T × S → R , the continuous connectome, in an appropriateSobolev space (square-integrable derivatives up to fourth order on T × S is sufficientlyregular). The function W may be seen as the kernel of an integral operator from S to T . These regions S and T are both compact subsets of R d representing source and targetregions of the brain. The mask region Ω i ⊂ T is the subset of the brain excluding theinjection site. Finally, the discrete Laplacian terms L have been replaced by the continuousLaplacian operator ∆ on S × T . The parameter λ again sets the level of smoothing. For simplicity, consider S = T = the whole brain, Ω i = T for all i = 1 , . . . , n inj andrelax the constraint of nonnegativity on W . Taking the first variational derivative of (2)and setting it to zero yields the Euler-Lagrange equations for this simplified problem:0 = λ ∆ W ( x, y ) − n inj (cid:88) i =1 X i ( x ) Y i ( y ) + (cid:90) S W ( x (cid:48) , y ) (cid:32) n inj (cid:88) i =1 X i ( x (cid:48) ) X i ( x ) (cid:33) d x (cid:48) = λ ∆ W ( x, y ) − g ( x, y ) + (cid:90) S W ( x (cid:48) , y ) f ( x (cid:48) , x )d x (cid:48) , (3)where for convenience we have defined the data covariance functions f ( x (cid:48) , x ) = (cid:80) n inj i =1 X i ( x (cid:48) ) X i ( x )and g ( x, y ) = (cid:80) n inj i =1 X i ( x ) Y i ( y ) , analogous to XX T and Y X T . The operator ∆ is the bihar-monic operator or bi-Laplacian. Equation (3) is a fourth-order partial integro-differentialequation in 2 d dimensions.Iterative solutions via gradient descent or quasi-Newton methods to biharmonic andsimilar equations can be slow to converge [1]. It takes many iterations to propagate thehighly local action of the biharmonic differential operator across global spatial scales dueto the small stable step size [42], whereas the integral part is inherently nonlocal. Veryslow convergence is what we have found when applying methods like gradient descent to One may consider rescaling λ as before, but subtle differences arise. In the continuous versus discretecases the units of the equations are different, since the functions X i ( x ) and Y i ( y ) are now viewed as densities.Furthermore, there is a mismatch in units between (1) and (2), because the graph Laplacian is unitlesswhereas the Laplace operator is not. This explains the lack of any dependence on the grid size in the scalingof the discrete problem. Regardless, choosing the exact scaling to make the continuous and discrete casesmatch is not necessary for the more qualitative argument we are making. We present a greedy, low-rank algorithm tailored to the connectome inference problem. Toleverage powerful linear methods, we consider solutions to the unconstrained problem W ∗ = arg min W (cid:107) P Ω ( W X − Y ) (cid:107) F + λ (cid:107) L y W + W L (cid:124) x (cid:107) F , (4)where all of the matrices and parameters are as in (1). In practice, solutions to the linearproblem (4) are often very close to those of the nonnegative problem (1), since the datamatrices X and Y and the “true” W are nonnegative. Setting any negative entries in thecomputed solution W ∗ to zero is adequate, or can serve as an initial guess to an iterativesolver for the slower nonnegative problem.Equation (4) is another regularized least-squares problem. In Section 2.1, we showthat taking the gradient and setting it equal to zero leads to a linear matrix equation inthe unknown W . This takes the form of a generalized Sylvester equation with coefficientmatrices formed from the data and Laplacian terms. The data matrices are, in fact, oflow-rank since n inj (cid:28) n X , n Y , and thus we can expect a low-rank approximation W ≈ U V (cid:124) to the full solution to perform well (see [20], although we do not know how to justify thisrigorously). We provide a brief survey of some low-rank methods for linear matrix equationsin Section 2.2. We employ a greedy solver that finds rank-one components u i v (cid:124) i one at atime, explained in Section 2.3. After a new component is found, it is orthogonalized and aGalerkin refinement step is applied. This leads to Algorithm 1, our complete method.We then test the method on a few connectome fitting problems. First, in Section 3.1,we test on a fake “toy” connectome, where we know the truth. This is a test problemconsisting of a 1-D brain with smooth connectivity [20]. We find that the output of ouralgorithm converges to the true solution as the rank increases and as the stopping tolerancedecreases. Next, we present two benchmarks using real viral tracing data from the isocorticesof mice, provided by the Allen Institute for Brain Science. In each case, we work with 2-D data in order to limit the problem size and because the cortex is a relatively flat, 2-Dshape. It has also been argued that such a projection also denoises such data [47, 18]. InSection 3.2, we work with data that are averaged directly over the superior-inferior axisto obtain a flattened cortex. We refer to this as the top view projection. In contrast, forSection 3.3, the data are flattened by averaging along curved streamlines of cortical depth.We call this the flatmap projection.Finally, Section 4 discusses the limitations of our method and directions for future re-search. Our data and code are described in Section 5 and freely available for anyone whowould like to reproduce the results. 6 Greedy low-rank method
We now derive the equivalent of the “normal equations” for our problem. Denote theobjective function (4) as J ( W ), with decomposition J ( W ) = J loss ( W ) + J reg ( W ) = 12 (cid:107) P Ω ( W X − Y ) (cid:107) F + λ (cid:107) L y W + W L (cid:124) x (cid:107) F . Writing J loss indexwise, we obtain (note that Ω ◦ Ω = Ω) J loss = 12 n,n inj (cid:88) i,α =1 Ω i,α (cid:32) m (cid:88) k =1 W i,k X k,α − Y i,α (cid:33) . The derivative reads ∂J loss ∂W ˆ ı, ˆ k = n,n inj (cid:88) i,α =1 Ω i,α (cid:32) m (cid:88) k =1 W i,k X k,α − Y i,α (cid:33) X ˆ k,α δ i, ˆ ı = n inj (cid:88) α =1 Ω ˆ ı,α X ˆ k,α m (cid:88) k =1 (cid:16) X k,α W ˆ ı,k − X ˆ k,α Ω ˆ ı,α Y ˆ ı,α (cid:17) , or in the vector form ∂J loss ∂ vec( W ) = n inj (cid:88) α =1 [( X α X (cid:124) α ) ⊗ diag(Ω α )] vec( W ) − vec ((Ω ◦ Y ) X (cid:124) ) , where X α is the α -th column of X and likewise for Ω. Setting the derivative equal to zerosleads to the system of normal equations A vec( W ) = vec ((Ω ◦ Y ) X (cid:124) ) , (5)where vec( W ) is the vector of all columns of W stacked on top of each other. This linearsystem features the following ( n Y n X ) × ( n Y n X ) matrix, consisting of n inj + 3 Kroneckerproducts, A = n inj (cid:88) α =1 ( X α X (cid:124) α ) ⊗ diag(Ω α ) + λ ( L x ⊗ I n Y + 2 L x ⊗ L y + I n X ⊗ L y ) . (6)Note that without the observation mask, Ω is a matrix of all ones, and the first termcompresses to XX (cid:124) ⊗ I n Y .The linear system (5) can be recast as the linear matrix equation A ( W ) = D, (7)with the operator A ( W ) := λ B ( W ) + C ( W ), where B ( W ) := W L x +2 L y W L x + L y W, C ( W ) := n inj (cid:88) α =1 diag(Ω α ) W X α X (cid:124) α , and D := (Ω ◦ Y ) X (cid:124) . The smoothing term B can be expressed as a squared standard Sylvester operator B ( W ) = L ( L ( W )) , where L ( W ) := L y W + W L x . The operator L is the graph Laplacian operatoron the discretization of T × S . Furthermore, the right hand side D is a matrix of rank n inj ,since it is an outer product of two rank n inj matrices.7 .2 Numerical low-rank methods for linear matrix equations Because of the potentially high dimensions n X , n Y , directly solving the algebraic matrixequation (7) is numerically inefficient since the solution will be a dense n Y × n X matrix,making even storing it infeasible. However, the rank of the right hand side of (7) is atmost n inj (cid:28) n X , n Y . It is often observed and theoretically shown [16, 3, 22] that thesolutions of large matrix equations with low-rank right hand sides exhibit rapidly decayingsingular values. Hence, the solution W is expected to have small numerical rank in the sensethat few of its singular values are larger than machine precision or the experimental noisefloor. Intuitively, since we also seek very smooth solutions, this also helps control the rank,since high frequency components tend to be associated with small singular values. Thismotivates us to approximate the solution of (7) by a low-rank approximation W ≈ U V (cid:124) with U ∈ R n Y × r , V ∈ R n X × r and r (cid:28) min( n X , n Y ). The low-rank factors are then typicallycomputed by iterative methods which never form the approximation U V (cid:124) explicitly.Several low-rank methods for computing
U, V have been proposed, starting from methodsfor standard Sylvester equations AX + XB = D , e.g., [2, 4, 5, 44] and more recently forgeneral linear matrix equations like (7) [13, 3, 43, 14, 22, 40]. However, these methods arespecialized and require the problem to have particular structures or properties (e.g., B , C have to form a convergent splitting of A ), which are not present in the problem at hand.The main structures present in (7) are positive definiteness and sparsity of L x , L y .An approach that is applicable to the matrix equation (7) is a greedy method as proposedby [27], which is based on successive rank-1 approximations of the error. Because thismethod is quite general, we tailored specific components of the algorithm to our problem.Three main challenges were overcome: First, we choose a simpler stopping criterion for theALS routine. Second, specific solvers were chosen for the three main sub-problems of thealgorithm, which maximizes its efficiency. Third, we developed a GPU implementation ofthe Galerkin refinement, to make this bottleneck step more efficient. We advocate for thismethod in the rest of the paper. Here we briefly review the algorithm from [27] and explain how it is specialized for ourparticular problem. Assume there is already an approximate solution W j ≈ W ∗ of thelinear matrix equation A ( W ) = D , equation (7), with solution W ∗ . We will improveour solution by an update of rank one: W j +1 = W j + u j +1 v (cid:124) j +1 , where u j +1 ∈ R n Y and v j +1 ∈ R n X . The update vectors u j +1 , v j +1 are computed by minimizing an error functionalthat we will soon define. Since the operator A is positive definite, it induces the A -innerproduct (cid:104) X, Y (cid:105) A = Tr( Y (cid:124) A ( X )) and the A -norm (cid:107) Y (cid:107) A := (cid:112) (cid:104) Y, Y (cid:105) A . So we find u j +1 , v j +1 by minimizing the squared error in the A -norm:( u j +1 , v j +1 ) = arg min u,v (cid:107) W ∗ − W j − uv (cid:124) (cid:107) A = arg min u,v Tr(( W ∗ − W j − uv (cid:124) ) (cid:124) A ( W ∗ − W j − uv (cid:124) ))= arg min u,v Tr(( W ∗ − W j − uv (cid:124) ) (cid:124) ( D − A ( W j ) − A ( uv (cid:124) ))) . Discarding constant terms, noting that (cid:104)
X, Y (cid:105) A = (cid:104) Y, X (cid:105) A , and setting R j = D − A ( W j )leads to ( u j +1 , v j +1 ) = arg min u,v (cid:104) uv (cid:124) , uv (cid:124) (cid:105) A − uv (cid:124) R j ) . (8)8otice that the rank-1 decomposition uv (cid:124) is not unique, because we can rescale the factorsby any nonzero scalar c such that ( uc )( v/c ) (cid:124) represents the same matrix. This reflects thefact that the optimization problem (8) is not convex. However, it is convex in each of thefactors u and v separately.We obtain the updates u j +1 , v j +1 via an alternating linear (ALS) scheme [36]. Althoughwe only consider low-rank approximations of matrices here, ALS methods are also used forcomputing low-rank approximations of higher order tensors by means of polyadic decompo-sitions, e.g., [21, 45]. First, a fixed v is used in (8) and a minimizing u is computed which isin the next stage kept fixed and (8) is solved for a minimizing v . For a fixed vector v with (cid:107) v (cid:107) = 1 the minimizing problem isˆ u = arg min u (cid:104) uv (cid:124) , uv (cid:124) (cid:105) A − uv (cid:124) R j )= arg min u (cid:16) λ (cid:0) ( u (cid:124) u ) v (cid:124) L x v + 2( u (cid:124) L y u )( v (cid:124) L x v ) + u (cid:124) L y u (cid:1) + n inj (cid:88) α =1 ( u (cid:124) diag(Ω α ) u )( v (cid:124) X α X (cid:124) α v ) (cid:17) − u (cid:124) R j v and, hence, ˆ u is obtained by solving the linear system of equationsˆ A ˆ u = R j v, ˆ A := λ (cid:0) ( v (cid:124) L x v ) I + 2 L y ( v (cid:124) L x v ) + L y (cid:1) + n inj (cid:88) α =1 diag(Ω α )( v (cid:124) X α X (cid:124) α v ) . (9a)The second half iteration starts from the fixed u = ˆ u/ (cid:107) ˆ u (cid:107) and tries to find a minimizing ˆ v by solvingˆ B ˆ v = R (cid:124) j u, ˆ B := λ (cid:0) L x + 2 L x ( u (cid:124) L y u ) + ( u (cid:124) L y u ) I (cid:1) + n inj (cid:88) α =1 ( u (cid:124) diag(Ω α ) u )( X α X (cid:124) α ) (9b)which can be derived by similar steps. The linear systems (9a) and (9b) inherit the sparsityfrom L x , L y and Ω. Therefore they can be solved by sparse direct or iterative methods.We use a sparse direct solver for (9a), as this was faster than the alternatives. The coeffi-cient matrix ˆ B in (9b) is the sum of a sparse (Laplacian terms) and a low-rank (rank n inj data terms) matrices. In this case, we solve (9b) using the Sherman-Morrison-Woodburyformula [15] and a direct solver for the sparse inversion.Both half steps form the ALS iteration which should be stopped when the iterates areclose enough to a critical point, which might be difficult to check. Here we propose a simplerapproach compared to the one in [27]. Since we rescale u and v such that (cid:107) u (cid:107) = (cid:107) v (cid:107) = 1,the norm of the other factor is equal to the norm of the full matrix. In other words, (cid:107) ˆ u (cid:107) = (cid:107) ˆ uv (cid:124) (cid:107) after solving for ˆ u , and hence (cid:107) ˆ u (cid:107) should converge to the norm of the exact solution.This motivates a simple criterion: we stop the ALS when (1 − δ ) (cid:107) ˆ u (cid:107) ≤ (cid:107) ˆ v (cid:107) ≤ (1 + δ ) (cid:107) ˆ u (cid:107) , where ˆ u and ˆ v are taken from two consecutive ALS steps, and δ < δ = 0 .
1, corresponding to 2–4 ALS iterations,is sufficient in practice for the overall convergence of the algorithm.The second stage of the method is a non-greedy Galerkin refinement of the low-rankfactors. Suppose a rank j approximation W j = (cid:80) ji =1 u i v (cid:124) i of W has been already com-puted. Let U ∈ R n Y × j and V ∈ R n X × j have orthonormal columns, spanning the spacesspan { u , . . . , u j } and span { v , . . . , v j } , respectively. We compute a refined approximation U ZV (cid:124) for Z ∈ R j × j by imposing the following condition onto the residual:(Galerkin condition) Find Z so that A ( U ZV (cid:124) ) − D ⊥ { U ZV (cid:124) ∈ R n Y × n X , Z ∈ R j × j } . lgorithm 1: Greedy rank-1 method with Galerkin projection for (7)
Input :
Matrices L x , L y , X, Ω , Y , maximal rank r ≤ min( n X , n Y ), tolerance0 < τ (cid:28) Output:
Low-rank approximation of W = U ZV (cid:124) in factored form Initialize W = 0, R = D , U = V = [ ], j = 0 repeat Pick initial vector v for ALS with (cid:107) v (cid:107) = 1, then get rank-1 update: while δ > . do Solve ˆ A ˆ u = R j v for ˆ u (sparse direct solver) and set u = ˆ u/ (cid:107) ˆ u (cid:107) Eqn. (9a) Solve ˆ B ˆ v = R (cid:124) j u for ˆ v (sparse direct + low-rank update) and set v = ˆ v/ (cid:107) ˆ v (cid:107) Eqn. (9b) δ = (cid:12)(cid:12)(cid:12) (cid:107) ˆ u (cid:107)(cid:107) ˆ v (cid:107) − (cid:12)(cid:12)(cid:12) U j +1 = orth([ U j , u ]), V j +1 = orth([ V j , v ]) Orthogonalize new factors Increment rank j ← j + 1 Solve Eqn. (10) for Z j (CG to tolerance τ / Galerkin update R j = D − A ( U j Z j V (cid:124) j ) Update residual δW = (cid:107) U j Z j V (cid:124) j − U j − Z j − V (cid:124) j − (cid:107) F / (cid:107) U j Z j V (cid:124) j (cid:107) F until j = r or δW ≤ τ This leads to the dense, square matrix equation in Z of dimension j ≤ r (cid:28) n X , n Y : λ (cid:0) Z ( V (cid:124) L x V ) + 2( U (cid:124) L y U ) Z ( V (cid:124) L x V ) + ( U (cid:124) L y U ) Z (cid:1) + n inj (cid:88) α =1 ( U (cid:124) diag(Ω α ) U ) Z ( V (cid:124) X α X (cid:124) α V ) = U (cid:124) DV. (10)Equation (10) is a projected version of (7) and inherits its structure including the positivedefiniteness of the operator which acts on Z . Instead of using a direct method to solve (10)as in [27], we employ an iterative method similar to [40]. Due to the positive definiteness,the obvious method of choice is a dense, matrix-valued conjugate gradient method (CG).Moreover, we reduce the number of iterations significantly by taking the solution Z fromthe previous greedy step as an initial guess. The improved solution W j +1 = U ZV (cid:124) yieldsa new residual R j +1 = D − A ( W j +1 ) onto which the ALS scheme is applied to obtain thenext rank-1 updates. The complete procedure is illustrated in Algorithm 1.This Galerkin refinement substantially improves the greedy approximation, leading toa faster convergence rate [27]. The ALS stage is primarily used to sketch the projectionbases for the Galerkin solution, which justifies the limited number of ALS steps. Use ofthe Galerkin refinement in low-rank decomposition literature can be traced back to thegreedy approximation in the CP tensor format [34], as well as orthogonal matching pursuitapproaches in sparse recovery and compressed sensing [38] and deflation strategies in low-rank matrix completion [19]. There are three test problems to which we apply Algorithm 1: a toy problem with syntheticdata (Section 3.1), the top view projected mouse connectivity data (Section 3.2), and the10able 1: Computing times and errors for the toy brain test problem. The output W iscompared to truth and a rank 140 reference solution.rank( W ) 10 20 40 60 80CPU time (sec.) 0.0396 0.1554 0.9653 2.6398 3.1108 E ( W, W ) 3.2324e-01 5.5407e-02 1.4162e-02 1.2125e-03 3.1549e-04 E ( W, W true ) 2.9418e-01 7.9921e-02 7.1537e-02 6.9777e-02 6.9821e-02 E rel ( W, W ) 4.3320e-01 8.9700e-02 2.4900e-02 2.5000e-03 5.1300e-04 E rel ( W, W true ) 4.0130e-01 1.1410e-01 1.0350e-01 1.0040e-01 1.0040e-01Figure 2: Toy brain test problem. Top: true connectivity map W true (left) and the low-ranksolution with rank = 40 and λ = 100 (right). Bottom: solutions with Ω = 1 (left) and λ = 0(right). The locations of simulated injections are shown by the grey bars. This shows thatboth the mask (Ω) and smoothing ( λ >
0) are necessary for good recovery.flatmap projected data (Section 3.3). These tests show that the method easily scales towhole-brain connectome reconstruction.We investigate the computational complexity and convergence of the greedy algorithm.Since the matrices in (9a) are sparse, the ALS steps need O ( nr n inj ) operations in total forthe final solution rank r , where n = max( n X , n Y ). In turn, if the solution of (10) takes γ CG iterations, this step will have a cost of O ( γr n inj ). Although γ can be kept at the samelevel for all j , it depends on the stopping tolerance τ , as does the rank r . We will thereforeinvestigate the cost in terms of the total computation time and the corresponding solutionaccuracy for a range of solution rank values.The numerical experiments were performed on an Intel ® E5-2650 v2 CPU with 8 threads11nd 64Gb RAM. We employ an
Nvidia ® P100 GPU card for some subtasks: The Galerkinupdate relies on dense linear algebra to solve (10) by the CG method, so this stage admitsan efficient GPU implementation. Algorithm 1 is implemented in MATLAB ® R2017b, andwas run on the Balena High Performance Computing Service at the University of Bath. SeeSection 5 for additional data and code resources.We measure errors in the solution using the root mean squared error. Given any referencesolution W (cid:63) of size n Y × n X , e.g. the truth or a large-rank solution when the truth is unknown,and a low-rank solution W r , the RMS error is computed as E ( W r , W (cid:63) ) = (cid:107) W r − W (cid:63) (cid:107) F √ n Y n X . We alsoreport the relative error in the Frobenius norm E rel ( W r , W (cid:63) ) = (cid:107) W r − W (cid:63) (cid:107) F (cid:107) W (cid:63) (cid:107) F . We use the same test problem as in [20], a one-dimensional “toy brain.” The source andtarget space are S = T = [0 , W true ( x, y ) = e − ( x − y . ) + 0 . e − ( x − . y − . . . (11)The input and output spaces were discretized using n X = n Y = 200 uniformly latticepoints. Injections are delivered at n inj = 5 locations in S , with a width of 0 .
12 + 0 . (cid:15) , where (cid:15) ∼ Uniform(0 , X are set to 1 within the injection region and 0 elsewhere,Ω ij = 1 − X ij , Y is set to 0 within the injection region, and we add Gaussian noise withstandard deviation σ = 0 .
1. The matrices L x = L y are the 3-point graph Laplacians for the1-d chain.We depict the true toy connectivity W true as well as a number of low-rank solutionsoutput by our method in Figure 2. Both the mask and regularization are required for goodperformance: If we remove the mask, setting Ω equal to the matrix of all ones, then thereare holes in the data at the location of the injections. If we try fitting with λ = 0, i.e. nosmoothing, then the method cannot fill in holes or extrapolate outside the injection sites.It is only with the combination of all ingredients that we recover the true connectivity.In Table 1 we show the performance of the algorithm for ranks r = 10, 20, 40, 60, and80. The output W is compared to W true as well as the rank 140 output of the algorithm.The stopping tolerance was τ = 10 − to ensure that the algorithm has reached this maximalrank. We see that the RMS distance to the reference solution W decreases as we increasethe rank, as does the relative distance. However, the RMS and relative distances from W true asymptote to roughly 0 .
07 and 10%, respectively, by rank 40. This shows that rank 40 is asuitable maximum rank for this problem given the input data and noise.The computing time of the greedy method (in this example we use the CPU only version)remains in the order of seconds even for the largest considered ranks. In contrast, theunpreconditioned CG method needs thousands of iterations (and hundreds of second oftime) to compute a solution within the same order of accuracy. Since it is unclear how todevelop a preconditioner for Eq. (5), especially for a non-trivial Ω, in the next sections wefocus only on the greedy algorithm.
We next benchmark Algorithm 1 on mouse cortical data projected into a top-down view.See Section 5 for details about how we obtained these data. Here, the problem sizes are12igure 3: Computing times and errors for the top view data. The errors are computed withreference to the full-rank solution W (cid:63) = W full . Full rank time: (cid:29) × s (see text).
200 400 600 800 1 , r CPU time (seconds) of low-rank methodCPUGPU
200 400 600 800 1 , − − − − − − − r Error of low-rank method E ( W r , W (cid:63) ) E rel ( W r , W (cid:63) ) Figure 4: Computing times and errors for the flatmap data. The errors are computed withreference to the rank-1000 solution W (cid:63) = W .
200 400 600 800 1 , r CPU time (seconds) of low-rank methodCPUGPU
200 400 600 800 1 , − − − − − r Error of low-rank method E ( W r , W (cid:63) ) E rel ( W r , W (cid:63) ) J ( W r ) versus the rank r of the low-rank approximation W r .
200 400 600 800 1 , , , , , , , rJ ( W r ) top view
200 400 600 800 1 , , , , , , rJ ( W r ) flatmap n Y = 44 478 and n X = 22 377 and the number of injections n inj = 126. We use the smoothingparameter ¯ λ = 10 .We run the low-rank solver with the target solution rank varying from r = 125 to 1000.The stopping tolerances τ were decreased geometrically from 10 − for r = 125 to 10 − for r = 1000. This delivers accurate but cheap solution to the Galerkin system (10) whileensuring that the algorithm reached the target rank.These low-rank solutions are compared to the full-rank solution W full with r = n X =22 377 found by L-BFGS [8], similar to [20], which used L-BFGS-B to deal with the non-negativity constraint. Note that this full rank algorithm was initialized from the output ofthe low-rank algorithm. This led to a significant speedup: The full rank method, initializednaively, had not reached a similar value of the cost function (4) after a week of computation,but this “warm start” allowed it to finish within hours.The computing times and errors are presented in Figure 3. We see that the RMS errorsare relatively small for ranks above 500, below 10 − . Neither the RMS or relative errorseem to have plateaued at rank 1000, but they are small. At rank 1000, the vector (cid:96) ∞ error (maximum absolute deviation of the matrices as vectors, not the matrix ∞ -norm) (cid:107) W − W full (cid:107) ∞ is less than 10 − , which is certainly within experimental uncertainty.In Figure 5, the value of the cost function J ( W r ) is plotted against the rank r of theapproximation W r for the top view and flatmap data. Apparently, around r = 500 the costfunction begins to stagnate indicating that the approximation quality does not significantlyimprove any more. Hence, we continue the investigation with the numerical rank set to r = 500.We analyze the leading singular vectors of the solution. The output of the algorithm is W r = U ZV (cid:124) , which is not the SVD of W r because Z is not diagonal. We perform a finalSVD of the Galerkin matrix, Z = ˜ U Σ ˜ V (cid:124) and set ˆ U = U ˜ U and ˆ V = V ˜ V , so that W r = ˆ U Σ ˆ V (cid:124) is the SVD of the solution.The first four of these singular vectors are shown in Figure 6. The brain is oriented withthe medial-lateral axis aligned left-right and anterior-posterior axis aligned top-bottom, asin a transverse slice. The midline of the cortex is in the center of the target plots, whereasit is on the left edge of the source plots. We observe that the leading component is astrong projection from medial areas of the cortex near the midline to nearby locations. Thesecond component provides a correction which adds local connectivity among posterior areas14igure 6: Top four singular vectors of the top view connectivity with r = 500. Left: scaledtarget vectors ˆ U Σ. Right: source vectors ˆ V . Factor 1Factor 2Factor 3Factor 4 and anterior areas. Note that increased anterior connectivity arises from negative entriesin both source and target vectors. The sign change along the roughly anterior-posterioraxis manifests as a reduction in connectivity from anterior to posterior regions as well asfrom posterior to anterior regions. The third component is a strong local connectivityamong somatomotor areas located medially along the anterior-posterior axis and strongeron the lateral side where the barrel fields, important sensory areas for whisking, are located.Finally, the fourth component is concentrated in posterior locations, mostly correspondingto the visual areas, as well as more anterior and medial locations in the retrosplenial cortex(thought to be a memory and association area). The visual and retrosplenial parts of thecomponent show opposite signs, reflecting stronger local connectivity within these regionsthan distal connectivity between them.These patterns in Figure 6 are reasonable, since connectivity in the brain is dominantlylocal with some specific long-range projections. We also observe that the projection patterns15igure 7: Top four singular vectors of the flatmap connectivity with r = 500. Left: scaledtarget vectors ˆ U Σ. Right: source vectors ˆ V . Factor 1Factor 2Factor 3Factor 4 (left components ˆ U Σ) are fairly symmetric across the midline. This is also expected dueto the mirroring of major brain areas in both hemispheres, despite the evidence for somelateralization, especially in humans. The more specific projections between brain regions willshow up in later, higher frequency components. However, it becomes increasingly difficult tointerpret lower energy components as specific pathways, since these combine in complicatedways.
Finally, we test the method on another problem which is a flatmap projection of the brain(see Section 5 for details). This projection more faithfully represents areas of the cortexwhich are missing from the top view since they curl underneath that vantage point. Theflatmap is closer to the kind of transformation used by cartographers to flatten the globe,16hereas the top view is like a satellite image taken far from the globe.The problem size is now larger by roughly a factor of three relative to the top view.Here, n Y = 126 847 and n X = 63 435. The number of experiments is the same, n inj = 126,whereas the regularization parameter is set to ¯ λ = 3 × . The smoothing parameter wasset to give the same level of smoothness, measured “by eye,” in the components as in thetop view experiment. The tolerances τ were as in the top view case.In this case, the computing time of the full solver would be excessively large, so we donot estimate the error by comparison to the full solution, instead taking the solution with r = 1000 as the reference solution W (cid:63) = W . The computing times and the errors areshown in Figure 4. Here, the benefits by using the GPU implementation for solving (10)were more significant than for the top view case. We obtained the rank 500 solution inapproximately 1.5 hours, which is significantly less than pure CPU implementation, whichtook 6.4 hours. Comparing Figs. 3 and 4, the computation times for the flatmap problemwith r = 500 and 1000 are roughly twice as large as for the top view problem. On the otherhand, for r = 125 and 250, the compute times are about three times as long for flatmapversus top view. The observed scaling in compute time appears slightly slower than O ( n )in these tests. The growth rate of the computing time on the GPU is better than that ofthe CPU version since the matrix multiplications, which dominate the CPU cost for largeranks, are calculated in nearly constant time, mainly due to communication overhead, onthe GPU. The RMS error between rank 500 and 1000 is again less than 10 − , so we believerank 500 is probably a very good approximation to the full solution. Figure 5 shows thecosts versus the approximation rank. Again, we see that r = 500 is reasonable and thedistance from W (cid:63) is smaller than 10%.The four dominant singular vectors of the flatmap solution are shown in Figure 7, ori-ented as in Figure 6, with the anterior-posterior axis from top-bottom and the medial-lateralaxis from left-right. The first two factors are directly comparable between the two problemoutputs, although we see more structure in the flatmap components. This could be due toemploying a projection which more accurately represents these 3-D data in 2-D, or due tothe choice of smoothing parameter ¯ λ . The third and fourth components, on the other hand,are comparable to the fourth and third components in the top view problem, respectively.Again, these patterns are reasonable and expected. The raw 3-D data that were fed intothe top view and flatmap projections were the same, but the greedy algorithm is run usingdifferent projected datasets. It is reassuring that we can interpret the first few factors anddirectly compare them against those in the top view. In order to apply linear methods, we relaxed the nonnegativity constraint when formulat-ing problem (4) (the unconstrained problem), as opposed to the original problem (1) (theproblem with nonnegativity constraint). We now show that the resulting solutions are notsignificantly different between the two problems. This justifies the major simplification thatwe have made.In all of our experiments with the test problem (Section 3.1), the resulting matrices werenearly nonnegative. The solution W has 48 out of 40 000 negative entries. These negativeentries were all greater than -0.0023 in the lower-left corner of the matrix (see Figure 2),where the truth is approximately zero.We were able to solve the top view problem with the nonnegative constraint using L-BFGS-B by initializing with W full projected onto the nonnegative orthant. Let W proj be the17atrix with entries ( W proj ) ij = max(0 , ( W full ) ij ), and let W nonneg denote the solution to theconstrained problem obtained in this way. Comparing the nonnegative versus unconstrainedsolutions, we found that E ( W full , W nonneg ) = 3.99e-04. Projecting W full onto the nonnegativeorthant leads to E ( W proj , W nonneg ) = 3.67e-04. In either case the ∞ -norm difference is 0.009.These results show that the solution to the unconstrained problem is close to the solutionof the constrained problem, and that the projection of the solution to the unconstrainedproblem is also close to the constrained solution. Algorithm 1 thus offers an efficient wayto approximate the solution to the more difficult nonnegative problem, while retaining lowrank. We have studied a numerical method specifically tailored for the important neuroscienceproblem of connectome regression from mesoscopic tract tracing experiments. This con-nectome inference problem was formulated as the regression problem (4). The optimalityconditions for this problem turn out to be a linear matrix equation in the unknown con-nectivity W , which we propose to solve with Algorithm 1. Our numerical results show thatthe low-rank greedy algorithm, as proposed by [27], is a viable choice for acquiring low-rankfactors of W with a computation cost that was significantly smaller compared to other ap-proaches [20, 3, 28]. This allows us to infer the flatmap matrix, with approximately 140 × more entries than previously obtained for the visual system, while taking significantly lesstime: computing the flatmap solution took hours versus days for the smaller low-rank visualnetwork [20]. The first few singular vector components of these cortical connectivities areinterpretable and reasonable from a neuroanatomy standpoint, although a full anatomicalstudy of this inferred connectivity is outside the scope of the current paper.The main ingredients of Algorithm 1 are solving the large, sparse linear systems ofequations at each ALS iteration and solving the dense but small projected version of theoriginal linear matrix equation for the Galerkin step. We had to carefully choose the solversfor each of these phases of the algorithm. The Galerkin step forms the principal bottleneckdue to the absence of direct numerical methods to handle dense linear matrix equationsof moderate size. We employed a matrix-valued CG iteration to approximately solve (10),implementing it on the GPU for speed. This lead to cubic complexity in r at this step.One could argue that equipping this CG iteration with a preconditioner could speed up itsconvergence, but so far we were not successful in finding a preconditioner that both reducedthe number of CG steps and the computational time. A future research direction could beto derive an adequate preconditioning strategy for the problem structure in (7), that wouldincrease the efficiency of any Krylov method.Matrix-valued Krylov subspace methods [13, 28, 3, 37] offer an alternative class of pos-sible algorithms to solving the overall linear matrix equation (7). However, for rapid con-vergence of these methods we typically need a preconditioner. In our tests on (7), these ap-proaches performed poorly, because rank truncations (e.g. via thin QR or SVD) are requiredafter major subcalculations which occur at every iteration. Computing these decomposi-tions quickly became expensive because of the sheer amount of necessary rank truncationsin the Krylov method. If a suitable preconditioner for our problem would be found, it wouldmake sense to give low-rank matrix-valued Krylov methods another try.The original regression problem proposed by [20] (1) demands that the solution W benonnegative. So far, this constraint is not considered by the employed algorithm. However,for the test problem and data we have tried, the computed matrix turns out to be majority18onnegative. We find typically small negative entries that can be safely neglected withoutsacrificing accuracy. Although a mostly nonnegative solution is not generally expectedwhen solving the unconstrained problem (4), it appears that such behavior is typical fornonnegative data matrices X and Y .Working directly with nonnegative factors U ≥ V ≥ W , and it allows interpreting the leading factors as the mostimportant neural pathways in the brain. Modifying Algorithm 1 to compute nonnegativelow-rank factors or enforcing that the low-rank approximation U V (cid:124) ≈ W is nonnegative—anonlinear constraint—is a much harder goal to achieve. For instance, even if one generatednonnegative factor matrices U and V , e.g. by changing the ALS step to nonnegative ALS,the orthogonalization and Galerkin update each destroy this nonnegativity. New methodsof NMF which incorporate regularizations similar to our Laplacian terms [12, 9] are an areaof ongoing research, and the optimization techniques developed there could accelerate thenonnegative low-rank formulation of (1). These include other techniques developed withneuroscience in mind, such as neuron segmentation and calcium deconvolution [39] as wellas sequence identification [30]. The greedy method we have presented is an excellent wayto initialize the nonnegative version of the problem, similar to how SVD is used to initializeNMF. We hope to improve upon nonnegative low-rank methods in the future.Model (1) is certainly not the only approach to solving the connectome inference problem.The loss term (cid:107) P Ω ( W X − Y ) (cid:107) F is standard and arises from Gaussian noise assumptionscombined with missing data and is standard loss in matrix completion problems with noisyobservations, e.g., [32, 10]. The regularization term is a thin plate spline penalty [48]. Thisis one of many possible choices for smoothing, among them penalties such as (cid:107) grad( W ) (cid:107) or the total variation semi-norm [42, 11], which favors piecewise-constant solutions. Whilewe recognize that there are many possible choices for the regularizer, the thin plate penaltyis reasonable, linear and thus convenient to work with. Previous work [20] has shown thatit is useful for the connectome problem. Testing other forms of regularization is a worthygoal but not straightforward to implement at scale. This is outside the scope of the currentpaper.Finally, the most exciting prospects for this class of algorithms is what can be learnedwhen we apply them to next-generation tract tracing datasets. Such techniques can beused to resolve differences between the rat [6] brain and mouse [35], or uncover unknowntopographies, see [41] in these and other animals (like the marmoset [31]). The mesoscaleis also naturally the same resolution as obtained by wide-field calcium imaging. Spatialconnectome modeling could elucidate the largely mysterious interactions different sensorymodalities, proprioception, and motor areas, hopefully leading to better understanding ofintegrative functions. We tested our algorithm on two datasets (top view and flatmap) generated from Allen Insti-tute for Brain Science Mouse Connectivity Atlas data http://connectivity.brain-map.org . These data were obtained with the Python SDK allensdk version 0.13.1 availablefrom http://alleninstitute.github.io/AllenSDK/ . Our data pulling and processingscripts are available from https://github.com/kharris/allen-voxel-network .We used the allensdk to retrieve 10 µ m injection and projection density volumetric19ata for 126 wildtype experiments in cortex. These data were projected from 3-D to 2-Dusing either the top view or flatmap paths and saved as 2-D arrays. Next, the projectedcoordinates were split into left and right hemispheres. Since wildtype injections were alwaysdelivered into the right hemisphere, this becomes our source space S whereas the union ofleft and right are the target space T . We constructed 2-D 5-point Laplacian matrices onthese grids with “free” Neumann boundary conditions on the cortical edge. Finally, the 2-Dprojected data were downsampled 4 times along each dimension to obtain 40 µ m resolution.The injection and projection data were then stacked into the matrices X and Y , respectively.The mask Ω was set via Ω ij = 1 { X ij ≤ . } .MATLAB code which implements our greedy low-rank algorithm (1) is included in therepository: https://gitlab.mpi-magdeburg.mpg.de/kuerschner/lowrank_connectome .We also include the problem inputs X, Y, L x , L y , Ω for our three example problems (test,top view, and flatmap) as MATLAB files. Note that Ω is stored as 1 − Ω in these files, asthis matches the convention of [20].
Acknowledgements
We would like to thank Lydia Ng, Nathan Gouwens, Stefan Mihalas, Nile Graddis and othersat the Allen Institute for the top view and flatmap paths and general help accessing thedata. Thank you to Braden Brinkman for discussions of the continuous problem, to StefanMihalas and Eric Shea-Brown for general discussions. This work was primarily generatedwhile PK was affiliated with the Max Planck Institute for Dynamics of Complex TechnicalSystems.KDH was supported by the Big Data for Genomics and Neuroscience NIH training grantand a Washington Research Foundation Postdoctoral Fellowship. SD is thankful to the En-gineering and Physical Sciences Research Council (UK) for supporting his postdoctoralposition at the University of Bath through Fellowship EP/M019004/1, and the kind hospi-tality of the Erwin Schr¨odinger International Institute for Mathematics and Physics (ESI),where this manuscript was finalised during the Thematic Programme
Numerical Analysisof Complex PDE Models in the Sciences . References [1]
I. Altas, J. Dym, M. Gupta, and R. Manohar , Multigrid Solution of Automati-cally Generated High-Order Discretizations for the Biharmonic Equation , SIAM Jour-nal on Scientific Computing, 19 (1998), pp. 1575–1585, https://doi.org/10.1137/S1464827596296970 .[2]
P. Benner , Solving large-scale control problems , IEEE Control Syst. Mag., 14 (2004),pp. 44–59.[3]
P. Benner and T. Breiten , Low rank methods for a class of generalized Lyapunovequations and related issues , Numer. Math., 124 (2013), pp. 441–470, https://doi.org/10.1007/s00211-013-0521-0 .[4]
P. Benner, R.-C. Li, and N. Truhar , On the ADI method for Sylvester equations ,J. Comput. Appl. Math., 233 (2009), pp. 1035–1045.205]
P. Benner and J. Saak , Numerical solution of large and sparse continuous timealgebraic matrix Riccati and Lyapunov equations: a state of the art survey , GAMMMitteilungen, 36 (2013), pp. 32–52, https://doi.org/10.1002/gamm.201310003 .[6]
M. Bota, H.-W. Dong, and L. W. Swanson , From Gene Networks to BrainNetworks , Nature Neuroscience, 6 (2003), pp. 795–799, https://doi.org/10.1038/nn1096 .[7]
R. L. Buckner and D. S. Margulies , Macroscale cortical organization and adefault-like apex transmodal network in the marmoset monkey , Nature Communica-tions, 10 (2019), https://doi.org/10.1038/s41467-019-09812-8 .[8]
R. Byrd, P. Lu, J. Nocedal, and C. Zhu , A Limited Memory Algorithm forBound Constrained Optimization , SIAM Journal on Scientific Computing, 16 (1995),pp. 1190–1208, https://doi.org/10.1137/0916069 .[9]
D. Cai, X. He, J. Han, and T. S. Huang , Graph Regularized Nonnegative MatrixFactorization for Data Representation , IEEE Transactions on Pattern Analysis andMachine Intelligence, 33 (2011), pp. 1548–1560, https://doi.org/10.1109/TPAMI.2010.231 .[10]
E. J. Candes and Y. Plan , Matrix Completion With Noise , Proceedings of theIEEE, 98 (2010), pp. 925–936, https://doi.org/10.1109/JPROC.2009.2035722 .[11]
A. Chambolle and T. Pock , An introduction to continuous optimization forimaging , Acta Numerica, 25 (2016), pp. 161–319, https://doi.org/10.1017/S096249291600009X .[12]
A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari , Nonnegative Matrixand Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis andBlind Source Separation , John Wiley & Sons, July 2009.[13]
T. Damm , Direct methods and ADI-preconditioned Krylov subspace methods for gen-eralized Lyapunov equations , Numer. Lin. Alg. Appl., 15 (2008), pp. 853–871.[14]
E. Ringh, G. Mele, J. Karlsson, E. Jarlebring , Sylvester-based preconditioningfor the waveguide eigenvalue problem , Linear Algebra Appl., 542 (2018), pp. 441–463.[15]
G. H. Golub and C. F. Van Loan , Matrix Computations , Johns Hopkins UniversityPress, Baltimore, fourth ed., 2013.[16]
L. Grasedyck , Existence of a low rank or H -matrix approximant to the solution ofa Sylvester equation , Numer. Lin. Alg. Appl., 11 (2004), pp. 371–389.[17] S. Grillner, N. Ip, C. Koch, W. Koroshetz, H. Okano, M. Polachek, M.-m. Poo, and T. J. Sejnowski , Worldwide Initiatives to Advance Brain Research ,Nature Neuroscience, (2016), https://doi.org/10.1038/nn.4371 .[18]
R. G˘am˘anut¸, H. Kennedy, Z. Toroczkai, M. Ercsey-Ravasz, D. C. VanEssen, K. Knoblauch, and A. Burkhalter , The Mouse Cortical Connectome,Characterized by an Ultra-Dense Cortical Graph, Maintains Specificity by Distinct Con-nectivity Profiles , Neuron, 97 (2018), pp. 698–715.e10, https://doi.org/10.1016/j.neuron.2017.12.037 . 2119]
M. Hardt and M. Wootters , Fast matrix completion without the condition number ,in Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona,Spain, June 13-15, 2014, 2014, pp. 638–678.[20]
K. D. Harris, S. Mihalas, and E. Shea-Brown , High Resolution Neural Connec-tivity from Incomplete Tracing Data Using Nonnegative Spline Regression , in NeuralInformation Processing Systems, 2016.[21]
R. Harshman , Foundations of the PARAFAC procedure: Models and conditions foran “explanatory” multi-modal factor analysis , UCLA Working Papers in Phonetics, 16(1970).[22]
E. Jarlebring, G. Mele, D. Palitta, and E. Ringh , Krylov methods for low-rank commuting generalized Sylvester equations , Numer. Lin. Alg. Appl., (2018), https://doi.org/10.1002/nla.2176 .[23]
A. Jenett, G. M. Rubin, T.-T. B. Ngo, D. Shepherd, C. Murphy, H. Dionne,B. D. Pfeiffer, A. Cavallaro, D. Hall, J. Jeter, N. Iyer, D. Fetter, J. H.Hausenfluck, H. Peng, E. T. Trautman, R. R. Svirskas, E. W. Myers,Z. R. Iwinski, Y. Aso, G. M. DePasquale, A. Enos, P. Hulamm, S. C. B.Lam, H.-H. Li, T. R. Laverty, F. Long, L. Qu, S. D. Murphy, K. Rokicki,T. Safford, K. Shaw, J. H. Simpson, A. Sowell, S. Tae, Y. Yu, and C. T.Zugates , A GAL4-Driver Line Resource for Drosophila Neurobiology , Cell Reports, 2(2012), pp. 991–1001, https://doi.org/10.1016/j.celrep.2012.09.011 .[24]
N. Kasthuri, K. J. Hayworth, D. R. Berger, R. L. Schalek, J. A.Conchello, S. Knowles-Barley, D. Lee, A. V´azquez-Reina, V. Kaynig,T. R. Jones, M. Roberts, J. L. Morgan, J. C. Tapia, H. S. Seung, W. G.Roncal, J. T. Vogelstein, R. Burns, D. L. Sussman, C. E. Priebe, H. Pfis-ter, and J. W. Lichtman , Saturated Reconstruction of a Volume of Neocortex , Cell,162 (2015), pp. 648–661, https://doi.org/10.1016/j.cell.2015.06.054 .[25]
Micro-, Meso- and Macro-Connectomics of the Brain , Research and Perspectives inNeurosciences, Springer International Publishing, 2016.[26]
J. E. Knox, K. D. Harris, N. Graddis, J. D. Whitesell, H. Zeng, J. A.Harris, E. Shea-Brown, and S. Mihalas , High Resolution Data-Driven Modelof the Mouse Connectome , bioRxiv, (2018), p. 293019, https://doi.org/10.1101/293019 .[27]
D. Kressner and P. Sirkovi´c , Truncated low-rank methods for solving general linearmatrix equations , Numer. Lin. Alg. Appl., 22 (2015), pp. 564–583, https://doi.org/10.1002/nla.1973 .[28]
D. Kressner and C. Tobler , Krylov Subspace Methods for Linear Systems withTensor Product Structure , SIAM J. Matrix Anal. Appl., 31 (2010), pp. 1688–1714.[29]
E. S. Lein, M. J. Hawrylycz, N. Ao, M. Ayres, A. Bensinger, A. Bernard,A. F. Boe, M. S. Boguski, K. S. Brockway, E. J. Byrnes, L. Chen, L. Chen,T.-M. Chen, M. C. Chin, J. Chong, B. E. Crook, A. Czaplinska, C. N.Dang, S. Datta, N. R. Dee, A. L. Desaki, T. Desta, E. Diep, T. A. Dol-beare, M. J. Donelan, H.-W. Dong, J. G. Dougherty, B. J. Duncan, A. J. bbert, G. Eichele, L. K. Estin, C. Faber, B. A. Facer, R. Fields, S. R.Fischer, T. P. Fliss, C. Frensley, S. N. Gates, K. J. Glattfelder, K. R.Halverson, M. R. Hart, J. G. Hohmann, M. P. Howell, D. P. Jeung, R. A.Johnson, P. T. Karr, R. Kawal, J. M. Kidney, R. H. Knapik, C. L. Kuan,J. H. Lake, A. R. Laramee, K. D. Larsen, C. Lau, T. A. Lemon, A. J. Liang,Y. Liu, L. T. Luong, J. Michaels, J. J. Morgan, R. J. Morgan, M. T.Mortrud, N. F. Mosqueda, L. L. Ng, R. Ng, G. J. Orta, C. C. Overly,T. H. Pak, S. E. Parry, S. D. Pathak, O. C. Pearson, R. B. Puchalski,Z. L. Riley, H. R. Rockett, S. A. Rowland, J. J. Royall, M. J. Ruiz, N. R.Sarno, K. Schaffnit, N. V. Shapovalova, T. Sivisay, C. R. Slaughterbeck,S. C. Smith, K. A. Smith, B. I. Smith, A. J. Sodt, N. N. Stewart, K.-R.Stumpf, S. M. Sunkin, M. Sutram, A. Tam, C. D. Teemer, C. Thaller, C. L.Thompson, L. R. Varnam, A. Visel, R. M. Whitlock, P. E. Wohnoutka,C. K. Wolkey, V. Y. Wong, M. Wood, M. B. Yaylaoglu, R. C. Young,B. L. Youngstrom, X. F. Yuan, B. Zhang, T. A. Zwingman, and A. R.Jones , Genome-Wide Atlas of Gene Expression in the Adult Mouse Brain , Nature,445 (2007), pp. 168–176, https://doi.org/10.1038/nature05453 .[30]
E. L. Mackevicius, A. H. Bahle, A. H. Williams, S. Gu, N. I. Denissenko,M. S. Goldman, and M. S. Fee , Unsupervised Discovery of Temporal Sequencesin High-Dimensional Datasets, with Applications to Neuroscience , bioRxiv, (2018),p. 273128, https://doi.org/10.1101/273128 .[31]
P. Majka, Chaplin, Tristan A., Yu, Hsin-Hao, Tolpygo, Alexander, Mi-tra, Partha P., W´ojcik, Daniel K., and Rosa, Marcello G.P. , Towards aComprehensive Atlas of Cortical Connections in a Primate Brain: Mapping Tracer In-jection Studies of the Common Marmoset into a Reference Digital Template , Journal ofComparative Neurology, 524 (2016), pp. 2161–2181, https://doi.org/10.1002/cne.24023 .[32]
R. Mazumder, T. Hastie, and R. Tibshirani , Spectral Regularization Algorithmsfor Learning Large Incomplete Matrices , J. Mach. Learn. Res., 11 (2010), pp. 2287–2322.[33]
P. P. Mitra , The Circuit Architecture of Whole Brains at the Mesoscopic Scale ,Neuron, 83 (2014), pp. 1273–1283, https://doi.org/10.1016/j.neuron.2014.08.055 .[34]
A. Nouy , Proper generalized decompositions and separated representations for thenumerical solution of high dimensional stochastic problems , Archives of Computa-tional Methods in Engineering, 17 (2010), pp. 403–434, https://doi.org/10.1007/s11831-010-9054-1 .[35]
S. W. Oh, J. A. Harris, L. Ng, B. Winslow, N. Cain, S. Mihalas, Q. Wang,C. Lau, L. Kuan, A. M. Henry, M. T. Mortrud, B. Ouellette, T. N.Nguyen, S. A. Sorensen, C. R. Slaughterbeck, W. Wakeman, Y. Li,D. Feng, A. Ho, E. Nicholas, K. E. Hirokawa, P. Bohn, K. M. Joines,H. Peng, M. J. Hawrylycz, J. W. Phillips, J. G. Hohmann, P. Wohnoutka,C. R. Gerfen, C. Koch, A. Bernard, C. Dang, A. R. Jones, and H. Zeng , A Mesoscale Connectome of the Mouse Brain , Nature, 508 (2014), pp. 207–214, https://doi.org/10.1038/nature13186 .2336]
J. M. Ortega and W. C. Rheinboldt , Iterative solution of nonlinear equations inseveral variables , SIAM, 2000.[37]
D. Palitta and P. K¨urschner , On the convergence of Krylov methods with low-ranktruncations , e-print 1909.01226, arXiv, 2019, https://arxiv.org/abs/1909.01226 .math.NA.[38]
Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad , Orthogonal matching pur-suit: recursive function approximation with applications to wavelet decomposition , inProceedings of 27th Asilomar Conference on Signals, Systems and Computers, vol. 1,1993, pp. 40–44, https://doi.org/10.1109/ACSSC.1993.342465 .[39]
E. A. Pnevmatikakis, D. Soudry, Y. Gao, T. A. Machado, J. Merel,D. Pfau, T. Reardon, Y. Mu, C. Lacefield, W. Yang, M. Ahrens,R. Bruno, T. M. Jessell, D. S. Peterka, R. Yuste, and L. Paninski , Simul-taneous Denoising, Deconvolution, and Demixing of Calcium Imaging Data , Neuron,89 (2016), pp. 285–299, https://doi.org/10.1016/j.neuron.2015.11.037 .[40]
C. E. Powell, D. Silvester, and V. Simoncini , An Efficient Reduced BasisSolver for Stochastic Galerkin Matrix Equations , SIAM J. Sci. Comput., 39 (2017),pp. A141–A163, https://doi.org/10.1137/15M1032399 .[41]
M. W. Reimann, M. Gevaert, Y. Shi, H. Lu, H. Markram, and E. Muller , Anull model of the mouse whole-neocortex micro-connectome , Nature Communications,10 (2019), pp. 1–16, https://doi.org/10.1038/s41467-019-11630-x .[42]
L. I. Rudin, S. Osher, and E. Fatemi , Nonlinear total variation based noise removalalgorithms , Physica D: Nonlinear Phenomena, 60 (1992), pp. 259–268, https://doi.org/10.1016/0167-2789(92)90242-F .[43]
S. D. Shank, V. Simoncini, and D. B. Szyld , Efficient low-rank solution of gen-eralized Lyapunov equations , Numer. Math., 134 (2015), pp. 327–342.[44]
V. Simoncini , Computational methods for linear matrix equations , SIAM Rev., 38(2016), pp. 377–441.[45]
L. Sorber, M. Van Barel, and L. De Lathauwer , Optimization-based algorithmsfor tensor decompositions: Canonical polyadic decomposition, decomposition in rank- ( L r , L r , terms, and a new generalization , SIAM Journal on Optimization, 23 (2013),pp. 695–720, https://doi.org/10.1137/120868323 .[46] O. Sporns , Networks of the Brain , The MIT Press, 1st ed., 2010.[47]
D. C. Van Essen , Cartography and Connectomes , Neuron, 80 (2013), pp. 775–790, https://doi.org/10.1016/j.neuron.2013.10.027 .[48]
G. Wahba , Spline Models for Observational Data , SIAM, Sept. 1990.[49]
R. J. F. Ypma and E. T. Bullmore , Statistical Analysis of Tract-Tracing Experi-ments Demonstrates a Dense, Complex Cortical Network in the Mouse , PLOS ComputBiol, 12 (2016), p. e1005104, https://doi.org/10.1371/journal.pcbi.1005104https://doi.org/10.1371/journal.pcbi.1005104