[PDF] Scalable Feature Matching Across Large Data Collections

Abstract

Full PDF

SScalable Feature Matching Across Large Data Collections

David DegrasJanuary 7, 2021

Abstract

This paper is concerned with matching feature vectors in a one-to-one fashion acrosslarge collections of datasets. Formulating this task as a multidimensional assignmentproblem with decomposable costs (MDADC), we develop extremely fast algorithmswith time complexity linear in the number n of datasets and space complexity a smallfraction of the data size. These remarkable properties hinge on using the squared Eu-clidean distance as dissimilarity function, which can reduce (cid:0) n (cid:1) matching problems be-tween pairs of datasets to n problems and enable calculating assignment costs on the ﬂy.To our knowledge, no other method applicable to the MDADC possesses these linearscaling and low-storage properties necessary to large-scale applications. In numericalexperiments, the novel algorithms outperform competing methods and show excellentcomputational and optimization performances. An application of feature matching toa large neuroimaging database is presented. The algorithms of this paper are imple-mented in the R package matchFeat available at github.com/ddegras/matchFeat. Matching objects across units (e.g. subjects, digital images, or networks) based on com-mon descriptor variables is an ubiquitous task in applied science. This problem, variouslyknown as object matching , feature matching , data association , or assignment problem , isat the core of applications such as resource allocation [Pierskalla, 1968], object tracking[Thornbrue et al., 2010, Bar-Shalom et al., 2011, Dehghan et al., 2015, Rezatoﬁghi et al.,2015, Smeulders et al., 2014, Wang et al., 2015], object recognition [Lowe, 2001, Belongieet al., 2002, Conte et al., 2004], navigation systems [Doherty et al., 2019], image registra-tion [Le Moigne et al., 2011, Ashburner, 2007], optimization of communication networks[Shalom et al., 2010], connectomics in neuroscience [Haxby et al., 2011, Vogelstein et al.,2015], and more.The impetus for this work is a task in functional neuroimaging which consists in matchingcollections of biomarkers (more precisely, brain connectivity measures) between the subjectsof a study. The matching process may serve in data exploration to provide new scientiﬁcinsights and generate hypotheses. It can also be a preliminary step in a group analysisto ensure meaningful comparisons across subjects. Key aspects of the matching problemunder study are that: (i) the number of subjects and/or the number of biomarkers persubject may be large, posing computational challenges, (ii) for two given subjects, each1 a r X i v : . [ s t a t . C O ] J a n iomarker of one subject must be matched to at most one biomarker of the other ( one-to-one matching ), and (iii) the matching must be consistent, i.e. transitive across subjects(for example, denoting subjects by letters and biomarkers by numbers, if A1 is matched toB2 and B2 to C3, then A1 must be matched to C3). This matching problem is not speciﬁcto neuroimaging and is applicable to the research ﬁelds mentioned above. It is generallyrelevant to multilevel or hierarchical analyses where outputs of a certain level of analysismust be matched before becoming inputs at the next level. This situation typically occurswhen the outputs to be matched result from an unsupervised analysis such as clustering,segmentation, or dimension reduction. Problem formulation.

The matching problem at the core of this paper is as follows.Given n set of vectors in R p having the same size, say { x , . . . , x m } , . . . , { x n , . . . , x nm } ,ﬁnd permutations σ , . . . , σ n of the vector labels { , . . . , m } that minimize the sum ofpairwise squared Euclidean distances within clusters { x σ ( k ) , . . . , x nσ n ( k ) } (1 ≤ k ≤ m ).Writing [ r ] = { , . . . , r } for a positive integer r and letting S m be the set of all permutationsof [ m ], the problem expresses asmin σ ,...,σ n ∈ S m (cid:88) ≤ i K , some vectors from unit i are not assigned to a cluster, i.e. they are unmatched. The matching problem (1) thusgeneralizes as min s ,...,s n (cid:88) ≤ i

Problem (1) can be viewed through the prism of combinatorial opti-mization problems such as the minimum weight clique partitioning problem in a complete n -partite graph, the quadratic assignment problem [Koopmans and Beckmann, 1957, C¸ ela,1998], or the multidimensional assignment problem (MAP) [e.g. Burkard et al., 2009]. TheMAP formalism is well suited to this work and is recalled hereafter:min σ ,...,σ n ∈ S m m (cid:88) k =1 c σ ( k ) σ ( k ) ··· σ n ( k ) (3)where ( c a a ...a n ) ∈ R m n is an n -dimensional array containing the costs of assigning thefeature vectors x a , . . . , x na n to the same cluster. Problem (1) is an instance of the MAPand, more precisely, it is a multidimensional assignment problem with decomposable costs (MDADC) [e.g. Bandelt et al., 1994, 2004] because its assignment costs decompose as c a a ...i n = (cid:88) ≤ i

3) as a generalization of the 3D assignmentproblem of Spieksma and Woeginger [1996].The formidable literature on the MAP, which spans more than ﬁve decades and multiplemathematical ﬁelds, will not be reviewed here. The interested reader may fruitfully consultBurkard et al. [2009] and Pardalos and Pitsoulis [2000]. In fact, given the focus of thepresent work on computations, a broad review of the general MAP is not necessary. Indeed,optimization methods for the MAP [e.g. Karapetyan and Gutin, 2011, Pierskalla, 1968,Poore and Rijavec, 1993, Robertson, 2001] are not computationally eﬃcient for the specialcase of MDADC (and in particular (1)), especially if the number n of dimensions is large.We will therefore only discuss the relevant MDADC literature.Bandelt et al. [1994] provide simple “hub” and “recursive” heuristics for the MDADC (3)-(4) along with their approximation ratios (worst-case bounds on the ratio of a method’sattained objective to the optimal objective value). The hub heuristic consists in selecting3ne dimension i ∈ [ n ] of the MDADC as a “hub” and matching all other dimensions to thisone, i.e. ﬁnding for each dimension j (cid:54) = i the assignment that minimizes the total cost withrespect to i . The recursive heuristic starts by permuting the n dimensions of the problemand then recursively ﬁnds the best assignment for the i th permuted dimension with respectto the ( i −

1) ﬁrst permuted dimensions ( i = 2 , . . . , n ). Bandelt et al. [2004] enhance theheuristic methods of Bandelt et al. [1994] with local neighborhood search methods thatattempt to improve a solution one or two dimensions at a time. They derive lower boundsfor the minimum cost assignment based on a Lagrangian relaxation of the MDADC. Collins[2012] also exploits the idea of improving solutions one dimension at a time in the generalMAP (3) through a factorization technique. Kuroki and Matsui [2009] formulate (1) asthe problem of ﬁnding a clique cover of an n -partite graph with minimum edge weights.They express the clique cover problem with various mathematical programs (integer linear,nonconvex quadratic, integer quadratic, and second order cone) which they tackle directlyor after relaxation. They also provide approximation ratios and computational complexitybounds for their algorithms. Tauer and Nagi [2013] and Natu et al. [2020] solve Lagrangianrelaxations of the integer linear program formulation of the MDADC, with an emphasison eﬃcient parallel computation in a Map-Reduce framework or with GPUs. They derivetight lower bounds to control the approximation error of their algorithms.As an alternative from the multidimensional assignment perspective, problem (1) can beviewed as an instance of constrained clustering or semi-supervised learning [Basu et al.,2009, Gancarski et al., 2020]. The constraint that each unit i ∈ [ n ] contributes exactlyone feature vector to each cluster can be rephrased as: two vector instances from the sameunit cannot be assigned to the same cluster. This type of constraint, namely that certainpairs of instances cannot be assigned to the same cluster (“cannot link” constraint) orthat certain pairs must be assigned to the same cluster (“must link” constraint), is called equivalence constraints and can be handled by constrained K -means algorithms [Wagstaﬀet al., 2001, Bilenko et al., 2004, Pelleg and Baras, 2007] or through constrained mixturemodels [Shental et al., 2004].Other tasks related to problem (1) but not directly relevant are object tracking, withapplications in engineering and more recently in computer vision and artiﬁcial intelligence,and image registration, which plays a key role in image processing, object recognition, andremote sensing. The former involves a temporal dimension absent from (1) whereas thelatter involves many (and often noisy) features that are not matched one-to-one. Matchingproblems also have a long history in statistics and have been a topic of intense scrutiny inmachine learning in recent years [DeGroot and Goel, 1976, Collier and Dalalyan, 2016, Hsuet al., 2017, Pananjady et al., 2018]. However, much of the research in these ﬁelds relevantto (1) deals with the case where n = 2 and m is large (asymptotically m → ∞ ) whereaswe are chieﬂy interested in situations where m is ﬁxed and n is large ( n → ∞ ). Contributions.

The methods for the MDADC (3)-(4) discussed heretofore are appliedin practice to problems of small size, say n in the single digits or a few tens. Theoreticalconsiderations as well as numerical experiments from this paper (see Sections 2-3) and fromthe literature indicate that these methods cannot handle large-scale problems with n in the4undreds, thousands or more (at least, not in a reasonable time on a single computer). Asa simple example, the (cid:0) n (cid:1) m costs in (4) are typically calculated and stored before startingthe optimization, but even this preliminary step may exceed computer memory limits forlarge n and/or m . In response to this methodological gap, our research aims to developfast, scalable methods for matching feature vectors in a one-to-one fashion across a largenumber of statistical units. The main contributions of the paper are the following.1. We develop very fast algorithms for solving the matching problem (1), that is, (3)-(4)with d as the squared Euclidean distance. The three main algorithms (Sections 2.1-2.2-2.3) have iteration complexity O ( nm ) and only take a few iterations to converge,meaning that they scale linearly with n . In addition, they calculate assignmentcosts (4) on the ﬂy and have space requirements O ( mn + mp ), a fraction of thedata size mnp . We also present initialization methods and a reﬁnement method(pairwise interchange). Further, we take a probabilistic view of (1) as a constrainedGaussian mixture model and devise an eﬃcient implementation of the Expectation-Maximization (EM) algorithm.2. We provide a broad review of the diverse methods applicable to (1) (integer linearprogramming, various relaxations, constrained clustering) which rarely appear to-gether in a paper. The novel algorithms are compared to these methods in numericalexperiments and show excellent computation and optimization performances.3. An R package matchFeat implementing all the algorithms of the paper is madeavailable at github.com/ddegras/matchFeat.4. The matching problem (1) is applied to a large database of neuroimaging data tostudy functional connectivity in the human brain. The data analysis conﬁrms ex-isting knowledge but also generates new insights, thus demonstrating the practicalusefulness of our approach. Organization of the paper.

Section 2 introduces novel algorithms for the matchingproblem (1). In Section 3, a numerical study assesses the novel algorithms and competingmethods with respect to computation and optimization performance. Section 4 details anapplication of our matching approach to a large neuroimaging database (ABIDE) relatingto autism spectrum disorders. Concluding remarks are gathered in Section 5 and additionaldetails of the data analysis are provided in the Appendix.

This section introduces novel algorithms for the matching problem (1). The ﬁrst four arelocal search methods that aim to improve existing solutions. At the end of the section, wediscuss initialization techniques for the local search methods.5 .1 K -means matching For a given n -uple of permutations σ = ( σ , . . . , σ n ) ∈ ( S m ) n , let X σ be the average matrixof the permuted data with columns x σ,k = n (cid:80) ni =1 x iσ i ( k ) for 1 ≤ k ≤ m . Problem (1) isequivalent to min σ ,...,σ n n (cid:88) i =1 m (cid:88) k =1 (cid:13)(cid:13) x iσ i ( k ) − x σ,k (cid:13)(cid:13) (5)The following method adapts the standard K -means clustering algorithm [Lloyd, 1982] tothe matching problem (5).1. Initialize σ = ( σ , . . . , σ n ) to some arbitrary value, for example σ = (Id [ m ] , . . . , (Id [ m ] ).Calculate the average matrix X σ and the objective value (5).2. Given the average matrix X σ : for 1 ≤ i ≤ n, ﬁnd the permutation σ i that minimizes (cid:80) mk =1 (cid:107) x iσ i ( k ) − x σ,k (cid:107) . Update the solution to σ = ( σ , . . . , σ n ).3. Given σ : calculate the average matrix X σ and the objective value (5). If the objectivehas not decreased from the previous iteration, terminate the execution and return σ .Else go back to step 2.Steps 2 and 3 above are non-increasing in the objective (5). For this reason, and due tothe ﬁniteness of the search space, the proposed approach converges in a ﬁnite number ofiterations. Like the K -means, it only ﬁnds a local minimum of (5).Concerning computations, step 3 can be performed in O ( nm ) ﬂops. Step 2, which consistsof n separate optimizations, is the computational bottleneck. Observe that m (cid:88) k =1 (cid:107) x iσ i ( k ) − x σ,k (cid:107) = m (cid:88) k =1 (cid:107) x iσ i ( k ) (cid:107) − m (cid:88) k =1 (cid:104) x iσ i ( k ) , x σ,k (cid:105) + m (cid:88) k =1 (cid:107) x σ,k (cid:107) = m (cid:88) k =1 (cid:107) x ik (cid:107) − m (cid:88) k =1 (cid:104) x iσ i ( k ) , x σ,k (cid:105) + m (cid:88) k =1 (cid:107) x σ,k (cid:107) where (cid:104)· , ·(cid:105) denotes the Euclidean scalar product. That is, the minimization of (cid:80) mk =1 (cid:107) x iσ i ( k ) − x σ,k (cid:107) (with respect to σ i ∈ S m ) is equivalent tomax σ i ∈ S m m (cid:88) k =1 (cid:104) x iσ i ( k ) , x σ,k (cid:105) (6)Problem (6) is an instance of the well-known linear assignment problem (LAP) [e.g. Burkardet al., 2009, Chap. 4]. After calculating the assignment matrix A = ( (cid:104) x σ,k , x il (cid:105) ) ≤ k,l ≤ m , theLAP (6) can be solved for example with the Hungarian algorithm [Kuhn, 1955, Munkres,1957]. Eﬃcient implementations of the Hungarian algorithm have complexity O ( m ).The K -means matching algorithm is summarized hereafter. The objective value in (5) isdenoted by F ( σ ). 6 lgorithm 1 K -Means Matching Input: X , . . . , X n ∈ R p × m , σ = ( σ , . . . , σ n ) ∈ ( S m ) n x σ,k ← (1 /n ) (cid:80) ni =1 x iσ i ( k ) (1 ≤ k ≤ m ), F new ← F ( σ ) repeat F old ← F new for i = 1 , . . . , n do Solve the LAP (6) and call σ + i a solution. σ i ← σ + i end for σ ← ( σ , . . . , σ n ) x σ,k ← (1 /n ) (cid:80) ni =1 x iσ i ( k ) (1 ≤ k ≤ m ), F new ← F ( σ ) until F new ≥ F old Remark. If p = 1, the matrices X i are row vectors and the x ik are scalars. In this case,step 2 of the proposed method is extremely simple. Indeed for each 1 ≤ i ≤ n , thesum (cid:80) mk =1 x iσ i ( k ) x σ,k is maximized when the x ik and x σ,k are matched by rank. Moreprecisely, take any s i ∈ S m such that x is i (1) ≤ · · · ≤ x is i ( m ) and any s ∈ S m such that x σ,s (1) ≤ . . . ≤ x σ,s ( m ) . Then σ i = s i ◦ s − maximizes the sum. In other words, theoptimal permutations σ i are simply obtained by sorting the components of the X i and x σ (computational complexity O ( nm log m )). For convenience problem (1) is rewritten here using permutation matrices P , . . . , P n in-stead of permutation functions σ , . . . , σ n . Each P i (1 ≤ i ≤ n ) is a square matrix withentries in { , } such that each row and each column contains the value 1 exactly once.Let Π m be the set of all m × m permutation matrices. Problem (1) expresses as the binaryquadratic assignment problem min P ,...,P n ∈ Π m n (cid:88) i =1 n (cid:88) j =1 (cid:107) X i P i − X j P j (cid:107) F (7)where (cid:107) · (cid:107) F denotes the Frobenius norm ( (cid:107) X (cid:107) F = (cid:104) X, X (cid:105) / F = (tr( X (cid:48) X )) / with tr( · ) thetrace operator). By expanding the squared Frobenius norm in the objective and notingthat column permutations do not change the Frobenius norm of matrix, we have n (cid:88) i =1 n (cid:88) j =1 (cid:107) X i P i − X j P j (cid:107) F = n (cid:88) i =1 n (cid:88) j =1 (cid:0) (cid:107) X i P i (cid:107) F + (cid:107) X j P j (cid:107) F − (cid:104) X i P i , X j P j (cid:105) F (cid:1) = n (cid:88) i =1 n (cid:88) j =1 (cid:0) (cid:107) X i (cid:107) F + (cid:107) X j (cid:107) F (cid:1) − (cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 X i P i (cid:13)(cid:13)(cid:13)(cid:13) F . P , . . . , P n , problem (7) is equivalent tomax P ,...,P n ∈ Π m (cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 X i P i (cid:13)(cid:13)(cid:13)(cid:13) F . (8)The maximization problem (8) can be handled one matrix P i at a time (1 ≤ i ≤ n ), that is,by block coordinate ascent [BCA, e.g. Wright, 2015]. Given a current solution ( ˆ P , . . . , ˆ P n )and an index i , all matrices ˆ P j , j (cid:54) = i are ﬁxed and the task at hand ismax P i ∈ Π m (cid:13)(cid:13)(cid:13)(cid:13) X i P i + (cid:88) ≤ j ≤ nj (cid:54) = i X j ˆ P j (cid:13)(cid:13)(cid:13)(cid:13) F which, after expansion, is equivalent to the linear assignment problemmax P i ∈ Π m (cid:68) P i , X (cid:48) i (cid:88) j (cid:54) = i X j ˆ P j (cid:69) F . (9)As mentioned in Section 2.1, (9) can be eﬃciently solved with the Hungarian algorithm.The permutation matrix ˆ P i is then updated to a solution of (9). This operation is repeatedwith the index i sweeping through the set [ n ] until no further increase in the objective (8)has been achieved in a full sweep. Given that each update of a ˆ P i is non-decreasing inthe objective (8) and that the search domain Π nm is ﬁnite, the algorithm is guaranteed toconverge in a ﬁnite number of steps. Popular methods for sweeping through [ n ] include thecyclical order (also known as the Gauss-Seidel rule), random sampling, random permutationof [ n ], and greedy selection.The BCA algorithm is summarized hereafter. The objective function in (8) is denoted by F . For simplicity the sweeping order is taken to be cyclical but any other sweeping methodcan be used. Algorithm 2

Block Coordinate Ascent

Input: X , . . . , X n ∈ R p × m , P , . . . , P n ∈ Π m . S ← (cid:80) ni =1 X i P i , F new ← (cid:107) S (cid:107) F repeat F old ← F new for i = 1 , . . . , n do S i ← S − X i P i Solve the LAP max P i ∈ Π m (cid:10) P i , X (cid:48) i S i (cid:11) F and call P + i a solution. P i ← P + i , S ← S i + X i P i end for F new ← (cid:107) S (cid:107) F until F new ≤ F old Algorithm 2 can be viewed as a special case of the local search algorithm LS1 of Bandeltet al. [2004]. The LS1 algorithm is more general in that it uses an arbitrary dissimilarity8unction d in the MDADC (3)-(4). The computational price to pay for this generality is thatfor each block update ( i ∈ [ n ]) the assignment matrix A i = ( (cid:80) j ∈ [ n ] \{ i } d ( x jσ j ( k ) , x il )) ≤ k,l ≤ m must be calculated from scratch in O ( nm ) ﬂops. Hence the LS1 method has iteration com-plexity O ( n m ) (one iteration meaning one sweep through [ n ]) which may be prohibitivefor large n . In comparison, the squared Euclidean distance d = (cid:107) · (cid:107) employed in the BCAmethod enables eﬃcient computation of A i in O ( m ) complexity by keeping track of therunning sum (cid:80) ni =1 X i P i with rank-1 updates. Accordingly, the BCA method has iterationcomplexity O ( nm ) linear in n . A variant of the BCA method using asynchronous parallelupdates of the matrices ˆ P i (the so-called Jacobi update) can further reduce the iterationcomplexity, although convergence properties of this approach are not clear. In the previous section, problem (8) was solved one permutation matrix P i at a time whilekeeping the other P j ( j (cid:54) = i ) ﬁxed. As an alternative, one may relax this problem to theset D m of doubly stochastic matrices of dimensions m × m , which is the convex hull of Π m .(As a reminder, a doubly stochastic matrix is a square matrix with elements in [0 ,

1] whoserows and columns all sum to 1.) The relaxed problem ismax P ,...,P n ∈D m (cid:13)(cid:13)(cid:13) n (cid:88) i =1 X i P i (cid:13)(cid:13)(cid:13) F . (10)Although this relaxation leads to an indeﬁnite program (i.e. maximizing a convex quadraticform), it is the correct way to relax (7)-(8). In contrast, directly relaxing (7) (to D ) wouldproduce a convex program that is computationally simpler but does not provide tightbounds [Lyzinski et al., 2016].The Frank-Wolfe algorithm [Frank and Wolfe, 1956] is an excellent candidate for thismaximization. Indeed the gradient of (10) is straightforward to compute. Denoting by F the objective function of (10), the partial derivatives are simply ∂F/∂P i = X (cid:48) i (cid:80) nj =1 X j P j (1 ≤ i ≤ n ). In addition, the associated linear programmax Q ,...,Q n ∈D m n (cid:88) i =1 (cid:68) Q i , X (cid:48) i n (cid:88) j =1 X j P j (cid:69) F (11)which provides the search direction ( Q , . . . , Q n ) for the next algorithm iterate is easilysolvable as n separate linear assignment problems (LAP). Although each LAP is solvedover D m , Birkhoﬀ-von Neumann’s theorem guarantees that a solution can be found inΠ m , a property referred to as the integrality of assignment polytopes [Birkhoﬀ, 1946, vonNeumann, 1953].Having found the search direction, it remains to select the step size α ∈ [0 , α ∈ [0 , F ( P + α ( Q − P )) where P = ( P , . . . , P n ) and Q = ( Q , . . . , Q n ). The expression to maximize is a quadratic polynomial in α with leadingcoeﬃcient (cid:107) (cid:80) ni =1 X i ( Q i − P i ) (cid:107) F ≥

0. Accordingly, the maximum over [0 ,

1] is attained9ither at α = 1 or at α = 0. In the former case, the algorithm takes a full step in thedirection Q whereas in the latter case, the current solution cannot be improved upon andthe algorithm ends. Interestingly, the iterates generated by the Frank-Wolfe algorithm forproblem (10) stay in Π m although in principle, they could also explore the interior of D m .This is a consequence of the integrality of the search direction Q and of the line searchmethod for a quadratic objective, which make the step size α equal to 0 or 1. Algorithm 3

Frank-Wolfe

Input: X , . . . , X m ∈ R p × m , P , . . . , P n ∈ D m S ← (cid:80) ni =1 X i P i , F new ← (cid:107) S (cid:107) F repeat S (cid:48) ← , F old ← F new for i = 1 to n do Solve the LAP max Q i ∈D m (cid:10) Q i , X (cid:48) i S (cid:11) F and call Q i a solution. S (cid:48) ← S (cid:48) + X i Q i end for F new ← (cid:107) S (cid:48) (cid:107) F if F new > F old then P i ← Q i (1 ≤ i ≤ n ) , S ← S (cid:48) end if until F new ≤ F old The BCA algorithm of Section 2.2 attempts to improve an existing solution to (1) onepermutation σ i at a time. In other words, at each iteration, it changes all assignments σ l = ( σ ( l ) , . . . , σ n ( l )) (1 ≤ l ≤ m ) in a single dimension. Karapetyan and Gutin [2011]call this approach a dimensionwise heuristic . Another strategy called the interchange or k -exchange heuristic is to change a few assignments (typically, k = 2 or k = 3) in alldimensions by element swaps [e.g. Balas and Saltzman, 1991, Robertson, 2001, Oliveiraand Pardalos, 2004]. Here we consider the 2-assignment exchange algorithm (Algorithm3.4) of Robertson [2001] for the general MAP (3) and adapt it to problem (1). In thisalgorithm, given two assignments, the search for the best interchange is done exhaustively.This involves accessing as many as 2 n − n is small, and (iii) candidate assignments for exchange are easily foundamong all feasible assignments. However, for moderate to large n , and in the context ofproblem (1) where assignment costs are not precalculated, the calculation and exhaustivesearch of 2 n − (cid:0) m (cid:1) candidate pairs ofassignments are untractable. We will show that in problem (1), the pairwise interchangeheuristic can be eﬃciently solved as a binary quadratic program.Given a solution σ = ( σ , . . . , σ n ) to (1) and two associated assignments σ q and σ r (1 ≤ q < r ≤ m ), the basic problem of pairwise interchange is to improve the objective in (1)10y interchanging elements between these assignments, i.e. by swapping the values of σ i ( q )and σ i ( r ) for one or more indices i ∈ [ n ]. Formally, the problem ismin σ ∗ ,...,σ ∗ n ∈ S m (cid:88) ≤ i

2) and σ i = Id [ m ] for1 ≤ i ≤ n . Problem (12) becomesmin σ ∗ ,...,σ ∗ n ∈ S (cid:88) ≤ i,j ≤ n (cid:13)(cid:13) x iσ ∗ i (1) − x jσ ∗ j (1) (cid:13)(cid:13) + (cid:88) ≤ i,j ≤ n (cid:13)(cid:13) x iσ ∗ i (2) − x jσ ∗ j (2) (cid:13)(cid:13) . (13)As in the previous sections, the problem can be transformed tomax σ ∗ ,...,σ ∗ n ∈ S (cid:13)(cid:13)(cid:13) n (cid:88) i =1 x iσ ∗ i (1) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) n (cid:88) i =1 x iσ ∗ i (2) (cid:13)(cid:13)(cid:13) . Replacing the permutations σ ∗ i ∈ S by binary variables c i , the problem becomesmax c ,...,c n ∈{ , } (cid:13)(cid:13)(cid:13) n (cid:88) i =1 ( c i x i + (1 − c i ) x i ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) n (cid:88) i =1 ((1 − c i ) x i + c i x i ) (cid:13)(cid:13)(cid:13) and, after simple manipulations,max c ,...,c n ∈{ , } (cid:88) i,j c i c j (cid:104) d i , d j (cid:105) − n (cid:88) i c i (cid:104) d i , ¯ d (cid:105) (14)where d i = x i − x i and ¯ d = (1 /n ) (cid:80) ni =1 d i . This is an unconstrained binary quadraticprogram (UBQP) of size n that can be solved with standard mathematical software (e.g.Cplex, Gurobi, Mosek). Refer to Kochenberger et al. [2014] for a survey of the UBQPliterature.Having reduced the basic pairwise interchange problem (12) to the UBQP (14), We nowembed it in Algorithm 3.4 of Robertson [2001] which combines randomization and greedyselection of interchange pairs. Hereafter F ( σ ) denotes the objective value in (1) and σ = ( σ , . . . , σ n ) ∈ ( S m ) n is identiﬁed with the assignments { σ , . . . , σ m } , where σ l =( σ ( l ) , . . . , σ n ( l )). The notation diag( · ) is used for diagonal matrices.11 lgorithm 4 Pairwise Interchange with Greedy Selection

Input: X , . . . , X n ∈ R p × m , σ ≡ { σ , . . . , σ m } C ← σ { candidate set of assignments for interchange } while C (cid:54) = ∅ do F best ← F ( σ ) σ + ← ∅ , τ + ← ∅ Select σ q ∈ C for σ r ∈ C \ { σ q } do d i ← x iσ q ( i ) − x iσ r ( i ) (1 ≤ i ≤ n ), ¯ d ← n (cid:80) ni =1 d i Q ← ( (cid:104) d i , d j (cid:105) ) ≤ i,j ≤ m − diag( n (cid:104) d , ¯ d (cid:105) , . . . , n (cid:104) d , ¯ d (cid:105) ) Solve the UBQP (14) with quadratic matrix Q and call ( c , . . . , c n ) a solution. ˜ σ q ( i ) ← c i σ q ( i ) + (1 − c i ) σ r ( i ) (1 ≤ i ≤ n ) ˜ σ r ( i ) ← c i σ r ( i ) + (1 − c i ) σ q ( i ) (1 ≤ i ≤ n ) ˜ F ← F ( σ \ { σ q , σ r } ∪ { ˜ σ q , ˜ σ r } ) if ˜ F < F best then ( σ + , τ + ) ← (˜ σ q , ˜ σ r ) { candidate new pair of assignments } ( σ − , τ − ) ← ( σ q , σ r ) { candidate old pair of assignments } F best ← ˜ F end if end for if σ + (cid:54) = ∅ then σ ← σ \ { σ − , τ − } ∪ { σ + , τ + } { perform interchange } C ← σ { reset candidate set to all assignments } else C ← C \ { σ q } { remove assignment from candidate set } end if end while The matching problem (1) has a probabilistic interpretation in terms of mixture mod-els. Let y , . . . , y m be random vectors in R p with respective probability distributions P , . . . , P m . Assume that these vectors are only observable after their labels have beenshuﬄed at random. The random permutation of labels represents the uncertainty aboutthe correspondence between observations, say x , . . . , x m , and their underlying distribu-tions P , . . . , P m . For mathematical convenience, y , . . . , y m are assumed independent andeach P k (1 ≤ k ≤ m ) is taken as a multivariate normal distribution N ( µ k , Σ k ). Thedata-generating process can be summarized as  y k ∼ N ( µ k , Σ k ) (1 ≤ k ≤ m ) ,s has a uniform distribution over S m , ( y , . . . , y m ) are mutually independent and independent of s, ( x , . . . , x m ) = ( y s (1) , . . . , y s ( m ) ) . (15)12his can be viewed as a Gaussian mixture model with permutation constraints on clusterassignments. These constraints can be shifted to the mean and covariance parametersby concatenating observations: the vector x = vec( x , . . . , x m ) follows a mixture of m !multivariate normal distributions N ( µ σ , Σ σ ) in R mp with equal mixture weights 1 /m !,where µ σ = vec( µ σ (1) , . . . , µ σ ( m ) ) and Σ σ = diag(Σ σ (1) , . . . , Σ σ ( m ) ) (block-diagonal matrix)for σ ∈ S m ; see also Qiao and Li [2015]. In this form, the theory and methods of Gaussianmixture models are seen to apply to (15), in particular the consistency and asymptoticnormality of maximum likelihood estimators [McLachlan and Peel, 2000, Chapter 2]. Remark.

In model (15), the cluster centers { x ˆ σ, , . . . , x ˆ σ,m } associated to a global solutionˆ σ = (ˆ σ , . . . , ˆ σ n ) of problem (1) are not consistent for { µ , . . . , µ m } as n → ∞ . Considerfor example the case where p = 1, m = 2 (univariate mixture with two components), and µ < µ . Then ˆ µ = n (cid:80) ni =1 min( x i , x i ) and ˆ µ = n (cid:80) ni =1 max( x i , x i ). Accordingly E (ˆ µ ) = E (min( x , x )) < µ and E (ˆ µ ) = E (max( x , x )) > µ , meaning that bothestimators are biased and inconsistent. Remark.

The permutation constraints of model (15) can be formulated as equivalenceconstraints [see Shental et al., 2004, and Section 1]. However, this general formulation isunlikely to lead to faster or better optimization, just as the constrained K -means approachof Wagstaﬀ et al. [2001], which also handles equivalence constraints, does not improve uponthe specialized K -means Algorithm 1 for problem (1) (see section 3).Gaussian mixture models and the Expectation Maximization (EM) algorithm [see e.g.McLachlan and Peel, 2000, McLachlan and Krishnan, 2008] constitute a well-known ap-proach to clustering. Here, in view of the matching problem (1), we propose a computa-tionally eﬃcient EM approach to the Gaussian mixture model (15). Although in principle,the standard EM algorithm for a Gaussian mixture model could be applied, the number m !of mixture components and the potentially high dimension mp of the data in (15) rendercomputations intractable unless m is very small.Let ( x i , . . . , x im ) (1 ≤ i ≤ n ) be data arising from (15) and let s , . . . , s n be associatedlabel permutations. For convenience, the permutations are expressed in terms of indicatorvariables I ikl (1 ≤ i ≤ n, ≤ k, l ≤ m ): I ikl = 1 if x ik = y il or equivalently s i ( k ) = l , I ikl = 0 otherwise. The ( x ik ) and ( I ikl ) are the so-called complete data. Call ˆ θ = { (ˆ µ l , ˆΣ l ) : l ∈ [ m ] } the current estimate of the model parameters of (15) in the EM procedure. Thelog-likelihood of the complete data islog L c = n (cid:88) i =1 m (cid:88) k =1 m (cid:88) l =1 log ϕ ( x ik ; ˆ µ l , ˆΣ l ) I ikl (16)where ϕ ( x ; µ, Σ) = (2 π ) − p/ | Σ | − / exp (cid:0) − ( x − µ ) (cid:48) Σ − ( x − µ ) / (cid:1) indicates a multivariatenormal density in R p . E step.

The E step of the EM algorithm consists in calculating the expected value of(16) conditional on the observed data X , . . . , X n and assuming that θ = ˆ θ . This amounts13o deriving, for each ( i, k, l ), the quantity E ˆ θ ( I ikl | X i ) = P ˆ θ ( I ikl = 1 | X i )= P ˆ θ ( X i | I ikl = 1) P ˆ θ ( I ikl = 1) P ˆ θ ( X i )= c i P ˆ θ ( X i | I ikl = 1)= c i (cid:88) σ ∈ S m : σ ( k )= l P ˆ θ (cid:0) X i | I i σ (1) = 1 , . . . , I imσ ( m ) = 1 (cid:1) × P ˆ θ (cid:0) I i σ (1) = 1 , . . . , I imσ ( m ) = 1 (cid:12)(cid:12) I ikl = 1 (cid:1) = c i ( m − (cid:88) σ ∈ S m : σ ( k )= l m (cid:89) r =1 P ˆ θ ( x ir | I irσ ( r ) = 1)= c i ( m − (cid:88) σ ∈ S m : σ ( k )= l m (cid:89) r =1 ϕ (cid:0) x ir ; ˆ µ σ ( r ) , ˆΣ σ ( r ) (cid:1) . (17)Formula (17) can be conveniently expressed with matrix permanents . The permanent ofa square matrix A = ( a ij ) of dimension m × m is deﬁned as per( A ) = (cid:80) σ ∈ S m (cid:81) mi =1 a iσ ( i ) .Writing A i = ( a ikl ) = ( ϕ ( x ik ; ˆ µ l , ˆΣ l )) ∈ R m × m and A − ( k,l ) i = ( a ik (cid:48) l (cid:48) ) k (cid:48) (cid:54) = k,l (cid:48) (cid:54) = l ∈ R ( m − × ( m − ,(17) reformulates as E ˆ θ ( I ikl | X i ) = a ikl per( A − ( k,l ) i ) / per( A i ).The permanent of a matrix has a very similar expression to the Leibniz formula for deter-minants, but without the permutation signatures ±

1. It is however far more expensive tocompute: eﬃcient implementations have complexity O (2 m m ) [Ryser, 1963] or O (2 m m )[Nijenhuis and Wilf, 1978]. Stochastic approximation methods running in polynomialtime [e.g. Jerrum et al., 2004, Kuck et al., 2019] and variational bounds [see Uhlmann,2004, and the references therein] are also available. Given that (17) must be evalu-ated for nm values of ( i, k, l ), and accounting for the computation of the matrices A i (1 ≤ i ≤ n ) [e.g. Press et al., 2007, Chap. 16.1], the E step has overall complexity at least O (2 m m n + mp + m p n ).The evaluation of permanents requires precautions to avoid numerical underﬂow. Indeed,the density values ϕ ( x ik ; ˆ µ l , ˆΣ l ) are often very small and multiplying them in (17) mayquickly lead to numerical zeros. Preconditioning greatly helps in this regard: by theproperties of the permanent, multiplying the rows and columns of A i by nonzero numbershas no eﬀect on (17) as these multiples cancel out between the numerator a ikl per( A − ( k,l ) i )and denominator per( A i ). One can exploit this by alternatively rescaling the rows andcolumns of A i by their sums. Provided that A i is a positive matrix, this scheme convergesto a doubly stochastic matrix [Sinkhorn, 1964] that in practice often has at least one“non-small” entry in each row and each column.14 step. By standard least square calculations, the updated estimate θ + = { ( µ + l , Σ + l ) :1 ≤ l ≤ m } is µ + l = 1 n n (cid:88) i =1 m (cid:88) k =1 P ˆ θ ( I ikl = 1 | X i ) x ik Σ + l = 1 n n (cid:88) i =1 m (cid:88) k =1 P ˆ θ ( I ikl = 1 | X i )( x ik − µ + l )( x ik − µ + l ) (cid:48) (18)with P ˆ θ ( I ikl = 1 | X i ) = E ˆ θ ( I ikl | X i ) given by (17). The fact that (cid:80) mk =1 P ˆ θ ( I ikl = 1 | X i ) = 1for all ( i, l ) was used to simplify (18). If the variances Σ , . . . , Σ m are assumed equal, theircommon estimate should be Σ + = (1 /m ) (cid:80) ml =1 Σ + l . Log-likelihood.

The log-likelihood of the observed data is given bylog L (ˆ θ ) = n (cid:88) i =1 log (cid:32) m ! (cid:88) σ ∈ S m (cid:89) k =1 ϕ (cid:0) x ik ; ˆ µ σ ( k ) , ˆΣ σ ( k ) (cid:1)(cid:33) . (19)It is simply the sum of the logarithms of the permanents of the matrices A i = (cid:0) ϕ ( x ik ; ˆ µ l , ˆΣ l ) (cid:1) deﬁned earlier. Since these permanents are calculated in the E step, there is essentially noadditional cost to computing the log-likelihood.The implementation of the EM algorithm for model (15) is sketched in Algorithm 5, Theinitial covariance matrices Σ , . . . , Σ m in this algorithm should be taken positive deﬁniteto avoid degeneracy issues when evaluating multivariate normal densities. However, thealgorithm is easily extended to handle singular covariance matrices. In practice, stoppingcriteria for the EM algorithm are often based on the absolute or relative increase in log-likelihood between successive iterations.In statistical problems involving a large number of latent variables such as (15), the EMalgorithm is usefully extended by the so-called deterministic annealing EM algorithm[DAEM, Ueda and Nakano, 1998]. The DAEM is identical to the EM except that inthe E step, the assignment probabilities P θ ( I ikl = 1 | X i ) are raised to a power β ∈ (0 , t of iterations grows, the power β = β t , which represents aninverse temperature parameter, increases to 1. For t suﬃciently large, the DAEM revertsback to the EM. In this way the DAEM oﬀers some control on how many iterations arespent exploring the latent variable space before converging to a set of (often highly unbal-anced) assignment probabilities. In particular, appropriate use of the DAEM prevents theconvergence from happening too fast. The matching methods developed for (1) in the previous sections are local search proce-dures. As can be expected, the quality of their solutions largely depends on their startingpoints. Several strategies for ﬁnding good starting points are presented hereafter.15 lgorithm 5

EM for Constrained Gaussian Mixture

Input: X , . . . , X n ∈ R p × m , µ , . . . , µ m ∈ R p , Σ , . . . , Σ m ∈ R p × p θ ← { ( µ l , Σ l ) : 1 ≤ l ≤ m } for t = 0 , , . . . do Perform Choleski decomposition Σ l = L (cid:48) l L l with L l lower triangular (1 ≤ l ≤ m ) for i = 1 , . . . , n do a ikl ← (2 π ) − p/ | L l | − / e −(cid:107) L − l ( x ik − µ l ) (cid:107) / (1 ≤ k, l ≤ m ), A i ← ( a ikl ) for k = 1 , . . . , m do for l = 1 , . . . , m do Alternatively rescale rows and columns of A − ( k,l ) i to sum to 1 Calculate per( A − ( k,l ) i ) with Ryser’s inclusion-exclusion formula p ikl ← a ikl per( A − ( k,l ) i ) end for end for c i ← m (cid:80) mk =1 (cid:80) ml =1 p ikl w ikl ← p ikl /c i (1 ≤ k, l ≤ m ) { class membership probability } end for (cid:96) t ← (cid:80) ni =1 log c i { log-likelihood } for l = 1 , . . . , m do µ l ← n (cid:80) ni =1 (cid:80) mk =1 w ikl x ik Σ l ← n (cid:80) ni =1 (cid:80) mk =1 w ikl ( x ik − µ l )( x ik − µ l ) (cid:48) end for θ t +1 ← { ( µ l , Σ l ) : 1 ≤ l ≤ m } end for Random initialization.

Utilizing multiple random starting points σ ∈ ( S m ) n or P ∈ (Π m ) n often yields at least one nearly optimal solution. This strategy is particularly suitable whenthe computational cost of optimization is cheap, as is the case with Algorithms 1-2-3. Template matching.

Given data matrices X , . . . , X m ∈ R p × m and a template matrix T ∈ R p × m , solve the matching problemmin P ,...,P n ∈ Π m n (cid:88) i =1 (cid:107) X i P i − T (cid:107) F . (20)The expediency of template matching comes from the fact that it reduces (cid:0) n (cid:1) relatedmatching tasks between pairs of data matrices in (1) to n separate matching tasks betweenthe data and the template. A central question is: which template to use? Bandelt et al.[1994] propose to either take a single data matrix as template ( single hub heuristic ), e.g. T = X , or to examine all data matrices in turn: T ∈ { X , . . . , X n } , and retain theassignment P ( T ) = ( P ( T ) , . . . , P n ( T )) that yields the lowest value of (1) ( multiple hubheuristic ). More generally, the template need not be a data point; it could for example bean estimate of cluster centers based on previous observations.16 ecursive heuristics. The recursive heuristics of Bandelt et al. [1994] (see Section 1) areeasily applicable to problem (1). Their algorithm RECUR1 for example, which is relatedto the BCA Algorithm 2, is implemented as follows. The ﬁrst permutation matrix P canbe selected arbitrarily, say P = I m . Then for i = 1 , . . . , n −

1, the LAP (9) is changed tomax P i +1 ∈ Π m (cid:68) P i +1 , X (cid:48) i +1 i (cid:88) j =1 X j P j (cid:69) F . (21) This section presents experiments that assess the numerical and computational perfor-mances of the matching methods of Section 2 and other relevant methods from the lit-erature. Three performance measures are reported: the attained objective value in thematching problem (1), the Rand index [Rand, 1971] for evaluating agreement betweenmatchings and data labels, and the computation time.

Simulation setup.

The simulations are based on handwritten digits data available onthe UCI machine learning repository (archive.ics.uci.edu). Unlike classiﬁcation problems,the task at hand is to match collections of digits without using label information. The dataare normalized bitmaps of handwritten digits. After downsampling, images of dimensions8 × { , . . . , } . The training data used for thesimulations contain 3823 images contributed by 30 people, with about 380 examples foreach digit 0 , . . . ,

9. A principal component analysis (PCA) is carried out separately for eachof the m = 10 digit classes (after vectorizing the 8 × x ik = (cid:80) r =1 ξ ikr φ kr + ε ikr for 1 ≤ i ≤ n and1 ≤ k ≤ m , where the φ kr are PC vectors of length p = 64 and the ξ ikr are independentnormal random variables with mean zero and standard deviation given by the PCA. A smallamount of Gaussian white noise ε with standard deviation 2.5 is added to the simulateddata, which corresponds to 10% of the standard deviation of the original data. The number n of statistical units varies in { , , , , , , , , , , } . For each value of n , the simulation is replicated 100 times. The simulations are run in the R programmingenvironment. Code for the simulations and the R package matchFeat implementing allmethods of this paper are available at github.com/ddegras/matchFeat. Matching methods.

The methods of Section 2 are combined in three steps: initial-ization, main algorithm, and optional post-processing. Four initializations are considered:identity matrix (ID), 100 random starting points (R100), multiple-hub heuristic (HUB),and recursive heuristic (REC). A ﬁfth initialization clustering data vectors by their digitlabels (LBL) is also examined as a benchmark. This initialization is infeasible in practice;it may also not minimize (1) although it is often nearly optimal. The main algorithms are K -means matching (KM), block coordinate ascent (BCA), and the Frank-Wolfe method(FW). The pairwise interchange algorithm (2X) and EM algorithm for constrained Gaus-17ian mixture (EM) are used for post-processing only as they were seen to perform poorlyon their own (that is, with any of the proposed initializations) in preliminary experiments.The simulations also comprise matching methods representative of the literature:- Integer linear program (ILP).

The standard ILP formulation of the MDADC (3)-(4)[e.g. Kuroki and Matsui, 2009, Tauer and Nagi, 2013] involves (cid:0) n (cid:1) m binary variables(the number of edges in a complete n -partite graph with m nodes in each subgraph), n ( n − m equality constraints and (cid:0) n (cid:1) m inequality constraints (so-called triangle orclique constraints). Another formulation of the ILP expresses the triangle constraintswith reference to one the n subgraphs, thereby reducing their number to (cid:0) n (cid:1) m .- ILP relaxation and integer quadratic program (IQP).

Two of the methods in Kurokiand Matsui [2009] are considered: the ﬁrst consists in dropping the triangle con-straints, solving (cid:0) n (cid:1) separate assignment problems, and recovering a proper solutionwith multiple-hub heuristics. The second expresses the triangle constraints with ref-erence to one of the n subgraphs as in the above ILP, and formulates the objectivefunction only in terms of those edges starting from and arriving to the reference sub-graph. This reduces the number of optimization variables to (cid:0) n (cid:1) m but transformsthe linear program into a quadratic one.- Constrained K -means. The COP-KMEANS [Wagstaﬀ et al., 2001], MPCK-MEANS[Bilenko et al., 2004], LCVQE [Pelleg and Baras, 2007], and CCLS Hiep et al. [2016]algorithms all handle equivalence constraints and can thus be applied to (1). Theyare conveniently implemented in the R package conclust of the last authors. COP-KMEANS and CCLS treat equivalence constraints as hard constraints and thus ex-actly solve (1). MPCK-MEANS and LCVQE handle equivalence constraints as softconstraints (in addition, MPCK-MEANS incorporates metric learning) and thus ap-proximately solve (1).Going forward, these methods will be referred to as ILP, KUR-ILP, KUR-IQP, COP-KM,MPC-KM, LCVQE, and CCLS. While they are applicable to the sum-of-squares matchingproblem (1), these methods are not geared towards it and should not be expected tooutperform the methods of this paper. Lagrangian heuristics [e.g. Tauer and Nagi, 2013,Natu et al., 2020] are not included in the simulations because their eﬃcient implementationrequires computer clusters and/or specialized computing architecture, whereas the focusof this paper is on methods executable on a single machine.

Remark.

Initial attempts were made to obtain lower bounds on the global minimum in (1)using a relaxation method of Bandelt et al. [2004]. However, the resulting bounds are fartoo small, a fact already noted by these authors in the case of non-Euclidean distances d (recall that in (1), d is the squared Euclidean distance).

Results.

Optimization accuracy.

To facilitate comparisons, we discuss the relative errorof each method averaged across 100 replications for each n . The relative error of a methodis deﬁned as the ratio of its attained objective value in (1) by the minimum objective value18cross all methods minus 1. Full results are available in Table 1. Hereafter and in thetable, methods are listed by order of best performance. Method n = 5 n = 10 n = 20 n = 30 n = 40 n = 50 n = 75 n = 100 n = 200 n = 500 n = 1000R100-BCA 2E-11 (1E-10) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)R100-BCA-2X 2E-11 (1E-10) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)KUR-IQP 2E-11 (1E-10)ILP 0 (0) 3E-5 (3E-4)LBL-BCA 2E-3 (4E-3) 1E-3 (3E-3) 4E-4 (1E-3) 3E-4 (1E-3) 1E-4 (4E-4) 2E-4 (6E-4) 5E-5 (3E-4) 2E-5 (1E-4) 3E-6 (3E-5) 3E-8 (1E-7) 1E-8 (6E-8)LBL-BCA-2X 2E-3 (3E-3) 8E-4 (2E-3) 2E-4 (5E-4) 1E-4 (4E-4) 6E-5 (2E-4) 1E-4 (4E-4) 4E-5 (2E-4)HUB-BCA-2X 1E-3 (2E-3) 1E-3 (3E-3) 4E-4 (1E-3) 2E-4 (1E-3) 2E-4 (6E-4) 3E-4 (9E-4) 6E-5 (3E-4)HUB-BCA 1E-3 (3E-3) 2E-3 (3E-3) 8E-4 (2E-3) 3E-4 (1E-3) 5E-4 (2E-3) 5E-4 (1E-3) 2E-4 (1E-3) 1E-4 (7E-4) 1E-4 (6E-4) 2E-5 (2E-4) 1E-8 (5E-8)LBL-FW-2X 4E-3 (6E-3) 2E-3 (3E-3) 5E-4 (9E-4) 2E-4 (4E-4) 2E-4 (4E-4) 2E-4 (4E-4) 9E-5 (2E-4)REC-BCA-2X 3E-3 (5E-3) 2E-3 (5E-3) 8E-4 (3E-3) 6E-4 (2E-3) 3E-4 (8E-4) 2E-4 (7E-4) 8E-5 (3E-4)LBL-KM-2X 4E-3 (5E-3) 2E-3 (3E-3) 5E-4 (9E-4) 2E-4 (4E-4) 2E-4 (4E-4) 2E-4 (4E-4) 9E-5 (2E-4)ID-BCA-2X 3E-3 (6E-3) 2E-3 (3E-3) 1E-3 (3E-3) 6E-4 (2E-3) 4E-4 (1E-3) 4E-4 (1E-3) 1E-4 (5E-4)R100-FW-2X 7E-3 (9E-3) 3E-3 (4E-3) 2E-4 (5E-4) 4E-5 (1E-4) 2E-5 (7E-5) 3E-6 (1E-5) 3E-6 (2E-5) 2E-6 (6E-6)REC-BCA 4E-3 (7E-3) 4E-3 (8E-3) 1E-3 (3E-3) 1E-3 (3E-3) 8E-4 (2E-3) 6E-4 (2E-3) 5E-4 (2E-3) 1E-4 (5E-4) 2E-4 (8E-4) 3E-4 (1E-3) 7E-5 (7E-4)R100-KM-2X 9E-3 (1E-2) 3E-3 (4E-3) 2E-4 (5E-4) 4E-5 (1E-4) 2E-5 (7E-5) 3E-6 (1E-5) 3E-6 (2E-5) 2E-6 (6E-6)ID-BCA 5E-3 (9E-3) 5E-3 (8E-3) 3E-3 (6E-3) 2E-3 (4E-3) 8E-4 (2E-3) 9E-4 (2E-3) 5E-4 (2E-3) 6E-4 (1E-3) 3E-4 (1E-3) 4E-4 (1E-3) 1E-8 (5E-8)HUB-FW-2X 4E-3 (6E-3) 4E-3 (6E-3) 1E-3 (1E-3) 8E-4 (1E-3) 6E-4 (8E-4) 4E-4 (8E-4) 2E-4 (4E-4)HUB-KM-2X 5E-3 (6E-3) 5E-3 (6E-3) 1E-3 (2E-3) 8E-4 (1E-3) 6E-4 (8E-4) 4E-4 (8E-4) 2E-4 (4E-4)REC-KM-2X 6E-3 (9E-3) 5E-3 (6E-3) 2E-3 (4E-3) 1E-3 (3E-3) 9E-4 (2E-3) 4E-4 (1E-3) 2E-4 (5E-4)REC-FW-2X 6E-3 (8E-3) 5E-3 (6E-3) 3E-3 (6E-3) 1E-3 (3E-3) 9E-4 (2E-3) 4E-4 (1E-3) 2E-4 (5E-4)LBL-FW 2E-2 (2E-2) 8E-3 (7E-3) 2E-3 (2E-3) 1E-3 (1E-3) 7E-4 (7E-4) 5E-4 (8E-4) 3E-4 (5E-4) 9E-5 (1E-4) 2E-5 (4E-5) 3E-6 (4E-6) 8E-7 (7E-7)LBL-KM 2E-2 (2E-2) 8E-3 (7E-3) 2E-3 (2E-3) 1E-3 (1E-3) 7E-4 (7E-4) 5E-4 (8E-4) 3E-4 (5E-4) 9E-5 (1E-4) 2E-5 (4E-5) 3E-6 (4E-6) 8E-7 (7E-7)ID-KM-2X 9E-3 (1E-2) 7E-3 (9E-3) 4E-3 (8E-3) 2E-3 (5E-3) 2E-3 (5E-3) 1E-3 (3E-3) 7E-4 (3E-3)ID-FW-2X 1E-2 (1E-2) 6E-3 (8E-3) 4E-3 (8E-3) 2E-3 (3E-3) 1E-3 (3E-3) 1E-3 (4E-3) 8E-4 (3E-3)LBL 2E-2 (2E-2) 1E-2 (9E-3) 5E-3 (3E-3) 3E-3 (2E-3) 3E-3 (2E-3) 3E-3 (2E-3) 2E-3 (1E-3) 2E-3 (9E-4) 2E-3 (6E-4) 2E-3 (4E-4) 2E-3 (3E-4)HUB-KM 3E-2 (1E-2) 2E-2 (1E-2) 6E-3 (5E-3) 3E-3 (3E-3) 2E-3 (3E-3) 1E-3 (2E-3) 9E-4 (2E-3) 3E-4 (9E-4) 2E-4 (7E-4) 2E-5 (2E-4) 8E-7 (8E-7)HUB-FW 3E-2 (1E-2) 2E-2 (1E-2) 6E-3 (5E-3) 3E-3 (3E-3) 2E-3 (3E-3) 1E-3 (2E-3) 9E-4 (2E-3) 3E-4 (9E-4) 2E-4 (7E-4) 2E-5 (2E-4) 8E-7 (8E-7)REC-KM 2E-2 (2E-2) 2E-2 (1E-2) 1E-2 (1E-2) 5E-3 (6E-3) 3E-3 (4E-3) 3E-3 (6E-3) 1E-3 (3E-3) 9E-4 (3E-3) 5E-4 (1E-3) 5E-4 (2E-3) 1E-4 (9E-4)REC-FW 2E-2 (2E-2) 2E-2 (1E-2) 1E-2 (1E-2) 5E-3 (6E-3) 3E-3 (4E-3) 3E-3 (6E-3) 1E-3 (3E-3) 9E-4 (3E-3) 5E-4 (1E-3) 5E-4 (2E-3) 1E-4 (9E-4)2X 1E-2 (1E-2) 7E-3 (7E-3) 5E-3 (5E-3) 4E-3 (4E-3)R100-FW 9E-2 (3E-2) 1E-2 (7E-3) 7E-4 (1E-3) 1E-4 (2E-4) 8E-5 (1E-4) 3E-5 (5E-5) 1E-5 (3E-5) 7E-6 (1E-5) 2E-6 (4E-6) 3E-7 (5E-7) 1E-7 (2E-7)R100-KM 9E-2 (3E-2) 1E-2 (7E-3) 7E-4 (1E-3) 1E-4 (2E-4) 8E-5 (1E-4) 3E-5 (5E-5) 1E-5 (3E-5) 7E-6 (1E-5) 2E-6 (4E-6) 3E-7 (5E-7) 1E-7 (2E-7)REC 2E-2 (2E-2) 2E-2 (2E-2) 2E-2 (2E-2) 2E-2 (1E-2) 2E-2 (1E-2) 2E-2 (1E-2) 2E-2 (1E-2) 1E-2 (1E-2) 9E-3 (8E-3) 5E-3 (5E-3) 4E-3 (4E-3)ID-KM 3E-1 (9E-2) 9E-2 (5E-2) 3E-2 (2E-2) 1E-2 (1E-2) 5E-3 (9E-3) 5E-3 (9E-3) 3E-3 (6E-3) 2E-3 (5E-3) 1E-3 (2E-3) 5E-4 (2E-3) 3E-4 (1E-3)ID-FW 3E-1 (9E-2) 9E-2 (5E-2) 3E-2 (2E-2) 1E-2 (1E-2) 5E-3 (9E-3) 5E-3 (9E-3) 3E-3 (6E-3) 2E-3 (5E-3) 1E-3 (2E-3) 5E-4 (2E-3) 3E-4 (1E-3)HUB 3E-2 (1E-2) 3E-2 (1E-2) 4E-2 (9E-3) 4E-2 (9E-3) 4E-2 (8E-3) 5E-2 (7E-3) 4E-2 (7E-3) 4E-2 (6E-3) 4E-2 (6E-3) 4E-2 (5E-3) 4E-2 (4E-3)KUR-ILP 3E-2 (1E-2) 3E-2 (1E-2) 4E-2 (9E-3) 4E-2 (9E-3) 4E-2 (8E-3) 5E-2 (7E-3) 4E-2 (7E-3) 4E-2 (6E-3) 4E-2 (6E-3) 4E-2 (5E-3) 4E-2 (4E-3)COP-KM 2E-1 (6E-2) 1E-1 (5E-2) 8E-2 (3E-2) 7E-2 (3E-2) 6E-2 (2E-2) 6E-2 (2E-2) 5E-2 (1E-2) 5E-2 (1E-2)MPC-KM 3E-1 (7E-2) 2E-1 (7E-2) 1E-1 (4E-2) 8E-2 (3E-2) 8E-2 (2E-2) 7E-2 (3E-2) 7E-2 (2E-2) 7E-2 (2E-2)EM 5E-3 (9E-3) 5E-3 (8E-3) 3E-3 (6E-3) 2E-3 (4E-3) 8E-4 (2E-3) 9E-4 (2E-3) 5E-1 (1E-2) 5E-1 (1E-2) 5E-1 (7E-3)LCVQE 3E-1 (7E-2) 3E-1 (6E-2) 2E-1 (6E-2) 2E-1 (6E-2) 2E-1 (5E-2) 2E-1 (5E-2) 2E-1 (6E-2) 2E-1 (6E-2) 2E-1 (5E-2) 2E-1 (6E-2) 2E-1 (5E-2)CCLS 4E-2 (3E-2) 7E-2 (3E-2) 1E-1 (5E-2) 3E-1 (6E-2) 3E-1 (4E-2) 3E-1 (3E-2) 4E-1 (3E-2) 4E-1 (3E-2) 4E-1 (2E-2)R100 4E-1 (4E-2) 4E-1 (3E-2) 5E-1 (2E-2) 5E-1 (2E-2) 5E-1 (1E-2) 5E-1 (1E-2) 5E-1 (1E-2) 5E-1 (1E-2) 5E-1 (7E-3) 5E-1 (5E-3) 5E-1 (3E-3)ID 5E-1 (6E-2) 5E-1 (4E-2) 5E-1 (2E-2) 5E-1 (2E-2) 5E-1 (2E-2) 5E-1 (1E-2) 5E-1 (1E-2) 5E-1 (1E-2) 5E-1 (7E-3) 5E-1 (5E-3) 5E-1 (3E-3) Table 1: Simulations: optimization performance in the matching problem (1). The relativeerror (average across 100 replications) is displayed with standard deviation in parentheses.From top to bottom of the table: best to worst performance. Missing values are due toexecution timeout (running time > s ).R100-BCA is the best method for each n , attaining the best objective value in virtuallyevery replication. For small values n ∈ { , } , ILP and KUR-IQP also achieve bestperformance. The next best methods are LBL-BCA-2X, HUB-BCA-2X, LBL-BCA, andHUB-BCA, with a relative error decreasing from order 10 − for n = 5 to order 10 − or10 − for n = 100. Recall that the LBL initialization is an oracle of sorts since data labelsare typically not available in matching problems. The other combinations of methodsof this paper yield slightly higher yet comparable relative error that goes roughly fromorder 10 − for n = 5 to the range (10 − , − ) for n = 100. As can be expected, the ID19nd REC initializations yield slightly worse performance whereas R100 provides the bestresults. BCA is less sensitive to the initialization methods than FW and KM. EM, which isinitialized with ID-BCA, gives reasonable results for n ≤

50 (relative error of order 10 − )although it does not improve upon BCA. For n >

50 however its performance with respectto (1) severely deteriorates and its relative error climbs to about 0.4.Among the competitor methods, KUR-ILP has the best performance, with a relative errorof order 10 − across values of n . COP-KM and MPC-KM have relative errors that decreasefrom order 10 − for small n to order 10 − for n = 100. LCVQE has a slowly decreasingrelative error that goes from 0.3 for n = 5 to 0.2 for n = 100. CCLS sees it relative errorincrease from order 10 − for small n to 0.4 for n = 100. Rand index.

The Rand index (RI) is a measure of agreement between two partitions of aset; it is suitable for matching problems which produce clusters and not individual labelpredictions. Here the data partition produced by a matching method is compared to thepartition induced by the data classes, i.e. their underlying digits in { , . . . , } . While thegoal of matching is to produce homogeneous data clusters and not to maximize agreementbetween the produced clusters and some underlying class/LBL-induced clusters, these twogoals are aligned in the simulations because data vectors generated by a same digit classtend to be much closer to each other than to vectors generated by other digit classes.Given a set D of size n and two partitions X and Y of D into clusters, the RI is deﬁned asthe ratio ( a + b ) / (cid:0) n (cid:1) , where a is the number of pairs of elements in D that are in a samecluster both in X and Y , and b is the number of pairs of elements in D that are in diﬀerentclusters both in X and Y . This can be interpreted as the fraction of correct decisions toassign two elements of D either to the same cluster or to diﬀerent ones.The RI of each method (averaged across 100 replications) is displayed in Figure 1 as afunction of n . Values closer to 1 indicate better agreement between matching outputs anddata labels (digits). For BCA, FW, and KM, the RI starts from a baseline in the range[0 . , . n = 100, and then stays at this level for n > n = 5 to 0.98 for n = 1000. ForCOP-KM, MPC-KM, LCVQE, KUR-ILP, and HUB, the RI slowly increases from about0.9 to 0.95 with n . R100 and ID are two initializations nearly or full independent of thedata labels, which are randomly shuﬄed. They are tantamount to random guessing andtheir baseline RI of 0.82 matches its theoretical expectation (1 − (2 m − /m ). EM andCCLS show a RI that rapid decreases at or below random guessing levels, in accord withtheir modest performance in the optimization (1). Running time.

The running times of the algorithms are displayed in Figure 2. During thesimulations, algorithms were given 300 seconds to complete execution, after which theywere interrupted. Accordingly any value 300s on the ﬁgure (often largely) underestimatesthe actual computation time. The algorithms can be divided in two groups: those whocan solve (1) for n = 1000 in 100s or far less, and those that time out (execution time over300s) for n ≤

500 or far less. They are described below by order of decreasing speed.BCA, FW, and KM are the fastest methods with running times of order 10 − to 10 − . . . . . n R and i nde x BCA FW KM REC

LCVQE

HUB

KUR-ILP ID R100 ID EM COP-KM

MPC-KM

CCLS

Figure 1: Rand index versus sample size n (average across 100 replications).seconds across values of n . For n = 1000, they are one order of magnitude faster thanthe next best method (LCVQE). The HUB and REC initializations, although slower thanarbitrary starting points like identity or random permutations, enable overall faster compu-tations because good starting points reduce the number of iterations required for the mainalgorithm to converge. Completion of (100 runs) of BCA, FW, or FW based on the R100initialization takes between 200 and 250 times the execution of a single run based on HUBor REC (instead of roughly 100). This is because the latter heuristics ﬁnd good startingpoints whereas some (or many) of the 100 random starting points will be bad and requiremany iterations for the main algorithm to converge. KUR-ILP enjoys the same speed asthe BCA, FW, and KM for small n but its running time appears to scale polynomiallywith n . LCVQE appears to scale linearly with n but with a much larger multiplicativeconstant than BCA, FW, and KM. Its running time is of order 10 − s for n = 5 and 1s for n = 75. The running time of CCLS grows steeply with n and exceeds the 300s limit for n ≥ n ≥ n . In the case of the EM, the computational21 n R unn i ng t i m e ( s ) R100-FW

R100-KM

R100-BCA

KUR-ILP

LCVQE FW KM BCA EM COP-KM

MPC-KM

CCLS

Figure 2: Running time versus sample size n (average across 100 replications).bottleneck is the evaluation of matrix permanents. ILP, KUR-IQP and 2X are by far theslowest methods in the simulations. The ﬁrst two stall and time out as soon as n exceedsa few units, although they produce good results when n ≤

5. The computational load of2X scales exponentially with n (average computation time 110s for n = 30); it is muchhigher when using the ID, R100, REC, and HUB initializations than when applied as apost-processing step following, say, the BCA method. Summary of simulations. - BCA is the fastest and most accurate of all studied methods. It provides excellent ac-curacy when initialized with REC or HUB. For best accuracy, the R100 initializationshould be used at the cost of increased yet still manageable computations.- BCA, KM, and FW are overall extremely fast and can handle datasets of size n = 10 − and 10 − ) and Rand index.- 2X is computationally costly and fairly inaccurate when used on its own, i.e. with anarbitrary starting point. It largely improves the accuracy of KM and FW solutionsbut not of BCA solutions. It is mostly beneﬁcial in small to moderate dimension n .- HUB and REC are not suﬃciently accurate to be used on their own but they providegood starting points to more sophisticated matching methods. HUB uses data moreextensively than REC and yields slightly better performance.- For moderate to large n , EM shows poor performance in both computations (due tothe evaluations of matrix permanents) and optimization. Its performance is satisfac-tory for n ≤

50, possibly because of the BCA initialization.- ILP and KUR-BQP are only computationally feasible in very small samples ( n ≤ n = 1000 in 50s) but not highly accurate(relative error between 3% and 5%). LCVQE is both faster and far less accurate: itsolves (1) for n = 1000 in 13s but has relative error in (0 . , .

3) for all values of n .- COP-KM and MPC-KM have very similar proﬁles in computation time and opti-mization accuracy. Their relative error goes from 0.2-0.3 for n = 5 to 0.05-0.06 for n = 100. They are not able to handle large datasets (at least not in their R imple-mentation) as their computations stall for n ≥ n ≤

10. Its Rand index and relative error deteriorate quickly as n increasesand its computations time out for n ≥ In this section we harness the matching problem (1) and its proposed solutions to analyzeresting-state functional magnetic resonance imaging (rs-fMRI) data, the goal being toexplore the dynamic functional connectivity (DFC) of the brain. In short, functionalconnectivity (FC) relates to the integration of brain activity, that is, how distant brainregions coordinate their activity to function as a whole. The dynamic nature of FC, inparticular its dependence on factors such as task-related activity, psychological state, andcognitive processes, is well established in neuroimaging research [e.g. Chang and Glover,2010, Handwerker et al., 2012, Hutchison et al., 2013].The present analysis aims to extract measures of DFC from individual subject data andmatch these measures across subjects to uncover common patterns and salient features.The data under consideration are part of the ABIDE preprocessed data [Craddock et al.,2013], a large corpus of rs-fMRI measurements recorded from subjects diagnosed withautism spectrum disorder and from control subjects. These data and detailed descriptionsare available at preprocessed-connectomes-project.org/abide/. We selected the followingpreprocessing options: Connectome Computation System (CCS) pipeline, spatial averagingover 116 regions of interest (ROI) deﬁned by the AAL brain atlas, bandpass temporal23ltering, no global signal regression. For simplicity, we only used data from control subjectsand discarded data that did not not pass all quality control tests. This resulted in n = 308subjects with fMRI time series of average length about 200 scans (SD=62). Subject-level analysis.

Vector autoregressive (VAR) models are widely used to assessFC in fMRI data [Vald´es-Sosa et al., 2005, Friston et al., 2013, Ting et al., 2017]. Here werepresent the fMRI time series of a subject by a piecewise VAR model of order 1: y t = A t y t − + b t + ε t (1 ≤ t ≤ T ) (22)where y t is an fMRI measurement vector of length 116, A t an unknown regression matrixencoding FC dynamics, b t an unknown baseline vector, and ε t a random noise vector withmultivariate normal distribution N (0 , Q t ). The A t are assumed sparse, reﬂecting the factthat only a small number of ROIs at time t − t .The model parameters ( A t , b t , Q t ) are assumed piecewise constant with few change points,indicating that FC states persist for some time (say, between 5 and 50 scans) before thebrain switches to a diﬀerent FC state.For each subject, the task at hand is to simultaneously detect change points in (22) andestimate ( A t , b t ) over the associated time segments. ( Q t is of secondary importance hereand can be ignored). The sparse group fused lasso (SGFL) approach of Degras [2020] isdesigned for this purpose. To simplify the task of determining a suitable range for theSGFL regularization parameters and calculating regularization paths, we employ the two-step procedure of this paper. The ﬁrst step detects change points via the group fusedlasso [e.g. Bleakley and Vert, 2011]; the second step recovers sparse estimates of the A t separately on each segment via the standard lasso [Tibshirani, 1996].After ﬁtting the regularization paths, a single lasso estimate ( ˆ A t , ˆ b t ) is selected for eachsegment by the Akaike Information Criterion. Among all generated model segmentations,we retain the one with the most segments satisfying the following criteria: (i) length : thesegment must have at least 5 scans, (ii) goodness of ﬁt : the lasso ﬁt must have a devianceratio at least 0.3, and (iii) distinctness : the parameter estimate ˆ A t for the segment musthave at least 10% relative diﬀerence with estimates of other selected segments. To facilitateinterpretation and remove noisy components, 10 segments are retained per subject at themost. Group-level analysis.

Following the subject-level analysis, a set of change points andassociated model parameter estimates is available for each subject, say { ( ˆ A ik , ˆ b ik , ˆ T ik ) :1 ≤ k ≤ m i } with ˆ T ik the k th change point and m i the number of segments for the i thsubject (1 ≤ i ≤ n ). The regression matrices ˆ A ik provide informative FC measures andcould in principle be used for group-level comparisons. They are however highly sparse andmatching them using the squared Euclidean distance of problems (1)-(2) does not producesensible results. We thus calculate the empirical correlation matrices on each segment { ˆ T ik , . . . , ˆ T i ( k +1) − } and take them as inputs for the group analysis. See e.g. [Wang et al.,2014] for a review of common FC measures in neuroimaging. After discarding correlation24atrices based on short segments (10 scans or less, for increased estimation accuracy) andextracting the lower halves of the remaining matrices, we obtain a set { x ik : 1 ≤ i ≤ , ≤ k ≤ m i } of 1801 correlation vectors of size p = 116 × / m i of vectors per subject varies in the range [1 ,

10] with an average of 5.88 (SD=1.77).The unbalanced matching problem (2) is then solved for K ∈ { , , . . . , } using ageneralized version of the BCA Algorithm 2. Based on the inspection of the cluster centersand cluster sizes, we retain the matching based on K = 100 clusters. With this choice,cluster sizes are in the range [12 ,

28] (mean=18.01, SD=4.16). Smaller values of K , say K ≥

50, would be equally ﬁne for data exploration. K should however not be too small soas to avoid large clusters in which ﬁne details of FC are averaged out in the cluster centerand only large-scale features remain.Figure 3 displays the 100 resulting cluster centers, i.e. the average correlation matrices ofthe clusters. For easier visualization and interpretation, the ROI-level correlations are ag-gregated into six well established resting state networks (RSN): the attentional network (26ROIs), auditory network (6 ROIs), default mode network (32 ROIs), sensorimotor network(12 ROIs), subcortical network (8 ROIs), and visual network (14 ROIs). A list of the ROInames and associated RSNs is given in Appendix A. Note that some ROIs do not belongto any known functional networks while others are recruited in two networks. The visualnetwork and auditory network have strong intracorrelation (0.59 and 0.64 on average acrosscluster centers, respectively, not including TPOsup in the auditory network). The subcorti-cal network and sensorimotor network show moderate internal correlation (0.51 on averageeach). The default mode and attentional networks comprise more ROIs and are usuallyless correlated (0.36 and 0.40 on average, respectively). The hippocampus (HIP), parahip-pocampal gyrus (PHG), and amygdala (AMYG) cluster together fairly strongly (averagecorrelation 0.53). Applying community detection algorithms to each cluster center withthe R package igraph , we noticed that ROIs from the visual network are virtually alwaysin the same community; the same holds true for the subcortical network. The strongestcorrelations found between RSNs are the following: auditory–sensorimotor (0.38 on aver-age across clusters) attentional–default mode (0.36), attentional–sensorimotor (0.36), andsensorimotor–visual (0.35).A remarkable feature (not visible in Figure 3) is the strong positive correlation between theRolandic Operculum (ROL) and the regions PUT (subcortical network), PoCG, PreCG,and SMG (sensorimotor), and HES, STG (auditory) (between 0.42 and 0.67). In addition,CB9.R, VERMIS10, CB10.R, PCG.L, VERMIS9 exhibit consistent negative correlation(or at least lower average correlation) with most other ROIs. In particular, CB9.R (cere-bellum) has 36.5% of negative correlations with other ROIs whereas the overall proportionof negative correlations in the 100 average correlation matrices is only 10.6%.Figure 4 shows interesting examples of average correlation matrices (cluster centers) at theROI level. Cluster 5 shows strong positive correlation within the auditory, subcortical,and visual networks, and in the small groups (HIP, PHG, AMYG), (CRUS1, CRUS2), and(CB3–CB6, VERMIS1–VERMIS7). ROL has moderate to strong negative correlation withCRUS1, CRUS2 and regions from the subcortical network (dark blue stripe towards the25 IS-VIS

SUB-VIS

SUB-SUB

SMT-VIS

SMT-SUB

SMT-SMT

DMN-VIS

DMN-SUB

DMN-SMT

DMN-DMN

AUD-VIS

AUD-SUB

AUD-SMT

AUD-DMN

AUD-AUD

ATN-VIS

ATN-SUB

ATN-SMT

ATN-DMN

ATN-AUD

ATN-ATN

Figure 3: rs-fMRI data analysis. Each column represents the center of a cluster of matchedfeatures, that is, (half) a correlation matrix averaged across cluster members (subjects) andacross ROIs of resting state networks. ATN: attentional network, AUD: auditory network,DMN: default mode network, SMT: sensorimotor network, SUB: subcortical network, VIS:visual network.top and left) and strong positive correlation with PoCG, SMG (sensorimotor) and HES,STG (auditory). The auditory and sensorimotor networks have moderate to strong positivecorrelation. Cluster 14 shows clear blocking structure along the diagonal (correlation withinRSN) as well as anticorrelation patterns between CAU, PUT, PAL, THA (subcortical)and ROL, PoCG (sensorimotor), PCL (sensorimotor); and between PCG (default mode)and PreCG (sensorimotor), ROL, PoCG (sensorimotor), PCL (sensorimotor). Communitydetection reveals three large and heterogeneous communities (sizes 43, 40, 36). Cluster 19displays moderate to strong negative correlation (-0.55,-0.25) between IPL, SMG, ROL,CB10.R on the one hand and about 40 other ROIs on the other. The alternating clearand dark lines in cluster 27 reveal lateralized anticorrelation patterns between ROIs in theattentional network on the left side of the brain with most other ROIs in the brain. Cluster42 shows two roughly uncorrelated blocks, a very large one with strong intracorrelation and26 smaller one (CRUS, CB, VERMIS) with weaker intracorrelation. Cluster 88 displays acheckered correlation structure with strong anticorrelation between (CRUS, CB, VERMIS)and the rest of the brain.

Summary of the data analysis.

The data analysis has established that the matching ap-proach (1)-(2) provides scientiﬁcally meaningful insights into DFC at the group level. Byinspecting the cluster centers (average correlation matrices) produced by the matchingprocess, one recovers large-scale patterns consistent with neuroscientiﬁc knowledge. Forexample, known resting state networks are clearly reﬂected in the blocking structure of thecluster centers (see Figure 4). But the cluster centers can also generate new insights andhypotheses. For example, the Heschl gyrus (HES) is not systematically included in theauditory network but, according to our analysis, it should. Similarly, the ROI TPOsup(temporal lobe: superior temporal gyrus), although it is near to or part of the auditorycortex, has shown only weak correlation with the other ROI of the auditory network, Supe-rior temporal gyrus (STG). These elements may lead to a more nuanced understanding ofthe auditory network. Other remarkable ﬁndings include the strong anticorrelations foundbetween the Rolandic operculum (ROL), the cerebellum (CER) and the vermis (VERMIS)on the one hand and (a large part of) the rest of the brain on the other. Importantly,by design, each of the clusters formed by the matching process highlights commonalities between subjects and not within subjects. This is in contrast with unconstrained clusteringmethods (e.g. K -means clustering) whose clusters may consist in (vectors from) a smallnumber of or even a single subject in extreme cases. We have sought to eﬃciently match feature vectors in a one-to-one fashion across large col-lections of datasets or statistical units. In applications where statistical units are matchedin pairs, this task is conveniently framed as a multidimensional assignment problem withdecomposable costs (MDAC). Taking the squared Euclidean distance as dissimilarity func-tion in the MDADC enables tremendous computational speedup by transforming (cid:0) n (cid:1) re-lated matching problems between all pairs of datasets into n separate matching problemsbetween each dataset and a template. Leveraging this idea, we have developed extremelyfast algorithms whose computational complexity scales linearly with n . These algorithmsdo not require precalculating and storing assignment costs, which may be infeasible inlarge-scale applications. Instead, they eﬃciently calculate assignment costs on the ﬂy. Toour knowledge, no other available method to solve the MDADC possesses either of theselinear scaling and storage-free properties necessary to large-scale applications.Our proposed algorithms rely on various optimization techniques such as K -means clus-tering, block coordinate ascent (BCA), convex relaxation, the Frank-Wolfe algorithm, andpairwise interchange heuristic. We have also taken a probabilistic view of (1) leading toa constrained Gaussian mixture model and associated EM algorithm. Altogether the pro-posed algorithms form a panel that covers most types of approach to the MDADC found inthe literature. (As discussed earlier, we have not considered Lagrangian relaxation methods27 verage correlation matrix for cluster 5 VERMIS9VERMIS7VERMIS4_5VERMIS1_2CB10CB9CB8CB7bCB6CB4_5CB3CRUS2CRUS1ITGTPOmidMTGTPOsupSTGHESTHAPALPUTCAUPCLPCUNANGSMGIPLSPGPoCGFFGIOGMOGSOGLINGCUNCALAMYGPHGHIPPCGDCGACGINSRECORBsupmedSFGmedOLFSMAROLORBinfIFGtriangIFGopercORBmidMFGORBsupSFGdorPreCG P r e C G S F G do r O R B s up M F GO R B m i d I F G ope r c I F G t r i ang O R B i n f R O L S M A O L F S F G m ed O R B s up m ed R E C I N SA C G DC G P C G H I PP H G A M Y G C A L CUN L I N G S OG M OG I OG FF G P o C G SP G I P L S M G A N G P CUN P C L C A U P U T PA L T H A H ESS T G T P O s up M T G T P O m i d I T G CRU S CRU S C B C B C B C B C B C B C B VE R M I S VE R M I S VE R M I S VE R M I S −0.4−0.20.00.20.40.60.81.0 Average correlation matrix for cluster 14

VERMIS9VERMIS7VERMIS4_5VERMIS1_2CB10CB9CB8CB7bCB6CB4_5CB3CRUS2CRUS1ITGTPOmidMTGTPOsupSTGHESTHAPALPUTCAUPCLPCUNANGSMGIPLSPGPoCGFFGIOGMOGSOGLINGCUNCALAMYGPHGHIPPCGDCGACGINSRECORBsupmedSFGmedOLFSMAROLORBinfIFGtriangIFGopercORBmidMFGORBsupSFGdorPreCG P r e C G S F G do r O R B s up M F GO R B m i d I F G ope r c I F G t r i ang O R B i n f R O L S M A O L F S F G m ed O R B s up m ed R E C I N SA C G DC G P C G H I PP H G A M Y G C A L CUN L I N G S OG M OG I OG FF G P o C G SP G I P L S M G A N G P CUN P C L C A U P U T PA L T H A H ESS T G T P O s up M T G T P O m i d I T G CRU S CRU S C B C B C B C B C B C B C B VE R M I S VE R M I S VE R M I S VE R M I S −0.4−0.20.00.20.40.60.81.0 Figure 4: rs-fMRI data analysis. Examples of cluster centers (averages correlation matrices)derived from matching individual correlation matrices across subjects. Each displayedmatrix corresponds to a cluster of 14 to 23 subjects.28s they require computer clusters and/or GPUs for eﬃcient large-scale implementation.)These algorithms extend or specialize existing approaches in a nontrivial way. For example,the BCA and 2-exchange algorithms, which are specialized versions of existing algorithms,scale linearly with n and are amenable to large-scale applications whereas the more generalalgorithms are not.The numerical study has shown the excellent performances of the three main algorithms: K -means matching, BCA, and Frank-Wolfe, with respect to computation and optimization.In particular, these algorithms largely outperform all competitors and can handle verylarge collections of data. The BCA algorithm shows slightly better performance thanthe other two. The pairwise interchange heuristic can enhance these two methods toreach near optimality at a hefty computational price. The EM algorithm displayed fairlypoor performance throughout the study. Upon inspection, the poor optimization resultscame from the fact that the algorithm was “too sure” about the allocation probabilities(of data vectors to classes) which were almost invariably calculated as 0 or 1. This inturn may arise from the (relatively) high dimension of the data, short tails of the normaldistribution, and/or error in covariance estimation. Using the deterministic annealing EMand/or random starting points did not ﬁx the issue. Solutions for improving the EMoptimization may be to impose a diagonal structure on covariance estimates or to consider(mixtures of) distributions with heavier tails such as multivariate t -distributions. Thecomputational slowness of the EM could be remedied by calculating a small ﬁxed number ofmost likely allocations rather than computing them all through matrix permanents.The analysis of the ABIDE preprocessed fMRI data has shown the strong potential of theproposed feature matching approach for exploring neuroimaging biomarkers and producinginterpretable clusters at the group level. A key characteristic of one-to-one feature match-ing is that, unlike unsupervised clustering, it is guaranteed to produce “representative”clusters that reﬂect variations between subjects and not within. While feature matchingwas employed in our analysis for data exploration, this technique could also be used in amore principled way as a preliminary step to disentangle association ambiguities betweenbiomarkers and/or to stratify subjects into small, homogenous groups prior to a group-levelanalysis. Such matching-based approach could be for example compared to the consensusclustering strategy of Rasero et al. [2019]. Possible extensions and future work. • Weighted (squared) Euclidean distance.

The squared Euclidean distance in (1) caneasily be generalized to a weighted squared Euclidean distance (cid:107) x (cid:107) W = x (cid:48) W x with W ∈ R p × p a positive semi-deﬁnite matrix. Decomposing W as L (cid:48) L (e.g. by Choleskydecomposition), it suﬃces to premultiply each matrix X i by L to formulate an equiv-alent problem (1) using the unweighted (squared) Euclidean distance. • Alternative dissimilarity measures.

Although the squared Euclidean distance for d inthe general MDADC problem (3)-(4) enables extremely fast and scalable algorithmswith low memory footprint, it may not adequately capture relevant diﬀerences be-tween feature vectors in some applications. If the Euclidean distance (cid:107) · (cid:107) or the29anhattan distance (cid:107) · (cid:107) , for example, is a more sensible choice for d , a reasonableapproach would be to use an objective function based on the ( nm ) distances betweenfeature vectors and their cluster centers instead of one based on the distances betweenall (cid:0) n (cid:1) m pairs of matched vectors. In this case, the K -means matching Algorithm1 can be adapted as follows. The assignment step remains the same: given clustercenters c , . . . , c m , the feature vectors of each unit i ∈ [ n ] are assigned to clustersby minimizing the LAP with assignment matrix A i = ( d ( x ik , c l )) ≤ k,l ≤ m . The up-dating step for the cluster centers proceeds from calculating m geometric medians if d = (cid:107) · (cid:107) , or mp univariate medians id d = (cid:107) · (cid:107) . Both these tasks can be accom-plished in near linear time, and like in the case d = (cid:107) · (cid:107) , no distance needs to bepre-calculated and stored. Accordingly, the modiﬁed objective function and modiﬁed K -means matching algorithm still enable linear time complexity linear in n and lowspace requirements. (The other algorithms of Section 2 do not extend quite so nicelyas they fundamentally rely on the scalar product and separability properties thatunderlie (cid:107) · (cid:107) .) • Gaining theoretical understanding of the optimization properties of the algorithmsof this paper, for example by establishing deterministic or probabilistic bounds ontheir performances, could maybe explain the very good performances observed and/orgive insights on worst case performance in diﬃcult instances [e.g. Gutin et al., 2008].Also, obtaining tight lower bounds through suitable Lagrangian relaxations would bedesirable in practice. • The rich structure of problem (1) may make it possible to easily construct instancesin which the global minimum and optimal assignment are known [see e.g. Drugan,2015, for related work on quadratic assignment problems]. This would be of courseuseful to benchmark methods.

Acknowledgments

The author thanks to Vince Lyzinski for early discussions that led to the convex relaxation/Frank-Wolfe approach of Section 2.3. He also acknowledges his student Yiming Chen forhis assistance in the literature search and the numerical study.

References

John Ashburner. A fast diﬀeomorphic image registration algorithm.

NeuroImage , 38(1):95– 113, 2007.Egon Balas and Matthew J. Saltzman. An algorithm for the three-index assignment prob-lem.

Oper. Res. , 39(1):150–161, 1991.H.-J. Bandelt, A. Maas, and F. C. R. Spieksma. Local search heuristics for multi-indexassignment problems with decomposable costs.

J. Oper. Res. Soc. , 55(7):694–704, 2004.Hans-J¨urgen Bandelt, Yves Crama, and Frits C. R. Spieksma. Approximation algorithms30or multi-dimensional assignment problems with decomposable costs.

Discrete Appl.Math. , 49(1-3):25–50, 1994.Y. Bar-Shalom, P.K. Willett, and X. Tian.

Tracking and Data Fusion: A Handbook ofAlgorithms . YBS Publishing, 2011.Sugato Basu, Ian Davidson, and Kiri L. Wagstaﬀ, editors.

Constrained clustering: Ad-vances in algorithms, theory, and applications . Chapman & Hall/CRC Data Mining andKnowledge Discovery Series. CRC Press, 2009.S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shapecontexts.

IEEE Trans. Pattern Anal. Mach. Intell. , 24(4):509–522, 2002.Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney. Integrating constraints andmetric learning in semi-supervised clustering. In

Machine Learning, Proceedings of theTwenty-ﬁrst International Conference (ICML 2004) , volume 69. ACM, 2004.Garrett Birkhoﬀ. Tres observaciones sobre el algebra lineal [Three observations on linearalgebra].

Univ. Nac. Tucum´an. Revista A. , 5:147–151, 1946.Kevin Bleakley and Jean-Philippe Vert. The group fused lasso for multiple change-pointdetection. Technical Report hal-00602121, 2011.Rainer Burkard, Mauro Dell’Amico, and Silvano Martello.

Assignment problems . Societyfor Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2009.Eranda C¸ ela.

The quadratic assignment problem , volume 1 of

Combinatorial Optimization .Kluwer Academic Publishers, Dordrecht, 1998.C. Chang and G. H. Glover. Time-frequency dynamics of resting-state brain connectivitymeasured with fMRI.

Neuroimage , 50(1):81–98, Mar 2010.Olivier Collier and Arnak S. Dalalyan. Minimax rates in permutation estimation for featurematching.

J. Mach. Learn. Res. , 17(6):1–31, 2016.Robert T. Collins. Multitarget data association with higher-order motion models. In , pages 1744–1751, 2012.D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graph matching in patternrecognition.

Int. J. Pattern Recognit. Artif. Intell. , 18:265–298, 2004.Cameron Craddock, Yassine Benhajali, Carlton Chu, Francois Chouinard, Alan Evans,Andr´as Jakab, Budhachandra Khundrakpam, John Lewis, Qingyang Li, Michael Milham,Chaogan Yan, and Pierre Bellec. The neuro bureau preprocessing initiative: open sharingof preprocessed neuroimaging data and derivatives.

Frontiers in Neuroinformatics , 7,2013.David Degras. Sparse group fused lasso for model segmentation: a hybrid approach.

Adv.Data Anal. Classif. , 2020. doi: https://doi.org/10.1007/s11634-020-00424-5.31orris H. DeGroot and Prem K. Goel. The matching problem for multivariate normaldata.

Sankhy¯a Ser. B , 38(1):14–29, 1976.A. Dehghan, S. M. Assari, and M. Shah. GMMCP tracker: Globally optimal generalizedmaximum multi clique problem for multiple object tracking. In , pages 4091–4099, 2015.K. Doherty, D. Fourie, and J. Leonard. Multimodal semantic SLAM with probabilisticdata association. In ,pages 2419–2425, 2019.M˘ad˘alina M. Drugan. Generating QAP instances with known optimum solution and addi-tively decomposable cost function.

J. Comb. Optim. , 30(4):1138–1172, 2015.Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.

NavalResearch Logistics Quarterly , 3(1?2):95–110, 1956.Karl Friston, Rosalyn Moran, and Anil K Seth. Analysing connectivity with grangercausality and dynamic causal modelling.

Current Opinion in Neurobiology , 23(2):172 –178, 2013. Macrocircuits.P. Gancarski, T-B-H. Dao, B. Cr´emilleux, G. Forestier, and T. Lampert. Constrainedclustering: Current and new trends. In

A Guided Tour of AI Research , pages 447–484.Springer, 2020.Gregory Gutin, Boris Goldengorin, and Jing Huang. Worst case analysis of max-regret,greedy and other heuristics for multidimensional assignment and traveling salesman prob-lems.

Journal of heuristics , 14(2):169–181, 2008.D. A. Handwerker, V. Roopchansingh, J. Gonzalez-Castillo, and P. A. Bandettini. Periodicchanges in fMRI connectivity.

Neuroimage , 63(3):1712–1719, Nov 2012.James V. Haxby, J. Swaroop Guntupalli, Andrew C. Connolly, Yaroslav O. Halchenko,Bryan R. Conroy, M. Ida Gobbini, Michael Hanke, and Peter J. Ramadge. A common,high-dimensional model of the representational space in human ventral temporal cortex.

Neuron , 72(2):404–416, 2011.Tran Khanh Hiep, Nguyen Minh Duc, and Bui Quoc Trung. Local search approach for thepairwise constrained clustering problem. In

Proceedings of the Seventh Symposium onInformation and Communication Technology , SoICT ’16, pages 115–122. ACM, 2016.Daniel Hsu, Kevin Shi, and Xiaorui Sun. Linear regression without correspondence. In

Ad-vances in Neural Information Processing Systems , volume 30, pages 1531–1540. CurranAssociates, Inc., 2017.R. M. Hutchison, T. Womelsdorf, E. A. Allen, P. A. Bandettini, V. D. Calhoun, M. Cor-betta, S. Della Penna, J. H. Duyn, G. H. Glover, J. Gonzalez-Castillo, D. A. Handwerker,S. Keilholz, V. Kiviniemi, D. A. Leopold, F. de Pasquale, O. Sporns, M. Walter, andC. Chang. Dynamic functional connectivity: promise, issues, and interpretations.

Neu-roimage , 80:360–378, Oct 2013. 32ark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algo-rithm for the permanent of a matrix with nonnegative entries.

J. ACM , 51(4):671–697,2004.Daniel Karapetyan and Gregory Gutin. Local search heuristics for the multidimensionalassignment problem.

Journal of Heuristics , 17(3):201–249, 2011.Gary Kochenberger, Jin-Kao Hao, Fred Glover, Mark Lewis, Zhipeng L¨u, Haibo Wang,and Yang Wang. The unconstrained binary quadratic programming problem: a survey.

J. Comb. Optim. , 28(1):58–81, 2014.Tjalling C. Koopmans and Martin Beckmann. Assignment problems and the location ofeconomic activities.

Econometrica , 25:53–76, 1957.Jonathan Kuck, Tri Dao, Hamid Rezatoﬁghi, Ashish Sabharwal, and Stefano Ermon. Ap-proximating the permanent by sampling from adaptive partitions. In

Advances in NeuralInformation Processing Systems , volume 32, pages 8860–8871. Curran Associates, Inc.,2019.H. W. Kuhn. The Hungarian method for the assignment problem.

Naval Research LogisticsQuarterly , 2(1-2):83–97, 1955.Yusuke Kuroki and Tomomi Matsui. An approximation algorithm for multidimensionalassignment problems minimizing the sum of squared errors.

Discrete Appl. Math. , 157(9):2124–2135, 2009.J. Le Moigne, N.S. Netanyahu, and R.D. Eastman.

Image Registration for Remote Sensing .Cambridge University Press, 2011.Stuart P. Lloyd. Least squares quantization in PCM.

IEEE Trans. Inform. Theory , 28(2):129–137, 1982.D.G Lowe. Local feature view clustering for 3D object recognition. In

Proceedings of the2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.CVPR 2001 , volume 1, pages I–I, 2001.V. Lyzinski, D. E. Fishkind, M. Fiori, J. T. Vogelstein, C. E. Priebe, and G. Sapiro. Graphmatching: Relax at your own risk.

IEEE Trans. Pattern Anal. Mach. Intell. , 38(1):60–73, 2016.Geoﬀrey McLachlan and David Peel.

Finite mixture models . Wiley Series in Probabilityand Statistics: Applied Probability and Statistics. Wiley-Interscience, New York, 2000.Geoﬀrey J. McLachlan and Thriyambakam Krishnan.

The EM algorithm and extensions .Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2008.James Munkres. Algorithms for the assignment and transportation problems.

J. Soc.Indust. Appl. Math. , 5:32–38, 1957.Shardul Natu, Ketan Date, and Rakesh Nagi. GPU-accelerated Lagrangian heuristic for33ultidimensional assignment problems with decomposable costs.

Parallel Computing ,page 102666, 2020.Albert Nijenhuis and Herbert S. Wilf.

Combinatorial algorithms . Academic Press, Inc.,second edition, 1978.Carlos A. S. Oliveira and Panos M. Pardalos. Randomized parallel algorithms for themultidimensional assignment problem.

Appl. Numer. Math. , 49(1):117–133, 2004. doi:10.1016/j.apnum.2003.11.014.A. Pananjady, M. J. Wainwright, and T. A. Courtade. Linear regression with shuﬄeddata: Statistical and computational limits of permutation recovery.

IEEE Transactionson Information Theory , 64(5):3286–3300, 2018.Panos M. Pardalos and Leonidas S. Pitsoulis, editors.

Nonlinear assignment problems ,volume 7 of

Combinatorial Optimization . Kluwer Academic Publishers, Dordrecht, 2000.Dan Pelleg and Dorit Baras. K -means with large and noisy constraint sets. In ECML2007 , pages 674–682. Springer, 2007.William P. Pierskalla. The multidimensional assignment problem.

Operations Research , 16(2):422–431, 1968.Aubrey P. Poore and Nenad Rijavec. A Lagrangian relaxation algorithm for multidimen-sional assignment problems arising from multitarget tracking.

SIAM J. Optim. , 3(3):544–563, 1993.William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery.

Nu-merical Recipes 3rd Edition: The Art of Scientiﬁc Computing . Cambridge UniversityPress, 2007.Mu Qiao and Jia Li. Gaussian mixture models with component means constrained inpre-selected subspaces, 2015.William M. Rand. Objective criteria for the evaluation of clustering methods.

J. Amer.Statist. Assoc. , 66(336):846–850, 1971.Javier Rasero, Ibai Diez, Jesus M. Cortes, Daniele Marinazzo, and Sebastiano Stramaglia.Connectome sorting by consensus clustering increases separability in group neuroimagingstudies.

Network Neuroscience , 3(2):325–343, 2019.S. H. Rezatoﬁghi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid. Joint probabilisticdata association revisited. In , pages 3047–3055, 2015.Alexander J. Robertson. A set of greedy randomized adaptive local search procedure(GRASP) implementations for the multidimensional assignment problem.

ComputationalOptimization and Applications , 19(2):145–164, 2001.Herbert J. Ryser.

Combinatorial mathematics , volume 14 of

The Carus MathematicalMonographs . Wiley and Sons, Inc., New York, 1963.34ordechai Shalom, Prudence W. H. Wong, and Shmuel Zaks. On-line maximum matchingin complete multipartite graphs with implications to the minimum ADM problem on astar topology. In

Structural information and communication complexity , volume 5869 of

Lecture Notes in Comput. Sci. , pages 281–294. Springer, Berlin, 2010.Noam Shental, Aharon Bar-hillel, Tomer Hertz, and Daphna Weinshall. Computing gaus-sian mixture models with EM using equivalence constraints. In

Advances in NeuralInformation Processing Systems , volume 16, pages 465–472. MIT Press, 2004.Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochasticmatrices.

Ann. Math. Statist. , 35:876–879, 1964.A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah.Visual tracking: An experimental survey.

IEEE Trans. Pattern Anal. Mach. Intell. , 36(7):1442–1468, 2014.F.C.R. Spieksma and G.J. Woeginger. Geometric three-dimensional assignment problems.

European Journal of Operational Research , 91(3):611–618, 1996.Gregory Tauer and Rakesh Nagi. A map-reduce Lagrangian heuristic for multidimensionalassignment problems with decomposable costs.

Parallel Computing , 39(11):653 – 668,2013.James R. Thornbrue, J. Nate Knight, and Benjamin J. Slocumb. Association ambiguitymanagement in mixed data dimension tracking problems. In

Signal and Data Processingof Small Targets 2010 , volume 7698, pages 255–266. International Society for Optics andPhotonics, SPIE, 2010.Robert Tibshirani. Regression shrinkage and selection via the lasso.

J. Roy. Statist. Soc.Ser. B , 58(1):267–288, 1996.C. M. Ting, H. Ombao, S. B. Samdin, and S. H. Salleh. Estimating dynamic connectivitystates in fMRI using regime-switching factor models.

IEEE Transactions on MedicalImaging , PP(99):1–1, 2017. ISSN 0278-0062.Naonori Ueda and Ryohei Nakano. Deterministic annealing EM algorithm.

Neural Netw. ,11(2):271–282, 1998.Jeﬀrey K. Uhlmann. Matrix permanent inequalities for approximating joint assignmentmatrices in tracking systems.

J. Franklin Inst. , 341(7):569–593, 2004.Pedro A. Vald´es-Sosa, Jose M. S´anchez-Bornot, Agust´ın Lage-Castellanos, Mayrim Vega-Hern´andez, Jorge Bosch-Bayard, Lester Melie-Garc´ıa, and Erick Canales-Rodr´ıguez. Es-timating brain functional connectivity with sparse multivariate autoregression.

Philo-sophical Transactions: Biological Sciences , 360(1457):969–981, 2005.Joshua T. Vogelstein, John M. Conroy, Vince Lyzinski, Louis J. Podrazik, Steven G.Kratzer, Eric T. Harley, Donniell E. Fishkind, R. Jacob Vogelstein, and Carey E. Priebe.Fast approximate quadratic programming for graph matching.

PLOS ONE , 10(4):1–17,04 2015. 35ohn von Neumann. A certain zero-sum two-person game equivalent to the optimal assign-ment problem. In

Contributions to the theory of games, vol. 2 , Annals of MathematicsStudies, no. 28, pages 5–12. Princeton University Press, Princeton, N. J., 1953.Kiri Wagstaﬀ, Claire Cardie, Seth Rogers, and Stefan Schr¨odl. Constrained k -means clus-tering with background knowledge. In ICML ’01 , pages 577–584, San Francisco, CA,USA, 2001. Morgan Kaufmann Publishers Inc.H. E. Wang, C. G. Benar, P. P. Quilichini, K. J. Friston, V. K. Jirsa, and C. Bernard.A systematic framework for functional connectivity measures.

Front. Neurosci. , 8:405,2014.L. Wang, T. Liu, G. Wang, K. L. Chan, and Q. Yang. Video tracking using learnedhierarchical features.

IEEE Trans. Image Proc. , 24(4):1424–1435, 2015.Stephen J. Wright. Coordinate descent algorithms.

Math. Program. , 151(1, Ser. B):3–34,2015. 36

Brain regions of interest in fMRI data analysis

Label Name Abbrv Label Name Abbrv

Subcortical network Default mode network

71 L Caudate nucleus CAU.L 5 L Superior frontal gyrus, orbital ORBsup.L72 R Caudate nucleus CAU.R 6 R Superior frontal gyrus, orbital ORBsup.R73 L Putamen PUT.L 7 L Middle frontal gyrus MFG.L74 R Putamen PUT.R 8 R Middle frontal gyrus MFG.R75 L Pallidum PAL.L 15 L Inferior frontal gyrus, orbital ORBinf.L76 R Pallidum PAL.R 16 R Inferior frontal gyrus, orbital ORBinf.R77 L Thalamus THA.L 23 L Superior frontal gyrus, medial SFGmed.L78 R Thalamus THA.R 24 R Superior frontal gyrus, medial SFGmed.R

Auditory network

25 L Superior frontal gyrus, medial orbital ORBsupmed.L79 L Heschl gyrus HES.L 26 R Superior frontal gyrus, medial orbital ORBsupmed.R80 R Heschl gyrus HES.R 31 L Cingulate gyrus, anterior part ACG.L81 L Superior temporal gyrus STG.L 32 R Cingulate gyrus, anterior part ACG.R82 R Superior temporal gyrus STG.R 33 L Cingulate gyrus, mid part DCG.L83 L Temporal pole: superior temporal gyrus TPOsup.L 34 R Cingulate gyrus, mid part DCG.R84 R Temporal pole: superior temporal gyrus TPOsup.R 35 L Cingulate gyurs, posterior part PCG.L

Sensorimotor network

36 R Cingulate gyrus, posterior part PCG.R1 L Precentral gyrus PreCG.L 37 L Hippocampus HIP.L2 R Precentral gyrus PreCG.R 38 R Hippocampus HIP.R19 L Supplementary motor area SMA.L 39 L Parahippocampus PHG.L20 R Supplementary motor area SMA.R 40 R Parahippocampus PHG.R57 L Postcentral gyrus PoCG.L 61 L Inferior parietal gyrus IPL.L58 R Postcentral gyrus PoCG.R 62 R Inferior parietal gyrus IPL.R59 L Superior parietal gyrus SPG.L 65 L Angular gyrus ANG.L60 R Superior parietal gyrus SPG.R 66 R Angular gyrus ANG.R63 L Supramarginal gyrus SMG.L 67 L Precuneus PCUN.L64 R Supramarginal gyrus SMG.R 68 R Precuneus PCUN.R69 L Paracentral lobule PCL.L 85 L Middle temporal gyrus MTG.L70 R Paracentral lobule PCL.R 86 R Middle temporal gyrus MTG.R

Visual network

87 L Temporal pole: middle temporal gyrus TPOmid.L43 L Calcarine ﬁssure and surrounding cortex CAL.L 88 R Temporal pole: middle temporal gyrus TPOmid.R44 R Calcarine ﬁssure and surrounding cortex CAL.R 89 L Inferior temporal gyrus ITG.L45 L Cuneus CUN.L 90 R Inferior temporal gyrus ITG.R46 R Cuneus CUN.R

Unclassiﬁed

47 L Lingual gyrus LING.L 17 L Rolandic operculum ROL.L48 R Lingual gyrus LING.R 18 R Rolandic operculum ROL.R49 L Superior occipital lobe SOG.L 21 L Olfactory cortex OLF.L50 R Superior occipital lobe SOG.R 22 R Olfactory cortex OLF.R51 L Middle occipital lobe MOG.L 27 L Gyrus rectus REC.L52 R Middle occipital lobe MOG.R 28 R Gyrus rectus REC.R53 L Inferior occipital lobe IOG.L 41 L Amygdala AMYG.L54 R Inferior occipital lobe IOG.R 42 R Amygdala AMYG.R55 L Fusiform gyrus FFG.L 91 L Cerebellum crus 1 CRUS1.L56 R Fusiform gyrus FFG.R 92 R Cerebellum crus 1 CRUS1.R

Attentional network

93 L Cerebellum crus 2 CRUS2.L3 L Superior frontal gyrus, dorsolateral SFGdor.L 94 R Cerebellum crus 2 CRUS2.R4 R Superior frontal gyrus, dorsolateral SFGdor.R 95 L Cerebellum 3 CB3.L5 L Superior frontal gyrus, orbital ORBsup.L 96 R Cerebellum 3 CB3.R6 R Superior frontal gyrus, orbital ORBsup.R 97 L Cerebellum 4 5 CB4 5.L7 L Middle frontal gyrus MFG.L 98 R Cerebellum 4 5 CB4 5.R8 R Middle frontal gyrus MFG.R 99 L Cerebellum 6 CB6.L9 L Middle frontal gyrus, orbital ORBmid.L 100 R Cerebellum 6 CB6.R10 R Middle frontal gyrus, orbital ORBmid.R 101 L Cerebellum 7 CB7b.L11 L Inferior frontal gyrus, opercular IFGoperc.L 102 R Cerebellum 7 CB7b.R12 R Inferior frontal gyrus, opercular IFGoperc.R 103 L Cerebellum 8 CB8.L13 L Inferior frontal gyrus, triangular IFGtriang.L 104 R Cerebellum 8 CB8.R14 R Inferior frontal gyrus, triangular IFGtriang.R 105 L Cerebellum 9 CB9.L15 L Inferior frontal gyrus, orbital ORBinf.L 106 R Cerebellum 9 CB9.R16 R Inferior frontal gyrus, orbital ORBinf.R 107 L Cerebellum 10 CB10.L29 L Insula INS.L 108 R Cerebellum 10 CB10.R30 R Insula INS.R 109 Vermis 12 VERMIS1 259 L Superior parietal gyrus SPG.L 110 Vermis 3 VERMIS360 R Superior parietal gyrus SPG.R 111 Vermis 4 5 VERMIS4 561 L Inferior parietal gyrus IPL.L 112 Vermis 6 VERMIS662 R Inferior parietal gyrus IPL.R 113 Vermis 7 VERMIS783 L Temporal pole: superior temporal gyrus TPOsup.L 114 Vermis 8 VERMIS884 R Temporal pole: superior temporal gyrus TPOsup.R 115 Vermis 9 VERMIS985 L Middle temporal gyrus MTG.L 116 Vermis 10 VERMIS1086 R Middle temporal gyrus MTG.R89 L Inferior temporal gyrus ITG.L90 R Inferior temporal gyrus ITG.R93 L Cerebellum crus 2 CRUS2.L3 L Superior frontal gyrus, dorsolateral SFGdor.L 94 R Cerebellum crus 2 CRUS2.R4 R Superior frontal gyrus, dorsolateral SFGdor.R 95 L Cerebellum 3 CB3.L5 L Superior frontal gyrus, orbital ORBsup.L 96 R Cerebellum 3 CB3.R6 R Superior frontal gyrus, orbital ORBsup.R 97 L Cerebellum 4 5 CB4 5.L7 L Middle frontal gyrus MFG.L 98 R Cerebellum 4 5 CB4 5.R8 R Middle frontal gyrus MFG.R 99 L Cerebellum 6 CB6.L9 L Middle frontal gyrus, orbital ORBmid.L 100 R Cerebellum 6 CB6.R10 R Middle frontal gyrus, orbital ORBmid.R 101 L Cerebellum 7 CB7b.L11 L Inferior frontal gyrus, opercular IFGoperc.L 102 R Cerebellum 7 CB7b.R12 R Inferior frontal gyrus, opercular IFGoperc.R 103 L Cerebellum 8 CB8.L13 L Inferior frontal gyrus, triangular IFGtriang.L 104 R Cerebellum 8 CB8.R14 R Inferior frontal gyrus, triangular IFGtriang.R 105 L Cerebellum 9 CB9.L15 L Inferior frontal gyrus, orbital ORBinf.L 106 R Cerebellum 9 CB9.R16 R Inferior frontal gyrus, orbital ORBinf.R 107 L Cerebellum 10 CB10.L29 L Insula INS.L 108 R Cerebellum 10 CB10.R30 R Insula INS.R 109 Vermis 12 VERMIS1 259 L Superior parietal gyrus SPG.L 110 Vermis 3 VERMIS360 R Superior parietal gyrus SPG.R 111 Vermis 4 5 VERMIS4 561 L Inferior parietal gyrus IPL.L 112 Vermis 6 VERMIS662 R Inferior parietal gyrus IPL.R 113 Vermis 7 VERMIS783 L Temporal pole: superior temporal gyrus TPOsup.L 114 Vermis 8 VERMIS884 R Temporal pole: superior temporal gyrus TPOsup.R 115 Vermis 9 VERMIS985 L Middle temporal gyrus MTG.L 116 Vermis 10 VERMIS1086 R Middle temporal gyrus MTG.R89 L Inferior temporal gyrus ITG.L90 R Inferior temporal gyrus ITG.R