[PDF] Diffeomorphic Learning

Abstract

We introduce in this paper a learning paradigm in which the training data is transformed by a diffeomorphic transformation before prediction. The learning algorithm minimizes a cost function evaluating the prediction error on the training set penalized by the distance between the diffeomorphism and the identity. The approach borrows ideas from shape analysis where diffeomorphisms are estimated for shape and image alignment, and brings them in a previously unexplored setting, estimating, in particular diffeomorphisms in much larger dimensions. After introducing the concept and describing a learning algorithm, we present diverse applications, mostly with synthetic examples, demonstrating the potential of the approach, as well as some insight on how it can be improved.

Full PDF

DDiﬀeomorphic Learning

Laurent YounesDepartment of Applied Mathematics and Statistics and Center for Imaging ScienceJohns Hopkins [email protected] 17, 2019

Abstract

We introduce in this paper a learning paradigm in which the training data is transformedby a diﬀeomorphic transformation before prediction. The learning algorithm minimizes a costfunction evaluating the prediction error on the training set penalized by the distance betweenthe diﬀeomorphism and the identity. The approach borrows ideas from shape analysis wherediﬀeomorphisms are estimated for shape and image alignment, and brings them in a previouslyunexplored setting, estimating, in particular diﬀeomorphisms in much larger dimensions. Afterintroducing the concept and describing a learning algorithm, we present diverse applications,mostly with synthetic examples, demonstrating the potential of the approach, as well as someinsight on how it can be improved. keywords:

Diﬀeomorphisms, Reproducing kernel Hilbert Spaces, Classiﬁcation

We consider, in this paper, a family of classiﬁers that take the form x (cid:55)→ F ( ψ ( x )), x ∈ R d , where ψ is a diﬀeomorphism of R d and F is real-valued or categorical, belonging to a class of simple (e.g.,linear) predictors. We will describe a training algorithm that, when presented with a ﬁnite numberof training samples, displaces all points together through non-linear trajectories, in order to bringthem to a position for which the classes are linearly separable. Because these trajectories are builtwith a guarantee to avoid collisions between points, they can be interpolated to provide a ﬂuidmotion of the whole space ( R d ) resulting in a diﬀeomorphic transformation. One can then use thistransformation to assign a class to any new observation. Because one may expect that the simplicityof the transformation will be directly related to the ability of the classiﬁer to generalize, the pointtrajectories and their interpolation are estimated so that this resulting global transformation ispenalized from erring too far away from the identity, this penalty being assessed using a geodesicdistance in the group of diﬀeomorphisms, based on methods previously developed for shape analysis.The last decade’s achievements in deep learning have indeed demonstrated that optimizing non-linear transformations of the data in very high dimensions and using massive parametrization couldlead to highly performing predictions without being necessarily struck by the curse of dimensional-ity. The approach that we describe here also explores a “very large” space in terms, at least, of thenumber of parameters required to describe the transformations, and uses a training algorithm thatimplements, like neural networks, dynamic programming. It however frames the estimated transfor-mations within a well speciﬁed space of diﬀeomorphisms, whose nonlinear structure is well adaptedto the successive compositions of functions that drive deep learning methods. While it shares someof the characteristics of deep methods, our formulation relies on a regularization term, which makesit closer in spirit with many of the methods used for non-parametric prediction.1 a r X i v : . [ s t a t . M L ] O c t e now sketch our model for classiﬁcation, and for this purpose introduce some notation. Welet c denote the number of classes for the problem at hand. As a consequence, the function F caneither take values in C = { , . . . , c − } , or in the space of probability distributions on C (which willbe our choice since we will use logistic regression). We furthermore assume that F is parametrized,by a parameter θ ∈ R q , and we will write F ( x ; θ ) instead of F ( x ) when needed.Assume that a training set is given, in the form T = ( x , y , . . . , x N , y N ) with x k ∈ R d and y k ∈ C , for k = 1 , . . . , N . We will assume that this set is not redundant, i.e., that x i (cid:54) = x j whenever i (cid:54) = j . (Our construction can be extended to redundant training sets by allowing the training classvariables y k to be multi-valued. This would result in rather cumbersome notation that is betteravoided.) If ψ is a diﬀeomorphism of R d , we will let ψ · T = ( ψ ( x ) , y , . . . , ψ ( x N ) , y N ). Thelearning procedure will consist in minimizing the objective function G ( ψ, θ ) = D (id , ψ ) + λg ( θ ) + 1 σ Γ( F ( · , θ ) , ψ · T ) (1)with respect to ψ and θ . Here D is a Riemannian distance in a group of diﬀeomorphisms of R d , thatwill be described in the next section, Γ is a “standard” loss function, g a regularization term on theparameters of the ﬁnal classiﬁer and λ , σ are positive numbers. One can take for exampleΓ( F ( · , θ ) , ψ · T ) = − N (cid:88) k =1 log F ( ψ ( x k ); θ )( y k ) , (2)where F ( z, θ )( y ) = e θ ( y ) T z (cid:80) y (cid:48) ∈ C e θ ( y (cid:48) ) T z and where the parameter is ( θ , . . . , θ c − ) ∈ ( R d ) c − with θ = 0, letting (as done in our experiments) g ( θ ) = | θ | .The proposed construction shares some of its features with feed-forward networks (Goodfellowet al., 2016), for which, for example, using the loss function above on the ﬁnal layer is standard, andcan also be seen as a kernel method, sharing this property with classiﬁers such as support vectormachines (SVMs) or other classiﬁcation methods that use the “kernel trick” (Vapnik, 2013; Sch¨olkopfand Smola, 2002). However, the algorithm is directly inspired from diﬀeomorphic shape analysis,and can be seen as a variant of the Large Deformation Diﬀeomorphic Metric Mapping (LDDMM)algorithm, which has been introduced for image and shape registration (Joshi and Miller, 2000;Beg et al., 2005; Younes, 2010). Up to our knowledge, this theory has not been applied so far toclassiﬁcation problems, although some similar ideas using a linearized model have been suggested in(Trouv´e and Yu, 2001) for the inference of metrics in shape spaces, and a similar approach has beendiscussed in Walder and Sch¨olkopf (2009) in the context of dimensionality reduction.While the underlying theory in shape analysis, which formulates the estimation of transforma-tions as optimal control problems, is by now well established, the present paper introduces severaloriginal contributions. Translating shape analysis methods designed for small dimensional problemsto larger scale problems indeed introduces new challenges, and suggests new strategies on the designof the ﬂows that control the diﬀeomorphic transformations, on the choice of parameters driving themodel and on optimization schemes. Importantly, the paper provides experimental evidence thatdiﬀeomorphic methods can be competitive in machine learning contexts. Indeed, diﬀeomorphismsin high dimension are “wild objects,” and, even with the kind of regularization that is applied inthis paper, it was not obvious that using them in a non-parametric setting would avoid overﬁttingand compare, often favorably, with several state-of-the-art machine learning methods. As explainedin section 5.1, the model (after the addition a “dummy” dimension) can represent essentially anyfunction of the data.In this regard, given that one is ready to work with such high-dimensional objects, the require-ment that the transformation is a diﬀeomorphism is not a strong restriction to the scope of the2odel. Using such a representation has several advantages, however. It indeed expresses the modelin terms of a well-understood non-linear space of functions (the group of diﬀeomorphisms), whichhas a simple algebraic structure, and a diﬀerential geometry extensively explored in mathematics.The description of such models is, as a consequence, quite concise and explicit, and lends itself toan optimal control formulation with the learning algorithms that result from it. Finally, there areobvious advantages, on the generative side, in using invertible data transformations, especially whenthe data in the transformed conﬁguration may be easier to describe using a simple statistical model,since, in that case, the application of the inverse diﬀeomorphism to realizations of that simple modelprovides a generative model in the original conﬁguration space.The paper is organized as follows. Section 2 introduces basic concepts related to groups ofdiﬀeomorphisms and their Riemannian distances. Section 3 formulates the optimal control problemassociated to the estimation of the diﬀeomorphic predictor, introducing its reduction to a ﬁnite-dimensional parametrization using a kernel trick and describing its optimality conditions. Someadditional discussion on the reproducing kernel is provided in section 4. Section 5 introduces remarksand extensions that are speciﬁc to the prediction problems that we consider here, and sections 6 and7 provide experimental results, including details on how (hyper-)parameters used in our model canbe set. We conclude the paper in section 8. We now describe the basic framework leading to the deﬁnition of Riemannian metrics on groupsof diﬀeomorphisms. While many of the concepts described in these ﬁrst sections are adapted fromsimilar constructions in shape analysis (Younes, 2010) this short presentation helps making the paperself-contained and accessible to readers without prior knowledge of that domain.Let B p = C p ( R d , R d ) denote the space of p -times continuously diﬀerentiable functions that tendto 0 (together with their ﬁrst p derivatives) to inﬁnity. This space is a Banach space, for the norm (cid:107) v (cid:107) p, ∞ = max ≤ k ≤ p (cid:107) d k v (cid:107) ∞ where (cid:107) · (cid:107) ∞ is the usual supremum norm.Introduce a Hilbert space V of vector ﬁelds on R d which is continuously embedded in B p forsome p ≥

1, which means that there exists a constant C such that (cid:107) v (cid:107) p, ∞ ≤ C (cid:107) v (cid:107) V for all v in V , where (cid:107) · (cid:107) V denotes the Hilbert norm on V (and we will denote the associated innerproduct by (cid:10) · , · (cid:11) V ). This assumption implies that V is a reproducing kernel Hilbert space (RKHS).Because V is a space of vector ﬁelds, the deﬁnition of the associated kernel slightly diﬀers from theusual case of scalar valued functions in that the kernel is matrix valued. More precisely, a directapplication of Riesz’s Hilbert space representation theorem implies that there exists a function K : R d × R d → M d ( R )where M d ( R ) is the space of d by d real matrices, such that1. The vector ﬁeld K ( · , y ) a : x (cid:55)→ K ( x, y ) a belongs to V for all y, a ∈ R d .2. For v ∈ V , for all y, a ∈ R d , (cid:10) K ( · , y ) a , v (cid:11) V = a T v ( y ).These properties imply that (cid:10) K ( · , x ) a , K ( · , y ) b (cid:11) V = a T K ( x, y ) b for all x, y, a, b ∈ R d , which in turnimplies that K ( y, x ) = K ( x, y ) T for all x, y ∈ R d .3iﬀeomorphisms can be generated as ﬂows of ordinary diﬀerential equations (ODEs) associatedwith time-dependent elements of V . More precisely, let v ∈ L ([0 , , V ), i.e., v ( t ) ∈ V for t ∈ [0 , (cid:90) (cid:107) v ( t ) (cid:107) V dt < ∞ . Then, the ODE ∂ t y = v ( t, y ) has a unique solution over [0 , ϕ v : ( t, x ) (cid:55)→ y ( t ) where y is the solution starting at x . This ﬂow is, at all times,a diﬀeomorphism of R d , and satisﬁes the equation ∂ t ϕ = v ( t ) ◦ ϕ , ϕ (0) = id (the identity map).Here, and in the rest of this paper, we make the small abuse of notation of writing v ( t )( y ) = v ( t, y ),where, in this case, v ( t ) ∈ V for all t ∈ [0 , ϕ v ( t ) for the function x (cid:55)→ ϕ v ( t, x ), so that ϕ v may be considered either as a time-dependent diﬀeomorphism, or as afunction of both time and space variables.The set of diﬀeomorphisms that can be generated in such a way forms a group, denoted Diﬀ V since it depends on V . Given ψ ∈ Diﬀ V , one deﬁnes the optimal deformation cost Λ( ψ ) fromid to ψ as the minimum of (cid:82) (cid:107) v ( t ) (cid:107) V dt over all v ∈ L ([0 , , V ) such that ϕ v (1) = ψ . If welet D ( ψ , ψ ) = Λ( ψ ◦ ψ − ) / then D is a geodesic distance on Diﬀ V associated with the right-invariant Riemannian metric generated by v (cid:55)→ (cid:107) v (cid:107) V on V . We refer to Younes (2010) for moredetails and additional properties on this construction. For our purposes here, we only need to noticethat the minimization of the objective function in (1) can be rewritten as an optimal control problemminimizing E ( v, θ ) = (cid:90) (cid:107) v ( t ) (cid:107) V dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , ϕ (1) · T ) (3)over v ∈ L ([0 , , V ), θ ∈ R q and subject to the constraint that ϕ ( t ) satisﬁes the equation ∂ t ϕ = v ◦ ϕ with ϕ (0) = id. We point out that, under mild regularity conditions on the dependency of Γ withrespect to T (continuity in x , . . . , x N suﬃces), a minimizer of this function in v for ﬁxed θ alwaysexists, with v ∈ L ([0 , , V ). The minimization in (3) can be reduced using an RKHS argument, similar to the kernel trickinvoked in standard kernel methods (Aronszajn, 1950; Duchon, 1977; Meinguet, 1979; Wahba, 1990;Sch¨olkopf and Smola, 2002). Let z k ( t ) = ϕ ( t, x k ). Because the endpoint cost Γ only depends on( z (1) , . . . , z N (1)), it only suﬃces to compute these trajectories, which satisfy ∂ t z k = v ( t, z k ). Thisimplies that an optimal v must be such that, at every time t , (cid:107) v ( t ) (cid:107) V is minimal over all (cid:107) w (cid:107) V with w satisfying w ( z k ) = v ( t, z k ), which requires v ( t ) to take the form v ( t, · ) = N (cid:88) k =1 K ( · , z k ( t )) a k ( t ) (4)where K is the kernel of the RKHS V and a , . . . , a N are unknown time-dependent vectors in R d ,which provide our reduced variables. Letting a = ( a , . . . , a N ), the reduced problem requires tominimize E ( a ( · ) , θ ) = (cid:90) N (cid:88) k,l =1 a k ( t ) T K ( z k ( t ) , z l ( t )) a l ( t ) dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , T ( z (1))) (5)subject to ∂ t z k = (cid:80) Nl =1 K ( z k , z l ) a l , z k (0) = x k , with the notation z = ( z , . . . , z N ) and T ( z ) =( z , y , . . . , z N , y N ). 4 .2 Optimality Conditions and Gradient We now consider the minimization problem with ﬁxed θ (optimization in θ will depend on the speciﬁcchoice of classiﬁer F and risk function Γ). For the optimal control problem (5), the “state space” isthe space in which the “state variable” z = ( z , . . . , z N ) belongs, and is therefore Q = ( R d ) N . Thecontrol space contains the control variable a , and is U = ( R d ) N .Optimality conditions for the variable a are provided by Pontryagin’s maximum principle (PMP).They require the introduction of a third variable (co-state), denoted p ∈ Q , and of a control-dependent Hamiltonian H a deﬁned on Q × Q given, in our case, by H a ( p , z ) = N (cid:88) k,l =1 ( p k − a k ) T K ( z k , z l ) a l . (6)(In this expression, a , p and z do not depend on time.) The PMP (Hocking, 1991; Macki and Strauss,2012; Miller et al., 2015; Vincent and Grantham, 1997) then states that any optimal solution a mustbe such that there exists a time-dependent co-state satisfying  ∂ t z = ∂ p H a ( t ) ( p ( t ) , z ( t )) ∂ t p = − ∂ z H a ( t ) ( p ( t ) , z ( t )) a ( t ) = argmax a (cid:48) H a (cid:48) ( p ( t ) , z ( t )) (7)with boundary conditions z (0) = ( x , . . . , x N ) and p (1) = − σ ∂ z Γ( F ( · , θ ) , T ( z (1))) . (8)These conditions are closely related to those allowing for the computation of the diﬀerential of E with respect to a ( · ), which is given by ∂ a ( · ) E ( a ( · ) , θ ) = u ( · )with u k ( t ) = N (cid:88) l =1 K ( z k ( t ) , z l ( t ))( p l ( t ) − a l ( t )) (9)where p solves (cid:40) ∂ t z = ∂ p H a ( t )( p ( t ) , z ( t )) ∂ t p = − ∂ z H a ( t )( p ( t ) , z ( t )) (10)with boundary conditions z (0) = ( x , . . . , x N ) and p (1) = − σ ∂ z Γ( F ( · , θ ) , T ( z (1))) . Concretely, the diﬀerential is computed by (i) solving the ﬁrst equation of (10), which does notinvolve p , (ii) using the obtained value of z (1) to compute p (1) from the boundary condition, then(iii) solving the second equation of (10) backward in time to obtain p at all times, and (iv) plug itin the deﬁnition of u ( t ).For practical purposes, the discrete time version of the problem is obviously more useful, andits diﬀerential is obtained from a similar dynamic programming (or back-propagation) computation.Namely, discretize time over 0 , , . . . , T and consider the objective function E ( a ( · ) , θ ) = 1 T T − (cid:88) t =0 N (cid:88) k,l =1 a k ( t ) T K ( z k ( t ) , z l ( t )) a l ( t ) dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , T ( z ( T ))) (11)5ubject to z k ( t + 1) = z k ( t ) + 1 T N (cid:88) l =1 K ( z k ( t ) , z l ( t )) a l ( t ) , z k (0) = x k . We therefore use a simple Euler scheme to discretize the state ODE. Note that the state is discretizedover 0 , . . . , T and the control over 0 , . . . , T −

1. The diﬀerential of E is now given by the followingexpression, very similar to that obtained in continuous time. ∂ a ( · ) E ( a ( · ) , θ ) = u ( · )with u k ( t ) = N (cid:88) l =1 K ( z k ( t ) , z l ( t ))( p l ( t ) − a l ( t )) , t = 0 , . . . , T − p (discretized over 0 , . . . , T −  z ( t + 1) = z ( t ) + 1 T ∂ p H a ( t )( p ( t ) , z ( t )) p ( t −

1) = p ( t ) + 1 T ∂ z H a ( t )( p ( t ) , z ( t )) (12)with boundary conditions z (0) = ( x , . . . , x N ) and p ( T −

1) = − σ ∂ z Γ( F ( · , θ ) , T ( z ( T ))) . These computations allow us to compute the diﬀerential of the objective function with respect to a .The diﬀerential in θ depends on the selected terminal classiﬁer and its expression for the functionchosen in (2) is standard.We emphasize the fact that we are talking of diﬀerential of the objective function rather thanits gradient. Our implementation uses a Riemannian (sometimes called “natural”) gradient withrespect to the metric (cid:10) η ( · ) , η ( · ) (cid:11) a ( · ) = (cid:90) n (cid:88) k,l =1 η k ( t ) T K ( z k ( t ) , z l ( t )) η l ( t ) dt with ∂ t z k = (cid:80) nl =1 K ( z k ( t ) , z l ( t )) a l ( t ). With respect to this metric, one has ∇ a ( · ) E ( a ( · ) , θ ) = p − a , a very simple expression that can also be used in the discrete case. Using this Riemannian innerproduct as a conditioner for the deformation parameters (and a standard Euclidean inner producton the other parameters), experiments in sections 6 and 7 run Polak-Ribiere conjugate gradientiterations (Nocedal and Wright, 1999) to optimize the objective function. (We also experimentedwith limited-memory BFGS, which does not use natural gradients, and found that conditionedconjugate gradient performed better on our data.) To fully specify the algorithm, one needs to select the RKHS V , or, equivalently, its reproducingkernel, K . They constitute important components of the model because they determine the regular-ity of the estimated diﬀeomorphisms. We recall that K is a kernel over vector ﬁelds, and therefore6s matrix valued. One simple way to build such a kernel is to start with a scalar positive kernel κ : R d × R d → R and let K ( x, y ) = κ ( x, y )Id R d . (13)We will refer to such kernels as “scalar.”One can choose κ from the large collection of known positive kernels (and their inﬁnite numberof possible combinations; Aronszajn (1950); Dyn (1989); Sch¨olkopf and Smola (2002); Buhmann(2003)). Most common options are Gaussian kernels, κ ( x, y ) = exp( −| x − y | / ρ ) , (14)or Mat´ern kernels of class C k , κ ( x, y ) = P k ( | x − y | /ρ ) exp( −| x − y | /ρ ) , (15)where P k is a reversed Bessel polynomial of order k . In both cases, ρ is a positive scale parameter.The Mat´ern kernels have the nice property that their associated RKHS is equivalent to a Sobolevspace of order k + d/ v in the RKHS associated with a matrix kernel such as (13), where κ is a radial basisfunction (RBF), are such that each coordinate function of v belongs to the scalar RKHS associatedwith κ , which is translation and rotation invariant (i.e., the transformations that associate to ascalar function h the functions x (cid:55)→ h ( R T ( x − b )) are isometries, for all rotation matrices R and allvectors b ∈ R d ). While (13) provide a simple recipe for the deﬁnition of matrix-valued kernels, other interestingchoices can be made within this class. Rotation and translation invariance more adapted to spacesof vector ﬁelds, in which one requires that replacing v : R d → R d by x (cid:55)→ Rv ( R T ( x − b )) is anisometry of V for all R, b , leads to a more general class of matrix kernels extensively discussed in(Micheli and Glaun`es, 2014). When the data is structured, however (e.g., when it is deﬁned over agrid), rotation invariance may not be a good requirement since it allows, for example, for permutingcoordinates, which would break the data structure. In this context, other choices may be preferable,as illustrated by the following example. Assume that the data is deﬁned over a graph, say G with d vertices. Then, one may consider matrix kernels relying on this structure. For example, letting N i denote the set of nearest neighbors of i in G , one can simply take K ( x, y ) = diag(Φ( | P i x − P i y | ) , i = 1 , . . . , d ) (16)where P i x is the vector ( x j , j ∈ N i ) and Φ is an RBF associated to a positive radial scalar kernel. RKHS’s of vector ﬁelds built from RBF’s have the property that all their elements (and several oftheir derivatives) vanish at inﬁnity. As a consequence, these spaces do not contain simple trans-formations, such as translations or more general aﬃne transformations. It is however possible tocomplement them with such mappings, deﬁningˆ V a = { g + v : g ∈ a , v ∈ V } where a is any Lie sub-algebra of the group of aﬃne transformations (so that any element g ∈ a takes the form g ( x ) = Ax + b , where A is a matrix and b is a vector). In particular, a can be the7hole space of aﬃne transformations. Since a and v intersect only at { } , one can deﬁne withoutambiguity a Hilbert norm on ˆ V a by letting (cid:107) g + v (cid:107) V a = (cid:107) g (cid:107) a + (cid:107) v (cid:107) V where (cid:107) g (cid:107) a is any inner-product norm on a . A simple choice, for g ( x ) = Ax + b , can be to take (cid:107) g (cid:107) a = κ trace( AA T ) + κ | b | . (17)Both a and ˆ V a are RKHS’s in this case, respectively with kernels K a and K a + K , where K is thekernel of V . If the norm on a is given by (17), then K a ( x, y ) = (cid:18) x T yκ + 1 κ (cid:19) Id R d , as can be deduced from the deﬁnition of a reproducing kernel.Instead of using this extended RKHS, one may prefer to model aﬃne transformations separatelyfrom the vector ﬁeld. This leads to replacing (4) byˆ v ( t, · ) = g ( t, · ) + N (cid:88) k =1 K ( · , z k ( t )) a k ( t )and the cost (5) by E ( a ( · ) , θ ) = (cid:90) (cid:107) g ( t ) (cid:107) a dt + (cid:90) N (cid:88) k,l =1 a k ( t ) T K ( z k ( t ) , z l ( t )) a l ( t ) dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , T (1))with ∂ t z k = ˆ v ( t, z k ( t )), z k (0) = x k . The two approaches are, in theory, equivalent, in that the secondone simply ignore the reduction on the aﬃne part of the vector ﬁeld, but it may be helpful, numer-ically, to use separate variables in the optimization process for the aﬃne transform and the vectorﬁeld reduced coeﬃcients, because this gives more ﬂexibility to the optimization. The derivation ofthe associated optimality conditions and gradient are similar to those made in section 3.2 and leftto the reader.Another point worth mentioning is that, if ϕ satisﬁes ∂ t ϕ = g ◦ ϕ + v ◦ ϕ for (time-dependent) g ∈ a and v ∈ V , and if one deﬁnes the time-dependent aﬃne transformation ρ by ∂ t ρ = g ◦ ρ, then ϕ = ρ ◦ ψ , where ψ satisﬁes ∂ t ψ = w ◦ ψ and w = ρ − L v ◦ ρ , ρ L being the linear part of ρ . In thespecial case when a is the Lie algebra of the Euclidean group, so that ρ is the composition of a rotationand a translation, and when the norm of V is invariant by such transformations (e.g., when using ascalar kernel associated to an RBF), then ϕ and ψ are equidistant to the identity. As a consequence,when the ﬁnal function F implements a linear model, there is, in theory (numerics may be diﬀerent)no gain in introducing an aﬃne component restricted to rotations and translations. The equidistanceproperty does not hold, however, if one uses a larger group of aﬃne transformations, or a norm on V that is not Euclidean invariant, and the introduction of such transformations actually extends themodel in a way that may signiﬁcantly modify its performance, generally, in our experiments, for thebetter. ‘ 8 Enhancements and Remarks

We use, in this paper, logistic regression as ﬁnal classiﬁer applied to transformed data. This classiﬁerestimates a linear separation rule between the classes, but it should be clear that not every trainingset can be transformed into a linearly separable one using a diﬀeomorphism of the ambient space.A very simple one-dimensional example is when the true class associated to an input x ∈ R is 0 if | x | < d + 1)-dimensional dataset in which X is replacedby ( X, x, µ ) (cid:55)→ ( x, µ + x − R that separates the two classes (along the y axis).Notice that any binary classiﬁer that can be expressed as x (cid:55)→ sign( f ( x ) − a ) for some smoothfunction f can be included in the model class we are considering after adding a dimension, simplytaking (letting again µ denote the additional scalar variable) ψ ( x, µ ) = ( x, µ + f ( x )), which is adiﬀeomorphism of R d +1 , and u = (0 R d , ϕ that will minimize the overall distortion, essentially trading oﬀ some non-linear transformation of the data, x , to induce a “simpler” classiﬁcation rule. Figure 1 provides anexample of the eﬀect of adding a dimension when the original data is two-dimensional with two classesforming a target pattern. While no 2D transformation will break the topology and separate thecentral disk from the exterior ring, using an additional dimension oﬀers a straightforward solution.Another simple example is provided in Figure 2, where a circle (class 1) is inscribed in a half ellipse(class 2). In this case, adding a dimension is not required, but the transformation estimated whenthis is done is closer to the identity (in the sense of our metric on diﬀeomorphisms) than the onecomputed in two dimensions.Figure 1: Additional dimension separating the center of a target from its external ring. Left: initialconﬁguration; Right: transformed conﬁguration.When adding one or more dimensions (in a c -class problem, it makes sense to add c − x k → ( x k , δu k ) for small δ , with u k arealization of a standard Gaussian variable to help breaking symmetries in the training phase (testdata being still expanded as x k → ( x k , The dimensional reduction method described in section 3.1 is the exact ﬁnite-dimensional parametriza-tion of the spatial component of the original inﬁnite-dimensional problem. Assuming T steps forthe time discretization, this results in T dN parameters in the transformation part of the model,9igure 2: Comparison of optimal transformations without (top) and with (bottom) adding adimension. Left: initial conﬁguration; Right: transformed conﬁguration.while the computation of the gradient has a

T dN order cost. With our implementation (on afour-core Intel i7 laptop), we were able to handle up to 2,000 training samples in 100 dimensionswith 10 time steps, which required about one day for 2,000 gradient iterations. Even if the eﬃciencyof our implementation may be improved, for example through the use of GPU arrays, it is clearthat the algorithm we just described will not scale to large datasets unless some approximationsare made. Importantly, no signiﬁcant reduction in computation cost can be obtained by randomlysampling from the training set at each iteration, since the representation (4) requires computing thetrajectories of all sample points.This suggests replacing the optimal expression in (4) by a form that does not requires trackingthe trajectories of N points. One possibility is to require that v is decomposed as a sum similar to(4), but over a smaller number of “control points”, i.e., to let v ( t, · ) = n (cid:88) j =1 K ( · , ζ j ( t )) a j ( t ) (18)with ∂ t ζ j = v ( t, ζ j ) and ζ j (0) = ζ j where { ζ , . . . , ζ n } is, for example, a subset of the trainingdata set. As a result, the computational cost per iteration in the data size would now be of order T dN n (because the evolution of all points is still needed to evaluate the objective function and itsgradient), but this change would make possible randomized evaluations of the data cost, Γ, (leadingto stochastic gradient descent) that would replace

T dnN by T dnN (cid:48) where N (cid:48) is the size of the mini-batch used at each iteration. This general scheme will be explored in future work, where optimalstrategies in selecting an n -dimensional subset of the training data must be analyzed, including inparticular the trade-oﬀ they imply between computation time and sub-optimality of solutions. Eventhough they appeared in a very diﬀerent context, some inspiration may be obtained from similarapproaches that were explored in shape analysis (Younes, 2012; Durrleman et al., 2013; Younes,2014; Durrleman et al., 2014; Gris et al., 2018).Another plausible choice is to specify a parametric form for the vector ﬁeld at a given time. Itis quite appealing to use a neural-net-like expression, letting v ( t, x ) = Φ( A ( t ) x + b ( t ))where A is a time-dependent d by d matrix, b a time dependent vector and Φ a function applyinga ﬁxed nonlinear transformation (for example, t (cid:55)→ t exp( − t / (cid:107) v ( t ) (cid:107) V may be challenging to compute analytically when v is given in this form,10t is suﬃcient, in order to ensure the existence of solutions to the state equation, to control thesupremum norm of the ﬁrst derivative of v ( t ), which, since Φ is ﬁxed, only depends of A ( t ). Onecan therefore replace (cid:107) v ( t ) (cid:107) in the optimal control formulation (3) by any norm evaluated at A ( t )in the ﬁnite-dimensional space of d by d matrices. Preliminary experiments, run on some of theexamples discussed in section 6, show that such a representation can perform very well in some casesand completely fail in others, clearly requiring further study to be reported in future work. Thecomputation time is, in all cases, signiﬁcantly reduced compared to the original version. It is important to strengthen the fact that, even though our model involves diﬀeomorphic trans-formations, it is not a deformable template model. The latter type of model typically works withsmall-dimensional images (k=2 or 3), say I : R k → R , and tries to adjust a k -dimensional deforma-tion (using a diﬀeomorphism g : R k → R k ) such that the deformed image, given by I ◦ g − alignswith a ﬁxed template (and classiﬁcation based on a ﬁnite number of templates would pick the onefor which a combination of the deformation and the associated residuals is smallest).The transformation ψ g : I (cid:55)→ I ◦ g − is a homeomorphism of the space of, say, continuous im-ages. Once images are discretized over a grid with d points, ψ g becomes (assuming that the grid isﬁne enough) a one-to-one transformation of R d , but a very special one. In this context, the modeldescribed in this paper would be directly applied to discrete images, looking for a d -dimensional dif-feomorphism that would include deformations such as ψ g , but many others, involving also variationsin the image values or more complex transformations (including, for example, reshuﬄing all imagepixels in arbitrary orders!). We now provide a few experiments that illustrate some of the advantages of the proposed diﬀeo-morphic learning method, and some of its limitations as well. We will compare the performanceof this algorithm with a few oﬀ-the-shelve methods, namely k -nearest-neighbors ( k NN), linear andnon-linear SVM, random forests (RF), multi-layer perceptrons with 1, 2 and 5 hidden layers (abbre-viated below as MLP1, MLP2 and MLP5) and logistic regression (Hastie et al., 2003; Bishop, 2006).The classiﬁcation rates that were reported were evaluated on a test set containing 2,000 examplesper class (except for MNIST, for which we used the test set available with this data).We used the scikit-learn Python package (Pedregosa et al., 2011) with the following parameters(most being default in scikit-learn). • Linear SVM: (cid:96) penalty with default weight C = 1, with one-vs-all multi-class strategy whenrelevant. • Kernel SVM: (cid:96) penalty with weight C estimated as described in section 7. Gaussian (i.e.,RBF) kernel with coeﬃcient γ identical to that used for for the kernel in diﬀeomorphic learning.One-vs-all multi-class strategy when relevant. • Random forests: 100 trees, with Gini entropy splitting rule, with the default choice ( √ d ) ofnumber of features at each node. • k -nearest neighbors: with the standard Euclidean metric and the default (5) number of neigh-bors. • Multi-layer perceptrons: with ReLU activations, ADAM solver, constant learning rate and10,000 maximal iterations, using 1, 2 or 5 hidden layers each composed of 100 units.11

Logistic regression: (cid:96) penalty with weight C = 1. The same classiﬁer is used as the ﬁnal stepof the diﬀeomorphic learning algorithm, so that its performance on transformed data is alsothe performance of the algorithm that is proposed in this paper.In all cases, we added a dummy dimension to the data as described in section 5.1 when runningdiﬀeomorphic learning (classiﬁcation results on original data were obtained without the added di-mension). We also used an aﬃne transformation to complement the kernel, as described in section4.3. Note that adding a dimension was not always necessary for learning (especially with largedimensional problems), nor was the aﬃne transform always improving on results, but they neverharm the results in any signiﬁcant way, so that it was simpler to always turn on these options in ourexperiments. The optimization procedure was initialized with a vanishing vector ﬁeld (i.e., ψ = id)and run until numerical stabilization (with a limit of 2,000 iterations at most). Doing so was alwaysan overkill, in terms of classiﬁcation performance, because in almost all cases, the limit classiﬁcationerror on the test set stabilizes faster than the time taken by the algorithm to optimize the transfor-mation. It was however important to avoid stopping the minimization early in order to make surethat optimizing the diﬀeomorphism did not result in overﬁtting.We now describe the datasets that we used (all but the last one being synthetic) and comparethe performances of the classiﬁers above on the original data and on the transformed data afterlearning. In our ﬁrst set of experiments, we let D i = R ( T i × R d − ) where T , T are non-intersecting tori in R and R is a random d -dimensional rotation. The tori are positioned as illustrated in the ﬁrst panelof ﬁgure 3, so that, even though they have an empty intersection, they are not linearly separable.The distribution of training and test data is (before rotation) the product of a uniform distributionon the torus and of a standard Gaussian in the d − d = 3). Center: Spherical layers ( d = 2). Right: “RBF” data ( d = 2).Classiﬁcation results for this dataset are summarized in Table 1. Here, we let the number of noisedimensions vary from 0 to 7 and 17 (so that the total number of dimensions is 3, 10 and 20) andthe number of training samples are 200, 500 and 1,000. The problem becomes very challenging formost classiﬁers when the number of noisy dimensions is large, in which case a multi-layer perceptronwith one hidden layer seems to perform best. All other classiﬁers see their performance improvedafter diﬀeomorphic transformation of the data. One can also notice that, after transformation, allclassiﬁers perform approximately at the same level. Figure 4 illustrates how the data is transformedby the diﬀeomorphism and is typical of the other results.We also point out that all classiﬁers except RF are invariant by rotation of the data, so thatmaking a random rotation when generating it does not change their performance. The estimation ofthe diﬀeomorphism using a radial kernel is also rotation invariant. The RF classiﬁer, which is basedon comparison along coordinate axes, is highly aﬀected, however. Without a random rotation, it12og. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Tori, d = 3, 100 samples per classOriginal Data 0.341 0.341 0.000 0.007 0.000 0.004 0.000 0.000Transformed Data 0.000 0.000 0.000 0.007 0.000 0.000 0.000 0.000Tori, d = 10, 100 samples per classOriginal Data 0.312 0.317 0.280 0.311 0.309 0.159 0.170 0.233Transformed Data 0.159 0.163 0.162 0.176 0.153 0.155 0.153 0.163Tori, d = 10, 250 samples per classOriginal Data 0.325 0.326 0.207 0.285 0.257 0.073 0.093 0.109Transformed Data 0.023 0.024 0.017 0.021 0.012 0.030 0.026 0.025Tori, d = 20, 100 samples per classOriginal Data 0.320 0.317 0.324 0.376 0.369 0.325 0.325 0.353Transformed Data 0.321 0.329 0.322 0.347 0.330 0.320 0.319 0.323Tori, d = 20, 250 samples per classOriginal Data 0.320 0.323 0.312 0.339 0.367 0.204 0.284 0.280Transformed Data 0.249 0.255 0.249 0.277 0.251 0.247 0.247 0.250Tori, d = 20, 500 samples per classOriginal Data 0.316 0.315 0.267 0.305 0.355 0.117 0.130 0.191Transformed Data 0.163 0.166 0.158 0.173 0.167 0.160 0.163 0.164Table 1: Comparative performance of classiﬁers on “Tori” dataFigure 4: Visualization of the diﬀeomorphic ﬂow applied to the 3D tori dataset. Left: trainingdata; Right: test data. Top to bottom: t = 0, t = 0 . t = 0 . t = 1. The data is visualizedin a frame formed by the discriminant direction followed by the two principal components in theperpendicular space.performs extremely well, with an error rate of only 0.05 with 17 noisy dimensions and 250 examplesper class. 13 .3 Spherical Layers We here deduce the class from the sign of cos(9 | X | ) where X follows a uniform distribution on the d -dimensional unit sphere, and we provide results with d = 3 (a representation of the data set in2D is provided in the second panel of Figure 3). We see that linear classiﬁers cannot do better thanchance (this is by design), but that after the estimated diﬀeomorphic transformation is applied, allclassiﬁers outperform, with a large margin for small datasets, the best non-linear classiﬁers trainedon the original data. Log. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Spherical layers, d = 3, 100 samples per classOriginal Data 0.507 0.507 0.357 0.353 0.408 0.478 0.488 0.481Transformed Data 0.246 0.249 0.329 0.279 0.355 0.246 0.245 0.253Spherical layers, d = 3, 250 samples per classOriginal Data 0.487 0.487 0.233 0.231 0.279 0.394 0.264 0.113Transformed Data 0.119 0.118 0.185 0.129 0.199 0.149 0.143 0.138Spherical layers, d = 3, 500 samples per classOriginal Data 0.506 0.506 0.185 0.194 0.229 0.299 0.113 0.091Transformed Data 0.089 0.092 0.165 0.087 0.197 0.092 0.103 0.096Table 2: Comparative performance of classiﬁers on 3D spherical layers. The next dataset generates classes using sums of radial basis functions. More precisely, we let ρ ( z ) = exp( − ( z/α ) ) with α = 0 .  L (cid:88) j =1 ρ ( | X − c j | ) a j  − µ . In this expression, X follows a uniform distribution over the d -dimensional unit sphere and µ isestimated so that both positive and negative classes are balanced. The centers, c , . . . , c L are chosenas c j = (3 j/L ) e ( j mod d )+1 where j mod d is the remainder of the division of j by d and e , . . . , e d isthe canonical basis of R d . The coeﬃcients are a j = cos(6 πj/L ), and we took L = 100. The thirdpanel of Figure 3 depicts the resulting regions of the unit disc in the case d = 2.Table 3 provides classiﬁcation performances for various combinations of dimension and samplesize. Excepted when d = 5 and only 100 samples per class are observed, in which case linear classiﬁersprovide the best rates, the best classiﬁcation is obtained after diﬀeomorphic transformation. In the next example, we assume that the conditional distribution of X given Y = y is normal withmean m y and covariance matrix Σ y . We used three classes ( y ∈ { , , } ) in dimension 20, with • m = (0 , . . . , T , m = ( − , − , − , , . . . , T and m = ( − , , − , . . . . , T , • Σ = 10 Id R d , Σ ( i, j ) = 20 exp( −| i − j | /

20) and Σ ( i, j ) = 20 exp( −| i − j | / d = 2, 100 samples per classOriginal Data 0.381 0.381 0.158 0.126 0.208 0.183 0.193 0.163Transformed Data 0.142 0.139 0.149 0.153 0.186 0.140 0.136 0.141RBF data, d = 2, 250 samples per classOriginal Data 0.358 0.359 0.109 0.150 0.136 0.176 0.158 0.127Transformed Data 0.101 0.102 0.106 0.102 0.130 0.126 0.122 0.119RBF data, d = 2, 500 samples per classOriginal Data 0.377 0.378 0.089 0.103 0.097 0.119 0.101 0.081Transformed Data 0.058 0.055 0.085 0.055 0.090 0.070 0.054 0.142RBF data, d = 3, 100 samples per classOriginal Data 0.307 0.307 0.209 0.207 0.221 0.218 0.233 0.250Transformed Data 0.171 0.175 0.202 0.168 0.195 0.162 0.162 0.154RBF data, d = 3, 250 samples per classOriginal Data 0.310 0.308 0.139 0.182 0.148 0.151 0.151 0.103Transformed Data 0.083 0.077 0.133 0.082 0.122 0.086 0.083 0.089RBF data, d = 3, 500 samples per classOriginal Data 0.337 0.335 0.129 0.112 0.114 0.144 0.093 0.094Transformed Data 0.071 0.070 0.118 0.063 0.102 0.080 0.068 0.078RBF data, d = 5, 100 samples per classOriginal Data 0.216 0.218 0.221 0.256 0.254 0.280 0.259 0.233Transformed Data 0.221 0.223 0.221 0.222 0.219 0.221 0.223 0.214RBF data, d = 5, 250 samples per classOriginal Data 0.233 0.233 0.197 0.226 0.213 0.250 0.273 0.240Transformed Data 0.199 0.204 0.185 0.215 0.196 0.207 0.209 0.205RBF data, d = 5, 500 samples per classOriginal Data 0.231 0.230 0.167 0.219 0.184 0.206 0.189 0.158Transformed Data 0.128 0.129 0.133 0.127 0.142 0.141 0.140 0.143Table 3: Comparative performance of classiﬁers on “RBF” datawhere Id R d is the d -dimensional identity matrix. Classiﬁcation results for training sets with 100and 250 examples per class are described in Table 4. While multilayer perceptrons perform best onthis data, the performance of diﬀeomorphic classiﬁcation is comparable, and improves on all otherclassiﬁers. A visualization of the transformation applied to this dataset is provided in Figure 5. To build this dataset, we start with two scalar functions and assume that the observed data isthe restriction of one of them (where this choice determines the class) to a small discrete interval.More precisely, we let Y ∈ { , } and given Y = y , we let X = ( ψ y ( t − A ) + ε t , t ∈ I ) where I = { , /d, . . . , ( d − /d ) } is a discretization of the unit interval, ( ε t , t ∈ I ) are independent standardGaussian variables and A follows a uniform distribution on [ − , d = 20, 100 samples per classOriginal Data 0.398 0.407 0.225 0.191 0.495 0.143 0.101 0.099Transformed Data 0.135 0.153 0.175 0.110 0.236 0.134 0.109 0.108Mixture of Gaussians, d = 20, 250 samples per classOriginal Data 0.354 0.359 0.172 0.163 0.427 0.073 0.061 0.063Transformed Data 0.079 0.091 0.145 0.089 0.187 0.077 0.074 0.074Table 4: Comparative performance of classiﬁers on Gaussian mixtures.Figure 5: Diﬀeomorphic transformation applied to the Gaussian mixture dataset (evolution visual-ized at times t = 0 . , . , . , . t = 1 .

5, we took ψ y ( t ) = log cosh( t/β y ) with β = 0 .

02 and β = 0 . d = 50, 100 samples per classOriginal Data 0.484 0.495 0.400 0.448 0.466 0.449 0.444 0.465Transformed Data 0.324 0.324 0.350 0.378 0.438 0.329 0.335 0.332Curve segments, d = 50, 250 samples per classOriginal Data 0.487 0.486 0.266 0.345 0.353 0.493 0.486 0.467Transformed Data 0.152 0.157 0.147 0.195 0.248 0.155 0.258 0.216Curve segments, d = 50, 500 samples per classOriginal Data 0.495 0.496 0.155 0.294 0.230 0.505 0.484 0.426Transformed Data 0.093 0.094 0.062 0.070 0.093 0.093 0.104 0.113Table 5: Comparative performance of classiﬁers on ”curve segment” data. Here, we consider a 50-dimensional dataset with two classes. Each sample has all coordinates equalto zero except two of them that are equal to ±

1, and the class is 0 if the two nonzero coordinates16re equal and 1 otherwise. The position of the two nonzero values is random and each takes thevalues ± √ d = 50, 100 samples per classOriginal Data 0.505 0.511 0.484 0.500 0.499 0.345 0.428 0.493Transformed Data 0.066 0.068 0.010 0.024 0.017 0.019 0.016 0.014xor data, d = 50, 250 samples per classOriginal Data 0.498 0.497 0.404 0.461 0.494 0.005 0.047 0.248Transformed Data 0.008 0.006 0.000 0.000 0.000 0.000 0.000 0.001Table 6: Comparative performance of classiﬁers on xor data. Our next synthetic example is a simple pattern recognition problem in d = 100 dimensions. Samplesfrom class 1 take the form ( ρ U , . . . , ρ d U d ) where ρ is uniformly distributed between 0 .

75 and 1 . U = ( U , . . . U d ) is a binary vector with exactly l y consecutive ones (possibly wrapping around { , . . . , d } )and d − l y zeros, with l = 10 and l = 11.An optimal linear separation between classes is achieved (because of shift invariance) by thresh-olding the sum of the d variables. However, because of the multiplication by ρ , . . . , ρ d , this simpleclassiﬁcation rule performs very poorly, while the same rule applied to the binary variables ob-tained by thresholding individual entries at, say, 0.5 perfectly separates the classes. It is thereforenot surprising that multi-layer perceptrons perform very well on this problem and achieve the bestclassiﬁcation rates. All other classiﬁers perform signiﬁcantly better when run on the transformeddata. Log. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Segment length data, d = 100, 100 samples per classOriginal Data 0.462 0.412 0.387 0.491 0.505 0.337 0.343 0.328Transformed Data 0.341 0.349 0.354 0.347 0.450 0.390 0.385 0.362Segment length data, d = 100, 250 samples per classOriginal Data 0.432 0.412 0.135 0.483 0.503 0.082 0.079 0.084Transformed Data 0.100 0.098 0.107 0.098 0.443 0.096 0.113 0.114Segment length data, d = 100, 500 samples per classOriginal Data 0.391 0.391 0.045 0.447 0.194 0.025 0.025 0.025Transformed Data 0.032 0.030 0.032 0.054 0.165 0.029 0.031 0.036Table 7: Comparative performance of classiﬁers on segment lengths17 .9 Segment Pairs This section describes a more challenging version of the former in which each data point consists intwo sequences of ones in the unit circle (discretized over 50 points), these two sequences being bothof length ﬁve in class 1, and of lengths 4 and 6 (in any order) in class 2. No linear rule can separatethe classes and do better than chance, since such a rule would need to be based on summing thevariables (by shift invariance), which returns 10 in both classes. The problem is also challenging formetric-based methods because each data point has close neighbors in the other class: the nearestnon-identical neighbor in the same class is obtained by shifting one of the two segments, and wouldbe at distance √

2, which is identical to the distance of the closest element from the other class whichis obtained by replacing a point in one segment by 0 and adding a 1 next to the other segment. Oneway to separate the classes would be to compute moving averages (convolutions) over windows oflengths 6 along the circle, and to threshold the result at 5.5, which can be represented as a one-hidden layer perceptron. As shown in Table 8, one-hidden-layer perceptrons do perform best on thisdata, with some margin compared to all others. Diﬀeomorphic classiﬁcation does however improvesigniﬁcantly on the classiﬁcation rates of all other classiﬁers.Log. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Segment pair data, d = 50, 100 samples per classOriginal Data 0.477 0.466 0.476 0.475 0.458 0.470 0.470 0.488Transformed Data 0.461 0.457 0.465 0.461 0.469 0.461 0.462 0.461Segment pair data, d = 50, 250 samples per classOriginal Data 0.490 0.485 0.447 0.412 0.492 0.343 0.392 0.405Transformed Data 0.394 0.392 0.412 0.390 0.444 0.398 0.391 0.386Segment pair data, d = 50, 500 samples per classOriginal Data 0.501 0.502 0.369 0.320 0.504 0.054 0.164 0.284Transformed Data 0.240 0.238 0.261 0.233 0.378 0.250 0.244 0.253Segment pair data, d = 50, 1000 samples per classOriginal Data 0.465 0.463 0.233 0.191 0.547 0.008 0.038 0.137Transformed Data 0.071 0.068 0.099 0.073 0.281 0.085 0.082 0.085Table 8: Comparative performance of classiﬁers on segment pair data To conclude this section we provide (in Table 9) classiﬁcation results on a subset of the MNISTdigit recognition dataset (LeCun, 1998), with 10 classes and 100 examples per class for training. Toreduce the computation time, we reduced the dimension of the data by transforming the original28 ×

28 images to 14 ×

14, resulting in 196-dimensional dataset. With this sample size, this reductionhad little inﬂuence on the performance of classiﬁers run before diﬀeomorphic transformation. Allclassiﬁers have similar error rates, with a small improvement obtained after diﬀeomorphic trans-formation compared to linear classiﬁers, reaching a ﬁnal performance comparable to that obtainedby random forests and multi-layer perceptrons with two hidden layers. We did observe a glitchin the performance of nonlinear support vector machines, which may result from a poor choice ofhyper-parameters (section 6.1) for this dataset. 18og. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5MNIST data, d = 196, 100 samples per classOriginal Data 0.128 0.138 0.228 0.111 0.120 0.122 0.111 0.127Transformed Data 0.113 0.120 0.325 0.129 0.094 0.110 0.110 0.124Table 9: Comparative performance of classiﬁers on a subset of the MNIST dataset. The results presented in this section are certainly encouraging, and demonstrate that the proposedalgorithm has, at least, some potential. We however point out that the classiﬁers that were usedin our experiments were run “oﬀ-the-shelves,” using the setup described in section 6.1. There isno doubt that, after some tuning up speciﬁc to each dataset, each of these classiﬁers could performbetter. Such an analysis was not our goal here, however, and the classiﬁcation rates that we obtainedmust be analyzed with this in mind. We just note that we also used the exact same version ofdiﬀeomorphic learning for all datasets, with hyper-parameters values described in section 7.We have included kernel-based support vector machines in our comparisons. These classiﬁersindeed share with the one proposed here the use of a kernel trick to reduce an inﬁnite dimensionalproblem to one with a number of parameters comparable to the size of the training set. Some notablediﬀerences exist however. The most fundamental one lies in the role that is taken by the kernel inthe modeling and computation. For SVMs and other kernel methods, the kernel provides, or isclosely related to, the transformation of the data into a feature space. It indeed deﬁnes the innerproduct in that space, and can itself be considered as an inﬁnite dimensional representation of thedata, through the standard mapping x (cid:55)→ κ ( x, · ). Kernel methods then implement linear methodsin this feature space, such as computing a separating hyperplane for SVMs. For the diﬀeomorphicclassiﬁcation method that is described here, the kernel is used as a computational tool, similar tothe way it appears in spline interpolation. The fundamental concept here is the Hilbert space V and its norm, which is, in the case of a C k Mat´ern kernel, a Hilbert Sobolev space of order k + d/ V is therefore a mathematical tool that makesexplicit the derivation of the discrete representation described in section 3.1 rather than an essentialpart of the model, as it is for kernel methods such as SVMs.Another important distinction, which makes the diﬀeomorphic method deviate from both kernelmethods in machine learning and from spline interpolation is that the kernel (or its associatedspace V ) is itself part of a nonlinear dynamical evolution where it intervenes at every time. Thisdynamical aspect of the approach constitutes a strong deviation from SVMs which applied thekernel transformation once before implementing a linear method. It is, in this regard, closer tofeed-forward neural network models in that it allows for very large non-linear transformations ofthe data to be applied prior to linear classiﬁcation. Unlike neural networks, however, diﬀeomorphiclearning operates the transformations within the original space containing the data.As mentioned in section 6.1, we ran the optimization algorithm until numerical convergence,although it is clear from monitoring classiﬁcation over time that classes become well separated earlyon while the estimation of the optimal transformation typically takes more time. If one is notinterested in the limit diﬀeomorphism (which can be of interest for modeling purposes), signiﬁcantcomputation time may be saved by stopping the procedure earlier, using, for example a validationset. 19 Setting Hyper-Parameters

The diﬀeomorphic classiﬁcation method described in this paper requires the determination of severalhyper-parameters that we now describe with a discussion of how they were set in our experiments.(a) Kernel scale parameter. While the kernel can be any function of positive type and thereforeconsidered as an inﬁnite dimensional hyper-parameter, we assume that one uses a kernel such asthose described in equations (14) and (15) (our experiments use the latter with k = 3) and focuson the choice of the scale parameter ρ . In our implementation, we have based this selectionon two constraints, namely that training data should tend to “collaborate” with some of theirneighbors from the same class while data from diﬀerent classes should “ignore” each other. Thisresulted in the following computation.(i) To address the ﬁrst constraint, we evaluate, for each data point, the ﬁfth percentile of itsdistance to other points from the same class. We then deﬁne ρ to be the 75th percentileof these values.(ii) For the second constraint, we deﬁne ρ as the 10th percentile of the minimum distancebetween each data point and its closest neighbor in an other class.We ﬁnally set ρ = ρ = min( ρ , ρ ). Table 10 provides a comparison of the performances of thealgorithm for values of ρ that deviate from this default value, showing that, in most cases, thisperformance obtained with ρ is not too far from the best one. The only exception was foundwith the Xor dataset, for which a signiﬁcant improvement was obtained using a large value of ρ . Note that in Tables 10 and 11, all experiments in the same row were run with the sametraining and test sets.Dataset Samples/class 0 . ρ . ρ . ρ ρ . ρ ρ ρ Sph. Layers 250 0.238 0.199 0.188 0.175 0.165 0.183 0.212RBF 100 0.170 0.107 0.085 0.083 0.070 0.063 0.063Tori ( d = 10) 100 0.285 0.233 0.207 0.165 0.163 0.175 0.194M. of G. 100 0.322 0.221 0.156 0.135 0.109 0.103 0.124Xor 100 0.473 0.108 0.090 0.092 0.093 0.062 0.01Seg. Length 250 0.164 0.143 0.114 0.097 0.093 0.093 0.86Table 10: Performance comparison when the kernel scale deviates from the default value ρ , assessedon the spherical layers, RBF, Tori, mixture of Gaussians, Xor and segment length datasets.(b) Regularization weight. A second important parameter is the regularization weight, σ in Equa-tion 1. In our experiments, this parameter is adjusted on line during the minimization algorithm,starting with a large value, and slowly decreasing it until the training error reaches a target, δ . While this just replaces a parameter by another, this target value is easier to interpret, andwe used δ = 0 .

005 in all our experiments (resulting de facto in a vanishing training error at theend of the procedure in all cases).(c) (cid:96) penalty on logistic regression. This parameter, denoted λ in Equation 1, was taken equal to1 in all our experiments.(d) Time discretization. The value of T in equation (11) may also impact the performance of thealgorithm, at least for small values, since one expects the model to converge to its continuouslimit for large T . This expectation is conﬁrmed in Table 11, which shows that the classiﬁcation20rror rates obtained with the value chosen in our experiments, T = 10, was not far from theasymptotic one. In some cases, the error rates obtained for smaller value of T may be slightlybetter, but the diﬀerence is marginal.Dataset Samples/class T = 1 T = 2 T = 4 T = 10 T = 20 T = 40 T = 100Sph. Layers 250 0.199 0.199 0.190 0.197 0.197 0.197 0.197RBF 250 0.089 0.089 0.085 0.083 0.083 0.083 0.083Tori ( d = 10) 100 0.243 0.160 0.170 0.177 0.186 0.195 0.197M. of G. 100 0.178 0.169 0.141 0.119 0.114 0.111 0.109Xor 100 0.478 0.179 0.079 0.043 0.042 0.040 0.039Seg. Length 250 0.140 0.117 0.100 0.094 0.092 0.091 0.091Table 11: Performance comparison for various values of the number of discretization steps, T ,assessed on the spherical layers, RBF, Tori, mixture of Gaussians, Xor and segment length datasets. In this paper, we have introduced the concept of diﬀeomorphic learning and provided a few illustra-tions of its performance on simple, but often challenging, classiﬁcation problems. On this class ofproblems, the proposed approach appeared quite competitive among other classiﬁers used as com-parison. Some limitations also appeared, regarding, in particular, the scalability of the method, forwhich we have provided some options that will be explored in the future.We have only considered, in this paper, applications of diﬀeomorphic learning to classiﬁcationproblems. Extensions to other contexts will be considered in the future. Some, such as regression,may be relatively straightforward, while others, such as clustering, or dimension reduction shouldrequire additional thoughts, as the obtained results will be highly dependent on the amount of metricdistortion allowed in the diﬀeomorphic transformation.

References

N. Aronszajn. Theory of reproducing kernels.

Trans. Am. Math. Soc. , 68:337–404, 1950.M. F. Beg, M. I. Miller, A. Trouv´e, and L. Younes. Computing large deformation metric mappingsvia geodesic ﬂows of diﬀeomorphisms.

Int J. Comp. Vis. , 61(2):139–157, 2005.Christopher M Bishop.

Pattern recognition and machine learning . springer, 2006.MD Buhmann.

Radial basis functions: theory and implementations, volume 12 of Cambridge Mono-graphs on Applied and Computational Mathematics . Cambridge University Press, Cambridge,2003.J. Duchon. Interpolation des fonctions de deux variables suivant le principe de la ﬂexion des plaquesminces.

R.A.I.R.O. Analyse Numerique , 10:5–12, 1977.Stanley Durrleman, St´ephanie Allassonni`ere, and Sarang Joshi. Sparse adaptive parameterization ofvariability in image ensembles.

International Journal of Computer Vision , 101(1):161–183, 2013.Stanley Durrleman, Marcel Prastawa, Nicolas Charon, Julie R Korenberg, Sarang Joshi, GuidoGerig, and Alain Trouv´e. Morphometry of anatomical shape complexes with dense deformationsand sparse parameters.

NeuroImage , 101:35–49, 2014.21. Dyn. Interpolation and approximation by radial and related functions. In C. K. Chui, L. L.Shumaker, and J. D. Ward, editors,

Approximation Theory VI: vol. 1 , pages 211–234. AcademicPress, 1989.Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.

Deep learning , volume 1.MIT press Cambridge, 2016.Barbara Gris, Stanley Durrleman, and Alain Trouv´e. A sub-riemannian modular framework fordiﬀeomorphism-based analysis of shape ensembles.

SIAM Journal on Imaging Sciences , 11(1):802–833, 2018.T Hastie, R Tibshirani, and J Friedman.

The elements of statistical learning . Springer, 2003.Leslie M Hocking.

Optimal control: an introduction to the theory with applications . Oxford UniversityPress, 1991.S. Joshi and M. Miller. Landmark matching via large deformation diﬀeomorphisms.

IEEE transac-tions in Image Processing , 9(8):1357–1370, 2000.Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ ,1998.Jack Macki and Aaron Strauss.

Introduction to optimal control theory . Springer Science &Business Media, 2012.J. Meinguet. Multivariate interpolation at arbitrary points made simple.

J. Applied Math. andPhysics , 30:292–304, 1979.Mario Micheli and Joan A. Glaun`es. Matrix-valued kernels for shape deformation analysis.

Geom.Imaging Comput. , 1(1):57–139, 2014. ISSN 2328-8876; 2328-8884/e.Michael I Miller, Alain Trouv´e, and Laurent Younes. Hamiltonian systems and optimal control incomputational anatomy: 100 years since d’arcy thompson.

Annual review of biomedical engineer-ing , 17:447–509, 2015.J. Nocedal and S. J. Wright.

Numerical Optimization . Springer, 1999.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn:Machine learning in python.

Journal of machine learning research , 12(Oct):2825–2830, 2011.B Sch¨olkopf and J Smola, A.

Learning with kernels . MIT Press, 2002.Alain Trouv´e and Yong Yu. Metric similarities learning through examples: an application to shaperetrieval. In

International Workshop on Energy Minimization Methods in Computer Vision andPattern Recognition , pages 50–62. Springer, 2001.Vladimir Vapnik.

The nature of statistical learning theory . Springer science & business media, 2013.T. L. Vincent and W. J. Grantham.

Nonlinear and Optimal Control Systems . Wiley, 1997.G. Wahba.

Spline Models for Observational Data . SIAM, 1990.Christian Walder and Bernhard Sch¨olkopf. Diﬀeomorphic dimensionality reduction. In

Advances inNeural Information Processing Systems , pages 1713–1720, 2009.Laurent Younes.

Shapes and diﬀeomorphisms , volume 171. Springer Science & Business Media,2010. 22aurent Younes. Constrained diﬀeomorphic shape evolution.

Foundations of Computational Math-ematics , 12(3):295–325, 2012.Laurent Younes. Gaussian diﬀeons for surface and image matching within a lagrangian framework.