DDiffeomorphic Learning
Laurent YounesDepartment of Applied Mathematics and Statistics and Center for Imaging ScienceJohns Hopkins [email protected] 17, 2019
Abstract
We introduce in this paper a learning paradigm in which the training data is transformedby a diffeomorphic transformation before prediction. The learning algorithm minimizes a costfunction evaluating the prediction error on the training set penalized by the distance betweenthe diffeomorphism and the identity. The approach borrows ideas from shape analysis wherediffeomorphisms are estimated for shape and image alignment, and brings them in a previouslyunexplored setting, estimating, in particular diffeomorphisms in much larger dimensions. Afterintroducing the concept and describing a learning algorithm, we present diverse applications,mostly with synthetic examples, demonstrating the potential of the approach, as well as someinsight on how it can be improved. keywords:
Diffeomorphisms, Reproducing kernel Hilbert Spaces, Classification
We consider, in this paper, a family of classifiers that take the form x (cid:55)→ F ( ψ ( x )), x ∈ R d , where ψ is a diffeomorphism of R d and F is real-valued or categorical, belonging to a class of simple (e.g.,linear) predictors. We will describe a training algorithm that, when presented with a finite numberof training samples, displaces all points together through non-linear trajectories, in order to bringthem to a position for which the classes are linearly separable. Because these trajectories are builtwith a guarantee to avoid collisions between points, they can be interpolated to provide a fluidmotion of the whole space ( R d ) resulting in a diffeomorphic transformation. One can then use thistransformation to assign a class to any new observation. Because one may expect that the simplicityof the transformation will be directly related to the ability of the classifier to generalize, the pointtrajectories and their interpolation are estimated so that this resulting global transformation ispenalized from erring too far away from the identity, this penalty being assessed using a geodesicdistance in the group of diffeomorphisms, based on methods previously developed for shape analysis.The last decade’s achievements in deep learning have indeed demonstrated that optimizing non-linear transformations of the data in very high dimensions and using massive parametrization couldlead to highly performing predictions without being necessarily struck by the curse of dimensional-ity. The approach that we describe here also explores a “very large” space in terms, at least, of thenumber of parameters required to describe the transformations, and uses a training algorithm thatimplements, like neural networks, dynamic programming. It however frames the estimated transfor-mations within a well specified space of diffeomorphisms, whose nonlinear structure is well adaptedto the successive compositions of functions that drive deep learning methods. While it shares someof the characteristics of deep methods, our formulation relies on a regularization term, which makesit closer in spirit with many of the methods used for non-parametric prediction.1 a r X i v : . [ s t a t . M L ] O c t e now sketch our model for classification, and for this purpose introduce some notation. Welet c denote the number of classes for the problem at hand. As a consequence, the function F caneither take values in C = { , . . . , c − } , or in the space of probability distributions on C (which willbe our choice since we will use logistic regression). We furthermore assume that F is parametrized,by a parameter θ ∈ R q , and we will write F ( x ; θ ) instead of F ( x ) when needed.Assume that a training set is given, in the form T = ( x , y , . . . , x N , y N ) with x k ∈ R d and y k ∈ C , for k = 1 , . . . , N . We will assume that this set is not redundant, i.e., that x i (cid:54) = x j whenever i (cid:54) = j . (Our construction can be extended to redundant training sets by allowing the training classvariables y k to be multi-valued. This would result in rather cumbersome notation that is betteravoided.) If ψ is a diffeomorphism of R d , we will let ψ · T = ( ψ ( x ) , y , . . . , ψ ( x N ) , y N ). Thelearning procedure will consist in minimizing the objective function G ( ψ, θ ) = D (id , ψ ) + λg ( θ ) + 1 σ Γ( F ( · , θ ) , ψ · T ) (1)with respect to ψ and θ . Here D is a Riemannian distance in a group of diffeomorphisms of R d , thatwill be described in the next section, Γ is a “standard” loss function, g a regularization term on theparameters of the final classifier and λ , σ are positive numbers. One can take for exampleΓ( F ( · , θ ) , ψ · T ) = − N (cid:88) k =1 log F ( ψ ( x k ); θ )( y k ) , (2)where F ( z, θ )( y ) = e θ ( y ) T z (cid:80) y (cid:48) ∈ C e θ ( y (cid:48) ) T z and where the parameter is ( θ , . . . , θ c − ) ∈ ( R d ) c − with θ = 0, letting (as done in our experiments) g ( θ ) = | θ | .The proposed construction shares some of its features with feed-forward networks (Goodfellowet al., 2016), for which, for example, using the loss function above on the final layer is standard, andcan also be seen as a kernel method, sharing this property with classifiers such as support vectormachines (SVMs) or other classification methods that use the “kernel trick” (Vapnik, 2013; Sch¨olkopfand Smola, 2002). However, the algorithm is directly inspired from diffeomorphic shape analysis,and can be seen as a variant of the Large Deformation Diffeomorphic Metric Mapping (LDDMM)algorithm, which has been introduced for image and shape registration (Joshi and Miller, 2000;Beg et al., 2005; Younes, 2010). Up to our knowledge, this theory has not been applied so far toclassification problems, although some similar ideas using a linearized model have been suggested in(Trouv´e and Yu, 2001) for the inference of metrics in shape spaces, and a similar approach has beendiscussed in Walder and Sch¨olkopf (2009) in the context of dimensionality reduction.While the underlying theory in shape analysis, which formulates the estimation of transforma-tions as optimal control problems, is by now well established, the present paper introduces severaloriginal contributions. Translating shape analysis methods designed for small dimensional problemsto larger scale problems indeed introduces new challenges, and suggests new strategies on the designof the flows that control the diffeomorphic transformations, on the choice of parameters driving themodel and on optimization schemes. Importantly, the paper provides experimental evidence thatdiffeomorphic methods can be competitive in machine learning contexts. Indeed, diffeomorphismsin high dimension are “wild objects,” and, even with the kind of regularization that is applied inthis paper, it was not obvious that using them in a non-parametric setting would avoid overfittingand compare, often favorably, with several state-of-the-art machine learning methods. As explainedin section 5.1, the model (after the addition a “dummy” dimension) can represent essentially anyfunction of the data.In this regard, given that one is ready to work with such high-dimensional objects, the require-ment that the transformation is a diffeomorphism is not a strong restriction to the scope of the2odel. Using such a representation has several advantages, however. It indeed expresses the modelin terms of a well-understood non-linear space of functions (the group of diffeomorphisms), whichhas a simple algebraic structure, and a differential geometry extensively explored in mathematics.The description of such models is, as a consequence, quite concise and explicit, and lends itself toan optimal control formulation with the learning algorithms that result from it. Finally, there areobvious advantages, on the generative side, in using invertible data transformations, especially whenthe data in the transformed configuration may be easier to describe using a simple statistical model,since, in that case, the application of the inverse diffeomorphism to realizations of that simple modelprovides a generative model in the original configuration space.The paper is organized as follows. Section 2 introduces basic concepts related to groups ofdiffeomorphisms and their Riemannian distances. Section 3 formulates the optimal control problemassociated to the estimation of the diffeomorphic predictor, introducing its reduction to a finite-dimensional parametrization using a kernel trick and describing its optimality conditions. Someadditional discussion on the reproducing kernel is provided in section 4. Section 5 introduces remarksand extensions that are specific to the prediction problems that we consider here, and sections 6 and7 provide experimental results, including details on how (hyper-)parameters used in our model canbe set. We conclude the paper in section 8. We now describe the basic framework leading to the definition of Riemannian metrics on groupsof diffeomorphisms. While many of the concepts described in these first sections are adapted fromsimilar constructions in shape analysis (Younes, 2010) this short presentation helps making the paperself-contained and accessible to readers without prior knowledge of that domain.Let B p = C p ( R d , R d ) denote the space of p -times continuously differentiable functions that tendto 0 (together with their first p derivatives) to infinity. This space is a Banach space, for the norm (cid:107) v (cid:107) p, ∞ = max ≤ k ≤ p (cid:107) d k v (cid:107) ∞ where (cid:107) · (cid:107) ∞ is the usual supremum norm.Introduce a Hilbert space V of vector fields on R d which is continuously embedded in B p forsome p ≥
1, which means that there exists a constant C such that (cid:107) v (cid:107) p, ∞ ≤ C (cid:107) v (cid:107) V for all v in V , where (cid:107) · (cid:107) V denotes the Hilbert norm on V (and we will denote the associated innerproduct by (cid:10) · , · (cid:11) V ). This assumption implies that V is a reproducing kernel Hilbert space (RKHS).Because V is a space of vector fields, the definition of the associated kernel slightly differs from theusual case of scalar valued functions in that the kernel is matrix valued. More precisely, a directapplication of Riesz’s Hilbert space representation theorem implies that there exists a function K : R d × R d → M d ( R )where M d ( R ) is the space of d by d real matrices, such that1. The vector field K ( · , y ) a : x (cid:55)→ K ( x, y ) a belongs to V for all y, a ∈ R d .2. For v ∈ V , for all y, a ∈ R d , (cid:10) K ( · , y ) a , v (cid:11) V = a T v ( y ).These properties imply that (cid:10) K ( · , x ) a , K ( · , y ) b (cid:11) V = a T K ( x, y ) b for all x, y, a, b ∈ R d , which in turnimplies that K ( y, x ) = K ( x, y ) T for all x, y ∈ R d .3iffeomorphisms can be generated as flows of ordinary differential equations (ODEs) associatedwith time-dependent elements of V . More precisely, let v ∈ L ([0 , , V ), i.e., v ( t ) ∈ V for t ∈ [0 , (cid:90) (cid:107) v ( t ) (cid:107) V dt < ∞ . Then, the ODE ∂ t y = v ( t, y ) has a unique solution over [0 , ϕ v : ( t, x ) (cid:55)→ y ( t ) where y is the solution starting at x . This flow is, at all times,a diffeomorphism of R d , and satisfies the equation ∂ t ϕ = v ( t ) ◦ ϕ , ϕ (0) = id (the identity map).Here, and in the rest of this paper, we make the small abuse of notation of writing v ( t )( y ) = v ( t, y ),where, in this case, v ( t ) ∈ V for all t ∈ [0 , ϕ v ( t ) for the function x (cid:55)→ ϕ v ( t, x ), so that ϕ v may be considered either as a time-dependent diffeomorphism, or as afunction of both time and space variables.The set of diffeomorphisms that can be generated in such a way forms a group, denoted Diff V since it depends on V . Given ψ ∈ Diff V , one defines the optimal deformation cost Λ( ψ ) fromid to ψ as the minimum of (cid:82) (cid:107) v ( t ) (cid:107) V dt over all v ∈ L ([0 , , V ) such that ϕ v (1) = ψ . If welet D ( ψ , ψ ) = Λ( ψ ◦ ψ − ) / then D is a geodesic distance on Diff V associated with the right-invariant Riemannian metric generated by v (cid:55)→ (cid:107) v (cid:107) V on V . We refer to Younes (2010) for moredetails and additional properties on this construction. For our purposes here, we only need to noticethat the minimization of the objective function in (1) can be rewritten as an optimal control problemminimizing E ( v, θ ) = (cid:90) (cid:107) v ( t ) (cid:107) V dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , ϕ (1) · T ) (3)over v ∈ L ([0 , , V ), θ ∈ R q and subject to the constraint that ϕ ( t ) satisfies the equation ∂ t ϕ = v ◦ ϕ with ϕ (0) = id. We point out that, under mild regularity conditions on the dependency of Γ withrespect to T (continuity in x , . . . , x N suffices), a minimizer of this function in v for fixed θ alwaysexists, with v ∈ L ([0 , , V ). The minimization in (3) can be reduced using an RKHS argument, similar to the kernel trickinvoked in standard kernel methods (Aronszajn, 1950; Duchon, 1977; Meinguet, 1979; Wahba, 1990;Sch¨olkopf and Smola, 2002). Let z k ( t ) = ϕ ( t, x k ). Because the endpoint cost Γ only depends on( z (1) , . . . , z N (1)), it only suffices to compute these trajectories, which satisfy ∂ t z k = v ( t, z k ). Thisimplies that an optimal v must be such that, at every time t , (cid:107) v ( t ) (cid:107) V is minimal over all (cid:107) w (cid:107) V with w satisfying w ( z k ) = v ( t, z k ), which requires v ( t ) to take the form v ( t, · ) = N (cid:88) k =1 K ( · , z k ( t )) a k ( t ) (4)where K is the kernel of the RKHS V and a , . . . , a N are unknown time-dependent vectors in R d ,which provide our reduced variables. Letting a = ( a , . . . , a N ), the reduced problem requires tominimize E ( a ( · ) , θ ) = (cid:90) N (cid:88) k,l =1 a k ( t ) T K ( z k ( t ) , z l ( t )) a l ( t ) dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , T ( z (1))) (5)subject to ∂ t z k = (cid:80) Nl =1 K ( z k , z l ) a l , z k (0) = x k , with the notation z = ( z , . . . , z N ) and T ( z ) =( z , y , . . . , z N , y N ). 4 .2 Optimality Conditions and Gradient We now consider the minimization problem with fixed θ (optimization in θ will depend on the specificchoice of classifier F and risk function Γ). For the optimal control problem (5), the “state space” isthe space in which the “state variable” z = ( z , . . . , z N ) belongs, and is therefore Q = ( R d ) N . Thecontrol space contains the control variable a , and is U = ( R d ) N .Optimality conditions for the variable a are provided by Pontryagin’s maximum principle (PMP).They require the introduction of a third variable (co-state), denoted p ∈ Q , and of a control-dependent Hamiltonian H a defined on Q × Q given, in our case, by H a ( p , z ) = N (cid:88) k,l =1 ( p k − a k ) T K ( z k , z l ) a l . (6)(In this expression, a , p and z do not depend on time.) The PMP (Hocking, 1991; Macki and Strauss,2012; Miller et al., 2015; Vincent and Grantham, 1997) then states that any optimal solution a mustbe such that there exists a time-dependent co-state satisfying ∂ t z = ∂ p H a ( t ) ( p ( t ) , z ( t )) ∂ t p = − ∂ z H a ( t ) ( p ( t ) , z ( t )) a ( t ) = argmax a (cid:48) H a (cid:48) ( p ( t ) , z ( t )) (7)with boundary conditions z (0) = ( x , . . . , x N ) and p (1) = − σ ∂ z Γ( F ( · , θ ) , T ( z (1))) . (8)These conditions are closely related to those allowing for the computation of the differential of E with respect to a ( · ), which is given by ∂ a ( · ) E ( a ( · ) , θ ) = u ( · )with u k ( t ) = N (cid:88) l =1 K ( z k ( t ) , z l ( t ))( p l ( t ) − a l ( t )) (9)where p solves (cid:40) ∂ t z = ∂ p H a ( t )( p ( t ) , z ( t )) ∂ t p = − ∂ z H a ( t )( p ( t ) , z ( t )) (10)with boundary conditions z (0) = ( x , . . . , x N ) and p (1) = − σ ∂ z Γ( F ( · , θ ) , T ( z (1))) . Concretely, the differential is computed by (i) solving the first equation of (10), which does notinvolve p , (ii) using the obtained value of z (1) to compute p (1) from the boundary condition, then(iii) solving the second equation of (10) backward in time to obtain p at all times, and (iv) plug itin the definition of u ( t ).For practical purposes, the discrete time version of the problem is obviously more useful, andits differential is obtained from a similar dynamic programming (or back-propagation) computation.Namely, discretize time over 0 , , . . . , T and consider the objective function E ( a ( · ) , θ ) = 1 T T − (cid:88) t =0 N (cid:88) k,l =1 a k ( t ) T K ( z k ( t ) , z l ( t )) a l ( t ) dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , T ( z ( T ))) (11)5ubject to z k ( t + 1) = z k ( t ) + 1 T N (cid:88) l =1 K ( z k ( t ) , z l ( t )) a l ( t ) , z k (0) = x k . We therefore use a simple Euler scheme to discretize the state ODE. Note that the state is discretizedover 0 , . . . , T and the control over 0 , . . . , T −
1. The differential of E is now given by the followingexpression, very similar to that obtained in continuous time. ∂ a ( · ) E ( a ( · ) , θ ) = u ( · )with u k ( t ) = N (cid:88) l =1 K ( z k ( t ) , z l ( t ))( p l ( t ) − a l ( t )) , t = 0 , . . . , T − p (discretized over 0 , . . . , T − z ( t + 1) = z ( t ) + 1 T ∂ p H a ( t )( p ( t ) , z ( t )) p ( t −
1) = p ( t ) + 1 T ∂ z H a ( t )( p ( t ) , z ( t )) (12)with boundary conditions z (0) = ( x , . . . , x N ) and p ( T −
1) = − σ ∂ z Γ( F ( · , θ ) , T ( z ( T ))) . These computations allow us to compute the differential of the objective function with respect to a .The differential in θ depends on the selected terminal classifier and its expression for the functionchosen in (2) is standard.We emphasize the fact that we are talking of differential of the objective function rather thanits gradient. Our implementation uses a Riemannian (sometimes called “natural”) gradient withrespect to the metric (cid:10) η ( · ) , η ( · ) (cid:11) a ( · ) = (cid:90) n (cid:88) k,l =1 η k ( t ) T K ( z k ( t ) , z l ( t )) η l ( t ) dt with ∂ t z k = (cid:80) nl =1 K ( z k ( t ) , z l ( t )) a l ( t ). With respect to this metric, one has ∇ a ( · ) E ( a ( · ) , θ ) = p − a , a very simple expression that can also be used in the discrete case. Using this Riemannian innerproduct as a conditioner for the deformation parameters (and a standard Euclidean inner producton the other parameters), experiments in sections 6 and 7 run Polak-Ribiere conjugate gradientiterations (Nocedal and Wright, 1999) to optimize the objective function. (We also experimentedwith limited-memory BFGS, which does not use natural gradients, and found that conditionedconjugate gradient performed better on our data.) To fully specify the algorithm, one needs to select the RKHS V , or, equivalently, its reproducingkernel, K . They constitute important components of the model because they determine the regular-ity of the estimated diffeomorphisms. We recall that K is a kernel over vector fields, and therefore6s matrix valued. One simple way to build such a kernel is to start with a scalar positive kernel κ : R d × R d → R and let K ( x, y ) = κ ( x, y )Id R d . (13)We will refer to such kernels as “scalar.”One can choose κ from the large collection of known positive kernels (and their infinite numberof possible combinations; Aronszajn (1950); Dyn (1989); Sch¨olkopf and Smola (2002); Buhmann(2003)). Most common options are Gaussian kernels, κ ( x, y ) = exp( −| x − y | / ρ ) , (14)or Mat´ern kernels of class C k , κ ( x, y ) = P k ( | x − y | /ρ ) exp( −| x − y | /ρ ) , (15)where P k is a reversed Bessel polynomial of order k . In both cases, ρ is a positive scale parameter.The Mat´ern kernels have the nice property that their associated RKHS is equivalent to a Sobolevspace of order k + d/ v in the RKHS associated with a matrix kernel such as (13), where κ is a radial basisfunction (RBF), are such that each coordinate function of v belongs to the scalar RKHS associatedwith κ , which is translation and rotation invariant (i.e., the transformations that associate to ascalar function h the functions x (cid:55)→ h ( R T ( x − b )) are isometries, for all rotation matrices R and allvectors b ∈ R d ). While (13) provide a simple recipe for the definition of matrix-valued kernels, other interestingchoices can be made within this class. Rotation and translation invariance more adapted to spacesof vector fields, in which one requires that replacing v : R d → R d by x (cid:55)→ Rv ( R T ( x − b )) is anisometry of V for all R, b , leads to a more general class of matrix kernels extensively discussed in(Micheli and Glaun`es, 2014). When the data is structured, however (e.g., when it is defined over agrid), rotation invariance may not be a good requirement since it allows, for example, for permutingcoordinates, which would break the data structure. In this context, other choices may be preferable,as illustrated by the following example. Assume that the data is defined over a graph, say G with d vertices. Then, one may consider matrix kernels relying on this structure. For example, letting N i denote the set of nearest neighbors of i in G , one can simply take K ( x, y ) = diag(Φ( | P i x − P i y | ) , i = 1 , . . . , d ) (16)where P i x is the vector ( x j , j ∈ N i ) and Φ is an RBF associated to a positive radial scalar kernel. RKHS’s of vector fields built from RBF’s have the property that all their elements (and several oftheir derivatives) vanish at infinity. As a consequence, these spaces do not contain simple trans-formations, such as translations or more general affine transformations. It is however possible tocomplement them with such mappings, definingˆ V a = { g + v : g ∈ a , v ∈ V } where a is any Lie sub-algebra of the group of affine transformations (so that any element g ∈ a takes the form g ( x ) = Ax + b , where A is a matrix and b is a vector). In particular, a can be the7hole space of affine transformations. Since a and v intersect only at { } , one can define withoutambiguity a Hilbert norm on ˆ V a by letting (cid:107) g + v (cid:107) V a = (cid:107) g (cid:107) a + (cid:107) v (cid:107) V where (cid:107) g (cid:107) a is any inner-product norm on a . A simple choice, for g ( x ) = Ax + b , can be to take (cid:107) g (cid:107) a = κ trace( AA T ) + κ | b | . (17)Both a and ˆ V a are RKHS’s in this case, respectively with kernels K a and K a + K , where K is thekernel of V . If the norm on a is given by (17), then K a ( x, y ) = (cid:18) x T yκ + 1 κ (cid:19) Id R d , as can be deduced from the definition of a reproducing kernel.Instead of using this extended RKHS, one may prefer to model affine transformations separatelyfrom the vector field. This leads to replacing (4) byˆ v ( t, · ) = g ( t, · ) + N (cid:88) k =1 K ( · , z k ( t )) a k ( t )and the cost (5) by E ( a ( · ) , θ ) = (cid:90) (cid:107) g ( t ) (cid:107) a dt + (cid:90) N (cid:88) k,l =1 a k ( t ) T K ( z k ( t ) , z l ( t )) a l ( t ) dt + λg ( θ ) + 1 σ Γ( F ( · , θ ) , T (1))with ∂ t z k = ˆ v ( t, z k ( t )), z k (0) = x k . The two approaches are, in theory, equivalent, in that the secondone simply ignore the reduction on the affine part of the vector field, but it may be helpful, numer-ically, to use separate variables in the optimization process for the affine transform and the vectorfield reduced coefficients, because this gives more flexibility to the optimization. The derivation ofthe associated optimality conditions and gradient are similar to those made in section 3.2 and leftto the reader.Another point worth mentioning is that, if ϕ satisfies ∂ t ϕ = g ◦ ϕ + v ◦ ϕ for (time-dependent) g ∈ a and v ∈ V , and if one defines the time-dependent affine transformation ρ by ∂ t ρ = g ◦ ρ, then ϕ = ρ ◦ ψ , where ψ satisfies ∂ t ψ = w ◦ ψ and w = ρ − L v ◦ ρ , ρ L being the linear part of ρ . In thespecial case when a is the Lie algebra of the Euclidean group, so that ρ is the composition of a rotationand a translation, and when the norm of V is invariant by such transformations (e.g., when using ascalar kernel associated to an RBF), then ϕ and ψ are equidistant to the identity. As a consequence,when the final function F implements a linear model, there is, in theory (numerics may be different)no gain in introducing an affine component restricted to rotations and translations. The equidistanceproperty does not hold, however, if one uses a larger group of affine transformations, or a norm on V that is not Euclidean invariant, and the introduction of such transformations actually extends themodel in a way that may significantly modify its performance, generally, in our experiments, for thebetter. ‘ 8 Enhancements and Remarks
We use, in this paper, logistic regression as final classifier applied to transformed data. This classifierestimates a linear separation rule between the classes, but it should be clear that not every trainingset can be transformed into a linearly separable one using a diffeomorphism of the ambient space.A very simple one-dimensional example is when the true class associated to an input x ∈ R is 0 if | x | < d + 1)-dimensional dataset in which X is replacedby ( X, x, µ ) (cid:55)→ ( x, µ + x − R that separates the two classes (along the y axis).Notice that any binary classifier that can be expressed as x (cid:55)→ sign( f ( x ) − a ) for some smoothfunction f can be included in the model class we are considering after adding a dimension, simplytaking (letting again µ denote the additional scalar variable) ψ ( x, µ ) = ( x, µ + f ( x )), which is adiffeomorphism of R d +1 , and u = (0 R d , ϕ that will minimize the overall distortion, essentially trading off some non-linear transformation of the data, x , to induce a “simpler” classification rule. Figure 1 provides anexample of the effect of adding a dimension when the original data is two-dimensional with two classesforming a target pattern. While no 2D transformation will break the topology and separate thecentral disk from the exterior ring, using an additional dimension offers a straightforward solution.Another simple example is provided in Figure 2, where a circle (class 1) is inscribed in a half ellipse(class 2). In this case, adding a dimension is not required, but the transformation estimated whenthis is done is closer to the identity (in the sense of our metric on diffeomorphisms) than the onecomputed in two dimensions.Figure 1: Additional dimension separating the center of a target from its external ring. Left: initialconfiguration; Right: transformed configuration.When adding one or more dimensions (in a c -class problem, it makes sense to add c − x k → ( x k , δu k ) for small δ , with u k arealization of a standard Gaussian variable to help breaking symmetries in the training phase (testdata being still expanded as x k → ( x k , The dimensional reduction method described in section 3.1 is the exact finite-dimensional parametriza-tion of the spatial component of the original infinite-dimensional problem. Assuming T steps forthe time discretization, this results in T dN parameters in the transformation part of the model,9igure 2: Comparison of optimal transformations without (top) and with (bottom) adding adimension. Left: initial configuration; Right: transformed configuration.while the computation of the gradient has a
T dN order cost. With our implementation (on afour-core Intel i7 laptop), we were able to handle up to 2,000 training samples in 100 dimensionswith 10 time steps, which required about one day for 2,000 gradient iterations. Even if the efficiencyof our implementation may be improved, for example through the use of GPU arrays, it is clearthat the algorithm we just described will not scale to large datasets unless some approximationsare made. Importantly, no significant reduction in computation cost can be obtained by randomlysampling from the training set at each iteration, since the representation (4) requires computing thetrajectories of all sample points.This suggests replacing the optimal expression in (4) by a form that does not requires trackingthe trajectories of N points. One possibility is to require that v is decomposed as a sum similar to(4), but over a smaller number of “control points”, i.e., to let v ( t, · ) = n (cid:88) j =1 K ( · , ζ j ( t )) a j ( t ) (18)with ∂ t ζ j = v ( t, ζ j ) and ζ j (0) = ζ j where { ζ , . . . , ζ n } is, for example, a subset of the trainingdata set. As a result, the computational cost per iteration in the data size would now be of order T dN n (because the evolution of all points is still needed to evaluate the objective function and itsgradient), but this change would make possible randomized evaluations of the data cost, Γ, (leadingto stochastic gradient descent) that would replace
T dnN by T dnN (cid:48) where N (cid:48) is the size of the mini-batch used at each iteration. This general scheme will be explored in future work, where optimalstrategies in selecting an n -dimensional subset of the training data must be analyzed, including inparticular the trade-off they imply between computation time and sub-optimality of solutions. Eventhough they appeared in a very different context, some inspiration may be obtained from similarapproaches that were explored in shape analysis (Younes, 2012; Durrleman et al., 2013; Younes,2014; Durrleman et al., 2014; Gris et al., 2018).Another plausible choice is to specify a parametric form for the vector field at a given time. Itis quite appealing to use a neural-net-like expression, letting v ( t, x ) = Φ( A ( t ) x + b ( t ))where A is a time-dependent d by d matrix, b a time dependent vector and Φ a function applyinga fixed nonlinear transformation (for example, t (cid:55)→ t exp( − t / (cid:107) v ( t ) (cid:107) V may be challenging to compute analytically when v is given in this form,10t is sufficient, in order to ensure the existence of solutions to the state equation, to control thesupremum norm of the first derivative of v ( t ), which, since Φ is fixed, only depends of A ( t ). Onecan therefore replace (cid:107) v ( t ) (cid:107) in the optimal control formulation (3) by any norm evaluated at A ( t )in the finite-dimensional space of d by d matrices. Preliminary experiments, run on some of theexamples discussed in section 6, show that such a representation can perform very well in some casesand completely fail in others, clearly requiring further study to be reported in future work. Thecomputation time is, in all cases, significantly reduced compared to the original version. It is important to strengthen the fact that, even though our model involves diffeomorphic trans-formations, it is not a deformable template model. The latter type of model typically works withsmall-dimensional images (k=2 or 3), say I : R k → R , and tries to adjust a k -dimensional deforma-tion (using a diffeomorphism g : R k → R k ) such that the deformed image, given by I ◦ g − alignswith a fixed template (and classification based on a finite number of templates would pick the onefor which a combination of the deformation and the associated residuals is smallest).The transformation ψ g : I (cid:55)→ I ◦ g − is a homeomorphism of the space of, say, continuous im-ages. Once images are discretized over a grid with d points, ψ g becomes (assuming that the grid isfine enough) a one-to-one transformation of R d , but a very special one. In this context, the modeldescribed in this paper would be directly applied to discrete images, looking for a d -dimensional dif-feomorphism that would include deformations such as ψ g , but many others, involving also variationsin the image values or more complex transformations (including, for example, reshuffling all imagepixels in arbitrary orders!). We now provide a few experiments that illustrate some of the advantages of the proposed diffeo-morphic learning method, and some of its limitations as well. We will compare the performanceof this algorithm with a few off-the-shelve methods, namely k -nearest-neighbors ( k NN), linear andnon-linear SVM, random forests (RF), multi-layer perceptrons with 1, 2 and 5 hidden layers (abbre-viated below as MLP1, MLP2 and MLP5) and logistic regression (Hastie et al., 2003; Bishop, 2006).The classification rates that were reported were evaluated on a test set containing 2,000 examplesper class (except for MNIST, for which we used the test set available with this data).We used the scikit-learn Python package (Pedregosa et al., 2011) with the following parameters(most being default in scikit-learn). • Linear SVM: (cid:96) penalty with default weight C = 1, with one-vs-all multi-class strategy whenrelevant. • Kernel SVM: (cid:96) penalty with weight C estimated as described in section 7. Gaussian (i.e.,RBF) kernel with coefficient γ identical to that used for for the kernel in diffeomorphic learning.One-vs-all multi-class strategy when relevant. • Random forests: 100 trees, with Gini entropy splitting rule, with the default choice ( √ d ) ofnumber of features at each node. • k -nearest neighbors: with the standard Euclidean metric and the default (5) number of neigh-bors. • Multi-layer perceptrons: with ReLU activations, ADAM solver, constant learning rate and10,000 maximal iterations, using 1, 2 or 5 hidden layers each composed of 100 units.11
Logistic regression: (cid:96) penalty with weight C = 1. The same classifier is used as the final stepof the diffeomorphic learning algorithm, so that its performance on transformed data is alsothe performance of the algorithm that is proposed in this paper.In all cases, we added a dummy dimension to the data as described in section 5.1 when runningdiffeomorphic learning (classification results on original data were obtained without the added di-mension). We also used an affine transformation to complement the kernel, as described in section4.3. Note that adding a dimension was not always necessary for learning (especially with largedimensional problems), nor was the affine transform always improving on results, but they neverharm the results in any significant way, so that it was simpler to always turn on these options in ourexperiments. The optimization procedure was initialized with a vanishing vector field (i.e., ψ = id)and run until numerical stabilization (with a limit of 2,000 iterations at most). Doing so was alwaysan overkill, in terms of classification performance, because in almost all cases, the limit classificationerror on the test set stabilizes faster than the time taken by the algorithm to optimize the transfor-mation. It was however important to avoid stopping the minimization early in order to make surethat optimizing the diffeomorphism did not result in overfitting.We now describe the datasets that we used (all but the last one being synthetic) and comparethe performances of the classifiers above on the original data and on the transformed data afterlearning. In our first set of experiments, we let D i = R ( T i × R d − ) where T , T are non-intersecting tori in R and R is a random d -dimensional rotation. The tori are positioned as illustrated in the first panelof figure 3, so that, even though they have an empty intersection, they are not linearly separable.The distribution of training and test data is (before rotation) the product of a uniform distributionon the torus and of a standard Gaussian in the d − d = 3). Center: Spherical layers ( d = 2). Right: “RBF” data ( d = 2).Classification results for this dataset are summarized in Table 1. Here, we let the number of noisedimensions vary from 0 to 7 and 17 (so that the total number of dimensions is 3, 10 and 20) andthe number of training samples are 200, 500 and 1,000. The problem becomes very challenging formost classifiers when the number of noisy dimensions is large, in which case a multi-layer perceptronwith one hidden layer seems to perform best. All other classifiers see their performance improvedafter diffeomorphic transformation of the data. One can also notice that, after transformation, allclassifiers perform approximately at the same level. Figure 4 illustrates how the data is transformedby the diffeomorphism and is typical of the other results.We also point out that all classifiers except RF are invariant by rotation of the data, so thatmaking a random rotation when generating it does not change their performance. The estimation ofthe diffeomorphism using a radial kernel is also rotation invariant. The RF classifier, which is basedon comparison along coordinate axes, is highly affected, however. Without a random rotation, it12og. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Tori, d = 3, 100 samples per classOriginal Data 0.341 0.341 0.000 0.007 0.000 0.004 0.000 0.000Transformed Data 0.000 0.000 0.000 0.007 0.000 0.000 0.000 0.000Tori, d = 10, 100 samples per classOriginal Data 0.312 0.317 0.280 0.311 0.309 0.159 0.170 0.233Transformed Data 0.159 0.163 0.162 0.176 0.153 0.155 0.153 0.163Tori, d = 10, 250 samples per classOriginal Data 0.325 0.326 0.207 0.285 0.257 0.073 0.093 0.109Transformed Data 0.023 0.024 0.017 0.021 0.012 0.030 0.026 0.025Tori, d = 20, 100 samples per classOriginal Data 0.320 0.317 0.324 0.376 0.369 0.325 0.325 0.353Transformed Data 0.321 0.329 0.322 0.347 0.330 0.320 0.319 0.323Tori, d = 20, 250 samples per classOriginal Data 0.320 0.323 0.312 0.339 0.367 0.204 0.284 0.280Transformed Data 0.249 0.255 0.249 0.277 0.251 0.247 0.247 0.250Tori, d = 20, 500 samples per classOriginal Data 0.316 0.315 0.267 0.305 0.355 0.117 0.130 0.191Transformed Data 0.163 0.166 0.158 0.173 0.167 0.160 0.163 0.164Table 1: Comparative performance of classifiers on “Tori” dataFigure 4: Visualization of the diffeomorphic flow applied to the 3D tori dataset. Left: trainingdata; Right: test data. Top to bottom: t = 0, t = 0 . t = 0 . t = 1. The data is visualizedin a frame formed by the discriminant direction followed by the two principal components in theperpendicular space.performs extremely well, with an error rate of only 0.05 with 17 noisy dimensions and 250 examplesper class. 13 .3 Spherical Layers We here deduce the class from the sign of cos(9 | X | ) where X follows a uniform distribution on the d -dimensional unit sphere, and we provide results with d = 3 (a representation of the data set in2D is provided in the second panel of Figure 3). We see that linear classifiers cannot do better thanchance (this is by design), but that after the estimated diffeomorphic transformation is applied, allclassifiers outperform, with a large margin for small datasets, the best non-linear classifiers trainedon the original data. Log. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Spherical layers, d = 3, 100 samples per classOriginal Data 0.507 0.507 0.357 0.353 0.408 0.478 0.488 0.481Transformed Data 0.246 0.249 0.329 0.279 0.355 0.246 0.245 0.253Spherical layers, d = 3, 250 samples per classOriginal Data 0.487 0.487 0.233 0.231 0.279 0.394 0.264 0.113Transformed Data 0.119 0.118 0.185 0.129 0.199 0.149 0.143 0.138Spherical layers, d = 3, 500 samples per classOriginal Data 0.506 0.506 0.185 0.194 0.229 0.299 0.113 0.091Transformed Data 0.089 0.092 0.165 0.087 0.197 0.092 0.103 0.096Table 2: Comparative performance of classifiers on 3D spherical layers. The next dataset generates classes using sums of radial basis functions. More precisely, we let ρ ( z ) = exp( − ( z/α ) ) with α = 0 . L (cid:88) j =1 ρ ( | X − c j | ) a j − µ . In this expression, X follows a uniform distribution over the d -dimensional unit sphere and µ isestimated so that both positive and negative classes are balanced. The centers, c , . . . , c L are chosenas c j = (3 j/L ) e ( j mod d )+1 where j mod d is the remainder of the division of j by d and e , . . . , e d isthe canonical basis of R d . The coefficients are a j = cos(6 πj/L ), and we took L = 100. The thirdpanel of Figure 3 depicts the resulting regions of the unit disc in the case d = 2.Table 3 provides classification performances for various combinations of dimension and samplesize. Excepted when d = 5 and only 100 samples per class are observed, in which case linear classifiersprovide the best rates, the best classification is obtained after diffeomorphic transformation. In the next example, we assume that the conditional distribution of X given Y = y is normal withmean m y and covariance matrix Σ y . We used three classes ( y ∈ { , , } ) in dimension 20, with • m = (0 , . . . , T , m = ( − , − , − , , . . . , T and m = ( − , , − , . . . . , T , • Σ = 10 Id R d , Σ ( i, j ) = 20 exp( −| i − j | /
20) and Σ ( i, j ) = 20 exp( −| i − j | / d = 2, 100 samples per classOriginal Data 0.381 0.381 0.158 0.126 0.208 0.183 0.193 0.163Transformed Data 0.142 0.139 0.149 0.153 0.186 0.140 0.136 0.141RBF data, d = 2, 250 samples per classOriginal Data 0.358 0.359 0.109 0.150 0.136 0.176 0.158 0.127Transformed Data 0.101 0.102 0.106 0.102 0.130 0.126 0.122 0.119RBF data, d = 2, 500 samples per classOriginal Data 0.377 0.378 0.089 0.103 0.097 0.119 0.101 0.081Transformed Data 0.058 0.055 0.085 0.055 0.090 0.070 0.054 0.142RBF data, d = 3, 100 samples per classOriginal Data 0.307 0.307 0.209 0.207 0.221 0.218 0.233 0.250Transformed Data 0.171 0.175 0.202 0.168 0.195 0.162 0.162 0.154RBF data, d = 3, 250 samples per classOriginal Data 0.310 0.308 0.139 0.182 0.148 0.151 0.151 0.103Transformed Data 0.083 0.077 0.133 0.082 0.122 0.086 0.083 0.089RBF data, d = 3, 500 samples per classOriginal Data 0.337 0.335 0.129 0.112 0.114 0.144 0.093 0.094Transformed Data 0.071 0.070 0.118 0.063 0.102 0.080 0.068 0.078RBF data, d = 5, 100 samples per classOriginal Data 0.216 0.218 0.221 0.256 0.254 0.280 0.259 0.233Transformed Data 0.221 0.223 0.221 0.222 0.219 0.221 0.223 0.214RBF data, d = 5, 250 samples per classOriginal Data 0.233 0.233 0.197 0.226 0.213 0.250 0.273 0.240Transformed Data 0.199 0.204 0.185 0.215 0.196 0.207 0.209 0.205RBF data, d = 5, 500 samples per classOriginal Data 0.231 0.230 0.167 0.219 0.184 0.206 0.189 0.158Transformed Data 0.128 0.129 0.133 0.127 0.142 0.141 0.140 0.143Table 3: Comparative performance of classifiers on “RBF” datawhere Id R d is the d -dimensional identity matrix. Classification results for training sets with 100and 250 examples per class are described in Table 4. While multilayer perceptrons perform best onthis data, the performance of diffeomorphic classification is comparable, and improves on all otherclassifiers. A visualization of the transformation applied to this dataset is provided in Figure 5. To build this dataset, we start with two scalar functions and assume that the observed data isthe restriction of one of them (where this choice determines the class) to a small discrete interval.More precisely, we let Y ∈ { , } and given Y = y , we let X = ( ψ y ( t − A ) + ε t , t ∈ I ) where I = { , /d, . . . , ( d − /d ) } is a discretization of the unit interval, ( ε t , t ∈ I ) are independent standardGaussian variables and A follows a uniform distribution on [ − , d = 20, 100 samples per classOriginal Data 0.398 0.407 0.225 0.191 0.495 0.143 0.101 0.099Transformed Data 0.135 0.153 0.175 0.110 0.236 0.134 0.109 0.108Mixture of Gaussians, d = 20, 250 samples per classOriginal Data 0.354 0.359 0.172 0.163 0.427 0.073 0.061 0.063Transformed Data 0.079 0.091 0.145 0.089 0.187 0.077 0.074 0.074Table 4: Comparative performance of classifiers on Gaussian mixtures.Figure 5: Diffeomorphic transformation applied to the Gaussian mixture dataset (evolution visual-ized at times t = 0 . , . , . , . t = 1 .
5, we took ψ y ( t ) = log cosh( t/β y ) with β = 0 .
02 and β = 0 . d = 50, 100 samples per classOriginal Data 0.484 0.495 0.400 0.448 0.466 0.449 0.444 0.465Transformed Data 0.324 0.324 0.350 0.378 0.438 0.329 0.335 0.332Curve segments, d = 50, 250 samples per classOriginal Data 0.487 0.486 0.266 0.345 0.353 0.493 0.486 0.467Transformed Data 0.152 0.157 0.147 0.195 0.248 0.155 0.258 0.216Curve segments, d = 50, 500 samples per classOriginal Data 0.495 0.496 0.155 0.294 0.230 0.505 0.484 0.426Transformed Data 0.093 0.094 0.062 0.070 0.093 0.093 0.104 0.113Table 5: Comparative performance of classifiers on ”curve segment” data. Here, we consider a 50-dimensional dataset with two classes. Each sample has all coordinates equalto zero except two of them that are equal to ±
1, and the class is 0 if the two nonzero coordinates16re equal and 1 otherwise. The position of the two nonzero values is random and each takes thevalues ± √ d = 50, 100 samples per classOriginal Data 0.505 0.511 0.484 0.500 0.499 0.345 0.428 0.493Transformed Data 0.066 0.068 0.010 0.024 0.017 0.019 0.016 0.014xor data, d = 50, 250 samples per classOriginal Data 0.498 0.497 0.404 0.461 0.494 0.005 0.047 0.248Transformed Data 0.008 0.006 0.000 0.000 0.000 0.000 0.000 0.001Table 6: Comparative performance of classifiers on xor data. Our next synthetic example is a simple pattern recognition problem in d = 100 dimensions. Samplesfrom class 1 take the form ( ρ U , . . . , ρ d U d ) where ρ is uniformly distributed between 0 .
75 and 1 . U = ( U , . . . U d ) is a binary vector with exactly l y consecutive ones (possibly wrapping around { , . . . , d } )and d − l y zeros, with l = 10 and l = 11.An optimal linear separation between classes is achieved (because of shift invariance) by thresh-olding the sum of the d variables. However, because of the multiplication by ρ , . . . , ρ d , this simpleclassification rule performs very poorly, while the same rule applied to the binary variables ob-tained by thresholding individual entries at, say, 0.5 perfectly separates the classes. It is thereforenot surprising that multi-layer perceptrons perform very well on this problem and achieve the bestclassification rates. All other classifiers perform significantly better when run on the transformeddata. Log. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Segment length data, d = 100, 100 samples per classOriginal Data 0.462 0.412 0.387 0.491 0.505 0.337 0.343 0.328Transformed Data 0.341 0.349 0.354 0.347 0.450 0.390 0.385 0.362Segment length data, d = 100, 250 samples per classOriginal Data 0.432 0.412 0.135 0.483 0.503 0.082 0.079 0.084Transformed Data 0.100 0.098 0.107 0.098 0.443 0.096 0.113 0.114Segment length data, d = 100, 500 samples per classOriginal Data 0.391 0.391 0.045 0.447 0.194 0.025 0.025 0.025Transformed Data 0.032 0.030 0.032 0.054 0.165 0.029 0.031 0.036Table 7: Comparative performance of classifiers on segment lengths17 .9 Segment Pairs This section describes a more challenging version of the former in which each data point consists intwo sequences of ones in the unit circle (discretized over 50 points), these two sequences being bothof length five in class 1, and of lengths 4 and 6 (in any order) in class 2. No linear rule can separatethe classes and do better than chance, since such a rule would need to be based on summing thevariables (by shift invariance), which returns 10 in both classes. The problem is also challenging formetric-based methods because each data point has close neighbors in the other class: the nearestnon-identical neighbor in the same class is obtained by shifting one of the two segments, and wouldbe at distance √
2, which is identical to the distance of the closest element from the other class whichis obtained by replacing a point in one segment by 0 and adding a 1 next to the other segment. Oneway to separate the classes would be to compute moving averages (convolutions) over windows oflengths 6 along the circle, and to threshold the result at 5.5, which can be represented as a one-hidden layer perceptron. As shown in Table 8, one-hidden-layer perceptrons do perform best on thisdata, with some margin compared to all others. Diffeomorphic classification does however improvesignificantly on the classification rates of all other classifiers.Log. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5Segment pair data, d = 50, 100 samples per classOriginal Data 0.477 0.466 0.476 0.475 0.458 0.470 0.470 0.488Transformed Data 0.461 0.457 0.465 0.461 0.469 0.461 0.462 0.461Segment pair data, d = 50, 250 samples per classOriginal Data 0.490 0.485 0.447 0.412 0.492 0.343 0.392 0.405Transformed Data 0.394 0.392 0.412 0.390 0.444 0.398 0.391 0.386Segment pair data, d = 50, 500 samples per classOriginal Data 0.501 0.502 0.369 0.320 0.504 0.054 0.164 0.284Transformed Data 0.240 0.238 0.261 0.233 0.378 0.250 0.244 0.253Segment pair data, d = 50, 1000 samples per classOriginal Data 0.465 0.463 0.233 0.191 0.547 0.008 0.038 0.137Transformed Data 0.071 0.068 0.099 0.073 0.281 0.085 0.082 0.085Table 8: Comparative performance of classifiers on segment pair data To conclude this section we provide (in Table 9) classification results on a subset of the MNISTdigit recognition dataset (LeCun, 1998), with 10 classes and 100 examples per class for training. Toreduce the computation time, we reduced the dimension of the data by transforming the original28 ×
28 images to 14 ×
14, resulting in 196-dimensional dataset. With this sample size, this reductionhad little influence on the performance of classifiers run before diffeomorphic transformation. Allclassifiers have similar error rates, with a small improvement obtained after diffeomorphic trans-formation compared to linear classifiers, reaching a final performance comparable to that obtainedby random forests and multi-layer perceptrons with two hidden layers. We did observe a glitchin the performance of nonlinear support vector machines, which may result from a poor choice ofhyper-parameters (section 6.1) for this dataset. 18og. reg. lin. SVM SVM RF kNN MLP1 MLP2 MLP5MNIST data, d = 196, 100 samples per classOriginal Data 0.128 0.138 0.228 0.111 0.120 0.122 0.111 0.127Transformed Data 0.113 0.120 0.325 0.129 0.094 0.110 0.110 0.124Table 9: Comparative performance of classifiers on a subset of the MNIST dataset. The results presented in this section are certainly encouraging, and demonstrate that the proposedalgorithm has, at least, some potential. We however point out that the classifiers that were usedin our experiments were run “off-the-shelves,” using the setup described in section 6.1. There isno doubt that, after some tuning up specific to each dataset, each of these classifiers could performbetter. Such an analysis was not our goal here, however, and the classification rates that we obtainedmust be analyzed with this in mind. We just note that we also used the exact same version ofdiffeomorphic learning for all datasets, with hyper-parameters values described in section 7.We have included kernel-based support vector machines in our comparisons. These classifiersindeed share with the one proposed here the use of a kernel trick to reduce an infinite dimensionalproblem to one with a number of parameters comparable to the size of the training set. Some notabledifferences exist however. The most fundamental one lies in the role that is taken by the kernel inthe modeling and computation. For SVMs and other kernel methods, the kernel provides, or isclosely related to, the transformation of the data into a feature space. It indeed defines the innerproduct in that space, and can itself be considered as an infinite dimensional representation of thedata, through the standard mapping x (cid:55)→ κ ( x, · ). Kernel methods then implement linear methodsin this feature space, such as computing a separating hyperplane for SVMs. For the diffeomorphicclassification method that is described here, the kernel is used as a computational tool, similar tothe way it appears in spline interpolation. The fundamental concept here is the Hilbert space V and its norm, which is, in the case of a C k Mat´ern kernel, a Hilbert Sobolev space of order k + d/ V is therefore a mathematical tool that makesexplicit the derivation of the discrete representation described in section 3.1 rather than an essentialpart of the model, as it is for kernel methods such as SVMs.Another important distinction, which makes the diffeomorphic method deviate from both kernelmethods in machine learning and from spline interpolation is that the kernel (or its associatedspace V ) is itself part of a nonlinear dynamical evolution where it intervenes at every time. Thisdynamical aspect of the approach constitutes a strong deviation from SVMs which applied thekernel transformation once before implementing a linear method. It is, in this regard, closer tofeed-forward neural network models in that it allows for very large non-linear transformations ofthe data to be applied prior to linear classification. Unlike neural networks, however, diffeomorphiclearning operates the transformations within the original space containing the data.As mentioned in section 6.1, we ran the optimization algorithm until numerical convergence,although it is clear from monitoring classification over time that classes become well separated earlyon while the estimation of the optimal transformation typically takes more time. If one is notinterested in the limit diffeomorphism (which can be of interest for modeling purposes), significantcomputation time may be saved by stopping the procedure earlier, using, for example a validationset. 19 Setting Hyper-Parameters
The diffeomorphic classification method described in this paper requires the determination of severalhyper-parameters that we now describe with a discussion of how they were set in our experiments.(a) Kernel scale parameter. While the kernel can be any function of positive type and thereforeconsidered as an infinite dimensional hyper-parameter, we assume that one uses a kernel such asthose described in equations (14) and (15) (our experiments use the latter with k = 3) and focuson the choice of the scale parameter ρ . In our implementation, we have based this selectionon two constraints, namely that training data should tend to “collaborate” with some of theirneighbors from the same class while data from different classes should “ignore” each other. Thisresulted in the following computation.(i) To address the first constraint, we evaluate, for each data point, the fifth percentile of itsdistance to other points from the same class. We then define ρ to be the 75th percentileof these values.(ii) For the second constraint, we define ρ as the 10th percentile of the minimum distancebetween each data point and its closest neighbor in an other class.We finally set ρ = ρ = min( ρ , ρ ). Table 10 provides a comparison of the performances of thealgorithm for values of ρ that deviate from this default value, showing that, in most cases, thisperformance obtained with ρ is not too far from the best one. The only exception was foundwith the Xor dataset, for which a significant improvement was obtained using a large value of ρ . Note that in Tables 10 and 11, all experiments in the same row were run with the sametraining and test sets.Dataset Samples/class 0 . ρ . ρ . ρ ρ . ρ ρ ρ Sph. Layers 250 0.238 0.199 0.188 0.175 0.165 0.183 0.212RBF 100 0.170 0.107 0.085 0.083 0.070 0.063 0.063Tori ( d = 10) 100 0.285 0.233 0.207 0.165 0.163 0.175 0.194M. of G. 100 0.322 0.221 0.156 0.135 0.109 0.103 0.124Xor 100 0.473 0.108 0.090 0.092 0.093 0.062 0.01Seg. Length 250 0.164 0.143 0.114 0.097 0.093 0.093 0.86Table 10: Performance comparison when the kernel scale deviates from the default value ρ , assessedon the spherical layers, RBF, Tori, mixture of Gaussians, Xor and segment length datasets.(b) Regularization weight. A second important parameter is the regularization weight, σ in Equa-tion 1. In our experiments, this parameter is adjusted on line during the minimization algorithm,starting with a large value, and slowly decreasing it until the training error reaches a target, δ . While this just replaces a parameter by another, this target value is easier to interpret, andwe used δ = 0 .
005 in all our experiments (resulting de facto in a vanishing training error at theend of the procedure in all cases).(c) (cid:96) penalty on logistic regression. This parameter, denoted λ in Equation 1, was taken equal to1 in all our experiments.(d) Time discretization. The value of T in equation (11) may also impact the performance of thealgorithm, at least for small values, since one expects the model to converge to its continuouslimit for large T . This expectation is confirmed in Table 11, which shows that the classification20rror rates obtained with the value chosen in our experiments, T = 10, was not far from theasymptotic one. In some cases, the error rates obtained for smaller value of T may be slightlybetter, but the difference is marginal.Dataset Samples/class T = 1 T = 2 T = 4 T = 10 T = 20 T = 40 T = 100Sph. Layers 250 0.199 0.199 0.190 0.197 0.197 0.197 0.197RBF 250 0.089 0.089 0.085 0.083 0.083 0.083 0.083Tori ( d = 10) 100 0.243 0.160 0.170 0.177 0.186 0.195 0.197M. of G. 100 0.178 0.169 0.141 0.119 0.114 0.111 0.109Xor 100 0.478 0.179 0.079 0.043 0.042 0.040 0.039Seg. Length 250 0.140 0.117 0.100 0.094 0.092 0.091 0.091Table 11: Performance comparison for various values of the number of discretization steps, T ,assessed on the spherical layers, RBF, Tori, mixture of Gaussians, Xor and segment length datasets. In this paper, we have introduced the concept of diffeomorphic learning and provided a few illustra-tions of its performance on simple, but often challenging, classification problems. On this class ofproblems, the proposed approach appeared quite competitive among other classifiers used as com-parison. Some limitations also appeared, regarding, in particular, the scalability of the method, forwhich we have provided some options that will be explored in the future.We have only considered, in this paper, applications of diffeomorphic learning to classificationproblems. Extensions to other contexts will be considered in the future. Some, such as regression,may be relatively straightforward, while others, such as clustering, or dimension reduction shouldrequire additional thoughts, as the obtained results will be highly dependent on the amount of metricdistortion allowed in the diffeomorphic transformation.
References
N. Aronszajn. Theory of reproducing kernels.
Trans. Am. Math. Soc. , 68:337–404, 1950.M. F. Beg, M. I. Miller, A. Trouv´e, and L. Younes. Computing large deformation metric mappingsvia geodesic flows of diffeomorphisms.
Int J. Comp. Vis. , 61(2):139–157, 2005.Christopher M Bishop.
Pattern recognition and machine learning . springer, 2006.MD Buhmann.
Radial basis functions: theory and implementations, volume 12 of Cambridge Mono-graphs on Applied and Computational Mathematics . Cambridge University Press, Cambridge,2003.J. Duchon. Interpolation des fonctions de deux variables suivant le principe de la flexion des plaquesminces.
R.A.I.R.O. Analyse Numerique , 10:5–12, 1977.Stanley Durrleman, St´ephanie Allassonni`ere, and Sarang Joshi. Sparse adaptive parameterization ofvariability in image ensembles.
International Journal of Computer Vision , 101(1):161–183, 2013.Stanley Durrleman, Marcel Prastawa, Nicolas Charon, Julie R Korenberg, Sarang Joshi, GuidoGerig, and Alain Trouv´e. Morphometry of anatomical shape complexes with dense deformationsand sparse parameters.
NeuroImage , 101:35–49, 2014.21. Dyn. Interpolation and approximation by radial and related functions. In C. K. Chui, L. L.Shumaker, and J. D. Ward, editors,
Approximation Theory VI: vol. 1 , pages 211–234. AcademicPress, 1989.Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.
Deep learning , volume 1.MIT press Cambridge, 2016.Barbara Gris, Stanley Durrleman, and Alain Trouv´e. A sub-riemannian modular framework fordiffeomorphism-based analysis of shape ensembles.
SIAM Journal on Imaging Sciences , 11(1):802–833, 2018.T Hastie, R Tibshirani, and J Friedman.
The elements of statistical learning . Springer, 2003.Leslie M Hocking.
Optimal control: an introduction to the theory with applications . Oxford UniversityPress, 1991.S. Joshi and M. Miller. Landmark matching via large deformation diffeomorphisms.
IEEE transac-tions in Image Processing , 9(8):1357–1370, 2000.Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ ,1998.Jack Macki and Aaron Strauss.
Introduction to optimal control theory . Springer Science &Business Media, 2012.J. Meinguet. Multivariate interpolation at arbitrary points made simple.
J. Applied Math. andPhysics , 30:292–304, 1979.Mario Micheli and Joan A. Glaun`es. Matrix-valued kernels for shape deformation analysis.
Geom.Imaging Comput. , 1(1):57–139, 2014. ISSN 2328-8876; 2328-8884/e.Michael I Miller, Alain Trouv´e, and Laurent Younes. Hamiltonian systems and optimal control incomputational anatomy: 100 years since d’arcy thompson.
Annual review of biomedical engineer-ing , 17:447–509, 2015.J. Nocedal and S. J. Wright.
Numerical Optimization . Springer, 1999.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn:Machine learning in python.
Journal of machine learning research , 12(Oct):2825–2830, 2011.B Sch¨olkopf and J Smola, A.
Learning with kernels . MIT Press, 2002.Alain Trouv´e and Yong Yu. Metric similarities learning through examples: an application to shaperetrieval. In
International Workshop on Energy Minimization Methods in Computer Vision andPattern Recognition , pages 50–62. Springer, 2001.Vladimir Vapnik.
The nature of statistical learning theory . Springer science & business media, 2013.T. L. Vincent and W. J. Grantham.
Nonlinear and Optimal Control Systems . Wiley, 1997.G. Wahba.
Spline Models for Observational Data . SIAM, 1990.Christian Walder and Bernhard Sch¨olkopf. Diffeomorphic dimensionality reduction. In
Advances inNeural Information Processing Systems , pages 1713–1720, 2009.Laurent Younes.
Shapes and diffeomorphisms , volume 171. Springer Science & Business Media,2010. 22aurent Younes. Constrained diffeomorphic shape evolution.
Foundations of Computational Math-ematics , 12(3):295–325, 2012.Laurent Younes. Gaussian diffeons for surface and image matching within a lagrangian framework.