Nonparametric Spherical Regression Using Diffeomorphic Mappings
NNonparametric Spherical Regression Using Diffeomorphic Mappings
M. Rosenthal a , W. Wu b , E. Klassen, c , Anuj Srivastava b a Naval Surface Warfare Center, Panama City Division - X23, 110 Vernon Avenue, Panama City, FL 32407-7001 b Department of Statistics, Florida State University, Tallahassee, FL 32306 c Department of Mathematics, Florida State University, Tallahassee, FL 32306
Abstract
Spherical regression explores relationships between variables on spherical domains. We develop anonparametric model that uses a diffeomorphic map from a sphere to itself. The restriction of thismapping to diffeomorphisms is natural in several settings. The model is estimated in a penalizedmaximum-likelihood framework using gradient-based optimization. Towards that goal, we specifya first-order roughness penalty using the Jacobian of diffeomorphisms. We compare the predictionperformance of the proposed model with state-of-the-art methods using simulated and real datainvolving cloud deformations, wind directions, and vector-cardiograms. This model is found tooutperform others in capturing relationships between spherical variables.
Keywords:
Nonlinear; Nonparametric; Riemannian Geometry; Spherical Regression.
1. Introduction
Spherical data arises naturally in a variety of settings. For instance, a random vector withunit norm constraint is naturally studied as a point on a unit sphere. The statistical analysis ofsuch random variables was pioneered by Mardia and colleagues (1972; 2000), in the context ofdirectional data. Common application areas where such data originates include geology, gaming,meteorology, computer vision, and bioinformatics. Examples from geographical domains includeplate tectonics (McKenzie, 1957; Chang, 1986), animal migrations, and tracking of weather for-mations. As mobile devices become increasingly advanced and prevalent, an abundance of newspherical data is being collected in the form of geographical coordinates. Another source of spher-ical data studies directions, e.g., vector-cardiograms studied in Downs (2003). Directional dataalso characterizes the orientations of objects or limbs, which are particularly relevant to biometricapplications including the study of human kinematics (Rancourt et al., 2000) and gait data in thecontext gait identification (Boyd and Little, 2005).Spherical regression is an analysis of paired data on a unit hyper-spherical domain S d − = { z ∈ R d : (cid:107) z (cid:107) = 1 } , where (cid:107) · (cid:107) indicates the Euclidean norm. Given n paired observations( x i , y i ) ∈ S d − × S d − for i = 1 , . . . , n , one wishes to describe the relationship between predictor x and response y in order to make predictions and inferences. To define a spherical regressionmodel, one must decide on the functional form of µ ( x ) the mean of y for a given x . This function µ : S d − → S d − characterizes the expected relationship between the x and y variables, and can Email addresses: [email protected] (M. Rosenthal), [email protected] (W. Wu), [email protected] (E. Klassen,), [email protected] (Anuj Srivastava) a r X i v : . [ s t a t . O T ] F e b ake a parametric, semi-parametric, or a nonparametric form. The estimation of µ additionallydepends on the chosen spherical error distribution for y given x .In past work, the function µ has predominantly taken a parametric form. Several parametricmodels have been proposed for the unit circle S , including the one in Rivest (1997) and Downsand Mardia (2002), but the choice gets limited for higher dimensions. There, the rigid rotationmodel has been the most common choice (Chang, 1986, 1989; Rivest, 1989; Kim, 1991; Prenticeand Mardia, 1995). Downs (2003) used complex M¨obius transformations on S . More recently,Rosenthal et al. (2014) expanded the parametric class by including projective linear transformation.Additionally, there has also been some progress in nonparametric formulations. For instance,Marzio et al. (2014) used a model that modifies standard kernel-smoothing methods to derive ageneral nonparametric spherical regression model.Parametric models, especially with a small number of parameters, are often too restrictiveto adequately capture broad correspondences observed in real data. On the other hand, the highdimensionality of certain nonparametric models can lead to overfitting and, thus, hinder model per-formance. Our proposed framework includes flexible classes of deformations using diffeomoprhicnonparametric representations which avoid over-fitting by using penalty functions. Smooth bijec-tive relationships occur naturally in many systems. An example is fluid motion, as the mass canbe compressed or expanded but is not generally allowed to occupy the same space simultaneously.For the same reason, the relationships between atmospheric variables over short intervals can bemodeled using diffeomorphisms. Since diffeomorphisms are invertible mappings, they are also use-ful in situations where the roles of x and y can be reversed. In this paper, we take a nonparametricapproach using a diffeomorphism from S to itself. The technique we develop are restricted to S due to the optimization procedure we have used, although the underlying penalized-likelihoodframework itself is valid for arbitrary dimensions.
2. Previous Methods and Their Limitations
A majority of past work in spherical regression have imposed parametric forms for µ . Chang(1986, 1989) created this field in the 80’s using µ ( x ) = Ax where A is a rotation in SO( d ) = { A ∈ S d × d : det( A ) = 1 , A T A = AA T = I d } . This approach was followed by Rivest (1989) and Kim(1991). The solution which minimizes the sum of squared errors is given by McKenzie (1957) andStephens (1979) as the Procrustes rotation. The rigid rotation model can characterize changes inlocation and orientation, but is unable to handle any difference in spread between the predictorand response variable. It is akin to using the fixed addition model y = ( x + a ) + e in the classicalEuclidean case.Downs (2003) proposed an extension that uses M¨obius transformations, a larger parametricfamily than rotations. This model can handle some additional differences in spread between theresponse and predictor variable, but this framework has not been extended to dimensions higherthan two. A recent model that is applicable to arbitrary S d − uses the group of projective lineartransformations. A projective linear transformation is a map x → A/ | Ax | and is parametrized bythe transformation matrix A ∈ SL( d ), the set of all matrices with determinant one. In Rosenthalet al. (2014), the authors derive an intrinsic Newton Raphson algorithm for maximum-likelihoodestimation and present an asymptotic analysis of the estimator under the von Mises Fisher errordistribution. 2hese parametric models are likely to be useful for a variety of inference and prediction ap-plications. However, more general situations require more flexible models in order to adequatelycharacterize relationships between spherical variables. While one possibility is to explore evenlarger classes of parametric families, a more natural approach is nonparametric regression. Nonparametric methods allow flexible expressions of µ at the cost of model parsimony. In recentwork, Marzio et al. (2014) developed a nonparametric regression model that locally fits weightedpolynomials at each point on the sphere. They demonstrate improvements in predictive perfor-mance over simpler kernel smoothing and rigid rotations. This model is applicable across arbitraryhyper-spheres, dimensions, and geometries. For example, it can be used to define relationshipsbetween spherical and linear domains. As with most kernel smoothing approaches, it does notensure invertibility of regression maps. This limitation becomes evident in applications such asmeteorological studies, where the short term changes are often diffeomorphic, and where data isoften scarce over large areas of the sphere. This is a common occurrence with data generated fromsatellites. In these cases, kernel smoothing methods can overfit data in ways that the bandwidthor smoothing parameter cannot adequately address. A restriction to the group of diffeomorphismsis a natural alternative in such settings.Our proposed approach is similiar to Glaun`es et al. (2004) which utilizes diffeomorphic maps onthe sphere in the context of landmark matching. They define an objective function for landmark-guided point registration, but this function is not motivated from a regression model perspective.Thus, their approach is not directly applicable to prediction and model estimation. The spheri-cal regression setting is additionally concerned with prediction accuracy, model parsimony, inter-pretability, and statistical inference. Our approach will use a different objective function, roughnesspenalty, and optimization strategy that are more suitable for spherical regression.
3. Proposed Regression Framework
The proposed method characterizes relationships between spherical data on S d − using diffeo-morphisms. The set of such diffeomorphisms Γ consists of mappings γ : S d − → S d − that aresmooth and invertible with smooth inverses, and forms a group under composition. The identityelement of Γ is the mapping γ id ( x ) = x . The group Γ forms a very flexible class of deformationswhich includes the subgroup of rigid rotations and the projective linear group. Additionally, thecomplex Mobius maps, considered as maps on S , are diffeomorphisms of the Riemann sphere. Formore details on diffeomorphisms see Kac (1990).We will need elements of the Riemannian geometry of S d − and Γ to derive estimation algo-rithms. The tangent space at p ∈ S d − is denoted by T p ( S d − ) = { x ∈ R d : (cid:104) x, p (cid:105) = 0 } and is avector space of dimension d −
1. Similarly, the tangent space of Γ at γ id is denoted by T γ id (Γ) andis infinite dimensional. This tangent space T γ id (Γ) is the set of smooth tangent vector fields on S d − . That is, for any V ∈ T γ id (Γ) and p ∈ S d − , V ( p ) ∈ T p ( S d − ) is a tangent vector, and it variessmoothly with p .We assume that the error distribution of y given x follows a von Mises Fisher distribution withmean direction γ ( x ) ( γ ∈ Γ) and concentration parameter κ >
0. That is, p x ( y ) = C d ( κ ) exp { κy T γ ( x ) } , C d ( κ ) = κ d/ − (2 π ) d/ I d/ − ( κ ) , (1)3ith y, x, γ ( x ) ∈ S d − . Here, C d is the normalizing constant, and I ν is the modified Bessel functionof the first kind and order ν . The von Mises Fisher density is isotropic about its mean direction γ ( x ). The parameter κ measures the degree of concentration: κ = 0 implies a uniform density on S d − while κ = ∞ implies a Dirac delta at γ ( x ). For further details see Mardia and Jupp (2000).The normalizing constant C d ( κ ) does not depend on the mean direction, so maximum likelihoodestimation of γ separates from that of κ Since the group of diffeomorphisms is an infinite-dimensional function space, the maximumlikelihood estimate may over-fit the data. To overcome this problem, we seek a penalized maximumlog-likelihood solution solution that takes the formˆ γ = argmax γ ∈ Γ (cid:32) n − n (cid:88) i =1 y Ti γ ( x i ) − λR ( γ ) (cid:33) . The term f n ( γ ) = n − (cid:80) ni =1 y Ti γ ( x i ) is the log-likelihood from equation (1), and the function R : Γ → R + measures the roughness of γ . We introduce a scale n − in the log-likelihood termin order to compare it across sample sizes. The scalar λ > R ( γ ). It is common in nonparametric statistics involving Euclidean variables to use first or secondderivatives of functions to define their roughness. The first order penalty is zero for translationsand penalizes large slopes. The second order penalty is zero for linear maps and penalizes cur-vatures. The interpretation of transformations in spherical domains is different. Translations inthe Euclidean domain become rigid rotations and reflections in spherical domains, but the analogsof higher-order transformations are not clear. We require that if the diffeomorphism is either arotation or a reflection, i.e. γ ( p ) = Op , O ∈ O ( d ) = { O ∈ R d × d : O T O = OO T = I d } , with I d denoting the d × d identity matrix , then its roughness measure should be zero. For this purpose, itis sufficient to impose a penalty that uses the first derivative, or the Jacobian, of γ . Not only do wenot have a proper interpretation for the second derivative of γ , but it also becomes computationallycomplex. Consequently, we will derive a first-order roughness penalty on γ .Let J p ( γ ) denote the Jacobian of γ ∈ Γ evaluated at a point p ∈ S d − . By definition, J p ( γ ) isa linear mapping from the tangent space T p ( S d − ) to T γ ( p ) ( S d − ), each one of these tangent spacesbeing ( d − d − × ( d −
1) matrixwith respect to the chosen bases for these tangent spaces, as elaborated next. Any linear mapbetween the two tangent spaces can first be expressed in R d coordinates as u (cid:55)→ Bu for B ∈ R d × d and each u ∈ T p ( S d − ) viewed as an element of R d . Let E, F ∈ R d × ( d − denote orthonormal basesfor T p ( S d − ) and T γ ( p ) ( S d − ), respectively. Then, under the chosen bases, the Jacobian matrixbecomes J p = F T BE ∈ R ( d − × ( d − . In case γ ( p ) = Op for an O ∈ O ( d ), then OE ∈ R d × ( d − also forms an orthonormal basis of T γ ( p ) ( S d − ). Since F and OE are two orthonormal bases ofthe same space, there exists an orthogonal matrix A ∈ O ( d −
1) such that F = OEA . Thus, theexpression for the Jacobian in this case reduces to J p = F T OE = A T E T O T OE = A T ∈ O ( d − p is (cid:107) J p ( γ ) T J p ( γ ) − I d − (cid:107) , where (cid:107)·(cid:107) denotes the Frobenius matrix norm. The roughness at p is zero if and only if J p ( γ ) is an orthogonalmatrix, i.e., a rotation or a reflection. The resulting first-order roughness measure for the full map γ is then: Q ( γ ) = (cid:82) S d − (cid:107) J p ( γ ) T J p ( γ ) − I d − (cid:107) dp . Another interesting property of this definition is4hat the natural action of the subgroup O ( d ) on Γ leaves the roughness measure unchanged. Thatis, for any γ ∈ Γ and O ∈ O ( d ), if we define a new ˜ γ ( p ) = Oγ ( p ), then Q (˜ γ ) = Q ( γ ).Another measure of the distance from isometry can be computed as (cid:107) logm( J Tp J p ) (cid:107) wherelogm denotes the matrix log and (cid:107) · (cid:107) denotes the matrix Frobenius norm. Unlike the previousdistance from isometry measure, this will diverge to infinity as the Jacobian becomes singular.The resulting roughness measure is defined as R ( γ ) = (cid:82) S d − (cid:107) logm { J p ( γ ) T J p ( γ ) }(cid:107) dp . This hasthe added property that the roughness measure will be infinity if the Jacobians are singular overa set of positive measure. As a penalty term, this will ensure that the deformation’s Jacobean hasnonzero determinant everywhere except perhaps on a set of measure zero.Because of the group structure of Γ, any finite sequence of composed diffeomorphisms willresult in a diffeomorphism. In the limit, an infinite sequence of compositions may converge to anon-diffeomorphic map. For example, a set of points with positive measure may converge to aset of measure zero in the limit. On the other hand, the proposed roughness measure of such adeformation is infinity, so the roughness measure should push the deformation away from suchsolutions. In this paper, we will use R as our chosen roughness measure. Details for computingthese roughness measures Q and R are presented in the Appendix.
4. Optimization Algorithm
Returning to the problem of model estimation, we treat it as an optimization of the objectivefunction E ( γ ) = n − (cid:80) ni =1 y Ti γ ( x i ) − λR ( γ ) by gradient ascent. Since we are optimizing over anonlinear group, the Riemannian geometry of Γ plays a vital role. The gradient of a function at apoint on the Riemannian manifold is by definition an element of the tangent space at that point.Since tangent spaces are linear, one can represent their elements as coefficients with respect tocorresponding orthonormal bases. Thus, the gradient of E at any γ can be expressed as a linearcombination of tangent basis elements at that γ . This simplifies the computation of gradient tosolving for the corresponding coefficients, but it still requires an orthonormal basis of all tangentspaces of Γ. To avoid this, we will take an iterative approach and solve of an optimal incrementaldiffeomorphism at every iteration as follows.We define an incremental cost function H (˜ γ ) = E (˜ γ ◦ γ ) for the current estimate γ ∈ Γ. Theincremental diffeomorphism ˜ γ will be small, i.e., close to γ id , and can be related to an element of T γ id (Γ). Thus, we need to specify the tangent space at γ id only and, as mentioned earlier, this isa set of all smooth tangent vector fields on S d . The gradient of H at γ id , ∇ γ id H , is an element of T γ id (Γ). Given an orthonormal basis { B , B , . . . } of T γ id (Γ), we can write ∇ γ id H = (cid:80) ∞ j =1 d j B j ,where each d j ∈ R . Each d j coefficient is the directional derivative of H in the direction of B j at γ id . In Section 4.2 we define such a basis for the tangent space T γ id (Γ). Then in Section 4.3 weshow how to compute the gradient and present the gradient algorithm.Additionally, we will need the following tools on the sphere S d . The exponential map at p ∈ S d − is a mapping exp p : T p ( S d − ) → S d − according to exp p ( v ) = cos( (cid:107) v (cid:107) ) p + sin( (cid:107) v (cid:107) ) v/ (cid:107) v (cid:107) ∈ S d − .The inverse exponential map at point p ∈ S d − maps each non-antipodal point z ∈ S d − to atangent vector v ∈ T p ( S ) according to exp − p ( z ) = ( z − cos( θ ) p ) θ (sin( θ )) − , where θ = cos − (cid:0) p T z (cid:1) .Two points p, z ∈ S d − are non-antipodal if p T z (cid:54) = −
1. For V ∈ T γ id (Γ) we define a map expressedas Ψ V ( p ) = exp p ( V ( p )) for each p ∈ S d − . The map Ψ V is a diffeomorphism on S d − when V isin a small neighborhood around the zero vector. The zero vector is denoted by the vector field V with (cid:107) V ( p ) (cid:107) = 0 for each p ∈ S d − . 5 .2. Orthogonal Basis for Incremental Diffeomorphisms In this paper, we focus on d = 3 and develop the optimization algorithm for that case. InKurtek et al. (2011) construct an orthonormal basis of smooth tangent vector fields on a spherefor T γ id (Γ) by applying the gradient to the real and imaginary parts of the complex sphericalharmonic function Y ml of degree l and order m = 0 , . . . , l . From the first l harmonics one gets( l + 1) distinct functions by taking the real and imaginary parts as separate functions. We willdenote these real-valued functions on S by ϕ , . . . , ϕ ( l +1) and note that they are parametrizedusing polar coordinates. For each i ∈ , . . . , ( l +1) we evaluate the gradient at each point ( θ, φ ) ∈ S resulting in the vector field ∇ ( θ,φ ) ϕ i = (cid:18) ∂ϕ i ∂θ , θ ) ∂ϕ i ∂φ (cid:19) . For each resulting non-trivial tangent vector field ∇ ϕ i for i ∈ { , , . . . } , let ˜ B i ( θ, φ ) = ∇ ( θ,φ ) ϕ i / (cid:107)∇ ( θ,φ ) ϕ i (cid:107) .Let ∗ ˜ B i ( θ, φ ) denote the tangent vector at point ( θ, φ ) obtained by rotating ˜ B i ( θ, φ ) counterclock-wise by π/ { ˜ B i } i =1 , ,... and the rotated tangent vector fields {∗ ˜ B i } i =1 , ,... provides an orthonormal basis for T γ id (Γ).We denote this union by { B , B , . . . } . For any B ∈ T γ id (Γ) and any point p ∈ S , the tangentvector B ( p ) ∈ T p ( S ) is represented as an element of R . If we restrict to basis elements obtainedfrom spherical harmonics of order ≤ l , we obtain L = 2( l +1) − l increases, so does L , and one is able to capture more complex deformations. As mentioned earlier, each iteration is based on optimization of the functional H (˜ γ ) = E (˜ γ ◦ γ )where γ ∈ Γ is the current deformation, and we optimize over the increment ˜ γ in the neighborhoodof γ id . By definition, ∇ γ id H ∈ T γ id (Γ), which implies that the gradient of H at the identity canbe expressed as a linear combination ∇ γ id H = (cid:80) ∞ j =1 d j B j . Each coefficient d j is the directionalderivative of H in the direction of B j ∈ T γ id (Γ) given by d j = lim (cid:15) ↓ H (Ψ (cid:15)B j ) − H ( γ id ) (cid:15) = lim (cid:15) ↓ E (Ψ (cid:15)B j ◦ γ ) − E ( γ ) (cid:15) . Since the gradient operation is linear, we can separate the likelihood and roughness terms in H . We can write an analytical expression for the directional derivatives of the log-likelihood term.Let f ( i ) ( γ ) = y Ti γ ( x i ) and z i = γ ( x i ), then the coefficient a ij ∈ R is computed as follows: a ij = lim (cid:15) ↓ f ( i ) (Ψ (cid:15)B j ◦ γ ) − f ( i ) ( γ ) (cid:15) = lim (cid:15) ↓ y Ti exp z i ( (cid:15)B j ( z i )) − y Ti z i (cid:15) = y Ti ( − sin( (cid:15) (cid:107) B j ( z i ) (cid:107) ) z i (cid:107) B j ( z i ) (cid:107) + cos( (cid:15) (cid:107) B j ( z i ) (cid:107) ) B j ( z i )) | (cid:15) =0 = y Ti B j ( z i ) . The directional derivative of the log-likelihood term, at the current estimate γ , is given by b j = n − n (cid:88) i =1 a ij = n − n (cid:88) i =1 y Ti B j ( γ ( x i )) . (2)We compute the directional derivative of roughness term numerically: for some small fixed (cid:15) > c j = (cid:15) − ( R (Ψ (cid:15)B j ◦ γ ) − R ( γ )). Thus, each coefficient d j is approximated as d j ≈ b j − λc j , and6he gradient is approximated by ∇ γ id H ≈ (cid:80) Lj =1 ( b j − λc j ) B j . Finally, γ is updated in the directionof ∇ γ id H according to the mapping γ (cid:55)→ Ψ ( δ ∇ γid H ) ◦ γ (3)for a small step size δ > Algorithm:
1. Initialize γ by the rigid rotation parametrized by U U T , where D = U Σ U T is the modifiedsingular value decomposition of D = (cid:80) ni =1 y i x Ti with U , U ∈ SO(2) and Σ is the diagonalsingular value matrix.2. For each j ∈ { , . . . , L } , compute the log-likelihood coefficient b j according to equation (2)and numerically compute the roughness coefficient c j for some small fixed step size (cid:15) > ∇ γ id H .3. Update γ for a small step size δ > E ( γ ) has converged, then stop. Otherwise return to step 2.
5. Experimental Results
In this experiment, we explore what happens with the gradient ascent algorithm when there isno likelihood term. We do this to check that the roughness term will push the deformation towardsomething which makes sense to have zero roughness. When there is no data, n − (cid:80) ni =1 y Ti γ ( x i ) = 0because the sum of the empty set is defined to be the additive identity. This implies that theobjective function E does not have a likelihood term in this case. We can perform this experimentusing the previously define gradient and algorithm with one minor change. Instead of initializingwith a rigid rotation, we initialize our deformation with an arbitrary diffeomorphism and theniteratively apply the gradient for the roughness term. Our intuition is that by iteratively applyingthe gradient of the roughness term, the deformation will converge to something which closelyresembles a rigid rotation. One can see in Fig. 1 that this initial deformation is relatively roughand distant from a rigid rotation. The resulting deformation after applying 10,000 iterations canbe seen in the middle panel. The evolution of Roughness measure is plotted in the right panel. Onecan see further details and the animated results of this experiment in the supplementary material. Figure 1: The left panel shows the initial diffeomorphism, the middle panel shows the resulting deformation after10,000 iterations, and the right panel shows the evolution of the roughness measure. able 1: Mean squared error of test data from Section 5 experiments. OURS NLL PLT RR TRUESection 5.2 18 . − ) 56 . − ) 22 . − ) 21 . − ) 1 . − )Section 5.3 4 . − ) 5 . − ) 19 . − ) 19 . − ) N/ASection 5.4 7 . − ) 11 . − ) 8 . − ) 9 . − ) N/AOURS refers to the model presented in this paper. NLL refers to the nonparametric local linearregression model. PLT refers to the rigid rotation model. RR refers to the rigid rotation model.TRUE refers to the true diffeomorphism and is only applicable to the simulated experimental resultfrom Section 5.2. N/A stands for Not Applicable. We demonstrate the proposed nonparametric diffeomorphic regression model on both simulatedand real data. For the simulated case we compare the estimated model to the true deformation,and on the real data we compare predictive performances to some alternative models includingRigid Rotation from Chang (1986), Projective Linear Transformation from Rosenthal et al. (2014),and the Nonparametric Local Linear model from Marzio et al. (2014). We evaluate the predictiveperformance of a model by splitting a data set into training and test data. The model parametersare estimated using the training data, and then the performance is evaluated by computing themean squared error of the test data. This error is computed as
M SE = n − (cid:80) ni =1 (cid:107) y i − ˆ µ i (cid:107) , where y i and ˆ µ i , respectively, denote the true and predicted values and (cid:107) · (cid:107) denotes the Frobenius norm.Since the data and the predicted values are restricted to a sphere, the mean squared error lies in[0, 4]. We also compare the fitted spherical mappings of each model and compare the observed andpredicted values.We start by illustrating the gradient ascent algorithm using simulated paired data points on S that are related by a diffeomorphism γ ∈ Γ. The true γ is obtained via a combinationof M¨obius and twisting transformations on S . Additionally, we compose a sequence of smallincremental diffeomorphisms using spherical harmonics to obtain our final true diffeomorphism. Seethe supplementary for details on these maps and their parametrization for this example. Once γ is generated, it is fixed throughout the experiment. The training and test data respectively consistof 200 and 100 independently sampled data points. Predictor variables x , . . . , x n are simulatedindependently from a von-Mises Fisher distribution with mean (0 , , T and concentration κ = 5for the training data and uniformly for the test data. This will leave a gap in the training datato simulate model performance with limited information. The individual responses y i are thenindependently simulated from a von-Mises Fisher distribution with mean γ ( x i ) and concentration κ = 100. The training and test data can be seen in the supplementary.The roughness parameter λ is estimated using cross-validation, i.e. by splitting the trainingdata into 75 training observations and 25 validation observations. In the left panel of Fig. D.8,one can see the mean squared error of the validation data plotted over various levels of λ and for l ∈ { , } . We select the value of λ and l which minimizes the validation error. In this case,we construct a basis from the spherical harmonic functions of order less than or equal to l = 3,select λ = 2(10) − , and use the model fitted using these values to make future predictions. Theright panel of Fig. D.8 shows the evolution of E ( γ ) as γ is iteratively updated according to thegradient ascent algorithm applied to the full set of 200 training data points. In Fig. 3, the true8iffeomorphism γ is compared to estimated diffeomorphisms at several roughness levels. Notice for l = 3 that the difference in the estimated diffeomorphisms are very subtle over λ . The roughnesspenalty’s influence is more visible for l = 10.A comparison of other model performances on the test data is summarized in the first rowof Table 5.2. None of the estimated models outperform the true diffeomorphism because data isscarce over an part of the sphere. The nonparametric local linear regression model performs poorlyhere because it does not handle the extrapolation well. All the other models outperform the locallinear model because they assume the underlying deformation is a diffeomorphism, which is truein this simulated the case. One can see a comparison of estimated deformations for other modelswith observed and predicted values plotted in the supplementary material. M SEλ E ( γ )Iteration Figure 2: On the left, one can see the mean squared error on the validation data for various values of λ . The dottedblue and solid red line respectively denote l = 3 and l = 10. The right plot show the evolution of the penalizedlog-likelihood function as the algorithm iterates. The Integrated Global Radiosonde Archive contains radiosonde and pilot balloon observationfrom various stations distributed across the globe. An overview of data coverage is presented inI. Durre (2006). The data set was constructed from the monthly average of November 2013. Inthis data set, there are typically multiple means per station which correspond to various pressurelevels. We take an average of tangential wind velocities at all the available pressure levels at apoint to form a direction at that point. This results in 694 spatial observations, which we split into200 training observations and 494 test observations. For each observed i = 1 , . . . , n the stationlocation will be treated as the predictor x i , and the corresponding tangential wind velocities willbe denoted by v i ∈ T x i ( S ). The tangential wind velocities are measured in scaled units of 600meters per second for this data. Each response variable y i = exp x i ( v i ) is obtained by applying theexponential map to each respective tangent vector. The training data, test data are shown in Fig.4. Using cross validation on the training data, we selected λ = (10) − for the roughness penaltyof our model and l = 10 for the maximum order of basis elements. A plot of the M SE for thevalidation data used for tuning the model is in the supplementary.The test error in Table 5.2 shows that out method outperforms the nonparametric local linearmodel. In Fig. 5, the observed and predicted tangential wind directions are plotted for each model.9rue Diffeomorphism λ = (10 − ) λ = 2 . − λ = (10) − l = 3 l = 3 l = 3 λ = (10 − ) λ = 2 . − λ = 7 . − l = 10 l = 10 l = 10 Figure 3: This illustration compares the observed (in red) and predicted (in green) test values of the true diffeo-morphism and deformations fitted at various λ values. Training Data ( n = 200) Test Data ( n = 494) Figure 4: The figures show the training and test data for the wind direction models. The yellow dots denote thelocations of the weather monitoring stations. The red dots denote the average wind displacement. Correspondingdata points are connected with a light gray lines.
The nonparametric local linear model is in close agreement with our diffeomorphic model in regionswhere there is abundant data available. In places where training data is scarce, as seen in the southpacific region, the nonparametric local linear model is non-injective and allows the mesh grid tooverlap and deform heavily.
A vector-cardiogram measures the direction and magnitude of electrical forces that are gener-ated by heart actions. The directional aspect of these vectors have important applications in thediagnoses of certain diseases. The dataset, which was used in Downs (2003), consists of vector-cardiogram data from 98 children ages 2-18, where each child is measured using two lead systems,namely the Frank system and the McFee system. The objective of this experiment is to define acorrespondence between these two systems. To do this, models will be fitted using the directionalvector from the Frank system as the predictor and the directional vector from the Mcfee system10URS NLLPLT RR
Figure 5: Deformations of fitted models from weather training data with observed and predicted test data respectivelyplotted as red squares and green circles. Corresponding data points are connected with a light gray lines. OURSrefers to the model presented in this paper. NLL refers to the nonparametric local linear regression model. PLTrefers to the rigid rotation model. RR refers to the rigid rotation model. as the response. This is could be useful for combining data sets that use two different systems. Inthis case, we choose to convert Frank system to optimally correspond with Mcfee system data.The test and training data has been plotted on the sphere for each corresponding lead systemfrom each child in Fig. 6. The selected roughness parameter is λ = 10 − and up to order l = 3is used for fitting the final model. A plot of the M SE for the validation data used for tuning themodel is in the supplementary. The third row of Table 5.2 shows that our method has the smallestpredictive error, while the previous nonparametric method proposed by Marzio et al. (2014) hasthe largest prediction error. In Fig. 7, the observed and predicted McFee directions are plotted foreach model.
6. Discussion
The asymptotic analysis is an important component of this method. This includes determiningconditions necessary for consistency to occur. Unfortunately, the classical arguments for thesetypes proofs usually involve many strong and subtle assumptions which are difficult to verify whenanalyzing over infinite dimensional nonlinear domains such as the group of diffeomorphisms on S .To facilitate in bringing this problem forward we will outline the general argument typically used11raining Data ( n = 70) Test Data ( n = 28) Figure 6: The figure shows the direction of greatest magnitude using the Frank system in yellow and the McFeesystem in red. Corresponding data points are connected with a light gray lines.
Ours NLLPLT RR
Figure 7: Deformations of fitted models from vector-cardiogram training data with observed and predicted test datarespectively plotted as red squares and green circles. Corresponding data points are connected with a light graylines. NLL refers to the nonparametric local linear regression model. PLT refers to the rigid rotation model. RRrefers to the rigid rotation model. and point out the parts which we are uncertain about.Several extensions to the proposed framework can result from alternative optimization strate-gies. This includes extensions into hyper-spherical domains. It is likely that one can extend thedeformation basis utilized in this paper into hyper-spherical domains, however this may be costly12nd inefficient. For higher dimensional spheres it may be better to construct custom orthonormalbases which are derived for specific applications in order to reduce the number of model coeffi-cients. Additionally, one may wish to include a second order roughness penalty to the objectivefunction. Other possible extensions include developing more general classes of deformations. Forexample, to better handle sliding plate boundaries one may wish to utilize a mapping that allowsfor discontinuity on sets of measure zero.
Acknowledgements
This research was supported in part by the NSF grants DMS 1208959 and IIS 1217515.
ReferencesReferences
Boyd, J., Little, J., 2005. Biometric gait recognition. In: Tistarelli, M., Bigun, J., Grosso, E. (Eds.), AdvancedStudies in Biometrics. Vol. 3161 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 19–42.URL http://dx.doi.org/10.1007/11493648_2
Chang, T., 1986. Spherical regression. Annals of Statistics 14 (3), 907–924.Chang, T., 1989. Spherical regression with errors in variables. Annals of Statistics 17 (1), 293–306.Downs, T. D., 2003. Spherical regression. Biometrika 90 (3), 655–668.Downs, T. D., Mardia, K. V., 2002. Circular regression. Biometrika 89 (3), 683–697.Glaun`es, J., Vaillant, M., Miller, M. I., Jan. 2004. Landmark matching via large deformation diffeomorphisms on thesphere. J. Math. Imaging Vis. 20 (1-2), 179–200.I. Durre, Russell S. Vose, D. B. W., 2006. Overview of the integrated global radiosonde archive. Journal of Climate19, 53–68.Kac, V. G., 1990. Infinite-Dimensional Lie Algebras. Cambridge University Press, Third Edition.Kim, P. T., 1991. Decision theoretic analysis of spherical regression. Journal of Multivariate Analysis 38, 233–240.Kurtek, S., Klassen, E., Ding, Z., Jacobson, S., Jacobson, J., Avison, M., Srivastava, A., march 2011.Parameterization-invariant shape comparisons of anatomical surfaces. Medical Imaging, IEEE Transactions on30 (3), 849 –858.Mardia, K. V., 1972. Statistics of Directional Data. Academic Press.Mardia, K. V., Jupp, P. E., 2000. Directional Statistics. John Wiley & Sons.Marzio, D., Panzera, A., Taylor, C., 2014. Nonparametric regression for spherical data. Journal of the AmericanStatistical Association.McKenzie, J. K., 1957. The estimation of an orientation relationship. Acta Crystallographica (10), 61–62.Prentice, M. J., Mardia, K. V., 1995. Shape changes in the plane for landmark data. Annals of Statistics 23 (6),1960–1974.Rancourt, D., Rivest, L.-P., Asselin, J., 2000. Using orientation statistics to investigate variations in human kine-matics. Journal of the Royal Statistical Society. Series C (Applied Statistics) 49 (1), pp. 81–94.URL
Rivest, L., 1997. A decentred predictor for ciruclar-circular regression. Biometrika 84 (3), pp. 717–726.Rivest, L.-P., 1989. Spherical regression for concentrated Fisher-von Mises distribution. Annals of Statistics 17,307–317.Rosenthal, M., Wu, W., Klassen, E., Srivastava, A., 2014. Sperical regression models using projective linear trans-formation. Journal of American Statistical Association In Press.Stephens, M. A., 1979. Vector correlation. Biometrika 66, 41–48.
Appendix A. Details of Numerically Computing the Roughness Measure
We use polar coordinates to compute the roughness measure numerically. Given a point ( θ, φ ) ∈ (0 , π ) × (0 , π ], we can obtain the cartesian coordinates as ψ θ,φ = (sin( θ ) cos( φ ) , sin( θ ) sin( φ ) , cos( θ )) T .13onversely, given a point p = ( p , p , p ) T ∈ S ⊂ R in Cartesian coordinates, we can get its polarcoordinates as ψ − p = ( θ, φ ) = (cos − ( p ) , tan − ( p /p )). For a warping γ which is parametrized inCartesian coordinates, let z = ( z , z , z ) T = γ ( ψ θ,φ ). The image of γ at ( θ, φ ) in polar coordinatescan thus be computed as (˜ θ, ˜ φ ) = (cos − ( z ) , tan − ( z /z )). The variable φ can be visualized as arotation about the z axis and θ can be thought of as the arc-length distance from the north pole.To compute a Jacobian on γ , we take derivatives with respect to an orthonormal basis for eachpoint on the sphere. An orthonormal basis for the tangent space at the point ( θ, φ ) is computed as (cid:20) dψ θ,φ dθ , θ ) dψ θ,φ dφ (cid:21) = cos( θ ) cos( φ )cos( θ ) sin( φ ) − sin( θ ) , − sin( φ )cos( φ )0 . Note that sin( θ ) = 0 at the north pole θ = 0 and south pole θ = π . This implies that the changein distance on the sphere is zero at these two points when moving in the direction of φ , so thereare two points of discontinuity. Since we are integrating the roughness over the entire surface ofthe sphere, these two points will not theoretically change the measure of roughness. We excludesmall neighborhoods around the north and south pole in our computation of the roughness.Recall that the Jacobian operation is a linear mapping between the tangent spaces T θ,φ to T ˜ θ, ˜ φ . Therefore, J θ,φ the Jacobian matrix with respect to the sphere at ( θ, φ ) can be computed byapplying a change of basis to the linear Jacobian matrix as J θ,φ = (cid:18) θ ) (cid:19) (cid:32) d ˜ θdθ d ˜ θdφd ˜ φdθ d ˜ φdφ (cid:33) (cid:18) θ ) (cid:19) = d ˜ θdθ θ ) d ˜ θdφ sin(˜ θ ) d ˜ φdθ sin(˜ θ )sin( θ ) d ˜ φdφ . Looking at this mapping in reverse, the change of basis from the left hand side transforms theelements from the tangent space T ˜ θ, ˜ φ into coordinates with respect to e , e , which is then appliedto the linear Jacobian. Finally the matrix on the right transforms the basis back to elements of T θ,φ . A measure of the distance from isometry can be computed as (cid:107) logm( J Tθ,φ J θ,φ ) (cid:107) where logmdenotes the matrix log and (cid:107) · (cid:107) denotes the matrix Frobenius norm. The elements of the matrix A θ,φ = J Tθ,φ J θ,φ can be computed as A θ,φ = (cid:32) d ˜ θdθ sin(˜ θ ) d ˜ φdθ θ ) d ˜ θdφ sin(˜ θ )sin( θ ) d ˜ φdφ (cid:33) d ˜ θdθ θ ) d ˜ θdφ sin(˜ θ ) d ˜ φdθ sin(˜ θ )sin( θ ) d ˜ φdφ = (cid:18) a a a a (cid:19) = (cid:16) d ˜ θdθ (cid:17) + sin (˜ θ ) (cid:16) d ˜ θdθ (cid:17) θ ) d ˜ θdθ d ˜ θdφ + sin (˜ θ )sin( θ ) d ˜ φdθ d ˜ φdφ θ ) d ˜ θdφ d ˜ θdθ + sin (˜ θ )sin( θ ) d ˜ φdφ d ˜ φdθ ( θ ) (cid:16) d ˜ θdφ (cid:17) + sin (˜ θ )sin ( θ ) (cid:16) d ˜ φdφ (cid:17) In this manner, we can measure the distance from isometry for the tangent space at each pointby measuring the distance J θ,φ is from a rotation or reflection by (cid:107) J Tθ,φ J θ,φ − I (cid:107) = (cid:107) A θ,φ − I (cid:107) where I denotes the 2 × (cid:107) · (cid:107) denotes the Frobenius norm. This results inthe first-order roughness measure Q ( γ ) = (cid:82) π (cid:82) π (cid:107) A θ,φ − I (cid:107) sin( θ ) dθdφ . Alternatively, the square14f the Frobenius norm of the matrix log of A θ,φ can be computed using the eigenvalues as (cid:107) logm( A θ,φ ) (cid:107) = log( λ ) + log( λ ) , where λ = a + a (cid:112) ( a + a ) − a a − a a )2 λ = a + a − (cid:112) ( a + a ) − a a − a a )2 . This results in the following first-order roughness measure. R ( γ ) = (cid:82) π (cid:82) π (cid:107) logm( J Tθ,φ J θ,φ ) (cid:107) sin( θ ) dθdφ .For a small δ >
0, let Θ denote a M × M matrix such that Θ i,j = π ( j − δ ) / ( M − − δ )for i, j = 1 , . . . , M . Similarly, let Φ denote a M × M matrix such that Φ ij = 2 π ( i − / ( M − i, j = 1 , . . . , M . In this manner, Θ and Φ represent a parametrization of the sphere where eachpoint in the matrix can be mapped back to Cartesian coordinate using the mapping ψ (Θ i,j , Φ i,j ).Let ( ˜Θ i,j , ˜Φ i,j ) = ψ − ( ψ (Θ i,j , Φ i,j )) for i, j = 1 , . . . , M . One can discretely represent the warpingfunction γ ∈ Γ by these four matrices (Θ , Φ) and ( ˜Θ , ˜Φ). If the resolution M is fine enough one canobtain numerically approximated derivatives for ( ˜Θ , ˜Φ). Since cos − and tan − are typically definedwith range interval [ − π/ , π/ m ( a, b ) = ( a − b + jπ ) with j = argmin i ∈ Z ( | ( a − b + iπ | ). A numericallyapproximated 2 × J i,j evaluated at (Θ i,j , Φ i,j ) can be computed as J i,j = m ( ˜Θ i,j , ˜Θ i,j +1 ) m (Θ i,j , Θ i,j +1 ) m ( ˜Θ i,j , ˜Θ i +1 ,j ) m (Φ i,j , Φ i +1 ,j ) m (˜Φ i,j , ˜Φ i,j +1 ) m (Θ i,j , Θ i,j +1 ) m (˜Φ i,j , ˜Φ i +1 ,j ) m (Φ i,j , Φ i +1 ,j ) A numerical estimate of Q ( γ )is given by: Q ( γ ) ≈ M − (cid:88) i =1 M − (cid:88) j =1 tr(( − I ) T ( J Ti,j J i,j − I )) sin(Θ i,j )2 π / ( M ) . If we let A i,j = J Ti,j J i,j and respectively denote the elements of the matrix as a ,i,j , a ,i,j , a ,i,j ,and a ,i,j then the eigenvalues can be computed as λ ,i,j = a ,i,j + a ,i,j (cid:112) ( a ,i,j + a ,i,j ) − a ,i,j a ,i,j − a ,i,j a ,i,j )2 λ ,i,j = a ,i,j + a ,i,j − (cid:112) ( a ,i,j + a ,i,j ) − a ,i,j a ,i,j − a ,i,j a ,i,j )2 . A numerical estimate of R ( γ ) is given by: R ( γ ) ≈ M − (cid:88) i =1 M − (cid:88) j =1 (cid:16) log ( λ ,i,j ) + log ( λ ,i,j ) (cid:17) sin(Θ i,j )2 π / ( M ) . Appendix B. Generating Arbitrary Diffeomorpism
In the Sections 5.1 and 5.2, diffeomorphic maps were generated by composing several of a para-metric diffeomorphisms. The parametric families are:15
Rigid Rotation:
One can uniquely parametrize rigid rotations of the sphere using a specialorthogonal matrix R ∈ SO ( n, R ) which denote the set of n × n real valued matrices withdet( R ) = 1 and R T R = I n where I n denotes the n × n identity matrix. A rotation diffeomor-phism γ which is parametrized via rotation matrix R can be evaluated at each z ∈ S by themap γ ( z ) = Rz . • Projective Linear Transformation:
One can uniquely parametrize projective linear trans-formations using a special linear matrix P ∈ SL ( n, R ) which denotes the set of n × n realvalued matrices with det( P ) = 1. A rotation diffeomorphism γ which is parametrized viarotation matrix R can be evaluated at each z ∈ S by the map γ ( z ) = P z/ (cid:107)
P z (cid:107) where (cid:107) · (cid:107) denotes the Frobenius norm for matrices. This include the group of rigid rotations. • Conformal map:
Consider S ⊂ R and identify R with C ⊕ R . So, instead of writing anelement of S as ( x, y, z ), we write it as ( z, t ), where z ∈ C and t ∈ R and | z | + t = 1. For S , the conformal maps are precisely the M¨obuis transformations studied by Downs (2003).Suppose we are given the matrix M = (cid:18) a bc d (cid:19) ∈ GL (2 , C ) . Associated to this matrix is a conformal map A : S → S , given by the following formula: A ( z, t ) = (cid:32) cz + d (1 − t ))( az + b (1 − t )) | az + b (1 − t ) | + | cz + d (1 − t ) | , | az + b (1 − t ) | − | cz + d (1 − t ) | | az + b (1 − t ) | + | cz + d (1 − t ) | (cid:33) This formula is well defined for every point of the sphere except the point where z = 0 and t = 1. At this point, A is defined by A (0 ,
1) = (cid:18) ca | a | + | c | , | a | − | c | | a | + | c | (cid:19) Essentially this map is derived by conjugating a M¨obius tranformation by stereographic pro-jection from C → S . Note that every conformal map S → S is obtained by a matrix inthis manner. • Twist map:
Consider S ⊂ R and identify R with C ⊕ R . So, instead of writing an elementof S as ( x, y, z ), we write it as ( z, t ), where z ∈ C and t ∈ R and | z | + t = 1. A very simpleformula for a twist map is as follows. Start by fixing r ∈ R . Then define a map T : S → S by T ( z, t ) = ( e irt z, t )Essentially this twist map takes each latitude circle to itself by a rotation, and this rotationvaries from the south pole to the north pole. • Small Incremental Diffeomorphism Using Spherical Harmonic Basis:
Let B , B , . . . , B L denote the basis elements obtained from the spherical harmonics up to order l as describedin Section 4.2. One can parametrize a small incremental diffeomorphism which is close tothe identity transformation via coefficient function c ∈ R L . If (cid:107) c (cid:107) is sufficiently small then16he map z ∈ S as γ ( z ) = exp z ( (cid:80) Li =1 c i B i ( z )) will represent a diffeomorphism which is closeto the identity transformation. Since the set of diffeomorphism forms a group under compo-sition, γ ◦ γ is also a diffeomorphism for any two γ , γ ∈ Γ. In this manner one can get alarger diffeomorphism by iteratively applying these small incremental diffeomorphisms. Let γ = γ ◦ γ ◦ . . . ◦ γ J denote J compositions of small incremental diffeomorphisms.The initialized deformation in Section 5.1 is generated from a composition of a rigid rotationparametrized by A = . . . − . . − . − . − . . , a projective linear transformation parametrized by A = . . . . . . . − . . , a conformal transformation parametrized by M = (cid:18) . . i − . . i − . . i . − . i (cid:19) , and a twist map parametrized by r = 0 . M = (cid:18) . − . i − . . i − . . i . − . i (cid:19) , a twist map parametrized by r = 0 . l = 2 spherical harmonics. The 150 coefficients were generated randomly. Appendix C. Diffeomorphisms and Their Jacobean Eigenvalues
Any diffeomorphism γ : S → S by definition is differentiable at each point p ∈ S . A map γ is differentiable at p ∈ S if J γ,p , the Jacobean of γ at p , exists. If it is exists, then J γ,p is definedas the full rank linear map from T p ( S ) to T γ ( p ) ( S ) such that for each v ∈ T p ( S ) the limit J γ,p ( v ) = lim (cid:15) ↓ γ ( p + (cid:15)v ) − γ ( p ) (cid:15) exists and yields a well defined linear map between two tangent spaces. If the limit fails to existfor some v ∈ T p ( S ), or maps to a lower dimensional subset of T γ ( p ) ( S ), then we say that J p doesnot exist so that γ is not differentiable at p .If J γ,p exists, then it can be represented as a full rank 2 × B with respectto some choice of orthonormal bases respectively for T p ( S ) and T γ ( p ) ( S ). There is no standardbasis to use so that the matrix representation will depend on the chosen basis. A change of basiswill results in an orthogonal transformation OBO T for some O ∈ O (2) so that the Eigenvalues willnot affected by the choice of basis.Let λ ,p and λ ,p denote the two Eigenvalues for J γ,p . If J γ,p exists then both Eigenvalues mustbe finite and non-zero. Because γ is a diffeomorphism, J γ,p is continuous with respect to p so thatthe Eigenvalues λ ,p and λ ,p are also continuous with respect to p .17 ppendix D. Additional Plots and Results Appendix D.1. Demonstration Using Simulated Data
Some additional plots from Section 5.2 which may be of interest are presented here. In FigureD.8 the training and testing data are presented. In Figure Appendix D.1, several other modelsfitted from the training data are compared. Notice the large overlapping regions that the nonpara-metric local linear model has in this case. This suggests that there is a problem with extrapolationhere. Training Data( n = 200) Test Data( n = 100) Figure D.8: The yellow points denote the x ’s and the red points denote the y ’s. Corresponding data points areconnected with a light gray lines. Appendix D.2. Weather Balloon Wind Directions Data
An additional plot from Section 5.3 which may be of interest are presented here. In FigureD.10 one can see the validation error used for tuning the model. One can see that l = 10 has asmaller validation error and is minimized around λ = 1 . − ). Since the error is not greatlyreduced with respect to λ so we select λ = 10 − . There may be some higher frequency variabilitywhich might be better captured with a larger value of l in this case. Appendix D.3. Vector-Cardiogram Data
An additional plot from Section 5.4 which may be of interest are presented here. In FigureD.11 one can see the validation error used for tuning the model. One can see that l = 3 has asmaller validation error and is minimized with a heavy penalty. Since the roughness seems to begoing down, we select λ = 10 − . The variability seems to be low frequency and may be bettercharacterized using up to order l = 2 basis elements.18urs NLLPLT RR Figure D.9: Deformations of fitted models from simulated training data with observed and predicted test datarespectively plotted as red squares and green circles. Corresponding data points are connected with a light graylines.
M SEλ
Figure D.10: One can see the mean squared error on the weather validation data for various values of λ . The dottedblue and solid red line respectively denote l = 3 and l = 10. SEλ
Figure D.11: One can see the mean squared error on the vector-cardiogram validation data for various values of λ .The dotted blue and solid red line respectively denote l = 3 and l = 10.= 10.