Transport-Based Pattern Theory: A Signal Transformation Approach
TTransport-Based Pattern Theory: A Signal Transformation Approach
Liam Cattell
Department of Biomedical Engineering, University ofVirginia, Charlottesville, USA
Gustavo K. Rohde
Department of Biomedical Engineering, University ofVirginia, Charlottesville, USACharles L. Brown Department of Electrical and ComputerEngineering, University of Virginia, Charlottesville, USA
In many scientific fields imaging is used to relate a certain physical quantity to other dependentvariables. Therefore, images can be considered as a map from a real-world coordinate systemto the non-negative measurements being acquired. In this work we describe an approach forsimultaneous modeling and inference of such data, using the mathematics of optimal transport.To achieve this, we describe a numerical implementation of the linear optimal transport trans-form, based on the solution of the Monge-Ampère equation, which uses Brenier’s theoremto characterize the solution of the Monge functional as the derivative of a convex potentialfunction. We use our implementation of the transform to compute a curl-free mapping be-tween two images, and show that it is able to match images with lower error that existingmethods. Moreover, we provide theoretical justification for properties of the linear optimaltransport framework observed in the literature, including a theorem for the linear separation ofdata classes. Finally, we use our optimal transport method to empirically demonstrate that thelinear separability theorem holds, by rendering non-linearly separable data as linearly separablefollowing transform to transport space.
Introduction
Many imaging problems relate to analyzing spatial pat-terns of a certain physical quantity for the purpose of un-derstanding the spatiotemporal distribution of such quan-tity in relation to other dependent variables. Historically,astronomers, for example, used images from telescopesto understand the mass distribution of di ff erent galaxies(Schwarzschild, 1954; Bertola, 1966). Biologists and pathol-ogists make use of digital images of cells in order to under-stand their structure as a function of di ff erent experimentalconditions or disease (Yeung et al., 2005; Wang, Ozolek, &Rohde, 2010). Scientists and radiologists utilize magneticresonance images (MRI), computed tomography (CT), andother imaging modalities, in order to understand the structureof di ff erent organs as they relate to dynamics, physiological,and other functional characteristics (Gorbunova et al., 2012;Schmitter et al., 2015).In the majority of these cases, imaging refers to the mea-surement of a certain physical quantity which is positive innature. In X-ray CT, detectors measure the number of pho-tons arriving at a certain region of the detector in a certainwindow of time. In magnitude reconstructed MRI, the quan-tity being reconstructed is the magnitude of the magnetiza-tion density. In light microscopy, detectors also measurethe number of photons arriving within a certain time win-dow. In transmission microscopy, light is attenuated and adarker pixel corresponds to an increased density in that lo-cation. In fluorescence microscopy, the amount of signal be- ing received is proportional to the concentration of a certainfluorophore at that location. In these, and other examples,an image I can be considered as a map from the domain Ω ∈ R d representing the coordinate system in which mea-surements are being made, to the non-negative real numbers: I : Ω → R + .Here we describe an approach for modeling and infer-ence on such data by considering each image in a database { I , I , · · · , I N } as an instance of a given template I ‘mor-phed’ by a given function f k . That is I k ( x ) ∼ D f ( x ) I ( f ( x )),where D f is the determinant of the Jacobian of f . The asso-ciation between I k and f k is made unique by the mathematicsof optimal transport, as explained in (Wang, Slepˇcev, Basu,Ozolek, & Rohde, 2013; Kolouri, Tosun, Ozolek, & Rohde,2016). The embedding provided by the so called linear op-timal transport (LOT) approach can be viewed as mathemat-ical image transforms with forward and inverse operations,thus enabling inference and modeling simultaneously. Herewe extend the work presented in (Wang et al., 2013; Kolouri,Tosun, et al., 2016) to provide theoretical justification for cer-tain properties observed empirically in the analysis of celland brain image data bases (Ozolek et al., 2014; Kolouri,Tosun, et al., 2016; Tosun, Yergiyev, Kolouri, Silverman, &Rohde, 2015; Kundu et al., 2018), including a linear separa-tion theorem in transform space similar to the work shownin (Park, Kolouri, Kundu, & Rohde, 2017; Kolouri, Park, &Rohde, 2016).We note that the framework discussed above, whereby a r X i v : . [ c s . C V ] A p r LIAM CATTELL images are modeled as actions D f s ◦ f has similarities tothe pattern theory approach of Grenander and Mumford,and others (Grenander, 1994, 1996; Grenander & Miller,1998; Mumford, 1997). In particular, the pattern theory-based computational anatomy approach originally proposedby Grenander and Miller (Grenander & Miller, 1998) in-volves viewing the orbit formed by the action of a di ff eomor-phism as a generative model for imagery. As such, it attemptsto model each observed image as I k ( x ) ∼ I ( f ( x )) based onrepresentations, probabilities, and inference methods that canbe developed within a Bayesian perspective. The frameworkhas been useful for motivating a number of methods used inpattern recognition (Grenander & Miller, 2007). The LOTapproach introduced in (Wang et al., 2013; Kolouri, Tosun,et al., 2016; Park et al., 2017) shares a similar point of view,and can be seen as a particular version of the more generalpattern theory approach (Grenander, 1994, 1996; Grenander& Miller, 1998; Mumford, 1997), with the main innovationbeing that the transport formulation can encompass both dis-placement (geometric transformations) as well as photomet-ric (i.e. intensity di ff erences) information. Overview
We begin with Section , Definitions and Preliminaries, byintroducing the basics of the mathematics of optimal masstransport, including the Monge functional, and Brenier’s the-orem that characterizes the solution of the Monge functional.We also review basic notions related to the geometry of opti-mal transport, as well as the LOT metric described in (Wanget al., 2013; Kolouri, Tosun, et al., 2016). In Section , wedefine the basics of the LOT transform, including the for-ward and inverse operations. In Section we describe sev-eral mathematical properties of the LOT transform. Theseinclude translation, scaling, and composition claims, as wellas linear separability properties of the transform. Sectiondescribes the numerical discretization and implementationof the LOT transform based on the solution of the Monge-Ampère equation, within a multi-resolution framework. Sec-tion describes computational experiments and concludingremarks are shown in Section .
Definitions and preliminariesOptimal Mass Transport
Consider two Borel probability measures µ and ν with fi-nite ‘p’ moments defined on measure spaces X ⊆ R d and Y ⊆ R d , respectively. Consider the case when the measuresare smooth and are associated with positive density functions I and I , such that d µ = I ( x ) dx and d ν ( y ) = I ( y ) dy . Theoriginal optimal transport problem posed by Monge (Monge,1781) is to find a mass-preserving (MP) map f : X → Y thatpushes µ on to ν and minimizes the objective function M ( µ, ν ) = inf f ∈ MP (cid:90) X c ( x , f ( x )) I ( x ) d ( x ) (1)where c : X × Y → R + is the cost function, and MP : = { f : X → Y | f µ = ν } , where f µ denotes the push-forward ofmeasure µ . This is characterized as: (cid:90) f − ( A ) d µ ( x ) = (cid:90) A d ν ( y ) ∀ A ⊂ Y (2)If f is a smooth, one-to-one mapping, Eq. (2) can be writtenin a di ff erential form: D f ( x ) I ( f ( x )) = I ( x ) (3)where D f is the determinant of the Jacobian of f .In his original paper, Monge considered the Euclidean dis-tance c ( x , f ( x )) = (cid:107) f ( x ) − x (cid:107) as the cost function (Monge,1781). However, under the smoothness assumption requiredfor Eq. (3), the cost function is often chosen to be the L -norm, such that Eq. (1) becomes the L Wasserstein met-ric (Haker, Zhu, Tannenbaum, & Angenent, 2004; Trigila &Tabak, 2016): M ( µ, ν ) = inf f ∈ MP (cid:90) X (cid:107) x − f ( x ) (cid:107) I ( x ) dx s.t. D f ( x ) I ( f ( x )) = I ( x ) (4)It should be noted that both the objective function andthe constraint in Eq. (4) are nonlinear with respect to f ( x ).Moreover, an important result in the field is Brenier’s theo-rem, which characterizes the unique solution to the Mongeproblem (Brenier, 1991). Theorem 1 (Brenier’s Theorem)
Let I and I be non-negative functions of same total mass and with bounded sup-port: (cid:90) X I ( x ) dx = (cid:90) Y I ( y ) dy . When c ( x , y ) = h ( x − y ) for some strictly convex function h,then there exists a unique optimal transport map f ∗ that min-imizes Eq. (1) . Furthermore, if c ( x , y ) = (cid:107) x − y (cid:107) then thereexists a (unique up to adding a constant) convex function φ such that f ∗ ( x ) = ∇ φ ( x ) . A proof is available in (Villani,2008; Gangbo & McCann, 1996).
Finally, it should be mentioned that, for certain measures,the Monge formulation of the optimal transport problem isill-posed; in the sense that there is no transport map to rear-range one measure to another. In such cases the Kantorovichformulation of the problem using transport plans is preferred.For brevity, we omit a detailed discussion of the Kantorovichmass transport formulation and instead refer the reader to(Kolouri, Park, Thorpe, Slepˇcev, & Rohde, 2017) for moredetails.
RANSPORT-BASED PATTERN THEORY Geometry of Optimal Transport
Brenier’s theorem states that the OT map between µ and ν is unique and, in the continuous setting presented above,the mass from each point x is moved to a single location,given as the value of the function f ( x ). Formally, the setof measures not just a metric space, but is also a Rieman-nian manifold equipped with an inner product on the tangentspace at any point (do Carmo, 1992). Specifically, the tan-gent space at the measure σ with density γ (i.e. d σ = γ ( x ) dx )is the set of the following vector fields T σ = { v : Ω → R d such that (cid:82) Ω | v ( x ) | γ ( x ) dx < ∞} and the inner product isthe weighted L (Wang et al., 2013): (cid:104) v , v (cid:105) σ = (cid:90) Ω v ( x ) · v ( x ) I ( x ) dx Therefore, the OT distance is the length of the geodesicconnecting two measures (Benamou & Brenier, 2000). For-tunately, in the context of optimal transport, this has astraightforward interpretation: if µ t , 0 ≤ t ≤ µ to ν , then µ t is the measure obtainedwhen mass from µ is transported by the transportation map x → (1 − t ) x + t f ( x ). Consequently, µ t ( A ) = (cid:82) f − t ( A ) α ( x ) dx . The Linear Optimal Transport Distance
Computational complexity can hinder the application ofOT-related metrics to learning and pattern analysis prob-lems that involve large numbers of images. For recognitionproblems, the application of classification techniques suchas nearest neighbors (Altman, 1992), linear discriminantanalysis (McLachlan, 2004; Wang, Mo, Ozolek, & Rohde,2011), support vector machines (Cortes & Vapnik, 1995) andtheir respective kernel versions (Cristianini & Shawe-Taylor,2000), would require O ( N ) OT computations, with N rep-resenting the number of images in the dataset. To ease thisburden, Wang et al. (Wang et al., 2013) proposed the lin-ear optimal transport (LOT) metric, which borrows the sta-tistical atlas notion often seen in brain morphology studies(Ashburner & Friston, 2000), and seeks to compare imagesto one another based on their relation to a reference.Consider a given signal I associated with measure µ via d µ ( y ) = I ( y ) dy , y ∈ Y as before, and a reference function I associated with measure σ such that d σ ( x ) = I ( x ) dx , x ∈ X .We can then identify µ (and I ) with a map f via the opti-mization problem stated in Eq. (1). Brenier’s theorem tellsus that this identification is unique and that it is characterizedby the map being the gradient of a potential function φ , thatis f = ∇ φ , such that D f ( x ) I ( f ( x )) = I ( x ). This identificationcan be viewed as the projection of µ on to the tangent plane T σ described above. More precisely, let the projection of µ be: P ( µ ) = v where v ( x ) = f ( x ) − x . Then v ∈ T σ : (cid:90) X | v ( x ) | I ( x ) dx = (cid:104) v , v (cid:105) σ = (cid:107) P ( µ ) − P ( σ ) (cid:107) σ , where (cid:107) v (cid:107) σ is defined to be (cid:104) v , v (cid:105) σ . As explained in (Wang etal., 2013), this is known as an azimuthal equidistant projec-tion in cartography, while in di ff erential geometry it wouldbe the inverse of the exponential map. Finally, given twomeasures µ and ν , and a fixed reference σ , the LOT distancebetween µ and ν is defined as (Wang et al., 2013): d LOT ( µ, ν ) = (cid:107) P ( µ ) − P ( ν ) (cid:107) σ . (5) LOT Transform
Consider two measures µ and µ , with correspondingdensities I and I , and let σ be a fixed reference measurewith corresponding density I . Furthermore, let f and f correspond to the OT maps linking µ to σ and µ to σ . Thatis D f ( x ) I ( f ( x )) = D f ( x ) I ( f ( x )) = I ( x ) such that f isuniquely defined through the solution to the Monge mini-mization problem in Eq. (1). From Eq. (5) we have: d LOT ( µ , µ ) = (cid:107) P ( µ ) − P ( ν ) (cid:107) σ = (cid:90) X | ( f ( x ) − x ) − ( f ( x ) − x ) | I ( x ) d x = (cid:107) ˆ I − ˆ I (cid:107) where (cid:107) ˆ I (cid:107) is defined to be (cid:104) ˆ I , ˆ I (cid:105) , andˆ I = ( f ( x ) − x ) (cid:112) I ( x ) , ˆ I = ( f ( x ) − x ) (cid:112) I ( x ) . The result above motivated Kolouri (Kolouri, Park,Thorpe, Slep˘cev, & Rohde, 2016) to define the continuousLOT embedding (transform) for signals and images. Here,a signal is interpreted to be a continuous function I ( y ), with Y ∈ R d , where d refers to the dimension of the signal (i.e. d = d = I can then be defined as:ˆ I ( x ) = ( f ( x ) − x ) (cid:112) I ( x ) , x ∈ X (6)where f is the gradient of a potential function φ , such that f = ∇ φ , and D f ( x ) I ( f ( x )) = I ( x ). The inverse (synthesis)LOT transform of ˆ I is then: I ( y ) = f − ( y ) I ( f − ( y )) , y ∈ Y (7)where f − ( f ( x )) = x . Thus, we can consider the LOT trans-form as an operation taking signals from one domain (signalspace) to another (displacements). Just like in engineering,where we consider time / space domain versus frequency do-main, the LOT transform allows us to consider signal space LIAM CATTELL domain versus displacement domain (with respect to a fixedreference).The idea was extended by Park et al. (Park et al., 2017),who exploited the fact that, in 1D, the unique map betweenany two positive smooth densities can be computed in closedform. Consequently, the authors defined the Cumulative Dis-tribution Transform (CDT) for 1D signals (Park et al., 2017).This allowed for the definition of several interesting trans-form pair properties, some of which have relevance to learn-ing problems that are popular at the time of writing. Kolouriet al. (Kolouri, Tosun, et al., 2016) expanded the CDTto higher dimensional signals (images) by using the Radontransform formalism to obtain a set of 1D projections froman image, and applying the CDT along these projections.This is equivalent to obtaining an embedding for the slicedWasserstein metric (Kolouri, Park, & Rohde, 2016). Simi-larly to the LOT transform above, the CDT and Radon-CDTare invertible operations with forward (analysis) and inverse(synthesis) operations.
LOT Transform Properties
In this section, we present three properties of the LOTtransform pertaining to changes in density coordinates: thetranslation theorem, scaling theorem, and composition theo-rem. In addition, we demonstrate that the linear separationproperty of the CDT and Radon-CDT also holds for the LOTtransform in d -dimensions. Translation
Let I µ be the translation of the density I by µ , such that I µ ( y ) = I ( y − µ ). The transform of I µ with respect to thereference density I is given by:ˆ I µ ( x ) = ˆ I ( x ) + µ (cid:112) I ( x ) x ∈ X (8)where I is defined on measure space X ⊆ R d . For a proof,see Appendix A. Scaling
Let I σ be the scaling of the density I by σ , such that I σ ( x ) = σ I ( σ x ). The transform of I σ with respect to thereference density I is given by:ˆ I σ ( x ) = σ (cid:16) ˆ I ( x ) − x ( σ − (cid:112) I ( x ) (cid:17) x ∈ X (9)where I is defined on measure space X ⊆ R d . For a proof,see Appendix B. Composition
Let I g represent the composition of the density I with aninvertible function g , such that I g ( x ) = D g ( x ) I ( g ( x )), where D g is the determinant of the Jacobian of g , and g is the gradi-ent of a convex potential function ϕ (i.e. g = ∇ ϕ ). The trans-form of I g with respect to the reference density I is givenby: ˆ I g ( x ) = (cid:32) g − (cid:32) ˆ I ( x ) √ I ( x ) + x (cid:33) − x (cid:33) (cid:112) I ( x ) x ∈ X (10)where I is defined on measure space X ⊆ R d . For a proof,see Appendix C. Linear Separability in LOT Space
It is well-known that if two sets are convex and disjoint,there always exists a hyperplane that is able to separate thesets (see (Park et al., 2017), for example). As a conse-quence, the sets are said to be linearly separable. In this sec-tion we will demonstrate that, under certain conditions, theLOT transform can enhance the linear separability of imageclasses.Consider two non-overlapping, non-linearly separablesubsets P and Q of a vector space V . All elements p i of P are generated by transforming a “mother” density p witha di ff erentiable function h i : p i = D h i p ◦ h i p i ∈ P , h i ∈ H where h is the gradient of a convex potential φ , such that h = ∇ φ . Moreover, D h denotes the determinant of the Ja-cobian of h , p ◦ h i ( x ) = p ( h i ( x )) and H is a set of di ff eo-morphisms (e.g. the set of all translations). Similarly, theelements q j of Q are generated by transforming q with adi ff erentiable function h j ∈ H : q j = D h j q ◦ h j q j ∈ Q , h j ∈ H It should also be noted that since sets P and Q do not overlap: D h p ◦ h (cid:44) q ∀ h ∈ H The definitions above provide a framework that can beused to construct or interpret image classes. For example,let H be the set of all translations h ( x ) = x − µ , where µ isa random variable. The elements of sets P and Q are givenby p ◦ h and q ◦ h , respectively, and are translations of theoriginal “mother” densities p and q . Therefore, the transla-tion µ becomes the confound parameter that a classifier mustdecode in order to separate the classes. Theorem 2 (Linear separability in LOT space)
Let P , Q , H follow the definitions given above. If ˆ P and ˆ Q are the LOTtransforms of P and Q , then ˆ P and ˆ Q will be linearly sepa-rable if H satisfies the following conditions:i) ∀ h ∈ H , h − ∈ H RANSPORT-BASED PATTERN THEORY ii) ∀ h ∈ H , (cid:80) i α i h i ∈ H where (cid:80) i α i = iii) ∀ h , h ∈ H , h ◦ h ∈ H For a proof, see Appendix D.We note that the definition of the embedding (transform)for a certain image I given in Eq. (6) is in a way redundantand that the convex potential φ could be used instead. Infact, doing so would be preferred in many practical situa-tions, such as performing machine learning-type operationson digital images. This is because the array used to store thetransform defined in Eq. (6) would be twice the size of the ar-ray used to store the potential φ , since there would be deriva-tives in the x − and y − directions. More than just a savings incomputer memory, using the potential directly would trans-late to performing operations on objects of lower dimension.Ultimately, the mathematical meaning is the same, as the twoformulations are related by a gradient operation. However,gradient operations are also known to be noise augmenting,and thus introduce additional di ffi culties when dealing withreal data. Thus in the computational examples we utilize thepotential φ , rather than the LOT defined in Eq. (6). Numerical Implementation
Given image I : Ω → R + , and reference I : Ω → R + ,with Ω = [0 , (in the results shown below we use 2D im-ages), we compute an approximate solution for the the LOTtransform by solving the associated Monge-Ampère partialdi ff erential equation. Since Brenier’s theorem asserts that theoptimal map between I and I is characterized by f = ∇ φ , wecan write: det H φ ( x ) = I ( x ) I ( ∇ φ ( x )) (11)where det H φ ( x ) represents the determinant of the Hessianmatrix of second partial derivatives of φ . In our implementa-tion we use a parametric formulation for the potential func-tion based on a linear combination of basis functions ρ : φ ( x ) = | x | + (cid:88) p ∈ Z c ( p ) ρ ( x − p ) (12)where c are the coe ffi cients (to be determined via optimiza-tion), and Z is the pixel grid for the image. For simplic-ity, we chose the basis function ρ to be a radially symmetricGaussian of standard deviation σ , where σ is a user-definedparameter: ρ ( x ) = √ πσ e − x / (2 σ ) (13)By combining Brenier’s theorem with Eq. (12)), the modelfor the spatial transformation is then given by: f ( x ) = x − (cid:88) p ∈ Z c ( p ) ∇ ρ ( x − p ) . (14) Rather than solving the Monge-Ampère PDE directly, wepropose to approximate the solution by iterative optimizationof the related cost function (presented in discretized form): ψ ( c ) = (cid:88) p ∈ Z (cid:0) D ( p ) I ( f ( p )) − I ( p ) (cid:1) dx (15)Thus, the solution for the OT problem can be found by solv-ing a registration-like image processing problem, under themodel for the spatial transformation specified in Eqs. (13)and (14). Note that for the spatial transformation to remaininvertible, D f ( x ) = H φ must be positive definite throughoutthe domain of the image, and thus input images must be suchthat I > , I >
0. Finally, we note the approach has similar-ities to other parametric image registration methods based onbasis functions (Rohde, Aldroubi, & Dawant, 2003; Kybic &Unser, 2003; Rueckert et al., 1999).
Numerical Optimization
The objective function ψ can be minimized using gradientdescent, which can be described by the following iterativeupdate rule: c t + = c t − ηψ c ( c t ) (16)where η is the learning rate and ψ c is the derivative of ψ withrespect to c . We refer the reader to Appendix E for the fullderivation of ψ c for the case when I and I are 2D images.It is well known that standard gradient descent can be veryslow to converge to the optimum, and therefore, alternativemethods have been developed that can dramatically increasethe rate of convergence (Nesterov, 1983; Duchi, Hazan, &Singer, 2011; Kingma & Ba, 2015). In this work, we use themethod known as Adam (Kingma & Ba, 2015), which typi-cally requires little tuning and has been shown to outperformother first-order gradient-based methods in neural networks.The Adam method estimates the first moment m t and secondmoment v t of the gradient: m t = β m t − + (1 − β ) ψ c ( c t ) v t = β v t − + (1 − β ) ψ c ( c t ) (17)where β , β ∈ [0 ,
1) are user-defined exponential decayrates.Since m t and v t are initialized as zero, the moment es-timates are biased towards zero. Kingma and Ba (Kingma& Ba, 2015) counteract this problem by computing bias-corrected estimates ˆ m t and ˆ v t :ˆ m t = m t − β t ˆ v t = v t − β t (18) LIAM CATTELL
These estimates are then used to compute an adaptivelearning rate. As a result, the coe ffi cients c are updated asfollows: c t + = c t − η √ ˆ v t + (cid:15) ˆ m t (19)The authors suggest default values of β = . β = . (cid:15) = − (Kingma & Ba, 2015). We adopt thesevalues throughout this work. Multi-Scale Approach
The e ffi ciency of the minimization not only relies on thechoice of gradient descent algorithm, but also the positionfrom where the optimization is started. Generally, we can as-sume that an image comprises a hierarchy of scales. Low fre-quency features, such as the general shapes within the image,are found at the coarsest scale, and high frequency features,such as noise, are found at the finest scale (Paquin, Levy,Schreibmann, & Xing, 2006). We can guide our method to-wards the globally optimum potential φ ∗ , by computing theoptimum potential φ ∗ i at scale i , and using the result as the ini-tial position for the gradient descent at the next, finer scale.For N scales, the globally optimum potential φ ∗ can be ex-pressed as follows: φ ∗ ( x ) = N (cid:88) i = φ ∗ i ( x ) (20)By extension, the optimal transport map f ∗ is given by: f ∗ ( x ) = x − N (cid:88) i = ∇ φ ∗ i ( x ) (21)In this work, we generated a hierarchy of image scalesusing a Gaussian pyramid. The Gaussian pyramid is con-structed by smoothing the images I and I with a Gaussianfilter, and subsampling the smoothed images by a factor oftwo in each direction (Adelson, Anderson, Bergen, Burt, &Ogden, 1984). An illustration of the result is depicted in Fig.1. During the multi-scale gradient descent, moving from acoarse scale to a finer scale requires upsampling the poten-tial φ ∗ i . Unlike the multi-scale approach developed by Kunduet al. (Kundu et al., 2018), which requires the transport mapsto be interpolated from one scale to the next, the potential φ ∗ i in our method can be upsampled using Eq. (12) with appro-priate values for x . ExperimentsImage WarpingData Preparation.
We tested our proposed single po-tential optimal transport method on three separate datasets:the extended Yale Face Database B (Lee, Ho, & Kriegman,2005), the LONI Probabilistic Brain Atlas dataset (Shattucket al., 2008), and a random sample of images from ImageNet
Figure 1 . An illustration of a Gaussian pyramid used in themulti-scale version of our single potential optimal transportmethod.(Deng et al., 2009). In this section, we briefly describe thedatasets and any pre-processing that was conducted prior toour experiments.
YaleB.
The extended Yale Face Database B (Lee et al.,2005) (abbreviated to YaleB) comprises 1710 grayscale im-ages of 38 individuals under 45 di ff erent lighting conditions.The authors of the dataset aligned the images to one anotherusing the location of the eyes as landmarks. The images werealso cropped to exclude hair and ears (Lee et al., 2005). Inorder to limit the number of image matches, we used a singlecropped image from each subject under direct illumination,resulting in a dataset of 38 images. Exemplar images fromour reduced YaleB dataset are shown in Fig. 2a. Prior to per-forming the warpings, the images were normalized so thatthe sum of intensities was the same in every image. This en-sured that the total mass constraint in Brenier’s theorem wassatisfied. Identically to Kundu et al. (Kundu et al., 2018),we normalized the images to a large value (10 ) in order toavoid numerical precision errors due to small numbers. Wealso added a small constant (0.1) to the normalized imagesso that they were strictly positive. LPBA40.
Brain image data from 40 subjects were usedby the Laboratory of Neuro Imaging (LONI) at UCLAto construct the LONI Probabilistic Brain Atlas (LPBA)(Shattuck et al., 2008). The three-dimensional magnetic res-onance images were skull-stripped and aligned to a standardspace using rigid-body transformations (Klein et al., 2009).In this work, we used the central axial slice from the 40pre-processed volumes, and normalized these 2D slices inan identical manner to the YaleB data. Exemplar LPBA40images are shown in Fig. 2b.
RANSPORT-BASED PATTERN THEORY (a) YaleB(b) LPBA40(c) ImageNet Figure 2 . Sample images after pre-processing from (a) theYaleB dataset, (b) the LPBA40 dataset, and (c) the ImageNetdataset.
ImageNet.
The ImageNet database consists of 3.2 mil-lion color images spread over 5247 di ff erent categories(Deng et al., 2009). The large dataset is typically used totrain and test machine learning methods for object detectionand recognition. However, in our work, we used a randomlyselected subset of 40 images from the database. Prior to usein our experiments, we converted the subset of ImageNetimages to grayscale and resized them to 256 ×
256 pixels.The images were normalized using the same procedure asthe YaleB and LPBA40 data. Sample ImageNet images afterpre-preprocessing are shown in Fig. 2c.
Warping Experiments.
As indicated in Section , oursingle potential optimal transport (SPOT) method has threeuser-defined parameters: the standard deviation σ of theGaussian basis function, the learning rate η for the gradientdescent, and the number of scales N . In order to assess thee ff ect of computing the optimal transport maps using a multi-scale approach, we compared SPOT at a single image scale(i.e. N =
1) and SPOT with N = σ and η , we spliteach dataset into a training and test set. The three trainingsets comprised five images randomly selected from the corre-sponding dataset. For each of the training sets, we computedthe optimal transport map from every image to every otherimage in the set for a range of parameter values. This pro-duced 20 (5 −
5) transport maps for each set of parameters. The parameters that resulted in the lowest mean value of theobjective function (Eq. (15)) across the 20 mappings for agiven dataset were used to test (multi-)SPOT on that dataset.The SPOT and multi-SPOT methods were tested using thedata excluded from the parameter optimization. For each ofthe three test datasets, we computed the optimal transportmap from every image to every other image in the set. Thisresulted in 1056 (33 −
33) transport maps for the YaleBdataset, and 1190 (35 −
35) transport maps for the LPBA40and ImageNet datasets.
Evaluation Measures.
Following the computation ofthe transport maps, we evaluated the SPOT and multi-SPOT methods by measuring the relative mean squared er-ror (MSE) between D f ( x ) I ( f ( x )) and I , the proportion ofmass transported, and the mean absolute curl of the optimaltransport map. The measures are defined as follows:Relative MSE = (cid:82) X (cid:0) D ( x ) I ( f ( x )) − I ( x ) (cid:1) dx (cid:82) X (cid:0) I ( x ) − I ( x ) (cid:1) dx (22)Mass transported = (cid:82) X (cid:107) f ( x ) − x (cid:107) I ( x ) dx (cid:82) X I ( x ) dx (23)Mean absolute curl = X ) (cid:90) X (cid:107)∇ × f ( x ) (cid:107) dx (24)For all three measures, the smaller the value, the closer thecomputed map is to the true OT map.In order to assess whether the relative MSEs of the SPOTand multi-SPOT methods were significantly di ff erent fromone another, we performed a paired t -test on the relative MSEresults. Furthermore, we also performed paired t -tests for themass transported and mean absolute curl. Comparison to Other Optimal Transport Methods.
We compared our single-scale SPOT method to the threemethods outlined in the following sections: the flow mini-mization (FM) method by Haker et al. (Haker et al., 2004),the dual problem minimization (DPM) method by Chartrandet al. (Chartrand, Wohlberg, Vixie, & Bollt, 2009), anda single-scale version of the variational optimal transport(VOT) method by Kundu et al. (Kundu et al., 2018). We alsocompared our multi-SPOT method to the multi-scale varia-tional optimal transport method (multi-VOT) (Kundu et al.,2018). The optimum parameters for the FM, DPM, VOT, andmulti-VOT methods were determined using the approach de-scribed in Section . We then repeated the image warping ex-periments for all of the optimal transport methods, and evalu-ated the resulting transport maps using the measures definedin Section .
Flow Minimization (FM).
Haker et al. (Haker et al.,2004) proposed a flow minimization method to compute the
LIAM CATTELL optimal transport map from the Monge formulation of the op-timal transport problem. The method first obtains an initialtransport map f using the Knothe-Rosenblatt rearrangement(Rosenblatt, 1952; Knothe, 1957). If the initial map f isconsidered to be a smooth vector field, Helmholtz’s theoremcan be used to resolve f into the sum of a curl-free and adivergence-free vector field: f = ∇ φ − χ The method by Haker et al. (Haker et al., 2004) usesthe Helmholtz decomposition of f , and removes the curl byminimizing the objective function in Eq. (1) with cost func-tion c ( x , f ( x )) = (cid:107) f ( x ) − x (cid:107) . The minimization is achievedusing a gradient descent scheme: f t + ( x ) = f t ( x ) + η I D ( x ) (cid:0) f t ( x ) − ∇ ( ∆ − ∇ · f t ( x )) (cid:1) where ∆ is the Laplace operator and ∇ · ( · ) denotes the diver-gence operator. Minimizing the Dual Problem (DPM).
To compute theoptimal transport map between two images Chartrand et al.(Chartrand et al., 2009) use Kantorovich’s dual form of theproblem (Kantorovich, 1948): K D ( u , v ) = sup u , v (cid:90) X u ( x ) I ( x ) dx + (cid:90) Y v ( y ) I ( y ) dy over all integrable functions u and v satisfying u ( x ) + v ( y ) ≤ c ( x , y )By considering the cost as a strictly convex function c ( x , y ) = (cid:107) x − y (cid:107) , and redefining u ( x ) → (cid:107) x (cid:107) − u ( x ) and v ( x ) → (cid:107) y (cid:107) − v ( y ), the resulting problem is to minimize: K D ( u , v ) = (cid:90) X u ( x ) I ( x ) dx + (cid:90) Y v ( y ) I ( y ) dy where u and v satisfy: u ( x ) + v ( y ) ≥ x · y Chartrand et al. (Chartrand et al., 2009) apply the proposi-tion that K D has a unique minimizing pair of functions u and v that are convex conjugates of one another (i.e. u ( x ) = v ∗ ( x )and v ( y ) = u ∗ ( y )). As a result, the derivative of K D can befound, and thus the authors develop a gradient descent ap-proach to update u and compute the optimal transport map: u t + = u t − η dK D ( u ) du where η is the learning rate. For a full derivation, we referthe reader to (Chartrand et al., 2009). Variational Optimal Transport (VOT).
Similarly to thework of Haker et al. (Haker et al., 2004), Kundu et al.(Kundu et al., 2018) compute the optimal transport map fromthe Monge formulation of the optimal transport problem.However, the authors modify the objective function in Eq.(1) to help guide the solution towards a curl-free mapping.The optimization problem is relaxed further to incorporatethe mass-preserving constraint in Eq. (3), resulting in thefollowing objective function: ψ ( f ) = (cid:90) X (cid:107) f ( x ) − x (cid:107) I ( x ) dx + γ (cid:90) X (cid:107)∇ × f ( x ) (cid:107) dx + λ (cid:90) X (cid:0) D ( x ) I ( f ( x )) − I ( x ) (cid:1) where γ and λ are regularization coe ffi cients, and ∇ × ( · ) isthe curl operator. The objective function is non-convex, soKundu et al. (Kundu et al., 2018) use a variational opti-mization technique to solve it. The authors use the Euler-Lagrange equations for the objective function ψ to derive itsgradient ψ f , and use Nesterov’s accelerated gradient descentmethod (Nesterov, 1983) to compute the optimal transportmap. Moreover, a multi-scale technique is employed to guidethe method towards the globally optimum solution. Image Classification
In order to demonstrate the linear separability property ofthe LOT transform, we tested the SPOT method on five clas-sification tasks: determining the number of Gaussian peaksin an image, distinguishing between animal faces, identify-ing the shape of galaxies, classifying liver nuclei as cancer-ous or benign, and recognizing handwritten digits. It shouldbe noted that our aim is not to determine the optimal clas-sification method for each task, but rather to experimentallyvalidate the linear separability theorem proposed in Section .
Synthetic Gaussian Data.
Using the same terminologyas Section , the “mother” densities for the image classes inthis dataset are a single 2D Gaussian, a mixture of two 2DGaussians, and a mixture of three 2D Gaussians. We gen-erated 1000 128 × φ – between each image I i and a uni-form reference density I . As the linear separability theoremdoes not prescribe an optimal classifier, we assessed the clas-sification accuracy of the potential φ using multiple classi-fiers: a linear support vector machine (SVM), logistic regres-sion (LR), and penalized Fisher’s linear discriminant analy-sis (PLDA) (Wang et al., 2011). Moreover, we employed a RANSPORT-BASED PATTERN THEORY Figure 3 . Sample images from each class of the Gaussiandataset. The task is to classify the number of Gaussian peaksin an image.ten-fold stratified cross-validation scheme. The dataset wasrandomly divided into ten subsets, each containing the sameproportion of images from each class. Nine subsets wereused to train the linear classifier, and the remaining subsetwas used as the test data. This process was repeated until allten subsets had been used as the test set. We then computedthe mean classification accuracy and standard deviation. Forcomparison, the classification experiments were conductedon the original images, as well as the potential.Although classifiers such as SVM and logistic regressionare more commonly used than PLDA in the literature, PLDAcomputes a low-dimensional projection of the data. There-fore, in addition to a quantitative evaluation of the linear sep-arability, PLDA can provide a qualitative visual assessmentof the linear separability by projecting the high-dimensionaldata into two dimensions. Computing PLDA directly onthe images or potential involves inverting a matrix of size128 × . This is computationally expensive, and there-fore, prior to classification using PLDA, the size of the inputdata was reduced using principal component analysis (PCA).In all of our classification experiments, we used the maxi-mum number of principal components (equal to the numberof images, in this case) to transform the data. Mathemati-cally, this should have no e ff ect on the classification results,since PCA is linear transformation. Animal Faces.
The LHI-Animal-Faces dataset com-prises face images from 20 di ff erent classes of animals (Si& Zhu, 2012). To complicate the classification of the an-imals, the categories exhibit a large range of within-classvariation, including rotation, posture variation, and di ff erentanimal sub-types (e.g. deer with and without antlers). In thiswork, we used a subset of three classes, namely cats, deer,and pandas.In order to better fit the data requirements of the linearseparability theorem, the images were pre-processed using Figure 4 . Sample images from the animal faces dataset (left),and the corresponding smoothed edge maps used in the clas-sification experiment described in Section (right).the same method as (Kolouri, Park, & Rohde, 2016). Firstly,an edge map was generated by applying Canny edge detec-tion to the images. Secondly, the edge maps were smoothedwith a Gaussian filter. Example images before and after pre-processing are shown in Fig. 4.Following pre-processing, we computed the optimal trans-port map between each image and a uniform reference imageusing our SPOT algorithm. Identically to the Gaussian dataexperiment, we then assessed the classification accuracy ofthe potential φ using SVM, logistic regression, and PLDAclassifiers with stratified ten-fold cross-validation. For com-parison, we also computed the classification accuracy usingthe pre-processed images. Galaxy Shapes.
In (Shamir, 2009), Shamir proposed amethod that can automatically classify images of elliptical,spiral, and edge-on galaxies. To test the method, the au-thor used a random subset of images from the Galaxy Zooproject (Lintott et al., 2008): 225 elliptical galaxies, 224 spi-ral galaxies, and 75 edge-on galaxies. We used the samesubset of Galaxy Zoo data in this work, however, prior tocomputing the transport maps between the images and a uni-form reference, we converted the color images to grayscale,centered the galaxy within each image, and removed back-ground noise by segmenting the galaxy using a thresholdingtechnique. Example pre-processed images are shown in Fig.5. As before, we compared the classification accuracy of thepotential and the original images using three di ff erent linearclassifiers. Liver Nuclei.
Another classification task is to classifyliver cell nuclei as either fetal-type hepatoblastoma (a type ofcancer) or benign. The dataset consists of nuclei images from17 cancer patients and 26 healthy controls. Each patient hasbetween 60 and 187 nuclei images (mean: 96, standard devi-ation: 16), resulting in 4145 images in total. Although thereis no guarantee that all of the cells from the cancer patients0
LIAM CATTELL
Figure 5 . Sample images from each class of the galaxydataset after pre-processing.
Figure 6 . Sample images from both classes of the liver nucleidataset. Since the patient classes (cancer or healthy) wereused as the ground truth labels for the cells, there is no guar-antee that all of the nuclei in the “cancer" class are cancerous.are cancerous, we used the subject label (cancer or benign)and the ground truth for each image. Example images fromeach class are shown in Fig. 6.Unlike the other classification experiments, in which thetransport maps were computed using a uniform reference im-age, for this task, we used a wide Gaussian distribution asthe reference. The Gaussian had a similar diameter to thenuclei, resulting a more task-specific reference image. Theremainder of the classification experiment was identical tothose described in the previous sections.
MNIST.
The final classification task is to assess the lin-ear separation theorem on the MNIST database of handwrit-ten digits. MNIST has become a benchmark dataset for test-ing machine learning methods, and by using deep neural net-works, classification errors of less than 0.25% have been re-ported in the literature (Wan, Zeiler, Zhang, Le Cunn, & Fer-gus, 2013; Cire¸san, Meier, & Schmidhuber, 2012). The aimof this work is not to present state-of-the-art classification re- Table 1
The optimum parameters for the single-scale and multi-scaleversions of our SPOT method, across all three datasets. Themulti-SPOT parameters are ordered from the coarsest scaleto the finest scale, with σ given in units of pixels at the orig-inal (finest) image resolution. SPOT Multi-SPOT σ η σ η
YaleB 1 0.01 { , , } {1, 0.1, 0.01}LPBA40 1 0.01 { , , } {0.01, 0.01, 0.01}ImageNet 1 0.01 { , , } {1, 0.1, 0.01}sults on MNIST, but rather, to assess the ability of linear clas-sifiers to distinguish between the classes in potential spacecompared to image space.The dataset comprises 60000 training images and 10000test images, however, to be consistent with our other classifi-cation experiments, we used ten-fold cross-validation on theentire set of 70000 images when evaluating the three linearclassifiers in image space and potential space. The transportmaps (and therefore, the potential) were computed using auniform reference image I . ResultsImage Warping Results
The optimum SPOT and multi-SPOT parameters for eachdataset are presented in Table 1. The table shows that a stan-dard deviation of σ = η = .
01 was selected at the finest reso-lution across all datasets. In contrast to the LPBA40 dataset,the optimum values for σ and η are larger at the coarserscales for the YaleB and ImageNet datasets.Figure 7 compares the mean and standard deviation ofthe relative MSE and mass transported for the SPOT andmulti-SPOT methods. For all of the datasets tested, multi-SPOT resulted in a statistically lower mean relative MSEthan SPOT ( p < . p < . ff erent ( p < . RANSPORT-BASED PATTERN THEORY (a) Relative MSE(b) Mass transported Figure 7 . Mean and standard deviation of the (a) relativeMSE and (b) mass transported for the single- and multi-scaleSPOT methods on all three datasets. The results for themean absolute curl have been omitted, since both methodsachieved zero curl on all of the datasets.three datasets. The central columns show the warped in-put image D f ( x ) I ( f ( x )) for each method, after attemptingto map I to I . The relative MSE of each result is displayedbelow the warped image. The top and bottom rows of eachsubfigure show the mapping that resulted in the lowest andhighest relative MSE, respectively, for the SPOT method. Inall of the examples in Fig. 8, the relative MSE for the SPOTmethod was lower than the corresponding relative MSEs forthe other single-scale optimal transport methods. Moreover,the lowest relative MSE for the SPOT method was two ordersof magnitude smaller than the corresponding relative MSEsof the other methods.Analogously to Fig. 8, Fig. 9 shows example results forthe multi-scale optimal transport methods (multi-VOT andmulti-SPOT) on all three datasets. The second and thirdcolumns show the warped input image D f ( x ) I ( f ( x )) for eachmethod, after attempting to map I to I . The relative MSEof each result is displayed below the warped image. Thetop and bottom rows of each subfigure show the mappingthat resulted in the lowest and highest relative MSE, respec-tively, for the multi-SPOT method. Similarly to the single-scale optimal transport results, in all three datasets, the low-est relative MSE for the multi-SPOT method was an orderof magnitude lower than the corresponding relative MSE forthe multi-VOT method. However, the cases that produced the highest relative MSE for the multi-SPOT method resulted ina lower relative MSE in the multi-VOT method.The mean and standard deviation of the error measuresfor the single-scale optimal transport methods (FM, DPM,VOT, and SPOT) are presented in Fig. 10. For clarity, largebars have been truncated, and the true mean value of theerror measure is displayed above the bar. Across all threedatasets, SPOT resulted in a significantly lower mean rela-tive MSE that the other single-scale optimal transport meth-ods ( p < .
01, corrected for multiple comparisons). Thestandard deviation of the relative MSE was also smaller forSPOT compared to the other methods. A similar trend isobserved for the mass transported in Fig. 10b, except thatthe VOT method resulted in a significantly lower mean masstransported than SPOT ( p < .
01) for the ImageNet dataset.Since the SPOT and DPM methods are based on Brenier’stheorem and derive the transport maps from a potential, themean and standard deviation of the mean absolute curl wereuniversally zero for those two methods .Figure 11 shows the mean and standard deviation of theerror measures for the multi-scale optimal transport methods(multi-VOT and multi-SPOT). Identically to Fig. 10, largebars have been truncated, and the true mean value of theerror measure is displayed above the bar. For convenienceof comparison, the mass transported and mean absolute curlare shown at the same scale as Fig. 10, however, the rel-ative MSE is shown at a smaller scale to reflect the lowerrelative MSEs for the multi-scale optimal transport methods.For all error measures and datasets, multi-SPOT resulted ina significantly lower mean value than multi-VOT ( p < . Classification Results
The mean and standard deviation of the classification ac-curacy from the five classification tasks are presented in Ta-ble 2. For the task in which the classifiers have to deter-mine the number of Gaussian peaks in an image, the meanclassification accuracy of the original image data was onlymarginally higher than chance for all three classifiers. Incontrast, when classifying the data using the potential φ , themean accuracy was around 90%. This result is illustrated inFig. 12, where the classes are indistinguishable in the two-dimensional PLDA projection of the image data, but are al-most perfectly linearly separable in the potential space.Although the increase in classification accuracy betweenimage and potential space is smaller for the animal facesdataset than the Gaussians dataset, a higher accuracy wasachieved across all three classifiers when using the potential.Interestingly, the accuracy of the logistic regression on theanimal faces image data was considerably lower than the ac-curacy obtained using a SVM or PLDA on the same data. In2 LIAM CATTELL (a) YaleB(b) LPBA40(c) ImageNet
Figure 8 . Example results for the single-scale optimal transport methods on (a) the YaleB dataset, (b) the LPBA40 dataset,and (c) the ImageNet dataset. Columns 2-5 show the warped input image D f ( x ) I ( f ( x )) for each method after attempting tomap I to I . The relative MSE of each result is displayed below the image. The top and bottom rows of each subfigure showthe mapping that resulted in the lowest and highest relative MSE, respectively, for the SPOT method.contrast, all three linear classifiers resulted in a very similarclassification accuracy ( ∼ RANSPORT-BASED PATTERN THEORY (a) YaleB (b) LPBA40(c) ImageNet Figure 9 . Example results for the multi-scale optimal transport methods on (a) the YaleB dataset, (b) the LPBA40 dataset, and(c) the ImageNet dataset. Columns 2-3 show the warped input image D f ( x ) I ( f ( x )) for each method after attempting to map I to I . The relative MSE of each result is displayed below the image. The top and bottom rows of each subfigure show themapping that resulted in the lowest and highest relative MSE, respectively, for the multi-SPOT method.the potential only resulted in a higher mean accuracy thanimage space for two out of the three classifiers. However,the accuracies on image space and potential space werevery close for the cases where classification on image spaceachieved a higher accuracy. Finally, for the MNIST dataset,classification on potential space resulted in a lower accuracythan image space using the PLDA classifier. The results forthe SVM and logistic regression classifiers were almost iden-tical in both image space and potential space, with imagespace achieving a marginally smaller standard deviation. Discussion
In the first set of experiments, we compared our numeri-cal implementation of the LOT transform, based on the so-lution of the Monge-Ampère equation, to existing optimaltransport methods from the literature. We showed that ourSPOT method is able to provide a curl-free mapping between two images, which results in a lower error than other meth-ods. This is highlighted by the high-quality image matchesin Figs. 8 and 9, as well as the plots of mean MSE in Figs.10 and 11. For all four methods tested in Section , the low-est mean relative MSE was obtained on the LPBA40 dataset.This could be due to the relatively small deformations and in-tensity changes required to match the brain images comparedto the face images and ImageNet data. This is also reflectedin the optimum multi-SPOT parameters presented in Table 1.Wider Gaussian basis functions are able to capture larger de-formations, and larger σ values were selected for the coarsestscales of the YaleB and ImageNet datasets compared to theLPBA40 dataset.Despite achieving a lower mean mass transported than themulti-VOT method, multi-SPOT resulted in a larger meanmass transported than its single-scale equivalent. This is pos-sibly because our cost function in Eq. (15) does not directly4 LIAM CATTELL
Table 2
Mean ± standard deviation classification accuracy on the images and potential φ using a linear SVM, logistic regression, andPLDA. The best results for each classifier are highlighted in bold font. Dataset SVM Logistic regression PLDAImage φ Image φ Image φ Gaussians 0.348 ± ± ± ± ± ± Animal faces 0.714 ± ± ± ± ± ± Galaxies 0.636 ± ± ± ± ± ± ± ± ± ± ± ± MNIST ± ± ± ± ± ± ff ected the classification result.Other than the PLDA classifier on the MNIST dataset,classification on the potential achieved a similar, if nothigher, accuracy than classification in image space using thesame linear classifier. This suggests that classification inLOT space can be beneficial on real data as well as syntheticexamples. Moreover, the relatively lower classification accu-racy of the PLDA classifier on the MNIST dataset could bedue to the handwritten digits not fulfilling the mathematicalrequirements of the linear separability theorem. For example,the confusion matrix of classification results indicates thatimages containing the digit “4” were more commonly mis-classified as “9” in potential space compared to image space.Example images of 4’s and 9’s that were correctly classifiedby the PLDA classifier in image space and potential spaceare shown in Figs. 13a and 13b. Images that were correctlyidentified as 4’s in image space, but incorrectly classified as9’s in potential space, are shown in Fig. 13c. In comparisonto the correctly classified 4’s, the misclassified 4’s appear tobe less angular, with the top part of the digit forming a loopsimilar to several of the correctly identified 9’s. This suggeststhat certain classes in the MNIST dataset are not completely disjoint, and thus do not meet the requirements of the linearseparability theorem outlined in Section . Summary and Conclusions
In this work we have presented a pattern theoretic ap-proach for modeling and inference from image data basedon the mathematics of optimal transport. Inspired by thework described in (Park et al., 2017; Kolouri, Park, & Ro-hde, 2016), we have framed the linear optimal transport ap-proach developed in (Wang et al., 2013; Basu, Kolouri, &Rohde, 2014; Kolouri, Tosun, et al., 2016) as a nonlinear sig-nal transform, and described some of its mathematical prop-erties. The transform involves computing a transport mapbetween a given image and a chosen reference. Hence – andin contrast to linear signal transforms such as the Fourier andWavelet transforms – the LOT transform allows for encodingof pixel displacements, and thus allows for the comparison ofpixel intensities at non-fixed image coordinates.It should be noted that the computational results shownhere relied on a numerical implementation of the underlying(Monge) OT problem, where the transport map was mod-eled directly as the gradient of a convex potential function φ .Furthermore, the potential function was itself modeled as alinear sum of smooth basis functions. We showed numeri-cal comparisons to other numerical (Monge) OT approachesin order to verify the accuracy of the computed solutions.We clarify, however, that the signal LOT transformation de-scribed here could make use of any numerical solver for ac-tual implementation.We also note that the nature of the numerical implemen-tation we described here is, in a way, incompatible with thesignal transformation properties (e.g. translation) shown ear-lier. This is because have have used a numerical PDE ap-proach which requires both source and target images to befixed grids in the square domain [0 , . This could explain inpart some loss in the ability of the numerical transformationmethod to make signal classes more linear separable, thoughthe numerical classification results suggest at times the sig-nal transformation framework can be valuable in making the RANSPORT-BASED PATTERN THEORY (a) Relative MSE(b) Mass transported(c) Mean absolute curl Figure 10 . Mean and standard deviation of the (a) rela-tive MSE, (b) mass transported, and (c) mean absolute curl,for the single-scale optimal transport methods on all threedatasets. For clarity, large bars have been truncated, and thetrue mean value of the error measure is displayed above thebar.pattern recognition problem simpler to solve.We believe the contributions presented here could en-hance the theoretical foundations of the emerging transportand other Lagrangian signal transformation methods thathave been recently been employed in a variety of applica-tions including modeling brain anatomy (Kundu et al., 2018)and cell morphology (Basu et al., 2014), cancer detection(Ozolek et al., 2014; Tosun et al., 2015; Hanna, Liu, Rohde,& Singh, 2017), communications (Park et al., 2018), model-ing turbulence in imagine experiments (Nichols et al., 2017),inverse problems (Kolouri & Rohde, 2015; Cattell, C.H, Ep-stein, & Rohde, 2017), and other applications including ma-chine learning and applied statistics (Kolouri et al., 2017). (a) Relative MSE(b) Mass transported(c) Mean absolute curl
Figure 11 . Mean and standard deviation of the (a) rela-tive MSE, (b) mass transported, and (c) mean absolute curl,for the multi-scale optimal transport methods on all threedatasets. For clarity, large bars have been truncated, and thetrue mean value is displayed above the bar. It should also benoted that the relative MSE in (a) is shown at a smaller scalethan Figure 10a.
Acknowledgements
The authors gratefully acknowledge funding from the Na-tional Science Foundation (CCF 1421502) and the NationalInstitutes of Health R01 (GM090033) in contributing to thiswork. Appendix AProof of the Translation PropertyConsider two positive density functions I and I defined onmeasure spaces X ⊆ R d and Y ⊆ R d , respectively. We as-sume that the total mass associated with X is equal to thetotal mass associated with Y , such that:6 LIAM CATTELL (a) Image space(b) Potential space
Figure 12 . Two-dimensional PLDA projection of (a) theoriginal Gaussian image data, and (b) its potential φ . In im-age space, the classes are indistinguishable, whereas in po-tential space they are almost perfectly linearly separable. Theresults shown here correspond to a single set of test data fromthe ten-fold cross validation. (cid:90) X I ( x ) dx = (cid:90) Y I ( y ) dy (25)The optimal transport problem is to find a the least cost trans-formation that is also a mass-preserving map f : X → Y thatmaps one density to the other: (cid:90) A I ( x ) dx = (cid:90) f ( A ) I ( y ) dy ∀ A ⊂ X (26) (a) (b)(c) Figure 13 . Images from the MNIST dataset that were (a)correctly identified as the digit “4” by the PLDA classfierin both spaces, (b) correctly identified as the digit “9” by thePLDA classifier in both spaces, and (c) correctly classified as4’s by the PLDA classfier in image space, but misclassifiedas 9’s in potential space.If we consider f to be the gradient of a convex potential φ ,such that f ( x ) = ∇ ( | x | − µ T x ), then by Brenier’s theorem,there exists a unique (up to a constant) transport map f thatwill satisfy the equation above. Now let I µ represent a trans-lation of I by µ , such that I µ ( y ) = I ( y − µ ). In the samemanner as above, we can solve for f µ : X → Y : (cid:90) A I ( x ) dx = (cid:90) f µ ( A ) I µ ( y ) dy = (cid:90) f µ ( A ) I ( y − µ ) dy (27)By substituting u = y − µ into Eq. (27) via the change ofvariables theorem, and equating Eqs. (26) and (27) we get: (cid:90) f ( A ) I ( y ) dy = (cid:90) f µ ( A ) − µ I ( u ) du (28)Since the integrals are equal, we have f µ ( x ) = f ( x ) + µ . If wedefine the transform of density I with respect to the referencedensity I as ˆ I = ( f ( x ) − x ) √ I ( x ), then:ˆ I µ = ( f µ ( x ) − x ) (cid:112) I ( x ) = ( f ( x ) − x + µ ) (cid:112) I ( x ) = ˆ I ( x ) + µ (cid:112) I ( x ) (29)Appendix BProof of the Scaling PropertyConsider two positive density functions I and I defined onmeasure spaces X ⊆ R d and Y ⊆ R d , respectively. We as-sume that the total mass associated with X is equal to thetotal mass associated with Y , such that: RANSPORT-BASED PATTERN THEORY (cid:90) X I ( x ) dx = (cid:90) Y I ( y ) dy (30)The optimal transport problem is to find a mass-preservingmap f : X → Y that maps one density to the other: (cid:90) A I ( x ) dx = (cid:90) f ( A ) I ( y ) dy ∀ A ⊂ X (31)If we consider f to be the gradient of a convex potential φ ,such that f = ∇ ( σ | x | ), then by Brenier’s theorem, thereexists a unique transport map f that will satisfy the equationabove. Now let I σ represent a scaling of I by σ , such that I σ ( y ) = σ I ( σ y ). In the same manner as above, we can solvefor f σ : X → Y : (cid:90) A I ( x ) dx = (cid:90) f σ ( A ) I σ ( y ) dy = (cid:90) f σ ( A ) σ I ( σ y ) dy (32)By substituting u = σ y into Eq. (32) via the change of vari-ables theorem, and equating Eqs. (31) and (32) we get: (cid:90) f ( A ) I ( y ) dy = (cid:90) σ f σ ( A ) I ( u ) du (33)Since the integrals are equal, we have f σ ( x ) = f ( x ) σ . If we de-fine the transform of density I with respect to the referencedensity I as ˆ I = ( f ( x ) − x ) √ I ( x ), then:ˆ I σ = ( f σ ( x ) − x ) (cid:112) I ( x ) = σ ( f ( x ) − x + x − σ x ) (cid:112) I ( x ) = σ (cid:16) ˆ I ( x ) − x ( σ − (cid:112) I ( x ) (cid:17) (34)Appendix CProof of the Composition PropertyConsider two positive density functions I and I defined onmeasure spaces X ⊆ R d and Y ⊆ R d , respectively. We as-sume that the total mass associated with X is equal to thetotal mass associated with Y , such that: (cid:90) X I ( x ) dx = (cid:90) Y I ( y ) dy (35)The optimal transport problem is to find a mass-preservingmap f : X → Y that maps one density to the other: (cid:90) A I ( x ) dx = (cid:90) f ( A ) I ( y ) dy ∀ A ⊂ X (36)If we consider f to be the gradient of a convex potential φ , such that f = ∇ φ , then by Brenier’s theorem, there exists a unique (up to a constant) transport map f , thatwill satisfy the equation above. Now let I g represent thecomposition of I with an invertible function g ( y ), such that I g ( y ) = D g ( y ) I ( g ( y )) where D g is the determinant of the Ja-cobian of g . Similarly to f , g is also the gradient of a poten-tial ϕ g (i.e. g = ∇ ϕ g ). In the same manner as before, we cansolve for f g : X → Y : (cid:90) A I ( x ) dx = (cid:90) f g ( A ) I g ( y ) dy = (cid:90) f g ( A ) D g ( y ) I ( g ( y )) dy (37)By substituting u = g ( y ) into Eq. (37) via the change ofvariables theorem, and equating Eqs. (36) and (37) we get: (cid:90) f ( A ) I ( y ) dy = (cid:90) g ( f g ( A )) I ( u ) du (38)Since the integrals are equal, we have f g ( x ) = g − ( f ( x )).If we define the transform of density I with respect to thereference density I as ˆ I = ( f ( x ) − x ) √ I ( x ), then:ˆ I g = ( f g ( x ) − x ) (cid:112) I ( x ) = ( g − ( f ( x )) − x ) (cid:112) I ( x ) = (cid:32) g − (cid:32) ˆ I ( x ) √ I ( x ) + x (cid:33) − x (cid:33) (cid:112) I ( x ) (39)Appendix DProof of the Linear Separability PropertyTwo non-empty subsets ˆ P and ˆ Q of a vector space Ω are lin-early separable if, and only if, their convex hulls are disjoint,i.e: N p (cid:88) i = α i ˆ p i (cid:44) N q (cid:88) j = β j ˆ q j (40)for any subset { ˆ p } N p i = ⊂ ˆ P and { ˆ q } N q j = ⊂ ˆ Q , where α i , β j > (cid:80) i α i = (cid:80) j β j =
1. For a proof, we refer the reader to(Park et al., 2017).In order to show that the LOT transforms ˆ P and ˆ Q (asdefined in Eq. (6)) of sets P and Q are linearly separable, wefirst assume that ˆ P and ˆ Q are not linearly separable (proof bycontradiction). This implies that the convex hulls of the LOTtransforms are not disjoint, and therefore, from Eq. (40) wecan write: (cid:88) i α i ˆ p i = (cid:88) j β j ˆ q j (41)Since the LOT transforms are defined as ˆ p i = ( f i − x ) √ I and ˆ q j = ( g j − x ) √ I , the equation above can be rewritten interms of the transport maps f i and g j :8 LIAM CATTELL (cid:88) i α i ( f i − x ) (cid:112) I = (cid:88) j β j ( g j − x ) (cid:112) I (cid:88) i α i f i = (cid:88) j β j g j . (42)where, by Brenier’s theorem, f i = ∇ φ i and g j = ∇ ϕ j , and φ i , ϕ j are convex potentials.Using the Jacobian equation from the Monge formu-lation of the optimal transport problem (Eq. (3)), and thefact that the LOT transform uniquely identifies an f withsome p , we can write D f p ◦ f = I . Moreover, usingthe generative model for image classes described in Section, the elements p i of P are generated via p i = D h i p ◦ h i . Bycombining these two results via the composition theorem forLOT transforms, one can show that f i = h − i ◦ f . Similarly,one can also show that g j = h − j ◦ g . These expressions for f i and g j can then be substituted into Eq. (42): (cid:88) i α i h − i ◦ f = (cid:88) j β j h − j ◦ g h − α ◦ f = h − β ◦ g (43)where h − α = (cid:80) i α i h − i and h − β = (cid:80) j β j h − j (permitted by con-ditions i and ii in Section ).From Eq. (3) we know that D f p ◦ f = D g q ◦ g = I . If we rearrange Eq. (43) and substitute it into this result,we get:( D h α D h − β D g ) p ◦ ( h α ◦ h − β ◦ g ) = D g q ◦ g (44)If we let h αβ = h α ◦ h − β (permitted by condition iii in Section), then this becomes: D h αβ p ◦ h αβ = q (45)However, this contradicts the definition in Section , whichstates that: D h p ◦ h (cid:44) q ∀ h ∈ H (46)Therefore, the LOT transforms ˆ P and ˆ Q of classes P and Q must be linearly separable.Appendix EDerivative of the SPOT Objective Function in 2DGiven that this work is primarily concerned with mappingimages to one another, we extend the result in Eq. (14) totwo dimensions: (cid:34) f ( x , y ) g ( x , y ) (cid:35) = (cid:34) xy (cid:35) − ∇ φ ( x , y ) (47)where f and g are the transport maps in the x and y directions,respectively. We can also rewrite Eq. (12) in two dimensions: φ ( x , y ) = (cid:88) p (cid:88) q c ( p , q ) ρ ( x − p , y − q ) (48)By combining Eqs. (47) and (48), we can express thetransport maps f and g in terms of the potential φ : f ( x , y ) = x − (cid:88) p (cid:88) q c ( p , q ) ρ x ( x − p , y − q ) g ( x , y ) = y − (cid:88) p (cid:88) q c ( p , q ) ρ y ( x − p , y − q )where subscripts denote derivatives with respect to that vari-able (e.g. ρ x = ∂ρ∂ x and ρ xx = ∂ ρ∂ x ). Consequently, the firstderivatives of the transport maps can be expressed as follows: f c ( x , y ) = − ρ x ( x − p , y − q ) g c ( x , y ) = − ρ y ( x − p , y − q ) f x ( x , y ) = − (cid:88) p (cid:88) q c ( p , q ) ρ xx ( x − p , y − q ) f y ( x , y ) = − (cid:88) p (cid:88) q c ( p , q ) ρ xy ( x − p , y − q ) g x ( x , y ) = − (cid:88) p (cid:88) q c ( p , q ) ρ yx ( x − p , y − q ) g y ( x , y ) = − (cid:88) p (cid:88) q c ( p , q ) ρ yy ( x − p , y − q )In two-dimensions, the SPOT objective function inEq. (15) becomes: ψ ( c ) = (cid:88) m (cid:88) n ( D ( m , n ) I ( f ( m , n ) , g ( m , n )) − I ( m , n )) where I and I are images that have been normalized, suchthat the total mass in each image is the same. The derivativeof the objective function with respect to c is given by: ψ c ( c ) = (cid:88) m (cid:88) n r ( m , n ) (cid:18) ∂∂ c (cid:2) D ( m , n ) (cid:3) I ( f ( m , n ) , g ( m , n )) + D ( m , n ) ∂∂ c (cid:2) I ( f ( m , n ) , g ( m , n )) (cid:3)(cid:19) (49)with r ( x , y ) = D ( x , y ) I ( f ( x , y ) , g ( x , y )) − I ( x , y ) RANSPORT-BASED PATTERN THEORY D ( x , y ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f x ( x , y ) f y ( x , y ) g x ( x , y ) g y ( x , y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = f x ( x , y ) g y ( x , y ) − f y ( x , y ) g x ( x , y ) (50)Using Eq. (50) and the product rule, the first partialderivative in Eq. (49) can be written: ∂∂ c D ( x , y ) = − g y ( x , y ) ρ xx ( x − p , y − q ) − f x ( x , y ) ρ yy ( x − p , y − q ) + g x ( x , y ) ρ xy ( x − p , y − q ) + f y ( x , y ) ρ yx ( x − p , y − q ) (51)Similarly, using the chain rule, the second the partialderivative in Eq. (49) is: ∂∂ c I ( f ( x , y ) , g ( x , y )) = − I x ( f ( x , y ) , g ( x , y )) ρ x ( x − p , y − q ) − I y ( f ( x , y ) , g ( x , y )) ρ y ( x − p , y − q ) (52)By substituting Eqs. (51) and (52) into Eq. (49), andsimplifying the result, ψ c can be written in the following way: ψ c ( c ) = Λ ∗ ˜ ρ xy ( p , q ) + Λ ∗ ˜ ρ yx ( p , q ) − Λ ∗ ˜ ρ xx ( p , q ) − Λ ∗ ˜ ρ yy ( p , q ) − Λ ∗ ˜ ρ x ( p , q ) − Λ ∗ ˜ ρ y ( p , q )where ∗ denotes convolution, a tilde above an array denotesflipping the array horizontally and vertically (i.e. ˜ ρ ( x , y ) = ρ ( − x , − y )), and: Λ ( x , y ) = r ( x , y ) g x ( x , y ) I ( f ( x , y ) , g ( x , y )) Λ ( x , y ) = r ( x , y ) f y ( x , y ) I ( f ( x , y ) , g ( x , y )) Λ ( x , y ) = r ( x , y ) g y ( x , y ) I ( f ( x , y ) , g ( x , y )) Λ ( x , y ) = r ( x , y ) f x ( x , y ) I ( f ( x , y ) , g ( x , y )) Λ ( x , y ) = r ( x , y ) D ( x , y ) I x ( f ( x , y ) , g ( x , y )) Λ ( x , y ) = r ( x , y ) D ( x , y ) I y ( f ( x , y ) , g ( x , y ))References Adelson, E., Anderson, C., Bergen, J., Burt, P., & Ogden, J. (1984).Pyramid methods in image processing.
RCA Engineer , (6),33–41.Altman, N. (1992). An introduction to kernel and nearest-neighbornonparametric regression. The American Statistician , (3),175–185.Ashburner, J., & Friston, K. (2000). Voxel-Based Morphometry –The Methods. NeuroImage , , 805–821. Basu, S., Kolouri, S., & Rohde, G. (2014). Detecting and visu-alizing cell phenotype di ff erences from microscopy images us-ing transport-based morphometry. Proc. Natl. Acad. Sci. U.S.A. , (9), 3448–3453.Benamou, J., & Brenier, Y. (2000). A computational fluid mechan-ics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik , (3), 375–393.Bertola, F. (1966). Mass and luminosity distribution in galaxiesNGC 3432, NGC 4485-90, NGC 4605. Memorie della SocietàAstronomia Italiana , , 433–458.Brenier, Y. (1991). Polar factorization and monotone rearrangementof vector-valued functions. Commun. Pure Appl. Math. , (4),375–417.Cattell, L., C.H, M., Epstein, F., & Rohde, G. (2017). Recon-structing high-resolution cardiac mr movies from under-sampledframes. In Asilomar conference on signals, systems, and com-puters.
Chartrand, R., Wohlberg, B., Vixie, K., & Bollt, E. (2009). A gradi-ent descent solution to the monge-kantorovich problem.
AppliedMathematical Sciences , (22), 1071–1080.Cire¸san, D., Meier, U., & Schmidhuber, J. (2012). Multi-columndeep neural networks for image classification. In Proc. ieee cvpr (p. 3642-3649).Cortes, C., & Vapnik, V. (1995). Support-vector networks.
MachineLearning , (3), 273–297.Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to sup-port vector machines and other kernel-based learning methods .Cambridge University Press.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L.(2009). Imagenet: A large-scale hierarchical image database.In
Ieee computer vision and pattern recognition. do Carmo, M. (1992).
Riemannian geometry (1st ed.). BirkhäuserBasel.Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradientmethods for online learning and stochastic optimization.
Jour-nal of Machine Learning Research , , 2121–2159.Gangbo, W., & McCann, R. (1996). The geometry of optimal trans-portation. Acta Math. , , 113–161.Gorbunova, V., Sporring, J., Lo, P., Loeve, M., Tiddens, H., Nielsen,M., . . . de Bruijne, M. (2012). Mass preserving image registra-tion for lung CT. Med. Image Anal. , (4), 786–795.Grenander, U. (1994). General pattern theory: A mathematicalstudy of regular structures (1st ed.). Clarendon Press.Grenander, U. (1996).
Elements of pattern theory . JHU Press.Grenander, U., & Miller, M. (1998). Computational anatomy: Anemerging discipline.
Quarterly of Applied Mathematics , (4),617–694.Grenander, U., & Miller, M. (2007). Pattern theory: From repre-sentation to inference . Oxford University Press.Haker, S., Zhu, L., Tannenbaum, A., & Angenent, S. (2004). Op-timal mass transport for registration and warping.
InternationalJournal of Computer Vision , (4), 225–240.Hanna, M., Liu, C., Rohde, G., & Singh, R. (2017). Predictivenuclear chromatin characteristics of melanoma and dysplasticnevi. J Pathol Inform , (15).Kantorovich, L. (1948). On a problem of monge. Uspekhi Matem-aticheskikh Nauk , (24), 225-226. LIAM CATTELL
Kingma, D., & Ba, J. (2015).
Adam: A method for stochasticoptimization. (arXiv preprint arXiv:1412.6980)Klein, A., Andersson, J., Ardekani, B., Ashburner, J., Avants, B.,Chiang, M.-C., . . . Parsey, R. (2009). Evaluation of 14 nonlin-ear deformation algorithms applied to human brain MRI regis-tration.
NeuroImage , (3), 786–802.Knothe, H. (1957). Contributions to the theory of convex bodies. The Michigan Mathematical Journal , (1), 39-52.Kolouri, S., Park, S., & Rohde, G. (2016). The radon cumulativedistribution transform and its application to image classification. IEEE Trans. Image Process. , (2), 920–934.Kolouri, S., Park, S., Thorpe, M., Slepˇcev, D., & Rohde, G. (2017).Optimal mass transport: signal processing and machine learningapplications. IEEE Signal Process. Mag. , , 43–59.Kolouri, S., Park, S., Thorpe, M., Slep˘cev, D., & Rohde, G. (2016). Transport-based analysis, modeling, and learning from signaland data distributions. (arXiv preprint arXiv:1609.04767)Kolouri, S., & Rohde, G. (2015). Transport-based single framesuper resolution of very low resolution face images. In
Proc.ieee cvpr (pp. 4876–4884).Kolouri, S., Tosun, A., Ozolek, J., & Rohde, G. (2016). A con-tinuous linear optimal transport approach for pattern analysis inimage datasets.
Pattern Recognit. , , 453–462.Kundu, S., Kolouri, S., Erickson, K., Kramer, A., McAuley, E.,& Rohde, G. (2018). Discovery and visualization of structuralbiomarkers from MRI using transport-based morphometry. Neu-roImage , , 256–275.Kybic, J., & Unser, M. (2003). Fast parametric elastic image regis-tration. IEEE Trans. Image Process. , (11), 1427–1442.Lee, K., Ho, J., & Kriegman, D. (2005). Acquiring linear subspacesfor face recognition under variable lighting. IEEE Transactionson Pattern Analysis and Machine Intelligence , (5), 684-698.Lintott, C., Schawinski, K., Slosar, A., Land, K., Bamford, S.,Thomas, D., . . . Vandenberg, J. (2008). Galaxy Zoo: morpholo-gies derived from visual inspection of galaxies from the SloanDigital Sky Survey. MNRAS , (3), 2833–2859.McLachlan, G. (2004). Discriminant analysis and statistical pat-tern recognition . Wiley Interscience.Monge, G. (1781).
Mémoire sur la théorie des déblais et des rem-blais . De l’Imprimerie Royale.Mumford, D. (1997). Pattern theory: A unifying perspective. In
Fields medallists’ lectures (pp. 226–261). World Scientific.Nesterov, Y. (1983). A method of solving a convex programmingproblem with convergence rate O (1 / k ). Soviet MathematicsDoklady , , 372-376.Nichols, J., Watnik, A., Doster, T., Park, S., Kanaev, A., Cattell, L.,& Rohde, G. (2017). An optimal transport model for imaging inatmospheric turbulence. arXiv preprint arXiv:1705.01050 .Ozolek, J., Tosun, A., Wang, W., Chen, C., Kolouri, S., Basu, S., . . .Rohde, G. (2014). Accurate diagnosis of thyroid follicular le-sions from nuclear morphology using supervised learning. MedImage Anal , (5), 772–780.Paquin, D., Levy, D., Schreibmann, E., & Xing, L. (2006). Mul-tiscale image registration. Mathematical Biosciences and Engi-neering , (2), 389–418.Park, S., Cattell, L., Nichols, J., Watnik, A., Doster, T., & Rohde, G. (2018). De-multiplexing vortex modes in optical communica-tions using transport-based pattern recognition. in press, AppliedOptics .Park, S., Kolouri, S., Kundu, S., & Rohde, G. (2017). The cumula-tive distribution transform and linear pattern classification. Appl.Comput. Harmon. Anal. .Rohde, G., Aldroubi, A., & Dawant, B. (2003). The adaptive basesalgorithm for intensity-based nonrigid image registration.
IEEETrans. Med. Imag. , (11), 1470–1479.Rosenblatt, M. (1952). Remarks on a multivariate transformation. The Annals of Mathematical Statistics , (3), 470–472.Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., &Hawkes, D. (1999). Nonrigid registration using free-form de-formations: application to breast MR images. IEEE Trans. Med.Imaging , (8), 712–721.Schmitter, D., Roche, A., Maréchal, B., Ribes, D., Abdulkadir, A.,Bach-Cuadra, M., . . . Krueger, G. (2015). An evaluation ofvolume-based morphometry for prediction of mild cognitive im-pairment and Alzheimer’s disease. Neuroimage Clin. , , 7–17.Schwarzschild, M. (1954). Mass distribution and mass-luminosityratio in galaxies. Astronomical Journal , (8), 273–284.Shamir, L. (2009). Automatic morphological classification ofgalaxy images. MNRAS , (3), 1367–1372.Shattuck, D., Mirza, M., Adisetiyo, V., Hojatkashani, C., Salamon,G., Narr, K., . . . Toga, A. (2008). Construction of a 3D prob-abilistic atlas of human cortical structures. NeuroImage , ,1064–1080.Si, Z., & Zhu, S. (2012). Learning hybrid image templates (HIT) byinformation projection. IEEE Transactions on Pattern Analysisand Machine Intelligence , (7), 1354–1367.Tosun, A., Yergiyev, O., Kolouri, S., Silverman, J., & Rohde, G.(2015). Detection of malignant mesothelioma using nuclearstructure of mesothelial cells in e ff usion cytology specimens. Cytometry Part A , (4), 326–333.Trigila, G., & Tabak, E. (2016). Data-driven optimal transport. Communications on Pure and Applied Mathematics , (4), 613–648.Villani, C. (2008). Optimal transport: old and new (Vol. 338).Springer Science & Business Media.Wan, L., Zeiler, M., Zhang, S., Le Cunn, Y., & Fergus, R. (2013).Regularization of neural networks using DropConnect. In
Icml (Vol. 28, pp. 1058–1066). PMLR.Wang, W., Mo, Y., Ozolek, J., & Rohde, G. (2011). PenalizedFisher discriminant analysis and its application to image-basedmorphometry.
Pattern Recognition Letters , (15), 2128–2135.Wang, W., Ozolek, J., & Rohde, G. (2010). Detection and clas-sification of thyroid follicular lesions based on nuclear structurefrom histopathology images. Cytometry Part A , (5), 485–494.Wang, W., Slepˇcev, D., Basu, S., Ozolek, J., & Rohde, G. (2013).A linear optimal transportation framework for quantifying andvisualizing variations in sets of images. IJCV , (2), 254–269.Yeung, T., Georges, P., Flanagan, L., Marg, B., Ortiz, M., Funaki,M., . . . Janmey, P. (2005). E ff ects of substrate sti ff ness on cellmorphology, cytoskeletal structure, and adhesion. Cell Motil.Cytoskel. ,60