[PDF] Velocity Inversion Using the Quadratic Wasserstein Metric

Abstract

Full--waveform inversion (FWI) is a method used to determine properties of the Earth from information on the surface. We use the squared Wasserstein distance (squared W 2 distance) as an objective function to invert for the velocity as a function of position in the Earth, and we discuss its convexity with respect to the velocity parameter. In one dimension, we consider constant, piecewise increasing, and linearly increasing velocity models as a function of position, and we show the convexity of the squared W 2 distance with respect to the velocity parameter on the interval from zero to the true value of the velocity parameter when the source function is a probability measure. Furthermore, we consider a two--dimensional model where velocity is linearly increasing as a function of depth and prove the convexity of the squared W 2 distance in the velocity parameter on large regions containing the true value. We discuss the convexity of the squared W 2 distance compared with the convexity of the squared L 2 norm, and we discuss the relationship between frequency and convexity of these respective distances. We also discuss multiple approaches to optimal transport for non--probability measures by first converting the wave data into probability measures.

Full PDF

VVelocity Inversion Using the Quadratic Wasserstein Metric

Srinath Mahankali

Abstract

Full–waveform inversion (FWI) is a method used to determine properties of the Earth from infor-mation on the surface. We use the squared Wasserstein distance (squared W distance) as an objectivefunction to invert for the velocity as a function of position in the Earth, and we discuss its convexitywith respect to the velocity parameter. In one dimension, we consider constant, piecewise increasing,and linearly increasing velocity models as a function of position, and we show the convexity of thesquared W distance with respect to the velocity parameter on the interval from zero to the true valueof the velocity parameter when the source function is a probability measure. Furthermore, we con-sider a two–dimensional model where velocity is linearly increasing as a function of depth and provethe convexity of the squared W distance in the velocity parameter on large regions containing thetrue value. We discuss the convexity of the squared W distance compared with the convexity of thesquared L norm, and we discuss the relationship between frequency and convexity of these respectivedistances. We also discuss multiple approaches to optimal transport for non–probability measures byﬁrst converting the wave data into probability measures. The study of seismic waves has many practical applications in geology, especially in searching for naturalresources such as oil or natural gas. It plays a major role in determining the type of material underground,given the position of several receivers on the surface and the amount of time it takes for the wave to reboundto the surface. The velocity of the wave (as a function of its position) is unknown, and ﬁnding the wavevelocity function is equivalent to ﬁnding the underground substance. With the same wave source, diﬀerentwave velocity properties produce diﬀerent wave data (such as wave amplitude and travel time) measuredat a given receiver. This wave data can be used to ﬁnd the velocity. We use an objective function, ormisﬁt function, which measures the “distance” between two sets of wave data. This allows us to comparethe observed data with simulated data to ﬁnd the true velocity function. Some examples of objectivefunctions are the L p distance and the p th Wasserstein distance ( W p distance) from the theory of optimaltransport [20], the latter of which is the main tool of this research project.The objective function becomes zero when the observed data and simulated data are equivalent,which occurs when we have the correct velocity model. Thus, ﬁnding the correct velocity model is anoptimization problem: minimizing the objective function, which measures the error between simulatedand observed wave data. Although it is necessary to have a gradient of zero to minimize the objectivefunction, this is not enough, as it is possible to reach a saddle point or a local minimum at such a point.However, this issue is ﬁxed if the objective function is convex, as it will have only one global minimum.Thus, it is important for the convex region near the global minimum to be as large as possible. For thisreason, we investigate the convexity of the squared W distance as an objective function.The conventional choice of objective function is the least–squares ( L ) norm, used in both time [19]and frequency [15, 16] domains. If we have data from multiple receiver locations X r (where r is the indexfor the receiver location), we can consider the squared L norm of the diﬀerence between the predictedwave data g ( X r , t, c ) and the observed wave data h ( X r , t ) = g ( X r , t, c ∗ ) L ( c ) = 12 (cid:88) r (cid:90) T | g ( X r , t, c ) − h ( X r , t ) | d t, (1.1)1 a r X i v : . [ phy s i c s . g e o - ph ] A ug a) Squared L metric as a plot in s (b) Squared W distance as a plot in s Figure 1: Comparison between squared W distance of f ( x ) and f ( x − s ) and squared L norm of f ( x ) − f ( x − s ) as plots in the shift s .from [2], where c ∗ is the true velocity parameter. While L ( c ) is minimized (and therefore equal to zero)exactly when c = c ∗ , algorithms for minimizing the squared L norm may reach local minima instead ofthe global minimum when c = c ∗ , due to the nonconvexity of the squared L norm as discussed in [2, 22].In addition, its sensitivity to noise can make it an unsuitable choice of objective function [3]. Thus, weuse the squared W distance from [5, 22, 23] instead. If there are multiple receiver locations X r , our ﬁnalobjective function will be W ( c ) = (cid:88) r W ( g ( X r , t, c ) , h ( X r , t )) . (1.2)Previous results show that the squared W metric is jointly convex in translations and dilations of thedata [23], suggesting that the squared W metric is a suitable choice for the objective function. Takingthe function f ( x ) = (cid:40) π sin ( x ) , − π ≤ x ≤ π, , otherwise . as an example, we compare in Figure 1 the graphs of the squared L norm of f ( x ) − f ( x − s ) with thesquared W distance between f ( x ) and f ( x − s ). It can be observed that the squared W distance isconvex in the shift s while the graph of the squared L norm is not.In this paper, we present a theoretical approach to velocity inversion using optimal transport byinvestigating the convexity of the squared W distance as a function of the velocity parameters – this hasnot been studied theoretically before. We investigate several velocity models in one dimension and showthat the squared W distance is a suitable objective function when inverting for the velocity parameter.In two dimensions, we consider a particular velocity model, and we show that the squared W distanceis a suitable objective function in the case where the source function f is nonnegative. We generalize towhen the source function alternates between negative and positive values, and we show that the squared W distance is a suitable objective function given certain requirements on f . These theorems suggest thatthe squared W distance is a suitable objective function when inf f is close enough to zero. Numericalevidence suggests that the squared W distance is a better objective function than the squared L normwhen the source function is nonnegative.The paper is structured as follows. In Section 2, we discuss background knowledge which is used lateron in this paper. First, we introduce optimal transport and the W p distance, along with an explicit formulaas well as some of its properties. Furthermore, we discuss the solution of wave equations in one dimensionand introduce ray tracing , which is used to solve wave equations in higher dimensions. In Section 3, weﬁrst consider one–dimensional velocity models, where the velocity is either constant, piecewise constant,2igure 2: The two functions f and g have the same total area, and T maps all points of f to g . The goalof optimal transport is to ﬁnd T such that the cost function is minimized.or linearly increasing as a function of position, and the source function is a probability measure. Then,we study a two–dimensional velocity model where the wave velocity v satisﬁes v ( X, z ) = a + bz where a and b are positive constants, X is the horizontal position, and z is the depth of the wave. Initially, inSection 3.3.1, we assume that the source function f is a probability measure and that the wave amplitudeis unchanged, and we show Theorem 4. In Section 3.3.2, we involve the wave amplitude and we allow thesource function to alternate between positive and negative values, and we show Theorem 5. In Section 4,we compare the convexity of the squared W distance with the squared L norm, and we include numericalexamples. We also discuss the relationship of the W distance with the ˙ H − norm, and we discuss multipleapproaches to optimal transport for non–probability measures. We summarize this paper in Section 5and discuss a possible direction for future research. We introduce optimal transport, and essential background on wave equations.

In this section, we establish the main goal of optimal transport (originally introduced by Monge [14]) andintroduce the W p distance along with some of its useful properties. Optimal transport involves probabilityspaces, which are nonnegative measure spaces with total measure equal to one. We discuss the convexityof the squared W metric, which is a distance between two probability measures on a probability space,and introduce some of its properties. Consider two distinct probability measures µ and ν , as shown in Figure 2. The goal of optimal transportis to ﬁnd a map T : µ → ν which minimizes the total cost of mapping the points in µ to ν according tothe map T, for a given cost function c : µ × ν → R [1]. The W p metric, based on optimal transport [24],gives the optimal transportation cost when the cost function is c ( x, y ) = | x − y | p .3 .1.2 Computing the W metric in one dimension Let µ and ν be probability measures and let M be a probability space. We deﬁne the W p distance as W p ( f, g ) = (cid:18) inf T ∈ M ( µ,ν ) (cid:90) Ω | x − T ( x ) | p d µ (cid:19) p where f ( x )d x = d µ , g ( y )d y = d ν , Ω is the support of µ , and M ( µ, ν ) is the set of all mass–preservingmaps T : µ → ν which map µ to ν . The integral within the inﬁmum is the total cost of the transportmap T, so computing W p ( f, g ) is equivalent to minimizing the transport cost. We study the case where p = 2 , and our objective function is W ( f, g ).In one dimension, it is possible to express W ( f, g ) for probability measures f and g in a simplerway: Let F and G be the cumulative distribution functions of f and g, respectively. Then, Rachev andR¨uschendorf derive the formula for the squared W distance in one dimension [17] as W ( f, g ) = (cid:90) ( F − ( s ) − G − ( s )) d s = (cid:90) Ω ( t − G − ( F ( t ))) f ( t ) d t (2.1)where Ω is the domain of f . This formula for W ( f, g ) is useful when the wave data is only a function oftime.As wave data are not usually probability density functions, we can normalize a function k ( t ) as shownin [11] by replacing it with k ( t ) + γ (cid:82) T ( k ( t ) + γ ) d t for some constant γ > T for suﬃciently large T such that k ( t ) + γ > t ∈ [0 , T ]. W Metric in Higher Dimensions

In general, there is no explicit formula to compute the W metric in higher dimensions. However, certainrequirements derived from the concept of cyclical monotonicity [13] make it possible to calculate theoptimal map T , and therefore the W metric, through numerical methods. This is shown in the followingtheorem of Brenier [4, 6, 22]: Theorem 1 (Brenier’s theorem) . Let µ and ν be two compactly supported probability measures on R n . If µ is absolutely continuous with respect to the Lebesgue measure, then there is a convex function w : R n → R such that the optimal map T is given by T ( x ) = ∇ w ( x ) for µ –almost everywhere x . Furthermore, if µ (d x ) = f ( x )d x, ν (d y ) = g ( y )d y , then T is diﬀerentiable µ –almost everywhere anddet( ∇ T ( x )) = f ( x ) g ( T ( x )) , (2.2)from the mass preserving property of T . Replacing T ( x ) in Equation (2.2) with ∇ w ( x ) leads to the Monge–Amp`ere equation det( D w ( x )) = f ( x ) g ( ∇ w ( x )) , where D w ( x ) is the Hessian matrix of w . Then, the squared W distance satisﬁes W ( f, g ) = (cid:90) Ω | x − ∇ w ( x ) | f ( x ) d x, where Ω is the domain of f . 4 .1.4 Properties of the Squared W Distance

Results about the convexity of the squared W distance with respect to changes in the data are known [9,22, 23]. Theorem 2.

Let f and g be compactly supported probability density functions with ﬁnite second ordermoment on an interval Ω ⊂ R . Then, W ( f ( t − s ) , g ( t )) = W ( g ( t ) , f ( t )) + s + 2 s (cid:90) Ω ( x − T ( x )) f ( x ) d x (2.3) where s ∈ R and T is the optimal map from f to g . Furthermore, W ( f ( t ) , Af ( At − s )) is convex in both A and s , for A ∈ R + . The convexity with respect to shifts and dilations suggests that the squared W distance is moresuitable when inverting for wave data. We introduce the partial diﬀerential equation which governs the behavior of n –dimensional waves. Wealso introduce d’Alembert’s solution to the one–dimensional wave equation and present the ray tracing approach to solving higher dimensional wave equations. An n –dimensional wave can be expressed as a function ψ of n position variables x , x , . . . , x n and time t , which in general satisﬁes the partial diﬀerential equation ∂ ψ∂t − C ( x ) (cid:18) ∂ ψ∂x + ∂ ψ∂x + · · · + ∂ ψ∂x n (cid:19) = f ( x , t )for a variable coeﬃcient C ( x ) which is a function deﬁned on R n , and a source function f which is afunction of both space and time. In general, the wave equation might not have an analytical solution. In one dimension, we consider a simple case where C ( x ) is equal to a constant c . The partial diﬀerentialequation becomes ∂ ψ∂t = c ∂ ψ∂x , but unlike its n –dimensional variant, it is possible to obtain an explicit solution as derived by d’Alembert [7]: ψ ( x, t ; c ) = j ( x + ct ) + j ( x − ct )2 + 12 c (cid:90) x + ctx − ct k ( s ) ds (2.4)given the initial conditions ψ ( x,

0) = j ( x ) and ψ t ( x,

0) = k ( x ). In higher dimensions, however, a wave equation might not have an analytical solution, so we use raytracing to obtain information about the wave function [18]. An example of ray tracing is shown inFigure 3, where an original signal is released, and two receivers on the surface are present to measure thewave data. We assume that the velocity v of a wave can be expressed as a function of depth z in theEarth, and the ray parameter p, or the horizontal slowness, can be expressed as u ( z ) sin θ by Snell’s lawwhere u ( z ) = v ( z ) and θ is the angle the ray makes with a vertical axis. The ray parameter is constant5igure 3: The method of ray tracing tracks the path of a ray passing through the Earth, and the paths oftwo rays are shown here. Each ray leaves the source location, reaches a turning point, and then returnsto the surface at the indicated receiver location.throughout the path of the ray. We also deﬁne the vertical slowness η ( z ) as (cid:112) u ( z ) − p . The path ofthe ray is symmetric about a turning point at depth z p , and u ( z p ) = p while η ( z p ) = 0. By examining aray passing through several layers in the Earth and eventually returning to the surface, we may calculatethe horizontal distance and the traveltime of the wave as a function of the velocity. We can calculate X = 2 p (cid:90) z p dz (cid:112) u ( z ) − p (2.5)as the distance from the source to the receiver, and T = 2 (cid:90) z p u ( z ) (cid:112) u ( z ) − p dz (2.6)as the total travel time. Using Equation (2.5), it is possible to solve for p in terms of X and the velocityparameters. It is also possible to use Equation (2.6) to solve for T as a function of X and the velocityparameters, and it is possible to calculate the predicted time T pred by guessing the velocity parameter. Themethod of traveltime tomography uses the above formulas to calculate the predicted time and approachesthe problem as minimizing the squared diﬀerence between the predicted time and the observed time:( T pred − T obs ) [25].However, this method does not work when the velocity model is not continuous [18]. In addition tousing Equations (2.5) and (2.6), we make use of the wave’s amplitude as well to deal with discontinuousvelocity models and other issues that traveltime tomography runs into. This method is known as full–waveform inversion (FWI), where both the amplitude and the traveltime are used to approximate theproperties of the Earth [21].The ﬁnal amplitude is asymptotically scaled by a factor of A = (cid:32) (cid:90) z p u ( z ) (cid:112) u ( z ) − p dz (cid:33) − , (2.7)which is the reciprocal of the total arc length of the ray’s path, and the ray’s path is symmetric about theturning point at depth z p . Thus, if the source function is f ( t ), then the observed wave function is approx-imated by Af ( t − T obs ). To invert for the velocity we use an objective function such as Equation (1.1) andEquation (1.2) to compare observed data with simulated data and minimize it using standard algorithms,and we will see that the squared W metric can be a suitable choice.6 Convexity in the Model Parameter

In this section, we study multiple velocity models in one dimension as well as a model in two dimensions,and we prove convexity of the squared W distance with respect to the velocity parameter on certaindomains. First, we consider one–dimensional velocity models. We begin with a model with constantvelocity c and prove convexity of the squared W distance with respect to c . We then consider twomodels with piecewise increasing velocities with respect to distance and a model where velocity is linearlyincreasing. In every one–dimensional velocity model, we assume that the source function is nonnegative.Finally, we consider a two–dimensional model where the velocity v satisﬁes v ( X, z ) = a + bz where a and b are positive constants, X is the horizontal position of the ray, and z is the current depth of the ray. InSection 3.3.1, we assume that the source function is nonnegative, which allows us to use Theorem 2 whencomputing the squared W distance. After this, we consider a more general case where the source functionis alternating in Section 3.3.2, and the predicted wave function has an amplitude which is a function of a, b , and the receiver location X . We also use the following result [12]: Lemma 1.

Let P : Ω → R and Q : Ω → Ω be convex functions where Ω ⊆ R and Ω ⊆ R n are convexsets and n ≥ . Furthermore, assume P is nondecreasing. Then, P ( Q ( x )) is a convex function on Ω . An example of this is when P ( x ) = x . Then, if Q is convex and nonnegative, we see that Q is alsoconvex on its domain. Let c be the constant wave velocity, and let our initial wave function be f ( t ). For this section, we assumethat f ( t ) is a probability distribution. We assume that the ﬁnal wave function at a ﬁxed spatial location d is of the form f ( t − T ( c )) , where T ( c ) is the amount of time it takes to receive the wave signals at location d as a function of the velocity c . Since the total distance is d, T ( c ) = dc , which means the predicted wavefunction is f ( t − dc ). Letting c ∗ be the true value of the velocity gives us W (cid:18) f (cid:18) t − dc (cid:19) , f (cid:18) t − dc ∗ (cid:19)(cid:19) = (cid:18) dc − dc ∗ (cid:19) , as the squared W distance from Theorem 2, because the integral in Equation (2.3) is zero. In addition, dc − dc ∗ is a convex function of c on (0 , ∞ ) , and it is nonnegative on the interval (0 , c ∗ ]. Hence, by Lemma 1,the squared W distance is convex on (0 , c ∗ ]. Remark 1.

Due to the convexity of the squared W distance in the interval (0 , k ∗ ], choosing a smallvalue of k as the initial guess guarantees being able to ﬁnd the true velocity parameter k ∗ throughgradient–based optimization methods. When the velocity is non–constant, the d’Alembert solution Equation (2.4) does not hold anymore. Thus,we study the convexity of the squared W distance for several non–constant velocity models, shown inFigure 4. We ﬁrst study a model where the velocity is piecewise constant, after which we add multiplepieces. Then, we study a model where the velocity is linearly increasing as a function of position. For allof these models, we assume that the source function f ( t ) is a probability distribution. The ﬁrst scenario we study is a piecewise constant velocity model. Assume the velocity v ( x ) satisﬁes v ( x ) = c for 0 ≤ x ≤ d and a known constant c , and v ( x ) = c for d < x ≤ d + d , as seenin Figure 4a – we show convexity in the unknown c . The total travel time, as a function of c , is7 a) First velocity model. (b) Second velocity model. (c) Third velocity model. Figure 4: Three types of non–constant velocity models in one dimension that we study. T ( c ) = d c + d c . Letting c ∗ be the true value of c gives (cid:16) d c − d c ∗ (cid:17) as the squared W distance byTheorem 2. As the function d c − d c ∗ is convex and nonnegative on (0 , c ∗ ] , the squared W distance is alsoconvex on (0 , c ∗ ] by Lemma 1. The second scenario we consider is the piecewise constant velocity with n pieces of equal length d, suchthat the velocity of the wave is v ( x ) =  c , x ≤ d,c + k, d < x ≤ d, ... ... c + ( n − k, ( n − d < x ≤ nd. Here, the unknown variable is k . In this case, the total travel time is T ( k ) = n − (cid:88) m =0 dc + mk , so the squared W distance becomes ( T ( k ) − T ( k ∗ )) by Theorem 2 where k ∗ is the true value of thevelocity parameter. We claim that the squared W distance is convex in k on the interval [0 , k ∗ ]. To dothis, we initially show that T ( k ) is convex in k on [0 , ∞ ). Taking the second derivative, we get T (cid:48)(cid:48) ( k ) = n − (cid:88) m =0 dm ( c + mk ) , which is nonnegative for all k ≥

0. Hence, T ( k ) is convex in k on [0 , ∞ ). Since T ( k ) is strictly decreasingon [0 , ∞ ) , the function T ( k ) − T ( k ∗ ) is nonnegative (and also convex) on the interval [0 , k ∗ ]. Thus, thesquared W distance ( T ( k ) − T ( k ∗ )) is convex in k on the interval [0 , k ∗ ] by Lemma 1. Next, we consider a linearly increasing velocity, which is of the form c + kx at a position x, for a knownconstant c . Here, k is unknown and k ∗ is the true value of the velocity parameter. Letting d be the totaltravel distance, we have the following integral representation for the traveltime: T ( k ) = (cid:90) d d xc + kx = ln( c + kd ) − ln( c ) k . (3.1)8e ﬁrst prove a lemma. Lemma 2. T ( k ) is convex in k on the interval [0 , ∞ ) , where T (0) = dc .Proof. We take the second derivative of T : T (cid:48)(cid:48) ( k ) = d d k (cid:90) d d xc + kx = (cid:90) d d d k (cid:18) c + kx (cid:19) d x, and the second equation follows by the Leibniz Integral Rule. Because d d k (cid:16) c + kx (cid:17) = x ( c + kx ) ≥

0, theintegrand is nonnegative. This implies the convexity of T ( k ).Now we are ready to show that ( T ( k ) − T ( k ∗ )) is convex on the interval [0 , k ∗ ]. Theorem 3. ( T ( k ) − T ( k ∗ )) is convex in k on the interval [0 , k ∗ ] , where T (0) = dc . Proof.

By Lemma 2, T ( k ) − T ( k ∗ ) is convex. As the square of a nonnegative convex function is convexby Lemma 1, it is enough to determine the interval in which T ( k ) − T ( k ∗ ) is nonnegative. From theintegral representation of T ( k ) in Equation (3.1) we see that T ( k ) is strictly decreasing in k . Hence, T ( k ) − T ( k ∗ ) is nonnegative on the interval [0 , k ∗ ] implying the convexity of ( T ( k ) − T ( k ∗ )) on thisinterval by Lemma 1. Consider a velocity model in two dimensions where the predicted velocity at a point (

X, z ) is of the form v ( X, z ) = a + bz where a and b are positive constants, X is the horizontal position of the ray, and z isthe current depth of the ray. We analyze the travel time of a ray emanating from a single receiver whichhas a ﬁnal distance X from the source, as well as the amplitude of the wave at a receiver. We computethe squared W distance of the predicted wave function with the observed wave function (with velocity a ∗ + b ∗ z at depth z ) and aim to ﬁnd a region in R for which this distance is convex in ( a, b ) . While the source function and the observed wave function always have the same amplitude in one dimen-sion, for higher dimensions this is not the case. With a source function f ( t ), the observed wave data is ofthe form Af ( t − T ) for constants A and T because there are no reﬂections when the velocity is a continuousfunction of depth. However, we analyze the convexity with the assumption that A = 1. We treat theobserved wave data as f ( t − T ( X, a ∗ , b ∗ )) where T ( X, a, b ) = T pred is the predicted traveltime expressedas a function of a, b, and X and T ( X, a ∗ , b ∗ ) = T obs is the observed traveltime of the wave data. X isthe receiver location, or the distance from the source to the receiver. Furthermore, we assume that f ( t )is a probability distribution with compact support in the interval [ p , p ] ⊂ [0 , T ], where T > T obs + p .When the source function is a probability distribution, this method is equivalent to traveltime tomographybecause the squared W distance is equal to ( T pred − T obs ) by Theorem 2.Even when the observed wave data is not a probability distribution, we may normalize the data toreduce it to this case. For example, if we normalize a nonnegative function k ( t ) of the form Af ( t − T ) withcompact support in the interval [0 , T ] using the formula (cid:101) k ( t ) = k ( t ) (cid:82) T k ( t ) d t , where (cid:101) k ( t ) is the normalizedwave data, then the amplitude of the normalized data remains constant. Thus, the squared W distancebecomes ( T pred − T obs ) . Using this expression for the squared W distance, we can determine its convexityin a large region containing ( a ∗ , b ∗ ) and points arbitrarily close to the origin.We ﬁrst explicitly compute T ( X, a, b ). Letting the ray parameter be p, we ﬁnd that X = η bu p fromEquation (2.5) where u = a is the initial slowness, and η = (cid:112) u − p is the initial vertical slowness.9 a) Sum of squared W distance (b) Sum of squared L norm Figure 5: The sum of the squared W distance of f ( t − T ( X, a, b )) and f ( t − T ( X, a ∗ , b ∗ )) is comparedwith the sum of the squared L norm of f ( t − T ( X, a, b )) − f ( t − T ( X, a ∗ , b ∗ )). The receiver locationsrange from 10 to 100 inclusive, and ( a ∗ , b ∗ ) = (1 ,

2) is the true value of the velocity parameter. Here, wetake f ( t ) = e − t − to be the source function.We can solve for p and η in terms of X, a, and b as p = 2 √ b X + 4 a , η = bXa √ b X + 4 a . From these formulas as well as Equation (2.6), we can solve for the time: T ( X, a, b ) = 2 b (cid:32) ln (cid:32) √ b X + 4 a + bX a (cid:33)(cid:33) . Then, the squared W distance becomes ( T ( X, a, b ) − T ( X, a ∗ , b ∗ )) where v ( X, z ) = a ∗ + b ∗ z is the truevelocity function. We ﬁrst claim that T ( X, a, b ) is convex subject to a restriction on bX a . Lemma 3.

Let S > be the largest root of the equation S φ ( S ) + 2 φ ( S ) − S = 0 , where φ ( S ) = 2 (cid:18) S (cid:19) ln( (cid:112) S + 1 + S ) − S − . Then, the traveltime T ( X, a, b ) is jointly convex in ( a, b ) whenever bX a ≥ S .Proof. Let y = √ b X + 4 a and S = bX a . We can compute the ﬁrst order derivatives of T ( X, a, b ) as ∂T∂a = − Xay and ∂T∂b = 2 Xby − (cid:16) √ S + 1 + S (cid:17) b . The Hessian matrix of T ( X, a, b ), where we treat X as a constant, becomes Xy H , where H =  S SX SX X (cid:18) S +1) S ln( √ S + 1 + S ) − S − (cid:19) . To prove the convexity of T ( X, a, b ), it is enough to show that the Hessian matrix of T ( X, a, b ) is positivesemideﬁnite, which is equivalent to showing that H is positive semideﬁnite. Using Sylvester’s Criterion,because 8 + 4 S is always positive, we see that H is positive semideﬁnite exactly when det H ≥ . Letting φ ( S ) = 2(1 + 1 S ) ln( (cid:112) S + 1 + S ) − S − ,

10e see that det H = 8 X φ ( S ) + 4 S X ( φ ( S ) −

1) = 4 X ( S φ ( S ) + 2 φ ( S ) − S ) , so it is enough to show that S φ ( S ) + 2 φ ( S ) − S ≥ S ≥ S . Since S is the largest root of S φ ( S ) + 2 φ ( S ) − S = 0, it is enough to show this inequality for all suﬃciently large S . Observe that φ ( S ) ≥ ln( S ) for all suﬃciently large S , implying that S φ ( S ) + 2 φ ( S ) − S ≥ S (ln( S ) −

1) + 2 ln( S ) , which is at least 0 for all suﬃciently large S . Since S = bX a , this proves the lemma.We are now ready to prove the convexity of ( T ( X, a, b ) − T ( X, a ∗ , b ∗ )) over a certain region U . Theorem 4.

Let U := { ( a, b ) ∈ R : a, b > , bX a ≥ S , T ( X, a, b ) ≥ T ( X, a ∗ , b ∗ ) } , where X > is a ﬁxed constant. Then ( T ( X, a, b ) − T ( X, a ∗ , b ∗ )) is jointly convex in ( a, b ) ∈ U .Proof. From Lemma 3 we see that the function T ( X, a, b ) − T ( X, a ∗ , b ∗ ) is jointly convex in ( a, b ) ∈ U . Inaddition, as T ( X, a, b ) ≥ T ( X, a ∗ , b ∗ ) on U we have that T ( X, a, b ) − T ( X, a ∗ , b ∗ ) is also nonnegative on U . Thus, by Lemma 1, the squared W distance, which is ( T ( X, a, b ) − T ( X, a ∗ , b ∗ )) , is jointly convex in( a, b ) ∈ U . Remark 2.

To ensure that ( a ∗ , b ∗ ) ∈ U , it is enough to require that b ∗ X a ∗ ≥ S because T ( X, a ∗ , b ∗ ) − T ( X, a ∗ , b ∗ ) is equal to 0. Thus, choosing large values of X will ensure that the squared W distance isconvex in ( a, b ) in a region containing ( a ∗ , b ∗ ). To ﬁnd a suitable initial guess, we may choose a point( a, b ) such that bX a ≥ S and scale it by a suﬃciently small constant in order to satisfy the condition T ( X, a, b ) ≥ T ( X, a ∗ , b ∗ ).However, this function is not suitable as an objective function because ( T ( X, a, b ) − T ( X, a ∗ , b ∗ )) may equal 0 , its minimum, even when a is not equal to a ∗ or b is not equal to b ∗ – there is not enoughinformation to ﬁnd a ∗ and b ∗ through one receiver alone. To ﬁx this, we add multiple receiver locations X r . Fact 1.

The equation T ( X, a , b ) − T ( X, a , b ) = 0 has at most one solution in X > where ( a , b ) (cid:54) =( a , b ) are ﬁxed ordered pairs of positive real numbers.Proof. Assume for the sake of contradiction that the equation T ( X, a , b ) = T ( X, a , b ) has two positivesolutions. Since T (0 , a , b ) = T (0 , a , b ) = 0, there are at least three nonnegative solutions in X to theequation T ( X, a , b ) = T ( X, a , b ). Therefore, the equation T (cid:48) ( X, a , b ) = T (cid:48) ( X, a , b ), or equivalently,2 (cid:112) b X + 4 a = 2 (cid:112) b X + 4 a , has at least two positive solutions in X by the Mean Value Theorem. Thus, there are two positive solutionsto the equation ( b − b ) X + (4 a − a ) = 0, a contradiction.This implies that the value of T ( X r , a, b ) at two diﬀerent receiver locations X r uniquely determinesthe pair ( a, b ). We use the objective function (cid:80) r ( T ( X r , a, b ) − T ( X r , a ∗ , b ∗ )) , where the sum is over thereceivers r , and the X r are the receiver locations. We plot this objective function in Figure 5a, where a ∗ = 1 and b ∗ = 2 and the X r range from 10 to 100 inclusive. We compare it to the sum of the squared L norm where the set of receiver locations is the same and the source function is f ( t ) = e − t − . The L –based objective function is mostly ﬂat with a sharp incline close to the true velocity parameter (1 , W –based objective function(which does not depend on the source function, as long as it is nonnegative) appears convex in ( a, b ).11 .3.2 Varying Amplitude In general, the amplitude of the wave equation solution is non–constant. The predicted wave function atreceiver X can be considered to be of the form g ( t, a, b ) = A ( X, a, b ) f ( t − T ( X, a, b )) , where A ( X, a, b ) is the amplitude as a function of the receiver location X and the velocity parameters a, b . The observed wave function at receiver X can be considered of the form g ( t, a ∗ , b ∗ ) where ( a ∗ , b ∗ ) isthe true velocity parameter. Using Equation (2.7), we obtain A ( X, a, b ) = b √ b X + 4 a (cid:16) π − arcsin (cid:16) a √ b X +4 a (cid:17)(cid:17) . Since b and X are positive, we may simplify this expression to get A ( X, a, b ) = 1 X (cid:113) (cid:0) abX (cid:1) (cid:32) π − arcsin (cid:32) (cid:113) ( bX a ) +1 (cid:33)(cid:33) = 1 RX (cid:113) (cid:0) S (cid:1) (3.2)where S = bX a and R = π − arcsin (cid:18) √ S + 1 (cid:19) . While the amplitude function in Equation (3.2) is not fully accurate, it is still a good approximation ifthe velocity model is continuous. Furthermore, taking X to be very large causes S = bX a to be large aswell, and as S grows large, the amplitude approaches πX . Taking the derivative of the amplitude withrespect to S gives ∂∂S A ( X, S ) = 1 X  (cid:0) S + 1 (cid:1) S R − S (cid:0) S + 1 (cid:1) S R  = O (cid:18) S (cid:19) . We treat X as a ﬁxed large constant, which further decreases the derivative. Therefore, it is reasonableto assume the amplitude remains unchanged for large values of X , and we let A be this amplitude. Weapproximate a predicted wave function g ( t, a, b ) = A ( X, a, b ) f ( t − T ( X, a, b )) where X is a ﬁxed largeconstant by the new function g ( t, a, b ) = Af ( t − T ( X, a, b )) so it suﬃces to approximate the observedwave function h ( t ) = g ( t, a ∗ , b ∗ ) by the wave function h ( t ) = g ( t, a ∗ , b ∗ ).As the W distance is only deﬁned when both of its inputs have total mass 1, we normalize all wavedata of the form k ( t ) using the formula (cid:101) k ( t ) = k ( t ) + γ (cid:82) T ( k ( t ) + γ )d t on the interval [0 , T ] and 0 everywhere else. Here, γ is a positive constant such that k ( t ) + γ > t .Eventually, we compute the squared W distance between (cid:101) g ( t, a, b ) and (cid:101) h ( t ). We have (cid:101) g ( t, a, b ) = Af ( t − T pred ) + γAI + γ T and (cid:101) h ( t ) = (cid:101) g ( t, a ∗ , b ∗ ) = Af ( t − T obs ) + γAI + γ T , (3.3)where (cid:82) T f ( t ) d t = I , T pred = T ( X, a, b ), and T obs = T ( X, a ∗ , b ∗ ). We also assume that γ is large enoughto ensure that (cid:101) g ( t, a, b ) and (cid:101) h ( t ) are strictly positive with total mass 1.12hen, we let G ( t, a, b ) = (cid:90) t (cid:101) g ( y, a, b ) d y and H ( t ) = (cid:90) t (cid:101) h ( y ) d y = G ( t, a ∗ , b ∗ ) . Since (cid:101) g ( t, a, b ) and (cid:101) h ( t ) are both probability distributions, we may compute the squared W distancebetween them using Equation (2.1) to get W ( (cid:101) g , (cid:101) h ) = (cid:90) ( G − ( s, a, b ) − H − ( s )) d s where G − ( s, a, b ) : [0 , → [0 , T ] is the unique function satisfying G ( G − ( s, a, b ) , a, b ) = s and G − ( G ( t, a, b ) , a, b ) = t for all s ∈ [0 ,

1] and t ∈ [0 , T ].Observe that (cid:101) g = (cid:101) g ( t, a, b ) and G = G ( t, a, b ) can be expressed as functions of time t and the velocityparameters a, b , and G − ( s ) = G − ( s, a, b ) can be expressed as a function of s and a, b . It is also possible toexpress (cid:101) g = (cid:101) g ( t, T pred ) as a function of t and the predicted traveltime T pred using Equation (3.3). Thus,we can alternatively express G = G ( t, T pred ) as a function of t and T pred , and G − ( s ) = G − ( s, T pred ) asa function of s and T pred . In the proofs of the following claims, we sometimes omit the variables a, b, and T pred in the arguments of (cid:101) g , G, and G − .We compute (cid:82) ( G − ( s )) d s to simplify W ( (cid:101) g , (cid:101) h ). Lemma 4.

Let I = (cid:82) T f ( t ) d t , I = (cid:82) T tf ( t ) d t , and I = (cid:82) T t f ( t ) d t . Then, if T obs ≤ T pred ≤ T − p , (cid:90) ( G − ( s, T pred )) d s = 1 I + γ T A (cid:18) γ T A + I + 2 T pred I + T pred I (cid:19) . Proof.

From the substitution s = G ( t ), we can compute (cid:82) ( G − ( s )) d s as (cid:90) ( G − ( s )) d s = (cid:90) T t (cid:101) g ( t ) d t = (cid:90) T t f ( t − T pred ) + γA I + γ T A d t. We may write the integral as (cid:90) T t f ( t − T pred ) + γA I + γ T A d t = 1 I + γ T A (cid:18) γ T A + (cid:90) T t f ( t − T pred ) d t (cid:19) . However, note that (cid:82) T t f ( t − T pred ) d t = (cid:82) T − T pred − T pred ( t + T pred ) f ( t ) d t. As f has compact support [ p , p ] ⊂ [0 , T ] and T pred ≤ T − p , we may write the integral as (cid:82) T ( t + T pred ) f ( t ) d t . This becomes I + 2 T pred I + T pred I , where I = (cid:82) T t f ( t ) d t, I = (cid:82) T tf ( t ) d t , and I = (cid:82) T f ( t ) d t , which proves the lemma.Now, observe that the squared W distance may be expressed as a function of T pred = T ( X, a, b ),because G − ( s, T pred ) is a function of both s and T pred . Using this, we may compute the ﬁrst and secondderivatives of W ( (cid:101) g , (cid:101) h ) with respect to T pred . To compute the ﬁrst derivative of W ( (cid:101) g , (cid:101) h ) with respectto T pred , it is enough to ﬁnd the ﬁrst derivative of G − with respect to T pred . Lemma 5.

The ﬁrst derivative of G − ( s, T pred ) with respect to T pred is ∂∂T pred G − ( s, T pred ) = f ( G − ( s, T pred ) − T pred ) f ( G − ( s, T pred ) − T pred ) + γA = 1 − γA f ( G − ( s, T pred ) − T pred ) + γA . (3.4)13 roof. Because (cid:82) G − ( s )0 (cid:101) g ( t ) d t = G ( G − ( s )) = s, we may apply the Leibniz integral rule to see that0 = ∂∂T pred s = ∂∂T pred (cid:90) G − ( s )0 (cid:101) g ( t ) d t = (cid:101) g ( G − ( s )) ∂∂T pred G − ( s ) + (cid:90) G − ( s )0 ∂∂T pred (cid:101) g ( t ) d t. Thus, we have that ∂∂T pred G − ( s ) = − (cid:82) G − ( s )0 ∂∂T pred (cid:101) g ( t ) d t (cid:101) g ( G − ( s )) = − (cid:16) I + γ T A (cid:17) (cid:82) G − ( s )0 ∂∂T pred (cid:101) g ( t ) d tf ( G − ( s ) − T pred ) + γA . From Equation (3.3), we see that − (cid:90) G − ( s )0 ∂∂T pred (cid:101) g ( t ) d t = (cid:82) G − ( s )0 f (cid:48) ( t − T pred ) d tI + γ T A = f ( G − ( s ) − T pred ) I + γ T A , where we use the fact that f ( − T pred ) = 0. Thus, Equation (3.4) holds.Now, we use this to compute the ﬁrst derivative of the squared W distance with respect to T pred . Lemma 6.

Suppose T obs ≤ T pred ≤ T − p . The ﬁrst derivative of W ( (cid:101) g ( t, T pred ) , (cid:101) h ( t )) with respect to T pred is I + γ T A (cid:18)(cid:18) I − γ T A (cid:19) ( T pred − T obs ) + γA (cid:90) T (cid:90) T pred T obs γA f ( G − ( H ( t ) , y ) − y ) + γA d y d t (cid:19) . Proof.

Because ∂∂T pred ( H − ( s )) = 0, ∂∂T pred W ( (cid:101) g , (cid:101) h ) = ∂∂T pred (cid:90) ( G − ( s )) d s − (cid:90) H − ( s ) ∂∂T pred ( G − ( s )) d s, using the Leibniz integral rule. By Lemma 4 and Lemma 5 we simplify ∂∂T pred W ( (cid:101) g , (cid:101) h ) = 2 T pred I + 2 I I + γ T A − (cid:90) H − ( s ) − γA H − ( s ) f ( G − ( s ) − T pred ) + γA d s. From the change of variables s = H ( t ), we see that (cid:82) H − ( s ) d s = (cid:82) T t (cid:101) h ( t ) d t , which can be computedas (cid:90) T tf ( t − T obs ) + γtA I + γ T A d t = 1 I + γ T A (cid:18) γ T A + (cid:90) T − T obs − T obs ( t + T obs ) f ( t ) d t (cid:19) = γ T A + I + T obs I I + γ T A . This means ∂∂T pred W ( (cid:101) g , (cid:101) h ) = 2 I + γ T A (cid:18) I ( T pred − T obs ) + γA (cid:90) H − ( s ) (cid:101) g ( G − ( s )) d s − γ T A (cid:19) . (cid:82) H − ( s ) (cid:101) g ( G − ( s )) d s = (cid:82) T H − ( G ( t )) d t from the substitution s = G ( t ). Because H − ( G ( t )) is the inverse function of G − ( H ( t )), we have that (cid:82) T H − ( G ( t )) d t = T − (cid:82) T G − ( H ( t )) d t .Thus, ∂∂T pred W ( (cid:101) g , (cid:101) h ) = 2 I + γ T A (cid:18) I ( T pred − T obs ) + γ T A − γA (cid:90) T G − ( H ( t )) d t (cid:19) . (3.5)Observe that G − ( H ( t ) , T obs ) = t . From Equation (3.4), we have that G − ( H ( t ) , T pred ) = t + (cid:90) T pred T obs ∂∂y G − ( H ( t ) , y ) d y = t + (cid:90) T pred T obs (cid:18) − γA f ( G − ( H ( t ) , y ) − y ) + γA (cid:19) d y. Substituting this expression for G − ( H ( t )) into Equation (3.5) gives2 I + γ T A (cid:18)(cid:18) I − γ T A (cid:19) ( T pred − T obs ) + γA (cid:90) T (cid:90) T pred T obs γA f ( G − ( H ( t ) , y ) − y ) + γA d y d t (cid:19) as the value of ∂∂T pred W ( (cid:101) g , (cid:101) h ).Expressing ∂∂T pred W ( (cid:101) g , (cid:101) h ) in this form allows us to prove the convexity of W ( (cid:101) g , (cid:101) h ) with respectto a, b subject to a restriction on the source function f . Here, we assume that f reaches both positiveand negative values. Theorem 5.

Let q = γA sup f + γA and suppose that I ≥ (1 − q ) γ T A . Let V = { ( a, b ) ∈ R : a, b > , bX a ≥ S , T ( X, a ∗ , b ∗ ) ≤ T ( X, a, b ) ≤ T − p } . Then, W ( (cid:101) g ( t, a, b ) , (cid:101) h ( t )) is jointly convex in ( a, b ) ∈ V .Proof. First, observe that T ( X, a, b ) is jointly convex in a, b over the region V by Lemma 3. Thus, it isenough to show that W ( (cid:101) g , (cid:101) h ) is convex and nondecreasing in T pred in the interval [ T obs , T − p ]. Weﬁrst claim that W ( (cid:101) g , (cid:101) h ) is nondecreasing in T pred in this interval. By Lemma 6, we see that ∂∂T pred W ( (cid:101) g , (cid:101) h ) ≥ I + γ T A (cid:18)(cid:18) I − γ T A (cid:19) ( T pred − T obs ) + γA (cid:90) T (cid:90) T pred T obs q d y d t (cid:19) = 2 I + γ T A (cid:18) I + ( q − γ T A (cid:19) ( T pred − T obs )because T pred ≥ T obs . Furthermore, because I ≥ (1 − q ) γ T A , every factor is nonnegative, implyingthat ∂∂T pred W ( (cid:101) g , (cid:101) h ) is nonnegative. Next, we claim that W ( (cid:101) g , (cid:101) h ) is convex in T pred in the interval[ T obs , T − p ]. Using Equation (3.5) and Lemma 5, we see that ∂ ∂T pred W ( (cid:101) g , (cid:101) h ) = ∂∂T pred (cid:32) I + γ T A (cid:18) I ( T pred − T obs ) + γ T A − γA (cid:90) T G − ( H ( t )) d t (cid:19)(cid:33) = 2 I + γ T A (cid:18) I − γA (cid:90) T ∂∂T pred G − ( H ( t )) d t (cid:19) = 2 I + γ T A (cid:18) I − γA (cid:90) T − γA f ( G − ( H ( t )) − T pred ) + γA d t (cid:19) ≥ I + γ T A (cid:18) I + ( q − γ T A (cid:19) . a) α = 2 (b) α = 10 (c) α = 100 Figure 6: Plots of || g − h || , where there is only one receiver at X = 10. (a) α = 2 (b) α = 10 (c) α = 100 Figure 7: Plots of W ( (cid:101) g ( X, t ) , (cid:101) h ( X, t )), where there is only one receiver at X = 10.Since I ≥ (1 − q ) γ T A , we see that ∂ ∂T pred W ( (cid:101) g , (cid:101) h ) is nonnegative in the interval [ T obs , T − p ]. Thus, W ( (cid:101) g , (cid:101) h ) is convex and nondecreasing in T pred in the interval [ T obs , T − p ]. Since T pred is jointly convexin ( a, b ) ∈ V , W ( (cid:101) g , (cid:101) h ) is also jointly convex in ( a, b ) ∈ V by Lemma 1. Remark 3.

Through the Mean Value Theorem, Theorem 5 can be reformulated as a result with acondition involving an upper bound on d f d t . This reformulation illustrates how the convexity of W ( (cid:101) g , (cid:101) h )is inﬂuenced by the frequency of the source function. Remark 4.

The requirements needed to apply Lemma 1 are not satisﬁed for T pred ∈ (0 , T obs ]. To seethis, observe that the squared W distance is nonnegative everywhere and 0 when T pred = T obs . Thus,it is impossible for the squared W distance to be nondecreasing in T pred ∈ (0 , T obs ], although it may beconvex in this interval. We continue with the velocity model studied in Section 3.3. We compute both the L and W distance andcompare the convexity of the two objective functions in ( a, b ), as shown in Figures 6 and 7. The sourcewave function is of the form f ( t ) = e − α ( t − and we consider α ∈ { , , } . Here, ( a ∗ , b ∗ ) = (1 ,

2) and γ = 10 − . We can consider the observed data to be of the form A ( X, a, b ) f ( t − T ( X, a, b )) because thevelocity is a continuous function of the depth.While the squared W distance appears to be mostly convex for α ∈ { , , } , the squared L normis certainly nonconvex. The plot of the squared L norm has large ﬂat regions with a steep incline closerto where the L norm is minimized, as shown in Figures 6a to 6c. Although the squared L norm isnonconvex, by decreasing the value of α we increase the size of the convex region around ( a ∗ , b ∗ ) of thesquared L norm. As the graph of the source function becomes sharper (Figure 9a), so does the graphof the squared L norm. The squared W distance, on the other hand, is relatively ﬂat throughout theentire domain and does not have a steep incline closer to the minimum, as shown in Figures 7a to 7c.16 a) α = 2 (b) α = 10 (c) α = 100 Figure 8: Plots of (cid:80) r W ( (cid:101) g ( X r , t ) , (cid:101) h ( X r , t )), where the receiver locations X r range from 10 to 100,inclusive. Time A m p li t ude Source Wave Function alpha = 2alpha = 10alpha = 100 (a) The source functions are of the form e − α ( t − . -10 -8 -6 -4 -2 0 2 4 6 8 10 Frequency P o w e r Fourier Transform of Source Function alpha = 2alpha = 10alpha = 100 (b) The Fourier transforms of the source functions.

Figure 9: The Fourier transform of the source function depends on the value of α . As α increases, theaverage absolute value of the frequency also increases.Thus, the squared W distance should be convex on a much larger region containing the minimum. Thesquared W distance is also highly insensitive to the choice of source function, and this suggests that thesquared W distance can be used to solve various seismic inversion problems, in contrast with the squared L norm.In addition, we plot the sum of the squared W distance taken over multiple receiver locations inFigure 8. The summation of the squared W distance over multiple receiver locations is highly convexregardless of α , as seen in Figures 8a to 8c. Furthermore, the summation of the squared W distance overmultiple receiver locations is also very close to the summation of ( T pred − T obs ) over multiple receiverlocations, and appears to be convex in ( a, b ) for ( a, b ) closer to the origin. Thus, as an initial guess for( a, b ), it appears to be better to choose points ( a, b ) which are very close to the origin. This is equivalentto choosing points ( a, b ) such that the predicted travel time is large. We observe that the changes in the squared L or W distances between g ( t, a, b ) and h ( t ) depends solelyon the value of α , which in turn aﬀects the frequency of the source function. The graph of the source17unction in the frequency domain can be derived by taking a Fourier transform. Letting (cid:98) f ( k ) = (cid:90) ∞−∞ e − πikt f ( t ) d t = (cid:90) ∞−∞ e − πikt e − α ( t − d t, we get that (cid:98) f ( k ) = (cid:112) πα e − ( πk )2 α − πik , and the power at a frequency k is given by the magnitude, which is | (cid:98) f ( k ) | = (cid:112) πα e − ( πk )2 α . The graph of | (cid:98) f ( k ) | is centered at k = 0 regardless of the value of α , but as the valueof α increases, the plot of | (cid:98) f ( k ) | becomes wider, as shown in Figure 9b. In other words, if | k | > | k | where k and k are ﬁxed frequencies, then | (cid:98) f ( k ) || (cid:98) f ( k ) | increases as α increases. Thus, as the value of α increases,the average absolute value of the frequency of the source function also increases. For large values of α ,the plot of the squared L norm also has the properties of higher frequency data, as the plot is verysharp close to ( a ∗ , b ∗ ) = (1 , L norm weighting low–frequency and high–frequency terms equally, by the Plancherel Theorem. On theother hand, the plot of the squared W distance is virtually unchanged as α increases, suggesting that thefrequency of the the squared W distance is highly insensitive to the frequency of the source function.The relationship between the W distance between two functions g and h and a weighted ˙ H − distancebetween them helps provide an explanation for these observations regarding frequency [10, 20]. We deﬁnethe space ˙ H ( R d ) through the seminorm || f || H ( R d ) = (cid:90) R d | k | | (cid:98) f ( k ) | d k and the space ˙ H − ( R d ) is deﬁned as the dual of ˙ H ( R d ) through the norm || f || ˙ H − ( R d ) = sup {|(cid:104) (cid:122) , f (cid:105)| : || (cid:122) || ˙ H ≤ } . It is known [20] that the W distance is asymptotically equivalent to the ˙ H − norm, which weights termsof lower frequency over terms with higher frequency. While the objective functions in Figures 7 and 8 arenot globally convex, the relationship between the squared W distance and the squared ˙ H − metric oﬀersan explanation for the smoothness of the plots in these ﬁgures, which display properties of low–frequencydata. In general, the wave data tends to alternate between positive and negative values, and the total integralof the observed or predicted wave function does not have to be 1. Thus, we cannot immediately use thesquared W distance as our objective function, because it is only deﬁned on probability distributions.The current approach to normalizing the wave data requires two steps: ﬁrst, transform the wave datato a nonnegative function, and second, divide by the total mass [8]. This ensures that the normalizedwave data satisﬁes the positivity and total mass requirements. Although there are several possible waysto complete the ﬁrst step, the known methods of doing this have their own drawbacks.If the source function f is positive, the ﬁrst step becomes unnecessary. The requirement in Theorem 4is not very strong, suggesting that when f is positive, the squared W distance is suitable as an objectivefunction. However, this method does not generalize well to source functions that alternate betweennegative and positive values. In this case, we complete the ﬁrst step by initially replacing an alternatingfunction k by k + γ , where inf k + γ >

0. Then, we divide by the total mass of k + γ in the interval [0 , T ].This method of normalization takes into account the wave amplitude as well. However, the convexity ofthe squared W distance, in this case, is not as general as with the previous method. Further restrictionson the source f are necessary, as shown by the requirements in Theorem 5. This suggests that the squared W distance is suitable as an objective function when the normalization constant γ is suﬃciently close to0, or equivalently, when inf f is suﬃciently close to 0.18 Conclusions

In this paper, we study the convexity of full–waveform inversion using the squared W distance as anobjective function with respect to the velocity model parameter. We show that the squared W distanceis a suitable objective function for multiple velocity models when the received signal is nonnegative. Next,we show that the squared W distance is suitable in some cases, in a two–dimensional velocity model wherethe received signal alternates between positive and negative values. We review the smoothing property ofthe squared W distance by its relation to the squared ˙ H − distance, and contrast this with the sharpnessof the squared L norm, which is very sensitive to high–frequency signals. We also discuss the drawbacksof the normalization methods used in this paper. A natural direction for future research is to generalizethe W distance to compare functions alternating between positive and negative values. Firstly, the author would like to thank Dr. Yunan Yang for her mentorship and guidance during thisproject. The author thanks Dr. Tanya Khovanova and Boya Song for proofreading this paper and forproviding feedback. Finally, the author is thankful to the PRIMES–USA program for making this researchproject possible. This work is supported in part by the National Science Foundation through grant DMS–1913129.

References [1] Luigi Ambrosio and Nicola Gigli. A users guide to optimal transport. In

Modelling and optimisationof ﬂows on networks , pages 1–155. Springer, 2013.[2] Hyoungsu Baek, Henri Calandra, and Laurent Demanet. Velocity estimation via registration-guidedleast-squares inversion.

Geophysics , 79(2):R79–R89, 2014.[3] Ebru Bozda˘g, Jeannot Trampert, and Jeroen Tromp. Misﬁt functions for full waveform inver-sion based on instantaneous phase and envelope measurements.

Geophysical Journal International ,185(2):845–870, 2011.[4] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions.

Commu-nications on pure and applied mathematics , 44(4):375–417, 1991.[5] Jing Chen, Yifan Chen, Hao Wu, and Dinghui Yang. The quadratic Wasserstein metric for earthquakelocation.

Journal of Computational Physics , 373:188–209, 2018.[6] Guido De Philippis and Alessio Figalli. The Monge–Amp`ere equation and its link to optimal trans-portation.

Bulletin of the American Mathematical Society , 51(4):527–580, 2014.[7] Laurent Demanet. Waves and imaging class notes-18.325, 2016.[8] Bj¨orn Engquist and Brittany D. Froese. Application of the Wasserstein metric to seismic signals. arXiv preprint arXiv:1311.4581 , 2013.[9] Bj¨orn Engquist, Brittany D. Froese, and Yunan Yang. Optimal transport for seismic full waveforminversion. arXiv preprint arXiv:1602.01540 , 2016.[10] Bj¨orn Engquist, Kui Ren, and Yunan Yang. The quadratic Wasserstein metric for inverse datamatching.

Inverse Problems , 36(5):055001, 2020.1911] Bj¨orn Engquist and Yunan Yang. Seismic imaging and optimal transport. arXiv preprintarXiv:1808.04801 , 2018.[12] Ayman Hourieh. The composition of two convex functions is convex. Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/287725 (version: 2013-01-26).[13] Martin Knott and Cyril S. Smith. On the optimal mapping of distributions.

Journal of OptimizationTheory and Applications , 43(1):39–49, 1984.[14] Gaspard Monge. M´emoire sur la th´eorie des d´eblais et des remblais.

Histoire de l’Acad´emie Royaledes Sciences de Paris , 1781.[15] R. Gerhard Pratt. Inverse theory applied to multi-source cross-hole tomography.: Part 2: Elasticwave-equation method1.

Geophysical Prospecting , 38(3):311–329, 1990.[16] R. Gerhard Pratt and Michael H. Worthington. Inverse theory applied to multi-source cross-holetomography. part 1: Acoustic wave-equation method 1.

Geophysical prospecting , 38(3):287–310, 1990.[17] Svetlozar T. Rachev and Ludger R¨uschendorf.

Mass Transportation Problems: Volume I: Theory ,volume 1. Springer Science & Business Media, 1998.[18] Peter M. Shearer.

Introduction to seismology . Cambridge university press, 2019.[19] Albert Tarantola.

Inverse problem theory and methods for model parameter estimation . SIAM, 2005.[20] C´edric Villani.

Topics in optimal transportation . American Mathematical Soc., 2003.[21] Jean Virieux and St´ephane Operto. An overview of full-waveform inversion in exploration geophysics.

Geophysics , 74(6):WCC1–WCC26, 2009.[22] Yunan Yang.

Optimal transport for seismic inverse problems . PhD thesis, The University of Texasat Austin, 2018.[23] Yunan Yang. Analysis and application of optimal transport for challenging seismic inverse problems. arXiv preprint arXiv:1902.01226 , 2019.[24] Yunan Yang and Bj¨orn Engquist. Analysis of optimal transport and related misﬁt functions infull-waveform inversion.

Geophysics , 83(1):A7–A12, 2018.[25] Colin A. Zelt. Traveltime tomography using controlled-source seismic data.