[PDF] Optimal transport in full-waveform inversion: Analysis and practice of the multidimensional Kantorovich-Rubinstein norm

Abstract

In the last ten years, full-waveform inversion has emerged as a robust and efficient high-resolution velocity model-building tool for seismic imaging, with the unique ability to recover complex subsurface structures. Originally based on a data fitting process using a least-squares cost function, it suffered from high sensitivity to cycle-skipping and was therefore of poor efficiency in handling large time shifts between observed and modelled seismic events. To tackle this problem, a common practice is to start the inversion using the low temporal frequencies of the data and selecting diving wave events. Complementary to this, the use of other cost functions has been investigated. Among these, cost functions based on optimal transport appeared appealing to possibly handle large time shifts between seismic events. Several strategies inspired by optimal transport have been proposed, taking into account the specificities of seismic data. Among them, the approach based on the Kantorovich-Rubinstein norm offers the possibility of the direct use of seismic data and an efficient numerical implementation allowing for a multidimensional (data coordinate space) application. We present here an analysis of the Kantorovich-Rubinstein norm, discussing its theoretical and practical aspects. A key component of our analysis is the back-propagated adjoint-source. We highlight its piecewise linearity, analyze its frequency content and amplitude balancing. We also emphasize the benefit of having a multidimensional implementation. Furthermore, we give practical rules for setting the tuning parameters of the numerical implementation. Our set of synthetic and field data examples demonstrate the improvements brought by the use of the Kantorovich-Rubinstein norm over least-squares full-waveform inversion, and highlight the improvements brought by the multidimensional approach over the one-dimensional one.

Full PDF

OOptimal transport in full-waveform inversion:

Analysis and practice of the multidimensional

Kantorovich-Rubinstein norm

J´er´emie Messud , Rapha¨el Poncet , and Gilles Lambar´e CGG, 27 avenue Carnot, 91341 Massy (France) Formerly CGGJanuary 5, 2021

In the last ten years, full-waveform inversion has emerged as a robust andeﬃcient high-resolution velocity model-building tool for seismic imaging, withthe unique ability to recover complex subsurface structures. Originally based ona data ﬁtting process using a least-squares cost function, it suﬀered from highsensitivity to cycle-skipping and was therefore of poor eﬃciency in handlinglarge time shifts between observed and modelled seismic events. To tackle thisproblem, a common practice is to start the inversion using the low temporalfrequencies of the data and selecting diving wave events. Complementary tothis, the use of other cost functions has been investigated. Among these, costfunctions based on optimal transport appeared appealing to possibly handlelarge time shifts between seismic events. Several strategies inspired by optimaltransport have been proposed, taking into account the speciﬁcities of seismicdata. Among them, the approach based on the Kantorovich-Rubinstein normoﬀers the possibility of the direct use of seismic data and an eﬃcient numer-ical implementation allowing for a multidimensional (data coordinate space)application.We present here an analysis of the Kantorovich-Rubinstein norm, discussingits theoretical and practical aspects. A key component of our analysis is theback-propagated adjoint-source. We highlight its piecewise linearity, analyzeits frequency content and amplitude balancing. We also emphasize the beneﬁtof having a multidimensional implementation. Furthermore, we give practicalrules for setting the tuning parameters of the numerical implementation. Ourset of synthetic and ﬁeld data examples demonstrate the improvements broughtby the use of the Kantorovich-Rubinstein norm over least-squares full-waveforminversion, and highlight the improvements brought by the multidimensional ap-proach over the one-dimensional one.

1. Introduction

Full-waveform inversion (FWI) was proposed in the early 80’s as a data ﬁtting process aimedat inverting for a subsurface model (Lailly, 1983; Tarantola, 1984). It is based on a data1 a r X i v : . [ phy s i c s . g e o - ph ] J a n pace cost function, initially the least-squares (LSQ) one, minimized through a non-lineariterative local optimization scheme. After initial successes, FWI encountered diﬃculties re-lated to its computational cost and the building of the long spatial wavelength componentsof the subsurface model from reﬂected waves (Tarantola, 2005). The investigations contin-ued, moving from the inversion of seismic reﬂected waves to the inversion of transmittedand diving waves, and starting the inversion using the low temporal frequencies of the dataand frequently an oﬀset selection (Pratt, 1990, 1999; Bunks et al., 1995; Sirgue and Pratt,2004). From the mid-2000’s the successes and potential of these approaches gave industrialperspectives for complex subsurface imaging (Operto et al., 2004; Brenders and Pratt, 2007;Virieux and Operto, 2009). The computational cost remained a blocking factor in the caseof 3D subsurface models, but this was rapidly overcome thanks to speciﬁc implementationsand the increase in computational capabilities (Vigh and Starr, 2007; Sirgue et al., 2009;Plessix, 2009). Since the late-2000’s, FWI has been established as an essential componentof the subsurface (or “velocity”) model building toolbox used in the seismic imaging indus-try. Its capability to recover complex and high-resolution subsurface models has made it afavored tool.However, issues still remained. Firstly, the LSQ cost function is very sensitive to cycle-skipping (because of the oscillatory nature of seismic data) so that the local optimizationcan easily get trapped into a local minimum (Virieux and Operto, 2009). Secondly, theLSQ cost function is very sensitive to amplitudes so that spurious inversion results canbe obtained when the assumptions used in the forward modelling do not make it possibleto “quantitatively reproduce” the amplitudes in the data (most implementations of FWIare currently based for example on an acoustic forward modelling while the subsurfaceis elastic). As mentioned above, starting the optimization from a good initial model, in-verting ﬁrst the low temporal frequency components of the data (Bunks et al., 1995) andusing mainly diving waves reduces the sensitivity to cycle-skipping (Tarantola, 2005), whiledata preprocessing (in particular trace normalization) reduces the sensitivity to amplitude.However, these ad-hoc strategies do not completely solve the problem and require a carefulimplementation. As a consequence, a search for complementary solutions, in particular interms of more suitable cost functions, has emerged as a very active research ﬁeld. Theexpected features of alternative cost functions are: an increased sensitivity to the kine-matic information contained in the data (or time shifts between observed and modelleddata) and a relaxed sensitivity to the amplitude information. Many proposals were madearound the early-2010’s (Shin and Ha, 2008; Van Leeuwen and Mulder, 2010; Luo and Sava,2011; Warner and Guasch, 2016), following some pioneering work (Luo and Schuster, 1991).The ﬁrst proposal of a cost function inspired from optimal transport (OT) was made byEngquist and Froese (2014).M´etivier et al. (2016b) provide the state-of-the-art of OT-based cost functions and theirﬁrst usage in FWI. They mention the original work of Gaspard Monge (18th century) aim-ing at ﬁnding the optimal way to transport piles of sand by minimizing the expended energyor cost (Monge, 1781). Mathematically, the distribution of masses of initial and ﬁnal pilesof sand are described by probability density functions (PDFs) (positive, with unit “mass”).The transportation is described as a mapping between the initial and ﬁnal PDFs. Min-imizing the transportation cost involves the Monge-Amp`ere equation in the general case.Various reformulations of the original problem exist, that can usually be related to Wasser-stein distances. Regarding FWI, the approach proposed by Engquist and Froese (2014);Engquist et al. (2016); Yang and Engquist (2018); Yang et al. (2018) and recently appliedto ﬁeld data by Wang and Wang (2019) is based on a 2-Wasserstein distance. Because ofthe computational cost of the resolution of the Monge-Amp`ere problem, the approach cur-rently only aﬀords for a trace-by-trace comparison in the 2-Wasserstein distance for largescale applications, not a comparison of a set of traces as in a common shot gather. So, the2ormulation does not use the property of OT-based cost functions to eventually be multidi-mensional in the data coordinate space (not to be confused with the dimensionality of thesubsurface model), i.e. to account for correlations between traces (multidimensional) andnot only within one trace (one-dimensional). Also, within this formulation, the requisitefor positive values and mass conservation imposes ad-hoc transformations of seismic traces.In this context, M´etivier et al. (2016a,b,c) proposed an alternative to this ﬁrst familyof OT-based cost functions, allowing the ad-hoc transformations of seismic traces to bebypassed. They started from the dual formulation of the 1-Wasserstein distance, calledthe Kantorovich-Rubinstein (KR) formulation, that leads to a maximization problem over1-Lipschitz functions. More speciﬁcally, they proposed to compute a distance betweenseismic data using the so-called KR norm (Hanin, 1992; Villani, 2003; Lellmann et al.,2014). Adding a bounding constraint allows the direct use of the seismic data withoutany transformation. An important feature is that a version of the KR norm exists thatcan be very eﬃciently computed using the simultaneous descent method of multipliers(SDMM) iterative algorithm (Combettes and Pesquet, 2011; M´etivier et al., 2016b). Asa consequence, the use of the KR norm is in practice not limited to a trace-per-tracecomparison. A multidimensional formulation in the data coordinate space is feasible forlarge scale applications, which can account for correlations in the oﬀset direction betweentraces in a common shot gather, which is not the case with an approach based on a 2-Wasserstein distance as Engquist and Froese (2014). As shown in M´etivier et al. (2016a)the KR norm and its numerical approximation does not lead to a full convexity whencomparing two shifted Ricker wavelets. It however becomes more convex when the numberof iterations of the SDMM algorithm increases (the secondary minima tend to be pushedout) and outperforms the convexity of LSQ while keeping a sharp valley of attraction(it can tend at convergence to a cost function valley almost twice wider than the LSQone). Recently, this approach led to numerous successful industrial FWI applications, seee.g. Poncet et al. (2018); Messud and Sedova (2019); Sedova et al. (2019); Hermant et al.(2019); Carotti et al. (2020); Hermant et al. (2020). Compared to LSQ FWI, an interestingreduction in sensitivity to cycle-skipping has been observed, together with an improvedstructural consistency in inverted subsurface models. These behaviors can be related to thespeciﬁc nature of the KR FWI adjoint-source, which is the gradient of the cost function inthe data space, that is back-propagated within FWI. Compared to the LSQ FWI adjoint-source, the KR FWI adjoint-source exhibits an enhancement in low frequencies, balancingof amplitudes and an increase in coherency along the events (i.e. in the moveout direction)when using the multidimensional formulation (Messud and Sedova, 2019).This paper deals with the use of the KR norm in FWI, in the continuation of the workof M´etivier et al. (2016a,b,c). We ﬁrstly focus the analysis on the KR adjoint-source. Weclarify theoretical aspects related to the establishment of the expression of the adjoint-source. We rigorously deﬁne the “texture” of the KR adjoint-source (piecewise linearity,lower frequency content, reduced amplitude dynamics) and emphasize the beneﬁt of havinga multidimensional implementation. We demonstrate the interest of the KR adjoint-sourcefor FWI in terms of “physical” characteristics, explain the meaning of the tuning param-eters (critical for a successful implementation) and give a set of practical rules for settingthese parameters. Then, we present a set of synthetic and ﬁeld data FWI examples demon-strating the eﬀectiveness of the approach over LSQ FWI, as well as the beneﬁt of themultidimensional KR FWI approach over the one-dimensional one. Finally, to overcomesome limitations, we discuss the possibility to combine the approach with a kinematictransformation applied to data.Before presenting our contributions, we start with reminders on FWI and OT formalisms( § . Theory A seismic survey corresponds to a series of recorded shots. The data space related to onecommon shot gather is denoted by D ( X ) ⊂ R ( X ) . (1)It is indexed on X = [ H minxl , H maxxl ] × [ H mininl , H maxinl ] × [0 , T ] ⊂ R that represents the setrelated to the positions in a common shot data. H minxl , H maxxl and H mininl , H maxinl denote theminimum and maximum receiver positions, respectively in the crossline and inline direc-tions. T denotes the maximum receiver recording time. x = [ x xl , x inl , x t ] t ∈ X represents a3 dimensional (3D) vector, where t denotes the transpose and x xl , x inl , x t are scalars. In thefollowing, we call X the (common shot) data “coordinate space”, considered as continuouswith the associated Lebesgue measure denoted by µ ( x ). However, the considerations in thisarticle generalize to discrete sets X simply considering µ ( x ) to be the counting measure .A seismic observed shot data, here considered as a scalar ﬁeld measured at the Earth’ssurface during a seismic experiment, is described by f obs : X → R , f obs ∈ D ( X ). In FWI,modelled shot data is produced by solving the wave-equation in a given subsurface model m ∈ M , where M denotes the model space , and extracting the result at the Earth’ssurface. Such a modelled shot data is described by f [ m ] : X → R , f [ m ] ∈ D ( X ). The FWIproblem consists in ﬁnding m ∗ ∈ M which best explains all the observed common shotdata, i.e. makes f [ m ∗ ] as close as possible to f obs for all shots from the point of view of agiven similarity measure in the data space (Tarantola, 2005). We denote such a similaritymeasure, also called cost function, by a functional J : [ D ( X ) × D ( X )] N shots → R + , where N shot represents the number of shots; the latter dependency is implicit in the following tolighten the notations. We resolve m ∗ = argmin m ∈ M J ( f [ m ] , f obs ) . (2)The dimensionality of the problem (in data and model spaces) requires to use an iterativelocal optimization scheme, based on the adjoint-state method (Plessix, 2006; Virieux andOperto, 2009). Firstly, a forward propagation consists of resolving the wave-equation tocompute f [ m ] and the forward-propagated ﬁeld is extracted at the Earth’s surface. Sec-ondly, the gradient in the data-space or “adjoint-source” is computed, i.e. ∂J ( f, f obs ) ∂f ( x ) (cid:12)(cid:12)(cid:12) f = f [ m ] , (3)and back-propagated using the time-reversed wave-equation to obtain the back-propagatedwaveﬁeld. Thirdly, the forward-propagated waveﬁeld and the second temporal derivativeof the back-propagated waveﬁeld are correlated in the time direction (when the wave-equation is considered in the time domain). This allows a “conversion” of the data-spacegradient, eq. (3), into the model-space gradient ∂J ( f [ m ] , f obs ) /∂m . The latter may thenbe preconditioned and ﬁnally a line-search is applied to give the correct magnitude to theresulting descent direction. These are the main features of the method (Plessix, 2006; For instance taking X = { H xl ( i xl ) , i xl = 1 ..N xl } × { H inl ( i inl ) , i inl = 1 ..N inl } × { T ( i t ) , i t = 1 ..N t } , with H xl (1) = H minxl , H xl ( N xl ) = H maxxl , H inl (1) = H mininl , H inl ( N inl ) = H maxinl , T (0) = 0 and T ( N t ) = T . N xl × N inl then represents the total number of traces in the considered common shot data. N xl × N inl × N t represents the total number of samples in the common shot data, i.e. the dimensionality of the dataspace D ( X ), typically equal to 10 − . The associated counting measure is deﬁned by µ ( x ) = (cid:80) N xl i xl =1 (cid:80) N inl i inl =1 (cid:80) N t x t =1 δ ( x xl − H xl ( i xl )) δ ( x inl − H inl ( i inl )) δ ( x t − T ( i t )). Typically a grid with 10 − samples. J aﬀects the method directly only through the adjoint-source computation,eq. (3). So, to understand the interest of a given cost function, interpreting how theadjoint-source is impacted is a key.Common FWI cost functions (and thus adjoint-source forms) are based on L p norms( p ≥

1) through J ( f [ m ] , f obs ) = || f [ m ] − f obs || pp with || f || p = (cid:16) (cid:90) X | f ( x ) | p dµ ( x ) (cid:17) /p ⇒ ∂J ( f, f obs ) ∂f ( x ) = sign( f ( x ) − f obs ( x )) × | f ( x ) − f obs ( x ) | p − , (4)for D ( X ) ⊆ L p ( X ) where L p ( X ) denotes the space of functions whose moments of or-der p are integrable. At the limit p → ∞ , the “uniform norm” is deﬁned by || f || ∞ =max x ∈ X | f ( x ) | . The p = 2 case leads to the most common FWI cost function, called least-squares (LSQ). The LSQ adjoint-source is equal to f [ m ] − f obs and called the “residual”.Let us come back to the (data) coordinate space X . As already mentioned, the dimen-sionality of X is 3 for common shot data (time direction, and receiver crossline and inlinedirections). However, if an application utilizes the data trace-per-trace independently, theeﬀective dimensionality of X becomes 1 (time direction only enters into the computation).If an application utilizes the data inline-per-inline independently, the eﬀective dimensional-ity of X becomes 2 (inline and time directions only enter into the computation). Hereafter,when we mention multidimensionality (multiD), we refer to the eﬀective dimensionality ofthe (data) coordinate space X (not to be confused with the dimensionality of the data space D ( X ) or of the model space).It is obvious from from eq. (4) that L p -based cost functions measure a point-wise simi-larity so that their eﬀective dimensionality in the coordinate space is 0; in other words, nocorrelation between diﬀerent positions in the time-direction are explicitly accounted for inthe cost function. This is a fundamental reason for the high sensitivity to cycle-skipping ofthe L p -based cost functions. Another potential issue is that these cost functions tend to bevery sensitive to amplitudes; this sensitivity can somewhat be reduced by making p veryclose to 1. Indeed, all amplitudes are equalized in the L adjoint-source, sign( f [ m ] − f obs ),eq. (4). But this can lead to convergence diﬃculties and is still plagued by cycle-skipping.In contrast, OT-based cost functions are explicitly at least 1D in the coordinate space,allowing them to explicitly account for correlation at least in the time direction, i.e. be-tween an observed full trace and the corresponding modelled full trace. This provides moresensitivity to time shifts between the data events, thus more robustness to cycle-skipping(Engquist and Froese, 2014; Yang et al., 2018; M´etivier et al., 2016c; Poncet et al., 2018;Messud and Sedova, 2019). Furthermore, these cost functions can even be multiD (2D or3D) in the coordinate space, allowing the exploitation of the correlations between manyobserved traces and the corresponding modelled traces, i.e. the coherency in the moveoutdirection; also, OT-based cost functions are less sensitive to amplitudes (M´etivier et al.,2016c; Poncet et al., 2018; Messud and Sedova, 2019). They thus seem a very good candi-date to overcome the limitations of L p -based cost functions.However, there are two possible obstacles to the application of OT to large scale FWIproblems. OT is originally designed to compare probability distributions, not signed andoscillatory functions like the seismic data. Also, the OT algorithm originally proposed forFWI (Engquist and Froese, 2014) is for now viable in an industrial context only for a 1Dcoordinate space, i.e. trace-per-trace comparison. In the next section, we review the generalOT formalism and a formulation recently proposed by M´etivier et al. (2016b,c) that canovercome these obstacles. 5 .2. From optimal transport to the Kantorovich-Rubinstein norm We plan to use OT within the FWI framework. The latter is deterministic whereas OT isoriginally a statistical formalism that measures a similarity between probability distribu-tions. So, let us start with some probability-related considerations. We consider probabilitydistributions that are absolutely continuous with respect to the measure µ ( x ). These prob-ability distributions can be represented by PDFs, denoted by ρ : X → R + . They have aunit “mass”: (cid:82) X ρ ( x ) dµ ( x ) = 1.We provide the coordinate space X with a metric d : X × X → R + . We consider d p ( x, y )to represent the cost of displacing information from x to y in the coordinate space. P p ( X )denotes the space of PDFs with ﬁnite d -moment of order p , i.e. (cid:82) X d p ( x (cid:48) , x ) ρ ( x ) dµ ( x ) < + ∞ for some (thus any) x (cid:48) ∈ X (Villani, 2008). OT allows the computation of a cost of “trans-porting” the PDF ρ ∈ P p ( X ) onto the PDF ρ ∈ P p ( X ) based on the coordinate spacecost d p . Kantorovich’s formulation represents the most general OT form (Kantorovich,1942; Villani, 2008): W pd p ( ρ , ρ ) = min π ∈ Π( ρ ,ρ ) (cid:90) X × X d p ( x, y ) π ( x, y ) dµ ( x ) dµ ( y ) , (5)where π represent “transference plans” belonging to the set of PDFs on X × X that havemarginals ρ and ρ :Π( ρ , ρ ) = (cid:110) π ∈ P p ( X × X ); ∀ x ∈ X : (cid:90) X π ( x, y ) dµ ( y ) = ρ ( x ) , ∀ y ∈ Y : (cid:90) X π ( x, y ) dµ ( x ) = ρ ( y ) (cid:111) . (6) π contains information on how to transport ρ onto ρ . W d p ( ρ , ρ ) deﬁnes for p ≥

1a distance between the PDFs ρ and ρ , called “ p -Wasserstein” distance for the metric d . Eq. (5) represents a linear programming problem (minimization of a linear functionalwith linear equality constraints). Despite the linearity, this problem can easilly becomeintractable because it implies a search in the space P p ( X × X ) that has large dimensionalityin the seismic case. So, let us discuss more tractable reformulations.Monge’s problem (the original OT formulation) is a restriction of Kantorovich’s problemto transference plans of the form π ( x, y ) = ρ ( x ) δ ( y − T ( x )) where T : X → X is aninvertible function, giving (Villani, 2008): (cid:99) W pd p ( ρ , ρ ) = min T ∈ Σ( ρ ,ρ ) (cid:90) X d p ( x, T ( x )) ρ ( x ) dµ ( x ) , (7)where the map T is constrained to transport ρ onto ρ :Σ( ρ , ρ ) = (cid:110) T : X → X ; ∀ x ∈ X : ρ [ T − ( x )] = ρ ( x ) (cid:111) . (8)Monge’s problem represents a restriction of Kantorovich’s problem and cannot always besolved. However, it implies a search in the space of bijective functions deﬁned on X , thatis smaller than the Kantorovich search space. Despite this advantage, the constraint in eq.(8) is non-linear and diﬃcult to enforce as soon as the coordinate space is multiD, becauseit then involves resolving the Monge-Amp`ere equation. In practice, only the 1D coordinatespace version has been used for industrial FWI applications, considering a trace-per-tracecomparison, i.e. explicit correlations only in the time-direction. Used together with atransformation of each seismic trace into a PDF, here called a “PDF-transformation”, thisformulation proved to reduce the cycle-skipping issue in FWI (Engquist et al., 2016; Yangand Engquist, 2018; Yang et al., 2018; Wang and Wang, 2019). The limitations of thisscheme are that in practice it does not allow the exploitation of multidimensionality in thecoordinate space and that using “PDF-transformations” may lead to diﬃculties with ﬁelddata (M´etivier et al., 2018). 6 simpliﬁcation of the Kantorovich problem occurs when the metric d is lower semi-continuous and p = 1. This allows to switch to the Kantorovich-Rubinstein (KR) dualformulation, deﬁned for ρ and ρ in the P ( X ) space by (Villani, 2008): W d ( ρ , ρ ) = max ϕ ∈ Lip( d, (cid:90) X ϕ ( x )( ρ ( x ) − ρ ( x )) dµ ( x ) , (9)where we denote by Lip( d, α ) the set of α -Lipschitz functions on X with respect to thedistance d :Lip( d, α ) = (cid:110) ϕ : X → R ; max x (cid:54) = y | ϕ ( x ) − ϕ ( y ) | d ( x, y ) ≤ α, ϕ ∈ L (cid:0) | ρ − ρ | dµ (cid:1)(cid:111) . (10) ϕ is constrained to be “slowly” varying, the absolute value of its “local slopes” beingbounded by α = 1. The KR problem implies a search in the space of 1-Lipschitz functions on X (that must be integrable for the measure | ρ − ρ | dµ , hence belong to L (cid:0) | ρ − ρ | dµ (cid:1) ). TheLipschitz constraint can be recast into linear constraints and leads to a linear programmingproblem that is numerically manageable even in the multiD case. Another nice featureof this formulation is that it is well deﬁned not only for ( ρ , ρ ) ∈ P ( X ) × P ( X ), butalso for any function ( f , f ) ∈ L ( X ) × L ( X ) provided they have the same “mass”,i.e. (cid:82) X f ( x ) dµ ( x ) = (cid:82) X f ( x ) dµ ( x ). This is still too constraining for seismic data (evenif oscillating, seismic data cannot be considered to have approximately null mass withinFWI, especially as many applications use the low frequencies and mutes). When f and f do not have same mass, the ϕ inverted in eq. (10) becomes singular; a solution is to add abounding constraint to ϕ (Hanin, 1992; Lellmann et al., 2014). Introducing the “ λ -bounded α -Lipschitz” setBLip( d, α, λ ) = (cid:110) ϕ : X → R ; max x (cid:54) = y | ϕ ( x ) − ϕ ( y ) | d ( x, y ) ≤ α, || ϕ || ∞ ≤ λ, ϕ ∈ L (cid:0) | f − f | dµ (cid:1)(cid:111) , (11)we now resolve for any function ( f , f ) ∈ L ( X ) × L ( X )˜ W d ( f , f ) = max ϕ ∈ BLip( d, ,λ ) (cid:90) X ϕ ( x )( f ( x ) − f ( x )) dµ ( x ) . (12)The absolute value of the “local slopes” of ϕ is as above constrained to be bounded by α , andin addition the absolute value of ϕ is constrained to be thresholded by λ . ˜ W d still deﬁnesa distance between f and f , induced by a norm called the “KR norm”; ˜ W d representsa generalization of W d and some relationship between both can be found in (Hanin, 1992;Villani, 2003; Lellmann et al., 2014). Solving for eq. (12) is still a linear programmingproblem.Let us now consider coordinate space distances d induced by a norm, i.e. d ( x, y ) = || x − y || ;( X, d ) then becomes a Banach space (Rudin, 1991; Brezis, 2020). d is typically induced bya Mahalanobis-like L ( X ) p norm ( p ≥ : || x || → || x || ( X ) p = (cid:16) | x xl | p σ pxl + | x inl | p σ pinl + | x t | p σ pt (cid:17) /p . (13)The σ , with ∞ > /σ >

0, denote standard-deviation-like weights that can account foruncertainties and rescale the diﬀerent physical dimensions between crossline and inlinepositions, and time. At the limit p → ∞ , the coordinate space “uniform norm” is deﬁnedby || x || ( X ) ∞ = max( | x xl | σ xl , | x inl | σ inl , | x t | σ t ). The coordinate space L ( X ) p -norm || x || ( X ) p , eq. (13), is not to be confused with the data space L p -norm || f || p , eq. (4), hence the superscript ( X ) to make this explicit. d → || . || ( X )1 that they argued to be the most interesting compu-tationally speaking; this choice will be implicit in the following. Theses authors proposedto use the SDMM algorithm to eﬃciently resolve the problem and underlined the manyinteresting properties of the KR norm. It allows the direct use of seismic data withoutapplication of a PDF-transformation as required by the approaches derived from Engquistand Froese (2014). Reduced sensitivity to cycle-skipping and amplitudes are satisfyinglyachieved, mostly thanks to the 1-Lipschitz constraint that brings low frequencies and tendsto equalize the amplitudes in the adjoint-source (M´etivier et al., 2016b,c; Poncet et al.,2018; Messud and Sedova, 2019). Fig. 1 illustrates that the KR norm global minimumvalley outperforms the convexity of LSQ, giving a global minimum valley that is almosttwice the width. So far, it has been the only OT-based scheme implemented numerically inmultiD in an industrial context, allowing a study of the advantage of taking into accountthe correlations of the events within common shot data; also, the corresponding tuningparameters are few and have a clear physical interpretation (Poncet et al., 2018; Messudand Sedova, 2019). Before detailing these points, we ﬁrstly clarify formal aspects relatedto the KR norm adjoint-source computation and precisely deﬁne the KR norm adjoint-source “texture”, that represent the keys to better understanding the interest regarding thecycle-skipping problem.Figure 1: Comparison of the KR norm and LSQ cost function values for 2 shifted Rickerwavelets, the horizontal axis being the shift value. The global minimum valley isalmost twice the width, which is very interesting for FWI especially when dealingwith the low frequencies.

3. On the exact KR FWI adjoint-source

We consider a given shot data f obs ∈ L ( X ). Eq. (12) with d → || . || ( X )1 can be rewrittenfor all f ∈ L ( X ) ˜ W d ( f, f obs ) = (cid:90) X ϕ max [ f ]( x )( f ( x ) − f obs ( x )) dµ ( x )8 max [ f ] = argmax ϕ ∈ BLip( || . || ( X )1 , ,λ ) (cid:90) X ϕ ( x )( f ( x ) − f obs ( x )) dµ ( x ) . (14)The KR adjoint-source, eq. (3), is thus deﬁned by ∂ ˜ W d ( f, f obs ) ∂f ( x ) = ϕ max [ f ]( x ) + (cid:90) X ∂ϕ max [ f ]( x (cid:48) ) ∂f ( x ) ( f ( x (cid:48) ) − f obs ( x (cid:48) )) dµ ( x (cid:48) ) . (15)The second term, (cid:82) X ∂ϕ max [ f ]( x (cid:48) ) ∂f ( x ) ( f ( x (cid:48) ) − f obs ( x (cid:48) )) dµ ( x (cid:48) ), is deﬁned provided ϕ max [ f ] is dif-ferentiable and has been neglected in M´etivier et al. (2016a,b,c). In this section, we demon-strate that the diﬀerential ∂ϕ max [ f ]( x (cid:48) ) /∂f ( x ) exists and can be neglected, fundamentallyjustifying the following form for the KR adjoint-source: ∂ ˜ W d ( f, f obs ) ∂f ( x ) = ϕ max [ f ]( x ) . (16)To achieve this goal, it is suﬃcient to prove ∃ γ > , ∀ h ∈ L ( X ) such as || h || < γ : ϕ max [ f + h ]( x ) = ϕ max [ f ]( x ) , (17)where the equality needs to hold almost everywhere “only” and ϕ max [ f + h ] = argmax ϕ ∈ BLip( || . || ( X )1 , ,λ ) (cid:90) X ϕ ( x )( f ( x ) − f obs ( x ) + h ( x )) dµ ( x ) . (18)In other terms, if ϕ max is invariant under any inﬁnitesimal perturbations of the residual f − f obs , the diﬀerential ∂ϕ max [ f ]( x (cid:48) ) /∂f ( x ) exists and is null. Intuitively, as we consider amaximization problem under constraints, eq. (17) would not be satisﬁed where the problem“hesitates” between diﬀerent constraints saturations, so that inﬁnitesimal perturbations ofthe residual may lead to “jumps” in the inverted ϕ max . Our following goal is to qualifysubsets where this behavior can occur, i.e. where eq. (17) cannot be satisﬁed nor thediﬀerential deﬁned.To simplify, we ﬁrstly deﬁne a subset X sub ⊆ X where it is suﬃcient to demonstrateeq. (17), called the subset that “matters”. To simplify the following notations, we denote the residual by∆ f = f − f obs . (19)We denote by X (∆ f ) null the subset of X where the residual is inﬁnitesimal: X (∆ f ) null = { x ∈ X ; ∀ ε > | ∆ f ( x ) | ≤ ε } , (20)and by X (∆ f ) null − tap ⊆ X (∆ f ) null where the spatial gradient of the residual is also inﬁnitesimal: X (∆ f ) null − tap = { x ∈ X (∆ f ) null ; ∀ ε > (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂∂x ∆ f ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X ) ∞ ≤ ε } . (21)For realistic seismic data, X (∆ f ) null − tap mostly corresponds to the area before the ﬁrst arrivaland the mute areas, where ϕ max is usually tapered to zero, hence the “ null − tap ” name.However, more rare with realistic data, X (∆ f ) null − tap also gathers areas where the residualstays very small for any reason due to the subsurface or the acquisition. For seismic data,9 (∆ f ) null \ X (∆ f ) null − tap (i.e. the complement of X (∆ f ) null − tap in X (∆ f ) null ) has a null Lebesgue measurein X (∆ f ) null , thus in X (as it “just” represents where the oscillatory seismic signal passesthrough 0). Fig. 2 gives an illustration of these sets on a simple synthetic trace.What happens in the whole X (∆ f ) null subset has no real importance for FWI coniderations.It is thus suﬃcient to ensure that eq. (17) is satisﬁed only ∀ x ∈ X \ X (∆ f ) null , called the subsetthat “matters”. Note that this subset remains stable under inﬁnitesimal perturbations of∆ f . To ease the next formal considerations, we rewrite the 1-Lipschitz constraint of eq. (11)in a form that is valid almost everywhere. We consider d induced by the L ( X )1 norm, i.e p = 1 in eq. (13). Appendix A demonstrates that, for a diﬀerentiable ϕ ( x ), the 1-Lipschitzconstraint can be rewrittenmax x (cid:54) = y | ϕ ( x ) − ϕ ( y ) ||| x − y || ( X )1 = max x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X ) ∞ ≤ , (22)with (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X ) ∞ = max (cid:16) σ xl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x xl (cid:12)(cid:12)(cid:12) , σ inl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x inl (cid:12)(cid:12)(cid:12) , σ t (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x t (cid:12)(cid:12)(cid:12)(cid:17) . Thus, when the L ( X )1 norm is considered in the 1-Lipschitz constraint, the correspondingdual norm representation must be considered for the derivative ∂ϕ ( x ) /∂x , i.e. the L ( X ) ∞ norm with non-inverted “ σ ”.As the 1-Lipschitz constraint implies strong continuity, thus diﬀerentiability of ϕ ( x ) al-most everywhere, and as our next considerations “just” need to be valid almost everywhere,we propose to consider ∂ϕ ( x ) /∂x in the following. Then, the 1-Lipschitz constraint impliesalmost everywhere in X (which is implicit from now)max (cid:16) σ xl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x xl (cid:12)(cid:12)(cid:12) , σ inl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x inl (cid:12)(cid:12)(cid:12) , σ t (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x t (cid:12)(cid:12)(cid:12)(cid:17) ≤ . (23) In this section, we formally study the “shape” of ϕ max . This will allow us to prove eq. (17)but also to draw interesting lessons like rigorously deﬁning a “texture” for the KR adjoint-source. We start by simpliﬁed cases before considering the “full” case. We ﬁrstly consider only the thresholding constraint || ϕ || ∞ ≤ λ in eqs. (11) and (14),i.e. removing the Lipschitz constraint. Appendix B demonstrates that the solution of theproblem in eq. (14) is then x ∈ X \ X (∆ f ) null with ∆ f ( x ) > ϕ max ( x ) = λ (24)∆ f ( x ) < ϕ max ( x ) = − λ. This result deﬁnes ϕ max in X \ X (∆ f ) null , where it clearly remains stable under inﬁnitesimalperturbations of ∆ f . We thus can deduce eq. (17) in the subset that matters X \ X (∆ f ) null .Note that the thresholding constraint saturates everywhere in X \ X (∆ f ) null , see Fig. 2 foran illustration on a simple synthetic trace (shifted Rickers). Note also that the solution ofeq. (24) can be rewritten x ∈ X \ X (∆ f ) null : ϕ max [ f ]( x ) = λ sign( f ( x ) − f obs ( x )) = λ ∂∂f ( x ) || f − f obs || . (25)10his simpliﬁed case is thus equivalent to the L norm cost function case, up to a multiplica-tive factor λ that has no impact because of the line-search that will recompute the globalscaling. We now consider only the 1-Lipschitz constraint in eqs. (11) and (14), i.e. remove thethresholding constraint, which implies that the residual satisﬁes (cid:90) X ∆ f ( x ) dµ ( x ) = 0 (26)to avoid singularities, remind § X = [0 , T ], in the time direction, eq. (23)implies ∀ x t ∈ X : σ t (cid:12)(cid:12) ∂ϕ ( x t ) ∂x t (cid:12)(cid:12) ≤

1. Appendix B demonstrates that the solution of theproblem in eq. (14) is then x t ∈ X \ X (∆ F β ) null with ∆ F β ( x t ) < β + ( x t ) = − ∆ F β ( x t ) , β − ( x t ) = 0 , σ t ∂ϕ max ( x t ) ∂x t = 1∆ F β ( x ) > β − ( x t ) = ∆ F β ( x t ) , β + ( x t ) = 0 , σ t ∂ϕ max ( x t ) ∂x t = − , (27)where ∆ F β ( x t ) = (cid:90) x t ∆ f ( x ) dµ ( x ) − (cid:16) β + (0) − β − ( T ) (cid:17) (28)deﬁnes the set X (∆ F β ) null through eq. (20). Eq. (27) contributes to deﬁne ϕ max in X \ X (∆ F β ) null but implies the resolution of a self-consistent problem. Indeed, β + , β − and ϕ max are coupledthrough ∆ F β , but also through the necessary continuity of ϕ max in the whole set X , i.e.also in X (∆ F β ) null .So, even if these equations would not lead to the most practical numerical scheme, theyallow us to draw interesting formal conclusions: • Eq. (27) clearly remains stable under inﬁnitesimal perturbations of ∆ F β , eq. (28),the latter remaining stable under inﬁnitesimal perturbation of ∆ f . This implies thateq. (17) is satisﬁed in X \ X (∆ F β ) null , and thus in X \ X (∆ f ) null (as X \ X (∆ F β ) null tends to be“larger” for seismic data). Appendix B gives further analysis. • The derivative constraint saturates almost everywhere in X \ X (∆ F β ) null , thus in X \ X (∆ f ) null .But, in most cases, the derivative of ϕ max will also tend to saturate or to be taperedin X (∆ f ) null , which is illustrated in Fig. 2 on a simple synthetic trace (shifted Rick-ers). Indeed, even if X (∆ f ) null does not play an explicit role in eq. (27), it plays arole through the necessary continuity constraint on ϕ max and the maximization of (cid:82) X ϕ ( x t )∆ f ( x t ) dµ ( x t ). This will tend to produce a piecewise linear shape for ϕ max in the whole X , as visible in Fig. 2. • Fig. 2 also shows how the Lischitz constraint saturation leads to a reduction inamplitude dynamics in ϕ max compared to ∆ f . • The “oscillations” of ϕ max are driven by ∆ F β through eq. (27), thus among others bythe integral of ∆ f , eq. (28). This should give ϕ max a lower frequency content than∆ f , which can be deduced from Fig. 2 and is quantiﬁed in Fig. 3. We observe that ϕ max has its spectrum shifted towards the low frequencies, beyond the low frequenciesof the integrated LSQ residual. 11 The argmax (i.e. ϕ max ) of the considered problem is not unique (Villani, 2008). Inparticular, any ϕ ( c ) max ( x ) = ϕ max ( x ) + c, ∀ c ∈ R , (29)clearly remains a solution of eqs. (14) and (10), or of eqs. (27) and (28), becauseeq. (26) must be satisﬁed. The constant c does not change the value of the max, i.e.of ˜ W d , but only the argmax. So, in schemes that use the argmax (like FWI where theargmax is interpreted as the adjoint-source), the constant may play a role, possiblyproducing a low frequency smearing. As demonstrated in Appendix B, considering both the thresholding and the 1-Lipschitzconstraints in eqs. (11) and (14) leads to similar lessons than these underlined above: • We deduce that eq. (17) is satisﬁed in the subset that matters X \ X (∆ f ) null , concludingthe proof that the KR adjoint-source is given by eq. (16). • The 1-Lipschitz constraint or the thresholding constraint tend to saturate (not si-multaneously) almost everywhere in X \ X (∆ f ) null and also in most of X (∆ f ) null . This willtend to produce a piecewise linear shape for ϕ max in the 1D case, as illustrated inFig. 4. Note that, in the multiD case, the piecewise linear shape becomes approx-imate. Indeed, at least one component of the 1-Lipschitz constraint then needs tosaturate at a given position (mostly the time direction in our implementation), butnothing constrains many components to saturate simultaneously in the general case.This will be further discussed in the next section. • Visible in Fig. 4 and for the reason mentioned in § ϕ max has a lower frequency content than the LSQ residual. • Fig. 4 also shows how the constraints saturation lead to a reduced amplitude dynamicsin ϕ max compared to ∆ f , the thresholding constraint parameter λ allowing furthercontrol of these dynamics (in the extreme case of a very small λ , the KR adjoint-source would become equivalent to the L adjoint-source up to a global proportionalityconstant). • These three previous items (piecewise linearity, lower frequency content, reduced am-plitude dynamics) represent what we call KR “texture” for the adjoint-source in thefollowing. It is favorable for reducing the sensitivity of FWI to cycle-skipping andamplitudes. • The thresholding constraint, making eq. (26) useless, limits the additive constant (ornull frequency) issue discussed in § λ value does not provide enough constraint and eq. (26) approximately holds, theadditive constant issue can still exist, possibly producing a low frequency smearing inthe FWI adjoint-source. A possible eﬀect will be discussed later in § • Eq. (17) can remain valid for non-inﬁnitesimal perturbations h , hence the oftenmentioned robustness of the KR adjoint-source to noise. Fig. 5 gives an illustrationfor a quite strong uniform noise. We observe that the KR adjoint-source is perturbedby the presence of noise mostly in X (∆ f ) null , that does not really matter, and tends toremain very stable in the subset that matters X \ X (∆ f ) null .12 Finally, we mention that the approximate piecewise linear shape for ϕ max in themultiD case is due to the use of a distance d induced by the L ( X )1 norm in the 1-Lipschitz constraint of eq. (11), that leads to the constraint in eq. (23). If a L ( X ) p norm with p ∈ ]1 , ∞ [ was used instead, the constraint in eq. (23) would have to bereplaced by the following constraint (as demonstrated in Appendix A) (cid:16) σ qxl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x xl (cid:12)(cid:12)(cid:12) q + σ qinl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x inl (cid:12)(cid:12)(cid:12) q + σ qt (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x t (cid:12)(cid:12)(cid:12) q (cid:17) /q ≤ /p + 1 /q = 1 . (30)The saturation of this constraint would not lead to approximate piecewise linearity inthe general multiD case, as the derivatives in the various coordinate space directionswould be strongly coupled.Figure 2: 1D coordinate space X = [0 , T ] (time direction). Comparison of adjoint-sourcesfor a simple synthetic trace, where f obs contains a Ricker wavelet (peak frequencyat 6 Hz) and f contains the same Ricker wavelet but shifted. In blue, the LSQresidual. At the bottom the various corresponding subsets of X ; X (∆ f ) null \ X (∆ f ) null − tap are point-wise contributions (shown by the four black arrows) that have a nullLebesgue measure in X . In green, the L adjoint-source. In orange, the KRadjoint-source, which is computed using the default linear cone iterative solver ofthe open source cvxopt library (Andersen et al., 2013), ran until convergence.The KR adjoint-source tends to saturate the (1-Lipschitz or thersholding) con-straints in most of X and especially in X \ X (∆ f ) null .13 .00 6.00 12.00 18.00 24.00Frequency (Hz) N o r m a li z e d a m p li t u d e s p e c t r u m Figure 3: Comparison of the amplitude spectra of diﬀerent adjoint-sources correspondingto the same setting as Fig. 2. In blue, the LSQ residual (same spectrum as theone of a Ricker whose peak frequency is at 6 Hz). In dashed blue, the spec-trum of the integrated LSQ residual. In orange, the KR adjoint-source. The KRadjoint-source has its spectrum shifted towards the low frequencies, beyond thelow frequencies of the integrated LSQ residual. This is favorable to help overcom-ing cycle-skipping within FWI (up to the extent of the low frequencies present inthe forward propagated sources wavelet). − λ λ − λ λ A m p li t u d e Figure 4: Comparison of adjoint-sources for the same setting as Fig. 2. In blue, the LSQresidual. In orange, the KR adjoint-sources corresponding to bound λ (darkorange) and λ ( < λ , light orange). Smaller λ values make the thresholdingconstraint saturate more, producing more clipping of the adjoint-source.14igure 5: Comparison of adjoint-sources for a similar setting than in Fig. 2, with additionaluniform noise. In blue, the LSQ residual to which uniform noise has been added(the LSQ residual without noise is blue dashed). The level of noise corresponds to30% of the maximum amplitude of the LSQ residual. In orange, the KR adjoint-source inverted from noisy LSQ residual (the KR adjoint-source without noiseis orange dashed). The KR adjoint-source is perturbed by the presence of noisemostly in X (∆ f ) null , that does not really matter.Figure 6: Comparison of adjoint-sources for a similar setting than in Fig. 2. In blue, theLSQ residual. In orange, the exact KR adjoint-source. In various reds, the KRadjoint-source approximated using SDMM with 50 (light red), 500 and 5000 (darkred) iterations. The various SDMM KR adjoint-sources diﬀer from the exact KRadjoint-source mostly on X (∆ f ) null , that does not really matter.15igure 7: Comparison of adjoint-sources for the same setting as Fig. 5. In blue, the noisyLSQ residual (the LSQ residual without noise is dashed). In red, the SDMM (50iterations) KR adjoint-source inverted from the inverted the noisy LSQ residual(the corresponding SDMM KR adjoint-source without noise is dashed). TheSDMM KR adjoint-source is perturbed by the presence of noise mostly in X (∆ f ) null ,that does not really matter. Note that it seems a little less perturbed by the noise(in X (∆ f ) null ) than the exact KR adjoint-source of Fig. 5.

4. Practical aspects

Up to now, we have considered an exact resolution of the KR problem. In this section, weunderline the advantages of using an approximate resolution method, the SDMM convexoptimization with a Laplace solver (M´etivier et al., 2016b). N ≥ f as the initial function for ϕ . Thus, for N = 0, thescheme is equivalent to LSQ. Increasing N will “add” to the adjoint-source more and moreof the KR texture deﬁned in § N , which would betoo costly for large scale applications and, mostly, unecessary. Fig. 6 shows the exact KRadjoint-source and the ones approximated by SDMM ( N = 50, 500 and 5000) on a simplesynthetic trace (shifted Ricker wavelets). Interestingly, all results are very similar in thesubset that matters X \ X (∆ f ) null and diﬀer mostly on X (∆ f ) null − tap where SDMM takes more it-erations to converge. As what happens in X (∆ f ) null − tap does not really matter in practice, theadjoint-source obtained with the smaller N value, i.e. N = 50, seems suﬃcient. We ob-serve that the KR texture (piecewise linearity, lower frequency content, reduced amplitudedynamics) is mostly preserved even for N = 50.The N parameter thus has a clear meaning, controlling the desired amount of KR textureadded to the LSQ residual, allowing us to consider the KR norm as a hybrid misﬁt thatmixes OT with the LSQ cost function. This ﬂexibility is a strong point for industrialapplications. For instance, at the ﬁrst FWI iteration, more KR texture can be added byincreasing N and, as FWI converges, N may be reduced to tend to LSQ.16hat about SDMM robustness to noise? Fig. 7 illustrates the KR adjoint-source invertedby SDMM with N = 50, for a noisy LSQ residual. We observe that the KR adjoint-sourceis perturbed by the presence of noise mostly in X (∆ f ) null , that does not really matter, andtends to remain very stable in X \ X (∆ f ) null , which is satisfying. Interestingly, the SDMM KRadjoint-source seems a little less perturbed by the noise in X (∆ f ) null − tap than the exact KRadjoint-source, comparing Figs. 5 and 7.This section justiﬁes the use of SDMM to compute the KR adjoint-source far beyondcomputational cost considerations only. In practice, our optimizations and tuning of SDMMallow ta satisfying result to be reached for N = 20 to 60. We mentioned that eq. (14) can be resolved considering diﬀerent eﬀective dimensionalitiesfor the coordinate space X . In our implementation, we considered two cases:1. Each trace within a shot independently. The coordinate space is not anymore multiDbut 1 D (that is still “much more” than L p norms whose coordinate space dimen-sionality is 0 as they are computed point-wise), the eﬀective X being [0 , T ]. Onlycorrelations in the time direction can be accounted for and we call this scheme 1DKR in the following.2. Group of traces in one spatial direction, the best sampled (or less noisy) one, thus aneﬀective dimensionality of 2. Correlations in the chosen spatial and time directionscan be accounted for and we call this scheme 2D KR in the following. For instancein the marine case, we consider each receiver line within a shot independently, theeﬀective X being [ H mininl , H maxinl ] × [0 , T ].Fig. 8 illustrates this data space “splitting” for 1D and 2D KR.Figure 8: Marine ﬁeld data set corresponding to two receiver lines within a common shotgather. LSQ residual at 6 Hz, muted for the FWI, that represents the initialadjoint-source for SDMM iterations. 2D KR considers each receiver line inde-pendently and 1D KR considers each trace independently. Modiﬁed from Poncetet al. (2018) (see the article for more details).17 .3. SDMM KR illustrated on common shot data We ﬁrstly consider a 2D common shot data modelled in the Marmousi 2 model (Martinet al., 2006). The source is a Ricker wavelet with peak frequency at 6 Hz and the datahas been low-pass ﬁltered below 3 Hz to be more realistic. Fig. 9 illustrates the eﬀect ofthe SDMM number of iterations N . The results are shown with a smooth high-cut ﬁlterabove 4 Hz to make it clearer to see the wiggle traces. We can observe how more and moreKR texture is added to the KR adjoint-source as N increases. The wiggles highlight howeach KR adjoint-source trace tends to becomes more piecewise linear, especially in the timedirection that matters the most to overcome cycle-skipping.Some artefacts can be observed in Fig. 9 at large oﬀsets as the amplitudes tend tobe larger and the traces are further from null average. However, these artefacts remainvery localized. We can observe that using 2D KR diminishes the artefacts and improvesthe adjoint-source continuity compared to using 1D KR. We will detail the correspondingspeciﬁcs of 2D KR further in § N increases, and lets guess that the low frequency content is reinforced. This is conﬁrmed byFig. 10 (high-cut at 10 Hz instead of 4 Hz to make frequency spectra visualization clearer),even for the smaller N = 20 value.Secondly, we consider a ﬁeld 3D common shot data. Fig. 11 compares a LSQ residualwith the corresponding 2D KR adjoint-source computed by SDMM, with a mute appliedas usually done for FWI. Again, the 2D KR adjoint-source has more low frequencies and areduced amplitude dynamics.In this section, the KR texture for the adjoint-source computed by SDMM has beenillustrated on shot data and in the 1D and multiD cases. This texture represents a keyto mitigate the cycle-skipping and reduce the sensitivity to the amplitudes within FWI,enhancing the kinematic information present in the adjoint-source (like a smart processingof the LSQ residual). To make this point explicit, the KR texture is sometimes called“skeleton-like” texture. We now illustrate the eﬀect of the thresholding constraint on common shot data. Choosinga good λ value is important for the success of the scheme. Using a value that is too largewould lead to instabilities (possible “singularities”), as explained in § L norm cost function. A good λ valuemust let the 1-Lipschitz constraint saturate at most positions in X \ X (∆ f ) null and limit theadditive constant (or null frequency) issue discussed in § λ values would make the thresholding constraint saturate slightly more,producing a clipping of the adjoint-source and thus slightly reducing the sensitivity of theFWI to amplitudes.Fig. 12 (bottom-right) illustrates the thresholding that occurs on Marmousi data when λ is reduced; we can observe that the clipping remains slight (the amplitude dynamics are notstrong with this data) and that larger λ values may lead to a better amplitude equalizationfor a ﬁnite number of SDMM iterations (top-right).We do not observe an additive constant or null frequency in our Marmousi KR adjoint-source results, contrary to M´etivier et al. (2016b,c). It is probable that this is due to our λ tuning, that contribute to limit the additive constant eﬀect as discussed in § § N and λ parameters allow us to consider KR as a hybrid cost function thatmixes OT with the standard LSQ and L cost functions, being able to pass continuouslyfrom one to the other. 18igure 9: Marmousi 2 data set (Martin et al., 2006) (Ricker wavelet with peack frequencyat 6 Hz and frequencies below 3 Hz muted). A smooth high-cut ﬁlter above 4 Hzhas been applied to make it easier to see the wiggle traces. LSQ residual, and 1Dand 2D KR adjoint-sources (eﬀect of the number of iterations N ) are shown, forone common shot gather. 19igure 10: Same data as Fig. 9, but with the high-cut ﬁlter at 10 Hz (instead of thesmooth high-cut ﬁlter above 4 Hz). LSQ residual and 2D KR ajoint-source,together with their frequency spectra.Figure 11: 3D ﬁeld data set with a mute for a 6 Hz inversion. The two left ﬁgures repre-sent the LSQ residual and the KR adjoint-source, muted as usual for the FWI.The right ﬁgure represents the corresponding amplitude spectra. Modiﬁed fromPoncet et al. (2018) (see the article for more details).20igure 12: Same data as Fig. 9. LSQ residual and 2D KR adjoint-sources (eﬀect of λ and σ inl ) are shown. Let us come back to the chosen norm for the data coordinate space, i.e. eq. (13) with p = 1. We have, in the 2D case || x || ( X )1 = 1 σ inl (cid:16) | x inl | + v | x t | (cid:17) = 1 σ t (cid:16) v | x inl | + | x t | (cid:17) with v = σ inl σ t , (31)where σ inl (or σ t ) represents a standard-deviation-like weight in the inline (or time) direc-tion, thus has the dimension of a distance (or a time), and v represents an apparent velocityparameter.It can be demonstrated that, at full convergence (i.e. a suﬃciently large number N ofSDMM iterations), any choice for σ inl (or σ t ) is equivalent when an optimal threshold ( λ value) is considered, as the corresponding adjoint-sources would almost only diﬀer by aglobal proportionality constant compensated by the line search. But, in practice for largescale applications, we can only run a small number of iterations N (usually around 30-60).Hence, σ inl (or σ t ) can play a role, mostly related to a change of the rate of convergence.Fig. 12 illustrates the much slower convergence that occurs with a too large σ inl value (theKR adjoint-source remains much closer to the LSQ adjoint-source for a given number ofiterations N = 60). This is thus not directly related to the 2D (or multiD) behavior in thedata coordinate space. In 1D, only the parameter σ t can play this role (the parameter σ inl then does not exist). Now, let us deﬁne which parameter controls the multiD (here 2D) KR feature. For a ﬁxed σ inl (or σ t ), the velocity v in eq. (31) plays a crucial role in the success of 2D KR FWI.21t deﬁnes the average direction along which most correlations between inline traces eventsoccur, thus must be parameterized to characterize the average move-out direction.To better understand this point we note that, according to eq. (23), the 1-Lipschitzconstraint corresponding to eq. (31) implies almost everywhere in X (cid:12)(cid:12)(cid:12) ∂ϕ max ( x ) ∂x inl (cid:12)(cid:12)(cid:12) ≤ σ inl and (cid:12)(cid:12)(cid:12) ∂ϕ max ( x ) ∂x t (cid:12)(cid:12)(cid:12) ≤ vσ inl . (32)Taking a very small v value would constrain ϕ max to vary very slowly in the time direction,much more slowly than in the inline direction, leading to vertical artifacts in the KR adjoint-source. Taking a large v value would constrain ϕ max to vary more slowly in the inlinedirection, possibly leading to some horizontal artifacts. This is illustrated in Fig. 13.The optimum v value would tend to: • Favor piecewise linearity of the KR adjoint-source in the time direction, the lattercontributing the most to reduce the cycle-skipping issue. • Within these bounds, favor the balance between the ϕ max derivatives in the time andinline directions, i.e. favor the average move-out direction.Fig. 13 illustrates the avantage of 2D KR with a good v choice compared to 1D KR and LSQ,in particular in terms of adjoint-source continuity along events. Fig. 14 (top) compares, fora marine ﬁeld data with a mute, the LSQ residual and the 1D and 2D KR adjoint-sources.The noise in the data is diﬀerent from one trace to the other, degrading the continuity ofLSQ and 1D KR adjoint-sources. In contrast, the 2D KR adjoint-source with an optimum v is strongly denoised, with an increased continuity in the move-out direction and betteramplitudes balancing. This may be useful to start FWI at even lower frequency and to relaxsensitivity to amplitudes. Fig. 14 (bottom) shows how v aﬀects the 2D KR adjoint-sourceand how a good choice enhances the continuity of events. Up to now, the analysis has remained at the adjoint-source level, which has allowed us tounderstand many features of the use of the KR norm within FWI. In the rest of this article,we illustrate how these features translate into a better velocity update. A SDMM numberof iterations N = 30 is considered in all the following results.Once the gradient in the model space has been computed using the adjoint-state methoddescribed in § § v parameter) maps into a FWI model update structural coherency enhancement. Thisremains true even if there is not a lot of noise in the data, i.e. when the adjoint-sourceimprovement brought by 2D KR compared to 1D KR is more subtle than in the case ofFig. 14 for instance.Figure 13: Same data as Fig. 9. 1D KR adjoint-source and 2D KR adjoint-sources (eﬀectof v ) are shown. 23igure 14: Marine ﬁeld data at 4 Hz. LSQ residual, 1D KR adjoint-source, and 2D KRadjoint-sources (eﬀect of v ) are shown.Figure 15: Marmousi 2 model (Martin et al., 2006). FWI inversion directly at 10 Hz(adjoint-sources in Fig. 10 are related to this case).24 . Application to ﬁeld data Successful applications of KR FWI on large scale ﬁeld data have been published in (Poncetet al., 2018; Messud and Sedova, 2019; Sedova et al., 2019; Hermant et al., 2019; Carottiet al., 2020; Hermant et al., 2020). In this section, we gather few illustrations from Messudand Sedova (2019) and Carotti et al. (2020). For further details or more illustrations, weinvite the reader to refer to the aforementioned articles.

The ﬁrst example refers to an Oman land broadband full azimuth data set consisting ofthree separate acquisitions that have been merged, with challenges associated with irregularoﬀset distributions among the merged surveys. Fig. 16 shows various FWI results obtainedat 9 Hz. LSQ FWI leads to poor structural consistency and poor ﬁt to the sonic log at thelocation of the well. 2D KR FWI gives a better ﬁt to the sonic log and a better consistencywith the geology compared to LSQ FWI. This can be related to a reduced sensitivity tocycle-skipping and to the ability of 2D KR to enhance the continuity along structures.Fig. 17 compares 2D KR FWI to 1D KR FWI on the same survey. As a careful denoisingof the data has been performed before FWI, following the workﬂow proposed by Sedovaet al. (2019), the 1D KR adjoint-source looks quite similar to the 2D KR adjoint-source(not shown here). Nevertheless, 2D KR FWI leads to a better model update than 1D KRFWI, with better structural consistency and well matching. However, 1D KR FWI remainscapable of resolving the cycle-skipping present in LSQ FWI.The second example refers to a 3D land broadband data set with full-azimuth and oﬀsetsof up to 13 km, again acquired in Oman. This data set was processed to enhance divingand post-critical waves. The FWI has been run with the frequency increasing from 2 Hzto 16 Hz, following the workﬂow proposed by Sedova et al. (2019). Fig. 18 compares LSQFWI to 2D KR FWI. The yellow oval in Fig. 18 highlights the improved delineation of thevelocity contrast achieved by 2D KR FWI, and the unexpected velocity increase obtainedwith LSQ FWI that is corrected by 2D KR FWI. This improved velocity provides imaginguplifts of the deep reﬂectors observable in Fig. 18b (right). Moreover, the focusing of themajor fault, a diﬃcult challenge for the area, is enhanced, as highlighted by the yellowarrow in Fig. 18b (right).

The third example refers to a North Sea marine data set. Fig. 19 shows results obtainedwith a 7 Hz FWI inversion. LSQ FWI is cycle-skipped, as indicated by the “red spots”in the observed data overlaid on top of the modelled data (highlighted by green arrows).These red spots are due to events that suddently jump from one “cycle” to another in themodelled data, resulting in a lack of structural consistency and continuity in the invertedvelocity model. Fig. 19 shows how 2D KR FWI solves for these issues and inverts for animproved velocity model (especially in the areas highlighted by green arrows).The last example refers to a Barents Sea marine data set, that is challenging because ofgas accumulations of varying size and depth location. Fig. 20 illustrates the stability of 2DKR FWI over LSQ FWI, the latter being cycle-skipped. Even with a poor initial model,2D KR FWI converges to a more structurally consistent model update, while mitigatingcycle-skipping and matching better the observed data.25igure 16: FWI results at 9 Hz on Oman land data. a) LSQ FWI result, b) 2D KRFWI result (same input data, initial model, mute and number of iterations).Three images are presented for each case. Left: velocity model superimposed onthe corresponding Kirchhoﬀ depth migrated stack. Middle: well measurement(blue) and inverted model trace at the position schematized by the dashed redline Right: Kirchhoﬀ depth migrated stack. Modiﬁed from Messud and Sedova(2019) (see the article for more details).Figure 17: FWI results at 9 Hz on Oman land data. Left, right: models inverted at 9Hz superimposed on the migrated stacks (same input data, initial model, muteand number of iterations). Middle: well measurement (blue) and inverted modeltraces (green is 1D KR and magenta is 2D KR) at the position schematized bythe dashed red line. Modiﬁed from Messud and Sedova (2019) (see the articlefor more details). 26igure 18: FWI results at 16Hz on North of Oman land data. a) LSQ FWI result and b) 2DKR FWI result (same input data, initial model, mute and number of iterations).Two images are presented for each case. Left: velocity model superimposed oncorresponding Kirchhoﬀ depth migrated stack (the yellow ovals highlight theimproved delineation of the velocity contrast and the correction of the velocityincrease achieved by 2D KR FWI). Right: Kirchhoﬀ depth migrated stack (theyellow arrows highlight the improved focusing of the major fault achieved by 2DKR FWI). Modiﬁed from Carotti et al. (2020) (see the article for more details).27igure 19: North sea data. a) LSQ FWI result and b) 2D KR FWI result (same input data,initial model, mute and number of iterations). Two images are presented foreach case. Left: observed data (in black-grey-white) superimposed to modelleddata (in red-blue) (the green arrows highlight where he modelled data suddentlyjumps from one “cycle” to another). Middle, right: FWI updated models at 7Hz (the green arrows highlight some areas where 2D KR FWI gives an improvedand more structurally consistent velocity model). Modiﬁed from Messud andSedova (2019) (see the article for more details).28igure 20: Barents Sea data. a) Initial model for FWI, b) LSQ FWI result and c) 2DKR FWI result (same input data, initial model, mute and number of iterations).Three images are presented for each case. Left: velocity model and migratedstack. Middle and right: normalized absolute values of the diﬀerence betweenobserved and modelled data (red means large values thus poor matching), atthe two positions highlighted by the red triangles. Modiﬁed from Carotti et al.(2020) (see the article for more details).

6. Perspectives

We end up by providing some perspectives and come-back to adjoint-source considerations.Examining previous KR adjoint-sources, e.g. Figs. 2-14, we notice that: • The kinematics present in the adjoint-source is enhanced by the KR texture (enhancedlow frequencies, more balanced amplitudes and events continuity), explaining why theKR norm is more robust to cycle-skipping and amplitudes. • But the kinematics present in the adjoint-source is not modiﬁed, unlike other methodsthat help to reduce the cycle-skipping issue by shifting events in the data and usingthe corresponding kinematically-modiﬁed residual. See e.g. the method of Baeket al. (2014) and Wang et al. (2016) that we call “registration-guided” FWI in thefollowing (sometimes also called “dynamic warping” FWI), or the method of M´etivieret al. (2018, 2019) called “graph-space” (GS) OT FWI.Ultimately, we would like to bring both features into a common framework to combine theiradvantages (texture change together with kinematics change) and become even more robust29o cycle-skipping. This is possible by introducing the concept of “kinematic transformation”(KT) into the KR norm.We call KT any transformation that makes the kinematics of the events present in theobserved data f obs closer to the kinematics of another (relatively close) data f , usually themodelled data. We consider in this section the example of a KT that works in the time-direction only, with generalization being straightforward. f obs ( x xl , x inl , x t ) is transformedinto f obs ( x xl , x inl , σ [ f ]( x t )) to become kinematically closer to a data f ( x xl , x inl , x t ) from acertain measure point of view. σ [ f ] :]0 , T ] → ]0 , T ] represents the KT, a functional of thedata whose kinematics should be (partially) matched. The registration-guide (Baek et al.,2014; Wang et al., 2016; Hale, 2013) and the permutation resulting from GS (M´etivier et al.,2019) are examples of KT. Used within FWI, the goal is to change the kinematics of theobserved data to make it closer to the modelled data at each FWI iteration, to improverobustness to cycle-skipping.Our goal here is not to detail the diﬀerences between KT but to give a framework, so weremain general. To embed a KT into KR, we can simply consider instead of eq. (14):˜ W d ( f, f ( σ ) obs ) = max ϕ ∈ BLip( || . || ( X )1 , ,λ ) (cid:90) X ϕ ( x ) (cid:16) f ( x xl , x inl , x t ) − f obs ( x xl , x inl , σ [ f ]( x t )) (cid:17) dµ ( x ) , (33)where σ [ f ] is deﬁned by any KT method. Applying the considerations of § ∂ ˜ W d ( f, f ( σ ) obs ) ∂f ( x ) = ϕ max [ f ]( x ) − (cid:90) X ϕ max [ f ]( x (cid:48) ) ∂f obs ( x (cid:48) xl , x (cid:48) inl , σ [ f ]( x (cid:48) t )) ∂f ( x ) dµ ( x (cid:48) ) . (34)If ∃ γ > , ∀ h ∈ L ( X ) such as || h || < γ : σ [ f + h ]( x ) = σ [ f ]( x ) , (35)i.e. if σ [ f ] is insensitive to inﬁnitesimal perturbations of f , the second term in eq. (34)can be neglected and the adjoint-source can keep the simple form ϕ max [ f ] used in KR FWIwithout KT (the adjoint-source form then would remain the same, but not the adjoint-source itself as the KT would aﬀect the ϕ max values through eq. (33)). So, do someexisting KT fulﬁll eq. (35)? • It is not strictly the case for the registration-guide (Baek et al., 2014; Wang et al.,2016), even if in practice the scheme neglects ∂f obs ( x (cid:48) xl , x (cid:48) inl , σ [ f ]( x (cid:48) t )) /∂f ( x ) “byhand”, arguing it represents a second order term, non-crucial because of the non-linearity of the FWI. • But it is strictly the case for GS as demonstrated formally in M´etivier et al. (2019) (ina discrete X space case as GS only deals with discrete data coordinate spaces). Thisis a result in favor of the use of GS as a KT within KR FWI, at least mathematically.Physically, the pertinence of GS OT FWI compared to registration-guided FWI stillneeds to studied.To conclude with an illustration, Fig. 21 (left) shows how the registration-guided KTcan aﬀect the LSQ residual. We notice that the texture remains quite the same but thekinematics of some events in the residual change, especially in the diving wave area. Onthe contrary, the texture of the 2D KR adjoint-source is quite diﬀerent from the LSQresidual, but the kinematics of the events does not change. Fig. 21 (right) shows the KRadjoint-source with KT embedded, where the texture and some events kinematics change.30mbedding a KT can thus naturally “augment” the KR scheme, possibly for even morerobustness to cycle-skipping. Comparing various possible KT goes beyond the scope ofthis article. We mention this point as a perspective, that has started to be explored inKpadonou et al. (2021).Figure 21: Same data as Fig. 9, zoom. Various adjoint-sources compared, including a KT(here the registration-guided one). The yellow dashed line helps comparing thekinematic change brought by the KT in the diving waves area.

7. Conclusion

The use of the KR norm in FWI is now well established and has proven its beneﬁts forpractical large-scale applications. Thanks to a signiﬁcantly wider global minimum valley,several ﬁeld data case studies have shown that KR FWI clearly outperformed LSQ FWI inovercoming moderate cycle-skipping. In this article, we have developed an analysis to bet-ter understand the beneﬁts of the KR norm for FWI, mainly focusing on the adjoint-source.We have addressed both theoretical aspects, fundamentally justifying the KR adjoint-sourceexpression and qualifying its “texture”, and practical points that are critical for an eﬃcientuse within FWI. The KR adjoint-source can be conceptualized as the result of smart pro-cessing applied to the LSQ residual, that enhances the low frequency content and reducesthe dynamics of amplitudes. When correctly tuned, the 2D (or multiD) KR adjoint-sourceappears to also enhance the lateral continuity of events within a common shot gather. Allthese components, enhancing the kinematic content in the adjoint-source, are beneﬁcial toa successful FWI implementation.If the use of the KR norm within FWI signiﬁcantly widens the global minimum valleycompared to the use of LSQ, it does not fully solve the cycle-skipping issue or the challengeof FWI with reﬂected waves only. Further investigations in these directions are ongoing(M´etivier et al., 2019; Sun and Al Khalifah, 2019; Tang et al., 2020; Kpadonou et al., 2021).

Acknowledgments

We are grateful to CGG for granting permission to publish this work and to CGG Multi-Client, TGS, Occidental Petroleum, PDO, the Ministry of Oil and Gas of the Sultanate ofOman and INEOS for providing the data and permission to present these results. We areindebted to F´elix Kpadonou, Thibault Lesieur, Thibaut Allemand, Nabil Masmoudi, AdelKhalil and Roger Taylor for enlightening discussions.31 ppendixA. Dual representation of the Lipschitz constraint (proof )

We consider d = || . || ( X ) p in the Lipschitz constraint, eq. (11), and ( x, y ) ∈ X × X . TheLipschitz semi-norm is deﬁned by || ϕ || Lip p = max x (cid:54) = y | ϕ ( x ) − ϕ ( y ) ||| x − y || ( X ) p . (36)We obtain a norm identifying functions ϕ deﬁned up to an additive constant.Taking x = y + z with z ∈ X and choosing a ϕ ( y ) that is diﬀerentiable, the Lipschitznorm can be rewritten ( p ≥ || ϕ || Lip p = max y max z (cid:54) =0 | ϕ ( y + z ) − ϕ ( y ) ||| z || ( X ) p = max y max z (cid:54) =0 (cid:12)(cid:12)(cid:12) ∂ϕ ( y ) ∂y .z (cid:12)(cid:12)(cid:12) || z || ( X ) p = max y max || z || ( X ) p ≤ (cid:12)(cid:12)(cid:12) ∂ϕ ( y ) ∂y .z (cid:12)(cid:12)(cid:12) = max y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ϕ ( y ) ∂y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X ) q with 1 /p + 1 /q = 1 . (37)The last equality is obtained by deﬁnition of the dual norm (Rudin, 1991; Brezis, 2020).The Lipschitz norm can thus be “represented” by a L q norm || . || ( X ) q on ∂ϕ ( y ) /∂y .Let us consider standard-deviation-like weights that satisfy ∞ > /σ > (cid:15) where (cid:15) > σ are introduced in || . || ( X ) p , i.e. || x || ( X ) p = (cid:16) | x xl | p σ pxl + | x inl | p σ pinl + | x t | p σ pt (cid:17) /p , (38)eq. (37) imposes that the non-inverted σ appear in || . || ( X ) q , i.e. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X ) q = (cid:16) σ qxl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x xl (cid:12)(cid:12)(cid:12) q + σ qinl (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x inl (cid:12)(cid:12)(cid:12) q + σ qt (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x t (cid:12)(cid:12)(cid:12) q (cid:17) /q with 1 /p + 1 /q = 1 . (39)Note that the second equality in eq. (37) is valid when the Riesz representation theoremcan be invoked in X ∗ ((Brezis, 2020) chapter IV), where X ∗ = L ( X, R ) denotes the (topo-logical) dual of X . This is the case for any norm in ﬁnite dimensional spaces like X thatis 3 dimensional here. (For completeness, we remind that this is not the case in inﬁnitedimensional space (Rudin, 1991; Brezis, 2020)).As the 1-Lipschitz constraint implies strong continuity, i.e. diﬀerentiability almost every-where, the derivative ∂ϕ ( x ) /∂x is deﬁned almost everywhere. Constraining the derivativealmost everywhere is suﬃcient for the Lagrange multipliers theoretical considerations ofAppendix B. B. KR adjoint-source (proofs)

Throughout this Appendix, we use the notations of § B.1. Simpliﬁed case (thresholding constraint only)

We ﬁrst consider only the thresholding constraint in eqs. (11) and (14). It implies ∀ x ∈ X : | ϕ ( x ) | ≤ λ , that can be split into two linear constraints ϕ ( x ) ≤ λ and ϕ ( x ) ≥ − λ . Using the32agrange multipliers method (Bertsekas, 1996), eq. (14) can be rewritten (dependencies to∆ f are implicit in the following to lighten the notations)˜ W d = max ϕ min α + ≥ ,α − ≥ L ( ϕ, α + , α − ) (40) L ( ϕ, α + , α − ) = (cid:90) X ϕ ( x )∆ f ( x ) dµ ( x ) − (cid:90) X ( ϕ ( x ) − λ ) α + ( x ) dµ ( x ) − (cid:90) X ( − ϕ ( x ) − λ ) α − ( x ) dµ ( x ) . Minimization of eq. (40) with respect to α + and α − gives the Karush-Kuhn-Tucker (KKT)conditions (Bertsekas, 1996) ϕ ( x ) = λ and α + ( x ) > ϕ ( x ) < λ and α + ( x ) = 0 − ϕ ( x ) = λ and α − ( x ) > − ϕ ( x ) < λ and α − ( x ) = 0 . (41)We deduce α + ( x ) > ⇒ α − ( x ) = 0 α − ( x ) > ⇒ α + ( x ) = 0 . (42)Maximization of eq. (40) with respect to ϕ gives almost everywhere in X ∆ f ( x ) = α + ( x ) − α − ( x ) . (43)Combining eqs. (41)-(43), we deduce x ∈ X \ X (∆ f ) null with ∆ f ( x ) > α + ( x ) = ∆ f ( x ) , α − ( x ) = 0 , ϕ max ( x ) = λ (44)∆ f ( x ) < α − ( x ) = − ∆ f ( x ) , α + ( x ) = 0 , ϕ max ( x ) = − λ. This result deﬁnes ϕ max in X \ X (∆ f ) null , where it obviously remains stable under inﬁnitesimalperturbations of ∆ f . Using the Lagrange multipliers to deduce this result in this simplecase is somewhat “oversized”. However, the method becomes interesting when consideringmore involved constraints as below, having the advantage to easilly generalize . B.2. Intermediary case (1-Lipschitz constraint only, 1D coordinate space)

We here consider only the 1-Lipschitz constraint in eq. (14), and a 1D coordinate space X that represents the time coordinate in practice. To lighten the notations, we considera unit standard-deviation-like weight in eq. (11). From the considerations of AppendixA, the 1-Lipschitz constraint can be rewritten almost-everywhere in X : (cid:12)(cid:12)(cid:12) ∂ϕ ( x ) ∂x (cid:12)(cid:12)(cid:12) ≤

1. Thisconstraint can be split into two linear constraints ∂ϕ ( x ) ∂x ≤ ∂ϕ ( x ) ∂x ≥ −

1. Using theLagrange multipliers method (Bertsekas, 1996), eq. (14) can be rewritten˜ W d = max ϕ min β + ≥ ,β − ≥ L ( ϕ, β + , β − ) (45) L ( ϕ, β + , β − ) = (cid:90) X ϕ ( x )∆ f ( x ) dµ ( x ) − (cid:90) X (cid:16) ∂ϕ ( x ) ∂x − (cid:17) β + ( x ) dµ ( x ) − (cid:90) X (cid:16) − ∂ϕ ( x ) ∂x − (cid:17) β − ( x ) dµ ( x ) . Minimization with respect to β + and β − gives the KKT conditions (Bertsekas, 1996) ∂ϕ ( x ) ∂x = 1 and β + ( x ) > ∂ϕ ( x ) ∂x < β + ( x ) = 033 ∂ϕ ( x ) ∂x = 1 and β − ( x ) > − ∂ϕ ( x ) ∂x < β − ( x ) = 0 . (46)We deduce β + ( x ) > ⇒ β − ( x ) = 0 and thus ∂β − ( x ) /∂x = 0 β − ( x ) > ⇒ β + ( x ) = 0 and thus ∂β + ( x ) /∂x = 0 . (47)The “and thus” is due to the strong continuity of ϕ . For instance, if ∂ϕ ( x ) /∂x = 1, itcannot “jump” to ∂ϕ ( x (cid:48) ) /∂x (cid:48) = − x . Thus, if β + ( x ) > β − ( x (cid:48) )must remain null in the neighborhood of x .Considering X = [ x min , x max ], maximization with respect to ϕ gives almost everywherein X ∆ f ( x ) + (cid:16) δ ( x − x min ) − δ ( x − x max ) (cid:17)(cid:16) β + ( x ) − β − ( x ) (cid:17)(cid:16) β + ( x ) − β − ( x ) (cid:17) = − ∂β + ( x ) ∂x + ∂β − ( x ) ∂x . (48)Combining eqs. (47) and (48), we can deduce that always one of the constraints mustsaturate almost everywhere in X \ X (∆ f ) null (the edges x min and x max of X have null measure).More precisely, we have almost everywhere in X \ X (∆ f ) null x ∈ X \ X (∆ f ) null : ∂β + ( x ) ∂x = − ∆ f ( x ) ⇒ β + ( x ) > , β − ( x ) = 0 , ∂ϕ max ( x ) ∂x = 1 (49)or ∂β − ( x ) ∂x = ∆ f ( x ) ⇒ β − ( x ) > , β + ( x ) = 0 , ∂ϕ max ( x ) ∂x = − . Complementarilly, integrating eq. (48) leads to − ∆ F β ( x ) = β + ( x ) − β − ( x ) (50)∆ F β ( x ) = (cid:90) xx min ∆ f ( x (cid:48) ) dµ ( x (cid:48) ) − (cid:16) β + ( x min ) − β − ( x min ) (cid:17) . Combining eqs. (47) and (50), we deduce almost everywhere in X \ X (∆ F β ) null x ∈ X \ X (∆ F β ) null with ∆ F β ( x ) < β + ( x ) = − ∆ F β ( x ) , β − ( x ) = 0 , ∂ϕ max ( x ) ∂x = 1(51)∆ F β ( x ) > β − ( x ) = ∆ F β ( x ) , β + ( x ) = 0 , ∂ϕ max ( x ) ∂x = − . Eq. (51) is satisﬁed in X \ X (∆ F β ) null and thus in X \ X (∆ f ) null (as X \ X (∆ F β ) null tends to be “larger”for seismic data).Obviously, eq. (48) (combined with eq. (47)), eq. (49) or eq. (51) remain stable underinﬁnitesimal perturbations of ∆ f in X \ X (∆ f ) null . Also, eq. (48) (combined with eq. (47)) issuﬃcient to prove that always one of the constraints must saturate almost everywhere in X \ X (∆ f ) null . We introduced eqs. (49) and eq. (51) to give complementary insight. B.3. Full case (1D coordinate space)

We now combine the two constraints (thresholding and 1-Lipschitz), still in the 1D co-ordinate space case and with unit standard-deviation-like weights. Using the Lagrangemultipliers method (Bertsekas, 1996), we write˜ W d = max ϕ min α + ≥ ,α − ≥ ,β + ≥ ,β − ≥ L ( ϕ, α + , α − , β + , β − ) (52)34 ( ϕ, α + , α − , β + , β − ) = (cid:90) X ϕ ( x )∆ f ( x ) dµ ( x ) − (cid:90) X ( ϕ ( x ) − λ ) α + ( x ) dµ ( x ) − (cid:90) X ( − ϕ ( x ) − λ ) α − ( x ) dµ ( x ) − (cid:90) X (cid:16) ∂ϕ ( x ) ∂x − (cid:17) β + ( x ) dµ ( x ) − (cid:90) X (cid:16) − ∂ϕ ( x ) ∂x − (cid:17) β − ( x ) dµ ( x ) . Minimization with respect to α + , α − , β + and β − gives KKT conditions (Bertsekas, 1996) ϕ ( x ) = λ and α + ( x ) > ϕ ( x ) < λ and α + ( x ) = 0 − ϕ ( x ) = λ and α − ( x ) > − ϕ ( x ) < λ and α − ( x ) = 0 ∂ϕ ( x ) ∂x = 1 and β + ( x ) > ∂ϕ ( x ) ∂x < β + ( x ) = 0 − ∂ϕ ( x ) ∂x = 1 and β − ( x ) > − ∂ϕ ( x ) ∂x < β − ( x ) = 0 . (53)None of these constraints can saturate simultaneously on non-null measure sets, as saturat-ing the thresholding constraint implies a null derivative for ϕ (and thus β + ( x ) = β − ( x ) = 0)and saturating the 1-Lipschitz constraints implies a non-constant ϕ (and thus α + ( x ) = α − ( x ) = 0). We deduce almost everywhere α + ( x ) > ⇒ α − ( x ) = 0 , β + ( x ) = ∂β + ( x ) /∂x = 0 , β − ( x ) = ∂β − ( x ) /∂x = 0 α − ( x ) > ⇒ α + ( x ) = 0 , β + ( x ) = ∂β + ( x ) /∂x = 0 , β − ( x ) = ∂β − ( x ) /∂x = 0 β + ( x ) > ⇒ β − ( x ) = ∂β − ( x ) /∂x = 0 , α + ( x ) = 0 , α − ( x ) = 0 β − ( x ) > ⇒ β + ( x ) = ∂β + ( x ) /∂x = 0 , α + ( x ) = 0 , α − ( x ) = 0 . (54)The justiﬁcation of the conﬁgurations where the derivatives of β + and β − are null wasalready given after eq. (47).Considering X = [ x min , x max ], maximization with respect to ϕ gives∆ f ( x ) + (cid:16) δ ( x − x min ) − δ ( x − x max ) (cid:17)(cid:16) β + ( x ) − β − ( x ) (cid:17) = α + ( x ) − α − ( x ) − ∂β + ( x ) ∂x + ∂β − ( x ) ∂x . (55)Combining eqs. (54) and (55), we can deduce that always one of the constraints must sat-urate almost everywhere in X \ X (∆ f ) null , and that eq. (55) remains stable under inﬁnitesimalperturbations of ∆ f in X \ X (∆ f ) null .Just to gain further insight, we introduce∆ f β ( x ) = ∆ f ( x ) + ∂β + ( x ) ∂x − ∂β − ( x ) ∂x (56)∆ F β,α ( x ) = (cid:90) xx min ∆ f ( x (cid:48) ) dµ ( x (cid:48) ) − (cid:16) β + ( x min ) − β − ( x min ) (cid:17) − (cid:90) xx min dx (cid:48) (cid:16) α + ( x (cid:48) ) − α − ( x (cid:48) ) (cid:17) dµ ( x (cid:48) ) . We then can deduce from eqs. (54)-(56) ∀ x ∈ X \ X (∆ f β ) null with ∆ f β ( x ) > α + ( x ) = ∆ f β ( x ) , α − ( x ) = 0 , ϕ max ( x ) = λ (57)∆ f β ( x ) < α − ( x ) = − ∆ f β ( x ) , α + ( x ) = 0 , ϕ max ( x ) = − λ x ∈ X \ X (∆ F β,α ) null with ∆ F β,α ( x ) < β + ( x ) = − ∆ F β,α ( x ) , β − ( x ) = 0 , ∂ϕ max ( x ) ∂x = 1∆ F β,α ( x ) > β − ( x ) = ∆ F β,α ( x ) , β + ( x ) = 0 , ∂ϕ max ( x ) ∂x = − . These equations deﬁne ϕ max quite similarly than in eqs. (44) and (51), but they are hereexplicitly coupled. An implicit coupling is also imposed through the necessary (strong)continuity of ϕ max and the fact that none of the constraints can saturate simultaneously,thus that X null = X (∆ f β ) null ∪ X (∆ F β,α ) null and ∅ = X (∆ f β ) null ∩ X (∆ F β,α ) null . (58)Obviously, eq. (55) (combined with eq. (54)) or eq. (57) remain stable under inﬁnitesimalperturbations of ∆ f in X \ X (∆ f ) null . B.4. Full case (2D coordinate space)

We consider X = [ H mininl , H maxinl ] × [0 , T ] with x = ( x inl , x t ) t , and the results of Appendix Awith non-unit standard-deviation-like weights. Eq. (55) becomes, with obvious notations∆ f ( x ) + σ inl (cid:16) δ ( x inl − H mininl ) − δ ( x inl − H maxinl ) (cid:17)(cid:16) β + inl ( x ) − β − inl ( x ) (cid:17) + σ t (cid:16) δ ( x t ) − δ ( x t − T ) (cid:17)(cid:16) β + t ( x ) − β − t ( x ) (cid:17) = α + ( x ) − α − ( x ) − σ inl (cid:16) ∂β + inl ( x ) ∂x inl − ∂β − inl ( x ) ∂x inl (cid:17) − σ t (cid:16) ∂β + t ( x ) ∂x t − ∂β − t ( x ) ∂x t (cid:17) . (59)Combining eq. (59) with the equivalent of eq. (54) in the 2D coordinate space case, we candeduce that always one of the constraints must saturate, almost everywhere in X \ X (∆ f ) null and that eq. (59) remains stable under inﬁnitesimal perturbations of ∆ f in X \ X (∆ f ) null . Theonly subtlety is that, at a given position in X \ X (∆ f ) null where the thresholding constraintdoes not saturate, only one of the Lipschitz constraints in the inline or time directions needto saturate, not necesarilly both. References

Andersen, M. S., Dahl, J., and Vandenberghe, L. (2013). Cvxopt: A python package forconvex optimization. abel. ee. ucla. edu/cvxopt .Baek, H., Calandra, H., and Demanet, L. (2014). Velocity estimation via registration-guidedleast-squares inversion.

Geophysics , 79(2):R79–R89.Bertsekas, D. (1996).

Constrained Optimization and Lagrange Multiplier Methods . Aca-demic Press, New York.Brenders, A. J. and Pratt, R. G. (2007). Waveform tomography of marine seismic data:What can limited oﬀset oﬀer? , pages 3024–3029.Brezis, H. (2020).

Analyse fonctionnelle: Th´eorie et applications . Dunod, Paris.Bunks, C., Saleck, F. M., Zaleski, S., and Chavent, G. (1995). Multiscale seismic waveforminversion.

Geophysics , 60(5):1457–1473. 36arotti, D., Hermant, O., Masclet, S., Reinier, M., Messud, J., Sedova, A., and Lambar´e, G.(2020). Optimal transport full waveform inversion - Applications. , Th Dome1 17.Combettes, P. and Pesquet, J.-C. (2011). Proximal splitting methods in signal processing.

Springer optimization and its applications: Springer New York , 49:185–212.Engquist, B. and Froese, B. (2014). Application of the Wasserstein metric to seismic signals.

Communications in Mathematical Sciences , 12(5).Engquist, B., Froese, B., and Yang, Y. (2016). Optimal transport for seismic full waveforminversion.

Communications in Mathematical Sciences , 14:2309–2330.Hale, D. (2013). Dynamic warping of seismic images.

Geophysics , 78(2):S105–S115.Hanin, L. G. (1992). Kantorovich-rubinstein norm and its application in the theory oflipschitz spaces.

Proceedings of the American Mathematical Society , 115(2):345–352.Hermant, O., Aziz, A., Warzocha, S., and Al Jahdhami, M. (2020). Imaging complex faultstructures on-shore oman using optimal transport full waveform inversion. , We Dome1 19.Hermant, O., Sedova, A., Royle, G., Retailleau, M., Messud, J., Lambar´e, G., Al Abri, S.,and Al Jahdhami, M. (2019). Broadband faz land data: an opportunity for FWI. , WS08 11.Kantorovich, L. V. (1942). On the translocation of masses.

C. R. (Doklady) Acad. Sci.USSR , 321:199–201.Kpadonou, F., Messud, J., Sedova, A., and Reinier, M. (2021). Optimal transport FWIwith graph transform: Analysis and proposal of a partial shift strategy.

Submitted to83rd EAGE Conference and Exhibition .Lailly, P. (1983). The seismic inverse problem as a sequence of before stack migrations.

Conference on Inverse Scattering, Theory and Application, Society for Industrial andApplied Mathematics, Expanded Abstracts , 83:206–220.Lellmann, J., Lorenz, D. A., Schonlieb, C., and T., V. (2014). Imaging with Kantorovich-Rubinstein discrepancy.

SIAM J. Imaging Sci. , 7(4):2833–2859.Luo, S. and Sava, P. (2011). A deconvolution-based objective function for wave-equationinversion. , pages 2788–2792.Luo, Y. and Schuster, G. (1991). Wave-equation traveltime inversion.

Geophysics , 56:645–653.Martin, G., Wiley, R., and Marfurt, K. (2006). Marmousi2: an elastic upgrade for Mar-mousi.

Leading Edge , 25:156–166.Messud, J. and Sedova, A. (2019). Multidimensional optimal transport for 3D FWI: Demon-stration on ﬁeld data. , TuR08 02.M´etivier, L., Allain, A., Brossier, R., M´erigot, Q., Oudet, E., and Virieux, J. (2018). Agraph-space approach to optimal transport for full waveform inversion. , pages 1158–1162.37´etivier, L., Brossier, R., M´erigot, Q., and Oudet, E. (2019). A graph space optimaltransport distance as a generalization of L p -distances: application to a seismic imaginginverse problem. Inverse Problems , 35(8):085001.M´etivier, L., Brossier, R., M´erigot, Q., Oudet, E., and Virieux, J. (2016a). Increasing therobustness and applicability of full waveform inversion: an optimal transport distancestrategy.

The Leading Edge , 35:1060–1067.M´etivier, L., Brossier, R., M´erigot, Q., Oudet, E., and Virieux, J. (2016b). Measuringthe misﬁt between seismograms using an optimal transport distance: Application to fullwaveform inversion.

Geophysical Journal International , 205(1):345–377.M´etivier, L., Brossier, R., M´erigot, Q., Oudet, E., and Virieux, J. (2016c). An optimaltransport approach for seismic tomography: application to 3D full waveform inversion.

Inverse Problems , 32(11):115008.Monge, G. (1781). M´emoire sur la th´eorie des d´eblais et des remblais.

Histoire de l’Acad´emieRoyale des Sciences , pages 666–704.Operto, S., Ravaut, C., Improta, L., Virieux, J., Herrero, A., and Dell’Aversana, P. (2004).Quantitative imaging of complex structures from multifold wide aperture seismic data:A case study.

Geophysical Prospecting , 52:625–651.Plessix, R.-E. (2006). A review of the adjoint-state method for computing the gradient of afunctional with geophysical applications.

Geophysical Journal International , 167(2):495–503.Plessix, R.-E. (2009). 3D frequency-domain full-waveform inversion with an iterative solver.

Geophysics , 74:WC149–WC157.Poncet, R., Messud, J., Bader, M., Lambar´e, G., Viguier, G., and Hidalgo, C. (2018). FWIwith optimal transport: a 3D implementation and an application on a ﬁeld dataset. , We A12 02.Pratt, R. G. (1990). Inverse theory applied to multi-source cross-hole tomography. part ii:Elastic wave-equation method.

Geophysical Prospecting , 38:311–330.Pratt, R. G. (1999). Seismic waveform inversion in the frequency domain, part i: Theoryand veriﬁcation in a physical scale model.

Geophysics , 64:888–901.Rudin, W. (1991).

Functional analysis (second edition) . Dunod, Paris.Sedova, A., Messud, J., Prigent, H., Masclet, S., Royle, G., and Lambar´e, G. (2019).Acoustic land full waveform inversion on a broadband land dataset: the impact of optimaltransport. , Th R08 07.Shin, C. and Ha, W. (2008). A comparison between the behavior of objective functionsfor waveform inversion in the frequency and Laplace domains.

Geophysics , 73(5):VE119–VE133.Sirgue, L., Barkved, O., Van Gestel, J., Askim, O., and Kommedal, J. (2009). 3D waveforminversion on valhall wide-azimuth obc. , U038.Sirgue, L. and Pratt, R. G. (2004). Eﬃcient waveform inversion and imaging: A strategyfor selecting temporal frequencies.

Geophysics , 69:231–248.38un, B. and Al Khalifah, T. (2019). The application of an optimal transport to a precondi-tioned data matching function for robust waveform inversion.

Geophysics , 84:R923–R945.Tang, Y., Sun, B., and Al Khalifah, T. (2020). Wave-equation migration velocity analysisvia the optimal-transport-based objective function. , pages 3709–3713.Tarantola, A. (1984). Inversion of seismic reﬂection data in the acoustic approximation.

Geophysics , 49:1259–1266.Tarantola, A. (2005).

Inverse Problem Theory and Methods for Model Parameters Estima-tion . Elsevier Science Publishers, Amsterdam.Van Leeuwen, T. and Mulder, W. (2010). A correlation-based misﬁt criterion for wave-equation traveltime tomography.

Geophysical Journal International , 182(3):1383–1394.Vigh, D. and Starr, E. W. (2007). 3D pre-stack plane wave full waveform inversion. , pages 1830–1834.Villani, C. (2003).

Topics in optimal transportation , volume 58 of

Graduate Studies inMathematics . American Mathematical Society.Villani, C. (2008).

Optimal transport: old and new , volume 338. Springer Science & BusinessMedia.Virieux, J. and Operto, S. (2009). An overview of full-waveform inversion in explorationgeophysics.

Geophysics , 74(6):WCC127–WCC152.Wang, D. and Wang, P. (2019). Adaptive quadratic Wasserstein full-waveform inversion. , pages 1300–1304.Wang, M., Xie, Y., Xu, W., Xin, K. F., Chuah, B. L., Loh, F. C., Manning, T., and Wol-farth, S. (2016). Dynamic-warping full-waveform inversion to overcome cycle skipping.

SEG , pages 1273–1277.Warner, M. and Guasch, L. (2016). Adaptive waveform inversion: Theory.

Geophysics ,81(6):R429–R445.Warner, M., Ratcliﬀe, A., Nangoo, T., Morgan, J., Umpleby, A., Shah, N., Vinje, V., ˇStekl,I., Guasch, L., Win, C., et al. (2013). Anisotropic 3D full-waveform inversion.

Geophysics ,78(2):R59–R80.Yang, Y. and Engquist, B. (2018). Analysis of optimal transport and related misﬁt functionsin full-waveform inversion.

Geophysics , 83(1):A7–A12.Yang, Y., Engquist, B., Sun, J., and Froese, B. (2018). Application of optimal transport andthe quadratic Wasserstein metric to full-waveform inversion.