A data-driven choice of misfit function for FWI using reinforcement learning
AA DATA - DRIVEN CHOICE OF MISFIT FUNCTION FOR
FWI
USINGREINFORCEMENT LEARNING
A P
REPRINT
Bingbing Sun
Physical Sciences and engineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia [email protected]
Tariq Alkhalifah
Physical Sciences and engineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia [email protected]
February 11, 2020 A BSTRACT
In the workflow of Full-Waveform Inversion (FWI), we often tune the parameters of the inversionto help us avoid cycle skipping and obtain high resolution models. For example, typically start byusing objective functions that avoid cycle skipping, like tomographic and image based or using onlylow frequency, and then later, we utilize the least squares misfit to admit high resolution information.We also may perform an isotropic (acoustic) inversion to first update the velocity model and thenswitch to multi-parameter anisotropic (elastic) inversions to fully recover the complex physics. Suchhierarchical approaches are common in FWI, and they often depend on our manual interventionbased on many factors, and of course, results depend on experience. However, with the large datasize often involved in the inversion and the complexity of the process, making optimal choices isdifficult even for an experienced practitioner. Thus, as an example, and within the framework ofreinforcement learning, we utilize a deep-Q network (DQN) to learn an optimal policy to determinethe proper timing to switch between different misfit functions. Specifically, we train the state-actionvalue function (Q) to predict when to use the conventional L2-norm misfit function or the moreadvanced optimal-transport matching-filter (OTMF) misfit to mitigate the cycle-skipping and obtainhigh resolution, as well as improve convergence. We use a simple while demonstrative shifted-signalinversion examples to demonstrate the basic principles of the proposed method. K eywords Full-waveform inversion · Reinforcement Learning · Misfit function
Considering the high nonlinearity of Full Waveform Inversion (FWI), a hierarchical approach is common practice.Developing such a strategy is a daunting task considering the size of the data, the realities of how the data were acquired(limited band and aperture), and the physical assumptions we impose on the model. With respect to the choice ofthe proper objective function, recent advances admitted reasonably cycle-skipping free misfit functions such as thematching filter misfit, optimal transport (OT) [1] function or a combination of them, i.e., the optimal transport of thematching filter misfit (OTMF) [2] . Unlike the L2-norm misfit, which is a local comparison, these advanced misfitfunctions seek global comparisons between the predicted and measured data, and thus, can avoid the cycle-skipping.However, we often still need to switch to the L2 norm for higher resolution models when the data are less cycle-skippedand safe for local comparisons. In reality, we need to carefully QC of the data matching to determine the optimal timeto switch. Besides, the probability of cycle-skipping varies for different offsets. It would be ideal to use different misfitfunctions for different traces to accommodate their specific cycle-skipping probabilities. In principle, we can formulatethis problem as that in each iteration, given the predicted and measured data, we try to make a decision (action) tochoose between the L2 norm misfit and a cycle-skipping free misfit such as the OTMF and the consequence of suchaction seeks a better fitting of the data in a long time horizon ( running over many iterations). Mathematically, it isconsidered as a Marko decision process and it is well studies in the field of statistics and machine learning. In this paper, a r X i v : . [ phy s i c s . g e o - ph ] F e b data-driven choice of misfit function for FWI using reinforcement learning A P
REPRINT based on the concept of reinforcement learning, we train a Deep Q network (DQN) [3] to achieve fast convergence bylearning the optimal choice of objective functions over FWI iterations.Reinforcement learning (RL) is one of three basic machine learning paradigms, alongside supervised learning andunsupervised learning. It is a potential algorithm towards true artificial intelligence. It differs from supervisedlearning that it does not require labels given by input/output pairs. Instead, the focus of RL is in finding a balancebetween exploration (of uncharted territory) and exploitation (of current knowledge). Recently, RL based algorithmsdemonstrated its potential in solving complex problems which is extremely difficult for conventional machine learningalgorithms, e.g., AlphaGo beats the top human Go players while the Alphastar achieves the Grandmaster level atplaying the Starcraft games [4, 5]. In this paper, we share our first attempt to use RL to automatically selecting a misfitfunction in full-waveform inversion. We start with a brief review of the OMTF misfit function used here and thendevelop the method for misfit function selection using DQN. At last, we demonstrate our method using a time-shiftedsignal example.
The conventional L2-norm misfit function seeks a local point-wise comparison between the predicted data p ( t ) and themeasured data d ( t ) : J L = 12 || p ( t ) − d ( t ) || . (1)[2] introduce the optimal transport of the matching filter (OTMF) misfit for FWI. In OTMF, a matching filter would becomputed first by deconvolving of the predicted data with the measured data: d ( t ) ∗ w ( t ) = p ( t ) , (2)where ∗ denotes the convolution operation. After a proper precondition of the resulting matching filter to fulfill therequirements for a distribution, we minimize the Wasserstein W distance between the resulting matching filter and atarget distribution given by, e.g., a Dirac delta function: J OT MF = W ( ˜ w ( t ) , δ ( t )) , (3)where ˜ w ( t ) = w / || w || , W denotes the Wasserstein distance [1]. The resulting OTMF misfit in Equation 3 canovercome the cycle-skipping effectively as demonstrated by [6]. We first provide a mathematical background summary for the Markov decision process (MDP), the deep Q network(DQN) and the RL techniques. We adopt the standard MDP formalism. An MDP is defined by a tuple < S, A, R, P, γ > ,which consists of a set of states S , a set of actions A , a reward function R ( s, a ) , a transition function P ( s (cid:48) | s, a ) and adiscount factor γ . For each state s ∈ S , the agent takes an action a ∈ A . Upon taking this action, the agent receivesa reward R ( s, a ) and reaches a new state s (cid:48) as a result of the action, determined from the probability distribution P ( s (cid:48) | s, a ) . In RL, we try to learn a policy π , specified for each state which action the agent will take. The goal of theagent is to find such policy π mapping states to actions that maximizes the expected discounted total reward over theagent’s lifetime. Such long time expected reward is formulated as the action-value function (Q) : Q π ( s, a ) = E π (cid:2) γ t R ( s t , a t ) (cid:3) , (4)where E π is the expectation over the distribution of the admissible trajectories ( s , a , s , a , ... ) obtained by executingthe policy π starting from s = s and a = a . There are many algorithms developed in RL to learn the policy π . DQNis a popular method for dealing with a discrete action space. It only learns the Q function and the optimal policy π ∗ canbe derived from the learned Q function directly: π ∗ ( a | s ) = max a (cid:48) Q ( s, a (cid:48) ) . (5)Equation 5 is intuitive to understand that the best action for each state should give the largest Q value for that state. Inorder to learn the Q function, we take a single move from current state to next one and see what reward R we can get.This admits a one-step look ahead: Q ( s t , a t ) = r t +1 + γ max a (cid:48) Q ( s t +1 , a (cid:48) ) . (6)In order to stabilize the learning process, we keep track of another target Q function: Q (cid:48) . Thus, the loss function ofDQN in training is the time difference (TD) error between the Q function and its target :Loss = 12 (cid:104) r t +1 + γ max a (cid:48) Q (cid:48) ( s t +1 , a (cid:48) ) − Q ( s t , a t ) (cid:105) (7)2 data-driven choice of misfit function for FWI using reinforcement learning A P
REPRINT
For training efficiency, we save the transition [ s t , a t , r t , s t +1 ] in a replay buffer and reuse these datasets for training (Itis referred as experience replay in RL).Exploration plays an important role in RL. Exploration provides the agent with the ability to expand his knowledge wheninteracting with the environment. The (cid:15) -greedy exploration strategy randomly choses the action given a probability (cid:15) : a t = (cid:26) a ∗ t with probability − (cid:15), random action with probability (cid:15), (8)where a ∗ t is related to the optimal policy from equation 5. We start with a large (cid:15) and gradually reduce it during thetraining. Another important aspect related to RL is the reward, the decision of the form for the reward is problem-specific, and it may affect the training significantly. Algorithm 1 shows a typical DQN flow with experience replay and (cid:15) -greedy exploration policy. Algorithm 1
Deep Q learning Initialize replay memory D to capacity N Initial action-value function Q with random weights θ Initialize target action-value function ˆ Q with weights ˆ θ = θ for episode = 1 , M do for t = 1 , T do With probability (cid:15) , select a random action a t otherwise select a t = max a (cid:48) Q ( s t , a ; θ ) execute action a t and observe reward r t +1 and the next state s t store transition ( s t , a t , r t , s t +1 ) in D Sample random minibatch of transitions ( s t , a t , r t , s t +1 ) from D Set y j = r j +1 + γ max a (cid:48) ˆ Q ( s j +1 , a ; ˆ θ ) Perform a gradient descent step on ( y j − Q ( s j , a j ; θ )) for updating Q paramter θ Every C steps reset ˆ Q = Q end for end for It is straightforward to adapt DQN to our misfit function selection problem, i.e., select between the L2 norm and theOTMF misfit. Considering a one dimensional FWI problem, the state in RL would be the predicted and measured data: s t = ( p t , d t ) , (9)where p t and d t is a single trace of the data in the time domain at iteration step t . The Q function will have such a stateas input and it will output two values determining whether we use the L2 norm misfit function or the OTMF misfitfunction. We will also incorporate the (cid:15) -greedy exploration policy, i.e., we will random choose between the L2 normand the OTMF misfit with probability (cid:15) . For the reward, we can define it as the negative of the normalized L2 norm ofthe model difference, or the negative of the normalized L2 norm of the data residuals as another option. r t = − || m true − m t || m true or r t = − || p true − p t || p true (10)We should keep in mind that unlike in FWI, here though we use a L2 norm of the data difference to formulate thereward in the RL training, it will not be an issue. Because the Q function we try to fit in Equation 4 seeks a long timeexpected reward (over many iterations). This means that the best policy learnt will always give fast convergence withless accumulated L2 norm of the data residuals throughout the inversion process. In this example, we try to optimize a single parameter, i.e., the time shift between signals. An assumed forwardmodeling produces a shifted Ricker wavelet, using the formula F ( t ; τ, f ) = (cid:2) − π f ( t − τ ) (cid:3) e − π f ( t − τ ) (11)where τ is the time shift and f is the dominant frequency. The modeling equation given by equation 11 is a simplifiedversion of a PDE based simulation. The reward we use for the training is the normalized L2-norm data residuals (thesecond formula in Equation 10). In this example, the data are discretized using nt = 200 samples with a time sampling3 data-driven choice of misfit function for FWI using reinforcement learning A P
REPRINT (a) (b)
Figure 1: a) The loss value of equation 7 over episodes; b) The accumulated reward over episodes. (a) (b)
Figure 2: a) The Q function for different states (time shift); b) The actions the policy takes for different states (0 for L2norm, 1 for OTMF). dt = 0 . s. We use direct connected network (DCN) for the Q function. We use one hidden layer for the DCN of size nt . The Q network will output two scalar values representation the Q for the L2 norm and the OTMF. We set the initialvalue of (cid:15) to be 0.90 and drop it exponentially to 0.05 at the end. Using a 3 Hz peak frequency wavelet, we randomlygenerate the initial and true time-shifts between 0.4 s and 1.2 s. In each episode (one full run of the inversion), weiterate for 12 iterations. We run ten thousand episodes for training, and we update the Q network based on equation 7at every iteration. The batch size is set to be 128, i.e., we randomly fetch 128 tuples of ( s t , a t , r t , s t +1 ) for updatingthe Q function. Figure 1a shows the Loss of equation 7 over episodes (the curves in Figure 1 has been smoothed witha moving average over 100 episodes). Its convergence demonstrates the success of the RL training. Figure 1b is theaccumulated reward over episodes, and its increasing value further indicates the learnt policy improved and can achievefast convergence with higher reward throughout the training. In order to further understand the trained Q function, weplot the Q value for different time shifts (we set the measured data with time-shift 0.5 s and scan the Q function over thepredicted data with time-shift varying from 0.5 to 1.1 s). We plot the Q function over the relative time-shift betweenthe predicted and measured data in Figure 2a. Figure 2b denotes the action that will be taken based on the learnt Qfunction (0 for L2 norm, 1 for OTMF). We can see that if the relative time shift is smaller than approximate 0.15 s, theQ value for the L2 norm is larger than the OTMF, suggesting apply the L2-norm misfit function. Otherwise, the learntQ function would suggest to use OTMF to avoid the cycle-skipping. Note the switch point at 0.15 s is consistent withthe half cycle of the 3 Hz peak frequency we used in training. However, this number is fully determined from the dataitself in the framework of reinforcement learning. In the framework of Reinforcement Learning, we trained a Deep Q network (DQN) to select a misfit function for FWI.We use the time-shift inversion example to demonstrate the basic principle of our method. The resulting trained networkmanaged to use the data to determine the appropriate objective function to achieve convergence.4 data-driven choice of misfit function for FWI using reinforcement learning
A P
REPRINT
References [1] Yunan Yang and Björn Engquist. Analysis of optimal transport and related misfit functions in full-waveforminversion.
Geophysics , 83(1):no. 1, A7–A12, 2018.[2] Bingbing Sun and Tariq Alkhalifah. The application of an optimal transport to a preconditioned data matchingfunction for robust waveform inversion.
Geophysics , 84(6):R923–R945, 2019.[3] Volodymyr Mnih, Koray Kavukcuoglu, David Sliver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and MartinRiedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602 , 2013.[4] Silver. Mastering the game of Go without human knowledge.
Nature , 550(7676):354–359, 2017.[5] Vinyals. Grandmaster level in StarCraft II using multi-agent reinforcement learning.
Nature , 575(7782):350–354,2019.[6] B. Sun and T. Alkhalifah. Salt body inversion using an optimal transport of the preconditioned matching filter.81stEAGE Conference and Exhibition