[PDF] A Fast Proximal Gradient Method and Convergence Analysis for Dynamic Mean Field Planning

Abstract

In this paper, we propose an efficient and flexible algorithm to solve dynamic mean-field planning problems based on an accelerated proximal gradient method. Besides an easy-to-implement gradient descent step in this algorithm, a crucial projection step becomes solving an elliptic equation whose solution can be obtained by conventional methods efficiently. By induction on iterations used in the algorithm, we theoretically show that the proposed discrete solution converges to the underlying continuous solution as the grid size increases. Furthermore, we generalize our algorithm to mean-field game problems and accelerate it using multilevel and multigrid strategies. We conduct comprehensive numerical experiments to confirm the convergence analysis of the proposed algorithm, to show its efficiency and mass preservation property by comparing it with state-of-the-art methods, and to illustrates its flexibility for handling various mean-field variational problems.

Full PDF

AA FAST PROXIMAL GRADIENT METHOD AND CONVERGENCEANALYSIS FOR DYNAMIC MEAN FIELD PLANNING ∗ JIAJIA YU † , RONGJIE LAI ‡ , WUCHEN LI § , AND

STANLEY OSHER ¶ Abstract.

In this paper, we propose an eﬃcient and ﬂexible algorithm to solve dynamic mean-ﬁeld planning problems based on an accelerated proximal gradient method. Besides an easy-to-implement gradient descent step in this algorithm, a crucial projection step becomes solving anelliptic equation whose solution can be obtained by conventional methods eﬃciently. By induction oniterations used in the algorithm, we theoretically show that the proposed discrete solution convergesto the underlying continuous solution as the grid size increases. Furthermore, we generalize ouralgorithm to mean-ﬁeld game problems and accelerate it using multilevel and multigrid strategies. Weconduct comprehensive numerical experiments to conﬁrm the convergence analysis of the proposedalgorithm, to show its eﬃciency and mass preservation property by comparing it with state-of-the-artmethods, and to illustrates its ﬂexibility for handling various mean-ﬁeld variational problems.

Key words.

Mean ﬁeld planning; Optimal transport; Mean ﬁeld games; Multigrid method;FISTA.

AMS subject classiﬁcations.

1. Introduction.

Mean ﬁeld planning (MFP) problems study how a large num-ber of similar rational agents make strategic movements to minimize their cost ina process satisfying given initial and terminal density distributions [2, 24, 37]. Onthe one hand, MFP can be viewed as a generalization of optimal transport (OT)[11, 12, 36, 40] where no interaction cost is considered in the process. On the otherhand, MFP is also a special case of mean ﬁeld game (MFG) problems where the ter-minal density is often provided implicitly [25, 27, 28, 30]. MFP, MFG and OT havewide applications in economics [1, 5, 22], engineering [21, 23, 42], quantum chemistry[18, 19], image processing [26, 34] as well as machine learning [8, 39, 41, 43].More speciﬁcally, the dynamic MFP problem has the following optimization for-mulation:(1.1) min ( ρ, m ) ∈C ( ρ ,ρ ) (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + (cid:90) F ( ρ ( t, · ))d t where ρ ( t, x ) is the densities of agents, m := ρ v with v representing the strat-egy(control) of this agent, and any pair of ( ρ, m ) ∈ C ( ρ , ρ ) satisﬁes mass con-servation and zero boundary ﬂux conditions with initial and terminal densities of ρ being ρ , ρ provided as:(1.2) C ( ρ , ρ ) : = { ( ρ, m ) : ∂ t ρ + div x m = 0 , m · n = 0 for x ∈ ∂ Ω , ρ (0 , · ) = ρ , ρ (1 , · ) = ρ , } . In this variational problem, L ( ρ, m ) denotes the dynamic cost, F models the interac-tion cost. Specially, with F = 0 and a speciﬁc choice of L , variational problem (1.1) ∗ J. Yu and R. Lai’s work are supported in part by an NSF Career Award DMS–1752934. W. Liand S. Osher’s work are supported in party by AFOSR MURI FP 9550-18-1-502. † Department of Mathematics, Rensselaer Polytechnic Institute, United States ([email protected]). ‡ Corresponding author. Department of Mathematics, Rensselaer Polytechnic Institute, UnitedStates ([email protected]). § Department of Mathematics, University of South Carolina, United States([email protected]). ¶ Department of Mathematics, University of California, Los Angeles, United States([email protected]) 1 a r X i v : . [ m a t h . O C ] F e b J. YU, R. LAI, W. LI, AND S. OSHER reduces to the dynamic formulation of optimal transport (OT) proposed in [11, 12].By relaxing the given terminal density as an implicit condition regularized by a func-tional G , one can retrieve a class of MFG as the following formulation [15, 25, 30]:(1.3) min ( ρ, m ) ∈C ( ρ ) (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + (cid:90) F ( ρ ( t, · ))d t + G ( ρ (1 , · ))where C ( ρ ) := { ( ρ, m ) : ∂ t ρ + div x m = 0 , m · n = 0 for x ∈ ∂ Ω , ρ (0 , · ) = ρ } . Several numerical methods have been established to solve dynamic MFP, MFGand OT problems. One class of methods is based on solving partial diﬀerential equa-tions (PDEs) corresponding to the KKT system of the variational problem [2, 3, 4, 16],where conventional numerical methods in nonlinear PDEs can be applied. In prin-ciple, this class of methods can also be applied to handle general MFP and MFGproblems that may not come from variational formulas. However, one obvious chal-lenge of solving PDEs directly is their nonlinearity. This also limits solvers to handlebroader choices of the dynamic cost L and interaction cost F .Another class of methods focuses on the variational formulas of dynamic MFP,MFG and OT problems. By naturally combining with recent advances from opti-mization, existing methods include several ﬁrst-order optimization algorithms to solvedynamic OT problems such as augmented Lagrangian [13, 14, 35], primal-dual [31]and G-prox [29], etc. These methods work on either the Lagrangian or the dual prob-lem of the original optimization problem, particularly for dynamic OT where F ≡ F and G . Besides, all of these algorithms may not preservemass in the evolution very well as the mass constraint term is not explicitly forced.We would like to propose a method that can eﬃciently compute the mean-ﬁeldtype of problems with mass preservation property and great ﬂexibility on a broadrange of objective functions. Note that the mass conservation constraint in MFP is aconvex set. A straightforward calculation shows that projection to this convex set canbe obtained from solving a standard Poisson equation. This motivates us to proposeanother algorithm to solve MFP problems based on the proximal gradient descentmethod [38, 9]. This method is simply composed of a gradient descent step and aprojection step. For MFP problems with a smooth objective function, the gradientdescent step can be evaluated very eﬃciently. It also enjoys the ﬂexibility to handlea broader range of L and F . More importantly, the projection step leads to masspreservation in each iteration. This step can also be computed very eﬃciently byconventional fast algorithms for solving a Poisson equation. In this work, we use anaccelerated version of the proximal gradient descent method, referred to as the fastiterative soft threshold algorithm (FISTA) [10], to solve the MFP problems. Afterthat, we further generalize our algorithm to handle MFG problems. In addition,inspired by [7, 32, 33], we also apply multigrid and multilevel methods to speed upthe proposed algorithm. Our numerical experiments illustrate the eﬃciency, masspreservation and ﬂexibility of the proposed algorithm to diﬀerent MFP problems aswell as MFG problems. The vanilla version of our algorithm performs comparable AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP discrete optimizer discrete PDE solutioncontinuous optimizer continuous

PDE solution equivalent equivalent denser mesh denser meshmain result main proof (a) Proof in [2] discrete optimizer discrete algorithmcontinuous optimizer continuous algorithm more iterationsmore iterations denser mesh denser meshmain result main proof (b) Our proof

Fig. 1 . Sketch of two approaches of convergence proof. with state-of-the-art methods, while the multigrid and multilevel accelerated versionsare much more eﬃcient than state-of-the-art methods.Besides proposing a new algorithm for MFP problems, we also analyze errorsbetween the discrete solution and the continuous solution. Since MFP is a functionaloptimization problem, all numerical methods on a given mesh grid only provide ap-proximated solutions to the continuous problem. It is important to understand howclose the discrete numerical solution is to the continuous solution on a given meshgrid. Our analysis is from the algorithm perspective. We ﬁrst derive an algorithmto optimize the variational problem and discretize each step of our algorithm. Ourmain eﬀort is to prove that at each iteration, the discrete values are not far from theunderlying continuous function values on grid points. Therefore we can show thatthe discrete algorithm converges to the continuous optimizer on grid points undercertain conditions. Similar types of analysis may not be conveniently conducted inthe existing methods including augmented Lagrangian, primal-dual and G-Prox sinceit could be diﬃcult to have desired perturbation analysis of solving cubic equationsinvolved in these three methods. To the best of our knowledge, this is for the ﬁrsttime to examine the discretization error based on variational MFP and its optimiza-tion algorithm. We remark that the convergence analysis has also been studied in[2, 3, 6] from the PDE perspective, where the authors argue solution of discrete KKTconverges to the continuous solution based on the equivalence of continuous systemsand discrete systems. We indicate the major diﬀerence between our error analysisbased on optimization perspective and error analysis based on PDEs perspective inFigure 1.

Contributions:

We summarize our contributions as follows:1. We propose to use an accelerate proximal gradient method to solve the MFPproblem (1.1).2. We analyze the error between the each iteration of discrete optimization andits continuous counter part. We prove that the discrete solution converges tocontinuous optimizer on grid points as the mesh size converges.3. We apply multilevel and multigrid strategies to to accelerate our algorithm.We also generalize our algorithm to solve MFG problems.4. We conduct comprehensive numerical experiments to illustrate the eﬃciencyand ﬂexibility of our algorithms.

Organization:

Our paper is organized as follows. In section 2, we give a briefreview of the MFP problem and provide several MFP examples and a MFG example.

J. YU, R. LAI, W. LI, AND S. OSHER

After that, we describe our algorithm and the implementation details in section 3.In section 4, we analyze the discretization error in our algorithm and prove the maintheoretical result on the convergence of discrete solutions to the continuous solutionas the grid size goes to zero. Furthermore, we generalize our algorithm to solvevariational MFG problems and accelerate our algorithm by multilevel and multigridmethods in section 5. Numerical experiments are provided in section 6 to demonstratethe convergence order and to illustrate the eﬃciency and ﬂexibility of our algorithm.At last we conclude this paper in section 7.

2. Review.

In this section, we brieﬂy review MFP problem and provide severalexamples which will be computed in the experiment section.Consider the model on time interval [0 ,

1] and space region Ω ∈ R D . Let ρ :[0 , × Ω → R + := { a ∈ R | a ≥ } be the density of agents through t ∈ [0 , m = ( m , · · · , m D ) : [0 , × Ω → R D be the ﬂux of the density which models strategies(control) of the agents. We are interested in ρ with given initial and terminal density ρ , ρ and ( ρ, m ) satisfying zero boundary ﬂux and mass conservation law, which givesthe constraint set C ( ρ , ρ ) deﬁned in (1.2). We denote L : R + × R D → R := R ∪ {∞} as the dynamic cost function (e.g. (2.3) in this paper) and F : Ω ∗ → R as a functionalmodeling interaction cost. The goal of MFP is to minimize the total cost among allfeasible ( ρ, m ) ∈ C ( ρ , ρ ) . Therefore the problem can be formulated as(2.1) min ( ρ, m ) ∈C ( ρ ,ρ ) (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + (cid:90) F ( ρ ( t, · ))d t. It is clear to see C ( ρ , ρ ) is convex and compact. In addition, the mass conser-vation law ∂ t ρ + div x m = 0 and zero ﬂux boundary condition m · n = 0 , x ∈ ∂ Ωimply that C ( ρ , ρ ) (cid:54) = ∅ if and only if (cid:82) Ω ρ = (cid:82) Ω ρ . Once C ( ρ , ρ ) is non-empty, theexistence and uniqueness of the optimizer depends on L and F .There are many diﬀerent choices of F . In this paper, we consider(2.2) F ( ρ ( t, · )) := λ E (cid:90) Ω F E ( ρ ( t, x ))d x + λ Q (cid:90) Ω ρ ( t, x ) Q ( x )d x . where λ E , λ Q ≥ F E : R + → R serves as a function to regularize ρ , and Q ( x ) : Ω → R provides a moving preference for density ρ . Consider anillustrative example by choosing Ω ⊂ Ω and assuming Q ( x ) = (cid:40) , x ∈ Ω + ∞ , x (cid:54)∈ Ω ,then the mass has to move within Ω in order to keep the cost ﬁnite. In more generalchoice of Q , ρ ( t, x ) tends to be smaller at the place where Q ( x ) is larger and viceversa.We then brieﬂy discuss several concrete examples which will be considered in ournumerical experiments. Example L by(2.3) L ( β , β ) :=  (cid:107) β (cid:107) β if β >

00 if β = 0 , β = + ∞ if β = 0 , β (cid:54) = . , If λ E = λ Q = 0, the MFP becomes the dynamic formulation of optimal transport AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP ρ, m ∈C ( ρ ,ρ ) (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t. Since m = ρ v , this deﬁnition of L makes sure that m = wherever ρ = 0. Because λ E = λ Q = 0, OT can be viewed as a special case of MFP where masses move freelyin Ω through t ∈ [0 , Example F E : R + → R , ρ (cid:55)→ (cid:40) ρ log( ρ ) , ρ > , ρ = 0 ,and write Ω + := Ω ∩ { x ∈ Ω : ρ ( t, x ) > } , we have the crowd motion model(2.5) min ρ, m ∈C ( ρ ,ρ )  (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + λ E (cid:90) (cid:90) Ω + ρ ( t, x ) log( ρ ( t, x ))d x d t + λ Q (cid:90) (cid:90) Ω ρ ( t, x ) Q ( x )d x d t  . With F E decreasing on [0 , e − ] and increasing on [ e − , + ∞ ), ρ ( t, x ) tends to be closeto e − everywhere. So we expect to have the density ρ ( t, x ) to be not sparse and notvery large everywhere. Example F E : R + → R , ρ (cid:55)→ (cid:40) | p | ρ p , ρ > , ρ = 0 , where p = 2 or −

1, thenwe have the following two models.(2.6) min ρ, m ∈C ( ρ ,ρ )  (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + λ E (cid:90) (cid:90) Ω ρ ( t, x )2 d x d t + λ Q (cid:90) (cid:90) Ω ρ ( t, x ) Q ( x )d x d t  (2.7) min ρ, m ∈C ( ρ ,ρ )  (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + λ E (cid:90) (cid:90) Ω + ρ ( t, x ) d x d t + λ Q (cid:90) (cid:90) Ω ρ ( t, x ) Q ( x )d x d t  In (2.6), by Cauchy-Schwartz inequality, we have(2.8) (cid:18)(cid:90) Ω ρ ( t, x )d x (cid:19) ≤ (cid:90) Ω ρ ( t, x )d x (cid:90) Ω x , therefore (cid:82) Ω ρ ( t, x )d x has a lower bound and achieves the lower bound when ρ ( t, · )is a constant over Ω. Therefore, model (2.6) guides the solution density uniformlydistributed over Ω . In (2.7), since the total mass (cid:82) Ω ρ ( t, x )d x is ﬁxed and 1 ρ islarger when ρ is smaller, the value of regularization term λ E (cid:90) Ω + ρ ( t, x ) d x is smallerif ρ ( t, x ) accumulates at several sites and vanishes at other regions. Therefore model(2.7) pursues a sparse optimizer ρ ( t, x ). J. YU, R. LAI, W. LI, AND S. OSHER

Example ρ is not explicitly provided butit satisﬁes a given preference. This preference can be imposed by regularizing ρ (1 , · )in the same spirit as (cid:82) Ω ρ ( t, x ) Q ( x )d x and obtain the following MFG model,(2.9) min ( ρ, m ) ∈C ( ρ )  (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + λ E (cid:90) (cid:90) Ω ρ ( t, x ) log( ρ ( t, x ))d x d t + λ Q (cid:90) (cid:90) Ω ρ ( t, x ) Q ( x )d x d t + λ G (cid:90) Ω ρ (1 , x ) G ( x )d x .  Here λ G > G : Ω → R gives a preference of the distribution of ρ (1 , x ), and C ( ρ ) := { ( ρ, m ) : ∂ t ρ + div x m = 0 , m · n = 0 for x ∈ ∂ Ω , ρ (0 , · ) = ρ } .

3. Algorithm.

In this section, we ﬁrst brieﬂy review FISTA algorithm pro-posed in [10]. Using a ﬁrst-optimize-then-discretize approach, we describe the FISTAalgorithm on variational problem (1.1). After that, we provide the details of our dis-cretization and implementation for the MFP. In the end of this section, we discussa diﬀerent approach based on ﬁrst-discretize-then-optimize strategy which turns outleading to same discrete algorithm.To solve general nonsmooth convex modelmin z f ( z ) + g ( z ) , where f is a smooth convex function and g is convex but possibly nonsmooth, onecan apply proximal gradient method [9, 38]. z ( k +1) := Prox η ( k ) g (cid:16) z ( k ) − η ( k ) ∇ f (cid:16) z ( k ) (cid:17)(cid:17) . Here η ( k ) > ηg ( z ) := argmin y (cid:26) g ( y ) + 12 η (cid:107) y − z (cid:107) (cid:27) . In particular, for an indicator function χ C ( z ) = (cid:26) , z ∈ C + ∞ , z (cid:54)∈ C of a convex set C ,its proximal operator is exactly the projection operator to C , i.e.Prox ηχ C ( z ) = Proj C ( z ) = argmin y ∈C (cid:107) y − z (cid:107) , ∀ η > . FISTA is essentially an accelerated proximal gradient algorithm [10]. It introduces (cid:98) z ( k ) as a linear combination of z ( k ) and z ( k − in each iteration, and conducts proximalgradient on (cid:98) z ( k ) to obtain z ( k +1) . The algorithm is summarized in (3.2), where thestepsizes η ( k ) can either be a constant or be obtained by a backtracking line search.(3.2)  z ( k +1) = Prox η ( k ) g (cid:16)(cid:98) z ( k ) − η ( k ) ∇ f ( (cid:98) z ( k ) ) (cid:17) ; τ ( k +1) = 12 (cid:18) (cid:113) (cid:0) τ ( k ) (cid:1) (cid:19) ; (cid:98) z ( k +1) = z ( k +1) + τ ( k ) − τ ( k +1) (cid:16) z ( k +1) − z ( k ) (cid:17) . AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP z ∗ = argmin z f ( z )+ g ( z ), and (cid:8) z ( k ) (cid:9) is generated by FISTA,then (cid:104) f (cid:16) z ( k ) (cid:17) + g (cid:16) z ( k ) (cid:17)(cid:105) − [ f ( z ∗ ) + g ( z ∗ )] = O (cid:18) k + 1) (cid:19) . To apply the above FISTA method to problem (1.1),let us write(3.3) min ρ, m ∈C ( ρ ,ρ ) Y ( ρ, m ) := (cid:90) (cid:90) Ω Y ( ρ ( t, x ) , m ( t, x ) , x )d x d t, where(3.4) Y ( β , β , x ) = L ( β , β ) + λ E F E ( β ) + λ Q β Q ( x ) . For convenience, we write F (cid:48) E = dd β F E , L = ∂∂β Y , ∇ β Y = (cid:18) ∂∂β d Y (cid:19) d =1 , ··· ,D and ∇ β L = (cid:18) ∂∂β d L (cid:19) d =1 , ··· ,D . This yields(3.5) (cid:40) Y ( ρ ( t, x ) , m ( t, x ) , x ) = L ( ρ ( t, x ) , m ( t, x )) + λ E F (cid:48) E ( ρ ( t, x )) + λ Q Q ( x ) , ∇ β Y ( ρ ( t, x ) , m ( t, x ) , x ) = ∇ β L ( ρ ( t, x ) , m ( t, x )) , d = 1 , · · · D To apply FISTA to this problem, we need to compute the gradients δ ρ Y ( ρ, m ), δ m Y ( ρ, m ) and the projection Proj C ( ρ ,ρ ) ( ρ, m ) . Gradient descent.

Let the boundary values ρ (0 , · ) = ρ , ρ (1 , · ) = ρ and m ( t, x ) · n = 0 for x ∈ ∂ Ω being ﬁxed. By variational calculus, we have(3.6) δ ρ Y ( ρ, m )( t, x ) = (cid:40) Y ( ρ ( t, x ) , m ( t, x ) , x ) , t (cid:54) = 0 , , t = 0 , δ m Y ( ρ, m )( t, x ) = (cid:40) ∇ β Y ( ρ ( t, x ) , m ( t, x ) , x ) , x (cid:54)∈ ∂ Ω0 , x ∈ ∂ Ω , d = 1 , · · · , D Then with step-size η ( k ) , the descent step can be written as(3.7) (cid:16) ρ ( k + ) , m ( k + ) (cid:17) = (cid:16)(cid:98) ρ ( k ) − η ( k ) δ ρ Y ( (cid:98) ρ ( k ) , (cid:99) m ( k ) ) , (cid:99) m ( k ) − η ( k ) δ m Y ( (cid:98) ρ ( k ) , (cid:99) m ( k ) ) (cid:17) , Projection.

The projection step solves the following minimization problem(3.8) (cid:16) ρ ( k +1) , m ( k +1) (cid:17) = argmin ρ, m ∈C ( ρ ,ρ ) (cid:13)(cid:13)(cid:13) ρ − ρ ( k + ) (cid:13)(cid:13)(cid:13) L + 12 (cid:13)(cid:13)(cid:13) m − m ( k + ) (cid:13)(cid:13)(cid:13) L . Since the boundary values are ﬁxed and boundary conditions are always satisﬁed,we only need to introduce dual variable φ ( k +1) ( t, x ) for mass conservation equation ∂ t ρ + div x m = 0. Consider a Lagrangian function(3.9) L ( ρ, m , φ ) : = 12 (cid:13)(cid:13)(cid:13) ρ − ρ ( k + ) (cid:13)(cid:13)(cid:13) L + 12 (cid:13)(cid:13)(cid:13) m − m ( k + ) (cid:13)(cid:13)(cid:13) L + (cid:104) φ, ∂ t ρ + div x m (cid:105) = 12 (cid:13)(cid:13)(cid:13) ρ − ρ ( k + ) (cid:13)(cid:13)(cid:13) L + 12 (cid:13)(cid:13)(cid:13) m − m ( k + ) (cid:13)(cid:13)(cid:13) L − (cid:104) ∂ t φ, ρ (cid:105) − (cid:104)∇ x φ, m (cid:105) + (cid:104) φ (1 , · ) , ρ (cid:105) − (cid:104) φ (0 , · ) , ρ (cid:105) . J. YU, R. LAI, W. LI, AND S. OSHER (cid:0) ρ ( k +1) , m ( k +1) , φ ( k +1) (cid:1) is the saddle point of L ( ρ, m , φ ) if and only if(3.10)  δ ρ L (cid:16) ρ ( k +1) , m ( k +1) , φ ( k +1) (cid:17) = 0 ,δ m L (cid:16) ρ ( k +1) , m ( k +1) , φ ( k +1) (cid:17) = 0 ,δ φ L (cid:16) ρ ( k +1) , m ( k +1) , φ ( k +1) (cid:17) = 0 . This yields(3.11) (cid:40) ρ ( k +1) = ρ ( k + ) + ∂ t φ ( k +1) , m ( k +1) = m ( k + ) + ∇ x φ ( k +1) . and(3.12) ∂ t ρ ( k +1) + div x m ( k +1) = 0 . Combining (3.11) and (3.12), it is clear that the dual variable φ ( k +1) solves the Poissonequation(3.13)  − ∆ t, x φ ( k +1) ( t, x ) = ∂ t ρ ( k + ) ( t, x ) + div x m ( k + ) ( t, x ) , ≤ t ≤ , x ∈ Ω ∂ t φ ( k +1) ( t, x ) = 0 , t = 0 , , x ∈ Ω ∇ x φ ( k +1) ( t, x ) · n = 0 , ≤ t ≤ , x ∈ ∂ Ω , Therefore, we can obtain the projection (3.8) in two steps: solving the Poisson equa-tion (3.13) and update ρ, m by (3.11).The FISTA algorithm for MFP problem (3.3) is summarized in Algorithm 3.1. Remark x ∈ Ω, ρ ( k + ) (0 , x ) = ρ ( x ) , ρ ( k + ) (1 , x ) = ρ ( x ) and for any t ∈ [0 , , x ∈ ∂ Ω, m ( k + ) ( t, x ) · n = 0, wehave (cid:90) [0 , × Ω ∂ t ρ ( k + ) ( t, x ) + div x m ( k + ) ( t, x )d x d t = (cid:90) Ω (cid:16) ρ ( k + ) (1 , x ) − ρ ( k + ) (0 , x ) (cid:17) d x + (cid:90) (cid:90) ∂ Ω m ( k + ) ( t, x ) · d s d t = (cid:90) Ω (cid:16) ρ ( k + )1 ( x ) − ρ ( k + )0 ( x ) (cid:17) d x =0 , This means (3.13) is solvable and the solution is unique up to a constant. In addition,the constant does not count in projection step because in (3.11), we only need ∂ t φ ( k +1) and ∇ x φ ( k +1) . Therefore the projection step is well-deﬁned. For convenience, we here assumeΩ = [0 , D . Then the boundary conditions of m = ( m , · · · , m D ) is provided as: m d ( t, x ) = 0 , if x d = 0 , or 1 , for d = 1 , · · · , D Consider a uniform grid with n segments on time interval [0 ,

1] and n d segments onthe d -th space dimension. Namely, the mesh size on each dimension is ∆ d = n d , d = AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP Algorithm 3.1

FISTA for MFPParameters ρ , ρ Initialization τ (1) = 1 ,ρ (0) (0 , · ) = (cid:98) ρ (0) (0 , · ) = ρ , ρ (0) (1 , · ) = (cid:98) ρ (0) (1 , · ) = ρ ,ρ (0) ( t, · ) = (cid:98) ρ (0) ( t, · ) = 1 for 0 < t < m (0) ( · , x ) · n = (cid:99) m (0) ( · , x ) · n = 0 for x ∈ ∂ Ω, m (0) ( · , x ) = (cid:99) m (0) ( · , x ) = 1 for x ∈ Ω \ ∂ Ω . for k = 0 , , , . . . dogradient descent (3.14)  ρ ( k + ) ( t, x ) = (cid:98) ρ ( k ) ( t, x ) − η ( k ) Y (cid:16)(cid:98) ρ ( k ) ( t, x ) , (cid:99) m ( k ) ( t, x ) , x (cid:17) , < t < , x ∈ Ω . m ( k + ) ( t, x ) = (cid:99) m ( k ) ( t, x ) − η ( k ) ∇ m Y (cid:16)(cid:98) ρ ( k ) ( t, x ) , (cid:99) m ( k ) ( t, x ) , x (cid:17) , ≤ t ≤ , x ∈ Ω \ ∂ Ω . projection solve φ ( k +1) for(3.15)  − ∆ t, x φ ( k +1) ( t, x ) = ∂ t ρ ( k + ) ( t, x ) + div x m ( k + ) ( t, x ) , ≤ t ≤ , x ∈ Ω ∂ t φ ( k +1) ( t, x ) = 0 , t = 0 , , x ∈ Ω ∇ x φ ( k +1) ( t, x ) · n = 0 , ≤ t ≤ , x ∈ ∂ Ω , and conduct(3.16) (cid:40) ρ ( k +1) = ρ ( k + ) + ∂ t φ ( k +1) , m ( k +1) = m ( k + ) + ∇ x φ ( k +1) . update τ ( k +1) = 1 + (cid:113) (cid:0) τ ( k ) (cid:1) ,ω ( k ) = τ ( k ) − τ ( k +1) , (cid:16)(cid:98) ρ ( k +1) , (cid:99) m ( k +1) (cid:17) = (cid:16) ω ( k ) (cid:17) (cid:16) ρ ( k +1) , m ( k +1) (cid:17) − ω ( k ) (cid:16) ρ ( k ) , m ( k ) (cid:17) . (3.17) end for , · · · , D , and the staggered grid points are t j = ( j − )∆ , ( x d ) j = ( j − )∆ d . We use amulti-dimensional index vector j = ( j , j , · · · , j D ) to indicate a grid point ( t j , x j ),where x j := (( x ) j , · · · , ( x D ) j D ). We further write u j := u ( t j , x j ) the value offunction u on the grid point and U j the proposed numerical approximation of u j .Our discretization of ρ and m deﬁned on diﬀerent staggered grids. For convenience,0 J. YU, R. LAI, W. LI, AND S. OSHER we list the following index sets: J d := (cid:26) , , · · · , n d − (cid:27) ,J d := { , , · · · , n d } , J d := J × J × · · · × J d − × J d × J d +1 × · · · J D , J := J × J × · · · × J D . Figure 2 illustrates an 1D example, where n = 4 , n = 5 and grid points related to J , J and J are annotated as red solid diamonds, blue solid diamonds and greensolid dots, respectively. FixedVariables

Fig. 2 . Illustration of staggered grids for the case d = 1 . We use P, M and Φ to denote the discretization of ρ, m and φ , respectively. Theyare deﬁned as: P := { P j } j ∈ J ∈ V := R ( n − × n ×···× n D ,M d := { ( M d ) j } j ∈ J d ∈ V d := R n × n ×···× n d − × ( n d − × n d +1 ×···× n D , M := { M d } d =1 , , ··· ,D ∈ V × · · · V D , Φ := { Φ j } j ∈ J ∈ V := R n × n ×···× n D . Moreover, we also deﬁne: P := { P j } j ∈ J ∈ V := R n × n ×···× n D ,M d := { ( M d ) j } j ∈ J ∈ V , M := { M d } d =1 , , ··· ,D ∈ V D . Based on the above settings, we next discuss details of computing the objectivevalue, implementing gradient descent and conducting the projection step.

Objective value.

To compute objective function, we need the value of ρ ( t, x ) and m ( t, x ) on the same point ( t, x ). While P, M d are deﬁned on diﬀerent grids, a natural AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP J ﬁrst. For convenience, let M ≡ P . We can deﬁne the average operators as: A d : V d → V , M d (cid:55)→ M d = A d ( M d ) , for d = 0 , , · · · , D, ( M d ) j := 

12 ( M d ) j + e d , j d = 1 , (cid:16) ( M d ) j + e d + ( M d ) j − e d (cid:17) , j d = 2 , , · · · , n d − ,

12 ( M d ) j − e d , j d = n d . ∀ j ∈ J A : V × · · · V D → V D , M (cid:55)→ {A d ( M d ) } d =1 , ··· ,D . where e d ∈ R D +1 has 1 in the ( d + 1)-th entry and 0 elsewhere, d = 0 , · · · , D . Theboundary conditions of M are implicitly included in the average operator. We furtherdeﬁne P A ∈ V to indicate density boundary conditions,( P A ) j :=  ρ ( x j ) , j = 1 , , j = 2 , , · · · , n − , ρ ( x j ) , j = n . ∀ j ∈ J As an example, in Figure 2, A maps the red solid dots to the green dots. A maps theblue solid dots to the green dots, and the red hollow dots contribute to the non-zeroentries of P A .Now, we are ready to evaluate the objective function by averaging P and M fromtheir staggered grids to the central grid. Namely, We deﬁne P , M ∈ V as(3.18) P := A ( P ) + P A , M := A ( M )then we approximate the objective value by (cid:32) D (cid:89) d =0 ∆ d (cid:33) Y ( P, M ) , where(3.19) Y ( P, M ) := Y ( P , M ) := (cid:88) j ∈ J Y (cid:0) P j , M j , x j (cid:1) . Gradient descent.

To fulﬁl gradient descent, we ﬁrst average ( P, M ) from diﬀerentgrids J d to grid J by (3.18) and compute gradient values pointwisely(3.20) ∂ P Y ( P , M ) := (cid:8) Y (cid:0) P j , M j , x j (cid:1)(cid:9) j ∈ J ,∂ M d Y ( P , M ) := (cid:8) Y d (cid:0) P j , M j , x j (cid:1)(cid:9) j ∈ J , d = 1 , · · · , D,∂ M Y ( P , M ) := (cid:8) ∂ M d Y ( P , M ) (cid:9) d =1 , ··· ,D . Then we average gradient values back to diﬀerent grids J d . Deﬁning another sets ofaverage operator as A ∗ d : V → V d , M d (cid:55)→ M d , ( M d ) j := 12 (cid:16) ( M d ) j + e d + ( M d ) j − e d (cid:17) , A ∗ : V D → V × · · · V D , M (cid:55)→ {A ∗ d ( M d ) } d =1 , ··· ,D , J. YU, R. LAI, W. LI, AND S. OSHER we obtain desired gradient values:(3.21) ∂ P Y ( P, M ) = A ∗ (cid:0) ∂ P Y ( P , M ) (cid:1) ,∂ M Y ( P, M ) = A ∗ (cid:0) ∂ M Y ( P , M ) (cid:1) . Combining (3.20),(3.21), we can implement gradient descent step (3.14) on discretemeshes by:(3.22) (cid:16) P ( k + ) , M ( k + ) (cid:17) = (cid:16) (cid:98) P ( k ) , (cid:99) M ( k ) (cid:17) − η ( k ) (cid:16) ∂ P Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) , ∂ M Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) (cid:17) , Projection.

To compute projection, we use the ﬁnite diﬀerence method to discretizethe corresponding diﬀerential operators in the PDE constraint. We ﬁrst deﬁne discretepartial derivative: D d : V d → V , M d (cid:55)→ D d ( M d ) , for d = 0 , · · · , D ( D d ( M d )) j :=  d ( M d ) j + e d , j d = 1 , d (cid:16) ( M d ) j + e d − ( M d ) j − e d (cid:17) , j d = 2 , , · · · , n d − , − d ( M d ) j − e d , j d = n d , discrete divergence:Div : V × V × · · · V D → V , ( P, M ) (cid:55)→ D ( P ) + D (cid:88) d =1 D d ( M d ) , and the term P D ∈ V to impose boundary conditions:( P D ) j :=  − ρ ( x j ) , j = 1 , , j = 2 , , · · · , n − , ρ ( x j ) , j = n . , Then the RHS of ﬁrst equation in (3.15) can be approximated byDiv (cid:16) P ( k + ) , M ( k + ) (cid:17) + P D . We approximate ∂ d with a central diﬀerence and ∂ dd with a three-point stencil ﬁnitediﬀerence. By homogenenous Neumann boundary condition, we have discrete second- AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP D dd : V → V , Φ (cid:55)→ D dd (Φ) , (cid:0) D dd (Φ) (cid:1) j :=  d (cid:0) − Φ j + Φ j + e d (cid:1) , j d = 1 , d (cid:0) Φ j − e d − j + Φ j + e d (cid:1) , j d = 2 , , · · · , n d − , d (cid:0) Φ j − e d − Φ j (cid:1) , j d = n d , Lap : V → V , Φ (cid:55)→ D (Φ) + D (cid:88) d =1 D dd (Φ) . The Poisson equation (3.15) on grids is therefore(3.23) − Lap (cid:16) Φ ( k +1) (cid:17) = Div (cid:16) P ( k + ) , M ( k + ) (cid:17) + P D , Deﬁning another set of derivative operators D ∗ d : V → V d , Φ (cid:55)→ D ∗ d (Φ) , ( D ∗ d (Φ)) j := 1∆ d (cid:16) (Φ) j + e d − Φ j − e d (cid:17) Grad : V → V × V × · · · V D , Φ (cid:55)→ (cid:8) D ∗ d (Φ) (cid:9) d =0 , , ··· ,D , we obtain the second step of projection, the discretization of (3.16):(3.24) (cid:16) P ( k +1) , M ( k +1) (cid:17) = (cid:16) P ( k + ) , M ( k + ) (cid:17) + Grad (cid:16) Φ ( k +1) (cid:17) . Combining above ingredients, we have FISTA for MFP on discrete mesh summa-rized in Algorithm 3.2.

Remark , Grad and Lap are consistent in the fol-lowing sense. For space V and V × V × · · · × V D , if we view the elements Φ and( P, M ) as long vectors, we can deﬁne the inner product as(3.29) (cid:68) Φ , Φ (cid:69) := (cid:88) j ∈ J Φ j Φ j , (cid:10) ( P , M ) , ( P , M ) (cid:11) := (cid:88) j ∈ J P j P j + D (cid:88) d =1 (cid:88) j ∈ J d ( M d ) j ( M d ) j . and deﬁne the induced norm as (cid:107) · (cid:107) F . Then simple calculation shows that for anyΦ ∈ V and ( P, M ) ∈ V × V × · · · × V D , the following equation holds(3.30) Lap (cid:0) Φ (cid:1) = Div ◦ Grad (cid:0) Φ (cid:1) , (cid:10) − Grad(Φ) , ( P, M ) (cid:11) = (cid:10) Φ , Div( P, M ) (cid:11) , These match the relations between div t, x , ∇ t, x and ∆ t, x on continuous spaces. Remark

J. YU, R. LAI, W. LI, AND S. OSHER

Algorithm 3.2

FISTA for MFP on discrete meshParameters ρ , ρ Initialization τ (1) = 1 , P (0) = (cid:98) P (0) = , and M (0) d = (cid:99) M d (0) = . for k = 0 , , , . . . dogradient descent (3.25) (cid:40) P ( k + ) = (cid:98) P ( k ) − η ( k ) ∂ P Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) , M ( k + ) = (cid:99) M ( k ) − η ( k ) ∂ M Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) , projection solve Φ ( k +1) for(3.26) − Lap (cid:16) Φ ( k +1) (cid:17) = Div (cid:16) P ( k + ) , M ( k + ) (cid:17) + P D , and conduct(3.27) (cid:16) P ( k +1) , M ( k +1) (cid:17) = (cid:16) P ( k + ) , M ( k + ) (cid:17) + Grad (cid:16) Φ ( k +1) (cid:17) . update τ ( k +1) = 1 + (cid:112) τ ( k ) ) ,ω ( k ) = τ ( k ) − τ ( k +1) , (cid:16) (cid:98) P ( k +1) , (cid:99) M ( k +1) (cid:17) = (cid:16) ω ( k ) (cid:17) (cid:16) P ( k +1) , M ( k +1) (cid:17) − ω ( k ) (cid:16) P ( k ) , M ( k ) (cid:17) . (3.28) end for Recall that ∆ d = n d . For i ∈ J , let λ i ∈ R , Ψ i ∈ V be λ i = − D (cid:88) d =0 n d sin (cid:18) ( i d − π n d (cid:19) , Ψ ij = D (cid:89) d =0 (cid:114) δ i d n d cos (cid:18)(cid:18) j d − (cid:19) ( i d − πn d (cid:19) , then (cid:8) Ψ i (cid:9) i ∈ J forms an orthogonal basis of ( V , (cid:107) · (cid:107) F ) andLap (cid:0) Ψ i (cid:1) = λ i Ψ i . Note that λ i = 0 if and only if i = , and Ψ has value (cid:81) Dd =0 (cid:113) δ id n d in all entries,which implies ker(Lap) = { C : C ∈ R } . This matches the fact that in continuoussetting, ∆ u = 0 if u ≡ C on Ω = [0 , D +1 . For any Φ ∈ V , we can deﬁne Lap − (Φ)by requiring(3.31) (cid:10) Lap − (Φ) , Ψ (cid:11) = 0 . AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP − (Φ) := (cid:88) i ∈ J \{ } λ i (cid:10) Φ , Ψ i (cid:11) Ψ i . This leads to a discrete cosine transform method to solve (3.26).To derive the discrete Algorithm 3.2, we optimize the continuous problem (3.3)by Algorithm 3.1 then discretize the algorithm. This is a ﬁrst-optimize-then-discretizeapproach. We can also consider a ﬁrst-discretize-then-optimize approach. In fact, us-ing our proposed discretization for MFP, the two approaches lead the same algorithm,as illustrated in Figure 3. This is mainly because of the consistent relation of discreteoperators discussed in Remark 3.2.discrete optimiza-tion problem (3.33)continuous varia-tional problem (3.3) discrete algorithm:Algorithm 3.2continuous algo-rithm: Algorithm 3.1discretizediscretize optimizeoptimize

Fig. 3 . Equivalent approaches to obtain the discrete Algorithm

Based on previous notations, we discretize the original problem (3.3) to(3.33) min ( P, M ) ∈C ( P D ) Y ( P, M ) := (cid:88) j ∈ J Y (cid:0) ( A ( P ) + P A ) j , A ( M ) j , x j (cid:1) , where the constraints are linear and the constraint set is convex:(3.34) C ( P D ) := (cid:8) ( P, M ) : D ( P ) + P D + Div( M ) = (cid:9) . To optimize the problem with FISTA, we ﬁrst compute gradient. For any P, M , wedeﬁne the corresponding values on J by P := A ( P ) + P A , M := A ( M ) be , then(3.35) Y ( P, M ) = (cid:88) j ∈ J Y (cid:0) P j , M j , x j (cid:1) . We will have (3.20) by taking partial derivatives w.r.t

P , M , and then (3.21) bychain rule. Therefore the gradient descent step is exactly (3.25). For projection,based on the inner product deﬁned as (3.29) and induced norm, we can formulate theLagrangian as(3.36) L ( P, M,

Φ) := 12 (cid:13)(cid:13)(cid:13) ( P, M ) − (cid:16) P ( k + ) , M ( k + ) (cid:17)(cid:13)(cid:13)(cid:13) F + (cid:68) Φ ( k +1) , Div( P, M ) + P D (cid:69) . Because of the consistency of the discrete operators (3.30), we know that (3.26),(3.27)computes the projection to C ( P D ). Therefore the FISTA algorithm to the discreteMFP problem (3.33) is exactly Algorithm 3.2.6 J. YU, R. LAI, W. LI, AND S. OSHER

4. Convergence.

One major diﬀerence between Algorithm 3.1 and Algorithm 3.2is that the former one is for the continuous setup while the latter one is for a givendiscretized mesh grid although both algorithms provide convergence sequences ac-cording to the FISTA theory. It is natural to ask if the discretized solution canconverge to the continuous solution when the mesh grid size ∆ d goes to zero. Specif-ically, with a given step-size sequence (cid:8) η ( k ) (cid:9) k , let the sequences (cid:8)(cid:0)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:1)(cid:9) k , (cid:110)(cid:0) ρ ( k ) , m ( k ) (cid:1) , (cid:16) ρ ( k + ) , m ( k + ) (cid:17)(cid:111) k be obtained from Algorithm 3.1, and (cid:110)(cid:16) (cid:98) P ( k ) , (cid:99) M ( k ) (cid:17)(cid:111) k , (cid:110)(cid:0) P ( k ) , M ( k ) (cid:1) , (cid:16) P ( k + ) , M ( k + ) (cid:17)(cid:111) k from Algorithm 3.2. If (cid:0) ρ ( k ) , m ( k ) (cid:1) → ( ρ ∗ , m ∗ )and (cid:0) P ( k ) , M ( k ) (cid:1) → ( P ∗ , M ∗ ) as k → ∞ , we would like to check whether ( P ∗ , M ∗ )converge to ( ρ ∗ , m ∗ ) as the mesh grid size converge to zero. In this section, we the-oretically analyze and provide a positive answer to this question. To the best of ourknowledge, this is for the ﬁrst time to examine the discretization error between op-timization path of the continuous variational MFP and its discretized optimizationcounterpart.We ﬁrst introduce some notations. With given step-size sequence { η ( k ) } k , let (cid:8)(cid:0)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:1)(cid:9) k , (cid:110)(cid:0) ρ ( k ) , m ( k ) (cid:1) , (cid:16) ρ ( k + ) , m ( k + ) (cid:17)(cid:111) k be obtained from Algorithm 3.1.With the same step-size sequence and initialization P (0) = (cid:98) P (0) = ρ (0) J , M (0) = (cid:99) M (0) = m (0) J , let (cid:110)(cid:16) (cid:98) P ( k ) , (cid:99) M ( k ) (cid:17)(cid:111) k , (cid:110)(cid:0) P ( k ) , M ( k ) (cid:1) , (cid:16) P ( k + ) , M ( k + ) (cid:17)(cid:111) k be ob-tained from Algorithm 3.2. For any index set J d , J , we write the continuous functions ρ and m evaluating on corresponding discrete grids as ρ J := { ρ j } j ∈ J , ( m d ) J d := { ( m d ) j } j ∈ J d , m J := { ( m d ) J d } d =1 , , ··· ,D ,ρ J := { ρ j } j ∈ J , ( m d ) J := { ( m d ) j } j ∈ J , m J := (cid:8) ( m d ) J (cid:9) d =1 , , ··· ,D . Let m ≡ ρ, M ≡ P . For any k = 0 , , , , · · · , we deﬁne the error on grid points J d by E ( k ) d := M ( k ) d − (cid:16) m ( k ) d (cid:17) J d , d = 0 , · · · , D E ( k ) m := (cid:110) E ( k ) d (cid:111) d =1 , ··· ,D , E ( k ) := (cid:110) E ( k )0 , E ( k ) m (cid:111) . Similarly, for k = 0 , , , · · · , we deﬁne (cid:98) E ( k ) d := (cid:99) M ( k ) d − (cid:16) (cid:99) m d ( k ) (cid:17) J d , d = 0 , · · · , D (cid:98) E ( k ) m := (cid:110) (cid:98) E ( k ) d (cid:111) d =1 , ··· ,D , (cid:98) E ( k ) := (cid:110) (cid:98) E ( k )0 , (cid:98) E ( k ) m (cid:111) E ( k ) φ := Φ ( k ) − φ ( k ) J , Recall that in Remark 3.3, we introduce induced norm (cid:107)·(cid:107) F on space V × V ×· · ·× V D AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP V as (cid:107) E (cid:107) F :=  D (cid:88) d =0 (cid:88) j ∈ J d ( E d ) j  , (cid:107) E φ (cid:107) F := (cid:88) j ∈ J ( E φ ) j  . We here deﬁne 2-norm (cid:107) · (cid:107) as (cid:107) · (cid:107) := (cid:32) D (cid:89) d =0 ∆ d (cid:33) (cid:107) · (cid:107) F . Next we propose several assumptions before stating the main theorem.

Assumption ρ , ρ be given initial and terminal densities. With abovenotations, we assume the following conditions hold for any k = 0 , , · · · ,1. ρ , ρ , ρ ( k ) , m ( k ) , ρ ( k + ) , m ( k + ) , (cid:98) ρ ( k ) , (cid:99) m ( k ) , are C ,2. There exist ρ ≤ ρ , such that (cid:98) ρ ( k ) ( t, x ) , (cid:98) P ( k ) j ∈ [ ρ, ρ ],3. (cid:99) m ( k ) ( t, x ) , (cid:99) M ( k ) j ∈ Ω m ⊂ R D ,4. Y d ’s are C Y -Lipschitz continuous on [ ρ, ρ ] × Ω m × [0 , D , i.e. for d = 0 , · · · , D and any ( β , β , x ) , ( β , β , x ) ∈ [ ρ, ρ ] × Ω m × [0 , D ,(4.1) (cid:12)(cid:12) Y d ( β , β , x ) − Y d ( β , β , x ) (cid:12)(cid:12) ≤ C Y (cid:13)(cid:13) ( β , β , x ) − ( β , β , x ) (cid:13)(cid:13) . Remark ρ , ρ , ρ (0) , m (0) are C and Y is C , one can show that assumption 1 holds by induc-tion on k . And assumptions 2 and 3 are true as long as (cid:8) ρ ( k ) , m ( k ) (cid:9) and (cid:8) P ( k ) , M ( k ) (cid:9) converges. With a typical choice Y ( β , β , x ) = L ( β , β ) where L is deﬁned in (2.3),we retrieve the optimal transport problem. Both (3.3) and (3.33) have unique min-imizers { ρ ∗ , m ∗ } and { P ∗ , M ∗ } and both algorithms converges. If in addition ρ , ρ are C and min x { ρ ( x ) , ρ ( x ) } ≥ ρ >

0, then Assumption 4.1 hold with continuousinitialization and carefully chosen step-sizes.We now state our main theorem which characterizes the error bound with respectthe grid size.

Theorem

If Assumption hold for k = 0 , , · · · , then (4.2) (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) ≤ C (cid:32) D (cid:88) d =0 ∆ d (cid:33) = O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Here C is a constant depending on dimension D , Lipschitz constant C Y , stepsizes { η ( s ) } s =1 , ··· ,k and sequences { (cid:98) ρ ( s ) , (cid:99) m ( s ) } s =1 , ··· ,k but it is independent of { ∆ d } d =0 , ··· ,D . Note that the above theorem analyzes error bounds at each iteration along op-timization paths from the continuous setup and its discretized counterpart. Conse-quently, we can have the following convergence analysis if both sequences from thecontinuous and discretized optimization converge (i.e. choice of the step size satisﬁesconvergence conditions used in FISTA [10]).

Corollary

Suppose that (cid:8)(cid:0) P ( k ) , M ( k ) (cid:1)(cid:9) k and (cid:8)(cid:0) ρ ( k ) , m ( k ) (cid:1)(cid:9) k satisfy allconditions in Theorem . If in addition, there exist ( P ∗ , M ∗ ) , ( ρ ∗ , m ∗ ) such that J. YU, R. LAI, W. LI, AND S. OSHER ρ ∗ ∈ C , m ∗ d ∈ C and (4.3) lim k →∞ (cid:13)(cid:13)(cid:13)(cid:16) P ( k ) , M ( k ) (cid:17) − ( P ∗ , M ∗ ) (cid:13)(cid:13)(cid:13) = 0 , lim k →∞ (cid:13)(cid:13)(cid:13)(cid:16) ρ ( k ) , m ( k ) (cid:17) − ( ρ ∗ , m ∗ ) (cid:13)(cid:13)(cid:13) L = 0 , where (cid:107)·(cid:107) L denotes the standard L -norm in the function space. Let ∆ = max d =0 , ··· ,D ∆ d ,then (4.4) lim ∆ → (cid:107) E ∗ (cid:107) := lim ∆ → (cid:13)(cid:13) ( P ∗ , M ∗ ) − (cid:0) ρ ∗ J , m ∗ J (cid:1)(cid:13)(cid:13) = 0 . Proof.

By triangular inequality, (cid:13)(cid:13) ( P ∗ , M ∗ ) − ( ρ ∗ J , m ∗ J ) (cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:16) P ( k ) , M ( k ) (cid:17) − ( P ∗ , M ∗ ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:16) ρ ( k ) J , m ( k ) J (cid:17) − ( ρ ∗ J , m ∗ J ) (cid:13)(cid:13)(cid:13) For any (cid:15) >

0, there exists k (cid:15) such that(4.5) (cid:13)(cid:13)(cid:13)(cid:16) P ( k (cid:15) ) , M ( k (cid:15) ) (cid:17) − ( P ∗ , M ∗ ) (cid:13)(cid:13)(cid:13) ≤ (cid:15) , (cid:13)(cid:13)(cid:13)(cid:16) ρ ( k (cid:15) ) , m ( k (cid:15) ) (cid:17) − ( ρ ∗ , m ∗ ) (cid:13)(cid:13)(cid:13) L ≤ (cid:15) C depending on d, ρ ( k (cid:15) ) , m ( k (cid:15) ) ,ρ ∗ , m ∗ and independent of ∆ d such that(4.6) (cid:13)(cid:13)(cid:13)(cid:16) ρ ( k (cid:15) ) J , m ( k (cid:15) ) J (cid:17) − (cid:0) ρ ∗ J , m ∗ J (cid:1)(cid:13)(cid:13)(cid:13) ≤ (cid:90) (cid:90) Ω (cid:13)(cid:13)(cid:13)(cid:16) ρ ( k (cid:15) ) , m ( k (cid:15) ) (cid:17) − ( ρ ∗ , m ∗ ) (cid:13)(cid:13)(cid:13) d x d t + C D (cid:88) d =0 ∆ d , By Theorem 4.3, there exist a constant C independent of ∆ d such that (cid:13)(cid:13)(cid:13) E ( k (cid:15) ) (cid:13)(cid:13)(cid:13) ≤ C D (cid:88) d =0 ∆ d . Let δ = (cid:15) ( D + 1)( | C | + | C | ) .Then for any ∆ d satisfying max d =0 , ··· ,D ∆ d ≤ δ , wehave(4.7) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C D (cid:88) d =0 ∆ d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C D (cid:88) d =0 ∆ d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) , Combining (4.5), (4.6) and (4.7), we conclude that for any (cid:15) >

0, there exist δ withall { ∆ d } d satisfying ∆ ≤ δ such that (cid:107) E ∗ (cid:107) ≤ (cid:15). To prove Theorem 4.3, we need to establish three lemmas to analyse the errorintroduced in each main steps of the algorithm. After that, the proof of Theorem 4.3can be obtained by induction.

AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP Lemma

If Assumption hold for k = 0 , , · · · , then (4.8) (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) ≤ C ( D, C Y , η ( k ) ) (cid:13)(cid:13)(cid:13) (cid:98) E ( k ) (cid:13)(cid:13)(cid:13) + C (cid:16)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:17) (cid:32) D (cid:88) d =0 ∆ d (cid:33) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Proof.

By deﬁnition of E ( k + ) d , we substitute discrete variables in (3.25) by thesum of error and continuous variables. This leads to (cid:16) E ( k + ) d (cid:17) j + (cid:16) m ( k + ) d (cid:17) j = (cid:16) (cid:98) E ( k ) d (cid:17) j + (cid:16) (cid:98) m ( k ) d (cid:17) j − η ( k ) (cid:16) ∂ M d Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) (cid:17) j From (3.14), we have (cid:16) m ( k + ) d (cid:17) j = (cid:16) (cid:98) m ( k ) d (cid:17) j − η ( k ) Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17) . Combining above gives us (cid:16) E ( k + ) d (cid:17) j = (cid:16) (cid:98) E ( k ) d (cid:17) j − η ( k ) (cid:20)(cid:16) ∂ M d Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) (cid:17) j − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:21) Therefore we have the norm estimation(4.9) (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) (cid:98) E ( k ) (cid:13)(cid:13)(cid:13) F + η ( k )  D (cid:88) d =0 (cid:88) j ∈ J d (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) ∂ M d Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) (cid:17) j − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)  For any j ∈ J d , the deﬁnition of (cid:16) ∂ M d Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) (cid:17) j in (3.21) yields,(4.10) (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) ∂ M d Y (cid:16) (cid:98) P ( k ) , (cid:99) M ( k ) (cid:17)(cid:17) j − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) ∂ M d Y (cid:18) (cid:98) P ( k ) , (cid:99) M ( k ) (cid:19)(cid:19) j + e d − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) ∂ M d Y (cid:18) (cid:98) P ( k ) , (cid:99) M ( k ) (cid:19)(cid:19) j − e d − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 12 (cid:12)(cid:12)(cid:12)(cid:12) Y d (cid:18) (cid:98) P ( k ) j + e d , (cid:99) M ( k ) j + e d , x j + e d (cid:19) − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12)(cid:12) Y d (cid:18) (cid:98) P ( k ) j − e d , (cid:99) M ( k ) j − e d , x j − e d (cid:19) − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C Y (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:98) P ( k ) j + e d , (cid:99) M ( k ) j + e d , x j + e d (cid:19) − (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) + C Y (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:98) P ( k ) j − e d , (cid:99) M ( k ) j − e d , x j − e d (cid:19) − (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) J. YU, R. LAI, W. LI, AND S. OSHER

Note that (cid:98) P ( k ) j , (cid:18) (cid:99) M ( k ) d (cid:19) j can be written as the sum of errors and continuous values: (cid:98) P ( k ) j = (cid:16) A (cid:16) (cid:98) E ( k )0 + (cid:98) ρ ( k ) J (cid:17) + P A (cid:17) j = (cid:16) A (cid:16) (cid:98) E ( k )0 (cid:17)(cid:17) j + (cid:16) A (cid:16)(cid:98) ρ ( k ) J (cid:17) + P A (cid:17) j = (cid:16) A (cid:16) (cid:98) E ( k )0 (cid:17)(cid:17) j + (cid:98) ρ ( k ) j + O (∆ ) (cid:18) (cid:99) M ( k ) d (cid:19) j = (cid:16) A d (cid:16) (cid:98) E ( k ) d (cid:17)(cid:17) j + (cid:16) (cid:98) m ( k ) d (cid:17) j + O (∆ d ) , where the last equality in the above two equations are obtained from using Taylorexpansion to (cid:98) ρ ( k ) and (cid:98) m ( k ) d . We further have:(4.11) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:98) P ( k ) j ± e d , (cid:99) M ( k ) j ± e d , x j ± e d (cid:19) − (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18)(cid:16) A (cid:16) (cid:98) E ( k )0 (cid:17)(cid:17) j ± e d , (cid:16) A (cid:16) (cid:98) E ( k ) m (cid:17)(cid:17) j ± e d , (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:16)(cid:98) ρ ( k ) j ± e d , (cid:99) m ( k ) j ± e d , x j ± e d (cid:17) − (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:13)(cid:13)(cid:13) + O (cid:0) ∆ d (cid:1) ≤ D (cid:88) d (cid:48) =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + (cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + e d (cid:12)(cid:12)(cid:12)(cid:12) + 12 D (cid:88) d (cid:48) =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + (cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j − e d (cid:12)(cid:12)(cid:12)(cid:12) + C (cid:16)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:17) ∆ d + O (cid:0) ∆ d (cid:1) , where C (cid:0)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:1) = max (cid:110) ∂∂x d (cid:98) m ( k ) d (cid:48) ( t, x ) : d (cid:48) = 0 , · · · , D (cid:111) . Combining (4.10) and(4.11) provides: (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) ∂ M d Y (cid:16) (cid:98) P ( k ) , (cid:99) M ( k ) (cid:17)(cid:17) j − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C Y D (cid:88) d (cid:48) =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + (cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + e d (cid:12)(cid:12)(cid:12)(cid:12) + C Y D (cid:88) d (cid:48) =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + (cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j − e d (cid:12)(cid:12)(cid:12)(cid:12) + C (cid:16)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:17) ∆ d + O (cid:0) ∆ d (cid:1) . and applying the triangle inequality yields:  D (cid:88) d =0 (cid:88) j ∈ J d (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) ∂ M d Y ( (cid:98) P ( k ) , (cid:99) M ( k ) ) (cid:17) j − Y d (cid:16)(cid:98) ρ ( k ) j , (cid:99) m ( k ) j , x j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)  ≤  D (cid:88) d =0 (cid:88) j ∈ J d D (cid:88) d (cid:48) =0 C Y (cid:18)(cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j − e d + 2 (cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + (cid:16) (cid:98) E ( k ) d (cid:48) (cid:17) j + e d (cid:19) +  D (cid:88) d =0 (cid:88) j ∈ J d C (cid:16)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:17) ∆ d  + (cid:32) D (cid:89) d =0 n d (cid:33) O (cid:32) D (cid:88) d =0 ∆ d (cid:33) AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP ≤ C ( D, C Y ) (cid:13)(cid:13)(cid:13) (cid:98) E ( k ) (cid:13)(cid:13)(cid:13) F + (cid:32) D (cid:89) d =0 n d (cid:33) C (cid:16)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:17) (cid:32) D (cid:88) d =0 ∆ d (cid:33) + (cid:32) D (cid:89) d =0 n d (cid:33) O (cid:32) D (cid:88) d =0 ∆ d (cid:33) , Together with estimation (4.9), we have(4.12) (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) ≤ (cid:16) C ( D, C Y , η ( k ) ) (cid:17) (cid:13)(cid:13)(cid:13) (cid:98) E ( k ) (cid:13)(cid:13)(cid:13) + C (cid:16)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:17) (cid:32) D (cid:88) d =0 ∆ d (cid:33) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Therefore we prove the lemma.Next we examine the error introduced in projection step.

Lemma

Suppose that ρ , ρ , ρ ( k + ) , m ( k + ) are C , then (cid:13)(cid:13)(cid:13) E ( k +1) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Proof.

By deﬁnition of error terms and (3.26), we have(4.13) − Lap (cid:16) E ( k +1) φ + φ ( k +1) J (cid:17) = Div (cid:16) E ( k + ) + (cid:16) ρ ( k + ) J , m ( k + ) J (cid:17)(cid:17) + P D . Since (cid:16) φ ( k +1) , ρ ( k + ) , m ( k + ) (cid:17) satisﬁes (3.15), and ρ ( k + ) , m ( k + ) are C , by Taylorexpansions, we have(4.14) − Lap (cid:16) φ ( k +1) J (cid:17) = Div (cid:16) ρ ( k + ) J , m ( k + ) J (cid:17) + P D + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Here O (cid:32) D (cid:88) d =0 ∆ d (cid:33) ∈ V indicates its entry-wise contribution is O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Com-bining (4.13) and (4.14) gives us(4.15) − Lap (cid:16) E ( k +1) φ (cid:17) = Div (cid:16) E ( k + ) (cid:17) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Similarly, the second step on discrete mesh (3.27) gives(4.16) E ( k +1) + (cid:16) ρ ( k +1) J , m ( k +1) J (cid:17) = E ( k + ) + Grad (cid:16) E ( k +1) φ (cid:17) + (cid:16) ρ ( k + ) J , m ( k + ) J (cid:17) + Grad (cid:16) φ ( k +1) J (cid:17) , and on continuous setting (3.16) gives(4.17) (cid:16) ρ ( k +1) J , m ( k +1) J (cid:17) = (cid:16) ρ ( k + ) J , m ( k + ) J (cid:17) + Grad (cid:16) φ ( k +1) J (cid:17) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . J. YU, R. LAI, W. LI, AND S. OSHER

Thus we have:(4.18) E ( k +1) = E ( k + ) + Grad (cid:16) E ( k +1) φ (cid:17) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) , where O (cid:32) D (cid:88) d =0 ∆ d (cid:33) ∈ V × V × · · · × V D indicates its entry-wise contribution is O (cid:32) D (cid:88) d =0 ∆ d (cid:33) .Combining (4.15) and (4.18), we obtain(4.19) E ( k +1) = (cid:0) Id − Grad ◦ Lap − ◦ Div (cid:1) E ( k + ) − Grad ◦ Lap − O (cid:32) D (cid:88) d =0 ∆ d (cid:33) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Claim: (cid:107)

Grad ◦ Lap − ◦ Div (cid:107) ≤ , (cid:107) Grad ◦ Lap − (cid:107) ≤ . Therefore (cid:13)(cid:13)(cid:13) E ( k +1) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) + 14 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) O (cid:32) D (cid:88) d =0 ∆ d (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) O (cid:32) D (cid:88) d =0 ∆ d (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) . Proof of claim:

Recall that Lap (cid:0) Ψ i (cid:1) = λ i Ψ i , i ∈ J , where λ i = − D (cid:88) d =0 n d sin (cid:18) ( i d − π n d (cid:19) , Ψ ij = D (cid:89) d =0 (cid:114) δ i d n d cos (cid:18)(cid:18) j d − (cid:19) ( i d − πn d (cid:19) , and (cid:8) Ψ i (cid:9) i ∈ J forms a basis of ( V , (cid:107) · (cid:107) F ) . For d = 0 , , · · · , D , and i ∈ J , i d (cid:54) = 1, let σ d, i ∈ R and Ψ d, i ∈ V × V × · · · × V d be: σ d, i = − n d sin (cid:18) ( i d − π n d (cid:19) , Ψ d, i = (cid:110) Ψ d, i d (cid:48) (cid:111) d (cid:48) =0 , , ··· ,D , where Ψ d, i d = 1 σ d, i D ∗ d (cid:0) Ψ i (cid:1) , Ψ d, i d (cid:48) = , d (cid:48) (cid:54) = d, then (cid:104) Ψ d, i , Ψ d (cid:48) , i (cid:48) (cid:105) = 0 , if d (cid:54) = d (cid:48) , (cid:104) Ψ d, i , Ψ d, i (cid:48) (cid:105) = 1 σ d, i σ d, i (cid:48) (cid:104)D ∗ d (cid:0) Ψ i (cid:1) , D ∗ d ( Ψ i (cid:48) ) (cid:105) = σ d, i (cid:48) σ d, i (cid:104) Ψ i , Ψ i (cid:48) (cid:105) = (cid:40) , i (cid:54) = i (cid:48) , i = i (cid:48) AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP (cid:8) Ψ d, i (cid:9) forms an orthonormal basis of ( V × V × · · · × V D , (cid:107) · (cid:107) F ) . Since (cid:107) · (cid:107) = (cid:32) D (cid:89) d =0 ∆ d (cid:33) (cid:107) · (cid:107) F , we next compute the (cid:13)(cid:13) Grad ◦ Lap − ◦ Div (cid:13)(cid:13) , (cid:13)(cid:13) Grad ◦ Lap − (cid:13)(cid:13) with basis of ( V , (cid:107) · (cid:107) F ) and ( V × V × · · · × V D , (cid:107) · (cid:107) F ) . For anybasis Ψ i ∈ V , Grad ◦ Lap − (cid:0) Ψ (cid:1) = , Grad ◦ Lap − (cid:0) Ψ i (cid:1) = D (cid:88) d =0 ,i d (cid:54) =1 σ d, i λ i Ψ d, i , i (cid:54) = thus when n d > d = 0 , · · · , D , we have (cid:13)(cid:13) Grad ◦ Lap − (cid:13)(cid:13) ≤ max i ∈ J \{ }  λ i ) D (cid:88) d =0 ,i d (cid:54) =1 (cid:0) σ d, i (cid:1)  = max i ∈ J \{ } | λ i | ≤ . And for any basis Ψ d, i ∈ V × V × · · · × V D ,Grad ◦ Lap − ◦ Div (cid:0) Ψ d, i (cid:1) = Grad ◦ Lap − (cid:18) σ d, i D d ◦ D ∗ d (cid:0) Ψ i (cid:1)(cid:19) = Grad ◦ Lap − (cid:0) − σ d, i Ψ i (cid:1) = Grad (cid:18) − σ d, i λ i Ψ i (cid:19) = − D (cid:88) d (cid:48) =0 ,i d (cid:48) (cid:54) =1 σ d, i σ d (cid:48) , i λ i Ψ d (cid:48) , i therefore (cid:13)(cid:13) Grad ◦ Lap − ◦ Div (cid:13)(cid:13) ≤ max d =0 , , ··· ,D max i ∈ J ,i d (cid:54) =1 (cid:18) σ d, i λ i (cid:19) D (cid:88) d (cid:48) =0 ,i d (cid:48) (cid:54) =1 (cid:16) σ d (cid:48) , i (cid:17)  = max d =0 , , ··· ,D max i ∈ J ,i d (cid:54) =1 (cid:32) (cid:0) σ d, i (cid:1) | λ i | (cid:33) ≤ . The claim and thus the lemma are proved.The last step is to estimate the error introduced in linear interpolation step(3.17),(3.28).

Lemma (cid:13)(cid:13)(cid:13) (cid:98) E ( k +1) (cid:13)(cid:13)(cid:13) ≤ (cid:12)(cid:12)(cid:12) ω ( k ) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) E ( k +1) (cid:13)(cid:13)(cid:13) + (cid:12)(cid:12)(cid:12) ω ( k ) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) . J. YU, R. LAI, W. LI, AND S. OSHER

Proof.

By deﬁnition of error terms and (3.28) (cid:98) E ( k +1) d + (cid:16) (cid:98) m ( k +1) d (cid:17) J d = (cid:16) ω ( k ) (cid:17) (cid:18) E ( k +1) d + (cid:16) m ( k +1) d (cid:17) J d (cid:19) − ω ( k ) (cid:18) E ( k ) d + (cid:16) m ( k ) d (cid:17) J d (cid:19) , and by (3.17) (cid:16) (cid:98) m ( k +1) d (cid:17) J d = (cid:16) ω ( k ) (cid:17) (cid:16) m ( k +1) d (cid:17) J d − ω ( k ) (cid:16) m ( k ) d (cid:17) J d . Therefore we have (cid:98) E ( k +1) = (cid:16) ω ( k ) (cid:17) E ( k +1) − ω ( k ) E ( k ) . By triangular inequality, the lemma is proved.With Lemma 4.5-Lemma 4.7, we can show Theorem 4.3 by induction.

Proof of Theorem 4.3:

Proof.

We ﬁrst restate results from Lemma 4.5-Lemma 4.7: (cid:13)(cid:13)(cid:13) E ( k +1) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) , (cid:13)(cid:13)(cid:13) E ( k + ) (cid:13)(cid:13)(cid:13) ≤ C ( D, C Y , η ( k ) ) (cid:13)(cid:13)(cid:13) (cid:98) E ( k ) (cid:13)(cid:13)(cid:13) + C (cid:16)(cid:98) ρ ( k ) , (cid:99) m ( k ) (cid:17) (cid:32) D (cid:88) d =0 ∆ d (cid:33) + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) , (cid:13)(cid:13)(cid:13) (cid:98) E ( k ) (cid:13)(cid:13)(cid:13) ≤ (cid:12)(cid:12)(cid:12) ω ( k − (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) + (cid:12)(cid:12)(cid:12) ω ( k − (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) E ( k − (cid:13)(cid:13)(cid:13) . From these, we obtain (cid:13)(cid:13)(cid:13) E (1) (cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13) (cid:98) E (0) (cid:13)(cid:13)(cid:13) + C D (cid:88) d =0 ∆ d + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) , (4.20) (cid:13)(cid:13)(cid:13) E ( k +1) (cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) + C (cid:13)(cid:13)(cid:13) E ( k − (cid:13)(cid:13)(cid:13) + C D (cid:88) d =0 ∆ d + O (cid:32) D (cid:88) d =0 ∆ d (cid:33) , k ≥ C depends on D, C Y , η ( k ) , ρ ( k + ) , m ( k + ) , (cid:98) ρ ( k ) , (cid:99) m ( k ) .The initialization gives us (cid:13)(cid:13)(cid:13) E (0) (cid:13)(cid:13)(cid:13) = 0 , (cid:13)(cid:13)(cid:13) (cid:98) E (0) (cid:13)(cid:13)(cid:13) = 0 , Then based on (4.21), it is straightforward to show (4.2) by applying induction on k directly.

5. Generalization and Acceleration.

In this section, we generalize the pro-posed algorithm to solve potential MFG problems. Moreover, we also discuss how touse multilevel and multigrid strategies to speed up our algorithm.

To apply FISTA to MFG, we followa ﬁrst-discretize-then-optimize approach. One crucial diﬀerence between MFG and

AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP FixedVariables

Fig. 4 . Illustration of discretization (MFG).

MFP is whether ρ (1 , · ) is provided explicitly. For MFG, we consider a discretizationin Figure 4 and modify our previous notations related to ρ .The index set and discrete variable are now J := (cid:26) , , · · · , n + 12 (cid:27) , P := { P j } j ∈ J ∈ V := R n × n ×···× n D , and the discrete operators are A : V → V ,P (cid:55)→ P ,P j :=  P j + e , j = 1 , (cid:16) P j + e + P j − e (cid:17) , j = 2 , , · · · , n , D : V → V ,P (cid:55)→ D ( P ) , ( D ( P )) j :=  P j + e , j = 1 , (cid:16) P j + e − P j − e (cid:17) , j = 2 , , · · · , n . Since the boundary condition is only at t = 0, we modify P A , P D ∈ V to( P A ) j :=  ρ ( x j ) , j = 1 , , j = 2 , , · · · , n , ( P D ) j :=  − ρ ( x j ) , j = 1 , , j = 2 , , · · · , n . Take model (2.9) in section 2 as an example, the discrete problem can be formulated6

J. YU, R. LAI, W. LI, AND S. OSHER as(5.1) min ( P, M ) ∈C MFG ( P D ) Y MF G ( P, M ) :=∆ D (cid:88) d =0 n d (cid:88) j d =1 J MF G (cid:0) ( A ( P ) + P A ) j , A ( M ) j , x j (cid:1) + λ G (cid:88) j ∈ J ,j = n + P j G ( x j )where(5.2) J MF G ( β , β , x ) := L ( β , β ) + λ E β log( β ) + λ Q β Q ( x ) , (5.3) C MF G ( P D ) := (cid:8) ( P, M ) : D ( P ) + P D + Div( M ) = (cid:9) . Since this is an optimization problem with linear constraints, we apply FISTA toit as detailed in Algorithm 5.1. In the algorithm, A (cid:62) , A (cid:62) , D (cid:62) , Div (cid:62) are conjugateoperators of A , A , D , Div in norm (cid:107) · (cid:107) F . Similar to what we discussed before, onecan have: (cid:40) ∂ P Y MF G ( P, M ) := (cid:8) ∆ ( J MF G ) β (cid:0) ( A ( P ) + P A ) j , A ( M ) j , x j (cid:1)(cid:9) j ∈ J ,∂ M Y MF G ( P, M ) := (cid:8) ∆ ( J MF G ) β (cid:0) ( A ( P ) + P A ) j , A ( M ) j , x j (cid:1)(cid:9) j ∈ J , and  ( ∂ P Y MF G ( P, M )) j := (cid:0) A (cid:62) ( ∂ P Y MF G ( P, M )) (cid:1) j , j (cid:54) = n + 12( ∂ P Y MF G ( P, M )) j := (cid:0) A (cid:62) ( ∂ P Y MF G ( P, M )) (cid:1) j + λ G G ( x j ) , j = n + 12 ∂ M Y MF G ( P, M ) := A (cid:62) (cid:0) ∂ M Y MF G ( P, M ) (cid:1) . Remark D D (cid:62) + Div Div (cid:62) , where D is the modiﬁed deﬁnition in this section. Then the lineardecomposition of Lap is Lap(Φ) = (cid:88) i ∈ J λ i (cid:10) Φ , Ψ i (cid:11) Ψ i , where λ i ∈ R , Ψ i ∈ V are λ i = − n sin (cid:18) ( i − / π n + 1 / (cid:19) − D (cid:88) d =1 n d sin (cid:18) ( i d − π n d (cid:19) , Ψ ij = (cid:114) n + 1 cos (cid:18)(cid:18) j − (cid:19) ( i − / πn + 1 / (cid:19) × D (cid:89) d =1 (cid:114) δ i d n d cos (cid:18)(cid:18) j d − (cid:19) ( i d − πn d (cid:19) , AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP Algorithm 5.1

FISTA for MFGParameters ρ , ρ Initialization τ (1) = 1, P (0) = (cid:98) P (0) = , and M (0) d = (cid:99) M d (0) = . for k = 0 , , , . . . dogradient descent: (cid:40) P ( k + ) = P ( k ) − η ( k ) ∂ P Y MF G ( P ( k ) , M ( k ) ) , M ( k + ) = M ( k ) − η ( k ) ∂ M Y MF G ( P ( k ) , M ( k ) ) projection: solve Φ ( k +1) for(5.4) (cid:16) D D (cid:62) + Div Div (cid:62) (cid:17) Φ ( k +1) = D (cid:16) P ( k + ) (cid:17) + P D + Div (cid:16) M ( k + ) (cid:17) , and project (cid:16) P ( k + ) , M ( k + ) (cid:17) to C MF G ( P D ) by  P ( k +1) = P ( k + ) − D (cid:62) (cid:16) Φ ( k +1) (cid:17) , M ( k +1) = M ( k + ) − Div (cid:62) (cid:16) Φ ( k +1) (cid:17) . update τ ( k +1) = 1 + (cid:113) (cid:0) τ ( k ) (cid:1) ,ω ( k ) = τ ( k ) − τ ( k +1) , (cid:16) (cid:98) P ( k +1) , (cid:99) M ( k +1) (cid:17) = (cid:16) ω ( k ) (cid:17) (cid:16) P ( k +1) , M ( k +1) (cid:17) − ω ( k ) (cid:16) P ( k ) , M ( k ) (cid:17) . end for Note that λ i < i ∈ J , we can deﬁne Lap − (Φ) by(5.5) Lap − (Φ) := (cid:88) i ∈ J λ i (cid:10) Φ , Ψ i (cid:11) Ψ i . Inspired by [17, 33], we borrow ideasfrom multigrid and multilevel methods in numerical PDEs to our variational problem.These methods can reduce computational cost on the ﬁnest level and thus acceleratethe proposed algorithm. The implementation details are presented in this section.For notation simplicity, we assume h = ∆ = ∆ = · · · = ∆ D in this section. Let h Ω be a grid with h = ∆ d , h J d be the certain J d on the grid. Then index h j ∈ h J d stands for the point h j . If there is no ambiguity, we can omit the prescript of j . Forexample, we deﬁne h u j = u (cid:0) h ( j − ) (cid:1) for any function u and approximate the valueby h U j . Consider L levels of grids h Ω , . . . , h L Ω where the ﬁnest level is h Ω, and h l :=2 l − h . We ﬁrst deﬁne how to prolongate values on a coarser grid into a ﬁner grid.Assume that h j ∈ h J d stands for point h (cid:0) j − (cid:1) on the ﬁner grid h Ω, we deﬁne its8

J. YU, R. LAI, W. LI, AND S. OSHER neighbourhood on the coarser grid h Ω as(5.6) hh N j := (cid:40) h i ∈ h J d : (cid:13)(cid:13)(cid:13)(cid:13) h (cid:18) h i − (cid:19) − h (cid:18) h j − (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = min h i (cid:48) ∈ h J d (cid:13)(cid:13)(cid:13)(cid:13) h (cid:18) h i (cid:48) − (cid:19) − h (cid:18) h j − (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (cid:41) . Then with boundary values h P i = h ( ρ ) i , i = 12 , h P i = h ( ρ ) i , i = 12 h + 12 , h ( M d ) i = 0 , i d = 12 , h + 12 , we deﬁne the prolongation ( h P, h M ) = Pro( h P, h M ) by averaging values in neigh-bourhoods:(5.7)  h P j := 1 (cid:12)(cid:12)(cid:12) hh N j (cid:12)(cid:12)(cid:12) (cid:88) h i ∈ hh N j h P i , ∀ h j ∈ h J , h ( M d ) j := 1 (cid:12)(cid:12)(cid:12) hh N j (cid:12)(cid:12)(cid:12) (cid:88) h i ∈ hh N j h ( M d ) i ∀ h j ∈ h J d . An example of prolongation in 1D is shown in the left panel of Figure 5.

11 1221 2231 32

Fig. 5 . Illustration of Prolongation (left) and Restriction (right) for 1D case.

From a ﬁner grid to a coarser grid, the neighbourhood is deﬁned inversely. Sup-pose h i ∈ h J d , its neighbourhood is the set of all h j ∈ h J d whose neighbourhoodincludes h i :(5.8) h h N i := (cid:110) h j ∈ h J d : h i ∈ hh N j (cid:111) . and the restriction from ﬁner level to coarser level ( h P, h M ) = Res( h P, h M ) isdeﬁned by a weighted average over neighbourhoods:(5.9)  h P i := (cid:88) j ∈ h h N i (cid:12)(cid:12)(cid:12) hh N j (cid:12)(cid:12)(cid:12) h P j (cid:44) (cid:88) j ∈ h h N i (cid:12)(cid:12)(cid:12) hh N j (cid:12)(cid:12)(cid:12) , ∀ h i ∈ h J , h ( M d ) i := (cid:88) j ∈ h h N i (cid:12)(cid:12)(cid:12) hh N j (cid:12)(cid:12)(cid:12) h ( M d ) j (cid:44) (cid:88) j ∈ h h N i (cid:12)(cid:12)(cid:12) hh N j (cid:12)(cid:12)(cid:12) , ∀ h i ∈ h J d . AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP K ( · )means run Algorithm 3.2 for K iterations and Algorithm 3.2( · ) means run the algo-rithm till convergence. The ﬁrst two inputs of Algorithm 3.2( · ) are initial and terminaldensities ρ , ρ , and the last two inputs are initialization P (0) = (cid:98) P (0) , M (0) = (cid:99) M (0) . Algorithm 5.2

Multigrid FISTA for MFPParameters

L, h l = 2 l − h, K, h l ρ , h l ρ ( l = 1 , . . . , L )Initialization h P (0) = , h M (0) = (cid:16) h P, h M (cid:17) = Algorithm 3.2 K (cid:16) ρ , ρ , h P (0) , h M (0) (cid:17) for l = 2 , , . . . , L do (cid:16) h l P, h l M (cid:17) = Algorithm 3.2 K (cid:16) ρ , ρ , Res (cid:16) h l − P, h l − M (cid:17)(cid:17) end forcorrection and post-smoothing (cid:16) h L P, h L M (cid:17) = Algorithm 3.2 (cid:16) ρ , ρ , h L P, h L M (cid:17) for l = L − , L − , . . . , do (cid:16) h l P, h l M (cid:17) = (cid:16) h l P, h l M (cid:17) + Algorithm 3.2 (cid:16) ρ , ρ , Pro (cid:16) h l +1 P, h l +1 M (cid:17)(cid:17) − Pro (cid:16) h l +1 P, h l +1 M (cid:17) end for Note that to keep the cost of Algorithm 5.2 low, we need to choose a K not verylarge. Motivated by [33], we can remove the pre-smoothing steps by setting K = 0and this leads to our Algorithm 5.3: multilevel FISTA. Algorithm 5.3

Multilevel FISTA for MFPParameters

L, h l = 2 l − h, h l ρ , h l ρ ( l = 1 , . . . , L )Initialization h L P (0) = , h L M (0) = (cid:16) h L P, h L M (cid:17) = Algorithm 3.2 (cid:16) ρ , ρ , h L P (0) , h L M (0) (cid:17) for l = L − , L − , . . . , do (cid:16) h l P, h l M (cid:17) = Algorithm 3.2 (cid:16) ρ , ρ , Pro (cid:16) h l +1 P, h l +1 M (cid:17)(cid:17) end for6. Numerical Experiments. In this section, we conduct comprehensive exper-iments to show the eﬃciency and eﬀectiveness of the proposed numerical algorithms.0

J. YU, R. LAI, W. LI, AND S. OSHER

We ﬁrst numerical verify the convergence of rate of the algorithm related to the meshsize. After that, our computation eﬃciency tests demonstrate that the proposed Algo-rithm 3.2 has comparable eﬃciency with the state-of-the-art methods. Interestingly,the proposed multilevel method performs around 10 times faster than existing meth-ods. We further illustrate the ﬂexibility of our algorithms on diﬀerent MFP problems.In all the numerical experiments, we use the dynamic cost function L deﬁned in (2.3).All of our numerical experiments are implemented in Matlab on a PC with an Intel(R)i7-8550U 1.80GHz CPU and 16 GB memory. To numerically verify the theoretical convergenceanalysis discussed in section 4, we ﬁrst apply the proposed numerical algorithm to asimple 1D OT example with exact solution as follows.Let Ω = [0 , ρ ( x ) = x + , ρ ( x ) = 1 . Then we can have the following theoreticalsolution of the OT between ρ and ρ .(6.1) ρ ∗ ( t, x ) =  x + 12 , t = 0 , (cid:113) tx + (cid:0) t − (cid:1) + t − t (cid:113) tx + (cid:0) t − (cid:1) , < t ≤ . (6.2) m ∗ ( t, x ) =  x ( x − x + 1) , t = 0 ,xt − − t t (cid:115) tx + (cid:18) t − (cid:19) − ( t − t − t (cid:113) tx + (cid:0) t − (cid:1) − t − t , < t ≤ . We also know W ( ρ , ρ ) = 1120 . Table 1

Convergence rate of Algorithm applied to 1D OT problem ( k = 50000 ). ∆ ∆ (cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13) order (cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13) ∞ order W error order1/16 1/64 3.19E-04 2.88E-03 4.88E-061/32 1/128 1.08E-04 1.56 1.47E-03 0.97 1.22E-06 2.001/64 1/256 3.76E-05 1.53 7.44E-04 0.98 3.05E-07 2.001/128 1/512 1.37E-05 1.46 3.62E-04 1.04 7.63E-08 2.00Note that it would be quite diﬃcult to check E ( k ) as we do not have evolutionpath, ρ ( k ) and m ( k ) , in the continuous Algorithm 3.1. Instead, we compute the AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) := (cid:112) ∆ ∆  (cid:88) j ∈ J (cid:12)(cid:12)(cid:12) P ( k ) j − ρ ∗ j (cid:12)(cid:12)(cid:12) + (cid:88) j ∈ J (cid:12)(cid:12)(cid:12) M ( k ) j − m ∗ j (cid:12)(cid:12)(cid:12)  , (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) ∞ := max (cid:26) max j ∈ J (cid:12)(cid:12)(cid:12) P ( k ) j − ρ ∗ j (cid:12)(cid:12)(cid:12) , max j ∈ J (cid:12)(cid:12)(cid:12) M ( k ) j − m ∗ j (cid:12)(cid:12)(cid:12)(cid:27) ,W error: (cid:12)(cid:12)(cid:12) ∆ ∆ Y (cid:16) P ( k ) , M ( k ) (cid:17) − W ( ρ , ρ ) (cid:12)(cid:12)(cid:12) . Here (cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13) is related to (cid:13)(cid:13) E ( k ) (cid:13)(cid:13) by: (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:16) ρ ( k ) J , m ( k ) J (cid:17) − (cid:0) ρ ∗ J , m ∗ J (cid:1)(cid:13)(cid:13)(cid:13) For given ∆ , ∆ , we can choose very large k such that (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) + (cid:15) ( k ) and (cid:15) ( k ) (cid:28) ∆ + ∆ . Fixing k , according to our theoretical analysis, we expect toobserve at least (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) = O (∆ + ∆ ) . and (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) F = (∆ ∆ ) − (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) = O (1) . Numerical results are shown in Table 1 where we observe (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) = O (cid:0) ∆ . + ∆ . (cid:1) , (cid:13)(cid:13)(cid:13) E ( k, ∗ ) (cid:13)(cid:13)(cid:13) ∞ = O (∆ + ∆ ) . This indicates that convergence rate of our numerical experiments perform betterthan theoretical prediction. This is not surprise as the way of our theoretical analysismay not be sharp.

Table 2

Time comparison of OT in 1D for diﬀerent grid sizes ( n = 64 , tol = 10 − , F=FISTA,A=ALG, G=G-prox) with the best performance highlighted in red. Iter Time (s) Time(s)/Iter n F A G F A G F A G256 611 435 426 1.74 2.99 2.93 2.85E-03 6.86E-03 6.88E-03512 611 435 429 3.06 5.91 6.92 5.00E-03 1.36E-02 1.61E-021024 611 435 430 7.60 12.93 12.85 1.24E-02 2.97E-02 2.99E-022048 611 435 431 24.84 36.15 32.72 4.07E-02 8.31E-02 7.59E-024096 611 435 431 51.79 69.99 68.09 8.48E-02 1.61E-01 1.58E-01

In this part, we would like to demonstrate theeﬃciency of our algorithms by comparing with state-of-the-art methods for dynamic2

J. YU, R. LAI, W. LI, AND S. OSHER

Table 3

Time comparison of OT in 2D for diﬀerent grid sizes ( n = 64 , F=FISTA, A=ALG, G=G-prox) with the best performance highlighted in red. Iter Time (s) Time(s)/Iter n , n F A G F A G F A G128 116 64 66 46.20 46.07 46.49 3.98E-01 7.20E-01 7.92E-01256 116 64 66 212.31 201.52 190.31 1.83E+00 3.15E+00 3.27E+00512 116 64 66 810.86 761.65 752.59 6.99E+00 1.19E+01 1.14E+01

OT problems. We apply our algorithms to OT problems with ρ , ρ being Gaussiandistribution densities, and compare the results and computation time with those us-ing ALG(augmented Lagrangian) [11, 12] and G-prox [29]. For all approaches, thestopping criteria are (cid:13)(cid:13)(cid:13)(cid:16) P ( k +1) , M ( k +1) (cid:17) − (cid:16) P ( k ) , M ( k ) (cid:17)(cid:13)(cid:13)(cid:13) ≤ tol. In Table 2 and Table 3, we report computation time and number of iterations foreach algorithms on diﬀerent grid sizes in 1D and 2D. From the tables, the proposedAlgorithm 3.2 outperforms ALG and G-prox in 1D and achieves similar eﬃciency in2D. Interestingly, CPU time per iteration in our algorithm is the least among thesethree algorithms. This is because, at each iteration, solving a Poisson equation isrequired for all three algorithms while our method does not need to solve (cid:81) Dd =0 n d cubic equations required in ALG and G-prox. Therefore our method needs less timein 1D experiment although it needs more iterations to achieve the given stoppingcriteria. While, this computation save is marginal comparing with the cost of solvingPoisson equation in 2D. Thus, our method spend comparable time instead of less timein this 2D experiment. Table 4

Eﬃciency and accuracy comparisons of OT in 1D ( n = 64 , n = 256 ) with the best perfor-mance highlighted in red. NumIter Time (s) StationarityResidue FeasibilityResidue MassResidueFISTA 611 1.723 3.27E-05 2.28E-13 1.33E-15ALG 435 2.840 9.43E-05 2.41E-04 1.64E-08G-prox 426 2.761 1.93E-04 1.88E-04 2.96E-08MLFISTA 882 0.422 7.97E-05 2.28E-13 1.11E-15MGFISTA ( K = 5) 1448 1.195 4.79E-05 2.33E-13 1.77E-15MGFISTA ( K = 10) 1517 1.341 3.95E-05 2.28E-13 2.22E-15Moreover, as shown in Table 4 and Table 5, we further accelerate the proposedalgorithm by at most 10 times with the help of multilevel and multigrid strategies. Wealso compute the residue of being a stationary point, residue of feasibility constraint(1.2), and residue of mass conservation to check the accuracy of the solutions. Fromthe residue comparisons listed in the tables, it is clear to see that all of our algorithmsprovide solutions with far more better mass preservation property than results fromALG and G-prox methods due to the nature of the projection step in our method. AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP Table 5

Eﬃciency and accuracy comparison of OT in 2D ( n = 64 , n = n = 256 ) with the bestperformance highlighted in red. NumIter Time (s) StationarityResidue FeasibilityResidue MassResidueFISTA 116 232.560 9.22E-04 6.01E-13 1.42E-14ALG 64 211.043 8.75E-04 4.99E-03 3.10E-03G-prox 66 208.696 9.29E-04 6.88E-03 3.10E-03MLFISTA 162 12.853 3.43E-03 2.95E-13 2.07E-14MGFISTA ( K = 5) 315 134.226 1.07E-03 5.99E-13 1.67E-14MGFISTA ( K = 10) 315 170.580 9.86E-04 6.00E-13 2.02E-14Qualitatively, Figure 6 also shows that all 6 algorithms in our experiments providesatisfactory results in accuracy. Fig. 6 . Qualitative comparisons of ρ ( t, · ) in 1D. Row 1 from left to right: FISTA, ALG, G-Prox.Row 2 from left to right: MLFISTA, MGFISTA ( K = 5) , MGFISTA ( K = 10) . Most numerical examples of MFP in literatureconsider Ω to be a regular region, i.e. Ω = [0 , × [0 , Q to be an indicatorfunction of obstacles which leads to solutions staying in the irregular domain. In adiﬀerent example, [35] provides an interesting optimal transport example where theregion is a maze with many “walls”. Here we consider several illustrative cases wherethere are one or two pieces of obstacles in our square domain and show that our algo-rithm can deal with this case without modiﬁcation of implementation. More detailedstudy along this direction will be explored in our future work.To be precise, letting Ω = (cid:2) − , (cid:3) × (cid:2) − , (cid:3) , we consider MFP problem withobjective function (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x )d x d t + λ Q (cid:90) (cid:90) Ω ρ ( t, x ) Q ( x )d x , J. YU, R. LAI, W. LI, AND S. OSHER

Diﬀerent choices of ρ , ρ , Q are shown in the ﬁrst row of Figure 7 and Q ( x ) = (cid:40) , x ∈ Ω , x (cid:54)∈ Ω where Ω is the white region. By setting λ Q to be a very largenumber (e.g. λ Q = 8 × in our implementation), we expect the set Ω to be viewedas an obstacle and the density evolution to circumvent the region. The snapshots ofthe evolution shown in Figure 7 demonstrate the success of our algorithm that themass circumvents the obstacles very well. (a) Example 1 (b) Example 2 (c) Example 3 t=0.1 t=0.3 t=0.5 t=0.7 t=0.9 (d) Example 1 t=0.1 t=0.3 t=0.5 t=0.7 t=0.9 (e) Example 2 t=0.1 t=0.3 t=0.5 t=0.7 t=0.9 (f) Example 3 Fig. 7 . (a-c): Initial density ρ , terminal density ρ and obstacle region Ω highlighted aswhite regions. (d-f) Snapshots of ρ at t = 0 . , . , . , . , . . As one of the greatest advantages, our method enjoys ﬂexibilityto handle diﬀerent types of objective functions in variational MFP problems. To showthe eﬀectiveness of our algorithm, we apply Algorithm 3.2 to the ﬁve models listedin section 2. We can also observe how diﬀerent objective functions aﬀect the density

AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP , × [0 , , ρ , ρ being two images shown in Figure 8, G ( x ) = − ρ ( x )and Q ( x ) = (cid:40) , ρ ( x ) (cid:54) = 0 or ρ ( x ) (cid:54) = 0 , , otherwise , . We consider MFP problem of thefollowing formmin ( ρ, m ) ∈C ( ρ ,ρ )  (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + λ E (cid:90) (cid:90) Ω F E ( ρ ( t, x ))d x d t + λ Q (cid:90) (cid:90) Ω ρ ( t, x ) Q ( x )d x d t  , We apply the proposed algorithm to the following four MFP models:(OT) λ E = λ Q = 0 , (6.3) (Model 1) λ E = 0 . , λ Q = 0 . , F E : R + → R , ρ (cid:55)→ (cid:40) ρ log( ρ ) , ρ > , ρ = 0(6.4) (Model 2) λ E = 0 . , λ Q = 0 . , F E : R + → R , ρ (cid:55)→ ρ , (6.5) (Model 3) λ E = 0 . , λ Q = 0 . , F E : R + → R , ρ (cid:55)→ (cid:40) ρ, ρ > , ρ = 0(6.6)and a MFG model(6.7) min ( ρ, m ) ∈C ( ρ )  (cid:90) (cid:90) Ω L ( ρ ( t, x ) , m ( t, x ))d x d t + λ E (cid:90) (cid:90) Ω ρ ( t, x ) log( ρ ( t, x ))d x d t + λ Q (cid:90) (cid:90) Ω ρ ( t, x ) Q ( x )d x d t + λ G (cid:90) Ω ρ (1 , x ) G ( x )d x .  with λ E = 0 . , λ Q = 0 . , λ G = 1. It is worth mentioning that to solve model (6.3)-(6.6), we must rescale ρ , ρ such that (cid:82) Ω ρ = (cid:82) Ω ρ but we do not have to rescale G ( x ) for ρ in (6.7). -1.1-1-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 Fig. 8 . From left to right: initial density ρ , ﬁnal ρ , interaction penalty Q ( x ) and terminaldensity regularizer G ( x ) Figure 9 show the snapshots of the density evolutions. Since (6.5)-(6.7) set thespace preference to the evolution, the mass evolutions are within the dark region andthe optimal transport model (6.3) has a more free evolution style.6

J. YU, R. LAI, W. LI, AND S. OSHER

Comparing model (6.4),(6.5) with (6.6), we observe that the mass evolution ofmodel (6.4),(6.5) are dense, while that of (6.6) experiences a congest-ﬂatten processand tends to be sparse. This is compatible with our discussions in section 2. t=0.2 t=0.4 t=0.6 t=0.8 t=1 (a) OT: F E ( a ) = F ( a ) := 0 . t=0.2 t=0.4 t=0.6 t=0.8 t=1 (b) Model 1: F E ( a ) = a log a, a > . t=0.2 t=0.4 t=0.6 t=0.8 t=1 (c) Model 2: F E ( a ) = a . t=0.2 t=0.4 t=0.6 t=0.8 t=1 (d) Model 3: F E ( a ) = a , a > . t=0.2 t=0.4 t=0.6 t=0.8 t=1 (e) MFG: F E ( a ) = a log a, a > . Fig. 9 . Snapshot of ρ at t = 0 . , . , . , . , (from left to right)

7. Conclusion.

In this paper, we propose an eﬃcient and ﬂexible algorithm tosolve potential MFP problems based on an accelerated proximal gradient algorithm.In the optimal transport setting, we can converge faster or nearly as fast as G-prox andapproach optimizer with the same accuracy. With multilevel and multigrid strategies,our algorithm can be accelerated up to 10 times without sacriﬁcing accuracy. Inbroader settings of MFP and MFG, our method is more ﬂexible than primal-dual ordual algorithms as it enjoys the ﬂexibility to handle diﬀerentiable objective functions.

AST ALGORITHM AND CONVERGENCE ANALYSIS FOR DYNAMIC MFP ρ, m , and show that under some mild assumptions,our algorithm converges to the optimizer. In the future, we expect to extend theproposed algorithms for non-potential mean ﬁeld games, which have vast applicationsin mathematical ﬁnance, communications, and data science. REFERENCES[1] Yves Achdou, Francisco J Buera, Jean-Michel Lasry, Pierre-Louis Lions, and Benjamin Moll.Partial diﬀerential equation models in macroeconomics.

Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineering Sciences , 372(2028):20130397,2014.[2] Yves Achdou, Fabio Camilli, and Italo Capuzzo-Dolcetta. Mean ﬁeld games: numerical methodsfor the planning problem.

SIAM Journal on Control and Optimization , 50(1):77–109, 2012.[3] Yves Achdou, Fabio Camilli, and Italo Capuzzo-Dolcetta. Mean ﬁeld games: convergence of aﬁnite diﬀerence method.

SIAM Journal on Numerical Analysis , 51(5):2585–2612, 2013.[4] Yves Achdou and Italo Capuzzo-Dolcetta. Mean ﬁeld games: numerical methods.

SIAMJournal on Numerical Analysis , 48(3):1136–1162, 2010.[5] Yves Achdou, Jiequn Han, Jean-Michel Lasry, Pierre-Louis Lions, and Benjamin Moll. Incomeand wealth distribution in macroeconomics: A continuous-time approach. Technical report,National Bureau of Economic Research, 2017.[6] Yves Achdou and Mathieu Lauri`ere. Mean ﬁeld games and applications: Numerical aspects. arXiv preprint arXiv:2003.04444 , 2020.[7] Yves Achdou and Victor Perez. Iterative strategies for solving linearized discrete mean ﬁeldgames systems.

Networks & Heterogeneous Media , 7(2):197, 2012.[8] Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein gan. arXiv preprintarXiv:1701.07875 , 2017.[9] Heinz H Bauschke, Regina S Burachik, Patrick L Combettes, Veit Elser, D Russell Luke, andHenry Wolkowicz.

Fixed-point algorithms for inverse problems in science and engineering ,volume 49. Springer Science & Business Media, 2011.[10] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems.

SIAM journal on imaging sciences , 2(1):183–202, 2009.[11] Jean-David Benamou and Yann Brenier. A numerical method for the optimal time-continuousmass transport problem and related problems.

Contemporary mathematics , 226:1–12, 1999.[12] Jean-David Benamou and Yann Brenier. A computational ﬂuid mechanics solution to themonge-kantorovich mass transfer problem.

Numerische Mathematik , 84(3):375–393, 2000.[13] Jean-David Benamou and Guillaume Carlier. Augmented lagrangian methods for transportoptimization, mean ﬁeld games and degenerate elliptic equations.

Journal of OptimizationTheory and Applications , 167(1):1–26, 2015.[14] Jean-David Benamou, Guillaume Carlier, and Maxime Laborde. An augmented lagrangianapproach to wasserstein gradient ﬂows and applications.

ESAIM: Proceedings and Surveys ,54:1–17, 2016.[15] Jean-David Benamou, Guillaume Carlier, and Filippo Santambrogio. Variational mean ﬁeldgames. In

Active Particles, Volume 1 , pages 141–171. Springer, 2017.[16] Jean-David Benamou, Brittany D Froese, and Adam M Oberman. Numerical solution of theoptimal transportation problem using the monge–amp`ere equation.

Journal of Computa-tional Physics , 260:107–126, 2014.[17] Alﬁo Borz`ı and Volker Schulz. Multigrid methods for pde optimization.

SIAM review ,51(2):361–395, 2009.[18] Giuseppe Buttazzo, Luigi De Pascale, and Paola Gori-Giorgi. Optimal-transport formulationof electronic density-functional theory.

Physical Review A , 85(6):062502, 2012.[19] Codina Cotar, Gero Friesecke, and Claudia Kl¨uppelberg. Density functional theory and optimaltransportation with coulomb cost.

Communications on Pure and Applied Mathematics ,66(4):548–599, 2013.[20] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In

Advancesin neural information processing systems , pages 2292–2300, 2013.[21] Antonio De Paola, Vincenzo Trovato, David Angeli, and Goran Strbac. A mean ﬁeld game ap-proach for distributed control of thermostatic loads acting in simultaneous energy-frequencyresponse markets.

IEEE Transactions on Smart Grid , 10(6):5987–5999, 2019.[22] Alfred Galichon.

Optimal transport methods in economics . Princeton University Press, 2018. J. YU, R. LAI, W. LI, AND S. OSHER[23] Diogo Gomes and Jo˜ao Sa´ude. A mean-ﬁeld game approach to price formation in electricitymarkets. arXiv preprint arXiv:1807.07088 , 2018.[24] Diogo A Gomes et al. Mean ﬁeld games models—a brief survey.

Dynamic Games and Appli-cations , 4(2):110–154, 2014.[25] Olivier Gu´eant, Jean-Michel Lasry, and Pierre-Louis Lions. Mean ﬁeld games and applications.In

Paris-Princeton lectures on mathematical ﬁnance 2010 , pages 205–266. Springer, 2011.[26] Steven Haker, Lei Zhu, Allen Tannenbaum, and Sigurd Angenent. Optimal mass transport forregistration and warping.

International Journal of computer vision , 60(3):225–240, 2004.[27] Minyi Huang, Peter E Caines, and Roland P Malham´e. Large-population cost-coupled lqg prob-lems with nonuniform agents: individual-mass behavior and decentralized ε -nash equilibria. IEEE transactions on automatic control , 52(9):1560–1571, 2007.[28] Minyi Huang, Roland P Malham´e, Peter E Caines, et al. Large population stochastic dynamicgames: closed-loop mckean-vlasov systems and the nash certainty equivalence principle.

Communications in Information & Systems , 6(3):221–252, 2006.[29] Matt Jacobs, Flavien L´eger, Wuchen Li, and Stanley Osher. Solving large-scale optimizationproblems with a convergence rate independent of grid size.

SIAM Journal on NumericalAnalysis , 57(3):1100–1123, 2019.[30] Jean-Michel Lasry and Pierre-Louis Lions. Mean ﬁeld games.

Japanese journal of mathematics ,2(1):229–260, 2007.[31] Wonjun Lee, Rongjie Lai, Wuchen Li, and Stanley Osher. Generalized unnormalized optimaltransport and its fast algorithms. arXiv preprint arXiv:2001.11530 , 2020.[32] Haoya Li, Yuwei Fan, and Lexing Ying. A simple multiscale method for mean ﬁeld games. arXiv preprint arXiv:2007.04594 , 2020.[33] Jialin Liu, Wotao Yin, Wuchen Li, and Yat Tin Chow. Multilevel optimal transport: afast approximation of wasserstein-1 distances.

SIAM Journal on Scientiﬁc Computing ,43(1):A193–A220, 2021.[34] Nicolas Papadakis.

Optimal transport for image processing . PhD thesis, 2015.[35] Nicolas Papadakis, Gabriel Peyr´e, and Edouard Oudet. Optimal transport with proximalsplitting.

SIAM Journal on Imaging Sciences , 7(1):212–238, 2014.[36] Gabriel Peyr´e, Marco Cuturi, et al. Computational optimal transport.

Foundations andTrends ® in Machine Learning , 11(5-6):355–607, 2019.[37] Alessio Porretta. On the planning problem for the mean ﬁeld games system. Dynamic Gamesand Applications , 4(2):231–256, 2014.[38] R Tyrrell Rockafellar.

Convex analysis , volume 36. Princeton university press, 1970.[39] Lars Ruthotto, Stanley J Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machinelearning framework for solving high-dimensional mean ﬁeld game and mean ﬁeld controlproblems.

Proceedings of the National Academy of Sciences , 117(17):9183–9193, 2020.[40] C´edric Villani.

Optimal transport: old and new , volume 338. Springer Science & BusinessMedia, 2008.[41] E Weinan, Jiequn Han, and Qianxiao Li. A mean-ﬁeld optimal control formulation of deeplearning.

Research in the Mathematical Sciences , 6(1):10, 2019.[42] Chungang Yang, Jiandong Li, Min Sheng, Alagan Anpalagan, and Jia Xiao. Mean ﬁeld game-theoretic framework for interference and energy-aware control in 5g ultra-dense networks.

IEEE Wireless Communications , 25(1):114–121, 2017.[43] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean ﬁeldmulti-agent reinforcement learning. In