[PDF] Maximum likelihood estimation of potential energy in interacting particle systems from single-trajectory data

Abstract

This paper concerns the parameter estimation problem for the quadratic potential energy in interacting particle systems from continuous-time and single-trajectory data. Even though such dynamical systems are high-dimensional, we show that the vanilla maximum likelihood estimator (without regularization) is able to estimate the interaction potential parameter with optimal rate of convergence simultaneously in mean-field limit and in long-time dynamics. This to some extend avoids the curse-of-dimensionality for estimating large dynamical systems under symmetry of the particle interaction.

Full PDF

aa r X i v : . [ m a t h . S T ] J u l MAXIMUM LIKELIHOOD ESTIMATION OF POTENTIAL ENERGY ININTERACTING PARTICLE SYSTEMS FROM SINGLE-TRAJECTORYDATA

XIAOHUI CHEN

Abstract.

This paper concerns the parameter estimation problem for the quadratic poten-tial energy in interacting particle systems from continuous-time and single-trajectory data.Even though such dynamical systems are high-dimensional, we show that the vanilla maxi-mum likelihood estimator (without regularization) is able to estimate the interaction potentialparameter with optimal rate of convergence simultaneously in mean-ﬁeld limit and in long-time dynamics. This to some extend avoids the curse-of-dimensionality for estimating largedynamical systems under symmetry of the particle interaction. Introduction

Dynamical systems of interacting particles have a wide range of applications on modelingcollective behaviors in physics [5], biology [16, 21], social science [17], and more recently inmachine learning as useful tools to understand the stochastic gradient descent (SGD) dynamicson neural networks [15]. Due to the large number of particles, such dynamical systems arehigh-dimensional even for single-trajectory data from each particle, and statistical learningproblems for lower-dimensional interaction functionals from data are usually challenging [2,11, 13]. In this paper, we propose a likelihood based inference to estimate the parametersof the interaction function induced by a potential energy in an N interacting particle systembased on their observed (continuous-time) trajectories, and establish its statistical guarantees.1.1. Interacting N -particle systems. In statistical mechanics, microscopic behaviors ofa large number of random particles are related to explain macroscopic physical quanti-ties (such as temperature distributions). Speciﬁcally, a system of N interacting particles( X N, t , . . . , X N,Nt ) can be described by stochastic diﬀerential equations (SDEs) of the form:d X N,it = 1 N N X j =1 b ( X N,jt − X N,it )d t + σ d W it , (1)where b : R d → R d is a vector ﬁeld representing the pairwise interaction between the particles,( W t ) t > , . . . , ( W Nt ) t > are N independent copies of the standard Brownian motion in R d suchthat E [ W i ] = 0, and σ ∈ R + is a diﬀusion parameter which is assumed to be constant andknown. The scaling in (1) puts us in the mean-ﬁeld regime, where the pairwise interaction Date : First version: July 23, 2020.2010

Mathematics Subject Classiﬁcation.

Primary: 60H15; Secondary: 62M05.

Key words and phrases.

Interacting particle systems, maximum likelihood estimation, mean-ﬁeld regime,stochastic Vlasov equation, symmetry.Research was supported in part by NSF CAREER Award DMS-1752614 and a Simons Fellowship. Part ofthis research was carried out in the Institute for Data, System, and Society (IDSS) at Massachusetts Instituteof Technology. The author would like to thank Philippe Rigollet and Yun Yang for helpful comments. eﬀect is weak and decays on the order of 1 /N as the number of particles N → ∞ . Thus thetotal interaction eﬀect remains O (1).In this paper, we consider the estimation problem of the interaction function b parametrizedby a linear approximation b ( x ) = Θ x for some unknown d × d (symmetric) positive-deﬁnitematrix Θ ≻

0: d X N,it = Θ( X Nt − X N,it )d t + σ d W it , (2)and X Nt = N − P Nj =1 X N,jt . Interaction in stochastic system (2) relates to the

Hookean behavior for capturing the linear elasticity where the interaction force b scales linearly withdeformation distance (due to compression and stretch) in the direction from X N,jt to X N,it .This type of interaction is extensively used to study the large-scale and long-time dynamicsof protein folding as an elastic mass-and-spring network of small C α atoms [7, 6].It is a classical result [20] that the N -particle SDE system (1) in the mean-ﬁeld limitadmits a unique strong solution ( X N,it ) t > if b is (globally) Lipschitz, which is the case forthe stochastic system (2). Based on the observed continuous-time and single-trajectory dataof the particle movement ( X N, t ) t > , . . . , ( X N,Nt ) t > in the interacting particle system (2), ourmain focus is to estimate the interaction parameter Θ in the potential energy.Note that the ﬁrst-order dynamical system (2) evolves as stochastic gradient ﬂows in R d :d X N,it d t = 1 N N X j =1 ∇ V ( X N,jt − X N,it ) + σξ it , (3)which corresponds to a quadratic potential energy V ( x ) = x T Θ x and ξ it are i.i.d. standardGaussian random vectors in R d . The left-hand side of (3) is the observed velocity vector ofparticle i at time t and the right-hand side of (3) is a linear function of all particle trajectoriesat time t corrupted by independent additive Gaussian noise. Thus, there are N d

SDEs withobserved N trajectory data in dimension d to solve in (3) and estimation problem for Θ canbe recast as a high-dimensional linear regression problem in an augmented space R Nd (cf. theequivalent form (7) in Section 2). Nevertheless, the trajectory data are temporally dependent samples since the particles are interacting and dynamic, so that theoretical guarantees onestimating structured coeﬃcients in high-dimensional linear models with i.i.d. samples areno long applicable in our context [4]. Moreover, due to the symmetry of the particles in law,the regression coeﬃcients have very special replicated block diagonal structure in R Nd , whichsuggests that regularization techniques may not necessarily needed in our problem. Indeed,we show that a direct likelihood-ratio method suﬃces to estimate Θ with the optimal rate ofconvergence in this work.1.2. Stochastic Vlasov equation: decoupled mean-ﬁeld limit.

Let ρ Nt = N − P Nj =1 δ X N,jt be the empirical measure of the N particles at time t . Then we can alternatively write (1) asd X N,it = (cid:16) Z R d b ( y − X N,it ) ρ Nt (d y ) (cid:17) d t + σ d W it , (4)where the drift coeﬃcient vector depends on the individual state X N,it and the distribution ρ Nt (due to interaction). As N → ∞ (i.e., in the mean-ﬁeld limit), the interaction contributed byany pair of particles in the N -particle system (1) vanishes and all particles are (asymptotically)i.i.d. since they have the same drift and diﬀusion coeﬃcients driven by independent standardBrownian motion in R d . By the law of large numbers, we see that for any ﬁxed t , ρ Nt → ρ t as NTERACTING PARTICLE SYSTEMS 3 N → ∞ , and the dynamic system (4) becomesd Y it = (cid:16) Z R d b ( y − Y it ) ρ t (d y ) (cid:17) d t + σ d W it , (5)where ( ρ t ) t > is a non-random measure ﬂow. Since the particles are symmetric in distribution, ρ t is actually the limiting law of each particle X it , i = 1 , . . . , N . This deﬁnes a system ofindependent stochastic Vlasov equations (5) with ρ t = L ( Y it ), which is a class of Markovprocesses with nonlinear dynamics [14].For quadratic potential V ( x ) = x T Θ x , the stochastic system (2) depends on the empiricaldistribution ρ Nt through the empirical mean X Nt and it can be decoupled and approximatedby a system of independent mean-reverting processes. Averaging the N SDEs in (2), we seethat once again by the law of number numbers, X Nt = W Nt := 1 N N X i =1 W tt → , as N → ∞ , which means that the interaction eﬀect X Nt becomes deterministic and it is nicely decoupled in the mean-ﬁeld limit. Thus we expect that the i -th particle process ( X N,it ) t > can be(independently) approximated by a limiting process ( Y it ) t > given byd Y it = − Θ Y it d t + σ d W it . (6)Note that the processes ( Y it ) t > (6) are independent copies of the Ornstein-Uhlenbeck (OU)processes, which is a linear dynamic system of N independent particles.1.3. Existing literature.

Learnability (i.e., identiﬁability) of interaction functions in in-teracting particle systems under the coercivity condition were studied in [13, 11]. In thenoiseless setting, estimation of the interaction kernel, a scalar-valued function of pairwisedistance between particles in the system, was ﬁrst studied in [2] for single-trajectory datain the mean-ﬁeld limit, where the rate of convergence is no faster than N − /d . To alleviatethe curse-of-dimensionality, sparsity-promoting techniques were considered for some struc-tured high-dimensional dynamical systems [3, 19]. [13] showed that a least-squares estimatorachieves the optimal rate of convergence (in the number of observed trajectories for each par-ticle) for estimating the interaction kernel based on multiple-trajectory data sampled froma deterministic system with random initialization. Estimation of the diﬀusion parameter forinteracting particle systems from noisy trajectory data was studied in [8]. Consistency ofparameter estimation of the general McKean-Vlasov equation by the maximum likelihoodestimation is studied in [22]. To the best of our knowledge, there is no existing work, regular-ized or not, establishes the optimal rate of convergence for interaction parameter estimationsimultaneously in the large N (mean-ﬁeld limit) and large t (long-time dynamics) regime.This work ﬁlls this gap for the linear elasticity interacting particle systems.1.4. Notation.

For two generic vectors a, b ∈ R d , we use a · b = P dj =1 a j b j to denote theinner product of a and b . We use k a k = ( a · a ) / to denote its Euclidean norm and k a k ∞ =max j d | a j | to denote its maximum norm. For a generic matrix M , we use k M k to denoteits spectral norm. Denote the set of d × d positive-deﬁnite matrices by S d × d + . XIAOHUI CHEN Maximum likelihood estimation

We estimate Θ ≻ σ is knownand constant, we may assume for simplicity that σ = 1. In addition, without loss of generalitywe assume that X N,i = ξ i for some mean-zero independent random vectors ξ i in R d .Let W t = ( W t , . . . , W Nt ) ∈ R Nd be the N stacked standard d -dimensional Brownian motionand X Nt = ( X N, t , . . . , X N,Nt ) ∈ R Nd be the stacked observation process. Then system (2) canbe rewritten as a higher-dimensional mean-reverting process in the augmented space R Nd :d X Nt = − Θ H X Nt d t + Σd W t , (7)where Θ = diag(Θ , . . . , Θ) is an (

N d ) × ( N d ) block diagonal matrix, Σ = diag( σ , . . . , σ ) isan ( N d ) × ( N d ) diagonal matrix, and H = 1 N  ( N − I d − I d . . . − I d − I d ( N − I d . . . − I d ... ... . . . ... − I d − I d . . . ( N − I d  is an ( N d ) × ( N d ) interaction matrix . Note that H is a projection matrix H = H , whichimplies that the interaction eﬀect is homogeneous.Let P be the law of the standard ( N d )-dimensional Brownian motion ( W t ) t > and Q be thelaw of the augmented observation process ( X Nt ) t > . By the multivariate Girsanov theoremfor changing measures (cf. Theorem 1.12 in [9]), the likelihood ratio of ( X Nt ) t > in (2) and( W t ) t > is given by the Radon-Nikodym derivative d Q d P ( X N,t , ρ t ) =: e ℓ Nt ( A ) , where ℓ Nt ( A ) = N X i =1 h − Z t k A ( X Ns − X N,is ) k d s + Z t A ( X Ns − X N,is ) · d X N,is i , (8) A ∈ S d × d + , X N,t = ( X Ns ) s ∈ [0 ,t ] , and ρ t = ( ρ s ) s ∈ [0 ,t ] . Then the maximum likelihood estimator(MLE) for Θ is deﬁned as ˆΘ Nt = argmax A ∈ S d × d + ℓ Nt ( A ) . (9)Next we explain the intuition why the MLE (9) works. Note that we can write the log-likelihood as ℓ Nt ( A ) = N Z t tr h M s (cid:16) − AA T + A Θ (cid:17)i d s + N X i =1 Z t A ( X Ns − X N,is ) · d W is , (10)where M s = N − P Ni =1 ( X Ns − X N,is )( X Ns − X N,is ) T is the instantaneous mean-ﬁeld covariancematrix at time point s .As discussed earlier in Section 1, since the interaction among the N particles in the mean-ﬁeld regime is weak, we expect that those particles can be decoupled by their independentanalogs. Let ( Y t ) t > , . . . , ( Y Nt ) t > be independent copies of the OU processes deﬁned in (6).Eﬀectively we can view ( Y t ) t > , . . . , ( Y Nt ) t > as a decoupled system of the N -particle system( X N, t ) t > , . . . , ( X N,Nt ) t > . Based on the decoupled processes, we can approximate ℓ Nt ( A ) by˜ ℓ Nt ( A )) := N Z t tr h ˜ M s (cid:16) − AA T + A Θ (cid:17)i d s − N X i =1 Z t AY is · d W is , NTERACTING PARTICLE SYSTEMS 5 where ˜ M s = N − P Ni =1 Y is Y is T . Decompose ℓ Nt ( A ) = ( ℓ Nt ( A ) − ˜ ℓ Nt ( A )) | {z } decoupling error + (˜ ℓ Nt ( A ) − E [˜ ℓ Nt ( A )]) | {z } OU ﬂuctuation error + E [˜ ℓ Nt ( A )] | {z } signal . Concentration bounds (which we shall develop in Appendix 5.2) allow us to control the decou-pling and ﬂuctuation errors around zero. Thus information useful for the estimation purposecomes from the signal part E [˜ ℓ Nt ( A )] = tr h(cid:16) Z t E [ ˜ M s ]d s (cid:17)(cid:16) − AA T + A Θ (cid:17)i . Since the matrix R t E [ ˜ M s ]d s is positive-deﬁnite (for instance if ξ i = 0), we see that themaximizer A ∗ = Θ. This means that on the population level, the MLE equals to the trueparameter. Combining this with the decoupling and ﬂuctuation errors, we can obtain the rateof convergence for the MLE ˆΘ Nt in (9).3. Rate of convergence

In this section, we derive the rate of convergence for estimating Θ by the MLE ˆΘ Nt in (9)from the continuous-time and single-trajectory data for each particle. Let θ > . . . > θ d > Theorem . Suppose that the N -particle system ( X N, t ) t > , . . . , ( X N,Nt ) t > initialized at X N,i = ξ i for i.i.d. Gaussian random vectors ξ , . . . , ξ N ∼ N (0 , D ) in R d with D = diag( τ , . . . , τ d ).If t > /θ d and N > ε ∈ [ e − N/ , − ε , k ˆΘ Nt − Θ k σθ / (cid:16) d log( d/ε ) N t (cid:17) / . (11)Theorem 3.1 is non-asymptotic and has several appealing features.First, we do not assume that the N -particle processes ( X N, t ) t > , . . . , ( X N,Nt ) t > starts fromthe stationary distribution (which is a restrictive assumption), and the (continuous) time com-plexity t > /θ d of the particle trajectories is sharp. Indeed, suppose Θ = diag( θ , . . . , θ d ) isa diagonal matrix and observe that the decoupled N copies of the OU processes to approx-imate the dynamics of the interacting N -particle system have the equilibrium distributionas N (0 , σ Θ − ). Thus it takes at least Ω(max j d θ − j ) time for ( X N, t ) t > , . . . , ( X N,Nt ) t > starting from zero to mix to the steady states, unless those processes are initialized at t = 0from the stationary distribution (i.e., τ j = σ θ j ). In particular, if θ j is closer to zero, then thelog-likelihood ratio becomes ﬂatter and thus larger t is necessary to see the information fromsamples of the stationary distribution. On the other hand, if the processes start from thestationary distribution, then this trajectory time lower bound is not needed to obtain (11).Second, the rate of convergence in (11) is parametric (and thus rate-optimal) in both N and t . Speciﬁcally, for ﬁxed d , we can obtain from Theorem 3.1 the large N (mean-ﬁeld limit)and large t (long-time dynamics) asymptotics as: k ˆΘ Nt − Θ k = O P (cid:16)r N t (cid:17) , (12)provided that 0 < θ j < ∞ for all j = 1 , . . . , d , and the N particle processes start from achaotic distribution. We shall highlight that long-time dynamic behavior in (12) as t → ∞ cannot be obtained from the classical theory of the propagation of chaos (cf. [20]), where XIAOHUI CHEN

Gronwall’s lemma (cf. Appendix 1 in [18]) is typically used to control the decoupling errorbetween the interacting N -particle processes and their independent analogs. In such case,the decoupling error is exponentially increasing in t , and thus it cannot be used to yield therate t − / . Our argument is tailored to the quadratic structure of the interacting potential V ( x ) = x T Θ x , which allows for a far more eﬃcient decoupling strategy (cf. Lemma 5.5). Proof of Theorem 3.1.

Without loss of generality, we may assume that σ = 1 and byrescaling, it suﬃces to prove that k ˆΘ Nt − Θ k θ / (cid:16) d log( d/ε ) N t (cid:17) / (13)holds with probability of at least 1 − ε . Since the objective function ℓ Nt ( A ) in (10) isquadratic in A , the ﬁrst-order optimality condition implies that the MLE of Θ is given byˆΘ Nt = Θ + (cid:16) Z t M s d s (cid:17) − (cid:16) Z t N N X i =1 ( X Ns − X N,is ) ⊗ d W is (cid:17) , (14)where we recall M s = N − P Ni =1 ( X Ns − X N,is )( X Ns − X N,is ) T and a ⊗ b denotes the tensorproduct of two vectors a and b . Because ( X Nt ) t > in the form of (7) is an R Nd -dimensionalOU process with a chaotic Gaussian initialization, we may diagonalize Θ and assume that Θis a diagonal matrix with the diagonal entries θ , . . . , θ d without changing the distribution of( X Nt ) t > .Hence, in the rest of the proof, we assume that Θ = diag( θ ), where θ = ( θ , . . . , θ d ) T with θ > . . . > θ d >

0, and estimate only the diagonal entries byˆ θ Nt,j = θ j + N P Ni =1 R t ( X Ns,j − X N,is,j )d W is,j N P Ni =1 R t | X Ns,j − X N,is,j | d s , j = 1 , . . . , d. (15)We shall analyze the numerate and the denominator of the error term in (15) separately. Let( Y it ) t > , i = 1 , . . . , N be independent copies of the OU process driven by the same Brownianmotion ( W it ) t > in ( X N,it ) t > , namely,d Y it = − Θ Y it d t + d W it with Y i = ξ i . Below we shall ﬁxed a j = 1 , . . . , d . Numerator.

Write1 N N X i =1 Z t ( X Ns,j − X N,is,j )d W is,j = 1 N N X i =1 Z t ( X Ns,j − X N,is,j + Y is,j )d W is,j − N N X i =1 Z t Y is,j d W is,j . By (25) in Lemma 5.2 and (27) in Lemma 5.5, we have for any ε > e − N/ with probabilityat least 1 − ε , (cid:12)(cid:12)(cid:12) N N X i =1 Z t ( X Ns,j − X N,is,j )d W is,j (cid:12)(cid:12)(cid:12) s t log(1 /ε ) N θ j . (16) NTERACTING PARTICLE SYSTEMS 7

Denominator.

400 and ε > e − N/ , we have withprobability at least 1 − ε , (cid:12)(cid:12)(cid:12) N N X i =1 Z t ( | Y is,j | − E | Y is,j | )d s (cid:12)(cid:12)(cid:12) tθ j (cid:16) log(1 /ε ) N + r log(1 /ε )2 N (cid:17) / . tθ j , (cid:12)(cid:12)(cid:12) N N X i =1 Z t ( | X Ns,j − X N,is,j | − | Y is,j | )d s (cid:12)(cid:12)(cid:12) C ( ε, N ) tθ j . tθ j , where C ( ε, N ) is the term deﬁned in (26). By (23) in Lemma 5.1, we have Z t E | Y is,j | d s = 12 θ j Z t [1 − (1 − θ j τ j ) e − θ j s ]d s > θ j Z t (1 − e − θ j s )d s = 12 θ j [ t − θ j (1 − e − θ j t )] = t θ j (cid:16) − − e − θ j t θ j t (cid:17) . Note that the function x (1 − e − x ) /x is positive and strictly decreasing on x ∈ [0 , ∞ ). Thusif t > max j d θ − j , then max j d (cid:16) − e − θ j t θ j t (cid:17) − e − , and consequently we have Z t E | Y is,j | d s > tθ j e − . Thus triangle inequality yields that1 N N X i =1 Z t | X Ns,j − X N,is,j | d s > (cid:16) e − − . (cid:17) tθ j > t θ j . (17)Now combining (15), (16), and (17), we obtain that with probability at least 1 − ε , | ˆ θ Nt,j − θ j | r θ j log(1 /ε ) N t .

Summing over j = 1 , . . . , d and applying the union bound, we conclude that (13) holds withprobability of at least 1 − ε . (cid:4) Discussion

This paper derives a quantitative error bound for the MLE to estimate the interactionparameter of the quadratic potential energy in a large interacting particle systems. It turnsout symmetry of the particle interaction is the key structure to obtain the optimal rateof convergence for such high-dimensional (linear) dynamical systems without regularizationtechniques.

XIAOHUI CHEN

Our future plan is to extend current results from linear interaction to learn some ﬁrst-ordernonlinear SDEs. A popular nonlinear interacting system is given by the following model:d X N,it = 1 N N X j =1 φ ( k X N,jt − X N,it k ) X N,jt − X N,it k X N,jt − X N,it k d t + σ d W it , (18)where φ : R + → R is an interaction kernel depending only on the pairwise distance be-tween particles. We should highlight that the linear elastic interacting particle system (2)has a diﬀerent structure from (18), where the latter implicitly assumes that the interactionkernel is isotropic and one-dimensional. Stochastic system (18) may capture other types ofphysical forces for attraction and repulsion that solely depend on the Euclidean distance.Nonparametric inference of such interaction kernels based on noiseless trajectory data hasbeen studied [2, 13, 12]. In the presence of noise, similarly as in Section 2, we can write (18)in R Nd as d X Nt = −∇ V φ ( X Nt )d t + σ d W t , (19)where for X = ( X , . . . , X N ) ∈ R Nd , the potential V φ : R Nd → R is given by V φ ( X ) = N − P Nj =1 Φ( k X j − X i k ) and Φ ′ ( r ) = φ ( r ). Then the negative log-likelihood function has asimilar form as in (10): ℓ Nt ( φ ) = 12 Z t k∇ V φ ( X Ns ) k d s + Z t ∇ V φ ( X Ns ) · d X Ns , (20)and the MLE is deﬁned as ˆ φ Nt = argmin ψ ∈H ℓ Nt ( ψ )for some reasonably rich class of functions H . Heuristically, the propagation of chaos phenom-enon allows us to control the diﬀerence between ( X Nt ) and its OU process analog ( Y Nt ) [20].Thus for each t , replacing ( X Nt ) by ( Y Nt ) in the log-likelihood function (20) causes a uni-form decoupling error O ( N − / ). In the linear interaction case, the propagation of chaos isnon-asymptotically implemented by decoupling concentration bounds (cf. Lemma 5.5). If thegradient of potential energy V φ is Lipschitz, then the martingale term R t ∇ V φ ( Y Nt ) · d W s cor-responding to noise ﬂuctuation can be upper bounded using Lemma 5.2. On the other hand,the signal distortion is controlled by the curvature of the (negative) log-likelihood function ℓ Nt ( · ). For instance, if V φ is strongly convex, then we expect that the signal would be boundedbelow. Thus the separation between noise and signal allows us to control the estimation error.We leave the derivation of rigorous results as a future work.5. Auxiliary lemmas

Ornstein-Uhlenbeck process.

The one-dimensional Ornstein-Uhlenbeck (OU) pro-cess ( Y t ) t > is a mean-reverting stochastic process deﬁned by the following stochastic diﬀer-ential equation: d Y t = − θY t d t + σ d W t , (21)where ( W t ) t > is the standard Brownian motion in R such that E [ W ] = 0, θ > σ > Y t is N (0 , σ θ ). Lemma 5.1 below gives the explicit formula for the mean andvariance at any ﬁnite time point t ∈ (0 , ∞ ). NTERACTING PARTICLE SYSTEMS 9

Lemma . Let ( Y t ) t > be the OU process deﬁnedin (21). If Y = W = ξ for some mean-zero random variable ξ with variance τ , then for all t >

0, we have E [ Y t ] = 0 , (22) E [ Y t ] = 12 θ [ σ − ( σ − θτ ) e − θt ] . (23)In particular, if Y = 0, then Y t ∼ N (0 , σ θ (1 − e − θt )). If Y ∼ N (0 , σ θ ), then Y t ∼ N (0 , σ θ )for all t > Proof of Lemma 5.1.

Since ( Y t ) t > starts from a mean-zero random variable, taking expec-tation on both side of (5.1), we have dd t E [ Y t ] = − θ E [ Y t ] , which implies that E [ Y t ] = E [ Y ] e − θt = 0. To prove (23), we apply Itˆo’s formula to getd Y t = ( − θY t + σ )d t + 2 σY t d W t . Taking expectation on both sides of the last equation to kill the martingale term R t Y s d W s ,we get dd t E [ Y t ] = − θ E [ Y t ] + σ , whose solution for E [ Y t ] subject to the initial condition E [ Y ] = τ is given by (23). (cid:4) Key supporting lemmas.

This section provides key technical results for boundingthe ﬂuctuation of the OU process and the decoupling error of the N -particle system by theassociated N independent OU processes. Lemma . Let ( Y t ) t > be the one-dimensionalOU process with independent coordinates deﬁned in (21) and Y = ξ , where ξ ∼ N (0 , τ ).Suppose that ( Y t ) t > , . . . , ( Y Nt ) t > are independent copies of ( Y t ) t > . Then we have for any ε ∈ (0 , P (cid:16)(cid:12)(cid:12)(cid:12) N N X i =1 Z t ( | Y is | − E | Y is | )d s (cid:12)(cid:12)(cid:12) > tσ θ (cid:16) log(1 /ε ) N + r log(1 /ε )2 N (cid:17)(cid:17) ε, (24)and for any ε > e − N/ , P (cid:16)(cid:12)(cid:12)(cid:12) N N X i =1 Z t Y is d W is (cid:12)(cid:12)(cid:12) > σ r t log(1 /ε ) N θ (cid:17) ε. (25) Proof of Lemma 5.2.

First we prove (24). In view of Lemma 5.1, we may assume withoutloss of generality that Y ∼ N (0 , σ θ ), which by rescaling gives the worst case error boundin (24). In such case, Y t ∼ N (0 , σ θ ) for all t >

0. Denote U it = N P Ni =1 ( | Y it | − E | Y it | ). Let λ >

0. By Jensen’s inequality, E h exp( λ Z t U is d s ) i t Z t E [exp( tλU is )]d s. Let w = tλσ Nθ . By independence, we have for 0 < w < , E [exp( tλU is )] = N Y i =1 E exp h tλN ( | Y is | − E | Y is | ) i N Y i =1 exp (cid:16) w − w (cid:17) , where the last inequality follows from Lemma 5.3. Combining the last two inequalities, weget log (cid:16) E h exp( λ Z t U is d s ) i(cid:17) N w − w . Then it follows from Lemma 5.4 that for all x > P (cid:16) Z t U is d s > tσ θ (cid:16) xN + r x N (cid:17)(cid:17) e − x . Likewise, the lower bound follows similar lines and we have P (cid:16) Z t U is d s − tσ θ r x N (cid:17) e − x . Choosing x = log(1 /ε ), we obtain (24).Next we prove (25). Denote Z it = R t Y is d W is . Clearly E [ Z it ] = 0. Moreover, ( Z it ) t > , . . . , ( Z Nt ) t > are independent continuous local martingales vanishing at zero with quadratic variation [ Z i ] t is given by [ Z i ] t = Z t | Y is | d s and E [ Z i ] t tσ θ , where the last inequality follows from (23) in Lemma 5.1. Observe that the process Z Ns = N P Ni =1 Z is is also a local martingale vanishing at zero with quadratic variation [ Z N ] t = N P Ni =1 R t | Y is | d s and E [ Z N ] t tσ θN . Using (24) (one-sided version), we see that for ε > e − N/ , P (cid:16) [ Z N ] t − E [ Z N ] t > tσ θN (cid:17) ε. By Bernstein’s inequality for continuous local martingales (cf. Exercise (3.16) on page 145in [18]), we have P (cid:16) sup s t N N X i =1 Z is > σ r tN x (cid:17) = P (cid:16) sup s t Z Ns > σ r tN x, [ Z N ] t tσ θN (cid:17) + ε, exp (cid:16) − tσ N x tσ θN (cid:17) + ε = exp (cid:16) − θx (cid:17) + ε. Choosing x = p /ε ) /θ , we have with probability at least 1 − ε ,sup s t N N X i =1 Z is σ r t log(1 /ε ) θN . Applying the same argument for − Z it , we get (25). (cid:4) Lemma χ distribution) . Let Z ∼ N (0 ,

1) be astandard Gaussian random variable and φ denote the logarithm of the Laplace transform of Z − φ ( u ) = log( E [exp( u ( Y − − u −

12 log(1 − u ) , u < / . NTERACTING PARTICLE SYSTEMS 11

In particular, φ ( u ) (cid:26) u − u for 0 < u < / u for − / < u < . Proof of Lemma 5.3.

See the proof of Lemma 1 in [10]. (cid:4)

Lemma . If a random variable Z satisﬁes log( E [exp( uZ )]) vu − cu for some v > c >

0, then for all x > P ( Z > cx + √ vx ) e − x . Proof of Lemma 5.4.

See the proof of Lemma 8 in [1]. (cid:4)

Lemma N -particle system) . Suppose that the one-dimensional N -particle system ( X N, t ) t > , . . . , ( X N,Nt ) t > deﬁned in (2) with initialization at X N,i = ξ i for i.i.d. Gaussian random variables ξ , . . . , ξ N ∼ N (0 , τ ). Then we have for any ε > e − N/ , P (cid:16)(cid:12)(cid:12)(cid:12) N N X i =1 Z t ( | X Ns − X N,is | − | Y is | )d s (cid:12)(cid:12)(cid:12) tσ θ C ( ε, N ) (cid:17) ε. (26)where C ( ε, N ) = p C ( ε, N )(2 C ( ε, N ) + 8 C ( ε, N )), C ( ε, N ) = 12 N + log(1 /ε ) N + 1 N r log(1 /ε )2 ,C ( ε, N ) = 12 + log(1 /ε ) N + r log(1 /ε )2 N .

Moreover we also have for any ε ∈ (0 , P (cid:16)(cid:12)(cid:12)(cid:12) N N X i =1 Z t ( X Ns − X N,is + Y is )d W is (cid:12)(cid:12)(cid:12) > σ r t log(1 /ε ) N θ (cid:17) ε. (27) Proof of Lemma 5.5.

Without loss of generality, we may assume σ = 1. Denote ∆ N,it = X Nt − X N,it + Y it . Recall the N -particle system and the associated approximating N indepen-dent OU processes: for i = 1 , . . . , N ,d X N,it = θ ( X Nt − X N,it )d t + d W it , d Y it = − θY it d t + d W it , where ( X N,it ) and ( Y it ) are driven by the same Brownian motion ( W it ). Averaging the processes( X N,it ) over i = 1 , . . . , N , we getd X Nt = d W Nt := 1 N N X i =1 d W it , i.e., the (rescaled) mean process ( √ N X Nt ) of the N particles has the same law as a standardBrownian motion. Combining the last three expressions, we obtain thatd( X Nt − X N,it + Y it ) = d W Nt − θ ( X Nt − X N,it + Y it ) , which implies that the diﬀerence process (∆ N,it ) between the N -particle system and the de-coupled OU processes is an OU process with respect to ( W Nt ), i.e., we haved∆ N,it = − θ ∆ N,it d t + d W Nt , for all i = 1 , . . . , N. (28)Equation (28) means that the processes (∆ N, t ) , . . . , (∆ N,Nt ) are the same mean-reverting OUprocesses, all driven by ( W Nt ). Thus ∆ Nt = ∆ N,it for all i = 1 , . . . , N . Now applying (25) inLemma 5.2 to the process (∆ Nt ) with σ = N − / , we obtain (27).Next we prove (26). Note that( ∗ ) := (cid:12)(cid:12)(cid:12) N N X i =1 Z t ( | X Ns − X N,is | − | Y is | )d s (cid:12)(cid:12)(cid:12) (1) N N X i =1 Z t | X Ns − X N,is + Y is | ( | X Ns − X N,is | + | Y is | )d s (2) (cid:16) N N X i =1 Z t | X Ns − X N,is + Y is | d s (cid:17) / (cid:16) N N X i =1 Z t ( | X Ns − X N,is | + | Y is | ) d s (cid:17) / (3) (cid:16) N N X i =1 Z t | ∆ N,is | d s (cid:17) / (cid:16) N N X i =1 Z t | ∆ N,is | d s + 8 N N X i =1 Z t | Y is | d s (cid:17) / = (4) (cid:16) Z t | ∆ N,is | d s (cid:17) / (cid:16) Z t | ∆ N,is | d s + 8 N N X i =1 Z t | Y is | d s (cid:17) / , where (1) follows from triangle inequality || a | − | b || | a − b | , (2) from the Cauchy-Schwartinequality, (3) from a second application of triangle inequality and the elementary inequality( a + b ) a + 2 b , and (4) from the fact that (∆ N, t ) = · · · = (∆ N,Nt ) are identical processes.Recall that E R t | Y is | d s t θ . Then using (24) in Lemma 5.2, we have with probability atleast 1 − ε , 1 N N X i =1 Z t | Y is | d s tθ (cid:16)

12 + log(1 /ε ) N + r log(1 /ε )2 N (cid:17) = tθ C ( ε, N ) . Similarly, E R t | ∆ N,is | d s t Nθ . By a second application of (24) in Lemma 5.2 with σ = N − / , we have P (cid:16)(cid:12)(cid:12)(cid:12) Z t ( | ∆ N,it | − E | ∆ N,it | )d s (cid:12)(cid:12)(cid:12) > tN θ (cid:16) log(1 /ε ) + r log(1 /ε )2 (cid:17)(cid:17) ε, which implies that with probability at least 1 − ε , Z t | ∆ N,is | d s tN θ (cid:16)

12 + log(1 /ε ) + r log(1 /ε )2 (cid:17) = tθ C ( ε, N ) . Then (26) follows from putting all pieces together. (cid:4)

References [1] Lucien Birg´e and Pascal Massart. Minimum contrast estimators on sieves: exponential bounds and ratesof convergence.

Bernoulli , 4(3):329–375, 09 1998.

NTERACTING PARTICLE SYSTEMS 13 [2] Mattia Bongini, Massimo Fornasier, Markus Hansen, and Mauro Maggioni. Inferring interaction rulesfrom observations of evolutive systems i: The variational approach.

Mathematical Models and Methods inApplied Sciences , 27(5):909–951, 2017.[3] Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Discovering governing equations from databy sparse identiﬁcation of nonlinear dynamical systems.

Proceedings of the National Academy of Sciences ,113(15):3932–3937, 2016.[4] Peter B¨uhlmann and Sara van de Geer.

Statistics for high-dimensional data . Springer Series in Statistics.Springer, Heidelberg, 2011. Methods, theory and applications.[5] M. R. D’Orsogna, Y. L. Chuang, A. L. Bertozzi, and L. S. Chayes. Self-propelled particles with soft-coreinteractions: Patterns, stability, and collapse.

Phys. Rev. Lett. , 96:104302, Mar 2006.[6] Eran Eyal, Lee-Wei Yang, and Ivet Bahar. Anisotropic network model: systematic evaluation and a newweb interface.

Bioinformatics , 22(21):2619–2627, 08 2006.[7] Turkan Haliloglu, Ivet Bahar, and Burak Erman. Gaussian Dynamics of Folded Proteins.

Phys. Rev. Lett. ,79(16):3090–3093, Oct 1997.[8] Hui Huang, Jian-Guo Liu, and Jianfeng Lu. Learning interacting particle systems: Diﬀusion parameterestimation for aggregation equations.

Mathematical Models and Methods in Applied Sciences , 29(1):1–29,2019.[9] Yury A. Kutoyants.

Statistical Inference for Ergodie Diﬀusion Processes . Springer-Verlag London Ltd.,2004.[10] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.

Ann. Statist. ,28(5):1302–1338, 10 2000.[11] Zhongyang Li, Fei Lu, Mauro Maggioni, Sui Tang, and Cheng Zhang. On the identiﬁability of interactionfunctions in systems of interacting particles. arXiv:1912.11965 , 2019.[12] Fei Lu, Mauro Maggioni, and Sui Tang. Learning interaction kernels in heterogeneous systems of agentsfrom multiple trajectories. arXiv:1910.04832 , 2019.[13] Fei Lu, Ming Zhong, Sui Tang, and Mauro Maggioni. Nonparametric inference of interaction laws insystems of agents from trajectory data.

Proceedings of the National Academy of Sciences , 116(29):14424–14433, 2019.[14] H. P. McKean. A class of markov processes associated with nonlinear parabolic equations.

Proceedings ofthe National Academy of Sciences , 56(6):1907–1911, 1966.[15] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the landscape of two-layerneural networks.

Proceedings of the National Academy of Sciences , 115(33):E7665–E7671, 2018.[16] Alexander Mogilner and Leah Edelstein-Keshet. A non-local model for a swarm.

Journal of MathematicalBiology , 38(6):534–570, 1999.[17] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus.

SIAM Review ,56(4):577–621, 2014.[18] Daniel Revuz and Marc Yor.

Continuous Martingales and Brownian Motion . Springer-Verlag, 1991.[19] Hayden Schaeﬀer, Giang Tran, and Rachel Ward. Extracting sparse high-dimensional dynamics fromlimited data.

SIAM Journal on Applied Mathematics , 78(6):3279–3295, 2018.[20] Alain-Sol Sznitman. Topics in propagation of chaos. In Paul-Louis Hennequin, editor,

Ecole d’Et´e de Prob-abilit´es de Saint-Flour XIX — 1989 , pages 165–251, Berlin, Heidelberg, 1991. Springer Berlin Heidelberg.[21] Chad M. Topaz, Andrea L. Bertozzi, and Mark A. Lewis. A nonlocal continuum model for biologicalaggregation.

Bulletin of Mathematical Biology , 68(7):1601, 2006.[22] Jianghui Wen, Xiangjun Wang, Shuhua Mao, and Xinping Xiao. Maximum likelihood estimation ofmckean-vlasov stochastic diﬀerential equation and its application.

Applied Mathematics and Computa-tion , 274:237 – 246, 2016.

Department of StatisticsUniversity of Illinois at Urbana-Champaign725 S. Wright Street, Champaign, IL 61820

E-mail : [email protected] URL ::