[PDF] On Dantzig and Lasso estimators of the drift in a high dimensional Ornstein-Uhlenbeck model

Abstract

In this paper we present new theoretical results for the Dantzig and Lasso estimators of the drift in a high dimensional Ornstein-Uhlenbeck model under sparsity constraints. Our focus is on oracle inequalities for both estimators and error bounds with respect to several norms. In the context of the Lasso estimator our paper is strongly related to [11], who investigated the same problem under row sparsity. We improve their rates and also prove the restricted eigenvalue property solely under ergodicity assumption on the model. Finally, we demonstrate a numerical analysis to uncover the finite sample performance of the Dantzig and Lasso estimators.

Full PDF

OOn Dantzig and Lasso estimators of the drift in a highdimensional Ornstein-Uhlenbeck model ∗ Gabriela Cio(cid:32)lek † Dmytro Marushkevych ‡ Mark Podolskij § August 4, 2020

Abstract

In this paper we present new theoretical results for the Dantzig and Lasso estimators ofthe drift in a high dimensional Ornstein-Uhlenbeck model under sparsity constraints.Our focus is on oracle inequalities for both estimators and error bounds with respectto several norms. In the context of the Lasso estimator our paper is strongly relatedto [11], who investigated the same problem under row sparsity. We improve their ratesand also prove the restricted eigenvalue property solely under ergodicity assumptionon the model. Finally, we demonstrate a numerical analysis to uncover the ﬁnitesample performance of the Dantzig and Lasso estimators.

Key words : Dantzig estimator, high dimensional statistics, Lasso, Ornstein-Uhlenbeckprocess, parametric estimation.

AMS 2010 subject classiﬁcations.

During past decades an immense progress has been achieved in statistics for stochasticprocesses. Nowadays, comprehensive studies on statistical inference for diﬀusion processesunder low and high frequency observation schemes can be found in monographs [13, 16, 18].Most of the existing literature is considering a ﬁxed dimensional parameter space, while ahigh dimensional framework received much less attention in the diﬀusion setting.Since the pioneering work of McKean [19, 20], high dimensional diﬀusions enteredthe scene in the context of modelling the movement of gas particles. More recently,they found numerous applications in economics and biology, among other disciplines [3,6, 9]. Typically, high dimensional diﬀusions are studied in the framework of mean ﬁeld ∗ The authors gratefully acknowledge ﬁnancial support of ERC Consolidator Grant 815703 “STAM-FORD: Statistical Methods for High Dimensional Diﬀusions”. † Department of Mathematics, University of Luxembourg, E-mail: [email protected]. ‡ Department of Mathematics, University of Luxembourg, E-mail: [email protected]. § Department of Mathematics, University of Luxembourg, E-mail: [email protected]. a r X i v : . [ m a t h . S T ] A ug theory , which aims at bridging the interaction of particles at the microscopic scale andthe mesoscopic features of the system (see e.g. [25] for a mathematical study). In physicsparticles are often assumed to be statistically equal, but this homogeneity assumption isnot appropriate in other applications. For instance, in [6] high dimensional SDEs are usedto model the wealth of trading agents in an economy, who are often far from being equalin their trading behaviour. Another example is the ﬂocking phenomenon of individuals[3], where it seems natural to assume that there are only very few “leaders” who have adistinguished role in the community. These examples motivate to investigate statisticalinference for diﬀusion processes under sparsity constraints.This paper is focusing on statistical analysis of a d -dimensional Ornstein-Uhlenbeckmodel of the form dX t = − A X t dt + dW t , t ≥ , (1.1)deﬁned on a ﬁltered probability space (Ω , F , ( F t ) t ≥ , P ), with underlying observation( X t ) t ∈ [0 ,T ] . Here W denotes a standard d -dimensional Brownian motion and A ∈ R d × d represents the unknown interaction matrix. Ornstein-Uhlenbeck processes are one of themost basic parametric diﬀusion models. When the dimension d is ﬁxed and T → ∞ , sta-tistical estimation of the parameter A has been discussed in several papers. Asymptoticanalysis of the maximum likelihood estimator in the ergodic case can be found in e.g.[16] while investigations of the non-ergodic setting can be found in [15, 17]. The adaptiveLasso estimation for multivariate diﬀusion models has been investigated in [7].Our main goal is to study the estimation of A under sparsity constraints in the large d /large T setting. Such a mathematical problem ﬁnds its main motivation in the analysisof bank connectedness whose wealth is modelled by the diﬀusion process X . This ﬁeld ofeconomics, which studies linkages between a large number of banks associated with e.g.asset/liability positions and contractual relationships, is key to understanding systemicrisk in a global economy [12]. Typically, the connectivity structure, which is representedby the parameter A , is quite sparse since only few ﬁnancial players are signiﬁcant in aneconomy, and the main focus is on estimation of non-zero components of A .Theoretical results in the high dimensional diﬀusion setting are rather scarce. In thiscontext we would like to mention the Dantzig selector which was introduced in [5] andprimarily designed for linear regression models. More speciﬁcally, [5] established sharpnon-asymptotic bounds on the l -error in the estimated coeﬃcients and proved that theerror is within a factor of log ( d ) of the error that would have been reached if the locationsof the non-zero coeﬃcients were known. Further extensions of the aforementioned resultscan be found in [10] and [23], which study the Dantzig selector for discretely observedlinear diﬀusions and support recovery for the drift coeﬃcient, respectively. Our work isclosely related to the recent article [11], where estimation of A under row sparsity hasbeen investigated. The authors propose to use the classical Lasso approach and derive up-per and lower bounds for the estimation error. We build upon their analysis and provideoracle inequalities and non-asymptotic theory for the the Lasso and Dantzig estimators.In comparison to [11], we obtain an improved upper bound for the Lasso estimator, whichessentially matches the theoretical lower bound, and also show that the restricted eigen-value property is automatically satisﬁed under ergodicity condition on the model (1.1) (in[11] the extra assumption (H4) has been imposed). The latter is proved via Malliavincalculus methods proposed in [21]. Moreover, we show that the Lasso and Dantzig esti-mators are asymptotically eﬃcient, which is a well known fact in linear regression models(cf. [2]). Finally, we present a simulation study to uncover the ﬁnite sample properties ofboth estimators.The paper is organised as follows. Section 2 is devoted to the exposition of the classicalestimation theory in the ﬁxed dimensional setting and to deﬁnition of the Lasso andDantzig estimators. Concentration inequalities for various stochastic terms are derived inSection 3. In particular, we show the restricted eigenvalue property under the ergodicityassumption via Malliavin calculus methods. In Section 4 we present oracle inequalitiesand error bounds for both estimators. Numerical simulation results are demonstrated inSection 5. Finally, some proofs are collected in Section 6. In this subsection we brieﬂy introduce the main notations used throughout the paper. Fora vector or a matrix x the transpose of x is denoted by x (cid:62) . For p ≥ A ∈ R d × d ,we deﬁne the l p -norm as (cid:107) A (cid:107) p :=  (cid:88) ≤ i ≤ d , ≤ j ≤ d | A ij | p  /p . We denote by (cid:107) A (cid:107) ∞ = lim p →∞ (cid:107) A (cid:107) p the maximum norm and set (cid:107) A (cid:107) := (cid:80) ≤ i ≤ d , ≤ j ≤ d { A ij (cid:54) =0 } .We associate to the Frobenius norm (cid:107) · (cid:107) the scalar product (cid:104) A , A (cid:105) F := tr( A (cid:62) A ) , A , A ∈ R d × d , where tr denotes the trace. For a symmetric matrix A ∈ R d × d we write λ max ( A ), λ min ( A )for the largest and the smallest eigenvalue of A , respectively. We denote by (cid:107) A (cid:107) op := (cid:112) λ max ( A (cid:62) A ) the operator norm of A ∈ R d × d . For any J ⊂ { , . . . , d } × { , . . . , d } and A ∈ R d × d , the matrix A | J is deﬁned via( A | J ) ij := A ij ( i,j ) ∈ J . For a quadratic matrix A ∈ R d × d , diag( A ) stands for the diagonal matrix satisfyingdiag( A ) ii = A ii . We also introduce the notation C ( s, c ) := (cid:110) A ∈ R d × d \ { } : (cid:107) A (cid:107) ≤ (1 + c ) (cid:107) A |I s ( A ) (cid:107) (cid:111) , (2.1)where c > I s ( A ) is a set of coordinates of s largest elements of A . Furthermore, vecdenotes the vectorisation operator and ⊗ stands for the Kronecker product. For z ∈ C wedenote by Re ( z ) (resp. Im ( z )) the real (resp. imaginary) part of z . Finally, for stochasticprocesses ( X t ) t ∈ [0 ,T ] , ( Y t ) t ∈ [0 ,T ] ∈ L ([0 , T ] , dt ) we introduce the scalar product (cid:104) X, Y (cid:105) L := 1 T (cid:90) T X t Y t dt. We consider a d -dimensional Ornstein-Uhlenbeck process introduced in (1.1). Throughoutthis paper the matrix A is assumed to satisfy the following condition:(H) Matrix A is diagonalisable with eigenvalues θ , . . . , θ d ∈ C , i.e. A = P diag( θ , . . . , θ d ) P − , where the column vectors of P are eigenvectors of A . Furthermore, the eigenvalues θ , . . . , θ d ∈ C have strictly positive real parts: r := min ≤ j ≤ d ( Re ( θ i )) > . (2.2)It is well known that under condition (H) the stochastic diﬀerential equation (1.1) exhibitsa unique stationary solution, which can be written explicitly as X t = (cid:90) t −∞ exp ( − ( t − s ) A ) dW t . In this case we have that X t ∼ N (0 , C ∞ ) with C ∞ := (cid:90) ∞ exp( − sA ) exp( − sA (cid:62) ) ds. We assume that the complete path ( X t ) t ∈ [0 ,T ] is observed and we are interested in esti-mating the unknown parameter A . Let us brieﬂy recall the classical maximum likelihoodtheory when d is ﬁxed and T → ∞ . When P TA denotes the law of the process (1.1) withtransition matrix A restricted to F T , the log-likelihood function is explicitly computed viaGirsanov’s theorem aslog( P TA / P T ) = − (cid:90) T ( AX t ) (cid:62) dX t − (cid:90) T ( AX t ) (cid:62) ( AX t ) dt. (2.3)Consequently, the maximum likelihood estimator (cid:98) A ML is given by (cid:98) A ML = − (cid:16) (cid:90) T dX t X (cid:62) t (cid:17)(cid:16) (cid:90) T X t X (cid:62) t dt (cid:17) − . Under condition (H) the estimator (cid:98) A ML is asymptotically normal, i.e. √ T (cid:16) vec( (cid:98) A ML ) − vec( A ) (cid:17) d −→ N d (cid:0) , C − ∞ ⊗ id (cid:1) (2.4)with id denoting the d -dimensional identity matrix. Indeed, we have the identity (cid:98) A ML − A = − ε T (cid:98) C − T with ε T := 1 T (cid:90) T dW t X (cid:62) t and (cid:98) C T := 1 T (cid:90) T X t X (cid:62) t dt a.s. −→ C ∞ , (2.5)and the result (2.4) follows from the standard martingale central limit theorem. We referto [16, p. 120–124] for a more detailed exposition.When assumption (H) is violated the asymptotic theory for the maximum likelihoodestimator (cid:98) A ML is more complex. If some eigenvalues θ j satisfy Re ( θ i ) < Re ( θ i ) = 0 appears forsome i ’s. Now we turn our attention to large d /large T setting. We consider the Ornstein-Uhlenbeckmodel (1.1) satisfying the assumption (H) and assume that the unknown transition matrix A satisﬁes the constraint (cid:107) A (cid:107) ≤ s . (2.6)We remark that due to condition (2.2) it must necessarily hold that s ≥ d . A standardapproach to estimate A under the sparsity constraint (2.6) is the Lasso method, whichhas been investigated in [11] in the framework of an Ornstein-Uhlenbeck model. The Lassoestimator is deﬁned as (cid:98) A L := argmin A ∈ R d × d ( L T ( A ) + λ (cid:107) A (cid:107) ) with L T ( A ) := − T log( P TA / P T ) , (2.7)where λ > (cid:98) A L can be computed eﬃciently, sinceit is a solution of a convex optimisation problem.Next, we are going to introduce the Dantzig estimator of the parameter A . Accordingto (2.3) the quantity L T ( A ) can be written as L T ( A ) = tr (cid:18) ( ε T − A (cid:98) C T ) A (cid:62) + 12 A (cid:98) C T A (cid:62) (cid:19) and ∇L T ( A ) = ε T − A (cid:98) C T + A (cid:98) C T . (2.8)We recall that B belongs to a subdiﬀerential of a convex function f : R d × d → R atpoint B , B ∈ ∂f ( B ), if (cid:104) B, A − B (cid:105) F ≤ f ( A ) − f ( B ) for all A ∈ R d × d . In particular, B ∈ ∂ (cid:107) B (cid:107) satisﬁes the constraint (cid:107) B (cid:107) ∞ ≤

1. A necessary and suﬃcient condition forthe minimiser at (2.7) is the fact that 0 belongs to the subdiﬀerential of the function A (cid:55)→ L T ( A ) + λ (cid:107) A (cid:107) . This implies that the Lasso estimator (cid:98) A L satisﬁes the constraint (cid:107) (cid:98) A L (cid:98) C T + ε T − A (cid:98) C T (cid:107) ∞ ≤ λ. (2.9)Now, the Dantzig estimator (cid:98) A D of the parameter A is deﬁned as a matrix with thesmallest l -norm that satisﬁes the inequality (2.9), i.e. (cid:98) A D := argmin A ∈ R d × d (cid:110) (cid:107) A (cid:107) : (cid:107) A (cid:98) C T + ε T − A (cid:98) C T (cid:107) ∞ ≤ λ (cid:111) . (2.10)By deﬁnition of the Dantzig estimator we have that (cid:107) (cid:98) A D (cid:107) ≤ (cid:107) (cid:98) A L (cid:107) . In particular, whenthe tuning parameters λ for Lasso and Dantzig estimators are preset to be the same,then the Lasso estimate is always a feasible solution to the Dantizg selector minimizationproblem although it may not necessarily be the optimal solution. This implies, that whenrespective solutions are not identical, the Dantizg selector solution is sparser (in l - norm)than the Lasso solution (see [14], Appendix A for details). From the computational pointview, the Dantzig estimator can be found numerically via linear programming for convexoptimisation with constraints.The following basic inequality, which is a direct consequence of the fact that L T ( (cid:98) A L ) + λ (cid:107) (cid:98) A L (cid:107) ≤ L T ( A ) + λ (cid:107) A (cid:107) for all A ∈ R d × d , provides the necessary basis for the analysisof the error (cid:98) A L − A . Lemma 2.1. ([11, Lemma 3]) For any A ∈ R d × d and λ > it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( A − A ) X (cid:107) L ≤ (cid:104) ε T , A − (cid:98) A L (cid:105) F − (cid:107) ( A − (cid:98) A L ) X (cid:107) L + 2 λ ( (cid:107) A (cid:107) − (cid:107) (cid:98) A L (cid:107) ) , where the quantity ε T is deﬁned in (2.5) . From Lemma 2.1 it is obvious that we require a good control over martingale term (cid:104) ε T , V (cid:105) F for certain matrices V ∈ R d × d to get an upper bound on the prediction error (cid:107) ( (cid:98) A L − A ) X (cid:107) L . Another important ingredient is the restricted eigenvalue property ,which is a standard requirement in the analysis of Lasso estimators (see e.g. [2, 4]). Inour setting the restricted eigenvalue property amounts in showing thatinf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) is bounded away from 0 with high probability.Interestingly, the latter is a consequence of the model assumption (H) and not an extracondition as in the framework of linear regression. This has been noticed in [11], but anadditional condition (H4) was required which is in fact not needed as we will show in thenext section.In order to establish the connection between the Dantzig and the Lasso estimators wewill show the inequality (cid:12)(cid:12)(cid:12) (cid:107) ( (cid:98) A D − A ) X (cid:107) L − (cid:107) ( (cid:98) A L − A ) X (cid:107) L (cid:12)(cid:12)(cid:12) ≤ c (cid:107) (cid:98) A L (cid:107) λ for a certain constant c >

0, which holds with high probability. Once the term (cid:107) (cid:98) A L (cid:107) is controlled, we deduce statements about the error term (cid:98) A D − A via the correspondinganalysis of (cid:98) A L − A . In this section we derive various concentration inequalities, which play a central role inthe analysis of the estimators (cid:98) A L and (cid:98) A D . This subsection is devoted to the proof of the restricted eigenvalue property. The mainresult of this subsection relies heavily on some theoretical techniques presented in [21],where Malliavin calculus is applied in order to obtain tail bounds for certain functionalsof Gaussian processes. In the following, we introduce some basic notions of Malliavincalculus; we refer to the monograph [22] for a more detailed exposition.Let H be a real separable Hilbert space. We denote by B = { B ( h ) : h ∈ H } an isonor-mal Gaussian process over H . That is, B is a centred Gaussian family with covariancekernel given by E (cid:2) B ( h ) B ( h ) (cid:3) = (cid:104) h , h (cid:105) H . We shall use the notation L ( B ) = L (Ω , σ ( B ) , P ). For every q ≥

1, we write H ⊗ q toindicate the q th tensor product of H ; H (cid:12) q stands for the symmetric q th tensor. We denoteby I q the isometry between H (cid:12) q and the q th Wiener chaos of X . It is well-known (see e.g.[22, Chapter 1]) that any random variable F ∈ L ( B ) admits the chaotic expansion F = ∞ (cid:88) q =0 I q ( f q ) , I ( f ) := E [ F ] , where the series converges in L and the kernels f q ∈ H (cid:12) q are uniquely determined by F .The operator L , called the generator of the Ornstein-Uhlenbeck semigroup , is deﬁned as LF := − ∞ (cid:88) q =1 qI q ( f q )whenever the latter series converges in L . The pseudo inverse L − of L is deﬁned by L − F = − (cid:80) ∞ q =1 q − I q ( f q ).Next, let us denote by S the set of all smooth cylindrical random variables of the form F = f (cid:0) B ( h ) , . . . , B ( h n ) (cid:1) , where n ≥ f : R n → R is a C ∞ -function with compactsupport and h i ∈ H . The Malliavin derivative DF of F is deﬁned as DF := n (cid:88) i =1 ∂f∂x i (cid:0) B ( h ) , . . . , B ( h n ) (cid:1) h i . The space D , denotes the closure of S with respect to norm (cid:107) F (cid:107) , := E [ F ]+ E [ (cid:107) DF (cid:107) H ] . The Malliavin derivative D veriﬁes the following chain rule : when ϕ : R n → R is in C b (the set of continuously diﬀerentiable functions with bounded partial derivatives) and if( F i ) i =1 ,...,n is a vector of elements in D , , then ϕ ( F , . . . , F n ) ∈ D , and Dϕ ( F , . . . , F n ) = n (cid:88) i =1 ∂ϕ∂x i ( F , . . . , F n ) DF i . The next theorem establishes left and right tail bounds for certain elements Z ∈ D , . Theorem 3.1. ([21, Theorem 4.1]) Assume that Z ∈ D , and deﬁne the function g Z ( z ) := E [ (cid:104) DZ, − DL − Z (cid:105) H | Z = z ] . Suppose that the following conditions hold for some α ≥ and β > :(i) g Z ( Z ) ≤ αZ + β holds P -almost surely,(ii) The law of Z has a Lebesgue density.Then, for any z > , it holds that P ( Z ≥ z ) ≤ exp (cid:18) − z αz + 2 β (cid:19) and P ( Z ≤ − z ) ≤ exp (cid:18) − z β (cid:19) . Now, we apply Theorem 3.1 to certain quadratic forms of the Ornstein-Uhlenbeckprocess X . The following result is crucial for proving the restricted eigenvalue property. Proposition 3.2.

Suppose that assumption (H) is satisﬁed and let (cid:98) C T be deﬁned as in (2.5) . Then it holds for all x > : sup v ∈ R d : (cid:107) v (cid:107) =1 P (cid:16) | v (cid:62) ( (cid:98) C T − C ∞ ) v | ≥ x (cid:17) ≤ − T H ( x )) , (3.1) where the function H is deﬁned as H ( x ) = r p K ∞ x x + K ∞ with K ∞ = λ max ( C ∞ ) and p = (cid:107) P (cid:107) op (cid:107) P − (cid:107) op , and the quantities P and r are intro-duced in assumption (H).Proof. We deﬁne the centred stationary Gaussian process Y vt = v (cid:62) X t and note that itscovariance kernel is given by E [ Y vt Y vs ] = ρ v ( | t − s | ) with ρ v ( r ) := v (cid:62) exp( − rA ) C ∞ v . Bysubmultiplicativity of the operator norm we conclude that | ρ v ( r ) | ≤ (cid:107) exp( − rA ) (cid:107) op (cid:107) C ∞ (cid:107) op ≤ exp( − r r ) p K ∞ . We observe that ( Y vt ) t ∈ [0 ,T ] can be considered as an isonormal Gaussian process indexedby a separable Hilbert space H whose scalar product is induced by the covariance kernelof ( Y vt ) t ∈ [0 ,T ] . In particular, we can write Y vt = B ( h t ) and (cid:104) h t , h s (cid:105) H = ρ v ( | t − s | ). Weintroduce the quantity Z vT := v (cid:62) ( (cid:98) C T − C ∞ ) v = 1 T (cid:90) T ( Y vt ) − E [( Y vt ) ] dt and notice that Z vT is an element of the second order Wiener chaos. Hence, Z vT has aLebesgue density and we have L − Z vT = − Z vT /

2, and we conclude by the chain rule that (cid:104) DZ vT , − DL − Z vT (cid:105) H = 12 (cid:107) DZ vT (cid:107) H ≤ T (cid:90) T (cid:90) T | Y vt Y vs || ρ v ( t − s ) | dtds ≤ T (cid:90) T (cid:90) T ( Y vt ) | ρ v ( t − s ) | dtds ≤ T (cid:90) ∞ | ρ v ( r ) | dr (cid:0) Z vT + ρ v (0) (cid:1) ≤ T p K ∞ (cid:90) ∞ exp( − r r ) dr (cid:0) Z vT + K ∞ (cid:1) = 4 T p K ∞ r ( Z vT + K ∞ ) . Consequently, the conditions of Theorem 3.1 are satisﬁed with α = T p K ∞ r and β = T p K ∞ r ,which completes the proof of Proposition 3.2 since P ( | Z vT | ≥ x ) = P ( Z vT ≥ x ) + P ( Z vT ≤− x ).The statement of Proposition 3.2 corresponds to assumption (H4) in [11], which hasbeen shown to be valid via a log-Sobolev inequality only when A is symmetric (cf. [11,Theorem]). In other words, the extra assumption (H4) is not required as it directly followsfrom the modelling setup.The next theorem proves the restricted eigenvalue property. Theorem 3.3.

Suppose that assumption (H) is satisﬁed and deﬁne k ∞ := λ min ( C ∞ ) > .Then for any (cid:15) ∈ (0 , it holds that P (cid:16) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:17) ≥ − (cid:15) , for all T ≥ T ( (cid:15) , s, c ) := T ( (cid:15) , s, c ) (cid:16) (4 s + 1) log d − s (cid:0) log 2 s − (cid:1) + log 2 (cid:15) (cid:17) , where the constant T ( (cid:15) , s, c ) is deﬁned as T ( (cid:15) , s, c ) = 144 p K ∞ ( c + 2) ( k ∞ + 18( c + 2) K ∞ ) r k ∞ . Proof.

See Section 6.1.The next corollary presents a deviation bound for the quantity (cid:98) C T . Corollary 3.4.

For any (cid:15) > and T ≥ T ( (cid:15) , s, c ) it holds that P (cid:18) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ , (cid:107) diag (cid:98) C T (cid:107) ∞ ≤ m ∞ + k ∞ , (cid:107) (cid:98) C T (cid:107) ∞ ≤ M ∞ + 3 k ∞ (cid:19) ≥ − (cid:15) , where m ∞ := (cid:107) diag C ∞ (cid:107) ∞ and M ∞ := (cid:107) C ∞ (cid:107) ∞ .Proof. See Section 6.2.0 ε T and ﬁnal estimates As mentioned earlier controlling the stochastic term (cid:104) ε T , V (cid:105) F for matrices V ∈ R d × d iscrucial for the analysis of the estimators (cid:98) A L and (cid:98) A D . The martingale property of ε T turnsout to be the key in the next proposition. We remark that the following result is animprovement of [11, Theorem 8]. Proposition 3.5.

For any (cid:15) ∈ (0 , the following inequality holds: P (cid:32) sup V ∈ R d × d ,V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≥ µ (cid:33) ≤ (cid:15) for any T ≥ p K ∞ r k ∞ + 6 K ∞ k ∞ (cid:0) (2 s + 1) ln d − s (ln s −

1) + ln (4 /(cid:15) ) (cid:1) and µ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T .

Proof.

We ﬁrst recall Bernstein’s inequality for continuous local martingales. Let ( M t ) t ≥ be a real-valued continuous local martingale with quadratic variation ( (cid:104) M (cid:105) t ) t ≥ . Then forany a, b > P ( M t ≥ a, (cid:104) M (cid:105) t ≤ b ) ≤ exp( − a / (2 b )) . (3.2)This result is a straightforward consequence of exponential martingale technique (cf.Chapter 4, Exercise 3.16 in [24]).By deﬁnition ε ijT = T (cid:82) T dW it X jt is a continuous martingale with (cid:104) ε ij (cid:105) T = T (cid:98) C iiT . There-fore, we obtain by Corollary 3.4 and (3.2) P (cid:32) sup V ∈ R d × d ,V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≥ µ (cid:33) ≤ P (cid:16) (cid:107) diag (cid:98) C T (cid:107) ∞ > m ∞ + k ∞ (cid:17) + P (cid:32) sup V ∈ R d × d ,V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≥ µ, (cid:107) diag (cid:98) C T (cid:107) ∞ ≤ m ∞ + k ∞ (cid:33) ≤ d (cid:88) i,j =1 P (cid:16) ε ijT ≥ µ, (cid:104) ε ij (cid:105) T ≤ T (cid:0) m ∞ + k ∞ (cid:1)(cid:17) + (cid:15) ≤ d exp (cid:16) − T µ m ∞ + k ∞ (cid:17) + (cid:15) ≤ (cid:15) , which completes the proof.Summarising all previous deviation bounds we obtain the following result.1 Corollary 3.6.

For s ≥ s and c > deﬁne the event E ( s, c ) := (cid:110) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:111) (cid:92) (cid:110) sup V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≤ λ (cid:111)(cid:92) (cid:110) (cid:107) ε T (cid:107) ∞ ≤ λ (cid:111) (cid:92) (cid:110) (cid:107) (cid:98) C T (cid:107) ∞ ≤ M ∞ + 3 k ∞ (cid:111) . Then, for any (cid:15) ∈ (0 , , it holds that P ( E ( s, c )) ≥ − (cid:15) for any T ≥ T (cid:0) (cid:15) / , s, c (cid:1) and λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T .

In this section we present the main theoretical results for the Lasso and Dantzig estimators.More speciﬁcally, we derive oracle inequalities for (cid:98) A L and (cid:98) A D , and show the error boundsfor the norms (cid:107) · (cid:107) L , (cid:107) · (cid:107) and (cid:107) · (cid:107) . In particular, we establish the asymptotic equivalencebetween the Lasso and Dantzig estimators. We start this subsection with proving a statement, which is important for obtaining oracleinequality for the Lasso estimator (cid:98) A L . Lemma 4.1.

Suppose that condition (2.6) holds. For any matrix A ∈ R d × d \ { } denote A := supp( A ) . Then for any s ≥ s and c > on E ( s, c ) the following inequality holds: (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) (cid:98) A L − A (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 4 λ (cid:107) (cid:98) A L |A − A (cid:107) . (4.1) In particular, it implies that (cid:98) A L − A ∈ C ( s , on E ( s, c ) .Proof. Let us set δ L ( A ) := A − (cid:98) A L . Applying Lemma 2.1 we obtain the following inequality (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) δ L ( A ) (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 2 (cid:104) ε T , δ L ( A ) (cid:105) F + λ (cid:107) δ L ( A ) (cid:107) + 2 λ (cid:0) (cid:107) A (cid:107) − (cid:107) (cid:98) A L (cid:107) (cid:1) . Hence, on E ( s, c ) it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) δ L ( A ) (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 2 λ (cid:0) (cid:107) δ L ( A ) (cid:107) + (cid:107) A (cid:107) − (cid:107) (cid:98) A L (cid:107) (cid:1) . We observe next that (cid:107) δ L ( A ) (cid:107) + (cid:107) A (cid:107) −(cid:107) (cid:98) A L (cid:107) ≤ (cid:107) δ L ( A ) |A (cid:107) , which immediately implies(4.1). Applying (4.1) to A = A we deduce that (cid:107) δ L ( A ) (cid:107) ≤ (cid:107) δ L ( A ) |A (cid:107) ≤ (cid:107) δ L ( A ) |I s ( δ L ( A )) (cid:107) , where the last inequality holds due to the sparsity assumption (cid:107) A (cid:107) ≤ s . Consequently, (cid:98) A L − A ∈ C ( s ,

3) and the proof is complete.2We are now in the position to present an oracle inequality for the Lasso estimator (cid:98) A L ,which is one of the main results of our paper. Theorem 4.2.

Fix γ > and (cid:15) ∈ (0 , . Consider the Lasso estimator (cid:98) A L deﬁned at (2.7) and assume that condition (H) holds. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , s , /γ (cid:1) , with probability at least − (cid:15) it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (1 + γ ) inf A : (cid:107) A (cid:107) ≤ s (cid:110) (cid:107) ( A − A ) X (cid:107) L + 9(2 + γ ) k ∞ γ (1 + γ ) (cid:107) A (cid:107) λ (cid:111) . Proof.

Consider an arbitrary matrix A ∈ R d × d with (cid:107) A (cid:107) ≤ s . Then, on E ( s , /γ ),according to Lemma 4.1 and Cauchy-Schwarz inequality: (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) (cid:98) A L − A (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 4 λ (cid:107) (cid:98) A L |A − A (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 4 λ (cid:112) (cid:107) A (cid:107) (cid:107) (cid:98) A L |A − A (cid:107) . (4.2)Now, if 4 λ (cid:107) (cid:98) A L |A − A (cid:107) ≤ γ (cid:107) ( A − A ) X (cid:107) L the result immediately follows from Lemma 4.1.Hence, we only need to treat the case 4 λ (cid:107) (cid:98) A L |A − A (cid:107) > γ (cid:107) ( A − A ) X (cid:107) L . The latter impliesthat (cid:98) A L − A ∈ C ( s , /γ ) due to (4.2). Then, on the event E ( s , /γ ), we have (cid:107) (cid:98) A L |A − A (cid:107) ≤ (cid:107) (cid:98) A L − A (cid:107) ≤ k ∞ (cid:107) ( (cid:98) A L − A ) X (cid:107) L and consequently we obtain from (4.2) that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (cid:107) ( A − A ) X (cid:107) L + 3 λ (cid:115) (cid:107) A (cid:107) k ∞ (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (cid:107) ( A − A ) X (cid:107) L + 3 λ (cid:115) (cid:107) A (cid:107) k ∞ (cid:0) (cid:107) ( (cid:98) A L − A ) X (cid:107) L + (cid:107) ( A − A ) X (cid:107) L (cid:1) . Using the inequality 2 xy ≤ ax + y /a for a >

0, we then conclude that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (1 + γ ) (cid:107) ( A − A ) X (cid:107) L + 9(2 + γ ) k ∞ γ (1 + γ ) (cid:107) A (cid:107) λ , which completes the proof.Theorem 4.2 enables us to ﬁnd upper bounds on the various norms of (cid:98) A L − A as wellas on the sparsity of (cid:98) A L . We remark that the bound in (4.6) will be useful to provide theconnection between the Lasso and Dantzig estimators in the next subsection.3 Corollary 4.3.

Fix (cid:15) ∈ (0 , . Consider the Lasso estimator (cid:98) A L deﬁned in (2.7) andassume that conditions (2.6) and (H) hold. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , s , (cid:1) , with probability at least − (cid:15) , it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ k ∞ s λ (4.3) (cid:107) (cid:98) A L − A (cid:107) ≤ k ∞ s λ (4.4) (cid:107) (cid:98) A L − A (cid:107) ≤ k ∞ s λ (4.5) (cid:107) (cid:98) A L (cid:107) ≤ (cid:16) M ∞ k ∞ + 72 (cid:17) s . (4.6) Proof.

On the event E ( s , A = A and A = supp( A ), we obtain the inequality (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) (cid:98) A L − A (cid:107) ≤ λ (cid:107) (cid:98) A L |A − A (cid:107) due to Lemma 4.1. Since on E ( s ,

3) we have (cid:98) A L − A ∈ C ( s , (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ λ (cid:107) (cid:98) A L |A − A (cid:107) ≤ λ √ s (cid:107) (cid:98) A L |A − A (cid:107) ≤ λ (cid:114) s k ∞ (cid:107) ( (cid:98) A L − A ) X (cid:107) L . This gives (4.3) and (4.4). Moreover, on the same event it holds (cid:107) (cid:98) A L |A − A (cid:107) ≤ √ s (cid:107) (cid:98) A L |A − A (cid:107) and hence (4.5) follows.Now, it remains to prove (4.6). Note that necessary and suﬃcient condition for (cid:98) A L tobe the solution of the optimisation problem (2.7) is the existence of a matrix B ∈ ∂ (cid:107) (cid:98) A L (cid:107) such that ε T + (cid:0) (cid:98) A L − A (cid:1) (cid:98) C T + λB = 0 . Furthermore, (cid:98) A ij L (cid:54) = 0 implies that B ij = sign( (cid:98) A ij L ). Thus, we conclude that (cid:107) ( (cid:98) A L − A ) (cid:98) C T (cid:107) = (cid:107) λB + ε T (cid:107) = d (cid:88) i,j =1 (cid:12)(cid:12)(cid:12) λB ij + ε ijT (cid:12)(cid:12)(cid:12) ≥ (cid:88) i,j : (cid:98) A ij L (cid:54) =0 (cid:12)(cid:12)(cid:12) λB ij + ε ijT (cid:12)(cid:12)(cid:12) ≥ (cid:88) i,j : (cid:98) A ij L (cid:54) =0 (cid:12)(cid:12)(cid:12) λ − | ε ijT | (cid:12)(cid:12)(cid:12) ≥ (cid:107) (cid:98) A L (cid:107) λ , E ( s , (cid:107) ( (cid:98) A L − A ) (cid:98) C T (cid:107) ≤ (cid:107) (cid:98) C T (cid:107) ∞ (cid:107) (cid:98) A L − A (cid:107) ≤ (cid:18) M ∞ + 3 k ∞ (cid:19) k ∞ s λ, which implies (4.6).The upper bounds in (4.3)-(4.5) improve the bounds obtained in [11, Corollary 1] andthey are in line with the classical results for linear regression models. We recall that thepaper [11] considers row sparsity of the unknown parameter A , i.e. (cid:107) A i (cid:107) ≤ s for all 1 ≤ i ≤ d, where A i denotes the i th row of A . Obviously, this constraint corresponds to s = d s inour setting. The authors of [11] obtained the upper bound for (cid:107) (cid:98) A L − A (cid:107) of order d s (log d + log log T ) T in contrast to our improved bound T − d s log d . Thus, we essentially match the lowerbound inf (cid:98) A sup A : max i (cid:107) A i (cid:107) ≤ s E [ (cid:107) (cid:98) A − A (cid:107) ] ≥ c d s log( c d/ s ) T for some c , c > , which has been derived in [11, Theorem 2].The authors of [11] have introduced the adaptive Lasso estimator, which is deﬁned as (cid:98) A ad := argmin A ∈ R d × d (cid:16) L T ( A ) + λ (cid:107) A ◦ | (cid:98) A ML | − γ (cid:107) (cid:17) , where ◦ denotes the Hadamard product and ( | (cid:98) A ML | − γ ) ij := | (cid:98) A ij ML | − γ for a γ >

0. Theyhave proved that the adaptive estimator (cid:98) A ad is consistent for support selection and showedthe asymptotic normality of (cid:98) A ad when restricted to the elements in supp( A ); see [11,Theorem 4]. In this subsection we will establish a connection between the prediction errors associatedwith the Lasso and Dantzig estimators. This step is essential for the derivation of errorbounds for (cid:98) A D . Our results are an extension of the study in [2], where it was shown thatunder sparsity conditions, the Lasso and the Dantizg estimators show similar behaviourfor linear regression and for nonparametric regression models, for l prediction loss andfor l p loss in the coeﬃcients for 1 ≤ p ≤ . In what follows, we will derive analogous bounds for the Ornstein-Uhlenbeck process.5

Proposition 4.4.

Consider the Lasso estimator (cid:98) A D deﬁned in (2.10) and assume thatcondition (H) holds.(i) Deﬁne δ D ( A ) := A − (cid:98) A D and A := supp ( A ) , and assume that A satisﬁes the Dantzigconstraint (2.9) . Then it holds that (cid:107) δ D ( A ) |A c (cid:107) ≤ (cid:107) δ D ( A ) |A (cid:107) . (ii) On the event (cid:8) (cid:107) (cid:98) A L (cid:107) ≤ s (cid:9) ∩ E ( s, the following inequality holds: (cid:12)(cid:12)(cid:12) (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( (cid:98) A D − A ) X (cid:107) L (cid:12)(cid:12)(cid:12) ≤ k ∞ (cid:107) (cid:98) A L (cid:107) λ . Proof.

See Section 6.3.Proposition 4.4 implies an oracle inequality for the Dantzig estimator, which is formu-lated in the next theorem.

Theorem 4.5.

Fix γ > and (cid:15) ∈ (0 , . Consider the Dantzig estimator (cid:98) A D deﬁnedin (2.10) and assume that conditions (2.6) and (H) hold. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , (48 M ∞ k ∞ + 72) s , /γ (cid:1) , with probability at least − (cid:15) , it holds that (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ (1 + γ ) inf A : (cid:107) A (cid:107) = s (cid:8) (cid:107) ( A − A ) X (cid:107) L + C D ( γ ) s λ (cid:9) , (4.7) where C D ( γ ) = 18 k ∞ (cid:16) ( γ + 2) γ + 48 M ∞ k ∞ + 72 (cid:17) . Proof.

Consider matrix A ∈ R d × d such that (cid:107) A (cid:107) = s . Then, on the event E ( s , /γ ),according to Proposition (4.4) (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ (cid:107) ( (cid:98) A L − A ) X (cid:107) L + 18 k ∞ (cid:16) M ∞ k ∞ + 72 (cid:17) s λ . On the other hand, due to Theorem 4.2, we deduce that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (1 + γ ) (cid:107) ( A − A ) X (cid:107) L + 9( γ + 2) k ∞ γ s λ . Combining both inequalities yields (4.7).The statements of Theorems 4.2 and 4.5 suggest that the Lasso and Dantzig estima-tors are asymptotically equivalent. This is in line with the theoretical ﬁndings in linearregression models as it has been shown in [2]. More speciﬁcally, we obtain the followingresult, which is a direct analogue of Corollary 4.3.6

Corollary 4.6.

Fix (cid:15) ∈ (0 , . Consider the Dantzig estimator (cid:98) A D deﬁned in (2.10) andassume that conditions (2.6) and (H) hold. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , s , (cid:1) , with probability at least − (cid:15) , it holds that (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ k ∞ s λ (4.8) (cid:107) (cid:98) A D − A (cid:107) ≤ k ∞ s λ (4.9) (cid:107) (cid:98) A D − A (cid:107) ≤ k ∞ s λ. Proof.

Denote A = supp( A ). On the event E ( s ,

1) the matrix A satisﬁes the Dantzigconstraint (2.9), (cid:98) A D − A ∈ C ( s ,

1) and (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ (cid:107) ( (cid:98) A D − A ) (cid:98) C T (cid:107) ∞ (cid:107) (cid:98) A D − A (cid:107) ≤ (cid:0) (cid:107) ( (cid:98) A D − A ) (cid:98) C T + ε T (cid:107) ∞ + (cid:107) ε T (cid:107) ∞ (cid:1) (cid:107) (cid:98) A D |A − A (cid:107) ≤ λ √ s (cid:107) (cid:98) A D − A (cid:107) ≤ λ (cid:114) s k ∞ (cid:107) ( (cid:98) A D − A ) X (cid:107) L , which gives (4.8) and (4.9). Moreover, on the same event it holds that (cid:107) (cid:98) A D − A (cid:107) ≤ √ s (cid:107) (cid:98) A D − A (cid:107) , which completes the proof.It is noteworthy to mention that even if in our case Lasso and Dantzig selector per-formances are equivalent, a potential strength of the Dantzig estimator over penalizedlikelihood methods such as Lasso is that it can be applied to settings in which no explicitlikelihoods or loss functions are available, and may be of interest in both computationaland theoretical context (see [8] for more details). This sections presents some numerical experiments on simulated data that illustrate ourtheoretical results.Our estimation methods are based on continuous observations of the the underlyingprocess, which need to be discretised for numerical simulations. We will use 500000 dis-cretisation points over the time interval [0 , T ] with T = 300. Such approximation issuﬃcient for the illustration purpose, since further reﬁnement of the grid does not lead toa signiﬁcant improvement.7 (a) Transition matrix A (b) MLE(c) Lasso (d) Dantzig Figure 1: Comparison of the true matrix with maximum likelihood, Lasso and Dantzigestimators.In Figure 1 we demonstrate an example of the transition matrix A ∈ R × andthe corresponding maximum likelihood, Lasso and Dantzig estimators. Instead of givingnumerical values of the entries of A we use a colour code to highlight the sparsity. Weobserve that MLE provides a good performance on the support, but it gives rather poorestimates outside the support. On the other hand, the superiority of the Lasso and Dantzigestimators, especially in terms of support recovery, is quite obvious even for relatively smalldimension of matrix.Figure 2 demonstrates the relative error of the maximum likelihood, Lasso and Dantzigestimators compared to the norm of the true matrix. We compute the relative errorfor dimensions d = 5 , . . . ,

20 and for L and Frobenius norms. Figure 2 clearly showsthe improvement of performance of penalized estimation methods with growth of thedimension d compared to the maximum likelihood estimation. Indeed, we observe thatrelative errors of maximum likelihood estimation grow linearly both in L and Frobeniusnorms, while relative errors of Lasso and Dantzig estimators decay in d . The sparsity ofthe true parameter A was chosen equal to s = 0 . d , which might explain the limitingbehaviour of Lasso and Dantzig estimators when d is increasing. Finally, we observe thatrelative errors for Lasso and Dantzig estimators are practically equivalent, which is exactlyin accordance with our theoretical results.8 (a) MLE - L -norm (b) MLE - Frobenius norm(c) Lasso - L -norm (d) Lasso - Frobenius norm(e) Dantzig - L -norm (f) Dantzig - Frobenius norm Figure 2: Relative error of maximum likelihood, Lasso and Dantzig estimators in L andFrobenius norms depending on d . Middle line corresponds to the mean and coloured areascorrespond to the standard deviation of the error over 10 independent simulations.9 We ﬁrst note the identity (cid:107)

V X (cid:107) L = tr( V (cid:98) C T V (cid:62) ). Replacing (cid:98) C T by its limit C ∞ wededuce the inequality tr( V C ∞ V (cid:62) ) ≥ k ∞ > (cid:107) V X (cid:107) L (cid:107) V (cid:107) = tr( V C ∞ V (cid:62) ) (cid:107) V (cid:107) − tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) (cid:107) V (cid:107) ≥ k ∞ − | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) . (6.1)Next, we introduce the set K ( s ) := (cid:8) V ∈ R d × d \ { } : (cid:107) V (cid:107) ≤ s (cid:9) . As is shown in Lemma6.1 it holds thatsup V ∈C ( s,c ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) ≤ c + 2) sup V ∈K (2 s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) . (6.2)Thus, it suﬃces to consider K ( s ) instead of C ( s, c ) in the following discussion. Observing(6.1) we obtain that P (cid:18) inf V ∈K ( s ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:19) ≥ P (cid:32) sup V ∈K ( s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) ≤ k ∞ (cid:33) . For a matrix V ∈ K ( s ) we denote its j -th row vector by v j and v = vec( V ) ∈ R d .Moreover, we deﬁne a symmetric random matrix D C = id ⊗ ( C ∞ − (cid:98) C T ) ∈ R d × d . Thenwe deduce the identity tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) (cid:107) V (cid:107) = v (cid:62) D C v (cid:107) v (cid:107) . (6.3)According to Proposition 3.2 we obtain the following inequalities for any x > P (cid:18) | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:19) ≤ P (cid:32) (cid:80) dj =1 | v j ( C ∞ − (cid:98) C T )( v j ) (cid:62) | (cid:80) dj =1 (cid:107) v j (cid:107) ≥ x (cid:33) ≤ d (cid:88) j =1 P (cid:32) | v j ( C ∞ − (cid:98) C T )( v j ) (cid:62) |(cid:107) v j (cid:107) ≥ x (cid:33) ≤ d exp ( − T H ( x )) . By Lemma 6.2 we conclude that P (cid:32) sup v ∈ R d \{ } : (cid:107) v (cid:107) ≤ s | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:33) ≤ d (cid:16) ed s (cid:17) s exp( − T H ( x )) . We deduce from (6.3) that P (cid:18) inf V ∈K ( s ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ x (cid:19) ≥ − d (cid:16) ed s (cid:17) s exp ( − T H ( x )) . The latter statement together with (6.2) implies the inequality P (cid:16) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:17) ≥ − (cid:15) , for all T ≥ T ( (cid:15) , s, c ), which completes the proof of Theorem 3.3.0 Let e ( i,j ) ∈ R d × d be a matrix deﬁned as e kl ( i,j ) := 1 ( k,l )=( i,j ) . We observe that (cid:26) (cid:107) diag( C ∞ − (cid:98) C T ) (cid:107) ∞ > k ∞ (cid:27) = (cid:26) max ≤ j ≤ d (cid:12)(cid:12)(cid:12) tr( e ( j,j ) ( C ∞ − (cid:98) C T ) e (cid:62) ( j,j ) (cid:12)(cid:12)(cid:12) > k ∞ (cid:27) ⊂  sup V ∈C ( s,c ) (cid:12)(cid:12)(cid:12) tr( V ( C ∞ − (cid:98) C T ) V (cid:62) (cid:12)(cid:12)(cid:12) (cid:107) V (cid:107) > k ∞  . Furthermore, | C ij ∞ − (cid:98) C ijT | = (cid:12)(cid:12)(cid:12) tr( e (1 ,i ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,j ) ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) tr(( e (1 ,i ) + e (1 ,j ) )( C ∞ − (cid:98) C T )( e (1 ,i ) + e (1 ,j ) ) (cid:62) ) (cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12) tr( e (1 ,i ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,i ) ) (cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12) tr( e (1 ,j ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,j ) ) (cid:12)(cid:12)(cid:12) ≤ V ∈C ( s,c ) (cid:12)(cid:12)(cid:12) tr( V ( C ∞ − (cid:98) C T ) V (cid:62) (cid:12)(cid:12)(cid:12) (cid:107) V (cid:107) and hence (cid:26) (cid:107) C ∞ − (cid:98) C T (cid:107) ∞ > k ∞ (cid:27) = (cid:26) max ≤ i,j ≤ d (cid:12)(cid:12)(cid:12) tr( e (1 ,i ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,j ) ) (cid:12)(cid:12)(cid:12) > k ∞ (cid:27) ⊂  sup V ∈C ( s,c ) (cid:12)(cid:12)(cid:12) tr( V ( C ∞ − (cid:98) C T ) V (cid:62) (cid:12)(cid:12)(cid:12) (cid:107) V (cid:107) > k ∞  . This completes the proof of Corollary 3.4.

Since A satisﬁes the Dantzig constraint (2.9), we deduce by deﬁnition of the Dantzigestimator: (cid:107) A (cid:107) ≥ (cid:107) (cid:98) A D (cid:107) = (cid:107) A − δ D ( A ) |A (cid:107) + (cid:107) δ D ( A ) |A c (cid:107) ≥ (cid:107) A (cid:107) − (cid:107) δ D ( A ) |A (cid:107) + (cid:107) δ D ( A ) |A c (cid:107) , which proves part (i).Now we show part (ii) of the proposition. Set δ := (cid:98) A L − (cid:98) A D . Due to (2.8) we deduce (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( (cid:98) A D − A ) X (cid:107) L = 2tr (cid:16)(cid:16) (cid:98) A D (cid:98) C T + ε T − A (cid:98) C T (cid:17) δ (cid:62) (cid:17) − (cid:16) ε T δ (cid:62) (cid:17) + tr (cid:16) δ (cid:98) C T δ (cid:62) (cid:17) = 2tr (cid:16)(cid:16) (cid:98) A L (cid:98) C T + ε T − A (cid:98) C T (cid:17) δ (cid:62) (cid:17) − (cid:16) ε T δ (cid:62) (cid:17) − tr (cid:16) δ (cid:98) C T δ (cid:62) (cid:17) . (6.4)1The Dantzig constraint (2.9) implies the inequality (cid:12)(cid:12)(cid:12) tr (cid:16)(cid:16) (cid:98) A D (cid:98) C T + ε T − A (cid:98) C T (cid:17) δ (cid:62) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:107) (cid:98) A D (cid:98) C T + ε T − A (cid:98) C T (cid:107) ∞ (cid:107) δ (cid:107) ≤ λ (cid:107) δ (cid:107) , and the same inequality holds for (cid:98) A D being replaced by (cid:98) A L . On E ( s,

1) we have (cid:12)(cid:12)(cid:12) tr (cid:16) ε T δ (cid:62) (cid:17)(cid:12)(cid:12)(cid:12) ≤ λ (cid:107) δ (cid:107) . Furthermore, on (cid:8) (cid:107) (cid:98) A L (cid:107) ≤ s (cid:9) it holds that δ ∈ C ( s,

1) and we conclude from Theorem (3.3)that tr (cid:16) δ (cid:98) C T δ (cid:62) (cid:17) ≥ k ∞ (cid:107) δ (cid:107) . We also have (cid:107) δ (cid:107) ≤ (cid:107) δ | supp( (cid:98) A L ) (cid:107) ≤ (cid:107) (cid:98) A L (cid:107) / (cid:107) δ (cid:107) . Observing the ﬁrst identity of (6.4),putting the previous estimates together and using the inequality 2 xy ≤ ax + y / a >

0, we obtain the following inequality (cid:107) ( (cid:98) A D − A ) X (cid:107) L − (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ k ∞ (cid:107) (cid:98) A L (cid:107) λ . On the other hand, applying the second identity of (6.4), we deduce that (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ k ∞ (cid:107) (cid:98) A L (cid:107) λ , which completes the proof. In this subsection we present two results that can be easily deduced from Lemmas F.1,F.2 and F.3 from supplementary material of [1]. We state their proofs for the sake ofcompleteness.

Lemma 6.1.

It holds that sup V ∈C ( s,c ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) ≤ c + 2) sup V ∈K (2 s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) . Proof.

First, recall the deﬁnition of the set C ( s, c ) in (2.1) and denote the unit balls by B q ( r ) := { v ∈ R d : (cid:107) v (cid:107) q ≤ r } for any d ≥ q ≥ , r >

0. Furthermore, we introducethe notation K ( s ) = B ( s ) ∩ B (1) for s ≥

1. For any set P we denote its closure and convexhull by cl( P ) and conv( P ) , respectively. By a direct application of Lemma F.1 from [1],we obtain the following approximation of cone sets by sparse sets: for any S ⊂ { , . . . , d } with | S | = s we get C ( s, c ) ∩ B (1) ⊆ B (cid:0) ( c + 1) √ s (cid:1) ∩ B (1) ⊆ ( c + 2)cl(conv( K ( s ))) . (6.5)Next, by the statement of Lemma F.3 in [1] we have thatsup V ∈ cl(conv( K ( s ))) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) | ≤ V ∈K (2 s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) | . (6.6)Thus, (6.5) combined with (6.6) yields the proof.2 Lemma 6.2.

Let v = vec( V ) ∈ R d and D C = id ⊗ ( C ∞ − (cid:98) C T ) ∈ R d × d . Then it holdsthat P (cid:32) sup v ∈ R d \{ } : (cid:107) v (cid:107) ≤ s | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:33) ≤ d (cid:16) ed s (cid:17) s exp( − T H ( x )) , where the function H has been introduced in Proposition 3.2.Proof. Choose U ⊂ { , . . . , d } with | U | = s , and deﬁne S U = (cid:110) v ∈ R d : (cid:107) v (cid:107) ≤ , supp( v ) ⊆ U (cid:111) . Then K ( s ) = (cid:83) | U |≤ s S U . In what follows, we choose A = { u , . . . , u m } , which is a -netof S U . Lemma 3.5 of [26] guarantees that |A| ≤ s . Next, notice that for every v ∈ S u , there exists some u i ∈ A such that (cid:107) ∆ v (cid:107) ≤ , where ∆ v = v − u i . Then it holds γ := sup v ∈ S U | v (cid:62) D C v | ≤ max i | u (cid:62) i D C u i | + 2 sup v ∈ S U | max i u (cid:62) i D C (∆ v ) | + sup v ∈ S U | (∆ v ) (cid:62) D C (∆ v ) | . Next, we use the fact that 10(∆ v ) ∈ S U which gives us in consequencesup v ∈ S U | (∆ v ) (cid:62) D C (∆ v ) | ≤ γ and2 sup v ∈ S U | max i u (cid:62) i D C (∆ v ) |≤

110 sup v ∈ S U | ( u i + 10∆ v ) (cid:62) D C ( u i + 10∆ v ) | + 110 sup v ∈ S U | u i D C u i | + 110 sup v ∈ S U | (10∆ v ) (cid:62) D C (10∆ v ) |≤ γ + 110 γ + 110 γ which implies that γ ≤ i | u (cid:62) i D C u i | . Now, we take an union bound over all u i ∈ A and combine it with inequality (3.1) fromProposition 3.2. Thus, P (cid:32) sup v ∈ S U | v (cid:62) D C v | ≥ x (cid:33) ≤ d exp( − T H ( x ) + s log 21) . Next, we take another union bound over (cid:0) d s (cid:1) ≤ (cid:16) ed s (cid:17) s choices of U . Thus, P (cid:32) sup v ∈ R d \{ } : (cid:107) v (cid:107) ≤ s | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:33) ≤ d (cid:16) ed s (cid:17) s exp( − T H ( x )) , which yields the proof.3 References [1] S. Basu and G. Michailidis (2015): Regularized estimation in sparse high-dimensionaltime series models.

Annals of Statistics

37, 1705–1732.[3] F. Bolley, J.A. Ca˜nizo, and J.A. Carrillo (2011): Stochastic mean-ﬁeld limit: non-Lipschitz forces and swarming.

Mathematical Models and Methods in Applied Sciences

Statistics for high-dimensional data.

SpringerSeries in Statistics, Springer.[5] E. Candes and T. Tao (2007): The Dantzig selector: Statistical estimation when p ismuch larger than n . Annals of Statistics , 35(6), 2313–2351.[6] R. Carmona and X. Zhu (2016): A probabilistic approach to mean ﬁeld games withmajor and minor players.

Annals of Applied Probability

Econometric Theory

28, 838–860.[8] L. Dicker, Y. Li and SD Zhao (2014): The Dantzig selector for censored linear regres-sion models.

Statistica Sinica

Frontiers in Computational Neuroscience

3, 1–28.[10] K. Fujimori (2019): The Dantzig selector for a linear model of diﬀusion processes.

Statistical Inference for Stochastic Processes

22, 475–498.[11] S. Ga¨ıﬀas and G. Matulewicz (2019): Sparse inference of the drift of a high-dimensional Ornstein-Uhlenbeck process.

Journal of Multivariate Analysis

Social and economic networks.

Princeton, NJ: Princeton Univer-sity Press.[13] J. Jacod and P. Protter (2012):

Discretization of processes.

Stochastic Modelling andApplied Probability, Springer.[14] G. James, P. Radchenko, and J. Lv (2009): Dasso: connections between the Dantzigselector and Lasso.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology)

Scandinavian Journal of Statistics

Exponential families of stochastic processes.

Springer Series in Statistics, Springer.[17] U. K¨uchler and M. Sørensen (1999): A note on limit theorems for multivariate mar-tingales.

Bernoulli

Statistical inference for ergodic diﬀusion processes.

SpringerSeries in Statistics, Springer.[19] H.P. McKean (1966): Speed of approach to equilibrium for Kac’s caricature of aMaxwellian gas.

Archive for Rational Mechanics and Analysis

Stochastic Diﬀerential Equations (Lecture Series in Diﬀerential Equations,Session 7, Catholic University , 41–57. Air Force Oﬃce of Scientiﬁc Research, Arling-ton.[21] I. Nourdin and F.G. Viens (2009): Density formula and concentration inequalitieswith Malliavin calculus.

Electronic Journal of Probability

14, 2287–2309.[22] D. Nualart (2006):

The Malliavin calculus and related topics.

IEEE Trnasactions of Information Theory

Continuous martingales and Brownian motion . 3rdedition, A Series of Comprehensive Studies in Mathematics, Springer.[25] A.-S. Sznitman (1991):

Topics in propagation of chaos.

In P.-L. Hennequin, editor, ´Ecole d’ ´Et´e de Probabilit´es de Saint Flour XIX - 1989 , volume 1464 of Lecture Notesin Mathematics, Springer, Berlin, 165–251.[26] R. Vershynin (2009):