On Dantzig and Lasso estimators of the drift in a high dimensional Ornstein-Uhlenbeck model
OOn Dantzig and Lasso estimators of the drift in a highdimensional Ornstein-Uhlenbeck model ∗ Gabriela Cio(cid:32)lek † Dmytro Marushkevych ‡ Mark Podolskij § August 4, 2020
Abstract
In this paper we present new theoretical results for the Dantzig and Lasso estimators ofthe drift in a high dimensional Ornstein-Uhlenbeck model under sparsity constraints.Our focus is on oracle inequalities for both estimators and error bounds with respectto several norms. In the context of the Lasso estimator our paper is strongly relatedto [11], who investigated the same problem under row sparsity. We improve their ratesand also prove the restricted eigenvalue property solely under ergodicity assumptionon the model. Finally, we demonstrate a numerical analysis to uncover the finitesample performance of the Dantzig and Lasso estimators.
Key words : Dantzig estimator, high dimensional statistics, Lasso, Ornstein-Uhlenbeckprocess, parametric estimation.
AMS 2010 subject classifications.
During past decades an immense progress has been achieved in statistics for stochasticprocesses. Nowadays, comprehensive studies on statistical inference for diffusion processesunder low and high frequency observation schemes can be found in monographs [13, 16, 18].Most of the existing literature is considering a fixed dimensional parameter space, while ahigh dimensional framework received much less attention in the diffusion setting.Since the pioneering work of McKean [19, 20], high dimensional diffusions enteredthe scene in the context of modelling the movement of gas particles. More recently,they found numerous applications in economics and biology, among other disciplines [3,6, 9]. Typically, high dimensional diffusions are studied in the framework of mean field ∗ The authors gratefully acknowledge financial support of ERC Consolidator Grant 815703 “STAM-FORD: Statistical Methods for High Dimensional Diffusions”. † Department of Mathematics, University of Luxembourg, E-mail: [email protected]. ‡ Department of Mathematics, University of Luxembourg, E-mail: [email protected]. § Department of Mathematics, University of Luxembourg, E-mail: [email protected]. a r X i v : . [ m a t h . S T ] A ug theory , which aims at bridging the interaction of particles at the microscopic scale andthe mesoscopic features of the system (see e.g. [25] for a mathematical study). In physicsparticles are often assumed to be statistically equal, but this homogeneity assumption isnot appropriate in other applications. For instance, in [6] high dimensional SDEs are usedto model the wealth of trading agents in an economy, who are often far from being equalin their trading behaviour. Another example is the flocking phenomenon of individuals[3], where it seems natural to assume that there are only very few “leaders” who have adistinguished role in the community. These examples motivate to investigate statisticalinference for diffusion processes under sparsity constraints.This paper is focusing on statistical analysis of a d -dimensional Ornstein-Uhlenbeckmodel of the form dX t = − A X t dt + dW t , t ≥ , (1.1)defined on a filtered probability space (Ω , F , ( F t ) t ≥ , P ), with underlying observation( X t ) t ∈ [0 ,T ] . Here W denotes a standard d -dimensional Brownian motion and A ∈ R d × d represents the unknown interaction matrix. Ornstein-Uhlenbeck processes are one of themost basic parametric diffusion models. When the dimension d is fixed and T → ∞ , sta-tistical estimation of the parameter A has been discussed in several papers. Asymptoticanalysis of the maximum likelihood estimator in the ergodic case can be found in e.g.[16] while investigations of the non-ergodic setting can be found in [15, 17]. The adaptiveLasso estimation for multivariate diffusion models has been investigated in [7].Our main goal is to study the estimation of A under sparsity constraints in the large d /large T setting. Such a mathematical problem finds its main motivation in the analysisof bank connectedness whose wealth is modelled by the diffusion process X . This field ofeconomics, which studies linkages between a large number of banks associated with e.g.asset/liability positions and contractual relationships, is key to understanding systemicrisk in a global economy [12]. Typically, the connectivity structure, which is representedby the parameter A , is quite sparse since only few financial players are significant in aneconomy, and the main focus is on estimation of non-zero components of A .Theoretical results in the high dimensional diffusion setting are rather scarce. In thiscontext we would like to mention the Dantzig selector which was introduced in [5] andprimarily designed for linear regression models. More specifically, [5] established sharpnon-asymptotic bounds on the l -error in the estimated coefficients and proved that theerror is within a factor of log ( d ) of the error that would have been reached if the locationsof the non-zero coefficients were known. Further extensions of the aforementioned resultscan be found in [10] and [23], which study the Dantzig selector for discretely observedlinear diffusions and support recovery for the drift coefficient, respectively. Our work isclosely related to the recent article [11], where estimation of A under row sparsity hasbeen investigated. The authors propose to use the classical Lasso approach and derive up-per and lower bounds for the estimation error. We build upon their analysis and provideoracle inequalities and non-asymptotic theory for the the Lasso and Dantzig estimators.In comparison to [11], we obtain an improved upper bound for the Lasso estimator, whichessentially matches the theoretical lower bound, and also show that the restricted eigen-value property is automatically satisfied under ergodicity condition on the model (1.1) (in[11] the extra assumption (H4) has been imposed). The latter is proved via Malliavincalculus methods proposed in [21]. Moreover, we show that the Lasso and Dantzig esti-mators are asymptotically efficient, which is a well known fact in linear regression models(cf. [2]). Finally, we present a simulation study to uncover the finite sample properties ofboth estimators.The paper is organised as follows. Section 2 is devoted to the exposition of the classicalestimation theory in the fixed dimensional setting and to definition of the Lasso andDantzig estimators. Concentration inequalities for various stochastic terms are derived inSection 3. In particular, we show the restricted eigenvalue property under the ergodicityassumption via Malliavin calculus methods. In Section 4 we present oracle inequalitiesand error bounds for both estimators. Numerical simulation results are demonstrated inSection 5. Finally, some proofs are collected in Section 6. In this subsection we briefly introduce the main notations used throughout the paper. Fora vector or a matrix x the transpose of x is denoted by x (cid:62) . For p ≥ A ∈ R d × d ,we define the l p -norm as (cid:107) A (cid:107) p := (cid:88) ≤ i ≤ d , ≤ j ≤ d | A ij | p /p . We denote by (cid:107) A (cid:107) ∞ = lim p →∞ (cid:107) A (cid:107) p the maximum norm and set (cid:107) A (cid:107) := (cid:80) ≤ i ≤ d , ≤ j ≤ d { A ij (cid:54) =0 } .We associate to the Frobenius norm (cid:107) · (cid:107) the scalar product (cid:104) A , A (cid:105) F := tr( A (cid:62) A ) , A , A ∈ R d × d , where tr denotes the trace. For a symmetric matrix A ∈ R d × d we write λ max ( A ), λ min ( A )for the largest and the smallest eigenvalue of A , respectively. We denote by (cid:107) A (cid:107) op := (cid:112) λ max ( A (cid:62) A ) the operator norm of A ∈ R d × d . For any J ⊂ { , . . . , d } × { , . . . , d } and A ∈ R d × d , the matrix A | J is defined via( A | J ) ij := A ij ( i,j ) ∈ J . For a quadratic matrix A ∈ R d × d , diag( A ) stands for the diagonal matrix satisfyingdiag( A ) ii = A ii . We also introduce the notation C ( s, c ) := (cid:110) A ∈ R d × d \ { } : (cid:107) A (cid:107) ≤ (1 + c ) (cid:107) A |I s ( A ) (cid:107) (cid:111) , (2.1)where c > I s ( A ) is a set of coordinates of s largest elements of A . Furthermore, vecdenotes the vectorisation operator and ⊗ stands for the Kronecker product. For z ∈ C wedenote by Re ( z ) (resp. Im ( z )) the real (resp. imaginary) part of z . Finally, for stochasticprocesses ( X t ) t ∈ [0 ,T ] , ( Y t ) t ∈ [0 ,T ] ∈ L ([0 , T ] , dt ) we introduce the scalar product (cid:104) X, Y (cid:105) L := 1 T (cid:90) T X t Y t dt. We consider a d -dimensional Ornstein-Uhlenbeck process introduced in (1.1). Throughoutthis paper the matrix A is assumed to satisfy the following condition:(H) Matrix A is diagonalisable with eigenvalues θ , . . . , θ d ∈ C , i.e. A = P diag( θ , . . . , θ d ) P − , where the column vectors of P are eigenvectors of A . Furthermore, the eigenvalues θ , . . . , θ d ∈ C have strictly positive real parts: r := min ≤ j ≤ d ( Re ( θ i )) > . (2.2)It is well known that under condition (H) the stochastic differential equation (1.1) exhibitsa unique stationary solution, which can be written explicitly as X t = (cid:90) t −∞ exp ( − ( t − s ) A ) dW t . In this case we have that X t ∼ N (0 , C ∞ ) with C ∞ := (cid:90) ∞ exp( − sA ) exp( − sA (cid:62) ) ds. We assume that the complete path ( X t ) t ∈ [0 ,T ] is observed and we are interested in esti-mating the unknown parameter A . Let us briefly recall the classical maximum likelihoodtheory when d is fixed and T → ∞ . When P TA denotes the law of the process (1.1) withtransition matrix A restricted to F T , the log-likelihood function is explicitly computed viaGirsanov’s theorem aslog( P TA / P T ) = − (cid:90) T ( AX t ) (cid:62) dX t − (cid:90) T ( AX t ) (cid:62) ( AX t ) dt. (2.3)Consequently, the maximum likelihood estimator (cid:98) A ML is given by (cid:98) A ML = − (cid:16) (cid:90) T dX t X (cid:62) t (cid:17)(cid:16) (cid:90) T X t X (cid:62) t dt (cid:17) − . Under condition (H) the estimator (cid:98) A ML is asymptotically normal, i.e. √ T (cid:16) vec( (cid:98) A ML ) − vec( A ) (cid:17) d −→ N d (cid:0) , C − ∞ ⊗ id (cid:1) (2.4)with id denoting the d -dimensional identity matrix. Indeed, we have the identity (cid:98) A ML − A = − ε T (cid:98) C − T with ε T := 1 T (cid:90) T dW t X (cid:62) t and (cid:98) C T := 1 T (cid:90) T X t X (cid:62) t dt a.s. −→ C ∞ , (2.5)and the result (2.4) follows from the standard martingale central limit theorem. We referto [16, p. 120–124] for a more detailed exposition.When assumption (H) is violated the asymptotic theory for the maximum likelihoodestimator (cid:98) A ML is more complex. If some eigenvalues θ j satisfy Re ( θ i ) < Re ( θ i ) = 0 appears forsome i ’s. Now we turn our attention to large d /large T setting. We consider the Ornstein-Uhlenbeckmodel (1.1) satisfying the assumption (H) and assume that the unknown transition matrix A satisfies the constraint (cid:107) A (cid:107) ≤ s . (2.6)We remark that due to condition (2.2) it must necessarily hold that s ≥ d . A standardapproach to estimate A under the sparsity constraint (2.6) is the Lasso method, whichhas been investigated in [11] in the framework of an Ornstein-Uhlenbeck model. The Lassoestimator is defined as (cid:98) A L := argmin A ∈ R d × d ( L T ( A ) + λ (cid:107) A (cid:107) ) with L T ( A ) := − T log( P TA / P T ) , (2.7)where λ > (cid:98) A L can be computed efficiently, sinceit is a solution of a convex optimisation problem.Next, we are going to introduce the Dantzig estimator of the parameter A . Accordingto (2.3) the quantity L T ( A ) can be written as L T ( A ) = tr (cid:18) ( ε T − A (cid:98) C T ) A (cid:62) + 12 A (cid:98) C T A (cid:62) (cid:19) and ∇L T ( A ) = ε T − A (cid:98) C T + A (cid:98) C T . (2.8)We recall that B belongs to a subdifferential of a convex function f : R d × d → R atpoint B , B ∈ ∂f ( B ), if (cid:104) B, A − B (cid:105) F ≤ f ( A ) − f ( B ) for all A ∈ R d × d . In particular, B ∈ ∂ (cid:107) B (cid:107) satisfies the constraint (cid:107) B (cid:107) ∞ ≤
1. A necessary and sufficient condition forthe minimiser at (2.7) is the fact that 0 belongs to the subdifferential of the function A (cid:55)→ L T ( A ) + λ (cid:107) A (cid:107) . This implies that the Lasso estimator (cid:98) A L satisfies the constraint (cid:107) (cid:98) A L (cid:98) C T + ε T − A (cid:98) C T (cid:107) ∞ ≤ λ. (2.9)Now, the Dantzig estimator (cid:98) A D of the parameter A is defined as a matrix with thesmallest l -norm that satisfies the inequality (2.9), i.e. (cid:98) A D := argmin A ∈ R d × d (cid:110) (cid:107) A (cid:107) : (cid:107) A (cid:98) C T + ε T − A (cid:98) C T (cid:107) ∞ ≤ λ (cid:111) . (2.10)By definition of the Dantzig estimator we have that (cid:107) (cid:98) A D (cid:107) ≤ (cid:107) (cid:98) A L (cid:107) . In particular, whenthe tuning parameters λ for Lasso and Dantzig estimators are preset to be the same,then the Lasso estimate is always a feasible solution to the Dantizg selector minimizationproblem although it may not necessarily be the optimal solution. This implies, that whenrespective solutions are not identical, the Dantizg selector solution is sparser (in l - norm)than the Lasso solution (see [14], Appendix A for details). From the computational pointview, the Dantzig estimator can be found numerically via linear programming for convexoptimisation with constraints.The following basic inequality, which is a direct consequence of the fact that L T ( (cid:98) A L ) + λ (cid:107) (cid:98) A L (cid:107) ≤ L T ( A ) + λ (cid:107) A (cid:107) for all A ∈ R d × d , provides the necessary basis for the analysisof the error (cid:98) A L − A . Lemma 2.1. ([11, Lemma 3]) For any A ∈ R d × d and λ > it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( A − A ) X (cid:107) L ≤ (cid:104) ε T , A − (cid:98) A L (cid:105) F − (cid:107) ( A − (cid:98) A L ) X (cid:107) L + 2 λ ( (cid:107) A (cid:107) − (cid:107) (cid:98) A L (cid:107) ) , where the quantity ε T is defined in (2.5) . From Lemma 2.1 it is obvious that we require a good control over martingale term (cid:104) ε T , V (cid:105) F for certain matrices V ∈ R d × d to get an upper bound on the prediction error (cid:107) ( (cid:98) A L − A ) X (cid:107) L . Another important ingredient is the restricted eigenvalue property ,which is a standard requirement in the analysis of Lasso estimators (see e.g. [2, 4]). Inour setting the restricted eigenvalue property amounts in showing thatinf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) is bounded away from 0 with high probability.Interestingly, the latter is a consequence of the model assumption (H) and not an extracondition as in the framework of linear regression. This has been noticed in [11], but anadditional condition (H4) was required which is in fact not needed as we will show in thenext section.In order to establish the connection between the Dantzig and the Lasso estimators wewill show the inequality (cid:12)(cid:12)(cid:12) (cid:107) ( (cid:98) A D − A ) X (cid:107) L − (cid:107) ( (cid:98) A L − A ) X (cid:107) L (cid:12)(cid:12)(cid:12) ≤ c (cid:107) (cid:98) A L (cid:107) λ for a certain constant c >
0, which holds with high probability. Once the term (cid:107) (cid:98) A L (cid:107) is controlled, we deduce statements about the error term (cid:98) A D − A via the correspondinganalysis of (cid:98) A L − A . In this section we derive various concentration inequalities, which play a central role inthe analysis of the estimators (cid:98) A L and (cid:98) A D . This subsection is devoted to the proof of the restricted eigenvalue property. The mainresult of this subsection relies heavily on some theoretical techniques presented in [21],where Malliavin calculus is applied in order to obtain tail bounds for certain functionalsof Gaussian processes. In the following, we introduce some basic notions of Malliavincalculus; we refer to the monograph [22] for a more detailed exposition.Let H be a real separable Hilbert space. We denote by B = { B ( h ) : h ∈ H } an isonor-mal Gaussian process over H . That is, B is a centred Gaussian family with covariancekernel given by E (cid:2) B ( h ) B ( h ) (cid:3) = (cid:104) h , h (cid:105) H . We shall use the notation L ( B ) = L (Ω , σ ( B ) , P ). For every q ≥
1, we write H ⊗ q toindicate the q th tensor product of H ; H (cid:12) q stands for the symmetric q th tensor. We denoteby I q the isometry between H (cid:12) q and the q th Wiener chaos of X . It is well-known (see e.g.[22, Chapter 1]) that any random variable F ∈ L ( B ) admits the chaotic expansion F = ∞ (cid:88) q =0 I q ( f q ) , I ( f ) := E [ F ] , where the series converges in L and the kernels f q ∈ H (cid:12) q are uniquely determined by F .The operator L , called the generator of the Ornstein-Uhlenbeck semigroup , is defined as LF := − ∞ (cid:88) q =1 qI q ( f q )whenever the latter series converges in L . The pseudo inverse L − of L is defined by L − F = − (cid:80) ∞ q =1 q − I q ( f q ).Next, let us denote by S the set of all smooth cylindrical random variables of the form F = f (cid:0) B ( h ) , . . . , B ( h n ) (cid:1) , where n ≥ f : R n → R is a C ∞ -function with compactsupport and h i ∈ H . The Malliavin derivative DF of F is defined as DF := n (cid:88) i =1 ∂f∂x i (cid:0) B ( h ) , . . . , B ( h n ) (cid:1) h i . The space D , denotes the closure of S with respect to norm (cid:107) F (cid:107) , := E [ F ]+ E [ (cid:107) DF (cid:107) H ] . The Malliavin derivative D verifies the following chain rule : when ϕ : R n → R is in C b (the set of continuously differentiable functions with bounded partial derivatives) and if( F i ) i =1 ,...,n is a vector of elements in D , , then ϕ ( F , . . . , F n ) ∈ D , and Dϕ ( F , . . . , F n ) = n (cid:88) i =1 ∂ϕ∂x i ( F , . . . , F n ) DF i . The next theorem establishes left and right tail bounds for certain elements Z ∈ D , . Theorem 3.1. ([21, Theorem 4.1]) Assume that Z ∈ D , and define the function g Z ( z ) := E [ (cid:104) DZ, − DL − Z (cid:105) H | Z = z ] . Suppose that the following conditions hold for some α ≥ and β > :(i) g Z ( Z ) ≤ αZ + β holds P -almost surely,(ii) The law of Z has a Lebesgue density.Then, for any z > , it holds that P ( Z ≥ z ) ≤ exp (cid:18) − z αz + 2 β (cid:19) and P ( Z ≤ − z ) ≤ exp (cid:18) − z β (cid:19) . Now, we apply Theorem 3.1 to certain quadratic forms of the Ornstein-Uhlenbeckprocess X . The following result is crucial for proving the restricted eigenvalue property. Proposition 3.2.
Suppose that assumption (H) is satisfied and let (cid:98) C T be defined as in (2.5) . Then it holds for all x > : sup v ∈ R d : (cid:107) v (cid:107) =1 P (cid:16) | v (cid:62) ( (cid:98) C T − C ∞ ) v | ≥ x (cid:17) ≤ − T H ( x )) , (3.1) where the function H is defined as H ( x ) = r p K ∞ x x + K ∞ with K ∞ = λ max ( C ∞ ) and p = (cid:107) P (cid:107) op (cid:107) P − (cid:107) op , and the quantities P and r are intro-duced in assumption (H).Proof. We define the centred stationary Gaussian process Y vt = v (cid:62) X t and note that itscovariance kernel is given by E [ Y vt Y vs ] = ρ v ( | t − s | ) with ρ v ( r ) := v (cid:62) exp( − rA ) C ∞ v . Bysubmultiplicativity of the operator norm we conclude that | ρ v ( r ) | ≤ (cid:107) exp( − rA ) (cid:107) op (cid:107) C ∞ (cid:107) op ≤ exp( − r r ) p K ∞ . We observe that ( Y vt ) t ∈ [0 ,T ] can be considered as an isonormal Gaussian process indexedby a separable Hilbert space H whose scalar product is induced by the covariance kernelof ( Y vt ) t ∈ [0 ,T ] . In particular, we can write Y vt = B ( h t ) and (cid:104) h t , h s (cid:105) H = ρ v ( | t − s | ). Weintroduce the quantity Z vT := v (cid:62) ( (cid:98) C T − C ∞ ) v = 1 T (cid:90) T ( Y vt ) − E [( Y vt ) ] dt and notice that Z vT is an element of the second order Wiener chaos. Hence, Z vT has aLebesgue density and we have L − Z vT = − Z vT /
2, and we conclude by the chain rule that (cid:104) DZ vT , − DL − Z vT (cid:105) H = 12 (cid:107) DZ vT (cid:107) H ≤ T (cid:90) T (cid:90) T | Y vt Y vs || ρ v ( t − s ) | dtds ≤ T (cid:90) T (cid:90) T ( Y vt ) | ρ v ( t − s ) | dtds ≤ T (cid:90) ∞ | ρ v ( r ) | dr (cid:0) Z vT + ρ v (0) (cid:1) ≤ T p K ∞ (cid:90) ∞ exp( − r r ) dr (cid:0) Z vT + K ∞ (cid:1) = 4 T p K ∞ r ( Z vT + K ∞ ) . Consequently, the conditions of Theorem 3.1 are satisfied with α = T p K ∞ r and β = T p K ∞ r ,which completes the proof of Proposition 3.2 since P ( | Z vT | ≥ x ) = P ( Z vT ≥ x ) + P ( Z vT ≤− x ).The statement of Proposition 3.2 corresponds to assumption (H4) in [11], which hasbeen shown to be valid via a log-Sobolev inequality only when A is symmetric (cf. [11,Theorem]). In other words, the extra assumption (H4) is not required as it directly followsfrom the modelling setup.The next theorem proves the restricted eigenvalue property. Theorem 3.3.
Suppose that assumption (H) is satisfied and define k ∞ := λ min ( C ∞ ) > .Then for any (cid:15) ∈ (0 , it holds that P (cid:16) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:17) ≥ − (cid:15) , for all T ≥ T ( (cid:15) , s, c ) := T ( (cid:15) , s, c ) (cid:16) (4 s + 1) log d − s (cid:0) log 2 s − (cid:1) + log 2 (cid:15) (cid:17) , where the constant T ( (cid:15) , s, c ) is defined as T ( (cid:15) , s, c ) = 144 p K ∞ ( c + 2) ( k ∞ + 18( c + 2) K ∞ ) r k ∞ . Proof.
See Section 6.1.The next corollary presents a deviation bound for the quantity (cid:98) C T . Corollary 3.4.
For any (cid:15) > and T ≥ T ( (cid:15) , s, c ) it holds that P (cid:18) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ , (cid:107) diag (cid:98) C T (cid:107) ∞ ≤ m ∞ + k ∞ , (cid:107) (cid:98) C T (cid:107) ∞ ≤ M ∞ + 3 k ∞ (cid:19) ≥ − (cid:15) , where m ∞ := (cid:107) diag C ∞ (cid:107) ∞ and M ∞ := (cid:107) C ∞ (cid:107) ∞ .Proof. See Section 6.2.0 ε T and final estimates As mentioned earlier controlling the stochastic term (cid:104) ε T , V (cid:105) F for matrices V ∈ R d × d iscrucial for the analysis of the estimators (cid:98) A L and (cid:98) A D . The martingale property of ε T turnsout to be the key in the next proposition. We remark that the following result is animprovement of [11, Theorem 8]. Proposition 3.5.
For any (cid:15) ∈ (0 , the following inequality holds: P (cid:32) sup V ∈ R d × d ,V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≥ µ (cid:33) ≤ (cid:15) for any T ≥ p K ∞ r k ∞ + 6 K ∞ k ∞ (cid:0) (2 s + 1) ln d − s (ln s −
1) + ln (4 /(cid:15) ) (cid:1) and µ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T .
Proof.
We first recall Bernstein’s inequality for continuous local martingales. Let ( M t ) t ≥ be a real-valued continuous local martingale with quadratic variation ( (cid:104) M (cid:105) t ) t ≥ . Then forany a, b > P ( M t ≥ a, (cid:104) M (cid:105) t ≤ b ) ≤ exp( − a / (2 b )) . (3.2)This result is a straightforward consequence of exponential martingale technique (cf.Chapter 4, Exercise 3.16 in [24]).By definition ε ijT = T (cid:82) T dW it X jt is a continuous martingale with (cid:104) ε ij (cid:105) T = T (cid:98) C iiT . There-fore, we obtain by Corollary 3.4 and (3.2) P (cid:32) sup V ∈ R d × d ,V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≥ µ (cid:33) ≤ P (cid:16) (cid:107) diag (cid:98) C T (cid:107) ∞ > m ∞ + k ∞ (cid:17) + P (cid:32) sup V ∈ R d × d ,V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≥ µ, (cid:107) diag (cid:98) C T (cid:107) ∞ ≤ m ∞ + k ∞ (cid:33) ≤ d (cid:88) i,j =1 P (cid:16) ε ijT ≥ µ, (cid:104) ε ij (cid:105) T ≤ T (cid:0) m ∞ + k ∞ (cid:1)(cid:17) + (cid:15) ≤ d exp (cid:16) − T µ m ∞ + k ∞ (cid:17) + (cid:15) ≤ (cid:15) , which completes the proof.Summarising all previous deviation bounds we obtain the following result.1 Corollary 3.6.
For s ≥ s and c > define the event E ( s, c ) := (cid:110) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:111) (cid:92) (cid:110) sup V (cid:54) =0 (cid:104) ε T , V (cid:105) F (cid:107) V (cid:107) ≤ λ (cid:111)(cid:92) (cid:110) (cid:107) ε T (cid:107) ∞ ≤ λ (cid:111) (cid:92) (cid:110) (cid:107) (cid:98) C T (cid:107) ∞ ≤ M ∞ + 3 k ∞ (cid:111) . Then, for any (cid:15) ∈ (0 , , it holds that P ( E ( s, c )) ≥ − (cid:15) for any T ≥ T (cid:0) (cid:15) / , s, c (cid:1) and λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T .
In this section we present the main theoretical results for the Lasso and Dantzig estimators.More specifically, we derive oracle inequalities for (cid:98) A L and (cid:98) A D , and show the error boundsfor the norms (cid:107) · (cid:107) L , (cid:107) · (cid:107) and (cid:107) · (cid:107) . In particular, we establish the asymptotic equivalencebetween the Lasso and Dantzig estimators. We start this subsection with proving a statement, which is important for obtaining oracleinequality for the Lasso estimator (cid:98) A L . Lemma 4.1.
Suppose that condition (2.6) holds. For any matrix A ∈ R d × d \ { } denote A := supp( A ) . Then for any s ≥ s and c > on E ( s, c ) the following inequality holds: (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) (cid:98) A L − A (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 4 λ (cid:107) (cid:98) A L |A − A (cid:107) . (4.1) In particular, it implies that (cid:98) A L − A ∈ C ( s , on E ( s, c ) .Proof. Let us set δ L ( A ) := A − (cid:98) A L . Applying Lemma 2.1 we obtain the following inequality (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) δ L ( A ) (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 2 (cid:104) ε T , δ L ( A ) (cid:105) F + λ (cid:107) δ L ( A ) (cid:107) + 2 λ (cid:0) (cid:107) A (cid:107) − (cid:107) (cid:98) A L (cid:107) (cid:1) . Hence, on E ( s, c ) it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) δ L ( A ) (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 2 λ (cid:0) (cid:107) δ L ( A ) (cid:107) + (cid:107) A (cid:107) − (cid:107) (cid:98) A L (cid:107) (cid:1) . We observe next that (cid:107) δ L ( A ) (cid:107) + (cid:107) A (cid:107) −(cid:107) (cid:98) A L (cid:107) ≤ (cid:107) δ L ( A ) |A (cid:107) , which immediately implies(4.1). Applying (4.1) to A = A we deduce that (cid:107) δ L ( A ) (cid:107) ≤ (cid:107) δ L ( A ) |A (cid:107) ≤ (cid:107) δ L ( A ) |I s ( δ L ( A )) (cid:107) , where the last inequality holds due to the sparsity assumption (cid:107) A (cid:107) ≤ s . Consequently, (cid:98) A L − A ∈ C ( s ,
3) and the proof is complete.2We are now in the position to present an oracle inequality for the Lasso estimator (cid:98) A L ,which is one of the main results of our paper. Theorem 4.2.
Fix γ > and (cid:15) ∈ (0 , . Consider the Lasso estimator (cid:98) A L defined at (2.7) and assume that condition (H) holds. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , s , /γ (cid:1) , with probability at least − (cid:15) it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (1 + γ ) inf A : (cid:107) A (cid:107) ≤ s (cid:110) (cid:107) ( A − A ) X (cid:107) L + 9(2 + γ ) k ∞ γ (1 + γ ) (cid:107) A (cid:107) λ (cid:111) . Proof.
Consider an arbitrary matrix A ∈ R d × d with (cid:107) A (cid:107) ≤ s . Then, on E ( s , /γ ),according to Lemma 4.1 and Cauchy-Schwarz inequality: (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) (cid:98) A L − A (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 4 λ (cid:107) (cid:98) A L |A − A (cid:107) ≤ (cid:107) ( A − A ) X (cid:107) L + 4 λ (cid:112) (cid:107) A (cid:107) (cid:107) (cid:98) A L |A − A (cid:107) . (4.2)Now, if 4 λ (cid:107) (cid:98) A L |A − A (cid:107) ≤ γ (cid:107) ( A − A ) X (cid:107) L the result immediately follows from Lemma 4.1.Hence, we only need to treat the case 4 λ (cid:107) (cid:98) A L |A − A (cid:107) > γ (cid:107) ( A − A ) X (cid:107) L . The latter impliesthat (cid:98) A L − A ∈ C ( s , /γ ) due to (4.2). Then, on the event E ( s , /γ ), we have (cid:107) (cid:98) A L |A − A (cid:107) ≤ (cid:107) (cid:98) A L − A (cid:107) ≤ k ∞ (cid:107) ( (cid:98) A L − A ) X (cid:107) L and consequently we obtain from (4.2) that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (cid:107) ( A − A ) X (cid:107) L + 3 λ (cid:115) (cid:107) A (cid:107) k ∞ (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (cid:107) ( A − A ) X (cid:107) L + 3 λ (cid:115) (cid:107) A (cid:107) k ∞ (cid:0) (cid:107) ( (cid:98) A L − A ) X (cid:107) L + (cid:107) ( A − A ) X (cid:107) L (cid:1) . Using the inequality 2 xy ≤ ax + y /a for a >
0, we then conclude that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (1 + γ ) (cid:107) ( A − A ) X (cid:107) L + 9(2 + γ ) k ∞ γ (1 + γ ) (cid:107) A (cid:107) λ , which completes the proof.Theorem 4.2 enables us to find upper bounds on the various norms of (cid:98) A L − A as wellas on the sparsity of (cid:98) A L . We remark that the bound in (4.6) will be useful to provide theconnection between the Lasso and Dantzig estimators in the next subsection.3 Corollary 4.3.
Fix (cid:15) ∈ (0 , . Consider the Lasso estimator (cid:98) A L defined in (2.7) andassume that conditions (2.6) and (H) hold. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , s , (cid:1) , with probability at least − (cid:15) , it holds that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ k ∞ s λ (4.3) (cid:107) (cid:98) A L − A (cid:107) ≤ k ∞ s λ (4.4) (cid:107) (cid:98) A L − A (cid:107) ≤ k ∞ s λ (4.5) (cid:107) (cid:98) A L (cid:107) ≤ (cid:16) M ∞ k ∞ + 72 (cid:17) s . (4.6) Proof.
On the event E ( s , A = A and A = supp( A ), we obtain the inequality (cid:107) ( (cid:98) A L − A ) X (cid:107) L + λ (cid:107) (cid:98) A L − A (cid:107) ≤ λ (cid:107) (cid:98) A L |A − A (cid:107) due to Lemma 4.1. Since on E ( s ,
3) we have (cid:98) A L − A ∈ C ( s , (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ λ (cid:107) (cid:98) A L |A − A (cid:107) ≤ λ √ s (cid:107) (cid:98) A L |A − A (cid:107) ≤ λ (cid:114) s k ∞ (cid:107) ( (cid:98) A L − A ) X (cid:107) L . This gives (4.3) and (4.4). Moreover, on the same event it holds (cid:107) (cid:98) A L |A − A (cid:107) ≤ √ s (cid:107) (cid:98) A L |A − A (cid:107) and hence (4.5) follows.Now, it remains to prove (4.6). Note that necessary and sufficient condition for (cid:98) A L tobe the solution of the optimisation problem (2.7) is the existence of a matrix B ∈ ∂ (cid:107) (cid:98) A L (cid:107) such that ε T + (cid:0) (cid:98) A L − A (cid:1) (cid:98) C T + λB = 0 . Furthermore, (cid:98) A ij L (cid:54) = 0 implies that B ij = sign( (cid:98) A ij L ). Thus, we conclude that (cid:107) ( (cid:98) A L − A ) (cid:98) C T (cid:107) = (cid:107) λB + ε T (cid:107) = d (cid:88) i,j =1 (cid:12)(cid:12)(cid:12) λB ij + ε ijT (cid:12)(cid:12)(cid:12) ≥ (cid:88) i,j : (cid:98) A ij L (cid:54) =0 (cid:12)(cid:12)(cid:12) λB ij + ε ijT (cid:12)(cid:12)(cid:12) ≥ (cid:88) i,j : (cid:98) A ij L (cid:54) =0 (cid:12)(cid:12)(cid:12) λ − | ε ijT | (cid:12)(cid:12)(cid:12) ≥ (cid:107) (cid:98) A L (cid:107) λ , E ( s , (cid:107) ( (cid:98) A L − A ) (cid:98) C T (cid:107) ≤ (cid:107) (cid:98) C T (cid:107) ∞ (cid:107) (cid:98) A L − A (cid:107) ≤ (cid:18) M ∞ + 3 k ∞ (cid:19) k ∞ s λ, which implies (4.6).The upper bounds in (4.3)-(4.5) improve the bounds obtained in [11, Corollary 1] andthey are in line with the classical results for linear regression models. We recall that thepaper [11] considers row sparsity of the unknown parameter A , i.e. (cid:107) A i (cid:107) ≤ s for all 1 ≤ i ≤ d, where A i denotes the i th row of A . Obviously, this constraint corresponds to s = d s inour setting. The authors of [11] obtained the upper bound for (cid:107) (cid:98) A L − A (cid:107) of order d s (log d + log log T ) T in contrast to our improved bound T − d s log d . Thus, we essentially match the lowerbound inf (cid:98) A sup A : max i (cid:107) A i (cid:107) ≤ s E [ (cid:107) (cid:98) A − A (cid:107) ] ≥ c d s log( c d/ s ) T for some c , c > , which has been derived in [11, Theorem 2].The authors of [11] have introduced the adaptive Lasso estimator, which is defined as (cid:98) A ad := argmin A ∈ R d × d (cid:16) L T ( A ) + λ (cid:107) A ◦ | (cid:98) A ML | − γ (cid:107) (cid:17) , where ◦ denotes the Hadamard product and ( | (cid:98) A ML | − γ ) ij := | (cid:98) A ij ML | − γ for a γ >
0. Theyhave proved that the adaptive estimator (cid:98) A ad is consistent for support selection and showedthe asymptotic normality of (cid:98) A ad when restricted to the elements in supp( A ); see [11,Theorem 4]. In this subsection we will establish a connection between the prediction errors associatedwith the Lasso and Dantzig estimators. This step is essential for the derivation of errorbounds for (cid:98) A D . Our results are an extension of the study in [2], where it was shown thatunder sparsity conditions, the Lasso and the Dantizg estimators show similar behaviourfor linear regression and for nonparametric regression models, for l prediction loss andfor l p loss in the coefficients for 1 ≤ p ≤ . In what follows, we will derive analogous bounds for the Ornstein-Uhlenbeck process.5
Proposition 4.4.
Consider the Lasso estimator (cid:98) A D defined in (2.10) and assume thatcondition (H) holds.(i) Define δ D ( A ) := A − (cid:98) A D and A := supp ( A ) , and assume that A satisfies the Dantzigconstraint (2.9) . Then it holds that (cid:107) δ D ( A ) |A c (cid:107) ≤ (cid:107) δ D ( A ) |A (cid:107) . (ii) On the event (cid:8) (cid:107) (cid:98) A L (cid:107) ≤ s (cid:9) ∩ E ( s, the following inequality holds: (cid:12)(cid:12)(cid:12) (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( (cid:98) A D − A ) X (cid:107) L (cid:12)(cid:12)(cid:12) ≤ k ∞ (cid:107) (cid:98) A L (cid:107) λ . Proof.
See Section 6.3.Proposition 4.4 implies an oracle inequality for the Dantzig estimator, which is formu-lated in the next theorem.
Theorem 4.5.
Fix γ > and (cid:15) ∈ (0 , . Consider the Dantzig estimator (cid:98) A D definedin (2.10) and assume that conditions (2.6) and (H) hold. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , (48 M ∞ k ∞ + 72) s , /γ (cid:1) , with probability at least − (cid:15) , it holds that (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ (1 + γ ) inf A : (cid:107) A (cid:107) = s (cid:8) (cid:107) ( A − A ) X (cid:107) L + C D ( γ ) s λ (cid:9) , (4.7) where C D ( γ ) = 18 k ∞ (cid:16) ( γ + 2) γ + 48 M ∞ k ∞ + 72 (cid:17) . Proof.
Consider matrix A ∈ R d × d such that (cid:107) A (cid:107) = s . Then, on the event E ( s , /γ ),according to Proposition (4.4) (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ (cid:107) ( (cid:98) A L − A ) X (cid:107) L + 18 k ∞ (cid:16) M ∞ k ∞ + 72 (cid:17) s λ . On the other hand, due to Theorem 4.2, we deduce that (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ (1 + γ ) (cid:107) ( A − A ) X (cid:107) L + 9( γ + 2) k ∞ γ s λ . Combining both inequalities yields (4.7).The statements of Theorems 4.2 and 4.5 suggest that the Lasso and Dantzig estima-tors are asymptotically equivalent. This is in line with the theoretical findings in linearregression models as it has been shown in [2]. More specifically, we obtain the followingresult, which is a direct analogue of Corollary 4.3.6
Corollary 4.6.
Fix (cid:15) ∈ (0 , . Consider the Dantzig estimator (cid:98) A D defined in (2.10) andassume that conditions (2.6) and (H) hold. Then for λ ≥ (cid:114)(cid:0) m ∞ + k ∞ (cid:1) ln (2 d /(cid:15) ) T and T ≥ T (cid:0) (cid:15) / , s , (cid:1) , with probability at least − (cid:15) , it holds that (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ k ∞ s λ (4.8) (cid:107) (cid:98) A D − A (cid:107) ≤ k ∞ s λ (4.9) (cid:107) (cid:98) A D − A (cid:107) ≤ k ∞ s λ. Proof.
Denote A = supp( A ). On the event E ( s ,
1) the matrix A satisfies the Dantzigconstraint (2.9), (cid:98) A D − A ∈ C ( s ,
1) and (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ (cid:107) ( (cid:98) A D − A ) (cid:98) C T (cid:107) ∞ (cid:107) (cid:98) A D − A (cid:107) ≤ (cid:0) (cid:107) ( (cid:98) A D − A ) (cid:98) C T + ε T (cid:107) ∞ + (cid:107) ε T (cid:107) ∞ (cid:1) (cid:107) (cid:98) A D |A − A (cid:107) ≤ λ √ s (cid:107) (cid:98) A D − A (cid:107) ≤ λ (cid:114) s k ∞ (cid:107) ( (cid:98) A D − A ) X (cid:107) L , which gives (4.8) and (4.9). Moreover, on the same event it holds that (cid:107) (cid:98) A D − A (cid:107) ≤ √ s (cid:107) (cid:98) A D − A (cid:107) , which completes the proof.It is noteworthy to mention that even if in our case Lasso and Dantzig selector per-formances are equivalent, a potential strength of the Dantzig estimator over penalizedlikelihood methods such as Lasso is that it can be applied to settings in which no explicitlikelihoods or loss functions are available, and may be of interest in both computationaland theoretical context (see [8] for more details). This sections presents some numerical experiments on simulated data that illustrate ourtheoretical results.Our estimation methods are based on continuous observations of the the underlyingprocess, which need to be discretised for numerical simulations. We will use 500000 dis-cretisation points over the time interval [0 , T ] with T = 300. Such approximation issufficient for the illustration purpose, since further refinement of the grid does not lead toa significant improvement.7 (a) Transition matrix A (b) MLE(c) Lasso (d) Dantzig Figure 1: Comparison of the true matrix with maximum likelihood, Lasso and Dantzigestimators.In Figure 1 we demonstrate an example of the transition matrix A ∈ R × andthe corresponding maximum likelihood, Lasso and Dantzig estimators. Instead of givingnumerical values of the entries of A we use a colour code to highlight the sparsity. Weobserve that MLE provides a good performance on the support, but it gives rather poorestimates outside the support. On the other hand, the superiority of the Lasso and Dantzigestimators, especially in terms of support recovery, is quite obvious even for relatively smalldimension of matrix.Figure 2 demonstrates the relative error of the maximum likelihood, Lasso and Dantzigestimators compared to the norm of the true matrix. We compute the relative errorfor dimensions d = 5 , . . . ,
20 and for L and Frobenius norms. Figure 2 clearly showsthe improvement of performance of penalized estimation methods with growth of thedimension d compared to the maximum likelihood estimation. Indeed, we observe thatrelative errors of maximum likelihood estimation grow linearly both in L and Frobeniusnorms, while relative errors of Lasso and Dantzig estimators decay in d . The sparsity ofthe true parameter A was chosen equal to s = 0 . d , which might explain the limitingbehaviour of Lasso and Dantzig estimators when d is increasing. Finally, we observe thatrelative errors for Lasso and Dantzig estimators are practically equivalent, which is exactlyin accordance with our theoretical results.8 (a) MLE - L -norm (b) MLE - Frobenius norm(c) Lasso - L -norm (d) Lasso - Frobenius norm(e) Dantzig - L -norm (f) Dantzig - Frobenius norm Figure 2: Relative error of maximum likelihood, Lasso and Dantzig estimators in L andFrobenius norms depending on d . Middle line corresponds to the mean and coloured areascorrespond to the standard deviation of the error over 10 independent simulations.9 We first note the identity (cid:107)
V X (cid:107) L = tr( V (cid:98) C T V (cid:62) ). Replacing (cid:98) C T by its limit C ∞ wededuce the inequality tr( V C ∞ V (cid:62) ) ≥ k ∞ > (cid:107) V X (cid:107) L (cid:107) V (cid:107) = tr( V C ∞ V (cid:62) ) (cid:107) V (cid:107) − tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) (cid:107) V (cid:107) ≥ k ∞ − | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) . (6.1)Next, we introduce the set K ( s ) := (cid:8) V ∈ R d × d \ { } : (cid:107) V (cid:107) ≤ s (cid:9) . As is shown in Lemma6.1 it holds thatsup V ∈C ( s,c ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) ≤ c + 2) sup V ∈K (2 s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) . (6.2)Thus, it suffices to consider K ( s ) instead of C ( s, c ) in the following discussion. Observing(6.1) we obtain that P (cid:18) inf V ∈K ( s ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:19) ≥ P (cid:32) sup V ∈K ( s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) ≤ k ∞ (cid:33) . For a matrix V ∈ K ( s ) we denote its j -th row vector by v j and v = vec( V ) ∈ R d .Moreover, we define a symmetric random matrix D C = id ⊗ ( C ∞ − (cid:98) C T ) ∈ R d × d . Thenwe deduce the identity tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) (cid:107) V (cid:107) = v (cid:62) D C v (cid:107) v (cid:107) . (6.3)According to Proposition 3.2 we obtain the following inequalities for any x > P (cid:18) | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:19) ≤ P (cid:32) (cid:80) dj =1 | v j ( C ∞ − (cid:98) C T )( v j ) (cid:62) | (cid:80) dj =1 (cid:107) v j (cid:107) ≥ x (cid:33) ≤ d (cid:88) j =1 P (cid:32) | v j ( C ∞ − (cid:98) C T )( v j ) (cid:62) |(cid:107) v j (cid:107) ≥ x (cid:33) ≤ d exp ( − T H ( x )) . By Lemma 6.2 we conclude that P (cid:32) sup v ∈ R d \{ } : (cid:107) v (cid:107) ≤ s | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:33) ≤ d (cid:16) ed s (cid:17) s exp( − T H ( x )) . We deduce from (6.3) that P (cid:18) inf V ∈K ( s ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ x (cid:19) ≥ − d (cid:16) ed s (cid:17) s exp ( − T H ( x )) . The latter statement together with (6.2) implies the inequality P (cid:16) inf V ∈C ( s,c ) (cid:107) V X (cid:107) L (cid:107) V (cid:107) ≥ k ∞ (cid:17) ≥ − (cid:15) , for all T ≥ T ( (cid:15) , s, c ), which completes the proof of Theorem 3.3.0 Let e ( i,j ) ∈ R d × d be a matrix defined as e kl ( i,j ) := 1 ( k,l )=( i,j ) . We observe that (cid:26) (cid:107) diag( C ∞ − (cid:98) C T ) (cid:107) ∞ > k ∞ (cid:27) = (cid:26) max ≤ j ≤ d (cid:12)(cid:12)(cid:12) tr( e ( j,j ) ( C ∞ − (cid:98) C T ) e (cid:62) ( j,j ) (cid:12)(cid:12)(cid:12) > k ∞ (cid:27) ⊂ sup V ∈C ( s,c ) (cid:12)(cid:12)(cid:12) tr( V ( C ∞ − (cid:98) C T ) V (cid:62) (cid:12)(cid:12)(cid:12) (cid:107) V (cid:107) > k ∞ . Furthermore, | C ij ∞ − (cid:98) C ijT | = (cid:12)(cid:12)(cid:12) tr( e (1 ,i ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,j ) ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) tr(( e (1 ,i ) + e (1 ,j ) )( C ∞ − (cid:98) C T )( e (1 ,i ) + e (1 ,j ) ) (cid:62) ) (cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12) tr( e (1 ,i ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,i ) ) (cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12) tr( e (1 ,j ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,j ) ) (cid:12)(cid:12)(cid:12) ≤ V ∈C ( s,c ) (cid:12)(cid:12)(cid:12) tr( V ( C ∞ − (cid:98) C T ) V (cid:62) (cid:12)(cid:12)(cid:12) (cid:107) V (cid:107) and hence (cid:26) (cid:107) C ∞ − (cid:98) C T (cid:107) ∞ > k ∞ (cid:27) = (cid:26) max ≤ i,j ≤ d (cid:12)(cid:12)(cid:12) tr( e (1 ,i ) ( C ∞ − (cid:98) C T ) e (cid:62) (1 ,j ) ) (cid:12)(cid:12)(cid:12) > k ∞ (cid:27) ⊂ sup V ∈C ( s,c ) (cid:12)(cid:12)(cid:12) tr( V ( C ∞ − (cid:98) C T ) V (cid:62) (cid:12)(cid:12)(cid:12) (cid:107) V (cid:107) > k ∞ . This completes the proof of Corollary 3.4.
Since A satisfies the Dantzig constraint (2.9), we deduce by definition of the Dantzigestimator: (cid:107) A (cid:107) ≥ (cid:107) (cid:98) A D (cid:107) = (cid:107) A − δ D ( A ) |A (cid:107) + (cid:107) δ D ( A ) |A c (cid:107) ≥ (cid:107) A (cid:107) − (cid:107) δ D ( A ) |A (cid:107) + (cid:107) δ D ( A ) |A c (cid:107) , which proves part (i).Now we show part (ii) of the proposition. Set δ := (cid:98) A L − (cid:98) A D . Due to (2.8) we deduce (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( (cid:98) A D − A ) X (cid:107) L = 2tr (cid:16)(cid:16) (cid:98) A D (cid:98) C T + ε T − A (cid:98) C T (cid:17) δ (cid:62) (cid:17) − (cid:16) ε T δ (cid:62) (cid:17) + tr (cid:16) δ (cid:98) C T δ (cid:62) (cid:17) = 2tr (cid:16)(cid:16) (cid:98) A L (cid:98) C T + ε T − A (cid:98) C T (cid:17) δ (cid:62) (cid:17) − (cid:16) ε T δ (cid:62) (cid:17) − tr (cid:16) δ (cid:98) C T δ (cid:62) (cid:17) . (6.4)1The Dantzig constraint (2.9) implies the inequality (cid:12)(cid:12)(cid:12) tr (cid:16)(cid:16) (cid:98) A D (cid:98) C T + ε T − A (cid:98) C T (cid:17) δ (cid:62) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:107) (cid:98) A D (cid:98) C T + ε T − A (cid:98) C T (cid:107) ∞ (cid:107) δ (cid:107) ≤ λ (cid:107) δ (cid:107) , and the same inequality holds for (cid:98) A D being replaced by (cid:98) A L . On E ( s,
1) we have (cid:12)(cid:12)(cid:12) tr (cid:16) ε T δ (cid:62) (cid:17)(cid:12)(cid:12)(cid:12) ≤ λ (cid:107) δ (cid:107) . Furthermore, on (cid:8) (cid:107) (cid:98) A L (cid:107) ≤ s (cid:9) it holds that δ ∈ C ( s,
1) and we conclude from Theorem (3.3)that tr (cid:16) δ (cid:98) C T δ (cid:62) (cid:17) ≥ k ∞ (cid:107) δ (cid:107) . We also have (cid:107) δ (cid:107) ≤ (cid:107) δ | supp( (cid:98) A L ) (cid:107) ≤ (cid:107) (cid:98) A L (cid:107) / (cid:107) δ (cid:107) . Observing the first identity of (6.4),putting the previous estimates together and using the inequality 2 xy ≤ ax + y / a >
0, we obtain the following inequality (cid:107) ( (cid:98) A D − A ) X (cid:107) L − (cid:107) ( (cid:98) A L − A ) X (cid:107) L ≤ k ∞ (cid:107) (cid:98) A L (cid:107) λ . On the other hand, applying the second identity of (6.4), we deduce that (cid:107) ( (cid:98) A L − A ) X (cid:107) L − (cid:107) ( (cid:98) A D − A ) X (cid:107) L ≤ k ∞ (cid:107) (cid:98) A L (cid:107) λ , which completes the proof. In this subsection we present two results that can be easily deduced from Lemmas F.1,F.2 and F.3 from supplementary material of [1]. We state their proofs for the sake ofcompleteness.
Lemma 6.1.
It holds that sup V ∈C ( s,c ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) ≤ c + 2) sup V ∈K (2 s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) |(cid:107) V (cid:107) . Proof.
First, recall the definition of the set C ( s, c ) in (2.1) and denote the unit balls by B q ( r ) := { v ∈ R d : (cid:107) v (cid:107) q ≤ r } for any d ≥ q ≥ , r >
0. Furthermore, we introducethe notation K ( s ) = B ( s ) ∩ B (1) for s ≥
1. For any set P we denote its closure and convexhull by cl( P ) and conv( P ) , respectively. By a direct application of Lemma F.1 from [1],we obtain the following approximation of cone sets by sparse sets: for any S ⊂ { , . . . , d } with | S | = s we get C ( s, c ) ∩ B (1) ⊆ B (cid:0) ( c + 1) √ s (cid:1) ∩ B (1) ⊆ ( c + 2)cl(conv( K ( s ))) . (6.5)Next, by the statement of Lemma F.3 in [1] we have thatsup V ∈ cl(conv( K ( s ))) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) | ≤ V ∈K (2 s ) | tr( V ( C ∞ − (cid:98) C T ) V (cid:62) ) | . (6.6)Thus, (6.5) combined with (6.6) yields the proof.2 Lemma 6.2.
Let v = vec( V ) ∈ R d and D C = id ⊗ ( C ∞ − (cid:98) C T ) ∈ R d × d . Then it holdsthat P (cid:32) sup v ∈ R d \{ } : (cid:107) v (cid:107) ≤ s | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:33) ≤ d (cid:16) ed s (cid:17) s exp( − T H ( x )) , where the function H has been introduced in Proposition 3.2.Proof. Choose U ⊂ { , . . . , d } with | U | = s , and define S U = (cid:110) v ∈ R d : (cid:107) v (cid:107) ≤ , supp( v ) ⊆ U (cid:111) . Then K ( s ) = (cid:83) | U |≤ s S U . In what follows, we choose A = { u , . . . , u m } , which is a -netof S U . Lemma 3.5 of [26] guarantees that |A| ≤ s . Next, notice that for every v ∈ S u , there exists some u i ∈ A such that (cid:107) ∆ v (cid:107) ≤ , where ∆ v = v − u i . Then it holds γ := sup v ∈ S U | v (cid:62) D C v | ≤ max i | u (cid:62) i D C u i | + 2 sup v ∈ S U | max i u (cid:62) i D C (∆ v ) | + sup v ∈ S U | (∆ v ) (cid:62) D C (∆ v ) | . Next, we use the fact that 10(∆ v ) ∈ S U which gives us in consequencesup v ∈ S U | (∆ v ) (cid:62) D C (∆ v ) | ≤ γ and2 sup v ∈ S U | max i u (cid:62) i D C (∆ v ) |≤
110 sup v ∈ S U | ( u i + 10∆ v ) (cid:62) D C ( u i + 10∆ v ) | + 110 sup v ∈ S U | u i D C u i | + 110 sup v ∈ S U | (10∆ v ) (cid:62) D C (10∆ v ) |≤ γ + 110 γ + 110 γ which implies that γ ≤ i | u (cid:62) i D C u i | . Now, we take an union bound over all u i ∈ A and combine it with inequality (3.1) fromProposition 3.2. Thus, P (cid:32) sup v ∈ S U | v (cid:62) D C v | ≥ x (cid:33) ≤ d exp( − T H ( x ) + s log 21) . Next, we take another union bound over (cid:0) d s (cid:1) ≤ (cid:16) ed s (cid:17) s choices of U . Thus, P (cid:32) sup v ∈ R d \{ } : (cid:107) v (cid:107) ≤ s | v (cid:62) D C v |(cid:107) v (cid:107) ≥ x (cid:33) ≤ d (cid:16) ed s (cid:17) s exp( − T H ( x )) , which yields the proof.3 References [1] S. Basu and G. Michailidis (2015): Regularized estimation in sparse high-dimensionaltime series models.
Annals of Statistics
Annals of Statistics
37, 1705–1732.[3] F. Bolley, J.A. Ca˜nizo, and J.A. Carrillo (2011): Stochastic mean-field limit: non-Lipschitz forces and swarming.
Mathematical Models and Methods in Applied Sciences
Statistics for high-dimensional data.
SpringerSeries in Statistics, Springer.[5] E. Candes and T. Tao (2007): The Dantzig selector: Statistical estimation when p ismuch larger than n . Annals of Statistics , 35(6), 2313–2351.[6] R. Carmona and X. Zhu (2016): A probabilistic approach to mean field games withmajor and minor players.
Annals of Applied Probability
Econometric Theory
28, 838–860.[8] L. Dicker, Y. Li and SD Zhao (2014): The Dantzig selector for censored linear regres-sion models.
Statistica Sinica
Frontiers in Computational Neuroscience
3, 1–28.[10] K. Fujimori (2019): The Dantzig selector for a linear model of diffusion processes.
Statistical Inference for Stochastic Processes
22, 475–498.[11] S. Ga¨ıffas and G. Matulewicz (2019): Sparse inference of the drift of a high-dimensional Ornstein-Uhlenbeck process.
Journal of Multivariate Analysis
Social and economic networks.
Princeton, NJ: Princeton Univer-sity Press.[13] J. Jacod and P. Protter (2012):
Discretization of processes.
Stochastic Modelling andApplied Probability, Springer.[14] G. James, P. Radchenko, and J. Lv (2009): Dasso: connections between the Dantzigselector and Lasso.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology)
Scandinavian Journal of Statistics
Exponential families of stochastic processes.
Springer Series in Statistics, Springer.[17] U. K¨uchler and M. Sørensen (1999): A note on limit theorems for multivariate mar-tingales.
Bernoulli
Statistical inference for ergodic diffusion processes.
SpringerSeries in Statistics, Springer.[19] H.P. McKean (1966): Speed of approach to equilibrium for Kac’s caricature of aMaxwellian gas.
Archive for Rational Mechanics and Analysis
Stochastic Differential Equations (Lecture Series in Differential Equations,Session 7, Catholic University , 41–57. Air Force Office of Scientific Research, Arling-ton.[21] I. Nourdin and F.G. Viens (2009): Density formula and concentration inequalitieswith Malliavin calculus.
Electronic Journal of Probability
14, 2287–2309.[22] D. Nualart (2006):
The Malliavin calculus and related topics.
IEEE Trnasactions of Information Theory
Continuous martingales and Brownian motion . 3rdedition, A Series of Comprehensive Studies in Mathematics, Springer.[25] A.-S. Sznitman (1991):
Topics in propagation of chaos.
In P.-L. Hennequin, editor, ´Ecole d’ ´Et´e de Probabilit´es de Saint Flour XIX - 1989 , volume 1464 of Lecture Notesin Mathematics, Springer, Berlin, 165–251.[26] R. Vershynin (2009):