[PDF] On penalized estimation for dynamical systems with small noise

Abstract

We consider a dynamical system with small noise for which the drift is parametrized by a finite dimensional parameter. For this model we consider minimum distance estimation from continuous time observations under l p -penalty imposed on the parameters in the spirit of the Lasso approach with the aim of simultaneous estimation and model selection. We study the consistency and the asymptotic distribution of these Lasso-type estimators for different values of p . For p=1 we also consider the adaptive version of the Lasso estimator and establish its oracle properties.

Full PDF

aa r X i v : . [ m a t h . S T ] M a r On penalized estimation for dynamicalsystems with small noise

Alessandro De Gregorio ∗ Stefano M. Iacus † September 10, 2018

Abstract

We consider a dynamical system with small noise for which thedrift is parametrized by a ﬁnite dimensional parameter. For this modelwe consider minimum distance estimation from continuous time ob-servations under l p -penalty imposed on the parameters in the spiritof the Lasso approach with the aim of simultaneous estimation andmodel selection. We study the consistency and the asymptotic dis-tribution of these Lasso-type estimators for diﬀerent values of p . For p = 1 we also consider the adaptive version of the Lasso estimator andestablish its oracle properties. Keywords: dynamical systems, lasso estimation, model selection, infer-ence for stochastic processes, diﬀusion-type processes, oracle properties.

Usually ordinary diﬀerential equation models are the result of averagingand/or neglecting some details of an original system without modeling acomplex system with a huge number of degrees of freedom or tuning param-eters. Introducing noise is therefore a way to approximate closer the reality ofobservable complex systems. It is then natural to think of the noise as small,for example when one is considering the dynamics of macroscopic quantities, ∗ Department of Statistical Sciences, P.le Aldo Moro 5, 00185- Rome, Italy - [email protected] † Department of Economics, Management and Quantitative Methods, Via Conservatorio7, 20122 - Milan, Italy - [email protected] Y i = x Ti β + ε i , with x i a vectorof covariates, β a vector of q > ε i i.i.d. Gaussian randomvariables. Knight and Fu [2000] proposed the following l p -penalized estima-tor for β ˆ β n := arg min u n X i =1 ( Y i − x Ti u ) + λ n q X j =1 | u j | p ! (1)for some p > λ n → n → ∞ . The family of estimators ˆ β n solutionto (1) are a generalization of the Ridge estimators which correspond to thecase p = 2 [Efron et al., 2004]. The original Lasso estimators [Tibshirani,1996] are obtained setting p = 1 while OLS is the case p = 0, not consideredhere. The link between Lasso-type estimation and model selection is also dueto the fact that, in the limit as p →

0, this procedure approximate the AIC2r BIC selection methods, i.e.lim p → q X j =1 | u j | p = q X j =1 { u j =0 } which amounts to the number of non-null parameters in the model. Here A the indicator function for set A .As said, the estimators solutions to (1) are attractive because with themit is possible to perform estimation and model selection in a single step, i.e.the procedure does not need to estimate diﬀerent models and compare themlater with information criteria as the dimension of the space of the parametersdoes no change, just some of the components of the vector β ∗ j are assumed tobe zero. In non-linear models a preliminary simple reparametrization (e.g. β β ′ − β ) is needed to interpret this approach in terms of model selection.In this work, we extend the problem in (1) to the class of diﬀusionprocesses with small noise solution to the stochastic diﬀerential equationd X t = S t ( θ, X )d t + ε d W t , t ∈ [0 , T ], by replacing the least squares estima-tion with the minimum distance estimation. The asymptotic is consideredas ε → < T < ∞ with θ ∈ Θ ⊂ R q a q -dimensional parameter.Since the seminal works of Kutoyants [1984, 1991, 1994] and Yoshida[1992b], statistical inference for continuously observed small diﬀusion pro-cesses is well developed today [see, e.g., Kutoyants and Philibossian, 1994,Iacus, 2000, Iacus and Kutoyants, 2001, Yoshida, 2003, Uchida and Yoshida,2004b] but the Lasso problem has not been considered so far. Althoughhere we consider only continuous time observations, it is worth mentioningthat there is also a growing literature on parametric inference for discretelyobserved small diﬀusion processes [see Genon-Catalot, 1990, Laredo, 1990,Sørensen, 1997, 2012, Sørensen and Uchida, 2003, Uchida, 2003, 2004, 2006,2008, Gloter and Sørensen, 2009, Guy et al., 2014] to which this Lasso prob-lem can be extended. Adaptive Lasso-type estimation for ergodic diﬀusionprocesses sampled at discrete time has been studied in De Gregorio and Iacus[2012] while for continuous time ergodic diﬀusion processes shrinkage estima-tion has been considered in Nkurunziza [2012].This paper is organized as follows. In Section 2 we introduce the model,the assumptions and the statement of the problem. In Section 3 we studythe consistency of the estimators and derive their asymptotic distributionfor diﬀerent values of p . For p = 1 we also consider the case of adaptiveLasso estimation that is meant to control asymptotic bias. For the adaptiveestimator, we are also able to prove that it represents an oracle procedure.3 The Lasso-type problem for dynamical sys-tems with small noise

Let us assume that on the probability space (Ω , F , P ) , with the ﬁltration {F t , ≤ t ≤ T } (where each F t , ≤ t ≤ T, is augmented by sets from F having zero P -measure), is given a Wiener process { W t , F t , ≤ t ≤ T } . Let X = { X t , ≤ t ≤ T } be a real valued diﬀusion-type process solution to thefollowing stochastic diﬀerential equationd X t = S t ( θ, X )d t + ε d W t , ε ∈ (0 , , (2)with non random initial condition X = x . The parameter θ ∈ Θ ⊂ R q , where Θ is a bounded, open and convex set, supposed to be unknown. Let( C [0 , T ] , B [0 , T ]) be the measurable space of continuous functions x t on [0 , T ]with σ -algebra B [0 , T ] = σ { x t , ≤ t ≤ T } . P ( ε ) θ denotes the law induced bythe process X in ( C [0 , T ] , B [0 , T ]) when the true parameter is θ . We denoteby u = ( u , . . . , u q ) T the (transposed) vector u ∈ R q and the true value of θ by θ ∗ . Let || · || = || · || L ( µ ) be the L -norm with respect to some ﬁnitemeasure µ on [0 , T ], i.e. || f || = Z T f ( x ) µ (d x ) . We suppose that the trend coeﬃcient in (2) is of integral type, i.e. S t ( θ, X ) = V ( θ, t, X ) + Z t K ( θ, t, s, X s )d s, where V ( θ, t, x ) and K ( θ, t, s, x ) are known measurable, non-anticipativefunctionals such that (2) has a strong unique solution. For example, theusual conditions (1.34) and (1.35) in Kutoyants [1994] and Theorem 4.6 inLipster and Shiryaev [2001] about Lipschitz behaviour and linear growing aresuﬃcient; i.e. Assumption 1.

For all t ∈ [0 , T ] , θ ∈ Θ and X t , Y t ∈ C [0 , T ] | V ( θ, t, X t ) − V ( θ, t, Y t ) | + | K ( θ, t, s, X t ) − K ( θ, t, s, Y t ) |≤ L Z t | X s − Y s | d K s + L | X t − Y t | , | V ( θ, t, X t ) | + | K ( θ, t, s, X t ) | ≤ L Z t (1 + | X s | )d K s + L (1 + | X t | ) , where L and L are positive constants and K s is a nondecreasing right-continuous function, ≤ K t ≤ K , K > . P ( ε ) θ , θ ∈ Θ , areequivalent (see Theorem 7.7 in Lipster and Shiryaev [2001]). The asymptoticin this model is considered as ε → < T < ∞ ﬁxed.We will also write x ( θ ) = x t ( θ ) to denote the limiting dynamical systemsatisfying the integro-diﬀerential equationd x t d t = V ( θ, t, x t ) + Z t K ( θ, t, s, x s )d s, x . We assume that, for all 0 ≤ t ≤ T and for each θ ∈ Θ , the randomelement X t and x t ( θ ) belong to L ( µ ) . Let x (1) t = { x (1) t ( θ ∗ ) , ≤ t ≤ T } be the Gaussian process solution tod x (1) t = (cid:18) V x ( θ ∗ , t, x t ( θ ∗ )) x (1) t + Z t K x ( θ , t, s, x s ( θ ∗ )) x (1) s d s (cid:19) d t + d W t , (3)0 ≤ t ≤ T , x (1)0 = 0, where V x ( θ, t, x ) = ∂∂x V ( θ, t, x ) and K x ( θ, t, s, x ) = ∂∂x K ( θ, t, s, x ) . The process x (1) t plays a central role in the deﬁnition of theasymptotic distribution of the estimators in the theory of dynamical systemswith small noise. We need in addition the following assumptions. Assumption 2.

The stochastic process X is diﬀerentiable in ε at the point ε = 0 in the following sense: for all ν > ε → P ( ε ) θ ∗ (cid:0) || ε − ( X − x ) − x (1) || > ν (cid:1) = 0 where x (1) = { x (1) t , ≤ t ≤ T } is from (3) . We further denote by ˙ x t ( θ ) the q -dimensional vector of partial derivativesof x t ( θ ) with respect to θ j , j = 1 , . . . , q , i.e., ˙ x t ( θ ) = ( ∂∂θ x t ( θ ) , . . . , ∂∂θ q x t ( θ )) T ,and ˙ x t ( θ ∗ ) satisﬁes the systems of equationsd ˙ x t ( θ ∗ )d t = [ V x ( θ ∗ , t, x t ( θ ∗ )) ˙ x t ( θ ∗ ) + ˙ V ( θ ∗ , t, x t ( θ ∗ ))+ Z t ( ˙ K ( θ ∗ , t, s, x s ( θ ∗ )) + K x ( θ , t, s, x s ( θ ∗ )) ˙ x s ( θ ∗ ))d s ]d t, ˙ x ( θ ∗ ) = 0 , where the point corresponds to the diﬀerentiation on θ ; i.e.˙ V ( θ, t, x t ( θ )) = (cid:18) ∂∂θ V ( θ, t, x t ( θ )) , ..., ∂∂θ q V ( θ, t, x t ( θ )) (cid:19) T . ssumption 3. The deterministic dynamical system x t ( θ ) is diﬀerentiablein θ at the point θ ∗ in L ( µ ) ; i.e. || x ( θ ∗ + h ) − x ( θ ∗ ) − h T ˙ x ( θ ∗ )) || = o ( | h | ) where h ∈ R q . Assumption 4.

The matrix I ( θ ∗ ) = Z T ˙ x t ( θ ∗ ) ˙ x Tt ( θ ∗ ) µ (d t ) is positive deﬁnite and nonsingular. We introduce a constrained minimum distance estimator for θ for the modelin (2). The asymptotic properties of unconstrained the minimum distanceestimators in the i.i.d. framework have been established in Millar [1983,1984]. Later Kutoyants [1991, 1994] and Kutoyants and Philibossian [1994]studied in details the properties of such estimators for diﬀusion processeswith small noise. Information criteria for this model have been studied inUchida and Yoshida [2004b], while here we study the Lasso-type approach.To deﬁne the Lasso-type estimator the following penalized contrast func-tion has to be considered Z ε ( u ) = || X − x ( u ) || + λ ε q X j =1 | u j | p , (4) p > u ∈ Θ and λ ε > θ ε : C [0 , T ] → ¯Θ for θ , deﬁned asˆ θ ε = arg min θ ∈ ¯Θ Z ε ( θ ) , (5)where ¯Θ is the closure od Θ.The following example explains well the spirit of the Lasso procedure.We consider a linear small diﬀusion-type process X given byd X t = q X j =1 θ j A j ( t, X )d t + ε d W t , ≤ t ≤ T. By applying the estimator (5), some parameters θ j will be set equal to 0 andthis implies a simultaneous estimation and selection of the model.6 Asymptotic properties of the estimator

The additional l p -penalization term in the contrast function (4) modiﬁesthe traditional properties of the minimum distance estimator. The analysisshould be performed for the diﬀerent values of p which change the convexityof the penalty term. Let us introduce the following functions g εθ ∗ ( ν ) = inf | θ − θ ∗ |≥ ν ( || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ) ,h εθ ∗ ( ν ) = inf | θ − θ ∗ | <ν ( || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ) where | θ − θ ∗ | ≥ ν ( < ν ) is to be intended component wise, for all ν >

0. Weneed the following identiﬁability-type condition.

Assumption 5.

For every ν > , we assume that g εθ ∗ ( ν ) > h εθ ∗ ( ν ) . Theorem 1.

Let Assumption 1 and Assumption 5 be fulﬁlled and λ ε = O ( ε ) as ε → . ˆ θ ε in (5) is a uniformly consistent estimator of θ ∗ ; i.e. for any ν > ε → sup θ ∗ ∈ Θ P ( ε ) θ ∗ (cid:16) | ˆ θ ε − θ ∗ | ≥ ν (cid:17) = 0 . Proof.

By deﬁnition of ˆ θ ε , for any ν > , we have that n ω : | ˆ θ ε − θ ∗ | ≥ ν o = (cid:26) ω : inf | θ − θ ∗ | <ν Z ε ( θ ) > inf | θ − θ ∗ |≥ ν Z ε ( θ ) (cid:27) Moreover, Z ε ( θ ) ≤ || X − x ( θ ∗ ) || + || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ,Z ε ( θ ) ≥ || x ( θ ) − x ( θ ∗ ) || − || X − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p . P ( ε ) θ ∗ (cid:16) | ˆ θ ε − θ ∗ | ≥ ν (cid:17) = P ( ε ) θ ∗ (cid:18) inf | θ − θ ∗ | <ν Z ε ( θ ) > inf | θ − θ ∗ |≥ ν Z ε ( θ ) (cid:19) ≤ P ( ε ) θ ∗ (cid:18) || X − x ( θ ∗ ) || + h εθ ∗ ( ν )2 > g εθ ∗ ( ν )2 (cid:19) Since (see Lemma 1.13, in Kutoyants [1994]) || X t − x t ( θ ∗ ) || ≤ Cε sup ≤ t ≤ T | W t | , P ( ε ) θ ∗ − a.s. , where C = C ( L , L , K , T ) is a positive constant, under Assumption 5, wegetsup θ ∗ ∈ Θ P ( ε ) θ ∗ (cid:16) | ˆ θ ε − θ ∗ | ≥ ν (cid:17) ≤ P ( ε ) θ ∗ (cid:18) Cε sup ≤ t ≤ T | W t | >

12 inf θ ∗ ∈ Θ { g εθ ∗ ( ν ) − h εθ ∗ ( ν ) } (cid:19) ≤ (cid:26) − (inf θ ∗ ∈ Θ { g εθ ∗ ( ν ) − h εθ ∗ ( ν ) } ) T C ε (cid:27) → . In the above we made use of the following estimate for

N > P (cid:18) sup ≤ t ≤ T | W t | > N (cid:19) ≤ P ( W T > N ) ≤ e − N T , see e.g. Kutoyants [1994], and observed that g εθ ∗ ( ν ) − h εθ ∗ ( ν ) → inf | θ − θ ∗ |≥ ν || x ( θ ) − x ( θ ∗ ) || > , ε → . From the proof of the consistency of the estimator ˆ θ ε is it clear that thespeed of the convergence depends on the speed of λ ε . The speed of λ ε alsoaﬀects the asymptotic distribution of the estimator. Remark 1.

It is possible to deﬁne other types of Lasso-type estimators mod-ifying the metric in (4) ; i.e. by considering, for instance, the sup-norm andthe L -norm. Hence, if { X t , ≤ t ≤ T } and { x t ( θ ) , ≤ t ≤ T } , θ ∈ Θ , areelements of the space C [0 , T ] and L ( µ ) , respectively, we can introduce thecorresponding Lasso estimator ˇ θ ε = arg min θ ∈ ¯Θ ( sup ≤ t ≤ T | X t − x t ( θ ) | + λ ε q X j =1 | u j | p ) nd ˘ θ ε = arg min θ ∈ ¯Θ (Z T | X t − x t ( θ ) | µ (d t ) + λ ε q X j =1 | u j | p ) . The estimators ˇ θ ε and ˘ θ ε are uniformly consistent and the proof follows bythe same steps adopted to prove Theorem 1. In order to study the asymptotic distribution of the Lasso-type estimator weneed to distinguish the diﬀerent cases for p . We start with the case of p ≥ → d ” the convergence in distribution and we denote by ζ thefollowing Gaussian random vector ζ = Z T x (1) t ( θ ∗ ) ˙ x t ( θ ∗ ) µ (d t ); (6)i.e. ζ ∼ N q ( , σ ) where σ := Z T Z T ˙ x t ( θ ∗ ) ˙ x s ( θ ∗ ) T E [ x (1) t ( θ ∗ ) x (1) s ( θ ∗ )] µ (d t ) µ (d s ) , see also Lemma 2.13 in Kutoyants [1994]. The next two theorems have beeninspired from Theorem 2 and Theorem 3 in Knight and Fu [2000]. Theorem 2.

Let Assumptions 1–5 hold, ζ deﬁned as in (6) , p ≥ and ε − λ ε → λ ≥ . Then ε − (ˆ θ ε − θ ∗ ) → d arg min u V ( u ) where V ( u ) = − u T ζ + u T I ( θ ∗ ) u + λ q X j =1 u j sgn( θ ∗ j ) | θ ∗ j | p − for p > and V ( u ) = − u T ζ + u T I ( θ ∗ ) u + λ q X j =1 (cid:16) | u j | { θ ∗ j =0 } + u j sgn( θ ∗ j ) | θ ∗ j | { θ ∗ j =0 } (cid:17) if p = 1 . roof. Let u ∈ R q and introduce the random function V ε ( u ) = 1 ε || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || + λ ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9)! , (7)which is minimized at the point u = ε − (ˆ θ ε − θ ∗ ) by deﬁnition of ˆ θ ε . Byexploiting Assumption 2–4, we get1 ε (cid:8) || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || (cid:9) = 1 ε (cid:8) || X − x ( θ ∗ ) − εu T ˙ x ( θ ∗ ) || − || X − x ( θ ∗ ) || (cid:9) + o ε (1)= u T || ˙ x ( θ ∗ ) || u − u T || ε − ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + o ε (1) P ( ε ) θ ∗ −→ ε → u T I ( θ ∗ ) u − u T ζ , (8)where P ( ε ) θ ∗ −→ stands for the convergence in probability and ζ is from (6). Forthe term in (7) λ ε ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9) we have to distinguish the case p = 1 and p >

1. Let γ >

1, then λ ε ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9) = λ ε ε q X j =1 u j | θ ∗ j + εu j | p − | θ ∗ j | p εu j −→ ε → λ p X j =1 u j sgn( θ ∗ j ) | θ ∗ j | p − (9)If p = 1, then by similar arguments, we have λ ε ε q X j =1 (cid:8) | θ ∗ j + εu j | − | θ ∗ j | (cid:9) −→ ε → λ q X j =1 (cid:16) | u j | { θ ∗ j =0 } + u j sgn( θ ∗ j ) | θ ∗ j | { θ ∗ j =0 } (cid:17) . (10)Notice that V ε ( u ) is not convex in u and then we have to consider theconvergence in distribution on the topology induced by the uniform met-ric on compact sets; i.e. we deal with the convergence in distribution of V ε ( u ) on the space of the continuous functions topologized by the distance10 ( y , y ) = sup u ∈ K | y ( u ) − y ( u ) | , where K is a compact subset of R d . From(8), (9) and (10) follows the convergence of the ﬁnite-dimensional distribu-tions ( V ε ( u ) , ..., V ε ( u k )) → d ( V ( u ) , ..., V ( u k ))for any u i ∈ R d , i = 1 , ..., k. The tightness of V ε ( u ) is implied bysup ε ∈ (0 , E (cid:20) sup u ∈ K (cid:12)(cid:12)(cid:12)(cid:12) dd u V ε ( u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) < ∞ which follows from the regularity conditions on { x t ( θ ) , ≤ t ≤ T } . Indeed itis not hard to prove thatlim h → lim sup ε → E [ w ( V ε ( u ) , h ) ∧ ≤ lim h → h sup ε ∈ (0 , E (cid:20) sup u ∈ K (cid:12)(cid:12)(cid:12)(cid:12) dd u V ε ( u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = 0 , where w ( y, h ) = sup { ρ ( y ( u ) , y ( v )) : | u − v | ≤ h } , with y continuous functionon compact sets and h > . Therefore by Theorem 16.5 in Kallenberg [2001],we conclude that V ε ( u ) → d V ( u )uniformly on u. Since arg min u V ( u ) is unique ( P ( ε ) θ ∗ − a.s.), to prove thatarg min V ε = ε − (ˆ θ ε − θ ∗ ) → d arg min V, we can use Theorem 2.7 in Kim and Pollard [1990]. Hence, it is suﬃcient toshow that arg min u V ε ( u ) = O P ( ε ) θ ∗ (1) . We observe that V ε ( u ) = V lε ( u ) + o ε (1)where V lε ( u ) = 1 ε ( u T || ˙ x ( θ ∗ ) || u − u T || ε − ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + λ ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9) ) is a convex function. Since for each a ∈ R and δ > , there exists a compactset K a,δ such that (see, Knight [1999])lim sup ε → P ( ε ) θ ∗ (cid:18) inf u/ ∈ K a,δ V ε ( u ) ≤ a (cid:19) ≤ δ, then arg min u V ε ( u ) = O P ( ε ) θ ∗ (1) .

11n the case 0 < p < λ ε . Theorem 3.

Let Assumptions 1–4 hold, ζ deﬁned as in (6) , < p < and λ ε /ε − p → λ ≥ . Then ε − (ˆ θ ε − θ ∗ ) → d arg min u V ( u ) where V ( u ) = − u T ζ + u T I ( θ ∗ ) u + λ q X j =1 | u j | p { θ ∗ j =0 } . Proof.

We proceed analogously to the proof of Theorem 2. As before westart with V ε ( u ) from (7). The ﬁrst part of the expression in V ε ( u ) convergesin distribution to − u T ζ + u T I ( θ ∗ ) u as in Theorem 2. For the second term,we need to distinguish the two cases θ ∗ k = 0 or θ ∗ k = 0. By assumptions wehave that λ ε /ε − p → λ and hence necessarily λ ε /ε → θ ∗ k = 0. We have that λ ε ε u k (cid:18) | θ ∗ k + εu k | p − | θ ∗ k | p εu k (cid:19) → . Conversely, if θ ∗ k = 0 we have that λ ε ε q X j =1 (cid:0) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:1) → λ q X j =1 | u j | p { θ ∗ j =0 } So, by means of the same arguments adopted in the proof of Theorem 2, wecan prove that V ε ( u ) → d V ( u ) uniformly on u . Following Kim and Pollard[1990], the ﬁnal step consists in showing that arg min V ε = O P ( ε ) θ ∗ (1) and soarg min V ε → d arg min V . Indeed, V ε ( u ) ≥ ε (cid:0) || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || (cid:1) − λ ε ε q X j =1 | εu j | p and for all u and ε suﬃciently small, δ >

0, we have V ε ( u ) ≥ ε (cid:0) || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || (cid:1) − ( λ + δ ) q X j =1 | u j | p = V δε ( u ) . The term | u j | p grows slower than the the ﬁrst normed terms in V δε ( u ), soarg min u V δε ( u ) = O P ( ε ) θ ∗ (1) and, in turn, arg min u V ε ( u ) is also O P ( ε ) θ ∗ (1). Sincearg min u V ( u ) is unique, then the theorem is proved.12 emark 2. If λ = 0 , from the above theorems we immediately obtain that ε − (ˆ θ ε − θ ∗ ) → d arg min u V ( u ) = I − ( θ ∗ ) ζ , where I − ( θ ∗ ) ζ ∼ N q ( , I − ( θ ∗ ) σ I − ( θ ∗ )) . Theorem 3 shows that, if p <

1, one can estimate the nonzero parameters θ ∗ j = 0 at the usual rate without introducing asymptotic bias due to thepenalization and, at the same time, shrink the estimates of the null θ ∗ j = 0parameters toward zero with positive probability.On the contrary, if p ≥ λ >

0. This is a well known result in the literature[Zou, 2006] and has been indeed considered in De Gregorio and Iacus [2012]for ergodic diﬀusion models with discrete observations. In this section weconsider only the case for p = 1, i.e. the real Lasso estimator.To state the results we need to rearrange the elements of the vector pa-rameters θ in this way. Suppose that q ≤ q values of θ ∗ are not null, thanwe reorder θ ∗ as follows: θ ∗ = ( θ ∗ , . . . , θ ∗ q , θ ∗ q +1 , . . . , θ ∗ q ) T , where we denotedby θ ∗ k = 0, k = q + 1 , . . . , q , the null parameters. We now need to modifythe optimization function by introducing one adaptive sequence for each ofthe parameters θ j ; i.e.˜ Z ǫ ( u ) = || X − x ( u ) || + q X j =1 λ ε,j | u j | , (11)and, as in the above, the adaptive Lasso-type estimator is the solution to˜ θ ε = (˜ θ ε , ..., ˜ θ εq ) = arg min θ ∈ Θ ˜ Z ε ( θ ) . (12)We now need to slightly modify the rate of convergence of the new sequences { λ ε,j , j = 1 , . . . , q } . Assumption 6.

Let κ ε = min j>q λ ε,j and γ ε = max ≤ j ≤ q λ ε,j . Then the following convergence must hold κ ε ε → ∞ and γ ε ε → . x t ( θ ) = (cid:18) ∂∂θ x t ( θ ) , . . . , ∂∂θ q x t ( θ ) (cid:19) T , and I ( θ ) = Z T ˙ x t ( θ ) ˙ x t ( θ ) T µ (d t ) , ( q × q matrix) . Let η be a Gaussian random vector deﬁned as follows η = Z T x (1) t ( θ ∗ ) ˙ x t ( θ ∗ ) µ (d t ) ∼ N q ( , σ ) , (13)where σ = Z T Z T ˙ x t ( θ ∗ ) ˙ x s ( θ ∗ ) T E [ x (1) t ( θ ∗ ) x (1) s ( θ ∗ )] µ (d t ) µ (d s ) . The estimator ˜ θ ε reaches asymptotically the oracle properties. Indeed,a good procedure should have the following (asymptotically) properties: (i)consistently estimates null parameters as zero and vice versa; i.e. identiﬁesthe right subset model; (ii) has the optimal estimation rate and converges toa Gaussian random variable with covariance matrix of the true subset model. Theorem 4 (Oracle properties) . Let Assumptions 1–6 hold. Then, as ε → ,(i) Consistency in variable selection; i.e. P ( ε ) θ ∗ (˜ θ εk = 0) −→ , k = q + 1 , . . . , q ; (ii) Asymptotic normality; i.e. ε − (˜ θ ε − θ ∗ , ..., ˜ θ εq − θ ∗ q ) T −→ d I − ( θ ∗ ) η, where I − ( θ ∗ ) η ∼ N q ( , I − ( θ ∗ ) σ I − ( θ ∗ )) . Proof. (i) We brieﬂy outline the proof. The proof is by contradiction. Letus assume that for one j = q + 1 , . . . , q the adaptive-lasso estimator for θ ∗ j = 0 is ˜ θ εj = 0. By taking into account the Karush-Kuhn-Tucker (KKT)optimality conditions, we have1 ε ∂∂u j ˜ Z ǫ ( u ) (cid:12)(cid:12)(cid:12)(cid:12) u =˜ θ ε = 1 ε (cid:18) ∂∂u j || X − x ( u ) || (cid:12)(cid:12)(cid:12)(cid:12) u =˜ θ ε (cid:19) + λ ε,j ε sgn(˜ θ εj ) = 0 . The ﬁrst term is O P ( ε ) θ ∗ (1) by Assumption 2 and the fact that ˜ θ ε is the solutionof (12). For the second term we have that λ ε,j ε ≥ κ ε ε → ∞ by Assumption 6.14ii) Let˜ V ε ( u ) = 1 ε || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || + q X j =1 λ ε,j (cid:8) | θ ∗ j + εu j | − | θ ∗ j | (cid:9)! = u T || ˙ x ( θ ∗ ) || u − u T || ε − ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + o ε (1)+ q X j =1 λ ε,j ε (cid:26) | θ ∗ j + εu j | − | θ ∗ j | ε (cid:27) (14)From Assumption 6, since u j | θ ∗ j + εu j | − | θ ∗ j | u j ε −→ ε → u j sgn ( θ ∗ j ) , for j = 1 , ..., q , we have that q X j =1 λ ε,j ε (cid:26) | θ ∗ j + εu j | − | θ ∗ j | ε (cid:27) ≤ γ ε ε q X j =1 (cid:26) u j | θ ∗ j + εu j | − | θ ∗ j | u j ε (cid:27) −→ ε → , while for θ ∗ j = 0 , j = q + 1 , ..., q, one has that P qj = q +1 λ ε,j ε | u j | −→ ε → ∞ . Therefore, it is not possible to use the topology of the uniform convergeon compact sets; nevertheless, we can deﬁne the convergence of ˜ V ε via epi-convergence in distribution; i.e. from Lemma 4.1 in Geyer [1994], followsthat ˜ V ε ( u ) → d ˜ V ( u ) for every u, where˜ V ( u ) = ( u T I ( θ ∗ ) u − u T η, if u q +1 = ... = u q = 0 , ∞ , otherwise , and u = ( u , ..., u q ) T and the previous convergence is considered on thespace of extended functions R q → [ −∞ , + ∞ ] with a suitable metric. (daﬁssare meglio) For more details on the epi-convergence see Geyer [1994],Knight [1999] and Rockafellar and Wets [1998]. Since the unique minimumpoint of ˜ V ε ( u ) is given by ε − (˜ θ ε − θ ∗ ) and arg min u ˜ V ( u ) = ( I − ( θ ∗ ) η, ) T is P θ ∗ − unique, from Theorem 4.4 in Geyer [1994] follows the result (ii).Now let ˜ θ ε be any consistent estimator of θ ∗ , for example, the uncon-strained minimum distance estimator or the maximum likelihood estimator[Kutoyants, 1994]. Then, as suggested by Zou [2006], for any constant λ > δ >

1, it is suﬃcient to choose the sequences λ ε,j as follows λ ε,j = λ | ˜ θ ε | δ . (15)15f λ /ε → ε δ − λ → ∞ as ε →

0, then Assumption 6 is satisﬁed.Usually values of δ = 1 . δ = 2 are common in adaptive Lasso estimation.The idea of weighting the sequences as in (15) is to exploit the ability ofconsistent estimators to give an initial guess of how large is a parameter, andthen using Lasso to shrink adaptively the penalty function in order to avoidbias for true large parameters. References

R. Azencott. Formule de taylor stochastique et d´eveloppement asymptotiqued´ınt´egrales de feynmann.

S´eminaire de Probabilit´es XVI; Suppl´ement:G´eom etrie Diﬀ´erentielle Stochastique. Lecture Notes In Math. , 921:237–285, 1982.M. I. Freidlin and A. D. Wentzell.

Random perturbations of dynamical sys-tems. 2nd. ed.

Springer-Verlag, New York, 1998.N. Yoshida. Asymptotic expansion for statistics related to small diﬀusions.

Journal of the Japan Statistical Society , 22:139–159, 1992a.N. Kunitomo and A. Takahashi. The asymptotic expansion approach to thevaluation of interest rate contingent claims.

Mathematical Finance , 11(1):117–151, 2001.A. Takahashi and N. Yoshida. An asymptotic expansion scheme for optimalinvestment problems.

Stat. Inference Stoch. Process. , 7:153–188, 2004.M. Uchida and N. Yoshida. Asymptotic expansion for small diﬀusions appliedto option pricing.

Statist. Infer. Stochast. Process. , 7:189–223, 2004a.J. D. Murray.

Mathematical Biology I, an introduction . Springer, New York,2002.P. C. Bressloﬀ.

Stochastic Processes in Cell Biology, Interdisciplinary AppliedMathematics 41 . Springer, New York, 2014.G. B. Ermentrout and D. H. Terman.

Mathematical Foundations of Neuro-sciences, Interdisciplinary Applied Mathematics 35 . Springer, New York,2010.M. Caner. Lasso-type gmm estimator.

Econometric Theory , 25:270–290,2009. 16. Fan and R. Li. Statistical Challenges with High Dimensionality: FeatureSelection in Knowledge Discovery.

ArXiv Mathematics e-prints , February2006.K. Knight and W. Fu. Asymptotics for lasso-type estimators.

The Annals ofStatistics , 5(28):1536–1378, 2000.B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.

The Annals of Statistics , 32:407–489, 2004.R. Tibshirani. Regression shrinkage and selection via the lasso.

J. Roy.Statist. Soc. Ser. B , 58:267–288, 1996.Y. Kutoyants.

Parameter estimation for stochastic processes . Heldermann,Berlin, 1984.Y. Kutoyants. Minimum distance parameter estimation for diﬀusion typeobservations.

C.R. Acad. Paris, S´er. I , 312:637–642, 1991.Y. Kutoyants.

Identiﬁcation of Dynamical Systems with Small Noise . KluwerAcademic Publishers, Dordrecht, The Netherlands, 1994.N. Yoshida. Asymptotic expansion of maximum likelihood estimators forsmall diﬀusions via the theory of malliavin-watanabe.

Probab. TheoryRelat. Fields , 92:275–311, 1992b.Y. Kutoyants and P. Philibossian. On minimum l -norm estimates of theparameter of ornstein-uhlenbeck process. Statistics and Probability Letters ,20:117–123, 1994.S. M. Iacus. Semiparametric estimation of the state of a dynamical systemwith small noise.

Statistical Inference for Stochastic Processes , 3:277–288,2000.S. M. Iacus and Yu. Kutoyants. Semiparametric hypotheses testing for dy-namical systems with small noise.

Statistical Inference for Stochastic Pro-cesses , 10:105–120, 2001.N. Yoshida. Conditional expansions and their applications.

Stochastic Pro-cess. Appl. , 107:53–81, 2003.M. Uchida and N. Yoshida. Information criteria for small diﬀusions via thetheory of malliavin-watanabe.

Statist. Infer. Stochast. Process. , 7:35–67,2004b. 17. Genon-Catalot. Maximum contrast estimation for diﬀusion processes fromdiscrete observations.

Statistics , 21:99–116, 1990.C. F. Laredo. A suﬃcient condition for asymptotic suﬃciency of incompleteobservations of a diﬀusion process.

Ann. Statist. , 18:1158–1171, 1990.M. Sørensen. Small dispersion asymptotics for diﬀusion martingaleestimating functions.

Department of Statistics and Operations Re-search, University of Copenhagen , Preprint No. 2000-2, 1997. URL .M Sørensen. Estimating functions for diﬀusion-type processes. In M. Kessler,A. Lindner, and M. Srensen, editors,

Statistical Methods for Stochastic Dif-ferential Equations , Proceedings of the Second International Symposiumon Information Theory, pages 1–107. CRC Press, Chapmann and Hall,2012.M. Sørensen and M. Uchida. Small diﬀusion asymptotics for discretely sam-pled stochastic diﬀerential equations.

Bernoulli , 9:1051–1069, 2003.M. Uchida. Estimation for dynamical systems with small noise from discreteobservations.

J. Japan Statist. Soc. , 33:157–167, 2003.M. Uchida. Estimation for discretely observed small diﬀusions based onapproximate martingale estimating functions.

Scand. J. Statist. , 31:553–566, 2004.M. Uchida. Martingale estimating functions based on eigenfunctions for dis-cretely observed small diﬀusions.

Bull. Inform. Cybernet. , 38:1–13, 2006.M. Uchida. Approximate martingale estimating functions for stochastic dif-ferential equations with small noises.

Stochastic Processes and their Ap-plications , 118:1706–1721, 2008.A. Gloter and M Sørensen. Estimation for stochastic diﬀerential equationswith a small diﬀusion coeﬃcient.

Stochastic Processes and their Applica-tions , 119:679–699, 2009.R. Guy, C. Laredo, and E. Vergu. Parametric inference for discretely observedmultidimensional diﬀusions with small diﬀusion coeﬃcient.

Stochastic Pro-cesses and their Applications , 124:51–80, 2014.A. De Gregorio and S. M. Iacus. Adaptive lasso-type estimation formultivariate diﬀusion processes.

Econometric Theory , 28:838–860, 818012. ISSN 1469-4360. doi: 10.1017/S0266466611000806. URL http://journals.cambridge.org/article_S0266466611000806 .S. Nkurunziza. Shrinkage strategies in some multiple multi-factor dy-namical systems.

ESAIM: Probability and Statistics , 16:139–150,1 2012. ISSN 1262-3318. doi: 10.1051/ps/2010015. URL .R.S. Lipster and A.N. Shiryaev.

Statistics for Random Processes I: GeneralTheory . Springer-Verlag, New York, 2001.P.W. Millar. The minimax principle in asymptotic statistical theory.

Lect.Notes in Math. , 976:76—265, 1983.P.W. Millar. A general approach to the optimality of the minimum distanceestimators.

Trans. Amer. Math. Soc. , 286:377–418, 1984.O. Kallenberg.

Foundations of Modern Probability . Springer-Verlag, NewYork, 2001.J. Kim and D. Pollard. Cube root asymptotics.

Annals of Statistics , 18:191–219, 1990.K. Knight. Epi-convergence in distribution and stochastic equi-semicontinuity.

Unpublished manuscript , 1999.H. Zou. The adaptive lasso and its oracle properties.

Journal of the AmericanStatistical Association , 101(476):1418–1429, 2006.C.J. Geyer. On the asymptotics of constrained m -estimation. Annals ofStatistics , 22:1993–2010, 1994.R.T. Rockafellar and R.J.B. Wets.