On penalized estimation for dynamical systems with small noise
aa r X i v : . [ m a t h . S T ] M a r On penalized estimation for dynamicalsystems with small noise
Alessandro De Gregorio ∗ Stefano M. Iacus † September 10, 2018
Abstract
We consider a dynamical system with small noise for which thedrift is parametrized by a finite dimensional parameter. For this modelwe consider minimum distance estimation from continuous time ob-servations under l p -penalty imposed on the parameters in the spiritof the Lasso approach with the aim of simultaneous estimation andmodel selection. We study the consistency and the asymptotic dis-tribution of these Lasso-type estimators for different values of p . For p = 1 we also consider the adaptive version of the Lasso estimator andestablish its oracle properties. Keywords: dynamical systems, lasso estimation, model selection, infer-ence for stochastic processes, diffusion-type processes, oracle properties.
Usually ordinary differential equation models are the result of averagingand/or neglecting some details of an original system without modeling acomplex system with a huge number of degrees of freedom or tuning param-eters. Introducing noise is therefore a way to approximate closer the reality ofobservable complex systems. It is then natural to think of the noise as small,for example when one is considering the dynamics of macroscopic quantities, ∗ Department of Statistical Sciences, P.le Aldo Moro 5, 00185- Rome, Italy - [email protected] † Department of Economics, Management and Quantitative Methods, Via Conservatorio7, 20122 - Milan, Italy - [email protected] Y i = x Ti β + ε i , with x i a vectorof covariates, β a vector of q > ε i i.i.d. Gaussian randomvariables. Knight and Fu [2000] proposed the following l p -penalized estima-tor for β ˆ β n := arg min u n X i =1 ( Y i − x Ti u ) + λ n q X j =1 | u j | p ! (1)for some p > λ n → n → ∞ . The family of estimators ˆ β n solutionto (1) are a generalization of the Ridge estimators which correspond to thecase p = 2 [Efron et al., 2004]. The original Lasso estimators [Tibshirani,1996] are obtained setting p = 1 while OLS is the case p = 0, not consideredhere. The link between Lasso-type estimation and model selection is also dueto the fact that, in the limit as p →
0, this procedure approximate the AIC2r BIC selection methods, i.e.lim p → q X j =1 | u j | p = q X j =1 { u j =0 } which amounts to the number of non-null parameters in the model. Here A the indicator function for set A .As said, the estimators solutions to (1) are attractive because with themit is possible to perform estimation and model selection in a single step, i.e.the procedure does not need to estimate different models and compare themlater with information criteria as the dimension of the space of the parametersdoes no change, just some of the components of the vector β ∗ j are assumed tobe zero. In non-linear models a preliminary simple reparametrization (e.g. β β ′ − β ) is needed to interpret this approach in terms of model selection.In this work, we extend the problem in (1) to the class of diffusionprocesses with small noise solution to the stochastic differential equationd X t = S t ( θ, X )d t + ε d W t , t ∈ [0 , T ], by replacing the least squares estima-tion with the minimum distance estimation. The asymptotic is consideredas ε → < T < ∞ with θ ∈ Θ ⊂ R q a q -dimensional parameter.Since the seminal works of Kutoyants [1984, 1991, 1994] and Yoshida[1992b], statistical inference for continuously observed small diffusion pro-cesses is well developed today [see, e.g., Kutoyants and Philibossian, 1994,Iacus, 2000, Iacus and Kutoyants, 2001, Yoshida, 2003, Uchida and Yoshida,2004b] but the Lasso problem has not been considered so far. Althoughhere we consider only continuous time observations, it is worth mentioningthat there is also a growing literature on parametric inference for discretelyobserved small diffusion processes [see Genon-Catalot, 1990, Laredo, 1990,Sørensen, 1997, 2012, Sørensen and Uchida, 2003, Uchida, 2003, 2004, 2006,2008, Gloter and Sørensen, 2009, Guy et al., 2014] to which this Lasso prob-lem can be extended. Adaptive Lasso-type estimation for ergodic diffusionprocesses sampled at discrete time has been studied in De Gregorio and Iacus[2012] while for continuous time ergodic diffusion processes shrinkage estima-tion has been considered in Nkurunziza [2012].This paper is organized as follows. In Section 2 we introduce the model,the assumptions and the statement of the problem. In Section 3 we studythe consistency of the estimators and derive their asymptotic distributionfor different values of p . For p = 1 we also consider the case of adaptiveLasso estimation that is meant to control asymptotic bias. For the adaptiveestimator, we are also able to prove that it represents an oracle procedure.3 The Lasso-type problem for dynamical sys-tems with small noise
Let us assume that on the probability space (Ω , F , P ) , with the filtration {F t , ≤ t ≤ T } (where each F t , ≤ t ≤ T, is augmented by sets from F having zero P -measure), is given a Wiener process { W t , F t , ≤ t ≤ T } . Let X = { X t , ≤ t ≤ T } be a real valued diffusion-type process solution to thefollowing stochastic differential equationd X t = S t ( θ, X )d t + ε d W t , ε ∈ (0 , , (2)with non random initial condition X = x . The parameter θ ∈ Θ ⊂ R q , where Θ is a bounded, open and convex set, supposed to be unknown. Let( C [0 , T ] , B [0 , T ]) be the measurable space of continuous functions x t on [0 , T ]with σ -algebra B [0 , T ] = σ { x t , ≤ t ≤ T } . P ( ε ) θ denotes the law induced bythe process X in ( C [0 , T ] , B [0 , T ]) when the true parameter is θ . We denoteby u = ( u , . . . , u q ) T the (transposed) vector u ∈ R q and the true value of θ by θ ∗ . Let || · || = || · || L ( µ ) be the L -norm with respect to some finitemeasure µ on [0 , T ], i.e. || f || = Z T f ( x ) µ (d x ) . We suppose that the trend coefficient in (2) is of integral type, i.e. S t ( θ, X ) = V ( θ, t, X ) + Z t K ( θ, t, s, X s )d s, where V ( θ, t, x ) and K ( θ, t, s, x ) are known measurable, non-anticipativefunctionals such that (2) has a strong unique solution. For example, theusual conditions (1.34) and (1.35) in Kutoyants [1994] and Theorem 4.6 inLipster and Shiryaev [2001] about Lipschitz behaviour and linear growing aresufficient; i.e. Assumption 1.
For all t ∈ [0 , T ] , θ ∈ Θ and X t , Y t ∈ C [0 , T ] | V ( θ, t, X t ) − V ( θ, t, Y t ) | + | K ( θ, t, s, X t ) − K ( θ, t, s, Y t ) |≤ L Z t | X s − Y s | d K s + L | X t − Y t | , | V ( θ, t, X t ) | + | K ( θ, t, s, X t ) | ≤ L Z t (1 + | X s | )d K s + L (1 + | X t | ) , where L and L are positive constants and K s is a nondecreasing right-continuous function, ≤ K t ≤ K , K > . P ( ε ) θ , θ ∈ Θ , areequivalent (see Theorem 7.7 in Lipster and Shiryaev [2001]). The asymptoticin this model is considered as ε → < T < ∞ fixed.We will also write x ( θ ) = x t ( θ ) to denote the limiting dynamical systemsatisfying the integro-differential equationd x t d t = V ( θ, t, x t ) + Z t K ( θ, t, s, x s )d s, x . We assume that, for all 0 ≤ t ≤ T and for each θ ∈ Θ , the randomelement X t and x t ( θ ) belong to L ( µ ) . Let x (1) t = { x (1) t ( θ ∗ ) , ≤ t ≤ T } be the Gaussian process solution tod x (1) t = (cid:18) V x ( θ ∗ , t, x t ( θ ∗ )) x (1) t + Z t K x ( θ , t, s, x s ( θ ∗ )) x (1) s d s (cid:19) d t + d W t , (3)0 ≤ t ≤ T , x (1)0 = 0, where V x ( θ, t, x ) = ∂∂x V ( θ, t, x ) and K x ( θ, t, s, x ) = ∂∂x K ( θ, t, s, x ) . The process x (1) t plays a central role in the definition of theasymptotic distribution of the estimators in the theory of dynamical systemswith small noise. We need in addition the following assumptions. Assumption 2.
The stochastic process X is differentiable in ε at the point ε = 0 in the following sense: for all ν > ε → P ( ε ) θ ∗ (cid:0) || ε − ( X − x ) − x (1) || > ν (cid:1) = 0 where x (1) = { x (1) t , ≤ t ≤ T } is from (3) . We further denote by ˙ x t ( θ ) the q -dimensional vector of partial derivativesof x t ( θ ) with respect to θ j , j = 1 , . . . , q , i.e., ˙ x t ( θ ) = ( ∂∂θ x t ( θ ) , . . . , ∂∂θ q x t ( θ )) T ,and ˙ x t ( θ ∗ ) satisfies the systems of equationsd ˙ x t ( θ ∗ )d t = [ V x ( θ ∗ , t, x t ( θ ∗ )) ˙ x t ( θ ∗ ) + ˙ V ( θ ∗ , t, x t ( θ ∗ ))+ Z t ( ˙ K ( θ ∗ , t, s, x s ( θ ∗ )) + K x ( θ , t, s, x s ( θ ∗ )) ˙ x s ( θ ∗ ))d s ]d t, ˙ x ( θ ∗ ) = 0 , where the point corresponds to the differentiation on θ ; i.e.˙ V ( θ, t, x t ( θ )) = (cid:18) ∂∂θ V ( θ, t, x t ( θ )) , ..., ∂∂θ q V ( θ, t, x t ( θ )) (cid:19) T . ssumption 3. The deterministic dynamical system x t ( θ ) is differentiablein θ at the point θ ∗ in L ( µ ) ; i.e. || x ( θ ∗ + h ) − x ( θ ∗ ) − h T ˙ x ( θ ∗ )) || = o ( | h | ) where h ∈ R q . Assumption 4.
The matrix I ( θ ∗ ) = Z T ˙ x t ( θ ∗ ) ˙ x Tt ( θ ∗ ) µ (d t ) is positive definite and nonsingular. We introduce a constrained minimum distance estimator for θ for the modelin (2). The asymptotic properties of unconstrained the minimum distanceestimators in the i.i.d. framework have been established in Millar [1983,1984]. Later Kutoyants [1991, 1994] and Kutoyants and Philibossian [1994]studied in details the properties of such estimators for diffusion processeswith small noise. Information criteria for this model have been studied inUchida and Yoshida [2004b], while here we study the Lasso-type approach.To define the Lasso-type estimator the following penalized contrast func-tion has to be considered Z ε ( u ) = || X − x ( u ) || + λ ε q X j =1 | u j | p , (4) p > u ∈ Θ and λ ε > θ ε : C [0 , T ] → ¯Θ for θ , defined asˆ θ ε = arg min θ ∈ ¯Θ Z ε ( θ ) , (5)where ¯Θ is the closure od Θ.The following example explains well the spirit of the Lasso procedure.We consider a linear small diffusion-type process X given byd X t = q X j =1 θ j A j ( t, X )d t + ε d W t , ≤ t ≤ T. By applying the estimator (5), some parameters θ j will be set equal to 0 andthis implies a simultaneous estimation and selection of the model.6 Asymptotic properties of the estimator
The additional l p -penalization term in the contrast function (4) modifiesthe traditional properties of the minimum distance estimator. The analysisshould be performed for the different values of p which change the convexityof the penalty term. Let us introduce the following functions g εθ ∗ ( ν ) = inf | θ − θ ∗ |≥ ν ( || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ) ,h εθ ∗ ( ν ) = inf | θ − θ ∗ | <ν ( || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ) where | θ − θ ∗ | ≥ ν ( < ν ) is to be intended component wise, for all ν >
0. Weneed the following identifiability-type condition.
Assumption 5.
For every ν > , we assume that g εθ ∗ ( ν ) > h εθ ∗ ( ν ) . Theorem 1.
Let Assumption 1 and Assumption 5 be fulfilled and λ ε = O ( ε ) as ε → . ˆ θ ε in (5) is a uniformly consistent estimator of θ ∗ ; i.e. for any ν > ε → sup θ ∗ ∈ Θ P ( ε ) θ ∗ (cid:16) | ˆ θ ε − θ ∗ | ≥ ν (cid:17) = 0 . Proof.
By definition of ˆ θ ε , for any ν > , we have that n ω : | ˆ θ ε − θ ∗ | ≥ ν o = (cid:26) ω : inf | θ − θ ∗ | <ν Z ε ( θ ) > inf | θ − θ ∗ |≥ ν Z ε ( θ ) (cid:27) Moreover, Z ε ( θ ) ≤ || X − x ( θ ∗ ) || + || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ,Z ε ( θ ) ≥ || x ( θ ) − x ( θ ∗ ) || − || X − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p . P ( ε ) θ ∗ (cid:16) | ˆ θ ε − θ ∗ | ≥ ν (cid:17) = P ( ε ) θ ∗ (cid:18) inf | θ − θ ∗ | <ν Z ε ( θ ) > inf | θ − θ ∗ |≥ ν Z ε ( θ ) (cid:19) ≤ P ( ε ) θ ∗ (cid:18) || X − x ( θ ∗ ) || + h εθ ∗ ( ν )2 > g εθ ∗ ( ν )2 (cid:19) Since (see Lemma 1.13, in Kutoyants [1994]) || X t − x t ( θ ∗ ) || ≤ Cε sup ≤ t ≤ T | W t | , P ( ε ) θ ∗ − a.s. , where C = C ( L , L , K , T ) is a positive constant, under Assumption 5, wegetsup θ ∗ ∈ Θ P ( ε ) θ ∗ (cid:16) | ˆ θ ε − θ ∗ | ≥ ν (cid:17) ≤ P ( ε ) θ ∗ (cid:18) Cε sup ≤ t ≤ T | W t | >
12 inf θ ∗ ∈ Θ { g εθ ∗ ( ν ) − h εθ ∗ ( ν ) } (cid:19) ≤ (cid:26) − (inf θ ∗ ∈ Θ { g εθ ∗ ( ν ) − h εθ ∗ ( ν ) } ) T C ε (cid:27) → . In the above we made use of the following estimate for
N > P (cid:18) sup ≤ t ≤ T | W t | > N (cid:19) ≤ P ( W T > N ) ≤ e − N T , see e.g. Kutoyants [1994], and observed that g εθ ∗ ( ν ) − h εθ ∗ ( ν ) → inf | θ − θ ∗ |≥ ν || x ( θ ) − x ( θ ∗ ) || > , ε → . From the proof of the consistency of the estimator ˆ θ ε is it clear that thespeed of the convergence depends on the speed of λ ε . The speed of λ ε alsoaffects the asymptotic distribution of the estimator. Remark 1.
It is possible to define other types of Lasso-type estimators mod-ifying the metric in (4) ; i.e. by considering, for instance, the sup-norm andthe L -norm. Hence, if { X t , ≤ t ≤ T } and { x t ( θ ) , ≤ t ≤ T } , θ ∈ Θ , areelements of the space C [0 , T ] and L ( µ ) , respectively, we can introduce thecorresponding Lasso estimator ˇ θ ε = arg min θ ∈ ¯Θ ( sup ≤ t ≤ T | X t − x t ( θ ) | + λ ε q X j =1 | u j | p ) nd ˘ θ ε = arg min θ ∈ ¯Θ (Z T | X t − x t ( θ ) | µ (d t ) + λ ε q X j =1 | u j | p ) . The estimators ˇ θ ε and ˘ θ ε are uniformly consistent and the proof follows bythe same steps adopted to prove Theorem 1. In order to study the asymptotic distribution of the Lasso-type estimator weneed to distinguish the different cases for p . We start with the case of p ≥ → d ” the convergence in distribution and we denote by ζ thefollowing Gaussian random vector ζ = Z T x (1) t ( θ ∗ ) ˙ x t ( θ ∗ ) µ (d t ); (6)i.e. ζ ∼ N q ( , σ ) where σ := Z T Z T ˙ x t ( θ ∗ ) ˙ x s ( θ ∗ ) T E [ x (1) t ( θ ∗ ) x (1) s ( θ ∗ )] µ (d t ) µ (d s ) , see also Lemma 2.13 in Kutoyants [1994]. The next two theorems have beeninspired from Theorem 2 and Theorem 3 in Knight and Fu [2000]. Theorem 2.
Let Assumptions 1–5 hold, ζ defined as in (6) , p ≥ and ε − λ ε → λ ≥ . Then ε − (ˆ θ ε − θ ∗ ) → d arg min u V ( u ) where V ( u ) = − u T ζ + u T I ( θ ∗ ) u + λ q X j =1 u j sgn( θ ∗ j ) | θ ∗ j | p − for p > and V ( u ) = − u T ζ + u T I ( θ ∗ ) u + λ q X j =1 (cid:16) | u j | { θ ∗ j =0 } + u j sgn( θ ∗ j ) | θ ∗ j | { θ ∗ j =0 } (cid:17) if p = 1 . roof. Let u ∈ R q and introduce the random function V ε ( u ) = 1 ε || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || + λ ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9)! , (7)which is minimized at the point u = ε − (ˆ θ ε − θ ∗ ) by definition of ˆ θ ε . Byexploiting Assumption 2–4, we get1 ε (cid:8) || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || (cid:9) = 1 ε (cid:8) || X − x ( θ ∗ ) − εu T ˙ x ( θ ∗ ) || − || X − x ( θ ∗ ) || (cid:9) + o ε (1)= u T || ˙ x ( θ ∗ ) || u − u T || ε − ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + o ε (1) P ( ε ) θ ∗ −→ ε → u T I ( θ ∗ ) u − u T ζ , (8)where P ( ε ) θ ∗ −→ stands for the convergence in probability and ζ is from (6). Forthe term in (7) λ ε ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9) we have to distinguish the case p = 1 and p >
1. Let γ >
1, then λ ε ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9) = λ ε ε q X j =1 u j | θ ∗ j + εu j | p − | θ ∗ j | p εu j −→ ε → λ p X j =1 u j sgn( θ ∗ j ) | θ ∗ j | p − (9)If p = 1, then by similar arguments, we have λ ε ε q X j =1 (cid:8) | θ ∗ j + εu j | − | θ ∗ j | (cid:9) −→ ε → λ q X j =1 (cid:16) | u j | { θ ∗ j =0 } + u j sgn( θ ∗ j ) | θ ∗ j | { θ ∗ j =0 } (cid:17) . (10)Notice that V ε ( u ) is not convex in u and then we have to consider theconvergence in distribution on the topology induced by the uniform met-ric on compact sets; i.e. we deal with the convergence in distribution of V ε ( u ) on the space of the continuous functions topologized by the distance10 ( y , y ) = sup u ∈ K | y ( u ) − y ( u ) | , where K is a compact subset of R d . From(8), (9) and (10) follows the convergence of the finite-dimensional distribu-tions ( V ε ( u ) , ..., V ε ( u k )) → d ( V ( u ) , ..., V ( u k ))for any u i ∈ R d , i = 1 , ..., k. The tightness of V ε ( u ) is implied bysup ε ∈ (0 , E (cid:20) sup u ∈ K (cid:12)(cid:12)(cid:12)(cid:12) dd u V ε ( u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) < ∞ which follows from the regularity conditions on { x t ( θ ) , ≤ t ≤ T } . Indeed itis not hard to prove thatlim h → lim sup ε → E [ w ( V ε ( u ) , h ) ∧ ≤ lim h → h sup ε ∈ (0 , E (cid:20) sup u ∈ K (cid:12)(cid:12)(cid:12)(cid:12) dd u V ε ( u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = 0 , where w ( y, h ) = sup { ρ ( y ( u ) , y ( v )) : | u − v | ≤ h } , with y continuous functionon compact sets and h > . Therefore by Theorem 16.5 in Kallenberg [2001],we conclude that V ε ( u ) → d V ( u )uniformly on u. Since arg min u V ( u ) is unique ( P ( ε ) θ ∗ − a.s.), to prove thatarg min V ε = ε − (ˆ θ ε − θ ∗ ) → d arg min V, we can use Theorem 2.7 in Kim and Pollard [1990]. Hence, it is sufficient toshow that arg min u V ε ( u ) = O P ( ε ) θ ∗ (1) . We observe that V ε ( u ) = V lε ( u ) + o ε (1)where V lε ( u ) = 1 ε ( u T || ˙ x ( θ ∗ ) || u − u T || ε − ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + λ ε q X j =1 (cid:8) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:9) ) is a convex function. Since for each a ∈ R and δ > , there exists a compactset K a,δ such that (see, Knight [1999])lim sup ε → P ( ε ) θ ∗ (cid:18) inf u/ ∈ K a,δ V ε ( u ) ≤ a (cid:19) ≤ δ, then arg min u V ε ( u ) = O P ( ε ) θ ∗ (1) .
11n the case 0 < p < λ ε . Theorem 3.
Let Assumptions 1–4 hold, ζ defined as in (6) , < p < and λ ε /ε − p → λ ≥ . Then ε − (ˆ θ ε − θ ∗ ) → d arg min u V ( u ) where V ( u ) = − u T ζ + u T I ( θ ∗ ) u + λ q X j =1 | u j | p { θ ∗ j =0 } . Proof.
We proceed analogously to the proof of Theorem 2. As before westart with V ε ( u ) from (7). The first part of the expression in V ε ( u ) convergesin distribution to − u T ζ + u T I ( θ ∗ ) u as in Theorem 2. For the second term,we need to distinguish the two cases θ ∗ k = 0 or θ ∗ k = 0. By assumptions wehave that λ ε /ε − p → λ and hence necessarily λ ε /ε → θ ∗ k = 0. We have that λ ε ε u k (cid:18) | θ ∗ k + εu k | p − | θ ∗ k | p εu k (cid:19) → . Conversely, if θ ∗ k = 0 we have that λ ε ε q X j =1 (cid:0) | θ ∗ j + εu j | p − | θ ∗ j | p (cid:1) → λ q X j =1 | u j | p { θ ∗ j =0 } So, by means of the same arguments adopted in the proof of Theorem 2, wecan prove that V ε ( u ) → d V ( u ) uniformly on u . Following Kim and Pollard[1990], the final step consists in showing that arg min V ε = O P ( ε ) θ ∗ (1) and soarg min V ε → d arg min V . Indeed, V ε ( u ) ≥ ε (cid:0) || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || (cid:1) − λ ε ε q X j =1 | εu j | p and for all u and ε sufficiently small, δ >
0, we have V ε ( u ) ≥ ε (cid:0) || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || (cid:1) − ( λ + δ ) q X j =1 | u j | p = V δε ( u ) . The term | u j | p grows slower than the the first normed terms in V δε ( u ), soarg min u V δε ( u ) = O P ( ε ) θ ∗ (1) and, in turn, arg min u V ε ( u ) is also O P ( ε ) θ ∗ (1). Sincearg min u V ( u ) is unique, then the theorem is proved.12 emark 2. If λ = 0 , from the above theorems we immediately obtain that ε − (ˆ θ ε − θ ∗ ) → d arg min u V ( u ) = I − ( θ ∗ ) ζ , where I − ( θ ∗ ) ζ ∼ N q ( , I − ( θ ∗ ) σ I − ( θ ∗ )) . Theorem 3 shows that, if p <
1, one can estimate the nonzero parameters θ ∗ j = 0 at the usual rate without introducing asymptotic bias due to thepenalization and, at the same time, shrink the estimates of the null θ ∗ j = 0parameters toward zero with positive probability.On the contrary, if p ≥ λ >
0. This is a well known result in the literature[Zou, 2006] and has been indeed considered in De Gregorio and Iacus [2012]for ergodic diffusion models with discrete observations. In this section weconsider only the case for p = 1, i.e. the real Lasso estimator.To state the results we need to rearrange the elements of the vector pa-rameters θ in this way. Suppose that q ≤ q values of θ ∗ are not null, thanwe reorder θ ∗ as follows: θ ∗ = ( θ ∗ , . . . , θ ∗ q , θ ∗ q +1 , . . . , θ ∗ q ) T , where we denotedby θ ∗ k = 0, k = q + 1 , . . . , q , the null parameters. We now need to modifythe optimization function by introducing one adaptive sequence for each ofthe parameters θ j ; i.e.˜ Z ǫ ( u ) = || X − x ( u ) || + q X j =1 λ ε,j | u j | , (11)and, as in the above, the adaptive Lasso-type estimator is the solution to˜ θ ε = (˜ θ ε , ..., ˜ θ εq ) = arg min θ ∈ Θ ˜ Z ε ( θ ) . (12)We now need to slightly modify the rate of convergence of the new sequences { λ ε,j , j = 1 , . . . , q } . Assumption 6.
Let κ ε = min j>q λ ε,j and γ ε = max ≤ j ≤ q λ ε,j . Then the following convergence must hold κ ε ε → ∞ and γ ε ε → . x t ( θ ) = (cid:18) ∂∂θ x t ( θ ) , . . . , ∂∂θ q x t ( θ ) (cid:19) T , and I ( θ ) = Z T ˙ x t ( θ ) ˙ x t ( θ ) T µ (d t ) , ( q × q matrix) . Let η be a Gaussian random vector defined as follows η = Z T x (1) t ( θ ∗ ) ˙ x t ( θ ∗ ) µ (d t ) ∼ N q ( , σ ) , (13)where σ = Z T Z T ˙ x t ( θ ∗ ) ˙ x s ( θ ∗ ) T E [ x (1) t ( θ ∗ ) x (1) s ( θ ∗ )] µ (d t ) µ (d s ) . The estimator ˜ θ ε reaches asymptotically the oracle properties. Indeed,a good procedure should have the following (asymptotically) properties: (i)consistently estimates null parameters as zero and vice versa; i.e. identifiesthe right subset model; (ii) has the optimal estimation rate and converges toa Gaussian random variable with covariance matrix of the true subset model. Theorem 4 (Oracle properties) . Let Assumptions 1–6 hold. Then, as ε → ,(i) Consistency in variable selection; i.e. P ( ε ) θ ∗ (˜ θ εk = 0) −→ , k = q + 1 , . . . , q ; (ii) Asymptotic normality; i.e. ε − (˜ θ ε − θ ∗ , ..., ˜ θ εq − θ ∗ q ) T −→ d I − ( θ ∗ ) η, where I − ( θ ∗ ) η ∼ N q ( , I − ( θ ∗ ) σ I − ( θ ∗ )) . Proof. (i) We briefly outline the proof. The proof is by contradiction. Letus assume that for one j = q + 1 , . . . , q the adaptive-lasso estimator for θ ∗ j = 0 is ˜ θ εj = 0. By taking into account the Karush-Kuhn-Tucker (KKT)optimality conditions, we have1 ε ∂∂u j ˜ Z ǫ ( u ) (cid:12)(cid:12)(cid:12)(cid:12) u =˜ θ ε = 1 ε (cid:18) ∂∂u j || X − x ( u ) || (cid:12)(cid:12)(cid:12)(cid:12) u =˜ θ ε (cid:19) + λ ε,j ε sgn(˜ θ εj ) = 0 . The first term is O P ( ε ) θ ∗ (1) by Assumption 2 and the fact that ˜ θ ε is the solutionof (12). For the second term we have that λ ε,j ε ≥ κ ε ε → ∞ by Assumption 6.14ii) Let˜ V ε ( u ) = 1 ε || X − x ( θ ∗ + εu ) || − || X − x ( θ ∗ ) || + q X j =1 λ ε,j (cid:8) | θ ∗ j + εu j | − | θ ∗ j | (cid:9)! = u T || ˙ x ( θ ∗ ) || u − u T || ε − ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + o ε (1)+ q X j =1 λ ε,j ε (cid:26) | θ ∗ j + εu j | − | θ ∗ j | ε (cid:27) (14)From Assumption 6, since u j | θ ∗ j + εu j | − | θ ∗ j | u j ε −→ ε → u j sgn ( θ ∗ j ) , for j = 1 , ..., q , we have that q X j =1 λ ε,j ε (cid:26) | θ ∗ j + εu j | − | θ ∗ j | ε (cid:27) ≤ γ ε ε q X j =1 (cid:26) u j | θ ∗ j + εu j | − | θ ∗ j | u j ε (cid:27) −→ ε → , while for θ ∗ j = 0 , j = q + 1 , ..., q, one has that P qj = q +1 λ ε,j ε | u j | −→ ε → ∞ . Therefore, it is not possible to use the topology of the uniform convergeon compact sets; nevertheless, we can define the convergence of ˜ V ε via epi-convergence in distribution; i.e. from Lemma 4.1 in Geyer [1994], followsthat ˜ V ε ( u ) → d ˜ V ( u ) for every u, where˜ V ( u ) = ( u T I ( θ ∗ ) u − u T η, if u q +1 = ... = u q = 0 , ∞ , otherwise , and u = ( u , ..., u q ) T and the previous convergence is considered on thespace of extended functions R q → [ −∞ , + ∞ ] with a suitable metric. (dafissare meglio) For more details on the epi-convergence see Geyer [1994],Knight [1999] and Rockafellar and Wets [1998]. Since the unique minimumpoint of ˜ V ε ( u ) is given by ε − (˜ θ ε − θ ∗ ) and arg min u ˜ V ( u ) = ( I − ( θ ∗ ) η, ) T is P θ ∗ − unique, from Theorem 4.4 in Geyer [1994] follows the result (ii).Now let ˜ θ ε be any consistent estimator of θ ∗ , for example, the uncon-strained minimum distance estimator or the maximum likelihood estimator[Kutoyants, 1994]. Then, as suggested by Zou [2006], for any constant λ > δ >
1, it is sufficient to choose the sequences λ ε,j as follows λ ε,j = λ | ˜ θ ε | δ . (15)15f λ /ε → ε δ − λ → ∞ as ε →
0, then Assumption 6 is satisfied.Usually values of δ = 1 . δ = 2 are common in adaptive Lasso estimation.The idea of weighting the sequences as in (15) is to exploit the ability ofconsistent estimators to give an initial guess of how large is a parameter, andthen using Lasso to shrink adaptively the penalty function in order to avoidbias for true large parameters. References
R. Azencott. Formule de taylor stochastique et d´eveloppement asymptotiqued´ınt´egrales de feynmann.
S´eminaire de Probabilit´es XVI; Suppl´ement:G´eom etrie Diff´erentielle Stochastique. Lecture Notes In Math. , 921:237–285, 1982.M. I. Freidlin and A. D. Wentzell.
Random perturbations of dynamical sys-tems. 2nd. ed.
Springer-Verlag, New York, 1998.N. Yoshida. Asymptotic expansion for statistics related to small diffusions.
Journal of the Japan Statistical Society , 22:139–159, 1992a.N. Kunitomo and A. Takahashi. The asymptotic expansion approach to thevaluation of interest rate contingent claims.
Mathematical Finance , 11(1):117–151, 2001.A. Takahashi and N. Yoshida. An asymptotic expansion scheme for optimalinvestment problems.
Stat. Inference Stoch. Process. , 7:153–188, 2004.M. Uchida and N. Yoshida. Asymptotic expansion for small diffusions appliedto option pricing.
Statist. Infer. Stochast. Process. , 7:189–223, 2004a.J. D. Murray.
Mathematical Biology I, an introduction . Springer, New York,2002.P. C. Bressloff.
Stochastic Processes in Cell Biology, Interdisciplinary AppliedMathematics 41 . Springer, New York, 2014.G. B. Ermentrout and D. H. Terman.
Mathematical Foundations of Neuro-sciences, Interdisciplinary Applied Mathematics 35 . Springer, New York,2010.M. Caner. Lasso-type gmm estimator.
Econometric Theory , 25:270–290,2009. 16. Fan and R. Li. Statistical Challenges with High Dimensionality: FeatureSelection in Knowledge Discovery.
ArXiv Mathematics e-prints , February2006.K. Knight and W. Fu. Asymptotics for lasso-type estimators.
The Annals ofStatistics , 5(28):1536–1378, 2000.B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.
The Annals of Statistics , 32:407–489, 2004.R. Tibshirani. Regression shrinkage and selection via the lasso.
J. Roy.Statist. Soc. Ser. B , 58:267–288, 1996.Y. Kutoyants.
Parameter estimation for stochastic processes . Heldermann,Berlin, 1984.Y. Kutoyants. Minimum distance parameter estimation for diffusion typeobservations.
C.R. Acad. Paris, S´er. I , 312:637–642, 1991.Y. Kutoyants.
Identification of Dynamical Systems with Small Noise . KluwerAcademic Publishers, Dordrecht, The Netherlands, 1994.N. Yoshida. Asymptotic expansion of maximum likelihood estimators forsmall diffusions via the theory of malliavin-watanabe.
Probab. TheoryRelat. Fields , 92:275–311, 1992b.Y. Kutoyants and P. Philibossian. On minimum l -norm estimates of theparameter of ornstein-uhlenbeck process. Statistics and Probability Letters ,20:117–123, 1994.S. M. Iacus. Semiparametric estimation of the state of a dynamical systemwith small noise.
Statistical Inference for Stochastic Processes , 3:277–288,2000.S. M. Iacus and Yu. Kutoyants. Semiparametric hypotheses testing for dy-namical systems with small noise.
Statistical Inference for Stochastic Pro-cesses , 10:105–120, 2001.N. Yoshida. Conditional expansions and their applications.
Stochastic Pro-cess. Appl. , 107:53–81, 2003.M. Uchida and N. Yoshida. Information criteria for small diffusions via thetheory of malliavin-watanabe.
Statist. Infer. Stochast. Process. , 7:35–67,2004b. 17. Genon-Catalot. Maximum contrast estimation for diffusion processes fromdiscrete observations.
Statistics , 21:99–116, 1990.C. F. Laredo. A sufficient condition for asymptotic sufficiency of incompleteobservations of a diffusion process.
Ann. Statist. , 18:1158–1171, 1990.M. Sørensen. Small dispersion asymptotics for diffusion martingaleestimating functions.
Department of Statistics and Operations Re-search, University of Copenhagen , Preprint No. 2000-2, 1997. URL .M Sørensen. Estimating functions for diffusion-type processes. In M. Kessler,A. Lindner, and M. Srensen, editors,
Statistical Methods for Stochastic Dif-ferential Equations , Proceedings of the Second International Symposiumon Information Theory, pages 1–107. CRC Press, Chapmann and Hall,2012.M. Sørensen and M. Uchida. Small diffusion asymptotics for discretely sam-pled stochastic differential equations.
Bernoulli , 9:1051–1069, 2003.M. Uchida. Estimation for dynamical systems with small noise from discreteobservations.
J. Japan Statist. Soc. , 33:157–167, 2003.M. Uchida. Estimation for discretely observed small diffusions based onapproximate martingale estimating functions.
Scand. J. Statist. , 31:553–566, 2004.M. Uchida. Martingale estimating functions based on eigenfunctions for dis-cretely observed small diffusions.
Bull. Inform. Cybernet. , 38:1–13, 2006.M. Uchida. Approximate martingale estimating functions for stochastic dif-ferential equations with small noises.
Stochastic Processes and their Ap-plications , 118:1706–1721, 2008.A. Gloter and M Sørensen. Estimation for stochastic differential equationswith a small diffusion coefficient.
Stochastic Processes and their Applica-tions , 119:679–699, 2009.R. Guy, C. Laredo, and E. Vergu. Parametric inference for discretely observedmultidimensional diffusions with small diffusion coefficient.
Stochastic Pro-cesses and their Applications , 124:51–80, 2014.A. De Gregorio and S. M. Iacus. Adaptive lasso-type estimation formultivariate diffusion processes.
Econometric Theory , 28:838–860, 818012. ISSN 1469-4360. doi: 10.1017/S0266466611000806. URL http://journals.cambridge.org/article_S0266466611000806 .S. Nkurunziza. Shrinkage strategies in some multiple multi-factor dy-namical systems.
ESAIM: Probability and Statistics , 16:139–150,1 2012. ISSN 1262-3318. doi: 10.1051/ps/2010015. URL .R.S. Lipster and A.N. Shiryaev.
Statistics for Random Processes I: GeneralTheory . Springer-Verlag, New York, 2001.P.W. Millar. The minimax principle in asymptotic statistical theory.
Lect.Notes in Math. , 976:76—265, 1983.P.W. Millar. A general approach to the optimality of the minimum distanceestimators.
Trans. Amer. Math. Soc. , 286:377–418, 1984.O. Kallenberg.
Foundations of Modern Probability . Springer-Verlag, NewYork, 2001.J. Kim and D. Pollard. Cube root asymptotics.
Annals of Statistics , 18:191–219, 1990.K. Knight. Epi-convergence in distribution and stochastic equi-semicontinuity.
Unpublished manuscript , 1999.H. Zou. The adaptive lasso and its oracle properties.
Journal of the AmericanStatistical Association , 101(476):1418–1429, 2006.C.J. Geyer. On the asymptotics of constrained m -estimation. Annals ofStatistics , 22:1993–2010, 1994.R.T. Rockafellar and R.J.B. Wets.